1.8 KiB
1.8 KiB
Why
Operations stability still depends heavily on manual intervention when workers degrade or pools saturate. We need a formal operations phase to align conda/systemd runtime contracts and add controlled self-healing with guardrails, so recovery is faster without triggering restart storms.
What Changes
- Standardize conda-based runtime paths across app service, watchdog, and operational scripts from a single source of truth.
- Introduce guarded worker self-healing policy (cooldown, churn windows, bounded retries, manual override).
- Add alert thresholds and machine-readable recovery signals for pool pressure, circuit-open persistence, and restart churn.
- Harden runbook documentation and scripts for deterministic restart, rollback, and incident triage.
Capabilities
New Capabilities
worker-self-healing-governance: Define safe autonomous recovery behavior with anti-storm guardrails.
Modified Capabilities
conda-systemd-runtime-alignment: Extend runtime consistency requirements with startup validation and drift detection.runtime-resilience-recovery: Add auditable recovery-action requirements for automated and operator-triggered restart flows.
Impact
- Affected code:
deploy/systemd/*.servicescripts/worker_watchdog.pysrc/mes_dashboard/routes/admin_routes.pysrc/mes_dashboard/routes/health_routes.pysrc/mes_dashboard/core/database.pysrc/mes_dashboard/core/circuit_breaker.pytests/README.md,README.mdj, runbook docs
- APIs:
/health/health/deep/admin/api/system-status/admin/api/worker/status/admin/api/worker/restart
- Operational behavior:
- Preserve single-port bind model.
- Add controlled self-healing policy and clearer alert thresholds.