Files
DashBoard/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/proposal.md

41 lines
1.8 KiB
Markdown

## Why
Operations stability still depends heavily on manual intervention when workers degrade or pools saturate. We need a formal operations phase to align conda/systemd runtime contracts and add controlled self-healing with guardrails, so recovery is faster without triggering restart storms.
## What Changes
- Standardize conda-based runtime paths across app service, watchdog, and operational scripts from a single source of truth.
- Introduce guarded worker self-healing policy (cooldown, churn windows, bounded retries, manual override).
- Add alert thresholds and machine-readable recovery signals for pool pressure, circuit-open persistence, and restart churn.
- Harden runbook documentation and scripts for deterministic restart, rollback, and incident triage.
## Capabilities
### New Capabilities
- `worker-self-healing-governance`: Define safe autonomous recovery behavior with anti-storm guardrails.
### Modified Capabilities
- `conda-systemd-runtime-alignment`: Extend runtime consistency requirements with startup validation and drift detection.
- `runtime-resilience-recovery`: Add auditable recovery-action requirements for automated and operator-triggered restart flows.
## Impact
- Affected code:
- `deploy/systemd/*.service`
- `scripts/worker_watchdog.py`
- `src/mes_dashboard/routes/admin_routes.py`
- `src/mes_dashboard/routes/health_routes.py`
- `src/mes_dashboard/core/database.py`
- `src/mes_dashboard/core/circuit_breaker.py`
- `tests/`
- `README.md`, `README.mdj`, runbook docs
- APIs:
- `/health`
- `/health/deep`
- `/admin/api/system-status`
- `/admin/api/worker/status`
- `/admin/api/worker/restart`
- Operational behavior:
- Preserve single-port bind model.
- Add controlled self-healing policy and clearer alert thresholds.