Files
DashBoard/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/tasks.md

24 lines
1.2 KiB
Markdown

## 1. Conda/Systemd Contract Alignment
- [x] 1.1 Centralize runtime path configuration consumed by service units, watchdog, and scripts.
- [x] 1.2 Add startup validation that fails fast on conda path drift.
- [x] 1.3 Update systemd/watchdog integration tests for consistent runtime contract.
## 2. Worker Self-Healing Policy
- [x] 2.1 Implement bounded auto-restart policy (cooldown, retry budget, churn window).
- [x] 2.2 Add guarded mode behavior when churn threshold is exceeded.
- [x] 2.3 Implement authenticated manual override flow with explicit logging context.
## 3. Alerting and Operational Signals
- [x] 3.1 Expose policy-state fields in health/admin payloads (`allowed`, `cooldown`, `blocked`).
- [x] 3.2 Add structured audit events for restart decisions and override actions.
- [x] 3.3 Define alert thresholds and wire monitoring-friendly fields for pool/circuit/churn conditions.
## 4. Validation and Runbook Delivery
- [x] 4.1 Add tests for policy transitions, guarded mode, and override behavior.
- [x] 4.2 Validate single-port continuity during controlled recovery and hot reload paths.
- [x] 4.3 Update README/README.mdj and deployment runbook with verified operational procedures.