Files
DashBoard/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/tasks.md

1.2 KiB

1. Conda/Systemd Contract Alignment

  • 1.1 Centralize runtime path configuration consumed by service units, watchdog, and scripts.
  • 1.2 Add startup validation that fails fast on conda path drift.
  • 1.3 Update systemd/watchdog integration tests for consistent runtime contract.

2. Worker Self-Healing Policy

  • 2.1 Implement bounded auto-restart policy (cooldown, retry budget, churn window).
  • 2.2 Add guarded mode behavior when churn threshold is exceeded.
  • 2.3 Implement authenticated manual override flow with explicit logging context.

3. Alerting and Operational Signals

  • 3.1 Expose policy-state fields in health/admin payloads (allowed, cooldown, blocked).
  • 3.2 Add structured audit events for restart decisions and override actions.
  • 3.3 Define alert thresholds and wire monitoring-friendly fields for pool/circuit/churn conditions.

4. Validation and Runbook Delivery

  • 4.1 Add tests for policy transitions, guarded mode, and override behavior.
  • 4.2 Validate single-port continuity during controlled recovery and hot reload paths.
  • 4.3 Update README/README.mdj and deployment runbook with verified operational procedures.