## Context The project already provides watchdog-assisted restart and resilience diagnostics, but policy boundaries for automated recovery are not yet formalized end-to-end. In practice, this can lead to either under-recovery (manual delays) or over-recovery (restart churn). We also need stronger conda/systemd path consistency checks to prevent runtime drift across deploy scripts and services. ## Goals / Non-Goals **Goals:** - Make conda/systemd runtime path contracts explicit, validated, and drift-detectable. - Implement safe self-healing policy with cooldown and churn limits. - Expose clear alert signals and recommended actions in health/admin payloads. - Keep operator manual override available for incident control. **Non-Goals:** - Migrating from systemd to another orchestrator. - Changing database vendor or introducing full autoscaling infrastructure. - Removing existing admin restart endpoints. ## Decisions 1. **Single source runtime contract** - Decision: centralize conda runtime path configuration consumed by systemd units, watchdog, and scripts. - Rationale: prevents mismatched interpreter/path drift. 2. **Guarded self-healing state machine** - Decision: implement bounded restart policy (cooldown + max retries per time window + circuit-open gating). - Rationale: recovers quickly while preventing restart storms. 3. **Explicit recovery observability contract** - Decision: enrich health/admin payloads with churn counters, cooldown state, and recommended operator action. - Rationale: enables deterministic triage and alert automation. 4. **Auditability requirement** - Decision: emit structured logs/events for auto-restart decision, manual override, and blocked restart attempts. - Rationale: supports incident retrospectives and policy tuning. 5. **Runbook-first rollout** - Decision: deploy policy changes behind documentation and validation gates, including rollback steps. - Rationale: operational safety for production adoption. ## Risks / Trade-offs - **[Risk] Overly strict policy delays recovery** → **Mitigation:** configurable thresholds and emergency manual override. - **[Risk] Aggressive policy causes churn loops** → **Mitigation:** hard stop on churn threshold breach and explicit cool-off windows. - **[Risk] Added operational complexity** → **Mitigation:** concise runbook with decision tables and tested scripts. - **[Risk] Drift detection false positives** → **Mitigation:** normalize path resolution and clearly defined comparison sources.