2.5 KiB
2.5 KiB
Context
The project already provides watchdog-assisted restart and resilience diagnostics, but policy boundaries for automated recovery are not yet formalized end-to-end. In practice, this can lead to either under-recovery (manual delays) or over-recovery (restart churn). We also need stronger conda/systemd path consistency checks to prevent runtime drift across deploy scripts and services.
Goals / Non-Goals
Goals:
- Make conda/systemd runtime path contracts explicit, validated, and drift-detectable.
- Implement safe self-healing policy with cooldown and churn limits.
- Expose clear alert signals and recommended actions in health/admin payloads.
- Keep operator manual override available for incident control.
Non-Goals:
- Migrating from systemd to another orchestrator.
- Changing database vendor or introducing full autoscaling infrastructure.
- Removing existing admin restart endpoints.
Decisions
-
Single source runtime contract
- Decision: centralize conda runtime path configuration consumed by systemd units, watchdog, and scripts.
- Rationale: prevents mismatched interpreter/path drift.
-
Guarded self-healing state machine
- Decision: implement bounded restart policy (cooldown + max retries per time window + circuit-open gating).
- Rationale: recovers quickly while preventing restart storms.
-
Explicit recovery observability contract
- Decision: enrich health/admin payloads with churn counters, cooldown state, and recommended operator action.
- Rationale: enables deterministic triage and alert automation.
-
Auditability requirement
- Decision: emit structured logs/events for auto-restart decision, manual override, and blocked restart attempts.
- Rationale: supports incident retrospectives and policy tuning.
-
Runbook-first rollout
- Decision: deploy policy changes behind documentation and validation gates, including rollback steps.
- Rationale: operational safety for production adoption.
Risks / Trade-offs
- [Risk] Overly strict policy delays recovery → Mitigation: configurable thresholds and emergency manual override.
- [Risk] Aggressive policy causes churn loops → Mitigation: hard stop on churn threshold breach and explicit cool-off windows.
- [Risk] Added operational complexity → Mitigation: concise runbook with decision tables and tested scripts.
- [Risk] Drift detection false positives → Mitigation: normalize path resolution and clearly defined comparison sources.