Files

2.5 KiB

Context

The project already provides watchdog-assisted restart and resilience diagnostics, but policy boundaries for automated recovery are not yet formalized end-to-end. In practice, this can lead to either under-recovery (manual delays) or over-recovery (restart churn). We also need stronger conda/systemd path consistency checks to prevent runtime drift across deploy scripts and services.

Goals / Non-Goals

Goals:

  • Make conda/systemd runtime path contracts explicit, validated, and drift-detectable.
  • Implement safe self-healing policy with cooldown and churn limits.
  • Expose clear alert signals and recommended actions in health/admin payloads.
  • Keep operator manual override available for incident control.

Non-Goals:

  • Migrating from systemd to another orchestrator.
  • Changing database vendor or introducing full autoscaling infrastructure.
  • Removing existing admin restart endpoints.

Decisions

  1. Single source runtime contract

    • Decision: centralize conda runtime path configuration consumed by systemd units, watchdog, and scripts.
    • Rationale: prevents mismatched interpreter/path drift.
  2. Guarded self-healing state machine

    • Decision: implement bounded restart policy (cooldown + max retries per time window + circuit-open gating).
    • Rationale: recovers quickly while preventing restart storms.
  3. Explicit recovery observability contract

    • Decision: enrich health/admin payloads with churn counters, cooldown state, and recommended operator action.
    • Rationale: enables deterministic triage and alert automation.
  4. Auditability requirement

    • Decision: emit structured logs/events for auto-restart decision, manual override, and blocked restart attempts.
    • Rationale: supports incident retrospectives and policy tuning.
  5. Runbook-first rollout

    • Decision: deploy policy changes behind documentation and validation gates, including rollback steps.
    • Rationale: operational safety for production adoption.

Risks / Trade-offs

  • [Risk] Overly strict policy delays recoveryMitigation: configurable thresholds and emergency manual override.
  • [Risk] Aggressive policy causes churn loopsMitigation: hard stop on churn threshold breach and explicit cool-off windows.
  • [Risk] Added operational complexityMitigation: concise runbook with decision tables and tested scripts.
  • [Risk] Drift detection false positivesMitigation: normalize path resolution and clearly defined comparison sources.