DashBoard/design.md at main

Files

beabigegg c8e225101e chore: finalize vite migration hardening and archive openspec changes

2026-02-08 20:03:36 +08:00

2.5 KiB

Raw Permalink Blame History

Context

The project already provides watchdog-assisted restart and resilience diagnostics, but policy boundaries for automated recovery are not yet formalized end-to-end. In practice, this can lead to either under-recovery (manual delays) or over-recovery (restart churn). We also need stronger conda/systemd path consistency checks to prevent runtime drift across deploy scripts and services.

Goals / Non-Goals

Goals:

Make conda/systemd runtime path contracts explicit, validated, and drift-detectable.
Implement safe self-healing policy with cooldown and churn limits.
Expose clear alert signals and recommended actions in health/admin payloads.
Keep operator manual override available for incident control.

Non-Goals:

Migrating from systemd to another orchestrator.
Changing database vendor or introducing full autoscaling infrastructure.
Removing existing admin restart endpoints.

Decisions

Single source runtime contract
- Decision: centralize conda runtime path configuration consumed by systemd units, watchdog, and scripts.
- Rationale: prevents mismatched interpreter/path drift.
Guarded self-healing state machine
- Decision: implement bounded restart policy (cooldown + max retries per time window + circuit-open gating).
- Rationale: recovers quickly while preventing restart storms.
Explicit recovery observability contract
- Decision: enrich health/admin payloads with churn counters, cooldown state, and recommended operator action.
- Rationale: enables deterministic triage and alert automation.
Auditability requirement
- Decision: emit structured logs/events for auto-restart decision, manual override, and blocked restart attempts.
- Rationale: supports incident retrospectives and policy tuning.
Runbook-first rollout
- Decision: deploy policy changes behind documentation and validation gates, including rollback steps.
- Rationale: operational safety for production adoption.

Risks / Trade-offs

[Risk] Overly strict policy delays recovery → Mitigation: configurable thresholds and emergency manual override.
[Risk] Aggressive policy causes churn loops → Mitigation: hard stop on churn threshold breach and explicit cool-off windows.
[Risk] Added operational complexity → Mitigation: concise runbook with decision tables and tested scripts.
[Risk] Drift detection false positives → Mitigation: normalize path resolution and clearly defined comparison sources.

2.5 KiB Raw Permalink Blame History

Context

Goals / Non-Goals

Decisions

Risks / Trade-offs

2.5 KiB

Raw Permalink Blame History