Files
DashBoard/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/design.md

2.9 KiB

Context

The current architecture already supports single-port Gunicorn runtime, circuit-breaker-aware degraded responses, and watchdog-assisted recovery. However, critical security and lifecycle controls are uneven: production startup can still fallback to a weak secret key, CSRF is not enforced globally, and background resources are not fully registered in a single shutdown lifecycle. These gaps are operationally risky when pool pressure or restart churn occurs.

Goals / Non-Goals

Goals:

  • Make production startup fail fast when required security secrets are missing.
  • Enforce CSRF validation for all state-changing endpoints without breaking existing frontend flow.
  • Make worker/app shutdown deterministic by stopping all background workers and shared clients.
  • Keep degraded responses for pool exhaustion and circuit-open states stable and retry-aware.
  • Isolate health probe connectivity from main request pool contention.

Non-Goals:

  • Replacing LDAP provider or redesigning the full authentication architecture.
  • Full CSP rollout across all templates in this change.
  • Changing URL structure, page IA, or single-port deployment topology.

Decisions

  1. Production secret-key guard at startup

    • Decision: enforce SECRET_KEY presence/strength in non-development modes and abort startup when invalid.
    • Rationale: prevents silent insecure deployment.
  2. Unified CSRF contract across form + JSON flows

    • Decision: issue CSRF token from server session, validate hidden form field for HTML forms and X-CSRF-Token for JSON POST/PUT/PATCH/DELETE.
    • Rationale: maintains current frontend behavior while covering non-form APIs.
  3. Centralized shutdown registry

    • Decision: register explicit shutdown hooks that call cache updater stop, realtime sync stop, Redis close, and DB dispose in bounded order.
    • Rationale: avoids thread/client leaks during worker recycle and controlled reload.
  4. Health probe pool isolation

    • Decision: use a dedicated lightweight DB health engine/pool for /health checks.
    • Rationale: prevents health endpoint from being blocked by request-pool exhaustion, improving observability fidelity.
  5. Template-safe JS serialization

    • Decision: replace HTML-escaped interpolation in JS string contexts with tojson serialization.
    • Rationale: avoids context-mismatch injection edge cases.

Risks / Trade-offs

  • [Risk] CSRF rollout may break undocumented clientsMitigation: provide opt-in transition flag and explicit error messaging during rollout.
  • [Risk] Strict startup secret validation can block misconfigured environmentsMitigation: provide clear startup diagnostics and .env.example updates.
  • [Risk] Additional shutdown hooks can prolong worker exitMitigation: bounded timeouts and idempotent stop handlers.
  • [Risk] Dedicated health pool introduces extra DB connectionsMitigation: fixed minimal size and short timeout.