Files
DashBoard/openspec/specs/runtime-resilience-recovery/spec.md

95 lines
5.5 KiB
Markdown

# runtime-resilience-recovery Specification
## Purpose
TBD - created by archiving change stability-and-frontend-compute-shift. Update Purpose after archive.
## Requirements
### Requirement: Database Pool Runtime Configuration SHALL Be Enforced
The system SHALL apply database pool and timeout parameters from runtime configuration to the active SQLAlchemy engine used by request handling.
#### Scenario: Runtime pool configuration takes effect
- **WHEN** operators set pool and timeout values via environment configuration and start the service
- **THEN** the active engine MUST use those values for pool size, overflow, wait timeout, and query call timeout
### Requirement: Pool Exhaustion MUST Return Retry-Aware Degraded Responses
The system MUST return explicit degraded responses for connection pool exhaustion and include machine-readable metadata for retry/backoff behavior.
#### Scenario: Pool exhausted under load
- **WHEN** concurrent requests exceed available database connections and pool wait timeout is reached
- **THEN** the API MUST return a dedicated error code and retry guidance instead of a generic 500 failure
### Requirement: Runtime Degradation MUST Integrate Circuit Breaker State
Database-facing API behavior SHALL distinguish circuit-breaker-open degradation from transient query failures.
#### Scenario: Circuit breaker is open
- **WHEN** the circuit breaker transitions to OPEN state
- **THEN** database-backed endpoints MUST fail fast with a stable degradation response contract
### Requirement: Worker Recovery SHALL Support Hot Reload and Watchdog-Assisted Recovery
The runtime MUST support graceful worker hot reload and watchdog-triggered recovery without requiring a port change or full system reboot.
#### Scenario: Worker restart requested
- **WHEN** an authorized operator requests worker restart during degraded operation
- **THEN** the service MUST trigger graceful reload and preserve single-port availability
### Requirement: Report Frontend API Access SHALL Honor Degraded Retry Contracts
Report pages SHALL use retry-aware API access paths for JSON endpoints so degraded backend responses propagate retry metadata to UI behavior.
#### Scenario: Pool exhaustion or circuit-open response
- **WHEN** report API endpoints return degraded error codes with retry hints
- **THEN** frontend calls MUST flow through MesApi-compatible behavior and avoid aggressive uncontrolled retry loops
### Requirement: Runtime Resilience Diagnostics MUST Expose Actionable Signals
The system MUST expose machine-readable resilience thresholds, restart-churn indicators, and operator action recommendations so degraded states can be triaged consistently.
#### Scenario: Health payload includes resilience diagnostics
- **WHEN** clients call `/health` or `/health/deep`
- **THEN** responses MUST include resilience thresholds and a recommendation field describing whether to observe, throttle, or trigger controlled worker recovery
#### Scenario: Admin status includes restart churn summary
- **WHEN** operators call `/admin/api/system-status` or `/admin/api/worker/status`
- **THEN** responses MUST include bounded restart history summary within a configured time window and indicate whether churn threshold is exceeded
### Requirement: Recovery Recommendations SHALL Reflect Self-Healing Policy State
Health and admin resilience payloads MUST expose whether automated recovery is allowed, cooling down, or blocked by churn policy.
#### Scenario: Operator inspects degraded state
- **WHEN** `/health` or `/admin/api/worker/status` is requested during degradation
- **THEN** response MUST include policy state, cooldown remaining time, and next recommended action
### Requirement: Manual Recovery Override SHALL Be Explicit and Controlled
Manual restart actions MUST bypass automatic block only through authenticated operator pathways with explicit acknowledgement.
#### Scenario: Churn-blocked state with manual override request
- **WHEN** authorized admin requests manual restart while auto-recovery is blocked
- **THEN** system MUST execute controlled restart path and log the override context for auditability
### Requirement: Circuit Breaker State Transitions SHALL Avoid Lock-Held Logging
Circuit breaker state transitions MUST avoid executing logger I/O while internal state locks are held.
#### Scenario: State transition occurs
- **WHEN** circuit breaker transitions between CLOSED, OPEN, or HALF_OPEN
- **THEN** lock-protected section MUST complete state mutation before emitting transition log output
#### Scenario: Slow log handler under load
- **WHEN** logger handlers are slow or blocked
- **THEN** circuit breaker lock contention MUST remain bounded and MUST NOT serialize unrelated request paths behind logging latency
### Requirement: Health Endpoints SHALL Use Short Internal Memoization
Health and deep-health computation SHALL use a short-lived internal cache to prevent probe storms from amplifying backend load.
#### Scenario: Frequent monitor scrapes
- **WHEN** health endpoints are called repeatedly within a small window
- **THEN** service SHALL return memoized payload for up to 5 seconds in non-testing environments
#### Scenario: Testing mode
- **WHEN** app is running in testing mode
- **THEN** health endpoint memoization MUST be bypassed to preserve deterministic tests
### Requirement: Logs MUST Redact Connection Secrets
Runtime logs MUST avoid exposing DB connection credentials.
#### Scenario: Connection string appears in log message
- **WHEN** a log message contains DB URL credentials
- **THEN** logger output MUST redact password and sensitive userinfo before emission