95 lines
5.5 KiB
Markdown
95 lines
5.5 KiB
Markdown
# runtime-resilience-recovery Specification
|
|
|
|
## Purpose
|
|
TBD - created by archiving change stability-and-frontend-compute-shift. Update Purpose after archive.
|
|
## Requirements
|
|
### Requirement: Database Pool Runtime Configuration SHALL Be Enforced
|
|
The system SHALL apply database pool and timeout parameters from runtime configuration to the active SQLAlchemy engine used by request handling.
|
|
|
|
#### Scenario: Runtime pool configuration takes effect
|
|
- **WHEN** operators set pool and timeout values via environment configuration and start the service
|
|
- **THEN** the active engine MUST use those values for pool size, overflow, wait timeout, and query call timeout
|
|
|
|
### Requirement: Pool Exhaustion MUST Return Retry-Aware Degraded Responses
|
|
The system MUST return explicit degraded responses for connection pool exhaustion and include machine-readable metadata for retry/backoff behavior.
|
|
|
|
#### Scenario: Pool exhausted under load
|
|
- **WHEN** concurrent requests exceed available database connections and pool wait timeout is reached
|
|
- **THEN** the API MUST return a dedicated error code and retry guidance instead of a generic 500 failure
|
|
|
|
### Requirement: Runtime Degradation MUST Integrate Circuit Breaker State
|
|
Database-facing API behavior SHALL distinguish circuit-breaker-open degradation from transient query failures.
|
|
|
|
#### Scenario: Circuit breaker is open
|
|
- **WHEN** the circuit breaker transitions to OPEN state
|
|
- **THEN** database-backed endpoints MUST fail fast with a stable degradation response contract
|
|
|
|
### Requirement: Worker Recovery SHALL Support Hot Reload and Watchdog-Assisted Recovery
|
|
The runtime MUST support graceful worker hot reload and watchdog-triggered recovery without requiring a port change or full system reboot.
|
|
|
|
#### Scenario: Worker restart requested
|
|
- **WHEN** an authorized operator requests worker restart during degraded operation
|
|
- **THEN** the service MUST trigger graceful reload and preserve single-port availability
|
|
|
|
### Requirement: Report Frontend API Access SHALL Honor Degraded Retry Contracts
|
|
Report pages SHALL use retry-aware API access paths for JSON endpoints so degraded backend responses propagate retry metadata to UI behavior.
|
|
|
|
#### Scenario: Pool exhaustion or circuit-open response
|
|
- **WHEN** report API endpoints return degraded error codes with retry hints
|
|
- **THEN** frontend calls MUST flow through MesApi-compatible behavior and avoid aggressive uncontrolled retry loops
|
|
|
|
### Requirement: Runtime Resilience Diagnostics MUST Expose Actionable Signals
|
|
The system MUST expose machine-readable resilience thresholds, restart-churn indicators, and operator action recommendations so degraded states can be triaged consistently.
|
|
|
|
#### Scenario: Health payload includes resilience diagnostics
|
|
- **WHEN** clients call `/health` or `/health/deep`
|
|
- **THEN** responses MUST include resilience thresholds and a recommendation field describing whether to observe, throttle, or trigger controlled worker recovery
|
|
|
|
#### Scenario: Admin status includes restart churn summary
|
|
- **WHEN** operators call `/admin/api/system-status` or `/admin/api/worker/status`
|
|
- **THEN** responses MUST include bounded restart history summary within a configured time window and indicate whether churn threshold is exceeded
|
|
|
|
### Requirement: Recovery Recommendations SHALL Reflect Self-Healing Policy State
|
|
Health and admin resilience payloads MUST expose whether automated recovery is allowed, cooling down, or blocked by churn policy.
|
|
|
|
#### Scenario: Operator inspects degraded state
|
|
- **WHEN** `/health` or `/admin/api/worker/status` is requested during degradation
|
|
- **THEN** response MUST include policy state, cooldown remaining time, and next recommended action
|
|
|
|
### Requirement: Manual Recovery Override SHALL Be Explicit and Controlled
|
|
Manual restart actions MUST bypass automatic block only through authenticated operator pathways with explicit acknowledgement.
|
|
|
|
#### Scenario: Churn-blocked state with manual override request
|
|
- **WHEN** authorized admin requests manual restart while auto-recovery is blocked
|
|
- **THEN** system MUST execute controlled restart path and log the override context for auditability
|
|
|
|
### Requirement: Circuit Breaker State Transitions SHALL Avoid Lock-Held Logging
|
|
Circuit breaker state transitions MUST avoid executing logger I/O while internal state locks are held.
|
|
|
|
#### Scenario: State transition occurs
|
|
- **WHEN** circuit breaker transitions between CLOSED, OPEN, or HALF_OPEN
|
|
- **THEN** lock-protected section MUST complete state mutation before emitting transition log output
|
|
|
|
#### Scenario: Slow log handler under load
|
|
- **WHEN** logger handlers are slow or blocked
|
|
- **THEN** circuit breaker lock contention MUST remain bounded and MUST NOT serialize unrelated request paths behind logging latency
|
|
|
|
### Requirement: Health Endpoints SHALL Use Short Internal Memoization
|
|
Health and deep-health computation SHALL use a short-lived internal cache to prevent probe storms from amplifying backend load.
|
|
|
|
#### Scenario: Frequent monitor scrapes
|
|
- **WHEN** health endpoints are called repeatedly within a small window
|
|
- **THEN** service SHALL return memoized payload for up to 5 seconds in non-testing environments
|
|
|
|
#### Scenario: Testing mode
|
|
- **WHEN** app is running in testing mode
|
|
- **THEN** health endpoint memoization MUST be bypassed to preserve deterministic tests
|
|
|
|
### Requirement: Logs MUST Redact Connection Secrets
|
|
Runtime logs MUST avoid exposing DB connection credentials.
|
|
|
|
#### Scenario: Connection string appears in log message
|
|
- **WHEN** a log message contains DB URL credentials
|
|
- **THEN** logger output MUST redact password and sensitive userinfo before emission
|
|
|