Trace pipeline pool isolation: - Switch event_fetcher and lineage_engine to read_sql_df_slow (non-pooled) - Reduce EVENT_FETCHER_MAX_WORKERS 4→2, TRACE_EVENTS_MAX_WORKERS 4→2 - Add 60s timeout per batch query, cache skip for CID>10K - Early del raw_domain_results + gc.collect() for large queries - Increase DB_SLOW_MAX_CONCURRENT: base 3→5, dev 2→3, prod 3→5 Test fixes (51 pre-existing failures → 0): - reject_history: WORKFLOW CSV header, strict bool validation, pareto mock path - portal shell: remove non-existent /tmtt-defect route from tests - conftest: add --run-stress option to skip stress/load tests by default - migration tests: skipif baseline directory missing - performance test: update Vite asset assertion - wip hold: add firstname/waferdesc mock params - template integration: add /reject-history canonical route Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5.9 KiB
runtime-resilience-recovery Specification
Purpose
TBD - created by archiving change stability-and-frontend-compute-shift. Update Purpose after archive.
Requirements
Requirement: Database Pool Runtime Configuration SHALL Be Enforced
The system SHALL apply database pool and timeout parameters from runtime configuration to the active SQLAlchemy engine used by request handling.
Scenario: Runtime pool configuration takes effect
- WHEN operators set pool and timeout values via environment configuration and start the service
- THEN the active engine MUST use those values for pool size, overflow, wait timeout, and query call timeout
Scenario: Slow query semaphore capacity
- WHEN the service starts in production or staging configuration
- THEN
DB_SLOW_MAX_CONCURRENTSHALL default to 5 (env:DB_SLOW_MAX_CONCURRENT) - WHEN the service starts in development configuration
- THEN
DB_SLOW_MAX_CONCURRENTSHALL default to 3 - WHEN the service starts in testing configuration
- THEN
DB_SLOW_MAX_CONCURRENTSHALL remain at 1
Requirement: Pool Exhaustion MUST Return Retry-Aware Degraded Responses
The system MUST return explicit degraded responses for connection pool exhaustion and include machine-readable metadata for retry/backoff behavior.
Scenario: Pool exhausted under load
- WHEN concurrent requests exceed available database connections and pool wait timeout is reached
- THEN the API MUST return a dedicated error code and retry guidance instead of a generic 500 failure
Requirement: Runtime Degradation MUST Integrate Circuit Breaker State
Database-facing API behavior SHALL distinguish circuit-breaker-open degradation from transient query failures.
Scenario: Circuit breaker is open
- WHEN the circuit breaker transitions to OPEN state
- THEN database-backed endpoints MUST fail fast with a stable degradation response contract
Requirement: Worker Recovery SHALL Support Hot Reload and Watchdog-Assisted Recovery
The runtime MUST support graceful worker hot reload and watchdog-triggered recovery without requiring a port change or full system reboot.
Scenario: Worker restart requested
- WHEN an authorized operator requests worker restart during degraded operation
- THEN the service MUST trigger graceful reload and preserve single-port availability
Requirement: Report Frontend API Access SHALL Honor Degraded Retry Contracts
Report pages SHALL use retry-aware API access paths for JSON endpoints so degraded backend responses propagate retry metadata to UI behavior.
Scenario: Pool exhaustion or circuit-open response
- WHEN report API endpoints return degraded error codes with retry hints
- THEN frontend calls MUST flow through MesApi-compatible behavior and avoid aggressive uncontrolled retry loops
Requirement: Runtime Resilience Diagnostics MUST Expose Actionable Signals
The system MUST expose machine-readable resilience thresholds, restart-churn indicators, and operator action recommendations so degraded states can be triaged consistently.
Scenario: Health payload includes resilience diagnostics
- WHEN clients call
/healthor/health/deep - THEN responses MUST include resilience thresholds and a recommendation field describing whether to observe, throttle, or trigger controlled worker recovery
Scenario: Admin status includes restart churn summary
- WHEN operators call
/admin/api/system-statusor/admin/api/worker/status - THEN responses MUST include bounded restart history summary within a configured time window and indicate whether churn threshold is exceeded
Requirement: Recovery Recommendations SHALL Reflect Self-Healing Policy State
Health and admin resilience payloads MUST expose whether automated recovery is allowed, cooling down, or blocked by churn policy.
Scenario: Operator inspects degraded state
- WHEN
/healthor/admin/api/worker/statusis requested during degradation - THEN response MUST include policy state, cooldown remaining time, and next recommended action
Requirement: Manual Recovery Override SHALL Be Explicit and Controlled
Manual restart actions MUST bypass automatic block only through authenticated operator pathways with explicit acknowledgement.
Scenario: Churn-blocked state with manual override request
- WHEN authorized admin requests manual restart while auto-recovery is blocked
- THEN system MUST execute controlled restart path and log the override context for auditability
Requirement: Circuit Breaker State Transitions SHALL Avoid Lock-Held Logging
Circuit breaker state transitions MUST avoid executing logger I/O while internal state locks are held.
Scenario: State transition occurs
- WHEN circuit breaker transitions between CLOSED, OPEN, or HALF_OPEN
- THEN lock-protected section MUST complete state mutation before emitting transition log output
Scenario: Slow log handler under load
- WHEN logger handlers are slow or blocked
- THEN circuit breaker lock contention MUST remain bounded and MUST NOT serialize unrelated request paths behind logging latency
Requirement: Health Endpoints SHALL Use Short Internal Memoization
Health and deep-health computation SHALL use a short-lived internal cache to prevent probe storms from amplifying backend load.
Scenario: Frequent monitor scrapes
- WHEN health endpoints are called repeatedly within a small window
- THEN service SHALL return memoized payload for up to 5 seconds in non-testing environments
Scenario: Testing mode
- WHEN app is running in testing mode
- THEN health endpoint memoization MUST be bypassed to preserve deterministic tests
Requirement: Logs MUST Redact Connection Secrets
Runtime logs MUST avoid exposing DB connection credentials.
Scenario: Connection string appears in log message
- WHEN a log message contains DB URL credentials
- THEN logger output MUST redact password and sensitive userinfo before emission