Files

egg cbb943dfe5 feat(trace-pool-isolation): migrate event_fetcher/lineage_engine to slow connections + fix 51 test failures

Trace pipeline pool isolation:
- Switch event_fetcher and lineage_engine to read_sql_df_slow (non-pooled)
- Reduce EVENT_FETCHER_MAX_WORKERS 4→2, TRACE_EVENTS_MAX_WORKERS 4→2
- Add 60s timeout per batch query, cache skip for CID>10K
- Early del raw_domain_results + gc.collect() for large queries
- Increase DB_SLOW_MAX_CONCURRENT: base 3→5, dev 2→3, prod 3→5

Test fixes (51 pre-existing failures → 0):
- reject_history: WORKFLOW CSV header, strict bool validation, pareto mock path
- portal shell: remove non-existent /tmtt-defect route from tests
- conftest: add --run-stress option to skip stress/load tests by default
- migration tests: skipif baseline directory missing
- performance test: update Vite asset assertion
- wip hold: add firstname/waferdesc mock params
- template integration: add /reject-history canonical route

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-25 16:13:19 +08:00

5.9 KiB

Raw Blame History

runtime-resilience-recovery Specification

Purpose

TBD - created by archiving change stability-and-frontend-compute-shift. Update Purpose after archive.

Requirements

Requirement: Database Pool Runtime Configuration SHALL Be Enforced

The system SHALL apply database pool and timeout parameters from runtime configuration to the active SQLAlchemy engine used by request handling.

Scenario: Runtime pool configuration takes effect

WHEN operators set pool and timeout values via environment configuration and start the service
THEN the active engine MUST use those values for pool size, overflow, wait timeout, and query call timeout

Scenario: Slow query semaphore capacity

WHEN the service starts in production or staging configuration
THEN DB_SLOW_MAX_CONCURRENT SHALL default to 5 (env: DB_SLOW_MAX_CONCURRENT)
WHEN the service starts in development configuration
THEN DB_SLOW_MAX_CONCURRENT SHALL default to 3
WHEN the service starts in testing configuration
THEN DB_SLOW_MAX_CONCURRENT SHALL remain at 1

Requirement: Pool Exhaustion MUST Return Retry-Aware Degraded Responses

The system MUST return explicit degraded responses for connection pool exhaustion and include machine-readable metadata for retry/backoff behavior.

Scenario: Pool exhausted under load

WHEN concurrent requests exceed available database connections and pool wait timeout is reached
THEN the API MUST return a dedicated error code and retry guidance instead of a generic 500 failure

Requirement: Runtime Degradation MUST Integrate Circuit Breaker State

Database-facing API behavior SHALL distinguish circuit-breaker-open degradation from transient query failures.

Scenario: Circuit breaker is open

WHEN the circuit breaker transitions to OPEN state
THEN database-backed endpoints MUST fail fast with a stable degradation response contract

Requirement: Worker Recovery SHALL Support Hot Reload and Watchdog-Assisted Recovery

The runtime MUST support graceful worker hot reload and watchdog-triggered recovery without requiring a port change or full system reboot.

Scenario: Worker restart requested

WHEN an authorized operator requests worker restart during degraded operation
THEN the service MUST trigger graceful reload and preserve single-port availability

Requirement: Report Frontend API Access SHALL Honor Degraded Retry Contracts

Report pages SHALL use retry-aware API access paths for JSON endpoints so degraded backend responses propagate retry metadata to UI behavior.

Scenario: Pool exhaustion or circuit-open response

WHEN report API endpoints return degraded error codes with retry hints
THEN frontend calls MUST flow through MesApi-compatible behavior and avoid aggressive uncontrolled retry loops

Requirement: Runtime Resilience Diagnostics MUST Expose Actionable Signals

The system MUST expose machine-readable resilience thresholds, restart-churn indicators, and operator action recommendations so degraded states can be triaged consistently.

Scenario: Health payload includes resilience diagnostics

WHEN clients call /health or /health/deep
THEN responses MUST include resilience thresholds and a recommendation field describing whether to observe, throttle, or trigger controlled worker recovery

Scenario: Admin status includes restart churn summary

WHEN operators call /admin/api/system-status or /admin/api/worker/status
THEN responses MUST include bounded restart history summary within a configured time window and indicate whether churn threshold is exceeded

Requirement: Recovery Recommendations SHALL Reflect Self-Healing Policy State

Health and admin resilience payloads MUST expose whether automated recovery is allowed, cooling down, or blocked by churn policy.

Scenario: Operator inspects degraded state

WHEN /health or /admin/api/worker/status is requested during degradation
THEN response MUST include policy state, cooldown remaining time, and next recommended action

Requirement: Manual Recovery Override SHALL Be Explicit and Controlled

Manual restart actions MUST bypass automatic block only through authenticated operator pathways with explicit acknowledgement.

Scenario: Churn-blocked state with manual override request

WHEN authorized admin requests manual restart while auto-recovery is blocked
THEN system MUST execute controlled restart path and log the override context for auditability

Requirement: Circuit Breaker State Transitions SHALL Avoid Lock-Held Logging

Circuit breaker state transitions MUST avoid executing logger I/O while internal state locks are held.

Scenario: State transition occurs

WHEN circuit breaker transitions between CLOSED, OPEN, or HALF_OPEN
THEN lock-protected section MUST complete state mutation before emitting transition log output

Scenario: Slow log handler under load

WHEN logger handlers are slow or blocked
THEN circuit breaker lock contention MUST remain bounded and MUST NOT serialize unrelated request paths behind logging latency

Requirement: Health Endpoints SHALL Use Short Internal Memoization

Health and deep-health computation SHALL use a short-lived internal cache to prevent probe storms from amplifying backend load.

Scenario: Frequent monitor scrapes

WHEN health endpoints are called repeatedly within a small window
THEN service SHALL return memoized payload for up to 5 seconds in non-testing environments

Scenario: Testing mode

WHEN app is running in testing mode
THEN health endpoint memoization MUST be bypassed to preserve deterministic tests

Requirement: Logs MUST Redact Connection Secrets

Runtime logs MUST avoid exposing DB connection credentials.

Scenario: Connection string appears in log message

WHEN a log message contains DB URL credentials
THEN logger output MUST redact password and sensitive userinfo before emission

5.9 KiB Raw Blame History