5.0 KiB
Purpose
Define stable requirements for cache-observability-hardening.
Requirements
Requirement: Layered Cache SHALL Expose Operational State
The route cache implementation SHALL expose layered cache operational state, including mode, freshness, and degradation status.
Scenario: Redis unavailable degradation state
- WHEN Redis is unavailable
- THEN health endpoints MUST indicate degraded cache mode while keeping L1 memory cache active
Requirement: Cache Telemetry MUST be Queryable for Operations
The system MUST provide cache telemetry suitable for operations diagnostics.
Scenario: Telemetry inspection
- WHEN operators request deep health status
- THEN cache-related metrics/state SHALL be present and interpretable for troubleshooting
Requirement: Health Endpoints SHALL Expose Pool Saturation and Degradation Reason Codes
Operational health endpoints MUST report connection pool saturation indicators and explicit degradation reason codes.
Scenario: Pool saturation observed
- WHEN checked-out connections and overflow approach configured limits
- THEN deep health output MUST expose saturation metrics and degraded reason classification
Requirement: Degraded Responses MUST Be Correlatable Across API and Health Telemetry
Error responses for degraded states SHALL include stable codes that can be mapped to health telemetry and operational dashboards.
Scenario: Degraded API response correlation
- WHEN an API request fails due to circuit-open or pool-exhausted conditions
- THEN operators MUST be able to match the response code to current health telemetry state
Requirement: Operational Alert Thresholds SHALL Be Explicitly Defined
The system MUST define alert thresholds for sustained degraded state, repeated worker recovery, and abnormal retry pressure.
Scenario: Sustained degradation threshold exceeded
- WHEN degraded status persists beyond configured duration
- THEN the monitoring contract MUST classify the service as alert-worthy with actionable context
Requirement: Cache Telemetry SHALL Include Memory Amplification Signals
Operational telemetry MUST expose cache-domain memory usage indicators and representation amplification factors, and MUST differentiate between authoritative data payload and derived/index helper structures.
Scenario: Deep health telemetry request after representation normalization
- WHEN operators inspect cache telemetry for resource or WIP domains
- THEN telemetry MUST include per-domain memory footprint, amplification indicators, and enough structure detail to verify that full-record duplication is not reintroduced
Requirement: Efficiency Benchmarks SHALL Gate Cache Refactor Rollout
Cache/query efficiency changes MUST be validated against baseline latency and memory benchmarks before rollout.
Scenario: Pre-release validation
- WHEN cache refactor changes are prepared for deployment
- THEN benchmark results MUST demonstrate no regression beyond configured thresholds for P95 latency and memory usage
Requirement: Process-Level Cache SHALL Use Bounded Capacity with Deterministic Eviction
Process-level parsed-data caches MUST enforce a configurable maximum key capacity and use deterministic eviction behavior when capacity is exceeded.
Scenario: Cache capacity reached
- WHEN a new cache entry is inserted and key capacity is at limit
- THEN cache MUST evict entries according to defined policy before storing the new key
Scenario: Repeated access updates recency
- WHEN an existing cache key is read or overwritten
- THEN eviction order MUST reflect recency semantics so hot keys are retained preferentially
Requirement: Cache Publish MUST Preserve Previous Readable Snapshot on Failure
When refreshing full-table cache payloads, the system MUST avoid exposing partially published states to readers.
Scenario: Publish fails after payload serialization
- WHEN a cache refresh has prepared new payload but publish operation fails
- THEN previously published cache keys MUST remain readable and metadata MUST remain consistent with old snapshot
Scenario: Publish succeeds
- WHEN publish operation completes successfully
- THEN data payload and metadata keys MUST be visible as one coherent new snapshot
Requirement: Process-Level Cache Slow Path SHALL Minimize Lock Hold Time
Large payload parsing MUST NOT happen inside long-held process cache locks.
Scenario: Cache miss under concurrent requests
- WHEN multiple requests hit process cache miss
- THEN parsing work SHALL happen outside lock-protected mutation section, and lock scope SHALL be limited to consistency check + commit
Requirement: Process-Level Cache Policies MUST Stay Consistent Across Services
All service-local process caches MUST support bounded capacity with deterministic eviction.
Scenario: Realtime equipment cache growth
- WHEN realtime equipment process cache reaches configured capacity
- THEN entries MUST be evicted according to deterministic LRU behavior