Chunk failures in BatchQueryEngine were silently discarded — `has_partial_failure` was tracked in Redis but never surfaced to the API response or frontend. Users could see incomplete data without any warning. This commit closes the gap end-to-end: Backend: - Track failed chunk time ranges (`failed_ranges`) in batch engine progress metadata - Add single retry for transient Oracle errors (timeout, connection) in `_execute_single_chunk` - Read `get_batch_progress()` after merge but before `redis_clear_batch()` cleanup - Inject `has_partial_failure`, `failed_chunk_count`, `failed_ranges` into API response meta - Persist partial failure flag to independent Redis key with TTL aligned to data storage layer - Add shared container-resolution policy module with wildcard/expansion guardrails - Refactor reason filter from single-value to multi-select (`reason` → `reasons`) Frontend: - Add client-side date range validation (730-day limit) before API submission - Display amber warning banner on partial failure with specific failed date ranges - Support generic fallback message for container-mode queries without date ranges - Update FilterPanel to support multi-select reason chips Specs & tests: - Create batch-query-resilience spec; update reject-history-api and reject-history-page specs - Add 7 new tests for retry, memory guard, failed ranges, partial failure propagation, TTL - Cross-service regression verified (hold, resource, job, msd — 411 tests pass) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4.9 KiB
4.9 KiB
ADDED Requirements
Requirement: BatchQueryEngine SHALL track failed chunk time ranges in progress metadata
The engine SHALL record the time ranges of failed chunks in Redis progress metadata so consumers can report which date intervals have missing data.
Scenario: Failed chunk range recorded in sequential path
- WHEN a chunk with
chunk_startandchunk_endkeys fails during sequential execution - THEN
_update_progress()SHALL store afailed_rangesfield in the Redis HSET metadata - THEN
failed_rangesSHALL be a JSON array of objects, each withstartandendstring keys - THEN the array SHALL contain one entry per failed chunk
Scenario: Failed chunk range recorded in parallel path
- WHEN a chunk with
chunk_startandchunk_endkeys fails during parallel execution - THEN the failed chunk's time range SHALL be appended to
failed_rangesin the same format as the sequential path
Scenario: No failed ranges when all chunks succeed
- WHEN all chunks complete successfully
- THEN the
failed_rangesfield SHALL NOT be present in Redis metadata
Scenario: ID-batch chunks produce no failed_ranges entries
- WHEN a chunk created by
decompose_by_ids()(containing only anidskey, nochunk_start/chunk_end) fails - THEN no entry SHALL be appended to
failed_rangesfor that chunk - THEN
has_partial_failureSHALL still be set toTrue - THEN
failedcount SHALL still be incremented
Scenario: get_batch_progress returns failed_ranges
- WHEN
get_batch_progress()is called after execution with failed chunks - THEN the returned dict SHALL include
failed_rangesas a JSON string parseable to a list of{start, end}objects
Requirement: BatchQueryEngine SHALL retry transient chunk failures once
The engine SHALL retry chunk execution once for transient errors (Oracle timeout, connection errors) but SHALL NOT retry deterministic failures (memory guard, Redis store).
Scenario: Oracle timeout retried once
- WHEN
_execute_single_chunk()raises an exception matching Oracle timeout patterns (DPY-4024,ORA-01013) - THEN the chunk SHALL be retried exactly once
- WHEN the retry succeeds
- THEN the chunk SHALL be marked as successful
Scenario: Connection error retried once
- WHEN
_execute_single_chunk()raisesTimeoutError,ConnectionError, orOSError - THEN the chunk SHALL be retried exactly once
Scenario: Retry exhausted marks chunk as failed
- WHEN a chunk fails on both the initial attempt and the retry
- THEN the chunk SHALL be marked as failed
- THEN
has_partial_failureSHALL be set toTrue
Scenario: Memory guard failure NOT retried
- WHEN a chunk's DataFrame exceeds
BATCH_CHUNK_MAX_MEMORY_MB - THEN the chunk SHALL return
Falseimmediately without retry - THEN the query function SHALL have been called exactly once for that chunk
Scenario: Redis store failure NOT retried
- WHEN
redis_store_chunk()returnsFalse - THEN the chunk SHALL return
Falseimmediately without retry
Requirement: reject_dataset_cache SHALL propagate partial failure metadata to API response
The cache service SHALL read batch execution metadata and include partial failure information in the API response meta field.
Scenario: Partial failure metadata included in response
- WHEN
execute_primary_query()uses the batch engine path andget_batch_progress()returnshas_partial_failure=True - THEN the response
metadict SHALL includehas_partial_failure: true - THEN the response
metadict SHALL includefailed_chunk_countas an integer - THEN if
failed_rangesis present, the responsemetadict SHALL includefailed_rangesas a list of{start, end}objects
Scenario: Metadata read before redis_clear_batch
- WHEN
execute_primary_query()callsget_batch_progress() - THEN the call SHALL occur after
merge_chunks()and beforeredis_clear_batch()
Scenario: No partial failure on successful query
- WHEN all chunks complete successfully
- THEN the response
metadict SHALL NOT includehas_partial_failure
Scenario: Cache-hit path restores partial failure flag
- WHEN a cached DataFrame is returned (cache hit) and a partial failure flag was stored during the original query
- THEN the response
metadict SHALL include the samehas_partial_failure,failed_chunk_count, andfailed_rangesas the original response
Scenario: Partial failure flag TTL matches data storage layer
- WHEN partial failure is detected and the query result is spilled to parquet spool
- THEN the partial failure flag SHALL be stored with TTL equal to
_REJECT_ENGINE_SPOOL_TTL_SECONDS(default 21600 seconds) - WHEN partial failure is detected and the query result is stored in L1/L2 Redis cache
- THEN the partial failure flag SHALL be stored with TTL equal to
_CACHE_TTL(default 900 seconds)