6.6 KiB
batch-query-resilience Specification
Purpose
Batch query engine resilience features: failed chunk range tracking, transient error retry, and partial failure metadata propagation to API consumers.
Requirements
Requirement: BatchQueryEngine SHALL track failed chunk time ranges in progress metadata
The engine SHALL record the time ranges of failed chunks in Redis progress metadata so consumers can report which date intervals have missing data.
Scenario: Failed chunk range recorded in sequential path
- WHEN a chunk with
chunk_startandchunk_endkeys fails during sequential execution - THEN
_update_progress()SHALL store afailed_rangesfield in the Redis HSET metadata - THEN
failed_rangesSHALL be a JSON array of objects, each withstartandendstring keys - THEN the array SHALL contain one entry per failed chunk
Scenario: Failed chunk range recorded in parallel path
- WHEN a chunk with
chunk_startandchunk_endkeys fails during parallel execution - THEN the failed chunk's time range SHALL be appended to
failed_rangesin the same format as the sequential path
Scenario: No failed ranges when all chunks succeed
- WHEN all chunks complete successfully
- THEN the
failed_rangesfield SHALL NOT be present in Redis metadata
Scenario: ID-batch chunks produce no failed_ranges entries
- WHEN a chunk created by
decompose_by_ids()(containing only anidskey, nochunk_start/chunk_end) fails - THEN no entry SHALL be appended to
failed_rangesfor that chunk - THEN
has_partial_failureSHALL still be set toTrue - THEN
failedcount SHALL still be incremented
Scenario: get_batch_progress returns failed_ranges
- WHEN
get_batch_progress()is called after execution with failed chunks - THEN the returned dict SHALL include
failed_rangesas a JSON string parseable to a list of{start, end}objects
Requirement: BatchQueryEngine SHALL retry transient chunk failures once
The engine SHALL retry chunk execution once for transient errors (Oracle timeout, connection errors) but SHALL NOT retry deterministic failures (memory guard, Redis store).
Scenario: Oracle timeout retried once
- WHEN
_execute_single_chunk()raises an exception matching Oracle timeout patterns (DPY-4024,ORA-01013) - THEN the chunk SHALL be retried exactly once
- WHEN the retry succeeds
- THEN the chunk SHALL be marked as successful
Scenario: Connection error retried once
- WHEN
_execute_single_chunk()raisesTimeoutError,ConnectionError, orOSError - THEN the chunk SHALL be retried exactly once
Scenario: Retry exhausted marks chunk as failed
- WHEN a chunk fails on both the initial attempt and the retry
- THEN the chunk SHALL be marked as failed
- THEN
has_partial_failureSHALL be set toTrue
Scenario: Memory guard failure NOT retried
- WHEN a chunk's DataFrame exceeds
BATCH_CHUNK_MAX_MEMORY_MB - THEN the chunk SHALL return
Falseimmediately without retry - THEN the query function SHALL have been called exactly once for that chunk
Scenario: Redis store failure NOT retried
- WHEN
redis_store_chunk()returnsFalse - THEN the chunk SHALL return
Falseimmediately without retry
Requirement: reject_dataset_cache SHALL propagate partial failure metadata to API response
The cache service SHALL read batch execution metadata and include partial failure information in the API response meta field.
Scenario: Partial failure metadata included in response
- WHEN
execute_primary_query()uses the batch engine path andget_batch_progress()returnshas_partial_failure=True - THEN the response
metadict SHALL includehas_partial_failure: true - THEN the response
metadict SHALL includefailed_chunk_countas an integer - THEN if
failed_rangesis present, the responsemetadict SHALL includefailed_rangesas a list of{start, end}objects
Scenario: Metadata read before redis_clear_batch
- WHEN
execute_primary_query()callsget_batch_progress() - THEN the call SHALL occur after
merge_chunks()and beforeredis_clear_batch()
Scenario: No partial failure on successful query
- WHEN all chunks complete successfully
- THEN the response
metadict SHALL NOT includehas_partial_failure
Scenario: Cache-hit path restores partial failure flag
- WHEN a cached DataFrame is returned (cache hit) and a partial failure flag was stored during the original query
- THEN the response
metadict SHALL include the samehas_partial_failure,failed_chunk_count, andfailed_rangesas the original response
Scenario: Partial failure flag TTL matches data storage layer
- WHEN partial failure is detected and the query result is spilled to parquet spool
- THEN the partial failure flag SHALL be stored with TTL equal to
_REJECT_ENGINE_SPOOL_TTL_SECONDS(default 21600 seconds) - WHEN partial failure is detected and the query result is stored in L1/L2 Redis cache
- THEN the partial failure flag SHALL be stored with TTL equal to
_CACHE_TTL(default 900 seconds)
Requirement: reject_dataset_cache batch primary execution SHALL avoid paginated replay loops
Batch chunk execution for reject-history primary query SHALL avoid page-by-page replay against paginated list SQL semantics.
Scenario: Chunk execution avoids offset iteration
- WHEN batch engine executes a reject-history chunk in
execute_primary_query() - THEN chunk execution SHALL NOT iterate through
offsetpages to assemble full chunk data - THEN chunk execution SHALL retrieve chunk data via the dedicated primary SQL path
Scenario: Chunk bind contract excludes pagination parameters
- WHEN chunk query parameters are prepared for batch execution
- THEN
offsetandlimitSHALL NOT be required bind variables for normal chunk retrieval
Requirement: Partial-failure resilience SHALL remain intact after source decoupling
Decoupling from paginated list SQL SHALL NOT regress partial-failure metadata behavior.
Scenario: Failed chunks still produce partial-failure metadata
- WHEN one or more reject-history chunks fail during batch execution
- THEN response
metaSHALL still report partial-failure indicators according to existing resilience contract
Scenario: Successful chunks still merge and continue
- WHEN some chunks succeed and others fail
- THEN the system SHALL continue to merge successful chunks and return partial results
- THEN progress metadata SHALL remain available for diagnostics