Files

egg 5517f7e85c feat(reject-history): finalize sql runtime and archive completed openspec changes

2026-03-04 16:32:00 +08:00

6.6 KiB

Raw Permalink Blame History

batch-query-resilience Specification

Purpose

Batch query engine resilience features: failed chunk range tracking, transient error retry, and partial failure metadata propagation to API consumers.

Requirements

Requirement: BatchQueryEngine SHALL track failed chunk time ranges in progress metadata

The engine SHALL record the time ranges of failed chunks in Redis progress metadata so consumers can report which date intervals have missing data.

Scenario: Failed chunk range recorded in sequential path

WHEN a chunk with chunk_start and chunk_end keys fails during sequential execution
THEN _update_progress() SHALL store a failed_ranges field in the Redis HSET metadata
THEN failed_ranges SHALL be a JSON array of objects, each with start and end string keys
THEN the array SHALL contain one entry per failed chunk

Scenario: Failed chunk range recorded in parallel path

WHEN a chunk with chunk_start and chunk_end keys fails during parallel execution
THEN the failed chunk's time range SHALL be appended to failed_ranges in the same format as the sequential path

Scenario: No failed ranges when all chunks succeed

WHEN all chunks complete successfully
THEN the failed_ranges field SHALL NOT be present in Redis metadata

Scenario: ID-batch chunks produce no failed_ranges entries

WHEN a chunk created by decompose_by_ids() (containing only an ids key, no chunk_start/chunk_end) fails
THEN no entry SHALL be appended to failed_ranges for that chunk
THEN has_partial_failure SHALL still be set to True
THEN failed count SHALL still be incremented

Scenario: get_batch_progress returns failed_ranges

WHEN get_batch_progress() is called after execution with failed chunks
THEN the returned dict SHALL include failed_ranges as a JSON string parseable to a list of {start, end} objects

Requirement: BatchQueryEngine SHALL retry transient chunk failures once

The engine SHALL retry chunk execution once for transient errors (Oracle timeout, connection errors) but SHALL NOT retry deterministic failures (memory guard, Redis store).

Scenario: Oracle timeout retried once

WHEN _execute_single_chunk() raises an exception matching Oracle timeout patterns (DPY-4024, ORA-01013)
THEN the chunk SHALL be retried exactly once
WHEN the retry succeeds
THEN the chunk SHALL be marked as successful

Scenario: Connection error retried once

WHEN _execute_single_chunk() raises TimeoutError, ConnectionError, or OSError
THEN the chunk SHALL be retried exactly once

Scenario: Retry exhausted marks chunk as failed

WHEN a chunk fails on both the initial attempt and the retry
THEN the chunk SHALL be marked as failed
THEN has_partial_failure SHALL be set to True

Scenario: Memory guard failure NOT retried

WHEN a chunk's DataFrame exceeds BATCH_CHUNK_MAX_MEMORY_MB
THEN the chunk SHALL return False immediately without retry
THEN the query function SHALL have been called exactly once for that chunk

Scenario: Redis store failure NOT retried

WHEN redis_store_chunk() returns False
THEN the chunk SHALL return False immediately without retry

Requirement: reject_dataset_cache SHALL propagate partial failure metadata to API response

The cache service SHALL read batch execution metadata and include partial failure information in the API response meta field.

Scenario: Partial failure metadata included in response

WHEN execute_primary_query() uses the batch engine path and get_batch_progress() returns has_partial_failure=True
THEN the response meta dict SHALL include has_partial_failure: true
THEN the response meta dict SHALL include failed_chunk_count as an integer
THEN if failed_ranges is present, the response meta dict SHALL include failed_ranges as a list of {start, end} objects

Scenario: Metadata read before redis_clear_batch

WHEN execute_primary_query() calls get_batch_progress()
THEN the call SHALL occur after merge_chunks() and before redis_clear_batch()

Scenario: No partial failure on successful query

WHEN all chunks complete successfully
THEN the response meta dict SHALL NOT include has_partial_failure

Scenario: Cache-hit path restores partial failure flag

WHEN a cached DataFrame is returned (cache hit) and a partial failure flag was stored during the original query
THEN the response meta dict SHALL include the same has_partial_failure, failed_chunk_count, and failed_ranges as the original response

Scenario: Partial failure flag TTL matches data storage layer

WHEN partial failure is detected and the query result is spilled to parquet spool
THEN the partial failure flag SHALL be stored with TTL equal to _REJECT_ENGINE_SPOOL_TTL_SECONDS (default 21600 seconds)
WHEN partial failure is detected and the query result is stored in L1/L2 Redis cache
THEN the partial failure flag SHALL be stored with TTL equal to _CACHE_TTL (default 900 seconds)

Requirement: reject_dataset_cache batch primary execution SHALL avoid paginated replay loops

Batch chunk execution for reject-history primary query SHALL avoid page-by-page replay against paginated list SQL semantics.

Scenario: Chunk execution avoids offset iteration

WHEN batch engine executes a reject-history chunk in execute_primary_query()
THEN chunk execution SHALL NOT iterate through offset pages to assemble full chunk data
THEN chunk execution SHALL retrieve chunk data via the dedicated primary SQL path

Scenario: Chunk bind contract excludes pagination parameters

WHEN chunk query parameters are prepared for batch execution
THEN offset and limit SHALL NOT be required bind variables for normal chunk retrieval

Requirement: Partial-failure resilience SHALL remain intact after source decoupling

Decoupling from paginated list SQL SHALL NOT regress partial-failure metadata behavior.

Scenario: Failed chunks still produce partial-failure metadata

WHEN one or more reject-history chunks fail during batch execution
THEN response meta SHALL still report partial-failure indicators according to existing resilience contract

Scenario: Successful chunks still merge and continue

WHEN some chunks succeed and others fail
THEN the system SHALL continue to merge successful chunks and return partial results
THEN progress metadata SHALL remain available for diagnostics

6.6 KiB Raw Permalink Blame History