Chunk failures in BatchQueryEngine were silently discarded — `has_partial_failure` was tracked in Redis but never surfaced to the API response or frontend. Users could see incomplete data without any warning. This commit closes the gap end-to-end: Backend: - Track failed chunk time ranges (`failed_ranges`) in batch engine progress metadata - Add single retry for transient Oracle errors (timeout, connection) in `_execute_single_chunk` - Read `get_batch_progress()` after merge but before `redis_clear_batch()` cleanup - Inject `has_partial_failure`, `failed_chunk_count`, `failed_ranges` into API response meta - Persist partial failure flag to independent Redis key with TTL aligned to data storage layer - Add shared container-resolution policy module with wildcard/expansion guardrails - Refactor reason filter from single-value to multi-select (`reason` → `reasons`) Frontend: - Add client-side date range validation (730-day limit) before API submission - Display amber warning banner on partial failure with specific failed date ranges - Support generic fallback message for container-mode queries without date ranges - Update FilterPanel to support multi-select reason chips Specs & tests: - Create batch-query-resilience spec; update reject-history-api and reject-history-page specs - Add 7 new tests for retry, memory guard, failed ranges, partial failure propagation, TTL - Cross-service regression verified (hold, resource, job, msd — 411 tests pass) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
87 lines
5.1 KiB
Markdown
87 lines
5.1 KiB
Markdown
# batch-query-resilience Specification
|
|
|
|
## Purpose
|
|
Batch query engine resilience features: failed chunk range tracking, transient error retry, and partial failure metadata propagation to API consumers.
|
|
|
|
## Requirements
|
|
### Requirement: BatchQueryEngine SHALL track failed chunk time ranges in progress metadata
|
|
The engine SHALL record the time ranges of failed chunks in Redis progress metadata so consumers can report which date intervals have missing data.
|
|
|
|
#### Scenario: Failed chunk range recorded in sequential path
|
|
- **WHEN** a chunk with `chunk_start` and `chunk_end` keys fails during sequential execution
|
|
- **THEN** `_update_progress()` SHALL store a `failed_ranges` field in the Redis HSET metadata
|
|
- **THEN** `failed_ranges` SHALL be a JSON array of objects, each with `start` and `end` string keys
|
|
- **THEN** the array SHALL contain one entry per failed chunk
|
|
|
|
#### Scenario: Failed chunk range recorded in parallel path
|
|
- **WHEN** a chunk with `chunk_start` and `chunk_end` keys fails during parallel execution
|
|
- **THEN** the failed chunk's time range SHALL be appended to `failed_ranges` in the same format as the sequential path
|
|
|
|
#### Scenario: No failed ranges when all chunks succeed
|
|
- **WHEN** all chunks complete successfully
|
|
- **THEN** the `failed_ranges` field SHALL NOT be present in Redis metadata
|
|
|
|
#### Scenario: ID-batch chunks produce no failed_ranges entries
|
|
- **WHEN** a chunk created by `decompose_by_ids()` (containing only an `ids` key, no `chunk_start`/`chunk_end`) fails
|
|
- **THEN** no entry SHALL be appended to `failed_ranges` for that chunk
|
|
- **THEN** `has_partial_failure` SHALL still be set to `True`
|
|
- **THEN** `failed` count SHALL still be incremented
|
|
|
|
#### Scenario: get_batch_progress returns failed_ranges
|
|
- **WHEN** `get_batch_progress()` is called after execution with failed chunks
|
|
- **THEN** the returned dict SHALL include `failed_ranges` as a JSON string parseable to a list of `{start, end}` objects
|
|
|
|
### Requirement: BatchQueryEngine SHALL retry transient chunk failures once
|
|
The engine SHALL retry chunk execution once for transient errors (Oracle timeout, connection errors) but SHALL NOT retry deterministic failures (memory guard, Redis store).
|
|
|
|
#### Scenario: Oracle timeout retried once
|
|
- **WHEN** `_execute_single_chunk()` raises an exception matching Oracle timeout patterns (`DPY-4024`, `ORA-01013`)
|
|
- **THEN** the chunk SHALL be retried exactly once
|
|
- **WHEN** the retry succeeds
|
|
- **THEN** the chunk SHALL be marked as successful
|
|
|
|
#### Scenario: Connection error retried once
|
|
- **WHEN** `_execute_single_chunk()` raises `TimeoutError`, `ConnectionError`, or `OSError`
|
|
- **THEN** the chunk SHALL be retried exactly once
|
|
|
|
#### Scenario: Retry exhausted marks chunk as failed
|
|
- **WHEN** a chunk fails on both the initial attempt and the retry
|
|
- **THEN** the chunk SHALL be marked as failed
|
|
- **THEN** `has_partial_failure` SHALL be set to `True`
|
|
|
|
#### Scenario: Memory guard failure NOT retried
|
|
- **WHEN** a chunk's DataFrame exceeds `BATCH_CHUNK_MAX_MEMORY_MB`
|
|
- **THEN** the chunk SHALL return `False` immediately without retry
|
|
- **THEN** the query function SHALL have been called exactly once for that chunk
|
|
|
|
#### Scenario: Redis store failure NOT retried
|
|
- **WHEN** `redis_store_chunk()` returns `False`
|
|
- **THEN** the chunk SHALL return `False` immediately without retry
|
|
|
|
### Requirement: reject_dataset_cache SHALL propagate partial failure metadata to API response
|
|
The cache service SHALL read batch execution metadata and include partial failure information in the API response `meta` field.
|
|
|
|
#### Scenario: Partial failure metadata included in response
|
|
- **WHEN** `execute_primary_query()` uses the batch engine path and `get_batch_progress()` returns `has_partial_failure=True`
|
|
- **THEN** the response `meta` dict SHALL include `has_partial_failure: true`
|
|
- **THEN** the response `meta` dict SHALL include `failed_chunk_count` as an integer
|
|
- **THEN** if `failed_ranges` is present, the response `meta` dict SHALL include `failed_ranges` as a list of `{start, end}` objects
|
|
|
|
#### Scenario: Metadata read before redis_clear_batch
|
|
- **WHEN** `execute_primary_query()` calls `get_batch_progress()`
|
|
- **THEN** the call SHALL occur after `merge_chunks()` and before `redis_clear_batch()`
|
|
|
|
#### Scenario: No partial failure on successful query
|
|
- **WHEN** all chunks complete successfully
|
|
- **THEN** the response `meta` dict SHALL NOT include `has_partial_failure`
|
|
|
|
#### Scenario: Cache-hit path restores partial failure flag
|
|
- **WHEN** a cached DataFrame is returned (cache hit) and a partial failure flag was stored during the original query
|
|
- **THEN** the response `meta` dict SHALL include the same `has_partial_failure`, `failed_chunk_count`, and `failed_ranges` as the original response
|
|
|
|
#### Scenario: Partial failure flag TTL matches data storage layer
|
|
- **WHEN** partial failure is detected and the query result is spilled to parquet spool
|
|
- **THEN** the partial failure flag SHALL be stored with TTL equal to `_REJECT_ENGINE_SPOOL_TTL_SECONDS` (default 21600 seconds)
|
|
- **WHEN** partial failure is detected and the query result is stored in L1/L2 Redis cache
|
|
- **THEN** the partial failure flag SHALL be stored with TTL equal to `_CACHE_TTL` (default 900 seconds)
|