DashBoard/openspec/specs/batch-query-resilience/spec.md

# batch-query-resilience Specification

## Purpose
Batch query engine resilience features: failed chunk range tracking, transient error retry, and partial failure metadata propagation to API consumers.

## Requirements
### Requirement: BatchQueryEngine SHALL track failed chunk time ranges in progress metadata
The engine SHALL record the time ranges of failed chunks in Redis progress metadata so consumers can report which date intervals have missing data.

#### Scenario: Failed chunk range recorded in sequential path
- **WHEN** a chunk with `chunk_start` and `chunk_end` keys fails during sequential execution
- **THEN** `_update_progress()` SHALL store a `failed_ranges` field in the Redis HSET metadata
- **THEN** `failed_ranges` SHALL be a JSON array of objects, each with `start` and `end` string keys
- **THEN** the array SHALL contain one entry per failed chunk

#### Scenario: Failed chunk range recorded in parallel path
- **WHEN** a chunk with `chunk_start` and `chunk_end` keys fails during parallel execution
- **THEN** the failed chunk's time range SHALL be appended to `failed_ranges` in the same format as the sequential path

#### Scenario: No failed ranges when all chunks succeed
- **WHEN** all chunks complete successfully
- **THEN** the `failed_ranges` field SHALL NOT be present in Redis metadata

#### Scenario: ID-batch chunks produce no failed_ranges entries
- **WHEN** a chunk created by `decompose_by_ids()` (containing only an `ids` key, no `chunk_start`/`chunk_end`) fails
- **THEN** no entry SHALL be appended to `failed_ranges` for that chunk
- **THEN** `has_partial_failure` SHALL still be set to `True`
- **THEN** `failed` count SHALL still be incremented

#### Scenario: get_batch_progress returns failed_ranges
- **WHEN** `get_batch_progress()` is called after execution with failed chunks
- **THEN** the returned dict SHALL include `failed_ranges` as a JSON string parseable to a list of `{start, end}` objects

### Requirement: BatchQueryEngine SHALL retry transient chunk failures once
The engine SHALL retry chunk execution once for transient errors (Oracle timeout, connection errors) but SHALL NOT retry deterministic failures (memory guard, Redis store).

#### Scenario: Oracle timeout retried once
- **WHEN** `_execute_single_chunk()` raises an exception matching Oracle timeout patterns (`DPY-4024`, `ORA-01013`)
- **THEN** the chunk SHALL be retried exactly once
- **WHEN** the retry succeeds
- **THEN** the chunk SHALL be marked as successful

#### Scenario: Connection error retried once
- **WHEN** `_execute_single_chunk()` raises `TimeoutError`, `ConnectionError`, or `OSError`
- **THEN** the chunk SHALL be retried exactly once

#### Scenario: Retry exhausted marks chunk as failed
- **WHEN** a chunk fails on both the initial attempt and the retry
- **THEN** the chunk SHALL be marked as failed
- **THEN** `has_partial_failure` SHALL be set to `True`

#### Scenario: Memory guard failure NOT retried
- **WHEN** a chunk's DataFrame exceeds `BATCH_CHUNK_MAX_MEMORY_MB`
- **THEN** the chunk SHALL return `False` immediately without retry
- **THEN** the query function SHALL have been called exactly once for that chunk

#### Scenario: Redis store failure NOT retried
- **WHEN** `redis_store_chunk()` returns `False`
- **THEN** the chunk SHALL return `False` immediately without retry

### Requirement: reject_dataset_cache SHALL propagate partial failure metadata to API response
The cache service SHALL read batch execution metadata and include partial failure information in the API response `meta` field.

#### Scenario: Partial failure metadata included in response
- **WHEN** `execute_primary_query()` uses the batch engine path and `get_batch_progress()` returns `has_partial_failure=True`
- **THEN** the response `meta` dict SHALL include `has_partial_failure: true`
- **THEN** the response `meta` dict SHALL include `failed_chunk_count` as an integer
- **THEN** if `failed_ranges` is present, the response `meta` dict SHALL include `failed_ranges` as a list of `{start, end}` objects

#### Scenario: Metadata read before redis_clear_batch
- **WHEN** `execute_primary_query()` calls `get_batch_progress()`
- **THEN** the call SHALL occur after `merge_chunks()` and before `redis_clear_batch()`

#### Scenario: No partial failure on successful query
- **WHEN** all chunks complete successfully
- **THEN** the response `meta` dict SHALL NOT include `has_partial_failure`

#### Scenario: Cache-hit path restores partial failure flag
- **WHEN** a cached DataFrame is returned (cache hit) and a partial failure flag was stored during the original query
- **THEN** the response `meta` dict SHALL include the same `has_partial_failure`, `failed_chunk_count`, and `failed_ranges` as the original response

#### Scenario: Partial failure flag TTL matches data storage layer
- **WHEN** partial failure is detected and the query result is spilled to parquet spool
- **THEN** the partial failure flag SHALL be stored with TTL equal to `_REJECT_ENGINE_SPOOL_TTL_SECONDS` (default 21600 seconds)
- **WHEN** partial failure is detected and the query result is stored in L1/L2 Redis cache
- **THEN** the partial failure flag SHALL be stored with TTL equal to `_CACHE_TTL` (default 900 seconds)