feat(reject-history): fix silent data loss by propagating partial failure metadata to frontend

Chunk failures in BatchQueryEngine were silently discarded — `has_partial_failure` was tracked
in Redis but never surfaced to the API response or frontend. Users could see incomplete data
without any warning. This commit closes the gap end-to-end:

Backend:
- Track failed chunk time ranges (`failed_ranges`) in batch engine progress metadata
- Add single retry for transient Oracle errors (timeout, connection) in `_execute_single_chunk`
- Read `get_batch_progress()` after merge but before `redis_clear_batch()` cleanup
- Inject `has_partial_failure`, `failed_chunk_count`, `failed_ranges` into API response meta
- Persist partial failure flag to independent Redis key with TTL aligned to data storage layer
- Add shared container-resolution policy module with wildcard/expansion guardrails
- Refactor reason filter from single-value to multi-select (`reason` → `reasons`)

Frontend:
- Add client-side date range validation (730-day limit) before API submission
- Display amber warning banner on partial failure with specific failed date ranges
- Support generic fallback message for container-mode queries without date ranges
- Update FilterPanel to support multi-select reason chips

Specs & tests:
- Create batch-query-resilience spec; update reject-history-api and reject-history-page specs
- Add 7 new tests for retry, memory guard, failed ranges, partial failure propagation, TTL
- Cross-service regression verified (hold, resource, job, msd — 411 tests pass)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
egg
2026-03-03 14:00:07 +08:00
parent f1506787fb
commit a275c30c0e
35 changed files with 3028 additions and 1460 deletions

View File

@@ -0,0 +1,86 @@
# batch-query-resilience Specification
## Purpose
Batch query engine resilience features: failed chunk range tracking, transient error retry, and partial failure metadata propagation to API consumers.
## Requirements
### Requirement: BatchQueryEngine SHALL track failed chunk time ranges in progress metadata
The engine SHALL record the time ranges of failed chunks in Redis progress metadata so consumers can report which date intervals have missing data.
#### Scenario: Failed chunk range recorded in sequential path
- **WHEN** a chunk with `chunk_start` and `chunk_end` keys fails during sequential execution
- **THEN** `_update_progress()` SHALL store a `failed_ranges` field in the Redis HSET metadata
- **THEN** `failed_ranges` SHALL be a JSON array of objects, each with `start` and `end` string keys
- **THEN** the array SHALL contain one entry per failed chunk
#### Scenario: Failed chunk range recorded in parallel path
- **WHEN** a chunk with `chunk_start` and `chunk_end` keys fails during parallel execution
- **THEN** the failed chunk's time range SHALL be appended to `failed_ranges` in the same format as the sequential path
#### Scenario: No failed ranges when all chunks succeed
- **WHEN** all chunks complete successfully
- **THEN** the `failed_ranges` field SHALL NOT be present in Redis metadata
#### Scenario: ID-batch chunks produce no failed_ranges entries
- **WHEN** a chunk created by `decompose_by_ids()` (containing only an `ids` key, no `chunk_start`/`chunk_end`) fails
- **THEN** no entry SHALL be appended to `failed_ranges` for that chunk
- **THEN** `has_partial_failure` SHALL still be set to `True`
- **THEN** `failed` count SHALL still be incremented
#### Scenario: get_batch_progress returns failed_ranges
- **WHEN** `get_batch_progress()` is called after execution with failed chunks
- **THEN** the returned dict SHALL include `failed_ranges` as a JSON string parseable to a list of `{start, end}` objects
### Requirement: BatchQueryEngine SHALL retry transient chunk failures once
The engine SHALL retry chunk execution once for transient errors (Oracle timeout, connection errors) but SHALL NOT retry deterministic failures (memory guard, Redis store).
#### Scenario: Oracle timeout retried once
- **WHEN** `_execute_single_chunk()` raises an exception matching Oracle timeout patterns (`DPY-4024`, `ORA-01013`)
- **THEN** the chunk SHALL be retried exactly once
- **WHEN** the retry succeeds
- **THEN** the chunk SHALL be marked as successful
#### Scenario: Connection error retried once
- **WHEN** `_execute_single_chunk()` raises `TimeoutError`, `ConnectionError`, or `OSError`
- **THEN** the chunk SHALL be retried exactly once
#### Scenario: Retry exhausted marks chunk as failed
- **WHEN** a chunk fails on both the initial attempt and the retry
- **THEN** the chunk SHALL be marked as failed
- **THEN** `has_partial_failure` SHALL be set to `True`
#### Scenario: Memory guard failure NOT retried
- **WHEN** a chunk's DataFrame exceeds `BATCH_CHUNK_MAX_MEMORY_MB`
- **THEN** the chunk SHALL return `False` immediately without retry
- **THEN** the query function SHALL have been called exactly once for that chunk
#### Scenario: Redis store failure NOT retried
- **WHEN** `redis_store_chunk()` returns `False`
- **THEN** the chunk SHALL return `False` immediately without retry
### Requirement: reject_dataset_cache SHALL propagate partial failure metadata to API response
The cache service SHALL read batch execution metadata and include partial failure information in the API response `meta` field.
#### Scenario: Partial failure metadata included in response
- **WHEN** `execute_primary_query()` uses the batch engine path and `get_batch_progress()` returns `has_partial_failure=True`
- **THEN** the response `meta` dict SHALL include `has_partial_failure: true`
- **THEN** the response `meta` dict SHALL include `failed_chunk_count` as an integer
- **THEN** if `failed_ranges` is present, the response `meta` dict SHALL include `failed_ranges` as a list of `{start, end}` objects
#### Scenario: Metadata read before redis_clear_batch
- **WHEN** `execute_primary_query()` calls `get_batch_progress()`
- **THEN** the call SHALL occur after `merge_chunks()` and before `redis_clear_batch()`
#### Scenario: No partial failure on successful query
- **WHEN** all chunks complete successfully
- **THEN** the response `meta` dict SHALL NOT include `has_partial_failure`
#### Scenario: Cache-hit path restores partial failure flag
- **WHEN** a cached DataFrame is returned (cache hit) and a partial failure flag was stored during the original query
- **THEN** the response `meta` dict SHALL include the same `has_partial_failure`, `failed_chunk_count`, and `failed_ranges` as the original response
#### Scenario: Partial failure flag TTL matches data storage layer
- **WHEN** partial failure is detected and the query result is spilled to parquet spool
- **THEN** the partial failure flag SHALL be stored with TTL equal to `_REJECT_ENGINE_SPOOL_TTL_SECONDS` (default 21600 seconds)
- **WHEN** partial failure is detected and the query result is stored in L1/L2 Redis cache
- **THEN** the partial failure flag SHALL be stored with TTL equal to `_CACHE_TTL` (default 900 seconds)

View File

@@ -14,6 +14,28 @@ The API SHALL validate date parameters and basic paging bounds before executing
- **WHEN** `end_date` is earlier than `start_date`
- **THEN** the API SHALL return HTTP 400 and SHALL NOT run SQL queries
#### Scenario: Date range exceeds maximum
- **WHEN** the date range between `start_date` and `end_date` exceeds 730 days
- **THEN** the API SHALL return HTTP 400 with error message "日期範圍不可超過 730 天"
### Requirement: Reject History API primary query response SHALL include partial failure metadata
The primary query endpoint SHALL include batch execution completeness information in the response `meta` field when chunks fail during batch query execution.
#### Scenario: Partial failure metadata in response
- **WHEN** `POST /api/reject-history/query` completes with some chunks failing
- **THEN** the response SHALL include `meta.has_partial_failure: true`
- **THEN** the response SHALL include `meta.failed_chunk_count` as a positive integer
- **THEN** the response SHALL include `meta.failed_ranges` as an array of `{start, end}` date strings (if available)
- **THEN** the HTTP status SHALL still be 200 (data is partially available)
#### Scenario: No partial failure metadata on full success
- **WHEN** `POST /api/reject-history/query` completes with all chunks succeeding
- **THEN** the response `meta` SHALL NOT include `has_partial_failure`, `failed_chunk_count`, or `failed_ranges`
#### Scenario: Partial failure metadata preserved on cache hit
- **WHEN** `POST /api/reject-history/query` returns cached data that originally had partial failures
- **THEN** the response SHALL include the same `meta.has_partial_failure`, `meta.failed_chunk_count`, and `meta.failed_ranges` as the original response
### Requirement: Reject History API SHALL provide summary metrics endpoint
The API SHALL provide aggregated summary metrics for the selected filter context.

View File

@@ -236,6 +236,63 @@ The page template SHALL delegate sections to focused sub-components, following t
- **THEN** `App.vue` SHALL hold all reactive state and API logic
- **THEN** sub-components SHALL receive data via props and communicate via events
### Requirement: Reject History page SHALL display partial failure warning banner
The page SHALL display an amber warning banner when the query result contains partial failures, informing users that displayed data may be incomplete.
#### Scenario: Warning banner displayed on partial failure
- **WHEN** the primary query response includes `meta.has_partial_failure: true`
- **THEN** an amber warning banner SHALL be displayed below the error banner position
- **THEN** the warning message SHALL be in Traditional Chinese
#### Scenario: Warning banner shows failed date ranges
- **WHEN** `meta.failed_ranges` contains date range objects
- **THEN** the warning banner SHALL display the specific failed date ranges (e.g., "以下日期區間的資料擷取失敗2025-01-01 ~ 2025-01-10")
#### Scenario: Warning banner shows generic message without ranges (container mode or missing range data)
- **WHEN** `meta.has_partial_failure` is true but `meta.failed_ranges` is empty or absent (e.g., container-id batch query)
- **THEN** the warning banner SHALL display a generic message with the failed chunk count (e.g., "3 個查詢批次的資料擷取失敗")
#### Scenario: Warning banner cleared on new query
- **WHEN** user initiates a new primary query
- **THEN** the warning banner SHALL be cleared before the new query executes
- **THEN** if the new query also has partial failures, the warning SHALL update with new failure information
#### Scenario: Warning banner coexists with error banner
- **WHEN** both an error message and a partial failure warning exist
- **THEN** the error banner SHALL appear first, followed by the warning banner
#### Scenario: Warning banner visual style
- **WHEN** the warning banner is rendered
- **THEN** it SHALL use amber/orange color scheme (background `#fffbeb`, text `#b45309`)
- **THEN** the style SHALL be consistent with the existing `.resolution-warn` color pattern
### Requirement: Reject History page SHALL validate date range before query submission
The page SHALL validate the date range on the client side before sending the API request, providing immediate feedback for invalid ranges.
#### Scenario: Date range exceeds 730-day limit
- **WHEN** user selects a date range exceeding 730 days and clicks "查詢"
- **THEN** the page SHALL display an error message "查詢範圍不可超過 730 天(約兩年)"
- **THEN** the API request SHALL NOT be sent
#### Scenario: Missing start or end date
- **WHEN** user clicks "查詢" without setting both start_date and end_date (in date_range mode)
- **THEN** the page SHALL display an error message "請先設定開始與結束日期"
- **THEN** the API request SHALL NOT be sent
#### Scenario: End date before start date
- **WHEN** user selects an end_date earlier than start_date
- **THEN** the page SHALL display an error message "結束日期必須大於起始日期"
- **THEN** the API request SHALL NOT be sent
#### Scenario: Valid date range proceeds normally
- **WHEN** user selects a valid date range within 730 days and clicks "查詢"
- **THEN** no validation error SHALL be shown
- **THEN** the API request SHALL proceed normally
#### Scenario: Container mode skips date validation
- **WHEN** query mode is "container" (not "date_range")
- **THEN** date range validation SHALL be skipped
### Requirement: Frontend API timeout
The reject-history page SHALL use a 360-second API timeout (up from 60 seconds) for all Oracle-backed API calls.