feat(reject-history): fix silent data loss by propagating partial failure metadata to frontend

Chunk failures in BatchQueryEngine were silently discarded — `has_partial_failure` was tracked in Redis but never surfaced to the API response or frontend. Users could see incomplete data without any warning. This commit closes the gap end-to-end: Backend: - Track failed chunk time ranges (`failed_ranges`) in batch engine progress metadata - Add single retry for transient Oracle errors (timeout, connection) in `_execute_single_chunk` - Read `get_batch_progress()` after merge but before `redis_clear_batch()` cleanup - Inject `has_partial_failure`, `failed_chunk_count`, `failed_ranges` into API response meta - Persist partial failure flag to independent Redis key with TTL aligned to data storage layer - Add shared container-resolution policy module with wildcard/expansion guardrails - Refactor reason filter from single-value to multi-select (`reason` → `reasons`) Frontend: - Add client-side date range validation (730-day limit) before API submission - Display amber warning banner on partial failure with specific failed date ranges - Support generic fallback message for container-mode queries without date ranges - Update FilterPanel to support multi-select reason chips Specs & tests: - Create batch-query-resilience spec; update reject-history-api and reject-history-page specs - Add 7 new tests for retry, memory guard, failed ranges, partial failure propagation, TTL - Cross-service regression verified (hold, resource, job, msd — 411 tests pass) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 14:00:07 +08:00
parent f1506787fb
commit a275c30c0e
35 changed files with 3028 additions and 1460 deletions
--- a/openspec/specs/batch-query-resilience/spec.md
+++ b/openspec/specs/batch-query-resilience/spec.md
@@ -0,0 +1,86 @@
+# batch-query-resilience Specification
+
+## Purpose
+Batch query engine resilience features: failed chunk range tracking, transient error retry, and partial failure metadata propagation to API consumers.
+
+## Requirements
+### Requirement: BatchQueryEngine SHALL track failed chunk time ranges in progress metadata
+The engine SHALL record the time ranges of failed chunks in Redis progress metadata so consumers can report which date intervals have missing data.
+
+#### Scenario: Failed chunk range recorded in sequential path
+- **WHEN** a chunk with `chunk_start` and `chunk_end` keys fails during sequential execution
+- **THEN** `_update_progress()` SHALL store a `failed_ranges` field in the Redis HSET metadata
+- **THEN** `failed_ranges` SHALL be a JSON array of objects, each with `start` and `end` string keys
+- **THEN** the array SHALL contain one entry per failed chunk
+
+#### Scenario: Failed chunk range recorded in parallel path
+- **WHEN** a chunk with `chunk_start` and `chunk_end` keys fails during parallel execution
+- **THEN** the failed chunk's time range SHALL be appended to `failed_ranges` in the same format as the sequential path
+
+#### Scenario: No failed ranges when all chunks succeed
+- **WHEN** all chunks complete successfully
+- **THEN** the `failed_ranges` field SHALL NOT be present in Redis metadata
+
+#### Scenario: ID-batch chunks produce no failed_ranges entries
+- **WHEN** a chunk created by `decompose_by_ids()` (containing only an `ids` key, no `chunk_start`/`chunk_end`) fails
+- **THEN** no entry SHALL be appended to `failed_ranges` for that chunk
+- **THEN** `has_partial_failure` SHALL still be set to `True`
+- **THEN** `failed` count SHALL still be incremented
+
+#### Scenario: get_batch_progress returns failed_ranges
+- **WHEN** `get_batch_progress()` is called after execution with failed chunks
+- **THEN** the returned dict SHALL include `failed_ranges` as a JSON string parseable to a list of `{start, end}` objects
+
+### Requirement: BatchQueryEngine SHALL retry transient chunk failures once
+The engine SHALL retry chunk execution once for transient errors (Oracle timeout, connection errors) but SHALL NOT retry deterministic failures (memory guard, Redis store).
+
+#### Scenario: Oracle timeout retried once
+- **WHEN** `_execute_single_chunk()` raises an exception matching Oracle timeout patterns (`DPY-4024`, `ORA-01013`)
+- **THEN** the chunk SHALL be retried exactly once
+- **WHEN** the retry succeeds
+- **THEN** the chunk SHALL be marked as successful
+
+#### Scenario: Connection error retried once
+- **WHEN** `_execute_single_chunk()` raises `TimeoutError`, `ConnectionError`, or `OSError`
+- **THEN** the chunk SHALL be retried exactly once
+
+#### Scenario: Retry exhausted marks chunk as failed
+- **WHEN** a chunk fails on both the initial attempt and the retry
+- **THEN** the chunk SHALL be marked as failed
+- **THEN** `has_partial_failure` SHALL be set to `True`
+
+#### Scenario: Memory guard failure NOT retried
+- **WHEN** a chunk's DataFrame exceeds `BATCH_CHUNK_MAX_MEMORY_MB`
+- **THEN** the chunk SHALL return `False` immediately without retry
+- **THEN** the query function SHALL have been called exactly once for that chunk
+
+#### Scenario: Redis store failure NOT retried
+- **WHEN** `redis_store_chunk()` returns `False`
+- **THEN** the chunk SHALL return `False` immediately without retry
+
+### Requirement: reject_dataset_cache SHALL propagate partial failure metadata to API response
+The cache service SHALL read batch execution metadata and include partial failure information in the API response `meta` field.
+
+#### Scenario: Partial failure metadata included in response
+- **WHEN** `execute_primary_query()` uses the batch engine path and `get_batch_progress()` returns `has_partial_failure=True`
+- **THEN** the response `meta` dict SHALL include `has_partial_failure: true`
+- **THEN** the response `meta` dict SHALL include `failed_chunk_count` as an integer
+- **THEN** if `failed_ranges` is present, the response `meta` dict SHALL include `failed_ranges` as a list of `{start, end}` objects
+
+#### Scenario: Metadata read before redis_clear_batch
+- **WHEN** `execute_primary_query()` calls `get_batch_progress()`
+- **THEN** the call SHALL occur after `merge_chunks()` and before `redis_clear_batch()`
+
+#### Scenario: No partial failure on successful query
+- **WHEN** all chunks complete successfully
+- **THEN** the response `meta` dict SHALL NOT include `has_partial_failure`
+
+#### Scenario: Cache-hit path restores partial failure flag
+- **WHEN** a cached DataFrame is returned (cache hit) and a partial failure flag was stored during the original query
+- **THEN** the response `meta` dict SHALL include the same `has_partial_failure`, `failed_chunk_count`, and `failed_ranges` as the original response
+
+#### Scenario: Partial failure flag TTL matches data storage layer
+- **WHEN** partial failure is detected and the query result is spilled to parquet spool
+- **THEN** the partial failure flag SHALL be stored with TTL equal to `_REJECT_ENGINE_SPOOL_TTL_SECONDS` (default 21600 seconds)
+- **WHEN** partial failure is detected and the query result is stored in L1/L2 Redis cache
+- **THEN** the partial failure flag SHALL be stored with TTL equal to `_CACHE_TTL` (default 900 seconds)
--- a/openspec/specs/reject-history-api/spec.md
+++ b/openspec/specs/reject-history-api/spec.md
@@ -14,6 +14,28 @@ The API SHALL validate date parameters and basic paging bounds before executing
 - **WHEN** `end_date` is earlier than `start_date`
 - **THEN** the API SHALL return HTTP 400 and SHALL NOT run SQL queries

+#### Scenario: Date range exceeds maximum
+- **WHEN** the date range between `start_date` and `end_date` exceeds 730 days
+- **THEN** the API SHALL return HTTP 400 with error message "日期範圍不可超過 730 天"
+
+### Requirement: Reject History API primary query response SHALL include partial failure metadata
+The primary query endpoint SHALL include batch execution completeness information in the response `meta` field when chunks fail during batch query execution.
+
+#### Scenario: Partial failure metadata in response
+- **WHEN** `POST /api/reject-history/query` completes with some chunks failing
+- **THEN** the response SHALL include `meta.has_partial_failure: true`
+- **THEN** the response SHALL include `meta.failed_chunk_count` as a positive integer
+- **THEN** the response SHALL include `meta.failed_ranges` as an array of `{start, end}` date strings (if available)
+- **THEN** the HTTP status SHALL still be 200 (data is partially available)
+
+#### Scenario: No partial failure metadata on full success
+- **WHEN** `POST /api/reject-history/query` completes with all chunks succeeding
+- **THEN** the response `meta` SHALL NOT include `has_partial_failure`, `failed_chunk_count`, or `failed_ranges`
+
+#### Scenario: Partial failure metadata preserved on cache hit
+- **WHEN** `POST /api/reject-history/query` returns cached data that originally had partial failures
+- **THEN** the response SHALL include the same `meta.has_partial_failure`, `meta.failed_chunk_count`, and `meta.failed_ranges` as the original response
+
 ### Requirement: Reject History API SHALL provide summary metrics endpoint
 The API SHALL provide aggregated summary metrics for the selected filter context.

--- a/openspec/specs/reject-history-page/spec.md
+++ b/openspec/specs/reject-history-page/spec.md
@@ -236,6 +236,63 @@ The page template SHALL delegate sections to focused sub-components, following t
 - **THEN** `App.vue` SHALL hold all reactive state and API logic
 - **THEN** sub-components SHALL receive data via props and communicate via events

+### Requirement: Reject History page SHALL display partial failure warning banner
+The page SHALL display an amber warning banner when the query result contains partial failures, informing users that displayed data may be incomplete.
+
+#### Scenario: Warning banner displayed on partial failure
+- **WHEN** the primary query response includes `meta.has_partial_failure: true`
+- **THEN** an amber warning banner SHALL be displayed below the error banner position
+- **THEN** the warning message SHALL be in Traditional Chinese
+
+#### Scenario: Warning banner shows failed date ranges
+- **WHEN** `meta.failed_ranges` contains date range objects
+- **THEN** the warning banner SHALL display the specific failed date ranges (e.g., "以下日期區間的資料擷取失敗：2025-01-01 ~ 2025-01-10")
+
+#### Scenario: Warning banner shows generic message without ranges (container mode or missing range data)
+- **WHEN** `meta.has_partial_failure` is true but `meta.failed_ranges` is empty or absent (e.g., container-id batch query)
+- **THEN** the warning banner SHALL display a generic message with the failed chunk count (e.g., "3 個查詢批次的資料擷取失敗")
+
+#### Scenario: Warning banner cleared on new query
+- **WHEN** user initiates a new primary query
+- **THEN** the warning banner SHALL be cleared before the new query executes
+- **THEN** if the new query also has partial failures, the warning SHALL update with new failure information
+
+#### Scenario: Warning banner coexists with error banner
+- **WHEN** both an error message and a partial failure warning exist
+- **THEN** the error banner SHALL appear first, followed by the warning banner
+
+#### Scenario: Warning banner visual style
+- **WHEN** the warning banner is rendered
+- **THEN** it SHALL use amber/orange color scheme (background `#fffbeb`, text `#b45309`)
+- **THEN** the style SHALL be consistent with the existing `.resolution-warn` color pattern
+
+### Requirement: Reject History page SHALL validate date range before query submission
+The page SHALL validate the date range on the client side before sending the API request, providing immediate feedback for invalid ranges.
+
+#### Scenario: Date range exceeds 730-day limit
+- **WHEN** user selects a date range exceeding 730 days and clicks "查詢"
+- **THEN** the page SHALL display an error message "查詢範圍不可超過 730 天（約兩年）"
+- **THEN** the API request SHALL NOT be sent
+
+#### Scenario: Missing start or end date
+- **WHEN** user clicks "查詢" without setting both start_date and end_date (in date_range mode)
+- **THEN** the page SHALL display an error message "請先設定開始與結束日期"
+- **THEN** the API request SHALL NOT be sent
+
+#### Scenario: End date before start date
+- **WHEN** user selects an end_date earlier than start_date
+- **THEN** the page SHALL display an error message "結束日期必須大於起始日期"
+- **THEN** the API request SHALL NOT be sent
+
+#### Scenario: Valid date range proceeds normally
+- **WHEN** user selects a valid date range within 730 days and clicks "查詢"
+- **THEN** no validation error SHALL be shown
+- **THEN** the API request SHALL proceed normally
+
+#### Scenario: Container mode skips date validation
+- **WHEN** query mode is "container" (not "date_range")
+- **THEN** date range validation SHALL be skipped
+
 ### Requirement: Frontend API timeout
 The reject-history page SHALL use a 360-second API timeout (up from 60 seconds) for all Oracle-backed API calls.