feat(reject-history): fix silent data loss by propagating partial failure metadata to frontend

Chunk failures in BatchQueryEngine were silently discarded — `has_partial_failure` was tracked
in Redis but never surfaced to the API response or frontend. Users could see incomplete data
without any warning. This commit closes the gap end-to-end:

Backend:
- Track failed chunk time ranges (`failed_ranges`) in batch engine progress metadata
- Add single retry for transient Oracle errors (timeout, connection) in `_execute_single_chunk`
- Read `get_batch_progress()` after merge but before `redis_clear_batch()` cleanup
- Inject `has_partial_failure`, `failed_chunk_count`, `failed_ranges` into API response meta
- Persist partial failure flag to independent Redis key with TTL aligned to data storage layer
- Add shared container-resolution policy module with wildcard/expansion guardrails
- Refactor reason filter from single-value to multi-select (`reason` → `reasons`)

Frontend:
- Add client-side date range validation (730-day limit) before API submission
- Display amber warning banner on partial failure with specific failed date ranges
- Support generic fallback message for container-mode queries without date ranges
- Update FilterPanel to support multi-select reason chips

Specs & tests:
- Create batch-query-resilience spec; update reject-history-api and reject-history-page specs
- Add 7 new tests for retry, memory guard, failed ranges, partial failure propagation, TTL
- Cross-service regression verified (hold, resource, job, msd — 411 tests pass)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
egg
2026-03-03 14:00:07 +08:00
parent f1506787fb
commit a275c30c0e
35 changed files with 3028 additions and 1460 deletions

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-03-03

View File

@@ -0,0 +1,80 @@
## Context
報廢歷史查詢使用 `BatchQueryEngine` 將長日期範圍拆成 10 天 chunks 平行查詢 Oracle。每個 chunk 有記憶體上限256 MB和 timeout300s防護。當 chunk 失敗時,`has_partial_failure` 旗標寫入 Redis HSETkey: `batch:reject:{hash}:meta`),但此資訊**在三個斷點被丟失**
1. `reject_dataset_cache.py``execute_primary_query()` 未讀取 batch progress metadata
2. API route 直接 `jsonify({"success": True, **result})`,在 partial chunk failure 路徑下仍回 HTTP 200 + `success: true`,不區分完整與不完整結果
3. 前端 `App.vue` 沒有任何 partial failure 處理邏輯
另一個問題:`redis_clear_batch()``execute_primary_query()` 的清理階段會刪除 metadata key所以讀取必須在清理之前。
前端的 730 天日期上限驗證只在後端 `_validate_range()` 做,前端缺乏即時回饋。
## Goals / Non-Goals
**Goals:**
-`has_partial_failure` 從 Redis metadata 傳遞到 API response `meta` 欄位
- 追蹤失敗 chunk 的時間範圍,讓前端可顯示具體的缺漏區間
- 前端顯示 amber warning banner告知使用者資料可能不完整
- 前端加入日期範圍即時驗證,避免無效 API 請求
- 對 transient errorOracle timeout、連線失敗加入單次重試減少不必要的 partial failure
- 持久化 partial failure 旗標到獨立 Redis key讓 cache-hit 路徑也能還原警告狀態
**Non-Goals:**
- 不改變現有 chunk 分片策略或記憶體上限數值
- 不實作前端的自動重查/重試機制
- 不修改 `EVENT_FETCHER_ALLOW_PARTIAL_RESULTS` 的行為(預設已是安全的 false
- 不加入 progress bar / 即時進度追蹤 UI
## Decisions
### D1: 在 `redis_clear_batch` 之前讀取 metadata
**決定**: 在 `execute_primary_query()` 中,`merge_chunks()` 之後、`redis_clear_batch()` 之前,呼叫 `get_batch_progress("reject", engine_hash)` 讀取 partial failure 狀態。
**理由**: `redis_clear_batch` 會刪除包含 metadata 的 key之後就讀不到了。此時 chunk 資料已合併完成,是最後可讀取 metadata 的時機點。
### D2: 用獨立 Redis key 持久化 partial failure flagTTL 對齊實際資料層
**決定**: 在 `_store_query_result()` 之後,將 partial failure 資訊存到 `reject_dataset:{query_id}:partial_failure` Redis HSET。**TTL 必須與資料實際存活的層一致**:若資料 spill 到 parquet spool`_REJECT_ENGINE_SPOOL_TTL_SECONDS = 21600s`partial failure flag 的 TTL 也要用 21600s若資料存在 L1/L2`_CACHE_TTL = 900s`flag TTL 用 900s。實作方式`_store_partial_failure_flag()` 接受 `ttl` 參數,由呼叫端根據 `should_spill` 判斷傳入 `_REJECT_ENGINE_SPOOL_TTL_SECONDS``_CACHE_TTL`。Cache-hit 路徑透過 `_load_partial_failure_flag(query_id)` 還原。
**替代方案 A**: 將 flag 嵌入 DataFrame 的 attrs 或另外 pickle。
**為何不採用**: DataFrame attrs 在 parquet 序列化時會丟失pickle 增加反序列化風險。
**替代方案 B**: 固定 TTL=900s。
**為何不採用**: 大查詢 spill 到 parquet spool21600s TTL資料還能讀 6 小時,但 partial failure flag 15 分鐘就過期,造成「資料讀得到但警告消失」。
### D3: 在 `_update_progress` 中追蹤 failed_ranges僅 time-range chunk
**決定**: 擴充 `_update_progress()` 接受 `failed_ranges: Optional[List[Dict]]` 參數,以 JSON 字串存入 Redis HSET。Sequential 和 parallel path 均從失敗的 chunk descriptor 提取 `chunk_start` / `chunk_end`。**僅當 chunk descriptor 包含 `chunk_start`/`chunk_end` 時才記錄**(即 `decompose_by_time_range` 產生的 time-range chunk
**container-id 分塊的情境**: reject 的 container 模式使用 `decompose_by_ids()`chunk 結構為 `{"ids": [...]}` 不含日期範圍。此時 `failed_ranges` 為空 list前端透過 `failed_chunk_count > 0` 顯示 generic 警告訊息「N 個查詢批次的資料擷取失敗」),不含日期區間。
**理由**: chunk descriptor 的結構由 decompose 函式決定engine 層不應假設所有 chunk 都有時間範圍。
### D4: Memory guard 失敗不重試
**決定**: `_execute_single_chunk()` 加入 `max_retries=1`,但只對 `_is_retryable_error()` 回傳 true 的 exception 重試。Memory guard記憶體超限和 Redis store 失敗直接 return False不重試。
**理由**: Memory guard 代表該時段資料量確實過大重試結果相同Oracle timeout 和連線錯誤則可能是暫態問題。
### D5: 前端 warning banner 使用既有 amber 色系
**決定**: 新增 `.warning-banner` CSS class使用 `background: #fffbeb; color: #b45309`,與既有 `.resolution-warn` 的 amber 色系一致。放在 `.error-banner` 之後。
**替代方案**: 使用 toast/notification 元件。
**為何不採用**: 此專案無 toast 系統amber banner 與 red error-banner 模式統一。
### D6: 前端日期驗證函式放在共用 filters module
**決定**: 在 `frontend/src/core/reject-history-filters.js` 新增 `validateDateRange()`,複用 `resource-history/App.vue:231-248` 的驗證模式。
**理由**: reject-history-filters.js 已是此頁面的 filter 工具模組validateDateRange 屬於 filter 驗證邏輯。
## Risks / Trade-offs
- **[中] 重試邏輯影響所有 execute_plan 呼叫端** — `_execute_single_chunk()` 是 shared function被 reject / hold / resource / job / msd 五個服務共用。重試邏輯為加法行為(新增 retry loop 包在既有 try/except 外),成功路徑不變。→ 需要對其他 4 個服務執行 smoke test既有測試通過即可。若需更保守可加入 `max_retries` 參數讓呼叫端控制(預設 1但目前判斷統一重試對所有服務都是正面效果。
- **[低] 重試增加 Oracle 負擔** — 單次重試最多增加 1 倍的失敗查詢量。→ 透過 `_is_retryable_error()` 嚴格過濾,只重試 transient error且 parallel path 最多 3 worker影響可控。
- **[低] failed_ranges JSON 大小** — 理論上 73 chunks730/10全部失敗會產生 73 筆 rangeJSON < 5 KB。→ 遠低於 Redis HSET 欄位限制

View File

@@ -0,0 +1,34 @@
## Why
報廢歷史查詢的防爆機制(時間分片 + 記憶體上限 256 MB + Oracle timeout 300s在 chunk 失敗時會丟棄該 chunk 的資料,`has_partial_failure` 旗標僅寫入 Redis metadata**從未傳遞到 API response 或前端**。使用者查到不完整資料卻毫不知情影響決策正確性。此外730 天日期上限僅在後端驗證,前端無即時提示,導致不必要的等待。
## What Changes
- 後端 `reject_dataset_cache``execute_plan()` 後讀取 batch progress metadata`has_partial_failure`、失敗 chunk 數量及失敗時間範圍注入 API response `meta` 欄位
- 後端 `batch_query_engine` 追蹤失敗 chunk 的時間區間描述,寫入 Redis metadata 的 `failed_ranges` 欄位
- 後端 `_execute_single_chunk()` 對 transient errorOracle timeout / 連線錯誤加入單次重試memory guard 失敗不重試
- 前端新增 amber warning banner`meta.has_partial_failure` 為 true 時顯示不完整資料警告及失敗的日期區間
- 前端新增日期範圍即時驗證730 天上限),在 API 發送前攔截無效範圍
## Capabilities
### New Capabilities
- `batch-query-resilience`: 批次查詢引擎的失敗範圍追蹤、partial failure metadata 傳遞、及 transient error 單次重試機制
### Modified Capabilities
- `reject-history-api`: API response `meta` 新增 `has_partial_failure``failed_chunk_count``failed_ranges` 欄位,讓前端得知查詢結果完整性
- `reject-history-page`: 新增 amber warning banner 顯示 partial failure 警告新增前端日期範圍即時驗證730 天上限)
## Impact
- **後端服務 — batch_query_engine.py共用模組影響所有使用 execute_plan 的服務)**:
- 追蹤 failed_ranges + 重試邏輯修改的是 `_execute_single_chunk()`,此函式被 **reject / hold / resource / job / msd** 五個 dataset cache 服務共用
- 重試邏輯為加法行為(新增 retry loop不改變既有成功路徑對其他服務向後相容
- `failed_ranges` 追蹤僅在 chunk descriptor 含 `chunk_start`/`chunk_end` 時才記錄container-id 分塊(僅 reject container 模式使用)不受影響
- 需對 hold / resource / job / msd 執行回歸 smoke test
- **後端服務 — reject_dataset_cache.py**: 讀取 metadata + 注入 response + 持久化 partial failure flag
- **前端**: `App.vue`warning banner + 日期驗證)、`reject-history-filters.js`validateDateRange 函式)、`style.css`.warning-banner 樣式)
- **API 契約**: response `meta` 新增可選欄位(向後相容,現有前端不受影響)
- **測試**: `test_batch_query_engine.py``test_reject_dataset_cache.py` 需新增對應測試案例hold / resource / job / msd 需回歸驗證

View File

@@ -0,0 +1,82 @@
## ADDED Requirements
### Requirement: BatchQueryEngine SHALL track failed chunk time ranges in progress metadata
The engine SHALL record the time ranges of failed chunks in Redis progress metadata so consumers can report which date intervals have missing data.
#### Scenario: Failed chunk range recorded in sequential path
- **WHEN** a chunk with `chunk_start` and `chunk_end` keys fails during sequential execution
- **THEN** `_update_progress()` SHALL store a `failed_ranges` field in the Redis HSET metadata
- **THEN** `failed_ranges` SHALL be a JSON array of objects, each with `start` and `end` string keys
- **THEN** the array SHALL contain one entry per failed chunk
#### Scenario: Failed chunk range recorded in parallel path
- **WHEN** a chunk with `chunk_start` and `chunk_end` keys fails during parallel execution
- **THEN** the failed chunk's time range SHALL be appended to `failed_ranges` in the same format as the sequential path
#### Scenario: No failed ranges when all chunks succeed
- **WHEN** all chunks complete successfully
- **THEN** the `failed_ranges` field SHALL NOT be present in Redis metadata
#### Scenario: ID-batch chunks produce no failed_ranges entries
- **WHEN** a chunk created by `decompose_by_ids()` (containing only an `ids` key, no `chunk_start`/`chunk_end`) fails
- **THEN** no entry SHALL be appended to `failed_ranges` for that chunk
- **THEN** `has_partial_failure` SHALL still be set to `True`
- **THEN** `failed` count SHALL still be incremented
#### Scenario: get_batch_progress returns failed_ranges
- **WHEN** `get_batch_progress()` is called after execution with failed chunks
- **THEN** the returned dict SHALL include `failed_ranges` as a JSON string parseable to a list of `{start, end}` objects
### Requirement: BatchQueryEngine SHALL retry transient chunk failures once
The engine SHALL retry chunk execution once for transient errors (Oracle timeout, connection errors) but SHALL NOT retry deterministic failures (memory guard, Redis store).
#### Scenario: Oracle timeout retried once
- **WHEN** `_execute_single_chunk()` raises an exception matching Oracle timeout patterns (`DPY-4024`, `ORA-01013`)
- **THEN** the chunk SHALL be retried exactly once
- **WHEN** the retry succeeds
- **THEN** the chunk SHALL be marked as successful
#### Scenario: Connection error retried once
- **WHEN** `_execute_single_chunk()` raises `TimeoutError`, `ConnectionError`, or `OSError`
- **THEN** the chunk SHALL be retried exactly once
#### Scenario: Retry exhausted marks chunk as failed
- **WHEN** a chunk fails on both the initial attempt and the retry
- **THEN** the chunk SHALL be marked as failed
- **THEN** `has_partial_failure` SHALL be set to `True`
#### Scenario: Memory guard failure NOT retried
- **WHEN** a chunk's DataFrame exceeds `BATCH_CHUNK_MAX_MEMORY_MB`
- **THEN** the chunk SHALL return `False` immediately without retry
- **THEN** the query function SHALL have been called exactly once for that chunk
#### Scenario: Redis store failure NOT retried
- **WHEN** `redis_store_chunk()` returns `False`
- **THEN** the chunk SHALL return `False` immediately without retry
### Requirement: reject_dataset_cache SHALL propagate partial failure metadata to API response
The cache service SHALL read batch execution metadata and include partial failure information in the API response `meta` field.
#### Scenario: Partial failure metadata included in response
- **WHEN** `execute_primary_query()` uses the batch engine path and `get_batch_progress()` returns `has_partial_failure=True`
- **THEN** the response `meta` dict SHALL include `has_partial_failure: true`
- **THEN** the response `meta` dict SHALL include `failed_chunk_count` as an integer
- **THEN** if `failed_ranges` is present, the response `meta` dict SHALL include `failed_ranges` as a list of `{start, end}` objects
#### Scenario: Metadata read before redis_clear_batch
- **WHEN** `execute_primary_query()` calls `get_batch_progress()`
- **THEN** the call SHALL occur after `merge_chunks()` and before `redis_clear_batch()`
#### Scenario: No partial failure on successful query
- **WHEN** all chunks complete successfully
- **THEN** the response `meta` dict SHALL NOT include `has_partial_failure`
#### Scenario: Cache-hit path restores partial failure flag
- **WHEN** a cached DataFrame is returned (cache hit) and a partial failure flag was stored during the original query
- **THEN** the response `meta` dict SHALL include the same `has_partial_failure`, `failed_chunk_count`, and `failed_ranges` as the original response
#### Scenario: Partial failure flag TTL matches data storage layer
- **WHEN** partial failure is detected and the query result is spilled to parquet spool
- **THEN** the partial failure flag SHALL be stored with TTL equal to `_REJECT_ENGINE_SPOOL_TTL_SECONDS` (default 21600 seconds)
- **WHEN** partial failure is detected and the query result is stored in L1/L2 Redis cache
- **THEN** the partial failure flag SHALL be stored with TTL equal to `_CACHE_TTL` (default 900 seconds)

View File

@@ -0,0 +1,36 @@
## MODIFIED Requirements
### Requirement: Reject History API SHALL validate required query parameters
The API SHALL validate date parameters and basic paging bounds before executing database work.
#### Scenario: Missing required dates
- **WHEN** a reject-history endpoint requiring date range is called without `start_date` or `end_date`
- **THEN** the API SHALL return HTTP 400 with a descriptive validation error
#### Scenario: Invalid date order
- **WHEN** `end_date` is earlier than `start_date`
- **THEN** the API SHALL return HTTP 400 and SHALL NOT run SQL queries
#### Scenario: Date range exceeds maximum
- **WHEN** the date range between `start_date` and `end_date` exceeds 730 days
- **THEN** the API SHALL return HTTP 400 with error message "日期範圍不可超過 730 天"
## ADDED Requirements
### Requirement: Reject History API primary query response SHALL include partial failure metadata
The primary query endpoint SHALL include batch execution completeness information in the response `meta` field when chunks fail during batch query execution.
#### Scenario: Partial failure metadata in response
- **WHEN** `POST /api/reject-history/query` completes with some chunks failing
- **THEN** the response SHALL include `meta.has_partial_failure: true`
- **THEN** the response SHALL include `meta.failed_chunk_count` as a positive integer
- **THEN** the response SHALL include `meta.failed_ranges` as an array of `{start, end}` date strings (if available)
- **THEN** the HTTP status SHALL still be 200 (data is partially available)
#### Scenario: No partial failure metadata on full success
- **WHEN** `POST /api/reject-history/query` completes with all chunks succeeding
- **THEN** the response `meta` SHALL NOT include `has_partial_failure`, `failed_chunk_count`, or `failed_ranges`
#### Scenario: Partial failure metadata preserved on cache hit
- **WHEN** `POST /api/reject-history/query` returns cached data that originally had partial failures
- **THEN** the response SHALL include the same `meta.has_partial_failure`, `meta.failed_chunk_count`, and `meta.failed_ranges` as the original response

View File

@@ -0,0 +1,58 @@
## ADDED Requirements
### Requirement: Reject History page SHALL display partial failure warning banner
The page SHALL display an amber warning banner when the query result contains partial failures, informing users that displayed data may be incomplete.
#### Scenario: Warning banner displayed on partial failure
- **WHEN** the primary query response includes `meta.has_partial_failure: true`
- **THEN** an amber warning banner SHALL be displayed below the error banner position
- **THEN** the warning message SHALL be in Traditional Chinese
#### Scenario: Warning banner shows failed date ranges
- **WHEN** `meta.failed_ranges` contains date range objects
- **THEN** the warning banner SHALL display the specific failed date ranges (e.g., "以下日期區間的資料擷取失敗2025-01-01 ~ 2025-01-10")
#### Scenario: Warning banner shows generic message without ranges (container mode or missing range data)
- **WHEN** `meta.has_partial_failure` is true but `meta.failed_ranges` is empty or absent (e.g., container-id batch query)
- **THEN** the warning banner SHALL display a generic message with the failed chunk count (e.g., "3 個查詢批次的資料擷取失敗")
#### Scenario: Warning banner cleared on new query
- **WHEN** user initiates a new primary query
- **THEN** the warning banner SHALL be cleared before the new query executes
- **THEN** if the new query also has partial failures, the warning SHALL update with new failure information
#### Scenario: Warning banner coexists with error banner
- **WHEN** both an error message and a partial failure warning exist
- **THEN** the error banner SHALL appear first, followed by the warning banner
#### Scenario: Warning banner visual style
- **WHEN** the warning banner is rendered
- **THEN** it SHALL use amber/orange color scheme (background `#fffbeb`, text `#b45309`)
- **THEN** the style SHALL be consistent with the existing `.resolution-warn` color pattern
### Requirement: Reject History page SHALL validate date range before query submission
The page SHALL validate the date range on the client side before sending the API request, providing immediate feedback for invalid ranges.
#### Scenario: Date range exceeds 730-day limit
- **WHEN** user selects a date range exceeding 730 days and clicks "查詢"
- **THEN** the page SHALL display an error message "查詢範圍不可超過 730 天(約兩年)"
- **THEN** the API request SHALL NOT be sent
#### Scenario: Missing start or end date
- **WHEN** user clicks "查詢" without setting both start_date and end_date (in date_range mode)
- **THEN** the page SHALL display an error message "請先設定開始與結束日期"
- **THEN** the API request SHALL NOT be sent
#### Scenario: End date before start date
- **WHEN** user selects an end_date earlier than start_date
- **THEN** the page SHALL display an error message "結束日期必須大於起始日期"
- **THEN** the API request SHALL NOT be sent
#### Scenario: Valid date range proceeds normally
- **WHEN** user selects a valid date range within 730 days and clicks "查詢"
- **THEN** no validation error SHALL be shown
- **THEN** the API request SHALL proceed normally
#### Scenario: Container mode skips date validation
- **WHEN** query mode is "container" (not "date_range")
- **THEN** date range validation SHALL be skipped

View File

@@ -0,0 +1,46 @@
## 1. 前端日期範圍即時驗證
- [x] 1.1 在 `frontend/src/core/reject-history-filters.js` 末尾新增 `validateDateRange(startDate, endDate)` 函式MAX_QUERY_DAYS=730回傳空字串表示通過、非空字串為錯誤訊息
- [x] 1.2 在 `frontend/src/reject-history/App.vue` import `validateDateRange`,在 `executePrimaryQuery()` 的 API 呼叫前(`errorMessage.value = ''` 重置之後)加入 date_range 模式的驗證邏輯,驗證失敗時設定 `errorMessage` 並 return
## 2. 後端追蹤失敗 chunk 時間範圍
- [x] 2.1 在 `batch_query_engine.py``_update_progress()` 簽名加入 `failed_ranges: Optional[List] = None` 參數,在 mapping dict 中條件性加入 `json.dumps(failed_ranges)` 欄位
- [x] 2.2 在 `execute_plan()` 的 sequential path`for idx, chunk in enumerate(chunks)` 迴圈區段)新增 `failed_range_list = []`chunk 失敗時從 chunk descriptor 條件性提取 `chunk_start`/`chunk_end` append 到 list僅 time-range chunk 才有),傳入每次 `_update_progress()` 呼叫
- [x] 2.3 在 `_execute_parallel()` 修改 `futures` dict 為 `futures[future] = (idx, chunk)` 以保留 chunk descriptor新增 `failed_range_list`,失敗時條件性 append range返回值改為 4-tuple `(completed, failed, has_partial_failure, failed_range_list)`;同步更新 `execute_plan()` 中呼叫 `_execute_parallel()` 的解構為 4-tuple
## 3. 後端 chunk 失敗單次重試
- [x] 3.1 在 `batch_query_engine.py` 新增 `_RETRYABLE_PATTERNS` 常數和 `_is_retryable_error(exc)` 函式,辨識 Oracle timeout / 連線錯誤
- [x] 3.2 修改 `_execute_single_chunk()` 加入 `max_retries: int = 1` 參數,將 try/except 包在 retry loop 中memory guard 和 Redis store 失敗直接 return False 不重試exception 中若 `_is_retryable_error()` 為 True 則 log warning 並 continue
## 4. 後端傳遞 partial failure 到 API response
- [x] 4.1 在 `reject_dataset_cache.py``execute_primary_query()` 內 batch_query_engine local import 區塊加入 `get_batch_progress`
- [x] 4.2 在 `execute_primary_query()``merge_chunks()` 呼叫之後、`redis_clear_batch()` 呼叫之前,呼叫 `get_batch_progress("reject", engine_hash)` 讀取 `has_partial_failure``failed``failed_ranges`
- [x] 4.3 在 `redis_clear_batch()` 之後、`_apply_policy_filters()` 之前,將 partial failure 資訊條件性注入 `meta` dict`has_partial_failure``failed_chunk_count``failed_ranges`
- [x] 4.4 新增 `_store_partial_failure_flag(query_id, failed_count, failed_ranges, ttl)``_load_partial_failure_flag(query_id)` 兩個 helper使用 Redis HSET 存取 `reject_dataset:{query_id}:partial_failure``ttl` 由呼叫端傳入
- [x] 4.5 在 `_store_query_result()` 呼叫之後呼叫 `_store_partial_failure_flag()`TTL 根據 `_store_query_result()` 內的 `should_spill` 判斷spill 到 spool 時用 `_REJECT_ENGINE_SPOOL_TTL_SECONDS`21600s否則用 `_CACHE_TTL`900s`_get_cached_df()` cache-hit 路徑呼叫 `_load_partial_failure_flag()``meta.update()`
## 5. 前端 partial failure 警告 banner
- [x] 5.1 在 `frontend/src/reject-history/App.vue` 新增 `partialFailureWarning` ref`executePrimaryQuery()` 開頭重置,在讀取 result 後根據 `result.meta.has_partial_failure` 設定警告訊息(含 failed_ranges 的日期區間文字;無 ranges 時用 failed_chunk_count 的 generic 訊息)
- [x] 5.2 在 App.vue template 的 error-banner `<div>` 之後加入 `<div v-if="partialFailureWarning" class="warning-banner">{{ partialFailureWarning }}</div>`
- [x] 5.3 在 `frontend/src/reject-history/style.css``.error-banner` 規則之後加入 `.warning-banner` 樣式background: #fffbeb, color: #b45309
## 6. 測試
- [x] 6.1 在 `tests/test_batch_query_engine.py` 新增 `test_transient_failure_retried_once`mock query_fn 第一次 raise TimeoutError、第二次成功assert chunk 最終成功且 query_fn 被呼叫 2 次
- [x] 6.2 在 `tests/test_batch_query_engine.py` 新增 `test_memory_guard_not_retried`mock query_fn 回傳超大 DataFrameassert query_fn 僅被呼叫 1 次
- [x] 6.3 在 `tests/test_batch_query_engine.py` 新增 `test_failed_ranges_tracked`3 chunks 其中 1 個失敗assert Redis metadata 含 `failed_ranges` JSON
- [x] 6.4 在 `tests/test_reject_dataset_cache.py` 新增 `test_partial_failure_in_response_meta`mock `get_batch_progress` 回傳 `has_partial_failure=True`assert response `meta` 包含旗標和 `failed_ranges`
- [x] 6.5 在 `tests/test_reject_dataset_cache.py` 新增 `test_cache_hit_restores_partial_failure`:先寫入 partial failure flagcache hit 時 assert meta 有旗標
- [x] 6.6 在 `tests/test_reject_dataset_cache.py` 新增 `test_partial_failure_ttl_matches_spool`:當 should_spill=True 時 assert flag TTL 為 `_REJECT_ENGINE_SPOOL_TTL_SECONDS`,否則為 `_CACHE_TTL`
- [x] 6.7 在 `tests/test_batch_query_engine.py` 新增 `test_id_batch_chunk_no_failed_ranges`container-id 分塊 chunk 失敗時 assert `failed_ranges` 為空 list 但 `has_partial_failure=True`
## 7. 跨服務回歸驗證
- [x] 7.1 執行 `pytest tests/test_batch_query_engine.py tests/test_reject_dataset_cache.py -v` 確認本次修改的測試全部通過
- [x] 7.2 執行 hold_dataset_cache 相關測試確認重試邏輯不影響 hold`pytest tests/ -k "hold" -v`
- [x] 7.3 執行 resource / job / msd 相關測試確認回歸:`pytest tests/ -k "resource or job or mid_section" -v`
- [x] 7.4 若任何跨服務測試失敗,檢查是否為 `_execute_single_chunk` 簽名變更(`max_retries` 參數)導致,確認 keyword-only 預設值不影響既有呼叫