feat(reject-history): fix silent data loss by propagating partial failure metadata to frontend
Chunk failures in BatchQueryEngine were silently discarded — `has_partial_failure` was tracked in Redis but never surfaced to the API response or frontend. Users could see incomplete data without any warning. This commit closes the gap end-to-end: Backend: - Track failed chunk time ranges (`failed_ranges`) in batch engine progress metadata - Add single retry for transient Oracle errors (timeout, connection) in `_execute_single_chunk` - Read `get_batch_progress()` after merge but before `redis_clear_batch()` cleanup - Inject `has_partial_failure`, `failed_chunk_count`, `failed_ranges` into API response meta - Persist partial failure flag to independent Redis key with TTL aligned to data storage layer - Add shared container-resolution policy module with wildcard/expansion guardrails - Refactor reason filter from single-value to multi-select (`reason` → `reasons`) Frontend: - Add client-side date range validation (730-day limit) before API submission - Display amber warning banner on partial failure with specific failed date ranges - Support generic fallback message for container-mode queries without date ranges - Update FilterPanel to support multi-select reason chips Specs & tests: - Create batch-query-resilience spec; update reject-history-api and reject-history-page specs - Add 7 new tests for retry, memory guard, failed ranges, partial failure propagation, TTL - Cross-service regression verified (hold, resource, job, msd — 411 tests pass) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,2 @@
|
||||
schema: spec-driven
|
||||
created: 2026-03-03
|
||||
@@ -0,0 +1,80 @@
|
||||
## Context
|
||||
|
||||
報廢歷史查詢使用 `BatchQueryEngine` 將長日期範圍拆成 10 天 chunks 平行查詢 Oracle。每個 chunk 有記憶體上限(256 MB)和 timeout(300s)防護。當 chunk 失敗時,`has_partial_failure` 旗標寫入 Redis HSET(key: `batch:reject:{hash}:meta`),但此資訊**在三個斷點被丟失**:
|
||||
|
||||
1. `reject_dataset_cache.py` 的 `execute_primary_query()` 未讀取 batch progress metadata
|
||||
2. API route 直接 `jsonify({"success": True, **result})`,在 partial chunk failure 路徑下仍回 HTTP 200 + `success: true`,不區分完整與不完整結果
|
||||
3. 前端 `App.vue` 沒有任何 partial failure 處理邏輯
|
||||
|
||||
另一個問題:`redis_clear_batch()` 在 `execute_primary_query()` 的清理階段會刪除 metadata key,所以讀取必須在清理之前。
|
||||
|
||||
前端的 730 天日期上限驗證只在後端 `_validate_range()` 做,前端缺乏即時回饋。
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
|
||||
- 將 `has_partial_failure` 從 Redis metadata 傳遞到 API response `meta` 欄位
|
||||
- 追蹤失敗 chunk 的時間範圍,讓前端可顯示具體的缺漏區間
|
||||
- 前端顯示 amber warning banner,告知使用者資料可能不完整
|
||||
- 前端加入日期範圍即時驗證,避免無效 API 請求
|
||||
- 對 transient error(Oracle timeout、連線失敗)加入單次重試,減少不必要的 partial failure
|
||||
- 持久化 partial failure 旗標到獨立 Redis key,讓 cache-hit 路徑也能還原警告狀態
|
||||
|
||||
**Non-Goals:**
|
||||
|
||||
- 不改變現有 chunk 分片策略或記憶體上限數值
|
||||
- 不實作前端的自動重查/重試機制
|
||||
- 不修改 `EVENT_FETCHER_ALLOW_PARTIAL_RESULTS` 的行為(預設已是安全的 false)
|
||||
- 不加入 progress bar / 即時進度追蹤 UI
|
||||
|
||||
## Decisions
|
||||
|
||||
### D1: 在 `redis_clear_batch` 之前讀取 metadata
|
||||
|
||||
**決定**: 在 `execute_primary_query()` 中,`merge_chunks()` 之後、`redis_clear_batch()` 之前,呼叫 `get_batch_progress("reject", engine_hash)` 讀取 partial failure 狀態。
|
||||
|
||||
**理由**: `redis_clear_batch` 會刪除包含 metadata 的 key,之後就讀不到了。此時 chunk 資料已合併完成,是最後可讀取 metadata 的時機點。
|
||||
|
||||
### D2: 用獨立 Redis key 持久化 partial failure flag,TTL 對齊實際資料層
|
||||
|
||||
**決定**: 在 `_store_query_result()` 之後,將 partial failure 資訊存到 `reject_dataset:{query_id}:partial_failure` Redis HSET。**TTL 必須與資料實際存活的層一致**:若資料 spill 到 parquet spool(`_REJECT_ENGINE_SPOOL_TTL_SECONDS = 21600s`),partial failure flag 的 TTL 也要用 21600s;若資料存在 L1/L2(`_CACHE_TTL = 900s`),flag TTL 用 900s。實作方式:`_store_partial_failure_flag()` 接受 `ttl` 參數,由呼叫端根據 `should_spill` 判斷傳入 `_REJECT_ENGINE_SPOOL_TTL_SECONDS` 或 `_CACHE_TTL`。Cache-hit 路徑透過 `_load_partial_failure_flag(query_id)` 還原。
|
||||
|
||||
**替代方案 A**: 將 flag 嵌入 DataFrame 的 attrs 或另外 pickle。
|
||||
**為何不採用**: DataFrame attrs 在 parquet 序列化時會丟失;pickle 增加反序列化風險。
|
||||
|
||||
**替代方案 B**: 固定 TTL=900s。
|
||||
**為何不採用**: 大查詢 spill 到 parquet spool(21600s TTL),資料還能讀 6 小時,但 partial failure flag 15 分鐘就過期,造成「資料讀得到但警告消失」。
|
||||
|
||||
### D3: 在 `_update_progress` 中追蹤 failed_ranges(僅 time-range chunk)
|
||||
|
||||
**決定**: 擴充 `_update_progress()` 接受 `failed_ranges: Optional[List[Dict]]` 參數,以 JSON 字串存入 Redis HSET。Sequential 和 parallel path 均從失敗的 chunk descriptor 提取 `chunk_start` / `chunk_end`。**僅當 chunk descriptor 包含 `chunk_start`/`chunk_end` 時才記錄**(即 `decompose_by_time_range` 產生的 time-range chunk)。
|
||||
|
||||
**container-id 分塊的情境**: reject 的 container 模式使用 `decompose_by_ids()`,chunk 結構為 `{"ids": [...]}` 不含日期範圍。此時 `failed_ranges` 為空 list,前端透過 `failed_chunk_count > 0` 顯示 generic 警告訊息(「N 個查詢批次的資料擷取失敗」),不含日期區間。
|
||||
|
||||
**理由**: chunk descriptor 的結構由 decompose 函式決定,engine 層不應假設所有 chunk 都有時間範圍。
|
||||
|
||||
### D4: Memory guard 失敗不重試
|
||||
|
||||
**決定**: `_execute_single_chunk()` 加入 `max_retries=1`,但只對 `_is_retryable_error()` 回傳 true 的 exception 重試。Memory guard(記憶體超限)和 Redis store 失敗直接 return False,不重試。
|
||||
|
||||
**理由**: Memory guard 代表該時段資料量確實過大,重試結果相同;Oracle timeout 和連線錯誤則可能是暫態問題。
|
||||
|
||||
### D5: 前端 warning banner 使用既有 amber 色系
|
||||
|
||||
**決定**: 新增 `.warning-banner` CSS class,使用 `background: #fffbeb; color: #b45309`,與既有 `.resolution-warn` 的 amber 色系一致。放在 `.error-banner` 之後。
|
||||
|
||||
**替代方案**: 使用 toast/notification 元件。
|
||||
**為何不採用**: 此專案無 toast 系統,amber banner 與 red error-banner 模式統一。
|
||||
|
||||
### D6: 前端日期驗證函式放在共用 filters module
|
||||
|
||||
**決定**: 在 `frontend/src/core/reject-history-filters.js` 新增 `validateDateRange()`,複用 `resource-history/App.vue:231-248` 的驗證模式。
|
||||
|
||||
**理由**: reject-history-filters.js 已是此頁面的 filter 工具模組,validateDateRange 屬於 filter 驗證邏輯。
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
- **[中] 重試邏輯影響所有 execute_plan 呼叫端** — `_execute_single_chunk()` 是 shared function,被 reject / hold / resource / job / msd 五個服務共用。重試邏輯為加法行為(新增 retry loop 包在既有 try/except 外),成功路徑不變。→ 需要對其他 4 個服務執行 smoke test(既有測試通過即可)。若需更保守,可加入 `max_retries` 參數讓呼叫端控制(預設 1),但目前判斷統一重試對所有服務都是正面效果。
|
||||
- **[低] 重試增加 Oracle 負擔** — 單次重試最多增加 1 倍的失敗查詢量。→ 透過 `_is_retryable_error()` 嚴格過濾,只重試 transient error,且 parallel path 最多 3 worker,影響可控。
|
||||
- **[低] failed_ranges JSON 大小** — 理論上 73 chunks(730/10)全部失敗會產生 73 筆 range,JSON < 5 KB。→ 遠低於 Redis HSET 欄位限制。
|
||||
@@ -0,0 +1,34 @@
|
||||
## Why
|
||||
|
||||
報廢歷史查詢的防爆機制(時間分片 + 記憶體上限 256 MB + Oracle timeout 300s)在 chunk 失敗時會丟棄該 chunk 的資料,`has_partial_failure` 旗標僅寫入 Redis metadata,**從未傳遞到 API response 或前端**。使用者查到不完整資料卻毫不知情,影響決策正確性。此外,730 天日期上限僅在後端驗證,前端無即時提示,導致不必要的等待。
|
||||
|
||||
## What Changes
|
||||
|
||||
- 後端 `reject_dataset_cache` 在 `execute_plan()` 後讀取 batch progress metadata,將 `has_partial_failure`、失敗 chunk 數量及失敗時間範圍注入 API response `meta` 欄位
|
||||
- 後端 `batch_query_engine` 追蹤失敗 chunk 的時間區間描述,寫入 Redis metadata 的 `failed_ranges` 欄位
|
||||
- 後端 `_execute_single_chunk()` 對 transient error(Oracle timeout / 連線錯誤)加入單次重試,memory guard 失敗不重試
|
||||
- 前端新增 amber warning banner,當 `meta.has_partial_failure` 為 true 時顯示不完整資料警告及失敗的日期區間
|
||||
- 前端新增日期範圍即時驗證(730 天上限),在 API 發送前攔截無效範圍
|
||||
|
||||
## Capabilities
|
||||
|
||||
### New Capabilities
|
||||
|
||||
- `batch-query-resilience`: 批次查詢引擎的失敗範圍追蹤、partial failure metadata 傳遞、及 transient error 單次重試機制
|
||||
|
||||
### Modified Capabilities
|
||||
|
||||
- `reject-history-api`: API response `meta` 新增 `has_partial_failure`、`failed_chunk_count`、`failed_ranges` 欄位,讓前端得知查詢結果完整性
|
||||
- `reject-history-page`: 新增 amber warning banner 顯示 partial failure 警告;新增前端日期範圍即時驗證(730 天上限)
|
||||
|
||||
## Impact
|
||||
|
||||
- **後端服務 — batch_query_engine.py(共用模組,影響所有使用 execute_plan 的服務)**:
|
||||
- 追蹤 failed_ranges + 重試邏輯修改的是 `_execute_single_chunk()`,此函式被 **reject / hold / resource / job / msd** 五個 dataset cache 服務共用
|
||||
- 重試邏輯為加法行為(新增 retry loop),不改變既有成功路徑,對其他服務向後相容
|
||||
- `failed_ranges` 追蹤僅在 chunk descriptor 含 `chunk_start`/`chunk_end` 時才記錄,container-id 分塊(僅 reject container 模式使用)不受影響
|
||||
- 需對 hold / resource / job / msd 執行回歸 smoke test
|
||||
- **後端服務 — reject_dataset_cache.py**: 讀取 metadata + 注入 response + 持久化 partial failure flag
|
||||
- **前端**: `App.vue`(warning banner + 日期驗證)、`reject-history-filters.js`(validateDateRange 函式)、`style.css`(.warning-banner 樣式)
|
||||
- **API 契約**: response `meta` 新增可選欄位(向後相容,現有前端不受影響)
|
||||
- **測試**: `test_batch_query_engine.py`、`test_reject_dataset_cache.py` 需新增對應測試案例;hold / resource / job / msd 需回歸驗證
|
||||
@@ -0,0 +1,82 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: BatchQueryEngine SHALL track failed chunk time ranges in progress metadata
|
||||
The engine SHALL record the time ranges of failed chunks in Redis progress metadata so consumers can report which date intervals have missing data.
|
||||
|
||||
#### Scenario: Failed chunk range recorded in sequential path
|
||||
- **WHEN** a chunk with `chunk_start` and `chunk_end` keys fails during sequential execution
|
||||
- **THEN** `_update_progress()` SHALL store a `failed_ranges` field in the Redis HSET metadata
|
||||
- **THEN** `failed_ranges` SHALL be a JSON array of objects, each with `start` and `end` string keys
|
||||
- **THEN** the array SHALL contain one entry per failed chunk
|
||||
|
||||
#### Scenario: Failed chunk range recorded in parallel path
|
||||
- **WHEN** a chunk with `chunk_start` and `chunk_end` keys fails during parallel execution
|
||||
- **THEN** the failed chunk's time range SHALL be appended to `failed_ranges` in the same format as the sequential path
|
||||
|
||||
#### Scenario: No failed ranges when all chunks succeed
|
||||
- **WHEN** all chunks complete successfully
|
||||
- **THEN** the `failed_ranges` field SHALL NOT be present in Redis metadata
|
||||
|
||||
#### Scenario: ID-batch chunks produce no failed_ranges entries
|
||||
- **WHEN** a chunk created by `decompose_by_ids()` (containing only an `ids` key, no `chunk_start`/`chunk_end`) fails
|
||||
- **THEN** no entry SHALL be appended to `failed_ranges` for that chunk
|
||||
- **THEN** `has_partial_failure` SHALL still be set to `True`
|
||||
- **THEN** `failed` count SHALL still be incremented
|
||||
|
||||
#### Scenario: get_batch_progress returns failed_ranges
|
||||
- **WHEN** `get_batch_progress()` is called after execution with failed chunks
|
||||
- **THEN** the returned dict SHALL include `failed_ranges` as a JSON string parseable to a list of `{start, end}` objects
|
||||
|
||||
### Requirement: BatchQueryEngine SHALL retry transient chunk failures once
|
||||
The engine SHALL retry chunk execution once for transient errors (Oracle timeout, connection errors) but SHALL NOT retry deterministic failures (memory guard, Redis store).
|
||||
|
||||
#### Scenario: Oracle timeout retried once
|
||||
- **WHEN** `_execute_single_chunk()` raises an exception matching Oracle timeout patterns (`DPY-4024`, `ORA-01013`)
|
||||
- **THEN** the chunk SHALL be retried exactly once
|
||||
- **WHEN** the retry succeeds
|
||||
- **THEN** the chunk SHALL be marked as successful
|
||||
|
||||
#### Scenario: Connection error retried once
|
||||
- **WHEN** `_execute_single_chunk()` raises `TimeoutError`, `ConnectionError`, or `OSError`
|
||||
- **THEN** the chunk SHALL be retried exactly once
|
||||
|
||||
#### Scenario: Retry exhausted marks chunk as failed
|
||||
- **WHEN** a chunk fails on both the initial attempt and the retry
|
||||
- **THEN** the chunk SHALL be marked as failed
|
||||
- **THEN** `has_partial_failure` SHALL be set to `True`
|
||||
|
||||
#### Scenario: Memory guard failure NOT retried
|
||||
- **WHEN** a chunk's DataFrame exceeds `BATCH_CHUNK_MAX_MEMORY_MB`
|
||||
- **THEN** the chunk SHALL return `False` immediately without retry
|
||||
- **THEN** the query function SHALL have been called exactly once for that chunk
|
||||
|
||||
#### Scenario: Redis store failure NOT retried
|
||||
- **WHEN** `redis_store_chunk()` returns `False`
|
||||
- **THEN** the chunk SHALL return `False` immediately without retry
|
||||
|
||||
### Requirement: reject_dataset_cache SHALL propagate partial failure metadata to API response
|
||||
The cache service SHALL read batch execution metadata and include partial failure information in the API response `meta` field.
|
||||
|
||||
#### Scenario: Partial failure metadata included in response
|
||||
- **WHEN** `execute_primary_query()` uses the batch engine path and `get_batch_progress()` returns `has_partial_failure=True`
|
||||
- **THEN** the response `meta` dict SHALL include `has_partial_failure: true`
|
||||
- **THEN** the response `meta` dict SHALL include `failed_chunk_count` as an integer
|
||||
- **THEN** if `failed_ranges` is present, the response `meta` dict SHALL include `failed_ranges` as a list of `{start, end}` objects
|
||||
|
||||
#### Scenario: Metadata read before redis_clear_batch
|
||||
- **WHEN** `execute_primary_query()` calls `get_batch_progress()`
|
||||
- **THEN** the call SHALL occur after `merge_chunks()` and before `redis_clear_batch()`
|
||||
|
||||
#### Scenario: No partial failure on successful query
|
||||
- **WHEN** all chunks complete successfully
|
||||
- **THEN** the response `meta` dict SHALL NOT include `has_partial_failure`
|
||||
|
||||
#### Scenario: Cache-hit path restores partial failure flag
|
||||
- **WHEN** a cached DataFrame is returned (cache hit) and a partial failure flag was stored during the original query
|
||||
- **THEN** the response `meta` dict SHALL include the same `has_partial_failure`, `failed_chunk_count`, and `failed_ranges` as the original response
|
||||
|
||||
#### Scenario: Partial failure flag TTL matches data storage layer
|
||||
- **WHEN** partial failure is detected and the query result is spilled to parquet spool
|
||||
- **THEN** the partial failure flag SHALL be stored with TTL equal to `_REJECT_ENGINE_SPOOL_TTL_SECONDS` (default 21600 seconds)
|
||||
- **WHEN** partial failure is detected and the query result is stored in L1/L2 Redis cache
|
||||
- **THEN** the partial failure flag SHALL be stored with TTL equal to `_CACHE_TTL` (default 900 seconds)
|
||||
@@ -0,0 +1,36 @@
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Reject History API SHALL validate required query parameters
|
||||
The API SHALL validate date parameters and basic paging bounds before executing database work.
|
||||
|
||||
#### Scenario: Missing required dates
|
||||
- **WHEN** a reject-history endpoint requiring date range is called without `start_date` or `end_date`
|
||||
- **THEN** the API SHALL return HTTP 400 with a descriptive validation error
|
||||
|
||||
#### Scenario: Invalid date order
|
||||
- **WHEN** `end_date` is earlier than `start_date`
|
||||
- **THEN** the API SHALL return HTTP 400 and SHALL NOT run SQL queries
|
||||
|
||||
#### Scenario: Date range exceeds maximum
|
||||
- **WHEN** the date range between `start_date` and `end_date` exceeds 730 days
|
||||
- **THEN** the API SHALL return HTTP 400 with error message "日期範圍不可超過 730 天"
|
||||
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Reject History API primary query response SHALL include partial failure metadata
|
||||
The primary query endpoint SHALL include batch execution completeness information in the response `meta` field when chunks fail during batch query execution.
|
||||
|
||||
#### Scenario: Partial failure metadata in response
|
||||
- **WHEN** `POST /api/reject-history/query` completes with some chunks failing
|
||||
- **THEN** the response SHALL include `meta.has_partial_failure: true`
|
||||
- **THEN** the response SHALL include `meta.failed_chunk_count` as a positive integer
|
||||
- **THEN** the response SHALL include `meta.failed_ranges` as an array of `{start, end}` date strings (if available)
|
||||
- **THEN** the HTTP status SHALL still be 200 (data is partially available)
|
||||
|
||||
#### Scenario: No partial failure metadata on full success
|
||||
- **WHEN** `POST /api/reject-history/query` completes with all chunks succeeding
|
||||
- **THEN** the response `meta` SHALL NOT include `has_partial_failure`, `failed_chunk_count`, or `failed_ranges`
|
||||
|
||||
#### Scenario: Partial failure metadata preserved on cache hit
|
||||
- **WHEN** `POST /api/reject-history/query` returns cached data that originally had partial failures
|
||||
- **THEN** the response SHALL include the same `meta.has_partial_failure`, `meta.failed_chunk_count`, and `meta.failed_ranges` as the original response
|
||||
@@ -0,0 +1,58 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Reject History page SHALL display partial failure warning banner
|
||||
The page SHALL display an amber warning banner when the query result contains partial failures, informing users that displayed data may be incomplete.
|
||||
|
||||
#### Scenario: Warning banner displayed on partial failure
|
||||
- **WHEN** the primary query response includes `meta.has_partial_failure: true`
|
||||
- **THEN** an amber warning banner SHALL be displayed below the error banner position
|
||||
- **THEN** the warning message SHALL be in Traditional Chinese
|
||||
|
||||
#### Scenario: Warning banner shows failed date ranges
|
||||
- **WHEN** `meta.failed_ranges` contains date range objects
|
||||
- **THEN** the warning banner SHALL display the specific failed date ranges (e.g., "以下日期區間的資料擷取失敗:2025-01-01 ~ 2025-01-10")
|
||||
|
||||
#### Scenario: Warning banner shows generic message without ranges (container mode or missing range data)
|
||||
- **WHEN** `meta.has_partial_failure` is true but `meta.failed_ranges` is empty or absent (e.g., container-id batch query)
|
||||
- **THEN** the warning banner SHALL display a generic message with the failed chunk count (e.g., "3 個查詢批次的資料擷取失敗")
|
||||
|
||||
#### Scenario: Warning banner cleared on new query
|
||||
- **WHEN** user initiates a new primary query
|
||||
- **THEN** the warning banner SHALL be cleared before the new query executes
|
||||
- **THEN** if the new query also has partial failures, the warning SHALL update with new failure information
|
||||
|
||||
#### Scenario: Warning banner coexists with error banner
|
||||
- **WHEN** both an error message and a partial failure warning exist
|
||||
- **THEN** the error banner SHALL appear first, followed by the warning banner
|
||||
|
||||
#### Scenario: Warning banner visual style
|
||||
- **WHEN** the warning banner is rendered
|
||||
- **THEN** it SHALL use amber/orange color scheme (background `#fffbeb`, text `#b45309`)
|
||||
- **THEN** the style SHALL be consistent with the existing `.resolution-warn` color pattern
|
||||
|
||||
### Requirement: Reject History page SHALL validate date range before query submission
|
||||
The page SHALL validate the date range on the client side before sending the API request, providing immediate feedback for invalid ranges.
|
||||
|
||||
#### Scenario: Date range exceeds 730-day limit
|
||||
- **WHEN** user selects a date range exceeding 730 days and clicks "查詢"
|
||||
- **THEN** the page SHALL display an error message "查詢範圍不可超過 730 天(約兩年)"
|
||||
- **THEN** the API request SHALL NOT be sent
|
||||
|
||||
#### Scenario: Missing start or end date
|
||||
- **WHEN** user clicks "查詢" without setting both start_date and end_date (in date_range mode)
|
||||
- **THEN** the page SHALL display an error message "請先設定開始與結束日期"
|
||||
- **THEN** the API request SHALL NOT be sent
|
||||
|
||||
#### Scenario: End date before start date
|
||||
- **WHEN** user selects an end_date earlier than start_date
|
||||
- **THEN** the page SHALL display an error message "結束日期必須大於起始日期"
|
||||
- **THEN** the API request SHALL NOT be sent
|
||||
|
||||
#### Scenario: Valid date range proceeds normally
|
||||
- **WHEN** user selects a valid date range within 730 days and clicks "查詢"
|
||||
- **THEN** no validation error SHALL be shown
|
||||
- **THEN** the API request SHALL proceed normally
|
||||
|
||||
#### Scenario: Container mode skips date validation
|
||||
- **WHEN** query mode is "container" (not "date_range")
|
||||
- **THEN** date range validation SHALL be skipped
|
||||
@@ -0,0 +1,46 @@
|
||||
## 1. 前端日期範圍即時驗證
|
||||
|
||||
- [x] 1.1 在 `frontend/src/core/reject-history-filters.js` 末尾新增 `validateDateRange(startDate, endDate)` 函式(MAX_QUERY_DAYS=730),回傳空字串表示通過、非空字串為錯誤訊息
|
||||
- [x] 1.2 在 `frontend/src/reject-history/App.vue` import `validateDateRange`,在 `executePrimaryQuery()` 的 API 呼叫前(`errorMessage.value = ''` 重置之後)加入 date_range 模式的驗證邏輯,驗證失敗時設定 `errorMessage` 並 return
|
||||
|
||||
## 2. 後端追蹤失敗 chunk 時間範圍
|
||||
|
||||
- [x] 2.1 在 `batch_query_engine.py` 的 `_update_progress()` 簽名加入 `failed_ranges: Optional[List] = None` 參數,在 mapping dict 中條件性加入 `json.dumps(failed_ranges)` 欄位
|
||||
- [x] 2.2 在 `execute_plan()` 的 sequential path(`for idx, chunk in enumerate(chunks)` 迴圈區段)新增 `failed_range_list = []`,chunk 失敗時從 chunk descriptor 條件性提取 `chunk_start`/`chunk_end` append 到 list(僅 time-range chunk 才有),傳入每次 `_update_progress()` 呼叫
|
||||
- [x] 2.3 在 `_execute_parallel()` 修改 `futures` dict 為 `futures[future] = (idx, chunk)` 以保留 chunk descriptor,新增 `failed_range_list`,失敗時條件性 append range,返回值改為 4-tuple `(completed, failed, has_partial_failure, failed_range_list)`;同步更新 `execute_plan()` 中呼叫 `_execute_parallel()` 的解構為 4-tuple
|
||||
|
||||
## 3. 後端 chunk 失敗單次重試
|
||||
|
||||
- [x] 3.1 在 `batch_query_engine.py` 新增 `_RETRYABLE_PATTERNS` 常數和 `_is_retryable_error(exc)` 函式,辨識 Oracle timeout / 連線錯誤
|
||||
- [x] 3.2 修改 `_execute_single_chunk()` 加入 `max_retries: int = 1` 參數,將 try/except 包在 retry loop 中:memory guard 和 Redis store 失敗直接 return False 不重試;exception 中若 `_is_retryable_error()` 為 True 則 log warning 並 continue
|
||||
|
||||
## 4. 後端傳遞 partial failure 到 API response
|
||||
|
||||
- [x] 4.1 在 `reject_dataset_cache.py` 的 `execute_primary_query()` 內 batch_query_engine local import 區塊加入 `get_batch_progress`
|
||||
- [x] 4.2 在 `execute_primary_query()` 的 `merge_chunks()` 呼叫之後、`redis_clear_batch()` 呼叫之前,呼叫 `get_batch_progress("reject", engine_hash)` 讀取 `has_partial_failure`、`failed`、`failed_ranges`
|
||||
- [x] 4.3 在 `redis_clear_batch()` 之後、`_apply_policy_filters()` 之前,將 partial failure 資訊條件性注入 `meta` dict(`has_partial_failure`、`failed_chunk_count`、`failed_ranges`)
|
||||
- [x] 4.4 新增 `_store_partial_failure_flag(query_id, failed_count, failed_ranges, ttl)` 和 `_load_partial_failure_flag(query_id)` 兩個 helper,使用 Redis HSET 存取 `reject_dataset:{query_id}:partial_failure`;`ttl` 由呼叫端傳入
|
||||
- [x] 4.5 在 `_store_query_result()` 呼叫之後呼叫 `_store_partial_failure_flag()`,TTL 根據 `_store_query_result()` 內的 `should_spill` 判斷:spill 到 spool 時用 `_REJECT_ENGINE_SPOOL_TTL_SECONDS`(21600s),否則用 `_CACHE_TTL`(900s);在 `_get_cached_df()` cache-hit 路徑呼叫 `_load_partial_failure_flag()` 並 `meta.update()`
|
||||
|
||||
## 5. 前端 partial failure 警告 banner
|
||||
|
||||
- [x] 5.1 在 `frontend/src/reject-history/App.vue` 新增 `partialFailureWarning` ref,在 `executePrimaryQuery()` 開頭重置,在讀取 result 後根據 `result.meta.has_partial_failure` 設定警告訊息(含 failed_ranges 的日期區間文字;無 ranges 時用 failed_chunk_count 的 generic 訊息)
|
||||
- [x] 5.2 在 App.vue template 的 error-banner `<div>` 之後加入 `<div v-if="partialFailureWarning" class="warning-banner">{{ partialFailureWarning }}</div>`
|
||||
- [x] 5.3 在 `frontend/src/reject-history/style.css` 的 `.error-banner` 規則之後加入 `.warning-banner` 樣式(background: #fffbeb, color: #b45309)
|
||||
|
||||
## 6. 測試
|
||||
|
||||
- [x] 6.1 在 `tests/test_batch_query_engine.py` 新增 `test_transient_failure_retried_once`:mock query_fn 第一次 raise TimeoutError、第二次成功,assert chunk 最終成功且 query_fn 被呼叫 2 次
|
||||
- [x] 6.2 在 `tests/test_batch_query_engine.py` 新增 `test_memory_guard_not_retried`:mock query_fn 回傳超大 DataFrame,assert query_fn 僅被呼叫 1 次
|
||||
- [x] 6.3 在 `tests/test_batch_query_engine.py` 新增 `test_failed_ranges_tracked`:3 chunks 其中 1 個失敗,assert Redis metadata 含 `failed_ranges` JSON
|
||||
- [x] 6.4 在 `tests/test_reject_dataset_cache.py` 新增 `test_partial_failure_in_response_meta`:mock `get_batch_progress` 回傳 `has_partial_failure=True`,assert response `meta` 包含旗標和 `failed_ranges`
|
||||
- [x] 6.5 在 `tests/test_reject_dataset_cache.py` 新增 `test_cache_hit_restores_partial_failure`:先寫入 partial failure flag,cache hit 時 assert meta 有旗標
|
||||
- [x] 6.6 在 `tests/test_reject_dataset_cache.py` 新增 `test_partial_failure_ttl_matches_spool`:當 should_spill=True 時 assert flag TTL 為 `_REJECT_ENGINE_SPOOL_TTL_SECONDS`,否則為 `_CACHE_TTL`
|
||||
- [x] 6.7 在 `tests/test_batch_query_engine.py` 新增 `test_id_batch_chunk_no_failed_ranges`:container-id 分塊 chunk 失敗時 assert `failed_ranges` 為空 list 但 `has_partial_failure=True`
|
||||
|
||||
## 7. 跨服務回歸驗證
|
||||
|
||||
- [x] 7.1 執行 `pytest tests/test_batch_query_engine.py tests/test_reject_dataset_cache.py -v` 確認本次修改的測試全部通過
|
||||
- [x] 7.2 執行 hold_dataset_cache 相關測試確認重試邏輯不影響 hold:`pytest tests/ -k "hold" -v`
|
||||
- [x] 7.3 執行 resource / job / msd 相關測試確認回歸:`pytest tests/ -k "resource or job or mid_section" -v`
|
||||
- [x] 7.4 若任何跨服務測試失敗,檢查是否為 `_execute_single_chunk` 簽名變更(`max_retries` 參數)導致,確認 keyword-only 預設值不影響既有呼叫
|
||||
Reference in New Issue
Block a user