Chunk failures in BatchQueryEngine were silently discarded — `has_partial_failure` was tracked in Redis but never surfaced to the API response or frontend. Users could see incomplete data without any warning. This commit closes the gap end-to-end: Backend: - Track failed chunk time ranges (`failed_ranges`) in batch engine progress metadata - Add single retry for transient Oracle errors (timeout, connection) in `_execute_single_chunk` - Read `get_batch_progress()` after merge but before `redis_clear_batch()` cleanup - Inject `has_partial_failure`, `failed_chunk_count`, `failed_ranges` into API response meta - Persist partial failure flag to independent Redis key with TTL aligned to data storage layer - Add shared container-resolution policy module with wildcard/expansion guardrails - Refactor reason filter from single-value to multi-select (`reason` → `reasons`) Frontend: - Add client-side date range validation (730-day limit) before API submission - Display amber warning banner on partial failure with specific failed date ranges - Support generic fallback message for container-mode queries without date ranges - Update FilterPanel to support multi-select reason chips Specs & tests: - Create batch-query-resilience spec; update reject-history-api and reject-history-page specs - Add 7 new tests for retry, memory guard, failed ranges, partial failure propagation, TTL - Cross-service regression verified (hold, resource, job, msd — 411 tests pass) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5.6 KiB
5.6 KiB
1. 前端日期範圍即時驗證
- 1.1 在
frontend/src/core/reject-history-filters.js末尾新增validateDateRange(startDate, endDate)函式(MAX_QUERY_DAYS=730),回傳空字串表示通過、非空字串為錯誤訊息 - 1.2 在
frontend/src/reject-history/App.vueimportvalidateDateRange,在executePrimaryQuery()的 API 呼叫前(errorMessage.value = ''重置之後)加入 date_range 模式的驗證邏輯,驗證失敗時設定errorMessage並 return
2. 後端追蹤失敗 chunk 時間範圍
- 2.1 在
batch_query_engine.py的_update_progress()簽名加入failed_ranges: Optional[List] = None參數,在 mapping dict 中條件性加入json.dumps(failed_ranges)欄位 - 2.2 在
execute_plan()的 sequential path(for idx, chunk in enumerate(chunks)迴圈區段)新增failed_range_list = [],chunk 失敗時從 chunk descriptor 條件性提取chunk_start/chunk_endappend 到 list(僅 time-range chunk 才有),傳入每次_update_progress()呼叫 - 2.3 在
_execute_parallel()修改futuresdict 為futures[future] = (idx, chunk)以保留 chunk descriptor,新增failed_range_list,失敗時條件性 append range,返回值改為 4-tuple(completed, failed, has_partial_failure, failed_range_list);同步更新execute_plan()中呼叫_execute_parallel()的解構為 4-tuple
3. 後端 chunk 失敗單次重試
- 3.1 在
batch_query_engine.py新增_RETRYABLE_PATTERNS常數和_is_retryable_error(exc)函式,辨識 Oracle timeout / 連線錯誤 - 3.2 修改
_execute_single_chunk()加入max_retries: int = 1參數,將 try/except 包在 retry loop 中:memory guard 和 Redis store 失敗直接 return False 不重試;exception 中若_is_retryable_error()為 True 則 log warning 並 continue
4. 後端傳遞 partial failure 到 API response
- 4.1 在
reject_dataset_cache.py的execute_primary_query()內 batch_query_engine local import 區塊加入get_batch_progress - 4.2 在
execute_primary_query()的merge_chunks()呼叫之後、redis_clear_batch()呼叫之前,呼叫get_batch_progress("reject", engine_hash)讀取has_partial_failure、failed、failed_ranges - 4.3 在
redis_clear_batch()之後、_apply_policy_filters()之前,將 partial failure 資訊條件性注入metadict(has_partial_failure、failed_chunk_count、failed_ranges) - 4.4 新增
_store_partial_failure_flag(query_id, failed_count, failed_ranges, ttl)和_load_partial_failure_flag(query_id)兩個 helper,使用 Redis HSET 存取reject_dataset:{query_id}:partial_failure;ttl由呼叫端傳入 - 4.5 在
_store_query_result()呼叫之後呼叫_store_partial_failure_flag(),TTL 根據_store_query_result()內的should_spill判斷:spill 到 spool 時用_REJECT_ENGINE_SPOOL_TTL_SECONDS(21600s),否則用_CACHE_TTL(900s);在_get_cached_df()cache-hit 路徑呼叫_load_partial_failure_flag()並meta.update()
5. 前端 partial failure 警告 banner
- 5.1 在
frontend/src/reject-history/App.vue新增partialFailureWarningref,在executePrimaryQuery()開頭重置,在讀取 result 後根據result.meta.has_partial_failure設定警告訊息(含 failed_ranges 的日期區間文字;無 ranges 時用 failed_chunk_count 的 generic 訊息) - 5.2 在 App.vue template 的 error-banner
<div>之後加入<div v-if="partialFailureWarning" class="warning-banner">{{ partialFailureWarning }}</div> - 5.3 在
frontend/src/reject-history/style.css的.error-banner規則之後加入.warning-banner樣式(background: #fffbeb, color: #b45309)
6. 測試
- 6.1 在
tests/test_batch_query_engine.py新增test_transient_failure_retried_once:mock query_fn 第一次 raise TimeoutError、第二次成功,assert chunk 最終成功且 query_fn 被呼叫 2 次 - 6.2 在
tests/test_batch_query_engine.py新增test_memory_guard_not_retried:mock query_fn 回傳超大 DataFrame,assert query_fn 僅被呼叫 1 次 - 6.3 在
tests/test_batch_query_engine.py新增test_failed_ranges_tracked:3 chunks 其中 1 個失敗,assert Redis metadata 含failed_rangesJSON - 6.4 在
tests/test_reject_dataset_cache.py新增test_partial_failure_in_response_meta:mockget_batch_progress回傳has_partial_failure=True,assert responsemeta包含旗標和failed_ranges - 6.5 在
tests/test_reject_dataset_cache.py新增test_cache_hit_restores_partial_failure:先寫入 partial failure flag,cache hit 時 assert meta 有旗標 - 6.6 在
tests/test_reject_dataset_cache.py新增test_partial_failure_ttl_matches_spool:當 should_spill=True 時 assert flag TTL 為_REJECT_ENGINE_SPOOL_TTL_SECONDS,否則為_CACHE_TTL - 6.7 在
tests/test_batch_query_engine.py新增test_id_batch_chunk_no_failed_ranges:container-id 分塊 chunk 失敗時 assertfailed_ranges為空 list 但has_partial_failure=True
7. 跨服務回歸驗證
- 7.1 執行
pytest tests/test_batch_query_engine.py tests/test_reject_dataset_cache.py -v確認本次修改的測試全部通過 - 7.2 執行 hold_dataset_cache 相關測試確認重試邏輯不影響 hold:
pytest tests/ -k "hold" -v - 7.3 執行 resource / job / msd 相關測試確認回歸:
pytest tests/ -k "resource or job or mid_section" -v - 7.4 若任何跨服務測試失敗,檢查是否為
_execute_single_chunk簽名變更(max_retries參數)導致,確認 keyword-only 預設值不影響既有呼叫