DashBoard/proposal.md at a275c30c0ee33a1d7239406ae98969d58d49bb76

Files

egg a275c30c0e feat(reject-history): fix silent data loss by propagating partial failure metadata to frontend

Chunk failures in BatchQueryEngine were silently discarded — `has_partial_failure` was tracked
in Redis but never surfaced to the API response or frontend. Users could see incomplete data
without any warning. This commit closes the gap end-to-end:

Backend:
- Track failed chunk time ranges (`failed_ranges`) in batch engine progress metadata
- Add single retry for transient Oracle errors (timeout, connection) in `_execute_single_chunk`
- Read `get_batch_progress()` after merge but before `redis_clear_batch()` cleanup
- Inject `has_partial_failure`, `failed_chunk_count`, `failed_ranges` into API response meta
- Persist partial failure flag to independent Redis key with TTL aligned to data storage layer
- Add shared container-resolution policy module with wildcard/expansion guardrails
- Refactor reason filter from single-value to multi-select (`reason` → `reasons`)

Frontend:
- Add client-side date range validation (730-day limit) before API submission
- Display amber warning banner on partial failure with specific failed date ranges
- Support generic fallback message for container-mode queries without date ranges
- Update FilterPanel to support multi-select reason chips

Specs & tests:
- Create batch-query-resilience spec; update reject-history-api and reject-history-page specs
- Add 7 new tests for retry, memory guard, failed ranges, partial failure propagation, TTL
- Cross-service regression verified (hold, resource, job, msd — 411 tests pass)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-03 14:00:07 +08:00

2.7 KiB

Raw Blame History

Why

報廢歷史查詢的防爆機制（時間分片 + 記憶體上限 256 MB + Oracle timeout 300s）在 chunk 失敗時會丟棄該 chunk 的資料，has_partial_failure 旗標僅寫入 Redis metadata，從未傳遞到 API response 或前端。使用者查到不完整資料卻毫不知情，影響決策正確性。此外，730 天日期上限僅在後端驗證，前端無即時提示，導致不必要的等待。

What Changes

後端 reject_dataset_cache 在 execute_plan() 後讀取 batch progress metadata，將 has_partial_failure、失敗 chunk 數量及失敗時間範圍注入 API response meta 欄位
後端 batch_query_engine 追蹤失敗 chunk 的時間區間描述，寫入 Redis metadata 的 failed_ranges 欄位
後端 _execute_single_chunk() 對 transient error（Oracle timeout / 連線錯誤）加入單次重試，memory guard 失敗不重試
前端新增 amber warning banner，當 meta.has_partial_failure 為 true 時顯示不完整資料警告及失敗的日期區間
前端新增日期範圍即時驗證（730 天上限），在 API 發送前攔截無效範圍

Capabilities

New Capabilities

batch-query-resilience: 批次查詢引擎的失敗範圍追蹤、partial failure metadata 傳遞、及 transient error 單次重試機制

Modified Capabilities

reject-history-api: API response meta 新增 has_partial_failure、failed_chunk_count、failed_ranges 欄位，讓前端得知查詢結果完整性
reject-history-page: 新增 amber warning banner 顯示 partial failure 警告；新增前端日期範圍即時驗證（730 天上限）

Impact

後端服務 — batch_query_engine.py（共用模組，影響所有使用 execute_plan 的服務）:
- 追蹤 failed_ranges + 重試邏輯修改的是 _execute_single_chunk()，此函式被 reject / hold / resource / job / msd 五個 dataset cache 服務共用
- 重試邏輯為加法行為（新增 retry loop），不改變既有成功路徑，對其他服務向後相容
- failed_ranges 追蹤僅在 chunk descriptor 含 chunk_start/chunk_end 時才記錄，container-id 分塊（僅 reject container 模式使用）不受影響
- 需對 hold / resource / job / msd 執行回歸 smoke test
後端服務 — reject_dataset_cache.py: 讀取 metadata + 注入 response + 持久化 partial failure flag
前端: App.vue（warning banner + 日期驗證）、reject-history-filters.js（validateDateRange 函式）、style.css（.warning-banner 樣式）
API 契約: response meta 新增可選欄位（向後相容，現有前端不受影響）
測試: test_batch_query_engine.py、test_reject_dataset_cache.py 需新增對應測試案例；hold / resource / job / msd 需回歸驗證

2.7 KiB Raw Blame History Unescape Escape

Why