Files
DashBoard/openspec/changes/archive/2026-03-03-fix-silent-data-loss-reject-history/proposal.md
egg a275c30c0e feat(reject-history): fix silent data loss by propagating partial failure metadata to frontend
Chunk failures in BatchQueryEngine were silently discarded — `has_partial_failure` was tracked
in Redis but never surfaced to the API response or frontend. Users could see incomplete data
without any warning. This commit closes the gap end-to-end:

Backend:
- Track failed chunk time ranges (`failed_ranges`) in batch engine progress metadata
- Add single retry for transient Oracle errors (timeout, connection) in `_execute_single_chunk`
- Read `get_batch_progress()` after merge but before `redis_clear_batch()` cleanup
- Inject `has_partial_failure`, `failed_chunk_count`, `failed_ranges` into API response meta
- Persist partial failure flag to independent Redis key with TTL aligned to data storage layer
- Add shared container-resolution policy module with wildcard/expansion guardrails
- Refactor reason filter from single-value to multi-select (`reason` → `reasons`)

Frontend:
- Add client-side date range validation (730-day limit) before API submission
- Display amber warning banner on partial failure with specific failed date ranges
- Support generic fallback message for container-mode queries without date ranges
- Update FilterPanel to support multi-select reason chips

Specs & tests:
- Create batch-query-resilience spec; update reject-history-api and reject-history-page specs
- Add 7 new tests for retry, memory guard, failed ranges, partial failure propagation, TTL
- Cross-service regression verified (hold, resource, job, msd — 411 tests pass)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 14:00:07 +08:00

35 lines
2.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## Why
報廢歷史查詢的防爆機制(時間分片 + 記憶體上限 256 MB + Oracle timeout 300s在 chunk 失敗時會丟棄該 chunk 的資料,`has_partial_failure` 旗標僅寫入 Redis metadata**從未傳遞到 API response 或前端**。使用者查到不完整資料卻毫不知情影響決策正確性。此外730 天日期上限僅在後端驗證,前端無即時提示,導致不必要的等待。
## What Changes
- 後端 `reject_dataset_cache``execute_plan()` 後讀取 batch progress metadata`has_partial_failure`、失敗 chunk 數量及失敗時間範圍注入 API response `meta` 欄位
- 後端 `batch_query_engine` 追蹤失敗 chunk 的時間區間描述,寫入 Redis metadata 的 `failed_ranges` 欄位
- 後端 `_execute_single_chunk()` 對 transient errorOracle timeout / 連線錯誤加入單次重試memory guard 失敗不重試
- 前端新增 amber warning banner`meta.has_partial_failure` 為 true 時顯示不完整資料警告及失敗的日期區間
- 前端新增日期範圍即時驗證730 天上限),在 API 發送前攔截無效範圍
## Capabilities
### New Capabilities
- `batch-query-resilience`: 批次查詢引擎的失敗範圍追蹤、partial failure metadata 傳遞、及 transient error 單次重試機制
### Modified Capabilities
- `reject-history-api`: API response `meta` 新增 `has_partial_failure``failed_chunk_count``failed_ranges` 欄位,讓前端得知查詢結果完整性
- `reject-history-page`: 新增 amber warning banner 顯示 partial failure 警告新增前端日期範圍即時驗證730 天上限)
## Impact
- **後端服務 — batch_query_engine.py共用模組影響所有使用 execute_plan 的服務)**:
- 追蹤 failed_ranges + 重試邏輯修改的是 `_execute_single_chunk()`,此函式被 **reject / hold / resource / job / msd** 五個 dataset cache 服務共用
- 重試邏輯為加法行為(新增 retry loop不改變既有成功路徑對其他服務向後相容
- `failed_ranges` 追蹤僅在 chunk descriptor 含 `chunk_start`/`chunk_end` 時才記錄container-id 分塊(僅 reject container 模式使用)不受影響
- 需對 hold / resource / job / msd 執行回歸 smoke test
- **後端服務 — reject_dataset_cache.py**: 讀取 metadata + 注入 response + 持久化 partial failure flag
- **前端**: `App.vue`warning banner + 日期驗證)、`reject-history-filters.js`validateDateRange 函式)、`style.css`.warning-banner 樣式)
- **API 契約**: response `meta` 新增可選欄位(向後相容,現有前端不受影響)
- **測試**: `test_batch_query_engine.py``test_reject_dataset_cache.py` 需新增對應測試案例hold / resource / job / msd 需回歸驗證