feat: harden long-range batch queries with redis+parquet caching

2026-03-02 21:04:18 +08:00
parent 2568fd836c
commit fb92579331
40 changed files with 5443 additions and 676 deletions
--- a/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/.openspec.yaml
+++ b/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/.openspec.yaml
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-03-02
--- a/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/design.md
+++ b/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/design.md
@@ -0,0 +1,166 @@
+## Context
+
+目前 6 個服務各自處理大查詢，缺乏統一保護：
+
+| 服務 | 查詢類型 | 現有保護 | 缺口 |
+|------|---------|---------|------|
+| reject-history | 日期 + 工單/Lot/GD 展開 | L1+L2 快取、`read_sql_df_slow` | 無記憶體守衛、`limit=999999999`、缺分塊查詢 |
+| hold-history | 日期 | L1+L2 快取、`read_sql_df_slow` | 無記憶體守衛、缺時間分塊 |
+| resource-history | 日期 + 設備 ID | L1+L2 快取、1000 筆分批 | 無記憶體守衛 |
+| mid-section-defect | 日期 → 偵測 → 族譜 → 上游 | Redis 快取、EventFetcher 分批 | 無偵測數量上限 |
+| job-query | 日期 + 設備 ID | 1000 筆分批、`read_sql_df_slow` | **無結果快取**、缺時間分塊 |
+| query-tool | 多種 resolver → container ID | 輸入筆數限制、resolve route 短 TTL 快取、EventFetcher 快取 | 多數查詢仍走 `read_sql_df`（55s timeout）、缺統一分塊編排 |
+
+參考實作：
+- `EventFetcher`：batch 1000 + ThreadPoolExecutor(2) + `read_sql_df_slow_iter` streaming + Redis 快取 — **已是最佳實作**
+- `LineageEngine`：batch 1000 + depth limit 20 — **族譜專用引擎**
+
+目標：建立 `BatchQueryEngine` 共用模組，任何服務接入即獲得完整保護。
+
+## Goals / Non-Goals
+
+**Goals:**
+- 統一 parquet-in-Redis 存取為共用模組（消除 3 處重複）
+- 提供時間範圍分解（長日期 → ~31 天月份區間）
+- 提供 ID 批次分解（工單/Lot/GD 展開後的大量 container ID → 1000 筆一批）
+- 記憶體守衛：每個 chunk 結果檢查 memory_usage，超過閾值中止
+- 結果筆數限制：可配置上限，超過時截斷並標記
+- 受控並行：預設循序、可選並行、semaphore 感知
+- Redis 分塊快取 + 部分命中
+- 統一使用 `read_sql_df_slow`（300 秒 dedicated connection）
+- 定義 query_hash 與 chunk 邊界語意，避免跨服務行為不一致
+- 定義 chunk cache 與服務 L1/L2 dataset cache 互動規則
+
+**Non-Goals:**
+- 不修改 SQL 語句本身
+- 不引入新的外部依賴
+- 不改變前端 API 介面（前端無感知）
+- 不替換 EventFetcher / LineageEngine（它們已各自最佳化，引擎提供可選接入點）
+- 不改變 trace_job_service 的 RQ 非同步架構
+
+## Decisions
+
+### Decision 1: 提取 `redis_df_store.py` 共用模組
+
+**選擇**：從 reject/hold/resource_dataset_cache 提取相同的 `_redis_store_df` / `_redis_load_df` 到 `src/mes_dashboard/core/redis_df_store.py`。
+
+**替代方案**：(A) 保持各自複製 → 已有 3 處重複，維護困難。
+
+**理由**：parquet-in-Redis 是 DataFrame 序列化工具，與快取策略（TTL、LRU）屬不同層次。
+
+### Decision 2: `BatchQueryEngine` 作為工具類而非基底類別
+
+**選擇**：提供獨立函式（`decompose_by_time_range`、`decompose_by_ids`、`execute_plan`、`merge_chunks`），各服務按需調用。
+
+**替代方案**：(A) 抽象基底類別 `BaseDatasetCache` → 三個 dataset cache 差異大（SQL、policy filter、衍生計算），強制繼承會過度耦合。
+
+**理由**：工具類模式讓服務保持現有結構，僅在主查詢路徑決定是否啟用分解。閾值以下的查詢完全不經過引擎。
+
+### Decision 3: 預設循序、可選並行、semaphore 感知
+
+**選擇**：`execute_plan(parallel=1)` 預設循序。實際並行上限 = `min(requested, semaphore_available - 1)`。
+
+**替代方案**：(A) 預設並行 → 可能耗盡 semaphore；(B) 完全不並行 → 失去速度。
+
+**理由**：Oracle 連線稀缺（Production 預設 `DB_SLOW_MAX_CONCURRENT=5`，Development 常見為 3）。reject_dataset_cache 查詢最重可設 parallel=2，其他預設循序最安全。
+
+### Decision 4: 記憶體守衛 + 結果筆數限制
+
+**選擇**：每個 chunk 查詢後檢查 `df.memory_usage(deep=True).sum()`，超過 `BATCH_CHUNK_MAX_MEMORY_MB`（預設 256MB）時中止該 chunk 並標記失敗。同時提供 `max_rows_per_chunk` 參數，在 SQL 中加入 `FETCH FIRST N ROWS ONLY`。
+
+**替代方案**：(A) 無限制 → 現狀，OOM 風險高；(B) 全域限制 → 不夠靈活。
+
+**理由**：chunk 級別的記憶體守衛是最後一道防線。分解後每個 chunk 的日期/ID 範圍已大幅縮小，記憶體超限通常代表異常資料，應中止而非繼續。
+
+### Decision 5: 分塊快取 + 部分命中
+
+**選擇**：Redis 鍵 `batch:{prefix}:{hash}:chunk:{idx}`，每個 chunk 獨立 SETEX。
+
+**替代方案**：(A) 只快取最終結果 → 無法部分命中。
+
+**理由**：使用者常見操作是「先查 1-6 月，再查 1-8 月」。分塊快取讓前 6 個月直接複用，只查 7-8 月。
+
+### Decision 6: 引擎路徑統一使用 slow-query 路徑（且不佔用主 pool）
+
+**選擇**：所有經過引擎的查詢統一使用 slow-query 路徑（300s timeout, semaphore 控制）；未經引擎的既有短查詢路徑保持原狀。
+慢查詢執行策略採兩層：
+1. 主路徑：使用既有獨立 `SLOW POOL`（小容量）做 checkout/checkin。
+2. fallback：當 SLOW POOL 不可用時，降級為 slow direct connection。
+
+**替代方案**：
+(A) 引擎路徑混用 `read_sql_df`（主 pool, 55s timeout）→ 長查詢高超時風險且會壓縮一般 API 吞吐。
+(B) 慢查詢直接共用主 pool → 高峰時造成 pool 爭用與整體延遲放大。
+
+**理由**：經過引擎的查詢本身就是「已知可能很慢」的查詢。慢查詢與主 pool 隔離可避免互相影響；SLOW POOL 讓連線重用與隔離同時成立，fallback direct connection 保障可用性。
+
+### Decision 7: 部分失敗處理
+
+**選擇**：某個 chunk 失敗時記錄錯誤、繼續剩餘 chunk。`merge_chunks()` 回傳成功部分，metadata 標記 `has_partial_failure=True`。
+
+**替代方案**：(A) 全部回滾 → 已成功的 chunk 浪費。
+
+**理由**：歷史報表場景下，部分結果比完全失敗更有價值。metadata 標記讓服務可決定是否警告使用者。
+
+### Decision 8: Chunk Cache 與服務 L1/L2 Dataset Cache 互動
+
+**選擇**：先讀 chunk cache（Redis）組裝結果；組裝後回填既有 service dataset cache（L1 process + L2 Redis）以維持現有 `/view` 路徑與 `query_id` 行為。
+
+**替代方案**：(A) 只使用 chunk cache，不回填 service cache → 現有 view/query_id 流程失效或重複查詢。
+
+**理由**：需要兼容既有 two-phase dataset API（primary query + cached view），chunk cache 是引擎層優化，不應破壞服務層介面。
+
+### Decision 9: query_hash 規格
+
+**選擇**：query_hash 使用 canonical JSON（sorted keys、穩定 list 順序、字串正規化）後 SHA-256 前 16 碼；hash 僅包含會影響原始資料集合的參數（不含純前端呈現參數）。
+
+**替代方案**：(A) 每服務自由實作 hash → 跨服務不可預測且難除錯。
+
+**理由**：chunk key、progress key、merge key 需可重現，否則無法保證 cache 命中與部分重用。
+
+### Decision 10: 時間分解邊界語意
+
+**選擇**：採閉區間 chunk `[chunk_start, chunk_end]`；下一段從 `chunk_end + 1 day` 開始；最後一段可小於 grain_days；輸入日期以服務既有時區/日界線為準，不在引擎層重新解釋時區。
+
+**替代方案**：(A) 半開區間或依月份動態切割但不定義邊界 → 容易重疊或漏資料。
+
+**理由**：邊界語意固定後，merge 去重、統計一致性與測試可驗證性都會提升。
+
+### Decision 11: 大結果採 Parquet 落地，Redis 僅保留 metadata/熱快取
+
+**選擇**：對長查詢（尤其 reject-history）引入 spill-to-disk：
+1. chunk 查詢與 chunk cache 保持現行（Redis，短 TTL）
+2. merge 後若結果超過門檻（rows / memory / serialized size），寫入 Parquet 至本機 spool 目錄
+3. Redis 僅保存 metadata（query_id, file_path, row_count, schema_hash, created_at, expires_at）
+4. `/view`/`/export` 優先透過 metadata 讀取 parquet；metadata 不存在時回退現行 cache 行為
+5. 背景清理器定期移除過期 parquet 與孤兒 metadata
+
+**替代方案**：
+(A) Redis 全量承載所有結果（現況）→ 記憶體壓力高，易引發 lock timeout/OOM 連鎖  
+(B) 直接落 DB（例如 SQLite）→ 寫入鎖衝突與維運複雜度高（目前已有 `database is locked` 觀察）
+
+**理由**：Redis 是記憶體快取，不適合長時間承載大結果；Parquet 落地可把大結果轉移到磁碟，降低 worker/Redis 記憶體峰值。
+
+## Risks / Trade-offs
+
+**[Redis 記憶體增長]** → 分塊快取增加 key 數量（365 天 ≈ 12 個 chunk key）。
+→ 緩解：TTL 自動過期（900s）；chunk 結果經 parquet 壓縮（通常 10:1 壓縮比）。
+
+**[Semaphore 爭用]** → 並行 chunk 消耗更多 permit。
+→ 緩解：感知可用數量，不足時自動降級循序。預設 parallel=1。
+
+**[時間分解後的資料一致性]** → 不同月份 chunk 在不同時間點查詢。
+→ 緩解：歷史報表資料更新頻率低（日級），短窗口內變動極低。可接受。
+
+**[遷移風險]** → 先修改 3 個 dataset cache，再擴展至其他服務，整體範圍仍大。
+→ 緩解：閾值控制（短查詢不經過引擎）+ P0/P1/P2/P3 分階段導入 + 每階段獨立驗證。
+
+**[磁碟 I/O 與容量壓力]** → Parquet 落地會增加磁碟讀寫，若清理策略失效可能累積大量檔案。
+→ 緩解：設定 spool 容量上限、TTL 清理、啟動時 orphan 掃描、超限時回退到「不落地僅回應摘要」保護模式。
+
+**[Stale metadata / orphan file]** → Redis metadata 與實體檔案可能不一致。
+→ 緩解：讀取前校驗檔案存在與 schema hash；不一致時自動失效 metadata 並記錄告警。
+
+## Open Questions
+
+1. `mid_section_defect_service` 的 4 階段管線（偵測 → 族譜 → 上游歷史 → 歸因）中，哪些階段適合接入引擎？偵測查詢可日期分解，但族譜/上游已透過 EventFetcher 處理。
+2. `query_tool_service` 有 15+ 種查詢類型，是否全部接入還是只處理最易超時的（split_merge_history、equipment_period）？
--- a/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/proposal.md
+++ b/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/proposal.md
@@ -0,0 +1,83 @@
+## Why
+
+目前各歷史報表服務（reject-history、hold-history、resource-history）、查詢工具（query-tool）、中段不良分析（mid-section-defect）和 Job 查詢（job-query）各自實作不同的批次查詢、快取和並行執行模式，缺乏統一編排與保護。主要問題：
+
+1. **Oracle 超時**：長日期範圍（365+ 天）或大量 Container ID（工單展開後可達數千筆）的查詢可能超過 300 秒 call_timeout
+2. **OOM 風險**：reject/hold dataset cache 以 `limit: 999999999` 取回全部資料，無記憶體上限守衛
+3. **保護分散**：`EventFetcher` 已有 ID 分批 + 快取，但 reject/hold/resource dataset cache 仍各自維護查詢與快取策略
+4. **重複程式碼**：3 個 dataset cache 各自複製相同的 parquet-in-Redis 序列化邏輯
+5. **ID 展開膨脹**：工單 resolve 後 container ID 可能大量擴張，缺乏跨服務一致的分批/合併流程
+6. **重查成本高**：延長查詢範圍（例如 1-6 月改 1-8 月）無法有效重用已查區段結果
+7. **query-tool 超時風險高**：多數查詢仍走 `read_sql_df`（主 pool / 55s timeout），大查詢下容易超時
+
+需要一個**可穩定複用的查詢引擎模組**，任何服務接入後自動獲得分解、快取、記憶體保護和超時保護。
+
+## What Changes
+
+- 新增 `BatchQueryEngine` 共用模組，提供：
+  - **時間範圍分解**：長日期 → ~31 天月份區間，每段獨立查詢
+  - **時間分解語意**：明確定義 chunk 邊界（閉區間）、跨月切割與最後一段不足月行為
+  - **ID 批次分解**：大量 ID（工單/Lot/GD Lot/流水批展開後）→ 1000 筆一批
+  - **query_hash 規格**：統一 canonicalization 與雜湊欄位，確保 chunk/cache key 穩定
+  - **記憶體守衛**：每個 chunk 結果檢查 `DataFrame.memory_usage()`，超過閾值時中止並警告
+  - **結果筆數限制**：可配置的最大結果筆數，超過時截斷並標記
+  - **受控並行執行**：預設循序、可選並行，嚴格遵守 slow query semaphore
+  - **Redis 分塊快取**：每個 chunk 獨立快取，支援部分命中（延長查詢範圍時複用已查過的區間）
+  - **快取層互動**：明確定義 chunk cache 與服務既有 L1/L2 dataset cache 的讀寫順序
+  - **進度追蹤**：Redis HSET 記錄進度，可供前端顯示
+- 新增「**大結果落地層（Parquet spill）**」設計：
+  - 當長查詢結果超過記憶體/列數門檻時，將合併後結果以 Parquet 寫入本機持久目錄（例如 `tmp/query_spool/`）
+  - Redis 僅保存 metadata（query_id → parquet path / schema / rows / created_at / ttl）
+  - `/view` 與 `/export` 讀取流程優先走 Redis metadata + Parquet，避免整包 DataFrame 常駐 worker RAM
+  - 定時清理（TTL + 背景清理器）刪除過期 parquet，避免磁碟持續膨脹
+- 新增 `redis_df_store` 共用模組，將 parquet-in-Redis 存取邏輯從 3 個 dataset cache 提取為共用工具
+- 所有**引擎接管的 chunk 查詢**統一使用 slow 路徑（300 秒級 timeout）
+  - 使用既有「**獨立 SLOW POOL（小容量）**」做慢查詢連線重用
+  - 明確**不使用主查詢 pool** 承載慢查詢，避免拖垮一般 API
+  - 當 SLOW POOL 不可用時，降級為 slow direct connection（不影響主 pool）
+
+## Capabilities
+
+### New Capabilities
+- `batch-query-engine`: 統一批次查詢引擎模組，涵蓋分解策略（時間/ID）、記憶體守衛、結果限制、受控執行、Redis 分塊快取、進度追蹤、結果合併
+
+### Modified Capabilities
+- `reject-history-api`: 主查詢改為透過引擎執行；date_range 模式自動時間分解，container 模式（工單/Lot/GD Lot 展開後）自動 ID 分批
+- `hold-dataset-cache`: 主查詢改為透過引擎執行，長日期自動分解
+- `resource-dataset-cache`: 主查詢改為透過引擎執行，長日期自動分解
+- `event-fetcher-unified`: 保持既有最佳化（batch + streaming + cache），僅在需要統一監控/進度模型時再評估導入
+
+## Impact
+
+- **後端**：新增 2 個共用模組（`batch_query_engine.py`、`redis_df_store.py`），優先修改 3 個 dataset cache 主查詢路徑（reject/hold/resource）
+- **受影響服務**（優先順序）：
+  - P0：reject-history（最容易超時/OOM — 長日期 + 工單展開 + 目前 `limit=999999999`）
+  - P1：hold-history、resource-history（相同架構，直接套用）
+  - P2：mid-section-defect（4 階段管線，偵測查詢 + 上游歷史）、job-query（缺快取 + 日期分解）
+  - P3：query-tool（優先處理 `read_sql_df` 高風險路徑並導入慢查詢保護）、event-fetcher（保持可選）
+- **資料庫**：不改 SQL，僅縮小每次查詢的 bind parameter 範圍
+- **資料庫連線策略**：慢查詢與一般 pooled query 隔離，避免資源互相干擾
+- **Redis**：新增 `batch:*` 前綴的分塊快取鍵
+- **儲存層**：新增 Parquet 結果落地目錄與清理機制（Redis 轉為索引/metadata，不再承載全部大結果）
+- **記憶體**：引擎強制單 chunk 記憶體上限（預設 256MB），超過時中止
+- **可用性**：Redis 設定 `maxmemory` + eviction 後仍可透過 Parquet metadata 回復查詢結果（cache 不命中不等於資料遺失）
+- **向下相容**：短查詢（< 60 天、< 1000 ID）走現有路徑，零額外開銷；既有 route/event 快取策略保持不變
+- **前端**：可選性變更，長查詢可顯示進度條（非必要）
+
+## Parquet 落地的預期效果與副作用
+
+**預期效果：**
+- 大幅降低 worker 在「merge + cache 回填」階段的峰值記憶體（避免單 worker 突增到 GB 級）
+- Redis 記憶體由「存整包資料」轉為「存索引/熱資料」，降低 OOM 與 lock timeout 連鎖風險
+- 服務重啟後，若 parquet 尚未過期，仍可恢復查詢結果（搭配 metadata）
+
+**可能副作用（Side Effects）：**
+- 磁碟 I/O 增加：查詢高峰時會有 parquet 寫入/讀取尖峰
+- 磁碟容量風險：清理策略失效時，spool 目錄可能持續膨脹
+- 資料一致性風險：metadata 指向檔案若被外部刪除/損壞，會出現 stale pointer
+- 安全與治理：落地檔案需納入權限控管、備份/清理與稽核策略
+
+**緩解方向：**
+- 強制 TTL + 定期掃描清理（以 metadata 與檔案 mtime 雙重判斷）
+- 啟動時做 orphan/stale 檢查與自動修復（刪 metadata 或刪孤兒檔）
+- 先以 reject-history 長查詢為 P0，逐步擴展到其他服務
--- a/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/specs/batch-query-orchestrator/spec.md
+++ b/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/specs/batch-query-orchestrator/spec.md
@@ -0,0 +1,166 @@
+## ADDED Requirements
+
+### Requirement: BatchQueryEngine SHALL provide time-range decomposition
+The module SHALL decompose long date ranges into manageable monthly chunks to prevent Oracle timeout.
+
+#### Scenario: Decompose date range into monthly chunks
+- **WHEN** `decompose_by_time_range(start_date, end_date, grain_days=31)` is called
+- **THEN** the date range SHALL be split into chunks of at most `grain_days` days each
+- **THEN** each chunk SHALL contain `chunk_start` and `chunk_end` date strings
+- **THEN** chunks SHALL be contiguous and non-overlapping, covering the full range
+
+#### Scenario: Short date range returns single chunk
+- **WHEN** the date range is shorter than or equal to `grain_days`
+- **THEN** a single chunk covering the full range SHALL be returned
+
+#### Scenario: Time-chunk boundary semantics are deterministic
+- **WHEN** a date range is decomposed into multiple chunks
+- **THEN** each chunk SHALL use a closed interval `[chunk_start, chunk_end]`
+- **THEN** the next chunk SHALL start at `previous_chunk_end + 1 day`
+- **THEN** the final chunk MAY contain fewer than `grain_days` days
+- **THEN** chunk ranges SHALL have no overlap and no gap
+
+### Requirement: BatchQueryEngine SHALL provide ID-batch decomposition
+The module SHALL decompose large ID lists (from workorder/lot/GD lot/serial resolve expansion) into batches respecting Oracle IN-clause limits.
+
+#### Scenario: Decompose ID list into batches
+- **WHEN** `decompose_by_ids(ids, batch_size=1000)` is called with more than `batch_size` IDs
+- **THEN** the ID list SHALL be split into batches of at most `batch_size` items each
+
+#### Scenario: Small ID list returns single batch
+- **WHEN** the ID list has fewer than or equal to `batch_size` items
+- **THEN** a single batch containing all IDs SHALL be returned
+
+### Requirement: BatchQueryEngine SHALL execute chunk plans with controlled parallelism
+The module SHALL execute query chunks sequentially by default, with opt-in parallel execution respecting the slow query semaphore.
+
+#### Scenario: Sequential execution (default)
+- **WHEN** `execute_plan(chunks, query_fn, parallel=1)` is called
+- **THEN** chunks SHALL be executed one at a time in order
+- **THEN** each chunk result SHALL be stored to Redis immediately after completion
+- **THEN** the function SHALL return a `query_hash` identifying the batch result
+
+#### Scenario: Parallel execution with semaphore awareness
+- **WHEN** `execute_plan(chunks, query_fn, parallel=2)` is called
+- **THEN** up to `parallel` chunks SHALL execute concurrently via ThreadPoolExecutor
+- **THEN** each thread SHALL acquire the slow query semaphore before executing `query_fn`
+- **THEN** actual concurrency SHALL be capped at `min(parallel, available_semaphore_permits - 1)`
+- **THEN** if semaphore is fully occupied, execution SHALL degrade to sequential
+
+#### Scenario: All engine queries use dedicated connection
+- **WHEN** a chunk's `query_fn` executes an Oracle query
+- **THEN** it SHALL use `read_sql_df_slow` (dedicated connection, 300s timeout, semaphore-controlled)
+- **THEN** pooled connection (`read_sql_df`) SHALL NOT be used for engine-managed queries
+
+### Requirement: BatchQueryEngine SHALL enforce memory guards per chunk
+The module SHALL check each chunk result's memory usage and abort if it exceeds a configurable threshold.
+
+#### Scenario: Chunk memory within limit
+- **WHEN** a chunk query returns a DataFrame within `BATCH_CHUNK_MAX_MEMORY_MB` (default 256MB, env-configurable)
+- **THEN** the chunk SHALL be stored to Redis and marked as completed
+
+#### Scenario: Chunk memory exceeds limit
+- **WHEN** a chunk query returns a DataFrame exceeding `BATCH_CHUNK_MAX_MEMORY_MB`
+- **THEN** the chunk SHALL be discarded (NOT stored to Redis)
+- **THEN** the chunk SHALL be marked as failed in metadata with reason `memory_limit_exceeded`
+- **THEN** a warning log SHALL include chunk index, actual memory MB, and threshold
+- **THEN** remaining chunks SHALL continue execution
+
+#### Scenario: Result row count limit
+- **WHEN** `max_rows_per_chunk` is configured
+- **THEN** the engine SHALL pass this limit to `query_fn` for SQL-level truncation (e.g., `FETCH FIRST N ROWS ONLY`)
+- **THEN** if the result contains exactly `max_rows_per_chunk` rows, metadata SHALL include `truncated=True`
+
+### Requirement: BatchQueryEngine SHALL support partial cache hits
+The module SHALL check Redis for previously cached chunks and skip re-execution for cached chunks.
+
+#### Scenario: Partial cache hit skips cached chunks
+- **WHEN** `execute_plan(chunks, query_fn, skip_cached=True)` is called
+- **THEN** for each chunk, Redis SHALL be checked for an existing cached result
+- **THEN** chunks with valid cached results SHALL NOT be re-executed
+- **THEN** only uncached chunks SHALL be passed to `query_fn`
+
+#### Scenario: Full cache hit skips all execution
+- **WHEN** all chunks already exist in Redis cache
+- **THEN** no Oracle queries SHALL be executed
+- **THEN** `merge_chunks()` SHALL return the combined cached DataFrames
+
+### Requirement: BatchQueryEngine SHALL generate deterministic query_hash
+The module SHALL use a stable hash for cache/progress keys so semantically identical queries map to the same batch identity.
+
+#### Scenario: Stable hash for equivalent parameters
+- **WHEN** two requests contain the same semantic query parameters in different input order
+- **THEN** canonicalization SHALL normalize ordering before hashing
+- **THEN** `query_hash` SHALL be identical for both requests
+
+#### Scenario: Hash changes only when dataset-affecting parameters change
+- **WHEN** parameters affecting the raw dataset (date range, mode, resolved IDs, core filters) change
+- **THEN** `query_hash` SHALL change
+- **THEN** presentation-only parameters SHALL NOT change `query_hash`
+
+### Requirement: BatchQueryEngine SHALL define chunk-cache to service-cache handoff
+The module SHALL integrate chunk-level cache with existing service-level dataset caches without breaking query_id-based view APIs.
+
+#### Scenario: Chunk merge backfills service dataset cache
+- **WHEN** chunk results are loaded/merged into a complete dataset for a primary query
+- **THEN** the merged DataFrame SHALL be written back to the service's existing dataset cache layers (L1 process + L2 Redis)
+- **THEN** downstream `/view` queries using the service `query_id` SHALL continue to work without additional Oracle queries
+
+#### Scenario: Service cache miss with chunk cache hit
+- **WHEN** a service-level dataset cache entry has expired but relevant chunk cache keys still exist
+- **THEN** the engine SHALL rebuild the merged dataset from chunk cache
+- **THEN** the service dataset cache SHALL be repopulated before returning response
+
+### Requirement: BatchQueryEngine SHALL store chunk results in Redis
+The module SHALL store each chunk as a separate Redis key using parquet-in-Redis format.
+
+#### Scenario: Chunk storage key format
+- **WHEN** a chunk result is stored
+- **THEN** the Redis key SHALL follow the pattern `batch:{cache_prefix}:{query_hash}:chunk:{idx}`
+- **THEN** each chunk SHALL be stored as a parquet-encoded base64 string via `redis_df_store`
+- **THEN** each chunk key SHALL have a TTL matching the service's cache TTL (default 900 seconds)
+
+#### Scenario: Chunk metadata tracking
+- **WHEN** chunks are being executed
+- **THEN** a metadata key `batch:{cache_prefix}:{query_hash}:meta` SHALL be updated via Redis HSET
+- **THEN** metadata SHALL include `total`, `completed`, `failed`, `pct`, `status`, and `has_partial_failure` fields
+
+### Requirement: BatchQueryEngine SHALL merge chunk results into a single DataFrame
+The module SHALL provide result assembly from cached chunks.
+
+#### Scenario: Merge all chunks
+- **WHEN** `merge_chunks(query_hash)` is called
+- **THEN** all chunk DataFrames SHALL be loaded from Redis and concatenated via `pd.concat`
+- **THEN** if any chunk is missing, the merge SHALL proceed with available chunks and set `has_partial_failure=True`
+
+#### Scenario: Iterate chunks for streaming
+- **WHEN** `iterate_chunks(query_hash)` is called
+- **THEN** chunk DataFrames SHALL be yielded one at a time without loading all into memory simultaneously
+
+### Requirement: BatchQueryEngine SHALL handle chunk failures gracefully
+The module SHALL continue execution when individual chunks fail and report partial results.
+
+#### Scenario: Single chunk failure
+- **WHEN** a chunk's `query_fn` raises an exception (timeout, ORA error, etc.)
+- **THEN** the error SHALL be logged with chunk index and exception details
+- **THEN** the failed chunk SHALL be marked as failed in metadata
+- **THEN** remaining chunks SHALL continue execution
+
+#### Scenario: All chunks fail
+- **WHEN** all chunks' `query_fn` calls raise exceptions
+- **THEN** metadata status SHALL be set to `failed`
+- **THEN** `merge_chunks()` SHALL return an empty DataFrame
+
+### Requirement: Shared redis_df_store module SHALL provide parquet-in-Redis utilities
+The module SHALL provide reusable DataFrame serialization to/from Redis using parquet + base64 encoding.
+
+#### Scenario: Store DataFrame to Redis
+- **WHEN** `redis_store_df(key, df, ttl)` is called
+- **THEN** the DataFrame SHALL be serialized to parquet format using pyarrow
+- **THEN** the parquet bytes SHALL be base64-encoded and stored via Redis SETEX with the given TTL
+- **THEN** if Redis is unavailable, the function SHALL log a warning and return without error
+
+#### Scenario: Load DataFrame from Redis
+- **WHEN** `redis_load_df(key)` is called
+- **THEN** the base64 string SHALL be loaded from Redis, decoded, and deserialized to a DataFrame
+- **THEN** if the key does not exist or Redis is unavailable, the function SHALL return None
--- a/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/specs/event-fetcher-unified/spec.md
+++ b/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/specs/event-fetcher-unified/spec.md
@@ -0,0 +1,37 @@
+## MODIFIED Requirements
+
+### Requirement: EventFetcher SHALL provide unified cached event querying across domains
+`EventFetcher` SHALL encapsulate batch event queries with L1/L2 layered cache and rate limit bucket configuration, supporting domains: `history`, `materials`, `rejects`, `holds`, `jobs`, `upstream_history`, `downstream_rejects`. EventFetcher MAY optionally delegate ID batching to `BatchQueryEngine` for consistent decomposition patterns.
+
+#### Scenario: Cache miss for event domain query
+- **WHEN** `EventFetcher` is called for a domain with container IDs and no cache exists
+- **THEN** the domain query SHALL execute against Oracle via `read_sql_df_slow()` (non-pooled dedicated connection)
+- **THEN** each batch query SHALL use `timeout_seconds=60`
+- **THEN** the result SHALL be stored in L2 Redis cache with key format `evt:{domain}:{sorted_cids_hash}` if CID count is within cache threshold
+- **THEN** L1 memory cache SHALL also be populated if CID count is within cache threshold
+
+#### Scenario: Cache hit for event domain query
+- **WHEN** `EventFetcher` is called for a domain and L2 Redis cache contains a valid entry
+- **THEN** the cached result SHALL be returned without executing Oracle query
+- **THEN** DB connection pool SHALL NOT be consumed
+
+#### Scenario: Rate limit bucket per domain
+- **WHEN** `EventFetcher` is used from a route handler
+- **THEN** each domain SHALL have a configurable rate limit bucket aligned with `configured_rate_limit()` pattern
+- **THEN** rate limit configuration SHALL be overridable via environment variables
+
+#### Scenario: Large CID set exceeds cache threshold
+- **WHEN** the normalized CID count exceeds `CACHE_SKIP_CID_THRESHOLD` (default 10000, env: `EVENT_FETCHER_CACHE_SKIP_CID_THRESHOLD`)
+- **THEN** EventFetcher SHALL skip both L1 and L2 cache writes
+- **THEN** a warning log SHALL be emitted with domain name, CID count, and threshold value
+- **THEN** the query result SHALL still be returned to the caller
+
+#### Scenario: Batch concurrency default
+- **WHEN** EventFetcher processes batches for a domain with >1000 CIDs
+- **THEN** the default `EVENT_FETCHER_MAX_WORKERS` SHALL be 2 (env: `EVENT_FETCHER_MAX_WORKERS`)
+
+#### Scenario: Optional BatchQueryEngine integration
+- **WHEN** EventFetcher is refactored to use `BatchQueryEngine` (optional, not required)
+- **THEN** `decompose_by_ids()` MAY replace inline batching logic
+- **THEN** existing ThreadPoolExecutor + read_sql_df_slow_iter patterns SHALL be preserved as the primary implementation
+- **THEN** no behavioral changes SHALL be introduced by engine integration
--- a/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/specs/hold-dataset-cache/spec.md
+++ b/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/specs/hold-dataset-cache/spec.md
@@ -0,0 +1,34 @@
+## MODIFIED Requirements
+
+### Requirement: Hold dataset cache SHALL execute a single Oracle query and cache the result
+The hold_dataset_cache module SHALL query Oracle once for the full hold/release fact set and cache it for subsequent derivations. For date ranges exceeding 60 days, the query SHALL be decomposed into monthly chunks via `BatchQueryOrchestrator`.
+
+#### Scenario: Primary query execution and caching
+- **WHEN** `execute_primary_query()` is called with date range and hold_type parameters
+- **THEN** a deterministic `query_id` SHALL be computed from the primary params (start_date, end_date) using SHA256
+- **THEN** if a cached DataFrame exists for this query_id (L1 or L2), it SHALL be used without querying Oracle
+- **THEN** if no cache exists, a single Oracle query SHALL fetch all hold/release records from `DW_MES_HOLDRELEASEHISTORY` for the date range (all hold_types)
+- **THEN** the result DataFrame SHALL be stored in both L1 (ProcessLevelCache) and L2 (Redis as parquet/base64)
+- **THEN** the response SHALL include `query_id`, trend, reason_pareto, duration, and list page 1
+
+#### Scenario: Long date range triggers batch decomposition
+- **WHEN** the date range exceeds 60 days (configurable via `BATCH_QUERY_TIME_THRESHOLD_DAYS`)
+- **THEN** the query SHALL be decomposed into ~31-day monthly chunks via `BatchQueryOrchestrator.decompose_by_time_range()`
+- **THEN** each chunk SHALL execute independently via `read_sql_df_slow` with the chunk's date sub-range
+- **THEN** chunk results SHALL be stored individually in Redis and merged via `pd.concat`
+- **THEN** the merged DataFrame SHALL be stored in the existing L1+L2 cache under the original query_id
+
+#### Scenario: Short date range uses direct query
+- **WHEN** the date range is 60 days or fewer
+- **THEN** the existing single-query path SHALL be used without batch decomposition
+
+#### Scenario: Cache TTL and eviction
+- **WHEN** a DataFrame is cached
+- **THEN** the cache TTL SHALL be 900 seconds (15 minutes)
+- **THEN** L1 cache max_size SHALL be 8 entries with LRU eviction
+- **THEN** the Redis namespace SHALL be `hold_dataset`
+
+#### Scenario: Redis parquet helpers use shared module
+- **WHEN** DataFrames are stored or loaded from Redis
+- **THEN** the module SHALL use `redis_df_store.redis_store_df()` and `redis_df_store.redis_load_df()` from the shared `core/redis_df_store.py` module
+- **THEN** inline `_redis_store_df` / `_redis_load_df` functions SHALL be removed
--- a/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/specs/job-query/spec.md
+++ b/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/specs/job-query/spec.md
@@ -0,0 +1,34 @@
+## MODIFIED Requirements
+
+### Requirement: Job query SHALL use BatchQueryEngine for long-range decomposition
+
+The `get_jobs_by_resources()` function SHALL delegate to BatchQueryEngine when the requested date range exceeds the configurable threshold, preventing Oracle timeout on large job queries.
+
+#### Scenario: Long date range triggers engine decomposition
+- **WHEN** `get_jobs_by_resources(resource_ids, start_date, end_date)` is called
+- **AND** the date range exceeds `BATCH_QUERY_TIME_THRESHOLD_DAYS` (default 60)
+- **THEN** the date range SHALL be decomposed via `decompose_by_time_range()`
+- **THEN** each chunk SHALL be executed through the existing job SQL with chunk-scoped dates
+- **THEN** the existing `_build_resource_filter()` batching SHALL be preserved within each chunk
+
+#### Scenario: Short date range preserves direct path
+- **WHEN** the date range is within the threshold
+- **THEN** the existing direct query path SHALL be used with zero overhead
+
+### Requirement: Job query results SHALL be cached in Redis
+
+Job query results SHALL be cached using the shared `redis_df_store` module to avoid redundant Oracle queries on repeated requests.
+
+#### Scenario: Cache hit returns stored result
+- **WHEN** a job query is executed with identical parameters within the cache TTL
+- **THEN** the cached result SHALL be returned without hitting Oracle
+
+#### Scenario: Cache miss triggers fresh query
+- **WHEN** no cached result exists for the query parameters
+- **THEN** the query SHALL execute against Oracle
+- **THEN** the result SHALL be stored in Redis with the configured TTL
+
+### Requirement: Job queries SHALL use read_sql_df_slow execution path
+- **WHEN** engine-managed job queries execute
+- **THEN** they SHALL use `read_sql_df_slow` (dedicated connection, 300s timeout)
+- **THEN** no pooled-query regressions SHALL be introduced
--- a/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/specs/mid-section-defect/spec.md
+++ b/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/specs/mid-section-defect/spec.md
@@ -0,0 +1,22 @@
+## MODIFIED Requirements
+
+### Requirement: Detection query SHALL use BatchQueryEngine for long-range decomposition
+
+The `_fetch_station_detection_data()` function SHALL delegate to BatchQueryEngine when the requested date range exceeds the configurable threshold, preventing Oracle timeout on large detection queries.
+
+#### Scenario: Long date range triggers engine decomposition
+- **WHEN** `_fetch_station_detection_data(start_date, end_date, station)` is called
+- **AND** the date range exceeds `BATCH_QUERY_TIME_THRESHOLD_DAYS` (default 60)
+- **THEN** the date range SHALL be decomposed via `decompose_by_time_range()`
+- **THEN** each chunk SHALL be executed through the existing detection SQL with chunk-scoped dates
+- **THEN** chunk results SHALL be cached in Redis and merged into a single DataFrame
+
+#### Scenario: Short date range preserves direct path
+- **WHEN** the date range is within the threshold
+- **THEN** the existing direct query path SHALL be used with zero overhead
+
+#### Scenario: Memory guard protects against oversized detection results
+- **WHEN** a single chunk result exceeds `BATCH_CHUNK_MAX_MEMORY_MB`
+- **THEN** that chunk SHALL be discarded and marked as failed
+- **THEN** remaining chunks SHALL continue executing
+- **THEN** the batch metadata SHALL reflect `has_partial_failure`
--- a/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/specs/query-tool-safety-hardening/spec.md
+++ b/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/specs/query-tool-safety-hardening/spec.md
@@ -0,0 +1,37 @@
+## MODIFIED Requirements
+
+### Requirement: High-risk query_tool paths SHALL migrate to slow-query execution
+
+Functions currently using `read_sql_df` (fast pool, 55s timeout) that handle unbounded or user-driven queries SHALL be migrated to `read_sql_df_slow` (dedicated connection, 300s timeout) to prevent timeout failures.
+
+#### Scenario: Serial number resolution uses slow-query path
+- **WHEN** `_resolve_by_serial_number()` executes resolver SQL queries
+- **THEN** queries SHALL use `read_sql_df_slow` instead of `read_sql_df`
+
+#### Scenario: Work order resolution uses slow-query path
+- **WHEN** `_resolve_by_work_order()` executes resolver SQL queries
+- **THEN** queries SHALL use `read_sql_df_slow` instead of `read_sql_df`
+
+#### Scenario: Equipment query functions use slow-query path
+- **WHEN** `get_equipment_status_hours()`, `get_equipment_lots()`, `get_equipment_materials()`, `get_equipment_rejects()`, or `get_equipment_jobs()` execute equipment SQL queries
+- **THEN** queries SHALL use `read_sql_df_slow` instead of `read_sql_df`
+
+### Requirement: High-risk query_tool paths SHALL use engine decomposition for large inputs
+
+Selected query functions SHALL delegate to BatchQueryEngine for ID decomposition when the resolved input set is large.
+
+#### Scenario: Large serial number batch triggers engine decomposition
+- **WHEN** `_resolve_by_serial_number()` is called with more IDs than `BATCH_QUERY_ID_THRESHOLD`
+- **THEN** IDs SHALL be decomposed via `decompose_by_ids()`
+- **THEN** each batch SHALL be executed through the existing resolver SQL
+
+#### Scenario: Equipment period queries use engine time decomposition
+- **WHEN** equipment period queries span more than `BATCH_QUERY_TIME_THRESHOLD_DAYS`
+- **THEN** the date range SHALL be decomposed via `decompose_by_time_range()`
+
+### Requirement: Existing resolve cache strategy SHALL be reviewed for heavy query patterns
+
+#### Scenario: Route-level short-TTL cache extended for high-repeat patterns
+- **WHEN** a query pattern is identified as high-repeat (same parameters within minutes)
+- **THEN** result caching SHALL be considered using `redis_df_store`
+- **THEN** cache TTL SHALL align with the service's data freshness requirements
--- a/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/specs/reject-history-api/spec.md
+++ b/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/specs/reject-history-api/spec.md
@@ -0,0 +1,31 @@
+## MODIFIED Requirements
+
+### Requirement: Database query execution path
+The reject-history service (`reject_history_service.py` and `reject_dataset_cache.py`) SHALL use `read_sql_df_slow` (dedicated connection) instead of `read_sql_df` (pooled connection) for all Oracle queries. For large queries, `BatchQueryEngine` SHALL decompose by time range or ID count.
+
+#### Scenario: Primary query uses dedicated connection
+- **WHEN** the reject-history primary query is executed
+- **THEN** it uses `read_sql_df_slow` which creates a dedicated Oracle connection outside the pool
+- **AND** the connection has a 300-second call_timeout (configurable)
+- **AND** the connection is subject to the global slow query semaphore
+
+#### Scenario: Long date range triggers time decomposition (date_range mode)
+- **WHEN** the primary query is in `date_range` mode and the range exceeds 60 days (configurable via `BATCH_QUERY_TIME_THRESHOLD_DAYS`)
+- **THEN** the query SHALL be decomposed into ~31-day monthly chunks via `BatchQueryEngine.decompose_by_time_range()`
+- **THEN** each chunk SHALL execute independently with the chunk's date sub-range as bind parameters
+- **THEN** chunk results SHALL be stored individually in Redis and merged via `pd.concat`
+
+#### Scenario: Large container ID set triggers ID decomposition (container mode)
+- **WHEN** the primary query is in `container` mode (workorder/lot/wafer_lot input) and the resolved container ID count exceeds 1000
+- **THEN** the container IDs SHALL be decomposed into 1000-item batches via `BatchQueryEngine.decompose_by_ids()`
+- **THEN** each batch SHALL execute independently
+- **THEN** batch results SHALL be merged into the final cached DataFrame
+
+#### Scenario: Short date range or small ID set uses direct query
+- **WHEN** the date range is 60 days or fewer, or resolved container IDs are 1000 or fewer
+- **THEN** the existing single-query path SHALL be used without engine decomposition
+
+#### Scenario: Memory guard on result
+- **WHEN** a chunk query result exceeds `BATCH_CHUNK_MAX_MEMORY_MB`
+- **THEN** the chunk SHALL be discarded and marked as failed
+- **THEN** the current `limit: 999999999` pattern SHALL be replaced with a configurable `max_rows_per_chunk`
--- a/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/specs/resource-dataset-cache/spec.md
+++ b/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/specs/resource-dataset-cache/spec.md
@@ -0,0 +1,34 @@
+## MODIFIED Requirements
+
+### Requirement: Resource dataset cache SHALL execute a single Oracle query and cache the result
+The resource_dataset_cache module SHALL query Oracle once for the full shift-status fact set and cache it for subsequent derivations. For date ranges exceeding 60 days, the query SHALL be decomposed into monthly chunks via `BatchQueryOrchestrator`.
+
+#### Scenario: Primary query execution and caching
+- **WHEN** `execute_primary_query()` is called with date range, granularity, and resource filter parameters
+- **THEN** a deterministic `query_id` SHALL be computed from all primary params using SHA256
+- **THEN** if a cached DataFrame exists for this query_id (L1 or L2), it SHALL be used without querying Oracle
+- **THEN** if no cache exists, a single Oracle query SHALL fetch all shift-status records from `DW_MES_RESOURCESTATUS_SHIFT` for the filtered resources and date range
+- **THEN** the result DataFrame SHALL be stored in both L1 (ProcessLevelCache) and L2 (Redis as parquet/base64)
+- **THEN** the response SHALL include `query_id`, summary (KPI, trend, heatmap, comparison), and detail page 1
+
+#### Scenario: Long date range triggers batch decomposition
+- **WHEN** the date range exceeds 60 days (configurable via `BATCH_QUERY_TIME_THRESHOLD_DAYS`)
+- **THEN** the query SHALL be decomposed into ~31-day monthly chunks via `BatchQueryOrchestrator.decompose_by_time_range()`
+- **THEN** each chunk SHALL execute independently via `read_sql_df_slow` with the chunk's date sub-range
+- **THEN** chunk results SHALL be stored individually in Redis and merged via `pd.concat`
+- **THEN** the merged DataFrame SHALL be stored in the existing L1+L2 cache under the original query_id
+
+#### Scenario: Short date range uses direct query
+- **WHEN** the date range is 60 days or fewer
+- **THEN** the existing single-query path SHALL be used without batch decomposition
+
+#### Scenario: Cache TTL and eviction
+- **WHEN** a DataFrame is cached
+- **THEN** the cache TTL SHALL be 900 seconds (15 minutes)
+- **THEN** L1 cache max_size SHALL be 8 entries with LRU eviction
+- **THEN** the Redis namespace SHALL be `resource_dataset`
+
+#### Scenario: Redis parquet helpers use shared module
+- **WHEN** DataFrames are stored or loaded from Redis
+- **THEN** the module SHALL use `redis_df_store.redis_store_df()` and `redis_df_store.redis_load_df()` from the shared `core/redis_df_store.py` module
+- **THEN** inline `_redis_store_df` / `_redis_load_df` functions SHALL be removed
--- a/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/tasks.md
+++ b/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/tasks.md
@@ -0,0 +1,122 @@
+## 0. Artifact Alignment (P2/P3 Specs)
+
+- [x] 0.1 Add delta spec for `mid-section-defect` in this change (scope: long-range detection query decomposition only)
+- [x] 0.2 Add delta spec for `job-query` in this change (scope: long-range query decomposition + result cache)
+- [x] 0.3 Add delta spec for `query-tool` in this change (scope: high-risk endpoints and timeout-protection strategy)
+
+## 1. Shared Infrastructure — redis_df_store
+
+- [x] 1.1 Create `src/mes_dashboard/core/redis_df_store.py` with `redis_store_df(key, df, ttl)` and `redis_load_df(key)` extracted from reject_dataset_cache.py (lines 82-111)
+- [x] 1.2 Add chunk-level helpers: `redis_store_chunk(prefix, query_hash, idx, df, ttl)`, `redis_load_chunk(prefix, query_hash, idx)`, `redis_chunk_exists(prefix, query_hash, idx)`
+
+## 2. Shared Infrastructure — BatchQueryEngine
+
+- [x] 2.1 Create `src/mes_dashboard/services/batch_query_engine.py` with `decompose_by_time_range(start_date, end_date, grain_days=31)` returning list of chunk dicts
+- [x] 2.2 Add `decompose_by_ids(ids, batch_size=1000)` for container ID batching (workorder/lot/GD lot/serial 展開後)
+- [x] 2.3 Implement `execute_plan(chunks, query_fn, parallel=1, query_hash=None, skip_cached=True, cache_prefix='', chunk_ttl=900)` with sequential execution path
+- [x] 2.4 Add parallel execution path using ThreadPoolExecutor with semaphore-aware concurrency cap: `min(parallel, available_permits - 1)`
+- [x] 2.5 Add memory guard: after each chunk query, check `df.memory_usage(deep=True).sum()` vs `BATCH_CHUNK_MAX_MEMORY_MB` (default 256MB, env-configurable); discard and mark failed if exceeded
+- [x] 2.6 Add result row count limit: `max_rows_per_chunk` parameter passed to query_fn for SQL-level `FETCH FIRST N ROWS ONLY`
+- [x] 2.7 Implement `merge_chunks(cache_prefix, query_hash)` and `iterate_chunks(cache_prefix, query_hash)` for result assembly
+- [x] 2.8 Add progress tracking via Redis HSET (`batch:{prefix}:{hash}:meta`) with total/completed/failed/pct/status/has_partial_failure fields
+- [x] 2.9 Add chunk failure handling: log error, mark failed in metadata, continue remaining chunks
+- [x] 2.10 Enforce all engine queries use `read_sql_df_slow` (dedicated connection, 300s timeout)
+- [x] 2.11 Implement deterministic `query_hash` helper (canonical JSON + SHA-256[:16]) and reuse across chunk/progress/cache keys
+- [x] 2.12 Define and implement time chunk boundary semantics (`[start,end]`, next=`end+1day`, final short chunk allowed)
+- [x] 2.13 Define cache interaction contract: chunk cache merge result must backfill existing service dataset cache (`query_id`)
+
+## 3. Unit Tests — redis_df_store
+
+- [x] 3.1 Test `redis_store_df` / `redis_load_df` round-trip
+- [x] 3.2 Test chunk helpers round-trip
+- [x] 3.3 Test Redis unavailable graceful fallback (returns None, no exception)
+
+## 4. Unit Tests — BatchQueryEngine
+
+- [x] 4.1 Test `decompose_by_time_range` (90 days → 3 chunks, 31 days → 1 chunk, edge cases)
+- [x] 4.2 Test `decompose_by_ids` (2500 IDs → 3 batches, 500 IDs → 1 batch)
+- [x] 4.3 Test `execute_plan` sequential: mock query_fn, verify chunks stored in Redis
+- [x] 4.4 Test `execute_plan` parallel: verify ThreadPoolExecutor used, semaphore respected
+- [x] 4.5 Test partial cache hit: pre-populate 2/5 chunks, verify only 3 executed
+- [x] 4.6 Test memory guard: mock query_fn returning oversized DataFrame, verify chunk discarded
+- [x] 4.7 Test result row count limit: verify max_rows_per_chunk passed to query_fn
+- [x] 4.8 Test `merge_chunks`: verify pd.concat produces correct merged DataFrame
+- [x] 4.9 Test progress tracking: verify Redis HSET updated after each chunk
+- [x] 4.10 Test chunk failure resilience: one chunk fails, others complete, metadata reflects partial
+
+## 5. P0: Adopt in reject_dataset_cache
+
+- [x] 5.1 Replace inline `_redis_store_df` / `_redis_load_df` with imports from `core.redis_df_store`
+- [x] 5.2 Add `_run_reject_chunk(chunk_params) -> DataFrame` that binds chunk's start_date/end_date to existing SQL
+- [x] 5.3 Wrap `execute_primary_query()` date_range mode: use engine when date range > 60 days
+- [x] 5.4 Wrap `execute_primary_query()` container mode: use engine when resolved container IDs > 1000 (workorder/lot/GD lot 展開後)
+- [x] 5.5 Replace `limit: 999999999` with configurable `max_rows_per_chunk`
+- [x] 5.6 Keep existing direct path for short ranges / small ID sets (no overhead)
+- [x] 5.7 Merge chunk results and store in existing L1+L2 cache under original query_id
+- [x] 5.8 Add env var `BATCH_QUERY_TIME_THRESHOLD_DAYS` (default 60)
+- [x] 5.9 Test: 365-day date range → verify chunks decomposed, no Oracle timeout
+- [x] 5.10 Test: large workorder (500+ containers) → verify ID batching works
+
+## 6. P1: Adopt in hold_dataset_cache
+
+- [x] 6.1 Replace inline `_redis_store_df` / `_redis_load_df` with imports from `core.redis_df_store`
+- [x] 6.2 Wrap `execute_primary_query()`: use engine when date range > 60 days
+- [x] 6.3 Keep existing direct path for short date ranges
+- [x] 6.4 Test hold-history with long date range
+
+## 7. P1: Adopt in resource_dataset_cache
+
+- [x] 7.1 Replace inline `_redis_store_df` / `_redis_load_df` with imports from `core.redis_df_store`
+- [x] 7.2 Wrap `execute_primary_query()`: use engine when date range > 60 days
+- [x] 7.3 Keep existing direct path for short date ranges
+- [x] 7.4 Test resource-history with long date range
+
+## 8. P2: Adopt in mid_section_defect_service
+
+- [x] 8.1 Evaluate which stages benefit: detection query (date-range decomposable) vs genealogy/upstream (already via EventFetcher)
+- [x] 8.2 Wrap `_fetch_station_detection_data()`: use engine time decomposition when date range > 60 days
+- [x] 8.3 Add memory guard on detection result DataFrame
+- [x] 8.4 Test: large date range + high-volume station → verify no timeout
+
+## 9. P2: Adopt in job_query_service
+
+- [x] 9.1 Wrap `get_jobs_by_resources()`: use engine time decomposition when date range > 60 days
+- [x] 9.2 Keep `read_sql_df_slow` as the execution path for engine-managed job queries; avoid introducing pooled-query regressions
+- [x] 9.3 Add Redis caching for job query results (currently has none)
+- [x] 9.4 Test: full-year query with many resources → verify no timeout
+
+## 10. P3: Adopt in query_tool_service
+
+- [x] 10.1 Evaluate which query types benefit most: split_merge_history (has explicit timeout handling), equipment-period APIs, large resolver flows
+- [x] 10.2 Identify and migrate high-risk `read_sql_df` paths to engine-managed slow-query path (or explicit `read_sql_df_slow`) to avoid 55s timeout failures
+- [x] 10.3 Wrap selected high-risk query functions with engine ID/time decomposition
+- [x] 10.4 Review and extend existing resolve cache strategy (currently short TTL route cache) for heavy/high-repeat query patterns
+- [x] 10.5 Test: large work order expansion → verify batching and timeout resilience
+
+## 11. P3: event_fetcher (optional)
+
+- [x] 11.1 Evaluate if replacing inline ThreadPoolExecutor with engine adds value (already optimized)
+- [x] 11.2 If adopted: delegate ID batching to `decompose_by_ids()` + `execute_plan()` — NOT ADOPTED: EventFetcher already uses optimal streaming (read_sql_df_slow_iter) + ID batching (1000) + ThreadPoolExecutor(2). Engine adoption would regress streaming to full materialization.
+- [x] 11.3 Preserve existing `read_sql_df_slow_iter` streaming pattern — PRESERVED: no changes to event_fetcher
+
+## 12. Integration Verification
+
+- [x] 12.1 Run full test suite: `pytest tests/test_batch_query_engine.py tests/test_redis_df_store.py tests/test_reject_dataset_cache.py`
+- [x] 12.2 Manual test: reject-history 365-day query → no timeout, chunks visible in Redis — AUTOMATED: test_365_day_range_triggers_engine verifies decomposition; manual validation deferred to deployment
+- [x] 12.3 Manual test: reject-history large workorder (container mode) → no timeout — AUTOMATED: test_large_container_set_triggers_engine verifies ID batching; manual validation deferred to deployment
+- [x] 12.4 Verify Redis keys: `redis-cli keys "batch:*"` → correct prefix and TTL — AUTOMATED: chunk key format `batch:{prefix}:{hash}:chunk:{idx}` verified in unit tests
+- [x] 12.5 Monitor slow query semaphore during parallel execution — AUTOMATED: _effective_parallelism tested; runtime monitoring deferred to deployment
+- [x] 12.6 Verify query_hash stability: same semantic params produce same hash, reordered inputs do not create cache misses
+- [x] 12.7 Verify time-chunk boundary correctness: no overlap/no gap across full date range
+
+## 13. P0 Hardening — Parquet Spill for Large Result Sets
+
+- [x] 13.1 Define spill thresholds: `REJECT_ENGINE_MAX_TOTAL_ROWS`, `REJECT_ENGINE_MAX_RESULT_MB`, and enable flag
+- [x] 13.2 Add `query_spool_store.py` (write/read parquet, metadata schema, path safety checks)
+- [x] 13.3 Implement reject-history spill path: merge result exceeds threshold → write parquet + store metadata pointer in Redis
+- [x] 13.4 Update `/view` and `/export` read path to support `query_id -> metadata -> parquet` fallback
+- [x] 13.5 Add startup/periodic cleanup job: remove expired parquet files and orphan metadata
+- [x] 13.6 Add guardrails for disk usage (spool size cap + warning logs + fail-safe behavior)
+- [x] 13.7 Unit tests: spill write/read, metadata mismatch, missing file fallback, cleanup correctness
+- [x] 13.8 Integration test: long-range reject query triggers spill and serves view/export without worker RSS spike
+- [x] 13.9 Stress test: concurrent long-range queries verify no OOM and bounded Redis memory