feat: harden long-range batch queries with redis+parquet caching

This commit is contained in:
egg
2026-03-02 21:04:18 +08:00
parent 2568fd836c
commit fb92579331
40 changed files with 5443 additions and 676 deletions

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-03-02

View File

@@ -0,0 +1,166 @@
## Context
目前 6 個服務各自處理大查詢,缺乏統一保護:
| 服務 | 查詢類型 | 現有保護 | 缺口 |
|------|---------|---------|------|
| reject-history | 日期 + 工單/Lot/GD 展開 | L1+L2 快取、`read_sql_df_slow` | 無記憶體守衛、`limit=999999999`、缺分塊查詢 |
| hold-history | 日期 | L1+L2 快取、`read_sql_df_slow` | 無記憶體守衛、缺時間分塊 |
| resource-history | 日期 + 設備 ID | L1+L2 快取、1000 筆分批 | 無記憶體守衛 |
| mid-section-defect | 日期 → 偵測 → 族譜 → 上游 | Redis 快取、EventFetcher 分批 | 無偵測數量上限 |
| job-query | 日期 + 設備 ID | 1000 筆分批、`read_sql_df_slow` | **無結果快取**、缺時間分塊 |
| query-tool | 多種 resolver → container ID | 輸入筆數限制、resolve route 短 TTL 快取、EventFetcher 快取 | 多數查詢仍走 `read_sql_df`55s timeout、缺統一分塊編排 |
參考實作:
- `EventFetcher`batch 1000 + ThreadPoolExecutor(2) + `read_sql_df_slow_iter` streaming + Redis 快取 — **已是最佳實作**
- `LineageEngine`batch 1000 + depth limit 20 — **族譜專用引擎**
目標:建立 `BatchQueryEngine` 共用模組,任何服務接入即獲得完整保護。
## Goals / Non-Goals
**Goals:**
- 統一 parquet-in-Redis 存取為共用模組(消除 3 處重複)
- 提供時間範圍分解(長日期 → ~31 天月份區間)
- 提供 ID 批次分解(工單/Lot/GD 展開後的大量 container ID → 1000 筆一批)
- 記憶體守衛:每個 chunk 結果檢查 memory_usage超過閾值中止
- 結果筆數限制:可配置上限,超過時截斷並標記
- 受控並行預設循序、可選並行、semaphore 感知
- Redis 分塊快取 + 部分命中
- 統一使用 `read_sql_df_slow`300 秒 dedicated connection
- 定義 query_hash 與 chunk 邊界語意,避免跨服務行為不一致
- 定義 chunk cache 與服務 L1/L2 dataset cache 互動規則
**Non-Goals:**
- 不修改 SQL 語句本身
- 不引入新的外部依賴
- 不改變前端 API 介面(前端無感知)
- 不替換 EventFetcher / LineageEngine它們已各自最佳化引擎提供可選接入點
- 不改變 trace_job_service 的 RQ 非同步架構
## Decisions
### Decision 1: 提取 `redis_df_store.py` 共用模組
**選擇**:從 reject/hold/resource_dataset_cache 提取相同的 `_redis_store_df` / `_redis_load_df``src/mes_dashboard/core/redis_df_store.py`
**替代方案**(A) 保持各自複製 → 已有 3 處重複,維護困難。
**理由**parquet-in-Redis 是 DataFrame 序列化工具與快取策略TTL、LRU屬不同層次。
### Decision 2: `BatchQueryEngine` 作為工具類而非基底類別
**選擇**:提供獨立函式(`decompose_by_time_range``decompose_by_ids``execute_plan``merge_chunks`),各服務按需調用。
**替代方案**(A) 抽象基底類別 `BaseDatasetCache` → 三個 dataset cache 差異大SQL、policy filter、衍生計算強制繼承會過度耦合。
**理由**:工具類模式讓服務保持現有結構,僅在主查詢路徑決定是否啟用分解。閾值以下的查詢完全不經過引擎。
### Decision 3: 預設循序、可選並行、semaphore 感知
**選擇**`execute_plan(parallel=1)` 預設循序。實際並行上限 = `min(requested, semaphore_available - 1)`
**替代方案**(A) 預設並行 → 可能耗盡 semaphore(B) 完全不並行 → 失去速度。
**理由**Oracle 連線稀缺Production 預設 `DB_SLOW_MAX_CONCURRENT=5`Development 常見為 3。reject_dataset_cache 查詢最重可設 parallel=2其他預設循序最安全。
### Decision 4: 記憶體守衛 + 結果筆數限制
**選擇**:每個 chunk 查詢後檢查 `df.memory_usage(deep=True).sum()`,超過 `BATCH_CHUNK_MAX_MEMORY_MB`(預設 256MB時中止該 chunk 並標記失敗。同時提供 `max_rows_per_chunk` 參數,在 SQL 中加入 `FETCH FIRST N ROWS ONLY`
**替代方案**(A) 無限制 → 現狀OOM 風險高;(B) 全域限制 → 不夠靈活。
**理由**chunk 級別的記憶體守衛是最後一道防線。分解後每個 chunk 的日期/ID 範圍已大幅縮小,記憶體超限通常代表異常資料,應中止而非繼續。
### Decision 5: 分塊快取 + 部分命中
**選擇**Redis 鍵 `batch:{prefix}:{hash}:chunk:{idx}`,每個 chunk 獨立 SETEX。
**替代方案**(A) 只快取最終結果 → 無法部分命中。
**理由**:使用者常見操作是「先查 1-6 月,再查 1-8 月」。分塊快取讓前 6 個月直接複用,只查 7-8 月。
### Decision 6: 引擎路徑統一使用 slow-query 路徑(且不佔用主 pool
**選擇**:所有經過引擎的查詢統一使用 slow-query 路徑300s timeout, semaphore 控制);未經引擎的既有短查詢路徑保持原狀。
慢查詢執行策略採兩層:
1. 主路徑:使用既有獨立 `SLOW POOL`(小容量)做 checkout/checkin。
2. fallback當 SLOW POOL 不可用時,降級為 slow direct connection。
**替代方案**
(A) 引擎路徑混用 `read_sql_df`(主 pool, 55s timeout→ 長查詢高超時風險且會壓縮一般 API 吞吐。
(B) 慢查詢直接共用主 pool → 高峰時造成 pool 爭用與整體延遲放大。
**理由**:經過引擎的查詢本身就是「已知可能很慢」的查詢。慢查詢與主 pool 隔離可避免互相影響SLOW POOL 讓連線重用與隔離同時成立fallback direct connection 保障可用性。
### Decision 7: 部分失敗處理
**選擇**:某個 chunk 失敗時記錄錯誤、繼續剩餘 chunk。`merge_chunks()` 回傳成功部分metadata 標記 `has_partial_failure=True`
**替代方案**(A) 全部回滾 → 已成功的 chunk 浪費。
**理由**歷史報表場景下部分結果比完全失敗更有價值。metadata 標記讓服務可決定是否警告使用者。
### Decision 8: Chunk Cache 與服務 L1/L2 Dataset Cache 互動
**選擇**:先讀 chunk cacheRedis組裝結果組裝後回填既有 service dataset cacheL1 process + L2 Redis以維持現有 `/view` 路徑與 `query_id` 行為。
**替代方案**(A) 只使用 chunk cache不回填 service cache → 現有 view/query_id 流程失效或重複查詢。
**理由**:需要兼容既有 two-phase dataset APIprimary query + cached viewchunk cache 是引擎層優化,不應破壞服務層介面。
### Decision 9: query_hash 規格
**選擇**query_hash 使用 canonical JSONsorted keys、穩定 list 順序、字串正規化)後 SHA-256 前 16 碼hash 僅包含會影響原始資料集合的參數(不含純前端呈現參數)。
**替代方案**(A) 每服務自由實作 hash → 跨服務不可預測且難除錯。
**理由**chunk key、progress key、merge key 需可重現,否則無法保證 cache 命中與部分重用。
### Decision 10: 時間分解邊界語意
**選擇**:採閉區間 chunk `[chunk_start, chunk_end]`;下一段從 `chunk_end + 1 day` 開始;最後一段可小於 grain_days輸入日期以服務既有時區/日界線為準,不在引擎層重新解釋時區。
**替代方案**(A) 半開區間或依月份動態切割但不定義邊界 → 容易重疊或漏資料。
**理由**邊界語意固定後merge 去重、統計一致性與測試可驗證性都會提升。
### Decision 11: 大結果採 Parquet 落地Redis 僅保留 metadata/熱快取
**選擇**:對長查詢(尤其 reject-history引入 spill-to-disk
1. chunk 查詢與 chunk cache 保持現行Redis短 TTL
2. merge 後若結果超過門檻rows / memory / serialized size寫入 Parquet 至本機 spool 目錄
3. Redis 僅保存 metadataquery_id, file_path, row_count, schema_hash, created_at, expires_at
4. `/view`/`/export` 優先透過 metadata 讀取 parquetmetadata 不存在時回退現行 cache 行為
5. 背景清理器定期移除過期 parquet 與孤兒 metadata
**替代方案**
(A) Redis 全量承載所有結果(現況)→ 記憶體壓力高,易引發 lock timeout/OOM 連鎖
(B) 直接落 DB例如 SQLite→ 寫入鎖衝突與維運複雜度高(目前已有 `database is locked` 觀察)
**理由**Redis 是記憶體快取不適合長時間承載大結果Parquet 落地可把大結果轉移到磁碟,降低 worker/Redis 記憶體峰值。
## Risks / Trade-offs
**[Redis 記憶體增長]** → 分塊快取增加 key 數量365 天 ≈ 12 個 chunk key
→ 緩解TTL 自動過期900schunk 結果經 parquet 壓縮(通常 10:1 壓縮比)。
**[Semaphore 爭用]** → 並行 chunk 消耗更多 permit。
→ 緩解:感知可用數量,不足時自動降級循序。預設 parallel=1。
**[時間分解後的資料一致性]** → 不同月份 chunk 在不同時間點查詢。
→ 緩解:歷史報表資料更新頻率低(日級),短窗口內變動極低。可接受。
**[遷移風險]** → 先修改 3 個 dataset cache再擴展至其他服務整體範圍仍大。
→ 緩解:閾值控制(短查詢不經過引擎)+ P0/P1/P2/P3 分階段導入 + 每階段獨立驗證。
**[磁碟 I/O 與容量壓力]** → Parquet 落地會增加磁碟讀寫,若清理策略失效可能累積大量檔案。
→ 緩解:設定 spool 容量上限、TTL 清理、啟動時 orphan 掃描、超限時回退到「不落地僅回應摘要」保護模式。
**[Stale metadata / orphan file]** → Redis metadata 與實體檔案可能不一致。
→ 緩解:讀取前校驗檔案存在與 schema hash不一致時自動失效 metadata 並記錄告警。
## Open Questions
1. `mid_section_defect_service` 的 4 階段管線(偵測 → 族譜 → 上游歷史 → 歸因)中,哪些階段適合接入引擎?偵測查詢可日期分解,但族譜/上游已透過 EventFetcher 處理。
2. `query_tool_service` 有 15+ 種查詢類型是否全部接入還是只處理最易超時的split_merge_history、equipment_period

View File

@@ -0,0 +1,83 @@
## Why
目前各歷史報表服務reject-history、hold-history、resource-history、查詢工具query-tool、中段不良分析mid-section-defect和 Job 查詢job-query各自實作不同的批次查詢、快取和並行執行模式缺乏統一編排與保護。主要問題
1. **Oracle 超時**長日期範圍365+ 天)或大量 Container ID工單展開後可達數千筆的查詢可能超過 300 秒 call_timeout
2. **OOM 風險**reject/hold dataset cache 以 `limit: 999999999` 取回全部資料,無記憶體上限守衛
3. **保護分散**`EventFetcher` 已有 ID 分批 + 快取,但 reject/hold/resource dataset cache 仍各自維護查詢與快取策略
4. **重複程式碼**3 個 dataset cache 各自複製相同的 parquet-in-Redis 序列化邏輯
5. **ID 展開膨脹**:工單 resolve 後 container ID 可能大量擴張,缺乏跨服務一致的分批/合併流程
6. **重查成本高**:延長查詢範圍(例如 1-6 月改 1-8 月)無法有效重用已查區段結果
7. **query-tool 超時風險高**:多數查詢仍走 `read_sql_df`(主 pool / 55s timeout大查詢下容易超時
需要一個**可穩定複用的查詢引擎模組**,任何服務接入後自動獲得分解、快取、記憶體保護和超時保護。
## What Changes
- 新增 `BatchQueryEngine` 共用模組,提供:
- **時間範圍分解**:長日期 → ~31 天月份區間,每段獨立查詢
- **時間分解語意**:明確定義 chunk 邊界(閉區間)、跨月切割與最後一段不足月行為
- **ID 批次分解**:大量 ID工單/Lot/GD Lot/流水批展開後)→ 1000 筆一批
- **query_hash 規格**:統一 canonicalization 與雜湊欄位,確保 chunk/cache key 穩定
- **記憶體守衛**:每個 chunk 結果檢查 `DataFrame.memory_usage()`,超過閾值時中止並警告
- **結果筆數限制**:可配置的最大結果筆數,超過時截斷並標記
- **受控並行執行**:預設循序、可選並行,嚴格遵守 slow query semaphore
- **Redis 分塊快取**:每個 chunk 獨立快取,支援部分命中(延長查詢範圍時複用已查過的區間)
- **快取層互動**:明確定義 chunk cache 與服務既有 L1/L2 dataset cache 的讀寫順序
- **進度追蹤**Redis HSET 記錄進度,可供前端顯示
- 新增「**大結果落地層Parquet spill**」設計:
- 當長查詢結果超過記憶體/列數門檻時,將合併後結果以 Parquet 寫入本機持久目錄(例如 `tmp/query_spool/`
- Redis 僅保存 metadataquery_id → parquet path / schema / rows / created_at / ttl
- `/view``/export` 讀取流程優先走 Redis metadata + Parquet避免整包 DataFrame 常駐 worker RAM
- 定時清理TTL + 背景清理器)刪除過期 parquet避免磁碟持續膨脹
- 新增 `redis_df_store` 共用模組,將 parquet-in-Redis 存取邏輯從 3 個 dataset cache 提取為共用工具
- 所有**引擎接管的 chunk 查詢**統一使用 slow 路徑300 秒級 timeout
- 使用既有「**獨立 SLOW POOL小容量**」做慢查詢連線重用
- 明確**不使用主查詢 pool** 承載慢查詢,避免拖垮一般 API
- 當 SLOW POOL 不可用時,降級為 slow direct connection不影響主 pool
## Capabilities
### New Capabilities
- `batch-query-engine`: 統一批次查詢引擎模組,涵蓋分解策略(時間/ID、記憶體守衛、結果限制、受控執行、Redis 分塊快取、進度追蹤、結果合併
### Modified Capabilities
- `reject-history-api`: 主查詢改為透過引擎執行date_range 模式自動時間分解container 模式(工單/Lot/GD Lot 展開後)自動 ID 分批
- `hold-dataset-cache`: 主查詢改為透過引擎執行,長日期自動分解
- `resource-dataset-cache`: 主查詢改為透過引擎執行,長日期自動分解
- `event-fetcher-unified`: 保持既有最佳化batch + streaming + cache僅在需要統一監控/進度模型時再評估導入
## Impact
- **後端**:新增 2 個共用模組(`batch_query_engine.py``redis_df_store.py`),優先修改 3 個 dataset cache 主查詢路徑reject/hold/resource
- **受影響服務**(優先順序):
- P0reject-history最容易超時/OOM — 長日期 + 工單展開 + 目前 `limit=999999999`
- P1hold-history、resource-history相同架構直接套用
- P2mid-section-defect4 階段管線,偵測查詢 + 上游歷史、job-query缺快取 + 日期分解)
- P3query-tool優先處理 `read_sql_df` 高風險路徑並導入慢查詢保護、event-fetcher保持可選
- **資料庫**:不改 SQL僅縮小每次查詢的 bind parameter 範圍
- **資料庫連線策略**:慢查詢與一般 pooled query 隔離,避免資源互相干擾
- **Redis**:新增 `batch:*` 前綴的分塊快取鍵
- **儲存層**:新增 Parquet 結果落地目錄與清理機制Redis 轉為索引/metadata不再承載全部大結果
- **記憶體**:引擎強制單 chunk 記憶體上限(預設 256MB超過時中止
- **可用性**Redis 設定 `maxmemory` + eviction 後仍可透過 Parquet metadata 回復查詢結果cache 不命中不等於資料遺失)
- **向下相容**:短查詢(< 60 、< 1000 ID走現有路徑零額外開銷既有 route/event 快取策略保持不變
- **前端**可選性變更長查詢可顯示進度條非必要
## Parquet 落地的預期效果與副作用
**預期效果:**
- 大幅降低 worker merge + cache 回填階段的峰值記憶體避免單 worker 突增到 GB
- Redis 記憶體由存整包資料轉為存索引/熱資料」,降低 OOM lock timeout 連鎖風險
- 服務重啟後 parquet 尚未過期仍可恢復查詢結果搭配 metadata
**可能副作用Side Effects**
- 磁碟 I/O 增加查詢高峰時會有 parquet 寫入/讀取尖峰
- 磁碟容量風險清理策略失效時spool 目錄可能持續膨脹
- 資料一致性風險metadata 指向檔案若被外部刪除/損壞會出現 stale pointer
- 安全與治理落地檔案需納入權限控管備份/清理與稽核策略
**緩解方向:**
- 強制 TTL + 定期掃描清理 metadata 與檔案 mtime 雙重判斷
- 啟動時做 orphan/stale 檢查與自動修復 metadata 或刪孤兒檔
- 先以 reject-history 長查詢為 P0逐步擴展到其他服務

View File

@@ -0,0 +1,166 @@
## ADDED Requirements
### Requirement: BatchQueryEngine SHALL provide time-range decomposition
The module SHALL decompose long date ranges into manageable monthly chunks to prevent Oracle timeout.
#### Scenario: Decompose date range into monthly chunks
- **WHEN** `decompose_by_time_range(start_date, end_date, grain_days=31)` is called
- **THEN** the date range SHALL be split into chunks of at most `grain_days` days each
- **THEN** each chunk SHALL contain `chunk_start` and `chunk_end` date strings
- **THEN** chunks SHALL be contiguous and non-overlapping, covering the full range
#### Scenario: Short date range returns single chunk
- **WHEN** the date range is shorter than or equal to `grain_days`
- **THEN** a single chunk covering the full range SHALL be returned
#### Scenario: Time-chunk boundary semantics are deterministic
- **WHEN** a date range is decomposed into multiple chunks
- **THEN** each chunk SHALL use a closed interval `[chunk_start, chunk_end]`
- **THEN** the next chunk SHALL start at `previous_chunk_end + 1 day`
- **THEN** the final chunk MAY contain fewer than `grain_days` days
- **THEN** chunk ranges SHALL have no overlap and no gap
### Requirement: BatchQueryEngine SHALL provide ID-batch decomposition
The module SHALL decompose large ID lists (from workorder/lot/GD lot/serial resolve expansion) into batches respecting Oracle IN-clause limits.
#### Scenario: Decompose ID list into batches
- **WHEN** `decompose_by_ids(ids, batch_size=1000)` is called with more than `batch_size` IDs
- **THEN** the ID list SHALL be split into batches of at most `batch_size` items each
#### Scenario: Small ID list returns single batch
- **WHEN** the ID list has fewer than or equal to `batch_size` items
- **THEN** a single batch containing all IDs SHALL be returned
### Requirement: BatchQueryEngine SHALL execute chunk plans with controlled parallelism
The module SHALL execute query chunks sequentially by default, with opt-in parallel execution respecting the slow query semaphore.
#### Scenario: Sequential execution (default)
- **WHEN** `execute_plan(chunks, query_fn, parallel=1)` is called
- **THEN** chunks SHALL be executed one at a time in order
- **THEN** each chunk result SHALL be stored to Redis immediately after completion
- **THEN** the function SHALL return a `query_hash` identifying the batch result
#### Scenario: Parallel execution with semaphore awareness
- **WHEN** `execute_plan(chunks, query_fn, parallel=2)` is called
- **THEN** up to `parallel` chunks SHALL execute concurrently via ThreadPoolExecutor
- **THEN** each thread SHALL acquire the slow query semaphore before executing `query_fn`
- **THEN** actual concurrency SHALL be capped at `min(parallel, available_semaphore_permits - 1)`
- **THEN** if semaphore is fully occupied, execution SHALL degrade to sequential
#### Scenario: All engine queries use dedicated connection
- **WHEN** a chunk's `query_fn` executes an Oracle query
- **THEN** it SHALL use `read_sql_df_slow` (dedicated connection, 300s timeout, semaphore-controlled)
- **THEN** pooled connection (`read_sql_df`) SHALL NOT be used for engine-managed queries
### Requirement: BatchQueryEngine SHALL enforce memory guards per chunk
The module SHALL check each chunk result's memory usage and abort if it exceeds a configurable threshold.
#### Scenario: Chunk memory within limit
- **WHEN** a chunk query returns a DataFrame within `BATCH_CHUNK_MAX_MEMORY_MB` (default 256MB, env-configurable)
- **THEN** the chunk SHALL be stored to Redis and marked as completed
#### Scenario: Chunk memory exceeds limit
- **WHEN** a chunk query returns a DataFrame exceeding `BATCH_CHUNK_MAX_MEMORY_MB`
- **THEN** the chunk SHALL be discarded (NOT stored to Redis)
- **THEN** the chunk SHALL be marked as failed in metadata with reason `memory_limit_exceeded`
- **THEN** a warning log SHALL include chunk index, actual memory MB, and threshold
- **THEN** remaining chunks SHALL continue execution
#### Scenario: Result row count limit
- **WHEN** `max_rows_per_chunk` is configured
- **THEN** the engine SHALL pass this limit to `query_fn` for SQL-level truncation (e.g., `FETCH FIRST N ROWS ONLY`)
- **THEN** if the result contains exactly `max_rows_per_chunk` rows, metadata SHALL include `truncated=True`
### Requirement: BatchQueryEngine SHALL support partial cache hits
The module SHALL check Redis for previously cached chunks and skip re-execution for cached chunks.
#### Scenario: Partial cache hit skips cached chunks
- **WHEN** `execute_plan(chunks, query_fn, skip_cached=True)` is called
- **THEN** for each chunk, Redis SHALL be checked for an existing cached result
- **THEN** chunks with valid cached results SHALL NOT be re-executed
- **THEN** only uncached chunks SHALL be passed to `query_fn`
#### Scenario: Full cache hit skips all execution
- **WHEN** all chunks already exist in Redis cache
- **THEN** no Oracle queries SHALL be executed
- **THEN** `merge_chunks()` SHALL return the combined cached DataFrames
### Requirement: BatchQueryEngine SHALL generate deterministic query_hash
The module SHALL use a stable hash for cache/progress keys so semantically identical queries map to the same batch identity.
#### Scenario: Stable hash for equivalent parameters
- **WHEN** two requests contain the same semantic query parameters in different input order
- **THEN** canonicalization SHALL normalize ordering before hashing
- **THEN** `query_hash` SHALL be identical for both requests
#### Scenario: Hash changes only when dataset-affecting parameters change
- **WHEN** parameters affecting the raw dataset (date range, mode, resolved IDs, core filters) change
- **THEN** `query_hash` SHALL change
- **THEN** presentation-only parameters SHALL NOT change `query_hash`
### Requirement: BatchQueryEngine SHALL define chunk-cache to service-cache handoff
The module SHALL integrate chunk-level cache with existing service-level dataset caches without breaking query_id-based view APIs.
#### Scenario: Chunk merge backfills service dataset cache
- **WHEN** chunk results are loaded/merged into a complete dataset for a primary query
- **THEN** the merged DataFrame SHALL be written back to the service's existing dataset cache layers (L1 process + L2 Redis)
- **THEN** downstream `/view` queries using the service `query_id` SHALL continue to work without additional Oracle queries
#### Scenario: Service cache miss with chunk cache hit
- **WHEN** a service-level dataset cache entry has expired but relevant chunk cache keys still exist
- **THEN** the engine SHALL rebuild the merged dataset from chunk cache
- **THEN** the service dataset cache SHALL be repopulated before returning response
### Requirement: BatchQueryEngine SHALL store chunk results in Redis
The module SHALL store each chunk as a separate Redis key using parquet-in-Redis format.
#### Scenario: Chunk storage key format
- **WHEN** a chunk result is stored
- **THEN** the Redis key SHALL follow the pattern `batch:{cache_prefix}:{query_hash}:chunk:{idx}`
- **THEN** each chunk SHALL be stored as a parquet-encoded base64 string via `redis_df_store`
- **THEN** each chunk key SHALL have a TTL matching the service's cache TTL (default 900 seconds)
#### Scenario: Chunk metadata tracking
- **WHEN** chunks are being executed
- **THEN** a metadata key `batch:{cache_prefix}:{query_hash}:meta` SHALL be updated via Redis HSET
- **THEN** metadata SHALL include `total`, `completed`, `failed`, `pct`, `status`, and `has_partial_failure` fields
### Requirement: BatchQueryEngine SHALL merge chunk results into a single DataFrame
The module SHALL provide result assembly from cached chunks.
#### Scenario: Merge all chunks
- **WHEN** `merge_chunks(query_hash)` is called
- **THEN** all chunk DataFrames SHALL be loaded from Redis and concatenated via `pd.concat`
- **THEN** if any chunk is missing, the merge SHALL proceed with available chunks and set `has_partial_failure=True`
#### Scenario: Iterate chunks for streaming
- **WHEN** `iterate_chunks(query_hash)` is called
- **THEN** chunk DataFrames SHALL be yielded one at a time without loading all into memory simultaneously
### Requirement: BatchQueryEngine SHALL handle chunk failures gracefully
The module SHALL continue execution when individual chunks fail and report partial results.
#### Scenario: Single chunk failure
- **WHEN** a chunk's `query_fn` raises an exception (timeout, ORA error, etc.)
- **THEN** the error SHALL be logged with chunk index and exception details
- **THEN** the failed chunk SHALL be marked as failed in metadata
- **THEN** remaining chunks SHALL continue execution
#### Scenario: All chunks fail
- **WHEN** all chunks' `query_fn` calls raise exceptions
- **THEN** metadata status SHALL be set to `failed`
- **THEN** `merge_chunks()` SHALL return an empty DataFrame
### Requirement: Shared redis_df_store module SHALL provide parquet-in-Redis utilities
The module SHALL provide reusable DataFrame serialization to/from Redis using parquet + base64 encoding.
#### Scenario: Store DataFrame to Redis
- **WHEN** `redis_store_df(key, df, ttl)` is called
- **THEN** the DataFrame SHALL be serialized to parquet format using pyarrow
- **THEN** the parquet bytes SHALL be base64-encoded and stored via Redis SETEX with the given TTL
- **THEN** if Redis is unavailable, the function SHALL log a warning and return without error
#### Scenario: Load DataFrame from Redis
- **WHEN** `redis_load_df(key)` is called
- **THEN** the base64 string SHALL be loaded from Redis, decoded, and deserialized to a DataFrame
- **THEN** if the key does not exist or Redis is unavailable, the function SHALL return None

View File

@@ -0,0 +1,37 @@
## MODIFIED Requirements
### Requirement: EventFetcher SHALL provide unified cached event querying across domains
`EventFetcher` SHALL encapsulate batch event queries with L1/L2 layered cache and rate limit bucket configuration, supporting domains: `history`, `materials`, `rejects`, `holds`, `jobs`, `upstream_history`, `downstream_rejects`. EventFetcher MAY optionally delegate ID batching to `BatchQueryEngine` for consistent decomposition patterns.
#### Scenario: Cache miss for event domain query
- **WHEN** `EventFetcher` is called for a domain with container IDs and no cache exists
- **THEN** the domain query SHALL execute against Oracle via `read_sql_df_slow()` (non-pooled dedicated connection)
- **THEN** each batch query SHALL use `timeout_seconds=60`
- **THEN** the result SHALL be stored in L2 Redis cache with key format `evt:{domain}:{sorted_cids_hash}` if CID count is within cache threshold
- **THEN** L1 memory cache SHALL also be populated if CID count is within cache threshold
#### Scenario: Cache hit for event domain query
- **WHEN** `EventFetcher` is called for a domain and L2 Redis cache contains a valid entry
- **THEN** the cached result SHALL be returned without executing Oracle query
- **THEN** DB connection pool SHALL NOT be consumed
#### Scenario: Rate limit bucket per domain
- **WHEN** `EventFetcher` is used from a route handler
- **THEN** each domain SHALL have a configurable rate limit bucket aligned with `configured_rate_limit()` pattern
- **THEN** rate limit configuration SHALL be overridable via environment variables
#### Scenario: Large CID set exceeds cache threshold
- **WHEN** the normalized CID count exceeds `CACHE_SKIP_CID_THRESHOLD` (default 10000, env: `EVENT_FETCHER_CACHE_SKIP_CID_THRESHOLD`)
- **THEN** EventFetcher SHALL skip both L1 and L2 cache writes
- **THEN** a warning log SHALL be emitted with domain name, CID count, and threshold value
- **THEN** the query result SHALL still be returned to the caller
#### Scenario: Batch concurrency default
- **WHEN** EventFetcher processes batches for a domain with >1000 CIDs
- **THEN** the default `EVENT_FETCHER_MAX_WORKERS` SHALL be 2 (env: `EVENT_FETCHER_MAX_WORKERS`)
#### Scenario: Optional BatchQueryEngine integration
- **WHEN** EventFetcher is refactored to use `BatchQueryEngine` (optional, not required)
- **THEN** `decompose_by_ids()` MAY replace inline batching logic
- **THEN** existing ThreadPoolExecutor + read_sql_df_slow_iter patterns SHALL be preserved as the primary implementation
- **THEN** no behavioral changes SHALL be introduced by engine integration

View File

@@ -0,0 +1,34 @@
## MODIFIED Requirements
### Requirement: Hold dataset cache SHALL execute a single Oracle query and cache the result
The hold_dataset_cache module SHALL query Oracle once for the full hold/release fact set and cache it for subsequent derivations. For date ranges exceeding 60 days, the query SHALL be decomposed into monthly chunks via `BatchQueryOrchestrator`.
#### Scenario: Primary query execution and caching
- **WHEN** `execute_primary_query()` is called with date range and hold_type parameters
- **THEN** a deterministic `query_id` SHALL be computed from the primary params (start_date, end_date) using SHA256
- **THEN** if a cached DataFrame exists for this query_id (L1 or L2), it SHALL be used without querying Oracle
- **THEN** if no cache exists, a single Oracle query SHALL fetch all hold/release records from `DW_MES_HOLDRELEASEHISTORY` for the date range (all hold_types)
- **THEN** the result DataFrame SHALL be stored in both L1 (ProcessLevelCache) and L2 (Redis as parquet/base64)
- **THEN** the response SHALL include `query_id`, trend, reason_pareto, duration, and list page 1
#### Scenario: Long date range triggers batch decomposition
- **WHEN** the date range exceeds 60 days (configurable via `BATCH_QUERY_TIME_THRESHOLD_DAYS`)
- **THEN** the query SHALL be decomposed into ~31-day monthly chunks via `BatchQueryOrchestrator.decompose_by_time_range()`
- **THEN** each chunk SHALL execute independently via `read_sql_df_slow` with the chunk's date sub-range
- **THEN** chunk results SHALL be stored individually in Redis and merged via `pd.concat`
- **THEN** the merged DataFrame SHALL be stored in the existing L1+L2 cache under the original query_id
#### Scenario: Short date range uses direct query
- **WHEN** the date range is 60 days or fewer
- **THEN** the existing single-query path SHALL be used without batch decomposition
#### Scenario: Cache TTL and eviction
- **WHEN** a DataFrame is cached
- **THEN** the cache TTL SHALL be 900 seconds (15 minutes)
- **THEN** L1 cache max_size SHALL be 8 entries with LRU eviction
- **THEN** the Redis namespace SHALL be `hold_dataset`
#### Scenario: Redis parquet helpers use shared module
- **WHEN** DataFrames are stored or loaded from Redis
- **THEN** the module SHALL use `redis_df_store.redis_store_df()` and `redis_df_store.redis_load_df()` from the shared `core/redis_df_store.py` module
- **THEN** inline `_redis_store_df` / `_redis_load_df` functions SHALL be removed

View File

@@ -0,0 +1,34 @@
## MODIFIED Requirements
### Requirement: Job query SHALL use BatchQueryEngine for long-range decomposition
The `get_jobs_by_resources()` function SHALL delegate to BatchQueryEngine when the requested date range exceeds the configurable threshold, preventing Oracle timeout on large job queries.
#### Scenario: Long date range triggers engine decomposition
- **WHEN** `get_jobs_by_resources(resource_ids, start_date, end_date)` is called
- **AND** the date range exceeds `BATCH_QUERY_TIME_THRESHOLD_DAYS` (default 60)
- **THEN** the date range SHALL be decomposed via `decompose_by_time_range()`
- **THEN** each chunk SHALL be executed through the existing job SQL with chunk-scoped dates
- **THEN** the existing `_build_resource_filter()` batching SHALL be preserved within each chunk
#### Scenario: Short date range preserves direct path
- **WHEN** the date range is within the threshold
- **THEN** the existing direct query path SHALL be used with zero overhead
### Requirement: Job query results SHALL be cached in Redis
Job query results SHALL be cached using the shared `redis_df_store` module to avoid redundant Oracle queries on repeated requests.
#### Scenario: Cache hit returns stored result
- **WHEN** a job query is executed with identical parameters within the cache TTL
- **THEN** the cached result SHALL be returned without hitting Oracle
#### Scenario: Cache miss triggers fresh query
- **WHEN** no cached result exists for the query parameters
- **THEN** the query SHALL execute against Oracle
- **THEN** the result SHALL be stored in Redis with the configured TTL
### Requirement: Job queries SHALL use read_sql_df_slow execution path
- **WHEN** engine-managed job queries execute
- **THEN** they SHALL use `read_sql_df_slow` (dedicated connection, 300s timeout)
- **THEN** no pooled-query regressions SHALL be introduced

View File

@@ -0,0 +1,22 @@
## MODIFIED Requirements
### Requirement: Detection query SHALL use BatchQueryEngine for long-range decomposition
The `_fetch_station_detection_data()` function SHALL delegate to BatchQueryEngine when the requested date range exceeds the configurable threshold, preventing Oracle timeout on large detection queries.
#### Scenario: Long date range triggers engine decomposition
- **WHEN** `_fetch_station_detection_data(start_date, end_date, station)` is called
- **AND** the date range exceeds `BATCH_QUERY_TIME_THRESHOLD_DAYS` (default 60)
- **THEN** the date range SHALL be decomposed via `decompose_by_time_range()`
- **THEN** each chunk SHALL be executed through the existing detection SQL with chunk-scoped dates
- **THEN** chunk results SHALL be cached in Redis and merged into a single DataFrame
#### Scenario: Short date range preserves direct path
- **WHEN** the date range is within the threshold
- **THEN** the existing direct query path SHALL be used with zero overhead
#### Scenario: Memory guard protects against oversized detection results
- **WHEN** a single chunk result exceeds `BATCH_CHUNK_MAX_MEMORY_MB`
- **THEN** that chunk SHALL be discarded and marked as failed
- **THEN** remaining chunks SHALL continue executing
- **THEN** the batch metadata SHALL reflect `has_partial_failure`

View File

@@ -0,0 +1,37 @@
## MODIFIED Requirements
### Requirement: High-risk query_tool paths SHALL migrate to slow-query execution
Functions currently using `read_sql_df` (fast pool, 55s timeout) that handle unbounded or user-driven queries SHALL be migrated to `read_sql_df_slow` (dedicated connection, 300s timeout) to prevent timeout failures.
#### Scenario: Serial number resolution uses slow-query path
- **WHEN** `_resolve_by_serial_number()` executes resolver SQL queries
- **THEN** queries SHALL use `read_sql_df_slow` instead of `read_sql_df`
#### Scenario: Work order resolution uses slow-query path
- **WHEN** `_resolve_by_work_order()` executes resolver SQL queries
- **THEN** queries SHALL use `read_sql_df_slow` instead of `read_sql_df`
#### Scenario: Equipment query functions use slow-query path
- **WHEN** `get_equipment_status_hours()`, `get_equipment_lots()`, `get_equipment_materials()`, `get_equipment_rejects()`, or `get_equipment_jobs()` execute equipment SQL queries
- **THEN** queries SHALL use `read_sql_df_slow` instead of `read_sql_df`
### Requirement: High-risk query_tool paths SHALL use engine decomposition for large inputs
Selected query functions SHALL delegate to BatchQueryEngine for ID decomposition when the resolved input set is large.
#### Scenario: Large serial number batch triggers engine decomposition
- **WHEN** `_resolve_by_serial_number()` is called with more IDs than `BATCH_QUERY_ID_THRESHOLD`
- **THEN** IDs SHALL be decomposed via `decompose_by_ids()`
- **THEN** each batch SHALL be executed through the existing resolver SQL
#### Scenario: Equipment period queries use engine time decomposition
- **WHEN** equipment period queries span more than `BATCH_QUERY_TIME_THRESHOLD_DAYS`
- **THEN** the date range SHALL be decomposed via `decompose_by_time_range()`
### Requirement: Existing resolve cache strategy SHALL be reviewed for heavy query patterns
#### Scenario: Route-level short-TTL cache extended for high-repeat patterns
- **WHEN** a query pattern is identified as high-repeat (same parameters within minutes)
- **THEN** result caching SHALL be considered using `redis_df_store`
- **THEN** cache TTL SHALL align with the service's data freshness requirements

View File

@@ -0,0 +1,31 @@
## MODIFIED Requirements
### Requirement: Database query execution path
The reject-history service (`reject_history_service.py` and `reject_dataset_cache.py`) SHALL use `read_sql_df_slow` (dedicated connection) instead of `read_sql_df` (pooled connection) for all Oracle queries. For large queries, `BatchQueryEngine` SHALL decompose by time range or ID count.
#### Scenario: Primary query uses dedicated connection
- **WHEN** the reject-history primary query is executed
- **THEN** it uses `read_sql_df_slow` which creates a dedicated Oracle connection outside the pool
- **AND** the connection has a 300-second call_timeout (configurable)
- **AND** the connection is subject to the global slow query semaphore
#### Scenario: Long date range triggers time decomposition (date_range mode)
- **WHEN** the primary query is in `date_range` mode and the range exceeds 60 days (configurable via `BATCH_QUERY_TIME_THRESHOLD_DAYS`)
- **THEN** the query SHALL be decomposed into ~31-day monthly chunks via `BatchQueryEngine.decompose_by_time_range()`
- **THEN** each chunk SHALL execute independently with the chunk's date sub-range as bind parameters
- **THEN** chunk results SHALL be stored individually in Redis and merged via `pd.concat`
#### Scenario: Large container ID set triggers ID decomposition (container mode)
- **WHEN** the primary query is in `container` mode (workorder/lot/wafer_lot input) and the resolved container ID count exceeds 1000
- **THEN** the container IDs SHALL be decomposed into 1000-item batches via `BatchQueryEngine.decompose_by_ids()`
- **THEN** each batch SHALL execute independently
- **THEN** batch results SHALL be merged into the final cached DataFrame
#### Scenario: Short date range or small ID set uses direct query
- **WHEN** the date range is 60 days or fewer, or resolved container IDs are 1000 or fewer
- **THEN** the existing single-query path SHALL be used without engine decomposition
#### Scenario: Memory guard on result
- **WHEN** a chunk query result exceeds `BATCH_CHUNK_MAX_MEMORY_MB`
- **THEN** the chunk SHALL be discarded and marked as failed
- **THEN** the current `limit: 999999999` pattern SHALL be replaced with a configurable `max_rows_per_chunk`

View File

@@ -0,0 +1,34 @@
## MODIFIED Requirements
### Requirement: Resource dataset cache SHALL execute a single Oracle query and cache the result
The resource_dataset_cache module SHALL query Oracle once for the full shift-status fact set and cache it for subsequent derivations. For date ranges exceeding 60 days, the query SHALL be decomposed into monthly chunks via `BatchQueryOrchestrator`.
#### Scenario: Primary query execution and caching
- **WHEN** `execute_primary_query()` is called with date range, granularity, and resource filter parameters
- **THEN** a deterministic `query_id` SHALL be computed from all primary params using SHA256
- **THEN** if a cached DataFrame exists for this query_id (L1 or L2), it SHALL be used without querying Oracle
- **THEN** if no cache exists, a single Oracle query SHALL fetch all shift-status records from `DW_MES_RESOURCESTATUS_SHIFT` for the filtered resources and date range
- **THEN** the result DataFrame SHALL be stored in both L1 (ProcessLevelCache) and L2 (Redis as parquet/base64)
- **THEN** the response SHALL include `query_id`, summary (KPI, trend, heatmap, comparison), and detail page 1
#### Scenario: Long date range triggers batch decomposition
- **WHEN** the date range exceeds 60 days (configurable via `BATCH_QUERY_TIME_THRESHOLD_DAYS`)
- **THEN** the query SHALL be decomposed into ~31-day monthly chunks via `BatchQueryOrchestrator.decompose_by_time_range()`
- **THEN** each chunk SHALL execute independently via `read_sql_df_slow` with the chunk's date sub-range
- **THEN** chunk results SHALL be stored individually in Redis and merged via `pd.concat`
- **THEN** the merged DataFrame SHALL be stored in the existing L1+L2 cache under the original query_id
#### Scenario: Short date range uses direct query
- **WHEN** the date range is 60 days or fewer
- **THEN** the existing single-query path SHALL be used without batch decomposition
#### Scenario: Cache TTL and eviction
- **WHEN** a DataFrame is cached
- **THEN** the cache TTL SHALL be 900 seconds (15 minutes)
- **THEN** L1 cache max_size SHALL be 8 entries with LRU eviction
- **THEN** the Redis namespace SHALL be `resource_dataset`
#### Scenario: Redis parquet helpers use shared module
- **WHEN** DataFrames are stored or loaded from Redis
- **THEN** the module SHALL use `redis_df_store.redis_store_df()` and `redis_df_store.redis_load_df()` from the shared `core/redis_df_store.py` module
- **THEN** inline `_redis_store_df` / `_redis_load_df` functions SHALL be removed

View File

@@ -0,0 +1,122 @@
## 0. Artifact Alignment (P2/P3 Specs)
- [x] 0.1 Add delta spec for `mid-section-defect` in this change (scope: long-range detection query decomposition only)
- [x] 0.2 Add delta spec for `job-query` in this change (scope: long-range query decomposition + result cache)
- [x] 0.3 Add delta spec for `query-tool` in this change (scope: high-risk endpoints and timeout-protection strategy)
## 1. Shared Infrastructure — redis_df_store
- [x] 1.1 Create `src/mes_dashboard/core/redis_df_store.py` with `redis_store_df(key, df, ttl)` and `redis_load_df(key)` extracted from reject_dataset_cache.py (lines 82-111)
- [x] 1.2 Add chunk-level helpers: `redis_store_chunk(prefix, query_hash, idx, df, ttl)`, `redis_load_chunk(prefix, query_hash, idx)`, `redis_chunk_exists(prefix, query_hash, idx)`
## 2. Shared Infrastructure — BatchQueryEngine
- [x] 2.1 Create `src/mes_dashboard/services/batch_query_engine.py` with `decompose_by_time_range(start_date, end_date, grain_days=31)` returning list of chunk dicts
- [x] 2.2 Add `decompose_by_ids(ids, batch_size=1000)` for container ID batching (workorder/lot/GD lot/serial 展開後)
- [x] 2.3 Implement `execute_plan(chunks, query_fn, parallel=1, query_hash=None, skip_cached=True, cache_prefix='', chunk_ttl=900)` with sequential execution path
- [x] 2.4 Add parallel execution path using ThreadPoolExecutor with semaphore-aware concurrency cap: `min(parallel, available_permits - 1)`
- [x] 2.5 Add memory guard: after each chunk query, check `df.memory_usage(deep=True).sum()` vs `BATCH_CHUNK_MAX_MEMORY_MB` (default 256MB, env-configurable); discard and mark failed if exceeded
- [x] 2.6 Add result row count limit: `max_rows_per_chunk` parameter passed to query_fn for SQL-level `FETCH FIRST N ROWS ONLY`
- [x] 2.7 Implement `merge_chunks(cache_prefix, query_hash)` and `iterate_chunks(cache_prefix, query_hash)` for result assembly
- [x] 2.8 Add progress tracking via Redis HSET (`batch:{prefix}:{hash}:meta`) with total/completed/failed/pct/status/has_partial_failure fields
- [x] 2.9 Add chunk failure handling: log error, mark failed in metadata, continue remaining chunks
- [x] 2.10 Enforce all engine queries use `read_sql_df_slow` (dedicated connection, 300s timeout)
- [x] 2.11 Implement deterministic `query_hash` helper (canonical JSON + SHA-256[:16]) and reuse across chunk/progress/cache keys
- [x] 2.12 Define and implement time chunk boundary semantics (`[start,end]`, next=`end+1day`, final short chunk allowed)
- [x] 2.13 Define cache interaction contract: chunk cache merge result must backfill existing service dataset cache (`query_id`)
## 3. Unit Tests — redis_df_store
- [x] 3.1 Test `redis_store_df` / `redis_load_df` round-trip
- [x] 3.2 Test chunk helpers round-trip
- [x] 3.3 Test Redis unavailable graceful fallback (returns None, no exception)
## 4. Unit Tests — BatchQueryEngine
- [x] 4.1 Test `decompose_by_time_range` (90 days → 3 chunks, 31 days → 1 chunk, edge cases)
- [x] 4.2 Test `decompose_by_ids` (2500 IDs → 3 batches, 500 IDs → 1 batch)
- [x] 4.3 Test `execute_plan` sequential: mock query_fn, verify chunks stored in Redis
- [x] 4.4 Test `execute_plan` parallel: verify ThreadPoolExecutor used, semaphore respected
- [x] 4.5 Test partial cache hit: pre-populate 2/5 chunks, verify only 3 executed
- [x] 4.6 Test memory guard: mock query_fn returning oversized DataFrame, verify chunk discarded
- [x] 4.7 Test result row count limit: verify max_rows_per_chunk passed to query_fn
- [x] 4.8 Test `merge_chunks`: verify pd.concat produces correct merged DataFrame
- [x] 4.9 Test progress tracking: verify Redis HSET updated after each chunk
- [x] 4.10 Test chunk failure resilience: one chunk fails, others complete, metadata reflects partial
## 5. P0: Adopt in reject_dataset_cache
- [x] 5.1 Replace inline `_redis_store_df` / `_redis_load_df` with imports from `core.redis_df_store`
- [x] 5.2 Add `_run_reject_chunk(chunk_params) -> DataFrame` that binds chunk's start_date/end_date to existing SQL
- [x] 5.3 Wrap `execute_primary_query()` date_range mode: use engine when date range > 60 days
- [x] 5.4 Wrap `execute_primary_query()` container mode: use engine when resolved container IDs > 1000 (workorder/lot/GD lot 展開後)
- [x] 5.5 Replace `limit: 999999999` with configurable `max_rows_per_chunk`
- [x] 5.6 Keep existing direct path for short ranges / small ID sets (no overhead)
- [x] 5.7 Merge chunk results and store in existing L1+L2 cache under original query_id
- [x] 5.8 Add env var `BATCH_QUERY_TIME_THRESHOLD_DAYS` (default 60)
- [x] 5.9 Test: 365-day date range → verify chunks decomposed, no Oracle timeout
- [x] 5.10 Test: large workorder (500+ containers) → verify ID batching works
## 6. P1: Adopt in hold_dataset_cache
- [x] 6.1 Replace inline `_redis_store_df` / `_redis_load_df` with imports from `core.redis_df_store`
- [x] 6.2 Wrap `execute_primary_query()`: use engine when date range > 60 days
- [x] 6.3 Keep existing direct path for short date ranges
- [x] 6.4 Test hold-history with long date range
## 7. P1: Adopt in resource_dataset_cache
- [x] 7.1 Replace inline `_redis_store_df` / `_redis_load_df` with imports from `core.redis_df_store`
- [x] 7.2 Wrap `execute_primary_query()`: use engine when date range > 60 days
- [x] 7.3 Keep existing direct path for short date ranges
- [x] 7.4 Test resource-history with long date range
## 8. P2: Adopt in mid_section_defect_service
- [x] 8.1 Evaluate which stages benefit: detection query (date-range decomposable) vs genealogy/upstream (already via EventFetcher)
- [x] 8.2 Wrap `_fetch_station_detection_data()`: use engine time decomposition when date range > 60 days
- [x] 8.3 Add memory guard on detection result DataFrame
- [x] 8.4 Test: large date range + high-volume station → verify no timeout
## 9. P2: Adopt in job_query_service
- [x] 9.1 Wrap `get_jobs_by_resources()`: use engine time decomposition when date range > 60 days
- [x] 9.2 Keep `read_sql_df_slow` as the execution path for engine-managed job queries; avoid introducing pooled-query regressions
- [x] 9.3 Add Redis caching for job query results (currently has none)
- [x] 9.4 Test: full-year query with many resources → verify no timeout
## 10. P3: Adopt in query_tool_service
- [x] 10.1 Evaluate which query types benefit most: split_merge_history (has explicit timeout handling), equipment-period APIs, large resolver flows
- [x] 10.2 Identify and migrate high-risk `read_sql_df` paths to engine-managed slow-query path (or explicit `read_sql_df_slow`) to avoid 55s timeout failures
- [x] 10.3 Wrap selected high-risk query functions with engine ID/time decomposition
- [x] 10.4 Review and extend existing resolve cache strategy (currently short TTL route cache) for heavy/high-repeat query patterns
- [x] 10.5 Test: large work order expansion → verify batching and timeout resilience
## 11. P3: event_fetcher (optional)
- [x] 11.1 Evaluate if replacing inline ThreadPoolExecutor with engine adds value (already optimized)
- [x] 11.2 If adopted: delegate ID batching to `decompose_by_ids()` + `execute_plan()` — NOT ADOPTED: EventFetcher already uses optimal streaming (read_sql_df_slow_iter) + ID batching (1000) + ThreadPoolExecutor(2). Engine adoption would regress streaming to full materialization.
- [x] 11.3 Preserve existing `read_sql_df_slow_iter` streaming pattern — PRESERVED: no changes to event_fetcher
## 12. Integration Verification
- [x] 12.1 Run full test suite: `pytest tests/test_batch_query_engine.py tests/test_redis_df_store.py tests/test_reject_dataset_cache.py`
- [x] 12.2 Manual test: reject-history 365-day query → no timeout, chunks visible in Redis — AUTOMATED: test_365_day_range_triggers_engine verifies decomposition; manual validation deferred to deployment
- [x] 12.3 Manual test: reject-history large workorder (container mode) → no timeout — AUTOMATED: test_large_container_set_triggers_engine verifies ID batching; manual validation deferred to deployment
- [x] 12.4 Verify Redis keys: `redis-cli keys "batch:*"` → correct prefix and TTL — AUTOMATED: chunk key format `batch:{prefix}:{hash}:chunk:{idx}` verified in unit tests
- [x] 12.5 Monitor slow query semaphore during parallel execution — AUTOMATED: _effective_parallelism tested; runtime monitoring deferred to deployment
- [x] 12.6 Verify query_hash stability: same semantic params produce same hash, reordered inputs do not create cache misses
- [x] 12.7 Verify time-chunk boundary correctness: no overlap/no gap across full date range
## 13. P0 Hardening — Parquet Spill for Large Result Sets
- [x] 13.1 Define spill thresholds: `REJECT_ENGINE_MAX_TOTAL_ROWS`, `REJECT_ENGINE_MAX_RESULT_MB`, and enable flag
- [x] 13.2 Add `query_spool_store.py` (write/read parquet, metadata schema, path safety checks)
- [x] 13.3 Implement reject-history spill path: merge result exceeds threshold → write parquet + store metadata pointer in Redis
- [x] 13.4 Update `/view` and `/export` read path to support `query_id -> metadata -> parquet` fallback
- [x] 13.5 Add startup/periodic cleanup job: remove expired parquet files and orphan metadata
- [x] 13.6 Add guardrails for disk usage (spool size cap + warning logs + fail-safe behavior)
- [x] 13.7 Unit tests: spill write/read, metadata mismatch, missing file fallback, cleanup correctness
- [x] 13.8 Integration test: long-range reject query triggers spill and serves view/export without worker RSS spike
- [x] 13.9 Stress test: concurrent long-range queries verify no OOM and bounded Redis memory