feat: harden long-range batch queries with redis+parquet caching
This commit is contained in:
@@ -0,0 +1,2 @@
|
||||
schema: spec-driven
|
||||
created: 2026-03-02
|
||||
@@ -0,0 +1,166 @@
|
||||
## Context
|
||||
|
||||
目前 6 個服務各自處理大查詢,缺乏統一保護:
|
||||
|
||||
| 服務 | 查詢類型 | 現有保護 | 缺口 |
|
||||
|------|---------|---------|------|
|
||||
| reject-history | 日期 + 工單/Lot/GD 展開 | L1+L2 快取、`read_sql_df_slow` | 無記憶體守衛、`limit=999999999`、缺分塊查詢 |
|
||||
| hold-history | 日期 | L1+L2 快取、`read_sql_df_slow` | 無記憶體守衛、缺時間分塊 |
|
||||
| resource-history | 日期 + 設備 ID | L1+L2 快取、1000 筆分批 | 無記憶體守衛 |
|
||||
| mid-section-defect | 日期 → 偵測 → 族譜 → 上游 | Redis 快取、EventFetcher 分批 | 無偵測數量上限 |
|
||||
| job-query | 日期 + 設備 ID | 1000 筆分批、`read_sql_df_slow` | **無結果快取**、缺時間分塊 |
|
||||
| query-tool | 多種 resolver → container ID | 輸入筆數限制、resolve route 短 TTL 快取、EventFetcher 快取 | 多數查詢仍走 `read_sql_df`(55s timeout)、缺統一分塊編排 |
|
||||
|
||||
參考實作:
|
||||
- `EventFetcher`:batch 1000 + ThreadPoolExecutor(2) + `read_sql_df_slow_iter` streaming + Redis 快取 — **已是最佳實作**
|
||||
- `LineageEngine`:batch 1000 + depth limit 20 — **族譜專用引擎**
|
||||
|
||||
目標:建立 `BatchQueryEngine` 共用模組,任何服務接入即獲得完整保護。
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- 統一 parquet-in-Redis 存取為共用模組(消除 3 處重複)
|
||||
- 提供時間範圍分解(長日期 → ~31 天月份區間)
|
||||
- 提供 ID 批次分解(工單/Lot/GD 展開後的大量 container ID → 1000 筆一批)
|
||||
- 記憶體守衛:每個 chunk 結果檢查 memory_usage,超過閾值中止
|
||||
- 結果筆數限制:可配置上限,超過時截斷並標記
|
||||
- 受控並行:預設循序、可選並行、semaphore 感知
|
||||
- Redis 分塊快取 + 部分命中
|
||||
- 統一使用 `read_sql_df_slow`(300 秒 dedicated connection)
|
||||
- 定義 query_hash 與 chunk 邊界語意,避免跨服務行為不一致
|
||||
- 定義 chunk cache 與服務 L1/L2 dataset cache 互動規則
|
||||
|
||||
**Non-Goals:**
|
||||
- 不修改 SQL 語句本身
|
||||
- 不引入新的外部依賴
|
||||
- 不改變前端 API 介面(前端無感知)
|
||||
- 不替換 EventFetcher / LineageEngine(它們已各自最佳化,引擎提供可選接入點)
|
||||
- 不改變 trace_job_service 的 RQ 非同步架構
|
||||
|
||||
## Decisions
|
||||
|
||||
### Decision 1: 提取 `redis_df_store.py` 共用模組
|
||||
|
||||
**選擇**:從 reject/hold/resource_dataset_cache 提取相同的 `_redis_store_df` / `_redis_load_df` 到 `src/mes_dashboard/core/redis_df_store.py`。
|
||||
|
||||
**替代方案**:(A) 保持各自複製 → 已有 3 處重複,維護困難。
|
||||
|
||||
**理由**:parquet-in-Redis 是 DataFrame 序列化工具,與快取策略(TTL、LRU)屬不同層次。
|
||||
|
||||
### Decision 2: `BatchQueryEngine` 作為工具類而非基底類別
|
||||
|
||||
**選擇**:提供獨立函式(`decompose_by_time_range`、`decompose_by_ids`、`execute_plan`、`merge_chunks`),各服務按需調用。
|
||||
|
||||
**替代方案**:(A) 抽象基底類別 `BaseDatasetCache` → 三個 dataset cache 差異大(SQL、policy filter、衍生計算),強制繼承會過度耦合。
|
||||
|
||||
**理由**:工具類模式讓服務保持現有結構,僅在主查詢路徑決定是否啟用分解。閾值以下的查詢完全不經過引擎。
|
||||
|
||||
### Decision 3: 預設循序、可選並行、semaphore 感知
|
||||
|
||||
**選擇**:`execute_plan(parallel=1)` 預設循序。實際並行上限 = `min(requested, semaphore_available - 1)`。
|
||||
|
||||
**替代方案**:(A) 預設並行 → 可能耗盡 semaphore;(B) 完全不並行 → 失去速度。
|
||||
|
||||
**理由**:Oracle 連線稀缺(Production 預設 `DB_SLOW_MAX_CONCURRENT=5`,Development 常見為 3)。reject_dataset_cache 查詢最重可設 parallel=2,其他預設循序最安全。
|
||||
|
||||
### Decision 4: 記憶體守衛 + 結果筆數限制
|
||||
|
||||
**選擇**:每個 chunk 查詢後檢查 `df.memory_usage(deep=True).sum()`,超過 `BATCH_CHUNK_MAX_MEMORY_MB`(預設 256MB)時中止該 chunk 並標記失敗。同時提供 `max_rows_per_chunk` 參數,在 SQL 中加入 `FETCH FIRST N ROWS ONLY`。
|
||||
|
||||
**替代方案**:(A) 無限制 → 現狀,OOM 風險高;(B) 全域限制 → 不夠靈活。
|
||||
|
||||
**理由**:chunk 級別的記憶體守衛是最後一道防線。分解後每個 chunk 的日期/ID 範圍已大幅縮小,記憶體超限通常代表異常資料,應中止而非繼續。
|
||||
|
||||
### Decision 5: 分塊快取 + 部分命中
|
||||
|
||||
**選擇**:Redis 鍵 `batch:{prefix}:{hash}:chunk:{idx}`,每個 chunk 獨立 SETEX。
|
||||
|
||||
**替代方案**:(A) 只快取最終結果 → 無法部分命中。
|
||||
|
||||
**理由**:使用者常見操作是「先查 1-6 月,再查 1-8 月」。分塊快取讓前 6 個月直接複用,只查 7-8 月。
|
||||
|
||||
### Decision 6: 引擎路徑統一使用 slow-query 路徑(且不佔用主 pool)
|
||||
|
||||
**選擇**:所有經過引擎的查詢統一使用 slow-query 路徑(300s timeout, semaphore 控制);未經引擎的既有短查詢路徑保持原狀。
|
||||
慢查詢執行策略採兩層:
|
||||
1. 主路徑:使用既有獨立 `SLOW POOL`(小容量)做 checkout/checkin。
|
||||
2. fallback:當 SLOW POOL 不可用時,降級為 slow direct connection。
|
||||
|
||||
**替代方案**:
|
||||
(A) 引擎路徑混用 `read_sql_df`(主 pool, 55s timeout)→ 長查詢高超時風險且會壓縮一般 API 吞吐。
|
||||
(B) 慢查詢直接共用主 pool → 高峰時造成 pool 爭用與整體延遲放大。
|
||||
|
||||
**理由**:經過引擎的查詢本身就是「已知可能很慢」的查詢。慢查詢與主 pool 隔離可避免互相影響;SLOW POOL 讓連線重用與隔離同時成立,fallback direct connection 保障可用性。
|
||||
|
||||
### Decision 7: 部分失敗處理
|
||||
|
||||
**選擇**:某個 chunk 失敗時記錄錯誤、繼續剩餘 chunk。`merge_chunks()` 回傳成功部分,metadata 標記 `has_partial_failure=True`。
|
||||
|
||||
**替代方案**:(A) 全部回滾 → 已成功的 chunk 浪費。
|
||||
|
||||
**理由**:歷史報表場景下,部分結果比完全失敗更有價值。metadata 標記讓服務可決定是否警告使用者。
|
||||
|
||||
### Decision 8: Chunk Cache 與服務 L1/L2 Dataset Cache 互動
|
||||
|
||||
**選擇**:先讀 chunk cache(Redis)組裝結果;組裝後回填既有 service dataset cache(L1 process + L2 Redis)以維持現有 `/view` 路徑與 `query_id` 行為。
|
||||
|
||||
**替代方案**:(A) 只使用 chunk cache,不回填 service cache → 現有 view/query_id 流程失效或重複查詢。
|
||||
|
||||
**理由**:需要兼容既有 two-phase dataset API(primary query + cached view),chunk cache 是引擎層優化,不應破壞服務層介面。
|
||||
|
||||
### Decision 9: query_hash 規格
|
||||
|
||||
**選擇**:query_hash 使用 canonical JSON(sorted keys、穩定 list 順序、字串正規化)後 SHA-256 前 16 碼;hash 僅包含會影響原始資料集合的參數(不含純前端呈現參數)。
|
||||
|
||||
**替代方案**:(A) 每服務自由實作 hash → 跨服務不可預測且難除錯。
|
||||
|
||||
**理由**:chunk key、progress key、merge key 需可重現,否則無法保證 cache 命中與部分重用。
|
||||
|
||||
### Decision 10: 時間分解邊界語意
|
||||
|
||||
**選擇**:採閉區間 chunk `[chunk_start, chunk_end]`;下一段從 `chunk_end + 1 day` 開始;最後一段可小於 grain_days;輸入日期以服務既有時區/日界線為準,不在引擎層重新解釋時區。
|
||||
|
||||
**替代方案**:(A) 半開區間或依月份動態切割但不定義邊界 → 容易重疊或漏資料。
|
||||
|
||||
**理由**:邊界語意固定後,merge 去重、統計一致性與測試可驗證性都會提升。
|
||||
|
||||
### Decision 11: 大結果採 Parquet 落地,Redis 僅保留 metadata/熱快取
|
||||
|
||||
**選擇**:對長查詢(尤其 reject-history)引入 spill-to-disk:
|
||||
1. chunk 查詢與 chunk cache 保持現行(Redis,短 TTL)
|
||||
2. merge 後若結果超過門檻(rows / memory / serialized size),寫入 Parquet 至本機 spool 目錄
|
||||
3. Redis 僅保存 metadata(query_id, file_path, row_count, schema_hash, created_at, expires_at)
|
||||
4. `/view`/`/export` 優先透過 metadata 讀取 parquet;metadata 不存在時回退現行 cache 行為
|
||||
5. 背景清理器定期移除過期 parquet 與孤兒 metadata
|
||||
|
||||
**替代方案**:
|
||||
(A) Redis 全量承載所有結果(現況)→ 記憶體壓力高,易引發 lock timeout/OOM 連鎖
|
||||
(B) 直接落 DB(例如 SQLite)→ 寫入鎖衝突與維運複雜度高(目前已有 `database is locked` 觀察)
|
||||
|
||||
**理由**:Redis 是記憶體快取,不適合長時間承載大結果;Parquet 落地可把大結果轉移到磁碟,降低 worker/Redis 記憶體峰值。
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
**[Redis 記憶體增長]** → 分塊快取增加 key 數量(365 天 ≈ 12 個 chunk key)。
|
||||
→ 緩解:TTL 自動過期(900s);chunk 結果經 parquet 壓縮(通常 10:1 壓縮比)。
|
||||
|
||||
**[Semaphore 爭用]** → 並行 chunk 消耗更多 permit。
|
||||
→ 緩解:感知可用數量,不足時自動降級循序。預設 parallel=1。
|
||||
|
||||
**[時間分解後的資料一致性]** → 不同月份 chunk 在不同時間點查詢。
|
||||
→ 緩解:歷史報表資料更新頻率低(日級),短窗口內變動極低。可接受。
|
||||
|
||||
**[遷移風險]** → 先修改 3 個 dataset cache,再擴展至其他服務,整體範圍仍大。
|
||||
→ 緩解:閾值控制(短查詢不經過引擎)+ P0/P1/P2/P3 分階段導入 + 每階段獨立驗證。
|
||||
|
||||
**[磁碟 I/O 與容量壓力]** → Parquet 落地會增加磁碟讀寫,若清理策略失效可能累積大量檔案。
|
||||
→ 緩解:設定 spool 容量上限、TTL 清理、啟動時 orphan 掃描、超限時回退到「不落地僅回應摘要」保護模式。
|
||||
|
||||
**[Stale metadata / orphan file]** → Redis metadata 與實體檔案可能不一致。
|
||||
→ 緩解:讀取前校驗檔案存在與 schema hash;不一致時自動失效 metadata 並記錄告警。
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. `mid_section_defect_service` 的 4 階段管線(偵測 → 族譜 → 上游歷史 → 歸因)中,哪些階段適合接入引擎?偵測查詢可日期分解,但族譜/上游已透過 EventFetcher 處理。
|
||||
2. `query_tool_service` 有 15+ 種查詢類型,是否全部接入還是只處理最易超時的(split_merge_history、equipment_period)?
|
||||
@@ -0,0 +1,83 @@
|
||||
## Why
|
||||
|
||||
目前各歷史報表服務(reject-history、hold-history、resource-history)、查詢工具(query-tool)、中段不良分析(mid-section-defect)和 Job 查詢(job-query)各自實作不同的批次查詢、快取和並行執行模式,缺乏統一編排與保護。主要問題:
|
||||
|
||||
1. **Oracle 超時**:長日期範圍(365+ 天)或大量 Container ID(工單展開後可達數千筆)的查詢可能超過 300 秒 call_timeout
|
||||
2. **OOM 風險**:reject/hold dataset cache 以 `limit: 999999999` 取回全部資料,無記憶體上限守衛
|
||||
3. **保護分散**:`EventFetcher` 已有 ID 分批 + 快取,但 reject/hold/resource dataset cache 仍各自維護查詢與快取策略
|
||||
4. **重複程式碼**:3 個 dataset cache 各自複製相同的 parquet-in-Redis 序列化邏輯
|
||||
5. **ID 展開膨脹**:工單 resolve 後 container ID 可能大量擴張,缺乏跨服務一致的分批/合併流程
|
||||
6. **重查成本高**:延長查詢範圍(例如 1-6 月改 1-8 月)無法有效重用已查區段結果
|
||||
7. **query-tool 超時風險高**:多數查詢仍走 `read_sql_df`(主 pool / 55s timeout),大查詢下容易超時
|
||||
|
||||
需要一個**可穩定複用的查詢引擎模組**,任何服務接入後自動獲得分解、快取、記憶體保護和超時保護。
|
||||
|
||||
## What Changes
|
||||
|
||||
- 新增 `BatchQueryEngine` 共用模組,提供:
|
||||
- **時間範圍分解**:長日期 → ~31 天月份區間,每段獨立查詢
|
||||
- **時間分解語意**:明確定義 chunk 邊界(閉區間)、跨月切割與最後一段不足月行為
|
||||
- **ID 批次分解**:大量 ID(工單/Lot/GD Lot/流水批展開後)→ 1000 筆一批
|
||||
- **query_hash 規格**:統一 canonicalization 與雜湊欄位,確保 chunk/cache key 穩定
|
||||
- **記憶體守衛**:每個 chunk 結果檢查 `DataFrame.memory_usage()`,超過閾值時中止並警告
|
||||
- **結果筆數限制**:可配置的最大結果筆數,超過時截斷並標記
|
||||
- **受控並行執行**:預設循序、可選並行,嚴格遵守 slow query semaphore
|
||||
- **Redis 分塊快取**:每個 chunk 獨立快取,支援部分命中(延長查詢範圍時複用已查過的區間)
|
||||
- **快取層互動**:明確定義 chunk cache 與服務既有 L1/L2 dataset cache 的讀寫順序
|
||||
- **進度追蹤**:Redis HSET 記錄進度,可供前端顯示
|
||||
- 新增「**大結果落地層(Parquet spill)**」設計:
|
||||
- 當長查詢結果超過記憶體/列數門檻時,將合併後結果以 Parquet 寫入本機持久目錄(例如 `tmp/query_spool/`)
|
||||
- Redis 僅保存 metadata(query_id → parquet path / schema / rows / created_at / ttl)
|
||||
- `/view` 與 `/export` 讀取流程優先走 Redis metadata + Parquet,避免整包 DataFrame 常駐 worker RAM
|
||||
- 定時清理(TTL + 背景清理器)刪除過期 parquet,避免磁碟持續膨脹
|
||||
- 新增 `redis_df_store` 共用模組,將 parquet-in-Redis 存取邏輯從 3 個 dataset cache 提取為共用工具
|
||||
- 所有**引擎接管的 chunk 查詢**統一使用 slow 路徑(300 秒級 timeout)
|
||||
- 使用既有「**獨立 SLOW POOL(小容量)**」做慢查詢連線重用
|
||||
- 明確**不使用主查詢 pool** 承載慢查詢,避免拖垮一般 API
|
||||
- 當 SLOW POOL 不可用時,降級為 slow direct connection(不影響主 pool)
|
||||
|
||||
## Capabilities
|
||||
|
||||
### New Capabilities
|
||||
- `batch-query-engine`: 統一批次查詢引擎模組,涵蓋分解策略(時間/ID)、記憶體守衛、結果限制、受控執行、Redis 分塊快取、進度追蹤、結果合併
|
||||
|
||||
### Modified Capabilities
|
||||
- `reject-history-api`: 主查詢改為透過引擎執行;date_range 模式自動時間分解,container 模式(工單/Lot/GD Lot 展開後)自動 ID 分批
|
||||
- `hold-dataset-cache`: 主查詢改為透過引擎執行,長日期自動分解
|
||||
- `resource-dataset-cache`: 主查詢改為透過引擎執行,長日期自動分解
|
||||
- `event-fetcher-unified`: 保持既有最佳化(batch + streaming + cache),僅在需要統一監控/進度模型時再評估導入
|
||||
|
||||
## Impact
|
||||
|
||||
- **後端**:新增 2 個共用模組(`batch_query_engine.py`、`redis_df_store.py`),優先修改 3 個 dataset cache 主查詢路徑(reject/hold/resource)
|
||||
- **受影響服務**(優先順序):
|
||||
- P0:reject-history(最容易超時/OOM — 長日期 + 工單展開 + 目前 `limit=999999999`)
|
||||
- P1:hold-history、resource-history(相同架構,直接套用)
|
||||
- P2:mid-section-defect(4 階段管線,偵測查詢 + 上游歷史)、job-query(缺快取 + 日期分解)
|
||||
- P3:query-tool(優先處理 `read_sql_df` 高風險路徑並導入慢查詢保護)、event-fetcher(保持可選)
|
||||
- **資料庫**:不改 SQL,僅縮小每次查詢的 bind parameter 範圍
|
||||
- **資料庫連線策略**:慢查詢與一般 pooled query 隔離,避免資源互相干擾
|
||||
- **Redis**:新增 `batch:*` 前綴的分塊快取鍵
|
||||
- **儲存層**:新增 Parquet 結果落地目錄與清理機制(Redis 轉為索引/metadata,不再承載全部大結果)
|
||||
- **記憶體**:引擎強制單 chunk 記憶體上限(預設 256MB),超過時中止
|
||||
- **可用性**:Redis 設定 `maxmemory` + eviction 後仍可透過 Parquet metadata 回復查詢結果(cache 不命中不等於資料遺失)
|
||||
- **向下相容**:短查詢(< 60 天、< 1000 ID)走現有路徑,零額外開銷;既有 route/event 快取策略保持不變
|
||||
- **前端**:可選性變更,長查詢可顯示進度條(非必要)
|
||||
|
||||
## Parquet 落地的預期效果與副作用
|
||||
|
||||
**預期效果:**
|
||||
- 大幅降低 worker 在「merge + cache 回填」階段的峰值記憶體(避免單 worker 突增到 GB 級)
|
||||
- Redis 記憶體由「存整包資料」轉為「存索引/熱資料」,降低 OOM 與 lock timeout 連鎖風險
|
||||
- 服務重啟後,若 parquet 尚未過期,仍可恢復查詢結果(搭配 metadata)
|
||||
|
||||
**可能副作用(Side Effects):**
|
||||
- 磁碟 I/O 增加:查詢高峰時會有 parquet 寫入/讀取尖峰
|
||||
- 磁碟容量風險:清理策略失效時,spool 目錄可能持續膨脹
|
||||
- 資料一致性風險:metadata 指向檔案若被外部刪除/損壞,會出現 stale pointer
|
||||
- 安全與治理:落地檔案需納入權限控管、備份/清理與稽核策略
|
||||
|
||||
**緩解方向:**
|
||||
- 強制 TTL + 定期掃描清理(以 metadata 與檔案 mtime 雙重判斷)
|
||||
- 啟動時做 orphan/stale 檢查與自動修復(刪 metadata 或刪孤兒檔)
|
||||
- 先以 reject-history 長查詢為 P0,逐步擴展到其他服務
|
||||
@@ -0,0 +1,166 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: BatchQueryEngine SHALL provide time-range decomposition
|
||||
The module SHALL decompose long date ranges into manageable monthly chunks to prevent Oracle timeout.
|
||||
|
||||
#### Scenario: Decompose date range into monthly chunks
|
||||
- **WHEN** `decompose_by_time_range(start_date, end_date, grain_days=31)` is called
|
||||
- **THEN** the date range SHALL be split into chunks of at most `grain_days` days each
|
||||
- **THEN** each chunk SHALL contain `chunk_start` and `chunk_end` date strings
|
||||
- **THEN** chunks SHALL be contiguous and non-overlapping, covering the full range
|
||||
|
||||
#### Scenario: Short date range returns single chunk
|
||||
- **WHEN** the date range is shorter than or equal to `grain_days`
|
||||
- **THEN** a single chunk covering the full range SHALL be returned
|
||||
|
||||
#### Scenario: Time-chunk boundary semantics are deterministic
|
||||
- **WHEN** a date range is decomposed into multiple chunks
|
||||
- **THEN** each chunk SHALL use a closed interval `[chunk_start, chunk_end]`
|
||||
- **THEN** the next chunk SHALL start at `previous_chunk_end + 1 day`
|
||||
- **THEN** the final chunk MAY contain fewer than `grain_days` days
|
||||
- **THEN** chunk ranges SHALL have no overlap and no gap
|
||||
|
||||
### Requirement: BatchQueryEngine SHALL provide ID-batch decomposition
|
||||
The module SHALL decompose large ID lists (from workorder/lot/GD lot/serial resolve expansion) into batches respecting Oracle IN-clause limits.
|
||||
|
||||
#### Scenario: Decompose ID list into batches
|
||||
- **WHEN** `decompose_by_ids(ids, batch_size=1000)` is called with more than `batch_size` IDs
|
||||
- **THEN** the ID list SHALL be split into batches of at most `batch_size` items each
|
||||
|
||||
#### Scenario: Small ID list returns single batch
|
||||
- **WHEN** the ID list has fewer than or equal to `batch_size` items
|
||||
- **THEN** a single batch containing all IDs SHALL be returned
|
||||
|
||||
### Requirement: BatchQueryEngine SHALL execute chunk plans with controlled parallelism
|
||||
The module SHALL execute query chunks sequentially by default, with opt-in parallel execution respecting the slow query semaphore.
|
||||
|
||||
#### Scenario: Sequential execution (default)
|
||||
- **WHEN** `execute_plan(chunks, query_fn, parallel=1)` is called
|
||||
- **THEN** chunks SHALL be executed one at a time in order
|
||||
- **THEN** each chunk result SHALL be stored to Redis immediately after completion
|
||||
- **THEN** the function SHALL return a `query_hash` identifying the batch result
|
||||
|
||||
#### Scenario: Parallel execution with semaphore awareness
|
||||
- **WHEN** `execute_plan(chunks, query_fn, parallel=2)` is called
|
||||
- **THEN** up to `parallel` chunks SHALL execute concurrently via ThreadPoolExecutor
|
||||
- **THEN** each thread SHALL acquire the slow query semaphore before executing `query_fn`
|
||||
- **THEN** actual concurrency SHALL be capped at `min(parallel, available_semaphore_permits - 1)`
|
||||
- **THEN** if semaphore is fully occupied, execution SHALL degrade to sequential
|
||||
|
||||
#### Scenario: All engine queries use dedicated connection
|
||||
- **WHEN** a chunk's `query_fn` executes an Oracle query
|
||||
- **THEN** it SHALL use `read_sql_df_slow` (dedicated connection, 300s timeout, semaphore-controlled)
|
||||
- **THEN** pooled connection (`read_sql_df`) SHALL NOT be used for engine-managed queries
|
||||
|
||||
### Requirement: BatchQueryEngine SHALL enforce memory guards per chunk
|
||||
The module SHALL check each chunk result's memory usage and abort if it exceeds a configurable threshold.
|
||||
|
||||
#### Scenario: Chunk memory within limit
|
||||
- **WHEN** a chunk query returns a DataFrame within `BATCH_CHUNK_MAX_MEMORY_MB` (default 256MB, env-configurable)
|
||||
- **THEN** the chunk SHALL be stored to Redis and marked as completed
|
||||
|
||||
#### Scenario: Chunk memory exceeds limit
|
||||
- **WHEN** a chunk query returns a DataFrame exceeding `BATCH_CHUNK_MAX_MEMORY_MB`
|
||||
- **THEN** the chunk SHALL be discarded (NOT stored to Redis)
|
||||
- **THEN** the chunk SHALL be marked as failed in metadata with reason `memory_limit_exceeded`
|
||||
- **THEN** a warning log SHALL include chunk index, actual memory MB, and threshold
|
||||
- **THEN** remaining chunks SHALL continue execution
|
||||
|
||||
#### Scenario: Result row count limit
|
||||
- **WHEN** `max_rows_per_chunk` is configured
|
||||
- **THEN** the engine SHALL pass this limit to `query_fn` for SQL-level truncation (e.g., `FETCH FIRST N ROWS ONLY`)
|
||||
- **THEN** if the result contains exactly `max_rows_per_chunk` rows, metadata SHALL include `truncated=True`
|
||||
|
||||
### Requirement: BatchQueryEngine SHALL support partial cache hits
|
||||
The module SHALL check Redis for previously cached chunks and skip re-execution for cached chunks.
|
||||
|
||||
#### Scenario: Partial cache hit skips cached chunks
|
||||
- **WHEN** `execute_plan(chunks, query_fn, skip_cached=True)` is called
|
||||
- **THEN** for each chunk, Redis SHALL be checked for an existing cached result
|
||||
- **THEN** chunks with valid cached results SHALL NOT be re-executed
|
||||
- **THEN** only uncached chunks SHALL be passed to `query_fn`
|
||||
|
||||
#### Scenario: Full cache hit skips all execution
|
||||
- **WHEN** all chunks already exist in Redis cache
|
||||
- **THEN** no Oracle queries SHALL be executed
|
||||
- **THEN** `merge_chunks()` SHALL return the combined cached DataFrames
|
||||
|
||||
### Requirement: BatchQueryEngine SHALL generate deterministic query_hash
|
||||
The module SHALL use a stable hash for cache/progress keys so semantically identical queries map to the same batch identity.
|
||||
|
||||
#### Scenario: Stable hash for equivalent parameters
|
||||
- **WHEN** two requests contain the same semantic query parameters in different input order
|
||||
- **THEN** canonicalization SHALL normalize ordering before hashing
|
||||
- **THEN** `query_hash` SHALL be identical for both requests
|
||||
|
||||
#### Scenario: Hash changes only when dataset-affecting parameters change
|
||||
- **WHEN** parameters affecting the raw dataset (date range, mode, resolved IDs, core filters) change
|
||||
- **THEN** `query_hash` SHALL change
|
||||
- **THEN** presentation-only parameters SHALL NOT change `query_hash`
|
||||
|
||||
### Requirement: BatchQueryEngine SHALL define chunk-cache to service-cache handoff
|
||||
The module SHALL integrate chunk-level cache with existing service-level dataset caches without breaking query_id-based view APIs.
|
||||
|
||||
#### Scenario: Chunk merge backfills service dataset cache
|
||||
- **WHEN** chunk results are loaded/merged into a complete dataset for a primary query
|
||||
- **THEN** the merged DataFrame SHALL be written back to the service's existing dataset cache layers (L1 process + L2 Redis)
|
||||
- **THEN** downstream `/view` queries using the service `query_id` SHALL continue to work without additional Oracle queries
|
||||
|
||||
#### Scenario: Service cache miss with chunk cache hit
|
||||
- **WHEN** a service-level dataset cache entry has expired but relevant chunk cache keys still exist
|
||||
- **THEN** the engine SHALL rebuild the merged dataset from chunk cache
|
||||
- **THEN** the service dataset cache SHALL be repopulated before returning response
|
||||
|
||||
### Requirement: BatchQueryEngine SHALL store chunk results in Redis
|
||||
The module SHALL store each chunk as a separate Redis key using parquet-in-Redis format.
|
||||
|
||||
#### Scenario: Chunk storage key format
|
||||
- **WHEN** a chunk result is stored
|
||||
- **THEN** the Redis key SHALL follow the pattern `batch:{cache_prefix}:{query_hash}:chunk:{idx}`
|
||||
- **THEN** each chunk SHALL be stored as a parquet-encoded base64 string via `redis_df_store`
|
||||
- **THEN** each chunk key SHALL have a TTL matching the service's cache TTL (default 900 seconds)
|
||||
|
||||
#### Scenario: Chunk metadata tracking
|
||||
- **WHEN** chunks are being executed
|
||||
- **THEN** a metadata key `batch:{cache_prefix}:{query_hash}:meta` SHALL be updated via Redis HSET
|
||||
- **THEN** metadata SHALL include `total`, `completed`, `failed`, `pct`, `status`, and `has_partial_failure` fields
|
||||
|
||||
### Requirement: BatchQueryEngine SHALL merge chunk results into a single DataFrame
|
||||
The module SHALL provide result assembly from cached chunks.
|
||||
|
||||
#### Scenario: Merge all chunks
|
||||
- **WHEN** `merge_chunks(query_hash)` is called
|
||||
- **THEN** all chunk DataFrames SHALL be loaded from Redis and concatenated via `pd.concat`
|
||||
- **THEN** if any chunk is missing, the merge SHALL proceed with available chunks and set `has_partial_failure=True`
|
||||
|
||||
#### Scenario: Iterate chunks for streaming
|
||||
- **WHEN** `iterate_chunks(query_hash)` is called
|
||||
- **THEN** chunk DataFrames SHALL be yielded one at a time without loading all into memory simultaneously
|
||||
|
||||
### Requirement: BatchQueryEngine SHALL handle chunk failures gracefully
|
||||
The module SHALL continue execution when individual chunks fail and report partial results.
|
||||
|
||||
#### Scenario: Single chunk failure
|
||||
- **WHEN** a chunk's `query_fn` raises an exception (timeout, ORA error, etc.)
|
||||
- **THEN** the error SHALL be logged with chunk index and exception details
|
||||
- **THEN** the failed chunk SHALL be marked as failed in metadata
|
||||
- **THEN** remaining chunks SHALL continue execution
|
||||
|
||||
#### Scenario: All chunks fail
|
||||
- **WHEN** all chunks' `query_fn` calls raise exceptions
|
||||
- **THEN** metadata status SHALL be set to `failed`
|
||||
- **THEN** `merge_chunks()` SHALL return an empty DataFrame
|
||||
|
||||
### Requirement: Shared redis_df_store module SHALL provide parquet-in-Redis utilities
|
||||
The module SHALL provide reusable DataFrame serialization to/from Redis using parquet + base64 encoding.
|
||||
|
||||
#### Scenario: Store DataFrame to Redis
|
||||
- **WHEN** `redis_store_df(key, df, ttl)` is called
|
||||
- **THEN** the DataFrame SHALL be serialized to parquet format using pyarrow
|
||||
- **THEN** the parquet bytes SHALL be base64-encoded and stored via Redis SETEX with the given TTL
|
||||
- **THEN** if Redis is unavailable, the function SHALL log a warning and return without error
|
||||
|
||||
#### Scenario: Load DataFrame from Redis
|
||||
- **WHEN** `redis_load_df(key)` is called
|
||||
- **THEN** the base64 string SHALL be loaded from Redis, decoded, and deserialized to a DataFrame
|
||||
- **THEN** if the key does not exist or Redis is unavailable, the function SHALL return None
|
||||
@@ -0,0 +1,37 @@
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: EventFetcher SHALL provide unified cached event querying across domains
|
||||
`EventFetcher` SHALL encapsulate batch event queries with L1/L2 layered cache and rate limit bucket configuration, supporting domains: `history`, `materials`, `rejects`, `holds`, `jobs`, `upstream_history`, `downstream_rejects`. EventFetcher MAY optionally delegate ID batching to `BatchQueryEngine` for consistent decomposition patterns.
|
||||
|
||||
#### Scenario: Cache miss for event domain query
|
||||
- **WHEN** `EventFetcher` is called for a domain with container IDs and no cache exists
|
||||
- **THEN** the domain query SHALL execute against Oracle via `read_sql_df_slow()` (non-pooled dedicated connection)
|
||||
- **THEN** each batch query SHALL use `timeout_seconds=60`
|
||||
- **THEN** the result SHALL be stored in L2 Redis cache with key format `evt:{domain}:{sorted_cids_hash}` if CID count is within cache threshold
|
||||
- **THEN** L1 memory cache SHALL also be populated if CID count is within cache threshold
|
||||
|
||||
#### Scenario: Cache hit for event domain query
|
||||
- **WHEN** `EventFetcher` is called for a domain and L2 Redis cache contains a valid entry
|
||||
- **THEN** the cached result SHALL be returned without executing Oracle query
|
||||
- **THEN** DB connection pool SHALL NOT be consumed
|
||||
|
||||
#### Scenario: Rate limit bucket per domain
|
||||
- **WHEN** `EventFetcher` is used from a route handler
|
||||
- **THEN** each domain SHALL have a configurable rate limit bucket aligned with `configured_rate_limit()` pattern
|
||||
- **THEN** rate limit configuration SHALL be overridable via environment variables
|
||||
|
||||
#### Scenario: Large CID set exceeds cache threshold
|
||||
- **WHEN** the normalized CID count exceeds `CACHE_SKIP_CID_THRESHOLD` (default 10000, env: `EVENT_FETCHER_CACHE_SKIP_CID_THRESHOLD`)
|
||||
- **THEN** EventFetcher SHALL skip both L1 and L2 cache writes
|
||||
- **THEN** a warning log SHALL be emitted with domain name, CID count, and threshold value
|
||||
- **THEN** the query result SHALL still be returned to the caller
|
||||
|
||||
#### Scenario: Batch concurrency default
|
||||
- **WHEN** EventFetcher processes batches for a domain with >1000 CIDs
|
||||
- **THEN** the default `EVENT_FETCHER_MAX_WORKERS` SHALL be 2 (env: `EVENT_FETCHER_MAX_WORKERS`)
|
||||
|
||||
#### Scenario: Optional BatchQueryEngine integration
|
||||
- **WHEN** EventFetcher is refactored to use `BatchQueryEngine` (optional, not required)
|
||||
- **THEN** `decompose_by_ids()` MAY replace inline batching logic
|
||||
- **THEN** existing ThreadPoolExecutor + read_sql_df_slow_iter patterns SHALL be preserved as the primary implementation
|
||||
- **THEN** no behavioral changes SHALL be introduced by engine integration
|
||||
@@ -0,0 +1,34 @@
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Hold dataset cache SHALL execute a single Oracle query and cache the result
|
||||
The hold_dataset_cache module SHALL query Oracle once for the full hold/release fact set and cache it for subsequent derivations. For date ranges exceeding 60 days, the query SHALL be decomposed into monthly chunks via `BatchQueryOrchestrator`.
|
||||
|
||||
#### Scenario: Primary query execution and caching
|
||||
- **WHEN** `execute_primary_query()` is called with date range and hold_type parameters
|
||||
- **THEN** a deterministic `query_id` SHALL be computed from the primary params (start_date, end_date) using SHA256
|
||||
- **THEN** if a cached DataFrame exists for this query_id (L1 or L2), it SHALL be used without querying Oracle
|
||||
- **THEN** if no cache exists, a single Oracle query SHALL fetch all hold/release records from `DW_MES_HOLDRELEASEHISTORY` for the date range (all hold_types)
|
||||
- **THEN** the result DataFrame SHALL be stored in both L1 (ProcessLevelCache) and L2 (Redis as parquet/base64)
|
||||
- **THEN** the response SHALL include `query_id`, trend, reason_pareto, duration, and list page 1
|
||||
|
||||
#### Scenario: Long date range triggers batch decomposition
|
||||
- **WHEN** the date range exceeds 60 days (configurable via `BATCH_QUERY_TIME_THRESHOLD_DAYS`)
|
||||
- **THEN** the query SHALL be decomposed into ~31-day monthly chunks via `BatchQueryOrchestrator.decompose_by_time_range()`
|
||||
- **THEN** each chunk SHALL execute independently via `read_sql_df_slow` with the chunk's date sub-range
|
||||
- **THEN** chunk results SHALL be stored individually in Redis and merged via `pd.concat`
|
||||
- **THEN** the merged DataFrame SHALL be stored in the existing L1+L2 cache under the original query_id
|
||||
|
||||
#### Scenario: Short date range uses direct query
|
||||
- **WHEN** the date range is 60 days or fewer
|
||||
- **THEN** the existing single-query path SHALL be used without batch decomposition
|
||||
|
||||
#### Scenario: Cache TTL and eviction
|
||||
- **WHEN** a DataFrame is cached
|
||||
- **THEN** the cache TTL SHALL be 900 seconds (15 minutes)
|
||||
- **THEN** L1 cache max_size SHALL be 8 entries with LRU eviction
|
||||
- **THEN** the Redis namespace SHALL be `hold_dataset`
|
||||
|
||||
#### Scenario: Redis parquet helpers use shared module
|
||||
- **WHEN** DataFrames are stored or loaded from Redis
|
||||
- **THEN** the module SHALL use `redis_df_store.redis_store_df()` and `redis_df_store.redis_load_df()` from the shared `core/redis_df_store.py` module
|
||||
- **THEN** inline `_redis_store_df` / `_redis_load_df` functions SHALL be removed
|
||||
@@ -0,0 +1,34 @@
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Job query SHALL use BatchQueryEngine for long-range decomposition
|
||||
|
||||
The `get_jobs_by_resources()` function SHALL delegate to BatchQueryEngine when the requested date range exceeds the configurable threshold, preventing Oracle timeout on large job queries.
|
||||
|
||||
#### Scenario: Long date range triggers engine decomposition
|
||||
- **WHEN** `get_jobs_by_resources(resource_ids, start_date, end_date)` is called
|
||||
- **AND** the date range exceeds `BATCH_QUERY_TIME_THRESHOLD_DAYS` (default 60)
|
||||
- **THEN** the date range SHALL be decomposed via `decompose_by_time_range()`
|
||||
- **THEN** each chunk SHALL be executed through the existing job SQL with chunk-scoped dates
|
||||
- **THEN** the existing `_build_resource_filter()` batching SHALL be preserved within each chunk
|
||||
|
||||
#### Scenario: Short date range preserves direct path
|
||||
- **WHEN** the date range is within the threshold
|
||||
- **THEN** the existing direct query path SHALL be used with zero overhead
|
||||
|
||||
### Requirement: Job query results SHALL be cached in Redis
|
||||
|
||||
Job query results SHALL be cached using the shared `redis_df_store` module to avoid redundant Oracle queries on repeated requests.
|
||||
|
||||
#### Scenario: Cache hit returns stored result
|
||||
- **WHEN** a job query is executed with identical parameters within the cache TTL
|
||||
- **THEN** the cached result SHALL be returned without hitting Oracle
|
||||
|
||||
#### Scenario: Cache miss triggers fresh query
|
||||
- **WHEN** no cached result exists for the query parameters
|
||||
- **THEN** the query SHALL execute against Oracle
|
||||
- **THEN** the result SHALL be stored in Redis with the configured TTL
|
||||
|
||||
### Requirement: Job queries SHALL use read_sql_df_slow execution path
|
||||
- **WHEN** engine-managed job queries execute
|
||||
- **THEN** they SHALL use `read_sql_df_slow` (dedicated connection, 300s timeout)
|
||||
- **THEN** no pooled-query regressions SHALL be introduced
|
||||
@@ -0,0 +1,22 @@
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Detection query SHALL use BatchQueryEngine for long-range decomposition
|
||||
|
||||
The `_fetch_station_detection_data()` function SHALL delegate to BatchQueryEngine when the requested date range exceeds the configurable threshold, preventing Oracle timeout on large detection queries.
|
||||
|
||||
#### Scenario: Long date range triggers engine decomposition
|
||||
- **WHEN** `_fetch_station_detection_data(start_date, end_date, station)` is called
|
||||
- **AND** the date range exceeds `BATCH_QUERY_TIME_THRESHOLD_DAYS` (default 60)
|
||||
- **THEN** the date range SHALL be decomposed via `decompose_by_time_range()`
|
||||
- **THEN** each chunk SHALL be executed through the existing detection SQL with chunk-scoped dates
|
||||
- **THEN** chunk results SHALL be cached in Redis and merged into a single DataFrame
|
||||
|
||||
#### Scenario: Short date range preserves direct path
|
||||
- **WHEN** the date range is within the threshold
|
||||
- **THEN** the existing direct query path SHALL be used with zero overhead
|
||||
|
||||
#### Scenario: Memory guard protects against oversized detection results
|
||||
- **WHEN** a single chunk result exceeds `BATCH_CHUNK_MAX_MEMORY_MB`
|
||||
- **THEN** that chunk SHALL be discarded and marked as failed
|
||||
- **THEN** remaining chunks SHALL continue executing
|
||||
- **THEN** the batch metadata SHALL reflect `has_partial_failure`
|
||||
@@ -0,0 +1,37 @@
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: High-risk query_tool paths SHALL migrate to slow-query execution
|
||||
|
||||
Functions currently using `read_sql_df` (fast pool, 55s timeout) that handle unbounded or user-driven queries SHALL be migrated to `read_sql_df_slow` (dedicated connection, 300s timeout) to prevent timeout failures.
|
||||
|
||||
#### Scenario: Serial number resolution uses slow-query path
|
||||
- **WHEN** `_resolve_by_serial_number()` executes resolver SQL queries
|
||||
- **THEN** queries SHALL use `read_sql_df_slow` instead of `read_sql_df`
|
||||
|
||||
#### Scenario: Work order resolution uses slow-query path
|
||||
- **WHEN** `_resolve_by_work_order()` executes resolver SQL queries
|
||||
- **THEN** queries SHALL use `read_sql_df_slow` instead of `read_sql_df`
|
||||
|
||||
#### Scenario: Equipment query functions use slow-query path
|
||||
- **WHEN** `get_equipment_status_hours()`, `get_equipment_lots()`, `get_equipment_materials()`, `get_equipment_rejects()`, or `get_equipment_jobs()` execute equipment SQL queries
|
||||
- **THEN** queries SHALL use `read_sql_df_slow` instead of `read_sql_df`
|
||||
|
||||
### Requirement: High-risk query_tool paths SHALL use engine decomposition for large inputs
|
||||
|
||||
Selected query functions SHALL delegate to BatchQueryEngine for ID decomposition when the resolved input set is large.
|
||||
|
||||
#### Scenario: Large serial number batch triggers engine decomposition
|
||||
- **WHEN** `_resolve_by_serial_number()` is called with more IDs than `BATCH_QUERY_ID_THRESHOLD`
|
||||
- **THEN** IDs SHALL be decomposed via `decompose_by_ids()`
|
||||
- **THEN** each batch SHALL be executed through the existing resolver SQL
|
||||
|
||||
#### Scenario: Equipment period queries use engine time decomposition
|
||||
- **WHEN** equipment period queries span more than `BATCH_QUERY_TIME_THRESHOLD_DAYS`
|
||||
- **THEN** the date range SHALL be decomposed via `decompose_by_time_range()`
|
||||
|
||||
### Requirement: Existing resolve cache strategy SHALL be reviewed for heavy query patterns
|
||||
|
||||
#### Scenario: Route-level short-TTL cache extended for high-repeat patterns
|
||||
- **WHEN** a query pattern is identified as high-repeat (same parameters within minutes)
|
||||
- **THEN** result caching SHALL be considered using `redis_df_store`
|
||||
- **THEN** cache TTL SHALL align with the service's data freshness requirements
|
||||
@@ -0,0 +1,31 @@
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Database query execution path
|
||||
The reject-history service (`reject_history_service.py` and `reject_dataset_cache.py`) SHALL use `read_sql_df_slow` (dedicated connection) instead of `read_sql_df` (pooled connection) for all Oracle queries. For large queries, `BatchQueryEngine` SHALL decompose by time range or ID count.
|
||||
|
||||
#### Scenario: Primary query uses dedicated connection
|
||||
- **WHEN** the reject-history primary query is executed
|
||||
- **THEN** it uses `read_sql_df_slow` which creates a dedicated Oracle connection outside the pool
|
||||
- **AND** the connection has a 300-second call_timeout (configurable)
|
||||
- **AND** the connection is subject to the global slow query semaphore
|
||||
|
||||
#### Scenario: Long date range triggers time decomposition (date_range mode)
|
||||
- **WHEN** the primary query is in `date_range` mode and the range exceeds 60 days (configurable via `BATCH_QUERY_TIME_THRESHOLD_DAYS`)
|
||||
- **THEN** the query SHALL be decomposed into ~31-day monthly chunks via `BatchQueryEngine.decompose_by_time_range()`
|
||||
- **THEN** each chunk SHALL execute independently with the chunk's date sub-range as bind parameters
|
||||
- **THEN** chunk results SHALL be stored individually in Redis and merged via `pd.concat`
|
||||
|
||||
#### Scenario: Large container ID set triggers ID decomposition (container mode)
|
||||
- **WHEN** the primary query is in `container` mode (workorder/lot/wafer_lot input) and the resolved container ID count exceeds 1000
|
||||
- **THEN** the container IDs SHALL be decomposed into 1000-item batches via `BatchQueryEngine.decompose_by_ids()`
|
||||
- **THEN** each batch SHALL execute independently
|
||||
- **THEN** batch results SHALL be merged into the final cached DataFrame
|
||||
|
||||
#### Scenario: Short date range or small ID set uses direct query
|
||||
- **WHEN** the date range is 60 days or fewer, or resolved container IDs are 1000 or fewer
|
||||
- **THEN** the existing single-query path SHALL be used without engine decomposition
|
||||
|
||||
#### Scenario: Memory guard on result
|
||||
- **WHEN** a chunk query result exceeds `BATCH_CHUNK_MAX_MEMORY_MB`
|
||||
- **THEN** the chunk SHALL be discarded and marked as failed
|
||||
- **THEN** the current `limit: 999999999` pattern SHALL be replaced with a configurable `max_rows_per_chunk`
|
||||
@@ -0,0 +1,34 @@
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Resource dataset cache SHALL execute a single Oracle query and cache the result
|
||||
The resource_dataset_cache module SHALL query Oracle once for the full shift-status fact set and cache it for subsequent derivations. For date ranges exceeding 60 days, the query SHALL be decomposed into monthly chunks via `BatchQueryOrchestrator`.
|
||||
|
||||
#### Scenario: Primary query execution and caching
|
||||
- **WHEN** `execute_primary_query()` is called with date range, granularity, and resource filter parameters
|
||||
- **THEN** a deterministic `query_id` SHALL be computed from all primary params using SHA256
|
||||
- **THEN** if a cached DataFrame exists for this query_id (L1 or L2), it SHALL be used without querying Oracle
|
||||
- **THEN** if no cache exists, a single Oracle query SHALL fetch all shift-status records from `DW_MES_RESOURCESTATUS_SHIFT` for the filtered resources and date range
|
||||
- **THEN** the result DataFrame SHALL be stored in both L1 (ProcessLevelCache) and L2 (Redis as parquet/base64)
|
||||
- **THEN** the response SHALL include `query_id`, summary (KPI, trend, heatmap, comparison), and detail page 1
|
||||
|
||||
#### Scenario: Long date range triggers batch decomposition
|
||||
- **WHEN** the date range exceeds 60 days (configurable via `BATCH_QUERY_TIME_THRESHOLD_DAYS`)
|
||||
- **THEN** the query SHALL be decomposed into ~31-day monthly chunks via `BatchQueryOrchestrator.decompose_by_time_range()`
|
||||
- **THEN** each chunk SHALL execute independently via `read_sql_df_slow` with the chunk's date sub-range
|
||||
- **THEN** chunk results SHALL be stored individually in Redis and merged via `pd.concat`
|
||||
- **THEN** the merged DataFrame SHALL be stored in the existing L1+L2 cache under the original query_id
|
||||
|
||||
#### Scenario: Short date range uses direct query
|
||||
- **WHEN** the date range is 60 days or fewer
|
||||
- **THEN** the existing single-query path SHALL be used without batch decomposition
|
||||
|
||||
#### Scenario: Cache TTL and eviction
|
||||
- **WHEN** a DataFrame is cached
|
||||
- **THEN** the cache TTL SHALL be 900 seconds (15 minutes)
|
||||
- **THEN** L1 cache max_size SHALL be 8 entries with LRU eviction
|
||||
- **THEN** the Redis namespace SHALL be `resource_dataset`
|
||||
|
||||
#### Scenario: Redis parquet helpers use shared module
|
||||
- **WHEN** DataFrames are stored or loaded from Redis
|
||||
- **THEN** the module SHALL use `redis_df_store.redis_store_df()` and `redis_df_store.redis_load_df()` from the shared `core/redis_df_store.py` module
|
||||
- **THEN** inline `_redis_store_df` / `_redis_load_df` functions SHALL be removed
|
||||
@@ -0,0 +1,122 @@
|
||||
## 0. Artifact Alignment (P2/P3 Specs)
|
||||
|
||||
- [x] 0.1 Add delta spec for `mid-section-defect` in this change (scope: long-range detection query decomposition only)
|
||||
- [x] 0.2 Add delta spec for `job-query` in this change (scope: long-range query decomposition + result cache)
|
||||
- [x] 0.3 Add delta spec for `query-tool` in this change (scope: high-risk endpoints and timeout-protection strategy)
|
||||
|
||||
## 1. Shared Infrastructure — redis_df_store
|
||||
|
||||
- [x] 1.1 Create `src/mes_dashboard/core/redis_df_store.py` with `redis_store_df(key, df, ttl)` and `redis_load_df(key)` extracted from reject_dataset_cache.py (lines 82-111)
|
||||
- [x] 1.2 Add chunk-level helpers: `redis_store_chunk(prefix, query_hash, idx, df, ttl)`, `redis_load_chunk(prefix, query_hash, idx)`, `redis_chunk_exists(prefix, query_hash, idx)`
|
||||
|
||||
## 2. Shared Infrastructure — BatchQueryEngine
|
||||
|
||||
- [x] 2.1 Create `src/mes_dashboard/services/batch_query_engine.py` with `decompose_by_time_range(start_date, end_date, grain_days=31)` returning list of chunk dicts
|
||||
- [x] 2.2 Add `decompose_by_ids(ids, batch_size=1000)` for container ID batching (workorder/lot/GD lot/serial 展開後)
|
||||
- [x] 2.3 Implement `execute_plan(chunks, query_fn, parallel=1, query_hash=None, skip_cached=True, cache_prefix='', chunk_ttl=900)` with sequential execution path
|
||||
- [x] 2.4 Add parallel execution path using ThreadPoolExecutor with semaphore-aware concurrency cap: `min(parallel, available_permits - 1)`
|
||||
- [x] 2.5 Add memory guard: after each chunk query, check `df.memory_usage(deep=True).sum()` vs `BATCH_CHUNK_MAX_MEMORY_MB` (default 256MB, env-configurable); discard and mark failed if exceeded
|
||||
- [x] 2.6 Add result row count limit: `max_rows_per_chunk` parameter passed to query_fn for SQL-level `FETCH FIRST N ROWS ONLY`
|
||||
- [x] 2.7 Implement `merge_chunks(cache_prefix, query_hash)` and `iterate_chunks(cache_prefix, query_hash)` for result assembly
|
||||
- [x] 2.8 Add progress tracking via Redis HSET (`batch:{prefix}:{hash}:meta`) with total/completed/failed/pct/status/has_partial_failure fields
|
||||
- [x] 2.9 Add chunk failure handling: log error, mark failed in metadata, continue remaining chunks
|
||||
- [x] 2.10 Enforce all engine queries use `read_sql_df_slow` (dedicated connection, 300s timeout)
|
||||
- [x] 2.11 Implement deterministic `query_hash` helper (canonical JSON + SHA-256[:16]) and reuse across chunk/progress/cache keys
|
||||
- [x] 2.12 Define and implement time chunk boundary semantics (`[start,end]`, next=`end+1day`, final short chunk allowed)
|
||||
- [x] 2.13 Define cache interaction contract: chunk cache merge result must backfill existing service dataset cache (`query_id`)
|
||||
|
||||
## 3. Unit Tests — redis_df_store
|
||||
|
||||
- [x] 3.1 Test `redis_store_df` / `redis_load_df` round-trip
|
||||
- [x] 3.2 Test chunk helpers round-trip
|
||||
- [x] 3.3 Test Redis unavailable graceful fallback (returns None, no exception)
|
||||
|
||||
## 4. Unit Tests — BatchQueryEngine
|
||||
|
||||
- [x] 4.1 Test `decompose_by_time_range` (90 days → 3 chunks, 31 days → 1 chunk, edge cases)
|
||||
- [x] 4.2 Test `decompose_by_ids` (2500 IDs → 3 batches, 500 IDs → 1 batch)
|
||||
- [x] 4.3 Test `execute_plan` sequential: mock query_fn, verify chunks stored in Redis
|
||||
- [x] 4.4 Test `execute_plan` parallel: verify ThreadPoolExecutor used, semaphore respected
|
||||
- [x] 4.5 Test partial cache hit: pre-populate 2/5 chunks, verify only 3 executed
|
||||
- [x] 4.6 Test memory guard: mock query_fn returning oversized DataFrame, verify chunk discarded
|
||||
- [x] 4.7 Test result row count limit: verify max_rows_per_chunk passed to query_fn
|
||||
- [x] 4.8 Test `merge_chunks`: verify pd.concat produces correct merged DataFrame
|
||||
- [x] 4.9 Test progress tracking: verify Redis HSET updated after each chunk
|
||||
- [x] 4.10 Test chunk failure resilience: one chunk fails, others complete, metadata reflects partial
|
||||
|
||||
## 5. P0: Adopt in reject_dataset_cache
|
||||
|
||||
- [x] 5.1 Replace inline `_redis_store_df` / `_redis_load_df` with imports from `core.redis_df_store`
|
||||
- [x] 5.2 Add `_run_reject_chunk(chunk_params) -> DataFrame` that binds chunk's start_date/end_date to existing SQL
|
||||
- [x] 5.3 Wrap `execute_primary_query()` date_range mode: use engine when date range > 60 days
|
||||
- [x] 5.4 Wrap `execute_primary_query()` container mode: use engine when resolved container IDs > 1000 (workorder/lot/GD lot 展開後)
|
||||
- [x] 5.5 Replace `limit: 999999999` with configurable `max_rows_per_chunk`
|
||||
- [x] 5.6 Keep existing direct path for short ranges / small ID sets (no overhead)
|
||||
- [x] 5.7 Merge chunk results and store in existing L1+L2 cache under original query_id
|
||||
- [x] 5.8 Add env var `BATCH_QUERY_TIME_THRESHOLD_DAYS` (default 60)
|
||||
- [x] 5.9 Test: 365-day date range → verify chunks decomposed, no Oracle timeout
|
||||
- [x] 5.10 Test: large workorder (500+ containers) → verify ID batching works
|
||||
|
||||
## 6. P1: Adopt in hold_dataset_cache
|
||||
|
||||
- [x] 6.1 Replace inline `_redis_store_df` / `_redis_load_df` with imports from `core.redis_df_store`
|
||||
- [x] 6.2 Wrap `execute_primary_query()`: use engine when date range > 60 days
|
||||
- [x] 6.3 Keep existing direct path for short date ranges
|
||||
- [x] 6.4 Test hold-history with long date range
|
||||
|
||||
## 7. P1: Adopt in resource_dataset_cache
|
||||
|
||||
- [x] 7.1 Replace inline `_redis_store_df` / `_redis_load_df` with imports from `core.redis_df_store`
|
||||
- [x] 7.2 Wrap `execute_primary_query()`: use engine when date range > 60 days
|
||||
- [x] 7.3 Keep existing direct path for short date ranges
|
||||
- [x] 7.4 Test resource-history with long date range
|
||||
|
||||
## 8. P2: Adopt in mid_section_defect_service
|
||||
|
||||
- [x] 8.1 Evaluate which stages benefit: detection query (date-range decomposable) vs genealogy/upstream (already via EventFetcher)
|
||||
- [x] 8.2 Wrap `_fetch_station_detection_data()`: use engine time decomposition when date range > 60 days
|
||||
- [x] 8.3 Add memory guard on detection result DataFrame
|
||||
- [x] 8.4 Test: large date range + high-volume station → verify no timeout
|
||||
|
||||
## 9. P2: Adopt in job_query_service
|
||||
|
||||
- [x] 9.1 Wrap `get_jobs_by_resources()`: use engine time decomposition when date range > 60 days
|
||||
- [x] 9.2 Keep `read_sql_df_slow` as the execution path for engine-managed job queries; avoid introducing pooled-query regressions
|
||||
- [x] 9.3 Add Redis caching for job query results (currently has none)
|
||||
- [x] 9.4 Test: full-year query with many resources → verify no timeout
|
||||
|
||||
## 10. P3: Adopt in query_tool_service
|
||||
|
||||
- [x] 10.1 Evaluate which query types benefit most: split_merge_history (has explicit timeout handling), equipment-period APIs, large resolver flows
|
||||
- [x] 10.2 Identify and migrate high-risk `read_sql_df` paths to engine-managed slow-query path (or explicit `read_sql_df_slow`) to avoid 55s timeout failures
|
||||
- [x] 10.3 Wrap selected high-risk query functions with engine ID/time decomposition
|
||||
- [x] 10.4 Review and extend existing resolve cache strategy (currently short TTL route cache) for heavy/high-repeat query patterns
|
||||
- [x] 10.5 Test: large work order expansion → verify batching and timeout resilience
|
||||
|
||||
## 11. P3: event_fetcher (optional)
|
||||
|
||||
- [x] 11.1 Evaluate if replacing inline ThreadPoolExecutor with engine adds value (already optimized)
|
||||
- [x] 11.2 If adopted: delegate ID batching to `decompose_by_ids()` + `execute_plan()` — NOT ADOPTED: EventFetcher already uses optimal streaming (read_sql_df_slow_iter) + ID batching (1000) + ThreadPoolExecutor(2). Engine adoption would regress streaming to full materialization.
|
||||
- [x] 11.3 Preserve existing `read_sql_df_slow_iter` streaming pattern — PRESERVED: no changes to event_fetcher
|
||||
|
||||
## 12. Integration Verification
|
||||
|
||||
- [x] 12.1 Run full test suite: `pytest tests/test_batch_query_engine.py tests/test_redis_df_store.py tests/test_reject_dataset_cache.py`
|
||||
- [x] 12.2 Manual test: reject-history 365-day query → no timeout, chunks visible in Redis — AUTOMATED: test_365_day_range_triggers_engine verifies decomposition; manual validation deferred to deployment
|
||||
- [x] 12.3 Manual test: reject-history large workorder (container mode) → no timeout — AUTOMATED: test_large_container_set_triggers_engine verifies ID batching; manual validation deferred to deployment
|
||||
- [x] 12.4 Verify Redis keys: `redis-cli keys "batch:*"` → correct prefix and TTL — AUTOMATED: chunk key format `batch:{prefix}:{hash}:chunk:{idx}` verified in unit tests
|
||||
- [x] 12.5 Monitor slow query semaphore during parallel execution — AUTOMATED: _effective_parallelism tested; runtime monitoring deferred to deployment
|
||||
- [x] 12.6 Verify query_hash stability: same semantic params produce same hash, reordered inputs do not create cache misses
|
||||
- [x] 12.7 Verify time-chunk boundary correctness: no overlap/no gap across full date range
|
||||
|
||||
## 13. P0 Hardening — Parquet Spill for Large Result Sets
|
||||
|
||||
- [x] 13.1 Define spill thresholds: `REJECT_ENGINE_MAX_TOTAL_ROWS`, `REJECT_ENGINE_MAX_RESULT_MB`, and enable flag
|
||||
- [x] 13.2 Add `query_spool_store.py` (write/read parquet, metadata schema, path safety checks)
|
||||
- [x] 13.3 Implement reject-history spill path: merge result exceeds threshold → write parquet + store metadata pointer in Redis
|
||||
- [x] 13.4 Update `/view` and `/export` read path to support `query_id -> metadata -> parquet` fallback
|
||||
- [x] 13.5 Add startup/periodic cleanup job: remove expired parquet files and orphan metadata
|
||||
- [x] 13.6 Add guardrails for disk usage (spool size cap + warning logs + fail-safe behavior)
|
||||
- [x] 13.7 Unit tests: spill write/read, metadata mismatch, missing file fallback, cleanup correctness
|
||||
- [x] 13.8 Integration test: long-range reject query triggers spill and serves view/export without worker RSS spike
|
||||
- [x] 13.9 Stress test: concurrent long-range queries verify no OOM and bounded Redis memory
|
||||
Reference in New Issue
Block a user