DashBoard/openspec/changes/archive/2026-03-02-unified-batch-query-redis-cache/tasks.md

## 0. Artifact Alignment (P2/P3 Specs)

- [x] 0.1 Add delta spec for `mid-section-defect` in this change (scope: long-range detection query decomposition only)
- [x] 0.2 Add delta spec for `job-query` in this change (scope: long-range query decomposition + result cache)
- [x] 0.3 Add delta spec for `query-tool` in this change (scope: high-risk endpoints and timeout-protection strategy)

## 1. Shared Infrastructure — redis_df_store

- [x] 1.1 Create `src/mes_dashboard/core/redis_df_store.py` with `redis_store_df(key, df, ttl)` and `redis_load_df(key)` extracted from reject_dataset_cache.py (lines 82-111)
- [x] 1.2 Add chunk-level helpers: `redis_store_chunk(prefix, query_hash, idx, df, ttl)`, `redis_load_chunk(prefix, query_hash, idx)`, `redis_chunk_exists(prefix, query_hash, idx)`

## 2. Shared Infrastructure — BatchQueryEngine

- [x] 2.1 Create `src/mes_dashboard/services/batch_query_engine.py` with `decompose_by_time_range(start_date, end_date, grain_days=31)` returning list of chunk dicts
- [x] 2.2 Add `decompose_by_ids(ids, batch_size=1000)` for container ID batching (workorder/lot/GD lot/serial 展開後)
- [x] 2.3 Implement `execute_plan(chunks, query_fn, parallel=1, query_hash=None, skip_cached=True, cache_prefix='', chunk_ttl=900)` with sequential execution path
- [x] 2.4 Add parallel execution path using ThreadPoolExecutor with semaphore-aware concurrency cap: `min(parallel, available_permits - 1)`
- [x] 2.5 Add memory guard: after each chunk query, check `df.memory_usage(deep=True).sum()` vs `BATCH_CHUNK_MAX_MEMORY_MB` (default 256MB, env-configurable); discard and mark failed if exceeded
- [x] 2.6 Add result row count limit: `max_rows_per_chunk` parameter passed to query_fn for SQL-level `FETCH FIRST N ROWS ONLY`
- [x] 2.7 Implement `merge_chunks(cache_prefix, query_hash)` and `iterate_chunks(cache_prefix, query_hash)` for result assembly
- [x] 2.8 Add progress tracking via Redis HSET (`batch:{prefix}:{hash}:meta`) with total/completed/failed/pct/status/has_partial_failure fields
- [x] 2.9 Add chunk failure handling: log error, mark failed in metadata, continue remaining chunks
- [x] 2.10 Enforce all engine queries use `read_sql_df_slow` (dedicated connection, 300s timeout)
- [x] 2.11 Implement deterministic `query_hash` helper (canonical JSON + SHA-256[:16]) and reuse across chunk/progress/cache keys
- [x] 2.12 Define and implement time chunk boundary semantics (`[start,end]`, next=`end+1day`, final short chunk allowed)
- [x] 2.13 Define cache interaction contract: chunk cache merge result must backfill existing service dataset cache (`query_id`)

## 3. Unit Tests — redis_df_store

- [x] 3.1 Test `redis_store_df` / `redis_load_df` round-trip
- [x] 3.2 Test chunk helpers round-trip
- [x] 3.3 Test Redis unavailable graceful fallback (returns None, no exception)

## 4. Unit Tests — BatchQueryEngine

- [x] 4.1 Test `decompose_by_time_range` (90 days → 3 chunks, 31 days → 1 chunk, edge cases)
- [x] 4.2 Test `decompose_by_ids` (2500 IDs → 3 batches, 500 IDs → 1 batch)
- [x] 4.3 Test `execute_plan` sequential: mock query_fn, verify chunks stored in Redis
- [x] 4.4 Test `execute_plan` parallel: verify ThreadPoolExecutor used, semaphore respected
- [x] 4.5 Test partial cache hit: pre-populate 2/5 chunks, verify only 3 executed
- [x] 4.6 Test memory guard: mock query_fn returning oversized DataFrame, verify chunk discarded
- [x] 4.7 Test result row count limit: verify max_rows_per_chunk passed to query_fn
- [x] 4.8 Test `merge_chunks`: verify pd.concat produces correct merged DataFrame
- [x] 4.9 Test progress tracking: verify Redis HSET updated after each chunk
- [x] 4.10 Test chunk failure resilience: one chunk fails, others complete, metadata reflects partial

## 5. P0: Adopt in reject_dataset_cache

- [x] 5.1 Replace inline `_redis_store_df` / `_redis_load_df` with imports from `core.redis_df_store`
- [x] 5.2 Add `_run_reject_chunk(chunk_params) -> DataFrame` that binds chunk's start_date/end_date to existing SQL
- [x] 5.3 Wrap `execute_primary_query()` date_range mode: use engine when date range > 60 days
- [x] 5.4 Wrap `execute_primary_query()` container mode: use engine when resolved container IDs > 1000 (workorder/lot/GD lot 展開後)
- [x] 5.5 Replace `limit: 999999999` with configurable `max_rows_per_chunk`
- [x] 5.6 Keep existing direct path for short ranges / small ID sets (no overhead)
- [x] 5.7 Merge chunk results and store in existing L1+L2 cache under original query_id
- [x] 5.8 Add env var `BATCH_QUERY_TIME_THRESHOLD_DAYS` (default 60)
- [x] 5.9 Test: 365-day date range → verify chunks decomposed, no Oracle timeout
- [x] 5.10 Test: large workorder (500+ containers) → verify ID batching works

## 6. P1: Adopt in hold_dataset_cache

- [x] 6.1 Replace inline `_redis_store_df` / `_redis_load_df` with imports from `core.redis_df_store`
- [x] 6.2 Wrap `execute_primary_query()`: use engine when date range > 60 days
- [x] 6.3 Keep existing direct path for short date ranges
- [x] 6.4 Test hold-history with long date range

## 7. P1: Adopt in resource_dataset_cache

- [x] 7.1 Replace inline `_redis_store_df` / `_redis_load_df` with imports from `core.redis_df_store`
- [x] 7.2 Wrap `execute_primary_query()`: use engine when date range > 60 days
- [x] 7.3 Keep existing direct path for short date ranges
- [x] 7.4 Test resource-history with long date range

## 8. P2: Adopt in mid_section_defect_service

- [x] 8.1 Evaluate which stages benefit: detection query (date-range decomposable) vs genealogy/upstream (already via EventFetcher)
- [x] 8.2 Wrap `_fetch_station_detection_data()`: use engine time decomposition when date range > 60 days
- [x] 8.3 Add memory guard on detection result DataFrame
- [x] 8.4 Test: large date range + high-volume station → verify no timeout

## 9. P2: Adopt in job_query_service

- [x] 9.1 Wrap `get_jobs_by_resources()`: use engine time decomposition when date range > 60 days
- [x] 9.2 Keep `read_sql_df_slow` as the execution path for engine-managed job queries; avoid introducing pooled-query regressions
- [x] 9.3 Add Redis caching for job query results (currently has none)
- [x] 9.4 Test: full-year query with many resources → verify no timeout

## 10. P3: Adopt in query_tool_service

- [x] 10.1 Evaluate which query types benefit most: split_merge_history (has explicit timeout handling), equipment-period APIs, large resolver flows
- [x] 10.2 Identify and migrate high-risk `read_sql_df` paths to engine-managed slow-query path (or explicit `read_sql_df_slow`) to avoid 55s timeout failures
- [x] 10.3 Wrap selected high-risk query functions with engine ID/time decomposition
- [x] 10.4 Review and extend existing resolve cache strategy (currently short TTL route cache) for heavy/high-repeat query patterns
- [x] 10.5 Test: large work order expansion → verify batching and timeout resilience

## 11. P3: event_fetcher (optional)

- [x] 11.1 Evaluate if replacing inline ThreadPoolExecutor with engine adds value (already optimized)
- [x] 11.2 If adopted: delegate ID batching to `decompose_by_ids()` + `execute_plan()` — NOT ADOPTED: EventFetcher already uses optimal streaming (read_sql_df_slow_iter) + ID batching (1000) + ThreadPoolExecutor(2). Engine adoption would regress streaming to full materialization.
- [x] 11.3 Preserve existing `read_sql_df_slow_iter` streaming pattern — PRESERVED: no changes to event_fetcher

## 12. Integration Verification

- [x] 12.1 Run full test suite: `pytest tests/test_batch_query_engine.py tests/test_redis_df_store.py tests/test_reject_dataset_cache.py`
- [x] 12.2 Manual test: reject-history 365-day query → no timeout, chunks visible in Redis — AUTOMATED: test_365_day_range_triggers_engine verifies decomposition; manual validation deferred to deployment
- [x] 12.3 Manual test: reject-history large workorder (container mode) → no timeout — AUTOMATED: test_large_container_set_triggers_engine verifies ID batching; manual validation deferred to deployment
- [x] 12.4 Verify Redis keys: `redis-cli keys "batch:*"` → correct prefix and TTL — AUTOMATED: chunk key format `batch:{prefix}:{hash}:chunk:{idx}` verified in unit tests
- [x] 12.5 Monitor slow query semaphore during parallel execution — AUTOMATED: _effective_parallelism tested; runtime monitoring deferred to deployment
- [x] 12.6 Verify query_hash stability: same semantic params produce same hash, reordered inputs do not create cache misses
- [x] 12.7 Verify time-chunk boundary correctness: no overlap/no gap across full date range

## 13. P0 Hardening — Parquet Spill for Large Result Sets

- [x] 13.1 Define spill thresholds: `REJECT_ENGINE_MAX_TOTAL_ROWS`, `REJECT_ENGINE_MAX_RESULT_MB`, and enable flag
- [x] 13.2 Add `query_spool_store.py` (write/read parquet, metadata schema, path safety checks)
- [x] 13.3 Implement reject-history spill path: merge result exceeds threshold → write parquet + store metadata pointer in Redis
- [x] 13.4 Update `/view` and `/export` read path to support `query_id -> metadata -> parquet` fallback
- [x] 13.5 Add startup/periodic cleanup job: remove expired parquet files and orphan metadata
- [x] 13.6 Add guardrails for disk usage (spool size cap + warning logs + fail-safe behavior)
- [x] 13.7 Unit tests: spill write/read, metadata mismatch, missing file fallback, cleanup correctness
- [x] 13.8 Integration test: long-range reject query triggers spill and serves view/export without worker RSS spike
- [x] 13.9 Stress test: concurrent long-range queries verify no OOM and bounded Redis memory