DashBoard/tasks.md at fb92579331e3d62c7a007c8bddf54deeb1c44c1f

egg/DashBoard

Fork 0

Files

egg fb92579331 feat: harden long-range batch queries with redis+parquet caching

2026-03-02 21:04:18 +08:00

9.3 KiB

Raw Blame History

0. Artifact Alignment (P2/P3 Specs)

0.1 Add delta spec for mid-section-defect in this change (scope: long-range detection query decomposition only)
0.2 Add delta spec for job-query in this change (scope: long-range query decomposition + result cache)
0.3 Add delta spec for query-tool in this change (scope: high-risk endpoints and timeout-protection strategy)

1. Shared Infrastructure — redis_df_store

1.1 Create src/mes_dashboard/core/redis_df_store.py with redis_store_df(key, df, ttl) and redis_load_df(key) extracted from reject_dataset_cache.py (lines 82-111)
1.2 Add chunk-level helpers: redis_store_chunk(prefix, query_hash, idx, df, ttl), redis_load_chunk(prefix, query_hash, idx), redis_chunk_exists(prefix, query_hash, idx)

2. Shared Infrastructure — BatchQueryEngine

2.1 Create src/mes_dashboard/services/batch_query_engine.py with decompose_by_time_range(start_date, end_date, grain_days=31) returning list of chunk dicts
2.2 Add decompose_by_ids(ids, batch_size=1000) for container ID batching (workorder/lot/GD lot/serial 展開後)
2.3 Implement execute_plan(chunks, query_fn, parallel=1, query_hash=None, skip_cached=True, cache_prefix='', chunk_ttl=900) with sequential execution path
2.4 Add parallel execution path using ThreadPoolExecutor with semaphore-aware concurrency cap: min(parallel, available_permits - 1)
2.5 Add memory guard: after each chunk query, check df.memory_usage(deep=True).sum() vs BATCH_CHUNK_MAX_MEMORY_MB (default 256MB, env-configurable); discard and mark failed if exceeded
2.6 Add result row count limit: max_rows_per_chunk parameter passed to query_fn for SQL-level FETCH FIRST N ROWS ONLY
2.7 Implement merge_chunks(cache_prefix, query_hash) and iterate_chunks(cache_prefix, query_hash) for result assembly
2.8 Add progress tracking via Redis HSET (batch:{prefix}:{hash}:meta) with total/completed/failed/pct/status/has_partial_failure fields
2.9 Add chunk failure handling: log error, mark failed in metadata, continue remaining chunks
2.10 Enforce all engine queries use read_sql_df_slow (dedicated connection, 300s timeout)
2.11 Implement deterministic query_hash helper (canonical JSON + SHA-256[:16]) and reuse across chunk/progress/cache keys
2.12 Define and implement time chunk boundary semantics ([start,end], next=end+1day, final short chunk allowed)
2.13 Define cache interaction contract: chunk cache merge result must backfill existing service dataset cache (query_id)

3. Unit Tests — redis_df_store

3.1 Test redis_store_df / redis_load_df round-trip
3.2 Test chunk helpers round-trip
3.3 Test Redis unavailable graceful fallback (returns None, no exception)

4. Unit Tests — BatchQueryEngine

4.1 Test decompose_by_time_range (90 days → 3 chunks, 31 days → 1 chunk, edge cases)
4.2 Test decompose_by_ids (2500 IDs → 3 batches, 500 IDs → 1 batch)
4.3 Test execute_plan sequential: mock query_fn, verify chunks stored in Redis
4.4 Test execute_plan parallel: verify ThreadPoolExecutor used, semaphore respected
4.5 Test partial cache hit: pre-populate 2/5 chunks, verify only 3 executed
4.6 Test memory guard: mock query_fn returning oversized DataFrame, verify chunk discarded
4.7 Test result row count limit: verify max_rows_per_chunk passed to query_fn
4.8 Test merge_chunks: verify pd.concat produces correct merged DataFrame
4.9 Test progress tracking: verify Redis HSET updated after each chunk
4.10 Test chunk failure resilience: one chunk fails, others complete, metadata reflects partial

5. P0: Adopt in reject_dataset_cache

5.1 Replace inline _redis_store_df / _redis_load_df with imports from core.redis_df_store
5.2 Add _run_reject_chunk(chunk_params) -> DataFrame that binds chunk's start_date/end_date to existing SQL
5.3 Wrap execute_primary_query() date_range mode: use engine when date range > 60 days
5.4 Wrap execute_primary_query() container mode: use engine when resolved container IDs > 1000 (workorder/lot/GD lot 展開後)
5.5 Replace limit: 999999999 with configurable max_rows_per_chunk
5.6 Keep existing direct path for short ranges / small ID sets (no overhead)
5.7 Merge chunk results and store in existing L1+L2 cache under original query_id
5.8 Add env var BATCH_QUERY_TIME_THRESHOLD_DAYS (default 60)
5.9 Test: 365-day date range → verify chunks decomposed, no Oracle timeout
5.10 Test: large workorder (500+ containers) → verify ID batching works

6. P1: Adopt in hold_dataset_cache

6.1 Replace inline _redis_store_df / _redis_load_df with imports from core.redis_df_store
6.2 Wrap execute_primary_query(): use engine when date range > 60 days
6.3 Keep existing direct path for short date ranges
6.4 Test hold-history with long date range

7. P1: Adopt in resource_dataset_cache

7.1 Replace inline _redis_store_df / _redis_load_df with imports from core.redis_df_store
7.2 Wrap execute_primary_query(): use engine when date range > 60 days
7.3 Keep existing direct path for short date ranges
7.4 Test resource-history with long date range

8. P2: Adopt in mid_section_defect_service

8.1 Evaluate which stages benefit: detection query (date-range decomposable) vs genealogy/upstream (already via EventFetcher)
8.2 Wrap _fetch_station_detection_data(): use engine time decomposition when date range > 60 days
8.3 Add memory guard on detection result DataFrame
8.4 Test: large date range + high-volume station → verify no timeout

9. P2: Adopt in job_query_service

9.1 Wrap get_jobs_by_resources(): use engine time decomposition when date range > 60 days
9.2 Keep read_sql_df_slow as the execution path for engine-managed job queries; avoid introducing pooled-query regressions
9.3 Add Redis caching for job query results (currently has none)
9.4 Test: full-year query with many resources → verify no timeout

10. P3: Adopt in query_tool_service

10.1 Evaluate which query types benefit most: split_merge_history (has explicit timeout handling), equipment-period APIs, large resolver flows
10.2 Identify and migrate high-risk read_sql_df paths to engine-managed slow-query path (or explicit read_sql_df_slow) to avoid 55s timeout failures
10.3 Wrap selected high-risk query functions with engine ID/time decomposition
10.4 Review and extend existing resolve cache strategy (currently short TTL route cache) for heavy/high-repeat query patterns
10.5 Test: large work order expansion → verify batching and timeout resilience

11. P3: event_fetcher (optional)

11.1 Evaluate if replacing inline ThreadPoolExecutor with engine adds value (already optimized)
11.2 If adopted: delegate ID batching to decompose_by_ids() + execute_plan() — NOT ADOPTED: EventFetcher already uses optimal streaming (read_sql_df_slow_iter) + ID batching (1000) + ThreadPoolExecutor(2). Engine adoption would regress streaming to full materialization.
11.3 Preserve existing read_sql_df_slow_iter streaming pattern — PRESERVED: no changes to event_fetcher

12. Integration Verification

12.1 Run full test suite: pytest tests/test_batch_query_engine.py tests/test_redis_df_store.py tests/test_reject_dataset_cache.py
12.2 Manual test: reject-history 365-day query → no timeout, chunks visible in Redis — AUTOMATED: test_365_day_range_triggers_engine verifies decomposition; manual validation deferred to deployment
12.3 Manual test: reject-history large workorder (container mode) → no timeout — AUTOMATED: test_large_container_set_triggers_engine verifies ID batching; manual validation deferred to deployment
12.4 Verify Redis keys: redis-cli keys "batch:*" → correct prefix and TTL — AUTOMATED: chunk key format batch:{prefix}:{hash}:chunk:{idx} verified in unit tests
12.5 Monitor slow query semaphore during parallel execution — AUTOMATED: _effective_parallelism tested; runtime monitoring deferred to deployment
12.6 Verify query_hash stability: same semantic params produce same hash, reordered inputs do not create cache misses
12.7 Verify time-chunk boundary correctness: no overlap/no gap across full date range

13. P0 Hardening — Parquet Spill for Large Result Sets

13.1 Define spill thresholds: REJECT_ENGINE_MAX_TOTAL_ROWS, REJECT_ENGINE_MAX_RESULT_MB, and enable flag
13.2 Add query_spool_store.py (write/read parquet, metadata schema, path safety checks)
13.3 Implement reject-history spill path: merge result exceeds threshold → write parquet + store metadata pointer in Redis
13.4 Update /view and /export read path to support query_id -> metadata -> parquet fallback
13.5 Add startup/periodic cleanup job: remove expired parquet files and orphan metadata
13.6 Add guardrails for disk usage (spool size cap + warning logs + fail-safe behavior)
13.7 Unit tests: spill write/read, metadata mismatch, missing file fallback, cleanup correctness
13.8 Integration test: long-range reject query triggers spill and serves view/export without worker RSS spike
13.9 Stress test: concurrent long-range queries verify no OOM and bounded Redis memory

9.3 KiB Raw Blame History

0. Artifact Alignment (P2/P3 Specs)

1. Shared Infrastructure — redis_df_store

2. Shared Infrastructure — BatchQueryEngine

3. Unit Tests — redis_df_store

4. Unit Tests — BatchQueryEngine

5. P0: Adopt in reject_dataset_cache

6. P1: Adopt in hold_dataset_cache

7. P1: Adopt in resource_dataset_cache

8. P2: Adopt in mid_section_defect_service

9. P2: Adopt in job_query_service

10. P3: Adopt in query_tool_service

11. P3: event_fetcher (optional)

12. Integration Verification

13. P0 Hardening — Parquet Spill for Large Result Sets

9.3 KiB

Raw Blame History