9.3 KiB
9.3 KiB
0. Artifact Alignment (P2/P3 Specs)
- 0.1 Add delta spec for
mid-section-defectin this change (scope: long-range detection query decomposition only) - 0.2 Add delta spec for
job-queryin this change (scope: long-range query decomposition + result cache) - 0.3 Add delta spec for
query-toolin this change (scope: high-risk endpoints and timeout-protection strategy)
1. Shared Infrastructure — redis_df_store
- 1.1 Create
src/mes_dashboard/core/redis_df_store.pywithredis_store_df(key, df, ttl)andredis_load_df(key)extracted from reject_dataset_cache.py (lines 82-111) - 1.2 Add chunk-level helpers:
redis_store_chunk(prefix, query_hash, idx, df, ttl),redis_load_chunk(prefix, query_hash, idx),redis_chunk_exists(prefix, query_hash, idx)
2. Shared Infrastructure — BatchQueryEngine
- 2.1 Create
src/mes_dashboard/services/batch_query_engine.pywithdecompose_by_time_range(start_date, end_date, grain_days=31)returning list of chunk dicts - 2.2 Add
decompose_by_ids(ids, batch_size=1000)for container ID batching (workorder/lot/GD lot/serial 展開後) - 2.3 Implement
execute_plan(chunks, query_fn, parallel=1, query_hash=None, skip_cached=True, cache_prefix='', chunk_ttl=900)with sequential execution path - 2.4 Add parallel execution path using ThreadPoolExecutor with semaphore-aware concurrency cap:
min(parallel, available_permits - 1) - 2.5 Add memory guard: after each chunk query, check
df.memory_usage(deep=True).sum()vsBATCH_CHUNK_MAX_MEMORY_MB(default 256MB, env-configurable); discard and mark failed if exceeded - 2.6 Add result row count limit:
max_rows_per_chunkparameter passed to query_fn for SQL-levelFETCH FIRST N ROWS ONLY - 2.7 Implement
merge_chunks(cache_prefix, query_hash)anditerate_chunks(cache_prefix, query_hash)for result assembly - 2.8 Add progress tracking via Redis HSET (
batch:{prefix}:{hash}:meta) with total/completed/failed/pct/status/has_partial_failure fields - 2.9 Add chunk failure handling: log error, mark failed in metadata, continue remaining chunks
- 2.10 Enforce all engine queries use
read_sql_df_slow(dedicated connection, 300s timeout) - 2.11 Implement deterministic
query_hashhelper (canonical JSON + SHA-256[:16]) and reuse across chunk/progress/cache keys - 2.12 Define and implement time chunk boundary semantics (
[start,end], next=end+1day, final short chunk allowed) - 2.13 Define cache interaction contract: chunk cache merge result must backfill existing service dataset cache (
query_id)
3. Unit Tests — redis_df_store
- 3.1 Test
redis_store_df/redis_load_dfround-trip - 3.2 Test chunk helpers round-trip
- 3.3 Test Redis unavailable graceful fallback (returns None, no exception)
4. Unit Tests — BatchQueryEngine
- 4.1 Test
decompose_by_time_range(90 days → 3 chunks, 31 days → 1 chunk, edge cases) - 4.2 Test
decompose_by_ids(2500 IDs → 3 batches, 500 IDs → 1 batch) - 4.3 Test
execute_plansequential: mock query_fn, verify chunks stored in Redis - 4.4 Test
execute_planparallel: verify ThreadPoolExecutor used, semaphore respected - 4.5 Test partial cache hit: pre-populate 2/5 chunks, verify only 3 executed
- 4.6 Test memory guard: mock query_fn returning oversized DataFrame, verify chunk discarded
- 4.7 Test result row count limit: verify max_rows_per_chunk passed to query_fn
- 4.8 Test
merge_chunks: verify pd.concat produces correct merged DataFrame - 4.9 Test progress tracking: verify Redis HSET updated after each chunk
- 4.10 Test chunk failure resilience: one chunk fails, others complete, metadata reflects partial
5. P0: Adopt in reject_dataset_cache
- 5.1 Replace inline
_redis_store_df/_redis_load_dfwith imports fromcore.redis_df_store - 5.2 Add
_run_reject_chunk(chunk_params) -> DataFramethat binds chunk's start_date/end_date to existing SQL - 5.3 Wrap
execute_primary_query()date_range mode: use engine when date range > 60 days - 5.4 Wrap
execute_primary_query()container mode: use engine when resolved container IDs > 1000 (workorder/lot/GD lot 展開後) - 5.5 Replace
limit: 999999999with configurablemax_rows_per_chunk - 5.6 Keep existing direct path for short ranges / small ID sets (no overhead)
- 5.7 Merge chunk results and store in existing L1+L2 cache under original query_id
- 5.8 Add env var
BATCH_QUERY_TIME_THRESHOLD_DAYS(default 60) - 5.9 Test: 365-day date range → verify chunks decomposed, no Oracle timeout
- 5.10 Test: large workorder (500+ containers) → verify ID batching works
6. P1: Adopt in hold_dataset_cache
- 6.1 Replace inline
_redis_store_df/_redis_load_dfwith imports fromcore.redis_df_store - 6.2 Wrap
execute_primary_query(): use engine when date range > 60 days - 6.3 Keep existing direct path for short date ranges
- 6.4 Test hold-history with long date range
7. P1: Adopt in resource_dataset_cache
- 7.1 Replace inline
_redis_store_df/_redis_load_dfwith imports fromcore.redis_df_store - 7.2 Wrap
execute_primary_query(): use engine when date range > 60 days - 7.3 Keep existing direct path for short date ranges
- 7.4 Test resource-history with long date range
8. P2: Adopt in mid_section_defect_service
- 8.1 Evaluate which stages benefit: detection query (date-range decomposable) vs genealogy/upstream (already via EventFetcher)
- 8.2 Wrap
_fetch_station_detection_data(): use engine time decomposition when date range > 60 days - 8.3 Add memory guard on detection result DataFrame
- 8.4 Test: large date range + high-volume station → verify no timeout
9. P2: Adopt in job_query_service
- 9.1 Wrap
get_jobs_by_resources(): use engine time decomposition when date range > 60 days - 9.2 Keep
read_sql_df_slowas the execution path for engine-managed job queries; avoid introducing pooled-query regressions - 9.3 Add Redis caching for job query results (currently has none)
- 9.4 Test: full-year query with many resources → verify no timeout
10. P3: Adopt in query_tool_service
- 10.1 Evaluate which query types benefit most: split_merge_history (has explicit timeout handling), equipment-period APIs, large resolver flows
- 10.2 Identify and migrate high-risk
read_sql_dfpaths to engine-managed slow-query path (or explicitread_sql_df_slow) to avoid 55s timeout failures - 10.3 Wrap selected high-risk query functions with engine ID/time decomposition
- 10.4 Review and extend existing resolve cache strategy (currently short TTL route cache) for heavy/high-repeat query patterns
- 10.5 Test: large work order expansion → verify batching and timeout resilience
11. P3: event_fetcher (optional)
- 11.1 Evaluate if replacing inline ThreadPoolExecutor with engine adds value (already optimized)
- 11.2 If adopted: delegate ID batching to
decompose_by_ids()+execute_plan()— NOT ADOPTED: EventFetcher already uses optimal streaming (read_sql_df_slow_iter) + ID batching (1000) + ThreadPoolExecutor(2). Engine adoption would regress streaming to full materialization. - 11.3 Preserve existing
read_sql_df_slow_iterstreaming pattern — PRESERVED: no changes to event_fetcher
12. Integration Verification
- 12.1 Run full test suite:
pytest tests/test_batch_query_engine.py tests/test_redis_df_store.py tests/test_reject_dataset_cache.py - 12.2 Manual test: reject-history 365-day query → no timeout, chunks visible in Redis — AUTOMATED: test_365_day_range_triggers_engine verifies decomposition; manual validation deferred to deployment
- 12.3 Manual test: reject-history large workorder (container mode) → no timeout — AUTOMATED: test_large_container_set_triggers_engine verifies ID batching; manual validation deferred to deployment
- 12.4 Verify Redis keys:
redis-cli keys "batch:*"→ correct prefix and TTL — AUTOMATED: chunk key formatbatch:{prefix}:{hash}:chunk:{idx}verified in unit tests - 12.5 Monitor slow query semaphore during parallel execution — AUTOMATED: _effective_parallelism tested; runtime monitoring deferred to deployment
- 12.6 Verify query_hash stability: same semantic params produce same hash, reordered inputs do not create cache misses
- 12.7 Verify time-chunk boundary correctness: no overlap/no gap across full date range
13. P0 Hardening — Parquet Spill for Large Result Sets
- 13.1 Define spill thresholds:
REJECT_ENGINE_MAX_TOTAL_ROWS,REJECT_ENGINE_MAX_RESULT_MB, and enable flag - 13.2 Add
query_spool_store.py(write/read parquet, metadata schema, path safety checks) - 13.3 Implement reject-history spill path: merge result exceeds threshold → write parquet + store metadata pointer in Redis
- 13.4 Update
/viewand/exportread path to supportquery_id -> metadata -> parquetfallback - 13.5 Add startup/periodic cleanup job: remove expired parquet files and orphan metadata
- 13.6 Add guardrails for disk usage (spool size cap + warning logs + fail-safe behavior)
- 13.7 Unit tests: spill write/read, metadata mismatch, missing file fallback, cleanup correctness
- 13.8 Integration test: long-range reject query triggers spill and serves view/export without worker RSS spike
- 13.9 Stress test: concurrent long-range queries verify no OOM and bounded Redis memory