Three proposals addressing the 2026-02-25 trace pipeline OOM crash (114K CIDs): 1. trace-events-memory-triage: fetchmany iterator (read_sql_df_slow_iter), admission control (50K CID limit for non-MSD), cache skip for large queries, early memory release with gc.collect() 2. trace-async-job-queue: RQ-based async jobs for queries >20K CIDs, separate worker process with isolated memory, frontend polling via useTraceProgress composable, systemd service + deploy scripts 3. trace-streaming-response: chunked Redis storage (TRACE_STREAM_BATCH_SIZE=5000), NDJSON stream endpoint (GET /api/trace/job/<id>/stream), frontend ReadableStream consumer for progressive rendering, backward-compatible with legacy single-key storage All three proposals archived. 1101 tests pass, frontend builds clean. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1.7 KiB
1.7 KiB
Why
即使有非同步 job(提案 2)處理大查詢,結果 materialize 仍然是記憶體瓶頸:
- job result 全量 JSON:114K CIDs × 2 domains 的結果 JSON 可達數百 MB, Redis 儲存 + 讀取 + Flask jsonify 序列化,峰值記憶體仍高
- 前端一次性解析:瀏覽器解析數百 MB JSON 會 freeze UI
- Redis 單 key 限制:大 value 影響 Redis 效能(阻塞其他操作)
串流回傳(NDJSON/分頁)讓 server 逐批產生資料、前端逐批消費, 記憶體使用與 CID 總數解耦,只與每批大小成正比。
What Changes
- EventFetcher 支援 iterator 模式:
fetch_events_iter()yield 每批結果而非累積全部 - 新增
GET /api/trace/job/{job_id}/stream:NDJSON 串流回傳 job 結果 - 前端 useTraceProgress 串流消費:用
fetch()+ReadableStream逐行解析 NDJSON - 結果分頁 API:
GET /api/trace/job/{job_id}/result?domain=history&offset=0&limit=5000 - 更新 .env.example:
TRACE_STREAM_BATCH_SIZE
Capabilities
New Capabilities
trace-streaming-response: NDJSON 串流回傳 + 結果分頁
Modified Capabilities
event-fetcher-unified: 新增 iterator 模式(fetch_events_iter)trace-staged-api: job result 串流 endpointprogressive-trace-ux: 前端串流消費 + 逐批渲染
Impact
- 後端核心:event_fetcher.py(iterator 模式)、trace_routes.py(stream endpoint)
- 前端修改:useTraceProgress.js(ReadableStream 消費)
- 部署設定:.env.example(
TRACE_STREAM_BATCH_SIZE) - 不影響:同步路徑(CID < 閾值仍走現有流程)、其他 service、即時監控頁
- 前置條件:trace-async-job-queue(提案 2)