Files
egg dbe0da057c feat(trace-pipeline): memory triage, async job queue, and NDJSON streaming
Three proposals addressing the 2026-02-25 trace pipeline OOM crash (114K CIDs):

1. trace-events-memory-triage: fetchmany iterator (read_sql_df_slow_iter),
   admission control (50K CID limit for non-MSD), cache skip for large queries,
   early memory release with gc.collect()

2. trace-async-job-queue: RQ-based async jobs for queries >20K CIDs,
   separate worker process with isolated memory, frontend polling via
   useTraceProgress composable, systemd service + deploy scripts

3. trace-streaming-response: chunked Redis storage (TRACE_STREAM_BATCH_SIZE=5000),
   NDJSON stream endpoint (GET /api/trace/job/<id>/stream), frontend
   ReadableStream consumer for progressive rendering, backward-compatible
   with legacy single-key storage

All three proposals archived. 1101 tests pass, frontend builds clean.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 21:01:27 +08:00

2.0 KiB
Raw Permalink Blame History

Why

2026-02-25 生產環境 trace pipeline 處理 114K CIDsTMTT 站 + 5 個月日期範圍)時, worker 被 OOM SIGKILL7GB VM無 swap。pool 隔離已完成,連線不再互搶, 但 events 階段的記憶體使用是真正瓶頸:

  1. cursor.fetchall() 一次載入全部 rows數十萬筆
  2. pd.DataFrame(rows) 複製一份
  3. df.iterrows() + row.to_dict() 再一份
  4. grouped[cid].append(record) 累積到最終 dict
  5. raw_domain_results[domain] + results[domain]["data"] 在 trace_routes 同時持有雙份

114K CIDs × 2 domains峰值同時存在 3-4 份完整資料副本,每份數百 MB → 2-4 GB 單一 domain。 7GB VM4 workers完全無法承受。

What Changes

  • Admission controltrace events endpoint 加 CID 數量上限判斷,超過閾值回 HTTP 413
  • 分批處理read_sql_df_slow 改用 cursor.fetchmany() 取代 fetchall(),不建 DataFrame
  • EventFetcher 逐批 group:每批 fetch 完立刻 group 到結果 dict釋放 batch 記憶體
  • trace_routes 避免雙份持有raw_domain_resultsresults 合併為單一資料結構
  • Gunicorn workers 降為 2:降低單機記憶體競爭
  • systemd MemoryMax:加 cgroup 記憶體保護,避免 OOM 殺死整台 VM
  • 更新 .env.example:新增 TRACE_EVENTS_CID_LIMITDB_SLOW_FETCHMANY_SIZE 等 env 文件
  • 更新 deploy/mes-dashboard.service:加入 MemoryHighMemoryMax

Capabilities

Modified Capabilities

  • trace-staged-api: events endpoint 加入 admission controlCID 上限)
  • event-fetcher-unified: 分批 group 記憶體優化,取消 DataFrame 中間層

Impact

  • 後端核心database.pyfetchmany、event_fetcher.py逐批 group、trace_routes.pyadmission control + 記憶體管理)
  • 部署設定gunicorn.conf.py、.env.example、deploy/mes-dashboard.service
  • 不影響:前端、即時監控頁、其他 servicereject_history、hold_history 等)
  • 前置條件trace-pipeline-pool-isolation已完成