Files
DashBoard/openspec/changes/archive/trace-events-memory-triage/proposal.md
egg dbe0da057c feat(trace-pipeline): memory triage, async job queue, and NDJSON streaming
Three proposals addressing the 2026-02-25 trace pipeline OOM crash (114K CIDs):

1. trace-events-memory-triage: fetchmany iterator (read_sql_df_slow_iter),
   admission control (50K CID limit for non-MSD), cache skip for large queries,
   early memory release with gc.collect()

2. trace-async-job-queue: RQ-based async jobs for queries >20K CIDs,
   separate worker process with isolated memory, frontend polling via
   useTraceProgress composable, systemd service + deploy scripts

3. trace-streaming-response: chunked Redis storage (TRACE_STREAM_BATCH_SIZE=5000),
   NDJSON stream endpoint (GET /api/trace/job/<id>/stream), frontend
   ReadableStream consumer for progressive rendering, backward-compatible
   with legacy single-key storage

All three proposals archived. 1101 tests pass, frontend builds clean.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 21:01:27 +08:00

40 lines
2.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## Why
2026-02-25 生產環境 trace pipeline 處理 114K CIDsTMTT 站 + 5 個月日期範圍)時,
worker 被 OOM SIGKILL7GB VM無 swap。pool 隔離已完成,連線不再互搶,
但 events 階段的記憶體使用是真正瓶頸:
1. `cursor.fetchall()` 一次載入全部 rows數十萬筆
2. `pd.DataFrame(rows)` 複製一份
3. `df.iterrows()` + `row.to_dict()` 再一份
4. `grouped[cid].append(record)` 累積到最終 dict
5. `raw_domain_results[domain]` + `results[domain]["data"]` 在 trace_routes 同時持有雙份
114K CIDs × 2 domains峰值同時存在 3-4 份完整資料副本,每份數百 MB → 2-4 GB 單一 domain。
7GB VM4 workers完全無法承受。
## What Changes
- **Admission control**trace events endpoint 加 CID 數量上限判斷,超過閾值回 HTTP 413
- **分批處理**`read_sql_df_slow` 改用 `cursor.fetchmany()` 取代 `fetchall()`,不建 DataFrame
- **EventFetcher 逐批 group**:每批 fetch 完立刻 group 到結果 dict釋放 batch 記憶體
- **trace_routes 避免雙份持有**`raw_domain_results``results` 合併為單一資料結構
- **Gunicorn workers 降為 2**:降低單機記憶體競爭
- **systemd MemoryMax**:加 cgroup 記憶體保護,避免 OOM 殺死整台 VM
- **更新 .env.example**:新增 `TRACE_EVENTS_CID_LIMIT``DB_SLOW_FETCHMANY_SIZE` 等 env 文件
- **更新 deploy/mes-dashboard.service**:加入 `MemoryHigh``MemoryMax`
## Capabilities
### Modified Capabilities
- `trace-staged-api`: events endpoint 加入 admission controlCID 上限)
- `event-fetcher-unified`: 分批 group 記憶體優化,取消 DataFrame 中間層
## Impact
- **後端核心**database.pyfetchmany、event_fetcher.py逐批 group、trace_routes.pyadmission control + 記憶體管理)
- **部署設定**gunicorn.conf.py、.env.example、deploy/mes-dashboard.service
- **不影響**:前端、即時監控頁、其他 servicereject_history、hold_history 等)
- **前置條件**trace-pipeline-pool-isolation已完成