Files
DashBoard/openspec/changes/archive/trace-async-job-queue/proposal.md
egg dbe0da057c feat(trace-pipeline): memory triage, async job queue, and NDJSON streaming
Three proposals addressing the 2026-02-25 trace pipeline OOM crash (114K CIDs):

1. trace-events-memory-triage: fetchmany iterator (read_sql_df_slow_iter),
   admission control (50K CID limit for non-MSD), cache skip for large queries,
   early memory release with gc.collect()

2. trace-async-job-queue: RQ-based async jobs for queries >20K CIDs,
   separate worker process with isolated memory, frontend polling via
   useTraceProgress composable, systemd service + deploy scripts

3. trace-streaming-response: chunked Redis storage (TRACE_STREAM_BATCH_SIZE=5000),
   NDJSON stream endpoint (GET /api/trace/job/<id>/stream), frontend
   ReadableStream consumer for progressive rendering, backward-compatible
   with legacy single-key storage

All three proposals archived. 1101 tests pass, frontend builds clean.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 21:01:27 +08:00

46 lines
2.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## Why
trace pipeline 處理大量 CIDs> 20K即使經過分批處理優化提案 1
仍然面臨以下根本問題:
1. **同步 request-response 模型**gunicorn 360s timeout 是硬限lineage + events 合計可能超過 300s
2. **worker thread 被佔住**:大查詢期間 1 個 gunicorn thread 完全被佔用,降低即時頁服務能力
3. **前端無進度回饋**:使用者只能盯著 loading spinner 等 5-6 分鐘,不知道是否正常運作
4. **失敗後需完全重新執行**:中途 timeout/OOM 後,已完成的 seed-resolve 和 lineage 結果全部浪費
業界標準做法是將長時間任務放入非同步佇列RQ/DramatiqAPI 先回 202 + job_id
背景 worker 獨立處理,前端輪詢或 SSE 取得結果。
## What Changes
- **引入 RQRedis Queue**:利用既有 Redis 基礎設施,最小化新依賴
- **新增 trace job worker**獨立進程systemd unit不佔 gunicorn worker 資源
- **新增 `POST /api/trace/events-async`**CID > 閾值時回 202 + job_id
- **新增 `GET /api/trace/job/{job_id}`**:輪詢 job 狀態queued/running/completed/failed
- **新增 `GET /api/trace/job/{job_id}/result`**:取得完成後的結果(分頁)
- **前端 useTraceProgress 整合**:自動判斷同步/非同步路徑,顯示 job 進度
- **Job TTL + 自動清理**:結果保留 1 小時後自動過期
- **新增 systemd unit**`mes-dashboard-trace-worker.service`
- **更新 .env.example**`TRACE_ASYNC_CID_THRESHOLD``TRACE_JOB_TTL_SECONDS``TRACE_WORKER_COUNT`
## Capabilities
### New Capabilities
- `trace-async-job`: 非同步 trace job 佇列RQ + Redis
### Modified Capabilities
- `trace-staged-api`: events endpoint 整合 async job 路由
- `progressive-trace-ux`: 前端整合 job 輪詢 + 進度顯示
## Impact
- **新增依賴**`rq>=1.16.0,<2.0.0`requirements.txt、environment.yml
- **後端新增**trace_job_service.py、trace_routes.pyasync endpoints
- **前端修改**useTraceProgress.jsasync 整合)
- **部署新增**deploy/mes-dashboard-trace-worker.service、scripts/start_server.shworker 管理)
- **部署設定**.env.example新 env vars
- **不影響**:其他 service、即時監控頁、admin 頁面
- **前置條件**trace-events-memory-triage提案 1