Files
DashBoard/openspec/changes/archive/trace-streaming-response/design.md
egg dbe0da057c feat(trace-pipeline): memory triage, async job queue, and NDJSON streaming
Three proposals addressing the 2026-02-25 trace pipeline OOM crash (114K CIDs):

1. trace-events-memory-triage: fetchmany iterator (read_sql_df_slow_iter),
   admission control (50K CID limit for non-MSD), cache skip for large queries,
   early memory release with gc.collect()

2. trace-async-job-queue: RQ-based async jobs for queries >20K CIDs,
   separate worker process with isolated memory, frontend polling via
   useTraceProgress composable, systemd service + deploy scripts

3. trace-streaming-response: chunked Redis storage (TRACE_STREAM_BATCH_SIZE=5000),
   NDJSON stream endpoint (GET /api/trace/job/<id>/stream), frontend
   ReadableStream consumer for progressive rendering, backward-compatible
   with legacy single-key storage

All three proposals archived. 1101 tests pass, frontend builds clean.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 21:01:27 +08:00

4.8 KiB
Raw Blame History

Context

提案 2trace-async-job-queue讓大查詢在獨立 worker 中執行, 但結果仍然全量 materialize 到 Redisjob result和前端記憶體。

114K CIDs × 2 domains 的結果 JSON 可達 200-500MB

  • Worker 記憶體grouped dict ~500MB + JSON serialize ~500MB = ~1GB 峰值
  • RedisSETEX 500MB 的 key 耗時 5-10s阻塞其他操作
  • 前端:瀏覽器解析 500MB JSON freeze UI 數十秒

串流回傳讓 server 逐批產生、前端逐批消費,記憶體使用只與每批大小成正比。

Goals / Non-Goals

Goals:

  • Job 結果以 NDJSON 串流回傳,避免全量 materialize
  • EventFetcher 支援 iterator 模式,逐批 yield 結果
  • 前端用 ReadableStream 逐行解析,逐批渲染
  • 結果也支援分頁 API給不支援串流的 consumer 使用)

Non-Goals:

  • 不改動同步路徑CID < 閾值仍走現有 JSON 回傳)
  • 不做 WebSocketNDJSON over HTTP 更簡單、更通用)
  • 不做 Server-Sent EventsSSE 只支援 text/event-stream不適合大 payload
  • 不修改 MSD aggregationaggregation 需要全量資料,但結果較小)

Decisions

D1: NDJSON 格式

決策:使用 Newline Delimited JSONNDJSON作為串流格式。

Content-Type: application/x-ndjson

{"type":"meta","job_id":"abc123","domains":["history","materials"],"cid_count":114892}
{"type":"domain_start","domain":"history","batch":1,"total_batches":23}
{"type":"records","domain":"history","batch":1,"data":[...5000 records...]}
{"type":"records","domain":"history","batch":2,"data":[...5000 records...]}
...
{"type":"domain_end","domain":"history","total_records":115000}
{"type":"domain_start","domain":"materials","batch":1,"total_batches":12}
...
{"type":"aggregation","data":{...}}
{"type":"complete","elapsed_seconds":285}

env varTRACE_STREAM_BATCH_SIZE(預設 5000 records/batch

理由

  • NDJSON 是業界標準串流 JSON 格式Elasticsearch、BigQuery、GitHub API 都用)
  • 每行是獨立 JSON前端可逐行 parse不需要等整個 response
  • 5000 records/batch ≈ 2-5MB瀏覽器可即時渲染
  • 與 HTTP/1.1 chunked transfer 完美搭配

D2: EventFetcher iterator 模式

決策:新增 fetch_events_iter() 方法yield 每批 grouped records。

@staticmethod
def fetch_events_iter(container_ids, domain, batch_size=5000):
    """Yield dicts of {cid: [records]} in batches."""
    # ... same SQL building logic ...
    for oracle_batch_ids in batches:
        for columns, rows in read_sql_df_slow_iter(sql, params):
            batch_grouped = defaultdict(list)
            for row in rows:
                record = dict(zip(columns, row))
                cid = record.get("CONTAINERID")
                if cid:
                    batch_grouped[cid].append(record)
            yield dict(batch_grouped)

理由

  • fetch_events() 共存,不影響同步路徑
  • 每次 yield 只持有一個 fetchmany batch 的 grouped 結果
  • Worker 收到 yield 後立刻序列化寫出,不累積

D3: 結果分頁 API

決策:提供 REST 分頁 API 作為 NDJSON 的替代方案。

GET /api/trace/job/{job_id}/result?domain=history&offset=0&limit=5000

回應格式

{
  "domain": "history",
  "offset": 0,
  "limit": 5000,
  "total": 115000,
  "data": [... 5000 records ...]
}

理由

  • 某些 consumer如外部系統不支援 NDJSON 串流
  • 分頁 API 是標準 REST pattern
  • 結果仍儲存在 Redis但按 domain 分 key每個 key 5000 records ≈ 5MB

D4: 前端 ReadableStream 消費

async function consumeNDJSON(url, onChunk) {
  const response = await fetch(url)
  const reader = response.body.getReader()
  const decoder = new TextDecoder()
  let buffer = ''

  while (true) {
    const { done, value } = await reader.read()
    if (done) break
    buffer += decoder.decode(value, { stream: true })
    const lines = buffer.split('\n')
    buffer = lines.pop()  // 保留不完整的最後一行
    for (const line of lines) {
      if (line.trim()) onChunk(JSON.parse(line))
    }
  }
}

理由

  • ReadableStream 是瀏覽器原生 API無需額外依賴
  • 逐行 parse 記憶體使用恆定(只與 batch_size 成正比)
  • 可邊收邊渲染,使用者體驗好

Risks / Trade-offs

風險 緩解措施
NDJSON 不支援 HTTP 壓縮 Flask 可配 gzip middleware每行 5000 records 壓縮率高
中途斷線需重新開始 分頁 API 可從斷點繼續取NDJSON 用於一次性消費
前端需要處理部分結果渲染 表格元件改用 virtual scroll既有 vue-virtual-scroller
MSD aggregation 仍需全量資料 aggregation 在 worker 內部完成,只串流最終結果(較小)
結果按 domain 分 key 增加 Redis key 數量 TTL 清理 + key prefix 隔離