Files
DashBoard/openspec/changes/archive/trace-streaming-response/design.md
egg dbe0da057c feat(trace-pipeline): memory triage, async job queue, and NDJSON streaming
Three proposals addressing the 2026-02-25 trace pipeline OOM crash (114K CIDs):

1. trace-events-memory-triage: fetchmany iterator (read_sql_df_slow_iter),
   admission control (50K CID limit for non-MSD), cache skip for large queries,
   early memory release with gc.collect()

2. trace-async-job-queue: RQ-based async jobs for queries >20K CIDs,
   separate worker process with isolated memory, frontend polling via
   useTraceProgress composable, systemd service + deploy scripts

3. trace-streaming-response: chunked Redis storage (TRACE_STREAM_BATCH_SIZE=5000),
   NDJSON stream endpoint (GET /api/trace/job/<id>/stream), frontend
   ReadableStream consumer for progressive rendering, backward-compatible
   with legacy single-key storage

All three proposals archived. 1101 tests pass, frontend builds clean.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 21:01:27 +08:00

141 lines
4.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## Context
提案 2trace-async-job-queue讓大查詢在獨立 worker 中執行,
但結果仍然全量 materialize 到 Redisjob result和前端記憶體。
114K CIDs × 2 domains 的結果 JSON 可達 200-500MB
- Worker 記憶體grouped dict ~500MB + JSON serialize ~500MB = ~1GB 峰值
- RedisSETEX 500MB 的 key 耗時 5-10s阻塞其他操作
- 前端:瀏覽器解析 500MB JSON freeze UI 數十秒
串流回傳讓 server 逐批產生、前端逐批消費,記憶體使用只與每批大小成正比。
## Goals / Non-Goals
**Goals:**
- Job 結果以 NDJSON 串流回傳,避免全量 materialize
- EventFetcher 支援 iterator 模式,逐批 yield 結果
- 前端用 ReadableStream 逐行解析,逐批渲染
- 結果也支援分頁 API給不支援串流的 consumer 使用)
**Non-Goals:**
- 不改動同步路徑CID < 閾值仍走現有 JSON 回傳
- 不做 WebSocketNDJSON over HTTP 更簡單更通用
- 不做 Server-Sent EventsSSE 只支援 text/event-stream不適合大 payload
- 不修改 MSD aggregationaggregation 需要全量資料但結果較小
## Decisions
### D1: NDJSON 格式
**決策**使用 Newline Delimited JSONNDJSON作為串流格式
```
Content-Type: application/x-ndjson
{"type":"meta","job_id":"abc123","domains":["history","materials"],"cid_count":114892}
{"type":"domain_start","domain":"history","batch":1,"total_batches":23}
{"type":"records","domain":"history","batch":1,"data":[...5000 records...]}
{"type":"records","domain":"history","batch":2,"data":[...5000 records...]}
...
{"type":"domain_end","domain":"history","total_records":115000}
{"type":"domain_start","domain":"materials","batch":1,"total_batches":12}
...
{"type":"aggregation","data":{...}}
{"type":"complete","elapsed_seconds":285}
```
**env var**`TRACE_STREAM_BATCH_SIZE`預設 5000 records/batch
**理由**
- NDJSON 是業界標準串流 JSON 格式ElasticsearchBigQueryGitHub API 都用
- 每行是獨立 JSON前端可逐行 parse不需要等整個 response
- 5000 records/batch 2-5MB瀏覽器可即時渲染
- HTTP/1.1 chunked transfer 完美搭配
### D2: EventFetcher iterator 模式
**決策**新增 `fetch_events_iter()` 方法yield 每批 grouped records
```python
@staticmethod
def fetch_events_iter(container_ids, domain, batch_size=5000):
"""Yield dicts of {cid: [records]} in batches."""
# ... same SQL building logic ...
for oracle_batch_ids in batches:
for columns, rows in read_sql_df_slow_iter(sql, params):
batch_grouped = defaultdict(list)
for row in rows:
record = dict(zip(columns, row))
cid = record.get("CONTAINERID")
if cid:
batch_grouped[cid].append(record)
yield dict(batch_grouped)
```
**理由**
- `fetch_events()` 共存不影響同步路徑
- 每次 yield 只持有一個 fetchmany batch grouped 結果
- Worker 收到 yield 後立刻序列化寫出不累積
### D3: 結果分頁 API
**決策**提供 REST 分頁 API 作為 NDJSON 的替代方案
```
GET /api/trace/job/{job_id}/result?domain=history&offset=0&limit=5000
```
**回應格式**
```json
{
"domain": "history",
"offset": 0,
"limit": 5000,
"total": 115000,
"data": [... 5000 records ...]
}
```
**理由**
- 某些 consumer如外部系統不支援 NDJSON 串流
- 分頁 API 是標準 REST pattern
- 結果仍儲存在 Redis但按 domain key每個 key 5000 records 5MB
### D4: 前端 ReadableStream 消費
```javascript
async function consumeNDJSON(url, onChunk) {
const response = await fetch(url)
const reader = response.body.getReader()
const decoder = new TextDecoder()
let buffer = ''
while (true) {
const { done, value } = await reader.read()
if (done) break
buffer += decoder.decode(value, { stream: true })
const lines = buffer.split('\n')
buffer = lines.pop() // 保留不完整的最後一行
for (const line of lines) {
if (line.trim()) onChunk(JSON.parse(line))
}
}
}
```
**理由**
- ReadableStream 是瀏覽器原生 API無需額外依賴
- 逐行 parse 記憶體使用恆定只與 batch_size 成正比
- 可邊收邊渲染使用者體驗好
## Risks / Trade-offs
| 風險 | 緩解措施 |
|------|---------|
| NDJSON 不支援 HTTP 壓縮 | Flask 可配 gzip middleware每行 5000 records 壓縮率高 |
| 中途斷線需重新開始 | 分頁 API 可從斷點繼續取NDJSON 用於一次性消費 |
| 前端需要處理部分結果渲染 | 表格元件改用 virtual scroll既有 vue-virtual-scroller |
| MSD aggregation 仍需全量資料 | aggregation worker 內部完成只串流最終結果較小 |
| 結果按 domain key 增加 Redis key 數量 | TTL 清理 + key prefix 隔離 |