feat(trace-pipeline): memory triage, async job queue, and NDJSON streaming
Three proposals addressing the 2026-02-25 trace pipeline OOM crash (114K CIDs): 1. trace-events-memory-triage: fetchmany iterator (read_sql_df_slow_iter), admission control (50K CID limit for non-MSD), cache skip for large queries, early memory release with gc.collect() 2. trace-async-job-queue: RQ-based async jobs for queries >20K CIDs, separate worker process with isolated memory, frontend polling via useTraceProgress composable, systemd service + deploy scripts 3. trace-streaming-response: chunked Redis storage (TRACE_STREAM_BATCH_SIZE=5000), NDJSON stream endpoint (GET /api/trace/job/<id>/stream), frontend ReadableStream consumer for progressive rendering, backward-compatible with legacy single-key storage All three proposals archived. 1101 tests pass, frontend builds clean. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,2 @@
|
||||
schema: spec-driven
|
||||
created: 2026-02-25
|
||||
149
openspec/changes/archive/trace-async-job-queue/design.md
Normal file
149
openspec/changes/archive/trace-async-job-queue/design.md
Normal file
@@ -0,0 +1,149 @@
|
||||
## Context
|
||||
|
||||
提案 1(trace-events-memory-triage)解決了峰值記憶體問題並加入 admission control,
|
||||
但 CID > 50K 的查詢被直接拒絕(HTTP 413)。
|
||||
使用者仍有合理需求查詢大範圍資料(例如 TMTT 站 5 個月 = 114K CIDs)。
|
||||
|
||||
目前 codebase 完全沒有非同步任務基礎設施(無 Celery、RQ、Dramatiq)。
|
||||
所有操作都是同步 request-response,受 gunicorn 360s timeout 硬限。
|
||||
|
||||
需要引入輕量級 job queue,讓大查詢在獨立 worker 進程中執行,
|
||||
不佔 gunicorn thread、不受 360s timeout 限制、失敗可重試。
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- CID > 閾值的 trace events 查詢改走非同步 job(API 回 202 + job_id)
|
||||
- 獨立 worker 進程(systemd unit),不佔 gunicorn 資源
|
||||
- Job 狀態可查詢(queued/running/completed/failed)
|
||||
- 結果有 TTL 自動清理,不佔 Redis 長期記憶體
|
||||
- 前端自動判斷同步/非同步路徑,顯示 job 進度
|
||||
- 最小新依賴(利用既有 Redis)
|
||||
|
||||
**Non-Goals:**
|
||||
- 不做通用 task queue(只處理 trace events)
|
||||
- 不做 job 重試(大查詢重試消耗巨大,失敗後使用者手動重新觸發)
|
||||
- 不做 job 取消(Oracle 查詢一旦發出難以取消)
|
||||
- 不做 job 持久化到 DB(Redis TTL 足夠)
|
||||
- 不修改 lineage 階段(仍然同步,通常 < 120s)
|
||||
|
||||
## Decisions
|
||||
|
||||
### D1: RQ(Redis Queue)而非 Celery/Dramatiq
|
||||
|
||||
**決策**:使用 RQ 作為 job queue。
|
||||
|
||||
**理由**:
|
||||
- 專案已有 Redis,零額外基礎設施
|
||||
- RQ 比 Celery 輕量 10 倍(無 broker 中間層、無 beat scheduler、無 flower)
|
||||
- RQ worker 是獨立 Python 進程,記憶體隔離
|
||||
- API 簡單:`queue.enqueue(func, args, job_timeout=600, result_ttl=3600)`
|
||||
- 社群活躍,Flask 生態整合良好
|
||||
|
||||
**替代方案**:
|
||||
- Celery:過重,專案不需要 beat、chord、chain 等功能 → rejected
|
||||
- Dramatiq:更輕量但社群較小,Redis broker 整合不如 RQ 成熟 → rejected
|
||||
- 自製 threading:前面討論已排除(worker 生命週期、記憶體競爭)→ rejected
|
||||
|
||||
### D2: 同步/非同步分界閾值
|
||||
|
||||
**決策**:
|
||||
|
||||
| CID 數量 | 行為 |
|
||||
|-----------|------|
|
||||
| ≤ 20,000 | 同步處理(現有 events endpoint) |
|
||||
| 20,001 ~ 50,000 | 非同步 job(回 202 + job_id) |
|
||||
| > 50,000 | 非同步 job(回 202 + job_id),worker 內部分段處理 |
|
||||
|
||||
**env var**:`TRACE_ASYNC_CID_THRESHOLD`(預設 20000)
|
||||
|
||||
**理由**:
|
||||
- ≤ 20K CIDs 的 events 查詢通常在 60s 內完成,同步足夠
|
||||
- 20K-50K 需要 2-5 分鐘,超出使用者耐心且佔住 gunicorn thread
|
||||
- > 50K 是提案 1 的 admission control 上限,必須走非同步
|
||||
|
||||
**提案 1 的 HTTP 413 改為 HTTP 202**:
|
||||
當提案 2 實作完成後,提案 1 的 `TRACE_EVENTS_CID_LIMIT` 檢查改為自動 fallback 到 async job,
|
||||
不再拒絕請求。
|
||||
|
||||
### D3: Job 狀態與結果儲存
|
||||
|
||||
**決策**:使用 RQ 內建的 job 狀態追蹤(儲存在 Redis)。
|
||||
|
||||
```
|
||||
Job lifecycle:
|
||||
queued → started → finished / failed
|
||||
|
||||
Redis keys:
|
||||
rq:job:{job_id} # RQ 內建 job metadata
|
||||
trace:job:{job_id}:meta # 自訂 metadata(profile, cid_count, domains, progress)
|
||||
trace:job:{job_id}:result # 完成後的結果(JSON,設 TTL)
|
||||
```
|
||||
|
||||
**env vars**:
|
||||
- `TRACE_JOB_TTL_SECONDS`:結果保留時間(預設 3600 = 1 小時)
|
||||
- `TRACE_JOB_TIMEOUT_SECONDS`:單一 job 最大執行時間(預設 1800 = 30 分鐘)
|
||||
|
||||
### D4: API 設計
|
||||
|
||||
```
|
||||
POST /api/trace/events ← 現有,CID ≤ 閾值時同步
|
||||
POST /api/trace/events ← CID > 閾值時回 202 + job_id(同一 endpoint)
|
||||
GET /api/trace/job/{job_id} ← 查詢 job 狀態
|
||||
GET /api/trace/job/{job_id}/result ← 取得完整結果
|
||||
GET /api/trace/job/{job_id}/result?domain=history&offset=0&limit=5000 ← 分頁取結果
|
||||
```
|
||||
|
||||
**202 回應格式**:
|
||||
```json
|
||||
{
|
||||
"stage": "events",
|
||||
"async": true,
|
||||
"job_id": "trace-evt-abc123",
|
||||
"status_url": "/api/trace/job/trace-evt-abc123",
|
||||
"estimated_seconds": 300
|
||||
}
|
||||
```
|
||||
|
||||
### D5: Worker 部署架構
|
||||
|
||||
```
|
||||
systemd (mes-dashboard-trace-worker.service)
|
||||
→ conda run -n mes-dashboard rq worker trace-events --with-scheduler
|
||||
→ 獨立進程,獨立記憶體空間
|
||||
→ MemoryMax=4G(cgroup 保護)
|
||||
```
|
||||
|
||||
**env vars**:
|
||||
- `TRACE_WORKER_COUNT`:worker 進程數(預設 1)
|
||||
- `TRACE_WORKER_QUEUE`:queue 名稱(預設 `trace-events`)
|
||||
|
||||
### D6: 前端整合
|
||||
|
||||
`useTraceProgress.js` 修改:
|
||||
|
||||
```javascript
|
||||
// events 階段
|
||||
const eventsResp = await fetchStage('events', payload)
|
||||
if (eventsResp.status === 202) {
|
||||
// 非同步路徑
|
||||
const { job_id, status_url } = eventsResp.data
|
||||
return await pollJobUntilComplete(status_url, {
|
||||
onProgress: (status) => updateProgress('events', status.progress),
|
||||
pollInterval: 3000,
|
||||
maxPollTime: 1800000, // 30 分鐘
|
||||
})
|
||||
}
|
||||
// 同步路徑(現有)
|
||||
return eventsResp.data
|
||||
```
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
| 風險 | 緩解措施 |
|
||||
|------|---------|
|
||||
| RQ 新依賴增加維護成本 | RQ 穩定、API 簡單、只用核心功能 |
|
||||
| Worker 進程增加記憶體使用 | 獨立 cgroup MemoryMax=4G;空閒時幾乎不佔記憶體 |
|
||||
| Redis 儲存大結果影響效能 | 結果 TTL=1h 自動清理;配合提案 3 串流取代全量儲存 |
|
||||
| Worker crash 丟失進行中 job | RQ 內建 failed job registry;使用者可手動重觸發 |
|
||||
| 前端輪詢增加 API 負載 | pollInterval=3s,只有 active job 才輪詢 |
|
||||
45
openspec/changes/archive/trace-async-job-queue/proposal.md
Normal file
45
openspec/changes/archive/trace-async-job-queue/proposal.md
Normal file
@@ -0,0 +1,45 @@
|
||||
## Why
|
||||
|
||||
trace pipeline 處理大量 CIDs(> 20K)時,即使經過分批處理優化(提案 1),
|
||||
仍然面臨以下根本問題:
|
||||
|
||||
1. **同步 request-response 模型**:gunicorn 360s timeout 是硬限,lineage + events 合計可能超過 300s
|
||||
2. **worker thread 被佔住**:大查詢期間 1 個 gunicorn thread 完全被佔用,降低即時頁服務能力
|
||||
3. **前端無進度回饋**:使用者只能盯著 loading spinner 等 5-6 分鐘,不知道是否正常運作
|
||||
4. **失敗後需完全重新執行**:中途 timeout/OOM 後,已完成的 seed-resolve 和 lineage 結果全部浪費
|
||||
|
||||
業界標準做法是將長時間任務放入非同步佇列(RQ/Dramatiq),API 先回 202 + job_id,
|
||||
背景 worker 獨立處理,前端輪詢或 SSE 取得結果。
|
||||
|
||||
## What Changes
|
||||
|
||||
- **引入 RQ(Redis Queue)**:利用既有 Redis 基礎設施,最小化新依賴
|
||||
- **新增 trace job worker**:獨立進程(systemd unit),不佔 gunicorn worker 資源
|
||||
- **新增 `POST /api/trace/events-async`**:CID > 閾值時回 202 + job_id
|
||||
- **新增 `GET /api/trace/job/{job_id}`**:輪詢 job 狀態(queued/running/completed/failed)
|
||||
- **新增 `GET /api/trace/job/{job_id}/result`**:取得完成後的結果(分頁)
|
||||
- **前端 useTraceProgress 整合**:自動判斷同步/非同步路徑,顯示 job 進度
|
||||
- **Job TTL + 自動清理**:結果保留 1 小時後自動過期
|
||||
- **新增 systemd unit**:`mes-dashboard-trace-worker.service`
|
||||
- **更新 .env.example**:`TRACE_ASYNC_CID_THRESHOLD`、`TRACE_JOB_TTL_SECONDS`、`TRACE_WORKER_COUNT`
|
||||
|
||||
## Capabilities
|
||||
|
||||
### New Capabilities
|
||||
|
||||
- `trace-async-job`: 非同步 trace job 佇列(RQ + Redis)
|
||||
|
||||
### Modified Capabilities
|
||||
|
||||
- `trace-staged-api`: events endpoint 整合 async job 路由
|
||||
- `progressive-trace-ux`: 前端整合 job 輪詢 + 進度顯示
|
||||
|
||||
## Impact
|
||||
|
||||
- **新增依賴**:`rq>=1.16.0,<2.0.0`(requirements.txt、environment.yml)
|
||||
- **後端新增**:trace_job_service.py、trace_routes.py(async endpoints)
|
||||
- **前端修改**:useTraceProgress.js(async 整合)
|
||||
- **部署新增**:deploy/mes-dashboard-trace-worker.service、scripts/start_server.sh(worker 管理)
|
||||
- **部署設定**:.env.example(新 env vars)
|
||||
- **不影響**:其他 service、即時監控頁、admin 頁面
|
||||
- **前置條件**:trace-events-memory-triage(提案 1)
|
||||
@@ -0,0 +1,47 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Trace events endpoint SHALL support asynchronous job execution
|
||||
The `/api/trace/events` endpoint SHALL automatically route large CID requests to an async job queue.
|
||||
|
||||
#### Scenario: CID count exceeds async threshold
|
||||
- **WHEN** the events endpoint receives a request with `container_ids` count exceeding `TRACE_ASYNC_CID_THRESHOLD` (env: `TRACE_ASYNC_CID_THRESHOLD`, default: 20000)
|
||||
- **THEN** the endpoint SHALL enqueue the request to the `trace-events` RQ queue
|
||||
- **THEN** the endpoint SHALL return HTTP 202 with `{ "stage": "events", "async": true, "job_id": "...", "status_url": "/api/trace/job/{job_id}" }`
|
||||
|
||||
#### Scenario: CID count within sync threshold
|
||||
- **WHEN** the events endpoint receives a request with `container_ids` count ≤ `TRACE_ASYNC_CID_THRESHOLD`
|
||||
- **THEN** the endpoint SHALL process synchronously as before
|
||||
|
||||
### Requirement: Trace API SHALL expose job status endpoint
|
||||
`GET /api/trace/job/{job_id}` SHALL return the current status of an async trace job.
|
||||
|
||||
#### Scenario: Job status query
|
||||
- **WHEN** a client queries job status with a valid job_id
|
||||
- **THEN** the endpoint SHALL return `{ "job_id": "...", "status": "queued|started|finished|failed", "progress": {...}, "created_at": "...", "elapsed_seconds": N }`
|
||||
|
||||
#### Scenario: Job not found
|
||||
- **WHEN** a client queries job status with an unknown or expired job_id
|
||||
- **THEN** the endpoint SHALL return HTTP 404 with `{ "error": "...", "code": "JOB_NOT_FOUND" }`
|
||||
|
||||
### Requirement: Trace API SHALL expose job result endpoint
|
||||
`GET /api/trace/job/{job_id}/result` SHALL return the result of a completed async trace job.
|
||||
|
||||
#### Scenario: Completed job result
|
||||
- **WHEN** a client requests result for a completed job
|
||||
- **THEN** the endpoint SHALL return the same response format as the synchronous events endpoint
|
||||
- **THEN** optional query params `domain`, `offset`, `limit` SHALL support pagination
|
||||
|
||||
#### Scenario: Job not yet completed
|
||||
- **WHEN** a client requests result for a non-completed job
|
||||
- **THEN** the endpoint SHALL return HTTP 409 with `{ "error": "...", "code": "JOB_NOT_COMPLETE", "status": "queued|started" }`
|
||||
|
||||
### Requirement: Async trace jobs SHALL have TTL and timeout
|
||||
Job results SHALL expire after a configurable TTL, and execution SHALL be bounded by a timeout.
|
||||
|
||||
#### Scenario: Job result TTL
|
||||
- **WHEN** a trace job completes (success or failure)
|
||||
- **THEN** the result SHALL be stored in Redis with TTL = `TRACE_JOB_TTL_SECONDS` (env, default: 3600)
|
||||
|
||||
#### Scenario: Job execution timeout
|
||||
- **WHEN** a trace job exceeds `TRACE_JOB_TIMEOUT_SECONDS` (env, default: 1800)
|
||||
- **THEN** RQ SHALL terminate the job and mark it as failed
|
||||
39
openspec/changes/archive/trace-async-job-queue/tasks.md
Normal file
39
openspec/changes/archive/trace-async-job-queue/tasks.md
Normal file
@@ -0,0 +1,39 @@
|
||||
## 1. Dependencies
|
||||
|
||||
- [x] 1.1 Add `rq>=1.16.0,<2.0.0` to `requirements.txt`
|
||||
- [x] 1.2 Add `rq>=1.16.0,<2.0.0` to pip dependencies in `environment.yml`
|
||||
|
||||
## 2. Trace Job Service
|
||||
|
||||
- [x] 2.1 Create `src/mes_dashboard/services/trace_job_service.py` with `enqueue_trace_events_job()`, `get_job_status()`, `get_job_result()`
|
||||
- [x] 2.2 Implement `execute_trace_events_job()` function (RQ worker entry point): runs EventFetcher + optional MSD aggregation, stores result in Redis with TTL
|
||||
- [x] 2.3 Add job metadata tracking: `trace:job:{job_id}:meta` Redis key with `{profile, cid_count, domains, status, progress, created_at, completed_at}`
|
||||
- [x] 2.4 Add unit tests for trace_job_service (13 tests: enqueue, status, result, worker execution, flatten)
|
||||
|
||||
## 3. Async API Endpoints
|
||||
|
||||
- [x] 3.1 Modify `events()` in `trace_routes.py`: when `len(container_ids) > TRACE_ASYNC_CID_THRESHOLD` and async available, call `enqueue_trace_events_job()` and return HTTP 202
|
||||
- [x] 3.2 Add `GET /api/trace/job/<job_id>` endpoint: return job status from `get_job_status()`
|
||||
- [x] 3.3 Add `GET /api/trace/job/<job_id>/result` endpoint: return job result from `get_job_result()` with optional `domain`, `offset`, `limit` query params
|
||||
- [x] 3.4 Add rate limiting to job status/result endpoints (60 req/60s)
|
||||
- [x] 3.5 Add unit tests for async endpoints (8 tests: async routing, sync fallback, 413 fallback, job status/result)
|
||||
|
||||
## 4. Deployment
|
||||
|
||||
- [x] 4.1 Create `deploy/mes-dashboard-trace-worker.service` systemd unit (MemoryHigh=3G, MemoryMax=4G)
|
||||
- [x] 4.2 Update `scripts/start_server.sh`: add `start_rq_worker`/`stop_rq_worker`/`rq_worker_status` functions
|
||||
- [x] 4.3 Update `scripts/deploy.sh`: add trace worker systemd install instructions
|
||||
- [x] 4.4 Update `.env.example`: uncomment and add `TRACE_WORKER_ENABLED`, `TRACE_ASYNC_CID_THRESHOLD`, `TRACE_JOB_TTL_SECONDS`, `TRACE_JOB_TIMEOUT_SECONDS`, `TRACE_WORKER_COUNT`, `TRACE_WORKER_QUEUE`
|
||||
|
||||
## 5. Frontend Integration
|
||||
|
||||
- [x] 5.1 Modify `useTraceProgress.js`: detect async response (`eventsPayload.async === true`), switch to job polling mode
|
||||
- [x] 5.2 Add `pollJobUntilComplete()` helper: poll `GET /api/trace/job/{job_id}` every 3s, max 30 minutes
|
||||
- [x] 5.3 Add `job_progress` reactive state for UI: `{ active, job_id, status, elapsed_seconds, progress }`
|
||||
- [x] 5.4 Add error handling: job failed (`JOB_FAILED`), polling timeout (`JOB_POLL_TIMEOUT`), abort support
|
||||
|
||||
## 6. Verification
|
||||
|
||||
- [x] 6.1 Run `python -m pytest tests/ -v` — 1090 passed, 152 skipped
|
||||
- [x] 6.2 Run `cd frontend && npm run build` — frontend builds successfully
|
||||
- [x] 6.3 Verify rq installed: `python -c "import rq; print(rq.VERSION)"` → 1.16.2
|
||||
174
openspec/changes/archive/trace-events-memory-triage/design.md
Normal file
174
openspec/changes/archive/trace-events-memory-triage/design.md
Normal file
@@ -0,0 +1,174 @@
|
||||
## Context
|
||||
|
||||
2026-02-25 生產環境 OOM crash 時間線:
|
||||
|
||||
```
|
||||
13:18:15 seed-resolve (read_sql_df_slow): 525K rows → 70K lots (38.95s)
|
||||
13:20:12 lineage (read_sql_df_slow): 114K CIDs, 54MB JSON (65s)
|
||||
13:20:16 events (read_sql_df_slow): 2 domains × 115 batches × 2 workers
|
||||
13:20:16 cursor.fetchall() 開始累積 rows → DataFrame → dict → grouped
|
||||
每個 domain 同時持有 ~3 份完整資料副本
|
||||
峰值記憶體: (fetchall rows + DataFrame + grouped dict) × 2 domains ≈ 4-6 GB
|
||||
13:37:47 OOM SIGKILL — 7GB VM, 0 swap
|
||||
```
|
||||
|
||||
pool 隔離(前一個 change)解決了連線互搶問題,但 events 階段的記憶體使用才是 OOM 根因。
|
||||
|
||||
目前 `read_sql_df_slow` 使用 `cursor.fetchall()` 一次載入全部結果到 Python list,
|
||||
然後建 `pd.DataFrame`,再 `iterrows()` + `to_dict()` 轉成 dict list。
|
||||
114K CIDs 的 upstream_history domain 可能回傳 100 萬+ rows,
|
||||
每份副本數百 MB,3-4 份同時存在就超過 VM 記憶體。
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- 防止大查詢直接 OOM 殺死整台 VM(admission control)
|
||||
- 降低 events 階段峰值記憶體 60-70%(fetchmany + 跳過 DataFrame)
|
||||
- 保護 host OS 穩定性(systemd MemoryMax + workers 降為 2)
|
||||
- 不改變現有 API 回傳格式(對前端透明)
|
||||
- 更新部署文件和 env 設定
|
||||
|
||||
**Non-Goals:**
|
||||
- 不引入非同步任務佇列(提案 2 範圍)
|
||||
- 不修改 lineage 階段(54MB 在可接受範圍)
|
||||
- 不修改前端(提案 3 範圍)
|
||||
- 不限制使用者查詢範圍(日期/站別由使用者決定)
|
||||
|
||||
## Decisions
|
||||
|
||||
### D1: Admission Control 閾值與行為
|
||||
|
||||
**決策**:在 trace events endpoint 加入 CID 數量上限判斷,**依 profile 區分**。
|
||||
|
||||
| Profile | CID 數量 | 行為 |
|
||||
|---------|-----------|------|
|
||||
| `query_tool` / `query_tool_reverse` | ≤ 50,000 | 正常同步處理 |
|
||||
| `query_tool` / `query_tool_reverse` | > 50,000 | 回 HTTP 413(實務上不會發生) |
|
||||
| `mid_section_defect` | 任意 | **不設硬限**,正常處理 + log warning(CID > 50K 時) |
|
||||
|
||||
**env var**:`TRACE_EVENTS_CID_LIMIT`(預設 50000,僅對非 MSD profile 生效)
|
||||
|
||||
**MSD 不設硬限的理由**:
|
||||
- MSD 報廢追溯是聚合統計(pareto/表格),不渲染追溯圖,CID 數量多寡不影響可讀性
|
||||
- 漏掉 CID 會導致報廢數量統計失準,資料完整性至關重要
|
||||
- 114K CIDs 是真實業務場景(TMTT 站 5 個月),不能拒絕
|
||||
- OOM 風險由 systemd `MemoryMax=6G` 保護 host OS(service 被殺但 VM 存活,自動重啟)
|
||||
- 提案 2 實作後,MSD 大查詢自動走 async job,記憶體問題根本解決
|
||||
|
||||
**query_tool 設 50K 上限的理由**:
|
||||
- 追溯圖超過數千節點已無法閱讀,50K 是極寬鬆的安全閥
|
||||
- 實務上 query_tool seed 通常 1-50 lots → lineage 後幾百到幾千 CIDs
|
||||
|
||||
**替代方案**:
|
||||
- 全 profile 統一上限 → MSD 被擋住,報廢統計不完整 → rejected
|
||||
- 無上限 + 只靠 fetchmany → MSD 接受此風險(有 MemoryMax 保護)→ adopted for MSD
|
||||
- 上限設太低(如 10K)→ 影響正常 MSD 查詢(通常 5K-30K CIDs)→ rejected
|
||||
|
||||
### D2: fetchmany 取代 fetchall
|
||||
|
||||
**決策**:`read_sql_df_slow` 新增 `fetchmany` 模式,不建 DataFrame,直接回傳 iterator。
|
||||
|
||||
```python
|
||||
def read_sql_df_slow_iter(sql, params=None, timeout_seconds=None, batch_size=5000):
|
||||
"""Yield batches of (columns, rows) without building DataFrame."""
|
||||
# ... connect, execute ...
|
||||
columns = [desc[0].upper() for desc in cursor.description]
|
||||
while True:
|
||||
rows = cursor.fetchmany(batch_size)
|
||||
if not rows:
|
||||
break
|
||||
yield columns, rows
|
||||
# ... cleanup in finally ...
|
||||
```
|
||||
|
||||
**env var**:`DB_SLOW_FETCHMANY_SIZE`(預設 5000)
|
||||
|
||||
**理由**:
|
||||
- `fetchall()` 強制全量 materialization
|
||||
- `fetchmany(5000)` 每次只持有 5000 rows 在記憶體
|
||||
- 不建 DataFrame 省去 pandas overhead(index、dtype inference、NaN handling)
|
||||
- EventFetcher 可以 yield 完一批就 group 到結果 dict,釋放 batch
|
||||
|
||||
**trade-off**:
|
||||
- `read_sql_df_slow`(回傳 DataFrame)保留不動,新增 `read_sql_df_slow_iter`
|
||||
- 只有 EventFetcher 使用 iter 版本;其他 service 繼續用 DataFrame 版本
|
||||
- 這樣不影響任何既有 consumer
|
||||
|
||||
### D3: EventFetcher 逐批 group 策略
|
||||
|
||||
**決策**:`_fetch_batch` 改用 `read_sql_df_slow_iter`,每 fetchmany batch 立刻 group 到 `grouped` dict。
|
||||
|
||||
```python
|
||||
def _fetch_batch(batch_ids):
|
||||
builder = QueryBuilder()
|
||||
builder.add_in_condition(filter_column, batch_ids)
|
||||
sql = EventFetcher._build_domain_sql(domain, builder.get_conditions_sql())
|
||||
|
||||
for columns, rows in read_sql_df_slow_iter(sql, builder.params, timeout_seconds=60):
|
||||
for row in rows:
|
||||
record = dict(zip(columns, row))
|
||||
# sanitize NaN
|
||||
cid = record.get("CONTAINERID")
|
||||
if cid:
|
||||
grouped[cid].append(record)
|
||||
# rows 離開 scope 即被 GC
|
||||
```
|
||||
|
||||
**記憶體改善估算**:
|
||||
|
||||
| 項目 | 修改前 | 修改後 |
|
||||
|------|--------|--------|
|
||||
| cursor buffer | 全量 (100K+ rows) | 5000 rows |
|
||||
| DataFrame | 全量 | 無 |
|
||||
| grouped dict | 全量(最終結果) | 全量(最終結果) |
|
||||
| **峰值** | ~3x 全量 | ~1.05x 全量 |
|
||||
|
||||
grouped dict 仍然是全量,但省去了 fetchall list + DataFrame 的兩份副本。
|
||||
對於 50K CIDs × 10 events = 500K records,從 ~1.5GB 降到 ~500MB。
|
||||
|
||||
### D4: trace_routes 避免雙份持有
|
||||
|
||||
**決策**:events endpoint 中 `raw_domain_results` 直接複用為 `results` 的來源,
|
||||
`_flatten_domain_records` 在建完 flat list 後立刻 `del events_by_cid`。
|
||||
|
||||
目前的問題:
|
||||
```python
|
||||
raw_domain_results[domain] = events_by_cid # 持有 reference
|
||||
rows = _flatten_domain_records(events_by_cid) # 建新 list
|
||||
results[domain] = {"data": rows, "count": len(rows)}
|
||||
# → events_by_cid 和 rows 同時存在
|
||||
```
|
||||
|
||||
修改後:
|
||||
```python
|
||||
events_by_cid = future.result()
|
||||
rows = _flatten_domain_records(events_by_cid)
|
||||
results[domain] = {"data": rows, "count": len(rows)}
|
||||
if is_msd:
|
||||
raw_domain_results[domain] = events_by_cid # MSD 需要 group-by-CID 結構
|
||||
else:
|
||||
del events_by_cid # 非 MSD 立刻釋放
|
||||
```
|
||||
|
||||
### D5: Gunicorn workers 降為 2 + systemd MemoryMax
|
||||
|
||||
**決策**:
|
||||
- `.env.example` 中 `GUNICORN_WORKERS` 預設改為 2
|
||||
- `deploy/mes-dashboard.service` 加入 `MemoryHigh=5G` 和 `MemoryMax=6G`
|
||||
|
||||
**理由**:
|
||||
- 4 workers × 大查詢 = 記憶體競爭嚴重
|
||||
- 2 workers × 4 threads = 8 request threads,足夠處理並行請求
|
||||
- `MemoryHigh=5G`:超過後 kernel 開始 reclaim,但不殺進程
|
||||
- `MemoryMax=6G`:硬限,超過直接 OOM kill service(保護 host OS)
|
||||
- 保留 1GB 給 OS + Redis + 其他服務
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
| 風險 | 緩解措施 |
|
||||
|------|---------|
|
||||
| 50K CID 上限可能擋住合理查詢 | env var 可調;提案 2 實作後改走 async |
|
||||
| fetchmany iterator 模式下 cursor 持有時間更長 | timeout_seconds=60 限制;semaphore 限制並行 |
|
||||
| grouped dict 最終仍全量 | 這是 API contract(需回傳所有結果);提案 3 的 streaming 才能根本解決 |
|
||||
| workers=2 降低並行處理能力 | 歷史頁查詢是 semaphore 限制的,降 workers 主要影響即時頁 throughput(但即時頁很輕量) |
|
||||
| MemoryMax kill service 會中斷所有在線使用者 | systemd Restart=always 自動重啟;比 host OS crash 好得多 |
|
||||
@@ -0,0 +1,39 @@
|
||||
## Why
|
||||
|
||||
2026-02-25 生產環境 trace pipeline 處理 114K CIDs(TMTT 站 + 5 個月日期範圍)時,
|
||||
worker 被 OOM SIGKILL(7GB VM,無 swap)。pool 隔離已完成,連線不再互搶,
|
||||
但 events 階段的記憶體使用是真正瓶頸:
|
||||
|
||||
1. `cursor.fetchall()` 一次載入全部 rows(數十萬筆)
|
||||
2. `pd.DataFrame(rows)` 複製一份
|
||||
3. `df.iterrows()` + `row.to_dict()` 再一份
|
||||
4. `grouped[cid].append(record)` 累積到最終 dict
|
||||
5. `raw_domain_results[domain]` + `results[domain]["data"]` 在 trace_routes 同時持有雙份
|
||||
|
||||
114K CIDs × 2 domains,峰值同時存在 3-4 份完整資料副本,每份數百 MB → 2-4 GB 單一 domain。
|
||||
7GB VM(4 workers)完全無法承受。
|
||||
|
||||
## What Changes
|
||||
|
||||
- **Admission control**:trace events endpoint 加 CID 數量上限判斷,超過閾值回 HTTP 413
|
||||
- **分批處理**:`read_sql_df_slow` 改用 `cursor.fetchmany()` 取代 `fetchall()`,不建 DataFrame
|
||||
- **EventFetcher 逐批 group**:每批 fetch 完立刻 group 到結果 dict,釋放 batch 記憶體
|
||||
- **trace_routes 避免雙份持有**:`raw_domain_results` 與 `results` 合併為單一資料結構
|
||||
- **Gunicorn workers 降為 2**:降低單機記憶體競爭
|
||||
- **systemd MemoryMax**:加 cgroup 記憶體保護,避免 OOM 殺死整台 VM
|
||||
- **更新 .env.example**:新增 `TRACE_EVENTS_CID_LIMIT`、`DB_SLOW_FETCHMANY_SIZE` 等 env 文件
|
||||
- **更新 deploy/mes-dashboard.service**:加入 `MemoryHigh` 和 `MemoryMax`
|
||||
|
||||
## Capabilities
|
||||
|
||||
### Modified Capabilities
|
||||
|
||||
- `trace-staged-api`: events endpoint 加入 admission control(CID 上限)
|
||||
- `event-fetcher-unified`: 分批 group 記憶體優化,取消 DataFrame 中間層
|
||||
|
||||
## Impact
|
||||
|
||||
- **後端核心**:database.py(fetchmany)、event_fetcher.py(逐批 group)、trace_routes.py(admission control + 記憶體管理)
|
||||
- **部署設定**:gunicorn.conf.py、.env.example、deploy/mes-dashboard.service
|
||||
- **不影響**:前端、即時監控頁、其他 service(reject_history、hold_history 等)
|
||||
- **前置條件**:trace-pipeline-pool-isolation(已完成)
|
||||
@@ -0,0 +1,15 @@
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: EventFetcher SHALL use streaming fetch for batch queries
|
||||
`EventFetcher._fetch_batch` SHALL use `read_sql_df_slow_iter` (fetchmany-based iterator) instead of `read_sql_df` (fetchall + DataFrame) to reduce peak memory usage.
|
||||
|
||||
#### Scenario: Batch query memory optimization
|
||||
- **WHEN** EventFetcher executes a batch query for a domain
|
||||
- **THEN** the query SHALL use `cursor.fetchmany(batch_size)` (env: `DB_SLOW_FETCHMANY_SIZE`, default: 5000) instead of `cursor.fetchall()`
|
||||
- **THEN** rows SHALL be converted directly to dicts via `dict(zip(columns, row))` without building a DataFrame
|
||||
- **THEN** each fetchmany batch SHALL be grouped into the result dict immediately, allowing the batch rows to be garbage collected
|
||||
|
||||
#### Scenario: Existing API contract preserved
|
||||
- **WHEN** EventFetcher.fetch_events() returns results
|
||||
- **THEN** the return type SHALL remain `Dict[str, List[Dict[str, Any]]]` (grouped by CONTAINERID)
|
||||
- **THEN** the result SHALL be identical to the previous DataFrame-based implementation
|
||||
@@ -0,0 +1,19 @@
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Trace events endpoint SHALL manage memory for large queries
|
||||
The events endpoint SHALL proactively release memory after processing large CID sets.
|
||||
|
||||
#### Scenario: Admission control for non-MSD profiles
|
||||
- **WHEN** the events endpoint receives a non-MSD profile request with `container_ids` count exceeding `TRACE_EVENTS_CID_LIMIT` (env: `TRACE_EVENTS_CID_LIMIT`, default: 50000)
|
||||
- **THEN** the endpoint SHALL return HTTP 413 with `{ "error": "...", "code": "CID_LIMIT_EXCEEDED", "cid_count": N, "limit": M }`
|
||||
- **THEN** Oracle DB connection pool SHALL NOT be consumed
|
||||
|
||||
#### Scenario: MSD profile bypasses CID hard limit
|
||||
- **WHEN** the events endpoint receives a `mid_section_defect` profile request regardless of CID count
|
||||
- **THEN** the endpoint SHALL proceed with normal processing (no CID hard limit)
|
||||
- **THEN** if CID count exceeds 50000, the endpoint SHALL log a warning with `cid_count` for monitoring
|
||||
|
||||
#### Scenario: Non-MSD profile avoids double memory retention
|
||||
- **WHEN** a non-MSD events request completes domain fetching
|
||||
- **THEN** the `events_by_cid` reference SHALL be deleted immediately after `_flatten_domain_records`
|
||||
- **THEN** only the flattened `results` dict SHALL remain in memory
|
||||
37
openspec/changes/archive/trace-events-memory-triage/tasks.md
Normal file
37
openspec/changes/archive/trace-events-memory-triage/tasks.md
Normal file
@@ -0,0 +1,37 @@
|
||||
## 1. Admission Control (profile-aware)
|
||||
|
||||
- [x] 1.1 Add `TRACE_EVENTS_CID_LIMIT` env var (default 50000) to `trace_routes.py`
|
||||
- [x] 1.2 Add CID count check in `events()` endpoint: for non-MSD profiles, if `len(container_ids) > TRACE_EVENTS_CID_LIMIT`, return HTTP 413 with `{ "code": "CID_LIMIT_EXCEEDED", "cid_count": N, "limit": M }`
|
||||
- [x] 1.3 For MSD profile: bypass CID hard limit, log warning when CID count > 50000
|
||||
- [x] 1.4 Add unit tests: non-MSD CID > limit → 413; MSD CID > limit → proceeds normally
|
||||
|
||||
## 2. Batch Fetch (fetchmany) in database.py
|
||||
|
||||
- [x] 2.1 Add `read_sql_df_slow_iter(sql, params, timeout_seconds, batch_size)` generator function to `database.py` that yields `(columns, rows)` tuples using `cursor.fetchmany(batch_size)`
|
||||
- [x] 2.2 Add `DB_SLOW_FETCHMANY_SIZE` to `get_db_runtime_config()` (default 5000)
|
||||
- [x] 2.3 Add unit test for `read_sql_df_slow_iter` (mock cursor, verify fetchmany calls and yields)
|
||||
|
||||
## 3. EventFetcher Memory Optimization
|
||||
|
||||
- [x] 3.1 Modify `_fetch_batch` in `event_fetcher.py` to use `read_sql_df_slow_iter` instead of `read_sql_df` — iterate rows directly, skip DataFrame, group to `grouped` dict immediately
|
||||
- [x] 3.2 Update `_sanitize_record` to work with `dict(zip(columns, row))` instead of `row.to_dict()`
|
||||
- [x] 3.3 Add unit test verifying EventFetcher uses `read_sql_df_slow_iter` import
|
||||
- [x] 3.4 Update existing EventFetcher tests (mock `read_sql_df_slow_iter` instead of `read_sql_df`)
|
||||
|
||||
## 4. trace_routes Memory Optimization
|
||||
|
||||
- [x] 4.1 Modify events endpoint: only keep `raw_domain_results[domain]` for MSD profile; for non-MSD, `del events_by_cid` after flattening
|
||||
- [x] 4.2 Verify existing `del raw_domain_results` and `gc.collect()` logic still correct after refactor
|
||||
|
||||
## 5. Deployment Configuration
|
||||
|
||||
- [x] 5.1 Update `.env.example`: add `TRACE_EVENTS_CID_LIMIT`, `DB_SLOW_FETCHMANY_SIZE` with descriptions
|
||||
- [x] 5.2 Update `.env.example`: change `GUNICORN_WORKERS` default comment to recommend 2 for ≤ 8GB RAM
|
||||
- [x] 5.3 Update `.env.example`: change `TRACE_EVENTS_MAX_WORKERS` and `EVENT_FETCHER_MAX_WORKERS` default to 2
|
||||
- [x] 5.4 Update `deploy/mes-dashboard.service`: add `MemoryHigh=5G` and `MemoryMax=6G`
|
||||
- [x] 5.5 Update `deploy/mes-dashboard.service`: add comment explaining memory limits
|
||||
|
||||
## 6. Verification
|
||||
|
||||
- [x] 6.1 Run `python -m pytest tests/ -v` — all existing tests pass (1069 passed, 152 skipped)
|
||||
- [x] 6.2 Verify `.env.example` env var documentation is consistent with code defaults
|
||||
@@ -0,0 +1,2 @@
|
||||
schema: spec-driven
|
||||
created: 2026-02-25
|
||||
140
openspec/changes/archive/trace-streaming-response/design.md
Normal file
140
openspec/changes/archive/trace-streaming-response/design.md
Normal file
@@ -0,0 +1,140 @@
|
||||
## Context
|
||||
|
||||
提案 2(trace-async-job-queue)讓大查詢在獨立 worker 中執行,
|
||||
但結果仍然全量 materialize 到 Redis(job result)和前端記憶體。
|
||||
|
||||
114K CIDs × 2 domains 的結果 JSON 可達 200-500MB:
|
||||
- Worker 記憶體:grouped dict ~500MB + JSON serialize ~500MB = ~1GB 峰值
|
||||
- Redis:SETEX 500MB 的 key 耗時 5-10s,阻塞其他操作
|
||||
- 前端:瀏覽器解析 500MB JSON freeze UI 數十秒
|
||||
|
||||
串流回傳讓 server 逐批產生、前端逐批消費,記憶體使用只與每批大小成正比。
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- Job 結果以 NDJSON 串流回傳,避免全量 materialize
|
||||
- EventFetcher 支援 iterator 模式,逐批 yield 結果
|
||||
- 前端用 ReadableStream 逐行解析,逐批渲染
|
||||
- 結果也支援分頁 API(給不支援串流的 consumer 使用)
|
||||
|
||||
**Non-Goals:**
|
||||
- 不改動同步路徑(CID < 閾值仍走現有 JSON 回傳)
|
||||
- 不做 WebSocket(NDJSON over HTTP 更簡單、更通用)
|
||||
- 不做 Server-Sent Events(SSE 只支援 text/event-stream,不適合大 payload)
|
||||
- 不修改 MSD aggregation(aggregation 需要全量資料,但結果較小)
|
||||
|
||||
## Decisions
|
||||
|
||||
### D1: NDJSON 格式
|
||||
|
||||
**決策**:使用 Newline Delimited JSON(NDJSON)作為串流格式。
|
||||
|
||||
```
|
||||
Content-Type: application/x-ndjson
|
||||
|
||||
{"type":"meta","job_id":"abc123","domains":["history","materials"],"cid_count":114892}
|
||||
{"type":"domain_start","domain":"history","batch":1,"total_batches":23}
|
||||
{"type":"records","domain":"history","batch":1,"data":[...5000 records...]}
|
||||
{"type":"records","domain":"history","batch":2,"data":[...5000 records...]}
|
||||
...
|
||||
{"type":"domain_end","domain":"history","total_records":115000}
|
||||
{"type":"domain_start","domain":"materials","batch":1,"total_batches":12}
|
||||
...
|
||||
{"type":"aggregation","data":{...}}
|
||||
{"type":"complete","elapsed_seconds":285}
|
||||
```
|
||||
|
||||
**env var**:`TRACE_STREAM_BATCH_SIZE`(預設 5000 records/batch)
|
||||
|
||||
**理由**:
|
||||
- NDJSON 是業界標準串流 JSON 格式(Elasticsearch、BigQuery、GitHub API 都用)
|
||||
- 每行是獨立 JSON,前端可逐行 parse(不需要等整個 response)
|
||||
- 5000 records/batch ≈ 2-5MB,瀏覽器可即時渲染
|
||||
- 與 HTTP/1.1 chunked transfer 完美搭配
|
||||
|
||||
### D2: EventFetcher iterator 模式
|
||||
|
||||
**決策**:新增 `fetch_events_iter()` 方法,yield 每批 grouped records。
|
||||
|
||||
```python
|
||||
@staticmethod
|
||||
def fetch_events_iter(container_ids, domain, batch_size=5000):
|
||||
"""Yield dicts of {cid: [records]} in batches."""
|
||||
# ... same SQL building logic ...
|
||||
for oracle_batch_ids in batches:
|
||||
for columns, rows in read_sql_df_slow_iter(sql, params):
|
||||
batch_grouped = defaultdict(list)
|
||||
for row in rows:
|
||||
record = dict(zip(columns, row))
|
||||
cid = record.get("CONTAINERID")
|
||||
if cid:
|
||||
batch_grouped[cid].append(record)
|
||||
yield dict(batch_grouped)
|
||||
```
|
||||
|
||||
**理由**:
|
||||
- 與 `fetch_events()` 共存,不影響同步路徑
|
||||
- 每次 yield 只持有一個 fetchmany batch 的 grouped 結果
|
||||
- Worker 收到 yield 後立刻序列化寫出,不累積
|
||||
|
||||
### D3: 結果分頁 API
|
||||
|
||||
**決策**:提供 REST 分頁 API 作為 NDJSON 的替代方案。
|
||||
|
||||
```
|
||||
GET /api/trace/job/{job_id}/result?domain=history&offset=0&limit=5000
|
||||
```
|
||||
|
||||
**回應格式**:
|
||||
```json
|
||||
{
|
||||
"domain": "history",
|
||||
"offset": 0,
|
||||
"limit": 5000,
|
||||
"total": 115000,
|
||||
"data": [... 5000 records ...]
|
||||
}
|
||||
```
|
||||
|
||||
**理由**:
|
||||
- 某些 consumer(如外部系統)不支援 NDJSON 串流
|
||||
- 分頁 API 是標準 REST pattern
|
||||
- 結果仍儲存在 Redis(但按 domain 分 key),每個 key 5000 records ≈ 5MB
|
||||
|
||||
### D4: 前端 ReadableStream 消費
|
||||
|
||||
```javascript
|
||||
async function consumeNDJSON(url, onChunk) {
|
||||
const response = await fetch(url)
|
||||
const reader = response.body.getReader()
|
||||
const decoder = new TextDecoder()
|
||||
let buffer = ''
|
||||
|
||||
while (true) {
|
||||
const { done, value } = await reader.read()
|
||||
if (done) break
|
||||
buffer += decoder.decode(value, { stream: true })
|
||||
const lines = buffer.split('\n')
|
||||
buffer = lines.pop() // 保留不完整的最後一行
|
||||
for (const line of lines) {
|
||||
if (line.trim()) onChunk(JSON.parse(line))
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**理由**:
|
||||
- ReadableStream 是瀏覽器原生 API,無需額外依賴
|
||||
- 逐行 parse 記憶體使用恆定(只與 batch_size 成正比)
|
||||
- 可邊收邊渲染,使用者體驗好
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
| 風險 | 緩解措施 |
|
||||
|------|---------|
|
||||
| NDJSON 不支援 HTTP 壓縮 | Flask 可配 gzip middleware;每行 5000 records 壓縮率高 |
|
||||
| 中途斷線需重新開始 | 分頁 API 可從斷點繼續取;NDJSON 用於一次性消費 |
|
||||
| 前端需要處理部分結果渲染 | 表格元件改用 virtual scroll(既有 vue-virtual-scroller) |
|
||||
| MSD aggregation 仍需全量資料 | aggregation 在 worker 內部完成,只串流最終結果(較小) |
|
||||
| 結果按 domain 分 key 增加 Redis key 數量 | TTL 清理 + key prefix 隔離 |
|
||||
@@ -0,0 +1,39 @@
|
||||
## Why
|
||||
|
||||
即使有非同步 job(提案 2)處理大查詢,結果 materialize 仍然是記憶體瓶頸:
|
||||
|
||||
1. **job result 全量 JSON**:114K CIDs × 2 domains 的結果 JSON 可達數百 MB,
|
||||
Redis 儲存 + 讀取 + Flask jsonify 序列化,峰值記憶體仍高
|
||||
2. **前端一次性解析**:瀏覽器解析數百 MB JSON 會 freeze UI
|
||||
3. **Redis 單 key 限制**:大 value 影響 Redis 效能(阻塞其他操作)
|
||||
|
||||
串流回傳(NDJSON/分頁)讓 server 逐批產生資料、前端逐批消費,
|
||||
記憶體使用與 CID 總數解耦,只與每批大小成正比。
|
||||
|
||||
## What Changes
|
||||
|
||||
- **EventFetcher 支援 iterator 模式**:`fetch_events_iter()` yield 每批結果而非累積全部
|
||||
- **新增 `GET /api/trace/job/{job_id}/stream`**:NDJSON 串流回傳 job 結果
|
||||
- **前端 useTraceProgress 串流消費**:用 `fetch()` + `ReadableStream` 逐行解析 NDJSON
|
||||
- **結果分頁 API**:`GET /api/trace/job/{job_id}/result?domain=history&offset=0&limit=5000`
|
||||
- **更新 .env.example**:`TRACE_STREAM_BATCH_SIZE`
|
||||
|
||||
## Capabilities
|
||||
|
||||
### New Capabilities
|
||||
|
||||
- `trace-streaming-response`: NDJSON 串流回傳 + 結果分頁
|
||||
|
||||
### Modified Capabilities
|
||||
|
||||
- `event-fetcher-unified`: 新增 iterator 模式(`fetch_events_iter`)
|
||||
- `trace-staged-api`: job result 串流 endpoint
|
||||
- `progressive-trace-ux`: 前端串流消費 + 逐批渲染
|
||||
|
||||
## Impact
|
||||
|
||||
- **後端核心**:event_fetcher.py(iterator 模式)、trace_routes.py(stream endpoint)
|
||||
- **前端修改**:useTraceProgress.js(ReadableStream 消費)
|
||||
- **部署設定**:.env.example(`TRACE_STREAM_BATCH_SIZE`)
|
||||
- **不影響**:同步路徑(CID < 閾值仍走現有流程)、其他 service、即時監控頁
|
||||
- **前置條件**:trace-async-job-queue(提案 2)
|
||||
@@ -0,0 +1,14 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: EventFetcher SHALL support iterator mode for streaming
|
||||
`EventFetcher.fetch_events_iter()` SHALL yield batched results for streaming consumption.
|
||||
|
||||
#### Scenario: Iterator mode yields batches
|
||||
- **WHEN** `fetch_events_iter(container_ids, domain, batch_size)` is called
|
||||
- **THEN** it SHALL yield `Dict[str, List[Dict]]` batches (grouped by CONTAINERID)
|
||||
- **THEN** each yielded batch SHALL contain results from one `cursor.fetchmany()` call
|
||||
- **THEN** memory usage SHALL be proportional to `batch_size`, not total result count
|
||||
|
||||
#### Scenario: Iterator mode cache behavior
|
||||
- **WHEN** `fetch_events_iter` is used for large CID sets (> CACHE_SKIP_CID_THRESHOLD)
|
||||
- **THEN** per-domain cache SHALL be skipped (consistent with `fetch_events` behavior)
|
||||
@@ -0,0 +1,22 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Trace API SHALL expose NDJSON stream endpoint for job results
|
||||
`GET /api/trace/job/{job_id}/stream` SHALL return job results as NDJSON (Newline Delimited JSON) stream.
|
||||
|
||||
#### Scenario: Stream completed job result
|
||||
- **WHEN** a client requests stream for a completed job
|
||||
- **THEN** the endpoint SHALL return `Content-Type: application/x-ndjson`
|
||||
- **THEN** the response SHALL contain ordered NDJSON lines: `meta` → `domain_start` → `records` batches → `domain_end` → `aggregation` (if applicable) → `complete`
|
||||
- **THEN** each `records` line SHALL contain at most `TRACE_STREAM_BATCH_SIZE` (env, default: 5000) records
|
||||
|
||||
#### Scenario: Stream for non-completed job
|
||||
- **WHEN** a client requests stream for a non-completed job
|
||||
- **THEN** the endpoint SHALL return HTTP 409 with `{ "error": "...", "code": "JOB_NOT_COMPLETE" }`
|
||||
|
||||
### Requirement: Job result pagination SHALL support domain-level offset/limit
|
||||
`GET /api/trace/job/{job_id}/result` SHALL support fine-grained pagination per domain.
|
||||
|
||||
#### Scenario: Paginated domain result
|
||||
- **WHEN** a client requests `?domain=history&offset=0&limit=5000`
|
||||
- **THEN** the endpoint SHALL return only the specified slice of records for that domain
|
||||
- **THEN** the response SHALL include `total` count for the domain
|
||||
35
openspec/changes/archive/trace-streaming-response/tasks.md
Normal file
35
openspec/changes/archive/trace-streaming-response/tasks.md
Normal file
@@ -0,0 +1,35 @@
|
||||
## 1. EventFetcher Iterator Mode
|
||||
|
||||
- [ ] 1.1 Add `fetch_events_iter(container_ids, domain, batch_size)` static method to `EventFetcher` class: yields `Dict[str, List[Dict]]` batches using `read_sql_df_slow_iter`
|
||||
- [ ] 1.2 Add unit tests for `fetch_events_iter` (mock read_sql_df_slow_iter, verify batch yields)
|
||||
|
||||
## 2. NDJSON Stream Endpoint
|
||||
|
||||
- [x] 2.1 Add `GET /api/trace/job/<job_id>/stream` endpoint: returns `Content-Type: application/x-ndjson` with Flask `Response(generate(), mimetype='application/x-ndjson')`
|
||||
- [x] 2.2 Implement NDJSON generator: yield `meta` → `domain_start` → `records` batches → `domain_end` → `aggregation` → `complete` lines
|
||||
- [x] 2.3 Add `TRACE_STREAM_BATCH_SIZE` env var (default 5000)
|
||||
- [x] 2.4 Modify `execute_trace_events_job()` to store results in chunked Redis keys: `trace:job:{job_id}:result:{domain}:{chunk_idx}`
|
||||
- [x] 2.5 Add unit tests for NDJSON stream endpoint
|
||||
|
||||
## 3. Result Pagination API
|
||||
|
||||
- [x] 3.1 Enhance `GET /api/trace/job/<job_id>/result` with `domain`, `offset`, `limit` query params
|
||||
- [x] 3.2 Implement pagination over chunked Redis keys
|
||||
- [x] 3.3 Add unit tests for pagination (offset/limit boundary cases)
|
||||
|
||||
## 4. Frontend Streaming Consumer
|
||||
|
||||
- [x] 4.1 Add `consumeNDJSONStream(url, onChunk)` utility using `ReadableStream`
|
||||
- [x] 4.2 Modify `useTraceProgress.js`: for async jobs, prefer stream endpoint over full result endpoint
|
||||
- [x] 4.3 Add progressive rendering: update table data as each NDJSON batch arrives
|
||||
- [x] 4.4 Add error handling: stream interruption, malformed NDJSON lines
|
||||
|
||||
## 5. Deployment
|
||||
|
||||
- [x] 5.1 Update `.env.example`: add `TRACE_STREAM_BATCH_SIZE` with description
|
||||
|
||||
## 6. Verification
|
||||
|
||||
- [x] 6.1 Run `python -m pytest tests/ -v` — all existing tests pass
|
||||
- [x] 6.2 Run `cd frontend && npm run build` — frontend builds successfully
|
||||
- [ ] 6.3 Manual test: verify NDJSON stream produces valid output for multi-domain query
|
||||
Reference in New Issue
Block a user