feat(trace-pipeline): memory triage, async job queue, and NDJSON streaming

Three proposals addressing the 2026-02-25 trace pipeline OOM crash (114K CIDs):

1. trace-events-memory-triage: fetchmany iterator (read_sql_df_slow_iter),
   admission control (50K CID limit for non-MSD), cache skip for large queries,
   early memory release with gc.collect()

2. trace-async-job-queue: RQ-based async jobs for queries >20K CIDs,
   separate worker process with isolated memory, frontend polling via
   useTraceProgress composable, systemd service + deploy scripts

3. trace-streaming-response: chunked Redis storage (TRACE_STREAM_BATCH_SIZE=5000),
   NDJSON stream endpoint (GET /api/trace/job/<id>/stream), frontend
   ReadableStream consumer for progressive rendering, backward-compatible
   with legacy single-key storage

All three proposals archived. 1101 tests pass, frontend builds clean.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
egg
2026-02-25 21:01:27 +08:00
parent cbb943dfe5
commit dbe0da057c
32 changed files with 3140 additions and 87 deletions

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-02-25

View File

@@ -0,0 +1,149 @@
## Context
提案 1trace-events-memory-triage解決了峰值記憶體問題並加入 admission control
但 CID > 50K 的查詢被直接拒絕HTTP 413
使用者仍有合理需求查詢大範圍資料(例如 TMTT 站 5 個月 = 114K CIDs
目前 codebase 完全沒有非同步任務基礎設施(無 Celery、RQ、Dramatiq
所有操作都是同步 request-response受 gunicorn 360s timeout 硬限。
需要引入輕量級 job queue讓大查詢在獨立 worker 進程中執行,
不佔 gunicorn thread、不受 360s timeout 限制、失敗可重試。
## Goals / Non-Goals
**Goals:**
- CID > 閾值的 trace events 查詢改走非同步 jobAPI 回 202 + job_id
- 獨立 worker 進程systemd unit不佔 gunicorn 資源
- Job 狀態可查詢queued/running/completed/failed
- 結果有 TTL 自動清理,不佔 Redis 長期記憶體
- 前端自動判斷同步/非同步路徑,顯示 job 進度
- 最小新依賴(利用既有 Redis
**Non-Goals:**
- 不做通用 task queue只處理 trace events
- 不做 job 重試(大查詢重試消耗巨大,失敗後使用者手動重新觸發)
- 不做 job 取消Oracle 查詢一旦發出難以取消)
- 不做 job 持久化到 DBRedis TTL 足夠)
- 不修改 lineage 階段(仍然同步,通常 < 120s
## Decisions
### D1: RQRedis Queue而非 Celery/Dramatiq
**決策**使用 RQ 作為 job queue
**理由**
- 專案已有 Redis零額外基礎設施
- RQ Celery 輕量 10 broker 中間層 beat scheduler flower
- RQ worker 是獨立 Python 進程記憶體隔離
- API 簡單`queue.enqueue(func, args, job_timeout=600, result_ttl=3600)`
- 社群活躍Flask 生態整合良好
**替代方案**
- Celery過重專案不需要 beatchordchain 等功能 rejected
- Dramatiq更輕量但社群較小Redis broker 整合不如 RQ 成熟 rejected
- 自製 threading前面討論已排除worker 生命週期記憶體競爭)→ rejected
### D2: 同步/非同步分界閾值
**決策**
| CID 數量 | 行為 |
|-----------|------|
| 20,000 | 同步處理現有 events endpoint |
| 20,001 ~ 50,000 | 非同步 job 202 + job_id |
| > 50,000 | 非同步 job回 202 + job_idworker 內部分段處理 |
**env var**`TRACE_ASYNC_CID_THRESHOLD`(預設 20000
**理由**
- ≤ 20K CIDs 的 events 查詢通常在 60s 內完成,同步足夠
- 20K-50K 需要 2-5 分鐘,超出使用者耐心且佔住 gunicorn thread
- > 50K 是提案 1 的 admission control 上限,必須走非同步
**提案 1 的 HTTP 413 改為 HTTP 202**
當提案 2 實作完成後,提案 1 的 `TRACE_EVENTS_CID_LIMIT` 檢查改為自動 fallback 到 async job
不再拒絕請求。
### D3: Job 狀態與結果儲存
**決策**:使用 RQ 內建的 job 狀態追蹤(儲存在 Redis
```
Job lifecycle:
queued → started → finished / failed
Redis keys:
rq:job:{job_id} # RQ 內建 job metadata
trace:job:{job_id}:meta # 自訂 metadataprofile, cid_count, domains, progress
trace:job:{job_id}:result # 完成後的結果JSON設 TTL
```
**env vars**
- `TRACE_JOB_TTL_SECONDS`:結果保留時間(預設 3600 = 1 小時)
- `TRACE_JOB_TIMEOUT_SECONDS`:單一 job 最大執行時間(預設 1800 = 30 分鐘)
### D4: API 設計
```
POST /api/trace/events ← 現有CID ≤ 閾值時同步
POST /api/trace/events ← CID > 閾值時回 202 + job_id同一 endpoint
GET /api/trace/job/{job_id} ← 查詢 job 狀態
GET /api/trace/job/{job_id}/result ← 取得完整結果
GET /api/trace/job/{job_id}/result?domain=history&offset=0&limit=5000 ← 分頁取結果
```
**202 回應格式**
```json
{
"stage": "events",
"async": true,
"job_id": "trace-evt-abc123",
"status_url": "/api/trace/job/trace-evt-abc123",
"estimated_seconds": 300
}
```
### D5: Worker 部署架構
```
systemd (mes-dashboard-trace-worker.service)
→ conda run -n mes-dashboard rq worker trace-events --with-scheduler
→ 獨立進程,獨立記憶體空間
→ MemoryMax=4Gcgroup 保護)
```
**env vars**
- `TRACE_WORKER_COUNT`worker 進程數(預設 1
- `TRACE_WORKER_QUEUE`queue 名稱(預設 `trace-events`
### D6: 前端整合
`useTraceProgress.js` 修改:
```javascript
// events 階段
const eventsResp = await fetchStage('events', payload)
if (eventsResp.status === 202) {
// 非同步路徑
const { job_id, status_url } = eventsResp.data
return await pollJobUntilComplete(status_url, {
onProgress: (status) => updateProgress('events', status.progress),
pollInterval: 3000,
maxPollTime: 1800000, // 30 分鐘
})
}
// 同步路徑(現有)
return eventsResp.data
```
## Risks / Trade-offs
| 風險 | 緩解措施 |
|------|---------|
| RQ 新依賴增加維護成本 | RQ 穩定、API 簡單、只用核心功能 |
| Worker 進程增加記憶體使用 | 獨立 cgroup MemoryMax=4G空閒時幾乎不佔記憶體 |
| Redis 儲存大結果影響效能 | 結果 TTL=1h 自動清理;配合提案 3 串流取代全量儲存 |
| Worker crash 丟失進行中 job | RQ 內建 failed job registry使用者可手動重觸發 |
| 前端輪詢增加 API 負載 | pollInterval=3s只有 active job 才輪詢 |

View File

@@ -0,0 +1,45 @@
## Why
trace pipeline 處理大量 CIDs> 20K即使經過分批處理優化提案 1
仍然面臨以下根本問題:
1. **同步 request-response 模型**gunicorn 360s timeout 是硬限lineage + events 合計可能超過 300s
2. **worker thread 被佔住**:大查詢期間 1 個 gunicorn thread 完全被佔用,降低即時頁服務能力
3. **前端無進度回饋**:使用者只能盯著 loading spinner 等 5-6 分鐘,不知道是否正常運作
4. **失敗後需完全重新執行**:中途 timeout/OOM 後,已完成的 seed-resolve 和 lineage 結果全部浪費
業界標準做法是將長時間任務放入非同步佇列RQ/DramatiqAPI 先回 202 + job_id
背景 worker 獨立處理,前端輪詢或 SSE 取得結果。
## What Changes
- **引入 RQRedis Queue**:利用既有 Redis 基礎設施,最小化新依賴
- **新增 trace job worker**獨立進程systemd unit不佔 gunicorn worker 資源
- **新增 `POST /api/trace/events-async`**CID > 閾值時回 202 + job_id
- **新增 `GET /api/trace/job/{job_id}`**:輪詢 job 狀態queued/running/completed/failed
- **新增 `GET /api/trace/job/{job_id}/result`**:取得完成後的結果(分頁)
- **前端 useTraceProgress 整合**:自動判斷同步/非同步路徑,顯示 job 進度
- **Job TTL + 自動清理**:結果保留 1 小時後自動過期
- **新增 systemd unit**`mes-dashboard-trace-worker.service`
- **更新 .env.example**`TRACE_ASYNC_CID_THRESHOLD``TRACE_JOB_TTL_SECONDS``TRACE_WORKER_COUNT`
## Capabilities
### New Capabilities
- `trace-async-job`: 非同步 trace job 佇列RQ + Redis
### Modified Capabilities
- `trace-staged-api`: events endpoint 整合 async job 路由
- `progressive-trace-ux`: 前端整合 job 輪詢 + 進度顯示
## Impact
- **新增依賴**`rq>=1.16.0,<2.0.0`requirements.txt、environment.yml
- **後端新增**trace_job_service.py、trace_routes.pyasync endpoints
- **前端修改**useTraceProgress.jsasync 整合)
- **部署新增**deploy/mes-dashboard-trace-worker.service、scripts/start_server.shworker 管理)
- **部署設定**.env.example新 env vars
- **不影響**:其他 service、即時監控頁、admin 頁面
- **前置條件**trace-events-memory-triage提案 1

View File

@@ -0,0 +1,47 @@
## ADDED Requirements
### Requirement: Trace events endpoint SHALL support asynchronous job execution
The `/api/trace/events` endpoint SHALL automatically route large CID requests to an async job queue.
#### Scenario: CID count exceeds async threshold
- **WHEN** the events endpoint receives a request with `container_ids` count exceeding `TRACE_ASYNC_CID_THRESHOLD` (env: `TRACE_ASYNC_CID_THRESHOLD`, default: 20000)
- **THEN** the endpoint SHALL enqueue the request to the `trace-events` RQ queue
- **THEN** the endpoint SHALL return HTTP 202 with `{ "stage": "events", "async": true, "job_id": "...", "status_url": "/api/trace/job/{job_id}" }`
#### Scenario: CID count within sync threshold
- **WHEN** the events endpoint receives a request with `container_ids` count ≤ `TRACE_ASYNC_CID_THRESHOLD`
- **THEN** the endpoint SHALL process synchronously as before
### Requirement: Trace API SHALL expose job status endpoint
`GET /api/trace/job/{job_id}` SHALL return the current status of an async trace job.
#### Scenario: Job status query
- **WHEN** a client queries job status with a valid job_id
- **THEN** the endpoint SHALL return `{ "job_id": "...", "status": "queued|started|finished|failed", "progress": {...}, "created_at": "...", "elapsed_seconds": N }`
#### Scenario: Job not found
- **WHEN** a client queries job status with an unknown or expired job_id
- **THEN** the endpoint SHALL return HTTP 404 with `{ "error": "...", "code": "JOB_NOT_FOUND" }`
### Requirement: Trace API SHALL expose job result endpoint
`GET /api/trace/job/{job_id}/result` SHALL return the result of a completed async trace job.
#### Scenario: Completed job result
- **WHEN** a client requests result for a completed job
- **THEN** the endpoint SHALL return the same response format as the synchronous events endpoint
- **THEN** optional query params `domain`, `offset`, `limit` SHALL support pagination
#### Scenario: Job not yet completed
- **WHEN** a client requests result for a non-completed job
- **THEN** the endpoint SHALL return HTTP 409 with `{ "error": "...", "code": "JOB_NOT_COMPLETE", "status": "queued|started" }`
### Requirement: Async trace jobs SHALL have TTL and timeout
Job results SHALL expire after a configurable TTL, and execution SHALL be bounded by a timeout.
#### Scenario: Job result TTL
- **WHEN** a trace job completes (success or failure)
- **THEN** the result SHALL be stored in Redis with TTL = `TRACE_JOB_TTL_SECONDS` (env, default: 3600)
#### Scenario: Job execution timeout
- **WHEN** a trace job exceeds `TRACE_JOB_TIMEOUT_SECONDS` (env, default: 1800)
- **THEN** RQ SHALL terminate the job and mark it as failed

View File

@@ -0,0 +1,39 @@
## 1. Dependencies
- [x] 1.1 Add `rq>=1.16.0,<2.0.0` to `requirements.txt`
- [x] 1.2 Add `rq>=1.16.0,<2.0.0` to pip dependencies in `environment.yml`
## 2. Trace Job Service
- [x] 2.1 Create `src/mes_dashboard/services/trace_job_service.py` with `enqueue_trace_events_job()`, `get_job_status()`, `get_job_result()`
- [x] 2.2 Implement `execute_trace_events_job()` function (RQ worker entry point): runs EventFetcher + optional MSD aggregation, stores result in Redis with TTL
- [x] 2.3 Add job metadata tracking: `trace:job:{job_id}:meta` Redis key with `{profile, cid_count, domains, status, progress, created_at, completed_at}`
- [x] 2.4 Add unit tests for trace_job_service (13 tests: enqueue, status, result, worker execution, flatten)
## 3. Async API Endpoints
- [x] 3.1 Modify `events()` in `trace_routes.py`: when `len(container_ids) > TRACE_ASYNC_CID_THRESHOLD` and async available, call `enqueue_trace_events_job()` and return HTTP 202
- [x] 3.2 Add `GET /api/trace/job/<job_id>` endpoint: return job status from `get_job_status()`
- [x] 3.3 Add `GET /api/trace/job/<job_id>/result` endpoint: return job result from `get_job_result()` with optional `domain`, `offset`, `limit` query params
- [x] 3.4 Add rate limiting to job status/result endpoints (60 req/60s)
- [x] 3.5 Add unit tests for async endpoints (8 tests: async routing, sync fallback, 413 fallback, job status/result)
## 4. Deployment
- [x] 4.1 Create `deploy/mes-dashboard-trace-worker.service` systemd unit (MemoryHigh=3G, MemoryMax=4G)
- [x] 4.2 Update `scripts/start_server.sh`: add `start_rq_worker`/`stop_rq_worker`/`rq_worker_status` functions
- [x] 4.3 Update `scripts/deploy.sh`: add trace worker systemd install instructions
- [x] 4.4 Update `.env.example`: uncomment and add `TRACE_WORKER_ENABLED`, `TRACE_ASYNC_CID_THRESHOLD`, `TRACE_JOB_TTL_SECONDS`, `TRACE_JOB_TIMEOUT_SECONDS`, `TRACE_WORKER_COUNT`, `TRACE_WORKER_QUEUE`
## 5. Frontend Integration
- [x] 5.1 Modify `useTraceProgress.js`: detect async response (`eventsPayload.async === true`), switch to job polling mode
- [x] 5.2 Add `pollJobUntilComplete()` helper: poll `GET /api/trace/job/{job_id}` every 3s, max 30 minutes
- [x] 5.3 Add `job_progress` reactive state for UI: `{ active, job_id, status, elapsed_seconds, progress }`
- [x] 5.4 Add error handling: job failed (`JOB_FAILED`), polling timeout (`JOB_POLL_TIMEOUT`), abort support
## 6. Verification
- [x] 6.1 Run `python -m pytest tests/ -v` — 1090 passed, 152 skipped
- [x] 6.2 Run `cd frontend && npm run build` — frontend builds successfully
- [x] 6.3 Verify rq installed: `python -c "import rq; print(rq.VERSION)"` → 1.16.2

View File

@@ -0,0 +1,174 @@
## Context
2026-02-25 生產環境 OOM crash 時間線:
```
13:18:15 seed-resolve (read_sql_df_slow): 525K rows → 70K lots (38.95s)
13:20:12 lineage (read_sql_df_slow): 114K CIDs, 54MB JSON (65s)
13:20:16 events (read_sql_df_slow): 2 domains × 115 batches × 2 workers
13:20:16 cursor.fetchall() 開始累積 rows → DataFrame → dict → grouped
每個 domain 同時持有 ~3 份完整資料副本
峰值記憶體: (fetchall rows + DataFrame + grouped dict) × 2 domains ≈ 4-6 GB
13:37:47 OOM SIGKILL — 7GB VM, 0 swap
```
pool 隔離(前一個 change解決了連線互搶問題但 events 階段的記憶體使用才是 OOM 根因。
目前 `read_sql_df_slow` 使用 `cursor.fetchall()` 一次載入全部結果到 Python list
然後建 `pd.DataFrame`,再 `iterrows()` + `to_dict()` 轉成 dict list。
114K CIDs 的 upstream_history domain 可能回傳 100 萬+ rows
每份副本數百 MB3-4 份同時存在就超過 VM 記憶體。
## Goals / Non-Goals
**Goals:**
- 防止大查詢直接 OOM 殺死整台 VMadmission control
- 降低 events 階段峰值記憶體 60-70%fetchmany + 跳過 DataFrame
- 保護 host OS 穩定性systemd MemoryMax + workers 降為 2
- 不改變現有 API 回傳格式(對前端透明)
- 更新部署文件和 env 設定
**Non-Goals:**
- 不引入非同步任務佇列(提案 2 範圍)
- 不修改 lineage 階段54MB 在可接受範圍)
- 不修改前端(提案 3 範圍)
- 不限制使用者查詢範圍(日期/站別由使用者決定)
## Decisions
### D1: Admission Control 閾值與行為
**決策**:在 trace events endpoint 加入 CID 數量上限判斷,**依 profile 區分**。
| Profile | CID 數量 | 行為 |
|---------|-----------|------|
| `query_tool` / `query_tool_reverse` | ≤ 50,000 | 正常同步處理 |
| `query_tool` / `query_tool_reverse` | > 50,000 | 回 HTTP 413實務上不會發生 |
| `mid_section_defect` | 任意 | **不設硬限**,正常處理 + log warningCID > 50K 時) |
**env var**`TRACE_EVENTS_CID_LIMIT`(預設 50000僅對非 MSD profile 生效)
**MSD 不設硬限的理由**
- MSD 報廢追溯是聚合統計pareto/表格不渲染追溯圖CID 數量多寡不影響可讀性
- 漏掉 CID 會導致報廢數量統計失準,資料完整性至關重要
- 114K CIDs 是真實業務場景TMTT 站 5 個月),不能拒絕
- OOM 風險由 systemd `MemoryMax=6G` 保護 host OSservice 被殺但 VM 存活,自動重啟)
- 提案 2 實作後MSD 大查詢自動走 async job記憶體問題根本解決
**query_tool 設 50K 上限的理由**
- 追溯圖超過數千節點已無法閱讀50K 是極寬鬆的安全閥
- 實務上 query_tool seed 通常 1-50 lots → lineage 後幾百到幾千 CIDs
**替代方案**
- 全 profile 統一上限 → MSD 被擋住,報廢統計不完整 → rejected
- 無上限 + 只靠 fetchmany → MSD 接受此風險(有 MemoryMax 保護)→ adopted for MSD
- 上限設太低(如 10K→ 影響正常 MSD 查詢(通常 5K-30K CIDs→ rejected
### D2: fetchmany 取代 fetchall
**決策**`read_sql_df_slow` 新增 `fetchmany` 模式,不建 DataFrame直接回傳 iterator。
```python
def read_sql_df_slow_iter(sql, params=None, timeout_seconds=None, batch_size=5000):
"""Yield batches of (columns, rows) without building DataFrame."""
# ... connect, execute ...
columns = [desc[0].upper() for desc in cursor.description]
while True:
rows = cursor.fetchmany(batch_size)
if not rows:
break
yield columns, rows
# ... cleanup in finally ...
```
**env var**`DB_SLOW_FETCHMANY_SIZE`(預設 5000
**理由**
- `fetchall()` 強制全量 materialization
- `fetchmany(5000)` 每次只持有 5000 rows 在記憶體
- 不建 DataFrame 省去 pandas overheadindex、dtype inference、NaN handling
- EventFetcher 可以 yield 完一批就 group 到結果 dict釋放 batch
**trade-off**
- `read_sql_df_slow`(回傳 DataFrame保留不動新增 `read_sql_df_slow_iter`
- 只有 EventFetcher 使用 iter 版本;其他 service 繼續用 DataFrame 版本
- 這樣不影響任何既有 consumer
### D3: EventFetcher 逐批 group 策略
**決策**`_fetch_batch` 改用 `read_sql_df_slow_iter`,每 fetchmany batch 立刻 group 到 `grouped` dict。
```python
def _fetch_batch(batch_ids):
builder = QueryBuilder()
builder.add_in_condition(filter_column, batch_ids)
sql = EventFetcher._build_domain_sql(domain, builder.get_conditions_sql())
for columns, rows in read_sql_df_slow_iter(sql, builder.params, timeout_seconds=60):
for row in rows:
record = dict(zip(columns, row))
# sanitize NaN
cid = record.get("CONTAINERID")
if cid:
grouped[cid].append(record)
# rows 離開 scope 即被 GC
```
**記憶體改善估算**
| 項目 | 修改前 | 修改後 |
|------|--------|--------|
| cursor buffer | 全量 (100K+ rows) | 5000 rows |
| DataFrame | 全量 | 無 |
| grouped dict | 全量(最終結果) | 全量(最終結果) |
| **峰值** | ~3x 全量 | ~1.05x 全量 |
grouped dict 仍然是全量,但省去了 fetchall list + DataFrame 的兩份副本。
對於 50K CIDs × 10 events = 500K records從 ~1.5GB 降到 ~500MB。
### D4: trace_routes 避免雙份持有
**決策**events endpoint 中 `raw_domain_results` 直接複用為 `results` 的來源,
`_flatten_domain_records` 在建完 flat list 後立刻 `del events_by_cid`
目前的問題:
```python
raw_domain_results[domain] = events_by_cid # 持有 reference
rows = _flatten_domain_records(events_by_cid) # 建新 list
results[domain] = {"data": rows, "count": len(rows)}
# → events_by_cid 和 rows 同時存在
```
修改後:
```python
events_by_cid = future.result()
rows = _flatten_domain_records(events_by_cid)
results[domain] = {"data": rows, "count": len(rows)}
if is_msd:
raw_domain_results[domain] = events_by_cid # MSD 需要 group-by-CID 結構
else:
del events_by_cid # 非 MSD 立刻釋放
```
### D5: Gunicorn workers 降為 2 + systemd MemoryMax
**決策**
- `.env.example``GUNICORN_WORKERS` 預設改為 2
- `deploy/mes-dashboard.service` 加入 `MemoryHigh=5G``MemoryMax=6G`
**理由**
- 4 workers × 大查詢 = 記憶體競爭嚴重
- 2 workers × 4 threads = 8 request threads足夠處理並行請求
- `MemoryHigh=5G`:超過後 kernel 開始 reclaim但不殺進程
- `MemoryMax=6G`:硬限,超過直接 OOM kill service保護 host OS
- 保留 1GB 給 OS + Redis + 其他服務
## Risks / Trade-offs
| 風險 | 緩解措施 |
|------|---------|
| 50K CID 上限可能擋住合理查詢 | env var 可調;提案 2 實作後改走 async |
| fetchmany iterator 模式下 cursor 持有時間更長 | timeout_seconds=60 限制semaphore 限制並行 |
| grouped dict 最終仍全量 | 這是 API contract需回傳所有結果提案 3 的 streaming 才能根本解決 |
| workers=2 降低並行處理能力 | 歷史頁查詢是 semaphore 限制的,降 workers 主要影響即時頁 throughput但即時頁很輕量 |
| MemoryMax kill service 會中斷所有在線使用者 | systemd Restart=always 自動重啟;比 host OS crash 好得多 |

View File

@@ -0,0 +1,39 @@
## Why
2026-02-25 生產環境 trace pipeline 處理 114K CIDsTMTT 站 + 5 個月日期範圍)時,
worker 被 OOM SIGKILL7GB VM無 swap。pool 隔離已完成,連線不再互搶,
但 events 階段的記憶體使用是真正瓶頸:
1. `cursor.fetchall()` 一次載入全部 rows數十萬筆
2. `pd.DataFrame(rows)` 複製一份
3. `df.iterrows()` + `row.to_dict()` 再一份
4. `grouped[cid].append(record)` 累積到最終 dict
5. `raw_domain_results[domain]` + `results[domain]["data"]` 在 trace_routes 同時持有雙份
114K CIDs × 2 domains峰值同時存在 3-4 份完整資料副本,每份數百 MB → 2-4 GB 單一 domain。
7GB VM4 workers完全無法承受。
## What Changes
- **Admission control**trace events endpoint 加 CID 數量上限判斷,超過閾值回 HTTP 413
- **分批處理**`read_sql_df_slow` 改用 `cursor.fetchmany()` 取代 `fetchall()`,不建 DataFrame
- **EventFetcher 逐批 group**:每批 fetch 完立刻 group 到結果 dict釋放 batch 記憶體
- **trace_routes 避免雙份持有**`raw_domain_results``results` 合併為單一資料結構
- **Gunicorn workers 降為 2**:降低單機記憶體競爭
- **systemd MemoryMax**:加 cgroup 記憶體保護,避免 OOM 殺死整台 VM
- **更新 .env.example**:新增 `TRACE_EVENTS_CID_LIMIT``DB_SLOW_FETCHMANY_SIZE` 等 env 文件
- **更新 deploy/mes-dashboard.service**:加入 `MemoryHigh``MemoryMax`
## Capabilities
### Modified Capabilities
- `trace-staged-api`: events endpoint 加入 admission controlCID 上限)
- `event-fetcher-unified`: 分批 group 記憶體優化,取消 DataFrame 中間層
## Impact
- **後端核心**database.pyfetchmany、event_fetcher.py逐批 group、trace_routes.pyadmission control + 記憶體管理)
- **部署設定**gunicorn.conf.py、.env.example、deploy/mes-dashboard.service
- **不影響**:前端、即時監控頁、其他 servicereject_history、hold_history 等)
- **前置條件**trace-pipeline-pool-isolation已完成

View File

@@ -0,0 +1,15 @@
## MODIFIED Requirements
### Requirement: EventFetcher SHALL use streaming fetch for batch queries
`EventFetcher._fetch_batch` SHALL use `read_sql_df_slow_iter` (fetchmany-based iterator) instead of `read_sql_df` (fetchall + DataFrame) to reduce peak memory usage.
#### Scenario: Batch query memory optimization
- **WHEN** EventFetcher executes a batch query for a domain
- **THEN** the query SHALL use `cursor.fetchmany(batch_size)` (env: `DB_SLOW_FETCHMANY_SIZE`, default: 5000) instead of `cursor.fetchall()`
- **THEN** rows SHALL be converted directly to dicts via `dict(zip(columns, row))` without building a DataFrame
- **THEN** each fetchmany batch SHALL be grouped into the result dict immediately, allowing the batch rows to be garbage collected
#### Scenario: Existing API contract preserved
- **WHEN** EventFetcher.fetch_events() returns results
- **THEN** the return type SHALL remain `Dict[str, List[Dict[str, Any]]]` (grouped by CONTAINERID)
- **THEN** the result SHALL be identical to the previous DataFrame-based implementation

View File

@@ -0,0 +1,19 @@
## MODIFIED Requirements
### Requirement: Trace events endpoint SHALL manage memory for large queries
The events endpoint SHALL proactively release memory after processing large CID sets.
#### Scenario: Admission control for non-MSD profiles
- **WHEN** the events endpoint receives a non-MSD profile request with `container_ids` count exceeding `TRACE_EVENTS_CID_LIMIT` (env: `TRACE_EVENTS_CID_LIMIT`, default: 50000)
- **THEN** the endpoint SHALL return HTTP 413 with `{ "error": "...", "code": "CID_LIMIT_EXCEEDED", "cid_count": N, "limit": M }`
- **THEN** Oracle DB connection pool SHALL NOT be consumed
#### Scenario: MSD profile bypasses CID hard limit
- **WHEN** the events endpoint receives a `mid_section_defect` profile request regardless of CID count
- **THEN** the endpoint SHALL proceed with normal processing (no CID hard limit)
- **THEN** if CID count exceeds 50000, the endpoint SHALL log a warning with `cid_count` for monitoring
#### Scenario: Non-MSD profile avoids double memory retention
- **WHEN** a non-MSD events request completes domain fetching
- **THEN** the `events_by_cid` reference SHALL be deleted immediately after `_flatten_domain_records`
- **THEN** only the flattened `results` dict SHALL remain in memory

View File

@@ -0,0 +1,37 @@
## 1. Admission Control (profile-aware)
- [x] 1.1 Add `TRACE_EVENTS_CID_LIMIT` env var (default 50000) to `trace_routes.py`
- [x] 1.2 Add CID count check in `events()` endpoint: for non-MSD profiles, if `len(container_ids) > TRACE_EVENTS_CID_LIMIT`, return HTTP 413 with `{ "code": "CID_LIMIT_EXCEEDED", "cid_count": N, "limit": M }`
- [x] 1.3 For MSD profile: bypass CID hard limit, log warning when CID count > 50000
- [x] 1.4 Add unit tests: non-MSD CID > limit → 413; MSD CID > limit → proceeds normally
## 2. Batch Fetch (fetchmany) in database.py
- [x] 2.1 Add `read_sql_df_slow_iter(sql, params, timeout_seconds, batch_size)` generator function to `database.py` that yields `(columns, rows)` tuples using `cursor.fetchmany(batch_size)`
- [x] 2.2 Add `DB_SLOW_FETCHMANY_SIZE` to `get_db_runtime_config()` (default 5000)
- [x] 2.3 Add unit test for `read_sql_df_slow_iter` (mock cursor, verify fetchmany calls and yields)
## 3. EventFetcher Memory Optimization
- [x] 3.1 Modify `_fetch_batch` in `event_fetcher.py` to use `read_sql_df_slow_iter` instead of `read_sql_df` — iterate rows directly, skip DataFrame, group to `grouped` dict immediately
- [x] 3.2 Update `_sanitize_record` to work with `dict(zip(columns, row))` instead of `row.to_dict()`
- [x] 3.3 Add unit test verifying EventFetcher uses `read_sql_df_slow_iter` import
- [x] 3.4 Update existing EventFetcher tests (mock `read_sql_df_slow_iter` instead of `read_sql_df`)
## 4. trace_routes Memory Optimization
- [x] 4.1 Modify events endpoint: only keep `raw_domain_results[domain]` for MSD profile; for non-MSD, `del events_by_cid` after flattening
- [x] 4.2 Verify existing `del raw_domain_results` and `gc.collect()` logic still correct after refactor
## 5. Deployment Configuration
- [x] 5.1 Update `.env.example`: add `TRACE_EVENTS_CID_LIMIT`, `DB_SLOW_FETCHMANY_SIZE` with descriptions
- [x] 5.2 Update `.env.example`: change `GUNICORN_WORKERS` default comment to recommend 2 for ≤ 8GB RAM
- [x] 5.3 Update `.env.example`: change `TRACE_EVENTS_MAX_WORKERS` and `EVENT_FETCHER_MAX_WORKERS` default to 2
- [x] 5.4 Update `deploy/mes-dashboard.service`: add `MemoryHigh=5G` and `MemoryMax=6G`
- [x] 5.5 Update `deploy/mes-dashboard.service`: add comment explaining memory limits
## 6. Verification
- [x] 6.1 Run `python -m pytest tests/ -v` — all existing tests pass (1069 passed, 152 skipped)
- [x] 6.2 Verify `.env.example` env var documentation is consistent with code defaults

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-02-25

View File

@@ -0,0 +1,140 @@
## Context
提案 2trace-async-job-queue讓大查詢在獨立 worker 中執行,
但結果仍然全量 materialize 到 Redisjob result和前端記憶體。
114K CIDs × 2 domains 的結果 JSON 可達 200-500MB
- Worker 記憶體grouped dict ~500MB + JSON serialize ~500MB = ~1GB 峰值
- RedisSETEX 500MB 的 key 耗時 5-10s阻塞其他操作
- 前端:瀏覽器解析 500MB JSON freeze UI 數十秒
串流回傳讓 server 逐批產生、前端逐批消費,記憶體使用只與每批大小成正比。
## Goals / Non-Goals
**Goals:**
- Job 結果以 NDJSON 串流回傳,避免全量 materialize
- EventFetcher 支援 iterator 模式,逐批 yield 結果
- 前端用 ReadableStream 逐行解析,逐批渲染
- 結果也支援分頁 API給不支援串流的 consumer 使用)
**Non-Goals:**
- 不改動同步路徑CID < 閾值仍走現有 JSON 回傳
- 不做 WebSocketNDJSON over HTTP 更簡單更通用
- 不做 Server-Sent EventsSSE 只支援 text/event-stream不適合大 payload
- 不修改 MSD aggregationaggregation 需要全量資料但結果較小
## Decisions
### D1: NDJSON 格式
**決策**使用 Newline Delimited JSONNDJSON作為串流格式
```
Content-Type: application/x-ndjson
{"type":"meta","job_id":"abc123","domains":["history","materials"],"cid_count":114892}
{"type":"domain_start","domain":"history","batch":1,"total_batches":23}
{"type":"records","domain":"history","batch":1,"data":[...5000 records...]}
{"type":"records","domain":"history","batch":2,"data":[...5000 records...]}
...
{"type":"domain_end","domain":"history","total_records":115000}
{"type":"domain_start","domain":"materials","batch":1,"total_batches":12}
...
{"type":"aggregation","data":{...}}
{"type":"complete","elapsed_seconds":285}
```
**env var**`TRACE_STREAM_BATCH_SIZE`預設 5000 records/batch
**理由**
- NDJSON 是業界標準串流 JSON 格式ElasticsearchBigQueryGitHub API 都用
- 每行是獨立 JSON前端可逐行 parse不需要等整個 response
- 5000 records/batch 2-5MB瀏覽器可即時渲染
- HTTP/1.1 chunked transfer 完美搭配
### D2: EventFetcher iterator 模式
**決策**新增 `fetch_events_iter()` 方法yield 每批 grouped records
```python
@staticmethod
def fetch_events_iter(container_ids, domain, batch_size=5000):
"""Yield dicts of {cid: [records]} in batches."""
# ... same SQL building logic ...
for oracle_batch_ids in batches:
for columns, rows in read_sql_df_slow_iter(sql, params):
batch_grouped = defaultdict(list)
for row in rows:
record = dict(zip(columns, row))
cid = record.get("CONTAINERID")
if cid:
batch_grouped[cid].append(record)
yield dict(batch_grouped)
```
**理由**
- `fetch_events()` 共存不影響同步路徑
- 每次 yield 只持有一個 fetchmany batch grouped 結果
- Worker 收到 yield 後立刻序列化寫出不累積
### D3: 結果分頁 API
**決策**提供 REST 分頁 API 作為 NDJSON 的替代方案
```
GET /api/trace/job/{job_id}/result?domain=history&offset=0&limit=5000
```
**回應格式**
```json
{
"domain": "history",
"offset": 0,
"limit": 5000,
"total": 115000,
"data": [... 5000 records ...]
}
```
**理由**
- 某些 consumer如外部系統不支援 NDJSON 串流
- 分頁 API 是標準 REST pattern
- 結果仍儲存在 Redis但按 domain key每個 key 5000 records 5MB
### D4: 前端 ReadableStream 消費
```javascript
async function consumeNDJSON(url, onChunk) {
const response = await fetch(url)
const reader = response.body.getReader()
const decoder = new TextDecoder()
let buffer = ''
while (true) {
const { done, value } = await reader.read()
if (done) break
buffer += decoder.decode(value, { stream: true })
const lines = buffer.split('\n')
buffer = lines.pop() // 保留不完整的最後一行
for (const line of lines) {
if (line.trim()) onChunk(JSON.parse(line))
}
}
}
```
**理由**
- ReadableStream 是瀏覽器原生 API無需額外依賴
- 逐行 parse 記憶體使用恆定只與 batch_size 成正比
- 可邊收邊渲染使用者體驗好
## Risks / Trade-offs
| 風險 | 緩解措施 |
|------|---------|
| NDJSON 不支援 HTTP 壓縮 | Flask 可配 gzip middleware每行 5000 records 壓縮率高 |
| 中途斷線需重新開始 | 分頁 API 可從斷點繼續取NDJSON 用於一次性消費 |
| 前端需要處理部分結果渲染 | 表格元件改用 virtual scroll既有 vue-virtual-scroller |
| MSD aggregation 仍需全量資料 | aggregation worker 內部完成只串流最終結果較小 |
| 結果按 domain key 增加 Redis key 數量 | TTL 清理 + key prefix 隔離 |

View File

@@ -0,0 +1,39 @@
## Why
即使有非同步 job提案 2處理大查詢結果 materialize 仍然是記憶體瓶頸:
1. **job result 全量 JSON**114K CIDs × 2 domains 的結果 JSON 可達數百 MB
Redis 儲存 + 讀取 + Flask jsonify 序列化,峰值記憶體仍高
2. **前端一次性解析**:瀏覽器解析數百 MB JSON 會 freeze UI
3. **Redis 單 key 限制**:大 value 影響 Redis 效能(阻塞其他操作)
串流回傳NDJSON/分頁)讓 server 逐批產生資料、前端逐批消費,
記憶體使用與 CID 總數解耦,只與每批大小成正比。
## What Changes
- **EventFetcher 支援 iterator 模式**`fetch_events_iter()` yield 每批結果而非累積全部
- **新增 `GET /api/trace/job/{job_id}/stream`**NDJSON 串流回傳 job 結果
- **前端 useTraceProgress 串流消費**:用 `fetch()` + `ReadableStream` 逐行解析 NDJSON
- **結果分頁 API**`GET /api/trace/job/{job_id}/result?domain=history&offset=0&limit=5000`
- **更新 .env.example**`TRACE_STREAM_BATCH_SIZE`
## Capabilities
### New Capabilities
- `trace-streaming-response`: NDJSON 串流回傳 + 結果分頁
### Modified Capabilities
- `event-fetcher-unified`: 新增 iterator 模式(`fetch_events_iter`
- `trace-staged-api`: job result 串流 endpoint
- `progressive-trace-ux`: 前端串流消費 + 逐批渲染
## Impact
- **後端核心**event_fetcher.pyiterator 模式、trace_routes.pystream endpoint
- **前端修改**useTraceProgress.jsReadableStream 消費)
- **部署設定**.env.example`TRACE_STREAM_BATCH_SIZE`
- **不影響**同步路徑CID < 閾值仍走現有流程)、其他 service即時監控頁
- **前置條件**trace-async-job-queue提案 2

View File

@@ -0,0 +1,14 @@
## ADDED Requirements
### Requirement: EventFetcher SHALL support iterator mode for streaming
`EventFetcher.fetch_events_iter()` SHALL yield batched results for streaming consumption.
#### Scenario: Iterator mode yields batches
- **WHEN** `fetch_events_iter(container_ids, domain, batch_size)` is called
- **THEN** it SHALL yield `Dict[str, List[Dict]]` batches (grouped by CONTAINERID)
- **THEN** each yielded batch SHALL contain results from one `cursor.fetchmany()` call
- **THEN** memory usage SHALL be proportional to `batch_size`, not total result count
#### Scenario: Iterator mode cache behavior
- **WHEN** `fetch_events_iter` is used for large CID sets (> CACHE_SKIP_CID_THRESHOLD)
- **THEN** per-domain cache SHALL be skipped (consistent with `fetch_events` behavior)

View File

@@ -0,0 +1,22 @@
## ADDED Requirements
### Requirement: Trace API SHALL expose NDJSON stream endpoint for job results
`GET /api/trace/job/{job_id}/stream` SHALL return job results as NDJSON (Newline Delimited JSON) stream.
#### Scenario: Stream completed job result
- **WHEN** a client requests stream for a completed job
- **THEN** the endpoint SHALL return `Content-Type: application/x-ndjson`
- **THEN** the response SHALL contain ordered NDJSON lines: `meta``domain_start``records` batches → `domain_end``aggregation` (if applicable) → `complete`
- **THEN** each `records` line SHALL contain at most `TRACE_STREAM_BATCH_SIZE` (env, default: 5000) records
#### Scenario: Stream for non-completed job
- **WHEN** a client requests stream for a non-completed job
- **THEN** the endpoint SHALL return HTTP 409 with `{ "error": "...", "code": "JOB_NOT_COMPLETE" }`
### Requirement: Job result pagination SHALL support domain-level offset/limit
`GET /api/trace/job/{job_id}/result` SHALL support fine-grained pagination per domain.
#### Scenario: Paginated domain result
- **WHEN** a client requests `?domain=history&offset=0&limit=5000`
- **THEN** the endpoint SHALL return only the specified slice of records for that domain
- **THEN** the response SHALL include `total` count for the domain

View File

@@ -0,0 +1,35 @@
## 1. EventFetcher Iterator Mode
- [ ] 1.1 Add `fetch_events_iter(container_ids, domain, batch_size)` static method to `EventFetcher` class: yields `Dict[str, List[Dict]]` batches using `read_sql_df_slow_iter`
- [ ] 1.2 Add unit tests for `fetch_events_iter` (mock read_sql_df_slow_iter, verify batch yields)
## 2. NDJSON Stream Endpoint
- [x] 2.1 Add `GET /api/trace/job/<job_id>/stream` endpoint: returns `Content-Type: application/x-ndjson` with Flask `Response(generate(), mimetype='application/x-ndjson')`
- [x] 2.2 Implement NDJSON generator: yield `meta``domain_start``records` batches → `domain_end``aggregation``complete` lines
- [x] 2.3 Add `TRACE_STREAM_BATCH_SIZE` env var (default 5000)
- [x] 2.4 Modify `execute_trace_events_job()` to store results in chunked Redis keys: `trace:job:{job_id}:result:{domain}:{chunk_idx}`
- [x] 2.5 Add unit tests for NDJSON stream endpoint
## 3. Result Pagination API
- [x] 3.1 Enhance `GET /api/trace/job/<job_id>/result` with `domain`, `offset`, `limit` query params
- [x] 3.2 Implement pagination over chunked Redis keys
- [x] 3.3 Add unit tests for pagination (offset/limit boundary cases)
## 4. Frontend Streaming Consumer
- [x] 4.1 Add `consumeNDJSONStream(url, onChunk)` utility using `ReadableStream`
- [x] 4.2 Modify `useTraceProgress.js`: for async jobs, prefer stream endpoint over full result endpoint
- [x] 4.3 Add progressive rendering: update table data as each NDJSON batch arrives
- [x] 4.4 Add error handling: stream interruption, malformed NDJSON lines
## 5. Deployment
- [x] 5.1 Update `.env.example`: add `TRACE_STREAM_BATCH_SIZE` with description
## 6. Verification
- [x] 6.1 Run `python -m pytest tests/ -v` — all existing tests pass
- [x] 6.2 Run `cd frontend && npm run build` — frontend builds successfully
- [ ] 6.3 Manual test: verify NDJSON stream produces valid output for multi-domain query