Files
DashBoard/openspec/changes/archive/2026-02-25-trace-pipeline-pool-isolation/proposal.md
egg cbb943dfe5 feat(trace-pool-isolation): migrate event_fetcher/lineage_engine to slow connections + fix 51 test failures
Trace pipeline pool isolation:
- Switch event_fetcher and lineage_engine to read_sql_df_slow (non-pooled)
- Reduce EVENT_FETCHER_MAX_WORKERS 4→2, TRACE_EVENTS_MAX_WORKERS 4→2
- Add 60s timeout per batch query, cache skip for CID>10K
- Early del raw_domain_results + gc.collect() for large queries
- Increase DB_SLOW_MAX_CONCURRENT: base 3→5, dev 2→3, prod 3→5

Test fixes (51 pre-existing failures → 0):
- reject_history: WORKFLOW CSV header, strict bool validation, pareto mock path
- portal shell: remove non-existent /tmtt-defect route from tests
- conftest: add --run-stress option to skip stress/load tests by default
- migration tests: skipif baseline directory missing
- performance test: update Vite asset assertion
- wip hold: add firstname/waferdesc mock params
- template integration: add /reject-history canonical route

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 16:13:19 +08:00

36 lines
1.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## Why
MSD trace pipeline 大範圍查詢TMTT, 5 個月)產生 114K CIDsevent_fetcher 使用
`read_sql_df`pool 連線)發送 ~230 條批次查詢,佔滿 connection pool導致背景任務
equipment cache、SYS_DATE查詢時間從 1s 暴增到 500s最終 Redis timeout +
gunicorn worker SIGKILL2026-02-25 13:18 事件)。
## What Changes
- event_fetcher 和 lineage_engine 改用 `read_sql_df_slow`(獨立連線),不佔用 pool
- 降低 `EVENT_FETCHER_MAX_WORKERS` 預設 4→2、`TRACE_EVENTS_MAX_WORKERS` 預設 4→2減少 Oracle 並行壓力
- 增加 `DB_SLOW_MAX_CONCURRENT` semaphore 容量 3→5容納 event_fetcher 批次查詢
- event_fetcher 在 CID 數量 >10K 時跳過 L1/L2 cache避免數百 MB JSON 序列化導致 Redis timeout 和 heap 膨脹)
- trace_routes events endpoint 早期釋放 `raw_domain_results` 並在大查詢後觸發 `gc.collect()`
## Capabilities
### New Capabilities
(無新增 capability
### Modified Capabilities
- `event-fetcher-unified`: 改用非 pool 連線 + 降低預設並行數 + 大 CID 集跳過快取
- `lineage-engine-core`: 改用非 pool 連線(不佔用 pool避免與 event_fetcher 競爭)
- `trace-staged-api`: 降低 domain 並行數 + 早期記憶體釋放 + 大查詢跳過 route-level cache
- `runtime-resilience-recovery`: slow query semaphore 容量增加以容納 trace pipeline 批次查詢
## Impact
- **後端 services**: event_fetcher.py, lineage_engine.py (import 切換)
- **routes**: trace_routes.py (並行數 + 記憶體管理)
- **config**: settings.py (DB_SLOW_MAX_CONCURRENT)
- **即時監控頁**: 不受影響(繼續用 pool
- **前端**: 無修改