Files
egg cbb943dfe5 feat(trace-pool-isolation): migrate event_fetcher/lineage_engine to slow connections + fix 51 test failures
Trace pipeline pool isolation:
- Switch event_fetcher and lineage_engine to read_sql_df_slow (non-pooled)
- Reduce EVENT_FETCHER_MAX_WORKERS 4→2, TRACE_EVENTS_MAX_WORKERS 4→2
- Add 60s timeout per batch query, cache skip for CID>10K
- Early del raw_domain_results + gc.collect() for large queries
- Increase DB_SLOW_MAX_CONCURRENT: base 3→5, dev 2→3, prod 3→5

Test fixes (51 pre-existing failures → 0):
- reject_history: WORKFLOW CSV header, strict bool validation, pareto mock path
- portal shell: remove non-existent /tmtt-defect route from tests
- conftest: add --run-stress option to skip stress/load tests by default
- migration tests: skipif baseline directory missing
- performance test: update Vite asset assertion
- wip hold: add firstname/waferdesc mock params
- template integration: add /reject-history canonical route

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 16:13:19 +08:00

1.7 KiB
Raw Permalink Blame History

Why

MSD trace pipeline 大範圍查詢TMTT, 5 個月)產生 114K CIDsevent_fetcher 使用 read_sql_dfpool 連線)發送 ~230 條批次查詢,佔滿 connection pool導致背景任務 equipment cache、SYS_DATE查詢時間從 1s 暴增到 500s最終 Redis timeout + gunicorn worker SIGKILL2026-02-25 13:18 事件)。

What Changes

  • event_fetcher 和 lineage_engine 改用 read_sql_df_slow(獨立連線),不佔用 pool
  • 降低 EVENT_FETCHER_MAX_WORKERS 預設 4→2、TRACE_EVENTS_MAX_WORKERS 預設 4→2減少 Oracle 並行壓力
  • 增加 DB_SLOW_MAX_CONCURRENT semaphore 容量 3→5容納 event_fetcher 批次查詢
  • event_fetcher 在 CID 數量 >10K 時跳過 L1/L2 cache避免數百 MB JSON 序列化導致 Redis timeout 和 heap 膨脹)
  • trace_routes events endpoint 早期釋放 raw_domain_results 並在大查詢後觸發 gc.collect()

Capabilities

New Capabilities

(無新增 capability

Modified Capabilities

  • event-fetcher-unified: 改用非 pool 連線 + 降低預設並行數 + 大 CID 集跳過快取
  • lineage-engine-core: 改用非 pool 連線(不佔用 pool避免與 event_fetcher 競爭)
  • trace-staged-api: 降低 domain 並行數 + 早期記憶體釋放 + 大查詢跳過 route-level cache
  • runtime-resilience-recovery: slow query semaphore 容量增加以容納 trace pipeline 批次查詢

Impact

  • 後端 services: event_fetcher.py, lineage_engine.py (import 切換)
  • routes: trace_routes.py (並行數 + 記憶體管理)
  • config: settings.py (DB_SLOW_MAX_CONCURRENT)
  • 即時監控頁: 不受影響(繼續用 pool
  • 前端: 無修改