feat(trace-pool-isolation): migrate event_fetcher/lineage_engine to slow connections + fix 51 test failures

Trace pipeline pool isolation:
- Switch event_fetcher and lineage_engine to read_sql_df_slow (non-pooled)
- Reduce EVENT_FETCHER_MAX_WORKERS 4→2, TRACE_EVENTS_MAX_WORKERS 4→2
- Add 60s timeout per batch query, cache skip for CID>10K
- Early del raw_domain_results + gc.collect() for large queries
- Increase DB_SLOW_MAX_CONCURRENT: base 3→5, dev 2→3, prod 3→5

Test fixes (51 pre-existing failures → 0):
- reject_history: WORKFLOW CSV header, strict bool validation, pareto mock path
- portal shell: remove non-existent /tmtt-defect route from tests
- conftest: add --run-stress option to skip stress/load tests by default
- migration tests: skipif baseline directory missing
- performance test: update Vite asset assertion
- wip hold: add firstname/waferdesc mock params
- template integration: add /reject-history canonical route

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
egg
2026-02-25 16:13:19 +08:00
parent 49bd4b31d3
commit cbb943dfe5
33 changed files with 453 additions and 94 deletions

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-02-25

View File

@@ -0,0 +1,81 @@
## Context
MSD trace pipeline 有三個階段seed-resolve → lineage → events。
seed-resolve 已使用 `read_sql_df_slow`(獨立連線),但 lineage 和 events 仍用
`read_sql_df`pool 連線55s timeout。大範圍查詢114K CIDs的 events 階段會
產生 ~230 條 Oracle 查詢,佔滿 pool10+15=25 connections導致背景任務和其他
worker 拿不到連線,最終 cascade failureRedis timeout → worker SIGKILL
Connection pool 設計用於即時監控頁的短暫查詢(<1s不適合 trace pipeline
的長時間批次作業
## Goals / Non-Goals
**Goals:**
- trace pipelinelineage + events Oracle 查詢不佔用共用 pool
- 降低同時 Oracle 查詢數減少 DB 壓力
- 大查詢>10K CIDs不觸發 Redis/L1 cache 寫入,避免 OOM 和 Redis timeout
- 背景任務equipment cache、SYS_DATE在 trace 執行期間可正常取得 pool 連線
**Non-Goals:**
- 加入 CID 數量上限(會導致資料不完整)
- 重構 event_fetcher 為 streaming/chunk 架構(未來改善)
- 修改前端 timeout 或即時監控頁
## Decisions
### D1: event_fetcher + lineage_engine 改用 `read_sql_df_slow`
**選擇**: import alias 切換 (`read_sql_df_slow as read_sql_df`)
**理由**: 最小改動量。所有 call site 不需修改,只改 import 行。`read_sql_df_slow`
建立獨立 oracledb 連線,不佔用 pool。
**替代方案考慮**:
- 建立第二個專用 pool → 過度工程,管理複雜
- 給 event_fetcher 自己的 semaphore → 增加兩套 semaphore 的管理複雜度
### D2: 降低 workers 預設值
**選擇**:
- `EVENT_FETCHER_MAX_WORKERS`: 4 → 2
- `TRACE_EVENTS_MAX_WORKERS`: 4 → 2
**理由**: Peak concurrent = 2 domains × 2 workers = 4 slow queries。
搭配 semaphore=5 留 1 slot 給其他 slow 查詢。仍可透過 env var 調整。
### D3: event_fetcher 批次查詢 timeout = 60s
**選擇**: 在 `_fetch_batch` 傳入 `timeout_seconds=60`
**理由**: 每個 batch query 正常 2-6s300s 預設過長。60s 是 10x headroom。
lineage 不設限CONNECT BY 可能較慢,保留 300s 預設)。
### D4: 大 CID 集跳過快取threshold = 10K
**選擇**: `CACHE_SKIP_CID_THRESHOLD = 10000`env var 可調)
**理由**:
- 114K CIDs 的 cache key 是 sorted CIDs 的 MD5同組查詢再次命中機率極低
- JSON 序列化 1M+ records 可達數百 MBRedis `socket_timeout=5s` 必定 timeout
- L1 MemoryTTLCache 會在 heap 留住 GB 級 dict 達 TTL(300s)
route-level events cache 同樣在 CID > 10K 時跳過。
### D5: 早期記憶體釋放 + gc.collect
**選擇**: trace_routes events endpoint 在 MSD aggregation 後 `del raw_domain_results`
大查詢後 `gc.collect()`
**理由**: `raw_domain_results``results` 是同份資料的兩種 representation
aggregation 完成後 grouped-by-CID 版本不再需要。`gc.collect()` 確保
Python 的 generational GC 及時回收大量 dict。
## Risks / Trade-offs
| 風險 | 緩解 |
|------|------|
| lineage 70K lots → 70+ sequential slow queries (~140s) 可能逼近 360s timeout | 已有 Redis cache TTL=300s重複查詢走快取。首次最壞情況 ~280s < 360s |
| semaphore=5 在多個大查詢同時執行時可能排隊 | 每條 batch query 2-6s排隊等待 <10swait timeout=60s 足夠|
| 跳過 cache 後重複大查詢需重新執行 | 大查詢本身罕見 5 月範圍 + 特定站點不值得佔用 cache 空間 |
| `gc.collect()` 有微小 CPU 開銷 | 僅在 CID>10K 時觸發,且在 response 建構後執行 |

View File

@@ -0,0 +1,35 @@
## Why
MSD trace pipeline 大範圍查詢TMTT, 5 個月)產生 114K CIDsevent_fetcher 使用
`read_sql_df`pool 連線)發送 ~230 條批次查詢,佔滿 connection pool導致背景任務
equipment cache、SYS_DATE查詢時間從 1s 暴增到 500s最終 Redis timeout +
gunicorn worker SIGKILL2026-02-25 13:18 事件)。
## What Changes
- event_fetcher 和 lineage_engine 改用 `read_sql_df_slow`(獨立連線),不佔用 pool
- 降低 `EVENT_FETCHER_MAX_WORKERS` 預設 4→2、`TRACE_EVENTS_MAX_WORKERS` 預設 4→2減少 Oracle 並行壓力
- 增加 `DB_SLOW_MAX_CONCURRENT` semaphore 容量 3→5容納 event_fetcher 批次查詢
- event_fetcher 在 CID 數量 >10K 時跳過 L1/L2 cache避免數百 MB JSON 序列化導致 Redis timeout 和 heap 膨脹)
- trace_routes events endpoint 早期釋放 `raw_domain_results` 並在大查詢後觸發 `gc.collect()`
## Capabilities
### New Capabilities
(無新增 capability
### Modified Capabilities
- `event-fetcher-unified`: 改用非 pool 連線 + 降低預設並行數 + 大 CID 集跳過快取
- `lineage-engine-core`: 改用非 pool 連線(不佔用 pool避免與 event_fetcher 競爭)
- `trace-staged-api`: 降低 domain 並行數 + 早期記憶體釋放 + 大查詢跳過 route-level cache
- `runtime-resilience-recovery`: slow query semaphore 容量增加以容納 trace pipeline 批次查詢
## Impact
- **後端 services**: event_fetcher.py, lineage_engine.py (import 切換)
- **routes**: trace_routes.py (並行數 + 記憶體管理)
- **config**: settings.py (DB_SLOW_MAX_CONCURRENT)
- **即時監控頁**: 不受影響(繼續用 pool
- **前端**: 無修改

View File

@@ -0,0 +1,31 @@
## MODIFIED Requirements
### Requirement: EventFetcher SHALL provide unified cached event querying across domains
`EventFetcher` SHALL encapsulate batch event queries with L1/L2 layered cache and rate limit bucket configuration, supporting domains: `history`, `materials`, `rejects`, `holds`, `jobs`, `upstream_history`, `downstream_rejects`.
#### Scenario: Cache miss for event domain query
- **WHEN** `EventFetcher` is called for a domain with container IDs and no cache exists
- **THEN** the domain query SHALL execute against Oracle via `read_sql_df_slow()` (non-pooled dedicated connection)
- **THEN** each batch query SHALL use `timeout_seconds=60`
- **THEN** the result SHALL be stored in L2 Redis cache with key format `evt:{domain}:{sorted_cids_hash}` if CID count is within cache threshold
- **THEN** L1 memory cache SHALL also be populated if CID count is within cache threshold
#### Scenario: Cache hit for event domain query
- **WHEN** `EventFetcher` is called for a domain and L2 Redis cache contains a valid entry
- **THEN** the cached result SHALL be returned without executing Oracle query
- **THEN** DB connection pool SHALL NOT be consumed
#### Scenario: Rate limit bucket per domain
- **WHEN** `EventFetcher` is used from a route handler
- **THEN** each domain SHALL have a configurable rate limit bucket aligned with `configured_rate_limit()` pattern
- **THEN** rate limit configuration SHALL be overridable via environment variables
#### Scenario: Large CID set exceeds cache threshold
- **WHEN** the normalized CID count exceeds `CACHE_SKIP_CID_THRESHOLD` (default 10000, env: `EVENT_FETCHER_CACHE_SKIP_CID_THRESHOLD`)
- **THEN** EventFetcher SHALL skip both L1 and L2 cache writes
- **THEN** a warning log SHALL be emitted with domain name, CID count, and threshold value
- **THEN** the query result SHALL still be returned to the caller
#### Scenario: Batch concurrency default
- **WHEN** EventFetcher processes batches for a domain with >1000 CIDs
- **THEN** the default `EVENT_FETCHER_MAX_WORKERS` SHALL be 2 (env: `EVENT_FETCHER_MAX_WORKERS`)

View File

@@ -0,0 +1,13 @@
## MODIFIED Requirements
### Requirement: LineageEngine SHALL use non-pooled database connections
All Oracle queries executed by `LineageEngine` SHALL use `read_sql_df_slow()` (dedicated non-pooled connections) instead of `read_sql_df()` (connection pool).
#### Scenario: Lineage query does not consume pool connections
- **WHEN** `LineageEngine` executes split ancestor, merge source, or other Oracle queries
- **THEN** queries SHALL use `read_sql_df_slow()` with the default slow query timeout (300s)
- **THEN** the shared connection pool SHALL NOT be consumed by lineage queries
#### Scenario: Lineage queries respect slow query semaphore
- **WHEN** `LineageEngine` executes queries via `read_sql_df_slow()`
- **THEN** each query SHALL acquire and release a slot from the slow query semaphore (`DB_SLOW_MAX_CONCURRENT`)

View File

@@ -0,0 +1,16 @@
## MODIFIED Requirements
### Requirement: Database Pool Runtime Configuration SHALL Be Enforced
The system SHALL apply database pool and timeout parameters from runtime configuration to the active SQLAlchemy engine used by request handling.
#### Scenario: Runtime pool configuration takes effect
- **WHEN** operators set pool and timeout values via environment configuration and start the service
- **THEN** the active engine MUST use those values for pool size, overflow, wait timeout, and query call timeout
#### Scenario: Slow query semaphore capacity
- **WHEN** the service starts in production or staging configuration
- **THEN** `DB_SLOW_MAX_CONCURRENT` SHALL default to 5 (env: `DB_SLOW_MAX_CONCURRENT`)
- **WHEN** the service starts in development configuration
- **THEN** `DB_SLOW_MAX_CONCURRENT` SHALL default to 3
- **WHEN** the service starts in testing configuration
- **THEN** `DB_SLOW_MAX_CONCURRENT` SHALL remain at 1

View File

@@ -0,0 +1,24 @@
## ADDED Requirements
### Requirement: Trace events endpoint SHALL limit domain concurrency
The `/api/trace/events` endpoint SHALL use `TRACE_EVENTS_MAX_WORKERS` to control how many domains execute concurrently.
#### Scenario: Default domain concurrency
- **WHEN** the events endpoint dispatches domain queries
- **THEN** the default `TRACE_EVENTS_MAX_WORKERS` SHALL be 2 (env: `TRACE_EVENTS_MAX_WORKERS`)
### Requirement: Trace events endpoint SHALL manage memory for large queries
The events endpoint SHALL proactively release memory after processing large CID sets.
#### Scenario: Early release of grouped domain results
- **WHEN** MSD aggregation completes using `raw_domain_results`
- **THEN** the `raw_domain_results` reference SHALL be deleted immediately after aggregation
- **THEN** for non-MSD profiles, `raw_domain_results` SHALL be deleted after result assembly
#### Scenario: Garbage collection for large CID sets
- **WHEN** the events endpoint completes processing and the CID count exceeds 10000
- **THEN** `gc.collect()` SHALL be called to prompt Python garbage collection
#### Scenario: Large CID set skips route-level cache
- **WHEN** the events endpoint completes for a non-MSD profile and CID count exceeds 10000
- **THEN** the route-level events cache write SHALL be skipped

View File

@@ -0,0 +1,24 @@
## 1. Config 調整
- [x] 1.1 `settings.py`: Config base `DB_SLOW_MAX_CONCURRENT` 3→5, DevelopmentConfig 2→3, ProductionConfig 3→5
## 2. 後端 service 遷移
- [x] 2.1 `event_fetcher.py`: import `read_sql_df_slow as read_sql_df` (line 16)
- [x] 2.2 `event_fetcher.py`: `EVENT_FETCHER_MAX_WORKERS` default 4→2 (line 22)
- [x] 2.3 `event_fetcher.py`: `_fetch_batch``timeout_seconds=60` (line 247)
- [x] 2.4 `event_fetcher.py`: 新增 `CACHE_SKIP_CID_THRESHOLD``fetch_events` 大 CID 集跳過 cache + `del grouped`
- [x] 2.5 `lineage_engine.py`: import `read_sql_df_slow as read_sql_df` (line 10)
## 3. Route 層修改
- [x] 3.1 `trace_routes.py`: `TRACE_EVENTS_MAX_WORKERS` default 4→2 (line 39)
- [x] 3.2 `trace_routes.py`: events endpoint 中 `del raw_domain_results` 早期釋放
- [x] 3.3 `trace_routes.py`: 大查詢後 `gc.collect()`
- [x] 3.4 `trace_routes.py`: 大查詢跳過 route-level events cache
## 4. Tests
- [x] 4.1 `test_event_fetcher.py`: 新增 regression test 驗證 slow path import
- [x] 4.2 `test_lineage_engine.py`: 新增 regression test 驗證 slow path import
- [x] 4.3 執行 `pytest tests/ -v` 確認全部通過

View File

@@ -4,13 +4,14 @@
TBD - created by archiving change unified-lineage-engine. Update Purpose after archive.
## Requirements
### Requirement: EventFetcher SHALL provide unified cached event querying across domains
`EventFetcher` SHALL encapsulate batch event queries with L1/L2 layered cache and rate limit bucket configuration, supporting domains: `history`, `materials`, `rejects`, `holds`, `jobs`, `upstream_history`.
`EventFetcher` SHALL encapsulate batch event queries with L1/L2 layered cache and rate limit bucket configuration, supporting domains: `history`, `materials`, `rejects`, `holds`, `jobs`, `upstream_history`, `downstream_rejects`.
#### Scenario: Cache miss for event domain query
- **WHEN** `EventFetcher` is called for a domain with container IDs and no cache exists
- **THEN** the domain query SHALL execute against Oracle via `read_sql_df()`
- **THEN** the result SHALL be stored in L2 Redis cache with key format `evt:{domain}:{sorted_cids_hash}`
- **THEN** L1 memory cache SHALL also be populated (aligned with `core/cache.py` LayeredCache pattern)
- **THEN** the domain query SHALL execute against Oracle via `read_sql_df_slow()` (non-pooled dedicated connection)
- **THEN** each batch query SHALL use `timeout_seconds=60`
- **THEN** the result SHALL be stored in L2 Redis cache with key format `evt:{domain}:{sorted_cids_hash}` if CID count is within cache threshold
- **THEN** L1 memory cache SHALL also be populated if CID count is within cache threshold
#### Scenario: Cache hit for event domain query
- **WHEN** `EventFetcher` is called for a domain and L2 Redis cache contains a valid entry
@@ -22,3 +23,13 @@ TBD - created by archiving change unified-lineage-engine. Update Purpose after a
- **THEN** each domain SHALL have a configurable rate limit bucket aligned with `configured_rate_limit()` pattern
- **THEN** rate limit configuration SHALL be overridable via environment variables
#### Scenario: Large CID set exceeds cache threshold
- **WHEN** the normalized CID count exceeds `CACHE_SKIP_CID_THRESHOLD` (default 10000, env: `EVENT_FETCHER_CACHE_SKIP_CID_THRESHOLD`)
- **THEN** EventFetcher SHALL skip both L1 and L2 cache writes
- **THEN** a warning log SHALL be emitted with domain name, CID count, and threshold value
- **THEN** the query result SHALL still be returned to the caller
#### Scenario: Batch concurrency default
- **WHEN** EventFetcher processes batches for a domain with >1000 CIDs
- **THEN** the default `EVENT_FETCHER_MAX_WORKERS` SHALL be 2 (env: `EVENT_FETCHER_MAX_WORKERS`)

View File

@@ -59,3 +59,15 @@ New SQL files SHALL follow the existing `SQLLoader` convention under `src/mes_da
- **THEN** `split_ancestors.sql` and `merge_sources.sql` SHALL be loaded via `SQLLoader.load_with_params("lineage/split_ancestors", ...)`
- **THEN** the SQL files SHALL NOT reference `HM_LOTMOVEOUT` (48M row table no longer needed for genealogy)
### Requirement: LineageEngine SHALL use non-pooled database connections
All Oracle queries executed by `LineageEngine` SHALL use `read_sql_df_slow()` (dedicated non-pooled connections) instead of `read_sql_df()` (connection pool).
#### Scenario: Lineage query does not consume pool connections
- **WHEN** `LineageEngine` executes split ancestor, merge source, or other Oracle queries
- **THEN** queries SHALL use `read_sql_df_slow()` with the default slow query timeout (300s)
- **THEN** the shared connection pool SHALL NOT be consumed by lineage queries
#### Scenario: Lineage queries respect slow query semaphore
- **WHEN** `LineageEngine` executes queries via `read_sql_df_slow()`
- **THEN** each query SHALL acquire and release a slot from the slow query semaphore (`DB_SLOW_MAX_CONCURRENT`)

View File

@@ -10,6 +10,14 @@ The system SHALL apply database pool and timeout parameters from runtime configu
- **WHEN** operators set pool and timeout values via environment configuration and start the service
- **THEN** the active engine MUST use those values for pool size, overflow, wait timeout, and query call timeout
#### Scenario: Slow query semaphore capacity
- **WHEN** the service starts in production or staging configuration
- **THEN** `DB_SLOW_MAX_CONCURRENT` SHALL default to 5 (env: `DB_SLOW_MAX_CONCURRENT`)
- **WHEN** the service starts in development configuration
- **THEN** `DB_SLOW_MAX_CONCURRENT` SHALL default to 3
- **WHEN** the service starts in testing configuration
- **THEN** `DB_SLOW_MAX_CONCURRENT` SHALL remain at 1
### Requirement: Pool Exhaustion MUST Return Retry-Aware Degraded Responses
The system MUST return explicit degraded responses for connection pool exhaustion and include machine-readable metadata for retry/backoff behavior.

View File

@@ -1,5 +1,6 @@
## ADDED Requirements
## Purpose
Staged trace API for seed-resolve, lineage, and events pipeline with rate limiting, caching, and memory management.
## Requirements
### Requirement: Staged trace API SHALL expose seed-resolve endpoint
`POST /api/trace/seed-resolve` SHALL resolve seed lots based on the provided profile and parameters.
@@ -87,3 +88,27 @@ The existing analysis endpoint (GET method) SHALL internally delegate to the sta
- **THEN** the endpoint SHALL internally execute seed-resolve → lineage → events + aggregation
- **THEN** the response format SHALL be identical to the pre-refactoring output
- **THEN** a golden test SHALL verify output equivalence
### Requirement: Trace events endpoint SHALL limit domain concurrency
The `/api/trace/events` endpoint SHALL use `TRACE_EVENTS_MAX_WORKERS` to control how many domains execute concurrently.
#### Scenario: Default domain concurrency
- **WHEN** the events endpoint dispatches domain queries
- **THEN** the default `TRACE_EVENTS_MAX_WORKERS` SHALL be 2 (env: `TRACE_EVENTS_MAX_WORKERS`)
### Requirement: Trace events endpoint SHALL manage memory for large queries
The events endpoint SHALL proactively release memory after processing large CID sets.
#### Scenario: Early release of grouped domain results
- **WHEN** MSD aggregation completes using `raw_domain_results`
- **THEN** the `raw_domain_results` reference SHALL be deleted immediately after aggregation
- **THEN** for non-MSD profiles, `raw_domain_results` SHALL be deleted after result assembly
#### Scenario: Garbage collection for large CID sets
- **WHEN** the events endpoint completes processing and the CID count exceeds 10000
- **THEN** `gc.collect()` SHALL be called to prompt Python garbage collection
#### Scenario: Large CID set skips route-level cache
- **WHEN** the events endpoint completes for a non-MSD profile and CID count exceeds 10000
- **THEN** the route-level events cache write SHALL be skipped