feat(trace-pool-isolation): migrate event_fetcher/lineage_engine to slow connections + fix 51 test failures

Trace pipeline pool isolation:
- Switch event_fetcher and lineage_engine to read_sql_df_slow (non-pooled)
- Reduce EVENT_FETCHER_MAX_WORKERS 4→2, TRACE_EVENTS_MAX_WORKERS 4→2
- Add 60s timeout per batch query, cache skip for CID>10K
- Early del raw_domain_results + gc.collect() for large queries
- Increase DB_SLOW_MAX_CONCURRENT: base 3→5, dev 2→3, prod 3→5

Test fixes (51 pre-existing failures → 0):
- reject_history: WORKFLOW CSV header, strict bool validation, pareto mock path
- portal shell: remove non-existent /tmtt-defect route from tests
- conftest: add --run-stress option to skip stress/load tests by default
- migration tests: skipif baseline directory missing
- performance test: update Vite asset assertion
- wip hold: add firstname/waferdesc mock params
- template integration: add /reject-history canonical route

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
egg
2026-02-25 16:13:19 +08:00
parent 49bd4b31d3
commit cbb943dfe5
33 changed files with 453 additions and 94 deletions

View File

@@ -30,8 +30,8 @@
"route": "/reject-history",
"name": "報廢歷史查詢",
"status": "released",
"drawer_id": "drawer",
"order": 1
"drawer_id": "drawer-2",
"order": 4
},
{
"route": "/wip-detail",
@@ -83,21 +83,21 @@
"name": "設備維修查詢",
"status": "released",
"drawer_id": "drawer",
"order": 2
"order": 1
},
{
"route": "/query-tool",
"name": "批次追蹤工具",
"status": "released",
"drawer_id": "drawer",
"order": 3
"order": 2
},
{
"route": "/mid-section-defect",
"name": "製程不良追溯分析",
"status": "released",
"drawer_id": "drawer",
"order": 4
"order": 3
},
{
"route": "/admin/pages",

View File

@@ -48,7 +48,7 @@
{
"route": "/reject-history",
"name": "報廢歷史查詢",
"status": "dev",
"status": "released",
"order": 4
},
{
@@ -163,6 +163,12 @@
"status": "released",
"order": 3
},
{
"route": "/reject-history",
"name": "報廢歷史查詢",
"status": "released",
"order": 4
},
{
"route": "/resource-history",
"name": "設備歷史績效",

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-02-25

View File

@@ -0,0 +1,81 @@
## Context
MSD trace pipeline 有三個階段seed-resolve → lineage → events。
seed-resolve 已使用 `read_sql_df_slow`(獨立連線),但 lineage 和 events 仍用
`read_sql_df`pool 連線55s timeout。大範圍查詢114K CIDs的 events 階段會
產生 ~230 條 Oracle 查詢,佔滿 pool10+15=25 connections導致背景任務和其他
worker 拿不到連線,最終 cascade failureRedis timeout → worker SIGKILL
Connection pool 設計用於即時監控頁的短暫查詢(<1s不適合 trace pipeline
的長時間批次作業
## Goals / Non-Goals
**Goals:**
- trace pipelinelineage + events Oracle 查詢不佔用共用 pool
- 降低同時 Oracle 查詢數減少 DB 壓力
- 大查詢>10K CIDs不觸發 Redis/L1 cache 寫入,避免 OOM 和 Redis timeout
- 背景任務equipment cache、SYS_DATE在 trace 執行期間可正常取得 pool 連線
**Non-Goals:**
- 加入 CID 數量上限(會導致資料不完整)
- 重構 event_fetcher 為 streaming/chunk 架構(未來改善)
- 修改前端 timeout 或即時監控頁
## Decisions
### D1: event_fetcher + lineage_engine 改用 `read_sql_df_slow`
**選擇**: import alias 切換 (`read_sql_df_slow as read_sql_df`)
**理由**: 最小改動量。所有 call site 不需修改,只改 import 行。`read_sql_df_slow`
建立獨立 oracledb 連線,不佔用 pool。
**替代方案考慮**:
- 建立第二個專用 pool → 過度工程,管理複雜
- 給 event_fetcher 自己的 semaphore → 增加兩套 semaphore 的管理複雜度
### D2: 降低 workers 預設值
**選擇**:
- `EVENT_FETCHER_MAX_WORKERS`: 4 → 2
- `TRACE_EVENTS_MAX_WORKERS`: 4 → 2
**理由**: Peak concurrent = 2 domains × 2 workers = 4 slow queries。
搭配 semaphore=5 留 1 slot 給其他 slow 查詢。仍可透過 env var 調整。
### D3: event_fetcher 批次查詢 timeout = 60s
**選擇**: 在 `_fetch_batch` 傳入 `timeout_seconds=60`
**理由**: 每個 batch query 正常 2-6s300s 預設過長。60s 是 10x headroom。
lineage 不設限CONNECT BY 可能較慢,保留 300s 預設)。
### D4: 大 CID 集跳過快取threshold = 10K
**選擇**: `CACHE_SKIP_CID_THRESHOLD = 10000`env var 可調)
**理由**:
- 114K CIDs 的 cache key 是 sorted CIDs 的 MD5同組查詢再次命中機率極低
- JSON 序列化 1M+ records 可達數百 MBRedis `socket_timeout=5s` 必定 timeout
- L1 MemoryTTLCache 會在 heap 留住 GB 級 dict 達 TTL(300s)
route-level events cache 同樣在 CID > 10K 時跳過。
### D5: 早期記憶體釋放 + gc.collect
**選擇**: trace_routes events endpoint 在 MSD aggregation 後 `del raw_domain_results`
大查詢後 `gc.collect()`
**理由**: `raw_domain_results``results` 是同份資料的兩種 representation
aggregation 完成後 grouped-by-CID 版本不再需要。`gc.collect()` 確保
Python 的 generational GC 及時回收大量 dict。
## Risks / Trade-offs
| 風險 | 緩解 |
|------|------|
| lineage 70K lots → 70+ sequential slow queries (~140s) 可能逼近 360s timeout | 已有 Redis cache TTL=300s重複查詢走快取。首次最壞情況 ~280s < 360s |
| semaphore=5 在多個大查詢同時執行時可能排隊 | 每條 batch query 2-6s排隊等待 <10swait timeout=60s 足夠|
| 跳過 cache 後重複大查詢需重新執行 | 大查詢本身罕見 5 月範圍 + 特定站點不值得佔用 cache 空間 |
| `gc.collect()` 有微小 CPU 開銷 | 僅在 CID>10K 時觸發,且在 response 建構後執行 |

View File

@@ -0,0 +1,35 @@
## Why
MSD trace pipeline 大範圍查詢TMTT, 5 個月)產生 114K CIDsevent_fetcher 使用
`read_sql_df`pool 連線)發送 ~230 條批次查詢,佔滿 connection pool導致背景任務
equipment cache、SYS_DATE查詢時間從 1s 暴增到 500s最終 Redis timeout +
gunicorn worker SIGKILL2026-02-25 13:18 事件)。
## What Changes
- event_fetcher 和 lineage_engine 改用 `read_sql_df_slow`(獨立連線),不佔用 pool
- 降低 `EVENT_FETCHER_MAX_WORKERS` 預設 4→2、`TRACE_EVENTS_MAX_WORKERS` 預設 4→2減少 Oracle 並行壓力
- 增加 `DB_SLOW_MAX_CONCURRENT` semaphore 容量 3→5容納 event_fetcher 批次查詢
- event_fetcher 在 CID 數量 >10K 時跳過 L1/L2 cache避免數百 MB JSON 序列化導致 Redis timeout 和 heap 膨脹)
- trace_routes events endpoint 早期釋放 `raw_domain_results` 並在大查詢後觸發 `gc.collect()`
## Capabilities
### New Capabilities
(無新增 capability
### Modified Capabilities
- `event-fetcher-unified`: 改用非 pool 連線 + 降低預設並行數 + 大 CID 集跳過快取
- `lineage-engine-core`: 改用非 pool 連線(不佔用 pool避免與 event_fetcher 競爭)
- `trace-staged-api`: 降低 domain 並行數 + 早期記憶體釋放 + 大查詢跳過 route-level cache
- `runtime-resilience-recovery`: slow query semaphore 容量增加以容納 trace pipeline 批次查詢
## Impact
- **後端 services**: event_fetcher.py, lineage_engine.py (import 切換)
- **routes**: trace_routes.py (並行數 + 記憶體管理)
- **config**: settings.py (DB_SLOW_MAX_CONCURRENT)
- **即時監控頁**: 不受影響(繼續用 pool
- **前端**: 無修改

View File

@@ -0,0 +1,31 @@
## MODIFIED Requirements
### Requirement: EventFetcher SHALL provide unified cached event querying across domains
`EventFetcher` SHALL encapsulate batch event queries with L1/L2 layered cache and rate limit bucket configuration, supporting domains: `history`, `materials`, `rejects`, `holds`, `jobs`, `upstream_history`, `downstream_rejects`.
#### Scenario: Cache miss for event domain query
- **WHEN** `EventFetcher` is called for a domain with container IDs and no cache exists
- **THEN** the domain query SHALL execute against Oracle via `read_sql_df_slow()` (non-pooled dedicated connection)
- **THEN** each batch query SHALL use `timeout_seconds=60`
- **THEN** the result SHALL be stored in L2 Redis cache with key format `evt:{domain}:{sorted_cids_hash}` if CID count is within cache threshold
- **THEN** L1 memory cache SHALL also be populated if CID count is within cache threshold
#### Scenario: Cache hit for event domain query
- **WHEN** `EventFetcher` is called for a domain and L2 Redis cache contains a valid entry
- **THEN** the cached result SHALL be returned without executing Oracle query
- **THEN** DB connection pool SHALL NOT be consumed
#### Scenario: Rate limit bucket per domain
- **WHEN** `EventFetcher` is used from a route handler
- **THEN** each domain SHALL have a configurable rate limit bucket aligned with `configured_rate_limit()` pattern
- **THEN** rate limit configuration SHALL be overridable via environment variables
#### Scenario: Large CID set exceeds cache threshold
- **WHEN** the normalized CID count exceeds `CACHE_SKIP_CID_THRESHOLD` (default 10000, env: `EVENT_FETCHER_CACHE_SKIP_CID_THRESHOLD`)
- **THEN** EventFetcher SHALL skip both L1 and L2 cache writes
- **THEN** a warning log SHALL be emitted with domain name, CID count, and threshold value
- **THEN** the query result SHALL still be returned to the caller
#### Scenario: Batch concurrency default
- **WHEN** EventFetcher processes batches for a domain with >1000 CIDs
- **THEN** the default `EVENT_FETCHER_MAX_WORKERS` SHALL be 2 (env: `EVENT_FETCHER_MAX_WORKERS`)

View File

@@ -0,0 +1,13 @@
## MODIFIED Requirements
### Requirement: LineageEngine SHALL use non-pooled database connections
All Oracle queries executed by `LineageEngine` SHALL use `read_sql_df_slow()` (dedicated non-pooled connections) instead of `read_sql_df()` (connection pool).
#### Scenario: Lineage query does not consume pool connections
- **WHEN** `LineageEngine` executes split ancestor, merge source, or other Oracle queries
- **THEN** queries SHALL use `read_sql_df_slow()` with the default slow query timeout (300s)
- **THEN** the shared connection pool SHALL NOT be consumed by lineage queries
#### Scenario: Lineage queries respect slow query semaphore
- **WHEN** `LineageEngine` executes queries via `read_sql_df_slow()`
- **THEN** each query SHALL acquire and release a slot from the slow query semaphore (`DB_SLOW_MAX_CONCURRENT`)

View File

@@ -0,0 +1,16 @@
## MODIFIED Requirements
### Requirement: Database Pool Runtime Configuration SHALL Be Enforced
The system SHALL apply database pool and timeout parameters from runtime configuration to the active SQLAlchemy engine used by request handling.
#### Scenario: Runtime pool configuration takes effect
- **WHEN** operators set pool and timeout values via environment configuration and start the service
- **THEN** the active engine MUST use those values for pool size, overflow, wait timeout, and query call timeout
#### Scenario: Slow query semaphore capacity
- **WHEN** the service starts in production or staging configuration
- **THEN** `DB_SLOW_MAX_CONCURRENT` SHALL default to 5 (env: `DB_SLOW_MAX_CONCURRENT`)
- **WHEN** the service starts in development configuration
- **THEN** `DB_SLOW_MAX_CONCURRENT` SHALL default to 3
- **WHEN** the service starts in testing configuration
- **THEN** `DB_SLOW_MAX_CONCURRENT` SHALL remain at 1

View File

@@ -0,0 +1,24 @@
## ADDED Requirements
### Requirement: Trace events endpoint SHALL limit domain concurrency
The `/api/trace/events` endpoint SHALL use `TRACE_EVENTS_MAX_WORKERS` to control how many domains execute concurrently.
#### Scenario: Default domain concurrency
- **WHEN** the events endpoint dispatches domain queries
- **THEN** the default `TRACE_EVENTS_MAX_WORKERS` SHALL be 2 (env: `TRACE_EVENTS_MAX_WORKERS`)
### Requirement: Trace events endpoint SHALL manage memory for large queries
The events endpoint SHALL proactively release memory after processing large CID sets.
#### Scenario: Early release of grouped domain results
- **WHEN** MSD aggregation completes using `raw_domain_results`
- **THEN** the `raw_domain_results` reference SHALL be deleted immediately after aggregation
- **THEN** for non-MSD profiles, `raw_domain_results` SHALL be deleted after result assembly
#### Scenario: Garbage collection for large CID sets
- **WHEN** the events endpoint completes processing and the CID count exceeds 10000
- **THEN** `gc.collect()` SHALL be called to prompt Python garbage collection
#### Scenario: Large CID set skips route-level cache
- **WHEN** the events endpoint completes for a non-MSD profile and CID count exceeds 10000
- **THEN** the route-level events cache write SHALL be skipped

View File

@@ -0,0 +1,24 @@
## 1. Config 調整
- [x] 1.1 `settings.py`: Config base `DB_SLOW_MAX_CONCURRENT` 3→5, DevelopmentConfig 2→3, ProductionConfig 3→5
## 2. 後端 service 遷移
- [x] 2.1 `event_fetcher.py`: import `read_sql_df_slow as read_sql_df` (line 16)
- [x] 2.2 `event_fetcher.py`: `EVENT_FETCHER_MAX_WORKERS` default 4→2 (line 22)
- [x] 2.3 `event_fetcher.py`: `_fetch_batch``timeout_seconds=60` (line 247)
- [x] 2.4 `event_fetcher.py`: 新增 `CACHE_SKIP_CID_THRESHOLD``fetch_events` 大 CID 集跳過 cache + `del grouped`
- [x] 2.5 `lineage_engine.py`: import `read_sql_df_slow as read_sql_df` (line 10)
## 3. Route 層修改
- [x] 3.1 `trace_routes.py`: `TRACE_EVENTS_MAX_WORKERS` default 4→2 (line 39)
- [x] 3.2 `trace_routes.py`: events endpoint 中 `del raw_domain_results` 早期釋放
- [x] 3.3 `trace_routes.py`: 大查詢後 `gc.collect()`
- [x] 3.4 `trace_routes.py`: 大查詢跳過 route-level events cache
## 4. Tests
- [x] 4.1 `test_event_fetcher.py`: 新增 regression test 驗證 slow path import
- [x] 4.2 `test_lineage_engine.py`: 新增 regression test 驗證 slow path import
- [x] 4.3 執行 `pytest tests/ -v` 確認全部通過

View File

@@ -4,13 +4,14 @@
TBD - created by archiving change unified-lineage-engine. Update Purpose after archive.
## Requirements
### Requirement: EventFetcher SHALL provide unified cached event querying across domains
`EventFetcher` SHALL encapsulate batch event queries with L1/L2 layered cache and rate limit bucket configuration, supporting domains: `history`, `materials`, `rejects`, `holds`, `jobs`, `upstream_history`.
`EventFetcher` SHALL encapsulate batch event queries with L1/L2 layered cache and rate limit bucket configuration, supporting domains: `history`, `materials`, `rejects`, `holds`, `jobs`, `upstream_history`, `downstream_rejects`.
#### Scenario: Cache miss for event domain query
- **WHEN** `EventFetcher` is called for a domain with container IDs and no cache exists
- **THEN** the domain query SHALL execute against Oracle via `read_sql_df()`
- **THEN** the result SHALL be stored in L2 Redis cache with key format `evt:{domain}:{sorted_cids_hash}`
- **THEN** L1 memory cache SHALL also be populated (aligned with `core/cache.py` LayeredCache pattern)
- **THEN** the domain query SHALL execute against Oracle via `read_sql_df_slow()` (non-pooled dedicated connection)
- **THEN** each batch query SHALL use `timeout_seconds=60`
- **THEN** the result SHALL be stored in L2 Redis cache with key format `evt:{domain}:{sorted_cids_hash}` if CID count is within cache threshold
- **THEN** L1 memory cache SHALL also be populated if CID count is within cache threshold
#### Scenario: Cache hit for event domain query
- **WHEN** `EventFetcher` is called for a domain and L2 Redis cache contains a valid entry
@@ -22,3 +23,13 @@ TBD - created by archiving change unified-lineage-engine. Update Purpose after a
- **THEN** each domain SHALL have a configurable rate limit bucket aligned with `configured_rate_limit()` pattern
- **THEN** rate limit configuration SHALL be overridable via environment variables
#### Scenario: Large CID set exceeds cache threshold
- **WHEN** the normalized CID count exceeds `CACHE_SKIP_CID_THRESHOLD` (default 10000, env: `EVENT_FETCHER_CACHE_SKIP_CID_THRESHOLD`)
- **THEN** EventFetcher SHALL skip both L1 and L2 cache writes
- **THEN** a warning log SHALL be emitted with domain name, CID count, and threshold value
- **THEN** the query result SHALL still be returned to the caller
#### Scenario: Batch concurrency default
- **WHEN** EventFetcher processes batches for a domain with >1000 CIDs
- **THEN** the default `EVENT_FETCHER_MAX_WORKERS` SHALL be 2 (env: `EVENT_FETCHER_MAX_WORKERS`)

View File

@@ -59,3 +59,15 @@ New SQL files SHALL follow the existing `SQLLoader` convention under `src/mes_da
- **THEN** `split_ancestors.sql` and `merge_sources.sql` SHALL be loaded via `SQLLoader.load_with_params("lineage/split_ancestors", ...)`
- **THEN** the SQL files SHALL NOT reference `HM_LOTMOVEOUT` (48M row table no longer needed for genealogy)
### Requirement: LineageEngine SHALL use non-pooled database connections
All Oracle queries executed by `LineageEngine` SHALL use `read_sql_df_slow()` (dedicated non-pooled connections) instead of `read_sql_df()` (connection pool).
#### Scenario: Lineage query does not consume pool connections
- **WHEN** `LineageEngine` executes split ancestor, merge source, or other Oracle queries
- **THEN** queries SHALL use `read_sql_df_slow()` with the default slow query timeout (300s)
- **THEN** the shared connection pool SHALL NOT be consumed by lineage queries
#### Scenario: Lineage queries respect slow query semaphore
- **WHEN** `LineageEngine` executes queries via `read_sql_df_slow()`
- **THEN** each query SHALL acquire and release a slot from the slow query semaphore (`DB_SLOW_MAX_CONCURRENT`)

View File

@@ -10,6 +10,14 @@ The system SHALL apply database pool and timeout parameters from runtime configu
- **WHEN** operators set pool and timeout values via environment configuration and start the service
- **THEN** the active engine MUST use those values for pool size, overflow, wait timeout, and query call timeout
#### Scenario: Slow query semaphore capacity
- **WHEN** the service starts in production or staging configuration
- **THEN** `DB_SLOW_MAX_CONCURRENT` SHALL default to 5 (env: `DB_SLOW_MAX_CONCURRENT`)
- **WHEN** the service starts in development configuration
- **THEN** `DB_SLOW_MAX_CONCURRENT` SHALL default to 3
- **WHEN** the service starts in testing configuration
- **THEN** `DB_SLOW_MAX_CONCURRENT` SHALL remain at 1
### Requirement: Pool Exhaustion MUST Return Retry-Aware Degraded Responses
The system MUST return explicit degraded responses for connection pool exhaustion and include machine-readable metadata for retry/backoff behavior.

View File

@@ -1,5 +1,6 @@
## ADDED Requirements
## Purpose
Staged trace API for seed-resolve, lineage, and events pipeline with rate limiting, caching, and memory management.
## Requirements
### Requirement: Staged trace API SHALL expose seed-resolve endpoint
`POST /api/trace/seed-resolve` SHALL resolve seed lots based on the provided profile and parameters.
@@ -87,3 +88,27 @@ The existing analysis endpoint (GET method) SHALL internally delegate to the sta
- **THEN** the endpoint SHALL internally execute seed-resolve → lineage → events + aggregation
- **THEN** the response format SHALL be identical to the pre-refactoring output
- **THEN** a golden test SHALL verify output equivalence
### Requirement: Trace events endpoint SHALL limit domain concurrency
The `/api/trace/events` endpoint SHALL use `TRACE_EVENTS_MAX_WORKERS` to control how many domains execute concurrently.
#### Scenario: Default domain concurrency
- **WHEN** the events endpoint dispatches domain queries
- **THEN** the default `TRACE_EVENTS_MAX_WORKERS` SHALL be 2 (env: `TRACE_EVENTS_MAX_WORKERS`)
### Requirement: Trace events endpoint SHALL manage memory for large queries
The events endpoint SHALL proactively release memory after processing large CID sets.
#### Scenario: Early release of grouped domain results
- **WHEN** MSD aggregation completes using `raw_domain_results`
- **THEN** the `raw_domain_results` reference SHALL be deleted immediately after aggregation
- **THEN** for non-MSD profiles, `raw_domain_results` SHALL be deleted after result assembly
#### Scenario: Garbage collection for large CID sets
- **WHEN** the events endpoint completes processing and the CID count exceeds 10000
- **THEN** `gc.collect()` SHALL be called to prompt Python garbage collection
#### Scenario: Large CID set skips route-level cache
- **WHEN** the events endpoint completes for a non-MSD profile and CID count exceeds 10000
- **THEN** the route-level events cache write SHALL be skipped

View File

@@ -53,7 +53,7 @@ class Config:
# Slow-query dedicated connection settings (non-pooled)
DB_SLOW_CALL_TIMEOUT_MS = _int_env("DB_SLOW_CALL_TIMEOUT_MS", 300000) # 300s
DB_SLOW_MAX_CONCURRENT = _int_env("DB_SLOW_MAX_CONCURRENT", 3)
DB_SLOW_MAX_CONCURRENT = _int_env("DB_SLOW_MAX_CONCURRENT", 5)
# Auth configuration - MUST be set in .env file
LDAP_API_URL = os.getenv("LDAP_API_URL", "")
@@ -99,7 +99,7 @@ class DevelopmentConfig(Config):
DB_CONNECT_RETRY_COUNT = _int_env("DB_CONNECT_RETRY_COUNT", 1)
DB_CONNECT_RETRY_DELAY = _float_env("DB_CONNECT_RETRY_DELAY", 1.0)
DB_CALL_TIMEOUT_MS = _int_env("DB_CALL_TIMEOUT_MS", 55000)
DB_SLOW_MAX_CONCURRENT = _int_env("DB_SLOW_MAX_CONCURRENT", 2)
DB_SLOW_MAX_CONCURRENT = _int_env("DB_SLOW_MAX_CONCURRENT", 3)
class ProductionConfig(Config):
@@ -116,7 +116,7 @@ class ProductionConfig(Config):
DB_CONNECT_RETRY_COUNT = _int_env("DB_CONNECT_RETRY_COUNT", 1)
DB_CONNECT_RETRY_DELAY = _float_env("DB_CONNECT_RETRY_DELAY", 1.0)
DB_CALL_TIMEOUT_MS = _int_env("DB_CALL_TIMEOUT_MS", 55000)
DB_SLOW_MAX_CONCURRENT = _int_env("DB_SLOW_MAX_CONCURRENT", 3)
DB_SLOW_MAX_CONCURRENT = _int_env("DB_SLOW_MAX_CONCURRENT", 5)
class TestingConfig(Config):

View File

@@ -11,6 +11,7 @@ from flask import Blueprint, Response, jsonify, request
from mes_dashboard.core.cache import cache_get, cache_set, make_cache_key
from mes_dashboard.core.rate_limit import configured_rate_limit
from mes_dashboard.core.utils import parse_bool_query
from mes_dashboard.services.reject_dataset_cache import (
apply_view,
compute_dimension_pareto,
@@ -69,15 +70,6 @@ def _parse_date_range(required: bool = True) -> tuple[Optional[str], Optional[st
return start_date, end_date, None
def _parse_bool(value: str, *, name: str) -> tuple[Optional[bool], Optional[tuple[dict, int]]]:
normalized = str(value or "").strip().lower()
if normalized in {"", "0", "false", "no", "n", "off"}:
return False, None
if normalized in {"1", "true", "yes", "y", "on"}:
return True, None
return None, ({"success": False, "error": f"Invalid {name}, use true/false"}, 400)
def _parse_multi_param(name: str) -> list[str]:
values = []
for raw in request.args.getlist(name):
@@ -120,32 +112,29 @@ def _extract_meta(
return data, meta
_VALID_BOOL_STRINGS = {"", "0", "false", "no", "n", "off", "1", "true", "yes", "y", "on"}
def _parse_common_bools() -> tuple[Optional[tuple[dict, int]], bool, bool, bool]:
"""Parse include_excluded_scrap, exclude_material_scrap, exclude_pb_diode."""
include_excluded_scrap, err1 = _parse_bool(
for name in ("include_excluded_scrap", "exclude_material_scrap", "exclude_pb_diode"):
raw = str(request.args.get(name, "") or "").strip().lower()
if raw not in _VALID_BOOL_STRINGS:
return ({"success": False, "error": f"Invalid {name}, use true/false"}, 400), False, True, True
include_excluded_scrap = parse_bool_query(
request.args.get("include_excluded_scrap", ""),
name="include_excluded_scrap",
default=False,
)
if err1:
return err1, False, True, True
exclude_material_scrap, err2 = _parse_bool(
exclude_material_scrap = parse_bool_query(
request.args.get("exclude_material_scrap", "true"),
name="exclude_material_scrap",
default=True,
)
if err2:
return err2, False, True, True
exclude_pb_diode, err3 = _parse_bool(
exclude_pb_diode = parse_bool_query(
request.args.get("exclude_pb_diode", "true"),
name="exclude_pb_diode",
)
if err3:
return err3, False, True, True
return (
None,
bool(include_excluded_scrap),
bool(exclude_material_scrap),
bool(exclude_pb_diode),
default=True,
)
return None, include_excluded_scrap, exclude_material_scrap, exclude_pb_diode
@reject_history_bp.route("/api/reject-history/options", methods=["GET"])
@@ -582,7 +571,7 @@ def api_reject_history_export_cached():
headers = [
"LOT", "WORKCENTER", "WORKCENTER_GROUP", "Package", "FUNCTION",
"TYPE", "PRODUCT", "原因", "EQUIPMENT", "COMMENT", "SPEC",
"TYPE", "WORKFLOW", "PRODUCT", "原因", "EQUIPMENT", "COMMENT", "SPEC",
"REJECT_QTY", "STANDBY_QTY", "QTYTOPROCESS_QTY", "INPROCESS_QTY",
"PROCESSED_QTY", "扣帳報廢量", "不扣帳報廢量", "MOVEIN_QTY",
"報廢時間", "日期",

View File

@@ -9,6 +9,7 @@ Provides three stage endpoints for progressive trace execution:
from __future__ import annotations
import gc
import hashlib
import json
import logging
@@ -36,7 +37,7 @@ logger = logging.getLogger("mes_dashboard.trace_routes")
trace_bp = Blueprint("trace", __name__, url_prefix="/api/trace")
TRACE_SLOW_THRESHOLD_SECONDS = float(os.getenv('TRACE_SLOW_THRESHOLD_SECONDS', '15'))
TRACE_EVENTS_MAX_WORKERS = int(os.getenv('TRACE_EVENTS_MAX_WORKERS', '4'))
TRACE_EVENTS_MAX_WORKERS = int(os.getenv('TRACE_EVENTS_MAX_WORKERS', '2'))
TRACE_CACHE_TTL_SECONDS = 300
PROFILE_QUERY_TOOL = "query_tool"
@@ -668,9 +669,12 @@ def events():
aggregation = None
if profile == PROFILE_MID_SECTION_DEFECT:
aggregation, agg_error = _build_msd_aggregation(payload, raw_domain_results)
del raw_domain_results
if agg_error is not None:
code, message, status = agg_error
return _error(code, message, status)
else:
del raw_domain_results
response: Dict[str, Any] = {
"stage": "events",
@@ -683,6 +687,10 @@ def events():
response["code"] = "EVENTS_PARTIAL_FAILURE"
response["failed_domains"] = sorted(failed_domains)
if not is_msd:
if not is_msd and len(container_ids) <= 10000:
cache_set(events_cache_key, response, ttl=TRACE_CACHE_TTL_SECONDS)
if len(container_ids) > 10000:
gc.collect()
return jsonify(response)

View File

@@ -13,13 +13,14 @@ from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import Any, Dict, List
from mes_dashboard.core.cache import cache_get, cache_set
from mes_dashboard.core.database import read_sql_df
from mes_dashboard.core.database import read_sql_df_slow as read_sql_df
from mes_dashboard.sql import QueryBuilder, SQLLoader
logger = logging.getLogger("mes_dashboard.event_fetcher")
ORACLE_IN_BATCH_SIZE = 1000
EVENT_FETCHER_MAX_WORKERS = int(os.getenv('EVENT_FETCHER_MAX_WORKERS', '4'))
EVENT_FETCHER_MAX_WORKERS = int(os.getenv('EVENT_FETCHER_MAX_WORKERS', '2'))
CACHE_SKIP_CID_THRESHOLD = int(os.getenv('EVENT_FETCHER_CACHE_SKIP_CID_THRESHOLD', '10000'))
_DOMAIN_SPECS: Dict[str, Dict[str, Any]] = {
"history": {
@@ -244,7 +245,7 @@ class EventFetcher:
else:
builder.add_in_condition(filter_column, batch_ids)
sql = EventFetcher._build_domain_sql(domain, builder.get_conditions_sql())
return batch_ids, read_sql_df(sql, builder.params)
return batch_ids, read_sql_df(sql, builder.params, timeout_seconds=60)
def _sanitize_record(d):
"""Replace NaN/NaT values with None for JSON-safe serialization."""
@@ -296,7 +297,15 @@ class EventFetcher:
)
result = dict(grouped)
cache_set(cache_key, result, ttl=_DOMAIN_SPECS[domain]["cache_ttl"])
del grouped
if len(normalized_ids) <= CACHE_SKIP_CID_THRESHOLD:
cache_set(cache_key, result, ttl=_DOMAIN_SPECS[domain]["cache_ttl"])
else:
logger.warning(
"EventFetcher skipping cache domain=%s cid_count=%s (threshold=%s)",
domain, len(normalized_ids), CACHE_SKIP_CID_THRESHOLD,
)
logger.info(
"EventFetcher fetched domain=%s queried_cids=%s hit_cids=%s",
domain,

View File

@@ -7,7 +7,7 @@ import logging
from collections import defaultdict
from typing import Any, Dict, List, Optional, Set, Tuple
from mes_dashboard.core.database import read_sql_df
from mes_dashboard.core.database import read_sql_df_slow as read_sql_df
from mes_dashboard.sql import QueryBuilder, SQLLoader
logger = logging.getLogger("mes_dashboard.lineage_engine")

View File

@@ -858,6 +858,7 @@ def export_csv(
"Package",
"FUNCTION",
"TYPE",
"WORKFLOW",
"PRODUCT",
"原因",
"EQUIPMENT",

View File

@@ -1,45 +1,45 @@
# -*- coding: utf-8 -*-
"""Pytest configuration and fixtures for MES Dashboard tests."""
import pytest
import sys
import os
# Add the src directory to Python path
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'src'))
_PROJECT_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), '..'))
_TMP_DIR = os.path.join(_PROJECT_ROOT, 'tmp')
# Test baseline env: keep pytest isolated from local runtime/.env side effects.
os.environ.setdefault('FLASK_ENV', 'testing')
os.environ.setdefault('REDIS_ENABLED', 'false')
os.environ.setdefault('RUNTIME_CONTRACT_ENFORCE', 'false')
os.environ.setdefault('SLOW_QUERY_THRESHOLD', '1.0')
os.environ.setdefault('WATCHDOG_RUNTIME_DIR', _TMP_DIR)
os.environ.setdefault('WATCHDOG_RESTART_FLAG', os.path.join(_TMP_DIR, 'mes_dashboard_restart.flag'))
os.environ.setdefault('WATCHDOG_PID_FILE', os.path.join(_TMP_DIR, 'gunicorn.pid'))
os.environ.setdefault('WATCHDOG_STATE_FILE', os.path.join(_TMP_DIR, 'mes_dashboard_restart_state.json'))
import mes_dashboard.core.database as db
from mes_dashboard.app import create_app
from mes_dashboard.core.modernization_policy import clear_modernization_policy_cache
import pytest
import sys
import os
# Add the src directory to Python path
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'src'))
_PROJECT_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), '..'))
_TMP_DIR = os.path.join(_PROJECT_ROOT, 'tmp')
# Test baseline env: keep pytest isolated from local runtime/.env side effects.
os.environ.setdefault('FLASK_ENV', 'testing')
os.environ.setdefault('REDIS_ENABLED', 'false')
os.environ.setdefault('RUNTIME_CONTRACT_ENFORCE', 'false')
os.environ.setdefault('SLOW_QUERY_THRESHOLD', '1.0')
os.environ.setdefault('WATCHDOG_RUNTIME_DIR', _TMP_DIR)
os.environ.setdefault('WATCHDOG_RESTART_FLAG', os.path.join(_TMP_DIR, 'mes_dashboard_restart.flag'))
os.environ.setdefault('WATCHDOG_PID_FILE', os.path.join(_TMP_DIR, 'gunicorn.pid'))
os.environ.setdefault('WATCHDOG_STATE_FILE', os.path.join(_TMP_DIR, 'mes_dashboard_restart_state.json'))
import mes_dashboard.core.database as db
from mes_dashboard.app import create_app
from mes_dashboard.core.modernization_policy import clear_modernization_policy_cache
@pytest.fixture
def app():
@pytest.fixture
def app():
"""Create application for testing."""
db._ENGINE = None
app = create_app('testing')
app.config['TESTING'] = True
return app
@pytest.fixture(autouse=True)
def _reset_modernization_policy_cache():
"""Keep policy-cache state isolated across tests."""
clear_modernization_policy_cache()
yield
clear_modernization_policy_cache()
app.config['TESTING'] = True
return app
@pytest.fixture(autouse=True)
def _reset_modernization_policy_cache():
"""Keep policy-cache state isolated across tests."""
clear_modernization_policy_cache()
yield
clear_modernization_policy_cache()
@pytest.fixture
@@ -65,6 +65,12 @@ def pytest_configure(config):
config.addinivalue_line(
"markers", "redis: mark test as requiring Redis connection"
)
config.addinivalue_line(
"markers", "stress: mark test as stress/load test (requires --run-stress)"
)
config.addinivalue_line(
"markers", "load: mark test as load test (requires --run-stress)"
)
def pytest_addoption(parser):
@@ -81,6 +87,12 @@ def pytest_addoption(parser):
default=False,
help="Run end-to-end tests that require running server"
)
parser.addoption(
"--run-stress",
action="store_true",
default=False,
help="Run stress/load tests that require running server"
)
def pytest_collection_modifyitems(config, items):
@@ -88,11 +100,16 @@ def pytest_collection_modifyitems(config, items):
run_integration = config.getoption("--run-integration")
run_e2e = config.getoption("--run-e2e")
run_stress = config.getoption("--run-stress")
skip_integration = pytest.mark.skip(reason="need --run-integration option to run")
skip_e2e = pytest.mark.skip(reason="need --run-e2e option to run")
skip_stress = pytest.mark.skip(reason="need --run-stress option to run")
for item in items:
if "integration" in item.keywords and not run_integration:
item.add_marker(skip_integration)
if "e2e" in item.keywords and not run_e2e:
item.add_marker(skip_e2e)
if ("stress" in item.keywords or "load" in item.keywords) and not run_stress:
item.add_marker(skip_stress)

View File

@@ -6,10 +6,17 @@ from __future__ import annotations
import json
from pathlib import Path
import pytest
from mes_dashboard.app import create_app
ROOT = Path(__file__).resolve().parents[1]
BASELINE_DIR = ROOT / "docs" / "migration" / "portal-shell-route-view-integration"
pytestmark = pytest.mark.skipif(
not BASELINE_DIR.exists(),
reason=f"Migration baseline directory missing: {BASELINE_DIR}",
)
BASELINE_VISIBILITY_FILE = BASELINE_DIR / "baseline_drawer_visibility.json"
BASELINE_API_FILE = BASELINE_DIR / "baseline_api_payload_contracts.json"
GATE_REPORT_FILE = BASELINE_DIR / "cutover-gates-report.json"

View File

@@ -160,3 +160,11 @@ def test_fetch_events_holds_branch_replaces_aliased_container_filter(
assert "h.CONTAINERID = :container_id" not in sql
assert "IN" in sql.upper()
assert params == {"p0": "CID-1", "p1": "CID-2"}
def test_event_fetcher_uses_slow_connection():
"""Regression: event_fetcher must use read_sql_df_slow (non-pooled)."""
import mes_dashboard.services.event_fetcher as ef
from mes_dashboard.core.database import read_sql_df_slow
assert ef.read_sql_df is read_sql_df_slow

View File

@@ -309,3 +309,11 @@ def test_resolve_full_genealogy_includes_semantic_edges(
edge_types = {edge["edge_type"] for edge in result["edges"]}
assert "wafer_origin" in edge_types
assert "gd_rework_source" in edge_types
def test_lineage_engine_uses_slow_connection():
"""Regression: lineage_engine must use read_sql_df_slow (non-pooled)."""
import mes_dashboard.services.lineage_engine as le
from mes_dashboard.core.database import read_sql_df_slow
assert le.read_sql_df is read_sql_df_slow

View File

@@ -387,5 +387,5 @@ class TestPerformancePage:
html = response.data.decode('utf-8', errors='ignore')
data_str = html.lower()
assert 'performance' in data_str or '效能' in data_str
assert '/static/js/chart.umd.min.js' in html
assert '/static/dist/admin-performance.js' in html
assert 'cdn.jsdelivr.net' not in html

View File

@@ -334,7 +334,7 @@ def test_wave_b_native_routes_are_reachable(monkeypatch):
client = app.test_client()
_login_as_admin(client)
for route in ["/job-query", "/excel-query", "/query-tool", "/tmtt-defect"]:
for route in ["/job-query", "/excel-query", "/query-tool"]:
response = client.get(route)
assert response.status_code == 200, f"{route} should be reachable"
@@ -350,7 +350,6 @@ def test_direct_entry_in_scope_report_routes_redirect_to_canonical_shell_when_sp
"/wip-overview?status=queue": "/portal-shell/wip-overview?status=queue",
"/resource-history?granularity=day": "/portal-shell/resource-history?granularity=day",
"/job-query?start_date=2026-02-01&end_date=2026-02-02": "/portal-shell/job-query?start_date=2026-02-01&end_date=2026-02-02",
"/tmtt-defect?start_date=2026-02-01&end_date=2026-02-02": "/portal-shell/tmtt-defect?start_date=2026-02-01&end_date=2026-02-02",
"/hold-detail?reason=YieldLimit": "/portal-shell/hold-detail?reason=YieldLimit",
}

View File

@@ -282,7 +282,7 @@ def test_reject_history_native_smoke_query_sections_and_export(client):
},
),
patch(
"mes_dashboard.routes.reject_history_routes.query_reason_pareto",
"mes_dashboard.routes.reject_history_routes.query_dimension_pareto",
return_value={
"items": [
{

View File

@@ -374,6 +374,7 @@ def test_export_csv_contains_semantic_headers(monkeypatch):
)
payload = "".join(chunks)
assert "REJECT_TOTAL_QTY" in payload
assert "DEFECT_QTY" in payload
assert "扣帳報廢量" in payload
assert "不扣帳報廢量" in payload
assert "WORKFLOW" in payload
assert "2026-02-03" in payload

View File

@@ -6,9 +6,15 @@ from __future__ import annotations
import json
from pathlib import Path
import pytest
ROOT = Path(__file__).resolve().parents[1]
BASELINE_ROUTE_QUERY_FILE = (
ROOT / "docs" / "migration" / "portal-shell-route-view-integration" / "baseline_route_query_contracts.json"
BASELINE_DIR = ROOT / "docs" / "migration" / "portal-shell-route-view-integration"
BASELINE_ROUTE_QUERY_FILE = BASELINE_DIR / "baseline_route_query_contracts.json"
pytestmark = pytest.mark.skipif(
not BASELINE_DIR.exists(),
reason=f"Migration baseline directory missing: {BASELINE_DIR}",
)

View File

@@ -7,6 +7,8 @@ import json
import copy
from pathlib import Path
import pytest
from mes_dashboard.app import create_app
from mes_dashboard.services.navigation_contract import (
compute_drawer_visibility,
@@ -19,6 +21,11 @@ ROOT = Path(__file__).resolve().parent.parent
PAGE_STATUS_FILE = ROOT / "data" / "page_status.json"
BASELINE_DIR = ROOT / "docs" / "migration" / "portal-shell-route-view-integration"
pytestmark = pytest.mark.skipif(
not BASELINE_DIR.exists(),
reason=f"Migration baseline directory missing: {BASELINE_DIR}",
)
BASELINE_VISIBILITY_FILE = BASELINE_DIR / "baseline_drawer_visibility.json"
BASELINE_ROUTE_QUERY_FILE = BASELINE_DIR / "baseline_route_query_contracts.json"
BASELINE_INTERACTION_FILE = BASELINE_DIR / "baseline_interaction_evidence.json"

View File

@@ -368,6 +368,7 @@ class TestViteModuleIntegration(unittest.TestCase):
'/tables': '/portal-shell/tables',
'/excel-query': '/portal-shell/excel-query',
'/query-tool': '/portal-shell/query-tool',
'/reject-history': '/portal-shell/reject-history',
}
for endpoint, asset in endpoints_and_assets:

View File

@@ -7,9 +7,15 @@ import hashlib
import json
from pathlib import Path
import pytest
ROOT = Path(__file__).resolve().parents[1]
SNAPSHOT_FILE = (
ROOT / "docs" / "migration" / "portal-shell-route-view-integration" / "visual-regression-snapshots.json"
BASELINE_DIR = ROOT / "docs" / "migration" / "portal-shell-route-view-integration"
SNAPSHOT_FILE = BASELINE_DIR / "visual-regression-snapshots.json"
pytestmark = pytest.mark.skipif(
not BASELINE_DIR.exists(),
reason=f"Migration baseline directory missing: {BASELINE_DIR}",
)

View File

@@ -80,6 +80,8 @@ def test_wip_overview_and_detail_status_parameter_contract(client):
hold_type=None,
package=None,
pj_type="PJA3460",
firstname=None,
waferdesc=None,
)
mock_detail.assert_called_once_with(
workcenter="TMTT",
@@ -89,6 +91,8 @@ def test_wip_overview_and_detail_status_parameter_contract(client):
hold_type=None,
workorder=None,
lotid=None,
firstname=None,
waferdesc=None,
include_dummy=False,
page=1,
page_size=100,