chore: finalize vite migration hardening and archive openspec changes

2026-02-08 20:03:36 +08:00
parent b56e80381b
commit c8e225101e
119 changed files with 6547 additions and 1301 deletions
--- a/README.md
+++ b/README.md
@@ -26,11 +26,60 @@
 | Worker 重啟控制 | ✅ 已完成 |
 | Runtime 韌性診斷（threshold/churn/recommendation） | ✅ 已完成 |
 | WIP 共用 autocomplete core 模組 | ✅ 已完成 |
 | WIP 共用 derive core 模組（KPI/filter/chart/table） | ✅ 已完成 |
 | WIP 索引查詢加速與增量同步 | ✅ 已完成 |
 | 快取記憶體放大係數 telemetry | ✅ 已完成 |
 | Cache benchmark gate（P95/記憶體門檻） | ✅ 已完成 |
 | Worker guarded mode + manual override 稽核 | ✅ 已完成 |
 | Runtime contract 啟動校驗（conda/systemd/watchdog） | ✅ 已完成 |
 | 前端核心模組測試（Node test） | ✅ 已完成 |
 | 部署自動化 | ✅ 已完成 |
 ---
 ## 開發歷史（Vite 重構後）
 - 2026-02-07：完成 Flask + Vite 單一 port 架構切換，舊版 `DashBoard/` 停用。
 - 2026-02-08：補齊 runtime 韌性治理（threshold/churn/recommendation）與 watchdog 可觀測欄位。
 - 2026-02-08：完成 P0 安全/穩定性硬化：
  - production `SECRET_KEY` 缺失時啟動失敗（fail-fast）
  - admin form + admin mutation API CSRF 防護
  - health probe 使用獨立 DB pool，避免與主查詢池互相阻塞
  - worker/app shutdown 統一清理 cache updater、realtime sync、Redis、DB engine
  - `hold_detail` inline script 變數改為 `tojson` 序列化
 - 2026-02-08：完成 P1 快取/查詢效率重構：
  - WIP 查詢路徑改為索引選擇，保留 `resource/wip` 全表快取語意
  - WIP search index 增量同步（watermark/version）與 drift fallback
  - health/admin 新增 cache memory amplification telemetry
  - 建立 `scripts/run_cache_benchmarks.py` + fixture gate
 - 2026-02-08：完成 P2 運維自癒治理：
  - runtime contract 共用化（app/start_server/watchdog/systemd）
  - 啟動時 conda/watchdog 路徑 drift fail-fast
  - worker restart policy（cooldown/retry budget/churn guarded mode）
  - manual override（需 ack + reason）與結構化 audit log
 - 2026-02-08：完成 round-2 安全/穩定補強：
  - LDAP endpoint 改為嚴格驗證（`https` + `LDAP_ALLOWED_HOSTS`）
  - process-level cache 新增 `max_size + LRU`（WIP/Resource）
  - circuit breaker transition logging 移至鎖外，降低 lock contention
  - 全域安全標頭（CSP/XFO/nosniff/Referrer-Policy，production 加 HSTS）
  - WIP detail 分頁參數加上下限（`page>=1`、`1<=page_size<=500`）
 - 2026-02-08：完成 round-3 殘餘風險修補：
  - WIP cache publish 採 staged publish，失敗不污染舊快照
  - WIP slow-path parse 移至鎖外；realtime equipment process cache 補齊 bounded LRU
  - resource NaN 清理改為 depth-safe 迭代；WIP/Hold 布林查詢解析共用化
  - filter cache view 名稱改為 env 可配置
  - `/health`、`/health/deep` 新增 5 秒內部 memo（testing 模式禁用）
  - 高成本 API 增加輕量 rate limit（WIP detail/matrix、Hold lots、Resource status/detail）
  - DB 連線字串 log redaction 遮罩密碼
 - 2026-02-08：完成 round-4 殘餘治理收斂：
  - Resource derived index 改為 row-position representation，移除 process 內 full records 複本
  - Resource / Realtime Equipment 共用 Oracle SQL fragments，降低查詢定義漂移
  - `resource_cache` / `realtime_equipment_cache` 型別註記與高頻常數命名收斂
  - `page_registry` 寫檔改為 atomic replace（tmp + rename），避免設定檔半寫入
  - 新增測試保護：共享 SQL 片段、index normalization、route bool parser 不重複定義
 ---
 ## 遷移與驗收文件
 - Root cutover 盤點：`docs/root_cutover_inventory.md`
@@ -46,20 +95,37 @@
 1. 單一 port 契約維持不變
 - Flask + Gunicorn + Vite dist 由同一服務提供（`GUNICORN_BIND`），前後端同源。
-2. Runtime 韌性採「降級 + 可操作建議」
+2. Runtime 韌性採「降級 + 可操作建議 + policy state」
 - `/health`、`/health/deep`、`/admin/api/system-status`、`/admin/api/worker/status` 皆提供：
  - 門檻（thresholds）
  - policy state（`allowed` / `cooldown` / `blocked`）
  - 重啟 churn 摘要
  - alerts（pool/circuit/churn）
  - recovery recommendation（值班建議動作）
-3. Watchdog 維持手動觸發重啟模型
+3. Watchdog 自癒策略具界限保護
- 仍以 admin API 觸發 reload，不預設啟用自動重啟風暴風險。
+- restart 流程納入 cooldown + retry budget + churn window。
- state 檔新增 bounded restart history，方便追蹤 churn。
+- churn 超標時進入 guarded mode，需 admin manual override 才可繼續重啟。
 - state 檔保留 bounded restart history，供 policy 與稽核使用。
-4. 前端治理：WIP autocomplete/filter 共用化
+4. 前端治理：WIP compute 共用化
 - `frontend/src/core/autocomplete.js` 作為 WIP overview/detail 共用邏輯來源。
 - `frontend/src/core/wip-derive.js` 共用 KPI/filter/chart/table 導出運算。
 - 維持既有頁面流程與 drill-down 語意，不變更操作習慣。
 5. P1 快取效率治理
 - 保留 `resource`、`wip` 全表快取策略（業務約束不變）。
 - 查詢改走索引選擇，並提供 memory amplification / index efficiency telemetry。
 - 以 benchmark gate 驗證 P95 延遲與記憶體放大不超過門檻。
 6. P0 Runtime Hardening（安全 + 穩定）
 - Production 必須提供 `SECRET_KEY`；未設定時服務拒絕啟動。
 - `/admin/login` 與 `/admin/api/*` 變更請求必須攜帶 CSRF token。
 - `/health` 資料庫連通探針使用獨立 health pool，降低 pool 飽和時誤判。
 - 關機/重啟時統一釋放 background workers 與 Redis/DB 連線資源。
 - LDAP API URL 啟動驗證：僅允許 `https` + host allowlist。
 - 全域 security headers：CSP/X-Frame-Options/X-Content-Type-Options/Referrer-Policy（production 含 HSTS）。
 ---
 ## 快速開始
@@ -175,6 +241,12 @@ DB_MAX_OVERFLOW=20
 DB_POOL_TIMEOUT=30
 DB_POOL_RECYCLE=1800
 DB_CALL_TIMEOUT_MS=55000
 DB_POOL_EXHAUSTED_RETRY_AFTER_SECONDS=5
 # Health probe 專用 DB pool（與主 request pool 隔離）
 DB_HEALTH_POOL_SIZE=1
 DB_HEALTH_MAX_OVERFLOW=0
 DB_HEALTH_POOL_TIMEOUT=2
 # Circuit Breaker
 CIRCUIT_BREAKER_ENABLED=true
@@ -192,6 +264,17 @@ WATCHDOG_RESTART_FLAG=./tmp/mes_dashboard_restart.flag
 WATCHDOG_PID_FILE=./tmp/gunicorn.pid
 WATCHDOG_STATE_FILE=./tmp/mes_dashboard_restart_state.json
 WATCHDOG_RESTART_HISTORY_MAX=50
 CONDA_BIN=/opt/miniconda3/bin/conda
 CONDA_ENV_NAME=mes-dashboard
 RUNTIME_CONTRACT_VERSION=2026.02-p2
 RUNTIME_CONTRACT_ENFORCE=true
 # Worker self-healing policy
 WORKER_RESTART_COOLDOWN=60
 WORKER_RESTART_RETRY_BUDGET=3
 WORKER_RESTART_WINDOW_SECONDS=600
 WORKER_RESTART_CHURN_THRESHOLD=3
 WORKER_GUARDED_MODE_ENABLED=true
 # Runtime resilience thresholds
 RESILIENCE_DEGRADED_ALERT_SECONDS=300
@@ -202,6 +285,36 @@ RESILIENCE_RESTART_CHURN_THRESHOLD=3
 # 管理員設定
 ADMIN_EMAILS=admin@example.com # 管理員郵件（逗號分隔）
 LDAP_API_URL=https://ldap-api.example.com
 LDAP_ALLOWED_HOSTS=ldap-api.example.com,ldap-api-dr.example.com
 # CSRF 防護（admin form/admin mutation API）
 CSRF_ENABLED=true
 # Process-level cache bounded LRU（WIP/Resource）
 PROCESS_CACHE_MAX_SIZE=32
 WIP_PROCESS_CACHE_MAX_SIZE=32
 RESOURCE_PROCESS_CACHE_MAX_SIZE=32
 EQUIPMENT_PROCESS_CACHE_MAX_SIZE=32
 # Filter cache source views (env-overridable)
 FILTER_CACHE_WIP_VIEW=DWH.DW_MES_LOT_V
 FILTER_CACHE_SPEC_WORKCENTER_VIEW=DWH.DW_MES_SPEC_WORKCENTER_V
 # Health internal memoization
 HEALTH_MEMO_TTL_SECONDS=5
 # High-cost API rate limit (in-process)
 WIP_MATRIX_RATE_LIMIT_MAX_REQUESTS=120
 WIP_MATRIX_RATE_LIMIT_WINDOW_SECONDS=60
 WIP_DETAIL_RATE_LIMIT_MAX_REQUESTS=90
 WIP_DETAIL_RATE_LIMIT_WINDOW_SECONDS=60
 HOLD_LOTS_RATE_LIMIT_MAX_REQUESTS=90
 HOLD_LOTS_RATE_LIMIT_WINDOW_SECONDS=60
 RESOURCE_DETAIL_RATE_LIMIT_MAX_REQUESTS=60
 RESOURCE_DETAIL_RATE_LIMIT_WINDOW_SECONDS=60
 RESOURCE_STATUS_RATE_LIMIT_MAX_REQUESTS=90
 RESOURCE_STATUS_RATE_LIMIT_WINDOW_SECONDS=60
 ```
 ### 生產環境注意事項
@@ -226,6 +339,7 @@ sudo cp deploy/mes-dashboard-watchdog.service /etc/systemd/system/
 # 2. 準備環境設定檔
 sudo mkdir -p /etc/mes-dashboard
 sudo cp deploy/mes-dashboard.env.example /etc/mes-dashboard/mes-dashboard.env
 sudo cp .env /etc/mes-dashboard/mes-dashboard.env
 # 3. 重新載入 systemd
@@ -239,6 +353,12 @@ sudo systemctl status mes-dashboard
 sudo systemctl status mes-dashboard-watchdog
 ```
 執行 runtime contract 驗證：
 ```bash
 RUNTIME_CONTRACT_ENFORCE=true ./scripts/start_server.sh check
 ```
 ### Rollback 步驟
 如需回滾到先前版本：
@@ -494,7 +614,8 @@ DashBoard_vite/
 │   └── worker_watchdog.py      # Worker 監控程式
 ├── deploy/                     # 部署設定
 │   ├── mes-dashboard.service            # Gunicorn systemd 服務 (Conda)
-│   └── mes-dashboard-watchdog.service   # Watchdog systemd 服務 (Conda)
+│   ├── mes-dashboard-watchdog.service   # Watchdog systemd 服務 (Conda)
 │   └── mes-dashboard.env.example        # Runtime contract 環境範本
 ├── tests/                      # 測試
 ├── data/                       # 資料檔案
 ├── logs/                       # 日誌
@@ -524,6 +645,9 @@ pytest tests/e2e/ -v
 # 執行壓力測試
 pytest tests/stress/ -v
 # Cache benchmark gate（P1）
 conda run -n mes-dashboard python scripts/run_cache_benchmarks.py --enforce
 ```
 ---
@@ -569,12 +693,17 @@ pytest tests/stress/ -v
 ### 2026-02-08
 - 完成並封存提案 `post-migration-resilience-governance`
 - 完成並封存提案 `p1-cache-query-efficiency`
 - 完成並封存提案 `p2-ops-self-healing-runbook`
 - 新增 runtime 韌性診斷核心（thresholds / restart churn / recovery recommendation）
 - 新增 worker restart policy state（allowed/cooldown/blocked）與 guarded mode override 流程
 - health 與 admin API 新增可操作韌性欄位：
  - `/health`、`/health/deep`
  - `/admin/api/system-status`、`/admin/api/worker/status`
 - watchdog restart state 支援 bounded history（`WATCHDOG_RESTART_HISTORY_MAX`）
 - WIP overview/detail 抽離共用 autocomplete/filter 模組（`frontend/src/core/autocomplete.js`）
 - WIP overview/detail 導入共享 derive 模組（`frontend/src/core/wip-derive.js`）
 - 新增 cache benchmark fixture 與 baseline-vs-indexed 門檻驗證
 - 新增前端 Node 測試流程（`npm --prefix frontend test`）
 - 更新 `README.mdj` 與 migration runbook 文件對齊 gate
@@ -654,5 +783,5 @@ pytest tests/stress/ -v
 ---
-**文檔版本**: 4.1
+**文檔版本**: 4.2
 **最後更新**: 2026-02-08
--- a/README.mdj
+++ b/README.mdj
@@ -1,61 +1,151 @@
-# MES Dashboard Architecture Snapshot (README.mdj)
+# MES Dashboard（README.mdj）
-本檔案為 `README.md` 的架構摘要鏡像，重點反映目前已完成的 Vite + 單一 port 運行契約與韌性治理策略。
+本文件為 `README.md` 的精簡技術同步版，聚焦目前可運行架構與運維契約。
-## Runtime Contract
+## 1. 架構摘要（2026-02-08）
- 單一服務單一 port：`GUNICORN_BIND`（預設 `0.0.0.0:8080`）
+- 後端：Flask + Gunicorn（單一 port）
- 前端資產由 Vite build 到 `src/mes_dashboard/static/dist/`，由 Flask/Gunicorn 同源提供
+- 前端：Vite build 輸出到 `src/mes_dashboard/static/dist`
- Watchdog 透過 restart flag + `SIGHUP` 進行 graceful worker reload
+- 快取：Redis + process-level cache + indexed selection telemetry
 - 資料：Oracle（QueuePool）
 - 運維：watchdog + admin worker restart API + guarded-mode policy
-## Resilience Contract
+## 2. 既有設計原則（保留）
- 降級回應：`DB_POOL_EXHAUSTED`、`CIRCUIT_BREAKER_OPEN` + `Retry-After`
+- `resource`（設備基礎資料）與 `wip`（線上即時狀況）維持全表快取策略。
- health/admin 診斷輸出包含：
+- 前端頁面邏輯與 drill-down 操作語意維持不變。
-  - thresholds
+- 系統維持單一 port 服務模式（前後端同源）。
  - restart churn summary
  - recovery recommendation
 - 不預設啟用自動重啟；維持受控人工觸發，避免重啟風暴
-## Frontend Governance
+## 3. P0 Runtime Hardening（已完成）
- WIP overview/detail 的 autocomplete/filter 查詢邏輯共用 `frontend/src/core/autocomplete.js`
+- Production 強制 `SECRET_KEY`：未設定或使用不安全預設值時，啟動直接失敗。
- 目標：維持既有操作語意，同時降低重複邏輯與維護成本
+- CSRF 防護：
- 前端核心模組測試：`npm --prefix frontend test`
+  - `/admin/login` 表單需 token
  - `/admin/api/*` 的 `POST/PUT/PATCH/DELETE` 需 `X-CSRF-Token`
 - Session hardening：登入成功後 `session.clear()` + CSRF token rotation。
 - Health probe isolation：`/health` DB 連通檢查使用獨立 health pool。
 - Shutdown cleanup：統一停止 cache updater、equipment sync worker，並關閉 Redis 與 DB engine。
 - XSS hardening：`hold_detail` fallback script 的 `reason` 改用 `tojson`。
-## 開發歷史（摘要）
+## 4. P1 Cache/Query Efficiency（已完成）
-### 2026-02-08
+- `resource` / `wip` 仍維持全表快取策略（業務約束不變）。
- 封存 `post-migration-resilience-governance`
+- WIP 查詢改走 indexed selection，並加入增量同步（watermark/version）與 drift fallback。
- 新增韌性診斷欄位（thresholds/churn/recommendation）
+- `/health`、`/health/deep`、`/admin/api/system-status` 提供 cache memory amplification/index telemetry。
- 完成 WIP autocomplete 共用模組化與前端測試腳本
+- 新增 benchmark harness：`scripts/run_cache_benchmarks.py --enforce`。
-### 2026-02-07
+## 5. P2 Ops Self-Healing（已完成）
 - 封存完整 Vite 遷移相關提案群組
 - 單一 port 架構、抽屜導航、欄位契約治理與 migration gates 就位
-## Key Configs
+- runtime contract 共用化：app/start_server/watchdog/systemd 使用同一組 watchdog/conda 路徑契約。
 - 啟動 fail-fast：conda/runtime path drift 時拒絕啟動並輸出可操作診斷。
 - worker restart policy：cooldown + retry budget + churn guarded mode。
 - manual override：需 admin 身分 + `manual_override` + `override_acknowledged` + `override_reason`，且寫入 audit log。
 - health/admin payload 提供 policy state：`allowed` / `cooldown` / `blocked`。
 ## 6. Round-3 Residual Hardening（已完成）
 - WIP cache publish 改為 staged publish，更新失敗不覆寫舊快照。
 - WIP process cache slow-path parse 移到鎖外，降低 lock contention。
 - realtime equipment process cache 補齊 bounded LRU（含 `EQUIPMENT_PROCESS_CACHE_MAX_SIZE`）。
 - `_clean_nan_values` 改為 depth-safe 迭代式清理（避免深層遞迴風險）。
 - WIP/Hold/Resource bool query parser 共用化（`core/utils.py`）。
 - filter cache source view 可由 env 覆寫（便於環境切換與測試）。
 - `/health`、`/health/deep` 增加 5 秒 memo（testing 模式自動關閉）。
 - 高成本 API 增加輕量 in-process rate limit，超限回傳一致 429 結構。
 - DB 連線字串記錄加上敏感欄位遮罩（密碼 redaction）。
 ## 7. Round-4 Residual Consolidation（已完成）
 - Resource derived index 改為 row-position representation，不再在 process 內保存 full records 複本。
 - Resource / Realtime Equipment 共用 Oracle SQL fragments，避免查詢定義重複漂移。
 - `resource_cache` / `realtime_equipment_cache` 型別註記風格與高頻常數命名收斂。
 - `page_registry` 寫檔改為 atomic replace，降低設定檔半寫入風險。
 - 新增測試覆蓋 shared SQL fragment 與 bool parser 不重複定義治理。
 ## 8. 重要環境變數
 ```bash
 FLASK_ENV=production
 SECRET_KEY=<required-in-production>
 CSRF_ENABLED=true
 LDAP_API_URL=https://ldap-api.example.com
 LDAP_ALLOWED_HOSTS=ldap-api.example.com,ldap-api-dr.example.com
 DB_POOL_SIZE=10
 DB_MAX_OVERFLOW=20
 DB_POOL_TIMEOUT=30
 DB_POOL_RECYCLE=1800
 DB_CALL_TIMEOUT_MS=55000
 DB_POOL_EXHAUSTED_RETRY_AFTER_SECONDS=5
 DB_HEALTH_POOL_SIZE=1
 DB_HEALTH_MAX_OVERFLOW=0
 DB_HEALTH_POOL_TIMEOUT=2
 CONDA_BIN=/opt/miniconda3/bin/conda
 CONDA_ENV_NAME=mes-dashboard
 RUNTIME_CONTRACT_VERSION=2026.02-p2
 RUNTIME_CONTRACT_ENFORCE=true
 WATCHDOG_RUNTIME_DIR=./tmp
 WATCHDOG_RESTART_FLAG=./tmp/mes_dashboard_restart.flag
 WATCHDOG_PID_FILE=./tmp/gunicorn.pid
 WATCHDOG_STATE_FILE=./tmp/mes_dashboard_restart_state.json
 WATCHDOG_RESTART_HISTORY_MAX=50
-RESILIENCE_DEGRADED_ALERT_SECONDS=300
+WORKER_RESTART_COOLDOWN=60
-RESILIENCE_POOL_SATURATION_WARNING=0.90
+WORKER_RESTART_RETRY_BUDGET=3
-RESILIENCE_POOL_SATURATION_CRITICAL=1.0
+WORKER_RESTART_WINDOW_SECONDS=600
-RESILIENCE_RESTART_CHURN_WINDOW_SECONDS=600
+WORKER_RESTART_CHURN_THRESHOLD=3
-RESILIENCE_RESTART_CHURN_THRESHOLD=3
+WORKER_GUARDED_MODE_ENABLED=true
 PROCESS_CACHE_MAX_SIZE=32
 WIP_PROCESS_CACHE_MAX_SIZE=32
 RESOURCE_PROCESS_CACHE_MAX_SIZE=32
 EQUIPMENT_PROCESS_CACHE_MAX_SIZE=32
 FILTER_CACHE_WIP_VIEW=DWH.DW_MES_LOT_V
 FILTER_CACHE_SPEC_WORKCENTER_VIEW=DWH.DW_MES_SPEC_WORKCENTER_V
 HEALTH_MEMO_TTL_SECONDS=5
 WIP_MATRIX_RATE_LIMIT_MAX_REQUESTS=120
 WIP_MATRIX_RATE_LIMIT_WINDOW_SECONDS=60
 WIP_DETAIL_RATE_LIMIT_MAX_REQUESTS=90
 WIP_DETAIL_RATE_LIMIT_WINDOW_SECONDS=60
 HOLD_LOTS_RATE_LIMIT_MAX_REQUESTS=90
 HOLD_LOTS_RATE_LIMIT_WINDOW_SECONDS=60
 RESOURCE_DETAIL_RATE_LIMIT_MAX_REQUESTS=60
 RESOURCE_DETAIL_RATE_LIMIT_WINDOW_SECONDS=60
 RESOURCE_STATUS_RATE_LIMIT_MAX_REQUESTS=90
 RESOURCE_STATUS_RATE_LIMIT_WINDOW_SECONDS=60
 ```
-## Validation Quick Commands
+## 9. 驗證命令（建議）
 ```bash
 # 後端（conda）
 conda run -n mes-dashboard python -m pytest -q tests/test_runtime_hardening.py
 # 前端
 npm --prefix frontend test
 npm --prefix frontend run build
-python -m pytest -q tests/test_resilience.py tests/test_health_routes.py tests/test_performance_integration.py
+
 # P1 benchmark gate
 conda run -n mes-dashboard python scripts/run_cache_benchmarks.py --enforce
 # P2 runtime contract check
 RUNTIME_CONTRACT_ENFORCE=true ./scripts/start_server.sh check
 ```
-> 詳細部署、使用說明與完整環境配置請參考 `README.md`。
+## 10. 開發歷史（Vite 專案）
 - 2026-02-07：完成 Vite 根目錄重構與舊版切除。
 - 2026-02-08：完成 resilience 診斷治理與前端共用模組化。
 - 2026-02-08：完成 P0 安全/穩定性硬化（本次更新）。
 - 2026-02-08：完成 P1 快取查詢效率重構（index + benchmark gate）。
 - 2026-02-08：完成 P2 運維自癒治理（guarded mode + manual override + runtime contract）。
 - 2026-02-08：完成 round-2 hardening（LDAP URL 驗證、bounded LRU cache、circuit breaker 鎖外日誌、安全標頭、分頁邊界）。
 - 2026-02-08：完成 round-3 residual hardening（staged publish、health memo、API rate limit、DB redaction、filter view env 化）。
 - 2026-02-08：完成 round-4 residual consolidation（resource index 表示正規化、shared SQL fragments、型別與常數治理、atomic page status 寫入）。
--- a/deploy/mes-dashboard-watchdog.service
+++ b/deploy/mes-dashboard-watchdog.service
@@ -18,6 +18,13 @@ Environment="WATCHDOG_RESTART_FLAG=/run/mes-dashboard/mes_dashboard_restart.flag
 Environment="WATCHDOG_PID_FILE=/run/mes-dashboard/gunicorn.pid"
 Environment="WATCHDOG_STATE_FILE=/var/lib/mes-dashboard/restart_state.json"
 Environment="WATCHDOG_CHECK_INTERVAL=5"
 Environment="RUNTIME_CONTRACT_VERSION=2026.02-p2"
 Environment="RUNTIME_CONTRACT_ENFORCE=true"
 Environment="WORKER_RESTART_COOLDOWN=60"
 Environment="WORKER_RESTART_RETRY_BUDGET=3"
 Environment="WORKER_RESTART_WINDOW_SECONDS=600"
 Environment="WORKER_RESTART_CHURN_THRESHOLD=3"
 Environment="WORKER_GUARDED_MODE_ENABLED=true"
 RuntimeDirectory=mes-dashboard
 StateDirectory=mes-dashboard
--- a/deploy/mes-dashboard.env.example
+++ b/deploy/mes-dashboard.env.example
@@ -0,0 +1,26 @@
 # MES Dashboard runtime contract (version 2026.02-p2)
 # Conda runtime
 CONDA_BIN=/opt/miniconda3/bin/conda
 CONDA_ENV_NAME=mes-dashboard
 # Single-port serving contract
 GUNICORN_BIND=0.0.0.0:8080
 # Watchdog/runtime paths
 WATCHDOG_RUNTIME_DIR=/run/mes-dashboard
 WATCHDOG_RESTART_FLAG=/run/mes-dashboard/mes_dashboard_restart.flag
 WATCHDOG_PID_FILE=/run/mes-dashboard/gunicorn.pid
 WATCHDOG_STATE_FILE=/var/lib/mes-dashboard/restart_state.json
 WATCHDOG_CHECK_INTERVAL=5
 # Runtime contract enforcement
 RUNTIME_CONTRACT_VERSION=2026.02-p2
 RUNTIME_CONTRACT_ENFORCE=true
 # Worker recovery policy
 WORKER_RESTART_COOLDOWN=60
 WORKER_RESTART_RETRY_BUDGET=3
 WORKER_RESTART_WINDOW_SECONDS=600
 WORKER_RESTART_CHURN_THRESHOLD=3
 WORKER_GUARDED_MODE_ENABLED=true
--- a/deploy/mes-dashboard.service
+++ b/deploy/mes-dashboard.service
@@ -18,6 +18,13 @@ Environment="WATCHDOG_RUNTIME_DIR=/run/mes-dashboard"
 Environment="WATCHDOG_RESTART_FLAG=/run/mes-dashboard/mes_dashboard_restart.flag"
 Environment="WATCHDOG_PID_FILE=/run/mes-dashboard/gunicorn.pid"
 Environment="WATCHDOG_STATE_FILE=/var/lib/mes-dashboard/restart_state.json"
 Environment="RUNTIME_CONTRACT_VERSION=2026.02-p2"
 Environment="RUNTIME_CONTRACT_ENFORCE=true"
 Environment="WORKER_RESTART_COOLDOWN=60"
 Environment="WORKER_RESTART_RETRY_BUDGET=3"
 Environment="WORKER_RESTART_WINDOW_SECONDS=600"
 Environment="WORKER_RESTART_CHURN_THRESHOLD=3"
 Environment="WORKER_GUARDED_MODE_ENABLED=true"
 RuntimeDirectory=mes-dashboard
 StateDirectory=mes-dashboard
--- a/docs/migration_gates_and_runbook.md
+++ b/docs/migration_gates_and_runbook.md
@@ -26,10 +26,12 @@ A release is cutover-ready only when all gates pass:
 - pool exhaustion path returns `503` + `DB_POOL_EXHAUSTED` and `Retry-After`
 - circuit-open path returns `503` + `CIRCUIT_BREAKER_OPEN` and fail-fast semantics
 - frontend client does not aggressively retry on degraded pool exhaustion responses
 - health/admin payloads expose worker policy state (`allowed`/`cooldown`/`blocked`) and alert booleans
 6. Conda-systemd contract gate
 - `deploy/mes-dashboard.service` and `deploy/mes-dashboard-watchdog.service` both run in the same conda runtime contract
 - `WATCHDOG_RESTART_FLAG`, `WATCHDOG_PID_FILE`, `WATCHDOG_STATE_FILE` paths are consistent across app/admin/watchdog
 - startup contract validation passes: `RUNTIME_CONTRACT_ENFORCE=true ./scripts/start_server.sh check`
 - single-port bind (`GUNICORN_BIND`) remains stable during restart workflow
 7. Regression gate
@@ -60,7 +62,8 @@ A release is cutover-ready only when all gates pass:
 5. Conda + systemd rehearsal (recommended before production cutover)
 - `sudo cp deploy/mes-dashboard.service /etc/systemd/system/`
 - `sudo cp deploy/mes-dashboard-watchdog.service /etc/systemd/system/`
- `sudo mkdir -p /etc/mes-dashboard && sudo cp .env /etc/mes-dashboard/mes-dashboard.env`
+- `sudo mkdir -p /etc/mes-dashboard && sudo cp deploy/mes-dashboard.env.example /etc/mes-dashboard/mes-dashboard.env`
 - merge deployment secrets from `.env` into `/etc/mes-dashboard/mes-dashboard.env`
 - `sudo systemctl daemon-reload`
 - `sudo systemctl enable --now mes-dashboard mes-dashboard-watchdog`
 - call `/admin/api/worker/status` and verify runtime contract paths exist
@@ -69,6 +72,7 @@ A release is cutover-ready only when all gates pass:
 - call `/health` and `/health/deep`
 - confirm route cache mode, degraded flags, and pool/runtime diagnostics align with environment (Redis on/off)
 - trigger one controlled worker restart from admin API and verify single-port continuity
 - verify guarded mode flow: blocked restart requires manual override payload (`manual_override`, `override_acknowledged`, `override_reason`)
 - verify README architecture section matches deployed runtime contract
 ## Rollback Procedure
@@ -111,3 +115,6 @@ Use these initial thresholds for alerting/escalation:
 4. Frontend/API retry pressure
 - significant increase of client retries for `DB_POOL_EXHAUSTED` or `CIRCUIT_BREAKER_OPEN` responses over baseline
 5. Recovery policy blocked
 - `resilience.policy_state.blocked == true` or `resilience.alerts.restart_blocked == true`
--- a/frontend/src/core/api.js
+++ b/frontend/src/core/api.js
@@ -1,5 +1,21 @@
 const DEFAULT_TIMEOUT = 30000;
 function getCsrfToken() {
  return document.querySelector('meta[name="csrf-token"]')?.content || '';
 }
 function withCsrfHeaders(headers = {}, method = 'GET') {
  const normalized = String(method).toUpperCase();
  const merged = { ...headers };
  if (['POST', 'PUT', 'PATCH', 'DELETE'].includes(normalized)) {
    const csrf = getCsrfToken();
    if (csrf && !merged['X-CSRF-Token']) {
      merged['X-CSRF-Token'] = csrf;
    }
  }
  return merged;
 }
 function buildApiError(response, payload) {
  const message =
    payload?.error?.message ||
@@ -47,15 +63,19 @@ export async function apiGet(url, options = {}) {
 export async function apiPost(url, payload, options = {}) {
  if (window.MesApi?.post) {
-    return window.MesApi.post(url, payload, options);
+    const enrichedOptions = {
      ...options,
      headers: withCsrfHeaders(options.headers || {}, 'POST')
    };
    return window.MesApi.post(url, payload, enrichedOptions);
  }
  return fetchJson(url, {
    ...options,
    method: 'POST',
-    headers: {
+    headers: withCsrfHeaders({
      'Content-Type': 'application/json',
      ...(options.headers || {})
-    },
+    }, 'POST'),
    body: JSON.stringify(payload)
  });
 }
@@ -64,6 +84,7 @@ export async function apiUpload(url, formData, options = {}) {
  return fetchJson(url, {
    ...options,
    method: 'POST',
    headers: withCsrfHeaders(options.headers || {}, 'POST'),
    body: formData
  });
 }
--- a/frontend/src/core/wip-derive.js
+++ b/frontend/src/core/wip-derive.js
@@ -0,0 +1,75 @@
 function toTrimmedString(value) {
  if (value === null || value === undefined) {
    return '';
  }
  return String(value).trim();
 }
 export function normalizeStatusFilter(statusFilter) {
  if (!statusFilter) {
    return {};
  }
  if (statusFilter === 'quality-hold') {
    return { status: 'HOLD', hold_type: 'quality' };
  }
  if (statusFilter === 'non-quality-hold') {
    return { status: 'HOLD', hold_type: 'non-quality' };
  }
  return { status: String(statusFilter).toUpperCase() };
 }
 export function buildWipOverviewQueryParams(filters = {}, statusFilter = null) {
  const params = {};
  const workorder = toTrimmedString(filters.workorder);
  const lotid = toTrimmedString(filters.lotid);
  const pkg = toTrimmedString(filters.package);
  const type = toTrimmedString(filters.type);
  if (workorder) params.workorder = workorder;
  if (lotid) params.lotid = lotid;
  if (pkg) params.package = pkg;
  if (type) params.type = type;
  return { ...params, ...normalizeStatusFilter(statusFilter) };
 }
 export function buildWipDetailQueryParams({
  page,
  pageSize,
  filters = {},
  statusFilter = null,
 }) {
  return {
    page,
    page_size: pageSize,
    ...buildWipOverviewQueryParams(filters, statusFilter),
  };
 }
 export function splitHoldByType(data) {
  const items = Array.isArray(data?.items) ? data.items : [];
  const quality = items.filter((item) => item?.holdType === 'quality');
  const nonQuality = items.filter((item) => item?.holdType !== 'quality');
  return { quality, nonQuality };
 }
 export function prepareParetoData(items) {
  if (!Array.isArray(items) || items.length === 0) {
    return { reasons: [], qtys: [], lots: [], cumulative: [], totalQty: 0, items: [] };
  }
  const sorted = [...items].sort((a, b) => (Number(b?.qty) || 0) - (Number(a?.qty) || 0));
  const reasons = sorted.map((item) => toTrimmedString(item?.reason) || '未知');
  const qtys = sorted.map((item) => Number(item?.qty) || 0);
  const lots = sorted.map((item) => Number(item?.lots) || 0);
  const totalQty = qtys.reduce((sum, value) => sum + value, 0);
  let running = 0;
  const cumulative = qtys.map((qty) => {
    running += qty;
    if (totalQty <= 0) return 0;
    return Math.round((running / totalQty) * 100);
  });
  return { reasons, qtys, lots, cumulative, totalQty, items: sorted };
 }
--- a/frontend/src/wip-detail/main.js
+++ b/frontend/src/wip-detail/main.js
@@ -3,6 +3,7 @@ import {
  debounce,
  fetchWipAutocompleteItems,
 } from '../core/autocomplete.js';
 import { buildWipDetailQueryParams } from '../core/wip-derive.js';
 ensureMesApiAvailable();
@@ -73,36 +74,12 @@ ensureMesApiAvailable();
          }
          async function fetchDetail(signal = null) {
-              const params = {
+              const params = buildWipDetailQueryParams({
                  page: state.page,
-                  page_size: state.pageSize
+                  pageSize: state.pageSize,
-              };
+                  filters: state.filters,
-  
+                  statusFilter: activeStatusFilter,
-              if (state.filters.package) {
+              });
                  params.package = state.filters.package;
              }
              if (state.filters.type) {
                  params.type = state.filters.type;
              }
              if (activeStatusFilter) {
                  // Handle hold type filters
                  if (activeStatusFilter === 'quality-hold') {
                      params.status = 'HOLD';
                      params.hold_type = 'quality';
                  } else if (activeStatusFilter === 'non-quality-hold') {
                      params.status = 'HOLD';
                      params.hold_type = 'non-quality';
                  } else {
                      // Convert to API status format (RUN/QUEUE)
                      params.status = activeStatusFilter.toUpperCase();
                  }
              }
              if (state.filters.workorder) {
                  params.workorder = state.filters.workorder;
              }
              if (state.filters.lotid) {
                  params.lotid = state.filters.lotid;
              }
              const result = await MesApi.get(`/api/wip/detail/${encodeURIComponent(state.workcenter)}`, {
                  params,
--- a/frontend/src/wip-overview/main.js
+++ b/frontend/src/wip-overview/main.js
@@ -3,6 +3,11 @@ import {
  debounce,
  fetchWipAutocompleteItems,
 } from '../core/autocomplete.js';
 import {
  buildWipOverviewQueryParams,
  splitHoldByType as splitHoldByTypeShared,
  prepareParetoData as prepareParetoDataShared,
 } from '../core/wip-derive.js';
 ensureMesApiAvailable();
@@ -61,20 +66,7 @@ ensureMesApiAvailable();
          }
          function buildQueryParams() {
-              const params = {};
+              return buildWipOverviewQueryParams(state.filters);
              if (state.filters.workorder) {
                  params.workorder = state.filters.workorder;
              }
              if (state.filters.lotid) {
                  params.lotid = state.filters.lotid;
              }
              if (state.filters.package) {
                  params.package = state.filters.package;
              }
              if (state.filters.type) {
                  params.type = state.filters.type;
              }
              return params;
          }
          // ============================================================
@@ -96,19 +88,7 @@ ensureMesApiAvailable();
          }
          async function fetchMatrix(signal = null) {
-              const params = buildQueryParams();
+              const params = buildWipOverviewQueryParams(state.filters, activeStatusFilter);
              // Add status filter if active
              if (activeStatusFilter) {
                  if (activeStatusFilter === 'quality-hold') {
                      params.status = 'HOLD';
                      params.hold_type = 'quality';
                  } else if (activeStatusFilter === 'non-quality-hold') {
                      params.status = 'HOLD';
                      params.hold_type = 'non-quality';
                  } else {
                      params.status = activeStatusFilter.toUpperCase();
                  }
              }
              const result = await MesApi.get('/api/wip/overview/matrix', {
                  params,
                  timeout: API_TIMEOUT,
@@ -467,37 +447,12 @@ ensureMesApiAvailable();
          // Task 2.1: Split hold data by type
          function splitHoldByType(data) {
-              if (!data || !data.items) {
+              return splitHoldByTypeShared(data);
                  return { quality: [], nonQuality: [] };
              }
              const quality = data.items.filter(item => item.holdType === 'quality');
              const nonQuality = data.items.filter(item => item.holdType !== 'quality');
              return { quality, nonQuality };
          }
          // Task 2.2: Prepare Pareto data (sort by QTY desc, calculate cumulative %)
          function prepareParetoData(items) {
-              if (!items || items.length === 0) {
+              return prepareParetoDataShared(items);
                  return { reasons: [], qtys: [], lots: [], cumulative: [], totalQty: 0 };
              }
              // Sort by QTY descending
              const sorted = [...items].sort((a, b) => (b.qty || 0) - (a.qty || 0));
              const reasons = sorted.map(item => item.reason || '未知');
              const qtys = sorted.map(item => item.qty || 0);
              const lots = sorted.map(item => item.lots || 0);
              const totalQty = qtys.reduce((sum, q) => sum + q, 0);
              // Calculate cumulative percentage
              const cumulative = [];
              let runningSum = 0;
              qtys.forEach(qty => {
                  runningSum += qty;
                  cumulative.push(totalQty > 0 ? Math.round((runningSum / totalQty) * 100) : 0);
              });
              return { reasons, qtys, lots, cumulative, totalQty, items: sorted };
          }
          // Task 3.1: Initialize Pareto charts
--- a/frontend/tests/wip-derive.test.js
+++ b/frontend/tests/wip-derive.test.js
@@ -0,0 +1,80 @@
 import test from 'node:test';
 import assert from 'node:assert/strict';
 import {
  buildWipOverviewQueryParams,
  buildWipDetailQueryParams,
  splitHoldByType,
  prepareParetoData,
 } from '../src/core/wip-derive.js';
 test('buildWipOverviewQueryParams keeps only non-empty filters', () => {
  const params = buildWipOverviewQueryParams({
    workorder: ' WO-1 ',
    lotid: '',
    package: 'PKG-A',
    type: 'QFN',
  });
  assert.deepEqual(params, {
    workorder: 'WO-1',
    package: 'PKG-A',
    type: 'QFN',
  });
 });
 test('buildWipOverviewQueryParams maps quality hold status filter', () => {
  const params = buildWipOverviewQueryParams({}, 'quality-hold');
  assert.deepEqual(params, {
    status: 'HOLD',
    hold_type: 'quality',
  });
 });
 test('buildWipDetailQueryParams uses page/page_size and shared filter mapper', () => {
  const params = buildWipDetailQueryParams({
    page: 2,
    pageSize: 100,
    filters: {
      workorder: 'WO',
      lotid: 'LOT',
      package: '',
      type: 'TSOP',
    },
    statusFilter: 'run',
  });
  assert.deepEqual(params, {
    page: 2,
    page_size: 100,
    workorder: 'WO',
    lotid: 'LOT',
    type: 'TSOP',
    status: 'RUN',
  });
 });
 test('splitHoldByType partitions quality/non-quality correctly', () => {
  const grouped = splitHoldByType({
    items: [
      { reason: 'Q1', holdType: 'quality' },
      { reason: 'NQ1', holdType: 'non-quality' },
      { reason: 'NQ2' },
    ],
  });
  assert.equal(grouped.quality.length, 1);
  assert.equal(grouped.nonQuality.length, 2);
 });
 test('prepareParetoData sorts by qty and builds cumulative percentages', () => {
  const data = prepareParetoData([
    { reason: 'B', qty: 20, lots: 1 },
    { reason: 'A', qty: 80, lots: 2 },
  ]);
  assert.deepEqual(data.reasons, ['A', 'B']);
  assert.deepEqual(data.qtys, [80, 20]);
  assert.deepEqual(data.cumulative, [80, 100]);
  assert.equal(data.totalQty, 100);
 });
--- a/frontend/vite.config.js
+++ b/frontend/vite.config.js
@@ -1,12 +1,12 @@
 import { defineConfig } from 'vite';
 import { resolve } from 'node:path';
-export default defineConfig({
+export default defineConfig(({ mode }) => ({
  publicDir: false,
  build: {
    outDir: '../src/mes_dashboard/static/dist',
    emptyOutDir: false,
-    sourcemap: false,
+    sourcemap: mode !== 'production',
    rollupOptions: {
      input: {
        portal: resolve(__dirname, 'src/portal/main.js'),
@@ -22,8 +22,17 @@ export default defineConfig({
      output: {
        entryFileNames: '[name].js',
        chunkFileNames: 'chunks/[name]-[hash].js',
-        assetFileNames: '[name][extname]'
+        assetFileNames: '[name][extname]',
        manualChunks(id) {
          if (!id.includes('node_modules')) {
            return;
          }
          if (id.includes('echarts')) {
            return 'vendor-echarts';
          }
          return 'vendor';
        }
      }
    }
-});
+  }
 }));
--- a/gunicorn.conf.py
+++ b/gunicorn.conf.py
@@ -30,6 +30,18 @@ def worker_exit(server, worker):
    except Exception as e:
        server.log.warning(f"Error stopping equipment sync worker: {e}")
    try:
        from mes_dashboard.core.cache_updater import stop_cache_updater
        stop_cache_updater()
    except Exception as e:
        server.log.warning(f"Error stopping cache updater: {e}")
    try:
        from mes_dashboard.core.redis_client import close_redis
        close_redis()
    except Exception as e:
        server.log.warning(f"Error closing redis client: {e}")
    # Then dispose database connections
    try:
        from mes_dashboard.core.database import dispose_engine
--- a/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/.openspec.yaml
+++ b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/.openspec.yaml
@@ -0,0 +1,2 @@
 schema: spec-driven
 created: 2026-02-08
--- a/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/design.md
+++ b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/design.md
@@ -0,0 +1,46 @@
 ## Context
 The current architecture already supports single-port Gunicorn runtime, circuit-breaker-aware degraded responses, and watchdog-assisted recovery. However, critical security and lifecycle controls are uneven: production startup can still fallback to a weak secret key, CSRF is not enforced globally, and background resources are not fully registered in a single shutdown lifecycle. These gaps are operationally risky when pool pressure or restart churn occurs.
 ## Goals / Non-Goals
 **Goals:**
 - Make production startup fail fast when required security secrets are missing.
 - Enforce CSRF validation for all state-changing endpoints without breaking existing frontend flow.
 - Make worker/app shutdown deterministic by stopping all background workers and shared clients.
 - Keep degraded responses for pool exhaustion and circuit-open states stable and retry-aware.
 - Isolate health probe connectivity from main request pool contention.
 **Non-Goals:**
 - Replacing LDAP provider or redesigning the full authentication architecture.
 - Full CSP rollout across all templates in this change.
 - Changing URL structure, page IA, or single-port deployment topology.
 ## Decisions
 1. **Production secret-key guard at startup**
   - Decision: enforce `SECRET_KEY` presence/strength in non-development modes and abort startup when invalid.
   - Rationale: prevents silent insecure deployment.
 2. **Unified CSRF contract across form + JSON flows**
   - Decision: issue CSRF token from server session, validate hidden form field for HTML forms and `X-CSRF-Token` for JSON POST/PUT/PATCH/DELETE.
   - Rationale: maintains current frontend behavior while covering non-form APIs.
 3. **Centralized shutdown registry**
   - Decision: register explicit shutdown hooks that call cache updater stop, realtime sync stop, Redis close, and DB dispose in bounded order.
   - Rationale: avoids thread/client leaks during worker recycle and controlled reload.
 4. **Health probe pool isolation**
   - Decision: use a dedicated lightweight DB health engine/pool for `/health` checks.
   - Rationale: prevents health endpoint from being blocked by request-pool exhaustion, improving observability fidelity.
 5. **Template-safe JS serialization**
   - Decision: replace HTML-escaped interpolation in JS string contexts with `tojson` serialization.
   - Rationale: avoids context-mismatch injection edge cases.
 ## Risks / Trade-offs
 - **[Risk] CSRF rollout may break undocumented clients** → **Mitigation:** provide opt-in transition flag and explicit error messaging during rollout.
 - **[Risk] Strict startup secret validation can block misconfigured environments** → **Mitigation:** provide clear startup diagnostics and `.env.example` updates.
 - **[Risk] Additional shutdown hooks can prolong worker exit** → **Mitigation:** bounded timeouts and idempotent stop handlers.
 - **[Risk] Dedicated health pool introduces extra DB connections** → **Mitigation:** fixed minimal size and short timeout.
--- a/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/proposal.md
+++ b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/proposal.md
@@ -0,0 +1,40 @@
 ## Why
 The Vite migration is functionally complete, but production runtime still has high-risk gaps in security baseline and worker lifecycle cleanup. Addressing these now prevents avoidable outages, authentication bypass risk, and unstable degradation behavior under pool pressure.
 ## What Changes
 - Enforce production-safe startup security defaults (no weak SECRET_KEY fallback in non-development environments).
 - Add first-class CSRF protection for admin forms and state-changing JSON APIs.
 - Harden degradation behavior for pool exhaustion with consistent retry/backoff contract and isolated health probing.
 - Ensure background workers and shared clients (cache updater, realtime sync, Redis) are explicitly stopped on worker/app shutdown.
 - Fix template-to-JavaScript variable serialization in hold-detail fallback script.
 ## Capabilities
 ### New Capabilities
 - `security-baseline-hardening`: Define mandatory secret/session/CSRF/XSS-safe baseline for production runtime.
 ### Modified Capabilities
 - `runtime-resilience-recovery`: Strengthen shutdown lifecycle and degraded-response behavior for pool pressure scenarios.
 ## Impact
 - Affected code:
  - `src/mes_dashboard/app.py`
  - `src/mes_dashboard/core/database.py`
  - `src/mes_dashboard/core/cache_updater.py`
  - `src/mes_dashboard/core/redis_client.py`
  - `src/mes_dashboard/routes/health_routes.py`
  - `src/mes_dashboard/routes/auth_routes.py`
  - `src/mes_dashboard/templates/hold_detail.html`
  - `gunicorn.conf.py`
  - `tests/`
 - APIs:
  - `/health`
  - `/health/deep`
  - `/admin/login`
  - state-changing `/api/*` endpoints
 - Operational behavior:
  - Keep single-port deployment model unchanged.
  - Improve degraded-state stability and startup safety gates.
--- a/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/specs/runtime-resilience-recovery/spec.md
+++ b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/specs/runtime-resilience-recovery/spec.md
@@ -0,0 +1,24 @@
 ## MODIFIED Requirements
 ### Requirement: Pool Exhaustion MUST Return Retry-Aware Degraded Responses
 The system MUST return explicit degraded responses for connection pool exhaustion, including stable machine-readable retry metadata and HTTP retry hints.
 #### Scenario: Pool exhausted under load
 - **WHEN** concurrent requests exceed available database connections and pool wait timeout is reached
 - **THEN** the API MUST return `DB_POOL_EXHAUSTED` with `retry_after_seconds` metadata and a `Retry-After` header instead of a generic 500 failure
 ## ADDED Requirements
 ### Requirement: Runtime Shutdown SHALL Cleanly Stop Background Services
 Worker/app shutdown MUST stop long-lived background services and shared clients in deterministic order.
 #### Scenario: Worker exits during recycle or graceful reload
 - **WHEN** Gunicorn worker shutdown hooks are triggered
 - **THEN** cache updater, realtime equipment sync worker, Redis client, and DB engine resources MUST be stopped/disposed without orphan threads
 ### Requirement: Health Probing SHALL Remain Available During Request-Pool Saturation
 Health checks MUST avoid depending solely on the same request pool used by business APIs.
 #### Scenario: Request pool saturation
 - **WHEN** the main database request pool is exhausted
 - **THEN** `/health` and `/health/deep` MUST still provide timely degraded status using isolated probe connectivity
--- a/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/specs/security-baseline-hardening/spec.md
+++ b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/specs/security-baseline-hardening/spec.md
@@ -0,0 +1,29 @@
 ## ADDED Requirements
 ### Requirement: Production Startup SHALL Reject Weak Session Secrets
 The system MUST reject startup in non-development environments when `SECRET_KEY` is missing or configured with known insecure default values.
 #### Scenario: Missing production secret key
 - **WHEN** runtime starts with `FLASK_ENV` not equal to `development` and no secure secret key is configured
 - **THEN** application startup MUST fail fast with an explicit configuration error
 ### Requirement: State-Changing Endpoints SHALL Enforce CSRF Validation
 All state-changing endpoints that rely on cookie-based authentication MUST enforce CSRF token validation.
 #### Scenario: Missing or invalid CSRF token
 - **WHEN** a POST/PUT/PATCH/DELETE request is sent without a valid CSRF token
 - **THEN** the server MUST reject the request with a client error and MUST NOT execute the mutation
 ### Requirement: Server-Rendered Values in JavaScript Context MUST Use Safe Serialization
 Values inserted into inline JavaScript from templates MUST be serialized for JavaScript context safety.
 #### Scenario: Hold reason rendered in fallback inline script
 - **WHEN** server-side string values are embedded into script state payloads
 - **THEN** template rendering MUST use JSON-safe serialization semantics to prevent script-context injection
 ### Requirement: Session Establishment SHALL Mitigate Fixation Risk
 Successful admin login MUST rotate session identity material before granting authenticated privileges.
 #### Scenario: Admin login success
 - **WHEN** credentials are validated and admin session is created
 - **THEN** session identity MUST be regenerated before storing authenticated user attributes
--- a/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/tasks.md
+++ b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/tasks.md
@@ -0,0 +1,18 @@
 ## 1. Runtime Stability Hardening
 - [x] 1.1 Add startup validation for `SECRET_KEY` and environment-aware secure defaults.
 - [x] 1.2 Register centralized shutdown hooks to stop cache updater, realtime sync worker, Redis client, and DB engine.
 - [x] 1.3 Isolate database health probing from request pool and keep degraded signal contract stable.
 - [x] 1.4 Normalize pool-exhausted response metadata and retry headers across API error paths.
 ## 2. Security Baseline Enforcement
 - [x] 2.1 Add CSRF token issuance/validation for form posts and JSON mutation endpoints.
 - [x] 2.2 Update login flow to rotate session identity on successful authentication.
 - [x] 2.3 Replace JS-context template interpolation in `hold_detail.html` with JSON-safe serialization.
 ## 3. Verification and Documentation
 - [x] 3.1 Add tests for startup secret guard, CSRF rejection, and session-rotation behavior.
 - [x] 3.2 Add lifecycle tests/validation for shutdown cleanup and health endpoint behavior under pool saturation.
 - [x] 3.3 Update README/README.mdj runtime hardening sections and operator rollout notes.
--- a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/.openspec.yaml
+++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/.openspec.yaml
@@ -0,0 +1,2 @@
 schema: spec-driven
 created: 2026-02-08
--- a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/design.md
+++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/design.md
@@ -0,0 +1,46 @@
 ## Context
 The migration delivered feature parity, but efficiency work remains: backend query paths still do broad copies and whole-frame recomputation even when only slices are needed. At the same time, business constraints explicitly require full-table caching for `resource` and `wip` because those datasets are intentionally small and frequently reused. This design optimizes around that constraint rather than removing it.
 ## Goals / Non-Goals
 **Goals:**
 - Keep `resource` and `wip` full-table caches intact.
 - Reduce memory amplification from redundant cache representations.
 - Replace repeated full merge/rebuild paths with incremental/indexed query plans where applicable.
 - Increase reuse of browser-side compute modules for chart/table/filter/KPI derivations.
 - Add measurable telemetry to verify latency and memory improvements.
 **Non-Goals:**
 - Rewriting all reporting endpoints to client-only mode.
 - Removing Redis or existing layered cache strategy.
 - Changing user-visible filter semantics or report outputs.
 ## Decisions
 1. **Constrained cache strategy**
   - Decision: retain full-table snapshots for `resource` and `wip`; optimize surrounding representations and derivation paths.
   - Rationale: business-approved data-size profile and low complexity for frequent lookups.
 2. **Incremental + indexed path for heavy derived datasets**
   - Decision: add watermark/version-aware incremental refresh and per-column indexes for high-cardinality filters.
   - Rationale: avoids repeated full recompute and lowers request tail latency.
 3. **Canonical in-process structure**
   - Decision: keep one canonical structure per cache domain and derive alternate views on demand.
   - Rationale: reduces 2x/3x memory amplification from parallel representations.
 4. **Frontend compute module expansion**
   - Decision: extract reusable browser compute helpers for matrix/table/KPI transformations used across report pages.
   - Rationale: shifts deterministic shaping work off backend and improves component reuse in Vite architecture.
 5. **Benchmark-driven acceptance**
   - Decision: add repeatable benchmark fixtures and telemetry thresholds as merge gates.
   - Rationale: prevent subjective "performance improved" claims without measurable proof.
 ## Risks / Trade-offs
 - **[Risk] Incremental sync correctness drift** → **Mitigation:** version checksum validation and periodic full reconciliation jobs.
 - **[Risk] Browser compute can increase client CPU on low-end devices** → **Mitigation:** bounded dataset chunking and fallback server aggregation path.
 - **[Risk] Refactor introduces subtle field-contract regressions** → **Mitigation:** keep export/header contract tests and fixture comparisons.
 - **[Risk] Telemetry overhead** → **Mitigation:** low-cost counters/histograms with sampling where needed.
--- a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/proposal.md
+++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/proposal.md
@@ -0,0 +1,36 @@
 ## Why
 Current reporting workloads still spend unnecessary CPU and memory on repeated full-data merges, broad DataFrame copies, and duplicated cache representations. We need a focused efficiency phase that preserves the intentional full-table cache strategy for `resource` and `wip`, while reducing cost for other query paths and increasing frontend compute reuse.
 ## What Changes
 - Introduce indexed/incremental cache synchronization for heavy report datasets that do not require full-table snapshots.
 - Keep `resource` and `wip` as full-table cache by design, but reduce redundant in-process representations and copy overhead.
 - Move additional derived calculations (chart/table/KPI/filter shaping) to reusable browser modules in Vite frontend.
 - Add cache/query efficiency telemetry and repeatable benchmark gates to validate gains.
 ## Capabilities
 ### New Capabilities
 - `cache-indexed-query-acceleration`: Define incremental refresh and indexed query contracts for non-full-snapshot datasets.
 ### Modified Capabilities
 - `cache-observability-hardening`: Add memory-efficiency and cache-structure telemetry expectations.
 - `frontend-compute-shift`: Expand browser-side reusable compute coverage for report interactions.
 ## Impact
 - Affected code:
  - `src/mes_dashboard/core/cache.py`
  - `src/mes_dashboard/services/resource_cache.py`
  - `src/mes_dashboard/services/realtime_equipment_cache.py`
  - `src/mes_dashboard/services/wip_service.py`
  - `src/mes_dashboard/routes/health_routes.py`
  - `frontend/src/core/`
  - `frontend/src/**/main.js`
  - `tests/`
 - APIs:
  - read-heavy `/api/wip/*` and `/api/resource/*` endpoints (response contract unchanged)
 - Operational behavior:
  - Preserve current `resource` and `wip` full-table caching strategy.
  - Reduce server-side compute load through selective frontend compute offload.
--- a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/cache-indexed-query-acceleration/spec.md
+++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/cache-indexed-query-acceleration/spec.md
@@ -0,0 +1,22 @@
 ## ADDED Requirements
 ### Requirement: Incremental Synchronization SHALL Use Versioned Watermarks
 For heavy non-full-snapshot datasets, cache refresh SHALL support incremental synchronization keyed by stable version or watermark boundaries.
 #### Scenario: Incremental refresh cycle
 - **WHEN** source data version indicates partial changes since last sync
 - **THEN** cache update logic MUST fetch and merge only changed partitions while preserving correctness guarantees
 ### Requirement: Query Paths SHALL Use Indexed Access for High-Frequency Filters
 Query execution over cached data SHALL use prebuilt indexes for known high-frequency filter columns.
 #### Scenario: Filtered report query
 - **WHEN** request filters target indexed fields
 - **THEN** result selection MUST avoid full dataset scans and maintain existing response contract
 ### Requirement: Business-Mandated Full-Table Caches SHALL Be Preserved for Resource and WIP
 The system SHALL continue to maintain full-table cache behavior for `resource` and `wip` domains.
 #### Scenario: Resource or WIP cache refresh
 - **WHEN** cache update runs for `resource` or `wip`
 - **THEN** the updater MUST retain full-table snapshot semantics and MUST NOT switch these domains to partial-only cache mode
--- a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/cache-observability-hardening/spec.md
+++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/cache-observability-hardening/spec.md
@@ -0,0 +1,15 @@
 ## ADDED Requirements
 ### Requirement: Cache Telemetry SHALL Include Memory Amplification Signals
 Operational telemetry MUST expose cache domain memory usage indicators and representation amplification factors.
 #### Scenario: Deep health telemetry request
 - **WHEN** operators inspect cache telemetry
 - **THEN** telemetry MUST include per-domain memory footprint and amplification indicators sufficient to detect redundant structures
 ### Requirement: Efficiency Benchmarks SHALL Gate Cache Refactor Rollout
 Cache/query efficiency changes MUST be validated against baseline latency and memory benchmarks before rollout.
 #### Scenario: Pre-release validation
 - **WHEN** cache refactor changes are prepared for deployment
 - **THEN** benchmark results MUST demonstrate no regression beyond configured thresholds for P95 latency and memory usage
--- a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/frontend-compute-shift/spec.md
+++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/frontend-compute-shift/spec.md
@@ -0,0 +1,15 @@
 ## ADDED Requirements
 ### Requirement: Reusable Browser Compute Modules SHALL Power Report Derivations
 Derived computations for report filters, KPI cards, chart series, and table projections SHALL be implemented through reusable frontend modules.
 #### Scenario: Shared report derivation logic
 - **WHEN** multiple report pages require equivalent data-shaping behavior
 - **THEN** pages MUST consume shared compute modules instead of duplicating transformation logic per page
 ### Requirement: Browser Compute Shift SHALL Preserve Export and Field Contracts
 Moving computations to frontend MUST preserve existing field naming and export column contracts.
 #### Scenario: User exports report after frontend-side derivation
 - **WHEN** transformed data is rendered and exported
 - **THEN** exported field names and ordering MUST remain consistent with governed field contract definitions
--- a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/tasks.md
+++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/tasks.md
@@ -0,0 +1,23 @@
 ## 1. Cache Structure and Sync Refactor
 - [x] 1.1 Define canonical per-domain cache representation and remove redundant parallel structures.
 - [x] 1.2 Implement version/watermark-based incremental sync path for eligible non-full-snapshot datasets.
 - [x] 1.3 Keep `resource` and `wip` full-table cache behavior while optimizing surrounding parse/index pipelines.
 ## 2. Indexed Query Acceleration
 - [x] 2.1 Add index builders for high-frequency filter columns used by report endpoints.
 - [x] 2.2 Refactor read paths to use indexed selection and reduce broad DataFrame copy operations.
 - [x] 2.3 Add fallback and reconciliation logic to guarantee correctness under incremental/index drift.
 ## 3. Frontend Compute Reuse Expansion
 - [x] 3.1 Extract shared Vite compute modules for KPI/filter/chart/table derivations.
 - [x] 3.2 Refactor report pages to consume shared modules without changing user-visible behavior.
 - [x] 3.3 Validate export/header field contract consistency after compute shift.
 ## 4. Performance Validation and Docs
 - [x] 4.1 Add benchmark fixtures for baseline vs refactor latency/memory comparison.
 - [x] 4.2 Surface cache memory amplification and index efficiency telemetry in health/admin outputs.
 - [x] 4.3 Update README/README.mdj with cache strategy constraints and performance governance rules.
--- a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/.openspec.yaml
+++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/.openspec.yaml
@@ -0,0 +1,2 @@
 schema: spec-driven
 created: 2026-02-08
--- a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/design.md
+++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/design.md
@@ -0,0 +1,45 @@
 ## Context
 The project already provides watchdog-assisted restart and resilience diagnostics, but policy boundaries for automated recovery are not yet formalized end-to-end. In practice, this can lead to either under-recovery (manual delays) or over-recovery (restart churn). We also need stronger conda/systemd path consistency checks to prevent runtime drift across deploy scripts and services.
 ## Goals / Non-Goals
 **Goals:**
 - Make conda/systemd runtime path contracts explicit, validated, and drift-detectable.
 - Implement safe self-healing policy with cooldown and churn limits.
 - Expose clear alert signals and recommended actions in health/admin payloads.
 - Keep operator manual override available for incident control.
 **Non-Goals:**
 - Migrating from systemd to another orchestrator.
 - Changing database vendor or introducing full autoscaling infrastructure.
 - Removing existing admin restart endpoints.
 ## Decisions
 1. **Single source runtime contract**
   - Decision: centralize conda runtime path configuration consumed by systemd units, watchdog, and scripts.
   - Rationale: prevents mismatched interpreter/path drift.
 2. **Guarded self-healing state machine**
   - Decision: implement bounded restart policy (cooldown + max retries per time window + circuit-open gating).
   - Rationale: recovers quickly while preventing restart storms.
 3. **Explicit recovery observability contract**
   - Decision: enrich health/admin payloads with churn counters, cooldown state, and recommended operator action.
   - Rationale: enables deterministic triage and alert automation.
 4. **Auditability requirement**
   - Decision: emit structured logs/events for auto-restart decision, manual override, and blocked restart attempts.
   - Rationale: supports incident retrospectives and policy tuning.
 5. **Runbook-first rollout**
   - Decision: deploy policy changes behind documentation and validation gates, including rollback steps.
   - Rationale: operational safety for production adoption.
 ## Risks / Trade-offs
 - **[Risk] Overly strict policy delays recovery** → **Mitigation:** configurable thresholds and emergency manual override.
 - **[Risk] Aggressive policy causes churn loops** → **Mitigation:** hard stop on churn threshold breach and explicit cool-off windows.
 - **[Risk] Added operational complexity** → **Mitigation:** concise runbook with decision tables and tested scripts.
 - **[Risk] Drift detection false positives** → **Mitigation:** normalize path resolution and clearly defined comparison sources.
--- a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/proposal.md
+++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/proposal.md
@@ -0,0 +1,40 @@
 ## Why
 Operations stability still depends heavily on manual intervention when workers degrade or pools saturate. We need a formal operations phase to align conda/systemd runtime contracts and add controlled self-healing with guardrails, so recovery is faster without triggering restart storms.
 ## What Changes
 - Standardize conda-based runtime paths across app service, watchdog, and operational scripts from a single source of truth.
 - Introduce guarded worker self-healing policy (cooldown, churn windows, bounded retries, manual override).
 - Add alert thresholds and machine-readable recovery signals for pool pressure, circuit-open persistence, and restart churn.
 - Harden runbook documentation and scripts for deterministic restart, rollback, and incident triage.
 ## Capabilities
 ### New Capabilities
 - `worker-self-healing-governance`: Define safe autonomous recovery behavior with anti-storm guardrails.
 ### Modified Capabilities
 - `conda-systemd-runtime-alignment`: Extend runtime consistency requirements with startup validation and drift detection.
 - `runtime-resilience-recovery`: Add auditable recovery-action requirements for automated and operator-triggered restart flows.
 ## Impact
 - Affected code:
  - `deploy/systemd/*.service`
  - `scripts/worker_watchdog.py`
  - `src/mes_dashboard/routes/admin_routes.py`
  - `src/mes_dashboard/routes/health_routes.py`
  - `src/mes_dashboard/core/database.py`
  - `src/mes_dashboard/core/circuit_breaker.py`
  - `tests/`
  - `README.md`, `README.mdj`, runbook docs
 - APIs:
  - `/health`
  - `/health/deep`
  - `/admin/api/system-status`
  - `/admin/api/worker/status`
  - `/admin/api/worker/restart`
 - Operational behavior:
  - Preserve single-port bind model.
  - Add controlled self-healing policy and clearer alert thresholds.
--- a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/conda-systemd-runtime-alignment/spec.md
+++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/conda-systemd-runtime-alignment/spec.md
@@ -0,0 +1,15 @@
 ## ADDED Requirements
 ### Requirement: Runtime Path Drift SHALL Be Detectable Before Service Start
 Service startup checks MUST validate configured conda runtime paths across app, watchdog, and worker control scripts.
 #### Scenario: Conda path mismatch detected
 - **WHEN** startup validation finds runtime path inconsistency between configured units and scripts
 - **THEN** service start MUST fail with actionable diagnostics instead of running with partial mismatch
 ### Requirement: Conda/Systemd Contract SHALL Be Versioned in Operations Docs
 The documented runtime contract MUST include versioned path assumptions and verification commands.
 #### Scenario: Operator verifies deployment contract
 - **WHEN** operator follows runbook validation steps
 - **THEN** commands MUST confirm active runtime paths match documented conda/systemd contract
--- a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/runtime-resilience-recovery/spec.md
+++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/runtime-resilience-recovery/spec.md
@@ -0,0 +1,15 @@
 ## ADDED Requirements
 ### Requirement: Recovery Recommendations SHALL Reflect Self-Healing Policy State
 Health and admin resilience payloads MUST expose whether automated recovery is allowed, cooling down, or blocked by churn policy.
 #### Scenario: Operator inspects degraded state
 - **WHEN** `/health` or `/admin/api/worker/status` is requested during degradation
 - **THEN** response MUST include policy state, cooldown remaining time, and next recommended action
 ### Requirement: Manual Recovery Override SHALL Be Explicit and Controlled
 Manual restart actions MUST bypass automatic block only through authenticated operator pathways with explicit acknowledgement.
 #### Scenario: Churn-blocked state with manual override request
 - **WHEN** authorized admin requests manual restart while auto-recovery is blocked
 - **THEN** system MUST execute controlled restart path and log the override context for auditability
--- a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/worker-self-healing-governance/spec.md
+++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/worker-self-healing-governance/spec.md
@@ -0,0 +1,22 @@
 ## ADDED Requirements
 ### Requirement: Automated Worker Recovery SHALL Use Bounded Policy Guards
 Automated worker restart behavior MUST enforce cooldown periods and bounded restart attempts within a configurable time window.
 #### Scenario: Repeated worker degradation within short window
 - **WHEN** degradation events exceed configured restart-attempt budget
 - **THEN** automated restarts MUST pause and surface a blocked-recovery signal for operator intervention
 ### Requirement: Restart-Churn Protection SHALL Prevent Recovery Storms
 The runtime MUST classify restart churn and prevent uncontrolled restart loops.
 #### Scenario: Churn threshold exceeded
 - **WHEN** restart count crosses churn threshold in active window
 - **THEN** watchdog MUST enter guarded mode and require explicit manual override before further restart attempts
 ### Requirement: Recovery Decisions SHALL Be Audit-Ready
 Every auto-recovery decision and manual override action MUST be recorded with structured metadata.
 #### Scenario: Worker restart decision emitted
 - **WHEN** system executes or denies a restart action
 - **THEN** structured logs/events MUST include reason, thresholds, actor/source, and resulting state
--- a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/tasks.md
+++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/tasks.md
@@ -0,0 +1,23 @@
 ## 1. Conda/Systemd Contract Alignment
 - [x] 1.1 Centralize runtime path configuration consumed by service units, watchdog, and scripts.
 - [x] 1.2 Add startup validation that fails fast on conda path drift.
 - [x] 1.3 Update systemd/watchdog integration tests for consistent runtime contract.
 ## 2. Worker Self-Healing Policy
 - [x] 2.1 Implement bounded auto-restart policy (cooldown, retry budget, churn window).
 - [x] 2.2 Add guarded mode behavior when churn threshold is exceeded.
 - [x] 2.3 Implement authenticated manual override flow with explicit logging context.
 ## 3. Alerting and Operational Signals
 - [x] 3.1 Expose policy-state fields in health/admin payloads (`allowed`, `cooldown`, `blocked`).
 - [x] 3.2 Add structured audit events for restart decisions and override actions.
 - [x] 3.3 Define alert thresholds and wire monitoring-friendly fields for pool/circuit/churn conditions.
 ## 4. Validation and Runbook Delivery
 - [x] 4.1 Add tests for policy transitions, guarded mode, and override behavior.
 - [x] 4.2 Validate single-port continuity during controlled recovery and hot reload paths.
 - [x] 4.3 Update README/README.mdj and deployment runbook with verified operational procedures.
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round3/.openspec.yaml
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/.openspec.yaml
@@ -0,0 +1,2 @@
 schema: spec-driven
 created: 2026-02-08
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round3/design.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/design.md
@@ -0,0 +1,50 @@
 ## Context
 目前系統已完成 Vite 單一 port 架構與主要 P0/P1/P2 硬化，但殘餘風險集中在「快取慢路徑鎖競爭 + health 熱點查詢 + API 邊界治理」。這些問題多屬中高流量下才明顯，若不在此階段收斂，後續排障成本會高。
 ## Goals / Non-Goals
 **Goals:**
 - 在不改變頁面操作語意與單一 port 架構前提下，完成殘餘穩定性與安全性修補。
 - 讓 cache/health 路徑在高併發下更可預期，並降低 log 資安風險。
 - 透過測試覆蓋確保修補不造成功能回歸。
 **Non-Goals:**
 - 不重寫主要查詢流程或移除 `resource/wip` 全表快取策略。
 - 不引入重量級 distributed rate-limit 基礎設施。
 - 不改動前端 drill-down 與報表功能語意。
 ## Decisions
 1. **Cache 發布一致性優先於局部最佳化**
 - 使用 staging key + 原子 rename/pipeline 發布資料與 metadata，確保 publish 失敗不影響舊資料可讀性。
 2. **解析移至鎖外，鎖內僅做快取一致性檢查/寫入**
 - WIP process cache 慢路徑改為鎖外 parse，再鎖內 double-check+commit，降低持鎖時間。
 3. **Process cache 策略一致化**
 - realtime equipment cache 補齊 max_size + LRU，與既有 WIP/Resource 一致。
 4. **Health 內部短快取僅在非測試環境啟用**
 - TTL=5 秒，降低高頻 probe 對 DB/Redis 的重複壓力；測試模式維持即時計算避免互相污染。
 5. **高成本 API 採輕量 in-memory 速率限制**
 - 以 IP+route window 限流，參數化可調，不引入新外部依賴。
 ## Risks / Trade-offs
 - [Risk] 快取發布改造引入 key 切換邏輯複雜度 → Mitigation: 補上 publish 失敗/成功測試。
 - [Risk] health 快取造成短時間觀測延遲 → Mitigation: TTL 限制 5 秒，並於 testing 禁用。
 - [Risk] in-memory rate limit 在多 worker 下非全域一致 → Mitigation: 先作保護閥，後續可升級 Redis-based limiter。
 ## Migration Plan
 1. 先完成 cache 與 health 核心修補（不影響 API contract）。
 2. 再導入 API 邊界/限流與共用工具抽離。
 3. 補單元與整合測試，執行 benchmark smoke。
 4. 更新 README 文件與環境變數說明。
 ## Open Questions
 - 高成本 API 的預設限流門檻是否要按端點細分（WIP vs Resource）？
 - 後續是否要升級為 Redis 分散式限流以覆蓋多 worker 全域一致性？
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round3/proposal.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/proposal.md
@@ -0,0 +1,44 @@
 ## Why
 上一輪已完成高風險核心修復，但仍有一批殘餘問題會在高併發、長時間運行與惡意/異常輸入下放大風險（快取發布一致性、鎖競爭、健康檢查負載、輸入邊界與速率治理）。本輪目標是把這些尾端風險收斂到可接受範圍，避免後續運維與效能不穩。
 ## What Changes
 - 強化 WIP 快取發布流程，確保更新失敗時不污染既有讀取路徑。
 - 調整 process cache 慢路徑鎖範圍，避免持鎖解析大 JSON。
 - 補齊 realtime equipment process cache 的 bounded LRU，與 WIP/Resource 策略一致。
 - 為資源路由 NaN 清理加入深度保護（避免深層遞迴風險）。
 - 抽取共用布林參數解析，消除重複邏輯。
 - 將 filter cache 的 view 名稱改為可配置，移除硬編碼耦合。
 - 加入敏感連線字串 log redaction。
 - 對 `/health`、`/health/deep` 增加 5 秒內部短快取（測試模式禁用）。
 - 對高成本查詢 API 增加輕量速率限制與可調參數。
 - 更新 README/README.mdj 與驗證測試。
 ## Capabilities
 ### New Capabilities
 - `api-safety-hygiene`: API 輸入邊界、共享參數解析、可配置查詢來源、與高成本端點速率治理。
 ### Modified Capabilities
 - `cache-observability-hardening`: 補強快取發布一致性、process cache 鎖範圍與 bounded 策略一致化。
 - `runtime-resilience-recovery`: 健康檢查短快取與敏感資訊日誌遮罩的運維安全要求。
 ## Impact
 - Affected code:
  - `src/mes_dashboard/core/cache_updater.py`
  - `src/mes_dashboard/core/cache.py`
  - `src/mes_dashboard/services/realtime_equipment_cache.py`
  - `src/mes_dashboard/routes/resource_routes.py`
  - `src/mes_dashboard/routes/wip_routes.py`
  - `src/mes_dashboard/routes/hold_routes.py`
  - `src/mes_dashboard/services/filter_cache.py`
  - `src/mes_dashboard/core/database.py`
  - `src/mes_dashboard/routes/health_routes.py`
 - APIs:
  - `/health`, `/health/deep`
  - `/api/wip/detail/<workcenter>`, `/api/wip/overview/*`
  - `/api/resource/*`（高成本路由）
 - Docs/tests:
  - `README.md`, `README.mdj`, `tests/*`
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/api-safety-hygiene/spec.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/api-safety-hygiene/spec.md
@@ -0,0 +1,29 @@
 ## ADDED Requirements
 ### Requirement: Recursive Payload Cleaning MUST Enforce Depth Safety
 Routes that normalize nested payloads MUST prevent unbounded recursion depth.
 #### Scenario: Deeply nested response object
 - **WHEN** NaN-cleaning helper receives deeply nested list/dict payload
 - **THEN** cleaning logic MUST enforce max depth or iterative traversal and return safely without recursion failure
 ### Requirement: Filter Source Names MUST Be Configurable
 Filter cache query sources MUST NOT rely on hardcoded view names only.
 #### Scenario: Environment-specific view names
 - **WHEN** deployment sets custom filter-source environment variables
 - **THEN** filter cache loader MUST resolve and query configured view names
 ### Requirement: High-Cost APIs SHALL Apply Basic Rate Guardrails
 High-cost read endpoints SHALL apply configurable request-rate guardrails to reduce abuse and accidental bursts.
 #### Scenario: Burst traffic from same client
 - **WHEN** a client exceeds configured request budget for guarded endpoints
 - **THEN** endpoint SHALL return throttled response with clear retry guidance
 ### Requirement: Common Boolean Query Parsing SHALL Be Shared
 Boolean query parsing in routes SHALL use shared helper behavior.
 #### Scenario: Different routes parse include flags
 - **WHEN** routes parse common boolean query parameters
 - **THEN** parsing behavior MUST be consistent across routes via shared utility
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/cache-observability-hardening/spec.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/cache-observability-hardening/spec.md
@@ -0,0 +1,26 @@
 ## ADDED Requirements
 ### Requirement: Cache Publish MUST Preserve Previous Readable Snapshot on Failure
 When refreshing full-table cache payloads, the system MUST avoid exposing partially published states to readers.
 #### Scenario: Publish fails after payload serialization
 - **WHEN** a cache refresh has prepared new payload but publish operation fails
 - **THEN** previously published cache keys MUST remain readable and metadata MUST remain consistent with old snapshot
 #### Scenario: Publish succeeds
 - **WHEN** publish operation completes successfully
 - **THEN** data payload and metadata keys MUST be visible as one coherent new snapshot
 ### Requirement: Process-Level Cache Slow Path SHALL Minimize Lock Hold Time
 Large payload parsing MUST NOT happen inside long-held process cache locks.
 #### Scenario: Cache miss under concurrent requests
 - **WHEN** multiple requests hit process cache miss
 - **THEN** parsing work SHALL happen outside lock-protected mutation section, and lock scope SHALL be limited to consistency check + commit
 ### Requirement: Process-Level Cache Policies MUST Stay Consistent Across Services
 All service-local process caches MUST support bounded capacity with deterministic eviction.
 #### Scenario: Realtime equipment cache growth
 - **WHEN** realtime equipment process cache reaches configured capacity
 - **THEN** entries MUST be evicted according to deterministic LRU behavior
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/runtime-resilience-recovery/spec.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/runtime-resilience-recovery/spec.md
@@ -0,0 +1,19 @@
 ## ADDED Requirements
 ### Requirement: Health Endpoints SHALL Use Short Internal Memoization
 Health and deep-health computation SHALL use a short-lived internal cache to prevent probe storms from amplifying backend load.
 #### Scenario: Frequent monitor scrapes
 - **WHEN** health endpoints are called repeatedly within a small window
 - **THEN** service SHALL return memoized payload for up to 5 seconds in non-testing environments
 #### Scenario: Testing mode
 - **WHEN** app is running in testing mode
 - **THEN** health endpoint memoization MUST be bypassed to preserve deterministic tests
 ### Requirement: Logs MUST Redact Connection Secrets
 Runtime logs MUST avoid exposing DB connection credentials.
 #### Scenario: Connection string appears in log message
 - **WHEN** a log message contains DB URL credentials
 - **THEN** logger output MUST redact password and sensitive userinfo before emission
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round3/tasks.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/tasks.md
@@ -0,0 +1,22 @@
 ## 1. Cache Consistency and Contention Hardening
 - [x] 1.1 Harden WIP cache publish in `cache_updater.py` to preserve old snapshot on publish failure.
 - [x] 1.2 Refactor WIP process-cache slow path in `core/cache.py` so heavy parse runs outside lock.
 - [x] 1.3 Extend realtime equipment process cache with bounded `max_size` + deterministic LRU and add regression tests.
 ## 2. API Safety and Config Hygiene
 - [x] 2.1 Add depth-safe NaN cleaning in `resource_routes.py` and tests for deep payloads.
 - [x] 2.2 Add shared boolean query parser in `core/utils.py` and switch `wip_routes.py` / `hold_routes.py` to it.
 - [x] 2.3 Make filter-cache source views configurable (env-based) in `filter_cache.py` and add config tests.
 ## 3. Runtime Guardrails
 - [x] 3.1 Add DB connection-string redaction logging filter in `core/database.py` (or logging bootstrap) with tests.
 - [x] 3.2 Add 5-second internal memoization for `/health` and `/health/deep` (disabled in testing) and tests.
 - [x] 3.3 Add lightweight rate limiting for selected high-cost APIs with clear throttling responses and tests.
 ## 4. Validation and Documentation
 - [x] 4.1 Run targeted backend/frontend tests and benchmark smoke gate.
 - [x] 4.2 Update `README.md` and `README.mdj` with round-3 hardening notes and new env variables.
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round4/.openspec.yaml
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/.openspec.yaml
@@ -0,0 +1,2 @@
 schema: spec-driven
 created: 2026-02-08
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round4/design.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/design.md
@@ -0,0 +1,61 @@
 ## Context
 round-3 後主流程已穩定，但仍有 3 類技術債：
 - Resource 快取在同一 process 內同時保存 DataFrame 與完整 records 複本，導致記憶體放大。
 - Resource 與 Realtime Equipment 的 Oracle 查詢存在跨服務重複字串，日後修改容易偏移。
 - 部分服務邊界型別註記與魔術數字未系統化，維護成本偏高。
 約束條件：
 - `resource` / `wip` 維持全表快取策略，不改資料來源與刷新頻率。
 - 對外 API 欄位與前端行為不變。
 - 保持單一 port 架構與既有運維契約。
 ## Goals / Non-Goals
 **Goals:**
 - 降低 Resource 快取在 process 內的重複資料表示，保留查詢輸出相容性。
 - 讓跨服務 Oracle 查詢片段由單一來源維護。
 - 讓關鍵 service/cache 模組具備一致的型別註記與具名常數。
 **Non-Goals:**
 - 不改動資料庫 schema 或 SQL 查詢結果欄位。
 - 不重寫整體 cache 架構（Redis + process cache 維持）。
 - 不引入新基礎設施或外部依賴。
 ## Decisions
 1. Resource derived index 改為「row-position index」而非保存完整 records 複本
 - 現況：index 中保留 `records` 與多組 bucket records，與 DataFrame 內容重複。
 - 決策：index 只保留 row positions（整數索引）與必要 metadata；需要輸出 dict 時由 DataFrame 按需轉換。
 - 取捨：單次輸出會增加少量轉換成本，但可顯著降低常駐記憶體重複。
 2. 建立共用 Oracle 查詢常數模組
 - 現況：`resource_cache.py`、`realtime_equipment_cache.py` 各自維護 base SQL。
 - 決策：抽出 `services/sql_fragments.py`（或等效模組）管理共用 query 文本與 table/view 名稱。
 - 取捨：增加一層間接引用，但查詢語意一致性與變更可控性更高。
 3. 型別與常數治理採「先核心邊界，後擴散」
 - 現況：部分函式已使用 `Optional` / PEP604 混搭，且魔術數字散落於 cache/service。
 - 決策：先統一這輪觸及檔案中的型別風格與高頻常數（TTL、size、window、limits）。
 - 取捨：不追求一次全專案清零，以避免大範圍 noise；先建立可持續擴展基線。
 ## Risks / Trade-offs
 - [Risk] row-position index 與 DataFrame 版本不同步 → Mitigation：每次 cache invalidate 時同步重建 index，並保留版本檢查。
 - [Risk] 惰性轉換導致查詢端 latency 波動 → Mitigation：保留 process cache，並對高頻路徑做小批量輸出優化。
 - [Risk] SQL 共用常數抽離造成引用錯誤 → Mitigation：補齊單元測試，驗證 query 文本與既有欄位契約一致。
 - [Risk] 型別/常數清理引發行為改變 → Mitigation：僅做等價重構，保留原值並用回歸測試覆蓋。
 ## Migration Plan
 1. 先重構 Resource index 表示，確保 API 輸出不變。
 2. 抽離 SQL 共用片段並替換兩個快取服務引用。
 3. 清理該範圍型別與常數，補測試。
 4. 更新 README / README.mdj 與 OpenSpec tasks，跑 backend/fronted 目標測試集。
 Rollback：
 - 若出現相容性問題，可回退至原 index records 表示與舊 SQL 內嵌寫法（單檔回退即可）。
 ## Open Questions
 - 是否要在下一輪把相同治理擴展到 `wip_service.py` 的其餘常數與型別（本輪先限定 residual 範圍）。
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round4/proposal.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/proposal.md
@@ -0,0 +1,31 @@
 ## Why
 目前剩餘風險集中在可維護性與記憶體效率：Resource 快取在同一個 process 內維持多種資料表示，部分查詢 SQL 在不同快取服務重複維護，且型別註記與魔術數字仍不一致。這些問題不會立刻造成中斷，但會提高記憶體占用、增加後續修改成本與回歸風險，因此需要在既有功能不變前提下完成收斂。
 ## What Changes
 - 將 Resource derived index 的資料表示改為「輕量索引 + 惰性輸出」，避免在 process 中重複保留完整 records 複本。
 - 將 Resource 與 Realtime Equipment 的 Oracle 查詢字串收斂到共用 SQL 常數模組，降低重複定義與異步漂移風險。
 - 補齊型別註記一致性（尤其 cache/index/service 邊界）並把高頻魔術數字提升為具名常數或可配置參數。
 - 維持現有 API 契約、全表快取策略、單一 port 架構與前端行為不變。
 ## Capabilities
 ### New Capabilities
 - `resource-cache-representation-normalization`: 以單一權威資料表示與輕量索引替代 process 內多份完整資料複本，並保留既有查詢回傳結構。
 - `oracle-query-fragment-governance`: 將跨服務共用的 Oracle 查詢片段抽離為共享常數/模板，確保查詢語意一致。
 - `maintainability-type-and-constant-hygiene`: 建立型別註記與具名常數的落地規範，降低魔術數字與註記風格漂移。
 ### Modified Capabilities
 - `cache-observability-hardening`: 補充記憶體放大係數與索引表示調整後的可觀測一致性要求。
 ## Impact
 - 主要影響檔案：
  - `src/mes_dashboard/services/resource_cache.py`
  - `src/mes_dashboard/services/realtime_equipment_cache.py`
  - `src/mes_dashboard/services/resource_service.py`（若需配合索引輸出）
  - `src/mes_dashboard/sql/*` 或新增共享 SQL 常數模組
  - `src/mes_dashboard/config/constants.py`、`src/mes_dashboard/core/utils.py`
  - 對應測試與 README/README.mdj 文檔
 - 不新增外部依賴，不變更對外 API 路徑與欄位契約。
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/cache-observability-hardening/spec.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/cache-observability-hardening/spec.md
@@ -0,0 +1,8 @@
 ## MODIFIED Requirements
 ### Requirement: Cache Telemetry SHALL Include Memory Amplification Signals
 Operational telemetry MUST expose cache-domain memory usage indicators and representation amplification factors, and MUST differentiate between authoritative data payload and derived/index helper structures.
 #### Scenario: Deep health telemetry request after representation normalization
 - **WHEN** operators inspect cache telemetry for resource or WIP domains
 - **THEN** telemetry MUST include per-domain memory footprint, amplification indicators, and enough structure detail to verify that full-record duplication is not reintroduced
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/maintainability-type-and-constant-hygiene/spec.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/maintainability-type-and-constant-hygiene/spec.md
@@ -0,0 +1,15 @@
 ## ADDED Requirements
 ### Requirement: Core Cache and Service Boundaries MUST Use Consistent Type Annotation Style
 Core cache/service modules touched by this change SHALL use a consistent and explicit type-annotation style for public and internal helper boundaries.
 #### Scenario: Reviewing updated cache/service modules
 - **WHEN** maintainers inspect function signatures in affected modules
 - **THEN** optional and collection types MUST follow a single consistent style and remain compatible with the project Python baseline
 ### Requirement: High-Frequency Magic Numbers MUST Be Replaced by Named Constants
 Cache, throttling, and index-related numeric literals that control behavior MUST be extracted to named constants or env-configurable settings.
 #### Scenario: Tuning cache/index behavior
 - **WHEN** operators need to tune cache/index thresholds
 - **THEN** they MUST find values in named constants or environment variables rather than scattered inline literals
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/oracle-query-fragment-governance/spec.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/oracle-query-fragment-governance/spec.md
@@ -0,0 +1,15 @@
 ## ADDED Requirements
 ### Requirement: Shared Oracle Query Fragments SHALL Have a Single Source of Truth
 Cross-service Oracle query fragments for resource and equipment cache loading MUST be defined in a shared module and imported by service implementations.
 #### Scenario: Update common table/view reference
 - **WHEN** a common table or view name changes
 - **THEN** operators and developers MUST be able to update one shared definition without editing duplicated SQL literals across services
 ### Requirement: Service Queries MUST Preserve Existing Columns and Semantics
 Services consuming shared Oracle query fragments SHALL preserve existing selected columns, filters, and downstream payload behavior.
 #### Scenario: Resource and equipment cache refresh after refactor
 - **WHEN** cache services execute queries via shared fragments
 - **THEN** resulting payload structure MUST remain compatible with existing aggregation and API contracts
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/resource-cache-representation-normalization/spec.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/resource-cache-representation-normalization/spec.md
@@ -0,0 +1,22 @@
 ## ADDED Requirements
 ### Requirement: Resource Derived Index MUST Avoid Full Record Duplication
 Resource derived index SHALL use lightweight row-position references instead of storing full duplicated record payloads alongside the process DataFrame cache.
 #### Scenario: Build index from cached DataFrame
 - **WHEN** resource cache data is parsed from Redis into process-level DataFrame
 - **THEN** the derived index MUST store position-based references and metadata without a second full records copy
 ### Requirement: Resource Query APIs SHALL Preserve Existing Response Contract
 Resource query APIs MUST keep existing output fields and semantics after index representation normalization.
 #### Scenario: Read all resources after normalization
 - **WHEN** callers request all resources or filtered resource lists
 - **THEN** the returned payload MUST remain field-compatible with pre-normalization responses
 ### Requirement: Cache Invalidation MUST Keep Index/Data Coherent
 The system SHALL invalidate and rebuild DataFrame/index representations atomically at cache refresh boundaries.
 #### Scenario: Redis-backed cache refresh completes
 - **WHEN** a new resource cache snapshot is published
 - **THEN** stale index references MUST be invalidated before subsequent reads use refreshed DataFrame data
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round4/tasks.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/tasks.md
@@ -0,0 +1,22 @@
 ## 1. Resource Cache Representation Normalization
 - [x] 1.1 Refactor `resource_cache` derived index to use lightweight row-position references instead of full duplicated records payload.
 - [x] 1.2 Keep `get_all_resources` / `get_resources_by_filter` API outputs backward compatible while sourcing data from normalized representation.
 - [x] 1.3 Update cache telemetry fields to reflect normalized representation and verify amplification calculation remains interpretable.
 ## 2. Oracle Query Fragment Governance
 - [x] 2.1 Extract shared Oracle SQL fragments/constants for resource/equipment cache loading into a common module.
 - [x] 2.2 Replace duplicated SQL literals in `resource_cache.py` and `realtime_equipment_cache.py` with shared definitions.
 - [x] 2.3 Add/adjust tests to lock expected query semantics and prevent drift.
 ## 3. Maintainability Hygiene
 - [x] 3.1 Normalize type annotations in touched cache/service modules to one consistent style.
 - [x] 3.2 Replace high-frequency magic numbers with named constants or env-driven config in touched modules.
 - [x] 3.3 Confirm existing login/API rate-limit and bool parser utilities remain centralized without new duplication.
 ## 4. Verification and Documentation
 - [x] 4.1 Run targeted backend tests for resource cache, equipment cache, health/admin, and route behavior.
 - [x] 4.2 Update `README.md` and `README.mdj` with round-4 hardening notes.
--- a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/.openspec.yaml
+++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/.openspec.yaml
@@ -0,0 +1,2 @@
 schema: spec-driven
 created: 2026-02-08
--- a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/design.md
+++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/design.md
@@ -0,0 +1,65 @@
 ## Context
 本專案上一輪已完成 P0/P1/P2 的主體重構，但 code review 後仍存在幾個殘餘高風險點：
 - `LDAP_API_URL` 缺少 scheme/host 防線，屬於可配置 SSRF 風險。
 - process-level DataFrame cache 僅用 TTL，缺少容量上限。
 - circuit breaker 狀態轉換在持鎖期間寫日誌，存在鎖競爭放大風險。
 - 全域 security headers 尚未統一輸出。
 - 分頁參數尚有下限驗證缺口。
 這些問題橫跨 `app/core/services/routes/tests`，屬於跨模組安全與穩定性修補。
 ## Goals / Non-Goals
 **Goals:**
 - 對 LDAP endpoint、HTTP 回應標頭、輸入邊界建立可測試的最低防線。
 - 讓 process-level cache 具備有界容量與可預期淘汰行為。
 - 降低 circuit breaker 內部鎖競爭風險，避免慢 handler 放大阻塞。
 - 維持單一 port、現有 API 契約與前端互動語意不變。
 **Non-Goals:**
 - 不引入完整 WAF/零信任架構。
 - 不重寫既有 cache 架構為外部快取服務。
 - 不改動報表功能或頁面流程。
 ## Decisions
 1. **LDAP URL 啟動驗證（fail-fast）**
   - Decision: 在 `auth_service` 啟動階段驗證 `LDAP_API_URL`，限制 `https` 與白名單 host（由 env 設定），不符合即禁用 LDAP 驗證路徑並記錄錯誤。
   - Rationale: 以最低改動封住配置型 SSRF 風險，不影響 local auth 模式。
 2. **ProcessLevelCache 有界化**
   - Decision: 在 `ProcessLevelCache` 新增 `max_size` 與 LRU 淘汰（`OrderedDict`），`set` 時淘汰最舊 key。
   - Rationale: 保留 TTL 行為，同時避免高基數 key 長時間堆積。
 3. **Circuit breaker 鎖外寫日誌**
   - Decision: `_transition_to` 僅在鎖內更新狀態並組裝日誌訊息，實際 logger 呼叫移到鎖外。
   - Rationale: 降低持鎖區塊執行時間，避免慢 I/O handler 阻塞其他請求路徑。
 4. **全域安全標頭統一注入**
   - Decision: 在 `app.after_request` 加入 `CSP`、`X-Frame-Options`、`X-Content-Type-Options`、`Referrer-Policy`，並在 production 加上 `HSTS`。
   - Rationale: 以集中式策略覆蓋所有頁面與 API，降低遺漏機率。
 5. **分頁參數上下限一致化**
   - Decision: 對 `page` 與 `page_size` 統一加入 `max(1, min(...))` 邊界處理。
   - Rationale: 防止負值或極端數值造成不必要負載與非預期行為。
 ## Risks / Trade-offs
 - **[Risk] LDAP 白名單設定不完整導致登入中斷** → **Mitigation:** 提供明確錯誤訊息與 local auth fallback 指引。
 - **[Risk] Cache 上限過小造成命中率下降** → **Mitigation:** `max_size` 設為可配置，先給保守預設值並觀察 telemetry。
 - **[Risk] CSP 過嚴影響既有 inline 腳本** → **Mitigation:** 先採 `default-src 'self'` 與相容策略，必要時以 nonce/白名單微調。
 - **[Risk] 行為調整引發測試回歸** → **Mitigation:** 補 unit/integration 測試覆蓋每個修補點。
 ## Migration Plan
 1. 先落地 backend 修補（auth/cache/circuit breaker/app headers/routes）。
 2. 補測試（LDAP 驗證、LRU、鎖外日誌、headers、分頁邊界）。
 3. 執行既有健康檢查與重點整合測試。
 4. 更新 README/README.mdj 的安全與穩定性章節。
 5. 若部署後有相容性問題，可暫時透過 env 放寬 LDAP host 白名單與 CSP 細項。
 ## Open Questions
 - LDAP host 白名單在各環境是否需要多個網域（例如內網 + DR site）？
 - CSP 是否要立即切換到 nonce-based 嚴格模式，或先維持相容策略？
--- a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/proposal.md
+++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/proposal.md
@@ -0,0 +1,40 @@
 ## Why
 上一輪已完成核心穩定性重構，但仍有數個高優先風險（LDAP URL 驗證、無界快取成長、circuit breaker 持鎖寫日誌、安全標頭缺口、分頁下限驗證）未收斂。這些問題會在長時運行與惡意輸入情境下累積可用性與安全風險，需在同一輪中補齊。
 ## What Changes
 - 新增 LDAP API base URL 啟動驗證（限定 `https` 與白名單主機），避免可控 SSRF 目標。
 - 對 process-level cache 加入 `max_size` 與 LRU 淘汰，避免高基數 key 造成無界記憶體成長。
 - 調整 circuit breaker 狀態轉換流程，避免在持鎖期間寫日誌。
 - 新增全域 security headers（CSP、X-Frame-Options、X-Content-Type-Options、Referrer-Policy、HSTS）。
 - 補齊分頁參數下限驗證，避免負值與不合理 page size 進入查詢流程。
 - 為上述修補新增對應測試與文件更新，並維持單一 port 與既有前端操作語意不變。
 ## Capabilities
 ### New Capabilities
 - `security-surface-hardening`: 規範剩餘安全面向（SSRF 防護、security headers、輸入邊界驗證）的最低防線。
 ### Modified Capabilities
 - `cache-observability-hardening`: 擴充快取治理需求，納入 process-level cache 有界容量與淘汰策略。
 - `runtime-resilience-recovery`: 補充 circuit breaker 鎖競爭風險修補與安全標頭對運維診斷回應的相容性要求。
 ## Impact
 - Affected code:
  - `src/mes_dashboard/services/auth_service.py`
  - `src/mes_dashboard/core/cache.py`
  - `src/mes_dashboard/services/resource_cache.py`
  - `src/mes_dashboard/core/circuit_breaker.py`
  - `src/mes_dashboard/app.py`
  - `src/mes_dashboard/routes/wip_routes.py`
  - `tests/`
  - `README.md`, `README.mdj`
 - APIs:
  - `/health`, `/health/deep`
  - `/api/wip/detail/<workcenter>`
  - `/admin/login`（間接受影響：LDAP base 驗證）
 - Operational behavior:
  - 保持單一 port 與既有報表 UI 流程。
  - 強化安全與穩定性防線，不改變既有功能語意。
--- a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/cache-observability-hardening/spec.md
+++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/cache-observability-hardening/spec.md
@@ -0,0 +1,12 @@
 ## ADDED Requirements
 ### Requirement: Process-Level Cache SHALL Use Bounded Capacity with Deterministic Eviction
 Process-level parsed-data caches MUST enforce a configurable maximum key capacity and use deterministic eviction behavior when capacity is exceeded.
 #### Scenario: Cache capacity reached
 - **WHEN** a new cache entry is inserted and key capacity is at limit
 - **THEN** cache MUST evict entries according to defined policy before storing the new key
 #### Scenario: Repeated access updates recency
 - **WHEN** an existing cache key is read or overwritten
 - **THEN** eviction order MUST reflect recency semantics so hot keys are retained preferentially
--- a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/runtime-resilience-recovery/spec.md
+++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/runtime-resilience-recovery/spec.md
@@ -0,0 +1,12 @@
 ## ADDED Requirements
 ### Requirement: Circuit Breaker State Transitions SHALL Avoid Lock-Held Logging
 Circuit breaker state transitions MUST avoid executing logger I/O while internal state locks are held.
 #### Scenario: State transition occurs
 - **WHEN** circuit breaker transitions between CLOSED, OPEN, or HALF_OPEN
 - **THEN** lock-protected section MUST complete state mutation before emitting transition log output
 #### Scenario: Slow log handler under load
 - **WHEN** logger handlers are slow or blocked
 - **THEN** circuit breaker lock contention MUST remain bounded and MUST NOT serialize unrelated request paths behind logging latency
--- a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/security-surface-hardening/spec.md
+++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/security-surface-hardening/spec.md
@@ -0,0 +1,34 @@
 ## ADDED Requirements
 ### Requirement: LDAP Authentication Endpoint Configuration SHALL Be Strictly Validated
 The system MUST validate LDAP authentication endpoint configuration before use, including HTTPS scheme enforcement and host allowlist checks.
 #### Scenario: Invalid LDAP URL configuration detected
 - **WHEN** `LDAP_API_URL` is missing, non-HTTPS, or points to a host outside the configured allowlist
 - **THEN** the service MUST reject LDAP authentication calls and emit actionable diagnostics without sending credentials to that endpoint
 #### Scenario: Valid LDAP URL configuration accepted
 - **WHEN** `LDAP_API_URL` uses HTTPS and host is allowlisted
 - **THEN** LDAP authentication requests MAY proceed with normal timeout and error handling behavior
 ### Requirement: Security Response Headers SHALL Be Applied Globally
 All HTTP responses MUST include baseline security headers suitable for dashboard and API traffic.
 #### Scenario: Standard response emitted
 - **WHEN** any route returns a response
 - **THEN** response MUST include `Content-Security-Policy`, `X-Frame-Options`, `X-Content-Type-Options`, and `Referrer-Policy`
 #### Scenario: Production transport hardening
 - **WHEN** runtime environment is production
 - **THEN** response MUST include `Strict-Transport-Security`
 ### Requirement: Pagination Input Boundaries SHALL Be Enforced
 Endpoints accepting pagination parameters MUST enforce lower and upper bounds before query execution.
 #### Scenario: Negative or zero pagination inputs
 - **WHEN** client sends `page <= 0` or `page_size <= 0`
 - **THEN** server MUST normalize values to minimum supported bounds
 #### Scenario: Excessive page size requested
 - **WHEN** client sends `page_size` above configured maximum
 - **THEN** server MUST clamp to maximum supported page size
--- a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/tasks.md
+++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/tasks.md
@@ -0,0 +1,24 @@
 ## 1. LDAP Endpoint Hardening
 - [x] 1.1 Add strict `LDAP_API_URL` validation (`https` + allowlisted hosts) in auth service initialization.
 - [x] 1.2 Add tests for valid/invalid LDAP URL configurations and ensure unsafe URLs are rejected without outbound auth call.
 ## 2. Bounded Process Cache
 - [x] 2.1 Extend `ProcessLevelCache` with configurable `max_size` and LRU eviction behavior.
 - [x] 2.2 Wire bounded cache configuration for WIP/Resource process-level caches and add regression tests.
 ## 3. Circuit Breaker Lock Contention Reduction
 - [x] 3.1 Refactor circuit breaker transition logging to execute outside lock-protected section.
 - [x] 3.2 Add tests verifying transition logs are emitted while state mutation remains correct.
 ## 4. HTTP Security Headers and Input Boundary Validation
 - [x] 4.1 Add global `after_request` security headers (CSP, frame, content-type, referrer, HSTS in production).
 - [x] 4.2 Tighten pagination boundary handling (`page`/`page_size`) for WIP detail endpoint and add tests.
 ## 5. Validation and Documentation
 - [x] 5.1 Run targeted backend/frontend tests plus benchmark smoke to confirm no behavior regression.
 - [x] 5.2 Update `README.md` and `README.mdj` with round-2 security/stability hardening notes.
--- a/openspec/specs/api-safety-hygiene/spec.md
+++ b/openspec/specs/api-safety-hygiene/spec.md
@@ -0,0 +1,33 @@
 # api-safety-hygiene Specification
 ## Purpose
 TBD - created by archiving change residual-hardening-round3. Update Purpose after archive.
 ## Requirements
 ### Requirement: Recursive Payload Cleaning MUST Enforce Depth Safety
 Routes that normalize nested payloads MUST prevent unbounded recursion depth.
 #### Scenario: Deeply nested response object
 - **WHEN** NaN-cleaning helper receives deeply nested list/dict payload
 - **THEN** cleaning logic MUST enforce max depth or iterative traversal and return safely without recursion failure
 ### Requirement: Filter Source Names MUST Be Configurable
 Filter cache query sources MUST NOT rely on hardcoded view names only.
 #### Scenario: Environment-specific view names
 - **WHEN** deployment sets custom filter-source environment variables
 - **THEN** filter cache loader MUST resolve and query configured view names
 ### Requirement: High-Cost APIs SHALL Apply Basic Rate Guardrails
 High-cost read endpoints SHALL apply configurable request-rate guardrails to reduce abuse and accidental bursts.
 #### Scenario: Burst traffic from same client
 - **WHEN** a client exceeds configured request budget for guarded endpoints
 - **THEN** endpoint SHALL return throttled response with clear retry guidance
 ### Requirement: Common Boolean Query Parsing SHALL Be Shared
 Boolean query parsing in routes SHALL use shared helper behavior.
 #### Scenario: Different routes parse include flags
 - **WHEN** routes parse common boolean query parameters
 - **THEN** parsing behavior MUST be consistent across routes via shared utility
--- a/openspec/specs/cache-indexed-query-acceleration/spec.md
+++ b/openspec/specs/cache-indexed-query-acceleration/spec.md
@@ -0,0 +1,26 @@
 # cache-indexed-query-acceleration Specification
 ## Purpose
 TBD - created by archiving change p1-cache-query-efficiency. Update Purpose after archive.
 ## Requirements
 ### Requirement: Incremental Synchronization SHALL Use Versioned Watermarks
 For heavy non-full-snapshot datasets, cache refresh SHALL support incremental synchronization keyed by stable version or watermark boundaries.
 #### Scenario: Incremental refresh cycle
 - **WHEN** source data version indicates partial changes since last sync
 - **THEN** cache update logic MUST fetch and merge only changed partitions while preserving correctness guarantees
 ### Requirement: Query Paths SHALL Use Indexed Access for High-Frequency Filters
 Query execution over cached data SHALL use prebuilt indexes for known high-frequency filter columns.
 #### Scenario: Filtered report query
 - **WHEN** request filters target indexed fields
 - **THEN** result selection MUST avoid full dataset scans and maintain existing response contract
 ### Requirement: Business-Mandated Full-Table Caches SHALL Be Preserved for Resource and WIP
 The system SHALL continue to maintain full-table cache behavior for `resource` and `wip` domains.
 #### Scenario: Resource or WIP cache refresh
 - **WHEN** cache update runs for `resource` or `wip`
 - **THEN** the updater MUST retain full-table snapshot semantics and MUST NOT switch these domains to partial-only cache mode
--- a/openspec/specs/cache-observability-hardening/spec.md
+++ b/openspec/specs/cache-observability-hardening/spec.md
@@ -36,3 +36,53 @@ The system MUST define alert thresholds for sustained degraded state, repeated w
 - **WHEN** degraded status persists beyond configured duration
 - **THEN** the monitoring contract MUST classify the service as alert-worthy with actionable context
 ### Requirement: Cache Telemetry SHALL Include Memory Amplification Signals
 Operational telemetry MUST expose cache-domain memory usage indicators and representation amplification factors, and MUST differentiate between authoritative data payload and derived/index helper structures.
 #### Scenario: Deep health telemetry request after representation normalization
 - **WHEN** operators inspect cache telemetry for resource or WIP domains
 - **THEN** telemetry MUST include per-domain memory footprint, amplification indicators, and enough structure detail to verify that full-record duplication is not reintroduced
 ### Requirement: Efficiency Benchmarks SHALL Gate Cache Refactor Rollout
 Cache/query efficiency changes MUST be validated against baseline latency and memory benchmarks before rollout.
 #### Scenario: Pre-release validation
 - **WHEN** cache refactor changes are prepared for deployment
 - **THEN** benchmark results MUST demonstrate no regression beyond configured thresholds for P95 latency and memory usage
 ### Requirement: Process-Level Cache SHALL Use Bounded Capacity with Deterministic Eviction
 Process-level parsed-data caches MUST enforce a configurable maximum key capacity and use deterministic eviction behavior when capacity is exceeded.
 #### Scenario: Cache capacity reached
 - **WHEN** a new cache entry is inserted and key capacity is at limit
 - **THEN** cache MUST evict entries according to defined policy before storing the new key
 #### Scenario: Repeated access updates recency
 - **WHEN** an existing cache key is read or overwritten
 - **THEN** eviction order MUST reflect recency semantics so hot keys are retained preferentially
 ### Requirement: Cache Publish MUST Preserve Previous Readable Snapshot on Failure
 When refreshing full-table cache payloads, the system MUST avoid exposing partially published states to readers.
 #### Scenario: Publish fails after payload serialization
 - **WHEN** a cache refresh has prepared new payload but publish operation fails
 - **THEN** previously published cache keys MUST remain readable and metadata MUST remain consistent with old snapshot
 #### Scenario: Publish succeeds
 - **WHEN** publish operation completes successfully
 - **THEN** data payload and metadata keys MUST be visible as one coherent new snapshot
 ### Requirement: Process-Level Cache Slow Path SHALL Minimize Lock Hold Time
 Large payload parsing MUST NOT happen inside long-held process cache locks.
 #### Scenario: Cache miss under concurrent requests
 - **WHEN** multiple requests hit process cache miss
 - **THEN** parsing work SHALL happen outside lock-protected mutation section, and lock scope SHALL be limited to consistency check + commit
 ### Requirement: Process-Level Cache Policies MUST Stay Consistent Across Services
 All service-local process caches MUST support bounded capacity with deterministic eviction.
 #### Scenario: Realtime equipment cache growth
 - **WHEN** realtime equipment process cache reaches configured capacity
 - **THEN** entries MUST be evicted according to deterministic LRU behavior
--- a/openspec/specs/conda-systemd-runtime-alignment/spec.md
+++ b/openspec/specs/conda-systemd-runtime-alignment/spec.md
@@ -24,3 +24,17 @@ Runbooks and deployment documentation MUST describe the same conda/systemd/watch
 - **WHEN** an operator performs deploy, health check, and rollback from documentation
 - **THEN** documented commands and paths MUST work without requiring venv-specific assumptions
 ### Requirement: Runtime Path Drift SHALL Be Detectable Before Service Start
 Service startup checks MUST validate configured conda runtime paths across app, watchdog, and worker control scripts.
 #### Scenario: Conda path mismatch detected
 - **WHEN** startup validation finds runtime path inconsistency between configured units and scripts
 - **THEN** service start MUST fail with actionable diagnostics instead of running with partial mismatch
 ### Requirement: Conda/Systemd Contract SHALL Be Versioned in Operations Docs
 The documented runtime contract MUST include versioned path assumptions and verification commands.
 #### Scenario: Operator verifies deployment contract
 - **WHEN** operator follows runbook validation steps
 - **THEN** commands MUST confirm active runtime paths match documented conda/systemd contract
--- a/openspec/specs/frontend-compute-shift/spec.md
+++ b/openspec/specs/frontend-compute-shift/spec.md
@@ -50,3 +50,17 @@ Frontend matrix/filter computations SHALL produce deterministic selection and fi
 - **WHEN** users toggle matrix cells across group, family, and resource rows
 - **THEN** selected-state rendering and filtered equipment result sets MUST remain level-correct and reversible
 ### Requirement: Reusable Browser Compute Modules SHALL Power Report Derivations
 Derived computations for report filters, KPI cards, chart series, and table projections SHALL be implemented through reusable frontend modules.
 #### Scenario: Shared report derivation logic
 - **WHEN** multiple report pages require equivalent data-shaping behavior
 - **THEN** pages MUST consume shared compute modules instead of duplicating transformation logic per page
 ### Requirement: Browser Compute Shift SHALL Preserve Export and Field Contracts
 Moving computations to frontend MUST preserve existing field naming and export column contracts.
 #### Scenario: User exports report after frontend-side derivation
 - **WHEN** transformed data is rendered and exported
 - **THEN** exported field names and ordering MUST remain consistent with governed field contract definitions
--- a/openspec/specs/maintainability-type-and-constant-hygiene/spec.md
+++ b/openspec/specs/maintainability-type-and-constant-hygiene/spec.md
@@ -0,0 +1,19 @@
 # maintainability-type-and-constant-hygiene Specification
 ## Purpose
 TBD - created by archiving change residual-hardening-round4. Update Purpose after archive.
 ## Requirements
 ### Requirement: Core Cache and Service Boundaries MUST Use Consistent Type Annotation Style
 Core cache/service modules touched by this change SHALL use a consistent and explicit type-annotation style for public and internal helper boundaries.
 #### Scenario: Reviewing updated cache/service modules
 - **WHEN** maintainers inspect function signatures in affected modules
 - **THEN** optional and collection types MUST follow a single consistent style and remain compatible with the project Python baseline
 ### Requirement: High-Frequency Magic Numbers MUST Be Replaced by Named Constants
 Cache, throttling, and index-related numeric literals that control behavior MUST be extracted to named constants or env-configurable settings.
 #### Scenario: Tuning cache/index behavior
 - **WHEN** operators need to tune cache/index thresholds
 - **THEN** they MUST find values in named constants or environment variables rather than scattered inline literals
--- a/openspec/specs/oracle-query-fragment-governance/spec.md
+++ b/openspec/specs/oracle-query-fragment-governance/spec.md
@@ -0,0 +1,19 @@
 # oracle-query-fragment-governance Specification
 ## Purpose
 TBD - created by archiving change residual-hardening-round4. Update Purpose after archive.
 ## Requirements
 ### Requirement: Shared Oracle Query Fragments SHALL Have a Single Source of Truth
 Cross-service Oracle query fragments for resource and equipment cache loading MUST be defined in a shared module and imported by service implementations.
 #### Scenario: Update common table/view reference
 - **WHEN** a common table or view name changes
 - **THEN** operators and developers MUST be able to update one shared definition without editing duplicated SQL literals across services
 ### Requirement: Service Queries MUST Preserve Existing Columns and Semantics
 Services consuming shared Oracle query fragments SHALL preserve existing selected columns, filters, and downstream payload behavior.
 #### Scenario: Resource and equipment cache refresh after refactor
 - **WHEN** cache services execute queries via shared fragments
 - **THEN** resulting payload structure MUST remain compatible with existing aggregation and API contracts
--- a/openspec/specs/resource-cache-representation-normalization/spec.md
+++ b/openspec/specs/resource-cache-representation-normalization/spec.md
@@ -0,0 +1,26 @@
 # resource-cache-representation-normalization Specification
 ## Purpose
 TBD - created by archiving change residual-hardening-round4. Update Purpose after archive.
 ## Requirements
 ### Requirement: Resource Derived Index MUST Avoid Full Record Duplication
 Resource derived index SHALL use lightweight row-position references instead of storing full duplicated record payloads alongside the process DataFrame cache.
 #### Scenario: Build index from cached DataFrame
 - **WHEN** resource cache data is parsed from Redis into process-level DataFrame
 - **THEN** the derived index MUST store position-based references and metadata without a second full records copy
 ### Requirement: Resource Query APIs SHALL Preserve Existing Response Contract
 Resource query APIs MUST keep existing output fields and semantics after index representation normalization.
 #### Scenario: Read all resources after normalization
 - **WHEN** callers request all resources or filtered resource lists
 - **THEN** the returned payload MUST remain field-compatible with pre-normalization responses
 ### Requirement: Cache Invalidation MUST Keep Index/Data Coherent
 The system SHALL invalidate and rebuild DataFrame/index representations atomically at cache refresh boundaries.
 #### Scenario: Redis-backed cache refresh completes
 - **WHEN** a new resource cache snapshot is published
 - **THEN** stale index references MUST be invalidated before subsequent reads use refreshed DataFrame data
--- a/openspec/specs/runtime-resilience-recovery/spec.md
+++ b/openspec/specs/runtime-resilience-recovery/spec.md
@@ -48,3 +48,47 @@ The system MUST expose machine-readable resilience thresholds, restart-churn ind
 #### Scenario: Admin status includes restart churn summary
 - **WHEN** operators call `/admin/api/system-status` or `/admin/api/worker/status`
 - **THEN** responses MUST include bounded restart history summary within a configured time window and indicate whether churn threshold is exceeded
 ### Requirement: Recovery Recommendations SHALL Reflect Self-Healing Policy State
 Health and admin resilience payloads MUST expose whether automated recovery is allowed, cooling down, or blocked by churn policy.
 #### Scenario: Operator inspects degraded state
 - **WHEN** `/health` or `/admin/api/worker/status` is requested during degradation
 - **THEN** response MUST include policy state, cooldown remaining time, and next recommended action
 ### Requirement: Manual Recovery Override SHALL Be Explicit and Controlled
 Manual restart actions MUST bypass automatic block only through authenticated operator pathways with explicit acknowledgement.
 #### Scenario: Churn-blocked state with manual override request
 - **WHEN** authorized admin requests manual restart while auto-recovery is blocked
 - **THEN** system MUST execute controlled restart path and log the override context for auditability
 ### Requirement: Circuit Breaker State Transitions SHALL Avoid Lock-Held Logging
 Circuit breaker state transitions MUST avoid executing logger I/O while internal state locks are held.
 #### Scenario: State transition occurs
 - **WHEN** circuit breaker transitions between CLOSED, OPEN, or HALF_OPEN
 - **THEN** lock-protected section MUST complete state mutation before emitting transition log output
 #### Scenario: Slow log handler under load
 - **WHEN** logger handlers are slow or blocked
 - **THEN** circuit breaker lock contention MUST remain bounded and MUST NOT serialize unrelated request paths behind logging latency
 ### Requirement: Health Endpoints SHALL Use Short Internal Memoization
 Health and deep-health computation SHALL use a short-lived internal cache to prevent probe storms from amplifying backend load.
 #### Scenario: Frequent monitor scrapes
 - **WHEN** health endpoints are called repeatedly within a small window
 - **THEN** service SHALL return memoized payload for up to 5 seconds in non-testing environments
 #### Scenario: Testing mode
 - **WHEN** app is running in testing mode
 - **THEN** health endpoint memoization MUST be bypassed to preserve deterministic tests
 ### Requirement: Logs MUST Redact Connection Secrets
 Runtime logs MUST avoid exposing DB connection credentials.
 #### Scenario: Connection string appears in log message
 - **WHEN** a log message contains DB URL credentials
 - **THEN** logger output MUST redact password and sensitive userinfo before emission
--- a/openspec/specs/security-surface-hardening/spec.md
+++ b/openspec/specs/security-surface-hardening/spec.md
@@ -0,0 +1,38 @@
 # security-surface-hardening Specification
 ## Purpose
 TBD - created by archiving change security-stability-hardening-round2. Update Purpose after archive.
 ## Requirements
 ### Requirement: LDAP Authentication Endpoint Configuration SHALL Be Strictly Validated
 The system MUST validate LDAP authentication endpoint configuration before use, including HTTPS scheme enforcement and host allowlist checks.
 #### Scenario: Invalid LDAP URL configuration detected
 - **WHEN** `LDAP_API_URL` is missing, non-HTTPS, or points to a host outside the configured allowlist
 - **THEN** the service MUST reject LDAP authentication calls and emit actionable diagnostics without sending credentials to that endpoint
 #### Scenario: Valid LDAP URL configuration accepted
 - **WHEN** `LDAP_API_URL` uses HTTPS and host is allowlisted
 - **THEN** LDAP authentication requests MAY proceed with normal timeout and error handling behavior
 ### Requirement: Security Response Headers SHALL Be Applied Globally
 All HTTP responses MUST include baseline security headers suitable for dashboard and API traffic.
 #### Scenario: Standard response emitted
 - **WHEN** any route returns a response
 - **THEN** response MUST include `Content-Security-Policy`, `X-Frame-Options`, `X-Content-Type-Options`, and `Referrer-Policy`
 #### Scenario: Production transport hardening
 - **WHEN** runtime environment is production
 - **THEN** response MUST include `Strict-Transport-Security`
 ### Requirement: Pagination Input Boundaries SHALL Be Enforced
 Endpoints accepting pagination parameters MUST enforce lower and upper bounds before query execution.
 #### Scenario: Negative or zero pagination inputs
 - **WHEN** client sends `page <= 0` or `page_size <= 0`
 - **THEN** server MUST normalize values to minimum supported bounds
 #### Scenario: Excessive page size requested
 - **WHEN** client sends `page_size` above configured maximum
 - **THEN** server MUST clamp to maximum supported page size
--- a/openspec/specs/worker-self-healing-governance/spec.md
+++ b/openspec/specs/worker-self-healing-governance/spec.md
@@ -0,0 +1,26 @@
 # worker-self-healing-governance Specification
 ## Purpose
 TBD - created by archiving change p2-ops-self-healing-runbook. Update Purpose after archive.
 ## Requirements
 ### Requirement: Automated Worker Recovery SHALL Use Bounded Policy Guards
 Automated worker restart behavior MUST enforce cooldown periods and bounded restart attempts within a configurable time window.
 #### Scenario: Repeated worker degradation within short window
 - **WHEN** degradation events exceed configured restart-attempt budget
 - **THEN** automated restarts MUST pause and surface a blocked-recovery signal for operator intervention
 ### Requirement: Restart-Churn Protection SHALL Prevent Recovery Storms
 The runtime MUST classify restart churn and prevent uncontrolled restart loops.
 #### Scenario: Churn threshold exceeded
 - **WHEN** restart count crosses churn threshold in active window
 - **THEN** watchdog MUST enter guarded mode and require explicit manual override before further restart attempts
 ### Requirement: Recovery Decisions SHALL Be Audit-Ready
 Every auto-recovery decision and manual override action MUST be recorded with structured metadata.
 #### Scenario: Worker restart decision emitted
 - **WHEN** system executes or denies a restart action
 - **THEN** structured logs/events MUST include reason, thresholds, actor/source, and resulting state
--- a/scripts/run_cache_benchmarks.py
+++ b/scripts/run_cache_benchmarks.py
@@ -0,0 +1,223 @@
 #!/usr/bin/env python3
 # -*- coding: utf-8 -*-
 """Benchmark cache query baseline vs indexed selection.
 This benchmark is used as a repeatable governance harness for P1 cache/query
 efficiency work. It focuses on deterministic synthetic workloads so operators
 can compare relative latency and memory amplification over time.
 """
 from __future__ import annotations
 import argparse
 import json
 import math
 import random
 import statistics
 import time
 from pathlib import Path
 from typing import Any
 import numpy as np
 import pandas as pd
 ROOT = Path(__file__).resolve().parents[1]
 FIXTURE_PATH = ROOT / "tests" / "fixtures" / "cache_benchmark_fixture.json"
 def load_fixture(path: Path = FIXTURE_PATH) -> dict[str, Any]:
    payload = json.loads(path.read_text())
    if "rows" not in payload:
        raise ValueError("fixture requires rows")
    return payload
 def build_dataset(rows: int, seed: int) -> pd.DataFrame:
    random.seed(seed)
    np.random.seed(seed)
    workcenters = [f"WC-{idx:02d}" for idx in range(1, 31)]
    packages = ["QFN", "DFN", "SOT", "SOP", "BGA", "TSOP"]
    types = ["TYPE-A", "TYPE-B", "TYPE-C", "TYPE-D"]
    statuses = ["RUN", "QUEUE", "HOLD"]
    hold_reasons = ["", "", "", "YieldLimit", "特殊需求管控", "PM Hold"]
    frame = pd.DataFrame(
        {
            "WORKCENTER_GROUP": np.random.choice(workcenters, rows),
            "PACKAGE_LEF": np.random.choice(packages, rows),
            "PJ_TYPE": np.random.choice(types, rows),
            "WIP_STATUS": np.random.choice(statuses, rows, p=[0.45, 0.35, 0.20]),
            "HOLDREASONNAME": np.random.choice(hold_reasons, rows),
            "QTY": np.random.randint(1, 500, rows),
            "WORKORDER": [f"WO-{i:06d}" for i in range(rows)],
            "LOTID": [f"LOT-{i:07d}" for i in range(rows)],
        }
    )
    return frame
 def _build_index(df: pd.DataFrame) -> dict[str, dict[str, set[int]]]:
    def by_column(column: str) -> dict[str, set[int]]:
        grouped = df.groupby(column, dropna=True, sort=False).indices
        return {str(k): {int(i) for i in v} for k, v in grouped.items()}
    return {
        "workcenter": by_column("WORKCENTER_GROUP"),
        "package": by_column("PACKAGE_LEF"),
        "type": by_column("PJ_TYPE"),
        "status": by_column("WIP_STATUS"),
    }
 def _baseline_query(df: pd.DataFrame, query: dict[str, str]) -> int:
    subset = df
    if query.get("workcenter"):
        subset = subset[subset["WORKCENTER_GROUP"] == query["workcenter"]]
    if query.get("package"):
        subset = subset[subset["PACKAGE_LEF"] == query["package"]]
    if query.get("type"):
        subset = subset[subset["PJ_TYPE"] == query["type"]]
    if query.get("status"):
        subset = subset[subset["WIP_STATUS"] == query["status"]]
    return int(len(subset))
 def _indexed_query(_df: pd.DataFrame, indexes: dict[str, dict[str, set[int]]], query: dict[str, str]) -> int:
    selected: set[int] | None = None
    for key, bucket in (
        ("workcenter", "workcenter"),
        ("package", "package"),
        ("type", "type"),
        ("status", "status"),
    ):
        current = indexes[bucket].get(query.get(key, ""))
        if current is None:
            return 0
        if selected is None:
            selected = set(current)
        else:
            selected.intersection_update(current)
            if not selected:
                return 0
    return len(selected or ())
 def _build_queries(df: pd.DataFrame, query_count: int, seed: int) -> list[dict[str, str]]:
    random.seed(seed + 17)
    workcenters = sorted(df["WORKCENTER_GROUP"].dropna().astype(str).unique().tolist())
    packages = sorted(df["PACKAGE_LEF"].dropna().astype(str).unique().tolist())
    types = sorted(df["PJ_TYPE"].dropna().astype(str).unique().tolist())
    statuses = sorted(df["WIP_STATUS"].dropna().astype(str).unique().tolist())
    queries: list[dict[str, str]] = []
    for _ in range(query_count):
        queries.append(
            {
                "workcenter": random.choice(workcenters),
                "package": random.choice(packages),
                "type": random.choice(types),
                "status": random.choice(statuses),
            }
        )
    return queries
 def _p95(values: list[float]) -> float:
    if not values:
        return 0.0
    sorted_values = sorted(values)
    index = min(max(math.ceil(0.95 * len(sorted_values)) - 1, 0), len(sorted_values) - 1)
    return sorted_values[index]
 def run_benchmark(rows: int, query_count: int, seed: int) -> dict[str, Any]:
    df = build_dataset(rows=rows, seed=seed)
    queries = _build_queries(df, query_count=query_count, seed=seed)
    indexes = _build_index(df)
    baseline_latencies: list[float] = []
    indexed_latencies: list[float] = []
    baseline_rows: list[int] = []
    indexed_rows: list[int] = []
    for query in queries:
        start = time.perf_counter()
        baseline_rows.append(_baseline_query(df, query))
        baseline_latencies.append((time.perf_counter() - start) * 1000)
        start = time.perf_counter()
        indexed_rows.append(_indexed_query(df, indexes, query))
        indexed_latencies.append((time.perf_counter() - start) * 1000)
    if baseline_rows != indexed_rows:
        raise AssertionError("benchmark correctness drift: indexed result mismatch")
    frame_bytes = int(df.memory_usage(index=True, deep=True).sum())
    index_entries = sum(len(bucket) for buckets in indexes.values() for bucket in buckets.values())
    index_bytes_estimate = int(index_entries * 16)
    baseline_p95 = _p95(baseline_latencies)
    indexed_p95 = _p95(indexed_latencies)
    return {
        "rows": rows,
        "query_count": query_count,
        "seed": seed,
        "latency_ms": {
            "baseline_avg": round(statistics.fmean(baseline_latencies), 4),
            "baseline_p95": round(baseline_p95, 4),
            "indexed_avg": round(statistics.fmean(indexed_latencies), 4),
            "indexed_p95": round(indexed_p95, 4),
            "p95_ratio_indexed_vs_baseline": round(
                (indexed_p95 / baseline_p95) if baseline_p95 > 0 else 0.0,
                4,
            ),
        },
        "memory_bytes": {
            "frame": frame_bytes,
            "index_estimate": index_bytes_estimate,
            "amplification_ratio": round(
                (frame_bytes + index_bytes_estimate) / max(frame_bytes, 1),
                4,
            ),
        },
    }
 def main() -> int:
    fixture = load_fixture()
    parser = argparse.ArgumentParser(description="Run cache baseline vs indexed benchmark")
    parser.add_argument("--rows", type=int, default=int(fixture.get("rows", 30000)))
    parser.add_argument("--queries", type=int, default=int(fixture.get("query_count", 400)))
    parser.add_argument("--seed", type=int, default=int(fixture.get("seed", 42)))
    parser.add_argument("--enforce", action="store_true")
    args = parser.parse_args()
    report = run_benchmark(rows=args.rows, query_count=args.queries, seed=args.seed)
    print(json.dumps(report, ensure_ascii=False, indent=2))
    if not args.enforce:
        return 0
    thresholds = fixture.get("thresholds") or {}
    max_latency_ratio = float(thresholds.get("max_p95_ratio_indexed_vs_baseline", 1.25))
    max_amplification = float(thresholds.get("max_memory_amplification_ratio", 1.8))
    latency_ratio = float(report["latency_ms"]["p95_ratio_indexed_vs_baseline"])
    amplification_ratio = float(report["memory_bytes"]["amplification_ratio"])
    if latency_ratio > max_latency_ratio:
        raise SystemExit(
            f"Latency regression: {latency_ratio:.4f} > max allowed {max_latency_ratio:.4f}"
        )
    if amplification_ratio > max_amplification:
        raise SystemExit(
            f"Memory amplification regression: {amplification_ratio:.4f} > max allowed {max_amplification:.4f}"
        )
    return 0
 if __name__ == "__main__":
    raise SystemExit(main())
--- a/scripts/start_server.sh
+++ b/scripts/start_server.sh
@@ -9,7 +9,7 @@ set -uo pipefail
 # Configuration
 # ============================================================
 ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
-CONDA_ENV="mes-dashboard"
+CONDA_ENV="${CONDA_ENV_NAME:-mes-dashboard}"
 APP_NAME="mes-dashboard"
 PID_FILE_DEFAULT="${ROOT}/tmp/gunicorn.pid"
 PID_FILE="${WATCHDOG_PID_FILE:-${PID_FILE_DEFAULT}}"
@@ -56,7 +56,7 @@ timestamp() {
 resolve_runtime_paths() {
    WATCHDOG_RUNTIME_DIR="${WATCHDOG_RUNTIME_DIR:-${ROOT}/tmp}"
    WATCHDOG_RESTART_FLAG="${WATCHDOG_RESTART_FLAG:-${WATCHDOG_RUNTIME_DIR}/mes_dashboard_restart.flag}"
-    WATCHDOG_PID_FILE="${WATCHDOG_PID_FILE:-${PID_FILE_DEFAULT}}"
+    WATCHDOG_PID_FILE="${WATCHDOG_PID_FILE:-${WATCHDOG_RUNTIME_DIR}/gunicorn.pid}"
    WATCHDOG_STATE_FILE="${WATCHDOG_STATE_FILE:-${WATCHDOG_RUNTIME_DIR}/mes_dashboard_restart_state.json}"
    PID_FILE="${WATCHDOG_PID_FILE}"
    export WATCHDOG_RUNTIME_DIR WATCHDOG_RESTART_FLAG WATCHDOG_PID_FILE WATCHDOG_STATE_FILE
@@ -81,8 +81,14 @@ check_conda() {
        return 1
    fi
    if [ -n "${CONDA_BIN:-}" ] && [ ! -x "${CONDA_BIN}" ]; then
        log_error "CONDA_BIN is set but not executable: ${CONDA_BIN}"
        return 1
    fi
    # Source conda
-    source "$(conda info --base)/etc/profile.d/conda.sh"
+    local conda_cmd="${CONDA_BIN:-$(command -v conda)}"
    source "$(${conda_cmd} info --base)/etc/profile.d/conda.sh"
    # Check if environment exists
    if ! conda env list | grep -q "^${CONDA_ENV} "; then
@@ -95,6 +101,33 @@ check_conda() {
    return 0
 }
 validate_runtime_contract() {
    conda activate "$CONDA_ENV"
    export PYTHONPATH="${ROOT}/src:${PYTHONPATH:-}"
    if python - <<'PY'
 import os
 import sys
 from mes_dashboard.core.runtime_contract import build_runtime_contract_diagnostics
 strict = os.getenv("RUNTIME_CONTRACT_ENFORCE", "true").strip().lower() in {"1", "true", "yes", "on"}
 diag = build_runtime_contract_diagnostics(strict=strict)
 if not diag["valid"]:
    for error in diag["errors"]:
        print(f"RUNTIME_CONTRACT_ERROR: {error}")
    raise SystemExit(1)
 PY
    then
        log_success "Runtime contract validation passed"
        return 0
    fi
    log_error "Runtime contract validation failed"
    log_info "Fix env vars: WATCHDOG_RUNTIME_DIR / WATCHDOG_RESTART_FLAG / WATCHDOG_PID_FILE / WATCHDOG_STATE_FILE / CONDA_BIN"
    return 1
 }
 check_dependencies() {
    conda activate "$CONDA_ENV"
@@ -329,6 +362,7 @@ run_all_checks() {
    check_env_file
    load_env
    resolve_runtime_paths
    validate_runtime_contract || return 1
    check_port || return 1
    check_database
    check_redis
--- a/scripts/worker_watchdog.py
+++ b/scripts/worker_watchdog.py
@@ -31,6 +31,23 @@ import time
 from datetime import datetime
 from pathlib import Path
 PROJECT_ROOT = Path(__file__).resolve().parents[1]
 SRC_ROOT = PROJECT_ROOT / "src"
 if str(SRC_ROOT) not in sys.path:
    sys.path.insert(0, str(SRC_ROOT))
 from mes_dashboard.core.runtime_contract import (  # noqa: E402
    build_runtime_contract_diagnostics,
    load_runtime_contract,
 )
 from mes_dashboard.core.worker_recovery_policy import (  # noqa: E402
    decide_restart_request,
    evaluate_worker_recovery_state,
    extract_last_requested_at,
    extract_restart_history,
    get_worker_recovery_policy_config,
 )
 # Configure logging
 logging.basicConfig(
    level=logging.INFO,
@@ -45,7 +62,10 @@ logger = logging.getLogger('mes_dashboard.watchdog')
 # Configuration
 # ============================================================
-CHECK_INTERVAL = int(os.getenv('WATCHDOG_CHECK_INTERVAL', '5'))
+_RUNTIME_CONTRACT = load_runtime_contract(project_root=PROJECT_ROOT)
 CHECK_INTERVAL = int(
    os.getenv('WATCHDOG_CHECK_INTERVAL', str(_RUNTIME_CONTRACT['watchdog_check_interval']))
 )
 def _env_int(name: str, default: int) -> int:
@@ -55,22 +75,11 @@ def _env_int(name: str, default: int) -> int:
        return default
-PROJECT_ROOT = Path(__file__).resolve().parents[1]
+DEFAULT_RUNTIME_DIR = Path(_RUNTIME_CONTRACT['watchdog_runtime_dir'])
-DEFAULT_RUNTIME_DIR = Path(
+RESTART_FLAG_PATH = _RUNTIME_CONTRACT['watchdog_restart_flag']
-    os.getenv('WATCHDOG_RUNTIME_DIR', str(PROJECT_ROOT / 'tmp'))
+GUNICORN_PID_FILE = _RUNTIME_CONTRACT['watchdog_pid_file']
-)
+RESTART_STATE_FILE = _RUNTIME_CONTRACT['watchdog_state_file']
-RESTART_FLAG_PATH = os.getenv(
+RUNTIME_CONTRACT_VERSION = _RUNTIME_CONTRACT['version']
    'WATCHDOG_RESTART_FLAG',
    str(DEFAULT_RUNTIME_DIR / 'mes_dashboard_restart.flag')
 )
 GUNICORN_PID_FILE = os.getenv(
    'WATCHDOG_PID_FILE',
    str(DEFAULT_RUNTIME_DIR / 'gunicorn.pid')
 )
 RESTART_STATE_FILE = os.getenv(
    'WATCHDOG_STATE_FILE',
    str(DEFAULT_RUNTIME_DIR / 'mes_dashboard_restart_state.json')
 )
 RESTART_HISTORY_MAX = _env_int('WATCHDOG_RESTART_HISTORY_MAX', 50)
@@ -78,6 +87,32 @@ RESTART_HISTORY_MAX = _env_int('WATCHDOG_RESTART_HISTORY_MAX', 50)
 # Watchdog Implementation
 # ============================================================
 def validate_runtime_contract_or_raise() -> None:
    """Fail fast if runtime contract is inconsistent."""
    strict = os.getenv("RUNTIME_CONTRACT_ENFORCE", "true").strip().lower() in {
        "1",
        "true",
        "yes",
        "on",
    }
    diagnostics = build_runtime_contract_diagnostics(strict=strict)
    if diagnostics["valid"]:
        return
    details = "; ".join(diagnostics["errors"])
    raise RuntimeError(f"Runtime contract validation failed: {details}")
 def log_restart_audit(event: str, payload: dict) -> None:
    entry = {
        "event": event,
        "timestamp": datetime.utcnow().isoformat(),
        "runtime_contract_version": RUNTIME_CONTRACT_VERSION,
        **payload,
    }
    logger.info("worker_watchdog_audit %s", json.dumps(entry, ensure_ascii=False))
 def get_gunicorn_pid() -> int | None:
    """Get Gunicorn master PID from PID file.
@@ -155,7 +190,12 @@ def save_restart_state(
    requested_at: str | None = None,
    requested_ip: str | None = None,
    completed_at: str | None = None,
-    success: bool = True
+    success: bool = True,
    source: str = "manual",
    decision: str = "allowed",
    decision_reason: str | None = None,
    manual_override: bool = False,
    policy_state: dict | None = None,
 ) -> None:
    """Save restart state for status queries.
@@ -173,7 +213,12 @@ def save_restart_state(
        "requested_at": requested_at,
        "requested_ip": requested_ip,
        "completed_at": completed_at,
-        "success": success
+        "success": success,
        "source": source,
        "decision": decision,
        "decision_reason": decision_reason,
        "manual_override": manual_override,
        "policy_state": policy_state or {},
    }
    current_state = load_restart_state()
    history = current_state.get("history", [])
@@ -229,6 +274,47 @@ def process_restart_request() -> bool:
        return False
    logger.info(f"Restart flag detected: {flag_data}")
    source = str(flag_data.get("source") or "manual").strip().lower()
    manual_override = bool(flag_data.get("manual_override"))
    override_ack = bool(flag_data.get("override_acknowledged"))
    restart_state = load_restart_state()
    restart_history = extract_restart_history(restart_state)
    policy_state = evaluate_worker_recovery_state(
        restart_history,
        last_requested_at=extract_last_requested_at(restart_state),
    )
    decision = decide_restart_request(
        policy_state,
        source=source,
        manual_override=manual_override,
        override_acknowledged=override_ack,
    )
    if not decision["allowed"]:
        remove_restart_flag()
        save_restart_state(
            requested_by=flag_data.get("user"),
            requested_at=flag_data.get("timestamp"),
            requested_ip=flag_data.get("ip"),
            completed_at=datetime.now().isoformat(),
            success=False,
            source=source,
            decision=decision["decision"],
            decision_reason=decision["reason"],
            manual_override=manual_override,
            policy_state=policy_state,
        )
        log_restart_audit(
            "restart_blocked",
            {
                "source": source,
                "actor": flag_data.get("user"),
                "ip": flag_data.get("ip"),
                "decision": decision,
                "policy_state": policy_state,
            },
        )
        return True
    # Get Gunicorn master PID
    pid = get_gunicorn_pid()
@@ -242,7 +328,22 @@ def process_restart_request() -> bool:
            requested_at=flag_data.get("timestamp"),
            requested_ip=flag_data.get("ip"),
            completed_at=datetime.now().isoformat(),
-            success=False
+            success=False,
            source=source,
            decision="failed",
            decision_reason="gunicorn_pid_unavailable",
            manual_override=manual_override,
            policy_state=policy_state,
        )
        log_restart_audit(
            "restart_failed",
            {
                "source": source,
                "actor": flag_data.get("user"),
                "ip": flag_data.get("ip"),
                "decision_reason": "gunicorn_pid_unavailable",
                "policy_state": policy_state,
            },
        )
        return True
@@ -258,7 +359,12 @@ def process_restart_request() -> bool:
        requested_at=flag_data.get("timestamp"),
        requested_ip=flag_data.get("ip"),
        completed_at=datetime.now().isoformat(),
-        success=success
+        success=success,
        source=source,
        decision="executed" if success else "failed",
        decision_reason="signal_sighup" if success else "signal_failed",
        manual_override=manual_override,
        policy_state=policy_state,
    )
    if success:
@@ -267,17 +373,44 @@ def process_restart_request() -> bool:
            f"Requested by: {flag_data.get('user', 'unknown')}, "
            f"IP: {flag_data.get('ip', 'unknown')}"
        )
        log_restart_audit(
            "restart_executed",
            {
                "source": source,
                "actor": flag_data.get("user"),
                "ip": flag_data.get("ip"),
                "manual_override": manual_override,
                "policy_state": policy_state,
            },
        )
    else:
        log_restart_audit(
            "restart_failed",
            {
                "source": source,
                "actor": flag_data.get("user"),
                "ip": flag_data.get("ip"),
                "decision_reason": "signal_failed",
                "policy_state": policy_state,
            },
        )
    return True
 def run_watchdog() -> None:
    """Main watchdog loop."""
    validate_runtime_contract_or_raise()
    policy = get_worker_recovery_policy_config()
    logger.info(
        f"Worker watchdog started - "
        f"Check interval: {CHECK_INTERVAL}s, "
        f"Flag path: {RESTART_FLAG_PATH}, "
-        f"PID file: {GUNICORN_PID_FILE}"
+        f"PID file: {GUNICORN_PID_FILE}, "
        f"Policy(cooldown={policy['cooldown_seconds']}s, "
        f"retry_budget={policy['retry_budget']}, "
        f"window={policy['window_seconds']}s, "
        f"guarded={policy['guarded_mode_enabled']})"
    )
    while True:
--- a/src/mes_dashboard/app.py
+++ b/src/mes_dashboard/app.py
@@ -3,24 +3,48 @@
 from __future__ import annotations
 import atexit
 import logging
 import os
 import sys
 import threading
 from flask import Flask, jsonify, redirect, render_template, request, session, url_for
 from mes_dashboard.config.tables import TABLES_CONFIG
 from mes_dashboard.config.settings import get_config
 from mes_dashboard.core.cache import create_default_cache_backend
-from mes_dashboard.core.database import get_table_data, get_table_columns, get_engine, init_db, start_keepalive
+from mes_dashboard.core.database import (
    get_table_data,
    get_table_columns,
    get_engine,
    init_db,
    start_keepalive,
    dispose_engine,
    install_log_redaction_filter,
 )
 from mes_dashboard.core.permissions import is_admin_logged_in, _is_ajax_request
 from mes_dashboard.core.csrf import (
    get_csrf_token,
    should_enforce_csrf,
    validate_csrf,
 )
 from mes_dashboard.routes import register_routes
 from mes_dashboard.routes.auth_routes import auth_bp
 from mes_dashboard.routes.admin_routes import admin_bp
 from mes_dashboard.routes.health_routes import health_bp
 from mes_dashboard.services.page_registry import get_page_status, is_api_public
 from mes_dashboard.core.cache_updater import start_cache_updater, stop_cache_updater
-from mes_dashboard.services.realtime_equipment_cache import init_realtime_equipment_cache
+from mes_dashboard.services.realtime_equipment_cache import (
    init_realtime_equipment_cache,
    stop_equipment_status_sync_worker,
 )
 from mes_dashboard.core.redis_client import close_redis
 from mes_dashboard.core.runtime_contract import build_runtime_contract_diagnostics
 _SHUTDOWN_LOCK = threading.Lock()
 _ATEXIT_REGISTERED = False
 def _configure_logging(app: Flask) -> None:
@@ -63,6 +87,121 @@ def _configure_logging(app: Flask) -> None:
    # Prevent propagation to root logger (avoid duplicate logs)
    logger.propagate = False
    install_log_redaction_filter(logger)
 def _is_production_env(app: Flask) -> bool:
    env_value = str(app.config.get("ENV") or os.getenv("FLASK_ENV") or "production").lower()
    return env_value in {"prod", "production"}
 def _build_security_headers(production: bool) -> dict[str, str]:
    headers = {
        "Content-Security-Policy": (
            "default-src 'self'; "
            "script-src 'self' 'unsafe-inline' 'unsafe-eval'; "
            "style-src 'self' 'unsafe-inline'; "
            "img-src 'self' data: blob:; "
            "font-src 'self' data:; "
            "connect-src 'self'; "
            "frame-ancestors 'none'; "
            "base-uri 'self'; "
            "form-action 'self'"
        ),
        "X-Frame-Options": "DENY",
        "X-Content-Type-Options": "nosniff",
        "Referrer-Policy": "strict-origin-when-cross-origin",
    }
    if production:
        headers["Strict-Transport-Security"] = "max-age=31536000; includeSubDomains"
    return headers
 def _resolve_secret_key(app: Flask) -> str:
    env_name = str(app.config.get("ENV") or os.getenv("FLASK_ENV") or "development").lower()
    configured = os.environ.get("SECRET_KEY") or app.config.get("SECRET_KEY")
    insecure_defaults = {"", "dev-secret-key-change-in-prod"}
    if configured and configured not in insecure_defaults:
        return configured
    if env_name in {"production", "prod"}:
        raise RuntimeError(
            "SECRET_KEY is required in production and cannot use insecure defaults."
        )
    # Development and testing get explicit environment-safe defaults.
    if env_name in {"testing", "test"}:
        return "test-secret-key"
    return "dev-local-only-secret-key"
 def _shutdown_runtime_resources() -> None:
    """Stop background workers and shared clients during app/worker shutdown."""
    logger = logging.getLogger("mes_dashboard")
    try:
        stop_cache_updater()
    except Exception as exc:
        logger.warning("Error stopping cache updater: %s", exc)
    try:
        stop_equipment_status_sync_worker()
    except Exception as exc:
        logger.warning("Error stopping equipment sync worker: %s", exc)
    try:
        close_redis()
    except Exception as exc:
        logger.warning("Error closing Redis client: %s", exc)
    try:
        dispose_engine()
    except Exception as exc:
        logger.warning("Error disposing DB engines: %s", exc)
 def _register_shutdown_hooks(app: Flask) -> None:
    global _ATEXIT_REGISTERED
    app.extensions["runtime_shutdown"] = _shutdown_runtime_resources
    if app.extensions.get("runtime_shutdown_registered"):
        return
    app.extensions["runtime_shutdown_registered"] = True
    if app.testing or bool(app.config.get("TESTING")) or os.getenv("PYTEST_CURRENT_TEST"):
        return
    with _SHUTDOWN_LOCK:
        if not _ATEXIT_REGISTERED:
            atexit.register(_shutdown_runtime_resources)
            _ATEXIT_REGISTERED = True
 def _is_runtime_contract_enforced(app: Flask) -> bool:
    raw = os.getenv("RUNTIME_CONTRACT_ENFORCE")
    if raw is not None:
        return raw.strip().lower() in {"1", "true", "yes", "on"}
    return _is_production_env(app)
 def _validate_runtime_contract(app: Flask) -> None:
    strict = _is_runtime_contract_enforced(app)
    diagnostics = build_runtime_contract_diagnostics(strict=strict)
    app.extensions["runtime_contract"] = diagnostics["contract"]
    app.extensions["runtime_contract_validation"] = {
        "valid": diagnostics["valid"],
        "strict": diagnostics["strict"],
        "errors": diagnostics["errors"],
    }
    if diagnostics["valid"]:
        return
    message = "Runtime contract validation failed: " + "; ".join(diagnostics["errors"])
    if strict:
        raise RuntimeError(message)
    logging.getLogger("mes_dashboard").warning(message)
 def create_app(config_name: str | None = None) -> Flask:
@@ -72,19 +211,22 @@ def create_app(config_name: str | None = None) -> Flask:
    config_class = get_config(config_name)
    app.config.from_object(config_class)
-    # Session configuration
+    # Session configuration with environment-aware secret validation.
-    app.secret_key = os.environ.get("SECRET_KEY", "dev-secret-key-change-in-prod")
+    app.secret_key = _resolve_secret_key(app)
    app.config["SECRET_KEY"] = app.secret_key
    # Session cookie security settings
-    # SECURE: Only send cookie over HTTPS (disable for local development)
+    # SECURE: Only send cookie over HTTPS in production.
-    app.config['SESSION_COOKIE_SECURE'] = os.environ.get("FLASK_ENV") == "production"
+    app.config['SESSION_COOKIE_SECURE'] = _is_production_env(app)
    # HTTPONLY: Prevent JavaScript access to session cookie (XSS protection)
    app.config['SESSION_COOKIE_HTTPONLY'] = True
-    # SAMESITE: Prevent CSRF by restricting cross-site cookie sending
+    # SAMESITE: strict in production, relaxed for local development usability.
-    app.config['SESSION_COOKIE_SAMESITE'] = 'Lax'
+    app.config['SESSION_COOKIE_SAMESITE'] = 'Strict' if _is_production_env(app) else 'Lax'
    # Configure logging first
    _configure_logging(app)
    _validate_runtime_contract(app)
    security_headers = _build_security_headers(_is_production_env(app))
    # Route-level cache backend (L1 memory + optional L2 Redis)
    app.extensions["cache"] = create_default_cache_backend()
@@ -96,6 +238,7 @@ def create_app(config_name: str | None = None) -> Flask:
        start_keepalive()  # Keep database connections alive
        start_cache_updater()  # Start Redis cache updater
        init_realtime_equipment_cache(app)  # Start realtime equipment status cache
    _register_shutdown_hooks(app)
    # Register API routes
    register_routes(app)
@@ -150,6 +293,34 @@ def create_app(config_name: str | None = None) -> Flask:
        return None
    @app.before_request
    def enforce_csrf():
        if not should_enforce_csrf(
            request,
            enabled=bool(app.config.get("CSRF_ENABLED", True)),
        ):
            return None
        if validate_csrf(request):
            return None
        if request.path == "/admin/login":
            return render_template("login.html", error="CSRF 驗證失敗，請重新提交"), 403
        from mes_dashboard.core.response import error_response, FORBIDDEN
        return error_response(
            FORBIDDEN,
            "CSRF 驗證失敗",
            status_code=403,
        )
    @app.after_request
    def apply_security_headers(response):
        for header, value in security_headers.items():
            response.headers.setdefault(header, value)
        return response
    # ========================================================
    # Template Context Processor
    # ========================================================
@@ -185,6 +356,7 @@ def create_app(config_name: str | None = None) -> Flask:
            "admin_user": session.get("admin"),
            "can_view_page": can_view_page,
            "frontend_asset": frontend_asset,
            "csrf_token": get_csrf_token,
        }
    # ========================================================
--- a/src/mes_dashboard/config/settings.py
+++ b/src/mes_dashboard/config/settings.py
@@ -20,6 +20,13 @@ def _float_env(name: str, default: float) -> float:
        return default
 def _bool_env(name: str, default: bool) -> bool:
    value = os.getenv(name)
    if value is None:
        return default
    return value.strip().lower() in {"1", "true", "yes", "on"}
 class Config:
    """Base configuration."""
@@ -40,7 +47,8 @@ class Config:
    # Auth configuration - MUST be set in .env file
    LDAP_API_URL = os.getenv("LDAP_API_URL", "")
    ADMIN_EMAILS = os.getenv("ADMIN_EMAILS", "")
-    SECRET_KEY = os.getenv("SECRET_KEY", "dev-secret-key-change-in-prod")
+    SECRET_KEY = os.getenv("SECRET_KEY")
    CSRF_ENABLED = _bool_env("CSRF_ENABLED", True)
    # Session configuration
    PERMANENT_SESSION_LIFETIME = _int_env("SESSION_LIFETIME", 28800)  # 8 hours
@@ -103,6 +111,7 @@ class TestingConfig(Config):
    DB_CONNECT_RETRY_COUNT = 0
    DB_CONNECT_RETRY_DELAY = 0.0
    DB_CALL_TIMEOUT_MS = 5000
    CSRF_ENABLED = False
 def get_config(env: str | None = None) -> Type[Config]:
--- a/src/mes_dashboard/core/cache.py
+++ b/src/mes_dashboard/core/cache.py
@@ -10,8 +10,10 @@ from __future__ import annotations
 import io
 import json
 import logging
 import os
 import threading
 import time
 from collections import OrderedDict
 from typing import Any, Optional, Protocol, Tuple
 import pandas as pd
@@ -39,26 +41,49 @@ class ProcessLevelCache:
    Uses a lock to ensure only one thread parses at a time.
    """
-    def __init__(self, ttl_seconds: int = 30):
+    def __init__(self, ttl_seconds: int = 30, max_size: int = 32):
-        self._cache: dict[str, Tuple[pd.DataFrame, float]] = {}
+        self._cache: OrderedDict[str, Tuple[pd.DataFrame, float]] = OrderedDict()
        self._lock = threading.Lock()
-        self._ttl = ttl_seconds
+        self._ttl = max(int(ttl_seconds), 1)
        self._max_size = max(int(max_size), 1)
    @property
    def max_size(self) -> int:
        return self._max_size
    def _evict_expired_locked(self, now: float) -> None:
        stale_keys = [
            key for key, (_, timestamp) in self._cache.items()
            if now - timestamp > self._ttl
        ]
        for key in stale_keys:
            self._cache.pop(key, None)
    def get(self, key: str) -> Optional[pd.DataFrame]:
        """Get cached DataFrame if not expired."""
        with self._lock:
-            if key not in self._cache:
+            payload = self._cache.get(key)
            if payload is None:
                return None
-            df, timestamp = self._cache[key]
+            df, timestamp = payload
-            if time.time() - timestamp > self._ttl:
+            now = time.time()
-                del self._cache[key]
+            if now - timestamp > self._ttl:
                self._cache.pop(key, None)
                return None
            self._cache.move_to_end(key, last=True)
            return df
    def set(self, key: str, df: pd.DataFrame) -> None:
        """Cache a DataFrame with current timestamp."""
        with self._lock:
-            self._cache[key] = (df, time.time())
+            now = time.time()
            self._evict_expired_locked(now)
            if key in self._cache:
                self._cache.pop(key, None)
            elif len(self._cache) >= self._max_size:
                self._cache.popitem(last=False)
            self._cache[key] = (df, now)
            self._cache.move_to_end(key, last=True)
    def invalidate(self, key: str) -> None:
        """Remove a key from cache."""
@@ -71,8 +96,26 @@ class ProcessLevelCache:
            self._cache.clear()
 def _resolve_cache_max_size(env_name: str, default: int) -> int:
    value = os.getenv(env_name)
    if value is None:
        return max(int(default), 1)
    try:
        return max(int(value), 1)
    except (TypeError, ValueError):
        return max(int(default), 1)
 # Global process-level cache for WIP DataFrame (30s TTL)
-_wip_df_cache = ProcessLevelCache(ttl_seconds=30)
+PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size("PROCESS_CACHE_MAX_SIZE", 32)
 WIP_PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size(
    "WIP_PROCESS_CACHE_MAX_SIZE",
    PROCESS_CACHE_MAX_SIZE,
 )
 _wip_df_cache = ProcessLevelCache(
    ttl_seconds=30,
    max_size=WIP_PROCESS_CACHE_MAX_SIZE,
 )
 _wip_parse_lock = threading.Lock()
 # ============================================================
@@ -328,14 +371,6 @@ def get_cached_wip_data() -> Optional[pd.DataFrame]:
    if client is None:
        return None
    # Use lock to prevent multiple threads from parsing simultaneously
    with _wip_parse_lock:
        # Double-check after acquiring lock (another thread may have parsed)
        cached_df = _wip_df_cache.get(cache_key)
        if cached_df is not None:
            logger.debug(f"Process cache hit (after lock): {len(cached_df)} rows")
            return cached_df
    try:
        start_time = time.time()
        data_json = client.get(get_key("data"))
@@ -343,19 +378,24 @@ def get_cached_wip_data() -> Optional[pd.DataFrame]:
            logger.debug("Cache miss: no data in Redis")
            return None
-            # Parse JSON to DataFrame
+        # Parse outside lock to reduce contention on hot paths.
-            df = pd.read_json(io.StringIO(data_json), orient='records')
+        parsed_df = pd.read_json(io.StringIO(data_json), orient='records')
        parse_time = time.time() - start_time
            # Store in process-level cache
            _wip_df_cache.set(cache_key, df)
            logger.debug(f"Cache hit: loaded {len(df)} rows from Redis (parsed in {parse_time:.2f}s)")
            return df
    except Exception as e:
        logger.warning(f"Failed to read cache: {e}")
        return None
    # Keep lock scope tight: consistency check + cache write only.
    with _wip_parse_lock:
        cached_df = _wip_df_cache.get(cache_key)
        if cached_df is not None:
            logger.debug(f"Process cache hit (after parse): {len(cached_df)} rows")
            return cached_df
        _wip_df_cache.set(cache_key, parsed_df)
    logger.debug(f"Cache hit: loaded {len(parsed_df)} rows from Redis (parsed in {parse_time:.2f}s)")
    return parsed_df
 def get_cached_sys_date() -> Optional[str]:
    """Get cached SYS_DATE from Redis.
--- a/src/mes_dashboard/core/cache_updater.py
+++ b/src/mes_dashboard/core/cache_updater.py
@@ -221,7 +221,7 @@ class CacheUpdater:
            return None
    def _update_redis_cache(self, df: pd.DataFrame, sys_date: str) -> bool:
-        """Update Redis cache with new data using pipeline for atomicity.
+        """Update Redis cache with staged publish for coherent snapshot visibility.
        Args:
            df: DataFrame with full table data.
@@ -234,18 +234,24 @@ class CacheUpdater:
        if client is None:
            return False
        staging_key: str | None = None
        try:
            # Convert DataFrame to JSON
            # Handle datetime columns
-            for col in df.select_dtypes(include=['datetime64']).columns:
+            df_copy = df.copy()
-                df[col] = df[col].astype(str)
+            for col in df_copy.select_dtypes(include=['datetime64']).columns:
                df_copy[col] = df_copy[col].astype(str)
-            data_json = df.to_json(orient='records', force_ascii=False)
+            data_json = df_copy.to_json(orient='records', force_ascii=False)
-            # Atomic update using pipeline
+            # Stage payload first, then atomically publish live key + metadata.
            now = datetime.now().isoformat()
            unique_suffix = f"{int(time.time() * 1000)}:{threading.get_ident()}"
            staging_key = get_key(f"data:staging:{unique_suffix}")
            pipe = client.pipeline()
-            pipe.set(get_key("data"), data_json)
+            pipe.set(staging_key, data_json)
            pipe.rename(staging_key, get_key("data"))
            pipe.set(get_key("meta:sys_date"), sys_date)
            pipe.set(get_key("meta:updated_at"), now)
            pipe.execute()
@@ -253,6 +259,11 @@ class CacheUpdater:
            return True
        except Exception as e:
            logger.error(f"Failed to update Redis cache: {e}")
            if staging_key:
                try:
                    client.delete(staging_key)
                except Exception:
                    pass
            return False
    def _check_resource_update(self, force: bool = False) -> bool:
--- a/src/mes_dashboard/core/circuit_breaker.py
+++ b/src/mes_dashboard/core/circuit_breaker.py
@@ -130,12 +130,16 @@ class CircuitBreaker:
    @property
    def state(self) -> CircuitState:
        """Get current circuit state, handling state transitions."""
        transition_log: tuple[int, str] | None = None
        with self._lock:
            if self._state == CircuitState.OPEN:
                # Check if we should transition to HALF_OPEN
                if self._open_time and time.time() - self._open_time >= self.recovery_timeout:
-                    self._transition_to(CircuitState.HALF_OPEN)
+                    transition_log = self._transition_to_locked(CircuitState.HALF_OPEN)
-            return self._state
+            current_state = self._state
        if transition_log:
            self._emit_transition_log(*transition_log)
        return current_state
    def allow_request(self) -> bool:
        """Check if a request should be allowed.
@@ -161,45 +165,57 @@ class CircuitBreaker:
        if not CIRCUIT_BREAKER_ENABLED:
            return
        transition_log: tuple[int, str] | None = None
        with self._lock:
            self._results.append(True)
            if self._state == CircuitState.HALF_OPEN:
                # Success in half-open means we can close
-                self._transition_to(CircuitState.CLOSED)
+                transition_log = self._transition_to_locked(CircuitState.CLOSED)
        if transition_log:
            self._emit_transition_log(*transition_log)
    def record_failure(self) -> None:
        """Record a failed operation."""
        if not CIRCUIT_BREAKER_ENABLED:
            return
        transition_log: tuple[int, str] | None = None
        with self._lock:
            self._results.append(False)
            self._last_failure_time = time.time()
            if self._state == CircuitState.HALF_OPEN:
                # Failure in half-open means back to open
-                self._transition_to(CircuitState.OPEN)
+                transition_log = self._transition_to_locked(CircuitState.OPEN)
            elif self._state == CircuitState.CLOSED:
                # Check if we should open
-                self._check_and_open()
+                transition_log = self._check_and_open_locked()
-    def _check_and_open(self) -> None:
+        if transition_log:
            self._emit_transition_log(*transition_log)
    def _check_and_open_locked(self) -> tuple[int, str] | None:
        """Check failure rate and open circuit if needed.
        Must be called with lock held.
        """
        if len(self._results) < self.failure_threshold:
-            return
+            return None
        failure_count = sum(1 for r in self._results if not r)
        failure_rate = failure_count / len(self._results)
        if (failure_count >= self.failure_threshold and
                failure_rate >= self.failure_rate_threshold):
-            self._transition_to(CircuitState.OPEN)
+            return self._transition_to_locked(CircuitState.OPEN)
        return None
-    def _transition_to(self, new_state: CircuitState) -> None:
+    def _emit_transition_log(self, level: int, message: str) -> None:
        logger.log(level, message)
    def _transition_to_locked(self, new_state: CircuitState) -> tuple[int, str]:
        """Transition to a new state with logging.
        Must be called with lock held.
@@ -209,20 +225,22 @@ class CircuitBreaker:
        if new_state == CircuitState.OPEN:
            self._open_time = time.time()
-            logger.warning(
+            return (
                logging.WARNING,
                f"Circuit breaker '{self.name}' OPENED: "
                f"state {old_state.value} -> {new_state.value}, "
                f"failures: {sum(1 for r in self._results if not r)}/{len(self._results)}"
            )
        elif new_state == CircuitState.HALF_OPEN:
-            logger.info(
+            return (
                logging.INFO,
                f"Circuit breaker '{self.name}' entering HALF_OPEN: "
                f"testing service recovery..."
            )
        elif new_state == CircuitState.CLOSED:
        self._open_time = None
        self._results.clear()
-            logger.info(
+        return (
            logging.INFO,
            f"Circuit breaker '{self.name}' CLOSED: "
            f"service recovered"
        )
--- a/src/mes_dashboard/core/csrf.py
+++ b/src/mes_dashboard/core/csrf.py
@@ -0,0 +1,85 @@
 # -*- coding: utf-8 -*-
 """CSRF token utilities for admin form and API mutation protection."""
 from __future__ import annotations
 import hmac
 import secrets
 from typing import Optional
 from flask import Request, request, session
 CSRF_SESSION_KEY = "_csrf_token"
 CSRF_HEADER_NAME = "X-CSRF-Token"
 CSRF_FORM_FIELD = "csrf_token"
 _MUTATING_METHODS = {"POST", "PUT", "PATCH", "DELETE"}
 def _new_csrf_token() -> str:
    return secrets.token_urlsafe(32)
 def get_csrf_token() -> str:
    """Get a stable CSRF token for the current session."""
    token = session.get(CSRF_SESSION_KEY)
    if not token:
        token = _new_csrf_token()
        session[CSRF_SESSION_KEY] = token
    return token
 def rotate_csrf_token() -> str:
    """Rotate session CSRF token after authentication state changes."""
    token = _new_csrf_token()
    session[CSRF_SESSION_KEY] = token
    return token
 def _extract_request_token(req: Request) -> Optional[str]:
    header_token = req.headers.get(CSRF_HEADER_NAME)
    if header_token:
        return header_token
    form_token = req.form.get(CSRF_FORM_FIELD)
    if form_token:
        return form_token
    if req.is_json:
        payload = req.get_json(silent=True) or {}
        json_token = payload.get(CSRF_FORM_FIELD)
        if json_token:
            return str(json_token)
    return None
 def should_enforce_csrf(req: Request = request, enabled: bool = True) -> bool:
    """Determine whether current request needs CSRF validation."""
    if not enabled:
        return False
    if req.method.upper() not in _MUTATING_METHODS:
        return False
    path = req.path or ""
    if path == "/admin/login":
        return True
    if path.startswith("/admin/api/"):
        return True
    if path.startswith("/admin/"):
        return True
    return False
 def validate_csrf(req: Request = request) -> bool:
    """Validate request CSRF token against current session token."""
    expected = session.get(CSRF_SESSION_KEY)
    if not expected:
        return False
    provided = _extract_request_token(req)
    if not provided:
        return False
    return hmac.compare_digest(str(expected), str(provided))
--- a/src/mes_dashboard/core/database.py
+++ b/src/mes_dashboard/core/database.py
@@ -51,6 +51,59 @@ from mes_dashboard.config.settings import get_config
 # Configure module logger
 logger = logging.getLogger('mes_dashboard.database')
 _REDACTION_INSTALLED = False
 _ORACLE_URL_RE = re.compile(r"(oracle\+oracledb://[^:\s/]+:)([^@/\s]+)(@)")
 _ENV_SECRET_RE = re.compile(r"(DB_PASSWORD=)([^\s]+)")
 def redact_connection_secrets(message: str) -> str:
    """Redact DB credentials from log message text."""
    if not message:
        return message
    sanitized = _ORACLE_URL_RE.sub(r"\1***\3", message)
    sanitized = _ENV_SECRET_RE.sub(r"\1***", sanitized)
    return sanitized
 class SecretRedactionFilter(logging.Filter):
    """Filter that masks DB connection secrets in log messages."""
    def filter(self, record: logging.LogRecord) -> bool:
        try:
            message = record.getMessage()
        except Exception:
            return True
        sanitized = redact_connection_secrets(message)
        if sanitized != message:
            record.msg = sanitized
            record.args = ()
        return True
 def install_log_redaction_filter(target_logger: logging.Logger | None = None) -> None:
    """Attach secret-redaction filter to mes_dashboard logging handlers once."""
    global _REDACTION_INSTALLED
    if target_logger is None and _REDACTION_INSTALLED:
        return
    logger_obj = target_logger or logging.getLogger("mes_dashboard")
    redaction_filter = SecretRedactionFilter()
    attached = False
    for handler in logger_obj.handlers:
        if any(isinstance(f, SecretRedactionFilter) for f in handler.filters):
            attached = True
            continue
        handler.addFilter(redaction_filter)
        attached = True
    if not attached and not any(isinstance(f, SecretRedactionFilter) for f in logger_obj.filters):
        logger_obj.addFilter(redaction_filter)
        attached = True
    if attached and target_logger is None:
        _REDACTION_INSTALLED = True
 # ============================================================
 # SQLAlchemy Engine (QueuePool - connection pooling)
 # ============================================================
@@ -59,6 +112,7 @@ logger = logging.getLogger('mes_dashboard.database')
 # pool_recycle prevents stale connections from firewalls/NAT.
 _ENGINE = None
 _HEALTH_ENGINE = None
 _DB_RUNTIME_CONFIG: Optional[Dict[str, Any]] = None
@@ -132,6 +186,13 @@ def get_db_runtime_config(refresh: bool = False) -> Dict[str, Any]:
        "retry_count": _from_app_or_env_int("DB_CONNECT_RETRY_COUNT", config_class.DB_CONNECT_RETRY_COUNT),
        "retry_delay": _from_app_or_env_float("DB_CONNECT_RETRY_DELAY", config_class.DB_CONNECT_RETRY_DELAY),
        "call_timeout_ms": _from_app_or_env_int("DB_CALL_TIMEOUT_MS", config_class.DB_CALL_TIMEOUT_MS),
        "health_pool_size": _from_app_or_env_int("DB_HEALTH_POOL_SIZE", 1),
        "health_max_overflow": _from_app_or_env_int("DB_HEALTH_MAX_OVERFLOW", 0),
        "health_pool_timeout": _from_app_or_env_int("DB_HEALTH_POOL_TIMEOUT", 2),
        "pool_exhausted_retry_after_seconds": _from_app_or_env_int(
            "DB_POOL_EXHAUSTED_RETRY_AFTER_SECONDS",
            5,
        ),
    }
    return _DB_RUNTIME_CONFIG.copy()
@@ -202,6 +263,42 @@ def get_engine():
    return _ENGINE
 def get_health_engine():
    """Get dedicated SQLAlchemy engine for health probes.
    Health checks use a tiny isolated pool so status probes remain available
    when the request pool is saturated.
    """
    global _HEALTH_ENGINE
    if _HEALTH_ENGINE is None:
        runtime = get_db_runtime_config()
        _HEALTH_ENGINE = create_engine(
            CONNECTION_STRING,
            poolclass=QueuePool,
            pool_size=max(int(runtime["health_pool_size"]), 1),
            max_overflow=max(int(runtime["health_max_overflow"]), 0),
            pool_timeout=max(int(runtime["health_pool_timeout"]), 1),
            pool_recycle=runtime["pool_recycle"],
            pool_pre_ping=True,
            connect_args={
                "tcp_connect_timeout": runtime["tcp_connect_timeout"],
                "retry_count": runtime["retry_count"],
                "retry_delay": runtime["retry_delay"],
            },
        )
        _register_pool_events(
            _HEALTH_ENGINE,
            min(int(runtime["call_timeout_ms"]), 10_000),
        )
        logger.info(
            "Health engine created (pool_size=%s, max_overflow=%s, pool_timeout=%s)",
            runtime["health_pool_size"],
            runtime["health_max_overflow"],
            runtime["health_pool_timeout"],
        )
    return _HEALTH_ENGINE
 def _register_pool_events(engine, call_timeout_ms: int):
    """Register event listeners for connection pool monitoring."""
@@ -302,8 +399,12 @@ def dispose_engine():
    Call this during application shutdown to cleanly release resources.
    """
-    global _ENGINE, _DB_RUNTIME_CONFIG
+    global _ENGINE, _HEALTH_ENGINE, _DB_RUNTIME_CONFIG
    stop_keepalive()
    if _HEALTH_ENGINE is not None:
        _HEALTH_ENGINE.dispose()
        logger.info("Health engine disposed")
        _HEALTH_ENGINE = None
    if _ENGINE is not None:
        _ENGINE.dispose()
        logger.info("Database engine disposed, all connections closed")
@@ -432,9 +533,13 @@ def read_sql_df(sql: str, params: Optional[Dict[str, Any]] = None) -> pd.DataFra
            elapsed,
            exc,
        )
        retry_after = max(
            int(get_db_runtime_config().get("pool_exhausted_retry_after_seconds", 5)),
            1,
        )
        raise DatabasePoolExhaustedError(
            "Database connection pool exhausted",
-            retry_after_seconds=5,
+            retry_after_seconds=retry_after,
        ) from exc
    except Exception as exc:
        elapsed = time.time() - start_time
--- a/src/mes_dashboard/core/rate_limit.py
+++ b/src/mes_dashboard/core/rate_limit.py
@@ -0,0 +1,103 @@
 # -*- coding: utf-8 -*-
 """Lightweight in-process rate limiting helpers for high-cost routes."""
 from __future__ import annotations
 import os
 import threading
 import time
 from collections import defaultdict, deque
 from functools import wraps
 from typing import Callable, Deque
 from flask import request
 from mes_dashboard.core.response import TOO_MANY_REQUESTS, error_response
 _RATE_LOCK = threading.Lock()
 _RATE_ATTEMPTS: dict[str, dict[str, Deque[float]]] = defaultdict(lambda: defaultdict(deque))
 def _env_int(name: str, default: int) -> int:
    raw = os.getenv(name)
    if raw is None:
        return int(default)
    try:
        value = int(raw)
    except (TypeError, ValueError):
        return int(default)
    return max(value, 1)
 def _client_identifier() -> str:
    forwarded = request.headers.get("X-Forwarded-For", "").strip()
    if forwarded:
        return forwarded.split(",")[0].strip()
    return request.remote_addr or "unknown"
 def check_and_record(
    bucket: str,
    *,
    client_id: str,
    max_attempts: int,
    window_seconds: int,
 ) -> tuple[bool, int]:
    """Check and record request attempt for a bucket+client pair."""
    now = time.time()
    window_start = now - max(window_seconds, 1)
    with _RATE_LOCK:
        per_bucket = _RATE_ATTEMPTS[bucket]
        attempts = per_bucket[client_id]
        while attempts and attempts[0] <= window_start:
            attempts.popleft()
        if len(attempts) >= max_attempts:
            retry_after = max(int(window_seconds - (now - attempts[0])), 1)
            return True, retry_after
        attempts.append(now)
        return False, 0
 def configured_rate_limit(
    *,
    bucket: str,
    max_attempts_env: str,
    window_seconds_env: str,
    default_max_attempts: int,
    default_window_seconds: int,
 ) -> Callable:
    """Build a route decorator with env-configurable rate limits."""
    max_attempts = _env_int(max_attempts_env, default_max_attempts)
    window_seconds = _env_int(window_seconds_env, default_window_seconds)
    def decorator(func: Callable) -> Callable:
        @wraps(func)
        def wrapped(*args, **kwargs):
            limited, retry_after = check_and_record(
                bucket,
                client_id=_client_identifier(),
                max_attempts=max_attempts,
                window_seconds=window_seconds,
            )
            if limited:
                return error_response(
                    TOO_MANY_REQUESTS,
                    "請求過於頻繁，請稍後再試",
                    status_code=429,
                    meta={"retry_after_seconds": retry_after},
                    headers={"Retry-After": str(retry_after)},
                )
            return func(*args, **kwargs)
        return wrapped
    return decorator
 def reset_rate_limits_for_tests() -> None:
    with _RATE_LOCK:
        _RATE_ATTEMPTS.clear()
--- a/src/mes_dashboard/core/runtime_contract.py
+++ b/src/mes_dashboard/core/runtime_contract.py
@@ -0,0 +1,143 @@
 # -*- coding: utf-8 -*-
 """Runtime contract helpers shared by app, scripts, and watchdog."""
 from __future__ import annotations
 import os
 import shutil
 from pathlib import Path
 from typing import Any, Mapping
 CONTRACT_VERSION = "2026.02-p2"
 DEFAULT_PROJECT_ROOT = Path(__file__).resolve().parents[3]
 def _to_bool(value: str | None, default: bool) -> bool:
    if value is None:
        return default
    return value.strip().lower() in {"1", "true", "yes", "on"}
 def _resolve_path(value: str | None, fallback: Path, project_root: Path) -> Path:
    if value is None or not str(value).strip():
        return fallback.resolve()
    raw = Path(str(value).strip())
    if raw.is_absolute():
        return raw.resolve()
    return (project_root / raw).resolve()
 def load_runtime_contract(
    environ: Mapping[str, str] | None = None,
    *,
    project_root: Path | str | None = None,
 ) -> dict[str, Any]:
    """Load effective runtime contract from environment with normalized paths."""
    env = environ or os.environ
    root = Path(project_root or env.get("MES_DASHBOARD_ROOT", DEFAULT_PROJECT_ROOT)).resolve()
    runtime_dir = _resolve_path(
        env.get("WATCHDOG_RUNTIME_DIR"),
        root / "tmp",
        root,
    )
    restart_flag = _resolve_path(
        env.get("WATCHDOG_RESTART_FLAG"),
        runtime_dir / "mes_dashboard_restart.flag",
        root,
    )
    pid_file = _resolve_path(
        env.get("WATCHDOG_PID_FILE"),
        runtime_dir / "gunicorn.pid",
        root,
    )
    state_file = _resolve_path(
        env.get("WATCHDOG_STATE_FILE"),
        runtime_dir / "mes_dashboard_restart_state.json",
        root,
    )
    contract = {
        "version": env.get("RUNTIME_CONTRACT_VERSION", CONTRACT_VERSION),
        "project_root": str(root),
        "gunicorn_bind": env.get("GUNICORN_BIND", "0.0.0.0:8080"),
        "conda_bin": (env.get("CONDA_BIN", "") or "").strip(),
        "conda_env_name": (env.get("CONDA_ENV_NAME", "mes-dashboard") or "").strip(),
        "watchdog_runtime_dir": str(runtime_dir),
        "watchdog_restart_flag": str(restart_flag),
        "watchdog_pid_file": str(pid_file),
        "watchdog_state_file": str(state_file),
        "watchdog_check_interval": int(env.get("WATCHDOG_CHECK_INTERVAL", "5")),
        "validation_enforced": _to_bool(env.get("RUNTIME_CONTRACT_ENFORCE"), False),
    }
    return contract
 def validate_runtime_contract(
    contract: Mapping[str, Any] | None = None,
    *,
    strict: bool = False,
 ) -> list[str]:
    """Validate runtime contract and return actionable errors."""
    cfg = dict(contract or load_runtime_contract())
    errors: list[str] = []
    runtime_dir = Path(str(cfg["watchdog_runtime_dir"])).resolve()
    restart_flag = Path(str(cfg["watchdog_restart_flag"])).resolve()
    pid_file = Path(str(cfg["watchdog_pid_file"])).resolve()
    state_file = Path(str(cfg["watchdog_state_file"])).resolve()
    if restart_flag.parent != runtime_dir:
        errors.append(
            "WATCHDOG_RESTART_FLAG must be under WATCHDOG_RUNTIME_DIR "
            f"({restart_flag} not under {runtime_dir})."
        )
    if pid_file.parent != runtime_dir:
        errors.append(
            "WATCHDOG_PID_FILE must be under WATCHDOG_RUNTIME_DIR "
            f"({pid_file} not under {runtime_dir})."
        )
    if not state_file.is_absolute():
        errors.append("WATCHDOG_STATE_FILE must resolve to an absolute path.")
    bind = str(cfg.get("gunicorn_bind", "")).strip()
    if ":" not in bind:
        errors.append(f"GUNICORN_BIND must include host:port (current: {bind!r}).")
    conda_bin = str(cfg.get("conda_bin", "")).strip()
    if strict and not conda_bin:
        conda_on_path = shutil.which("conda")
        if not conda_on_path:
            errors.append(
                "CONDA_BIN is required when strict runtime validation is enabled "
                "and conda is not discoverable on PATH."
            )
    if conda_bin:
        conda_path = Path(conda_bin)
        if not conda_path.exists():
            errors.append(f"CONDA_BIN does not exist: {conda_bin}")
        elif not os.access(conda_bin, os.X_OK):
            errors.append(f"CONDA_BIN is not executable: {conda_bin}")
    conda_env_name = str(cfg.get("conda_env_name", "")).strip()
    active_env = (os.getenv("CONDA_DEFAULT_ENV") or "").strip()
    if strict and conda_env_name and active_env and active_env != conda_env_name:
        errors.append(
            "CONDA_DEFAULT_ENV mismatch: "
            f"expected {conda_env_name!r}, got {active_env!r}."
        )
    return errors
 def build_runtime_contract_diagnostics(*, strict: bool = False) -> dict[str, Any]:
    """Build diagnostics payload for runtime contract introspection."""
    contract = load_runtime_contract()
    errors = validate_runtime_contract(contract, strict=strict)
    return {
        "valid": not errors,
        "strict": strict,
        "errors": errors,
        "contract": contract,
    }
--- a/src/mes_dashboard/core/utils.py
+++ b/src/mes_dashboard/core/utils.py
@@ -33,6 +33,22 @@ def get_days_back(filters: Optional[Dict] = None, default: int = DEFAULT_DAYS_BA
    return default
 def parse_bool_query(value: Any, default: bool = False) -> bool:
    """Parse common boolean query parameter values."""
    if value is None:
        return default
    if isinstance(value, bool):
        return value
    text = str(value).strip().lower()
    if not text:
        return default
    if text in {"true", "1", "yes", "y", "on"}:
        return True
    if text in {"false", "0", "no", "n", "off"}:
        return False
    return default
 # ============================================================
 # SQL Filter Building (DEPRECATED)
 # Use mes_dashboard.sql.CommonFilters with QueryBuilder instead.
--- a/src/mes_dashboard/core/worker_recovery_policy.py
+++ b/src/mes_dashboard/core/worker_recovery_policy.py
@@ -0,0 +1,220 @@
 # -*- coding: utf-8 -*-
 """Worker restart policy helpers (cooldown, retry budget, churn guard)."""
 from __future__ import annotations
 import json
 import os
 from datetime import datetime, timezone
 from pathlib import Path
 from typing import Any, Mapping
 from mes_dashboard.core.runtime_contract import load_runtime_contract
 def _env_int(name: str, default: int) -> int:
    try:
        return int(os.getenv(name, str(default)))
    except (TypeError, ValueError):
        return default
 def _env_bool(name: str, default: bool) -> bool:
    raw = os.getenv(name)
    if raw is None:
        return default
    return raw.strip().lower() in {"1", "true", "yes", "on"}
 def _parse_iso(ts: str | None) -> datetime | None:
    if not ts:
        return None
    try:
        value = datetime.fromisoformat(ts)
    except (TypeError, ValueError):
        return None
    if value.tzinfo is None:
        value = value.replace(tzinfo=timezone.utc)
    return value
 def _utc_now() -> datetime:
    return datetime.now(timezone.utc)
 def get_worker_recovery_policy_config() -> dict[str, Any]:
    """Return effective worker restart policy config."""
    retry_budget = _env_int("WORKER_RESTART_RETRY_BUDGET", 3)
    churn_threshold = _env_int(
        "WORKER_RESTART_CHURN_THRESHOLD",
        _env_int("RESILIENCE_RESTART_CHURN_THRESHOLD", retry_budget),
    )
    window_seconds = _env_int(
        "WORKER_RESTART_WINDOW_SECONDS",
        _env_int("RESILIENCE_RESTART_CHURN_WINDOW_SECONDS", 600),
    )
    return {
        "cooldown_seconds": max(_env_int("WORKER_RESTART_COOLDOWN", 60), 1),
        "retry_budget": max(retry_budget, 1),
        "window_seconds": max(window_seconds, 30),
        "churn_threshold": max(churn_threshold, 1),
        "guarded_mode_enabled": _env_bool("WORKER_GUARDED_MODE_ENABLED", True),
    }
 def load_restart_state(path: str | None = None) -> dict[str, Any]:
    """Load persisted restart state from runtime contract state file."""
    state_path = Path(path or load_runtime_contract()["watchdog_state_file"])
    if not state_path.exists():
        return {}
    try:
        return json.loads(state_path.read_text())
    except (json.JSONDecodeError, IOError):
        return {}
 def extract_restart_history(state: Mapping[str, Any] | None = None) -> list[dict[str, Any]]:
    """Extract bounded restart history from persisted state."""
    payload = dict(state or {})
    raw_history = payload.get("history")
    if not isinstance(raw_history, list):
        return []
    return [item for item in raw_history if isinstance(item, dict)][-50:]
 def extract_last_requested_at(state: Mapping[str, Any] | None = None) -> str | None:
    """Extract last requested timestamp from persisted state."""
    payload = dict(state or {})
    last_restart = payload.get("last_restart") or {}
    if not isinstance(last_restart, dict):
        return None
    value = last_restart.get("requested_at")
    return str(value) if value else None
 def evaluate_worker_recovery_state(
    history: list[dict[str, Any]] | None,
    *,
    last_requested_at: str | None = None,
    now: datetime | None = None,
 ) -> dict[str, Any]:
    """Evaluate restart policy state for automated/manual recovery decisions."""
    cfg = get_worker_recovery_policy_config()
    now_dt = now or _utc_now()
    window_seconds = int(cfg["window_seconds"])
    cooldown_seconds = int(cfg["cooldown_seconds"])
    recent_attempts = 0
    for item in history or []:
        requested = _parse_iso(item.get("requested_at"))
        completed = _parse_iso(item.get("completed_at"))
        ts = requested or completed
        if ts is None:
            continue
        age = (now_dt - ts).total_seconds()
        if age <= window_seconds:
            recent_attempts += 1
    retry_budget = int(cfg["retry_budget"])
    churn_threshold = int(cfg["churn_threshold"])
    retry_budget_exhausted = recent_attempts >= retry_budget
    churn_exceeded = recent_attempts >= churn_threshold
    guarded_mode = bool(cfg["guarded_mode_enabled"] and (retry_budget_exhausted or churn_exceeded))
    cooldown_active = False
    cooldown_remaining = 0
    last_requested_dt = _parse_iso(last_requested_at)
    if last_requested_dt is not None:
        elapsed = (now_dt - last_requested_dt).total_seconds()
        if elapsed < cooldown_seconds:
            cooldown_active = True
            cooldown_remaining = int(max(cooldown_seconds - elapsed, 0))
    blocked = guarded_mode
    allowed = not blocked and not cooldown_active
    state = "allowed"
    if blocked:
        state = "blocked"
    elif cooldown_active:
        state = "cooldown"
    return {
        "state": state,
        "allowed": allowed,
        "cooldown": cooldown_active,
        "cooldown_remaining_seconds": cooldown_remaining,
        "blocked": blocked,
        "guarded_mode": guarded_mode,
        "retry_budget_exhausted": retry_budget_exhausted,
        "churn_exceeded": churn_exceeded,
        "attempts_in_window": recent_attempts,
        "retry_budget": retry_budget,
        "churn_threshold": churn_threshold,
        "window_seconds": window_seconds,
        "cooldown_seconds": cooldown_seconds,
    }
 def decide_restart_request(
    policy_state: Mapping[str, Any],
    *,
    source: str,
    manual_override: bool = False,
    override_acknowledged: bool = False,
 ) -> dict[str, Any]:
    """Decide whether restart request is allowed under current policy state."""
    state = dict(policy_state or {})
    blocked = bool(state.get("blocked"))
    cooldown = bool(state.get("cooldown"))
    source_value = (source or "manual").strip().lower()
    if source_value not in {"auto", "manual"}:
        source_value = "manual"
    if source_value == "auto":
        if blocked:
            return {
                "allowed": False,
                "decision": "blocked",
                "reason": "guarded_mode_blocked",
                "requires_acknowledgement": False,
            }
        if cooldown:
            return {
                "allowed": False,
                "decision": "blocked",
                "reason": "cooldown_active",
                "requires_acknowledgement": False,
            }
        return {
            "allowed": True,
            "decision": "allowed",
            "reason": "policy_allows_auto_restart",
            "requires_acknowledgement": False,
        }
    if (blocked or cooldown) and not (manual_override and override_acknowledged):
        reason = "manual_override_required" if blocked else "cooldown_override_required"
        return {
            "allowed": False,
            "decision": "blocked",
            "reason": reason,
            "requires_acknowledgement": True,
        }
    if manual_override and override_acknowledged:
        return {
            "allowed": True,
            "decision": "manual_override",
            "reason": "operator_override_acknowledged",
            "requires_acknowledgement": False,
        }
    return {
        "allowed": True,
        "decision": "allowed",
        "reason": "policy_allows_manual_restart",
        "requires_acknowledgement": False,
    }
--- a/src/mes_dashboard/routes/admin_routes.py
+++ b/src/mes_dashboard/routes/admin_routes.py
@@ -7,8 +7,9 @@ import json
 import logging
 import os
 import time
-from datetime import datetime
+from datetime import datetime, timezone
 from pathlib import Path
 from typing import Any
 from flask import Blueprint, g, jsonify, render_template, request
@@ -19,6 +20,17 @@ from mes_dashboard.core.resilience import (
    get_resilience_thresholds,
    summarize_restart_history,
 )
 from mes_dashboard.core.runtime_contract import (
    build_runtime_contract_diagnostics,
    load_runtime_contract,
 )
 from mes_dashboard.core.worker_recovery_policy import (
    decide_restart_request,
    evaluate_worker_recovery_state,
    extract_last_requested_at,
    extract_restart_history,
    load_restart_state,
 )
 from mes_dashboard.services.page_registry import get_all_pages, set_page_status
 admin_bp = Blueprint("admin", __name__, url_prefix="/admin")
@@ -28,21 +40,13 @@ logger = logging.getLogger("mes_dashboard.admin")
 # Worker Restart Configuration
 # ============================================================
-WATCHDOG_RUNTIME_DIR = os.getenv("WATCHDOG_RUNTIME_DIR", "/tmp")
+_RUNTIME_CONTRACT = load_runtime_contract()
-RESTART_FLAG_PATH = os.getenv(
+WATCHDOG_RUNTIME_DIR = _RUNTIME_CONTRACT["watchdog_runtime_dir"]
-    "WATCHDOG_RESTART_FLAG",
+RESTART_FLAG_PATH = _RUNTIME_CONTRACT["watchdog_restart_flag"]
-    f"{WATCHDOG_RUNTIME_DIR}/mes_dashboard_restart.flag"
+RESTART_STATE_PATH = _RUNTIME_CONTRACT["watchdog_state_file"]
-)
+WATCHDOG_PID_PATH = _RUNTIME_CONTRACT["watchdog_pid_file"]
-RESTART_STATE_PATH = os.getenv(
+GUNICORN_BIND = _RUNTIME_CONTRACT["gunicorn_bind"]
-    "WATCHDOG_STATE_FILE",
+RUNTIME_CONTRACT_VERSION = _RUNTIME_CONTRACT["version"]
    f"{WATCHDOG_RUNTIME_DIR}/mes_dashboard_restart_state.json"
 )
 WATCHDOG_PID_PATH = os.getenv(
    "WATCHDOG_PID_FILE",
    f"{WATCHDOG_RUNTIME_DIR}/gunicorn.pid"
 )
 GUNICORN_BIND = os.getenv("GUNICORN_BIND", "0.0.0.0:8080")
 RESTART_COOLDOWN_SECONDS = int(os.getenv("WORKER_RESTART_COOLDOWN", "60"))
 # Track last restart request time (in-memory for this worker)
 _last_restart_request: float = 0.0
@@ -91,7 +95,9 @@ def api_system_status():
    thresholds = get_resilience_thresholds()
    restart_state = _get_restart_state()
    restart_churn = _get_restart_churn_summary(restart_state)
-    in_cooldown, remaining = _check_restart_cooldown()
+    policy_state = _get_restart_policy_state(restart_state)
    in_cooldown = bool(policy_state.get("cooldown"))
    remaining = int(policy_state.get("cooldown_remaining_seconds") or 0)
    degraded_reason = None
    if db_status == "error":
@@ -111,6 +117,14 @@ def api_system_status():
        restart_churn_exceeded=bool(restart_churn.get("exceeded")),
        cooldown_active=in_cooldown,
    )
    alerts = _build_restart_alerts(
        pool_saturation=(pool_state or {}).get("saturation"),
        circuit_state=circuit_breaker.get("state"),
        route_cache_degraded=bool(route_cache.get("degraded")),
        policy_state=policy_state,
        thresholds=thresholds,
    )
    runtime_contract = build_runtime_contract_diagnostics(strict=False)
    # Cache status
    from mes_dashboard.routes.health_routes import (
@@ -142,13 +156,22 @@ def api_system_status():
                "pool_state": pool_state,
                "route_cache": route_cache,
                "thresholds": thresholds,
                "alerts": alerts,
                "restart_churn": restart_churn,
                "policy_state": {
                    "state": policy_state.get("state"),
                    "allowed": policy_state.get("allowed"),
                    "cooldown": policy_state.get("cooldown"),
                    "blocked": policy_state.get("blocked"),
                    "cooldown_remaining_seconds": remaining,
                },
                "recovery_recommendation": recommendation,
                "restart_cooldown": {
                    "active": in_cooldown,
-                    "remaining_seconds": int(remaining) if in_cooldown else 0,
+                    "remaining_seconds": remaining if in_cooldown else 0,
                },
            },
            "runtime_contract": runtime_contract,
            "single_port_bind": GUNICORN_BIND,
            "worker_pid": os.getpid()
        }
@@ -283,13 +306,13 @@ def api_logs_cleanup():
 def _get_restart_state() -> dict:
    """Read worker restart state from file."""
-    state_path = Path(RESTART_STATE_PATH)
+    return load_restart_state(RESTART_STATE_PATH)
-    if not state_path.exists():
+
-        return {}
+
-    try:
+def _iso_from_epoch(ts: float) -> str | None:
-        return json.loads(state_path.read_text())
+    if ts <= 0:
-    except (json.JSONDecodeError, IOError):
+        return None
-        return {}
+    return datetime.fromtimestamp(ts, tz=timezone.utc).isoformat()
 def _check_restart_cooldown() -> tuple[bool, float]:
@@ -298,38 +321,16 @@ def _check_restart_cooldown() -> tuple[bool, float]:
    Returns:
        Tuple of (is_in_cooldown, remaining_seconds).
    """
-    global _last_restart_request
+    policy = _get_restart_policy_state()
-
+    if policy.get("cooldown"):
-    # Check in-memory cooldown first
+        return True, float(policy.get("cooldown_remaining_seconds") or 0.0)
    now = time.time()
    elapsed = now - _last_restart_request
    if elapsed < RESTART_COOLDOWN_SECONDS:
        return True, RESTART_COOLDOWN_SECONDS - elapsed
    # Check file-based state (for cross-worker coordination)
    state = _get_restart_state()
    last_restart = state.get("last_restart", {})
    requested_at = last_restart.get("requested_at")
    if requested_at:
        try:
            request_time = datetime.fromisoformat(requested_at).timestamp()
            elapsed = now - request_time
            if elapsed < RESTART_COOLDOWN_SECONDS:
                return True, RESTART_COOLDOWN_SECONDS - elapsed
        except (ValueError, TypeError):
            pass
    return False, 0.0
 def _get_restart_history(state: dict | None = None) -> list[dict]:
    """Return bounded restart history for admin telemetry."""
    payload = state if state is not None else _get_restart_state()
-    raw_history = payload.get("history") or []
+    return extract_restart_history(payload)[-20:]
    if not isinstance(raw_history, list):
        return []
    return raw_history[-20:]
 def _get_restart_churn_summary(state: dict | None = None) -> dict:
@@ -338,22 +339,58 @@ def _get_restart_churn_summary(state: dict | None = None) -> dict:
    return summarize_restart_history(history)
-def _worker_recovery_hint(churn: dict, cooldown_active: bool) -> dict:
+def _get_restart_policy_state(state: dict | None = None) -> dict[str, Any]:
-    """Build worker control recommendation from churn/cooldown state."""
+    """Return effective worker restart policy state."""
-    if churn.get("exceeded"):
+    payload = state if state is not None else _get_restart_state()
    history = _get_restart_history(payload)
    last_requested = extract_last_requested_at(payload)
    in_memory_requested = _iso_from_epoch(_last_restart_request)
    if in_memory_requested:
        try:
            in_memory_dt = datetime.fromisoformat(in_memory_requested)
            persisted_dt = datetime.fromisoformat(last_requested) if last_requested else None
        except (TypeError, ValueError):
            in_memory_dt = None
            persisted_dt = None
        if in_memory_dt and (persisted_dt is None or in_memory_dt > persisted_dt):
            last_requested = in_memory_requested
    return evaluate_worker_recovery_state(
        history,
        last_requested_at=last_requested,
    )
 def _build_restart_alerts(
    *,
    pool_saturation: float | None,
    circuit_state: str | None,
    route_cache_degraded: bool,
    policy_state: dict[str, Any],
    thresholds: dict[str, Any],
 ) -> dict[str, Any]:
    saturation = float(pool_saturation or 0.0)
    warning = float(thresholds.get("pool_saturation_warning", 0.9))
    critical = float(thresholds.get("pool_saturation_critical", 1.0))
    return {
-            "action": "throttle_and_investigate_queries",
+        "pool_warning": saturation >= warning,
-            "reason": "restart_churn_exceeded",
+        "pool_critical": saturation >= critical,
        "circuit_open": circuit_state == "OPEN",
        "route_cache_degraded": bool(route_cache_degraded),
        "restart_churn_exceeded": bool(policy_state.get("churn_exceeded")),
        "restart_blocked": bool(policy_state.get("blocked")),
    }
-    if cooldown_active:
+
-        return {
+
-            "action": "wait_for_restart_cooldown",
+def _log_restart_audit(event: str, payload: dict[str, Any]) -> None:
-            "reason": "restart_cooldown_active",
+    entry = {
-        }
+        "event": event,
-    return {
+        "timestamp": datetime.now(tz=timezone.utc).isoformat(),
-        "action": "restart_available",
+        "runtime_contract_version": RUNTIME_CONTRACT_VERSION,
-        "reason": "no_churn_or_cooldown",
+        **payload,
    }
    logger.info("worker_restart_audit %s", json.dumps(entry, ensure_ascii=False))
@admin_bp.route("/api/worker/restart", methods=["POST"])
@@ -366,19 +403,60 @@ def api_worker_restart():
    """
    global _last_restart_request
-    # Check cooldown
+    payload = request.get_json(silent=True) or {}
-    in_cooldown, remaining = _check_restart_cooldown()
+    manual_override = bool(payload.get("manual_override"))
-    if in_cooldown:
+    override_acknowledged = bool(payload.get("override_acknowledged"))
-        return error_response(
+    override_reason = str(payload.get("override_reason") or "").strip()
            TOO_MANY_REQUESTS,
            f"Restart in cooldown. Please wait {int(remaining)} seconds.",
            status_code=429
        )
    # Get request metadata
    user = getattr(g, "username", "unknown")
    ip = request.remote_addr or "unknown"
-    timestamp = datetime.now().isoformat()
+    timestamp = datetime.now(tz=timezone.utc).isoformat()
    state = _get_restart_state()
    policy_state = _get_restart_policy_state(state)
    decision = decide_restart_request(
        policy_state,
        source="manual",
        manual_override=manual_override,
        override_acknowledged=override_acknowledged,
    )
    if manual_override and not override_reason:
        return error_response(
            "RESTART_OVERRIDE_REASON_REQUIRED",
            "Manual override requires non-empty override_reason for audit traceability.",
            status_code=400,
        )
    if not decision["allowed"]:
        status_code = 429 if policy_state.get("cooldown") else 409
        if status_code == 429:
            message = (
                f"Restart in cooldown. Please wait "
                f"{int(policy_state.get('cooldown_remaining_seconds') or 0)} seconds."
            )
            code = TOO_MANY_REQUESTS
        else:
            message = (
                "Restart blocked by guarded mode. "
                "Set manual_override=true and override_acknowledged=true to proceed."
            )
            code = "RESTART_POLICY_BLOCKED"
        _log_restart_audit(
            "restart_request_blocked",
            {
                "actor": user,
                "ip": ip,
                "decision": decision,
                "policy_state": policy_state,
            },
        )
        return error_response(
            code,
            message,
            status_code=status_code,
        )
    # Write restart flag file
    flag_path = Path(RESTART_FLAG_PATH)
@@ -386,11 +464,21 @@ def api_worker_restart():
        "user": user,
        "ip": ip,
        "timestamp": timestamp,
-        "worker_pid": os.getpid()
+        "worker_pid": os.getpid(),
        "source": "manual",
        "manual_override": bool(manual_override and override_acknowledged),
        "override_acknowledged": override_acknowledged,
        "override_reason": override_reason or None,
        "policy_state": policy_state,
        "policy_decision": decision["decision"],
        "runtime_contract_version": RUNTIME_CONTRACT_VERSION,
    }
    try:
-        flag_path.write_text(json.dumps(flag_data))
+        flag_path.parent.mkdir(parents=True, exist_ok=True)
        tmp_path = flag_path.with_suffix(flag_path.suffix + ".tmp")
        tmp_path.write_text(json.dumps(flag_data, ensure_ascii=False))
        tmp_path.replace(flag_path)
    except IOError as e:
        logger.error(f"Failed to write restart flag: {e}")
        return error_response(
@@ -402,8 +490,15 @@ def api_worker_restart():
    # Update in-memory cooldown
    _last_restart_request = time.time()
-    logger.info(
+    _log_restart_audit(
-        f"Worker restart requested by {user} from {ip}"
+        "restart_request_accepted",
        {
            "actor": user,
            "ip": ip,
            "decision": decision,
            "policy_state": policy_state,
            "override_reason": override_reason or None,
        },
    )
    return jsonify({
@@ -412,6 +507,14 @@ def api_worker_restart():
            "message": "Restart requested. Workers will reload shortly.",
            "requested_by": user,
            "requested_at": timestamp,
            "policy_state": {
                "state": policy_state.get("state"),
                "allowed": policy_state.get("allowed"),
                "cooldown": policy_state.get("cooldown"),
                "blocked": policy_state.get("blocked"),
                "cooldown_remaining_seconds": policy_state.get("cooldown_remaining_seconds"),
            },
            "decision": decision,
            "single_port_bind": GUNICORN_BIND,
            "watchdog": {
                "runtime_dir": WATCHDOG_RUNTIME_DIR,
@@ -427,16 +530,21 @@ def api_worker_restart():
@admin_required
 def api_worker_status():
    """API: Get worker status and restart information."""
    # Check cooldown
    in_cooldown, remaining = _check_restart_cooldown()
    # Get last restart info
    state = _get_restart_state()
    last_restart = state.get("last_restart", {})
    history = _get_restart_history(state)
    churn = _get_restart_churn_summary(state)
    policy_state = _get_restart_policy_state(state)
    thresholds = get_resilience_thresholds()
-    recommendation = _worker_recovery_hint(churn, in_cooldown)
+    recommendation = build_recovery_recommendation(
        degraded_reason="db_pool_saturated" if policy_state.get("blocked") else None,
        pool_saturation=None,
        circuit_state=None,
        restart_churn_exceeded=bool(churn.get("exceeded")),
        cooldown_active=bool(policy_state.get("cooldown")),
    )
    runtime_contract = build_runtime_contract_diagnostics(strict=False)
    # Get worker start time (psutil is optional)
    worker_start_time = None
@@ -466,6 +574,11 @@ def api_worker_status():
            "worker_pid": os.getpid(),
            "worker_start_time": worker_start_time,
            "runtime_contract": {
                "version": runtime_contract["contract"]["version"],
                "validation": {
                    "valid": runtime_contract["valid"],
                    "errors": runtime_contract["errors"],
                },
                "single_port_bind": GUNICORN_BIND,
                "watchdog": {
                    "runtime_dir": WATCHDOG_RUNTIME_DIR,
@@ -478,12 +591,27 @@ def api_worker_status():
                },
            },
            "cooldown": {
-                "active": in_cooldown,
+                "active": bool(policy_state.get("cooldown")),
-                "remaining_seconds": int(remaining) if in_cooldown else 0
+                "remaining_seconds": int(policy_state.get("cooldown_remaining_seconds") or 0)
            },
            "resilience": {
                "thresholds": thresholds,
                "alerts": {
                    "restart_churn_exceeded": bool(churn.get("exceeded")),
                    "restart_blocked": bool(policy_state.get("blocked")),
                },
                "restart_churn": churn,
                "policy_state": {
                    "state": policy_state.get("state"),
                    "allowed": policy_state.get("allowed"),
                    "cooldown": policy_state.get("cooldown"),
                    "blocked": policy_state.get("blocked"),
                    "cooldown_remaining_seconds": policy_state.get("cooldown_remaining_seconds"),
                    "attempts_in_window": policy_state.get("attempts_in_window"),
                    "retry_budget": policy_state.get("retry_budget"),
                    "churn_threshold": policy_state.get("churn_threshold"),
                    "window_seconds": policy_state.get("window_seconds"),
                },
                "recovery_recommendation": recommendation,
            },
            "restart_history": history,
--- a/src/mes_dashboard/routes/auth_routes.py
+++ b/src/mes_dashboard/routes/auth_routes.py
@@ -11,6 +11,7 @@ from threading import Lock
 from flask import Blueprint, flash, redirect, render_template, request, session, url_for
 from mes_dashboard.core.csrf import rotate_csrf_token
 from mes_dashboard.services.auth_service import authenticate, is_admin
 logger = logging.getLogger('mes_dashboard.auth_routes')
@@ -93,6 +94,7 @@ def login():
                error = "您不是管理員，無法登入後台"
            else:
                # Login successful
                session.clear()
                session["admin"] = {
                    "username": user.get("username"),
                    "displayName": user.get("displayName"),
@@ -100,6 +102,7 @@ def login():
                    "department": user.get("department"),
                    "login_time": datetime.now().isoformat(),
                }
                rotate_csrf_token()
                next_url = request.args.get("next", url_for("portal_index"))
                return redirect(next_url)
@@ -109,5 +112,5 @@ def login():
@auth_bp.route("/logout")
 def logout():
    """Admin logout."""
-    session.pop("admin", None)
+    session.clear()
    return redirect(url_for("portal_index"))
--- a/src/mes_dashboard/routes/health_routes.py
+++ b/src/mes_dashboard/routes/health_routes.py
@@ -7,12 +7,14 @@ Provides /health and /health/deep endpoints for monitoring service status.
 from __future__ import annotations
 import logging
 import os
 import threading
 import time
 from datetime import datetime, timedelta
-from flask import Blueprint, jsonify, make_response
+from flask import Blueprint, current_app, jsonify, make_response
 from mes_dashboard.core.database import (
-    get_engine,
+    get_health_engine,
    get_pool_runtime_config,
    get_pool_status,
 )
@@ -28,6 +30,15 @@ from mes_dashboard.core.cache import (
 from mes_dashboard.core.resilience import (
    build_recovery_recommendation,
    get_resilience_thresholds,
    summarize_restart_history,
 )
 from mes_dashboard.core.runtime_contract import build_runtime_contract_diagnostics
 from mes_dashboard.core.worker_recovery_policy import (
    evaluate_worker_recovery_state,
    extract_last_requested_at,
    extract_restart_history,
    get_worker_recovery_policy_config,
    load_restart_state,
 )
 from sqlalchemy import text
@@ -41,6 +52,61 @@ health_bp = Blueprint('health', __name__)
 DB_LATENCY_WARNING_MS = 100  # Database latency > 100ms is slow
 CACHE_STALE_MINUTES = 2  # Cache update > 2 minutes is stale
 HEALTH_MEMO_TTL_SECONDS = int(os.getenv("HEALTH_MEMO_TTL_SECONDS", "5"))
 _HEALTH_MEMO_LOCK = threading.Lock()
 _HEALTH_MEMO: dict[str, dict | None] = {
    "health": None,
    "deep": None,
 }
 def _health_memo_enabled() -> bool:
    if HEALTH_MEMO_TTL_SECONDS <= 0:
        return False
    if current_app.testing or bool(current_app.config.get("TESTING")):
        return False
    return True
 def _get_health_memo(cache_key: str) -> tuple[dict, int] | None:
    if not _health_memo_enabled():
        return None
    now = time.time()
    with _HEALTH_MEMO_LOCK:
        entry = _HEALTH_MEMO.get(cache_key)
        if not entry:
            return None
        if now - float(entry.get("ts", 0.0)) > HEALTH_MEMO_TTL_SECONDS:
            _HEALTH_MEMO[cache_key] = None
            return None
        return entry["payload"], int(entry["status"])
 def _set_health_memo(cache_key: str, payload: dict, status_code: int) -> None:
    if not _health_memo_enabled():
        return
    with _HEALTH_MEMO_LOCK:
        _HEALTH_MEMO[cache_key] = {
            "ts": time.time(),
            "payload": payload,
            "status": int(status_code),
        }
 def _build_health_response(payload: dict, status_code: int):
    """Build JSON response with explicit no-cache headers."""
    resp = make_response(jsonify(payload), status_code)
    resp.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate'
    resp.headers['Pragma'] = 'no-cache'
    resp.headers['Expires'] = '0'
    return resp
 def _reset_health_memo_for_tests() -> None:
    with _HEALTH_MEMO_LOCK:
        _HEALTH_MEMO["health"] = None
        _HEALTH_MEMO["deep"] = None
 def _classify_degraded_reason(
@@ -63,6 +129,48 @@ def _classify_degraded_reason(
    return None
 def _build_resilience_alerts(
    *,
    pool_saturation: float | None,
    circuit_state: str | None,
    route_cache_degraded: bool,
    restart_churn_exceeded: bool,
    restart_blocked: bool,
    thresholds: dict,
 ) -> dict:
    saturation = float(pool_saturation or 0.0)
    warning = float(thresholds.get("pool_saturation_warning", 0.9))
    critical = float(thresholds.get("pool_saturation_critical", 1.0))
    return {
        "pool_warning": saturation >= warning,
        "pool_critical": saturation >= critical,
        "circuit_open": circuit_state == "OPEN",
        "route_cache_degraded": bool(route_cache_degraded),
        "restart_churn_exceeded": bool(restart_churn_exceeded),
        "restart_blocked": bool(restart_blocked),
    }
 def get_worker_recovery_status() -> dict:
    """Build worker recovery policy status for health/admin telemetry."""
    state = load_restart_state()
    history = extract_restart_history(state)
    policy_state = evaluate_worker_recovery_state(
        history,
        last_requested_at=extract_last_requested_at(state),
    )
    churn = summarize_restart_history(
        history,
        window_seconds=int(policy_state.get("window_seconds") or 600),
        threshold=int(policy_state.get("churn_threshold") or 3),
    )
    return {
        "policy_state": policy_state,
        "restart_churn": churn,
        "policy_config": get_worker_recovery_policy_config(),
    }
 def check_database() -> tuple[str, str | None]:
    """Check database connectivity.
@@ -71,7 +179,7 @@ def check_database() -> tuple[str, str | None]:
        status is 'ok' or 'error'.
    """
    try:
-        engine = get_engine()
+        engine = get_health_engine()
        with engine.connect() as conn:
            conn.execute(text("SELECT 1 FROM DUAL"))
        return 'ok', None
@@ -111,13 +219,21 @@ def get_cache_status() -> dict:
    status = {
        'enabled': REDIS_ENABLED,
        'sys_date': get_cached_sys_date(),
-        'updated_at': get_cache_updated_at()
+        'updated_at': get_cache_updated_at(),
        'derived_search_index': {},
        'derived_frame_snapshot': {},
        'index_metrics': {},
        'memory': {},
    }
    try:
        from mes_dashboard.services.wip_service import get_wip_search_index_status
-        status['derived_search_index'] = get_wip_search_index_status()
+        derived = get_wip_search_index_status()
        status['derived_search_index'] = derived.get('derived_search_index', {})
        status['derived_frame_snapshot'] = derived.get('derived_frame_snapshot', {})
        status['index_metrics'] = derived.get('metrics', {})
        status['memory'] = derived.get('memory', {})
    except Exception:
-        status['derived_search_index'] = {}
+        pass
    return status
@@ -205,6 +321,11 @@ def health_check():
        - 200 OK: All services healthy or degraded (Redis down but DB ok)
        - 503 Service Unavailable: Database unhealthy
    """
    cached = _get_health_memo("health")
    if cached is not None:
        payload, status_code = cached
        return _build_health_response(payload, status_code)
    from mes_dashboard.core.circuit_breaker import get_circuit_breaker_status
    db_status, db_error = check_database()
@@ -266,13 +387,25 @@ def health_check():
            warnings.append(f"Database pool saturation is high ({saturation:.0%})")
    thresholds = get_resilience_thresholds()
    worker_recovery = get_worker_recovery_status()
    policy_state = worker_recovery.get("policy_state", {})
    restart_churn = worker_recovery.get("restart_churn", {})
    recommendation = build_recovery_recommendation(
        degraded_reason=degraded_reason,
        pool_saturation=pool_saturation,
        circuit_state=circuit_breaker.get('state'),
-        restart_churn_exceeded=False,
+        restart_churn_exceeded=bool(restart_churn.get("exceeded")),
-        cooldown_active=False,
+        cooldown_active=bool(policy_state.get("cooldown")),
    )
    alerts = _build_resilience_alerts(
        pool_saturation=pool_saturation,
        circuit_state=circuit_breaker.get("state"),
        route_cache_degraded=bool(route_cache.get("degraded")),
        restart_churn_exceeded=bool(restart_churn.get("exceeded")),
        restart_blocked=bool(policy_state.get("blocked")),
        thresholds=thresholds,
    )
    runtime_contract = build_runtime_contract_diagnostics(strict=False)
    # Check equipment status cache
    equipment_status_cache = get_equipment_status_cache_status()
@@ -293,8 +426,18 @@ def health_check():
        },
        'resilience': {
            'thresholds': thresholds,
            'alerts': alerts,
            'policy_state': {
                'state': policy_state.get("state"),
                'allowed': policy_state.get("allowed"),
                'cooldown': policy_state.get("cooldown"),
                'blocked': policy_state.get("blocked"),
                'cooldown_remaining_seconds': policy_state.get("cooldown_remaining_seconds"),
            },
            'restart_churn': restart_churn,
            'recovery_recommendation': recommendation,
        },
        'runtime_contract': runtime_contract,
        'cache': get_cache_status(),
        'route_cache': route_cache,
        'resource_cache': resource_cache,
@@ -307,12 +450,8 @@ def health_check():
    if warnings:
        response['warnings'] = warnings
-    # Add no-cache headers to prevent browser caching
+    _set_health_memo("health", response, http_code)
-    resp = make_response(jsonify(response), http_code)
+    return _build_health_response(response, http_code)
    resp.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate'
    resp.headers['Pragma'] = 'no-cache'
    resp.headers['Expires'] = '0'
    return resp
@health_bp.route('/health/deep', methods=['GET'])
@@ -334,6 +473,11 @@ def deep_health_check():
    if not is_admin_logged_in():
        return redirect(url_for("auth.login", next=request.url))
    cached = _get_health_memo("deep")
    if cached is not None:
        payload, status_code = cached
        return _build_health_response(payload, status_code)
    # Check database with latency measurement
    db_start = time.time()
    db_status, db_error = check_database()
@@ -397,6 +541,9 @@ def deep_health_check():
        warnings.append(f"Database pool saturation is high ({pool_saturation:.0%})")
    thresholds = get_resilience_thresholds()
    worker_recovery = get_worker_recovery_status()
    policy_state = worker_recovery.get("policy_state", {})
    restart_churn = worker_recovery.get("restart_churn", {})
    degraded_reason = _classify_degraded_reason(
        db_status=db_status,
        redis_status=redis_status,
@@ -408,9 +555,18 @@ def deep_health_check():
        degraded_reason=degraded_reason,
        pool_saturation=pool_saturation,
        circuit_state=circuit_breaker.get('state'),
-        restart_churn_exceeded=False,
+        restart_churn_exceeded=bool(restart_churn.get("exceeded")),
-        cooldown_active=False,
+        cooldown_active=bool(policy_state.get("cooldown")),
    )
    alerts = _build_resilience_alerts(
        pool_saturation=pool_saturation,
        circuit_state=circuit_breaker.get("state"),
        route_cache_degraded=bool(route_cache.get("degraded")),
        restart_churn_exceeded=bool(restart_churn.get("exceeded")),
        restart_blocked=bool(policy_state.get("blocked")),
        thresholds=thresholds,
    )
    runtime_contract = build_runtime_contract_diagnostics(strict=False)
    # Check latency thresholds
    db_latency_status = 'healthy'
@@ -429,8 +585,18 @@ def deep_health_check():
        'degraded_reason': degraded_reason,
        'resilience': {
            'thresholds': thresholds,
            'alerts': alerts,
            'policy_state': {
                'state': policy_state.get("state"),
                'allowed': policy_state.get("allowed"),
                'cooldown': policy_state.get("cooldown"),
                'blocked': policy_state.get("blocked"),
                'cooldown_remaining_seconds': policy_state.get("cooldown_remaining_seconds"),
            },
            'restart_churn': restart_churn,
            'recovery_recommendation': recommendation,
        },
        'runtime_contract': runtime_contract,
        'checks': {
            'database': {
                'status': db_latency_status if db_status == 'ok' else 'error',
@@ -446,7 +612,9 @@ def deep_health_check():
            'cache': {
                'freshness': cache_freshness,
                'updated_at': cache_updated_at,
-                'sys_date': cache_status.get('sys_date')
+                'sys_date': cache_status.get('sys_date'),
                'index_metrics': cache_status.get('index_metrics', {}),
                'memory': cache_status.get('memory', {}),
            },
            'route_cache': route_cache
        },
@@ -464,9 +632,5 @@ def deep_health_check():
    if warnings:
        response['warnings'] = warnings
-    # Add no-cache headers
+    _set_health_memo("deep", response, http_code)
-    resp = make_response(jsonify(response), http_code)
+    return _build_health_response(response, http_code)
    resp.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate'
    resp.headers['Pragma'] = 'no-cache'
    resp.headers['Expires'] = '0'
    return resp
--- a/src/mes_dashboard/routes/hold_routes.py
+++ b/src/mes_dashboard/routes/hold_routes.py
@@ -6,6 +6,8 @@ Contains Flask Blueprint for Hold Detail page and API endpoints.
 from flask import Blueprint, jsonify, request, render_template, redirect, url_for
 from mes_dashboard.core.rate_limit import configured_rate_limit
 from mes_dashboard.core.utils import parse_bool_query
 from mes_dashboard.services.wip_service import (
    get_hold_detail_summary,
    get_hold_detail_distribution,
@@ -16,10 +18,13 @@ from mes_dashboard.services.wip_service import (
 # Create Blueprint
 hold_bp = Blueprint('hold', __name__)
-
+_HOLD_LOTS_RATE_LIMIT = configured_rate_limit(
-def _parse_bool(value: str) -> bool:
+    bucket="hold-detail-lots",
-    """Parse boolean from query string."""
+    max_attempts_env="HOLD_LOTS_RATE_LIMIT_MAX_REQUESTS",
-    return value.lower() in ('true', '1', 'yes') if value else False
+    window_seconds_env="HOLD_LOTS_RATE_LIMIT_WINDOW_SECONDS",
    default_max_attempts=90,
    default_window_seconds=60,
 )
 # ============================================================
@@ -64,7 +69,7 @@ def api_hold_detail_summary():
    if not reason:
        return jsonify({'success': False, 'error': '缺少必要參數: reason'}), 400
-    include_dummy = _parse_bool(request.args.get('include_dummy', ''))
+    include_dummy = parse_bool_query(request.args.get('include_dummy'))
    result = get_hold_detail_summary(
        reason=reason,
@@ -90,7 +95,7 @@ def api_hold_detail_distribution():
    if not reason:
        return jsonify({'success': False, 'error': '缺少必要參數: reason'}), 400
-    include_dummy = _parse_bool(request.args.get('include_dummy', ''))
+    include_dummy = parse_bool_query(request.args.get('include_dummy'))
    result = get_hold_detail_distribution(
        reason=reason,
@@ -102,6 +107,7 @@ def api_hold_detail_distribution():
@hold_bp.route('/api/wip/hold-detail/lots')
@_HOLD_LOTS_RATE_LIMIT
 def api_hold_detail_lots():
    """API: Get paginated lot details for a specific hold reason.
@@ -124,7 +130,7 @@ def api_hold_detail_lots():
    workcenter = request.args.get('workcenter', '').strip() or None
    package = request.args.get('package', '').strip() or None
    age_range = request.args.get('age_range', '').strip() or None
-    include_dummy = _parse_bool(request.args.get('include_dummy', ''))
+    include_dummy = parse_bool_query(request.args.get('include_dummy'))
    page = request.args.get('page', 1, type=int)
    per_page = min(request.args.get('per_page', 50, type=int), 200)
--- a/src/mes_dashboard/routes/resource_routes.py
+++ b/src/mes_dashboard/routes/resource_routes.py
@@ -13,10 +13,12 @@ from mes_dashboard.core.database import (
    DatabaseCircuitOpenError,
 )
 from mes_dashboard.core.cache import cache_get, cache_set, make_cache_key
 from mes_dashboard.core.rate_limit import configured_rate_limit
 from mes_dashboard.core.utils import get_days_back, parse_bool_query
 def _clean_nan_values(data):
-    """Convert NaN and NaT values to None for JSON serialization.
+    """Convert NaN/NaT values to None for JSON serialization (depth-safe).
    Args:
        data: List of dicts or single dict.
@@ -24,28 +26,77 @@ def _clean_nan_values(data):
    Returns:
        Cleaned data with NaN/NaT replaced by None.
    """
-    if isinstance(data, list):
+    def _normalize_scalar(value):
        return [_clean_nan_values(item) for item in data]
    elif isinstance(data, dict):
        cleaned = {}
        for key, value in data.items():
        if isinstance(value, float) and math.isnan(value):
-                cleaned[key] = None
+            return None
-            elif isinstance(value, str) and value == 'NaT':
+        if isinstance(value, str) and value == 'NaT':
-                cleaned[key] = None
+            return None
-            elif value != value:  # NaN check (NaN != NaN)
+        try:
-                cleaned[key] = None
+            if value != value:  # NaN check (NaN != NaN)
-            elif isinstance(value, list):
+                return None
-                # Recursively clean nested lists (e.g., LOT_DETAILS)
+        except Exception:
-                cleaned[key] = _clean_nan_values(value)
+            pass
-            elif isinstance(value, dict):
+        return value
-                # Recursively clean nested dicts
+
-                cleaned[key] = _clean_nan_values(value)
+    if isinstance(data, list):
        root: list = []
    elif isinstance(data, dict):
        root = {}
    else:
-                cleaned[key] = value
+        return _normalize_scalar(data)
-        return cleaned
+
-    return data
+    stack = [(data, root)]
-from mes_dashboard.core.utils import get_days_back
+    seen: set[int] = {id(data)}
    while stack:
        source, target = stack.pop()
        if isinstance(source, list):
            for item in source:
                if isinstance(item, list):
                    item_id = id(item)
                    if item_id in seen:
                        target.append(None)
                        continue
                    child = []
                    target.append(child)
                    seen.add(item_id)
                    stack.append((item, child))
                elif isinstance(item, dict):
                    item_id = id(item)
                    if item_id in seen:
                        target.append(None)
                        continue
                    child = {}
                    target.append(child)
                    seen.add(item_id)
                    stack.append((item, child))
                else:
                    target.append(_normalize_scalar(item))
            continue
        for key, value in source.items():
            if isinstance(value, list):
                value_id = id(value)
                if value_id in seen:
                    target[key] = None
                    continue
                child = []
                target[key] = child
                seen.add(value_id)
                stack.append((value, child))
            elif isinstance(value, dict):
                value_id = id(value)
                if value_id in seen:
                    target[key] = None
                    continue
                child = {}
                target[key] = child
                seen.add(value_id)
                stack.append((value, child))
            else:
                target[key] = _normalize_scalar(value)
    return root
 from mes_dashboard.services.resource_service import (
    query_resource_by_status,
    query_resource_by_workcenter,
@@ -62,6 +113,32 @@ from mes_dashboard.config.constants import STATUS_CATEGORIES
 # Create Blueprint
 resource_bp = Blueprint('resource', __name__, url_prefix='/api/resource')
 _RESOURCE_DETAIL_RATE_LIMIT = configured_rate_limit(
    bucket="resource-detail",
    max_attempts_env="RESOURCE_DETAIL_RATE_LIMIT_MAX_REQUESTS",
    window_seconds_env="RESOURCE_DETAIL_RATE_LIMIT_WINDOW_SECONDS",
    default_max_attempts=60,
    default_window_seconds=60,
 )
 _RESOURCE_STATUS_RATE_LIMIT = configured_rate_limit(
    bucket="resource-status",
    max_attempts_env="RESOURCE_STATUS_RATE_LIMIT_MAX_REQUESTS",
    window_seconds_env="RESOURCE_STATUS_RATE_LIMIT_WINDOW_SECONDS",
    default_max_attempts=90,
    default_window_seconds=60,
 )
 def _optional_bool_arg(name: str):
    raw = request.args.get(name)
    if raw is None:
        return None
    text = str(raw).strip()
    if not text:
        return None
    return parse_bool_query(text)
@resource_bp.route('/by_status')
 def api_resource_by_status():
@@ -118,6 +195,7 @@ def api_resource_workcenter_status_matrix():
@resource_bp.route('/detail', methods=['POST'])
@_RESOURCE_DETAIL_RATE_LIMIT
 def api_resource_detail():
    """API: Resource detail with filters."""
    data = request.get_json() or {}
@@ -183,6 +261,7 @@ def api_resource_status_values():
 # ============================================================
@resource_bp.route('/status')
@_RESOURCE_STATUS_RATE_LIMIT
 def api_resource_status():
    """API: Get merged resource status from realtime cache.
@@ -197,20 +276,9 @@ def api_resource_status():
    wc_groups_param = request.args.get('workcenter_groups')
    workcenter_groups = wc_groups_param.split(',') if wc_groups_param else None
-    is_production = None
+    is_production = _optional_bool_arg('is_production')
-    is_prod_param = request.args.get('is_production')
+    is_key = _optional_bool_arg('is_key')
-    if is_prod_param:
+    is_monitor = _optional_bool_arg('is_monitor')
        is_production = is_prod_param.lower() in ('1', 'true', 'yes')
    is_key = None
    is_key_param = request.args.get('is_key')
    if is_key_param:
        is_key = is_key_param.lower() in ('1', 'true', 'yes')
    is_monitor = None
    is_monitor_param = request.args.get('is_monitor')
    if is_monitor_param:
        is_monitor = is_monitor_param.lower() in ('1', 'true', 'yes')
    status_cats_param = request.args.get('status_categories')
    status_categories = status_cats_param.split(',') if status_cats_param else None
@@ -260,6 +328,7 @@ def api_resource_status_options():
@resource_bp.route('/status/summary')
@_RESOURCE_STATUS_RATE_LIMIT
 def api_resource_status_summary():
    """API: Get resource status summary statistics.
@@ -269,20 +338,9 @@ def api_resource_status_summary():
    wc_groups_param = request.args.get('workcenter_groups')
    workcenter_groups = wc_groups_param.split(',') if wc_groups_param else None
-    is_production = None
+    is_production = _optional_bool_arg('is_production')
-    is_prod_param = request.args.get('is_production')
+    is_key = _optional_bool_arg('is_key')
-    if is_prod_param:
+    is_monitor = _optional_bool_arg('is_monitor')
        is_production = is_prod_param.lower() in ('1', 'true', 'yes')
    is_key = None
    is_key_param = request.args.get('is_key')
    if is_key_param:
        is_key = is_key_param.lower() in ('1', 'true', 'yes')
    is_monitor = None
    is_monitor_param = request.args.get('is_monitor')
    if is_monitor_param:
        is_monitor = is_monitor_param.lower() in ('1', 'true', 'yes')
    try:
        data = get_resource_status_summary(
@@ -301,6 +359,7 @@ def api_resource_status_summary():
@resource_bp.route('/status/matrix')
@_RESOURCE_STATUS_RATE_LIMIT
 def api_resource_status_matrix():
    """API: Get workcenter × status matrix.
@@ -309,20 +368,9 @@ def api_resource_status_matrix():
        is_key: Filter by key equipment
        is_monitor: Filter by monitor equipment
    """
-    is_production = None
+    is_production = _optional_bool_arg('is_production')
-    is_prod_param = request.args.get('is_production')
+    is_key = _optional_bool_arg('is_key')
-    if is_prod_param:
+    is_monitor = _optional_bool_arg('is_monitor')
        is_production = is_prod_param.lower() in ('1', 'true', 'yes')
    is_key = None
    is_key_param = request.args.get('is_key')
    if is_key_param:
        is_key = is_key_param.lower() in ('1', 'true', 'yes')
    is_monitor = None
    is_monitor_param = request.args.get('is_monitor')
    if is_monitor_param:
        is_monitor = is_monitor_param.lower() in ('1', 'true', 'yes')
    try:
        data = get_workcenter_status_matrix(
--- a/src/mes_dashboard/routes/wip_routes.py
+++ b/src/mes_dashboard/routes/wip_routes.py
@@ -7,6 +7,8 @@ Uses DWH.DW_MES_LOT_V view for real-time WIP data.
 from flask import Blueprint, jsonify, request
 from mes_dashboard.core.rate_limit import configured_rate_limit
 from mes_dashboard.core.utils import parse_bool_query
 from mes_dashboard.services.wip_service import (
    get_wip_summary,
    get_wip_matrix,
@@ -24,10 +26,21 @@ from mes_dashboard.services.wip_service import (
 # Create Blueprint
 wip_bp = Blueprint('wip', __name__, url_prefix='/api/wip')
 _WIP_MATRIX_RATE_LIMIT = configured_rate_limit(
    bucket="wip-overview-matrix",
    max_attempts_env="WIP_MATRIX_RATE_LIMIT_MAX_REQUESTS",
    window_seconds_env="WIP_MATRIX_RATE_LIMIT_WINDOW_SECONDS",
    default_max_attempts=120,
    default_window_seconds=60,
 )
-def _parse_bool(value: str) -> bool:
+_WIP_DETAIL_RATE_LIMIT = configured_rate_limit(
-    """Parse boolean from query string."""
+    bucket="wip-detail",
-    return value.lower() in ('true', '1', 'yes') if value else False
+    max_attempts_env="WIP_DETAIL_RATE_LIMIT_MAX_REQUESTS",
    window_seconds_env="WIP_DETAIL_RATE_LIMIT_WINDOW_SECONDS",
    default_max_attempts=90,
    default_window_seconds=60,
 )
 # ============================================================
@@ -52,7 +65,7 @@ def api_overview_summary():
    lotid = request.args.get('lotid', '').strip() or None
    package = request.args.get('package', '').strip() or None
    pj_type = request.args.get('type', '').strip() or None
-    include_dummy = _parse_bool(request.args.get('include_dummy', ''))
+    include_dummy = parse_bool_query(request.args.get('include_dummy'))
    result = get_wip_summary(
        include_dummy=include_dummy,
@@ -67,6 +80,7 @@ def api_overview_summary():
@wip_bp.route('/overview/matrix')
@_WIP_MATRIX_RATE_LIMIT
 def api_overview_matrix():
    """API: Get workcenter x product line matrix for overview dashboard.
@@ -88,7 +102,7 @@ def api_overview_matrix():
    lotid = request.args.get('lotid', '').strip() or None
    package = request.args.get('package', '').strip() or None
    pj_type = request.args.get('type', '').strip() or None
-    include_dummy = _parse_bool(request.args.get('include_dummy', ''))
+    include_dummy = parse_bool_query(request.args.get('include_dummy'))
    status = request.args.get('status', '').strip().upper() or None
    hold_type = request.args.get('hold_type', '').strip().lower() or None
@@ -134,7 +148,7 @@ def api_overview_hold():
    """
    workorder = request.args.get('workorder', '').strip() or None
    lotid = request.args.get('lotid', '').strip() or None
-    include_dummy = _parse_bool(request.args.get('include_dummy', ''))
+    include_dummy = parse_bool_query(request.args.get('include_dummy'))
    result = get_wip_hold_summary(
        include_dummy=include_dummy,
@@ -151,6 +165,7 @@ def api_overview_hold():
 # ============================================================
@wip_bp.route('/detail/<workcenter>')
@_WIP_DETAIL_RATE_LIMIT
 def api_detail(workcenter: str):
    """API: Get WIP detail for a specific workcenter group.
@@ -176,12 +191,17 @@ def api_detail(workcenter: str):
    hold_type = request.args.get('hold_type', '').strip().lower() or None
    workorder = request.args.get('workorder', '').strip() or None
    lotid = request.args.get('lotid', '').strip() or None
-    include_dummy = _parse_bool(request.args.get('include_dummy', ''))
+    include_dummy = parse_bool_query(request.args.get('include_dummy'))
    page = request.args.get('page', 1, type=int)
-    page_size = min(request.args.get('page_size', 100, type=int), 500)
+    page_size = request.args.get('page_size', 100, type=int)
-    if page < 1:
+    if page is None:
        page = 1
    if page_size is None:
        page_size = 100
    page = max(page, 1)
    page_size = max(1, min(page_size, 500))
    # Validate status parameter
    if status and status not in ('RUN', 'QUEUE', 'HOLD'):
@@ -245,7 +265,7 @@ def api_meta_workcenters():
    Returns:
        JSON with list of {name, lot_count} sorted by sequence
    """
-    include_dummy = _parse_bool(request.args.get('include_dummy', ''))
+    include_dummy = parse_bool_query(request.args.get('include_dummy'))
    result = get_workcenters(include_dummy=include_dummy)
    if result is not None:
@@ -263,7 +283,7 @@ def api_meta_packages():
    Returns:
        JSON with list of {name, lot_count} sorted by count desc
    """
-    include_dummy = _parse_bool(request.args.get('include_dummy', ''))
+    include_dummy = parse_bool_query(request.args.get('include_dummy'))
    result = get_packages(include_dummy=include_dummy)
    if result is not None:
@@ -293,7 +313,7 @@ def api_meta_search():
    search_field = request.args.get('field', '').strip().lower()
    q = request.args.get('q', '').strip()
    limit = min(request.args.get('limit', 20, type=int), 50)
-    include_dummy = _parse_bool(request.args.get('include_dummy', ''))
+    include_dummy = parse_bool_query(request.args.get('include_dummy'))
    # Cross-filter parameters
    workorder = request.args.get('workorder', '').strip() or None
--- a/src/mes_dashboard/services/auth_service.py
+++ b/src/mes_dashboard/services/auth_service.py
@@ -5,23 +5,83 @@ from __future__ import annotations
 import logging
 import os
 from urllib.parse import urlparse
 import requests
 logger = logging.getLogger(__name__)
 # Configuration - MUST be set in .env file
 LDAP_API_BASE = os.environ.get("LDAP_API_URL", "")
 ADMIN_EMAILS = os.environ.get("ADMIN_EMAILS", "").lower().split(",")
 # Timeout for LDAP API requests
 LDAP_TIMEOUT = 10
 # Configuration - MUST be set in .env file
 ADMIN_EMAILS = os.environ.get("ADMIN_EMAILS", "").lower().split(",")
 # Local authentication configuration (for development/testing)
 LOCAL_AUTH_ENABLED = os.environ.get("LOCAL_AUTH_ENABLED", "false").lower() in ("true", "1", "yes")
 LOCAL_AUTH_USERNAME = os.environ.get("LOCAL_AUTH_USERNAME", "")
 LOCAL_AUTH_PASSWORD = os.environ.get("LOCAL_AUTH_PASSWORD", "")
 # LDAP endpoint hardening configuration
 LDAP_API_URL = os.environ.get("LDAP_API_URL", "").strip()
 LDAP_ALLOWED_HOSTS_RAW = os.environ.get("LDAP_ALLOWED_HOSTS", "").strip()
 def _normalize_host(host: str) -> str:
    return host.strip().lower().rstrip(".")
 def _parse_allowed_hosts(raw_hosts: str) -> tuple[str, ...]:
    if not raw_hosts:
        return tuple()
    hosts: list[str] = []
    for raw in raw_hosts.split(","):
        host = _normalize_host(raw)
        if host:
            hosts.append(host)
    return tuple(hosts)
 def _validate_ldap_api_url(raw_url: str, allowed_hosts: tuple[str, ...]) -> tuple[str | None, str | None]:
    """Validate LDAP API URL to prevent configuration-based SSRF risks."""
    url = (raw_url or "").strip()
    if not url:
        return None, "LDAP_API_URL is missing"
    parsed = urlparse(url)
    scheme = (parsed.scheme or "").lower()
    host = _normalize_host(parsed.hostname or "")
    if not host:
        return None, f"LDAP_API_URL has no valid host: {url!r}"
    if scheme != "https":
        return None, f"LDAP_API_URL must use HTTPS: {url!r}"
    effective_allowlist = allowed_hosts or (host,)
    if host not in effective_allowlist:
        return None, (
            f"LDAP_API_URL host {host!r} is not allowlisted. "
            f"Allowed hosts: {', '.join(effective_allowlist)}"
        )
    return url.rstrip("/"), None
 def _resolve_ldap_config() -> tuple[str | None, str | None, tuple[str, ...]]:
    allowed_hosts = _parse_allowed_hosts(LDAP_ALLOWED_HOSTS_RAW)
    api_base, error = _validate_ldap_api_url(LDAP_API_URL, allowed_hosts)
    if api_base:
        effective_hosts = allowed_hosts or (_normalize_host(urlparse(api_base).hostname or ""),)
        return api_base, None, effective_hosts
    return None, error, allowed_hosts
 LDAP_API_BASE, LDAP_CONFIG_ERROR, LDAP_ALLOWED_HOSTS = _resolve_ldap_config()
 def _authenticate_local(username: str, password: str) -> dict | None:
    """Authenticate using local environment credentials.
@@ -77,6 +137,14 @@ def authenticate(username: str, password: str, domain: str = "PANJIT") -> dict |
        # This ensures local-only mode when LOCAL_AUTH_ENABLED is true
        return None
    if LDAP_CONFIG_ERROR:
        logger.error("LDAP authentication blocked: %s", LDAP_CONFIG_ERROR)
        return None
    if not LDAP_API_BASE:
        logger.error("LDAP authentication blocked: LDAP_API_URL is not configured")
        return None
    # LDAP authentication
    try:
        response = requests.post(
@@ -121,4 +189,5 @@ def is_admin(user: dict) -> bool:
            return True
    user_mail = user.get("mail", "").lower().strip()
-    return user_mail in [e.strip() for e in ADMIN_EMAILS]
+    allowed_emails = [e.strip() for e in ADMIN_EMAILS if e and e.strip()]
    return user_mail in allowed_emails
--- a/src/mes_dashboard/services/filter_cache.py
+++ b/src/mes_dashboard/services/filter_cache.py
@@ -6,6 +6,7 @@ Data is loaded from database and cached in memory with periodic refresh.
 """
 import logging
 import os
 import threading
 from datetime import datetime, timedelta
 from typing import Optional, Dict, List, Any
@@ -19,8 +20,8 @@ logger = logging.getLogger('mes_dashboard.filter_cache')
 # ============================================================
 CACHE_TTL_SECONDS = 3600  # 1 hour cache TTL
-WIP_VIEW = "DWH.DW_MES_LOT_V"
+WIP_VIEW = os.getenv("FILTER_CACHE_WIP_VIEW", "DWH.DW_MES_LOT_V")
-SPEC_WORKCENTER_VIEW = "DWH.DW_MES_SPEC_WORKCENTER_V"
+SPEC_WORKCENTER_VIEW = os.getenv("FILTER_CACHE_SPEC_WORKCENTER_VIEW", "DWH.DW_MES_SPEC_WORKCENTER_V")
 # ============================================================
 # Cache Storage
--- a/src/mes_dashboard/services/page_registry.py
+++ b/src/mes_dashboard/services/page_registry.py
@@ -5,6 +5,8 @@ from __future__ import annotations
 import json
 import logging
 import os
 import tempfile
 from pathlib import Path
 from threading import Lock
@@ -37,15 +39,33 @@ def _load() -> dict:
 def _save(data: dict) -> None:
    """Save page status configuration."""
    global _cache
    tmp_path: Path | None = None
    try:
        DATA_FILE.parent.mkdir(parents=True, exist_ok=True)
-        DATA_FILE.write_text(
+        payload = json.dumps(data, ensure_ascii=False, indent=2)
-            json.dumps(data, ensure_ascii=False, indent=2),
+
-            encoding="utf-8"
+        # Atomic write: write to sibling temp file, then replace target.
-        )
+        with tempfile.NamedTemporaryFile(
            mode="w",
            encoding="utf-8",
            dir=str(DATA_FILE.parent),
            prefix=f".{DATA_FILE.name}.",
            suffix=".tmp",
            delete=False,
        ) as tmp:
            tmp.write(payload)
            tmp.flush()
            os.fsync(tmp.fileno())
            tmp_path = Path(tmp.name)
        os.replace(tmp_path, DATA_FILE)
        _cache = data
        logger.debug("Saved page status to %s", DATA_FILE)
    except OSError as e:
        if tmp_path is not None:
            try:
                tmp_path.unlink(missing_ok=True)
            except OSError:
                pass
        logger.error("Failed to save page status: %s", e)
        raise
--- a/src/mes_dashboard/services/realtime_equipment_cache.py
+++ b/src/mes_dashboard/services/realtime_equipment_cache.py
@@ -7,10 +7,12 @@ Data is synced periodically (default 5 minutes) and stored in Redis.
 import json
 import logging
 import os
 import threading
 import time
 from collections import OrderedDict
 from datetime import datetime
-from typing import Any, Dict, List, Optional, Tuple
+from typing import Any
 from mes_dashboard.core.database import read_sql_df
 from mes_dashboard.core.redis_client import (
@@ -26,6 +28,7 @@ from mes_dashboard.config.constants import (
    EQUIPMENT_STATUS_META_COUNT_KEY,
    STATUS_CATEGORY_MAP,
 )
 from mes_dashboard.services.sql_fragments import EQUIPMENT_STATUS_SELECT_SQL
 logger = logging.getLogger('mes_dashboard.realtime_equipment_cache')
@@ -33,29 +36,56 @@ logger = logging.getLogger('mes_dashboard.realtime_equipment_cache')
 # Process-Level Cache (Prevents redundant JSON parsing)
 # ============================================================
 DEFAULT_PROCESS_CACHE_TTL_SECONDS = 30
 DEFAULT_PROCESS_CACHE_MAX_SIZE = 32
 DEFAULT_LOOKUP_TTL_SECONDS = 30
 class _ProcessLevelCache:
    """Thread-safe process-level cache for parsed equipment status data."""
-    def __init__(self, ttl_seconds: int = 30):
+    def __init__(self, ttl_seconds: int = 30, max_size: int = 32):
-        self._cache: Dict[str, Tuple[List[Dict[str, Any]], float]] = {}
+        self._cache: OrderedDict[str, tuple[list[dict[str, Any]], float]] = OrderedDict()
        self._lock = threading.Lock()
-        self._ttl = ttl_seconds
+        self._ttl = max(int(ttl_seconds), 1)
        self._max_size = max(int(max_size), 1)
-    def get(self, key: str) -> Optional[List[Dict[str, Any]]]:
+    @property
    def max_size(self) -> int:
        return self._max_size
    def _evict_expired_locked(self, now: float) -> None:
        stale_keys = [
            key for key, (_, timestamp) in self._cache.items()
            if now - timestamp > self._ttl
        ]
        for key in stale_keys:
            self._cache.pop(key, None)
    def get(self, key: str) -> list[dict[str, Any]] | None:
        """Get cached data if not expired."""
        with self._lock:
-            if key not in self._cache:
+            payload = self._cache.get(key)
            if payload is None:
                return None
-            data, timestamp = self._cache[key]
+            data, timestamp = payload
-            if time.time() - timestamp > self._ttl:
+            now = time.time()
-                del self._cache[key]
+            if now - timestamp > self._ttl:
                self._cache.pop(key, None)
                return None
            self._cache.move_to_end(key, last=True)
            return data
-    def set(self, key: str, data: List[Dict[str, Any]]) -> None:
+    def set(self, key: str, data: list[dict[str, Any]]) -> None:
        """Cache data with current timestamp."""
        with self._lock:
-            self._cache[key] = (data, time.time())
+            now = time.time()
            self._evict_expired_locked(now)
            if key in self._cache:
                self._cache.pop(key, None)
            elif len(self._cache) >= self._max_size:
                self._cache.popitem(last=False)
            self._cache[key] = (data, now)
            self._cache.move_to_end(key, last=True)
    def invalidate(self, key: str) -> None:
        """Remove a key from cache."""
@@ -63,20 +93,38 @@ class _ProcessLevelCache:
            self._cache.pop(key, None)
 def _resolve_cache_max_size(env_name: str, default: int) -> int:
    value = os.getenv(env_name)
    if value is None:
        return max(int(default), 1)
    try:
        return max(int(value), 1)
    except (TypeError, ValueError):
        return max(int(default), 1)
 # Global process-level cache for equipment status (30s TTL)
-_equipment_status_cache = _ProcessLevelCache(ttl_seconds=30)
+PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size("PROCESS_CACHE_MAX_SIZE", DEFAULT_PROCESS_CACHE_MAX_SIZE)
 EQUIPMENT_PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size(
    "EQUIPMENT_PROCESS_CACHE_MAX_SIZE",
    PROCESS_CACHE_MAX_SIZE,
 )
 _equipment_status_cache = _ProcessLevelCache(
    ttl_seconds=DEFAULT_PROCESS_CACHE_TTL_SECONDS,
    max_size=EQUIPMENT_PROCESS_CACHE_MAX_SIZE,
 )
 _equipment_status_parse_lock = threading.Lock()
 _equipment_lookup_lock = threading.Lock()
-_equipment_status_lookup: Dict[str, Dict[str, Any]] = {}
+_equipment_status_lookup: dict[str, dict[str, Any]] = {}
-_equipment_status_lookup_built_at: Optional[str] = None
+_equipment_status_lookup_built_at: str | None = None
 _equipment_status_lookup_ts: float = 0.0
-LOOKUP_TTL_SECONDS = 30
+LOOKUP_TTL_SECONDS = DEFAULT_LOOKUP_TTL_SECONDS
 # ============================================================
 # Module State
 # ============================================================
-_SYNC_THREAD: Optional[threading.Thread] = None
+_SYNC_THREAD: threading.Thread | None = None
 _STOP_EVENT = threading.Event()
 _SYNC_LOCK = threading.Lock()
@@ -85,40 +133,14 @@ _SYNC_LOCK = threading.Lock()
 # Oracle Query
 # ============================================================
-def _load_equipment_status_from_oracle() -> Optional[List[Dict[str, Any]]]:
+def _load_equipment_status_from_oracle() -> list[dict[str, Any]] | None:
    """Query DW_MES_EQUIPMENTSTATUS_WIP_V from Oracle.
    Returns:
        List of equipment status records, or None if query fails.
    """
    sql = """
        SELECT
            RESOURCEID,
            EQUIPMENTID,
            OBJECTCATEGORY,
            EQUIPMENTASSETSSTATUS,
            EQUIPMENTASSETSSTATUSREASON,
            JOBORDER,
            JOBMODEL,
            JOBSTAGE,
            JOBID,
            JOBSTATUS,
            CREATEDATE,
            CREATEUSERNAME,
            CREATEUSER,
            TECHNICIANUSERNAME,
            TECHNICIANUSER,
            SYMPTOMCODE,
            CAUSECODE,
            REPAIRCODE,
            RUNCARDLOTID,
            LOTTRACKINQTY_PCS,
            LOTTRACKINTIME,
            LOTTRACKINEMPLOYEE
        FROM DWH.DW_MES_EQUIPMENTSTATUS_WIP_V
    """
    try:
-        df = read_sql_df(sql)
+        df = read_sql_df(EQUIPMENT_STATUS_SELECT_SQL)
        if df is None or df.empty:
            logger.warning("No data returned from DW_MES_EQUIPMENTSTATUS_WIP_V")
            return []
@@ -147,7 +169,7 @@ def _load_equipment_status_from_oracle() -> Optional[List[Dict[str, Any]]]:
 # Data Aggregation
 # ============================================================
-def _classify_status(status: Optional[str]) -> str:
+def _classify_status(status: str | None) -> str:
    """Classify equipment status into category.
    Args:
@@ -183,7 +205,7 @@ def _is_valid_value(value) -> bool:
    return True
-def _aggregate_by_resourceid(records: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
+def _aggregate_by_resourceid(records: list[dict[str, Any]]) -> list[dict[str, Any]]:
    """Aggregate equipment status records by RESOURCEID.
    For each RESOURCEID:
@@ -203,7 +225,7 @@ def _aggregate_by_resourceid(records: List[Dict[str, Any]]) -> List[Dict[str, An
        return []
    # Group by RESOURCEID
-    grouped: Dict[str, List[Dict[str, Any]]] = {}
+    grouped: dict[str, list[dict[str, Any]]] = {}
    for record in records:
        resource_id = record.get('RESOURCEID')
        if resource_id:
@@ -272,7 +294,7 @@ def _aggregate_by_resourceid(records: List[Dict[str, Any]]) -> List[Dict[str, An
            'CAUSECODE': first.get('CAUSECODE'),
            'REPAIRCODE': first.get('REPAIRCODE'),
            # LOT related fields
-            'LOT_COUNT': len(seen_lots),  # Count distinct RUNCARDLOTID
+            'LOT_COUNT': len(seen_lots) if seen_lots else len(group),
            'LOT_DETAILS': lot_details,   # LOT details for tooltip
            'TOTAL_TRACKIN_QTY': total_qty,
            'LATEST_TRACKIN_TIME': latest_trackin,
@@ -286,7 +308,7 @@ def _aggregate_by_resourceid(records: List[Dict[str, Any]]) -> List[Dict[str, An
 # Redis Storage
 # ============================================================
-def _save_to_redis(aggregated: List[Dict[str, Any]]) -> bool:
+def _save_to_redis(aggregated: list[dict[str, Any]]) -> bool:
    """Save aggregated equipment status to Redis.
    Uses pipeline for atomic update of all keys.
@@ -354,7 +376,7 @@ def _invalidate_equipment_status_lookup() -> None:
        _equipment_status_lookup_ts = 0.0
-def get_equipment_status_lookup() -> Dict[str, Dict[str, Any]]:
+def get_equipment_status_lookup() -> dict[str, dict[str, Any]]:
    """Get RESOURCEID -> status record lookup with process-level caching."""
    global _equipment_status_lookup, _equipment_status_lookup_built_at, _equipment_status_lookup_ts
@@ -375,7 +397,7 @@ def get_equipment_status_lookup() -> Dict[str, Dict[str, Any]]:
        _equipment_status_lookup_ts = time.time()
        return _equipment_status_lookup
-def get_all_equipment_status() -> List[Dict[str, Any]]:
+def get_all_equipment_status() -> list[dict[str, Any]]:
    """Get all equipment status from cache with process-level caching.
    Uses a two-tier cache strategy:
@@ -433,7 +455,7 @@ def get_all_equipment_status() -> List[Dict[str, Any]]:
            return []
-def get_equipment_status_by_id(resource_id: str) -> Optional[Dict[str, Any]]:
+def get_equipment_status_by_id(resource_id: str) -> dict[str, Any] | None:
    """Get equipment status by RESOURCEID.
    Uses index hash for O(1) lookup.
@@ -485,7 +507,7 @@ def get_equipment_status_by_id(resource_id: str) -> Optional[Dict[str, Any]]:
        return None
-def get_equipment_status_by_ids(resource_ids: List[str]) -> List[Dict[str, Any]]:
+def get_equipment_status_by_ids(resource_ids: list[str]) -> list[dict[str, Any]]:
    """Get equipment status for multiple RESOURCEIDs.
    Args:
@@ -540,7 +562,7 @@ def get_equipment_status_by_ids(resource_ids: List[str]) -> List[Dict[str, Any]]
        return []
-def get_equipment_status_cache_status() -> Dict[str, Any]:
+def get_equipment_status_cache_status() -> dict[str, Any]:
    """Get equipment status cache status.
    Returns:
--- a/src/mes_dashboard/services/resource_cache.py
+++ b/src/mes_dashboard/services/resource_cache.py
@@ -13,8 +13,9 @@ import logging
 import os
 import threading
 import time
 from collections import OrderedDict
 from datetime import datetime
-from typing import Any, Dict, List, Optional, Tuple
+from typing import Any
 import pandas as pd
@@ -31,9 +32,27 @@ from mes_dashboard.config.constants import (
    EQUIPMENT_TYPE_FILTER,
 )
 from mes_dashboard.sql import QueryBuilder
 from mes_dashboard.services.sql_fragments import (
    RESOURCE_BASE_SELECT_TEMPLATE,
    RESOURCE_VERSION_SELECT_TEMPLATE,
 )
 logger = logging.getLogger('mes_dashboard.resource_cache')
 ResourceRecord = dict[str, Any]
 RowPosition = int
 PositionBucket = dict[str, list[RowPosition]]
 FlagBuckets = dict[str, list[RowPosition]]
 ResourceIndex = dict[str, Any]
 DEFAULT_PROCESS_CACHE_TTL_SECONDS = 30
 DEFAULT_PROCESS_CACHE_MAX_SIZE = 32
 DEFAULT_RESOURCE_SYNC_INTERVAL_SECONDS = 14_400  # 4 hours
 DEFAULT_INDEX_VERSION_CHECK_INTERVAL_SECONDS = 5
 RESOURCE_DF_CACHE_KEY = "resource_data"
 TRUE_BUCKET = "1"
 FALSE_BUCKET = "0"
 # ============================================================
 # Process-Level Cache (Prevents redundant JSON parsing)
 # ============================================================
@@ -41,26 +60,49 @@ logger = logging.getLogger('mes_dashboard.resource_cache')
 class _ProcessLevelCache:
    """Thread-safe process-level cache for parsed DataFrames."""
-    def __init__(self, ttl_seconds: int = 30):
+    def __init__(self, ttl_seconds: int = DEFAULT_PROCESS_CACHE_TTL_SECONDS, max_size: int = DEFAULT_PROCESS_CACHE_MAX_SIZE):
-        self._cache: Dict[str, Tuple[pd.DataFrame, float]] = {}
+        self._cache: OrderedDict[str, tuple[pd.DataFrame, float]] = OrderedDict()
        self._lock = threading.Lock()
-        self._ttl = ttl_seconds
+        self._ttl = max(int(ttl_seconds), 1)
        self._max_size = max(int(max_size), 1)
-    def get(self, key: str) -> Optional[pd.DataFrame]:
+    @property
    def max_size(self) -> int:
        return self._max_size
    def _evict_expired_locked(self, now: float) -> None:
        stale_keys = [
            key for key, (_, timestamp) in self._cache.items()
            if now - timestamp > self._ttl
        ]
        for key in stale_keys:
            self._cache.pop(key, None)
    def get(self, key: str) -> pd.DataFrame | None:
        """Get cached DataFrame if not expired."""
        with self._lock:
-            if key not in self._cache:
+            payload = self._cache.get(key)
            if payload is None:
                return None
-            df, timestamp = self._cache[key]
+            df, timestamp = payload
-            if time.time() - timestamp > self._ttl:
+            now = time.time()
-                del self._cache[key]
+            if now - timestamp > self._ttl:
                self._cache.pop(key, None)
                return None
            self._cache.move_to_end(key, last=True)
            return df
    def set(self, key: str, df: pd.DataFrame) -> None:
        """Cache a DataFrame with current timestamp."""
        with self._lock:
-            self._cache[key] = (df, time.time())
+            now = time.time()
            self._evict_expired_locked(now)
            if key in self._cache:
                self._cache.pop(key, None)
            elif len(self._cache) >= self._max_size:
                self._cache.popitem(last=False)
            self._cache[key] = (df, now)
            self._cache.move_to_end(key, last=True)
    def invalidate(self, key: str) -> None:
        """Remove a key from cache."""
@@ -68,11 +110,29 @@ class _ProcessLevelCache:
            self._cache.pop(key, None)
 def _resolve_cache_max_size(env_name: str, default: int) -> int:
    value = os.getenv(env_name)
    if value is None:
        return max(int(default), 1)
    try:
        return max(int(value), 1)
    except (TypeError, ValueError):
        return max(int(default), 1)
 # Global process-level cache for resource data (30s TTL)
-_resource_df_cache = _ProcessLevelCache(ttl_seconds=30)
+PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size("PROCESS_CACHE_MAX_SIZE", DEFAULT_PROCESS_CACHE_MAX_SIZE)
 RESOURCE_PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size(
    "RESOURCE_PROCESS_CACHE_MAX_SIZE",
    PROCESS_CACHE_MAX_SIZE,
 )
 _resource_df_cache = _ProcessLevelCache(
    ttl_seconds=DEFAULT_PROCESS_CACHE_TTL_SECONDS,
    max_size=RESOURCE_PROCESS_CACHE_MAX_SIZE,
 )
 _resource_parse_lock = threading.Lock()
 _resource_index_lock = threading.Lock()
-_resource_index: Dict[str, Any] = {
+_resource_index: ResourceIndex = {
    "ready": False,
    "source": None,
    "version": None,
@@ -80,19 +140,27 @@ _resource_index: Dict[str, Any] = {
    "built_at": None,
    "version_checked_at": 0.0,
    "count": 0,
-    "records": [],
+    "all_positions": [],
    "by_resource_id": {},
    "by_workcenter": {},
    "by_family": {},
    "by_department": {},
    "by_location": {},
-    "by_is_production": {"1": [], "0": []},
+    "by_is_production": {TRUE_BUCKET: [], FALSE_BUCKET: []},
-    "by_is_key": {"1": [], "0": []},
+    "by_is_key": {TRUE_BUCKET: [], FALSE_BUCKET: []},
-    "by_is_monitor": {"1": [], "0": []},
+    "by_is_monitor": {TRUE_BUCKET: [], FALSE_BUCKET: []},
    "memory": {
        "frame_bytes": 0,
        "index_bytes": 0,
        "records_json_bytes": 0,
        "bucket_entries": 0,
        "amplification_ratio": 0.0,
        "representation": "dataframe+row-index",
    },
 }
-def _new_empty_index() -> Dict[str, Any]:
+def _new_empty_index() -> ResourceIndex:
    return {
        "ready": False,
        "source": None,
@@ -101,15 +169,23 @@ def _new_empty_index() -> Dict[str, Any]:
        "built_at": None,
        "version_checked_at": 0.0,
        "count": 0,
-        "records": [],
+        "all_positions": [],
        "by_resource_id": {},
        "by_workcenter": {},
        "by_family": {},
        "by_department": {},
        "by_location": {},
-        "by_is_production": {"1": [], "0": []},
+        "by_is_production": {TRUE_BUCKET: [], FALSE_BUCKET: []},
-        "by_is_key": {"1": [], "0": []},
+        "by_is_key": {TRUE_BUCKET: [], FALSE_BUCKET: []},
-        "by_is_monitor": {"1": [], "0": []},
+        "by_is_monitor": {TRUE_BUCKET: [], FALSE_BUCKET: []},
        "memory": {
            "frame_bytes": 0,
            "index_bytes": 0,
            "records_json_bytes": 0,
            "bucket_entries": 0,
            "amplification_ratio": 0.0,
            "representation": "dataframe+row-index",
        },
    }
@@ -129,23 +205,59 @@ def _is_truthy_flag(value: Any) -> bool:
    return False
-def _bucket_append(bucket: Dict[str, List[Dict[str, Any]]], key: Any, record: Dict[str, Any]) -> None:
+def _bucket_append(bucket: PositionBucket, key: Any, row_position: RowPosition) -> None:
    if key is None:
        return
    if isinstance(key, float) and pd.isna(key):
        return
    key_str = str(key)
-    bucket.setdefault(key_str, []).append(record)
+    bucket.setdefault(key_str, []).append(int(row_position))
 def _estimate_dataframe_bytes(df: pd.DataFrame) -> int:
    try:
        return int(df.memory_usage(index=True, deep=True).sum())
    except Exception:
        return 0
 def _estimate_index_bytes(index: ResourceIndex) -> int:
    """Estimate lightweight index memory footprint for telemetry."""
    by_resource_id = index.get("by_resource_id", {})
    by_workcenter = index.get("by_workcenter", {})
    by_family = index.get("by_family", {})
    by_department = index.get("by_department", {})
    by_location = index.get("by_location", {})
    by_is_production = index.get("by_is_production", {TRUE_BUCKET: [], FALSE_BUCKET: []})
    by_is_key = index.get("by_is_key", {TRUE_BUCKET: [], FALSE_BUCKET: []})
    by_is_monitor = index.get("by_is_monitor", {TRUE_BUCKET: [], FALSE_BUCKET: []})
    all_positions = index.get("all_positions", [])
    position_entries = (
        len(all_positions)
        + sum(len(v) for v in by_workcenter.values())
        + sum(len(v) for v in by_family.values())
        + sum(len(v) for v in by_department.values())
        + sum(len(v) for v in by_location.values())
        + len(by_is_production.get(TRUE_BUCKET, []))
        + len(by_is_production.get(FALSE_BUCKET, []))
        + len(by_is_key.get(TRUE_BUCKET, []))
        + len(by_is_key.get(FALSE_BUCKET, []))
        + len(by_is_monitor.get(TRUE_BUCKET, []))
        + len(by_is_monitor.get(FALSE_BUCKET, []))
    )
    # Approximate integer/list/dict overhead; telemetry only needs directional signal.
    return int(position_entries * 8 + len(by_resource_id) * 64)
 def _build_resource_index(
    df: pd.DataFrame,
    *,
    source: str,
-    version: Optional[str],
+    version: str | None,
-    updated_at: Optional[str],
+    updated_at: str | None,
-) -> Dict[str, Any]:
+) -> ResourceIndex:
-    records = df.to_dict(orient='records')
+    normalized_df = df.reset_index(drop=True)
    index = _new_empty_index()
    index["ready"] = True
    index["source"] = source
@@ -153,31 +265,58 @@ def _build_resource_index(
    index["updated_at"] = updated_at
    index["built_at"] = datetime.now().isoformat()
    index["version_checked_at"] = time.time()
-    index["count"] = len(records)
+    index["count"] = len(normalized_df)
-    index["records"] = records
+    index["all_positions"] = list(range(len(normalized_df)))
-    for record in records:
+    for row_position, record in normalized_df.iterrows():
        resource_id = record.get("RESOURCEID")
        if resource_id is not None and not (isinstance(resource_id, float) and pd.isna(resource_id)):
-            index["by_resource_id"][str(resource_id)] = record
+            index["by_resource_id"][str(resource_id)] = int(row_position)
-        _bucket_append(index["by_workcenter"], record.get("WORKCENTERNAME"), record)
+        _bucket_append(index["by_workcenter"], record.get("WORKCENTERNAME"), row_position)
-        _bucket_append(index["by_family"], record.get("RESOURCEFAMILYNAME"), record)
+        _bucket_append(index["by_family"], record.get("RESOURCEFAMILYNAME"), row_position)
-        _bucket_append(index["by_department"], record.get("PJ_DEPARTMENT"), record)
+        _bucket_append(index["by_department"], record.get("PJ_DEPARTMENT"), row_position)
-        _bucket_append(index["by_location"], record.get("LOCATIONNAME"), record)
+        _bucket_append(index["by_location"], record.get("LOCATIONNAME"), row_position)
-        index["by_is_production"]["1" if _is_truthy_flag(record.get("PJ_ISPRODUCTION")) else "0"].append(record)
+        index["by_is_production"][TRUE_BUCKET if _is_truthy_flag(record.get("PJ_ISPRODUCTION")) else FALSE_BUCKET].append(int(row_position))
-        index["by_is_key"]["1" if _is_truthy_flag(record.get("PJ_ISKEY")) else "0"].append(record)
+        index["by_is_key"][TRUE_BUCKET if _is_truthy_flag(record.get("PJ_ISKEY")) else FALSE_BUCKET].append(int(row_position))
-        index["by_is_monitor"]["1" if _is_truthy_flag(record.get("PJ_ISMONITOR")) else "0"].append(record)
+        index["by_is_monitor"][TRUE_BUCKET if _is_truthy_flag(record.get("PJ_ISMONITOR")) else FALSE_BUCKET].append(int(row_position))
    bucket_entries = (
        sum(len(v) for v in index["by_workcenter"].values())
        + sum(len(v) for v in index["by_family"].values())
        + sum(len(v) for v in index["by_department"].values())
        + sum(len(v) for v in index["by_location"].values())
        + len(index["by_is_production"][TRUE_BUCKET])
        + len(index["by_is_production"][FALSE_BUCKET])
        + len(index["by_is_key"][TRUE_BUCKET])
        + len(index["by_is_key"][FALSE_BUCKET])
        + len(index["by_is_monitor"][TRUE_BUCKET])
        + len(index["by_is_monitor"][FALSE_BUCKET])
    )
    frame_bytes = _estimate_dataframe_bytes(normalized_df)
    index_bytes = _estimate_index_bytes(index)
    amplification_ratio = round(
        (frame_bytes + index_bytes) / max(frame_bytes, 1),
        4,
    )
    index["memory"] = {
        "frame_bytes": int(frame_bytes),
        "index_bytes": int(index_bytes),
        "records_json_bytes": 0,  # kept for backward-compatible telemetry shape
        "bucket_entries": int(bucket_entries),
        "amplification_ratio": amplification_ratio,
        "representation": "dataframe+row-index",
    }
    return index
 def _index_matches(
-    current: Dict[str, Any],
+    current: ResourceIndex,
    *,
    source: str,
-    version: Optional[str],
+    version: str | None,
    row_count: int,
 ) -> bool:
    if not current.get("ready"):
@@ -193,8 +332,8 @@ def _ensure_resource_index(
    df: pd.DataFrame,
    *,
    source: str,
-    version: Optional[str] = None,
+    version: str | None = None,
-    updated_at: Optional[str] = None,
+    updated_at: str | None = None,
 ) -> None:
    global _resource_index
    with _resource_index_lock:
@@ -212,12 +351,12 @@ def _ensure_resource_index(
        _resource_index = new_index
-def _get_resource_index() -> Dict[str, Any]:
+def _get_resource_index() -> ResourceIndex:
    with _resource_index_lock:
        return _resource_index
-def _get_cache_meta(client=None) -> Tuple[Optional[str], Optional[str]]:
+def _get_cache_meta(client=None) -> tuple[str | None, str | None]:
    redis_client = client or get_redis_client()
    if redis_client is None:
        return None, None
@@ -244,31 +383,59 @@ def _redis_data_available(client=None) -> bool:
        return False
-def _pick_bucket_records(
+def _pick_bucket_positions(
-    bucket: Dict[str, List[Dict[str, Any]]],
+    bucket: PositionBucket,
-    keys: List[Any],
+    keys: list[Any],
-) -> List[Dict[str, Any]]:
+) -> list[RowPosition]:
-    seen: set[str] = set()
+    seen: set[int] = set()
-    result: List[Dict[str, Any]] = []
+    result: list[int] = []
    for key in keys:
-        for record in bucket.get(str(key), []):
+        for row_position in bucket.get(str(key), []):
-            rid = record.get("RESOURCEID")
+            normalized = int(row_position)
-            rid_key = str(rid) if rid is not None else str(id(record))
+            if normalized in seen:
            if rid_key in seen:
                continue
-            seen.add(rid_key)
+            seen.add(normalized)
-            result.append(record)
+            result.append(normalized)
    return result
 def _records_from_positions(df: pd.DataFrame, positions: list[RowPosition]) -> list[ResourceRecord]:
    if not positions:
        return []
    unique_positions = sorted({int(pos) for pos in positions if 0 <= int(pos) < len(df)})
    if not unique_positions:
        return []
    return df.iloc[unique_positions].to_dict(orient='records')
 def _records_from_index(index: ResourceIndex, positions: list[RowPosition] | None = None) -> list[ResourceRecord]:
    if not index.get("ready"):
        return []
    df = _resource_df_cache.get(RESOURCE_DF_CACHE_KEY)
    if df is None:
        legacy_records = index.get("records")
        if isinstance(legacy_records, list):
            if positions is None:
                return list(legacy_records)
            selected = [legacy_records[int(pos)] for pos in positions if 0 <= int(pos) < len(legacy_records)]
            return selected
        return []
    selected_positions = positions if positions is not None else index.get("all_positions", [])
    if not selected_positions:
        selected_positions = list(range(len(df)))
    return _records_from_positions(df, selected_positions)
 # ============================================================
 # Configuration
 # ============================================================
 RESOURCE_CACHE_ENABLED = os.getenv('RESOURCE_CACHE_ENABLED', 'true').lower() == 'true'
-RESOURCE_SYNC_INTERVAL = int(os.getenv('RESOURCE_SYNC_INTERVAL', '14400'))  # 4 hours
+RESOURCE_SYNC_INTERVAL = int(
    os.getenv('RESOURCE_SYNC_INTERVAL', str(DEFAULT_RESOURCE_SYNC_INTERVAL_SECONDS))
 )
 RESOURCE_INDEX_VERSION_CHECK_INTERVAL = int(
-    os.getenv('RESOURCE_INDEX_VERSION_CHECK_INTERVAL', '5')
+    os.getenv('RESOURCE_INDEX_VERSION_CHECK_INTERVAL', str(DEFAULT_INDEX_VERSION_CHECK_INTERVAL_SECONDS))
-)  # seconds
+)
 # Redis key helpers
 def _get_key(key: str) -> str:
@@ -313,14 +480,14 @@ def _build_filter_builder() -> QueryBuilder:
    return builder
-def _load_from_oracle() -> Optional[pd.DataFrame]:
+def _load_from_oracle() -> pd.DataFrame | None:
    """從 Oracle 載入全表資料（套用全域篩選）.
    Returns:
        DataFrame with all columns, or None if query failed.
    """
    builder = _build_filter_builder()
-    builder.base_sql = "SELECT * FROM DWH.DW_MES_RESOURCE {{ WHERE_CLAUSE }}"
+    builder.base_sql = RESOURCE_BASE_SELECT_TEMPLATE
    sql, params = builder.build()
    try:
@@ -333,14 +500,14 @@ def _load_from_oracle() -> Optional[pd.DataFrame]:
        return None
-def _get_version_from_oracle() -> Optional[str]:
+def _get_version_from_oracle() -> str | None:
    """取得 Oracle 資料版本（MAX(LASTCHANGEDATE)）.
    Returns:
        Version string (ISO format), or None if query failed.
    """
    builder = _build_filter_builder()
-    builder.base_sql = "SELECT MAX(LASTCHANGEDATE) as VERSION FROM DWH.DW_MES_RESOURCE {{ WHERE_CLAUSE }}"
+    builder.base_sql = RESOURCE_VERSION_SELECT_TEMPLATE
    sql, params = builder.build()
    try:
@@ -361,7 +528,7 @@ def _get_version_from_oracle() -> Optional[str]:
 # Internal: Redis Functions
 # ============================================================
-def _get_version_from_redis() -> Optional[str]:
+def _get_version_from_redis() -> str | None:
    """取得 Redis 快取版本.
    Returns:
@@ -411,7 +578,7 @@ def _sync_to_redis(df: pd.DataFrame, version: str) -> bool:
        pipe.execute()
        # Invalidate process-level cache so next request picks up new data
-        _resource_df_cache.invalidate("resource_data")
+        _resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY)
        _invalidate_resource_index()
        logger.info(f"Resource cache synced: {len(df)} rows, version={version}")
@@ -421,7 +588,7 @@ def _sync_to_redis(df: pd.DataFrame, version: str) -> bool:
        return False
-def _get_cached_data() -> Optional[pd.DataFrame]:
+def _get_cached_data() -> pd.DataFrame | None:
    """Get cached resource data from Redis with process-level caching.
    Uses a two-tier cache strategy:
@@ -433,11 +600,15 @@ def _get_cached_data() -> Optional[pd.DataFrame]:
    Returns:
        DataFrame with resource data, or None if cache miss.
    """
-    cache_key = "resource_data"
+    cache_key = RESOURCE_DF_CACHE_KEY
    # Tier 1: Check process-level cache first (fast path)
    cached_df = _resource_df_cache.get(cache_key)
    if cached_df is not None:
        if REDIS_ENABLED and RESOURCE_CACHE_ENABLED and not _redis_data_available():
            _resource_df_cache.invalidate(cache_key)
            _invalidate_resource_index()
        else:
            if not _get_resource_index().get("ready"):
                version, updated_at = _get_cache_meta()
                _ensure_resource_index(
@@ -568,7 +739,7 @@ def init_cache() -> None:
        logger.error(f"Failed to init resource cache: {e}")
-def get_cache_status() -> Dict[str, Any]:
+def get_cache_status() -> dict[str, Any]:
    """取得快取狀態資訊.
    Returns:
@@ -611,9 +782,10 @@ def get_cache_status() -> Dict[str, Any]:
 # Query API
 # ============================================================
-def get_resource_index_status() -> Dict[str, Any]:
+def get_resource_index_status() -> dict[str, Any]:
    """Get process-level derived index telemetry."""
    index = _get_resource_index()
    memory = index.get("memory") or {}
    built_at = index.get("built_at")
    age_seconds = None
    if built_at:
@@ -630,19 +802,32 @@ def get_resource_index_status() -> Dict[str, Any]:
        "built_at": built_at,
        "count": int(index.get("count", 0)),
        "age_seconds": round(age_seconds, 3) if age_seconds is not None else None,
        "memory": {
            "frame_bytes": int(memory.get("frame_bytes", 0)),
            "index_bytes": int(memory.get("index_bytes", 0)),
            "records_json_bytes": int(memory.get("records_json_bytes", 0)),
            "bucket_entries": int(memory.get("bucket_entries", 0)),
            "amplification_ratio": float(memory.get("amplification_ratio", 0.0)),
            "representation": str(memory.get("representation", "unknown")),
        },
    }
-def get_resource_index_snapshot() -> Dict[str, Any]:
+def get_resource_index_snapshot() -> ResourceIndex:
    """Get derived resource index snapshot, rebuilding if needed."""
    index = _get_resource_index()
    if index.get("ready"):
        if index.get("source") == "redis":
            if not _redis_data_available():
                _resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY)
                _invalidate_resource_index()
                index = _get_resource_index()
            # If Redis metadata version is missing, verify payload existence on every call.
            # This avoids serving stale in-process index when Redis payload is evicted.
-            if not index.get("version"):
+            if index.get("ready") and not index.get("version"):
                if not _redis_data_available():
-                    _resource_df_cache.invalidate("resource_data")
+                    _resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY)
                    _invalidate_resource_index()
                    index = _get_resource_index()
                else:
@@ -661,7 +846,7 @@ def get_resource_index_snapshot() -> Dict[str, Any]:
                            current_version,
                            latest_version,
                        )
-                        _resource_df_cache.invalidate("resource_data")
+                        _resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY)
                        _invalidate_resource_index()
                        index = _get_resource_index()
                    else:
@@ -678,6 +863,7 @@ def get_resource_index_snapshot() -> Dict[str, Any]:
    df = _get_cached_data()
    if df is not None:
        _resource_df_cache.set(RESOURCE_DF_CACHE_KEY, df.reset_index(drop=True))
        version, updated_at = _get_cache_meta()
        _ensure_resource_index(
            df,
@@ -690,6 +876,8 @@ def get_resource_index_snapshot() -> Dict[str, Any]:
    logger.info("Resource cache miss while building index, falling back to Oracle")
    oracle_df = _load_from_oracle()
    if oracle_df is None:
        _resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY)
        _invalidate_resource_index()
        return _new_empty_index()
    _ensure_resource_index(
@@ -698,9 +886,11 @@ def get_resource_index_snapshot() -> Dict[str, Any]:
        version=None,
        updated_at=datetime.now().isoformat(),
    )
    _resource_df_cache.set(RESOURCE_DF_CACHE_KEY, oracle_df.reset_index(drop=True))
    return _get_resource_index()
-def get_all_resources() -> List[Dict]:
+
 def get_all_resources() -> list[ResourceRecord]:
    """取得所有快取中的設備資料（全欄位）.
    Falls back to Oracle if cache unavailable.
@@ -709,11 +899,10 @@ def get_all_resources() -> List[Dict]:
        List of resource dicts.
    """
    index = get_resource_index_snapshot()
-    records = index.get("records", [])
+    return _records_from_index(index)
    return list(records)
-def get_resource_by_id(resource_id: str) -> Optional[Dict]:
+def get_resource_by_id(resource_id: str) -> ResourceRecord | None:
    """依 RESOURCEID 取得單筆設備資料.
    Args:
@@ -725,10 +914,12 @@ def get_resource_by_id(resource_id: str) -> Optional[Dict]:
    if not resource_id:
        return None
    index = get_resource_index_snapshot()
-    by_id = index.get("by_resource_id", {})
+    by_id: dict[str, RowPosition] = index.get("by_resource_id", {})
-    row = by_id.get(str(resource_id))
+    row_position = by_id.get(str(resource_id))
-    if row is not None:
+    if row_position is not None:
-        return row
+        rows = _records_from_index(index, [int(row_position)])
        if rows:
            return rows[0]
    # Backward-compatible fallback for call sites/tests that patch get_all_resources.
    target = str(resource_id)
@@ -738,7 +929,7 @@ def get_resource_by_id(resource_id: str) -> Optional[Dict]:
    return None
-def get_resources_by_ids(resource_ids: List[str]) -> List[Dict]:
+def get_resources_by_ids(resource_ids: list[str]) -> list[ResourceRecord]:
    """依 RESOURCEID 清單批次取得設備資料.
    Args:
@@ -747,20 +938,28 @@ def get_resources_by_ids(resource_ids: List[str]) -> List[Dict]:
    Returns:
        List of matching resource dicts.
    """
    index = get_resource_index_snapshot()
    by_id: dict[str, RowPosition] = index.get("by_resource_id", {})
    positions = [by_id[str(resource_id)] for resource_id in resource_ids if str(resource_id) in by_id]
    if positions:
        rows = _records_from_index(index, positions)
        if rows:
            return rows
    # Backward-compatible fallback for call sites/tests that patch get_all_resources.
    id_set = set(resource_ids)
-    resources = get_all_resources()
+    return [r for r in get_all_resources() if r.get('RESOURCEID') in id_set]
    return [r for r in resources if r.get('RESOURCEID') in id_set]
 def get_resources_by_filter(
-    workcenters: Optional[List[str]] = None,
+    workcenters: list[str] | None = None,
-    families: Optional[List[str]] = None,
+    families: list[str] | None = None,
-    departments: Optional[List[str]] = None,
+    departments: list[str] | None = None,
-    locations: Optional[List[str]] = None,
+    locations: list[str] | None = None,
-    is_production: Optional[bool] = None,
+    is_production: bool | None = None,
-    is_key: Optional[bool] = None,
+    is_key: bool | None = None,
-    is_monitor: Optional[bool] = None,
+    is_monitor: bool | None = None,
-) -> List[Dict]:
+) -> list[ResourceRecord]:
    """依條件篩選設備資料（在 Python 端篩選）.
    Args:
@@ -775,11 +974,9 @@ def get_resources_by_filter(
    Returns:
        List of matching resource dicts.
    """
-    resources = get_all_resources()
+    def _filter_from_records(resources: list[ResourceRecord]) -> list[ResourceRecord]:
-
+        result: list[ResourceRecord] = []
    result = []
        for r in resources:
        # Apply filters
            if workcenters and r.get('WORKCENTERNAME') not in workcenters:
                continue
            if families and r.get('RESOURCEFAMILYNAME') not in families:
@@ -788,29 +985,68 @@ def get_resources_by_filter(
                continue
            if locations and r.get('LOCATIONNAME') not in locations:
                continue
-        if is_production is not None:
+            if is_production is not None and (r.get('PJ_ISPRODUCTION') == 1) != is_production:
            val = r.get('PJ_ISPRODUCTION')
            if (val == 1) != is_production:
                continue
-        if is_key is not None:
+            if is_key is not None and (r.get('PJ_ISKEY') == 1) != is_key:
            val = r.get('PJ_ISKEY')
            if (val == 1) != is_key:
                continue
-        if is_monitor is not None:
+            if is_monitor is not None and (r.get('PJ_ISMONITOR') == 1) != is_monitor:
            val = r.get('PJ_ISMONITOR')
            if (val == 1) != is_monitor:
                continue
            result.append(r)
        return result
    index = get_resource_index_snapshot()
    if not index.get("ready"):
        return _filter_from_records(get_all_resources())
    if _resource_df_cache.get(RESOURCE_DF_CACHE_KEY) is None:
        return _filter_from_records(get_all_resources())
    candidate_positions: set[int] = set(int(pos) for pos in index.get("all_positions", []))
    if not candidate_positions:
        return []
    def _intersect_with_positions(selected: list[int] | None) -> None:
        nonlocal candidate_positions
        if selected is None:
            return
        candidate_positions &= set(int(item) for item in selected)
    if workcenters:
        _intersect_with_positions(
            _pick_bucket_positions(index.get("by_workcenter", {}), workcenters)
        )
    if families:
        _intersect_with_positions(
            _pick_bucket_positions(index.get("by_family", {}), families)
        )
    if departments:
        _intersect_with_positions(
            _pick_bucket_positions(index.get("by_department", {}), departments)
        )
    if locations:
        _intersect_with_positions(
            _pick_bucket_positions(index.get("by_location", {}), locations)
        )
    if is_production is not None:
        _intersect_with_positions(
            index.get("by_is_production", {}).get(TRUE_BUCKET if is_production else FALSE_BUCKET, [])
        )
    if is_key is not None:
        _intersect_with_positions(
            index.get("by_is_key", {}).get(TRUE_BUCKET if is_key else FALSE_BUCKET, [])
        )
    if is_monitor is not None:
        _intersect_with_positions(
            index.get("by_is_monitor", {}).get(TRUE_BUCKET if is_monitor else FALSE_BUCKET, [])
        )
    return _records_from_index(index, sorted(candidate_positions))
 # ============================================================
 # Distinct Values API (for filters)
 # ============================================================
-def get_distinct_values(column: str) -> List[str]:
+def get_distinct_values(column: str) -> list[str]:
    """取得指定欄位的唯一值清單（排序後）.
    Args:
@@ -833,26 +1069,26 @@ def get_distinct_values(column: str) -> List[str]:
    return sorted(values)
-def get_resource_families() -> List[str]:
+def get_resource_families() -> list[str]:
    """取得型號清單（便捷方法）."""
    return get_distinct_values('RESOURCEFAMILYNAME')
-def get_workcenters() -> List[str]:
+def get_workcenters() -> list[str]:
    """取得站點清單（便捷方法）."""
    return get_distinct_values('WORKCENTERNAME')
-def get_departments() -> List[str]:
+def get_departments() -> list[str]:
    """取得部門清單（便捷方法）."""
    return get_distinct_values('PJ_DEPARTMENT')
-def get_locations() -> List[str]:
+def get_locations() -> list[str]:
    """取得區域清單（便捷方法）."""
    return get_distinct_values('LOCATIONNAME')
-def get_vendors() -> List[str]:
+def get_vendors() -> list[str]:
    """取得供應商清單（便捷方法）."""
    return get_distinct_values('VENDORNAME')
--- a/src/mes_dashboard/services/sql_fragments.py
+++ b/src/mes_dashboard/services/sql_fragments.py
@@ -0,0 +1,46 @@
 # -*- coding: utf-8 -*-
 """Shared SQL fragments/constants for cache-oriented services.
 Centralizing common Oracle table/view references reduces drift across
 resource/equipment cache implementations.
 """
 from __future__ import annotations
 RESOURCE_TABLE = "DWH.DW_MES_RESOURCE"
 RESOURCE_BASE_SELECT_TEMPLATE = f"SELECT * FROM {RESOURCE_TABLE} {{ WHERE_CLAUSE }}"
 RESOURCE_VERSION_SELECT_TEMPLATE = (
    f"SELECT MAX(LASTCHANGEDATE) as VERSION FROM {RESOURCE_TABLE} {{ WHERE_CLAUSE }}"
 )
 EQUIPMENT_STATUS_VIEW = "DWH.DW_MES_EQUIPMENTSTATUS_WIP_V"
 EQUIPMENT_STATUS_COLUMNS: tuple[str, ...] = (
    "RESOURCEID",
    "EQUIPMENTID",
    "OBJECTCATEGORY",
    "EQUIPMENTASSETSSTATUS",
    "EQUIPMENTASSETSSTATUSREASON",
    "JOBORDER",
    "JOBMODEL",
    "JOBSTAGE",
    "JOBID",
    "JOBSTATUS",
    "CREATEDATE",
    "CREATEUSERNAME",
    "CREATEUSER",
    "TECHNICIANUSERNAME",
    "TECHNICIANUSER",
    "SYMPTOMCODE",
    "CAUSECODE",
    "REPAIRCODE",
    "RUNCARDLOTID",
    "LOTTRACKINQTY_PCS",
    "LOTTRACKINTIME",
    "LOTTRACKINEMPLOYEE",
 )
 EQUIPMENT_STATUS_SELECT_SQL = (
    "SELECT\n    "
    + ",\n    ".join(EQUIPMENT_STATUS_COLUMNS)
    + f"\nFROM {EQUIPMENT_STATUS_VIEW}"
 )
--- a/src/mes_dashboard/services/wip_service.py
+++ b/src/mes_dashboard/services/wip_service.py
@@ -9,6 +9,7 @@ Now uses Redis cache when available, with fallback to Oracle direct query.
 import logging
 import threading
 from collections import Counter
 from datetime import datetime
 from typing import Optional, Dict, List, Any
@@ -32,6 +33,20 @@ logger = logging.getLogger('mes_dashboard.wip_service')
 _wip_search_index_lock = threading.Lock()
 _wip_search_index_cache: Dict[str, Dict[str, Any]] = {}
 _wip_snapshot_lock = threading.Lock()
 _wip_snapshot_cache: Dict[str, Dict[str, Any]] = {}
 _wip_index_metrics_lock = threading.Lock()
 _wip_index_metrics: Dict[str, Any] = {
    "snapshot_hits": 0,
    "snapshot_misses": 0,
    "search_index_hits": 0,
    "search_index_misses": 0,
    "search_index_rebuilds": 0,
    "search_index_incremental_updates": 0,
    "search_index_reconciliation_fallbacks": 0,
 }
 _EMPTY_INT_INDEX = np.array([], dtype=np.int64)
 def _safe_value(val):
@@ -153,29 +168,373 @@ def _get_wip_cache_version() -> str:
    return f"{updated_at}|{sys_date}"
-def _distinct_sorted_values(df: pd.DataFrame, column: str) -> List[str]:
+def _increment_wip_metric(metric: str, value: int = 1) -> None:
-    if column not in df.columns:
+    with _wip_index_metrics_lock:
-        return []
+        _wip_index_metrics[metric] = int(_wip_index_metrics.get(metric, 0)) + value
-    series = df[column].dropna().astype(str)
+
-    if series.empty:
+
-        return []
+def _estimate_dataframe_bytes(df: pd.DataFrame) -> int:
-    series = series[series.str.len() > 0]
+    if df is None:
-    if series.empty:
+        return 0
-        return []
+    try:
-    return series.drop_duplicates().sort_values().tolist()
+        return int(df.memory_usage(index=True, deep=True).sum())
    except Exception:
        return 0
 def _estimate_counter_payload_bytes(counter: Counter) -> int:
    total = 0
    for key, count in counter.items():
        total += len(str(key)) + 16 + int(count)
    return total
 def _normalize_text_value(value: Any) -> str:
    if value is None:
        return ""
    if isinstance(value, float) and pd.isna(value):
        return ""
    text = str(value).strip()
    return text
 def _build_filter_mask(
    df: pd.DataFrame,
    *,
    include_dummy: bool,
    workorder: Optional[str] = None,
    lotid: Optional[str] = None,
 ) -> pd.Series:
    if df.empty:
        return pd.Series(dtype=bool)
    mask = df['WORKORDER'].notna()
    if not include_dummy and 'LOTID' in df.columns:
        mask &= ~df['LOTID'].astype(str).str.contains('DUMMY', case=False, na=False)
    if workorder and 'WORKORDER' in df.columns:
        mask &= df['WORKORDER'].astype(str).str.contains(workorder, case=False, na=False)
    if lotid and 'LOTID' in df.columns:
        mask &= df['LOTID'].astype(str).str.contains(lotid, case=False, na=False)
    return mask
 def _build_value_index(df: pd.DataFrame, column: str) -> Dict[str, np.ndarray]:
    if column not in df.columns or df.empty:
        return {}
    grouped = df.groupby(column, dropna=True, sort=False).indices
    return {str(key): np.asarray(indices, dtype=np.int64) for key, indices in grouped.items()}
 def _intersect_positions(current: Optional[np.ndarray], candidate: Optional[np.ndarray]) -> np.ndarray:
    if candidate is None:
        return _EMPTY_INT_INDEX
    if current is None:
        return candidate
    if len(current) == 0 or len(candidate) == 0:
        return _EMPTY_INT_INDEX
    return np.intersect1d(current, candidate, assume_unique=False)
 def _select_with_snapshot_indexes(
    include_dummy: bool = False,
    workorder: Optional[str] = None,
    lotid: Optional[str] = None,
    package: Optional[str] = None,
    pj_type: Optional[str] = None,
    workcenter: Optional[str] = None,
    status: Optional[str] = None,
    hold_type: Optional[str] = None,
 ) -> Optional[pd.DataFrame]:
    snapshot = _get_wip_snapshot(include_dummy=include_dummy)
    if snapshot is None:
        return None
    df = snapshot["frame"]
    indexes = snapshot["indexes"]
    selected_positions: Optional[np.ndarray] = None
    if workcenter:
        selected_positions = _intersect_positions(
            selected_positions,
            indexes["workcenter"].get(str(workcenter)),
        )
    if package:
        selected_positions = _intersect_positions(
            selected_positions,
            indexes["package"].get(str(package)),
        )
    if pj_type:
        selected_positions = _intersect_positions(
            selected_positions,
            indexes["pj_type"].get(str(pj_type)),
        )
    if status:
        selected_positions = _intersect_positions(
            selected_positions,
            indexes["wip_status"].get(str(status).upper()),
        )
    if hold_type:
        selected_positions = _intersect_positions(
            selected_positions,
            indexes["hold_type"].get(str(hold_type).lower()),
        )
    if selected_positions is None:
        result = df
    elif len(selected_positions) == 0:
        result = df.iloc[0:0]
    else:
        result = df.iloc[selected_positions]
    if workorder:
        result = result[result['WORKORDER'].astype(str).str.contains(workorder, case=False, na=False)]
    if lotid:
        result = result[result['LOTID'].astype(str).str.contains(lotid, case=False, na=False)]
    return result
 def _build_search_signatures(df: pd.DataFrame) -> tuple[Counter, Dict[str, tuple[str, str, str, str]]]:
    if df.empty:
        return Counter(), {}
    workorders = df.get("WORKORDER", pd.Series(index=df.index, dtype=object)).map(_normalize_text_value)
    lotids = df.get("LOTID", pd.Series(index=df.index, dtype=object)).map(_normalize_text_value)
    packages = df.get("PACKAGE_LEF", pd.Series(index=df.index, dtype=object)).map(_normalize_text_value)
    types = df.get("PJ_TYPE", pd.Series(index=df.index, dtype=object)).map(_normalize_text_value)
    signatures = (
        workorders
        + "\x1f"
        + lotids
        + "\x1f"
        + packages
        + "\x1f"
        + types
    ).tolist()
    signature_counter = Counter(signatures)
    signature_fields: Dict[str, tuple[str, str, str, str]] = {}
    for signature, wo, lot, pkg, pj in zip(signatures, workorders, lotids, packages, types):
        if signature not in signature_fields:
            signature_fields[signature] = (wo, lot, pkg, pj)
    return signature_counter, signature_fields
 def _build_field_counters(
    signature_counter: Counter,
    signature_fields: Dict[str, tuple[str, str, str, str]],
 ) -> Dict[str, Counter]:
    counters = {
        "workorders": Counter(),
        "lotids": Counter(),
        "packages": Counter(),
        "types": Counter(),
    }
    for signature, count in signature_counter.items():
        wo, lot, pkg, pj = signature_fields.get(signature, ("", "", "", ""))
        if wo:
            counters["workorders"][wo] += count
        if lot:
            counters["lotids"][lot] += count
        if pkg:
            counters["packages"][pkg] += count
        if pj:
            counters["types"][pj] += count
    return counters
 def _materialize_search_payload(
    *,
    version: str,
    row_count: int,
    signature_counter: Counter,
    field_counters: Dict[str, Counter],
    mode: str,
    added_rows: int = 0,
    removed_rows: int = 0,
    drift_ratio: float = 0.0,
 ) -> Dict[str, Any]:
    workorders = sorted(field_counters["workorders"].keys())
    lotids = sorted(field_counters["lotids"].keys())
    packages = sorted(field_counters["packages"].keys())
    types = sorted(field_counters["types"].keys())
    memory_bytes = (
        _estimate_counter_payload_bytes(field_counters["workorders"])
        + _estimate_counter_payload_bytes(field_counters["lotids"])
        + _estimate_counter_payload_bytes(field_counters["packages"])
        + _estimate_counter_payload_bytes(field_counters["types"])
    )
    return {
        "version": version,
        "built_at": datetime.now().isoformat(),
        "row_count": int(row_count),
        "workorders": workorders,
        "lotids": lotids,
        "packages": packages,
        "types": types,
        "sync_mode": mode,
        "sync_added_rows": int(added_rows),
        "sync_removed_rows": int(removed_rows),
        "drift_ratio": round(float(drift_ratio), 6),
        "memory_bytes": int(memory_bytes),
        "_signature_counter": dict(signature_counter),
        "_field_counters": {
            "workorders": dict(field_counters["workorders"]),
            "lotids": dict(field_counters["lotids"]),
            "packages": dict(field_counters["packages"]),
            "types": dict(field_counters["types"]),
        },
    }
 def _build_wip_search_index(df: pd.DataFrame, include_dummy: bool) -> Dict[str, Any]:
    filtered = _filter_base_conditions(df, include_dummy=include_dummy)
-    return {
+    signatures, signature_fields = _build_search_signatures(filtered)
-        "built_at": datetime.now().isoformat(),
+    field_counters = _build_field_counters(signatures, signature_fields)
-        "row_count": len(filtered),
+    return _materialize_search_payload(
-        "workorders": _distinct_sorted_values(filtered, "WORKORDER"),
+        version=_get_wip_cache_version(),
-        "lotids": _distinct_sorted_values(filtered, "LOTID"),
+        row_count=len(filtered),
-        "packages": _distinct_sorted_values(filtered, "PACKAGE_LEF"),
+        signature_counter=signatures,
-        "types": _distinct_sorted_values(filtered, "PJ_TYPE"),
+        field_counters=field_counters,
        mode="full",
    )
 def _try_incremental_search_sync(
    previous: Dict[str, Any],
    *,
    version: str,
    row_count: int,
    signature_counter: Counter,
    signature_fields: Dict[str, tuple[str, str, str, str]],
 ) -> Optional[Dict[str, Any]]:
    if not previous:
        return None
    old_signature_counter = Counter(previous.get("_signature_counter") or {})
    old_field_counters_raw = previous.get("_field_counters") or {}
    if not old_signature_counter or not old_field_counters_raw:
        return None
    added = signature_counter - old_signature_counter
    removed = old_signature_counter - signature_counter
    total_delta = sum(added.values()) + sum(removed.values())
    drift_ratio = total_delta / max(int(row_count), 1)
    if drift_ratio > 0.6:
        _increment_wip_metric("search_index_reconciliation_fallbacks")
        return None
    field_counters = {
        "workorders": Counter(old_field_counters_raw.get("workorders") or {}),
        "lotids": Counter(old_field_counters_raw.get("lotids") or {}),
        "packages": Counter(old_field_counters_raw.get("packages") or {}),
        "types": Counter(old_field_counters_raw.get("types") or {}),
    }
    for signature, count in added.items():
        wo, lot, pkg, pj = signature_fields.get(signature, ("", "", "", ""))
        if wo:
            field_counters["workorders"][wo] += count
        if lot:
            field_counters["lotids"][lot] += count
        if pkg:
            field_counters["packages"][pkg] += count
        if pj:
            field_counters["types"][pj] += count
    previous_fields = {
        sig: tuple(str(v) for v in sig.split("\x1f", 3))
        for sig in old_signature_counter.keys()
    }
    for signature, count in removed.items():
        wo, lot, pkg, pj = previous_fields.get(signature, ("", "", "", ""))
        if wo:
            field_counters["workorders"][wo] -= count
            if field_counters["workorders"][wo] <= 0:
                field_counters["workorders"].pop(wo, None)
        if lot:
            field_counters["lotids"][lot] -= count
            if field_counters["lotids"][lot] <= 0:
                field_counters["lotids"].pop(lot, None)
        if pkg:
            field_counters["packages"][pkg] -= count
            if field_counters["packages"][pkg] <= 0:
                field_counters["packages"].pop(pkg, None)
        if pj:
            field_counters["types"][pj] -= count
            if field_counters["types"][pj] <= 0:
                field_counters["types"].pop(pj, None)
    _increment_wip_metric("search_index_incremental_updates")
    return _materialize_search_payload(
        version=version,
        row_count=row_count,
        signature_counter=signature_counter,
        field_counters=field_counters,
        mode="incremental",
        added_rows=sum(added.values()),
        removed_rows=sum(removed.values()),
        drift_ratio=drift_ratio,
    )
 def _build_wip_snapshot(df: pd.DataFrame, include_dummy: bool, version: str) -> Dict[str, Any]:
    filtered = _filter_base_conditions(df, include_dummy=include_dummy)
    filtered = _add_wip_status_columns(filtered).reset_index(drop=True)
    hold_type_series = pd.Series(index=filtered.index, dtype=object)
    if not filtered.empty:
        hold_type_series = pd.Series("", index=filtered.index, dtype=object)
        hold_type_series.loc[filtered["IS_QUALITY_HOLD"]] = "quality"
        hold_type_series.loc[filtered["IS_NON_QUALITY_HOLD"]] = "non-quality"
    indexes = {
        "workcenter": _build_value_index(filtered, "WORKCENTER_GROUP"),
        "package": _build_value_index(filtered, "PACKAGE_LEF"),
        "pj_type": _build_value_index(filtered, "PJ_TYPE"),
        "wip_status": _build_value_index(filtered, "WIP_STATUS"),
        "hold_type": _build_value_index(pd.DataFrame({"HOLD_TYPE": hold_type_series}), "HOLD_TYPE"),
    }
    exact_bucket_count = sum(len(bucket) for bucket in indexes.values())
    return {
        "version": version,
        "built_at": datetime.now().isoformat(),
        "row_count": int(len(filtered)),
        "frame": filtered,
        "indexes": indexes,
        "frame_bytes": _estimate_dataframe_bytes(filtered),
        "index_bucket_count": int(exact_bucket_count),
    }
 def _get_wip_snapshot(include_dummy: bool) -> Optional[Dict[str, Any]]:
    cache_key = "with_dummy" if include_dummy else "without_dummy"
    version = _get_wip_cache_version()
    with _wip_snapshot_lock:
        cached = _wip_snapshot_cache.get(cache_key)
        if cached and cached.get("version") == version:
            _increment_wip_metric("snapshot_hits")
            return cached
    _increment_wip_metric("snapshot_misses")
    df = _get_wip_dataframe()
    if df is None:
        return None
    snapshot = _build_wip_snapshot(df, include_dummy=include_dummy, version=version)
    with _wip_snapshot_lock:
        existing = _wip_snapshot_cache.get(cache_key)
        if existing and existing.get("version") == version:
            _increment_wip_metric("snapshot_hits")
            return existing
        _wip_snapshot_cache[cache_key] = snapshot
    return snapshot
 def _get_wip_search_index(include_dummy: bool) -> Optional[Dict[str, Any]]:
    cache_key = "with_dummy" if include_dummy else "without_dummy"
@@ -184,14 +543,37 @@ def _get_wip_search_index(include_dummy: bool) -> Optional[Dict[str, Any]]:
    with _wip_search_index_lock:
        cached = _wip_search_index_cache.get(cache_key)
        if cached and cached.get("version") == version:
            _increment_wip_metric("search_index_hits")
            return cached
-    df = _get_wip_dataframe()
+    _increment_wip_metric("search_index_misses")
-    if df is None:
+    snapshot = _get_wip_snapshot(include_dummy=include_dummy)
    if snapshot is None:
        return None
-    index_payload = _build_wip_search_index(df, include_dummy=include_dummy)
+    filtered = snapshot["frame"]
-    index_payload["version"] = version
+    signature_counter, signature_fields = _build_search_signatures(filtered)
    with _wip_search_index_lock:
        previous = _wip_search_index_cache.get(cache_key)
    index_payload = _try_incremental_search_sync(
        previous or {},
        version=version,
        row_count=int(snapshot.get("row_count", 0)),
        signature_counter=signature_counter,
        signature_fields=signature_fields,
    )
    if index_payload is None:
        field_counters = _build_field_counters(signature_counter, signature_fields)
        index_payload = _materialize_search_payload(
            version=version,
            row_count=int(snapshot.get("row_count", 0)),
            signature_counter=signature_counter,
            field_counters=field_counters,
            mode="full",
        )
        _increment_wip_metric("search_index_rebuilds")
    with _wip_search_index_lock:
        _wip_search_index_cache[cache_key] = index_payload
@@ -207,9 +589,9 @@ def _search_values_from_index(values: List[str], query: str, limit: int) -> List
 def get_wip_search_index_status() -> Dict[str, Any]:
    """Expose WIP derived search-index freshness for diagnostics."""
    with _wip_search_index_lock:
-        snapshot = {}
+        search_snapshot = {}
        for key, payload in _wip_search_index_cache.items():
-            snapshot[key] = {
+            search_snapshot[key] = {
                "version": payload.get("version"),
                "built_at": payload.get("built_at"),
                "row_count": payload.get("row_count", 0),
@@ -217,8 +599,39 @@ def get_wip_search_index_status() -> Dict[str, Any]:
                "lotids": len(payload.get("lotids", [])),
                "packages": len(payload.get("packages", [])),
                "types": len(payload.get("types", [])),
                "sync_mode": payload.get("sync_mode"),
                "sync_added_rows": payload.get("sync_added_rows", 0),
                "sync_removed_rows": payload.get("sync_removed_rows", 0),
                "drift_ratio": payload.get("drift_ratio", 0.0),
                "memory_bytes": payload.get("memory_bytes", 0),
            }
    with _wip_snapshot_lock:
        frame_snapshot = {}
        for key, payload in _wip_snapshot_cache.items():
            frame_snapshot[key] = {
                "version": payload.get("version"),
                "built_at": payload.get("built_at"),
                "row_count": payload.get("row_count", 0),
                "frame_bytes": payload.get("frame_bytes", 0),
                "index_bucket_count": payload.get("index_bucket_count", 0),
            }
    with _wip_index_metrics_lock:
        metrics = dict(_wip_index_metrics)
    total_frame_bytes = sum(item.get("frame_bytes", 0) for item in frame_snapshot.values())
    total_search_bytes = sum(item.get("memory_bytes", 0) for item in search_snapshot.values())
    amplification_ratio = round((total_frame_bytes + total_search_bytes) / max(total_frame_bytes, 1), 4)
    return {
        "derived_search_index": search_snapshot,
        "derived_frame_snapshot": frame_snapshot,
        "metrics": metrics,
        "memory": {
            "frame_bytes_total": int(total_frame_bytes),
            "search_bytes_total": int(total_search_bytes),
            "amplification_ratio": amplification_ratio,
        },
    }
        return snapshot
 def _add_wip_status_columns(df: pd.DataFrame) -> pd.DataFrame:
@@ -235,24 +648,31 @@ def _add_wip_status_columns(df: pd.DataFrame) -> pd.DataFrame:
    Returns:
        DataFrame with additional status columns
    """
-    df = df.copy()
+    required = {'WIP_STATUS', 'IS_QUALITY_HOLD', 'IS_NON_QUALITY_HOLD'}
    if required.issubset(df.columns):
        return df
    working = df.copy()
    # Ensure numeric columns
-    df['EQUIPMENTCOUNT'] = pd.to_numeric(df['EQUIPMENTCOUNT'], errors='coerce').fillna(0)
+    working['EQUIPMENTCOUNT'] = pd.to_numeric(working['EQUIPMENTCOUNT'], errors='coerce').fillna(0)
-    df['CURRENTHOLDCOUNT'] = pd.to_numeric(df['CURRENTHOLDCOUNT'], errors='coerce').fillna(0)
+    working['CURRENTHOLDCOUNT'] = pd.to_numeric(working['CURRENTHOLDCOUNT'], errors='coerce').fillna(0)
-    df['QTY'] = pd.to_numeric(df['QTY'], errors='coerce').fillna(0)
+    working['QTY'] = pd.to_numeric(working['QTY'], errors='coerce').fillna(0)
    # Compute WIP status
-    df['WIP_STATUS'] = 'QUEUE'  # Default
+    working['WIP_STATUS'] = 'QUEUE'  # Default
-    df.loc[df['EQUIPMENTCOUNT'] > 0, 'WIP_STATUS'] = 'RUN'
+    working.loc[working['EQUIPMENTCOUNT'] > 0, 'WIP_STATUS'] = 'RUN'
-    df.loc[(df['EQUIPMENTCOUNT'] == 0) & (df['CURRENTHOLDCOUNT'] > 0), 'WIP_STATUS'] = 'HOLD'
+    working.loc[
        (working['EQUIPMENTCOUNT'] == 0) & (working['CURRENTHOLDCOUNT'] > 0),
        'WIP_STATUS'
    ] = 'HOLD'
    # Compute hold type
-    df['IS_NON_QUALITY_HOLD'] = df['HOLDREASONNAME'].isin(NON_QUALITY_HOLD_REASONS)
+    non_quality_flags = working['HOLDREASONNAME'].isin(NON_QUALITY_HOLD_REASONS)
-    df['IS_QUALITY_HOLD'] = (df['WIP_STATUS'] == 'HOLD') & ~df['IS_NON_QUALITY_HOLD']
+    working['IS_QUALITY_HOLD'] = (working['WIP_STATUS'] == 'HOLD') & ~non_quality_flags
-    df['IS_NON_QUALITY_HOLD'] = (df['WIP_STATUS'] == 'HOLD') & df['IS_NON_QUALITY_HOLD']
+    working['IS_NON_QUALITY_HOLD'] = (working['WIP_STATUS'] == 'HOLD') & non_quality_flags
-    return df
+    return working
 def _filter_base_conditions(
@@ -272,24 +692,18 @@ def _filter_base_conditions(
    Returns:
        Filtered DataFrame
    """
-    df = df.copy()
+    if df is None or df.empty:
        return df.iloc[0:0] if isinstance(df, pd.DataFrame) else pd.DataFrame()
-    # Exclude NULL WORKORDER (raw materials)
+    mask = _build_filter_mask(
-    df = df[df['WORKORDER'].notna()]
+        df,
-
+        include_dummy=include_dummy,
-    # DUMMY exclusion
+        workorder=workorder,
-    if not include_dummy:
+        lotid=lotid,
-        df = df[~df['LOTID'].str.contains('DUMMY', case=False, na=False)]
+    )
-
+    if mask.empty:
-    # WORKORDER filter (fuzzy match)
+        return df.iloc[0:0]
-    if workorder:
+    return df.loc[mask]
        df = df[df['WORKORDER'].str.contains(workorder, case=False, na=False)]
    # LOTID filter (fuzzy match)
    if lotid:
        df = df[df['LOTID'].str.contains(lotid, case=False, na=False)]
    return df
 # ============================================================
@@ -325,16 +739,15 @@ def get_wip_summary(
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy, workorder, lotid)
+            df = _select_with_snapshot_indexes(
-            df = _add_wip_status_columns(df)
+                include_dummy=include_dummy,
-
+                workorder=workorder,
-            # Apply package filter
+                lotid=lotid,
-            if package and 'PACKAGE_LEF' in df.columns:
+                package=package,
-                df = df[df['PACKAGE_LEF'] == package]
+                pj_type=pj_type,
-
+            )
-            # Apply pj_type filter
+            if df is None:
-            if pj_type and 'PJ_TYPE' in df.columns:
+                return _get_wip_summary_from_oracle(include_dummy, workorder, lotid, package, pj_type)
                df = df[df['PJ_TYPE'] == pj_type]
            if df.empty:
                return {
@@ -495,32 +908,31 @@ def get_wip_matrix(
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy, workorder, lotid)
+            status_upper = status.upper() if status else None
-            df = _add_wip_status_columns(df)
+            hold_type_filter = hold_type if status_upper == 'HOLD' else None
            df = _select_with_snapshot_indexes(
                include_dummy=include_dummy,
                workorder=workorder,
                lotid=lotid,
                package=package,
                pj_type=pj_type,
                status=status_upper,
                hold_type=hold_type_filter,
            )
            if df is None:
                return _get_wip_matrix_from_oracle(
                    include_dummy,
                    workorder,
                    lotid,
                    status,
                    hold_type,
                    package,
                    pj_type,
                )
            # Filter by WORKCENTER_GROUP and PACKAGE_LEF
            df = df[df['WORKCENTER_GROUP'].notna() & df['PACKAGE_LEF'].notna()]
            # Apply package filter
            if package:
                df = df[df['PACKAGE_LEF'] == package]
            # Apply pj_type filter
            if pj_type and 'PJ_TYPE' in df.columns:
                df = df[df['PJ_TYPE'] == pj_type]
            # WIP status filter
            if status:
                status_upper = status.upper()
                df = df[df['WIP_STATUS'] == status_upper]
                # Hold type sub-filter
                if status_upper == 'HOLD' and hold_type:
                    if hold_type == 'quality':
                        df = df[df['IS_QUALITY_HOLD']]
                    elif hold_type == 'non-quality':
                        df = df[df['IS_NON_QUALITY_HOLD']]
            if df.empty:
                return {
                    'workcenters': [],
@@ -677,11 +1089,17 @@ def get_wip_hold_summary(
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy, workorder, lotid)
+            df = _select_with_snapshot_indexes(
-            df = _add_wip_status_columns(df)
+                include_dummy=include_dummy,
                workorder=workorder,
                lotid=lotid,
                status='HOLD',
            )
            if df is None:
                return _get_wip_hold_summary_from_oracle(include_dummy, workorder, lotid)
            # Filter for HOLD status with reason
-            df = df[(df['WIP_STATUS'] == 'HOLD') & df['HOLDREASONNAME'].notna()]
+            df = df[df['HOLDREASONNAME'].notna()]
            if df.empty:
                return {'items': []}
@@ -805,17 +1223,40 @@ def get_wip_detail(
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy, workorder, lotid)
+            summary_df = _select_with_snapshot_indexes(
-            df = _add_wip_status_columns(df)
+                include_dummy=include_dummy,
                workorder=workorder,
                lotid=lotid,
                package=package,
                workcenter=workcenter,
            )
            if summary_df is None:
                return _get_wip_detail_from_oracle(
                    workcenter,
                    package,
                    status,
                    hold_type,
                    workorder,
                    lotid,
                    include_dummy,
                    page,
                    page_size,
                )
-            # Filter by workcenter
+            if summary_df.empty:
-            df = df[df['WORKCENTER_GROUP'] == workcenter]
+                summary = {
-
+                    'totalLots': 0,
-            if package:
+                    'runLots': 0,
-                df = df[df['PACKAGE_LEF'] == package]
+                    'queueLots': 0,
                    'holdLots': 0,
                    'qualityHoldLots': 0,
                    'nonQualityHoldLots': 0
                }
                df = summary_df
            else:
                df = summary_df
            # Calculate summary before status filter
            summary_df = df.copy()
            run_lots = len(summary_df[summary_df['WIP_STATUS'] == 'RUN'])
            queue_lots = len(summary_df[summary_df['WIP_STATUS'] == 'QUEUE'])
            hold_lots = len(summary_df[summary_df['WIP_STATUS'] == 'HOLD'])
@@ -835,13 +1276,29 @@ def get_wip_detail(
            # Apply status filter for lots list
            if status:
                status_upper = status.upper()
-                df = df[df['WIP_STATUS'] == status_upper]
+                hold_type_filter = hold_type if status_upper == 'HOLD' else None
-
+                filtered_df = _select_with_snapshot_indexes(
-                if status_upper == 'HOLD' and hold_type:
+                    include_dummy=include_dummy,
-                    if hold_type == 'quality':
+                    workorder=workorder,
-                        df = df[df['IS_QUALITY_HOLD']]
+                    lotid=lotid,
-                    elif hold_type == 'non-quality':
+                    package=package,
-                        df = df[df['IS_NON_QUALITY_HOLD']]
+                    workcenter=workcenter,
                    status=status_upper,
                    hold_type=hold_type_filter,
                )
                if filtered_df is None:
                    return _get_wip_detail_from_oracle(
                        workcenter,
                        package,
                        status,
                        hold_type,
                        workorder,
                        lotid,
                        include_dummy,
                        page,
                        page_size,
                    )
                df = filtered_df
            # Get specs (sorted by SPECSEQUENCE if available)
            specs_df = df[df['SPECNAME'].notna()][['SPECNAME', 'SPECSEQUENCE']].drop_duplicates()
@@ -1083,7 +1540,9 @@ def get_workcenters(include_dummy: bool = False) -> Optional[List[Dict[str, Any]
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy)
+            df = _select_with_snapshot_indexes(include_dummy=include_dummy)
            if df is None:
                return _get_workcenters_from_oracle(include_dummy)
            df = df[df['WORKCENTER_GROUP'].notna()]
            if df.empty:
@@ -1162,7 +1621,9 @@ def get_packages(include_dummy: bool = False) -> Optional[List[Dict[str, Any]]]:
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy)
+            df = _select_with_snapshot_indexes(include_dummy=include_dummy)
            if df is None:
                return _get_packages_from_oracle(include_dummy)
            df = df[df['PACKAGE_LEF'].notna()]
            if df.empty:
@@ -1267,15 +1728,16 @@ def search_workorders(
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy, lotid=lotid)
+            df = _select_with_snapshot_indexes(
                include_dummy=include_dummy,
                lotid=lotid,
                package=package,
                pj_type=pj_type,
            )
            if df is None:
                return _search_workorders_from_oracle(q, limit, include_dummy, lotid, package, pj_type)
            df = df[df['WORKORDER'].notna()]
            # Apply cross-filters
            if package and 'PACKAGE_LEF' in df.columns:
                df = df[df['PACKAGE_LEF'] == package]
            if pj_type and 'PJ_TYPE' in df.columns:
                df = df[df['PJ_TYPE'] == pj_type]
            # Filter by search query (case-insensitive)
            df = df[df['WORKORDER'].str.contains(q, case=False, na=False)]
@@ -1375,13 +1837,14 @@ def search_lot_ids(
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy, workorder=workorder)
+            df = _select_with_snapshot_indexes(
-
+                include_dummy=include_dummy,
-            # Apply cross-filters
+                workorder=workorder,
-            if package and 'PACKAGE_LEF' in df.columns:
+                package=package,
-                df = df[df['PACKAGE_LEF'] == package]
+                pj_type=pj_type,
-            if pj_type and 'PJ_TYPE' in df.columns:
+            )
-                df = df[df['PJ_TYPE'] == pj_type]
+            if df is None:
                return _search_lot_ids_from_oracle(q, limit, include_dummy, workorder, package, pj_type)
            # Filter by search query (case-insensitive)
            df = df[df['LOTID'].str.contains(q, case=False, na=False)]
@@ -1481,7 +1944,14 @@ def search_packages(
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy, workorder=workorder, lotid=lotid)
+            df = _select_with_snapshot_indexes(
                include_dummy=include_dummy,
                workorder=workorder,
                lotid=lotid,
                pj_type=pj_type,
            )
            if df is None:
                return _search_packages_from_oracle(q, limit, include_dummy, workorder, lotid, pj_type)
            # Check if PACKAGE_LEF column exists
            if 'PACKAGE_LEF' not in df.columns:
@@ -1490,10 +1960,6 @@ def search_packages(
            df = df[df['PACKAGE_LEF'].notna()]
            # Apply cross-filter
            if pj_type and 'PJ_TYPE' in df.columns:
                df = df[df['PJ_TYPE'] == pj_type]
            # Filter by search query (case-insensitive)
            df = df[df['PACKAGE_LEF'].str.contains(q, case=False, na=False)]
@@ -1591,7 +2057,14 @@ def search_types(
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy, workorder=workorder, lotid=lotid)
+            df = _select_with_snapshot_indexes(
                include_dummy=include_dummy,
                workorder=workorder,
                lotid=lotid,
                package=package,
            )
            if df is None:
                return _search_types_from_oracle(q, limit, include_dummy, workorder, lotid, package)
            # Check if PJ_TYPE column exists
            if 'PJ_TYPE' not in df.columns:
@@ -1600,10 +2073,6 @@ def search_types(
            df = df[df['PJ_TYPE'].notna()]
            # Apply cross-filter
            if package and 'PACKAGE_LEF' in df.columns:
                df = df[df['PACKAGE_LEF'] == package]
            # Filter by search query (case-insensitive)
            df = df[df['PJ_TYPE'].str.contains(q, case=False, na=False)]
@@ -1686,11 +2155,15 @@ def get_hold_detail_summary(
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy)
+            df = _select_with_snapshot_indexes(
-            df = _add_wip_status_columns(df)
+                include_dummy=include_dummy,
                status='HOLD',
            )
            if df is None:
                return _get_hold_detail_summary_from_oracle(reason, include_dummy)
            # Filter for HOLD status with matching reason
-            df = df[(df['WIP_STATUS'] == 'HOLD') & (df['HOLDREASONNAME'] == reason)]
+            df = df[df['HOLDREASONNAME'] == reason]
            if df.empty:
                return {
@@ -1783,11 +2256,15 @@ def get_hold_detail_distribution(
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy)
+            df = _select_with_snapshot_indexes(
-            df = _add_wip_status_columns(df)
+                include_dummy=include_dummy,
                status='HOLD',
            )
            if df is None:
                return _get_hold_detail_distribution_from_oracle(reason, include_dummy)
            # Filter for HOLD status with matching reason
-            df = df[(df['WIP_STATUS'] == 'HOLD') & (df['HOLDREASONNAME'] == reason)]
+            df = df[df['HOLDREASONNAME'] == reason]
            total_lots = len(df)
@@ -2072,20 +2549,30 @@ def get_hold_detail_lots(
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy)
+            df = _select_with_snapshot_indexes(
-            df = _add_wip_status_columns(df)
+                include_dummy=include_dummy,
                workcenter=workcenter,
                package=package,
                status='HOLD',
            )
            if df is None:
                return _get_hold_detail_lots_from_oracle(
                    reason=reason,
                    workcenter=workcenter,
                    package=package,
                    age_range=age_range,
                    include_dummy=include_dummy,
                    page=page,
                    page_size=page_size,
                )
            # Filter for HOLD status with matching reason
-            df = df[(df['WIP_STATUS'] == 'HOLD') & (df['HOLDREASONNAME'] == reason)]
+            df = df[df['HOLDREASONNAME'] == reason]
            # Ensure numeric columns
            df['AGEBYDAYS'] = pd.to_numeric(df['AGEBYDAYS'], errors='coerce').fillna(0)
-            # Optional filters
+            # Optional age filter
            if workcenter:
                df = df[df['WORKCENTER_GROUP'] == workcenter]
            if package:
                df = df[df['PACKAGE_LEF'] == package]
            if age_range:
                if age_range == '0-1':
                    df = df[(df['AGEBYDAYS'] >= 0) & (df['AGEBYDAYS'] < 1)]
--- a/src/mes_dashboard/static/js/mes-api.js
+++ b/src/mes_dashboard/static/js/mes-api.js
@@ -33,6 +33,23 @@ const MesApi = (function() {
    let requestCounter = 0;
    function getCsrfToken() {
        const meta = document.querySelector('meta[name=\"csrf-token\"]');
        return meta ? meta.content : '';
    }
    function withCsrfHeaders(headers, method) {
        const normalized = (method || 'GET').toUpperCase();
        const nextHeaders = { ...(headers || {}) };
        if (['POST', 'PUT', 'PATCH', 'DELETE'].includes(normalized)) {
            const token = getCsrfToken();
            if (token && !nextHeaders['X-CSRF-Token']) {
                nextHeaders['X-CSRF-Token'] = token;
            }
        }
        return nextHeaders;
    }
    /**
     * Generate a unique request ID
     */
@@ -205,9 +222,9 @@ const MesApi = (function() {
        const fetchOptions = {
            method: method,
-            headers: {
+            headers: withCsrfHeaders({
                'Content-Type': 'application/json'
-            }
+            }, method)
        };
        if (options.body) {
--- a/src/mes_dashboard/templates/_base.html
+++ b/src/mes_dashboard/templates/_base.html
@@ -3,6 +3,7 @@
 <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta name="csrf-token" content="{{ csrf_token() }}">
    <title>{% block title %}MES Dashboard{% endblock %}</title>
    <!-- Toast 樣式 -->
--- a/src/mes_dashboard/templates/admin/pages.html
+++ b/src/mes_dashboard/templates/admin/pages.html
@@ -223,6 +223,11 @@
 {% block scripts %}
    <script>
        const tbody = document.getElementById('pages-tbody');
        const csrfToken = document.querySelector('meta[name=\"csrf-token\"]')?.content || '';
        function withCsrfHeaders(headers = {}) {
            return csrfToken ? { ...headers, 'X-CSRF-Token': csrfToken } : headers;
        }
        async function loadPages() {
            try {
@@ -266,7 +271,7 @@
            try {
                const response = await fetch(`/admin/api/pages${route}`, {
                    method: 'PUT',
-                    headers: { 'Content-Type': 'application/json' },
+                    headers: withCsrfHeaders({ 'Content-Type': 'application/json' }),
                    body: JSON.stringify({ status: newStatus })
                });
--- a/src/mes_dashboard/templates/admin/performance.html
+++ b/src/mes_dashboard/templates/admin/performance.html
@@ -707,7 +707,13 @@
        // Auth Helper
        // ============================================================
        async function fetchWithAuth(url, options = {}) {
-            const resp = await fetch(url, { ...options, cache: 'no-store' });
+            const method = (options.method || 'GET').toUpperCase();
            const csrfToken = document.querySelector('meta[name="csrf-token"]')?.content || '';
            const headers = { ...(options.headers || {}) };
            if (csrfToken && ['POST', 'PUT', 'PATCH', 'DELETE'].includes(method)) {
                headers['X-CSRF-Token'] = csrfToken;
            }
            const resp = await fetch(url, { ...options, headers, cache: 'no-store' });
            if (resp.status === 401) {
                const json = await resp.json().catch(() => ({}));
                if (!authErrorShown) {
@@ -962,9 +968,15 @@
                document.getElementById('workerStartTime').textContent =
                    data.worker_start_time ? formatTimestamp(data.worker_start_time) : '--';
-                // Update cooldown status
+                // Update recovery policy status
                const policyState = data?.resilience?.policy_state || {};
                const cooldown = data.cooldown;
-                if (cooldown && cooldown.active) {
+                if (policyState.blocked) {
                    document.getElementById('workerCooldown').textContent = 'Guarded mode（需手動 override）';
                    document.getElementById('restartBtn').disabled = false;
                    document.getElementById('restartBtn').style.opacity = '1';
                    document.getElementById('restartBtn').style.cursor = 'pointer';
                } else if (cooldown && cooldown.active) {
                    document.getElementById('workerCooldown').textContent =
                        `冷卻中 (${cooldown.remaining_seconds}秒)`;
                    document.getElementById('restartBtn').disabled = true;
@@ -1017,11 +1029,41 @@
            btn.style.opacity = '0.5';
            try {
-                const resp = await fetchWithAuth('/admin/api/worker/restart', {
+                let resp = await fetchWithAuth('/admin/api/worker/restart', {
                    method: 'POST',
-                    headers: { 'Content-Type': 'application/json' }
+                    headers: { 'Content-Type': 'application/json' },
                    body: JSON.stringify({})
                });
-                const json = await resp.json();
+                let json = await resp.json();
                if (!json.success && resp.status === 409) {
                    const reason = window.prompt(
                        '目前 restart policy 為 guarded mode。\n請輸入 override 原因（會記錄於稽核日誌）：'
                    );
                    if (!reason || !reason.trim()) {
                        alert('已取消 override。');
                        return;
                    }
                    const acknowledged = window.confirm(
                        '確認執行 manual override？此操作將繞過 guarded mode 保護。'
                    );
                    if (!acknowledged) {
                        alert('已取消 override。');
                        return;
                    }
                    resp = await fetchWithAuth('/admin/api/worker/restart', {
                        method: 'POST',
                        headers: { 'Content-Type': 'application/json' },
                        body: JSON.stringify({
                            manual_override: true,
                            override_acknowledged: true,
                            override_reason: reason.trim()
                        })
                    });
                    json = await resp.json();
                }
                if (!json.success) {
                    alert('重啟失敗: ' + (json.error?.message || '未知錯誤'));
--- a/src/mes_dashboard/templates/hold_detail.html
+++ b/src/mes_dashboard/templates/hold_detail.html
@@ -682,7 +682,7 @@
        // State
        // ============================================================
        const state = {
-            reason: '{{ reason | e }}',
+            reason: {{ reason | tojson }},
            summary: null,
            distribution: null,
            lots: null,
--- a/src/mes_dashboard/templates/login.html
+++ b/src/mes_dashboard/templates/login.html
@@ -132,6 +132,7 @@
        {% endif %}
        <form method="POST">
            <input type="hidden" name="csrf_token" value="{{ csrf_token() }}">
            <div class="form-group">
                <label for="username">帳號</label>
                <input type="text" id="username" name="username" placeholder="工號或 Email" required autofocus>
--- a/tests/fixtures/cache_benchmark_fixture.json
+++ b/tests/fixtures/cache_benchmark_fixture.json
@@ -0,0 +1,9 @@
 {
  "rows": 30000,
  "query_count": 400,
  "seed": 42,
  "thresholds": {
    "max_p95_ratio_indexed_vs_baseline": 1.25,
    "max_memory_amplification_ratio": 1.8
  }
 }
--- a/Show More
+++ b/Show More