chore: finalize vite migration hardening and archive openspec changes

2026-02-08 20:03:36 +08:00
parent b56e80381b
commit c8e225101e
119 changed files with 6547 additions and 1301 deletions
--- a/README.md
+++ b/README.md
@@ -26,11 +26,60 @@
 | Worker 重啟控制 | ✅ 已完成 |
 | Runtime 韌性診斷（threshold/churn/recommendation） | ✅ 已完成 |
 | WIP 共用 autocomplete core 模組 | ✅ 已完成 |
+| WIP 共用 derive core 模組（KPI/filter/chart/table） | ✅ 已完成 |
+| WIP 索引查詢加速與增量同步 | ✅ 已完成 |
+| 快取記憶體放大係數 telemetry | ✅ 已完成 |
+| Cache benchmark gate（P95/記憶體門檻） | ✅ 已完成 |
+| Worker guarded mode + manual override 稽核 | ✅ 已完成 |
+| Runtime contract 啟動校驗（conda/systemd/watchdog） | ✅ 已完成 |
 | 前端核心模組測試（Node test） | ✅ 已完成 |
 | 部署自動化 | ✅ 已完成 |

 ---

+## 開發歷史（Vite 重構後）
+
+- 2026-02-07：完成 Flask + Vite 單一 port 架構切換，舊版 `DashBoard/` 停用。
+- 2026-02-08：補齊 runtime 韌性治理（threshold/churn/recommendation）與 watchdog 可觀測欄位。
+- 2026-02-08：完成 P0 安全/穩定性硬化：
+  - production `SECRET_KEY` 缺失時啟動失敗（fail-fast）
+  - admin form + admin mutation API CSRF 防護
+  - health probe 使用獨立 DB pool，避免與主查詢池互相阻塞
+  - worker/app shutdown 統一清理 cache updater、realtime sync、Redis、DB engine
+  - `hold_detail` inline script 變數改為 `tojson` 序列化
+- 2026-02-08：完成 P1 快取/查詢效率重構：
+  - WIP 查詢路徑改為索引選擇，保留 `resource/wip` 全表快取語意
+  - WIP search index 增量同步（watermark/version）與 drift fallback
+  - health/admin 新增 cache memory amplification telemetry
+  - 建立 `scripts/run_cache_benchmarks.py` + fixture gate
+- 2026-02-08：完成 P2 運維自癒治理：
+  - runtime contract 共用化（app/start_server/watchdog/systemd）
+  - 啟動時 conda/watchdog 路徑 drift fail-fast
+  - worker restart policy（cooldown/retry budget/churn guarded mode）
+  - manual override（需 ack + reason）與結構化 audit log
+- 2026-02-08：完成 round-2 安全/穩定補強：
+  - LDAP endpoint 改為嚴格驗證（`https` + `LDAP_ALLOWED_HOSTS`）
+  - process-level cache 新增 `max_size + LRU`（WIP/Resource）
+  - circuit breaker transition logging 移至鎖外，降低 lock contention
+  - 全域安全標頭（CSP/XFO/nosniff/Referrer-Policy，production 加 HSTS）
+  - WIP detail 分頁參數加上下限（`page>=1`、`1<=page_size<=500`）
+- 2026-02-08：完成 round-3 殘餘風險修補：
+  - WIP cache publish 採 staged publish，失敗不污染舊快照
+  - WIP slow-path parse 移至鎖外；realtime equipment process cache 補齊 bounded LRU
+  - resource NaN 清理改為 depth-safe 迭代；WIP/Hold 布林查詢解析共用化
+  - filter cache view 名稱改為 env 可配置
+  - `/health`、`/health/deep` 新增 5 秒內部 memo（testing 模式禁用）
+  - 高成本 API 增加輕量 rate limit（WIP detail/matrix、Hold lots、Resource status/detail）
+  - DB 連線字串 log redaction 遮罩密碼
+- 2026-02-08：完成 round-4 殘餘治理收斂：
+  - Resource derived index 改為 row-position representation，移除 process 內 full records 複本
+  - Resource / Realtime Equipment 共用 Oracle SQL fragments，降低查詢定義漂移
+  - `resource_cache` / `realtime_equipment_cache` 型別註記與高頻常數命名收斂
+  - `page_registry` 寫檔改為 atomic replace（tmp + rename），避免設定檔半寫入
+  - 新增測試保護：共享 SQL 片段、index normalization、route bool parser 不重複定義
+
+---
+
 ## 遷移與驗收文件

 - Root cutover 盤點：`docs/root_cutover_inventory.md`
@@ -46,20 +95,37 @@
 1. 單一 port 契約維持不變
 - Flask + Gunicorn + Vite dist 由同一服務提供（`GUNICORN_BIND`），前後端同源。

-2. Runtime 韌性採「降級 + 可操作建議」
+2. Runtime 韌性採「降級 + 可操作建議 + policy state」
 - `/health`、`/health/deep`、`/admin/api/system-status`、`/admin/api/worker/status` 皆提供：
  - 門檻（thresholds）
+  - policy state（`allowed` / `cooldown` / `blocked`）
  - 重啟 churn 摘要
+  - alerts（pool/circuit/churn）
  - recovery recommendation（值班建議動作）

-3. Watchdog 維持手動觸發重啟模型
- 仍以 admin API 觸發 reload，不預設啟用自動重啟風暴風險。
- state 檔新增 bounded restart history，方便追蹤 churn。
+3. Watchdog 自癒策略具界限保護
+- restart 流程納入 cooldown + retry budget + churn window。
+- churn 超標時進入 guarded mode，需 admin manual override 才可繼續重啟。
+- state 檔保留 bounded restart history，供 policy 與稽核使用。

-4. 前端治理：WIP autocomplete/filter 共用化
+4. 前端治理：WIP compute 共用化
 - `frontend/src/core/autocomplete.js` 作為 WIP overview/detail 共用邏輯來源。
+- `frontend/src/core/wip-derive.js` 共用 KPI/filter/chart/table 導出運算。
 - 維持既有頁面流程與 drill-down 語意，不變更操作習慣。

+5. P1 快取效率治理
+- 保留 `resource`、`wip` 全表快取策略（業務約束不變）。
+- 查詢改走索引選擇，並提供 memory amplification / index efficiency telemetry。
+- 以 benchmark gate 驗證 P95 延遲與記憶體放大不超過門檻。
+
+6. P0 Runtime Hardening（安全 + 穩定）
+- Production 必須提供 `SECRET_KEY`；未設定時服務拒絕啟動。
+- `/admin/login` 與 `/admin/api/*` 變更請求必須攜帶 CSRF token。
+- `/health` 資料庫連通探針使用獨立 health pool，降低 pool 飽和時誤判。
+- 關機/重啟時統一釋放 background workers 與 Redis/DB 連線資源。
+- LDAP API URL 啟動驗證：僅允許 `https` + host allowlist。
+- 全域 security headers：CSP/X-Frame-Options/X-Content-Type-Options/Referrer-Policy（production 含 HSTS）。
+
 ---

 ## 快速開始
@@ -175,6 +241,12 @@ DB_MAX_OVERFLOW=20
 DB_POOL_TIMEOUT=30
 DB_POOL_RECYCLE=1800
 DB_CALL_TIMEOUT_MS=55000
+DB_POOL_EXHAUSTED_RETRY_AFTER_SECONDS=5
+
+# Health probe 專用 DB pool（與主 request pool 隔離）
+DB_HEALTH_POOL_SIZE=1
+DB_HEALTH_MAX_OVERFLOW=0
+DB_HEALTH_POOL_TIMEOUT=2

 # Circuit Breaker
 CIRCUIT_BREAKER_ENABLED=true
@@ -192,6 +264,17 @@ WATCHDOG_RESTART_FLAG=./tmp/mes_dashboard_restart.flag
 WATCHDOG_PID_FILE=./tmp/gunicorn.pid
 WATCHDOG_STATE_FILE=./tmp/mes_dashboard_restart_state.json
 WATCHDOG_RESTART_HISTORY_MAX=50
+CONDA_BIN=/opt/miniconda3/bin/conda
+CONDA_ENV_NAME=mes-dashboard
+RUNTIME_CONTRACT_VERSION=2026.02-p2
+RUNTIME_CONTRACT_ENFORCE=true
+
+# Worker self-healing policy
+WORKER_RESTART_COOLDOWN=60
+WORKER_RESTART_RETRY_BUDGET=3
+WORKER_RESTART_WINDOW_SECONDS=600
+WORKER_RESTART_CHURN_THRESHOLD=3
+WORKER_GUARDED_MODE_ENABLED=true

 # Runtime resilience thresholds
 RESILIENCE_DEGRADED_ALERT_SECONDS=300
@@ -202,6 +285,36 @@ RESILIENCE_RESTART_CHURN_THRESHOLD=3

 # 管理員設定
 ADMIN_EMAILS=admin@example.com # 管理員郵件（逗號分隔）
+LDAP_API_URL=https://ldap-api.example.com
+LDAP_ALLOWED_HOSTS=ldap-api.example.com,ldap-api-dr.example.com
+
+# CSRF 防護（admin form/admin mutation API）
+CSRF_ENABLED=true
+
+# Process-level cache bounded LRU（WIP/Resource）
+PROCESS_CACHE_MAX_SIZE=32
+WIP_PROCESS_CACHE_MAX_SIZE=32
+RESOURCE_PROCESS_CACHE_MAX_SIZE=32
+EQUIPMENT_PROCESS_CACHE_MAX_SIZE=32
+
+# Filter cache source views (env-overridable)
+FILTER_CACHE_WIP_VIEW=DWH.DW_MES_LOT_V
+FILTER_CACHE_SPEC_WORKCENTER_VIEW=DWH.DW_MES_SPEC_WORKCENTER_V
+
+# Health internal memoization
+HEALTH_MEMO_TTL_SECONDS=5
+
+# High-cost API rate limit (in-process)
+WIP_MATRIX_RATE_LIMIT_MAX_REQUESTS=120
+WIP_MATRIX_RATE_LIMIT_WINDOW_SECONDS=60
+WIP_DETAIL_RATE_LIMIT_MAX_REQUESTS=90
+WIP_DETAIL_RATE_LIMIT_WINDOW_SECONDS=60
+HOLD_LOTS_RATE_LIMIT_MAX_REQUESTS=90
+HOLD_LOTS_RATE_LIMIT_WINDOW_SECONDS=60
+RESOURCE_DETAIL_RATE_LIMIT_MAX_REQUESTS=60
+RESOURCE_DETAIL_RATE_LIMIT_WINDOW_SECONDS=60
+RESOURCE_STATUS_RATE_LIMIT_MAX_REQUESTS=90
+RESOURCE_STATUS_RATE_LIMIT_WINDOW_SECONDS=60
 ```

 ### 生產環境注意事項
@@ -226,6 +339,7 @@ sudo cp deploy/mes-dashboard-watchdog.service /etc/systemd/system/

 # 2. 準備環境設定檔
 sudo mkdir -p /etc/mes-dashboard
+sudo cp deploy/mes-dashboard.env.example /etc/mes-dashboard/mes-dashboard.env
 sudo cp .env /etc/mes-dashboard/mes-dashboard.env

 # 3. 重新載入 systemd
@@ -238,6 +352,12 @@ sudo systemctl enable --now mes-dashboard mes-dashboard-watchdog
 sudo systemctl status mes-dashboard
 sudo systemctl status mes-dashboard-watchdog
 ```
+
+執行 runtime contract 驗證：
+
+```bash
+RUNTIME_CONTRACT_ENFORCE=true ./scripts/start_server.sh check
+```

 ### Rollback 步驟

@@ -494,7 +614,8 @@ DashBoard_vite/
 │   └── worker_watchdog.py      # Worker 監控程式
 ├── deploy/                     # 部署設定
 │   ├── mes-dashboard.service            # Gunicorn systemd 服務 (Conda)
-│   └── mes-dashboard-watchdog.service   # Watchdog systemd 服務 (Conda)
+│   ├── mes-dashboard-watchdog.service   # Watchdog systemd 服務 (Conda)
+│   └── mes-dashboard.env.example        # Runtime contract 環境範本
 ├── tests/                      # 測試
 ├── data/                       # 資料檔案
 ├── logs/                       # 日誌
@@ -522,9 +643,12 @@ pytest tests/test_*_integration.py -v
 # 執行 E2E 測試
 pytest tests/e2e/ -v

-# 執行壓力測試
-pytest tests/stress/ -v
-```
+# 執行壓力測試
+pytest tests/stress/ -v
+
+# Cache benchmark gate（P1）
+conda run -n mes-dashboard python scripts/run_cache_benchmarks.py --enforce
+```

 ---

@@ -569,12 +693,17 @@ pytest tests/stress/ -v
 ### 2026-02-08

 - 完成並封存提案 `post-migration-resilience-governance`
+- 完成並封存提案 `p1-cache-query-efficiency`
+- 完成並封存提案 `p2-ops-self-healing-runbook`
 - 新增 runtime 韌性診斷核心（thresholds / restart churn / recovery recommendation）
+- 新增 worker restart policy state（allowed/cooldown/blocked）與 guarded mode override 流程
 - health 與 admin API 新增可操作韌性欄位：
  - `/health`、`/health/deep`
  - `/admin/api/system-status`、`/admin/api/worker/status`
 - watchdog restart state 支援 bounded history（`WATCHDOG_RESTART_HISTORY_MAX`）
 - WIP overview/detail 抽離共用 autocomplete/filter 模組（`frontend/src/core/autocomplete.js`）
+- WIP overview/detail 導入共享 derive 模組（`frontend/src/core/wip-derive.js`）
+- 新增 cache benchmark fixture 與 baseline-vs-indexed 門檻驗證
 - 新增前端 Node 測試流程（`npm --prefix frontend test`）
 - 更新 `README.mdj` 與 migration runbook 文件對齊 gate

@@ -654,5 +783,5 @@ pytest tests/stress/ -v

 ---

-**文檔版本**: 4.1
+**文檔版本**: 4.2
 **最後更新**: 2026-02-08
--- a/README.mdj
+++ b/README.mdj
@@ -1,61 +1,151 @@
-# MES Dashboard Architecture Snapshot (README.mdj)
+# MES Dashboard（README.mdj）

-本檔案為 `README.md` 的架構摘要鏡像，重點反映目前已完成的 Vite + 單一 port 運行契約與韌性治理策略。
+本文件為 `README.md` 的精簡技術同步版，聚焦目前可運行架構與運維契約。

-## Runtime Contract
+## 1. 架構摘要（2026-02-08）

- 單一服務單一 port：`GUNICORN_BIND`（預設 `0.0.0.0:8080`）
- 前端資產由 Vite build 到 `src/mes_dashboard/static/dist/`，由 Flask/Gunicorn 同源提供
- Watchdog 透過 restart flag + `SIGHUP` 進行 graceful worker reload
+- 後端：Flask + Gunicorn（單一 port）
+- 前端：Vite build 輸出到 `src/mes_dashboard/static/dist`
+- 快取：Redis + process-level cache + indexed selection telemetry
+- 資料：Oracle（QueuePool）
+- 運維：watchdog + admin worker restart API + guarded-mode policy

-## Resilience Contract
+## 2. 既有設計原則（保留）

- 降級回應：`DB_POOL_EXHAUSTED`、`CIRCUIT_BREAKER_OPEN` + `Retry-After`
- health/admin 診斷輸出包含：
-  - thresholds
-  - restart churn summary
-  - recovery recommendation
- 不預設啟用自動重啟；維持受控人工觸發，避免重啟風暴
+- `resource`（設備基礎資料）與 `wip`（線上即時狀況）維持全表快取策略。
+- 前端頁面邏輯與 drill-down 操作語意維持不變。
+- 系統維持單一 port 服務模式（前後端同源）。

-## Frontend Governance
+## 3. P0 Runtime Hardening（已完成）

- WIP overview/detail 的 autocomplete/filter 查詢邏輯共用 `frontend/src/core/autocomplete.js`
- 目標：維持既有操作語意，同時降低重複邏輯與維護成本
- 前端核心模組測試：`npm --prefix frontend test`
+- Production 強制 `SECRET_KEY`：未設定或使用不安全預設值時，啟動直接失敗。
+- CSRF 防護：
+  - `/admin/login` 表單需 token
+  - `/admin/api/*` 的 `POST/PUT/PATCH/DELETE` 需 `X-CSRF-Token`
+- Session hardening：登入成功後 `session.clear()` + CSRF token rotation。
+- Health probe isolation：`/health` DB 連通檢查使用獨立 health pool。
+- Shutdown cleanup：統一停止 cache updater、equipment sync worker，並關閉 Redis 與 DB engine。
+- XSS hardening：`hold_detail` fallback script 的 `reason` 改用 `tojson`。

-## 開發歷史（摘要）
+## 4. P1 Cache/Query Efficiency（已完成）

-### 2026-02-08
- 封存 `post-migration-resilience-governance`
- 新增韌性診斷欄位（thresholds/churn/recommendation）
- 完成 WIP autocomplete 共用模組化與前端測試腳本
+- `resource` / `wip` 仍維持全表快取策略（業務約束不變）。
+- WIP 查詢改走 indexed selection，並加入增量同步（watermark/version）與 drift fallback。
+- `/health`、`/health/deep`、`/admin/api/system-status` 提供 cache memory amplification/index telemetry。
+- 新增 benchmark harness：`scripts/run_cache_benchmarks.py --enforce`。

-### 2026-02-07
- 封存完整 Vite 遷移相關提案群組
- 單一 port 架構、抽屜導航、欄位契約治理與 migration gates 就位
+## 5. P2 Ops Self-Healing（已完成）

-## Key Configs
+- runtime contract 共用化：app/start_server/watchdog/systemd 使用同一組 watchdog/conda 路徑契約。
+- 啟動 fail-fast：conda/runtime path drift 時拒絕啟動並輸出可操作診斷。
+- worker restart policy：cooldown + retry budget + churn guarded mode。
+- manual override：需 admin 身分 + `manual_override` + `override_acknowledged` + `override_reason`，且寫入 audit log。
+- health/admin payload 提供 policy state：`allowed` / `cooldown` / `blocked`。
+
+## 6. Round-3 Residual Hardening（已完成）
+
+- WIP cache publish 改為 staged publish，更新失敗不覆寫舊快照。
+- WIP process cache slow-path parse 移到鎖外，降低 lock contention。
+- realtime equipment process cache 補齊 bounded LRU（含 `EQUIPMENT_PROCESS_CACHE_MAX_SIZE`）。
+- `_clean_nan_values` 改為 depth-safe 迭代式清理（避免深層遞迴風險）。
+- WIP/Hold/Resource bool query parser 共用化（`core/utils.py`）。
+- filter cache source view 可由 env 覆寫（便於環境切換與測試）。
+- `/health`、`/health/deep` 增加 5 秒 memo（testing 模式自動關閉）。
+- 高成本 API 增加輕量 in-process rate limit，超限回傳一致 429 結構。
+- DB 連線字串記錄加上敏感欄位遮罩（密碼 redaction）。
+
+## 7. Round-4 Residual Consolidation（已完成）
+
+- Resource derived index 改為 row-position representation，不再在 process 內保存 full records 複本。
+- Resource / Realtime Equipment 共用 Oracle SQL fragments，避免查詢定義重複漂移。
+- `resource_cache` / `realtime_equipment_cache` 型別註記風格與高頻常數命名收斂。
+- `page_registry` 寫檔改為 atomic replace，降低設定檔半寫入風險。
+- 新增測試覆蓋 shared SQL fragment 與 bool parser 不重複定義治理。
+
+## 8. 重要環境變數

 ```bash
+FLASK_ENV=production
+SECRET_KEY=<required-in-production>
+CSRF_ENABLED=true
+
+LDAP_API_URL=https://ldap-api.example.com
+LDAP_ALLOWED_HOSTS=ldap-api.example.com,ldap-api-dr.example.com
+
+DB_POOL_SIZE=10
+DB_MAX_OVERFLOW=20
+DB_POOL_TIMEOUT=30
+DB_POOL_RECYCLE=1800
+DB_CALL_TIMEOUT_MS=55000
+DB_POOL_EXHAUSTED_RETRY_AFTER_SECONDS=5
+
+DB_HEALTH_POOL_SIZE=1
+DB_HEALTH_MAX_OVERFLOW=0
+DB_HEALTH_POOL_TIMEOUT=2
+
+CONDA_BIN=/opt/miniconda3/bin/conda
+CONDA_ENV_NAME=mes-dashboard
+RUNTIME_CONTRACT_VERSION=2026.02-p2
+RUNTIME_CONTRACT_ENFORCE=true
+
 WATCHDOG_RUNTIME_DIR=./tmp
 WATCHDOG_RESTART_FLAG=./tmp/mes_dashboard_restart.flag
 WATCHDOG_PID_FILE=./tmp/gunicorn.pid
 WATCHDOG_STATE_FILE=./tmp/mes_dashboard_restart_state.json
 WATCHDOG_RESTART_HISTORY_MAX=50

-RESILIENCE_DEGRADED_ALERT_SECONDS=300
-RESILIENCE_POOL_SATURATION_WARNING=0.90
-RESILIENCE_POOL_SATURATION_CRITICAL=1.0
-RESILIENCE_RESTART_CHURN_WINDOW_SECONDS=600
-RESILIENCE_RESTART_CHURN_THRESHOLD=3
+WORKER_RESTART_COOLDOWN=60
+WORKER_RESTART_RETRY_BUDGET=3
+WORKER_RESTART_WINDOW_SECONDS=600
+WORKER_RESTART_CHURN_THRESHOLD=3
+WORKER_GUARDED_MODE_ENABLED=true
+
+PROCESS_CACHE_MAX_SIZE=32
+WIP_PROCESS_CACHE_MAX_SIZE=32
+RESOURCE_PROCESS_CACHE_MAX_SIZE=32
+EQUIPMENT_PROCESS_CACHE_MAX_SIZE=32
+
+FILTER_CACHE_WIP_VIEW=DWH.DW_MES_LOT_V
+FILTER_CACHE_SPEC_WORKCENTER_VIEW=DWH.DW_MES_SPEC_WORKCENTER_V
+
+HEALTH_MEMO_TTL_SECONDS=5
+
+WIP_MATRIX_RATE_LIMIT_MAX_REQUESTS=120
+WIP_MATRIX_RATE_LIMIT_WINDOW_SECONDS=60
+WIP_DETAIL_RATE_LIMIT_MAX_REQUESTS=90
+WIP_DETAIL_RATE_LIMIT_WINDOW_SECONDS=60
+HOLD_LOTS_RATE_LIMIT_MAX_REQUESTS=90
+HOLD_LOTS_RATE_LIMIT_WINDOW_SECONDS=60
+RESOURCE_DETAIL_RATE_LIMIT_MAX_REQUESTS=60
+RESOURCE_DETAIL_RATE_LIMIT_WINDOW_SECONDS=60
+RESOURCE_STATUS_RATE_LIMIT_MAX_REQUESTS=90
+RESOURCE_STATUS_RATE_LIMIT_WINDOW_SECONDS=60
 ```

-## Validation Quick Commands
+## 9. 驗證命令（建議）

 ```bash
+# 後端（conda）
+conda run -n mes-dashboard python -m pytest -q tests/test_runtime_hardening.py
+
+# 前端
 npm --prefix frontend test
 npm --prefix frontend run build
-python -m pytest -q tests/test_resilience.py tests/test_health_routes.py tests/test_performance_integration.py
+
+# P1 benchmark gate
+conda run -n mes-dashboard python scripts/run_cache_benchmarks.py --enforce
+
+# P2 runtime contract check
+RUNTIME_CONTRACT_ENFORCE=true ./scripts/start_server.sh check
 ```

-> 詳細部署、使用說明與完整環境配置請參考 `README.md`。
+## 10. 開發歷史（Vite 專案）
+
+- 2026-02-07：完成 Vite 根目錄重構與舊版切除。
+- 2026-02-08：完成 resilience 診斷治理與前端共用模組化。
+- 2026-02-08：完成 P0 安全/穩定性硬化（本次更新）。
+- 2026-02-08：完成 P1 快取查詢效率重構（index + benchmark gate）。
+- 2026-02-08：完成 P2 運維自癒治理（guarded mode + manual override + runtime contract）。
+- 2026-02-08：完成 round-2 hardening（LDAP URL 驗證、bounded LRU cache、circuit breaker 鎖外日誌、安全標頭、分頁邊界）。
+- 2026-02-08：完成 round-3 residual hardening（staged publish、health memo、API rate limit、DB redaction、filter view env 化）。
+- 2026-02-08：完成 round-4 residual consolidation（resource index 表示正規化、shared SQL fragments、型別與常數治理、atomic page status 寫入）。
--- a/deploy/mes-dashboard-watchdog.service
+++ b/deploy/mes-dashboard-watchdog.service
@@ -18,6 +18,13 @@ Environment="WATCHDOG_RESTART_FLAG=/run/mes-dashboard/mes_dashboard_restart.flag
 Environment="WATCHDOG_PID_FILE=/run/mes-dashboard/gunicorn.pid"
 Environment="WATCHDOG_STATE_FILE=/var/lib/mes-dashboard/restart_state.json"
 Environment="WATCHDOG_CHECK_INTERVAL=5"
+Environment="RUNTIME_CONTRACT_VERSION=2026.02-p2"
+Environment="RUNTIME_CONTRACT_ENFORCE=true"
+Environment="WORKER_RESTART_COOLDOWN=60"
+Environment="WORKER_RESTART_RETRY_BUDGET=3"
+Environment="WORKER_RESTART_WINDOW_SECONDS=600"
+Environment="WORKER_RESTART_CHURN_THRESHOLD=3"
+Environment="WORKER_GUARDED_MODE_ENABLED=true"

 RuntimeDirectory=mes-dashboard
 StateDirectory=mes-dashboard
--- a/deploy/mes-dashboard.env.example
+++ b/deploy/mes-dashboard.env.example
@@ -0,0 +1,26 @@
+# MES Dashboard runtime contract (version 2026.02-p2)
+
+# Conda runtime
+CONDA_BIN=/opt/miniconda3/bin/conda
+CONDA_ENV_NAME=mes-dashboard
+
+# Single-port serving contract
+GUNICORN_BIND=0.0.0.0:8080
+
+# Watchdog/runtime paths
+WATCHDOG_RUNTIME_DIR=/run/mes-dashboard
+WATCHDOG_RESTART_FLAG=/run/mes-dashboard/mes_dashboard_restart.flag
+WATCHDOG_PID_FILE=/run/mes-dashboard/gunicorn.pid
+WATCHDOG_STATE_FILE=/var/lib/mes-dashboard/restart_state.json
+WATCHDOG_CHECK_INTERVAL=5
+
+# Runtime contract enforcement
+RUNTIME_CONTRACT_VERSION=2026.02-p2
+RUNTIME_CONTRACT_ENFORCE=true
+
+# Worker recovery policy
+WORKER_RESTART_COOLDOWN=60
+WORKER_RESTART_RETRY_BUDGET=3
+WORKER_RESTART_WINDOW_SECONDS=600
+WORKER_RESTART_CHURN_THRESHOLD=3
+WORKER_GUARDED_MODE_ENABLED=true
--- a/deploy/mes-dashboard.service
+++ b/deploy/mes-dashboard.service
@@ -18,6 +18,13 @@ Environment="WATCHDOG_RUNTIME_DIR=/run/mes-dashboard"
 Environment="WATCHDOG_RESTART_FLAG=/run/mes-dashboard/mes_dashboard_restart.flag"
 Environment="WATCHDOG_PID_FILE=/run/mes-dashboard/gunicorn.pid"
 Environment="WATCHDOG_STATE_FILE=/var/lib/mes-dashboard/restart_state.json"
+Environment="RUNTIME_CONTRACT_VERSION=2026.02-p2"
+Environment="RUNTIME_CONTRACT_ENFORCE=true"
+Environment="WORKER_RESTART_COOLDOWN=60"
+Environment="WORKER_RESTART_RETRY_BUDGET=3"
+Environment="WORKER_RESTART_WINDOW_SECONDS=600"
+Environment="WORKER_RESTART_CHURN_THRESHOLD=3"
+Environment="WORKER_GUARDED_MODE_ENABLED=true"

 RuntimeDirectory=mes-dashboard
 StateDirectory=mes-dashboard
--- a/docs/migration_gates_and_runbook.md
+++ b/docs/migration_gates_and_runbook.md
@@ -26,10 +26,12 @@ A release is cutover-ready only when all gates pass:
 - pool exhaustion path returns `503` + `DB_POOL_EXHAUSTED` and `Retry-After`
 - circuit-open path returns `503` + `CIRCUIT_BREAKER_OPEN` and fail-fast semantics
 - frontend client does not aggressively retry on degraded pool exhaustion responses
+- health/admin payloads expose worker policy state (`allowed`/`cooldown`/`blocked`) and alert booleans

 6. Conda-systemd contract gate
 - `deploy/mes-dashboard.service` and `deploy/mes-dashboard-watchdog.service` both run in the same conda runtime contract
 - `WATCHDOG_RESTART_FLAG`, `WATCHDOG_PID_FILE`, `WATCHDOG_STATE_FILE` paths are consistent across app/admin/watchdog
+- startup contract validation passes: `RUNTIME_CONTRACT_ENFORCE=true ./scripts/start_server.sh check`
 - single-port bind (`GUNICORN_BIND`) remains stable during restart workflow

 7. Regression gate
@@ -60,7 +62,8 @@ A release is cutover-ready only when all gates pass:
 5. Conda + systemd rehearsal (recommended before production cutover)
 - `sudo cp deploy/mes-dashboard.service /etc/systemd/system/`
 - `sudo cp deploy/mes-dashboard-watchdog.service /etc/systemd/system/`
- `sudo mkdir -p /etc/mes-dashboard && sudo cp .env /etc/mes-dashboard/mes-dashboard.env`
+- `sudo mkdir -p /etc/mes-dashboard && sudo cp deploy/mes-dashboard.env.example /etc/mes-dashboard/mes-dashboard.env`
+- merge deployment secrets from `.env` into `/etc/mes-dashboard/mes-dashboard.env`
 - `sudo systemctl daemon-reload`
 - `sudo systemctl enable --now mes-dashboard mes-dashboard-watchdog`
 - call `/admin/api/worker/status` and verify runtime contract paths exist
@@ -69,6 +72,7 @@ A release is cutover-ready only when all gates pass:
 - call `/health` and `/health/deep`
 - confirm route cache mode, degraded flags, and pool/runtime diagnostics align with environment (Redis on/off)
 - trigger one controlled worker restart from admin API and verify single-port continuity
+- verify guarded mode flow: blocked restart requires manual override payload (`manual_override`, `override_acknowledged`, `override_reason`)
 - verify README architecture section matches deployed runtime contract

 ## Rollback Procedure
@@ -111,3 +115,6 @@ Use these initial thresholds for alerting/escalation:

 4. Frontend/API retry pressure
 - significant increase of client retries for `DB_POOL_EXHAUSTED` or `CIRCUIT_BREAKER_OPEN` responses over baseline
+
+5. Recovery policy blocked
+- `resilience.policy_state.blocked == true` or `resilience.alerts.restart_blocked == true`
--- a/frontend/src/core/api.js
+++ b/frontend/src/core/api.js
@@ -1,5 +1,21 @@
 const DEFAULT_TIMEOUT = 30000;

+function getCsrfToken() {
+  return document.querySelector('meta[name="csrf-token"]')?.content || '';
+}
+
+function withCsrfHeaders(headers = {}, method = 'GET') {
+  const normalized = String(method).toUpperCase();
+  const merged = { ...headers };
+  if (['POST', 'PUT', 'PATCH', 'DELETE'].includes(normalized)) {
+    const csrf = getCsrfToken();
+    if (csrf && !merged['X-CSRF-Token']) {
+      merged['X-CSRF-Token'] = csrf;
+    }
+  }
+  return merged;
+}
+
 function buildApiError(response, payload) {
  const message =
    payload?.error?.message ||
@@ -47,15 +63,19 @@ export async function apiGet(url, options = {}) {

 export async function apiPost(url, payload, options = {}) {
  if (window.MesApi?.post) {
-    return window.MesApi.post(url, payload, options);
+    const enrichedOptions = {
+      ...options,
+      headers: withCsrfHeaders(options.headers || {}, 'POST')
+    };
+    return window.MesApi.post(url, payload, enrichedOptions);
  }
  return fetchJson(url, {
    ...options,
    method: 'POST',
-    headers: {
+    headers: withCsrfHeaders({
      'Content-Type': 'application/json',
      ...(options.headers || {})
-    },
+    }, 'POST'),
    body: JSON.stringify(payload)
  });
 }
@@ -64,6 +84,7 @@ export async function apiUpload(url, formData, options = {}) {
  return fetchJson(url, {
    ...options,
    method: 'POST',
+    headers: withCsrfHeaders(options.headers || {}, 'POST'),
    body: formData
  });
 }
--- a/frontend/src/core/wip-derive.js
+++ b/frontend/src/core/wip-derive.js
@@ -0,0 +1,75 @@
+function toTrimmedString(value) {
+  if (value === null || value === undefined) {
+    return '';
+  }
+  return String(value).trim();
+}
+
+export function normalizeStatusFilter(statusFilter) {
+  if (!statusFilter) {
+    return {};
+  }
+  if (statusFilter === 'quality-hold') {
+    return { status: 'HOLD', hold_type: 'quality' };
+  }
+  if (statusFilter === 'non-quality-hold') {
+    return { status: 'HOLD', hold_type: 'non-quality' };
+  }
+  return { status: String(statusFilter).toUpperCase() };
+}
+
+export function buildWipOverviewQueryParams(filters = {}, statusFilter = null) {
+  const params = {};
+  const workorder = toTrimmedString(filters.workorder);
+  const lotid = toTrimmedString(filters.lotid);
+  const pkg = toTrimmedString(filters.package);
+  const type = toTrimmedString(filters.type);
+
+  if (workorder) params.workorder = workorder;
+  if (lotid) params.lotid = lotid;
+  if (pkg) params.package = pkg;
+  if (type) params.type = type;
+
+  return { ...params, ...normalizeStatusFilter(statusFilter) };
+}
+
+export function buildWipDetailQueryParams({
+  page,
+  pageSize,
+  filters = {},
+  statusFilter = null,
+}) {
+  return {
+    page,
+    page_size: pageSize,
+    ...buildWipOverviewQueryParams(filters, statusFilter),
+  };
+}
+
+export function splitHoldByType(data) {
+  const items = Array.isArray(data?.items) ? data.items : [];
+  const quality = items.filter((item) => item?.holdType === 'quality');
+  const nonQuality = items.filter((item) => item?.holdType !== 'quality');
+  return { quality, nonQuality };
+}
+
+export function prepareParetoData(items) {
+  if (!Array.isArray(items) || items.length === 0) {
+    return { reasons: [], qtys: [], lots: [], cumulative: [], totalQty: 0, items: [] };
+  }
+
+  const sorted = [...items].sort((a, b) => (Number(b?.qty) || 0) - (Number(a?.qty) || 0));
+  const reasons = sorted.map((item) => toTrimmedString(item?.reason) || '未知');
+  const qtys = sorted.map((item) => Number(item?.qty) || 0);
+  const lots = sorted.map((item) => Number(item?.lots) || 0);
+  const totalQty = qtys.reduce((sum, value) => sum + value, 0);
+
+  let running = 0;
+  const cumulative = qtys.map((qty) => {
+    running += qty;
+    if (totalQty <= 0) return 0;
+    return Math.round((running / totalQty) * 100);
+  });
+
+  return { reasons, qtys, lots, cumulative, totalQty, items: sorted };
+}
--- a/frontend/src/wip-detail/main.js
+++ b/frontend/src/wip-detail/main.js
@@ -3,6 +3,7 @@ import {
  debounce,
  fetchWipAutocompleteItems,
 } from '../core/autocomplete.js';
+import { buildWipDetailQueryParams } from '../core/wip-derive.js';

 ensureMesApiAvailable();

@@ -72,37 +73,13 @@ ensureMesApiAvailable();
              throw new Error(result.error || 'Failed to fetch packages');
          }
  
-          async function fetchDetail(signal = null) {
-              const params = {
-                  page: state.page,
-                  page_size: state.pageSize
-              };
-  
-              if (state.filters.package) {
-                  params.package = state.filters.package;
-              }
-              if (state.filters.type) {
-                  params.type = state.filters.type;
-              }
-              if (activeStatusFilter) {
-                  // Handle hold type filters
-                  if (activeStatusFilter === 'quality-hold') {
-                      params.status = 'HOLD';
-                      params.hold_type = 'quality';
-                  } else if (activeStatusFilter === 'non-quality-hold') {
-                      params.status = 'HOLD';
-                      params.hold_type = 'non-quality';
-                  } else {
-                      // Convert to API status format (RUN/QUEUE)
-                      params.status = activeStatusFilter.toUpperCase();
-                  }
-              }
-              if (state.filters.workorder) {
-                  params.workorder = state.filters.workorder;
-              }
-              if (state.filters.lotid) {
-                  params.lotid = state.filters.lotid;
-              }
+          async function fetchDetail(signal = null) {
+              const params = buildWipDetailQueryParams({
+                  page: state.page,
+                  pageSize: state.pageSize,
+                  filters: state.filters,
+                  statusFilter: activeStatusFilter,
+              });
  
              const result = await MesApi.get(`/api/wip/detail/${encodeURIComponent(state.workcenter)}`, {
                  params,
--- a/frontend/src/wip-overview/main.js
+++ b/frontend/src/wip-overview/main.js
@@ -3,6 +3,11 @@ import {
  debounce,
  fetchWipAutocompleteItems,
 } from '../core/autocomplete.js';
+import {
+  buildWipOverviewQueryParams,
+  splitHoldByType as splitHoldByTypeShared,
+  prepareParetoData as prepareParetoDataShared,
+} from '../core/wip-derive.js';

 ensureMesApiAvailable();

@@ -61,21 +66,8 @@ ensureMesApiAvailable();
          }
  
          function buildQueryParams() {
-              const params = {};
-              if (state.filters.workorder) {
-                  params.workorder = state.filters.workorder;
-              }
-              if (state.filters.lotid) {
-                  params.lotid = state.filters.lotid;
-              }
-              if (state.filters.package) {
-                  params.package = state.filters.package;
-              }
-              if (state.filters.type) {
-                  params.type = state.filters.type;
-              }
-              return params;
-          }
+              return buildWipOverviewQueryParams(state.filters);
+          }
  
          // ============================================================
          // API Functions (using MesApi)
@@ -95,23 +87,11 @@ ensureMesApiAvailable();
              throw new Error(result.error || 'Failed to fetch summary');
          }
  
-          async function fetchMatrix(signal = null) {
-              const params = buildQueryParams();
-              // Add status filter if active
-              if (activeStatusFilter) {
-                  if (activeStatusFilter === 'quality-hold') {
-                      params.status = 'HOLD';
-                      params.hold_type = 'quality';
-                  } else if (activeStatusFilter === 'non-quality-hold') {
-                      params.status = 'HOLD';
-                      params.hold_type = 'non-quality';
-                  } else {
-                      params.status = activeStatusFilter.toUpperCase();
-                  }
-              }
-              const result = await MesApi.get('/api/wip/overview/matrix', {
-                  params,
-                  timeout: API_TIMEOUT,
+          async function fetchMatrix(signal = null) {
+              const params = buildWipOverviewQueryParams(state.filters, activeStatusFilter);
+              const result = await MesApi.get('/api/wip/overview/matrix', {
+                  params,
+                  timeout: API_TIMEOUT,
                  signal
              });
              if (result.success) {
@@ -465,40 +445,15 @@ ensureMesApiAvailable();
              nonQuality: null
          };
  
-          // Task 2.1: Split hold data by type
-          function splitHoldByType(data) {
-              if (!data || !data.items) {
-                  return { quality: [], nonQuality: [] };
-              }
-              const quality = data.items.filter(item => item.holdType === 'quality');
-              const nonQuality = data.items.filter(item => item.holdType !== 'quality');
-              return { quality, nonQuality };
-          }
+          // Task 2.1: Split hold data by type
+          function splitHoldByType(data) {
+              return splitHoldByTypeShared(data);
+          }
  
-          // Task 2.2: Prepare Pareto data (sort by QTY desc, calculate cumulative %)
-          function prepareParetoData(items) {
-              if (!items || items.length === 0) {
-                  return { reasons: [], qtys: [], lots: [], cumulative: [], totalQty: 0 };
-              }
-  
-              // Sort by QTY descending
-              const sorted = [...items].sort((a, b) => (b.qty || 0) - (a.qty || 0));
-  
-              const reasons = sorted.map(item => item.reason || '未知');
-              const qtys = sorted.map(item => item.qty || 0);
-              const lots = sorted.map(item => item.lots || 0);
-              const totalQty = qtys.reduce((sum, q) => sum + q, 0);
-  
-              // Calculate cumulative percentage
-              const cumulative = [];
-              let runningSum = 0;
-              qtys.forEach(qty => {
-                  runningSum += qty;
-                  cumulative.push(totalQty > 0 ? Math.round((runningSum / totalQty) * 100) : 0);
-              });
-  
-              return { reasons, qtys, lots, cumulative, totalQty, items: sorted };
-          }
+          // Task 2.2: Prepare Pareto data (sort by QTY desc, calculate cumulative %)
+          function prepareParetoData(items) {
+              return prepareParetoDataShared(items);
+          }
  
          // Task 3.1: Initialize Pareto charts
          function initParetoCharts() {
--- a/frontend/tests/wip-derive.test.js
+++ b/frontend/tests/wip-derive.test.js
@@ -0,0 +1,80 @@
+import test from 'node:test';
+import assert from 'node:assert/strict';
+
+import {
+  buildWipOverviewQueryParams,
+  buildWipDetailQueryParams,
+  splitHoldByType,
+  prepareParetoData,
+} from '../src/core/wip-derive.js';
+
+test('buildWipOverviewQueryParams keeps only non-empty filters', () => {
+  const params = buildWipOverviewQueryParams({
+    workorder: ' WO-1 ',
+    lotid: '',
+    package: 'PKG-A',
+    type: 'QFN',
+  });
+
+  assert.deepEqual(params, {
+    workorder: 'WO-1',
+    package: 'PKG-A',
+    type: 'QFN',
+  });
+});
+
+test('buildWipOverviewQueryParams maps quality hold status filter', () => {
+  const params = buildWipOverviewQueryParams({}, 'quality-hold');
+  assert.deepEqual(params, {
+    status: 'HOLD',
+    hold_type: 'quality',
+  });
+});
+
+test('buildWipDetailQueryParams uses page/page_size and shared filter mapper', () => {
+  const params = buildWipDetailQueryParams({
+    page: 2,
+    pageSize: 100,
+    filters: {
+      workorder: 'WO',
+      lotid: 'LOT',
+      package: '',
+      type: 'TSOP',
+    },
+    statusFilter: 'run',
+  });
+
+  assert.deepEqual(params, {
+    page: 2,
+    page_size: 100,
+    workorder: 'WO',
+    lotid: 'LOT',
+    type: 'TSOP',
+    status: 'RUN',
+  });
+});
+
+test('splitHoldByType partitions quality/non-quality correctly', () => {
+  const grouped = splitHoldByType({
+    items: [
+      { reason: 'Q1', holdType: 'quality' },
+      { reason: 'NQ1', holdType: 'non-quality' },
+      { reason: 'NQ2' },
+    ],
+  });
+
+  assert.equal(grouped.quality.length, 1);
+  assert.equal(grouped.nonQuality.length, 2);
+});
+
+test('prepareParetoData sorts by qty and builds cumulative percentages', () => {
+  const data = prepareParetoData([
+    { reason: 'B', qty: 20, lots: 1 },
+    { reason: 'A', qty: 80, lots: 2 },
+  ]);
+
+  assert.deepEqual(data.reasons, ['A', 'B']);
+  assert.deepEqual(data.qtys, [80, 20]);
+  assert.deepEqual(data.cumulative, [80, 100]);
+  assert.equal(data.totalQty, 100);
+});
--- a/frontend/vite.config.js
+++ b/frontend/vite.config.js
@@ -1,12 +1,12 @@
 import { defineConfig } from 'vite';
 import { resolve } from 'node:path';

-export default defineConfig({
+export default defineConfig(({ mode }) => ({
  publicDir: false,
  build: {
    outDir: '../src/mes_dashboard/static/dist',
    emptyOutDir: false,
-    sourcemap: false,
+    sourcemap: mode !== 'production',
    rollupOptions: {
      input: {
        portal: resolve(__dirname, 'src/portal/main.js'),
@@ -22,8 +22,17 @@ export default defineConfig({
      output: {
        entryFileNames: '[name].js',
        chunkFileNames: 'chunks/[name]-[hash].js',
-        assetFileNames: '[name][extname]'
+        assetFileNames: '[name][extname]',
+        manualChunks(id) {
+          if (!id.includes('node_modules')) {
+            return;
+          }
+          if (id.includes('echarts')) {
+            return 'vendor-echarts';
+          }
+          return 'vendor';
+        }
      }
    }
  }
-});
+}));
--- a/gunicorn.conf.py
+++ b/gunicorn.conf.py
@@ -30,6 +30,18 @@ def worker_exit(server, worker):
    except Exception as e:
        server.log.warning(f"Error stopping equipment sync worker: {e}")

+    try:
+        from mes_dashboard.core.cache_updater import stop_cache_updater
+        stop_cache_updater()
+    except Exception as e:
+        server.log.warning(f"Error stopping cache updater: {e}")
+
+    try:
+        from mes_dashboard.core.redis_client import close_redis
+        close_redis()
+    except Exception as e:
+        server.log.warning(f"Error closing redis client: {e}")
+
    # Then dispose database connections
    try:
        from mes_dashboard.core.database import dispose_engine
--- a/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/.openspec.yaml
+++ b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/.openspec.yaml
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-02-08
--- a/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/design.md
+++ b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/design.md
@@ -0,0 +1,46 @@
+## Context
+
+The current architecture already supports single-port Gunicorn runtime, circuit-breaker-aware degraded responses, and watchdog-assisted recovery. However, critical security and lifecycle controls are uneven: production startup can still fallback to a weak secret key, CSRF is not enforced globally, and background resources are not fully registered in a single shutdown lifecycle. These gaps are operationally risky when pool pressure or restart churn occurs.
+
+## Goals / Non-Goals
+
+**Goals:**
+- Make production startup fail fast when required security secrets are missing.
+- Enforce CSRF validation for all state-changing endpoints without breaking existing frontend flow.
+- Make worker/app shutdown deterministic by stopping all background workers and shared clients.
+- Keep degraded responses for pool exhaustion and circuit-open states stable and retry-aware.
+- Isolate health probe connectivity from main request pool contention.
+
+**Non-Goals:**
+- Replacing LDAP provider or redesigning the full authentication architecture.
+- Full CSP rollout across all templates in this change.
+- Changing URL structure, page IA, or single-port deployment topology.
+
+## Decisions
+
+1. **Production secret-key guard at startup**
+   - Decision: enforce `SECRET_KEY` presence/strength in non-development modes and abort startup when invalid.
+   - Rationale: prevents silent insecure deployment.
+
+2. **Unified CSRF contract across form + JSON flows**
+   - Decision: issue CSRF token from server session, validate hidden form field for HTML forms and `X-CSRF-Token` for JSON POST/PUT/PATCH/DELETE.
+   - Rationale: maintains current frontend behavior while covering non-form APIs.
+
+3. **Centralized shutdown registry**
+   - Decision: register explicit shutdown hooks that call cache updater stop, realtime sync stop, Redis close, and DB dispose in bounded order.
+   - Rationale: avoids thread/client leaks during worker recycle and controlled reload.
+
+4. **Health probe pool isolation**
+   - Decision: use a dedicated lightweight DB health engine/pool for `/health` checks.
+   - Rationale: prevents health endpoint from being blocked by request-pool exhaustion, improving observability fidelity.
+
+5. **Template-safe JS serialization**
+   - Decision: replace HTML-escaped interpolation in JS string contexts with `tojson` serialization.
+   - Rationale: avoids context-mismatch injection edge cases.
+
+## Risks / Trade-offs
+
+- **[Risk] CSRF rollout may break undocumented clients** → **Mitigation:** provide opt-in transition flag and explicit error messaging during rollout.
+- **[Risk] Strict startup secret validation can block misconfigured environments** → **Mitigation:** provide clear startup diagnostics and `.env.example` updates.
+- **[Risk] Additional shutdown hooks can prolong worker exit** → **Mitigation:** bounded timeouts and idempotent stop handlers.
+- **[Risk] Dedicated health pool introduces extra DB connections** → **Mitigation:** fixed minimal size and short timeout.
--- a/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/proposal.md
+++ b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/proposal.md
@@ -0,0 +1,40 @@
+## Why
+
+The Vite migration is functionally complete, but production runtime still has high-risk gaps in security baseline and worker lifecycle cleanup. Addressing these now prevents avoidable outages, authentication bypass risk, and unstable degradation behavior under pool pressure.
+
+## What Changes
+
+- Enforce production-safe startup security defaults (no weak SECRET_KEY fallback in non-development environments).
+- Add first-class CSRF protection for admin forms and state-changing JSON APIs.
+- Harden degradation behavior for pool exhaustion with consistent retry/backoff contract and isolated health probing.
+- Ensure background workers and shared clients (cache updater, realtime sync, Redis) are explicitly stopped on worker/app shutdown.
+- Fix template-to-JavaScript variable serialization in hold-detail fallback script.
+
+## Capabilities
+
+### New Capabilities
+- `security-baseline-hardening`: Define mandatory secret/session/CSRF/XSS-safe baseline for production runtime.
+
+### Modified Capabilities
+- `runtime-resilience-recovery`: Strengthen shutdown lifecycle and degraded-response behavior for pool pressure scenarios.
+
+## Impact
+
+- Affected code:
+  - `src/mes_dashboard/app.py`
+  - `src/mes_dashboard/core/database.py`
+  - `src/mes_dashboard/core/cache_updater.py`
+  - `src/mes_dashboard/core/redis_client.py`
+  - `src/mes_dashboard/routes/health_routes.py`
+  - `src/mes_dashboard/routes/auth_routes.py`
+  - `src/mes_dashboard/templates/hold_detail.html`
+  - `gunicorn.conf.py`
+  - `tests/`
+- APIs:
+  - `/health`
+  - `/health/deep`
+  - `/admin/login`
+  - state-changing `/api/*` endpoints
+- Operational behavior:
+  - Keep single-port deployment model unchanged.
+  - Improve degraded-state stability and startup safety gates.
--- a/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/specs/runtime-resilience-recovery/spec.md
+++ b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/specs/runtime-resilience-recovery/spec.md
@@ -0,0 +1,24 @@
+## MODIFIED Requirements
+
+### Requirement: Pool Exhaustion MUST Return Retry-Aware Degraded Responses
+The system MUST return explicit degraded responses for connection pool exhaustion, including stable machine-readable retry metadata and HTTP retry hints.
+
+#### Scenario: Pool exhausted under load
+- **WHEN** concurrent requests exceed available database connections and pool wait timeout is reached
+- **THEN** the API MUST return `DB_POOL_EXHAUSTED` with `retry_after_seconds` metadata and a `Retry-After` header instead of a generic 500 failure
+
+## ADDED Requirements
+
+### Requirement: Runtime Shutdown SHALL Cleanly Stop Background Services
+Worker/app shutdown MUST stop long-lived background services and shared clients in deterministic order.
+
+#### Scenario: Worker exits during recycle or graceful reload
+- **WHEN** Gunicorn worker shutdown hooks are triggered
+- **THEN** cache updater, realtime equipment sync worker, Redis client, and DB engine resources MUST be stopped/disposed without orphan threads
+
+### Requirement: Health Probing SHALL Remain Available During Request-Pool Saturation
+Health checks MUST avoid depending solely on the same request pool used by business APIs.
+
+#### Scenario: Request pool saturation
+- **WHEN** the main database request pool is exhausted
+- **THEN** `/health` and `/health/deep` MUST still provide timely degraded status using isolated probe connectivity
--- a/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/specs/security-baseline-hardening/spec.md
+++ b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/specs/security-baseline-hardening/spec.md
@@ -0,0 +1,29 @@
+## ADDED Requirements
+
+### Requirement: Production Startup SHALL Reject Weak Session Secrets
+The system MUST reject startup in non-development environments when `SECRET_KEY` is missing or configured with known insecure default values.
+
+#### Scenario: Missing production secret key
+- **WHEN** runtime starts with `FLASK_ENV` not equal to `development` and no secure secret key is configured
+- **THEN** application startup MUST fail fast with an explicit configuration error
+
+### Requirement: State-Changing Endpoints SHALL Enforce CSRF Validation
+All state-changing endpoints that rely on cookie-based authentication MUST enforce CSRF token validation.
+
+#### Scenario: Missing or invalid CSRF token
+- **WHEN** a POST/PUT/PATCH/DELETE request is sent without a valid CSRF token
+- **THEN** the server MUST reject the request with a client error and MUST NOT execute the mutation
+
+### Requirement: Server-Rendered Values in JavaScript Context MUST Use Safe Serialization
+Values inserted into inline JavaScript from templates MUST be serialized for JavaScript context safety.
+
+#### Scenario: Hold reason rendered in fallback inline script
+- **WHEN** server-side string values are embedded into script state payloads
+- **THEN** template rendering MUST use JSON-safe serialization semantics to prevent script-context injection
+
+### Requirement: Session Establishment SHALL Mitigate Fixation Risk
+Successful admin login MUST rotate session identity material before granting authenticated privileges.
+
+#### Scenario: Admin login success
+- **WHEN** credentials are validated and admin session is created
+- **THEN** session identity MUST be regenerated before storing authenticated user attributes
--- a/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/tasks.md
+++ b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/tasks.md
@@ -0,0 +1,18 @@
+## 1. Runtime Stability Hardening
+
+- [x] 1.1 Add startup validation for `SECRET_KEY` and environment-aware secure defaults.
+- [x] 1.2 Register centralized shutdown hooks to stop cache updater, realtime sync worker, Redis client, and DB engine.
+- [x] 1.3 Isolate database health probing from request pool and keep degraded signal contract stable.
+- [x] 1.4 Normalize pool-exhausted response metadata and retry headers across API error paths.
+
+## 2. Security Baseline Enforcement
+
+- [x] 2.1 Add CSRF token issuance/validation for form posts and JSON mutation endpoints.
+- [x] 2.2 Update login flow to rotate session identity on successful authentication.
+- [x] 2.3 Replace JS-context template interpolation in `hold_detail.html` with JSON-safe serialization.
+
+## 3. Verification and Documentation
+
+- [x] 3.1 Add tests for startup secret guard, CSRF rejection, and session-rotation behavior.
+- [x] 3.2 Add lifecycle tests/validation for shutdown cleanup and health endpoint behavior under pool saturation.
+- [x] 3.3 Update README/README.mdj runtime hardening sections and operator rollout notes.
--- a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/.openspec.yaml
+++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/.openspec.yaml
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-02-08
--- a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/design.md
+++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/design.md
@@ -0,0 +1,46 @@
+## Context
+
+The migration delivered feature parity, but efficiency work remains: backend query paths still do broad copies and whole-frame recomputation even when only slices are needed. At the same time, business constraints explicitly require full-table caching for `resource` and `wip` because those datasets are intentionally small and frequently reused. This design optimizes around that constraint rather than removing it.
+
+## Goals / Non-Goals
+
+**Goals:**
+- Keep `resource` and `wip` full-table caches intact.
+- Reduce memory amplification from redundant cache representations.
+- Replace repeated full merge/rebuild paths with incremental/indexed query plans where applicable.
+- Increase reuse of browser-side compute modules for chart/table/filter/KPI derivations.
+- Add measurable telemetry to verify latency and memory improvements.
+
+**Non-Goals:**
+- Rewriting all reporting endpoints to client-only mode.
+- Removing Redis or existing layered cache strategy.
+- Changing user-visible filter semantics or report outputs.
+
+## Decisions
+
+1. **Constrained cache strategy**
+   - Decision: retain full-table snapshots for `resource` and `wip`; optimize surrounding representations and derivation paths.
+   - Rationale: business-approved data-size profile and low complexity for frequent lookups.
+
+2. **Incremental + indexed path for heavy derived datasets**
+   - Decision: add watermark/version-aware incremental refresh and per-column indexes for high-cardinality filters.
+   - Rationale: avoids repeated full recompute and lowers request tail latency.
+
+3. **Canonical in-process structure**
+   - Decision: keep one canonical structure per cache domain and derive alternate views on demand.
+   - Rationale: reduces 2x/3x memory amplification from parallel representations.
+
+4. **Frontend compute module expansion**
+   - Decision: extract reusable browser compute helpers for matrix/table/KPI transformations used across report pages.
+   - Rationale: shifts deterministic shaping work off backend and improves component reuse in Vite architecture.
+
+5. **Benchmark-driven acceptance**
+   - Decision: add repeatable benchmark fixtures and telemetry thresholds as merge gates.
+   - Rationale: prevent subjective "performance improved" claims without measurable proof.
+
+## Risks / Trade-offs
+
+- **[Risk] Incremental sync correctness drift** → **Mitigation:** version checksum validation and periodic full reconciliation jobs.
+- **[Risk] Browser compute can increase client CPU on low-end devices** → **Mitigation:** bounded dataset chunking and fallback server aggregation path.
+- **[Risk] Refactor introduces subtle field-contract regressions** → **Mitigation:** keep export/header contract tests and fixture comparisons.
+- **[Risk] Telemetry overhead** → **Mitigation:** low-cost counters/histograms with sampling where needed.
--- a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/proposal.md
+++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/proposal.md
@@ -0,0 +1,36 @@
+## Why
+
+Current reporting workloads still spend unnecessary CPU and memory on repeated full-data merges, broad DataFrame copies, and duplicated cache representations. We need a focused efficiency phase that preserves the intentional full-table cache strategy for `resource` and `wip`, while reducing cost for other query paths and increasing frontend compute reuse.
+
+## What Changes
+
+- Introduce indexed/incremental cache synchronization for heavy report datasets that do not require full-table snapshots.
+- Keep `resource` and `wip` as full-table cache by design, but reduce redundant in-process representations and copy overhead.
+- Move additional derived calculations (chart/table/KPI/filter shaping) to reusable browser modules in Vite frontend.
+- Add cache/query efficiency telemetry and repeatable benchmark gates to validate gains.
+
+## Capabilities
+
+### New Capabilities
+- `cache-indexed-query-acceleration`: Define incremental refresh and indexed query contracts for non-full-snapshot datasets.
+
+### Modified Capabilities
+- `cache-observability-hardening`: Add memory-efficiency and cache-structure telemetry expectations.
+- `frontend-compute-shift`: Expand browser-side reusable compute coverage for report interactions.
+
+## Impact
+
+- Affected code:
+  - `src/mes_dashboard/core/cache.py`
+  - `src/mes_dashboard/services/resource_cache.py`
+  - `src/mes_dashboard/services/realtime_equipment_cache.py`
+  - `src/mes_dashboard/services/wip_service.py`
+  - `src/mes_dashboard/routes/health_routes.py`
+  - `frontend/src/core/`
+  - `frontend/src/**/main.js`
+  - `tests/`
+- APIs:
+  - read-heavy `/api/wip/*` and `/api/resource/*` endpoints (response contract unchanged)
+- Operational behavior:
+  - Preserve current `resource` and `wip` full-table caching strategy.
+  - Reduce server-side compute load through selective frontend compute offload.
--- a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/cache-indexed-query-acceleration/spec.md
+++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/cache-indexed-query-acceleration/spec.md
@@ -0,0 +1,22 @@
+## ADDED Requirements
+
+### Requirement: Incremental Synchronization SHALL Use Versioned Watermarks
+For heavy non-full-snapshot datasets, cache refresh SHALL support incremental synchronization keyed by stable version or watermark boundaries.
+
+#### Scenario: Incremental refresh cycle
+- **WHEN** source data version indicates partial changes since last sync
+- **THEN** cache update logic MUST fetch and merge only changed partitions while preserving correctness guarantees
+
+### Requirement: Query Paths SHALL Use Indexed Access for High-Frequency Filters
+Query execution over cached data SHALL use prebuilt indexes for known high-frequency filter columns.
+
+#### Scenario: Filtered report query
+- **WHEN** request filters target indexed fields
+- **THEN** result selection MUST avoid full dataset scans and maintain existing response contract
+
+### Requirement: Business-Mandated Full-Table Caches SHALL Be Preserved for Resource and WIP
+The system SHALL continue to maintain full-table cache behavior for `resource` and `wip` domains.
+
+#### Scenario: Resource or WIP cache refresh
+- **WHEN** cache update runs for `resource` or `wip`
+- **THEN** the updater MUST retain full-table snapshot semantics and MUST NOT switch these domains to partial-only cache mode
--- a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/cache-observability-hardening/spec.md
+++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/cache-observability-hardening/spec.md
@@ -0,0 +1,15 @@
+## ADDED Requirements
+
+### Requirement: Cache Telemetry SHALL Include Memory Amplification Signals
+Operational telemetry MUST expose cache domain memory usage indicators and representation amplification factors.
+
+#### Scenario: Deep health telemetry request
+- **WHEN** operators inspect cache telemetry
+- **THEN** telemetry MUST include per-domain memory footprint and amplification indicators sufficient to detect redundant structures
+
+### Requirement: Efficiency Benchmarks SHALL Gate Cache Refactor Rollout
+Cache/query efficiency changes MUST be validated against baseline latency and memory benchmarks before rollout.
+
+#### Scenario: Pre-release validation
+- **WHEN** cache refactor changes are prepared for deployment
+- **THEN** benchmark results MUST demonstrate no regression beyond configured thresholds for P95 latency and memory usage
--- a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/frontend-compute-shift/spec.md
+++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/frontend-compute-shift/spec.md
@@ -0,0 +1,15 @@
+## ADDED Requirements
+
+### Requirement: Reusable Browser Compute Modules SHALL Power Report Derivations
+Derived computations for report filters, KPI cards, chart series, and table projections SHALL be implemented through reusable frontend modules.
+
+#### Scenario: Shared report derivation logic
+- **WHEN** multiple report pages require equivalent data-shaping behavior
+- **THEN** pages MUST consume shared compute modules instead of duplicating transformation logic per page
+
+### Requirement: Browser Compute Shift SHALL Preserve Export and Field Contracts
+Moving computations to frontend MUST preserve existing field naming and export column contracts.
+
+#### Scenario: User exports report after frontend-side derivation
+- **WHEN** transformed data is rendered and exported
+- **THEN** exported field names and ordering MUST remain consistent with governed field contract definitions
--- a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/tasks.md
+++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/tasks.md
@@ -0,0 +1,23 @@
+## 1. Cache Structure and Sync Refactor
+
+- [x] 1.1 Define canonical per-domain cache representation and remove redundant parallel structures.
+- [x] 1.2 Implement version/watermark-based incremental sync path for eligible non-full-snapshot datasets.
+- [x] 1.3 Keep `resource` and `wip` full-table cache behavior while optimizing surrounding parse/index pipelines.
+
+## 2. Indexed Query Acceleration
+
+- [x] 2.1 Add index builders for high-frequency filter columns used by report endpoints.
+- [x] 2.2 Refactor read paths to use indexed selection and reduce broad DataFrame copy operations.
+- [x] 2.3 Add fallback and reconciliation logic to guarantee correctness under incremental/index drift.
+
+## 3. Frontend Compute Reuse Expansion
+
+- [x] 3.1 Extract shared Vite compute modules for KPI/filter/chart/table derivations.
+- [x] 3.2 Refactor report pages to consume shared modules without changing user-visible behavior.
+- [x] 3.3 Validate export/header field contract consistency after compute shift.
+
+## 4. Performance Validation and Docs
+
+- [x] 4.1 Add benchmark fixtures for baseline vs refactor latency/memory comparison.
+- [x] 4.2 Surface cache memory amplification and index efficiency telemetry in health/admin outputs.
+- [x] 4.3 Update README/README.mdj with cache strategy constraints and performance governance rules.
--- a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/.openspec.yaml
+++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/.openspec.yaml
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-02-08
--- a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/design.md
+++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/design.md
@@ -0,0 +1,45 @@
+## Context
+
+The project already provides watchdog-assisted restart and resilience diagnostics, but policy boundaries for automated recovery are not yet formalized end-to-end. In practice, this can lead to either under-recovery (manual delays) or over-recovery (restart churn). We also need stronger conda/systemd path consistency checks to prevent runtime drift across deploy scripts and services.
+
+## Goals / Non-Goals
+
+**Goals:**
+- Make conda/systemd runtime path contracts explicit, validated, and drift-detectable.
+- Implement safe self-healing policy with cooldown and churn limits.
+- Expose clear alert signals and recommended actions in health/admin payloads.
+- Keep operator manual override available for incident control.
+
+**Non-Goals:**
+- Migrating from systemd to another orchestrator.
+- Changing database vendor or introducing full autoscaling infrastructure.
+- Removing existing admin restart endpoints.
+
+## Decisions
+
+1. **Single source runtime contract**
+   - Decision: centralize conda runtime path configuration consumed by systemd units, watchdog, and scripts.
+   - Rationale: prevents mismatched interpreter/path drift.
+
+2. **Guarded self-healing state machine**
+   - Decision: implement bounded restart policy (cooldown + max retries per time window + circuit-open gating).
+   - Rationale: recovers quickly while preventing restart storms.
+
+3. **Explicit recovery observability contract**
+   - Decision: enrich health/admin payloads with churn counters, cooldown state, and recommended operator action.
+   - Rationale: enables deterministic triage and alert automation.
+
+4. **Auditability requirement**
+   - Decision: emit structured logs/events for auto-restart decision, manual override, and blocked restart attempts.
+   - Rationale: supports incident retrospectives and policy tuning.
+
+5. **Runbook-first rollout**
+   - Decision: deploy policy changes behind documentation and validation gates, including rollback steps.
+   - Rationale: operational safety for production adoption.
+
+## Risks / Trade-offs
+
+- **[Risk] Overly strict policy delays recovery** → **Mitigation:** configurable thresholds and emergency manual override.
+- **[Risk] Aggressive policy causes churn loops** → **Mitigation:** hard stop on churn threshold breach and explicit cool-off windows.
+- **[Risk] Added operational complexity** → **Mitigation:** concise runbook with decision tables and tested scripts.
+- **[Risk] Drift detection false positives** → **Mitigation:** normalize path resolution and clearly defined comparison sources.
--- a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/proposal.md
+++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/proposal.md
@@ -0,0 +1,40 @@
+## Why
+
+Operations stability still depends heavily on manual intervention when workers degrade or pools saturate. We need a formal operations phase to align conda/systemd runtime contracts and add controlled self-healing with guardrails, so recovery is faster without triggering restart storms.
+
+## What Changes
+
+- Standardize conda-based runtime paths across app service, watchdog, and operational scripts from a single source of truth.
+- Introduce guarded worker self-healing policy (cooldown, churn windows, bounded retries, manual override).
+- Add alert thresholds and machine-readable recovery signals for pool pressure, circuit-open persistence, and restart churn.
+- Harden runbook documentation and scripts for deterministic restart, rollback, and incident triage.
+
+## Capabilities
+
+### New Capabilities
+- `worker-self-healing-governance`: Define safe autonomous recovery behavior with anti-storm guardrails.
+
+### Modified Capabilities
+- `conda-systemd-runtime-alignment`: Extend runtime consistency requirements with startup validation and drift detection.
+- `runtime-resilience-recovery`: Add auditable recovery-action requirements for automated and operator-triggered restart flows.
+
+## Impact
+
+- Affected code:
+  - `deploy/systemd/*.service`
+  - `scripts/worker_watchdog.py`
+  - `src/mes_dashboard/routes/admin_routes.py`
+  - `src/mes_dashboard/routes/health_routes.py`
+  - `src/mes_dashboard/core/database.py`
+  - `src/mes_dashboard/core/circuit_breaker.py`
+  - `tests/`
+  - `README.md`, `README.mdj`, runbook docs
+- APIs:
+  - `/health`
+  - `/health/deep`
+  - `/admin/api/system-status`
+  - `/admin/api/worker/status`
+  - `/admin/api/worker/restart`
+- Operational behavior:
+  - Preserve single-port bind model.
+  - Add controlled self-healing policy and clearer alert thresholds.
--- a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/conda-systemd-runtime-alignment/spec.md
+++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/conda-systemd-runtime-alignment/spec.md
@@ -0,0 +1,15 @@
+## ADDED Requirements
+
+### Requirement: Runtime Path Drift SHALL Be Detectable Before Service Start
+Service startup checks MUST validate configured conda runtime paths across app, watchdog, and worker control scripts.
+
+#### Scenario: Conda path mismatch detected
+- **WHEN** startup validation finds runtime path inconsistency between configured units and scripts
+- **THEN** service start MUST fail with actionable diagnostics instead of running with partial mismatch
+
+### Requirement: Conda/Systemd Contract SHALL Be Versioned in Operations Docs
+The documented runtime contract MUST include versioned path assumptions and verification commands.
+
+#### Scenario: Operator verifies deployment contract
+- **WHEN** operator follows runbook validation steps
+- **THEN** commands MUST confirm active runtime paths match documented conda/systemd contract
--- a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/runtime-resilience-recovery/spec.md
+++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/runtime-resilience-recovery/spec.md
@@ -0,0 +1,15 @@
+## ADDED Requirements
+
+### Requirement: Recovery Recommendations SHALL Reflect Self-Healing Policy State
+Health and admin resilience payloads MUST expose whether automated recovery is allowed, cooling down, or blocked by churn policy.
+
+#### Scenario: Operator inspects degraded state
+- **WHEN** `/health` or `/admin/api/worker/status` is requested during degradation
+- **THEN** response MUST include policy state, cooldown remaining time, and next recommended action
+
+### Requirement: Manual Recovery Override SHALL Be Explicit and Controlled
+Manual restart actions MUST bypass automatic block only through authenticated operator pathways with explicit acknowledgement.
+
+#### Scenario: Churn-blocked state with manual override request
+- **WHEN** authorized admin requests manual restart while auto-recovery is blocked
+- **THEN** system MUST execute controlled restart path and log the override context for auditability
--- a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/worker-self-healing-governance/spec.md
+++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/worker-self-healing-governance/spec.md
@@ -0,0 +1,22 @@
+## ADDED Requirements
+
+### Requirement: Automated Worker Recovery SHALL Use Bounded Policy Guards
+Automated worker restart behavior MUST enforce cooldown periods and bounded restart attempts within a configurable time window.
+
+#### Scenario: Repeated worker degradation within short window
+- **WHEN** degradation events exceed configured restart-attempt budget
+- **THEN** automated restarts MUST pause and surface a blocked-recovery signal for operator intervention
+
+### Requirement: Restart-Churn Protection SHALL Prevent Recovery Storms
+The runtime MUST classify restart churn and prevent uncontrolled restart loops.
+
+#### Scenario: Churn threshold exceeded
+- **WHEN** restart count crosses churn threshold in active window
+- **THEN** watchdog MUST enter guarded mode and require explicit manual override before further restart attempts
+
+### Requirement: Recovery Decisions SHALL Be Audit-Ready
+Every auto-recovery decision and manual override action MUST be recorded with structured metadata.
+
+#### Scenario: Worker restart decision emitted
+- **WHEN** system executes or denies a restart action
+- **THEN** structured logs/events MUST include reason, thresholds, actor/source, and resulting state
--- a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/tasks.md
+++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/tasks.md
@@ -0,0 +1,23 @@
+## 1. Conda/Systemd Contract Alignment
+
+- [x] 1.1 Centralize runtime path configuration consumed by service units, watchdog, and scripts.
+- [x] 1.2 Add startup validation that fails fast on conda path drift.
+- [x] 1.3 Update systemd/watchdog integration tests for consistent runtime contract.
+
+## 2. Worker Self-Healing Policy
+
+- [x] 2.1 Implement bounded auto-restart policy (cooldown, retry budget, churn window).
+- [x] 2.2 Add guarded mode behavior when churn threshold is exceeded.
+- [x] 2.3 Implement authenticated manual override flow with explicit logging context.
+
+## 3. Alerting and Operational Signals
+
+- [x] 3.1 Expose policy-state fields in health/admin payloads (`allowed`, `cooldown`, `blocked`).
+- [x] 3.2 Add structured audit events for restart decisions and override actions.
+- [x] 3.3 Define alert thresholds and wire monitoring-friendly fields for pool/circuit/churn conditions.
+
+## 4. Validation and Runbook Delivery
+
+- [x] 4.1 Add tests for policy transitions, guarded mode, and override behavior.
+- [x] 4.2 Validate single-port continuity during controlled recovery and hot reload paths.
+- [x] 4.3 Update README/README.mdj and deployment runbook with verified operational procedures.
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round3/.openspec.yaml
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/.openspec.yaml
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-02-08
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round3/design.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/design.md
@@ -0,0 +1,50 @@
+## Context
+
+目前系統已完成 Vite 單一 port 架構與主要 P0/P1/P2 硬化，但殘餘風險集中在「快取慢路徑鎖競爭 + health 熱點查詢 + API 邊界治理」。這些問題多屬中高流量下才明顯，若不在此階段收斂，後續排障成本會高。
+
+## Goals / Non-Goals
+
+**Goals:**
+- 在不改變頁面操作語意與單一 port 架構前提下，完成殘餘穩定性與安全性修補。
+- 讓 cache/health 路徑在高併發下更可預期，並降低 log 資安風險。
+- 透過測試覆蓋確保修補不造成功能回歸。
+
+**Non-Goals:**
+- 不重寫主要查詢流程或移除 `resource/wip` 全表快取策略。
+- 不引入重量級 distributed rate-limit 基礎設施。
+- 不改動前端 drill-down 與報表功能語意。
+
+## Decisions
+
+1. **Cache 發布一致性優先於局部最佳化**
+- 使用 staging key + 原子 rename/pipeline 發布資料與 metadata，確保 publish 失敗不影響舊資料可讀性。
+
+2. **解析移至鎖外，鎖內僅做快取一致性檢查/寫入**
+- WIP process cache 慢路徑改為鎖外 parse，再鎖內 double-check+commit，降低持鎖時間。
+
+3. **Process cache 策略一致化**
+- realtime equipment cache 補齊 max_size + LRU，與既有 WIP/Resource 一致。
+
+4. **Health 內部短快取僅在非測試環境啟用**
+- TTL=5 秒，降低高頻 probe 對 DB/Redis 的重複壓力；測試模式維持即時計算避免互相污染。
+
+5. **高成本 API 採輕量 in-memory 速率限制**
+- 以 IP+route window 限流，參數化可調，不引入新外部依賴。
+
+## Risks / Trade-offs
+
+- [Risk] 快取發布改造引入 key 切換邏輯複雜度 → Mitigation: 補上 publish 失敗/成功測試。
+- [Risk] health 快取造成短時間觀測延遲 → Mitigation: TTL 限制 5 秒，並於 testing 禁用。
+- [Risk] in-memory rate limit 在多 worker 下非全域一致 → Mitigation: 先作保護閥，後續可升級 Redis-based limiter。
+
+## Migration Plan
+
+1. 先完成 cache 與 health 核心修補（不影響 API contract）。
+2. 再導入 API 邊界/限流與共用工具抽離。
+3. 補單元與整合測試，執行 benchmark smoke。
+4. 更新 README 文件與環境變數說明。
+
+## Open Questions
+
+- 高成本 API 的預設限流門檻是否要按端點細分（WIP vs Resource）？
+- 後續是否要升級為 Redis 分散式限流以覆蓋多 worker 全域一致性？
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round3/proposal.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/proposal.md
@@ -0,0 +1,44 @@
+## Why
+
+上一輪已完成高風險核心修復，但仍有一批殘餘問題會在高併發、長時間運行與惡意/異常輸入下放大風險（快取發布一致性、鎖競爭、健康檢查負載、輸入邊界與速率治理）。本輪目標是把這些尾端風險收斂到可接受範圍，避免後續運維與效能不穩。
+
+## What Changes
+
+- 強化 WIP 快取發布流程，確保更新失敗時不污染既有讀取路徑。
+- 調整 process cache 慢路徑鎖範圍，避免持鎖解析大 JSON。
+- 補齊 realtime equipment process cache 的 bounded LRU，與 WIP/Resource 策略一致。
+- 為資源路由 NaN 清理加入深度保護（避免深層遞迴風險）。
+- 抽取共用布林參數解析，消除重複邏輯。
+- 將 filter cache 的 view 名稱改為可配置，移除硬編碼耦合。
+- 加入敏感連線字串 log redaction。
+- 對 `/health`、`/health/deep` 增加 5 秒內部短快取（測試模式禁用）。
+- 對高成本查詢 API 增加輕量速率限制與可調參數。
+- 更新 README/README.mdj 與驗證測試。
+
+## Capabilities
+
+### New Capabilities
+- `api-safety-hygiene`: API 輸入邊界、共享參數解析、可配置查詢來源、與高成本端點速率治理。
+
+### Modified Capabilities
+- `cache-observability-hardening`: 補強快取發布一致性、process cache 鎖範圍與 bounded 策略一致化。
+- `runtime-resilience-recovery`: 健康檢查短快取與敏感資訊日誌遮罩的運維安全要求。
+
+## Impact
+
+- Affected code:
+  - `src/mes_dashboard/core/cache_updater.py`
+  - `src/mes_dashboard/core/cache.py`
+  - `src/mes_dashboard/services/realtime_equipment_cache.py`
+  - `src/mes_dashboard/routes/resource_routes.py`
+  - `src/mes_dashboard/routes/wip_routes.py`
+  - `src/mes_dashboard/routes/hold_routes.py`
+  - `src/mes_dashboard/services/filter_cache.py`
+  - `src/mes_dashboard/core/database.py`
+  - `src/mes_dashboard/routes/health_routes.py`
+- APIs:
+  - `/health`, `/health/deep`
+  - `/api/wip/detail/<workcenter>`, `/api/wip/overview/*`
+  - `/api/resource/*`（高成本路由）
+- Docs/tests:
+  - `README.md`, `README.mdj`, `tests/*`
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/api-safety-hygiene/spec.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/api-safety-hygiene/spec.md
@@ -0,0 +1,29 @@
+## ADDED Requirements
+
+### Requirement: Recursive Payload Cleaning MUST Enforce Depth Safety
+Routes that normalize nested payloads MUST prevent unbounded recursion depth.
+
+#### Scenario: Deeply nested response object
+- **WHEN** NaN-cleaning helper receives deeply nested list/dict payload
+- **THEN** cleaning logic MUST enforce max depth or iterative traversal and return safely without recursion failure
+
+### Requirement: Filter Source Names MUST Be Configurable
+Filter cache query sources MUST NOT rely on hardcoded view names only.
+
+#### Scenario: Environment-specific view names
+- **WHEN** deployment sets custom filter-source environment variables
+- **THEN** filter cache loader MUST resolve and query configured view names
+
+### Requirement: High-Cost APIs SHALL Apply Basic Rate Guardrails
+High-cost read endpoints SHALL apply configurable request-rate guardrails to reduce abuse and accidental bursts.
+
+#### Scenario: Burst traffic from same client
+- **WHEN** a client exceeds configured request budget for guarded endpoints
+- **THEN** endpoint SHALL return throttled response with clear retry guidance
+
+### Requirement: Common Boolean Query Parsing SHALL Be Shared
+Boolean query parsing in routes SHALL use shared helper behavior.
+
+#### Scenario: Different routes parse include flags
+- **WHEN** routes parse common boolean query parameters
+- **THEN** parsing behavior MUST be consistent across routes via shared utility
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/cache-observability-hardening/spec.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/cache-observability-hardening/spec.md
@@ -0,0 +1,26 @@
+## ADDED Requirements
+
+### Requirement: Cache Publish MUST Preserve Previous Readable Snapshot on Failure
+When refreshing full-table cache payloads, the system MUST avoid exposing partially published states to readers.
+
+#### Scenario: Publish fails after payload serialization
+- **WHEN** a cache refresh has prepared new payload but publish operation fails
+- **THEN** previously published cache keys MUST remain readable and metadata MUST remain consistent with old snapshot
+
+#### Scenario: Publish succeeds
+- **WHEN** publish operation completes successfully
+- **THEN** data payload and metadata keys MUST be visible as one coherent new snapshot
+
+### Requirement: Process-Level Cache Slow Path SHALL Minimize Lock Hold Time
+Large payload parsing MUST NOT happen inside long-held process cache locks.
+
+#### Scenario: Cache miss under concurrent requests
+- **WHEN** multiple requests hit process cache miss
+- **THEN** parsing work SHALL happen outside lock-protected mutation section, and lock scope SHALL be limited to consistency check + commit
+
+### Requirement: Process-Level Cache Policies MUST Stay Consistent Across Services
+All service-local process caches MUST support bounded capacity with deterministic eviction.
+
+#### Scenario: Realtime equipment cache growth
+- **WHEN** realtime equipment process cache reaches configured capacity
+- **THEN** entries MUST be evicted according to deterministic LRU behavior
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/runtime-resilience-recovery/spec.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/runtime-resilience-recovery/spec.md
@@ -0,0 +1,19 @@
+## ADDED Requirements
+
+### Requirement: Health Endpoints SHALL Use Short Internal Memoization
+Health and deep-health computation SHALL use a short-lived internal cache to prevent probe storms from amplifying backend load.
+
+#### Scenario: Frequent monitor scrapes
+- **WHEN** health endpoints are called repeatedly within a small window
+- **THEN** service SHALL return memoized payload for up to 5 seconds in non-testing environments
+
+#### Scenario: Testing mode
+- **WHEN** app is running in testing mode
+- **THEN** health endpoint memoization MUST be bypassed to preserve deterministic tests
+
+### Requirement: Logs MUST Redact Connection Secrets
+Runtime logs MUST avoid exposing DB connection credentials.
+
+#### Scenario: Connection string appears in log message
+- **WHEN** a log message contains DB URL credentials
+- **THEN** logger output MUST redact password and sensitive userinfo before emission
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round3/tasks.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/tasks.md
@@ -0,0 +1,22 @@
+## 1. Cache Consistency and Contention Hardening
+
+- [x] 1.1 Harden WIP cache publish in `cache_updater.py` to preserve old snapshot on publish failure.
+- [x] 1.2 Refactor WIP process-cache slow path in `core/cache.py` so heavy parse runs outside lock.
+- [x] 1.3 Extend realtime equipment process cache with bounded `max_size` + deterministic LRU and add regression tests.
+
+## 2. API Safety and Config Hygiene
+
+- [x] 2.1 Add depth-safe NaN cleaning in `resource_routes.py` and tests for deep payloads.
+- [x] 2.2 Add shared boolean query parser in `core/utils.py` and switch `wip_routes.py` / `hold_routes.py` to it.
+- [x] 2.3 Make filter-cache source views configurable (env-based) in `filter_cache.py` and add config tests.
+
+## 3. Runtime Guardrails
+
+- [x] 3.1 Add DB connection-string redaction logging filter in `core/database.py` (or logging bootstrap) with tests.
+- [x] 3.2 Add 5-second internal memoization for `/health` and `/health/deep` (disabled in testing) and tests.
+- [x] 3.3 Add lightweight rate limiting for selected high-cost APIs with clear throttling responses and tests.
+
+## 4. Validation and Documentation
+
+- [x] 4.1 Run targeted backend/frontend tests and benchmark smoke gate.
+- [x] 4.2 Update `README.md` and `README.mdj` with round-3 hardening notes and new env variables.
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round4/.openspec.yaml
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/.openspec.yaml
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-02-08
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round4/design.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/design.md
@@ -0,0 +1,61 @@
+## Context
+
+round-3 後主流程已穩定，但仍有 3 類技術債：
+- Resource 快取在同一 process 內同時保存 DataFrame 與完整 records 複本，導致記憶體放大。
+- Resource 與 Realtime Equipment 的 Oracle 查詢存在跨服務重複字串，日後修改容易偏移。
+- 部分服務邊界型別註記與魔術數字未系統化，維護成本偏高。
+
+約束條件：
+- `resource` / `wip` 維持全表快取策略，不改資料來源與刷新頻率。
+- 對外 API 欄位與前端行為不變。
+- 保持單一 port 架構與既有運維契約。
+
+## Goals / Non-Goals
+
+**Goals:**
+- 降低 Resource 快取在 process 內的重複資料表示，保留查詢輸出相容性。
+- 讓跨服務 Oracle 查詢片段由單一來源維護。
+- 讓關鍵 service/cache 模組具備一致的型別註記與具名常數。
+
+**Non-Goals:**
+- 不改動資料庫 schema 或 SQL 查詢結果欄位。
+- 不重寫整體 cache 架構（Redis + process cache 維持）。
+- 不引入新基礎設施或外部依賴。
+
+## Decisions
+
+1. Resource derived index 改為「row-position index」而非保存完整 records 複本
+- 現況：index 中保留 `records` 與多組 bucket records，與 DataFrame 內容重複。
+- 決策：index 只保留 row positions（整數索引）與必要 metadata；需要輸出 dict 時由 DataFrame 按需轉換。
+- 取捨：單次輸出會增加少量轉換成本，但可顯著降低常駐記憶體重複。
+
+2. 建立共用 Oracle 查詢常數模組
+- 現況：`resource_cache.py`、`realtime_equipment_cache.py` 各自維護 base SQL。
+- 決策：抽出 `services/sql_fragments.py`（或等效模組）管理共用 query 文本與 table/view 名稱。
+- 取捨：增加一層間接引用，但查詢語意一致性與變更可控性更高。
+
+3. 型別與常數治理採「先核心邊界，後擴散」
+- 現況：部分函式已使用 `Optional` / PEP604 混搭，且魔術數字散落於 cache/service。
+- 決策：先統一這輪觸及檔案中的型別風格與高頻常數（TTL、size、window、limits）。
+- 取捨：不追求一次全專案清零，以避免大範圍 noise；先建立可持續擴展基線。
+
+## Risks / Trade-offs
+
+- [Risk] row-position index 與 DataFrame 版本不同步 → Mitigation：每次 cache invalidate 時同步重建 index，並保留版本檢查。
+- [Risk] 惰性轉換導致查詢端 latency 波動 → Mitigation：保留 process cache，並對高頻路徑做小批量輸出優化。
+- [Risk] SQL 共用常數抽離造成引用錯誤 → Mitigation：補齊單元測試，驗證 query 文本與既有欄位契約一致。
+- [Risk] 型別/常數清理引發行為改變 → Mitigation：僅做等價重構，保留原值並用回歸測試覆蓋。
+
+## Migration Plan
+
+1. 先重構 Resource index 表示，確保 API 輸出不變。
+2. 抽離 SQL 共用片段並替換兩個快取服務引用。
+3. 清理該範圍型別與常數，補測試。
+4. 更新 README / README.mdj 與 OpenSpec tasks，跑 backend/fronted 目標測試集。
+
+Rollback：
+- 若出現相容性問題，可回退至原 index records 表示與舊 SQL 內嵌寫法（單檔回退即可）。
+
+## Open Questions
+
+- 是否要在下一輪把相同治理擴展到 `wip_service.py` 的其餘常數與型別（本輪先限定 residual 範圍）。
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round4/proposal.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/proposal.md
@@ -0,0 +1,31 @@
+## Why
+
+目前剩餘風險集中在可維護性與記憶體效率：Resource 快取在同一個 process 內維持多種資料表示，部分查詢 SQL 在不同快取服務重複維護，且型別註記與魔術數字仍不一致。這些問題不會立刻造成中斷，但會提高記憶體占用、增加後續修改成本與回歸風險，因此需要在既有功能不變前提下完成收斂。
+
+## What Changes
+
+- 將 Resource derived index 的資料表示改為「輕量索引 + 惰性輸出」，避免在 process 中重複保留完整 records 複本。
+- 將 Resource 與 Realtime Equipment 的 Oracle 查詢字串收斂到共用 SQL 常數模組，降低重複定義與異步漂移風險。
+- 補齊型別註記一致性（尤其 cache/index/service 邊界）並把高頻魔術數字提升為具名常數或可配置參數。
+- 維持現有 API 契約、全表快取策略、單一 port 架構與前端行為不變。
+
+## Capabilities
+
+### New Capabilities
+- `resource-cache-representation-normalization`: 以單一權威資料表示與輕量索引替代 process 內多份完整資料複本，並保留既有查詢回傳結構。
+- `oracle-query-fragment-governance`: 將跨服務共用的 Oracle 查詢片段抽離為共享常數/模板，確保查詢語意一致。
+- `maintainability-type-and-constant-hygiene`: 建立型別註記與具名常數的落地規範，降低魔術數字與註記風格漂移。
+
+### Modified Capabilities
+- `cache-observability-hardening`: 補充記憶體放大係數與索引表示調整後的可觀測一致性要求。
+
+## Impact
+
+- 主要影響檔案：
+  - `src/mes_dashboard/services/resource_cache.py`
+  - `src/mes_dashboard/services/realtime_equipment_cache.py`
+  - `src/mes_dashboard/services/resource_service.py`（若需配合索引輸出）
+  - `src/mes_dashboard/sql/*` 或新增共享 SQL 常數模組
+  - `src/mes_dashboard/config/constants.py`、`src/mes_dashboard/core/utils.py`
+  - 對應測試與 README/README.mdj 文檔
+- 不新增外部依賴，不變更對外 API 路徑與欄位契約。
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/cache-observability-hardening/spec.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/cache-observability-hardening/spec.md
@@ -0,0 +1,8 @@
+## MODIFIED Requirements
+
+### Requirement: Cache Telemetry SHALL Include Memory Amplification Signals
+Operational telemetry MUST expose cache-domain memory usage indicators and representation amplification factors, and MUST differentiate between authoritative data payload and derived/index helper structures.
+
+#### Scenario: Deep health telemetry request after representation normalization
+- **WHEN** operators inspect cache telemetry for resource or WIP domains
+- **THEN** telemetry MUST include per-domain memory footprint, amplification indicators, and enough structure detail to verify that full-record duplication is not reintroduced
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/maintainability-type-and-constant-hygiene/spec.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/maintainability-type-and-constant-hygiene/spec.md
@@ -0,0 +1,15 @@
+## ADDED Requirements
+
+### Requirement: Core Cache and Service Boundaries MUST Use Consistent Type Annotation Style
+Core cache/service modules touched by this change SHALL use a consistent and explicit type-annotation style for public and internal helper boundaries.
+
+#### Scenario: Reviewing updated cache/service modules
+- **WHEN** maintainers inspect function signatures in affected modules
+- **THEN** optional and collection types MUST follow a single consistent style and remain compatible with the project Python baseline
+
+### Requirement: High-Frequency Magic Numbers MUST Be Replaced by Named Constants
+Cache, throttling, and index-related numeric literals that control behavior MUST be extracted to named constants or env-configurable settings.
+
+#### Scenario: Tuning cache/index behavior
+- **WHEN** operators need to tune cache/index thresholds
+- **THEN** they MUST find values in named constants or environment variables rather than scattered inline literals
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/oracle-query-fragment-governance/spec.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/oracle-query-fragment-governance/spec.md
@@ -0,0 +1,15 @@
+## ADDED Requirements
+
+### Requirement: Shared Oracle Query Fragments SHALL Have a Single Source of Truth
+Cross-service Oracle query fragments for resource and equipment cache loading MUST be defined in a shared module and imported by service implementations.
+
+#### Scenario: Update common table/view reference
+- **WHEN** a common table or view name changes
+- **THEN** operators and developers MUST be able to update one shared definition without editing duplicated SQL literals across services
+
+### Requirement: Service Queries MUST Preserve Existing Columns and Semantics
+Services consuming shared Oracle query fragments SHALL preserve existing selected columns, filters, and downstream payload behavior.
+
+#### Scenario: Resource and equipment cache refresh after refactor
+- **WHEN** cache services execute queries via shared fragments
+- **THEN** resulting payload structure MUST remain compatible with existing aggregation and API contracts
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/resource-cache-representation-normalization/spec.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/resource-cache-representation-normalization/spec.md
@@ -0,0 +1,22 @@
+## ADDED Requirements
+
+### Requirement: Resource Derived Index MUST Avoid Full Record Duplication
+Resource derived index SHALL use lightweight row-position references instead of storing full duplicated record payloads alongside the process DataFrame cache.
+
+#### Scenario: Build index from cached DataFrame
+- **WHEN** resource cache data is parsed from Redis into process-level DataFrame
+- **THEN** the derived index MUST store position-based references and metadata without a second full records copy
+
+### Requirement: Resource Query APIs SHALL Preserve Existing Response Contract
+Resource query APIs MUST keep existing output fields and semantics after index representation normalization.
+
+#### Scenario: Read all resources after normalization
+- **WHEN** callers request all resources or filtered resource lists
+- **THEN** the returned payload MUST remain field-compatible with pre-normalization responses
+
+### Requirement: Cache Invalidation MUST Keep Index/Data Coherent
+The system SHALL invalidate and rebuild DataFrame/index representations atomically at cache refresh boundaries.
+
+#### Scenario: Redis-backed cache refresh completes
+- **WHEN** a new resource cache snapshot is published
+- **THEN** stale index references MUST be invalidated before subsequent reads use refreshed DataFrame data
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round4/tasks.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/tasks.md
@@ -0,0 +1,22 @@
+## 1. Resource Cache Representation Normalization
+
+- [x] 1.1 Refactor `resource_cache` derived index to use lightweight row-position references instead of full duplicated records payload.
+- [x] 1.2 Keep `get_all_resources` / `get_resources_by_filter` API outputs backward compatible while sourcing data from normalized representation.
+- [x] 1.3 Update cache telemetry fields to reflect normalized representation and verify amplification calculation remains interpretable.
+
+## 2. Oracle Query Fragment Governance
+
+- [x] 2.1 Extract shared Oracle SQL fragments/constants for resource/equipment cache loading into a common module.
+- [x] 2.2 Replace duplicated SQL literals in `resource_cache.py` and `realtime_equipment_cache.py` with shared definitions.
+- [x] 2.3 Add/adjust tests to lock expected query semantics and prevent drift.
+
+## 3. Maintainability Hygiene
+
+- [x] 3.1 Normalize type annotations in touched cache/service modules to one consistent style.
+- [x] 3.2 Replace high-frequency magic numbers with named constants or env-driven config in touched modules.
+- [x] 3.3 Confirm existing login/API rate-limit and bool parser utilities remain centralized without new duplication.
+
+## 4. Verification and Documentation
+
+- [x] 4.1 Run targeted backend tests for resource cache, equipment cache, health/admin, and route behavior.
+- [x] 4.2 Update `README.md` and `README.mdj` with round-4 hardening notes.
--- a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/.openspec.yaml
+++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/.openspec.yaml
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-02-08
--- a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/design.md
+++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/design.md
@@ -0,0 +1,65 @@
+## Context
+
+本專案上一輪已完成 P0/P1/P2 的主體重構，但 code review 後仍存在幾個殘餘高風險點：
+- `LDAP_API_URL` 缺少 scheme/host 防線，屬於可配置 SSRF 風險。
+- process-level DataFrame cache 僅用 TTL，缺少容量上限。
+- circuit breaker 狀態轉換在持鎖期間寫日誌，存在鎖競爭放大風險。
+- 全域 security headers 尚未統一輸出。
+- 分頁參數尚有下限驗證缺口。
+
+這些問題橫跨 `app/core/services/routes/tests`，屬於跨模組安全與穩定性修補。
+
+## Goals / Non-Goals
+
+**Goals:**
+- 對 LDAP endpoint、HTTP 回應標頭、輸入邊界建立可測試的最低防線。
+- 讓 process-level cache 具備有界容量與可預期淘汰行為。
+- 降低 circuit breaker 內部鎖競爭風險，避免慢 handler 放大阻塞。
+- 維持單一 port、現有 API 契約與前端互動語意不變。
+
+**Non-Goals:**
+- 不引入完整 WAF/零信任架構。
+- 不重寫既有 cache 架構為外部快取服務。
+- 不改動報表功能或頁面流程。
+
+## Decisions
+
+1. **LDAP URL 啟動驗證（fail-fast）**
+   - Decision: 在 `auth_service` 啟動階段驗證 `LDAP_API_URL`，限制 `https` 與白名單 host（由 env 設定），不符合即禁用 LDAP 驗證路徑並記錄錯誤。
+   - Rationale: 以最低改動封住配置型 SSRF 風險，不影響 local auth 模式。
+
+2. **ProcessLevelCache 有界化**
+   - Decision: 在 `ProcessLevelCache` 新增 `max_size` 與 LRU 淘汰（`OrderedDict`），`set` 時淘汰最舊 key。
+   - Rationale: 保留 TTL 行為，同時避免高基數 key 長時間堆積。
+
+3. **Circuit breaker 鎖外寫日誌**
+   - Decision: `_transition_to` 僅在鎖內更新狀態並組裝日誌訊息，實際 logger 呼叫移到鎖外。
+   - Rationale: 降低持鎖區塊執行時間，避免慢 I/O handler 阻塞其他請求路徑。
+
+4. **全域安全標頭統一注入**
+   - Decision: 在 `app.after_request` 加入 `CSP`、`X-Frame-Options`、`X-Content-Type-Options`、`Referrer-Policy`，並在 production 加上 `HSTS`。
+   - Rationale: 以集中式策略覆蓋所有頁面與 API，降低遺漏機率。
+
+5. **分頁參數上下限一致化**
+   - Decision: 對 `page` 與 `page_size` 統一加入 `max(1, min(...))` 邊界處理。
+   - Rationale: 防止負值或極端數值造成不必要負載與非預期行為。
+
+## Risks / Trade-offs
+
+- **[Risk] LDAP 白名單設定不完整導致登入中斷** → **Mitigation:** 提供明確錯誤訊息與 local auth fallback 指引。
+- **[Risk] Cache 上限過小造成命中率下降** → **Mitigation:** `max_size` 設為可配置，先給保守預設值並觀察 telemetry。
+- **[Risk] CSP 過嚴影響既有 inline 腳本** → **Mitigation:** 先採 `default-src 'self'` 與相容策略，必要時以 nonce/白名單微調。
+- **[Risk] 行為調整引發測試回歸** → **Mitigation:** 補 unit/integration 測試覆蓋每個修補點。
+
+## Migration Plan
+
+1. 先落地 backend 修補（auth/cache/circuit breaker/app headers/routes）。
+2. 補測試（LDAP 驗證、LRU、鎖外日誌、headers、分頁邊界）。
+3. 執行既有健康檢查與重點整合測試。
+4. 更新 README/README.mdj 的安全與穩定性章節。
+5. 若部署後有相容性問題，可暫時透過 env 放寬 LDAP host 白名單與 CSP 細項。
+
+## Open Questions
+
+- LDAP host 白名單在各環境是否需要多個網域（例如內網 + DR site）？
+- CSP 是否要立即切換到 nonce-based 嚴格模式，或先維持相容策略？
--- a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/proposal.md
+++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/proposal.md
@@ -0,0 +1,40 @@
+## Why
+
+上一輪已完成核心穩定性重構，但仍有數個高優先風險（LDAP URL 驗證、無界快取成長、circuit breaker 持鎖寫日誌、安全標頭缺口、分頁下限驗證）未收斂。這些問題會在長時運行與惡意輸入情境下累積可用性與安全風險，需在同一輪中補齊。
+
+## What Changes
+
+- 新增 LDAP API base URL 啟動驗證（限定 `https` 與白名單主機），避免可控 SSRF 目標。
+- 對 process-level cache 加入 `max_size` 與 LRU 淘汰，避免高基數 key 造成無界記憶體成長。
+- 調整 circuit breaker 狀態轉換流程，避免在持鎖期間寫日誌。
+- 新增全域 security headers（CSP、X-Frame-Options、X-Content-Type-Options、Referrer-Policy、HSTS）。
+- 補齊分頁參數下限驗證，避免負值與不合理 page size 進入查詢流程。
+- 為上述修補新增對應測試與文件更新，並維持單一 port 與既有前端操作語意不變。
+
+## Capabilities
+
+### New Capabilities
+- `security-surface-hardening`: 規範剩餘安全面向（SSRF 防護、security headers、輸入邊界驗證）的最低防線。
+
+### Modified Capabilities
+- `cache-observability-hardening`: 擴充快取治理需求，納入 process-level cache 有界容量與淘汰策略。
+- `runtime-resilience-recovery`: 補充 circuit breaker 鎖競爭風險修補與安全標頭對運維診斷回應的相容性要求。
+
+## Impact
+
+- Affected code:
+  - `src/mes_dashboard/services/auth_service.py`
+  - `src/mes_dashboard/core/cache.py`
+  - `src/mes_dashboard/services/resource_cache.py`
+  - `src/mes_dashboard/core/circuit_breaker.py`
+  - `src/mes_dashboard/app.py`
+  - `src/mes_dashboard/routes/wip_routes.py`
+  - `tests/`
+  - `README.md`, `README.mdj`
+- APIs:
+  - `/health`, `/health/deep`
+  - `/api/wip/detail/<workcenter>`
+  - `/admin/login`（間接受影響：LDAP base 驗證）
+- Operational behavior:
+  - 保持單一 port 與既有報表 UI 流程。
+  - 強化安全與穩定性防線，不改變既有功能語意。
--- a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/cache-observability-hardening/spec.md
+++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/cache-observability-hardening/spec.md
@@ -0,0 +1,12 @@
+## ADDED Requirements
+
+### Requirement: Process-Level Cache SHALL Use Bounded Capacity with Deterministic Eviction
+Process-level parsed-data caches MUST enforce a configurable maximum key capacity and use deterministic eviction behavior when capacity is exceeded.
+
+#### Scenario: Cache capacity reached
+- **WHEN** a new cache entry is inserted and key capacity is at limit
+- **THEN** cache MUST evict entries according to defined policy before storing the new key
+
+#### Scenario: Repeated access updates recency
+- **WHEN** an existing cache key is read or overwritten
+- **THEN** eviction order MUST reflect recency semantics so hot keys are retained preferentially
--- a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/runtime-resilience-recovery/spec.md
+++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/runtime-resilience-recovery/spec.md
@@ -0,0 +1,12 @@
+## ADDED Requirements
+
+### Requirement: Circuit Breaker State Transitions SHALL Avoid Lock-Held Logging
+Circuit breaker state transitions MUST avoid executing logger I/O while internal state locks are held.
+
+#### Scenario: State transition occurs
+- **WHEN** circuit breaker transitions between CLOSED, OPEN, or HALF_OPEN
+- **THEN** lock-protected section MUST complete state mutation before emitting transition log output
+
+#### Scenario: Slow log handler under load
+- **WHEN** logger handlers are slow or blocked
+- **THEN** circuit breaker lock contention MUST remain bounded and MUST NOT serialize unrelated request paths behind logging latency
--- a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/security-surface-hardening/spec.md
+++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/security-surface-hardening/spec.md
@@ -0,0 +1,34 @@
+## ADDED Requirements
+
+### Requirement: LDAP Authentication Endpoint Configuration SHALL Be Strictly Validated
+The system MUST validate LDAP authentication endpoint configuration before use, including HTTPS scheme enforcement and host allowlist checks.
+
+#### Scenario: Invalid LDAP URL configuration detected
+- **WHEN** `LDAP_API_URL` is missing, non-HTTPS, or points to a host outside the configured allowlist
+- **THEN** the service MUST reject LDAP authentication calls and emit actionable diagnostics without sending credentials to that endpoint
+
+#### Scenario: Valid LDAP URL configuration accepted
+- **WHEN** `LDAP_API_URL` uses HTTPS and host is allowlisted
+- **THEN** LDAP authentication requests MAY proceed with normal timeout and error handling behavior
+
+### Requirement: Security Response Headers SHALL Be Applied Globally
+All HTTP responses MUST include baseline security headers suitable for dashboard and API traffic.
+
+#### Scenario: Standard response emitted
+- **WHEN** any route returns a response
+- **THEN** response MUST include `Content-Security-Policy`, `X-Frame-Options`, `X-Content-Type-Options`, and `Referrer-Policy`
+
+#### Scenario: Production transport hardening
+- **WHEN** runtime environment is production
+- **THEN** response MUST include `Strict-Transport-Security`
+
+### Requirement: Pagination Input Boundaries SHALL Be Enforced
+Endpoints accepting pagination parameters MUST enforce lower and upper bounds before query execution.
+
+#### Scenario: Negative or zero pagination inputs
+- **WHEN** client sends `page <= 0` or `page_size <= 0`
+- **THEN** server MUST normalize values to minimum supported bounds
+
+#### Scenario: Excessive page size requested
+- **WHEN** client sends `page_size` above configured maximum
+- **THEN** server MUST clamp to maximum supported page size
--- a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/tasks.md
+++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/tasks.md
@@ -0,0 +1,24 @@
+## 1. LDAP Endpoint Hardening
+
+- [x] 1.1 Add strict `LDAP_API_URL` validation (`https` + allowlisted hosts) in auth service initialization.
+- [x] 1.2 Add tests for valid/invalid LDAP URL configurations and ensure unsafe URLs are rejected without outbound auth call.
+
+## 2. Bounded Process Cache
+
+- [x] 2.1 Extend `ProcessLevelCache` with configurable `max_size` and LRU eviction behavior.
+- [x] 2.2 Wire bounded cache configuration for WIP/Resource process-level caches and add regression tests.
+
+## 3. Circuit Breaker Lock Contention Reduction
+
+- [x] 3.1 Refactor circuit breaker transition logging to execute outside lock-protected section.
+- [x] 3.2 Add tests verifying transition logs are emitted while state mutation remains correct.
+
+## 4. HTTP Security Headers and Input Boundary Validation
+
+- [x] 4.1 Add global `after_request` security headers (CSP, frame, content-type, referrer, HSTS in production).
+- [x] 4.2 Tighten pagination boundary handling (`page`/`page_size`) for WIP detail endpoint and add tests.
+
+## 5. Validation and Documentation
+
+- [x] 5.1 Run targeted backend/frontend tests plus benchmark smoke to confirm no behavior regression.
+- [x] 5.2 Update `README.md` and `README.mdj` with round-2 security/stability hardening notes.
--- a/openspec/specs/api-safety-hygiene/spec.md
+++ b/openspec/specs/api-safety-hygiene/spec.md
@@ -0,0 +1,33 @@
+# api-safety-hygiene Specification
+
+## Purpose
+TBD - created by archiving change residual-hardening-round3. Update Purpose after archive.
+## Requirements
+### Requirement: Recursive Payload Cleaning MUST Enforce Depth Safety
+Routes that normalize nested payloads MUST prevent unbounded recursion depth.
+
+#### Scenario: Deeply nested response object
+- **WHEN** NaN-cleaning helper receives deeply nested list/dict payload
+- **THEN** cleaning logic MUST enforce max depth or iterative traversal and return safely without recursion failure
+
+### Requirement: Filter Source Names MUST Be Configurable
+Filter cache query sources MUST NOT rely on hardcoded view names only.
+
+#### Scenario: Environment-specific view names
+- **WHEN** deployment sets custom filter-source environment variables
+- **THEN** filter cache loader MUST resolve and query configured view names
+
+### Requirement: High-Cost APIs SHALL Apply Basic Rate Guardrails
+High-cost read endpoints SHALL apply configurable request-rate guardrails to reduce abuse and accidental bursts.
+
+#### Scenario: Burst traffic from same client
+- **WHEN** a client exceeds configured request budget for guarded endpoints
+- **THEN** endpoint SHALL return throttled response with clear retry guidance
+
+### Requirement: Common Boolean Query Parsing SHALL Be Shared
+Boolean query parsing in routes SHALL use shared helper behavior.
+
+#### Scenario: Different routes parse include flags
+- **WHEN** routes parse common boolean query parameters
+- **THEN** parsing behavior MUST be consistent across routes via shared utility
+
--- a/openspec/specs/cache-indexed-query-acceleration/spec.md
+++ b/openspec/specs/cache-indexed-query-acceleration/spec.md
@@ -0,0 +1,26 @@
+# cache-indexed-query-acceleration Specification
+
+## Purpose
+TBD - created by archiving change p1-cache-query-efficiency. Update Purpose after archive.
+## Requirements
+### Requirement: Incremental Synchronization SHALL Use Versioned Watermarks
+For heavy non-full-snapshot datasets, cache refresh SHALL support incremental synchronization keyed by stable version or watermark boundaries.
+
+#### Scenario: Incremental refresh cycle
+- **WHEN** source data version indicates partial changes since last sync
+- **THEN** cache update logic MUST fetch and merge only changed partitions while preserving correctness guarantees
+
+### Requirement: Query Paths SHALL Use Indexed Access for High-Frequency Filters
+Query execution over cached data SHALL use prebuilt indexes for known high-frequency filter columns.
+
+#### Scenario: Filtered report query
+- **WHEN** request filters target indexed fields
+- **THEN** result selection MUST avoid full dataset scans and maintain existing response contract
+
+### Requirement: Business-Mandated Full-Table Caches SHALL Be Preserved for Resource and WIP
+The system SHALL continue to maintain full-table cache behavior for `resource` and `wip` domains.
+
+#### Scenario: Resource or WIP cache refresh
+- **WHEN** cache update runs for `resource` or `wip`
+- **THEN** the updater MUST retain full-table snapshot semantics and MUST NOT switch these domains to partial-only cache mode
+
--- a/openspec/specs/cache-observability-hardening/spec.md
+++ b/openspec/specs/cache-observability-hardening/spec.md
@@ -36,3 +36,53 @@ The system MUST define alert thresholds for sustained degraded state, repeated w
 - **WHEN** degraded status persists beyond configured duration
 - **THEN** the monitoring contract MUST classify the service as alert-worthy with actionable context

+### Requirement: Cache Telemetry SHALL Include Memory Amplification Signals
+Operational telemetry MUST expose cache-domain memory usage indicators and representation amplification factors, and MUST differentiate between authoritative data payload and derived/index helper structures.
+
+#### Scenario: Deep health telemetry request after representation normalization
+- **WHEN** operators inspect cache telemetry for resource or WIP domains
+- **THEN** telemetry MUST include per-domain memory footprint, amplification indicators, and enough structure detail to verify that full-record duplication is not reintroduced
+
+### Requirement: Efficiency Benchmarks SHALL Gate Cache Refactor Rollout
+Cache/query efficiency changes MUST be validated against baseline latency and memory benchmarks before rollout.
+
+#### Scenario: Pre-release validation
+- **WHEN** cache refactor changes are prepared for deployment
+- **THEN** benchmark results MUST demonstrate no regression beyond configured thresholds for P95 latency and memory usage
+
+### Requirement: Process-Level Cache SHALL Use Bounded Capacity with Deterministic Eviction
+Process-level parsed-data caches MUST enforce a configurable maximum key capacity and use deterministic eviction behavior when capacity is exceeded.
+
+#### Scenario: Cache capacity reached
+- **WHEN** a new cache entry is inserted and key capacity is at limit
+- **THEN** cache MUST evict entries according to defined policy before storing the new key
+
+#### Scenario: Repeated access updates recency
+- **WHEN** an existing cache key is read or overwritten
+- **THEN** eviction order MUST reflect recency semantics so hot keys are retained preferentially
+
+### Requirement: Cache Publish MUST Preserve Previous Readable Snapshot on Failure
+When refreshing full-table cache payloads, the system MUST avoid exposing partially published states to readers.
+
+#### Scenario: Publish fails after payload serialization
+- **WHEN** a cache refresh has prepared new payload but publish operation fails
+- **THEN** previously published cache keys MUST remain readable and metadata MUST remain consistent with old snapshot
+
+#### Scenario: Publish succeeds
+- **WHEN** publish operation completes successfully
+- **THEN** data payload and metadata keys MUST be visible as one coherent new snapshot
+
+### Requirement: Process-Level Cache Slow Path SHALL Minimize Lock Hold Time
+Large payload parsing MUST NOT happen inside long-held process cache locks.
+
+#### Scenario: Cache miss under concurrent requests
+- **WHEN** multiple requests hit process cache miss
+- **THEN** parsing work SHALL happen outside lock-protected mutation section, and lock scope SHALL be limited to consistency check + commit
+
+### Requirement: Process-Level Cache Policies MUST Stay Consistent Across Services
+All service-local process caches MUST support bounded capacity with deterministic eviction.
+
+#### Scenario: Realtime equipment cache growth
+- **WHEN** realtime equipment process cache reaches configured capacity
+- **THEN** entries MUST be evicted according to deterministic LRU behavior
+
--- a/openspec/specs/conda-systemd-runtime-alignment/spec.md
+++ b/openspec/specs/conda-systemd-runtime-alignment/spec.md
@@ -24,3 +24,17 @@ Runbooks and deployment documentation MUST describe the same conda/systemd/watch
 - **WHEN** an operator performs deploy, health check, and rollback from documentation
 - **THEN** documented commands and paths MUST work without requiring venv-specific assumptions

+### Requirement: Runtime Path Drift SHALL Be Detectable Before Service Start
+Service startup checks MUST validate configured conda runtime paths across app, watchdog, and worker control scripts.
+
+#### Scenario: Conda path mismatch detected
+- **WHEN** startup validation finds runtime path inconsistency between configured units and scripts
+- **THEN** service start MUST fail with actionable diagnostics instead of running with partial mismatch
+
+### Requirement: Conda/Systemd Contract SHALL Be Versioned in Operations Docs
+The documented runtime contract MUST include versioned path assumptions and verification commands.
+
+#### Scenario: Operator verifies deployment contract
+- **WHEN** operator follows runbook validation steps
+- **THEN** commands MUST confirm active runtime paths match documented conda/systemd contract
+
--- a/openspec/specs/frontend-compute-shift/spec.md
+++ b/openspec/specs/frontend-compute-shift/spec.md
@@ -50,3 +50,17 @@ Frontend matrix/filter computations SHALL produce deterministic selection and fi
 - **WHEN** users toggle matrix cells across group, family, and resource rows
 - **THEN** selected-state rendering and filtered equipment result sets MUST remain level-correct and reversible

+### Requirement: Reusable Browser Compute Modules SHALL Power Report Derivations
+Derived computations for report filters, KPI cards, chart series, and table projections SHALL be implemented through reusable frontend modules.
+
+#### Scenario: Shared report derivation logic
+- **WHEN** multiple report pages require equivalent data-shaping behavior
+- **THEN** pages MUST consume shared compute modules instead of duplicating transformation logic per page
+
+### Requirement: Browser Compute Shift SHALL Preserve Export and Field Contracts
+Moving computations to frontend MUST preserve existing field naming and export column contracts.
+
+#### Scenario: User exports report after frontend-side derivation
+- **WHEN** transformed data is rendered and exported
+- **THEN** exported field names and ordering MUST remain consistent with governed field contract definitions
+
--- a/openspec/specs/maintainability-type-and-constant-hygiene/spec.md
+++ b/openspec/specs/maintainability-type-and-constant-hygiene/spec.md
@@ -0,0 +1,19 @@
+# maintainability-type-and-constant-hygiene Specification
+
+## Purpose
+TBD - created by archiving change residual-hardening-round4. Update Purpose after archive.
+## Requirements
+### Requirement: Core Cache and Service Boundaries MUST Use Consistent Type Annotation Style
+Core cache/service modules touched by this change SHALL use a consistent and explicit type-annotation style for public and internal helper boundaries.
+
+#### Scenario: Reviewing updated cache/service modules
+- **WHEN** maintainers inspect function signatures in affected modules
+- **THEN** optional and collection types MUST follow a single consistent style and remain compatible with the project Python baseline
+
+### Requirement: High-Frequency Magic Numbers MUST Be Replaced by Named Constants
+Cache, throttling, and index-related numeric literals that control behavior MUST be extracted to named constants or env-configurable settings.
+
+#### Scenario: Tuning cache/index behavior
+- **WHEN** operators need to tune cache/index thresholds
+- **THEN** they MUST find values in named constants or environment variables rather than scattered inline literals
+
--- a/openspec/specs/oracle-query-fragment-governance/spec.md
+++ b/openspec/specs/oracle-query-fragment-governance/spec.md
@@ -0,0 +1,19 @@
+# oracle-query-fragment-governance Specification
+
+## Purpose
+TBD - created by archiving change residual-hardening-round4. Update Purpose after archive.
+## Requirements
+### Requirement: Shared Oracle Query Fragments SHALL Have a Single Source of Truth
+Cross-service Oracle query fragments for resource and equipment cache loading MUST be defined in a shared module and imported by service implementations.
+
+#### Scenario: Update common table/view reference
+- **WHEN** a common table or view name changes
+- **THEN** operators and developers MUST be able to update one shared definition without editing duplicated SQL literals across services
+
+### Requirement: Service Queries MUST Preserve Existing Columns and Semantics
+Services consuming shared Oracle query fragments SHALL preserve existing selected columns, filters, and downstream payload behavior.
+
+#### Scenario: Resource and equipment cache refresh after refactor
+- **WHEN** cache services execute queries via shared fragments
+- **THEN** resulting payload structure MUST remain compatible with existing aggregation and API contracts
+
--- a/openspec/specs/resource-cache-representation-normalization/spec.md
+++ b/openspec/specs/resource-cache-representation-normalization/spec.md
@@ -0,0 +1,26 @@
+# resource-cache-representation-normalization Specification
+
+## Purpose
+TBD - created by archiving change residual-hardening-round4. Update Purpose after archive.
+## Requirements
+### Requirement: Resource Derived Index MUST Avoid Full Record Duplication
+Resource derived index SHALL use lightweight row-position references instead of storing full duplicated record payloads alongside the process DataFrame cache.
+
+#### Scenario: Build index from cached DataFrame
+- **WHEN** resource cache data is parsed from Redis into process-level DataFrame
+- **THEN** the derived index MUST store position-based references and metadata without a second full records copy
+
+### Requirement: Resource Query APIs SHALL Preserve Existing Response Contract
+Resource query APIs MUST keep existing output fields and semantics after index representation normalization.
+
+#### Scenario: Read all resources after normalization
+- **WHEN** callers request all resources or filtered resource lists
+- **THEN** the returned payload MUST remain field-compatible with pre-normalization responses
+
+### Requirement: Cache Invalidation MUST Keep Index/Data Coherent
+The system SHALL invalidate and rebuild DataFrame/index representations atomically at cache refresh boundaries.
+
+#### Scenario: Redis-backed cache refresh completes
+- **WHEN** a new resource cache snapshot is published
+- **THEN** stale index references MUST be invalidated before subsequent reads use refreshed DataFrame data
+
--- a/openspec/specs/runtime-resilience-recovery/spec.md
+++ b/openspec/specs/runtime-resilience-recovery/spec.md
@@ -48,3 +48,47 @@ The system MUST expose machine-readable resilience thresholds, restart-churn ind
 #### Scenario: Admin status includes restart churn summary
 - **WHEN** operators call `/admin/api/system-status` or `/admin/api/worker/status`
 - **THEN** responses MUST include bounded restart history summary within a configured time window and indicate whether churn threshold is exceeded
+
+### Requirement: Recovery Recommendations SHALL Reflect Self-Healing Policy State
+Health and admin resilience payloads MUST expose whether automated recovery is allowed, cooling down, or blocked by churn policy.
+
+#### Scenario: Operator inspects degraded state
+- **WHEN** `/health` or `/admin/api/worker/status` is requested during degradation
+- **THEN** response MUST include policy state, cooldown remaining time, and next recommended action
+
+### Requirement: Manual Recovery Override SHALL Be Explicit and Controlled
+Manual restart actions MUST bypass automatic block only through authenticated operator pathways with explicit acknowledgement.
+
+#### Scenario: Churn-blocked state with manual override request
+- **WHEN** authorized admin requests manual restart while auto-recovery is blocked
+- **THEN** system MUST execute controlled restart path and log the override context for auditability
+
+### Requirement: Circuit Breaker State Transitions SHALL Avoid Lock-Held Logging
+Circuit breaker state transitions MUST avoid executing logger I/O while internal state locks are held.
+
+#### Scenario: State transition occurs
+- **WHEN** circuit breaker transitions between CLOSED, OPEN, or HALF_OPEN
+- **THEN** lock-protected section MUST complete state mutation before emitting transition log output
+
+#### Scenario: Slow log handler under load
+- **WHEN** logger handlers are slow or blocked
+- **THEN** circuit breaker lock contention MUST remain bounded and MUST NOT serialize unrelated request paths behind logging latency
+
+### Requirement: Health Endpoints SHALL Use Short Internal Memoization
+Health and deep-health computation SHALL use a short-lived internal cache to prevent probe storms from amplifying backend load.
+
+#### Scenario: Frequent monitor scrapes
+- **WHEN** health endpoints are called repeatedly within a small window
+- **THEN** service SHALL return memoized payload for up to 5 seconds in non-testing environments
+
+#### Scenario: Testing mode
+- **WHEN** app is running in testing mode
+- **THEN** health endpoint memoization MUST be bypassed to preserve deterministic tests
+
+### Requirement: Logs MUST Redact Connection Secrets
+Runtime logs MUST avoid exposing DB connection credentials.
+
+#### Scenario: Connection string appears in log message
+- **WHEN** a log message contains DB URL credentials
+- **THEN** logger output MUST redact password and sensitive userinfo before emission
+
--- a/openspec/specs/security-surface-hardening/spec.md
+++ b/openspec/specs/security-surface-hardening/spec.md
@@ -0,0 +1,38 @@
+# security-surface-hardening Specification
+
+## Purpose
+TBD - created by archiving change security-stability-hardening-round2. Update Purpose after archive.
+## Requirements
+### Requirement: LDAP Authentication Endpoint Configuration SHALL Be Strictly Validated
+The system MUST validate LDAP authentication endpoint configuration before use, including HTTPS scheme enforcement and host allowlist checks.
+
+#### Scenario: Invalid LDAP URL configuration detected
+- **WHEN** `LDAP_API_URL` is missing, non-HTTPS, or points to a host outside the configured allowlist
+- **THEN** the service MUST reject LDAP authentication calls and emit actionable diagnostics without sending credentials to that endpoint
+
+#### Scenario: Valid LDAP URL configuration accepted
+- **WHEN** `LDAP_API_URL` uses HTTPS and host is allowlisted
+- **THEN** LDAP authentication requests MAY proceed with normal timeout and error handling behavior
+
+### Requirement: Security Response Headers SHALL Be Applied Globally
+All HTTP responses MUST include baseline security headers suitable for dashboard and API traffic.
+
+#### Scenario: Standard response emitted
+- **WHEN** any route returns a response
+- **THEN** response MUST include `Content-Security-Policy`, `X-Frame-Options`, `X-Content-Type-Options`, and `Referrer-Policy`
+
+#### Scenario: Production transport hardening
+- **WHEN** runtime environment is production
+- **THEN** response MUST include `Strict-Transport-Security`
+
+### Requirement: Pagination Input Boundaries SHALL Be Enforced
+Endpoints accepting pagination parameters MUST enforce lower and upper bounds before query execution.
+
+#### Scenario: Negative or zero pagination inputs
+- **WHEN** client sends `page <= 0` or `page_size <= 0`
+- **THEN** server MUST normalize values to minimum supported bounds
+
+#### Scenario: Excessive page size requested
+- **WHEN** client sends `page_size` above configured maximum
+- **THEN** server MUST clamp to maximum supported page size
+
--- a/openspec/specs/worker-self-healing-governance/spec.md
+++ b/openspec/specs/worker-self-healing-governance/spec.md
@@ -0,0 +1,26 @@
+# worker-self-healing-governance Specification
+
+## Purpose
+TBD - created by archiving change p2-ops-self-healing-runbook. Update Purpose after archive.
+## Requirements
+### Requirement: Automated Worker Recovery SHALL Use Bounded Policy Guards
+Automated worker restart behavior MUST enforce cooldown periods and bounded restart attempts within a configurable time window.
+
+#### Scenario: Repeated worker degradation within short window
+- **WHEN** degradation events exceed configured restart-attempt budget
+- **THEN** automated restarts MUST pause and surface a blocked-recovery signal for operator intervention
+
+### Requirement: Restart-Churn Protection SHALL Prevent Recovery Storms
+The runtime MUST classify restart churn and prevent uncontrolled restart loops.
+
+#### Scenario: Churn threshold exceeded
+- **WHEN** restart count crosses churn threshold in active window
+- **THEN** watchdog MUST enter guarded mode and require explicit manual override before further restart attempts
+
+### Requirement: Recovery Decisions SHALL Be Audit-Ready
+Every auto-recovery decision and manual override action MUST be recorded with structured metadata.
+
+#### Scenario: Worker restart decision emitted
+- **WHEN** system executes or denies a restart action
+- **THEN** structured logs/events MUST include reason, thresholds, actor/source, and resulting state
+
--- a/scripts/run_cache_benchmarks.py
+++ b/scripts/run_cache_benchmarks.py
@@ -0,0 +1,223 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""Benchmark cache query baseline vs indexed selection.
+
+This benchmark is used as a repeatable governance harness for P1 cache/query
+efficiency work. It focuses on deterministic synthetic workloads so operators
+can compare relative latency and memory amplification over time.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import random
+import statistics
+import time
+from pathlib import Path
+from typing import Any
+
+import numpy as np
+import pandas as pd
+
+ROOT = Path(__file__).resolve().parents[1]
+FIXTURE_PATH = ROOT / "tests" / "fixtures" / "cache_benchmark_fixture.json"
+
+
+def load_fixture(path: Path = FIXTURE_PATH) -> dict[str, Any]:
+    payload = json.loads(path.read_text())
+    if "rows" not in payload:
+        raise ValueError("fixture requires rows")
+    return payload
+
+
+def build_dataset(rows: int, seed: int) -> pd.DataFrame:
+    random.seed(seed)
+    np.random.seed(seed)
+
+    workcenters = [f"WC-{idx:02d}" for idx in range(1, 31)]
+    packages = ["QFN", "DFN", "SOT", "SOP", "BGA", "TSOP"]
+    types = ["TYPE-A", "TYPE-B", "TYPE-C", "TYPE-D"]
+    statuses = ["RUN", "QUEUE", "HOLD"]
+    hold_reasons = ["", "", "", "YieldLimit", "特殊需求管控", "PM Hold"]
+
+    frame = pd.DataFrame(
+        {
+            "WORKCENTER_GROUP": np.random.choice(workcenters, rows),
+            "PACKAGE_LEF": np.random.choice(packages, rows),
+            "PJ_TYPE": np.random.choice(types, rows),
+            "WIP_STATUS": np.random.choice(statuses, rows, p=[0.45, 0.35, 0.20]),
+            "HOLDREASONNAME": np.random.choice(hold_reasons, rows),
+            "QTY": np.random.randint(1, 500, rows),
+            "WORKORDER": [f"WO-{i:06d}" for i in range(rows)],
+            "LOTID": [f"LOT-{i:07d}" for i in range(rows)],
+        }
+    )
+    return frame
+
+
+def _build_index(df: pd.DataFrame) -> dict[str, dict[str, set[int]]]:
+    def by_column(column: str) -> dict[str, set[int]]:
+        grouped = df.groupby(column, dropna=True, sort=False).indices
+        return {str(k): {int(i) for i in v} for k, v in grouped.items()}
+
+    return {
+        "workcenter": by_column("WORKCENTER_GROUP"),
+        "package": by_column("PACKAGE_LEF"),
+        "type": by_column("PJ_TYPE"),
+        "status": by_column("WIP_STATUS"),
+    }
+
+
+def _baseline_query(df: pd.DataFrame, query: dict[str, str]) -> int:
+    subset = df
+    if query.get("workcenter"):
+        subset = subset[subset["WORKCENTER_GROUP"] == query["workcenter"]]
+    if query.get("package"):
+        subset = subset[subset["PACKAGE_LEF"] == query["package"]]
+    if query.get("type"):
+        subset = subset[subset["PJ_TYPE"] == query["type"]]
+    if query.get("status"):
+        subset = subset[subset["WIP_STATUS"] == query["status"]]
+    return int(len(subset))
+
+
+def _indexed_query(_df: pd.DataFrame, indexes: dict[str, dict[str, set[int]]], query: dict[str, str]) -> int:
+    selected: set[int] | None = None
+    for key, bucket in (
+        ("workcenter", "workcenter"),
+        ("package", "package"),
+        ("type", "type"),
+        ("status", "status"),
+    ):
+        current = indexes[bucket].get(query.get(key, ""))
+        if current is None:
+            return 0
+        if selected is None:
+            selected = set(current)
+        else:
+            selected.intersection_update(current)
+            if not selected:
+                return 0
+    return len(selected or ())
+
+
+def _build_queries(df: pd.DataFrame, query_count: int, seed: int) -> list[dict[str, str]]:
+    random.seed(seed + 17)
+    workcenters = sorted(df["WORKCENTER_GROUP"].dropna().astype(str).unique().tolist())
+    packages = sorted(df["PACKAGE_LEF"].dropna().astype(str).unique().tolist())
+    types = sorted(df["PJ_TYPE"].dropna().astype(str).unique().tolist())
+    statuses = sorted(df["WIP_STATUS"].dropna().astype(str).unique().tolist())
+
+    queries: list[dict[str, str]] = []
+    for _ in range(query_count):
+        queries.append(
+            {
+                "workcenter": random.choice(workcenters),
+                "package": random.choice(packages),
+                "type": random.choice(types),
+                "status": random.choice(statuses),
+            }
+        )
+    return queries
+
+
+def _p95(values: list[float]) -> float:
+    if not values:
+        return 0.0
+    sorted_values = sorted(values)
+    index = min(max(math.ceil(0.95 * len(sorted_values)) - 1, 0), len(sorted_values) - 1)
+    return sorted_values[index]
+
+
+def run_benchmark(rows: int, query_count: int, seed: int) -> dict[str, Any]:
+    df = build_dataset(rows=rows, seed=seed)
+    queries = _build_queries(df, query_count=query_count, seed=seed)
+    indexes = _build_index(df)
+
+    baseline_latencies: list[float] = []
+    indexed_latencies: list[float] = []
+    baseline_rows: list[int] = []
+    indexed_rows: list[int] = []
+
+    for query in queries:
+        start = time.perf_counter()
+        baseline_rows.append(_baseline_query(df, query))
+        baseline_latencies.append((time.perf_counter() - start) * 1000)
+
+        start = time.perf_counter()
+        indexed_rows.append(_indexed_query(df, indexes, query))
+        indexed_latencies.append((time.perf_counter() - start) * 1000)
+
+    if baseline_rows != indexed_rows:
+        raise AssertionError("benchmark correctness drift: indexed result mismatch")
+
+    frame_bytes = int(df.memory_usage(index=True, deep=True).sum())
+    index_entries = sum(len(bucket) for buckets in indexes.values() for bucket in buckets.values())
+    index_bytes_estimate = int(index_entries * 16)
+
+    baseline_p95 = _p95(baseline_latencies)
+    indexed_p95 = _p95(indexed_latencies)
+
+    return {
+        "rows": rows,
+        "query_count": query_count,
+        "seed": seed,
+        "latency_ms": {
+            "baseline_avg": round(statistics.fmean(baseline_latencies), 4),
+            "baseline_p95": round(baseline_p95, 4),
+            "indexed_avg": round(statistics.fmean(indexed_latencies), 4),
+            "indexed_p95": round(indexed_p95, 4),
+            "p95_ratio_indexed_vs_baseline": round(
+                (indexed_p95 / baseline_p95) if baseline_p95 > 0 else 0.0,
+                4,
+            ),
+        },
+        "memory_bytes": {
+            "frame": frame_bytes,
+            "index_estimate": index_bytes_estimate,
+            "amplification_ratio": round(
+                (frame_bytes + index_bytes_estimate) / max(frame_bytes, 1),
+                4,
+            ),
+        },
+    }
+
+
+def main() -> int:
+    fixture = load_fixture()
+
+    parser = argparse.ArgumentParser(description="Run cache baseline vs indexed benchmark")
+    parser.add_argument("--rows", type=int, default=int(fixture.get("rows", 30000)))
+    parser.add_argument("--queries", type=int, default=int(fixture.get("query_count", 400)))
+    parser.add_argument("--seed", type=int, default=int(fixture.get("seed", 42)))
+    parser.add_argument("--enforce", action="store_true")
+    args = parser.parse_args()
+
+    report = run_benchmark(rows=args.rows, query_count=args.queries, seed=args.seed)
+    print(json.dumps(report, ensure_ascii=False, indent=2))
+
+    if not args.enforce:
+        return 0
+
+    thresholds = fixture.get("thresholds") or {}
+    max_latency_ratio = float(thresholds.get("max_p95_ratio_indexed_vs_baseline", 1.25))
+    max_amplification = float(thresholds.get("max_memory_amplification_ratio", 1.8))
+
+    latency_ratio = float(report["latency_ms"]["p95_ratio_indexed_vs_baseline"])
+    amplification_ratio = float(report["memory_bytes"]["amplification_ratio"])
+
+    if latency_ratio > max_latency_ratio:
+        raise SystemExit(
+            f"Latency regression: {latency_ratio:.4f} > max allowed {max_latency_ratio:.4f}"
+        )
+    if amplification_ratio > max_amplification:
+        raise SystemExit(
+            f"Memory amplification regression: {amplification_ratio:.4f} > max allowed {max_amplification:.4f}"
+        )
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/scripts/start_server.sh
+++ b/scripts/start_server.sh
@@ -9,7 +9,7 @@ set -uo pipefail
 # Configuration
 # ============================================================
 ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
-CONDA_ENV="mes-dashboard"
+CONDA_ENV="${CONDA_ENV_NAME:-mes-dashboard}"
 APP_NAME="mes-dashboard"
 PID_FILE_DEFAULT="${ROOT}/tmp/gunicorn.pid"
 PID_FILE="${WATCHDOG_PID_FILE:-${PID_FILE_DEFAULT}}"
@@ -56,7 +56,7 @@ timestamp() {
 resolve_runtime_paths() {
    WATCHDOG_RUNTIME_DIR="${WATCHDOG_RUNTIME_DIR:-${ROOT}/tmp}"
    WATCHDOG_RESTART_FLAG="${WATCHDOG_RESTART_FLAG:-${WATCHDOG_RUNTIME_DIR}/mes_dashboard_restart.flag}"
-    WATCHDOG_PID_FILE="${WATCHDOG_PID_FILE:-${PID_FILE_DEFAULT}}"
+    WATCHDOG_PID_FILE="${WATCHDOG_PID_FILE:-${WATCHDOG_RUNTIME_DIR}/gunicorn.pid}"
    WATCHDOG_STATE_FILE="${WATCHDOG_STATE_FILE:-${WATCHDOG_RUNTIME_DIR}/mes_dashboard_restart_state.json}"
    PID_FILE="${WATCHDOG_PID_FILE}"
    export WATCHDOG_RUNTIME_DIR WATCHDOG_RESTART_FLAG WATCHDOG_PID_FILE WATCHDOG_STATE_FILE
@@ -81,8 +81,14 @@ check_conda() {
        return 1
    fi

+    if [ -n "${CONDA_BIN:-}" ] && [ ! -x "${CONDA_BIN}" ]; then
+        log_error "CONDA_BIN is set but not executable: ${CONDA_BIN}"
+        return 1
+    fi
+
    # Source conda
-    source "$(conda info --base)/etc/profile.d/conda.sh"
+    local conda_cmd="${CONDA_BIN:-$(command -v conda)}"
+    source "$(${conda_cmd} info --base)/etc/profile.d/conda.sh"

    # Check if environment exists
    if ! conda env list | grep -q "^${CONDA_ENV} "; then
@@ -95,6 +101,33 @@ check_conda() {
    return 0
 }

+validate_runtime_contract() {
+    conda activate "$CONDA_ENV"
+    export PYTHONPATH="${ROOT}/src:${PYTHONPATH:-}"
+
+    if python - <<'PY'
+import os
+import sys
+
+from mes_dashboard.core.runtime_contract import build_runtime_contract_diagnostics
+
+strict = os.getenv("RUNTIME_CONTRACT_ENFORCE", "true").strip().lower() in {"1", "true", "yes", "on"}
+diag = build_runtime_contract_diagnostics(strict=strict)
+if not diag["valid"]:
+    for error in diag["errors"]:
+        print(f"RUNTIME_CONTRACT_ERROR: {error}")
+    raise SystemExit(1)
+PY
+    then
+        log_success "Runtime contract validation passed"
+        return 0
+    fi
+
+    log_error "Runtime contract validation failed"
+    log_info "Fix env vars: WATCHDOG_RUNTIME_DIR / WATCHDOG_RESTART_FLAG / WATCHDOG_PID_FILE / WATCHDOG_STATE_FILE / CONDA_BIN"
+    return 1
+}
+
 check_dependencies() {
    conda activate "$CONDA_ENV"

@@ -329,6 +362,7 @@ run_all_checks() {
    check_env_file
    load_env
    resolve_runtime_paths
+    validate_runtime_contract || return 1
    check_port || return 1
    check_database
    check_redis
--- a/scripts/worker_watchdog.py
+++ b/scripts/worker_watchdog.py
@@ -31,6 +31,23 @@ import time
 from datetime import datetime
 from pathlib import Path

+PROJECT_ROOT = Path(__file__).resolve().parents[1]
+SRC_ROOT = PROJECT_ROOT / "src"
+if str(SRC_ROOT) not in sys.path:
+    sys.path.insert(0, str(SRC_ROOT))
+
+from mes_dashboard.core.runtime_contract import (  # noqa: E402
+    build_runtime_contract_diagnostics,
+    load_runtime_contract,
+)
+from mes_dashboard.core.worker_recovery_policy import (  # noqa: E402
+    decide_restart_request,
+    evaluate_worker_recovery_state,
+    extract_last_requested_at,
+    extract_restart_history,
+    get_worker_recovery_policy_config,
+)
+
 # Configure logging
 logging.basicConfig(
    level=logging.INFO,
@@ -45,7 +62,10 @@ logger = logging.getLogger('mes_dashboard.watchdog')
 # Configuration
 # ============================================================

-CHECK_INTERVAL = int(os.getenv('WATCHDOG_CHECK_INTERVAL', '5'))
+_RUNTIME_CONTRACT = load_runtime_contract(project_root=PROJECT_ROOT)
+CHECK_INTERVAL = int(
+    os.getenv('WATCHDOG_CHECK_INTERVAL', str(_RUNTIME_CONTRACT['watchdog_check_interval']))
+)


 def _env_int(name: str, default: int) -> int:
@@ -55,22 +75,11 @@ def _env_int(name: str, default: int) -> int:
        return default


-PROJECT_ROOT = Path(__file__).resolve().parents[1]
-DEFAULT_RUNTIME_DIR = Path(
-    os.getenv('WATCHDOG_RUNTIME_DIR', str(PROJECT_ROOT / 'tmp'))
-)
-RESTART_FLAG_PATH = os.getenv(
-    'WATCHDOG_RESTART_FLAG',
-    str(DEFAULT_RUNTIME_DIR / 'mes_dashboard_restart.flag')
-)
-GUNICORN_PID_FILE = os.getenv(
-    'WATCHDOG_PID_FILE',
-    str(DEFAULT_RUNTIME_DIR / 'gunicorn.pid')
-)
-RESTART_STATE_FILE = os.getenv(
-    'WATCHDOG_STATE_FILE',
-    str(DEFAULT_RUNTIME_DIR / 'mes_dashboard_restart_state.json')
-)
+DEFAULT_RUNTIME_DIR = Path(_RUNTIME_CONTRACT['watchdog_runtime_dir'])
+RESTART_FLAG_PATH = _RUNTIME_CONTRACT['watchdog_restart_flag']
+GUNICORN_PID_FILE = _RUNTIME_CONTRACT['watchdog_pid_file']
+RESTART_STATE_FILE = _RUNTIME_CONTRACT['watchdog_state_file']
+RUNTIME_CONTRACT_VERSION = _RUNTIME_CONTRACT['version']
 RESTART_HISTORY_MAX = _env_int('WATCHDOG_RESTART_HISTORY_MAX', 50)


@@ -78,6 +87,32 @@ RESTART_HISTORY_MAX = _env_int('WATCHDOG_RESTART_HISTORY_MAX', 50)
 # Watchdog Implementation
 # ============================================================

+
+def validate_runtime_contract_or_raise() -> None:
+    """Fail fast if runtime contract is inconsistent."""
+    strict = os.getenv("RUNTIME_CONTRACT_ENFORCE", "true").strip().lower() in {
+        "1",
+        "true",
+        "yes",
+        "on",
+    }
+    diagnostics = build_runtime_contract_diagnostics(strict=strict)
+    if diagnostics["valid"]:
+        return
+
+    details = "; ".join(diagnostics["errors"])
+    raise RuntimeError(f"Runtime contract validation failed: {details}")
+
+
+def log_restart_audit(event: str, payload: dict) -> None:
+    entry = {
+        "event": event,
+        "timestamp": datetime.utcnow().isoformat(),
+        "runtime_contract_version": RUNTIME_CONTRACT_VERSION,
+        **payload,
+    }
+    logger.info("worker_watchdog_audit %s", json.dumps(entry, ensure_ascii=False))
+
 def get_gunicorn_pid() -> int | None:
    """Get Gunicorn master PID from PID file.

@@ -155,7 +190,12 @@ def save_restart_state(
    requested_at: str | None = None,
    requested_ip: str | None = None,
    completed_at: str | None = None,
-    success: bool = True
+    success: bool = True,
+    source: str = "manual",
+    decision: str = "allowed",
+    decision_reason: str | None = None,
+    manual_override: bool = False,
+    policy_state: dict | None = None,
 ) -> None:
    """Save restart state for status queries.

@@ -173,7 +213,12 @@ def save_restart_state(
        "requested_at": requested_at,
        "requested_ip": requested_ip,
        "completed_at": completed_at,
-        "success": success
+        "success": success,
+        "source": source,
+        "decision": decision,
+        "decision_reason": decision_reason,
+        "manual_override": manual_override,
+        "policy_state": policy_state or {},
    }
    current_state = load_restart_state()
    history = current_state.get("history", [])
@@ -229,6 +274,47 @@ def process_restart_request() -> bool:
        return False

    logger.info(f"Restart flag detected: {flag_data}")
+    source = str(flag_data.get("source") or "manual").strip().lower()
+    manual_override = bool(flag_data.get("manual_override"))
+    override_ack = bool(flag_data.get("override_acknowledged"))
+    restart_state = load_restart_state()
+    restart_history = extract_restart_history(restart_state)
+    policy_state = evaluate_worker_recovery_state(
+        restart_history,
+        last_requested_at=extract_last_requested_at(restart_state),
+    )
+    decision = decide_restart_request(
+        policy_state,
+        source=source,
+        manual_override=manual_override,
+        override_acknowledged=override_ack,
+    )
+
+    if not decision["allowed"]:
+        remove_restart_flag()
+        save_restart_state(
+            requested_by=flag_data.get("user"),
+            requested_at=flag_data.get("timestamp"),
+            requested_ip=flag_data.get("ip"),
+            completed_at=datetime.now().isoformat(),
+            success=False,
+            source=source,
+            decision=decision["decision"],
+            decision_reason=decision["reason"],
+            manual_override=manual_override,
+            policy_state=policy_state,
+        )
+        log_restart_audit(
+            "restart_blocked",
+            {
+                "source": source,
+                "actor": flag_data.get("user"),
+                "ip": flag_data.get("ip"),
+                "decision": decision,
+                "policy_state": policy_state,
+            },
+        )
+        return True

    # Get Gunicorn master PID
    pid = get_gunicorn_pid()
@@ -242,7 +328,22 @@ def process_restart_request() -> bool:
            requested_at=flag_data.get("timestamp"),
            requested_ip=flag_data.get("ip"),
            completed_at=datetime.now().isoformat(),
-            success=False
+            success=False,
+            source=source,
+            decision="failed",
+            decision_reason="gunicorn_pid_unavailable",
+            manual_override=manual_override,
+            policy_state=policy_state,
+        )
+        log_restart_audit(
+            "restart_failed",
+            {
+                "source": source,
+                "actor": flag_data.get("user"),
+                "ip": flag_data.get("ip"),
+                "decision_reason": "gunicorn_pid_unavailable",
+                "policy_state": policy_state,
+            },
        )
        return True

@@ -258,7 +359,12 @@ def process_restart_request() -> bool:
        requested_at=flag_data.get("timestamp"),
        requested_ip=flag_data.get("ip"),
        completed_at=datetime.now().isoformat(),
-        success=success
+        success=success,
+        source=source,
+        decision="executed" if success else "failed",
+        decision_reason="signal_sighup" if success else "signal_failed",
+        manual_override=manual_override,
+        policy_state=policy_state,
    )

    if success:
@@ -267,17 +373,44 @@ def process_restart_request() -> bool:
            f"Requested by: {flag_data.get('user', 'unknown')}, "
            f"IP: {flag_data.get('ip', 'unknown')}"
        )
+        log_restart_audit(
+            "restart_executed",
+            {
+                "source": source,
+                "actor": flag_data.get("user"),
+                "ip": flag_data.get("ip"),
+                "manual_override": manual_override,
+                "policy_state": policy_state,
+            },
+        )
+    else:
+        log_restart_audit(
+            "restart_failed",
+            {
+                "source": source,
+                "actor": flag_data.get("user"),
+                "ip": flag_data.get("ip"),
+                "decision_reason": "signal_failed",
+                "policy_state": policy_state,
+            },
+        )

    return True


 def run_watchdog() -> None:
    """Main watchdog loop."""
+    validate_runtime_contract_or_raise()
+    policy = get_worker_recovery_policy_config()
    logger.info(
        f"Worker watchdog started - "
        f"Check interval: {CHECK_INTERVAL}s, "
        f"Flag path: {RESTART_FLAG_PATH}, "
-        f"PID file: {GUNICORN_PID_FILE}"
+        f"PID file: {GUNICORN_PID_FILE}, "
+        f"Policy(cooldown={policy['cooldown_seconds']}s, "
+        f"retry_budget={policy['retry_budget']}, "
+        f"window={policy['window_seconds']}s, "
+        f"guarded={policy['guarded_mode_enabled']})"
    )

    while True:
--- a/src/mes_dashboard/app.py
+++ b/src/mes_dashboard/app.py
@@ -3,24 +3,48 @@

 from __future__ import annotations

+import atexit
 import logging
 import os
 import sys
+import threading

 from flask import Flask, jsonify, redirect, render_template, request, session, url_for

 from mes_dashboard.config.tables import TABLES_CONFIG
 from mes_dashboard.config.settings import get_config
 from mes_dashboard.core.cache import create_default_cache_backend
-from mes_dashboard.core.database import get_table_data, get_table_columns, get_engine, init_db, start_keepalive
+from mes_dashboard.core.database import (
+    get_table_data,
+    get_table_columns,
+    get_engine,
+    init_db,
+    start_keepalive,
+    dispose_engine,
+    install_log_redaction_filter,
+)
 from mes_dashboard.core.permissions import is_admin_logged_in, _is_ajax_request
+from mes_dashboard.core.csrf import (
+    get_csrf_token,
+    should_enforce_csrf,
+    validate_csrf,
+)
 from mes_dashboard.routes import register_routes
 from mes_dashboard.routes.auth_routes import auth_bp
 from mes_dashboard.routes.admin_routes import admin_bp
 from mes_dashboard.routes.health_routes import health_bp
 from mes_dashboard.services.page_registry import get_page_status, is_api_public
 from mes_dashboard.core.cache_updater import start_cache_updater, stop_cache_updater
-from mes_dashboard.services.realtime_equipment_cache import init_realtime_equipment_cache
+from mes_dashboard.services.realtime_equipment_cache import (
+    init_realtime_equipment_cache,
+    stop_equipment_status_sync_worker,
+)
+from mes_dashboard.core.redis_client import close_redis
+from mes_dashboard.core.runtime_contract import build_runtime_contract_diagnostics
+
+
+_SHUTDOWN_LOCK = threading.Lock()
+_ATEXIT_REGISTERED = False


 def _configure_logging(app: Flask) -> None:
@@ -63,6 +87,121 @@ def _configure_logging(app: Flask) -> None:

    # Prevent propagation to root logger (avoid duplicate logs)
    logger.propagate = False
+    install_log_redaction_filter(logger)
+
+
+def _is_production_env(app: Flask) -> bool:
+    env_value = str(app.config.get("ENV") or os.getenv("FLASK_ENV") or "production").lower()
+    return env_value in {"prod", "production"}
+
+
+def _build_security_headers(production: bool) -> dict[str, str]:
+    headers = {
+        "Content-Security-Policy": (
+            "default-src 'self'; "
+            "script-src 'self' 'unsafe-inline' 'unsafe-eval'; "
+            "style-src 'self' 'unsafe-inline'; "
+            "img-src 'self' data: blob:; "
+            "font-src 'self' data:; "
+            "connect-src 'self'; "
+            "frame-ancestors 'none'; "
+            "base-uri 'self'; "
+            "form-action 'self'"
+        ),
+        "X-Frame-Options": "DENY",
+        "X-Content-Type-Options": "nosniff",
+        "Referrer-Policy": "strict-origin-when-cross-origin",
+    }
+    if production:
+        headers["Strict-Transport-Security"] = "max-age=31536000; includeSubDomains"
+    return headers
+
+
+def _resolve_secret_key(app: Flask) -> str:
+    env_name = str(app.config.get("ENV") or os.getenv("FLASK_ENV") or "development").lower()
+    configured = os.environ.get("SECRET_KEY") or app.config.get("SECRET_KEY")
+    insecure_defaults = {"", "dev-secret-key-change-in-prod"}
+
+    if configured and configured not in insecure_defaults:
+        return configured
+
+    if env_name in {"production", "prod"}:
+        raise RuntimeError(
+            "SECRET_KEY is required in production and cannot use insecure defaults."
+        )
+
+    # Development and testing get explicit environment-safe defaults.
+    if env_name in {"testing", "test"}:
+        return "test-secret-key"
+    return "dev-local-only-secret-key"
+
+
+def _shutdown_runtime_resources() -> None:
+    """Stop background workers and shared clients during app/worker shutdown."""
+    logger = logging.getLogger("mes_dashboard")
+
+    try:
+        stop_cache_updater()
+    except Exception as exc:
+        logger.warning("Error stopping cache updater: %s", exc)
+
+    try:
+        stop_equipment_status_sync_worker()
+    except Exception as exc:
+        logger.warning("Error stopping equipment sync worker: %s", exc)
+
+    try:
+        close_redis()
+    except Exception as exc:
+        logger.warning("Error closing Redis client: %s", exc)
+
+    try:
+        dispose_engine()
+    except Exception as exc:
+        logger.warning("Error disposing DB engines: %s", exc)
+
+
+def _register_shutdown_hooks(app: Flask) -> None:
+    global _ATEXIT_REGISTERED
+
+    app.extensions["runtime_shutdown"] = _shutdown_runtime_resources
+    if app.extensions.get("runtime_shutdown_registered"):
+        return
+
+    app.extensions["runtime_shutdown_registered"] = True
+    if app.testing or bool(app.config.get("TESTING")) or os.getenv("PYTEST_CURRENT_TEST"):
+        return
+
+    with _SHUTDOWN_LOCK:
+        if not _ATEXIT_REGISTERED:
+            atexit.register(_shutdown_runtime_resources)
+            _ATEXIT_REGISTERED = True
+
+
+def _is_runtime_contract_enforced(app: Flask) -> bool:
+    raw = os.getenv("RUNTIME_CONTRACT_ENFORCE")
+    if raw is not None:
+        return raw.strip().lower() in {"1", "true", "yes", "on"}
+    return _is_production_env(app)
+
+
+def _validate_runtime_contract(app: Flask) -> None:
+    strict = _is_runtime_contract_enforced(app)
+    diagnostics = build_runtime_contract_diagnostics(strict=strict)
+    app.extensions["runtime_contract"] = diagnostics["contract"]
+    app.extensions["runtime_contract_validation"] = {
+        "valid": diagnostics["valid"],
+        "strict": diagnostics["strict"],
+        "errors": diagnostics["errors"],
+    }
+
+    if diagnostics["valid"]:
+        return
+
+    message = "Runtime contract validation failed: " + "; ".join(diagnostics["errors"])
+    if strict:
+        raise RuntimeError(message)
+    logging.getLogger("mes_dashboard").warning(message)


 def create_app(config_name: str | None = None) -> Flask:
@@ -72,19 +211,22 @@ def create_app(config_name: str | None = None) -> Flask:
    config_class = get_config(config_name)
    app.config.from_object(config_class)

-    # Session configuration
-    app.secret_key = os.environ.get("SECRET_KEY", "dev-secret-key-change-in-prod")
+    # Session configuration with environment-aware secret validation.
+    app.secret_key = _resolve_secret_key(app)
+    app.config["SECRET_KEY"] = app.secret_key

    # Session cookie security settings
-    # SECURE: Only send cookie over HTTPS (disable for local development)
-    app.config['SESSION_COOKIE_SECURE'] = os.environ.get("FLASK_ENV") == "production"
+    # SECURE: Only send cookie over HTTPS in production.
+    app.config['SESSION_COOKIE_SECURE'] = _is_production_env(app)
    # HTTPONLY: Prevent JavaScript access to session cookie (XSS protection)
    app.config['SESSION_COOKIE_HTTPONLY'] = True
-    # SAMESITE: Prevent CSRF by restricting cross-site cookie sending
-    app.config['SESSION_COOKIE_SAMESITE'] = 'Lax'
+    # SAMESITE: strict in production, relaxed for local development usability.
+    app.config['SESSION_COOKIE_SAMESITE'] = 'Strict' if _is_production_env(app) else 'Lax'

    # Configure logging first
    _configure_logging(app)
+    _validate_runtime_contract(app)
+    security_headers = _build_security_headers(_is_production_env(app))

    # Route-level cache backend (L1 memory + optional L2 Redis)
    app.extensions["cache"] = create_default_cache_backend()
@@ -96,6 +238,7 @@ def create_app(config_name: str | None = None) -> Flask:
        start_keepalive()  # Keep database connections alive
        start_cache_updater()  # Start Redis cache updater
        init_realtime_equipment_cache(app)  # Start realtime equipment status cache
+    _register_shutdown_hooks(app)

    # Register API routes
    register_routes(app)
@@ -150,6 +293,34 @@ def create_app(config_name: str | None = None) -> Flask:

        return None

+    @app.before_request
+    def enforce_csrf():
+        if not should_enforce_csrf(
+            request,
+            enabled=bool(app.config.get("CSRF_ENABLED", True)),
+        ):
+            return None
+
+        if validate_csrf(request):
+            return None
+
+        if request.path == "/admin/login":
+            return render_template("login.html", error="CSRF 驗證失敗，請重新提交"), 403
+
+        from mes_dashboard.core.response import error_response, FORBIDDEN
+
+        return error_response(
+            FORBIDDEN,
+            "CSRF 驗證失敗",
+            status_code=403,
+        )
+
+    @app.after_request
+    def apply_security_headers(response):
+        for header, value in security_headers.items():
+            response.headers.setdefault(header, value)
+        return response
+
    # ========================================================
    # Template Context Processor
    # ========================================================
@@ -185,6 +356,7 @@ def create_app(config_name: str | None = None) -> Flask:
            "admin_user": session.get("admin"),
            "can_view_page": can_view_page,
            "frontend_asset": frontend_asset,
+            "csrf_token": get_csrf_token,
        }

    # ========================================================
--- a/src/mes_dashboard/config/settings.py
+++ b/src/mes_dashboard/config/settings.py
@@ -20,6 +20,13 @@ def _float_env(name: str, default: float) -> float:
        return default


+def _bool_env(name: str, default: bool) -> bool:
+    value = os.getenv(name)
+    if value is None:
+        return default
+    return value.strip().lower() in {"1", "true", "yes", "on"}
+
+
 class Config:
    """Base configuration."""

@@ -40,7 +47,8 @@ class Config:
    # Auth configuration - MUST be set in .env file
    LDAP_API_URL = os.getenv("LDAP_API_URL", "")
    ADMIN_EMAILS = os.getenv("ADMIN_EMAILS", "")
-    SECRET_KEY = os.getenv("SECRET_KEY", "dev-secret-key-change-in-prod")
+    SECRET_KEY = os.getenv("SECRET_KEY")
+    CSRF_ENABLED = _bool_env("CSRF_ENABLED", True)

    # Session configuration
    PERMANENT_SESSION_LIFETIME = _int_env("SESSION_LIFETIME", 28800)  # 8 hours
@@ -103,6 +111,7 @@ class TestingConfig(Config):
    DB_CONNECT_RETRY_COUNT = 0
    DB_CONNECT_RETRY_DELAY = 0.0
    DB_CALL_TIMEOUT_MS = 5000
+    CSRF_ENABLED = False


 def get_config(env: str | None = None) -> Type[Config]:
--- a/src/mes_dashboard/core/cache.py
+++ b/src/mes_dashboard/core/cache.py
@@ -10,8 +10,10 @@ from __future__ import annotations
 import io
 import json
 import logging
+import os
 import threading
 import time
+from collections import OrderedDict
 from typing import Any, Optional, Protocol, Tuple

 import pandas as pd
@@ -39,26 +41,49 @@ class ProcessLevelCache:
    Uses a lock to ensure only one thread parses at a time.
    """

-    def __init__(self, ttl_seconds: int = 30):
-        self._cache: dict[str, Tuple[pd.DataFrame, float]] = {}
+    def __init__(self, ttl_seconds: int = 30, max_size: int = 32):
+        self._cache: OrderedDict[str, Tuple[pd.DataFrame, float]] = OrderedDict()
        self._lock = threading.Lock()
-        self._ttl = ttl_seconds
+        self._ttl = max(int(ttl_seconds), 1)
+        self._max_size = max(int(max_size), 1)
+
+    @property
+    def max_size(self) -> int:
+        return self._max_size
+
+    def _evict_expired_locked(self, now: float) -> None:
+        stale_keys = [
+            key for key, (_, timestamp) in self._cache.items()
+            if now - timestamp > self._ttl
+        ]
+        for key in stale_keys:
+            self._cache.pop(key, None)

    def get(self, key: str) -> Optional[pd.DataFrame]:
        """Get cached DataFrame if not expired."""
        with self._lock:
-            if key not in self._cache:
+            payload = self._cache.get(key)
+            if payload is None:
                return None
-            df, timestamp = self._cache[key]
-            if time.time() - timestamp > self._ttl:
-                del self._cache[key]
+            df, timestamp = payload
+            now = time.time()
+            if now - timestamp > self._ttl:
+                self._cache.pop(key, None)
                return None
+            self._cache.move_to_end(key, last=True)
            return df

    def set(self, key: str, df: pd.DataFrame) -> None:
        """Cache a DataFrame with current timestamp."""
        with self._lock:
-            self._cache[key] = (df, time.time())
+            now = time.time()
+            self._evict_expired_locked(now)
+            if key in self._cache:
+                self._cache.pop(key, None)
+            elif len(self._cache) >= self._max_size:
+                self._cache.popitem(last=False)
+            self._cache[key] = (df, now)
+            self._cache.move_to_end(key, last=True)

    def invalidate(self, key: str) -> None:
        """Remove a key from cache."""
@@ -71,8 +96,26 @@ class ProcessLevelCache:
            self._cache.clear()


+def _resolve_cache_max_size(env_name: str, default: int) -> int:
+    value = os.getenv(env_name)
+    if value is None:
+        return max(int(default), 1)
+    try:
+        return max(int(value), 1)
+    except (TypeError, ValueError):
+        return max(int(default), 1)
+
+
 # Global process-level cache for WIP DataFrame (30s TTL)
-_wip_df_cache = ProcessLevelCache(ttl_seconds=30)
+PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size("PROCESS_CACHE_MAX_SIZE", 32)
+WIP_PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size(
+    "WIP_PROCESS_CACHE_MAX_SIZE",
+    PROCESS_CACHE_MAX_SIZE,
+)
+_wip_df_cache = ProcessLevelCache(
+    ttl_seconds=30,
+    max_size=WIP_PROCESS_CACHE_MAX_SIZE,
+)
 _wip_parse_lock = threading.Lock()

 # ============================================================
@@ -328,33 +371,30 @@ def get_cached_wip_data() -> Optional[pd.DataFrame]:
    if client is None:
        return None

-    # Use lock to prevent multiple threads from parsing simultaneously
+    try:
+        start_time = time.time()
+        data_json = client.get(get_key("data"))
+        if data_json is None:
+            logger.debug("Cache miss: no data in Redis")
+            return None
+
+        # Parse outside lock to reduce contention on hot paths.
+        parsed_df = pd.read_json(io.StringIO(data_json), orient='records')
+        parse_time = time.time() - start_time
+    except Exception as e:
+        logger.warning(f"Failed to read cache: {e}")
+        return None
+
+    # Keep lock scope tight: consistency check + cache write only.
    with _wip_parse_lock:
-        # Double-check after acquiring lock (another thread may have parsed)
        cached_df = _wip_df_cache.get(cache_key)
        if cached_df is not None:
-            logger.debug(f"Process cache hit (after lock): {len(cached_df)} rows")
+            logger.debug(f"Process cache hit (after parse): {len(cached_df)} rows")
            return cached_df
+        _wip_df_cache.set(cache_key, parsed_df)

-        try:
-            start_time = time.time()
-            data_json = client.get(get_key("data"))
-            if data_json is None:
-                logger.debug("Cache miss: no data in Redis")
-                return None
-
-            # Parse JSON to DataFrame
-            df = pd.read_json(io.StringIO(data_json), orient='records')
-            parse_time = time.time() - start_time
-
-            # Store in process-level cache
-            _wip_df_cache.set(cache_key, df)
-
-            logger.debug(f"Cache hit: loaded {len(df)} rows from Redis (parsed in {parse_time:.2f}s)")
-            return df
-        except Exception as e:
-            logger.warning(f"Failed to read cache: {e}")
-            return None
+    logger.debug(f"Cache hit: loaded {len(parsed_df)} rows from Redis (parsed in {parse_time:.2f}s)")
+    return parsed_df


 def get_cached_sys_date() -> Optional[str]:
--- a/src/mes_dashboard/core/cache_updater.py
+++ b/src/mes_dashboard/core/cache_updater.py
@@ -221,7 +221,7 @@ class CacheUpdater:
            return None

    def _update_redis_cache(self, df: pd.DataFrame, sys_date: str) -> bool:
-        """Update Redis cache with new data using pipeline for atomicity.
+        """Update Redis cache with staged publish for coherent snapshot visibility.

        Args:
            df: DataFrame with full table data.
@@ -234,18 +234,24 @@ class CacheUpdater:
        if client is None:
            return False

+        staging_key: str | None = None
        try:
            # Convert DataFrame to JSON
            # Handle datetime columns
-            for col in df.select_dtypes(include=['datetime64']).columns:
-                df[col] = df[col].astype(str)
+            df_copy = df.copy()
+            for col in df_copy.select_dtypes(include=['datetime64']).columns:
+                df_copy[col] = df_copy[col].astype(str)

-            data_json = df.to_json(orient='records', force_ascii=False)
+            data_json = df_copy.to_json(orient='records', force_ascii=False)

-            # Atomic update using pipeline
+            # Stage payload first, then atomically publish live key + metadata.
            now = datetime.now().isoformat()
+            unique_suffix = f"{int(time.time() * 1000)}:{threading.get_ident()}"
+            staging_key = get_key(f"data:staging:{unique_suffix}")
+
            pipe = client.pipeline()
-            pipe.set(get_key("data"), data_json)
+            pipe.set(staging_key, data_json)
+            pipe.rename(staging_key, get_key("data"))
            pipe.set(get_key("meta:sys_date"), sys_date)
            pipe.set(get_key("meta:updated_at"), now)
            pipe.execute()
@@ -253,6 +259,11 @@ class CacheUpdater:
            return True
        except Exception as e:
            logger.error(f"Failed to update Redis cache: {e}")
+            if staging_key:
+                try:
+                    client.delete(staging_key)
+                except Exception:
+                    pass
            return False

    def _check_resource_update(self, force: bool = False) -> bool:
--- a/src/mes_dashboard/core/circuit_breaker.py
+++ b/src/mes_dashboard/core/circuit_breaker.py
@@ -130,12 +130,16 @@ class CircuitBreaker:
    @property
    def state(self) -> CircuitState:
        """Get current circuit state, handling state transitions."""
+        transition_log: tuple[int, str] | None = None
        with self._lock:
            if self._state == CircuitState.OPEN:
                # Check if we should transition to HALF_OPEN
                if self._open_time and time.time() - self._open_time >= self.recovery_timeout:
-                    self._transition_to(CircuitState.HALF_OPEN)
-            return self._state
+                    transition_log = self._transition_to_locked(CircuitState.HALF_OPEN)
+            current_state = self._state
+        if transition_log:
+            self._emit_transition_log(*transition_log)
+        return current_state

    def allow_request(self) -> bool:
        """Check if a request should be allowed.
@@ -161,45 +165,57 @@ class CircuitBreaker:
        if not CIRCUIT_BREAKER_ENABLED:
            return

+        transition_log: tuple[int, str] | None = None
        with self._lock:
            self._results.append(True)

            if self._state == CircuitState.HALF_OPEN:
                # Success in half-open means we can close
-                self._transition_to(CircuitState.CLOSED)
+                transition_log = self._transition_to_locked(CircuitState.CLOSED)
+
+        if transition_log:
+            self._emit_transition_log(*transition_log)

    def record_failure(self) -> None:
        """Record a failed operation."""
        if not CIRCUIT_BREAKER_ENABLED:
            return

+        transition_log: tuple[int, str] | None = None
        with self._lock:
            self._results.append(False)
            self._last_failure_time = time.time()

            if self._state == CircuitState.HALF_OPEN:
                # Failure in half-open means back to open
-                self._transition_to(CircuitState.OPEN)
+                transition_log = self._transition_to_locked(CircuitState.OPEN)
            elif self._state == CircuitState.CLOSED:
                # Check if we should open
-                self._check_and_open()
+                transition_log = self._check_and_open_locked()

-    def _check_and_open(self) -> None:
+        if transition_log:
+            self._emit_transition_log(*transition_log)
+
+    def _check_and_open_locked(self) -> tuple[int, str] | None:
        """Check failure rate and open circuit if needed.

        Must be called with lock held.
        """
        if len(self._results) < self.failure_threshold:
-            return
+            return None

        failure_count = sum(1 for r in self._results if not r)
        failure_rate = failure_count / len(self._results)

        if (failure_count >= self.failure_threshold and
                failure_rate >= self.failure_rate_threshold):
-            self._transition_to(CircuitState.OPEN)
+            return self._transition_to_locked(CircuitState.OPEN)
+        return None

-    def _transition_to(self, new_state: CircuitState) -> None:
+    def _emit_transition_log(self, level: int, message: str) -> None:
+        logger.log(level, message)
+
+    def _transition_to_locked(self, new_state: CircuitState) -> tuple[int, str]:
        """Transition to a new state with logging.

        Must be called with lock held.
@@ -209,23 +225,25 @@ class CircuitBreaker:

        if new_state == CircuitState.OPEN:
            self._open_time = time.time()
-            logger.warning(
+            return (
+                logging.WARNING,
                f"Circuit breaker '{self.name}' OPENED: "
                f"state {old_state.value} -> {new_state.value}, "
                f"failures: {sum(1 for r in self._results if not r)}/{len(self._results)}"
            )
        elif new_state == CircuitState.HALF_OPEN:
-            logger.info(
+            return (
+                logging.INFO,
                f"Circuit breaker '{self.name}' entering HALF_OPEN: "
                f"testing service recovery..."
            )
-        elif new_state == CircuitState.CLOSED:
-            self._open_time = None
-            self._results.clear()
-            logger.info(
-                f"Circuit breaker '{self.name}' CLOSED: "
-                f"service recovered"
-            )
+        self._open_time = None
+        self._results.clear()
+        return (
+            logging.INFO,
+            f"Circuit breaker '{self.name}' CLOSED: "
+            f"service recovered"
+        )

    def get_status(self) -> CircuitBreakerStatus:
        """Get current status information."""
@@ -266,7 +284,7 @@ class CircuitBreaker:
            self._results.clear()
            self._last_failure_time = None
            self._open_time = None
-            logger.info(f"Circuit breaker '{self.name}' reset")
+        logger.info(f"Circuit breaker '{self.name}' reset")


 # ============================================================
--- a/src/mes_dashboard/core/csrf.py
+++ b/src/mes_dashboard/core/csrf.py
@@ -0,0 +1,85 @@
+# -*- coding: utf-8 -*-
+"""CSRF token utilities for admin form and API mutation protection."""
+
+from __future__ import annotations
+
+import hmac
+import secrets
+from typing import Optional
+
+from flask import Request, request, session
+
+CSRF_SESSION_KEY = "_csrf_token"
+CSRF_HEADER_NAME = "X-CSRF-Token"
+CSRF_FORM_FIELD = "csrf_token"
+_MUTATING_METHODS = {"POST", "PUT", "PATCH", "DELETE"}
+
+
+def _new_csrf_token() -> str:
+    return secrets.token_urlsafe(32)
+
+
+def get_csrf_token() -> str:
+    """Get a stable CSRF token for the current session."""
+    token = session.get(CSRF_SESSION_KEY)
+    if not token:
+        token = _new_csrf_token()
+        session[CSRF_SESSION_KEY] = token
+    return token
+
+
+def rotate_csrf_token() -> str:
+    """Rotate session CSRF token after authentication state changes."""
+    token = _new_csrf_token()
+    session[CSRF_SESSION_KEY] = token
+    return token
+
+
+def _extract_request_token(req: Request) -> Optional[str]:
+    header_token = req.headers.get(CSRF_HEADER_NAME)
+    if header_token:
+        return header_token
+
+    form_token = req.form.get(CSRF_FORM_FIELD)
+    if form_token:
+        return form_token
+
+    if req.is_json:
+        payload = req.get_json(silent=True) or {}
+        json_token = payload.get(CSRF_FORM_FIELD)
+        if json_token:
+            return str(json_token)
+
+    return None
+
+
+def should_enforce_csrf(req: Request = request, enabled: bool = True) -> bool:
+    """Determine whether current request needs CSRF validation."""
+    if not enabled:
+        return False
+
+    if req.method.upper() not in _MUTATING_METHODS:
+        return False
+
+    path = req.path or ""
+    if path == "/admin/login":
+        return True
+    if path.startswith("/admin/api/"):
+        return True
+    if path.startswith("/admin/"):
+        return True
+
+    return False
+
+
+def validate_csrf(req: Request = request) -> bool:
+    """Validate request CSRF token against current session token."""
+    expected = session.get(CSRF_SESSION_KEY)
+    if not expected:
+        return False
+
+    provided = _extract_request_token(req)
+    if not provided:
+        return False
+
+    return hmac.compare_digest(str(expected), str(provided))
--- a/src/mes_dashboard/core/database.py
+++ b/src/mes_dashboard/core/database.py
@@ -51,6 +51,59 @@ from mes_dashboard.config.settings import get_config
 # Configure module logger
 logger = logging.getLogger('mes_dashboard.database')

+_REDACTION_INSTALLED = False
+_ORACLE_URL_RE = re.compile(r"(oracle\+oracledb://[^:\s/]+:)([^@/\s]+)(@)")
+_ENV_SECRET_RE = re.compile(r"(DB_PASSWORD=)([^\s]+)")
+
+
+def redact_connection_secrets(message: str) -> str:
+    """Redact DB credentials from log message text."""
+    if not message:
+        return message
+    sanitized = _ORACLE_URL_RE.sub(r"\1***\3", message)
+    sanitized = _ENV_SECRET_RE.sub(r"\1***", sanitized)
+    return sanitized
+
+
+class SecretRedactionFilter(logging.Filter):
+    """Filter that masks DB connection secrets in log messages."""
+
+    def filter(self, record: logging.LogRecord) -> bool:
+        try:
+            message = record.getMessage()
+        except Exception:
+            return True
+        sanitized = redact_connection_secrets(message)
+        if sanitized != message:
+            record.msg = sanitized
+            record.args = ()
+        return True
+
+
+def install_log_redaction_filter(target_logger: logging.Logger | None = None) -> None:
+    """Attach secret-redaction filter to mes_dashboard logging handlers once."""
+    global _REDACTION_INSTALLED
+    if target_logger is None and _REDACTION_INSTALLED:
+        return
+
+    logger_obj = target_logger or logging.getLogger("mes_dashboard")
+    redaction_filter = SecretRedactionFilter()
+
+    attached = False
+    for handler in logger_obj.handlers:
+        if any(isinstance(f, SecretRedactionFilter) for f in handler.filters):
+            attached = True
+            continue
+        handler.addFilter(redaction_filter)
+        attached = True
+
+    if not attached and not any(isinstance(f, SecretRedactionFilter) for f in logger_obj.filters):
+        logger_obj.addFilter(redaction_filter)
+        attached = True
+
+    if attached and target_logger is None:
+        _REDACTION_INSTALLED = True
+
 # ============================================================
 # SQLAlchemy Engine (QueuePool - connection pooling)
 # ============================================================
@@ -59,6 +112,7 @@ logger = logging.getLogger('mes_dashboard.database')
 # pool_recycle prevents stale connections from firewalls/NAT.

 _ENGINE = None
+_HEALTH_ENGINE = None
 _DB_RUNTIME_CONFIG: Optional[Dict[str, Any]] = None


@@ -132,6 +186,13 @@ def get_db_runtime_config(refresh: bool = False) -> Dict[str, Any]:
        "retry_count": _from_app_or_env_int("DB_CONNECT_RETRY_COUNT", config_class.DB_CONNECT_RETRY_COUNT),
        "retry_delay": _from_app_or_env_float("DB_CONNECT_RETRY_DELAY", config_class.DB_CONNECT_RETRY_DELAY),
        "call_timeout_ms": _from_app_or_env_int("DB_CALL_TIMEOUT_MS", config_class.DB_CALL_TIMEOUT_MS),
+        "health_pool_size": _from_app_or_env_int("DB_HEALTH_POOL_SIZE", 1),
+        "health_max_overflow": _from_app_or_env_int("DB_HEALTH_MAX_OVERFLOW", 0),
+        "health_pool_timeout": _from_app_or_env_int("DB_HEALTH_POOL_TIMEOUT", 2),
+        "pool_exhausted_retry_after_seconds": _from_app_or_env_int(
+            "DB_POOL_EXHAUSTED_RETRY_AFTER_SECONDS",
+            5,
+        ),
    }
    return _DB_RUNTIME_CONFIG.copy()

@@ -202,6 +263,42 @@ def get_engine():
    return _ENGINE


+def get_health_engine():
+    """Get dedicated SQLAlchemy engine for health probes.
+
+    Health checks use a tiny isolated pool so status probes remain available
+    when the request pool is saturated.
+    """
+    global _HEALTH_ENGINE
+    if _HEALTH_ENGINE is None:
+        runtime = get_db_runtime_config()
+        _HEALTH_ENGINE = create_engine(
+            CONNECTION_STRING,
+            poolclass=QueuePool,
+            pool_size=max(int(runtime["health_pool_size"]), 1),
+            max_overflow=max(int(runtime["health_max_overflow"]), 0),
+            pool_timeout=max(int(runtime["health_pool_timeout"]), 1),
+            pool_recycle=runtime["pool_recycle"],
+            pool_pre_ping=True,
+            connect_args={
+                "tcp_connect_timeout": runtime["tcp_connect_timeout"],
+                "retry_count": runtime["retry_count"],
+                "retry_delay": runtime["retry_delay"],
+            },
+        )
+        _register_pool_events(
+            _HEALTH_ENGINE,
+            min(int(runtime["call_timeout_ms"]), 10_000),
+        )
+        logger.info(
+            "Health engine created (pool_size=%s, max_overflow=%s, pool_timeout=%s)",
+            runtime["health_pool_size"],
+            runtime["health_max_overflow"],
+            runtime["health_pool_timeout"],
+        )
+    return _HEALTH_ENGINE
+
+
 def _register_pool_events(engine, call_timeout_ms: int):
    """Register event listeners for connection pool monitoring."""

@@ -302,8 +399,12 @@ def dispose_engine():

    Call this during application shutdown to cleanly release resources.
    """
-    global _ENGINE, _DB_RUNTIME_CONFIG
+    global _ENGINE, _HEALTH_ENGINE, _DB_RUNTIME_CONFIG
    stop_keepalive()
+    if _HEALTH_ENGINE is not None:
+        _HEALTH_ENGINE.dispose()
+        logger.info("Health engine disposed")
+        _HEALTH_ENGINE = None
    if _ENGINE is not None:
        _ENGINE.dispose()
        logger.info("Database engine disposed, all connections closed")
@@ -432,9 +533,13 @@ def read_sql_df(sql: str, params: Optional[Dict[str, Any]] = None) -> pd.DataFra
            elapsed,
            exc,
        )
+        retry_after = max(
+            int(get_db_runtime_config().get("pool_exhausted_retry_after_seconds", 5)),
+            1,
+        )
        raise DatabasePoolExhaustedError(
            "Database connection pool exhausted",
-            retry_after_seconds=5,
+            retry_after_seconds=retry_after,
        ) from exc
    except Exception as exc:
        elapsed = time.time() - start_time
--- a/src/mes_dashboard/core/rate_limit.py
+++ b/src/mes_dashboard/core/rate_limit.py
@@ -0,0 +1,103 @@
+# -*- coding: utf-8 -*-
+"""Lightweight in-process rate limiting helpers for high-cost routes."""
+
+from __future__ import annotations
+
+import os
+import threading
+import time
+from collections import defaultdict, deque
+from functools import wraps
+from typing import Callable, Deque
+
+from flask import request
+
+from mes_dashboard.core.response import TOO_MANY_REQUESTS, error_response
+
+_RATE_LOCK = threading.Lock()
+_RATE_ATTEMPTS: dict[str, dict[str, Deque[float]]] = defaultdict(lambda: defaultdict(deque))
+
+
+def _env_int(name: str, default: int) -> int:
+    raw = os.getenv(name)
+    if raw is None:
+        return int(default)
+    try:
+        value = int(raw)
+    except (TypeError, ValueError):
+        return int(default)
+    return max(value, 1)
+
+
+def _client_identifier() -> str:
+    forwarded = request.headers.get("X-Forwarded-For", "").strip()
+    if forwarded:
+        return forwarded.split(",")[0].strip()
+    return request.remote_addr or "unknown"
+
+
+def check_and_record(
+    bucket: str,
+    *,
+    client_id: str,
+    max_attempts: int,
+    window_seconds: int,
+) -> tuple[bool, int]:
+    """Check and record request attempt for a bucket+client pair."""
+    now = time.time()
+    window_start = now - max(window_seconds, 1)
+
+    with _RATE_LOCK:
+        per_bucket = _RATE_ATTEMPTS[bucket]
+        attempts = per_bucket[client_id]
+
+        while attempts and attempts[0] <= window_start:
+            attempts.popleft()
+
+        if len(attempts) >= max_attempts:
+            retry_after = max(int(window_seconds - (now - attempts[0])), 1)
+            return True, retry_after
+
+        attempts.append(now)
+        return False, 0
+
+
+def configured_rate_limit(
+    *,
+    bucket: str,
+    max_attempts_env: str,
+    window_seconds_env: str,
+    default_max_attempts: int,
+    default_window_seconds: int,
+) -> Callable:
+    """Build a route decorator with env-configurable rate limits."""
+    max_attempts = _env_int(max_attempts_env, default_max_attempts)
+    window_seconds = _env_int(window_seconds_env, default_window_seconds)
+
+    def decorator(func: Callable) -> Callable:
+        @wraps(func)
+        def wrapped(*args, **kwargs):
+            limited, retry_after = check_and_record(
+                bucket,
+                client_id=_client_identifier(),
+                max_attempts=max_attempts,
+                window_seconds=window_seconds,
+            )
+            if limited:
+                return error_response(
+                    TOO_MANY_REQUESTS,
+                    "請求過於頻繁，請稍後再試",
+                    status_code=429,
+                    meta={"retry_after_seconds": retry_after},
+                    headers={"Retry-After": str(retry_after)},
+                )
+            return func(*args, **kwargs)
+
+        return wrapped
+
+    return decorator
+
+
+def reset_rate_limits_for_tests() -> None:
+    with _RATE_LOCK:
+        _RATE_ATTEMPTS.clear()
--- a/src/mes_dashboard/core/runtime_contract.py
+++ b/src/mes_dashboard/core/runtime_contract.py
@@ -0,0 +1,143 @@
+# -*- coding: utf-8 -*-
+"""Runtime contract helpers shared by app, scripts, and watchdog."""
+
+from __future__ import annotations
+
+import os
+import shutil
+from pathlib import Path
+from typing import Any, Mapping
+
+CONTRACT_VERSION = "2026.02-p2"
+DEFAULT_PROJECT_ROOT = Path(__file__).resolve().parents[3]
+
+
+def _to_bool(value: str | None, default: bool) -> bool:
+    if value is None:
+        return default
+    return value.strip().lower() in {"1", "true", "yes", "on"}
+
+
+def _resolve_path(value: str | None, fallback: Path, project_root: Path) -> Path:
+    if value is None or not str(value).strip():
+        return fallback.resolve()
+    raw = Path(str(value).strip())
+    if raw.is_absolute():
+        return raw.resolve()
+    return (project_root / raw).resolve()
+
+
+def load_runtime_contract(
+    environ: Mapping[str, str] | None = None,
+    *,
+    project_root: Path | str | None = None,
+) -> dict[str, Any]:
+    """Load effective runtime contract from environment with normalized paths."""
+    env = environ or os.environ
+    root = Path(project_root or env.get("MES_DASHBOARD_ROOT", DEFAULT_PROJECT_ROOT)).resolve()
+    runtime_dir = _resolve_path(
+        env.get("WATCHDOG_RUNTIME_DIR"),
+        root / "tmp",
+        root,
+    )
+
+    restart_flag = _resolve_path(
+        env.get("WATCHDOG_RESTART_FLAG"),
+        runtime_dir / "mes_dashboard_restart.flag",
+        root,
+    )
+    pid_file = _resolve_path(
+        env.get("WATCHDOG_PID_FILE"),
+        runtime_dir / "gunicorn.pid",
+        root,
+    )
+    state_file = _resolve_path(
+        env.get("WATCHDOG_STATE_FILE"),
+        runtime_dir / "mes_dashboard_restart_state.json",
+        root,
+    )
+
+    contract = {
+        "version": env.get("RUNTIME_CONTRACT_VERSION", CONTRACT_VERSION),
+        "project_root": str(root),
+        "gunicorn_bind": env.get("GUNICORN_BIND", "0.0.0.0:8080"),
+        "conda_bin": (env.get("CONDA_BIN", "") or "").strip(),
+        "conda_env_name": (env.get("CONDA_ENV_NAME", "mes-dashboard") or "").strip(),
+        "watchdog_runtime_dir": str(runtime_dir),
+        "watchdog_restart_flag": str(restart_flag),
+        "watchdog_pid_file": str(pid_file),
+        "watchdog_state_file": str(state_file),
+        "watchdog_check_interval": int(env.get("WATCHDOG_CHECK_INTERVAL", "5")),
+        "validation_enforced": _to_bool(env.get("RUNTIME_CONTRACT_ENFORCE"), False),
+    }
+    return contract
+
+
+def validate_runtime_contract(
+    contract: Mapping[str, Any] | None = None,
+    *,
+    strict: bool = False,
+) -> list[str]:
+    """Validate runtime contract and return actionable errors."""
+    cfg = dict(contract or load_runtime_contract())
+    errors: list[str] = []
+
+    runtime_dir = Path(str(cfg["watchdog_runtime_dir"])).resolve()
+    restart_flag = Path(str(cfg["watchdog_restart_flag"])).resolve()
+    pid_file = Path(str(cfg["watchdog_pid_file"])).resolve()
+    state_file = Path(str(cfg["watchdog_state_file"])).resolve()
+
+    if restart_flag.parent != runtime_dir:
+        errors.append(
+            "WATCHDOG_RESTART_FLAG must be under WATCHDOG_RUNTIME_DIR "
+            f"({restart_flag} not under {runtime_dir})."
+        )
+    if pid_file.parent != runtime_dir:
+        errors.append(
+            "WATCHDOG_PID_FILE must be under WATCHDOG_RUNTIME_DIR "
+            f"({pid_file} not under {runtime_dir})."
+        )
+
+    if not state_file.is_absolute():
+        errors.append("WATCHDOG_STATE_FILE must resolve to an absolute path.")
+
+    bind = str(cfg.get("gunicorn_bind", "")).strip()
+    if ":" not in bind:
+        errors.append(f"GUNICORN_BIND must include host:port (current: {bind!r}).")
+
+    conda_bin = str(cfg.get("conda_bin", "")).strip()
+    if strict and not conda_bin:
+        conda_on_path = shutil.which("conda")
+        if not conda_on_path:
+            errors.append(
+                "CONDA_BIN is required when strict runtime validation is enabled "
+                "and conda is not discoverable on PATH."
+            )
+    if conda_bin:
+        conda_path = Path(conda_bin)
+        if not conda_path.exists():
+            errors.append(f"CONDA_BIN does not exist: {conda_bin}")
+        elif not os.access(conda_bin, os.X_OK):
+            errors.append(f"CONDA_BIN is not executable: {conda_bin}")
+
+    conda_env_name = str(cfg.get("conda_env_name", "")).strip()
+    active_env = (os.getenv("CONDA_DEFAULT_ENV") or "").strip()
+    if strict and conda_env_name and active_env and active_env != conda_env_name:
+        errors.append(
+            "CONDA_DEFAULT_ENV mismatch: "
+            f"expected {conda_env_name!r}, got {active_env!r}."
+        )
+
+    return errors
+
+
+def build_runtime_contract_diagnostics(*, strict: bool = False) -> dict[str, Any]:
+    """Build diagnostics payload for runtime contract introspection."""
+    contract = load_runtime_contract()
+    errors = validate_runtime_contract(contract, strict=strict)
+    return {
+        "valid": not errors,
+        "strict": strict,
+        "errors": errors,
+        "contract": contract,
+    }
--- a/src/mes_dashboard/core/utils.py
+++ b/src/mes_dashboard/core/utils.py
@@ -33,6 +33,22 @@ def get_days_back(filters: Optional[Dict] = None, default: int = DEFAULT_DAYS_BA
    return default


+def parse_bool_query(value: Any, default: bool = False) -> bool:
+    """Parse common boolean query parameter values."""
+    if value is None:
+        return default
+    if isinstance(value, bool):
+        return value
+    text = str(value).strip().lower()
+    if not text:
+        return default
+    if text in {"true", "1", "yes", "y", "on"}:
+        return True
+    if text in {"false", "0", "no", "n", "off"}:
+        return False
+    return default
+
+
 # ============================================================
 # SQL Filter Building (DEPRECATED)
 # Use mes_dashboard.sql.CommonFilters with QueryBuilder instead.
--- a/src/mes_dashboard/core/worker_recovery_policy.py
+++ b/src/mes_dashboard/core/worker_recovery_policy.py
@@ -0,0 +1,220 @@
+# -*- coding: utf-8 -*-
+"""Worker restart policy helpers (cooldown, retry budget, churn guard)."""
+
+from __future__ import annotations
+
+import json
+import os
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any, Mapping
+
+from mes_dashboard.core.runtime_contract import load_runtime_contract
+
+
+def _env_int(name: str, default: int) -> int:
+    try:
+        return int(os.getenv(name, str(default)))
+    except (TypeError, ValueError):
+        return default
+
+
+def _env_bool(name: str, default: bool) -> bool:
+    raw = os.getenv(name)
+    if raw is None:
+        return default
+    return raw.strip().lower() in {"1", "true", "yes", "on"}
+
+
+def _parse_iso(ts: str | None) -> datetime | None:
+    if not ts:
+        return None
+    try:
+        value = datetime.fromisoformat(ts)
+    except (TypeError, ValueError):
+        return None
+    if value.tzinfo is None:
+        value = value.replace(tzinfo=timezone.utc)
+    return value
+
+
+def _utc_now() -> datetime:
+    return datetime.now(timezone.utc)
+
+
+def get_worker_recovery_policy_config() -> dict[str, Any]:
+    """Return effective worker restart policy config."""
+    retry_budget = _env_int("WORKER_RESTART_RETRY_BUDGET", 3)
+    churn_threshold = _env_int(
+        "WORKER_RESTART_CHURN_THRESHOLD",
+        _env_int("RESILIENCE_RESTART_CHURN_THRESHOLD", retry_budget),
+    )
+    window_seconds = _env_int(
+        "WORKER_RESTART_WINDOW_SECONDS",
+        _env_int("RESILIENCE_RESTART_CHURN_WINDOW_SECONDS", 600),
+    )
+    return {
+        "cooldown_seconds": max(_env_int("WORKER_RESTART_COOLDOWN", 60), 1),
+        "retry_budget": max(retry_budget, 1),
+        "window_seconds": max(window_seconds, 30),
+        "churn_threshold": max(churn_threshold, 1),
+        "guarded_mode_enabled": _env_bool("WORKER_GUARDED_MODE_ENABLED", True),
+    }
+
+
+def load_restart_state(path: str | None = None) -> dict[str, Any]:
+    """Load persisted restart state from runtime contract state file."""
+    state_path = Path(path or load_runtime_contract()["watchdog_state_file"])
+    if not state_path.exists():
+        return {}
+    try:
+        return json.loads(state_path.read_text())
+    except (json.JSONDecodeError, IOError):
+        return {}
+
+
+def extract_restart_history(state: Mapping[str, Any] | None = None) -> list[dict[str, Any]]:
+    """Extract bounded restart history from persisted state."""
+    payload = dict(state or {})
+    raw_history = payload.get("history")
+    if not isinstance(raw_history, list):
+        return []
+    return [item for item in raw_history if isinstance(item, dict)][-50:]
+
+
+def extract_last_requested_at(state: Mapping[str, Any] | None = None) -> str | None:
+    """Extract last requested timestamp from persisted state."""
+    payload = dict(state or {})
+    last_restart = payload.get("last_restart") or {}
+    if not isinstance(last_restart, dict):
+        return None
+    value = last_restart.get("requested_at")
+    return str(value) if value else None
+
+
+def evaluate_worker_recovery_state(
+    history: list[dict[str, Any]] | None,
+    *,
+    last_requested_at: str | None = None,
+    now: datetime | None = None,
+) -> dict[str, Any]:
+    """Evaluate restart policy state for automated/manual recovery decisions."""
+    cfg = get_worker_recovery_policy_config()
+    now_dt = now or _utc_now()
+    window_seconds = int(cfg["window_seconds"])
+    cooldown_seconds = int(cfg["cooldown_seconds"])
+
+    recent_attempts = 0
+    for item in history or []:
+        requested = _parse_iso(item.get("requested_at"))
+        completed = _parse_iso(item.get("completed_at"))
+        ts = requested or completed
+        if ts is None:
+            continue
+        age = (now_dt - ts).total_seconds()
+        if age <= window_seconds:
+            recent_attempts += 1
+
+    retry_budget = int(cfg["retry_budget"])
+    churn_threshold = int(cfg["churn_threshold"])
+    retry_budget_exhausted = recent_attempts >= retry_budget
+    churn_exceeded = recent_attempts >= churn_threshold
+    guarded_mode = bool(cfg["guarded_mode_enabled"] and (retry_budget_exhausted or churn_exceeded))
+
+    cooldown_active = False
+    cooldown_remaining = 0
+    last_requested_dt = _parse_iso(last_requested_at)
+    if last_requested_dt is not None:
+        elapsed = (now_dt - last_requested_dt).total_seconds()
+        if elapsed < cooldown_seconds:
+            cooldown_active = True
+            cooldown_remaining = int(max(cooldown_seconds - elapsed, 0))
+
+    blocked = guarded_mode
+    allowed = not blocked and not cooldown_active
+
+    state = "allowed"
+    if blocked:
+        state = "blocked"
+    elif cooldown_active:
+        state = "cooldown"
+
+    return {
+        "state": state,
+        "allowed": allowed,
+        "cooldown": cooldown_active,
+        "cooldown_remaining_seconds": cooldown_remaining,
+        "blocked": blocked,
+        "guarded_mode": guarded_mode,
+        "retry_budget_exhausted": retry_budget_exhausted,
+        "churn_exceeded": churn_exceeded,
+        "attempts_in_window": recent_attempts,
+        "retry_budget": retry_budget,
+        "churn_threshold": churn_threshold,
+        "window_seconds": window_seconds,
+        "cooldown_seconds": cooldown_seconds,
+    }
+
+
+def decide_restart_request(
+    policy_state: Mapping[str, Any],
+    *,
+    source: str,
+    manual_override: bool = False,
+    override_acknowledged: bool = False,
+) -> dict[str, Any]:
+    """Decide whether restart request is allowed under current policy state."""
+    state = dict(policy_state or {})
+    blocked = bool(state.get("blocked"))
+    cooldown = bool(state.get("cooldown"))
+    source_value = (source or "manual").strip().lower()
+
+    if source_value not in {"auto", "manual"}:
+        source_value = "manual"
+
+    if source_value == "auto":
+        if blocked:
+            return {
+                "allowed": False,
+                "decision": "blocked",
+                "reason": "guarded_mode_blocked",
+                "requires_acknowledgement": False,
+            }
+        if cooldown:
+            return {
+                "allowed": False,
+                "decision": "blocked",
+                "reason": "cooldown_active",
+                "requires_acknowledgement": False,
+            }
+        return {
+            "allowed": True,
+            "decision": "allowed",
+            "reason": "policy_allows_auto_restart",
+            "requires_acknowledgement": False,
+        }
+
+    if (blocked or cooldown) and not (manual_override and override_acknowledged):
+        reason = "manual_override_required" if blocked else "cooldown_override_required"
+        return {
+            "allowed": False,
+            "decision": "blocked",
+            "reason": reason,
+            "requires_acknowledgement": True,
+        }
+
+    if manual_override and override_acknowledged:
+        return {
+            "allowed": True,
+            "decision": "manual_override",
+            "reason": "operator_override_acknowledged",
+            "requires_acknowledgement": False,
+        }
+
+    return {
+        "allowed": True,
+        "decision": "allowed",
+        "reason": "policy_allows_manual_restart",
+        "requires_acknowledgement": False,
+    }
+
--- a/src/mes_dashboard/routes/admin_routes.py
+++ b/src/mes_dashboard/routes/admin_routes.py
@@ -3,12 +3,13 @@

 from __future__ import annotations

-import json
-import logging
-import os
-import time
-from datetime import datetime
-from pathlib import Path
+import json
+import logging
+import os
+import time
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any

 from flask import Blueprint, g, jsonify, render_template, request

@@ -19,6 +20,17 @@ from mes_dashboard.core.resilience import (
    get_resilience_thresholds,
    summarize_restart_history,
 )
+from mes_dashboard.core.runtime_contract import (
+    build_runtime_contract_diagnostics,
+    load_runtime_contract,
+)
+from mes_dashboard.core.worker_recovery_policy import (
+    decide_restart_request,
+    evaluate_worker_recovery_state,
+    extract_last_requested_at,
+    extract_restart_history,
+    load_restart_state,
+)
 from mes_dashboard.services.page_registry import get_all_pages, set_page_status

 admin_bp = Blueprint("admin", __name__, url_prefix="/admin")
@@ -28,21 +40,13 @@ logger = logging.getLogger("mes_dashboard.admin")
 # Worker Restart Configuration
 # ============================================================

-WATCHDOG_RUNTIME_DIR = os.getenv("WATCHDOG_RUNTIME_DIR", "/tmp")
-RESTART_FLAG_PATH = os.getenv(
-    "WATCHDOG_RESTART_FLAG",
-    f"{WATCHDOG_RUNTIME_DIR}/mes_dashboard_restart.flag"
-)
-RESTART_STATE_PATH = os.getenv(
-    "WATCHDOG_STATE_FILE",
-    f"{WATCHDOG_RUNTIME_DIR}/mes_dashboard_restart_state.json"
-)
-WATCHDOG_PID_PATH = os.getenv(
-    "WATCHDOG_PID_FILE",
-    f"{WATCHDOG_RUNTIME_DIR}/gunicorn.pid"
-)
-GUNICORN_BIND = os.getenv("GUNICORN_BIND", "0.0.0.0:8080")
-RESTART_COOLDOWN_SECONDS = int(os.getenv("WORKER_RESTART_COOLDOWN", "60"))
+_RUNTIME_CONTRACT = load_runtime_contract()
+WATCHDOG_RUNTIME_DIR = _RUNTIME_CONTRACT["watchdog_runtime_dir"]
+RESTART_FLAG_PATH = _RUNTIME_CONTRACT["watchdog_restart_flag"]
+RESTART_STATE_PATH = _RUNTIME_CONTRACT["watchdog_state_file"]
+WATCHDOG_PID_PATH = _RUNTIME_CONTRACT["watchdog_pid_file"]
+GUNICORN_BIND = _RUNTIME_CONTRACT["gunicorn_bind"]
+RUNTIME_CONTRACT_VERSION = _RUNTIME_CONTRACT["version"]

 # Track last restart request time (in-memory for this worker)
 _last_restart_request: float = 0.0
@@ -91,7 +95,9 @@ def api_system_status():
    thresholds = get_resilience_thresholds()
    restart_state = _get_restart_state()
    restart_churn = _get_restart_churn_summary(restart_state)
-    in_cooldown, remaining = _check_restart_cooldown()
+    policy_state = _get_restart_policy_state(restart_state)
+    in_cooldown = bool(policy_state.get("cooldown"))
+    remaining = int(policy_state.get("cooldown_remaining_seconds") or 0)

    degraded_reason = None
    if db_status == "error":
@@ -111,6 +117,14 @@ def api_system_status():
        restart_churn_exceeded=bool(restart_churn.get("exceeded")),
        cooldown_active=in_cooldown,
    )
+    alerts = _build_restart_alerts(
+        pool_saturation=(pool_state or {}).get("saturation"),
+        circuit_state=circuit_breaker.get("state"),
+        route_cache_degraded=bool(route_cache.get("degraded")),
+        policy_state=policy_state,
+        thresholds=thresholds,
+    )
+    runtime_contract = build_runtime_contract_diagnostics(strict=False)

    # Cache status
    from mes_dashboard.routes.health_routes import (
@@ -142,13 +156,22 @@ def api_system_status():
                "pool_state": pool_state,
                "route_cache": route_cache,
                "thresholds": thresholds,
+                "alerts": alerts,
                "restart_churn": restart_churn,
+                "policy_state": {
+                    "state": policy_state.get("state"),
+                    "allowed": policy_state.get("allowed"),
+                    "cooldown": policy_state.get("cooldown"),
+                    "blocked": policy_state.get("blocked"),
+                    "cooldown_remaining_seconds": remaining,
+                },
                "recovery_recommendation": recommendation,
                "restart_cooldown": {
                    "active": in_cooldown,
-                    "remaining_seconds": int(remaining) if in_cooldown else 0,
+                    "remaining_seconds": remaining if in_cooldown else 0,
                },
            },
+            "runtime_contract": runtime_contract,
            "single_port_bind": GUNICORN_BIND,
            "worker_pid": os.getpid()
        }
@@ -281,55 +304,33 @@ def api_logs_cleanup():
 # Worker Restart Control Routes
 # ============================================================

-def _get_restart_state() -> dict:
-    """Read worker restart state from file."""
-    state_path = Path(RESTART_STATE_PATH)
-    if not state_path.exists():
-        return {}
-    try:
-        return json.loads(state_path.read_text())
-    except (json.JSONDecodeError, IOError):
-        return {}
+def _get_restart_state() -> dict:
+    """Read worker restart state from file."""
+    return load_restart_state(RESTART_STATE_PATH)
+
+
+def _iso_from_epoch(ts: float) -> str | None:
+    if ts <= 0:
+        return None
+    return datetime.fromtimestamp(ts, tz=timezone.utc).isoformat()


 def _check_restart_cooldown() -> tuple[bool, float]:
-    """Check if restart is in cooldown.
+    """Check if restart is in cooldown.

    Returns:
        Tuple of (is_in_cooldown, remaining_seconds).
    """
-    global _last_restart_request
-
-    # Check in-memory cooldown first
-    now = time.time()
-    elapsed = now - _last_restart_request
-    if elapsed < RESTART_COOLDOWN_SECONDS:
-        return True, RESTART_COOLDOWN_SECONDS - elapsed
-
-    # Check file-based state (for cross-worker coordination)
-    state = _get_restart_state()
-    last_restart = state.get("last_restart", {})
-    requested_at = last_restart.get("requested_at")
-
-    if requested_at:
-        try:
-            request_time = datetime.fromisoformat(requested_at).timestamp()
-            elapsed = now - request_time
-            if elapsed < RESTART_COOLDOWN_SECONDS:
-                return True, RESTART_COOLDOWN_SECONDS - elapsed
-        except (ValueError, TypeError):
-            pass
-
+    policy = _get_restart_policy_state()
+    if policy.get("cooldown"):
+        return True, float(policy.get("cooldown_remaining_seconds") or 0.0)
    return False, 0.0


 def _get_restart_history(state: dict | None = None) -> list[dict]:
    """Return bounded restart history for admin telemetry."""
    payload = state if state is not None else _get_restart_state()
-    raw_history = payload.get("history") or []
-    if not isinstance(raw_history, list):
-        return []
-    return raw_history[-20:]
+    return extract_restart_history(payload)[-20:]


 def _get_restart_churn_summary(state: dict | None = None) -> dict:
@@ -338,27 +339,63 @@ def _get_restart_churn_summary(state: dict | None = None) -> dict:
    return summarize_restart_history(history)


-def _worker_recovery_hint(churn: dict, cooldown_active: bool) -> dict:
-    """Build worker control recommendation from churn/cooldown state."""
-    if churn.get("exceeded"):
-        return {
-            "action": "throttle_and_investigate_queries",
-            "reason": "restart_churn_exceeded",
-        }
-    if cooldown_active:
-        return {
-            "action": "wait_for_restart_cooldown",
-            "reason": "restart_cooldown_active",
-        }
+def _get_restart_policy_state(state: dict | None = None) -> dict[str, Any]:
+    """Return effective worker restart policy state."""
+    payload = state if state is not None else _get_restart_state()
+    history = _get_restart_history(payload)
+    last_requested = extract_last_requested_at(payload)
+
+    in_memory_requested = _iso_from_epoch(_last_restart_request)
+    if in_memory_requested:
+        try:
+            in_memory_dt = datetime.fromisoformat(in_memory_requested)
+            persisted_dt = datetime.fromisoformat(last_requested) if last_requested else None
+        except (TypeError, ValueError):
+            in_memory_dt = None
+            persisted_dt = None
+        if in_memory_dt and (persisted_dt is None or in_memory_dt > persisted_dt):
+            last_requested = in_memory_requested
+
+    return evaluate_worker_recovery_state(
+        history,
+        last_requested_at=last_requested,
+    )
+
+
+def _build_restart_alerts(
+    *,
+    pool_saturation: float | None,
+    circuit_state: str | None,
+    route_cache_degraded: bool,
+    policy_state: dict[str, Any],
+    thresholds: dict[str, Any],
+) -> dict[str, Any]:
+    saturation = float(pool_saturation or 0.0)
+    warning = float(thresholds.get("pool_saturation_warning", 0.9))
+    critical = float(thresholds.get("pool_saturation_critical", 1.0))
    return {
-        "action": "restart_available",
-        "reason": "no_churn_or_cooldown",
+        "pool_warning": saturation >= warning,
+        "pool_critical": saturation >= critical,
+        "circuit_open": circuit_state == "OPEN",
+        "route_cache_degraded": bool(route_cache_degraded),
+        "restart_churn_exceeded": bool(policy_state.get("churn_exceeded")),
+        "restart_blocked": bool(policy_state.get("blocked")),
    }
+
+
+def _log_restart_audit(event: str, payload: dict[str, Any]) -> None:
+    entry = {
+        "event": event,
+        "timestamp": datetime.now(tz=timezone.utc).isoformat(),
+        "runtime_contract_version": RUNTIME_CONTRACT_VERSION,
+        **payload,
+    }
+    logger.info("worker_restart_audit %s", json.dumps(entry, ensure_ascii=False))


-@admin_bp.route("/api/worker/restart", methods=["POST"])
-@admin_required
-def api_worker_restart():
+@admin_bp.route("/api/worker/restart", methods=["POST"])
+@admin_required
+def api_worker_restart():
    """API: Request worker restart.

    Writes a restart flag file that the watchdog process monitors.
@@ -366,52 +403,118 @@ def api_worker_restart():
    """
    global _last_restart_request

-    # Check cooldown
-    in_cooldown, remaining = _check_restart_cooldown()
-    if in_cooldown:
-        return error_response(
-            TOO_MANY_REQUESTS,
-            f"Restart in cooldown. Please wait {int(remaining)} seconds.",
-            status_code=429
-        )
-
-    # Get request metadata
-    user = getattr(g, "username", "unknown")
-    ip = request.remote_addr or "unknown"
-    timestamp = datetime.now().isoformat()
-
-    # Write restart flag file
-    flag_path = Path(RESTART_FLAG_PATH)
-    flag_data = {
-        "user": user,
-        "ip": ip,
-        "timestamp": timestamp,
-        "worker_pid": os.getpid()
-    }
-
-    try:
-        flag_path.write_text(json.dumps(flag_data))
-    except IOError as e:
-        logger.error(f"Failed to write restart flag: {e}")
-        return error_response(
-            "RESTART_FAILED",
-            f"Failed to request restart: {e}",
-            status_code=500
-        )
+    payload = request.get_json(silent=True) or {}
+    manual_override = bool(payload.get("manual_override"))
+    override_acknowledged = bool(payload.get("override_acknowledged"))
+    override_reason = str(payload.get("override_reason") or "").strip()
+
+    # Get request metadata
+    user = getattr(g, "username", "unknown")
+    ip = request.remote_addr or "unknown"
+    timestamp = datetime.now(tz=timezone.utc).isoformat()
+
+    state = _get_restart_state()
+    policy_state = _get_restart_policy_state(state)
+    decision = decide_restart_request(
+        policy_state,
+        source="manual",
+        manual_override=manual_override,
+        override_acknowledged=override_acknowledged,
+    )
+
+    if manual_override and not override_reason:
+        return error_response(
+            "RESTART_OVERRIDE_REASON_REQUIRED",
+            "Manual override requires non-empty override_reason for audit traceability.",
+            status_code=400,
+        )
+
+    if not decision["allowed"]:
+        status_code = 429 if policy_state.get("cooldown") else 409
+        if status_code == 429:
+            message = (
+                f"Restart in cooldown. Please wait "
+                f"{int(policy_state.get('cooldown_remaining_seconds') or 0)} seconds."
+            )
+            code = TOO_MANY_REQUESTS
+        else:
+            message = (
+                "Restart blocked by guarded mode. "
+                "Set manual_override=true and override_acknowledged=true to proceed."
+            )
+            code = "RESTART_POLICY_BLOCKED"
+        _log_restart_audit(
+            "restart_request_blocked",
+            {
+                "actor": user,
+                "ip": ip,
+                "decision": decision,
+                "policy_state": policy_state,
+            },
+        )
+        return error_response(
+            code,
+            message,
+            status_code=status_code,
+        )
+
+    # Write restart flag file
+    flag_path = Path(RESTART_FLAG_PATH)
+    flag_data = {
+        "user": user,
+        "ip": ip,
+        "timestamp": timestamp,
+        "worker_pid": os.getpid(),
+        "source": "manual",
+        "manual_override": bool(manual_override and override_acknowledged),
+        "override_acknowledged": override_acknowledged,
+        "override_reason": override_reason or None,
+        "policy_state": policy_state,
+        "policy_decision": decision["decision"],
+        "runtime_contract_version": RUNTIME_CONTRACT_VERSION,
+    }
+
+    try:
+        flag_path.parent.mkdir(parents=True, exist_ok=True)
+        tmp_path = flag_path.with_suffix(flag_path.suffix + ".tmp")
+        tmp_path.write_text(json.dumps(flag_data, ensure_ascii=False))
+        tmp_path.replace(flag_path)
+    except IOError as e:
+        logger.error(f"Failed to write restart flag: {e}")
+        return error_response(
+            "RESTART_FAILED",
+            f"Failed to request restart: {e}",
+            status_code=500
+        )

    # Update in-memory cooldown
    _last_restart_request = time.time()

-    logger.info(
-        f"Worker restart requested by {user} from {ip}"
-    )
-
+    _log_restart_audit(
+        "restart_request_accepted",
+        {
+            "actor": user,
+            "ip": ip,
+            "decision": decision,
+            "policy_state": policy_state,
+            "override_reason": override_reason or None,
+        },
+    )
+
    return jsonify({
        "success": True,
        "data": {
            "message": "Restart requested. Workers will reload shortly.",
            "requested_by": user,
            "requested_at": timestamp,
+            "policy_state": {
+                "state": policy_state.get("state"),
+                "allowed": policy_state.get("allowed"),
+                "cooldown": policy_state.get("cooldown"),
+                "blocked": policy_state.get("blocked"),
+                "cooldown_remaining_seconds": policy_state.get("cooldown_remaining_seconds"),
+            },
+            "decision": decision,
            "single_port_bind": GUNICORN_BIND,
            "watchdog": {
                "runtime_dir": WATCHDOG_RUNTIME_DIR,
@@ -425,18 +528,23 @@ def api_worker_restart():

@admin_bp.route("/api/worker/status", methods=["GET"])
@admin_required
-def api_worker_status():
-    """API: Get worker status and restart information."""
-    # Check cooldown
-    in_cooldown, remaining = _check_restart_cooldown()
-
+def api_worker_status():
+    """API: Get worker status and restart information."""
    # Get last restart info
    state = _get_restart_state()
    last_restart = state.get("last_restart", {})
    history = _get_restart_history(state)
    churn = _get_restart_churn_summary(state)
+    policy_state = _get_restart_policy_state(state)
    thresholds = get_resilience_thresholds()
-    recommendation = _worker_recovery_hint(churn, in_cooldown)
+    recommendation = build_recovery_recommendation(
+        degraded_reason="db_pool_saturated" if policy_state.get("blocked") else None,
+        pool_saturation=None,
+        circuit_state=None,
+        restart_churn_exceeded=bool(churn.get("exceeded")),
+        cooldown_active=bool(policy_state.get("cooldown")),
+    )
+    runtime_contract = build_runtime_contract_diagnostics(strict=False)

    # Get worker start time (psutil is optional)
    worker_start_time = None
@@ -466,6 +574,11 @@ def api_worker_status():
            "worker_pid": os.getpid(),
            "worker_start_time": worker_start_time,
            "runtime_contract": {
+                "version": runtime_contract["contract"]["version"],
+                "validation": {
+                    "valid": runtime_contract["valid"],
+                    "errors": runtime_contract["errors"],
+                },
                "single_port_bind": GUNICORN_BIND,
                "watchdog": {
                    "runtime_dir": WATCHDOG_RUNTIME_DIR,
@@ -478,12 +591,27 @@ def api_worker_status():
                },
            },
            "cooldown": {
-                "active": in_cooldown,
-                "remaining_seconds": int(remaining) if in_cooldown else 0
+                "active": bool(policy_state.get("cooldown")),
+                "remaining_seconds": int(policy_state.get("cooldown_remaining_seconds") or 0)
            },
            "resilience": {
                "thresholds": thresholds,
+                "alerts": {
+                    "restart_churn_exceeded": bool(churn.get("exceeded")),
+                    "restart_blocked": bool(policy_state.get("blocked")),
+                },
                "restart_churn": churn,
+                "policy_state": {
+                    "state": policy_state.get("state"),
+                    "allowed": policy_state.get("allowed"),
+                    "cooldown": policy_state.get("cooldown"),
+                    "blocked": policy_state.get("blocked"),
+                    "cooldown_remaining_seconds": policy_state.get("cooldown_remaining_seconds"),
+                    "attempts_in_window": policy_state.get("attempts_in_window"),
+                    "retry_budget": policy_state.get("retry_budget"),
+                    "churn_threshold": policy_state.get("churn_threshold"),
+                    "window_seconds": policy_state.get("window_seconds"),
+                },
                "recovery_recommendation": recommendation,
            },
            "restart_history": history,
--- a/src/mes_dashboard/routes/auth_routes.py
+++ b/src/mes_dashboard/routes/auth_routes.py
@@ -9,9 +9,10 @@ from collections import defaultdict
 from datetime import datetime
 from threading import Lock

-from flask import Blueprint, flash, redirect, render_template, request, session, url_for
-
-from mes_dashboard.services.auth_service import authenticate, is_admin
+from flask import Blueprint, flash, redirect, render_template, request, session, url_for
+
+from mes_dashboard.core.csrf import rotate_csrf_token
+from mes_dashboard.services.auth_service import authenticate, is_admin

 logger = logging.getLogger('mes_dashboard.auth_routes')
 auth_bp = Blueprint("auth", __name__, url_prefix="/admin")
@@ -89,25 +90,27 @@ def login():
            user = authenticate(username, password)
            if user is None:
                error = "帳號或密碼錯誤"
-            elif not is_admin(user):
-                error = "您不是管理員，無法登入後台"
-            else:
-                # Login successful
-                session["admin"] = {
-                    "username": user.get("username"),
-                    "displayName": user.get("displayName"),
-                    "mail": user.get("mail"),
-                    "department": user.get("department"),
-                    "login_time": datetime.now().isoformat(),
-                }
-                next_url = request.args.get("next", url_for("portal_index"))
-                return redirect(next_url)
+            elif not is_admin(user):
+                error = "您不是管理員，無法登入後台"
+            else:
+                # Login successful
+                session.clear()
+                session["admin"] = {
+                    "username": user.get("username"),
+                    "displayName": user.get("displayName"),
+                    "mail": user.get("mail"),
+                    "department": user.get("department"),
+                    "login_time": datetime.now().isoformat(),
+                }
+                rotate_csrf_token()
+                next_url = request.args.get("next", url_for("portal_index"))
+                return redirect(next_url)

    return render_template("login.html", error=error)


@auth_bp.route("/logout")
-def logout():
-    """Admin logout."""
-    session.pop("admin", None)
-    return redirect(url_for("portal_index"))
+def logout():
+    """Admin logout."""
+    session.clear()
+    return redirect(url_for("portal_index"))
--- a/src/mes_dashboard/routes/health_routes.py
+++ b/src/mes_dashboard/routes/health_routes.py
@@ -6,13 +6,15 @@ Provides /health and /health/deep endpoints for monitoring service status.

 from __future__ import annotations

-import logging
-import time
-from datetime import datetime, timedelta
-from flask import Blueprint, jsonify, make_response
+import logging
+import os
+import threading
+import time
+from datetime import datetime, timedelta
+from flask import Blueprint, current_app, jsonify, make_response

 from mes_dashboard.core.database import (
-    get_engine,
+    get_health_engine,
    get_pool_runtime_config,
    get_pool_status,
 )
@@ -28,6 +30,15 @@ from mes_dashboard.core.cache import (
 from mes_dashboard.core.resilience import (
    build_recovery_recommendation,
    get_resilience_thresholds,
+    summarize_restart_history,
+)
+from mes_dashboard.core.runtime_contract import build_runtime_contract_diagnostics
+from mes_dashboard.core.worker_recovery_policy import (
+    evaluate_worker_recovery_state,
+    extract_last_requested_at,
+    extract_restart_history,
+    get_worker_recovery_policy_config,
+    load_restart_state,
 )
 from sqlalchemy import text

@@ -39,8 +50,63 @@ health_bp = Blueprint('health', __name__)
 # Warning Thresholds
 # ============================================================

-DB_LATENCY_WARNING_MS = 100  # Database latency > 100ms is slow
+DB_LATENCY_WARNING_MS = 100  # Database latency > 100ms is slow
 CACHE_STALE_MINUTES = 2  # Cache update > 2 minutes is stale
+HEALTH_MEMO_TTL_SECONDS = int(os.getenv("HEALTH_MEMO_TTL_SECONDS", "5"))
+
+_HEALTH_MEMO_LOCK = threading.Lock()
+_HEALTH_MEMO: dict[str, dict | None] = {
+    "health": None,
+    "deep": None,
+}
+
+
+def _health_memo_enabled() -> bool:
+    if HEALTH_MEMO_TTL_SECONDS <= 0:
+        return False
+    if current_app.testing or bool(current_app.config.get("TESTING")):
+        return False
+    return True
+
+
+def _get_health_memo(cache_key: str) -> tuple[dict, int] | None:
+    if not _health_memo_enabled():
+        return None
+    now = time.time()
+    with _HEALTH_MEMO_LOCK:
+        entry = _HEALTH_MEMO.get(cache_key)
+        if not entry:
+            return None
+        if now - float(entry.get("ts", 0.0)) > HEALTH_MEMO_TTL_SECONDS:
+            _HEALTH_MEMO[cache_key] = None
+            return None
+        return entry["payload"], int(entry["status"])
+
+
+def _set_health_memo(cache_key: str, payload: dict, status_code: int) -> None:
+    if not _health_memo_enabled():
+        return
+    with _HEALTH_MEMO_LOCK:
+        _HEALTH_MEMO[cache_key] = {
+            "ts": time.time(),
+            "payload": payload,
+            "status": int(status_code),
+        }
+
+
+def _build_health_response(payload: dict, status_code: int):
+    """Build JSON response with explicit no-cache headers."""
+    resp = make_response(jsonify(payload), status_code)
+    resp.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate'
+    resp.headers['Pragma'] = 'no-cache'
+    resp.headers['Expires'] = '0'
+    return resp
+
+
+def _reset_health_memo_for_tests() -> None:
+    with _HEALTH_MEMO_LOCK:
+        _HEALTH_MEMO["health"] = None
+        _HEALTH_MEMO["deep"] = None


 def _classify_degraded_reason(
@@ -63,18 +129,60 @@ def _classify_degraded_reason(
    return None


+def _build_resilience_alerts(
+    *,
+    pool_saturation: float | None,
+    circuit_state: str | None,
+    route_cache_degraded: bool,
+    restart_churn_exceeded: bool,
+    restart_blocked: bool,
+    thresholds: dict,
+) -> dict:
+    saturation = float(pool_saturation or 0.0)
+    warning = float(thresholds.get("pool_saturation_warning", 0.9))
+    critical = float(thresholds.get("pool_saturation_critical", 1.0))
+    return {
+        "pool_warning": saturation >= warning,
+        "pool_critical": saturation >= critical,
+        "circuit_open": circuit_state == "OPEN",
+        "route_cache_degraded": bool(route_cache_degraded),
+        "restart_churn_exceeded": bool(restart_churn_exceeded),
+        "restart_blocked": bool(restart_blocked),
+    }
+
+
+def get_worker_recovery_status() -> dict:
+    """Build worker recovery policy status for health/admin telemetry."""
+    state = load_restart_state()
+    history = extract_restart_history(state)
+    policy_state = evaluate_worker_recovery_state(
+        history,
+        last_requested_at=extract_last_requested_at(state),
+    )
+    churn = summarize_restart_history(
+        history,
+        window_seconds=int(policy_state.get("window_seconds") or 600),
+        threshold=int(policy_state.get("churn_threshold") or 3),
+    )
+    return {
+        "policy_state": policy_state,
+        "restart_churn": churn,
+        "policy_config": get_worker_recovery_policy_config(),
+    }
+
+
 def check_database() -> tuple[str, str | None]:
    """Check database connectivity.

    Returns:
        Tuple of (status, error_message).
        status is 'ok' or 'error'.
-    """
-    try:
-        engine = get_engine()
-        with engine.connect() as conn:
-            conn.execute(text("SELECT 1 FROM DUAL"))
-        return 'ok', None
+    """
+    try:
+        engine = get_health_engine()
+        with engine.connect() as conn:
+            conn.execute(text("SELECT 1 FROM DUAL"))
+        return 'ok', None
    except Exception as e:
        logger.error(f"Database health check failed: {e}")
        return 'error', str(e)
@@ -111,13 +219,21 @@ def get_cache_status() -> dict:
    status = {
        'enabled': REDIS_ENABLED,
        'sys_date': get_cached_sys_date(),
-        'updated_at': get_cache_updated_at()
+        'updated_at': get_cache_updated_at(),
+        'derived_search_index': {},
+        'derived_frame_snapshot': {},
+        'index_metrics': {},
+        'memory': {},
    }
    try:
        from mes_dashboard.services.wip_service import get_wip_search_index_status
-        status['derived_search_index'] = get_wip_search_index_status()
+        derived = get_wip_search_index_status()
+        status['derived_search_index'] = derived.get('derived_search_index', {})
+        status['derived_frame_snapshot'] = derived.get('derived_frame_snapshot', {})
+        status['index_metrics'] = derived.get('metrics', {})
+        status['memory'] = derived.get('memory', {})
    except Exception:
-        status['derived_search_index'] = {}
+        pass
    return status


@@ -201,10 +317,15 @@ def get_workcenter_mapping_status() -> dict:
 def health_check():
    """Health check endpoint.

-    Returns:
-        - 200 OK: All services healthy or degraded (Redis down but DB ok)
-        - 503 Service Unavailable: Database unhealthy
-    """
+    Returns:
+        - 200 OK: All services healthy or degraded (Redis down but DB ok)
+        - 503 Service Unavailable: Database unhealthy
+    """
+    cached = _get_health_memo("health")
+    if cached is not None:
+        payload, status_code = cached
+        return _build_health_response(payload, status_code)
+
    from mes_dashboard.core.circuit_breaker import get_circuit_breaker_status

    db_status, db_error = check_database()
@@ -266,13 +387,25 @@ def health_check():
            warnings.append(f"Database pool saturation is high ({saturation:.0%})")

    thresholds = get_resilience_thresholds()
+    worker_recovery = get_worker_recovery_status()
+    policy_state = worker_recovery.get("policy_state", {})
+    restart_churn = worker_recovery.get("restart_churn", {})
    recommendation = build_recovery_recommendation(
        degraded_reason=degraded_reason,
        pool_saturation=pool_saturation,
        circuit_state=circuit_breaker.get('state'),
-        restart_churn_exceeded=False,
-        cooldown_active=False,
+        restart_churn_exceeded=bool(restart_churn.get("exceeded")),
+        cooldown_active=bool(policy_state.get("cooldown")),
    )
+    alerts = _build_resilience_alerts(
+        pool_saturation=pool_saturation,
+        circuit_state=circuit_breaker.get("state"),
+        route_cache_degraded=bool(route_cache.get("degraded")),
+        restart_churn_exceeded=bool(restart_churn.get("exceeded")),
+        restart_blocked=bool(policy_state.get("blocked")),
+        thresholds=thresholds,
+    )
+    runtime_contract = build_runtime_contract_diagnostics(strict=False)

    # Check equipment status cache
    equipment_status_cache = get_equipment_status_cache_status()
@@ -293,8 +426,18 @@ def health_check():
        },
        'resilience': {
            'thresholds': thresholds,
+            'alerts': alerts,
+            'policy_state': {
+                'state': policy_state.get("state"),
+                'allowed': policy_state.get("allowed"),
+                'cooldown': policy_state.get("cooldown"),
+                'blocked': policy_state.get("blocked"),
+                'cooldown_remaining_seconds': policy_state.get("cooldown_remaining_seconds"),
+            },
+            'restart_churn': restart_churn,
            'recovery_recommendation': recommendation,
        },
+        'runtime_contract': runtime_contract,
        'cache': get_cache_status(),
        'route_cache': route_cache,
        'resource_cache': resource_cache,
@@ -307,12 +450,8 @@ def health_check():
    if warnings:
        response['warnings'] = warnings

-    # Add no-cache headers to prevent browser caching
-    resp = make_response(jsonify(response), http_code)
-    resp.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate'
-    resp.headers['Pragma'] = 'no-cache'
-    resp.headers['Expires'] = '0'
-    return resp
+    _set_health_memo("health", response, http_code)
+    return _build_health_response(response, http_code)


@health_bp.route('/health/deep', methods=['GET'])
@@ -330,9 +469,14 @@ def deep_health_check():
    from mes_dashboard.core.metrics import get_metrics_summary
    from flask import redirect, url_for, request

-    # Require admin authentication - redirect to login for consistency
-    if not is_admin_logged_in():
-        return redirect(url_for("auth.login", next=request.url))
+    # Require admin authentication - redirect to login for consistency
+    if not is_admin_logged_in():
+        return redirect(url_for("auth.login", next=request.url))
+
+    cached = _get_health_memo("deep")
+    if cached is not None:
+        payload, status_code = cached
+        return _build_health_response(payload, status_code)

    # Check database with latency measurement
    db_start = time.time()
@@ -397,6 +541,9 @@ def deep_health_check():
        warnings.append(f"Database pool saturation is high ({pool_saturation:.0%})")

    thresholds = get_resilience_thresholds()
+    worker_recovery = get_worker_recovery_status()
+    policy_state = worker_recovery.get("policy_state", {})
+    restart_churn = worker_recovery.get("restart_churn", {})
    degraded_reason = _classify_degraded_reason(
        db_status=db_status,
        redis_status=redis_status,
@@ -408,9 +555,18 @@ def deep_health_check():
        degraded_reason=degraded_reason,
        pool_saturation=pool_saturation,
        circuit_state=circuit_breaker.get('state'),
-        restart_churn_exceeded=False,
-        cooldown_active=False,
+        restart_churn_exceeded=bool(restart_churn.get("exceeded")),
+        cooldown_active=bool(policy_state.get("cooldown")),
    )
+    alerts = _build_resilience_alerts(
+        pool_saturation=pool_saturation,
+        circuit_state=circuit_breaker.get("state"),
+        route_cache_degraded=bool(route_cache.get("degraded")),
+        restart_churn_exceeded=bool(restart_churn.get("exceeded")),
+        restart_blocked=bool(policy_state.get("blocked")),
+        thresholds=thresholds,
+    )
+    runtime_contract = build_runtime_contract_diagnostics(strict=False)

    # Check latency thresholds
    db_latency_status = 'healthy'
@@ -429,8 +585,18 @@ def deep_health_check():
        'degraded_reason': degraded_reason,
        'resilience': {
            'thresholds': thresholds,
+            'alerts': alerts,
+            'policy_state': {
+                'state': policy_state.get("state"),
+                'allowed': policy_state.get("allowed"),
+                'cooldown': policy_state.get("cooldown"),
+                'blocked': policy_state.get("blocked"),
+                'cooldown_remaining_seconds': policy_state.get("cooldown_remaining_seconds"),
+            },
+            'restart_churn': restart_churn,
            'recovery_recommendation': recommendation,
        },
+        'runtime_contract': runtime_contract,
        'checks': {
            'database': {
                'status': db_latency_status if db_status == 'ok' else 'error',
@@ -446,7 +612,9 @@ def deep_health_check():
            'cache': {
                'freshness': cache_freshness,
                'updated_at': cache_updated_at,
-                'sys_date': cache_status.get('sys_date')
+                'sys_date': cache_status.get('sys_date'),
+                'index_metrics': cache_status.get('index_metrics', {}),
+                'memory': cache_status.get('memory', {}),
            },
            'route_cache': route_cache
        },
@@ -464,9 +632,5 @@ def deep_health_check():
    if warnings:
        response['warnings'] = warnings

-    # Add no-cache headers
-    resp = make_response(jsonify(response), http_code)
-    resp.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate'
-    resp.headers['Pragma'] = 'no-cache'
-    resp.headers['Expires'] = '0'
-    return resp
+    _set_health_memo("deep", response, http_code)
+    return _build_health_response(response, http_code)
--- a/src/mes_dashboard/routes/hold_routes.py
+++ b/src/mes_dashboard/routes/hold_routes.py
@@ -4,22 +4,27 @@
 Contains Flask Blueprint for Hold Detail page and API endpoints.
 """

-from flask import Blueprint, jsonify, request, render_template, redirect, url_for
-
-from mes_dashboard.services.wip_service import (
+from flask import Blueprint, jsonify, request, render_template, redirect, url_for
+
+from mes_dashboard.core.rate_limit import configured_rate_limit
+from mes_dashboard.core.utils import parse_bool_query
+from mes_dashboard.services.wip_service import (
    get_hold_detail_summary,
    get_hold_detail_distribution,
    get_hold_detail_lots,
    is_quality_hold,
 )

-# Create Blueprint
-hold_bp = Blueprint('hold', __name__)
-
-
-def _parse_bool(value: str) -> bool:
-    """Parse boolean from query string."""
-    return value.lower() in ('true', '1', 'yes') if value else False
+# Create Blueprint
+hold_bp = Blueprint('hold', __name__)
+
+_HOLD_LOTS_RATE_LIMIT = configured_rate_limit(
+    bucket="hold-detail-lots",
+    max_attempts_env="HOLD_LOTS_RATE_LIMIT_MAX_REQUESTS",
+    window_seconds_env="HOLD_LOTS_RATE_LIMIT_WINDOW_SECONDS",
+    default_max_attempts=90,
+    default_window_seconds=60,
+)


 # ============================================================
@@ -64,7 +69,7 @@ def api_hold_detail_summary():
    if not reason:
        return jsonify({'success': False, 'error': '缺少必要參數: reason'}), 400

-    include_dummy = _parse_bool(request.args.get('include_dummy', ''))
+    include_dummy = parse_bool_query(request.args.get('include_dummy'))

    result = get_hold_detail_summary(
        reason=reason,
@@ -90,7 +95,7 @@ def api_hold_detail_distribution():
    if not reason:
        return jsonify({'success': False, 'error': '缺少必要參數: reason'}), 400

-    include_dummy = _parse_bool(request.args.get('include_dummy', ''))
+    include_dummy = parse_bool_query(request.args.get('include_dummy'))

    result = get_hold_detail_distribution(
        reason=reason,
@@ -101,8 +106,9 @@ def api_hold_detail_distribution():
    return jsonify({'success': False, 'error': '查詢失敗'}), 500


-@hold_bp.route('/api/wip/hold-detail/lots')
-def api_hold_detail_lots():
+@hold_bp.route('/api/wip/hold-detail/lots')
+@_HOLD_LOTS_RATE_LIMIT
+def api_hold_detail_lots():
    """API: Get paginated lot details for a specific hold reason.

    Query Parameters:
@@ -124,7 +130,7 @@ def api_hold_detail_lots():
    workcenter = request.args.get('workcenter', '').strip() or None
    package = request.args.get('package', '').strip() or None
    age_range = request.args.get('age_range', '').strip() or None
-    include_dummy = _parse_bool(request.args.get('include_dummy', ''))
+    include_dummy = parse_bool_query(request.args.get('include_dummy'))
    page = request.args.get('page', 1, type=int)
    per_page = min(request.args.get('per_page', 50, type=int), 200)

--- a/src/mes_dashboard/routes/resource_routes.py
+++ b/src/mes_dashboard/routes/resource_routes.py
@@ -13,10 +13,12 @@ from mes_dashboard.core.database import (
    DatabaseCircuitOpenError,
 )
 from mes_dashboard.core.cache import cache_get, cache_set, make_cache_key
+from mes_dashboard.core.rate_limit import configured_rate_limit
+from mes_dashboard.core.utils import get_days_back, parse_bool_query


 def _clean_nan_values(data):
-    """Convert NaN and NaT values to None for JSON serialization.
+    """Convert NaN/NaT values to None for JSON serialization (depth-safe).

    Args:
        data: List of dicts or single dict.
@@ -24,28 +26,77 @@ def _clean_nan_values(data):
    Returns:
        Cleaned data with NaN/NaT replaced by None.
    """
+    def _normalize_scalar(value):
+        if isinstance(value, float) and math.isnan(value):
+            return None
+        if isinstance(value, str) and value == 'NaT':
+            return None
+        try:
+            if value != value:  # NaN check (NaN != NaN)
+                return None
+        except Exception:
+            pass
+        return value
+
    if isinstance(data, list):
-        return [_clean_nan_values(item) for item in data]
+        root: list = []
    elif isinstance(data, dict):
-        cleaned = {}
-        for key, value in data.items():
-            if isinstance(value, float) and math.isnan(value):
-                cleaned[key] = None
-            elif isinstance(value, str) and value == 'NaT':
-                cleaned[key] = None
-            elif value != value:  # NaN check (NaN != NaN)
-                cleaned[key] = None
-            elif isinstance(value, list):
-                # Recursively clean nested lists (e.g., LOT_DETAILS)
-                cleaned[key] = _clean_nan_values(value)
+        root = {}
+    else:
+        return _normalize_scalar(data)
+
+    stack = [(data, root)]
+    seen: set[int] = {id(data)}
+
+    while stack:
+        source, target = stack.pop()
+        if isinstance(source, list):
+            for item in source:
+                if isinstance(item, list):
+                    item_id = id(item)
+                    if item_id in seen:
+                        target.append(None)
+                        continue
+                    child = []
+                    target.append(child)
+                    seen.add(item_id)
+                    stack.append((item, child))
+                elif isinstance(item, dict):
+                    item_id = id(item)
+                    if item_id in seen:
+                        target.append(None)
+                        continue
+                    child = {}
+                    target.append(child)
+                    seen.add(item_id)
+                    stack.append((item, child))
+                else:
+                    target.append(_normalize_scalar(item))
+            continue
+
+        for key, value in source.items():
+            if isinstance(value, list):
+                value_id = id(value)
+                if value_id in seen:
+                    target[key] = None
+                    continue
+                child = []
+                target[key] = child
+                seen.add(value_id)
+                stack.append((value, child))
            elif isinstance(value, dict):
-                # Recursively clean nested dicts
-                cleaned[key] = _clean_nan_values(value)
+                value_id = id(value)
+                if value_id in seen:
+                    target[key] = None
+                    continue
+                child = {}
+                target[key] = child
+                seen.add(value_id)
+                stack.append((value, child))
            else:
-                cleaned[key] = value
-        return cleaned
-    return data
-from mes_dashboard.core.utils import get_days_back
+                target[key] = _normalize_scalar(value)
+    return root
+
 from mes_dashboard.services.resource_service import (
    query_resource_by_status,
    query_resource_by_workcenter,
@@ -62,6 +113,32 @@ from mes_dashboard.config.constants import STATUS_CATEGORIES
 # Create Blueprint
 resource_bp = Blueprint('resource', __name__, url_prefix='/api/resource')

+_RESOURCE_DETAIL_RATE_LIMIT = configured_rate_limit(
+    bucket="resource-detail",
+    max_attempts_env="RESOURCE_DETAIL_RATE_LIMIT_MAX_REQUESTS",
+    window_seconds_env="RESOURCE_DETAIL_RATE_LIMIT_WINDOW_SECONDS",
+    default_max_attempts=60,
+    default_window_seconds=60,
+)
+
+_RESOURCE_STATUS_RATE_LIMIT = configured_rate_limit(
+    bucket="resource-status",
+    max_attempts_env="RESOURCE_STATUS_RATE_LIMIT_MAX_REQUESTS",
+    window_seconds_env="RESOURCE_STATUS_RATE_LIMIT_WINDOW_SECONDS",
+    default_max_attempts=90,
+    default_window_seconds=60,
+)
+
+
+def _optional_bool_arg(name: str):
+    raw = request.args.get(name)
+    if raw is None:
+        return None
+    text = str(raw).strip()
+    if not text:
+        return None
+    return parse_bool_query(text)
+

@resource_bp.route('/by_status')
 def api_resource_by_status():
@@ -118,6 +195,7 @@ def api_resource_workcenter_status_matrix():


@resource_bp.route('/detail', methods=['POST'])
+@_RESOURCE_DETAIL_RATE_LIMIT
 def api_resource_detail():
    """API: Resource detail with filters."""
    data = request.get_json() or {}
@@ -183,6 +261,7 @@ def api_resource_status_values():
 # ============================================================

@resource_bp.route('/status')
+@_RESOURCE_STATUS_RATE_LIMIT
 def api_resource_status():
    """API: Get merged resource status from realtime cache.

@@ -197,20 +276,9 @@ def api_resource_status():
    wc_groups_param = request.args.get('workcenter_groups')
    workcenter_groups = wc_groups_param.split(',') if wc_groups_param else None

-    is_production = None
-    is_prod_param = request.args.get('is_production')
-    if is_prod_param:
-        is_production = is_prod_param.lower() in ('1', 'true', 'yes')
-
-    is_key = None
-    is_key_param = request.args.get('is_key')
-    if is_key_param:
-        is_key = is_key_param.lower() in ('1', 'true', 'yes')
-
-    is_monitor = None
-    is_monitor_param = request.args.get('is_monitor')
-    if is_monitor_param:
-        is_monitor = is_monitor_param.lower() in ('1', 'true', 'yes')
+    is_production = _optional_bool_arg('is_production')
+    is_key = _optional_bool_arg('is_key')
+    is_monitor = _optional_bool_arg('is_monitor')

    status_cats_param = request.args.get('status_categories')
    status_categories = status_cats_param.split(',') if status_cats_param else None
@@ -260,6 +328,7 @@ def api_resource_status_options():


@resource_bp.route('/status/summary')
+@_RESOURCE_STATUS_RATE_LIMIT
 def api_resource_status_summary():
    """API: Get resource status summary statistics.

@@ -269,20 +338,9 @@ def api_resource_status_summary():
    wc_groups_param = request.args.get('workcenter_groups')
    workcenter_groups = wc_groups_param.split(',') if wc_groups_param else None

-    is_production = None
-    is_prod_param = request.args.get('is_production')
-    if is_prod_param:
-        is_production = is_prod_param.lower() in ('1', 'true', 'yes')
-
-    is_key = None
-    is_key_param = request.args.get('is_key')
-    if is_key_param:
-        is_key = is_key_param.lower() in ('1', 'true', 'yes')
-
-    is_monitor = None
-    is_monitor_param = request.args.get('is_monitor')
-    if is_monitor_param:
-        is_monitor = is_monitor_param.lower() in ('1', 'true', 'yes')
+    is_production = _optional_bool_arg('is_production')
+    is_key = _optional_bool_arg('is_key')
+    is_monitor = _optional_bool_arg('is_monitor')

    try:
        data = get_resource_status_summary(
@@ -301,6 +359,7 @@ def api_resource_status_summary():


@resource_bp.route('/status/matrix')
+@_RESOURCE_STATUS_RATE_LIMIT
 def api_resource_status_matrix():
    """API: Get workcenter × status matrix.

@@ -309,20 +368,9 @@ def api_resource_status_matrix():
        is_key: Filter by key equipment
        is_monitor: Filter by monitor equipment
    """
-    is_production = None
-    is_prod_param = request.args.get('is_production')
-    if is_prod_param:
-        is_production = is_prod_param.lower() in ('1', 'true', 'yes')
-
-    is_key = None
-    is_key_param = request.args.get('is_key')
-    if is_key_param:
-        is_key = is_key_param.lower() in ('1', 'true', 'yes')
-
-    is_monitor = None
-    is_monitor_param = request.args.get('is_monitor')
-    if is_monitor_param:
-        is_monitor = is_monitor_param.lower() in ('1', 'true', 'yes')
+    is_production = _optional_bool_arg('is_production')
+    is_key = _optional_bool_arg('is_key')
+    is_monitor = _optional_bool_arg('is_monitor')

    try:
        data = get_workcenter_status_matrix(
--- a/src/mes_dashboard/routes/wip_routes.py
+++ b/src/mes_dashboard/routes/wip_routes.py
@@ -7,6 +7,8 @@ Uses DWH.DW_MES_LOT_V view for real-time WIP data.

 from flask import Blueprint, jsonify, request

+from mes_dashboard.core.rate_limit import configured_rate_limit
+from mes_dashboard.core.utils import parse_bool_query
 from mes_dashboard.services.wip_service import (
    get_wip_summary,
    get_wip_matrix,
@@ -24,10 +26,21 @@ from mes_dashboard.services.wip_service import (
 # Create Blueprint
 wip_bp = Blueprint('wip', __name__, url_prefix='/api/wip')

+_WIP_MATRIX_RATE_LIMIT = configured_rate_limit(
+    bucket="wip-overview-matrix",
+    max_attempts_env="WIP_MATRIX_RATE_LIMIT_MAX_REQUESTS",
+    window_seconds_env="WIP_MATRIX_RATE_LIMIT_WINDOW_SECONDS",
+    default_max_attempts=120,
+    default_window_seconds=60,
+)

-def _parse_bool(value: str) -> bool:
-    """Parse boolean from query string."""
-    return value.lower() in ('true', '1', 'yes') if value else False
+_WIP_DETAIL_RATE_LIMIT = configured_rate_limit(
+    bucket="wip-detail",
+    max_attempts_env="WIP_DETAIL_RATE_LIMIT_MAX_REQUESTS",
+    window_seconds_env="WIP_DETAIL_RATE_LIMIT_WINDOW_SECONDS",
+    default_max_attempts=90,
+    default_window_seconds=60,
+)


 # ============================================================
@@ -52,7 +65,7 @@ def api_overview_summary():
    lotid = request.args.get('lotid', '').strip() or None
    package = request.args.get('package', '').strip() or None
    pj_type = request.args.get('type', '').strip() or None
-    include_dummy = _parse_bool(request.args.get('include_dummy', ''))
+    include_dummy = parse_bool_query(request.args.get('include_dummy'))

    result = get_wip_summary(
        include_dummy=include_dummy,
@@ -67,6 +80,7 @@ def api_overview_summary():


@wip_bp.route('/overview/matrix')
+@_WIP_MATRIX_RATE_LIMIT
 def api_overview_matrix():
    """API: Get workcenter x product line matrix for overview dashboard.

@@ -88,7 +102,7 @@ def api_overview_matrix():
    lotid = request.args.get('lotid', '').strip() or None
    package = request.args.get('package', '').strip() or None
    pj_type = request.args.get('type', '').strip() or None
-    include_dummy = _parse_bool(request.args.get('include_dummy', ''))
+    include_dummy = parse_bool_query(request.args.get('include_dummy'))
    status = request.args.get('status', '').strip().upper() or None
    hold_type = request.args.get('hold_type', '').strip().lower() or None

@@ -134,7 +148,7 @@ def api_overview_hold():
    """
    workorder = request.args.get('workorder', '').strip() or None
    lotid = request.args.get('lotid', '').strip() or None
-    include_dummy = _parse_bool(request.args.get('include_dummy', ''))
+    include_dummy = parse_bool_query(request.args.get('include_dummy'))

    result = get_wip_hold_summary(
        include_dummy=include_dummy,
@@ -151,6 +165,7 @@ def api_overview_hold():
 # ============================================================

@wip_bp.route('/detail/<workcenter>')
+@_WIP_DETAIL_RATE_LIMIT
 def api_detail(workcenter: str):
    """API: Get WIP detail for a specific workcenter group.

@@ -176,12 +191,17 @@ def api_detail(workcenter: str):
    hold_type = request.args.get('hold_type', '').strip().lower() or None
    workorder = request.args.get('workorder', '').strip() or None
    lotid = request.args.get('lotid', '').strip() or None
-    include_dummy = _parse_bool(request.args.get('include_dummy', ''))
+    include_dummy = parse_bool_query(request.args.get('include_dummy'))
    page = request.args.get('page', 1, type=int)
-    page_size = min(request.args.get('page_size', 100, type=int), 500)
+    page_size = request.args.get('page_size', 100, type=int)

-    if page < 1:
+    if page is None:
        page = 1
+    if page_size is None:
+        page_size = 100
+
+    page = max(page, 1)
+    page_size = max(1, min(page_size, 500))

    # Validate status parameter
    if status and status not in ('RUN', 'QUEUE', 'HOLD'):
@@ -245,7 +265,7 @@ def api_meta_workcenters():
    Returns:
        JSON with list of {name, lot_count} sorted by sequence
    """
-    include_dummy = _parse_bool(request.args.get('include_dummy', ''))
+    include_dummy = parse_bool_query(request.args.get('include_dummy'))

    result = get_workcenters(include_dummy=include_dummy)
    if result is not None:
@@ -263,7 +283,7 @@ def api_meta_packages():
    Returns:
        JSON with list of {name, lot_count} sorted by count desc
    """
-    include_dummy = _parse_bool(request.args.get('include_dummy', ''))
+    include_dummy = parse_bool_query(request.args.get('include_dummy'))

    result = get_packages(include_dummy=include_dummy)
    if result is not None:
@@ -293,7 +313,7 @@ def api_meta_search():
    search_field = request.args.get('field', '').strip().lower()
    q = request.args.get('q', '').strip()
    limit = min(request.args.get('limit', 20, type=int), 50)
-    include_dummy = _parse_bool(request.args.get('include_dummy', ''))
+    include_dummy = parse_bool_query(request.args.get('include_dummy'))

    # Cross-filter parameters
    workorder = request.args.get('workorder', '').strip() or None
--- a/src/mes_dashboard/services/auth_service.py
+++ b/src/mes_dashboard/services/auth_service.py
@@ -1,124 +1,193 @@
-# -*- coding: utf-8 -*-
-"""Authentication service using LDAP API or local credentials."""
-
-from __future__ import annotations
-
-import logging
-import os
-
-import requests
-
-logger = logging.getLogger(__name__)
-
-# Configuration - MUST be set in .env file
-LDAP_API_BASE = os.environ.get("LDAP_API_URL", "")
-ADMIN_EMAILS = os.environ.get("ADMIN_EMAILS", "").lower().split(",")
-
-# Timeout for LDAP API requests
-LDAP_TIMEOUT = 10
-
-# Local authentication configuration (for development/testing)
-LOCAL_AUTH_ENABLED = os.environ.get("LOCAL_AUTH_ENABLED", "false").lower() in ("true", "1", "yes")
-LOCAL_AUTH_USERNAME = os.environ.get("LOCAL_AUTH_USERNAME", "")
-LOCAL_AUTH_PASSWORD = os.environ.get("LOCAL_AUTH_PASSWORD", "")
-
-
-def _authenticate_local(username: str, password: str) -> dict | None:
-    """Authenticate using local environment credentials.
-
-    Args:
-        username: User provided username
-        password: User provided password
-
-    Returns:
-        User info dict on success, None on failure
-    """
-    if not LOCAL_AUTH_ENABLED:
-        return None
-
-    if not LOCAL_AUTH_USERNAME or not LOCAL_AUTH_PASSWORD:
-        logger.warning("Local auth enabled but credentials not configured")
-        return None
-
-    if username == LOCAL_AUTH_USERNAME and password == LOCAL_AUTH_PASSWORD:
-        logger.info("Local auth success for user: %s", username)
-        return {
-            "username": username,
-            "displayName": f"Local User ({username})",
-            "mail": f"{username}@local.dev",
-            "department": "Development",
-        }
-
-    logger.warning("Local auth failed for user: %s", username)
-    return None
-
-
-def authenticate(username: str, password: str, domain: str = "PANJIT") -> dict | None:
-    """Authenticate user via local credentials or LDAP API.
-
-    If LOCAL_AUTH_ENABLED is set, tries local authentication first.
-    Falls back to LDAP API if local auth is disabled or fails.
-
-    Args:
-        username: Employee ID or email
-        password: User password
-        domain: Domain name (default: PANJIT)
-
-    Returns:
-        User info dict on success: {username, displayName, mail, department}
-        None on failure
-    """
-    # Try local authentication first if enabled
-    if LOCAL_AUTH_ENABLED:
-        local_result = _authenticate_local(username, password)
-        if local_result:
-            return local_result
-        # If local auth is enabled but failed, don't fall back to LDAP
-        # This ensures local-only mode when LOCAL_AUTH_ENABLED is true
-        return None
-
-    # LDAP authentication
-    try:
-        response = requests.post(
-            f"{LDAP_API_BASE}/api/v1/ldap/auth",
-            json={"username": username, "password": password, "domain": domain},
-            timeout=LDAP_TIMEOUT,
-        )
-        data = response.json()
-
-        if data.get("success"):
-            user = data.get("user", {})
-            logger.info("LDAP auth success for user: %s", user.get("username"))
-            return user
-
-        logger.warning("LDAP auth failed for user: %s", username)
-        return None
-
-    except requests.Timeout:
-        logger.error("LDAP API timeout for user: %s", username)
-        return None
-    except requests.RequestException as e:
-        logger.error("LDAP API error for user %s: %s", username, e)
-        return None
-    except (ValueError, KeyError) as e:
-        logger.error("LDAP API response parse error: %s", e)
-        return None
-
-
-def is_admin(user: dict) -> bool:
-    """Check if user is an admin.
-
-    Args:
-        user: User info dict with 'mail' field
-
-    Returns:
-        True if user email is in ADMIN_EMAILS list, or if local auth is enabled
-    """
-    # Local auth users are automatically admins (for development/testing)
-    if LOCAL_AUTH_ENABLED:
-        user_mail = user.get("mail", "")
-        if user_mail.endswith("@local.dev"):
-            return True
-
-    user_mail = user.get("mail", "").lower().strip()
-    return user_mail in [e.strip() for e in ADMIN_EMAILS]
+# -*- coding: utf-8 -*-
+"""Authentication service using LDAP API or local credentials."""
+
+from __future__ import annotations
+
+import logging
+import os
+from urllib.parse import urlparse
+
+import requests
+
+logger = logging.getLogger(__name__)
+
+# Timeout for LDAP API requests
+LDAP_TIMEOUT = 10
+
+# Configuration - MUST be set in .env file
+ADMIN_EMAILS = os.environ.get("ADMIN_EMAILS", "").lower().split(",")
+
+# Local authentication configuration (for development/testing)
+LOCAL_AUTH_ENABLED = os.environ.get("LOCAL_AUTH_ENABLED", "false").lower() in ("true", "1", "yes")
+LOCAL_AUTH_USERNAME = os.environ.get("LOCAL_AUTH_USERNAME", "")
+LOCAL_AUTH_PASSWORD = os.environ.get("LOCAL_AUTH_PASSWORD", "")
+
+# LDAP endpoint hardening configuration
+LDAP_API_URL = os.environ.get("LDAP_API_URL", "").strip()
+LDAP_ALLOWED_HOSTS_RAW = os.environ.get("LDAP_ALLOWED_HOSTS", "").strip()
+
+
+def _normalize_host(host: str) -> str:
+    return host.strip().lower().rstrip(".")
+
+
+def _parse_allowed_hosts(raw_hosts: str) -> tuple[str, ...]:
+    if not raw_hosts:
+        return tuple()
+
+    hosts: list[str] = []
+    for raw in raw_hosts.split(","):
+        host = _normalize_host(raw)
+        if host:
+            hosts.append(host)
+    return tuple(hosts)
+
+
+def _validate_ldap_api_url(raw_url: str, allowed_hosts: tuple[str, ...]) -> tuple[str | None, str | None]:
+    """Validate LDAP API URL to prevent configuration-based SSRF risks."""
+    url = (raw_url or "").strip()
+    if not url:
+        return None, "LDAP_API_URL is missing"
+
+    parsed = urlparse(url)
+    scheme = (parsed.scheme or "").lower()
+    host = _normalize_host(parsed.hostname or "")
+
+    if not host:
+        return None, f"LDAP_API_URL has no valid host: {url!r}"
+
+    if scheme != "https":
+        return None, f"LDAP_API_URL must use HTTPS: {url!r}"
+
+    effective_allowlist = allowed_hosts or (host,)
+    if host not in effective_allowlist:
+        return None, (
+            f"LDAP_API_URL host {host!r} is not allowlisted. "
+            f"Allowed hosts: {', '.join(effective_allowlist)}"
+        )
+
+    return url.rstrip("/"), None
+
+
+def _resolve_ldap_config() -> tuple[str | None, str | None, tuple[str, ...]]:
+    allowed_hosts = _parse_allowed_hosts(LDAP_ALLOWED_HOSTS_RAW)
+    api_base, error = _validate_ldap_api_url(LDAP_API_URL, allowed_hosts)
+
+    if api_base:
+        effective_hosts = allowed_hosts or (_normalize_host(urlparse(api_base).hostname or ""),)
+        return api_base, None, effective_hosts
+
+    return None, error, allowed_hosts
+
+
+LDAP_API_BASE, LDAP_CONFIG_ERROR, LDAP_ALLOWED_HOSTS = _resolve_ldap_config()
+
+
+def _authenticate_local(username: str, password: str) -> dict | None:
+    """Authenticate using local environment credentials.
+
+    Args:
+        username: User provided username
+        password: User provided password
+
+    Returns:
+        User info dict on success, None on failure
+    """
+    if not LOCAL_AUTH_ENABLED:
+        return None
+
+    if not LOCAL_AUTH_USERNAME or not LOCAL_AUTH_PASSWORD:
+        logger.warning("Local auth enabled but credentials not configured")
+        return None
+
+    if username == LOCAL_AUTH_USERNAME and password == LOCAL_AUTH_PASSWORD:
+        logger.info("Local auth success for user: %s", username)
+        return {
+            "username": username,
+            "displayName": f"Local User ({username})",
+            "mail": f"{username}@local.dev",
+            "department": "Development",
+        }
+
+    logger.warning("Local auth failed for user: %s", username)
+    return None
+
+
+def authenticate(username: str, password: str, domain: str = "PANJIT") -> dict | None:
+    """Authenticate user via local credentials or LDAP API.
+
+    If LOCAL_AUTH_ENABLED is set, tries local authentication first.
+    Falls back to LDAP API if local auth is disabled or fails.
+
+    Args:
+        username: Employee ID or email
+        password: User password
+        domain: Domain name (default: PANJIT)
+
+    Returns:
+        User info dict on success: {username, displayName, mail, department}
+        None on failure
+    """
+    # Try local authentication first if enabled
+    if LOCAL_AUTH_ENABLED:
+        local_result = _authenticate_local(username, password)
+        if local_result:
+            return local_result
+        # If local auth is enabled but failed, don't fall back to LDAP
+        # This ensures local-only mode when LOCAL_AUTH_ENABLED is true
+        return None
+
+    if LDAP_CONFIG_ERROR:
+        logger.error("LDAP authentication blocked: %s", LDAP_CONFIG_ERROR)
+        return None
+
+    if not LDAP_API_BASE:
+        logger.error("LDAP authentication blocked: LDAP_API_URL is not configured")
+        return None
+
+    # LDAP authentication
+    try:
+        response = requests.post(
+            f"{LDAP_API_BASE}/api/v1/ldap/auth",
+            json={"username": username, "password": password, "domain": domain},
+            timeout=LDAP_TIMEOUT,
+        )
+        data = response.json()
+
+        if data.get("success"):
+            user = data.get("user", {})
+            logger.info("LDAP auth success for user: %s", user.get("username"))
+            return user
+
+        logger.warning("LDAP auth failed for user: %s", username)
+        return None
+
+    except requests.Timeout:
+        logger.error("LDAP API timeout for user: %s", username)
+        return None
+    except requests.RequestException as e:
+        logger.error("LDAP API error for user %s: %s", username, e)
+        return None
+    except (ValueError, KeyError) as e:
+        logger.error("LDAP API response parse error: %s", e)
+        return None
+
+
+def is_admin(user: dict) -> bool:
+    """Check if user is an admin.
+
+    Args:
+        user: User info dict with 'mail' field
+
+    Returns:
+        True if user email is in ADMIN_EMAILS list, or if local auth is enabled
+    """
+    # Local auth users are automatically admins (for development/testing)
+    if LOCAL_AUTH_ENABLED:
+        user_mail = user.get("mail", "")
+        if user_mail.endswith("@local.dev"):
+            return True
+
+    user_mail = user.get("mail", "").lower().strip()
+    allowed_emails = [e.strip() for e in ADMIN_EMAILS if e and e.strip()]
+    return user_mail in allowed_emails
--- a/src/mes_dashboard/services/filter_cache.py
+++ b/src/mes_dashboard/services/filter_cache.py
@@ -6,6 +6,7 @@ Data is loaded from database and cached in memory with periodic refresh.
 """

 import logging
+import os
 import threading
 from datetime import datetime, timedelta
 from typing import Optional, Dict, List, Any
@@ -19,8 +20,8 @@ logger = logging.getLogger('mes_dashboard.filter_cache')
 # ============================================================

 CACHE_TTL_SECONDS = 3600  # 1 hour cache TTL
-WIP_VIEW = "DWH.DW_MES_LOT_V"
-SPEC_WORKCENTER_VIEW = "DWH.DW_MES_SPEC_WORKCENTER_V"
+WIP_VIEW = os.getenv("FILTER_CACHE_WIP_VIEW", "DWH.DW_MES_LOT_V")
+SPEC_WORKCENTER_VIEW = os.getenv("FILTER_CACHE_SPEC_WORKCENTER_VIEW", "DWH.DW_MES_SPEC_WORKCENTER_V")

 # ============================================================
 # Cache Storage
--- a/src/mes_dashboard/services/page_registry.py
+++ b/src/mes_dashboard/services/page_registry.py
@@ -1,12 +1,14 @@
 # -*- coding: utf-8 -*-
 """Page registry service for managing page access status."""

-from __future__ import annotations
-
-import json
-import logging
-from pathlib import Path
-from threading import Lock
+from __future__ import annotations
+
+import json
+import logging
+import os
+import tempfile
+from pathlib import Path
+from threading import Lock

 logger = logging.getLogger(__name__)

@@ -34,20 +36,38 @@ def _load() -> dict:
    return _cache


-def _save(data: dict) -> None:
-    """Save page status configuration."""
-    global _cache
-    try:
-        DATA_FILE.parent.mkdir(parents=True, exist_ok=True)
-        DATA_FILE.write_text(
-            json.dumps(data, ensure_ascii=False, indent=2),
-            encoding="utf-8"
-        )
-        _cache = data
-        logger.debug("Saved page status to %s", DATA_FILE)
-    except OSError as e:
-        logger.error("Failed to save page status: %s", e)
-        raise
+def _save(data: dict) -> None:
+    """Save page status configuration."""
+    global _cache
+    tmp_path: Path | None = None
+    try:
+        DATA_FILE.parent.mkdir(parents=True, exist_ok=True)
+        payload = json.dumps(data, ensure_ascii=False, indent=2)
+
+        # Atomic write: write to sibling temp file, then replace target.
+        with tempfile.NamedTemporaryFile(
+            mode="w",
+            encoding="utf-8",
+            dir=str(DATA_FILE.parent),
+            prefix=f".{DATA_FILE.name}.",
+            suffix=".tmp",
+            delete=False,
+        ) as tmp:
+            tmp.write(payload)
+            tmp.flush()
+            os.fsync(tmp.fileno())
+            tmp_path = Path(tmp.name)
+        os.replace(tmp_path, DATA_FILE)
+        _cache = data
+        logger.debug("Saved page status to %s", DATA_FILE)
+    except OSError as e:
+        if tmp_path is not None:
+            try:
+                tmp_path.unlink(missing_ok=True)
+            except OSError:
+                pass
+        logger.error("Failed to save page status: %s", e)
+        raise


 def get_page_status(route: str) -> str | None:
--- a/src/mes_dashboard/services/realtime_equipment_cache.py
+++ b/src/mes_dashboard/services/realtime_equipment_cache.py
@@ -5,12 +5,14 @@ Provides cached equipment status from DW_MES_EQUIPMENTSTATUS_WIP_V.
 Data is synced periodically (default 5 minutes) and stored in Redis.
 """

-import json
-import logging
-import threading
-import time
-from datetime import datetime
-from typing import Any, Dict, List, Optional, Tuple
+import json
+import logging
+import os
+import threading
+import time
+from collections import OrderedDict
+from datetime import datetime
+from typing import Any

 from mes_dashboard.core.database import read_sql_df
 from mes_dashboard.core.redis_client import (
@@ -19,64 +21,110 @@ from mes_dashboard.core.redis_client import (
    try_acquire_lock,
    release_lock,
 )
-from mes_dashboard.config.constants import (
-    EQUIPMENT_STATUS_DATA_KEY,
-    EQUIPMENT_STATUS_INDEX_KEY,
-    EQUIPMENT_STATUS_META_UPDATED_KEY,
-    EQUIPMENT_STATUS_META_COUNT_KEY,
-    STATUS_CATEGORY_MAP,
-)
+from mes_dashboard.config.constants import (
+    EQUIPMENT_STATUS_DATA_KEY,
+    EQUIPMENT_STATUS_INDEX_KEY,
+    EQUIPMENT_STATUS_META_UPDATED_KEY,
+    EQUIPMENT_STATUS_META_COUNT_KEY,
+    STATUS_CATEGORY_MAP,
+)
+from mes_dashboard.services.sql_fragments import EQUIPMENT_STATUS_SELECT_SQL

-logger = logging.getLogger('mes_dashboard.realtime_equipment_cache')
-
-# ============================================================
-# Process-Level Cache (Prevents redundant JSON parsing)
-# ============================================================
-
-class _ProcessLevelCache:
-    """Thread-safe process-level cache for parsed equipment status data."""
-
-    def __init__(self, ttl_seconds: int = 30):
-        self._cache: Dict[str, Tuple[List[Dict[str, Any]], float]] = {}
-        self._lock = threading.Lock()
-        self._ttl = ttl_seconds
-
-    def get(self, key: str) -> Optional[List[Dict[str, Any]]]:
-        """Get cached data if not expired."""
-        with self._lock:
-            if key not in self._cache:
-                return None
-            data, timestamp = self._cache[key]
-            if time.time() - timestamp > self._ttl:
-                del self._cache[key]
-                return None
-            return data
-
-    def set(self, key: str, data: List[Dict[str, Any]]) -> None:
-        """Cache data with current timestamp."""
-        with self._lock:
-            self._cache[key] = (data, time.time())
-
-    def invalidate(self, key: str) -> None:
-        """Remove a key from cache."""
-        with self._lock:
-            self._cache.pop(key, None)
+logger = logging.getLogger('mes_dashboard.realtime_equipment_cache')
+
+# ============================================================
+# Process-Level Cache (Prevents redundant JSON parsing)
+# ============================================================
+
+DEFAULT_PROCESS_CACHE_TTL_SECONDS = 30
+DEFAULT_PROCESS_CACHE_MAX_SIZE = 32
+DEFAULT_LOOKUP_TTL_SECONDS = 30

+class _ProcessLevelCache:
+    """Thread-safe process-level cache for parsed equipment status data."""
+
+    def __init__(self, ttl_seconds: int = 30, max_size: int = 32):
+        self._cache: OrderedDict[str, tuple[list[dict[str, Any]], float]] = OrderedDict()
+        self._lock = threading.Lock()
+        self._ttl = max(int(ttl_seconds), 1)
+        self._max_size = max(int(max_size), 1)
+
+    @property
+    def max_size(self) -> int:
+        return self._max_size
+
+    def _evict_expired_locked(self, now: float) -> None:
+        stale_keys = [
+            key for key, (_, timestamp) in self._cache.items()
+            if now - timestamp > self._ttl
+        ]
+        for key in stale_keys:
+            self._cache.pop(key, None)
+
+    def get(self, key: str) -> list[dict[str, Any]] | None:
+        """Get cached data if not expired."""
+        with self._lock:
+            payload = self._cache.get(key)
+            if payload is None:
+                return None
+            data, timestamp = payload
+            now = time.time()
+            if now - timestamp > self._ttl:
+                self._cache.pop(key, None)
+                return None
+            self._cache.move_to_end(key, last=True)
+            return data
+
+    def set(self, key: str, data: list[dict[str, Any]]) -> None:
+        """Cache data with current timestamp."""
+        with self._lock:
+            now = time.time()
+            self._evict_expired_locked(now)
+            if key in self._cache:
+                self._cache.pop(key, None)
+            elif len(self._cache) >= self._max_size:
+                self._cache.popitem(last=False)
+            self._cache[key] = (data, now)
+            self._cache.move_to_end(key, last=True)

+    def invalidate(self, key: str) -> None:
+        """Remove a key from cache."""
+        with self._lock:
+            self._cache.pop(key, None)
+
+
+def _resolve_cache_max_size(env_name: str, default: int) -> int:
+    value = os.getenv(env_name)
+    if value is None:
+        return max(int(default), 1)
+    try:
+        return max(int(value), 1)
+    except (TypeError, ValueError):
+        return max(int(default), 1)
+
+
 # Global process-level cache for equipment status (30s TTL)
-_equipment_status_cache = _ProcessLevelCache(ttl_seconds=30)
+PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size("PROCESS_CACHE_MAX_SIZE", DEFAULT_PROCESS_CACHE_MAX_SIZE)
+EQUIPMENT_PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size(
+    "EQUIPMENT_PROCESS_CACHE_MAX_SIZE",
+    PROCESS_CACHE_MAX_SIZE,
+)
+_equipment_status_cache = _ProcessLevelCache(
+    ttl_seconds=DEFAULT_PROCESS_CACHE_TTL_SECONDS,
+    max_size=EQUIPMENT_PROCESS_CACHE_MAX_SIZE,
+)
 _equipment_status_parse_lock = threading.Lock()
 _equipment_lookup_lock = threading.Lock()
-_equipment_status_lookup: Dict[str, Dict[str, Any]] = {}
-_equipment_status_lookup_built_at: Optional[str] = None
+_equipment_status_lookup: dict[str, dict[str, Any]] = {}
+_equipment_status_lookup_built_at: str | None = None
 _equipment_status_lookup_ts: float = 0.0
-LOOKUP_TTL_SECONDS = 30
+LOOKUP_TTL_SECONDS = DEFAULT_LOOKUP_TTL_SECONDS

 # ============================================================
 # Module State
 # ============================================================

-_SYNC_THREAD: Optional[threading.Thread] = None
+_SYNC_THREAD: threading.Thread | None = None
 _STOP_EVENT = threading.Event()
 _SYNC_LOCK = threading.Lock()

@@ -85,40 +133,14 @@ _SYNC_LOCK = threading.Lock()
 # Oracle Query
 # ============================================================

-def _load_equipment_status_from_oracle() -> Optional[List[Dict[str, Any]]]:
+def _load_equipment_status_from_oracle() -> list[dict[str, Any]] | None:
    """Query DW_MES_EQUIPMENTSTATUS_WIP_V from Oracle.

    Returns:
        List of equipment status records, or None if query fails.
    """
-    sql = """
-        SELECT
-            RESOURCEID,
-            EQUIPMENTID,
-            OBJECTCATEGORY,
-            EQUIPMENTASSETSSTATUS,
-            EQUIPMENTASSETSSTATUSREASON,
-            JOBORDER,
-            JOBMODEL,
-            JOBSTAGE,
-            JOBID,
-            JOBSTATUS,
-            CREATEDATE,
-            CREATEUSERNAME,
-            CREATEUSER,
-            TECHNICIANUSERNAME,
-            TECHNICIANUSER,
-            SYMPTOMCODE,
-            CAUSECODE,
-            REPAIRCODE,
-            RUNCARDLOTID,
-            LOTTRACKINQTY_PCS,
-            LOTTRACKINTIME,
-            LOTTRACKINEMPLOYEE
-        FROM DWH.DW_MES_EQUIPMENTSTATUS_WIP_V
-    """
-    try:
-        df = read_sql_df(sql)
+    try:
+        df = read_sql_df(EQUIPMENT_STATUS_SELECT_SQL)
        if df is None or df.empty:
            logger.warning("No data returned from DW_MES_EQUIPMENTSTATUS_WIP_V")
            return []
@@ -147,7 +169,7 @@ def _load_equipment_status_from_oracle() -> Optional[List[Dict[str, Any]]]:
 # Data Aggregation
 # ============================================================

-def _classify_status(status: Optional[str]) -> str:
+def _classify_status(status: str | None) -> str:
    """Classify equipment status into category.

    Args:
@@ -183,7 +205,7 @@ def _is_valid_value(value) -> bool:
    return True


-def _aggregate_by_resourceid(records: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
+def _aggregate_by_resourceid(records: list[dict[str, Any]]) -> list[dict[str, Any]]:
    """Aggregate equipment status records by RESOURCEID.

    For each RESOURCEID:
@@ -203,7 +225,7 @@ def _aggregate_by_resourceid(records: List[Dict[str, Any]]) -> List[Dict[str, An
        return []

    # Group by RESOURCEID
-    grouped: Dict[str, List[Dict[str, Any]]] = {}
+    grouped: dict[str, list[dict[str, Any]]] = {}
    for record in records:
        resource_id = record.get('RESOURCEID')
        if resource_id:
@@ -250,7 +272,7 @@ def _aggregate_by_resourceid(records: List[Dict[str, Any]]) -> List[Dict[str, An

        # Build aggregated record
        status = first.get('EQUIPMENTASSETSSTATUS')
-        aggregated.append({
+        aggregated.append({
            'RESOURCEID': resource_id,
            'EQUIPMENTID': first.get('EQUIPMENTID'),
            'OBJECTCATEGORY': first.get('OBJECTCATEGORY'),
@@ -270,11 +292,11 @@ def _aggregate_by_resourceid(records: List[Dict[str, Any]]) -> List[Dict[str, An
            'TECHNICIANUSER': first.get('TECHNICIANUSER'),
            'SYMPTOMCODE': first.get('SYMPTOMCODE'),
            'CAUSECODE': first.get('CAUSECODE'),
-            'REPAIRCODE': first.get('REPAIRCODE'),
-            # LOT related fields
-            'LOT_COUNT': len(seen_lots),  # Count distinct RUNCARDLOTID
-            'LOT_DETAILS': lot_details,   # LOT details for tooltip
-            'TOTAL_TRACKIN_QTY': total_qty,
+            'REPAIRCODE': first.get('REPAIRCODE'),
+            # LOT related fields
+            'LOT_COUNT': len(seen_lots) if seen_lots else len(group),
+            'LOT_DETAILS': lot_details,   # LOT details for tooltip
+            'TOTAL_TRACKIN_QTY': total_qty,
            'LATEST_TRACKIN_TIME': latest_trackin,
        })

@@ -286,7 +308,7 @@ def _aggregate_by_resourceid(records: List[Dict[str, Any]]) -> List[Dict[str, An
 # Redis Storage
 # ============================================================

-def _save_to_redis(aggregated: List[Dict[str, Any]]) -> bool:
+def _save_to_redis(aggregated: list[dict[str, Any]]) -> bool:
    """Save aggregated equipment status to Redis.

    Uses pipeline for atomic update of all keys.
@@ -354,7 +376,7 @@ def _invalidate_equipment_status_lookup() -> None:
        _equipment_status_lookup_ts = 0.0


-def get_equipment_status_lookup() -> Dict[str, Dict[str, Any]]:
+def get_equipment_status_lookup() -> dict[str, dict[str, Any]]:
    """Get RESOURCEID -> status record lookup with process-level caching."""
    global _equipment_status_lookup, _equipment_status_lookup_built_at, _equipment_status_lookup_ts

@@ -375,7 +397,7 @@ def get_equipment_status_lookup() -> Dict[str, Dict[str, Any]]:
        _equipment_status_lookup_ts = time.time()
        return _equipment_status_lookup

-def get_all_equipment_status() -> List[Dict[str, Any]]:
+def get_all_equipment_status() -> list[dict[str, Any]]:
    """Get all equipment status from cache with process-level caching.

    Uses a two-tier cache strategy:
@@ -433,7 +455,7 @@ def get_all_equipment_status() -> List[Dict[str, Any]]:
            return []


-def get_equipment_status_by_id(resource_id: str) -> Optional[Dict[str, Any]]:
+def get_equipment_status_by_id(resource_id: str) -> dict[str, Any] | None:
    """Get equipment status by RESOURCEID.

    Uses index hash for O(1) lookup.
@@ -485,7 +507,7 @@ def get_equipment_status_by_id(resource_id: str) -> Optional[Dict[str, Any]]:
        return None


-def get_equipment_status_by_ids(resource_ids: List[str]) -> List[Dict[str, Any]]:
+def get_equipment_status_by_ids(resource_ids: list[str]) -> list[dict[str, Any]]:
    """Get equipment status for multiple RESOURCEIDs.

    Args:
@@ -540,7 +562,7 @@ def get_equipment_status_by_ids(resource_ids: List[str]) -> List[Dict[str, Any]]
        return []


-def get_equipment_status_cache_status() -> Dict[str, Any]:
+def get_equipment_status_cache_status() -> dict[str, Any]:
    """Get equipment status cache status.

    Returns:
--- a/src/mes_dashboard/services/resource_cache.py
+++ b/src/mes_dashboard/services/resource_cache.py
@@ -13,8 +13,9 @@ import logging
 import os
 import threading
 import time
+from collections import OrderedDict
 from datetime import datetime
-from typing import Any, Dict, List, Optional, Tuple
+from typing import Any

 import pandas as pd

@@ -31,9 +32,27 @@ from mes_dashboard.config.constants import (
    EQUIPMENT_TYPE_FILTER,
 )
 from mes_dashboard.sql import QueryBuilder
+from mes_dashboard.services.sql_fragments import (
+    RESOURCE_BASE_SELECT_TEMPLATE,
+    RESOURCE_VERSION_SELECT_TEMPLATE,
+)

 logger = logging.getLogger('mes_dashboard.resource_cache')

+ResourceRecord = dict[str, Any]
+RowPosition = int
+PositionBucket = dict[str, list[RowPosition]]
+FlagBuckets = dict[str, list[RowPosition]]
+ResourceIndex = dict[str, Any]
+
+DEFAULT_PROCESS_CACHE_TTL_SECONDS = 30
+DEFAULT_PROCESS_CACHE_MAX_SIZE = 32
+DEFAULT_RESOURCE_SYNC_INTERVAL_SECONDS = 14_400  # 4 hours
+DEFAULT_INDEX_VERSION_CHECK_INTERVAL_SECONDS = 5
+RESOURCE_DF_CACHE_KEY = "resource_data"
+TRUE_BUCKET = "1"
+FALSE_BUCKET = "0"
+
 # ============================================================
 # Process-Level Cache (Prevents redundant JSON parsing)
 # ============================================================
@@ -41,26 +60,49 @@ logger = logging.getLogger('mes_dashboard.resource_cache')
 class _ProcessLevelCache:
    """Thread-safe process-level cache for parsed DataFrames."""

-    def __init__(self, ttl_seconds: int = 30):
-        self._cache: Dict[str, Tuple[pd.DataFrame, float]] = {}
+    def __init__(self, ttl_seconds: int = DEFAULT_PROCESS_CACHE_TTL_SECONDS, max_size: int = DEFAULT_PROCESS_CACHE_MAX_SIZE):
+        self._cache: OrderedDict[str, tuple[pd.DataFrame, float]] = OrderedDict()
        self._lock = threading.Lock()
-        self._ttl = ttl_seconds
+        self._ttl = max(int(ttl_seconds), 1)
+        self._max_size = max(int(max_size), 1)

-    def get(self, key: str) -> Optional[pd.DataFrame]:
+    @property
+    def max_size(self) -> int:
+        return self._max_size
+
+    def _evict_expired_locked(self, now: float) -> None:
+        stale_keys = [
+            key for key, (_, timestamp) in self._cache.items()
+            if now - timestamp > self._ttl
+        ]
+        for key in stale_keys:
+            self._cache.pop(key, None)
+
+    def get(self, key: str) -> pd.DataFrame | None:
        """Get cached DataFrame if not expired."""
        with self._lock:
-            if key not in self._cache:
+            payload = self._cache.get(key)
+            if payload is None:
                return None
-            df, timestamp = self._cache[key]
-            if time.time() - timestamp > self._ttl:
-                del self._cache[key]
+            df, timestamp = payload
+            now = time.time()
+            if now - timestamp > self._ttl:
+                self._cache.pop(key, None)
                return None
+            self._cache.move_to_end(key, last=True)
            return df

    def set(self, key: str, df: pd.DataFrame) -> None:
        """Cache a DataFrame with current timestamp."""
        with self._lock:
-            self._cache[key] = (df, time.time())
+            now = time.time()
+            self._evict_expired_locked(now)
+            if key in self._cache:
+                self._cache.pop(key, None)
+            elif len(self._cache) >= self._max_size:
+                self._cache.popitem(last=False)
+            self._cache[key] = (df, now)
+            self._cache.move_to_end(key, last=True)

    def invalidate(self, key: str) -> None:
        """Remove a key from cache."""
@@ -68,11 +110,29 @@ class _ProcessLevelCache:
            self._cache.pop(key, None)


+def _resolve_cache_max_size(env_name: str, default: int) -> int:
+    value = os.getenv(env_name)
+    if value is None:
+        return max(int(default), 1)
+    try:
+        return max(int(value), 1)
+    except (TypeError, ValueError):
+        return max(int(default), 1)
+
+
 # Global process-level cache for resource data (30s TTL)
-_resource_df_cache = _ProcessLevelCache(ttl_seconds=30)
+PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size("PROCESS_CACHE_MAX_SIZE", DEFAULT_PROCESS_CACHE_MAX_SIZE)
+RESOURCE_PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size(
+    "RESOURCE_PROCESS_CACHE_MAX_SIZE",
+    PROCESS_CACHE_MAX_SIZE,
+)
+_resource_df_cache = _ProcessLevelCache(
+    ttl_seconds=DEFAULT_PROCESS_CACHE_TTL_SECONDS,
+    max_size=RESOURCE_PROCESS_CACHE_MAX_SIZE,
+)
 _resource_parse_lock = threading.Lock()
 _resource_index_lock = threading.Lock()
-_resource_index: Dict[str, Any] = {
+_resource_index: ResourceIndex = {
    "ready": False,
    "source": None,
    "version": None,
@@ -80,19 +140,27 @@ _resource_index: Dict[str, Any] = {
    "built_at": None,
    "version_checked_at": 0.0,
    "count": 0,
-    "records": [],
+    "all_positions": [],
    "by_resource_id": {},
    "by_workcenter": {},
    "by_family": {},
    "by_department": {},
    "by_location": {},
-    "by_is_production": {"1": [], "0": []},
-    "by_is_key": {"1": [], "0": []},
-    "by_is_monitor": {"1": [], "0": []},
+    "by_is_production": {TRUE_BUCKET: [], FALSE_BUCKET: []},
+    "by_is_key": {TRUE_BUCKET: [], FALSE_BUCKET: []},
+    "by_is_monitor": {TRUE_BUCKET: [], FALSE_BUCKET: []},
+    "memory": {
+        "frame_bytes": 0,
+        "index_bytes": 0,
+        "records_json_bytes": 0,
+        "bucket_entries": 0,
+        "amplification_ratio": 0.0,
+        "representation": "dataframe+row-index",
+    },
 }


-def _new_empty_index() -> Dict[str, Any]:
+def _new_empty_index() -> ResourceIndex:
    return {
        "ready": False,
        "source": None,
@@ -101,15 +169,23 @@ def _new_empty_index() -> Dict[str, Any]:
        "built_at": None,
        "version_checked_at": 0.0,
        "count": 0,
-        "records": [],
+        "all_positions": [],
        "by_resource_id": {},
        "by_workcenter": {},
        "by_family": {},
        "by_department": {},
        "by_location": {},
-        "by_is_production": {"1": [], "0": []},
-        "by_is_key": {"1": [], "0": []},
-        "by_is_monitor": {"1": [], "0": []},
+        "by_is_production": {TRUE_BUCKET: [], FALSE_BUCKET: []},
+        "by_is_key": {TRUE_BUCKET: [], FALSE_BUCKET: []},
+        "by_is_monitor": {TRUE_BUCKET: [], FALSE_BUCKET: []},
+        "memory": {
+            "frame_bytes": 0,
+            "index_bytes": 0,
+            "records_json_bytes": 0,
+            "bucket_entries": 0,
+            "amplification_ratio": 0.0,
+            "representation": "dataframe+row-index",
+        },
    }


@@ -129,23 +205,59 @@ def _is_truthy_flag(value: Any) -> bool:
    return False


-def _bucket_append(bucket: Dict[str, List[Dict[str, Any]]], key: Any, record: Dict[str, Any]) -> None:
+def _bucket_append(bucket: PositionBucket, key: Any, row_position: RowPosition) -> None:
    if key is None:
        return
    if isinstance(key, float) and pd.isna(key):
        return
    key_str = str(key)
-    bucket.setdefault(key_str, []).append(record)
+    bucket.setdefault(key_str, []).append(int(row_position))
+
+
+def _estimate_dataframe_bytes(df: pd.DataFrame) -> int:
+    try:
+        return int(df.memory_usage(index=True, deep=True).sum())
+    except Exception:
+        return 0
+
+
+def _estimate_index_bytes(index: ResourceIndex) -> int:
+    """Estimate lightweight index memory footprint for telemetry."""
+    by_resource_id = index.get("by_resource_id", {})
+    by_workcenter = index.get("by_workcenter", {})
+    by_family = index.get("by_family", {})
+    by_department = index.get("by_department", {})
+    by_location = index.get("by_location", {})
+    by_is_production = index.get("by_is_production", {TRUE_BUCKET: [], FALSE_BUCKET: []})
+    by_is_key = index.get("by_is_key", {TRUE_BUCKET: [], FALSE_BUCKET: []})
+    by_is_monitor = index.get("by_is_monitor", {TRUE_BUCKET: [], FALSE_BUCKET: []})
+    all_positions = index.get("all_positions", [])
+
+    position_entries = (
+        len(all_positions)
+        + sum(len(v) for v in by_workcenter.values())
+        + sum(len(v) for v in by_family.values())
+        + sum(len(v) for v in by_department.values())
+        + sum(len(v) for v in by_location.values())
+        + len(by_is_production.get(TRUE_BUCKET, []))
+        + len(by_is_production.get(FALSE_BUCKET, []))
+        + len(by_is_key.get(TRUE_BUCKET, []))
+        + len(by_is_key.get(FALSE_BUCKET, []))
+        + len(by_is_monitor.get(TRUE_BUCKET, []))
+        + len(by_is_monitor.get(FALSE_BUCKET, []))
+    )
+    # Approximate integer/list/dict overhead; telemetry only needs directional signal.
+    return int(position_entries * 8 + len(by_resource_id) * 64)


 def _build_resource_index(
    df: pd.DataFrame,
    *,
    source: str,
-    version: Optional[str],
-    updated_at: Optional[str],
-) -> Dict[str, Any]:
-    records = df.to_dict(orient='records')
+    version: str | None,
+    updated_at: str | None,
+) -> ResourceIndex:
+    normalized_df = df.reset_index(drop=True)
    index = _new_empty_index()
    index["ready"] = True
    index["source"] = source
@@ -153,31 +265,58 @@ def _build_resource_index(
    index["updated_at"] = updated_at
    index["built_at"] = datetime.now().isoformat()
    index["version_checked_at"] = time.time()
-    index["count"] = len(records)
-    index["records"] = records
+    index["count"] = len(normalized_df)
+    index["all_positions"] = list(range(len(normalized_df)))

-    for record in records:
+    for row_position, record in normalized_df.iterrows():
        resource_id = record.get("RESOURCEID")
        if resource_id is not None and not (isinstance(resource_id, float) and pd.isna(resource_id)):
-            index["by_resource_id"][str(resource_id)] = record
+            index["by_resource_id"][str(resource_id)] = int(row_position)

-        _bucket_append(index["by_workcenter"], record.get("WORKCENTERNAME"), record)
-        _bucket_append(index["by_family"], record.get("RESOURCEFAMILYNAME"), record)
-        _bucket_append(index["by_department"], record.get("PJ_DEPARTMENT"), record)
-        _bucket_append(index["by_location"], record.get("LOCATIONNAME"), record)
+        _bucket_append(index["by_workcenter"], record.get("WORKCENTERNAME"), row_position)
+        _bucket_append(index["by_family"], record.get("RESOURCEFAMILYNAME"), row_position)
+        _bucket_append(index["by_department"], record.get("PJ_DEPARTMENT"), row_position)
+        _bucket_append(index["by_location"], record.get("LOCATIONNAME"), row_position)

-        index["by_is_production"]["1" if _is_truthy_flag(record.get("PJ_ISPRODUCTION")) else "0"].append(record)
-        index["by_is_key"]["1" if _is_truthy_flag(record.get("PJ_ISKEY")) else "0"].append(record)
-        index["by_is_monitor"]["1" if _is_truthy_flag(record.get("PJ_ISMONITOR")) else "0"].append(record)
+        index["by_is_production"][TRUE_BUCKET if _is_truthy_flag(record.get("PJ_ISPRODUCTION")) else FALSE_BUCKET].append(int(row_position))
+        index["by_is_key"][TRUE_BUCKET if _is_truthy_flag(record.get("PJ_ISKEY")) else FALSE_BUCKET].append(int(row_position))
+        index["by_is_monitor"][TRUE_BUCKET if _is_truthy_flag(record.get("PJ_ISMONITOR")) else FALSE_BUCKET].append(int(row_position))
+
+    bucket_entries = (
+        sum(len(v) for v in index["by_workcenter"].values())
+        + sum(len(v) for v in index["by_family"].values())
+        + sum(len(v) for v in index["by_department"].values())
+        + sum(len(v) for v in index["by_location"].values())
+        + len(index["by_is_production"][TRUE_BUCKET])
+        + len(index["by_is_production"][FALSE_BUCKET])
+        + len(index["by_is_key"][TRUE_BUCKET])
+        + len(index["by_is_key"][FALSE_BUCKET])
+        + len(index["by_is_monitor"][TRUE_BUCKET])
+        + len(index["by_is_monitor"][FALSE_BUCKET])
+    )
+    frame_bytes = _estimate_dataframe_bytes(normalized_df)
+    index_bytes = _estimate_index_bytes(index)
+    amplification_ratio = round(
+        (frame_bytes + index_bytes) / max(frame_bytes, 1),
+        4,
+    )
+    index["memory"] = {
+        "frame_bytes": int(frame_bytes),
+        "index_bytes": int(index_bytes),
+        "records_json_bytes": 0,  # kept for backward-compatible telemetry shape
+        "bucket_entries": int(bucket_entries),
+        "amplification_ratio": amplification_ratio,
+        "representation": "dataframe+row-index",
+    }

    return index


 def _index_matches(
-    current: Dict[str, Any],
+    current: ResourceIndex,
    *,
    source: str,
-    version: Optional[str],
+    version: str | None,
    row_count: int,
 ) -> bool:
    if not current.get("ready"):
@@ -193,8 +332,8 @@ def _ensure_resource_index(
    df: pd.DataFrame,
    *,
    source: str,
-    version: Optional[str] = None,
-    updated_at: Optional[str] = None,
+    version: str | None = None,
+    updated_at: str | None = None,
 ) -> None:
    global _resource_index
    with _resource_index_lock:
@@ -212,12 +351,12 @@ def _ensure_resource_index(
        _resource_index = new_index


-def _get_resource_index() -> Dict[str, Any]:
+def _get_resource_index() -> ResourceIndex:
    with _resource_index_lock:
        return _resource_index


-def _get_cache_meta(client=None) -> Tuple[Optional[str], Optional[str]]:
+def _get_cache_meta(client=None) -> tuple[str | None, str | None]:
    redis_client = client or get_redis_client()
    if redis_client is None:
        return None, None
@@ -244,31 +383,59 @@ def _redis_data_available(client=None) -> bool:
        return False


-def _pick_bucket_records(
-    bucket: Dict[str, List[Dict[str, Any]]],
-    keys: List[Any],
-) -> List[Dict[str, Any]]:
-    seen: set[str] = set()
-    result: List[Dict[str, Any]] = []
+def _pick_bucket_positions(
+    bucket: PositionBucket,
+    keys: list[Any],
+) -> list[RowPosition]:
+    seen: set[int] = set()
+    result: list[int] = []
    for key in keys:
-        for record in bucket.get(str(key), []):
-            rid = record.get("RESOURCEID")
-            rid_key = str(rid) if rid is not None else str(id(record))
-            if rid_key in seen:
+        for row_position in bucket.get(str(key), []):
+            normalized = int(row_position)
+            if normalized in seen:
                continue
-            seen.add(rid_key)
-            result.append(record)
+            seen.add(normalized)
+            result.append(normalized)
    return result

+
+def _records_from_positions(df: pd.DataFrame, positions: list[RowPosition]) -> list[ResourceRecord]:
+    if not positions:
+        return []
+    unique_positions = sorted({int(pos) for pos in positions if 0 <= int(pos) < len(df)})
+    if not unique_positions:
+        return []
+    return df.iloc[unique_positions].to_dict(orient='records')
+
+
+def _records_from_index(index: ResourceIndex, positions: list[RowPosition] | None = None) -> list[ResourceRecord]:
+    if not index.get("ready"):
+        return []
+    df = _resource_df_cache.get(RESOURCE_DF_CACHE_KEY)
+    if df is None:
+        legacy_records = index.get("records")
+        if isinstance(legacy_records, list):
+            if positions is None:
+                return list(legacy_records)
+            selected = [legacy_records[int(pos)] for pos in positions if 0 <= int(pos) < len(legacy_records)]
+            return selected
+        return []
+    selected_positions = positions if positions is not None else index.get("all_positions", [])
+    if not selected_positions:
+        selected_positions = list(range(len(df)))
+    return _records_from_positions(df, selected_positions)
+
 # ============================================================
 # Configuration
 # ============================================================

 RESOURCE_CACHE_ENABLED = os.getenv('RESOURCE_CACHE_ENABLED', 'true').lower() == 'true'
-RESOURCE_SYNC_INTERVAL = int(os.getenv('RESOURCE_SYNC_INTERVAL', '14400'))  # 4 hours
+RESOURCE_SYNC_INTERVAL = int(
+    os.getenv('RESOURCE_SYNC_INTERVAL', str(DEFAULT_RESOURCE_SYNC_INTERVAL_SECONDS))
+)
 RESOURCE_INDEX_VERSION_CHECK_INTERVAL = int(
-    os.getenv('RESOURCE_INDEX_VERSION_CHECK_INTERVAL', '5')
-)  # seconds
+    os.getenv('RESOURCE_INDEX_VERSION_CHECK_INTERVAL', str(DEFAULT_INDEX_VERSION_CHECK_INTERVAL_SECONDS))
+)

 # Redis key helpers
 def _get_key(key: str) -> str:
@@ -313,14 +480,14 @@ def _build_filter_builder() -> QueryBuilder:
    return builder


-def _load_from_oracle() -> Optional[pd.DataFrame]:
+def _load_from_oracle() -> pd.DataFrame | None:
    """從 Oracle 載入全表資料（套用全域篩選）.

    Returns:
        DataFrame with all columns, or None if query failed.
    """
    builder = _build_filter_builder()
-    builder.base_sql = "SELECT * FROM DWH.DW_MES_RESOURCE {{ WHERE_CLAUSE }}"
+    builder.base_sql = RESOURCE_BASE_SELECT_TEMPLATE
    sql, params = builder.build()

    try:
@@ -333,14 +500,14 @@ def _load_from_oracle() -> Optional[pd.DataFrame]:
        return None


-def _get_version_from_oracle() -> Optional[str]:
+def _get_version_from_oracle() -> str | None:
    """取得 Oracle 資料版本（MAX(LASTCHANGEDATE)）.

    Returns:
        Version string (ISO format), or None if query failed.
    """
    builder = _build_filter_builder()
-    builder.base_sql = "SELECT MAX(LASTCHANGEDATE) as VERSION FROM DWH.DW_MES_RESOURCE {{ WHERE_CLAUSE }}"
+    builder.base_sql = RESOURCE_VERSION_SELECT_TEMPLATE
    sql, params = builder.build()

    try:
@@ -361,7 +528,7 @@ def _get_version_from_oracle() -> Optional[str]:
 # Internal: Redis Functions
 # ============================================================

-def _get_version_from_redis() -> Optional[str]:
+def _get_version_from_redis() -> str | None:
    """取得 Redis 快取版本.

    Returns:
@@ -411,7 +578,7 @@ def _sync_to_redis(df: pd.DataFrame, version: str) -> bool:
        pipe.execute()

        # Invalidate process-level cache so next request picks up new data
-        _resource_df_cache.invalidate("resource_data")
+        _resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY)
        _invalidate_resource_index()

        logger.info(f"Resource cache synced: {len(df)} rows, version={version}")
@@ -421,7 +588,7 @@ def _sync_to_redis(df: pd.DataFrame, version: str) -> bool:
        return False


-def _get_cached_data() -> Optional[pd.DataFrame]:
+def _get_cached_data() -> pd.DataFrame | None:
    """Get cached resource data from Redis with process-level caching.

    Uses a two-tier cache strategy:
@@ -433,21 +600,25 @@ def _get_cached_data() -> Optional[pd.DataFrame]:
    Returns:
        DataFrame with resource data, or None if cache miss.
    """
-    cache_key = "resource_data"
+    cache_key = RESOURCE_DF_CACHE_KEY

    # Tier 1: Check process-level cache first (fast path)
    cached_df = _resource_df_cache.get(cache_key)
    if cached_df is not None:
-        if not _get_resource_index().get("ready"):
-            version, updated_at = _get_cache_meta()
-            _ensure_resource_index(
-                cached_df,
-                source="redis",
-                version=version,
-                updated_at=updated_at,
-            )
-        logger.debug(f"Process cache hit: {len(cached_df)} rows")
-        return cached_df
+        if REDIS_ENABLED and RESOURCE_CACHE_ENABLED and not _redis_data_available():
+            _resource_df_cache.invalidate(cache_key)
+            _invalidate_resource_index()
+        else:
+            if not _get_resource_index().get("ready"):
+                version, updated_at = _get_cache_meta()
+                _ensure_resource_index(
+                    cached_df,
+                    source="redis",
+                    version=version,
+                    updated_at=updated_at,
+                )
+            logger.debug(f"Process cache hit: {len(cached_df)} rows")
+            return cached_df

    # Tier 2: Parse from Redis (slow path - needs lock)
    if not REDIS_ENABLED or not RESOURCE_CACHE_ENABLED:
@@ -568,7 +739,7 @@ def init_cache() -> None:
        logger.error(f"Failed to init resource cache: {e}")


-def get_cache_status() -> Dict[str, Any]:
+def get_cache_status() -> dict[str, Any]:
    """取得快取狀態資訊.

    Returns:
@@ -611,9 +782,10 @@ def get_cache_status() -> Dict[str, Any]:
 # Query API
 # ============================================================

-def get_resource_index_status() -> Dict[str, Any]:
+def get_resource_index_status() -> dict[str, Any]:
    """Get process-level derived index telemetry."""
    index = _get_resource_index()
+    memory = index.get("memory") or {}
    built_at = index.get("built_at")
    age_seconds = None
    if built_at:
@@ -630,19 +802,32 @@ def get_resource_index_status() -> Dict[str, Any]:
        "built_at": built_at,
        "count": int(index.get("count", 0)),
        "age_seconds": round(age_seconds, 3) if age_seconds is not None else None,
+        "memory": {
+            "frame_bytes": int(memory.get("frame_bytes", 0)),
+            "index_bytes": int(memory.get("index_bytes", 0)),
+            "records_json_bytes": int(memory.get("records_json_bytes", 0)),
+            "bucket_entries": int(memory.get("bucket_entries", 0)),
+            "amplification_ratio": float(memory.get("amplification_ratio", 0.0)),
+            "representation": str(memory.get("representation", "unknown")),
+        },
    }


-def get_resource_index_snapshot() -> Dict[str, Any]:
+def get_resource_index_snapshot() -> ResourceIndex:
    """Get derived resource index snapshot, rebuilding if needed."""
    index = _get_resource_index()
    if index.get("ready"):
        if index.get("source") == "redis":
+            if not _redis_data_available():
+                _resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY)
+                _invalidate_resource_index()
+                index = _get_resource_index()
+
            # If Redis metadata version is missing, verify payload existence on every call.
            # This avoids serving stale in-process index when Redis payload is evicted.
-            if not index.get("version"):
+            if index.get("ready") and not index.get("version"):
                if not _redis_data_available():
-                    _resource_df_cache.invalidate("resource_data")
+                    _resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY)
                    _invalidate_resource_index()
                    index = _get_resource_index()
                else:
@@ -661,7 +846,7 @@ def get_resource_index_snapshot() -> Dict[str, Any]:
                            current_version,
                            latest_version,
                        )
-                        _resource_df_cache.invalidate("resource_data")
+                        _resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY)
                        _invalidate_resource_index()
                        index = _get_resource_index()
                    else:
@@ -678,6 +863,7 @@ def get_resource_index_snapshot() -> Dict[str, Any]:

    df = _get_cached_data()
    if df is not None:
+        _resource_df_cache.set(RESOURCE_DF_CACHE_KEY, df.reset_index(drop=True))
        version, updated_at = _get_cache_meta()
        _ensure_resource_index(
            df,
@@ -690,6 +876,8 @@ def get_resource_index_snapshot() -> Dict[str, Any]:
    logger.info("Resource cache miss while building index, falling back to Oracle")
    oracle_df = _load_from_oracle()
    if oracle_df is None:
+        _resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY)
+        _invalidate_resource_index()
        return _new_empty_index()

    _ensure_resource_index(
@@ -698,9 +886,11 @@ def get_resource_index_snapshot() -> Dict[str, Any]:
        version=None,
        updated_at=datetime.now().isoformat(),
    )
+    _resource_df_cache.set(RESOURCE_DF_CACHE_KEY, oracle_df.reset_index(drop=True))
    return _get_resource_index()

-def get_all_resources() -> List[Dict]:
+
+def get_all_resources() -> list[ResourceRecord]:
    """取得所有快取中的設備資料（全欄位）.

    Falls back to Oracle if cache unavailable.
@@ -709,11 +899,10 @@ def get_all_resources() -> List[Dict]:
        List of resource dicts.
    """
    index = get_resource_index_snapshot()
-    records = index.get("records", [])
-    return list(records)
+    return _records_from_index(index)


-def get_resource_by_id(resource_id: str) -> Optional[Dict]:
+def get_resource_by_id(resource_id: str) -> ResourceRecord | None:
    """依 RESOURCEID 取得單筆設備資料.

    Args:
@@ -725,10 +914,12 @@ def get_resource_by_id(resource_id: str) -> Optional[Dict]:
    if not resource_id:
        return None
    index = get_resource_index_snapshot()
-    by_id = index.get("by_resource_id", {})
-    row = by_id.get(str(resource_id))
-    if row is not None:
-        return row
+    by_id: dict[str, RowPosition] = index.get("by_resource_id", {})
+    row_position = by_id.get(str(resource_id))
+    if row_position is not None:
+        rows = _records_from_index(index, [int(row_position)])
+        if rows:
+            return rows[0]

    # Backward-compatible fallback for call sites/tests that patch get_all_resources.
    target = str(resource_id)
@@ -738,7 +929,7 @@ def get_resource_by_id(resource_id: str) -> Optional[Dict]:
    return None


-def get_resources_by_ids(resource_ids: List[str]) -> List[Dict]:
+def get_resources_by_ids(resource_ids: list[str]) -> list[ResourceRecord]:
    """依 RESOURCEID 清單批次取得設備資料.

    Args:
@@ -747,20 +938,28 @@ def get_resources_by_ids(resource_ids: List[str]) -> List[Dict]:
    Returns:
        List of matching resource dicts.
    """
+    index = get_resource_index_snapshot()
+    by_id: dict[str, RowPosition] = index.get("by_resource_id", {})
+    positions = [by_id[str(resource_id)] for resource_id in resource_ids if str(resource_id) in by_id]
+    if positions:
+        rows = _records_from_index(index, positions)
+        if rows:
+            return rows
+
+    # Backward-compatible fallback for call sites/tests that patch get_all_resources.
    id_set = set(resource_ids)
-    resources = get_all_resources()
-    return [r for r in resources if r.get('RESOURCEID') in id_set]
+    return [r for r in get_all_resources() if r.get('RESOURCEID') in id_set]


 def get_resources_by_filter(
-    workcenters: Optional[List[str]] = None,
-    families: Optional[List[str]] = None,
-    departments: Optional[List[str]] = None,
-    locations: Optional[List[str]] = None,
-    is_production: Optional[bool] = None,
-    is_key: Optional[bool] = None,
-    is_monitor: Optional[bool] = None,
-) -> List[Dict]:
+    workcenters: list[str] | None = None,
+    families: list[str] | None = None,
+    departments: list[str] | None = None,
+    locations: list[str] | None = None,
+    is_production: bool | None = None,
+    is_key: bool | None = None,
+    is_monitor: bool | None = None,
+) -> list[ResourceRecord]:
    """依條件篩選設備資料（在 Python 端篩選）.

    Args:
@@ -775,42 +974,79 @@ def get_resources_by_filter(
    Returns:
        List of matching resource dicts.
    """
-    resources = get_all_resources()
-
-    result = []
-    for r in resources:
-        # Apply filters
-        if workcenters and r.get('WORKCENTERNAME') not in workcenters:
-            continue
-        if families and r.get('RESOURCEFAMILYNAME') not in families:
-            continue
-        if departments and r.get('PJ_DEPARTMENT') not in departments:
-            continue
-        if locations and r.get('LOCATIONNAME') not in locations:
-            continue
-        if is_production is not None:
-            val = r.get('PJ_ISPRODUCTION')
-            if (val == 1) != is_production:
+    def _filter_from_records(resources: list[ResourceRecord]) -> list[ResourceRecord]:
+        result: list[ResourceRecord] = []
+        for r in resources:
+            if workcenters and r.get('WORKCENTERNAME') not in workcenters:
                continue
-        if is_key is not None:
-            val = r.get('PJ_ISKEY')
-            if (val == 1) != is_key:
+            if families and r.get('RESOURCEFAMILYNAME') not in families:
                continue
-        if is_monitor is not None:
-            val = r.get('PJ_ISMONITOR')
-            if (val == 1) != is_monitor:
+            if departments and r.get('PJ_DEPARTMENT') not in departments:
                continue
+            if locations and r.get('LOCATIONNAME') not in locations:
+                continue
+            if is_production is not None and (r.get('PJ_ISPRODUCTION') == 1) != is_production:
+                continue
+            if is_key is not None and (r.get('PJ_ISKEY') == 1) != is_key:
+                continue
+            if is_monitor is not None and (r.get('PJ_ISMONITOR') == 1) != is_monitor:
+                continue
+            result.append(r)
+        return result

-        result.append(r)
+    index = get_resource_index_snapshot()
+    if not index.get("ready"):
+        return _filter_from_records(get_all_resources())
+    if _resource_df_cache.get(RESOURCE_DF_CACHE_KEY) is None:
+        return _filter_from_records(get_all_resources())

-    return result
+    candidate_positions: set[int] = set(int(pos) for pos in index.get("all_positions", []))
+    if not candidate_positions:
+        return []
+
+    def _intersect_with_positions(selected: list[int] | None) -> None:
+        nonlocal candidate_positions
+        if selected is None:
+            return
+        candidate_positions &= set(int(item) for item in selected)
+
+    if workcenters:
+        _intersect_with_positions(
+            _pick_bucket_positions(index.get("by_workcenter", {}), workcenters)
+        )
+    if families:
+        _intersect_with_positions(
+            _pick_bucket_positions(index.get("by_family", {}), families)
+        )
+    if departments:
+        _intersect_with_positions(
+            _pick_bucket_positions(index.get("by_department", {}), departments)
+        )
+    if locations:
+        _intersect_with_positions(
+            _pick_bucket_positions(index.get("by_location", {}), locations)
+        )
+    if is_production is not None:
+        _intersect_with_positions(
+            index.get("by_is_production", {}).get(TRUE_BUCKET if is_production else FALSE_BUCKET, [])
+        )
+    if is_key is not None:
+        _intersect_with_positions(
+            index.get("by_is_key", {}).get(TRUE_BUCKET if is_key else FALSE_BUCKET, [])
+        )
+    if is_monitor is not None:
+        _intersect_with_positions(
+            index.get("by_is_monitor", {}).get(TRUE_BUCKET if is_monitor else FALSE_BUCKET, [])
+        )
+
+    return _records_from_index(index, sorted(candidate_positions))


 # ============================================================
 # Distinct Values API (for filters)
 # ============================================================

-def get_distinct_values(column: str) -> List[str]:
+def get_distinct_values(column: str) -> list[str]:
    """取得指定欄位的唯一值清單（排序後）.

    Args:
@@ -833,26 +1069,26 @@ def get_distinct_values(column: str) -> List[str]:
    return sorted(values)


-def get_resource_families() -> List[str]:
+def get_resource_families() -> list[str]:
    """取得型號清單（便捷方法）."""
    return get_distinct_values('RESOURCEFAMILYNAME')


-def get_workcenters() -> List[str]:
+def get_workcenters() -> list[str]:
    """取得站點清單（便捷方法）."""
    return get_distinct_values('WORKCENTERNAME')


-def get_departments() -> List[str]:
+def get_departments() -> list[str]:
    """取得部門清單（便捷方法）."""
    return get_distinct_values('PJ_DEPARTMENT')


-def get_locations() -> List[str]:
+def get_locations() -> list[str]:
    """取得區域清單（便捷方法）."""
    return get_distinct_values('LOCATIONNAME')


-def get_vendors() -> List[str]:
+def get_vendors() -> list[str]:
    """取得供應商清單（便捷方法）."""
    return get_distinct_values('VENDORNAME')
--- a/src/mes_dashboard/services/sql_fragments.py
+++ b/src/mes_dashboard/services/sql_fragments.py
@@ -0,0 +1,46 @@
+# -*- coding: utf-8 -*-
+"""Shared SQL fragments/constants for cache-oriented services.
+
+Centralizing common Oracle table/view references reduces drift across
+resource/equipment cache implementations.
+"""
+
+from __future__ import annotations
+
+RESOURCE_TABLE = "DWH.DW_MES_RESOURCE"
+RESOURCE_BASE_SELECT_TEMPLATE = f"SELECT * FROM {RESOURCE_TABLE} {{ WHERE_CLAUSE }}"
+RESOURCE_VERSION_SELECT_TEMPLATE = (
+    f"SELECT MAX(LASTCHANGEDATE) as VERSION FROM {RESOURCE_TABLE} {{ WHERE_CLAUSE }}"
+)
+
+EQUIPMENT_STATUS_VIEW = "DWH.DW_MES_EQUIPMENTSTATUS_WIP_V"
+EQUIPMENT_STATUS_COLUMNS: tuple[str, ...] = (
+    "RESOURCEID",
+    "EQUIPMENTID",
+    "OBJECTCATEGORY",
+    "EQUIPMENTASSETSSTATUS",
+    "EQUIPMENTASSETSSTATUSREASON",
+    "JOBORDER",
+    "JOBMODEL",
+    "JOBSTAGE",
+    "JOBID",
+    "JOBSTATUS",
+    "CREATEDATE",
+    "CREATEUSERNAME",
+    "CREATEUSER",
+    "TECHNICIANUSERNAME",
+    "TECHNICIANUSER",
+    "SYMPTOMCODE",
+    "CAUSECODE",
+    "REPAIRCODE",
+    "RUNCARDLOTID",
+    "LOTTRACKINQTY_PCS",
+    "LOTTRACKINTIME",
+    "LOTTRACKINEMPLOYEE",
+)
+
+EQUIPMENT_STATUS_SELECT_SQL = (
+    "SELECT\n    "
+    + ",\n    ".join(EQUIPMENT_STATUS_COLUMNS)
+    + f"\nFROM {EQUIPMENT_STATUS_VIEW}"
+)
--- a/src/mes_dashboard/services/wip_service.py
+++ b/src/mes_dashboard/services/wip_service.py
@@ -9,6 +9,7 @@ Now uses Redis cache when available, with fallback to Oracle direct query.

 import logging
 import threading
+from collections import Counter
 from datetime import datetime
 from typing import Optional, Dict, List, Any

@@ -32,6 +33,20 @@ logger = logging.getLogger('mes_dashboard.wip_service')

 _wip_search_index_lock = threading.Lock()
 _wip_search_index_cache: Dict[str, Dict[str, Any]] = {}
+_wip_snapshot_lock = threading.Lock()
+_wip_snapshot_cache: Dict[str, Dict[str, Any]] = {}
+_wip_index_metrics_lock = threading.Lock()
+_wip_index_metrics: Dict[str, Any] = {
+    "snapshot_hits": 0,
+    "snapshot_misses": 0,
+    "search_index_hits": 0,
+    "search_index_misses": 0,
+    "search_index_rebuilds": 0,
+    "search_index_incremental_updates": 0,
+    "search_index_reconciliation_fallbacks": 0,
+}
+
+_EMPTY_INT_INDEX = np.array([], dtype=np.int64)


 def _safe_value(val):
@@ -153,29 +168,373 @@ def _get_wip_cache_version() -> str:
    return f"{updated_at}|{sys_date}"


-def _distinct_sorted_values(df: pd.DataFrame, column: str) -> List[str]:
-    if column not in df.columns:
-        return []
-    series = df[column].dropna().astype(str)
-    if series.empty:
-        return []
-    series = series[series.str.len() > 0]
-    if series.empty:
-        return []
-    return series.drop_duplicates().sort_values().tolist()
+def _increment_wip_metric(metric: str, value: int = 1) -> None:
+    with _wip_index_metrics_lock:
+        _wip_index_metrics[metric] = int(_wip_index_metrics.get(metric, 0)) + value
+
+
+def _estimate_dataframe_bytes(df: pd.DataFrame) -> int:
+    if df is None:
+        return 0
+    try:
+        return int(df.memory_usage(index=True, deep=True).sum())
+    except Exception:
+        return 0
+
+
+def _estimate_counter_payload_bytes(counter: Counter) -> int:
+    total = 0
+    for key, count in counter.items():
+        total += len(str(key)) + 16 + int(count)
+    return total
+
+
+def _normalize_text_value(value: Any) -> str:
+    if value is None:
+        return ""
+    if isinstance(value, float) and pd.isna(value):
+        return ""
+    text = str(value).strip()
+    return text
+
+
+def _build_filter_mask(
+    df: pd.DataFrame,
+    *,
+    include_dummy: bool,
+    workorder: Optional[str] = None,
+    lotid: Optional[str] = None,
+) -> pd.Series:
+    if df.empty:
+        return pd.Series(dtype=bool)
+
+    mask = df['WORKORDER'].notna()
+
+    if not include_dummy and 'LOTID' in df.columns:
+        mask &= ~df['LOTID'].astype(str).str.contains('DUMMY', case=False, na=False)
+
+    if workorder and 'WORKORDER' in df.columns:
+        mask &= df['WORKORDER'].astype(str).str.contains(workorder, case=False, na=False)
+
+    if lotid and 'LOTID' in df.columns:
+        mask &= df['LOTID'].astype(str).str.contains(lotid, case=False, na=False)
+
+    return mask
+
+
+def _build_value_index(df: pd.DataFrame, column: str) -> Dict[str, np.ndarray]:
+    if column not in df.columns or df.empty:
+        return {}
+    grouped = df.groupby(column, dropna=True, sort=False).indices
+    return {str(key): np.asarray(indices, dtype=np.int64) for key, indices in grouped.items()}
+
+
+def _intersect_positions(current: Optional[np.ndarray], candidate: Optional[np.ndarray]) -> np.ndarray:
+    if candidate is None:
+        return _EMPTY_INT_INDEX
+    if current is None:
+        return candidate
+    if len(current) == 0 or len(candidate) == 0:
+        return _EMPTY_INT_INDEX
+    return np.intersect1d(current, candidate, assume_unique=False)
+
+
+def _select_with_snapshot_indexes(
+    include_dummy: bool = False,
+    workorder: Optional[str] = None,
+    lotid: Optional[str] = None,
+    package: Optional[str] = None,
+    pj_type: Optional[str] = None,
+    workcenter: Optional[str] = None,
+    status: Optional[str] = None,
+    hold_type: Optional[str] = None,
+) -> Optional[pd.DataFrame]:
+    snapshot = _get_wip_snapshot(include_dummy=include_dummy)
+    if snapshot is None:
+        return None
+
+    df = snapshot["frame"]
+    indexes = snapshot["indexes"]
+    selected_positions: Optional[np.ndarray] = None
+
+    if workcenter:
+        selected_positions = _intersect_positions(
+            selected_positions,
+            indexes["workcenter"].get(str(workcenter)),
+        )
+    if package:
+        selected_positions = _intersect_positions(
+            selected_positions,
+            indexes["package"].get(str(package)),
+        )
+    if pj_type:
+        selected_positions = _intersect_positions(
+            selected_positions,
+            indexes["pj_type"].get(str(pj_type)),
+        )
+    if status:
+        selected_positions = _intersect_positions(
+            selected_positions,
+            indexes["wip_status"].get(str(status).upper()),
+        )
+    if hold_type:
+        selected_positions = _intersect_positions(
+            selected_positions,
+            indexes["hold_type"].get(str(hold_type).lower()),
+        )
+
+    if selected_positions is None:
+        result = df
+    elif len(selected_positions) == 0:
+        result = df.iloc[0:0]
+    else:
+        result = df.iloc[selected_positions]
+
+    if workorder:
+        result = result[result['WORKORDER'].astype(str).str.contains(workorder, case=False, na=False)]
+    if lotid:
+        result = result[result['LOTID'].astype(str).str.contains(lotid, case=False, na=False)]
+    return result
+
+
+def _build_search_signatures(df: pd.DataFrame) -> tuple[Counter, Dict[str, tuple[str, str, str, str]]]:
+    if df.empty:
+        return Counter(), {}
+
+    workorders = df.get("WORKORDER", pd.Series(index=df.index, dtype=object)).map(_normalize_text_value)
+    lotids = df.get("LOTID", pd.Series(index=df.index, dtype=object)).map(_normalize_text_value)
+    packages = df.get("PACKAGE_LEF", pd.Series(index=df.index, dtype=object)).map(_normalize_text_value)
+    types = df.get("PJ_TYPE", pd.Series(index=df.index, dtype=object)).map(_normalize_text_value)
+
+    signatures = (
+        workorders
+        + "\x1f"
+        + lotids
+        + "\x1f"
+        + packages
+        + "\x1f"
+        + types
+    ).tolist()
+    signature_counter = Counter(signatures)
+
+    signature_fields: Dict[str, tuple[str, str, str, str]] = {}
+    for signature, wo, lot, pkg, pj in zip(signatures, workorders, lotids, packages, types):
+        if signature not in signature_fields:
+            signature_fields[signature] = (wo, lot, pkg, pj)
+    return signature_counter, signature_fields
+
+
+def _build_field_counters(
+    signature_counter: Counter,
+    signature_fields: Dict[str, tuple[str, str, str, str]],
+) -> Dict[str, Counter]:
+    counters = {
+        "workorders": Counter(),
+        "lotids": Counter(),
+        "packages": Counter(),
+        "types": Counter(),
+    }
+    for signature, count in signature_counter.items():
+        wo, lot, pkg, pj = signature_fields.get(signature, ("", "", "", ""))
+        if wo:
+            counters["workorders"][wo] += count
+        if lot:
+            counters["lotids"][lot] += count
+        if pkg:
+            counters["packages"][pkg] += count
+        if pj:
+            counters["types"][pj] += count
+    return counters
+
+
+def _materialize_search_payload(
+    *,
+    version: str,
+    row_count: int,
+    signature_counter: Counter,
+    field_counters: Dict[str, Counter],
+    mode: str,
+    added_rows: int = 0,
+    removed_rows: int = 0,
+    drift_ratio: float = 0.0,
+) -> Dict[str, Any]:
+    workorders = sorted(field_counters["workorders"].keys())
+    lotids = sorted(field_counters["lotids"].keys())
+    packages = sorted(field_counters["packages"].keys())
+    types = sorted(field_counters["types"].keys())
+    memory_bytes = (
+        _estimate_counter_payload_bytes(field_counters["workorders"])
+        + _estimate_counter_payload_bytes(field_counters["lotids"])
+        + _estimate_counter_payload_bytes(field_counters["packages"])
+        + _estimate_counter_payload_bytes(field_counters["types"])
+    )
+    return {
+        "version": version,
+        "built_at": datetime.now().isoformat(),
+        "row_count": int(row_count),
+        "workorders": workorders,
+        "lotids": lotids,
+        "packages": packages,
+        "types": types,
+        "sync_mode": mode,
+        "sync_added_rows": int(added_rows),
+        "sync_removed_rows": int(removed_rows),
+        "drift_ratio": round(float(drift_ratio), 6),
+        "memory_bytes": int(memory_bytes),
+        "_signature_counter": dict(signature_counter),
+        "_field_counters": {
+            "workorders": dict(field_counters["workorders"]),
+            "lotids": dict(field_counters["lotids"]),
+            "packages": dict(field_counters["packages"]),
+            "types": dict(field_counters["types"]),
+        },
+    }


 def _build_wip_search_index(df: pd.DataFrame, include_dummy: bool) -> Dict[str, Any]:
    filtered = _filter_base_conditions(df, include_dummy=include_dummy)
-    return {
-        "built_at": datetime.now().isoformat(),
-        "row_count": len(filtered),
-        "workorders": _distinct_sorted_values(filtered, "WORKORDER"),
-        "lotids": _distinct_sorted_values(filtered, "LOTID"),
-        "packages": _distinct_sorted_values(filtered, "PACKAGE_LEF"),
-        "types": _distinct_sorted_values(filtered, "PJ_TYPE"),
+    signatures, signature_fields = _build_search_signatures(filtered)
+    field_counters = _build_field_counters(signatures, signature_fields)
+    return _materialize_search_payload(
+        version=_get_wip_cache_version(),
+        row_count=len(filtered),
+        signature_counter=signatures,
+        field_counters=field_counters,
+        mode="full",
+    )
+
+
+def _try_incremental_search_sync(
+    previous: Dict[str, Any],
+    *,
+    version: str,
+    row_count: int,
+    signature_counter: Counter,
+    signature_fields: Dict[str, tuple[str, str, str, str]],
+) -> Optional[Dict[str, Any]]:
+    if not previous:
+        return None
+    old_signature_counter = Counter(previous.get("_signature_counter") or {})
+    old_field_counters_raw = previous.get("_field_counters") or {}
+    if not old_signature_counter or not old_field_counters_raw:
+        return None
+
+    added = signature_counter - old_signature_counter
+    removed = old_signature_counter - signature_counter
+    total_delta = sum(added.values()) + sum(removed.values())
+    drift_ratio = total_delta / max(int(row_count), 1)
+    if drift_ratio > 0.6:
+        _increment_wip_metric("search_index_reconciliation_fallbacks")
+        return None
+
+    field_counters = {
+        "workorders": Counter(old_field_counters_raw.get("workorders") or {}),
+        "lotids": Counter(old_field_counters_raw.get("lotids") or {}),
+        "packages": Counter(old_field_counters_raw.get("packages") or {}),
+        "types": Counter(old_field_counters_raw.get("types") or {}),
    }

+    for signature, count in added.items():
+        wo, lot, pkg, pj = signature_fields.get(signature, ("", "", "", ""))
+        if wo:
+            field_counters["workorders"][wo] += count
+        if lot:
+            field_counters["lotids"][lot] += count
+        if pkg:
+            field_counters["packages"][pkg] += count
+        if pj:
+            field_counters["types"][pj] += count
+
+    previous_fields = {
+        sig: tuple(str(v) for v in sig.split("\x1f", 3))
+        for sig in old_signature_counter.keys()
+    }
+    for signature, count in removed.items():
+        wo, lot, pkg, pj = previous_fields.get(signature, ("", "", "", ""))
+        if wo:
+            field_counters["workorders"][wo] -= count
+            if field_counters["workorders"][wo] <= 0:
+                field_counters["workorders"].pop(wo, None)
+        if lot:
+            field_counters["lotids"][lot] -= count
+            if field_counters["lotids"][lot] <= 0:
+                field_counters["lotids"].pop(lot, None)
+        if pkg:
+            field_counters["packages"][pkg] -= count
+            if field_counters["packages"][pkg] <= 0:
+                field_counters["packages"].pop(pkg, None)
+        if pj:
+            field_counters["types"][pj] -= count
+            if field_counters["types"][pj] <= 0:
+                field_counters["types"].pop(pj, None)
+
+    _increment_wip_metric("search_index_incremental_updates")
+    return _materialize_search_payload(
+        version=version,
+        row_count=row_count,
+        signature_counter=signature_counter,
+        field_counters=field_counters,
+        mode="incremental",
+        added_rows=sum(added.values()),
+        removed_rows=sum(removed.values()),
+        drift_ratio=drift_ratio,
+    )
+
+
+def _build_wip_snapshot(df: pd.DataFrame, include_dummy: bool, version: str) -> Dict[str, Any]:
+    filtered = _filter_base_conditions(df, include_dummy=include_dummy)
+    filtered = _add_wip_status_columns(filtered).reset_index(drop=True)
+
+    hold_type_series = pd.Series(index=filtered.index, dtype=object)
+    if not filtered.empty:
+        hold_type_series = pd.Series("", index=filtered.index, dtype=object)
+        hold_type_series.loc[filtered["IS_QUALITY_HOLD"]] = "quality"
+        hold_type_series.loc[filtered["IS_NON_QUALITY_HOLD"]] = "non-quality"
+
+    indexes = {
+        "workcenter": _build_value_index(filtered, "WORKCENTER_GROUP"),
+        "package": _build_value_index(filtered, "PACKAGE_LEF"),
+        "pj_type": _build_value_index(filtered, "PJ_TYPE"),
+        "wip_status": _build_value_index(filtered, "WIP_STATUS"),
+        "hold_type": _build_value_index(pd.DataFrame({"HOLD_TYPE": hold_type_series}), "HOLD_TYPE"),
+    }
+
+    exact_bucket_count = sum(len(bucket) for bucket in indexes.values())
+    return {
+        "version": version,
+        "built_at": datetime.now().isoformat(),
+        "row_count": int(len(filtered)),
+        "frame": filtered,
+        "indexes": indexes,
+        "frame_bytes": _estimate_dataframe_bytes(filtered),
+        "index_bucket_count": int(exact_bucket_count),
+    }
+
+
+def _get_wip_snapshot(include_dummy: bool) -> Optional[Dict[str, Any]]:
+    cache_key = "with_dummy" if include_dummy else "without_dummy"
+    version = _get_wip_cache_version()
+
+    with _wip_snapshot_lock:
+        cached = _wip_snapshot_cache.get(cache_key)
+        if cached and cached.get("version") == version:
+            _increment_wip_metric("snapshot_hits")
+            return cached
+
+    _increment_wip_metric("snapshot_misses")
+    df = _get_wip_dataframe()
+    if df is None:
+        return None
+
+    snapshot = _build_wip_snapshot(df, include_dummy=include_dummy, version=version)
+    with _wip_snapshot_lock:
+        existing = _wip_snapshot_cache.get(cache_key)
+        if existing and existing.get("version") == version:
+            _increment_wip_metric("snapshot_hits")
+            return existing
+        _wip_snapshot_cache[cache_key] = snapshot
+    return snapshot
+

 def _get_wip_search_index(include_dummy: bool) -> Optional[Dict[str, Any]]:
    cache_key = "with_dummy" if include_dummy else "without_dummy"
@@ -184,14 +543,37 @@ def _get_wip_search_index(include_dummy: bool) -> Optional[Dict[str, Any]]:
    with _wip_search_index_lock:
        cached = _wip_search_index_cache.get(cache_key)
        if cached and cached.get("version") == version:
+            _increment_wip_metric("search_index_hits")
            return cached

-    df = _get_wip_dataframe()
-    if df is None:
+    _increment_wip_metric("search_index_misses")
+    snapshot = _get_wip_snapshot(include_dummy=include_dummy)
+    if snapshot is None:
        return None

-    index_payload = _build_wip_search_index(df, include_dummy=include_dummy)
-    index_payload["version"] = version
+    filtered = snapshot["frame"]
+    signature_counter, signature_fields = _build_search_signatures(filtered)
+
+    with _wip_search_index_lock:
+        previous = _wip_search_index_cache.get(cache_key)
+
+    index_payload = _try_incremental_search_sync(
+        previous or {},
+        version=version,
+        row_count=int(snapshot.get("row_count", 0)),
+        signature_counter=signature_counter,
+        signature_fields=signature_fields,
+    )
+    if index_payload is None:
+        field_counters = _build_field_counters(signature_counter, signature_fields)
+        index_payload = _materialize_search_payload(
+            version=version,
+            row_count=int(snapshot.get("row_count", 0)),
+            signature_counter=signature_counter,
+            field_counters=field_counters,
+            mode="full",
+        )
+        _increment_wip_metric("search_index_rebuilds")

    with _wip_search_index_lock:
        _wip_search_index_cache[cache_key] = index_payload
@@ -207,9 +589,9 @@ def _search_values_from_index(values: List[str], query: str, limit: int) -> List
 def get_wip_search_index_status() -> Dict[str, Any]:
    """Expose WIP derived search-index freshness for diagnostics."""
    with _wip_search_index_lock:
-        snapshot = {}
+        search_snapshot = {}
        for key, payload in _wip_search_index_cache.items():
-            snapshot[key] = {
+            search_snapshot[key] = {
                "version": payload.get("version"),
                "built_at": payload.get("built_at"),
                "row_count": payload.get("row_count", 0),
@@ -217,8 +599,39 @@ def get_wip_search_index_status() -> Dict[str, Any]:
                "lotids": len(payload.get("lotids", [])),
                "packages": len(payload.get("packages", [])),
                "types": len(payload.get("types", [])),
+                "sync_mode": payload.get("sync_mode"),
+                "sync_added_rows": payload.get("sync_added_rows", 0),
+                "sync_removed_rows": payload.get("sync_removed_rows", 0),
+                "drift_ratio": payload.get("drift_ratio", 0.0),
+                "memory_bytes": payload.get("memory_bytes", 0),
            }
-        return snapshot
+    with _wip_snapshot_lock:
+        frame_snapshot = {}
+        for key, payload in _wip_snapshot_cache.items():
+            frame_snapshot[key] = {
+                "version": payload.get("version"),
+                "built_at": payload.get("built_at"),
+                "row_count": payload.get("row_count", 0),
+                "frame_bytes": payload.get("frame_bytes", 0),
+                "index_bucket_count": payload.get("index_bucket_count", 0),
+            }
+    with _wip_index_metrics_lock:
+        metrics = dict(_wip_index_metrics)
+
+    total_frame_bytes = sum(item.get("frame_bytes", 0) for item in frame_snapshot.values())
+    total_search_bytes = sum(item.get("memory_bytes", 0) for item in search_snapshot.values())
+    amplification_ratio = round((total_frame_bytes + total_search_bytes) / max(total_frame_bytes, 1), 4)
+
+    return {
+        "derived_search_index": search_snapshot,
+        "derived_frame_snapshot": frame_snapshot,
+        "metrics": metrics,
+        "memory": {
+            "frame_bytes_total": int(total_frame_bytes),
+            "search_bytes_total": int(total_search_bytes),
+            "amplification_ratio": amplification_ratio,
+        },
+    }


 def _add_wip_status_columns(df: pd.DataFrame) -> pd.DataFrame:
@@ -235,24 +648,31 @@ def _add_wip_status_columns(df: pd.DataFrame) -> pd.DataFrame:
    Returns:
        DataFrame with additional status columns
    """
-    df = df.copy()
+    required = {'WIP_STATUS', 'IS_QUALITY_HOLD', 'IS_NON_QUALITY_HOLD'}
+    if required.issubset(df.columns):
+        return df
+
+    working = df.copy()

    # Ensure numeric columns
-    df['EQUIPMENTCOUNT'] = pd.to_numeric(df['EQUIPMENTCOUNT'], errors='coerce').fillna(0)
-    df['CURRENTHOLDCOUNT'] = pd.to_numeric(df['CURRENTHOLDCOUNT'], errors='coerce').fillna(0)
-    df['QTY'] = pd.to_numeric(df['QTY'], errors='coerce').fillna(0)
+    working['EQUIPMENTCOUNT'] = pd.to_numeric(working['EQUIPMENTCOUNT'], errors='coerce').fillna(0)
+    working['CURRENTHOLDCOUNT'] = pd.to_numeric(working['CURRENTHOLDCOUNT'], errors='coerce').fillna(0)
+    working['QTY'] = pd.to_numeric(working['QTY'], errors='coerce').fillna(0)

    # Compute WIP status
-    df['WIP_STATUS'] = 'QUEUE'  # Default
-    df.loc[df['EQUIPMENTCOUNT'] > 0, 'WIP_STATUS'] = 'RUN'
-    df.loc[(df['EQUIPMENTCOUNT'] == 0) & (df['CURRENTHOLDCOUNT'] > 0), 'WIP_STATUS'] = 'HOLD'
+    working['WIP_STATUS'] = 'QUEUE'  # Default
+    working.loc[working['EQUIPMENTCOUNT'] > 0, 'WIP_STATUS'] = 'RUN'
+    working.loc[
+        (working['EQUIPMENTCOUNT'] == 0) & (working['CURRENTHOLDCOUNT'] > 0),
+        'WIP_STATUS'
+    ] = 'HOLD'

    # Compute hold type
-    df['IS_NON_QUALITY_HOLD'] = df['HOLDREASONNAME'].isin(NON_QUALITY_HOLD_REASONS)
-    df['IS_QUALITY_HOLD'] = (df['WIP_STATUS'] == 'HOLD') & ~df['IS_NON_QUALITY_HOLD']
-    df['IS_NON_QUALITY_HOLD'] = (df['WIP_STATUS'] == 'HOLD') & df['IS_NON_QUALITY_HOLD']
+    non_quality_flags = working['HOLDREASONNAME'].isin(NON_QUALITY_HOLD_REASONS)
+    working['IS_QUALITY_HOLD'] = (working['WIP_STATUS'] == 'HOLD') & ~non_quality_flags
+    working['IS_NON_QUALITY_HOLD'] = (working['WIP_STATUS'] == 'HOLD') & non_quality_flags

-    return df
+    return working


 def _filter_base_conditions(
@@ -272,24 +692,18 @@ def _filter_base_conditions(
    Returns:
        Filtered DataFrame
    """
-    df = df.copy()
+    if df is None or df.empty:
+        return df.iloc[0:0] if isinstance(df, pd.DataFrame) else pd.DataFrame()

-    # Exclude NULL WORKORDER (raw materials)
-    df = df[df['WORKORDER'].notna()]
-
-    # DUMMY exclusion
-    if not include_dummy:
-        df = df[~df['LOTID'].str.contains('DUMMY', case=False, na=False)]
-
-    # WORKORDER filter (fuzzy match)
-    if workorder:
-        df = df[df['WORKORDER'].str.contains(workorder, case=False, na=False)]
-
-    # LOTID filter (fuzzy match)
-    if lotid:
-        df = df[df['LOTID'].str.contains(lotid, case=False, na=False)]
-
-    return df
+    mask = _build_filter_mask(
+        df,
+        include_dummy=include_dummy,
+        workorder=workorder,
+        lotid=lotid,
+    )
+    if mask.empty:
+        return df.iloc[0:0]
+    return df.loc[mask]


 # ============================================================
@@ -325,16 +739,15 @@ def get_wip_summary(
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy, workorder, lotid)
-            df = _add_wip_status_columns(df)
-
-            # Apply package filter
-            if package and 'PACKAGE_LEF' in df.columns:
-                df = df[df['PACKAGE_LEF'] == package]
-
-            # Apply pj_type filter
-            if pj_type and 'PJ_TYPE' in df.columns:
-                df = df[df['PJ_TYPE'] == pj_type]
+            df = _select_with_snapshot_indexes(
+                include_dummy=include_dummy,
+                workorder=workorder,
+                lotid=lotid,
+                package=package,
+                pj_type=pj_type,
+            )
+            if df is None:
+                return _get_wip_summary_from_oracle(include_dummy, workorder, lotid, package, pj_type)

            if df.empty:
                return {
@@ -495,32 +908,31 @@ def get_wip_matrix(
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy, workorder, lotid)
-            df = _add_wip_status_columns(df)
+            status_upper = status.upper() if status else None
+            hold_type_filter = hold_type if status_upper == 'HOLD' else None
+            df = _select_with_snapshot_indexes(
+                include_dummy=include_dummy,
+                workorder=workorder,
+                lotid=lotid,
+                package=package,
+                pj_type=pj_type,
+                status=status_upper,
+                hold_type=hold_type_filter,
+            )
+            if df is None:
+                return _get_wip_matrix_from_oracle(
+                    include_dummy,
+                    workorder,
+                    lotid,
+                    status,
+                    hold_type,
+                    package,
+                    pj_type,
+                )

            # Filter by WORKCENTER_GROUP and PACKAGE_LEF
            df = df[df['WORKCENTER_GROUP'].notna() & df['PACKAGE_LEF'].notna()]

-            # Apply package filter
-            if package:
-                df = df[df['PACKAGE_LEF'] == package]
-
-            # Apply pj_type filter
-            if pj_type and 'PJ_TYPE' in df.columns:
-                df = df[df['PJ_TYPE'] == pj_type]
-
-            # WIP status filter
-            if status:
-                status_upper = status.upper()
-                df = df[df['WIP_STATUS'] == status_upper]
-
-                # Hold type sub-filter
-                if status_upper == 'HOLD' and hold_type:
-                    if hold_type == 'quality':
-                        df = df[df['IS_QUALITY_HOLD']]
-                    elif hold_type == 'non-quality':
-                        df = df[df['IS_NON_QUALITY_HOLD']]
-
            if df.empty:
                return {
                    'workcenters': [],
@@ -677,11 +1089,17 @@ def get_wip_hold_summary(
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy, workorder, lotid)
-            df = _add_wip_status_columns(df)
+            df = _select_with_snapshot_indexes(
+                include_dummy=include_dummy,
+                workorder=workorder,
+                lotid=lotid,
+                status='HOLD',
+            )
+            if df is None:
+                return _get_wip_hold_summary_from_oracle(include_dummy, workorder, lotid)

            # Filter for HOLD status with reason
-            df = df[(df['WIP_STATUS'] == 'HOLD') & df['HOLDREASONNAME'].notna()]
+            df = df[df['HOLDREASONNAME'].notna()]

            if df.empty:
                return {'items': []}
@@ -805,17 +1223,40 @@ def get_wip_detail(
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy, workorder, lotid)
-            df = _add_wip_status_columns(df)
+            summary_df = _select_with_snapshot_indexes(
+                include_dummy=include_dummy,
+                workorder=workorder,
+                lotid=lotid,
+                package=package,
+                workcenter=workcenter,
+            )
+            if summary_df is None:
+                return _get_wip_detail_from_oracle(
+                    workcenter,
+                    package,
+                    status,
+                    hold_type,
+                    workorder,
+                    lotid,
+                    include_dummy,
+                    page,
+                    page_size,
+                )

-            # Filter by workcenter
-            df = df[df['WORKCENTER_GROUP'] == workcenter]
-
-            if package:
-                df = df[df['PACKAGE_LEF'] == package]
+            if summary_df.empty:
+                summary = {
+                    'totalLots': 0,
+                    'runLots': 0,
+                    'queueLots': 0,
+                    'holdLots': 0,
+                    'qualityHoldLots': 0,
+                    'nonQualityHoldLots': 0
+                }
+                df = summary_df
+            else:
+                df = summary_df

            # Calculate summary before status filter
-            summary_df = df.copy()
            run_lots = len(summary_df[summary_df['WIP_STATUS'] == 'RUN'])
            queue_lots = len(summary_df[summary_df['WIP_STATUS'] == 'QUEUE'])
            hold_lots = len(summary_df[summary_df['WIP_STATUS'] == 'HOLD'])
@@ -835,13 +1276,29 @@ def get_wip_detail(
            # Apply status filter for lots list
            if status:
                status_upper = status.upper()
-                df = df[df['WIP_STATUS'] == status_upper]
-
-                if status_upper == 'HOLD' and hold_type:
-                    if hold_type == 'quality':
-                        df = df[df['IS_QUALITY_HOLD']]
-                    elif hold_type == 'non-quality':
-                        df = df[df['IS_NON_QUALITY_HOLD']]
+                hold_type_filter = hold_type if status_upper == 'HOLD' else None
+                filtered_df = _select_with_snapshot_indexes(
+                    include_dummy=include_dummy,
+                    workorder=workorder,
+                    lotid=lotid,
+                    package=package,
+                    workcenter=workcenter,
+                    status=status_upper,
+                    hold_type=hold_type_filter,
+                )
+                if filtered_df is None:
+                    return _get_wip_detail_from_oracle(
+                        workcenter,
+                        package,
+                        status,
+                        hold_type,
+                        workorder,
+                        lotid,
+                        include_dummy,
+                        page,
+                        page_size,
+                    )
+                df = filtered_df

            # Get specs (sorted by SPECSEQUENCE if available)
            specs_df = df[df['SPECNAME'].notna()][['SPECNAME', 'SPECSEQUENCE']].drop_duplicates()
@@ -1083,7 +1540,9 @@ def get_workcenters(include_dummy: bool = False) -> Optional[List[Dict[str, Any]
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy)
+            df = _select_with_snapshot_indexes(include_dummy=include_dummy)
+            if df is None:
+                return _get_workcenters_from_oracle(include_dummy)
            df = df[df['WORKCENTER_GROUP'].notna()]

            if df.empty:
@@ -1162,7 +1621,9 @@ def get_packages(include_dummy: bool = False) -> Optional[List[Dict[str, Any]]]:
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy)
+            df = _select_with_snapshot_indexes(include_dummy=include_dummy)
+            if df is None:
+                return _get_packages_from_oracle(include_dummy)
            df = df[df['PACKAGE_LEF'].notna()]

            if df.empty:
@@ -1267,15 +1728,16 @@ def search_workorders(
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy, lotid=lotid)
+            df = _select_with_snapshot_indexes(
+                include_dummy=include_dummy,
+                lotid=lotid,
+                package=package,
+                pj_type=pj_type,
+            )
+            if df is None:
+                return _search_workorders_from_oracle(q, limit, include_dummy, lotid, package, pj_type)
            df = df[df['WORKORDER'].notna()]

-            # Apply cross-filters
-            if package and 'PACKAGE_LEF' in df.columns:
-                df = df[df['PACKAGE_LEF'] == package]
-            if pj_type and 'PJ_TYPE' in df.columns:
-                df = df[df['PJ_TYPE'] == pj_type]
-
            # Filter by search query (case-insensitive)
            df = df[df['WORKORDER'].str.contains(q, case=False, na=False)]

@@ -1375,13 +1837,14 @@ def search_lot_ids(
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy, workorder=workorder)
-
-            # Apply cross-filters
-            if package and 'PACKAGE_LEF' in df.columns:
-                df = df[df['PACKAGE_LEF'] == package]
-            if pj_type and 'PJ_TYPE' in df.columns:
-                df = df[df['PJ_TYPE'] == pj_type]
+            df = _select_with_snapshot_indexes(
+                include_dummy=include_dummy,
+                workorder=workorder,
+                package=package,
+                pj_type=pj_type,
+            )
+            if df is None:
+                return _search_lot_ids_from_oracle(q, limit, include_dummy, workorder, package, pj_type)

            # Filter by search query (case-insensitive)
            df = df[df['LOTID'].str.contains(q, case=False, na=False)]
@@ -1481,7 +1944,14 @@ def search_packages(
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy, workorder=workorder, lotid=lotid)
+            df = _select_with_snapshot_indexes(
+                include_dummy=include_dummy,
+                workorder=workorder,
+                lotid=lotid,
+                pj_type=pj_type,
+            )
+            if df is None:
+                return _search_packages_from_oracle(q, limit, include_dummy, workorder, lotid, pj_type)

            # Check if PACKAGE_LEF column exists
            if 'PACKAGE_LEF' not in df.columns:
@@ -1490,10 +1960,6 @@ def search_packages(

            df = df[df['PACKAGE_LEF'].notna()]

-            # Apply cross-filter
-            if pj_type and 'PJ_TYPE' in df.columns:
-                df = df[df['PJ_TYPE'] == pj_type]
-
            # Filter by search query (case-insensitive)
            df = df[df['PACKAGE_LEF'].str.contains(q, case=False, na=False)]

@@ -1591,7 +2057,14 @@ def search_types(
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy, workorder=workorder, lotid=lotid)
+            df = _select_with_snapshot_indexes(
+                include_dummy=include_dummy,
+                workorder=workorder,
+                lotid=lotid,
+                package=package,
+            )
+            if df is None:
+                return _search_types_from_oracle(q, limit, include_dummy, workorder, lotid, package)

            # Check if PJ_TYPE column exists
            if 'PJ_TYPE' not in df.columns:
@@ -1600,10 +2073,6 @@ def search_types(

            df = df[df['PJ_TYPE'].notna()]

-            # Apply cross-filter
-            if package and 'PACKAGE_LEF' in df.columns:
-                df = df[df['PACKAGE_LEF'] == package]
-
            # Filter by search query (case-insensitive)
            df = df[df['PJ_TYPE'].str.contains(q, case=False, na=False)]

@@ -1686,11 +2155,15 @@ def get_hold_detail_summary(
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy)
-            df = _add_wip_status_columns(df)
+            df = _select_with_snapshot_indexes(
+                include_dummy=include_dummy,
+                status='HOLD',
+            )
+            if df is None:
+                return _get_hold_detail_summary_from_oracle(reason, include_dummy)

            # Filter for HOLD status with matching reason
-            df = df[(df['WIP_STATUS'] == 'HOLD') & (df['HOLDREASONNAME'] == reason)]
+            df = df[df['HOLDREASONNAME'] == reason]

            if df.empty:
                return {
@@ -1783,11 +2256,15 @@ def get_hold_detail_distribution(
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy)
-            df = _add_wip_status_columns(df)
+            df = _select_with_snapshot_indexes(
+                include_dummy=include_dummy,
+                status='HOLD',
+            )
+            if df is None:
+                return _get_hold_detail_distribution_from_oracle(reason, include_dummy)

            # Filter for HOLD status with matching reason
-            df = df[(df['WIP_STATUS'] == 'HOLD') & (df['HOLDREASONNAME'] == reason)]
+            df = df[df['HOLDREASONNAME'] == reason]

            total_lots = len(df)

@@ -2072,20 +2549,30 @@ def get_hold_detail_lots(
    cached_df = _get_wip_dataframe()
    if cached_df is not None:
        try:
-            df = _filter_base_conditions(cached_df, include_dummy)
-            df = _add_wip_status_columns(df)
+            df = _select_with_snapshot_indexes(
+                include_dummy=include_dummy,
+                workcenter=workcenter,
+                package=package,
+                status='HOLD',
+            )
+            if df is None:
+                return _get_hold_detail_lots_from_oracle(
+                    reason=reason,
+                    workcenter=workcenter,
+                    package=package,
+                    age_range=age_range,
+                    include_dummy=include_dummy,
+                    page=page,
+                    page_size=page_size,
+                )

            # Filter for HOLD status with matching reason
-            df = df[(df['WIP_STATUS'] == 'HOLD') & (df['HOLDREASONNAME'] == reason)]
+            df = df[df['HOLDREASONNAME'] == reason]

            # Ensure numeric columns
            df['AGEBYDAYS'] = pd.to_numeric(df['AGEBYDAYS'], errors='coerce').fillna(0)

-            # Optional filters
-            if workcenter:
-                df = df[df['WORKCENTER_GROUP'] == workcenter]
-            if package:
-                df = df[df['PACKAGE_LEF'] == package]
+            # Optional age filter
            if age_range:
                if age_range == '0-1':
                    df = df[(df['AGEBYDAYS'] >= 0) & (df['AGEBYDAYS'] < 1)]
--- a/src/mes_dashboard/static/js/mes-api.js
+++ b/src/mes_dashboard/static/js/mes-api.js
@@ -32,6 +32,23 @@ const MesApi = (function() {
    const MIN_DEGRADED_DELAY_MS = 3000;

    let requestCounter = 0;
+
+    function getCsrfToken() {
+        const meta = document.querySelector('meta[name=\"csrf-token\"]');
+        return meta ? meta.content : '';
+    }
+
+    function withCsrfHeaders(headers, method) {
+        const normalized = (method || 'GET').toUpperCase();
+        const nextHeaders = { ...(headers || {}) };
+        if (['POST', 'PUT', 'PATCH', 'DELETE'].includes(normalized)) {
+            const token = getCsrfToken();
+            if (token && !nextHeaders['X-CSRF-Token']) {
+                nextHeaders['X-CSRF-Token'] = token;
+            }
+        }
+        return nextHeaders;
+    }

    /**
     * Generate a unique request ID
@@ -203,12 +220,12 @@ const MesApi = (function() {

        console.log(`[MesApi] ${reqId} ${method} ${fullUrl}`);

-        const fetchOptions = {
-            method: method,
-            headers: {
-                'Content-Type': 'application/json'
-            }
-        };
+        const fetchOptions = {
+            method: method,
+            headers: withCsrfHeaders({
+                'Content-Type': 'application/json'
+            }, method)
+        };

        if (options.body) {
            fetchOptions.body = JSON.stringify(options.body);
--- a/src/mes_dashboard/templates/_base.html
+++ b/src/mes_dashboard/templates/_base.html
@@ -1,9 +1,10 @@
 <!DOCTYPE html>
 <html lang="zh-TW">
-<head>
-    <meta charset="UTF-8">
-    <meta name="viewport" content="width=device-width, initial-scale=1.0">
-    <title>{% block title %}MES Dashboard{% endblock %}</title>
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <meta name="csrf-token" content="{{ csrf_token() }}">
+    <title>{% block title %}MES Dashboard{% endblock %}</title>

    <!-- Toast 樣式 -->
    <style id="mes-core-styles">
--- a/src/mes_dashboard/templates/admin/pages.html
+++ b/src/mes_dashboard/templates/admin/pages.html
@@ -221,8 +221,13 @@
 {% endblock %}

 {% block scripts %}
-    <script>
-        const tbody = document.getElementById('pages-tbody');
+    <script>
+        const tbody = document.getElementById('pages-tbody');
+        const csrfToken = document.querySelector('meta[name=\"csrf-token\"]')?.content || '';
+
+        function withCsrfHeaders(headers = {}) {
+            return csrfToken ? { ...headers, 'X-CSRF-Token': csrfToken } : headers;
+        }

        async function loadPages() {
            try {
@@ -264,11 +269,11 @@
            const newStatus = currentStatus === 'released' ? 'dev' : 'released';

            try {
-                const response = await fetch(`/admin/api/pages${route}`, {
-                    method: 'PUT',
-                    headers: { 'Content-Type': 'application/json' },
-                    body: JSON.stringify({ status: newStatus })
-                });
+                const response = await fetch(`/admin/api/pages${route}`, {
+                    method: 'PUT',
+                    headers: withCsrfHeaders({ 'Content-Type': 'application/json' }),
+                    body: JSON.stringify({ status: newStatus })
+                });

                const data = await response.json();

--- a/src/mes_dashboard/templates/admin/performance.html
+++ b/src/mes_dashboard/templates/admin/performance.html
@@ -707,7 +707,13 @@
        // Auth Helper
        // ============================================================
        async function fetchWithAuth(url, options = {}) {
-            const resp = await fetch(url, { ...options, cache: 'no-store' });
+            const method = (options.method || 'GET').toUpperCase();
+            const csrfToken = document.querySelector('meta[name="csrf-token"]')?.content || '';
+            const headers = { ...(options.headers || {}) };
+            if (csrfToken && ['POST', 'PUT', 'PATCH', 'DELETE'].includes(method)) {
+                headers['X-CSRF-Token'] = csrfToken;
+            }
+            const resp = await fetch(url, { ...options, headers, cache: 'no-store' });
            if (resp.status === 401) {
                const json = await resp.json().catch(() => ({}));
                if (!authErrorShown) {
@@ -962,9 +968,15 @@
                document.getElementById('workerStartTime').textContent =
                    data.worker_start_time ? formatTimestamp(data.worker_start_time) : '--';

-                // Update cooldown status
+                // Update recovery policy status
+                const policyState = data?.resilience?.policy_state || {};
                const cooldown = data.cooldown;
-                if (cooldown && cooldown.active) {
+                if (policyState.blocked) {
+                    document.getElementById('workerCooldown').textContent = 'Guarded mode（需手動 override）';
+                    document.getElementById('restartBtn').disabled = false;
+                    document.getElementById('restartBtn').style.opacity = '1';
+                    document.getElementById('restartBtn').style.cursor = 'pointer';
+                } else if (cooldown && cooldown.active) {
                    document.getElementById('workerCooldown').textContent =
                        `冷卻中 (${cooldown.remaining_seconds}秒)`;
                    document.getElementById('restartBtn').disabled = true;
@@ -1017,11 +1029,41 @@
            btn.style.opacity = '0.5';

            try {
-                const resp = await fetchWithAuth('/admin/api/worker/restart', {
+                let resp = await fetchWithAuth('/admin/api/worker/restart', {
                    method: 'POST',
-                    headers: { 'Content-Type': 'application/json' }
+                    headers: { 'Content-Type': 'application/json' },
+                    body: JSON.stringify({})
                });
-                const json = await resp.json();
+                let json = await resp.json();
+
+                if (!json.success && resp.status === 409) {
+                    const reason = window.prompt(
+                        '目前 restart policy 為 guarded mode。\n請輸入 override 原因（會記錄於稽核日誌）：'
+                    );
+                    if (!reason || !reason.trim()) {
+                        alert('已取消 override。');
+                        return;
+                    }
+
+                    const acknowledged = window.confirm(
+                        '確認執行 manual override？此操作將繞過 guarded mode 保護。'
+                    );
+                    if (!acknowledged) {
+                        alert('已取消 override。');
+                        return;
+                    }
+
+                    resp = await fetchWithAuth('/admin/api/worker/restart', {
+                        method: 'POST',
+                        headers: { 'Content-Type': 'application/json' },
+                        body: JSON.stringify({
+                            manual_override: true,
+                            override_acknowledged: true,
+                            override_reason: reason.trim()
+                        })
+                    });
+                    json = await resp.json();
+                }

                if (!json.success) {
                    alert('重啟失敗: ' + (json.error?.message || '未知錯誤'));
--- a/src/mes_dashboard/templates/hold_detail.html
+++ b/src/mes_dashboard/templates/hold_detail.html
@@ -682,7 +682,7 @@
        // State
        // ============================================================
        const state = {
-            reason: '{{ reason | e }}',
+            reason: {{ reason | tojson }},
            summary: null,
            distribution: null,
            lots: null,
--- a/src/mes_dashboard/templates/login.html
+++ b/src/mes_dashboard/templates/login.html
@@ -129,12 +129,13 @@
        <div class="error-message">
            {{ error }}
        </div>
-        {% endif %}
-
-        <form method="POST">
-            <div class="form-group">
-                <label for="username">帳號</label>
-                <input type="text" id="username" name="username" placeholder="工號或 Email" required autofocus>
+        {% endif %}
+
+        <form method="POST">
+            <input type="hidden" name="csrf_token" value="{{ csrf_token() }}">
+            <div class="form-group">
+                <label for="username">帳號</label>
+                <input type="text" id="username" name="username" placeholder="工號或 Email" required autofocus>
            </div>

            <div class="form-group">
--- a/tests/fixtures/cache_benchmark_fixture.json
+++ b/tests/fixtures/cache_benchmark_fixture.json
@@ -0,0 +1,9 @@
+{
+  "rows": 30000,
+  "query_count": 400,
+  "seed": 42,
+  "thresholds": {
+    "max_p95_ratio_indexed_vs_baseline": 1.25,
+    "max_memory_amplification_ratio": 1.8
+  }
+}
--- a/Show More
+++ b/Show More