chore: finalize vite migration hardening and archive openspec changes

This commit is contained in:
beabigegg
2026-02-08 20:03:36 +08:00
parent b56e80381b
commit c8e225101e
119 changed files with 6547 additions and 1301 deletions

143
README.md
View File

@@ -26,11 +26,60 @@
| Worker 重啟控制 | ✅ 已完成 | | Worker 重啟控制 | ✅ 已完成 |
| Runtime 韌性診斷threshold/churn/recommendation | ✅ 已完成 | | Runtime 韌性診斷threshold/churn/recommendation | ✅ 已完成 |
| WIP 共用 autocomplete core 模組 | ✅ 已完成 | | WIP 共用 autocomplete core 模組 | ✅ 已完成 |
| WIP 共用 derive core 模組KPI/filter/chart/table | ✅ 已完成 |
| WIP 索引查詢加速與增量同步 | ✅ 已完成 |
| 快取記憶體放大係數 telemetry | ✅ 已完成 |
| Cache benchmark gateP95/記憶體門檻) | ✅ 已完成 |
| Worker guarded mode + manual override 稽核 | ✅ 已完成 |
| Runtime contract 啟動校驗conda/systemd/watchdog | ✅ 已完成 |
| 前端核心模組測試Node test | ✅ 已完成 | | 前端核心模組測試Node test | ✅ 已完成 |
| 部署自動化 | ✅ 已完成 | | 部署自動化 | ✅ 已完成 |
--- ---
## 開發歷史Vite 重構後)
- 2026-02-07完成 Flask + Vite 單一 port 架構切換,舊版 `DashBoard/` 停用。
- 2026-02-08補齊 runtime 韌性治理threshold/churn/recommendation與 watchdog 可觀測欄位。
- 2026-02-08完成 P0 安全/穩定性硬化:
- production `SECRET_KEY` 缺失時啟動失敗fail-fast
- admin form + admin mutation API CSRF 防護
- health probe 使用獨立 DB pool避免與主查詢池互相阻塞
- worker/app shutdown 統一清理 cache updater、realtime sync、Redis、DB engine
- `hold_detail` inline script 變數改為 `tojson` 序列化
- 2026-02-08完成 P1 快取/查詢效率重構:
- WIP 查詢路徑改為索引選擇,保留 `resource/wip` 全表快取語意
- WIP search index 增量同步watermark/version與 drift fallback
- health/admin 新增 cache memory amplification telemetry
- 建立 `scripts/run_cache_benchmarks.py` + fixture gate
- 2026-02-08完成 P2 運維自癒治理:
- runtime contract 共用化app/start_server/watchdog/systemd
- 啟動時 conda/watchdog 路徑 drift fail-fast
- worker restart policycooldown/retry budget/churn guarded mode
- manual override需 ack + reason與結構化 audit log
- 2026-02-08完成 round-2 安全/穩定補強:
- LDAP endpoint 改為嚴格驗證(`https` + `LDAP_ALLOWED_HOSTS`
- process-level cache 新增 `max_size + LRU`WIP/Resource
- circuit breaker transition logging 移至鎖外,降低 lock contention
- 全域安全標頭CSP/XFO/nosniff/Referrer-Policyproduction 加 HSTS
- WIP detail 分頁參數加上下限(`page>=1``1<=page_size<=500`
- 2026-02-08完成 round-3 殘餘風險修補:
- WIP cache publish 採 staged publish失敗不污染舊快照
- WIP slow-path parse 移至鎖外realtime equipment process cache 補齊 bounded LRU
- resource NaN 清理改為 depth-safe 迭代WIP/Hold 布林查詢解析共用化
- filter cache view 名稱改為 env 可配置
- `/health``/health/deep` 新增 5 秒內部 memotesting 模式禁用)
- 高成本 API 增加輕量 rate limitWIP detail/matrix、Hold lots、Resource status/detail
- DB 連線字串 log redaction 遮罩密碼
- 2026-02-08完成 round-4 殘餘治理收斂:
- Resource derived index 改為 row-position representation移除 process 內 full records 複本
- Resource / Realtime Equipment 共用 Oracle SQL fragments降低查詢定義漂移
- `resource_cache` / `realtime_equipment_cache` 型別註記與高頻常數命名收斂
- `page_registry` 寫檔改為 atomic replacetmp + rename避免設定檔半寫入
- 新增測試保護:共享 SQL 片段、index normalization、route bool parser 不重複定義
---
## 遷移與驗收文件 ## 遷移與驗收文件
- Root cutover 盤點:`docs/root_cutover_inventory.md` - Root cutover 盤點:`docs/root_cutover_inventory.md`
@@ -46,20 +95,37 @@
1. 單一 port 契約維持不變 1. 單一 port 契約維持不變
- Flask + Gunicorn + Vite dist 由同一服務提供(`GUNICORN_BIND`),前後端同源。 - Flask + Gunicorn + Vite dist 由同一服務提供(`GUNICORN_BIND`),前後端同源。
2. Runtime 韌性採「降級 + 可操作建議」 2. Runtime 韌性採「降級 + 可操作建議 + policy state
- `/health``/health/deep``/admin/api/system-status``/admin/api/worker/status` 皆提供: - `/health``/health/deep``/admin/api/system-status``/admin/api/worker/status` 皆提供:
- 門檻thresholds - 門檻thresholds
- policy state`allowed` / `cooldown` / `blocked`
- 重啟 churn 摘要 - 重啟 churn 摘要
- alertspool/circuit/churn
- recovery recommendation值班建議動作 - recovery recommendation值班建議動作
3. Watchdog 維持手動觸發重啟模型 3. Watchdog 自癒策略具界限保護
- 仍以 admin API 觸發 reload不預設啟用自動重啟風暴風險 - restart 流程納入 cooldown + retry budget + churn window
- state 檔新增 bounded restart history方便追蹤 churn - churn 超標時進入 guarded mode需 admin manual override 才可繼續重啟
- state 檔保留 bounded restart history供 policy 與稽核使用。
4. 前端治理WIP autocomplete/filter 共用化 4. 前端治理WIP compute 共用化
- `frontend/src/core/autocomplete.js` 作為 WIP overview/detail 共用邏輯來源。 - `frontend/src/core/autocomplete.js` 作為 WIP overview/detail 共用邏輯來源。
- `frontend/src/core/wip-derive.js` 共用 KPI/filter/chart/table 導出運算。
- 維持既有頁面流程與 drill-down 語意,不變更操作習慣。 - 維持既有頁面流程與 drill-down 語意,不變更操作習慣。
5. P1 快取效率治理
- 保留 `resource``wip` 全表快取策略(業務約束不變)。
- 查詢改走索引選擇,並提供 memory amplification / index efficiency telemetry。
- 以 benchmark gate 驗證 P95 延遲與記憶體放大不超過門檻。
6. P0 Runtime Hardening安全 + 穩定)
- Production 必須提供 `SECRET_KEY`;未設定時服務拒絕啟動。
- `/admin/login``/admin/api/*` 變更請求必須攜帶 CSRF token。
- `/health` 資料庫連通探針使用獨立 health pool降低 pool 飽和時誤判。
- 關機/重啟時統一釋放 background workers 與 Redis/DB 連線資源。
- LDAP API URL 啟動驗證:僅允許 `https` + host allowlist。
- 全域 security headersCSP/X-Frame-Options/X-Content-Type-Options/Referrer-Policyproduction 含 HSTS
--- ---
## 快速開始 ## 快速開始
@@ -175,6 +241,12 @@ DB_MAX_OVERFLOW=20
DB_POOL_TIMEOUT=30 DB_POOL_TIMEOUT=30
DB_POOL_RECYCLE=1800 DB_POOL_RECYCLE=1800
DB_CALL_TIMEOUT_MS=55000 DB_CALL_TIMEOUT_MS=55000
DB_POOL_EXHAUSTED_RETRY_AFTER_SECONDS=5
# Health probe 專用 DB pool與主 request pool 隔離)
DB_HEALTH_POOL_SIZE=1
DB_HEALTH_MAX_OVERFLOW=0
DB_HEALTH_POOL_TIMEOUT=2
# Circuit Breaker # Circuit Breaker
CIRCUIT_BREAKER_ENABLED=true CIRCUIT_BREAKER_ENABLED=true
@@ -192,6 +264,17 @@ WATCHDOG_RESTART_FLAG=./tmp/mes_dashboard_restart.flag
WATCHDOG_PID_FILE=./tmp/gunicorn.pid WATCHDOG_PID_FILE=./tmp/gunicorn.pid
WATCHDOG_STATE_FILE=./tmp/mes_dashboard_restart_state.json WATCHDOG_STATE_FILE=./tmp/mes_dashboard_restart_state.json
WATCHDOG_RESTART_HISTORY_MAX=50 WATCHDOG_RESTART_HISTORY_MAX=50
CONDA_BIN=/opt/miniconda3/bin/conda
CONDA_ENV_NAME=mes-dashboard
RUNTIME_CONTRACT_VERSION=2026.02-p2
RUNTIME_CONTRACT_ENFORCE=true
# Worker self-healing policy
WORKER_RESTART_COOLDOWN=60
WORKER_RESTART_RETRY_BUDGET=3
WORKER_RESTART_WINDOW_SECONDS=600
WORKER_RESTART_CHURN_THRESHOLD=3
WORKER_GUARDED_MODE_ENABLED=true
# Runtime resilience thresholds # Runtime resilience thresholds
RESILIENCE_DEGRADED_ALERT_SECONDS=300 RESILIENCE_DEGRADED_ALERT_SECONDS=300
@@ -202,6 +285,36 @@ RESILIENCE_RESTART_CHURN_THRESHOLD=3
# 管理員設定 # 管理員設定
ADMIN_EMAILS=admin@example.com # 管理員郵件(逗號分隔) ADMIN_EMAILS=admin@example.com # 管理員郵件(逗號分隔)
LDAP_API_URL=https://ldap-api.example.com
LDAP_ALLOWED_HOSTS=ldap-api.example.com,ldap-api-dr.example.com
# CSRF 防護admin form/admin mutation API
CSRF_ENABLED=true
# Process-level cache bounded LRUWIP/Resource
PROCESS_CACHE_MAX_SIZE=32
WIP_PROCESS_CACHE_MAX_SIZE=32
RESOURCE_PROCESS_CACHE_MAX_SIZE=32
EQUIPMENT_PROCESS_CACHE_MAX_SIZE=32
# Filter cache source views (env-overridable)
FILTER_CACHE_WIP_VIEW=DWH.DW_MES_LOT_V
FILTER_CACHE_SPEC_WORKCENTER_VIEW=DWH.DW_MES_SPEC_WORKCENTER_V
# Health internal memoization
HEALTH_MEMO_TTL_SECONDS=5
# High-cost API rate limit (in-process)
WIP_MATRIX_RATE_LIMIT_MAX_REQUESTS=120
WIP_MATRIX_RATE_LIMIT_WINDOW_SECONDS=60
WIP_DETAIL_RATE_LIMIT_MAX_REQUESTS=90
WIP_DETAIL_RATE_LIMIT_WINDOW_SECONDS=60
HOLD_LOTS_RATE_LIMIT_MAX_REQUESTS=90
HOLD_LOTS_RATE_LIMIT_WINDOW_SECONDS=60
RESOURCE_DETAIL_RATE_LIMIT_MAX_REQUESTS=60
RESOURCE_DETAIL_RATE_LIMIT_WINDOW_SECONDS=60
RESOURCE_STATUS_RATE_LIMIT_MAX_REQUESTS=90
RESOURCE_STATUS_RATE_LIMIT_WINDOW_SECONDS=60
``` ```
### 生產環境注意事項 ### 生產環境注意事項
@@ -226,6 +339,7 @@ sudo cp deploy/mes-dashboard-watchdog.service /etc/systemd/system/
# 2. 準備環境設定檔 # 2. 準備環境設定檔
sudo mkdir -p /etc/mes-dashboard sudo mkdir -p /etc/mes-dashboard
sudo cp deploy/mes-dashboard.env.example /etc/mes-dashboard/mes-dashboard.env
sudo cp .env /etc/mes-dashboard/mes-dashboard.env sudo cp .env /etc/mes-dashboard/mes-dashboard.env
# 3. 重新載入 systemd # 3. 重新載入 systemd
@@ -239,6 +353,12 @@ sudo systemctl status mes-dashboard
sudo systemctl status mes-dashboard-watchdog sudo systemctl status mes-dashboard-watchdog
``` ```
執行 runtime contract 驗證:
```bash
RUNTIME_CONTRACT_ENFORCE=true ./scripts/start_server.sh check
```
### Rollback 步驟 ### Rollback 步驟
如需回滾到先前版本: 如需回滾到先前版本:
@@ -494,7 +614,8 @@ DashBoard_vite/
│ └── worker_watchdog.py # Worker 監控程式 │ └── worker_watchdog.py # Worker 監控程式
├── deploy/ # 部署設定 ├── deploy/ # 部署設定
│ ├── mes-dashboard.service # Gunicorn systemd 服務 (Conda) │ ├── mes-dashboard.service # Gunicorn systemd 服務 (Conda)
── mes-dashboard-watchdog.service # Watchdog systemd 服務 (Conda) ── mes-dashboard-watchdog.service # Watchdog systemd 服務 (Conda)
│ └── mes-dashboard.env.example # Runtime contract 環境範本
├── tests/ # 測試 ├── tests/ # 測試
├── data/ # 資料檔案 ├── data/ # 資料檔案
├── logs/ # 日誌 ├── logs/ # 日誌
@@ -524,6 +645,9 @@ pytest tests/e2e/ -v
# 執行壓力測試 # 執行壓力測試
pytest tests/stress/ -v pytest tests/stress/ -v
# Cache benchmark gateP1
conda run -n mes-dashboard python scripts/run_cache_benchmarks.py --enforce
``` ```
--- ---
@@ -569,12 +693,17 @@ pytest tests/stress/ -v
### 2026-02-08 ### 2026-02-08
- 完成並封存提案 `post-migration-resilience-governance` - 完成並封存提案 `post-migration-resilience-governance`
- 完成並封存提案 `p1-cache-query-efficiency`
- 完成並封存提案 `p2-ops-self-healing-runbook`
- 新增 runtime 韌性診斷核心thresholds / restart churn / recovery recommendation - 新增 runtime 韌性診斷核心thresholds / restart churn / recovery recommendation
- 新增 worker restart policy stateallowed/cooldown/blocked與 guarded mode override 流程
- health 與 admin API 新增可操作韌性欄位: - health 與 admin API 新增可操作韌性欄位:
- `/health`、`/health/deep` - `/health`、`/health/deep`
- `/admin/api/system-status`、`/admin/api/worker/status` - `/admin/api/system-status`、`/admin/api/worker/status`
- watchdog restart state 支援 bounded history`WATCHDOG_RESTART_HISTORY_MAX` - watchdog restart state 支援 bounded history`WATCHDOG_RESTART_HISTORY_MAX`
- WIP overview/detail 抽離共用 autocomplete/filter 模組(`frontend/src/core/autocomplete.js` - WIP overview/detail 抽離共用 autocomplete/filter 模組(`frontend/src/core/autocomplete.js`
- WIP overview/detail 導入共享 derive 模組(`frontend/src/core/wip-derive.js`
- 新增 cache benchmark fixture 與 baseline-vs-indexed 門檻驗證
- 新增前端 Node 測試流程(`npm --prefix frontend test` - 新增前端 Node 測試流程(`npm --prefix frontend test`
- 更新 `README.mdj` 與 migration runbook 文件對齊 gate - 更新 `README.mdj` 與 migration runbook 文件對齊 gate
@@ -654,5 +783,5 @@ pytest tests/stress/ -v
--- ---
**文檔版本**: 4.1 **文檔版本**: 4.2
**最後更新**: 2026-02-08 **最後更新**: 2026-02-08

View File

@@ -1,61 +1,151 @@
# MES Dashboard Architecture Snapshot (README.mdj) # MES DashboardREADME.mdj
檔案為 `README.md` 的架構摘要鏡像,重點反映目前已完成的 Vite + 單一 port 運行契約與韌性治理策略 文件為 `README.md` 的精簡技術同步版,聚焦目前可運行架構與運維契約
## Runtime Contract ## 1. 架構摘要2026-02-08
- 單一服務單一 port`GUNICORN_BIND`(預設 `0.0.0.0:8080` - 後端Flask + Gunicorn單一 port
- 前端資產由 Vite build 到 `src/mes_dashboard/static/dist/`,由 Flask/Gunicorn 同源提供 - 前端Vite build 輸出到 `src/mes_dashboard/static/dist`
- Watchdog 透過 restart flag + `SIGHUP` 進行 graceful worker reload - 快取Redis + process-level cache + indexed selection telemetry
- 資料OracleQueuePool
- 運維watchdog + admin worker restart API + guarded-mode policy
## Resilience Contract ## 2. 既有設計原則(保留)
- 降級回應:`DB_POOL_EXHAUSTED`、`CIRCUIT_BREAKER_OPEN` + `Retry-After` - `resource`(設備基礎資料)與 `wip`(線上即時狀況)維持全表快取策略。
- health/admin 診斷輸出包含: - 前端頁面邏輯與 drill-down 操作語意維持不變。
- thresholds - 系統維持單一 port 服務模式(前後端同源)。
- restart churn summary
- recovery recommendation
- 不預設啟用自動重啟;維持受控人工觸發,避免重啟風暴
## Frontend Governance ## 3. P0 Runtime Hardening已完成
- WIP overview/detail 的 autocomplete/filter 查詢邏輯共用 `frontend/src/core/autocomplete.js` - Production 強制 `SECRET_KEY`:未設定或使用不安全預設值時,啟動直接失敗。
- 目標:維持既有操作語意,同時降低重複邏輯與維護成本 - CSRF 防護:
- 前端核心模組測試:`npm --prefix frontend test` - `/admin/login` 表單需 token
- `/admin/api/*` 的 `POST/PUT/PATCH/DELETE` 需 `X-CSRF-Token`
- Session hardening登入成功後 `session.clear()` + CSRF token rotation。
- Health probe isolation`/health` DB 連通檢查使用獨立 health pool。
- Shutdown cleanup統一停止 cache updater、equipment sync worker並關閉 Redis 與 DB engine。
- XSS hardening`hold_detail` fallback script 的 `reason` 改用 `tojson`。
## 開發歷史(摘要 ## 4. P1 Cache/Query Efficiency已完成
### 2026-02-08 - `resource` / `wip` 仍維持全表快取策略(業務約束不變)。
- 封存 `post-migration-resilience-governance` - WIP 查詢改走 indexed selection並加入增量同步watermark/version與 drift fallback。
- 新增韌性診斷欄位thresholds/churn/recommendation - `/health`、`/health/deep`、`/admin/api/system-status` 提供 cache memory amplification/index telemetry。
- 完成 WIP autocomplete 共用模組化與前端測試腳本 - 新增 benchmark harness`scripts/run_cache_benchmarks.py --enforce`。
### 2026-02-07 ## 5. P2 Ops Self-Healing已完成
- 封存完整 Vite 遷移相關提案群組
- 單一 port 架構、抽屜導航、欄位契約治理與 migration gates 就位
## Key Configs - runtime contract 共用化app/start_server/watchdog/systemd 使用同一組 watchdog/conda 路徑契約。
- 啟動 fail-fastconda/runtime path drift 時拒絕啟動並輸出可操作診斷。
- worker restart policycooldown + retry budget + churn guarded mode。
- manual override需 admin 身分 + `manual_override` + `override_acknowledged` + `override_reason`,且寫入 audit log。
- health/admin payload 提供 policy state`allowed` / `cooldown` / `blocked`。
## 6. Round-3 Residual Hardening已完成
- WIP cache publish 改為 staged publish更新失敗不覆寫舊快照。
- WIP process cache slow-path parse 移到鎖外,降低 lock contention。
- realtime equipment process cache 補齊 bounded LRU含 `EQUIPMENT_PROCESS_CACHE_MAX_SIZE`)。
- `_clean_nan_values` 改為 depth-safe 迭代式清理(避免深層遞迴風險)。
- WIP/Hold/Resource bool query parser 共用化(`core/utils.py`)。
- filter cache source view 可由 env 覆寫(便於環境切換與測試)。
- `/health`、`/health/deep` 增加 5 秒 memotesting 模式自動關閉)。
- 高成本 API 增加輕量 in-process rate limit超限回傳一致 429 結構。
- DB 連線字串記錄加上敏感欄位遮罩(密碼 redaction
## 7. Round-4 Residual Consolidation已完成
- Resource derived index 改為 row-position representation不再在 process 內保存 full records 複本。
- Resource / Realtime Equipment 共用 Oracle SQL fragments避免查詢定義重複漂移。
- `resource_cache` / `realtime_equipment_cache` 型別註記風格與高頻常數命名收斂。
- `page_registry` 寫檔改為 atomic replace降低設定檔半寫入風險。
- 新增測試覆蓋 shared SQL fragment 與 bool parser 不重複定義治理。
## 8. 重要環境變數
```bash ```bash
FLASK_ENV=production
SECRET_KEY=<required-in-production>
CSRF_ENABLED=true
LDAP_API_URL=https://ldap-api.example.com
LDAP_ALLOWED_HOSTS=ldap-api.example.com,ldap-api-dr.example.com
DB_POOL_SIZE=10
DB_MAX_OVERFLOW=20
DB_POOL_TIMEOUT=30
DB_POOL_RECYCLE=1800
DB_CALL_TIMEOUT_MS=55000
DB_POOL_EXHAUSTED_RETRY_AFTER_SECONDS=5
DB_HEALTH_POOL_SIZE=1
DB_HEALTH_MAX_OVERFLOW=0
DB_HEALTH_POOL_TIMEOUT=2
CONDA_BIN=/opt/miniconda3/bin/conda
CONDA_ENV_NAME=mes-dashboard
RUNTIME_CONTRACT_VERSION=2026.02-p2
RUNTIME_CONTRACT_ENFORCE=true
WATCHDOG_RUNTIME_DIR=./tmp WATCHDOG_RUNTIME_DIR=./tmp
WATCHDOG_RESTART_FLAG=./tmp/mes_dashboard_restart.flag WATCHDOG_RESTART_FLAG=./tmp/mes_dashboard_restart.flag
WATCHDOG_PID_FILE=./tmp/gunicorn.pid WATCHDOG_PID_FILE=./tmp/gunicorn.pid
WATCHDOG_STATE_FILE=./tmp/mes_dashboard_restart_state.json WATCHDOG_STATE_FILE=./tmp/mes_dashboard_restart_state.json
WATCHDOG_RESTART_HISTORY_MAX=50 WATCHDOG_RESTART_HISTORY_MAX=50
RESILIENCE_DEGRADED_ALERT_SECONDS=300 WORKER_RESTART_COOLDOWN=60
RESILIENCE_POOL_SATURATION_WARNING=0.90 WORKER_RESTART_RETRY_BUDGET=3
RESILIENCE_POOL_SATURATION_CRITICAL=1.0 WORKER_RESTART_WINDOW_SECONDS=600
RESILIENCE_RESTART_CHURN_WINDOW_SECONDS=600 WORKER_RESTART_CHURN_THRESHOLD=3
RESILIENCE_RESTART_CHURN_THRESHOLD=3 WORKER_GUARDED_MODE_ENABLED=true
PROCESS_CACHE_MAX_SIZE=32
WIP_PROCESS_CACHE_MAX_SIZE=32
RESOURCE_PROCESS_CACHE_MAX_SIZE=32
EQUIPMENT_PROCESS_CACHE_MAX_SIZE=32
FILTER_CACHE_WIP_VIEW=DWH.DW_MES_LOT_V
FILTER_CACHE_SPEC_WORKCENTER_VIEW=DWH.DW_MES_SPEC_WORKCENTER_V
HEALTH_MEMO_TTL_SECONDS=5
WIP_MATRIX_RATE_LIMIT_MAX_REQUESTS=120
WIP_MATRIX_RATE_LIMIT_WINDOW_SECONDS=60
WIP_DETAIL_RATE_LIMIT_MAX_REQUESTS=90
WIP_DETAIL_RATE_LIMIT_WINDOW_SECONDS=60
HOLD_LOTS_RATE_LIMIT_MAX_REQUESTS=90
HOLD_LOTS_RATE_LIMIT_WINDOW_SECONDS=60
RESOURCE_DETAIL_RATE_LIMIT_MAX_REQUESTS=60
RESOURCE_DETAIL_RATE_LIMIT_WINDOW_SECONDS=60
RESOURCE_STATUS_RATE_LIMIT_MAX_REQUESTS=90
RESOURCE_STATUS_RATE_LIMIT_WINDOW_SECONDS=60
``` ```
## Validation Quick Commands ## 9. 驗證命令(建議)
```bash ```bash
# 後端conda
conda run -n mes-dashboard python -m pytest -q tests/test_runtime_hardening.py
# 前端
npm --prefix frontend test npm --prefix frontend test
npm --prefix frontend run build npm --prefix frontend run build
python -m pytest -q tests/test_resilience.py tests/test_health_routes.py tests/test_performance_integration.py
# P1 benchmark gate
conda run -n mes-dashboard python scripts/run_cache_benchmarks.py --enforce
# P2 runtime contract check
RUNTIME_CONTRACT_ENFORCE=true ./scripts/start_server.sh check
``` ```
> 詳細部署、使用說明與完整環境配置請參考 `README.md`。 ## 10. 開發歷史Vite 專案)
- 2026-02-07完成 Vite 根目錄重構與舊版切除。
- 2026-02-08完成 resilience 診斷治理與前端共用模組化。
- 2026-02-08完成 P0 安全/穩定性硬化(本次更新)。
- 2026-02-08完成 P1 快取查詢效率重構index + benchmark gate
- 2026-02-08完成 P2 運維自癒治理guarded mode + manual override + runtime contract
- 2026-02-08完成 round-2 hardeningLDAP URL 驗證、bounded LRU cache、circuit breaker 鎖外日誌、安全標頭、分頁邊界)。
- 2026-02-08完成 round-3 residual hardeningstaged publish、health memo、API rate limit、DB redaction、filter view env 化)。
- 2026-02-08完成 round-4 residual consolidationresource index 表示正規化、shared SQL fragments、型別與常數治理、atomic page status 寫入)。

View File

@@ -18,6 +18,13 @@ Environment="WATCHDOG_RESTART_FLAG=/run/mes-dashboard/mes_dashboard_restart.flag
Environment="WATCHDOG_PID_FILE=/run/mes-dashboard/gunicorn.pid" Environment="WATCHDOG_PID_FILE=/run/mes-dashboard/gunicorn.pid"
Environment="WATCHDOG_STATE_FILE=/var/lib/mes-dashboard/restart_state.json" Environment="WATCHDOG_STATE_FILE=/var/lib/mes-dashboard/restart_state.json"
Environment="WATCHDOG_CHECK_INTERVAL=5" Environment="WATCHDOG_CHECK_INTERVAL=5"
Environment="RUNTIME_CONTRACT_VERSION=2026.02-p2"
Environment="RUNTIME_CONTRACT_ENFORCE=true"
Environment="WORKER_RESTART_COOLDOWN=60"
Environment="WORKER_RESTART_RETRY_BUDGET=3"
Environment="WORKER_RESTART_WINDOW_SECONDS=600"
Environment="WORKER_RESTART_CHURN_THRESHOLD=3"
Environment="WORKER_GUARDED_MODE_ENABLED=true"
RuntimeDirectory=mes-dashboard RuntimeDirectory=mes-dashboard
StateDirectory=mes-dashboard StateDirectory=mes-dashboard

View File

@@ -0,0 +1,26 @@
# MES Dashboard runtime contract (version 2026.02-p2)
# Conda runtime
CONDA_BIN=/opt/miniconda3/bin/conda
CONDA_ENV_NAME=mes-dashboard
# Single-port serving contract
GUNICORN_BIND=0.0.0.0:8080
# Watchdog/runtime paths
WATCHDOG_RUNTIME_DIR=/run/mes-dashboard
WATCHDOG_RESTART_FLAG=/run/mes-dashboard/mes_dashboard_restart.flag
WATCHDOG_PID_FILE=/run/mes-dashboard/gunicorn.pid
WATCHDOG_STATE_FILE=/var/lib/mes-dashboard/restart_state.json
WATCHDOG_CHECK_INTERVAL=5
# Runtime contract enforcement
RUNTIME_CONTRACT_VERSION=2026.02-p2
RUNTIME_CONTRACT_ENFORCE=true
# Worker recovery policy
WORKER_RESTART_COOLDOWN=60
WORKER_RESTART_RETRY_BUDGET=3
WORKER_RESTART_WINDOW_SECONDS=600
WORKER_RESTART_CHURN_THRESHOLD=3
WORKER_GUARDED_MODE_ENABLED=true

View File

@@ -18,6 +18,13 @@ Environment="WATCHDOG_RUNTIME_DIR=/run/mes-dashboard"
Environment="WATCHDOG_RESTART_FLAG=/run/mes-dashboard/mes_dashboard_restart.flag" Environment="WATCHDOG_RESTART_FLAG=/run/mes-dashboard/mes_dashboard_restart.flag"
Environment="WATCHDOG_PID_FILE=/run/mes-dashboard/gunicorn.pid" Environment="WATCHDOG_PID_FILE=/run/mes-dashboard/gunicorn.pid"
Environment="WATCHDOG_STATE_FILE=/var/lib/mes-dashboard/restart_state.json" Environment="WATCHDOG_STATE_FILE=/var/lib/mes-dashboard/restart_state.json"
Environment="RUNTIME_CONTRACT_VERSION=2026.02-p2"
Environment="RUNTIME_CONTRACT_ENFORCE=true"
Environment="WORKER_RESTART_COOLDOWN=60"
Environment="WORKER_RESTART_RETRY_BUDGET=3"
Environment="WORKER_RESTART_WINDOW_SECONDS=600"
Environment="WORKER_RESTART_CHURN_THRESHOLD=3"
Environment="WORKER_GUARDED_MODE_ENABLED=true"
RuntimeDirectory=mes-dashboard RuntimeDirectory=mes-dashboard
StateDirectory=mes-dashboard StateDirectory=mes-dashboard

View File

@@ -26,10 +26,12 @@ A release is cutover-ready only when all gates pass:
- pool exhaustion path returns `503` + `DB_POOL_EXHAUSTED` and `Retry-After` - pool exhaustion path returns `503` + `DB_POOL_EXHAUSTED` and `Retry-After`
- circuit-open path returns `503` + `CIRCUIT_BREAKER_OPEN` and fail-fast semantics - circuit-open path returns `503` + `CIRCUIT_BREAKER_OPEN` and fail-fast semantics
- frontend client does not aggressively retry on degraded pool exhaustion responses - frontend client does not aggressively retry on degraded pool exhaustion responses
- health/admin payloads expose worker policy state (`allowed`/`cooldown`/`blocked`) and alert booleans
6. Conda-systemd contract gate 6. Conda-systemd contract gate
- `deploy/mes-dashboard.service` and `deploy/mes-dashboard-watchdog.service` both run in the same conda runtime contract - `deploy/mes-dashboard.service` and `deploy/mes-dashboard-watchdog.service` both run in the same conda runtime contract
- `WATCHDOG_RESTART_FLAG`, `WATCHDOG_PID_FILE`, `WATCHDOG_STATE_FILE` paths are consistent across app/admin/watchdog - `WATCHDOG_RESTART_FLAG`, `WATCHDOG_PID_FILE`, `WATCHDOG_STATE_FILE` paths are consistent across app/admin/watchdog
- startup contract validation passes: `RUNTIME_CONTRACT_ENFORCE=true ./scripts/start_server.sh check`
- single-port bind (`GUNICORN_BIND`) remains stable during restart workflow - single-port bind (`GUNICORN_BIND`) remains stable during restart workflow
7. Regression gate 7. Regression gate
@@ -60,7 +62,8 @@ A release is cutover-ready only when all gates pass:
5. Conda + systemd rehearsal (recommended before production cutover) 5. Conda + systemd rehearsal (recommended before production cutover)
- `sudo cp deploy/mes-dashboard.service /etc/systemd/system/` - `sudo cp deploy/mes-dashboard.service /etc/systemd/system/`
- `sudo cp deploy/mes-dashboard-watchdog.service /etc/systemd/system/` - `sudo cp deploy/mes-dashboard-watchdog.service /etc/systemd/system/`
- `sudo mkdir -p /etc/mes-dashboard && sudo cp .env /etc/mes-dashboard/mes-dashboard.env` - `sudo mkdir -p /etc/mes-dashboard && sudo cp deploy/mes-dashboard.env.example /etc/mes-dashboard/mes-dashboard.env`
- merge deployment secrets from `.env` into `/etc/mes-dashboard/mes-dashboard.env`
- `sudo systemctl daemon-reload` - `sudo systemctl daemon-reload`
- `sudo systemctl enable --now mes-dashboard mes-dashboard-watchdog` - `sudo systemctl enable --now mes-dashboard mes-dashboard-watchdog`
- call `/admin/api/worker/status` and verify runtime contract paths exist - call `/admin/api/worker/status` and verify runtime contract paths exist
@@ -69,6 +72,7 @@ A release is cutover-ready only when all gates pass:
- call `/health` and `/health/deep` - call `/health` and `/health/deep`
- confirm route cache mode, degraded flags, and pool/runtime diagnostics align with environment (Redis on/off) - confirm route cache mode, degraded flags, and pool/runtime diagnostics align with environment (Redis on/off)
- trigger one controlled worker restart from admin API and verify single-port continuity - trigger one controlled worker restart from admin API and verify single-port continuity
- verify guarded mode flow: blocked restart requires manual override payload (`manual_override`, `override_acknowledged`, `override_reason`)
- verify README architecture section matches deployed runtime contract - verify README architecture section matches deployed runtime contract
## Rollback Procedure ## Rollback Procedure
@@ -111,3 +115,6 @@ Use these initial thresholds for alerting/escalation:
4. Frontend/API retry pressure 4. Frontend/API retry pressure
- significant increase of client retries for `DB_POOL_EXHAUSTED` or `CIRCUIT_BREAKER_OPEN` responses over baseline - significant increase of client retries for `DB_POOL_EXHAUSTED` or `CIRCUIT_BREAKER_OPEN` responses over baseline
5. Recovery policy blocked
- `resilience.policy_state.blocked == true` or `resilience.alerts.restart_blocked == true`

View File

@@ -1,5 +1,21 @@
const DEFAULT_TIMEOUT = 30000; const DEFAULT_TIMEOUT = 30000;
function getCsrfToken() {
return document.querySelector('meta[name="csrf-token"]')?.content || '';
}
function withCsrfHeaders(headers = {}, method = 'GET') {
const normalized = String(method).toUpperCase();
const merged = { ...headers };
if (['POST', 'PUT', 'PATCH', 'DELETE'].includes(normalized)) {
const csrf = getCsrfToken();
if (csrf && !merged['X-CSRF-Token']) {
merged['X-CSRF-Token'] = csrf;
}
}
return merged;
}
function buildApiError(response, payload) { function buildApiError(response, payload) {
const message = const message =
payload?.error?.message || payload?.error?.message ||
@@ -47,15 +63,19 @@ export async function apiGet(url, options = {}) {
export async function apiPost(url, payload, options = {}) { export async function apiPost(url, payload, options = {}) {
if (window.MesApi?.post) { if (window.MesApi?.post) {
return window.MesApi.post(url, payload, options); const enrichedOptions = {
...options,
headers: withCsrfHeaders(options.headers || {}, 'POST')
};
return window.MesApi.post(url, payload, enrichedOptions);
} }
return fetchJson(url, { return fetchJson(url, {
...options, ...options,
method: 'POST', method: 'POST',
headers: { headers: withCsrfHeaders({
'Content-Type': 'application/json', 'Content-Type': 'application/json',
...(options.headers || {}) ...(options.headers || {})
}, }, 'POST'),
body: JSON.stringify(payload) body: JSON.stringify(payload)
}); });
} }
@@ -64,6 +84,7 @@ export async function apiUpload(url, formData, options = {}) {
return fetchJson(url, { return fetchJson(url, {
...options, ...options,
method: 'POST', method: 'POST',
headers: withCsrfHeaders(options.headers || {}, 'POST'),
body: formData body: formData
}); });
} }

View File

@@ -0,0 +1,75 @@
function toTrimmedString(value) {
if (value === null || value === undefined) {
return '';
}
return String(value).trim();
}
export function normalizeStatusFilter(statusFilter) {
if (!statusFilter) {
return {};
}
if (statusFilter === 'quality-hold') {
return { status: 'HOLD', hold_type: 'quality' };
}
if (statusFilter === 'non-quality-hold') {
return { status: 'HOLD', hold_type: 'non-quality' };
}
return { status: String(statusFilter).toUpperCase() };
}
export function buildWipOverviewQueryParams(filters = {}, statusFilter = null) {
const params = {};
const workorder = toTrimmedString(filters.workorder);
const lotid = toTrimmedString(filters.lotid);
const pkg = toTrimmedString(filters.package);
const type = toTrimmedString(filters.type);
if (workorder) params.workorder = workorder;
if (lotid) params.lotid = lotid;
if (pkg) params.package = pkg;
if (type) params.type = type;
return { ...params, ...normalizeStatusFilter(statusFilter) };
}
export function buildWipDetailQueryParams({
page,
pageSize,
filters = {},
statusFilter = null,
}) {
return {
page,
page_size: pageSize,
...buildWipOverviewQueryParams(filters, statusFilter),
};
}
export function splitHoldByType(data) {
const items = Array.isArray(data?.items) ? data.items : [];
const quality = items.filter((item) => item?.holdType === 'quality');
const nonQuality = items.filter((item) => item?.holdType !== 'quality');
return { quality, nonQuality };
}
export function prepareParetoData(items) {
if (!Array.isArray(items) || items.length === 0) {
return { reasons: [], qtys: [], lots: [], cumulative: [], totalQty: 0, items: [] };
}
const sorted = [...items].sort((a, b) => (Number(b?.qty) || 0) - (Number(a?.qty) || 0));
const reasons = sorted.map((item) => toTrimmedString(item?.reason) || '未知');
const qtys = sorted.map((item) => Number(item?.qty) || 0);
const lots = sorted.map((item) => Number(item?.lots) || 0);
const totalQty = qtys.reduce((sum, value) => sum + value, 0);
let running = 0;
const cumulative = qtys.map((qty) => {
running += qty;
if (totalQty <= 0) return 0;
return Math.round((running / totalQty) * 100);
});
return { reasons, qtys, lots, cumulative, totalQty, items: sorted };
}

View File

@@ -3,6 +3,7 @@ import {
debounce, debounce,
fetchWipAutocompleteItems, fetchWipAutocompleteItems,
} from '../core/autocomplete.js'; } from '../core/autocomplete.js';
import { buildWipDetailQueryParams } from '../core/wip-derive.js';
ensureMesApiAvailable(); ensureMesApiAvailable();
@@ -73,36 +74,12 @@ ensureMesApiAvailable();
} }
async function fetchDetail(signal = null) { async function fetchDetail(signal = null) {
const params = { const params = buildWipDetailQueryParams({
page: state.page, page: state.page,
page_size: state.pageSize pageSize: state.pageSize,
}; filters: state.filters,
statusFilter: activeStatusFilter,
if (state.filters.package) { });
params.package = state.filters.package;
}
if (state.filters.type) {
params.type = state.filters.type;
}
if (activeStatusFilter) {
// Handle hold type filters
if (activeStatusFilter === 'quality-hold') {
params.status = 'HOLD';
params.hold_type = 'quality';
} else if (activeStatusFilter === 'non-quality-hold') {
params.status = 'HOLD';
params.hold_type = 'non-quality';
} else {
// Convert to API status format (RUN/QUEUE)
params.status = activeStatusFilter.toUpperCase();
}
}
if (state.filters.workorder) {
params.workorder = state.filters.workorder;
}
if (state.filters.lotid) {
params.lotid = state.filters.lotid;
}
const result = await MesApi.get(`/api/wip/detail/${encodeURIComponent(state.workcenter)}`, { const result = await MesApi.get(`/api/wip/detail/${encodeURIComponent(state.workcenter)}`, {
params, params,

View File

@@ -3,6 +3,11 @@ import {
debounce, debounce,
fetchWipAutocompleteItems, fetchWipAutocompleteItems,
} from '../core/autocomplete.js'; } from '../core/autocomplete.js';
import {
buildWipOverviewQueryParams,
splitHoldByType as splitHoldByTypeShared,
prepareParetoData as prepareParetoDataShared,
} from '../core/wip-derive.js';
ensureMesApiAvailable(); ensureMesApiAvailable();
@@ -61,20 +66,7 @@ ensureMesApiAvailable();
} }
function buildQueryParams() { function buildQueryParams() {
const params = {}; return buildWipOverviewQueryParams(state.filters);
if (state.filters.workorder) {
params.workorder = state.filters.workorder;
}
if (state.filters.lotid) {
params.lotid = state.filters.lotid;
}
if (state.filters.package) {
params.package = state.filters.package;
}
if (state.filters.type) {
params.type = state.filters.type;
}
return params;
} }
// ============================================================ // ============================================================
@@ -96,19 +88,7 @@ ensureMesApiAvailable();
} }
async function fetchMatrix(signal = null) { async function fetchMatrix(signal = null) {
const params = buildQueryParams(); const params = buildWipOverviewQueryParams(state.filters, activeStatusFilter);
// Add status filter if active
if (activeStatusFilter) {
if (activeStatusFilter === 'quality-hold') {
params.status = 'HOLD';
params.hold_type = 'quality';
} else if (activeStatusFilter === 'non-quality-hold') {
params.status = 'HOLD';
params.hold_type = 'non-quality';
} else {
params.status = activeStatusFilter.toUpperCase();
}
}
const result = await MesApi.get('/api/wip/overview/matrix', { const result = await MesApi.get('/api/wip/overview/matrix', {
params, params,
timeout: API_TIMEOUT, timeout: API_TIMEOUT,
@@ -467,37 +447,12 @@ ensureMesApiAvailable();
// Task 2.1: Split hold data by type // Task 2.1: Split hold data by type
function splitHoldByType(data) { function splitHoldByType(data) {
if (!data || !data.items) { return splitHoldByTypeShared(data);
return { quality: [], nonQuality: [] };
}
const quality = data.items.filter(item => item.holdType === 'quality');
const nonQuality = data.items.filter(item => item.holdType !== 'quality');
return { quality, nonQuality };
} }
// Task 2.2: Prepare Pareto data (sort by QTY desc, calculate cumulative %) // Task 2.2: Prepare Pareto data (sort by QTY desc, calculate cumulative %)
function prepareParetoData(items) { function prepareParetoData(items) {
if (!items || items.length === 0) { return prepareParetoDataShared(items);
return { reasons: [], qtys: [], lots: [], cumulative: [], totalQty: 0 };
}
// Sort by QTY descending
const sorted = [...items].sort((a, b) => (b.qty || 0) - (a.qty || 0));
const reasons = sorted.map(item => item.reason || '未知');
const qtys = sorted.map(item => item.qty || 0);
const lots = sorted.map(item => item.lots || 0);
const totalQty = qtys.reduce((sum, q) => sum + q, 0);
// Calculate cumulative percentage
const cumulative = [];
let runningSum = 0;
qtys.forEach(qty => {
runningSum += qty;
cumulative.push(totalQty > 0 ? Math.round((runningSum / totalQty) * 100) : 0);
});
return { reasons, qtys, lots, cumulative, totalQty, items: sorted };
} }
// Task 3.1: Initialize Pareto charts // Task 3.1: Initialize Pareto charts

View File

@@ -0,0 +1,80 @@
import test from 'node:test';
import assert from 'node:assert/strict';
import {
buildWipOverviewQueryParams,
buildWipDetailQueryParams,
splitHoldByType,
prepareParetoData,
} from '../src/core/wip-derive.js';
test('buildWipOverviewQueryParams keeps only non-empty filters', () => {
const params = buildWipOverviewQueryParams({
workorder: ' WO-1 ',
lotid: '',
package: 'PKG-A',
type: 'QFN',
});
assert.deepEqual(params, {
workorder: 'WO-1',
package: 'PKG-A',
type: 'QFN',
});
});
test('buildWipOverviewQueryParams maps quality hold status filter', () => {
const params = buildWipOverviewQueryParams({}, 'quality-hold');
assert.deepEqual(params, {
status: 'HOLD',
hold_type: 'quality',
});
});
test('buildWipDetailQueryParams uses page/page_size and shared filter mapper', () => {
const params = buildWipDetailQueryParams({
page: 2,
pageSize: 100,
filters: {
workorder: 'WO',
lotid: 'LOT',
package: '',
type: 'TSOP',
},
statusFilter: 'run',
});
assert.deepEqual(params, {
page: 2,
page_size: 100,
workorder: 'WO',
lotid: 'LOT',
type: 'TSOP',
status: 'RUN',
});
});
test('splitHoldByType partitions quality/non-quality correctly', () => {
const grouped = splitHoldByType({
items: [
{ reason: 'Q1', holdType: 'quality' },
{ reason: 'NQ1', holdType: 'non-quality' },
{ reason: 'NQ2' },
],
});
assert.equal(grouped.quality.length, 1);
assert.equal(grouped.nonQuality.length, 2);
});
test('prepareParetoData sorts by qty and builds cumulative percentages', () => {
const data = prepareParetoData([
{ reason: 'B', qty: 20, lots: 1 },
{ reason: 'A', qty: 80, lots: 2 },
]);
assert.deepEqual(data.reasons, ['A', 'B']);
assert.deepEqual(data.qtys, [80, 20]);
assert.deepEqual(data.cumulative, [80, 100]);
assert.equal(data.totalQty, 100);
});

View File

@@ -1,12 +1,12 @@
import { defineConfig } from 'vite'; import { defineConfig } from 'vite';
import { resolve } from 'node:path'; import { resolve } from 'node:path';
export default defineConfig({ export default defineConfig(({ mode }) => ({
publicDir: false, publicDir: false,
build: { build: {
outDir: '../src/mes_dashboard/static/dist', outDir: '../src/mes_dashboard/static/dist',
emptyOutDir: false, emptyOutDir: false,
sourcemap: false, sourcemap: mode !== 'production',
rollupOptions: { rollupOptions: {
input: { input: {
portal: resolve(__dirname, 'src/portal/main.js'), portal: resolve(__dirname, 'src/portal/main.js'),
@@ -22,8 +22,17 @@ export default defineConfig({
output: { output: {
entryFileNames: '[name].js', entryFileNames: '[name].js',
chunkFileNames: 'chunks/[name]-[hash].js', chunkFileNames: 'chunks/[name]-[hash].js',
assetFileNames: '[name][extname]' assetFileNames: '[name][extname]',
manualChunks(id) {
if (!id.includes('node_modules')) {
return;
}
if (id.includes('echarts')) {
return 'vendor-echarts';
}
return 'vendor';
} }
} }
} }
}); }
}));

View File

@@ -30,6 +30,18 @@ def worker_exit(server, worker):
except Exception as e: except Exception as e:
server.log.warning(f"Error stopping equipment sync worker: {e}") server.log.warning(f"Error stopping equipment sync worker: {e}")
try:
from mes_dashboard.core.cache_updater import stop_cache_updater
stop_cache_updater()
except Exception as e:
server.log.warning(f"Error stopping cache updater: {e}")
try:
from mes_dashboard.core.redis_client import close_redis
close_redis()
except Exception as e:
server.log.warning(f"Error closing redis client: {e}")
# Then dispose database connections # Then dispose database connections
try: try:
from mes_dashboard.core.database import dispose_engine from mes_dashboard.core.database import dispose_engine

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-02-08

View File

@@ -0,0 +1,46 @@
## Context
The current architecture already supports single-port Gunicorn runtime, circuit-breaker-aware degraded responses, and watchdog-assisted recovery. However, critical security and lifecycle controls are uneven: production startup can still fallback to a weak secret key, CSRF is not enforced globally, and background resources are not fully registered in a single shutdown lifecycle. These gaps are operationally risky when pool pressure or restart churn occurs.
## Goals / Non-Goals
**Goals:**
- Make production startup fail fast when required security secrets are missing.
- Enforce CSRF validation for all state-changing endpoints without breaking existing frontend flow.
- Make worker/app shutdown deterministic by stopping all background workers and shared clients.
- Keep degraded responses for pool exhaustion and circuit-open states stable and retry-aware.
- Isolate health probe connectivity from main request pool contention.
**Non-Goals:**
- Replacing LDAP provider or redesigning the full authentication architecture.
- Full CSP rollout across all templates in this change.
- Changing URL structure, page IA, or single-port deployment topology.
## Decisions
1. **Production secret-key guard at startup**
- Decision: enforce `SECRET_KEY` presence/strength in non-development modes and abort startup when invalid.
- Rationale: prevents silent insecure deployment.
2. **Unified CSRF contract across form + JSON flows**
- Decision: issue CSRF token from server session, validate hidden form field for HTML forms and `X-CSRF-Token` for JSON POST/PUT/PATCH/DELETE.
- Rationale: maintains current frontend behavior while covering non-form APIs.
3. **Centralized shutdown registry**
- Decision: register explicit shutdown hooks that call cache updater stop, realtime sync stop, Redis close, and DB dispose in bounded order.
- Rationale: avoids thread/client leaks during worker recycle and controlled reload.
4. **Health probe pool isolation**
- Decision: use a dedicated lightweight DB health engine/pool for `/health` checks.
- Rationale: prevents health endpoint from being blocked by request-pool exhaustion, improving observability fidelity.
5. **Template-safe JS serialization**
- Decision: replace HTML-escaped interpolation in JS string contexts with `tojson` serialization.
- Rationale: avoids context-mismatch injection edge cases.
## Risks / Trade-offs
- **[Risk] CSRF rollout may break undocumented clients** → **Mitigation:** provide opt-in transition flag and explicit error messaging during rollout.
- **[Risk] Strict startup secret validation can block misconfigured environments** → **Mitigation:** provide clear startup diagnostics and `.env.example` updates.
- **[Risk] Additional shutdown hooks can prolong worker exit** → **Mitigation:** bounded timeouts and idempotent stop handlers.
- **[Risk] Dedicated health pool introduces extra DB connections** → **Mitigation:** fixed minimal size and short timeout.

View File

@@ -0,0 +1,40 @@
## Why
The Vite migration is functionally complete, but production runtime still has high-risk gaps in security baseline and worker lifecycle cleanup. Addressing these now prevents avoidable outages, authentication bypass risk, and unstable degradation behavior under pool pressure.
## What Changes
- Enforce production-safe startup security defaults (no weak SECRET_KEY fallback in non-development environments).
- Add first-class CSRF protection for admin forms and state-changing JSON APIs.
- Harden degradation behavior for pool exhaustion with consistent retry/backoff contract and isolated health probing.
- Ensure background workers and shared clients (cache updater, realtime sync, Redis) are explicitly stopped on worker/app shutdown.
- Fix template-to-JavaScript variable serialization in hold-detail fallback script.
## Capabilities
### New Capabilities
- `security-baseline-hardening`: Define mandatory secret/session/CSRF/XSS-safe baseline for production runtime.
### Modified Capabilities
- `runtime-resilience-recovery`: Strengthen shutdown lifecycle and degraded-response behavior for pool pressure scenarios.
## Impact
- Affected code:
- `src/mes_dashboard/app.py`
- `src/mes_dashboard/core/database.py`
- `src/mes_dashboard/core/cache_updater.py`
- `src/mes_dashboard/core/redis_client.py`
- `src/mes_dashboard/routes/health_routes.py`
- `src/mes_dashboard/routes/auth_routes.py`
- `src/mes_dashboard/templates/hold_detail.html`
- `gunicorn.conf.py`
- `tests/`
- APIs:
- `/health`
- `/health/deep`
- `/admin/login`
- state-changing `/api/*` endpoints
- Operational behavior:
- Keep single-port deployment model unchanged.
- Improve degraded-state stability and startup safety gates.

View File

@@ -0,0 +1,24 @@
## MODIFIED Requirements
### Requirement: Pool Exhaustion MUST Return Retry-Aware Degraded Responses
The system MUST return explicit degraded responses for connection pool exhaustion, including stable machine-readable retry metadata and HTTP retry hints.
#### Scenario: Pool exhausted under load
- **WHEN** concurrent requests exceed available database connections and pool wait timeout is reached
- **THEN** the API MUST return `DB_POOL_EXHAUSTED` with `retry_after_seconds` metadata and a `Retry-After` header instead of a generic 500 failure
## ADDED Requirements
### Requirement: Runtime Shutdown SHALL Cleanly Stop Background Services
Worker/app shutdown MUST stop long-lived background services and shared clients in deterministic order.
#### Scenario: Worker exits during recycle or graceful reload
- **WHEN** Gunicorn worker shutdown hooks are triggered
- **THEN** cache updater, realtime equipment sync worker, Redis client, and DB engine resources MUST be stopped/disposed without orphan threads
### Requirement: Health Probing SHALL Remain Available During Request-Pool Saturation
Health checks MUST avoid depending solely on the same request pool used by business APIs.
#### Scenario: Request pool saturation
- **WHEN** the main database request pool is exhausted
- **THEN** `/health` and `/health/deep` MUST still provide timely degraded status using isolated probe connectivity

View File

@@ -0,0 +1,29 @@
## ADDED Requirements
### Requirement: Production Startup SHALL Reject Weak Session Secrets
The system MUST reject startup in non-development environments when `SECRET_KEY` is missing or configured with known insecure default values.
#### Scenario: Missing production secret key
- **WHEN** runtime starts with `FLASK_ENV` not equal to `development` and no secure secret key is configured
- **THEN** application startup MUST fail fast with an explicit configuration error
### Requirement: State-Changing Endpoints SHALL Enforce CSRF Validation
All state-changing endpoints that rely on cookie-based authentication MUST enforce CSRF token validation.
#### Scenario: Missing or invalid CSRF token
- **WHEN** a POST/PUT/PATCH/DELETE request is sent without a valid CSRF token
- **THEN** the server MUST reject the request with a client error and MUST NOT execute the mutation
### Requirement: Server-Rendered Values in JavaScript Context MUST Use Safe Serialization
Values inserted into inline JavaScript from templates MUST be serialized for JavaScript context safety.
#### Scenario: Hold reason rendered in fallback inline script
- **WHEN** server-side string values are embedded into script state payloads
- **THEN** template rendering MUST use JSON-safe serialization semantics to prevent script-context injection
### Requirement: Session Establishment SHALL Mitigate Fixation Risk
Successful admin login MUST rotate session identity material before granting authenticated privileges.
#### Scenario: Admin login success
- **WHEN** credentials are validated and admin session is created
- **THEN** session identity MUST be regenerated before storing authenticated user attributes

View File

@@ -0,0 +1,18 @@
## 1. Runtime Stability Hardening
- [x] 1.1 Add startup validation for `SECRET_KEY` and environment-aware secure defaults.
- [x] 1.2 Register centralized shutdown hooks to stop cache updater, realtime sync worker, Redis client, and DB engine.
- [x] 1.3 Isolate database health probing from request pool and keep degraded signal contract stable.
- [x] 1.4 Normalize pool-exhausted response metadata and retry headers across API error paths.
## 2. Security Baseline Enforcement
- [x] 2.1 Add CSRF token issuance/validation for form posts and JSON mutation endpoints.
- [x] 2.2 Update login flow to rotate session identity on successful authentication.
- [x] 2.3 Replace JS-context template interpolation in `hold_detail.html` with JSON-safe serialization.
## 3. Verification and Documentation
- [x] 3.1 Add tests for startup secret guard, CSRF rejection, and session-rotation behavior.
- [x] 3.2 Add lifecycle tests/validation for shutdown cleanup and health endpoint behavior under pool saturation.
- [x] 3.3 Update README/README.mdj runtime hardening sections and operator rollout notes.

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-02-08

View File

@@ -0,0 +1,46 @@
## Context
The migration delivered feature parity, but efficiency work remains: backend query paths still do broad copies and whole-frame recomputation even when only slices are needed. At the same time, business constraints explicitly require full-table caching for `resource` and `wip` because those datasets are intentionally small and frequently reused. This design optimizes around that constraint rather than removing it.
## Goals / Non-Goals
**Goals:**
- Keep `resource` and `wip` full-table caches intact.
- Reduce memory amplification from redundant cache representations.
- Replace repeated full merge/rebuild paths with incremental/indexed query plans where applicable.
- Increase reuse of browser-side compute modules for chart/table/filter/KPI derivations.
- Add measurable telemetry to verify latency and memory improvements.
**Non-Goals:**
- Rewriting all reporting endpoints to client-only mode.
- Removing Redis or existing layered cache strategy.
- Changing user-visible filter semantics or report outputs.
## Decisions
1. **Constrained cache strategy**
- Decision: retain full-table snapshots for `resource` and `wip`; optimize surrounding representations and derivation paths.
- Rationale: business-approved data-size profile and low complexity for frequent lookups.
2. **Incremental + indexed path for heavy derived datasets**
- Decision: add watermark/version-aware incremental refresh and per-column indexes for high-cardinality filters.
- Rationale: avoids repeated full recompute and lowers request tail latency.
3. **Canonical in-process structure**
- Decision: keep one canonical structure per cache domain and derive alternate views on demand.
- Rationale: reduces 2x/3x memory amplification from parallel representations.
4. **Frontend compute module expansion**
- Decision: extract reusable browser compute helpers for matrix/table/KPI transformations used across report pages.
- Rationale: shifts deterministic shaping work off backend and improves component reuse in Vite architecture.
5. **Benchmark-driven acceptance**
- Decision: add repeatable benchmark fixtures and telemetry thresholds as merge gates.
- Rationale: prevent subjective "performance improved" claims without measurable proof.
## Risks / Trade-offs
- **[Risk] Incremental sync correctness drift** → **Mitigation:** version checksum validation and periodic full reconciliation jobs.
- **[Risk] Browser compute can increase client CPU on low-end devices** → **Mitigation:** bounded dataset chunking and fallback server aggregation path.
- **[Risk] Refactor introduces subtle field-contract regressions** → **Mitigation:** keep export/header contract tests and fixture comparisons.
- **[Risk] Telemetry overhead** → **Mitigation:** low-cost counters/histograms with sampling where needed.

View File

@@ -0,0 +1,36 @@
## Why
Current reporting workloads still spend unnecessary CPU and memory on repeated full-data merges, broad DataFrame copies, and duplicated cache representations. We need a focused efficiency phase that preserves the intentional full-table cache strategy for `resource` and `wip`, while reducing cost for other query paths and increasing frontend compute reuse.
## What Changes
- Introduce indexed/incremental cache synchronization for heavy report datasets that do not require full-table snapshots.
- Keep `resource` and `wip` as full-table cache by design, but reduce redundant in-process representations and copy overhead.
- Move additional derived calculations (chart/table/KPI/filter shaping) to reusable browser modules in Vite frontend.
- Add cache/query efficiency telemetry and repeatable benchmark gates to validate gains.
## Capabilities
### New Capabilities
- `cache-indexed-query-acceleration`: Define incremental refresh and indexed query contracts for non-full-snapshot datasets.
### Modified Capabilities
- `cache-observability-hardening`: Add memory-efficiency and cache-structure telemetry expectations.
- `frontend-compute-shift`: Expand browser-side reusable compute coverage for report interactions.
## Impact
- Affected code:
- `src/mes_dashboard/core/cache.py`
- `src/mes_dashboard/services/resource_cache.py`
- `src/mes_dashboard/services/realtime_equipment_cache.py`
- `src/mes_dashboard/services/wip_service.py`
- `src/mes_dashboard/routes/health_routes.py`
- `frontend/src/core/`
- `frontend/src/**/main.js`
- `tests/`
- APIs:
- read-heavy `/api/wip/*` and `/api/resource/*` endpoints (response contract unchanged)
- Operational behavior:
- Preserve current `resource` and `wip` full-table caching strategy.
- Reduce server-side compute load through selective frontend compute offload.

View File

@@ -0,0 +1,22 @@
## ADDED Requirements
### Requirement: Incremental Synchronization SHALL Use Versioned Watermarks
For heavy non-full-snapshot datasets, cache refresh SHALL support incremental synchronization keyed by stable version or watermark boundaries.
#### Scenario: Incremental refresh cycle
- **WHEN** source data version indicates partial changes since last sync
- **THEN** cache update logic MUST fetch and merge only changed partitions while preserving correctness guarantees
### Requirement: Query Paths SHALL Use Indexed Access for High-Frequency Filters
Query execution over cached data SHALL use prebuilt indexes for known high-frequency filter columns.
#### Scenario: Filtered report query
- **WHEN** request filters target indexed fields
- **THEN** result selection MUST avoid full dataset scans and maintain existing response contract
### Requirement: Business-Mandated Full-Table Caches SHALL Be Preserved for Resource and WIP
The system SHALL continue to maintain full-table cache behavior for `resource` and `wip` domains.
#### Scenario: Resource or WIP cache refresh
- **WHEN** cache update runs for `resource` or `wip`
- **THEN** the updater MUST retain full-table snapshot semantics and MUST NOT switch these domains to partial-only cache mode

View File

@@ -0,0 +1,15 @@
## ADDED Requirements
### Requirement: Cache Telemetry SHALL Include Memory Amplification Signals
Operational telemetry MUST expose cache domain memory usage indicators and representation amplification factors.
#### Scenario: Deep health telemetry request
- **WHEN** operators inspect cache telemetry
- **THEN** telemetry MUST include per-domain memory footprint and amplification indicators sufficient to detect redundant structures
### Requirement: Efficiency Benchmarks SHALL Gate Cache Refactor Rollout
Cache/query efficiency changes MUST be validated against baseline latency and memory benchmarks before rollout.
#### Scenario: Pre-release validation
- **WHEN** cache refactor changes are prepared for deployment
- **THEN** benchmark results MUST demonstrate no regression beyond configured thresholds for P95 latency and memory usage

View File

@@ -0,0 +1,15 @@
## ADDED Requirements
### Requirement: Reusable Browser Compute Modules SHALL Power Report Derivations
Derived computations for report filters, KPI cards, chart series, and table projections SHALL be implemented through reusable frontend modules.
#### Scenario: Shared report derivation logic
- **WHEN** multiple report pages require equivalent data-shaping behavior
- **THEN** pages MUST consume shared compute modules instead of duplicating transformation logic per page
### Requirement: Browser Compute Shift SHALL Preserve Export and Field Contracts
Moving computations to frontend MUST preserve existing field naming and export column contracts.
#### Scenario: User exports report after frontend-side derivation
- **WHEN** transformed data is rendered and exported
- **THEN** exported field names and ordering MUST remain consistent with governed field contract definitions

View File

@@ -0,0 +1,23 @@
## 1. Cache Structure and Sync Refactor
- [x] 1.1 Define canonical per-domain cache representation and remove redundant parallel structures.
- [x] 1.2 Implement version/watermark-based incremental sync path for eligible non-full-snapshot datasets.
- [x] 1.3 Keep `resource` and `wip` full-table cache behavior while optimizing surrounding parse/index pipelines.
## 2. Indexed Query Acceleration
- [x] 2.1 Add index builders for high-frequency filter columns used by report endpoints.
- [x] 2.2 Refactor read paths to use indexed selection and reduce broad DataFrame copy operations.
- [x] 2.3 Add fallback and reconciliation logic to guarantee correctness under incremental/index drift.
## 3. Frontend Compute Reuse Expansion
- [x] 3.1 Extract shared Vite compute modules for KPI/filter/chart/table derivations.
- [x] 3.2 Refactor report pages to consume shared modules without changing user-visible behavior.
- [x] 3.3 Validate export/header field contract consistency after compute shift.
## 4. Performance Validation and Docs
- [x] 4.1 Add benchmark fixtures for baseline vs refactor latency/memory comparison.
- [x] 4.2 Surface cache memory amplification and index efficiency telemetry in health/admin outputs.
- [x] 4.3 Update README/README.mdj with cache strategy constraints and performance governance rules.

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-02-08

View File

@@ -0,0 +1,45 @@
## Context
The project already provides watchdog-assisted restart and resilience diagnostics, but policy boundaries for automated recovery are not yet formalized end-to-end. In practice, this can lead to either under-recovery (manual delays) or over-recovery (restart churn). We also need stronger conda/systemd path consistency checks to prevent runtime drift across deploy scripts and services.
## Goals / Non-Goals
**Goals:**
- Make conda/systemd runtime path contracts explicit, validated, and drift-detectable.
- Implement safe self-healing policy with cooldown and churn limits.
- Expose clear alert signals and recommended actions in health/admin payloads.
- Keep operator manual override available for incident control.
**Non-Goals:**
- Migrating from systemd to another orchestrator.
- Changing database vendor or introducing full autoscaling infrastructure.
- Removing existing admin restart endpoints.
## Decisions
1. **Single source runtime contract**
- Decision: centralize conda runtime path configuration consumed by systemd units, watchdog, and scripts.
- Rationale: prevents mismatched interpreter/path drift.
2. **Guarded self-healing state machine**
- Decision: implement bounded restart policy (cooldown + max retries per time window + circuit-open gating).
- Rationale: recovers quickly while preventing restart storms.
3. **Explicit recovery observability contract**
- Decision: enrich health/admin payloads with churn counters, cooldown state, and recommended operator action.
- Rationale: enables deterministic triage and alert automation.
4. **Auditability requirement**
- Decision: emit structured logs/events for auto-restart decision, manual override, and blocked restart attempts.
- Rationale: supports incident retrospectives and policy tuning.
5. **Runbook-first rollout**
- Decision: deploy policy changes behind documentation and validation gates, including rollback steps.
- Rationale: operational safety for production adoption.
## Risks / Trade-offs
- **[Risk] Overly strict policy delays recovery** → **Mitigation:** configurable thresholds and emergency manual override.
- **[Risk] Aggressive policy causes churn loops** → **Mitigation:** hard stop on churn threshold breach and explicit cool-off windows.
- **[Risk] Added operational complexity** → **Mitigation:** concise runbook with decision tables and tested scripts.
- **[Risk] Drift detection false positives** → **Mitigation:** normalize path resolution and clearly defined comparison sources.

View File

@@ -0,0 +1,40 @@
## Why
Operations stability still depends heavily on manual intervention when workers degrade or pools saturate. We need a formal operations phase to align conda/systemd runtime contracts and add controlled self-healing with guardrails, so recovery is faster without triggering restart storms.
## What Changes
- Standardize conda-based runtime paths across app service, watchdog, and operational scripts from a single source of truth.
- Introduce guarded worker self-healing policy (cooldown, churn windows, bounded retries, manual override).
- Add alert thresholds and machine-readable recovery signals for pool pressure, circuit-open persistence, and restart churn.
- Harden runbook documentation and scripts for deterministic restart, rollback, and incident triage.
## Capabilities
### New Capabilities
- `worker-self-healing-governance`: Define safe autonomous recovery behavior with anti-storm guardrails.
### Modified Capabilities
- `conda-systemd-runtime-alignment`: Extend runtime consistency requirements with startup validation and drift detection.
- `runtime-resilience-recovery`: Add auditable recovery-action requirements for automated and operator-triggered restart flows.
## Impact
- Affected code:
- `deploy/systemd/*.service`
- `scripts/worker_watchdog.py`
- `src/mes_dashboard/routes/admin_routes.py`
- `src/mes_dashboard/routes/health_routes.py`
- `src/mes_dashboard/core/database.py`
- `src/mes_dashboard/core/circuit_breaker.py`
- `tests/`
- `README.md`, `README.mdj`, runbook docs
- APIs:
- `/health`
- `/health/deep`
- `/admin/api/system-status`
- `/admin/api/worker/status`
- `/admin/api/worker/restart`
- Operational behavior:
- Preserve single-port bind model.
- Add controlled self-healing policy and clearer alert thresholds.

View File

@@ -0,0 +1,15 @@
## ADDED Requirements
### Requirement: Runtime Path Drift SHALL Be Detectable Before Service Start
Service startup checks MUST validate configured conda runtime paths across app, watchdog, and worker control scripts.
#### Scenario: Conda path mismatch detected
- **WHEN** startup validation finds runtime path inconsistency between configured units and scripts
- **THEN** service start MUST fail with actionable diagnostics instead of running with partial mismatch
### Requirement: Conda/Systemd Contract SHALL Be Versioned in Operations Docs
The documented runtime contract MUST include versioned path assumptions and verification commands.
#### Scenario: Operator verifies deployment contract
- **WHEN** operator follows runbook validation steps
- **THEN** commands MUST confirm active runtime paths match documented conda/systemd contract

View File

@@ -0,0 +1,15 @@
## ADDED Requirements
### Requirement: Recovery Recommendations SHALL Reflect Self-Healing Policy State
Health and admin resilience payloads MUST expose whether automated recovery is allowed, cooling down, or blocked by churn policy.
#### Scenario: Operator inspects degraded state
- **WHEN** `/health` or `/admin/api/worker/status` is requested during degradation
- **THEN** response MUST include policy state, cooldown remaining time, and next recommended action
### Requirement: Manual Recovery Override SHALL Be Explicit and Controlled
Manual restart actions MUST bypass automatic block only through authenticated operator pathways with explicit acknowledgement.
#### Scenario: Churn-blocked state with manual override request
- **WHEN** authorized admin requests manual restart while auto-recovery is blocked
- **THEN** system MUST execute controlled restart path and log the override context for auditability

View File

@@ -0,0 +1,22 @@
## ADDED Requirements
### Requirement: Automated Worker Recovery SHALL Use Bounded Policy Guards
Automated worker restart behavior MUST enforce cooldown periods and bounded restart attempts within a configurable time window.
#### Scenario: Repeated worker degradation within short window
- **WHEN** degradation events exceed configured restart-attempt budget
- **THEN** automated restarts MUST pause and surface a blocked-recovery signal for operator intervention
### Requirement: Restart-Churn Protection SHALL Prevent Recovery Storms
The runtime MUST classify restart churn and prevent uncontrolled restart loops.
#### Scenario: Churn threshold exceeded
- **WHEN** restart count crosses churn threshold in active window
- **THEN** watchdog MUST enter guarded mode and require explicit manual override before further restart attempts
### Requirement: Recovery Decisions SHALL Be Audit-Ready
Every auto-recovery decision and manual override action MUST be recorded with structured metadata.
#### Scenario: Worker restart decision emitted
- **WHEN** system executes or denies a restart action
- **THEN** structured logs/events MUST include reason, thresholds, actor/source, and resulting state

View File

@@ -0,0 +1,23 @@
## 1. Conda/Systemd Contract Alignment
- [x] 1.1 Centralize runtime path configuration consumed by service units, watchdog, and scripts.
- [x] 1.2 Add startup validation that fails fast on conda path drift.
- [x] 1.3 Update systemd/watchdog integration tests for consistent runtime contract.
## 2. Worker Self-Healing Policy
- [x] 2.1 Implement bounded auto-restart policy (cooldown, retry budget, churn window).
- [x] 2.2 Add guarded mode behavior when churn threshold is exceeded.
- [x] 2.3 Implement authenticated manual override flow with explicit logging context.
## 3. Alerting and Operational Signals
- [x] 3.1 Expose policy-state fields in health/admin payloads (`allowed`, `cooldown`, `blocked`).
- [x] 3.2 Add structured audit events for restart decisions and override actions.
- [x] 3.3 Define alert thresholds and wire monitoring-friendly fields for pool/circuit/churn conditions.
## 4. Validation and Runbook Delivery
- [x] 4.1 Add tests for policy transitions, guarded mode, and override behavior.
- [x] 4.2 Validate single-port continuity during controlled recovery and hot reload paths.
- [x] 4.3 Update README/README.mdj and deployment runbook with verified operational procedures.

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-02-08

View File

@@ -0,0 +1,50 @@
## Context
目前系統已完成 Vite 單一 port 架構與主要 P0/P1/P2 硬化,但殘餘風險集中在「快取慢路徑鎖競爭 + health 熱點查詢 + API 邊界治理」。這些問題多屬中高流量下才明顯,若不在此階段收斂,後續排障成本會高。
## Goals / Non-Goals
**Goals:**
- 在不改變頁面操作語意與單一 port 架構前提下,完成殘餘穩定性與安全性修補。
- 讓 cache/health 路徑在高併發下更可預期,並降低 log 資安風險。
- 透過測試覆蓋確保修補不造成功能回歸。
**Non-Goals:**
- 不重寫主要查詢流程或移除 `resource/wip` 全表快取策略。
- 不引入重量級 distributed rate-limit 基礎設施。
- 不改動前端 drill-down 與報表功能語意。
## Decisions
1. **Cache 發布一致性優先於局部最佳化**
- 使用 staging key + 原子 rename/pipeline 發布資料與 metadata確保 publish 失敗不影響舊資料可讀性。
2. **解析移至鎖外,鎖內僅做快取一致性檢查/寫入**
- WIP process cache 慢路徑改為鎖外 parse再鎖內 double-check+commit降低持鎖時間。
3. **Process cache 策略一致化**
- realtime equipment cache 補齊 max_size + LRU與既有 WIP/Resource 一致。
4. **Health 內部短快取僅在非測試環境啟用**
- TTL=5 秒,降低高頻 probe 對 DB/Redis 的重複壓力;測試模式維持即時計算避免互相污染。
5. **高成本 API 採輕量 in-memory 速率限制**
- 以 IP+route window 限流,參數化可調,不引入新外部依賴。
## Risks / Trade-offs
- [Risk] 快取發布改造引入 key 切換邏輯複雜度 → Mitigation: 補上 publish 失敗/成功測試。
- [Risk] health 快取造成短時間觀測延遲 → Mitigation: TTL 限制 5 秒,並於 testing 禁用。
- [Risk] in-memory rate limit 在多 worker 下非全域一致 → Mitigation: 先作保護閥,後續可升級 Redis-based limiter。
## Migration Plan
1. 先完成 cache 與 health 核心修補(不影響 API contract
2. 再導入 API 邊界/限流與共用工具抽離。
3. 補單元與整合測試,執行 benchmark smoke。
4. 更新 README 文件與環境變數說明。
## Open Questions
- 高成本 API 的預設限流門檻是否要按端點細分WIP vs Resource
- 後續是否要升級為 Redis 分散式限流以覆蓋多 worker 全域一致性?

View File

@@ -0,0 +1,44 @@
## Why
上一輪已完成高風險核心修復,但仍有一批殘餘問題會在高併發、長時間運行與惡意/異常輸入下放大風險(快取發布一致性、鎖競爭、健康檢查負載、輸入邊界與速率治理)。本輪目標是把這些尾端風險收斂到可接受範圍,避免後續運維與效能不穩。
## What Changes
- 強化 WIP 快取發布流程,確保更新失敗時不污染既有讀取路徑。
- 調整 process cache 慢路徑鎖範圍,避免持鎖解析大 JSON。
- 補齊 realtime equipment process cache 的 bounded LRU與 WIP/Resource 策略一致。
- 為資源路由 NaN 清理加入深度保護(避免深層遞迴風險)。
- 抽取共用布林參數解析,消除重複邏輯。
- 將 filter cache 的 view 名稱改為可配置,移除硬編碼耦合。
- 加入敏感連線字串 log redaction。
-`/health``/health/deep` 增加 5 秒內部短快取(測試模式禁用)。
- 對高成本查詢 API 增加輕量速率限制與可調參數。
- 更新 README/README.mdj 與驗證測試。
## Capabilities
### New Capabilities
- `api-safety-hygiene`: API 輸入邊界、共享參數解析、可配置查詢來源、與高成本端點速率治理。
### Modified Capabilities
- `cache-observability-hardening`: 補強快取發布一致性、process cache 鎖範圍與 bounded 策略一致化。
- `runtime-resilience-recovery`: 健康檢查短快取與敏感資訊日誌遮罩的運維安全要求。
## Impact
- Affected code:
- `src/mes_dashboard/core/cache_updater.py`
- `src/mes_dashboard/core/cache.py`
- `src/mes_dashboard/services/realtime_equipment_cache.py`
- `src/mes_dashboard/routes/resource_routes.py`
- `src/mes_dashboard/routes/wip_routes.py`
- `src/mes_dashboard/routes/hold_routes.py`
- `src/mes_dashboard/services/filter_cache.py`
- `src/mes_dashboard/core/database.py`
- `src/mes_dashboard/routes/health_routes.py`
- APIs:
- `/health`, `/health/deep`
- `/api/wip/detail/<workcenter>`, `/api/wip/overview/*`
- `/api/resource/*`(高成本路由)
- Docs/tests:
- `README.md`, `README.mdj`, `tests/*`

View File

@@ -0,0 +1,29 @@
## ADDED Requirements
### Requirement: Recursive Payload Cleaning MUST Enforce Depth Safety
Routes that normalize nested payloads MUST prevent unbounded recursion depth.
#### Scenario: Deeply nested response object
- **WHEN** NaN-cleaning helper receives deeply nested list/dict payload
- **THEN** cleaning logic MUST enforce max depth or iterative traversal and return safely without recursion failure
### Requirement: Filter Source Names MUST Be Configurable
Filter cache query sources MUST NOT rely on hardcoded view names only.
#### Scenario: Environment-specific view names
- **WHEN** deployment sets custom filter-source environment variables
- **THEN** filter cache loader MUST resolve and query configured view names
### Requirement: High-Cost APIs SHALL Apply Basic Rate Guardrails
High-cost read endpoints SHALL apply configurable request-rate guardrails to reduce abuse and accidental bursts.
#### Scenario: Burst traffic from same client
- **WHEN** a client exceeds configured request budget for guarded endpoints
- **THEN** endpoint SHALL return throttled response with clear retry guidance
### Requirement: Common Boolean Query Parsing SHALL Be Shared
Boolean query parsing in routes SHALL use shared helper behavior.
#### Scenario: Different routes parse include flags
- **WHEN** routes parse common boolean query parameters
- **THEN** parsing behavior MUST be consistent across routes via shared utility

View File

@@ -0,0 +1,26 @@
## ADDED Requirements
### Requirement: Cache Publish MUST Preserve Previous Readable Snapshot on Failure
When refreshing full-table cache payloads, the system MUST avoid exposing partially published states to readers.
#### Scenario: Publish fails after payload serialization
- **WHEN** a cache refresh has prepared new payload but publish operation fails
- **THEN** previously published cache keys MUST remain readable and metadata MUST remain consistent with old snapshot
#### Scenario: Publish succeeds
- **WHEN** publish operation completes successfully
- **THEN** data payload and metadata keys MUST be visible as one coherent new snapshot
### Requirement: Process-Level Cache Slow Path SHALL Minimize Lock Hold Time
Large payload parsing MUST NOT happen inside long-held process cache locks.
#### Scenario: Cache miss under concurrent requests
- **WHEN** multiple requests hit process cache miss
- **THEN** parsing work SHALL happen outside lock-protected mutation section, and lock scope SHALL be limited to consistency check + commit
### Requirement: Process-Level Cache Policies MUST Stay Consistent Across Services
All service-local process caches MUST support bounded capacity with deterministic eviction.
#### Scenario: Realtime equipment cache growth
- **WHEN** realtime equipment process cache reaches configured capacity
- **THEN** entries MUST be evicted according to deterministic LRU behavior

View File

@@ -0,0 +1,19 @@
## ADDED Requirements
### Requirement: Health Endpoints SHALL Use Short Internal Memoization
Health and deep-health computation SHALL use a short-lived internal cache to prevent probe storms from amplifying backend load.
#### Scenario: Frequent monitor scrapes
- **WHEN** health endpoints are called repeatedly within a small window
- **THEN** service SHALL return memoized payload for up to 5 seconds in non-testing environments
#### Scenario: Testing mode
- **WHEN** app is running in testing mode
- **THEN** health endpoint memoization MUST be bypassed to preserve deterministic tests
### Requirement: Logs MUST Redact Connection Secrets
Runtime logs MUST avoid exposing DB connection credentials.
#### Scenario: Connection string appears in log message
- **WHEN** a log message contains DB URL credentials
- **THEN** logger output MUST redact password and sensitive userinfo before emission

View File

@@ -0,0 +1,22 @@
## 1. Cache Consistency and Contention Hardening
- [x] 1.1 Harden WIP cache publish in `cache_updater.py` to preserve old snapshot on publish failure.
- [x] 1.2 Refactor WIP process-cache slow path in `core/cache.py` so heavy parse runs outside lock.
- [x] 1.3 Extend realtime equipment process cache with bounded `max_size` + deterministic LRU and add regression tests.
## 2. API Safety and Config Hygiene
- [x] 2.1 Add depth-safe NaN cleaning in `resource_routes.py` and tests for deep payloads.
- [x] 2.2 Add shared boolean query parser in `core/utils.py` and switch `wip_routes.py` / `hold_routes.py` to it.
- [x] 2.3 Make filter-cache source views configurable (env-based) in `filter_cache.py` and add config tests.
## 3. Runtime Guardrails
- [x] 3.1 Add DB connection-string redaction logging filter in `core/database.py` (or logging bootstrap) with tests.
- [x] 3.2 Add 5-second internal memoization for `/health` and `/health/deep` (disabled in testing) and tests.
- [x] 3.3 Add lightweight rate limiting for selected high-cost APIs with clear throttling responses and tests.
## 4. Validation and Documentation
- [x] 4.1 Run targeted backend/frontend tests and benchmark smoke gate.
- [x] 4.2 Update `README.md` and `README.mdj` with round-3 hardening notes and new env variables.

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-02-08

View File

@@ -0,0 +1,61 @@
## Context
round-3 後主流程已穩定,但仍有 3 類技術債:
- Resource 快取在同一 process 內同時保存 DataFrame 與完整 records 複本,導致記憶體放大。
- Resource 與 Realtime Equipment 的 Oracle 查詢存在跨服務重複字串,日後修改容易偏移。
- 部分服務邊界型別註記與魔術數字未系統化,維護成本偏高。
約束條件:
- `resource` / `wip` 維持全表快取策略,不改資料來源與刷新頻率。
- 對外 API 欄位與前端行為不變。
- 保持單一 port 架構與既有運維契約。
## Goals / Non-Goals
**Goals:**
- 降低 Resource 快取在 process 內的重複資料表示,保留查詢輸出相容性。
- 讓跨服務 Oracle 查詢片段由單一來源維護。
- 讓關鍵 service/cache 模組具備一致的型別註記與具名常數。
**Non-Goals:**
- 不改動資料庫 schema 或 SQL 查詢結果欄位。
- 不重寫整體 cache 架構Redis + process cache 維持)。
- 不引入新基礎設施或外部依賴。
## Decisions
1. Resource derived index 改為「row-position index」而非保存完整 records 複本
- 現況index 中保留 `records` 與多組 bucket records與 DataFrame 內容重複。
- 決策index 只保留 row positions整數索引與必要 metadata需要輸出 dict 時由 DataFrame 按需轉換。
- 取捨:單次輸出會增加少量轉換成本,但可顯著降低常駐記憶體重複。
2. 建立共用 Oracle 查詢常數模組
- 現況:`resource_cache.py``realtime_equipment_cache.py` 各自維護 base SQL。
- 決策:抽出 `services/sql_fragments.py`(或等效模組)管理共用 query 文本與 table/view 名稱。
- 取捨:增加一層間接引用,但查詢語意一致性與變更可控性更高。
3. 型別與常數治理採「先核心邊界,後擴散」
- 現況:部分函式已使用 `Optional` / PEP604 混搭,且魔術數字散落於 cache/service。
- 決策先統一這輪觸及檔案中的型別風格與高頻常數TTL、size、window、limits
- 取捨:不追求一次全專案清零,以避免大範圍 noise先建立可持續擴展基線。
## Risks / Trade-offs
- [Risk] row-position index 與 DataFrame 版本不同步 → Mitigation每次 cache invalidate 時同步重建 index並保留版本檢查。
- [Risk] 惰性轉換導致查詢端 latency 波動 → Mitigation保留 process cache並對高頻路徑做小批量輸出優化。
- [Risk] SQL 共用常數抽離造成引用錯誤 → Mitigation補齊單元測試驗證 query 文本與既有欄位契約一致。
- [Risk] 型別/常數清理引發行為改變 → Mitigation僅做等價重構保留原值並用回歸測試覆蓋。
## Migration Plan
1. 先重構 Resource index 表示,確保 API 輸出不變。
2. 抽離 SQL 共用片段並替換兩個快取服務引用。
3. 清理該範圍型別與常數,補測試。
4. 更新 README / README.mdj 與 OpenSpec tasks跑 backend/fronted 目標測試集。
Rollback
- 若出現相容性問題,可回退至原 index records 表示與舊 SQL 內嵌寫法(單檔回退即可)。
## Open Questions
- 是否要在下一輪把相同治理擴展到 `wip_service.py` 的其餘常數與型別(本輪先限定 residual 範圍)。

View File

@@ -0,0 +1,31 @@
## Why
目前剩餘風險集中在可維護性與記憶體效率Resource 快取在同一個 process 內維持多種資料表示,部分查詢 SQL 在不同快取服務重複維護,且型別註記與魔術數字仍不一致。這些問題不會立刻造成中斷,但會提高記憶體占用、增加後續修改成本與回歸風險,因此需要在既有功能不變前提下完成收斂。
## What Changes
- 將 Resource derived index 的資料表示改為「輕量索引 + 惰性輸出」,避免在 process 中重複保留完整 records 複本。
- 將 Resource 與 Realtime Equipment 的 Oracle 查詢字串收斂到共用 SQL 常數模組,降低重複定義與異步漂移風險。
- 補齊型別註記一致性(尤其 cache/index/service 邊界)並把高頻魔術數字提升為具名常數或可配置參數。
- 維持現有 API 契約、全表快取策略、單一 port 架構與前端行為不變。
## Capabilities
### New Capabilities
- `resource-cache-representation-normalization`: 以單一權威資料表示與輕量索引替代 process 內多份完整資料複本,並保留既有查詢回傳結構。
- `oracle-query-fragment-governance`: 將跨服務共用的 Oracle 查詢片段抽離為共享常數/模板,確保查詢語意一致。
- `maintainability-type-and-constant-hygiene`: 建立型別註記與具名常數的落地規範,降低魔術數字與註記風格漂移。
### Modified Capabilities
- `cache-observability-hardening`: 補充記憶體放大係數與索引表示調整後的可觀測一致性要求。
## Impact
- 主要影響檔案:
- `src/mes_dashboard/services/resource_cache.py`
- `src/mes_dashboard/services/realtime_equipment_cache.py`
- `src/mes_dashboard/services/resource_service.py`(若需配合索引輸出)
- `src/mes_dashboard/sql/*` 或新增共享 SQL 常數模組
- `src/mes_dashboard/config/constants.py``src/mes_dashboard/core/utils.py`
- 對應測試與 README/README.mdj 文檔
- 不新增外部依賴,不變更對外 API 路徑與欄位契約。

View File

@@ -0,0 +1,8 @@
## MODIFIED Requirements
### Requirement: Cache Telemetry SHALL Include Memory Amplification Signals
Operational telemetry MUST expose cache-domain memory usage indicators and representation amplification factors, and MUST differentiate between authoritative data payload and derived/index helper structures.
#### Scenario: Deep health telemetry request after representation normalization
- **WHEN** operators inspect cache telemetry for resource or WIP domains
- **THEN** telemetry MUST include per-domain memory footprint, amplification indicators, and enough structure detail to verify that full-record duplication is not reintroduced

View File

@@ -0,0 +1,15 @@
## ADDED Requirements
### Requirement: Core Cache and Service Boundaries MUST Use Consistent Type Annotation Style
Core cache/service modules touched by this change SHALL use a consistent and explicit type-annotation style for public and internal helper boundaries.
#### Scenario: Reviewing updated cache/service modules
- **WHEN** maintainers inspect function signatures in affected modules
- **THEN** optional and collection types MUST follow a single consistent style and remain compatible with the project Python baseline
### Requirement: High-Frequency Magic Numbers MUST Be Replaced by Named Constants
Cache, throttling, and index-related numeric literals that control behavior MUST be extracted to named constants or env-configurable settings.
#### Scenario: Tuning cache/index behavior
- **WHEN** operators need to tune cache/index thresholds
- **THEN** they MUST find values in named constants or environment variables rather than scattered inline literals

View File

@@ -0,0 +1,15 @@
## ADDED Requirements
### Requirement: Shared Oracle Query Fragments SHALL Have a Single Source of Truth
Cross-service Oracle query fragments for resource and equipment cache loading MUST be defined in a shared module and imported by service implementations.
#### Scenario: Update common table/view reference
- **WHEN** a common table or view name changes
- **THEN** operators and developers MUST be able to update one shared definition without editing duplicated SQL literals across services
### Requirement: Service Queries MUST Preserve Existing Columns and Semantics
Services consuming shared Oracle query fragments SHALL preserve existing selected columns, filters, and downstream payload behavior.
#### Scenario: Resource and equipment cache refresh after refactor
- **WHEN** cache services execute queries via shared fragments
- **THEN** resulting payload structure MUST remain compatible with existing aggregation and API contracts

View File

@@ -0,0 +1,22 @@
## ADDED Requirements
### Requirement: Resource Derived Index MUST Avoid Full Record Duplication
Resource derived index SHALL use lightweight row-position references instead of storing full duplicated record payloads alongside the process DataFrame cache.
#### Scenario: Build index from cached DataFrame
- **WHEN** resource cache data is parsed from Redis into process-level DataFrame
- **THEN** the derived index MUST store position-based references and metadata without a second full records copy
### Requirement: Resource Query APIs SHALL Preserve Existing Response Contract
Resource query APIs MUST keep existing output fields and semantics after index representation normalization.
#### Scenario: Read all resources after normalization
- **WHEN** callers request all resources or filtered resource lists
- **THEN** the returned payload MUST remain field-compatible with pre-normalization responses
### Requirement: Cache Invalidation MUST Keep Index/Data Coherent
The system SHALL invalidate and rebuild DataFrame/index representations atomically at cache refresh boundaries.
#### Scenario: Redis-backed cache refresh completes
- **WHEN** a new resource cache snapshot is published
- **THEN** stale index references MUST be invalidated before subsequent reads use refreshed DataFrame data

View File

@@ -0,0 +1,22 @@
## 1. Resource Cache Representation Normalization
- [x] 1.1 Refactor `resource_cache` derived index to use lightweight row-position references instead of full duplicated records payload.
- [x] 1.2 Keep `get_all_resources` / `get_resources_by_filter` API outputs backward compatible while sourcing data from normalized representation.
- [x] 1.3 Update cache telemetry fields to reflect normalized representation and verify amplification calculation remains interpretable.
## 2. Oracle Query Fragment Governance
- [x] 2.1 Extract shared Oracle SQL fragments/constants for resource/equipment cache loading into a common module.
- [x] 2.2 Replace duplicated SQL literals in `resource_cache.py` and `realtime_equipment_cache.py` with shared definitions.
- [x] 2.3 Add/adjust tests to lock expected query semantics and prevent drift.
## 3. Maintainability Hygiene
- [x] 3.1 Normalize type annotations in touched cache/service modules to one consistent style.
- [x] 3.2 Replace high-frequency magic numbers with named constants or env-driven config in touched modules.
- [x] 3.3 Confirm existing login/API rate-limit and bool parser utilities remain centralized without new duplication.
## 4. Verification and Documentation
- [x] 4.1 Run targeted backend tests for resource cache, equipment cache, health/admin, and route behavior.
- [x] 4.2 Update `README.md` and `README.mdj` with round-4 hardening notes.

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-02-08

View File

@@ -0,0 +1,65 @@
## Context
本專案上一輪已完成 P0/P1/P2 的主體重構,但 code review 後仍存在幾個殘餘高風險點:
- `LDAP_API_URL` 缺少 scheme/host 防線,屬於可配置 SSRF 風險。
- process-level DataFrame cache 僅用 TTL缺少容量上限。
- circuit breaker 狀態轉換在持鎖期間寫日誌,存在鎖競爭放大風險。
- 全域 security headers 尚未統一輸出。
- 分頁參數尚有下限驗證缺口。
這些問題橫跨 `app/core/services/routes/tests`,屬於跨模組安全與穩定性修補。
## Goals / Non-Goals
**Goals:**
- 對 LDAP endpoint、HTTP 回應標頭、輸入邊界建立可測試的最低防線。
- 讓 process-level cache 具備有界容量與可預期淘汰行為。
- 降低 circuit breaker 內部鎖競爭風險,避免慢 handler 放大阻塞。
- 維持單一 port、現有 API 契約與前端互動語意不變。
**Non-Goals:**
- 不引入完整 WAF/零信任架構。
- 不重寫既有 cache 架構為外部快取服務。
- 不改動報表功能或頁面流程。
## Decisions
1. **LDAP URL 啟動驗證fail-fast**
- Decision: 在 `auth_service` 啟動階段驗證 `LDAP_API_URL`,限制 `https` 與白名單 host由 env 設定),不符合即禁用 LDAP 驗證路徑並記錄錯誤。
- Rationale: 以最低改動封住配置型 SSRF 風險,不影響 local auth 模式。
2. **ProcessLevelCache 有界化**
- Decision: 在 `ProcessLevelCache` 新增 `max_size` 與 LRU 淘汰(`OrderedDict``set` 時淘汰最舊 key。
- Rationale: 保留 TTL 行為,同時避免高基數 key 長時間堆積。
3. **Circuit breaker 鎖外寫日誌**
- Decision: `_transition_to` 僅在鎖內更新狀態並組裝日誌訊息,實際 logger 呼叫移到鎖外。
- Rationale: 降低持鎖區塊執行時間,避免慢 I/O handler 阻塞其他請求路徑。
4. **全域安全標頭統一注入**
- Decision: 在 `app.after_request` 加入 `CSP``X-Frame-Options``X-Content-Type-Options``Referrer-Policy`,並在 production 加上 `HSTS`
- Rationale: 以集中式策略覆蓋所有頁面與 API降低遺漏機率。
5. **分頁參數上下限一致化**
- Decision: 對 `page``page_size` 統一加入 `max(1, min(...))` 邊界處理。
- Rationale: 防止負值或極端數值造成不必要負載與非預期行為。
## Risks / Trade-offs
- **[Risk] LDAP 白名單設定不完整導致登入中斷** → **Mitigation:** 提供明確錯誤訊息與 local auth fallback 指引。
- **[Risk] Cache 上限過小造成命中率下降** → **Mitigation:** `max_size` 設為可配置,先給保守預設值並觀察 telemetry。
- **[Risk] CSP 過嚴影響既有 inline 腳本** → **Mitigation:** 先採 `default-src 'self'` 與相容策略,必要時以 nonce/白名單微調。
- **[Risk] 行為調整引發測試回歸** → **Mitigation:** 補 unit/integration 測試覆蓋每個修補點。
## Migration Plan
1. 先落地 backend 修補auth/cache/circuit breaker/app headers/routes
2. 補測試LDAP 驗證、LRU、鎖外日誌、headers、分頁邊界
3. 執行既有健康檢查與重點整合測試。
4. 更新 README/README.mdj 的安全與穩定性章節。
5. 若部署後有相容性問題,可暫時透過 env 放寬 LDAP host 白名單與 CSP 細項。
## Open Questions
- LDAP host 白名單在各環境是否需要多個網域(例如內網 + DR site
- CSP 是否要立即切換到 nonce-based 嚴格模式,或先維持相容策略?

View File

@@ -0,0 +1,40 @@
## Why
上一輪已完成核心穩定性重構但仍有數個高優先風險LDAP URL 驗證、無界快取成長、circuit breaker 持鎖寫日誌、安全標頭缺口、分頁下限驗證)未收斂。這些問題會在長時運行與惡意輸入情境下累積可用性與安全風險,需在同一輪中補齊。
## What Changes
- 新增 LDAP API base URL 啟動驗證(限定 `https` 與白名單主機),避免可控 SSRF 目標。
- 對 process-level cache 加入 `max_size` 與 LRU 淘汰,避免高基數 key 造成無界記憶體成長。
- 調整 circuit breaker 狀態轉換流程,避免在持鎖期間寫日誌。
- 新增全域 security headersCSP、X-Frame-Options、X-Content-Type-Options、Referrer-Policy、HSTS
- 補齊分頁參數下限驗證,避免負值與不合理 page size 進入查詢流程。
- 為上述修補新增對應測試與文件更新,並維持單一 port 與既有前端操作語意不變。
## Capabilities
### New Capabilities
- `security-surface-hardening`: 規範剩餘安全面向SSRF 防護、security headers、輸入邊界驗證的最低防線。
### Modified Capabilities
- `cache-observability-hardening`: 擴充快取治理需求,納入 process-level cache 有界容量與淘汰策略。
- `runtime-resilience-recovery`: 補充 circuit breaker 鎖競爭風險修補與安全標頭對運維診斷回應的相容性要求。
## Impact
- Affected code:
- `src/mes_dashboard/services/auth_service.py`
- `src/mes_dashboard/core/cache.py`
- `src/mes_dashboard/services/resource_cache.py`
- `src/mes_dashboard/core/circuit_breaker.py`
- `src/mes_dashboard/app.py`
- `src/mes_dashboard/routes/wip_routes.py`
- `tests/`
- `README.md`, `README.mdj`
- APIs:
- `/health`, `/health/deep`
- `/api/wip/detail/<workcenter>`
- `/admin/login`間接受影響LDAP base 驗證)
- Operational behavior:
- 保持單一 port 與既有報表 UI 流程。
- 強化安全與穩定性防線,不改變既有功能語意。

View File

@@ -0,0 +1,12 @@
## ADDED Requirements
### Requirement: Process-Level Cache SHALL Use Bounded Capacity with Deterministic Eviction
Process-level parsed-data caches MUST enforce a configurable maximum key capacity and use deterministic eviction behavior when capacity is exceeded.
#### Scenario: Cache capacity reached
- **WHEN** a new cache entry is inserted and key capacity is at limit
- **THEN** cache MUST evict entries according to defined policy before storing the new key
#### Scenario: Repeated access updates recency
- **WHEN** an existing cache key is read or overwritten
- **THEN** eviction order MUST reflect recency semantics so hot keys are retained preferentially

View File

@@ -0,0 +1,12 @@
## ADDED Requirements
### Requirement: Circuit Breaker State Transitions SHALL Avoid Lock-Held Logging
Circuit breaker state transitions MUST avoid executing logger I/O while internal state locks are held.
#### Scenario: State transition occurs
- **WHEN** circuit breaker transitions between CLOSED, OPEN, or HALF_OPEN
- **THEN** lock-protected section MUST complete state mutation before emitting transition log output
#### Scenario: Slow log handler under load
- **WHEN** logger handlers are slow or blocked
- **THEN** circuit breaker lock contention MUST remain bounded and MUST NOT serialize unrelated request paths behind logging latency

View File

@@ -0,0 +1,34 @@
## ADDED Requirements
### Requirement: LDAP Authentication Endpoint Configuration SHALL Be Strictly Validated
The system MUST validate LDAP authentication endpoint configuration before use, including HTTPS scheme enforcement and host allowlist checks.
#### Scenario: Invalid LDAP URL configuration detected
- **WHEN** `LDAP_API_URL` is missing, non-HTTPS, or points to a host outside the configured allowlist
- **THEN** the service MUST reject LDAP authentication calls and emit actionable diagnostics without sending credentials to that endpoint
#### Scenario: Valid LDAP URL configuration accepted
- **WHEN** `LDAP_API_URL` uses HTTPS and host is allowlisted
- **THEN** LDAP authentication requests MAY proceed with normal timeout and error handling behavior
### Requirement: Security Response Headers SHALL Be Applied Globally
All HTTP responses MUST include baseline security headers suitable for dashboard and API traffic.
#### Scenario: Standard response emitted
- **WHEN** any route returns a response
- **THEN** response MUST include `Content-Security-Policy`, `X-Frame-Options`, `X-Content-Type-Options`, and `Referrer-Policy`
#### Scenario: Production transport hardening
- **WHEN** runtime environment is production
- **THEN** response MUST include `Strict-Transport-Security`
### Requirement: Pagination Input Boundaries SHALL Be Enforced
Endpoints accepting pagination parameters MUST enforce lower and upper bounds before query execution.
#### Scenario: Negative or zero pagination inputs
- **WHEN** client sends `page <= 0` or `page_size <= 0`
- **THEN** server MUST normalize values to minimum supported bounds
#### Scenario: Excessive page size requested
- **WHEN** client sends `page_size` above configured maximum
- **THEN** server MUST clamp to maximum supported page size

View File

@@ -0,0 +1,24 @@
## 1. LDAP Endpoint Hardening
- [x] 1.1 Add strict `LDAP_API_URL` validation (`https` + allowlisted hosts) in auth service initialization.
- [x] 1.2 Add tests for valid/invalid LDAP URL configurations and ensure unsafe URLs are rejected without outbound auth call.
## 2. Bounded Process Cache
- [x] 2.1 Extend `ProcessLevelCache` with configurable `max_size` and LRU eviction behavior.
- [x] 2.2 Wire bounded cache configuration for WIP/Resource process-level caches and add regression tests.
## 3. Circuit Breaker Lock Contention Reduction
- [x] 3.1 Refactor circuit breaker transition logging to execute outside lock-protected section.
- [x] 3.2 Add tests verifying transition logs are emitted while state mutation remains correct.
## 4. HTTP Security Headers and Input Boundary Validation
- [x] 4.1 Add global `after_request` security headers (CSP, frame, content-type, referrer, HSTS in production).
- [x] 4.2 Tighten pagination boundary handling (`page`/`page_size`) for WIP detail endpoint and add tests.
## 5. Validation and Documentation
- [x] 5.1 Run targeted backend/frontend tests plus benchmark smoke to confirm no behavior regression.
- [x] 5.2 Update `README.md` and `README.mdj` with round-2 security/stability hardening notes.

View File

@@ -0,0 +1,33 @@
# api-safety-hygiene Specification
## Purpose
TBD - created by archiving change residual-hardening-round3. Update Purpose after archive.
## Requirements
### Requirement: Recursive Payload Cleaning MUST Enforce Depth Safety
Routes that normalize nested payloads MUST prevent unbounded recursion depth.
#### Scenario: Deeply nested response object
- **WHEN** NaN-cleaning helper receives deeply nested list/dict payload
- **THEN** cleaning logic MUST enforce max depth or iterative traversal and return safely without recursion failure
### Requirement: Filter Source Names MUST Be Configurable
Filter cache query sources MUST NOT rely on hardcoded view names only.
#### Scenario: Environment-specific view names
- **WHEN** deployment sets custom filter-source environment variables
- **THEN** filter cache loader MUST resolve and query configured view names
### Requirement: High-Cost APIs SHALL Apply Basic Rate Guardrails
High-cost read endpoints SHALL apply configurable request-rate guardrails to reduce abuse and accidental bursts.
#### Scenario: Burst traffic from same client
- **WHEN** a client exceeds configured request budget for guarded endpoints
- **THEN** endpoint SHALL return throttled response with clear retry guidance
### Requirement: Common Boolean Query Parsing SHALL Be Shared
Boolean query parsing in routes SHALL use shared helper behavior.
#### Scenario: Different routes parse include flags
- **WHEN** routes parse common boolean query parameters
- **THEN** parsing behavior MUST be consistent across routes via shared utility

View File

@@ -0,0 +1,26 @@
# cache-indexed-query-acceleration Specification
## Purpose
TBD - created by archiving change p1-cache-query-efficiency. Update Purpose after archive.
## Requirements
### Requirement: Incremental Synchronization SHALL Use Versioned Watermarks
For heavy non-full-snapshot datasets, cache refresh SHALL support incremental synchronization keyed by stable version or watermark boundaries.
#### Scenario: Incremental refresh cycle
- **WHEN** source data version indicates partial changes since last sync
- **THEN** cache update logic MUST fetch and merge only changed partitions while preserving correctness guarantees
### Requirement: Query Paths SHALL Use Indexed Access for High-Frequency Filters
Query execution over cached data SHALL use prebuilt indexes for known high-frequency filter columns.
#### Scenario: Filtered report query
- **WHEN** request filters target indexed fields
- **THEN** result selection MUST avoid full dataset scans and maintain existing response contract
### Requirement: Business-Mandated Full-Table Caches SHALL Be Preserved for Resource and WIP
The system SHALL continue to maintain full-table cache behavior for `resource` and `wip` domains.
#### Scenario: Resource or WIP cache refresh
- **WHEN** cache update runs for `resource` or `wip`
- **THEN** the updater MUST retain full-table snapshot semantics and MUST NOT switch these domains to partial-only cache mode

View File

@@ -36,3 +36,53 @@ The system MUST define alert thresholds for sustained degraded state, repeated w
- **WHEN** degraded status persists beyond configured duration - **WHEN** degraded status persists beyond configured duration
- **THEN** the monitoring contract MUST classify the service as alert-worthy with actionable context - **THEN** the monitoring contract MUST classify the service as alert-worthy with actionable context
### Requirement: Cache Telemetry SHALL Include Memory Amplification Signals
Operational telemetry MUST expose cache-domain memory usage indicators and representation amplification factors, and MUST differentiate between authoritative data payload and derived/index helper structures.
#### Scenario: Deep health telemetry request after representation normalization
- **WHEN** operators inspect cache telemetry for resource or WIP domains
- **THEN** telemetry MUST include per-domain memory footprint, amplification indicators, and enough structure detail to verify that full-record duplication is not reintroduced
### Requirement: Efficiency Benchmarks SHALL Gate Cache Refactor Rollout
Cache/query efficiency changes MUST be validated against baseline latency and memory benchmarks before rollout.
#### Scenario: Pre-release validation
- **WHEN** cache refactor changes are prepared for deployment
- **THEN** benchmark results MUST demonstrate no regression beyond configured thresholds for P95 latency and memory usage
### Requirement: Process-Level Cache SHALL Use Bounded Capacity with Deterministic Eviction
Process-level parsed-data caches MUST enforce a configurable maximum key capacity and use deterministic eviction behavior when capacity is exceeded.
#### Scenario: Cache capacity reached
- **WHEN** a new cache entry is inserted and key capacity is at limit
- **THEN** cache MUST evict entries according to defined policy before storing the new key
#### Scenario: Repeated access updates recency
- **WHEN** an existing cache key is read or overwritten
- **THEN** eviction order MUST reflect recency semantics so hot keys are retained preferentially
### Requirement: Cache Publish MUST Preserve Previous Readable Snapshot on Failure
When refreshing full-table cache payloads, the system MUST avoid exposing partially published states to readers.
#### Scenario: Publish fails after payload serialization
- **WHEN** a cache refresh has prepared new payload but publish operation fails
- **THEN** previously published cache keys MUST remain readable and metadata MUST remain consistent with old snapshot
#### Scenario: Publish succeeds
- **WHEN** publish operation completes successfully
- **THEN** data payload and metadata keys MUST be visible as one coherent new snapshot
### Requirement: Process-Level Cache Slow Path SHALL Minimize Lock Hold Time
Large payload parsing MUST NOT happen inside long-held process cache locks.
#### Scenario: Cache miss under concurrent requests
- **WHEN** multiple requests hit process cache miss
- **THEN** parsing work SHALL happen outside lock-protected mutation section, and lock scope SHALL be limited to consistency check + commit
### Requirement: Process-Level Cache Policies MUST Stay Consistent Across Services
All service-local process caches MUST support bounded capacity with deterministic eviction.
#### Scenario: Realtime equipment cache growth
- **WHEN** realtime equipment process cache reaches configured capacity
- **THEN** entries MUST be evicted according to deterministic LRU behavior

View File

@@ -24,3 +24,17 @@ Runbooks and deployment documentation MUST describe the same conda/systemd/watch
- **WHEN** an operator performs deploy, health check, and rollback from documentation - **WHEN** an operator performs deploy, health check, and rollback from documentation
- **THEN** documented commands and paths MUST work without requiring venv-specific assumptions - **THEN** documented commands and paths MUST work without requiring venv-specific assumptions
### Requirement: Runtime Path Drift SHALL Be Detectable Before Service Start
Service startup checks MUST validate configured conda runtime paths across app, watchdog, and worker control scripts.
#### Scenario: Conda path mismatch detected
- **WHEN** startup validation finds runtime path inconsistency between configured units and scripts
- **THEN** service start MUST fail with actionable diagnostics instead of running with partial mismatch
### Requirement: Conda/Systemd Contract SHALL Be Versioned in Operations Docs
The documented runtime contract MUST include versioned path assumptions and verification commands.
#### Scenario: Operator verifies deployment contract
- **WHEN** operator follows runbook validation steps
- **THEN** commands MUST confirm active runtime paths match documented conda/systemd contract

View File

@@ -50,3 +50,17 @@ Frontend matrix/filter computations SHALL produce deterministic selection and fi
- **WHEN** users toggle matrix cells across group, family, and resource rows - **WHEN** users toggle matrix cells across group, family, and resource rows
- **THEN** selected-state rendering and filtered equipment result sets MUST remain level-correct and reversible - **THEN** selected-state rendering and filtered equipment result sets MUST remain level-correct and reversible
### Requirement: Reusable Browser Compute Modules SHALL Power Report Derivations
Derived computations for report filters, KPI cards, chart series, and table projections SHALL be implemented through reusable frontend modules.
#### Scenario: Shared report derivation logic
- **WHEN** multiple report pages require equivalent data-shaping behavior
- **THEN** pages MUST consume shared compute modules instead of duplicating transformation logic per page
### Requirement: Browser Compute Shift SHALL Preserve Export and Field Contracts
Moving computations to frontend MUST preserve existing field naming and export column contracts.
#### Scenario: User exports report after frontend-side derivation
- **WHEN** transformed data is rendered and exported
- **THEN** exported field names and ordering MUST remain consistent with governed field contract definitions

View File

@@ -0,0 +1,19 @@
# maintainability-type-and-constant-hygiene Specification
## Purpose
TBD - created by archiving change residual-hardening-round4. Update Purpose after archive.
## Requirements
### Requirement: Core Cache and Service Boundaries MUST Use Consistent Type Annotation Style
Core cache/service modules touched by this change SHALL use a consistent and explicit type-annotation style for public and internal helper boundaries.
#### Scenario: Reviewing updated cache/service modules
- **WHEN** maintainers inspect function signatures in affected modules
- **THEN** optional and collection types MUST follow a single consistent style and remain compatible with the project Python baseline
### Requirement: High-Frequency Magic Numbers MUST Be Replaced by Named Constants
Cache, throttling, and index-related numeric literals that control behavior MUST be extracted to named constants or env-configurable settings.
#### Scenario: Tuning cache/index behavior
- **WHEN** operators need to tune cache/index thresholds
- **THEN** they MUST find values in named constants or environment variables rather than scattered inline literals

View File

@@ -0,0 +1,19 @@
# oracle-query-fragment-governance Specification
## Purpose
TBD - created by archiving change residual-hardening-round4. Update Purpose after archive.
## Requirements
### Requirement: Shared Oracle Query Fragments SHALL Have a Single Source of Truth
Cross-service Oracle query fragments for resource and equipment cache loading MUST be defined in a shared module and imported by service implementations.
#### Scenario: Update common table/view reference
- **WHEN** a common table or view name changes
- **THEN** operators and developers MUST be able to update one shared definition without editing duplicated SQL literals across services
### Requirement: Service Queries MUST Preserve Existing Columns and Semantics
Services consuming shared Oracle query fragments SHALL preserve existing selected columns, filters, and downstream payload behavior.
#### Scenario: Resource and equipment cache refresh after refactor
- **WHEN** cache services execute queries via shared fragments
- **THEN** resulting payload structure MUST remain compatible with existing aggregation and API contracts

View File

@@ -0,0 +1,26 @@
# resource-cache-representation-normalization Specification
## Purpose
TBD - created by archiving change residual-hardening-round4. Update Purpose after archive.
## Requirements
### Requirement: Resource Derived Index MUST Avoid Full Record Duplication
Resource derived index SHALL use lightweight row-position references instead of storing full duplicated record payloads alongside the process DataFrame cache.
#### Scenario: Build index from cached DataFrame
- **WHEN** resource cache data is parsed from Redis into process-level DataFrame
- **THEN** the derived index MUST store position-based references and metadata without a second full records copy
### Requirement: Resource Query APIs SHALL Preserve Existing Response Contract
Resource query APIs MUST keep existing output fields and semantics after index representation normalization.
#### Scenario: Read all resources after normalization
- **WHEN** callers request all resources or filtered resource lists
- **THEN** the returned payload MUST remain field-compatible with pre-normalization responses
### Requirement: Cache Invalidation MUST Keep Index/Data Coherent
The system SHALL invalidate and rebuild DataFrame/index representations atomically at cache refresh boundaries.
#### Scenario: Redis-backed cache refresh completes
- **WHEN** a new resource cache snapshot is published
- **THEN** stale index references MUST be invalidated before subsequent reads use refreshed DataFrame data

View File

@@ -48,3 +48,47 @@ The system MUST expose machine-readable resilience thresholds, restart-churn ind
#### Scenario: Admin status includes restart churn summary #### Scenario: Admin status includes restart churn summary
- **WHEN** operators call `/admin/api/system-status` or `/admin/api/worker/status` - **WHEN** operators call `/admin/api/system-status` or `/admin/api/worker/status`
- **THEN** responses MUST include bounded restart history summary within a configured time window and indicate whether churn threshold is exceeded - **THEN** responses MUST include bounded restart history summary within a configured time window and indicate whether churn threshold is exceeded
### Requirement: Recovery Recommendations SHALL Reflect Self-Healing Policy State
Health and admin resilience payloads MUST expose whether automated recovery is allowed, cooling down, or blocked by churn policy.
#### Scenario: Operator inspects degraded state
- **WHEN** `/health` or `/admin/api/worker/status` is requested during degradation
- **THEN** response MUST include policy state, cooldown remaining time, and next recommended action
### Requirement: Manual Recovery Override SHALL Be Explicit and Controlled
Manual restart actions MUST bypass automatic block only through authenticated operator pathways with explicit acknowledgement.
#### Scenario: Churn-blocked state with manual override request
- **WHEN** authorized admin requests manual restart while auto-recovery is blocked
- **THEN** system MUST execute controlled restart path and log the override context for auditability
### Requirement: Circuit Breaker State Transitions SHALL Avoid Lock-Held Logging
Circuit breaker state transitions MUST avoid executing logger I/O while internal state locks are held.
#### Scenario: State transition occurs
- **WHEN** circuit breaker transitions between CLOSED, OPEN, or HALF_OPEN
- **THEN** lock-protected section MUST complete state mutation before emitting transition log output
#### Scenario: Slow log handler under load
- **WHEN** logger handlers are slow or blocked
- **THEN** circuit breaker lock contention MUST remain bounded and MUST NOT serialize unrelated request paths behind logging latency
### Requirement: Health Endpoints SHALL Use Short Internal Memoization
Health and deep-health computation SHALL use a short-lived internal cache to prevent probe storms from amplifying backend load.
#### Scenario: Frequent monitor scrapes
- **WHEN** health endpoints are called repeatedly within a small window
- **THEN** service SHALL return memoized payload for up to 5 seconds in non-testing environments
#### Scenario: Testing mode
- **WHEN** app is running in testing mode
- **THEN** health endpoint memoization MUST be bypassed to preserve deterministic tests
### Requirement: Logs MUST Redact Connection Secrets
Runtime logs MUST avoid exposing DB connection credentials.
#### Scenario: Connection string appears in log message
- **WHEN** a log message contains DB URL credentials
- **THEN** logger output MUST redact password and sensitive userinfo before emission

View File

@@ -0,0 +1,38 @@
# security-surface-hardening Specification
## Purpose
TBD - created by archiving change security-stability-hardening-round2. Update Purpose after archive.
## Requirements
### Requirement: LDAP Authentication Endpoint Configuration SHALL Be Strictly Validated
The system MUST validate LDAP authentication endpoint configuration before use, including HTTPS scheme enforcement and host allowlist checks.
#### Scenario: Invalid LDAP URL configuration detected
- **WHEN** `LDAP_API_URL` is missing, non-HTTPS, or points to a host outside the configured allowlist
- **THEN** the service MUST reject LDAP authentication calls and emit actionable diagnostics without sending credentials to that endpoint
#### Scenario: Valid LDAP URL configuration accepted
- **WHEN** `LDAP_API_URL` uses HTTPS and host is allowlisted
- **THEN** LDAP authentication requests MAY proceed with normal timeout and error handling behavior
### Requirement: Security Response Headers SHALL Be Applied Globally
All HTTP responses MUST include baseline security headers suitable for dashboard and API traffic.
#### Scenario: Standard response emitted
- **WHEN** any route returns a response
- **THEN** response MUST include `Content-Security-Policy`, `X-Frame-Options`, `X-Content-Type-Options`, and `Referrer-Policy`
#### Scenario: Production transport hardening
- **WHEN** runtime environment is production
- **THEN** response MUST include `Strict-Transport-Security`
### Requirement: Pagination Input Boundaries SHALL Be Enforced
Endpoints accepting pagination parameters MUST enforce lower and upper bounds before query execution.
#### Scenario: Negative or zero pagination inputs
- **WHEN** client sends `page <= 0` or `page_size <= 0`
- **THEN** server MUST normalize values to minimum supported bounds
#### Scenario: Excessive page size requested
- **WHEN** client sends `page_size` above configured maximum
- **THEN** server MUST clamp to maximum supported page size

View File

@@ -0,0 +1,26 @@
# worker-self-healing-governance Specification
## Purpose
TBD - created by archiving change p2-ops-self-healing-runbook. Update Purpose after archive.
## Requirements
### Requirement: Automated Worker Recovery SHALL Use Bounded Policy Guards
Automated worker restart behavior MUST enforce cooldown periods and bounded restart attempts within a configurable time window.
#### Scenario: Repeated worker degradation within short window
- **WHEN** degradation events exceed configured restart-attempt budget
- **THEN** automated restarts MUST pause and surface a blocked-recovery signal for operator intervention
### Requirement: Restart-Churn Protection SHALL Prevent Recovery Storms
The runtime MUST classify restart churn and prevent uncontrolled restart loops.
#### Scenario: Churn threshold exceeded
- **WHEN** restart count crosses churn threshold in active window
- **THEN** watchdog MUST enter guarded mode and require explicit manual override before further restart attempts
### Requirement: Recovery Decisions SHALL Be Audit-Ready
Every auto-recovery decision and manual override action MUST be recorded with structured metadata.
#### Scenario: Worker restart decision emitted
- **WHEN** system executes or denies a restart action
- **THEN** structured logs/events MUST include reason, thresholds, actor/source, and resulting state

223
scripts/run_cache_benchmarks.py Executable file
View File

@@ -0,0 +1,223 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""Benchmark cache query baseline vs indexed selection.
This benchmark is used as a repeatable governance harness for P1 cache/query
efficiency work. It focuses on deterministic synthetic workloads so operators
can compare relative latency and memory amplification over time.
"""
from __future__ import annotations
import argparse
import json
import math
import random
import statistics
import time
from pathlib import Path
from typing import Any
import numpy as np
import pandas as pd
ROOT = Path(__file__).resolve().parents[1]
FIXTURE_PATH = ROOT / "tests" / "fixtures" / "cache_benchmark_fixture.json"
def load_fixture(path: Path = FIXTURE_PATH) -> dict[str, Any]:
payload = json.loads(path.read_text())
if "rows" not in payload:
raise ValueError("fixture requires rows")
return payload
def build_dataset(rows: int, seed: int) -> pd.DataFrame:
random.seed(seed)
np.random.seed(seed)
workcenters = [f"WC-{idx:02d}" for idx in range(1, 31)]
packages = ["QFN", "DFN", "SOT", "SOP", "BGA", "TSOP"]
types = ["TYPE-A", "TYPE-B", "TYPE-C", "TYPE-D"]
statuses = ["RUN", "QUEUE", "HOLD"]
hold_reasons = ["", "", "", "YieldLimit", "特殊需求管控", "PM Hold"]
frame = pd.DataFrame(
{
"WORKCENTER_GROUP": np.random.choice(workcenters, rows),
"PACKAGE_LEF": np.random.choice(packages, rows),
"PJ_TYPE": np.random.choice(types, rows),
"WIP_STATUS": np.random.choice(statuses, rows, p=[0.45, 0.35, 0.20]),
"HOLDREASONNAME": np.random.choice(hold_reasons, rows),
"QTY": np.random.randint(1, 500, rows),
"WORKORDER": [f"WO-{i:06d}" for i in range(rows)],
"LOTID": [f"LOT-{i:07d}" for i in range(rows)],
}
)
return frame
def _build_index(df: pd.DataFrame) -> dict[str, dict[str, set[int]]]:
def by_column(column: str) -> dict[str, set[int]]:
grouped = df.groupby(column, dropna=True, sort=False).indices
return {str(k): {int(i) for i in v} for k, v in grouped.items()}
return {
"workcenter": by_column("WORKCENTER_GROUP"),
"package": by_column("PACKAGE_LEF"),
"type": by_column("PJ_TYPE"),
"status": by_column("WIP_STATUS"),
}
def _baseline_query(df: pd.DataFrame, query: dict[str, str]) -> int:
subset = df
if query.get("workcenter"):
subset = subset[subset["WORKCENTER_GROUP"] == query["workcenter"]]
if query.get("package"):
subset = subset[subset["PACKAGE_LEF"] == query["package"]]
if query.get("type"):
subset = subset[subset["PJ_TYPE"] == query["type"]]
if query.get("status"):
subset = subset[subset["WIP_STATUS"] == query["status"]]
return int(len(subset))
def _indexed_query(_df: pd.DataFrame, indexes: dict[str, dict[str, set[int]]], query: dict[str, str]) -> int:
selected: set[int] | None = None
for key, bucket in (
("workcenter", "workcenter"),
("package", "package"),
("type", "type"),
("status", "status"),
):
current = indexes[bucket].get(query.get(key, ""))
if current is None:
return 0
if selected is None:
selected = set(current)
else:
selected.intersection_update(current)
if not selected:
return 0
return len(selected or ())
def _build_queries(df: pd.DataFrame, query_count: int, seed: int) -> list[dict[str, str]]:
random.seed(seed + 17)
workcenters = sorted(df["WORKCENTER_GROUP"].dropna().astype(str).unique().tolist())
packages = sorted(df["PACKAGE_LEF"].dropna().astype(str).unique().tolist())
types = sorted(df["PJ_TYPE"].dropna().astype(str).unique().tolist())
statuses = sorted(df["WIP_STATUS"].dropna().astype(str).unique().tolist())
queries: list[dict[str, str]] = []
for _ in range(query_count):
queries.append(
{
"workcenter": random.choice(workcenters),
"package": random.choice(packages),
"type": random.choice(types),
"status": random.choice(statuses),
}
)
return queries
def _p95(values: list[float]) -> float:
if not values:
return 0.0
sorted_values = sorted(values)
index = min(max(math.ceil(0.95 * len(sorted_values)) - 1, 0), len(sorted_values) - 1)
return sorted_values[index]
def run_benchmark(rows: int, query_count: int, seed: int) -> dict[str, Any]:
df = build_dataset(rows=rows, seed=seed)
queries = _build_queries(df, query_count=query_count, seed=seed)
indexes = _build_index(df)
baseline_latencies: list[float] = []
indexed_latencies: list[float] = []
baseline_rows: list[int] = []
indexed_rows: list[int] = []
for query in queries:
start = time.perf_counter()
baseline_rows.append(_baseline_query(df, query))
baseline_latencies.append((time.perf_counter() - start) * 1000)
start = time.perf_counter()
indexed_rows.append(_indexed_query(df, indexes, query))
indexed_latencies.append((time.perf_counter() - start) * 1000)
if baseline_rows != indexed_rows:
raise AssertionError("benchmark correctness drift: indexed result mismatch")
frame_bytes = int(df.memory_usage(index=True, deep=True).sum())
index_entries = sum(len(bucket) for buckets in indexes.values() for bucket in buckets.values())
index_bytes_estimate = int(index_entries * 16)
baseline_p95 = _p95(baseline_latencies)
indexed_p95 = _p95(indexed_latencies)
return {
"rows": rows,
"query_count": query_count,
"seed": seed,
"latency_ms": {
"baseline_avg": round(statistics.fmean(baseline_latencies), 4),
"baseline_p95": round(baseline_p95, 4),
"indexed_avg": round(statistics.fmean(indexed_latencies), 4),
"indexed_p95": round(indexed_p95, 4),
"p95_ratio_indexed_vs_baseline": round(
(indexed_p95 / baseline_p95) if baseline_p95 > 0 else 0.0,
4,
),
},
"memory_bytes": {
"frame": frame_bytes,
"index_estimate": index_bytes_estimate,
"amplification_ratio": round(
(frame_bytes + index_bytes_estimate) / max(frame_bytes, 1),
4,
),
},
}
def main() -> int:
fixture = load_fixture()
parser = argparse.ArgumentParser(description="Run cache baseline vs indexed benchmark")
parser.add_argument("--rows", type=int, default=int(fixture.get("rows", 30000)))
parser.add_argument("--queries", type=int, default=int(fixture.get("query_count", 400)))
parser.add_argument("--seed", type=int, default=int(fixture.get("seed", 42)))
parser.add_argument("--enforce", action="store_true")
args = parser.parse_args()
report = run_benchmark(rows=args.rows, query_count=args.queries, seed=args.seed)
print(json.dumps(report, ensure_ascii=False, indent=2))
if not args.enforce:
return 0
thresholds = fixture.get("thresholds") or {}
max_latency_ratio = float(thresholds.get("max_p95_ratio_indexed_vs_baseline", 1.25))
max_amplification = float(thresholds.get("max_memory_amplification_ratio", 1.8))
latency_ratio = float(report["latency_ms"]["p95_ratio_indexed_vs_baseline"])
amplification_ratio = float(report["memory_bytes"]["amplification_ratio"])
if latency_ratio > max_latency_ratio:
raise SystemExit(
f"Latency regression: {latency_ratio:.4f} > max allowed {max_latency_ratio:.4f}"
)
if amplification_ratio > max_amplification:
raise SystemExit(
f"Memory amplification regression: {amplification_ratio:.4f} > max allowed {max_amplification:.4f}"
)
return 0
if __name__ == "__main__":
raise SystemExit(main())

40
scripts/start_server.sh Normal file → Executable file
View File

@@ -9,7 +9,7 @@ set -uo pipefail
# Configuration # Configuration
# ============================================================ # ============================================================
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
CONDA_ENV="mes-dashboard" CONDA_ENV="${CONDA_ENV_NAME:-mes-dashboard}"
APP_NAME="mes-dashboard" APP_NAME="mes-dashboard"
PID_FILE_DEFAULT="${ROOT}/tmp/gunicorn.pid" PID_FILE_DEFAULT="${ROOT}/tmp/gunicorn.pid"
PID_FILE="${WATCHDOG_PID_FILE:-${PID_FILE_DEFAULT}}" PID_FILE="${WATCHDOG_PID_FILE:-${PID_FILE_DEFAULT}}"
@@ -56,7 +56,7 @@ timestamp() {
resolve_runtime_paths() { resolve_runtime_paths() {
WATCHDOG_RUNTIME_DIR="${WATCHDOG_RUNTIME_DIR:-${ROOT}/tmp}" WATCHDOG_RUNTIME_DIR="${WATCHDOG_RUNTIME_DIR:-${ROOT}/tmp}"
WATCHDOG_RESTART_FLAG="${WATCHDOG_RESTART_FLAG:-${WATCHDOG_RUNTIME_DIR}/mes_dashboard_restart.flag}" WATCHDOG_RESTART_FLAG="${WATCHDOG_RESTART_FLAG:-${WATCHDOG_RUNTIME_DIR}/mes_dashboard_restart.flag}"
WATCHDOG_PID_FILE="${WATCHDOG_PID_FILE:-${PID_FILE_DEFAULT}}" WATCHDOG_PID_FILE="${WATCHDOG_PID_FILE:-${WATCHDOG_RUNTIME_DIR}/gunicorn.pid}"
WATCHDOG_STATE_FILE="${WATCHDOG_STATE_FILE:-${WATCHDOG_RUNTIME_DIR}/mes_dashboard_restart_state.json}" WATCHDOG_STATE_FILE="${WATCHDOG_STATE_FILE:-${WATCHDOG_RUNTIME_DIR}/mes_dashboard_restart_state.json}"
PID_FILE="${WATCHDOG_PID_FILE}" PID_FILE="${WATCHDOG_PID_FILE}"
export WATCHDOG_RUNTIME_DIR WATCHDOG_RESTART_FLAG WATCHDOG_PID_FILE WATCHDOG_STATE_FILE export WATCHDOG_RUNTIME_DIR WATCHDOG_RESTART_FLAG WATCHDOG_PID_FILE WATCHDOG_STATE_FILE
@@ -81,8 +81,14 @@ check_conda() {
return 1 return 1
fi fi
if [ -n "${CONDA_BIN:-}" ] && [ ! -x "${CONDA_BIN}" ]; then
log_error "CONDA_BIN is set but not executable: ${CONDA_BIN}"
return 1
fi
# Source conda # Source conda
source "$(conda info --base)/etc/profile.d/conda.sh" local conda_cmd="${CONDA_BIN:-$(command -v conda)}"
source "$(${conda_cmd} info --base)/etc/profile.d/conda.sh"
# Check if environment exists # Check if environment exists
if ! conda env list | grep -q "^${CONDA_ENV} "; then if ! conda env list | grep -q "^${CONDA_ENV} "; then
@@ -95,6 +101,33 @@ check_conda() {
return 0 return 0
} }
validate_runtime_contract() {
conda activate "$CONDA_ENV"
export PYTHONPATH="${ROOT}/src:${PYTHONPATH:-}"
if python - <<'PY'
import os
import sys
from mes_dashboard.core.runtime_contract import build_runtime_contract_diagnostics
strict = os.getenv("RUNTIME_CONTRACT_ENFORCE", "true").strip().lower() in {"1", "true", "yes", "on"}
diag = build_runtime_contract_diagnostics(strict=strict)
if not diag["valid"]:
for error in diag["errors"]:
print(f"RUNTIME_CONTRACT_ERROR: {error}")
raise SystemExit(1)
PY
then
log_success "Runtime contract validation passed"
return 0
fi
log_error "Runtime contract validation failed"
log_info "Fix env vars: WATCHDOG_RUNTIME_DIR / WATCHDOG_RESTART_FLAG / WATCHDOG_PID_FILE / WATCHDOG_STATE_FILE / CONDA_BIN"
return 1
}
check_dependencies() { check_dependencies() {
conda activate "$CONDA_ENV" conda activate "$CONDA_ENV"
@@ -329,6 +362,7 @@ run_all_checks() {
check_env_file check_env_file
load_env load_env
resolve_runtime_paths resolve_runtime_paths
validate_runtime_contract || return 1
check_port || return 1 check_port || return 1
check_database check_database
check_redis check_redis

177
scripts/worker_watchdog.py Normal file → Executable file
View File

@@ -31,6 +31,23 @@ import time
from datetime import datetime from datetime import datetime
from pathlib import Path from pathlib import Path
PROJECT_ROOT = Path(__file__).resolve().parents[1]
SRC_ROOT = PROJECT_ROOT / "src"
if str(SRC_ROOT) not in sys.path:
sys.path.insert(0, str(SRC_ROOT))
from mes_dashboard.core.runtime_contract import ( # noqa: E402
build_runtime_contract_diagnostics,
load_runtime_contract,
)
from mes_dashboard.core.worker_recovery_policy import ( # noqa: E402
decide_restart_request,
evaluate_worker_recovery_state,
extract_last_requested_at,
extract_restart_history,
get_worker_recovery_policy_config,
)
# Configure logging # Configure logging
logging.basicConfig( logging.basicConfig(
level=logging.INFO, level=logging.INFO,
@@ -45,7 +62,10 @@ logger = logging.getLogger('mes_dashboard.watchdog')
# Configuration # Configuration
# ============================================================ # ============================================================
CHECK_INTERVAL = int(os.getenv('WATCHDOG_CHECK_INTERVAL', '5')) _RUNTIME_CONTRACT = load_runtime_contract(project_root=PROJECT_ROOT)
CHECK_INTERVAL = int(
os.getenv('WATCHDOG_CHECK_INTERVAL', str(_RUNTIME_CONTRACT['watchdog_check_interval']))
)
def _env_int(name: str, default: int) -> int: def _env_int(name: str, default: int) -> int:
@@ -55,22 +75,11 @@ def _env_int(name: str, default: int) -> int:
return default return default
PROJECT_ROOT = Path(__file__).resolve().parents[1] DEFAULT_RUNTIME_DIR = Path(_RUNTIME_CONTRACT['watchdog_runtime_dir'])
DEFAULT_RUNTIME_DIR = Path( RESTART_FLAG_PATH = _RUNTIME_CONTRACT['watchdog_restart_flag']
os.getenv('WATCHDOG_RUNTIME_DIR', str(PROJECT_ROOT / 'tmp')) GUNICORN_PID_FILE = _RUNTIME_CONTRACT['watchdog_pid_file']
) RESTART_STATE_FILE = _RUNTIME_CONTRACT['watchdog_state_file']
RESTART_FLAG_PATH = os.getenv( RUNTIME_CONTRACT_VERSION = _RUNTIME_CONTRACT['version']
'WATCHDOG_RESTART_FLAG',
str(DEFAULT_RUNTIME_DIR / 'mes_dashboard_restart.flag')
)
GUNICORN_PID_FILE = os.getenv(
'WATCHDOG_PID_FILE',
str(DEFAULT_RUNTIME_DIR / 'gunicorn.pid')
)
RESTART_STATE_FILE = os.getenv(
'WATCHDOG_STATE_FILE',
str(DEFAULT_RUNTIME_DIR / 'mes_dashboard_restart_state.json')
)
RESTART_HISTORY_MAX = _env_int('WATCHDOG_RESTART_HISTORY_MAX', 50) RESTART_HISTORY_MAX = _env_int('WATCHDOG_RESTART_HISTORY_MAX', 50)
@@ -78,6 +87,32 @@ RESTART_HISTORY_MAX = _env_int('WATCHDOG_RESTART_HISTORY_MAX', 50)
# Watchdog Implementation # Watchdog Implementation
# ============================================================ # ============================================================
def validate_runtime_contract_or_raise() -> None:
"""Fail fast if runtime contract is inconsistent."""
strict = os.getenv("RUNTIME_CONTRACT_ENFORCE", "true").strip().lower() in {
"1",
"true",
"yes",
"on",
}
diagnostics = build_runtime_contract_diagnostics(strict=strict)
if diagnostics["valid"]:
return
details = "; ".join(diagnostics["errors"])
raise RuntimeError(f"Runtime contract validation failed: {details}")
def log_restart_audit(event: str, payload: dict) -> None:
entry = {
"event": event,
"timestamp": datetime.utcnow().isoformat(),
"runtime_contract_version": RUNTIME_CONTRACT_VERSION,
**payload,
}
logger.info("worker_watchdog_audit %s", json.dumps(entry, ensure_ascii=False))
def get_gunicorn_pid() -> int | None: def get_gunicorn_pid() -> int | None:
"""Get Gunicorn master PID from PID file. """Get Gunicorn master PID from PID file.
@@ -155,7 +190,12 @@ def save_restart_state(
requested_at: str | None = None, requested_at: str | None = None,
requested_ip: str | None = None, requested_ip: str | None = None,
completed_at: str | None = None, completed_at: str | None = None,
success: bool = True success: bool = True,
source: str = "manual",
decision: str = "allowed",
decision_reason: str | None = None,
manual_override: bool = False,
policy_state: dict | None = None,
) -> None: ) -> None:
"""Save restart state for status queries. """Save restart state for status queries.
@@ -173,7 +213,12 @@ def save_restart_state(
"requested_at": requested_at, "requested_at": requested_at,
"requested_ip": requested_ip, "requested_ip": requested_ip,
"completed_at": completed_at, "completed_at": completed_at,
"success": success "success": success,
"source": source,
"decision": decision,
"decision_reason": decision_reason,
"manual_override": manual_override,
"policy_state": policy_state or {},
} }
current_state = load_restart_state() current_state = load_restart_state()
history = current_state.get("history", []) history = current_state.get("history", [])
@@ -229,6 +274,47 @@ def process_restart_request() -> bool:
return False return False
logger.info(f"Restart flag detected: {flag_data}") logger.info(f"Restart flag detected: {flag_data}")
source = str(flag_data.get("source") or "manual").strip().lower()
manual_override = bool(flag_data.get("manual_override"))
override_ack = bool(flag_data.get("override_acknowledged"))
restart_state = load_restart_state()
restart_history = extract_restart_history(restart_state)
policy_state = evaluate_worker_recovery_state(
restart_history,
last_requested_at=extract_last_requested_at(restart_state),
)
decision = decide_restart_request(
policy_state,
source=source,
manual_override=manual_override,
override_acknowledged=override_ack,
)
if not decision["allowed"]:
remove_restart_flag()
save_restart_state(
requested_by=flag_data.get("user"),
requested_at=flag_data.get("timestamp"),
requested_ip=flag_data.get("ip"),
completed_at=datetime.now().isoformat(),
success=False,
source=source,
decision=decision["decision"],
decision_reason=decision["reason"],
manual_override=manual_override,
policy_state=policy_state,
)
log_restart_audit(
"restart_blocked",
{
"source": source,
"actor": flag_data.get("user"),
"ip": flag_data.get("ip"),
"decision": decision,
"policy_state": policy_state,
},
)
return True
# Get Gunicorn master PID # Get Gunicorn master PID
pid = get_gunicorn_pid() pid = get_gunicorn_pid()
@@ -242,7 +328,22 @@ def process_restart_request() -> bool:
requested_at=flag_data.get("timestamp"), requested_at=flag_data.get("timestamp"),
requested_ip=flag_data.get("ip"), requested_ip=flag_data.get("ip"),
completed_at=datetime.now().isoformat(), completed_at=datetime.now().isoformat(),
success=False success=False,
source=source,
decision="failed",
decision_reason="gunicorn_pid_unavailable",
manual_override=manual_override,
policy_state=policy_state,
)
log_restart_audit(
"restart_failed",
{
"source": source,
"actor": flag_data.get("user"),
"ip": flag_data.get("ip"),
"decision_reason": "gunicorn_pid_unavailable",
"policy_state": policy_state,
},
) )
return True return True
@@ -258,7 +359,12 @@ def process_restart_request() -> bool:
requested_at=flag_data.get("timestamp"), requested_at=flag_data.get("timestamp"),
requested_ip=flag_data.get("ip"), requested_ip=flag_data.get("ip"),
completed_at=datetime.now().isoformat(), completed_at=datetime.now().isoformat(),
success=success success=success,
source=source,
decision="executed" if success else "failed",
decision_reason="signal_sighup" if success else "signal_failed",
manual_override=manual_override,
policy_state=policy_state,
) )
if success: if success:
@@ -267,17 +373,44 @@ def process_restart_request() -> bool:
f"Requested by: {flag_data.get('user', 'unknown')}, " f"Requested by: {flag_data.get('user', 'unknown')}, "
f"IP: {flag_data.get('ip', 'unknown')}" f"IP: {flag_data.get('ip', 'unknown')}"
) )
log_restart_audit(
"restart_executed",
{
"source": source,
"actor": flag_data.get("user"),
"ip": flag_data.get("ip"),
"manual_override": manual_override,
"policy_state": policy_state,
},
)
else:
log_restart_audit(
"restart_failed",
{
"source": source,
"actor": flag_data.get("user"),
"ip": flag_data.get("ip"),
"decision_reason": "signal_failed",
"policy_state": policy_state,
},
)
return True return True
def run_watchdog() -> None: def run_watchdog() -> None:
"""Main watchdog loop.""" """Main watchdog loop."""
validate_runtime_contract_or_raise()
policy = get_worker_recovery_policy_config()
logger.info( logger.info(
f"Worker watchdog started - " f"Worker watchdog started - "
f"Check interval: {CHECK_INTERVAL}s, " f"Check interval: {CHECK_INTERVAL}s, "
f"Flag path: {RESTART_FLAG_PATH}, " f"Flag path: {RESTART_FLAG_PATH}, "
f"PID file: {GUNICORN_PID_FILE}" f"PID file: {GUNICORN_PID_FILE}, "
f"Policy(cooldown={policy['cooldown_seconds']}s, "
f"retry_budget={policy['retry_budget']}, "
f"window={policy['window_seconds']}s, "
f"guarded={policy['guarded_mode_enabled']})"
) )
while True: while True:

View File

@@ -3,24 +3,48 @@
from __future__ import annotations from __future__ import annotations
import atexit
import logging import logging
import os import os
import sys import sys
import threading
from flask import Flask, jsonify, redirect, render_template, request, session, url_for from flask import Flask, jsonify, redirect, render_template, request, session, url_for
from mes_dashboard.config.tables import TABLES_CONFIG from mes_dashboard.config.tables import TABLES_CONFIG
from mes_dashboard.config.settings import get_config from mes_dashboard.config.settings import get_config
from mes_dashboard.core.cache import create_default_cache_backend from mes_dashboard.core.cache import create_default_cache_backend
from mes_dashboard.core.database import get_table_data, get_table_columns, get_engine, init_db, start_keepalive from mes_dashboard.core.database import (
get_table_data,
get_table_columns,
get_engine,
init_db,
start_keepalive,
dispose_engine,
install_log_redaction_filter,
)
from mes_dashboard.core.permissions import is_admin_logged_in, _is_ajax_request from mes_dashboard.core.permissions import is_admin_logged_in, _is_ajax_request
from mes_dashboard.core.csrf import (
get_csrf_token,
should_enforce_csrf,
validate_csrf,
)
from mes_dashboard.routes import register_routes from mes_dashboard.routes import register_routes
from mes_dashboard.routes.auth_routes import auth_bp from mes_dashboard.routes.auth_routes import auth_bp
from mes_dashboard.routes.admin_routes import admin_bp from mes_dashboard.routes.admin_routes import admin_bp
from mes_dashboard.routes.health_routes import health_bp from mes_dashboard.routes.health_routes import health_bp
from mes_dashboard.services.page_registry import get_page_status, is_api_public from mes_dashboard.services.page_registry import get_page_status, is_api_public
from mes_dashboard.core.cache_updater import start_cache_updater, stop_cache_updater from mes_dashboard.core.cache_updater import start_cache_updater, stop_cache_updater
from mes_dashboard.services.realtime_equipment_cache import init_realtime_equipment_cache from mes_dashboard.services.realtime_equipment_cache import (
init_realtime_equipment_cache,
stop_equipment_status_sync_worker,
)
from mes_dashboard.core.redis_client import close_redis
from mes_dashboard.core.runtime_contract import build_runtime_contract_diagnostics
_SHUTDOWN_LOCK = threading.Lock()
_ATEXIT_REGISTERED = False
def _configure_logging(app: Flask) -> None: def _configure_logging(app: Flask) -> None:
@@ -63,6 +87,121 @@ def _configure_logging(app: Flask) -> None:
# Prevent propagation to root logger (avoid duplicate logs) # Prevent propagation to root logger (avoid duplicate logs)
logger.propagate = False logger.propagate = False
install_log_redaction_filter(logger)
def _is_production_env(app: Flask) -> bool:
env_value = str(app.config.get("ENV") or os.getenv("FLASK_ENV") or "production").lower()
return env_value in {"prod", "production"}
def _build_security_headers(production: bool) -> dict[str, str]:
headers = {
"Content-Security-Policy": (
"default-src 'self'; "
"script-src 'self' 'unsafe-inline' 'unsafe-eval'; "
"style-src 'self' 'unsafe-inline'; "
"img-src 'self' data: blob:; "
"font-src 'self' data:; "
"connect-src 'self'; "
"frame-ancestors 'none'; "
"base-uri 'self'; "
"form-action 'self'"
),
"X-Frame-Options": "DENY",
"X-Content-Type-Options": "nosniff",
"Referrer-Policy": "strict-origin-when-cross-origin",
}
if production:
headers["Strict-Transport-Security"] = "max-age=31536000; includeSubDomains"
return headers
def _resolve_secret_key(app: Flask) -> str:
env_name = str(app.config.get("ENV") or os.getenv("FLASK_ENV") or "development").lower()
configured = os.environ.get("SECRET_KEY") or app.config.get("SECRET_KEY")
insecure_defaults = {"", "dev-secret-key-change-in-prod"}
if configured and configured not in insecure_defaults:
return configured
if env_name in {"production", "prod"}:
raise RuntimeError(
"SECRET_KEY is required in production and cannot use insecure defaults."
)
# Development and testing get explicit environment-safe defaults.
if env_name in {"testing", "test"}:
return "test-secret-key"
return "dev-local-only-secret-key"
def _shutdown_runtime_resources() -> None:
"""Stop background workers and shared clients during app/worker shutdown."""
logger = logging.getLogger("mes_dashboard")
try:
stop_cache_updater()
except Exception as exc:
logger.warning("Error stopping cache updater: %s", exc)
try:
stop_equipment_status_sync_worker()
except Exception as exc:
logger.warning("Error stopping equipment sync worker: %s", exc)
try:
close_redis()
except Exception as exc:
logger.warning("Error closing Redis client: %s", exc)
try:
dispose_engine()
except Exception as exc:
logger.warning("Error disposing DB engines: %s", exc)
def _register_shutdown_hooks(app: Flask) -> None:
global _ATEXIT_REGISTERED
app.extensions["runtime_shutdown"] = _shutdown_runtime_resources
if app.extensions.get("runtime_shutdown_registered"):
return
app.extensions["runtime_shutdown_registered"] = True
if app.testing or bool(app.config.get("TESTING")) or os.getenv("PYTEST_CURRENT_TEST"):
return
with _SHUTDOWN_LOCK:
if not _ATEXIT_REGISTERED:
atexit.register(_shutdown_runtime_resources)
_ATEXIT_REGISTERED = True
def _is_runtime_contract_enforced(app: Flask) -> bool:
raw = os.getenv("RUNTIME_CONTRACT_ENFORCE")
if raw is not None:
return raw.strip().lower() in {"1", "true", "yes", "on"}
return _is_production_env(app)
def _validate_runtime_contract(app: Flask) -> None:
strict = _is_runtime_contract_enforced(app)
diagnostics = build_runtime_contract_diagnostics(strict=strict)
app.extensions["runtime_contract"] = diagnostics["contract"]
app.extensions["runtime_contract_validation"] = {
"valid": diagnostics["valid"],
"strict": diagnostics["strict"],
"errors": diagnostics["errors"],
}
if diagnostics["valid"]:
return
message = "Runtime contract validation failed: " + "; ".join(diagnostics["errors"])
if strict:
raise RuntimeError(message)
logging.getLogger("mes_dashboard").warning(message)
def create_app(config_name: str | None = None) -> Flask: def create_app(config_name: str | None = None) -> Flask:
@@ -72,19 +211,22 @@ def create_app(config_name: str | None = None) -> Flask:
config_class = get_config(config_name) config_class = get_config(config_name)
app.config.from_object(config_class) app.config.from_object(config_class)
# Session configuration # Session configuration with environment-aware secret validation.
app.secret_key = os.environ.get("SECRET_KEY", "dev-secret-key-change-in-prod") app.secret_key = _resolve_secret_key(app)
app.config["SECRET_KEY"] = app.secret_key
# Session cookie security settings # Session cookie security settings
# SECURE: Only send cookie over HTTPS (disable for local development) # SECURE: Only send cookie over HTTPS in production.
app.config['SESSION_COOKIE_SECURE'] = os.environ.get("FLASK_ENV") == "production" app.config['SESSION_COOKIE_SECURE'] = _is_production_env(app)
# HTTPONLY: Prevent JavaScript access to session cookie (XSS protection) # HTTPONLY: Prevent JavaScript access to session cookie (XSS protection)
app.config['SESSION_COOKIE_HTTPONLY'] = True app.config['SESSION_COOKIE_HTTPONLY'] = True
# SAMESITE: Prevent CSRF by restricting cross-site cookie sending # SAMESITE: strict in production, relaxed for local development usability.
app.config['SESSION_COOKIE_SAMESITE'] = 'Lax' app.config['SESSION_COOKIE_SAMESITE'] = 'Strict' if _is_production_env(app) else 'Lax'
# Configure logging first # Configure logging first
_configure_logging(app) _configure_logging(app)
_validate_runtime_contract(app)
security_headers = _build_security_headers(_is_production_env(app))
# Route-level cache backend (L1 memory + optional L2 Redis) # Route-level cache backend (L1 memory + optional L2 Redis)
app.extensions["cache"] = create_default_cache_backend() app.extensions["cache"] = create_default_cache_backend()
@@ -96,6 +238,7 @@ def create_app(config_name: str | None = None) -> Flask:
start_keepalive() # Keep database connections alive start_keepalive() # Keep database connections alive
start_cache_updater() # Start Redis cache updater start_cache_updater() # Start Redis cache updater
init_realtime_equipment_cache(app) # Start realtime equipment status cache init_realtime_equipment_cache(app) # Start realtime equipment status cache
_register_shutdown_hooks(app)
# Register API routes # Register API routes
register_routes(app) register_routes(app)
@@ -150,6 +293,34 @@ def create_app(config_name: str | None = None) -> Flask:
return None return None
@app.before_request
def enforce_csrf():
if not should_enforce_csrf(
request,
enabled=bool(app.config.get("CSRF_ENABLED", True)),
):
return None
if validate_csrf(request):
return None
if request.path == "/admin/login":
return render_template("login.html", error="CSRF 驗證失敗,請重新提交"), 403
from mes_dashboard.core.response import error_response, FORBIDDEN
return error_response(
FORBIDDEN,
"CSRF 驗證失敗",
status_code=403,
)
@app.after_request
def apply_security_headers(response):
for header, value in security_headers.items():
response.headers.setdefault(header, value)
return response
# ======================================================== # ========================================================
# Template Context Processor # Template Context Processor
# ======================================================== # ========================================================
@@ -185,6 +356,7 @@ def create_app(config_name: str | None = None) -> Flask:
"admin_user": session.get("admin"), "admin_user": session.get("admin"),
"can_view_page": can_view_page, "can_view_page": can_view_page,
"frontend_asset": frontend_asset, "frontend_asset": frontend_asset,
"csrf_token": get_csrf_token,
} }
# ======================================================== # ========================================================

View File

@@ -20,6 +20,13 @@ def _float_env(name: str, default: float) -> float:
return default return default
def _bool_env(name: str, default: bool) -> bool:
value = os.getenv(name)
if value is None:
return default
return value.strip().lower() in {"1", "true", "yes", "on"}
class Config: class Config:
"""Base configuration.""" """Base configuration."""
@@ -40,7 +47,8 @@ class Config:
# Auth configuration - MUST be set in .env file # Auth configuration - MUST be set in .env file
LDAP_API_URL = os.getenv("LDAP_API_URL", "") LDAP_API_URL = os.getenv("LDAP_API_URL", "")
ADMIN_EMAILS = os.getenv("ADMIN_EMAILS", "") ADMIN_EMAILS = os.getenv("ADMIN_EMAILS", "")
SECRET_KEY = os.getenv("SECRET_KEY", "dev-secret-key-change-in-prod") SECRET_KEY = os.getenv("SECRET_KEY")
CSRF_ENABLED = _bool_env("CSRF_ENABLED", True)
# Session configuration # Session configuration
PERMANENT_SESSION_LIFETIME = _int_env("SESSION_LIFETIME", 28800) # 8 hours PERMANENT_SESSION_LIFETIME = _int_env("SESSION_LIFETIME", 28800) # 8 hours
@@ -103,6 +111,7 @@ class TestingConfig(Config):
DB_CONNECT_RETRY_COUNT = 0 DB_CONNECT_RETRY_COUNT = 0
DB_CONNECT_RETRY_DELAY = 0.0 DB_CONNECT_RETRY_DELAY = 0.0
DB_CALL_TIMEOUT_MS = 5000 DB_CALL_TIMEOUT_MS = 5000
CSRF_ENABLED = False
def get_config(env: str | None = None) -> Type[Config]: def get_config(env: str | None = None) -> Type[Config]:

View File

@@ -10,8 +10,10 @@ from __future__ import annotations
import io import io
import json import json
import logging import logging
import os
import threading import threading
import time import time
from collections import OrderedDict
from typing import Any, Optional, Protocol, Tuple from typing import Any, Optional, Protocol, Tuple
import pandas as pd import pandas as pd
@@ -39,26 +41,49 @@ class ProcessLevelCache:
Uses a lock to ensure only one thread parses at a time. Uses a lock to ensure only one thread parses at a time.
""" """
def __init__(self, ttl_seconds: int = 30): def __init__(self, ttl_seconds: int = 30, max_size: int = 32):
self._cache: dict[str, Tuple[pd.DataFrame, float]] = {} self._cache: OrderedDict[str, Tuple[pd.DataFrame, float]] = OrderedDict()
self._lock = threading.Lock() self._lock = threading.Lock()
self._ttl = ttl_seconds self._ttl = max(int(ttl_seconds), 1)
self._max_size = max(int(max_size), 1)
@property
def max_size(self) -> int:
return self._max_size
def _evict_expired_locked(self, now: float) -> None:
stale_keys = [
key for key, (_, timestamp) in self._cache.items()
if now - timestamp > self._ttl
]
for key in stale_keys:
self._cache.pop(key, None)
def get(self, key: str) -> Optional[pd.DataFrame]: def get(self, key: str) -> Optional[pd.DataFrame]:
"""Get cached DataFrame if not expired.""" """Get cached DataFrame if not expired."""
with self._lock: with self._lock:
if key not in self._cache: payload = self._cache.get(key)
if payload is None:
return None return None
df, timestamp = self._cache[key] df, timestamp = payload
if time.time() - timestamp > self._ttl: now = time.time()
del self._cache[key] if now - timestamp > self._ttl:
self._cache.pop(key, None)
return None return None
self._cache.move_to_end(key, last=True)
return df return df
def set(self, key: str, df: pd.DataFrame) -> None: def set(self, key: str, df: pd.DataFrame) -> None:
"""Cache a DataFrame with current timestamp.""" """Cache a DataFrame with current timestamp."""
with self._lock: with self._lock:
self._cache[key] = (df, time.time()) now = time.time()
self._evict_expired_locked(now)
if key in self._cache:
self._cache.pop(key, None)
elif len(self._cache) >= self._max_size:
self._cache.popitem(last=False)
self._cache[key] = (df, now)
self._cache.move_to_end(key, last=True)
def invalidate(self, key: str) -> None: def invalidate(self, key: str) -> None:
"""Remove a key from cache.""" """Remove a key from cache."""
@@ -71,8 +96,26 @@ class ProcessLevelCache:
self._cache.clear() self._cache.clear()
def _resolve_cache_max_size(env_name: str, default: int) -> int:
value = os.getenv(env_name)
if value is None:
return max(int(default), 1)
try:
return max(int(value), 1)
except (TypeError, ValueError):
return max(int(default), 1)
# Global process-level cache for WIP DataFrame (30s TTL) # Global process-level cache for WIP DataFrame (30s TTL)
_wip_df_cache = ProcessLevelCache(ttl_seconds=30) PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size("PROCESS_CACHE_MAX_SIZE", 32)
WIP_PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size(
"WIP_PROCESS_CACHE_MAX_SIZE",
PROCESS_CACHE_MAX_SIZE,
)
_wip_df_cache = ProcessLevelCache(
ttl_seconds=30,
max_size=WIP_PROCESS_CACHE_MAX_SIZE,
)
_wip_parse_lock = threading.Lock() _wip_parse_lock = threading.Lock()
# ============================================================ # ============================================================
@@ -328,14 +371,6 @@ def get_cached_wip_data() -> Optional[pd.DataFrame]:
if client is None: if client is None:
return None return None
# Use lock to prevent multiple threads from parsing simultaneously
with _wip_parse_lock:
# Double-check after acquiring lock (another thread may have parsed)
cached_df = _wip_df_cache.get(cache_key)
if cached_df is not None:
logger.debug(f"Process cache hit (after lock): {len(cached_df)} rows")
return cached_df
try: try:
start_time = time.time() start_time = time.time()
data_json = client.get(get_key("data")) data_json = client.get(get_key("data"))
@@ -343,19 +378,24 @@ def get_cached_wip_data() -> Optional[pd.DataFrame]:
logger.debug("Cache miss: no data in Redis") logger.debug("Cache miss: no data in Redis")
return None return None
# Parse JSON to DataFrame # Parse outside lock to reduce contention on hot paths.
df = pd.read_json(io.StringIO(data_json), orient='records') parsed_df = pd.read_json(io.StringIO(data_json), orient='records')
parse_time = time.time() - start_time parse_time = time.time() - start_time
# Store in process-level cache
_wip_df_cache.set(cache_key, df)
logger.debug(f"Cache hit: loaded {len(df)} rows from Redis (parsed in {parse_time:.2f}s)")
return df
except Exception as e: except Exception as e:
logger.warning(f"Failed to read cache: {e}") logger.warning(f"Failed to read cache: {e}")
return None return None
# Keep lock scope tight: consistency check + cache write only.
with _wip_parse_lock:
cached_df = _wip_df_cache.get(cache_key)
if cached_df is not None:
logger.debug(f"Process cache hit (after parse): {len(cached_df)} rows")
return cached_df
_wip_df_cache.set(cache_key, parsed_df)
logger.debug(f"Cache hit: loaded {len(parsed_df)} rows from Redis (parsed in {parse_time:.2f}s)")
return parsed_df
def get_cached_sys_date() -> Optional[str]: def get_cached_sys_date() -> Optional[str]:
"""Get cached SYS_DATE from Redis. """Get cached SYS_DATE from Redis.

View File

@@ -221,7 +221,7 @@ class CacheUpdater:
return None return None
def _update_redis_cache(self, df: pd.DataFrame, sys_date: str) -> bool: def _update_redis_cache(self, df: pd.DataFrame, sys_date: str) -> bool:
"""Update Redis cache with new data using pipeline for atomicity. """Update Redis cache with staged publish for coherent snapshot visibility.
Args: Args:
df: DataFrame with full table data. df: DataFrame with full table data.
@@ -234,18 +234,24 @@ class CacheUpdater:
if client is None: if client is None:
return False return False
staging_key: str | None = None
try: try:
# Convert DataFrame to JSON # Convert DataFrame to JSON
# Handle datetime columns # Handle datetime columns
for col in df.select_dtypes(include=['datetime64']).columns: df_copy = df.copy()
df[col] = df[col].astype(str) for col in df_copy.select_dtypes(include=['datetime64']).columns:
df_copy[col] = df_copy[col].astype(str)
data_json = df.to_json(orient='records', force_ascii=False) data_json = df_copy.to_json(orient='records', force_ascii=False)
# Atomic update using pipeline # Stage payload first, then atomically publish live key + metadata.
now = datetime.now().isoformat() now = datetime.now().isoformat()
unique_suffix = f"{int(time.time() * 1000)}:{threading.get_ident()}"
staging_key = get_key(f"data:staging:{unique_suffix}")
pipe = client.pipeline() pipe = client.pipeline()
pipe.set(get_key("data"), data_json) pipe.set(staging_key, data_json)
pipe.rename(staging_key, get_key("data"))
pipe.set(get_key("meta:sys_date"), sys_date) pipe.set(get_key("meta:sys_date"), sys_date)
pipe.set(get_key("meta:updated_at"), now) pipe.set(get_key("meta:updated_at"), now)
pipe.execute() pipe.execute()
@@ -253,6 +259,11 @@ class CacheUpdater:
return True return True
except Exception as e: except Exception as e:
logger.error(f"Failed to update Redis cache: {e}") logger.error(f"Failed to update Redis cache: {e}")
if staging_key:
try:
client.delete(staging_key)
except Exception:
pass
return False return False
def _check_resource_update(self, force: bool = False) -> bool: def _check_resource_update(self, force: bool = False) -> bool:

View File

@@ -130,12 +130,16 @@ class CircuitBreaker:
@property @property
def state(self) -> CircuitState: def state(self) -> CircuitState:
"""Get current circuit state, handling state transitions.""" """Get current circuit state, handling state transitions."""
transition_log: tuple[int, str] | None = None
with self._lock: with self._lock:
if self._state == CircuitState.OPEN: if self._state == CircuitState.OPEN:
# Check if we should transition to HALF_OPEN # Check if we should transition to HALF_OPEN
if self._open_time and time.time() - self._open_time >= self.recovery_timeout: if self._open_time and time.time() - self._open_time >= self.recovery_timeout:
self._transition_to(CircuitState.HALF_OPEN) transition_log = self._transition_to_locked(CircuitState.HALF_OPEN)
return self._state current_state = self._state
if transition_log:
self._emit_transition_log(*transition_log)
return current_state
def allow_request(self) -> bool: def allow_request(self) -> bool:
"""Check if a request should be allowed. """Check if a request should be allowed.
@@ -161,45 +165,57 @@ class CircuitBreaker:
if not CIRCUIT_BREAKER_ENABLED: if not CIRCUIT_BREAKER_ENABLED:
return return
transition_log: tuple[int, str] | None = None
with self._lock: with self._lock:
self._results.append(True) self._results.append(True)
if self._state == CircuitState.HALF_OPEN: if self._state == CircuitState.HALF_OPEN:
# Success in half-open means we can close # Success in half-open means we can close
self._transition_to(CircuitState.CLOSED) transition_log = self._transition_to_locked(CircuitState.CLOSED)
if transition_log:
self._emit_transition_log(*transition_log)
def record_failure(self) -> None: def record_failure(self) -> None:
"""Record a failed operation.""" """Record a failed operation."""
if not CIRCUIT_BREAKER_ENABLED: if not CIRCUIT_BREAKER_ENABLED:
return return
transition_log: tuple[int, str] | None = None
with self._lock: with self._lock:
self._results.append(False) self._results.append(False)
self._last_failure_time = time.time() self._last_failure_time = time.time()
if self._state == CircuitState.HALF_OPEN: if self._state == CircuitState.HALF_OPEN:
# Failure in half-open means back to open # Failure in half-open means back to open
self._transition_to(CircuitState.OPEN) transition_log = self._transition_to_locked(CircuitState.OPEN)
elif self._state == CircuitState.CLOSED: elif self._state == CircuitState.CLOSED:
# Check if we should open # Check if we should open
self._check_and_open() transition_log = self._check_and_open_locked()
def _check_and_open(self) -> None: if transition_log:
self._emit_transition_log(*transition_log)
def _check_and_open_locked(self) -> tuple[int, str] | None:
"""Check failure rate and open circuit if needed. """Check failure rate and open circuit if needed.
Must be called with lock held. Must be called with lock held.
""" """
if len(self._results) < self.failure_threshold: if len(self._results) < self.failure_threshold:
return return None
failure_count = sum(1 for r in self._results if not r) failure_count = sum(1 for r in self._results if not r)
failure_rate = failure_count / len(self._results) failure_rate = failure_count / len(self._results)
if (failure_count >= self.failure_threshold and if (failure_count >= self.failure_threshold and
failure_rate >= self.failure_rate_threshold): failure_rate >= self.failure_rate_threshold):
self._transition_to(CircuitState.OPEN) return self._transition_to_locked(CircuitState.OPEN)
return None
def _transition_to(self, new_state: CircuitState) -> None: def _emit_transition_log(self, level: int, message: str) -> None:
logger.log(level, message)
def _transition_to_locked(self, new_state: CircuitState) -> tuple[int, str]:
"""Transition to a new state with logging. """Transition to a new state with logging.
Must be called with lock held. Must be called with lock held.
@@ -209,20 +225,22 @@ class CircuitBreaker:
if new_state == CircuitState.OPEN: if new_state == CircuitState.OPEN:
self._open_time = time.time() self._open_time = time.time()
logger.warning( return (
logging.WARNING,
f"Circuit breaker '{self.name}' OPENED: " f"Circuit breaker '{self.name}' OPENED: "
f"state {old_state.value} -> {new_state.value}, " f"state {old_state.value} -> {new_state.value}, "
f"failures: {sum(1 for r in self._results if not r)}/{len(self._results)}" f"failures: {sum(1 for r in self._results if not r)}/{len(self._results)}"
) )
elif new_state == CircuitState.HALF_OPEN: elif new_state == CircuitState.HALF_OPEN:
logger.info( return (
logging.INFO,
f"Circuit breaker '{self.name}' entering HALF_OPEN: " f"Circuit breaker '{self.name}' entering HALF_OPEN: "
f"testing service recovery..." f"testing service recovery..."
) )
elif new_state == CircuitState.CLOSED:
self._open_time = None self._open_time = None
self._results.clear() self._results.clear()
logger.info( return (
logging.INFO,
f"Circuit breaker '{self.name}' CLOSED: " f"Circuit breaker '{self.name}' CLOSED: "
f"service recovered" f"service recovered"
) )

View File

@@ -0,0 +1,85 @@
# -*- coding: utf-8 -*-
"""CSRF token utilities for admin form and API mutation protection."""
from __future__ import annotations
import hmac
import secrets
from typing import Optional
from flask import Request, request, session
CSRF_SESSION_KEY = "_csrf_token"
CSRF_HEADER_NAME = "X-CSRF-Token"
CSRF_FORM_FIELD = "csrf_token"
_MUTATING_METHODS = {"POST", "PUT", "PATCH", "DELETE"}
def _new_csrf_token() -> str:
return secrets.token_urlsafe(32)
def get_csrf_token() -> str:
"""Get a stable CSRF token for the current session."""
token = session.get(CSRF_SESSION_KEY)
if not token:
token = _new_csrf_token()
session[CSRF_SESSION_KEY] = token
return token
def rotate_csrf_token() -> str:
"""Rotate session CSRF token after authentication state changes."""
token = _new_csrf_token()
session[CSRF_SESSION_KEY] = token
return token
def _extract_request_token(req: Request) -> Optional[str]:
header_token = req.headers.get(CSRF_HEADER_NAME)
if header_token:
return header_token
form_token = req.form.get(CSRF_FORM_FIELD)
if form_token:
return form_token
if req.is_json:
payload = req.get_json(silent=True) or {}
json_token = payload.get(CSRF_FORM_FIELD)
if json_token:
return str(json_token)
return None
def should_enforce_csrf(req: Request = request, enabled: bool = True) -> bool:
"""Determine whether current request needs CSRF validation."""
if not enabled:
return False
if req.method.upper() not in _MUTATING_METHODS:
return False
path = req.path or ""
if path == "/admin/login":
return True
if path.startswith("/admin/api/"):
return True
if path.startswith("/admin/"):
return True
return False
def validate_csrf(req: Request = request) -> bool:
"""Validate request CSRF token against current session token."""
expected = session.get(CSRF_SESSION_KEY)
if not expected:
return False
provided = _extract_request_token(req)
if not provided:
return False
return hmac.compare_digest(str(expected), str(provided))

View File

@@ -51,6 +51,59 @@ from mes_dashboard.config.settings import get_config
# Configure module logger # Configure module logger
logger = logging.getLogger('mes_dashboard.database') logger = logging.getLogger('mes_dashboard.database')
_REDACTION_INSTALLED = False
_ORACLE_URL_RE = re.compile(r"(oracle\+oracledb://[^:\s/]+:)([^@/\s]+)(@)")
_ENV_SECRET_RE = re.compile(r"(DB_PASSWORD=)([^\s]+)")
def redact_connection_secrets(message: str) -> str:
"""Redact DB credentials from log message text."""
if not message:
return message
sanitized = _ORACLE_URL_RE.sub(r"\1***\3", message)
sanitized = _ENV_SECRET_RE.sub(r"\1***", sanitized)
return sanitized
class SecretRedactionFilter(logging.Filter):
"""Filter that masks DB connection secrets in log messages."""
def filter(self, record: logging.LogRecord) -> bool:
try:
message = record.getMessage()
except Exception:
return True
sanitized = redact_connection_secrets(message)
if sanitized != message:
record.msg = sanitized
record.args = ()
return True
def install_log_redaction_filter(target_logger: logging.Logger | None = None) -> None:
"""Attach secret-redaction filter to mes_dashboard logging handlers once."""
global _REDACTION_INSTALLED
if target_logger is None and _REDACTION_INSTALLED:
return
logger_obj = target_logger or logging.getLogger("mes_dashboard")
redaction_filter = SecretRedactionFilter()
attached = False
for handler in logger_obj.handlers:
if any(isinstance(f, SecretRedactionFilter) for f in handler.filters):
attached = True
continue
handler.addFilter(redaction_filter)
attached = True
if not attached and not any(isinstance(f, SecretRedactionFilter) for f in logger_obj.filters):
logger_obj.addFilter(redaction_filter)
attached = True
if attached and target_logger is None:
_REDACTION_INSTALLED = True
# ============================================================ # ============================================================
# SQLAlchemy Engine (QueuePool - connection pooling) # SQLAlchemy Engine (QueuePool - connection pooling)
# ============================================================ # ============================================================
@@ -59,6 +112,7 @@ logger = logging.getLogger('mes_dashboard.database')
# pool_recycle prevents stale connections from firewalls/NAT. # pool_recycle prevents stale connections from firewalls/NAT.
_ENGINE = None _ENGINE = None
_HEALTH_ENGINE = None
_DB_RUNTIME_CONFIG: Optional[Dict[str, Any]] = None _DB_RUNTIME_CONFIG: Optional[Dict[str, Any]] = None
@@ -132,6 +186,13 @@ def get_db_runtime_config(refresh: bool = False) -> Dict[str, Any]:
"retry_count": _from_app_or_env_int("DB_CONNECT_RETRY_COUNT", config_class.DB_CONNECT_RETRY_COUNT), "retry_count": _from_app_or_env_int("DB_CONNECT_RETRY_COUNT", config_class.DB_CONNECT_RETRY_COUNT),
"retry_delay": _from_app_or_env_float("DB_CONNECT_RETRY_DELAY", config_class.DB_CONNECT_RETRY_DELAY), "retry_delay": _from_app_or_env_float("DB_CONNECT_RETRY_DELAY", config_class.DB_CONNECT_RETRY_DELAY),
"call_timeout_ms": _from_app_or_env_int("DB_CALL_TIMEOUT_MS", config_class.DB_CALL_TIMEOUT_MS), "call_timeout_ms": _from_app_or_env_int("DB_CALL_TIMEOUT_MS", config_class.DB_CALL_TIMEOUT_MS),
"health_pool_size": _from_app_or_env_int("DB_HEALTH_POOL_SIZE", 1),
"health_max_overflow": _from_app_or_env_int("DB_HEALTH_MAX_OVERFLOW", 0),
"health_pool_timeout": _from_app_or_env_int("DB_HEALTH_POOL_TIMEOUT", 2),
"pool_exhausted_retry_after_seconds": _from_app_or_env_int(
"DB_POOL_EXHAUSTED_RETRY_AFTER_SECONDS",
5,
),
} }
return _DB_RUNTIME_CONFIG.copy() return _DB_RUNTIME_CONFIG.copy()
@@ -202,6 +263,42 @@ def get_engine():
return _ENGINE return _ENGINE
def get_health_engine():
"""Get dedicated SQLAlchemy engine for health probes.
Health checks use a tiny isolated pool so status probes remain available
when the request pool is saturated.
"""
global _HEALTH_ENGINE
if _HEALTH_ENGINE is None:
runtime = get_db_runtime_config()
_HEALTH_ENGINE = create_engine(
CONNECTION_STRING,
poolclass=QueuePool,
pool_size=max(int(runtime["health_pool_size"]), 1),
max_overflow=max(int(runtime["health_max_overflow"]), 0),
pool_timeout=max(int(runtime["health_pool_timeout"]), 1),
pool_recycle=runtime["pool_recycle"],
pool_pre_ping=True,
connect_args={
"tcp_connect_timeout": runtime["tcp_connect_timeout"],
"retry_count": runtime["retry_count"],
"retry_delay": runtime["retry_delay"],
},
)
_register_pool_events(
_HEALTH_ENGINE,
min(int(runtime["call_timeout_ms"]), 10_000),
)
logger.info(
"Health engine created (pool_size=%s, max_overflow=%s, pool_timeout=%s)",
runtime["health_pool_size"],
runtime["health_max_overflow"],
runtime["health_pool_timeout"],
)
return _HEALTH_ENGINE
def _register_pool_events(engine, call_timeout_ms: int): def _register_pool_events(engine, call_timeout_ms: int):
"""Register event listeners for connection pool monitoring.""" """Register event listeners for connection pool monitoring."""
@@ -302,8 +399,12 @@ def dispose_engine():
Call this during application shutdown to cleanly release resources. Call this during application shutdown to cleanly release resources.
""" """
global _ENGINE, _DB_RUNTIME_CONFIG global _ENGINE, _HEALTH_ENGINE, _DB_RUNTIME_CONFIG
stop_keepalive() stop_keepalive()
if _HEALTH_ENGINE is not None:
_HEALTH_ENGINE.dispose()
logger.info("Health engine disposed")
_HEALTH_ENGINE = None
if _ENGINE is not None: if _ENGINE is not None:
_ENGINE.dispose() _ENGINE.dispose()
logger.info("Database engine disposed, all connections closed") logger.info("Database engine disposed, all connections closed")
@@ -432,9 +533,13 @@ def read_sql_df(sql: str, params: Optional[Dict[str, Any]] = None) -> pd.DataFra
elapsed, elapsed,
exc, exc,
) )
retry_after = max(
int(get_db_runtime_config().get("pool_exhausted_retry_after_seconds", 5)),
1,
)
raise DatabasePoolExhaustedError( raise DatabasePoolExhaustedError(
"Database connection pool exhausted", "Database connection pool exhausted",
retry_after_seconds=5, retry_after_seconds=retry_after,
) from exc ) from exc
except Exception as exc: except Exception as exc:
elapsed = time.time() - start_time elapsed = time.time() - start_time

View File

@@ -0,0 +1,103 @@
# -*- coding: utf-8 -*-
"""Lightweight in-process rate limiting helpers for high-cost routes."""
from __future__ import annotations
import os
import threading
import time
from collections import defaultdict, deque
from functools import wraps
from typing import Callable, Deque
from flask import request
from mes_dashboard.core.response import TOO_MANY_REQUESTS, error_response
_RATE_LOCK = threading.Lock()
_RATE_ATTEMPTS: dict[str, dict[str, Deque[float]]] = defaultdict(lambda: defaultdict(deque))
def _env_int(name: str, default: int) -> int:
raw = os.getenv(name)
if raw is None:
return int(default)
try:
value = int(raw)
except (TypeError, ValueError):
return int(default)
return max(value, 1)
def _client_identifier() -> str:
forwarded = request.headers.get("X-Forwarded-For", "").strip()
if forwarded:
return forwarded.split(",")[0].strip()
return request.remote_addr or "unknown"
def check_and_record(
bucket: str,
*,
client_id: str,
max_attempts: int,
window_seconds: int,
) -> tuple[bool, int]:
"""Check and record request attempt for a bucket+client pair."""
now = time.time()
window_start = now - max(window_seconds, 1)
with _RATE_LOCK:
per_bucket = _RATE_ATTEMPTS[bucket]
attempts = per_bucket[client_id]
while attempts and attempts[0] <= window_start:
attempts.popleft()
if len(attempts) >= max_attempts:
retry_after = max(int(window_seconds - (now - attempts[0])), 1)
return True, retry_after
attempts.append(now)
return False, 0
def configured_rate_limit(
*,
bucket: str,
max_attempts_env: str,
window_seconds_env: str,
default_max_attempts: int,
default_window_seconds: int,
) -> Callable:
"""Build a route decorator with env-configurable rate limits."""
max_attempts = _env_int(max_attempts_env, default_max_attempts)
window_seconds = _env_int(window_seconds_env, default_window_seconds)
def decorator(func: Callable) -> Callable:
@wraps(func)
def wrapped(*args, **kwargs):
limited, retry_after = check_and_record(
bucket,
client_id=_client_identifier(),
max_attempts=max_attempts,
window_seconds=window_seconds,
)
if limited:
return error_response(
TOO_MANY_REQUESTS,
"請求過於頻繁,請稍後再試",
status_code=429,
meta={"retry_after_seconds": retry_after},
headers={"Retry-After": str(retry_after)},
)
return func(*args, **kwargs)
return wrapped
return decorator
def reset_rate_limits_for_tests() -> None:
with _RATE_LOCK:
_RATE_ATTEMPTS.clear()

View File

@@ -0,0 +1,143 @@
# -*- coding: utf-8 -*-
"""Runtime contract helpers shared by app, scripts, and watchdog."""
from __future__ import annotations
import os
import shutil
from pathlib import Path
from typing import Any, Mapping
CONTRACT_VERSION = "2026.02-p2"
DEFAULT_PROJECT_ROOT = Path(__file__).resolve().parents[3]
def _to_bool(value: str | None, default: bool) -> bool:
if value is None:
return default
return value.strip().lower() in {"1", "true", "yes", "on"}
def _resolve_path(value: str | None, fallback: Path, project_root: Path) -> Path:
if value is None or not str(value).strip():
return fallback.resolve()
raw = Path(str(value).strip())
if raw.is_absolute():
return raw.resolve()
return (project_root / raw).resolve()
def load_runtime_contract(
environ: Mapping[str, str] | None = None,
*,
project_root: Path | str | None = None,
) -> dict[str, Any]:
"""Load effective runtime contract from environment with normalized paths."""
env = environ or os.environ
root = Path(project_root or env.get("MES_DASHBOARD_ROOT", DEFAULT_PROJECT_ROOT)).resolve()
runtime_dir = _resolve_path(
env.get("WATCHDOG_RUNTIME_DIR"),
root / "tmp",
root,
)
restart_flag = _resolve_path(
env.get("WATCHDOG_RESTART_FLAG"),
runtime_dir / "mes_dashboard_restart.flag",
root,
)
pid_file = _resolve_path(
env.get("WATCHDOG_PID_FILE"),
runtime_dir / "gunicorn.pid",
root,
)
state_file = _resolve_path(
env.get("WATCHDOG_STATE_FILE"),
runtime_dir / "mes_dashboard_restart_state.json",
root,
)
contract = {
"version": env.get("RUNTIME_CONTRACT_VERSION", CONTRACT_VERSION),
"project_root": str(root),
"gunicorn_bind": env.get("GUNICORN_BIND", "0.0.0.0:8080"),
"conda_bin": (env.get("CONDA_BIN", "") or "").strip(),
"conda_env_name": (env.get("CONDA_ENV_NAME", "mes-dashboard") or "").strip(),
"watchdog_runtime_dir": str(runtime_dir),
"watchdog_restart_flag": str(restart_flag),
"watchdog_pid_file": str(pid_file),
"watchdog_state_file": str(state_file),
"watchdog_check_interval": int(env.get("WATCHDOG_CHECK_INTERVAL", "5")),
"validation_enforced": _to_bool(env.get("RUNTIME_CONTRACT_ENFORCE"), False),
}
return contract
def validate_runtime_contract(
contract: Mapping[str, Any] | None = None,
*,
strict: bool = False,
) -> list[str]:
"""Validate runtime contract and return actionable errors."""
cfg = dict(contract or load_runtime_contract())
errors: list[str] = []
runtime_dir = Path(str(cfg["watchdog_runtime_dir"])).resolve()
restart_flag = Path(str(cfg["watchdog_restart_flag"])).resolve()
pid_file = Path(str(cfg["watchdog_pid_file"])).resolve()
state_file = Path(str(cfg["watchdog_state_file"])).resolve()
if restart_flag.parent != runtime_dir:
errors.append(
"WATCHDOG_RESTART_FLAG must be under WATCHDOG_RUNTIME_DIR "
f"({restart_flag} not under {runtime_dir})."
)
if pid_file.parent != runtime_dir:
errors.append(
"WATCHDOG_PID_FILE must be under WATCHDOG_RUNTIME_DIR "
f"({pid_file} not under {runtime_dir})."
)
if not state_file.is_absolute():
errors.append("WATCHDOG_STATE_FILE must resolve to an absolute path.")
bind = str(cfg.get("gunicorn_bind", "")).strip()
if ":" not in bind:
errors.append(f"GUNICORN_BIND must include host:port (current: {bind!r}).")
conda_bin = str(cfg.get("conda_bin", "")).strip()
if strict and not conda_bin:
conda_on_path = shutil.which("conda")
if not conda_on_path:
errors.append(
"CONDA_BIN is required when strict runtime validation is enabled "
"and conda is not discoverable on PATH."
)
if conda_bin:
conda_path = Path(conda_bin)
if not conda_path.exists():
errors.append(f"CONDA_BIN does not exist: {conda_bin}")
elif not os.access(conda_bin, os.X_OK):
errors.append(f"CONDA_BIN is not executable: {conda_bin}")
conda_env_name = str(cfg.get("conda_env_name", "")).strip()
active_env = (os.getenv("CONDA_DEFAULT_ENV") or "").strip()
if strict and conda_env_name and active_env and active_env != conda_env_name:
errors.append(
"CONDA_DEFAULT_ENV mismatch: "
f"expected {conda_env_name!r}, got {active_env!r}."
)
return errors
def build_runtime_contract_diagnostics(*, strict: bool = False) -> dict[str, Any]:
"""Build diagnostics payload for runtime contract introspection."""
contract = load_runtime_contract()
errors = validate_runtime_contract(contract, strict=strict)
return {
"valid": not errors,
"strict": strict,
"errors": errors,
"contract": contract,
}

View File

@@ -33,6 +33,22 @@ def get_days_back(filters: Optional[Dict] = None, default: int = DEFAULT_DAYS_BA
return default return default
def parse_bool_query(value: Any, default: bool = False) -> bool:
"""Parse common boolean query parameter values."""
if value is None:
return default
if isinstance(value, bool):
return value
text = str(value).strip().lower()
if not text:
return default
if text in {"true", "1", "yes", "y", "on"}:
return True
if text in {"false", "0", "no", "n", "off"}:
return False
return default
# ============================================================ # ============================================================
# SQL Filter Building (DEPRECATED) # SQL Filter Building (DEPRECATED)
# Use mes_dashboard.sql.CommonFilters with QueryBuilder instead. # Use mes_dashboard.sql.CommonFilters with QueryBuilder instead.

View File

@@ -0,0 +1,220 @@
# -*- coding: utf-8 -*-
"""Worker restart policy helpers (cooldown, retry budget, churn guard)."""
from __future__ import annotations
import json
import os
from datetime import datetime, timezone
from pathlib import Path
from typing import Any, Mapping
from mes_dashboard.core.runtime_contract import load_runtime_contract
def _env_int(name: str, default: int) -> int:
try:
return int(os.getenv(name, str(default)))
except (TypeError, ValueError):
return default
def _env_bool(name: str, default: bool) -> bool:
raw = os.getenv(name)
if raw is None:
return default
return raw.strip().lower() in {"1", "true", "yes", "on"}
def _parse_iso(ts: str | None) -> datetime | None:
if not ts:
return None
try:
value = datetime.fromisoformat(ts)
except (TypeError, ValueError):
return None
if value.tzinfo is None:
value = value.replace(tzinfo=timezone.utc)
return value
def _utc_now() -> datetime:
return datetime.now(timezone.utc)
def get_worker_recovery_policy_config() -> dict[str, Any]:
"""Return effective worker restart policy config."""
retry_budget = _env_int("WORKER_RESTART_RETRY_BUDGET", 3)
churn_threshold = _env_int(
"WORKER_RESTART_CHURN_THRESHOLD",
_env_int("RESILIENCE_RESTART_CHURN_THRESHOLD", retry_budget),
)
window_seconds = _env_int(
"WORKER_RESTART_WINDOW_SECONDS",
_env_int("RESILIENCE_RESTART_CHURN_WINDOW_SECONDS", 600),
)
return {
"cooldown_seconds": max(_env_int("WORKER_RESTART_COOLDOWN", 60), 1),
"retry_budget": max(retry_budget, 1),
"window_seconds": max(window_seconds, 30),
"churn_threshold": max(churn_threshold, 1),
"guarded_mode_enabled": _env_bool("WORKER_GUARDED_MODE_ENABLED", True),
}
def load_restart_state(path: str | None = None) -> dict[str, Any]:
"""Load persisted restart state from runtime contract state file."""
state_path = Path(path or load_runtime_contract()["watchdog_state_file"])
if not state_path.exists():
return {}
try:
return json.loads(state_path.read_text())
except (json.JSONDecodeError, IOError):
return {}
def extract_restart_history(state: Mapping[str, Any] | None = None) -> list[dict[str, Any]]:
"""Extract bounded restart history from persisted state."""
payload = dict(state or {})
raw_history = payload.get("history")
if not isinstance(raw_history, list):
return []
return [item for item in raw_history if isinstance(item, dict)][-50:]
def extract_last_requested_at(state: Mapping[str, Any] | None = None) -> str | None:
"""Extract last requested timestamp from persisted state."""
payload = dict(state or {})
last_restart = payload.get("last_restart") or {}
if not isinstance(last_restart, dict):
return None
value = last_restart.get("requested_at")
return str(value) if value else None
def evaluate_worker_recovery_state(
history: list[dict[str, Any]] | None,
*,
last_requested_at: str | None = None,
now: datetime | None = None,
) -> dict[str, Any]:
"""Evaluate restart policy state for automated/manual recovery decisions."""
cfg = get_worker_recovery_policy_config()
now_dt = now or _utc_now()
window_seconds = int(cfg["window_seconds"])
cooldown_seconds = int(cfg["cooldown_seconds"])
recent_attempts = 0
for item in history or []:
requested = _parse_iso(item.get("requested_at"))
completed = _parse_iso(item.get("completed_at"))
ts = requested or completed
if ts is None:
continue
age = (now_dt - ts).total_seconds()
if age <= window_seconds:
recent_attempts += 1
retry_budget = int(cfg["retry_budget"])
churn_threshold = int(cfg["churn_threshold"])
retry_budget_exhausted = recent_attempts >= retry_budget
churn_exceeded = recent_attempts >= churn_threshold
guarded_mode = bool(cfg["guarded_mode_enabled"] and (retry_budget_exhausted or churn_exceeded))
cooldown_active = False
cooldown_remaining = 0
last_requested_dt = _parse_iso(last_requested_at)
if last_requested_dt is not None:
elapsed = (now_dt - last_requested_dt).total_seconds()
if elapsed < cooldown_seconds:
cooldown_active = True
cooldown_remaining = int(max(cooldown_seconds - elapsed, 0))
blocked = guarded_mode
allowed = not blocked and not cooldown_active
state = "allowed"
if blocked:
state = "blocked"
elif cooldown_active:
state = "cooldown"
return {
"state": state,
"allowed": allowed,
"cooldown": cooldown_active,
"cooldown_remaining_seconds": cooldown_remaining,
"blocked": blocked,
"guarded_mode": guarded_mode,
"retry_budget_exhausted": retry_budget_exhausted,
"churn_exceeded": churn_exceeded,
"attempts_in_window": recent_attempts,
"retry_budget": retry_budget,
"churn_threshold": churn_threshold,
"window_seconds": window_seconds,
"cooldown_seconds": cooldown_seconds,
}
def decide_restart_request(
policy_state: Mapping[str, Any],
*,
source: str,
manual_override: bool = False,
override_acknowledged: bool = False,
) -> dict[str, Any]:
"""Decide whether restart request is allowed under current policy state."""
state = dict(policy_state or {})
blocked = bool(state.get("blocked"))
cooldown = bool(state.get("cooldown"))
source_value = (source or "manual").strip().lower()
if source_value not in {"auto", "manual"}:
source_value = "manual"
if source_value == "auto":
if blocked:
return {
"allowed": False,
"decision": "blocked",
"reason": "guarded_mode_blocked",
"requires_acknowledgement": False,
}
if cooldown:
return {
"allowed": False,
"decision": "blocked",
"reason": "cooldown_active",
"requires_acknowledgement": False,
}
return {
"allowed": True,
"decision": "allowed",
"reason": "policy_allows_auto_restart",
"requires_acknowledgement": False,
}
if (blocked or cooldown) and not (manual_override and override_acknowledged):
reason = "manual_override_required" if blocked else "cooldown_override_required"
return {
"allowed": False,
"decision": "blocked",
"reason": reason,
"requires_acknowledgement": True,
}
if manual_override and override_acknowledged:
return {
"allowed": True,
"decision": "manual_override",
"reason": "operator_override_acknowledged",
"requires_acknowledgement": False,
}
return {
"allowed": True,
"decision": "allowed",
"reason": "policy_allows_manual_restart",
"requires_acknowledgement": False,
}

View File

@@ -7,8 +7,9 @@ import json
import logging import logging
import os import os
import time import time
from datetime import datetime from datetime import datetime, timezone
from pathlib import Path from pathlib import Path
from typing import Any
from flask import Blueprint, g, jsonify, render_template, request from flask import Blueprint, g, jsonify, render_template, request
@@ -19,6 +20,17 @@ from mes_dashboard.core.resilience import (
get_resilience_thresholds, get_resilience_thresholds,
summarize_restart_history, summarize_restart_history,
) )
from mes_dashboard.core.runtime_contract import (
build_runtime_contract_diagnostics,
load_runtime_contract,
)
from mes_dashboard.core.worker_recovery_policy import (
decide_restart_request,
evaluate_worker_recovery_state,
extract_last_requested_at,
extract_restart_history,
load_restart_state,
)
from mes_dashboard.services.page_registry import get_all_pages, set_page_status from mes_dashboard.services.page_registry import get_all_pages, set_page_status
admin_bp = Blueprint("admin", __name__, url_prefix="/admin") admin_bp = Blueprint("admin", __name__, url_prefix="/admin")
@@ -28,21 +40,13 @@ logger = logging.getLogger("mes_dashboard.admin")
# Worker Restart Configuration # Worker Restart Configuration
# ============================================================ # ============================================================
WATCHDOG_RUNTIME_DIR = os.getenv("WATCHDOG_RUNTIME_DIR", "/tmp") _RUNTIME_CONTRACT = load_runtime_contract()
RESTART_FLAG_PATH = os.getenv( WATCHDOG_RUNTIME_DIR = _RUNTIME_CONTRACT["watchdog_runtime_dir"]
"WATCHDOG_RESTART_FLAG", RESTART_FLAG_PATH = _RUNTIME_CONTRACT["watchdog_restart_flag"]
f"{WATCHDOG_RUNTIME_DIR}/mes_dashboard_restart.flag" RESTART_STATE_PATH = _RUNTIME_CONTRACT["watchdog_state_file"]
) WATCHDOG_PID_PATH = _RUNTIME_CONTRACT["watchdog_pid_file"]
RESTART_STATE_PATH = os.getenv( GUNICORN_BIND = _RUNTIME_CONTRACT["gunicorn_bind"]
"WATCHDOG_STATE_FILE", RUNTIME_CONTRACT_VERSION = _RUNTIME_CONTRACT["version"]
f"{WATCHDOG_RUNTIME_DIR}/mes_dashboard_restart_state.json"
)
WATCHDOG_PID_PATH = os.getenv(
"WATCHDOG_PID_FILE",
f"{WATCHDOG_RUNTIME_DIR}/gunicorn.pid"
)
GUNICORN_BIND = os.getenv("GUNICORN_BIND", "0.0.0.0:8080")
RESTART_COOLDOWN_SECONDS = int(os.getenv("WORKER_RESTART_COOLDOWN", "60"))
# Track last restart request time (in-memory for this worker) # Track last restart request time (in-memory for this worker)
_last_restart_request: float = 0.0 _last_restart_request: float = 0.0
@@ -91,7 +95,9 @@ def api_system_status():
thresholds = get_resilience_thresholds() thresholds = get_resilience_thresholds()
restart_state = _get_restart_state() restart_state = _get_restart_state()
restart_churn = _get_restart_churn_summary(restart_state) restart_churn = _get_restart_churn_summary(restart_state)
in_cooldown, remaining = _check_restart_cooldown() policy_state = _get_restart_policy_state(restart_state)
in_cooldown = bool(policy_state.get("cooldown"))
remaining = int(policy_state.get("cooldown_remaining_seconds") or 0)
degraded_reason = None degraded_reason = None
if db_status == "error": if db_status == "error":
@@ -111,6 +117,14 @@ def api_system_status():
restart_churn_exceeded=bool(restart_churn.get("exceeded")), restart_churn_exceeded=bool(restart_churn.get("exceeded")),
cooldown_active=in_cooldown, cooldown_active=in_cooldown,
) )
alerts = _build_restart_alerts(
pool_saturation=(pool_state or {}).get("saturation"),
circuit_state=circuit_breaker.get("state"),
route_cache_degraded=bool(route_cache.get("degraded")),
policy_state=policy_state,
thresholds=thresholds,
)
runtime_contract = build_runtime_contract_diagnostics(strict=False)
# Cache status # Cache status
from mes_dashboard.routes.health_routes import ( from mes_dashboard.routes.health_routes import (
@@ -142,13 +156,22 @@ def api_system_status():
"pool_state": pool_state, "pool_state": pool_state,
"route_cache": route_cache, "route_cache": route_cache,
"thresholds": thresholds, "thresholds": thresholds,
"alerts": alerts,
"restart_churn": restart_churn, "restart_churn": restart_churn,
"policy_state": {
"state": policy_state.get("state"),
"allowed": policy_state.get("allowed"),
"cooldown": policy_state.get("cooldown"),
"blocked": policy_state.get("blocked"),
"cooldown_remaining_seconds": remaining,
},
"recovery_recommendation": recommendation, "recovery_recommendation": recommendation,
"restart_cooldown": { "restart_cooldown": {
"active": in_cooldown, "active": in_cooldown,
"remaining_seconds": int(remaining) if in_cooldown else 0, "remaining_seconds": remaining if in_cooldown else 0,
}, },
}, },
"runtime_contract": runtime_contract,
"single_port_bind": GUNICORN_BIND, "single_port_bind": GUNICORN_BIND,
"worker_pid": os.getpid() "worker_pid": os.getpid()
} }
@@ -283,13 +306,13 @@ def api_logs_cleanup():
def _get_restart_state() -> dict: def _get_restart_state() -> dict:
"""Read worker restart state from file.""" """Read worker restart state from file."""
state_path = Path(RESTART_STATE_PATH) return load_restart_state(RESTART_STATE_PATH)
if not state_path.exists():
return {}
try: def _iso_from_epoch(ts: float) -> str | None:
return json.loads(state_path.read_text()) if ts <= 0:
except (json.JSONDecodeError, IOError): return None
return {} return datetime.fromtimestamp(ts, tz=timezone.utc).isoformat()
def _check_restart_cooldown() -> tuple[bool, float]: def _check_restart_cooldown() -> tuple[bool, float]:
@@ -298,38 +321,16 @@ def _check_restart_cooldown() -> tuple[bool, float]:
Returns: Returns:
Tuple of (is_in_cooldown, remaining_seconds). Tuple of (is_in_cooldown, remaining_seconds).
""" """
global _last_restart_request policy = _get_restart_policy_state()
if policy.get("cooldown"):
# Check in-memory cooldown first return True, float(policy.get("cooldown_remaining_seconds") or 0.0)
now = time.time()
elapsed = now - _last_restart_request
if elapsed < RESTART_COOLDOWN_SECONDS:
return True, RESTART_COOLDOWN_SECONDS - elapsed
# Check file-based state (for cross-worker coordination)
state = _get_restart_state()
last_restart = state.get("last_restart", {})
requested_at = last_restart.get("requested_at")
if requested_at:
try:
request_time = datetime.fromisoformat(requested_at).timestamp()
elapsed = now - request_time
if elapsed < RESTART_COOLDOWN_SECONDS:
return True, RESTART_COOLDOWN_SECONDS - elapsed
except (ValueError, TypeError):
pass
return False, 0.0 return False, 0.0
def _get_restart_history(state: dict | None = None) -> list[dict]: def _get_restart_history(state: dict | None = None) -> list[dict]:
"""Return bounded restart history for admin telemetry.""" """Return bounded restart history for admin telemetry."""
payload = state if state is not None else _get_restart_state() payload = state if state is not None else _get_restart_state()
raw_history = payload.get("history") or [] return extract_restart_history(payload)[-20:]
if not isinstance(raw_history, list):
return []
return raw_history[-20:]
def _get_restart_churn_summary(state: dict | None = None) -> dict: def _get_restart_churn_summary(state: dict | None = None) -> dict:
@@ -338,22 +339,58 @@ def _get_restart_churn_summary(state: dict | None = None) -> dict:
return summarize_restart_history(history) return summarize_restart_history(history)
def _worker_recovery_hint(churn: dict, cooldown_active: bool) -> dict: def _get_restart_policy_state(state: dict | None = None) -> dict[str, Any]:
"""Build worker control recommendation from churn/cooldown state.""" """Return effective worker restart policy state."""
if churn.get("exceeded"): payload = state if state is not None else _get_restart_state()
history = _get_restart_history(payload)
last_requested = extract_last_requested_at(payload)
in_memory_requested = _iso_from_epoch(_last_restart_request)
if in_memory_requested:
try:
in_memory_dt = datetime.fromisoformat(in_memory_requested)
persisted_dt = datetime.fromisoformat(last_requested) if last_requested else None
except (TypeError, ValueError):
in_memory_dt = None
persisted_dt = None
if in_memory_dt and (persisted_dt is None or in_memory_dt > persisted_dt):
last_requested = in_memory_requested
return evaluate_worker_recovery_state(
history,
last_requested_at=last_requested,
)
def _build_restart_alerts(
*,
pool_saturation: float | None,
circuit_state: str | None,
route_cache_degraded: bool,
policy_state: dict[str, Any],
thresholds: dict[str, Any],
) -> dict[str, Any]:
saturation = float(pool_saturation or 0.0)
warning = float(thresholds.get("pool_saturation_warning", 0.9))
critical = float(thresholds.get("pool_saturation_critical", 1.0))
return { return {
"action": "throttle_and_investigate_queries", "pool_warning": saturation >= warning,
"reason": "restart_churn_exceeded", "pool_critical": saturation >= critical,
"circuit_open": circuit_state == "OPEN",
"route_cache_degraded": bool(route_cache_degraded),
"restart_churn_exceeded": bool(policy_state.get("churn_exceeded")),
"restart_blocked": bool(policy_state.get("blocked")),
} }
if cooldown_active:
return {
"action": "wait_for_restart_cooldown", def _log_restart_audit(event: str, payload: dict[str, Any]) -> None:
"reason": "restart_cooldown_active", entry = {
} "event": event,
return { "timestamp": datetime.now(tz=timezone.utc).isoformat(),
"action": "restart_available", "runtime_contract_version": RUNTIME_CONTRACT_VERSION,
"reason": "no_churn_or_cooldown", **payload,
} }
logger.info("worker_restart_audit %s", json.dumps(entry, ensure_ascii=False))
@admin_bp.route("/api/worker/restart", methods=["POST"]) @admin_bp.route("/api/worker/restart", methods=["POST"])
@@ -366,19 +403,60 @@ def api_worker_restart():
""" """
global _last_restart_request global _last_restart_request
# Check cooldown payload = request.get_json(silent=True) or {}
in_cooldown, remaining = _check_restart_cooldown() manual_override = bool(payload.get("manual_override"))
if in_cooldown: override_acknowledged = bool(payload.get("override_acknowledged"))
return error_response( override_reason = str(payload.get("override_reason") or "").strip()
TOO_MANY_REQUESTS,
f"Restart in cooldown. Please wait {int(remaining)} seconds.",
status_code=429
)
# Get request metadata # Get request metadata
user = getattr(g, "username", "unknown") user = getattr(g, "username", "unknown")
ip = request.remote_addr or "unknown" ip = request.remote_addr or "unknown"
timestamp = datetime.now().isoformat() timestamp = datetime.now(tz=timezone.utc).isoformat()
state = _get_restart_state()
policy_state = _get_restart_policy_state(state)
decision = decide_restart_request(
policy_state,
source="manual",
manual_override=manual_override,
override_acknowledged=override_acknowledged,
)
if manual_override and not override_reason:
return error_response(
"RESTART_OVERRIDE_REASON_REQUIRED",
"Manual override requires non-empty override_reason for audit traceability.",
status_code=400,
)
if not decision["allowed"]:
status_code = 429 if policy_state.get("cooldown") else 409
if status_code == 429:
message = (
f"Restart in cooldown. Please wait "
f"{int(policy_state.get('cooldown_remaining_seconds') or 0)} seconds."
)
code = TOO_MANY_REQUESTS
else:
message = (
"Restart blocked by guarded mode. "
"Set manual_override=true and override_acknowledged=true to proceed."
)
code = "RESTART_POLICY_BLOCKED"
_log_restart_audit(
"restart_request_blocked",
{
"actor": user,
"ip": ip,
"decision": decision,
"policy_state": policy_state,
},
)
return error_response(
code,
message,
status_code=status_code,
)
# Write restart flag file # Write restart flag file
flag_path = Path(RESTART_FLAG_PATH) flag_path = Path(RESTART_FLAG_PATH)
@@ -386,11 +464,21 @@ def api_worker_restart():
"user": user, "user": user,
"ip": ip, "ip": ip,
"timestamp": timestamp, "timestamp": timestamp,
"worker_pid": os.getpid() "worker_pid": os.getpid(),
"source": "manual",
"manual_override": bool(manual_override and override_acknowledged),
"override_acknowledged": override_acknowledged,
"override_reason": override_reason or None,
"policy_state": policy_state,
"policy_decision": decision["decision"],
"runtime_contract_version": RUNTIME_CONTRACT_VERSION,
} }
try: try:
flag_path.write_text(json.dumps(flag_data)) flag_path.parent.mkdir(parents=True, exist_ok=True)
tmp_path = flag_path.with_suffix(flag_path.suffix + ".tmp")
tmp_path.write_text(json.dumps(flag_data, ensure_ascii=False))
tmp_path.replace(flag_path)
except IOError as e: except IOError as e:
logger.error(f"Failed to write restart flag: {e}") logger.error(f"Failed to write restart flag: {e}")
return error_response( return error_response(
@@ -402,8 +490,15 @@ def api_worker_restart():
# Update in-memory cooldown # Update in-memory cooldown
_last_restart_request = time.time() _last_restart_request = time.time()
logger.info( _log_restart_audit(
f"Worker restart requested by {user} from {ip}" "restart_request_accepted",
{
"actor": user,
"ip": ip,
"decision": decision,
"policy_state": policy_state,
"override_reason": override_reason or None,
},
) )
return jsonify({ return jsonify({
@@ -412,6 +507,14 @@ def api_worker_restart():
"message": "Restart requested. Workers will reload shortly.", "message": "Restart requested. Workers will reload shortly.",
"requested_by": user, "requested_by": user,
"requested_at": timestamp, "requested_at": timestamp,
"policy_state": {
"state": policy_state.get("state"),
"allowed": policy_state.get("allowed"),
"cooldown": policy_state.get("cooldown"),
"blocked": policy_state.get("blocked"),
"cooldown_remaining_seconds": policy_state.get("cooldown_remaining_seconds"),
},
"decision": decision,
"single_port_bind": GUNICORN_BIND, "single_port_bind": GUNICORN_BIND,
"watchdog": { "watchdog": {
"runtime_dir": WATCHDOG_RUNTIME_DIR, "runtime_dir": WATCHDOG_RUNTIME_DIR,
@@ -427,16 +530,21 @@ def api_worker_restart():
@admin_required @admin_required
def api_worker_status(): def api_worker_status():
"""API: Get worker status and restart information.""" """API: Get worker status and restart information."""
# Check cooldown
in_cooldown, remaining = _check_restart_cooldown()
# Get last restart info # Get last restart info
state = _get_restart_state() state = _get_restart_state()
last_restart = state.get("last_restart", {}) last_restart = state.get("last_restart", {})
history = _get_restart_history(state) history = _get_restart_history(state)
churn = _get_restart_churn_summary(state) churn = _get_restart_churn_summary(state)
policy_state = _get_restart_policy_state(state)
thresholds = get_resilience_thresholds() thresholds = get_resilience_thresholds()
recommendation = _worker_recovery_hint(churn, in_cooldown) recommendation = build_recovery_recommendation(
degraded_reason="db_pool_saturated" if policy_state.get("blocked") else None,
pool_saturation=None,
circuit_state=None,
restart_churn_exceeded=bool(churn.get("exceeded")),
cooldown_active=bool(policy_state.get("cooldown")),
)
runtime_contract = build_runtime_contract_diagnostics(strict=False)
# Get worker start time (psutil is optional) # Get worker start time (psutil is optional)
worker_start_time = None worker_start_time = None
@@ -466,6 +574,11 @@ def api_worker_status():
"worker_pid": os.getpid(), "worker_pid": os.getpid(),
"worker_start_time": worker_start_time, "worker_start_time": worker_start_time,
"runtime_contract": { "runtime_contract": {
"version": runtime_contract["contract"]["version"],
"validation": {
"valid": runtime_contract["valid"],
"errors": runtime_contract["errors"],
},
"single_port_bind": GUNICORN_BIND, "single_port_bind": GUNICORN_BIND,
"watchdog": { "watchdog": {
"runtime_dir": WATCHDOG_RUNTIME_DIR, "runtime_dir": WATCHDOG_RUNTIME_DIR,
@@ -478,12 +591,27 @@ def api_worker_status():
}, },
}, },
"cooldown": { "cooldown": {
"active": in_cooldown, "active": bool(policy_state.get("cooldown")),
"remaining_seconds": int(remaining) if in_cooldown else 0 "remaining_seconds": int(policy_state.get("cooldown_remaining_seconds") or 0)
}, },
"resilience": { "resilience": {
"thresholds": thresholds, "thresholds": thresholds,
"alerts": {
"restart_churn_exceeded": bool(churn.get("exceeded")),
"restart_blocked": bool(policy_state.get("blocked")),
},
"restart_churn": churn, "restart_churn": churn,
"policy_state": {
"state": policy_state.get("state"),
"allowed": policy_state.get("allowed"),
"cooldown": policy_state.get("cooldown"),
"blocked": policy_state.get("blocked"),
"cooldown_remaining_seconds": policy_state.get("cooldown_remaining_seconds"),
"attempts_in_window": policy_state.get("attempts_in_window"),
"retry_budget": policy_state.get("retry_budget"),
"churn_threshold": policy_state.get("churn_threshold"),
"window_seconds": policy_state.get("window_seconds"),
},
"recovery_recommendation": recommendation, "recovery_recommendation": recommendation,
}, },
"restart_history": history, "restart_history": history,

View File

@@ -11,6 +11,7 @@ from threading import Lock
from flask import Blueprint, flash, redirect, render_template, request, session, url_for from flask import Blueprint, flash, redirect, render_template, request, session, url_for
from mes_dashboard.core.csrf import rotate_csrf_token
from mes_dashboard.services.auth_service import authenticate, is_admin from mes_dashboard.services.auth_service import authenticate, is_admin
logger = logging.getLogger('mes_dashboard.auth_routes') logger = logging.getLogger('mes_dashboard.auth_routes')
@@ -93,6 +94,7 @@ def login():
error = "您不是管理員,無法登入後台" error = "您不是管理員,無法登入後台"
else: else:
# Login successful # Login successful
session.clear()
session["admin"] = { session["admin"] = {
"username": user.get("username"), "username": user.get("username"),
"displayName": user.get("displayName"), "displayName": user.get("displayName"),
@@ -100,6 +102,7 @@ def login():
"department": user.get("department"), "department": user.get("department"),
"login_time": datetime.now().isoformat(), "login_time": datetime.now().isoformat(),
} }
rotate_csrf_token()
next_url = request.args.get("next", url_for("portal_index")) next_url = request.args.get("next", url_for("portal_index"))
return redirect(next_url) return redirect(next_url)
@@ -109,5 +112,5 @@ def login():
@auth_bp.route("/logout") @auth_bp.route("/logout")
def logout(): def logout():
"""Admin logout.""" """Admin logout."""
session.pop("admin", None) session.clear()
return redirect(url_for("portal_index")) return redirect(url_for("portal_index"))

View File

@@ -7,12 +7,14 @@ Provides /health and /health/deep endpoints for monitoring service status.
from __future__ import annotations from __future__ import annotations
import logging import logging
import os
import threading
import time import time
from datetime import datetime, timedelta from datetime import datetime, timedelta
from flask import Blueprint, jsonify, make_response from flask import Blueprint, current_app, jsonify, make_response
from mes_dashboard.core.database import ( from mes_dashboard.core.database import (
get_engine, get_health_engine,
get_pool_runtime_config, get_pool_runtime_config,
get_pool_status, get_pool_status,
) )
@@ -28,6 +30,15 @@ from mes_dashboard.core.cache import (
from mes_dashboard.core.resilience import ( from mes_dashboard.core.resilience import (
build_recovery_recommendation, build_recovery_recommendation,
get_resilience_thresholds, get_resilience_thresholds,
summarize_restart_history,
)
from mes_dashboard.core.runtime_contract import build_runtime_contract_diagnostics
from mes_dashboard.core.worker_recovery_policy import (
evaluate_worker_recovery_state,
extract_last_requested_at,
extract_restart_history,
get_worker_recovery_policy_config,
load_restart_state,
) )
from sqlalchemy import text from sqlalchemy import text
@@ -41,6 +52,61 @@ health_bp = Blueprint('health', __name__)
DB_LATENCY_WARNING_MS = 100 # Database latency > 100ms is slow DB_LATENCY_WARNING_MS = 100 # Database latency > 100ms is slow
CACHE_STALE_MINUTES = 2 # Cache update > 2 minutes is stale CACHE_STALE_MINUTES = 2 # Cache update > 2 minutes is stale
HEALTH_MEMO_TTL_SECONDS = int(os.getenv("HEALTH_MEMO_TTL_SECONDS", "5"))
_HEALTH_MEMO_LOCK = threading.Lock()
_HEALTH_MEMO: dict[str, dict | None] = {
"health": None,
"deep": None,
}
def _health_memo_enabled() -> bool:
if HEALTH_MEMO_TTL_SECONDS <= 0:
return False
if current_app.testing or bool(current_app.config.get("TESTING")):
return False
return True
def _get_health_memo(cache_key: str) -> tuple[dict, int] | None:
if not _health_memo_enabled():
return None
now = time.time()
with _HEALTH_MEMO_LOCK:
entry = _HEALTH_MEMO.get(cache_key)
if not entry:
return None
if now - float(entry.get("ts", 0.0)) > HEALTH_MEMO_TTL_SECONDS:
_HEALTH_MEMO[cache_key] = None
return None
return entry["payload"], int(entry["status"])
def _set_health_memo(cache_key: str, payload: dict, status_code: int) -> None:
if not _health_memo_enabled():
return
with _HEALTH_MEMO_LOCK:
_HEALTH_MEMO[cache_key] = {
"ts": time.time(),
"payload": payload,
"status": int(status_code),
}
def _build_health_response(payload: dict, status_code: int):
"""Build JSON response with explicit no-cache headers."""
resp = make_response(jsonify(payload), status_code)
resp.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate'
resp.headers['Pragma'] = 'no-cache'
resp.headers['Expires'] = '0'
return resp
def _reset_health_memo_for_tests() -> None:
with _HEALTH_MEMO_LOCK:
_HEALTH_MEMO["health"] = None
_HEALTH_MEMO["deep"] = None
def _classify_degraded_reason( def _classify_degraded_reason(
@@ -63,6 +129,48 @@ def _classify_degraded_reason(
return None return None
def _build_resilience_alerts(
*,
pool_saturation: float | None,
circuit_state: str | None,
route_cache_degraded: bool,
restart_churn_exceeded: bool,
restart_blocked: bool,
thresholds: dict,
) -> dict:
saturation = float(pool_saturation or 0.0)
warning = float(thresholds.get("pool_saturation_warning", 0.9))
critical = float(thresholds.get("pool_saturation_critical", 1.0))
return {
"pool_warning": saturation >= warning,
"pool_critical": saturation >= critical,
"circuit_open": circuit_state == "OPEN",
"route_cache_degraded": bool(route_cache_degraded),
"restart_churn_exceeded": bool(restart_churn_exceeded),
"restart_blocked": bool(restart_blocked),
}
def get_worker_recovery_status() -> dict:
"""Build worker recovery policy status for health/admin telemetry."""
state = load_restart_state()
history = extract_restart_history(state)
policy_state = evaluate_worker_recovery_state(
history,
last_requested_at=extract_last_requested_at(state),
)
churn = summarize_restart_history(
history,
window_seconds=int(policy_state.get("window_seconds") or 600),
threshold=int(policy_state.get("churn_threshold") or 3),
)
return {
"policy_state": policy_state,
"restart_churn": churn,
"policy_config": get_worker_recovery_policy_config(),
}
def check_database() -> tuple[str, str | None]: def check_database() -> tuple[str, str | None]:
"""Check database connectivity. """Check database connectivity.
@@ -71,7 +179,7 @@ def check_database() -> tuple[str, str | None]:
status is 'ok' or 'error'. status is 'ok' or 'error'.
""" """
try: try:
engine = get_engine() engine = get_health_engine()
with engine.connect() as conn: with engine.connect() as conn:
conn.execute(text("SELECT 1 FROM DUAL")) conn.execute(text("SELECT 1 FROM DUAL"))
return 'ok', None return 'ok', None
@@ -111,13 +219,21 @@ def get_cache_status() -> dict:
status = { status = {
'enabled': REDIS_ENABLED, 'enabled': REDIS_ENABLED,
'sys_date': get_cached_sys_date(), 'sys_date': get_cached_sys_date(),
'updated_at': get_cache_updated_at() 'updated_at': get_cache_updated_at(),
'derived_search_index': {},
'derived_frame_snapshot': {},
'index_metrics': {},
'memory': {},
} }
try: try:
from mes_dashboard.services.wip_service import get_wip_search_index_status from mes_dashboard.services.wip_service import get_wip_search_index_status
status['derived_search_index'] = get_wip_search_index_status() derived = get_wip_search_index_status()
status['derived_search_index'] = derived.get('derived_search_index', {})
status['derived_frame_snapshot'] = derived.get('derived_frame_snapshot', {})
status['index_metrics'] = derived.get('metrics', {})
status['memory'] = derived.get('memory', {})
except Exception: except Exception:
status['derived_search_index'] = {} pass
return status return status
@@ -205,6 +321,11 @@ def health_check():
- 200 OK: All services healthy or degraded (Redis down but DB ok) - 200 OK: All services healthy or degraded (Redis down but DB ok)
- 503 Service Unavailable: Database unhealthy - 503 Service Unavailable: Database unhealthy
""" """
cached = _get_health_memo("health")
if cached is not None:
payload, status_code = cached
return _build_health_response(payload, status_code)
from mes_dashboard.core.circuit_breaker import get_circuit_breaker_status from mes_dashboard.core.circuit_breaker import get_circuit_breaker_status
db_status, db_error = check_database() db_status, db_error = check_database()
@@ -266,13 +387,25 @@ def health_check():
warnings.append(f"Database pool saturation is high ({saturation:.0%})") warnings.append(f"Database pool saturation is high ({saturation:.0%})")
thresholds = get_resilience_thresholds() thresholds = get_resilience_thresholds()
worker_recovery = get_worker_recovery_status()
policy_state = worker_recovery.get("policy_state", {})
restart_churn = worker_recovery.get("restart_churn", {})
recommendation = build_recovery_recommendation( recommendation = build_recovery_recommendation(
degraded_reason=degraded_reason, degraded_reason=degraded_reason,
pool_saturation=pool_saturation, pool_saturation=pool_saturation,
circuit_state=circuit_breaker.get('state'), circuit_state=circuit_breaker.get('state'),
restart_churn_exceeded=False, restart_churn_exceeded=bool(restart_churn.get("exceeded")),
cooldown_active=False, cooldown_active=bool(policy_state.get("cooldown")),
) )
alerts = _build_resilience_alerts(
pool_saturation=pool_saturation,
circuit_state=circuit_breaker.get("state"),
route_cache_degraded=bool(route_cache.get("degraded")),
restart_churn_exceeded=bool(restart_churn.get("exceeded")),
restart_blocked=bool(policy_state.get("blocked")),
thresholds=thresholds,
)
runtime_contract = build_runtime_contract_diagnostics(strict=False)
# Check equipment status cache # Check equipment status cache
equipment_status_cache = get_equipment_status_cache_status() equipment_status_cache = get_equipment_status_cache_status()
@@ -293,8 +426,18 @@ def health_check():
}, },
'resilience': { 'resilience': {
'thresholds': thresholds, 'thresholds': thresholds,
'alerts': alerts,
'policy_state': {
'state': policy_state.get("state"),
'allowed': policy_state.get("allowed"),
'cooldown': policy_state.get("cooldown"),
'blocked': policy_state.get("blocked"),
'cooldown_remaining_seconds': policy_state.get("cooldown_remaining_seconds"),
},
'restart_churn': restart_churn,
'recovery_recommendation': recommendation, 'recovery_recommendation': recommendation,
}, },
'runtime_contract': runtime_contract,
'cache': get_cache_status(), 'cache': get_cache_status(),
'route_cache': route_cache, 'route_cache': route_cache,
'resource_cache': resource_cache, 'resource_cache': resource_cache,
@@ -307,12 +450,8 @@ def health_check():
if warnings: if warnings:
response['warnings'] = warnings response['warnings'] = warnings
# Add no-cache headers to prevent browser caching _set_health_memo("health", response, http_code)
resp = make_response(jsonify(response), http_code) return _build_health_response(response, http_code)
resp.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate'
resp.headers['Pragma'] = 'no-cache'
resp.headers['Expires'] = '0'
return resp
@health_bp.route('/health/deep', methods=['GET']) @health_bp.route('/health/deep', methods=['GET'])
@@ -334,6 +473,11 @@ def deep_health_check():
if not is_admin_logged_in(): if not is_admin_logged_in():
return redirect(url_for("auth.login", next=request.url)) return redirect(url_for("auth.login", next=request.url))
cached = _get_health_memo("deep")
if cached is not None:
payload, status_code = cached
return _build_health_response(payload, status_code)
# Check database with latency measurement # Check database with latency measurement
db_start = time.time() db_start = time.time()
db_status, db_error = check_database() db_status, db_error = check_database()
@@ -397,6 +541,9 @@ def deep_health_check():
warnings.append(f"Database pool saturation is high ({pool_saturation:.0%})") warnings.append(f"Database pool saturation is high ({pool_saturation:.0%})")
thresholds = get_resilience_thresholds() thresholds = get_resilience_thresholds()
worker_recovery = get_worker_recovery_status()
policy_state = worker_recovery.get("policy_state", {})
restart_churn = worker_recovery.get("restart_churn", {})
degraded_reason = _classify_degraded_reason( degraded_reason = _classify_degraded_reason(
db_status=db_status, db_status=db_status,
redis_status=redis_status, redis_status=redis_status,
@@ -408,9 +555,18 @@ def deep_health_check():
degraded_reason=degraded_reason, degraded_reason=degraded_reason,
pool_saturation=pool_saturation, pool_saturation=pool_saturation,
circuit_state=circuit_breaker.get('state'), circuit_state=circuit_breaker.get('state'),
restart_churn_exceeded=False, restart_churn_exceeded=bool(restart_churn.get("exceeded")),
cooldown_active=False, cooldown_active=bool(policy_state.get("cooldown")),
) )
alerts = _build_resilience_alerts(
pool_saturation=pool_saturation,
circuit_state=circuit_breaker.get("state"),
route_cache_degraded=bool(route_cache.get("degraded")),
restart_churn_exceeded=bool(restart_churn.get("exceeded")),
restart_blocked=bool(policy_state.get("blocked")),
thresholds=thresholds,
)
runtime_contract = build_runtime_contract_diagnostics(strict=False)
# Check latency thresholds # Check latency thresholds
db_latency_status = 'healthy' db_latency_status = 'healthy'
@@ -429,8 +585,18 @@ def deep_health_check():
'degraded_reason': degraded_reason, 'degraded_reason': degraded_reason,
'resilience': { 'resilience': {
'thresholds': thresholds, 'thresholds': thresholds,
'alerts': alerts,
'policy_state': {
'state': policy_state.get("state"),
'allowed': policy_state.get("allowed"),
'cooldown': policy_state.get("cooldown"),
'blocked': policy_state.get("blocked"),
'cooldown_remaining_seconds': policy_state.get("cooldown_remaining_seconds"),
},
'restart_churn': restart_churn,
'recovery_recommendation': recommendation, 'recovery_recommendation': recommendation,
}, },
'runtime_contract': runtime_contract,
'checks': { 'checks': {
'database': { 'database': {
'status': db_latency_status if db_status == 'ok' else 'error', 'status': db_latency_status if db_status == 'ok' else 'error',
@@ -446,7 +612,9 @@ def deep_health_check():
'cache': { 'cache': {
'freshness': cache_freshness, 'freshness': cache_freshness,
'updated_at': cache_updated_at, 'updated_at': cache_updated_at,
'sys_date': cache_status.get('sys_date') 'sys_date': cache_status.get('sys_date'),
'index_metrics': cache_status.get('index_metrics', {}),
'memory': cache_status.get('memory', {}),
}, },
'route_cache': route_cache 'route_cache': route_cache
}, },
@@ -464,9 +632,5 @@ def deep_health_check():
if warnings: if warnings:
response['warnings'] = warnings response['warnings'] = warnings
# Add no-cache headers _set_health_memo("deep", response, http_code)
resp = make_response(jsonify(response), http_code) return _build_health_response(response, http_code)
resp.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate'
resp.headers['Pragma'] = 'no-cache'
resp.headers['Expires'] = '0'
return resp

View File

@@ -6,6 +6,8 @@ Contains Flask Blueprint for Hold Detail page and API endpoints.
from flask import Blueprint, jsonify, request, render_template, redirect, url_for from flask import Blueprint, jsonify, request, render_template, redirect, url_for
from mes_dashboard.core.rate_limit import configured_rate_limit
from mes_dashboard.core.utils import parse_bool_query
from mes_dashboard.services.wip_service import ( from mes_dashboard.services.wip_service import (
get_hold_detail_summary, get_hold_detail_summary,
get_hold_detail_distribution, get_hold_detail_distribution,
@@ -16,10 +18,13 @@ from mes_dashboard.services.wip_service import (
# Create Blueprint # Create Blueprint
hold_bp = Blueprint('hold', __name__) hold_bp = Blueprint('hold', __name__)
_HOLD_LOTS_RATE_LIMIT = configured_rate_limit(
def _parse_bool(value: str) -> bool: bucket="hold-detail-lots",
"""Parse boolean from query string.""" max_attempts_env="HOLD_LOTS_RATE_LIMIT_MAX_REQUESTS",
return value.lower() in ('true', '1', 'yes') if value else False window_seconds_env="HOLD_LOTS_RATE_LIMIT_WINDOW_SECONDS",
default_max_attempts=90,
default_window_seconds=60,
)
# ============================================================ # ============================================================
@@ -64,7 +69,7 @@ def api_hold_detail_summary():
if not reason: if not reason:
return jsonify({'success': False, 'error': '缺少必要參數: reason'}), 400 return jsonify({'success': False, 'error': '缺少必要參數: reason'}), 400
include_dummy = _parse_bool(request.args.get('include_dummy', '')) include_dummy = parse_bool_query(request.args.get('include_dummy'))
result = get_hold_detail_summary( result = get_hold_detail_summary(
reason=reason, reason=reason,
@@ -90,7 +95,7 @@ def api_hold_detail_distribution():
if not reason: if not reason:
return jsonify({'success': False, 'error': '缺少必要參數: reason'}), 400 return jsonify({'success': False, 'error': '缺少必要參數: reason'}), 400
include_dummy = _parse_bool(request.args.get('include_dummy', '')) include_dummy = parse_bool_query(request.args.get('include_dummy'))
result = get_hold_detail_distribution( result = get_hold_detail_distribution(
reason=reason, reason=reason,
@@ -102,6 +107,7 @@ def api_hold_detail_distribution():
@hold_bp.route('/api/wip/hold-detail/lots') @hold_bp.route('/api/wip/hold-detail/lots')
@_HOLD_LOTS_RATE_LIMIT
def api_hold_detail_lots(): def api_hold_detail_lots():
"""API: Get paginated lot details for a specific hold reason. """API: Get paginated lot details for a specific hold reason.
@@ -124,7 +130,7 @@ def api_hold_detail_lots():
workcenter = request.args.get('workcenter', '').strip() or None workcenter = request.args.get('workcenter', '').strip() or None
package = request.args.get('package', '').strip() or None package = request.args.get('package', '').strip() or None
age_range = request.args.get('age_range', '').strip() or None age_range = request.args.get('age_range', '').strip() or None
include_dummy = _parse_bool(request.args.get('include_dummy', '')) include_dummy = parse_bool_query(request.args.get('include_dummy'))
page = request.args.get('page', 1, type=int) page = request.args.get('page', 1, type=int)
per_page = min(request.args.get('per_page', 50, type=int), 200) per_page = min(request.args.get('per_page', 50, type=int), 200)

View File

@@ -13,10 +13,12 @@ from mes_dashboard.core.database import (
DatabaseCircuitOpenError, DatabaseCircuitOpenError,
) )
from mes_dashboard.core.cache import cache_get, cache_set, make_cache_key from mes_dashboard.core.cache import cache_get, cache_set, make_cache_key
from mes_dashboard.core.rate_limit import configured_rate_limit
from mes_dashboard.core.utils import get_days_back, parse_bool_query
def _clean_nan_values(data): def _clean_nan_values(data):
"""Convert NaN and NaT values to None for JSON serialization. """Convert NaN/NaT values to None for JSON serialization (depth-safe).
Args: Args:
data: List of dicts or single dict. data: List of dicts or single dict.
@@ -24,28 +26,77 @@ def _clean_nan_values(data):
Returns: Returns:
Cleaned data with NaN/NaT replaced by None. Cleaned data with NaN/NaT replaced by None.
""" """
if isinstance(data, list): def _normalize_scalar(value):
return [_clean_nan_values(item) for item in data]
elif isinstance(data, dict):
cleaned = {}
for key, value in data.items():
if isinstance(value, float) and math.isnan(value): if isinstance(value, float) and math.isnan(value):
cleaned[key] = None return None
elif isinstance(value, str) and value == 'NaT': if isinstance(value, str) and value == 'NaT':
cleaned[key] = None return None
elif value != value: # NaN check (NaN != NaN) try:
cleaned[key] = None if value != value: # NaN check (NaN != NaN)
elif isinstance(value, list): return None
# Recursively clean nested lists (e.g., LOT_DETAILS) except Exception:
cleaned[key] = _clean_nan_values(value) pass
elif isinstance(value, dict): return value
# Recursively clean nested dicts
cleaned[key] = _clean_nan_values(value) if isinstance(data, list):
root: list = []
elif isinstance(data, dict):
root = {}
else: else:
cleaned[key] = value return _normalize_scalar(data)
return cleaned
return data stack = [(data, root)]
from mes_dashboard.core.utils import get_days_back seen: set[int] = {id(data)}
while stack:
source, target = stack.pop()
if isinstance(source, list):
for item in source:
if isinstance(item, list):
item_id = id(item)
if item_id in seen:
target.append(None)
continue
child = []
target.append(child)
seen.add(item_id)
stack.append((item, child))
elif isinstance(item, dict):
item_id = id(item)
if item_id in seen:
target.append(None)
continue
child = {}
target.append(child)
seen.add(item_id)
stack.append((item, child))
else:
target.append(_normalize_scalar(item))
continue
for key, value in source.items():
if isinstance(value, list):
value_id = id(value)
if value_id in seen:
target[key] = None
continue
child = []
target[key] = child
seen.add(value_id)
stack.append((value, child))
elif isinstance(value, dict):
value_id = id(value)
if value_id in seen:
target[key] = None
continue
child = {}
target[key] = child
seen.add(value_id)
stack.append((value, child))
else:
target[key] = _normalize_scalar(value)
return root
from mes_dashboard.services.resource_service import ( from mes_dashboard.services.resource_service import (
query_resource_by_status, query_resource_by_status,
query_resource_by_workcenter, query_resource_by_workcenter,
@@ -62,6 +113,32 @@ from mes_dashboard.config.constants import STATUS_CATEGORIES
# Create Blueprint # Create Blueprint
resource_bp = Blueprint('resource', __name__, url_prefix='/api/resource') resource_bp = Blueprint('resource', __name__, url_prefix='/api/resource')
_RESOURCE_DETAIL_RATE_LIMIT = configured_rate_limit(
bucket="resource-detail",
max_attempts_env="RESOURCE_DETAIL_RATE_LIMIT_MAX_REQUESTS",
window_seconds_env="RESOURCE_DETAIL_RATE_LIMIT_WINDOW_SECONDS",
default_max_attempts=60,
default_window_seconds=60,
)
_RESOURCE_STATUS_RATE_LIMIT = configured_rate_limit(
bucket="resource-status",
max_attempts_env="RESOURCE_STATUS_RATE_LIMIT_MAX_REQUESTS",
window_seconds_env="RESOURCE_STATUS_RATE_LIMIT_WINDOW_SECONDS",
default_max_attempts=90,
default_window_seconds=60,
)
def _optional_bool_arg(name: str):
raw = request.args.get(name)
if raw is None:
return None
text = str(raw).strip()
if not text:
return None
return parse_bool_query(text)
@resource_bp.route('/by_status') @resource_bp.route('/by_status')
def api_resource_by_status(): def api_resource_by_status():
@@ -118,6 +195,7 @@ def api_resource_workcenter_status_matrix():
@resource_bp.route('/detail', methods=['POST']) @resource_bp.route('/detail', methods=['POST'])
@_RESOURCE_DETAIL_RATE_LIMIT
def api_resource_detail(): def api_resource_detail():
"""API: Resource detail with filters.""" """API: Resource detail with filters."""
data = request.get_json() or {} data = request.get_json() or {}
@@ -183,6 +261,7 @@ def api_resource_status_values():
# ============================================================ # ============================================================
@resource_bp.route('/status') @resource_bp.route('/status')
@_RESOURCE_STATUS_RATE_LIMIT
def api_resource_status(): def api_resource_status():
"""API: Get merged resource status from realtime cache. """API: Get merged resource status from realtime cache.
@@ -197,20 +276,9 @@ def api_resource_status():
wc_groups_param = request.args.get('workcenter_groups') wc_groups_param = request.args.get('workcenter_groups')
workcenter_groups = wc_groups_param.split(',') if wc_groups_param else None workcenter_groups = wc_groups_param.split(',') if wc_groups_param else None
is_production = None is_production = _optional_bool_arg('is_production')
is_prod_param = request.args.get('is_production') is_key = _optional_bool_arg('is_key')
if is_prod_param: is_monitor = _optional_bool_arg('is_monitor')
is_production = is_prod_param.lower() in ('1', 'true', 'yes')
is_key = None
is_key_param = request.args.get('is_key')
if is_key_param:
is_key = is_key_param.lower() in ('1', 'true', 'yes')
is_monitor = None
is_monitor_param = request.args.get('is_monitor')
if is_monitor_param:
is_monitor = is_monitor_param.lower() in ('1', 'true', 'yes')
status_cats_param = request.args.get('status_categories') status_cats_param = request.args.get('status_categories')
status_categories = status_cats_param.split(',') if status_cats_param else None status_categories = status_cats_param.split(',') if status_cats_param else None
@@ -260,6 +328,7 @@ def api_resource_status_options():
@resource_bp.route('/status/summary') @resource_bp.route('/status/summary')
@_RESOURCE_STATUS_RATE_LIMIT
def api_resource_status_summary(): def api_resource_status_summary():
"""API: Get resource status summary statistics. """API: Get resource status summary statistics.
@@ -269,20 +338,9 @@ def api_resource_status_summary():
wc_groups_param = request.args.get('workcenter_groups') wc_groups_param = request.args.get('workcenter_groups')
workcenter_groups = wc_groups_param.split(',') if wc_groups_param else None workcenter_groups = wc_groups_param.split(',') if wc_groups_param else None
is_production = None is_production = _optional_bool_arg('is_production')
is_prod_param = request.args.get('is_production') is_key = _optional_bool_arg('is_key')
if is_prod_param: is_monitor = _optional_bool_arg('is_monitor')
is_production = is_prod_param.lower() in ('1', 'true', 'yes')
is_key = None
is_key_param = request.args.get('is_key')
if is_key_param:
is_key = is_key_param.lower() in ('1', 'true', 'yes')
is_monitor = None
is_monitor_param = request.args.get('is_monitor')
if is_monitor_param:
is_monitor = is_monitor_param.lower() in ('1', 'true', 'yes')
try: try:
data = get_resource_status_summary( data = get_resource_status_summary(
@@ -301,6 +359,7 @@ def api_resource_status_summary():
@resource_bp.route('/status/matrix') @resource_bp.route('/status/matrix')
@_RESOURCE_STATUS_RATE_LIMIT
def api_resource_status_matrix(): def api_resource_status_matrix():
"""API: Get workcenter × status matrix. """API: Get workcenter × status matrix.
@@ -309,20 +368,9 @@ def api_resource_status_matrix():
is_key: Filter by key equipment is_key: Filter by key equipment
is_monitor: Filter by monitor equipment is_monitor: Filter by monitor equipment
""" """
is_production = None is_production = _optional_bool_arg('is_production')
is_prod_param = request.args.get('is_production') is_key = _optional_bool_arg('is_key')
if is_prod_param: is_monitor = _optional_bool_arg('is_monitor')
is_production = is_prod_param.lower() in ('1', 'true', 'yes')
is_key = None
is_key_param = request.args.get('is_key')
if is_key_param:
is_key = is_key_param.lower() in ('1', 'true', 'yes')
is_monitor = None
is_monitor_param = request.args.get('is_monitor')
if is_monitor_param:
is_monitor = is_monitor_param.lower() in ('1', 'true', 'yes')
try: try:
data = get_workcenter_status_matrix( data = get_workcenter_status_matrix(

View File

@@ -7,6 +7,8 @@ Uses DWH.DW_MES_LOT_V view for real-time WIP data.
from flask import Blueprint, jsonify, request from flask import Blueprint, jsonify, request
from mes_dashboard.core.rate_limit import configured_rate_limit
from mes_dashboard.core.utils import parse_bool_query
from mes_dashboard.services.wip_service import ( from mes_dashboard.services.wip_service import (
get_wip_summary, get_wip_summary,
get_wip_matrix, get_wip_matrix,
@@ -24,10 +26,21 @@ from mes_dashboard.services.wip_service import (
# Create Blueprint # Create Blueprint
wip_bp = Blueprint('wip', __name__, url_prefix='/api/wip') wip_bp = Blueprint('wip', __name__, url_prefix='/api/wip')
_WIP_MATRIX_RATE_LIMIT = configured_rate_limit(
bucket="wip-overview-matrix",
max_attempts_env="WIP_MATRIX_RATE_LIMIT_MAX_REQUESTS",
window_seconds_env="WIP_MATRIX_RATE_LIMIT_WINDOW_SECONDS",
default_max_attempts=120,
default_window_seconds=60,
)
def _parse_bool(value: str) -> bool: _WIP_DETAIL_RATE_LIMIT = configured_rate_limit(
"""Parse boolean from query string.""" bucket="wip-detail",
return value.lower() in ('true', '1', 'yes') if value else False max_attempts_env="WIP_DETAIL_RATE_LIMIT_MAX_REQUESTS",
window_seconds_env="WIP_DETAIL_RATE_LIMIT_WINDOW_SECONDS",
default_max_attempts=90,
default_window_seconds=60,
)
# ============================================================ # ============================================================
@@ -52,7 +65,7 @@ def api_overview_summary():
lotid = request.args.get('lotid', '').strip() or None lotid = request.args.get('lotid', '').strip() or None
package = request.args.get('package', '').strip() or None package = request.args.get('package', '').strip() or None
pj_type = request.args.get('type', '').strip() or None pj_type = request.args.get('type', '').strip() or None
include_dummy = _parse_bool(request.args.get('include_dummy', '')) include_dummy = parse_bool_query(request.args.get('include_dummy'))
result = get_wip_summary( result = get_wip_summary(
include_dummy=include_dummy, include_dummy=include_dummy,
@@ -67,6 +80,7 @@ def api_overview_summary():
@wip_bp.route('/overview/matrix') @wip_bp.route('/overview/matrix')
@_WIP_MATRIX_RATE_LIMIT
def api_overview_matrix(): def api_overview_matrix():
"""API: Get workcenter x product line matrix for overview dashboard. """API: Get workcenter x product line matrix for overview dashboard.
@@ -88,7 +102,7 @@ def api_overview_matrix():
lotid = request.args.get('lotid', '').strip() or None lotid = request.args.get('lotid', '').strip() or None
package = request.args.get('package', '').strip() or None package = request.args.get('package', '').strip() or None
pj_type = request.args.get('type', '').strip() or None pj_type = request.args.get('type', '').strip() or None
include_dummy = _parse_bool(request.args.get('include_dummy', '')) include_dummy = parse_bool_query(request.args.get('include_dummy'))
status = request.args.get('status', '').strip().upper() or None status = request.args.get('status', '').strip().upper() or None
hold_type = request.args.get('hold_type', '').strip().lower() or None hold_type = request.args.get('hold_type', '').strip().lower() or None
@@ -134,7 +148,7 @@ def api_overview_hold():
""" """
workorder = request.args.get('workorder', '').strip() or None workorder = request.args.get('workorder', '').strip() or None
lotid = request.args.get('lotid', '').strip() or None lotid = request.args.get('lotid', '').strip() or None
include_dummy = _parse_bool(request.args.get('include_dummy', '')) include_dummy = parse_bool_query(request.args.get('include_dummy'))
result = get_wip_hold_summary( result = get_wip_hold_summary(
include_dummy=include_dummy, include_dummy=include_dummy,
@@ -151,6 +165,7 @@ def api_overview_hold():
# ============================================================ # ============================================================
@wip_bp.route('/detail/<workcenter>') @wip_bp.route('/detail/<workcenter>')
@_WIP_DETAIL_RATE_LIMIT
def api_detail(workcenter: str): def api_detail(workcenter: str):
"""API: Get WIP detail for a specific workcenter group. """API: Get WIP detail for a specific workcenter group.
@@ -176,12 +191,17 @@ def api_detail(workcenter: str):
hold_type = request.args.get('hold_type', '').strip().lower() or None hold_type = request.args.get('hold_type', '').strip().lower() or None
workorder = request.args.get('workorder', '').strip() or None workorder = request.args.get('workorder', '').strip() or None
lotid = request.args.get('lotid', '').strip() or None lotid = request.args.get('lotid', '').strip() or None
include_dummy = _parse_bool(request.args.get('include_dummy', '')) include_dummy = parse_bool_query(request.args.get('include_dummy'))
page = request.args.get('page', 1, type=int) page = request.args.get('page', 1, type=int)
page_size = min(request.args.get('page_size', 100, type=int), 500) page_size = request.args.get('page_size', 100, type=int)
if page < 1: if page is None:
page = 1 page = 1
if page_size is None:
page_size = 100
page = max(page, 1)
page_size = max(1, min(page_size, 500))
# Validate status parameter # Validate status parameter
if status and status not in ('RUN', 'QUEUE', 'HOLD'): if status and status not in ('RUN', 'QUEUE', 'HOLD'):
@@ -245,7 +265,7 @@ def api_meta_workcenters():
Returns: Returns:
JSON with list of {name, lot_count} sorted by sequence JSON with list of {name, lot_count} sorted by sequence
""" """
include_dummy = _parse_bool(request.args.get('include_dummy', '')) include_dummy = parse_bool_query(request.args.get('include_dummy'))
result = get_workcenters(include_dummy=include_dummy) result = get_workcenters(include_dummy=include_dummy)
if result is not None: if result is not None:
@@ -263,7 +283,7 @@ def api_meta_packages():
Returns: Returns:
JSON with list of {name, lot_count} sorted by count desc JSON with list of {name, lot_count} sorted by count desc
""" """
include_dummy = _parse_bool(request.args.get('include_dummy', '')) include_dummy = parse_bool_query(request.args.get('include_dummy'))
result = get_packages(include_dummy=include_dummy) result = get_packages(include_dummy=include_dummy)
if result is not None: if result is not None:
@@ -293,7 +313,7 @@ def api_meta_search():
search_field = request.args.get('field', '').strip().lower() search_field = request.args.get('field', '').strip().lower()
q = request.args.get('q', '').strip() q = request.args.get('q', '').strip()
limit = min(request.args.get('limit', 20, type=int), 50) limit = min(request.args.get('limit', 20, type=int), 50)
include_dummy = _parse_bool(request.args.get('include_dummy', '')) include_dummy = parse_bool_query(request.args.get('include_dummy'))
# Cross-filter parameters # Cross-filter parameters
workorder = request.args.get('workorder', '').strip() or None workorder = request.args.get('workorder', '').strip() or None

View File

@@ -5,23 +5,83 @@ from __future__ import annotations
import logging import logging
import os import os
from urllib.parse import urlparse
import requests import requests
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
# Configuration - MUST be set in .env file
LDAP_API_BASE = os.environ.get("LDAP_API_URL", "")
ADMIN_EMAILS = os.environ.get("ADMIN_EMAILS", "").lower().split(",")
# Timeout for LDAP API requests # Timeout for LDAP API requests
LDAP_TIMEOUT = 10 LDAP_TIMEOUT = 10
# Configuration - MUST be set in .env file
ADMIN_EMAILS = os.environ.get("ADMIN_EMAILS", "").lower().split(",")
# Local authentication configuration (for development/testing) # Local authentication configuration (for development/testing)
LOCAL_AUTH_ENABLED = os.environ.get("LOCAL_AUTH_ENABLED", "false").lower() in ("true", "1", "yes") LOCAL_AUTH_ENABLED = os.environ.get("LOCAL_AUTH_ENABLED", "false").lower() in ("true", "1", "yes")
LOCAL_AUTH_USERNAME = os.environ.get("LOCAL_AUTH_USERNAME", "") LOCAL_AUTH_USERNAME = os.environ.get("LOCAL_AUTH_USERNAME", "")
LOCAL_AUTH_PASSWORD = os.environ.get("LOCAL_AUTH_PASSWORD", "") LOCAL_AUTH_PASSWORD = os.environ.get("LOCAL_AUTH_PASSWORD", "")
# LDAP endpoint hardening configuration
LDAP_API_URL = os.environ.get("LDAP_API_URL", "").strip()
LDAP_ALLOWED_HOSTS_RAW = os.environ.get("LDAP_ALLOWED_HOSTS", "").strip()
def _normalize_host(host: str) -> str:
return host.strip().lower().rstrip(".")
def _parse_allowed_hosts(raw_hosts: str) -> tuple[str, ...]:
if not raw_hosts:
return tuple()
hosts: list[str] = []
for raw in raw_hosts.split(","):
host = _normalize_host(raw)
if host:
hosts.append(host)
return tuple(hosts)
def _validate_ldap_api_url(raw_url: str, allowed_hosts: tuple[str, ...]) -> tuple[str | None, str | None]:
"""Validate LDAP API URL to prevent configuration-based SSRF risks."""
url = (raw_url or "").strip()
if not url:
return None, "LDAP_API_URL is missing"
parsed = urlparse(url)
scheme = (parsed.scheme or "").lower()
host = _normalize_host(parsed.hostname or "")
if not host:
return None, f"LDAP_API_URL has no valid host: {url!r}"
if scheme != "https":
return None, f"LDAP_API_URL must use HTTPS: {url!r}"
effective_allowlist = allowed_hosts or (host,)
if host not in effective_allowlist:
return None, (
f"LDAP_API_URL host {host!r} is not allowlisted. "
f"Allowed hosts: {', '.join(effective_allowlist)}"
)
return url.rstrip("/"), None
def _resolve_ldap_config() -> tuple[str | None, str | None, tuple[str, ...]]:
allowed_hosts = _parse_allowed_hosts(LDAP_ALLOWED_HOSTS_RAW)
api_base, error = _validate_ldap_api_url(LDAP_API_URL, allowed_hosts)
if api_base:
effective_hosts = allowed_hosts or (_normalize_host(urlparse(api_base).hostname or ""),)
return api_base, None, effective_hosts
return None, error, allowed_hosts
LDAP_API_BASE, LDAP_CONFIG_ERROR, LDAP_ALLOWED_HOSTS = _resolve_ldap_config()
def _authenticate_local(username: str, password: str) -> dict | None: def _authenticate_local(username: str, password: str) -> dict | None:
"""Authenticate using local environment credentials. """Authenticate using local environment credentials.
@@ -77,6 +137,14 @@ def authenticate(username: str, password: str, domain: str = "PANJIT") -> dict |
# This ensures local-only mode when LOCAL_AUTH_ENABLED is true # This ensures local-only mode when LOCAL_AUTH_ENABLED is true
return None return None
if LDAP_CONFIG_ERROR:
logger.error("LDAP authentication blocked: %s", LDAP_CONFIG_ERROR)
return None
if not LDAP_API_BASE:
logger.error("LDAP authentication blocked: LDAP_API_URL is not configured")
return None
# LDAP authentication # LDAP authentication
try: try:
response = requests.post( response = requests.post(
@@ -121,4 +189,5 @@ def is_admin(user: dict) -> bool:
return True return True
user_mail = user.get("mail", "").lower().strip() user_mail = user.get("mail", "").lower().strip()
return user_mail in [e.strip() for e in ADMIN_EMAILS] allowed_emails = [e.strip() for e in ADMIN_EMAILS if e and e.strip()]
return user_mail in allowed_emails

View File

@@ -6,6 +6,7 @@ Data is loaded from database and cached in memory with periodic refresh.
""" """
import logging import logging
import os
import threading import threading
from datetime import datetime, timedelta from datetime import datetime, timedelta
from typing import Optional, Dict, List, Any from typing import Optional, Dict, List, Any
@@ -19,8 +20,8 @@ logger = logging.getLogger('mes_dashboard.filter_cache')
# ============================================================ # ============================================================
CACHE_TTL_SECONDS = 3600 # 1 hour cache TTL CACHE_TTL_SECONDS = 3600 # 1 hour cache TTL
WIP_VIEW = "DWH.DW_MES_LOT_V" WIP_VIEW = os.getenv("FILTER_CACHE_WIP_VIEW", "DWH.DW_MES_LOT_V")
SPEC_WORKCENTER_VIEW = "DWH.DW_MES_SPEC_WORKCENTER_V" SPEC_WORKCENTER_VIEW = os.getenv("FILTER_CACHE_SPEC_WORKCENTER_VIEW", "DWH.DW_MES_SPEC_WORKCENTER_V")
# ============================================================ # ============================================================
# Cache Storage # Cache Storage

View File

@@ -5,6 +5,8 @@ from __future__ import annotations
import json import json
import logging import logging
import os
import tempfile
from pathlib import Path from pathlib import Path
from threading import Lock from threading import Lock
@@ -37,15 +39,33 @@ def _load() -> dict:
def _save(data: dict) -> None: def _save(data: dict) -> None:
"""Save page status configuration.""" """Save page status configuration."""
global _cache global _cache
tmp_path: Path | None = None
try: try:
DATA_FILE.parent.mkdir(parents=True, exist_ok=True) DATA_FILE.parent.mkdir(parents=True, exist_ok=True)
DATA_FILE.write_text( payload = json.dumps(data, ensure_ascii=False, indent=2)
json.dumps(data, ensure_ascii=False, indent=2),
encoding="utf-8" # Atomic write: write to sibling temp file, then replace target.
) with tempfile.NamedTemporaryFile(
mode="w",
encoding="utf-8",
dir=str(DATA_FILE.parent),
prefix=f".{DATA_FILE.name}.",
suffix=".tmp",
delete=False,
) as tmp:
tmp.write(payload)
tmp.flush()
os.fsync(tmp.fileno())
tmp_path = Path(tmp.name)
os.replace(tmp_path, DATA_FILE)
_cache = data _cache = data
logger.debug("Saved page status to %s", DATA_FILE) logger.debug("Saved page status to %s", DATA_FILE)
except OSError as e: except OSError as e:
if tmp_path is not None:
try:
tmp_path.unlink(missing_ok=True)
except OSError:
pass
logger.error("Failed to save page status: %s", e) logger.error("Failed to save page status: %s", e)
raise raise

View File

@@ -7,10 +7,12 @@ Data is synced periodically (default 5 minutes) and stored in Redis.
import json import json
import logging import logging
import os
import threading import threading
import time import time
from collections import OrderedDict
from datetime import datetime from datetime import datetime
from typing import Any, Dict, List, Optional, Tuple from typing import Any
from mes_dashboard.core.database import read_sql_df from mes_dashboard.core.database import read_sql_df
from mes_dashboard.core.redis_client import ( from mes_dashboard.core.redis_client import (
@@ -26,6 +28,7 @@ from mes_dashboard.config.constants import (
EQUIPMENT_STATUS_META_COUNT_KEY, EQUIPMENT_STATUS_META_COUNT_KEY,
STATUS_CATEGORY_MAP, STATUS_CATEGORY_MAP,
) )
from mes_dashboard.services.sql_fragments import EQUIPMENT_STATUS_SELECT_SQL
logger = logging.getLogger('mes_dashboard.realtime_equipment_cache') logger = logging.getLogger('mes_dashboard.realtime_equipment_cache')
@@ -33,29 +36,56 @@ logger = logging.getLogger('mes_dashboard.realtime_equipment_cache')
# Process-Level Cache (Prevents redundant JSON parsing) # Process-Level Cache (Prevents redundant JSON parsing)
# ============================================================ # ============================================================
DEFAULT_PROCESS_CACHE_TTL_SECONDS = 30
DEFAULT_PROCESS_CACHE_MAX_SIZE = 32
DEFAULT_LOOKUP_TTL_SECONDS = 30
class _ProcessLevelCache: class _ProcessLevelCache:
"""Thread-safe process-level cache for parsed equipment status data.""" """Thread-safe process-level cache for parsed equipment status data."""
def __init__(self, ttl_seconds: int = 30): def __init__(self, ttl_seconds: int = 30, max_size: int = 32):
self._cache: Dict[str, Tuple[List[Dict[str, Any]], float]] = {} self._cache: OrderedDict[str, tuple[list[dict[str, Any]], float]] = OrderedDict()
self._lock = threading.Lock() self._lock = threading.Lock()
self._ttl = ttl_seconds self._ttl = max(int(ttl_seconds), 1)
self._max_size = max(int(max_size), 1)
def get(self, key: str) -> Optional[List[Dict[str, Any]]]: @property
def max_size(self) -> int:
return self._max_size
def _evict_expired_locked(self, now: float) -> None:
stale_keys = [
key for key, (_, timestamp) in self._cache.items()
if now - timestamp > self._ttl
]
for key in stale_keys:
self._cache.pop(key, None)
def get(self, key: str) -> list[dict[str, Any]] | None:
"""Get cached data if not expired.""" """Get cached data if not expired."""
with self._lock: with self._lock:
if key not in self._cache: payload = self._cache.get(key)
if payload is None:
return None return None
data, timestamp = self._cache[key] data, timestamp = payload
if time.time() - timestamp > self._ttl: now = time.time()
del self._cache[key] if now - timestamp > self._ttl:
self._cache.pop(key, None)
return None return None
self._cache.move_to_end(key, last=True)
return data return data
def set(self, key: str, data: List[Dict[str, Any]]) -> None: def set(self, key: str, data: list[dict[str, Any]]) -> None:
"""Cache data with current timestamp.""" """Cache data with current timestamp."""
with self._lock: with self._lock:
self._cache[key] = (data, time.time()) now = time.time()
self._evict_expired_locked(now)
if key in self._cache:
self._cache.pop(key, None)
elif len(self._cache) >= self._max_size:
self._cache.popitem(last=False)
self._cache[key] = (data, now)
self._cache.move_to_end(key, last=True)
def invalidate(self, key: str) -> None: def invalidate(self, key: str) -> None:
"""Remove a key from cache.""" """Remove a key from cache."""
@@ -63,20 +93,38 @@ class _ProcessLevelCache:
self._cache.pop(key, None) self._cache.pop(key, None)
def _resolve_cache_max_size(env_name: str, default: int) -> int:
value = os.getenv(env_name)
if value is None:
return max(int(default), 1)
try:
return max(int(value), 1)
except (TypeError, ValueError):
return max(int(default), 1)
# Global process-level cache for equipment status (30s TTL) # Global process-level cache for equipment status (30s TTL)
_equipment_status_cache = _ProcessLevelCache(ttl_seconds=30) PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size("PROCESS_CACHE_MAX_SIZE", DEFAULT_PROCESS_CACHE_MAX_SIZE)
EQUIPMENT_PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size(
"EQUIPMENT_PROCESS_CACHE_MAX_SIZE",
PROCESS_CACHE_MAX_SIZE,
)
_equipment_status_cache = _ProcessLevelCache(
ttl_seconds=DEFAULT_PROCESS_CACHE_TTL_SECONDS,
max_size=EQUIPMENT_PROCESS_CACHE_MAX_SIZE,
)
_equipment_status_parse_lock = threading.Lock() _equipment_status_parse_lock = threading.Lock()
_equipment_lookup_lock = threading.Lock() _equipment_lookup_lock = threading.Lock()
_equipment_status_lookup: Dict[str, Dict[str, Any]] = {} _equipment_status_lookup: dict[str, dict[str, Any]] = {}
_equipment_status_lookup_built_at: Optional[str] = None _equipment_status_lookup_built_at: str | None = None
_equipment_status_lookup_ts: float = 0.0 _equipment_status_lookup_ts: float = 0.0
LOOKUP_TTL_SECONDS = 30 LOOKUP_TTL_SECONDS = DEFAULT_LOOKUP_TTL_SECONDS
# ============================================================ # ============================================================
# Module State # Module State
# ============================================================ # ============================================================
_SYNC_THREAD: Optional[threading.Thread] = None _SYNC_THREAD: threading.Thread | None = None
_STOP_EVENT = threading.Event() _STOP_EVENT = threading.Event()
_SYNC_LOCK = threading.Lock() _SYNC_LOCK = threading.Lock()
@@ -85,40 +133,14 @@ _SYNC_LOCK = threading.Lock()
# Oracle Query # Oracle Query
# ============================================================ # ============================================================
def _load_equipment_status_from_oracle() -> Optional[List[Dict[str, Any]]]: def _load_equipment_status_from_oracle() -> list[dict[str, Any]] | None:
"""Query DW_MES_EQUIPMENTSTATUS_WIP_V from Oracle. """Query DW_MES_EQUIPMENTSTATUS_WIP_V from Oracle.
Returns: Returns:
List of equipment status records, or None if query fails. List of equipment status records, or None if query fails.
""" """
sql = """
SELECT
RESOURCEID,
EQUIPMENTID,
OBJECTCATEGORY,
EQUIPMENTASSETSSTATUS,
EQUIPMENTASSETSSTATUSREASON,
JOBORDER,
JOBMODEL,
JOBSTAGE,
JOBID,
JOBSTATUS,
CREATEDATE,
CREATEUSERNAME,
CREATEUSER,
TECHNICIANUSERNAME,
TECHNICIANUSER,
SYMPTOMCODE,
CAUSECODE,
REPAIRCODE,
RUNCARDLOTID,
LOTTRACKINQTY_PCS,
LOTTRACKINTIME,
LOTTRACKINEMPLOYEE
FROM DWH.DW_MES_EQUIPMENTSTATUS_WIP_V
"""
try: try:
df = read_sql_df(sql) df = read_sql_df(EQUIPMENT_STATUS_SELECT_SQL)
if df is None or df.empty: if df is None or df.empty:
logger.warning("No data returned from DW_MES_EQUIPMENTSTATUS_WIP_V") logger.warning("No data returned from DW_MES_EQUIPMENTSTATUS_WIP_V")
return [] return []
@@ -147,7 +169,7 @@ def _load_equipment_status_from_oracle() -> Optional[List[Dict[str, Any]]]:
# Data Aggregation # Data Aggregation
# ============================================================ # ============================================================
def _classify_status(status: Optional[str]) -> str: def _classify_status(status: str | None) -> str:
"""Classify equipment status into category. """Classify equipment status into category.
Args: Args:
@@ -183,7 +205,7 @@ def _is_valid_value(value) -> bool:
return True return True
def _aggregate_by_resourceid(records: List[Dict[str, Any]]) -> List[Dict[str, Any]]: def _aggregate_by_resourceid(records: list[dict[str, Any]]) -> list[dict[str, Any]]:
"""Aggregate equipment status records by RESOURCEID. """Aggregate equipment status records by RESOURCEID.
For each RESOURCEID: For each RESOURCEID:
@@ -203,7 +225,7 @@ def _aggregate_by_resourceid(records: List[Dict[str, Any]]) -> List[Dict[str, An
return [] return []
# Group by RESOURCEID # Group by RESOURCEID
grouped: Dict[str, List[Dict[str, Any]]] = {} grouped: dict[str, list[dict[str, Any]]] = {}
for record in records: for record in records:
resource_id = record.get('RESOURCEID') resource_id = record.get('RESOURCEID')
if resource_id: if resource_id:
@@ -272,7 +294,7 @@ def _aggregate_by_resourceid(records: List[Dict[str, Any]]) -> List[Dict[str, An
'CAUSECODE': first.get('CAUSECODE'), 'CAUSECODE': first.get('CAUSECODE'),
'REPAIRCODE': first.get('REPAIRCODE'), 'REPAIRCODE': first.get('REPAIRCODE'),
# LOT related fields # LOT related fields
'LOT_COUNT': len(seen_lots), # Count distinct RUNCARDLOTID 'LOT_COUNT': len(seen_lots) if seen_lots else len(group),
'LOT_DETAILS': lot_details, # LOT details for tooltip 'LOT_DETAILS': lot_details, # LOT details for tooltip
'TOTAL_TRACKIN_QTY': total_qty, 'TOTAL_TRACKIN_QTY': total_qty,
'LATEST_TRACKIN_TIME': latest_trackin, 'LATEST_TRACKIN_TIME': latest_trackin,
@@ -286,7 +308,7 @@ def _aggregate_by_resourceid(records: List[Dict[str, Any]]) -> List[Dict[str, An
# Redis Storage # Redis Storage
# ============================================================ # ============================================================
def _save_to_redis(aggregated: List[Dict[str, Any]]) -> bool: def _save_to_redis(aggregated: list[dict[str, Any]]) -> bool:
"""Save aggregated equipment status to Redis. """Save aggregated equipment status to Redis.
Uses pipeline for atomic update of all keys. Uses pipeline for atomic update of all keys.
@@ -354,7 +376,7 @@ def _invalidate_equipment_status_lookup() -> None:
_equipment_status_lookup_ts = 0.0 _equipment_status_lookup_ts = 0.0
def get_equipment_status_lookup() -> Dict[str, Dict[str, Any]]: def get_equipment_status_lookup() -> dict[str, dict[str, Any]]:
"""Get RESOURCEID -> status record lookup with process-level caching.""" """Get RESOURCEID -> status record lookup with process-level caching."""
global _equipment_status_lookup, _equipment_status_lookup_built_at, _equipment_status_lookup_ts global _equipment_status_lookup, _equipment_status_lookup_built_at, _equipment_status_lookup_ts
@@ -375,7 +397,7 @@ def get_equipment_status_lookup() -> Dict[str, Dict[str, Any]]:
_equipment_status_lookup_ts = time.time() _equipment_status_lookup_ts = time.time()
return _equipment_status_lookup return _equipment_status_lookup
def get_all_equipment_status() -> List[Dict[str, Any]]: def get_all_equipment_status() -> list[dict[str, Any]]:
"""Get all equipment status from cache with process-level caching. """Get all equipment status from cache with process-level caching.
Uses a two-tier cache strategy: Uses a two-tier cache strategy:
@@ -433,7 +455,7 @@ def get_all_equipment_status() -> List[Dict[str, Any]]:
return [] return []
def get_equipment_status_by_id(resource_id: str) -> Optional[Dict[str, Any]]: def get_equipment_status_by_id(resource_id: str) -> dict[str, Any] | None:
"""Get equipment status by RESOURCEID. """Get equipment status by RESOURCEID.
Uses index hash for O(1) lookup. Uses index hash for O(1) lookup.
@@ -485,7 +507,7 @@ def get_equipment_status_by_id(resource_id: str) -> Optional[Dict[str, Any]]:
return None return None
def get_equipment_status_by_ids(resource_ids: List[str]) -> List[Dict[str, Any]]: def get_equipment_status_by_ids(resource_ids: list[str]) -> list[dict[str, Any]]:
"""Get equipment status for multiple RESOURCEIDs. """Get equipment status for multiple RESOURCEIDs.
Args: Args:
@@ -540,7 +562,7 @@ def get_equipment_status_by_ids(resource_ids: List[str]) -> List[Dict[str, Any]]
return [] return []
def get_equipment_status_cache_status() -> Dict[str, Any]: def get_equipment_status_cache_status() -> dict[str, Any]:
"""Get equipment status cache status. """Get equipment status cache status.
Returns: Returns:

View File

@@ -13,8 +13,9 @@ import logging
import os import os
import threading import threading
import time import time
from collections import OrderedDict
from datetime import datetime from datetime import datetime
from typing import Any, Dict, List, Optional, Tuple from typing import Any
import pandas as pd import pandas as pd
@@ -31,9 +32,27 @@ from mes_dashboard.config.constants import (
EQUIPMENT_TYPE_FILTER, EQUIPMENT_TYPE_FILTER,
) )
from mes_dashboard.sql import QueryBuilder from mes_dashboard.sql import QueryBuilder
from mes_dashboard.services.sql_fragments import (
RESOURCE_BASE_SELECT_TEMPLATE,
RESOURCE_VERSION_SELECT_TEMPLATE,
)
logger = logging.getLogger('mes_dashboard.resource_cache') logger = logging.getLogger('mes_dashboard.resource_cache')
ResourceRecord = dict[str, Any]
RowPosition = int
PositionBucket = dict[str, list[RowPosition]]
FlagBuckets = dict[str, list[RowPosition]]
ResourceIndex = dict[str, Any]
DEFAULT_PROCESS_CACHE_TTL_SECONDS = 30
DEFAULT_PROCESS_CACHE_MAX_SIZE = 32
DEFAULT_RESOURCE_SYNC_INTERVAL_SECONDS = 14_400 # 4 hours
DEFAULT_INDEX_VERSION_CHECK_INTERVAL_SECONDS = 5
RESOURCE_DF_CACHE_KEY = "resource_data"
TRUE_BUCKET = "1"
FALSE_BUCKET = "0"
# ============================================================ # ============================================================
# Process-Level Cache (Prevents redundant JSON parsing) # Process-Level Cache (Prevents redundant JSON parsing)
# ============================================================ # ============================================================
@@ -41,26 +60,49 @@ logger = logging.getLogger('mes_dashboard.resource_cache')
class _ProcessLevelCache: class _ProcessLevelCache:
"""Thread-safe process-level cache for parsed DataFrames.""" """Thread-safe process-level cache for parsed DataFrames."""
def __init__(self, ttl_seconds: int = 30): def __init__(self, ttl_seconds: int = DEFAULT_PROCESS_CACHE_TTL_SECONDS, max_size: int = DEFAULT_PROCESS_CACHE_MAX_SIZE):
self._cache: Dict[str, Tuple[pd.DataFrame, float]] = {} self._cache: OrderedDict[str, tuple[pd.DataFrame, float]] = OrderedDict()
self._lock = threading.Lock() self._lock = threading.Lock()
self._ttl = ttl_seconds self._ttl = max(int(ttl_seconds), 1)
self._max_size = max(int(max_size), 1)
def get(self, key: str) -> Optional[pd.DataFrame]: @property
def max_size(self) -> int:
return self._max_size
def _evict_expired_locked(self, now: float) -> None:
stale_keys = [
key for key, (_, timestamp) in self._cache.items()
if now - timestamp > self._ttl
]
for key in stale_keys:
self._cache.pop(key, None)
def get(self, key: str) -> pd.DataFrame | None:
"""Get cached DataFrame if not expired.""" """Get cached DataFrame if not expired."""
with self._lock: with self._lock:
if key not in self._cache: payload = self._cache.get(key)
if payload is None:
return None return None
df, timestamp = self._cache[key] df, timestamp = payload
if time.time() - timestamp > self._ttl: now = time.time()
del self._cache[key] if now - timestamp > self._ttl:
self._cache.pop(key, None)
return None return None
self._cache.move_to_end(key, last=True)
return df return df
def set(self, key: str, df: pd.DataFrame) -> None: def set(self, key: str, df: pd.DataFrame) -> None:
"""Cache a DataFrame with current timestamp.""" """Cache a DataFrame with current timestamp."""
with self._lock: with self._lock:
self._cache[key] = (df, time.time()) now = time.time()
self._evict_expired_locked(now)
if key in self._cache:
self._cache.pop(key, None)
elif len(self._cache) >= self._max_size:
self._cache.popitem(last=False)
self._cache[key] = (df, now)
self._cache.move_to_end(key, last=True)
def invalidate(self, key: str) -> None: def invalidate(self, key: str) -> None:
"""Remove a key from cache.""" """Remove a key from cache."""
@@ -68,11 +110,29 @@ class _ProcessLevelCache:
self._cache.pop(key, None) self._cache.pop(key, None)
def _resolve_cache_max_size(env_name: str, default: int) -> int:
value = os.getenv(env_name)
if value is None:
return max(int(default), 1)
try:
return max(int(value), 1)
except (TypeError, ValueError):
return max(int(default), 1)
# Global process-level cache for resource data (30s TTL) # Global process-level cache for resource data (30s TTL)
_resource_df_cache = _ProcessLevelCache(ttl_seconds=30) PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size("PROCESS_CACHE_MAX_SIZE", DEFAULT_PROCESS_CACHE_MAX_SIZE)
RESOURCE_PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size(
"RESOURCE_PROCESS_CACHE_MAX_SIZE",
PROCESS_CACHE_MAX_SIZE,
)
_resource_df_cache = _ProcessLevelCache(
ttl_seconds=DEFAULT_PROCESS_CACHE_TTL_SECONDS,
max_size=RESOURCE_PROCESS_CACHE_MAX_SIZE,
)
_resource_parse_lock = threading.Lock() _resource_parse_lock = threading.Lock()
_resource_index_lock = threading.Lock() _resource_index_lock = threading.Lock()
_resource_index: Dict[str, Any] = { _resource_index: ResourceIndex = {
"ready": False, "ready": False,
"source": None, "source": None,
"version": None, "version": None,
@@ -80,19 +140,27 @@ _resource_index: Dict[str, Any] = {
"built_at": None, "built_at": None,
"version_checked_at": 0.0, "version_checked_at": 0.0,
"count": 0, "count": 0,
"records": [], "all_positions": [],
"by_resource_id": {}, "by_resource_id": {},
"by_workcenter": {}, "by_workcenter": {},
"by_family": {}, "by_family": {},
"by_department": {}, "by_department": {},
"by_location": {}, "by_location": {},
"by_is_production": {"1": [], "0": []}, "by_is_production": {TRUE_BUCKET: [], FALSE_BUCKET: []},
"by_is_key": {"1": [], "0": []}, "by_is_key": {TRUE_BUCKET: [], FALSE_BUCKET: []},
"by_is_monitor": {"1": [], "0": []}, "by_is_monitor": {TRUE_BUCKET: [], FALSE_BUCKET: []},
"memory": {
"frame_bytes": 0,
"index_bytes": 0,
"records_json_bytes": 0,
"bucket_entries": 0,
"amplification_ratio": 0.0,
"representation": "dataframe+row-index",
},
} }
def _new_empty_index() -> Dict[str, Any]: def _new_empty_index() -> ResourceIndex:
return { return {
"ready": False, "ready": False,
"source": None, "source": None,
@@ -101,15 +169,23 @@ def _new_empty_index() -> Dict[str, Any]:
"built_at": None, "built_at": None,
"version_checked_at": 0.0, "version_checked_at": 0.0,
"count": 0, "count": 0,
"records": [], "all_positions": [],
"by_resource_id": {}, "by_resource_id": {},
"by_workcenter": {}, "by_workcenter": {},
"by_family": {}, "by_family": {},
"by_department": {}, "by_department": {},
"by_location": {}, "by_location": {},
"by_is_production": {"1": [], "0": []}, "by_is_production": {TRUE_BUCKET: [], FALSE_BUCKET: []},
"by_is_key": {"1": [], "0": []}, "by_is_key": {TRUE_BUCKET: [], FALSE_BUCKET: []},
"by_is_monitor": {"1": [], "0": []}, "by_is_monitor": {TRUE_BUCKET: [], FALSE_BUCKET: []},
"memory": {
"frame_bytes": 0,
"index_bytes": 0,
"records_json_bytes": 0,
"bucket_entries": 0,
"amplification_ratio": 0.0,
"representation": "dataframe+row-index",
},
} }
@@ -129,23 +205,59 @@ def _is_truthy_flag(value: Any) -> bool:
return False return False
def _bucket_append(bucket: Dict[str, List[Dict[str, Any]]], key: Any, record: Dict[str, Any]) -> None: def _bucket_append(bucket: PositionBucket, key: Any, row_position: RowPosition) -> None:
if key is None: if key is None:
return return
if isinstance(key, float) and pd.isna(key): if isinstance(key, float) and pd.isna(key):
return return
key_str = str(key) key_str = str(key)
bucket.setdefault(key_str, []).append(record) bucket.setdefault(key_str, []).append(int(row_position))
def _estimate_dataframe_bytes(df: pd.DataFrame) -> int:
try:
return int(df.memory_usage(index=True, deep=True).sum())
except Exception:
return 0
def _estimate_index_bytes(index: ResourceIndex) -> int:
"""Estimate lightweight index memory footprint for telemetry."""
by_resource_id = index.get("by_resource_id", {})
by_workcenter = index.get("by_workcenter", {})
by_family = index.get("by_family", {})
by_department = index.get("by_department", {})
by_location = index.get("by_location", {})
by_is_production = index.get("by_is_production", {TRUE_BUCKET: [], FALSE_BUCKET: []})
by_is_key = index.get("by_is_key", {TRUE_BUCKET: [], FALSE_BUCKET: []})
by_is_monitor = index.get("by_is_monitor", {TRUE_BUCKET: [], FALSE_BUCKET: []})
all_positions = index.get("all_positions", [])
position_entries = (
len(all_positions)
+ sum(len(v) for v in by_workcenter.values())
+ sum(len(v) for v in by_family.values())
+ sum(len(v) for v in by_department.values())
+ sum(len(v) for v in by_location.values())
+ len(by_is_production.get(TRUE_BUCKET, []))
+ len(by_is_production.get(FALSE_BUCKET, []))
+ len(by_is_key.get(TRUE_BUCKET, []))
+ len(by_is_key.get(FALSE_BUCKET, []))
+ len(by_is_monitor.get(TRUE_BUCKET, []))
+ len(by_is_monitor.get(FALSE_BUCKET, []))
)
# Approximate integer/list/dict overhead; telemetry only needs directional signal.
return int(position_entries * 8 + len(by_resource_id) * 64)
def _build_resource_index( def _build_resource_index(
df: pd.DataFrame, df: pd.DataFrame,
*, *,
source: str, source: str,
version: Optional[str], version: str | None,
updated_at: Optional[str], updated_at: str | None,
) -> Dict[str, Any]: ) -> ResourceIndex:
records = df.to_dict(orient='records') normalized_df = df.reset_index(drop=True)
index = _new_empty_index() index = _new_empty_index()
index["ready"] = True index["ready"] = True
index["source"] = source index["source"] = source
@@ -153,31 +265,58 @@ def _build_resource_index(
index["updated_at"] = updated_at index["updated_at"] = updated_at
index["built_at"] = datetime.now().isoformat() index["built_at"] = datetime.now().isoformat()
index["version_checked_at"] = time.time() index["version_checked_at"] = time.time()
index["count"] = len(records) index["count"] = len(normalized_df)
index["records"] = records index["all_positions"] = list(range(len(normalized_df)))
for record in records: for row_position, record in normalized_df.iterrows():
resource_id = record.get("RESOURCEID") resource_id = record.get("RESOURCEID")
if resource_id is not None and not (isinstance(resource_id, float) and pd.isna(resource_id)): if resource_id is not None and not (isinstance(resource_id, float) and pd.isna(resource_id)):
index["by_resource_id"][str(resource_id)] = record index["by_resource_id"][str(resource_id)] = int(row_position)
_bucket_append(index["by_workcenter"], record.get("WORKCENTERNAME"), record) _bucket_append(index["by_workcenter"], record.get("WORKCENTERNAME"), row_position)
_bucket_append(index["by_family"], record.get("RESOURCEFAMILYNAME"), record) _bucket_append(index["by_family"], record.get("RESOURCEFAMILYNAME"), row_position)
_bucket_append(index["by_department"], record.get("PJ_DEPARTMENT"), record) _bucket_append(index["by_department"], record.get("PJ_DEPARTMENT"), row_position)
_bucket_append(index["by_location"], record.get("LOCATIONNAME"), record) _bucket_append(index["by_location"], record.get("LOCATIONNAME"), row_position)
index["by_is_production"]["1" if _is_truthy_flag(record.get("PJ_ISPRODUCTION")) else "0"].append(record) index["by_is_production"][TRUE_BUCKET if _is_truthy_flag(record.get("PJ_ISPRODUCTION")) else FALSE_BUCKET].append(int(row_position))
index["by_is_key"]["1" if _is_truthy_flag(record.get("PJ_ISKEY")) else "0"].append(record) index["by_is_key"][TRUE_BUCKET if _is_truthy_flag(record.get("PJ_ISKEY")) else FALSE_BUCKET].append(int(row_position))
index["by_is_monitor"]["1" if _is_truthy_flag(record.get("PJ_ISMONITOR")) else "0"].append(record) index["by_is_monitor"][TRUE_BUCKET if _is_truthy_flag(record.get("PJ_ISMONITOR")) else FALSE_BUCKET].append(int(row_position))
bucket_entries = (
sum(len(v) for v in index["by_workcenter"].values())
+ sum(len(v) for v in index["by_family"].values())
+ sum(len(v) for v in index["by_department"].values())
+ sum(len(v) for v in index["by_location"].values())
+ len(index["by_is_production"][TRUE_BUCKET])
+ len(index["by_is_production"][FALSE_BUCKET])
+ len(index["by_is_key"][TRUE_BUCKET])
+ len(index["by_is_key"][FALSE_BUCKET])
+ len(index["by_is_monitor"][TRUE_BUCKET])
+ len(index["by_is_monitor"][FALSE_BUCKET])
)
frame_bytes = _estimate_dataframe_bytes(normalized_df)
index_bytes = _estimate_index_bytes(index)
amplification_ratio = round(
(frame_bytes + index_bytes) / max(frame_bytes, 1),
4,
)
index["memory"] = {
"frame_bytes": int(frame_bytes),
"index_bytes": int(index_bytes),
"records_json_bytes": 0, # kept for backward-compatible telemetry shape
"bucket_entries": int(bucket_entries),
"amplification_ratio": amplification_ratio,
"representation": "dataframe+row-index",
}
return index return index
def _index_matches( def _index_matches(
current: Dict[str, Any], current: ResourceIndex,
*, *,
source: str, source: str,
version: Optional[str], version: str | None,
row_count: int, row_count: int,
) -> bool: ) -> bool:
if not current.get("ready"): if not current.get("ready"):
@@ -193,8 +332,8 @@ def _ensure_resource_index(
df: pd.DataFrame, df: pd.DataFrame,
*, *,
source: str, source: str,
version: Optional[str] = None, version: str | None = None,
updated_at: Optional[str] = None, updated_at: str | None = None,
) -> None: ) -> None:
global _resource_index global _resource_index
with _resource_index_lock: with _resource_index_lock:
@@ -212,12 +351,12 @@ def _ensure_resource_index(
_resource_index = new_index _resource_index = new_index
def _get_resource_index() -> Dict[str, Any]: def _get_resource_index() -> ResourceIndex:
with _resource_index_lock: with _resource_index_lock:
return _resource_index return _resource_index
def _get_cache_meta(client=None) -> Tuple[Optional[str], Optional[str]]: def _get_cache_meta(client=None) -> tuple[str | None, str | None]:
redis_client = client or get_redis_client() redis_client = client or get_redis_client()
if redis_client is None: if redis_client is None:
return None, None return None, None
@@ -244,31 +383,59 @@ def _redis_data_available(client=None) -> bool:
return False return False
def _pick_bucket_records( def _pick_bucket_positions(
bucket: Dict[str, List[Dict[str, Any]]], bucket: PositionBucket,
keys: List[Any], keys: list[Any],
) -> List[Dict[str, Any]]: ) -> list[RowPosition]:
seen: set[str] = set() seen: set[int] = set()
result: List[Dict[str, Any]] = [] result: list[int] = []
for key in keys: for key in keys:
for record in bucket.get(str(key), []): for row_position in bucket.get(str(key), []):
rid = record.get("RESOURCEID") normalized = int(row_position)
rid_key = str(rid) if rid is not None else str(id(record)) if normalized in seen:
if rid_key in seen:
continue continue
seen.add(rid_key) seen.add(normalized)
result.append(record) result.append(normalized)
return result return result
def _records_from_positions(df: pd.DataFrame, positions: list[RowPosition]) -> list[ResourceRecord]:
if not positions:
return []
unique_positions = sorted({int(pos) for pos in positions if 0 <= int(pos) < len(df)})
if not unique_positions:
return []
return df.iloc[unique_positions].to_dict(orient='records')
def _records_from_index(index: ResourceIndex, positions: list[RowPosition] | None = None) -> list[ResourceRecord]:
if not index.get("ready"):
return []
df = _resource_df_cache.get(RESOURCE_DF_CACHE_KEY)
if df is None:
legacy_records = index.get("records")
if isinstance(legacy_records, list):
if positions is None:
return list(legacy_records)
selected = [legacy_records[int(pos)] for pos in positions if 0 <= int(pos) < len(legacy_records)]
return selected
return []
selected_positions = positions if positions is not None else index.get("all_positions", [])
if not selected_positions:
selected_positions = list(range(len(df)))
return _records_from_positions(df, selected_positions)
# ============================================================ # ============================================================
# Configuration # Configuration
# ============================================================ # ============================================================
RESOURCE_CACHE_ENABLED = os.getenv('RESOURCE_CACHE_ENABLED', 'true').lower() == 'true' RESOURCE_CACHE_ENABLED = os.getenv('RESOURCE_CACHE_ENABLED', 'true').lower() == 'true'
RESOURCE_SYNC_INTERVAL = int(os.getenv('RESOURCE_SYNC_INTERVAL', '14400')) # 4 hours RESOURCE_SYNC_INTERVAL = int(
os.getenv('RESOURCE_SYNC_INTERVAL', str(DEFAULT_RESOURCE_SYNC_INTERVAL_SECONDS))
)
RESOURCE_INDEX_VERSION_CHECK_INTERVAL = int( RESOURCE_INDEX_VERSION_CHECK_INTERVAL = int(
os.getenv('RESOURCE_INDEX_VERSION_CHECK_INTERVAL', '5') os.getenv('RESOURCE_INDEX_VERSION_CHECK_INTERVAL', str(DEFAULT_INDEX_VERSION_CHECK_INTERVAL_SECONDS))
) # seconds )
# Redis key helpers # Redis key helpers
def _get_key(key: str) -> str: def _get_key(key: str) -> str:
@@ -313,14 +480,14 @@ def _build_filter_builder() -> QueryBuilder:
return builder return builder
def _load_from_oracle() -> Optional[pd.DataFrame]: def _load_from_oracle() -> pd.DataFrame | None:
"""從 Oracle 載入全表資料(套用全域篩選). """從 Oracle 載入全表資料(套用全域篩選).
Returns: Returns:
DataFrame with all columns, or None if query failed. DataFrame with all columns, or None if query failed.
""" """
builder = _build_filter_builder() builder = _build_filter_builder()
builder.base_sql = "SELECT * FROM DWH.DW_MES_RESOURCE {{ WHERE_CLAUSE }}" builder.base_sql = RESOURCE_BASE_SELECT_TEMPLATE
sql, params = builder.build() sql, params = builder.build()
try: try:
@@ -333,14 +500,14 @@ def _load_from_oracle() -> Optional[pd.DataFrame]:
return None return None
def _get_version_from_oracle() -> Optional[str]: def _get_version_from_oracle() -> str | None:
"""取得 Oracle 資料版本MAX(LASTCHANGEDATE). """取得 Oracle 資料版本MAX(LASTCHANGEDATE).
Returns: Returns:
Version string (ISO format), or None if query failed. Version string (ISO format), or None if query failed.
""" """
builder = _build_filter_builder() builder = _build_filter_builder()
builder.base_sql = "SELECT MAX(LASTCHANGEDATE) as VERSION FROM DWH.DW_MES_RESOURCE {{ WHERE_CLAUSE }}" builder.base_sql = RESOURCE_VERSION_SELECT_TEMPLATE
sql, params = builder.build() sql, params = builder.build()
try: try:
@@ -361,7 +528,7 @@ def _get_version_from_oracle() -> Optional[str]:
# Internal: Redis Functions # Internal: Redis Functions
# ============================================================ # ============================================================
def _get_version_from_redis() -> Optional[str]: def _get_version_from_redis() -> str | None:
"""取得 Redis 快取版本. """取得 Redis 快取版本.
Returns: Returns:
@@ -411,7 +578,7 @@ def _sync_to_redis(df: pd.DataFrame, version: str) -> bool:
pipe.execute() pipe.execute()
# Invalidate process-level cache so next request picks up new data # Invalidate process-level cache so next request picks up new data
_resource_df_cache.invalidate("resource_data") _resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY)
_invalidate_resource_index() _invalidate_resource_index()
logger.info(f"Resource cache synced: {len(df)} rows, version={version}") logger.info(f"Resource cache synced: {len(df)} rows, version={version}")
@@ -421,7 +588,7 @@ def _sync_to_redis(df: pd.DataFrame, version: str) -> bool:
return False return False
def _get_cached_data() -> Optional[pd.DataFrame]: def _get_cached_data() -> pd.DataFrame | None:
"""Get cached resource data from Redis with process-level caching. """Get cached resource data from Redis with process-level caching.
Uses a two-tier cache strategy: Uses a two-tier cache strategy:
@@ -433,11 +600,15 @@ def _get_cached_data() -> Optional[pd.DataFrame]:
Returns: Returns:
DataFrame with resource data, or None if cache miss. DataFrame with resource data, or None if cache miss.
""" """
cache_key = "resource_data" cache_key = RESOURCE_DF_CACHE_KEY
# Tier 1: Check process-level cache first (fast path) # Tier 1: Check process-level cache first (fast path)
cached_df = _resource_df_cache.get(cache_key) cached_df = _resource_df_cache.get(cache_key)
if cached_df is not None: if cached_df is not None:
if REDIS_ENABLED and RESOURCE_CACHE_ENABLED and not _redis_data_available():
_resource_df_cache.invalidate(cache_key)
_invalidate_resource_index()
else:
if not _get_resource_index().get("ready"): if not _get_resource_index().get("ready"):
version, updated_at = _get_cache_meta() version, updated_at = _get_cache_meta()
_ensure_resource_index( _ensure_resource_index(
@@ -568,7 +739,7 @@ def init_cache() -> None:
logger.error(f"Failed to init resource cache: {e}") logger.error(f"Failed to init resource cache: {e}")
def get_cache_status() -> Dict[str, Any]: def get_cache_status() -> dict[str, Any]:
"""取得快取狀態資訊. """取得快取狀態資訊.
Returns: Returns:
@@ -611,9 +782,10 @@ def get_cache_status() -> Dict[str, Any]:
# Query API # Query API
# ============================================================ # ============================================================
def get_resource_index_status() -> Dict[str, Any]: def get_resource_index_status() -> dict[str, Any]:
"""Get process-level derived index telemetry.""" """Get process-level derived index telemetry."""
index = _get_resource_index() index = _get_resource_index()
memory = index.get("memory") or {}
built_at = index.get("built_at") built_at = index.get("built_at")
age_seconds = None age_seconds = None
if built_at: if built_at:
@@ -630,19 +802,32 @@ def get_resource_index_status() -> Dict[str, Any]:
"built_at": built_at, "built_at": built_at,
"count": int(index.get("count", 0)), "count": int(index.get("count", 0)),
"age_seconds": round(age_seconds, 3) if age_seconds is not None else None, "age_seconds": round(age_seconds, 3) if age_seconds is not None else None,
"memory": {
"frame_bytes": int(memory.get("frame_bytes", 0)),
"index_bytes": int(memory.get("index_bytes", 0)),
"records_json_bytes": int(memory.get("records_json_bytes", 0)),
"bucket_entries": int(memory.get("bucket_entries", 0)),
"amplification_ratio": float(memory.get("amplification_ratio", 0.0)),
"representation": str(memory.get("representation", "unknown")),
},
} }
def get_resource_index_snapshot() -> Dict[str, Any]: def get_resource_index_snapshot() -> ResourceIndex:
"""Get derived resource index snapshot, rebuilding if needed.""" """Get derived resource index snapshot, rebuilding if needed."""
index = _get_resource_index() index = _get_resource_index()
if index.get("ready"): if index.get("ready"):
if index.get("source") == "redis": if index.get("source") == "redis":
if not _redis_data_available():
_resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY)
_invalidate_resource_index()
index = _get_resource_index()
# If Redis metadata version is missing, verify payload existence on every call. # If Redis metadata version is missing, verify payload existence on every call.
# This avoids serving stale in-process index when Redis payload is evicted. # This avoids serving stale in-process index when Redis payload is evicted.
if not index.get("version"): if index.get("ready") and not index.get("version"):
if not _redis_data_available(): if not _redis_data_available():
_resource_df_cache.invalidate("resource_data") _resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY)
_invalidate_resource_index() _invalidate_resource_index()
index = _get_resource_index() index = _get_resource_index()
else: else:
@@ -661,7 +846,7 @@ def get_resource_index_snapshot() -> Dict[str, Any]:
current_version, current_version,
latest_version, latest_version,
) )
_resource_df_cache.invalidate("resource_data") _resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY)
_invalidate_resource_index() _invalidate_resource_index()
index = _get_resource_index() index = _get_resource_index()
else: else:
@@ -678,6 +863,7 @@ def get_resource_index_snapshot() -> Dict[str, Any]:
df = _get_cached_data() df = _get_cached_data()
if df is not None: if df is not None:
_resource_df_cache.set(RESOURCE_DF_CACHE_KEY, df.reset_index(drop=True))
version, updated_at = _get_cache_meta() version, updated_at = _get_cache_meta()
_ensure_resource_index( _ensure_resource_index(
df, df,
@@ -690,6 +876,8 @@ def get_resource_index_snapshot() -> Dict[str, Any]:
logger.info("Resource cache miss while building index, falling back to Oracle") logger.info("Resource cache miss while building index, falling back to Oracle")
oracle_df = _load_from_oracle() oracle_df = _load_from_oracle()
if oracle_df is None: if oracle_df is None:
_resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY)
_invalidate_resource_index()
return _new_empty_index() return _new_empty_index()
_ensure_resource_index( _ensure_resource_index(
@@ -698,9 +886,11 @@ def get_resource_index_snapshot() -> Dict[str, Any]:
version=None, version=None,
updated_at=datetime.now().isoformat(), updated_at=datetime.now().isoformat(),
) )
_resource_df_cache.set(RESOURCE_DF_CACHE_KEY, oracle_df.reset_index(drop=True))
return _get_resource_index() return _get_resource_index()
def get_all_resources() -> List[Dict]:
def get_all_resources() -> list[ResourceRecord]:
"""取得所有快取中的設備資料(全欄位). """取得所有快取中的設備資料(全欄位).
Falls back to Oracle if cache unavailable. Falls back to Oracle if cache unavailable.
@@ -709,11 +899,10 @@ def get_all_resources() -> List[Dict]:
List of resource dicts. List of resource dicts.
""" """
index = get_resource_index_snapshot() index = get_resource_index_snapshot()
records = index.get("records", []) return _records_from_index(index)
return list(records)
def get_resource_by_id(resource_id: str) -> Optional[Dict]: def get_resource_by_id(resource_id: str) -> ResourceRecord | None:
"""依 RESOURCEID 取得單筆設備資料. """依 RESOURCEID 取得單筆設備資料.
Args: Args:
@@ -725,10 +914,12 @@ def get_resource_by_id(resource_id: str) -> Optional[Dict]:
if not resource_id: if not resource_id:
return None return None
index = get_resource_index_snapshot() index = get_resource_index_snapshot()
by_id = index.get("by_resource_id", {}) by_id: dict[str, RowPosition] = index.get("by_resource_id", {})
row = by_id.get(str(resource_id)) row_position = by_id.get(str(resource_id))
if row is not None: if row_position is not None:
return row rows = _records_from_index(index, [int(row_position)])
if rows:
return rows[0]
# Backward-compatible fallback for call sites/tests that patch get_all_resources. # Backward-compatible fallback for call sites/tests that patch get_all_resources.
target = str(resource_id) target = str(resource_id)
@@ -738,7 +929,7 @@ def get_resource_by_id(resource_id: str) -> Optional[Dict]:
return None return None
def get_resources_by_ids(resource_ids: List[str]) -> List[Dict]: def get_resources_by_ids(resource_ids: list[str]) -> list[ResourceRecord]:
"""依 RESOURCEID 清單批次取得設備資料. """依 RESOURCEID 清單批次取得設備資料.
Args: Args:
@@ -747,20 +938,28 @@ def get_resources_by_ids(resource_ids: List[str]) -> List[Dict]:
Returns: Returns:
List of matching resource dicts. List of matching resource dicts.
""" """
index = get_resource_index_snapshot()
by_id: dict[str, RowPosition] = index.get("by_resource_id", {})
positions = [by_id[str(resource_id)] for resource_id in resource_ids if str(resource_id) in by_id]
if positions:
rows = _records_from_index(index, positions)
if rows:
return rows
# Backward-compatible fallback for call sites/tests that patch get_all_resources.
id_set = set(resource_ids) id_set = set(resource_ids)
resources = get_all_resources() return [r for r in get_all_resources() if r.get('RESOURCEID') in id_set]
return [r for r in resources if r.get('RESOURCEID') in id_set]
def get_resources_by_filter( def get_resources_by_filter(
workcenters: Optional[List[str]] = None, workcenters: list[str] | None = None,
families: Optional[List[str]] = None, families: list[str] | None = None,
departments: Optional[List[str]] = None, departments: list[str] | None = None,
locations: Optional[List[str]] = None, locations: list[str] | None = None,
is_production: Optional[bool] = None, is_production: bool | None = None,
is_key: Optional[bool] = None, is_key: bool | None = None,
is_monitor: Optional[bool] = None, is_monitor: bool | None = None,
) -> List[Dict]: ) -> list[ResourceRecord]:
"""依條件篩選設備資料(在 Python 端篩選). """依條件篩選設備資料(在 Python 端篩選).
Args: Args:
@@ -775,11 +974,9 @@ def get_resources_by_filter(
Returns: Returns:
List of matching resource dicts. List of matching resource dicts.
""" """
resources = get_all_resources() def _filter_from_records(resources: list[ResourceRecord]) -> list[ResourceRecord]:
result: list[ResourceRecord] = []
result = []
for r in resources: for r in resources:
# Apply filters
if workcenters and r.get('WORKCENTERNAME') not in workcenters: if workcenters and r.get('WORKCENTERNAME') not in workcenters:
continue continue
if families and r.get('RESOURCEFAMILYNAME') not in families: if families and r.get('RESOURCEFAMILYNAME') not in families:
@@ -788,29 +985,68 @@ def get_resources_by_filter(
continue continue
if locations and r.get('LOCATIONNAME') not in locations: if locations and r.get('LOCATIONNAME') not in locations:
continue continue
if is_production is not None: if is_production is not None and (r.get('PJ_ISPRODUCTION') == 1) != is_production:
val = r.get('PJ_ISPRODUCTION')
if (val == 1) != is_production:
continue continue
if is_key is not None: if is_key is not None and (r.get('PJ_ISKEY') == 1) != is_key:
val = r.get('PJ_ISKEY')
if (val == 1) != is_key:
continue continue
if is_monitor is not None: if is_monitor is not None and (r.get('PJ_ISMONITOR') == 1) != is_monitor:
val = r.get('PJ_ISMONITOR')
if (val == 1) != is_monitor:
continue continue
result.append(r) result.append(r)
return result return result
index = get_resource_index_snapshot()
if not index.get("ready"):
return _filter_from_records(get_all_resources())
if _resource_df_cache.get(RESOURCE_DF_CACHE_KEY) is None:
return _filter_from_records(get_all_resources())
candidate_positions: set[int] = set(int(pos) for pos in index.get("all_positions", []))
if not candidate_positions:
return []
def _intersect_with_positions(selected: list[int] | None) -> None:
nonlocal candidate_positions
if selected is None:
return
candidate_positions &= set(int(item) for item in selected)
if workcenters:
_intersect_with_positions(
_pick_bucket_positions(index.get("by_workcenter", {}), workcenters)
)
if families:
_intersect_with_positions(
_pick_bucket_positions(index.get("by_family", {}), families)
)
if departments:
_intersect_with_positions(
_pick_bucket_positions(index.get("by_department", {}), departments)
)
if locations:
_intersect_with_positions(
_pick_bucket_positions(index.get("by_location", {}), locations)
)
if is_production is not None:
_intersect_with_positions(
index.get("by_is_production", {}).get(TRUE_BUCKET if is_production else FALSE_BUCKET, [])
)
if is_key is not None:
_intersect_with_positions(
index.get("by_is_key", {}).get(TRUE_BUCKET if is_key else FALSE_BUCKET, [])
)
if is_monitor is not None:
_intersect_with_positions(
index.get("by_is_monitor", {}).get(TRUE_BUCKET if is_monitor else FALSE_BUCKET, [])
)
return _records_from_index(index, sorted(candidate_positions))
# ============================================================ # ============================================================
# Distinct Values API (for filters) # Distinct Values API (for filters)
# ============================================================ # ============================================================
def get_distinct_values(column: str) -> List[str]: def get_distinct_values(column: str) -> list[str]:
"""取得指定欄位的唯一值清單(排序後). """取得指定欄位的唯一值清單(排序後).
Args: Args:
@@ -833,26 +1069,26 @@ def get_distinct_values(column: str) -> List[str]:
return sorted(values) return sorted(values)
def get_resource_families() -> List[str]: def get_resource_families() -> list[str]:
"""取得型號清單(便捷方法).""" """取得型號清單(便捷方法)."""
return get_distinct_values('RESOURCEFAMILYNAME') return get_distinct_values('RESOURCEFAMILYNAME')
def get_workcenters() -> List[str]: def get_workcenters() -> list[str]:
"""取得站點清單(便捷方法).""" """取得站點清單(便捷方法)."""
return get_distinct_values('WORKCENTERNAME') return get_distinct_values('WORKCENTERNAME')
def get_departments() -> List[str]: def get_departments() -> list[str]:
"""取得部門清單(便捷方法).""" """取得部門清單(便捷方法)."""
return get_distinct_values('PJ_DEPARTMENT') return get_distinct_values('PJ_DEPARTMENT')
def get_locations() -> List[str]: def get_locations() -> list[str]:
"""取得區域清單(便捷方法).""" """取得區域清單(便捷方法)."""
return get_distinct_values('LOCATIONNAME') return get_distinct_values('LOCATIONNAME')
def get_vendors() -> List[str]: def get_vendors() -> list[str]:
"""取得供應商清單(便捷方法).""" """取得供應商清單(便捷方法)."""
return get_distinct_values('VENDORNAME') return get_distinct_values('VENDORNAME')

View File

@@ -0,0 +1,46 @@
# -*- coding: utf-8 -*-
"""Shared SQL fragments/constants for cache-oriented services.
Centralizing common Oracle table/view references reduces drift across
resource/equipment cache implementations.
"""
from __future__ import annotations
RESOURCE_TABLE = "DWH.DW_MES_RESOURCE"
RESOURCE_BASE_SELECT_TEMPLATE = f"SELECT * FROM {RESOURCE_TABLE} {{ WHERE_CLAUSE }}"
RESOURCE_VERSION_SELECT_TEMPLATE = (
f"SELECT MAX(LASTCHANGEDATE) as VERSION FROM {RESOURCE_TABLE} {{ WHERE_CLAUSE }}"
)
EQUIPMENT_STATUS_VIEW = "DWH.DW_MES_EQUIPMENTSTATUS_WIP_V"
EQUIPMENT_STATUS_COLUMNS: tuple[str, ...] = (
"RESOURCEID",
"EQUIPMENTID",
"OBJECTCATEGORY",
"EQUIPMENTASSETSSTATUS",
"EQUIPMENTASSETSSTATUSREASON",
"JOBORDER",
"JOBMODEL",
"JOBSTAGE",
"JOBID",
"JOBSTATUS",
"CREATEDATE",
"CREATEUSERNAME",
"CREATEUSER",
"TECHNICIANUSERNAME",
"TECHNICIANUSER",
"SYMPTOMCODE",
"CAUSECODE",
"REPAIRCODE",
"RUNCARDLOTID",
"LOTTRACKINQTY_PCS",
"LOTTRACKINTIME",
"LOTTRACKINEMPLOYEE",
)
EQUIPMENT_STATUS_SELECT_SQL = (
"SELECT\n "
+ ",\n ".join(EQUIPMENT_STATUS_COLUMNS)
+ f"\nFROM {EQUIPMENT_STATUS_VIEW}"
)

View File

@@ -9,6 +9,7 @@ Now uses Redis cache when available, with fallback to Oracle direct query.
import logging import logging
import threading import threading
from collections import Counter
from datetime import datetime from datetime import datetime
from typing import Optional, Dict, List, Any from typing import Optional, Dict, List, Any
@@ -32,6 +33,20 @@ logger = logging.getLogger('mes_dashboard.wip_service')
_wip_search_index_lock = threading.Lock() _wip_search_index_lock = threading.Lock()
_wip_search_index_cache: Dict[str, Dict[str, Any]] = {} _wip_search_index_cache: Dict[str, Dict[str, Any]] = {}
_wip_snapshot_lock = threading.Lock()
_wip_snapshot_cache: Dict[str, Dict[str, Any]] = {}
_wip_index_metrics_lock = threading.Lock()
_wip_index_metrics: Dict[str, Any] = {
"snapshot_hits": 0,
"snapshot_misses": 0,
"search_index_hits": 0,
"search_index_misses": 0,
"search_index_rebuilds": 0,
"search_index_incremental_updates": 0,
"search_index_reconciliation_fallbacks": 0,
}
_EMPTY_INT_INDEX = np.array([], dtype=np.int64)
def _safe_value(val): def _safe_value(val):
@@ -153,29 +168,373 @@ def _get_wip_cache_version() -> str:
return f"{updated_at}|{sys_date}" return f"{updated_at}|{sys_date}"
def _distinct_sorted_values(df: pd.DataFrame, column: str) -> List[str]: def _increment_wip_metric(metric: str, value: int = 1) -> None:
if column not in df.columns: with _wip_index_metrics_lock:
return [] _wip_index_metrics[metric] = int(_wip_index_metrics.get(metric, 0)) + value
series = df[column].dropna().astype(str)
if series.empty:
return [] def _estimate_dataframe_bytes(df: pd.DataFrame) -> int:
series = series[series.str.len() > 0] if df is None:
if series.empty: return 0
return [] try:
return series.drop_duplicates().sort_values().tolist() return int(df.memory_usage(index=True, deep=True).sum())
except Exception:
return 0
def _estimate_counter_payload_bytes(counter: Counter) -> int:
total = 0
for key, count in counter.items():
total += len(str(key)) + 16 + int(count)
return total
def _normalize_text_value(value: Any) -> str:
if value is None:
return ""
if isinstance(value, float) and pd.isna(value):
return ""
text = str(value).strip()
return text
def _build_filter_mask(
df: pd.DataFrame,
*,
include_dummy: bool,
workorder: Optional[str] = None,
lotid: Optional[str] = None,
) -> pd.Series:
if df.empty:
return pd.Series(dtype=bool)
mask = df['WORKORDER'].notna()
if not include_dummy and 'LOTID' in df.columns:
mask &= ~df['LOTID'].astype(str).str.contains('DUMMY', case=False, na=False)
if workorder and 'WORKORDER' in df.columns:
mask &= df['WORKORDER'].astype(str).str.contains(workorder, case=False, na=False)
if lotid and 'LOTID' in df.columns:
mask &= df['LOTID'].astype(str).str.contains(lotid, case=False, na=False)
return mask
def _build_value_index(df: pd.DataFrame, column: str) -> Dict[str, np.ndarray]:
if column not in df.columns or df.empty:
return {}
grouped = df.groupby(column, dropna=True, sort=False).indices
return {str(key): np.asarray(indices, dtype=np.int64) for key, indices in grouped.items()}
def _intersect_positions(current: Optional[np.ndarray], candidate: Optional[np.ndarray]) -> np.ndarray:
if candidate is None:
return _EMPTY_INT_INDEX
if current is None:
return candidate
if len(current) == 0 or len(candidate) == 0:
return _EMPTY_INT_INDEX
return np.intersect1d(current, candidate, assume_unique=False)
def _select_with_snapshot_indexes(
include_dummy: bool = False,
workorder: Optional[str] = None,
lotid: Optional[str] = None,
package: Optional[str] = None,
pj_type: Optional[str] = None,
workcenter: Optional[str] = None,
status: Optional[str] = None,
hold_type: Optional[str] = None,
) -> Optional[pd.DataFrame]:
snapshot = _get_wip_snapshot(include_dummy=include_dummy)
if snapshot is None:
return None
df = snapshot["frame"]
indexes = snapshot["indexes"]
selected_positions: Optional[np.ndarray] = None
if workcenter:
selected_positions = _intersect_positions(
selected_positions,
indexes["workcenter"].get(str(workcenter)),
)
if package:
selected_positions = _intersect_positions(
selected_positions,
indexes["package"].get(str(package)),
)
if pj_type:
selected_positions = _intersect_positions(
selected_positions,
indexes["pj_type"].get(str(pj_type)),
)
if status:
selected_positions = _intersect_positions(
selected_positions,
indexes["wip_status"].get(str(status).upper()),
)
if hold_type:
selected_positions = _intersect_positions(
selected_positions,
indexes["hold_type"].get(str(hold_type).lower()),
)
if selected_positions is None:
result = df
elif len(selected_positions) == 0:
result = df.iloc[0:0]
else:
result = df.iloc[selected_positions]
if workorder:
result = result[result['WORKORDER'].astype(str).str.contains(workorder, case=False, na=False)]
if lotid:
result = result[result['LOTID'].astype(str).str.contains(lotid, case=False, na=False)]
return result
def _build_search_signatures(df: pd.DataFrame) -> tuple[Counter, Dict[str, tuple[str, str, str, str]]]:
if df.empty:
return Counter(), {}
workorders = df.get("WORKORDER", pd.Series(index=df.index, dtype=object)).map(_normalize_text_value)
lotids = df.get("LOTID", pd.Series(index=df.index, dtype=object)).map(_normalize_text_value)
packages = df.get("PACKAGE_LEF", pd.Series(index=df.index, dtype=object)).map(_normalize_text_value)
types = df.get("PJ_TYPE", pd.Series(index=df.index, dtype=object)).map(_normalize_text_value)
signatures = (
workorders
+ "\x1f"
+ lotids
+ "\x1f"
+ packages
+ "\x1f"
+ types
).tolist()
signature_counter = Counter(signatures)
signature_fields: Dict[str, tuple[str, str, str, str]] = {}
for signature, wo, lot, pkg, pj in zip(signatures, workorders, lotids, packages, types):
if signature not in signature_fields:
signature_fields[signature] = (wo, lot, pkg, pj)
return signature_counter, signature_fields
def _build_field_counters(
signature_counter: Counter,
signature_fields: Dict[str, tuple[str, str, str, str]],
) -> Dict[str, Counter]:
counters = {
"workorders": Counter(),
"lotids": Counter(),
"packages": Counter(),
"types": Counter(),
}
for signature, count in signature_counter.items():
wo, lot, pkg, pj = signature_fields.get(signature, ("", "", "", ""))
if wo:
counters["workorders"][wo] += count
if lot:
counters["lotids"][lot] += count
if pkg:
counters["packages"][pkg] += count
if pj:
counters["types"][pj] += count
return counters
def _materialize_search_payload(
*,
version: str,
row_count: int,
signature_counter: Counter,
field_counters: Dict[str, Counter],
mode: str,
added_rows: int = 0,
removed_rows: int = 0,
drift_ratio: float = 0.0,
) -> Dict[str, Any]:
workorders = sorted(field_counters["workorders"].keys())
lotids = sorted(field_counters["lotids"].keys())
packages = sorted(field_counters["packages"].keys())
types = sorted(field_counters["types"].keys())
memory_bytes = (
_estimate_counter_payload_bytes(field_counters["workorders"])
+ _estimate_counter_payload_bytes(field_counters["lotids"])
+ _estimate_counter_payload_bytes(field_counters["packages"])
+ _estimate_counter_payload_bytes(field_counters["types"])
)
return {
"version": version,
"built_at": datetime.now().isoformat(),
"row_count": int(row_count),
"workorders": workorders,
"lotids": lotids,
"packages": packages,
"types": types,
"sync_mode": mode,
"sync_added_rows": int(added_rows),
"sync_removed_rows": int(removed_rows),
"drift_ratio": round(float(drift_ratio), 6),
"memory_bytes": int(memory_bytes),
"_signature_counter": dict(signature_counter),
"_field_counters": {
"workorders": dict(field_counters["workorders"]),
"lotids": dict(field_counters["lotids"]),
"packages": dict(field_counters["packages"]),
"types": dict(field_counters["types"]),
},
}
def _build_wip_search_index(df: pd.DataFrame, include_dummy: bool) -> Dict[str, Any]: def _build_wip_search_index(df: pd.DataFrame, include_dummy: bool) -> Dict[str, Any]:
filtered = _filter_base_conditions(df, include_dummy=include_dummy) filtered = _filter_base_conditions(df, include_dummy=include_dummy)
return { signatures, signature_fields = _build_search_signatures(filtered)
"built_at": datetime.now().isoformat(), field_counters = _build_field_counters(signatures, signature_fields)
"row_count": len(filtered), return _materialize_search_payload(
"workorders": _distinct_sorted_values(filtered, "WORKORDER"), version=_get_wip_cache_version(),
"lotids": _distinct_sorted_values(filtered, "LOTID"), row_count=len(filtered),
"packages": _distinct_sorted_values(filtered, "PACKAGE_LEF"), signature_counter=signatures,
"types": _distinct_sorted_values(filtered, "PJ_TYPE"), field_counters=field_counters,
mode="full",
)
def _try_incremental_search_sync(
previous: Dict[str, Any],
*,
version: str,
row_count: int,
signature_counter: Counter,
signature_fields: Dict[str, tuple[str, str, str, str]],
) -> Optional[Dict[str, Any]]:
if not previous:
return None
old_signature_counter = Counter(previous.get("_signature_counter") or {})
old_field_counters_raw = previous.get("_field_counters") or {}
if not old_signature_counter or not old_field_counters_raw:
return None
added = signature_counter - old_signature_counter
removed = old_signature_counter - signature_counter
total_delta = sum(added.values()) + sum(removed.values())
drift_ratio = total_delta / max(int(row_count), 1)
if drift_ratio > 0.6:
_increment_wip_metric("search_index_reconciliation_fallbacks")
return None
field_counters = {
"workorders": Counter(old_field_counters_raw.get("workorders") or {}),
"lotids": Counter(old_field_counters_raw.get("lotids") or {}),
"packages": Counter(old_field_counters_raw.get("packages") or {}),
"types": Counter(old_field_counters_raw.get("types") or {}),
} }
for signature, count in added.items():
wo, lot, pkg, pj = signature_fields.get(signature, ("", "", "", ""))
if wo:
field_counters["workorders"][wo] += count
if lot:
field_counters["lotids"][lot] += count
if pkg:
field_counters["packages"][pkg] += count
if pj:
field_counters["types"][pj] += count
previous_fields = {
sig: tuple(str(v) for v in sig.split("\x1f", 3))
for sig in old_signature_counter.keys()
}
for signature, count in removed.items():
wo, lot, pkg, pj = previous_fields.get(signature, ("", "", "", ""))
if wo:
field_counters["workorders"][wo] -= count
if field_counters["workorders"][wo] <= 0:
field_counters["workorders"].pop(wo, None)
if lot:
field_counters["lotids"][lot] -= count
if field_counters["lotids"][lot] <= 0:
field_counters["lotids"].pop(lot, None)
if pkg:
field_counters["packages"][pkg] -= count
if field_counters["packages"][pkg] <= 0:
field_counters["packages"].pop(pkg, None)
if pj:
field_counters["types"][pj] -= count
if field_counters["types"][pj] <= 0:
field_counters["types"].pop(pj, None)
_increment_wip_metric("search_index_incremental_updates")
return _materialize_search_payload(
version=version,
row_count=row_count,
signature_counter=signature_counter,
field_counters=field_counters,
mode="incremental",
added_rows=sum(added.values()),
removed_rows=sum(removed.values()),
drift_ratio=drift_ratio,
)
def _build_wip_snapshot(df: pd.DataFrame, include_dummy: bool, version: str) -> Dict[str, Any]:
filtered = _filter_base_conditions(df, include_dummy=include_dummy)
filtered = _add_wip_status_columns(filtered).reset_index(drop=True)
hold_type_series = pd.Series(index=filtered.index, dtype=object)
if not filtered.empty:
hold_type_series = pd.Series("", index=filtered.index, dtype=object)
hold_type_series.loc[filtered["IS_QUALITY_HOLD"]] = "quality"
hold_type_series.loc[filtered["IS_NON_QUALITY_HOLD"]] = "non-quality"
indexes = {
"workcenter": _build_value_index(filtered, "WORKCENTER_GROUP"),
"package": _build_value_index(filtered, "PACKAGE_LEF"),
"pj_type": _build_value_index(filtered, "PJ_TYPE"),
"wip_status": _build_value_index(filtered, "WIP_STATUS"),
"hold_type": _build_value_index(pd.DataFrame({"HOLD_TYPE": hold_type_series}), "HOLD_TYPE"),
}
exact_bucket_count = sum(len(bucket) for bucket in indexes.values())
return {
"version": version,
"built_at": datetime.now().isoformat(),
"row_count": int(len(filtered)),
"frame": filtered,
"indexes": indexes,
"frame_bytes": _estimate_dataframe_bytes(filtered),
"index_bucket_count": int(exact_bucket_count),
}
def _get_wip_snapshot(include_dummy: bool) -> Optional[Dict[str, Any]]:
cache_key = "with_dummy" if include_dummy else "without_dummy"
version = _get_wip_cache_version()
with _wip_snapshot_lock:
cached = _wip_snapshot_cache.get(cache_key)
if cached and cached.get("version") == version:
_increment_wip_metric("snapshot_hits")
return cached
_increment_wip_metric("snapshot_misses")
df = _get_wip_dataframe()
if df is None:
return None
snapshot = _build_wip_snapshot(df, include_dummy=include_dummy, version=version)
with _wip_snapshot_lock:
existing = _wip_snapshot_cache.get(cache_key)
if existing and existing.get("version") == version:
_increment_wip_metric("snapshot_hits")
return existing
_wip_snapshot_cache[cache_key] = snapshot
return snapshot
def _get_wip_search_index(include_dummy: bool) -> Optional[Dict[str, Any]]: def _get_wip_search_index(include_dummy: bool) -> Optional[Dict[str, Any]]:
cache_key = "with_dummy" if include_dummy else "without_dummy" cache_key = "with_dummy" if include_dummy else "without_dummy"
@@ -184,14 +543,37 @@ def _get_wip_search_index(include_dummy: bool) -> Optional[Dict[str, Any]]:
with _wip_search_index_lock: with _wip_search_index_lock:
cached = _wip_search_index_cache.get(cache_key) cached = _wip_search_index_cache.get(cache_key)
if cached and cached.get("version") == version: if cached and cached.get("version") == version:
_increment_wip_metric("search_index_hits")
return cached return cached
df = _get_wip_dataframe() _increment_wip_metric("search_index_misses")
if df is None: snapshot = _get_wip_snapshot(include_dummy=include_dummy)
if snapshot is None:
return None return None
index_payload = _build_wip_search_index(df, include_dummy=include_dummy) filtered = snapshot["frame"]
index_payload["version"] = version signature_counter, signature_fields = _build_search_signatures(filtered)
with _wip_search_index_lock:
previous = _wip_search_index_cache.get(cache_key)
index_payload = _try_incremental_search_sync(
previous or {},
version=version,
row_count=int(snapshot.get("row_count", 0)),
signature_counter=signature_counter,
signature_fields=signature_fields,
)
if index_payload is None:
field_counters = _build_field_counters(signature_counter, signature_fields)
index_payload = _materialize_search_payload(
version=version,
row_count=int(snapshot.get("row_count", 0)),
signature_counter=signature_counter,
field_counters=field_counters,
mode="full",
)
_increment_wip_metric("search_index_rebuilds")
with _wip_search_index_lock: with _wip_search_index_lock:
_wip_search_index_cache[cache_key] = index_payload _wip_search_index_cache[cache_key] = index_payload
@@ -207,9 +589,9 @@ def _search_values_from_index(values: List[str], query: str, limit: int) -> List
def get_wip_search_index_status() -> Dict[str, Any]: def get_wip_search_index_status() -> Dict[str, Any]:
"""Expose WIP derived search-index freshness for diagnostics.""" """Expose WIP derived search-index freshness for diagnostics."""
with _wip_search_index_lock: with _wip_search_index_lock:
snapshot = {} search_snapshot = {}
for key, payload in _wip_search_index_cache.items(): for key, payload in _wip_search_index_cache.items():
snapshot[key] = { search_snapshot[key] = {
"version": payload.get("version"), "version": payload.get("version"),
"built_at": payload.get("built_at"), "built_at": payload.get("built_at"),
"row_count": payload.get("row_count", 0), "row_count": payload.get("row_count", 0),
@@ -217,8 +599,39 @@ def get_wip_search_index_status() -> Dict[str, Any]:
"lotids": len(payload.get("lotids", [])), "lotids": len(payload.get("lotids", [])),
"packages": len(payload.get("packages", [])), "packages": len(payload.get("packages", [])),
"types": len(payload.get("types", [])), "types": len(payload.get("types", [])),
"sync_mode": payload.get("sync_mode"),
"sync_added_rows": payload.get("sync_added_rows", 0),
"sync_removed_rows": payload.get("sync_removed_rows", 0),
"drift_ratio": payload.get("drift_ratio", 0.0),
"memory_bytes": payload.get("memory_bytes", 0),
}
with _wip_snapshot_lock:
frame_snapshot = {}
for key, payload in _wip_snapshot_cache.items():
frame_snapshot[key] = {
"version": payload.get("version"),
"built_at": payload.get("built_at"),
"row_count": payload.get("row_count", 0),
"frame_bytes": payload.get("frame_bytes", 0),
"index_bucket_count": payload.get("index_bucket_count", 0),
}
with _wip_index_metrics_lock:
metrics = dict(_wip_index_metrics)
total_frame_bytes = sum(item.get("frame_bytes", 0) for item in frame_snapshot.values())
total_search_bytes = sum(item.get("memory_bytes", 0) for item in search_snapshot.values())
amplification_ratio = round((total_frame_bytes + total_search_bytes) / max(total_frame_bytes, 1), 4)
return {
"derived_search_index": search_snapshot,
"derived_frame_snapshot": frame_snapshot,
"metrics": metrics,
"memory": {
"frame_bytes_total": int(total_frame_bytes),
"search_bytes_total": int(total_search_bytes),
"amplification_ratio": amplification_ratio,
},
} }
return snapshot
def _add_wip_status_columns(df: pd.DataFrame) -> pd.DataFrame: def _add_wip_status_columns(df: pd.DataFrame) -> pd.DataFrame:
@@ -235,24 +648,31 @@ def _add_wip_status_columns(df: pd.DataFrame) -> pd.DataFrame:
Returns: Returns:
DataFrame with additional status columns DataFrame with additional status columns
""" """
df = df.copy() required = {'WIP_STATUS', 'IS_QUALITY_HOLD', 'IS_NON_QUALITY_HOLD'}
if required.issubset(df.columns):
return df
working = df.copy()
# Ensure numeric columns # Ensure numeric columns
df['EQUIPMENTCOUNT'] = pd.to_numeric(df['EQUIPMENTCOUNT'], errors='coerce').fillna(0) working['EQUIPMENTCOUNT'] = pd.to_numeric(working['EQUIPMENTCOUNT'], errors='coerce').fillna(0)
df['CURRENTHOLDCOUNT'] = pd.to_numeric(df['CURRENTHOLDCOUNT'], errors='coerce').fillna(0) working['CURRENTHOLDCOUNT'] = pd.to_numeric(working['CURRENTHOLDCOUNT'], errors='coerce').fillna(0)
df['QTY'] = pd.to_numeric(df['QTY'], errors='coerce').fillna(0) working['QTY'] = pd.to_numeric(working['QTY'], errors='coerce').fillna(0)
# Compute WIP status # Compute WIP status
df['WIP_STATUS'] = 'QUEUE' # Default working['WIP_STATUS'] = 'QUEUE' # Default
df.loc[df['EQUIPMENTCOUNT'] > 0, 'WIP_STATUS'] = 'RUN' working.loc[working['EQUIPMENTCOUNT'] > 0, 'WIP_STATUS'] = 'RUN'
df.loc[(df['EQUIPMENTCOUNT'] == 0) & (df['CURRENTHOLDCOUNT'] > 0), 'WIP_STATUS'] = 'HOLD' working.loc[
(working['EQUIPMENTCOUNT'] == 0) & (working['CURRENTHOLDCOUNT'] > 0),
'WIP_STATUS'
] = 'HOLD'
# Compute hold type # Compute hold type
df['IS_NON_QUALITY_HOLD'] = df['HOLDREASONNAME'].isin(NON_QUALITY_HOLD_REASONS) non_quality_flags = working['HOLDREASONNAME'].isin(NON_QUALITY_HOLD_REASONS)
df['IS_QUALITY_HOLD'] = (df['WIP_STATUS'] == 'HOLD') & ~df['IS_NON_QUALITY_HOLD'] working['IS_QUALITY_HOLD'] = (working['WIP_STATUS'] == 'HOLD') & ~non_quality_flags
df['IS_NON_QUALITY_HOLD'] = (df['WIP_STATUS'] == 'HOLD') & df['IS_NON_QUALITY_HOLD'] working['IS_NON_QUALITY_HOLD'] = (working['WIP_STATUS'] == 'HOLD') & non_quality_flags
return df return working
def _filter_base_conditions( def _filter_base_conditions(
@@ -272,24 +692,18 @@ def _filter_base_conditions(
Returns: Returns:
Filtered DataFrame Filtered DataFrame
""" """
df = df.copy() if df is None or df.empty:
return df.iloc[0:0] if isinstance(df, pd.DataFrame) else pd.DataFrame()
# Exclude NULL WORKORDER (raw materials) mask = _build_filter_mask(
df = df[df['WORKORDER'].notna()] df,
include_dummy=include_dummy,
# DUMMY exclusion workorder=workorder,
if not include_dummy: lotid=lotid,
df = df[~df['LOTID'].str.contains('DUMMY', case=False, na=False)] )
if mask.empty:
# WORKORDER filter (fuzzy match) return df.iloc[0:0]
if workorder: return df.loc[mask]
df = df[df['WORKORDER'].str.contains(workorder, case=False, na=False)]
# LOTID filter (fuzzy match)
if lotid:
df = df[df['LOTID'].str.contains(lotid, case=False, na=False)]
return df
# ============================================================ # ============================================================
@@ -325,16 +739,15 @@ def get_wip_summary(
cached_df = _get_wip_dataframe() cached_df = _get_wip_dataframe()
if cached_df is not None: if cached_df is not None:
try: try:
df = _filter_base_conditions(cached_df, include_dummy, workorder, lotid) df = _select_with_snapshot_indexes(
df = _add_wip_status_columns(df) include_dummy=include_dummy,
workorder=workorder,
# Apply package filter lotid=lotid,
if package and 'PACKAGE_LEF' in df.columns: package=package,
df = df[df['PACKAGE_LEF'] == package] pj_type=pj_type,
)
# Apply pj_type filter if df is None:
if pj_type and 'PJ_TYPE' in df.columns: return _get_wip_summary_from_oracle(include_dummy, workorder, lotid, package, pj_type)
df = df[df['PJ_TYPE'] == pj_type]
if df.empty: if df.empty:
return { return {
@@ -495,32 +908,31 @@ def get_wip_matrix(
cached_df = _get_wip_dataframe() cached_df = _get_wip_dataframe()
if cached_df is not None: if cached_df is not None:
try: try:
df = _filter_base_conditions(cached_df, include_dummy, workorder, lotid) status_upper = status.upper() if status else None
df = _add_wip_status_columns(df) hold_type_filter = hold_type if status_upper == 'HOLD' else None
df = _select_with_snapshot_indexes(
include_dummy=include_dummy,
workorder=workorder,
lotid=lotid,
package=package,
pj_type=pj_type,
status=status_upper,
hold_type=hold_type_filter,
)
if df is None:
return _get_wip_matrix_from_oracle(
include_dummy,
workorder,
lotid,
status,
hold_type,
package,
pj_type,
)
# Filter by WORKCENTER_GROUP and PACKAGE_LEF # Filter by WORKCENTER_GROUP and PACKAGE_LEF
df = df[df['WORKCENTER_GROUP'].notna() & df['PACKAGE_LEF'].notna()] df = df[df['WORKCENTER_GROUP'].notna() & df['PACKAGE_LEF'].notna()]
# Apply package filter
if package:
df = df[df['PACKAGE_LEF'] == package]
# Apply pj_type filter
if pj_type and 'PJ_TYPE' in df.columns:
df = df[df['PJ_TYPE'] == pj_type]
# WIP status filter
if status:
status_upper = status.upper()
df = df[df['WIP_STATUS'] == status_upper]
# Hold type sub-filter
if status_upper == 'HOLD' and hold_type:
if hold_type == 'quality':
df = df[df['IS_QUALITY_HOLD']]
elif hold_type == 'non-quality':
df = df[df['IS_NON_QUALITY_HOLD']]
if df.empty: if df.empty:
return { return {
'workcenters': [], 'workcenters': [],
@@ -677,11 +1089,17 @@ def get_wip_hold_summary(
cached_df = _get_wip_dataframe() cached_df = _get_wip_dataframe()
if cached_df is not None: if cached_df is not None:
try: try:
df = _filter_base_conditions(cached_df, include_dummy, workorder, lotid) df = _select_with_snapshot_indexes(
df = _add_wip_status_columns(df) include_dummy=include_dummy,
workorder=workorder,
lotid=lotid,
status='HOLD',
)
if df is None:
return _get_wip_hold_summary_from_oracle(include_dummy, workorder, lotid)
# Filter for HOLD status with reason # Filter for HOLD status with reason
df = df[(df['WIP_STATUS'] == 'HOLD') & df['HOLDREASONNAME'].notna()] df = df[df['HOLDREASONNAME'].notna()]
if df.empty: if df.empty:
return {'items': []} return {'items': []}
@@ -805,17 +1223,40 @@ def get_wip_detail(
cached_df = _get_wip_dataframe() cached_df = _get_wip_dataframe()
if cached_df is not None: if cached_df is not None:
try: try:
df = _filter_base_conditions(cached_df, include_dummy, workorder, lotid) summary_df = _select_with_snapshot_indexes(
df = _add_wip_status_columns(df) include_dummy=include_dummy,
workorder=workorder,
lotid=lotid,
package=package,
workcenter=workcenter,
)
if summary_df is None:
return _get_wip_detail_from_oracle(
workcenter,
package,
status,
hold_type,
workorder,
lotid,
include_dummy,
page,
page_size,
)
# Filter by workcenter if summary_df.empty:
df = df[df['WORKCENTER_GROUP'] == workcenter] summary = {
'totalLots': 0,
if package: 'runLots': 0,
df = df[df['PACKAGE_LEF'] == package] 'queueLots': 0,
'holdLots': 0,
'qualityHoldLots': 0,
'nonQualityHoldLots': 0
}
df = summary_df
else:
df = summary_df
# Calculate summary before status filter # Calculate summary before status filter
summary_df = df.copy()
run_lots = len(summary_df[summary_df['WIP_STATUS'] == 'RUN']) run_lots = len(summary_df[summary_df['WIP_STATUS'] == 'RUN'])
queue_lots = len(summary_df[summary_df['WIP_STATUS'] == 'QUEUE']) queue_lots = len(summary_df[summary_df['WIP_STATUS'] == 'QUEUE'])
hold_lots = len(summary_df[summary_df['WIP_STATUS'] == 'HOLD']) hold_lots = len(summary_df[summary_df['WIP_STATUS'] == 'HOLD'])
@@ -835,13 +1276,29 @@ def get_wip_detail(
# Apply status filter for lots list # Apply status filter for lots list
if status: if status:
status_upper = status.upper() status_upper = status.upper()
df = df[df['WIP_STATUS'] == status_upper] hold_type_filter = hold_type if status_upper == 'HOLD' else None
filtered_df = _select_with_snapshot_indexes(
if status_upper == 'HOLD' and hold_type: include_dummy=include_dummy,
if hold_type == 'quality': workorder=workorder,
df = df[df['IS_QUALITY_HOLD']] lotid=lotid,
elif hold_type == 'non-quality': package=package,
df = df[df['IS_NON_QUALITY_HOLD']] workcenter=workcenter,
status=status_upper,
hold_type=hold_type_filter,
)
if filtered_df is None:
return _get_wip_detail_from_oracle(
workcenter,
package,
status,
hold_type,
workorder,
lotid,
include_dummy,
page,
page_size,
)
df = filtered_df
# Get specs (sorted by SPECSEQUENCE if available) # Get specs (sorted by SPECSEQUENCE if available)
specs_df = df[df['SPECNAME'].notna()][['SPECNAME', 'SPECSEQUENCE']].drop_duplicates() specs_df = df[df['SPECNAME'].notna()][['SPECNAME', 'SPECSEQUENCE']].drop_duplicates()
@@ -1083,7 +1540,9 @@ def get_workcenters(include_dummy: bool = False) -> Optional[List[Dict[str, Any]
cached_df = _get_wip_dataframe() cached_df = _get_wip_dataframe()
if cached_df is not None: if cached_df is not None:
try: try:
df = _filter_base_conditions(cached_df, include_dummy) df = _select_with_snapshot_indexes(include_dummy=include_dummy)
if df is None:
return _get_workcenters_from_oracle(include_dummy)
df = df[df['WORKCENTER_GROUP'].notna()] df = df[df['WORKCENTER_GROUP'].notna()]
if df.empty: if df.empty:
@@ -1162,7 +1621,9 @@ def get_packages(include_dummy: bool = False) -> Optional[List[Dict[str, Any]]]:
cached_df = _get_wip_dataframe() cached_df = _get_wip_dataframe()
if cached_df is not None: if cached_df is not None:
try: try:
df = _filter_base_conditions(cached_df, include_dummy) df = _select_with_snapshot_indexes(include_dummy=include_dummy)
if df is None:
return _get_packages_from_oracle(include_dummy)
df = df[df['PACKAGE_LEF'].notna()] df = df[df['PACKAGE_LEF'].notna()]
if df.empty: if df.empty:
@@ -1267,15 +1728,16 @@ def search_workorders(
cached_df = _get_wip_dataframe() cached_df = _get_wip_dataframe()
if cached_df is not None: if cached_df is not None:
try: try:
df = _filter_base_conditions(cached_df, include_dummy, lotid=lotid) df = _select_with_snapshot_indexes(
include_dummy=include_dummy,
lotid=lotid,
package=package,
pj_type=pj_type,
)
if df is None:
return _search_workorders_from_oracle(q, limit, include_dummy, lotid, package, pj_type)
df = df[df['WORKORDER'].notna()] df = df[df['WORKORDER'].notna()]
# Apply cross-filters
if package and 'PACKAGE_LEF' in df.columns:
df = df[df['PACKAGE_LEF'] == package]
if pj_type and 'PJ_TYPE' in df.columns:
df = df[df['PJ_TYPE'] == pj_type]
# Filter by search query (case-insensitive) # Filter by search query (case-insensitive)
df = df[df['WORKORDER'].str.contains(q, case=False, na=False)] df = df[df['WORKORDER'].str.contains(q, case=False, na=False)]
@@ -1375,13 +1837,14 @@ def search_lot_ids(
cached_df = _get_wip_dataframe() cached_df = _get_wip_dataframe()
if cached_df is not None: if cached_df is not None:
try: try:
df = _filter_base_conditions(cached_df, include_dummy, workorder=workorder) df = _select_with_snapshot_indexes(
include_dummy=include_dummy,
# Apply cross-filters workorder=workorder,
if package and 'PACKAGE_LEF' in df.columns: package=package,
df = df[df['PACKAGE_LEF'] == package] pj_type=pj_type,
if pj_type and 'PJ_TYPE' in df.columns: )
df = df[df['PJ_TYPE'] == pj_type] if df is None:
return _search_lot_ids_from_oracle(q, limit, include_dummy, workorder, package, pj_type)
# Filter by search query (case-insensitive) # Filter by search query (case-insensitive)
df = df[df['LOTID'].str.contains(q, case=False, na=False)] df = df[df['LOTID'].str.contains(q, case=False, na=False)]
@@ -1481,7 +1944,14 @@ def search_packages(
cached_df = _get_wip_dataframe() cached_df = _get_wip_dataframe()
if cached_df is not None: if cached_df is not None:
try: try:
df = _filter_base_conditions(cached_df, include_dummy, workorder=workorder, lotid=lotid) df = _select_with_snapshot_indexes(
include_dummy=include_dummy,
workorder=workorder,
lotid=lotid,
pj_type=pj_type,
)
if df is None:
return _search_packages_from_oracle(q, limit, include_dummy, workorder, lotid, pj_type)
# Check if PACKAGE_LEF column exists # Check if PACKAGE_LEF column exists
if 'PACKAGE_LEF' not in df.columns: if 'PACKAGE_LEF' not in df.columns:
@@ -1490,10 +1960,6 @@ def search_packages(
df = df[df['PACKAGE_LEF'].notna()] df = df[df['PACKAGE_LEF'].notna()]
# Apply cross-filter
if pj_type and 'PJ_TYPE' in df.columns:
df = df[df['PJ_TYPE'] == pj_type]
# Filter by search query (case-insensitive) # Filter by search query (case-insensitive)
df = df[df['PACKAGE_LEF'].str.contains(q, case=False, na=False)] df = df[df['PACKAGE_LEF'].str.contains(q, case=False, na=False)]
@@ -1591,7 +2057,14 @@ def search_types(
cached_df = _get_wip_dataframe() cached_df = _get_wip_dataframe()
if cached_df is not None: if cached_df is not None:
try: try:
df = _filter_base_conditions(cached_df, include_dummy, workorder=workorder, lotid=lotid) df = _select_with_snapshot_indexes(
include_dummy=include_dummy,
workorder=workorder,
lotid=lotid,
package=package,
)
if df is None:
return _search_types_from_oracle(q, limit, include_dummy, workorder, lotid, package)
# Check if PJ_TYPE column exists # Check if PJ_TYPE column exists
if 'PJ_TYPE' not in df.columns: if 'PJ_TYPE' not in df.columns:
@@ -1600,10 +2073,6 @@ def search_types(
df = df[df['PJ_TYPE'].notna()] df = df[df['PJ_TYPE'].notna()]
# Apply cross-filter
if package and 'PACKAGE_LEF' in df.columns:
df = df[df['PACKAGE_LEF'] == package]
# Filter by search query (case-insensitive) # Filter by search query (case-insensitive)
df = df[df['PJ_TYPE'].str.contains(q, case=False, na=False)] df = df[df['PJ_TYPE'].str.contains(q, case=False, na=False)]
@@ -1686,11 +2155,15 @@ def get_hold_detail_summary(
cached_df = _get_wip_dataframe() cached_df = _get_wip_dataframe()
if cached_df is not None: if cached_df is not None:
try: try:
df = _filter_base_conditions(cached_df, include_dummy) df = _select_with_snapshot_indexes(
df = _add_wip_status_columns(df) include_dummy=include_dummy,
status='HOLD',
)
if df is None:
return _get_hold_detail_summary_from_oracle(reason, include_dummy)
# Filter for HOLD status with matching reason # Filter for HOLD status with matching reason
df = df[(df['WIP_STATUS'] == 'HOLD') & (df['HOLDREASONNAME'] == reason)] df = df[df['HOLDREASONNAME'] == reason]
if df.empty: if df.empty:
return { return {
@@ -1783,11 +2256,15 @@ def get_hold_detail_distribution(
cached_df = _get_wip_dataframe() cached_df = _get_wip_dataframe()
if cached_df is not None: if cached_df is not None:
try: try:
df = _filter_base_conditions(cached_df, include_dummy) df = _select_with_snapshot_indexes(
df = _add_wip_status_columns(df) include_dummy=include_dummy,
status='HOLD',
)
if df is None:
return _get_hold_detail_distribution_from_oracle(reason, include_dummy)
# Filter for HOLD status with matching reason # Filter for HOLD status with matching reason
df = df[(df['WIP_STATUS'] == 'HOLD') & (df['HOLDREASONNAME'] == reason)] df = df[df['HOLDREASONNAME'] == reason]
total_lots = len(df) total_lots = len(df)
@@ -2072,20 +2549,30 @@ def get_hold_detail_lots(
cached_df = _get_wip_dataframe() cached_df = _get_wip_dataframe()
if cached_df is not None: if cached_df is not None:
try: try:
df = _filter_base_conditions(cached_df, include_dummy) df = _select_with_snapshot_indexes(
df = _add_wip_status_columns(df) include_dummy=include_dummy,
workcenter=workcenter,
package=package,
status='HOLD',
)
if df is None:
return _get_hold_detail_lots_from_oracle(
reason=reason,
workcenter=workcenter,
package=package,
age_range=age_range,
include_dummy=include_dummy,
page=page,
page_size=page_size,
)
# Filter for HOLD status with matching reason # Filter for HOLD status with matching reason
df = df[(df['WIP_STATUS'] == 'HOLD') & (df['HOLDREASONNAME'] == reason)] df = df[df['HOLDREASONNAME'] == reason]
# Ensure numeric columns # Ensure numeric columns
df['AGEBYDAYS'] = pd.to_numeric(df['AGEBYDAYS'], errors='coerce').fillna(0) df['AGEBYDAYS'] = pd.to_numeric(df['AGEBYDAYS'], errors='coerce').fillna(0)
# Optional filters # Optional age filter
if workcenter:
df = df[df['WORKCENTER_GROUP'] == workcenter]
if package:
df = df[df['PACKAGE_LEF'] == package]
if age_range: if age_range:
if age_range == '0-1': if age_range == '0-1':
df = df[(df['AGEBYDAYS'] >= 0) & (df['AGEBYDAYS'] < 1)] df = df[(df['AGEBYDAYS'] >= 0) & (df['AGEBYDAYS'] < 1)]

View File

@@ -33,6 +33,23 @@ const MesApi = (function() {
let requestCounter = 0; let requestCounter = 0;
function getCsrfToken() {
const meta = document.querySelector('meta[name=\"csrf-token\"]');
return meta ? meta.content : '';
}
function withCsrfHeaders(headers, method) {
const normalized = (method || 'GET').toUpperCase();
const nextHeaders = { ...(headers || {}) };
if (['POST', 'PUT', 'PATCH', 'DELETE'].includes(normalized)) {
const token = getCsrfToken();
if (token && !nextHeaders['X-CSRF-Token']) {
nextHeaders['X-CSRF-Token'] = token;
}
}
return nextHeaders;
}
/** /**
* Generate a unique request ID * Generate a unique request ID
*/ */
@@ -205,9 +222,9 @@ const MesApi = (function() {
const fetchOptions = { const fetchOptions = {
method: method, method: method,
headers: { headers: withCsrfHeaders({
'Content-Type': 'application/json' 'Content-Type': 'application/json'
} }, method)
}; };
if (options.body) { if (options.body) {

View File

@@ -3,6 +3,7 @@
<head> <head>
<meta charset="UTF-8"> <meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="csrf-token" content="{{ csrf_token() }}">
<title>{% block title %}MES Dashboard{% endblock %}</title> <title>{% block title %}MES Dashboard{% endblock %}</title>
<!-- Toast 樣式 --> <!-- Toast 樣式 -->

View File

@@ -223,6 +223,11 @@
{% block scripts %} {% block scripts %}
<script> <script>
const tbody = document.getElementById('pages-tbody'); const tbody = document.getElementById('pages-tbody');
const csrfToken = document.querySelector('meta[name=\"csrf-token\"]')?.content || '';
function withCsrfHeaders(headers = {}) {
return csrfToken ? { ...headers, 'X-CSRF-Token': csrfToken } : headers;
}
async function loadPages() { async function loadPages() {
try { try {
@@ -266,7 +271,7 @@
try { try {
const response = await fetch(`/admin/api/pages${route}`, { const response = await fetch(`/admin/api/pages${route}`, {
method: 'PUT', method: 'PUT',
headers: { 'Content-Type': 'application/json' }, headers: withCsrfHeaders({ 'Content-Type': 'application/json' }),
body: JSON.stringify({ status: newStatus }) body: JSON.stringify({ status: newStatus })
}); });

View File

@@ -707,7 +707,13 @@
// Auth Helper // Auth Helper
// ============================================================ // ============================================================
async function fetchWithAuth(url, options = {}) { async function fetchWithAuth(url, options = {}) {
const resp = await fetch(url, { ...options, cache: 'no-store' }); const method = (options.method || 'GET').toUpperCase();
const csrfToken = document.querySelector('meta[name="csrf-token"]')?.content || '';
const headers = { ...(options.headers || {}) };
if (csrfToken && ['POST', 'PUT', 'PATCH', 'DELETE'].includes(method)) {
headers['X-CSRF-Token'] = csrfToken;
}
const resp = await fetch(url, { ...options, headers, cache: 'no-store' });
if (resp.status === 401) { if (resp.status === 401) {
const json = await resp.json().catch(() => ({})); const json = await resp.json().catch(() => ({}));
if (!authErrorShown) { if (!authErrorShown) {
@@ -962,9 +968,15 @@
document.getElementById('workerStartTime').textContent = document.getElementById('workerStartTime').textContent =
data.worker_start_time ? formatTimestamp(data.worker_start_time) : '--'; data.worker_start_time ? formatTimestamp(data.worker_start_time) : '--';
// Update cooldown status // Update recovery policy status
const policyState = data?.resilience?.policy_state || {};
const cooldown = data.cooldown; const cooldown = data.cooldown;
if (cooldown && cooldown.active) { if (policyState.blocked) {
document.getElementById('workerCooldown').textContent = 'Guarded mode需手動 override';
document.getElementById('restartBtn').disabled = false;
document.getElementById('restartBtn').style.opacity = '1';
document.getElementById('restartBtn').style.cursor = 'pointer';
} else if (cooldown && cooldown.active) {
document.getElementById('workerCooldown').textContent = document.getElementById('workerCooldown').textContent =
`冷卻中 (${cooldown.remaining_seconds}秒)`; `冷卻中 (${cooldown.remaining_seconds}秒)`;
document.getElementById('restartBtn').disabled = true; document.getElementById('restartBtn').disabled = true;
@@ -1017,11 +1029,41 @@
btn.style.opacity = '0.5'; btn.style.opacity = '0.5';
try { try {
const resp = await fetchWithAuth('/admin/api/worker/restart', { let resp = await fetchWithAuth('/admin/api/worker/restart', {
method: 'POST', method: 'POST',
headers: { 'Content-Type': 'application/json' } headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({})
}); });
const json = await resp.json(); let json = await resp.json();
if (!json.success && resp.status === 409) {
const reason = window.prompt(
'目前 restart policy 為 guarded mode。\n請輸入 override 原因(會記錄於稽核日誌):'
);
if (!reason || !reason.trim()) {
alert('已取消 override。');
return;
}
const acknowledged = window.confirm(
'確認執行 manual override此操作將繞過 guarded mode 保護。'
);
if (!acknowledged) {
alert('已取消 override。');
return;
}
resp = await fetchWithAuth('/admin/api/worker/restart', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
manual_override: true,
override_acknowledged: true,
override_reason: reason.trim()
})
});
json = await resp.json();
}
if (!json.success) { if (!json.success) {
alert('重啟失敗: ' + (json.error?.message || '未知錯誤')); alert('重啟失敗: ' + (json.error?.message || '未知錯誤'));

View File

@@ -682,7 +682,7 @@
// State // State
// ============================================================ // ============================================================
const state = { const state = {
reason: '{{ reason | e }}', reason: {{ reason | tojson }},
summary: null, summary: null,
distribution: null, distribution: null,
lots: null, lots: null,

View File

@@ -132,6 +132,7 @@
{% endif %} {% endif %}
<form method="POST"> <form method="POST">
<input type="hidden" name="csrf_token" value="{{ csrf_token() }}">
<div class="form-group"> <div class="form-group">
<label for="username">帳號</label> <label for="username">帳號</label>
<input type="text" id="username" name="username" placeholder="工號或 Email" required autofocus> <input type="text" id="username" name="username" placeholder="工號或 Email" required autofocus>

View File

@@ -0,0 +1,9 @@
{
"rows": 30000,
"query_count": 400,
"seed": 42,
"thresholds": {
"max_p95_ratio_indexed_vs_baseline": 1.25,
"max_memory_amplification_ratio": 1.8
}
}

Some files were not shown because too many files have changed in this diff Show More