diff --git a/README.md b/README.md index b699d23..3c67de7 100644 --- a/README.md +++ b/README.md @@ -26,11 +26,60 @@ | Worker 重啟控制 | ✅ 已完成 | | Runtime 韌性診斷(threshold/churn/recommendation) | ✅ 已完成 | | WIP 共用 autocomplete core 模組 | ✅ 已完成 | +| WIP 共用 derive core 模組(KPI/filter/chart/table) | ✅ 已完成 | +| WIP 索引查詢加速與增量同步 | ✅ 已完成 | +| 快取記憶體放大係數 telemetry | ✅ 已完成 | +| Cache benchmark gate(P95/記憶體門檻) | ✅ 已完成 | +| Worker guarded mode + manual override 稽核 | ✅ 已完成 | +| Runtime contract 啟動校驗(conda/systemd/watchdog) | ✅ 已完成 | | 前端核心模組測試(Node test) | ✅ 已完成 | | 部署自動化 | ✅ 已完成 | --- +## 開發歷史(Vite 重構後) + +- 2026-02-07:完成 Flask + Vite 單一 port 架構切換,舊版 `DashBoard/` 停用。 +- 2026-02-08:補齊 runtime 韌性治理(threshold/churn/recommendation)與 watchdog 可觀測欄位。 +- 2026-02-08:完成 P0 安全/穩定性硬化: + - production `SECRET_KEY` 缺失時啟動失敗(fail-fast) + - admin form + admin mutation API CSRF 防護 + - health probe 使用獨立 DB pool,避免與主查詢池互相阻塞 + - worker/app shutdown 統一清理 cache updater、realtime sync、Redis、DB engine + - `hold_detail` inline script 變數改為 `tojson` 序列化 +- 2026-02-08:完成 P1 快取/查詢效率重構: + - WIP 查詢路徑改為索引選擇,保留 `resource/wip` 全表快取語意 + - WIP search index 增量同步(watermark/version)與 drift fallback + - health/admin 新增 cache memory amplification telemetry + - 建立 `scripts/run_cache_benchmarks.py` + fixture gate +- 2026-02-08:完成 P2 運維自癒治理: + - runtime contract 共用化(app/start_server/watchdog/systemd) + - 啟動時 conda/watchdog 路徑 drift fail-fast + - worker restart policy(cooldown/retry budget/churn guarded mode) + - manual override(需 ack + reason)與結構化 audit log +- 2026-02-08:完成 round-2 安全/穩定補強: + - LDAP endpoint 改為嚴格驗證(`https` + `LDAP_ALLOWED_HOSTS`) + - process-level cache 新增 `max_size + LRU`(WIP/Resource) + - circuit breaker transition logging 移至鎖外,降低 lock contention + - 全域安全標頭(CSP/XFO/nosniff/Referrer-Policy,production 加 HSTS) + - WIP detail 分頁參數加上下限(`page>=1`、`1<=page_size<=500`) +- 2026-02-08:完成 round-3 殘餘風險修補: + - WIP cache publish 採 staged publish,失敗不污染舊快照 + - WIP slow-path parse 移至鎖外;realtime equipment process cache 補齊 bounded LRU + - resource NaN 清理改為 depth-safe 迭代;WIP/Hold 布林查詢解析共用化 + - filter cache view 名稱改為 env 可配置 + - `/health`、`/health/deep` 新增 5 秒內部 memo(testing 模式禁用) + - 高成本 API 增加輕量 rate limit(WIP detail/matrix、Hold lots、Resource status/detail) + - DB 連線字串 log redaction 遮罩密碼 +- 2026-02-08:完成 round-4 殘餘治理收斂: + - Resource derived index 改為 row-position representation,移除 process 內 full records 複本 + - Resource / Realtime Equipment 共用 Oracle SQL fragments,降低查詢定義漂移 + - `resource_cache` / `realtime_equipment_cache` 型別註記與高頻常數命名收斂 + - `page_registry` 寫檔改為 atomic replace(tmp + rename),避免設定檔半寫入 + - 新增測試保護:共享 SQL 片段、index normalization、route bool parser 不重複定義 + +--- + ## 遷移與驗收文件 - Root cutover 盤點:`docs/root_cutover_inventory.md` @@ -46,20 +95,37 @@ 1. 單一 port 契約維持不變 - Flask + Gunicorn + Vite dist 由同一服務提供(`GUNICORN_BIND`),前後端同源。 -2. Runtime 韌性採「降級 + 可操作建議」 +2. Runtime 韌性採「降級 + 可操作建議 + policy state」 - `/health`、`/health/deep`、`/admin/api/system-status`、`/admin/api/worker/status` 皆提供: - 門檻(thresholds) + - policy state(`allowed` / `cooldown` / `blocked`) - 重啟 churn 摘要 + - alerts(pool/circuit/churn) - recovery recommendation(值班建議動作) -3. Watchdog 維持手動觸發重啟模型 -- 仍以 admin API 觸發 reload,不預設啟用自動重啟風暴風險。 -- state 檔新增 bounded restart history,方便追蹤 churn。 +3. Watchdog 自癒策略具界限保護 +- restart 流程納入 cooldown + retry budget + churn window。 +- churn 超標時進入 guarded mode,需 admin manual override 才可繼續重啟。 +- state 檔保留 bounded restart history,供 policy 與稽核使用。 -4. 前端治理:WIP autocomplete/filter 共用化 +4. 前端治理:WIP compute 共用化 - `frontend/src/core/autocomplete.js` 作為 WIP overview/detail 共用邏輯來源。 +- `frontend/src/core/wip-derive.js` 共用 KPI/filter/chart/table 導出運算。 - 維持既有頁面流程與 drill-down 語意,不變更操作習慣。 +5. P1 快取效率治理 +- 保留 `resource`、`wip` 全表快取策略(業務約束不變)。 +- 查詢改走索引選擇,並提供 memory amplification / index efficiency telemetry。 +- 以 benchmark gate 驗證 P95 延遲與記憶體放大不超過門檻。 + +6. P0 Runtime Hardening(安全 + 穩定) +- Production 必須提供 `SECRET_KEY`;未設定時服務拒絕啟動。 +- `/admin/login` 與 `/admin/api/*` 變更請求必須攜帶 CSRF token。 +- `/health` 資料庫連通探針使用獨立 health pool,降低 pool 飽和時誤判。 +- 關機/重啟時統一釋放 background workers 與 Redis/DB 連線資源。 +- LDAP API URL 啟動驗證:僅允許 `https` + host allowlist。 +- 全域 security headers:CSP/X-Frame-Options/X-Content-Type-Options/Referrer-Policy(production 含 HSTS)。 + --- ## 快速開始 @@ -175,6 +241,12 @@ DB_MAX_OVERFLOW=20 DB_POOL_TIMEOUT=30 DB_POOL_RECYCLE=1800 DB_CALL_TIMEOUT_MS=55000 +DB_POOL_EXHAUSTED_RETRY_AFTER_SECONDS=5 + +# Health probe 專用 DB pool(與主 request pool 隔離) +DB_HEALTH_POOL_SIZE=1 +DB_HEALTH_MAX_OVERFLOW=0 +DB_HEALTH_POOL_TIMEOUT=2 # Circuit Breaker CIRCUIT_BREAKER_ENABLED=true @@ -192,6 +264,17 @@ WATCHDOG_RESTART_FLAG=./tmp/mes_dashboard_restart.flag WATCHDOG_PID_FILE=./tmp/gunicorn.pid WATCHDOG_STATE_FILE=./tmp/mes_dashboard_restart_state.json WATCHDOG_RESTART_HISTORY_MAX=50 +CONDA_BIN=/opt/miniconda3/bin/conda +CONDA_ENV_NAME=mes-dashboard +RUNTIME_CONTRACT_VERSION=2026.02-p2 +RUNTIME_CONTRACT_ENFORCE=true + +# Worker self-healing policy +WORKER_RESTART_COOLDOWN=60 +WORKER_RESTART_RETRY_BUDGET=3 +WORKER_RESTART_WINDOW_SECONDS=600 +WORKER_RESTART_CHURN_THRESHOLD=3 +WORKER_GUARDED_MODE_ENABLED=true # Runtime resilience thresholds RESILIENCE_DEGRADED_ALERT_SECONDS=300 @@ -202,6 +285,36 @@ RESILIENCE_RESTART_CHURN_THRESHOLD=3 # 管理員設定 ADMIN_EMAILS=admin@example.com # 管理員郵件(逗號分隔) +LDAP_API_URL=https://ldap-api.example.com +LDAP_ALLOWED_HOSTS=ldap-api.example.com,ldap-api-dr.example.com + +# CSRF 防護(admin form/admin mutation API) +CSRF_ENABLED=true + +# Process-level cache bounded LRU(WIP/Resource) +PROCESS_CACHE_MAX_SIZE=32 +WIP_PROCESS_CACHE_MAX_SIZE=32 +RESOURCE_PROCESS_CACHE_MAX_SIZE=32 +EQUIPMENT_PROCESS_CACHE_MAX_SIZE=32 + +# Filter cache source views (env-overridable) +FILTER_CACHE_WIP_VIEW=DWH.DW_MES_LOT_V +FILTER_CACHE_SPEC_WORKCENTER_VIEW=DWH.DW_MES_SPEC_WORKCENTER_V + +# Health internal memoization +HEALTH_MEMO_TTL_SECONDS=5 + +# High-cost API rate limit (in-process) +WIP_MATRIX_RATE_LIMIT_MAX_REQUESTS=120 +WIP_MATRIX_RATE_LIMIT_WINDOW_SECONDS=60 +WIP_DETAIL_RATE_LIMIT_MAX_REQUESTS=90 +WIP_DETAIL_RATE_LIMIT_WINDOW_SECONDS=60 +HOLD_LOTS_RATE_LIMIT_MAX_REQUESTS=90 +HOLD_LOTS_RATE_LIMIT_WINDOW_SECONDS=60 +RESOURCE_DETAIL_RATE_LIMIT_MAX_REQUESTS=60 +RESOURCE_DETAIL_RATE_LIMIT_WINDOW_SECONDS=60 +RESOURCE_STATUS_RATE_LIMIT_MAX_REQUESTS=90 +RESOURCE_STATUS_RATE_LIMIT_WINDOW_SECONDS=60 ``` ### 生產環境注意事項 @@ -226,6 +339,7 @@ sudo cp deploy/mes-dashboard-watchdog.service /etc/systemd/system/ # 2. 準備環境設定檔 sudo mkdir -p /etc/mes-dashboard +sudo cp deploy/mes-dashboard.env.example /etc/mes-dashboard/mes-dashboard.env sudo cp .env /etc/mes-dashboard/mes-dashboard.env # 3. 重新載入 systemd @@ -238,6 +352,12 @@ sudo systemctl enable --now mes-dashboard mes-dashboard-watchdog sudo systemctl status mes-dashboard sudo systemctl status mes-dashboard-watchdog ``` + +執行 runtime contract 驗證: + +```bash +RUNTIME_CONTRACT_ENFORCE=true ./scripts/start_server.sh check +``` ### Rollback 步驟 @@ -494,7 +614,8 @@ DashBoard_vite/ │ └── worker_watchdog.py # Worker 監控程式 ├── deploy/ # 部署設定 │ ├── mes-dashboard.service # Gunicorn systemd 服務 (Conda) -│ └── mes-dashboard-watchdog.service # Watchdog systemd 服務 (Conda) +│ ├── mes-dashboard-watchdog.service # Watchdog systemd 服務 (Conda) +│ └── mes-dashboard.env.example # Runtime contract 環境範本 ├── tests/ # 測試 ├── data/ # 資料檔案 ├── logs/ # 日誌 @@ -522,9 +643,12 @@ pytest tests/test_*_integration.py -v # 執行 E2E 測試 pytest tests/e2e/ -v -# 執行壓力測試 -pytest tests/stress/ -v -``` +# 執行壓力測試 +pytest tests/stress/ -v + +# Cache benchmark gate(P1) +conda run -n mes-dashboard python scripts/run_cache_benchmarks.py --enforce +``` --- @@ -569,12 +693,17 @@ pytest tests/stress/ -v ### 2026-02-08 - 完成並封存提案 `post-migration-resilience-governance` +- 完成並封存提案 `p1-cache-query-efficiency` +- 完成並封存提案 `p2-ops-self-healing-runbook` - 新增 runtime 韌性診斷核心(thresholds / restart churn / recovery recommendation) +- 新增 worker restart policy state(allowed/cooldown/blocked)與 guarded mode override 流程 - health 與 admin API 新增可操作韌性欄位: - `/health`、`/health/deep` - `/admin/api/system-status`、`/admin/api/worker/status` - watchdog restart state 支援 bounded history(`WATCHDOG_RESTART_HISTORY_MAX`) - WIP overview/detail 抽離共用 autocomplete/filter 模組(`frontend/src/core/autocomplete.js`) +- WIP overview/detail 導入共享 derive 模組(`frontend/src/core/wip-derive.js`) +- 新增 cache benchmark fixture 與 baseline-vs-indexed 門檻驗證 - 新增前端 Node 測試流程(`npm --prefix frontend test`) - 更新 `README.mdj` 與 migration runbook 文件對齊 gate @@ -654,5 +783,5 @@ pytest tests/stress/ -v --- -**文檔版本**: 4.1 +**文檔版本**: 4.2 **最後更新**: 2026-02-08 diff --git a/README.mdj b/README.mdj index 17c1f1b..c11cf35 100644 --- a/README.mdj +++ b/README.mdj @@ -1,61 +1,151 @@ -# MES Dashboard Architecture Snapshot (README.mdj) +# MES Dashboard(README.mdj) -本檔案為 `README.md` 的架構摘要鏡像,重點反映目前已完成的 Vite + 單一 port 運行契約與韌性治理策略。 +本文件為 `README.md` 的精簡技術同步版,聚焦目前可運行架構與運維契約。 -## Runtime Contract +## 1. 架構摘要(2026-02-08) -- 單一服務單一 port:`GUNICORN_BIND`(預設 `0.0.0.0:8080`) -- 前端資產由 Vite build 到 `src/mes_dashboard/static/dist/`,由 Flask/Gunicorn 同源提供 -- Watchdog 透過 restart flag + `SIGHUP` 進行 graceful worker reload +- 後端:Flask + Gunicorn(單一 port) +- 前端:Vite build 輸出到 `src/mes_dashboard/static/dist` +- 快取:Redis + process-level cache + indexed selection telemetry +- 資料:Oracle(QueuePool) +- 運維:watchdog + admin worker restart API + guarded-mode policy -## Resilience Contract +## 2. 既有設計原則(保留) -- 降級回應:`DB_POOL_EXHAUSTED`、`CIRCUIT_BREAKER_OPEN` + `Retry-After` -- health/admin 診斷輸出包含: - - thresholds - - restart churn summary - - recovery recommendation -- 不預設啟用自動重啟;維持受控人工觸發,避免重啟風暴 +- `resource`(設備基礎資料)與 `wip`(線上即時狀況)維持全表快取策略。 +- 前端頁面邏輯與 drill-down 操作語意維持不變。 +- 系統維持單一 port 服務模式(前後端同源)。 -## Frontend Governance +## 3. P0 Runtime Hardening(已完成) -- WIP overview/detail 的 autocomplete/filter 查詢邏輯共用 `frontend/src/core/autocomplete.js` -- 目標:維持既有操作語意,同時降低重複邏輯與維護成本 -- 前端核心模組測試:`npm --prefix frontend test` +- Production 強制 `SECRET_KEY`:未設定或使用不安全預設值時,啟動直接失敗。 +- CSRF 防護: + - `/admin/login` 表單需 token + - `/admin/api/*` 的 `POST/PUT/PATCH/DELETE` 需 `X-CSRF-Token` +- Session hardening:登入成功後 `session.clear()` + CSRF token rotation。 +- Health probe isolation:`/health` DB 連通檢查使用獨立 health pool。 +- Shutdown cleanup:統一停止 cache updater、equipment sync worker,並關閉 Redis 與 DB engine。 +- XSS hardening:`hold_detail` fallback script 的 `reason` 改用 `tojson`。 -## 開發歷史(摘要) +## 4. P1 Cache/Query Efficiency(已完成) -### 2026-02-08 -- 封存 `post-migration-resilience-governance` -- 新增韌性診斷欄位(thresholds/churn/recommendation) -- 完成 WIP autocomplete 共用模組化與前端測試腳本 +- `resource` / `wip` 仍維持全表快取策略(業務約束不變)。 +- WIP 查詢改走 indexed selection,並加入增量同步(watermark/version)與 drift fallback。 +- `/health`、`/health/deep`、`/admin/api/system-status` 提供 cache memory amplification/index telemetry。 +- 新增 benchmark harness:`scripts/run_cache_benchmarks.py --enforce`。 -### 2026-02-07 -- 封存完整 Vite 遷移相關提案群組 -- 單一 port 架構、抽屜導航、欄位契約治理與 migration gates 就位 +## 5. P2 Ops Self-Healing(已完成) -## Key Configs +- runtime contract 共用化:app/start_server/watchdog/systemd 使用同一組 watchdog/conda 路徑契約。 +- 啟動 fail-fast:conda/runtime path drift 時拒絕啟動並輸出可操作診斷。 +- worker restart policy:cooldown + retry budget + churn guarded mode。 +- manual override:需 admin 身分 + `manual_override` + `override_acknowledged` + `override_reason`,且寫入 audit log。 +- health/admin payload 提供 policy state:`allowed` / `cooldown` / `blocked`。 + +## 6. Round-3 Residual Hardening(已完成) + +- WIP cache publish 改為 staged publish,更新失敗不覆寫舊快照。 +- WIP process cache slow-path parse 移到鎖外,降低 lock contention。 +- realtime equipment process cache 補齊 bounded LRU(含 `EQUIPMENT_PROCESS_CACHE_MAX_SIZE`)。 +- `_clean_nan_values` 改為 depth-safe 迭代式清理(避免深層遞迴風險)。 +- WIP/Hold/Resource bool query parser 共用化(`core/utils.py`)。 +- filter cache source view 可由 env 覆寫(便於環境切換與測試)。 +- `/health`、`/health/deep` 增加 5 秒 memo(testing 模式自動關閉)。 +- 高成本 API 增加輕量 in-process rate limit,超限回傳一致 429 結構。 +- DB 連線字串記錄加上敏感欄位遮罩(密碼 redaction)。 + +## 7. Round-4 Residual Consolidation(已完成) + +- Resource derived index 改為 row-position representation,不再在 process 內保存 full records 複本。 +- Resource / Realtime Equipment 共用 Oracle SQL fragments,避免查詢定義重複漂移。 +- `resource_cache` / `realtime_equipment_cache` 型別註記風格與高頻常數命名收斂。 +- `page_registry` 寫檔改為 atomic replace,降低設定檔半寫入風險。 +- 新增測試覆蓋 shared SQL fragment 與 bool parser 不重複定義治理。 + +## 8. 重要環境變數 ```bash +FLASK_ENV=production +SECRET_KEY= +CSRF_ENABLED=true + +LDAP_API_URL=https://ldap-api.example.com +LDAP_ALLOWED_HOSTS=ldap-api.example.com,ldap-api-dr.example.com + +DB_POOL_SIZE=10 +DB_MAX_OVERFLOW=20 +DB_POOL_TIMEOUT=30 +DB_POOL_RECYCLE=1800 +DB_CALL_TIMEOUT_MS=55000 +DB_POOL_EXHAUSTED_RETRY_AFTER_SECONDS=5 + +DB_HEALTH_POOL_SIZE=1 +DB_HEALTH_MAX_OVERFLOW=0 +DB_HEALTH_POOL_TIMEOUT=2 + +CONDA_BIN=/opt/miniconda3/bin/conda +CONDA_ENV_NAME=mes-dashboard +RUNTIME_CONTRACT_VERSION=2026.02-p2 +RUNTIME_CONTRACT_ENFORCE=true + WATCHDOG_RUNTIME_DIR=./tmp WATCHDOG_RESTART_FLAG=./tmp/mes_dashboard_restart.flag WATCHDOG_PID_FILE=./tmp/gunicorn.pid WATCHDOG_STATE_FILE=./tmp/mes_dashboard_restart_state.json WATCHDOG_RESTART_HISTORY_MAX=50 -RESILIENCE_DEGRADED_ALERT_SECONDS=300 -RESILIENCE_POOL_SATURATION_WARNING=0.90 -RESILIENCE_POOL_SATURATION_CRITICAL=1.0 -RESILIENCE_RESTART_CHURN_WINDOW_SECONDS=600 -RESILIENCE_RESTART_CHURN_THRESHOLD=3 +WORKER_RESTART_COOLDOWN=60 +WORKER_RESTART_RETRY_BUDGET=3 +WORKER_RESTART_WINDOW_SECONDS=600 +WORKER_RESTART_CHURN_THRESHOLD=3 +WORKER_GUARDED_MODE_ENABLED=true + +PROCESS_CACHE_MAX_SIZE=32 +WIP_PROCESS_CACHE_MAX_SIZE=32 +RESOURCE_PROCESS_CACHE_MAX_SIZE=32 +EQUIPMENT_PROCESS_CACHE_MAX_SIZE=32 + +FILTER_CACHE_WIP_VIEW=DWH.DW_MES_LOT_V +FILTER_CACHE_SPEC_WORKCENTER_VIEW=DWH.DW_MES_SPEC_WORKCENTER_V + +HEALTH_MEMO_TTL_SECONDS=5 + +WIP_MATRIX_RATE_LIMIT_MAX_REQUESTS=120 +WIP_MATRIX_RATE_LIMIT_WINDOW_SECONDS=60 +WIP_DETAIL_RATE_LIMIT_MAX_REQUESTS=90 +WIP_DETAIL_RATE_LIMIT_WINDOW_SECONDS=60 +HOLD_LOTS_RATE_LIMIT_MAX_REQUESTS=90 +HOLD_LOTS_RATE_LIMIT_WINDOW_SECONDS=60 +RESOURCE_DETAIL_RATE_LIMIT_MAX_REQUESTS=60 +RESOURCE_DETAIL_RATE_LIMIT_WINDOW_SECONDS=60 +RESOURCE_STATUS_RATE_LIMIT_MAX_REQUESTS=90 +RESOURCE_STATUS_RATE_LIMIT_WINDOW_SECONDS=60 ``` -## Validation Quick Commands +## 9. 驗證命令(建議) ```bash +# 後端(conda) +conda run -n mes-dashboard python -m pytest -q tests/test_runtime_hardening.py + +# 前端 npm --prefix frontend test npm --prefix frontend run build -python -m pytest -q tests/test_resilience.py tests/test_health_routes.py tests/test_performance_integration.py + +# P1 benchmark gate +conda run -n mes-dashboard python scripts/run_cache_benchmarks.py --enforce + +# P2 runtime contract check +RUNTIME_CONTRACT_ENFORCE=true ./scripts/start_server.sh check ``` -> 詳細部署、使用說明與完整環境配置請參考 `README.md`。 +## 10. 開發歷史(Vite 專案) + +- 2026-02-07:完成 Vite 根目錄重構與舊版切除。 +- 2026-02-08:完成 resilience 診斷治理與前端共用模組化。 +- 2026-02-08:完成 P0 安全/穩定性硬化(本次更新)。 +- 2026-02-08:完成 P1 快取查詢效率重構(index + benchmark gate)。 +- 2026-02-08:完成 P2 運維自癒治理(guarded mode + manual override + runtime contract)。 +- 2026-02-08:完成 round-2 hardening(LDAP URL 驗證、bounded LRU cache、circuit breaker 鎖外日誌、安全標頭、分頁邊界)。 +- 2026-02-08:完成 round-3 residual hardening(staged publish、health memo、API rate limit、DB redaction、filter view env 化)。 +- 2026-02-08:完成 round-4 residual consolidation(resource index 表示正規化、shared SQL fragments、型別與常數治理、atomic page status 寫入)。 diff --git a/deploy/mes-dashboard-watchdog.service b/deploy/mes-dashboard-watchdog.service index c291d9e..fe31b58 100644 --- a/deploy/mes-dashboard-watchdog.service +++ b/deploy/mes-dashboard-watchdog.service @@ -18,6 +18,13 @@ Environment="WATCHDOG_RESTART_FLAG=/run/mes-dashboard/mes_dashboard_restart.flag Environment="WATCHDOG_PID_FILE=/run/mes-dashboard/gunicorn.pid" Environment="WATCHDOG_STATE_FILE=/var/lib/mes-dashboard/restart_state.json" Environment="WATCHDOG_CHECK_INTERVAL=5" +Environment="RUNTIME_CONTRACT_VERSION=2026.02-p2" +Environment="RUNTIME_CONTRACT_ENFORCE=true" +Environment="WORKER_RESTART_COOLDOWN=60" +Environment="WORKER_RESTART_RETRY_BUDGET=3" +Environment="WORKER_RESTART_WINDOW_SECONDS=600" +Environment="WORKER_RESTART_CHURN_THRESHOLD=3" +Environment="WORKER_GUARDED_MODE_ENABLED=true" RuntimeDirectory=mes-dashboard StateDirectory=mes-dashboard diff --git a/deploy/mes-dashboard.env.example b/deploy/mes-dashboard.env.example new file mode 100644 index 0000000..82f6f81 --- /dev/null +++ b/deploy/mes-dashboard.env.example @@ -0,0 +1,26 @@ +# MES Dashboard runtime contract (version 2026.02-p2) + +# Conda runtime +CONDA_BIN=/opt/miniconda3/bin/conda +CONDA_ENV_NAME=mes-dashboard + +# Single-port serving contract +GUNICORN_BIND=0.0.0.0:8080 + +# Watchdog/runtime paths +WATCHDOG_RUNTIME_DIR=/run/mes-dashboard +WATCHDOG_RESTART_FLAG=/run/mes-dashboard/mes_dashboard_restart.flag +WATCHDOG_PID_FILE=/run/mes-dashboard/gunicorn.pid +WATCHDOG_STATE_FILE=/var/lib/mes-dashboard/restart_state.json +WATCHDOG_CHECK_INTERVAL=5 + +# Runtime contract enforcement +RUNTIME_CONTRACT_VERSION=2026.02-p2 +RUNTIME_CONTRACT_ENFORCE=true + +# Worker recovery policy +WORKER_RESTART_COOLDOWN=60 +WORKER_RESTART_RETRY_BUDGET=3 +WORKER_RESTART_WINDOW_SECONDS=600 +WORKER_RESTART_CHURN_THRESHOLD=3 +WORKER_GUARDED_MODE_ENABLED=true diff --git a/deploy/mes-dashboard.service b/deploy/mes-dashboard.service index 630577d..44f6c83 100644 --- a/deploy/mes-dashboard.service +++ b/deploy/mes-dashboard.service @@ -18,6 +18,13 @@ Environment="WATCHDOG_RUNTIME_DIR=/run/mes-dashboard" Environment="WATCHDOG_RESTART_FLAG=/run/mes-dashboard/mes_dashboard_restart.flag" Environment="WATCHDOG_PID_FILE=/run/mes-dashboard/gunicorn.pid" Environment="WATCHDOG_STATE_FILE=/var/lib/mes-dashboard/restart_state.json" +Environment="RUNTIME_CONTRACT_VERSION=2026.02-p2" +Environment="RUNTIME_CONTRACT_ENFORCE=true" +Environment="WORKER_RESTART_COOLDOWN=60" +Environment="WORKER_RESTART_RETRY_BUDGET=3" +Environment="WORKER_RESTART_WINDOW_SECONDS=600" +Environment="WORKER_RESTART_CHURN_THRESHOLD=3" +Environment="WORKER_GUARDED_MODE_ENABLED=true" RuntimeDirectory=mes-dashboard StateDirectory=mes-dashboard diff --git a/docs/migration_gates_and_runbook.md b/docs/migration_gates_and_runbook.md index 838032d..a63bfdc 100644 --- a/docs/migration_gates_and_runbook.md +++ b/docs/migration_gates_and_runbook.md @@ -26,10 +26,12 @@ A release is cutover-ready only when all gates pass: - pool exhaustion path returns `503` + `DB_POOL_EXHAUSTED` and `Retry-After` - circuit-open path returns `503` + `CIRCUIT_BREAKER_OPEN` and fail-fast semantics - frontend client does not aggressively retry on degraded pool exhaustion responses +- health/admin payloads expose worker policy state (`allowed`/`cooldown`/`blocked`) and alert booleans 6. Conda-systemd contract gate - `deploy/mes-dashboard.service` and `deploy/mes-dashboard-watchdog.service` both run in the same conda runtime contract - `WATCHDOG_RESTART_FLAG`, `WATCHDOG_PID_FILE`, `WATCHDOG_STATE_FILE` paths are consistent across app/admin/watchdog +- startup contract validation passes: `RUNTIME_CONTRACT_ENFORCE=true ./scripts/start_server.sh check` - single-port bind (`GUNICORN_BIND`) remains stable during restart workflow 7. Regression gate @@ -60,7 +62,8 @@ A release is cutover-ready only when all gates pass: 5. Conda + systemd rehearsal (recommended before production cutover) - `sudo cp deploy/mes-dashboard.service /etc/systemd/system/` - `sudo cp deploy/mes-dashboard-watchdog.service /etc/systemd/system/` -- `sudo mkdir -p /etc/mes-dashboard && sudo cp .env /etc/mes-dashboard/mes-dashboard.env` +- `sudo mkdir -p /etc/mes-dashboard && sudo cp deploy/mes-dashboard.env.example /etc/mes-dashboard/mes-dashboard.env` +- merge deployment secrets from `.env` into `/etc/mes-dashboard/mes-dashboard.env` - `sudo systemctl daemon-reload` - `sudo systemctl enable --now mes-dashboard mes-dashboard-watchdog` - call `/admin/api/worker/status` and verify runtime contract paths exist @@ -69,6 +72,7 @@ A release is cutover-ready only when all gates pass: - call `/health` and `/health/deep` - confirm route cache mode, degraded flags, and pool/runtime diagnostics align with environment (Redis on/off) - trigger one controlled worker restart from admin API and verify single-port continuity +- verify guarded mode flow: blocked restart requires manual override payload (`manual_override`, `override_acknowledged`, `override_reason`) - verify README architecture section matches deployed runtime contract ## Rollback Procedure @@ -111,3 +115,6 @@ Use these initial thresholds for alerting/escalation: 4. Frontend/API retry pressure - significant increase of client retries for `DB_POOL_EXHAUSTED` or `CIRCUIT_BREAKER_OPEN` responses over baseline + +5. Recovery policy blocked +- `resilience.policy_state.blocked == true` or `resilience.alerts.restart_blocked == true` diff --git a/frontend/src/core/api.js b/frontend/src/core/api.js index 51a4728..6958eab 100644 --- a/frontend/src/core/api.js +++ b/frontend/src/core/api.js @@ -1,5 +1,21 @@ const DEFAULT_TIMEOUT = 30000; +function getCsrfToken() { + return document.querySelector('meta[name="csrf-token"]')?.content || ''; +} + +function withCsrfHeaders(headers = {}, method = 'GET') { + const normalized = String(method).toUpperCase(); + const merged = { ...headers }; + if (['POST', 'PUT', 'PATCH', 'DELETE'].includes(normalized)) { + const csrf = getCsrfToken(); + if (csrf && !merged['X-CSRF-Token']) { + merged['X-CSRF-Token'] = csrf; + } + } + return merged; +} + function buildApiError(response, payload) { const message = payload?.error?.message || @@ -47,15 +63,19 @@ export async function apiGet(url, options = {}) { export async function apiPost(url, payload, options = {}) { if (window.MesApi?.post) { - return window.MesApi.post(url, payload, options); + const enrichedOptions = { + ...options, + headers: withCsrfHeaders(options.headers || {}, 'POST') + }; + return window.MesApi.post(url, payload, enrichedOptions); } return fetchJson(url, { ...options, method: 'POST', - headers: { + headers: withCsrfHeaders({ 'Content-Type': 'application/json', ...(options.headers || {}) - }, + }, 'POST'), body: JSON.stringify(payload) }); } @@ -64,6 +84,7 @@ export async function apiUpload(url, formData, options = {}) { return fetchJson(url, { ...options, method: 'POST', + headers: withCsrfHeaders(options.headers || {}, 'POST'), body: formData }); } diff --git a/frontend/src/core/wip-derive.js b/frontend/src/core/wip-derive.js new file mode 100644 index 0000000..e181b33 --- /dev/null +++ b/frontend/src/core/wip-derive.js @@ -0,0 +1,75 @@ +function toTrimmedString(value) { + if (value === null || value === undefined) { + return ''; + } + return String(value).trim(); +} + +export function normalizeStatusFilter(statusFilter) { + if (!statusFilter) { + return {}; + } + if (statusFilter === 'quality-hold') { + return { status: 'HOLD', hold_type: 'quality' }; + } + if (statusFilter === 'non-quality-hold') { + return { status: 'HOLD', hold_type: 'non-quality' }; + } + return { status: String(statusFilter).toUpperCase() }; +} + +export function buildWipOverviewQueryParams(filters = {}, statusFilter = null) { + const params = {}; + const workorder = toTrimmedString(filters.workorder); + const lotid = toTrimmedString(filters.lotid); + const pkg = toTrimmedString(filters.package); + const type = toTrimmedString(filters.type); + + if (workorder) params.workorder = workorder; + if (lotid) params.lotid = lotid; + if (pkg) params.package = pkg; + if (type) params.type = type; + + return { ...params, ...normalizeStatusFilter(statusFilter) }; +} + +export function buildWipDetailQueryParams({ + page, + pageSize, + filters = {}, + statusFilter = null, +}) { + return { + page, + page_size: pageSize, + ...buildWipOverviewQueryParams(filters, statusFilter), + }; +} + +export function splitHoldByType(data) { + const items = Array.isArray(data?.items) ? data.items : []; + const quality = items.filter((item) => item?.holdType === 'quality'); + const nonQuality = items.filter((item) => item?.holdType !== 'quality'); + return { quality, nonQuality }; +} + +export function prepareParetoData(items) { + if (!Array.isArray(items) || items.length === 0) { + return { reasons: [], qtys: [], lots: [], cumulative: [], totalQty: 0, items: [] }; + } + + const sorted = [...items].sort((a, b) => (Number(b?.qty) || 0) - (Number(a?.qty) || 0)); + const reasons = sorted.map((item) => toTrimmedString(item?.reason) || '未知'); + const qtys = sorted.map((item) => Number(item?.qty) || 0); + const lots = sorted.map((item) => Number(item?.lots) || 0); + const totalQty = qtys.reduce((sum, value) => sum + value, 0); + + let running = 0; + const cumulative = qtys.map((qty) => { + running += qty; + if (totalQty <= 0) return 0; + return Math.round((running / totalQty) * 100); + }); + + return { reasons, qtys, lots, cumulative, totalQty, items: sorted }; +} diff --git a/frontend/src/wip-detail/main.js b/frontend/src/wip-detail/main.js index 5d578b1..0b8750b 100644 --- a/frontend/src/wip-detail/main.js +++ b/frontend/src/wip-detail/main.js @@ -3,6 +3,7 @@ import { debounce, fetchWipAutocompleteItems, } from '../core/autocomplete.js'; +import { buildWipDetailQueryParams } from '../core/wip-derive.js'; ensureMesApiAvailable(); @@ -72,37 +73,13 @@ ensureMesApiAvailable(); throw new Error(result.error || 'Failed to fetch packages'); } - async function fetchDetail(signal = null) { - const params = { - page: state.page, - page_size: state.pageSize - }; - - if (state.filters.package) { - params.package = state.filters.package; - } - if (state.filters.type) { - params.type = state.filters.type; - } - if (activeStatusFilter) { - // Handle hold type filters - if (activeStatusFilter === 'quality-hold') { - params.status = 'HOLD'; - params.hold_type = 'quality'; - } else if (activeStatusFilter === 'non-quality-hold') { - params.status = 'HOLD'; - params.hold_type = 'non-quality'; - } else { - // Convert to API status format (RUN/QUEUE) - params.status = activeStatusFilter.toUpperCase(); - } - } - if (state.filters.workorder) { - params.workorder = state.filters.workorder; - } - if (state.filters.lotid) { - params.lotid = state.filters.lotid; - } + async function fetchDetail(signal = null) { + const params = buildWipDetailQueryParams({ + page: state.page, + pageSize: state.pageSize, + filters: state.filters, + statusFilter: activeStatusFilter, + }); const result = await MesApi.get(`/api/wip/detail/${encodeURIComponent(state.workcenter)}`, { params, diff --git a/frontend/src/wip-overview/main.js b/frontend/src/wip-overview/main.js index 20904a7..11a0533 100644 --- a/frontend/src/wip-overview/main.js +++ b/frontend/src/wip-overview/main.js @@ -3,6 +3,11 @@ import { debounce, fetchWipAutocompleteItems, } from '../core/autocomplete.js'; +import { + buildWipOverviewQueryParams, + splitHoldByType as splitHoldByTypeShared, + prepareParetoData as prepareParetoDataShared, +} from '../core/wip-derive.js'; ensureMesApiAvailable(); @@ -61,21 +66,8 @@ ensureMesApiAvailable(); } function buildQueryParams() { - const params = {}; - if (state.filters.workorder) { - params.workorder = state.filters.workorder; - } - if (state.filters.lotid) { - params.lotid = state.filters.lotid; - } - if (state.filters.package) { - params.package = state.filters.package; - } - if (state.filters.type) { - params.type = state.filters.type; - } - return params; - } + return buildWipOverviewQueryParams(state.filters); + } // ============================================================ // API Functions (using MesApi) @@ -95,23 +87,11 @@ ensureMesApiAvailable(); throw new Error(result.error || 'Failed to fetch summary'); } - async function fetchMatrix(signal = null) { - const params = buildQueryParams(); - // Add status filter if active - if (activeStatusFilter) { - if (activeStatusFilter === 'quality-hold') { - params.status = 'HOLD'; - params.hold_type = 'quality'; - } else if (activeStatusFilter === 'non-quality-hold') { - params.status = 'HOLD'; - params.hold_type = 'non-quality'; - } else { - params.status = activeStatusFilter.toUpperCase(); - } - } - const result = await MesApi.get('/api/wip/overview/matrix', { - params, - timeout: API_TIMEOUT, + async function fetchMatrix(signal = null) { + const params = buildWipOverviewQueryParams(state.filters, activeStatusFilter); + const result = await MesApi.get('/api/wip/overview/matrix', { + params, + timeout: API_TIMEOUT, signal }); if (result.success) { @@ -465,40 +445,15 @@ ensureMesApiAvailable(); nonQuality: null }; - // Task 2.1: Split hold data by type - function splitHoldByType(data) { - if (!data || !data.items) { - return { quality: [], nonQuality: [] }; - } - const quality = data.items.filter(item => item.holdType === 'quality'); - const nonQuality = data.items.filter(item => item.holdType !== 'quality'); - return { quality, nonQuality }; - } + // Task 2.1: Split hold data by type + function splitHoldByType(data) { + return splitHoldByTypeShared(data); + } - // Task 2.2: Prepare Pareto data (sort by QTY desc, calculate cumulative %) - function prepareParetoData(items) { - if (!items || items.length === 0) { - return { reasons: [], qtys: [], lots: [], cumulative: [], totalQty: 0 }; - } - - // Sort by QTY descending - const sorted = [...items].sort((a, b) => (b.qty || 0) - (a.qty || 0)); - - const reasons = sorted.map(item => item.reason || '未知'); - const qtys = sorted.map(item => item.qty || 0); - const lots = sorted.map(item => item.lots || 0); - const totalQty = qtys.reduce((sum, q) => sum + q, 0); - - // Calculate cumulative percentage - const cumulative = []; - let runningSum = 0; - qtys.forEach(qty => { - runningSum += qty; - cumulative.push(totalQty > 0 ? Math.round((runningSum / totalQty) * 100) : 0); - }); - - return { reasons, qtys, lots, cumulative, totalQty, items: sorted }; - } + // Task 2.2: Prepare Pareto data (sort by QTY desc, calculate cumulative %) + function prepareParetoData(items) { + return prepareParetoDataShared(items); + } // Task 3.1: Initialize Pareto charts function initParetoCharts() { diff --git a/frontend/tests/wip-derive.test.js b/frontend/tests/wip-derive.test.js new file mode 100644 index 0000000..5d66708 --- /dev/null +++ b/frontend/tests/wip-derive.test.js @@ -0,0 +1,80 @@ +import test from 'node:test'; +import assert from 'node:assert/strict'; + +import { + buildWipOverviewQueryParams, + buildWipDetailQueryParams, + splitHoldByType, + prepareParetoData, +} from '../src/core/wip-derive.js'; + +test('buildWipOverviewQueryParams keeps only non-empty filters', () => { + const params = buildWipOverviewQueryParams({ + workorder: ' WO-1 ', + lotid: '', + package: 'PKG-A', + type: 'QFN', + }); + + assert.deepEqual(params, { + workorder: 'WO-1', + package: 'PKG-A', + type: 'QFN', + }); +}); + +test('buildWipOverviewQueryParams maps quality hold status filter', () => { + const params = buildWipOverviewQueryParams({}, 'quality-hold'); + assert.deepEqual(params, { + status: 'HOLD', + hold_type: 'quality', + }); +}); + +test('buildWipDetailQueryParams uses page/page_size and shared filter mapper', () => { + const params = buildWipDetailQueryParams({ + page: 2, + pageSize: 100, + filters: { + workorder: 'WO', + lotid: 'LOT', + package: '', + type: 'TSOP', + }, + statusFilter: 'run', + }); + + assert.deepEqual(params, { + page: 2, + page_size: 100, + workorder: 'WO', + lotid: 'LOT', + type: 'TSOP', + status: 'RUN', + }); +}); + +test('splitHoldByType partitions quality/non-quality correctly', () => { + const grouped = splitHoldByType({ + items: [ + { reason: 'Q1', holdType: 'quality' }, + { reason: 'NQ1', holdType: 'non-quality' }, + { reason: 'NQ2' }, + ], + }); + + assert.equal(grouped.quality.length, 1); + assert.equal(grouped.nonQuality.length, 2); +}); + +test('prepareParetoData sorts by qty and builds cumulative percentages', () => { + const data = prepareParetoData([ + { reason: 'B', qty: 20, lots: 1 }, + { reason: 'A', qty: 80, lots: 2 }, + ]); + + assert.deepEqual(data.reasons, ['A', 'B']); + assert.deepEqual(data.qtys, [80, 20]); + assert.deepEqual(data.cumulative, [80, 100]); + assert.equal(data.totalQty, 100); +}); diff --git a/frontend/vite.config.js b/frontend/vite.config.js index 44c29c8..888ad5d 100644 --- a/frontend/vite.config.js +++ b/frontend/vite.config.js @@ -1,12 +1,12 @@ import { defineConfig } from 'vite'; import { resolve } from 'node:path'; -export default defineConfig({ +export default defineConfig(({ mode }) => ({ publicDir: false, build: { outDir: '../src/mes_dashboard/static/dist', emptyOutDir: false, - sourcemap: false, + sourcemap: mode !== 'production', rollupOptions: { input: { portal: resolve(__dirname, 'src/portal/main.js'), @@ -22,8 +22,17 @@ export default defineConfig({ output: { entryFileNames: '[name].js', chunkFileNames: 'chunks/[name]-[hash].js', - assetFileNames: '[name][extname]' + assetFileNames: '[name][extname]', + manualChunks(id) { + if (!id.includes('node_modules')) { + return; + } + if (id.includes('echarts')) { + return 'vendor-echarts'; + } + return 'vendor'; + } } } } -}); +})); diff --git a/gunicorn.conf.py b/gunicorn.conf.py index bcd7e9f..c39c509 100644 --- a/gunicorn.conf.py +++ b/gunicorn.conf.py @@ -30,6 +30,18 @@ def worker_exit(server, worker): except Exception as e: server.log.warning(f"Error stopping equipment sync worker: {e}") + try: + from mes_dashboard.core.cache_updater import stop_cache_updater + stop_cache_updater() + except Exception as e: + server.log.warning(f"Error stopping cache updater: {e}") + + try: + from mes_dashboard.core.redis_client import close_redis + close_redis() + except Exception as e: + server.log.warning(f"Error closing redis client: {e}") + # Then dispose database connections try: from mes_dashboard.core.database import dispose_engine diff --git a/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/.openspec.yaml b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/.openspec.yaml new file mode 100644 index 0000000..565fad5 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/.openspec.yaml @@ -0,0 +1,2 @@ +schema: spec-driven +created: 2026-02-08 diff --git a/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/design.md b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/design.md new file mode 100644 index 0000000..8917e6a --- /dev/null +++ b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/design.md @@ -0,0 +1,46 @@ +## Context + +The current architecture already supports single-port Gunicorn runtime, circuit-breaker-aware degraded responses, and watchdog-assisted recovery. However, critical security and lifecycle controls are uneven: production startup can still fallback to a weak secret key, CSRF is not enforced globally, and background resources are not fully registered in a single shutdown lifecycle. These gaps are operationally risky when pool pressure or restart churn occurs. + +## Goals / Non-Goals + +**Goals:** +- Make production startup fail fast when required security secrets are missing. +- Enforce CSRF validation for all state-changing endpoints without breaking existing frontend flow. +- Make worker/app shutdown deterministic by stopping all background workers and shared clients. +- Keep degraded responses for pool exhaustion and circuit-open states stable and retry-aware. +- Isolate health probe connectivity from main request pool contention. + +**Non-Goals:** +- Replacing LDAP provider or redesigning the full authentication architecture. +- Full CSP rollout across all templates in this change. +- Changing URL structure, page IA, or single-port deployment topology. + +## Decisions + +1. **Production secret-key guard at startup** + - Decision: enforce `SECRET_KEY` presence/strength in non-development modes and abort startup when invalid. + - Rationale: prevents silent insecure deployment. + +2. **Unified CSRF contract across form + JSON flows** + - Decision: issue CSRF token from server session, validate hidden form field for HTML forms and `X-CSRF-Token` for JSON POST/PUT/PATCH/DELETE. + - Rationale: maintains current frontend behavior while covering non-form APIs. + +3. **Centralized shutdown registry** + - Decision: register explicit shutdown hooks that call cache updater stop, realtime sync stop, Redis close, and DB dispose in bounded order. + - Rationale: avoids thread/client leaks during worker recycle and controlled reload. + +4. **Health probe pool isolation** + - Decision: use a dedicated lightweight DB health engine/pool for `/health` checks. + - Rationale: prevents health endpoint from being blocked by request-pool exhaustion, improving observability fidelity. + +5. **Template-safe JS serialization** + - Decision: replace HTML-escaped interpolation in JS string contexts with `tojson` serialization. + - Rationale: avoids context-mismatch injection edge cases. + +## Risks / Trade-offs + +- **[Risk] CSRF rollout may break undocumented clients** → **Mitigation:** provide opt-in transition flag and explicit error messaging during rollout. +- **[Risk] Strict startup secret validation can block misconfigured environments** → **Mitigation:** provide clear startup diagnostics and `.env.example` updates. +- **[Risk] Additional shutdown hooks can prolong worker exit** → **Mitigation:** bounded timeouts and idempotent stop handlers. +- **[Risk] Dedicated health pool introduces extra DB connections** → **Mitigation:** fixed minimal size and short timeout. diff --git a/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/proposal.md b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/proposal.md new file mode 100644 index 0000000..9aa93b4 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/proposal.md @@ -0,0 +1,40 @@ +## Why + +The Vite migration is functionally complete, but production runtime still has high-risk gaps in security baseline and worker lifecycle cleanup. Addressing these now prevents avoidable outages, authentication bypass risk, and unstable degradation behavior under pool pressure. + +## What Changes + +- Enforce production-safe startup security defaults (no weak SECRET_KEY fallback in non-development environments). +- Add first-class CSRF protection for admin forms and state-changing JSON APIs. +- Harden degradation behavior for pool exhaustion with consistent retry/backoff contract and isolated health probing. +- Ensure background workers and shared clients (cache updater, realtime sync, Redis) are explicitly stopped on worker/app shutdown. +- Fix template-to-JavaScript variable serialization in hold-detail fallback script. + +## Capabilities + +### New Capabilities +- `security-baseline-hardening`: Define mandatory secret/session/CSRF/XSS-safe baseline for production runtime. + +### Modified Capabilities +- `runtime-resilience-recovery`: Strengthen shutdown lifecycle and degraded-response behavior for pool pressure scenarios. + +## Impact + +- Affected code: + - `src/mes_dashboard/app.py` + - `src/mes_dashboard/core/database.py` + - `src/mes_dashboard/core/cache_updater.py` + - `src/mes_dashboard/core/redis_client.py` + - `src/mes_dashboard/routes/health_routes.py` + - `src/mes_dashboard/routes/auth_routes.py` + - `src/mes_dashboard/templates/hold_detail.html` + - `gunicorn.conf.py` + - `tests/` +- APIs: + - `/health` + - `/health/deep` + - `/admin/login` + - state-changing `/api/*` endpoints +- Operational behavior: + - Keep single-port deployment model unchanged. + - Improve degraded-state stability and startup safety gates. diff --git a/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/specs/runtime-resilience-recovery/spec.md b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/specs/runtime-resilience-recovery/spec.md new file mode 100644 index 0000000..d83bfa0 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/specs/runtime-resilience-recovery/spec.md @@ -0,0 +1,24 @@ +## MODIFIED Requirements + +### Requirement: Pool Exhaustion MUST Return Retry-Aware Degraded Responses +The system MUST return explicit degraded responses for connection pool exhaustion, including stable machine-readable retry metadata and HTTP retry hints. + +#### Scenario: Pool exhausted under load +- **WHEN** concurrent requests exceed available database connections and pool wait timeout is reached +- **THEN** the API MUST return `DB_POOL_EXHAUSTED` with `retry_after_seconds` metadata and a `Retry-After` header instead of a generic 500 failure + +## ADDED Requirements + +### Requirement: Runtime Shutdown SHALL Cleanly Stop Background Services +Worker/app shutdown MUST stop long-lived background services and shared clients in deterministic order. + +#### Scenario: Worker exits during recycle or graceful reload +- **WHEN** Gunicorn worker shutdown hooks are triggered +- **THEN** cache updater, realtime equipment sync worker, Redis client, and DB engine resources MUST be stopped/disposed without orphan threads + +### Requirement: Health Probing SHALL Remain Available During Request-Pool Saturation +Health checks MUST avoid depending solely on the same request pool used by business APIs. + +#### Scenario: Request pool saturation +- **WHEN** the main database request pool is exhausted +- **THEN** `/health` and `/health/deep` MUST still provide timely degraded status using isolated probe connectivity diff --git a/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/specs/security-baseline-hardening/spec.md b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/specs/security-baseline-hardening/spec.md new file mode 100644 index 0000000..cf7c514 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/specs/security-baseline-hardening/spec.md @@ -0,0 +1,29 @@ +## ADDED Requirements + +### Requirement: Production Startup SHALL Reject Weak Session Secrets +The system MUST reject startup in non-development environments when `SECRET_KEY` is missing or configured with known insecure default values. + +#### Scenario: Missing production secret key +- **WHEN** runtime starts with `FLASK_ENV` not equal to `development` and no secure secret key is configured +- **THEN** application startup MUST fail fast with an explicit configuration error + +### Requirement: State-Changing Endpoints SHALL Enforce CSRF Validation +All state-changing endpoints that rely on cookie-based authentication MUST enforce CSRF token validation. + +#### Scenario: Missing or invalid CSRF token +- **WHEN** a POST/PUT/PATCH/DELETE request is sent without a valid CSRF token +- **THEN** the server MUST reject the request with a client error and MUST NOT execute the mutation + +### Requirement: Server-Rendered Values in JavaScript Context MUST Use Safe Serialization +Values inserted into inline JavaScript from templates MUST be serialized for JavaScript context safety. + +#### Scenario: Hold reason rendered in fallback inline script +- **WHEN** server-side string values are embedded into script state payloads +- **THEN** template rendering MUST use JSON-safe serialization semantics to prevent script-context injection + +### Requirement: Session Establishment SHALL Mitigate Fixation Risk +Successful admin login MUST rotate session identity material before granting authenticated privileges. + +#### Scenario: Admin login success +- **WHEN** credentials are validated and admin session is created +- **THEN** session identity MUST be regenerated before storing authenticated user attributes diff --git a/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/tasks.md b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/tasks.md new file mode 100644 index 0000000..300ace1 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/tasks.md @@ -0,0 +1,18 @@ +## 1. Runtime Stability Hardening + +- [x] 1.1 Add startup validation for `SECRET_KEY` and environment-aware secure defaults. +- [x] 1.2 Register centralized shutdown hooks to stop cache updater, realtime sync worker, Redis client, and DB engine. +- [x] 1.3 Isolate database health probing from request pool and keep degraded signal contract stable. +- [x] 1.4 Normalize pool-exhausted response metadata and retry headers across API error paths. + +## 2. Security Baseline Enforcement + +- [x] 2.1 Add CSRF token issuance/validation for form posts and JSON mutation endpoints. +- [x] 2.2 Update login flow to rotate session identity on successful authentication. +- [x] 2.3 Replace JS-context template interpolation in `hold_detail.html` with JSON-safe serialization. + +## 3. Verification and Documentation + +- [x] 3.1 Add tests for startup secret guard, CSRF rejection, and session-rotation behavior. +- [x] 3.2 Add lifecycle tests/validation for shutdown cleanup and health endpoint behavior under pool saturation. +- [x] 3.3 Update README/README.mdj runtime hardening sections and operator rollout notes. diff --git a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/.openspec.yaml b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/.openspec.yaml new file mode 100644 index 0000000..565fad5 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/.openspec.yaml @@ -0,0 +1,2 @@ +schema: spec-driven +created: 2026-02-08 diff --git a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/design.md b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/design.md new file mode 100644 index 0000000..b191684 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/design.md @@ -0,0 +1,46 @@ +## Context + +The migration delivered feature parity, but efficiency work remains: backend query paths still do broad copies and whole-frame recomputation even when only slices are needed. At the same time, business constraints explicitly require full-table caching for `resource` and `wip` because those datasets are intentionally small and frequently reused. This design optimizes around that constraint rather than removing it. + +## Goals / Non-Goals + +**Goals:** +- Keep `resource` and `wip` full-table caches intact. +- Reduce memory amplification from redundant cache representations. +- Replace repeated full merge/rebuild paths with incremental/indexed query plans where applicable. +- Increase reuse of browser-side compute modules for chart/table/filter/KPI derivations. +- Add measurable telemetry to verify latency and memory improvements. + +**Non-Goals:** +- Rewriting all reporting endpoints to client-only mode. +- Removing Redis or existing layered cache strategy. +- Changing user-visible filter semantics or report outputs. + +## Decisions + +1. **Constrained cache strategy** + - Decision: retain full-table snapshots for `resource` and `wip`; optimize surrounding representations and derivation paths. + - Rationale: business-approved data-size profile and low complexity for frequent lookups. + +2. **Incremental + indexed path for heavy derived datasets** + - Decision: add watermark/version-aware incremental refresh and per-column indexes for high-cardinality filters. + - Rationale: avoids repeated full recompute and lowers request tail latency. + +3. **Canonical in-process structure** + - Decision: keep one canonical structure per cache domain and derive alternate views on demand. + - Rationale: reduces 2x/3x memory amplification from parallel representations. + +4. **Frontend compute module expansion** + - Decision: extract reusable browser compute helpers for matrix/table/KPI transformations used across report pages. + - Rationale: shifts deterministic shaping work off backend and improves component reuse in Vite architecture. + +5. **Benchmark-driven acceptance** + - Decision: add repeatable benchmark fixtures and telemetry thresholds as merge gates. + - Rationale: prevent subjective "performance improved" claims without measurable proof. + +## Risks / Trade-offs + +- **[Risk] Incremental sync correctness drift** → **Mitigation:** version checksum validation and periodic full reconciliation jobs. +- **[Risk] Browser compute can increase client CPU on low-end devices** → **Mitigation:** bounded dataset chunking and fallback server aggregation path. +- **[Risk] Refactor introduces subtle field-contract regressions** → **Mitigation:** keep export/header contract tests and fixture comparisons. +- **[Risk] Telemetry overhead** → **Mitigation:** low-cost counters/histograms with sampling where needed. diff --git a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/proposal.md b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/proposal.md new file mode 100644 index 0000000..94bb281 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/proposal.md @@ -0,0 +1,36 @@ +## Why + +Current reporting workloads still spend unnecessary CPU and memory on repeated full-data merges, broad DataFrame copies, and duplicated cache representations. We need a focused efficiency phase that preserves the intentional full-table cache strategy for `resource` and `wip`, while reducing cost for other query paths and increasing frontend compute reuse. + +## What Changes + +- Introduce indexed/incremental cache synchronization for heavy report datasets that do not require full-table snapshots. +- Keep `resource` and `wip` as full-table cache by design, but reduce redundant in-process representations and copy overhead. +- Move additional derived calculations (chart/table/KPI/filter shaping) to reusable browser modules in Vite frontend. +- Add cache/query efficiency telemetry and repeatable benchmark gates to validate gains. + +## Capabilities + +### New Capabilities +- `cache-indexed-query-acceleration`: Define incremental refresh and indexed query contracts for non-full-snapshot datasets. + +### Modified Capabilities +- `cache-observability-hardening`: Add memory-efficiency and cache-structure telemetry expectations. +- `frontend-compute-shift`: Expand browser-side reusable compute coverage for report interactions. + +## Impact + +- Affected code: + - `src/mes_dashboard/core/cache.py` + - `src/mes_dashboard/services/resource_cache.py` + - `src/mes_dashboard/services/realtime_equipment_cache.py` + - `src/mes_dashboard/services/wip_service.py` + - `src/mes_dashboard/routes/health_routes.py` + - `frontend/src/core/` + - `frontend/src/**/main.js` + - `tests/` +- APIs: + - read-heavy `/api/wip/*` and `/api/resource/*` endpoints (response contract unchanged) +- Operational behavior: + - Preserve current `resource` and `wip` full-table caching strategy. + - Reduce server-side compute load through selective frontend compute offload. diff --git a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/cache-indexed-query-acceleration/spec.md b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/cache-indexed-query-acceleration/spec.md new file mode 100644 index 0000000..418f722 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/cache-indexed-query-acceleration/spec.md @@ -0,0 +1,22 @@ +## ADDED Requirements + +### Requirement: Incremental Synchronization SHALL Use Versioned Watermarks +For heavy non-full-snapshot datasets, cache refresh SHALL support incremental synchronization keyed by stable version or watermark boundaries. + +#### Scenario: Incremental refresh cycle +- **WHEN** source data version indicates partial changes since last sync +- **THEN** cache update logic MUST fetch and merge only changed partitions while preserving correctness guarantees + +### Requirement: Query Paths SHALL Use Indexed Access for High-Frequency Filters +Query execution over cached data SHALL use prebuilt indexes for known high-frequency filter columns. + +#### Scenario: Filtered report query +- **WHEN** request filters target indexed fields +- **THEN** result selection MUST avoid full dataset scans and maintain existing response contract + +### Requirement: Business-Mandated Full-Table Caches SHALL Be Preserved for Resource and WIP +The system SHALL continue to maintain full-table cache behavior for `resource` and `wip` domains. + +#### Scenario: Resource or WIP cache refresh +- **WHEN** cache update runs for `resource` or `wip` +- **THEN** the updater MUST retain full-table snapshot semantics and MUST NOT switch these domains to partial-only cache mode diff --git a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/cache-observability-hardening/spec.md b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/cache-observability-hardening/spec.md new file mode 100644 index 0000000..acf4de9 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/cache-observability-hardening/spec.md @@ -0,0 +1,15 @@ +## ADDED Requirements + +### Requirement: Cache Telemetry SHALL Include Memory Amplification Signals +Operational telemetry MUST expose cache domain memory usage indicators and representation amplification factors. + +#### Scenario: Deep health telemetry request +- **WHEN** operators inspect cache telemetry +- **THEN** telemetry MUST include per-domain memory footprint and amplification indicators sufficient to detect redundant structures + +### Requirement: Efficiency Benchmarks SHALL Gate Cache Refactor Rollout +Cache/query efficiency changes MUST be validated against baseline latency and memory benchmarks before rollout. + +#### Scenario: Pre-release validation +- **WHEN** cache refactor changes are prepared for deployment +- **THEN** benchmark results MUST demonstrate no regression beyond configured thresholds for P95 latency and memory usage diff --git a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/frontend-compute-shift/spec.md b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/frontend-compute-shift/spec.md new file mode 100644 index 0000000..4434c99 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/frontend-compute-shift/spec.md @@ -0,0 +1,15 @@ +## ADDED Requirements + +### Requirement: Reusable Browser Compute Modules SHALL Power Report Derivations +Derived computations for report filters, KPI cards, chart series, and table projections SHALL be implemented through reusable frontend modules. + +#### Scenario: Shared report derivation logic +- **WHEN** multiple report pages require equivalent data-shaping behavior +- **THEN** pages MUST consume shared compute modules instead of duplicating transformation logic per page + +### Requirement: Browser Compute Shift SHALL Preserve Export and Field Contracts +Moving computations to frontend MUST preserve existing field naming and export column contracts. + +#### Scenario: User exports report after frontend-side derivation +- **WHEN** transformed data is rendered and exported +- **THEN** exported field names and ordering MUST remain consistent with governed field contract definitions diff --git a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/tasks.md b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/tasks.md new file mode 100644 index 0000000..1d225c7 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/tasks.md @@ -0,0 +1,23 @@ +## 1. Cache Structure and Sync Refactor + +- [x] 1.1 Define canonical per-domain cache representation and remove redundant parallel structures. +- [x] 1.2 Implement version/watermark-based incremental sync path for eligible non-full-snapshot datasets. +- [x] 1.3 Keep `resource` and `wip` full-table cache behavior while optimizing surrounding parse/index pipelines. + +## 2. Indexed Query Acceleration + +- [x] 2.1 Add index builders for high-frequency filter columns used by report endpoints. +- [x] 2.2 Refactor read paths to use indexed selection and reduce broad DataFrame copy operations. +- [x] 2.3 Add fallback and reconciliation logic to guarantee correctness under incremental/index drift. + +## 3. Frontend Compute Reuse Expansion + +- [x] 3.1 Extract shared Vite compute modules for KPI/filter/chart/table derivations. +- [x] 3.2 Refactor report pages to consume shared modules without changing user-visible behavior. +- [x] 3.3 Validate export/header field contract consistency after compute shift. + +## 4. Performance Validation and Docs + +- [x] 4.1 Add benchmark fixtures for baseline vs refactor latency/memory comparison. +- [x] 4.2 Surface cache memory amplification and index efficiency telemetry in health/admin outputs. +- [x] 4.3 Update README/README.mdj with cache strategy constraints and performance governance rules. diff --git a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/.openspec.yaml b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/.openspec.yaml new file mode 100644 index 0000000..565fad5 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/.openspec.yaml @@ -0,0 +1,2 @@ +schema: spec-driven +created: 2026-02-08 diff --git a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/design.md b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/design.md new file mode 100644 index 0000000..1a17d69 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/design.md @@ -0,0 +1,45 @@ +## Context + +The project already provides watchdog-assisted restart and resilience diagnostics, but policy boundaries for automated recovery are not yet formalized end-to-end. In practice, this can lead to either under-recovery (manual delays) or over-recovery (restart churn). We also need stronger conda/systemd path consistency checks to prevent runtime drift across deploy scripts and services. + +## Goals / Non-Goals + +**Goals:** +- Make conda/systemd runtime path contracts explicit, validated, and drift-detectable. +- Implement safe self-healing policy with cooldown and churn limits. +- Expose clear alert signals and recommended actions in health/admin payloads. +- Keep operator manual override available for incident control. + +**Non-Goals:** +- Migrating from systemd to another orchestrator. +- Changing database vendor or introducing full autoscaling infrastructure. +- Removing existing admin restart endpoints. + +## Decisions + +1. **Single source runtime contract** + - Decision: centralize conda runtime path configuration consumed by systemd units, watchdog, and scripts. + - Rationale: prevents mismatched interpreter/path drift. + +2. **Guarded self-healing state machine** + - Decision: implement bounded restart policy (cooldown + max retries per time window + circuit-open gating). + - Rationale: recovers quickly while preventing restart storms. + +3. **Explicit recovery observability contract** + - Decision: enrich health/admin payloads with churn counters, cooldown state, and recommended operator action. + - Rationale: enables deterministic triage and alert automation. + +4. **Auditability requirement** + - Decision: emit structured logs/events for auto-restart decision, manual override, and blocked restart attempts. + - Rationale: supports incident retrospectives and policy tuning. + +5. **Runbook-first rollout** + - Decision: deploy policy changes behind documentation and validation gates, including rollback steps. + - Rationale: operational safety for production adoption. + +## Risks / Trade-offs + +- **[Risk] Overly strict policy delays recovery** → **Mitigation:** configurable thresholds and emergency manual override. +- **[Risk] Aggressive policy causes churn loops** → **Mitigation:** hard stop on churn threshold breach and explicit cool-off windows. +- **[Risk] Added operational complexity** → **Mitigation:** concise runbook with decision tables and tested scripts. +- **[Risk] Drift detection false positives** → **Mitigation:** normalize path resolution and clearly defined comparison sources. diff --git a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/proposal.md b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/proposal.md new file mode 100644 index 0000000..7a83bc0 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/proposal.md @@ -0,0 +1,40 @@ +## Why + +Operations stability still depends heavily on manual intervention when workers degrade or pools saturate. We need a formal operations phase to align conda/systemd runtime contracts and add controlled self-healing with guardrails, so recovery is faster without triggering restart storms. + +## What Changes + +- Standardize conda-based runtime paths across app service, watchdog, and operational scripts from a single source of truth. +- Introduce guarded worker self-healing policy (cooldown, churn windows, bounded retries, manual override). +- Add alert thresholds and machine-readable recovery signals for pool pressure, circuit-open persistence, and restart churn. +- Harden runbook documentation and scripts for deterministic restart, rollback, and incident triage. + +## Capabilities + +### New Capabilities +- `worker-self-healing-governance`: Define safe autonomous recovery behavior with anti-storm guardrails. + +### Modified Capabilities +- `conda-systemd-runtime-alignment`: Extend runtime consistency requirements with startup validation and drift detection. +- `runtime-resilience-recovery`: Add auditable recovery-action requirements for automated and operator-triggered restart flows. + +## Impact + +- Affected code: + - `deploy/systemd/*.service` + - `scripts/worker_watchdog.py` + - `src/mes_dashboard/routes/admin_routes.py` + - `src/mes_dashboard/routes/health_routes.py` + - `src/mes_dashboard/core/database.py` + - `src/mes_dashboard/core/circuit_breaker.py` + - `tests/` + - `README.md`, `README.mdj`, runbook docs +- APIs: + - `/health` + - `/health/deep` + - `/admin/api/system-status` + - `/admin/api/worker/status` + - `/admin/api/worker/restart` +- Operational behavior: + - Preserve single-port bind model. + - Add controlled self-healing policy and clearer alert thresholds. diff --git a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/conda-systemd-runtime-alignment/spec.md b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/conda-systemd-runtime-alignment/spec.md new file mode 100644 index 0000000..b44aab2 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/conda-systemd-runtime-alignment/spec.md @@ -0,0 +1,15 @@ +## ADDED Requirements + +### Requirement: Runtime Path Drift SHALL Be Detectable Before Service Start +Service startup checks MUST validate configured conda runtime paths across app, watchdog, and worker control scripts. + +#### Scenario: Conda path mismatch detected +- **WHEN** startup validation finds runtime path inconsistency between configured units and scripts +- **THEN** service start MUST fail with actionable diagnostics instead of running with partial mismatch + +### Requirement: Conda/Systemd Contract SHALL Be Versioned in Operations Docs +The documented runtime contract MUST include versioned path assumptions and verification commands. + +#### Scenario: Operator verifies deployment contract +- **WHEN** operator follows runbook validation steps +- **THEN** commands MUST confirm active runtime paths match documented conda/systemd contract diff --git a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/runtime-resilience-recovery/spec.md b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/runtime-resilience-recovery/spec.md new file mode 100644 index 0000000..4bac0fc --- /dev/null +++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/runtime-resilience-recovery/spec.md @@ -0,0 +1,15 @@ +## ADDED Requirements + +### Requirement: Recovery Recommendations SHALL Reflect Self-Healing Policy State +Health and admin resilience payloads MUST expose whether automated recovery is allowed, cooling down, or blocked by churn policy. + +#### Scenario: Operator inspects degraded state +- **WHEN** `/health` or `/admin/api/worker/status` is requested during degradation +- **THEN** response MUST include policy state, cooldown remaining time, and next recommended action + +### Requirement: Manual Recovery Override SHALL Be Explicit and Controlled +Manual restart actions MUST bypass automatic block only through authenticated operator pathways with explicit acknowledgement. + +#### Scenario: Churn-blocked state with manual override request +- **WHEN** authorized admin requests manual restart while auto-recovery is blocked +- **THEN** system MUST execute controlled restart path and log the override context for auditability diff --git a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/worker-self-healing-governance/spec.md b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/worker-self-healing-governance/spec.md new file mode 100644 index 0000000..6cf2037 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/worker-self-healing-governance/spec.md @@ -0,0 +1,22 @@ +## ADDED Requirements + +### Requirement: Automated Worker Recovery SHALL Use Bounded Policy Guards +Automated worker restart behavior MUST enforce cooldown periods and bounded restart attempts within a configurable time window. + +#### Scenario: Repeated worker degradation within short window +- **WHEN** degradation events exceed configured restart-attempt budget +- **THEN** automated restarts MUST pause and surface a blocked-recovery signal for operator intervention + +### Requirement: Restart-Churn Protection SHALL Prevent Recovery Storms +The runtime MUST classify restart churn and prevent uncontrolled restart loops. + +#### Scenario: Churn threshold exceeded +- **WHEN** restart count crosses churn threshold in active window +- **THEN** watchdog MUST enter guarded mode and require explicit manual override before further restart attempts + +### Requirement: Recovery Decisions SHALL Be Audit-Ready +Every auto-recovery decision and manual override action MUST be recorded with structured metadata. + +#### Scenario: Worker restart decision emitted +- **WHEN** system executes or denies a restart action +- **THEN** structured logs/events MUST include reason, thresholds, actor/source, and resulting state diff --git a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/tasks.md b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/tasks.md new file mode 100644 index 0000000..c921ab1 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/tasks.md @@ -0,0 +1,23 @@ +## 1. Conda/Systemd Contract Alignment + +- [x] 1.1 Centralize runtime path configuration consumed by service units, watchdog, and scripts. +- [x] 1.2 Add startup validation that fails fast on conda path drift. +- [x] 1.3 Update systemd/watchdog integration tests for consistent runtime contract. + +## 2. Worker Self-Healing Policy + +- [x] 2.1 Implement bounded auto-restart policy (cooldown, retry budget, churn window). +- [x] 2.2 Add guarded mode behavior when churn threshold is exceeded. +- [x] 2.3 Implement authenticated manual override flow with explicit logging context. + +## 3. Alerting and Operational Signals + +- [x] 3.1 Expose policy-state fields in health/admin payloads (`allowed`, `cooldown`, `blocked`). +- [x] 3.2 Add structured audit events for restart decisions and override actions. +- [x] 3.3 Define alert thresholds and wire monitoring-friendly fields for pool/circuit/churn conditions. + +## 4. Validation and Runbook Delivery + +- [x] 4.1 Add tests for policy transitions, guarded mode, and override behavior. +- [x] 4.2 Validate single-port continuity during controlled recovery and hot reload paths. +- [x] 4.3 Update README/README.mdj and deployment runbook with verified operational procedures. diff --git a/openspec/changes/archive/2026-02-08-residual-hardening-round3/.openspec.yaml b/openspec/changes/archive/2026-02-08-residual-hardening-round3/.openspec.yaml new file mode 100644 index 0000000..565fad5 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/.openspec.yaml @@ -0,0 +1,2 @@ +schema: spec-driven +created: 2026-02-08 diff --git a/openspec/changes/archive/2026-02-08-residual-hardening-round3/design.md b/openspec/changes/archive/2026-02-08-residual-hardening-round3/design.md new file mode 100644 index 0000000..fff93ec --- /dev/null +++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/design.md @@ -0,0 +1,50 @@ +## Context + +目前系統已完成 Vite 單一 port 架構與主要 P0/P1/P2 硬化,但殘餘風險集中在「快取慢路徑鎖競爭 + health 熱點查詢 + API 邊界治理」。這些問題多屬中高流量下才明顯,若不在此階段收斂,後續排障成本會高。 + +## Goals / Non-Goals + +**Goals:** +- 在不改變頁面操作語意與單一 port 架構前提下,完成殘餘穩定性與安全性修補。 +- 讓 cache/health 路徑在高併發下更可預期,並降低 log 資安風險。 +- 透過測試覆蓋確保修補不造成功能回歸。 + +**Non-Goals:** +- 不重寫主要查詢流程或移除 `resource/wip` 全表快取策略。 +- 不引入重量級 distributed rate-limit 基礎設施。 +- 不改動前端 drill-down 與報表功能語意。 + +## Decisions + +1. **Cache 發布一致性優先於局部最佳化** +- 使用 staging key + 原子 rename/pipeline 發布資料與 metadata,確保 publish 失敗不影響舊資料可讀性。 + +2. **解析移至鎖外,鎖內僅做快取一致性檢查/寫入** +- WIP process cache 慢路徑改為鎖外 parse,再鎖內 double-check+commit,降低持鎖時間。 + +3. **Process cache 策略一致化** +- realtime equipment cache 補齊 max_size + LRU,與既有 WIP/Resource 一致。 + +4. **Health 內部短快取僅在非測試環境啟用** +- TTL=5 秒,降低高頻 probe 對 DB/Redis 的重複壓力;測試模式維持即時計算避免互相污染。 + +5. **高成本 API 採輕量 in-memory 速率限制** +- 以 IP+route window 限流,參數化可調,不引入新外部依賴。 + +## Risks / Trade-offs + +- [Risk] 快取發布改造引入 key 切換邏輯複雜度 → Mitigation: 補上 publish 失敗/成功測試。 +- [Risk] health 快取造成短時間觀測延遲 → Mitigation: TTL 限制 5 秒,並於 testing 禁用。 +- [Risk] in-memory rate limit 在多 worker 下非全域一致 → Mitigation: 先作保護閥,後續可升級 Redis-based limiter。 + +## Migration Plan + +1. 先完成 cache 與 health 核心修補(不影響 API contract)。 +2. 再導入 API 邊界/限流與共用工具抽離。 +3. 補單元與整合測試,執行 benchmark smoke。 +4. 更新 README 文件與環境變數說明。 + +## Open Questions + +- 高成本 API 的預設限流門檻是否要按端點細分(WIP vs Resource)? +- 後續是否要升級為 Redis 分散式限流以覆蓋多 worker 全域一致性? diff --git a/openspec/changes/archive/2026-02-08-residual-hardening-round3/proposal.md b/openspec/changes/archive/2026-02-08-residual-hardening-round3/proposal.md new file mode 100644 index 0000000..7b48b91 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/proposal.md @@ -0,0 +1,44 @@ +## Why + +上一輪已完成高風險核心修復,但仍有一批殘餘問題會在高併發、長時間運行與惡意/異常輸入下放大風險(快取發布一致性、鎖競爭、健康檢查負載、輸入邊界與速率治理)。本輪目標是把這些尾端風險收斂到可接受範圍,避免後續運維與效能不穩。 + +## What Changes + +- 強化 WIP 快取發布流程,確保更新失敗時不污染既有讀取路徑。 +- 調整 process cache 慢路徑鎖範圍,避免持鎖解析大 JSON。 +- 補齊 realtime equipment process cache 的 bounded LRU,與 WIP/Resource 策略一致。 +- 為資源路由 NaN 清理加入深度保護(避免深層遞迴風險)。 +- 抽取共用布林參數解析,消除重複邏輯。 +- 將 filter cache 的 view 名稱改為可配置,移除硬編碼耦合。 +- 加入敏感連線字串 log redaction。 +- 對 `/health`、`/health/deep` 增加 5 秒內部短快取(測試模式禁用)。 +- 對高成本查詢 API 增加輕量速率限制與可調參數。 +- 更新 README/README.mdj 與驗證測試。 + +## Capabilities + +### New Capabilities +- `api-safety-hygiene`: API 輸入邊界、共享參數解析、可配置查詢來源、與高成本端點速率治理。 + +### Modified Capabilities +- `cache-observability-hardening`: 補強快取發布一致性、process cache 鎖範圍與 bounded 策略一致化。 +- `runtime-resilience-recovery`: 健康檢查短快取與敏感資訊日誌遮罩的運維安全要求。 + +## Impact + +- Affected code: + - `src/mes_dashboard/core/cache_updater.py` + - `src/mes_dashboard/core/cache.py` + - `src/mes_dashboard/services/realtime_equipment_cache.py` + - `src/mes_dashboard/routes/resource_routes.py` + - `src/mes_dashboard/routes/wip_routes.py` + - `src/mes_dashboard/routes/hold_routes.py` + - `src/mes_dashboard/services/filter_cache.py` + - `src/mes_dashboard/core/database.py` + - `src/mes_dashboard/routes/health_routes.py` +- APIs: + - `/health`, `/health/deep` + - `/api/wip/detail/`, `/api/wip/overview/*` + - `/api/resource/*`(高成本路由) +- Docs/tests: + - `README.md`, `README.mdj`, `tests/*` diff --git a/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/api-safety-hygiene/spec.md b/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/api-safety-hygiene/spec.md new file mode 100644 index 0000000..32db75f --- /dev/null +++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/api-safety-hygiene/spec.md @@ -0,0 +1,29 @@ +## ADDED Requirements + +### Requirement: Recursive Payload Cleaning MUST Enforce Depth Safety +Routes that normalize nested payloads MUST prevent unbounded recursion depth. + +#### Scenario: Deeply nested response object +- **WHEN** NaN-cleaning helper receives deeply nested list/dict payload +- **THEN** cleaning logic MUST enforce max depth or iterative traversal and return safely without recursion failure + +### Requirement: Filter Source Names MUST Be Configurable +Filter cache query sources MUST NOT rely on hardcoded view names only. + +#### Scenario: Environment-specific view names +- **WHEN** deployment sets custom filter-source environment variables +- **THEN** filter cache loader MUST resolve and query configured view names + +### Requirement: High-Cost APIs SHALL Apply Basic Rate Guardrails +High-cost read endpoints SHALL apply configurable request-rate guardrails to reduce abuse and accidental bursts. + +#### Scenario: Burst traffic from same client +- **WHEN** a client exceeds configured request budget for guarded endpoints +- **THEN** endpoint SHALL return throttled response with clear retry guidance + +### Requirement: Common Boolean Query Parsing SHALL Be Shared +Boolean query parsing in routes SHALL use shared helper behavior. + +#### Scenario: Different routes parse include flags +- **WHEN** routes parse common boolean query parameters +- **THEN** parsing behavior MUST be consistent across routes via shared utility diff --git a/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/cache-observability-hardening/spec.md b/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/cache-observability-hardening/spec.md new file mode 100644 index 0000000..eed748d --- /dev/null +++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/cache-observability-hardening/spec.md @@ -0,0 +1,26 @@ +## ADDED Requirements + +### Requirement: Cache Publish MUST Preserve Previous Readable Snapshot on Failure +When refreshing full-table cache payloads, the system MUST avoid exposing partially published states to readers. + +#### Scenario: Publish fails after payload serialization +- **WHEN** a cache refresh has prepared new payload but publish operation fails +- **THEN** previously published cache keys MUST remain readable and metadata MUST remain consistent with old snapshot + +#### Scenario: Publish succeeds +- **WHEN** publish operation completes successfully +- **THEN** data payload and metadata keys MUST be visible as one coherent new snapshot + +### Requirement: Process-Level Cache Slow Path SHALL Minimize Lock Hold Time +Large payload parsing MUST NOT happen inside long-held process cache locks. + +#### Scenario: Cache miss under concurrent requests +- **WHEN** multiple requests hit process cache miss +- **THEN** parsing work SHALL happen outside lock-protected mutation section, and lock scope SHALL be limited to consistency check + commit + +### Requirement: Process-Level Cache Policies MUST Stay Consistent Across Services +All service-local process caches MUST support bounded capacity with deterministic eviction. + +#### Scenario: Realtime equipment cache growth +- **WHEN** realtime equipment process cache reaches configured capacity +- **THEN** entries MUST be evicted according to deterministic LRU behavior diff --git a/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/runtime-resilience-recovery/spec.md b/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/runtime-resilience-recovery/spec.md new file mode 100644 index 0000000..563a0b2 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/runtime-resilience-recovery/spec.md @@ -0,0 +1,19 @@ +## ADDED Requirements + +### Requirement: Health Endpoints SHALL Use Short Internal Memoization +Health and deep-health computation SHALL use a short-lived internal cache to prevent probe storms from amplifying backend load. + +#### Scenario: Frequent monitor scrapes +- **WHEN** health endpoints are called repeatedly within a small window +- **THEN** service SHALL return memoized payload for up to 5 seconds in non-testing environments + +#### Scenario: Testing mode +- **WHEN** app is running in testing mode +- **THEN** health endpoint memoization MUST be bypassed to preserve deterministic tests + +### Requirement: Logs MUST Redact Connection Secrets +Runtime logs MUST avoid exposing DB connection credentials. + +#### Scenario: Connection string appears in log message +- **WHEN** a log message contains DB URL credentials +- **THEN** logger output MUST redact password and sensitive userinfo before emission diff --git a/openspec/changes/archive/2026-02-08-residual-hardening-round3/tasks.md b/openspec/changes/archive/2026-02-08-residual-hardening-round3/tasks.md new file mode 100644 index 0000000..ab8963e --- /dev/null +++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/tasks.md @@ -0,0 +1,22 @@ +## 1. Cache Consistency and Contention Hardening + +- [x] 1.1 Harden WIP cache publish in `cache_updater.py` to preserve old snapshot on publish failure. +- [x] 1.2 Refactor WIP process-cache slow path in `core/cache.py` so heavy parse runs outside lock. +- [x] 1.3 Extend realtime equipment process cache with bounded `max_size` + deterministic LRU and add regression tests. + +## 2. API Safety and Config Hygiene + +- [x] 2.1 Add depth-safe NaN cleaning in `resource_routes.py` and tests for deep payloads. +- [x] 2.2 Add shared boolean query parser in `core/utils.py` and switch `wip_routes.py` / `hold_routes.py` to it. +- [x] 2.3 Make filter-cache source views configurable (env-based) in `filter_cache.py` and add config tests. + +## 3. Runtime Guardrails + +- [x] 3.1 Add DB connection-string redaction logging filter in `core/database.py` (or logging bootstrap) with tests. +- [x] 3.2 Add 5-second internal memoization for `/health` and `/health/deep` (disabled in testing) and tests. +- [x] 3.3 Add lightweight rate limiting for selected high-cost APIs with clear throttling responses and tests. + +## 4. Validation and Documentation + +- [x] 4.1 Run targeted backend/frontend tests and benchmark smoke gate. +- [x] 4.2 Update `README.md` and `README.mdj` with round-3 hardening notes and new env variables. diff --git a/openspec/changes/archive/2026-02-08-residual-hardening-round4/.openspec.yaml b/openspec/changes/archive/2026-02-08-residual-hardening-round4/.openspec.yaml new file mode 100644 index 0000000..565fad5 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/.openspec.yaml @@ -0,0 +1,2 @@ +schema: spec-driven +created: 2026-02-08 diff --git a/openspec/changes/archive/2026-02-08-residual-hardening-round4/design.md b/openspec/changes/archive/2026-02-08-residual-hardening-round4/design.md new file mode 100644 index 0000000..54d5cd7 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/design.md @@ -0,0 +1,61 @@ +## Context + +round-3 後主流程已穩定,但仍有 3 類技術債: +- Resource 快取在同一 process 內同時保存 DataFrame 與完整 records 複本,導致記憶體放大。 +- Resource 與 Realtime Equipment 的 Oracle 查詢存在跨服務重複字串,日後修改容易偏移。 +- 部分服務邊界型別註記與魔術數字未系統化,維護成本偏高。 + +約束條件: +- `resource` / `wip` 維持全表快取策略,不改資料來源與刷新頻率。 +- 對外 API 欄位與前端行為不變。 +- 保持單一 port 架構與既有運維契約。 + +## Goals / Non-Goals + +**Goals:** +- 降低 Resource 快取在 process 內的重複資料表示,保留查詢輸出相容性。 +- 讓跨服務 Oracle 查詢片段由單一來源維護。 +- 讓關鍵 service/cache 模組具備一致的型別註記與具名常數。 + +**Non-Goals:** +- 不改動資料庫 schema 或 SQL 查詢結果欄位。 +- 不重寫整體 cache 架構(Redis + process cache 維持)。 +- 不引入新基礎設施或外部依賴。 + +## Decisions + +1. Resource derived index 改為「row-position index」而非保存完整 records 複本 +- 現況:index 中保留 `records` 與多組 bucket records,與 DataFrame 內容重複。 +- 決策:index 只保留 row positions(整數索引)與必要 metadata;需要輸出 dict 時由 DataFrame 按需轉換。 +- 取捨:單次輸出會增加少量轉換成本,但可顯著降低常駐記憶體重複。 + +2. 建立共用 Oracle 查詢常數模組 +- 現況:`resource_cache.py`、`realtime_equipment_cache.py` 各自維護 base SQL。 +- 決策:抽出 `services/sql_fragments.py`(或等效模組)管理共用 query 文本與 table/view 名稱。 +- 取捨:增加一層間接引用,但查詢語意一致性與變更可控性更高。 + +3. 型別與常數治理採「先核心邊界,後擴散」 +- 現況:部分函式已使用 `Optional` / PEP604 混搭,且魔術數字散落於 cache/service。 +- 決策:先統一這輪觸及檔案中的型別風格與高頻常數(TTL、size、window、limits)。 +- 取捨:不追求一次全專案清零,以避免大範圍 noise;先建立可持續擴展基線。 + +## Risks / Trade-offs + +- [Risk] row-position index 與 DataFrame 版本不同步 → Mitigation:每次 cache invalidate 時同步重建 index,並保留版本檢查。 +- [Risk] 惰性轉換導致查詢端 latency 波動 → Mitigation:保留 process cache,並對高頻路徑做小批量輸出優化。 +- [Risk] SQL 共用常數抽離造成引用錯誤 → Mitigation:補齊單元測試,驗證 query 文本與既有欄位契約一致。 +- [Risk] 型別/常數清理引發行為改變 → Mitigation:僅做等價重構,保留原值並用回歸測試覆蓋。 + +## Migration Plan + +1. 先重構 Resource index 表示,確保 API 輸出不變。 +2. 抽離 SQL 共用片段並替換兩個快取服務引用。 +3. 清理該範圍型別與常數,補測試。 +4. 更新 README / README.mdj 與 OpenSpec tasks,跑 backend/fronted 目標測試集。 + +Rollback: +- 若出現相容性問題,可回退至原 index records 表示與舊 SQL 內嵌寫法(單檔回退即可)。 + +## Open Questions + +- 是否要在下一輪把相同治理擴展到 `wip_service.py` 的其餘常數與型別(本輪先限定 residual 範圍)。 diff --git a/openspec/changes/archive/2026-02-08-residual-hardening-round4/proposal.md b/openspec/changes/archive/2026-02-08-residual-hardening-round4/proposal.md new file mode 100644 index 0000000..a18a035 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/proposal.md @@ -0,0 +1,31 @@ +## Why + +目前剩餘風險集中在可維護性與記憶體效率:Resource 快取在同一個 process 內維持多種資料表示,部分查詢 SQL 在不同快取服務重複維護,且型別註記與魔術數字仍不一致。這些問題不會立刻造成中斷,但會提高記憶體占用、增加後續修改成本與回歸風險,因此需要在既有功能不變前提下完成收斂。 + +## What Changes + +- 將 Resource derived index 的資料表示改為「輕量索引 + 惰性輸出」,避免在 process 中重複保留完整 records 複本。 +- 將 Resource 與 Realtime Equipment 的 Oracle 查詢字串收斂到共用 SQL 常數模組,降低重複定義與異步漂移風險。 +- 補齊型別註記一致性(尤其 cache/index/service 邊界)並把高頻魔術數字提升為具名常數或可配置參數。 +- 維持現有 API 契約、全表快取策略、單一 port 架構與前端行為不變。 + +## Capabilities + +### New Capabilities +- `resource-cache-representation-normalization`: 以單一權威資料表示與輕量索引替代 process 內多份完整資料複本,並保留既有查詢回傳結構。 +- `oracle-query-fragment-governance`: 將跨服務共用的 Oracle 查詢片段抽離為共享常數/模板,確保查詢語意一致。 +- `maintainability-type-and-constant-hygiene`: 建立型別註記與具名常數的落地規範,降低魔術數字與註記風格漂移。 + +### Modified Capabilities +- `cache-observability-hardening`: 補充記憶體放大係數與索引表示調整後的可觀測一致性要求。 + +## Impact + +- 主要影響檔案: + - `src/mes_dashboard/services/resource_cache.py` + - `src/mes_dashboard/services/realtime_equipment_cache.py` + - `src/mes_dashboard/services/resource_service.py`(若需配合索引輸出) + - `src/mes_dashboard/sql/*` 或新增共享 SQL 常數模組 + - `src/mes_dashboard/config/constants.py`、`src/mes_dashboard/core/utils.py` + - 對應測試與 README/README.mdj 文檔 +- 不新增外部依賴,不變更對外 API 路徑與欄位契約。 diff --git a/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/cache-observability-hardening/spec.md b/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/cache-observability-hardening/spec.md new file mode 100644 index 0000000..013821f --- /dev/null +++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/cache-observability-hardening/spec.md @@ -0,0 +1,8 @@ +## MODIFIED Requirements + +### Requirement: Cache Telemetry SHALL Include Memory Amplification Signals +Operational telemetry MUST expose cache-domain memory usage indicators and representation amplification factors, and MUST differentiate between authoritative data payload and derived/index helper structures. + +#### Scenario: Deep health telemetry request after representation normalization +- **WHEN** operators inspect cache telemetry for resource or WIP domains +- **THEN** telemetry MUST include per-domain memory footprint, amplification indicators, and enough structure detail to verify that full-record duplication is not reintroduced diff --git a/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/maintainability-type-and-constant-hygiene/spec.md b/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/maintainability-type-and-constant-hygiene/spec.md new file mode 100644 index 0000000..8a83c0d --- /dev/null +++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/maintainability-type-and-constant-hygiene/spec.md @@ -0,0 +1,15 @@ +## ADDED Requirements + +### Requirement: Core Cache and Service Boundaries MUST Use Consistent Type Annotation Style +Core cache/service modules touched by this change SHALL use a consistent and explicit type-annotation style for public and internal helper boundaries. + +#### Scenario: Reviewing updated cache/service modules +- **WHEN** maintainers inspect function signatures in affected modules +- **THEN** optional and collection types MUST follow a single consistent style and remain compatible with the project Python baseline + +### Requirement: High-Frequency Magic Numbers MUST Be Replaced by Named Constants +Cache, throttling, and index-related numeric literals that control behavior MUST be extracted to named constants or env-configurable settings. + +#### Scenario: Tuning cache/index behavior +- **WHEN** operators need to tune cache/index thresholds +- **THEN** they MUST find values in named constants or environment variables rather than scattered inline literals diff --git a/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/oracle-query-fragment-governance/spec.md b/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/oracle-query-fragment-governance/spec.md new file mode 100644 index 0000000..607654c --- /dev/null +++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/oracle-query-fragment-governance/spec.md @@ -0,0 +1,15 @@ +## ADDED Requirements + +### Requirement: Shared Oracle Query Fragments SHALL Have a Single Source of Truth +Cross-service Oracle query fragments for resource and equipment cache loading MUST be defined in a shared module and imported by service implementations. + +#### Scenario: Update common table/view reference +- **WHEN** a common table or view name changes +- **THEN** operators and developers MUST be able to update one shared definition without editing duplicated SQL literals across services + +### Requirement: Service Queries MUST Preserve Existing Columns and Semantics +Services consuming shared Oracle query fragments SHALL preserve existing selected columns, filters, and downstream payload behavior. + +#### Scenario: Resource and equipment cache refresh after refactor +- **WHEN** cache services execute queries via shared fragments +- **THEN** resulting payload structure MUST remain compatible with existing aggregation and API contracts diff --git a/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/resource-cache-representation-normalization/spec.md b/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/resource-cache-representation-normalization/spec.md new file mode 100644 index 0000000..481f5b8 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/resource-cache-representation-normalization/spec.md @@ -0,0 +1,22 @@ +## ADDED Requirements + +### Requirement: Resource Derived Index MUST Avoid Full Record Duplication +Resource derived index SHALL use lightweight row-position references instead of storing full duplicated record payloads alongside the process DataFrame cache. + +#### Scenario: Build index from cached DataFrame +- **WHEN** resource cache data is parsed from Redis into process-level DataFrame +- **THEN** the derived index MUST store position-based references and metadata without a second full records copy + +### Requirement: Resource Query APIs SHALL Preserve Existing Response Contract +Resource query APIs MUST keep existing output fields and semantics after index representation normalization. + +#### Scenario: Read all resources after normalization +- **WHEN** callers request all resources or filtered resource lists +- **THEN** the returned payload MUST remain field-compatible with pre-normalization responses + +### Requirement: Cache Invalidation MUST Keep Index/Data Coherent +The system SHALL invalidate and rebuild DataFrame/index representations atomically at cache refresh boundaries. + +#### Scenario: Redis-backed cache refresh completes +- **WHEN** a new resource cache snapshot is published +- **THEN** stale index references MUST be invalidated before subsequent reads use refreshed DataFrame data diff --git a/openspec/changes/archive/2026-02-08-residual-hardening-round4/tasks.md b/openspec/changes/archive/2026-02-08-residual-hardening-round4/tasks.md new file mode 100644 index 0000000..0cdbab1 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/tasks.md @@ -0,0 +1,22 @@ +## 1. Resource Cache Representation Normalization + +- [x] 1.1 Refactor `resource_cache` derived index to use lightweight row-position references instead of full duplicated records payload. +- [x] 1.2 Keep `get_all_resources` / `get_resources_by_filter` API outputs backward compatible while sourcing data from normalized representation. +- [x] 1.3 Update cache telemetry fields to reflect normalized representation and verify amplification calculation remains interpretable. + +## 2. Oracle Query Fragment Governance + +- [x] 2.1 Extract shared Oracle SQL fragments/constants for resource/equipment cache loading into a common module. +- [x] 2.2 Replace duplicated SQL literals in `resource_cache.py` and `realtime_equipment_cache.py` with shared definitions. +- [x] 2.3 Add/adjust tests to lock expected query semantics and prevent drift. + +## 3. Maintainability Hygiene + +- [x] 3.1 Normalize type annotations in touched cache/service modules to one consistent style. +- [x] 3.2 Replace high-frequency magic numbers with named constants or env-driven config in touched modules. +- [x] 3.3 Confirm existing login/API rate-limit and bool parser utilities remain centralized without new duplication. + +## 4. Verification and Documentation + +- [x] 4.1 Run targeted backend tests for resource cache, equipment cache, health/admin, and route behavior. +- [x] 4.2 Update `README.md` and `README.mdj` with round-4 hardening notes. diff --git a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/.openspec.yaml b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/.openspec.yaml new file mode 100644 index 0000000..565fad5 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/.openspec.yaml @@ -0,0 +1,2 @@ +schema: spec-driven +created: 2026-02-08 diff --git a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/design.md b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/design.md new file mode 100644 index 0000000..a4eb05c --- /dev/null +++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/design.md @@ -0,0 +1,65 @@ +## Context + +本專案上一輪已完成 P0/P1/P2 的主體重構,但 code review 後仍存在幾個殘餘高風險點: +- `LDAP_API_URL` 缺少 scheme/host 防線,屬於可配置 SSRF 風險。 +- process-level DataFrame cache 僅用 TTL,缺少容量上限。 +- circuit breaker 狀態轉換在持鎖期間寫日誌,存在鎖競爭放大風險。 +- 全域 security headers 尚未統一輸出。 +- 分頁參數尚有下限驗證缺口。 + +這些問題橫跨 `app/core/services/routes/tests`,屬於跨模組安全與穩定性修補。 + +## Goals / Non-Goals + +**Goals:** +- 對 LDAP endpoint、HTTP 回應標頭、輸入邊界建立可測試的最低防線。 +- 讓 process-level cache 具備有界容量與可預期淘汰行為。 +- 降低 circuit breaker 內部鎖競爭風險,避免慢 handler 放大阻塞。 +- 維持單一 port、現有 API 契約與前端互動語意不變。 + +**Non-Goals:** +- 不引入完整 WAF/零信任架構。 +- 不重寫既有 cache 架構為外部快取服務。 +- 不改動報表功能或頁面流程。 + +## Decisions + +1. **LDAP URL 啟動驗證(fail-fast)** + - Decision: 在 `auth_service` 啟動階段驗證 `LDAP_API_URL`,限制 `https` 與白名單 host(由 env 設定),不符合即禁用 LDAP 驗證路徑並記錄錯誤。 + - Rationale: 以最低改動封住配置型 SSRF 風險,不影響 local auth 模式。 + +2. **ProcessLevelCache 有界化** + - Decision: 在 `ProcessLevelCache` 新增 `max_size` 與 LRU 淘汰(`OrderedDict`),`set` 時淘汰最舊 key。 + - Rationale: 保留 TTL 行為,同時避免高基數 key 長時間堆積。 + +3. **Circuit breaker 鎖外寫日誌** + - Decision: `_transition_to` 僅在鎖內更新狀態並組裝日誌訊息,實際 logger 呼叫移到鎖外。 + - Rationale: 降低持鎖區塊執行時間,避免慢 I/O handler 阻塞其他請求路徑。 + +4. **全域安全標頭統一注入** + - Decision: 在 `app.after_request` 加入 `CSP`、`X-Frame-Options`、`X-Content-Type-Options`、`Referrer-Policy`,並在 production 加上 `HSTS`。 + - Rationale: 以集中式策略覆蓋所有頁面與 API,降低遺漏機率。 + +5. **分頁參數上下限一致化** + - Decision: 對 `page` 與 `page_size` 統一加入 `max(1, min(...))` 邊界處理。 + - Rationale: 防止負值或極端數值造成不必要負載與非預期行為。 + +## Risks / Trade-offs + +- **[Risk] LDAP 白名單設定不完整導致登入中斷** → **Mitigation:** 提供明確錯誤訊息與 local auth fallback 指引。 +- **[Risk] Cache 上限過小造成命中率下降** → **Mitigation:** `max_size` 設為可配置,先給保守預設值並觀察 telemetry。 +- **[Risk] CSP 過嚴影響既有 inline 腳本** → **Mitigation:** 先採 `default-src 'self'` 與相容策略,必要時以 nonce/白名單微調。 +- **[Risk] 行為調整引發測試回歸** → **Mitigation:** 補 unit/integration 測試覆蓋每個修補點。 + +## Migration Plan + +1. 先落地 backend 修補(auth/cache/circuit breaker/app headers/routes)。 +2. 補測試(LDAP 驗證、LRU、鎖外日誌、headers、分頁邊界)。 +3. 執行既有健康檢查與重點整合測試。 +4. 更新 README/README.mdj 的安全與穩定性章節。 +5. 若部署後有相容性問題,可暫時透過 env 放寬 LDAP host 白名單與 CSP 細項。 + +## Open Questions + +- LDAP host 白名單在各環境是否需要多個網域(例如內網 + DR site)? +- CSP 是否要立即切換到 nonce-based 嚴格模式,或先維持相容策略? diff --git a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/proposal.md b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/proposal.md new file mode 100644 index 0000000..a70e11a --- /dev/null +++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/proposal.md @@ -0,0 +1,40 @@ +## Why + +上一輪已完成核心穩定性重構,但仍有數個高優先風險(LDAP URL 驗證、無界快取成長、circuit breaker 持鎖寫日誌、安全標頭缺口、分頁下限驗證)未收斂。這些問題會在長時運行與惡意輸入情境下累積可用性與安全風險,需在同一輪中補齊。 + +## What Changes + +- 新增 LDAP API base URL 啟動驗證(限定 `https` 與白名單主機),避免可控 SSRF 目標。 +- 對 process-level cache 加入 `max_size` 與 LRU 淘汰,避免高基數 key 造成無界記憶體成長。 +- 調整 circuit breaker 狀態轉換流程,避免在持鎖期間寫日誌。 +- 新增全域 security headers(CSP、X-Frame-Options、X-Content-Type-Options、Referrer-Policy、HSTS)。 +- 補齊分頁參數下限驗證,避免負值與不合理 page size 進入查詢流程。 +- 為上述修補新增對應測試與文件更新,並維持單一 port 與既有前端操作語意不變。 + +## Capabilities + +### New Capabilities +- `security-surface-hardening`: 規範剩餘安全面向(SSRF 防護、security headers、輸入邊界驗證)的最低防線。 + +### Modified Capabilities +- `cache-observability-hardening`: 擴充快取治理需求,納入 process-level cache 有界容量與淘汰策略。 +- `runtime-resilience-recovery`: 補充 circuit breaker 鎖競爭風險修補與安全標頭對運維診斷回應的相容性要求。 + +## Impact + +- Affected code: + - `src/mes_dashboard/services/auth_service.py` + - `src/mes_dashboard/core/cache.py` + - `src/mes_dashboard/services/resource_cache.py` + - `src/mes_dashboard/core/circuit_breaker.py` + - `src/mes_dashboard/app.py` + - `src/mes_dashboard/routes/wip_routes.py` + - `tests/` + - `README.md`, `README.mdj` +- APIs: + - `/health`, `/health/deep` + - `/api/wip/detail/` + - `/admin/login`(間接受影響:LDAP base 驗證) +- Operational behavior: + - 保持單一 port 與既有報表 UI 流程。 + - 強化安全與穩定性防線,不改變既有功能語意。 diff --git a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/cache-observability-hardening/spec.md b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/cache-observability-hardening/spec.md new file mode 100644 index 0000000..55c9cdd --- /dev/null +++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/cache-observability-hardening/spec.md @@ -0,0 +1,12 @@ +## ADDED Requirements + +### Requirement: Process-Level Cache SHALL Use Bounded Capacity with Deterministic Eviction +Process-level parsed-data caches MUST enforce a configurable maximum key capacity and use deterministic eviction behavior when capacity is exceeded. + +#### Scenario: Cache capacity reached +- **WHEN** a new cache entry is inserted and key capacity is at limit +- **THEN** cache MUST evict entries according to defined policy before storing the new key + +#### Scenario: Repeated access updates recency +- **WHEN** an existing cache key is read or overwritten +- **THEN** eviction order MUST reflect recency semantics so hot keys are retained preferentially diff --git a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/runtime-resilience-recovery/spec.md b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/runtime-resilience-recovery/spec.md new file mode 100644 index 0000000..c76c6f9 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/runtime-resilience-recovery/spec.md @@ -0,0 +1,12 @@ +## ADDED Requirements + +### Requirement: Circuit Breaker State Transitions SHALL Avoid Lock-Held Logging +Circuit breaker state transitions MUST avoid executing logger I/O while internal state locks are held. + +#### Scenario: State transition occurs +- **WHEN** circuit breaker transitions between CLOSED, OPEN, or HALF_OPEN +- **THEN** lock-protected section MUST complete state mutation before emitting transition log output + +#### Scenario: Slow log handler under load +- **WHEN** logger handlers are slow or blocked +- **THEN** circuit breaker lock contention MUST remain bounded and MUST NOT serialize unrelated request paths behind logging latency diff --git a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/security-surface-hardening/spec.md b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/security-surface-hardening/spec.md new file mode 100644 index 0000000..6178ae2 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/security-surface-hardening/spec.md @@ -0,0 +1,34 @@ +## ADDED Requirements + +### Requirement: LDAP Authentication Endpoint Configuration SHALL Be Strictly Validated +The system MUST validate LDAP authentication endpoint configuration before use, including HTTPS scheme enforcement and host allowlist checks. + +#### Scenario: Invalid LDAP URL configuration detected +- **WHEN** `LDAP_API_URL` is missing, non-HTTPS, or points to a host outside the configured allowlist +- **THEN** the service MUST reject LDAP authentication calls and emit actionable diagnostics without sending credentials to that endpoint + +#### Scenario: Valid LDAP URL configuration accepted +- **WHEN** `LDAP_API_URL` uses HTTPS and host is allowlisted +- **THEN** LDAP authentication requests MAY proceed with normal timeout and error handling behavior + +### Requirement: Security Response Headers SHALL Be Applied Globally +All HTTP responses MUST include baseline security headers suitable for dashboard and API traffic. + +#### Scenario: Standard response emitted +- **WHEN** any route returns a response +- **THEN** response MUST include `Content-Security-Policy`, `X-Frame-Options`, `X-Content-Type-Options`, and `Referrer-Policy` + +#### Scenario: Production transport hardening +- **WHEN** runtime environment is production +- **THEN** response MUST include `Strict-Transport-Security` + +### Requirement: Pagination Input Boundaries SHALL Be Enforced +Endpoints accepting pagination parameters MUST enforce lower and upper bounds before query execution. + +#### Scenario: Negative or zero pagination inputs +- **WHEN** client sends `page <= 0` or `page_size <= 0` +- **THEN** server MUST normalize values to minimum supported bounds + +#### Scenario: Excessive page size requested +- **WHEN** client sends `page_size` above configured maximum +- **THEN** server MUST clamp to maximum supported page size diff --git a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/tasks.md b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/tasks.md new file mode 100644 index 0000000..8090ac6 --- /dev/null +++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/tasks.md @@ -0,0 +1,24 @@ +## 1. LDAP Endpoint Hardening + +- [x] 1.1 Add strict `LDAP_API_URL` validation (`https` + allowlisted hosts) in auth service initialization. +- [x] 1.2 Add tests for valid/invalid LDAP URL configurations and ensure unsafe URLs are rejected without outbound auth call. + +## 2. Bounded Process Cache + +- [x] 2.1 Extend `ProcessLevelCache` with configurable `max_size` and LRU eviction behavior. +- [x] 2.2 Wire bounded cache configuration for WIP/Resource process-level caches and add regression tests. + +## 3. Circuit Breaker Lock Contention Reduction + +- [x] 3.1 Refactor circuit breaker transition logging to execute outside lock-protected section. +- [x] 3.2 Add tests verifying transition logs are emitted while state mutation remains correct. + +## 4. HTTP Security Headers and Input Boundary Validation + +- [x] 4.1 Add global `after_request` security headers (CSP, frame, content-type, referrer, HSTS in production). +- [x] 4.2 Tighten pagination boundary handling (`page`/`page_size`) for WIP detail endpoint and add tests. + +## 5. Validation and Documentation + +- [x] 5.1 Run targeted backend/frontend tests plus benchmark smoke to confirm no behavior regression. +- [x] 5.2 Update `README.md` and `README.mdj` with round-2 security/stability hardening notes. diff --git a/openspec/specs/api-safety-hygiene/spec.md b/openspec/specs/api-safety-hygiene/spec.md new file mode 100644 index 0000000..4237904 --- /dev/null +++ b/openspec/specs/api-safety-hygiene/spec.md @@ -0,0 +1,33 @@ +# api-safety-hygiene Specification + +## Purpose +TBD - created by archiving change residual-hardening-round3. Update Purpose after archive. +## Requirements +### Requirement: Recursive Payload Cleaning MUST Enforce Depth Safety +Routes that normalize nested payloads MUST prevent unbounded recursion depth. + +#### Scenario: Deeply nested response object +- **WHEN** NaN-cleaning helper receives deeply nested list/dict payload +- **THEN** cleaning logic MUST enforce max depth or iterative traversal and return safely without recursion failure + +### Requirement: Filter Source Names MUST Be Configurable +Filter cache query sources MUST NOT rely on hardcoded view names only. + +#### Scenario: Environment-specific view names +- **WHEN** deployment sets custom filter-source environment variables +- **THEN** filter cache loader MUST resolve and query configured view names + +### Requirement: High-Cost APIs SHALL Apply Basic Rate Guardrails +High-cost read endpoints SHALL apply configurable request-rate guardrails to reduce abuse and accidental bursts. + +#### Scenario: Burst traffic from same client +- **WHEN** a client exceeds configured request budget for guarded endpoints +- **THEN** endpoint SHALL return throttled response with clear retry guidance + +### Requirement: Common Boolean Query Parsing SHALL Be Shared +Boolean query parsing in routes SHALL use shared helper behavior. + +#### Scenario: Different routes parse include flags +- **WHEN** routes parse common boolean query parameters +- **THEN** parsing behavior MUST be consistent across routes via shared utility + diff --git a/openspec/specs/cache-indexed-query-acceleration/spec.md b/openspec/specs/cache-indexed-query-acceleration/spec.md new file mode 100644 index 0000000..baf8e19 --- /dev/null +++ b/openspec/specs/cache-indexed-query-acceleration/spec.md @@ -0,0 +1,26 @@ +# cache-indexed-query-acceleration Specification + +## Purpose +TBD - created by archiving change p1-cache-query-efficiency. Update Purpose after archive. +## Requirements +### Requirement: Incremental Synchronization SHALL Use Versioned Watermarks +For heavy non-full-snapshot datasets, cache refresh SHALL support incremental synchronization keyed by stable version or watermark boundaries. + +#### Scenario: Incremental refresh cycle +- **WHEN** source data version indicates partial changes since last sync +- **THEN** cache update logic MUST fetch and merge only changed partitions while preserving correctness guarantees + +### Requirement: Query Paths SHALL Use Indexed Access for High-Frequency Filters +Query execution over cached data SHALL use prebuilt indexes for known high-frequency filter columns. + +#### Scenario: Filtered report query +- **WHEN** request filters target indexed fields +- **THEN** result selection MUST avoid full dataset scans and maintain existing response contract + +### Requirement: Business-Mandated Full-Table Caches SHALL Be Preserved for Resource and WIP +The system SHALL continue to maintain full-table cache behavior for `resource` and `wip` domains. + +#### Scenario: Resource or WIP cache refresh +- **WHEN** cache update runs for `resource` or `wip` +- **THEN** the updater MUST retain full-table snapshot semantics and MUST NOT switch these domains to partial-only cache mode + diff --git a/openspec/specs/cache-observability-hardening/spec.md b/openspec/specs/cache-observability-hardening/spec.md index f027cc2..633e82b 100644 --- a/openspec/specs/cache-observability-hardening/spec.md +++ b/openspec/specs/cache-observability-hardening/spec.md @@ -36,3 +36,53 @@ The system MUST define alert thresholds for sustained degraded state, repeated w - **WHEN** degraded status persists beyond configured duration - **THEN** the monitoring contract MUST classify the service as alert-worthy with actionable context +### Requirement: Cache Telemetry SHALL Include Memory Amplification Signals +Operational telemetry MUST expose cache-domain memory usage indicators and representation amplification factors, and MUST differentiate between authoritative data payload and derived/index helper structures. + +#### Scenario: Deep health telemetry request after representation normalization +- **WHEN** operators inspect cache telemetry for resource or WIP domains +- **THEN** telemetry MUST include per-domain memory footprint, amplification indicators, and enough structure detail to verify that full-record duplication is not reintroduced + +### Requirement: Efficiency Benchmarks SHALL Gate Cache Refactor Rollout +Cache/query efficiency changes MUST be validated against baseline latency and memory benchmarks before rollout. + +#### Scenario: Pre-release validation +- **WHEN** cache refactor changes are prepared for deployment +- **THEN** benchmark results MUST demonstrate no regression beyond configured thresholds for P95 latency and memory usage + +### Requirement: Process-Level Cache SHALL Use Bounded Capacity with Deterministic Eviction +Process-level parsed-data caches MUST enforce a configurable maximum key capacity and use deterministic eviction behavior when capacity is exceeded. + +#### Scenario: Cache capacity reached +- **WHEN** a new cache entry is inserted and key capacity is at limit +- **THEN** cache MUST evict entries according to defined policy before storing the new key + +#### Scenario: Repeated access updates recency +- **WHEN** an existing cache key is read or overwritten +- **THEN** eviction order MUST reflect recency semantics so hot keys are retained preferentially + +### Requirement: Cache Publish MUST Preserve Previous Readable Snapshot on Failure +When refreshing full-table cache payloads, the system MUST avoid exposing partially published states to readers. + +#### Scenario: Publish fails after payload serialization +- **WHEN** a cache refresh has prepared new payload but publish operation fails +- **THEN** previously published cache keys MUST remain readable and metadata MUST remain consistent with old snapshot + +#### Scenario: Publish succeeds +- **WHEN** publish operation completes successfully +- **THEN** data payload and metadata keys MUST be visible as one coherent new snapshot + +### Requirement: Process-Level Cache Slow Path SHALL Minimize Lock Hold Time +Large payload parsing MUST NOT happen inside long-held process cache locks. + +#### Scenario: Cache miss under concurrent requests +- **WHEN** multiple requests hit process cache miss +- **THEN** parsing work SHALL happen outside lock-protected mutation section, and lock scope SHALL be limited to consistency check + commit + +### Requirement: Process-Level Cache Policies MUST Stay Consistent Across Services +All service-local process caches MUST support bounded capacity with deterministic eviction. + +#### Scenario: Realtime equipment cache growth +- **WHEN** realtime equipment process cache reaches configured capacity +- **THEN** entries MUST be evicted according to deterministic LRU behavior + diff --git a/openspec/specs/conda-systemd-runtime-alignment/spec.md b/openspec/specs/conda-systemd-runtime-alignment/spec.md index 614e07f..64a3758 100644 --- a/openspec/specs/conda-systemd-runtime-alignment/spec.md +++ b/openspec/specs/conda-systemd-runtime-alignment/spec.md @@ -24,3 +24,17 @@ Runbooks and deployment documentation MUST describe the same conda/systemd/watch - **WHEN** an operator performs deploy, health check, and rollback from documentation - **THEN** documented commands and paths MUST work without requiring venv-specific assumptions +### Requirement: Runtime Path Drift SHALL Be Detectable Before Service Start +Service startup checks MUST validate configured conda runtime paths across app, watchdog, and worker control scripts. + +#### Scenario: Conda path mismatch detected +- **WHEN** startup validation finds runtime path inconsistency between configured units and scripts +- **THEN** service start MUST fail with actionable diagnostics instead of running with partial mismatch + +### Requirement: Conda/Systemd Contract SHALL Be Versioned in Operations Docs +The documented runtime contract MUST include versioned path assumptions and verification commands. + +#### Scenario: Operator verifies deployment contract +- **WHEN** operator follows runbook validation steps +- **THEN** commands MUST confirm active runtime paths match documented conda/systemd contract + diff --git a/openspec/specs/frontend-compute-shift/spec.md b/openspec/specs/frontend-compute-shift/spec.md index 2a6075a..5fe0d2f 100644 --- a/openspec/specs/frontend-compute-shift/spec.md +++ b/openspec/specs/frontend-compute-shift/spec.md @@ -50,3 +50,17 @@ Frontend matrix/filter computations SHALL produce deterministic selection and fi - **WHEN** users toggle matrix cells across group, family, and resource rows - **THEN** selected-state rendering and filtered equipment result sets MUST remain level-correct and reversible +### Requirement: Reusable Browser Compute Modules SHALL Power Report Derivations +Derived computations for report filters, KPI cards, chart series, and table projections SHALL be implemented through reusable frontend modules. + +#### Scenario: Shared report derivation logic +- **WHEN** multiple report pages require equivalent data-shaping behavior +- **THEN** pages MUST consume shared compute modules instead of duplicating transformation logic per page + +### Requirement: Browser Compute Shift SHALL Preserve Export and Field Contracts +Moving computations to frontend MUST preserve existing field naming and export column contracts. + +#### Scenario: User exports report after frontend-side derivation +- **WHEN** transformed data is rendered and exported +- **THEN** exported field names and ordering MUST remain consistent with governed field contract definitions + diff --git a/openspec/specs/maintainability-type-and-constant-hygiene/spec.md b/openspec/specs/maintainability-type-and-constant-hygiene/spec.md new file mode 100644 index 0000000..97f5bdf --- /dev/null +++ b/openspec/specs/maintainability-type-and-constant-hygiene/spec.md @@ -0,0 +1,19 @@ +# maintainability-type-and-constant-hygiene Specification + +## Purpose +TBD - created by archiving change residual-hardening-round4. Update Purpose after archive. +## Requirements +### Requirement: Core Cache and Service Boundaries MUST Use Consistent Type Annotation Style +Core cache/service modules touched by this change SHALL use a consistent and explicit type-annotation style for public and internal helper boundaries. + +#### Scenario: Reviewing updated cache/service modules +- **WHEN** maintainers inspect function signatures in affected modules +- **THEN** optional and collection types MUST follow a single consistent style and remain compatible with the project Python baseline + +### Requirement: High-Frequency Magic Numbers MUST Be Replaced by Named Constants +Cache, throttling, and index-related numeric literals that control behavior MUST be extracted to named constants or env-configurable settings. + +#### Scenario: Tuning cache/index behavior +- **WHEN** operators need to tune cache/index thresholds +- **THEN** they MUST find values in named constants or environment variables rather than scattered inline literals + diff --git a/openspec/specs/oracle-query-fragment-governance/spec.md b/openspec/specs/oracle-query-fragment-governance/spec.md new file mode 100644 index 0000000..001701a --- /dev/null +++ b/openspec/specs/oracle-query-fragment-governance/spec.md @@ -0,0 +1,19 @@ +# oracle-query-fragment-governance Specification + +## Purpose +TBD - created by archiving change residual-hardening-round4. Update Purpose after archive. +## Requirements +### Requirement: Shared Oracle Query Fragments SHALL Have a Single Source of Truth +Cross-service Oracle query fragments for resource and equipment cache loading MUST be defined in a shared module and imported by service implementations. + +#### Scenario: Update common table/view reference +- **WHEN** a common table or view name changes +- **THEN** operators and developers MUST be able to update one shared definition without editing duplicated SQL literals across services + +### Requirement: Service Queries MUST Preserve Existing Columns and Semantics +Services consuming shared Oracle query fragments SHALL preserve existing selected columns, filters, and downstream payload behavior. + +#### Scenario: Resource and equipment cache refresh after refactor +- **WHEN** cache services execute queries via shared fragments +- **THEN** resulting payload structure MUST remain compatible with existing aggregation and API contracts + diff --git a/openspec/specs/resource-cache-representation-normalization/spec.md b/openspec/specs/resource-cache-representation-normalization/spec.md new file mode 100644 index 0000000..5bcf219 --- /dev/null +++ b/openspec/specs/resource-cache-representation-normalization/spec.md @@ -0,0 +1,26 @@ +# resource-cache-representation-normalization Specification + +## Purpose +TBD - created by archiving change residual-hardening-round4. Update Purpose after archive. +## Requirements +### Requirement: Resource Derived Index MUST Avoid Full Record Duplication +Resource derived index SHALL use lightweight row-position references instead of storing full duplicated record payloads alongside the process DataFrame cache. + +#### Scenario: Build index from cached DataFrame +- **WHEN** resource cache data is parsed from Redis into process-level DataFrame +- **THEN** the derived index MUST store position-based references and metadata without a second full records copy + +### Requirement: Resource Query APIs SHALL Preserve Existing Response Contract +Resource query APIs MUST keep existing output fields and semantics after index representation normalization. + +#### Scenario: Read all resources after normalization +- **WHEN** callers request all resources or filtered resource lists +- **THEN** the returned payload MUST remain field-compatible with pre-normalization responses + +### Requirement: Cache Invalidation MUST Keep Index/Data Coherent +The system SHALL invalidate and rebuild DataFrame/index representations atomically at cache refresh boundaries. + +#### Scenario: Redis-backed cache refresh completes +- **WHEN** a new resource cache snapshot is published +- **THEN** stale index references MUST be invalidated before subsequent reads use refreshed DataFrame data + diff --git a/openspec/specs/runtime-resilience-recovery/spec.md b/openspec/specs/runtime-resilience-recovery/spec.md index 67b2566..2be5861 100644 --- a/openspec/specs/runtime-resilience-recovery/spec.md +++ b/openspec/specs/runtime-resilience-recovery/spec.md @@ -48,3 +48,47 @@ The system MUST expose machine-readable resilience thresholds, restart-churn ind #### Scenario: Admin status includes restart churn summary - **WHEN** operators call `/admin/api/system-status` or `/admin/api/worker/status` - **THEN** responses MUST include bounded restart history summary within a configured time window and indicate whether churn threshold is exceeded + +### Requirement: Recovery Recommendations SHALL Reflect Self-Healing Policy State +Health and admin resilience payloads MUST expose whether automated recovery is allowed, cooling down, or blocked by churn policy. + +#### Scenario: Operator inspects degraded state +- **WHEN** `/health` or `/admin/api/worker/status` is requested during degradation +- **THEN** response MUST include policy state, cooldown remaining time, and next recommended action + +### Requirement: Manual Recovery Override SHALL Be Explicit and Controlled +Manual restart actions MUST bypass automatic block only through authenticated operator pathways with explicit acknowledgement. + +#### Scenario: Churn-blocked state with manual override request +- **WHEN** authorized admin requests manual restart while auto-recovery is blocked +- **THEN** system MUST execute controlled restart path and log the override context for auditability + +### Requirement: Circuit Breaker State Transitions SHALL Avoid Lock-Held Logging +Circuit breaker state transitions MUST avoid executing logger I/O while internal state locks are held. + +#### Scenario: State transition occurs +- **WHEN** circuit breaker transitions between CLOSED, OPEN, or HALF_OPEN +- **THEN** lock-protected section MUST complete state mutation before emitting transition log output + +#### Scenario: Slow log handler under load +- **WHEN** logger handlers are slow or blocked +- **THEN** circuit breaker lock contention MUST remain bounded and MUST NOT serialize unrelated request paths behind logging latency + +### Requirement: Health Endpoints SHALL Use Short Internal Memoization +Health and deep-health computation SHALL use a short-lived internal cache to prevent probe storms from amplifying backend load. + +#### Scenario: Frequent monitor scrapes +- **WHEN** health endpoints are called repeatedly within a small window +- **THEN** service SHALL return memoized payload for up to 5 seconds in non-testing environments + +#### Scenario: Testing mode +- **WHEN** app is running in testing mode +- **THEN** health endpoint memoization MUST be bypassed to preserve deterministic tests + +### Requirement: Logs MUST Redact Connection Secrets +Runtime logs MUST avoid exposing DB connection credentials. + +#### Scenario: Connection string appears in log message +- **WHEN** a log message contains DB URL credentials +- **THEN** logger output MUST redact password and sensitive userinfo before emission + diff --git a/openspec/specs/security-surface-hardening/spec.md b/openspec/specs/security-surface-hardening/spec.md new file mode 100644 index 0000000..177af53 --- /dev/null +++ b/openspec/specs/security-surface-hardening/spec.md @@ -0,0 +1,38 @@ +# security-surface-hardening Specification + +## Purpose +TBD - created by archiving change security-stability-hardening-round2. Update Purpose after archive. +## Requirements +### Requirement: LDAP Authentication Endpoint Configuration SHALL Be Strictly Validated +The system MUST validate LDAP authentication endpoint configuration before use, including HTTPS scheme enforcement and host allowlist checks. + +#### Scenario: Invalid LDAP URL configuration detected +- **WHEN** `LDAP_API_URL` is missing, non-HTTPS, or points to a host outside the configured allowlist +- **THEN** the service MUST reject LDAP authentication calls and emit actionable diagnostics without sending credentials to that endpoint + +#### Scenario: Valid LDAP URL configuration accepted +- **WHEN** `LDAP_API_URL` uses HTTPS and host is allowlisted +- **THEN** LDAP authentication requests MAY proceed with normal timeout and error handling behavior + +### Requirement: Security Response Headers SHALL Be Applied Globally +All HTTP responses MUST include baseline security headers suitable for dashboard and API traffic. + +#### Scenario: Standard response emitted +- **WHEN** any route returns a response +- **THEN** response MUST include `Content-Security-Policy`, `X-Frame-Options`, `X-Content-Type-Options`, and `Referrer-Policy` + +#### Scenario: Production transport hardening +- **WHEN** runtime environment is production +- **THEN** response MUST include `Strict-Transport-Security` + +### Requirement: Pagination Input Boundaries SHALL Be Enforced +Endpoints accepting pagination parameters MUST enforce lower and upper bounds before query execution. + +#### Scenario: Negative or zero pagination inputs +- **WHEN** client sends `page <= 0` or `page_size <= 0` +- **THEN** server MUST normalize values to minimum supported bounds + +#### Scenario: Excessive page size requested +- **WHEN** client sends `page_size` above configured maximum +- **THEN** server MUST clamp to maximum supported page size + diff --git a/openspec/specs/worker-self-healing-governance/spec.md b/openspec/specs/worker-self-healing-governance/spec.md new file mode 100644 index 0000000..3fbc0d5 --- /dev/null +++ b/openspec/specs/worker-self-healing-governance/spec.md @@ -0,0 +1,26 @@ +# worker-self-healing-governance Specification + +## Purpose +TBD - created by archiving change p2-ops-self-healing-runbook. Update Purpose after archive. +## Requirements +### Requirement: Automated Worker Recovery SHALL Use Bounded Policy Guards +Automated worker restart behavior MUST enforce cooldown periods and bounded restart attempts within a configurable time window. + +#### Scenario: Repeated worker degradation within short window +- **WHEN** degradation events exceed configured restart-attempt budget +- **THEN** automated restarts MUST pause and surface a blocked-recovery signal for operator intervention + +### Requirement: Restart-Churn Protection SHALL Prevent Recovery Storms +The runtime MUST classify restart churn and prevent uncontrolled restart loops. + +#### Scenario: Churn threshold exceeded +- **WHEN** restart count crosses churn threshold in active window +- **THEN** watchdog MUST enter guarded mode and require explicit manual override before further restart attempts + +### Requirement: Recovery Decisions SHALL Be Audit-Ready +Every auto-recovery decision and manual override action MUST be recorded with structured metadata. + +#### Scenario: Worker restart decision emitted +- **WHEN** system executes or denies a restart action +- **THEN** structured logs/events MUST include reason, thresholds, actor/source, and resulting state + diff --git a/scripts/run_cache_benchmarks.py b/scripts/run_cache_benchmarks.py new file mode 100755 index 0000000..b9733ef --- /dev/null +++ b/scripts/run_cache_benchmarks.py @@ -0,0 +1,223 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +"""Benchmark cache query baseline vs indexed selection. + +This benchmark is used as a repeatable governance harness for P1 cache/query +efficiency work. It focuses on deterministic synthetic workloads so operators +can compare relative latency and memory amplification over time. +""" + +from __future__ import annotations + +import argparse +import json +import math +import random +import statistics +import time +from pathlib import Path +from typing import Any + +import numpy as np +import pandas as pd + +ROOT = Path(__file__).resolve().parents[1] +FIXTURE_PATH = ROOT / "tests" / "fixtures" / "cache_benchmark_fixture.json" + + +def load_fixture(path: Path = FIXTURE_PATH) -> dict[str, Any]: + payload = json.loads(path.read_text()) + if "rows" not in payload: + raise ValueError("fixture requires rows") + return payload + + +def build_dataset(rows: int, seed: int) -> pd.DataFrame: + random.seed(seed) + np.random.seed(seed) + + workcenters = [f"WC-{idx:02d}" for idx in range(1, 31)] + packages = ["QFN", "DFN", "SOT", "SOP", "BGA", "TSOP"] + types = ["TYPE-A", "TYPE-B", "TYPE-C", "TYPE-D"] + statuses = ["RUN", "QUEUE", "HOLD"] + hold_reasons = ["", "", "", "YieldLimit", "特殊需求管控", "PM Hold"] + + frame = pd.DataFrame( + { + "WORKCENTER_GROUP": np.random.choice(workcenters, rows), + "PACKAGE_LEF": np.random.choice(packages, rows), + "PJ_TYPE": np.random.choice(types, rows), + "WIP_STATUS": np.random.choice(statuses, rows, p=[0.45, 0.35, 0.20]), + "HOLDREASONNAME": np.random.choice(hold_reasons, rows), + "QTY": np.random.randint(1, 500, rows), + "WORKORDER": [f"WO-{i:06d}" for i in range(rows)], + "LOTID": [f"LOT-{i:07d}" for i in range(rows)], + } + ) + return frame + + +def _build_index(df: pd.DataFrame) -> dict[str, dict[str, set[int]]]: + def by_column(column: str) -> dict[str, set[int]]: + grouped = df.groupby(column, dropna=True, sort=False).indices + return {str(k): {int(i) for i in v} for k, v in grouped.items()} + + return { + "workcenter": by_column("WORKCENTER_GROUP"), + "package": by_column("PACKAGE_LEF"), + "type": by_column("PJ_TYPE"), + "status": by_column("WIP_STATUS"), + } + + +def _baseline_query(df: pd.DataFrame, query: dict[str, str]) -> int: + subset = df + if query.get("workcenter"): + subset = subset[subset["WORKCENTER_GROUP"] == query["workcenter"]] + if query.get("package"): + subset = subset[subset["PACKAGE_LEF"] == query["package"]] + if query.get("type"): + subset = subset[subset["PJ_TYPE"] == query["type"]] + if query.get("status"): + subset = subset[subset["WIP_STATUS"] == query["status"]] + return int(len(subset)) + + +def _indexed_query(_df: pd.DataFrame, indexes: dict[str, dict[str, set[int]]], query: dict[str, str]) -> int: + selected: set[int] | None = None + for key, bucket in ( + ("workcenter", "workcenter"), + ("package", "package"), + ("type", "type"), + ("status", "status"), + ): + current = indexes[bucket].get(query.get(key, "")) + if current is None: + return 0 + if selected is None: + selected = set(current) + else: + selected.intersection_update(current) + if not selected: + return 0 + return len(selected or ()) + + +def _build_queries(df: pd.DataFrame, query_count: int, seed: int) -> list[dict[str, str]]: + random.seed(seed + 17) + workcenters = sorted(df["WORKCENTER_GROUP"].dropna().astype(str).unique().tolist()) + packages = sorted(df["PACKAGE_LEF"].dropna().astype(str).unique().tolist()) + types = sorted(df["PJ_TYPE"].dropna().astype(str).unique().tolist()) + statuses = sorted(df["WIP_STATUS"].dropna().astype(str).unique().tolist()) + + queries: list[dict[str, str]] = [] + for _ in range(query_count): + queries.append( + { + "workcenter": random.choice(workcenters), + "package": random.choice(packages), + "type": random.choice(types), + "status": random.choice(statuses), + } + ) + return queries + + +def _p95(values: list[float]) -> float: + if not values: + return 0.0 + sorted_values = sorted(values) + index = min(max(math.ceil(0.95 * len(sorted_values)) - 1, 0), len(sorted_values) - 1) + return sorted_values[index] + + +def run_benchmark(rows: int, query_count: int, seed: int) -> dict[str, Any]: + df = build_dataset(rows=rows, seed=seed) + queries = _build_queries(df, query_count=query_count, seed=seed) + indexes = _build_index(df) + + baseline_latencies: list[float] = [] + indexed_latencies: list[float] = [] + baseline_rows: list[int] = [] + indexed_rows: list[int] = [] + + for query in queries: + start = time.perf_counter() + baseline_rows.append(_baseline_query(df, query)) + baseline_latencies.append((time.perf_counter() - start) * 1000) + + start = time.perf_counter() + indexed_rows.append(_indexed_query(df, indexes, query)) + indexed_latencies.append((time.perf_counter() - start) * 1000) + + if baseline_rows != indexed_rows: + raise AssertionError("benchmark correctness drift: indexed result mismatch") + + frame_bytes = int(df.memory_usage(index=True, deep=True).sum()) + index_entries = sum(len(bucket) for buckets in indexes.values() for bucket in buckets.values()) + index_bytes_estimate = int(index_entries * 16) + + baseline_p95 = _p95(baseline_latencies) + indexed_p95 = _p95(indexed_latencies) + + return { + "rows": rows, + "query_count": query_count, + "seed": seed, + "latency_ms": { + "baseline_avg": round(statistics.fmean(baseline_latencies), 4), + "baseline_p95": round(baseline_p95, 4), + "indexed_avg": round(statistics.fmean(indexed_latencies), 4), + "indexed_p95": round(indexed_p95, 4), + "p95_ratio_indexed_vs_baseline": round( + (indexed_p95 / baseline_p95) if baseline_p95 > 0 else 0.0, + 4, + ), + }, + "memory_bytes": { + "frame": frame_bytes, + "index_estimate": index_bytes_estimate, + "amplification_ratio": round( + (frame_bytes + index_bytes_estimate) / max(frame_bytes, 1), + 4, + ), + }, + } + + +def main() -> int: + fixture = load_fixture() + + parser = argparse.ArgumentParser(description="Run cache baseline vs indexed benchmark") + parser.add_argument("--rows", type=int, default=int(fixture.get("rows", 30000))) + parser.add_argument("--queries", type=int, default=int(fixture.get("query_count", 400))) + parser.add_argument("--seed", type=int, default=int(fixture.get("seed", 42))) + parser.add_argument("--enforce", action="store_true") + args = parser.parse_args() + + report = run_benchmark(rows=args.rows, query_count=args.queries, seed=args.seed) + print(json.dumps(report, ensure_ascii=False, indent=2)) + + if not args.enforce: + return 0 + + thresholds = fixture.get("thresholds") or {} + max_latency_ratio = float(thresholds.get("max_p95_ratio_indexed_vs_baseline", 1.25)) + max_amplification = float(thresholds.get("max_memory_amplification_ratio", 1.8)) + + latency_ratio = float(report["latency_ms"]["p95_ratio_indexed_vs_baseline"]) + amplification_ratio = float(report["memory_bytes"]["amplification_ratio"]) + + if latency_ratio > max_latency_ratio: + raise SystemExit( + f"Latency regression: {latency_ratio:.4f} > max allowed {max_latency_ratio:.4f}" + ) + if amplification_ratio > max_amplification: + raise SystemExit( + f"Memory amplification regression: {amplification_ratio:.4f} > max allowed {max_amplification:.4f}" + ) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/scripts/start_server.sh b/scripts/start_server.sh old mode 100644 new mode 100755 index 1334ff2..7000561 --- a/scripts/start_server.sh +++ b/scripts/start_server.sh @@ -9,7 +9,7 @@ set -uo pipefail # Configuration # ============================================================ ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" -CONDA_ENV="mes-dashboard" +CONDA_ENV="${CONDA_ENV_NAME:-mes-dashboard}" APP_NAME="mes-dashboard" PID_FILE_DEFAULT="${ROOT}/tmp/gunicorn.pid" PID_FILE="${WATCHDOG_PID_FILE:-${PID_FILE_DEFAULT}}" @@ -56,7 +56,7 @@ timestamp() { resolve_runtime_paths() { WATCHDOG_RUNTIME_DIR="${WATCHDOG_RUNTIME_DIR:-${ROOT}/tmp}" WATCHDOG_RESTART_FLAG="${WATCHDOG_RESTART_FLAG:-${WATCHDOG_RUNTIME_DIR}/mes_dashboard_restart.flag}" - WATCHDOG_PID_FILE="${WATCHDOG_PID_FILE:-${PID_FILE_DEFAULT}}" + WATCHDOG_PID_FILE="${WATCHDOG_PID_FILE:-${WATCHDOG_RUNTIME_DIR}/gunicorn.pid}" WATCHDOG_STATE_FILE="${WATCHDOG_STATE_FILE:-${WATCHDOG_RUNTIME_DIR}/mes_dashboard_restart_state.json}" PID_FILE="${WATCHDOG_PID_FILE}" export WATCHDOG_RUNTIME_DIR WATCHDOG_RESTART_FLAG WATCHDOG_PID_FILE WATCHDOG_STATE_FILE @@ -81,8 +81,14 @@ check_conda() { return 1 fi + if [ -n "${CONDA_BIN:-}" ] && [ ! -x "${CONDA_BIN}" ]; then + log_error "CONDA_BIN is set but not executable: ${CONDA_BIN}" + return 1 + fi + # Source conda - source "$(conda info --base)/etc/profile.d/conda.sh" + local conda_cmd="${CONDA_BIN:-$(command -v conda)}" + source "$(${conda_cmd} info --base)/etc/profile.d/conda.sh" # Check if environment exists if ! conda env list | grep -q "^${CONDA_ENV} "; then @@ -95,6 +101,33 @@ check_conda() { return 0 } +validate_runtime_contract() { + conda activate "$CONDA_ENV" + export PYTHONPATH="${ROOT}/src:${PYTHONPATH:-}" + + if python - <<'PY' +import os +import sys + +from mes_dashboard.core.runtime_contract import build_runtime_contract_diagnostics + +strict = os.getenv("RUNTIME_CONTRACT_ENFORCE", "true").strip().lower() in {"1", "true", "yes", "on"} +diag = build_runtime_contract_diagnostics(strict=strict) +if not diag["valid"]: + for error in diag["errors"]: + print(f"RUNTIME_CONTRACT_ERROR: {error}") + raise SystemExit(1) +PY + then + log_success "Runtime contract validation passed" + return 0 + fi + + log_error "Runtime contract validation failed" + log_info "Fix env vars: WATCHDOG_RUNTIME_DIR / WATCHDOG_RESTART_FLAG / WATCHDOG_PID_FILE / WATCHDOG_STATE_FILE / CONDA_BIN" + return 1 +} + check_dependencies() { conda activate "$CONDA_ENV" @@ -329,6 +362,7 @@ run_all_checks() { check_env_file load_env resolve_runtime_paths + validate_runtime_contract || return 1 check_port || return 1 check_database check_redis diff --git a/scripts/worker_watchdog.py b/scripts/worker_watchdog.py old mode 100644 new mode 100755 index c354610..bac2c8d --- a/scripts/worker_watchdog.py +++ b/scripts/worker_watchdog.py @@ -31,6 +31,23 @@ import time from datetime import datetime from pathlib import Path +PROJECT_ROOT = Path(__file__).resolve().parents[1] +SRC_ROOT = PROJECT_ROOT / "src" +if str(SRC_ROOT) not in sys.path: + sys.path.insert(0, str(SRC_ROOT)) + +from mes_dashboard.core.runtime_contract import ( # noqa: E402 + build_runtime_contract_diagnostics, + load_runtime_contract, +) +from mes_dashboard.core.worker_recovery_policy import ( # noqa: E402 + decide_restart_request, + evaluate_worker_recovery_state, + extract_last_requested_at, + extract_restart_history, + get_worker_recovery_policy_config, +) + # Configure logging logging.basicConfig( level=logging.INFO, @@ -45,7 +62,10 @@ logger = logging.getLogger('mes_dashboard.watchdog') # Configuration # ============================================================ -CHECK_INTERVAL = int(os.getenv('WATCHDOG_CHECK_INTERVAL', '5')) +_RUNTIME_CONTRACT = load_runtime_contract(project_root=PROJECT_ROOT) +CHECK_INTERVAL = int( + os.getenv('WATCHDOG_CHECK_INTERVAL', str(_RUNTIME_CONTRACT['watchdog_check_interval'])) +) def _env_int(name: str, default: int) -> int: @@ -55,22 +75,11 @@ def _env_int(name: str, default: int) -> int: return default -PROJECT_ROOT = Path(__file__).resolve().parents[1] -DEFAULT_RUNTIME_DIR = Path( - os.getenv('WATCHDOG_RUNTIME_DIR', str(PROJECT_ROOT / 'tmp')) -) -RESTART_FLAG_PATH = os.getenv( - 'WATCHDOG_RESTART_FLAG', - str(DEFAULT_RUNTIME_DIR / 'mes_dashboard_restart.flag') -) -GUNICORN_PID_FILE = os.getenv( - 'WATCHDOG_PID_FILE', - str(DEFAULT_RUNTIME_DIR / 'gunicorn.pid') -) -RESTART_STATE_FILE = os.getenv( - 'WATCHDOG_STATE_FILE', - str(DEFAULT_RUNTIME_DIR / 'mes_dashboard_restart_state.json') -) +DEFAULT_RUNTIME_DIR = Path(_RUNTIME_CONTRACT['watchdog_runtime_dir']) +RESTART_FLAG_PATH = _RUNTIME_CONTRACT['watchdog_restart_flag'] +GUNICORN_PID_FILE = _RUNTIME_CONTRACT['watchdog_pid_file'] +RESTART_STATE_FILE = _RUNTIME_CONTRACT['watchdog_state_file'] +RUNTIME_CONTRACT_VERSION = _RUNTIME_CONTRACT['version'] RESTART_HISTORY_MAX = _env_int('WATCHDOG_RESTART_HISTORY_MAX', 50) @@ -78,6 +87,32 @@ RESTART_HISTORY_MAX = _env_int('WATCHDOG_RESTART_HISTORY_MAX', 50) # Watchdog Implementation # ============================================================ + +def validate_runtime_contract_or_raise() -> None: + """Fail fast if runtime contract is inconsistent.""" + strict = os.getenv("RUNTIME_CONTRACT_ENFORCE", "true").strip().lower() in { + "1", + "true", + "yes", + "on", + } + diagnostics = build_runtime_contract_diagnostics(strict=strict) + if diagnostics["valid"]: + return + + details = "; ".join(diagnostics["errors"]) + raise RuntimeError(f"Runtime contract validation failed: {details}") + + +def log_restart_audit(event: str, payload: dict) -> None: + entry = { + "event": event, + "timestamp": datetime.utcnow().isoformat(), + "runtime_contract_version": RUNTIME_CONTRACT_VERSION, + **payload, + } + logger.info("worker_watchdog_audit %s", json.dumps(entry, ensure_ascii=False)) + def get_gunicorn_pid() -> int | None: """Get Gunicorn master PID from PID file. @@ -155,7 +190,12 @@ def save_restart_state( requested_at: str | None = None, requested_ip: str | None = None, completed_at: str | None = None, - success: bool = True + success: bool = True, + source: str = "manual", + decision: str = "allowed", + decision_reason: str | None = None, + manual_override: bool = False, + policy_state: dict | None = None, ) -> None: """Save restart state for status queries. @@ -173,7 +213,12 @@ def save_restart_state( "requested_at": requested_at, "requested_ip": requested_ip, "completed_at": completed_at, - "success": success + "success": success, + "source": source, + "decision": decision, + "decision_reason": decision_reason, + "manual_override": manual_override, + "policy_state": policy_state or {}, } current_state = load_restart_state() history = current_state.get("history", []) @@ -229,6 +274,47 @@ def process_restart_request() -> bool: return False logger.info(f"Restart flag detected: {flag_data}") + source = str(flag_data.get("source") or "manual").strip().lower() + manual_override = bool(flag_data.get("manual_override")) + override_ack = bool(flag_data.get("override_acknowledged")) + restart_state = load_restart_state() + restart_history = extract_restart_history(restart_state) + policy_state = evaluate_worker_recovery_state( + restart_history, + last_requested_at=extract_last_requested_at(restart_state), + ) + decision = decide_restart_request( + policy_state, + source=source, + manual_override=manual_override, + override_acknowledged=override_ack, + ) + + if not decision["allowed"]: + remove_restart_flag() + save_restart_state( + requested_by=flag_data.get("user"), + requested_at=flag_data.get("timestamp"), + requested_ip=flag_data.get("ip"), + completed_at=datetime.now().isoformat(), + success=False, + source=source, + decision=decision["decision"], + decision_reason=decision["reason"], + manual_override=manual_override, + policy_state=policy_state, + ) + log_restart_audit( + "restart_blocked", + { + "source": source, + "actor": flag_data.get("user"), + "ip": flag_data.get("ip"), + "decision": decision, + "policy_state": policy_state, + }, + ) + return True # Get Gunicorn master PID pid = get_gunicorn_pid() @@ -242,7 +328,22 @@ def process_restart_request() -> bool: requested_at=flag_data.get("timestamp"), requested_ip=flag_data.get("ip"), completed_at=datetime.now().isoformat(), - success=False + success=False, + source=source, + decision="failed", + decision_reason="gunicorn_pid_unavailable", + manual_override=manual_override, + policy_state=policy_state, + ) + log_restart_audit( + "restart_failed", + { + "source": source, + "actor": flag_data.get("user"), + "ip": flag_data.get("ip"), + "decision_reason": "gunicorn_pid_unavailable", + "policy_state": policy_state, + }, ) return True @@ -258,7 +359,12 @@ def process_restart_request() -> bool: requested_at=flag_data.get("timestamp"), requested_ip=flag_data.get("ip"), completed_at=datetime.now().isoformat(), - success=success + success=success, + source=source, + decision="executed" if success else "failed", + decision_reason="signal_sighup" if success else "signal_failed", + manual_override=manual_override, + policy_state=policy_state, ) if success: @@ -267,17 +373,44 @@ def process_restart_request() -> bool: f"Requested by: {flag_data.get('user', 'unknown')}, " f"IP: {flag_data.get('ip', 'unknown')}" ) + log_restart_audit( + "restart_executed", + { + "source": source, + "actor": flag_data.get("user"), + "ip": flag_data.get("ip"), + "manual_override": manual_override, + "policy_state": policy_state, + }, + ) + else: + log_restart_audit( + "restart_failed", + { + "source": source, + "actor": flag_data.get("user"), + "ip": flag_data.get("ip"), + "decision_reason": "signal_failed", + "policy_state": policy_state, + }, + ) return True def run_watchdog() -> None: """Main watchdog loop.""" + validate_runtime_contract_or_raise() + policy = get_worker_recovery_policy_config() logger.info( f"Worker watchdog started - " f"Check interval: {CHECK_INTERVAL}s, " f"Flag path: {RESTART_FLAG_PATH}, " - f"PID file: {GUNICORN_PID_FILE}" + f"PID file: {GUNICORN_PID_FILE}, " + f"Policy(cooldown={policy['cooldown_seconds']}s, " + f"retry_budget={policy['retry_budget']}, " + f"window={policy['window_seconds']}s, " + f"guarded={policy['guarded_mode_enabled']})" ) while True: diff --git a/src/mes_dashboard/app.py b/src/mes_dashboard/app.py index 1433684..e98d78e 100644 --- a/src/mes_dashboard/app.py +++ b/src/mes_dashboard/app.py @@ -3,24 +3,48 @@ from __future__ import annotations +import atexit import logging import os import sys +import threading from flask import Flask, jsonify, redirect, render_template, request, session, url_for from mes_dashboard.config.tables import TABLES_CONFIG from mes_dashboard.config.settings import get_config from mes_dashboard.core.cache import create_default_cache_backend -from mes_dashboard.core.database import get_table_data, get_table_columns, get_engine, init_db, start_keepalive +from mes_dashboard.core.database import ( + get_table_data, + get_table_columns, + get_engine, + init_db, + start_keepalive, + dispose_engine, + install_log_redaction_filter, +) from mes_dashboard.core.permissions import is_admin_logged_in, _is_ajax_request +from mes_dashboard.core.csrf import ( + get_csrf_token, + should_enforce_csrf, + validate_csrf, +) from mes_dashboard.routes import register_routes from mes_dashboard.routes.auth_routes import auth_bp from mes_dashboard.routes.admin_routes import admin_bp from mes_dashboard.routes.health_routes import health_bp from mes_dashboard.services.page_registry import get_page_status, is_api_public from mes_dashboard.core.cache_updater import start_cache_updater, stop_cache_updater -from mes_dashboard.services.realtime_equipment_cache import init_realtime_equipment_cache +from mes_dashboard.services.realtime_equipment_cache import ( + init_realtime_equipment_cache, + stop_equipment_status_sync_worker, +) +from mes_dashboard.core.redis_client import close_redis +from mes_dashboard.core.runtime_contract import build_runtime_contract_diagnostics + + +_SHUTDOWN_LOCK = threading.Lock() +_ATEXIT_REGISTERED = False def _configure_logging(app: Flask) -> None: @@ -63,6 +87,121 @@ def _configure_logging(app: Flask) -> None: # Prevent propagation to root logger (avoid duplicate logs) logger.propagate = False + install_log_redaction_filter(logger) + + +def _is_production_env(app: Flask) -> bool: + env_value = str(app.config.get("ENV") or os.getenv("FLASK_ENV") or "production").lower() + return env_value in {"prod", "production"} + + +def _build_security_headers(production: bool) -> dict[str, str]: + headers = { + "Content-Security-Policy": ( + "default-src 'self'; " + "script-src 'self' 'unsafe-inline' 'unsafe-eval'; " + "style-src 'self' 'unsafe-inline'; " + "img-src 'self' data: blob:; " + "font-src 'self' data:; " + "connect-src 'self'; " + "frame-ancestors 'none'; " + "base-uri 'self'; " + "form-action 'self'" + ), + "X-Frame-Options": "DENY", + "X-Content-Type-Options": "nosniff", + "Referrer-Policy": "strict-origin-when-cross-origin", + } + if production: + headers["Strict-Transport-Security"] = "max-age=31536000; includeSubDomains" + return headers + + +def _resolve_secret_key(app: Flask) -> str: + env_name = str(app.config.get("ENV") or os.getenv("FLASK_ENV") or "development").lower() + configured = os.environ.get("SECRET_KEY") or app.config.get("SECRET_KEY") + insecure_defaults = {"", "dev-secret-key-change-in-prod"} + + if configured and configured not in insecure_defaults: + return configured + + if env_name in {"production", "prod"}: + raise RuntimeError( + "SECRET_KEY is required in production and cannot use insecure defaults." + ) + + # Development and testing get explicit environment-safe defaults. + if env_name in {"testing", "test"}: + return "test-secret-key" + return "dev-local-only-secret-key" + + +def _shutdown_runtime_resources() -> None: + """Stop background workers and shared clients during app/worker shutdown.""" + logger = logging.getLogger("mes_dashboard") + + try: + stop_cache_updater() + except Exception as exc: + logger.warning("Error stopping cache updater: %s", exc) + + try: + stop_equipment_status_sync_worker() + except Exception as exc: + logger.warning("Error stopping equipment sync worker: %s", exc) + + try: + close_redis() + except Exception as exc: + logger.warning("Error closing Redis client: %s", exc) + + try: + dispose_engine() + except Exception as exc: + logger.warning("Error disposing DB engines: %s", exc) + + +def _register_shutdown_hooks(app: Flask) -> None: + global _ATEXIT_REGISTERED + + app.extensions["runtime_shutdown"] = _shutdown_runtime_resources + if app.extensions.get("runtime_shutdown_registered"): + return + + app.extensions["runtime_shutdown_registered"] = True + if app.testing or bool(app.config.get("TESTING")) or os.getenv("PYTEST_CURRENT_TEST"): + return + + with _SHUTDOWN_LOCK: + if not _ATEXIT_REGISTERED: + atexit.register(_shutdown_runtime_resources) + _ATEXIT_REGISTERED = True + + +def _is_runtime_contract_enforced(app: Flask) -> bool: + raw = os.getenv("RUNTIME_CONTRACT_ENFORCE") + if raw is not None: + return raw.strip().lower() in {"1", "true", "yes", "on"} + return _is_production_env(app) + + +def _validate_runtime_contract(app: Flask) -> None: + strict = _is_runtime_contract_enforced(app) + diagnostics = build_runtime_contract_diagnostics(strict=strict) + app.extensions["runtime_contract"] = diagnostics["contract"] + app.extensions["runtime_contract_validation"] = { + "valid": diagnostics["valid"], + "strict": diagnostics["strict"], + "errors": diagnostics["errors"], + } + + if diagnostics["valid"]: + return + + message = "Runtime contract validation failed: " + "; ".join(diagnostics["errors"]) + if strict: + raise RuntimeError(message) + logging.getLogger("mes_dashboard").warning(message) def create_app(config_name: str | None = None) -> Flask: @@ -72,19 +211,22 @@ def create_app(config_name: str | None = None) -> Flask: config_class = get_config(config_name) app.config.from_object(config_class) - # Session configuration - app.secret_key = os.environ.get("SECRET_KEY", "dev-secret-key-change-in-prod") + # Session configuration with environment-aware secret validation. + app.secret_key = _resolve_secret_key(app) + app.config["SECRET_KEY"] = app.secret_key # Session cookie security settings - # SECURE: Only send cookie over HTTPS (disable for local development) - app.config['SESSION_COOKIE_SECURE'] = os.environ.get("FLASK_ENV") == "production" + # SECURE: Only send cookie over HTTPS in production. + app.config['SESSION_COOKIE_SECURE'] = _is_production_env(app) # HTTPONLY: Prevent JavaScript access to session cookie (XSS protection) app.config['SESSION_COOKIE_HTTPONLY'] = True - # SAMESITE: Prevent CSRF by restricting cross-site cookie sending - app.config['SESSION_COOKIE_SAMESITE'] = 'Lax' + # SAMESITE: strict in production, relaxed for local development usability. + app.config['SESSION_COOKIE_SAMESITE'] = 'Strict' if _is_production_env(app) else 'Lax' # Configure logging first _configure_logging(app) + _validate_runtime_contract(app) + security_headers = _build_security_headers(_is_production_env(app)) # Route-level cache backend (L1 memory + optional L2 Redis) app.extensions["cache"] = create_default_cache_backend() @@ -96,6 +238,7 @@ def create_app(config_name: str | None = None) -> Flask: start_keepalive() # Keep database connections alive start_cache_updater() # Start Redis cache updater init_realtime_equipment_cache(app) # Start realtime equipment status cache + _register_shutdown_hooks(app) # Register API routes register_routes(app) @@ -150,6 +293,34 @@ def create_app(config_name: str | None = None) -> Flask: return None + @app.before_request + def enforce_csrf(): + if not should_enforce_csrf( + request, + enabled=bool(app.config.get("CSRF_ENABLED", True)), + ): + return None + + if validate_csrf(request): + return None + + if request.path == "/admin/login": + return render_template("login.html", error="CSRF 驗證失敗,請重新提交"), 403 + + from mes_dashboard.core.response import error_response, FORBIDDEN + + return error_response( + FORBIDDEN, + "CSRF 驗證失敗", + status_code=403, + ) + + @app.after_request + def apply_security_headers(response): + for header, value in security_headers.items(): + response.headers.setdefault(header, value) + return response + # ======================================================== # Template Context Processor # ======================================================== @@ -185,6 +356,7 @@ def create_app(config_name: str | None = None) -> Flask: "admin_user": session.get("admin"), "can_view_page": can_view_page, "frontend_asset": frontend_asset, + "csrf_token": get_csrf_token, } # ======================================================== diff --git a/src/mes_dashboard/config/settings.py b/src/mes_dashboard/config/settings.py index c9c283a..e46f3af 100644 --- a/src/mes_dashboard/config/settings.py +++ b/src/mes_dashboard/config/settings.py @@ -20,6 +20,13 @@ def _float_env(name: str, default: float) -> float: return default +def _bool_env(name: str, default: bool) -> bool: + value = os.getenv(name) + if value is None: + return default + return value.strip().lower() in {"1", "true", "yes", "on"} + + class Config: """Base configuration.""" @@ -40,7 +47,8 @@ class Config: # Auth configuration - MUST be set in .env file LDAP_API_URL = os.getenv("LDAP_API_URL", "") ADMIN_EMAILS = os.getenv("ADMIN_EMAILS", "") - SECRET_KEY = os.getenv("SECRET_KEY", "dev-secret-key-change-in-prod") + SECRET_KEY = os.getenv("SECRET_KEY") + CSRF_ENABLED = _bool_env("CSRF_ENABLED", True) # Session configuration PERMANENT_SESSION_LIFETIME = _int_env("SESSION_LIFETIME", 28800) # 8 hours @@ -103,6 +111,7 @@ class TestingConfig(Config): DB_CONNECT_RETRY_COUNT = 0 DB_CONNECT_RETRY_DELAY = 0.0 DB_CALL_TIMEOUT_MS = 5000 + CSRF_ENABLED = False def get_config(env: str | None = None) -> Type[Config]: diff --git a/src/mes_dashboard/core/cache.py b/src/mes_dashboard/core/cache.py index f5f5906..ea59d48 100644 --- a/src/mes_dashboard/core/cache.py +++ b/src/mes_dashboard/core/cache.py @@ -10,8 +10,10 @@ from __future__ import annotations import io import json import logging +import os import threading import time +from collections import OrderedDict from typing import Any, Optional, Protocol, Tuple import pandas as pd @@ -39,26 +41,49 @@ class ProcessLevelCache: Uses a lock to ensure only one thread parses at a time. """ - def __init__(self, ttl_seconds: int = 30): - self._cache: dict[str, Tuple[pd.DataFrame, float]] = {} + def __init__(self, ttl_seconds: int = 30, max_size: int = 32): + self._cache: OrderedDict[str, Tuple[pd.DataFrame, float]] = OrderedDict() self._lock = threading.Lock() - self._ttl = ttl_seconds + self._ttl = max(int(ttl_seconds), 1) + self._max_size = max(int(max_size), 1) + + @property + def max_size(self) -> int: + return self._max_size + + def _evict_expired_locked(self, now: float) -> None: + stale_keys = [ + key for key, (_, timestamp) in self._cache.items() + if now - timestamp > self._ttl + ] + for key in stale_keys: + self._cache.pop(key, None) def get(self, key: str) -> Optional[pd.DataFrame]: """Get cached DataFrame if not expired.""" with self._lock: - if key not in self._cache: + payload = self._cache.get(key) + if payload is None: return None - df, timestamp = self._cache[key] - if time.time() - timestamp > self._ttl: - del self._cache[key] + df, timestamp = payload + now = time.time() + if now - timestamp > self._ttl: + self._cache.pop(key, None) return None + self._cache.move_to_end(key, last=True) return df def set(self, key: str, df: pd.DataFrame) -> None: """Cache a DataFrame with current timestamp.""" with self._lock: - self._cache[key] = (df, time.time()) + now = time.time() + self._evict_expired_locked(now) + if key in self._cache: + self._cache.pop(key, None) + elif len(self._cache) >= self._max_size: + self._cache.popitem(last=False) + self._cache[key] = (df, now) + self._cache.move_to_end(key, last=True) def invalidate(self, key: str) -> None: """Remove a key from cache.""" @@ -71,8 +96,26 @@ class ProcessLevelCache: self._cache.clear() +def _resolve_cache_max_size(env_name: str, default: int) -> int: + value = os.getenv(env_name) + if value is None: + return max(int(default), 1) + try: + return max(int(value), 1) + except (TypeError, ValueError): + return max(int(default), 1) + + # Global process-level cache for WIP DataFrame (30s TTL) -_wip_df_cache = ProcessLevelCache(ttl_seconds=30) +PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size("PROCESS_CACHE_MAX_SIZE", 32) +WIP_PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size( + "WIP_PROCESS_CACHE_MAX_SIZE", + PROCESS_CACHE_MAX_SIZE, +) +_wip_df_cache = ProcessLevelCache( + ttl_seconds=30, + max_size=WIP_PROCESS_CACHE_MAX_SIZE, +) _wip_parse_lock = threading.Lock() # ============================================================ @@ -328,33 +371,30 @@ def get_cached_wip_data() -> Optional[pd.DataFrame]: if client is None: return None - # Use lock to prevent multiple threads from parsing simultaneously + try: + start_time = time.time() + data_json = client.get(get_key("data")) + if data_json is None: + logger.debug("Cache miss: no data in Redis") + return None + + # Parse outside lock to reduce contention on hot paths. + parsed_df = pd.read_json(io.StringIO(data_json), orient='records') + parse_time = time.time() - start_time + except Exception as e: + logger.warning(f"Failed to read cache: {e}") + return None + + # Keep lock scope tight: consistency check + cache write only. with _wip_parse_lock: - # Double-check after acquiring lock (another thread may have parsed) cached_df = _wip_df_cache.get(cache_key) if cached_df is not None: - logger.debug(f"Process cache hit (after lock): {len(cached_df)} rows") + logger.debug(f"Process cache hit (after parse): {len(cached_df)} rows") return cached_df + _wip_df_cache.set(cache_key, parsed_df) - try: - start_time = time.time() - data_json = client.get(get_key("data")) - if data_json is None: - logger.debug("Cache miss: no data in Redis") - return None - - # Parse JSON to DataFrame - df = pd.read_json(io.StringIO(data_json), orient='records') - parse_time = time.time() - start_time - - # Store in process-level cache - _wip_df_cache.set(cache_key, df) - - logger.debug(f"Cache hit: loaded {len(df)} rows from Redis (parsed in {parse_time:.2f}s)") - return df - except Exception as e: - logger.warning(f"Failed to read cache: {e}") - return None + logger.debug(f"Cache hit: loaded {len(parsed_df)} rows from Redis (parsed in {parse_time:.2f}s)") + return parsed_df def get_cached_sys_date() -> Optional[str]: diff --git a/src/mes_dashboard/core/cache_updater.py b/src/mes_dashboard/core/cache_updater.py index b4033cd..ffa6e01 100644 --- a/src/mes_dashboard/core/cache_updater.py +++ b/src/mes_dashboard/core/cache_updater.py @@ -221,7 +221,7 @@ class CacheUpdater: return None def _update_redis_cache(self, df: pd.DataFrame, sys_date: str) -> bool: - """Update Redis cache with new data using pipeline for atomicity. + """Update Redis cache with staged publish for coherent snapshot visibility. Args: df: DataFrame with full table data. @@ -234,18 +234,24 @@ class CacheUpdater: if client is None: return False + staging_key: str | None = None try: # Convert DataFrame to JSON # Handle datetime columns - for col in df.select_dtypes(include=['datetime64']).columns: - df[col] = df[col].astype(str) + df_copy = df.copy() + for col in df_copy.select_dtypes(include=['datetime64']).columns: + df_copy[col] = df_copy[col].astype(str) - data_json = df.to_json(orient='records', force_ascii=False) + data_json = df_copy.to_json(orient='records', force_ascii=False) - # Atomic update using pipeline + # Stage payload first, then atomically publish live key + metadata. now = datetime.now().isoformat() + unique_suffix = f"{int(time.time() * 1000)}:{threading.get_ident()}" + staging_key = get_key(f"data:staging:{unique_suffix}") + pipe = client.pipeline() - pipe.set(get_key("data"), data_json) + pipe.set(staging_key, data_json) + pipe.rename(staging_key, get_key("data")) pipe.set(get_key("meta:sys_date"), sys_date) pipe.set(get_key("meta:updated_at"), now) pipe.execute() @@ -253,6 +259,11 @@ class CacheUpdater: return True except Exception as e: logger.error(f"Failed to update Redis cache: {e}") + if staging_key: + try: + client.delete(staging_key) + except Exception: + pass return False def _check_resource_update(self, force: bool = False) -> bool: diff --git a/src/mes_dashboard/core/circuit_breaker.py b/src/mes_dashboard/core/circuit_breaker.py index ded364a..bea3351 100644 --- a/src/mes_dashboard/core/circuit_breaker.py +++ b/src/mes_dashboard/core/circuit_breaker.py @@ -130,12 +130,16 @@ class CircuitBreaker: @property def state(self) -> CircuitState: """Get current circuit state, handling state transitions.""" + transition_log: tuple[int, str] | None = None with self._lock: if self._state == CircuitState.OPEN: # Check if we should transition to HALF_OPEN if self._open_time and time.time() - self._open_time >= self.recovery_timeout: - self._transition_to(CircuitState.HALF_OPEN) - return self._state + transition_log = self._transition_to_locked(CircuitState.HALF_OPEN) + current_state = self._state + if transition_log: + self._emit_transition_log(*transition_log) + return current_state def allow_request(self) -> bool: """Check if a request should be allowed. @@ -161,45 +165,57 @@ class CircuitBreaker: if not CIRCUIT_BREAKER_ENABLED: return + transition_log: tuple[int, str] | None = None with self._lock: self._results.append(True) if self._state == CircuitState.HALF_OPEN: # Success in half-open means we can close - self._transition_to(CircuitState.CLOSED) + transition_log = self._transition_to_locked(CircuitState.CLOSED) + + if transition_log: + self._emit_transition_log(*transition_log) def record_failure(self) -> None: """Record a failed operation.""" if not CIRCUIT_BREAKER_ENABLED: return + transition_log: tuple[int, str] | None = None with self._lock: self._results.append(False) self._last_failure_time = time.time() if self._state == CircuitState.HALF_OPEN: # Failure in half-open means back to open - self._transition_to(CircuitState.OPEN) + transition_log = self._transition_to_locked(CircuitState.OPEN) elif self._state == CircuitState.CLOSED: # Check if we should open - self._check_and_open() + transition_log = self._check_and_open_locked() - def _check_and_open(self) -> None: + if transition_log: + self._emit_transition_log(*transition_log) + + def _check_and_open_locked(self) -> tuple[int, str] | None: """Check failure rate and open circuit if needed. Must be called with lock held. """ if len(self._results) < self.failure_threshold: - return + return None failure_count = sum(1 for r in self._results if not r) failure_rate = failure_count / len(self._results) if (failure_count >= self.failure_threshold and failure_rate >= self.failure_rate_threshold): - self._transition_to(CircuitState.OPEN) + return self._transition_to_locked(CircuitState.OPEN) + return None - def _transition_to(self, new_state: CircuitState) -> None: + def _emit_transition_log(self, level: int, message: str) -> None: + logger.log(level, message) + + def _transition_to_locked(self, new_state: CircuitState) -> tuple[int, str]: """Transition to a new state with logging. Must be called with lock held. @@ -209,23 +225,25 @@ class CircuitBreaker: if new_state == CircuitState.OPEN: self._open_time = time.time() - logger.warning( + return ( + logging.WARNING, f"Circuit breaker '{self.name}' OPENED: " f"state {old_state.value} -> {new_state.value}, " f"failures: {sum(1 for r in self._results if not r)}/{len(self._results)}" ) elif new_state == CircuitState.HALF_OPEN: - logger.info( + return ( + logging.INFO, f"Circuit breaker '{self.name}' entering HALF_OPEN: " f"testing service recovery..." ) - elif new_state == CircuitState.CLOSED: - self._open_time = None - self._results.clear() - logger.info( - f"Circuit breaker '{self.name}' CLOSED: " - f"service recovered" - ) + self._open_time = None + self._results.clear() + return ( + logging.INFO, + f"Circuit breaker '{self.name}' CLOSED: " + f"service recovered" + ) def get_status(self) -> CircuitBreakerStatus: """Get current status information.""" @@ -266,7 +284,7 @@ class CircuitBreaker: self._results.clear() self._last_failure_time = None self._open_time = None - logger.info(f"Circuit breaker '{self.name}' reset") + logger.info(f"Circuit breaker '{self.name}' reset") # ============================================================ diff --git a/src/mes_dashboard/core/csrf.py b/src/mes_dashboard/core/csrf.py new file mode 100644 index 0000000..cfc4d76 --- /dev/null +++ b/src/mes_dashboard/core/csrf.py @@ -0,0 +1,85 @@ +# -*- coding: utf-8 -*- +"""CSRF token utilities for admin form and API mutation protection.""" + +from __future__ import annotations + +import hmac +import secrets +from typing import Optional + +from flask import Request, request, session + +CSRF_SESSION_KEY = "_csrf_token" +CSRF_HEADER_NAME = "X-CSRF-Token" +CSRF_FORM_FIELD = "csrf_token" +_MUTATING_METHODS = {"POST", "PUT", "PATCH", "DELETE"} + + +def _new_csrf_token() -> str: + return secrets.token_urlsafe(32) + + +def get_csrf_token() -> str: + """Get a stable CSRF token for the current session.""" + token = session.get(CSRF_SESSION_KEY) + if not token: + token = _new_csrf_token() + session[CSRF_SESSION_KEY] = token + return token + + +def rotate_csrf_token() -> str: + """Rotate session CSRF token after authentication state changes.""" + token = _new_csrf_token() + session[CSRF_SESSION_KEY] = token + return token + + +def _extract_request_token(req: Request) -> Optional[str]: + header_token = req.headers.get(CSRF_HEADER_NAME) + if header_token: + return header_token + + form_token = req.form.get(CSRF_FORM_FIELD) + if form_token: + return form_token + + if req.is_json: + payload = req.get_json(silent=True) or {} + json_token = payload.get(CSRF_FORM_FIELD) + if json_token: + return str(json_token) + + return None + + +def should_enforce_csrf(req: Request = request, enabled: bool = True) -> bool: + """Determine whether current request needs CSRF validation.""" + if not enabled: + return False + + if req.method.upper() not in _MUTATING_METHODS: + return False + + path = req.path or "" + if path == "/admin/login": + return True + if path.startswith("/admin/api/"): + return True + if path.startswith("/admin/"): + return True + + return False + + +def validate_csrf(req: Request = request) -> bool: + """Validate request CSRF token against current session token.""" + expected = session.get(CSRF_SESSION_KEY) + if not expected: + return False + + provided = _extract_request_token(req) + if not provided: + return False + + return hmac.compare_digest(str(expected), str(provided)) diff --git a/src/mes_dashboard/core/database.py b/src/mes_dashboard/core/database.py index 728a274..a5db670 100644 --- a/src/mes_dashboard/core/database.py +++ b/src/mes_dashboard/core/database.py @@ -51,6 +51,59 @@ from mes_dashboard.config.settings import get_config # Configure module logger logger = logging.getLogger('mes_dashboard.database') +_REDACTION_INSTALLED = False +_ORACLE_URL_RE = re.compile(r"(oracle\+oracledb://[^:\s/]+:)([^@/\s]+)(@)") +_ENV_SECRET_RE = re.compile(r"(DB_PASSWORD=)([^\s]+)") + + +def redact_connection_secrets(message: str) -> str: + """Redact DB credentials from log message text.""" + if not message: + return message + sanitized = _ORACLE_URL_RE.sub(r"\1***\3", message) + sanitized = _ENV_SECRET_RE.sub(r"\1***", sanitized) + return sanitized + + +class SecretRedactionFilter(logging.Filter): + """Filter that masks DB connection secrets in log messages.""" + + def filter(self, record: logging.LogRecord) -> bool: + try: + message = record.getMessage() + except Exception: + return True + sanitized = redact_connection_secrets(message) + if sanitized != message: + record.msg = sanitized + record.args = () + return True + + +def install_log_redaction_filter(target_logger: logging.Logger | None = None) -> None: + """Attach secret-redaction filter to mes_dashboard logging handlers once.""" + global _REDACTION_INSTALLED + if target_logger is None and _REDACTION_INSTALLED: + return + + logger_obj = target_logger or logging.getLogger("mes_dashboard") + redaction_filter = SecretRedactionFilter() + + attached = False + for handler in logger_obj.handlers: + if any(isinstance(f, SecretRedactionFilter) for f in handler.filters): + attached = True + continue + handler.addFilter(redaction_filter) + attached = True + + if not attached and not any(isinstance(f, SecretRedactionFilter) for f in logger_obj.filters): + logger_obj.addFilter(redaction_filter) + attached = True + + if attached and target_logger is None: + _REDACTION_INSTALLED = True + # ============================================================ # SQLAlchemy Engine (QueuePool - connection pooling) # ============================================================ @@ -59,6 +112,7 @@ logger = logging.getLogger('mes_dashboard.database') # pool_recycle prevents stale connections from firewalls/NAT. _ENGINE = None +_HEALTH_ENGINE = None _DB_RUNTIME_CONFIG: Optional[Dict[str, Any]] = None @@ -132,6 +186,13 @@ def get_db_runtime_config(refresh: bool = False) -> Dict[str, Any]: "retry_count": _from_app_or_env_int("DB_CONNECT_RETRY_COUNT", config_class.DB_CONNECT_RETRY_COUNT), "retry_delay": _from_app_or_env_float("DB_CONNECT_RETRY_DELAY", config_class.DB_CONNECT_RETRY_DELAY), "call_timeout_ms": _from_app_or_env_int("DB_CALL_TIMEOUT_MS", config_class.DB_CALL_TIMEOUT_MS), + "health_pool_size": _from_app_or_env_int("DB_HEALTH_POOL_SIZE", 1), + "health_max_overflow": _from_app_or_env_int("DB_HEALTH_MAX_OVERFLOW", 0), + "health_pool_timeout": _from_app_or_env_int("DB_HEALTH_POOL_TIMEOUT", 2), + "pool_exhausted_retry_after_seconds": _from_app_or_env_int( + "DB_POOL_EXHAUSTED_RETRY_AFTER_SECONDS", + 5, + ), } return _DB_RUNTIME_CONFIG.copy() @@ -202,6 +263,42 @@ def get_engine(): return _ENGINE +def get_health_engine(): + """Get dedicated SQLAlchemy engine for health probes. + + Health checks use a tiny isolated pool so status probes remain available + when the request pool is saturated. + """ + global _HEALTH_ENGINE + if _HEALTH_ENGINE is None: + runtime = get_db_runtime_config() + _HEALTH_ENGINE = create_engine( + CONNECTION_STRING, + poolclass=QueuePool, + pool_size=max(int(runtime["health_pool_size"]), 1), + max_overflow=max(int(runtime["health_max_overflow"]), 0), + pool_timeout=max(int(runtime["health_pool_timeout"]), 1), + pool_recycle=runtime["pool_recycle"], + pool_pre_ping=True, + connect_args={ + "tcp_connect_timeout": runtime["tcp_connect_timeout"], + "retry_count": runtime["retry_count"], + "retry_delay": runtime["retry_delay"], + }, + ) + _register_pool_events( + _HEALTH_ENGINE, + min(int(runtime["call_timeout_ms"]), 10_000), + ) + logger.info( + "Health engine created (pool_size=%s, max_overflow=%s, pool_timeout=%s)", + runtime["health_pool_size"], + runtime["health_max_overflow"], + runtime["health_pool_timeout"], + ) + return _HEALTH_ENGINE + + def _register_pool_events(engine, call_timeout_ms: int): """Register event listeners for connection pool monitoring.""" @@ -302,8 +399,12 @@ def dispose_engine(): Call this during application shutdown to cleanly release resources. """ - global _ENGINE, _DB_RUNTIME_CONFIG + global _ENGINE, _HEALTH_ENGINE, _DB_RUNTIME_CONFIG stop_keepalive() + if _HEALTH_ENGINE is not None: + _HEALTH_ENGINE.dispose() + logger.info("Health engine disposed") + _HEALTH_ENGINE = None if _ENGINE is not None: _ENGINE.dispose() logger.info("Database engine disposed, all connections closed") @@ -432,9 +533,13 @@ def read_sql_df(sql: str, params: Optional[Dict[str, Any]] = None) -> pd.DataFra elapsed, exc, ) + retry_after = max( + int(get_db_runtime_config().get("pool_exhausted_retry_after_seconds", 5)), + 1, + ) raise DatabasePoolExhaustedError( "Database connection pool exhausted", - retry_after_seconds=5, + retry_after_seconds=retry_after, ) from exc except Exception as exc: elapsed = time.time() - start_time diff --git a/src/mes_dashboard/core/rate_limit.py b/src/mes_dashboard/core/rate_limit.py new file mode 100644 index 0000000..f982d42 --- /dev/null +++ b/src/mes_dashboard/core/rate_limit.py @@ -0,0 +1,103 @@ +# -*- coding: utf-8 -*- +"""Lightweight in-process rate limiting helpers for high-cost routes.""" + +from __future__ import annotations + +import os +import threading +import time +from collections import defaultdict, deque +from functools import wraps +from typing import Callable, Deque + +from flask import request + +from mes_dashboard.core.response import TOO_MANY_REQUESTS, error_response + +_RATE_LOCK = threading.Lock() +_RATE_ATTEMPTS: dict[str, dict[str, Deque[float]]] = defaultdict(lambda: defaultdict(deque)) + + +def _env_int(name: str, default: int) -> int: + raw = os.getenv(name) + if raw is None: + return int(default) + try: + value = int(raw) + except (TypeError, ValueError): + return int(default) + return max(value, 1) + + +def _client_identifier() -> str: + forwarded = request.headers.get("X-Forwarded-For", "").strip() + if forwarded: + return forwarded.split(",")[0].strip() + return request.remote_addr or "unknown" + + +def check_and_record( + bucket: str, + *, + client_id: str, + max_attempts: int, + window_seconds: int, +) -> tuple[bool, int]: + """Check and record request attempt for a bucket+client pair.""" + now = time.time() + window_start = now - max(window_seconds, 1) + + with _RATE_LOCK: + per_bucket = _RATE_ATTEMPTS[bucket] + attempts = per_bucket[client_id] + + while attempts and attempts[0] <= window_start: + attempts.popleft() + + if len(attempts) >= max_attempts: + retry_after = max(int(window_seconds - (now - attempts[0])), 1) + return True, retry_after + + attempts.append(now) + return False, 0 + + +def configured_rate_limit( + *, + bucket: str, + max_attempts_env: str, + window_seconds_env: str, + default_max_attempts: int, + default_window_seconds: int, +) -> Callable: + """Build a route decorator with env-configurable rate limits.""" + max_attempts = _env_int(max_attempts_env, default_max_attempts) + window_seconds = _env_int(window_seconds_env, default_window_seconds) + + def decorator(func: Callable) -> Callable: + @wraps(func) + def wrapped(*args, **kwargs): + limited, retry_after = check_and_record( + bucket, + client_id=_client_identifier(), + max_attempts=max_attempts, + window_seconds=window_seconds, + ) + if limited: + return error_response( + TOO_MANY_REQUESTS, + "請求過於頻繁,請稍後再試", + status_code=429, + meta={"retry_after_seconds": retry_after}, + headers={"Retry-After": str(retry_after)}, + ) + return func(*args, **kwargs) + + return wrapped + + return decorator + + +def reset_rate_limits_for_tests() -> None: + with _RATE_LOCK: + _RATE_ATTEMPTS.clear() diff --git a/src/mes_dashboard/core/runtime_contract.py b/src/mes_dashboard/core/runtime_contract.py new file mode 100644 index 0000000..3eff73c --- /dev/null +++ b/src/mes_dashboard/core/runtime_contract.py @@ -0,0 +1,143 @@ +# -*- coding: utf-8 -*- +"""Runtime contract helpers shared by app, scripts, and watchdog.""" + +from __future__ import annotations + +import os +import shutil +from pathlib import Path +from typing import Any, Mapping + +CONTRACT_VERSION = "2026.02-p2" +DEFAULT_PROJECT_ROOT = Path(__file__).resolve().parents[3] + + +def _to_bool(value: str | None, default: bool) -> bool: + if value is None: + return default + return value.strip().lower() in {"1", "true", "yes", "on"} + + +def _resolve_path(value: str | None, fallback: Path, project_root: Path) -> Path: + if value is None or not str(value).strip(): + return fallback.resolve() + raw = Path(str(value).strip()) + if raw.is_absolute(): + return raw.resolve() + return (project_root / raw).resolve() + + +def load_runtime_contract( + environ: Mapping[str, str] | None = None, + *, + project_root: Path | str | None = None, +) -> dict[str, Any]: + """Load effective runtime contract from environment with normalized paths.""" + env = environ or os.environ + root = Path(project_root or env.get("MES_DASHBOARD_ROOT", DEFAULT_PROJECT_ROOT)).resolve() + runtime_dir = _resolve_path( + env.get("WATCHDOG_RUNTIME_DIR"), + root / "tmp", + root, + ) + + restart_flag = _resolve_path( + env.get("WATCHDOG_RESTART_FLAG"), + runtime_dir / "mes_dashboard_restart.flag", + root, + ) + pid_file = _resolve_path( + env.get("WATCHDOG_PID_FILE"), + runtime_dir / "gunicorn.pid", + root, + ) + state_file = _resolve_path( + env.get("WATCHDOG_STATE_FILE"), + runtime_dir / "mes_dashboard_restart_state.json", + root, + ) + + contract = { + "version": env.get("RUNTIME_CONTRACT_VERSION", CONTRACT_VERSION), + "project_root": str(root), + "gunicorn_bind": env.get("GUNICORN_BIND", "0.0.0.0:8080"), + "conda_bin": (env.get("CONDA_BIN", "") or "").strip(), + "conda_env_name": (env.get("CONDA_ENV_NAME", "mes-dashboard") or "").strip(), + "watchdog_runtime_dir": str(runtime_dir), + "watchdog_restart_flag": str(restart_flag), + "watchdog_pid_file": str(pid_file), + "watchdog_state_file": str(state_file), + "watchdog_check_interval": int(env.get("WATCHDOG_CHECK_INTERVAL", "5")), + "validation_enforced": _to_bool(env.get("RUNTIME_CONTRACT_ENFORCE"), False), + } + return contract + + +def validate_runtime_contract( + contract: Mapping[str, Any] | None = None, + *, + strict: bool = False, +) -> list[str]: + """Validate runtime contract and return actionable errors.""" + cfg = dict(contract or load_runtime_contract()) + errors: list[str] = [] + + runtime_dir = Path(str(cfg["watchdog_runtime_dir"])).resolve() + restart_flag = Path(str(cfg["watchdog_restart_flag"])).resolve() + pid_file = Path(str(cfg["watchdog_pid_file"])).resolve() + state_file = Path(str(cfg["watchdog_state_file"])).resolve() + + if restart_flag.parent != runtime_dir: + errors.append( + "WATCHDOG_RESTART_FLAG must be under WATCHDOG_RUNTIME_DIR " + f"({restart_flag} not under {runtime_dir})." + ) + if pid_file.parent != runtime_dir: + errors.append( + "WATCHDOG_PID_FILE must be under WATCHDOG_RUNTIME_DIR " + f"({pid_file} not under {runtime_dir})." + ) + + if not state_file.is_absolute(): + errors.append("WATCHDOG_STATE_FILE must resolve to an absolute path.") + + bind = str(cfg.get("gunicorn_bind", "")).strip() + if ":" not in bind: + errors.append(f"GUNICORN_BIND must include host:port (current: {bind!r}).") + + conda_bin = str(cfg.get("conda_bin", "")).strip() + if strict and not conda_bin: + conda_on_path = shutil.which("conda") + if not conda_on_path: + errors.append( + "CONDA_BIN is required when strict runtime validation is enabled " + "and conda is not discoverable on PATH." + ) + if conda_bin: + conda_path = Path(conda_bin) + if not conda_path.exists(): + errors.append(f"CONDA_BIN does not exist: {conda_bin}") + elif not os.access(conda_bin, os.X_OK): + errors.append(f"CONDA_BIN is not executable: {conda_bin}") + + conda_env_name = str(cfg.get("conda_env_name", "")).strip() + active_env = (os.getenv("CONDA_DEFAULT_ENV") or "").strip() + if strict and conda_env_name and active_env and active_env != conda_env_name: + errors.append( + "CONDA_DEFAULT_ENV mismatch: " + f"expected {conda_env_name!r}, got {active_env!r}." + ) + + return errors + + +def build_runtime_contract_diagnostics(*, strict: bool = False) -> dict[str, Any]: + """Build diagnostics payload for runtime contract introspection.""" + contract = load_runtime_contract() + errors = validate_runtime_contract(contract, strict=strict) + return { + "valid": not errors, + "strict": strict, + "errors": errors, + "contract": contract, + } diff --git a/src/mes_dashboard/core/utils.py b/src/mes_dashboard/core/utils.py index f744208..dc1fb2a 100644 --- a/src/mes_dashboard/core/utils.py +++ b/src/mes_dashboard/core/utils.py @@ -33,6 +33,22 @@ def get_days_back(filters: Optional[Dict] = None, default: int = DEFAULT_DAYS_BA return default +def parse_bool_query(value: Any, default: bool = False) -> bool: + """Parse common boolean query parameter values.""" + if value is None: + return default + if isinstance(value, bool): + return value + text = str(value).strip().lower() + if not text: + return default + if text in {"true", "1", "yes", "y", "on"}: + return True + if text in {"false", "0", "no", "n", "off"}: + return False + return default + + # ============================================================ # SQL Filter Building (DEPRECATED) # Use mes_dashboard.sql.CommonFilters with QueryBuilder instead. diff --git a/src/mes_dashboard/core/worker_recovery_policy.py b/src/mes_dashboard/core/worker_recovery_policy.py new file mode 100644 index 0000000..faacb80 --- /dev/null +++ b/src/mes_dashboard/core/worker_recovery_policy.py @@ -0,0 +1,220 @@ +# -*- coding: utf-8 -*- +"""Worker restart policy helpers (cooldown, retry budget, churn guard).""" + +from __future__ import annotations + +import json +import os +from datetime import datetime, timezone +from pathlib import Path +from typing import Any, Mapping + +from mes_dashboard.core.runtime_contract import load_runtime_contract + + +def _env_int(name: str, default: int) -> int: + try: + return int(os.getenv(name, str(default))) + except (TypeError, ValueError): + return default + + +def _env_bool(name: str, default: bool) -> bool: + raw = os.getenv(name) + if raw is None: + return default + return raw.strip().lower() in {"1", "true", "yes", "on"} + + +def _parse_iso(ts: str | None) -> datetime | None: + if not ts: + return None + try: + value = datetime.fromisoformat(ts) + except (TypeError, ValueError): + return None + if value.tzinfo is None: + value = value.replace(tzinfo=timezone.utc) + return value + + +def _utc_now() -> datetime: + return datetime.now(timezone.utc) + + +def get_worker_recovery_policy_config() -> dict[str, Any]: + """Return effective worker restart policy config.""" + retry_budget = _env_int("WORKER_RESTART_RETRY_BUDGET", 3) + churn_threshold = _env_int( + "WORKER_RESTART_CHURN_THRESHOLD", + _env_int("RESILIENCE_RESTART_CHURN_THRESHOLD", retry_budget), + ) + window_seconds = _env_int( + "WORKER_RESTART_WINDOW_SECONDS", + _env_int("RESILIENCE_RESTART_CHURN_WINDOW_SECONDS", 600), + ) + return { + "cooldown_seconds": max(_env_int("WORKER_RESTART_COOLDOWN", 60), 1), + "retry_budget": max(retry_budget, 1), + "window_seconds": max(window_seconds, 30), + "churn_threshold": max(churn_threshold, 1), + "guarded_mode_enabled": _env_bool("WORKER_GUARDED_MODE_ENABLED", True), + } + + +def load_restart_state(path: str | None = None) -> dict[str, Any]: + """Load persisted restart state from runtime contract state file.""" + state_path = Path(path or load_runtime_contract()["watchdog_state_file"]) + if not state_path.exists(): + return {} + try: + return json.loads(state_path.read_text()) + except (json.JSONDecodeError, IOError): + return {} + + +def extract_restart_history(state: Mapping[str, Any] | None = None) -> list[dict[str, Any]]: + """Extract bounded restart history from persisted state.""" + payload = dict(state or {}) + raw_history = payload.get("history") + if not isinstance(raw_history, list): + return [] + return [item for item in raw_history if isinstance(item, dict)][-50:] + + +def extract_last_requested_at(state: Mapping[str, Any] | None = None) -> str | None: + """Extract last requested timestamp from persisted state.""" + payload = dict(state or {}) + last_restart = payload.get("last_restart") or {} + if not isinstance(last_restart, dict): + return None + value = last_restart.get("requested_at") + return str(value) if value else None + + +def evaluate_worker_recovery_state( + history: list[dict[str, Any]] | None, + *, + last_requested_at: str | None = None, + now: datetime | None = None, +) -> dict[str, Any]: + """Evaluate restart policy state for automated/manual recovery decisions.""" + cfg = get_worker_recovery_policy_config() + now_dt = now or _utc_now() + window_seconds = int(cfg["window_seconds"]) + cooldown_seconds = int(cfg["cooldown_seconds"]) + + recent_attempts = 0 + for item in history or []: + requested = _parse_iso(item.get("requested_at")) + completed = _parse_iso(item.get("completed_at")) + ts = requested or completed + if ts is None: + continue + age = (now_dt - ts).total_seconds() + if age <= window_seconds: + recent_attempts += 1 + + retry_budget = int(cfg["retry_budget"]) + churn_threshold = int(cfg["churn_threshold"]) + retry_budget_exhausted = recent_attempts >= retry_budget + churn_exceeded = recent_attempts >= churn_threshold + guarded_mode = bool(cfg["guarded_mode_enabled"] and (retry_budget_exhausted or churn_exceeded)) + + cooldown_active = False + cooldown_remaining = 0 + last_requested_dt = _parse_iso(last_requested_at) + if last_requested_dt is not None: + elapsed = (now_dt - last_requested_dt).total_seconds() + if elapsed < cooldown_seconds: + cooldown_active = True + cooldown_remaining = int(max(cooldown_seconds - elapsed, 0)) + + blocked = guarded_mode + allowed = not blocked and not cooldown_active + + state = "allowed" + if blocked: + state = "blocked" + elif cooldown_active: + state = "cooldown" + + return { + "state": state, + "allowed": allowed, + "cooldown": cooldown_active, + "cooldown_remaining_seconds": cooldown_remaining, + "blocked": blocked, + "guarded_mode": guarded_mode, + "retry_budget_exhausted": retry_budget_exhausted, + "churn_exceeded": churn_exceeded, + "attempts_in_window": recent_attempts, + "retry_budget": retry_budget, + "churn_threshold": churn_threshold, + "window_seconds": window_seconds, + "cooldown_seconds": cooldown_seconds, + } + + +def decide_restart_request( + policy_state: Mapping[str, Any], + *, + source: str, + manual_override: bool = False, + override_acknowledged: bool = False, +) -> dict[str, Any]: + """Decide whether restart request is allowed under current policy state.""" + state = dict(policy_state or {}) + blocked = bool(state.get("blocked")) + cooldown = bool(state.get("cooldown")) + source_value = (source or "manual").strip().lower() + + if source_value not in {"auto", "manual"}: + source_value = "manual" + + if source_value == "auto": + if blocked: + return { + "allowed": False, + "decision": "blocked", + "reason": "guarded_mode_blocked", + "requires_acknowledgement": False, + } + if cooldown: + return { + "allowed": False, + "decision": "blocked", + "reason": "cooldown_active", + "requires_acknowledgement": False, + } + return { + "allowed": True, + "decision": "allowed", + "reason": "policy_allows_auto_restart", + "requires_acknowledgement": False, + } + + if (blocked or cooldown) and not (manual_override and override_acknowledged): + reason = "manual_override_required" if blocked else "cooldown_override_required" + return { + "allowed": False, + "decision": "blocked", + "reason": reason, + "requires_acknowledgement": True, + } + + if manual_override and override_acknowledged: + return { + "allowed": True, + "decision": "manual_override", + "reason": "operator_override_acknowledged", + "requires_acknowledgement": False, + } + + return { + "allowed": True, + "decision": "allowed", + "reason": "policy_allows_manual_restart", + "requires_acknowledgement": False, + } + diff --git a/src/mes_dashboard/routes/admin_routes.py b/src/mes_dashboard/routes/admin_routes.py index e50cf85..50f726f 100644 --- a/src/mes_dashboard/routes/admin_routes.py +++ b/src/mes_dashboard/routes/admin_routes.py @@ -3,12 +3,13 @@ from __future__ import annotations -import json -import logging -import os -import time -from datetime import datetime -from pathlib import Path +import json +import logging +import os +import time +from datetime import datetime, timezone +from pathlib import Path +from typing import Any from flask import Blueprint, g, jsonify, render_template, request @@ -19,6 +20,17 @@ from mes_dashboard.core.resilience import ( get_resilience_thresholds, summarize_restart_history, ) +from mes_dashboard.core.runtime_contract import ( + build_runtime_contract_diagnostics, + load_runtime_contract, +) +from mes_dashboard.core.worker_recovery_policy import ( + decide_restart_request, + evaluate_worker_recovery_state, + extract_last_requested_at, + extract_restart_history, + load_restart_state, +) from mes_dashboard.services.page_registry import get_all_pages, set_page_status admin_bp = Blueprint("admin", __name__, url_prefix="/admin") @@ -28,21 +40,13 @@ logger = logging.getLogger("mes_dashboard.admin") # Worker Restart Configuration # ============================================================ -WATCHDOG_RUNTIME_DIR = os.getenv("WATCHDOG_RUNTIME_DIR", "/tmp") -RESTART_FLAG_PATH = os.getenv( - "WATCHDOG_RESTART_FLAG", - f"{WATCHDOG_RUNTIME_DIR}/mes_dashboard_restart.flag" -) -RESTART_STATE_PATH = os.getenv( - "WATCHDOG_STATE_FILE", - f"{WATCHDOG_RUNTIME_DIR}/mes_dashboard_restart_state.json" -) -WATCHDOG_PID_PATH = os.getenv( - "WATCHDOG_PID_FILE", - f"{WATCHDOG_RUNTIME_DIR}/gunicorn.pid" -) -GUNICORN_BIND = os.getenv("GUNICORN_BIND", "0.0.0.0:8080") -RESTART_COOLDOWN_SECONDS = int(os.getenv("WORKER_RESTART_COOLDOWN", "60")) +_RUNTIME_CONTRACT = load_runtime_contract() +WATCHDOG_RUNTIME_DIR = _RUNTIME_CONTRACT["watchdog_runtime_dir"] +RESTART_FLAG_PATH = _RUNTIME_CONTRACT["watchdog_restart_flag"] +RESTART_STATE_PATH = _RUNTIME_CONTRACT["watchdog_state_file"] +WATCHDOG_PID_PATH = _RUNTIME_CONTRACT["watchdog_pid_file"] +GUNICORN_BIND = _RUNTIME_CONTRACT["gunicorn_bind"] +RUNTIME_CONTRACT_VERSION = _RUNTIME_CONTRACT["version"] # Track last restart request time (in-memory for this worker) _last_restart_request: float = 0.0 @@ -91,7 +95,9 @@ def api_system_status(): thresholds = get_resilience_thresholds() restart_state = _get_restart_state() restart_churn = _get_restart_churn_summary(restart_state) - in_cooldown, remaining = _check_restart_cooldown() + policy_state = _get_restart_policy_state(restart_state) + in_cooldown = bool(policy_state.get("cooldown")) + remaining = int(policy_state.get("cooldown_remaining_seconds") or 0) degraded_reason = None if db_status == "error": @@ -111,6 +117,14 @@ def api_system_status(): restart_churn_exceeded=bool(restart_churn.get("exceeded")), cooldown_active=in_cooldown, ) + alerts = _build_restart_alerts( + pool_saturation=(pool_state or {}).get("saturation"), + circuit_state=circuit_breaker.get("state"), + route_cache_degraded=bool(route_cache.get("degraded")), + policy_state=policy_state, + thresholds=thresholds, + ) + runtime_contract = build_runtime_contract_diagnostics(strict=False) # Cache status from mes_dashboard.routes.health_routes import ( @@ -142,13 +156,22 @@ def api_system_status(): "pool_state": pool_state, "route_cache": route_cache, "thresholds": thresholds, + "alerts": alerts, "restart_churn": restart_churn, + "policy_state": { + "state": policy_state.get("state"), + "allowed": policy_state.get("allowed"), + "cooldown": policy_state.get("cooldown"), + "blocked": policy_state.get("blocked"), + "cooldown_remaining_seconds": remaining, + }, "recovery_recommendation": recommendation, "restart_cooldown": { "active": in_cooldown, - "remaining_seconds": int(remaining) if in_cooldown else 0, + "remaining_seconds": remaining if in_cooldown else 0, }, }, + "runtime_contract": runtime_contract, "single_port_bind": GUNICORN_BIND, "worker_pid": os.getpid() } @@ -281,55 +304,33 @@ def api_logs_cleanup(): # Worker Restart Control Routes # ============================================================ -def _get_restart_state() -> dict: - """Read worker restart state from file.""" - state_path = Path(RESTART_STATE_PATH) - if not state_path.exists(): - return {} - try: - return json.loads(state_path.read_text()) - except (json.JSONDecodeError, IOError): - return {} +def _get_restart_state() -> dict: + """Read worker restart state from file.""" + return load_restart_state(RESTART_STATE_PATH) + + +def _iso_from_epoch(ts: float) -> str | None: + if ts <= 0: + return None + return datetime.fromtimestamp(ts, tz=timezone.utc).isoformat() def _check_restart_cooldown() -> tuple[bool, float]: - """Check if restart is in cooldown. + """Check if restart is in cooldown. Returns: Tuple of (is_in_cooldown, remaining_seconds). """ - global _last_restart_request - - # Check in-memory cooldown first - now = time.time() - elapsed = now - _last_restart_request - if elapsed < RESTART_COOLDOWN_SECONDS: - return True, RESTART_COOLDOWN_SECONDS - elapsed - - # Check file-based state (for cross-worker coordination) - state = _get_restart_state() - last_restart = state.get("last_restart", {}) - requested_at = last_restart.get("requested_at") - - if requested_at: - try: - request_time = datetime.fromisoformat(requested_at).timestamp() - elapsed = now - request_time - if elapsed < RESTART_COOLDOWN_SECONDS: - return True, RESTART_COOLDOWN_SECONDS - elapsed - except (ValueError, TypeError): - pass - + policy = _get_restart_policy_state() + if policy.get("cooldown"): + return True, float(policy.get("cooldown_remaining_seconds") or 0.0) return False, 0.0 def _get_restart_history(state: dict | None = None) -> list[dict]: """Return bounded restart history for admin telemetry.""" payload = state if state is not None else _get_restart_state() - raw_history = payload.get("history") or [] - if not isinstance(raw_history, list): - return [] - return raw_history[-20:] + return extract_restart_history(payload)[-20:] def _get_restart_churn_summary(state: dict | None = None) -> dict: @@ -338,27 +339,63 @@ def _get_restart_churn_summary(state: dict | None = None) -> dict: return summarize_restart_history(history) -def _worker_recovery_hint(churn: dict, cooldown_active: bool) -> dict: - """Build worker control recommendation from churn/cooldown state.""" - if churn.get("exceeded"): - return { - "action": "throttle_and_investigate_queries", - "reason": "restart_churn_exceeded", - } - if cooldown_active: - return { - "action": "wait_for_restart_cooldown", - "reason": "restart_cooldown_active", - } +def _get_restart_policy_state(state: dict | None = None) -> dict[str, Any]: + """Return effective worker restart policy state.""" + payload = state if state is not None else _get_restart_state() + history = _get_restart_history(payload) + last_requested = extract_last_requested_at(payload) + + in_memory_requested = _iso_from_epoch(_last_restart_request) + if in_memory_requested: + try: + in_memory_dt = datetime.fromisoformat(in_memory_requested) + persisted_dt = datetime.fromisoformat(last_requested) if last_requested else None + except (TypeError, ValueError): + in_memory_dt = None + persisted_dt = None + if in_memory_dt and (persisted_dt is None or in_memory_dt > persisted_dt): + last_requested = in_memory_requested + + return evaluate_worker_recovery_state( + history, + last_requested_at=last_requested, + ) + + +def _build_restart_alerts( + *, + pool_saturation: float | None, + circuit_state: str | None, + route_cache_degraded: bool, + policy_state: dict[str, Any], + thresholds: dict[str, Any], +) -> dict[str, Any]: + saturation = float(pool_saturation or 0.0) + warning = float(thresholds.get("pool_saturation_warning", 0.9)) + critical = float(thresholds.get("pool_saturation_critical", 1.0)) return { - "action": "restart_available", - "reason": "no_churn_or_cooldown", + "pool_warning": saturation >= warning, + "pool_critical": saturation >= critical, + "circuit_open": circuit_state == "OPEN", + "route_cache_degraded": bool(route_cache_degraded), + "restart_churn_exceeded": bool(policy_state.get("churn_exceeded")), + "restart_blocked": bool(policy_state.get("blocked")), } + + +def _log_restart_audit(event: str, payload: dict[str, Any]) -> None: + entry = { + "event": event, + "timestamp": datetime.now(tz=timezone.utc).isoformat(), + "runtime_contract_version": RUNTIME_CONTRACT_VERSION, + **payload, + } + logger.info("worker_restart_audit %s", json.dumps(entry, ensure_ascii=False)) -@admin_bp.route("/api/worker/restart", methods=["POST"]) -@admin_required -def api_worker_restart(): +@admin_bp.route("/api/worker/restart", methods=["POST"]) +@admin_required +def api_worker_restart(): """API: Request worker restart. Writes a restart flag file that the watchdog process monitors. @@ -366,52 +403,118 @@ def api_worker_restart(): """ global _last_restart_request - # Check cooldown - in_cooldown, remaining = _check_restart_cooldown() - if in_cooldown: - return error_response( - TOO_MANY_REQUESTS, - f"Restart in cooldown. Please wait {int(remaining)} seconds.", - status_code=429 - ) - - # Get request metadata - user = getattr(g, "username", "unknown") - ip = request.remote_addr or "unknown" - timestamp = datetime.now().isoformat() - - # Write restart flag file - flag_path = Path(RESTART_FLAG_PATH) - flag_data = { - "user": user, - "ip": ip, - "timestamp": timestamp, - "worker_pid": os.getpid() - } - - try: - flag_path.write_text(json.dumps(flag_data)) - except IOError as e: - logger.error(f"Failed to write restart flag: {e}") - return error_response( - "RESTART_FAILED", - f"Failed to request restart: {e}", - status_code=500 - ) + payload = request.get_json(silent=True) or {} + manual_override = bool(payload.get("manual_override")) + override_acknowledged = bool(payload.get("override_acknowledged")) + override_reason = str(payload.get("override_reason") or "").strip() + + # Get request metadata + user = getattr(g, "username", "unknown") + ip = request.remote_addr or "unknown" + timestamp = datetime.now(tz=timezone.utc).isoformat() + + state = _get_restart_state() + policy_state = _get_restart_policy_state(state) + decision = decide_restart_request( + policy_state, + source="manual", + manual_override=manual_override, + override_acknowledged=override_acknowledged, + ) + + if manual_override and not override_reason: + return error_response( + "RESTART_OVERRIDE_REASON_REQUIRED", + "Manual override requires non-empty override_reason for audit traceability.", + status_code=400, + ) + + if not decision["allowed"]: + status_code = 429 if policy_state.get("cooldown") else 409 + if status_code == 429: + message = ( + f"Restart in cooldown. Please wait " + f"{int(policy_state.get('cooldown_remaining_seconds') or 0)} seconds." + ) + code = TOO_MANY_REQUESTS + else: + message = ( + "Restart blocked by guarded mode. " + "Set manual_override=true and override_acknowledged=true to proceed." + ) + code = "RESTART_POLICY_BLOCKED" + _log_restart_audit( + "restart_request_blocked", + { + "actor": user, + "ip": ip, + "decision": decision, + "policy_state": policy_state, + }, + ) + return error_response( + code, + message, + status_code=status_code, + ) + + # Write restart flag file + flag_path = Path(RESTART_FLAG_PATH) + flag_data = { + "user": user, + "ip": ip, + "timestamp": timestamp, + "worker_pid": os.getpid(), + "source": "manual", + "manual_override": bool(manual_override and override_acknowledged), + "override_acknowledged": override_acknowledged, + "override_reason": override_reason or None, + "policy_state": policy_state, + "policy_decision": decision["decision"], + "runtime_contract_version": RUNTIME_CONTRACT_VERSION, + } + + try: + flag_path.parent.mkdir(parents=True, exist_ok=True) + tmp_path = flag_path.with_suffix(flag_path.suffix + ".tmp") + tmp_path.write_text(json.dumps(flag_data, ensure_ascii=False)) + tmp_path.replace(flag_path) + except IOError as e: + logger.error(f"Failed to write restart flag: {e}") + return error_response( + "RESTART_FAILED", + f"Failed to request restart: {e}", + status_code=500 + ) # Update in-memory cooldown _last_restart_request = time.time() - logger.info( - f"Worker restart requested by {user} from {ip}" - ) - + _log_restart_audit( + "restart_request_accepted", + { + "actor": user, + "ip": ip, + "decision": decision, + "policy_state": policy_state, + "override_reason": override_reason or None, + }, + ) + return jsonify({ "success": True, "data": { "message": "Restart requested. Workers will reload shortly.", "requested_by": user, "requested_at": timestamp, + "policy_state": { + "state": policy_state.get("state"), + "allowed": policy_state.get("allowed"), + "cooldown": policy_state.get("cooldown"), + "blocked": policy_state.get("blocked"), + "cooldown_remaining_seconds": policy_state.get("cooldown_remaining_seconds"), + }, + "decision": decision, "single_port_bind": GUNICORN_BIND, "watchdog": { "runtime_dir": WATCHDOG_RUNTIME_DIR, @@ -425,18 +528,23 @@ def api_worker_restart(): @admin_bp.route("/api/worker/status", methods=["GET"]) @admin_required -def api_worker_status(): - """API: Get worker status and restart information.""" - # Check cooldown - in_cooldown, remaining = _check_restart_cooldown() - +def api_worker_status(): + """API: Get worker status and restart information.""" # Get last restart info state = _get_restart_state() last_restart = state.get("last_restart", {}) history = _get_restart_history(state) churn = _get_restart_churn_summary(state) + policy_state = _get_restart_policy_state(state) thresholds = get_resilience_thresholds() - recommendation = _worker_recovery_hint(churn, in_cooldown) + recommendation = build_recovery_recommendation( + degraded_reason="db_pool_saturated" if policy_state.get("blocked") else None, + pool_saturation=None, + circuit_state=None, + restart_churn_exceeded=bool(churn.get("exceeded")), + cooldown_active=bool(policy_state.get("cooldown")), + ) + runtime_contract = build_runtime_contract_diagnostics(strict=False) # Get worker start time (psutil is optional) worker_start_time = None @@ -466,6 +574,11 @@ def api_worker_status(): "worker_pid": os.getpid(), "worker_start_time": worker_start_time, "runtime_contract": { + "version": runtime_contract["contract"]["version"], + "validation": { + "valid": runtime_contract["valid"], + "errors": runtime_contract["errors"], + }, "single_port_bind": GUNICORN_BIND, "watchdog": { "runtime_dir": WATCHDOG_RUNTIME_DIR, @@ -478,12 +591,27 @@ def api_worker_status(): }, }, "cooldown": { - "active": in_cooldown, - "remaining_seconds": int(remaining) if in_cooldown else 0 + "active": bool(policy_state.get("cooldown")), + "remaining_seconds": int(policy_state.get("cooldown_remaining_seconds") or 0) }, "resilience": { "thresholds": thresholds, + "alerts": { + "restart_churn_exceeded": bool(churn.get("exceeded")), + "restart_blocked": bool(policy_state.get("blocked")), + }, "restart_churn": churn, + "policy_state": { + "state": policy_state.get("state"), + "allowed": policy_state.get("allowed"), + "cooldown": policy_state.get("cooldown"), + "blocked": policy_state.get("blocked"), + "cooldown_remaining_seconds": policy_state.get("cooldown_remaining_seconds"), + "attempts_in_window": policy_state.get("attempts_in_window"), + "retry_budget": policy_state.get("retry_budget"), + "churn_threshold": policy_state.get("churn_threshold"), + "window_seconds": policy_state.get("window_seconds"), + }, "recovery_recommendation": recommendation, }, "restart_history": history, diff --git a/src/mes_dashboard/routes/auth_routes.py b/src/mes_dashboard/routes/auth_routes.py index a320701..21a4060 100644 --- a/src/mes_dashboard/routes/auth_routes.py +++ b/src/mes_dashboard/routes/auth_routes.py @@ -9,9 +9,10 @@ from collections import defaultdict from datetime import datetime from threading import Lock -from flask import Blueprint, flash, redirect, render_template, request, session, url_for - -from mes_dashboard.services.auth_service import authenticate, is_admin +from flask import Blueprint, flash, redirect, render_template, request, session, url_for + +from mes_dashboard.core.csrf import rotate_csrf_token +from mes_dashboard.services.auth_service import authenticate, is_admin logger = logging.getLogger('mes_dashboard.auth_routes') auth_bp = Blueprint("auth", __name__, url_prefix="/admin") @@ -89,25 +90,27 @@ def login(): user = authenticate(username, password) if user is None: error = "帳號或密碼錯誤" - elif not is_admin(user): - error = "您不是管理員,無法登入後台" - else: - # Login successful - session["admin"] = { - "username": user.get("username"), - "displayName": user.get("displayName"), - "mail": user.get("mail"), - "department": user.get("department"), - "login_time": datetime.now().isoformat(), - } - next_url = request.args.get("next", url_for("portal_index")) - return redirect(next_url) + elif not is_admin(user): + error = "您不是管理員,無法登入後台" + else: + # Login successful + session.clear() + session["admin"] = { + "username": user.get("username"), + "displayName": user.get("displayName"), + "mail": user.get("mail"), + "department": user.get("department"), + "login_time": datetime.now().isoformat(), + } + rotate_csrf_token() + next_url = request.args.get("next", url_for("portal_index")) + return redirect(next_url) return render_template("login.html", error=error) @auth_bp.route("/logout") -def logout(): - """Admin logout.""" - session.pop("admin", None) - return redirect(url_for("portal_index")) +def logout(): + """Admin logout.""" + session.clear() + return redirect(url_for("portal_index")) diff --git a/src/mes_dashboard/routes/health_routes.py b/src/mes_dashboard/routes/health_routes.py index 2f34db3..ba5dafe 100644 --- a/src/mes_dashboard/routes/health_routes.py +++ b/src/mes_dashboard/routes/health_routes.py @@ -6,13 +6,15 @@ Provides /health and /health/deep endpoints for monitoring service status. from __future__ import annotations -import logging -import time -from datetime import datetime, timedelta -from flask import Blueprint, jsonify, make_response +import logging +import os +import threading +import time +from datetime import datetime, timedelta +from flask import Blueprint, current_app, jsonify, make_response from mes_dashboard.core.database import ( - get_engine, + get_health_engine, get_pool_runtime_config, get_pool_status, ) @@ -28,6 +30,15 @@ from mes_dashboard.core.cache import ( from mes_dashboard.core.resilience import ( build_recovery_recommendation, get_resilience_thresholds, + summarize_restart_history, +) +from mes_dashboard.core.runtime_contract import build_runtime_contract_diagnostics +from mes_dashboard.core.worker_recovery_policy import ( + evaluate_worker_recovery_state, + extract_last_requested_at, + extract_restart_history, + get_worker_recovery_policy_config, + load_restart_state, ) from sqlalchemy import text @@ -39,8 +50,63 @@ health_bp = Blueprint('health', __name__) # Warning Thresholds # ============================================================ -DB_LATENCY_WARNING_MS = 100 # Database latency > 100ms is slow +DB_LATENCY_WARNING_MS = 100 # Database latency > 100ms is slow CACHE_STALE_MINUTES = 2 # Cache update > 2 minutes is stale +HEALTH_MEMO_TTL_SECONDS = int(os.getenv("HEALTH_MEMO_TTL_SECONDS", "5")) + +_HEALTH_MEMO_LOCK = threading.Lock() +_HEALTH_MEMO: dict[str, dict | None] = { + "health": None, + "deep": None, +} + + +def _health_memo_enabled() -> bool: + if HEALTH_MEMO_TTL_SECONDS <= 0: + return False + if current_app.testing or bool(current_app.config.get("TESTING")): + return False + return True + + +def _get_health_memo(cache_key: str) -> tuple[dict, int] | None: + if not _health_memo_enabled(): + return None + now = time.time() + with _HEALTH_MEMO_LOCK: + entry = _HEALTH_MEMO.get(cache_key) + if not entry: + return None + if now - float(entry.get("ts", 0.0)) > HEALTH_MEMO_TTL_SECONDS: + _HEALTH_MEMO[cache_key] = None + return None + return entry["payload"], int(entry["status"]) + + +def _set_health_memo(cache_key: str, payload: dict, status_code: int) -> None: + if not _health_memo_enabled(): + return + with _HEALTH_MEMO_LOCK: + _HEALTH_MEMO[cache_key] = { + "ts": time.time(), + "payload": payload, + "status": int(status_code), + } + + +def _build_health_response(payload: dict, status_code: int): + """Build JSON response with explicit no-cache headers.""" + resp = make_response(jsonify(payload), status_code) + resp.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate' + resp.headers['Pragma'] = 'no-cache' + resp.headers['Expires'] = '0' + return resp + + +def _reset_health_memo_for_tests() -> None: + with _HEALTH_MEMO_LOCK: + _HEALTH_MEMO["health"] = None + _HEALTH_MEMO["deep"] = None def _classify_degraded_reason( @@ -63,18 +129,60 @@ def _classify_degraded_reason( return None +def _build_resilience_alerts( + *, + pool_saturation: float | None, + circuit_state: str | None, + route_cache_degraded: bool, + restart_churn_exceeded: bool, + restart_blocked: bool, + thresholds: dict, +) -> dict: + saturation = float(pool_saturation or 0.0) + warning = float(thresholds.get("pool_saturation_warning", 0.9)) + critical = float(thresholds.get("pool_saturation_critical", 1.0)) + return { + "pool_warning": saturation >= warning, + "pool_critical": saturation >= critical, + "circuit_open": circuit_state == "OPEN", + "route_cache_degraded": bool(route_cache_degraded), + "restart_churn_exceeded": bool(restart_churn_exceeded), + "restart_blocked": bool(restart_blocked), + } + + +def get_worker_recovery_status() -> dict: + """Build worker recovery policy status for health/admin telemetry.""" + state = load_restart_state() + history = extract_restart_history(state) + policy_state = evaluate_worker_recovery_state( + history, + last_requested_at=extract_last_requested_at(state), + ) + churn = summarize_restart_history( + history, + window_seconds=int(policy_state.get("window_seconds") or 600), + threshold=int(policy_state.get("churn_threshold") or 3), + ) + return { + "policy_state": policy_state, + "restart_churn": churn, + "policy_config": get_worker_recovery_policy_config(), + } + + def check_database() -> tuple[str, str | None]: """Check database connectivity. Returns: Tuple of (status, error_message). status is 'ok' or 'error'. - """ - try: - engine = get_engine() - with engine.connect() as conn: - conn.execute(text("SELECT 1 FROM DUAL")) - return 'ok', None + """ + try: + engine = get_health_engine() + with engine.connect() as conn: + conn.execute(text("SELECT 1 FROM DUAL")) + return 'ok', None except Exception as e: logger.error(f"Database health check failed: {e}") return 'error', str(e) @@ -111,13 +219,21 @@ def get_cache_status() -> dict: status = { 'enabled': REDIS_ENABLED, 'sys_date': get_cached_sys_date(), - 'updated_at': get_cache_updated_at() + 'updated_at': get_cache_updated_at(), + 'derived_search_index': {}, + 'derived_frame_snapshot': {}, + 'index_metrics': {}, + 'memory': {}, } try: from mes_dashboard.services.wip_service import get_wip_search_index_status - status['derived_search_index'] = get_wip_search_index_status() + derived = get_wip_search_index_status() + status['derived_search_index'] = derived.get('derived_search_index', {}) + status['derived_frame_snapshot'] = derived.get('derived_frame_snapshot', {}) + status['index_metrics'] = derived.get('metrics', {}) + status['memory'] = derived.get('memory', {}) except Exception: - status['derived_search_index'] = {} + pass return status @@ -201,10 +317,15 @@ def get_workcenter_mapping_status() -> dict: def health_check(): """Health check endpoint. - Returns: - - 200 OK: All services healthy or degraded (Redis down but DB ok) - - 503 Service Unavailable: Database unhealthy - """ + Returns: + - 200 OK: All services healthy or degraded (Redis down but DB ok) + - 503 Service Unavailable: Database unhealthy + """ + cached = _get_health_memo("health") + if cached is not None: + payload, status_code = cached + return _build_health_response(payload, status_code) + from mes_dashboard.core.circuit_breaker import get_circuit_breaker_status db_status, db_error = check_database() @@ -266,13 +387,25 @@ def health_check(): warnings.append(f"Database pool saturation is high ({saturation:.0%})") thresholds = get_resilience_thresholds() + worker_recovery = get_worker_recovery_status() + policy_state = worker_recovery.get("policy_state", {}) + restart_churn = worker_recovery.get("restart_churn", {}) recommendation = build_recovery_recommendation( degraded_reason=degraded_reason, pool_saturation=pool_saturation, circuit_state=circuit_breaker.get('state'), - restart_churn_exceeded=False, - cooldown_active=False, + restart_churn_exceeded=bool(restart_churn.get("exceeded")), + cooldown_active=bool(policy_state.get("cooldown")), ) + alerts = _build_resilience_alerts( + pool_saturation=pool_saturation, + circuit_state=circuit_breaker.get("state"), + route_cache_degraded=bool(route_cache.get("degraded")), + restart_churn_exceeded=bool(restart_churn.get("exceeded")), + restart_blocked=bool(policy_state.get("blocked")), + thresholds=thresholds, + ) + runtime_contract = build_runtime_contract_diagnostics(strict=False) # Check equipment status cache equipment_status_cache = get_equipment_status_cache_status() @@ -293,8 +426,18 @@ def health_check(): }, 'resilience': { 'thresholds': thresholds, + 'alerts': alerts, + 'policy_state': { + 'state': policy_state.get("state"), + 'allowed': policy_state.get("allowed"), + 'cooldown': policy_state.get("cooldown"), + 'blocked': policy_state.get("blocked"), + 'cooldown_remaining_seconds': policy_state.get("cooldown_remaining_seconds"), + }, + 'restart_churn': restart_churn, 'recovery_recommendation': recommendation, }, + 'runtime_contract': runtime_contract, 'cache': get_cache_status(), 'route_cache': route_cache, 'resource_cache': resource_cache, @@ -307,12 +450,8 @@ def health_check(): if warnings: response['warnings'] = warnings - # Add no-cache headers to prevent browser caching - resp = make_response(jsonify(response), http_code) - resp.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate' - resp.headers['Pragma'] = 'no-cache' - resp.headers['Expires'] = '0' - return resp + _set_health_memo("health", response, http_code) + return _build_health_response(response, http_code) @health_bp.route('/health/deep', methods=['GET']) @@ -330,9 +469,14 @@ def deep_health_check(): from mes_dashboard.core.metrics import get_metrics_summary from flask import redirect, url_for, request - # Require admin authentication - redirect to login for consistency - if not is_admin_logged_in(): - return redirect(url_for("auth.login", next=request.url)) + # Require admin authentication - redirect to login for consistency + if not is_admin_logged_in(): + return redirect(url_for("auth.login", next=request.url)) + + cached = _get_health_memo("deep") + if cached is not None: + payload, status_code = cached + return _build_health_response(payload, status_code) # Check database with latency measurement db_start = time.time() @@ -397,6 +541,9 @@ def deep_health_check(): warnings.append(f"Database pool saturation is high ({pool_saturation:.0%})") thresholds = get_resilience_thresholds() + worker_recovery = get_worker_recovery_status() + policy_state = worker_recovery.get("policy_state", {}) + restart_churn = worker_recovery.get("restart_churn", {}) degraded_reason = _classify_degraded_reason( db_status=db_status, redis_status=redis_status, @@ -408,9 +555,18 @@ def deep_health_check(): degraded_reason=degraded_reason, pool_saturation=pool_saturation, circuit_state=circuit_breaker.get('state'), - restart_churn_exceeded=False, - cooldown_active=False, + restart_churn_exceeded=bool(restart_churn.get("exceeded")), + cooldown_active=bool(policy_state.get("cooldown")), ) + alerts = _build_resilience_alerts( + pool_saturation=pool_saturation, + circuit_state=circuit_breaker.get("state"), + route_cache_degraded=bool(route_cache.get("degraded")), + restart_churn_exceeded=bool(restart_churn.get("exceeded")), + restart_blocked=bool(policy_state.get("blocked")), + thresholds=thresholds, + ) + runtime_contract = build_runtime_contract_diagnostics(strict=False) # Check latency thresholds db_latency_status = 'healthy' @@ -429,8 +585,18 @@ def deep_health_check(): 'degraded_reason': degraded_reason, 'resilience': { 'thresholds': thresholds, + 'alerts': alerts, + 'policy_state': { + 'state': policy_state.get("state"), + 'allowed': policy_state.get("allowed"), + 'cooldown': policy_state.get("cooldown"), + 'blocked': policy_state.get("blocked"), + 'cooldown_remaining_seconds': policy_state.get("cooldown_remaining_seconds"), + }, + 'restart_churn': restart_churn, 'recovery_recommendation': recommendation, }, + 'runtime_contract': runtime_contract, 'checks': { 'database': { 'status': db_latency_status if db_status == 'ok' else 'error', @@ -446,7 +612,9 @@ def deep_health_check(): 'cache': { 'freshness': cache_freshness, 'updated_at': cache_updated_at, - 'sys_date': cache_status.get('sys_date') + 'sys_date': cache_status.get('sys_date'), + 'index_metrics': cache_status.get('index_metrics', {}), + 'memory': cache_status.get('memory', {}), }, 'route_cache': route_cache }, @@ -464,9 +632,5 @@ def deep_health_check(): if warnings: response['warnings'] = warnings - # Add no-cache headers - resp = make_response(jsonify(response), http_code) - resp.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate' - resp.headers['Pragma'] = 'no-cache' - resp.headers['Expires'] = '0' - return resp + _set_health_memo("deep", response, http_code) + return _build_health_response(response, http_code) diff --git a/src/mes_dashboard/routes/hold_routes.py b/src/mes_dashboard/routes/hold_routes.py index dc0f9bc..3ff244c 100644 --- a/src/mes_dashboard/routes/hold_routes.py +++ b/src/mes_dashboard/routes/hold_routes.py @@ -4,22 +4,27 @@ Contains Flask Blueprint for Hold Detail page and API endpoints. """ -from flask import Blueprint, jsonify, request, render_template, redirect, url_for - -from mes_dashboard.services.wip_service import ( +from flask import Blueprint, jsonify, request, render_template, redirect, url_for + +from mes_dashboard.core.rate_limit import configured_rate_limit +from mes_dashboard.core.utils import parse_bool_query +from mes_dashboard.services.wip_service import ( get_hold_detail_summary, get_hold_detail_distribution, get_hold_detail_lots, is_quality_hold, ) -# Create Blueprint -hold_bp = Blueprint('hold', __name__) - - -def _parse_bool(value: str) -> bool: - """Parse boolean from query string.""" - return value.lower() in ('true', '1', 'yes') if value else False +# Create Blueprint +hold_bp = Blueprint('hold', __name__) + +_HOLD_LOTS_RATE_LIMIT = configured_rate_limit( + bucket="hold-detail-lots", + max_attempts_env="HOLD_LOTS_RATE_LIMIT_MAX_REQUESTS", + window_seconds_env="HOLD_LOTS_RATE_LIMIT_WINDOW_SECONDS", + default_max_attempts=90, + default_window_seconds=60, +) # ============================================================ @@ -64,7 +69,7 @@ def api_hold_detail_summary(): if not reason: return jsonify({'success': False, 'error': '缺少必要參數: reason'}), 400 - include_dummy = _parse_bool(request.args.get('include_dummy', '')) + include_dummy = parse_bool_query(request.args.get('include_dummy')) result = get_hold_detail_summary( reason=reason, @@ -90,7 +95,7 @@ def api_hold_detail_distribution(): if not reason: return jsonify({'success': False, 'error': '缺少必要參數: reason'}), 400 - include_dummy = _parse_bool(request.args.get('include_dummy', '')) + include_dummy = parse_bool_query(request.args.get('include_dummy')) result = get_hold_detail_distribution( reason=reason, @@ -101,8 +106,9 @@ def api_hold_detail_distribution(): return jsonify({'success': False, 'error': '查詢失敗'}), 500 -@hold_bp.route('/api/wip/hold-detail/lots') -def api_hold_detail_lots(): +@hold_bp.route('/api/wip/hold-detail/lots') +@_HOLD_LOTS_RATE_LIMIT +def api_hold_detail_lots(): """API: Get paginated lot details for a specific hold reason. Query Parameters: @@ -124,7 +130,7 @@ def api_hold_detail_lots(): workcenter = request.args.get('workcenter', '').strip() or None package = request.args.get('package', '').strip() or None age_range = request.args.get('age_range', '').strip() or None - include_dummy = _parse_bool(request.args.get('include_dummy', '')) + include_dummy = parse_bool_query(request.args.get('include_dummy')) page = request.args.get('page', 1, type=int) per_page = min(request.args.get('per_page', 50, type=int), 200) diff --git a/src/mes_dashboard/routes/resource_routes.py b/src/mes_dashboard/routes/resource_routes.py index 84610aa..5cf66f6 100644 --- a/src/mes_dashboard/routes/resource_routes.py +++ b/src/mes_dashboard/routes/resource_routes.py @@ -13,10 +13,12 @@ from mes_dashboard.core.database import ( DatabaseCircuitOpenError, ) from mes_dashboard.core.cache import cache_get, cache_set, make_cache_key +from mes_dashboard.core.rate_limit import configured_rate_limit +from mes_dashboard.core.utils import get_days_back, parse_bool_query def _clean_nan_values(data): - """Convert NaN and NaT values to None for JSON serialization. + """Convert NaN/NaT values to None for JSON serialization (depth-safe). Args: data: List of dicts or single dict. @@ -24,28 +26,77 @@ def _clean_nan_values(data): Returns: Cleaned data with NaN/NaT replaced by None. """ + def _normalize_scalar(value): + if isinstance(value, float) and math.isnan(value): + return None + if isinstance(value, str) and value == 'NaT': + return None + try: + if value != value: # NaN check (NaN != NaN) + return None + except Exception: + pass + return value + if isinstance(data, list): - return [_clean_nan_values(item) for item in data] + root: list = [] elif isinstance(data, dict): - cleaned = {} - for key, value in data.items(): - if isinstance(value, float) and math.isnan(value): - cleaned[key] = None - elif isinstance(value, str) and value == 'NaT': - cleaned[key] = None - elif value != value: # NaN check (NaN != NaN) - cleaned[key] = None - elif isinstance(value, list): - # Recursively clean nested lists (e.g., LOT_DETAILS) - cleaned[key] = _clean_nan_values(value) + root = {} + else: + return _normalize_scalar(data) + + stack = [(data, root)] + seen: set[int] = {id(data)} + + while stack: + source, target = stack.pop() + if isinstance(source, list): + for item in source: + if isinstance(item, list): + item_id = id(item) + if item_id in seen: + target.append(None) + continue + child = [] + target.append(child) + seen.add(item_id) + stack.append((item, child)) + elif isinstance(item, dict): + item_id = id(item) + if item_id in seen: + target.append(None) + continue + child = {} + target.append(child) + seen.add(item_id) + stack.append((item, child)) + else: + target.append(_normalize_scalar(item)) + continue + + for key, value in source.items(): + if isinstance(value, list): + value_id = id(value) + if value_id in seen: + target[key] = None + continue + child = [] + target[key] = child + seen.add(value_id) + stack.append((value, child)) elif isinstance(value, dict): - # Recursively clean nested dicts - cleaned[key] = _clean_nan_values(value) + value_id = id(value) + if value_id in seen: + target[key] = None + continue + child = {} + target[key] = child + seen.add(value_id) + stack.append((value, child)) else: - cleaned[key] = value - return cleaned - return data -from mes_dashboard.core.utils import get_days_back + target[key] = _normalize_scalar(value) + return root + from mes_dashboard.services.resource_service import ( query_resource_by_status, query_resource_by_workcenter, @@ -62,6 +113,32 @@ from mes_dashboard.config.constants import STATUS_CATEGORIES # Create Blueprint resource_bp = Blueprint('resource', __name__, url_prefix='/api/resource') +_RESOURCE_DETAIL_RATE_LIMIT = configured_rate_limit( + bucket="resource-detail", + max_attempts_env="RESOURCE_DETAIL_RATE_LIMIT_MAX_REQUESTS", + window_seconds_env="RESOURCE_DETAIL_RATE_LIMIT_WINDOW_SECONDS", + default_max_attempts=60, + default_window_seconds=60, +) + +_RESOURCE_STATUS_RATE_LIMIT = configured_rate_limit( + bucket="resource-status", + max_attempts_env="RESOURCE_STATUS_RATE_LIMIT_MAX_REQUESTS", + window_seconds_env="RESOURCE_STATUS_RATE_LIMIT_WINDOW_SECONDS", + default_max_attempts=90, + default_window_seconds=60, +) + + +def _optional_bool_arg(name: str): + raw = request.args.get(name) + if raw is None: + return None + text = str(raw).strip() + if not text: + return None + return parse_bool_query(text) + @resource_bp.route('/by_status') def api_resource_by_status(): @@ -118,6 +195,7 @@ def api_resource_workcenter_status_matrix(): @resource_bp.route('/detail', methods=['POST']) +@_RESOURCE_DETAIL_RATE_LIMIT def api_resource_detail(): """API: Resource detail with filters.""" data = request.get_json() or {} @@ -183,6 +261,7 @@ def api_resource_status_values(): # ============================================================ @resource_bp.route('/status') +@_RESOURCE_STATUS_RATE_LIMIT def api_resource_status(): """API: Get merged resource status from realtime cache. @@ -197,20 +276,9 @@ def api_resource_status(): wc_groups_param = request.args.get('workcenter_groups') workcenter_groups = wc_groups_param.split(',') if wc_groups_param else None - is_production = None - is_prod_param = request.args.get('is_production') - if is_prod_param: - is_production = is_prod_param.lower() in ('1', 'true', 'yes') - - is_key = None - is_key_param = request.args.get('is_key') - if is_key_param: - is_key = is_key_param.lower() in ('1', 'true', 'yes') - - is_monitor = None - is_monitor_param = request.args.get('is_monitor') - if is_monitor_param: - is_monitor = is_monitor_param.lower() in ('1', 'true', 'yes') + is_production = _optional_bool_arg('is_production') + is_key = _optional_bool_arg('is_key') + is_monitor = _optional_bool_arg('is_monitor') status_cats_param = request.args.get('status_categories') status_categories = status_cats_param.split(',') if status_cats_param else None @@ -260,6 +328,7 @@ def api_resource_status_options(): @resource_bp.route('/status/summary') +@_RESOURCE_STATUS_RATE_LIMIT def api_resource_status_summary(): """API: Get resource status summary statistics. @@ -269,20 +338,9 @@ def api_resource_status_summary(): wc_groups_param = request.args.get('workcenter_groups') workcenter_groups = wc_groups_param.split(',') if wc_groups_param else None - is_production = None - is_prod_param = request.args.get('is_production') - if is_prod_param: - is_production = is_prod_param.lower() in ('1', 'true', 'yes') - - is_key = None - is_key_param = request.args.get('is_key') - if is_key_param: - is_key = is_key_param.lower() in ('1', 'true', 'yes') - - is_monitor = None - is_monitor_param = request.args.get('is_monitor') - if is_monitor_param: - is_monitor = is_monitor_param.lower() in ('1', 'true', 'yes') + is_production = _optional_bool_arg('is_production') + is_key = _optional_bool_arg('is_key') + is_monitor = _optional_bool_arg('is_monitor') try: data = get_resource_status_summary( @@ -301,6 +359,7 @@ def api_resource_status_summary(): @resource_bp.route('/status/matrix') +@_RESOURCE_STATUS_RATE_LIMIT def api_resource_status_matrix(): """API: Get workcenter × status matrix. @@ -309,20 +368,9 @@ def api_resource_status_matrix(): is_key: Filter by key equipment is_monitor: Filter by monitor equipment """ - is_production = None - is_prod_param = request.args.get('is_production') - if is_prod_param: - is_production = is_prod_param.lower() in ('1', 'true', 'yes') - - is_key = None - is_key_param = request.args.get('is_key') - if is_key_param: - is_key = is_key_param.lower() in ('1', 'true', 'yes') - - is_monitor = None - is_monitor_param = request.args.get('is_monitor') - if is_monitor_param: - is_monitor = is_monitor_param.lower() in ('1', 'true', 'yes') + is_production = _optional_bool_arg('is_production') + is_key = _optional_bool_arg('is_key') + is_monitor = _optional_bool_arg('is_monitor') try: data = get_workcenter_status_matrix( diff --git a/src/mes_dashboard/routes/wip_routes.py b/src/mes_dashboard/routes/wip_routes.py index 4b1a2a9..2b1182d 100644 --- a/src/mes_dashboard/routes/wip_routes.py +++ b/src/mes_dashboard/routes/wip_routes.py @@ -7,6 +7,8 @@ Uses DWH.DW_MES_LOT_V view for real-time WIP data. from flask import Blueprint, jsonify, request +from mes_dashboard.core.rate_limit import configured_rate_limit +from mes_dashboard.core.utils import parse_bool_query from mes_dashboard.services.wip_service import ( get_wip_summary, get_wip_matrix, @@ -24,10 +26,21 @@ from mes_dashboard.services.wip_service import ( # Create Blueprint wip_bp = Blueprint('wip', __name__, url_prefix='/api/wip') +_WIP_MATRIX_RATE_LIMIT = configured_rate_limit( + bucket="wip-overview-matrix", + max_attempts_env="WIP_MATRIX_RATE_LIMIT_MAX_REQUESTS", + window_seconds_env="WIP_MATRIX_RATE_LIMIT_WINDOW_SECONDS", + default_max_attempts=120, + default_window_seconds=60, +) -def _parse_bool(value: str) -> bool: - """Parse boolean from query string.""" - return value.lower() in ('true', '1', 'yes') if value else False +_WIP_DETAIL_RATE_LIMIT = configured_rate_limit( + bucket="wip-detail", + max_attempts_env="WIP_DETAIL_RATE_LIMIT_MAX_REQUESTS", + window_seconds_env="WIP_DETAIL_RATE_LIMIT_WINDOW_SECONDS", + default_max_attempts=90, + default_window_seconds=60, +) # ============================================================ @@ -52,7 +65,7 @@ def api_overview_summary(): lotid = request.args.get('lotid', '').strip() or None package = request.args.get('package', '').strip() or None pj_type = request.args.get('type', '').strip() or None - include_dummy = _parse_bool(request.args.get('include_dummy', '')) + include_dummy = parse_bool_query(request.args.get('include_dummy')) result = get_wip_summary( include_dummy=include_dummy, @@ -67,6 +80,7 @@ def api_overview_summary(): @wip_bp.route('/overview/matrix') +@_WIP_MATRIX_RATE_LIMIT def api_overview_matrix(): """API: Get workcenter x product line matrix for overview dashboard. @@ -88,7 +102,7 @@ def api_overview_matrix(): lotid = request.args.get('lotid', '').strip() or None package = request.args.get('package', '').strip() or None pj_type = request.args.get('type', '').strip() or None - include_dummy = _parse_bool(request.args.get('include_dummy', '')) + include_dummy = parse_bool_query(request.args.get('include_dummy')) status = request.args.get('status', '').strip().upper() or None hold_type = request.args.get('hold_type', '').strip().lower() or None @@ -134,7 +148,7 @@ def api_overview_hold(): """ workorder = request.args.get('workorder', '').strip() or None lotid = request.args.get('lotid', '').strip() or None - include_dummy = _parse_bool(request.args.get('include_dummy', '')) + include_dummy = parse_bool_query(request.args.get('include_dummy')) result = get_wip_hold_summary( include_dummy=include_dummy, @@ -151,6 +165,7 @@ def api_overview_hold(): # ============================================================ @wip_bp.route('/detail/') +@_WIP_DETAIL_RATE_LIMIT def api_detail(workcenter: str): """API: Get WIP detail for a specific workcenter group. @@ -176,12 +191,17 @@ def api_detail(workcenter: str): hold_type = request.args.get('hold_type', '').strip().lower() or None workorder = request.args.get('workorder', '').strip() or None lotid = request.args.get('lotid', '').strip() or None - include_dummy = _parse_bool(request.args.get('include_dummy', '')) + include_dummy = parse_bool_query(request.args.get('include_dummy')) page = request.args.get('page', 1, type=int) - page_size = min(request.args.get('page_size', 100, type=int), 500) + page_size = request.args.get('page_size', 100, type=int) - if page < 1: + if page is None: page = 1 + if page_size is None: + page_size = 100 + + page = max(page, 1) + page_size = max(1, min(page_size, 500)) # Validate status parameter if status and status not in ('RUN', 'QUEUE', 'HOLD'): @@ -245,7 +265,7 @@ def api_meta_workcenters(): Returns: JSON with list of {name, lot_count} sorted by sequence """ - include_dummy = _parse_bool(request.args.get('include_dummy', '')) + include_dummy = parse_bool_query(request.args.get('include_dummy')) result = get_workcenters(include_dummy=include_dummy) if result is not None: @@ -263,7 +283,7 @@ def api_meta_packages(): Returns: JSON with list of {name, lot_count} sorted by count desc """ - include_dummy = _parse_bool(request.args.get('include_dummy', '')) + include_dummy = parse_bool_query(request.args.get('include_dummy')) result = get_packages(include_dummy=include_dummy) if result is not None: @@ -293,7 +313,7 @@ def api_meta_search(): search_field = request.args.get('field', '').strip().lower() q = request.args.get('q', '').strip() limit = min(request.args.get('limit', 20, type=int), 50) - include_dummy = _parse_bool(request.args.get('include_dummy', '')) + include_dummy = parse_bool_query(request.args.get('include_dummy')) # Cross-filter parameters workorder = request.args.get('workorder', '').strip() or None diff --git a/src/mes_dashboard/services/auth_service.py b/src/mes_dashboard/services/auth_service.py index a4ee1a8..bf4c866 100644 --- a/src/mes_dashboard/services/auth_service.py +++ b/src/mes_dashboard/services/auth_service.py @@ -1,124 +1,193 @@ -# -*- coding: utf-8 -*- -"""Authentication service using LDAP API or local credentials.""" - -from __future__ import annotations - -import logging -import os - -import requests - -logger = logging.getLogger(__name__) - -# Configuration - MUST be set in .env file -LDAP_API_BASE = os.environ.get("LDAP_API_URL", "") -ADMIN_EMAILS = os.environ.get("ADMIN_EMAILS", "").lower().split(",") - -# Timeout for LDAP API requests -LDAP_TIMEOUT = 10 - -# Local authentication configuration (for development/testing) -LOCAL_AUTH_ENABLED = os.environ.get("LOCAL_AUTH_ENABLED", "false").lower() in ("true", "1", "yes") -LOCAL_AUTH_USERNAME = os.environ.get("LOCAL_AUTH_USERNAME", "") -LOCAL_AUTH_PASSWORD = os.environ.get("LOCAL_AUTH_PASSWORD", "") - - -def _authenticate_local(username: str, password: str) -> dict | None: - """Authenticate using local environment credentials. - - Args: - username: User provided username - password: User provided password - - Returns: - User info dict on success, None on failure - """ - if not LOCAL_AUTH_ENABLED: - return None - - if not LOCAL_AUTH_USERNAME or not LOCAL_AUTH_PASSWORD: - logger.warning("Local auth enabled but credentials not configured") - return None - - if username == LOCAL_AUTH_USERNAME and password == LOCAL_AUTH_PASSWORD: - logger.info("Local auth success for user: %s", username) - return { - "username": username, - "displayName": f"Local User ({username})", - "mail": f"{username}@local.dev", - "department": "Development", - } - - logger.warning("Local auth failed for user: %s", username) - return None - - -def authenticate(username: str, password: str, domain: str = "PANJIT") -> dict | None: - """Authenticate user via local credentials or LDAP API. - - If LOCAL_AUTH_ENABLED is set, tries local authentication first. - Falls back to LDAP API if local auth is disabled or fails. - - Args: - username: Employee ID or email - password: User password - domain: Domain name (default: PANJIT) - - Returns: - User info dict on success: {username, displayName, mail, department} - None on failure - """ - # Try local authentication first if enabled - if LOCAL_AUTH_ENABLED: - local_result = _authenticate_local(username, password) - if local_result: - return local_result - # If local auth is enabled but failed, don't fall back to LDAP - # This ensures local-only mode when LOCAL_AUTH_ENABLED is true - return None - - # LDAP authentication - try: - response = requests.post( - f"{LDAP_API_BASE}/api/v1/ldap/auth", - json={"username": username, "password": password, "domain": domain}, - timeout=LDAP_TIMEOUT, - ) - data = response.json() - - if data.get("success"): - user = data.get("user", {}) - logger.info("LDAP auth success for user: %s", user.get("username")) - return user - - logger.warning("LDAP auth failed for user: %s", username) - return None - - except requests.Timeout: - logger.error("LDAP API timeout for user: %s", username) - return None - except requests.RequestException as e: - logger.error("LDAP API error for user %s: %s", username, e) - return None - except (ValueError, KeyError) as e: - logger.error("LDAP API response parse error: %s", e) - return None - - -def is_admin(user: dict) -> bool: - """Check if user is an admin. - - Args: - user: User info dict with 'mail' field - - Returns: - True if user email is in ADMIN_EMAILS list, or if local auth is enabled - """ - # Local auth users are automatically admins (for development/testing) - if LOCAL_AUTH_ENABLED: - user_mail = user.get("mail", "") - if user_mail.endswith("@local.dev"): - return True - - user_mail = user.get("mail", "").lower().strip() - return user_mail in [e.strip() for e in ADMIN_EMAILS] +# -*- coding: utf-8 -*- +"""Authentication service using LDAP API or local credentials.""" + +from __future__ import annotations + +import logging +import os +from urllib.parse import urlparse + +import requests + +logger = logging.getLogger(__name__) + +# Timeout for LDAP API requests +LDAP_TIMEOUT = 10 + +# Configuration - MUST be set in .env file +ADMIN_EMAILS = os.environ.get("ADMIN_EMAILS", "").lower().split(",") + +# Local authentication configuration (for development/testing) +LOCAL_AUTH_ENABLED = os.environ.get("LOCAL_AUTH_ENABLED", "false").lower() in ("true", "1", "yes") +LOCAL_AUTH_USERNAME = os.environ.get("LOCAL_AUTH_USERNAME", "") +LOCAL_AUTH_PASSWORD = os.environ.get("LOCAL_AUTH_PASSWORD", "") + +# LDAP endpoint hardening configuration +LDAP_API_URL = os.environ.get("LDAP_API_URL", "").strip() +LDAP_ALLOWED_HOSTS_RAW = os.environ.get("LDAP_ALLOWED_HOSTS", "").strip() + + +def _normalize_host(host: str) -> str: + return host.strip().lower().rstrip(".") + + +def _parse_allowed_hosts(raw_hosts: str) -> tuple[str, ...]: + if not raw_hosts: + return tuple() + + hosts: list[str] = [] + for raw in raw_hosts.split(","): + host = _normalize_host(raw) + if host: + hosts.append(host) + return tuple(hosts) + + +def _validate_ldap_api_url(raw_url: str, allowed_hosts: tuple[str, ...]) -> tuple[str | None, str | None]: + """Validate LDAP API URL to prevent configuration-based SSRF risks.""" + url = (raw_url or "").strip() + if not url: + return None, "LDAP_API_URL is missing" + + parsed = urlparse(url) + scheme = (parsed.scheme or "").lower() + host = _normalize_host(parsed.hostname or "") + + if not host: + return None, f"LDAP_API_URL has no valid host: {url!r}" + + if scheme != "https": + return None, f"LDAP_API_URL must use HTTPS: {url!r}" + + effective_allowlist = allowed_hosts or (host,) + if host not in effective_allowlist: + return None, ( + f"LDAP_API_URL host {host!r} is not allowlisted. " + f"Allowed hosts: {', '.join(effective_allowlist)}" + ) + + return url.rstrip("/"), None + + +def _resolve_ldap_config() -> tuple[str | None, str | None, tuple[str, ...]]: + allowed_hosts = _parse_allowed_hosts(LDAP_ALLOWED_HOSTS_RAW) + api_base, error = _validate_ldap_api_url(LDAP_API_URL, allowed_hosts) + + if api_base: + effective_hosts = allowed_hosts or (_normalize_host(urlparse(api_base).hostname or ""),) + return api_base, None, effective_hosts + + return None, error, allowed_hosts + + +LDAP_API_BASE, LDAP_CONFIG_ERROR, LDAP_ALLOWED_HOSTS = _resolve_ldap_config() + + +def _authenticate_local(username: str, password: str) -> dict | None: + """Authenticate using local environment credentials. + + Args: + username: User provided username + password: User provided password + + Returns: + User info dict on success, None on failure + """ + if not LOCAL_AUTH_ENABLED: + return None + + if not LOCAL_AUTH_USERNAME or not LOCAL_AUTH_PASSWORD: + logger.warning("Local auth enabled but credentials not configured") + return None + + if username == LOCAL_AUTH_USERNAME and password == LOCAL_AUTH_PASSWORD: + logger.info("Local auth success for user: %s", username) + return { + "username": username, + "displayName": f"Local User ({username})", + "mail": f"{username}@local.dev", + "department": "Development", + } + + logger.warning("Local auth failed for user: %s", username) + return None + + +def authenticate(username: str, password: str, domain: str = "PANJIT") -> dict | None: + """Authenticate user via local credentials or LDAP API. + + If LOCAL_AUTH_ENABLED is set, tries local authentication first. + Falls back to LDAP API if local auth is disabled or fails. + + Args: + username: Employee ID or email + password: User password + domain: Domain name (default: PANJIT) + + Returns: + User info dict on success: {username, displayName, mail, department} + None on failure + """ + # Try local authentication first if enabled + if LOCAL_AUTH_ENABLED: + local_result = _authenticate_local(username, password) + if local_result: + return local_result + # If local auth is enabled but failed, don't fall back to LDAP + # This ensures local-only mode when LOCAL_AUTH_ENABLED is true + return None + + if LDAP_CONFIG_ERROR: + logger.error("LDAP authentication blocked: %s", LDAP_CONFIG_ERROR) + return None + + if not LDAP_API_BASE: + logger.error("LDAP authentication blocked: LDAP_API_URL is not configured") + return None + + # LDAP authentication + try: + response = requests.post( + f"{LDAP_API_BASE}/api/v1/ldap/auth", + json={"username": username, "password": password, "domain": domain}, + timeout=LDAP_TIMEOUT, + ) + data = response.json() + + if data.get("success"): + user = data.get("user", {}) + logger.info("LDAP auth success for user: %s", user.get("username")) + return user + + logger.warning("LDAP auth failed for user: %s", username) + return None + + except requests.Timeout: + logger.error("LDAP API timeout for user: %s", username) + return None + except requests.RequestException as e: + logger.error("LDAP API error for user %s: %s", username, e) + return None + except (ValueError, KeyError) as e: + logger.error("LDAP API response parse error: %s", e) + return None + + +def is_admin(user: dict) -> bool: + """Check if user is an admin. + + Args: + user: User info dict with 'mail' field + + Returns: + True if user email is in ADMIN_EMAILS list, or if local auth is enabled + """ + # Local auth users are automatically admins (for development/testing) + if LOCAL_AUTH_ENABLED: + user_mail = user.get("mail", "") + if user_mail.endswith("@local.dev"): + return True + + user_mail = user.get("mail", "").lower().strip() + allowed_emails = [e.strip() for e in ADMIN_EMAILS if e and e.strip()] + return user_mail in allowed_emails diff --git a/src/mes_dashboard/services/filter_cache.py b/src/mes_dashboard/services/filter_cache.py index d5dbba2..baff4fb 100644 --- a/src/mes_dashboard/services/filter_cache.py +++ b/src/mes_dashboard/services/filter_cache.py @@ -6,6 +6,7 @@ Data is loaded from database and cached in memory with periodic refresh. """ import logging +import os import threading from datetime import datetime, timedelta from typing import Optional, Dict, List, Any @@ -19,8 +20,8 @@ logger = logging.getLogger('mes_dashboard.filter_cache') # ============================================================ CACHE_TTL_SECONDS = 3600 # 1 hour cache TTL -WIP_VIEW = "DWH.DW_MES_LOT_V" -SPEC_WORKCENTER_VIEW = "DWH.DW_MES_SPEC_WORKCENTER_V" +WIP_VIEW = os.getenv("FILTER_CACHE_WIP_VIEW", "DWH.DW_MES_LOT_V") +SPEC_WORKCENTER_VIEW = os.getenv("FILTER_CACHE_SPEC_WORKCENTER_VIEW", "DWH.DW_MES_SPEC_WORKCENTER_V") # ============================================================ # Cache Storage diff --git a/src/mes_dashboard/services/page_registry.py b/src/mes_dashboard/services/page_registry.py index 91e7448..8d50f0d 100644 --- a/src/mes_dashboard/services/page_registry.py +++ b/src/mes_dashboard/services/page_registry.py @@ -1,12 +1,14 @@ # -*- coding: utf-8 -*- """Page registry service for managing page access status.""" -from __future__ import annotations - -import json -import logging -from pathlib import Path -from threading import Lock +from __future__ import annotations + +import json +import logging +import os +import tempfile +from pathlib import Path +from threading import Lock logger = logging.getLogger(__name__) @@ -34,20 +36,38 @@ def _load() -> dict: return _cache -def _save(data: dict) -> None: - """Save page status configuration.""" - global _cache - try: - DATA_FILE.parent.mkdir(parents=True, exist_ok=True) - DATA_FILE.write_text( - json.dumps(data, ensure_ascii=False, indent=2), - encoding="utf-8" - ) - _cache = data - logger.debug("Saved page status to %s", DATA_FILE) - except OSError as e: - logger.error("Failed to save page status: %s", e) - raise +def _save(data: dict) -> None: + """Save page status configuration.""" + global _cache + tmp_path: Path | None = None + try: + DATA_FILE.parent.mkdir(parents=True, exist_ok=True) + payload = json.dumps(data, ensure_ascii=False, indent=2) + + # Atomic write: write to sibling temp file, then replace target. + with tempfile.NamedTemporaryFile( + mode="w", + encoding="utf-8", + dir=str(DATA_FILE.parent), + prefix=f".{DATA_FILE.name}.", + suffix=".tmp", + delete=False, + ) as tmp: + tmp.write(payload) + tmp.flush() + os.fsync(tmp.fileno()) + tmp_path = Path(tmp.name) + os.replace(tmp_path, DATA_FILE) + _cache = data + logger.debug("Saved page status to %s", DATA_FILE) + except OSError as e: + if tmp_path is not None: + try: + tmp_path.unlink(missing_ok=True) + except OSError: + pass + logger.error("Failed to save page status: %s", e) + raise def get_page_status(route: str) -> str | None: diff --git a/src/mes_dashboard/services/realtime_equipment_cache.py b/src/mes_dashboard/services/realtime_equipment_cache.py index 40980e9..4aef7c2 100644 --- a/src/mes_dashboard/services/realtime_equipment_cache.py +++ b/src/mes_dashboard/services/realtime_equipment_cache.py @@ -5,12 +5,14 @@ Provides cached equipment status from DW_MES_EQUIPMENTSTATUS_WIP_V. Data is synced periodically (default 5 minutes) and stored in Redis. """ -import json -import logging -import threading -import time -from datetime import datetime -from typing import Any, Dict, List, Optional, Tuple +import json +import logging +import os +import threading +import time +from collections import OrderedDict +from datetime import datetime +from typing import Any from mes_dashboard.core.database import read_sql_df from mes_dashboard.core.redis_client import ( @@ -19,64 +21,110 @@ from mes_dashboard.core.redis_client import ( try_acquire_lock, release_lock, ) -from mes_dashboard.config.constants import ( - EQUIPMENT_STATUS_DATA_KEY, - EQUIPMENT_STATUS_INDEX_KEY, - EQUIPMENT_STATUS_META_UPDATED_KEY, - EQUIPMENT_STATUS_META_COUNT_KEY, - STATUS_CATEGORY_MAP, -) +from mes_dashboard.config.constants import ( + EQUIPMENT_STATUS_DATA_KEY, + EQUIPMENT_STATUS_INDEX_KEY, + EQUIPMENT_STATUS_META_UPDATED_KEY, + EQUIPMENT_STATUS_META_COUNT_KEY, + STATUS_CATEGORY_MAP, +) +from mes_dashboard.services.sql_fragments import EQUIPMENT_STATUS_SELECT_SQL -logger = logging.getLogger('mes_dashboard.realtime_equipment_cache') - -# ============================================================ -# Process-Level Cache (Prevents redundant JSON parsing) -# ============================================================ - -class _ProcessLevelCache: - """Thread-safe process-level cache for parsed equipment status data.""" - - def __init__(self, ttl_seconds: int = 30): - self._cache: Dict[str, Tuple[List[Dict[str, Any]], float]] = {} - self._lock = threading.Lock() - self._ttl = ttl_seconds - - def get(self, key: str) -> Optional[List[Dict[str, Any]]]: - """Get cached data if not expired.""" - with self._lock: - if key not in self._cache: - return None - data, timestamp = self._cache[key] - if time.time() - timestamp > self._ttl: - del self._cache[key] - return None - return data - - def set(self, key: str, data: List[Dict[str, Any]]) -> None: - """Cache data with current timestamp.""" - with self._lock: - self._cache[key] = (data, time.time()) - - def invalidate(self, key: str) -> None: - """Remove a key from cache.""" - with self._lock: - self._cache.pop(key, None) +logger = logging.getLogger('mes_dashboard.realtime_equipment_cache') + +# ============================================================ +# Process-Level Cache (Prevents redundant JSON parsing) +# ============================================================ + +DEFAULT_PROCESS_CACHE_TTL_SECONDS = 30 +DEFAULT_PROCESS_CACHE_MAX_SIZE = 32 +DEFAULT_LOOKUP_TTL_SECONDS = 30 +class _ProcessLevelCache: + """Thread-safe process-level cache for parsed equipment status data.""" + + def __init__(self, ttl_seconds: int = 30, max_size: int = 32): + self._cache: OrderedDict[str, tuple[list[dict[str, Any]], float]] = OrderedDict() + self._lock = threading.Lock() + self._ttl = max(int(ttl_seconds), 1) + self._max_size = max(int(max_size), 1) + + @property + def max_size(self) -> int: + return self._max_size + + def _evict_expired_locked(self, now: float) -> None: + stale_keys = [ + key for key, (_, timestamp) in self._cache.items() + if now - timestamp > self._ttl + ] + for key in stale_keys: + self._cache.pop(key, None) + + def get(self, key: str) -> list[dict[str, Any]] | None: + """Get cached data if not expired.""" + with self._lock: + payload = self._cache.get(key) + if payload is None: + return None + data, timestamp = payload + now = time.time() + if now - timestamp > self._ttl: + self._cache.pop(key, None) + return None + self._cache.move_to_end(key, last=True) + return data + + def set(self, key: str, data: list[dict[str, Any]]) -> None: + """Cache data with current timestamp.""" + with self._lock: + now = time.time() + self._evict_expired_locked(now) + if key in self._cache: + self._cache.pop(key, None) + elif len(self._cache) >= self._max_size: + self._cache.popitem(last=False) + self._cache[key] = (data, now) + self._cache.move_to_end(key, last=True) + def invalidate(self, key: str) -> None: + """Remove a key from cache.""" + with self._lock: + self._cache.pop(key, None) + + +def _resolve_cache_max_size(env_name: str, default: int) -> int: + value = os.getenv(env_name) + if value is None: + return max(int(default), 1) + try: + return max(int(value), 1) + except (TypeError, ValueError): + return max(int(default), 1) + + # Global process-level cache for equipment status (30s TTL) -_equipment_status_cache = _ProcessLevelCache(ttl_seconds=30) +PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size("PROCESS_CACHE_MAX_SIZE", DEFAULT_PROCESS_CACHE_MAX_SIZE) +EQUIPMENT_PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size( + "EQUIPMENT_PROCESS_CACHE_MAX_SIZE", + PROCESS_CACHE_MAX_SIZE, +) +_equipment_status_cache = _ProcessLevelCache( + ttl_seconds=DEFAULT_PROCESS_CACHE_TTL_SECONDS, + max_size=EQUIPMENT_PROCESS_CACHE_MAX_SIZE, +) _equipment_status_parse_lock = threading.Lock() _equipment_lookup_lock = threading.Lock() -_equipment_status_lookup: Dict[str, Dict[str, Any]] = {} -_equipment_status_lookup_built_at: Optional[str] = None +_equipment_status_lookup: dict[str, dict[str, Any]] = {} +_equipment_status_lookup_built_at: str | None = None _equipment_status_lookup_ts: float = 0.0 -LOOKUP_TTL_SECONDS = 30 +LOOKUP_TTL_SECONDS = DEFAULT_LOOKUP_TTL_SECONDS # ============================================================ # Module State # ============================================================ -_SYNC_THREAD: Optional[threading.Thread] = None +_SYNC_THREAD: threading.Thread | None = None _STOP_EVENT = threading.Event() _SYNC_LOCK = threading.Lock() @@ -85,40 +133,14 @@ _SYNC_LOCK = threading.Lock() # Oracle Query # ============================================================ -def _load_equipment_status_from_oracle() -> Optional[List[Dict[str, Any]]]: +def _load_equipment_status_from_oracle() -> list[dict[str, Any]] | None: """Query DW_MES_EQUIPMENTSTATUS_WIP_V from Oracle. Returns: List of equipment status records, or None if query fails. """ - sql = """ - SELECT - RESOURCEID, - EQUIPMENTID, - OBJECTCATEGORY, - EQUIPMENTASSETSSTATUS, - EQUIPMENTASSETSSTATUSREASON, - JOBORDER, - JOBMODEL, - JOBSTAGE, - JOBID, - JOBSTATUS, - CREATEDATE, - CREATEUSERNAME, - CREATEUSER, - TECHNICIANUSERNAME, - TECHNICIANUSER, - SYMPTOMCODE, - CAUSECODE, - REPAIRCODE, - RUNCARDLOTID, - LOTTRACKINQTY_PCS, - LOTTRACKINTIME, - LOTTRACKINEMPLOYEE - FROM DWH.DW_MES_EQUIPMENTSTATUS_WIP_V - """ - try: - df = read_sql_df(sql) + try: + df = read_sql_df(EQUIPMENT_STATUS_SELECT_SQL) if df is None or df.empty: logger.warning("No data returned from DW_MES_EQUIPMENTSTATUS_WIP_V") return [] @@ -147,7 +169,7 @@ def _load_equipment_status_from_oracle() -> Optional[List[Dict[str, Any]]]: # Data Aggregation # ============================================================ -def _classify_status(status: Optional[str]) -> str: +def _classify_status(status: str | None) -> str: """Classify equipment status into category. Args: @@ -183,7 +205,7 @@ def _is_valid_value(value) -> bool: return True -def _aggregate_by_resourceid(records: List[Dict[str, Any]]) -> List[Dict[str, Any]]: +def _aggregate_by_resourceid(records: list[dict[str, Any]]) -> list[dict[str, Any]]: """Aggregate equipment status records by RESOURCEID. For each RESOURCEID: @@ -203,7 +225,7 @@ def _aggregate_by_resourceid(records: List[Dict[str, Any]]) -> List[Dict[str, An return [] # Group by RESOURCEID - grouped: Dict[str, List[Dict[str, Any]]] = {} + grouped: dict[str, list[dict[str, Any]]] = {} for record in records: resource_id = record.get('RESOURCEID') if resource_id: @@ -250,7 +272,7 @@ def _aggregate_by_resourceid(records: List[Dict[str, Any]]) -> List[Dict[str, An # Build aggregated record status = first.get('EQUIPMENTASSETSSTATUS') - aggregated.append({ + aggregated.append({ 'RESOURCEID': resource_id, 'EQUIPMENTID': first.get('EQUIPMENTID'), 'OBJECTCATEGORY': first.get('OBJECTCATEGORY'), @@ -270,11 +292,11 @@ def _aggregate_by_resourceid(records: List[Dict[str, Any]]) -> List[Dict[str, An 'TECHNICIANUSER': first.get('TECHNICIANUSER'), 'SYMPTOMCODE': first.get('SYMPTOMCODE'), 'CAUSECODE': first.get('CAUSECODE'), - 'REPAIRCODE': first.get('REPAIRCODE'), - # LOT related fields - 'LOT_COUNT': len(seen_lots), # Count distinct RUNCARDLOTID - 'LOT_DETAILS': lot_details, # LOT details for tooltip - 'TOTAL_TRACKIN_QTY': total_qty, + 'REPAIRCODE': first.get('REPAIRCODE'), + # LOT related fields + 'LOT_COUNT': len(seen_lots) if seen_lots else len(group), + 'LOT_DETAILS': lot_details, # LOT details for tooltip + 'TOTAL_TRACKIN_QTY': total_qty, 'LATEST_TRACKIN_TIME': latest_trackin, }) @@ -286,7 +308,7 @@ def _aggregate_by_resourceid(records: List[Dict[str, Any]]) -> List[Dict[str, An # Redis Storage # ============================================================ -def _save_to_redis(aggregated: List[Dict[str, Any]]) -> bool: +def _save_to_redis(aggregated: list[dict[str, Any]]) -> bool: """Save aggregated equipment status to Redis. Uses pipeline for atomic update of all keys. @@ -354,7 +376,7 @@ def _invalidate_equipment_status_lookup() -> None: _equipment_status_lookup_ts = 0.0 -def get_equipment_status_lookup() -> Dict[str, Dict[str, Any]]: +def get_equipment_status_lookup() -> dict[str, dict[str, Any]]: """Get RESOURCEID -> status record lookup with process-level caching.""" global _equipment_status_lookup, _equipment_status_lookup_built_at, _equipment_status_lookup_ts @@ -375,7 +397,7 @@ def get_equipment_status_lookup() -> Dict[str, Dict[str, Any]]: _equipment_status_lookup_ts = time.time() return _equipment_status_lookup -def get_all_equipment_status() -> List[Dict[str, Any]]: +def get_all_equipment_status() -> list[dict[str, Any]]: """Get all equipment status from cache with process-level caching. Uses a two-tier cache strategy: @@ -433,7 +455,7 @@ def get_all_equipment_status() -> List[Dict[str, Any]]: return [] -def get_equipment_status_by_id(resource_id: str) -> Optional[Dict[str, Any]]: +def get_equipment_status_by_id(resource_id: str) -> dict[str, Any] | None: """Get equipment status by RESOURCEID. Uses index hash for O(1) lookup. @@ -485,7 +507,7 @@ def get_equipment_status_by_id(resource_id: str) -> Optional[Dict[str, Any]]: return None -def get_equipment_status_by_ids(resource_ids: List[str]) -> List[Dict[str, Any]]: +def get_equipment_status_by_ids(resource_ids: list[str]) -> list[dict[str, Any]]: """Get equipment status for multiple RESOURCEIDs. Args: @@ -540,7 +562,7 @@ def get_equipment_status_by_ids(resource_ids: List[str]) -> List[Dict[str, Any]] return [] -def get_equipment_status_cache_status() -> Dict[str, Any]: +def get_equipment_status_cache_status() -> dict[str, Any]: """Get equipment status cache status. Returns: diff --git a/src/mes_dashboard/services/resource_cache.py b/src/mes_dashboard/services/resource_cache.py index 2ce26b9..af498d9 100644 --- a/src/mes_dashboard/services/resource_cache.py +++ b/src/mes_dashboard/services/resource_cache.py @@ -13,8 +13,9 @@ import logging import os import threading import time +from collections import OrderedDict from datetime import datetime -from typing import Any, Dict, List, Optional, Tuple +from typing import Any import pandas as pd @@ -31,9 +32,27 @@ from mes_dashboard.config.constants import ( EQUIPMENT_TYPE_FILTER, ) from mes_dashboard.sql import QueryBuilder +from mes_dashboard.services.sql_fragments import ( + RESOURCE_BASE_SELECT_TEMPLATE, + RESOURCE_VERSION_SELECT_TEMPLATE, +) logger = logging.getLogger('mes_dashboard.resource_cache') +ResourceRecord = dict[str, Any] +RowPosition = int +PositionBucket = dict[str, list[RowPosition]] +FlagBuckets = dict[str, list[RowPosition]] +ResourceIndex = dict[str, Any] + +DEFAULT_PROCESS_CACHE_TTL_SECONDS = 30 +DEFAULT_PROCESS_CACHE_MAX_SIZE = 32 +DEFAULT_RESOURCE_SYNC_INTERVAL_SECONDS = 14_400 # 4 hours +DEFAULT_INDEX_VERSION_CHECK_INTERVAL_SECONDS = 5 +RESOURCE_DF_CACHE_KEY = "resource_data" +TRUE_BUCKET = "1" +FALSE_BUCKET = "0" + # ============================================================ # Process-Level Cache (Prevents redundant JSON parsing) # ============================================================ @@ -41,26 +60,49 @@ logger = logging.getLogger('mes_dashboard.resource_cache') class _ProcessLevelCache: """Thread-safe process-level cache for parsed DataFrames.""" - def __init__(self, ttl_seconds: int = 30): - self._cache: Dict[str, Tuple[pd.DataFrame, float]] = {} + def __init__(self, ttl_seconds: int = DEFAULT_PROCESS_CACHE_TTL_SECONDS, max_size: int = DEFAULT_PROCESS_CACHE_MAX_SIZE): + self._cache: OrderedDict[str, tuple[pd.DataFrame, float]] = OrderedDict() self._lock = threading.Lock() - self._ttl = ttl_seconds + self._ttl = max(int(ttl_seconds), 1) + self._max_size = max(int(max_size), 1) - def get(self, key: str) -> Optional[pd.DataFrame]: + @property + def max_size(self) -> int: + return self._max_size + + def _evict_expired_locked(self, now: float) -> None: + stale_keys = [ + key for key, (_, timestamp) in self._cache.items() + if now - timestamp > self._ttl + ] + for key in stale_keys: + self._cache.pop(key, None) + + def get(self, key: str) -> pd.DataFrame | None: """Get cached DataFrame if not expired.""" with self._lock: - if key not in self._cache: + payload = self._cache.get(key) + if payload is None: return None - df, timestamp = self._cache[key] - if time.time() - timestamp > self._ttl: - del self._cache[key] + df, timestamp = payload + now = time.time() + if now - timestamp > self._ttl: + self._cache.pop(key, None) return None + self._cache.move_to_end(key, last=True) return df def set(self, key: str, df: pd.DataFrame) -> None: """Cache a DataFrame with current timestamp.""" with self._lock: - self._cache[key] = (df, time.time()) + now = time.time() + self._evict_expired_locked(now) + if key in self._cache: + self._cache.pop(key, None) + elif len(self._cache) >= self._max_size: + self._cache.popitem(last=False) + self._cache[key] = (df, now) + self._cache.move_to_end(key, last=True) def invalidate(self, key: str) -> None: """Remove a key from cache.""" @@ -68,11 +110,29 @@ class _ProcessLevelCache: self._cache.pop(key, None) +def _resolve_cache_max_size(env_name: str, default: int) -> int: + value = os.getenv(env_name) + if value is None: + return max(int(default), 1) + try: + return max(int(value), 1) + except (TypeError, ValueError): + return max(int(default), 1) + + # Global process-level cache for resource data (30s TTL) -_resource_df_cache = _ProcessLevelCache(ttl_seconds=30) +PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size("PROCESS_CACHE_MAX_SIZE", DEFAULT_PROCESS_CACHE_MAX_SIZE) +RESOURCE_PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size( + "RESOURCE_PROCESS_CACHE_MAX_SIZE", + PROCESS_CACHE_MAX_SIZE, +) +_resource_df_cache = _ProcessLevelCache( + ttl_seconds=DEFAULT_PROCESS_CACHE_TTL_SECONDS, + max_size=RESOURCE_PROCESS_CACHE_MAX_SIZE, +) _resource_parse_lock = threading.Lock() _resource_index_lock = threading.Lock() -_resource_index: Dict[str, Any] = { +_resource_index: ResourceIndex = { "ready": False, "source": None, "version": None, @@ -80,19 +140,27 @@ _resource_index: Dict[str, Any] = { "built_at": None, "version_checked_at": 0.0, "count": 0, - "records": [], + "all_positions": [], "by_resource_id": {}, "by_workcenter": {}, "by_family": {}, "by_department": {}, "by_location": {}, - "by_is_production": {"1": [], "0": []}, - "by_is_key": {"1": [], "0": []}, - "by_is_monitor": {"1": [], "0": []}, + "by_is_production": {TRUE_BUCKET: [], FALSE_BUCKET: []}, + "by_is_key": {TRUE_BUCKET: [], FALSE_BUCKET: []}, + "by_is_monitor": {TRUE_BUCKET: [], FALSE_BUCKET: []}, + "memory": { + "frame_bytes": 0, + "index_bytes": 0, + "records_json_bytes": 0, + "bucket_entries": 0, + "amplification_ratio": 0.0, + "representation": "dataframe+row-index", + }, } -def _new_empty_index() -> Dict[str, Any]: +def _new_empty_index() -> ResourceIndex: return { "ready": False, "source": None, @@ -101,15 +169,23 @@ def _new_empty_index() -> Dict[str, Any]: "built_at": None, "version_checked_at": 0.0, "count": 0, - "records": [], + "all_positions": [], "by_resource_id": {}, "by_workcenter": {}, "by_family": {}, "by_department": {}, "by_location": {}, - "by_is_production": {"1": [], "0": []}, - "by_is_key": {"1": [], "0": []}, - "by_is_monitor": {"1": [], "0": []}, + "by_is_production": {TRUE_BUCKET: [], FALSE_BUCKET: []}, + "by_is_key": {TRUE_BUCKET: [], FALSE_BUCKET: []}, + "by_is_monitor": {TRUE_BUCKET: [], FALSE_BUCKET: []}, + "memory": { + "frame_bytes": 0, + "index_bytes": 0, + "records_json_bytes": 0, + "bucket_entries": 0, + "amplification_ratio": 0.0, + "representation": "dataframe+row-index", + }, } @@ -129,23 +205,59 @@ def _is_truthy_flag(value: Any) -> bool: return False -def _bucket_append(bucket: Dict[str, List[Dict[str, Any]]], key: Any, record: Dict[str, Any]) -> None: +def _bucket_append(bucket: PositionBucket, key: Any, row_position: RowPosition) -> None: if key is None: return if isinstance(key, float) and pd.isna(key): return key_str = str(key) - bucket.setdefault(key_str, []).append(record) + bucket.setdefault(key_str, []).append(int(row_position)) + + +def _estimate_dataframe_bytes(df: pd.DataFrame) -> int: + try: + return int(df.memory_usage(index=True, deep=True).sum()) + except Exception: + return 0 + + +def _estimate_index_bytes(index: ResourceIndex) -> int: + """Estimate lightweight index memory footprint for telemetry.""" + by_resource_id = index.get("by_resource_id", {}) + by_workcenter = index.get("by_workcenter", {}) + by_family = index.get("by_family", {}) + by_department = index.get("by_department", {}) + by_location = index.get("by_location", {}) + by_is_production = index.get("by_is_production", {TRUE_BUCKET: [], FALSE_BUCKET: []}) + by_is_key = index.get("by_is_key", {TRUE_BUCKET: [], FALSE_BUCKET: []}) + by_is_monitor = index.get("by_is_monitor", {TRUE_BUCKET: [], FALSE_BUCKET: []}) + all_positions = index.get("all_positions", []) + + position_entries = ( + len(all_positions) + + sum(len(v) for v in by_workcenter.values()) + + sum(len(v) for v in by_family.values()) + + sum(len(v) for v in by_department.values()) + + sum(len(v) for v in by_location.values()) + + len(by_is_production.get(TRUE_BUCKET, [])) + + len(by_is_production.get(FALSE_BUCKET, [])) + + len(by_is_key.get(TRUE_BUCKET, [])) + + len(by_is_key.get(FALSE_BUCKET, [])) + + len(by_is_monitor.get(TRUE_BUCKET, [])) + + len(by_is_monitor.get(FALSE_BUCKET, [])) + ) + # Approximate integer/list/dict overhead; telemetry only needs directional signal. + return int(position_entries * 8 + len(by_resource_id) * 64) def _build_resource_index( df: pd.DataFrame, *, source: str, - version: Optional[str], - updated_at: Optional[str], -) -> Dict[str, Any]: - records = df.to_dict(orient='records') + version: str | None, + updated_at: str | None, +) -> ResourceIndex: + normalized_df = df.reset_index(drop=True) index = _new_empty_index() index["ready"] = True index["source"] = source @@ -153,31 +265,58 @@ def _build_resource_index( index["updated_at"] = updated_at index["built_at"] = datetime.now().isoformat() index["version_checked_at"] = time.time() - index["count"] = len(records) - index["records"] = records + index["count"] = len(normalized_df) + index["all_positions"] = list(range(len(normalized_df))) - for record in records: + for row_position, record in normalized_df.iterrows(): resource_id = record.get("RESOURCEID") if resource_id is not None and not (isinstance(resource_id, float) and pd.isna(resource_id)): - index["by_resource_id"][str(resource_id)] = record + index["by_resource_id"][str(resource_id)] = int(row_position) - _bucket_append(index["by_workcenter"], record.get("WORKCENTERNAME"), record) - _bucket_append(index["by_family"], record.get("RESOURCEFAMILYNAME"), record) - _bucket_append(index["by_department"], record.get("PJ_DEPARTMENT"), record) - _bucket_append(index["by_location"], record.get("LOCATIONNAME"), record) + _bucket_append(index["by_workcenter"], record.get("WORKCENTERNAME"), row_position) + _bucket_append(index["by_family"], record.get("RESOURCEFAMILYNAME"), row_position) + _bucket_append(index["by_department"], record.get("PJ_DEPARTMENT"), row_position) + _bucket_append(index["by_location"], record.get("LOCATIONNAME"), row_position) - index["by_is_production"]["1" if _is_truthy_flag(record.get("PJ_ISPRODUCTION")) else "0"].append(record) - index["by_is_key"]["1" if _is_truthy_flag(record.get("PJ_ISKEY")) else "0"].append(record) - index["by_is_monitor"]["1" if _is_truthy_flag(record.get("PJ_ISMONITOR")) else "0"].append(record) + index["by_is_production"][TRUE_BUCKET if _is_truthy_flag(record.get("PJ_ISPRODUCTION")) else FALSE_BUCKET].append(int(row_position)) + index["by_is_key"][TRUE_BUCKET if _is_truthy_flag(record.get("PJ_ISKEY")) else FALSE_BUCKET].append(int(row_position)) + index["by_is_monitor"][TRUE_BUCKET if _is_truthy_flag(record.get("PJ_ISMONITOR")) else FALSE_BUCKET].append(int(row_position)) + + bucket_entries = ( + sum(len(v) for v in index["by_workcenter"].values()) + + sum(len(v) for v in index["by_family"].values()) + + sum(len(v) for v in index["by_department"].values()) + + sum(len(v) for v in index["by_location"].values()) + + len(index["by_is_production"][TRUE_BUCKET]) + + len(index["by_is_production"][FALSE_BUCKET]) + + len(index["by_is_key"][TRUE_BUCKET]) + + len(index["by_is_key"][FALSE_BUCKET]) + + len(index["by_is_monitor"][TRUE_BUCKET]) + + len(index["by_is_monitor"][FALSE_BUCKET]) + ) + frame_bytes = _estimate_dataframe_bytes(normalized_df) + index_bytes = _estimate_index_bytes(index) + amplification_ratio = round( + (frame_bytes + index_bytes) / max(frame_bytes, 1), + 4, + ) + index["memory"] = { + "frame_bytes": int(frame_bytes), + "index_bytes": int(index_bytes), + "records_json_bytes": 0, # kept for backward-compatible telemetry shape + "bucket_entries": int(bucket_entries), + "amplification_ratio": amplification_ratio, + "representation": "dataframe+row-index", + } return index def _index_matches( - current: Dict[str, Any], + current: ResourceIndex, *, source: str, - version: Optional[str], + version: str | None, row_count: int, ) -> bool: if not current.get("ready"): @@ -193,8 +332,8 @@ def _ensure_resource_index( df: pd.DataFrame, *, source: str, - version: Optional[str] = None, - updated_at: Optional[str] = None, + version: str | None = None, + updated_at: str | None = None, ) -> None: global _resource_index with _resource_index_lock: @@ -212,12 +351,12 @@ def _ensure_resource_index( _resource_index = new_index -def _get_resource_index() -> Dict[str, Any]: +def _get_resource_index() -> ResourceIndex: with _resource_index_lock: return _resource_index -def _get_cache_meta(client=None) -> Tuple[Optional[str], Optional[str]]: +def _get_cache_meta(client=None) -> tuple[str | None, str | None]: redis_client = client or get_redis_client() if redis_client is None: return None, None @@ -244,31 +383,59 @@ def _redis_data_available(client=None) -> bool: return False -def _pick_bucket_records( - bucket: Dict[str, List[Dict[str, Any]]], - keys: List[Any], -) -> List[Dict[str, Any]]: - seen: set[str] = set() - result: List[Dict[str, Any]] = [] +def _pick_bucket_positions( + bucket: PositionBucket, + keys: list[Any], +) -> list[RowPosition]: + seen: set[int] = set() + result: list[int] = [] for key in keys: - for record in bucket.get(str(key), []): - rid = record.get("RESOURCEID") - rid_key = str(rid) if rid is not None else str(id(record)) - if rid_key in seen: + for row_position in bucket.get(str(key), []): + normalized = int(row_position) + if normalized in seen: continue - seen.add(rid_key) - result.append(record) + seen.add(normalized) + result.append(normalized) return result + +def _records_from_positions(df: pd.DataFrame, positions: list[RowPosition]) -> list[ResourceRecord]: + if not positions: + return [] + unique_positions = sorted({int(pos) for pos in positions if 0 <= int(pos) < len(df)}) + if not unique_positions: + return [] + return df.iloc[unique_positions].to_dict(orient='records') + + +def _records_from_index(index: ResourceIndex, positions: list[RowPosition] | None = None) -> list[ResourceRecord]: + if not index.get("ready"): + return [] + df = _resource_df_cache.get(RESOURCE_DF_CACHE_KEY) + if df is None: + legacy_records = index.get("records") + if isinstance(legacy_records, list): + if positions is None: + return list(legacy_records) + selected = [legacy_records[int(pos)] for pos in positions if 0 <= int(pos) < len(legacy_records)] + return selected + return [] + selected_positions = positions if positions is not None else index.get("all_positions", []) + if not selected_positions: + selected_positions = list(range(len(df))) + return _records_from_positions(df, selected_positions) + # ============================================================ # Configuration # ============================================================ RESOURCE_CACHE_ENABLED = os.getenv('RESOURCE_CACHE_ENABLED', 'true').lower() == 'true' -RESOURCE_SYNC_INTERVAL = int(os.getenv('RESOURCE_SYNC_INTERVAL', '14400')) # 4 hours +RESOURCE_SYNC_INTERVAL = int( + os.getenv('RESOURCE_SYNC_INTERVAL', str(DEFAULT_RESOURCE_SYNC_INTERVAL_SECONDS)) +) RESOURCE_INDEX_VERSION_CHECK_INTERVAL = int( - os.getenv('RESOURCE_INDEX_VERSION_CHECK_INTERVAL', '5') -) # seconds + os.getenv('RESOURCE_INDEX_VERSION_CHECK_INTERVAL', str(DEFAULT_INDEX_VERSION_CHECK_INTERVAL_SECONDS)) +) # Redis key helpers def _get_key(key: str) -> str: @@ -313,14 +480,14 @@ def _build_filter_builder() -> QueryBuilder: return builder -def _load_from_oracle() -> Optional[pd.DataFrame]: +def _load_from_oracle() -> pd.DataFrame | None: """從 Oracle 載入全表資料(套用全域篩選). Returns: DataFrame with all columns, or None if query failed. """ builder = _build_filter_builder() - builder.base_sql = "SELECT * FROM DWH.DW_MES_RESOURCE {{ WHERE_CLAUSE }}" + builder.base_sql = RESOURCE_BASE_SELECT_TEMPLATE sql, params = builder.build() try: @@ -333,14 +500,14 @@ def _load_from_oracle() -> Optional[pd.DataFrame]: return None -def _get_version_from_oracle() -> Optional[str]: +def _get_version_from_oracle() -> str | None: """取得 Oracle 資料版本(MAX(LASTCHANGEDATE)). Returns: Version string (ISO format), or None if query failed. """ builder = _build_filter_builder() - builder.base_sql = "SELECT MAX(LASTCHANGEDATE) as VERSION FROM DWH.DW_MES_RESOURCE {{ WHERE_CLAUSE }}" + builder.base_sql = RESOURCE_VERSION_SELECT_TEMPLATE sql, params = builder.build() try: @@ -361,7 +528,7 @@ def _get_version_from_oracle() -> Optional[str]: # Internal: Redis Functions # ============================================================ -def _get_version_from_redis() -> Optional[str]: +def _get_version_from_redis() -> str | None: """取得 Redis 快取版本. Returns: @@ -411,7 +578,7 @@ def _sync_to_redis(df: pd.DataFrame, version: str) -> bool: pipe.execute() # Invalidate process-level cache so next request picks up new data - _resource_df_cache.invalidate("resource_data") + _resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY) _invalidate_resource_index() logger.info(f"Resource cache synced: {len(df)} rows, version={version}") @@ -421,7 +588,7 @@ def _sync_to_redis(df: pd.DataFrame, version: str) -> bool: return False -def _get_cached_data() -> Optional[pd.DataFrame]: +def _get_cached_data() -> pd.DataFrame | None: """Get cached resource data from Redis with process-level caching. Uses a two-tier cache strategy: @@ -433,21 +600,25 @@ def _get_cached_data() -> Optional[pd.DataFrame]: Returns: DataFrame with resource data, or None if cache miss. """ - cache_key = "resource_data" + cache_key = RESOURCE_DF_CACHE_KEY # Tier 1: Check process-level cache first (fast path) cached_df = _resource_df_cache.get(cache_key) if cached_df is not None: - if not _get_resource_index().get("ready"): - version, updated_at = _get_cache_meta() - _ensure_resource_index( - cached_df, - source="redis", - version=version, - updated_at=updated_at, - ) - logger.debug(f"Process cache hit: {len(cached_df)} rows") - return cached_df + if REDIS_ENABLED and RESOURCE_CACHE_ENABLED and not _redis_data_available(): + _resource_df_cache.invalidate(cache_key) + _invalidate_resource_index() + else: + if not _get_resource_index().get("ready"): + version, updated_at = _get_cache_meta() + _ensure_resource_index( + cached_df, + source="redis", + version=version, + updated_at=updated_at, + ) + logger.debug(f"Process cache hit: {len(cached_df)} rows") + return cached_df # Tier 2: Parse from Redis (slow path - needs lock) if not REDIS_ENABLED or not RESOURCE_CACHE_ENABLED: @@ -568,7 +739,7 @@ def init_cache() -> None: logger.error(f"Failed to init resource cache: {e}") -def get_cache_status() -> Dict[str, Any]: +def get_cache_status() -> dict[str, Any]: """取得快取狀態資訊. Returns: @@ -611,9 +782,10 @@ def get_cache_status() -> Dict[str, Any]: # Query API # ============================================================ -def get_resource_index_status() -> Dict[str, Any]: +def get_resource_index_status() -> dict[str, Any]: """Get process-level derived index telemetry.""" index = _get_resource_index() + memory = index.get("memory") or {} built_at = index.get("built_at") age_seconds = None if built_at: @@ -630,19 +802,32 @@ def get_resource_index_status() -> Dict[str, Any]: "built_at": built_at, "count": int(index.get("count", 0)), "age_seconds": round(age_seconds, 3) if age_seconds is not None else None, + "memory": { + "frame_bytes": int(memory.get("frame_bytes", 0)), + "index_bytes": int(memory.get("index_bytes", 0)), + "records_json_bytes": int(memory.get("records_json_bytes", 0)), + "bucket_entries": int(memory.get("bucket_entries", 0)), + "amplification_ratio": float(memory.get("amplification_ratio", 0.0)), + "representation": str(memory.get("representation", "unknown")), + }, } -def get_resource_index_snapshot() -> Dict[str, Any]: +def get_resource_index_snapshot() -> ResourceIndex: """Get derived resource index snapshot, rebuilding if needed.""" index = _get_resource_index() if index.get("ready"): if index.get("source") == "redis": + if not _redis_data_available(): + _resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY) + _invalidate_resource_index() + index = _get_resource_index() + # If Redis metadata version is missing, verify payload existence on every call. # This avoids serving stale in-process index when Redis payload is evicted. - if not index.get("version"): + if index.get("ready") and not index.get("version"): if not _redis_data_available(): - _resource_df_cache.invalidate("resource_data") + _resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY) _invalidate_resource_index() index = _get_resource_index() else: @@ -661,7 +846,7 @@ def get_resource_index_snapshot() -> Dict[str, Any]: current_version, latest_version, ) - _resource_df_cache.invalidate("resource_data") + _resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY) _invalidate_resource_index() index = _get_resource_index() else: @@ -678,6 +863,7 @@ def get_resource_index_snapshot() -> Dict[str, Any]: df = _get_cached_data() if df is not None: + _resource_df_cache.set(RESOURCE_DF_CACHE_KEY, df.reset_index(drop=True)) version, updated_at = _get_cache_meta() _ensure_resource_index( df, @@ -690,6 +876,8 @@ def get_resource_index_snapshot() -> Dict[str, Any]: logger.info("Resource cache miss while building index, falling back to Oracle") oracle_df = _load_from_oracle() if oracle_df is None: + _resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY) + _invalidate_resource_index() return _new_empty_index() _ensure_resource_index( @@ -698,9 +886,11 @@ def get_resource_index_snapshot() -> Dict[str, Any]: version=None, updated_at=datetime.now().isoformat(), ) + _resource_df_cache.set(RESOURCE_DF_CACHE_KEY, oracle_df.reset_index(drop=True)) return _get_resource_index() -def get_all_resources() -> List[Dict]: + +def get_all_resources() -> list[ResourceRecord]: """取得所有快取中的設備資料(全欄位). Falls back to Oracle if cache unavailable. @@ -709,11 +899,10 @@ def get_all_resources() -> List[Dict]: List of resource dicts. """ index = get_resource_index_snapshot() - records = index.get("records", []) - return list(records) + return _records_from_index(index) -def get_resource_by_id(resource_id: str) -> Optional[Dict]: +def get_resource_by_id(resource_id: str) -> ResourceRecord | None: """依 RESOURCEID 取得單筆設備資料. Args: @@ -725,10 +914,12 @@ def get_resource_by_id(resource_id: str) -> Optional[Dict]: if not resource_id: return None index = get_resource_index_snapshot() - by_id = index.get("by_resource_id", {}) - row = by_id.get(str(resource_id)) - if row is not None: - return row + by_id: dict[str, RowPosition] = index.get("by_resource_id", {}) + row_position = by_id.get(str(resource_id)) + if row_position is not None: + rows = _records_from_index(index, [int(row_position)]) + if rows: + return rows[0] # Backward-compatible fallback for call sites/tests that patch get_all_resources. target = str(resource_id) @@ -738,7 +929,7 @@ def get_resource_by_id(resource_id: str) -> Optional[Dict]: return None -def get_resources_by_ids(resource_ids: List[str]) -> List[Dict]: +def get_resources_by_ids(resource_ids: list[str]) -> list[ResourceRecord]: """依 RESOURCEID 清單批次取得設備資料. Args: @@ -747,20 +938,28 @@ def get_resources_by_ids(resource_ids: List[str]) -> List[Dict]: Returns: List of matching resource dicts. """ + index = get_resource_index_snapshot() + by_id: dict[str, RowPosition] = index.get("by_resource_id", {}) + positions = [by_id[str(resource_id)] for resource_id in resource_ids if str(resource_id) in by_id] + if positions: + rows = _records_from_index(index, positions) + if rows: + return rows + + # Backward-compatible fallback for call sites/tests that patch get_all_resources. id_set = set(resource_ids) - resources = get_all_resources() - return [r for r in resources if r.get('RESOURCEID') in id_set] + return [r for r in get_all_resources() if r.get('RESOURCEID') in id_set] def get_resources_by_filter( - workcenters: Optional[List[str]] = None, - families: Optional[List[str]] = None, - departments: Optional[List[str]] = None, - locations: Optional[List[str]] = None, - is_production: Optional[bool] = None, - is_key: Optional[bool] = None, - is_monitor: Optional[bool] = None, -) -> List[Dict]: + workcenters: list[str] | None = None, + families: list[str] | None = None, + departments: list[str] | None = None, + locations: list[str] | None = None, + is_production: bool | None = None, + is_key: bool | None = None, + is_monitor: bool | None = None, +) -> list[ResourceRecord]: """依條件篩選設備資料(在 Python 端篩選). Args: @@ -775,42 +974,79 @@ def get_resources_by_filter( Returns: List of matching resource dicts. """ - resources = get_all_resources() - - result = [] - for r in resources: - # Apply filters - if workcenters and r.get('WORKCENTERNAME') not in workcenters: - continue - if families and r.get('RESOURCEFAMILYNAME') not in families: - continue - if departments and r.get('PJ_DEPARTMENT') not in departments: - continue - if locations and r.get('LOCATIONNAME') not in locations: - continue - if is_production is not None: - val = r.get('PJ_ISPRODUCTION') - if (val == 1) != is_production: + def _filter_from_records(resources: list[ResourceRecord]) -> list[ResourceRecord]: + result: list[ResourceRecord] = [] + for r in resources: + if workcenters and r.get('WORKCENTERNAME') not in workcenters: continue - if is_key is not None: - val = r.get('PJ_ISKEY') - if (val == 1) != is_key: + if families and r.get('RESOURCEFAMILYNAME') not in families: continue - if is_monitor is not None: - val = r.get('PJ_ISMONITOR') - if (val == 1) != is_monitor: + if departments and r.get('PJ_DEPARTMENT') not in departments: continue + if locations and r.get('LOCATIONNAME') not in locations: + continue + if is_production is not None and (r.get('PJ_ISPRODUCTION') == 1) != is_production: + continue + if is_key is not None and (r.get('PJ_ISKEY') == 1) != is_key: + continue + if is_monitor is not None and (r.get('PJ_ISMONITOR') == 1) != is_monitor: + continue + result.append(r) + return result - result.append(r) + index = get_resource_index_snapshot() + if not index.get("ready"): + return _filter_from_records(get_all_resources()) + if _resource_df_cache.get(RESOURCE_DF_CACHE_KEY) is None: + return _filter_from_records(get_all_resources()) - return result + candidate_positions: set[int] = set(int(pos) for pos in index.get("all_positions", [])) + if not candidate_positions: + return [] + + def _intersect_with_positions(selected: list[int] | None) -> None: + nonlocal candidate_positions + if selected is None: + return + candidate_positions &= set(int(item) for item in selected) + + if workcenters: + _intersect_with_positions( + _pick_bucket_positions(index.get("by_workcenter", {}), workcenters) + ) + if families: + _intersect_with_positions( + _pick_bucket_positions(index.get("by_family", {}), families) + ) + if departments: + _intersect_with_positions( + _pick_bucket_positions(index.get("by_department", {}), departments) + ) + if locations: + _intersect_with_positions( + _pick_bucket_positions(index.get("by_location", {}), locations) + ) + if is_production is not None: + _intersect_with_positions( + index.get("by_is_production", {}).get(TRUE_BUCKET if is_production else FALSE_BUCKET, []) + ) + if is_key is not None: + _intersect_with_positions( + index.get("by_is_key", {}).get(TRUE_BUCKET if is_key else FALSE_BUCKET, []) + ) + if is_monitor is not None: + _intersect_with_positions( + index.get("by_is_monitor", {}).get(TRUE_BUCKET if is_monitor else FALSE_BUCKET, []) + ) + + return _records_from_index(index, sorted(candidate_positions)) # ============================================================ # Distinct Values API (for filters) # ============================================================ -def get_distinct_values(column: str) -> List[str]: +def get_distinct_values(column: str) -> list[str]: """取得指定欄位的唯一值清單(排序後). Args: @@ -833,26 +1069,26 @@ def get_distinct_values(column: str) -> List[str]: return sorted(values) -def get_resource_families() -> List[str]: +def get_resource_families() -> list[str]: """取得型號清單(便捷方法).""" return get_distinct_values('RESOURCEFAMILYNAME') -def get_workcenters() -> List[str]: +def get_workcenters() -> list[str]: """取得站點清單(便捷方法).""" return get_distinct_values('WORKCENTERNAME') -def get_departments() -> List[str]: +def get_departments() -> list[str]: """取得部門清單(便捷方法).""" return get_distinct_values('PJ_DEPARTMENT') -def get_locations() -> List[str]: +def get_locations() -> list[str]: """取得區域清單(便捷方法).""" return get_distinct_values('LOCATIONNAME') -def get_vendors() -> List[str]: +def get_vendors() -> list[str]: """取得供應商清單(便捷方法).""" return get_distinct_values('VENDORNAME') diff --git a/src/mes_dashboard/services/sql_fragments.py b/src/mes_dashboard/services/sql_fragments.py new file mode 100644 index 0000000..a364b07 --- /dev/null +++ b/src/mes_dashboard/services/sql_fragments.py @@ -0,0 +1,46 @@ +# -*- coding: utf-8 -*- +"""Shared SQL fragments/constants for cache-oriented services. + +Centralizing common Oracle table/view references reduces drift across +resource/equipment cache implementations. +""" + +from __future__ import annotations + +RESOURCE_TABLE = "DWH.DW_MES_RESOURCE" +RESOURCE_BASE_SELECT_TEMPLATE = f"SELECT * FROM {RESOURCE_TABLE} {{ WHERE_CLAUSE }}" +RESOURCE_VERSION_SELECT_TEMPLATE = ( + f"SELECT MAX(LASTCHANGEDATE) as VERSION FROM {RESOURCE_TABLE} {{ WHERE_CLAUSE }}" +) + +EQUIPMENT_STATUS_VIEW = "DWH.DW_MES_EQUIPMENTSTATUS_WIP_V" +EQUIPMENT_STATUS_COLUMNS: tuple[str, ...] = ( + "RESOURCEID", + "EQUIPMENTID", + "OBJECTCATEGORY", + "EQUIPMENTASSETSSTATUS", + "EQUIPMENTASSETSSTATUSREASON", + "JOBORDER", + "JOBMODEL", + "JOBSTAGE", + "JOBID", + "JOBSTATUS", + "CREATEDATE", + "CREATEUSERNAME", + "CREATEUSER", + "TECHNICIANUSERNAME", + "TECHNICIANUSER", + "SYMPTOMCODE", + "CAUSECODE", + "REPAIRCODE", + "RUNCARDLOTID", + "LOTTRACKINQTY_PCS", + "LOTTRACKINTIME", + "LOTTRACKINEMPLOYEE", +) + +EQUIPMENT_STATUS_SELECT_SQL = ( + "SELECT\n " + + ",\n ".join(EQUIPMENT_STATUS_COLUMNS) + + f"\nFROM {EQUIPMENT_STATUS_VIEW}" +) diff --git a/src/mes_dashboard/services/wip_service.py b/src/mes_dashboard/services/wip_service.py index 60548ed..2c0fdc9 100644 --- a/src/mes_dashboard/services/wip_service.py +++ b/src/mes_dashboard/services/wip_service.py @@ -9,6 +9,7 @@ Now uses Redis cache when available, with fallback to Oracle direct query. import logging import threading +from collections import Counter from datetime import datetime from typing import Optional, Dict, List, Any @@ -32,6 +33,20 @@ logger = logging.getLogger('mes_dashboard.wip_service') _wip_search_index_lock = threading.Lock() _wip_search_index_cache: Dict[str, Dict[str, Any]] = {} +_wip_snapshot_lock = threading.Lock() +_wip_snapshot_cache: Dict[str, Dict[str, Any]] = {} +_wip_index_metrics_lock = threading.Lock() +_wip_index_metrics: Dict[str, Any] = { + "snapshot_hits": 0, + "snapshot_misses": 0, + "search_index_hits": 0, + "search_index_misses": 0, + "search_index_rebuilds": 0, + "search_index_incremental_updates": 0, + "search_index_reconciliation_fallbacks": 0, +} + +_EMPTY_INT_INDEX = np.array([], dtype=np.int64) def _safe_value(val): @@ -153,29 +168,373 @@ def _get_wip_cache_version() -> str: return f"{updated_at}|{sys_date}" -def _distinct_sorted_values(df: pd.DataFrame, column: str) -> List[str]: - if column not in df.columns: - return [] - series = df[column].dropna().astype(str) - if series.empty: - return [] - series = series[series.str.len() > 0] - if series.empty: - return [] - return series.drop_duplicates().sort_values().tolist() +def _increment_wip_metric(metric: str, value: int = 1) -> None: + with _wip_index_metrics_lock: + _wip_index_metrics[metric] = int(_wip_index_metrics.get(metric, 0)) + value + + +def _estimate_dataframe_bytes(df: pd.DataFrame) -> int: + if df is None: + return 0 + try: + return int(df.memory_usage(index=True, deep=True).sum()) + except Exception: + return 0 + + +def _estimate_counter_payload_bytes(counter: Counter) -> int: + total = 0 + for key, count in counter.items(): + total += len(str(key)) + 16 + int(count) + return total + + +def _normalize_text_value(value: Any) -> str: + if value is None: + return "" + if isinstance(value, float) and pd.isna(value): + return "" + text = str(value).strip() + return text + + +def _build_filter_mask( + df: pd.DataFrame, + *, + include_dummy: bool, + workorder: Optional[str] = None, + lotid: Optional[str] = None, +) -> pd.Series: + if df.empty: + return pd.Series(dtype=bool) + + mask = df['WORKORDER'].notna() + + if not include_dummy and 'LOTID' in df.columns: + mask &= ~df['LOTID'].astype(str).str.contains('DUMMY', case=False, na=False) + + if workorder and 'WORKORDER' in df.columns: + mask &= df['WORKORDER'].astype(str).str.contains(workorder, case=False, na=False) + + if lotid and 'LOTID' in df.columns: + mask &= df['LOTID'].astype(str).str.contains(lotid, case=False, na=False) + + return mask + + +def _build_value_index(df: pd.DataFrame, column: str) -> Dict[str, np.ndarray]: + if column not in df.columns or df.empty: + return {} + grouped = df.groupby(column, dropna=True, sort=False).indices + return {str(key): np.asarray(indices, dtype=np.int64) for key, indices in grouped.items()} + + +def _intersect_positions(current: Optional[np.ndarray], candidate: Optional[np.ndarray]) -> np.ndarray: + if candidate is None: + return _EMPTY_INT_INDEX + if current is None: + return candidate + if len(current) == 0 or len(candidate) == 0: + return _EMPTY_INT_INDEX + return np.intersect1d(current, candidate, assume_unique=False) + + +def _select_with_snapshot_indexes( + include_dummy: bool = False, + workorder: Optional[str] = None, + lotid: Optional[str] = None, + package: Optional[str] = None, + pj_type: Optional[str] = None, + workcenter: Optional[str] = None, + status: Optional[str] = None, + hold_type: Optional[str] = None, +) -> Optional[pd.DataFrame]: + snapshot = _get_wip_snapshot(include_dummy=include_dummy) + if snapshot is None: + return None + + df = snapshot["frame"] + indexes = snapshot["indexes"] + selected_positions: Optional[np.ndarray] = None + + if workcenter: + selected_positions = _intersect_positions( + selected_positions, + indexes["workcenter"].get(str(workcenter)), + ) + if package: + selected_positions = _intersect_positions( + selected_positions, + indexes["package"].get(str(package)), + ) + if pj_type: + selected_positions = _intersect_positions( + selected_positions, + indexes["pj_type"].get(str(pj_type)), + ) + if status: + selected_positions = _intersect_positions( + selected_positions, + indexes["wip_status"].get(str(status).upper()), + ) + if hold_type: + selected_positions = _intersect_positions( + selected_positions, + indexes["hold_type"].get(str(hold_type).lower()), + ) + + if selected_positions is None: + result = df + elif len(selected_positions) == 0: + result = df.iloc[0:0] + else: + result = df.iloc[selected_positions] + + if workorder: + result = result[result['WORKORDER'].astype(str).str.contains(workorder, case=False, na=False)] + if lotid: + result = result[result['LOTID'].astype(str).str.contains(lotid, case=False, na=False)] + return result + + +def _build_search_signatures(df: pd.DataFrame) -> tuple[Counter, Dict[str, tuple[str, str, str, str]]]: + if df.empty: + return Counter(), {} + + workorders = df.get("WORKORDER", pd.Series(index=df.index, dtype=object)).map(_normalize_text_value) + lotids = df.get("LOTID", pd.Series(index=df.index, dtype=object)).map(_normalize_text_value) + packages = df.get("PACKAGE_LEF", pd.Series(index=df.index, dtype=object)).map(_normalize_text_value) + types = df.get("PJ_TYPE", pd.Series(index=df.index, dtype=object)).map(_normalize_text_value) + + signatures = ( + workorders + + "\x1f" + + lotids + + "\x1f" + + packages + + "\x1f" + + types + ).tolist() + signature_counter = Counter(signatures) + + signature_fields: Dict[str, tuple[str, str, str, str]] = {} + for signature, wo, lot, pkg, pj in zip(signatures, workorders, lotids, packages, types): + if signature not in signature_fields: + signature_fields[signature] = (wo, lot, pkg, pj) + return signature_counter, signature_fields + + +def _build_field_counters( + signature_counter: Counter, + signature_fields: Dict[str, tuple[str, str, str, str]], +) -> Dict[str, Counter]: + counters = { + "workorders": Counter(), + "lotids": Counter(), + "packages": Counter(), + "types": Counter(), + } + for signature, count in signature_counter.items(): + wo, lot, pkg, pj = signature_fields.get(signature, ("", "", "", "")) + if wo: + counters["workorders"][wo] += count + if lot: + counters["lotids"][lot] += count + if pkg: + counters["packages"][pkg] += count + if pj: + counters["types"][pj] += count + return counters + + +def _materialize_search_payload( + *, + version: str, + row_count: int, + signature_counter: Counter, + field_counters: Dict[str, Counter], + mode: str, + added_rows: int = 0, + removed_rows: int = 0, + drift_ratio: float = 0.0, +) -> Dict[str, Any]: + workorders = sorted(field_counters["workorders"].keys()) + lotids = sorted(field_counters["lotids"].keys()) + packages = sorted(field_counters["packages"].keys()) + types = sorted(field_counters["types"].keys()) + memory_bytes = ( + _estimate_counter_payload_bytes(field_counters["workorders"]) + + _estimate_counter_payload_bytes(field_counters["lotids"]) + + _estimate_counter_payload_bytes(field_counters["packages"]) + + _estimate_counter_payload_bytes(field_counters["types"]) + ) + return { + "version": version, + "built_at": datetime.now().isoformat(), + "row_count": int(row_count), + "workorders": workorders, + "lotids": lotids, + "packages": packages, + "types": types, + "sync_mode": mode, + "sync_added_rows": int(added_rows), + "sync_removed_rows": int(removed_rows), + "drift_ratio": round(float(drift_ratio), 6), + "memory_bytes": int(memory_bytes), + "_signature_counter": dict(signature_counter), + "_field_counters": { + "workorders": dict(field_counters["workorders"]), + "lotids": dict(field_counters["lotids"]), + "packages": dict(field_counters["packages"]), + "types": dict(field_counters["types"]), + }, + } def _build_wip_search_index(df: pd.DataFrame, include_dummy: bool) -> Dict[str, Any]: filtered = _filter_base_conditions(df, include_dummy=include_dummy) - return { - "built_at": datetime.now().isoformat(), - "row_count": len(filtered), - "workorders": _distinct_sorted_values(filtered, "WORKORDER"), - "lotids": _distinct_sorted_values(filtered, "LOTID"), - "packages": _distinct_sorted_values(filtered, "PACKAGE_LEF"), - "types": _distinct_sorted_values(filtered, "PJ_TYPE"), + signatures, signature_fields = _build_search_signatures(filtered) + field_counters = _build_field_counters(signatures, signature_fields) + return _materialize_search_payload( + version=_get_wip_cache_version(), + row_count=len(filtered), + signature_counter=signatures, + field_counters=field_counters, + mode="full", + ) + + +def _try_incremental_search_sync( + previous: Dict[str, Any], + *, + version: str, + row_count: int, + signature_counter: Counter, + signature_fields: Dict[str, tuple[str, str, str, str]], +) -> Optional[Dict[str, Any]]: + if not previous: + return None + old_signature_counter = Counter(previous.get("_signature_counter") or {}) + old_field_counters_raw = previous.get("_field_counters") or {} + if not old_signature_counter or not old_field_counters_raw: + return None + + added = signature_counter - old_signature_counter + removed = old_signature_counter - signature_counter + total_delta = sum(added.values()) + sum(removed.values()) + drift_ratio = total_delta / max(int(row_count), 1) + if drift_ratio > 0.6: + _increment_wip_metric("search_index_reconciliation_fallbacks") + return None + + field_counters = { + "workorders": Counter(old_field_counters_raw.get("workorders") or {}), + "lotids": Counter(old_field_counters_raw.get("lotids") or {}), + "packages": Counter(old_field_counters_raw.get("packages") or {}), + "types": Counter(old_field_counters_raw.get("types") or {}), } + for signature, count in added.items(): + wo, lot, pkg, pj = signature_fields.get(signature, ("", "", "", "")) + if wo: + field_counters["workorders"][wo] += count + if lot: + field_counters["lotids"][lot] += count + if pkg: + field_counters["packages"][pkg] += count + if pj: + field_counters["types"][pj] += count + + previous_fields = { + sig: tuple(str(v) for v in sig.split("\x1f", 3)) + for sig in old_signature_counter.keys() + } + for signature, count in removed.items(): + wo, lot, pkg, pj = previous_fields.get(signature, ("", "", "", "")) + if wo: + field_counters["workorders"][wo] -= count + if field_counters["workorders"][wo] <= 0: + field_counters["workorders"].pop(wo, None) + if lot: + field_counters["lotids"][lot] -= count + if field_counters["lotids"][lot] <= 0: + field_counters["lotids"].pop(lot, None) + if pkg: + field_counters["packages"][pkg] -= count + if field_counters["packages"][pkg] <= 0: + field_counters["packages"].pop(pkg, None) + if pj: + field_counters["types"][pj] -= count + if field_counters["types"][pj] <= 0: + field_counters["types"].pop(pj, None) + + _increment_wip_metric("search_index_incremental_updates") + return _materialize_search_payload( + version=version, + row_count=row_count, + signature_counter=signature_counter, + field_counters=field_counters, + mode="incremental", + added_rows=sum(added.values()), + removed_rows=sum(removed.values()), + drift_ratio=drift_ratio, + ) + + +def _build_wip_snapshot(df: pd.DataFrame, include_dummy: bool, version: str) -> Dict[str, Any]: + filtered = _filter_base_conditions(df, include_dummy=include_dummy) + filtered = _add_wip_status_columns(filtered).reset_index(drop=True) + + hold_type_series = pd.Series(index=filtered.index, dtype=object) + if not filtered.empty: + hold_type_series = pd.Series("", index=filtered.index, dtype=object) + hold_type_series.loc[filtered["IS_QUALITY_HOLD"]] = "quality" + hold_type_series.loc[filtered["IS_NON_QUALITY_HOLD"]] = "non-quality" + + indexes = { + "workcenter": _build_value_index(filtered, "WORKCENTER_GROUP"), + "package": _build_value_index(filtered, "PACKAGE_LEF"), + "pj_type": _build_value_index(filtered, "PJ_TYPE"), + "wip_status": _build_value_index(filtered, "WIP_STATUS"), + "hold_type": _build_value_index(pd.DataFrame({"HOLD_TYPE": hold_type_series}), "HOLD_TYPE"), + } + + exact_bucket_count = sum(len(bucket) for bucket in indexes.values()) + return { + "version": version, + "built_at": datetime.now().isoformat(), + "row_count": int(len(filtered)), + "frame": filtered, + "indexes": indexes, + "frame_bytes": _estimate_dataframe_bytes(filtered), + "index_bucket_count": int(exact_bucket_count), + } + + +def _get_wip_snapshot(include_dummy: bool) -> Optional[Dict[str, Any]]: + cache_key = "with_dummy" if include_dummy else "without_dummy" + version = _get_wip_cache_version() + + with _wip_snapshot_lock: + cached = _wip_snapshot_cache.get(cache_key) + if cached and cached.get("version") == version: + _increment_wip_metric("snapshot_hits") + return cached + + _increment_wip_metric("snapshot_misses") + df = _get_wip_dataframe() + if df is None: + return None + + snapshot = _build_wip_snapshot(df, include_dummy=include_dummy, version=version) + with _wip_snapshot_lock: + existing = _wip_snapshot_cache.get(cache_key) + if existing and existing.get("version") == version: + _increment_wip_metric("snapshot_hits") + return existing + _wip_snapshot_cache[cache_key] = snapshot + return snapshot + def _get_wip_search_index(include_dummy: bool) -> Optional[Dict[str, Any]]: cache_key = "with_dummy" if include_dummy else "without_dummy" @@ -184,14 +543,37 @@ def _get_wip_search_index(include_dummy: bool) -> Optional[Dict[str, Any]]: with _wip_search_index_lock: cached = _wip_search_index_cache.get(cache_key) if cached and cached.get("version") == version: + _increment_wip_metric("search_index_hits") return cached - df = _get_wip_dataframe() - if df is None: + _increment_wip_metric("search_index_misses") + snapshot = _get_wip_snapshot(include_dummy=include_dummy) + if snapshot is None: return None - index_payload = _build_wip_search_index(df, include_dummy=include_dummy) - index_payload["version"] = version + filtered = snapshot["frame"] + signature_counter, signature_fields = _build_search_signatures(filtered) + + with _wip_search_index_lock: + previous = _wip_search_index_cache.get(cache_key) + + index_payload = _try_incremental_search_sync( + previous or {}, + version=version, + row_count=int(snapshot.get("row_count", 0)), + signature_counter=signature_counter, + signature_fields=signature_fields, + ) + if index_payload is None: + field_counters = _build_field_counters(signature_counter, signature_fields) + index_payload = _materialize_search_payload( + version=version, + row_count=int(snapshot.get("row_count", 0)), + signature_counter=signature_counter, + field_counters=field_counters, + mode="full", + ) + _increment_wip_metric("search_index_rebuilds") with _wip_search_index_lock: _wip_search_index_cache[cache_key] = index_payload @@ -207,9 +589,9 @@ def _search_values_from_index(values: List[str], query: str, limit: int) -> List def get_wip_search_index_status() -> Dict[str, Any]: """Expose WIP derived search-index freshness for diagnostics.""" with _wip_search_index_lock: - snapshot = {} + search_snapshot = {} for key, payload in _wip_search_index_cache.items(): - snapshot[key] = { + search_snapshot[key] = { "version": payload.get("version"), "built_at": payload.get("built_at"), "row_count": payload.get("row_count", 0), @@ -217,8 +599,39 @@ def get_wip_search_index_status() -> Dict[str, Any]: "lotids": len(payload.get("lotids", [])), "packages": len(payload.get("packages", [])), "types": len(payload.get("types", [])), + "sync_mode": payload.get("sync_mode"), + "sync_added_rows": payload.get("sync_added_rows", 0), + "sync_removed_rows": payload.get("sync_removed_rows", 0), + "drift_ratio": payload.get("drift_ratio", 0.0), + "memory_bytes": payload.get("memory_bytes", 0), } - return snapshot + with _wip_snapshot_lock: + frame_snapshot = {} + for key, payload in _wip_snapshot_cache.items(): + frame_snapshot[key] = { + "version": payload.get("version"), + "built_at": payload.get("built_at"), + "row_count": payload.get("row_count", 0), + "frame_bytes": payload.get("frame_bytes", 0), + "index_bucket_count": payload.get("index_bucket_count", 0), + } + with _wip_index_metrics_lock: + metrics = dict(_wip_index_metrics) + + total_frame_bytes = sum(item.get("frame_bytes", 0) for item in frame_snapshot.values()) + total_search_bytes = sum(item.get("memory_bytes", 0) for item in search_snapshot.values()) + amplification_ratio = round((total_frame_bytes + total_search_bytes) / max(total_frame_bytes, 1), 4) + + return { + "derived_search_index": search_snapshot, + "derived_frame_snapshot": frame_snapshot, + "metrics": metrics, + "memory": { + "frame_bytes_total": int(total_frame_bytes), + "search_bytes_total": int(total_search_bytes), + "amplification_ratio": amplification_ratio, + }, + } def _add_wip_status_columns(df: pd.DataFrame) -> pd.DataFrame: @@ -235,24 +648,31 @@ def _add_wip_status_columns(df: pd.DataFrame) -> pd.DataFrame: Returns: DataFrame with additional status columns """ - df = df.copy() + required = {'WIP_STATUS', 'IS_QUALITY_HOLD', 'IS_NON_QUALITY_HOLD'} + if required.issubset(df.columns): + return df + + working = df.copy() # Ensure numeric columns - df['EQUIPMENTCOUNT'] = pd.to_numeric(df['EQUIPMENTCOUNT'], errors='coerce').fillna(0) - df['CURRENTHOLDCOUNT'] = pd.to_numeric(df['CURRENTHOLDCOUNT'], errors='coerce').fillna(0) - df['QTY'] = pd.to_numeric(df['QTY'], errors='coerce').fillna(0) + working['EQUIPMENTCOUNT'] = pd.to_numeric(working['EQUIPMENTCOUNT'], errors='coerce').fillna(0) + working['CURRENTHOLDCOUNT'] = pd.to_numeric(working['CURRENTHOLDCOUNT'], errors='coerce').fillna(0) + working['QTY'] = pd.to_numeric(working['QTY'], errors='coerce').fillna(0) # Compute WIP status - df['WIP_STATUS'] = 'QUEUE' # Default - df.loc[df['EQUIPMENTCOUNT'] > 0, 'WIP_STATUS'] = 'RUN' - df.loc[(df['EQUIPMENTCOUNT'] == 0) & (df['CURRENTHOLDCOUNT'] > 0), 'WIP_STATUS'] = 'HOLD' + working['WIP_STATUS'] = 'QUEUE' # Default + working.loc[working['EQUIPMENTCOUNT'] > 0, 'WIP_STATUS'] = 'RUN' + working.loc[ + (working['EQUIPMENTCOUNT'] == 0) & (working['CURRENTHOLDCOUNT'] > 0), + 'WIP_STATUS' + ] = 'HOLD' # Compute hold type - df['IS_NON_QUALITY_HOLD'] = df['HOLDREASONNAME'].isin(NON_QUALITY_HOLD_REASONS) - df['IS_QUALITY_HOLD'] = (df['WIP_STATUS'] == 'HOLD') & ~df['IS_NON_QUALITY_HOLD'] - df['IS_NON_QUALITY_HOLD'] = (df['WIP_STATUS'] == 'HOLD') & df['IS_NON_QUALITY_HOLD'] + non_quality_flags = working['HOLDREASONNAME'].isin(NON_QUALITY_HOLD_REASONS) + working['IS_QUALITY_HOLD'] = (working['WIP_STATUS'] == 'HOLD') & ~non_quality_flags + working['IS_NON_QUALITY_HOLD'] = (working['WIP_STATUS'] == 'HOLD') & non_quality_flags - return df + return working def _filter_base_conditions( @@ -272,24 +692,18 @@ def _filter_base_conditions( Returns: Filtered DataFrame """ - df = df.copy() + if df is None or df.empty: + return df.iloc[0:0] if isinstance(df, pd.DataFrame) else pd.DataFrame() - # Exclude NULL WORKORDER (raw materials) - df = df[df['WORKORDER'].notna()] - - # DUMMY exclusion - if not include_dummy: - df = df[~df['LOTID'].str.contains('DUMMY', case=False, na=False)] - - # WORKORDER filter (fuzzy match) - if workorder: - df = df[df['WORKORDER'].str.contains(workorder, case=False, na=False)] - - # LOTID filter (fuzzy match) - if lotid: - df = df[df['LOTID'].str.contains(lotid, case=False, na=False)] - - return df + mask = _build_filter_mask( + df, + include_dummy=include_dummy, + workorder=workorder, + lotid=lotid, + ) + if mask.empty: + return df.iloc[0:0] + return df.loc[mask] # ============================================================ @@ -325,16 +739,15 @@ def get_wip_summary( cached_df = _get_wip_dataframe() if cached_df is not None: try: - df = _filter_base_conditions(cached_df, include_dummy, workorder, lotid) - df = _add_wip_status_columns(df) - - # Apply package filter - if package and 'PACKAGE_LEF' in df.columns: - df = df[df['PACKAGE_LEF'] == package] - - # Apply pj_type filter - if pj_type and 'PJ_TYPE' in df.columns: - df = df[df['PJ_TYPE'] == pj_type] + df = _select_with_snapshot_indexes( + include_dummy=include_dummy, + workorder=workorder, + lotid=lotid, + package=package, + pj_type=pj_type, + ) + if df is None: + return _get_wip_summary_from_oracle(include_dummy, workorder, lotid, package, pj_type) if df.empty: return { @@ -495,32 +908,31 @@ def get_wip_matrix( cached_df = _get_wip_dataframe() if cached_df is not None: try: - df = _filter_base_conditions(cached_df, include_dummy, workorder, lotid) - df = _add_wip_status_columns(df) + status_upper = status.upper() if status else None + hold_type_filter = hold_type if status_upper == 'HOLD' else None + df = _select_with_snapshot_indexes( + include_dummy=include_dummy, + workorder=workorder, + lotid=lotid, + package=package, + pj_type=pj_type, + status=status_upper, + hold_type=hold_type_filter, + ) + if df is None: + return _get_wip_matrix_from_oracle( + include_dummy, + workorder, + lotid, + status, + hold_type, + package, + pj_type, + ) # Filter by WORKCENTER_GROUP and PACKAGE_LEF df = df[df['WORKCENTER_GROUP'].notna() & df['PACKAGE_LEF'].notna()] - # Apply package filter - if package: - df = df[df['PACKAGE_LEF'] == package] - - # Apply pj_type filter - if pj_type and 'PJ_TYPE' in df.columns: - df = df[df['PJ_TYPE'] == pj_type] - - # WIP status filter - if status: - status_upper = status.upper() - df = df[df['WIP_STATUS'] == status_upper] - - # Hold type sub-filter - if status_upper == 'HOLD' and hold_type: - if hold_type == 'quality': - df = df[df['IS_QUALITY_HOLD']] - elif hold_type == 'non-quality': - df = df[df['IS_NON_QUALITY_HOLD']] - if df.empty: return { 'workcenters': [], @@ -677,11 +1089,17 @@ def get_wip_hold_summary( cached_df = _get_wip_dataframe() if cached_df is not None: try: - df = _filter_base_conditions(cached_df, include_dummy, workorder, lotid) - df = _add_wip_status_columns(df) + df = _select_with_snapshot_indexes( + include_dummy=include_dummy, + workorder=workorder, + lotid=lotid, + status='HOLD', + ) + if df is None: + return _get_wip_hold_summary_from_oracle(include_dummy, workorder, lotid) # Filter for HOLD status with reason - df = df[(df['WIP_STATUS'] == 'HOLD') & df['HOLDREASONNAME'].notna()] + df = df[df['HOLDREASONNAME'].notna()] if df.empty: return {'items': []} @@ -805,17 +1223,40 @@ def get_wip_detail( cached_df = _get_wip_dataframe() if cached_df is not None: try: - df = _filter_base_conditions(cached_df, include_dummy, workorder, lotid) - df = _add_wip_status_columns(df) + summary_df = _select_with_snapshot_indexes( + include_dummy=include_dummy, + workorder=workorder, + lotid=lotid, + package=package, + workcenter=workcenter, + ) + if summary_df is None: + return _get_wip_detail_from_oracle( + workcenter, + package, + status, + hold_type, + workorder, + lotid, + include_dummy, + page, + page_size, + ) - # Filter by workcenter - df = df[df['WORKCENTER_GROUP'] == workcenter] - - if package: - df = df[df['PACKAGE_LEF'] == package] + if summary_df.empty: + summary = { + 'totalLots': 0, + 'runLots': 0, + 'queueLots': 0, + 'holdLots': 0, + 'qualityHoldLots': 0, + 'nonQualityHoldLots': 0 + } + df = summary_df + else: + df = summary_df # Calculate summary before status filter - summary_df = df.copy() run_lots = len(summary_df[summary_df['WIP_STATUS'] == 'RUN']) queue_lots = len(summary_df[summary_df['WIP_STATUS'] == 'QUEUE']) hold_lots = len(summary_df[summary_df['WIP_STATUS'] == 'HOLD']) @@ -835,13 +1276,29 @@ def get_wip_detail( # Apply status filter for lots list if status: status_upper = status.upper() - df = df[df['WIP_STATUS'] == status_upper] - - if status_upper == 'HOLD' and hold_type: - if hold_type == 'quality': - df = df[df['IS_QUALITY_HOLD']] - elif hold_type == 'non-quality': - df = df[df['IS_NON_QUALITY_HOLD']] + hold_type_filter = hold_type if status_upper == 'HOLD' else None + filtered_df = _select_with_snapshot_indexes( + include_dummy=include_dummy, + workorder=workorder, + lotid=lotid, + package=package, + workcenter=workcenter, + status=status_upper, + hold_type=hold_type_filter, + ) + if filtered_df is None: + return _get_wip_detail_from_oracle( + workcenter, + package, + status, + hold_type, + workorder, + lotid, + include_dummy, + page, + page_size, + ) + df = filtered_df # Get specs (sorted by SPECSEQUENCE if available) specs_df = df[df['SPECNAME'].notna()][['SPECNAME', 'SPECSEQUENCE']].drop_duplicates() @@ -1083,7 +1540,9 @@ def get_workcenters(include_dummy: bool = False) -> Optional[List[Dict[str, Any] cached_df = _get_wip_dataframe() if cached_df is not None: try: - df = _filter_base_conditions(cached_df, include_dummy) + df = _select_with_snapshot_indexes(include_dummy=include_dummy) + if df is None: + return _get_workcenters_from_oracle(include_dummy) df = df[df['WORKCENTER_GROUP'].notna()] if df.empty: @@ -1162,7 +1621,9 @@ def get_packages(include_dummy: bool = False) -> Optional[List[Dict[str, Any]]]: cached_df = _get_wip_dataframe() if cached_df is not None: try: - df = _filter_base_conditions(cached_df, include_dummy) + df = _select_with_snapshot_indexes(include_dummy=include_dummy) + if df is None: + return _get_packages_from_oracle(include_dummy) df = df[df['PACKAGE_LEF'].notna()] if df.empty: @@ -1267,15 +1728,16 @@ def search_workorders( cached_df = _get_wip_dataframe() if cached_df is not None: try: - df = _filter_base_conditions(cached_df, include_dummy, lotid=lotid) + df = _select_with_snapshot_indexes( + include_dummy=include_dummy, + lotid=lotid, + package=package, + pj_type=pj_type, + ) + if df is None: + return _search_workorders_from_oracle(q, limit, include_dummy, lotid, package, pj_type) df = df[df['WORKORDER'].notna()] - # Apply cross-filters - if package and 'PACKAGE_LEF' in df.columns: - df = df[df['PACKAGE_LEF'] == package] - if pj_type and 'PJ_TYPE' in df.columns: - df = df[df['PJ_TYPE'] == pj_type] - # Filter by search query (case-insensitive) df = df[df['WORKORDER'].str.contains(q, case=False, na=False)] @@ -1375,13 +1837,14 @@ def search_lot_ids( cached_df = _get_wip_dataframe() if cached_df is not None: try: - df = _filter_base_conditions(cached_df, include_dummy, workorder=workorder) - - # Apply cross-filters - if package and 'PACKAGE_LEF' in df.columns: - df = df[df['PACKAGE_LEF'] == package] - if pj_type and 'PJ_TYPE' in df.columns: - df = df[df['PJ_TYPE'] == pj_type] + df = _select_with_snapshot_indexes( + include_dummy=include_dummy, + workorder=workorder, + package=package, + pj_type=pj_type, + ) + if df is None: + return _search_lot_ids_from_oracle(q, limit, include_dummy, workorder, package, pj_type) # Filter by search query (case-insensitive) df = df[df['LOTID'].str.contains(q, case=False, na=False)] @@ -1481,7 +1944,14 @@ def search_packages( cached_df = _get_wip_dataframe() if cached_df is not None: try: - df = _filter_base_conditions(cached_df, include_dummy, workorder=workorder, lotid=lotid) + df = _select_with_snapshot_indexes( + include_dummy=include_dummy, + workorder=workorder, + lotid=lotid, + pj_type=pj_type, + ) + if df is None: + return _search_packages_from_oracle(q, limit, include_dummy, workorder, lotid, pj_type) # Check if PACKAGE_LEF column exists if 'PACKAGE_LEF' not in df.columns: @@ -1490,10 +1960,6 @@ def search_packages( df = df[df['PACKAGE_LEF'].notna()] - # Apply cross-filter - if pj_type and 'PJ_TYPE' in df.columns: - df = df[df['PJ_TYPE'] == pj_type] - # Filter by search query (case-insensitive) df = df[df['PACKAGE_LEF'].str.contains(q, case=False, na=False)] @@ -1591,7 +2057,14 @@ def search_types( cached_df = _get_wip_dataframe() if cached_df is not None: try: - df = _filter_base_conditions(cached_df, include_dummy, workorder=workorder, lotid=lotid) + df = _select_with_snapshot_indexes( + include_dummy=include_dummy, + workorder=workorder, + lotid=lotid, + package=package, + ) + if df is None: + return _search_types_from_oracle(q, limit, include_dummy, workorder, lotid, package) # Check if PJ_TYPE column exists if 'PJ_TYPE' not in df.columns: @@ -1600,10 +2073,6 @@ def search_types( df = df[df['PJ_TYPE'].notna()] - # Apply cross-filter - if package and 'PACKAGE_LEF' in df.columns: - df = df[df['PACKAGE_LEF'] == package] - # Filter by search query (case-insensitive) df = df[df['PJ_TYPE'].str.contains(q, case=False, na=False)] @@ -1686,11 +2155,15 @@ def get_hold_detail_summary( cached_df = _get_wip_dataframe() if cached_df is not None: try: - df = _filter_base_conditions(cached_df, include_dummy) - df = _add_wip_status_columns(df) + df = _select_with_snapshot_indexes( + include_dummy=include_dummy, + status='HOLD', + ) + if df is None: + return _get_hold_detail_summary_from_oracle(reason, include_dummy) # Filter for HOLD status with matching reason - df = df[(df['WIP_STATUS'] == 'HOLD') & (df['HOLDREASONNAME'] == reason)] + df = df[df['HOLDREASONNAME'] == reason] if df.empty: return { @@ -1783,11 +2256,15 @@ def get_hold_detail_distribution( cached_df = _get_wip_dataframe() if cached_df is not None: try: - df = _filter_base_conditions(cached_df, include_dummy) - df = _add_wip_status_columns(df) + df = _select_with_snapshot_indexes( + include_dummy=include_dummy, + status='HOLD', + ) + if df is None: + return _get_hold_detail_distribution_from_oracle(reason, include_dummy) # Filter for HOLD status with matching reason - df = df[(df['WIP_STATUS'] == 'HOLD') & (df['HOLDREASONNAME'] == reason)] + df = df[df['HOLDREASONNAME'] == reason] total_lots = len(df) @@ -2072,20 +2549,30 @@ def get_hold_detail_lots( cached_df = _get_wip_dataframe() if cached_df is not None: try: - df = _filter_base_conditions(cached_df, include_dummy) - df = _add_wip_status_columns(df) + df = _select_with_snapshot_indexes( + include_dummy=include_dummy, + workcenter=workcenter, + package=package, + status='HOLD', + ) + if df is None: + return _get_hold_detail_lots_from_oracle( + reason=reason, + workcenter=workcenter, + package=package, + age_range=age_range, + include_dummy=include_dummy, + page=page, + page_size=page_size, + ) # Filter for HOLD status with matching reason - df = df[(df['WIP_STATUS'] == 'HOLD') & (df['HOLDREASONNAME'] == reason)] + df = df[df['HOLDREASONNAME'] == reason] # Ensure numeric columns df['AGEBYDAYS'] = pd.to_numeric(df['AGEBYDAYS'], errors='coerce').fillna(0) - # Optional filters - if workcenter: - df = df[df['WORKCENTER_GROUP'] == workcenter] - if package: - df = df[df['PACKAGE_LEF'] == package] + # Optional age filter if age_range: if age_range == '0-1': df = df[(df['AGEBYDAYS'] >= 0) & (df['AGEBYDAYS'] < 1)] diff --git a/src/mes_dashboard/static/js/mes-api.js b/src/mes_dashboard/static/js/mes-api.js index 2a9a786..dd166ed 100644 --- a/src/mes_dashboard/static/js/mes-api.js +++ b/src/mes_dashboard/static/js/mes-api.js @@ -32,6 +32,23 @@ const MesApi = (function() { const MIN_DEGRADED_DELAY_MS = 3000; let requestCounter = 0; + + function getCsrfToken() { + const meta = document.querySelector('meta[name=\"csrf-token\"]'); + return meta ? meta.content : ''; + } + + function withCsrfHeaders(headers, method) { + const normalized = (method || 'GET').toUpperCase(); + const nextHeaders = { ...(headers || {}) }; + if (['POST', 'PUT', 'PATCH', 'DELETE'].includes(normalized)) { + const token = getCsrfToken(); + if (token && !nextHeaders['X-CSRF-Token']) { + nextHeaders['X-CSRF-Token'] = token; + } + } + return nextHeaders; + } /** * Generate a unique request ID @@ -203,12 +220,12 @@ const MesApi = (function() { console.log(`[MesApi] ${reqId} ${method} ${fullUrl}`); - const fetchOptions = { - method: method, - headers: { - 'Content-Type': 'application/json' - } - }; + const fetchOptions = { + method: method, + headers: withCsrfHeaders({ + 'Content-Type': 'application/json' + }, method) + }; if (options.body) { fetchOptions.body = JSON.stringify(options.body); diff --git a/src/mes_dashboard/templates/_base.html b/src/mes_dashboard/templates/_base.html index 86583c6..3008567 100644 --- a/src/mes_dashboard/templates/_base.html +++ b/src/mes_dashboard/templates/_base.html @@ -1,9 +1,10 @@ - - - - {% block title %}MES Dashboard{% endblock %} + + + + + {% block title %}MES Dashboard{% endblock %}