chore: finalize vite migration hardening and archive openspec changes
This commit is contained in:
149
README.md
149
README.md
@@ -26,11 +26,60 @@
|
||||
| Worker 重啟控制 | ✅ 已完成 |
|
||||
| Runtime 韌性診斷(threshold/churn/recommendation) | ✅ 已完成 |
|
||||
| WIP 共用 autocomplete core 模組 | ✅ 已完成 |
|
||||
| WIP 共用 derive core 模組(KPI/filter/chart/table) | ✅ 已完成 |
|
||||
| WIP 索引查詢加速與增量同步 | ✅ 已完成 |
|
||||
| 快取記憶體放大係數 telemetry | ✅ 已完成 |
|
||||
| Cache benchmark gate(P95/記憶體門檻) | ✅ 已完成 |
|
||||
| Worker guarded mode + manual override 稽核 | ✅ 已完成 |
|
||||
| Runtime contract 啟動校驗(conda/systemd/watchdog) | ✅ 已完成 |
|
||||
| 前端核心模組測試(Node test) | ✅ 已完成 |
|
||||
| 部署自動化 | ✅ 已完成 |
|
||||
|
||||
---
|
||||
|
||||
## 開發歷史(Vite 重構後)
|
||||
|
||||
- 2026-02-07:完成 Flask + Vite 單一 port 架構切換,舊版 `DashBoard/` 停用。
|
||||
- 2026-02-08:補齊 runtime 韌性治理(threshold/churn/recommendation)與 watchdog 可觀測欄位。
|
||||
- 2026-02-08:完成 P0 安全/穩定性硬化:
|
||||
- production `SECRET_KEY` 缺失時啟動失敗(fail-fast)
|
||||
- admin form + admin mutation API CSRF 防護
|
||||
- health probe 使用獨立 DB pool,避免與主查詢池互相阻塞
|
||||
- worker/app shutdown 統一清理 cache updater、realtime sync、Redis、DB engine
|
||||
- `hold_detail` inline script 變數改為 `tojson` 序列化
|
||||
- 2026-02-08:完成 P1 快取/查詢效率重構:
|
||||
- WIP 查詢路徑改為索引選擇,保留 `resource/wip` 全表快取語意
|
||||
- WIP search index 增量同步(watermark/version)與 drift fallback
|
||||
- health/admin 新增 cache memory amplification telemetry
|
||||
- 建立 `scripts/run_cache_benchmarks.py` + fixture gate
|
||||
- 2026-02-08:完成 P2 運維自癒治理:
|
||||
- runtime contract 共用化(app/start_server/watchdog/systemd)
|
||||
- 啟動時 conda/watchdog 路徑 drift fail-fast
|
||||
- worker restart policy(cooldown/retry budget/churn guarded mode)
|
||||
- manual override(需 ack + reason)與結構化 audit log
|
||||
- 2026-02-08:完成 round-2 安全/穩定補強:
|
||||
- LDAP endpoint 改為嚴格驗證(`https` + `LDAP_ALLOWED_HOSTS`)
|
||||
- process-level cache 新增 `max_size + LRU`(WIP/Resource)
|
||||
- circuit breaker transition logging 移至鎖外,降低 lock contention
|
||||
- 全域安全標頭(CSP/XFO/nosniff/Referrer-Policy,production 加 HSTS)
|
||||
- WIP detail 分頁參數加上下限(`page>=1`、`1<=page_size<=500`)
|
||||
- 2026-02-08:完成 round-3 殘餘風險修補:
|
||||
- WIP cache publish 採 staged publish,失敗不污染舊快照
|
||||
- WIP slow-path parse 移至鎖外;realtime equipment process cache 補齊 bounded LRU
|
||||
- resource NaN 清理改為 depth-safe 迭代;WIP/Hold 布林查詢解析共用化
|
||||
- filter cache view 名稱改為 env 可配置
|
||||
- `/health`、`/health/deep` 新增 5 秒內部 memo(testing 模式禁用)
|
||||
- 高成本 API 增加輕量 rate limit(WIP detail/matrix、Hold lots、Resource status/detail)
|
||||
- DB 連線字串 log redaction 遮罩密碼
|
||||
- 2026-02-08:完成 round-4 殘餘治理收斂:
|
||||
- Resource derived index 改為 row-position representation,移除 process 內 full records 複本
|
||||
- Resource / Realtime Equipment 共用 Oracle SQL fragments,降低查詢定義漂移
|
||||
- `resource_cache` / `realtime_equipment_cache` 型別註記與高頻常數命名收斂
|
||||
- `page_registry` 寫檔改為 atomic replace(tmp + rename),避免設定檔半寫入
|
||||
- 新增測試保護:共享 SQL 片段、index normalization、route bool parser 不重複定義
|
||||
|
||||
---
|
||||
|
||||
## 遷移與驗收文件
|
||||
|
||||
- Root cutover 盤點:`docs/root_cutover_inventory.md`
|
||||
@@ -46,20 +95,37 @@
|
||||
1. 單一 port 契約維持不變
|
||||
- Flask + Gunicorn + Vite dist 由同一服務提供(`GUNICORN_BIND`),前後端同源。
|
||||
|
||||
2. Runtime 韌性採「降級 + 可操作建議」
|
||||
2. Runtime 韌性採「降級 + 可操作建議 + policy state」
|
||||
- `/health`、`/health/deep`、`/admin/api/system-status`、`/admin/api/worker/status` 皆提供:
|
||||
- 門檻(thresholds)
|
||||
- policy state(`allowed` / `cooldown` / `blocked`)
|
||||
- 重啟 churn 摘要
|
||||
- alerts(pool/circuit/churn)
|
||||
- recovery recommendation(值班建議動作)
|
||||
|
||||
3. Watchdog 維持手動觸發重啟模型
|
||||
- 仍以 admin API 觸發 reload,不預設啟用自動重啟風暴風險。
|
||||
- state 檔新增 bounded restart history,方便追蹤 churn。
|
||||
3. Watchdog 自癒策略具界限保護
|
||||
- restart 流程納入 cooldown + retry budget + churn window。
|
||||
- churn 超標時進入 guarded mode,需 admin manual override 才可繼續重啟。
|
||||
- state 檔保留 bounded restart history,供 policy 與稽核使用。
|
||||
|
||||
4. 前端治理:WIP autocomplete/filter 共用化
|
||||
4. 前端治理:WIP compute 共用化
|
||||
- `frontend/src/core/autocomplete.js` 作為 WIP overview/detail 共用邏輯來源。
|
||||
- `frontend/src/core/wip-derive.js` 共用 KPI/filter/chart/table 導出運算。
|
||||
- 維持既有頁面流程與 drill-down 語意,不變更操作習慣。
|
||||
|
||||
5. P1 快取效率治理
|
||||
- 保留 `resource`、`wip` 全表快取策略(業務約束不變)。
|
||||
- 查詢改走索引選擇,並提供 memory amplification / index efficiency telemetry。
|
||||
- 以 benchmark gate 驗證 P95 延遲與記憶體放大不超過門檻。
|
||||
|
||||
6. P0 Runtime Hardening(安全 + 穩定)
|
||||
- Production 必須提供 `SECRET_KEY`;未設定時服務拒絕啟動。
|
||||
- `/admin/login` 與 `/admin/api/*` 變更請求必須攜帶 CSRF token。
|
||||
- `/health` 資料庫連通探針使用獨立 health pool,降低 pool 飽和時誤判。
|
||||
- 關機/重啟時統一釋放 background workers 與 Redis/DB 連線資源。
|
||||
- LDAP API URL 啟動驗證:僅允許 `https` + host allowlist。
|
||||
- 全域 security headers:CSP/X-Frame-Options/X-Content-Type-Options/Referrer-Policy(production 含 HSTS)。
|
||||
|
||||
---
|
||||
|
||||
## 快速開始
|
||||
@@ -175,6 +241,12 @@ DB_MAX_OVERFLOW=20
|
||||
DB_POOL_TIMEOUT=30
|
||||
DB_POOL_RECYCLE=1800
|
||||
DB_CALL_TIMEOUT_MS=55000
|
||||
DB_POOL_EXHAUSTED_RETRY_AFTER_SECONDS=5
|
||||
|
||||
# Health probe 專用 DB pool(與主 request pool 隔離)
|
||||
DB_HEALTH_POOL_SIZE=1
|
||||
DB_HEALTH_MAX_OVERFLOW=0
|
||||
DB_HEALTH_POOL_TIMEOUT=2
|
||||
|
||||
# Circuit Breaker
|
||||
CIRCUIT_BREAKER_ENABLED=true
|
||||
@@ -192,6 +264,17 @@ WATCHDOG_RESTART_FLAG=./tmp/mes_dashboard_restart.flag
|
||||
WATCHDOG_PID_FILE=./tmp/gunicorn.pid
|
||||
WATCHDOG_STATE_FILE=./tmp/mes_dashboard_restart_state.json
|
||||
WATCHDOG_RESTART_HISTORY_MAX=50
|
||||
CONDA_BIN=/opt/miniconda3/bin/conda
|
||||
CONDA_ENV_NAME=mes-dashboard
|
||||
RUNTIME_CONTRACT_VERSION=2026.02-p2
|
||||
RUNTIME_CONTRACT_ENFORCE=true
|
||||
|
||||
# Worker self-healing policy
|
||||
WORKER_RESTART_COOLDOWN=60
|
||||
WORKER_RESTART_RETRY_BUDGET=3
|
||||
WORKER_RESTART_WINDOW_SECONDS=600
|
||||
WORKER_RESTART_CHURN_THRESHOLD=3
|
||||
WORKER_GUARDED_MODE_ENABLED=true
|
||||
|
||||
# Runtime resilience thresholds
|
||||
RESILIENCE_DEGRADED_ALERT_SECONDS=300
|
||||
@@ -202,6 +285,36 @@ RESILIENCE_RESTART_CHURN_THRESHOLD=3
|
||||
|
||||
# 管理員設定
|
||||
ADMIN_EMAILS=admin@example.com # 管理員郵件(逗號分隔)
|
||||
LDAP_API_URL=https://ldap-api.example.com
|
||||
LDAP_ALLOWED_HOSTS=ldap-api.example.com,ldap-api-dr.example.com
|
||||
|
||||
# CSRF 防護(admin form/admin mutation API)
|
||||
CSRF_ENABLED=true
|
||||
|
||||
# Process-level cache bounded LRU(WIP/Resource)
|
||||
PROCESS_CACHE_MAX_SIZE=32
|
||||
WIP_PROCESS_CACHE_MAX_SIZE=32
|
||||
RESOURCE_PROCESS_CACHE_MAX_SIZE=32
|
||||
EQUIPMENT_PROCESS_CACHE_MAX_SIZE=32
|
||||
|
||||
# Filter cache source views (env-overridable)
|
||||
FILTER_CACHE_WIP_VIEW=DWH.DW_MES_LOT_V
|
||||
FILTER_CACHE_SPEC_WORKCENTER_VIEW=DWH.DW_MES_SPEC_WORKCENTER_V
|
||||
|
||||
# Health internal memoization
|
||||
HEALTH_MEMO_TTL_SECONDS=5
|
||||
|
||||
# High-cost API rate limit (in-process)
|
||||
WIP_MATRIX_RATE_LIMIT_MAX_REQUESTS=120
|
||||
WIP_MATRIX_RATE_LIMIT_WINDOW_SECONDS=60
|
||||
WIP_DETAIL_RATE_LIMIT_MAX_REQUESTS=90
|
||||
WIP_DETAIL_RATE_LIMIT_WINDOW_SECONDS=60
|
||||
HOLD_LOTS_RATE_LIMIT_MAX_REQUESTS=90
|
||||
HOLD_LOTS_RATE_LIMIT_WINDOW_SECONDS=60
|
||||
RESOURCE_DETAIL_RATE_LIMIT_MAX_REQUESTS=60
|
||||
RESOURCE_DETAIL_RATE_LIMIT_WINDOW_SECONDS=60
|
||||
RESOURCE_STATUS_RATE_LIMIT_MAX_REQUESTS=90
|
||||
RESOURCE_STATUS_RATE_LIMIT_WINDOW_SECONDS=60
|
||||
```
|
||||
|
||||
### 生產環境注意事項
|
||||
@@ -226,6 +339,7 @@ sudo cp deploy/mes-dashboard-watchdog.service /etc/systemd/system/
|
||||
|
||||
# 2. 準備環境設定檔
|
||||
sudo mkdir -p /etc/mes-dashboard
|
||||
sudo cp deploy/mes-dashboard.env.example /etc/mes-dashboard/mes-dashboard.env
|
||||
sudo cp .env /etc/mes-dashboard/mes-dashboard.env
|
||||
|
||||
# 3. 重新載入 systemd
|
||||
@@ -238,6 +352,12 @@ sudo systemctl enable --now mes-dashboard mes-dashboard-watchdog
|
||||
sudo systemctl status mes-dashboard
|
||||
sudo systemctl status mes-dashboard-watchdog
|
||||
```
|
||||
|
||||
執行 runtime contract 驗證:
|
||||
|
||||
```bash
|
||||
RUNTIME_CONTRACT_ENFORCE=true ./scripts/start_server.sh check
|
||||
```
|
||||
|
||||
### Rollback 步驟
|
||||
|
||||
@@ -494,7 +614,8 @@ DashBoard_vite/
|
||||
│ └── worker_watchdog.py # Worker 監控程式
|
||||
├── deploy/ # 部署設定
|
||||
│ ├── mes-dashboard.service # Gunicorn systemd 服務 (Conda)
|
||||
│ └── mes-dashboard-watchdog.service # Watchdog systemd 服務 (Conda)
|
||||
│ ├── mes-dashboard-watchdog.service # Watchdog systemd 服務 (Conda)
|
||||
│ └── mes-dashboard.env.example # Runtime contract 環境範本
|
||||
├── tests/ # 測試
|
||||
├── data/ # 資料檔案
|
||||
├── logs/ # 日誌
|
||||
@@ -522,9 +643,12 @@ pytest tests/test_*_integration.py -v
|
||||
# 執行 E2E 測試
|
||||
pytest tests/e2e/ -v
|
||||
|
||||
# 執行壓力測試
|
||||
pytest tests/stress/ -v
|
||||
```
|
||||
# 執行壓力測試
|
||||
pytest tests/stress/ -v
|
||||
|
||||
# Cache benchmark gate(P1)
|
||||
conda run -n mes-dashboard python scripts/run_cache_benchmarks.py --enforce
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
@@ -569,12 +693,17 @@ pytest tests/stress/ -v
|
||||
### 2026-02-08
|
||||
|
||||
- 完成並封存提案 `post-migration-resilience-governance`
|
||||
- 完成並封存提案 `p1-cache-query-efficiency`
|
||||
- 完成並封存提案 `p2-ops-self-healing-runbook`
|
||||
- 新增 runtime 韌性診斷核心(thresholds / restart churn / recovery recommendation)
|
||||
- 新增 worker restart policy state(allowed/cooldown/blocked)與 guarded mode override 流程
|
||||
- health 與 admin API 新增可操作韌性欄位:
|
||||
- `/health`、`/health/deep`
|
||||
- `/admin/api/system-status`、`/admin/api/worker/status`
|
||||
- watchdog restart state 支援 bounded history(`WATCHDOG_RESTART_HISTORY_MAX`)
|
||||
- WIP overview/detail 抽離共用 autocomplete/filter 模組(`frontend/src/core/autocomplete.js`)
|
||||
- WIP overview/detail 導入共享 derive 模組(`frontend/src/core/wip-derive.js`)
|
||||
- 新增 cache benchmark fixture 與 baseline-vs-indexed 門檻驗證
|
||||
- 新增前端 Node 測試流程(`npm --prefix frontend test`)
|
||||
- 更新 `README.mdj` 與 migration runbook 文件對齊 gate
|
||||
|
||||
@@ -654,5 +783,5 @@ pytest tests/stress/ -v
|
||||
|
||||
---
|
||||
|
||||
**文檔版本**: 4.1
|
||||
**文檔版本**: 4.2
|
||||
**最後更新**: 2026-02-08
|
||||
|
||||
158
README.mdj
158
README.mdj
@@ -1,61 +1,151 @@
|
||||
# MES Dashboard Architecture Snapshot (README.mdj)
|
||||
# MES Dashboard(README.mdj)
|
||||
|
||||
本檔案為 `README.md` 的架構摘要鏡像,重點反映目前已完成的 Vite + 單一 port 運行契約與韌性治理策略。
|
||||
本文件為 `README.md` 的精簡技術同步版,聚焦目前可運行架構與運維契約。
|
||||
|
||||
## Runtime Contract
|
||||
## 1. 架構摘要(2026-02-08)
|
||||
|
||||
- 單一服務單一 port:`GUNICORN_BIND`(預設 `0.0.0.0:8080`)
|
||||
- 前端資產由 Vite build 到 `src/mes_dashboard/static/dist/`,由 Flask/Gunicorn 同源提供
|
||||
- Watchdog 透過 restart flag + `SIGHUP` 進行 graceful worker reload
|
||||
- 後端:Flask + Gunicorn(單一 port)
|
||||
- 前端:Vite build 輸出到 `src/mes_dashboard/static/dist`
|
||||
- 快取:Redis + process-level cache + indexed selection telemetry
|
||||
- 資料:Oracle(QueuePool)
|
||||
- 運維:watchdog + admin worker restart API + guarded-mode policy
|
||||
|
||||
## Resilience Contract
|
||||
## 2. 既有設計原則(保留)
|
||||
|
||||
- 降級回應:`DB_POOL_EXHAUSTED`、`CIRCUIT_BREAKER_OPEN` + `Retry-After`
|
||||
- health/admin 診斷輸出包含:
|
||||
- thresholds
|
||||
- restart churn summary
|
||||
- recovery recommendation
|
||||
- 不預設啟用自動重啟;維持受控人工觸發,避免重啟風暴
|
||||
- `resource`(設備基礎資料)與 `wip`(線上即時狀況)維持全表快取策略。
|
||||
- 前端頁面邏輯與 drill-down 操作語意維持不變。
|
||||
- 系統維持單一 port 服務模式(前後端同源)。
|
||||
|
||||
## Frontend Governance
|
||||
## 3. P0 Runtime Hardening(已完成)
|
||||
|
||||
- WIP overview/detail 的 autocomplete/filter 查詢邏輯共用 `frontend/src/core/autocomplete.js`
|
||||
- 目標:維持既有操作語意,同時降低重複邏輯與維護成本
|
||||
- 前端核心模組測試:`npm --prefix frontend test`
|
||||
- Production 強制 `SECRET_KEY`:未設定或使用不安全預設值時,啟動直接失敗。
|
||||
- CSRF 防護:
|
||||
- `/admin/login` 表單需 token
|
||||
- `/admin/api/*` 的 `POST/PUT/PATCH/DELETE` 需 `X-CSRF-Token`
|
||||
- Session hardening:登入成功後 `session.clear()` + CSRF token rotation。
|
||||
- Health probe isolation:`/health` DB 連通檢查使用獨立 health pool。
|
||||
- Shutdown cleanup:統一停止 cache updater、equipment sync worker,並關閉 Redis 與 DB engine。
|
||||
- XSS hardening:`hold_detail` fallback script 的 `reason` 改用 `tojson`。
|
||||
|
||||
## 開發歷史(摘要)
|
||||
## 4. P1 Cache/Query Efficiency(已完成)
|
||||
|
||||
### 2026-02-08
|
||||
- 封存 `post-migration-resilience-governance`
|
||||
- 新增韌性診斷欄位(thresholds/churn/recommendation)
|
||||
- 完成 WIP autocomplete 共用模組化與前端測試腳本
|
||||
- `resource` / `wip` 仍維持全表快取策略(業務約束不變)。
|
||||
- WIP 查詢改走 indexed selection,並加入增量同步(watermark/version)與 drift fallback。
|
||||
- `/health`、`/health/deep`、`/admin/api/system-status` 提供 cache memory amplification/index telemetry。
|
||||
- 新增 benchmark harness:`scripts/run_cache_benchmarks.py --enforce`。
|
||||
|
||||
### 2026-02-07
|
||||
- 封存完整 Vite 遷移相關提案群組
|
||||
- 單一 port 架構、抽屜導航、欄位契約治理與 migration gates 就位
|
||||
## 5. P2 Ops Self-Healing(已完成)
|
||||
|
||||
## Key Configs
|
||||
- runtime contract 共用化:app/start_server/watchdog/systemd 使用同一組 watchdog/conda 路徑契約。
|
||||
- 啟動 fail-fast:conda/runtime path drift 時拒絕啟動並輸出可操作診斷。
|
||||
- worker restart policy:cooldown + retry budget + churn guarded mode。
|
||||
- manual override:需 admin 身分 + `manual_override` + `override_acknowledged` + `override_reason`,且寫入 audit log。
|
||||
- health/admin payload 提供 policy state:`allowed` / `cooldown` / `blocked`。
|
||||
|
||||
## 6. Round-3 Residual Hardening(已完成)
|
||||
|
||||
- WIP cache publish 改為 staged publish,更新失敗不覆寫舊快照。
|
||||
- WIP process cache slow-path parse 移到鎖外,降低 lock contention。
|
||||
- realtime equipment process cache 補齊 bounded LRU(含 `EQUIPMENT_PROCESS_CACHE_MAX_SIZE`)。
|
||||
- `_clean_nan_values` 改為 depth-safe 迭代式清理(避免深層遞迴風險)。
|
||||
- WIP/Hold/Resource bool query parser 共用化(`core/utils.py`)。
|
||||
- filter cache source view 可由 env 覆寫(便於環境切換與測試)。
|
||||
- `/health`、`/health/deep` 增加 5 秒 memo(testing 模式自動關閉)。
|
||||
- 高成本 API 增加輕量 in-process rate limit,超限回傳一致 429 結構。
|
||||
- DB 連線字串記錄加上敏感欄位遮罩(密碼 redaction)。
|
||||
|
||||
## 7. Round-4 Residual Consolidation(已完成)
|
||||
|
||||
- Resource derived index 改為 row-position representation,不再在 process 內保存 full records 複本。
|
||||
- Resource / Realtime Equipment 共用 Oracle SQL fragments,避免查詢定義重複漂移。
|
||||
- `resource_cache` / `realtime_equipment_cache` 型別註記風格與高頻常數命名收斂。
|
||||
- `page_registry` 寫檔改為 atomic replace,降低設定檔半寫入風險。
|
||||
- 新增測試覆蓋 shared SQL fragment 與 bool parser 不重複定義治理。
|
||||
|
||||
## 8. 重要環境變數
|
||||
|
||||
```bash
|
||||
FLASK_ENV=production
|
||||
SECRET_KEY=<required-in-production>
|
||||
CSRF_ENABLED=true
|
||||
|
||||
LDAP_API_URL=https://ldap-api.example.com
|
||||
LDAP_ALLOWED_HOSTS=ldap-api.example.com,ldap-api-dr.example.com
|
||||
|
||||
DB_POOL_SIZE=10
|
||||
DB_MAX_OVERFLOW=20
|
||||
DB_POOL_TIMEOUT=30
|
||||
DB_POOL_RECYCLE=1800
|
||||
DB_CALL_TIMEOUT_MS=55000
|
||||
DB_POOL_EXHAUSTED_RETRY_AFTER_SECONDS=5
|
||||
|
||||
DB_HEALTH_POOL_SIZE=1
|
||||
DB_HEALTH_MAX_OVERFLOW=0
|
||||
DB_HEALTH_POOL_TIMEOUT=2
|
||||
|
||||
CONDA_BIN=/opt/miniconda3/bin/conda
|
||||
CONDA_ENV_NAME=mes-dashboard
|
||||
RUNTIME_CONTRACT_VERSION=2026.02-p2
|
||||
RUNTIME_CONTRACT_ENFORCE=true
|
||||
|
||||
WATCHDOG_RUNTIME_DIR=./tmp
|
||||
WATCHDOG_RESTART_FLAG=./tmp/mes_dashboard_restart.flag
|
||||
WATCHDOG_PID_FILE=./tmp/gunicorn.pid
|
||||
WATCHDOG_STATE_FILE=./tmp/mes_dashboard_restart_state.json
|
||||
WATCHDOG_RESTART_HISTORY_MAX=50
|
||||
|
||||
RESILIENCE_DEGRADED_ALERT_SECONDS=300
|
||||
RESILIENCE_POOL_SATURATION_WARNING=0.90
|
||||
RESILIENCE_POOL_SATURATION_CRITICAL=1.0
|
||||
RESILIENCE_RESTART_CHURN_WINDOW_SECONDS=600
|
||||
RESILIENCE_RESTART_CHURN_THRESHOLD=3
|
||||
WORKER_RESTART_COOLDOWN=60
|
||||
WORKER_RESTART_RETRY_BUDGET=3
|
||||
WORKER_RESTART_WINDOW_SECONDS=600
|
||||
WORKER_RESTART_CHURN_THRESHOLD=3
|
||||
WORKER_GUARDED_MODE_ENABLED=true
|
||||
|
||||
PROCESS_CACHE_MAX_SIZE=32
|
||||
WIP_PROCESS_CACHE_MAX_SIZE=32
|
||||
RESOURCE_PROCESS_CACHE_MAX_SIZE=32
|
||||
EQUIPMENT_PROCESS_CACHE_MAX_SIZE=32
|
||||
|
||||
FILTER_CACHE_WIP_VIEW=DWH.DW_MES_LOT_V
|
||||
FILTER_CACHE_SPEC_WORKCENTER_VIEW=DWH.DW_MES_SPEC_WORKCENTER_V
|
||||
|
||||
HEALTH_MEMO_TTL_SECONDS=5
|
||||
|
||||
WIP_MATRIX_RATE_LIMIT_MAX_REQUESTS=120
|
||||
WIP_MATRIX_RATE_LIMIT_WINDOW_SECONDS=60
|
||||
WIP_DETAIL_RATE_LIMIT_MAX_REQUESTS=90
|
||||
WIP_DETAIL_RATE_LIMIT_WINDOW_SECONDS=60
|
||||
HOLD_LOTS_RATE_LIMIT_MAX_REQUESTS=90
|
||||
HOLD_LOTS_RATE_LIMIT_WINDOW_SECONDS=60
|
||||
RESOURCE_DETAIL_RATE_LIMIT_MAX_REQUESTS=60
|
||||
RESOURCE_DETAIL_RATE_LIMIT_WINDOW_SECONDS=60
|
||||
RESOURCE_STATUS_RATE_LIMIT_MAX_REQUESTS=90
|
||||
RESOURCE_STATUS_RATE_LIMIT_WINDOW_SECONDS=60
|
||||
```
|
||||
|
||||
## Validation Quick Commands
|
||||
## 9. 驗證命令(建議)
|
||||
|
||||
```bash
|
||||
# 後端(conda)
|
||||
conda run -n mes-dashboard python -m pytest -q tests/test_runtime_hardening.py
|
||||
|
||||
# 前端
|
||||
npm --prefix frontend test
|
||||
npm --prefix frontend run build
|
||||
python -m pytest -q tests/test_resilience.py tests/test_health_routes.py tests/test_performance_integration.py
|
||||
|
||||
# P1 benchmark gate
|
||||
conda run -n mes-dashboard python scripts/run_cache_benchmarks.py --enforce
|
||||
|
||||
# P2 runtime contract check
|
||||
RUNTIME_CONTRACT_ENFORCE=true ./scripts/start_server.sh check
|
||||
```
|
||||
|
||||
> 詳細部署、使用說明與完整環境配置請參考 `README.md`。
|
||||
## 10. 開發歷史(Vite 專案)
|
||||
|
||||
- 2026-02-07:完成 Vite 根目錄重構與舊版切除。
|
||||
- 2026-02-08:完成 resilience 診斷治理與前端共用模組化。
|
||||
- 2026-02-08:完成 P0 安全/穩定性硬化(本次更新)。
|
||||
- 2026-02-08:完成 P1 快取查詢效率重構(index + benchmark gate)。
|
||||
- 2026-02-08:完成 P2 運維自癒治理(guarded mode + manual override + runtime contract)。
|
||||
- 2026-02-08:完成 round-2 hardening(LDAP URL 驗證、bounded LRU cache、circuit breaker 鎖外日誌、安全標頭、分頁邊界)。
|
||||
- 2026-02-08:完成 round-3 residual hardening(staged publish、health memo、API rate limit、DB redaction、filter view env 化)。
|
||||
- 2026-02-08:完成 round-4 residual consolidation(resource index 表示正規化、shared SQL fragments、型別與常數治理、atomic page status 寫入)。
|
||||
|
||||
@@ -18,6 +18,13 @@ Environment="WATCHDOG_RESTART_FLAG=/run/mes-dashboard/mes_dashboard_restart.flag
|
||||
Environment="WATCHDOG_PID_FILE=/run/mes-dashboard/gunicorn.pid"
|
||||
Environment="WATCHDOG_STATE_FILE=/var/lib/mes-dashboard/restart_state.json"
|
||||
Environment="WATCHDOG_CHECK_INTERVAL=5"
|
||||
Environment="RUNTIME_CONTRACT_VERSION=2026.02-p2"
|
||||
Environment="RUNTIME_CONTRACT_ENFORCE=true"
|
||||
Environment="WORKER_RESTART_COOLDOWN=60"
|
||||
Environment="WORKER_RESTART_RETRY_BUDGET=3"
|
||||
Environment="WORKER_RESTART_WINDOW_SECONDS=600"
|
||||
Environment="WORKER_RESTART_CHURN_THRESHOLD=3"
|
||||
Environment="WORKER_GUARDED_MODE_ENABLED=true"
|
||||
|
||||
RuntimeDirectory=mes-dashboard
|
||||
StateDirectory=mes-dashboard
|
||||
|
||||
26
deploy/mes-dashboard.env.example
Normal file
26
deploy/mes-dashboard.env.example
Normal file
@@ -0,0 +1,26 @@
|
||||
# MES Dashboard runtime contract (version 2026.02-p2)
|
||||
|
||||
# Conda runtime
|
||||
CONDA_BIN=/opt/miniconda3/bin/conda
|
||||
CONDA_ENV_NAME=mes-dashboard
|
||||
|
||||
# Single-port serving contract
|
||||
GUNICORN_BIND=0.0.0.0:8080
|
||||
|
||||
# Watchdog/runtime paths
|
||||
WATCHDOG_RUNTIME_DIR=/run/mes-dashboard
|
||||
WATCHDOG_RESTART_FLAG=/run/mes-dashboard/mes_dashboard_restart.flag
|
||||
WATCHDOG_PID_FILE=/run/mes-dashboard/gunicorn.pid
|
||||
WATCHDOG_STATE_FILE=/var/lib/mes-dashboard/restart_state.json
|
||||
WATCHDOG_CHECK_INTERVAL=5
|
||||
|
||||
# Runtime contract enforcement
|
||||
RUNTIME_CONTRACT_VERSION=2026.02-p2
|
||||
RUNTIME_CONTRACT_ENFORCE=true
|
||||
|
||||
# Worker recovery policy
|
||||
WORKER_RESTART_COOLDOWN=60
|
||||
WORKER_RESTART_RETRY_BUDGET=3
|
||||
WORKER_RESTART_WINDOW_SECONDS=600
|
||||
WORKER_RESTART_CHURN_THRESHOLD=3
|
||||
WORKER_GUARDED_MODE_ENABLED=true
|
||||
@@ -18,6 +18,13 @@ Environment="WATCHDOG_RUNTIME_DIR=/run/mes-dashboard"
|
||||
Environment="WATCHDOG_RESTART_FLAG=/run/mes-dashboard/mes_dashboard_restart.flag"
|
||||
Environment="WATCHDOG_PID_FILE=/run/mes-dashboard/gunicorn.pid"
|
||||
Environment="WATCHDOG_STATE_FILE=/var/lib/mes-dashboard/restart_state.json"
|
||||
Environment="RUNTIME_CONTRACT_VERSION=2026.02-p2"
|
||||
Environment="RUNTIME_CONTRACT_ENFORCE=true"
|
||||
Environment="WORKER_RESTART_COOLDOWN=60"
|
||||
Environment="WORKER_RESTART_RETRY_BUDGET=3"
|
||||
Environment="WORKER_RESTART_WINDOW_SECONDS=600"
|
||||
Environment="WORKER_RESTART_CHURN_THRESHOLD=3"
|
||||
Environment="WORKER_GUARDED_MODE_ENABLED=true"
|
||||
|
||||
RuntimeDirectory=mes-dashboard
|
||||
StateDirectory=mes-dashboard
|
||||
|
||||
@@ -26,10 +26,12 @@ A release is cutover-ready only when all gates pass:
|
||||
- pool exhaustion path returns `503` + `DB_POOL_EXHAUSTED` and `Retry-After`
|
||||
- circuit-open path returns `503` + `CIRCUIT_BREAKER_OPEN` and fail-fast semantics
|
||||
- frontend client does not aggressively retry on degraded pool exhaustion responses
|
||||
- health/admin payloads expose worker policy state (`allowed`/`cooldown`/`blocked`) and alert booleans
|
||||
|
||||
6. Conda-systemd contract gate
|
||||
- `deploy/mes-dashboard.service` and `deploy/mes-dashboard-watchdog.service` both run in the same conda runtime contract
|
||||
- `WATCHDOG_RESTART_FLAG`, `WATCHDOG_PID_FILE`, `WATCHDOG_STATE_FILE` paths are consistent across app/admin/watchdog
|
||||
- startup contract validation passes: `RUNTIME_CONTRACT_ENFORCE=true ./scripts/start_server.sh check`
|
||||
- single-port bind (`GUNICORN_BIND`) remains stable during restart workflow
|
||||
|
||||
7. Regression gate
|
||||
@@ -60,7 +62,8 @@ A release is cutover-ready only when all gates pass:
|
||||
5. Conda + systemd rehearsal (recommended before production cutover)
|
||||
- `sudo cp deploy/mes-dashboard.service /etc/systemd/system/`
|
||||
- `sudo cp deploy/mes-dashboard-watchdog.service /etc/systemd/system/`
|
||||
- `sudo mkdir -p /etc/mes-dashboard && sudo cp .env /etc/mes-dashboard/mes-dashboard.env`
|
||||
- `sudo mkdir -p /etc/mes-dashboard && sudo cp deploy/mes-dashboard.env.example /etc/mes-dashboard/mes-dashboard.env`
|
||||
- merge deployment secrets from `.env` into `/etc/mes-dashboard/mes-dashboard.env`
|
||||
- `sudo systemctl daemon-reload`
|
||||
- `sudo systemctl enable --now mes-dashboard mes-dashboard-watchdog`
|
||||
- call `/admin/api/worker/status` and verify runtime contract paths exist
|
||||
@@ -69,6 +72,7 @@ A release is cutover-ready only when all gates pass:
|
||||
- call `/health` and `/health/deep`
|
||||
- confirm route cache mode, degraded flags, and pool/runtime diagnostics align with environment (Redis on/off)
|
||||
- trigger one controlled worker restart from admin API and verify single-port continuity
|
||||
- verify guarded mode flow: blocked restart requires manual override payload (`manual_override`, `override_acknowledged`, `override_reason`)
|
||||
- verify README architecture section matches deployed runtime contract
|
||||
|
||||
## Rollback Procedure
|
||||
@@ -111,3 +115,6 @@ Use these initial thresholds for alerting/escalation:
|
||||
|
||||
4. Frontend/API retry pressure
|
||||
- significant increase of client retries for `DB_POOL_EXHAUSTED` or `CIRCUIT_BREAKER_OPEN` responses over baseline
|
||||
|
||||
5. Recovery policy blocked
|
||||
- `resilience.policy_state.blocked == true` or `resilience.alerts.restart_blocked == true`
|
||||
|
||||
@@ -1,5 +1,21 @@
|
||||
const DEFAULT_TIMEOUT = 30000;
|
||||
|
||||
function getCsrfToken() {
|
||||
return document.querySelector('meta[name="csrf-token"]')?.content || '';
|
||||
}
|
||||
|
||||
function withCsrfHeaders(headers = {}, method = 'GET') {
|
||||
const normalized = String(method).toUpperCase();
|
||||
const merged = { ...headers };
|
||||
if (['POST', 'PUT', 'PATCH', 'DELETE'].includes(normalized)) {
|
||||
const csrf = getCsrfToken();
|
||||
if (csrf && !merged['X-CSRF-Token']) {
|
||||
merged['X-CSRF-Token'] = csrf;
|
||||
}
|
||||
}
|
||||
return merged;
|
||||
}
|
||||
|
||||
function buildApiError(response, payload) {
|
||||
const message =
|
||||
payload?.error?.message ||
|
||||
@@ -47,15 +63,19 @@ export async function apiGet(url, options = {}) {
|
||||
|
||||
export async function apiPost(url, payload, options = {}) {
|
||||
if (window.MesApi?.post) {
|
||||
return window.MesApi.post(url, payload, options);
|
||||
const enrichedOptions = {
|
||||
...options,
|
||||
headers: withCsrfHeaders(options.headers || {}, 'POST')
|
||||
};
|
||||
return window.MesApi.post(url, payload, enrichedOptions);
|
||||
}
|
||||
return fetchJson(url, {
|
||||
...options,
|
||||
method: 'POST',
|
||||
headers: {
|
||||
headers: withCsrfHeaders({
|
||||
'Content-Type': 'application/json',
|
||||
...(options.headers || {})
|
||||
},
|
||||
}, 'POST'),
|
||||
body: JSON.stringify(payload)
|
||||
});
|
||||
}
|
||||
@@ -64,6 +84,7 @@ export async function apiUpload(url, formData, options = {}) {
|
||||
return fetchJson(url, {
|
||||
...options,
|
||||
method: 'POST',
|
||||
headers: withCsrfHeaders(options.headers || {}, 'POST'),
|
||||
body: formData
|
||||
});
|
||||
}
|
||||
|
||||
75
frontend/src/core/wip-derive.js
Normal file
75
frontend/src/core/wip-derive.js
Normal file
@@ -0,0 +1,75 @@
|
||||
function toTrimmedString(value) {
|
||||
if (value === null || value === undefined) {
|
||||
return '';
|
||||
}
|
||||
return String(value).trim();
|
||||
}
|
||||
|
||||
export function normalizeStatusFilter(statusFilter) {
|
||||
if (!statusFilter) {
|
||||
return {};
|
||||
}
|
||||
if (statusFilter === 'quality-hold') {
|
||||
return { status: 'HOLD', hold_type: 'quality' };
|
||||
}
|
||||
if (statusFilter === 'non-quality-hold') {
|
||||
return { status: 'HOLD', hold_type: 'non-quality' };
|
||||
}
|
||||
return { status: String(statusFilter).toUpperCase() };
|
||||
}
|
||||
|
||||
export function buildWipOverviewQueryParams(filters = {}, statusFilter = null) {
|
||||
const params = {};
|
||||
const workorder = toTrimmedString(filters.workorder);
|
||||
const lotid = toTrimmedString(filters.lotid);
|
||||
const pkg = toTrimmedString(filters.package);
|
||||
const type = toTrimmedString(filters.type);
|
||||
|
||||
if (workorder) params.workorder = workorder;
|
||||
if (lotid) params.lotid = lotid;
|
||||
if (pkg) params.package = pkg;
|
||||
if (type) params.type = type;
|
||||
|
||||
return { ...params, ...normalizeStatusFilter(statusFilter) };
|
||||
}
|
||||
|
||||
export function buildWipDetailQueryParams({
|
||||
page,
|
||||
pageSize,
|
||||
filters = {},
|
||||
statusFilter = null,
|
||||
}) {
|
||||
return {
|
||||
page,
|
||||
page_size: pageSize,
|
||||
...buildWipOverviewQueryParams(filters, statusFilter),
|
||||
};
|
||||
}
|
||||
|
||||
export function splitHoldByType(data) {
|
||||
const items = Array.isArray(data?.items) ? data.items : [];
|
||||
const quality = items.filter((item) => item?.holdType === 'quality');
|
||||
const nonQuality = items.filter((item) => item?.holdType !== 'quality');
|
||||
return { quality, nonQuality };
|
||||
}
|
||||
|
||||
export function prepareParetoData(items) {
|
||||
if (!Array.isArray(items) || items.length === 0) {
|
||||
return { reasons: [], qtys: [], lots: [], cumulative: [], totalQty: 0, items: [] };
|
||||
}
|
||||
|
||||
const sorted = [...items].sort((a, b) => (Number(b?.qty) || 0) - (Number(a?.qty) || 0));
|
||||
const reasons = sorted.map((item) => toTrimmedString(item?.reason) || '未知');
|
||||
const qtys = sorted.map((item) => Number(item?.qty) || 0);
|
||||
const lots = sorted.map((item) => Number(item?.lots) || 0);
|
||||
const totalQty = qtys.reduce((sum, value) => sum + value, 0);
|
||||
|
||||
let running = 0;
|
||||
const cumulative = qtys.map((qty) => {
|
||||
running += qty;
|
||||
if (totalQty <= 0) return 0;
|
||||
return Math.round((running / totalQty) * 100);
|
||||
});
|
||||
|
||||
return { reasons, qtys, lots, cumulative, totalQty, items: sorted };
|
||||
}
|
||||
@@ -3,6 +3,7 @@ import {
|
||||
debounce,
|
||||
fetchWipAutocompleteItems,
|
||||
} from '../core/autocomplete.js';
|
||||
import { buildWipDetailQueryParams } from '../core/wip-derive.js';
|
||||
|
||||
ensureMesApiAvailable();
|
||||
|
||||
@@ -72,37 +73,13 @@ ensureMesApiAvailable();
|
||||
throw new Error(result.error || 'Failed to fetch packages');
|
||||
}
|
||||
|
||||
async function fetchDetail(signal = null) {
|
||||
const params = {
|
||||
page: state.page,
|
||||
page_size: state.pageSize
|
||||
};
|
||||
|
||||
if (state.filters.package) {
|
||||
params.package = state.filters.package;
|
||||
}
|
||||
if (state.filters.type) {
|
||||
params.type = state.filters.type;
|
||||
}
|
||||
if (activeStatusFilter) {
|
||||
// Handle hold type filters
|
||||
if (activeStatusFilter === 'quality-hold') {
|
||||
params.status = 'HOLD';
|
||||
params.hold_type = 'quality';
|
||||
} else if (activeStatusFilter === 'non-quality-hold') {
|
||||
params.status = 'HOLD';
|
||||
params.hold_type = 'non-quality';
|
||||
} else {
|
||||
// Convert to API status format (RUN/QUEUE)
|
||||
params.status = activeStatusFilter.toUpperCase();
|
||||
}
|
||||
}
|
||||
if (state.filters.workorder) {
|
||||
params.workorder = state.filters.workorder;
|
||||
}
|
||||
if (state.filters.lotid) {
|
||||
params.lotid = state.filters.lotid;
|
||||
}
|
||||
async function fetchDetail(signal = null) {
|
||||
const params = buildWipDetailQueryParams({
|
||||
page: state.page,
|
||||
pageSize: state.pageSize,
|
||||
filters: state.filters,
|
||||
statusFilter: activeStatusFilter,
|
||||
});
|
||||
|
||||
const result = await MesApi.get(`/api/wip/detail/${encodeURIComponent(state.workcenter)}`, {
|
||||
params,
|
||||
|
||||
@@ -3,6 +3,11 @@ import {
|
||||
debounce,
|
||||
fetchWipAutocompleteItems,
|
||||
} from '../core/autocomplete.js';
|
||||
import {
|
||||
buildWipOverviewQueryParams,
|
||||
splitHoldByType as splitHoldByTypeShared,
|
||||
prepareParetoData as prepareParetoDataShared,
|
||||
} from '../core/wip-derive.js';
|
||||
|
||||
ensureMesApiAvailable();
|
||||
|
||||
@@ -61,21 +66,8 @@ ensureMesApiAvailable();
|
||||
}
|
||||
|
||||
function buildQueryParams() {
|
||||
const params = {};
|
||||
if (state.filters.workorder) {
|
||||
params.workorder = state.filters.workorder;
|
||||
}
|
||||
if (state.filters.lotid) {
|
||||
params.lotid = state.filters.lotid;
|
||||
}
|
||||
if (state.filters.package) {
|
||||
params.package = state.filters.package;
|
||||
}
|
||||
if (state.filters.type) {
|
||||
params.type = state.filters.type;
|
||||
}
|
||||
return params;
|
||||
}
|
||||
return buildWipOverviewQueryParams(state.filters);
|
||||
}
|
||||
|
||||
// ============================================================
|
||||
// API Functions (using MesApi)
|
||||
@@ -95,23 +87,11 @@ ensureMesApiAvailable();
|
||||
throw new Error(result.error || 'Failed to fetch summary');
|
||||
}
|
||||
|
||||
async function fetchMatrix(signal = null) {
|
||||
const params = buildQueryParams();
|
||||
// Add status filter if active
|
||||
if (activeStatusFilter) {
|
||||
if (activeStatusFilter === 'quality-hold') {
|
||||
params.status = 'HOLD';
|
||||
params.hold_type = 'quality';
|
||||
} else if (activeStatusFilter === 'non-quality-hold') {
|
||||
params.status = 'HOLD';
|
||||
params.hold_type = 'non-quality';
|
||||
} else {
|
||||
params.status = activeStatusFilter.toUpperCase();
|
||||
}
|
||||
}
|
||||
const result = await MesApi.get('/api/wip/overview/matrix', {
|
||||
params,
|
||||
timeout: API_TIMEOUT,
|
||||
async function fetchMatrix(signal = null) {
|
||||
const params = buildWipOverviewQueryParams(state.filters, activeStatusFilter);
|
||||
const result = await MesApi.get('/api/wip/overview/matrix', {
|
||||
params,
|
||||
timeout: API_TIMEOUT,
|
||||
signal
|
||||
});
|
||||
if (result.success) {
|
||||
@@ -465,40 +445,15 @@ ensureMesApiAvailable();
|
||||
nonQuality: null
|
||||
};
|
||||
|
||||
// Task 2.1: Split hold data by type
|
||||
function splitHoldByType(data) {
|
||||
if (!data || !data.items) {
|
||||
return { quality: [], nonQuality: [] };
|
||||
}
|
||||
const quality = data.items.filter(item => item.holdType === 'quality');
|
||||
const nonQuality = data.items.filter(item => item.holdType !== 'quality');
|
||||
return { quality, nonQuality };
|
||||
}
|
||||
// Task 2.1: Split hold data by type
|
||||
function splitHoldByType(data) {
|
||||
return splitHoldByTypeShared(data);
|
||||
}
|
||||
|
||||
// Task 2.2: Prepare Pareto data (sort by QTY desc, calculate cumulative %)
|
||||
function prepareParetoData(items) {
|
||||
if (!items || items.length === 0) {
|
||||
return { reasons: [], qtys: [], lots: [], cumulative: [], totalQty: 0 };
|
||||
}
|
||||
|
||||
// Sort by QTY descending
|
||||
const sorted = [...items].sort((a, b) => (b.qty || 0) - (a.qty || 0));
|
||||
|
||||
const reasons = sorted.map(item => item.reason || '未知');
|
||||
const qtys = sorted.map(item => item.qty || 0);
|
||||
const lots = sorted.map(item => item.lots || 0);
|
||||
const totalQty = qtys.reduce((sum, q) => sum + q, 0);
|
||||
|
||||
// Calculate cumulative percentage
|
||||
const cumulative = [];
|
||||
let runningSum = 0;
|
||||
qtys.forEach(qty => {
|
||||
runningSum += qty;
|
||||
cumulative.push(totalQty > 0 ? Math.round((runningSum / totalQty) * 100) : 0);
|
||||
});
|
||||
|
||||
return { reasons, qtys, lots, cumulative, totalQty, items: sorted };
|
||||
}
|
||||
// Task 2.2: Prepare Pareto data (sort by QTY desc, calculate cumulative %)
|
||||
function prepareParetoData(items) {
|
||||
return prepareParetoDataShared(items);
|
||||
}
|
||||
|
||||
// Task 3.1: Initialize Pareto charts
|
||||
function initParetoCharts() {
|
||||
|
||||
80
frontend/tests/wip-derive.test.js
Normal file
80
frontend/tests/wip-derive.test.js
Normal file
@@ -0,0 +1,80 @@
|
||||
import test from 'node:test';
|
||||
import assert from 'node:assert/strict';
|
||||
|
||||
import {
|
||||
buildWipOverviewQueryParams,
|
||||
buildWipDetailQueryParams,
|
||||
splitHoldByType,
|
||||
prepareParetoData,
|
||||
} from '../src/core/wip-derive.js';
|
||||
|
||||
test('buildWipOverviewQueryParams keeps only non-empty filters', () => {
|
||||
const params = buildWipOverviewQueryParams({
|
||||
workorder: ' WO-1 ',
|
||||
lotid: '',
|
||||
package: 'PKG-A',
|
||||
type: 'QFN',
|
||||
});
|
||||
|
||||
assert.deepEqual(params, {
|
||||
workorder: 'WO-1',
|
||||
package: 'PKG-A',
|
||||
type: 'QFN',
|
||||
});
|
||||
});
|
||||
|
||||
test('buildWipOverviewQueryParams maps quality hold status filter', () => {
|
||||
const params = buildWipOverviewQueryParams({}, 'quality-hold');
|
||||
assert.deepEqual(params, {
|
||||
status: 'HOLD',
|
||||
hold_type: 'quality',
|
||||
});
|
||||
});
|
||||
|
||||
test('buildWipDetailQueryParams uses page/page_size and shared filter mapper', () => {
|
||||
const params = buildWipDetailQueryParams({
|
||||
page: 2,
|
||||
pageSize: 100,
|
||||
filters: {
|
||||
workorder: 'WO',
|
||||
lotid: 'LOT',
|
||||
package: '',
|
||||
type: 'TSOP',
|
||||
},
|
||||
statusFilter: 'run',
|
||||
});
|
||||
|
||||
assert.deepEqual(params, {
|
||||
page: 2,
|
||||
page_size: 100,
|
||||
workorder: 'WO',
|
||||
lotid: 'LOT',
|
||||
type: 'TSOP',
|
||||
status: 'RUN',
|
||||
});
|
||||
});
|
||||
|
||||
test('splitHoldByType partitions quality/non-quality correctly', () => {
|
||||
const grouped = splitHoldByType({
|
||||
items: [
|
||||
{ reason: 'Q1', holdType: 'quality' },
|
||||
{ reason: 'NQ1', holdType: 'non-quality' },
|
||||
{ reason: 'NQ2' },
|
||||
],
|
||||
});
|
||||
|
||||
assert.equal(grouped.quality.length, 1);
|
||||
assert.equal(grouped.nonQuality.length, 2);
|
||||
});
|
||||
|
||||
test('prepareParetoData sorts by qty and builds cumulative percentages', () => {
|
||||
const data = prepareParetoData([
|
||||
{ reason: 'B', qty: 20, lots: 1 },
|
||||
{ reason: 'A', qty: 80, lots: 2 },
|
||||
]);
|
||||
|
||||
assert.deepEqual(data.reasons, ['A', 'B']);
|
||||
assert.deepEqual(data.qtys, [80, 20]);
|
||||
assert.deepEqual(data.cumulative, [80, 100]);
|
||||
assert.equal(data.totalQty, 100);
|
||||
});
|
||||
@@ -1,12 +1,12 @@
|
||||
import { defineConfig } from 'vite';
|
||||
import { resolve } from 'node:path';
|
||||
|
||||
export default defineConfig({
|
||||
export default defineConfig(({ mode }) => ({
|
||||
publicDir: false,
|
||||
build: {
|
||||
outDir: '../src/mes_dashboard/static/dist',
|
||||
emptyOutDir: false,
|
||||
sourcemap: false,
|
||||
sourcemap: mode !== 'production',
|
||||
rollupOptions: {
|
||||
input: {
|
||||
portal: resolve(__dirname, 'src/portal/main.js'),
|
||||
@@ -22,8 +22,17 @@ export default defineConfig({
|
||||
output: {
|
||||
entryFileNames: '[name].js',
|
||||
chunkFileNames: 'chunks/[name]-[hash].js',
|
||||
assetFileNames: '[name][extname]'
|
||||
assetFileNames: '[name][extname]',
|
||||
manualChunks(id) {
|
||||
if (!id.includes('node_modules')) {
|
||||
return;
|
||||
}
|
||||
if (id.includes('echarts')) {
|
||||
return 'vendor-echarts';
|
||||
}
|
||||
return 'vendor';
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
});
|
||||
}));
|
||||
|
||||
@@ -30,6 +30,18 @@ def worker_exit(server, worker):
|
||||
except Exception as e:
|
||||
server.log.warning(f"Error stopping equipment sync worker: {e}")
|
||||
|
||||
try:
|
||||
from mes_dashboard.core.cache_updater import stop_cache_updater
|
||||
stop_cache_updater()
|
||||
except Exception as e:
|
||||
server.log.warning(f"Error stopping cache updater: {e}")
|
||||
|
||||
try:
|
||||
from mes_dashboard.core.redis_client import close_redis
|
||||
close_redis()
|
||||
except Exception as e:
|
||||
server.log.warning(f"Error closing redis client: {e}")
|
||||
|
||||
# Then dispose database connections
|
||||
try:
|
||||
from mes_dashboard.core.database import dispose_engine
|
||||
|
||||
@@ -0,0 +1,2 @@
|
||||
schema: spec-driven
|
||||
created: 2026-02-08
|
||||
@@ -0,0 +1,46 @@
|
||||
## Context
|
||||
|
||||
The current architecture already supports single-port Gunicorn runtime, circuit-breaker-aware degraded responses, and watchdog-assisted recovery. However, critical security and lifecycle controls are uneven: production startup can still fallback to a weak secret key, CSRF is not enforced globally, and background resources are not fully registered in a single shutdown lifecycle. These gaps are operationally risky when pool pressure or restart churn occurs.
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- Make production startup fail fast when required security secrets are missing.
|
||||
- Enforce CSRF validation for all state-changing endpoints without breaking existing frontend flow.
|
||||
- Make worker/app shutdown deterministic by stopping all background workers and shared clients.
|
||||
- Keep degraded responses for pool exhaustion and circuit-open states stable and retry-aware.
|
||||
- Isolate health probe connectivity from main request pool contention.
|
||||
|
||||
**Non-Goals:**
|
||||
- Replacing LDAP provider or redesigning the full authentication architecture.
|
||||
- Full CSP rollout across all templates in this change.
|
||||
- Changing URL structure, page IA, or single-port deployment topology.
|
||||
|
||||
## Decisions
|
||||
|
||||
1. **Production secret-key guard at startup**
|
||||
- Decision: enforce `SECRET_KEY` presence/strength in non-development modes and abort startup when invalid.
|
||||
- Rationale: prevents silent insecure deployment.
|
||||
|
||||
2. **Unified CSRF contract across form + JSON flows**
|
||||
- Decision: issue CSRF token from server session, validate hidden form field for HTML forms and `X-CSRF-Token` for JSON POST/PUT/PATCH/DELETE.
|
||||
- Rationale: maintains current frontend behavior while covering non-form APIs.
|
||||
|
||||
3. **Centralized shutdown registry**
|
||||
- Decision: register explicit shutdown hooks that call cache updater stop, realtime sync stop, Redis close, and DB dispose in bounded order.
|
||||
- Rationale: avoids thread/client leaks during worker recycle and controlled reload.
|
||||
|
||||
4. **Health probe pool isolation**
|
||||
- Decision: use a dedicated lightweight DB health engine/pool for `/health` checks.
|
||||
- Rationale: prevents health endpoint from being blocked by request-pool exhaustion, improving observability fidelity.
|
||||
|
||||
5. **Template-safe JS serialization**
|
||||
- Decision: replace HTML-escaped interpolation in JS string contexts with `tojson` serialization.
|
||||
- Rationale: avoids context-mismatch injection edge cases.
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
- **[Risk] CSRF rollout may break undocumented clients** → **Mitigation:** provide opt-in transition flag and explicit error messaging during rollout.
|
||||
- **[Risk] Strict startup secret validation can block misconfigured environments** → **Mitigation:** provide clear startup diagnostics and `.env.example` updates.
|
||||
- **[Risk] Additional shutdown hooks can prolong worker exit** → **Mitigation:** bounded timeouts and idempotent stop handlers.
|
||||
- **[Risk] Dedicated health pool introduces extra DB connections** → **Mitigation:** fixed minimal size and short timeout.
|
||||
@@ -0,0 +1,40 @@
|
||||
## Why
|
||||
|
||||
The Vite migration is functionally complete, but production runtime still has high-risk gaps in security baseline and worker lifecycle cleanup. Addressing these now prevents avoidable outages, authentication bypass risk, and unstable degradation behavior under pool pressure.
|
||||
|
||||
## What Changes
|
||||
|
||||
- Enforce production-safe startup security defaults (no weak SECRET_KEY fallback in non-development environments).
|
||||
- Add first-class CSRF protection for admin forms and state-changing JSON APIs.
|
||||
- Harden degradation behavior for pool exhaustion with consistent retry/backoff contract and isolated health probing.
|
||||
- Ensure background workers and shared clients (cache updater, realtime sync, Redis) are explicitly stopped on worker/app shutdown.
|
||||
- Fix template-to-JavaScript variable serialization in hold-detail fallback script.
|
||||
|
||||
## Capabilities
|
||||
|
||||
### New Capabilities
|
||||
- `security-baseline-hardening`: Define mandatory secret/session/CSRF/XSS-safe baseline for production runtime.
|
||||
|
||||
### Modified Capabilities
|
||||
- `runtime-resilience-recovery`: Strengthen shutdown lifecycle and degraded-response behavior for pool pressure scenarios.
|
||||
|
||||
## Impact
|
||||
|
||||
- Affected code:
|
||||
- `src/mes_dashboard/app.py`
|
||||
- `src/mes_dashboard/core/database.py`
|
||||
- `src/mes_dashboard/core/cache_updater.py`
|
||||
- `src/mes_dashboard/core/redis_client.py`
|
||||
- `src/mes_dashboard/routes/health_routes.py`
|
||||
- `src/mes_dashboard/routes/auth_routes.py`
|
||||
- `src/mes_dashboard/templates/hold_detail.html`
|
||||
- `gunicorn.conf.py`
|
||||
- `tests/`
|
||||
- APIs:
|
||||
- `/health`
|
||||
- `/health/deep`
|
||||
- `/admin/login`
|
||||
- state-changing `/api/*` endpoints
|
||||
- Operational behavior:
|
||||
- Keep single-port deployment model unchanged.
|
||||
- Improve degraded-state stability and startup safety gates.
|
||||
@@ -0,0 +1,24 @@
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Pool Exhaustion MUST Return Retry-Aware Degraded Responses
|
||||
The system MUST return explicit degraded responses for connection pool exhaustion, including stable machine-readable retry metadata and HTTP retry hints.
|
||||
|
||||
#### Scenario: Pool exhausted under load
|
||||
- **WHEN** concurrent requests exceed available database connections and pool wait timeout is reached
|
||||
- **THEN** the API MUST return `DB_POOL_EXHAUSTED` with `retry_after_seconds` metadata and a `Retry-After` header instead of a generic 500 failure
|
||||
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Runtime Shutdown SHALL Cleanly Stop Background Services
|
||||
Worker/app shutdown MUST stop long-lived background services and shared clients in deterministic order.
|
||||
|
||||
#### Scenario: Worker exits during recycle or graceful reload
|
||||
- **WHEN** Gunicorn worker shutdown hooks are triggered
|
||||
- **THEN** cache updater, realtime equipment sync worker, Redis client, and DB engine resources MUST be stopped/disposed without orphan threads
|
||||
|
||||
### Requirement: Health Probing SHALL Remain Available During Request-Pool Saturation
|
||||
Health checks MUST avoid depending solely on the same request pool used by business APIs.
|
||||
|
||||
#### Scenario: Request pool saturation
|
||||
- **WHEN** the main database request pool is exhausted
|
||||
- **THEN** `/health` and `/health/deep` MUST still provide timely degraded status using isolated probe connectivity
|
||||
@@ -0,0 +1,29 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Production Startup SHALL Reject Weak Session Secrets
|
||||
The system MUST reject startup in non-development environments when `SECRET_KEY` is missing or configured with known insecure default values.
|
||||
|
||||
#### Scenario: Missing production secret key
|
||||
- **WHEN** runtime starts with `FLASK_ENV` not equal to `development` and no secure secret key is configured
|
||||
- **THEN** application startup MUST fail fast with an explicit configuration error
|
||||
|
||||
### Requirement: State-Changing Endpoints SHALL Enforce CSRF Validation
|
||||
All state-changing endpoints that rely on cookie-based authentication MUST enforce CSRF token validation.
|
||||
|
||||
#### Scenario: Missing or invalid CSRF token
|
||||
- **WHEN** a POST/PUT/PATCH/DELETE request is sent without a valid CSRF token
|
||||
- **THEN** the server MUST reject the request with a client error and MUST NOT execute the mutation
|
||||
|
||||
### Requirement: Server-Rendered Values in JavaScript Context MUST Use Safe Serialization
|
||||
Values inserted into inline JavaScript from templates MUST be serialized for JavaScript context safety.
|
||||
|
||||
#### Scenario: Hold reason rendered in fallback inline script
|
||||
- **WHEN** server-side string values are embedded into script state payloads
|
||||
- **THEN** template rendering MUST use JSON-safe serialization semantics to prevent script-context injection
|
||||
|
||||
### Requirement: Session Establishment SHALL Mitigate Fixation Risk
|
||||
Successful admin login MUST rotate session identity material before granting authenticated privileges.
|
||||
|
||||
#### Scenario: Admin login success
|
||||
- **WHEN** credentials are validated and admin session is created
|
||||
- **THEN** session identity MUST be regenerated before storing authenticated user attributes
|
||||
@@ -0,0 +1,18 @@
|
||||
## 1. Runtime Stability Hardening
|
||||
|
||||
- [x] 1.1 Add startup validation for `SECRET_KEY` and environment-aware secure defaults.
|
||||
- [x] 1.2 Register centralized shutdown hooks to stop cache updater, realtime sync worker, Redis client, and DB engine.
|
||||
- [x] 1.3 Isolate database health probing from request pool and keep degraded signal contract stable.
|
||||
- [x] 1.4 Normalize pool-exhausted response metadata and retry headers across API error paths.
|
||||
|
||||
## 2. Security Baseline Enforcement
|
||||
|
||||
- [x] 2.1 Add CSRF token issuance/validation for form posts and JSON mutation endpoints.
|
||||
- [x] 2.2 Update login flow to rotate session identity on successful authentication.
|
||||
- [x] 2.3 Replace JS-context template interpolation in `hold_detail.html` with JSON-safe serialization.
|
||||
|
||||
## 3. Verification and Documentation
|
||||
|
||||
- [x] 3.1 Add tests for startup secret guard, CSRF rejection, and session-rotation behavior.
|
||||
- [x] 3.2 Add lifecycle tests/validation for shutdown cleanup and health endpoint behavior under pool saturation.
|
||||
- [x] 3.3 Update README/README.mdj runtime hardening sections and operator rollout notes.
|
||||
@@ -0,0 +1,2 @@
|
||||
schema: spec-driven
|
||||
created: 2026-02-08
|
||||
@@ -0,0 +1,46 @@
|
||||
## Context
|
||||
|
||||
The migration delivered feature parity, but efficiency work remains: backend query paths still do broad copies and whole-frame recomputation even when only slices are needed. At the same time, business constraints explicitly require full-table caching for `resource` and `wip` because those datasets are intentionally small and frequently reused. This design optimizes around that constraint rather than removing it.
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- Keep `resource` and `wip` full-table caches intact.
|
||||
- Reduce memory amplification from redundant cache representations.
|
||||
- Replace repeated full merge/rebuild paths with incremental/indexed query plans where applicable.
|
||||
- Increase reuse of browser-side compute modules for chart/table/filter/KPI derivations.
|
||||
- Add measurable telemetry to verify latency and memory improvements.
|
||||
|
||||
**Non-Goals:**
|
||||
- Rewriting all reporting endpoints to client-only mode.
|
||||
- Removing Redis or existing layered cache strategy.
|
||||
- Changing user-visible filter semantics or report outputs.
|
||||
|
||||
## Decisions
|
||||
|
||||
1. **Constrained cache strategy**
|
||||
- Decision: retain full-table snapshots for `resource` and `wip`; optimize surrounding representations and derivation paths.
|
||||
- Rationale: business-approved data-size profile and low complexity for frequent lookups.
|
||||
|
||||
2. **Incremental + indexed path for heavy derived datasets**
|
||||
- Decision: add watermark/version-aware incremental refresh and per-column indexes for high-cardinality filters.
|
||||
- Rationale: avoids repeated full recompute and lowers request tail latency.
|
||||
|
||||
3. **Canonical in-process structure**
|
||||
- Decision: keep one canonical structure per cache domain and derive alternate views on demand.
|
||||
- Rationale: reduces 2x/3x memory amplification from parallel representations.
|
||||
|
||||
4. **Frontend compute module expansion**
|
||||
- Decision: extract reusable browser compute helpers for matrix/table/KPI transformations used across report pages.
|
||||
- Rationale: shifts deterministic shaping work off backend and improves component reuse in Vite architecture.
|
||||
|
||||
5. **Benchmark-driven acceptance**
|
||||
- Decision: add repeatable benchmark fixtures and telemetry thresholds as merge gates.
|
||||
- Rationale: prevent subjective "performance improved" claims without measurable proof.
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
- **[Risk] Incremental sync correctness drift** → **Mitigation:** version checksum validation and periodic full reconciliation jobs.
|
||||
- **[Risk] Browser compute can increase client CPU on low-end devices** → **Mitigation:** bounded dataset chunking and fallback server aggregation path.
|
||||
- **[Risk] Refactor introduces subtle field-contract regressions** → **Mitigation:** keep export/header contract tests and fixture comparisons.
|
||||
- **[Risk] Telemetry overhead** → **Mitigation:** low-cost counters/histograms with sampling where needed.
|
||||
@@ -0,0 +1,36 @@
|
||||
## Why
|
||||
|
||||
Current reporting workloads still spend unnecessary CPU and memory on repeated full-data merges, broad DataFrame copies, and duplicated cache representations. We need a focused efficiency phase that preserves the intentional full-table cache strategy for `resource` and `wip`, while reducing cost for other query paths and increasing frontend compute reuse.
|
||||
|
||||
## What Changes
|
||||
|
||||
- Introduce indexed/incremental cache synchronization for heavy report datasets that do not require full-table snapshots.
|
||||
- Keep `resource` and `wip` as full-table cache by design, but reduce redundant in-process representations and copy overhead.
|
||||
- Move additional derived calculations (chart/table/KPI/filter shaping) to reusable browser modules in Vite frontend.
|
||||
- Add cache/query efficiency telemetry and repeatable benchmark gates to validate gains.
|
||||
|
||||
## Capabilities
|
||||
|
||||
### New Capabilities
|
||||
- `cache-indexed-query-acceleration`: Define incremental refresh and indexed query contracts for non-full-snapshot datasets.
|
||||
|
||||
### Modified Capabilities
|
||||
- `cache-observability-hardening`: Add memory-efficiency and cache-structure telemetry expectations.
|
||||
- `frontend-compute-shift`: Expand browser-side reusable compute coverage for report interactions.
|
||||
|
||||
## Impact
|
||||
|
||||
- Affected code:
|
||||
- `src/mes_dashboard/core/cache.py`
|
||||
- `src/mes_dashboard/services/resource_cache.py`
|
||||
- `src/mes_dashboard/services/realtime_equipment_cache.py`
|
||||
- `src/mes_dashboard/services/wip_service.py`
|
||||
- `src/mes_dashboard/routes/health_routes.py`
|
||||
- `frontend/src/core/`
|
||||
- `frontend/src/**/main.js`
|
||||
- `tests/`
|
||||
- APIs:
|
||||
- read-heavy `/api/wip/*` and `/api/resource/*` endpoints (response contract unchanged)
|
||||
- Operational behavior:
|
||||
- Preserve current `resource` and `wip` full-table caching strategy.
|
||||
- Reduce server-side compute load through selective frontend compute offload.
|
||||
@@ -0,0 +1,22 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Incremental Synchronization SHALL Use Versioned Watermarks
|
||||
For heavy non-full-snapshot datasets, cache refresh SHALL support incremental synchronization keyed by stable version or watermark boundaries.
|
||||
|
||||
#### Scenario: Incremental refresh cycle
|
||||
- **WHEN** source data version indicates partial changes since last sync
|
||||
- **THEN** cache update logic MUST fetch and merge only changed partitions while preserving correctness guarantees
|
||||
|
||||
### Requirement: Query Paths SHALL Use Indexed Access for High-Frequency Filters
|
||||
Query execution over cached data SHALL use prebuilt indexes for known high-frequency filter columns.
|
||||
|
||||
#### Scenario: Filtered report query
|
||||
- **WHEN** request filters target indexed fields
|
||||
- **THEN** result selection MUST avoid full dataset scans and maintain existing response contract
|
||||
|
||||
### Requirement: Business-Mandated Full-Table Caches SHALL Be Preserved for Resource and WIP
|
||||
The system SHALL continue to maintain full-table cache behavior for `resource` and `wip` domains.
|
||||
|
||||
#### Scenario: Resource or WIP cache refresh
|
||||
- **WHEN** cache update runs for `resource` or `wip`
|
||||
- **THEN** the updater MUST retain full-table snapshot semantics and MUST NOT switch these domains to partial-only cache mode
|
||||
@@ -0,0 +1,15 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Cache Telemetry SHALL Include Memory Amplification Signals
|
||||
Operational telemetry MUST expose cache domain memory usage indicators and representation amplification factors.
|
||||
|
||||
#### Scenario: Deep health telemetry request
|
||||
- **WHEN** operators inspect cache telemetry
|
||||
- **THEN** telemetry MUST include per-domain memory footprint and amplification indicators sufficient to detect redundant structures
|
||||
|
||||
### Requirement: Efficiency Benchmarks SHALL Gate Cache Refactor Rollout
|
||||
Cache/query efficiency changes MUST be validated against baseline latency and memory benchmarks before rollout.
|
||||
|
||||
#### Scenario: Pre-release validation
|
||||
- **WHEN** cache refactor changes are prepared for deployment
|
||||
- **THEN** benchmark results MUST demonstrate no regression beyond configured thresholds for P95 latency and memory usage
|
||||
@@ -0,0 +1,15 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Reusable Browser Compute Modules SHALL Power Report Derivations
|
||||
Derived computations for report filters, KPI cards, chart series, and table projections SHALL be implemented through reusable frontend modules.
|
||||
|
||||
#### Scenario: Shared report derivation logic
|
||||
- **WHEN** multiple report pages require equivalent data-shaping behavior
|
||||
- **THEN** pages MUST consume shared compute modules instead of duplicating transformation logic per page
|
||||
|
||||
### Requirement: Browser Compute Shift SHALL Preserve Export and Field Contracts
|
||||
Moving computations to frontend MUST preserve existing field naming and export column contracts.
|
||||
|
||||
#### Scenario: User exports report after frontend-side derivation
|
||||
- **WHEN** transformed data is rendered and exported
|
||||
- **THEN** exported field names and ordering MUST remain consistent with governed field contract definitions
|
||||
@@ -0,0 +1,23 @@
|
||||
## 1. Cache Structure and Sync Refactor
|
||||
|
||||
- [x] 1.1 Define canonical per-domain cache representation and remove redundant parallel structures.
|
||||
- [x] 1.2 Implement version/watermark-based incremental sync path for eligible non-full-snapshot datasets.
|
||||
- [x] 1.3 Keep `resource` and `wip` full-table cache behavior while optimizing surrounding parse/index pipelines.
|
||||
|
||||
## 2. Indexed Query Acceleration
|
||||
|
||||
- [x] 2.1 Add index builders for high-frequency filter columns used by report endpoints.
|
||||
- [x] 2.2 Refactor read paths to use indexed selection and reduce broad DataFrame copy operations.
|
||||
- [x] 2.3 Add fallback and reconciliation logic to guarantee correctness under incremental/index drift.
|
||||
|
||||
## 3. Frontend Compute Reuse Expansion
|
||||
|
||||
- [x] 3.1 Extract shared Vite compute modules for KPI/filter/chart/table derivations.
|
||||
- [x] 3.2 Refactor report pages to consume shared modules without changing user-visible behavior.
|
||||
- [x] 3.3 Validate export/header field contract consistency after compute shift.
|
||||
|
||||
## 4. Performance Validation and Docs
|
||||
|
||||
- [x] 4.1 Add benchmark fixtures for baseline vs refactor latency/memory comparison.
|
||||
- [x] 4.2 Surface cache memory amplification and index efficiency telemetry in health/admin outputs.
|
||||
- [x] 4.3 Update README/README.mdj with cache strategy constraints and performance governance rules.
|
||||
@@ -0,0 +1,2 @@
|
||||
schema: spec-driven
|
||||
created: 2026-02-08
|
||||
@@ -0,0 +1,45 @@
|
||||
## Context
|
||||
|
||||
The project already provides watchdog-assisted restart and resilience diagnostics, but policy boundaries for automated recovery are not yet formalized end-to-end. In practice, this can lead to either under-recovery (manual delays) or over-recovery (restart churn). We also need stronger conda/systemd path consistency checks to prevent runtime drift across deploy scripts and services.
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- Make conda/systemd runtime path contracts explicit, validated, and drift-detectable.
|
||||
- Implement safe self-healing policy with cooldown and churn limits.
|
||||
- Expose clear alert signals and recommended actions in health/admin payloads.
|
||||
- Keep operator manual override available for incident control.
|
||||
|
||||
**Non-Goals:**
|
||||
- Migrating from systemd to another orchestrator.
|
||||
- Changing database vendor or introducing full autoscaling infrastructure.
|
||||
- Removing existing admin restart endpoints.
|
||||
|
||||
## Decisions
|
||||
|
||||
1. **Single source runtime contract**
|
||||
- Decision: centralize conda runtime path configuration consumed by systemd units, watchdog, and scripts.
|
||||
- Rationale: prevents mismatched interpreter/path drift.
|
||||
|
||||
2. **Guarded self-healing state machine**
|
||||
- Decision: implement bounded restart policy (cooldown + max retries per time window + circuit-open gating).
|
||||
- Rationale: recovers quickly while preventing restart storms.
|
||||
|
||||
3. **Explicit recovery observability contract**
|
||||
- Decision: enrich health/admin payloads with churn counters, cooldown state, and recommended operator action.
|
||||
- Rationale: enables deterministic triage and alert automation.
|
||||
|
||||
4. **Auditability requirement**
|
||||
- Decision: emit structured logs/events for auto-restart decision, manual override, and blocked restart attempts.
|
||||
- Rationale: supports incident retrospectives and policy tuning.
|
||||
|
||||
5. **Runbook-first rollout**
|
||||
- Decision: deploy policy changes behind documentation and validation gates, including rollback steps.
|
||||
- Rationale: operational safety for production adoption.
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
- **[Risk] Overly strict policy delays recovery** → **Mitigation:** configurable thresholds and emergency manual override.
|
||||
- **[Risk] Aggressive policy causes churn loops** → **Mitigation:** hard stop on churn threshold breach and explicit cool-off windows.
|
||||
- **[Risk] Added operational complexity** → **Mitigation:** concise runbook with decision tables and tested scripts.
|
||||
- **[Risk] Drift detection false positives** → **Mitigation:** normalize path resolution and clearly defined comparison sources.
|
||||
@@ -0,0 +1,40 @@
|
||||
## Why
|
||||
|
||||
Operations stability still depends heavily on manual intervention when workers degrade or pools saturate. We need a formal operations phase to align conda/systemd runtime contracts and add controlled self-healing with guardrails, so recovery is faster without triggering restart storms.
|
||||
|
||||
## What Changes
|
||||
|
||||
- Standardize conda-based runtime paths across app service, watchdog, and operational scripts from a single source of truth.
|
||||
- Introduce guarded worker self-healing policy (cooldown, churn windows, bounded retries, manual override).
|
||||
- Add alert thresholds and machine-readable recovery signals for pool pressure, circuit-open persistence, and restart churn.
|
||||
- Harden runbook documentation and scripts for deterministic restart, rollback, and incident triage.
|
||||
|
||||
## Capabilities
|
||||
|
||||
### New Capabilities
|
||||
- `worker-self-healing-governance`: Define safe autonomous recovery behavior with anti-storm guardrails.
|
||||
|
||||
### Modified Capabilities
|
||||
- `conda-systemd-runtime-alignment`: Extend runtime consistency requirements with startup validation and drift detection.
|
||||
- `runtime-resilience-recovery`: Add auditable recovery-action requirements for automated and operator-triggered restart flows.
|
||||
|
||||
## Impact
|
||||
|
||||
- Affected code:
|
||||
- `deploy/systemd/*.service`
|
||||
- `scripts/worker_watchdog.py`
|
||||
- `src/mes_dashboard/routes/admin_routes.py`
|
||||
- `src/mes_dashboard/routes/health_routes.py`
|
||||
- `src/mes_dashboard/core/database.py`
|
||||
- `src/mes_dashboard/core/circuit_breaker.py`
|
||||
- `tests/`
|
||||
- `README.md`, `README.mdj`, runbook docs
|
||||
- APIs:
|
||||
- `/health`
|
||||
- `/health/deep`
|
||||
- `/admin/api/system-status`
|
||||
- `/admin/api/worker/status`
|
||||
- `/admin/api/worker/restart`
|
||||
- Operational behavior:
|
||||
- Preserve single-port bind model.
|
||||
- Add controlled self-healing policy and clearer alert thresholds.
|
||||
@@ -0,0 +1,15 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Runtime Path Drift SHALL Be Detectable Before Service Start
|
||||
Service startup checks MUST validate configured conda runtime paths across app, watchdog, and worker control scripts.
|
||||
|
||||
#### Scenario: Conda path mismatch detected
|
||||
- **WHEN** startup validation finds runtime path inconsistency between configured units and scripts
|
||||
- **THEN** service start MUST fail with actionable diagnostics instead of running with partial mismatch
|
||||
|
||||
### Requirement: Conda/Systemd Contract SHALL Be Versioned in Operations Docs
|
||||
The documented runtime contract MUST include versioned path assumptions and verification commands.
|
||||
|
||||
#### Scenario: Operator verifies deployment contract
|
||||
- **WHEN** operator follows runbook validation steps
|
||||
- **THEN** commands MUST confirm active runtime paths match documented conda/systemd contract
|
||||
@@ -0,0 +1,15 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Recovery Recommendations SHALL Reflect Self-Healing Policy State
|
||||
Health and admin resilience payloads MUST expose whether automated recovery is allowed, cooling down, or blocked by churn policy.
|
||||
|
||||
#### Scenario: Operator inspects degraded state
|
||||
- **WHEN** `/health` or `/admin/api/worker/status` is requested during degradation
|
||||
- **THEN** response MUST include policy state, cooldown remaining time, and next recommended action
|
||||
|
||||
### Requirement: Manual Recovery Override SHALL Be Explicit and Controlled
|
||||
Manual restart actions MUST bypass automatic block only through authenticated operator pathways with explicit acknowledgement.
|
||||
|
||||
#### Scenario: Churn-blocked state with manual override request
|
||||
- **WHEN** authorized admin requests manual restart while auto-recovery is blocked
|
||||
- **THEN** system MUST execute controlled restart path and log the override context for auditability
|
||||
@@ -0,0 +1,22 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Automated Worker Recovery SHALL Use Bounded Policy Guards
|
||||
Automated worker restart behavior MUST enforce cooldown periods and bounded restart attempts within a configurable time window.
|
||||
|
||||
#### Scenario: Repeated worker degradation within short window
|
||||
- **WHEN** degradation events exceed configured restart-attempt budget
|
||||
- **THEN** automated restarts MUST pause and surface a blocked-recovery signal for operator intervention
|
||||
|
||||
### Requirement: Restart-Churn Protection SHALL Prevent Recovery Storms
|
||||
The runtime MUST classify restart churn and prevent uncontrolled restart loops.
|
||||
|
||||
#### Scenario: Churn threshold exceeded
|
||||
- **WHEN** restart count crosses churn threshold in active window
|
||||
- **THEN** watchdog MUST enter guarded mode and require explicit manual override before further restart attempts
|
||||
|
||||
### Requirement: Recovery Decisions SHALL Be Audit-Ready
|
||||
Every auto-recovery decision and manual override action MUST be recorded with structured metadata.
|
||||
|
||||
#### Scenario: Worker restart decision emitted
|
||||
- **WHEN** system executes or denies a restart action
|
||||
- **THEN** structured logs/events MUST include reason, thresholds, actor/source, and resulting state
|
||||
@@ -0,0 +1,23 @@
|
||||
## 1. Conda/Systemd Contract Alignment
|
||||
|
||||
- [x] 1.1 Centralize runtime path configuration consumed by service units, watchdog, and scripts.
|
||||
- [x] 1.2 Add startup validation that fails fast on conda path drift.
|
||||
- [x] 1.3 Update systemd/watchdog integration tests for consistent runtime contract.
|
||||
|
||||
## 2. Worker Self-Healing Policy
|
||||
|
||||
- [x] 2.1 Implement bounded auto-restart policy (cooldown, retry budget, churn window).
|
||||
- [x] 2.2 Add guarded mode behavior when churn threshold is exceeded.
|
||||
- [x] 2.3 Implement authenticated manual override flow with explicit logging context.
|
||||
|
||||
## 3. Alerting and Operational Signals
|
||||
|
||||
- [x] 3.1 Expose policy-state fields in health/admin payloads (`allowed`, `cooldown`, `blocked`).
|
||||
- [x] 3.2 Add structured audit events for restart decisions and override actions.
|
||||
- [x] 3.3 Define alert thresholds and wire monitoring-friendly fields for pool/circuit/churn conditions.
|
||||
|
||||
## 4. Validation and Runbook Delivery
|
||||
|
||||
- [x] 4.1 Add tests for policy transitions, guarded mode, and override behavior.
|
||||
- [x] 4.2 Validate single-port continuity during controlled recovery and hot reload paths.
|
||||
- [x] 4.3 Update README/README.mdj and deployment runbook with verified operational procedures.
|
||||
@@ -0,0 +1,2 @@
|
||||
schema: spec-driven
|
||||
created: 2026-02-08
|
||||
@@ -0,0 +1,50 @@
|
||||
## Context
|
||||
|
||||
目前系統已完成 Vite 單一 port 架構與主要 P0/P1/P2 硬化,但殘餘風險集中在「快取慢路徑鎖競爭 + health 熱點查詢 + API 邊界治理」。這些問題多屬中高流量下才明顯,若不在此階段收斂,後續排障成本會高。
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- 在不改變頁面操作語意與單一 port 架構前提下,完成殘餘穩定性與安全性修補。
|
||||
- 讓 cache/health 路徑在高併發下更可預期,並降低 log 資安風險。
|
||||
- 透過測試覆蓋確保修補不造成功能回歸。
|
||||
|
||||
**Non-Goals:**
|
||||
- 不重寫主要查詢流程或移除 `resource/wip` 全表快取策略。
|
||||
- 不引入重量級 distributed rate-limit 基礎設施。
|
||||
- 不改動前端 drill-down 與報表功能語意。
|
||||
|
||||
## Decisions
|
||||
|
||||
1. **Cache 發布一致性優先於局部最佳化**
|
||||
- 使用 staging key + 原子 rename/pipeline 發布資料與 metadata,確保 publish 失敗不影響舊資料可讀性。
|
||||
|
||||
2. **解析移至鎖外,鎖內僅做快取一致性檢查/寫入**
|
||||
- WIP process cache 慢路徑改為鎖外 parse,再鎖內 double-check+commit,降低持鎖時間。
|
||||
|
||||
3. **Process cache 策略一致化**
|
||||
- realtime equipment cache 補齊 max_size + LRU,與既有 WIP/Resource 一致。
|
||||
|
||||
4. **Health 內部短快取僅在非測試環境啟用**
|
||||
- TTL=5 秒,降低高頻 probe 對 DB/Redis 的重複壓力;測試模式維持即時計算避免互相污染。
|
||||
|
||||
5. **高成本 API 採輕量 in-memory 速率限制**
|
||||
- 以 IP+route window 限流,參數化可調,不引入新外部依賴。
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
- [Risk] 快取發布改造引入 key 切換邏輯複雜度 → Mitigation: 補上 publish 失敗/成功測試。
|
||||
- [Risk] health 快取造成短時間觀測延遲 → Mitigation: TTL 限制 5 秒,並於 testing 禁用。
|
||||
- [Risk] in-memory rate limit 在多 worker 下非全域一致 → Mitigation: 先作保護閥,後續可升級 Redis-based limiter。
|
||||
|
||||
## Migration Plan
|
||||
|
||||
1. 先完成 cache 與 health 核心修補(不影響 API contract)。
|
||||
2. 再導入 API 邊界/限流與共用工具抽離。
|
||||
3. 補單元與整合測試,執行 benchmark smoke。
|
||||
4. 更新 README 文件與環境變數說明。
|
||||
|
||||
## Open Questions
|
||||
|
||||
- 高成本 API 的預設限流門檻是否要按端點細分(WIP vs Resource)?
|
||||
- 後續是否要升級為 Redis 分散式限流以覆蓋多 worker 全域一致性?
|
||||
@@ -0,0 +1,44 @@
|
||||
## Why
|
||||
|
||||
上一輪已完成高風險核心修復,但仍有一批殘餘問題會在高併發、長時間運行與惡意/異常輸入下放大風險(快取發布一致性、鎖競爭、健康檢查負載、輸入邊界與速率治理)。本輪目標是把這些尾端風險收斂到可接受範圍,避免後續運維與效能不穩。
|
||||
|
||||
## What Changes
|
||||
|
||||
- 強化 WIP 快取發布流程,確保更新失敗時不污染既有讀取路徑。
|
||||
- 調整 process cache 慢路徑鎖範圍,避免持鎖解析大 JSON。
|
||||
- 補齊 realtime equipment process cache 的 bounded LRU,與 WIP/Resource 策略一致。
|
||||
- 為資源路由 NaN 清理加入深度保護(避免深層遞迴風險)。
|
||||
- 抽取共用布林參數解析,消除重複邏輯。
|
||||
- 將 filter cache 的 view 名稱改為可配置,移除硬編碼耦合。
|
||||
- 加入敏感連線字串 log redaction。
|
||||
- 對 `/health`、`/health/deep` 增加 5 秒內部短快取(測試模式禁用)。
|
||||
- 對高成本查詢 API 增加輕量速率限制與可調參數。
|
||||
- 更新 README/README.mdj 與驗證測試。
|
||||
|
||||
## Capabilities
|
||||
|
||||
### New Capabilities
|
||||
- `api-safety-hygiene`: API 輸入邊界、共享參數解析、可配置查詢來源、與高成本端點速率治理。
|
||||
|
||||
### Modified Capabilities
|
||||
- `cache-observability-hardening`: 補強快取發布一致性、process cache 鎖範圍與 bounded 策略一致化。
|
||||
- `runtime-resilience-recovery`: 健康檢查短快取與敏感資訊日誌遮罩的運維安全要求。
|
||||
|
||||
## Impact
|
||||
|
||||
- Affected code:
|
||||
- `src/mes_dashboard/core/cache_updater.py`
|
||||
- `src/mes_dashboard/core/cache.py`
|
||||
- `src/mes_dashboard/services/realtime_equipment_cache.py`
|
||||
- `src/mes_dashboard/routes/resource_routes.py`
|
||||
- `src/mes_dashboard/routes/wip_routes.py`
|
||||
- `src/mes_dashboard/routes/hold_routes.py`
|
||||
- `src/mes_dashboard/services/filter_cache.py`
|
||||
- `src/mes_dashboard/core/database.py`
|
||||
- `src/mes_dashboard/routes/health_routes.py`
|
||||
- APIs:
|
||||
- `/health`, `/health/deep`
|
||||
- `/api/wip/detail/<workcenter>`, `/api/wip/overview/*`
|
||||
- `/api/resource/*`(高成本路由)
|
||||
- Docs/tests:
|
||||
- `README.md`, `README.mdj`, `tests/*`
|
||||
@@ -0,0 +1,29 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Recursive Payload Cleaning MUST Enforce Depth Safety
|
||||
Routes that normalize nested payloads MUST prevent unbounded recursion depth.
|
||||
|
||||
#### Scenario: Deeply nested response object
|
||||
- **WHEN** NaN-cleaning helper receives deeply nested list/dict payload
|
||||
- **THEN** cleaning logic MUST enforce max depth or iterative traversal and return safely without recursion failure
|
||||
|
||||
### Requirement: Filter Source Names MUST Be Configurable
|
||||
Filter cache query sources MUST NOT rely on hardcoded view names only.
|
||||
|
||||
#### Scenario: Environment-specific view names
|
||||
- **WHEN** deployment sets custom filter-source environment variables
|
||||
- **THEN** filter cache loader MUST resolve and query configured view names
|
||||
|
||||
### Requirement: High-Cost APIs SHALL Apply Basic Rate Guardrails
|
||||
High-cost read endpoints SHALL apply configurable request-rate guardrails to reduce abuse and accidental bursts.
|
||||
|
||||
#### Scenario: Burst traffic from same client
|
||||
- **WHEN** a client exceeds configured request budget for guarded endpoints
|
||||
- **THEN** endpoint SHALL return throttled response with clear retry guidance
|
||||
|
||||
### Requirement: Common Boolean Query Parsing SHALL Be Shared
|
||||
Boolean query parsing in routes SHALL use shared helper behavior.
|
||||
|
||||
#### Scenario: Different routes parse include flags
|
||||
- **WHEN** routes parse common boolean query parameters
|
||||
- **THEN** parsing behavior MUST be consistent across routes via shared utility
|
||||
@@ -0,0 +1,26 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Cache Publish MUST Preserve Previous Readable Snapshot on Failure
|
||||
When refreshing full-table cache payloads, the system MUST avoid exposing partially published states to readers.
|
||||
|
||||
#### Scenario: Publish fails after payload serialization
|
||||
- **WHEN** a cache refresh has prepared new payload but publish operation fails
|
||||
- **THEN** previously published cache keys MUST remain readable and metadata MUST remain consistent with old snapshot
|
||||
|
||||
#### Scenario: Publish succeeds
|
||||
- **WHEN** publish operation completes successfully
|
||||
- **THEN** data payload and metadata keys MUST be visible as one coherent new snapshot
|
||||
|
||||
### Requirement: Process-Level Cache Slow Path SHALL Minimize Lock Hold Time
|
||||
Large payload parsing MUST NOT happen inside long-held process cache locks.
|
||||
|
||||
#### Scenario: Cache miss under concurrent requests
|
||||
- **WHEN** multiple requests hit process cache miss
|
||||
- **THEN** parsing work SHALL happen outside lock-protected mutation section, and lock scope SHALL be limited to consistency check + commit
|
||||
|
||||
### Requirement: Process-Level Cache Policies MUST Stay Consistent Across Services
|
||||
All service-local process caches MUST support bounded capacity with deterministic eviction.
|
||||
|
||||
#### Scenario: Realtime equipment cache growth
|
||||
- **WHEN** realtime equipment process cache reaches configured capacity
|
||||
- **THEN** entries MUST be evicted according to deterministic LRU behavior
|
||||
@@ -0,0 +1,19 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Health Endpoints SHALL Use Short Internal Memoization
|
||||
Health and deep-health computation SHALL use a short-lived internal cache to prevent probe storms from amplifying backend load.
|
||||
|
||||
#### Scenario: Frequent monitor scrapes
|
||||
- **WHEN** health endpoints are called repeatedly within a small window
|
||||
- **THEN** service SHALL return memoized payload for up to 5 seconds in non-testing environments
|
||||
|
||||
#### Scenario: Testing mode
|
||||
- **WHEN** app is running in testing mode
|
||||
- **THEN** health endpoint memoization MUST be bypassed to preserve deterministic tests
|
||||
|
||||
### Requirement: Logs MUST Redact Connection Secrets
|
||||
Runtime logs MUST avoid exposing DB connection credentials.
|
||||
|
||||
#### Scenario: Connection string appears in log message
|
||||
- **WHEN** a log message contains DB URL credentials
|
||||
- **THEN** logger output MUST redact password and sensitive userinfo before emission
|
||||
@@ -0,0 +1,22 @@
|
||||
## 1. Cache Consistency and Contention Hardening
|
||||
|
||||
- [x] 1.1 Harden WIP cache publish in `cache_updater.py` to preserve old snapshot on publish failure.
|
||||
- [x] 1.2 Refactor WIP process-cache slow path in `core/cache.py` so heavy parse runs outside lock.
|
||||
- [x] 1.3 Extend realtime equipment process cache with bounded `max_size` + deterministic LRU and add regression tests.
|
||||
|
||||
## 2. API Safety and Config Hygiene
|
||||
|
||||
- [x] 2.1 Add depth-safe NaN cleaning in `resource_routes.py` and tests for deep payloads.
|
||||
- [x] 2.2 Add shared boolean query parser in `core/utils.py` and switch `wip_routes.py` / `hold_routes.py` to it.
|
||||
- [x] 2.3 Make filter-cache source views configurable (env-based) in `filter_cache.py` and add config tests.
|
||||
|
||||
## 3. Runtime Guardrails
|
||||
|
||||
- [x] 3.1 Add DB connection-string redaction logging filter in `core/database.py` (or logging bootstrap) with tests.
|
||||
- [x] 3.2 Add 5-second internal memoization for `/health` and `/health/deep` (disabled in testing) and tests.
|
||||
- [x] 3.3 Add lightweight rate limiting for selected high-cost APIs with clear throttling responses and tests.
|
||||
|
||||
## 4. Validation and Documentation
|
||||
|
||||
- [x] 4.1 Run targeted backend/frontend tests and benchmark smoke gate.
|
||||
- [x] 4.2 Update `README.md` and `README.mdj` with round-3 hardening notes and new env variables.
|
||||
@@ -0,0 +1,2 @@
|
||||
schema: spec-driven
|
||||
created: 2026-02-08
|
||||
@@ -0,0 +1,61 @@
|
||||
## Context
|
||||
|
||||
round-3 後主流程已穩定,但仍有 3 類技術債:
|
||||
- Resource 快取在同一 process 內同時保存 DataFrame 與完整 records 複本,導致記憶體放大。
|
||||
- Resource 與 Realtime Equipment 的 Oracle 查詢存在跨服務重複字串,日後修改容易偏移。
|
||||
- 部分服務邊界型別註記與魔術數字未系統化,維護成本偏高。
|
||||
|
||||
約束條件:
|
||||
- `resource` / `wip` 維持全表快取策略,不改資料來源與刷新頻率。
|
||||
- 對外 API 欄位與前端行為不變。
|
||||
- 保持單一 port 架構與既有運維契約。
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- 降低 Resource 快取在 process 內的重複資料表示,保留查詢輸出相容性。
|
||||
- 讓跨服務 Oracle 查詢片段由單一來源維護。
|
||||
- 讓關鍵 service/cache 模組具備一致的型別註記與具名常數。
|
||||
|
||||
**Non-Goals:**
|
||||
- 不改動資料庫 schema 或 SQL 查詢結果欄位。
|
||||
- 不重寫整體 cache 架構(Redis + process cache 維持)。
|
||||
- 不引入新基礎設施或外部依賴。
|
||||
|
||||
## Decisions
|
||||
|
||||
1. Resource derived index 改為「row-position index」而非保存完整 records 複本
|
||||
- 現況:index 中保留 `records` 與多組 bucket records,與 DataFrame 內容重複。
|
||||
- 決策:index 只保留 row positions(整數索引)與必要 metadata;需要輸出 dict 時由 DataFrame 按需轉換。
|
||||
- 取捨:單次輸出會增加少量轉換成本,但可顯著降低常駐記憶體重複。
|
||||
|
||||
2. 建立共用 Oracle 查詢常數模組
|
||||
- 現況:`resource_cache.py`、`realtime_equipment_cache.py` 各自維護 base SQL。
|
||||
- 決策:抽出 `services/sql_fragments.py`(或等效模組)管理共用 query 文本與 table/view 名稱。
|
||||
- 取捨:增加一層間接引用,但查詢語意一致性與變更可控性更高。
|
||||
|
||||
3. 型別與常數治理採「先核心邊界,後擴散」
|
||||
- 現況:部分函式已使用 `Optional` / PEP604 混搭,且魔術數字散落於 cache/service。
|
||||
- 決策:先統一這輪觸及檔案中的型別風格與高頻常數(TTL、size、window、limits)。
|
||||
- 取捨:不追求一次全專案清零,以避免大範圍 noise;先建立可持續擴展基線。
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
- [Risk] row-position index 與 DataFrame 版本不同步 → Mitigation:每次 cache invalidate 時同步重建 index,並保留版本檢查。
|
||||
- [Risk] 惰性轉換導致查詢端 latency 波動 → Mitigation:保留 process cache,並對高頻路徑做小批量輸出優化。
|
||||
- [Risk] SQL 共用常數抽離造成引用錯誤 → Mitigation:補齊單元測試,驗證 query 文本與既有欄位契約一致。
|
||||
- [Risk] 型別/常數清理引發行為改變 → Mitigation:僅做等價重構,保留原值並用回歸測試覆蓋。
|
||||
|
||||
## Migration Plan
|
||||
|
||||
1. 先重構 Resource index 表示,確保 API 輸出不變。
|
||||
2. 抽離 SQL 共用片段並替換兩個快取服務引用。
|
||||
3. 清理該範圍型別與常數,補測試。
|
||||
4. 更新 README / README.mdj 與 OpenSpec tasks,跑 backend/fronted 目標測試集。
|
||||
|
||||
Rollback:
|
||||
- 若出現相容性問題,可回退至原 index records 表示與舊 SQL 內嵌寫法(單檔回退即可)。
|
||||
|
||||
## Open Questions
|
||||
|
||||
- 是否要在下一輪把相同治理擴展到 `wip_service.py` 的其餘常數與型別(本輪先限定 residual 範圍)。
|
||||
@@ -0,0 +1,31 @@
|
||||
## Why
|
||||
|
||||
目前剩餘風險集中在可維護性與記憶體效率:Resource 快取在同一個 process 內維持多種資料表示,部分查詢 SQL 在不同快取服務重複維護,且型別註記與魔術數字仍不一致。這些問題不會立刻造成中斷,但會提高記憶體占用、增加後續修改成本與回歸風險,因此需要在既有功能不變前提下完成收斂。
|
||||
|
||||
## What Changes
|
||||
|
||||
- 將 Resource derived index 的資料表示改為「輕量索引 + 惰性輸出」,避免在 process 中重複保留完整 records 複本。
|
||||
- 將 Resource 與 Realtime Equipment 的 Oracle 查詢字串收斂到共用 SQL 常數模組,降低重複定義與異步漂移風險。
|
||||
- 補齊型別註記一致性(尤其 cache/index/service 邊界)並把高頻魔術數字提升為具名常數或可配置參數。
|
||||
- 維持現有 API 契約、全表快取策略、單一 port 架構與前端行為不變。
|
||||
|
||||
## Capabilities
|
||||
|
||||
### New Capabilities
|
||||
- `resource-cache-representation-normalization`: 以單一權威資料表示與輕量索引替代 process 內多份完整資料複本,並保留既有查詢回傳結構。
|
||||
- `oracle-query-fragment-governance`: 將跨服務共用的 Oracle 查詢片段抽離為共享常數/模板,確保查詢語意一致。
|
||||
- `maintainability-type-and-constant-hygiene`: 建立型別註記與具名常數的落地規範,降低魔術數字與註記風格漂移。
|
||||
|
||||
### Modified Capabilities
|
||||
- `cache-observability-hardening`: 補充記憶體放大係數與索引表示調整後的可觀測一致性要求。
|
||||
|
||||
## Impact
|
||||
|
||||
- 主要影響檔案:
|
||||
- `src/mes_dashboard/services/resource_cache.py`
|
||||
- `src/mes_dashboard/services/realtime_equipment_cache.py`
|
||||
- `src/mes_dashboard/services/resource_service.py`(若需配合索引輸出)
|
||||
- `src/mes_dashboard/sql/*` 或新增共享 SQL 常數模組
|
||||
- `src/mes_dashboard/config/constants.py`、`src/mes_dashboard/core/utils.py`
|
||||
- 對應測試與 README/README.mdj 文檔
|
||||
- 不新增外部依賴,不變更對外 API 路徑與欄位契約。
|
||||
@@ -0,0 +1,8 @@
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Cache Telemetry SHALL Include Memory Amplification Signals
|
||||
Operational telemetry MUST expose cache-domain memory usage indicators and representation amplification factors, and MUST differentiate between authoritative data payload and derived/index helper structures.
|
||||
|
||||
#### Scenario: Deep health telemetry request after representation normalization
|
||||
- **WHEN** operators inspect cache telemetry for resource or WIP domains
|
||||
- **THEN** telemetry MUST include per-domain memory footprint, amplification indicators, and enough structure detail to verify that full-record duplication is not reintroduced
|
||||
@@ -0,0 +1,15 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Core Cache and Service Boundaries MUST Use Consistent Type Annotation Style
|
||||
Core cache/service modules touched by this change SHALL use a consistent and explicit type-annotation style for public and internal helper boundaries.
|
||||
|
||||
#### Scenario: Reviewing updated cache/service modules
|
||||
- **WHEN** maintainers inspect function signatures in affected modules
|
||||
- **THEN** optional and collection types MUST follow a single consistent style and remain compatible with the project Python baseline
|
||||
|
||||
### Requirement: High-Frequency Magic Numbers MUST Be Replaced by Named Constants
|
||||
Cache, throttling, and index-related numeric literals that control behavior MUST be extracted to named constants or env-configurable settings.
|
||||
|
||||
#### Scenario: Tuning cache/index behavior
|
||||
- **WHEN** operators need to tune cache/index thresholds
|
||||
- **THEN** they MUST find values in named constants or environment variables rather than scattered inline literals
|
||||
@@ -0,0 +1,15 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Shared Oracle Query Fragments SHALL Have a Single Source of Truth
|
||||
Cross-service Oracle query fragments for resource and equipment cache loading MUST be defined in a shared module and imported by service implementations.
|
||||
|
||||
#### Scenario: Update common table/view reference
|
||||
- **WHEN** a common table or view name changes
|
||||
- **THEN** operators and developers MUST be able to update one shared definition without editing duplicated SQL literals across services
|
||||
|
||||
### Requirement: Service Queries MUST Preserve Existing Columns and Semantics
|
||||
Services consuming shared Oracle query fragments SHALL preserve existing selected columns, filters, and downstream payload behavior.
|
||||
|
||||
#### Scenario: Resource and equipment cache refresh after refactor
|
||||
- **WHEN** cache services execute queries via shared fragments
|
||||
- **THEN** resulting payload structure MUST remain compatible with existing aggregation and API contracts
|
||||
@@ -0,0 +1,22 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Resource Derived Index MUST Avoid Full Record Duplication
|
||||
Resource derived index SHALL use lightweight row-position references instead of storing full duplicated record payloads alongside the process DataFrame cache.
|
||||
|
||||
#### Scenario: Build index from cached DataFrame
|
||||
- **WHEN** resource cache data is parsed from Redis into process-level DataFrame
|
||||
- **THEN** the derived index MUST store position-based references and metadata without a second full records copy
|
||||
|
||||
### Requirement: Resource Query APIs SHALL Preserve Existing Response Contract
|
||||
Resource query APIs MUST keep existing output fields and semantics after index representation normalization.
|
||||
|
||||
#### Scenario: Read all resources after normalization
|
||||
- **WHEN** callers request all resources or filtered resource lists
|
||||
- **THEN** the returned payload MUST remain field-compatible with pre-normalization responses
|
||||
|
||||
### Requirement: Cache Invalidation MUST Keep Index/Data Coherent
|
||||
The system SHALL invalidate and rebuild DataFrame/index representations atomically at cache refresh boundaries.
|
||||
|
||||
#### Scenario: Redis-backed cache refresh completes
|
||||
- **WHEN** a new resource cache snapshot is published
|
||||
- **THEN** stale index references MUST be invalidated before subsequent reads use refreshed DataFrame data
|
||||
@@ -0,0 +1,22 @@
|
||||
## 1. Resource Cache Representation Normalization
|
||||
|
||||
- [x] 1.1 Refactor `resource_cache` derived index to use lightweight row-position references instead of full duplicated records payload.
|
||||
- [x] 1.2 Keep `get_all_resources` / `get_resources_by_filter` API outputs backward compatible while sourcing data from normalized representation.
|
||||
- [x] 1.3 Update cache telemetry fields to reflect normalized representation and verify amplification calculation remains interpretable.
|
||||
|
||||
## 2. Oracle Query Fragment Governance
|
||||
|
||||
- [x] 2.1 Extract shared Oracle SQL fragments/constants for resource/equipment cache loading into a common module.
|
||||
- [x] 2.2 Replace duplicated SQL literals in `resource_cache.py` and `realtime_equipment_cache.py` with shared definitions.
|
||||
- [x] 2.3 Add/adjust tests to lock expected query semantics and prevent drift.
|
||||
|
||||
## 3. Maintainability Hygiene
|
||||
|
||||
- [x] 3.1 Normalize type annotations in touched cache/service modules to one consistent style.
|
||||
- [x] 3.2 Replace high-frequency magic numbers with named constants or env-driven config in touched modules.
|
||||
- [x] 3.3 Confirm existing login/API rate-limit and bool parser utilities remain centralized without new duplication.
|
||||
|
||||
## 4. Verification and Documentation
|
||||
|
||||
- [x] 4.1 Run targeted backend tests for resource cache, equipment cache, health/admin, and route behavior.
|
||||
- [x] 4.2 Update `README.md` and `README.mdj` with round-4 hardening notes.
|
||||
@@ -0,0 +1,2 @@
|
||||
schema: spec-driven
|
||||
created: 2026-02-08
|
||||
@@ -0,0 +1,65 @@
|
||||
## Context
|
||||
|
||||
本專案上一輪已完成 P0/P1/P2 的主體重構,但 code review 後仍存在幾個殘餘高風險點:
|
||||
- `LDAP_API_URL` 缺少 scheme/host 防線,屬於可配置 SSRF 風險。
|
||||
- process-level DataFrame cache 僅用 TTL,缺少容量上限。
|
||||
- circuit breaker 狀態轉換在持鎖期間寫日誌,存在鎖競爭放大風險。
|
||||
- 全域 security headers 尚未統一輸出。
|
||||
- 分頁參數尚有下限驗證缺口。
|
||||
|
||||
這些問題橫跨 `app/core/services/routes/tests`,屬於跨模組安全與穩定性修補。
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- 對 LDAP endpoint、HTTP 回應標頭、輸入邊界建立可測試的最低防線。
|
||||
- 讓 process-level cache 具備有界容量與可預期淘汰行為。
|
||||
- 降低 circuit breaker 內部鎖競爭風險,避免慢 handler 放大阻塞。
|
||||
- 維持單一 port、現有 API 契約與前端互動語意不變。
|
||||
|
||||
**Non-Goals:**
|
||||
- 不引入完整 WAF/零信任架構。
|
||||
- 不重寫既有 cache 架構為外部快取服務。
|
||||
- 不改動報表功能或頁面流程。
|
||||
|
||||
## Decisions
|
||||
|
||||
1. **LDAP URL 啟動驗證(fail-fast)**
|
||||
- Decision: 在 `auth_service` 啟動階段驗證 `LDAP_API_URL`,限制 `https` 與白名單 host(由 env 設定),不符合即禁用 LDAP 驗證路徑並記錄錯誤。
|
||||
- Rationale: 以最低改動封住配置型 SSRF 風險,不影響 local auth 模式。
|
||||
|
||||
2. **ProcessLevelCache 有界化**
|
||||
- Decision: 在 `ProcessLevelCache` 新增 `max_size` 與 LRU 淘汰(`OrderedDict`),`set` 時淘汰最舊 key。
|
||||
- Rationale: 保留 TTL 行為,同時避免高基數 key 長時間堆積。
|
||||
|
||||
3. **Circuit breaker 鎖外寫日誌**
|
||||
- Decision: `_transition_to` 僅在鎖內更新狀態並組裝日誌訊息,實際 logger 呼叫移到鎖外。
|
||||
- Rationale: 降低持鎖區塊執行時間,避免慢 I/O handler 阻塞其他請求路徑。
|
||||
|
||||
4. **全域安全標頭統一注入**
|
||||
- Decision: 在 `app.after_request` 加入 `CSP`、`X-Frame-Options`、`X-Content-Type-Options`、`Referrer-Policy`,並在 production 加上 `HSTS`。
|
||||
- Rationale: 以集中式策略覆蓋所有頁面與 API,降低遺漏機率。
|
||||
|
||||
5. **分頁參數上下限一致化**
|
||||
- Decision: 對 `page` 與 `page_size` 統一加入 `max(1, min(...))` 邊界處理。
|
||||
- Rationale: 防止負值或極端數值造成不必要負載與非預期行為。
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
- **[Risk] LDAP 白名單設定不完整導致登入中斷** → **Mitigation:** 提供明確錯誤訊息與 local auth fallback 指引。
|
||||
- **[Risk] Cache 上限過小造成命中率下降** → **Mitigation:** `max_size` 設為可配置,先給保守預設值並觀察 telemetry。
|
||||
- **[Risk] CSP 過嚴影響既有 inline 腳本** → **Mitigation:** 先採 `default-src 'self'` 與相容策略,必要時以 nonce/白名單微調。
|
||||
- **[Risk] 行為調整引發測試回歸** → **Mitigation:** 補 unit/integration 測試覆蓋每個修補點。
|
||||
|
||||
## Migration Plan
|
||||
|
||||
1. 先落地 backend 修補(auth/cache/circuit breaker/app headers/routes)。
|
||||
2. 補測試(LDAP 驗證、LRU、鎖外日誌、headers、分頁邊界)。
|
||||
3. 執行既有健康檢查與重點整合測試。
|
||||
4. 更新 README/README.mdj 的安全與穩定性章節。
|
||||
5. 若部署後有相容性問題,可暫時透過 env 放寬 LDAP host 白名單與 CSP 細項。
|
||||
|
||||
## Open Questions
|
||||
|
||||
- LDAP host 白名單在各環境是否需要多個網域(例如內網 + DR site)?
|
||||
- CSP 是否要立即切換到 nonce-based 嚴格模式,或先維持相容策略?
|
||||
@@ -0,0 +1,40 @@
|
||||
## Why
|
||||
|
||||
上一輪已完成核心穩定性重構,但仍有數個高優先風險(LDAP URL 驗證、無界快取成長、circuit breaker 持鎖寫日誌、安全標頭缺口、分頁下限驗證)未收斂。這些問題會在長時運行與惡意輸入情境下累積可用性與安全風險,需在同一輪中補齊。
|
||||
|
||||
## What Changes
|
||||
|
||||
- 新增 LDAP API base URL 啟動驗證(限定 `https` 與白名單主機),避免可控 SSRF 目標。
|
||||
- 對 process-level cache 加入 `max_size` 與 LRU 淘汰,避免高基數 key 造成無界記憶體成長。
|
||||
- 調整 circuit breaker 狀態轉換流程,避免在持鎖期間寫日誌。
|
||||
- 新增全域 security headers(CSP、X-Frame-Options、X-Content-Type-Options、Referrer-Policy、HSTS)。
|
||||
- 補齊分頁參數下限驗證,避免負值與不合理 page size 進入查詢流程。
|
||||
- 為上述修補新增對應測試與文件更新,並維持單一 port 與既有前端操作語意不變。
|
||||
|
||||
## Capabilities
|
||||
|
||||
### New Capabilities
|
||||
- `security-surface-hardening`: 規範剩餘安全面向(SSRF 防護、security headers、輸入邊界驗證)的最低防線。
|
||||
|
||||
### Modified Capabilities
|
||||
- `cache-observability-hardening`: 擴充快取治理需求,納入 process-level cache 有界容量與淘汰策略。
|
||||
- `runtime-resilience-recovery`: 補充 circuit breaker 鎖競爭風險修補與安全標頭對運維診斷回應的相容性要求。
|
||||
|
||||
## Impact
|
||||
|
||||
- Affected code:
|
||||
- `src/mes_dashboard/services/auth_service.py`
|
||||
- `src/mes_dashboard/core/cache.py`
|
||||
- `src/mes_dashboard/services/resource_cache.py`
|
||||
- `src/mes_dashboard/core/circuit_breaker.py`
|
||||
- `src/mes_dashboard/app.py`
|
||||
- `src/mes_dashboard/routes/wip_routes.py`
|
||||
- `tests/`
|
||||
- `README.md`, `README.mdj`
|
||||
- APIs:
|
||||
- `/health`, `/health/deep`
|
||||
- `/api/wip/detail/<workcenter>`
|
||||
- `/admin/login`(間接受影響:LDAP base 驗證)
|
||||
- Operational behavior:
|
||||
- 保持單一 port 與既有報表 UI 流程。
|
||||
- 強化安全與穩定性防線,不改變既有功能語意。
|
||||
@@ -0,0 +1,12 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Process-Level Cache SHALL Use Bounded Capacity with Deterministic Eviction
|
||||
Process-level parsed-data caches MUST enforce a configurable maximum key capacity and use deterministic eviction behavior when capacity is exceeded.
|
||||
|
||||
#### Scenario: Cache capacity reached
|
||||
- **WHEN** a new cache entry is inserted and key capacity is at limit
|
||||
- **THEN** cache MUST evict entries according to defined policy before storing the new key
|
||||
|
||||
#### Scenario: Repeated access updates recency
|
||||
- **WHEN** an existing cache key is read or overwritten
|
||||
- **THEN** eviction order MUST reflect recency semantics so hot keys are retained preferentially
|
||||
@@ -0,0 +1,12 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Circuit Breaker State Transitions SHALL Avoid Lock-Held Logging
|
||||
Circuit breaker state transitions MUST avoid executing logger I/O while internal state locks are held.
|
||||
|
||||
#### Scenario: State transition occurs
|
||||
- **WHEN** circuit breaker transitions between CLOSED, OPEN, or HALF_OPEN
|
||||
- **THEN** lock-protected section MUST complete state mutation before emitting transition log output
|
||||
|
||||
#### Scenario: Slow log handler under load
|
||||
- **WHEN** logger handlers are slow or blocked
|
||||
- **THEN** circuit breaker lock contention MUST remain bounded and MUST NOT serialize unrelated request paths behind logging latency
|
||||
@@ -0,0 +1,34 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: LDAP Authentication Endpoint Configuration SHALL Be Strictly Validated
|
||||
The system MUST validate LDAP authentication endpoint configuration before use, including HTTPS scheme enforcement and host allowlist checks.
|
||||
|
||||
#### Scenario: Invalid LDAP URL configuration detected
|
||||
- **WHEN** `LDAP_API_URL` is missing, non-HTTPS, or points to a host outside the configured allowlist
|
||||
- **THEN** the service MUST reject LDAP authentication calls and emit actionable diagnostics without sending credentials to that endpoint
|
||||
|
||||
#### Scenario: Valid LDAP URL configuration accepted
|
||||
- **WHEN** `LDAP_API_URL` uses HTTPS and host is allowlisted
|
||||
- **THEN** LDAP authentication requests MAY proceed with normal timeout and error handling behavior
|
||||
|
||||
### Requirement: Security Response Headers SHALL Be Applied Globally
|
||||
All HTTP responses MUST include baseline security headers suitable for dashboard and API traffic.
|
||||
|
||||
#### Scenario: Standard response emitted
|
||||
- **WHEN** any route returns a response
|
||||
- **THEN** response MUST include `Content-Security-Policy`, `X-Frame-Options`, `X-Content-Type-Options`, and `Referrer-Policy`
|
||||
|
||||
#### Scenario: Production transport hardening
|
||||
- **WHEN** runtime environment is production
|
||||
- **THEN** response MUST include `Strict-Transport-Security`
|
||||
|
||||
### Requirement: Pagination Input Boundaries SHALL Be Enforced
|
||||
Endpoints accepting pagination parameters MUST enforce lower and upper bounds before query execution.
|
||||
|
||||
#### Scenario: Negative or zero pagination inputs
|
||||
- **WHEN** client sends `page <= 0` or `page_size <= 0`
|
||||
- **THEN** server MUST normalize values to minimum supported bounds
|
||||
|
||||
#### Scenario: Excessive page size requested
|
||||
- **WHEN** client sends `page_size` above configured maximum
|
||||
- **THEN** server MUST clamp to maximum supported page size
|
||||
@@ -0,0 +1,24 @@
|
||||
## 1. LDAP Endpoint Hardening
|
||||
|
||||
- [x] 1.1 Add strict `LDAP_API_URL` validation (`https` + allowlisted hosts) in auth service initialization.
|
||||
- [x] 1.2 Add tests for valid/invalid LDAP URL configurations and ensure unsafe URLs are rejected without outbound auth call.
|
||||
|
||||
## 2. Bounded Process Cache
|
||||
|
||||
- [x] 2.1 Extend `ProcessLevelCache` with configurable `max_size` and LRU eviction behavior.
|
||||
- [x] 2.2 Wire bounded cache configuration for WIP/Resource process-level caches and add regression tests.
|
||||
|
||||
## 3. Circuit Breaker Lock Contention Reduction
|
||||
|
||||
- [x] 3.1 Refactor circuit breaker transition logging to execute outside lock-protected section.
|
||||
- [x] 3.2 Add tests verifying transition logs are emitted while state mutation remains correct.
|
||||
|
||||
## 4. HTTP Security Headers and Input Boundary Validation
|
||||
|
||||
- [x] 4.1 Add global `after_request` security headers (CSP, frame, content-type, referrer, HSTS in production).
|
||||
- [x] 4.2 Tighten pagination boundary handling (`page`/`page_size`) for WIP detail endpoint and add tests.
|
||||
|
||||
## 5. Validation and Documentation
|
||||
|
||||
- [x] 5.1 Run targeted backend/frontend tests plus benchmark smoke to confirm no behavior regression.
|
||||
- [x] 5.2 Update `README.md` and `README.mdj` with round-2 security/stability hardening notes.
|
||||
33
openspec/specs/api-safety-hygiene/spec.md
Normal file
33
openspec/specs/api-safety-hygiene/spec.md
Normal file
@@ -0,0 +1,33 @@
|
||||
# api-safety-hygiene Specification
|
||||
|
||||
## Purpose
|
||||
TBD - created by archiving change residual-hardening-round3. Update Purpose after archive.
|
||||
## Requirements
|
||||
### Requirement: Recursive Payload Cleaning MUST Enforce Depth Safety
|
||||
Routes that normalize nested payloads MUST prevent unbounded recursion depth.
|
||||
|
||||
#### Scenario: Deeply nested response object
|
||||
- **WHEN** NaN-cleaning helper receives deeply nested list/dict payload
|
||||
- **THEN** cleaning logic MUST enforce max depth or iterative traversal and return safely without recursion failure
|
||||
|
||||
### Requirement: Filter Source Names MUST Be Configurable
|
||||
Filter cache query sources MUST NOT rely on hardcoded view names only.
|
||||
|
||||
#### Scenario: Environment-specific view names
|
||||
- **WHEN** deployment sets custom filter-source environment variables
|
||||
- **THEN** filter cache loader MUST resolve and query configured view names
|
||||
|
||||
### Requirement: High-Cost APIs SHALL Apply Basic Rate Guardrails
|
||||
High-cost read endpoints SHALL apply configurable request-rate guardrails to reduce abuse and accidental bursts.
|
||||
|
||||
#### Scenario: Burst traffic from same client
|
||||
- **WHEN** a client exceeds configured request budget for guarded endpoints
|
||||
- **THEN** endpoint SHALL return throttled response with clear retry guidance
|
||||
|
||||
### Requirement: Common Boolean Query Parsing SHALL Be Shared
|
||||
Boolean query parsing in routes SHALL use shared helper behavior.
|
||||
|
||||
#### Scenario: Different routes parse include flags
|
||||
- **WHEN** routes parse common boolean query parameters
|
||||
- **THEN** parsing behavior MUST be consistent across routes via shared utility
|
||||
|
||||
26
openspec/specs/cache-indexed-query-acceleration/spec.md
Normal file
26
openspec/specs/cache-indexed-query-acceleration/spec.md
Normal file
@@ -0,0 +1,26 @@
|
||||
# cache-indexed-query-acceleration Specification
|
||||
|
||||
## Purpose
|
||||
TBD - created by archiving change p1-cache-query-efficiency. Update Purpose after archive.
|
||||
## Requirements
|
||||
### Requirement: Incremental Synchronization SHALL Use Versioned Watermarks
|
||||
For heavy non-full-snapshot datasets, cache refresh SHALL support incremental synchronization keyed by stable version or watermark boundaries.
|
||||
|
||||
#### Scenario: Incremental refresh cycle
|
||||
- **WHEN** source data version indicates partial changes since last sync
|
||||
- **THEN** cache update logic MUST fetch and merge only changed partitions while preserving correctness guarantees
|
||||
|
||||
### Requirement: Query Paths SHALL Use Indexed Access for High-Frequency Filters
|
||||
Query execution over cached data SHALL use prebuilt indexes for known high-frequency filter columns.
|
||||
|
||||
#### Scenario: Filtered report query
|
||||
- **WHEN** request filters target indexed fields
|
||||
- **THEN** result selection MUST avoid full dataset scans and maintain existing response contract
|
||||
|
||||
### Requirement: Business-Mandated Full-Table Caches SHALL Be Preserved for Resource and WIP
|
||||
The system SHALL continue to maintain full-table cache behavior for `resource` and `wip` domains.
|
||||
|
||||
#### Scenario: Resource or WIP cache refresh
|
||||
- **WHEN** cache update runs for `resource` or `wip`
|
||||
- **THEN** the updater MUST retain full-table snapshot semantics and MUST NOT switch these domains to partial-only cache mode
|
||||
|
||||
@@ -36,3 +36,53 @@ The system MUST define alert thresholds for sustained degraded state, repeated w
|
||||
- **WHEN** degraded status persists beyond configured duration
|
||||
- **THEN** the monitoring contract MUST classify the service as alert-worthy with actionable context
|
||||
|
||||
### Requirement: Cache Telemetry SHALL Include Memory Amplification Signals
|
||||
Operational telemetry MUST expose cache-domain memory usage indicators and representation amplification factors, and MUST differentiate between authoritative data payload and derived/index helper structures.
|
||||
|
||||
#### Scenario: Deep health telemetry request after representation normalization
|
||||
- **WHEN** operators inspect cache telemetry for resource or WIP domains
|
||||
- **THEN** telemetry MUST include per-domain memory footprint, amplification indicators, and enough structure detail to verify that full-record duplication is not reintroduced
|
||||
|
||||
### Requirement: Efficiency Benchmarks SHALL Gate Cache Refactor Rollout
|
||||
Cache/query efficiency changes MUST be validated against baseline latency and memory benchmarks before rollout.
|
||||
|
||||
#### Scenario: Pre-release validation
|
||||
- **WHEN** cache refactor changes are prepared for deployment
|
||||
- **THEN** benchmark results MUST demonstrate no regression beyond configured thresholds for P95 latency and memory usage
|
||||
|
||||
### Requirement: Process-Level Cache SHALL Use Bounded Capacity with Deterministic Eviction
|
||||
Process-level parsed-data caches MUST enforce a configurable maximum key capacity and use deterministic eviction behavior when capacity is exceeded.
|
||||
|
||||
#### Scenario: Cache capacity reached
|
||||
- **WHEN** a new cache entry is inserted and key capacity is at limit
|
||||
- **THEN** cache MUST evict entries according to defined policy before storing the new key
|
||||
|
||||
#### Scenario: Repeated access updates recency
|
||||
- **WHEN** an existing cache key is read or overwritten
|
||||
- **THEN** eviction order MUST reflect recency semantics so hot keys are retained preferentially
|
||||
|
||||
### Requirement: Cache Publish MUST Preserve Previous Readable Snapshot on Failure
|
||||
When refreshing full-table cache payloads, the system MUST avoid exposing partially published states to readers.
|
||||
|
||||
#### Scenario: Publish fails after payload serialization
|
||||
- **WHEN** a cache refresh has prepared new payload but publish operation fails
|
||||
- **THEN** previously published cache keys MUST remain readable and metadata MUST remain consistent with old snapshot
|
||||
|
||||
#### Scenario: Publish succeeds
|
||||
- **WHEN** publish operation completes successfully
|
||||
- **THEN** data payload and metadata keys MUST be visible as one coherent new snapshot
|
||||
|
||||
### Requirement: Process-Level Cache Slow Path SHALL Minimize Lock Hold Time
|
||||
Large payload parsing MUST NOT happen inside long-held process cache locks.
|
||||
|
||||
#### Scenario: Cache miss under concurrent requests
|
||||
- **WHEN** multiple requests hit process cache miss
|
||||
- **THEN** parsing work SHALL happen outside lock-protected mutation section, and lock scope SHALL be limited to consistency check + commit
|
||||
|
||||
### Requirement: Process-Level Cache Policies MUST Stay Consistent Across Services
|
||||
All service-local process caches MUST support bounded capacity with deterministic eviction.
|
||||
|
||||
#### Scenario: Realtime equipment cache growth
|
||||
- **WHEN** realtime equipment process cache reaches configured capacity
|
||||
- **THEN** entries MUST be evicted according to deterministic LRU behavior
|
||||
|
||||
|
||||
@@ -24,3 +24,17 @@ Runbooks and deployment documentation MUST describe the same conda/systemd/watch
|
||||
- **WHEN** an operator performs deploy, health check, and rollback from documentation
|
||||
- **THEN** documented commands and paths MUST work without requiring venv-specific assumptions
|
||||
|
||||
### Requirement: Runtime Path Drift SHALL Be Detectable Before Service Start
|
||||
Service startup checks MUST validate configured conda runtime paths across app, watchdog, and worker control scripts.
|
||||
|
||||
#### Scenario: Conda path mismatch detected
|
||||
- **WHEN** startup validation finds runtime path inconsistency between configured units and scripts
|
||||
- **THEN** service start MUST fail with actionable diagnostics instead of running with partial mismatch
|
||||
|
||||
### Requirement: Conda/Systemd Contract SHALL Be Versioned in Operations Docs
|
||||
The documented runtime contract MUST include versioned path assumptions and verification commands.
|
||||
|
||||
#### Scenario: Operator verifies deployment contract
|
||||
- **WHEN** operator follows runbook validation steps
|
||||
- **THEN** commands MUST confirm active runtime paths match documented conda/systemd contract
|
||||
|
||||
|
||||
@@ -50,3 +50,17 @@ Frontend matrix/filter computations SHALL produce deterministic selection and fi
|
||||
- **WHEN** users toggle matrix cells across group, family, and resource rows
|
||||
- **THEN** selected-state rendering and filtered equipment result sets MUST remain level-correct and reversible
|
||||
|
||||
### Requirement: Reusable Browser Compute Modules SHALL Power Report Derivations
|
||||
Derived computations for report filters, KPI cards, chart series, and table projections SHALL be implemented through reusable frontend modules.
|
||||
|
||||
#### Scenario: Shared report derivation logic
|
||||
- **WHEN** multiple report pages require equivalent data-shaping behavior
|
||||
- **THEN** pages MUST consume shared compute modules instead of duplicating transformation logic per page
|
||||
|
||||
### Requirement: Browser Compute Shift SHALL Preserve Export and Field Contracts
|
||||
Moving computations to frontend MUST preserve existing field naming and export column contracts.
|
||||
|
||||
#### Scenario: User exports report after frontend-side derivation
|
||||
- **WHEN** transformed data is rendered and exported
|
||||
- **THEN** exported field names and ordering MUST remain consistent with governed field contract definitions
|
||||
|
||||
|
||||
@@ -0,0 +1,19 @@
|
||||
# maintainability-type-and-constant-hygiene Specification
|
||||
|
||||
## Purpose
|
||||
TBD - created by archiving change residual-hardening-round4. Update Purpose after archive.
|
||||
## Requirements
|
||||
### Requirement: Core Cache and Service Boundaries MUST Use Consistent Type Annotation Style
|
||||
Core cache/service modules touched by this change SHALL use a consistent and explicit type-annotation style for public and internal helper boundaries.
|
||||
|
||||
#### Scenario: Reviewing updated cache/service modules
|
||||
- **WHEN** maintainers inspect function signatures in affected modules
|
||||
- **THEN** optional and collection types MUST follow a single consistent style and remain compatible with the project Python baseline
|
||||
|
||||
### Requirement: High-Frequency Magic Numbers MUST Be Replaced by Named Constants
|
||||
Cache, throttling, and index-related numeric literals that control behavior MUST be extracted to named constants or env-configurable settings.
|
||||
|
||||
#### Scenario: Tuning cache/index behavior
|
||||
- **WHEN** operators need to tune cache/index thresholds
|
||||
- **THEN** they MUST find values in named constants or environment variables rather than scattered inline literals
|
||||
|
||||
19
openspec/specs/oracle-query-fragment-governance/spec.md
Normal file
19
openspec/specs/oracle-query-fragment-governance/spec.md
Normal file
@@ -0,0 +1,19 @@
|
||||
# oracle-query-fragment-governance Specification
|
||||
|
||||
## Purpose
|
||||
TBD - created by archiving change residual-hardening-round4. Update Purpose after archive.
|
||||
## Requirements
|
||||
### Requirement: Shared Oracle Query Fragments SHALL Have a Single Source of Truth
|
||||
Cross-service Oracle query fragments for resource and equipment cache loading MUST be defined in a shared module and imported by service implementations.
|
||||
|
||||
#### Scenario: Update common table/view reference
|
||||
- **WHEN** a common table or view name changes
|
||||
- **THEN** operators and developers MUST be able to update one shared definition without editing duplicated SQL literals across services
|
||||
|
||||
### Requirement: Service Queries MUST Preserve Existing Columns and Semantics
|
||||
Services consuming shared Oracle query fragments SHALL preserve existing selected columns, filters, and downstream payload behavior.
|
||||
|
||||
#### Scenario: Resource and equipment cache refresh after refactor
|
||||
- **WHEN** cache services execute queries via shared fragments
|
||||
- **THEN** resulting payload structure MUST remain compatible with existing aggregation and API contracts
|
||||
|
||||
@@ -0,0 +1,26 @@
|
||||
# resource-cache-representation-normalization Specification
|
||||
|
||||
## Purpose
|
||||
TBD - created by archiving change residual-hardening-round4. Update Purpose after archive.
|
||||
## Requirements
|
||||
### Requirement: Resource Derived Index MUST Avoid Full Record Duplication
|
||||
Resource derived index SHALL use lightweight row-position references instead of storing full duplicated record payloads alongside the process DataFrame cache.
|
||||
|
||||
#### Scenario: Build index from cached DataFrame
|
||||
- **WHEN** resource cache data is parsed from Redis into process-level DataFrame
|
||||
- **THEN** the derived index MUST store position-based references and metadata without a second full records copy
|
||||
|
||||
### Requirement: Resource Query APIs SHALL Preserve Existing Response Contract
|
||||
Resource query APIs MUST keep existing output fields and semantics after index representation normalization.
|
||||
|
||||
#### Scenario: Read all resources after normalization
|
||||
- **WHEN** callers request all resources or filtered resource lists
|
||||
- **THEN** the returned payload MUST remain field-compatible with pre-normalization responses
|
||||
|
||||
### Requirement: Cache Invalidation MUST Keep Index/Data Coherent
|
||||
The system SHALL invalidate and rebuild DataFrame/index representations atomically at cache refresh boundaries.
|
||||
|
||||
#### Scenario: Redis-backed cache refresh completes
|
||||
- **WHEN** a new resource cache snapshot is published
|
||||
- **THEN** stale index references MUST be invalidated before subsequent reads use refreshed DataFrame data
|
||||
|
||||
@@ -48,3 +48,47 @@ The system MUST expose machine-readable resilience thresholds, restart-churn ind
|
||||
#### Scenario: Admin status includes restart churn summary
|
||||
- **WHEN** operators call `/admin/api/system-status` or `/admin/api/worker/status`
|
||||
- **THEN** responses MUST include bounded restart history summary within a configured time window and indicate whether churn threshold is exceeded
|
||||
|
||||
### Requirement: Recovery Recommendations SHALL Reflect Self-Healing Policy State
|
||||
Health and admin resilience payloads MUST expose whether automated recovery is allowed, cooling down, or blocked by churn policy.
|
||||
|
||||
#### Scenario: Operator inspects degraded state
|
||||
- **WHEN** `/health` or `/admin/api/worker/status` is requested during degradation
|
||||
- **THEN** response MUST include policy state, cooldown remaining time, and next recommended action
|
||||
|
||||
### Requirement: Manual Recovery Override SHALL Be Explicit and Controlled
|
||||
Manual restart actions MUST bypass automatic block only through authenticated operator pathways with explicit acknowledgement.
|
||||
|
||||
#### Scenario: Churn-blocked state with manual override request
|
||||
- **WHEN** authorized admin requests manual restart while auto-recovery is blocked
|
||||
- **THEN** system MUST execute controlled restart path and log the override context for auditability
|
||||
|
||||
### Requirement: Circuit Breaker State Transitions SHALL Avoid Lock-Held Logging
|
||||
Circuit breaker state transitions MUST avoid executing logger I/O while internal state locks are held.
|
||||
|
||||
#### Scenario: State transition occurs
|
||||
- **WHEN** circuit breaker transitions between CLOSED, OPEN, or HALF_OPEN
|
||||
- **THEN** lock-protected section MUST complete state mutation before emitting transition log output
|
||||
|
||||
#### Scenario: Slow log handler under load
|
||||
- **WHEN** logger handlers are slow or blocked
|
||||
- **THEN** circuit breaker lock contention MUST remain bounded and MUST NOT serialize unrelated request paths behind logging latency
|
||||
|
||||
### Requirement: Health Endpoints SHALL Use Short Internal Memoization
|
||||
Health and deep-health computation SHALL use a short-lived internal cache to prevent probe storms from amplifying backend load.
|
||||
|
||||
#### Scenario: Frequent monitor scrapes
|
||||
- **WHEN** health endpoints are called repeatedly within a small window
|
||||
- **THEN** service SHALL return memoized payload for up to 5 seconds in non-testing environments
|
||||
|
||||
#### Scenario: Testing mode
|
||||
- **WHEN** app is running in testing mode
|
||||
- **THEN** health endpoint memoization MUST be bypassed to preserve deterministic tests
|
||||
|
||||
### Requirement: Logs MUST Redact Connection Secrets
|
||||
Runtime logs MUST avoid exposing DB connection credentials.
|
||||
|
||||
#### Scenario: Connection string appears in log message
|
||||
- **WHEN** a log message contains DB URL credentials
|
||||
- **THEN** logger output MUST redact password and sensitive userinfo before emission
|
||||
|
||||
|
||||
38
openspec/specs/security-surface-hardening/spec.md
Normal file
38
openspec/specs/security-surface-hardening/spec.md
Normal file
@@ -0,0 +1,38 @@
|
||||
# security-surface-hardening Specification
|
||||
|
||||
## Purpose
|
||||
TBD - created by archiving change security-stability-hardening-round2. Update Purpose after archive.
|
||||
## Requirements
|
||||
### Requirement: LDAP Authentication Endpoint Configuration SHALL Be Strictly Validated
|
||||
The system MUST validate LDAP authentication endpoint configuration before use, including HTTPS scheme enforcement and host allowlist checks.
|
||||
|
||||
#### Scenario: Invalid LDAP URL configuration detected
|
||||
- **WHEN** `LDAP_API_URL` is missing, non-HTTPS, or points to a host outside the configured allowlist
|
||||
- **THEN** the service MUST reject LDAP authentication calls and emit actionable diagnostics without sending credentials to that endpoint
|
||||
|
||||
#### Scenario: Valid LDAP URL configuration accepted
|
||||
- **WHEN** `LDAP_API_URL` uses HTTPS and host is allowlisted
|
||||
- **THEN** LDAP authentication requests MAY proceed with normal timeout and error handling behavior
|
||||
|
||||
### Requirement: Security Response Headers SHALL Be Applied Globally
|
||||
All HTTP responses MUST include baseline security headers suitable for dashboard and API traffic.
|
||||
|
||||
#### Scenario: Standard response emitted
|
||||
- **WHEN** any route returns a response
|
||||
- **THEN** response MUST include `Content-Security-Policy`, `X-Frame-Options`, `X-Content-Type-Options`, and `Referrer-Policy`
|
||||
|
||||
#### Scenario: Production transport hardening
|
||||
- **WHEN** runtime environment is production
|
||||
- **THEN** response MUST include `Strict-Transport-Security`
|
||||
|
||||
### Requirement: Pagination Input Boundaries SHALL Be Enforced
|
||||
Endpoints accepting pagination parameters MUST enforce lower and upper bounds before query execution.
|
||||
|
||||
#### Scenario: Negative or zero pagination inputs
|
||||
- **WHEN** client sends `page <= 0` or `page_size <= 0`
|
||||
- **THEN** server MUST normalize values to minimum supported bounds
|
||||
|
||||
#### Scenario: Excessive page size requested
|
||||
- **WHEN** client sends `page_size` above configured maximum
|
||||
- **THEN** server MUST clamp to maximum supported page size
|
||||
|
||||
26
openspec/specs/worker-self-healing-governance/spec.md
Normal file
26
openspec/specs/worker-self-healing-governance/spec.md
Normal file
@@ -0,0 +1,26 @@
|
||||
# worker-self-healing-governance Specification
|
||||
|
||||
## Purpose
|
||||
TBD - created by archiving change p2-ops-self-healing-runbook. Update Purpose after archive.
|
||||
## Requirements
|
||||
### Requirement: Automated Worker Recovery SHALL Use Bounded Policy Guards
|
||||
Automated worker restart behavior MUST enforce cooldown periods and bounded restart attempts within a configurable time window.
|
||||
|
||||
#### Scenario: Repeated worker degradation within short window
|
||||
- **WHEN** degradation events exceed configured restart-attempt budget
|
||||
- **THEN** automated restarts MUST pause and surface a blocked-recovery signal for operator intervention
|
||||
|
||||
### Requirement: Restart-Churn Protection SHALL Prevent Recovery Storms
|
||||
The runtime MUST classify restart churn and prevent uncontrolled restart loops.
|
||||
|
||||
#### Scenario: Churn threshold exceeded
|
||||
- **WHEN** restart count crosses churn threshold in active window
|
||||
- **THEN** watchdog MUST enter guarded mode and require explicit manual override before further restart attempts
|
||||
|
||||
### Requirement: Recovery Decisions SHALL Be Audit-Ready
|
||||
Every auto-recovery decision and manual override action MUST be recorded with structured metadata.
|
||||
|
||||
#### Scenario: Worker restart decision emitted
|
||||
- **WHEN** system executes or denies a restart action
|
||||
- **THEN** structured logs/events MUST include reason, thresholds, actor/source, and resulting state
|
||||
|
||||
223
scripts/run_cache_benchmarks.py
Executable file
223
scripts/run_cache_benchmarks.py
Executable file
@@ -0,0 +1,223 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Benchmark cache query baseline vs indexed selection.
|
||||
|
||||
This benchmark is used as a repeatable governance harness for P1 cache/query
|
||||
efficiency work. It focuses on deterministic synthetic workloads so operators
|
||||
can compare relative latency and memory amplification over time.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import math
|
||||
import random
|
||||
import statistics
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
ROOT = Path(__file__).resolve().parents[1]
|
||||
FIXTURE_PATH = ROOT / "tests" / "fixtures" / "cache_benchmark_fixture.json"
|
||||
|
||||
|
||||
def load_fixture(path: Path = FIXTURE_PATH) -> dict[str, Any]:
|
||||
payload = json.loads(path.read_text())
|
||||
if "rows" not in payload:
|
||||
raise ValueError("fixture requires rows")
|
||||
return payload
|
||||
|
||||
|
||||
def build_dataset(rows: int, seed: int) -> pd.DataFrame:
|
||||
random.seed(seed)
|
||||
np.random.seed(seed)
|
||||
|
||||
workcenters = [f"WC-{idx:02d}" for idx in range(1, 31)]
|
||||
packages = ["QFN", "DFN", "SOT", "SOP", "BGA", "TSOP"]
|
||||
types = ["TYPE-A", "TYPE-B", "TYPE-C", "TYPE-D"]
|
||||
statuses = ["RUN", "QUEUE", "HOLD"]
|
||||
hold_reasons = ["", "", "", "YieldLimit", "特殊需求管控", "PM Hold"]
|
||||
|
||||
frame = pd.DataFrame(
|
||||
{
|
||||
"WORKCENTER_GROUP": np.random.choice(workcenters, rows),
|
||||
"PACKAGE_LEF": np.random.choice(packages, rows),
|
||||
"PJ_TYPE": np.random.choice(types, rows),
|
||||
"WIP_STATUS": np.random.choice(statuses, rows, p=[0.45, 0.35, 0.20]),
|
||||
"HOLDREASONNAME": np.random.choice(hold_reasons, rows),
|
||||
"QTY": np.random.randint(1, 500, rows),
|
||||
"WORKORDER": [f"WO-{i:06d}" for i in range(rows)],
|
||||
"LOTID": [f"LOT-{i:07d}" for i in range(rows)],
|
||||
}
|
||||
)
|
||||
return frame
|
||||
|
||||
|
||||
def _build_index(df: pd.DataFrame) -> dict[str, dict[str, set[int]]]:
|
||||
def by_column(column: str) -> dict[str, set[int]]:
|
||||
grouped = df.groupby(column, dropna=True, sort=False).indices
|
||||
return {str(k): {int(i) for i in v} for k, v in grouped.items()}
|
||||
|
||||
return {
|
||||
"workcenter": by_column("WORKCENTER_GROUP"),
|
||||
"package": by_column("PACKAGE_LEF"),
|
||||
"type": by_column("PJ_TYPE"),
|
||||
"status": by_column("WIP_STATUS"),
|
||||
}
|
||||
|
||||
|
||||
def _baseline_query(df: pd.DataFrame, query: dict[str, str]) -> int:
|
||||
subset = df
|
||||
if query.get("workcenter"):
|
||||
subset = subset[subset["WORKCENTER_GROUP"] == query["workcenter"]]
|
||||
if query.get("package"):
|
||||
subset = subset[subset["PACKAGE_LEF"] == query["package"]]
|
||||
if query.get("type"):
|
||||
subset = subset[subset["PJ_TYPE"] == query["type"]]
|
||||
if query.get("status"):
|
||||
subset = subset[subset["WIP_STATUS"] == query["status"]]
|
||||
return int(len(subset))
|
||||
|
||||
|
||||
def _indexed_query(_df: pd.DataFrame, indexes: dict[str, dict[str, set[int]]], query: dict[str, str]) -> int:
|
||||
selected: set[int] | None = None
|
||||
for key, bucket in (
|
||||
("workcenter", "workcenter"),
|
||||
("package", "package"),
|
||||
("type", "type"),
|
||||
("status", "status"),
|
||||
):
|
||||
current = indexes[bucket].get(query.get(key, ""))
|
||||
if current is None:
|
||||
return 0
|
||||
if selected is None:
|
||||
selected = set(current)
|
||||
else:
|
||||
selected.intersection_update(current)
|
||||
if not selected:
|
||||
return 0
|
||||
return len(selected or ())
|
||||
|
||||
|
||||
def _build_queries(df: pd.DataFrame, query_count: int, seed: int) -> list[dict[str, str]]:
|
||||
random.seed(seed + 17)
|
||||
workcenters = sorted(df["WORKCENTER_GROUP"].dropna().astype(str).unique().tolist())
|
||||
packages = sorted(df["PACKAGE_LEF"].dropna().astype(str).unique().tolist())
|
||||
types = sorted(df["PJ_TYPE"].dropna().astype(str).unique().tolist())
|
||||
statuses = sorted(df["WIP_STATUS"].dropna().astype(str).unique().tolist())
|
||||
|
||||
queries: list[dict[str, str]] = []
|
||||
for _ in range(query_count):
|
||||
queries.append(
|
||||
{
|
||||
"workcenter": random.choice(workcenters),
|
||||
"package": random.choice(packages),
|
||||
"type": random.choice(types),
|
||||
"status": random.choice(statuses),
|
||||
}
|
||||
)
|
||||
return queries
|
||||
|
||||
|
||||
def _p95(values: list[float]) -> float:
|
||||
if not values:
|
||||
return 0.0
|
||||
sorted_values = sorted(values)
|
||||
index = min(max(math.ceil(0.95 * len(sorted_values)) - 1, 0), len(sorted_values) - 1)
|
||||
return sorted_values[index]
|
||||
|
||||
|
||||
def run_benchmark(rows: int, query_count: int, seed: int) -> dict[str, Any]:
|
||||
df = build_dataset(rows=rows, seed=seed)
|
||||
queries = _build_queries(df, query_count=query_count, seed=seed)
|
||||
indexes = _build_index(df)
|
||||
|
||||
baseline_latencies: list[float] = []
|
||||
indexed_latencies: list[float] = []
|
||||
baseline_rows: list[int] = []
|
||||
indexed_rows: list[int] = []
|
||||
|
||||
for query in queries:
|
||||
start = time.perf_counter()
|
||||
baseline_rows.append(_baseline_query(df, query))
|
||||
baseline_latencies.append((time.perf_counter() - start) * 1000)
|
||||
|
||||
start = time.perf_counter()
|
||||
indexed_rows.append(_indexed_query(df, indexes, query))
|
||||
indexed_latencies.append((time.perf_counter() - start) * 1000)
|
||||
|
||||
if baseline_rows != indexed_rows:
|
||||
raise AssertionError("benchmark correctness drift: indexed result mismatch")
|
||||
|
||||
frame_bytes = int(df.memory_usage(index=True, deep=True).sum())
|
||||
index_entries = sum(len(bucket) for buckets in indexes.values() for bucket in buckets.values())
|
||||
index_bytes_estimate = int(index_entries * 16)
|
||||
|
||||
baseline_p95 = _p95(baseline_latencies)
|
||||
indexed_p95 = _p95(indexed_latencies)
|
||||
|
||||
return {
|
||||
"rows": rows,
|
||||
"query_count": query_count,
|
||||
"seed": seed,
|
||||
"latency_ms": {
|
||||
"baseline_avg": round(statistics.fmean(baseline_latencies), 4),
|
||||
"baseline_p95": round(baseline_p95, 4),
|
||||
"indexed_avg": round(statistics.fmean(indexed_latencies), 4),
|
||||
"indexed_p95": round(indexed_p95, 4),
|
||||
"p95_ratio_indexed_vs_baseline": round(
|
||||
(indexed_p95 / baseline_p95) if baseline_p95 > 0 else 0.0,
|
||||
4,
|
||||
),
|
||||
},
|
||||
"memory_bytes": {
|
||||
"frame": frame_bytes,
|
||||
"index_estimate": index_bytes_estimate,
|
||||
"amplification_ratio": round(
|
||||
(frame_bytes + index_bytes_estimate) / max(frame_bytes, 1),
|
||||
4,
|
||||
),
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def main() -> int:
|
||||
fixture = load_fixture()
|
||||
|
||||
parser = argparse.ArgumentParser(description="Run cache baseline vs indexed benchmark")
|
||||
parser.add_argument("--rows", type=int, default=int(fixture.get("rows", 30000)))
|
||||
parser.add_argument("--queries", type=int, default=int(fixture.get("query_count", 400)))
|
||||
parser.add_argument("--seed", type=int, default=int(fixture.get("seed", 42)))
|
||||
parser.add_argument("--enforce", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
report = run_benchmark(rows=args.rows, query_count=args.queries, seed=args.seed)
|
||||
print(json.dumps(report, ensure_ascii=False, indent=2))
|
||||
|
||||
if not args.enforce:
|
||||
return 0
|
||||
|
||||
thresholds = fixture.get("thresholds") or {}
|
||||
max_latency_ratio = float(thresholds.get("max_p95_ratio_indexed_vs_baseline", 1.25))
|
||||
max_amplification = float(thresholds.get("max_memory_amplification_ratio", 1.8))
|
||||
|
||||
latency_ratio = float(report["latency_ms"]["p95_ratio_indexed_vs_baseline"])
|
||||
amplification_ratio = float(report["memory_bytes"]["amplification_ratio"])
|
||||
|
||||
if latency_ratio > max_latency_ratio:
|
||||
raise SystemExit(
|
||||
f"Latency regression: {latency_ratio:.4f} > max allowed {max_latency_ratio:.4f}"
|
||||
)
|
||||
if amplification_ratio > max_amplification:
|
||||
raise SystemExit(
|
||||
f"Memory amplification regression: {amplification_ratio:.4f} > max allowed {max_amplification:.4f}"
|
||||
)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
40
scripts/start_server.sh
Normal file → Executable file
40
scripts/start_server.sh
Normal file → Executable file
@@ -9,7 +9,7 @@ set -uo pipefail
|
||||
# Configuration
|
||||
# ============================================================
|
||||
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
CONDA_ENV="mes-dashboard"
|
||||
CONDA_ENV="${CONDA_ENV_NAME:-mes-dashboard}"
|
||||
APP_NAME="mes-dashboard"
|
||||
PID_FILE_DEFAULT="${ROOT}/tmp/gunicorn.pid"
|
||||
PID_FILE="${WATCHDOG_PID_FILE:-${PID_FILE_DEFAULT}}"
|
||||
@@ -56,7 +56,7 @@ timestamp() {
|
||||
resolve_runtime_paths() {
|
||||
WATCHDOG_RUNTIME_DIR="${WATCHDOG_RUNTIME_DIR:-${ROOT}/tmp}"
|
||||
WATCHDOG_RESTART_FLAG="${WATCHDOG_RESTART_FLAG:-${WATCHDOG_RUNTIME_DIR}/mes_dashboard_restart.flag}"
|
||||
WATCHDOG_PID_FILE="${WATCHDOG_PID_FILE:-${PID_FILE_DEFAULT}}"
|
||||
WATCHDOG_PID_FILE="${WATCHDOG_PID_FILE:-${WATCHDOG_RUNTIME_DIR}/gunicorn.pid}"
|
||||
WATCHDOG_STATE_FILE="${WATCHDOG_STATE_FILE:-${WATCHDOG_RUNTIME_DIR}/mes_dashboard_restart_state.json}"
|
||||
PID_FILE="${WATCHDOG_PID_FILE}"
|
||||
export WATCHDOG_RUNTIME_DIR WATCHDOG_RESTART_FLAG WATCHDOG_PID_FILE WATCHDOG_STATE_FILE
|
||||
@@ -81,8 +81,14 @@ check_conda() {
|
||||
return 1
|
||||
fi
|
||||
|
||||
if [ -n "${CONDA_BIN:-}" ] && [ ! -x "${CONDA_BIN}" ]; then
|
||||
log_error "CONDA_BIN is set but not executable: ${CONDA_BIN}"
|
||||
return 1
|
||||
fi
|
||||
|
||||
# Source conda
|
||||
source "$(conda info --base)/etc/profile.d/conda.sh"
|
||||
local conda_cmd="${CONDA_BIN:-$(command -v conda)}"
|
||||
source "$(${conda_cmd} info --base)/etc/profile.d/conda.sh"
|
||||
|
||||
# Check if environment exists
|
||||
if ! conda env list | grep -q "^${CONDA_ENV} "; then
|
||||
@@ -95,6 +101,33 @@ check_conda() {
|
||||
return 0
|
||||
}
|
||||
|
||||
validate_runtime_contract() {
|
||||
conda activate "$CONDA_ENV"
|
||||
export PYTHONPATH="${ROOT}/src:${PYTHONPATH:-}"
|
||||
|
||||
if python - <<'PY'
|
||||
import os
|
||||
import sys
|
||||
|
||||
from mes_dashboard.core.runtime_contract import build_runtime_contract_diagnostics
|
||||
|
||||
strict = os.getenv("RUNTIME_CONTRACT_ENFORCE", "true").strip().lower() in {"1", "true", "yes", "on"}
|
||||
diag = build_runtime_contract_diagnostics(strict=strict)
|
||||
if not diag["valid"]:
|
||||
for error in diag["errors"]:
|
||||
print(f"RUNTIME_CONTRACT_ERROR: {error}")
|
||||
raise SystemExit(1)
|
||||
PY
|
||||
then
|
||||
log_success "Runtime contract validation passed"
|
||||
return 0
|
||||
fi
|
||||
|
||||
log_error "Runtime contract validation failed"
|
||||
log_info "Fix env vars: WATCHDOG_RUNTIME_DIR / WATCHDOG_RESTART_FLAG / WATCHDOG_PID_FILE / WATCHDOG_STATE_FILE / CONDA_BIN"
|
||||
return 1
|
||||
}
|
||||
|
||||
check_dependencies() {
|
||||
conda activate "$CONDA_ENV"
|
||||
|
||||
@@ -329,6 +362,7 @@ run_all_checks() {
|
||||
check_env_file
|
||||
load_env
|
||||
resolve_runtime_paths
|
||||
validate_runtime_contract || return 1
|
||||
check_port || return 1
|
||||
check_database
|
||||
check_redis
|
||||
|
||||
177
scripts/worker_watchdog.py
Normal file → Executable file
177
scripts/worker_watchdog.py
Normal file → Executable file
@@ -31,6 +31,23 @@ import time
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
PROJECT_ROOT = Path(__file__).resolve().parents[1]
|
||||
SRC_ROOT = PROJECT_ROOT / "src"
|
||||
if str(SRC_ROOT) not in sys.path:
|
||||
sys.path.insert(0, str(SRC_ROOT))
|
||||
|
||||
from mes_dashboard.core.runtime_contract import ( # noqa: E402
|
||||
build_runtime_contract_diagnostics,
|
||||
load_runtime_contract,
|
||||
)
|
||||
from mes_dashboard.core.worker_recovery_policy import ( # noqa: E402
|
||||
decide_restart_request,
|
||||
evaluate_worker_recovery_state,
|
||||
extract_last_requested_at,
|
||||
extract_restart_history,
|
||||
get_worker_recovery_policy_config,
|
||||
)
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
@@ -45,7 +62,10 @@ logger = logging.getLogger('mes_dashboard.watchdog')
|
||||
# Configuration
|
||||
# ============================================================
|
||||
|
||||
CHECK_INTERVAL = int(os.getenv('WATCHDOG_CHECK_INTERVAL', '5'))
|
||||
_RUNTIME_CONTRACT = load_runtime_contract(project_root=PROJECT_ROOT)
|
||||
CHECK_INTERVAL = int(
|
||||
os.getenv('WATCHDOG_CHECK_INTERVAL', str(_RUNTIME_CONTRACT['watchdog_check_interval']))
|
||||
)
|
||||
|
||||
|
||||
def _env_int(name: str, default: int) -> int:
|
||||
@@ -55,22 +75,11 @@ def _env_int(name: str, default: int) -> int:
|
||||
return default
|
||||
|
||||
|
||||
PROJECT_ROOT = Path(__file__).resolve().parents[1]
|
||||
DEFAULT_RUNTIME_DIR = Path(
|
||||
os.getenv('WATCHDOG_RUNTIME_DIR', str(PROJECT_ROOT / 'tmp'))
|
||||
)
|
||||
RESTART_FLAG_PATH = os.getenv(
|
||||
'WATCHDOG_RESTART_FLAG',
|
||||
str(DEFAULT_RUNTIME_DIR / 'mes_dashboard_restart.flag')
|
||||
)
|
||||
GUNICORN_PID_FILE = os.getenv(
|
||||
'WATCHDOG_PID_FILE',
|
||||
str(DEFAULT_RUNTIME_DIR / 'gunicorn.pid')
|
||||
)
|
||||
RESTART_STATE_FILE = os.getenv(
|
||||
'WATCHDOG_STATE_FILE',
|
||||
str(DEFAULT_RUNTIME_DIR / 'mes_dashboard_restart_state.json')
|
||||
)
|
||||
DEFAULT_RUNTIME_DIR = Path(_RUNTIME_CONTRACT['watchdog_runtime_dir'])
|
||||
RESTART_FLAG_PATH = _RUNTIME_CONTRACT['watchdog_restart_flag']
|
||||
GUNICORN_PID_FILE = _RUNTIME_CONTRACT['watchdog_pid_file']
|
||||
RESTART_STATE_FILE = _RUNTIME_CONTRACT['watchdog_state_file']
|
||||
RUNTIME_CONTRACT_VERSION = _RUNTIME_CONTRACT['version']
|
||||
RESTART_HISTORY_MAX = _env_int('WATCHDOG_RESTART_HISTORY_MAX', 50)
|
||||
|
||||
|
||||
@@ -78,6 +87,32 @@ RESTART_HISTORY_MAX = _env_int('WATCHDOG_RESTART_HISTORY_MAX', 50)
|
||||
# Watchdog Implementation
|
||||
# ============================================================
|
||||
|
||||
|
||||
def validate_runtime_contract_or_raise() -> None:
|
||||
"""Fail fast if runtime contract is inconsistent."""
|
||||
strict = os.getenv("RUNTIME_CONTRACT_ENFORCE", "true").strip().lower() in {
|
||||
"1",
|
||||
"true",
|
||||
"yes",
|
||||
"on",
|
||||
}
|
||||
diagnostics = build_runtime_contract_diagnostics(strict=strict)
|
||||
if diagnostics["valid"]:
|
||||
return
|
||||
|
||||
details = "; ".join(diagnostics["errors"])
|
||||
raise RuntimeError(f"Runtime contract validation failed: {details}")
|
||||
|
||||
|
||||
def log_restart_audit(event: str, payload: dict) -> None:
|
||||
entry = {
|
||||
"event": event,
|
||||
"timestamp": datetime.utcnow().isoformat(),
|
||||
"runtime_contract_version": RUNTIME_CONTRACT_VERSION,
|
||||
**payload,
|
||||
}
|
||||
logger.info("worker_watchdog_audit %s", json.dumps(entry, ensure_ascii=False))
|
||||
|
||||
def get_gunicorn_pid() -> int | None:
|
||||
"""Get Gunicorn master PID from PID file.
|
||||
|
||||
@@ -155,7 +190,12 @@ def save_restart_state(
|
||||
requested_at: str | None = None,
|
||||
requested_ip: str | None = None,
|
||||
completed_at: str | None = None,
|
||||
success: bool = True
|
||||
success: bool = True,
|
||||
source: str = "manual",
|
||||
decision: str = "allowed",
|
||||
decision_reason: str | None = None,
|
||||
manual_override: bool = False,
|
||||
policy_state: dict | None = None,
|
||||
) -> None:
|
||||
"""Save restart state for status queries.
|
||||
|
||||
@@ -173,7 +213,12 @@ def save_restart_state(
|
||||
"requested_at": requested_at,
|
||||
"requested_ip": requested_ip,
|
||||
"completed_at": completed_at,
|
||||
"success": success
|
||||
"success": success,
|
||||
"source": source,
|
||||
"decision": decision,
|
||||
"decision_reason": decision_reason,
|
||||
"manual_override": manual_override,
|
||||
"policy_state": policy_state or {},
|
||||
}
|
||||
current_state = load_restart_state()
|
||||
history = current_state.get("history", [])
|
||||
@@ -229,6 +274,47 @@ def process_restart_request() -> bool:
|
||||
return False
|
||||
|
||||
logger.info(f"Restart flag detected: {flag_data}")
|
||||
source = str(flag_data.get("source") or "manual").strip().lower()
|
||||
manual_override = bool(flag_data.get("manual_override"))
|
||||
override_ack = bool(flag_data.get("override_acknowledged"))
|
||||
restart_state = load_restart_state()
|
||||
restart_history = extract_restart_history(restart_state)
|
||||
policy_state = evaluate_worker_recovery_state(
|
||||
restart_history,
|
||||
last_requested_at=extract_last_requested_at(restart_state),
|
||||
)
|
||||
decision = decide_restart_request(
|
||||
policy_state,
|
||||
source=source,
|
||||
manual_override=manual_override,
|
||||
override_acknowledged=override_ack,
|
||||
)
|
||||
|
||||
if not decision["allowed"]:
|
||||
remove_restart_flag()
|
||||
save_restart_state(
|
||||
requested_by=flag_data.get("user"),
|
||||
requested_at=flag_data.get("timestamp"),
|
||||
requested_ip=flag_data.get("ip"),
|
||||
completed_at=datetime.now().isoformat(),
|
||||
success=False,
|
||||
source=source,
|
||||
decision=decision["decision"],
|
||||
decision_reason=decision["reason"],
|
||||
manual_override=manual_override,
|
||||
policy_state=policy_state,
|
||||
)
|
||||
log_restart_audit(
|
||||
"restart_blocked",
|
||||
{
|
||||
"source": source,
|
||||
"actor": flag_data.get("user"),
|
||||
"ip": flag_data.get("ip"),
|
||||
"decision": decision,
|
||||
"policy_state": policy_state,
|
||||
},
|
||||
)
|
||||
return True
|
||||
|
||||
# Get Gunicorn master PID
|
||||
pid = get_gunicorn_pid()
|
||||
@@ -242,7 +328,22 @@ def process_restart_request() -> bool:
|
||||
requested_at=flag_data.get("timestamp"),
|
||||
requested_ip=flag_data.get("ip"),
|
||||
completed_at=datetime.now().isoformat(),
|
||||
success=False
|
||||
success=False,
|
||||
source=source,
|
||||
decision="failed",
|
||||
decision_reason="gunicorn_pid_unavailable",
|
||||
manual_override=manual_override,
|
||||
policy_state=policy_state,
|
||||
)
|
||||
log_restart_audit(
|
||||
"restart_failed",
|
||||
{
|
||||
"source": source,
|
||||
"actor": flag_data.get("user"),
|
||||
"ip": flag_data.get("ip"),
|
||||
"decision_reason": "gunicorn_pid_unavailable",
|
||||
"policy_state": policy_state,
|
||||
},
|
||||
)
|
||||
return True
|
||||
|
||||
@@ -258,7 +359,12 @@ def process_restart_request() -> bool:
|
||||
requested_at=flag_data.get("timestamp"),
|
||||
requested_ip=flag_data.get("ip"),
|
||||
completed_at=datetime.now().isoformat(),
|
||||
success=success
|
||||
success=success,
|
||||
source=source,
|
||||
decision="executed" if success else "failed",
|
||||
decision_reason="signal_sighup" if success else "signal_failed",
|
||||
manual_override=manual_override,
|
||||
policy_state=policy_state,
|
||||
)
|
||||
|
||||
if success:
|
||||
@@ -267,17 +373,44 @@ def process_restart_request() -> bool:
|
||||
f"Requested by: {flag_data.get('user', 'unknown')}, "
|
||||
f"IP: {flag_data.get('ip', 'unknown')}"
|
||||
)
|
||||
log_restart_audit(
|
||||
"restart_executed",
|
||||
{
|
||||
"source": source,
|
||||
"actor": flag_data.get("user"),
|
||||
"ip": flag_data.get("ip"),
|
||||
"manual_override": manual_override,
|
||||
"policy_state": policy_state,
|
||||
},
|
||||
)
|
||||
else:
|
||||
log_restart_audit(
|
||||
"restart_failed",
|
||||
{
|
||||
"source": source,
|
||||
"actor": flag_data.get("user"),
|
||||
"ip": flag_data.get("ip"),
|
||||
"decision_reason": "signal_failed",
|
||||
"policy_state": policy_state,
|
||||
},
|
||||
)
|
||||
|
||||
return True
|
||||
|
||||
|
||||
def run_watchdog() -> None:
|
||||
"""Main watchdog loop."""
|
||||
validate_runtime_contract_or_raise()
|
||||
policy = get_worker_recovery_policy_config()
|
||||
logger.info(
|
||||
f"Worker watchdog started - "
|
||||
f"Check interval: {CHECK_INTERVAL}s, "
|
||||
f"Flag path: {RESTART_FLAG_PATH}, "
|
||||
f"PID file: {GUNICORN_PID_FILE}"
|
||||
f"PID file: {GUNICORN_PID_FILE}, "
|
||||
f"Policy(cooldown={policy['cooldown_seconds']}s, "
|
||||
f"retry_budget={policy['retry_budget']}, "
|
||||
f"window={policy['window_seconds']}s, "
|
||||
f"guarded={policy['guarded_mode_enabled']})"
|
||||
)
|
||||
|
||||
while True:
|
||||
|
||||
@@ -3,24 +3,48 @@
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import atexit
|
||||
import logging
|
||||
import os
|
||||
import sys
|
||||
import threading
|
||||
|
||||
from flask import Flask, jsonify, redirect, render_template, request, session, url_for
|
||||
|
||||
from mes_dashboard.config.tables import TABLES_CONFIG
|
||||
from mes_dashboard.config.settings import get_config
|
||||
from mes_dashboard.core.cache import create_default_cache_backend
|
||||
from mes_dashboard.core.database import get_table_data, get_table_columns, get_engine, init_db, start_keepalive
|
||||
from mes_dashboard.core.database import (
|
||||
get_table_data,
|
||||
get_table_columns,
|
||||
get_engine,
|
||||
init_db,
|
||||
start_keepalive,
|
||||
dispose_engine,
|
||||
install_log_redaction_filter,
|
||||
)
|
||||
from mes_dashboard.core.permissions import is_admin_logged_in, _is_ajax_request
|
||||
from mes_dashboard.core.csrf import (
|
||||
get_csrf_token,
|
||||
should_enforce_csrf,
|
||||
validate_csrf,
|
||||
)
|
||||
from mes_dashboard.routes import register_routes
|
||||
from mes_dashboard.routes.auth_routes import auth_bp
|
||||
from mes_dashboard.routes.admin_routes import admin_bp
|
||||
from mes_dashboard.routes.health_routes import health_bp
|
||||
from mes_dashboard.services.page_registry import get_page_status, is_api_public
|
||||
from mes_dashboard.core.cache_updater import start_cache_updater, stop_cache_updater
|
||||
from mes_dashboard.services.realtime_equipment_cache import init_realtime_equipment_cache
|
||||
from mes_dashboard.services.realtime_equipment_cache import (
|
||||
init_realtime_equipment_cache,
|
||||
stop_equipment_status_sync_worker,
|
||||
)
|
||||
from mes_dashboard.core.redis_client import close_redis
|
||||
from mes_dashboard.core.runtime_contract import build_runtime_contract_diagnostics
|
||||
|
||||
|
||||
_SHUTDOWN_LOCK = threading.Lock()
|
||||
_ATEXIT_REGISTERED = False
|
||||
|
||||
|
||||
def _configure_logging(app: Flask) -> None:
|
||||
@@ -63,6 +87,121 @@ def _configure_logging(app: Flask) -> None:
|
||||
|
||||
# Prevent propagation to root logger (avoid duplicate logs)
|
||||
logger.propagate = False
|
||||
install_log_redaction_filter(logger)
|
||||
|
||||
|
||||
def _is_production_env(app: Flask) -> bool:
|
||||
env_value = str(app.config.get("ENV") or os.getenv("FLASK_ENV") or "production").lower()
|
||||
return env_value in {"prod", "production"}
|
||||
|
||||
|
||||
def _build_security_headers(production: bool) -> dict[str, str]:
|
||||
headers = {
|
||||
"Content-Security-Policy": (
|
||||
"default-src 'self'; "
|
||||
"script-src 'self' 'unsafe-inline' 'unsafe-eval'; "
|
||||
"style-src 'self' 'unsafe-inline'; "
|
||||
"img-src 'self' data: blob:; "
|
||||
"font-src 'self' data:; "
|
||||
"connect-src 'self'; "
|
||||
"frame-ancestors 'none'; "
|
||||
"base-uri 'self'; "
|
||||
"form-action 'self'"
|
||||
),
|
||||
"X-Frame-Options": "DENY",
|
||||
"X-Content-Type-Options": "nosniff",
|
||||
"Referrer-Policy": "strict-origin-when-cross-origin",
|
||||
}
|
||||
if production:
|
||||
headers["Strict-Transport-Security"] = "max-age=31536000; includeSubDomains"
|
||||
return headers
|
||||
|
||||
|
||||
def _resolve_secret_key(app: Flask) -> str:
|
||||
env_name = str(app.config.get("ENV") or os.getenv("FLASK_ENV") or "development").lower()
|
||||
configured = os.environ.get("SECRET_KEY") or app.config.get("SECRET_KEY")
|
||||
insecure_defaults = {"", "dev-secret-key-change-in-prod"}
|
||||
|
||||
if configured and configured not in insecure_defaults:
|
||||
return configured
|
||||
|
||||
if env_name in {"production", "prod"}:
|
||||
raise RuntimeError(
|
||||
"SECRET_KEY is required in production and cannot use insecure defaults."
|
||||
)
|
||||
|
||||
# Development and testing get explicit environment-safe defaults.
|
||||
if env_name in {"testing", "test"}:
|
||||
return "test-secret-key"
|
||||
return "dev-local-only-secret-key"
|
||||
|
||||
|
||||
def _shutdown_runtime_resources() -> None:
|
||||
"""Stop background workers and shared clients during app/worker shutdown."""
|
||||
logger = logging.getLogger("mes_dashboard")
|
||||
|
||||
try:
|
||||
stop_cache_updater()
|
||||
except Exception as exc:
|
||||
logger.warning("Error stopping cache updater: %s", exc)
|
||||
|
||||
try:
|
||||
stop_equipment_status_sync_worker()
|
||||
except Exception as exc:
|
||||
logger.warning("Error stopping equipment sync worker: %s", exc)
|
||||
|
||||
try:
|
||||
close_redis()
|
||||
except Exception as exc:
|
||||
logger.warning("Error closing Redis client: %s", exc)
|
||||
|
||||
try:
|
||||
dispose_engine()
|
||||
except Exception as exc:
|
||||
logger.warning("Error disposing DB engines: %s", exc)
|
||||
|
||||
|
||||
def _register_shutdown_hooks(app: Flask) -> None:
|
||||
global _ATEXIT_REGISTERED
|
||||
|
||||
app.extensions["runtime_shutdown"] = _shutdown_runtime_resources
|
||||
if app.extensions.get("runtime_shutdown_registered"):
|
||||
return
|
||||
|
||||
app.extensions["runtime_shutdown_registered"] = True
|
||||
if app.testing or bool(app.config.get("TESTING")) or os.getenv("PYTEST_CURRENT_TEST"):
|
||||
return
|
||||
|
||||
with _SHUTDOWN_LOCK:
|
||||
if not _ATEXIT_REGISTERED:
|
||||
atexit.register(_shutdown_runtime_resources)
|
||||
_ATEXIT_REGISTERED = True
|
||||
|
||||
|
||||
def _is_runtime_contract_enforced(app: Flask) -> bool:
|
||||
raw = os.getenv("RUNTIME_CONTRACT_ENFORCE")
|
||||
if raw is not None:
|
||||
return raw.strip().lower() in {"1", "true", "yes", "on"}
|
||||
return _is_production_env(app)
|
||||
|
||||
|
||||
def _validate_runtime_contract(app: Flask) -> None:
|
||||
strict = _is_runtime_contract_enforced(app)
|
||||
diagnostics = build_runtime_contract_diagnostics(strict=strict)
|
||||
app.extensions["runtime_contract"] = diagnostics["contract"]
|
||||
app.extensions["runtime_contract_validation"] = {
|
||||
"valid": diagnostics["valid"],
|
||||
"strict": diagnostics["strict"],
|
||||
"errors": diagnostics["errors"],
|
||||
}
|
||||
|
||||
if diagnostics["valid"]:
|
||||
return
|
||||
|
||||
message = "Runtime contract validation failed: " + "; ".join(diagnostics["errors"])
|
||||
if strict:
|
||||
raise RuntimeError(message)
|
||||
logging.getLogger("mes_dashboard").warning(message)
|
||||
|
||||
|
||||
def create_app(config_name: str | None = None) -> Flask:
|
||||
@@ -72,19 +211,22 @@ def create_app(config_name: str | None = None) -> Flask:
|
||||
config_class = get_config(config_name)
|
||||
app.config.from_object(config_class)
|
||||
|
||||
# Session configuration
|
||||
app.secret_key = os.environ.get("SECRET_KEY", "dev-secret-key-change-in-prod")
|
||||
# Session configuration with environment-aware secret validation.
|
||||
app.secret_key = _resolve_secret_key(app)
|
||||
app.config["SECRET_KEY"] = app.secret_key
|
||||
|
||||
# Session cookie security settings
|
||||
# SECURE: Only send cookie over HTTPS (disable for local development)
|
||||
app.config['SESSION_COOKIE_SECURE'] = os.environ.get("FLASK_ENV") == "production"
|
||||
# SECURE: Only send cookie over HTTPS in production.
|
||||
app.config['SESSION_COOKIE_SECURE'] = _is_production_env(app)
|
||||
# HTTPONLY: Prevent JavaScript access to session cookie (XSS protection)
|
||||
app.config['SESSION_COOKIE_HTTPONLY'] = True
|
||||
# SAMESITE: Prevent CSRF by restricting cross-site cookie sending
|
||||
app.config['SESSION_COOKIE_SAMESITE'] = 'Lax'
|
||||
# SAMESITE: strict in production, relaxed for local development usability.
|
||||
app.config['SESSION_COOKIE_SAMESITE'] = 'Strict' if _is_production_env(app) else 'Lax'
|
||||
|
||||
# Configure logging first
|
||||
_configure_logging(app)
|
||||
_validate_runtime_contract(app)
|
||||
security_headers = _build_security_headers(_is_production_env(app))
|
||||
|
||||
# Route-level cache backend (L1 memory + optional L2 Redis)
|
||||
app.extensions["cache"] = create_default_cache_backend()
|
||||
@@ -96,6 +238,7 @@ def create_app(config_name: str | None = None) -> Flask:
|
||||
start_keepalive() # Keep database connections alive
|
||||
start_cache_updater() # Start Redis cache updater
|
||||
init_realtime_equipment_cache(app) # Start realtime equipment status cache
|
||||
_register_shutdown_hooks(app)
|
||||
|
||||
# Register API routes
|
||||
register_routes(app)
|
||||
@@ -150,6 +293,34 @@ def create_app(config_name: str | None = None) -> Flask:
|
||||
|
||||
return None
|
||||
|
||||
@app.before_request
|
||||
def enforce_csrf():
|
||||
if not should_enforce_csrf(
|
||||
request,
|
||||
enabled=bool(app.config.get("CSRF_ENABLED", True)),
|
||||
):
|
||||
return None
|
||||
|
||||
if validate_csrf(request):
|
||||
return None
|
||||
|
||||
if request.path == "/admin/login":
|
||||
return render_template("login.html", error="CSRF 驗證失敗,請重新提交"), 403
|
||||
|
||||
from mes_dashboard.core.response import error_response, FORBIDDEN
|
||||
|
||||
return error_response(
|
||||
FORBIDDEN,
|
||||
"CSRF 驗證失敗",
|
||||
status_code=403,
|
||||
)
|
||||
|
||||
@app.after_request
|
||||
def apply_security_headers(response):
|
||||
for header, value in security_headers.items():
|
||||
response.headers.setdefault(header, value)
|
||||
return response
|
||||
|
||||
# ========================================================
|
||||
# Template Context Processor
|
||||
# ========================================================
|
||||
@@ -185,6 +356,7 @@ def create_app(config_name: str | None = None) -> Flask:
|
||||
"admin_user": session.get("admin"),
|
||||
"can_view_page": can_view_page,
|
||||
"frontend_asset": frontend_asset,
|
||||
"csrf_token": get_csrf_token,
|
||||
}
|
||||
|
||||
# ========================================================
|
||||
|
||||
@@ -20,6 +20,13 @@ def _float_env(name: str, default: float) -> float:
|
||||
return default
|
||||
|
||||
|
||||
def _bool_env(name: str, default: bool) -> bool:
|
||||
value = os.getenv(name)
|
||||
if value is None:
|
||||
return default
|
||||
return value.strip().lower() in {"1", "true", "yes", "on"}
|
||||
|
||||
|
||||
class Config:
|
||||
"""Base configuration."""
|
||||
|
||||
@@ -40,7 +47,8 @@ class Config:
|
||||
# Auth configuration - MUST be set in .env file
|
||||
LDAP_API_URL = os.getenv("LDAP_API_URL", "")
|
||||
ADMIN_EMAILS = os.getenv("ADMIN_EMAILS", "")
|
||||
SECRET_KEY = os.getenv("SECRET_KEY", "dev-secret-key-change-in-prod")
|
||||
SECRET_KEY = os.getenv("SECRET_KEY")
|
||||
CSRF_ENABLED = _bool_env("CSRF_ENABLED", True)
|
||||
|
||||
# Session configuration
|
||||
PERMANENT_SESSION_LIFETIME = _int_env("SESSION_LIFETIME", 28800) # 8 hours
|
||||
@@ -103,6 +111,7 @@ class TestingConfig(Config):
|
||||
DB_CONNECT_RETRY_COUNT = 0
|
||||
DB_CONNECT_RETRY_DELAY = 0.0
|
||||
DB_CALL_TIMEOUT_MS = 5000
|
||||
CSRF_ENABLED = False
|
||||
|
||||
|
||||
def get_config(env: str | None = None) -> Type[Config]:
|
||||
|
||||
@@ -10,8 +10,10 @@ from __future__ import annotations
|
||||
import io
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import threading
|
||||
import time
|
||||
from collections import OrderedDict
|
||||
from typing import Any, Optional, Protocol, Tuple
|
||||
|
||||
import pandas as pd
|
||||
@@ -39,26 +41,49 @@ class ProcessLevelCache:
|
||||
Uses a lock to ensure only one thread parses at a time.
|
||||
"""
|
||||
|
||||
def __init__(self, ttl_seconds: int = 30):
|
||||
self._cache: dict[str, Tuple[pd.DataFrame, float]] = {}
|
||||
def __init__(self, ttl_seconds: int = 30, max_size: int = 32):
|
||||
self._cache: OrderedDict[str, Tuple[pd.DataFrame, float]] = OrderedDict()
|
||||
self._lock = threading.Lock()
|
||||
self._ttl = ttl_seconds
|
||||
self._ttl = max(int(ttl_seconds), 1)
|
||||
self._max_size = max(int(max_size), 1)
|
||||
|
||||
@property
|
||||
def max_size(self) -> int:
|
||||
return self._max_size
|
||||
|
||||
def _evict_expired_locked(self, now: float) -> None:
|
||||
stale_keys = [
|
||||
key for key, (_, timestamp) in self._cache.items()
|
||||
if now - timestamp > self._ttl
|
||||
]
|
||||
for key in stale_keys:
|
||||
self._cache.pop(key, None)
|
||||
|
||||
def get(self, key: str) -> Optional[pd.DataFrame]:
|
||||
"""Get cached DataFrame if not expired."""
|
||||
with self._lock:
|
||||
if key not in self._cache:
|
||||
payload = self._cache.get(key)
|
||||
if payload is None:
|
||||
return None
|
||||
df, timestamp = self._cache[key]
|
||||
if time.time() - timestamp > self._ttl:
|
||||
del self._cache[key]
|
||||
df, timestamp = payload
|
||||
now = time.time()
|
||||
if now - timestamp > self._ttl:
|
||||
self._cache.pop(key, None)
|
||||
return None
|
||||
self._cache.move_to_end(key, last=True)
|
||||
return df
|
||||
|
||||
def set(self, key: str, df: pd.DataFrame) -> None:
|
||||
"""Cache a DataFrame with current timestamp."""
|
||||
with self._lock:
|
||||
self._cache[key] = (df, time.time())
|
||||
now = time.time()
|
||||
self._evict_expired_locked(now)
|
||||
if key in self._cache:
|
||||
self._cache.pop(key, None)
|
||||
elif len(self._cache) >= self._max_size:
|
||||
self._cache.popitem(last=False)
|
||||
self._cache[key] = (df, now)
|
||||
self._cache.move_to_end(key, last=True)
|
||||
|
||||
def invalidate(self, key: str) -> None:
|
||||
"""Remove a key from cache."""
|
||||
@@ -71,8 +96,26 @@ class ProcessLevelCache:
|
||||
self._cache.clear()
|
||||
|
||||
|
||||
def _resolve_cache_max_size(env_name: str, default: int) -> int:
|
||||
value = os.getenv(env_name)
|
||||
if value is None:
|
||||
return max(int(default), 1)
|
||||
try:
|
||||
return max(int(value), 1)
|
||||
except (TypeError, ValueError):
|
||||
return max(int(default), 1)
|
||||
|
||||
|
||||
# Global process-level cache for WIP DataFrame (30s TTL)
|
||||
_wip_df_cache = ProcessLevelCache(ttl_seconds=30)
|
||||
PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size("PROCESS_CACHE_MAX_SIZE", 32)
|
||||
WIP_PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size(
|
||||
"WIP_PROCESS_CACHE_MAX_SIZE",
|
||||
PROCESS_CACHE_MAX_SIZE,
|
||||
)
|
||||
_wip_df_cache = ProcessLevelCache(
|
||||
ttl_seconds=30,
|
||||
max_size=WIP_PROCESS_CACHE_MAX_SIZE,
|
||||
)
|
||||
_wip_parse_lock = threading.Lock()
|
||||
|
||||
# ============================================================
|
||||
@@ -328,33 +371,30 @@ def get_cached_wip_data() -> Optional[pd.DataFrame]:
|
||||
if client is None:
|
||||
return None
|
||||
|
||||
# Use lock to prevent multiple threads from parsing simultaneously
|
||||
try:
|
||||
start_time = time.time()
|
||||
data_json = client.get(get_key("data"))
|
||||
if data_json is None:
|
||||
logger.debug("Cache miss: no data in Redis")
|
||||
return None
|
||||
|
||||
# Parse outside lock to reduce contention on hot paths.
|
||||
parsed_df = pd.read_json(io.StringIO(data_json), orient='records')
|
||||
parse_time = time.time() - start_time
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to read cache: {e}")
|
||||
return None
|
||||
|
||||
# Keep lock scope tight: consistency check + cache write only.
|
||||
with _wip_parse_lock:
|
||||
# Double-check after acquiring lock (another thread may have parsed)
|
||||
cached_df = _wip_df_cache.get(cache_key)
|
||||
if cached_df is not None:
|
||||
logger.debug(f"Process cache hit (after lock): {len(cached_df)} rows")
|
||||
logger.debug(f"Process cache hit (after parse): {len(cached_df)} rows")
|
||||
return cached_df
|
||||
_wip_df_cache.set(cache_key, parsed_df)
|
||||
|
||||
try:
|
||||
start_time = time.time()
|
||||
data_json = client.get(get_key("data"))
|
||||
if data_json is None:
|
||||
logger.debug("Cache miss: no data in Redis")
|
||||
return None
|
||||
|
||||
# Parse JSON to DataFrame
|
||||
df = pd.read_json(io.StringIO(data_json), orient='records')
|
||||
parse_time = time.time() - start_time
|
||||
|
||||
# Store in process-level cache
|
||||
_wip_df_cache.set(cache_key, df)
|
||||
|
||||
logger.debug(f"Cache hit: loaded {len(df)} rows from Redis (parsed in {parse_time:.2f}s)")
|
||||
return df
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to read cache: {e}")
|
||||
return None
|
||||
logger.debug(f"Cache hit: loaded {len(parsed_df)} rows from Redis (parsed in {parse_time:.2f}s)")
|
||||
return parsed_df
|
||||
|
||||
|
||||
def get_cached_sys_date() -> Optional[str]:
|
||||
|
||||
@@ -221,7 +221,7 @@ class CacheUpdater:
|
||||
return None
|
||||
|
||||
def _update_redis_cache(self, df: pd.DataFrame, sys_date: str) -> bool:
|
||||
"""Update Redis cache with new data using pipeline for atomicity.
|
||||
"""Update Redis cache with staged publish for coherent snapshot visibility.
|
||||
|
||||
Args:
|
||||
df: DataFrame with full table data.
|
||||
@@ -234,18 +234,24 @@ class CacheUpdater:
|
||||
if client is None:
|
||||
return False
|
||||
|
||||
staging_key: str | None = None
|
||||
try:
|
||||
# Convert DataFrame to JSON
|
||||
# Handle datetime columns
|
||||
for col in df.select_dtypes(include=['datetime64']).columns:
|
||||
df[col] = df[col].astype(str)
|
||||
df_copy = df.copy()
|
||||
for col in df_copy.select_dtypes(include=['datetime64']).columns:
|
||||
df_copy[col] = df_copy[col].astype(str)
|
||||
|
||||
data_json = df.to_json(orient='records', force_ascii=False)
|
||||
data_json = df_copy.to_json(orient='records', force_ascii=False)
|
||||
|
||||
# Atomic update using pipeline
|
||||
# Stage payload first, then atomically publish live key + metadata.
|
||||
now = datetime.now().isoformat()
|
||||
unique_suffix = f"{int(time.time() * 1000)}:{threading.get_ident()}"
|
||||
staging_key = get_key(f"data:staging:{unique_suffix}")
|
||||
|
||||
pipe = client.pipeline()
|
||||
pipe.set(get_key("data"), data_json)
|
||||
pipe.set(staging_key, data_json)
|
||||
pipe.rename(staging_key, get_key("data"))
|
||||
pipe.set(get_key("meta:sys_date"), sys_date)
|
||||
pipe.set(get_key("meta:updated_at"), now)
|
||||
pipe.execute()
|
||||
@@ -253,6 +259,11 @@ class CacheUpdater:
|
||||
return True
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to update Redis cache: {e}")
|
||||
if staging_key:
|
||||
try:
|
||||
client.delete(staging_key)
|
||||
except Exception:
|
||||
pass
|
||||
return False
|
||||
|
||||
def _check_resource_update(self, force: bool = False) -> bool:
|
||||
|
||||
@@ -130,12 +130,16 @@ class CircuitBreaker:
|
||||
@property
|
||||
def state(self) -> CircuitState:
|
||||
"""Get current circuit state, handling state transitions."""
|
||||
transition_log: tuple[int, str] | None = None
|
||||
with self._lock:
|
||||
if self._state == CircuitState.OPEN:
|
||||
# Check if we should transition to HALF_OPEN
|
||||
if self._open_time and time.time() - self._open_time >= self.recovery_timeout:
|
||||
self._transition_to(CircuitState.HALF_OPEN)
|
||||
return self._state
|
||||
transition_log = self._transition_to_locked(CircuitState.HALF_OPEN)
|
||||
current_state = self._state
|
||||
if transition_log:
|
||||
self._emit_transition_log(*transition_log)
|
||||
return current_state
|
||||
|
||||
def allow_request(self) -> bool:
|
||||
"""Check if a request should be allowed.
|
||||
@@ -161,45 +165,57 @@ class CircuitBreaker:
|
||||
if not CIRCUIT_BREAKER_ENABLED:
|
||||
return
|
||||
|
||||
transition_log: tuple[int, str] | None = None
|
||||
with self._lock:
|
||||
self._results.append(True)
|
||||
|
||||
if self._state == CircuitState.HALF_OPEN:
|
||||
# Success in half-open means we can close
|
||||
self._transition_to(CircuitState.CLOSED)
|
||||
transition_log = self._transition_to_locked(CircuitState.CLOSED)
|
||||
|
||||
if transition_log:
|
||||
self._emit_transition_log(*transition_log)
|
||||
|
||||
def record_failure(self) -> None:
|
||||
"""Record a failed operation."""
|
||||
if not CIRCUIT_BREAKER_ENABLED:
|
||||
return
|
||||
|
||||
transition_log: tuple[int, str] | None = None
|
||||
with self._lock:
|
||||
self._results.append(False)
|
||||
self._last_failure_time = time.time()
|
||||
|
||||
if self._state == CircuitState.HALF_OPEN:
|
||||
# Failure in half-open means back to open
|
||||
self._transition_to(CircuitState.OPEN)
|
||||
transition_log = self._transition_to_locked(CircuitState.OPEN)
|
||||
elif self._state == CircuitState.CLOSED:
|
||||
# Check if we should open
|
||||
self._check_and_open()
|
||||
transition_log = self._check_and_open_locked()
|
||||
|
||||
def _check_and_open(self) -> None:
|
||||
if transition_log:
|
||||
self._emit_transition_log(*transition_log)
|
||||
|
||||
def _check_and_open_locked(self) -> tuple[int, str] | None:
|
||||
"""Check failure rate and open circuit if needed.
|
||||
|
||||
Must be called with lock held.
|
||||
"""
|
||||
if len(self._results) < self.failure_threshold:
|
||||
return
|
||||
return None
|
||||
|
||||
failure_count = sum(1 for r in self._results if not r)
|
||||
failure_rate = failure_count / len(self._results)
|
||||
|
||||
if (failure_count >= self.failure_threshold and
|
||||
failure_rate >= self.failure_rate_threshold):
|
||||
self._transition_to(CircuitState.OPEN)
|
||||
return self._transition_to_locked(CircuitState.OPEN)
|
||||
return None
|
||||
|
||||
def _transition_to(self, new_state: CircuitState) -> None:
|
||||
def _emit_transition_log(self, level: int, message: str) -> None:
|
||||
logger.log(level, message)
|
||||
|
||||
def _transition_to_locked(self, new_state: CircuitState) -> tuple[int, str]:
|
||||
"""Transition to a new state with logging.
|
||||
|
||||
Must be called with lock held.
|
||||
@@ -209,23 +225,25 @@ class CircuitBreaker:
|
||||
|
||||
if new_state == CircuitState.OPEN:
|
||||
self._open_time = time.time()
|
||||
logger.warning(
|
||||
return (
|
||||
logging.WARNING,
|
||||
f"Circuit breaker '{self.name}' OPENED: "
|
||||
f"state {old_state.value} -> {new_state.value}, "
|
||||
f"failures: {sum(1 for r in self._results if not r)}/{len(self._results)}"
|
||||
)
|
||||
elif new_state == CircuitState.HALF_OPEN:
|
||||
logger.info(
|
||||
return (
|
||||
logging.INFO,
|
||||
f"Circuit breaker '{self.name}' entering HALF_OPEN: "
|
||||
f"testing service recovery..."
|
||||
)
|
||||
elif new_state == CircuitState.CLOSED:
|
||||
self._open_time = None
|
||||
self._results.clear()
|
||||
logger.info(
|
||||
f"Circuit breaker '{self.name}' CLOSED: "
|
||||
f"service recovered"
|
||||
)
|
||||
self._open_time = None
|
||||
self._results.clear()
|
||||
return (
|
||||
logging.INFO,
|
||||
f"Circuit breaker '{self.name}' CLOSED: "
|
||||
f"service recovered"
|
||||
)
|
||||
|
||||
def get_status(self) -> CircuitBreakerStatus:
|
||||
"""Get current status information."""
|
||||
@@ -266,7 +284,7 @@ class CircuitBreaker:
|
||||
self._results.clear()
|
||||
self._last_failure_time = None
|
||||
self._open_time = None
|
||||
logger.info(f"Circuit breaker '{self.name}' reset")
|
||||
logger.info(f"Circuit breaker '{self.name}' reset")
|
||||
|
||||
|
||||
# ============================================================
|
||||
|
||||
85
src/mes_dashboard/core/csrf.py
Normal file
85
src/mes_dashboard/core/csrf.py
Normal file
@@ -0,0 +1,85 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""CSRF token utilities for admin form and API mutation protection."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import hmac
|
||||
import secrets
|
||||
from typing import Optional
|
||||
|
||||
from flask import Request, request, session
|
||||
|
||||
CSRF_SESSION_KEY = "_csrf_token"
|
||||
CSRF_HEADER_NAME = "X-CSRF-Token"
|
||||
CSRF_FORM_FIELD = "csrf_token"
|
||||
_MUTATING_METHODS = {"POST", "PUT", "PATCH", "DELETE"}
|
||||
|
||||
|
||||
def _new_csrf_token() -> str:
|
||||
return secrets.token_urlsafe(32)
|
||||
|
||||
|
||||
def get_csrf_token() -> str:
|
||||
"""Get a stable CSRF token for the current session."""
|
||||
token = session.get(CSRF_SESSION_KEY)
|
||||
if not token:
|
||||
token = _new_csrf_token()
|
||||
session[CSRF_SESSION_KEY] = token
|
||||
return token
|
||||
|
||||
|
||||
def rotate_csrf_token() -> str:
|
||||
"""Rotate session CSRF token after authentication state changes."""
|
||||
token = _new_csrf_token()
|
||||
session[CSRF_SESSION_KEY] = token
|
||||
return token
|
||||
|
||||
|
||||
def _extract_request_token(req: Request) -> Optional[str]:
|
||||
header_token = req.headers.get(CSRF_HEADER_NAME)
|
||||
if header_token:
|
||||
return header_token
|
||||
|
||||
form_token = req.form.get(CSRF_FORM_FIELD)
|
||||
if form_token:
|
||||
return form_token
|
||||
|
||||
if req.is_json:
|
||||
payload = req.get_json(silent=True) or {}
|
||||
json_token = payload.get(CSRF_FORM_FIELD)
|
||||
if json_token:
|
||||
return str(json_token)
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def should_enforce_csrf(req: Request = request, enabled: bool = True) -> bool:
|
||||
"""Determine whether current request needs CSRF validation."""
|
||||
if not enabled:
|
||||
return False
|
||||
|
||||
if req.method.upper() not in _MUTATING_METHODS:
|
||||
return False
|
||||
|
||||
path = req.path or ""
|
||||
if path == "/admin/login":
|
||||
return True
|
||||
if path.startswith("/admin/api/"):
|
||||
return True
|
||||
if path.startswith("/admin/"):
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
|
||||
def validate_csrf(req: Request = request) -> bool:
|
||||
"""Validate request CSRF token against current session token."""
|
||||
expected = session.get(CSRF_SESSION_KEY)
|
||||
if not expected:
|
||||
return False
|
||||
|
||||
provided = _extract_request_token(req)
|
||||
if not provided:
|
||||
return False
|
||||
|
||||
return hmac.compare_digest(str(expected), str(provided))
|
||||
@@ -51,6 +51,59 @@ from mes_dashboard.config.settings import get_config
|
||||
# Configure module logger
|
||||
logger = logging.getLogger('mes_dashboard.database')
|
||||
|
||||
_REDACTION_INSTALLED = False
|
||||
_ORACLE_URL_RE = re.compile(r"(oracle\+oracledb://[^:\s/]+:)([^@/\s]+)(@)")
|
||||
_ENV_SECRET_RE = re.compile(r"(DB_PASSWORD=)([^\s]+)")
|
||||
|
||||
|
||||
def redact_connection_secrets(message: str) -> str:
|
||||
"""Redact DB credentials from log message text."""
|
||||
if not message:
|
||||
return message
|
||||
sanitized = _ORACLE_URL_RE.sub(r"\1***\3", message)
|
||||
sanitized = _ENV_SECRET_RE.sub(r"\1***", sanitized)
|
||||
return sanitized
|
||||
|
||||
|
||||
class SecretRedactionFilter(logging.Filter):
|
||||
"""Filter that masks DB connection secrets in log messages."""
|
||||
|
||||
def filter(self, record: logging.LogRecord) -> bool:
|
||||
try:
|
||||
message = record.getMessage()
|
||||
except Exception:
|
||||
return True
|
||||
sanitized = redact_connection_secrets(message)
|
||||
if sanitized != message:
|
||||
record.msg = sanitized
|
||||
record.args = ()
|
||||
return True
|
||||
|
||||
|
||||
def install_log_redaction_filter(target_logger: logging.Logger | None = None) -> None:
|
||||
"""Attach secret-redaction filter to mes_dashboard logging handlers once."""
|
||||
global _REDACTION_INSTALLED
|
||||
if target_logger is None and _REDACTION_INSTALLED:
|
||||
return
|
||||
|
||||
logger_obj = target_logger or logging.getLogger("mes_dashboard")
|
||||
redaction_filter = SecretRedactionFilter()
|
||||
|
||||
attached = False
|
||||
for handler in logger_obj.handlers:
|
||||
if any(isinstance(f, SecretRedactionFilter) for f in handler.filters):
|
||||
attached = True
|
||||
continue
|
||||
handler.addFilter(redaction_filter)
|
||||
attached = True
|
||||
|
||||
if not attached and not any(isinstance(f, SecretRedactionFilter) for f in logger_obj.filters):
|
||||
logger_obj.addFilter(redaction_filter)
|
||||
attached = True
|
||||
|
||||
if attached and target_logger is None:
|
||||
_REDACTION_INSTALLED = True
|
||||
|
||||
# ============================================================
|
||||
# SQLAlchemy Engine (QueuePool - connection pooling)
|
||||
# ============================================================
|
||||
@@ -59,6 +112,7 @@ logger = logging.getLogger('mes_dashboard.database')
|
||||
# pool_recycle prevents stale connections from firewalls/NAT.
|
||||
|
||||
_ENGINE = None
|
||||
_HEALTH_ENGINE = None
|
||||
_DB_RUNTIME_CONFIG: Optional[Dict[str, Any]] = None
|
||||
|
||||
|
||||
@@ -132,6 +186,13 @@ def get_db_runtime_config(refresh: bool = False) -> Dict[str, Any]:
|
||||
"retry_count": _from_app_or_env_int("DB_CONNECT_RETRY_COUNT", config_class.DB_CONNECT_RETRY_COUNT),
|
||||
"retry_delay": _from_app_or_env_float("DB_CONNECT_RETRY_DELAY", config_class.DB_CONNECT_RETRY_DELAY),
|
||||
"call_timeout_ms": _from_app_or_env_int("DB_CALL_TIMEOUT_MS", config_class.DB_CALL_TIMEOUT_MS),
|
||||
"health_pool_size": _from_app_or_env_int("DB_HEALTH_POOL_SIZE", 1),
|
||||
"health_max_overflow": _from_app_or_env_int("DB_HEALTH_MAX_OVERFLOW", 0),
|
||||
"health_pool_timeout": _from_app_or_env_int("DB_HEALTH_POOL_TIMEOUT", 2),
|
||||
"pool_exhausted_retry_after_seconds": _from_app_or_env_int(
|
||||
"DB_POOL_EXHAUSTED_RETRY_AFTER_SECONDS",
|
||||
5,
|
||||
),
|
||||
}
|
||||
return _DB_RUNTIME_CONFIG.copy()
|
||||
|
||||
@@ -202,6 +263,42 @@ def get_engine():
|
||||
return _ENGINE
|
||||
|
||||
|
||||
def get_health_engine():
|
||||
"""Get dedicated SQLAlchemy engine for health probes.
|
||||
|
||||
Health checks use a tiny isolated pool so status probes remain available
|
||||
when the request pool is saturated.
|
||||
"""
|
||||
global _HEALTH_ENGINE
|
||||
if _HEALTH_ENGINE is None:
|
||||
runtime = get_db_runtime_config()
|
||||
_HEALTH_ENGINE = create_engine(
|
||||
CONNECTION_STRING,
|
||||
poolclass=QueuePool,
|
||||
pool_size=max(int(runtime["health_pool_size"]), 1),
|
||||
max_overflow=max(int(runtime["health_max_overflow"]), 0),
|
||||
pool_timeout=max(int(runtime["health_pool_timeout"]), 1),
|
||||
pool_recycle=runtime["pool_recycle"],
|
||||
pool_pre_ping=True,
|
||||
connect_args={
|
||||
"tcp_connect_timeout": runtime["tcp_connect_timeout"],
|
||||
"retry_count": runtime["retry_count"],
|
||||
"retry_delay": runtime["retry_delay"],
|
||||
},
|
||||
)
|
||||
_register_pool_events(
|
||||
_HEALTH_ENGINE,
|
||||
min(int(runtime["call_timeout_ms"]), 10_000),
|
||||
)
|
||||
logger.info(
|
||||
"Health engine created (pool_size=%s, max_overflow=%s, pool_timeout=%s)",
|
||||
runtime["health_pool_size"],
|
||||
runtime["health_max_overflow"],
|
||||
runtime["health_pool_timeout"],
|
||||
)
|
||||
return _HEALTH_ENGINE
|
||||
|
||||
|
||||
def _register_pool_events(engine, call_timeout_ms: int):
|
||||
"""Register event listeners for connection pool monitoring."""
|
||||
|
||||
@@ -302,8 +399,12 @@ def dispose_engine():
|
||||
|
||||
Call this during application shutdown to cleanly release resources.
|
||||
"""
|
||||
global _ENGINE, _DB_RUNTIME_CONFIG
|
||||
global _ENGINE, _HEALTH_ENGINE, _DB_RUNTIME_CONFIG
|
||||
stop_keepalive()
|
||||
if _HEALTH_ENGINE is not None:
|
||||
_HEALTH_ENGINE.dispose()
|
||||
logger.info("Health engine disposed")
|
||||
_HEALTH_ENGINE = None
|
||||
if _ENGINE is not None:
|
||||
_ENGINE.dispose()
|
||||
logger.info("Database engine disposed, all connections closed")
|
||||
@@ -432,9 +533,13 @@ def read_sql_df(sql: str, params: Optional[Dict[str, Any]] = None) -> pd.DataFra
|
||||
elapsed,
|
||||
exc,
|
||||
)
|
||||
retry_after = max(
|
||||
int(get_db_runtime_config().get("pool_exhausted_retry_after_seconds", 5)),
|
||||
1,
|
||||
)
|
||||
raise DatabasePoolExhaustedError(
|
||||
"Database connection pool exhausted",
|
||||
retry_after_seconds=5,
|
||||
retry_after_seconds=retry_after,
|
||||
) from exc
|
||||
except Exception as exc:
|
||||
elapsed = time.time() - start_time
|
||||
|
||||
103
src/mes_dashboard/core/rate_limit.py
Normal file
103
src/mes_dashboard/core/rate_limit.py
Normal file
@@ -0,0 +1,103 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Lightweight in-process rate limiting helpers for high-cost routes."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import threading
|
||||
import time
|
||||
from collections import defaultdict, deque
|
||||
from functools import wraps
|
||||
from typing import Callable, Deque
|
||||
|
||||
from flask import request
|
||||
|
||||
from mes_dashboard.core.response import TOO_MANY_REQUESTS, error_response
|
||||
|
||||
_RATE_LOCK = threading.Lock()
|
||||
_RATE_ATTEMPTS: dict[str, dict[str, Deque[float]]] = defaultdict(lambda: defaultdict(deque))
|
||||
|
||||
|
||||
def _env_int(name: str, default: int) -> int:
|
||||
raw = os.getenv(name)
|
||||
if raw is None:
|
||||
return int(default)
|
||||
try:
|
||||
value = int(raw)
|
||||
except (TypeError, ValueError):
|
||||
return int(default)
|
||||
return max(value, 1)
|
||||
|
||||
|
||||
def _client_identifier() -> str:
|
||||
forwarded = request.headers.get("X-Forwarded-For", "").strip()
|
||||
if forwarded:
|
||||
return forwarded.split(",")[0].strip()
|
||||
return request.remote_addr or "unknown"
|
||||
|
||||
|
||||
def check_and_record(
|
||||
bucket: str,
|
||||
*,
|
||||
client_id: str,
|
||||
max_attempts: int,
|
||||
window_seconds: int,
|
||||
) -> tuple[bool, int]:
|
||||
"""Check and record request attempt for a bucket+client pair."""
|
||||
now = time.time()
|
||||
window_start = now - max(window_seconds, 1)
|
||||
|
||||
with _RATE_LOCK:
|
||||
per_bucket = _RATE_ATTEMPTS[bucket]
|
||||
attempts = per_bucket[client_id]
|
||||
|
||||
while attempts and attempts[0] <= window_start:
|
||||
attempts.popleft()
|
||||
|
||||
if len(attempts) >= max_attempts:
|
||||
retry_after = max(int(window_seconds - (now - attempts[0])), 1)
|
||||
return True, retry_after
|
||||
|
||||
attempts.append(now)
|
||||
return False, 0
|
||||
|
||||
|
||||
def configured_rate_limit(
|
||||
*,
|
||||
bucket: str,
|
||||
max_attempts_env: str,
|
||||
window_seconds_env: str,
|
||||
default_max_attempts: int,
|
||||
default_window_seconds: int,
|
||||
) -> Callable:
|
||||
"""Build a route decorator with env-configurable rate limits."""
|
||||
max_attempts = _env_int(max_attempts_env, default_max_attempts)
|
||||
window_seconds = _env_int(window_seconds_env, default_window_seconds)
|
||||
|
||||
def decorator(func: Callable) -> Callable:
|
||||
@wraps(func)
|
||||
def wrapped(*args, **kwargs):
|
||||
limited, retry_after = check_and_record(
|
||||
bucket,
|
||||
client_id=_client_identifier(),
|
||||
max_attempts=max_attempts,
|
||||
window_seconds=window_seconds,
|
||||
)
|
||||
if limited:
|
||||
return error_response(
|
||||
TOO_MANY_REQUESTS,
|
||||
"請求過於頻繁,請稍後再試",
|
||||
status_code=429,
|
||||
meta={"retry_after_seconds": retry_after},
|
||||
headers={"Retry-After": str(retry_after)},
|
||||
)
|
||||
return func(*args, **kwargs)
|
||||
|
||||
return wrapped
|
||||
|
||||
return decorator
|
||||
|
||||
|
||||
def reset_rate_limits_for_tests() -> None:
|
||||
with _RATE_LOCK:
|
||||
_RATE_ATTEMPTS.clear()
|
||||
143
src/mes_dashboard/core/runtime_contract.py
Normal file
143
src/mes_dashboard/core/runtime_contract.py
Normal file
@@ -0,0 +1,143 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Runtime contract helpers shared by app, scripts, and watchdog."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import shutil
|
||||
from pathlib import Path
|
||||
from typing import Any, Mapping
|
||||
|
||||
CONTRACT_VERSION = "2026.02-p2"
|
||||
DEFAULT_PROJECT_ROOT = Path(__file__).resolve().parents[3]
|
||||
|
||||
|
||||
def _to_bool(value: str | None, default: bool) -> bool:
|
||||
if value is None:
|
||||
return default
|
||||
return value.strip().lower() in {"1", "true", "yes", "on"}
|
||||
|
||||
|
||||
def _resolve_path(value: str | None, fallback: Path, project_root: Path) -> Path:
|
||||
if value is None or not str(value).strip():
|
||||
return fallback.resolve()
|
||||
raw = Path(str(value).strip())
|
||||
if raw.is_absolute():
|
||||
return raw.resolve()
|
||||
return (project_root / raw).resolve()
|
||||
|
||||
|
||||
def load_runtime_contract(
|
||||
environ: Mapping[str, str] | None = None,
|
||||
*,
|
||||
project_root: Path | str | None = None,
|
||||
) -> dict[str, Any]:
|
||||
"""Load effective runtime contract from environment with normalized paths."""
|
||||
env = environ or os.environ
|
||||
root = Path(project_root or env.get("MES_DASHBOARD_ROOT", DEFAULT_PROJECT_ROOT)).resolve()
|
||||
runtime_dir = _resolve_path(
|
||||
env.get("WATCHDOG_RUNTIME_DIR"),
|
||||
root / "tmp",
|
||||
root,
|
||||
)
|
||||
|
||||
restart_flag = _resolve_path(
|
||||
env.get("WATCHDOG_RESTART_FLAG"),
|
||||
runtime_dir / "mes_dashboard_restart.flag",
|
||||
root,
|
||||
)
|
||||
pid_file = _resolve_path(
|
||||
env.get("WATCHDOG_PID_FILE"),
|
||||
runtime_dir / "gunicorn.pid",
|
||||
root,
|
||||
)
|
||||
state_file = _resolve_path(
|
||||
env.get("WATCHDOG_STATE_FILE"),
|
||||
runtime_dir / "mes_dashboard_restart_state.json",
|
||||
root,
|
||||
)
|
||||
|
||||
contract = {
|
||||
"version": env.get("RUNTIME_CONTRACT_VERSION", CONTRACT_VERSION),
|
||||
"project_root": str(root),
|
||||
"gunicorn_bind": env.get("GUNICORN_BIND", "0.0.0.0:8080"),
|
||||
"conda_bin": (env.get("CONDA_BIN", "") or "").strip(),
|
||||
"conda_env_name": (env.get("CONDA_ENV_NAME", "mes-dashboard") or "").strip(),
|
||||
"watchdog_runtime_dir": str(runtime_dir),
|
||||
"watchdog_restart_flag": str(restart_flag),
|
||||
"watchdog_pid_file": str(pid_file),
|
||||
"watchdog_state_file": str(state_file),
|
||||
"watchdog_check_interval": int(env.get("WATCHDOG_CHECK_INTERVAL", "5")),
|
||||
"validation_enforced": _to_bool(env.get("RUNTIME_CONTRACT_ENFORCE"), False),
|
||||
}
|
||||
return contract
|
||||
|
||||
|
||||
def validate_runtime_contract(
|
||||
contract: Mapping[str, Any] | None = None,
|
||||
*,
|
||||
strict: bool = False,
|
||||
) -> list[str]:
|
||||
"""Validate runtime contract and return actionable errors."""
|
||||
cfg = dict(contract or load_runtime_contract())
|
||||
errors: list[str] = []
|
||||
|
||||
runtime_dir = Path(str(cfg["watchdog_runtime_dir"])).resolve()
|
||||
restart_flag = Path(str(cfg["watchdog_restart_flag"])).resolve()
|
||||
pid_file = Path(str(cfg["watchdog_pid_file"])).resolve()
|
||||
state_file = Path(str(cfg["watchdog_state_file"])).resolve()
|
||||
|
||||
if restart_flag.parent != runtime_dir:
|
||||
errors.append(
|
||||
"WATCHDOG_RESTART_FLAG must be under WATCHDOG_RUNTIME_DIR "
|
||||
f"({restart_flag} not under {runtime_dir})."
|
||||
)
|
||||
if pid_file.parent != runtime_dir:
|
||||
errors.append(
|
||||
"WATCHDOG_PID_FILE must be under WATCHDOG_RUNTIME_DIR "
|
||||
f"({pid_file} not under {runtime_dir})."
|
||||
)
|
||||
|
||||
if not state_file.is_absolute():
|
||||
errors.append("WATCHDOG_STATE_FILE must resolve to an absolute path.")
|
||||
|
||||
bind = str(cfg.get("gunicorn_bind", "")).strip()
|
||||
if ":" not in bind:
|
||||
errors.append(f"GUNICORN_BIND must include host:port (current: {bind!r}).")
|
||||
|
||||
conda_bin = str(cfg.get("conda_bin", "")).strip()
|
||||
if strict and not conda_bin:
|
||||
conda_on_path = shutil.which("conda")
|
||||
if not conda_on_path:
|
||||
errors.append(
|
||||
"CONDA_BIN is required when strict runtime validation is enabled "
|
||||
"and conda is not discoverable on PATH."
|
||||
)
|
||||
if conda_bin:
|
||||
conda_path = Path(conda_bin)
|
||||
if not conda_path.exists():
|
||||
errors.append(f"CONDA_BIN does not exist: {conda_bin}")
|
||||
elif not os.access(conda_bin, os.X_OK):
|
||||
errors.append(f"CONDA_BIN is not executable: {conda_bin}")
|
||||
|
||||
conda_env_name = str(cfg.get("conda_env_name", "")).strip()
|
||||
active_env = (os.getenv("CONDA_DEFAULT_ENV") or "").strip()
|
||||
if strict and conda_env_name and active_env and active_env != conda_env_name:
|
||||
errors.append(
|
||||
"CONDA_DEFAULT_ENV mismatch: "
|
||||
f"expected {conda_env_name!r}, got {active_env!r}."
|
||||
)
|
||||
|
||||
return errors
|
||||
|
||||
|
||||
def build_runtime_contract_diagnostics(*, strict: bool = False) -> dict[str, Any]:
|
||||
"""Build diagnostics payload for runtime contract introspection."""
|
||||
contract = load_runtime_contract()
|
||||
errors = validate_runtime_contract(contract, strict=strict)
|
||||
return {
|
||||
"valid": not errors,
|
||||
"strict": strict,
|
||||
"errors": errors,
|
||||
"contract": contract,
|
||||
}
|
||||
@@ -33,6 +33,22 @@ def get_days_back(filters: Optional[Dict] = None, default: int = DEFAULT_DAYS_BA
|
||||
return default
|
||||
|
||||
|
||||
def parse_bool_query(value: Any, default: bool = False) -> bool:
|
||||
"""Parse common boolean query parameter values."""
|
||||
if value is None:
|
||||
return default
|
||||
if isinstance(value, bool):
|
||||
return value
|
||||
text = str(value).strip().lower()
|
||||
if not text:
|
||||
return default
|
||||
if text in {"true", "1", "yes", "y", "on"}:
|
||||
return True
|
||||
if text in {"false", "0", "no", "n", "off"}:
|
||||
return False
|
||||
return default
|
||||
|
||||
|
||||
# ============================================================
|
||||
# SQL Filter Building (DEPRECATED)
|
||||
# Use mes_dashboard.sql.CommonFilters with QueryBuilder instead.
|
||||
|
||||
220
src/mes_dashboard/core/worker_recovery_policy.py
Normal file
220
src/mes_dashboard/core/worker_recovery_policy.py
Normal file
@@ -0,0 +1,220 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Worker restart policy helpers (cooldown, retry budget, churn guard)."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import os
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any, Mapping
|
||||
|
||||
from mes_dashboard.core.runtime_contract import load_runtime_contract
|
||||
|
||||
|
||||
def _env_int(name: str, default: int) -> int:
|
||||
try:
|
||||
return int(os.getenv(name, str(default)))
|
||||
except (TypeError, ValueError):
|
||||
return default
|
||||
|
||||
|
||||
def _env_bool(name: str, default: bool) -> bool:
|
||||
raw = os.getenv(name)
|
||||
if raw is None:
|
||||
return default
|
||||
return raw.strip().lower() in {"1", "true", "yes", "on"}
|
||||
|
||||
|
||||
def _parse_iso(ts: str | None) -> datetime | None:
|
||||
if not ts:
|
||||
return None
|
||||
try:
|
||||
value = datetime.fromisoformat(ts)
|
||||
except (TypeError, ValueError):
|
||||
return None
|
||||
if value.tzinfo is None:
|
||||
value = value.replace(tzinfo=timezone.utc)
|
||||
return value
|
||||
|
||||
|
||||
def _utc_now() -> datetime:
|
||||
return datetime.now(timezone.utc)
|
||||
|
||||
|
||||
def get_worker_recovery_policy_config() -> dict[str, Any]:
|
||||
"""Return effective worker restart policy config."""
|
||||
retry_budget = _env_int("WORKER_RESTART_RETRY_BUDGET", 3)
|
||||
churn_threshold = _env_int(
|
||||
"WORKER_RESTART_CHURN_THRESHOLD",
|
||||
_env_int("RESILIENCE_RESTART_CHURN_THRESHOLD", retry_budget),
|
||||
)
|
||||
window_seconds = _env_int(
|
||||
"WORKER_RESTART_WINDOW_SECONDS",
|
||||
_env_int("RESILIENCE_RESTART_CHURN_WINDOW_SECONDS", 600),
|
||||
)
|
||||
return {
|
||||
"cooldown_seconds": max(_env_int("WORKER_RESTART_COOLDOWN", 60), 1),
|
||||
"retry_budget": max(retry_budget, 1),
|
||||
"window_seconds": max(window_seconds, 30),
|
||||
"churn_threshold": max(churn_threshold, 1),
|
||||
"guarded_mode_enabled": _env_bool("WORKER_GUARDED_MODE_ENABLED", True),
|
||||
}
|
||||
|
||||
|
||||
def load_restart_state(path: str | None = None) -> dict[str, Any]:
|
||||
"""Load persisted restart state from runtime contract state file."""
|
||||
state_path = Path(path or load_runtime_contract()["watchdog_state_file"])
|
||||
if not state_path.exists():
|
||||
return {}
|
||||
try:
|
||||
return json.loads(state_path.read_text())
|
||||
except (json.JSONDecodeError, IOError):
|
||||
return {}
|
||||
|
||||
|
||||
def extract_restart_history(state: Mapping[str, Any] | None = None) -> list[dict[str, Any]]:
|
||||
"""Extract bounded restart history from persisted state."""
|
||||
payload = dict(state or {})
|
||||
raw_history = payload.get("history")
|
||||
if not isinstance(raw_history, list):
|
||||
return []
|
||||
return [item for item in raw_history if isinstance(item, dict)][-50:]
|
||||
|
||||
|
||||
def extract_last_requested_at(state: Mapping[str, Any] | None = None) -> str | None:
|
||||
"""Extract last requested timestamp from persisted state."""
|
||||
payload = dict(state or {})
|
||||
last_restart = payload.get("last_restart") or {}
|
||||
if not isinstance(last_restart, dict):
|
||||
return None
|
||||
value = last_restart.get("requested_at")
|
||||
return str(value) if value else None
|
||||
|
||||
|
||||
def evaluate_worker_recovery_state(
|
||||
history: list[dict[str, Any]] | None,
|
||||
*,
|
||||
last_requested_at: str | None = None,
|
||||
now: datetime | None = None,
|
||||
) -> dict[str, Any]:
|
||||
"""Evaluate restart policy state for automated/manual recovery decisions."""
|
||||
cfg = get_worker_recovery_policy_config()
|
||||
now_dt = now or _utc_now()
|
||||
window_seconds = int(cfg["window_seconds"])
|
||||
cooldown_seconds = int(cfg["cooldown_seconds"])
|
||||
|
||||
recent_attempts = 0
|
||||
for item in history or []:
|
||||
requested = _parse_iso(item.get("requested_at"))
|
||||
completed = _parse_iso(item.get("completed_at"))
|
||||
ts = requested or completed
|
||||
if ts is None:
|
||||
continue
|
||||
age = (now_dt - ts).total_seconds()
|
||||
if age <= window_seconds:
|
||||
recent_attempts += 1
|
||||
|
||||
retry_budget = int(cfg["retry_budget"])
|
||||
churn_threshold = int(cfg["churn_threshold"])
|
||||
retry_budget_exhausted = recent_attempts >= retry_budget
|
||||
churn_exceeded = recent_attempts >= churn_threshold
|
||||
guarded_mode = bool(cfg["guarded_mode_enabled"] and (retry_budget_exhausted or churn_exceeded))
|
||||
|
||||
cooldown_active = False
|
||||
cooldown_remaining = 0
|
||||
last_requested_dt = _parse_iso(last_requested_at)
|
||||
if last_requested_dt is not None:
|
||||
elapsed = (now_dt - last_requested_dt).total_seconds()
|
||||
if elapsed < cooldown_seconds:
|
||||
cooldown_active = True
|
||||
cooldown_remaining = int(max(cooldown_seconds - elapsed, 0))
|
||||
|
||||
blocked = guarded_mode
|
||||
allowed = not blocked and not cooldown_active
|
||||
|
||||
state = "allowed"
|
||||
if blocked:
|
||||
state = "blocked"
|
||||
elif cooldown_active:
|
||||
state = "cooldown"
|
||||
|
||||
return {
|
||||
"state": state,
|
||||
"allowed": allowed,
|
||||
"cooldown": cooldown_active,
|
||||
"cooldown_remaining_seconds": cooldown_remaining,
|
||||
"blocked": blocked,
|
||||
"guarded_mode": guarded_mode,
|
||||
"retry_budget_exhausted": retry_budget_exhausted,
|
||||
"churn_exceeded": churn_exceeded,
|
||||
"attempts_in_window": recent_attempts,
|
||||
"retry_budget": retry_budget,
|
||||
"churn_threshold": churn_threshold,
|
||||
"window_seconds": window_seconds,
|
||||
"cooldown_seconds": cooldown_seconds,
|
||||
}
|
||||
|
||||
|
||||
def decide_restart_request(
|
||||
policy_state: Mapping[str, Any],
|
||||
*,
|
||||
source: str,
|
||||
manual_override: bool = False,
|
||||
override_acknowledged: bool = False,
|
||||
) -> dict[str, Any]:
|
||||
"""Decide whether restart request is allowed under current policy state."""
|
||||
state = dict(policy_state or {})
|
||||
blocked = bool(state.get("blocked"))
|
||||
cooldown = bool(state.get("cooldown"))
|
||||
source_value = (source or "manual").strip().lower()
|
||||
|
||||
if source_value not in {"auto", "manual"}:
|
||||
source_value = "manual"
|
||||
|
||||
if source_value == "auto":
|
||||
if blocked:
|
||||
return {
|
||||
"allowed": False,
|
||||
"decision": "blocked",
|
||||
"reason": "guarded_mode_blocked",
|
||||
"requires_acknowledgement": False,
|
||||
}
|
||||
if cooldown:
|
||||
return {
|
||||
"allowed": False,
|
||||
"decision": "blocked",
|
||||
"reason": "cooldown_active",
|
||||
"requires_acknowledgement": False,
|
||||
}
|
||||
return {
|
||||
"allowed": True,
|
||||
"decision": "allowed",
|
||||
"reason": "policy_allows_auto_restart",
|
||||
"requires_acknowledgement": False,
|
||||
}
|
||||
|
||||
if (blocked or cooldown) and not (manual_override and override_acknowledged):
|
||||
reason = "manual_override_required" if blocked else "cooldown_override_required"
|
||||
return {
|
||||
"allowed": False,
|
||||
"decision": "blocked",
|
||||
"reason": reason,
|
||||
"requires_acknowledgement": True,
|
||||
}
|
||||
|
||||
if manual_override and override_acknowledged:
|
||||
return {
|
||||
"allowed": True,
|
||||
"decision": "manual_override",
|
||||
"reason": "operator_override_acknowledged",
|
||||
"requires_acknowledgement": False,
|
||||
}
|
||||
|
||||
return {
|
||||
"allowed": True,
|
||||
"decision": "allowed",
|
||||
"reason": "policy_allows_manual_restart",
|
||||
"requires_acknowledgement": False,
|
||||
}
|
||||
|
||||
@@ -3,12 +3,13 @@
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import time
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
from flask import Blueprint, g, jsonify, render_template, request
|
||||
|
||||
@@ -19,6 +20,17 @@ from mes_dashboard.core.resilience import (
|
||||
get_resilience_thresholds,
|
||||
summarize_restart_history,
|
||||
)
|
||||
from mes_dashboard.core.runtime_contract import (
|
||||
build_runtime_contract_diagnostics,
|
||||
load_runtime_contract,
|
||||
)
|
||||
from mes_dashboard.core.worker_recovery_policy import (
|
||||
decide_restart_request,
|
||||
evaluate_worker_recovery_state,
|
||||
extract_last_requested_at,
|
||||
extract_restart_history,
|
||||
load_restart_state,
|
||||
)
|
||||
from mes_dashboard.services.page_registry import get_all_pages, set_page_status
|
||||
|
||||
admin_bp = Blueprint("admin", __name__, url_prefix="/admin")
|
||||
@@ -28,21 +40,13 @@ logger = logging.getLogger("mes_dashboard.admin")
|
||||
# Worker Restart Configuration
|
||||
# ============================================================
|
||||
|
||||
WATCHDOG_RUNTIME_DIR = os.getenv("WATCHDOG_RUNTIME_DIR", "/tmp")
|
||||
RESTART_FLAG_PATH = os.getenv(
|
||||
"WATCHDOG_RESTART_FLAG",
|
||||
f"{WATCHDOG_RUNTIME_DIR}/mes_dashboard_restart.flag"
|
||||
)
|
||||
RESTART_STATE_PATH = os.getenv(
|
||||
"WATCHDOG_STATE_FILE",
|
||||
f"{WATCHDOG_RUNTIME_DIR}/mes_dashboard_restart_state.json"
|
||||
)
|
||||
WATCHDOG_PID_PATH = os.getenv(
|
||||
"WATCHDOG_PID_FILE",
|
||||
f"{WATCHDOG_RUNTIME_DIR}/gunicorn.pid"
|
||||
)
|
||||
GUNICORN_BIND = os.getenv("GUNICORN_BIND", "0.0.0.0:8080")
|
||||
RESTART_COOLDOWN_SECONDS = int(os.getenv("WORKER_RESTART_COOLDOWN", "60"))
|
||||
_RUNTIME_CONTRACT = load_runtime_contract()
|
||||
WATCHDOG_RUNTIME_DIR = _RUNTIME_CONTRACT["watchdog_runtime_dir"]
|
||||
RESTART_FLAG_PATH = _RUNTIME_CONTRACT["watchdog_restart_flag"]
|
||||
RESTART_STATE_PATH = _RUNTIME_CONTRACT["watchdog_state_file"]
|
||||
WATCHDOG_PID_PATH = _RUNTIME_CONTRACT["watchdog_pid_file"]
|
||||
GUNICORN_BIND = _RUNTIME_CONTRACT["gunicorn_bind"]
|
||||
RUNTIME_CONTRACT_VERSION = _RUNTIME_CONTRACT["version"]
|
||||
|
||||
# Track last restart request time (in-memory for this worker)
|
||||
_last_restart_request: float = 0.0
|
||||
@@ -91,7 +95,9 @@ def api_system_status():
|
||||
thresholds = get_resilience_thresholds()
|
||||
restart_state = _get_restart_state()
|
||||
restart_churn = _get_restart_churn_summary(restart_state)
|
||||
in_cooldown, remaining = _check_restart_cooldown()
|
||||
policy_state = _get_restart_policy_state(restart_state)
|
||||
in_cooldown = bool(policy_state.get("cooldown"))
|
||||
remaining = int(policy_state.get("cooldown_remaining_seconds") or 0)
|
||||
|
||||
degraded_reason = None
|
||||
if db_status == "error":
|
||||
@@ -111,6 +117,14 @@ def api_system_status():
|
||||
restart_churn_exceeded=bool(restart_churn.get("exceeded")),
|
||||
cooldown_active=in_cooldown,
|
||||
)
|
||||
alerts = _build_restart_alerts(
|
||||
pool_saturation=(pool_state or {}).get("saturation"),
|
||||
circuit_state=circuit_breaker.get("state"),
|
||||
route_cache_degraded=bool(route_cache.get("degraded")),
|
||||
policy_state=policy_state,
|
||||
thresholds=thresholds,
|
||||
)
|
||||
runtime_contract = build_runtime_contract_diagnostics(strict=False)
|
||||
|
||||
# Cache status
|
||||
from mes_dashboard.routes.health_routes import (
|
||||
@@ -142,13 +156,22 @@ def api_system_status():
|
||||
"pool_state": pool_state,
|
||||
"route_cache": route_cache,
|
||||
"thresholds": thresholds,
|
||||
"alerts": alerts,
|
||||
"restart_churn": restart_churn,
|
||||
"policy_state": {
|
||||
"state": policy_state.get("state"),
|
||||
"allowed": policy_state.get("allowed"),
|
||||
"cooldown": policy_state.get("cooldown"),
|
||||
"blocked": policy_state.get("blocked"),
|
||||
"cooldown_remaining_seconds": remaining,
|
||||
},
|
||||
"recovery_recommendation": recommendation,
|
||||
"restart_cooldown": {
|
||||
"active": in_cooldown,
|
||||
"remaining_seconds": int(remaining) if in_cooldown else 0,
|
||||
"remaining_seconds": remaining if in_cooldown else 0,
|
||||
},
|
||||
},
|
||||
"runtime_contract": runtime_contract,
|
||||
"single_port_bind": GUNICORN_BIND,
|
||||
"worker_pid": os.getpid()
|
||||
}
|
||||
@@ -281,55 +304,33 @@ def api_logs_cleanup():
|
||||
# Worker Restart Control Routes
|
||||
# ============================================================
|
||||
|
||||
def _get_restart_state() -> dict:
|
||||
"""Read worker restart state from file."""
|
||||
state_path = Path(RESTART_STATE_PATH)
|
||||
if not state_path.exists():
|
||||
return {}
|
||||
try:
|
||||
return json.loads(state_path.read_text())
|
||||
except (json.JSONDecodeError, IOError):
|
||||
return {}
|
||||
def _get_restart_state() -> dict:
|
||||
"""Read worker restart state from file."""
|
||||
return load_restart_state(RESTART_STATE_PATH)
|
||||
|
||||
|
||||
def _iso_from_epoch(ts: float) -> str | None:
|
||||
if ts <= 0:
|
||||
return None
|
||||
return datetime.fromtimestamp(ts, tz=timezone.utc).isoformat()
|
||||
|
||||
|
||||
def _check_restart_cooldown() -> tuple[bool, float]:
|
||||
"""Check if restart is in cooldown.
|
||||
"""Check if restart is in cooldown.
|
||||
|
||||
Returns:
|
||||
Tuple of (is_in_cooldown, remaining_seconds).
|
||||
"""
|
||||
global _last_restart_request
|
||||
|
||||
# Check in-memory cooldown first
|
||||
now = time.time()
|
||||
elapsed = now - _last_restart_request
|
||||
if elapsed < RESTART_COOLDOWN_SECONDS:
|
||||
return True, RESTART_COOLDOWN_SECONDS - elapsed
|
||||
|
||||
# Check file-based state (for cross-worker coordination)
|
||||
state = _get_restart_state()
|
||||
last_restart = state.get("last_restart", {})
|
||||
requested_at = last_restart.get("requested_at")
|
||||
|
||||
if requested_at:
|
||||
try:
|
||||
request_time = datetime.fromisoformat(requested_at).timestamp()
|
||||
elapsed = now - request_time
|
||||
if elapsed < RESTART_COOLDOWN_SECONDS:
|
||||
return True, RESTART_COOLDOWN_SECONDS - elapsed
|
||||
except (ValueError, TypeError):
|
||||
pass
|
||||
|
||||
policy = _get_restart_policy_state()
|
||||
if policy.get("cooldown"):
|
||||
return True, float(policy.get("cooldown_remaining_seconds") or 0.0)
|
||||
return False, 0.0
|
||||
|
||||
|
||||
def _get_restart_history(state: dict | None = None) -> list[dict]:
|
||||
"""Return bounded restart history for admin telemetry."""
|
||||
payload = state if state is not None else _get_restart_state()
|
||||
raw_history = payload.get("history") or []
|
||||
if not isinstance(raw_history, list):
|
||||
return []
|
||||
return raw_history[-20:]
|
||||
return extract_restart_history(payload)[-20:]
|
||||
|
||||
|
||||
def _get_restart_churn_summary(state: dict | None = None) -> dict:
|
||||
@@ -338,27 +339,63 @@ def _get_restart_churn_summary(state: dict | None = None) -> dict:
|
||||
return summarize_restart_history(history)
|
||||
|
||||
|
||||
def _worker_recovery_hint(churn: dict, cooldown_active: bool) -> dict:
|
||||
"""Build worker control recommendation from churn/cooldown state."""
|
||||
if churn.get("exceeded"):
|
||||
return {
|
||||
"action": "throttle_and_investigate_queries",
|
||||
"reason": "restart_churn_exceeded",
|
||||
}
|
||||
if cooldown_active:
|
||||
return {
|
||||
"action": "wait_for_restart_cooldown",
|
||||
"reason": "restart_cooldown_active",
|
||||
}
|
||||
def _get_restart_policy_state(state: dict | None = None) -> dict[str, Any]:
|
||||
"""Return effective worker restart policy state."""
|
||||
payload = state if state is not None else _get_restart_state()
|
||||
history = _get_restart_history(payload)
|
||||
last_requested = extract_last_requested_at(payload)
|
||||
|
||||
in_memory_requested = _iso_from_epoch(_last_restart_request)
|
||||
if in_memory_requested:
|
||||
try:
|
||||
in_memory_dt = datetime.fromisoformat(in_memory_requested)
|
||||
persisted_dt = datetime.fromisoformat(last_requested) if last_requested else None
|
||||
except (TypeError, ValueError):
|
||||
in_memory_dt = None
|
||||
persisted_dt = None
|
||||
if in_memory_dt and (persisted_dt is None or in_memory_dt > persisted_dt):
|
||||
last_requested = in_memory_requested
|
||||
|
||||
return evaluate_worker_recovery_state(
|
||||
history,
|
||||
last_requested_at=last_requested,
|
||||
)
|
||||
|
||||
|
||||
def _build_restart_alerts(
|
||||
*,
|
||||
pool_saturation: float | None,
|
||||
circuit_state: str | None,
|
||||
route_cache_degraded: bool,
|
||||
policy_state: dict[str, Any],
|
||||
thresholds: dict[str, Any],
|
||||
) -> dict[str, Any]:
|
||||
saturation = float(pool_saturation or 0.0)
|
||||
warning = float(thresholds.get("pool_saturation_warning", 0.9))
|
||||
critical = float(thresholds.get("pool_saturation_critical", 1.0))
|
||||
return {
|
||||
"action": "restart_available",
|
||||
"reason": "no_churn_or_cooldown",
|
||||
"pool_warning": saturation >= warning,
|
||||
"pool_critical": saturation >= critical,
|
||||
"circuit_open": circuit_state == "OPEN",
|
||||
"route_cache_degraded": bool(route_cache_degraded),
|
||||
"restart_churn_exceeded": bool(policy_state.get("churn_exceeded")),
|
||||
"restart_blocked": bool(policy_state.get("blocked")),
|
||||
}
|
||||
|
||||
|
||||
def _log_restart_audit(event: str, payload: dict[str, Any]) -> None:
|
||||
entry = {
|
||||
"event": event,
|
||||
"timestamp": datetime.now(tz=timezone.utc).isoformat(),
|
||||
"runtime_contract_version": RUNTIME_CONTRACT_VERSION,
|
||||
**payload,
|
||||
}
|
||||
logger.info("worker_restart_audit %s", json.dumps(entry, ensure_ascii=False))
|
||||
|
||||
|
||||
@admin_bp.route("/api/worker/restart", methods=["POST"])
|
||||
@admin_required
|
||||
def api_worker_restart():
|
||||
@admin_bp.route("/api/worker/restart", methods=["POST"])
|
||||
@admin_required
|
||||
def api_worker_restart():
|
||||
"""API: Request worker restart.
|
||||
|
||||
Writes a restart flag file that the watchdog process monitors.
|
||||
@@ -366,52 +403,118 @@ def api_worker_restart():
|
||||
"""
|
||||
global _last_restart_request
|
||||
|
||||
# Check cooldown
|
||||
in_cooldown, remaining = _check_restart_cooldown()
|
||||
if in_cooldown:
|
||||
return error_response(
|
||||
TOO_MANY_REQUESTS,
|
||||
f"Restart in cooldown. Please wait {int(remaining)} seconds.",
|
||||
status_code=429
|
||||
)
|
||||
|
||||
# Get request metadata
|
||||
user = getattr(g, "username", "unknown")
|
||||
ip = request.remote_addr or "unknown"
|
||||
timestamp = datetime.now().isoformat()
|
||||
|
||||
# Write restart flag file
|
||||
flag_path = Path(RESTART_FLAG_PATH)
|
||||
flag_data = {
|
||||
"user": user,
|
||||
"ip": ip,
|
||||
"timestamp": timestamp,
|
||||
"worker_pid": os.getpid()
|
||||
}
|
||||
|
||||
try:
|
||||
flag_path.write_text(json.dumps(flag_data))
|
||||
except IOError as e:
|
||||
logger.error(f"Failed to write restart flag: {e}")
|
||||
return error_response(
|
||||
"RESTART_FAILED",
|
||||
f"Failed to request restart: {e}",
|
||||
status_code=500
|
||||
)
|
||||
payload = request.get_json(silent=True) or {}
|
||||
manual_override = bool(payload.get("manual_override"))
|
||||
override_acknowledged = bool(payload.get("override_acknowledged"))
|
||||
override_reason = str(payload.get("override_reason") or "").strip()
|
||||
|
||||
# Get request metadata
|
||||
user = getattr(g, "username", "unknown")
|
||||
ip = request.remote_addr or "unknown"
|
||||
timestamp = datetime.now(tz=timezone.utc).isoformat()
|
||||
|
||||
state = _get_restart_state()
|
||||
policy_state = _get_restart_policy_state(state)
|
||||
decision = decide_restart_request(
|
||||
policy_state,
|
||||
source="manual",
|
||||
manual_override=manual_override,
|
||||
override_acknowledged=override_acknowledged,
|
||||
)
|
||||
|
||||
if manual_override and not override_reason:
|
||||
return error_response(
|
||||
"RESTART_OVERRIDE_REASON_REQUIRED",
|
||||
"Manual override requires non-empty override_reason for audit traceability.",
|
||||
status_code=400,
|
||||
)
|
||||
|
||||
if not decision["allowed"]:
|
||||
status_code = 429 if policy_state.get("cooldown") else 409
|
||||
if status_code == 429:
|
||||
message = (
|
||||
f"Restart in cooldown. Please wait "
|
||||
f"{int(policy_state.get('cooldown_remaining_seconds') or 0)} seconds."
|
||||
)
|
||||
code = TOO_MANY_REQUESTS
|
||||
else:
|
||||
message = (
|
||||
"Restart blocked by guarded mode. "
|
||||
"Set manual_override=true and override_acknowledged=true to proceed."
|
||||
)
|
||||
code = "RESTART_POLICY_BLOCKED"
|
||||
_log_restart_audit(
|
||||
"restart_request_blocked",
|
||||
{
|
||||
"actor": user,
|
||||
"ip": ip,
|
||||
"decision": decision,
|
||||
"policy_state": policy_state,
|
||||
},
|
||||
)
|
||||
return error_response(
|
||||
code,
|
||||
message,
|
||||
status_code=status_code,
|
||||
)
|
||||
|
||||
# Write restart flag file
|
||||
flag_path = Path(RESTART_FLAG_PATH)
|
||||
flag_data = {
|
||||
"user": user,
|
||||
"ip": ip,
|
||||
"timestamp": timestamp,
|
||||
"worker_pid": os.getpid(),
|
||||
"source": "manual",
|
||||
"manual_override": bool(manual_override and override_acknowledged),
|
||||
"override_acknowledged": override_acknowledged,
|
||||
"override_reason": override_reason or None,
|
||||
"policy_state": policy_state,
|
||||
"policy_decision": decision["decision"],
|
||||
"runtime_contract_version": RUNTIME_CONTRACT_VERSION,
|
||||
}
|
||||
|
||||
try:
|
||||
flag_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
tmp_path = flag_path.with_suffix(flag_path.suffix + ".tmp")
|
||||
tmp_path.write_text(json.dumps(flag_data, ensure_ascii=False))
|
||||
tmp_path.replace(flag_path)
|
||||
except IOError as e:
|
||||
logger.error(f"Failed to write restart flag: {e}")
|
||||
return error_response(
|
||||
"RESTART_FAILED",
|
||||
f"Failed to request restart: {e}",
|
||||
status_code=500
|
||||
)
|
||||
|
||||
# Update in-memory cooldown
|
||||
_last_restart_request = time.time()
|
||||
|
||||
logger.info(
|
||||
f"Worker restart requested by {user} from {ip}"
|
||||
)
|
||||
|
||||
_log_restart_audit(
|
||||
"restart_request_accepted",
|
||||
{
|
||||
"actor": user,
|
||||
"ip": ip,
|
||||
"decision": decision,
|
||||
"policy_state": policy_state,
|
||||
"override_reason": override_reason or None,
|
||||
},
|
||||
)
|
||||
|
||||
return jsonify({
|
||||
"success": True,
|
||||
"data": {
|
||||
"message": "Restart requested. Workers will reload shortly.",
|
||||
"requested_by": user,
|
||||
"requested_at": timestamp,
|
||||
"policy_state": {
|
||||
"state": policy_state.get("state"),
|
||||
"allowed": policy_state.get("allowed"),
|
||||
"cooldown": policy_state.get("cooldown"),
|
||||
"blocked": policy_state.get("blocked"),
|
||||
"cooldown_remaining_seconds": policy_state.get("cooldown_remaining_seconds"),
|
||||
},
|
||||
"decision": decision,
|
||||
"single_port_bind": GUNICORN_BIND,
|
||||
"watchdog": {
|
||||
"runtime_dir": WATCHDOG_RUNTIME_DIR,
|
||||
@@ -425,18 +528,23 @@ def api_worker_restart():
|
||||
|
||||
@admin_bp.route("/api/worker/status", methods=["GET"])
|
||||
@admin_required
|
||||
def api_worker_status():
|
||||
"""API: Get worker status and restart information."""
|
||||
# Check cooldown
|
||||
in_cooldown, remaining = _check_restart_cooldown()
|
||||
|
||||
def api_worker_status():
|
||||
"""API: Get worker status and restart information."""
|
||||
# Get last restart info
|
||||
state = _get_restart_state()
|
||||
last_restart = state.get("last_restart", {})
|
||||
history = _get_restart_history(state)
|
||||
churn = _get_restart_churn_summary(state)
|
||||
policy_state = _get_restart_policy_state(state)
|
||||
thresholds = get_resilience_thresholds()
|
||||
recommendation = _worker_recovery_hint(churn, in_cooldown)
|
||||
recommendation = build_recovery_recommendation(
|
||||
degraded_reason="db_pool_saturated" if policy_state.get("blocked") else None,
|
||||
pool_saturation=None,
|
||||
circuit_state=None,
|
||||
restart_churn_exceeded=bool(churn.get("exceeded")),
|
||||
cooldown_active=bool(policy_state.get("cooldown")),
|
||||
)
|
||||
runtime_contract = build_runtime_contract_diagnostics(strict=False)
|
||||
|
||||
# Get worker start time (psutil is optional)
|
||||
worker_start_time = None
|
||||
@@ -466,6 +574,11 @@ def api_worker_status():
|
||||
"worker_pid": os.getpid(),
|
||||
"worker_start_time": worker_start_time,
|
||||
"runtime_contract": {
|
||||
"version": runtime_contract["contract"]["version"],
|
||||
"validation": {
|
||||
"valid": runtime_contract["valid"],
|
||||
"errors": runtime_contract["errors"],
|
||||
},
|
||||
"single_port_bind": GUNICORN_BIND,
|
||||
"watchdog": {
|
||||
"runtime_dir": WATCHDOG_RUNTIME_DIR,
|
||||
@@ -478,12 +591,27 @@ def api_worker_status():
|
||||
},
|
||||
},
|
||||
"cooldown": {
|
||||
"active": in_cooldown,
|
||||
"remaining_seconds": int(remaining) if in_cooldown else 0
|
||||
"active": bool(policy_state.get("cooldown")),
|
||||
"remaining_seconds": int(policy_state.get("cooldown_remaining_seconds") or 0)
|
||||
},
|
||||
"resilience": {
|
||||
"thresholds": thresholds,
|
||||
"alerts": {
|
||||
"restart_churn_exceeded": bool(churn.get("exceeded")),
|
||||
"restart_blocked": bool(policy_state.get("blocked")),
|
||||
},
|
||||
"restart_churn": churn,
|
||||
"policy_state": {
|
||||
"state": policy_state.get("state"),
|
||||
"allowed": policy_state.get("allowed"),
|
||||
"cooldown": policy_state.get("cooldown"),
|
||||
"blocked": policy_state.get("blocked"),
|
||||
"cooldown_remaining_seconds": policy_state.get("cooldown_remaining_seconds"),
|
||||
"attempts_in_window": policy_state.get("attempts_in_window"),
|
||||
"retry_budget": policy_state.get("retry_budget"),
|
||||
"churn_threshold": policy_state.get("churn_threshold"),
|
||||
"window_seconds": policy_state.get("window_seconds"),
|
||||
},
|
||||
"recovery_recommendation": recommendation,
|
||||
},
|
||||
"restart_history": history,
|
||||
|
||||
@@ -9,9 +9,10 @@ from collections import defaultdict
|
||||
from datetime import datetime
|
||||
from threading import Lock
|
||||
|
||||
from flask import Blueprint, flash, redirect, render_template, request, session, url_for
|
||||
|
||||
from mes_dashboard.services.auth_service import authenticate, is_admin
|
||||
from flask import Blueprint, flash, redirect, render_template, request, session, url_for
|
||||
|
||||
from mes_dashboard.core.csrf import rotate_csrf_token
|
||||
from mes_dashboard.services.auth_service import authenticate, is_admin
|
||||
|
||||
logger = logging.getLogger('mes_dashboard.auth_routes')
|
||||
auth_bp = Blueprint("auth", __name__, url_prefix="/admin")
|
||||
@@ -89,25 +90,27 @@ def login():
|
||||
user = authenticate(username, password)
|
||||
if user is None:
|
||||
error = "帳號或密碼錯誤"
|
||||
elif not is_admin(user):
|
||||
error = "您不是管理員,無法登入後台"
|
||||
else:
|
||||
# Login successful
|
||||
session["admin"] = {
|
||||
"username": user.get("username"),
|
||||
"displayName": user.get("displayName"),
|
||||
"mail": user.get("mail"),
|
||||
"department": user.get("department"),
|
||||
"login_time": datetime.now().isoformat(),
|
||||
}
|
||||
next_url = request.args.get("next", url_for("portal_index"))
|
||||
return redirect(next_url)
|
||||
elif not is_admin(user):
|
||||
error = "您不是管理員,無法登入後台"
|
||||
else:
|
||||
# Login successful
|
||||
session.clear()
|
||||
session["admin"] = {
|
||||
"username": user.get("username"),
|
||||
"displayName": user.get("displayName"),
|
||||
"mail": user.get("mail"),
|
||||
"department": user.get("department"),
|
||||
"login_time": datetime.now().isoformat(),
|
||||
}
|
||||
rotate_csrf_token()
|
||||
next_url = request.args.get("next", url_for("portal_index"))
|
||||
return redirect(next_url)
|
||||
|
||||
return render_template("login.html", error=error)
|
||||
|
||||
|
||||
@auth_bp.route("/logout")
|
||||
def logout():
|
||||
"""Admin logout."""
|
||||
session.pop("admin", None)
|
||||
return redirect(url_for("portal_index"))
|
||||
def logout():
|
||||
"""Admin logout."""
|
||||
session.clear()
|
||||
return redirect(url_for("portal_index"))
|
||||
|
||||
@@ -6,13 +6,15 @@ Provides /health and /health/deep endpoints for monitoring service status.
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import time
|
||||
from datetime import datetime, timedelta
|
||||
from flask import Blueprint, jsonify, make_response
|
||||
import logging
|
||||
import os
|
||||
import threading
|
||||
import time
|
||||
from datetime import datetime, timedelta
|
||||
from flask import Blueprint, current_app, jsonify, make_response
|
||||
|
||||
from mes_dashboard.core.database import (
|
||||
get_engine,
|
||||
get_health_engine,
|
||||
get_pool_runtime_config,
|
||||
get_pool_status,
|
||||
)
|
||||
@@ -28,6 +30,15 @@ from mes_dashboard.core.cache import (
|
||||
from mes_dashboard.core.resilience import (
|
||||
build_recovery_recommendation,
|
||||
get_resilience_thresholds,
|
||||
summarize_restart_history,
|
||||
)
|
||||
from mes_dashboard.core.runtime_contract import build_runtime_contract_diagnostics
|
||||
from mes_dashboard.core.worker_recovery_policy import (
|
||||
evaluate_worker_recovery_state,
|
||||
extract_last_requested_at,
|
||||
extract_restart_history,
|
||||
get_worker_recovery_policy_config,
|
||||
load_restart_state,
|
||||
)
|
||||
from sqlalchemy import text
|
||||
|
||||
@@ -39,8 +50,63 @@ health_bp = Blueprint('health', __name__)
|
||||
# Warning Thresholds
|
||||
# ============================================================
|
||||
|
||||
DB_LATENCY_WARNING_MS = 100 # Database latency > 100ms is slow
|
||||
DB_LATENCY_WARNING_MS = 100 # Database latency > 100ms is slow
|
||||
CACHE_STALE_MINUTES = 2 # Cache update > 2 minutes is stale
|
||||
HEALTH_MEMO_TTL_SECONDS = int(os.getenv("HEALTH_MEMO_TTL_SECONDS", "5"))
|
||||
|
||||
_HEALTH_MEMO_LOCK = threading.Lock()
|
||||
_HEALTH_MEMO: dict[str, dict | None] = {
|
||||
"health": None,
|
||||
"deep": None,
|
||||
}
|
||||
|
||||
|
||||
def _health_memo_enabled() -> bool:
|
||||
if HEALTH_MEMO_TTL_SECONDS <= 0:
|
||||
return False
|
||||
if current_app.testing or bool(current_app.config.get("TESTING")):
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
def _get_health_memo(cache_key: str) -> tuple[dict, int] | None:
|
||||
if not _health_memo_enabled():
|
||||
return None
|
||||
now = time.time()
|
||||
with _HEALTH_MEMO_LOCK:
|
||||
entry = _HEALTH_MEMO.get(cache_key)
|
||||
if not entry:
|
||||
return None
|
||||
if now - float(entry.get("ts", 0.0)) > HEALTH_MEMO_TTL_SECONDS:
|
||||
_HEALTH_MEMO[cache_key] = None
|
||||
return None
|
||||
return entry["payload"], int(entry["status"])
|
||||
|
||||
|
||||
def _set_health_memo(cache_key: str, payload: dict, status_code: int) -> None:
|
||||
if not _health_memo_enabled():
|
||||
return
|
||||
with _HEALTH_MEMO_LOCK:
|
||||
_HEALTH_MEMO[cache_key] = {
|
||||
"ts": time.time(),
|
||||
"payload": payload,
|
||||
"status": int(status_code),
|
||||
}
|
||||
|
||||
|
||||
def _build_health_response(payload: dict, status_code: int):
|
||||
"""Build JSON response with explicit no-cache headers."""
|
||||
resp = make_response(jsonify(payload), status_code)
|
||||
resp.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate'
|
||||
resp.headers['Pragma'] = 'no-cache'
|
||||
resp.headers['Expires'] = '0'
|
||||
return resp
|
||||
|
||||
|
||||
def _reset_health_memo_for_tests() -> None:
|
||||
with _HEALTH_MEMO_LOCK:
|
||||
_HEALTH_MEMO["health"] = None
|
||||
_HEALTH_MEMO["deep"] = None
|
||||
|
||||
|
||||
def _classify_degraded_reason(
|
||||
@@ -63,18 +129,60 @@ def _classify_degraded_reason(
|
||||
return None
|
||||
|
||||
|
||||
def _build_resilience_alerts(
|
||||
*,
|
||||
pool_saturation: float | None,
|
||||
circuit_state: str | None,
|
||||
route_cache_degraded: bool,
|
||||
restart_churn_exceeded: bool,
|
||||
restart_blocked: bool,
|
||||
thresholds: dict,
|
||||
) -> dict:
|
||||
saturation = float(pool_saturation or 0.0)
|
||||
warning = float(thresholds.get("pool_saturation_warning", 0.9))
|
||||
critical = float(thresholds.get("pool_saturation_critical", 1.0))
|
||||
return {
|
||||
"pool_warning": saturation >= warning,
|
||||
"pool_critical": saturation >= critical,
|
||||
"circuit_open": circuit_state == "OPEN",
|
||||
"route_cache_degraded": bool(route_cache_degraded),
|
||||
"restart_churn_exceeded": bool(restart_churn_exceeded),
|
||||
"restart_blocked": bool(restart_blocked),
|
||||
}
|
||||
|
||||
|
||||
def get_worker_recovery_status() -> dict:
|
||||
"""Build worker recovery policy status for health/admin telemetry."""
|
||||
state = load_restart_state()
|
||||
history = extract_restart_history(state)
|
||||
policy_state = evaluate_worker_recovery_state(
|
||||
history,
|
||||
last_requested_at=extract_last_requested_at(state),
|
||||
)
|
||||
churn = summarize_restart_history(
|
||||
history,
|
||||
window_seconds=int(policy_state.get("window_seconds") or 600),
|
||||
threshold=int(policy_state.get("churn_threshold") or 3),
|
||||
)
|
||||
return {
|
||||
"policy_state": policy_state,
|
||||
"restart_churn": churn,
|
||||
"policy_config": get_worker_recovery_policy_config(),
|
||||
}
|
||||
|
||||
|
||||
def check_database() -> tuple[str, str | None]:
|
||||
"""Check database connectivity.
|
||||
|
||||
Returns:
|
||||
Tuple of (status, error_message).
|
||||
status is 'ok' or 'error'.
|
||||
"""
|
||||
try:
|
||||
engine = get_engine()
|
||||
with engine.connect() as conn:
|
||||
conn.execute(text("SELECT 1 FROM DUAL"))
|
||||
return 'ok', None
|
||||
"""
|
||||
try:
|
||||
engine = get_health_engine()
|
||||
with engine.connect() as conn:
|
||||
conn.execute(text("SELECT 1 FROM DUAL"))
|
||||
return 'ok', None
|
||||
except Exception as e:
|
||||
logger.error(f"Database health check failed: {e}")
|
||||
return 'error', str(e)
|
||||
@@ -111,13 +219,21 @@ def get_cache_status() -> dict:
|
||||
status = {
|
||||
'enabled': REDIS_ENABLED,
|
||||
'sys_date': get_cached_sys_date(),
|
||||
'updated_at': get_cache_updated_at()
|
||||
'updated_at': get_cache_updated_at(),
|
||||
'derived_search_index': {},
|
||||
'derived_frame_snapshot': {},
|
||||
'index_metrics': {},
|
||||
'memory': {},
|
||||
}
|
||||
try:
|
||||
from mes_dashboard.services.wip_service import get_wip_search_index_status
|
||||
status['derived_search_index'] = get_wip_search_index_status()
|
||||
derived = get_wip_search_index_status()
|
||||
status['derived_search_index'] = derived.get('derived_search_index', {})
|
||||
status['derived_frame_snapshot'] = derived.get('derived_frame_snapshot', {})
|
||||
status['index_metrics'] = derived.get('metrics', {})
|
||||
status['memory'] = derived.get('memory', {})
|
||||
except Exception:
|
||||
status['derived_search_index'] = {}
|
||||
pass
|
||||
return status
|
||||
|
||||
|
||||
@@ -201,10 +317,15 @@ def get_workcenter_mapping_status() -> dict:
|
||||
def health_check():
|
||||
"""Health check endpoint.
|
||||
|
||||
Returns:
|
||||
- 200 OK: All services healthy or degraded (Redis down but DB ok)
|
||||
- 503 Service Unavailable: Database unhealthy
|
||||
"""
|
||||
Returns:
|
||||
- 200 OK: All services healthy or degraded (Redis down but DB ok)
|
||||
- 503 Service Unavailable: Database unhealthy
|
||||
"""
|
||||
cached = _get_health_memo("health")
|
||||
if cached is not None:
|
||||
payload, status_code = cached
|
||||
return _build_health_response(payload, status_code)
|
||||
|
||||
from mes_dashboard.core.circuit_breaker import get_circuit_breaker_status
|
||||
|
||||
db_status, db_error = check_database()
|
||||
@@ -266,13 +387,25 @@ def health_check():
|
||||
warnings.append(f"Database pool saturation is high ({saturation:.0%})")
|
||||
|
||||
thresholds = get_resilience_thresholds()
|
||||
worker_recovery = get_worker_recovery_status()
|
||||
policy_state = worker_recovery.get("policy_state", {})
|
||||
restart_churn = worker_recovery.get("restart_churn", {})
|
||||
recommendation = build_recovery_recommendation(
|
||||
degraded_reason=degraded_reason,
|
||||
pool_saturation=pool_saturation,
|
||||
circuit_state=circuit_breaker.get('state'),
|
||||
restart_churn_exceeded=False,
|
||||
cooldown_active=False,
|
||||
restart_churn_exceeded=bool(restart_churn.get("exceeded")),
|
||||
cooldown_active=bool(policy_state.get("cooldown")),
|
||||
)
|
||||
alerts = _build_resilience_alerts(
|
||||
pool_saturation=pool_saturation,
|
||||
circuit_state=circuit_breaker.get("state"),
|
||||
route_cache_degraded=bool(route_cache.get("degraded")),
|
||||
restart_churn_exceeded=bool(restart_churn.get("exceeded")),
|
||||
restart_blocked=bool(policy_state.get("blocked")),
|
||||
thresholds=thresholds,
|
||||
)
|
||||
runtime_contract = build_runtime_contract_diagnostics(strict=False)
|
||||
|
||||
# Check equipment status cache
|
||||
equipment_status_cache = get_equipment_status_cache_status()
|
||||
@@ -293,8 +426,18 @@ def health_check():
|
||||
},
|
||||
'resilience': {
|
||||
'thresholds': thresholds,
|
||||
'alerts': alerts,
|
||||
'policy_state': {
|
||||
'state': policy_state.get("state"),
|
||||
'allowed': policy_state.get("allowed"),
|
||||
'cooldown': policy_state.get("cooldown"),
|
||||
'blocked': policy_state.get("blocked"),
|
||||
'cooldown_remaining_seconds': policy_state.get("cooldown_remaining_seconds"),
|
||||
},
|
||||
'restart_churn': restart_churn,
|
||||
'recovery_recommendation': recommendation,
|
||||
},
|
||||
'runtime_contract': runtime_contract,
|
||||
'cache': get_cache_status(),
|
||||
'route_cache': route_cache,
|
||||
'resource_cache': resource_cache,
|
||||
@@ -307,12 +450,8 @@ def health_check():
|
||||
if warnings:
|
||||
response['warnings'] = warnings
|
||||
|
||||
# Add no-cache headers to prevent browser caching
|
||||
resp = make_response(jsonify(response), http_code)
|
||||
resp.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate'
|
||||
resp.headers['Pragma'] = 'no-cache'
|
||||
resp.headers['Expires'] = '0'
|
||||
return resp
|
||||
_set_health_memo("health", response, http_code)
|
||||
return _build_health_response(response, http_code)
|
||||
|
||||
|
||||
@health_bp.route('/health/deep', methods=['GET'])
|
||||
@@ -330,9 +469,14 @@ def deep_health_check():
|
||||
from mes_dashboard.core.metrics import get_metrics_summary
|
||||
from flask import redirect, url_for, request
|
||||
|
||||
# Require admin authentication - redirect to login for consistency
|
||||
if not is_admin_logged_in():
|
||||
return redirect(url_for("auth.login", next=request.url))
|
||||
# Require admin authentication - redirect to login for consistency
|
||||
if not is_admin_logged_in():
|
||||
return redirect(url_for("auth.login", next=request.url))
|
||||
|
||||
cached = _get_health_memo("deep")
|
||||
if cached is not None:
|
||||
payload, status_code = cached
|
||||
return _build_health_response(payload, status_code)
|
||||
|
||||
# Check database with latency measurement
|
||||
db_start = time.time()
|
||||
@@ -397,6 +541,9 @@ def deep_health_check():
|
||||
warnings.append(f"Database pool saturation is high ({pool_saturation:.0%})")
|
||||
|
||||
thresholds = get_resilience_thresholds()
|
||||
worker_recovery = get_worker_recovery_status()
|
||||
policy_state = worker_recovery.get("policy_state", {})
|
||||
restart_churn = worker_recovery.get("restart_churn", {})
|
||||
degraded_reason = _classify_degraded_reason(
|
||||
db_status=db_status,
|
||||
redis_status=redis_status,
|
||||
@@ -408,9 +555,18 @@ def deep_health_check():
|
||||
degraded_reason=degraded_reason,
|
||||
pool_saturation=pool_saturation,
|
||||
circuit_state=circuit_breaker.get('state'),
|
||||
restart_churn_exceeded=False,
|
||||
cooldown_active=False,
|
||||
restart_churn_exceeded=bool(restart_churn.get("exceeded")),
|
||||
cooldown_active=bool(policy_state.get("cooldown")),
|
||||
)
|
||||
alerts = _build_resilience_alerts(
|
||||
pool_saturation=pool_saturation,
|
||||
circuit_state=circuit_breaker.get("state"),
|
||||
route_cache_degraded=bool(route_cache.get("degraded")),
|
||||
restart_churn_exceeded=bool(restart_churn.get("exceeded")),
|
||||
restart_blocked=bool(policy_state.get("blocked")),
|
||||
thresholds=thresholds,
|
||||
)
|
||||
runtime_contract = build_runtime_contract_diagnostics(strict=False)
|
||||
|
||||
# Check latency thresholds
|
||||
db_latency_status = 'healthy'
|
||||
@@ -429,8 +585,18 @@ def deep_health_check():
|
||||
'degraded_reason': degraded_reason,
|
||||
'resilience': {
|
||||
'thresholds': thresholds,
|
||||
'alerts': alerts,
|
||||
'policy_state': {
|
||||
'state': policy_state.get("state"),
|
||||
'allowed': policy_state.get("allowed"),
|
||||
'cooldown': policy_state.get("cooldown"),
|
||||
'blocked': policy_state.get("blocked"),
|
||||
'cooldown_remaining_seconds': policy_state.get("cooldown_remaining_seconds"),
|
||||
},
|
||||
'restart_churn': restart_churn,
|
||||
'recovery_recommendation': recommendation,
|
||||
},
|
||||
'runtime_contract': runtime_contract,
|
||||
'checks': {
|
||||
'database': {
|
||||
'status': db_latency_status if db_status == 'ok' else 'error',
|
||||
@@ -446,7 +612,9 @@ def deep_health_check():
|
||||
'cache': {
|
||||
'freshness': cache_freshness,
|
||||
'updated_at': cache_updated_at,
|
||||
'sys_date': cache_status.get('sys_date')
|
||||
'sys_date': cache_status.get('sys_date'),
|
||||
'index_metrics': cache_status.get('index_metrics', {}),
|
||||
'memory': cache_status.get('memory', {}),
|
||||
},
|
||||
'route_cache': route_cache
|
||||
},
|
||||
@@ -464,9 +632,5 @@ def deep_health_check():
|
||||
if warnings:
|
||||
response['warnings'] = warnings
|
||||
|
||||
# Add no-cache headers
|
||||
resp = make_response(jsonify(response), http_code)
|
||||
resp.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate'
|
||||
resp.headers['Pragma'] = 'no-cache'
|
||||
resp.headers['Expires'] = '0'
|
||||
return resp
|
||||
_set_health_memo("deep", response, http_code)
|
||||
return _build_health_response(response, http_code)
|
||||
|
||||
@@ -4,22 +4,27 @@
|
||||
Contains Flask Blueprint for Hold Detail page and API endpoints.
|
||||
"""
|
||||
|
||||
from flask import Blueprint, jsonify, request, render_template, redirect, url_for
|
||||
|
||||
from mes_dashboard.services.wip_service import (
|
||||
from flask import Blueprint, jsonify, request, render_template, redirect, url_for
|
||||
|
||||
from mes_dashboard.core.rate_limit import configured_rate_limit
|
||||
from mes_dashboard.core.utils import parse_bool_query
|
||||
from mes_dashboard.services.wip_service import (
|
||||
get_hold_detail_summary,
|
||||
get_hold_detail_distribution,
|
||||
get_hold_detail_lots,
|
||||
is_quality_hold,
|
||||
)
|
||||
|
||||
# Create Blueprint
|
||||
hold_bp = Blueprint('hold', __name__)
|
||||
|
||||
|
||||
def _parse_bool(value: str) -> bool:
|
||||
"""Parse boolean from query string."""
|
||||
return value.lower() in ('true', '1', 'yes') if value else False
|
||||
# Create Blueprint
|
||||
hold_bp = Blueprint('hold', __name__)
|
||||
|
||||
_HOLD_LOTS_RATE_LIMIT = configured_rate_limit(
|
||||
bucket="hold-detail-lots",
|
||||
max_attempts_env="HOLD_LOTS_RATE_LIMIT_MAX_REQUESTS",
|
||||
window_seconds_env="HOLD_LOTS_RATE_LIMIT_WINDOW_SECONDS",
|
||||
default_max_attempts=90,
|
||||
default_window_seconds=60,
|
||||
)
|
||||
|
||||
|
||||
# ============================================================
|
||||
@@ -64,7 +69,7 @@ def api_hold_detail_summary():
|
||||
if not reason:
|
||||
return jsonify({'success': False, 'error': '缺少必要參數: reason'}), 400
|
||||
|
||||
include_dummy = _parse_bool(request.args.get('include_dummy', ''))
|
||||
include_dummy = parse_bool_query(request.args.get('include_dummy'))
|
||||
|
||||
result = get_hold_detail_summary(
|
||||
reason=reason,
|
||||
@@ -90,7 +95,7 @@ def api_hold_detail_distribution():
|
||||
if not reason:
|
||||
return jsonify({'success': False, 'error': '缺少必要參數: reason'}), 400
|
||||
|
||||
include_dummy = _parse_bool(request.args.get('include_dummy', ''))
|
||||
include_dummy = parse_bool_query(request.args.get('include_dummy'))
|
||||
|
||||
result = get_hold_detail_distribution(
|
||||
reason=reason,
|
||||
@@ -101,8 +106,9 @@ def api_hold_detail_distribution():
|
||||
return jsonify({'success': False, 'error': '查詢失敗'}), 500
|
||||
|
||||
|
||||
@hold_bp.route('/api/wip/hold-detail/lots')
|
||||
def api_hold_detail_lots():
|
||||
@hold_bp.route('/api/wip/hold-detail/lots')
|
||||
@_HOLD_LOTS_RATE_LIMIT
|
||||
def api_hold_detail_lots():
|
||||
"""API: Get paginated lot details for a specific hold reason.
|
||||
|
||||
Query Parameters:
|
||||
@@ -124,7 +130,7 @@ def api_hold_detail_lots():
|
||||
workcenter = request.args.get('workcenter', '').strip() or None
|
||||
package = request.args.get('package', '').strip() or None
|
||||
age_range = request.args.get('age_range', '').strip() or None
|
||||
include_dummy = _parse_bool(request.args.get('include_dummy', ''))
|
||||
include_dummy = parse_bool_query(request.args.get('include_dummy'))
|
||||
page = request.args.get('page', 1, type=int)
|
||||
per_page = min(request.args.get('per_page', 50, type=int), 200)
|
||||
|
||||
|
||||
@@ -13,10 +13,12 @@ from mes_dashboard.core.database import (
|
||||
DatabaseCircuitOpenError,
|
||||
)
|
||||
from mes_dashboard.core.cache import cache_get, cache_set, make_cache_key
|
||||
from mes_dashboard.core.rate_limit import configured_rate_limit
|
||||
from mes_dashboard.core.utils import get_days_back, parse_bool_query
|
||||
|
||||
|
||||
def _clean_nan_values(data):
|
||||
"""Convert NaN and NaT values to None for JSON serialization.
|
||||
"""Convert NaN/NaT values to None for JSON serialization (depth-safe).
|
||||
|
||||
Args:
|
||||
data: List of dicts or single dict.
|
||||
@@ -24,28 +26,77 @@ def _clean_nan_values(data):
|
||||
Returns:
|
||||
Cleaned data with NaN/NaT replaced by None.
|
||||
"""
|
||||
def _normalize_scalar(value):
|
||||
if isinstance(value, float) and math.isnan(value):
|
||||
return None
|
||||
if isinstance(value, str) and value == 'NaT':
|
||||
return None
|
||||
try:
|
||||
if value != value: # NaN check (NaN != NaN)
|
||||
return None
|
||||
except Exception:
|
||||
pass
|
||||
return value
|
||||
|
||||
if isinstance(data, list):
|
||||
return [_clean_nan_values(item) for item in data]
|
||||
root: list = []
|
||||
elif isinstance(data, dict):
|
||||
cleaned = {}
|
||||
for key, value in data.items():
|
||||
if isinstance(value, float) and math.isnan(value):
|
||||
cleaned[key] = None
|
||||
elif isinstance(value, str) and value == 'NaT':
|
||||
cleaned[key] = None
|
||||
elif value != value: # NaN check (NaN != NaN)
|
||||
cleaned[key] = None
|
||||
elif isinstance(value, list):
|
||||
# Recursively clean nested lists (e.g., LOT_DETAILS)
|
||||
cleaned[key] = _clean_nan_values(value)
|
||||
root = {}
|
||||
else:
|
||||
return _normalize_scalar(data)
|
||||
|
||||
stack = [(data, root)]
|
||||
seen: set[int] = {id(data)}
|
||||
|
||||
while stack:
|
||||
source, target = stack.pop()
|
||||
if isinstance(source, list):
|
||||
for item in source:
|
||||
if isinstance(item, list):
|
||||
item_id = id(item)
|
||||
if item_id in seen:
|
||||
target.append(None)
|
||||
continue
|
||||
child = []
|
||||
target.append(child)
|
||||
seen.add(item_id)
|
||||
stack.append((item, child))
|
||||
elif isinstance(item, dict):
|
||||
item_id = id(item)
|
||||
if item_id in seen:
|
||||
target.append(None)
|
||||
continue
|
||||
child = {}
|
||||
target.append(child)
|
||||
seen.add(item_id)
|
||||
stack.append((item, child))
|
||||
else:
|
||||
target.append(_normalize_scalar(item))
|
||||
continue
|
||||
|
||||
for key, value in source.items():
|
||||
if isinstance(value, list):
|
||||
value_id = id(value)
|
||||
if value_id in seen:
|
||||
target[key] = None
|
||||
continue
|
||||
child = []
|
||||
target[key] = child
|
||||
seen.add(value_id)
|
||||
stack.append((value, child))
|
||||
elif isinstance(value, dict):
|
||||
# Recursively clean nested dicts
|
||||
cleaned[key] = _clean_nan_values(value)
|
||||
value_id = id(value)
|
||||
if value_id in seen:
|
||||
target[key] = None
|
||||
continue
|
||||
child = {}
|
||||
target[key] = child
|
||||
seen.add(value_id)
|
||||
stack.append((value, child))
|
||||
else:
|
||||
cleaned[key] = value
|
||||
return cleaned
|
||||
return data
|
||||
from mes_dashboard.core.utils import get_days_back
|
||||
target[key] = _normalize_scalar(value)
|
||||
return root
|
||||
|
||||
from mes_dashboard.services.resource_service import (
|
||||
query_resource_by_status,
|
||||
query_resource_by_workcenter,
|
||||
@@ -62,6 +113,32 @@ from mes_dashboard.config.constants import STATUS_CATEGORIES
|
||||
# Create Blueprint
|
||||
resource_bp = Blueprint('resource', __name__, url_prefix='/api/resource')
|
||||
|
||||
_RESOURCE_DETAIL_RATE_LIMIT = configured_rate_limit(
|
||||
bucket="resource-detail",
|
||||
max_attempts_env="RESOURCE_DETAIL_RATE_LIMIT_MAX_REQUESTS",
|
||||
window_seconds_env="RESOURCE_DETAIL_RATE_LIMIT_WINDOW_SECONDS",
|
||||
default_max_attempts=60,
|
||||
default_window_seconds=60,
|
||||
)
|
||||
|
||||
_RESOURCE_STATUS_RATE_LIMIT = configured_rate_limit(
|
||||
bucket="resource-status",
|
||||
max_attempts_env="RESOURCE_STATUS_RATE_LIMIT_MAX_REQUESTS",
|
||||
window_seconds_env="RESOURCE_STATUS_RATE_LIMIT_WINDOW_SECONDS",
|
||||
default_max_attempts=90,
|
||||
default_window_seconds=60,
|
||||
)
|
||||
|
||||
|
||||
def _optional_bool_arg(name: str):
|
||||
raw = request.args.get(name)
|
||||
if raw is None:
|
||||
return None
|
||||
text = str(raw).strip()
|
||||
if not text:
|
||||
return None
|
||||
return parse_bool_query(text)
|
||||
|
||||
|
||||
@resource_bp.route('/by_status')
|
||||
def api_resource_by_status():
|
||||
@@ -118,6 +195,7 @@ def api_resource_workcenter_status_matrix():
|
||||
|
||||
|
||||
@resource_bp.route('/detail', methods=['POST'])
|
||||
@_RESOURCE_DETAIL_RATE_LIMIT
|
||||
def api_resource_detail():
|
||||
"""API: Resource detail with filters."""
|
||||
data = request.get_json() or {}
|
||||
@@ -183,6 +261,7 @@ def api_resource_status_values():
|
||||
# ============================================================
|
||||
|
||||
@resource_bp.route('/status')
|
||||
@_RESOURCE_STATUS_RATE_LIMIT
|
||||
def api_resource_status():
|
||||
"""API: Get merged resource status from realtime cache.
|
||||
|
||||
@@ -197,20 +276,9 @@ def api_resource_status():
|
||||
wc_groups_param = request.args.get('workcenter_groups')
|
||||
workcenter_groups = wc_groups_param.split(',') if wc_groups_param else None
|
||||
|
||||
is_production = None
|
||||
is_prod_param = request.args.get('is_production')
|
||||
if is_prod_param:
|
||||
is_production = is_prod_param.lower() in ('1', 'true', 'yes')
|
||||
|
||||
is_key = None
|
||||
is_key_param = request.args.get('is_key')
|
||||
if is_key_param:
|
||||
is_key = is_key_param.lower() in ('1', 'true', 'yes')
|
||||
|
||||
is_monitor = None
|
||||
is_monitor_param = request.args.get('is_monitor')
|
||||
if is_monitor_param:
|
||||
is_monitor = is_monitor_param.lower() in ('1', 'true', 'yes')
|
||||
is_production = _optional_bool_arg('is_production')
|
||||
is_key = _optional_bool_arg('is_key')
|
||||
is_monitor = _optional_bool_arg('is_monitor')
|
||||
|
||||
status_cats_param = request.args.get('status_categories')
|
||||
status_categories = status_cats_param.split(',') if status_cats_param else None
|
||||
@@ -260,6 +328,7 @@ def api_resource_status_options():
|
||||
|
||||
|
||||
@resource_bp.route('/status/summary')
|
||||
@_RESOURCE_STATUS_RATE_LIMIT
|
||||
def api_resource_status_summary():
|
||||
"""API: Get resource status summary statistics.
|
||||
|
||||
@@ -269,20 +338,9 @@ def api_resource_status_summary():
|
||||
wc_groups_param = request.args.get('workcenter_groups')
|
||||
workcenter_groups = wc_groups_param.split(',') if wc_groups_param else None
|
||||
|
||||
is_production = None
|
||||
is_prod_param = request.args.get('is_production')
|
||||
if is_prod_param:
|
||||
is_production = is_prod_param.lower() in ('1', 'true', 'yes')
|
||||
|
||||
is_key = None
|
||||
is_key_param = request.args.get('is_key')
|
||||
if is_key_param:
|
||||
is_key = is_key_param.lower() in ('1', 'true', 'yes')
|
||||
|
||||
is_monitor = None
|
||||
is_monitor_param = request.args.get('is_monitor')
|
||||
if is_monitor_param:
|
||||
is_monitor = is_monitor_param.lower() in ('1', 'true', 'yes')
|
||||
is_production = _optional_bool_arg('is_production')
|
||||
is_key = _optional_bool_arg('is_key')
|
||||
is_monitor = _optional_bool_arg('is_monitor')
|
||||
|
||||
try:
|
||||
data = get_resource_status_summary(
|
||||
@@ -301,6 +359,7 @@ def api_resource_status_summary():
|
||||
|
||||
|
||||
@resource_bp.route('/status/matrix')
|
||||
@_RESOURCE_STATUS_RATE_LIMIT
|
||||
def api_resource_status_matrix():
|
||||
"""API: Get workcenter × status matrix.
|
||||
|
||||
@@ -309,20 +368,9 @@ def api_resource_status_matrix():
|
||||
is_key: Filter by key equipment
|
||||
is_monitor: Filter by monitor equipment
|
||||
"""
|
||||
is_production = None
|
||||
is_prod_param = request.args.get('is_production')
|
||||
if is_prod_param:
|
||||
is_production = is_prod_param.lower() in ('1', 'true', 'yes')
|
||||
|
||||
is_key = None
|
||||
is_key_param = request.args.get('is_key')
|
||||
if is_key_param:
|
||||
is_key = is_key_param.lower() in ('1', 'true', 'yes')
|
||||
|
||||
is_monitor = None
|
||||
is_monitor_param = request.args.get('is_monitor')
|
||||
if is_monitor_param:
|
||||
is_monitor = is_monitor_param.lower() in ('1', 'true', 'yes')
|
||||
is_production = _optional_bool_arg('is_production')
|
||||
is_key = _optional_bool_arg('is_key')
|
||||
is_monitor = _optional_bool_arg('is_monitor')
|
||||
|
||||
try:
|
||||
data = get_workcenter_status_matrix(
|
||||
|
||||
@@ -7,6 +7,8 @@ Uses DWH.DW_MES_LOT_V view for real-time WIP data.
|
||||
|
||||
from flask import Blueprint, jsonify, request
|
||||
|
||||
from mes_dashboard.core.rate_limit import configured_rate_limit
|
||||
from mes_dashboard.core.utils import parse_bool_query
|
||||
from mes_dashboard.services.wip_service import (
|
||||
get_wip_summary,
|
||||
get_wip_matrix,
|
||||
@@ -24,10 +26,21 @@ from mes_dashboard.services.wip_service import (
|
||||
# Create Blueprint
|
||||
wip_bp = Blueprint('wip', __name__, url_prefix='/api/wip')
|
||||
|
||||
_WIP_MATRIX_RATE_LIMIT = configured_rate_limit(
|
||||
bucket="wip-overview-matrix",
|
||||
max_attempts_env="WIP_MATRIX_RATE_LIMIT_MAX_REQUESTS",
|
||||
window_seconds_env="WIP_MATRIX_RATE_LIMIT_WINDOW_SECONDS",
|
||||
default_max_attempts=120,
|
||||
default_window_seconds=60,
|
||||
)
|
||||
|
||||
def _parse_bool(value: str) -> bool:
|
||||
"""Parse boolean from query string."""
|
||||
return value.lower() in ('true', '1', 'yes') if value else False
|
||||
_WIP_DETAIL_RATE_LIMIT = configured_rate_limit(
|
||||
bucket="wip-detail",
|
||||
max_attempts_env="WIP_DETAIL_RATE_LIMIT_MAX_REQUESTS",
|
||||
window_seconds_env="WIP_DETAIL_RATE_LIMIT_WINDOW_SECONDS",
|
||||
default_max_attempts=90,
|
||||
default_window_seconds=60,
|
||||
)
|
||||
|
||||
|
||||
# ============================================================
|
||||
@@ -52,7 +65,7 @@ def api_overview_summary():
|
||||
lotid = request.args.get('lotid', '').strip() or None
|
||||
package = request.args.get('package', '').strip() or None
|
||||
pj_type = request.args.get('type', '').strip() or None
|
||||
include_dummy = _parse_bool(request.args.get('include_dummy', ''))
|
||||
include_dummy = parse_bool_query(request.args.get('include_dummy'))
|
||||
|
||||
result = get_wip_summary(
|
||||
include_dummy=include_dummy,
|
||||
@@ -67,6 +80,7 @@ def api_overview_summary():
|
||||
|
||||
|
||||
@wip_bp.route('/overview/matrix')
|
||||
@_WIP_MATRIX_RATE_LIMIT
|
||||
def api_overview_matrix():
|
||||
"""API: Get workcenter x product line matrix for overview dashboard.
|
||||
|
||||
@@ -88,7 +102,7 @@ def api_overview_matrix():
|
||||
lotid = request.args.get('lotid', '').strip() or None
|
||||
package = request.args.get('package', '').strip() or None
|
||||
pj_type = request.args.get('type', '').strip() or None
|
||||
include_dummy = _parse_bool(request.args.get('include_dummy', ''))
|
||||
include_dummy = parse_bool_query(request.args.get('include_dummy'))
|
||||
status = request.args.get('status', '').strip().upper() or None
|
||||
hold_type = request.args.get('hold_type', '').strip().lower() or None
|
||||
|
||||
@@ -134,7 +148,7 @@ def api_overview_hold():
|
||||
"""
|
||||
workorder = request.args.get('workorder', '').strip() or None
|
||||
lotid = request.args.get('lotid', '').strip() or None
|
||||
include_dummy = _parse_bool(request.args.get('include_dummy', ''))
|
||||
include_dummy = parse_bool_query(request.args.get('include_dummy'))
|
||||
|
||||
result = get_wip_hold_summary(
|
||||
include_dummy=include_dummy,
|
||||
@@ -151,6 +165,7 @@ def api_overview_hold():
|
||||
# ============================================================
|
||||
|
||||
@wip_bp.route('/detail/<workcenter>')
|
||||
@_WIP_DETAIL_RATE_LIMIT
|
||||
def api_detail(workcenter: str):
|
||||
"""API: Get WIP detail for a specific workcenter group.
|
||||
|
||||
@@ -176,12 +191,17 @@ def api_detail(workcenter: str):
|
||||
hold_type = request.args.get('hold_type', '').strip().lower() or None
|
||||
workorder = request.args.get('workorder', '').strip() or None
|
||||
lotid = request.args.get('lotid', '').strip() or None
|
||||
include_dummy = _parse_bool(request.args.get('include_dummy', ''))
|
||||
include_dummy = parse_bool_query(request.args.get('include_dummy'))
|
||||
page = request.args.get('page', 1, type=int)
|
||||
page_size = min(request.args.get('page_size', 100, type=int), 500)
|
||||
page_size = request.args.get('page_size', 100, type=int)
|
||||
|
||||
if page < 1:
|
||||
if page is None:
|
||||
page = 1
|
||||
if page_size is None:
|
||||
page_size = 100
|
||||
|
||||
page = max(page, 1)
|
||||
page_size = max(1, min(page_size, 500))
|
||||
|
||||
# Validate status parameter
|
||||
if status and status not in ('RUN', 'QUEUE', 'HOLD'):
|
||||
@@ -245,7 +265,7 @@ def api_meta_workcenters():
|
||||
Returns:
|
||||
JSON with list of {name, lot_count} sorted by sequence
|
||||
"""
|
||||
include_dummy = _parse_bool(request.args.get('include_dummy', ''))
|
||||
include_dummy = parse_bool_query(request.args.get('include_dummy'))
|
||||
|
||||
result = get_workcenters(include_dummy=include_dummy)
|
||||
if result is not None:
|
||||
@@ -263,7 +283,7 @@ def api_meta_packages():
|
||||
Returns:
|
||||
JSON with list of {name, lot_count} sorted by count desc
|
||||
"""
|
||||
include_dummy = _parse_bool(request.args.get('include_dummy', ''))
|
||||
include_dummy = parse_bool_query(request.args.get('include_dummy'))
|
||||
|
||||
result = get_packages(include_dummy=include_dummy)
|
||||
if result is not None:
|
||||
@@ -293,7 +313,7 @@ def api_meta_search():
|
||||
search_field = request.args.get('field', '').strip().lower()
|
||||
q = request.args.get('q', '').strip()
|
||||
limit = min(request.args.get('limit', 20, type=int), 50)
|
||||
include_dummy = _parse_bool(request.args.get('include_dummy', ''))
|
||||
include_dummy = parse_bool_query(request.args.get('include_dummy'))
|
||||
|
||||
# Cross-filter parameters
|
||||
workorder = request.args.get('workorder', '').strip() or None
|
||||
|
||||
@@ -1,124 +1,193 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Authentication service using LDAP API or local credentials."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
|
||||
import requests
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Configuration - MUST be set in .env file
|
||||
LDAP_API_BASE = os.environ.get("LDAP_API_URL", "")
|
||||
ADMIN_EMAILS = os.environ.get("ADMIN_EMAILS", "").lower().split(",")
|
||||
|
||||
# Timeout for LDAP API requests
|
||||
LDAP_TIMEOUT = 10
|
||||
|
||||
# Local authentication configuration (for development/testing)
|
||||
LOCAL_AUTH_ENABLED = os.environ.get("LOCAL_AUTH_ENABLED", "false").lower() in ("true", "1", "yes")
|
||||
LOCAL_AUTH_USERNAME = os.environ.get("LOCAL_AUTH_USERNAME", "")
|
||||
LOCAL_AUTH_PASSWORD = os.environ.get("LOCAL_AUTH_PASSWORD", "")
|
||||
|
||||
|
||||
def _authenticate_local(username: str, password: str) -> dict | None:
|
||||
"""Authenticate using local environment credentials.
|
||||
|
||||
Args:
|
||||
username: User provided username
|
||||
password: User provided password
|
||||
|
||||
Returns:
|
||||
User info dict on success, None on failure
|
||||
"""
|
||||
if not LOCAL_AUTH_ENABLED:
|
||||
return None
|
||||
|
||||
if not LOCAL_AUTH_USERNAME or not LOCAL_AUTH_PASSWORD:
|
||||
logger.warning("Local auth enabled but credentials not configured")
|
||||
return None
|
||||
|
||||
if username == LOCAL_AUTH_USERNAME and password == LOCAL_AUTH_PASSWORD:
|
||||
logger.info("Local auth success for user: %s", username)
|
||||
return {
|
||||
"username": username,
|
||||
"displayName": f"Local User ({username})",
|
||||
"mail": f"{username}@local.dev",
|
||||
"department": "Development",
|
||||
}
|
||||
|
||||
logger.warning("Local auth failed for user: %s", username)
|
||||
return None
|
||||
|
||||
|
||||
def authenticate(username: str, password: str, domain: str = "PANJIT") -> dict | None:
|
||||
"""Authenticate user via local credentials or LDAP API.
|
||||
|
||||
If LOCAL_AUTH_ENABLED is set, tries local authentication first.
|
||||
Falls back to LDAP API if local auth is disabled or fails.
|
||||
|
||||
Args:
|
||||
username: Employee ID or email
|
||||
password: User password
|
||||
domain: Domain name (default: PANJIT)
|
||||
|
||||
Returns:
|
||||
User info dict on success: {username, displayName, mail, department}
|
||||
None on failure
|
||||
"""
|
||||
# Try local authentication first if enabled
|
||||
if LOCAL_AUTH_ENABLED:
|
||||
local_result = _authenticate_local(username, password)
|
||||
if local_result:
|
||||
return local_result
|
||||
# If local auth is enabled but failed, don't fall back to LDAP
|
||||
# This ensures local-only mode when LOCAL_AUTH_ENABLED is true
|
||||
return None
|
||||
|
||||
# LDAP authentication
|
||||
try:
|
||||
response = requests.post(
|
||||
f"{LDAP_API_BASE}/api/v1/ldap/auth",
|
||||
json={"username": username, "password": password, "domain": domain},
|
||||
timeout=LDAP_TIMEOUT,
|
||||
)
|
||||
data = response.json()
|
||||
|
||||
if data.get("success"):
|
||||
user = data.get("user", {})
|
||||
logger.info("LDAP auth success for user: %s", user.get("username"))
|
||||
return user
|
||||
|
||||
logger.warning("LDAP auth failed for user: %s", username)
|
||||
return None
|
||||
|
||||
except requests.Timeout:
|
||||
logger.error("LDAP API timeout for user: %s", username)
|
||||
return None
|
||||
except requests.RequestException as e:
|
||||
logger.error("LDAP API error for user %s: %s", username, e)
|
||||
return None
|
||||
except (ValueError, KeyError) as e:
|
||||
logger.error("LDAP API response parse error: %s", e)
|
||||
return None
|
||||
|
||||
|
||||
def is_admin(user: dict) -> bool:
|
||||
"""Check if user is an admin.
|
||||
|
||||
Args:
|
||||
user: User info dict with 'mail' field
|
||||
|
||||
Returns:
|
||||
True if user email is in ADMIN_EMAILS list, or if local auth is enabled
|
||||
"""
|
||||
# Local auth users are automatically admins (for development/testing)
|
||||
if LOCAL_AUTH_ENABLED:
|
||||
user_mail = user.get("mail", "")
|
||||
if user_mail.endswith("@local.dev"):
|
||||
return True
|
||||
|
||||
user_mail = user.get("mail", "").lower().strip()
|
||||
return user_mail in [e.strip() for e in ADMIN_EMAILS]
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Authentication service using LDAP API or local credentials."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
from urllib.parse import urlparse
|
||||
|
||||
import requests
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Timeout for LDAP API requests
|
||||
LDAP_TIMEOUT = 10
|
||||
|
||||
# Configuration - MUST be set in .env file
|
||||
ADMIN_EMAILS = os.environ.get("ADMIN_EMAILS", "").lower().split(",")
|
||||
|
||||
# Local authentication configuration (for development/testing)
|
||||
LOCAL_AUTH_ENABLED = os.environ.get("LOCAL_AUTH_ENABLED", "false").lower() in ("true", "1", "yes")
|
||||
LOCAL_AUTH_USERNAME = os.environ.get("LOCAL_AUTH_USERNAME", "")
|
||||
LOCAL_AUTH_PASSWORD = os.environ.get("LOCAL_AUTH_PASSWORD", "")
|
||||
|
||||
# LDAP endpoint hardening configuration
|
||||
LDAP_API_URL = os.environ.get("LDAP_API_URL", "").strip()
|
||||
LDAP_ALLOWED_HOSTS_RAW = os.environ.get("LDAP_ALLOWED_HOSTS", "").strip()
|
||||
|
||||
|
||||
def _normalize_host(host: str) -> str:
|
||||
return host.strip().lower().rstrip(".")
|
||||
|
||||
|
||||
def _parse_allowed_hosts(raw_hosts: str) -> tuple[str, ...]:
|
||||
if not raw_hosts:
|
||||
return tuple()
|
||||
|
||||
hosts: list[str] = []
|
||||
for raw in raw_hosts.split(","):
|
||||
host = _normalize_host(raw)
|
||||
if host:
|
||||
hosts.append(host)
|
||||
return tuple(hosts)
|
||||
|
||||
|
||||
def _validate_ldap_api_url(raw_url: str, allowed_hosts: tuple[str, ...]) -> tuple[str | None, str | None]:
|
||||
"""Validate LDAP API URL to prevent configuration-based SSRF risks."""
|
||||
url = (raw_url or "").strip()
|
||||
if not url:
|
||||
return None, "LDAP_API_URL is missing"
|
||||
|
||||
parsed = urlparse(url)
|
||||
scheme = (parsed.scheme or "").lower()
|
||||
host = _normalize_host(parsed.hostname or "")
|
||||
|
||||
if not host:
|
||||
return None, f"LDAP_API_URL has no valid host: {url!r}"
|
||||
|
||||
if scheme != "https":
|
||||
return None, f"LDAP_API_URL must use HTTPS: {url!r}"
|
||||
|
||||
effective_allowlist = allowed_hosts or (host,)
|
||||
if host not in effective_allowlist:
|
||||
return None, (
|
||||
f"LDAP_API_URL host {host!r} is not allowlisted. "
|
||||
f"Allowed hosts: {', '.join(effective_allowlist)}"
|
||||
)
|
||||
|
||||
return url.rstrip("/"), None
|
||||
|
||||
|
||||
def _resolve_ldap_config() -> tuple[str | None, str | None, tuple[str, ...]]:
|
||||
allowed_hosts = _parse_allowed_hosts(LDAP_ALLOWED_HOSTS_RAW)
|
||||
api_base, error = _validate_ldap_api_url(LDAP_API_URL, allowed_hosts)
|
||||
|
||||
if api_base:
|
||||
effective_hosts = allowed_hosts or (_normalize_host(urlparse(api_base).hostname or ""),)
|
||||
return api_base, None, effective_hosts
|
||||
|
||||
return None, error, allowed_hosts
|
||||
|
||||
|
||||
LDAP_API_BASE, LDAP_CONFIG_ERROR, LDAP_ALLOWED_HOSTS = _resolve_ldap_config()
|
||||
|
||||
|
||||
def _authenticate_local(username: str, password: str) -> dict | None:
|
||||
"""Authenticate using local environment credentials.
|
||||
|
||||
Args:
|
||||
username: User provided username
|
||||
password: User provided password
|
||||
|
||||
Returns:
|
||||
User info dict on success, None on failure
|
||||
"""
|
||||
if not LOCAL_AUTH_ENABLED:
|
||||
return None
|
||||
|
||||
if not LOCAL_AUTH_USERNAME or not LOCAL_AUTH_PASSWORD:
|
||||
logger.warning("Local auth enabled but credentials not configured")
|
||||
return None
|
||||
|
||||
if username == LOCAL_AUTH_USERNAME and password == LOCAL_AUTH_PASSWORD:
|
||||
logger.info("Local auth success for user: %s", username)
|
||||
return {
|
||||
"username": username,
|
||||
"displayName": f"Local User ({username})",
|
||||
"mail": f"{username}@local.dev",
|
||||
"department": "Development",
|
||||
}
|
||||
|
||||
logger.warning("Local auth failed for user: %s", username)
|
||||
return None
|
||||
|
||||
|
||||
def authenticate(username: str, password: str, domain: str = "PANJIT") -> dict | None:
|
||||
"""Authenticate user via local credentials or LDAP API.
|
||||
|
||||
If LOCAL_AUTH_ENABLED is set, tries local authentication first.
|
||||
Falls back to LDAP API if local auth is disabled or fails.
|
||||
|
||||
Args:
|
||||
username: Employee ID or email
|
||||
password: User password
|
||||
domain: Domain name (default: PANJIT)
|
||||
|
||||
Returns:
|
||||
User info dict on success: {username, displayName, mail, department}
|
||||
None on failure
|
||||
"""
|
||||
# Try local authentication first if enabled
|
||||
if LOCAL_AUTH_ENABLED:
|
||||
local_result = _authenticate_local(username, password)
|
||||
if local_result:
|
||||
return local_result
|
||||
# If local auth is enabled but failed, don't fall back to LDAP
|
||||
# This ensures local-only mode when LOCAL_AUTH_ENABLED is true
|
||||
return None
|
||||
|
||||
if LDAP_CONFIG_ERROR:
|
||||
logger.error("LDAP authentication blocked: %s", LDAP_CONFIG_ERROR)
|
||||
return None
|
||||
|
||||
if not LDAP_API_BASE:
|
||||
logger.error("LDAP authentication blocked: LDAP_API_URL is not configured")
|
||||
return None
|
||||
|
||||
# LDAP authentication
|
||||
try:
|
||||
response = requests.post(
|
||||
f"{LDAP_API_BASE}/api/v1/ldap/auth",
|
||||
json={"username": username, "password": password, "domain": domain},
|
||||
timeout=LDAP_TIMEOUT,
|
||||
)
|
||||
data = response.json()
|
||||
|
||||
if data.get("success"):
|
||||
user = data.get("user", {})
|
||||
logger.info("LDAP auth success for user: %s", user.get("username"))
|
||||
return user
|
||||
|
||||
logger.warning("LDAP auth failed for user: %s", username)
|
||||
return None
|
||||
|
||||
except requests.Timeout:
|
||||
logger.error("LDAP API timeout for user: %s", username)
|
||||
return None
|
||||
except requests.RequestException as e:
|
||||
logger.error("LDAP API error for user %s: %s", username, e)
|
||||
return None
|
||||
except (ValueError, KeyError) as e:
|
||||
logger.error("LDAP API response parse error: %s", e)
|
||||
return None
|
||||
|
||||
|
||||
def is_admin(user: dict) -> bool:
|
||||
"""Check if user is an admin.
|
||||
|
||||
Args:
|
||||
user: User info dict with 'mail' field
|
||||
|
||||
Returns:
|
||||
True if user email is in ADMIN_EMAILS list, or if local auth is enabled
|
||||
"""
|
||||
# Local auth users are automatically admins (for development/testing)
|
||||
if LOCAL_AUTH_ENABLED:
|
||||
user_mail = user.get("mail", "")
|
||||
if user_mail.endswith("@local.dev"):
|
||||
return True
|
||||
|
||||
user_mail = user.get("mail", "").lower().strip()
|
||||
allowed_emails = [e.strip() for e in ADMIN_EMAILS if e and e.strip()]
|
||||
return user_mail in allowed_emails
|
||||
|
||||
@@ -6,6 +6,7 @@ Data is loaded from database and cached in memory with periodic refresh.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import os
|
||||
import threading
|
||||
from datetime import datetime, timedelta
|
||||
from typing import Optional, Dict, List, Any
|
||||
@@ -19,8 +20,8 @@ logger = logging.getLogger('mes_dashboard.filter_cache')
|
||||
# ============================================================
|
||||
|
||||
CACHE_TTL_SECONDS = 3600 # 1 hour cache TTL
|
||||
WIP_VIEW = "DWH.DW_MES_LOT_V"
|
||||
SPEC_WORKCENTER_VIEW = "DWH.DW_MES_SPEC_WORKCENTER_V"
|
||||
WIP_VIEW = os.getenv("FILTER_CACHE_WIP_VIEW", "DWH.DW_MES_LOT_V")
|
||||
SPEC_WORKCENTER_VIEW = os.getenv("FILTER_CACHE_SPEC_WORKCENTER_VIEW", "DWH.DW_MES_SPEC_WORKCENTER_V")
|
||||
|
||||
# ============================================================
|
||||
# Cache Storage
|
||||
|
||||
@@ -1,12 +1,14 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Page registry service for managing page access status."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
from pathlib import Path
|
||||
from threading import Lock
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
from threading import Lock
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@@ -34,20 +36,38 @@ def _load() -> dict:
|
||||
return _cache
|
||||
|
||||
|
||||
def _save(data: dict) -> None:
|
||||
"""Save page status configuration."""
|
||||
global _cache
|
||||
try:
|
||||
DATA_FILE.parent.mkdir(parents=True, exist_ok=True)
|
||||
DATA_FILE.write_text(
|
||||
json.dumps(data, ensure_ascii=False, indent=2),
|
||||
encoding="utf-8"
|
||||
)
|
||||
_cache = data
|
||||
logger.debug("Saved page status to %s", DATA_FILE)
|
||||
except OSError as e:
|
||||
logger.error("Failed to save page status: %s", e)
|
||||
raise
|
||||
def _save(data: dict) -> None:
|
||||
"""Save page status configuration."""
|
||||
global _cache
|
||||
tmp_path: Path | None = None
|
||||
try:
|
||||
DATA_FILE.parent.mkdir(parents=True, exist_ok=True)
|
||||
payload = json.dumps(data, ensure_ascii=False, indent=2)
|
||||
|
||||
# Atomic write: write to sibling temp file, then replace target.
|
||||
with tempfile.NamedTemporaryFile(
|
||||
mode="w",
|
||||
encoding="utf-8",
|
||||
dir=str(DATA_FILE.parent),
|
||||
prefix=f".{DATA_FILE.name}.",
|
||||
suffix=".tmp",
|
||||
delete=False,
|
||||
) as tmp:
|
||||
tmp.write(payload)
|
||||
tmp.flush()
|
||||
os.fsync(tmp.fileno())
|
||||
tmp_path = Path(tmp.name)
|
||||
os.replace(tmp_path, DATA_FILE)
|
||||
_cache = data
|
||||
logger.debug("Saved page status to %s", DATA_FILE)
|
||||
except OSError as e:
|
||||
if tmp_path is not None:
|
||||
try:
|
||||
tmp_path.unlink(missing_ok=True)
|
||||
except OSError:
|
||||
pass
|
||||
logger.error("Failed to save page status: %s", e)
|
||||
raise
|
||||
|
||||
|
||||
def get_page_status(route: str) -> str | None:
|
||||
|
||||
@@ -5,12 +5,14 @@ Provides cached equipment status from DW_MES_EQUIPMENTSTATUS_WIP_V.
|
||||
Data is synced periodically (default 5 minutes) and stored in Redis.
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import threading
|
||||
import time
|
||||
from datetime import datetime
|
||||
from typing import Any, Dict, List, Optional, Tuple
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import threading
|
||||
import time
|
||||
from collections import OrderedDict
|
||||
from datetime import datetime
|
||||
from typing import Any
|
||||
|
||||
from mes_dashboard.core.database import read_sql_df
|
||||
from mes_dashboard.core.redis_client import (
|
||||
@@ -19,64 +21,110 @@ from mes_dashboard.core.redis_client import (
|
||||
try_acquire_lock,
|
||||
release_lock,
|
||||
)
|
||||
from mes_dashboard.config.constants import (
|
||||
EQUIPMENT_STATUS_DATA_KEY,
|
||||
EQUIPMENT_STATUS_INDEX_KEY,
|
||||
EQUIPMENT_STATUS_META_UPDATED_KEY,
|
||||
EQUIPMENT_STATUS_META_COUNT_KEY,
|
||||
STATUS_CATEGORY_MAP,
|
||||
)
|
||||
from mes_dashboard.config.constants import (
|
||||
EQUIPMENT_STATUS_DATA_KEY,
|
||||
EQUIPMENT_STATUS_INDEX_KEY,
|
||||
EQUIPMENT_STATUS_META_UPDATED_KEY,
|
||||
EQUIPMENT_STATUS_META_COUNT_KEY,
|
||||
STATUS_CATEGORY_MAP,
|
||||
)
|
||||
from mes_dashboard.services.sql_fragments import EQUIPMENT_STATUS_SELECT_SQL
|
||||
|
||||
logger = logging.getLogger('mes_dashboard.realtime_equipment_cache')
|
||||
|
||||
# ============================================================
|
||||
# Process-Level Cache (Prevents redundant JSON parsing)
|
||||
# ============================================================
|
||||
|
||||
class _ProcessLevelCache:
|
||||
"""Thread-safe process-level cache for parsed equipment status data."""
|
||||
|
||||
def __init__(self, ttl_seconds: int = 30):
|
||||
self._cache: Dict[str, Tuple[List[Dict[str, Any]], float]] = {}
|
||||
self._lock = threading.Lock()
|
||||
self._ttl = ttl_seconds
|
||||
|
||||
def get(self, key: str) -> Optional[List[Dict[str, Any]]]:
|
||||
"""Get cached data if not expired."""
|
||||
with self._lock:
|
||||
if key not in self._cache:
|
||||
return None
|
||||
data, timestamp = self._cache[key]
|
||||
if time.time() - timestamp > self._ttl:
|
||||
del self._cache[key]
|
||||
return None
|
||||
return data
|
||||
|
||||
def set(self, key: str, data: List[Dict[str, Any]]) -> None:
|
||||
"""Cache data with current timestamp."""
|
||||
with self._lock:
|
||||
self._cache[key] = (data, time.time())
|
||||
|
||||
def invalidate(self, key: str) -> None:
|
||||
"""Remove a key from cache."""
|
||||
with self._lock:
|
||||
self._cache.pop(key, None)
|
||||
logger = logging.getLogger('mes_dashboard.realtime_equipment_cache')
|
||||
|
||||
# ============================================================
|
||||
# Process-Level Cache (Prevents redundant JSON parsing)
|
||||
# ============================================================
|
||||
|
||||
DEFAULT_PROCESS_CACHE_TTL_SECONDS = 30
|
||||
DEFAULT_PROCESS_CACHE_MAX_SIZE = 32
|
||||
DEFAULT_LOOKUP_TTL_SECONDS = 30
|
||||
|
||||
class _ProcessLevelCache:
|
||||
"""Thread-safe process-level cache for parsed equipment status data."""
|
||||
|
||||
def __init__(self, ttl_seconds: int = 30, max_size: int = 32):
|
||||
self._cache: OrderedDict[str, tuple[list[dict[str, Any]], float]] = OrderedDict()
|
||||
self._lock = threading.Lock()
|
||||
self._ttl = max(int(ttl_seconds), 1)
|
||||
self._max_size = max(int(max_size), 1)
|
||||
|
||||
@property
|
||||
def max_size(self) -> int:
|
||||
return self._max_size
|
||||
|
||||
def _evict_expired_locked(self, now: float) -> None:
|
||||
stale_keys = [
|
||||
key for key, (_, timestamp) in self._cache.items()
|
||||
if now - timestamp > self._ttl
|
||||
]
|
||||
for key in stale_keys:
|
||||
self._cache.pop(key, None)
|
||||
|
||||
def get(self, key: str) -> list[dict[str, Any]] | None:
|
||||
"""Get cached data if not expired."""
|
||||
with self._lock:
|
||||
payload = self._cache.get(key)
|
||||
if payload is None:
|
||||
return None
|
||||
data, timestamp = payload
|
||||
now = time.time()
|
||||
if now - timestamp > self._ttl:
|
||||
self._cache.pop(key, None)
|
||||
return None
|
||||
self._cache.move_to_end(key, last=True)
|
||||
return data
|
||||
|
||||
def set(self, key: str, data: list[dict[str, Any]]) -> None:
|
||||
"""Cache data with current timestamp."""
|
||||
with self._lock:
|
||||
now = time.time()
|
||||
self._evict_expired_locked(now)
|
||||
if key in self._cache:
|
||||
self._cache.pop(key, None)
|
||||
elif len(self._cache) >= self._max_size:
|
||||
self._cache.popitem(last=False)
|
||||
self._cache[key] = (data, now)
|
||||
self._cache.move_to_end(key, last=True)
|
||||
|
||||
def invalidate(self, key: str) -> None:
|
||||
"""Remove a key from cache."""
|
||||
with self._lock:
|
||||
self._cache.pop(key, None)
|
||||
|
||||
|
||||
def _resolve_cache_max_size(env_name: str, default: int) -> int:
|
||||
value = os.getenv(env_name)
|
||||
if value is None:
|
||||
return max(int(default), 1)
|
||||
try:
|
||||
return max(int(value), 1)
|
||||
except (TypeError, ValueError):
|
||||
return max(int(default), 1)
|
||||
|
||||
|
||||
# Global process-level cache for equipment status (30s TTL)
|
||||
_equipment_status_cache = _ProcessLevelCache(ttl_seconds=30)
|
||||
PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size("PROCESS_CACHE_MAX_SIZE", DEFAULT_PROCESS_CACHE_MAX_SIZE)
|
||||
EQUIPMENT_PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size(
|
||||
"EQUIPMENT_PROCESS_CACHE_MAX_SIZE",
|
||||
PROCESS_CACHE_MAX_SIZE,
|
||||
)
|
||||
_equipment_status_cache = _ProcessLevelCache(
|
||||
ttl_seconds=DEFAULT_PROCESS_CACHE_TTL_SECONDS,
|
||||
max_size=EQUIPMENT_PROCESS_CACHE_MAX_SIZE,
|
||||
)
|
||||
_equipment_status_parse_lock = threading.Lock()
|
||||
_equipment_lookup_lock = threading.Lock()
|
||||
_equipment_status_lookup: Dict[str, Dict[str, Any]] = {}
|
||||
_equipment_status_lookup_built_at: Optional[str] = None
|
||||
_equipment_status_lookup: dict[str, dict[str, Any]] = {}
|
||||
_equipment_status_lookup_built_at: str | None = None
|
||||
_equipment_status_lookup_ts: float = 0.0
|
||||
LOOKUP_TTL_SECONDS = 30
|
||||
LOOKUP_TTL_SECONDS = DEFAULT_LOOKUP_TTL_SECONDS
|
||||
|
||||
# ============================================================
|
||||
# Module State
|
||||
# ============================================================
|
||||
|
||||
_SYNC_THREAD: Optional[threading.Thread] = None
|
||||
_SYNC_THREAD: threading.Thread | None = None
|
||||
_STOP_EVENT = threading.Event()
|
||||
_SYNC_LOCK = threading.Lock()
|
||||
|
||||
@@ -85,40 +133,14 @@ _SYNC_LOCK = threading.Lock()
|
||||
# Oracle Query
|
||||
# ============================================================
|
||||
|
||||
def _load_equipment_status_from_oracle() -> Optional[List[Dict[str, Any]]]:
|
||||
def _load_equipment_status_from_oracle() -> list[dict[str, Any]] | None:
|
||||
"""Query DW_MES_EQUIPMENTSTATUS_WIP_V from Oracle.
|
||||
|
||||
Returns:
|
||||
List of equipment status records, or None if query fails.
|
||||
"""
|
||||
sql = """
|
||||
SELECT
|
||||
RESOURCEID,
|
||||
EQUIPMENTID,
|
||||
OBJECTCATEGORY,
|
||||
EQUIPMENTASSETSSTATUS,
|
||||
EQUIPMENTASSETSSTATUSREASON,
|
||||
JOBORDER,
|
||||
JOBMODEL,
|
||||
JOBSTAGE,
|
||||
JOBID,
|
||||
JOBSTATUS,
|
||||
CREATEDATE,
|
||||
CREATEUSERNAME,
|
||||
CREATEUSER,
|
||||
TECHNICIANUSERNAME,
|
||||
TECHNICIANUSER,
|
||||
SYMPTOMCODE,
|
||||
CAUSECODE,
|
||||
REPAIRCODE,
|
||||
RUNCARDLOTID,
|
||||
LOTTRACKINQTY_PCS,
|
||||
LOTTRACKINTIME,
|
||||
LOTTRACKINEMPLOYEE
|
||||
FROM DWH.DW_MES_EQUIPMENTSTATUS_WIP_V
|
||||
"""
|
||||
try:
|
||||
df = read_sql_df(sql)
|
||||
try:
|
||||
df = read_sql_df(EQUIPMENT_STATUS_SELECT_SQL)
|
||||
if df is None or df.empty:
|
||||
logger.warning("No data returned from DW_MES_EQUIPMENTSTATUS_WIP_V")
|
||||
return []
|
||||
@@ -147,7 +169,7 @@ def _load_equipment_status_from_oracle() -> Optional[List[Dict[str, Any]]]:
|
||||
# Data Aggregation
|
||||
# ============================================================
|
||||
|
||||
def _classify_status(status: Optional[str]) -> str:
|
||||
def _classify_status(status: str | None) -> str:
|
||||
"""Classify equipment status into category.
|
||||
|
||||
Args:
|
||||
@@ -183,7 +205,7 @@ def _is_valid_value(value) -> bool:
|
||||
return True
|
||||
|
||||
|
||||
def _aggregate_by_resourceid(records: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
|
||||
def _aggregate_by_resourceid(records: list[dict[str, Any]]) -> list[dict[str, Any]]:
|
||||
"""Aggregate equipment status records by RESOURCEID.
|
||||
|
||||
For each RESOURCEID:
|
||||
@@ -203,7 +225,7 @@ def _aggregate_by_resourceid(records: List[Dict[str, Any]]) -> List[Dict[str, An
|
||||
return []
|
||||
|
||||
# Group by RESOURCEID
|
||||
grouped: Dict[str, List[Dict[str, Any]]] = {}
|
||||
grouped: dict[str, list[dict[str, Any]]] = {}
|
||||
for record in records:
|
||||
resource_id = record.get('RESOURCEID')
|
||||
if resource_id:
|
||||
@@ -250,7 +272,7 @@ def _aggregate_by_resourceid(records: List[Dict[str, Any]]) -> List[Dict[str, An
|
||||
|
||||
# Build aggregated record
|
||||
status = first.get('EQUIPMENTASSETSSTATUS')
|
||||
aggregated.append({
|
||||
aggregated.append({
|
||||
'RESOURCEID': resource_id,
|
||||
'EQUIPMENTID': first.get('EQUIPMENTID'),
|
||||
'OBJECTCATEGORY': first.get('OBJECTCATEGORY'),
|
||||
@@ -270,11 +292,11 @@ def _aggregate_by_resourceid(records: List[Dict[str, Any]]) -> List[Dict[str, An
|
||||
'TECHNICIANUSER': first.get('TECHNICIANUSER'),
|
||||
'SYMPTOMCODE': first.get('SYMPTOMCODE'),
|
||||
'CAUSECODE': first.get('CAUSECODE'),
|
||||
'REPAIRCODE': first.get('REPAIRCODE'),
|
||||
# LOT related fields
|
||||
'LOT_COUNT': len(seen_lots), # Count distinct RUNCARDLOTID
|
||||
'LOT_DETAILS': lot_details, # LOT details for tooltip
|
||||
'TOTAL_TRACKIN_QTY': total_qty,
|
||||
'REPAIRCODE': first.get('REPAIRCODE'),
|
||||
# LOT related fields
|
||||
'LOT_COUNT': len(seen_lots) if seen_lots else len(group),
|
||||
'LOT_DETAILS': lot_details, # LOT details for tooltip
|
||||
'TOTAL_TRACKIN_QTY': total_qty,
|
||||
'LATEST_TRACKIN_TIME': latest_trackin,
|
||||
})
|
||||
|
||||
@@ -286,7 +308,7 @@ def _aggregate_by_resourceid(records: List[Dict[str, Any]]) -> List[Dict[str, An
|
||||
# Redis Storage
|
||||
# ============================================================
|
||||
|
||||
def _save_to_redis(aggregated: List[Dict[str, Any]]) -> bool:
|
||||
def _save_to_redis(aggregated: list[dict[str, Any]]) -> bool:
|
||||
"""Save aggregated equipment status to Redis.
|
||||
|
||||
Uses pipeline for atomic update of all keys.
|
||||
@@ -354,7 +376,7 @@ def _invalidate_equipment_status_lookup() -> None:
|
||||
_equipment_status_lookup_ts = 0.0
|
||||
|
||||
|
||||
def get_equipment_status_lookup() -> Dict[str, Dict[str, Any]]:
|
||||
def get_equipment_status_lookup() -> dict[str, dict[str, Any]]:
|
||||
"""Get RESOURCEID -> status record lookup with process-level caching."""
|
||||
global _equipment_status_lookup, _equipment_status_lookup_built_at, _equipment_status_lookup_ts
|
||||
|
||||
@@ -375,7 +397,7 @@ def get_equipment_status_lookup() -> Dict[str, Dict[str, Any]]:
|
||||
_equipment_status_lookup_ts = time.time()
|
||||
return _equipment_status_lookup
|
||||
|
||||
def get_all_equipment_status() -> List[Dict[str, Any]]:
|
||||
def get_all_equipment_status() -> list[dict[str, Any]]:
|
||||
"""Get all equipment status from cache with process-level caching.
|
||||
|
||||
Uses a two-tier cache strategy:
|
||||
@@ -433,7 +455,7 @@ def get_all_equipment_status() -> List[Dict[str, Any]]:
|
||||
return []
|
||||
|
||||
|
||||
def get_equipment_status_by_id(resource_id: str) -> Optional[Dict[str, Any]]:
|
||||
def get_equipment_status_by_id(resource_id: str) -> dict[str, Any] | None:
|
||||
"""Get equipment status by RESOURCEID.
|
||||
|
||||
Uses index hash for O(1) lookup.
|
||||
@@ -485,7 +507,7 @@ def get_equipment_status_by_id(resource_id: str) -> Optional[Dict[str, Any]]:
|
||||
return None
|
||||
|
||||
|
||||
def get_equipment_status_by_ids(resource_ids: List[str]) -> List[Dict[str, Any]]:
|
||||
def get_equipment_status_by_ids(resource_ids: list[str]) -> list[dict[str, Any]]:
|
||||
"""Get equipment status for multiple RESOURCEIDs.
|
||||
|
||||
Args:
|
||||
@@ -540,7 +562,7 @@ def get_equipment_status_by_ids(resource_ids: List[str]) -> List[Dict[str, Any]]
|
||||
return []
|
||||
|
||||
|
||||
def get_equipment_status_cache_status() -> Dict[str, Any]:
|
||||
def get_equipment_status_cache_status() -> dict[str, Any]:
|
||||
"""Get equipment status cache status.
|
||||
|
||||
Returns:
|
||||
|
||||
@@ -13,8 +13,9 @@ import logging
|
||||
import os
|
||||
import threading
|
||||
import time
|
||||
from collections import OrderedDict
|
||||
from datetime import datetime
|
||||
from typing import Any, Dict, List, Optional, Tuple
|
||||
from typing import Any
|
||||
|
||||
import pandas as pd
|
||||
|
||||
@@ -31,9 +32,27 @@ from mes_dashboard.config.constants import (
|
||||
EQUIPMENT_TYPE_FILTER,
|
||||
)
|
||||
from mes_dashboard.sql import QueryBuilder
|
||||
from mes_dashboard.services.sql_fragments import (
|
||||
RESOURCE_BASE_SELECT_TEMPLATE,
|
||||
RESOURCE_VERSION_SELECT_TEMPLATE,
|
||||
)
|
||||
|
||||
logger = logging.getLogger('mes_dashboard.resource_cache')
|
||||
|
||||
ResourceRecord = dict[str, Any]
|
||||
RowPosition = int
|
||||
PositionBucket = dict[str, list[RowPosition]]
|
||||
FlagBuckets = dict[str, list[RowPosition]]
|
||||
ResourceIndex = dict[str, Any]
|
||||
|
||||
DEFAULT_PROCESS_CACHE_TTL_SECONDS = 30
|
||||
DEFAULT_PROCESS_CACHE_MAX_SIZE = 32
|
||||
DEFAULT_RESOURCE_SYNC_INTERVAL_SECONDS = 14_400 # 4 hours
|
||||
DEFAULT_INDEX_VERSION_CHECK_INTERVAL_SECONDS = 5
|
||||
RESOURCE_DF_CACHE_KEY = "resource_data"
|
||||
TRUE_BUCKET = "1"
|
||||
FALSE_BUCKET = "0"
|
||||
|
||||
# ============================================================
|
||||
# Process-Level Cache (Prevents redundant JSON parsing)
|
||||
# ============================================================
|
||||
@@ -41,26 +60,49 @@ logger = logging.getLogger('mes_dashboard.resource_cache')
|
||||
class _ProcessLevelCache:
|
||||
"""Thread-safe process-level cache for parsed DataFrames."""
|
||||
|
||||
def __init__(self, ttl_seconds: int = 30):
|
||||
self._cache: Dict[str, Tuple[pd.DataFrame, float]] = {}
|
||||
def __init__(self, ttl_seconds: int = DEFAULT_PROCESS_CACHE_TTL_SECONDS, max_size: int = DEFAULT_PROCESS_CACHE_MAX_SIZE):
|
||||
self._cache: OrderedDict[str, tuple[pd.DataFrame, float]] = OrderedDict()
|
||||
self._lock = threading.Lock()
|
||||
self._ttl = ttl_seconds
|
||||
self._ttl = max(int(ttl_seconds), 1)
|
||||
self._max_size = max(int(max_size), 1)
|
||||
|
||||
def get(self, key: str) -> Optional[pd.DataFrame]:
|
||||
@property
|
||||
def max_size(self) -> int:
|
||||
return self._max_size
|
||||
|
||||
def _evict_expired_locked(self, now: float) -> None:
|
||||
stale_keys = [
|
||||
key for key, (_, timestamp) in self._cache.items()
|
||||
if now - timestamp > self._ttl
|
||||
]
|
||||
for key in stale_keys:
|
||||
self._cache.pop(key, None)
|
||||
|
||||
def get(self, key: str) -> pd.DataFrame | None:
|
||||
"""Get cached DataFrame if not expired."""
|
||||
with self._lock:
|
||||
if key not in self._cache:
|
||||
payload = self._cache.get(key)
|
||||
if payload is None:
|
||||
return None
|
||||
df, timestamp = self._cache[key]
|
||||
if time.time() - timestamp > self._ttl:
|
||||
del self._cache[key]
|
||||
df, timestamp = payload
|
||||
now = time.time()
|
||||
if now - timestamp > self._ttl:
|
||||
self._cache.pop(key, None)
|
||||
return None
|
||||
self._cache.move_to_end(key, last=True)
|
||||
return df
|
||||
|
||||
def set(self, key: str, df: pd.DataFrame) -> None:
|
||||
"""Cache a DataFrame with current timestamp."""
|
||||
with self._lock:
|
||||
self._cache[key] = (df, time.time())
|
||||
now = time.time()
|
||||
self._evict_expired_locked(now)
|
||||
if key in self._cache:
|
||||
self._cache.pop(key, None)
|
||||
elif len(self._cache) >= self._max_size:
|
||||
self._cache.popitem(last=False)
|
||||
self._cache[key] = (df, now)
|
||||
self._cache.move_to_end(key, last=True)
|
||||
|
||||
def invalidate(self, key: str) -> None:
|
||||
"""Remove a key from cache."""
|
||||
@@ -68,11 +110,29 @@ class _ProcessLevelCache:
|
||||
self._cache.pop(key, None)
|
||||
|
||||
|
||||
def _resolve_cache_max_size(env_name: str, default: int) -> int:
|
||||
value = os.getenv(env_name)
|
||||
if value is None:
|
||||
return max(int(default), 1)
|
||||
try:
|
||||
return max(int(value), 1)
|
||||
except (TypeError, ValueError):
|
||||
return max(int(default), 1)
|
||||
|
||||
|
||||
# Global process-level cache for resource data (30s TTL)
|
||||
_resource_df_cache = _ProcessLevelCache(ttl_seconds=30)
|
||||
PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size("PROCESS_CACHE_MAX_SIZE", DEFAULT_PROCESS_CACHE_MAX_SIZE)
|
||||
RESOURCE_PROCESS_CACHE_MAX_SIZE = _resolve_cache_max_size(
|
||||
"RESOURCE_PROCESS_CACHE_MAX_SIZE",
|
||||
PROCESS_CACHE_MAX_SIZE,
|
||||
)
|
||||
_resource_df_cache = _ProcessLevelCache(
|
||||
ttl_seconds=DEFAULT_PROCESS_CACHE_TTL_SECONDS,
|
||||
max_size=RESOURCE_PROCESS_CACHE_MAX_SIZE,
|
||||
)
|
||||
_resource_parse_lock = threading.Lock()
|
||||
_resource_index_lock = threading.Lock()
|
||||
_resource_index: Dict[str, Any] = {
|
||||
_resource_index: ResourceIndex = {
|
||||
"ready": False,
|
||||
"source": None,
|
||||
"version": None,
|
||||
@@ -80,19 +140,27 @@ _resource_index: Dict[str, Any] = {
|
||||
"built_at": None,
|
||||
"version_checked_at": 0.0,
|
||||
"count": 0,
|
||||
"records": [],
|
||||
"all_positions": [],
|
||||
"by_resource_id": {},
|
||||
"by_workcenter": {},
|
||||
"by_family": {},
|
||||
"by_department": {},
|
||||
"by_location": {},
|
||||
"by_is_production": {"1": [], "0": []},
|
||||
"by_is_key": {"1": [], "0": []},
|
||||
"by_is_monitor": {"1": [], "0": []},
|
||||
"by_is_production": {TRUE_BUCKET: [], FALSE_BUCKET: []},
|
||||
"by_is_key": {TRUE_BUCKET: [], FALSE_BUCKET: []},
|
||||
"by_is_monitor": {TRUE_BUCKET: [], FALSE_BUCKET: []},
|
||||
"memory": {
|
||||
"frame_bytes": 0,
|
||||
"index_bytes": 0,
|
||||
"records_json_bytes": 0,
|
||||
"bucket_entries": 0,
|
||||
"amplification_ratio": 0.0,
|
||||
"representation": "dataframe+row-index",
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def _new_empty_index() -> Dict[str, Any]:
|
||||
def _new_empty_index() -> ResourceIndex:
|
||||
return {
|
||||
"ready": False,
|
||||
"source": None,
|
||||
@@ -101,15 +169,23 @@ def _new_empty_index() -> Dict[str, Any]:
|
||||
"built_at": None,
|
||||
"version_checked_at": 0.0,
|
||||
"count": 0,
|
||||
"records": [],
|
||||
"all_positions": [],
|
||||
"by_resource_id": {},
|
||||
"by_workcenter": {},
|
||||
"by_family": {},
|
||||
"by_department": {},
|
||||
"by_location": {},
|
||||
"by_is_production": {"1": [], "0": []},
|
||||
"by_is_key": {"1": [], "0": []},
|
||||
"by_is_monitor": {"1": [], "0": []},
|
||||
"by_is_production": {TRUE_BUCKET: [], FALSE_BUCKET: []},
|
||||
"by_is_key": {TRUE_BUCKET: [], FALSE_BUCKET: []},
|
||||
"by_is_monitor": {TRUE_BUCKET: [], FALSE_BUCKET: []},
|
||||
"memory": {
|
||||
"frame_bytes": 0,
|
||||
"index_bytes": 0,
|
||||
"records_json_bytes": 0,
|
||||
"bucket_entries": 0,
|
||||
"amplification_ratio": 0.0,
|
||||
"representation": "dataframe+row-index",
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
@@ -129,23 +205,59 @@ def _is_truthy_flag(value: Any) -> bool:
|
||||
return False
|
||||
|
||||
|
||||
def _bucket_append(bucket: Dict[str, List[Dict[str, Any]]], key: Any, record: Dict[str, Any]) -> None:
|
||||
def _bucket_append(bucket: PositionBucket, key: Any, row_position: RowPosition) -> None:
|
||||
if key is None:
|
||||
return
|
||||
if isinstance(key, float) and pd.isna(key):
|
||||
return
|
||||
key_str = str(key)
|
||||
bucket.setdefault(key_str, []).append(record)
|
||||
bucket.setdefault(key_str, []).append(int(row_position))
|
||||
|
||||
|
||||
def _estimate_dataframe_bytes(df: pd.DataFrame) -> int:
|
||||
try:
|
||||
return int(df.memory_usage(index=True, deep=True).sum())
|
||||
except Exception:
|
||||
return 0
|
||||
|
||||
|
||||
def _estimate_index_bytes(index: ResourceIndex) -> int:
|
||||
"""Estimate lightweight index memory footprint for telemetry."""
|
||||
by_resource_id = index.get("by_resource_id", {})
|
||||
by_workcenter = index.get("by_workcenter", {})
|
||||
by_family = index.get("by_family", {})
|
||||
by_department = index.get("by_department", {})
|
||||
by_location = index.get("by_location", {})
|
||||
by_is_production = index.get("by_is_production", {TRUE_BUCKET: [], FALSE_BUCKET: []})
|
||||
by_is_key = index.get("by_is_key", {TRUE_BUCKET: [], FALSE_BUCKET: []})
|
||||
by_is_monitor = index.get("by_is_monitor", {TRUE_BUCKET: [], FALSE_BUCKET: []})
|
||||
all_positions = index.get("all_positions", [])
|
||||
|
||||
position_entries = (
|
||||
len(all_positions)
|
||||
+ sum(len(v) for v in by_workcenter.values())
|
||||
+ sum(len(v) for v in by_family.values())
|
||||
+ sum(len(v) for v in by_department.values())
|
||||
+ sum(len(v) for v in by_location.values())
|
||||
+ len(by_is_production.get(TRUE_BUCKET, []))
|
||||
+ len(by_is_production.get(FALSE_BUCKET, []))
|
||||
+ len(by_is_key.get(TRUE_BUCKET, []))
|
||||
+ len(by_is_key.get(FALSE_BUCKET, []))
|
||||
+ len(by_is_monitor.get(TRUE_BUCKET, []))
|
||||
+ len(by_is_monitor.get(FALSE_BUCKET, []))
|
||||
)
|
||||
# Approximate integer/list/dict overhead; telemetry only needs directional signal.
|
||||
return int(position_entries * 8 + len(by_resource_id) * 64)
|
||||
|
||||
|
||||
def _build_resource_index(
|
||||
df: pd.DataFrame,
|
||||
*,
|
||||
source: str,
|
||||
version: Optional[str],
|
||||
updated_at: Optional[str],
|
||||
) -> Dict[str, Any]:
|
||||
records = df.to_dict(orient='records')
|
||||
version: str | None,
|
||||
updated_at: str | None,
|
||||
) -> ResourceIndex:
|
||||
normalized_df = df.reset_index(drop=True)
|
||||
index = _new_empty_index()
|
||||
index["ready"] = True
|
||||
index["source"] = source
|
||||
@@ -153,31 +265,58 @@ def _build_resource_index(
|
||||
index["updated_at"] = updated_at
|
||||
index["built_at"] = datetime.now().isoformat()
|
||||
index["version_checked_at"] = time.time()
|
||||
index["count"] = len(records)
|
||||
index["records"] = records
|
||||
index["count"] = len(normalized_df)
|
||||
index["all_positions"] = list(range(len(normalized_df)))
|
||||
|
||||
for record in records:
|
||||
for row_position, record in normalized_df.iterrows():
|
||||
resource_id = record.get("RESOURCEID")
|
||||
if resource_id is not None and not (isinstance(resource_id, float) and pd.isna(resource_id)):
|
||||
index["by_resource_id"][str(resource_id)] = record
|
||||
index["by_resource_id"][str(resource_id)] = int(row_position)
|
||||
|
||||
_bucket_append(index["by_workcenter"], record.get("WORKCENTERNAME"), record)
|
||||
_bucket_append(index["by_family"], record.get("RESOURCEFAMILYNAME"), record)
|
||||
_bucket_append(index["by_department"], record.get("PJ_DEPARTMENT"), record)
|
||||
_bucket_append(index["by_location"], record.get("LOCATIONNAME"), record)
|
||||
_bucket_append(index["by_workcenter"], record.get("WORKCENTERNAME"), row_position)
|
||||
_bucket_append(index["by_family"], record.get("RESOURCEFAMILYNAME"), row_position)
|
||||
_bucket_append(index["by_department"], record.get("PJ_DEPARTMENT"), row_position)
|
||||
_bucket_append(index["by_location"], record.get("LOCATIONNAME"), row_position)
|
||||
|
||||
index["by_is_production"]["1" if _is_truthy_flag(record.get("PJ_ISPRODUCTION")) else "0"].append(record)
|
||||
index["by_is_key"]["1" if _is_truthy_flag(record.get("PJ_ISKEY")) else "0"].append(record)
|
||||
index["by_is_monitor"]["1" if _is_truthy_flag(record.get("PJ_ISMONITOR")) else "0"].append(record)
|
||||
index["by_is_production"][TRUE_BUCKET if _is_truthy_flag(record.get("PJ_ISPRODUCTION")) else FALSE_BUCKET].append(int(row_position))
|
||||
index["by_is_key"][TRUE_BUCKET if _is_truthy_flag(record.get("PJ_ISKEY")) else FALSE_BUCKET].append(int(row_position))
|
||||
index["by_is_monitor"][TRUE_BUCKET if _is_truthy_flag(record.get("PJ_ISMONITOR")) else FALSE_BUCKET].append(int(row_position))
|
||||
|
||||
bucket_entries = (
|
||||
sum(len(v) for v in index["by_workcenter"].values())
|
||||
+ sum(len(v) for v in index["by_family"].values())
|
||||
+ sum(len(v) for v in index["by_department"].values())
|
||||
+ sum(len(v) for v in index["by_location"].values())
|
||||
+ len(index["by_is_production"][TRUE_BUCKET])
|
||||
+ len(index["by_is_production"][FALSE_BUCKET])
|
||||
+ len(index["by_is_key"][TRUE_BUCKET])
|
||||
+ len(index["by_is_key"][FALSE_BUCKET])
|
||||
+ len(index["by_is_monitor"][TRUE_BUCKET])
|
||||
+ len(index["by_is_monitor"][FALSE_BUCKET])
|
||||
)
|
||||
frame_bytes = _estimate_dataframe_bytes(normalized_df)
|
||||
index_bytes = _estimate_index_bytes(index)
|
||||
amplification_ratio = round(
|
||||
(frame_bytes + index_bytes) / max(frame_bytes, 1),
|
||||
4,
|
||||
)
|
||||
index["memory"] = {
|
||||
"frame_bytes": int(frame_bytes),
|
||||
"index_bytes": int(index_bytes),
|
||||
"records_json_bytes": 0, # kept for backward-compatible telemetry shape
|
||||
"bucket_entries": int(bucket_entries),
|
||||
"amplification_ratio": amplification_ratio,
|
||||
"representation": "dataframe+row-index",
|
||||
}
|
||||
|
||||
return index
|
||||
|
||||
|
||||
def _index_matches(
|
||||
current: Dict[str, Any],
|
||||
current: ResourceIndex,
|
||||
*,
|
||||
source: str,
|
||||
version: Optional[str],
|
||||
version: str | None,
|
||||
row_count: int,
|
||||
) -> bool:
|
||||
if not current.get("ready"):
|
||||
@@ -193,8 +332,8 @@ def _ensure_resource_index(
|
||||
df: pd.DataFrame,
|
||||
*,
|
||||
source: str,
|
||||
version: Optional[str] = None,
|
||||
updated_at: Optional[str] = None,
|
||||
version: str | None = None,
|
||||
updated_at: str | None = None,
|
||||
) -> None:
|
||||
global _resource_index
|
||||
with _resource_index_lock:
|
||||
@@ -212,12 +351,12 @@ def _ensure_resource_index(
|
||||
_resource_index = new_index
|
||||
|
||||
|
||||
def _get_resource_index() -> Dict[str, Any]:
|
||||
def _get_resource_index() -> ResourceIndex:
|
||||
with _resource_index_lock:
|
||||
return _resource_index
|
||||
|
||||
|
||||
def _get_cache_meta(client=None) -> Tuple[Optional[str], Optional[str]]:
|
||||
def _get_cache_meta(client=None) -> tuple[str | None, str | None]:
|
||||
redis_client = client or get_redis_client()
|
||||
if redis_client is None:
|
||||
return None, None
|
||||
@@ -244,31 +383,59 @@ def _redis_data_available(client=None) -> bool:
|
||||
return False
|
||||
|
||||
|
||||
def _pick_bucket_records(
|
||||
bucket: Dict[str, List[Dict[str, Any]]],
|
||||
keys: List[Any],
|
||||
) -> List[Dict[str, Any]]:
|
||||
seen: set[str] = set()
|
||||
result: List[Dict[str, Any]] = []
|
||||
def _pick_bucket_positions(
|
||||
bucket: PositionBucket,
|
||||
keys: list[Any],
|
||||
) -> list[RowPosition]:
|
||||
seen: set[int] = set()
|
||||
result: list[int] = []
|
||||
for key in keys:
|
||||
for record in bucket.get(str(key), []):
|
||||
rid = record.get("RESOURCEID")
|
||||
rid_key = str(rid) if rid is not None else str(id(record))
|
||||
if rid_key in seen:
|
||||
for row_position in bucket.get(str(key), []):
|
||||
normalized = int(row_position)
|
||||
if normalized in seen:
|
||||
continue
|
||||
seen.add(rid_key)
|
||||
result.append(record)
|
||||
seen.add(normalized)
|
||||
result.append(normalized)
|
||||
return result
|
||||
|
||||
|
||||
def _records_from_positions(df: pd.DataFrame, positions: list[RowPosition]) -> list[ResourceRecord]:
|
||||
if not positions:
|
||||
return []
|
||||
unique_positions = sorted({int(pos) for pos in positions if 0 <= int(pos) < len(df)})
|
||||
if not unique_positions:
|
||||
return []
|
||||
return df.iloc[unique_positions].to_dict(orient='records')
|
||||
|
||||
|
||||
def _records_from_index(index: ResourceIndex, positions: list[RowPosition] | None = None) -> list[ResourceRecord]:
|
||||
if not index.get("ready"):
|
||||
return []
|
||||
df = _resource_df_cache.get(RESOURCE_DF_CACHE_KEY)
|
||||
if df is None:
|
||||
legacy_records = index.get("records")
|
||||
if isinstance(legacy_records, list):
|
||||
if positions is None:
|
||||
return list(legacy_records)
|
||||
selected = [legacy_records[int(pos)] for pos in positions if 0 <= int(pos) < len(legacy_records)]
|
||||
return selected
|
||||
return []
|
||||
selected_positions = positions if positions is not None else index.get("all_positions", [])
|
||||
if not selected_positions:
|
||||
selected_positions = list(range(len(df)))
|
||||
return _records_from_positions(df, selected_positions)
|
||||
|
||||
# ============================================================
|
||||
# Configuration
|
||||
# ============================================================
|
||||
|
||||
RESOURCE_CACHE_ENABLED = os.getenv('RESOURCE_CACHE_ENABLED', 'true').lower() == 'true'
|
||||
RESOURCE_SYNC_INTERVAL = int(os.getenv('RESOURCE_SYNC_INTERVAL', '14400')) # 4 hours
|
||||
RESOURCE_SYNC_INTERVAL = int(
|
||||
os.getenv('RESOURCE_SYNC_INTERVAL', str(DEFAULT_RESOURCE_SYNC_INTERVAL_SECONDS))
|
||||
)
|
||||
RESOURCE_INDEX_VERSION_CHECK_INTERVAL = int(
|
||||
os.getenv('RESOURCE_INDEX_VERSION_CHECK_INTERVAL', '5')
|
||||
) # seconds
|
||||
os.getenv('RESOURCE_INDEX_VERSION_CHECK_INTERVAL', str(DEFAULT_INDEX_VERSION_CHECK_INTERVAL_SECONDS))
|
||||
)
|
||||
|
||||
# Redis key helpers
|
||||
def _get_key(key: str) -> str:
|
||||
@@ -313,14 +480,14 @@ def _build_filter_builder() -> QueryBuilder:
|
||||
return builder
|
||||
|
||||
|
||||
def _load_from_oracle() -> Optional[pd.DataFrame]:
|
||||
def _load_from_oracle() -> pd.DataFrame | None:
|
||||
"""從 Oracle 載入全表資料(套用全域篩選).
|
||||
|
||||
Returns:
|
||||
DataFrame with all columns, or None if query failed.
|
||||
"""
|
||||
builder = _build_filter_builder()
|
||||
builder.base_sql = "SELECT * FROM DWH.DW_MES_RESOURCE {{ WHERE_CLAUSE }}"
|
||||
builder.base_sql = RESOURCE_BASE_SELECT_TEMPLATE
|
||||
sql, params = builder.build()
|
||||
|
||||
try:
|
||||
@@ -333,14 +500,14 @@ def _load_from_oracle() -> Optional[pd.DataFrame]:
|
||||
return None
|
||||
|
||||
|
||||
def _get_version_from_oracle() -> Optional[str]:
|
||||
def _get_version_from_oracle() -> str | None:
|
||||
"""取得 Oracle 資料版本(MAX(LASTCHANGEDATE)).
|
||||
|
||||
Returns:
|
||||
Version string (ISO format), or None if query failed.
|
||||
"""
|
||||
builder = _build_filter_builder()
|
||||
builder.base_sql = "SELECT MAX(LASTCHANGEDATE) as VERSION FROM DWH.DW_MES_RESOURCE {{ WHERE_CLAUSE }}"
|
||||
builder.base_sql = RESOURCE_VERSION_SELECT_TEMPLATE
|
||||
sql, params = builder.build()
|
||||
|
||||
try:
|
||||
@@ -361,7 +528,7 @@ def _get_version_from_oracle() -> Optional[str]:
|
||||
# Internal: Redis Functions
|
||||
# ============================================================
|
||||
|
||||
def _get_version_from_redis() -> Optional[str]:
|
||||
def _get_version_from_redis() -> str | None:
|
||||
"""取得 Redis 快取版本.
|
||||
|
||||
Returns:
|
||||
@@ -411,7 +578,7 @@ def _sync_to_redis(df: pd.DataFrame, version: str) -> bool:
|
||||
pipe.execute()
|
||||
|
||||
# Invalidate process-level cache so next request picks up new data
|
||||
_resource_df_cache.invalidate("resource_data")
|
||||
_resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY)
|
||||
_invalidate_resource_index()
|
||||
|
||||
logger.info(f"Resource cache synced: {len(df)} rows, version={version}")
|
||||
@@ -421,7 +588,7 @@ def _sync_to_redis(df: pd.DataFrame, version: str) -> bool:
|
||||
return False
|
||||
|
||||
|
||||
def _get_cached_data() -> Optional[pd.DataFrame]:
|
||||
def _get_cached_data() -> pd.DataFrame | None:
|
||||
"""Get cached resource data from Redis with process-level caching.
|
||||
|
||||
Uses a two-tier cache strategy:
|
||||
@@ -433,21 +600,25 @@ def _get_cached_data() -> Optional[pd.DataFrame]:
|
||||
Returns:
|
||||
DataFrame with resource data, or None if cache miss.
|
||||
"""
|
||||
cache_key = "resource_data"
|
||||
cache_key = RESOURCE_DF_CACHE_KEY
|
||||
|
||||
# Tier 1: Check process-level cache first (fast path)
|
||||
cached_df = _resource_df_cache.get(cache_key)
|
||||
if cached_df is not None:
|
||||
if not _get_resource_index().get("ready"):
|
||||
version, updated_at = _get_cache_meta()
|
||||
_ensure_resource_index(
|
||||
cached_df,
|
||||
source="redis",
|
||||
version=version,
|
||||
updated_at=updated_at,
|
||||
)
|
||||
logger.debug(f"Process cache hit: {len(cached_df)} rows")
|
||||
return cached_df
|
||||
if REDIS_ENABLED and RESOURCE_CACHE_ENABLED and not _redis_data_available():
|
||||
_resource_df_cache.invalidate(cache_key)
|
||||
_invalidate_resource_index()
|
||||
else:
|
||||
if not _get_resource_index().get("ready"):
|
||||
version, updated_at = _get_cache_meta()
|
||||
_ensure_resource_index(
|
||||
cached_df,
|
||||
source="redis",
|
||||
version=version,
|
||||
updated_at=updated_at,
|
||||
)
|
||||
logger.debug(f"Process cache hit: {len(cached_df)} rows")
|
||||
return cached_df
|
||||
|
||||
# Tier 2: Parse from Redis (slow path - needs lock)
|
||||
if not REDIS_ENABLED or not RESOURCE_CACHE_ENABLED:
|
||||
@@ -568,7 +739,7 @@ def init_cache() -> None:
|
||||
logger.error(f"Failed to init resource cache: {e}")
|
||||
|
||||
|
||||
def get_cache_status() -> Dict[str, Any]:
|
||||
def get_cache_status() -> dict[str, Any]:
|
||||
"""取得快取狀態資訊.
|
||||
|
||||
Returns:
|
||||
@@ -611,9 +782,10 @@ def get_cache_status() -> Dict[str, Any]:
|
||||
# Query API
|
||||
# ============================================================
|
||||
|
||||
def get_resource_index_status() -> Dict[str, Any]:
|
||||
def get_resource_index_status() -> dict[str, Any]:
|
||||
"""Get process-level derived index telemetry."""
|
||||
index = _get_resource_index()
|
||||
memory = index.get("memory") or {}
|
||||
built_at = index.get("built_at")
|
||||
age_seconds = None
|
||||
if built_at:
|
||||
@@ -630,19 +802,32 @@ def get_resource_index_status() -> Dict[str, Any]:
|
||||
"built_at": built_at,
|
||||
"count": int(index.get("count", 0)),
|
||||
"age_seconds": round(age_seconds, 3) if age_seconds is not None else None,
|
||||
"memory": {
|
||||
"frame_bytes": int(memory.get("frame_bytes", 0)),
|
||||
"index_bytes": int(memory.get("index_bytes", 0)),
|
||||
"records_json_bytes": int(memory.get("records_json_bytes", 0)),
|
||||
"bucket_entries": int(memory.get("bucket_entries", 0)),
|
||||
"amplification_ratio": float(memory.get("amplification_ratio", 0.0)),
|
||||
"representation": str(memory.get("representation", "unknown")),
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def get_resource_index_snapshot() -> Dict[str, Any]:
|
||||
def get_resource_index_snapshot() -> ResourceIndex:
|
||||
"""Get derived resource index snapshot, rebuilding if needed."""
|
||||
index = _get_resource_index()
|
||||
if index.get("ready"):
|
||||
if index.get("source") == "redis":
|
||||
if not _redis_data_available():
|
||||
_resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY)
|
||||
_invalidate_resource_index()
|
||||
index = _get_resource_index()
|
||||
|
||||
# If Redis metadata version is missing, verify payload existence on every call.
|
||||
# This avoids serving stale in-process index when Redis payload is evicted.
|
||||
if not index.get("version"):
|
||||
if index.get("ready") and not index.get("version"):
|
||||
if not _redis_data_available():
|
||||
_resource_df_cache.invalidate("resource_data")
|
||||
_resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY)
|
||||
_invalidate_resource_index()
|
||||
index = _get_resource_index()
|
||||
else:
|
||||
@@ -661,7 +846,7 @@ def get_resource_index_snapshot() -> Dict[str, Any]:
|
||||
current_version,
|
||||
latest_version,
|
||||
)
|
||||
_resource_df_cache.invalidate("resource_data")
|
||||
_resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY)
|
||||
_invalidate_resource_index()
|
||||
index = _get_resource_index()
|
||||
else:
|
||||
@@ -678,6 +863,7 @@ def get_resource_index_snapshot() -> Dict[str, Any]:
|
||||
|
||||
df = _get_cached_data()
|
||||
if df is not None:
|
||||
_resource_df_cache.set(RESOURCE_DF_CACHE_KEY, df.reset_index(drop=True))
|
||||
version, updated_at = _get_cache_meta()
|
||||
_ensure_resource_index(
|
||||
df,
|
||||
@@ -690,6 +876,8 @@ def get_resource_index_snapshot() -> Dict[str, Any]:
|
||||
logger.info("Resource cache miss while building index, falling back to Oracle")
|
||||
oracle_df = _load_from_oracle()
|
||||
if oracle_df is None:
|
||||
_resource_df_cache.invalidate(RESOURCE_DF_CACHE_KEY)
|
||||
_invalidate_resource_index()
|
||||
return _new_empty_index()
|
||||
|
||||
_ensure_resource_index(
|
||||
@@ -698,9 +886,11 @@ def get_resource_index_snapshot() -> Dict[str, Any]:
|
||||
version=None,
|
||||
updated_at=datetime.now().isoformat(),
|
||||
)
|
||||
_resource_df_cache.set(RESOURCE_DF_CACHE_KEY, oracle_df.reset_index(drop=True))
|
||||
return _get_resource_index()
|
||||
|
||||
def get_all_resources() -> List[Dict]:
|
||||
|
||||
def get_all_resources() -> list[ResourceRecord]:
|
||||
"""取得所有快取中的設備資料(全欄位).
|
||||
|
||||
Falls back to Oracle if cache unavailable.
|
||||
@@ -709,11 +899,10 @@ def get_all_resources() -> List[Dict]:
|
||||
List of resource dicts.
|
||||
"""
|
||||
index = get_resource_index_snapshot()
|
||||
records = index.get("records", [])
|
||||
return list(records)
|
||||
return _records_from_index(index)
|
||||
|
||||
|
||||
def get_resource_by_id(resource_id: str) -> Optional[Dict]:
|
||||
def get_resource_by_id(resource_id: str) -> ResourceRecord | None:
|
||||
"""依 RESOURCEID 取得單筆設備資料.
|
||||
|
||||
Args:
|
||||
@@ -725,10 +914,12 @@ def get_resource_by_id(resource_id: str) -> Optional[Dict]:
|
||||
if not resource_id:
|
||||
return None
|
||||
index = get_resource_index_snapshot()
|
||||
by_id = index.get("by_resource_id", {})
|
||||
row = by_id.get(str(resource_id))
|
||||
if row is not None:
|
||||
return row
|
||||
by_id: dict[str, RowPosition] = index.get("by_resource_id", {})
|
||||
row_position = by_id.get(str(resource_id))
|
||||
if row_position is not None:
|
||||
rows = _records_from_index(index, [int(row_position)])
|
||||
if rows:
|
||||
return rows[0]
|
||||
|
||||
# Backward-compatible fallback for call sites/tests that patch get_all_resources.
|
||||
target = str(resource_id)
|
||||
@@ -738,7 +929,7 @@ def get_resource_by_id(resource_id: str) -> Optional[Dict]:
|
||||
return None
|
||||
|
||||
|
||||
def get_resources_by_ids(resource_ids: List[str]) -> List[Dict]:
|
||||
def get_resources_by_ids(resource_ids: list[str]) -> list[ResourceRecord]:
|
||||
"""依 RESOURCEID 清單批次取得設備資料.
|
||||
|
||||
Args:
|
||||
@@ -747,20 +938,28 @@ def get_resources_by_ids(resource_ids: List[str]) -> List[Dict]:
|
||||
Returns:
|
||||
List of matching resource dicts.
|
||||
"""
|
||||
index = get_resource_index_snapshot()
|
||||
by_id: dict[str, RowPosition] = index.get("by_resource_id", {})
|
||||
positions = [by_id[str(resource_id)] for resource_id in resource_ids if str(resource_id) in by_id]
|
||||
if positions:
|
||||
rows = _records_from_index(index, positions)
|
||||
if rows:
|
||||
return rows
|
||||
|
||||
# Backward-compatible fallback for call sites/tests that patch get_all_resources.
|
||||
id_set = set(resource_ids)
|
||||
resources = get_all_resources()
|
||||
return [r for r in resources if r.get('RESOURCEID') in id_set]
|
||||
return [r for r in get_all_resources() if r.get('RESOURCEID') in id_set]
|
||||
|
||||
|
||||
def get_resources_by_filter(
|
||||
workcenters: Optional[List[str]] = None,
|
||||
families: Optional[List[str]] = None,
|
||||
departments: Optional[List[str]] = None,
|
||||
locations: Optional[List[str]] = None,
|
||||
is_production: Optional[bool] = None,
|
||||
is_key: Optional[bool] = None,
|
||||
is_monitor: Optional[bool] = None,
|
||||
) -> List[Dict]:
|
||||
workcenters: list[str] | None = None,
|
||||
families: list[str] | None = None,
|
||||
departments: list[str] | None = None,
|
||||
locations: list[str] | None = None,
|
||||
is_production: bool | None = None,
|
||||
is_key: bool | None = None,
|
||||
is_monitor: bool | None = None,
|
||||
) -> list[ResourceRecord]:
|
||||
"""依條件篩選設備資料(在 Python 端篩選).
|
||||
|
||||
Args:
|
||||
@@ -775,42 +974,79 @@ def get_resources_by_filter(
|
||||
Returns:
|
||||
List of matching resource dicts.
|
||||
"""
|
||||
resources = get_all_resources()
|
||||
|
||||
result = []
|
||||
for r in resources:
|
||||
# Apply filters
|
||||
if workcenters and r.get('WORKCENTERNAME') not in workcenters:
|
||||
continue
|
||||
if families and r.get('RESOURCEFAMILYNAME') not in families:
|
||||
continue
|
||||
if departments and r.get('PJ_DEPARTMENT') not in departments:
|
||||
continue
|
||||
if locations and r.get('LOCATIONNAME') not in locations:
|
||||
continue
|
||||
if is_production is not None:
|
||||
val = r.get('PJ_ISPRODUCTION')
|
||||
if (val == 1) != is_production:
|
||||
def _filter_from_records(resources: list[ResourceRecord]) -> list[ResourceRecord]:
|
||||
result: list[ResourceRecord] = []
|
||||
for r in resources:
|
||||
if workcenters and r.get('WORKCENTERNAME') not in workcenters:
|
||||
continue
|
||||
if is_key is not None:
|
||||
val = r.get('PJ_ISKEY')
|
||||
if (val == 1) != is_key:
|
||||
if families and r.get('RESOURCEFAMILYNAME') not in families:
|
||||
continue
|
||||
if is_monitor is not None:
|
||||
val = r.get('PJ_ISMONITOR')
|
||||
if (val == 1) != is_monitor:
|
||||
if departments and r.get('PJ_DEPARTMENT') not in departments:
|
||||
continue
|
||||
if locations and r.get('LOCATIONNAME') not in locations:
|
||||
continue
|
||||
if is_production is not None and (r.get('PJ_ISPRODUCTION') == 1) != is_production:
|
||||
continue
|
||||
if is_key is not None and (r.get('PJ_ISKEY') == 1) != is_key:
|
||||
continue
|
||||
if is_monitor is not None and (r.get('PJ_ISMONITOR') == 1) != is_monitor:
|
||||
continue
|
||||
result.append(r)
|
||||
return result
|
||||
|
||||
result.append(r)
|
||||
index = get_resource_index_snapshot()
|
||||
if not index.get("ready"):
|
||||
return _filter_from_records(get_all_resources())
|
||||
if _resource_df_cache.get(RESOURCE_DF_CACHE_KEY) is None:
|
||||
return _filter_from_records(get_all_resources())
|
||||
|
||||
return result
|
||||
candidate_positions: set[int] = set(int(pos) for pos in index.get("all_positions", []))
|
||||
if not candidate_positions:
|
||||
return []
|
||||
|
||||
def _intersect_with_positions(selected: list[int] | None) -> None:
|
||||
nonlocal candidate_positions
|
||||
if selected is None:
|
||||
return
|
||||
candidate_positions &= set(int(item) for item in selected)
|
||||
|
||||
if workcenters:
|
||||
_intersect_with_positions(
|
||||
_pick_bucket_positions(index.get("by_workcenter", {}), workcenters)
|
||||
)
|
||||
if families:
|
||||
_intersect_with_positions(
|
||||
_pick_bucket_positions(index.get("by_family", {}), families)
|
||||
)
|
||||
if departments:
|
||||
_intersect_with_positions(
|
||||
_pick_bucket_positions(index.get("by_department", {}), departments)
|
||||
)
|
||||
if locations:
|
||||
_intersect_with_positions(
|
||||
_pick_bucket_positions(index.get("by_location", {}), locations)
|
||||
)
|
||||
if is_production is not None:
|
||||
_intersect_with_positions(
|
||||
index.get("by_is_production", {}).get(TRUE_BUCKET if is_production else FALSE_BUCKET, [])
|
||||
)
|
||||
if is_key is not None:
|
||||
_intersect_with_positions(
|
||||
index.get("by_is_key", {}).get(TRUE_BUCKET if is_key else FALSE_BUCKET, [])
|
||||
)
|
||||
if is_monitor is not None:
|
||||
_intersect_with_positions(
|
||||
index.get("by_is_monitor", {}).get(TRUE_BUCKET if is_monitor else FALSE_BUCKET, [])
|
||||
)
|
||||
|
||||
return _records_from_index(index, sorted(candidate_positions))
|
||||
|
||||
|
||||
# ============================================================
|
||||
# Distinct Values API (for filters)
|
||||
# ============================================================
|
||||
|
||||
def get_distinct_values(column: str) -> List[str]:
|
||||
def get_distinct_values(column: str) -> list[str]:
|
||||
"""取得指定欄位的唯一值清單(排序後).
|
||||
|
||||
Args:
|
||||
@@ -833,26 +1069,26 @@ def get_distinct_values(column: str) -> List[str]:
|
||||
return sorted(values)
|
||||
|
||||
|
||||
def get_resource_families() -> List[str]:
|
||||
def get_resource_families() -> list[str]:
|
||||
"""取得型號清單(便捷方法)."""
|
||||
return get_distinct_values('RESOURCEFAMILYNAME')
|
||||
|
||||
|
||||
def get_workcenters() -> List[str]:
|
||||
def get_workcenters() -> list[str]:
|
||||
"""取得站點清單(便捷方法)."""
|
||||
return get_distinct_values('WORKCENTERNAME')
|
||||
|
||||
|
||||
def get_departments() -> List[str]:
|
||||
def get_departments() -> list[str]:
|
||||
"""取得部門清單(便捷方法)."""
|
||||
return get_distinct_values('PJ_DEPARTMENT')
|
||||
|
||||
|
||||
def get_locations() -> List[str]:
|
||||
def get_locations() -> list[str]:
|
||||
"""取得區域清單(便捷方法)."""
|
||||
return get_distinct_values('LOCATIONNAME')
|
||||
|
||||
|
||||
def get_vendors() -> List[str]:
|
||||
def get_vendors() -> list[str]:
|
||||
"""取得供應商清單(便捷方法)."""
|
||||
return get_distinct_values('VENDORNAME')
|
||||
|
||||
46
src/mes_dashboard/services/sql_fragments.py
Normal file
46
src/mes_dashboard/services/sql_fragments.py
Normal file
@@ -0,0 +1,46 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Shared SQL fragments/constants for cache-oriented services.
|
||||
|
||||
Centralizing common Oracle table/view references reduces drift across
|
||||
resource/equipment cache implementations.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
RESOURCE_TABLE = "DWH.DW_MES_RESOURCE"
|
||||
RESOURCE_BASE_SELECT_TEMPLATE = f"SELECT * FROM {RESOURCE_TABLE} {{ WHERE_CLAUSE }}"
|
||||
RESOURCE_VERSION_SELECT_TEMPLATE = (
|
||||
f"SELECT MAX(LASTCHANGEDATE) as VERSION FROM {RESOURCE_TABLE} {{ WHERE_CLAUSE }}"
|
||||
)
|
||||
|
||||
EQUIPMENT_STATUS_VIEW = "DWH.DW_MES_EQUIPMENTSTATUS_WIP_V"
|
||||
EQUIPMENT_STATUS_COLUMNS: tuple[str, ...] = (
|
||||
"RESOURCEID",
|
||||
"EQUIPMENTID",
|
||||
"OBJECTCATEGORY",
|
||||
"EQUIPMENTASSETSSTATUS",
|
||||
"EQUIPMENTASSETSSTATUSREASON",
|
||||
"JOBORDER",
|
||||
"JOBMODEL",
|
||||
"JOBSTAGE",
|
||||
"JOBID",
|
||||
"JOBSTATUS",
|
||||
"CREATEDATE",
|
||||
"CREATEUSERNAME",
|
||||
"CREATEUSER",
|
||||
"TECHNICIANUSERNAME",
|
||||
"TECHNICIANUSER",
|
||||
"SYMPTOMCODE",
|
||||
"CAUSECODE",
|
||||
"REPAIRCODE",
|
||||
"RUNCARDLOTID",
|
||||
"LOTTRACKINQTY_PCS",
|
||||
"LOTTRACKINTIME",
|
||||
"LOTTRACKINEMPLOYEE",
|
||||
)
|
||||
|
||||
EQUIPMENT_STATUS_SELECT_SQL = (
|
||||
"SELECT\n "
|
||||
+ ",\n ".join(EQUIPMENT_STATUS_COLUMNS)
|
||||
+ f"\nFROM {EQUIPMENT_STATUS_VIEW}"
|
||||
)
|
||||
@@ -9,6 +9,7 @@ Now uses Redis cache when available, with fallback to Oracle direct query.
|
||||
|
||||
import logging
|
||||
import threading
|
||||
from collections import Counter
|
||||
from datetime import datetime
|
||||
from typing import Optional, Dict, List, Any
|
||||
|
||||
@@ -32,6 +33,20 @@ logger = logging.getLogger('mes_dashboard.wip_service')
|
||||
|
||||
_wip_search_index_lock = threading.Lock()
|
||||
_wip_search_index_cache: Dict[str, Dict[str, Any]] = {}
|
||||
_wip_snapshot_lock = threading.Lock()
|
||||
_wip_snapshot_cache: Dict[str, Dict[str, Any]] = {}
|
||||
_wip_index_metrics_lock = threading.Lock()
|
||||
_wip_index_metrics: Dict[str, Any] = {
|
||||
"snapshot_hits": 0,
|
||||
"snapshot_misses": 0,
|
||||
"search_index_hits": 0,
|
||||
"search_index_misses": 0,
|
||||
"search_index_rebuilds": 0,
|
||||
"search_index_incremental_updates": 0,
|
||||
"search_index_reconciliation_fallbacks": 0,
|
||||
}
|
||||
|
||||
_EMPTY_INT_INDEX = np.array([], dtype=np.int64)
|
||||
|
||||
|
||||
def _safe_value(val):
|
||||
@@ -153,29 +168,373 @@ def _get_wip_cache_version() -> str:
|
||||
return f"{updated_at}|{sys_date}"
|
||||
|
||||
|
||||
def _distinct_sorted_values(df: pd.DataFrame, column: str) -> List[str]:
|
||||
if column not in df.columns:
|
||||
return []
|
||||
series = df[column].dropna().astype(str)
|
||||
if series.empty:
|
||||
return []
|
||||
series = series[series.str.len() > 0]
|
||||
if series.empty:
|
||||
return []
|
||||
return series.drop_duplicates().sort_values().tolist()
|
||||
def _increment_wip_metric(metric: str, value: int = 1) -> None:
|
||||
with _wip_index_metrics_lock:
|
||||
_wip_index_metrics[metric] = int(_wip_index_metrics.get(metric, 0)) + value
|
||||
|
||||
|
||||
def _estimate_dataframe_bytes(df: pd.DataFrame) -> int:
|
||||
if df is None:
|
||||
return 0
|
||||
try:
|
||||
return int(df.memory_usage(index=True, deep=True).sum())
|
||||
except Exception:
|
||||
return 0
|
||||
|
||||
|
||||
def _estimate_counter_payload_bytes(counter: Counter) -> int:
|
||||
total = 0
|
||||
for key, count in counter.items():
|
||||
total += len(str(key)) + 16 + int(count)
|
||||
return total
|
||||
|
||||
|
||||
def _normalize_text_value(value: Any) -> str:
|
||||
if value is None:
|
||||
return ""
|
||||
if isinstance(value, float) and pd.isna(value):
|
||||
return ""
|
||||
text = str(value).strip()
|
||||
return text
|
||||
|
||||
|
||||
def _build_filter_mask(
|
||||
df: pd.DataFrame,
|
||||
*,
|
||||
include_dummy: bool,
|
||||
workorder: Optional[str] = None,
|
||||
lotid: Optional[str] = None,
|
||||
) -> pd.Series:
|
||||
if df.empty:
|
||||
return pd.Series(dtype=bool)
|
||||
|
||||
mask = df['WORKORDER'].notna()
|
||||
|
||||
if not include_dummy and 'LOTID' in df.columns:
|
||||
mask &= ~df['LOTID'].astype(str).str.contains('DUMMY', case=False, na=False)
|
||||
|
||||
if workorder and 'WORKORDER' in df.columns:
|
||||
mask &= df['WORKORDER'].astype(str).str.contains(workorder, case=False, na=False)
|
||||
|
||||
if lotid and 'LOTID' in df.columns:
|
||||
mask &= df['LOTID'].astype(str).str.contains(lotid, case=False, na=False)
|
||||
|
||||
return mask
|
||||
|
||||
|
||||
def _build_value_index(df: pd.DataFrame, column: str) -> Dict[str, np.ndarray]:
|
||||
if column not in df.columns or df.empty:
|
||||
return {}
|
||||
grouped = df.groupby(column, dropna=True, sort=False).indices
|
||||
return {str(key): np.asarray(indices, dtype=np.int64) for key, indices in grouped.items()}
|
||||
|
||||
|
||||
def _intersect_positions(current: Optional[np.ndarray], candidate: Optional[np.ndarray]) -> np.ndarray:
|
||||
if candidate is None:
|
||||
return _EMPTY_INT_INDEX
|
||||
if current is None:
|
||||
return candidate
|
||||
if len(current) == 0 or len(candidate) == 0:
|
||||
return _EMPTY_INT_INDEX
|
||||
return np.intersect1d(current, candidate, assume_unique=False)
|
||||
|
||||
|
||||
def _select_with_snapshot_indexes(
|
||||
include_dummy: bool = False,
|
||||
workorder: Optional[str] = None,
|
||||
lotid: Optional[str] = None,
|
||||
package: Optional[str] = None,
|
||||
pj_type: Optional[str] = None,
|
||||
workcenter: Optional[str] = None,
|
||||
status: Optional[str] = None,
|
||||
hold_type: Optional[str] = None,
|
||||
) -> Optional[pd.DataFrame]:
|
||||
snapshot = _get_wip_snapshot(include_dummy=include_dummy)
|
||||
if snapshot is None:
|
||||
return None
|
||||
|
||||
df = snapshot["frame"]
|
||||
indexes = snapshot["indexes"]
|
||||
selected_positions: Optional[np.ndarray] = None
|
||||
|
||||
if workcenter:
|
||||
selected_positions = _intersect_positions(
|
||||
selected_positions,
|
||||
indexes["workcenter"].get(str(workcenter)),
|
||||
)
|
||||
if package:
|
||||
selected_positions = _intersect_positions(
|
||||
selected_positions,
|
||||
indexes["package"].get(str(package)),
|
||||
)
|
||||
if pj_type:
|
||||
selected_positions = _intersect_positions(
|
||||
selected_positions,
|
||||
indexes["pj_type"].get(str(pj_type)),
|
||||
)
|
||||
if status:
|
||||
selected_positions = _intersect_positions(
|
||||
selected_positions,
|
||||
indexes["wip_status"].get(str(status).upper()),
|
||||
)
|
||||
if hold_type:
|
||||
selected_positions = _intersect_positions(
|
||||
selected_positions,
|
||||
indexes["hold_type"].get(str(hold_type).lower()),
|
||||
)
|
||||
|
||||
if selected_positions is None:
|
||||
result = df
|
||||
elif len(selected_positions) == 0:
|
||||
result = df.iloc[0:0]
|
||||
else:
|
||||
result = df.iloc[selected_positions]
|
||||
|
||||
if workorder:
|
||||
result = result[result['WORKORDER'].astype(str).str.contains(workorder, case=False, na=False)]
|
||||
if lotid:
|
||||
result = result[result['LOTID'].astype(str).str.contains(lotid, case=False, na=False)]
|
||||
return result
|
||||
|
||||
|
||||
def _build_search_signatures(df: pd.DataFrame) -> tuple[Counter, Dict[str, tuple[str, str, str, str]]]:
|
||||
if df.empty:
|
||||
return Counter(), {}
|
||||
|
||||
workorders = df.get("WORKORDER", pd.Series(index=df.index, dtype=object)).map(_normalize_text_value)
|
||||
lotids = df.get("LOTID", pd.Series(index=df.index, dtype=object)).map(_normalize_text_value)
|
||||
packages = df.get("PACKAGE_LEF", pd.Series(index=df.index, dtype=object)).map(_normalize_text_value)
|
||||
types = df.get("PJ_TYPE", pd.Series(index=df.index, dtype=object)).map(_normalize_text_value)
|
||||
|
||||
signatures = (
|
||||
workorders
|
||||
+ "\x1f"
|
||||
+ lotids
|
||||
+ "\x1f"
|
||||
+ packages
|
||||
+ "\x1f"
|
||||
+ types
|
||||
).tolist()
|
||||
signature_counter = Counter(signatures)
|
||||
|
||||
signature_fields: Dict[str, tuple[str, str, str, str]] = {}
|
||||
for signature, wo, lot, pkg, pj in zip(signatures, workorders, lotids, packages, types):
|
||||
if signature not in signature_fields:
|
||||
signature_fields[signature] = (wo, lot, pkg, pj)
|
||||
return signature_counter, signature_fields
|
||||
|
||||
|
||||
def _build_field_counters(
|
||||
signature_counter: Counter,
|
||||
signature_fields: Dict[str, tuple[str, str, str, str]],
|
||||
) -> Dict[str, Counter]:
|
||||
counters = {
|
||||
"workorders": Counter(),
|
||||
"lotids": Counter(),
|
||||
"packages": Counter(),
|
||||
"types": Counter(),
|
||||
}
|
||||
for signature, count in signature_counter.items():
|
||||
wo, lot, pkg, pj = signature_fields.get(signature, ("", "", "", ""))
|
||||
if wo:
|
||||
counters["workorders"][wo] += count
|
||||
if lot:
|
||||
counters["lotids"][lot] += count
|
||||
if pkg:
|
||||
counters["packages"][pkg] += count
|
||||
if pj:
|
||||
counters["types"][pj] += count
|
||||
return counters
|
||||
|
||||
|
||||
def _materialize_search_payload(
|
||||
*,
|
||||
version: str,
|
||||
row_count: int,
|
||||
signature_counter: Counter,
|
||||
field_counters: Dict[str, Counter],
|
||||
mode: str,
|
||||
added_rows: int = 0,
|
||||
removed_rows: int = 0,
|
||||
drift_ratio: float = 0.0,
|
||||
) -> Dict[str, Any]:
|
||||
workorders = sorted(field_counters["workorders"].keys())
|
||||
lotids = sorted(field_counters["lotids"].keys())
|
||||
packages = sorted(field_counters["packages"].keys())
|
||||
types = sorted(field_counters["types"].keys())
|
||||
memory_bytes = (
|
||||
_estimate_counter_payload_bytes(field_counters["workorders"])
|
||||
+ _estimate_counter_payload_bytes(field_counters["lotids"])
|
||||
+ _estimate_counter_payload_bytes(field_counters["packages"])
|
||||
+ _estimate_counter_payload_bytes(field_counters["types"])
|
||||
)
|
||||
return {
|
||||
"version": version,
|
||||
"built_at": datetime.now().isoformat(),
|
||||
"row_count": int(row_count),
|
||||
"workorders": workorders,
|
||||
"lotids": lotids,
|
||||
"packages": packages,
|
||||
"types": types,
|
||||
"sync_mode": mode,
|
||||
"sync_added_rows": int(added_rows),
|
||||
"sync_removed_rows": int(removed_rows),
|
||||
"drift_ratio": round(float(drift_ratio), 6),
|
||||
"memory_bytes": int(memory_bytes),
|
||||
"_signature_counter": dict(signature_counter),
|
||||
"_field_counters": {
|
||||
"workorders": dict(field_counters["workorders"]),
|
||||
"lotids": dict(field_counters["lotids"]),
|
||||
"packages": dict(field_counters["packages"]),
|
||||
"types": dict(field_counters["types"]),
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def _build_wip_search_index(df: pd.DataFrame, include_dummy: bool) -> Dict[str, Any]:
|
||||
filtered = _filter_base_conditions(df, include_dummy=include_dummy)
|
||||
return {
|
||||
"built_at": datetime.now().isoformat(),
|
||||
"row_count": len(filtered),
|
||||
"workorders": _distinct_sorted_values(filtered, "WORKORDER"),
|
||||
"lotids": _distinct_sorted_values(filtered, "LOTID"),
|
||||
"packages": _distinct_sorted_values(filtered, "PACKAGE_LEF"),
|
||||
"types": _distinct_sorted_values(filtered, "PJ_TYPE"),
|
||||
signatures, signature_fields = _build_search_signatures(filtered)
|
||||
field_counters = _build_field_counters(signatures, signature_fields)
|
||||
return _materialize_search_payload(
|
||||
version=_get_wip_cache_version(),
|
||||
row_count=len(filtered),
|
||||
signature_counter=signatures,
|
||||
field_counters=field_counters,
|
||||
mode="full",
|
||||
)
|
||||
|
||||
|
||||
def _try_incremental_search_sync(
|
||||
previous: Dict[str, Any],
|
||||
*,
|
||||
version: str,
|
||||
row_count: int,
|
||||
signature_counter: Counter,
|
||||
signature_fields: Dict[str, tuple[str, str, str, str]],
|
||||
) -> Optional[Dict[str, Any]]:
|
||||
if not previous:
|
||||
return None
|
||||
old_signature_counter = Counter(previous.get("_signature_counter") or {})
|
||||
old_field_counters_raw = previous.get("_field_counters") or {}
|
||||
if not old_signature_counter or not old_field_counters_raw:
|
||||
return None
|
||||
|
||||
added = signature_counter - old_signature_counter
|
||||
removed = old_signature_counter - signature_counter
|
||||
total_delta = sum(added.values()) + sum(removed.values())
|
||||
drift_ratio = total_delta / max(int(row_count), 1)
|
||||
if drift_ratio > 0.6:
|
||||
_increment_wip_metric("search_index_reconciliation_fallbacks")
|
||||
return None
|
||||
|
||||
field_counters = {
|
||||
"workorders": Counter(old_field_counters_raw.get("workorders") or {}),
|
||||
"lotids": Counter(old_field_counters_raw.get("lotids") or {}),
|
||||
"packages": Counter(old_field_counters_raw.get("packages") or {}),
|
||||
"types": Counter(old_field_counters_raw.get("types") or {}),
|
||||
}
|
||||
|
||||
for signature, count in added.items():
|
||||
wo, lot, pkg, pj = signature_fields.get(signature, ("", "", "", ""))
|
||||
if wo:
|
||||
field_counters["workorders"][wo] += count
|
||||
if lot:
|
||||
field_counters["lotids"][lot] += count
|
||||
if pkg:
|
||||
field_counters["packages"][pkg] += count
|
||||
if pj:
|
||||
field_counters["types"][pj] += count
|
||||
|
||||
previous_fields = {
|
||||
sig: tuple(str(v) for v in sig.split("\x1f", 3))
|
||||
for sig in old_signature_counter.keys()
|
||||
}
|
||||
for signature, count in removed.items():
|
||||
wo, lot, pkg, pj = previous_fields.get(signature, ("", "", "", ""))
|
||||
if wo:
|
||||
field_counters["workorders"][wo] -= count
|
||||
if field_counters["workorders"][wo] <= 0:
|
||||
field_counters["workorders"].pop(wo, None)
|
||||
if lot:
|
||||
field_counters["lotids"][lot] -= count
|
||||
if field_counters["lotids"][lot] <= 0:
|
||||
field_counters["lotids"].pop(lot, None)
|
||||
if pkg:
|
||||
field_counters["packages"][pkg] -= count
|
||||
if field_counters["packages"][pkg] <= 0:
|
||||
field_counters["packages"].pop(pkg, None)
|
||||
if pj:
|
||||
field_counters["types"][pj] -= count
|
||||
if field_counters["types"][pj] <= 0:
|
||||
field_counters["types"].pop(pj, None)
|
||||
|
||||
_increment_wip_metric("search_index_incremental_updates")
|
||||
return _materialize_search_payload(
|
||||
version=version,
|
||||
row_count=row_count,
|
||||
signature_counter=signature_counter,
|
||||
field_counters=field_counters,
|
||||
mode="incremental",
|
||||
added_rows=sum(added.values()),
|
||||
removed_rows=sum(removed.values()),
|
||||
drift_ratio=drift_ratio,
|
||||
)
|
||||
|
||||
|
||||
def _build_wip_snapshot(df: pd.DataFrame, include_dummy: bool, version: str) -> Dict[str, Any]:
|
||||
filtered = _filter_base_conditions(df, include_dummy=include_dummy)
|
||||
filtered = _add_wip_status_columns(filtered).reset_index(drop=True)
|
||||
|
||||
hold_type_series = pd.Series(index=filtered.index, dtype=object)
|
||||
if not filtered.empty:
|
||||
hold_type_series = pd.Series("", index=filtered.index, dtype=object)
|
||||
hold_type_series.loc[filtered["IS_QUALITY_HOLD"]] = "quality"
|
||||
hold_type_series.loc[filtered["IS_NON_QUALITY_HOLD"]] = "non-quality"
|
||||
|
||||
indexes = {
|
||||
"workcenter": _build_value_index(filtered, "WORKCENTER_GROUP"),
|
||||
"package": _build_value_index(filtered, "PACKAGE_LEF"),
|
||||
"pj_type": _build_value_index(filtered, "PJ_TYPE"),
|
||||
"wip_status": _build_value_index(filtered, "WIP_STATUS"),
|
||||
"hold_type": _build_value_index(pd.DataFrame({"HOLD_TYPE": hold_type_series}), "HOLD_TYPE"),
|
||||
}
|
||||
|
||||
exact_bucket_count = sum(len(bucket) for bucket in indexes.values())
|
||||
return {
|
||||
"version": version,
|
||||
"built_at": datetime.now().isoformat(),
|
||||
"row_count": int(len(filtered)),
|
||||
"frame": filtered,
|
||||
"indexes": indexes,
|
||||
"frame_bytes": _estimate_dataframe_bytes(filtered),
|
||||
"index_bucket_count": int(exact_bucket_count),
|
||||
}
|
||||
|
||||
|
||||
def _get_wip_snapshot(include_dummy: bool) -> Optional[Dict[str, Any]]:
|
||||
cache_key = "with_dummy" if include_dummy else "without_dummy"
|
||||
version = _get_wip_cache_version()
|
||||
|
||||
with _wip_snapshot_lock:
|
||||
cached = _wip_snapshot_cache.get(cache_key)
|
||||
if cached and cached.get("version") == version:
|
||||
_increment_wip_metric("snapshot_hits")
|
||||
return cached
|
||||
|
||||
_increment_wip_metric("snapshot_misses")
|
||||
df = _get_wip_dataframe()
|
||||
if df is None:
|
||||
return None
|
||||
|
||||
snapshot = _build_wip_snapshot(df, include_dummy=include_dummy, version=version)
|
||||
with _wip_snapshot_lock:
|
||||
existing = _wip_snapshot_cache.get(cache_key)
|
||||
if existing and existing.get("version") == version:
|
||||
_increment_wip_metric("snapshot_hits")
|
||||
return existing
|
||||
_wip_snapshot_cache[cache_key] = snapshot
|
||||
return snapshot
|
||||
|
||||
|
||||
def _get_wip_search_index(include_dummy: bool) -> Optional[Dict[str, Any]]:
|
||||
cache_key = "with_dummy" if include_dummy else "without_dummy"
|
||||
@@ -184,14 +543,37 @@ def _get_wip_search_index(include_dummy: bool) -> Optional[Dict[str, Any]]:
|
||||
with _wip_search_index_lock:
|
||||
cached = _wip_search_index_cache.get(cache_key)
|
||||
if cached and cached.get("version") == version:
|
||||
_increment_wip_metric("search_index_hits")
|
||||
return cached
|
||||
|
||||
df = _get_wip_dataframe()
|
||||
if df is None:
|
||||
_increment_wip_metric("search_index_misses")
|
||||
snapshot = _get_wip_snapshot(include_dummy=include_dummy)
|
||||
if snapshot is None:
|
||||
return None
|
||||
|
||||
index_payload = _build_wip_search_index(df, include_dummy=include_dummy)
|
||||
index_payload["version"] = version
|
||||
filtered = snapshot["frame"]
|
||||
signature_counter, signature_fields = _build_search_signatures(filtered)
|
||||
|
||||
with _wip_search_index_lock:
|
||||
previous = _wip_search_index_cache.get(cache_key)
|
||||
|
||||
index_payload = _try_incremental_search_sync(
|
||||
previous or {},
|
||||
version=version,
|
||||
row_count=int(snapshot.get("row_count", 0)),
|
||||
signature_counter=signature_counter,
|
||||
signature_fields=signature_fields,
|
||||
)
|
||||
if index_payload is None:
|
||||
field_counters = _build_field_counters(signature_counter, signature_fields)
|
||||
index_payload = _materialize_search_payload(
|
||||
version=version,
|
||||
row_count=int(snapshot.get("row_count", 0)),
|
||||
signature_counter=signature_counter,
|
||||
field_counters=field_counters,
|
||||
mode="full",
|
||||
)
|
||||
_increment_wip_metric("search_index_rebuilds")
|
||||
|
||||
with _wip_search_index_lock:
|
||||
_wip_search_index_cache[cache_key] = index_payload
|
||||
@@ -207,9 +589,9 @@ def _search_values_from_index(values: List[str], query: str, limit: int) -> List
|
||||
def get_wip_search_index_status() -> Dict[str, Any]:
|
||||
"""Expose WIP derived search-index freshness for diagnostics."""
|
||||
with _wip_search_index_lock:
|
||||
snapshot = {}
|
||||
search_snapshot = {}
|
||||
for key, payload in _wip_search_index_cache.items():
|
||||
snapshot[key] = {
|
||||
search_snapshot[key] = {
|
||||
"version": payload.get("version"),
|
||||
"built_at": payload.get("built_at"),
|
||||
"row_count": payload.get("row_count", 0),
|
||||
@@ -217,8 +599,39 @@ def get_wip_search_index_status() -> Dict[str, Any]:
|
||||
"lotids": len(payload.get("lotids", [])),
|
||||
"packages": len(payload.get("packages", [])),
|
||||
"types": len(payload.get("types", [])),
|
||||
"sync_mode": payload.get("sync_mode"),
|
||||
"sync_added_rows": payload.get("sync_added_rows", 0),
|
||||
"sync_removed_rows": payload.get("sync_removed_rows", 0),
|
||||
"drift_ratio": payload.get("drift_ratio", 0.0),
|
||||
"memory_bytes": payload.get("memory_bytes", 0),
|
||||
}
|
||||
return snapshot
|
||||
with _wip_snapshot_lock:
|
||||
frame_snapshot = {}
|
||||
for key, payload in _wip_snapshot_cache.items():
|
||||
frame_snapshot[key] = {
|
||||
"version": payload.get("version"),
|
||||
"built_at": payload.get("built_at"),
|
||||
"row_count": payload.get("row_count", 0),
|
||||
"frame_bytes": payload.get("frame_bytes", 0),
|
||||
"index_bucket_count": payload.get("index_bucket_count", 0),
|
||||
}
|
||||
with _wip_index_metrics_lock:
|
||||
metrics = dict(_wip_index_metrics)
|
||||
|
||||
total_frame_bytes = sum(item.get("frame_bytes", 0) for item in frame_snapshot.values())
|
||||
total_search_bytes = sum(item.get("memory_bytes", 0) for item in search_snapshot.values())
|
||||
amplification_ratio = round((total_frame_bytes + total_search_bytes) / max(total_frame_bytes, 1), 4)
|
||||
|
||||
return {
|
||||
"derived_search_index": search_snapshot,
|
||||
"derived_frame_snapshot": frame_snapshot,
|
||||
"metrics": metrics,
|
||||
"memory": {
|
||||
"frame_bytes_total": int(total_frame_bytes),
|
||||
"search_bytes_total": int(total_search_bytes),
|
||||
"amplification_ratio": amplification_ratio,
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def _add_wip_status_columns(df: pd.DataFrame) -> pd.DataFrame:
|
||||
@@ -235,24 +648,31 @@ def _add_wip_status_columns(df: pd.DataFrame) -> pd.DataFrame:
|
||||
Returns:
|
||||
DataFrame with additional status columns
|
||||
"""
|
||||
df = df.copy()
|
||||
required = {'WIP_STATUS', 'IS_QUALITY_HOLD', 'IS_NON_QUALITY_HOLD'}
|
||||
if required.issubset(df.columns):
|
||||
return df
|
||||
|
||||
working = df.copy()
|
||||
|
||||
# Ensure numeric columns
|
||||
df['EQUIPMENTCOUNT'] = pd.to_numeric(df['EQUIPMENTCOUNT'], errors='coerce').fillna(0)
|
||||
df['CURRENTHOLDCOUNT'] = pd.to_numeric(df['CURRENTHOLDCOUNT'], errors='coerce').fillna(0)
|
||||
df['QTY'] = pd.to_numeric(df['QTY'], errors='coerce').fillna(0)
|
||||
working['EQUIPMENTCOUNT'] = pd.to_numeric(working['EQUIPMENTCOUNT'], errors='coerce').fillna(0)
|
||||
working['CURRENTHOLDCOUNT'] = pd.to_numeric(working['CURRENTHOLDCOUNT'], errors='coerce').fillna(0)
|
||||
working['QTY'] = pd.to_numeric(working['QTY'], errors='coerce').fillna(0)
|
||||
|
||||
# Compute WIP status
|
||||
df['WIP_STATUS'] = 'QUEUE' # Default
|
||||
df.loc[df['EQUIPMENTCOUNT'] > 0, 'WIP_STATUS'] = 'RUN'
|
||||
df.loc[(df['EQUIPMENTCOUNT'] == 0) & (df['CURRENTHOLDCOUNT'] > 0), 'WIP_STATUS'] = 'HOLD'
|
||||
working['WIP_STATUS'] = 'QUEUE' # Default
|
||||
working.loc[working['EQUIPMENTCOUNT'] > 0, 'WIP_STATUS'] = 'RUN'
|
||||
working.loc[
|
||||
(working['EQUIPMENTCOUNT'] == 0) & (working['CURRENTHOLDCOUNT'] > 0),
|
||||
'WIP_STATUS'
|
||||
] = 'HOLD'
|
||||
|
||||
# Compute hold type
|
||||
df['IS_NON_QUALITY_HOLD'] = df['HOLDREASONNAME'].isin(NON_QUALITY_HOLD_REASONS)
|
||||
df['IS_QUALITY_HOLD'] = (df['WIP_STATUS'] == 'HOLD') & ~df['IS_NON_QUALITY_HOLD']
|
||||
df['IS_NON_QUALITY_HOLD'] = (df['WIP_STATUS'] == 'HOLD') & df['IS_NON_QUALITY_HOLD']
|
||||
non_quality_flags = working['HOLDREASONNAME'].isin(NON_QUALITY_HOLD_REASONS)
|
||||
working['IS_QUALITY_HOLD'] = (working['WIP_STATUS'] == 'HOLD') & ~non_quality_flags
|
||||
working['IS_NON_QUALITY_HOLD'] = (working['WIP_STATUS'] == 'HOLD') & non_quality_flags
|
||||
|
||||
return df
|
||||
return working
|
||||
|
||||
|
||||
def _filter_base_conditions(
|
||||
@@ -272,24 +692,18 @@ def _filter_base_conditions(
|
||||
Returns:
|
||||
Filtered DataFrame
|
||||
"""
|
||||
df = df.copy()
|
||||
if df is None or df.empty:
|
||||
return df.iloc[0:0] if isinstance(df, pd.DataFrame) else pd.DataFrame()
|
||||
|
||||
# Exclude NULL WORKORDER (raw materials)
|
||||
df = df[df['WORKORDER'].notna()]
|
||||
|
||||
# DUMMY exclusion
|
||||
if not include_dummy:
|
||||
df = df[~df['LOTID'].str.contains('DUMMY', case=False, na=False)]
|
||||
|
||||
# WORKORDER filter (fuzzy match)
|
||||
if workorder:
|
||||
df = df[df['WORKORDER'].str.contains(workorder, case=False, na=False)]
|
||||
|
||||
# LOTID filter (fuzzy match)
|
||||
if lotid:
|
||||
df = df[df['LOTID'].str.contains(lotid, case=False, na=False)]
|
||||
|
||||
return df
|
||||
mask = _build_filter_mask(
|
||||
df,
|
||||
include_dummy=include_dummy,
|
||||
workorder=workorder,
|
||||
lotid=lotid,
|
||||
)
|
||||
if mask.empty:
|
||||
return df.iloc[0:0]
|
||||
return df.loc[mask]
|
||||
|
||||
|
||||
# ============================================================
|
||||
@@ -325,16 +739,15 @@ def get_wip_summary(
|
||||
cached_df = _get_wip_dataframe()
|
||||
if cached_df is not None:
|
||||
try:
|
||||
df = _filter_base_conditions(cached_df, include_dummy, workorder, lotid)
|
||||
df = _add_wip_status_columns(df)
|
||||
|
||||
# Apply package filter
|
||||
if package and 'PACKAGE_LEF' in df.columns:
|
||||
df = df[df['PACKAGE_LEF'] == package]
|
||||
|
||||
# Apply pj_type filter
|
||||
if pj_type and 'PJ_TYPE' in df.columns:
|
||||
df = df[df['PJ_TYPE'] == pj_type]
|
||||
df = _select_with_snapshot_indexes(
|
||||
include_dummy=include_dummy,
|
||||
workorder=workorder,
|
||||
lotid=lotid,
|
||||
package=package,
|
||||
pj_type=pj_type,
|
||||
)
|
||||
if df is None:
|
||||
return _get_wip_summary_from_oracle(include_dummy, workorder, lotid, package, pj_type)
|
||||
|
||||
if df.empty:
|
||||
return {
|
||||
@@ -495,32 +908,31 @@ def get_wip_matrix(
|
||||
cached_df = _get_wip_dataframe()
|
||||
if cached_df is not None:
|
||||
try:
|
||||
df = _filter_base_conditions(cached_df, include_dummy, workorder, lotid)
|
||||
df = _add_wip_status_columns(df)
|
||||
status_upper = status.upper() if status else None
|
||||
hold_type_filter = hold_type if status_upper == 'HOLD' else None
|
||||
df = _select_with_snapshot_indexes(
|
||||
include_dummy=include_dummy,
|
||||
workorder=workorder,
|
||||
lotid=lotid,
|
||||
package=package,
|
||||
pj_type=pj_type,
|
||||
status=status_upper,
|
||||
hold_type=hold_type_filter,
|
||||
)
|
||||
if df is None:
|
||||
return _get_wip_matrix_from_oracle(
|
||||
include_dummy,
|
||||
workorder,
|
||||
lotid,
|
||||
status,
|
||||
hold_type,
|
||||
package,
|
||||
pj_type,
|
||||
)
|
||||
|
||||
# Filter by WORKCENTER_GROUP and PACKAGE_LEF
|
||||
df = df[df['WORKCENTER_GROUP'].notna() & df['PACKAGE_LEF'].notna()]
|
||||
|
||||
# Apply package filter
|
||||
if package:
|
||||
df = df[df['PACKAGE_LEF'] == package]
|
||||
|
||||
# Apply pj_type filter
|
||||
if pj_type and 'PJ_TYPE' in df.columns:
|
||||
df = df[df['PJ_TYPE'] == pj_type]
|
||||
|
||||
# WIP status filter
|
||||
if status:
|
||||
status_upper = status.upper()
|
||||
df = df[df['WIP_STATUS'] == status_upper]
|
||||
|
||||
# Hold type sub-filter
|
||||
if status_upper == 'HOLD' and hold_type:
|
||||
if hold_type == 'quality':
|
||||
df = df[df['IS_QUALITY_HOLD']]
|
||||
elif hold_type == 'non-quality':
|
||||
df = df[df['IS_NON_QUALITY_HOLD']]
|
||||
|
||||
if df.empty:
|
||||
return {
|
||||
'workcenters': [],
|
||||
@@ -677,11 +1089,17 @@ def get_wip_hold_summary(
|
||||
cached_df = _get_wip_dataframe()
|
||||
if cached_df is not None:
|
||||
try:
|
||||
df = _filter_base_conditions(cached_df, include_dummy, workorder, lotid)
|
||||
df = _add_wip_status_columns(df)
|
||||
df = _select_with_snapshot_indexes(
|
||||
include_dummy=include_dummy,
|
||||
workorder=workorder,
|
||||
lotid=lotid,
|
||||
status='HOLD',
|
||||
)
|
||||
if df is None:
|
||||
return _get_wip_hold_summary_from_oracle(include_dummy, workorder, lotid)
|
||||
|
||||
# Filter for HOLD status with reason
|
||||
df = df[(df['WIP_STATUS'] == 'HOLD') & df['HOLDREASONNAME'].notna()]
|
||||
df = df[df['HOLDREASONNAME'].notna()]
|
||||
|
||||
if df.empty:
|
||||
return {'items': []}
|
||||
@@ -805,17 +1223,40 @@ def get_wip_detail(
|
||||
cached_df = _get_wip_dataframe()
|
||||
if cached_df is not None:
|
||||
try:
|
||||
df = _filter_base_conditions(cached_df, include_dummy, workorder, lotid)
|
||||
df = _add_wip_status_columns(df)
|
||||
summary_df = _select_with_snapshot_indexes(
|
||||
include_dummy=include_dummy,
|
||||
workorder=workorder,
|
||||
lotid=lotid,
|
||||
package=package,
|
||||
workcenter=workcenter,
|
||||
)
|
||||
if summary_df is None:
|
||||
return _get_wip_detail_from_oracle(
|
||||
workcenter,
|
||||
package,
|
||||
status,
|
||||
hold_type,
|
||||
workorder,
|
||||
lotid,
|
||||
include_dummy,
|
||||
page,
|
||||
page_size,
|
||||
)
|
||||
|
||||
# Filter by workcenter
|
||||
df = df[df['WORKCENTER_GROUP'] == workcenter]
|
||||
|
||||
if package:
|
||||
df = df[df['PACKAGE_LEF'] == package]
|
||||
if summary_df.empty:
|
||||
summary = {
|
||||
'totalLots': 0,
|
||||
'runLots': 0,
|
||||
'queueLots': 0,
|
||||
'holdLots': 0,
|
||||
'qualityHoldLots': 0,
|
||||
'nonQualityHoldLots': 0
|
||||
}
|
||||
df = summary_df
|
||||
else:
|
||||
df = summary_df
|
||||
|
||||
# Calculate summary before status filter
|
||||
summary_df = df.copy()
|
||||
run_lots = len(summary_df[summary_df['WIP_STATUS'] == 'RUN'])
|
||||
queue_lots = len(summary_df[summary_df['WIP_STATUS'] == 'QUEUE'])
|
||||
hold_lots = len(summary_df[summary_df['WIP_STATUS'] == 'HOLD'])
|
||||
@@ -835,13 +1276,29 @@ def get_wip_detail(
|
||||
# Apply status filter for lots list
|
||||
if status:
|
||||
status_upper = status.upper()
|
||||
df = df[df['WIP_STATUS'] == status_upper]
|
||||
|
||||
if status_upper == 'HOLD' and hold_type:
|
||||
if hold_type == 'quality':
|
||||
df = df[df['IS_QUALITY_HOLD']]
|
||||
elif hold_type == 'non-quality':
|
||||
df = df[df['IS_NON_QUALITY_HOLD']]
|
||||
hold_type_filter = hold_type if status_upper == 'HOLD' else None
|
||||
filtered_df = _select_with_snapshot_indexes(
|
||||
include_dummy=include_dummy,
|
||||
workorder=workorder,
|
||||
lotid=lotid,
|
||||
package=package,
|
||||
workcenter=workcenter,
|
||||
status=status_upper,
|
||||
hold_type=hold_type_filter,
|
||||
)
|
||||
if filtered_df is None:
|
||||
return _get_wip_detail_from_oracle(
|
||||
workcenter,
|
||||
package,
|
||||
status,
|
||||
hold_type,
|
||||
workorder,
|
||||
lotid,
|
||||
include_dummy,
|
||||
page,
|
||||
page_size,
|
||||
)
|
||||
df = filtered_df
|
||||
|
||||
# Get specs (sorted by SPECSEQUENCE if available)
|
||||
specs_df = df[df['SPECNAME'].notna()][['SPECNAME', 'SPECSEQUENCE']].drop_duplicates()
|
||||
@@ -1083,7 +1540,9 @@ def get_workcenters(include_dummy: bool = False) -> Optional[List[Dict[str, Any]
|
||||
cached_df = _get_wip_dataframe()
|
||||
if cached_df is not None:
|
||||
try:
|
||||
df = _filter_base_conditions(cached_df, include_dummy)
|
||||
df = _select_with_snapshot_indexes(include_dummy=include_dummy)
|
||||
if df is None:
|
||||
return _get_workcenters_from_oracle(include_dummy)
|
||||
df = df[df['WORKCENTER_GROUP'].notna()]
|
||||
|
||||
if df.empty:
|
||||
@@ -1162,7 +1621,9 @@ def get_packages(include_dummy: bool = False) -> Optional[List[Dict[str, Any]]]:
|
||||
cached_df = _get_wip_dataframe()
|
||||
if cached_df is not None:
|
||||
try:
|
||||
df = _filter_base_conditions(cached_df, include_dummy)
|
||||
df = _select_with_snapshot_indexes(include_dummy=include_dummy)
|
||||
if df is None:
|
||||
return _get_packages_from_oracle(include_dummy)
|
||||
df = df[df['PACKAGE_LEF'].notna()]
|
||||
|
||||
if df.empty:
|
||||
@@ -1267,15 +1728,16 @@ def search_workorders(
|
||||
cached_df = _get_wip_dataframe()
|
||||
if cached_df is not None:
|
||||
try:
|
||||
df = _filter_base_conditions(cached_df, include_dummy, lotid=lotid)
|
||||
df = _select_with_snapshot_indexes(
|
||||
include_dummy=include_dummy,
|
||||
lotid=lotid,
|
||||
package=package,
|
||||
pj_type=pj_type,
|
||||
)
|
||||
if df is None:
|
||||
return _search_workorders_from_oracle(q, limit, include_dummy, lotid, package, pj_type)
|
||||
df = df[df['WORKORDER'].notna()]
|
||||
|
||||
# Apply cross-filters
|
||||
if package and 'PACKAGE_LEF' in df.columns:
|
||||
df = df[df['PACKAGE_LEF'] == package]
|
||||
if pj_type and 'PJ_TYPE' in df.columns:
|
||||
df = df[df['PJ_TYPE'] == pj_type]
|
||||
|
||||
# Filter by search query (case-insensitive)
|
||||
df = df[df['WORKORDER'].str.contains(q, case=False, na=False)]
|
||||
|
||||
@@ -1375,13 +1837,14 @@ def search_lot_ids(
|
||||
cached_df = _get_wip_dataframe()
|
||||
if cached_df is not None:
|
||||
try:
|
||||
df = _filter_base_conditions(cached_df, include_dummy, workorder=workorder)
|
||||
|
||||
# Apply cross-filters
|
||||
if package and 'PACKAGE_LEF' in df.columns:
|
||||
df = df[df['PACKAGE_LEF'] == package]
|
||||
if pj_type and 'PJ_TYPE' in df.columns:
|
||||
df = df[df['PJ_TYPE'] == pj_type]
|
||||
df = _select_with_snapshot_indexes(
|
||||
include_dummy=include_dummy,
|
||||
workorder=workorder,
|
||||
package=package,
|
||||
pj_type=pj_type,
|
||||
)
|
||||
if df is None:
|
||||
return _search_lot_ids_from_oracle(q, limit, include_dummy, workorder, package, pj_type)
|
||||
|
||||
# Filter by search query (case-insensitive)
|
||||
df = df[df['LOTID'].str.contains(q, case=False, na=False)]
|
||||
@@ -1481,7 +1944,14 @@ def search_packages(
|
||||
cached_df = _get_wip_dataframe()
|
||||
if cached_df is not None:
|
||||
try:
|
||||
df = _filter_base_conditions(cached_df, include_dummy, workorder=workorder, lotid=lotid)
|
||||
df = _select_with_snapshot_indexes(
|
||||
include_dummy=include_dummy,
|
||||
workorder=workorder,
|
||||
lotid=lotid,
|
||||
pj_type=pj_type,
|
||||
)
|
||||
if df is None:
|
||||
return _search_packages_from_oracle(q, limit, include_dummy, workorder, lotid, pj_type)
|
||||
|
||||
# Check if PACKAGE_LEF column exists
|
||||
if 'PACKAGE_LEF' not in df.columns:
|
||||
@@ -1490,10 +1960,6 @@ def search_packages(
|
||||
|
||||
df = df[df['PACKAGE_LEF'].notna()]
|
||||
|
||||
# Apply cross-filter
|
||||
if pj_type and 'PJ_TYPE' in df.columns:
|
||||
df = df[df['PJ_TYPE'] == pj_type]
|
||||
|
||||
# Filter by search query (case-insensitive)
|
||||
df = df[df['PACKAGE_LEF'].str.contains(q, case=False, na=False)]
|
||||
|
||||
@@ -1591,7 +2057,14 @@ def search_types(
|
||||
cached_df = _get_wip_dataframe()
|
||||
if cached_df is not None:
|
||||
try:
|
||||
df = _filter_base_conditions(cached_df, include_dummy, workorder=workorder, lotid=lotid)
|
||||
df = _select_with_snapshot_indexes(
|
||||
include_dummy=include_dummy,
|
||||
workorder=workorder,
|
||||
lotid=lotid,
|
||||
package=package,
|
||||
)
|
||||
if df is None:
|
||||
return _search_types_from_oracle(q, limit, include_dummy, workorder, lotid, package)
|
||||
|
||||
# Check if PJ_TYPE column exists
|
||||
if 'PJ_TYPE' not in df.columns:
|
||||
@@ -1600,10 +2073,6 @@ def search_types(
|
||||
|
||||
df = df[df['PJ_TYPE'].notna()]
|
||||
|
||||
# Apply cross-filter
|
||||
if package and 'PACKAGE_LEF' in df.columns:
|
||||
df = df[df['PACKAGE_LEF'] == package]
|
||||
|
||||
# Filter by search query (case-insensitive)
|
||||
df = df[df['PJ_TYPE'].str.contains(q, case=False, na=False)]
|
||||
|
||||
@@ -1686,11 +2155,15 @@ def get_hold_detail_summary(
|
||||
cached_df = _get_wip_dataframe()
|
||||
if cached_df is not None:
|
||||
try:
|
||||
df = _filter_base_conditions(cached_df, include_dummy)
|
||||
df = _add_wip_status_columns(df)
|
||||
df = _select_with_snapshot_indexes(
|
||||
include_dummy=include_dummy,
|
||||
status='HOLD',
|
||||
)
|
||||
if df is None:
|
||||
return _get_hold_detail_summary_from_oracle(reason, include_dummy)
|
||||
|
||||
# Filter for HOLD status with matching reason
|
||||
df = df[(df['WIP_STATUS'] == 'HOLD') & (df['HOLDREASONNAME'] == reason)]
|
||||
df = df[df['HOLDREASONNAME'] == reason]
|
||||
|
||||
if df.empty:
|
||||
return {
|
||||
@@ -1783,11 +2256,15 @@ def get_hold_detail_distribution(
|
||||
cached_df = _get_wip_dataframe()
|
||||
if cached_df is not None:
|
||||
try:
|
||||
df = _filter_base_conditions(cached_df, include_dummy)
|
||||
df = _add_wip_status_columns(df)
|
||||
df = _select_with_snapshot_indexes(
|
||||
include_dummy=include_dummy,
|
||||
status='HOLD',
|
||||
)
|
||||
if df is None:
|
||||
return _get_hold_detail_distribution_from_oracle(reason, include_dummy)
|
||||
|
||||
# Filter for HOLD status with matching reason
|
||||
df = df[(df['WIP_STATUS'] == 'HOLD') & (df['HOLDREASONNAME'] == reason)]
|
||||
df = df[df['HOLDREASONNAME'] == reason]
|
||||
|
||||
total_lots = len(df)
|
||||
|
||||
@@ -2072,20 +2549,30 @@ def get_hold_detail_lots(
|
||||
cached_df = _get_wip_dataframe()
|
||||
if cached_df is not None:
|
||||
try:
|
||||
df = _filter_base_conditions(cached_df, include_dummy)
|
||||
df = _add_wip_status_columns(df)
|
||||
df = _select_with_snapshot_indexes(
|
||||
include_dummy=include_dummy,
|
||||
workcenter=workcenter,
|
||||
package=package,
|
||||
status='HOLD',
|
||||
)
|
||||
if df is None:
|
||||
return _get_hold_detail_lots_from_oracle(
|
||||
reason=reason,
|
||||
workcenter=workcenter,
|
||||
package=package,
|
||||
age_range=age_range,
|
||||
include_dummy=include_dummy,
|
||||
page=page,
|
||||
page_size=page_size,
|
||||
)
|
||||
|
||||
# Filter for HOLD status with matching reason
|
||||
df = df[(df['WIP_STATUS'] == 'HOLD') & (df['HOLDREASONNAME'] == reason)]
|
||||
df = df[df['HOLDREASONNAME'] == reason]
|
||||
|
||||
# Ensure numeric columns
|
||||
df['AGEBYDAYS'] = pd.to_numeric(df['AGEBYDAYS'], errors='coerce').fillna(0)
|
||||
|
||||
# Optional filters
|
||||
if workcenter:
|
||||
df = df[df['WORKCENTER_GROUP'] == workcenter]
|
||||
if package:
|
||||
df = df[df['PACKAGE_LEF'] == package]
|
||||
# Optional age filter
|
||||
if age_range:
|
||||
if age_range == '0-1':
|
||||
df = df[(df['AGEBYDAYS'] >= 0) & (df['AGEBYDAYS'] < 1)]
|
||||
|
||||
@@ -32,6 +32,23 @@ const MesApi = (function() {
|
||||
const MIN_DEGRADED_DELAY_MS = 3000;
|
||||
|
||||
let requestCounter = 0;
|
||||
|
||||
function getCsrfToken() {
|
||||
const meta = document.querySelector('meta[name=\"csrf-token\"]');
|
||||
return meta ? meta.content : '';
|
||||
}
|
||||
|
||||
function withCsrfHeaders(headers, method) {
|
||||
const normalized = (method || 'GET').toUpperCase();
|
||||
const nextHeaders = { ...(headers || {}) };
|
||||
if (['POST', 'PUT', 'PATCH', 'DELETE'].includes(normalized)) {
|
||||
const token = getCsrfToken();
|
||||
if (token && !nextHeaders['X-CSRF-Token']) {
|
||||
nextHeaders['X-CSRF-Token'] = token;
|
||||
}
|
||||
}
|
||||
return nextHeaders;
|
||||
}
|
||||
|
||||
/**
|
||||
* Generate a unique request ID
|
||||
@@ -203,12 +220,12 @@ const MesApi = (function() {
|
||||
|
||||
console.log(`[MesApi] ${reqId} ${method} ${fullUrl}`);
|
||||
|
||||
const fetchOptions = {
|
||||
method: method,
|
||||
headers: {
|
||||
'Content-Type': 'application/json'
|
||||
}
|
||||
};
|
||||
const fetchOptions = {
|
||||
method: method,
|
||||
headers: withCsrfHeaders({
|
||||
'Content-Type': 'application/json'
|
||||
}, method)
|
||||
};
|
||||
|
||||
if (options.body) {
|
||||
fetchOptions.body = JSON.stringify(options.body);
|
||||
|
||||
@@ -1,9 +1,10 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="zh-TW">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>{% block title %}MES Dashboard{% endblock %}</title>
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<meta name="csrf-token" content="{{ csrf_token() }}">
|
||||
<title>{% block title %}MES Dashboard{% endblock %}</title>
|
||||
|
||||
<!-- Toast 樣式 -->
|
||||
<style id="mes-core-styles">
|
||||
|
||||
@@ -221,8 +221,13 @@
|
||||
{% endblock %}
|
||||
|
||||
{% block scripts %}
|
||||
<script>
|
||||
const tbody = document.getElementById('pages-tbody');
|
||||
<script>
|
||||
const tbody = document.getElementById('pages-tbody');
|
||||
const csrfToken = document.querySelector('meta[name=\"csrf-token\"]')?.content || '';
|
||||
|
||||
function withCsrfHeaders(headers = {}) {
|
||||
return csrfToken ? { ...headers, 'X-CSRF-Token': csrfToken } : headers;
|
||||
}
|
||||
|
||||
async function loadPages() {
|
||||
try {
|
||||
@@ -264,11 +269,11 @@
|
||||
const newStatus = currentStatus === 'released' ? 'dev' : 'released';
|
||||
|
||||
try {
|
||||
const response = await fetch(`/admin/api/pages${route}`, {
|
||||
method: 'PUT',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify({ status: newStatus })
|
||||
});
|
||||
const response = await fetch(`/admin/api/pages${route}`, {
|
||||
method: 'PUT',
|
||||
headers: withCsrfHeaders({ 'Content-Type': 'application/json' }),
|
||||
body: JSON.stringify({ status: newStatus })
|
||||
});
|
||||
|
||||
const data = await response.json();
|
||||
|
||||
|
||||
@@ -707,7 +707,13 @@
|
||||
// Auth Helper
|
||||
// ============================================================
|
||||
async function fetchWithAuth(url, options = {}) {
|
||||
const resp = await fetch(url, { ...options, cache: 'no-store' });
|
||||
const method = (options.method || 'GET').toUpperCase();
|
||||
const csrfToken = document.querySelector('meta[name="csrf-token"]')?.content || '';
|
||||
const headers = { ...(options.headers || {}) };
|
||||
if (csrfToken && ['POST', 'PUT', 'PATCH', 'DELETE'].includes(method)) {
|
||||
headers['X-CSRF-Token'] = csrfToken;
|
||||
}
|
||||
const resp = await fetch(url, { ...options, headers, cache: 'no-store' });
|
||||
if (resp.status === 401) {
|
||||
const json = await resp.json().catch(() => ({}));
|
||||
if (!authErrorShown) {
|
||||
@@ -962,9 +968,15 @@
|
||||
document.getElementById('workerStartTime').textContent =
|
||||
data.worker_start_time ? formatTimestamp(data.worker_start_time) : '--';
|
||||
|
||||
// Update cooldown status
|
||||
// Update recovery policy status
|
||||
const policyState = data?.resilience?.policy_state || {};
|
||||
const cooldown = data.cooldown;
|
||||
if (cooldown && cooldown.active) {
|
||||
if (policyState.blocked) {
|
||||
document.getElementById('workerCooldown').textContent = 'Guarded mode(需手動 override)';
|
||||
document.getElementById('restartBtn').disabled = false;
|
||||
document.getElementById('restartBtn').style.opacity = '1';
|
||||
document.getElementById('restartBtn').style.cursor = 'pointer';
|
||||
} else if (cooldown && cooldown.active) {
|
||||
document.getElementById('workerCooldown').textContent =
|
||||
`冷卻中 (${cooldown.remaining_seconds}秒)`;
|
||||
document.getElementById('restartBtn').disabled = true;
|
||||
@@ -1017,11 +1029,41 @@
|
||||
btn.style.opacity = '0.5';
|
||||
|
||||
try {
|
||||
const resp = await fetchWithAuth('/admin/api/worker/restart', {
|
||||
let resp = await fetchWithAuth('/admin/api/worker/restart', {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' }
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify({})
|
||||
});
|
||||
const json = await resp.json();
|
||||
let json = await resp.json();
|
||||
|
||||
if (!json.success && resp.status === 409) {
|
||||
const reason = window.prompt(
|
||||
'目前 restart policy 為 guarded mode。\n請輸入 override 原因(會記錄於稽核日誌):'
|
||||
);
|
||||
if (!reason || !reason.trim()) {
|
||||
alert('已取消 override。');
|
||||
return;
|
||||
}
|
||||
|
||||
const acknowledged = window.confirm(
|
||||
'確認執行 manual override?此操作將繞過 guarded mode 保護。'
|
||||
);
|
||||
if (!acknowledged) {
|
||||
alert('已取消 override。');
|
||||
return;
|
||||
}
|
||||
|
||||
resp = await fetchWithAuth('/admin/api/worker/restart', {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify({
|
||||
manual_override: true,
|
||||
override_acknowledged: true,
|
||||
override_reason: reason.trim()
|
||||
})
|
||||
});
|
||||
json = await resp.json();
|
||||
}
|
||||
|
||||
if (!json.success) {
|
||||
alert('重啟失敗: ' + (json.error?.message || '未知錯誤'));
|
||||
|
||||
@@ -682,7 +682,7 @@
|
||||
// State
|
||||
// ============================================================
|
||||
const state = {
|
||||
reason: '{{ reason | e }}',
|
||||
reason: {{ reason | tojson }},
|
||||
summary: null,
|
||||
distribution: null,
|
||||
lots: null,
|
||||
|
||||
@@ -129,12 +129,13 @@
|
||||
<div class="error-message">
|
||||
{{ error }}
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
<form method="POST">
|
||||
<div class="form-group">
|
||||
<label for="username">帳號</label>
|
||||
<input type="text" id="username" name="username" placeholder="工號或 Email" required autofocus>
|
||||
{% endif %}
|
||||
|
||||
<form method="POST">
|
||||
<input type="hidden" name="csrf_token" value="{{ csrf_token() }}">
|
||||
<div class="form-group">
|
||||
<label for="username">帳號</label>
|
||||
<input type="text" id="username" name="username" placeholder="工號或 Email" required autofocus>
|
||||
</div>
|
||||
|
||||
<div class="form-group">
|
||||
|
||||
9
tests/fixtures/cache_benchmark_fixture.json
vendored
Normal file
9
tests/fixtures/cache_benchmark_fixture.json
vendored
Normal file
@@ -0,0 +1,9 @@
|
||||
{
|
||||
"rows": 30000,
|
||||
"query_count": 400,
|
||||
"seed": 42,
|
||||
"thresholds": {
|
||||
"max_p95_ratio_indexed_vs_baseline": 1.25,
|
||||
"max_memory_amplification_ratio": 1.8
|
||||
}
|
||||
}
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user