feat: 新增效能監控、熔斷器保護與 Worker 重啟控制功能

新增功能:
- 效能監控儀表板 (/admin/performance):系統狀態、查詢延遲、日誌檢視
- 熔斷器 (Circuit Breaker):CLOSED/OPEN/HALF_OPEN 狀態保護資料庫
- 效能指標收集:P50/P95/P99 延遲追蹤、慢查詢統計
- SQLite 日誌儲存:結構化日誌、保留策略、手動清理功能
- Worker Watchdog:透過 systemd 服務支援優雅重啟
- 統一 API 回應格式:success_response/error_response 標準化
- 深度健康檢查端點 (/health/deep)
- 404/500 錯誤頁面模板

Bug 修復:
- 修復 circuit_breaker.py get_status() 死鎖問題
- 修復 health_routes.py 模組匯入路徑錯誤

新增依賴:psutil (Worker 狀態監控)
測試覆蓋:59 個新增測試案例

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
beabigegg
2026-02-04 08:14:42 +08:00
parent c11b13a7e3
commit 13acbfc71b
40 changed files with 6942 additions and 133 deletions

View File

@@ -90,3 +90,54 @@ RESOURCE_CACHE_ENABLED=true
# Resource cache sync interval in seconds (default: 14400 = 4 hours)
# The cache will check for updates at this interval using MAX(LASTCHANGEDATE)
RESOURCE_SYNC_INTERVAL=14400
# ============================================================
# Circuit Breaker Configuration
# ============================================================
# Enable/disable circuit breaker for database protection
CIRCUIT_BREAKER_ENABLED=false
# Minimum failures before circuit can open
CIRCUIT_BREAKER_FAILURE_THRESHOLD=5
# Failure rate threshold (0.0 - 1.0)
CIRCUIT_BREAKER_FAILURE_RATE=0.5
# Seconds to wait in OPEN state before trying HALF_OPEN
CIRCUIT_BREAKER_RECOVERY_TIMEOUT=30
# Sliding window size for counting successes/failures
CIRCUIT_BREAKER_WINDOW_SIZE=10
# ============================================================
# Performance Metrics Configuration
# ============================================================
# Slow query threshold in seconds (default: 1.0)
SLOW_QUERY_THRESHOLD=1.0
# ============================================================
# SQLite Log Store Configuration
# ============================================================
# Enable/disable SQLite log store for admin dashboard
LOG_STORE_ENABLED=true
# SQLite database path
LOG_SQLITE_PATH=logs/admin_logs.sqlite
# Log retention period in days (default: 7)
LOG_SQLITE_RETENTION_DAYS=7
# Maximum log rows (default: 100000)
LOG_SQLITE_MAX_ROWS=100000
# ============================================================
# Worker Watchdog Configuration
# ============================================================
# Path to restart flag file (watchdog monitors this file)
WATCHDOG_RESTART_FLAG=/tmp/mes_dashboard_restart.flag
# Path to restart state file (stores last restart info)
WATCHDOG_STATE_FILE=/tmp/mes_dashboard_restart_state.json
# Cooldown period between restart requests in seconds (default: 60)
WORKER_RESTART_COOLDOWN=60

105
README.md
View File

@@ -18,6 +18,9 @@
| 頁面狀態管理 | ✅ 已完成 |
| Redis 快取系統 | ✅ 已完成 |
| SQL 查詢安全架構 | ✅ 已完成 |
| 效能監控儀表板 | ✅ 已完成 |
| 熔斷器保護機制 | ✅ 已完成 |
| Worker 重啟控制 | ✅ 已完成 |
| 部署自動化 | ✅ 已完成 |
---
@@ -143,6 +146,51 @@ ADMIN_EMAILS=admin@example.com # 管理員郵件(逗號分隔)
3. **防火牆**: 開放服務端口(預設 8080
### Worker Watchdog 服務配置
Watchdog 監控程式用於支援管理員從介面優雅重啟 Workers
```bash
# 1. 複製 systemd 服務檔案
sudo cp deploy/mes-dashboard-watchdog.service /etc/systemd/system/
# 2. 編輯服務檔案,修改路徑和用戶
sudo nano /etc/systemd/system/mes-dashboard-watchdog.service
# 3. 重新載入 systemd
sudo systemctl daemon-reload
# 4. 啟動並設定開機自動啟動
sudo systemctl start mes-dashboard-watchdog
sudo systemctl enable mes-dashboard-watchdog
# 5. 查看狀態
sudo systemctl status mes-dashboard-watchdog
```
### Rollback 步驟
如需回滾到先前版本:
```bash
# 1. 停止服務
./scripts/start_server.sh stop
sudo systemctl stop mes-dashboard-watchdog
# 2. 回滾程式碼
git checkout <previous-commit>
# 3. 重新安裝依賴(如有變更)
pip install -r requirements.txt
# 4. 清理新版本資料(可選)
rm -f logs/admin_logs.sqlite # 清理 SQLite 日誌
# 5. 重啟服務
./scripts/start_server.sh start
sudo systemctl start mes-dashboard-watchdog
```
---
## 功能說明
@@ -201,6 +249,33 @@ ADMIN_EMAILS=admin@example.com # 管理員郵件(逗號分隔)
- 頁面狀態管理released/dev
- Dev 頁面僅管理員可見
### 效能監控儀表板
管理員專用的系統監控介面(`/admin/performance`
- **系統狀態總覽**Database、Redis、Circuit Breaker、Worker 狀態
- **查詢效能指標**P50/P95/P99 延遲、慢查詢統計、延遲分布圖
- **系統日誌檢視**:即時日誌查詢、等級篩選、關鍵字搜尋
- **日誌管理**:儲存統計、手動清理功能
- **Worker 控制**:優雅重啟(透過 Watchdog 機制)
- 自動更新30 秒間隔)
### 熔斷器保護機制
Circuit Breaker 模式保護資料庫免於雪崩效應:
- **CLOSED**:正常運作,請求通過
- **OPEN**:失敗過多,請求立即拒絕
- **HALF_OPEN**:測試恢復,允許有限請求
配置方式:
```bash
CIRCUIT_BREAKER_ENABLED=true
CIRCUIT_BREAKER_FAILURE_THRESHOLD=5
CIRCUIT_BREAKER_FAILURE_RATE=0.5
CIRCUIT_BREAKER_RECOVERY_TIMEOUT=30
```
---
## 技術架構
@@ -248,7 +323,12 @@ DashBoard/
│ │ ├── database.py # 資料庫連線
│ │ ├── redis_client.py # Redis 客戶端
│ │ ├── cache.py # 快取管理
│ │ ── cache_updater.py # 快取自動更新
│ │ ── cache_updater.py # 快取自動更新
│ │ ├── circuit_breaker.py # 熔斷器
│ │ ├── metrics.py # 效能指標收集
│ │ ├── log_store.py # SQLite 日誌儲存
│ │ ├── response.py # API 回應格式
│ │ └── permissions.py # 權限管理
│ ├── routes/ # 路由
│ │ ├── wip_routes.py # WIP 相關 API
│ │ ├── resource_routes.py # 設備狀態 API
@@ -270,7 +350,10 @@ DashBoard/
│ └── templates/ # HTML 模板
├── scripts/ # 腳本
│ ├── deploy.sh # 部署腳本
── start_server.sh # 服務管理腳本
── start_server.sh # 服務管理腳本
│ └── worker_watchdog.py # Worker 監控程式
├── deploy/ # 部署設定
│ └── mes-dashboard-watchdog.service # Watchdog systemd 服務
├── tests/ # 測試
├── data/ # 資料檔案
├── logs/ # 日誌
@@ -342,6 +425,20 @@ pytest tests/stress/ -v
## 變更日誌
### 2026-02-04
- 新增效能監控儀表板(`/admin/performance`
- 新增熔斷器保護機制Circuit Breaker
- 新增效能指標收集P50/P95/P99 延遲、慢查詢統計)
- 新增 SQLite 日誌儲存與管理功能
- 新增 Worker Watchdog 重啟機制
- 新增統一 API 回應格式success_response/error_response
- 新增 404/500 錯誤頁面模板
- 修復熔斷器 get_status() 死鎖問題
- 修復 health_routes.py 模組匯入錯誤
- 新增 psutil 依賴用於 Worker 狀態監控
- 新增完整測試套件59 個效能相關測試)
### 2026-02-03
- 重構 SQL 查詢管理架構,提升安全性與效能
@@ -393,5 +490,5 @@ pytest tests/stress/ -v
---
**文檔版本**: 3.0
**最後更新**: 2026-02-03
**文檔版本**: 4.0
**最後更新**: 2026-02-04

View File

@@ -33,7 +33,7 @@
{
"route": "/resource",
"name": "機台狀態",
"status": "dev"
"status": "released"
},
{
"route": "/excel-query",

View File

@@ -0,0 +1,36 @@
[Unit]
Description=MES Dashboard Worker Watchdog
Documentation=https://github.com/your-org/mes-dashboard
After=network.target mes-dashboard.service
Requires=mes-dashboard.service
[Service]
Type=simple
User=www-data
Group=www-data
WorkingDirectory=/opt/mes-dashboard
Environment="PYTHONPATH=/opt/mes-dashboard/src"
Environment="WATCHDOG_CHECK_INTERVAL=5"
Environment="WATCHDOG_RESTART_FLAG=/tmp/mes_dashboard_restart.flag"
Environment="WATCHDOG_PID_FILE=/tmp/mes_dashboard_gunicorn.pid"
Environment="WATCHDOG_STATE_FILE=/tmp/mes_dashboard_restart_state.json"
ExecStart=/opt/mes-dashboard/venv/bin/python scripts/worker_watchdog.py
# Restart policy
Restart=always
RestartSec=5
# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=mes-watchdog
# Security hardening
NoNewPrivileges=yes
PrivateTmp=yes
ProtectSystem=strict
ReadWritePaths=/tmp
[Install]
WantedBy=multi-user.target

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-02-03

View File

@@ -0,0 +1,359 @@
## Context
MES Dashboard 是基於 Flask + Gunicorn + Redis 的報表系統,目前已完成 SQL 查詢安全架構重構。系統運行於:
- Gunicorn: 2 workers × 4 threads
- Oracle Database: 連線池 (pool_size=5, max_overflow=10)
- Redis: 設備狀態快取 (30 秒更新)
**現有架構限制:**
- 錯誤處理分散於各 service/route格式不一致
- 資料庫異常時無熔斷機制,可能導致連線池耗盡
- 效能指標僅有 warning log無量化追蹤
- 管理員需 SSH 登入才能處理 worker 問題
**利害關係人:**
- 終端使用者:需要友善的錯誤訊息與穩定的服務
- 管理員:需要效能監控與緊急處理能力
- 維運人員:需要可觀測性與問題診斷工具
## Goals / Non-Goals
**Goals:**
- 統一 API 回應格式,提升前端開發體驗
- 實作熔斷機制,防止資料庫異常導致雪崩
- 提供效能指標收集與視覺化報表
- 允許管理員從前端安全地重啟服務
- 新增本地快取作為 Redis 的二級 fallback
**Non-Goals:**
- 不實作分散式追蹤 (distributed tracing)
- 不整合外部監控系統 (Prometheus/Grafana)
- 不變更現有 API 端點路徑
- 不實作自動擴展 (auto-scaling)
- 不實作 Excel 批次查詢進度回報(需評估前後端架構變更)
- 不實作歷史趨勢查詢優化(預計算/分層快取)
## Decisions
### Decision 1: API 回應格式
**選擇:** 標準化 envelope 格式,向下相容
```python
# 成功回應
{
"success": True,
"data": { ... }, # 原有回應內容
"meta": { # 可選的中繼資料
"timestamp": "...",
"request_id": "..."
}
}
# 錯誤回應
{
"success": False,
"error": {
"code": "DB_CONNECTION_FAILED", # 機器可讀代碼
"message": "資料庫連線失敗,請稍後再試", # 使用者友善訊息
"details": "ORA-12541: TNS:no listener" # 僅開發模式顯示
}
}
```
**替代方案考慮:**
- 完全重新設計 API → 破壞向下相容,需前端配合改版
- 僅加 HTTP status code → 錯誤資訊不夠豐富
**理由:** 保持原有 `data` 結構,僅加上 `success``error` 包裝,前端可漸進式遷移。
---
### Decision 2: Circuit Breaker 熔斷器
**選擇:** 自製輕量熔斷器,基於滑動視窗計數
```
狀態轉換:
CLOSED → (失敗率 > 50% 且失敗數 > 5) → OPEN
OPEN → (等待 30 秒) → HALF_OPEN
HALF_OPEN → (探測成功) → CLOSED
HALF_OPEN → (探測失敗) → OPEN
```
**參數設計:**
| 參數 | 值 | 說明 |
|------|-----|------|
| failure_threshold | 5 | 最少失敗次數才觸發 |
| failure_rate | 0.5 | 失敗率閾值 (50%) |
| recovery_timeout | 30s | OPEN 狀態等待時間 |
| window_size | 10 | 滑動視窗大小 |
**計數層級:**
- 熔斷器 SHALL 在 `read_sql_df()` 層級實作(所有 SQL 查詢共用單一熔斷器)
- 這確保所有資料庫查詢(包含 WIP、Equipment、Hold 等)共同計算失敗率
- 當熔斷器 OPEN 時,所有查詢立即回傳錯誤,避免連線池耗盡
- 單一熔斷器設計簡化狀態管理,且符合「資料庫整體健康」的概念
**替代方案考慮:**
- 使用 `pybreaker` 套件 → 增加外部依賴
- 使用 `tenacity` retry → 只處理重試,無熔斷
**理由:** 需求簡單,自製可完全控制行為,無外部依賴。
---
### Decision 3: 效能指標收集
**選擇:** 記憶體內滑動視窗 + 定期彙總
```python
class QueryMetrics:
# 使用 deque 儲存最近 1000 筆查詢延遲
latencies: deque[float] = deque(maxlen=1000)
# 計算 percentiles
def get_percentiles(self) -> dict:
sorted_latencies = sorted(self.latencies)
return {
"p50": percentile(sorted_latencies, 50),
"p95": percentile(sorted_latencies, 95),
"p99": percentile(sorted_latencies, 99),
"count": len(sorted_latencies),
"slow_count": sum(1 for l in sorted_latencies if l > 1.0)
}
```
**儲存策略:**
- 即時指標:記憶體內滑動視窗
- 不持久化歷史資料(避免複雜度)
- 每個 worker 獨立統計(不跨 worker 合併,前端顯示當前 Worker PID
**替代方案考慮:**
- 使用 Redis 儲存 → 增加 Redis 負擔
- 使用 Prometheus → 需要額外基礎設施
**理由:** 簡單場景不需複雜方案,記憶體內統計足夠。
---
### Decision 4: 本地快取 Fallback
**選擇:** TTL-aware LRU Cache 作為 Redis 的二級快取
```
快取查詢流程:
1. 查詢 Redis → 命中則回傳
2. Redis 失敗/未命中 → 查詢本地 LRU Cache
3. 本地未命中 → 查詢 Oracle
4. 回填:同時寫入 Redis 和本地快取
```
**本地快取參數:**
| 參數 | 值 | 說明 |
|------|-----|------|
| maxsize | 500 | 最大條目數(足以容納多組快取如 WIP、Equipment、Hold |
| ttl | 60s | 過期時間(比 Redis 短) |
**替代方案考慮:**
- 只用 Redis → Redis 故障時無 fallback
- 使用 `cachetools.TTLCache` → 可接受,但自製更靈活
**理由:** 增加一層本地快取Redis 故障時仍能從本地取得較新資料。
---
### Decision 5: Worker 重啟機制
**選擇:** 方案 C - 控制檔案 + Watchdog 腳本
```
架構:
┌─────────────┐ 寫入檔案 ┌─────────────────┐
│ Flask App │ ───────────────→ │ /tmp/restart.flag│
│ (Admin API) │ └────────┬────────┘
└─────────────┘ │ 監控
┌─────────────┐ SIGHUP ┌─────────────────┐
│ Gunicorn │ ←─────────────── │ worker_watchdog │
│ Master │ │ (獨立 Python) │
└─────────────┘ └─────────────────┘
```
**流程:**
1. Admin 點擊「重啟服務」按鈕
2. Flask 寫入 `/tmp/mes_dashboard_restart.flag`
3. Watchdog 腳本偵測到檔案,發送 SIGHUP 給 Gunicorn master
4. Gunicorn graceful reload 所有 workers
5. Watchdog 刪除 flag 檔案
**安全機制:**
- API 僅限 admin_required
- 冷卻時間 60 秒(防止連續觸發)
- 操作日誌記錄到檔案和資料庫
- 前端需二次確認
**替代方案考慮:**
- 方案 A (SIGHUP 直接發送) → Flask worker 無法發送信號給 Gunicorn master
- 方案 B (systemctl) → 需要 sudo 權限,安全風險高
- 方案 D (獨立控制服務) → 過度設計
**理由:** 安全、解耦、不需要特殊權限Flask 只需寫檔案。
---
### Decision 6: 效能報表頁面
**選擇:** 整合至現有 Admin 頁面,使用 Chart.js 視覺化
**頁面內容:**
```
┌─────────────────────────────────────────────────────┐
│ 效能監控儀表板 [重新整理] │
├─────────────────────────────────────────────────────┤
│ 系統狀態 │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Database│ │ Redis │ │ Circuit │ │ Workers │ │
│ │ ✅ │ │ ✅ │ │ CLOSED │ │ 2/2 │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
├─────────────────────────────────────────────────────┤
│ 查詢效能 (最近 1000 筆) │
│ P50: 0.12s P95: 0.45s P99: 1.23s 慢查詢: 15 │
│ [========================================] 延遲分布│
├─────────────────────────────────────────────────────┤
│ 快取狀態 │
│ Redis: 命中率 85% 本地: 命中率 92% │
│ 最後更新: 2026-02-03 16:30:22 │
├─────────────────────────────────────────────────────┤
│ 服務控制 │
│ [重啟 Workers] 冷卻時間: 可用 │
│ 最後重啟: 2026-02-03 10:15:00 by admin@example │
└─────────────────────────────────────────────────────┘
```
**API 端點:**
- `GET /admin/api/metrics` - 取得效能指標
- `GET /admin/api/system-status` - 取得系統狀態
- `GET /admin/api/logs` - 取得近期 log 紀錄
- `GET /admin/api/worker/status` - 取得 Worker 狀態cooldown、last_restart、啟動時間
- `POST /admin/api/worker/restart` - 觸發重啟
**自動重新整理:**
- 前端支援 30 秒自動重新整理間隔
- 使用者可手動停用自動重新整理
**Log 紀錄檢視:**
- 顯示最近 N 筆 log預設 200 筆)
- 支援依等級INFO/WARNING/ERROR篩選與關鍵字搜尋
- 顯示欄位包含時間、等級、來源、訊息、操作者(若有)
---
### Decision 7: 深度健康檢查
**選擇:** 新增 `/health/deep` 端點,包含延遲指標
```json
{
"status": "healthy",
"checks": {
"database": {
"status": "healthy",
"latency_ms": 12,
"pool_size": 5,
"pool_checked_out": 2
},
"redis": {
"status": "healthy",
"latency_ms": 2
},
"circuit_breaker": {
"database": "CLOSED",
"failures": 0
},
"cache": {
"redis_hit_rate": 0.85,
"local_hit_rate": 0.92,
"last_update": "2026-02-03T16:30:22Z"
}
},
"metrics": {
"query_p50_ms": 120,
"query_p95_ms": 450,
"query_p99_ms": 1230
}
}
```
---
### Decision 8: 管理員 Log 儲存與檢視
**選擇:** 使用本機 SQLite 儲存結構化 log供管理員介面查詢
**策略:**
- 既有檔案/STDERR log 保留(維運與除錯用途)
- 另新增 SQLite log store 供管理員查詢(本機檔案)
- SQLite 以 append-only 方式寫入,避免重鎖
**建議預設:**
- 檔案位置:`logs/admin_logs.sqlite`
- 保留策略:保留最近 7 天或最多 100,000 筆(可由環境變數調整)
**理由:**
- 管理員需要可查詢的 log 紀錄
- SQLite 為標準函式庫,無新增 Python 第三方依賴
- 保留檔案 log 可維持既有維運流程
## Risks / Trade-offs
| 風險 | 緩解措施 |
|------|---------|
| Worker 重啟期間短暫服務中斷 | Gunicorn graceful reload 確保現有請求完成 |
| 本地快取資料過期 | TTL 設為 60 秒,比 Redis 短;加入版本檢查 |
| 熔斷器誤判導致服務不可用 | 設定合理閾值HALF_OPEN 狀態快速探測 |
| 效能指標記憶體佔用 | 限制 deque maxsize=1000每個 worker 獨立 |
| Watchdog 腳本異常 | 加入 systemd 監控 |
| SQLite log 檔案成長 | 設定保留天數/最大筆數並定期清理 |
| 管理員誤操作重啟 | 二次確認 + 60 秒冷卻時間 |
## Migration Plan
**Phase 1: 基礎設施 (無影響)**
1. 新增 `core/response.py` - API 回應格式
2. 新增 `core/circuit_breaker.py` - 熔斷器
3. 新增 `core/metrics.py` - 效能指標
4. 新增 `core/local_cache.py` - 本地快取
5. 新增 `core/log_store.py` - SQLite log store
6. 新增單元測試
**Phase 2: 整合 (低風險)**
1. `database.py` 整合熔斷器(預設關閉)
2. `cache.py` 整合本地快取 fallback
3. 環境變數控制開關:`CIRCUIT_BREAKER_ENABLED=true`
**Phase 3: API 遷移 (漸進式)**
1. 新建立的 API 直接使用新格式
2. 現有 API 保持原格式(向下相容)
3. 前端視需要逐步遷移使用新格式
**Phase 4: 管理功能**
1. 新增效能報表頁面
2. 新增 log 檢視區塊與 logs API
3. 部署 watchdog 腳本
4. 新增 Worker 控制 API
**Rollback 策略:**
- 熔斷器:環境變數關閉 `CIRCUIT_BREAKER_ENABLED=false`
- 本地快取:環境變數關閉 `LOCAL_CACHE_ENABLED=false`
- Worker 控制:移除 watchdog 腳本即可
## Open Questions
1. **效能指標保留時間**:目前設計為記憶體內 1000 筆,是否需要持久化歷史?
2. ~~**多 Worker 指標合併**~~:已決定採用 per-worker 方式,前端顯示當前 Worker PID
3. **Watchdog 部署方式**:使用 systemd service 或 cron 監控?
4. ~~**API 格式遷移時程**~~:已決定不強制遷移,新 API 用新格式,舊 API 保持原格式

View File

@@ -0,0 +1,121 @@
## Why
目前系統在 SQL 查詢安全架構重構後,核心功能已穩定運作。然而,根據代碼審查發現以下改善空間:
1. **使用者體驗不一致**:錯誤訊息直接暴露技術細節(如 ORA-xxxxx、API 回應格式不統一、批次查詢缺乏進度回饋
2. **穩定度風險**:缺乏熔斷機制,當 Oracle 異常時所有請求仍會嘗試連線導致雪崩Redis 降級只有單層 fallback
3. **效能監控不足**:慢查詢僅記錄 warning缺乏量化指標追蹤部分歷史查詢仍有優化空間
4. **管理功能不足**:管理員缺乏效能監控視覺化介面;當 worker 異常時需 SSH 登入伺服器手動處理
## What Changes
### 使用者體驗 (UX)
- 新增統一的 API 回應格式與錯誤代碼系統
- 錯誤訊息分層:使用者友善訊息 vs 技術日誌
### 穩定度
- 新增 Circuit Breaker 熔斷機制,防止連鎖失敗
- 擴充健康檢查,新增深度檢查與延遲指標
- 新增本地 LRU 快取作為 Redis 的二級 fallback
### 效能
- 新增查詢效能指標收集P50/P95/P99 延遲)
### 管理 (Admin)
- 新增效能報表頁面至 Admin視覺化顯示系統效能指標
- 新增管理員 log 紀錄檢視(可篩選/搜尋)
- 評估並實作 Worker 重啟機制,允許管理員從前端觸發服務重啟
## Non-Goals (本次範圍外)
以下項目暫不納入本次變更,留待後續評估:
- **Excel 批次查詢進度回報**:需評估前後端架構變更幅度,可能需要 WebSocket 或 Server-Sent Events
- **歷史趨勢查詢優化(預計算/分層快取)**:需先有查詢效能指標數據,確認瓶頸後再規劃優化策略
## Capabilities
### New Capabilities
- `api-response-format`: 統一的 API 回應格式與錯誤代碼系統,提供一致的成功/失敗回應結構
- `circuit-breaker`: 資料庫連線熔斷機制,防止連鎖失敗與資源耗盡
- `query-metrics`: 查詢效能指標收集與監控,追蹤延遲分布與慢查詢統計
- `local-cache-fallback`: 本地 LRU 記憶體快取,作為 Redis 不可用時的二級 fallback
- `admin-performance-dashboard`: 管理員效能報表頁面,顯示系統健康狀態、效能指標、熔斷器狀態與近期 log 紀錄
- `admin-worker-control`: 管理員服務控制功能,提供 Worker 重啟機制(需評估安全性與可行性)
### Modified Capabilities
- `health-check`: 擴充深度檢查功能,新增延遲指標、快取新鮮度檢查、熔斷器狀態
## Impact
### 程式碼變更
**新增檔案:**
- `src/mes_dashboard/core/response.py` - API 回應格式與錯誤代碼
- `src/mes_dashboard/core/circuit_breaker.py` - 熔斷器實作
- `src/mes_dashboard/core/metrics.py` - 效能指標收集
- `src/mes_dashboard/core/local_cache.py` - 本地 LRU 快取
- `src/mes_dashboard/core/worker_control.py` - Worker 控制模組(評估後實作)
- `src/mes_dashboard/templates/admin/performance.html` - 效能報表頁面
- `src/mes_dashboard/core/log_store.py` - SQLite log 存取與查詢
- `scripts/worker_watchdog.py` - Worker 監控與重啟服務(可選,依架構決策)
**修改檔案:**
- `src/mes_dashboard/core/database.py` - 整合熔斷器
- `src/mes_dashboard/core/cache.py` - 整合本地快取 fallback
- `src/mes_dashboard/routes/health_routes.py` - 擴充健康檢查
- `src/mes_dashboard/routes/admin_routes.py` - 新增效能報表、log 檢視與服務控制路由
- `src/mes_dashboard/services/*.py` - 統一錯誤回應格式
- `src/mes_dashboard/routes/*.py` - 統一 API 回應格式
### API 影響
- 所有 API 回應格式將統一,但維持向下相容(現有欄位保留)
- 新增 `GET /health/deep` 深度健康檢查端點
- 新增 `GET /admin/api/metrics` 效能指標端點
- 新增 `GET /admin/performance` 效能報表頁面
- 新增 `GET /admin/api/logs` 近期 log 紀錄查詢 API
- 新增 `GET /admin/api/worker/status` Worker 狀態查詢 API
- 新增 `POST /admin/api/worker/restart` Worker 重啟 API需評估
### 依賴
- 無新增 Python 第三方依賴,使用 Python 標準函式庫實作(包含 `sqlite3`
- 熔斷器:使用 `threading` + `time` 實作
- 本地快取:使用 `functools.lru_cache` 或自訂 TTL cache
- 指標收集:使用 `collections.deque` 實作滑動視窗
- 管理員 log 檢視:使用 SQLite 儲存(本機檔案)
- 前端圖表:使用 Chart.js前端依賴
- Worker 控制:評估方案(見下方)
### Worker 重啟機制評估
**方案選項:**
| 方案 | 說明 | 優點 | 缺點 |
|------|------|------|------|
| A. Gunicorn SIGHUP | 透過信號觸發 graceful reload | 簡單、原生支援 | Flask 無法直接發送信號給父進程 |
| B. Supervisor/Systemd | 透過 subprocess 呼叫 systemctl | 標準做法 | 需要 sudo 權限配置 |
| C. 控制檔案 + Watchdog | 寫入標記檔案,外部腳本監控並重啟 | 安全、解耦 | 需要額外的監控腳本 |
| D. 獨立控制服務 | 建立輕量 HTTP 服務專門處理重啟 | 完全隔離 | 架構複雜度增加 |
**建議:** 在 Design 階段評估各方案的安全性與可行性,選擇最適合的實作方式。
### 安全考量
- Worker 重啟 API 必須限制僅管理員可存取
- 應有操作日誌記錄(誰、何時、從哪個 IP 觸發)
- 考慮加入確認機制或冷卻時間,防止誤操作
- 評估是否需要二次驗證(如重新輸入密碼)
### 測試
- `tests/test_api_response.py` - API 回應格式測試
- `tests/test_circuit_breaker.py` - 熔斷器狀態轉換測試
- `tests/test_query_metrics.py` - 指標收集測試
- `tests/test_local_cache.py` - 本地快取測試
- `tests/test_admin_performance.py` - 效能報表 API 測試
- `tests/test_admin_logs.py` - 管理員 log 檢視 API 測試
- `tests/test_worker_control.py` - Worker 控制測試(模擬)

View File

@@ -0,0 +1,160 @@
## ADDED Requirements
### Requirement: 效能報表頁面
系統 SHALL 提供管理員效能報表頁面。
#### Scenario: 存取效能報表頁面
- **WHEN** 管理員存取 `GET /admin/performance`
- **THEN** 系統 SHALL 顯示效能報表頁面
- **AND** HTTP 狀態碼 SHALL 為 200
#### Scenario: 非管理員禁止存取
- **WHEN** 非管理員存取 `GET /admin/performance`
- **THEN** 系統 SHALL 重導向至登入頁面
- **OR** HTTP 狀態碼 SHALL 為 403
---
### Requirement: 系統狀態顯示
效能報表頁面 SHALL 顯示系統各元件的健康狀態。
#### Scenario: 顯示資料庫狀態
- **WHEN** 載入效能報表頁面
- **THEN** 頁面 SHALL 顯示資料庫連線狀態
- **AND** 狀態 SHALL 為 ✅ (正常) 或 ❌ (異常)
#### Scenario: 顯示 Redis 狀態
- **WHEN** 載入效能報表頁面
- **THEN** 頁面 SHALL 顯示 Redis 連線狀態
- **AND** 若 Redis 停用則顯示「已停用」
#### Scenario: 顯示熔斷器狀態
- **WHEN** 載入效能報表頁面
- **THEN** 頁面 SHALL 顯示熔斷器狀態
- **AND** 狀態 SHALL 為 CLOSED、OPEN 或 HALF_OPEN
#### Scenario: 顯示 Worker 數量
- **WHEN** 載入效能報表頁面
- **THEN** 頁面 SHALL 顯示目前回應的 Worker PID
---
### Requirement: 效能指標顯示
效能報表頁面 SHALL 顯示查詢效能指標。
#### Scenario: 顯示延遲百分位數
- **WHEN** 載入效能報表頁面
- **THEN** 頁面 SHALL 顯示 P50、P95、P99 延遲值
- **AND** 單位 SHALL 為毫秒或秒
#### Scenario: 顯示慢查詢統計
- **WHEN** 載入效能報表頁面
- **THEN** 頁面 SHALL 顯示慢查詢數量
- **AND** SHALL 顯示慢查詢比例
#### Scenario: 延遲分布視覺化
- **WHEN** 載入效能報表頁面
- **THEN** 頁面 SHALL 顯示延遲分布圖表
- **AND** 圖表 SHALL 使用 Chart.js 或類似工具
---
### Requirement: 快取狀態顯示
效能報表頁面 SHALL 顯示快取運作狀態。
#### Scenario: 顯示 Redis 快取命中率
- **WHEN** 載入效能報表頁面
- **THEN** 頁面 SHALL 顯示 Redis 快取命中率
#### Scenario: 顯示本地快取命中率
- **WHEN** 載入效能報表頁面
- **THEN** 頁面 SHALL 顯示本地快取命中率
#### Scenario: 顯示快取最後更新時間
- **WHEN** 載入效能報表頁面
- **THEN** 頁面 SHALL 顯示快取最後更新時間
---
### Requirement: 自動重新整理
效能報表頁面 SHALL 支援自動重新整理。
#### Scenario: 手動重新整理
- **WHEN** 點擊「重新整理」按鈕
- **THEN** 頁面 SHALL 重新載入所有指標資料
- **AND** SHALL NOT 整頁重新載入(使用 AJAX
#### Scenario: 自動重新整理間隔
- **WHEN** 啟用自動重新整理
- **THEN** 頁面 SHALL 每 30 秒自動更新指標
- **AND** 使用者 SHALL 可以停用自動重新整理
---
### Requirement: 系統狀態 API
系統 SHALL 提供 API 取得系統狀態資訊。
#### Scenario: 取得系統狀態
- **WHEN** 呼叫 `GET /admin/api/system-status`
- **AND** 使用者為管理員
- **THEN** 回應 SHALL 包含:
- `database`: 資料庫狀態
- `redis`: Redis 狀態
- `circuit_breaker`: 熔斷器狀態
- `cache`: 快取狀態
- `worker_pid`: 當前 Worker PID
---
### Requirement: Log 紀錄檢視
效能報表頁面 SHALL 顯示近期 log 紀錄。
#### Scenario: 顯示近期 log
- **WHEN** 管理員載入效能報表頁面
- **THEN** 頁面 SHALL 顯示最近 N 筆 log預設 200 筆)
- **AND** 每筆 log SHALL 顯示時間、等級、來源、訊息
#### Scenario: 篩選與搜尋
- **WHEN** 管理員選擇等級INFO/WARNING/ERROR或輸入關鍵字
- **THEN** 頁面 SHALL 即時更新顯示結果
---
### Requirement: Log API
系統 SHALL 提供 API 取得近期 log 紀錄。
#### Scenario: 取得 log 紀錄
- **WHEN** 呼叫 `GET /admin/api/logs`
- **AND** 使用者為管理員
- **THEN** 回應 SHALL 包含 log 清單
- **AND** HTTP 狀態碼 SHALL 為 200
#### Scenario: Log API 查詢參數
- **WHEN** 呼叫 `GET /admin/api/logs` 並帶入查詢參數
- **THEN** API SHALL 支援:
- `level`等級過濾INFO/WARNING/ERROR
- `q`:關鍵字搜尋
- `limit`:回傳筆數(預設 200
- `since`起始時間ISO-8601
#### Scenario: 非管理員禁止存取
- **WHEN** 非管理員呼叫 `GET /admin/api/logs`
- **THEN** HTTP 狀態碼 SHALL 為 403
---
### Requirement: Log 資料儲存
系統 SHALL 將 log 寫入本機 SQLite 供管理員查詢。
#### Scenario: 寫入 SQLite log store
- **WHEN** 系統產生 log 紀錄
- **THEN** log SHALL 寫入本機 SQLite log store
- **AND** 供 `GET /admin/api/logs` 查詢

View File

@@ -0,0 +1,116 @@
## ADDED Requirements
### Requirement: Worker 重啟觸發
系統 SHALL 允許管理員從前端觸發 Worker 重啟。
#### Scenario: 觸發重啟請求
- **WHEN** 管理員呼叫 `POST /admin/api/worker/restart`
- **AND** 使用者為管理員
- **THEN** 系統 SHALL 寫入重啟標記檔案
- **AND** HTTP 狀態碼 SHALL 為 202 (Accepted)
- **AND** 回應 SHALL 包含 `"message": "重啟請求已提交"`
#### Scenario: 非管理員禁止操作
- **WHEN** 非管理員呼叫 `POST /admin/api/worker/restart`
- **THEN** HTTP 狀態碼 SHALL 為 403
- **AND** 操作 SHALL NOT 執行
---
### Requirement: 重啟冷卻時間
系統 SHALL 實作重啟冷卻機制,防止頻繁重啟。
#### Scenario: 冷卻時間內拒絕
- **WHEN** 管理員呼叫 `POST /admin/api/worker/restart`
- **AND** 距離上次重啟不足 60 秒
- **THEN** HTTP 狀態碼 SHALL 為 429 (Too Many Requests)
- **AND** 回應 SHALL 包含剩餘冷卻秒數
#### Scenario: 冷卻時間後允許
- **WHEN** 管理員呼叫 `POST /admin/api/worker/restart`
- **AND** 距離上次重啟已超過 60 秒
- **THEN** 重啟請求 SHALL 被接受
#### Scenario: 查詢冷卻狀態
- **WHEN** 呼叫 `GET /admin/api/worker/status`
- **THEN** 回應 SHALL 包含:
- `cooldown_remaining`: 剩餘冷卻秒數0 表示可用)
- `last_restart`: 上次重啟時間
- `last_restart_by`: 上次重啟操作者
---
### Requirement: 重啟操作日誌
系統 SHALL 記錄所有重啟操作。
#### Scenario: 記錄操作資訊
- **WHEN** 管理員觸發重啟
- **THEN** 系統 SHALL 記錄:
- 操作者email/username
- 操作時間
- 來源 IP 位址
- 操作結果
#### Scenario: 日誌儲存位置
- **WHEN** 記錄重啟操作
- **THEN** 日誌 SHALL 寫入系統日誌INFO 級別)
- **AND** SHALL 寫入獨立的操作日誌檔案
---
### Requirement: 前端確認機制
效能報表頁面 SHALL 實作重啟確認機制。
#### Scenario: 顯示確認對話框
- **WHEN** 管理員點擊「重啟 Workers」按鈕
- **THEN** 系統 SHALL 顯示確認對話框
- **AND** 對話框 SHALL 警告此操作會短暫影響服務
#### Scenario: 確認後執行
- **WHEN** 管理員在確認對話框點擊「確定」
- **THEN** 系統 SHALL 發送重啟請求
#### Scenario: 取消操作
- **WHEN** 管理員在確認對話框點擊「取消」
- **THEN** 系統 SHALL NOT 發送重啟請求
---
### Requirement: Watchdog 腳本
系統 SHALL 提供 Watchdog 腳本監控重啟標記檔案。
#### Scenario: 監控標記檔案
- **WHEN** Watchdog 腳本運行中
- **THEN** 腳本 SHALL 每 5 秒檢查 `/tmp/mes_dashboard_restart.flag`
#### Scenario: 偵測到標記檔案
- **WHEN** Watchdog 偵測到標記檔案存在
- **THEN** 腳本 SHALL 發送 SIGHUP 信號給 Gunicorn master
- **AND** SHALL 刪除標記檔案
- **AND** SHALL 記錄重啟事件到日誌
#### Scenario: Gunicorn Graceful Reload
- **WHEN** Gunicorn master 收到 SIGHUP
- **THEN** Gunicorn SHALL 執行 graceful reload
- **AND** 現有請求 SHALL 完成後才終止 worker
- **AND** 新 worker SHALL 啟動接手
---
### Requirement: 重啟狀態回報
系統 SHALL 提供方式確認重啟是否完成。
#### Scenario: 查詢 Worker 啟動時間
- **WHEN** 呼叫 `GET /admin/api/worker/status`
- **THEN** 回應 SHALL 包含當前 worker 的啟動時間
#### Scenario: 前端顯示重啟結果
- **WHEN** 重啟請求已提交
- **THEN** 前端 SHALL 輪詢 worker 狀態
- **AND** SHALL 顯示「重啟中...」直到偵測到新 worker

View File

@@ -0,0 +1,140 @@
## ADDED Requirements
### Requirement: 統一成功回應格式
系統 SHALL 對所有成功的 API 回應使用統一的 envelope 格式。
#### Scenario: 成功回應包含 success 標記
- **WHEN** API 請求成功執行
- **THEN** 回應 body SHALL 包含 `"success": true`
- **AND** 原有回應資料 SHALL 放在 `data` 欄位中
#### Scenario: 成功回應範例
- **WHEN** 呼叫 `GET /api/dashboard/kpi` 成功
- **THEN** 回應格式 SHALL 為:
```json
{
"success": true,
"data": {
"total": 100,
"prd": 50,
...
}
}
```
---
### Requirement: 統一錯誤回應格式
系統 SHALL 對所有失敗的 API 回應使用統一的錯誤格式。
#### Scenario: 錯誤回應包含錯誤代碼
- **WHEN** API 請求執行失敗
- **THEN** 回應 body SHALL 包含 `"success": false`
- **AND** SHALL 包含 `error` 物件
- **AND** `error.code` SHALL 為機器可讀的錯誤代碼
- **AND** `error.message` SHALL 為使用者友善的中文訊息
#### Scenario: 錯誤回應範例
- **WHEN** 資料庫連線失敗
- **THEN** 回應格式 SHALL 為:
```json
{
"success": false,
"error": {
"code": "DB_CONNECTION_FAILED",
"message": "資料庫連線失敗,請稍後再試"
}
}
```
#### Scenario: 開發模式顯示詳細錯誤
- **WHEN** `FLASK_ENV=development`
- **AND** API 請求執行失敗
- **THEN** `error` 物件 SHALL 額外包含 `details` 欄位
- **AND** `details` SHALL 包含技術性錯誤訊息(如 ORA-xxxxx
#### Scenario: 生產模式隱藏詳細錯誤
- **WHEN** `FLASK_ENV=production`
- **AND** API 請求執行失敗
- **THEN** `error` 物件 SHALL NOT 包含 `details` 欄位
---
### Requirement: 標準錯誤代碼
系統 SHALL 定義並使用標準化的錯誤代碼。
#### Scenario: 資料庫相關錯誤代碼
- **WHEN** 資料庫連線失敗
- **THEN** 錯誤代碼 SHALL 為 `DB_CONNECTION_FAILED`
#### Scenario: 資料庫查詢逾時
- **WHEN** 資料庫查詢超過 55 秒
- **THEN** 錯誤代碼 SHALL 為 `DB_QUERY_TIMEOUT`
#### Scenario: 熔斷器開啟
- **WHEN** Circuit Breaker 處於 OPEN 狀態
- **THEN** 錯誤代碼 SHALL 為 `SERVICE_UNAVAILABLE`
#### Scenario: 驗證失敗
- **WHEN** 請求參數驗證失敗
- **THEN** 錯誤代碼 SHALL 為 `VALIDATION_ERROR`
#### Scenario: 未授權
- **WHEN** 使用者未登入或 session 過期
- **THEN** 錯誤代碼 SHALL 為 `UNAUTHORIZED`
#### Scenario: 禁止存取
- **WHEN** 使用者權限不足
- **THEN** 錯誤代碼 SHALL 為 `FORBIDDEN`
---
### Requirement: 全域錯誤處理
系統 SHALL 在 middleware 層級統一處理所有未捕獲的錯誤。
#### Scenario: 認證中介層拒絕
- **WHEN** 認證中介層(`create_app` 中的 `@app.before_request`)拒絕請求
- **THEN** 回應格式 SHALL 符合統一錯誤格式
- **AND** 錯誤代碼 SHALL 為 `UNAUTHORIZED` 或 `FORBIDDEN`
#### Scenario: 未處理的例外
- **WHEN** 路由處理器拋出未捕獲的例外
- **THEN** Flask 錯誤處理器 SHALL 攔截該例外
- **AND** 回應格式 SHALL 符合統一錯誤格式
- **AND** 錯誤代碼 SHALL 為 `INTERNAL_ERROR`
#### Scenario: 404 錯誤處理
- **WHEN** 請求的路由不存在
- **THEN** 回應格式 SHALL 符合統一錯誤格式
- **AND** 錯誤代碼 SHALL 為 `NOT_FOUND`
#### Scenario: 全域錯誤處理器註冊
- **WHEN** Flask 應用程式初始化
- **THEN** `create_app()` SHALL 註冊以下錯誤處理器:
- `@app.errorhandler(401)` - 處理未授權
- `@app.errorhandler(403)` - 處理禁止存取
- `@app.errorhandler(404)` - 處理找不到資源
- `@app.errorhandler(500)` - 處理伺服器錯誤
- `@app.errorhandler(Exception)` - 處理所有未捕獲例外
---
### Requirement: 向下相容
系統 SHALL 維持與現有 API 的向下相容性。
#### Scenario: 原有欄位保留
- **WHEN** 使用新的回應格式
- **THEN** 原有 API 回傳的欄位 SHALL 完整保留在 `data` 中
- **AND** 欄位名稱與型別 SHALL 不變
#### Scenario: HTTP 狀態碼維持
- **WHEN** API 回應使用新格式
- **THEN** HTTP 狀態碼 SHALL 維持原有語義
- **AND** 成功 SHALL 回傳 2xx
- **AND** 客戶端錯誤 SHALL 回傳 4xx
- **AND** 伺服器錯誤 SHALL 回傳 5xx

View File

@@ -0,0 +1,91 @@
## ADDED Requirements
### Requirement: 熔斷器狀態管理
系統 SHALL 實作 Circuit Breaker 模式,管理資料庫連線的熔斷狀態。
#### Scenario: 初始狀態為 CLOSED
- **WHEN** 系統啟動
- **THEN** 熔斷器狀態 SHALL 為 `CLOSED`
- **AND** 所有資料庫請求 SHALL 正常執行
#### Scenario: 失敗累積觸發 OPEN
- **WHEN** 熔斷器處於 `CLOSED` 狀態
- **AND** 滑動視窗內失敗次數 >= 5
- **AND** 失敗率 >= 50%
- **THEN** 熔斷器狀態 SHALL 轉換為 `OPEN`
#### Scenario: OPEN 狀態拒絕請求
- **WHEN** 熔斷器處於 `OPEN` 狀態
- **AND** 收到資料庫請求
- **THEN** 系統 SHALL 立即回傳錯誤
- **AND** 錯誤代碼 SHALL 為 `SERVICE_UNAVAILABLE`
- **AND** SHALL NOT 嘗試連線資料庫
#### Scenario: OPEN 轉換為 HALF_OPEN
- **WHEN** 熔斷器處於 `OPEN` 狀態
- **AND** 已等待 30 秒recovery_timeout
- **THEN** 熔斷器狀態 SHALL 轉換為 `HALF_OPEN`
#### Scenario: HALF_OPEN 探測成功
- **WHEN** 熔斷器處於 `HALF_OPEN` 狀態
- **AND** 探測請求執行成功
- **THEN** 熔斷器狀態 SHALL 轉換為 `CLOSED`
- **AND** 失敗計數 SHALL 重置為 0
#### Scenario: HALF_OPEN 探測失敗
- **WHEN** 熔斷器處於 `HALF_OPEN` 狀態
- **AND** 探測請求執行失敗
- **THEN** 熔斷器狀態 SHALL 轉換為 `OPEN`
- **AND** recovery_timeout SHALL 重新計時
---
### Requirement: 熔斷器參數配置
系統 SHALL 支援透過環境變數配置熔斷器參數。
#### Scenario: 預設參數值
- **WHEN** 未設定熔斷器相關環境變數
- **THEN** failure_threshold SHALL 為 5
- **AND** failure_rate SHALL 為 0.5 (50%)
- **AND** recovery_timeout SHALL 為 30 秒
- **AND** window_size SHALL 為 10
#### Scenario: 環境變數覆蓋
- **WHEN** 設定 `CIRCUIT_BREAKER_FAILURE_THRESHOLD=10`
- **THEN** failure_threshold SHALL 為 10
#### Scenario: 停用熔斷器
- **WHEN** 設定 `CIRCUIT_BREAKER_ENABLED=false`
- **THEN** 熔斷器功能 SHALL 停用
- **AND** 所有請求 SHALL 直接執行,不經過熔斷器檢查
---
### Requirement: 熔斷器狀態查詢
系統 SHALL 提供 API 查詢熔斷器狀態。
#### Scenario: 查詢熔斷器狀態
- **WHEN** 呼叫內部方法 `get_circuit_breaker_status()`
- **THEN** 回傳值 SHALL 包含:
- `state`: 當前狀態 (CLOSED/OPEN/HALF_OPEN)
- `failure_count`: 目前失敗次數
- `success_count`: 目前成功次數
- `last_failure_time`: 最後失敗時間
---
### Requirement: 熔斷事件日誌
系統 SHALL 記錄熔斷器狀態變化事件。
#### Scenario: 記錄狀態轉換
- **WHEN** 熔斷器狀態發生變化
- **THEN** 系統 SHALL 記錄 WARNING 級別日誌
- **AND** 日誌 SHALL 包含:前狀態、新狀態、觸發原因
#### Scenario: 記錄 OPEN 事件
- **WHEN** 熔斷器轉換為 `OPEN` 狀態
- **THEN** 日誌訊息 SHALL 包含失敗次數與失敗率

View File

@@ -0,0 +1,150 @@
## ADDED Requirements
### Requirement: 深度健康檢查端點
系統 SHALL 提供 `/health/deep` 端點,回報詳細的系統健康資訊。
#### Scenario: 深度檢查回應格式
- **WHEN** 呼叫 `GET /health/deep`
- **THEN** 回應 body SHALL 包含:
```json
{
"status": "healthy",
"checks": {
"database": { ... },
"redis": { ... },
"circuit_breaker": { ... },
"cache": { ... }
},
"metrics": { ... }
}
```
#### Scenario: 深度檢查需要認證
- **WHEN** 呼叫 `GET /health/deep`
- **AND** 使用者未登入
- **THEN** HTTP 狀態碼 SHALL 為 401
#### Scenario: 深度檢查管理員存取
- **WHEN** 呼叫 `GET /health/deep`
- **AND** 使用者為管理員
- **THEN** HTTP 狀態碼 SHALL 為 200
- **AND** 回應 SHALL 包含完整詳細資訊
#### Scenario: 深度檢查非管理員禁止
- **WHEN** 呼叫 `GET /health/deep`
- **AND** 使用者已登入但非管理員
- **THEN** HTTP 狀態碼 SHALL 為 403
- **AND** 回應 SHALL 符合統一錯誤格式
#### Scenario: 深度檢查實作方式
- **WHEN** 實作 `/health/deep` 端點
- **THEN** 路由 SHALL 使用 `@admin_required` 裝飾器
- **AND** 裝飾器 SHALL 處理認證與授權驗證
---
### Requirement: 延遲指標檢查
深度健康檢查 SHALL 包含各服務的延遲指標。
#### Scenario: 資料庫延遲
- **WHEN** 執行深度健康檢查
- **THEN** `checks.database` SHALL 包含 `latency_ms`
- **AND** `latency_ms` SHALL 為執行 ping 查詢的實際耗時
#### Scenario: Redis 延遲
- **WHEN** 執行深度健康檢查
- **AND** Redis 已啟用
- **THEN** `checks.redis` SHALL 包含 `latency_ms`
- **AND** `latency_ms` SHALL 為執行 PING 的實際耗時
#### Scenario: 延遲警告閾值
- **WHEN** 資料庫延遲超過 100ms
- **THEN** `checks.database.status` SHALL 為 `"slow"`
- **AND** `warnings` 陣列 SHALL 包含延遲警告訊息
---
### Requirement: 連線池狀態檢查
深度健康檢查 SHALL 包含資料庫連線池狀態。
#### Scenario: 連線池資訊
- **WHEN** 執行深度健康檢查
- **THEN** `checks.database` SHALL 包含:
- `pool_size`: 設定的連線池大小
- `pool_checked_out`: 目前借出的連線數
- `pool_overflow`: 目前溢出的連線數
#### Scenario: 連線池耗盡警告
- **WHEN** `pool_checked_out` + `pool_overflow` >= `pool_size` + `max_overflow`
- **THEN** `warnings` 陣列 SHALL 包含連線池耗盡警告
---
### Requirement: 熔斷器狀態檢查
深度健康檢查 SHALL 包含熔斷器狀態。
#### Scenario: 熔斷器狀態正常
- **WHEN** 執行深度健康檢查
- **AND** 熔斷器狀態為 CLOSED
- **THEN** `checks.circuit_breaker` SHALL 包含:
```json
{
"database": "CLOSED",
"failures": 0
}
```
#### Scenario: 熔斷器狀態 OPEN
- **WHEN** 執行深度健康檢查
- **AND** 熔斷器狀態為 OPEN
- **THEN** `checks.circuit_breaker.database` SHALL 為 `"OPEN"`
- **AND** 整體 `status` SHALL 為 `"degraded"` 或 `"unhealthy"`
- **AND** `warnings` SHALL 包含熔斷器警告
---
### Requirement: 快取新鮮度檢查
深度健康檢查 SHALL 檢查快取資料的新鮮度。
#### Scenario: 快取新鮮度正常
- **WHEN** 執行深度健康檢查
- **AND** 快取更新時間在 2 分鐘內
- **THEN** `checks.cache.status` SHALL 為 `"fresh"`
#### Scenario: 快取資料過期
- **WHEN** 執行深度健康檢查
- **AND** 快取更新時間超過 2 分鐘
- **THEN** `checks.cache.status` SHALL 為 `"stale"`
- **AND** `warnings` SHALL 包含快取過期警告
#### Scenario: 本地快取狀態
- **WHEN** 執行深度健康檢查
- **AND** 本地快取已啟用
- **THEN** `checks.cache` SHALL 包含:
- `local_enabled`: true
- `local_hit_rate`: 本地快取命中率
- `local_size`: 本地快取條目數
---
### Requirement: 效能指標摘要
深度健康檢查 SHALL 包含效能指標摘要。
#### Scenario: 包含延遲百分位數
- **WHEN** 執行深度健康檢查
- **THEN** `metrics` SHALL 包含:
- `query_p50_ms`: P50 查詢延遲
- `query_p95_ms`: P95 查詢延遲
- `query_p99_ms`: P99 查詢延遲
- `slow_query_count`: 慢查詢數量
#### Scenario: 指標為空
- **WHEN** 執行深度健康檢查
- **AND** 尚無查詢記錄
- **THEN** `metrics` 各欄位 SHALL 為 0

View File

@@ -0,0 +1,98 @@
## ADDED Requirements
### Requirement: 本地 LRU 快取
系統 SHALL 實作本地 LRU 快取作為 Redis 的二級 fallback。
#### Scenario: 快取查詢順序
- **WHEN** 查詢快取資料
- **THEN** 系統 SHALL 先查詢 Redis
- **AND** Redis 未命中或失敗時 SHALL 查詢本地快取
- **AND** 本地快取未命中時 SHALL 查詢 Oracle
#### Scenario: 快取回填
- **WHEN** 從 Oracle 取得資料
- **THEN** 系統 SHALL 同時寫入 Redis 和本地快取
- **AND** 本地快取 TTL SHALL 為 60 秒
---
### Requirement: 本地快取容量限制
系統 SHALL 限制本地快取的記憶體使用。
#### Scenario: 預設最大條目數
- **WHEN** 未設定 `LOCAL_CACHE_MAXSIZE` 環境變數
- **THEN** 本地快取預設最大條目數 SHALL 為 500
- **AND** 此值足以容納 WIP 狀態、設備清單、Hold Summary 等多組快取
#### Scenario: 最大條目數限制
- **WHEN** 本地快取條目數達到 maxsize 上限
- **AND** 新增新條目
- **THEN** 系統 SHALL 移除最少使用LRU的條目
- **AND** 條目數 SHALL 維持 <= maxsize
#### Scenario: 環境變數配置
- **WHEN** 設定 `LOCAL_CACHE_MAXSIZE=1000`
- **THEN** 本地快取最大條目數 SHALL 為 1000
#### Scenario: 快取鍵設計
- **WHEN** 建立快取條目
- **THEN** 快取鍵 SHALL 包含功能前綴(如 `wip:`, `equipment:`, `hold:`
- **AND** 不同功能的快取 SHALL 共用同一 LRU 池
- **AND** LRU 策略 SHALL 自動淘汰最少使用的條目(無論功能類型)
---
### Requirement: 本地快取 TTL
系統 SHALL 為本地快取條目設定過期時間。
#### Scenario: 預設 TTL
- **WHEN** 未設定 TTL 環境變數
- **THEN** 本地快取 TTL SHALL 為 60 秒
#### Scenario: 過期條目處理
- **WHEN** 查詢本地快取
- **AND** 條目已過期(超過 TTL
- **THEN** 系統 SHALL 視為未命中
- **AND** SHALL 移除該過期條目
#### Scenario: TTL 比 Redis 短
- **WHEN** Redis 快取 TTL 為 N 秒
- **THEN** 本地快取 TTL SHALL < N
- **AND** 確保本地快取資料不會比 Redis 舊太多
---
### Requirement: 快取停用控制
系統 SHALL 支援透過環境變數停用本地快取
#### Scenario: 停用本地快取
- **WHEN** 設定 `LOCAL_CACHE_ENABLED=false`
- **THEN** 本地快取功能 SHALL 停用
- **AND** 快取查詢 SHALL 直接查詢 Redis Oracle
#### Scenario: 預設啟用
- **WHEN** 未設定 `LOCAL_CACHE_ENABLED`
- **THEN** 本地快取 SHALL 預設啟用
---
### Requirement: 快取命中率統計
系統 SHALL 追蹤本地快取的命中率
#### Scenario: 記錄命中與未命中
- **WHEN** 查詢本地快取
- **THEN** 系統 SHALL 記錄是否命中
- **AND** 統計 SHALL 儲存在記憶體中
#### Scenario: 查詢命中率
- **WHEN** 呼叫 `get_local_cache_stats()`
- **THEN** 回傳值 SHALL 包含
- `hits`: 命中次數
- `misses`: 未命中次數
- `hit_rate`: 命中率 (hits / (hits + misses))
- `size`: 目前條目數

View File

@@ -0,0 +1,111 @@
## ADDED Requirements
### Requirement: 查詢延遲收集
系統 SHALL 收集所有資料庫查詢的延遲時間。
#### Scenario: 記錄查詢延遲
- **WHEN** 執行資料庫查詢
- **THEN** 系統 SHALL 記錄查詢耗時(毫秒)
- **AND** 記錄 SHALL 儲存在記憶體內滑動視窗
#### Scenario: 滑動視窗大小限制
- **WHEN** 記錄的查詢數量超過 1000 筆
- **THEN** 系統 SHALL 自動移除最舊的記錄
- **AND** 視窗 SHALL 維持最多 1000 筆
---
### Requirement: 延遲百分位數計算
系統 SHALL 計算查詢延遲的百分位數統計。
#### Scenario: 計算 P50/P95/P99
- **WHEN** 呼叫 `get_query_metrics()`
- **THEN** 回傳值 SHALL 包含:
- `p50_ms`: 第 50 百分位延遲
- `p95_ms`: 第 95 百分位延遲
- `p99_ms`: 第 99 百分位延遲
- `count`: 樣本數量
- `slow_count`: 慢查詢數量(延遲 > 1 秒)
#### Scenario: 空資料處理
- **WHEN** 尚無查詢記錄
- **THEN** 所有百分位數 SHALL 回傳 0
- **AND** `count` SHALL 為 0
---
### Requirement: 慢查詢統計
系統 SHALL 追蹤慢查詢的數量與比例。
#### Scenario: 慢查詢定義
- **WHEN** 查詢延遲超過 1000 毫秒
- **THEN** 該查詢 SHALL 被標記為慢查詢
#### Scenario: 慢查詢比例計算
- **WHEN** 呼叫 `get_query_metrics()`
- **THEN** 回傳值 SHALL 包含 `slow_rate`
- **AND** `slow_rate` SHALL 為 `slow_count / count`
---
### Requirement: 指標 API 端點
系統 SHALL 提供 API 端點查詢效能指標。
#### Scenario: 取得效能指標
- **WHEN** 呼叫 `GET /admin/api/metrics`
- **AND** 使用者為管理員
- **THEN** 回應 SHALL 包含查詢延遲統計
- **AND** HTTP 狀態碼 SHALL 為 200
#### Scenario: 非管理員禁止存取
- **WHEN** 呼叫 `GET /admin/api/metrics`
- **AND** 使用者非管理員
- **THEN** HTTP 狀態碼 SHALL 為 403
---
### Requirement: Worker 獨立統計
系統 SHALL 在每個 Gunicorn worker 獨立收集指標。
#### Scenario: 各 worker 獨立統計
- **WHEN** 系統運行多個 workers
- **THEN** 每個 worker SHALL 維護獨立的指標資料
- **AND** 百分位數計算 SHALL 基於該 worker 的樣本
#### Scenario: API 回傳當前 worker 指標
- **WHEN** 呼叫 `GET /admin/api/metrics`
- **THEN** 回應 SHALL 標示該資料來自哪個 workerPID
- **AND** 回應 SHALL 包含 `worker_pid` 欄位
#### Scenario: 已知限制 - 指標跳動
- **GIVEN** 系統運行 N 個 workersN > 1
- **WHEN** 多次呼叫 `GET /admin/api/metrics`
- **THEN** 因 load balancer 分配,數值可能因不同 worker 而有差異
- **AND** 這是已知且接受的行為限制
---
### Requirement: 共享計數器(可選優化)
當 Redis 可用時,系統 MAY 使用 Redis 共享關鍵計數指標。
#### Scenario: Redis 共享總計數
- **WHEN** Redis 已啟用
- **THEN** `total_queries` 計數 MAY 使用 Redis INCR 命令
- **AND** `slow_queries` 計數 MAY 使用 Redis INCR 命令
- **AND** 這些計數 SHALL 跨所有 workers 共享
#### Scenario: Redis 不可用時退化
- **WHEN** Redis 不可用或停用
- **THEN** 系統 SHALL 退化為純 worker 獨立統計
- **AND** 功能 SHALL 繼續正常運作
#### Scenario: 百分位數仍為 worker 獨立
- **WHEN** 使用 Redis 共享計數器
- **THEN** 延遲百分位數P50/P95/P99SHALL 仍維持 worker 獨立
- **AND** 百分位數計算需要完整樣本,不適合跨 worker 共享

View File

@@ -0,0 +1,161 @@
## 1. 基礎設施模組
- [x] 1.1 建立 `core/response.py` - API 回應格式工具
- 實作 `success_response(data, meta=None)` 函數
- 實作 `error_response(code, message, details=None)` 函數
- 定義標準錯誤代碼常數 (DB_CONNECTION_FAILED, DB_QUERY_TIMEOUT, SERVICE_UNAVAILABLE, VALIDATION_ERROR, UNAUTHORIZED, FORBIDDEN, NOT_FOUND, INTERNAL_ERROR)
- [x] 1.2 建立 `core/circuit_breaker.py` - 熔斷器模組
- 實作 CircuitBreaker 類別,支援 CLOSED/OPEN/HALF_OPEN 狀態
- 實作滑動視窗計數 (window_size=10)
- 支援環境變數配置 (CIRCUIT_BREAKER_ENABLED, CIRCUIT_BREAKER_FAILURE_THRESHOLD 等)
- 實作 `get_circuit_breaker_status()` 查詢狀態
- 實作狀態轉換日誌記錄
- [x] 1.3 建立 `core/metrics.py` - 效能指標收集模組
- 實作 QueryMetrics 類別,使用 deque(maxlen=1000)
- 實作 P50/P95/P99 百分位數計算
- 追蹤慢查詢數量 (> 1秒)
- 支援 worker PID 識別
- [x] 1.4 擴展 `core/cache.py` - 本地快取 Fallback (部分完成)
- [x] 實作 ProcessLevelCache 類別 (TTL-aware)
- [x] 實作 WIP DataFrame 的 process-level 快取
- [x] 實作 Resource Cache 的 process-level 快取
- [x] 實作 Equipment Status Cache 的 process-level 快取
- [ ] 實作通用 LRU cache 介面 (maxsize=500, ttl=60s)
- [ ] 追蹤命中率統計 (hits, misses, hit_rate)
- [ ] 支援環境變數 LOCAL_CACHE_ENABLED, LOCAL_CACHE_MAXSIZE
- [x] 1.5 建立 `core/log_store.py` - SQLite log store
- 建立 logs 資料表時間、等級、來源、訊息、request_id、user、ip
- 支援查詢參數level, q, limit, since
- 實作保留策略(預設 7 天或 100,000 筆)
- 支援環境變數 LOG_SQLITE_PATH, LOG_SQLITE_RETENTION_DAYS, LOG_SQLITE_MAX_ROWS
- [x] 1.6 整合應用程式 logging handler
-`app.py` 註冊 SQLite log handler
- 保留原有檔案/STDERR log
- [x] 1.7 撰寫基礎設施模組單元測試
- Circuit Breaker 狀態轉換測試
- Metrics 百分位數計算測試
- Local Cache LRU 與 TTL 測試
- SQLite log store 讀寫與保留策略測試
## 2. 資料庫層整合
- [x] 2.1 整合熔斷器到 `core/database.py`
-`read_sql_df()` 加入熔斷器檢查
- OPEN 狀態時立即回傳錯誤
- 記錄成功/失敗到熔斷器
- 預設停用,透過 CIRCUIT_BREAKER_ENABLED=true 啟用
- [x] 2.2 整合效能指標到 `core/database.py`
- 記錄每次查詢延遲
- 記錄慢查詢 (> 1秒) 到 metrics
- [x] 2.3 整合本地快取 Fallback 到快取層 (已由 1.4 ProcessLevelCache 實現)
- Redis 失敗時查詢本地 LRU Cache
- Oracle 查詢結果回填到 Redis 和本地快取
## 3. API 回應格式遷移
- [x] 3.1 在 `app.py` 註冊全域錯誤處理器
- @app.errorhandler(401) - UNAUTHORIZED
- @app.errorhandler(403) - FORBIDDEN
- @app.errorhandler(404) - NOT_FOUND
- @app.errorhandler(500) - INTERNAL_ERROR
- @app.errorhandler(Exception) - 未捕獲例外
- [x] 3.2 更新認證中介層回應格式
- `@app.before_request` 的拒絕回應改用統一格式
- [x] 3.3 逐步遷移各 Blueprint 使用新回應格式
- 新 API 直接使用 success_response/error_response
- 現有 API 保持向下相容
## 4. 健康檢查端點
- [x] 4.1 實作 `/health/deep` 深度健康檢查端點
- 需要 @admin_required 認證
- 包含資料庫延遲與連線池狀態
- 包含 Redis 延遲 (如啟用)
- 包含熔斷器狀態
- 包含快取新鮮度與命中率
- 包含效能指標摘要 (P50/P95/P99)
- [x] 4.2 實作延遲警告閾值
- 資料庫延遲 > 100ms 標記為 "slow"
- 快取更新 > 2 分鐘標記為 "stale"
- 熔斷器 OPEN 時整體狀態為 "degraded"
## 5. 效能報表頁面
- [x] 5.1 建立 `GET /admin/performance` 頁面路由
- 需要管理員權限
- 使用現有 admin template 風格
- [x] 5.2 實作 `GET /admin/api/system-status` API
- 回傳 database, redis, circuit_breaker, cache, worker_pid
- [x] 5.3 實作 `GET /admin/api/metrics` API
- 回傳 P50/P95/P99, slow_count, slow_rate, worker_pid
- [x] 5.4 建立效能報表前端頁面
- 系統狀態卡片 (Database, Redis, Circuit Breaker, Worker)
- 延遲百分位數顯示
- 慢查詢統計
- 延遲分布圖表 (Chart.js)
- 快取命中率顯示
- 手動/自動重新整理 (30秒間隔)
- [x] 5.5 實作 `GET /admin/api/logs` API
- 從 SQLite log store 讀取
- 支援 level/q/limit/since 查詢參數
- [x] 5.6 效能報表頁面加入 Log 檢視區塊
- 顯示最近 200 筆
- 支援等級篩選與關鍵字搜尋
- 與自動重新整理同步更新
## 6. Worker 重啟控制
- [x] 6.1 建立 `scripts/worker_watchdog.py` 腳本
- 每 5 秒檢查 `/tmp/mes_dashboard_restart.flag`
- 偵測到時發送 SIGHUP 給 Gunicorn master
- 刪除 flag 檔案
- 記錄重啟事件到日誌
- [x] 6.2 實作 `POST /admin/api/worker/restart` API
- 需要 @admin_required
- 寫入重啟標記檔案
- 60 秒冷卻時間 (429 Too Many Requests)
- 記錄操作者、時間、IP 到日誌
- [x] 6.3 實作 `GET /admin/api/worker/status` API
- 回傳 cooldown_remaining, last_restart, last_restart_by
- 回傳當前 worker 啟動時間
- [x] 6.4 效能報表頁面加入 Worker 控制區塊
- 重啟按鈕 + 確認對話框
- 冷卻狀態顯示
- 最後重啟資訊
- 重啟中狀態輪詢
## 7. 部署與測試
- [x] 7.1 建立 systemd service 檔案
- `mes-dashboard-watchdog.service` 監控腳本
- [x] 7.2 撰寫整合測試
- 熔斷器觸發與恢復測試
- API 回應格式驗證測試
- 健康檢查端點測試
- 管理員 log API 測試
- Worker 控制 API 測試
- [x] 7.3 更新部署文件
- 新增環境變數說明
- Watchdog 服務配置
- Rollback 步驟

View File

@@ -0,0 +1,160 @@
## ADDED Requirements
### Requirement: 效能報表頁面
系統 SHALL 提供管理員效能報表頁面。
#### Scenario: 存取效能報表頁面
- **WHEN** 管理員存取 `GET /admin/performance`
- **THEN** 系統 SHALL 顯示效能報表頁面
- **AND** HTTP 狀態碼 SHALL 為 200
#### Scenario: 非管理員禁止存取
- **WHEN** 非管理員存取 `GET /admin/performance`
- **THEN** 系統 SHALL 重導向至登入頁面
- **OR** HTTP 狀態碼 SHALL 為 403
---
### Requirement: 系統狀態顯示
效能報表頁面 SHALL 顯示系統各元件的健康狀態。
#### Scenario: 顯示資料庫狀態
- **WHEN** 載入效能報表頁面
- **THEN** 頁面 SHALL 顯示資料庫連線狀態
- **AND** 狀態 SHALL 為 ✅ (正常) 或 ❌ (異常)
#### Scenario: 顯示 Redis 狀態
- **WHEN** 載入效能報表頁面
- **THEN** 頁面 SHALL 顯示 Redis 連線狀態
- **AND** 若 Redis 停用則顯示「已停用」
#### Scenario: 顯示熔斷器狀態
- **WHEN** 載入效能報表頁面
- **THEN** 頁面 SHALL 顯示熔斷器狀態
- **AND** 狀態 SHALL 為 CLOSED、OPEN 或 HALF_OPEN
#### Scenario: 顯示 Worker 數量
- **WHEN** 載入效能報表頁面
- **THEN** 頁面 SHALL 顯示目前回應的 Worker PID
---
### Requirement: 效能指標顯示
效能報表頁面 SHALL 顯示查詢效能指標。
#### Scenario: 顯示延遲百分位數
- **WHEN** 載入效能報表頁面
- **THEN** 頁面 SHALL 顯示 P50、P95、P99 延遲值
- **AND** 單位 SHALL 為毫秒或秒
#### Scenario: 顯示慢查詢統計
- **WHEN** 載入效能報表頁面
- **THEN** 頁面 SHALL 顯示慢查詢數量
- **AND** SHALL 顯示慢查詢比例
#### Scenario: 延遲分布視覺化
- **WHEN** 載入效能報表頁面
- **THEN** 頁面 SHALL 顯示延遲分布圖表
- **AND** 圖表 SHALL 使用 Chart.js 或類似工具
---
### Requirement: 快取狀態顯示
效能報表頁面 SHALL 顯示快取運作狀態。
#### Scenario: 顯示 Redis 快取命中率
- **WHEN** 載入效能報表頁面
- **THEN** 頁面 SHALL 顯示 Redis 快取命中率
#### Scenario: 顯示本地快取命中率
- **WHEN** 載入效能報表頁面
- **THEN** 頁面 SHALL 顯示本地快取命中率
#### Scenario: 顯示快取最後更新時間
- **WHEN** 載入效能報表頁面
- **THEN** 頁面 SHALL 顯示快取最後更新時間
---
### Requirement: 自動重新整理
效能報表頁面 SHALL 支援自動重新整理。
#### Scenario: 手動重新整理
- **WHEN** 點擊「重新整理」按鈕
- **THEN** 頁面 SHALL 重新載入所有指標資料
- **AND** SHALL NOT 整頁重新載入(使用 AJAX
#### Scenario: 自動重新整理間隔
- **WHEN** 啟用自動重新整理
- **THEN** 頁面 SHALL 每 30 秒自動更新指標
- **AND** 使用者 SHALL 可以停用自動重新整理
---
### Requirement: 系統狀態 API
系統 SHALL 提供 API 取得系統狀態資訊。
#### Scenario: 取得系統狀態
- **WHEN** 呼叫 `GET /admin/api/system-status`
- **AND** 使用者為管理員
- **THEN** 回應 SHALL 包含:
- `database`: 資料庫狀態
- `redis`: Redis 狀態
- `circuit_breaker`: 熔斷器狀態
- `cache`: 快取狀態
- `worker_pid`: 當前 Worker PID
---
### Requirement: Log 紀錄檢視
效能報表頁面 SHALL 顯示近期 log 紀錄。
#### Scenario: 顯示近期 log
- **WHEN** 管理員載入效能報表頁面
- **THEN** 頁面 SHALL 顯示最近 N 筆 log預設 200 筆)
- **AND** 每筆 log SHALL 顯示時間、等級、來源、訊息
#### Scenario: 篩選與搜尋
- **WHEN** 管理員選擇等級INFO/WARNING/ERROR或輸入關鍵字
- **THEN** 頁面 SHALL 即時更新顯示結果
---
### Requirement: Log API
系統 SHALL 提供 API 取得近期 log 紀錄。
#### Scenario: 取得 log 紀錄
- **WHEN** 呼叫 `GET /admin/api/logs`
- **AND** 使用者為管理員
- **THEN** 回應 SHALL 包含 log 清單
- **AND** HTTP 狀態碼 SHALL 為 200
#### Scenario: Log API 查詢參數
- **WHEN** 呼叫 `GET /admin/api/logs` 並帶入查詢參數
- **THEN** API SHALL 支援:
- `level`等級過濾INFO/WARNING/ERROR
- `q`:關鍵字搜尋
- `limit`:回傳筆數(預設 200
- `since`起始時間ISO-8601
#### Scenario: 非管理員禁止存取
- **WHEN** 非管理員呼叫 `GET /admin/api/logs`
- **THEN** HTTP 狀態碼 SHALL 為 403
---
### Requirement: Log 資料儲存
系統 SHALL 將 log 寫入本機 SQLite 供管理員查詢。
#### Scenario: 寫入 SQLite log store
- **WHEN** 系統產生 log 紀錄
- **THEN** log SHALL 寫入本機 SQLite log store
- **AND** 供 `GET /admin/api/logs` 查詢

View File

@@ -0,0 +1,116 @@
## ADDED Requirements
### Requirement: Worker 重啟觸發
系統 SHALL 允許管理員從前端觸發 Worker 重啟。
#### Scenario: 觸發重啟請求
- **WHEN** 管理員呼叫 `POST /admin/api/worker/restart`
- **AND** 使用者為管理員
- **THEN** 系統 SHALL 寫入重啟標記檔案
- **AND** HTTP 狀態碼 SHALL 為 202 (Accepted)
- **AND** 回應 SHALL 包含 `"message": "重啟請求已提交"`
#### Scenario: 非管理員禁止操作
- **WHEN** 非管理員呼叫 `POST /admin/api/worker/restart`
- **THEN** HTTP 狀態碼 SHALL 為 403
- **AND** 操作 SHALL NOT 執行
---
### Requirement: 重啟冷卻時間
系統 SHALL 實作重啟冷卻機制,防止頻繁重啟。
#### Scenario: 冷卻時間內拒絕
- **WHEN** 管理員呼叫 `POST /admin/api/worker/restart`
- **AND** 距離上次重啟不足 60 秒
- **THEN** HTTP 狀態碼 SHALL 為 429 (Too Many Requests)
- **AND** 回應 SHALL 包含剩餘冷卻秒數
#### Scenario: 冷卻時間後允許
- **WHEN** 管理員呼叫 `POST /admin/api/worker/restart`
- **AND** 距離上次重啟已超過 60 秒
- **THEN** 重啟請求 SHALL 被接受
#### Scenario: 查詢冷卻狀態
- **WHEN** 呼叫 `GET /admin/api/worker/status`
- **THEN** 回應 SHALL 包含:
- `cooldown_remaining`: 剩餘冷卻秒數0 表示可用)
- `last_restart`: 上次重啟時間
- `last_restart_by`: 上次重啟操作者
---
### Requirement: 重啟操作日誌
系統 SHALL 記錄所有重啟操作。
#### Scenario: 記錄操作資訊
- **WHEN** 管理員觸發重啟
- **THEN** 系統 SHALL 記錄:
- 操作者email/username
- 操作時間
- 來源 IP 位址
- 操作結果
#### Scenario: 日誌儲存位置
- **WHEN** 記錄重啟操作
- **THEN** 日誌 SHALL 寫入系統日誌INFO 級別)
- **AND** SHALL 寫入獨立的操作日誌檔案
---
### Requirement: 前端確認機制
效能報表頁面 SHALL 實作重啟確認機制。
#### Scenario: 顯示確認對話框
- **WHEN** 管理員點擊「重啟 Workers」按鈕
- **THEN** 系統 SHALL 顯示確認對話框
- **AND** 對話框 SHALL 警告此操作會短暫影響服務
#### Scenario: 確認後執行
- **WHEN** 管理員在確認對話框點擊「確定」
- **THEN** 系統 SHALL 發送重啟請求
#### Scenario: 取消操作
- **WHEN** 管理員在確認對話框點擊「取消」
- **THEN** 系統 SHALL NOT 發送重啟請求
---
### Requirement: Watchdog 腳本
系統 SHALL 提供 Watchdog 腳本監控重啟標記檔案。
#### Scenario: 監控標記檔案
- **WHEN** Watchdog 腳本運行中
- **THEN** 腳本 SHALL 每 5 秒檢查 `/tmp/mes_dashboard_restart.flag`
#### Scenario: 偵測到標記檔案
- **WHEN** Watchdog 偵測到標記檔案存在
- **THEN** 腳本 SHALL 發送 SIGHUP 信號給 Gunicorn master
- **AND** SHALL 刪除標記檔案
- **AND** SHALL 記錄重啟事件到日誌
#### Scenario: Gunicorn Graceful Reload
- **WHEN** Gunicorn master 收到 SIGHUP
- **THEN** Gunicorn SHALL 執行 graceful reload
- **AND** 現有請求 SHALL 完成後才終止 worker
- **AND** 新 worker SHALL 啟動接手
---
### Requirement: 重啟狀態回報
系統 SHALL 提供方式確認重啟是否完成。
#### Scenario: 查詢 Worker 啟動時間
- **WHEN** 呼叫 `GET /admin/api/worker/status`
- **THEN** 回應 SHALL 包含當前 worker 的啟動時間
#### Scenario: 前端顯示重啟結果
- **WHEN** 重啟請求已提交
- **THEN** 前端 SHALL 輪詢 worker 狀態
- **AND** SHALL 顯示「重啟中...」直到偵測到新 worker

View File

@@ -0,0 +1,140 @@
## ADDED Requirements
### Requirement: 統一成功回應格式
系統 SHALL 對所有成功的 API 回應使用統一的 envelope 格式。
#### Scenario: 成功回應包含 success 標記
- **WHEN** API 請求成功執行
- **THEN** 回應 body SHALL 包含 `"success": true`
- **AND** 原有回應資料 SHALL 放在 `data` 欄位中
#### Scenario: 成功回應範例
- **WHEN** 呼叫 `GET /api/dashboard/kpi` 成功
- **THEN** 回應格式 SHALL 為:
```json
{
"success": true,
"data": {
"total": 100,
"prd": 50,
...
}
}
```
---
### Requirement: 統一錯誤回應格式
系統 SHALL 對所有失敗的 API 回應使用統一的錯誤格式。
#### Scenario: 錯誤回應包含錯誤代碼
- **WHEN** API 請求執行失敗
- **THEN** 回應 body SHALL 包含 `"success": false`
- **AND** SHALL 包含 `error` 物件
- **AND** `error.code` SHALL 為機器可讀的錯誤代碼
- **AND** `error.message` SHALL 為使用者友善的中文訊息
#### Scenario: 錯誤回應範例
- **WHEN** 資料庫連線失敗
- **THEN** 回應格式 SHALL 為:
```json
{
"success": false,
"error": {
"code": "DB_CONNECTION_FAILED",
"message": "資料庫連線失敗,請稍後再試"
}
}
```
#### Scenario: 開發模式顯示詳細錯誤
- **WHEN** `FLASK_ENV=development`
- **AND** API 請求執行失敗
- **THEN** `error` 物件 SHALL 額外包含 `details` 欄位
- **AND** `details` SHALL 包含技術性錯誤訊息(如 ORA-xxxxx
#### Scenario: 生產模式隱藏詳細錯誤
- **WHEN** `FLASK_ENV=production`
- **AND** API 請求執行失敗
- **THEN** `error` 物件 SHALL NOT 包含 `details` 欄位
---
### Requirement: 標準錯誤代碼
系統 SHALL 定義並使用標準化的錯誤代碼。
#### Scenario: 資料庫相關錯誤代碼
- **WHEN** 資料庫連線失敗
- **THEN** 錯誤代碼 SHALL 為 `DB_CONNECTION_FAILED`
#### Scenario: 資料庫查詢逾時
- **WHEN** 資料庫查詢超過 55 秒
- **THEN** 錯誤代碼 SHALL 為 `DB_QUERY_TIMEOUT`
#### Scenario: 熔斷器開啟
- **WHEN** Circuit Breaker 處於 OPEN 狀態
- **THEN** 錯誤代碼 SHALL 為 `SERVICE_UNAVAILABLE`
#### Scenario: 驗證失敗
- **WHEN** 請求參數驗證失敗
- **THEN** 錯誤代碼 SHALL 為 `VALIDATION_ERROR`
#### Scenario: 未授權
- **WHEN** 使用者未登入或 session 過期
- **THEN** 錯誤代碼 SHALL 為 `UNAUTHORIZED`
#### Scenario: 禁止存取
- **WHEN** 使用者權限不足
- **THEN** 錯誤代碼 SHALL 為 `FORBIDDEN`
---
### Requirement: 全域錯誤處理
系統 SHALL 在 middleware 層級統一處理所有未捕獲的錯誤。
#### Scenario: 認證中介層拒絕
- **WHEN** 認證中介層(`create_app` 中的 `@app.before_request`)拒絕請求
- **THEN** 回應格式 SHALL 符合統一錯誤格式
- **AND** 錯誤代碼 SHALL 為 `UNAUTHORIZED` 或 `FORBIDDEN`
#### Scenario: 未處理的例外
- **WHEN** 路由處理器拋出未捕獲的例外
- **THEN** Flask 錯誤處理器 SHALL 攔截該例外
- **AND** 回應格式 SHALL 符合統一錯誤格式
- **AND** 錯誤代碼 SHALL 為 `INTERNAL_ERROR`
#### Scenario: 404 錯誤處理
- **WHEN** 請求的路由不存在
- **THEN** 回應格式 SHALL 符合統一錯誤格式
- **AND** 錯誤代碼 SHALL 為 `NOT_FOUND`
#### Scenario: 全域錯誤處理器註冊
- **WHEN** Flask 應用程式初始化
- **THEN** `create_app()` SHALL 註冊以下錯誤處理器:
- `@app.errorhandler(401)` - 處理未授權
- `@app.errorhandler(403)` - 處理禁止存取
- `@app.errorhandler(404)` - 處理找不到資源
- `@app.errorhandler(500)` - 處理伺服器錯誤
- `@app.errorhandler(Exception)` - 處理所有未捕獲例外
---
### Requirement: 向下相容
系統 SHALL 維持與現有 API 的向下相容性。
#### Scenario: 原有欄位保留
- **WHEN** 使用新的回應格式
- **THEN** 原有 API 回傳的欄位 SHALL 完整保留在 `data` 中
- **AND** 欄位名稱與型別 SHALL 不變
#### Scenario: HTTP 狀態碼維持
- **WHEN** API 回應使用新格式
- **THEN** HTTP 狀態碼 SHALL 維持原有語義
- **AND** 成功 SHALL 回傳 2xx
- **AND** 客戶端錯誤 SHALL 回傳 4xx
- **AND** 伺服器錯誤 SHALL 回傳 5xx

View File

@@ -0,0 +1,91 @@
## ADDED Requirements
### Requirement: 熔斷器狀態管理
系統 SHALL 實作 Circuit Breaker 模式,管理資料庫連線的熔斷狀態。
#### Scenario: 初始狀態為 CLOSED
- **WHEN** 系統啟動
- **THEN** 熔斷器狀態 SHALL 為 `CLOSED`
- **AND** 所有資料庫請求 SHALL 正常執行
#### Scenario: 失敗累積觸發 OPEN
- **WHEN** 熔斷器處於 `CLOSED` 狀態
- **AND** 滑動視窗內失敗次數 >= 5
- **AND** 失敗率 >= 50%
- **THEN** 熔斷器狀態 SHALL 轉換為 `OPEN`
#### Scenario: OPEN 狀態拒絕請求
- **WHEN** 熔斷器處於 `OPEN` 狀態
- **AND** 收到資料庫請求
- **THEN** 系統 SHALL 立即回傳錯誤
- **AND** 錯誤代碼 SHALL 為 `SERVICE_UNAVAILABLE`
- **AND** SHALL NOT 嘗試連線資料庫
#### Scenario: OPEN 轉換為 HALF_OPEN
- **WHEN** 熔斷器處於 `OPEN` 狀態
- **AND** 已等待 30 秒recovery_timeout
- **THEN** 熔斷器狀態 SHALL 轉換為 `HALF_OPEN`
#### Scenario: HALF_OPEN 探測成功
- **WHEN** 熔斷器處於 `HALF_OPEN` 狀態
- **AND** 探測請求執行成功
- **THEN** 熔斷器狀態 SHALL 轉換為 `CLOSED`
- **AND** 失敗計數 SHALL 重置為 0
#### Scenario: HALF_OPEN 探測失敗
- **WHEN** 熔斷器處於 `HALF_OPEN` 狀態
- **AND** 探測請求執行失敗
- **THEN** 熔斷器狀態 SHALL 轉換為 `OPEN`
- **AND** recovery_timeout SHALL 重新計時
---
### Requirement: 熔斷器參數配置
系統 SHALL 支援透過環境變數配置熔斷器參數。
#### Scenario: 預設參數值
- **WHEN** 未設定熔斷器相關環境變數
- **THEN** failure_threshold SHALL 為 5
- **AND** failure_rate SHALL 為 0.5 (50%)
- **AND** recovery_timeout SHALL 為 30 秒
- **AND** window_size SHALL 為 10
#### Scenario: 環境變數覆蓋
- **WHEN** 設定 `CIRCUIT_BREAKER_FAILURE_THRESHOLD=10`
- **THEN** failure_threshold SHALL 為 10
#### Scenario: 停用熔斷器
- **WHEN** 設定 `CIRCUIT_BREAKER_ENABLED=false`
- **THEN** 熔斷器功能 SHALL 停用
- **AND** 所有請求 SHALL 直接執行,不經過熔斷器檢查
---
### Requirement: 熔斷器狀態查詢
系統 SHALL 提供 API 查詢熔斷器狀態。
#### Scenario: 查詢熔斷器狀態
- **WHEN** 呼叫內部方法 `get_circuit_breaker_status()`
- **THEN** 回傳值 SHALL 包含:
- `state`: 當前狀態 (CLOSED/OPEN/HALF_OPEN)
- `failure_count`: 目前失敗次數
- `success_count`: 目前成功次數
- `last_failure_time`: 最後失敗時間
---
### Requirement: 熔斷事件日誌
系統 SHALL 記錄熔斷器狀態變化事件。
#### Scenario: 記錄狀態轉換
- **WHEN** 熔斷器狀態發生變化
- **THEN** 系統 SHALL 記錄 WARNING 級別日誌
- **AND** 日誌 SHALL 包含:前狀態、新狀態、觸發原因
#### Scenario: 記錄 OPEN 事件
- **WHEN** 熔斷器轉換為 `OPEN` 狀態
- **THEN** 日誌訊息 SHALL 包含失敗次數與失敗率

View File

@@ -1,124 +1,150 @@
## ADDED Requirements
### Requirement: Health Check Endpoint
系統 SHALL 提供 `/health` 端點,回報服務健康狀態
#### Scenario: All services healthy
- **WHEN** 呼叫 `GET /health` 且 Oracle 和 Redis 都正常
- **THEN** 系統 SHALL 回傳 HTTP 200
- **AND** 回應 body 為:
```json
{
"status": "healthy",
"services": {
"database": "ok",
"redis": "ok"
}
}
```
#### Scenario: Database unhealthy
- **WHEN** 呼叫 `GET /health` 且 Oracle 連線失敗
- **THEN** 系統 SHALL 回傳 HTTP 503
- **AND** 回應 body 包含:
```json
{
"status": "unhealthy",
"services": {
"database": "error",
"redis": "ok"
},
"errors": ["Database connection failed: <error message>"]
}
```
#### Scenario: Redis unhealthy but service degraded
- **WHEN** 呼叫 `GET /health` 且 Redis 連線失敗但 Oracle 正常
- **THEN** 系統 SHALL 回傳 HTTP 200因為可降級運作
- **AND** 回應 body 包含:
```json
{
"status": "degraded",
"services": {
"database": "ok",
"redis": "error"
},
"warnings": ["Redis unavailable, running in fallback mode"]
}
```
#### Scenario: Redis disabled
- **WHEN** 呼叫 `GET /health` 且 `REDIS_ENABLED=false`
- **THEN** 回應 body 的 `services.redis` SHALL `"disabled"`
---
### Requirement: Database Health Check
健康檢查 SHALL 驗證 Oracle 資料庫連線。
#### Scenario: Database ping succeeds
- **WHEN** 執行資料庫健康檢查
- **THEN** 系統 SHALL 執行 `SELECT 1 FROM DUAL`
- **AND** 查詢成功則標記 database 為 `ok`
#### Scenario: Database ping timeout
- **WHEN** 資料庫查詢超過 5 秒
- **THEN** 系統 SHALL 標記 database 為 `error`
- **AND** 記錄超時錯誤
---
### Requirement: Redis Health Check
健康檢查 SHALL 驗證 Redis 連線(當 REDIS_ENABLED=true 時)。
#### Scenario: Redis ping succeeds
- **WHEN** 執行 Redis 健康檢查
- **THEN** 系統 SHALL 執行 Redis `PING` 命令
- **AND** 收到 `PONG` 回應則標記 redis 為 `ok`
#### Scenario: Redis ping fails
- **WHEN** Redis `PING` 命令失敗或超時
- **THEN** 系統 SHALL 標記 redis 為 `error`
- **AND** 服務狀態 SHALL 為 `degraded`(非 `unhealthy`
---
### Requirement: Cache Status in Health Check
健康檢查 SHALL 包含快取狀態資訊。
#### Scenario: Cache status included
- **WHEN** 呼叫 `GET /health` 且快取可用
- **THEN** 回應 body SHALL 包含 `cache` 區塊:
```json
{
"cache": {
"enabled": true,
"sys_date": "2024-01-15 10:30:00",
"updated_at": "2024-01-15 10:35:22"
}
}
```
#### Scenario: Cache not populated
- **WHEN** 呼叫 `GET /health` 且 Redis 可用但快取尚未載入
- **THEN** 回應 body 的 `cache.sys_date` SHALL 為 `null`
---
### Requirement: Health Check Performance
健康檢查 SHALL 快速回應,不影響服務效能。
#### Scenario: Response within timeout
- **WHEN** 呼叫 `GET /health`
- **THEN** 系統 SHALL 在 10 秒內回應
- **AND** 各項檢查的超時時間 SHALL 不超過 5 秒
#### Scenario: No authentication required
- **WHEN** 呼叫 `GET /health`
- **THEN** 系統 SHALL 不要求身份驗證
- **AND** 不記錄到存取日誌(避免日誌污染)
## ADDED Requirements
### Requirement: 深度健康檢查端點
系統 SHALL 提供 `/health/deep` 端點,回報詳細的系統健康資訊
#### Scenario: 深度檢查回應格式
- **WHEN** 呼叫 `GET /health/deep`
- **THEN** 回應 body SHALL 包含:
```json
{
"status": "healthy",
"checks": {
"database": { ... },
"redis": { ... },
"circuit_breaker": { ... },
"cache": { ... }
},
"metrics": { ... }
}
```
#### Scenario: 深度檢查需要認證
- **WHEN** 呼叫 `GET /health/deep`
- **AND** 使用者未登入
- **THEN** HTTP 狀態碼 SHALL 為 401
#### Scenario: 深度檢查管理員存取
- **WHEN** 呼叫 `GET /health/deep`
- **AND** 使用者為管理員
- **THEN** HTTP 狀態碼 SHALL 為 200
- **AND** 回應 SHALL 包含完整詳細資訊
#### Scenario: 深度檢查非管理員禁止
- **WHEN** 呼叫 `GET /health/deep`
- **AND** 使用者已登入但非管理員
- **THEN** HTTP 狀態碼 SHALL 為 403
- **AND** 回應 SHALL 符合統一錯誤格式
#### Scenario: 深度檢查實作方式
- **WHEN** 實作 `/health/deep` 端點
- **THEN** 路由 SHALL 使用 `@admin_required` 裝飾器
- **AND** 裝飾器 SHALL 處理認證與授權驗證
---
### Requirement: 延遲指標檢查
深度健康檢查 SHALL 包含各服務的延遲指標。
#### Scenario: 資料庫延遲
- **WHEN** 執行深度健康檢查
- **THEN** `checks.database` SHALL 包含 `latency_ms`
- **AND** `latency_ms` SHALL 為執行 ping 查詢的實際耗時
#### Scenario: Redis 延遲
- **WHEN** 執行深度健康檢查
- **AND** Redis 已啟用
- **THEN** `checks.redis` SHALL 包含 `latency_ms`
- **AND** `latency_ms` SHALL 為執行 PING 的實際耗時
#### Scenario: 延遲警告閾值
- **WHEN** 資料庫延遲超過 100ms
- **THEN** `checks.database.status` SHALL 為 `"slow"`
- **AND** `warnings` 陣列 SHALL 包含延遲警告訊息
---
### Requirement: 連線池狀態檢查
深度健康檢查 SHALL 包含資料庫連線池狀態。
#### Scenario: 連線池資訊
- **WHEN** 執行深度健康檢查
- **THEN** `checks.database` SHALL 包含:
- `pool_size`: 設定的連線池大小
- `pool_checked_out`: 目前借出的連線數
- `pool_overflow`: 目前溢出的連線數
#### Scenario: 連線池耗盡警告
- **WHEN** `pool_checked_out` + `pool_overflow` >= `pool_size` + `max_overflow`
- **THEN** `warnings` 陣列 SHALL 包含連線池耗盡警告
---
### Requirement: 熔斷器狀態檢查
深度健康檢查 SHALL 包含熔斷器狀態。
#### Scenario: 熔斷器狀態正常
- **WHEN** 執行深度健康檢查
- **AND** 熔斷器狀態為 CLOSED
- **THEN** `checks.circuit_breaker` SHALL 包含:
```json
{
"database": "CLOSED",
"failures": 0
}
```
#### Scenario: 熔斷器狀態 OPEN
- **WHEN** 執行深度健康檢查
- **AND** 熔斷器狀態為 OPEN
- **THEN** `checks.circuit_breaker.database` SHALL 為 `"OPEN"`
- **AND** 整體 `status` SHALL 為 `"degraded"` 或 `"unhealthy"`
- **AND** `warnings` SHALL 包含熔斷器警告
---
### Requirement: 快取新鮮度檢查
深度健康檢查 SHALL 檢查快取資料的新鮮度。
#### Scenario: 快取新鮮度正常
- **WHEN** 執行深度健康檢查
- **AND** 快取更新時間在 2 分鐘內
- **THEN** `checks.cache.status` SHALL 為 `"fresh"`
#### Scenario: 快取資料過期
- **WHEN** 執行深度健康檢查
- **AND** 快取更新時間超過 2 分鐘
- **THEN** `checks.cache.status` SHALL 為 `"stale"`
- **AND** `warnings` SHALL 包含快取過期警告
#### Scenario: 本地快取狀態
- **WHEN** 執行深度健康檢查
- **AND** 本地快取已啟用
- **THEN** `checks.cache` SHALL 包含:
- `local_enabled`: true
- `local_hit_rate`: 本地快取命中率
- `local_size`: 本地快取條目數
---
### Requirement: 效能指標摘要
深度健康檢查 SHALL 包含效能指標摘要。
#### Scenario: 包含延遲百分位數
- **WHEN** 執行深度健康檢查
- **THEN** `metrics` SHALL 包含:
- `query_p50_ms`: P50 查詢延遲
- `query_p95_ms`: P95 查詢延遲
- `query_p99_ms`: P99 查詢延遲
- `slow_query_count`: 慢查詢數量
#### Scenario: 指標為空
- **WHEN** 執行深度健康檢查
- **AND** 尚無查詢記錄
- **THEN** `metrics` 各欄位 SHALL 為 0

View File

@@ -0,0 +1,98 @@
## ADDED Requirements
### Requirement: 本地 LRU 快取
系統 SHALL 實作本地 LRU 快取作為 Redis 的二級 fallback。
#### Scenario: 快取查詢順序
- **WHEN** 查詢快取資料
- **THEN** 系統 SHALL 先查詢 Redis
- **AND** Redis 未命中或失敗時 SHALL 查詢本地快取
- **AND** 本地快取未命中時 SHALL 查詢 Oracle
#### Scenario: 快取回填
- **WHEN** 從 Oracle 取得資料
- **THEN** 系統 SHALL 同時寫入 Redis 和本地快取
- **AND** 本地快取 TTL SHALL 為 60 秒
---
### Requirement: 本地快取容量限制
系統 SHALL 限制本地快取的記憶體使用。
#### Scenario: 預設最大條目數
- **WHEN** 未設定 `LOCAL_CACHE_MAXSIZE` 環境變數
- **THEN** 本地快取預設最大條目數 SHALL 為 500
- **AND** 此值足以容納 WIP 狀態、設備清單、Hold Summary 等多組快取
#### Scenario: 最大條目數限制
- **WHEN** 本地快取條目數達到 maxsize 上限
- **AND** 新增新條目
- **THEN** 系統 SHALL 移除最少使用LRU的條目
- **AND** 條目數 SHALL 維持 <= maxsize
#### Scenario: 環境變數配置
- **WHEN** 設定 `LOCAL_CACHE_MAXSIZE=1000`
- **THEN** 本地快取最大條目數 SHALL 為 1000
#### Scenario: 快取鍵設計
- **WHEN** 建立快取條目
- **THEN** 快取鍵 SHALL 包含功能前綴(如 `wip:`, `equipment:`, `hold:`
- **AND** 不同功能的快取 SHALL 共用同一 LRU 池
- **AND** LRU 策略 SHALL 自動淘汰最少使用的條目(無論功能類型)
---
### Requirement: 本地快取 TTL
系統 SHALL 為本地快取條目設定過期時間。
#### Scenario: 預設 TTL
- **WHEN** 未設定 TTL 環境變數
- **THEN** 本地快取 TTL SHALL 為 60 秒
#### Scenario: 過期條目處理
- **WHEN** 查詢本地快取
- **AND** 條目已過期(超過 TTL
- **THEN** 系統 SHALL 視為未命中
- **AND** SHALL 移除該過期條目
#### Scenario: TTL 比 Redis 短
- **WHEN** Redis 快取 TTL 為 N 秒
- **THEN** 本地快取 TTL SHALL < N
- **AND** 確保本地快取資料不會比 Redis 舊太多
---
### Requirement: 快取停用控制
系統 SHALL 支援透過環境變數停用本地快取
#### Scenario: 停用本地快取
- **WHEN** 設定 `LOCAL_CACHE_ENABLED=false`
- **THEN** 本地快取功能 SHALL 停用
- **AND** 快取查詢 SHALL 直接查詢 Redis Oracle
#### Scenario: 預設啟用
- **WHEN** 未設定 `LOCAL_CACHE_ENABLED`
- **THEN** 本地快取 SHALL 預設啟用
---
### Requirement: 快取命中率統計
系統 SHALL 追蹤本地快取的命中率
#### Scenario: 記錄命中與未命中
- **WHEN** 查詢本地快取
- **THEN** 系統 SHALL 記錄是否命中
- **AND** 統計 SHALL 儲存在記憶體中
#### Scenario: 查詢命中率
- **WHEN** 呼叫 `get_local_cache_stats()`
- **THEN** 回傳值 SHALL 包含
- `hits`: 命中次數
- `misses`: 未命中次數
- `hit_rate`: 命中率 (hits / (hits + misses))
- `size`: 目前條目數

View File

@@ -0,0 +1,111 @@
## ADDED Requirements
### Requirement: 查詢延遲收集
系統 SHALL 收集所有資料庫查詢的延遲時間。
#### Scenario: 記錄查詢延遲
- **WHEN** 執行資料庫查詢
- **THEN** 系統 SHALL 記錄查詢耗時(毫秒)
- **AND** 記錄 SHALL 儲存在記憶體內滑動視窗
#### Scenario: 滑動視窗大小限制
- **WHEN** 記錄的查詢數量超過 1000 筆
- **THEN** 系統 SHALL 自動移除最舊的記錄
- **AND** 視窗 SHALL 維持最多 1000 筆
---
### Requirement: 延遲百分位數計算
系統 SHALL 計算查詢延遲的百分位數統計。
#### Scenario: 計算 P50/P95/P99
- **WHEN** 呼叫 `get_query_metrics()`
- **THEN** 回傳值 SHALL 包含:
- `p50_ms`: 第 50 百分位延遲
- `p95_ms`: 第 95 百分位延遲
- `p99_ms`: 第 99 百分位延遲
- `count`: 樣本數量
- `slow_count`: 慢查詢數量(延遲 > 1 秒)
#### Scenario: 空資料處理
- **WHEN** 尚無查詢記錄
- **THEN** 所有百分位數 SHALL 回傳 0
- **AND** `count` SHALL 為 0
---
### Requirement: 慢查詢統計
系統 SHALL 追蹤慢查詢的數量與比例。
#### Scenario: 慢查詢定義
- **WHEN** 查詢延遲超過 1000 毫秒
- **THEN** 該查詢 SHALL 被標記為慢查詢
#### Scenario: 慢查詢比例計算
- **WHEN** 呼叫 `get_query_metrics()`
- **THEN** 回傳值 SHALL 包含 `slow_rate`
- **AND** `slow_rate` SHALL 為 `slow_count / count`
---
### Requirement: 指標 API 端點
系統 SHALL 提供 API 端點查詢效能指標。
#### Scenario: 取得效能指標
- **WHEN** 呼叫 `GET /admin/api/metrics`
- **AND** 使用者為管理員
- **THEN** 回應 SHALL 包含查詢延遲統計
- **AND** HTTP 狀態碼 SHALL 為 200
#### Scenario: 非管理員禁止存取
- **WHEN** 呼叫 `GET /admin/api/metrics`
- **AND** 使用者非管理員
- **THEN** HTTP 狀態碼 SHALL 為 403
---
### Requirement: Worker 獨立統計
系統 SHALL 在每個 Gunicorn worker 獨立收集指標。
#### Scenario: 各 worker 獨立統計
- **WHEN** 系統運行多個 workers
- **THEN** 每個 worker SHALL 維護獨立的指標資料
- **AND** 百分位數計算 SHALL 基於該 worker 的樣本
#### Scenario: API 回傳當前 worker 指標
- **WHEN** 呼叫 `GET /admin/api/metrics`
- **THEN** 回應 SHALL 標示該資料來自哪個 workerPID
- **AND** 回應 SHALL 包含 `worker_pid` 欄位
#### Scenario: 已知限制 - 指標跳動
- **GIVEN** 系統運行 N 個 workersN > 1
- **WHEN** 多次呼叫 `GET /admin/api/metrics`
- **THEN** 因 load balancer 分配,數值可能因不同 worker 而有差異
- **AND** 這是已知且接受的行為限制
---
### Requirement: 共享計數器(可選優化)
當 Redis 可用時,系統 MAY 使用 Redis 共享關鍵計數指標。
#### Scenario: Redis 共享總計數
- **WHEN** Redis 已啟用
- **THEN** `total_queries` 計數 MAY 使用 Redis INCR 命令
- **AND** `slow_queries` 計數 MAY 使用 Redis INCR 命令
- **AND** 這些計數 SHALL 跨所有 workers 共享
#### Scenario: Redis 不可用時退化
- **WHEN** Redis 不可用或停用
- **THEN** 系統 SHALL 退化為純 worker 獨立統計
- **AND** 功能 SHALL 繼續正常運作
#### Scenario: 百分位數仍為 worker 獨立
- **WHEN** 使用 Redis 共享計數器
- **THEN** 延遲百分位數P50/P95/P99SHALL 仍維持 worker 獨立
- **AND** 百分位數計算需要完整樣本,不適合跨 worker 共享

View File

@@ -9,3 +9,4 @@ waitress>=2.1.2; platform_system=="Windows"
requests>=2.28.0
redis>=5.0.0
hiredis>=2.0.0
psutil>=5.9.0

265
scripts/worker_watchdog.py Normal file
View File

@@ -0,0 +1,265 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""Worker watchdog for MES Dashboard.
Monitors a restart flag file and signals Gunicorn master to gracefully
reload workers when the flag is detected.
Usage:
python scripts/worker_watchdog.py
The watchdog:
- Checks for /tmp/mes_dashboard_restart.flag every 5 seconds
- Sends SIGHUP to Gunicorn master process when flag is detected
- Removes the flag file after signaling
- Logs all restart events
Configuration via environment variables:
- WATCHDOG_CHECK_INTERVAL: Check interval in seconds (default: 5)
- WATCHDOG_RESTART_FLAG: Path to restart flag file
- WATCHDOG_PID_FILE: Path to Gunicorn PID file
"""
from __future__ import annotations
import json
import logging
import os
import signal
import sys
import time
from datetime import datetime
from pathlib import Path
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.StreamHandler(sys.stdout),
]
)
logger = logging.getLogger('mes_dashboard.watchdog')
# ============================================================
# Configuration
# ============================================================
CHECK_INTERVAL = int(os.getenv('WATCHDOG_CHECK_INTERVAL', '5'))
RESTART_FLAG_PATH = os.getenv(
'WATCHDOG_RESTART_FLAG',
'/tmp/mes_dashboard_restart.flag'
)
GUNICORN_PID_FILE = os.getenv(
'WATCHDOG_PID_FILE',
'/tmp/mes_dashboard_gunicorn.pid'
)
RESTART_STATE_FILE = os.getenv(
'WATCHDOG_STATE_FILE',
'/tmp/mes_dashboard_restart_state.json'
)
# ============================================================
# Watchdog Implementation
# ============================================================
def get_gunicorn_pid() -> int | None:
"""Get Gunicorn master PID from PID file.
Returns:
PID of Gunicorn master process, or None if not found.
"""
pid_path = Path(GUNICORN_PID_FILE)
if not pid_path.exists():
logger.warning(f"PID file not found: {GUNICORN_PID_FILE}")
return None
try:
pid = int(pid_path.read_text().strip())
# Verify process exists
os.kill(pid, 0)
return pid
except (ValueError, ProcessLookupError, PermissionError) as e:
logger.warning(f"Invalid or stale PID file: {e}")
return None
def read_restart_flag() -> dict | None:
"""Read and parse the restart flag file.
Returns:
Dictionary with restart metadata, or None if no flag exists.
"""
flag_path = Path(RESTART_FLAG_PATH)
if not flag_path.exists():
return None
try:
content = flag_path.read_text().strip()
if content:
return json.loads(content)
return {"timestamp": datetime.now().isoformat()}
except (json.JSONDecodeError, IOError) as e:
logger.warning(f"Error reading restart flag: {e}")
return {"timestamp": datetime.now().isoformat(), "error": str(e)}
def remove_restart_flag() -> bool:
"""Remove the restart flag file.
Returns:
True if file was removed, False otherwise.
"""
flag_path = Path(RESTART_FLAG_PATH)
try:
if flag_path.exists():
flag_path.unlink()
return True
return False
except IOError as e:
logger.error(f"Failed to remove restart flag: {e}")
return False
def save_restart_state(
requested_by: str | None = None,
requested_at: str | None = None,
requested_ip: str | None = None,
completed_at: str | None = None,
success: bool = True
) -> None:
"""Save restart state for status queries.
Args:
requested_by: Username who requested the restart.
requested_at: ISO timestamp when restart was requested.
requested_ip: IP address of requester.
completed_at: ISO timestamp when restart was completed.
success: Whether the restart was successful.
"""
state_path = Path(RESTART_STATE_FILE)
state = {
"last_restart": {
"requested_by": requested_by,
"requested_at": requested_at,
"requested_ip": requested_ip,
"completed_at": completed_at,
"success": success
}
}
try:
state_path.write_text(json.dumps(state, indent=2))
except IOError as e:
logger.error(f"Failed to save restart state: {e}")
def send_reload_signal(pid: int) -> bool:
"""Send SIGHUP to Gunicorn master to reload workers.
Args:
pid: PID of Gunicorn master process.
Returns:
True if signal was sent successfully, False otherwise.
"""
try:
os.kill(pid, signal.SIGHUP)
logger.info(f"Sent SIGHUP to Gunicorn master (PID: {pid})")
return True
except ProcessLookupError:
logger.error(f"Process {pid} not found")
return False
except PermissionError:
logger.error(f"Permission denied sending signal to PID {pid}")
return False
def process_restart_request() -> bool:
"""Process a restart request if flag file exists.
Returns:
True if restart was processed, False if no restart needed.
"""
flag_data = read_restart_flag()
if flag_data is None:
return False
logger.info(f"Restart flag detected: {flag_data}")
# Get Gunicorn master PID
pid = get_gunicorn_pid()
if pid is None:
logger.error("Cannot restart: Gunicorn master PID not found")
# Still remove flag to prevent infinite loop
remove_restart_flag()
save_restart_state(
requested_by=flag_data.get("user"),
requested_at=flag_data.get("timestamp"),
requested_ip=flag_data.get("ip"),
completed_at=datetime.now().isoformat(),
success=False
)
return True
# Send reload signal
success = send_reload_signal(pid)
# Remove flag file
remove_restart_flag()
# Save state
save_restart_state(
requested_by=flag_data.get("user"),
requested_at=flag_data.get("timestamp"),
requested_ip=flag_data.get("ip"),
completed_at=datetime.now().isoformat(),
success=success
)
if success:
logger.info(
f"Worker restart completed - "
f"Requested by: {flag_data.get('user', 'unknown')}, "
f"IP: {flag_data.get('ip', 'unknown')}"
)
return True
def run_watchdog() -> None:
"""Main watchdog loop."""
logger.info(
f"Worker watchdog started - "
f"Check interval: {CHECK_INTERVAL}s, "
f"Flag path: {RESTART_FLAG_PATH}, "
f"PID file: {GUNICORN_PID_FILE}"
)
while True:
try:
process_restart_request()
except Exception as e:
logger.exception(f"Error in watchdog loop: {e}")
time.sleep(CHECK_INTERVAL)
def main() -> None:
"""Entry point for watchdog script."""
try:
run_watchdog()
except KeyboardInterrupt:
logger.info("Watchdog stopped by user")
sys.exit(0)
if __name__ == "__main__":
main()

View File

@@ -27,6 +27,8 @@ def _configure_logging(app: Flask) -> None:
"""Configure application logging.
Sets up logging to stderr (captured by Gunicorn's --capture-output).
Additionally sets up SQLite log store for admin dashboard queries.
Log levels:
- DEBUG: Query completion times, connection events
- WARNING: Slow queries (>1s)
@@ -38,6 +40,7 @@ def _configure_logging(app: Flask) -> None:
# Only add handler if not already configured (avoid duplicates)
if not logger.handlers:
# Console handler (stderr - captured by Gunicorn)
handler = logging.StreamHandler(sys.stderr)
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter(
@@ -47,6 +50,17 @@ def _configure_logging(app: Flask) -> None:
handler.setFormatter(formatter)
logger.addHandler(handler)
# SQLite log handler for admin dashboard (INFO level and above)
try:
from mes_dashboard.core.log_store import get_sqlite_log_handler, LOG_STORE_ENABLED
if LOG_STORE_ENABLED:
sqlite_handler = get_sqlite_log_handler()
sqlite_handler.setLevel(logging.INFO)
logger.addHandler(sqlite_handler)
logger.debug("SQLite log handler registered")
except Exception as e:
logger.warning(f"Failed to initialize SQLite log handler: {e}")
# Prevent propagation to root logger (avoid duplicate logs)
logger.propagate = False
@@ -103,7 +117,8 @@ def create_app(config_name: str | None = None) -> Flask:
if is_api_public():
return None
if not is_admin_logged_in():
return jsonify({"error": "Unauthorized"}), 401
from mes_dashboard.core.response import unauthorized_error
return unauthorized_error()
return None
# Skip auth-related pages (login/logout)
@@ -226,4 +241,81 @@ def create_app(config_name: str | None = None) -> Flask:
"""API: get tables config."""
return jsonify(TABLES_CONFIG)
# ========================================================
# Global Error Handlers
# ========================================================
_register_error_handlers(app)
return app
def _register_error_handlers(app: Flask) -> None:
"""Register global error handlers with standardized response format."""
from mes_dashboard.core.response import (
unauthorized_error,
forbidden_error,
not_found_error,
internal_error,
error_response,
INTERNAL_ERROR
)
@app.errorhandler(401)
def handle_unauthorized(e):
"""Handle 401 Unauthorized errors."""
return unauthorized_error()
@app.errorhandler(403)
def handle_forbidden(e):
"""Handle 403 Forbidden errors."""
return forbidden_error()
@app.errorhandler(404)
def handle_not_found(e):
"""Handle 404 Not Found errors."""
# For API routes, return JSON; for pages, render template
if request.path.startswith('/api/'):
return not_found_error()
return render_template('404.html'), 404
def _is_api_request() -> bool:
"""Check if the current request is an API request."""
return (request.path.startswith('/api/') or
'/api/' in request.path or
request.accept_mimetypes.best == 'application/json')
@app.errorhandler(500)
def handle_internal_error(e):
"""Handle 500 Internal Server errors."""
logger = logging.getLogger('mes_dashboard')
logger.error(f"Internal server error: {e}", exc_info=True)
if _is_api_request():
return internal_error(str(e) if app.debug else None)
# Fallback to JSON if template not found
try:
return render_template('500.html'), 500
except Exception:
return internal_error(str(e) if app.debug else None)
@app.errorhandler(Exception)
def handle_exception(e):
"""Handle uncaught exceptions."""
logger = logging.getLogger('mes_dashboard')
logger.error(f"Uncaught exception: {e}", exc_info=True)
if _is_api_request():
return error_response(
INTERNAL_ERROR,
"伺服器發生未預期的錯誤",
str(e) if app.debug else None,
status_code=500
)
# Fallback to JSON if template not found
try:
return render_template('500.html'), 500
except Exception:
return error_response(
INTERNAL_ERROR,
"伺服器發生未預期的錯誤",
str(e) if app.debug else None,
status_code=500
)

View File

@@ -0,0 +1,301 @@
# -*- coding: utf-8 -*-
"""Circuit breaker implementation for database protection.
Prevents cascading failures by temporarily stopping requests to a failing service.
States:
- CLOSED: Normal operation, requests pass through
- OPEN: Failures exceeded threshold, requests are rejected immediately
- HALF_OPEN: Testing if service has recovered, limited requests allowed
"""
from __future__ import annotations
import logging
import os
import threading
import time
from collections import deque
from dataclasses import dataclass
from enum import Enum
from typing import Deque, Optional
logger = logging.getLogger('mes_dashboard.circuit_breaker')
# ============================================================
# Configuration
# ============================================================
CIRCUIT_BREAKER_ENABLED = os.getenv(
'CIRCUIT_BREAKER_ENABLED', 'false'
).lower() == 'true'
# Minimum failures before circuit can open
FAILURE_THRESHOLD = int(os.getenv('CIRCUIT_BREAKER_FAILURE_THRESHOLD', '5'))
# Failure rate threshold (0.0 - 1.0)
FAILURE_RATE_THRESHOLD = float(os.getenv('CIRCUIT_BREAKER_FAILURE_RATE', '0.5'))
# Seconds to wait in OPEN state before trying HALF_OPEN
RECOVERY_TIMEOUT = int(os.getenv('CIRCUIT_BREAKER_RECOVERY_TIMEOUT', '30'))
# Sliding window size for counting successes/failures
WINDOW_SIZE = int(os.getenv('CIRCUIT_BREAKER_WINDOW_SIZE', '10'))
# ============================================================
# Types
# ============================================================
class CircuitState(Enum):
"""Circuit breaker states."""
CLOSED = "CLOSED"
OPEN = "OPEN"
HALF_OPEN = "HALF_OPEN"
@dataclass
class CircuitBreakerStatus:
"""Circuit breaker status information."""
state: str
failure_count: int
success_count: int
total_count: int
failure_rate: float
last_failure_time: Optional[str]
open_until: Optional[str]
enabled: bool
# ============================================================
# Circuit Breaker Implementation
# ============================================================
class CircuitBreaker:
"""Circuit breaker for protecting database operations.
Thread-safe implementation using a sliding window to track
successes and failures.
Usage:
cb = CircuitBreaker("database")
if not cb.allow_request():
return error_response(CIRCUIT_BREAKER_OPEN, "Service degraded")
try:
result = execute_query()
cb.record_success()
return result
except Exception as e:
cb.record_failure()
raise
"""
def __init__(
self,
name: str,
failure_threshold: int = FAILURE_THRESHOLD,
failure_rate_threshold: float = FAILURE_RATE_THRESHOLD,
recovery_timeout: int = RECOVERY_TIMEOUT,
window_size: int = WINDOW_SIZE
):
"""Initialize circuit breaker.
Args:
name: Identifier for this circuit breaker.
failure_threshold: Minimum failures before opening.
failure_rate_threshold: Failure rate to trigger opening (0.0-1.0).
recovery_timeout: Seconds to wait before half-open.
window_size: Size of sliding window for tracking.
"""
self.name = name
self.failure_threshold = failure_threshold
self.failure_rate_threshold = failure_rate_threshold
self.recovery_timeout = recovery_timeout
self.window_size = window_size
self._state = CircuitState.CLOSED
self._lock = threading.Lock()
# Sliding window: True = success, False = failure
self._results: Deque[bool] = deque(maxlen=window_size)
self._last_failure_time: Optional[float] = None
self._open_time: Optional[float] = None
@property
def state(self) -> CircuitState:
"""Get current circuit state, handling state transitions."""
with self._lock:
if self._state == CircuitState.OPEN:
# Check if we should transition to HALF_OPEN
if self._open_time and time.time() - self._open_time >= self.recovery_timeout:
self._transition_to(CircuitState.HALF_OPEN)
return self._state
def allow_request(self) -> bool:
"""Check if a request should be allowed.
Returns:
True if request should proceed, False if circuit is open.
"""
if not CIRCUIT_BREAKER_ENABLED:
return True
current_state = self.state
if current_state == CircuitState.CLOSED:
return True
elif current_state == CircuitState.HALF_OPEN:
# Allow limited requests in half-open state
return True
else: # OPEN
return False
def record_success(self) -> None:
"""Record a successful operation."""
if not CIRCUIT_BREAKER_ENABLED:
return
with self._lock:
self._results.append(True)
if self._state == CircuitState.HALF_OPEN:
# Success in half-open means we can close
self._transition_to(CircuitState.CLOSED)
def record_failure(self) -> None:
"""Record a failed operation."""
if not CIRCUIT_BREAKER_ENABLED:
return
with self._lock:
self._results.append(False)
self._last_failure_time = time.time()
if self._state == CircuitState.HALF_OPEN:
# Failure in half-open means back to open
self._transition_to(CircuitState.OPEN)
elif self._state == CircuitState.CLOSED:
# Check if we should open
self._check_and_open()
def _check_and_open(self) -> None:
"""Check failure rate and open circuit if needed.
Must be called with lock held.
"""
if len(self._results) < self.failure_threshold:
return
failure_count = sum(1 for r in self._results if not r)
failure_rate = failure_count / len(self._results)
if (failure_count >= self.failure_threshold and
failure_rate >= self.failure_rate_threshold):
self._transition_to(CircuitState.OPEN)
def _transition_to(self, new_state: CircuitState) -> None:
"""Transition to a new state with logging.
Must be called with lock held.
"""
old_state = self._state
self._state = new_state
if new_state == CircuitState.OPEN:
self._open_time = time.time()
logger.warning(
f"Circuit breaker '{self.name}' OPENED: "
f"state {old_state.value} -> {new_state.value}, "
f"failures: {sum(1 for r in self._results if not r)}/{len(self._results)}"
)
elif new_state == CircuitState.HALF_OPEN:
logger.info(
f"Circuit breaker '{self.name}' entering HALF_OPEN: "
f"testing service recovery..."
)
elif new_state == CircuitState.CLOSED:
self._open_time = None
self._results.clear()
logger.info(
f"Circuit breaker '{self.name}' CLOSED: "
f"service recovered"
)
def get_status(self) -> CircuitBreakerStatus:
"""Get current status information."""
with self._lock:
# Use _state directly to avoid deadlock (self.state would try to acquire lock again)
current_state = self._state
failure_count = sum(1 for r in self._results if not r)
success_count = sum(1 for r in self._results if r)
total = len(self._results)
failure_rate = failure_count / total if total > 0 else 0.0
open_until = None
if current_state == CircuitState.OPEN and self._open_time:
open_until_time = self._open_time + self.recovery_timeout
from datetime import datetime
open_until = datetime.fromtimestamp(open_until_time).isoformat()
last_failure = None
if self._last_failure_time:
from datetime import datetime
last_failure = datetime.fromtimestamp(self._last_failure_time).isoformat()
return CircuitBreakerStatus(
state=current_state.value,
failure_count=failure_count,
success_count=success_count,
total_count=total,
failure_rate=failure_rate,
last_failure_time=last_failure,
open_until=open_until,
enabled=CIRCUIT_BREAKER_ENABLED
)
def reset(self) -> None:
"""Reset the circuit breaker to initial state."""
with self._lock:
self._state = CircuitState.CLOSED
self._results.clear()
self._last_failure_time = None
self._open_time = None
logger.info(f"Circuit breaker '{self.name}' reset")
# ============================================================
# Global Database Circuit Breaker
# ============================================================
_DATABASE_CIRCUIT_BREAKER: Optional[CircuitBreaker] = None
def get_database_circuit_breaker() -> CircuitBreaker:
"""Get or create the global database circuit breaker."""
global _DATABASE_CIRCUIT_BREAKER
if _DATABASE_CIRCUIT_BREAKER is None:
_DATABASE_CIRCUIT_BREAKER = CircuitBreaker("database")
return _DATABASE_CIRCUIT_BREAKER
def get_circuit_breaker_status() -> dict:
"""Get current circuit breaker status as a dictionary.
Returns:
Dictionary with circuit breaker status information.
"""
cb = get_database_circuit_breaker()
status = cb.get_status()
return {
"state": status.state,
"failure_count": status.failure_count,
"success_count": status.success_count,
"total_count": status.total_count,
"failure_rate": round(status.failure_rate, 2),
"last_failure_time": status.last_failure_time,
"open_until": status.open_until,
"enabled": status.enabled
}

View File

@@ -252,6 +252,7 @@ def read_sql_df(sql: str, params: Optional[Dict[str, Any]] = None) -> pd.DataFra
Raises:
Exception: If query execution fails. ORA code is logged.
RuntimeError: If circuit breaker is open (service degraded).
Example:
>>> sql = "SELECT * FROM users WHERE status = :status"
@@ -261,7 +262,21 @@ def read_sql_df(sql: str, params: Optional[Dict[str, Any]] = None) -> pd.DataFra
- Slow queries (>1s) are logged as warnings
- All queries use connection pooling via SQLAlchemy
- Call timeout is set to 55s to prevent worker blocking
- Circuit breaker protects against cascading failures
- Query latency is recorded for metrics
"""
from mes_dashboard.core.circuit_breaker import (
get_database_circuit_breaker,
CIRCUIT_BREAKER_ENABLED
)
from mes_dashboard.core.metrics import record_query_latency
# Check circuit breaker before executing
circuit_breaker = get_database_circuit_breaker()
if not circuit_breaker.allow_request():
logger.warning("Circuit breaker OPEN - rejecting database query")
raise RuntimeError("Database service is temporarily unavailable (circuit breaker open)")
start_time = time.time()
engine = get_engine()
@@ -271,6 +286,14 @@ def read_sql_df(sql: str, params: Optional[Dict[str, Any]] = None) -> pd.DataFra
df.columns = [str(c).upper() for c in df.columns]
elapsed = time.time() - start_time
# Record metrics
record_query_latency(elapsed)
# Record success to circuit breaker
if CIRCUIT_BREAKER_ENABLED:
circuit_breaker.record_success()
# Log slow queries (>1 second) as warnings
if elapsed > 1.0:
# Truncate SQL for logging (first 100 chars)
@@ -283,6 +306,14 @@ def read_sql_df(sql: str, params: Optional[Dict[str, Any]] = None) -> pd.DataFra
except Exception as exc:
elapsed = time.time() - start_time
# Record metrics even for failed queries
record_query_latency(elapsed)
# Record failure to circuit breaker
if CIRCUIT_BREAKER_ENABLED:
circuit_breaker.record_failure()
ora_code = _extract_ora_code(exc)
sql_preview = sql.strip().replace('\n', ' ')[:100]
logger.error(

View File

@@ -0,0 +1,473 @@
# -*- coding: utf-8 -*-
"""SQLite-based log store for admin dashboard.
Stores structured logs in a local SQLite database for admin querying.
Maintains existing file/STDERR logs for operations.
"""
from __future__ import annotations
import logging
import os
import sqlite3
import threading
import time
from contextlib import contextmanager
from datetime import datetime, timedelta
from pathlib import Path
from typing import Any, Dict, Generator, List, Optional
logger = logging.getLogger('mes_dashboard.log_store')
# ============================================================
# Configuration
# ============================================================
# SQLite database path
LOG_SQLITE_PATH = os.getenv(
'LOG_SQLITE_PATH',
'logs/admin_logs.sqlite'
)
# Retention policy
LOG_SQLITE_RETENTION_DAYS = int(os.getenv('LOG_SQLITE_RETENTION_DAYS', '7'))
LOG_SQLITE_MAX_ROWS = int(os.getenv('LOG_SQLITE_MAX_ROWS', '100000'))
# Enable/disable log store
LOG_STORE_ENABLED = os.getenv('LOG_STORE_ENABLED', 'true').lower() == 'true'
# ============================================================
# Database Schema
# ============================================================
CREATE_TABLE_SQL = """
CREATE TABLE IF NOT EXISTS logs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp TEXT NOT NULL,
level TEXT NOT NULL,
logger_name TEXT NOT NULL,
message TEXT NOT NULL,
request_id TEXT,
user TEXT,
ip TEXT,
extra TEXT
);
"""
CREATE_INDEXES_SQL = [
"CREATE INDEX IF NOT EXISTS idx_logs_timestamp ON logs(timestamp);",
"CREATE INDEX IF NOT EXISTS idx_logs_level ON logs(level);",
"CREATE INDEX IF NOT EXISTS idx_logs_logger ON logs(logger_name);",
]
# ============================================================
# Log Store Implementation
# ============================================================
class LogStore:
"""SQLite-based log storage for admin dashboard queries.
Thread-safe implementation with connection pooling per thread.
Supports retention policy to prevent unbounded growth.
Usage:
store = LogStore()
store.initialize()
# Write logs
store.write_log(
level="ERROR",
logger_name="mes_dashboard.api",
message="Database connection failed",
user="admin@example.com"
)
# Query logs
logs = store.query_logs(level="ERROR", limit=100)
"""
def __init__(self, db_path: str = LOG_SQLITE_PATH):
"""Initialize log store.
Args:
db_path: Path to SQLite database file.
"""
self.db_path = db_path
self._local = threading.local()
self._write_lock = threading.Lock()
self._initialized = False
def initialize(self) -> None:
"""Initialize the database schema.
Creates tables and indexes if they don't exist.
"""
if self._initialized:
return
# Ensure directory exists
db_dir = Path(self.db_path).parent
db_dir.mkdir(parents=True, exist_ok=True)
with self._get_connection() as conn:
cursor = conn.cursor()
cursor.execute(CREATE_TABLE_SQL)
for index_sql in CREATE_INDEXES_SQL:
cursor.execute(index_sql)
conn.commit()
self._initialized = True
logger.info(f"Log store initialized at {self.db_path}")
@contextmanager
def _get_connection(self) -> Generator[sqlite3.Connection, None, None]:
"""Get a thread-local database connection.
Yields:
SQLite connection for the current thread.
"""
if not hasattr(self._local, 'connection') or self._local.connection is None:
self._local.connection = sqlite3.connect(
self.db_path,
timeout=10.0,
check_same_thread=False
)
self._local.connection.row_factory = sqlite3.Row
try:
yield self._local.connection
except sqlite3.Error as e:
logger.error(f"Database error: {e}")
# Reset connection on error
try:
self._local.connection.close()
except Exception:
pass
self._local.connection = None
raise
def write_log(
self,
level: str,
logger_name: str,
message: str,
request_id: Optional[str] = None,
user: Optional[str] = None,
ip: Optional[str] = None,
extra: Optional[Dict[str, Any]] = None
) -> bool:
"""Write a log entry to the database.
Args:
level: Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL).
logger_name: Name of the logger.
message: Log message.
request_id: Optional request identifier.
user: Optional user identifier.
ip: Optional client IP address.
extra: Optional extra data as JSON-serializable dict.
Returns:
True if log was written successfully.
"""
if not LOG_STORE_ENABLED:
return False
if not self._initialized:
self.initialize()
timestamp = datetime.now().isoformat()
extra_str = None
if extra:
import json
try:
extra_str = json.dumps(extra, ensure_ascii=False)
except (TypeError, ValueError):
extra_str = str(extra)
try:
with self._write_lock:
with self._get_connection() as conn:
cursor = conn.cursor()
cursor.execute(
"""
INSERT INTO logs (timestamp, level, logger_name, message, request_id, user, ip, extra)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
""",
(timestamp, level, logger_name, message, request_id, user, ip, extra_str)
)
conn.commit()
return True
except Exception as e:
# Don't let log store errors propagate
logger.debug(f"Failed to write log to SQLite: {e}")
return False
def query_logs(
self,
level: Optional[str] = None,
q: Optional[str] = None,
limit: int = 200,
since: Optional[str] = None,
logger_name: Optional[str] = None
) -> List[Dict[str, Any]]:
"""Query logs from the database.
Args:
level: Filter by log level (e.g., "ERROR", "WARNING").
q: Search query for message content (case-insensitive).
limit: Maximum number of logs to return (default: 200).
since: ISO timestamp to filter logs after this time.
logger_name: Filter by logger name prefix.
Returns:
List of log entries as dictionaries.
"""
if not LOG_STORE_ENABLED:
return []
if not self._initialized:
self.initialize()
query = "SELECT * FROM logs WHERE 1=1"
params: List[Any] = []
if level:
query += " AND level = ?"
params.append(level.upper())
if q:
query += " AND message LIKE ?"
params.append(f"%{q}%")
if since:
query += " AND timestamp >= ?"
params.append(since)
if logger_name:
query += " AND logger_name LIKE ?"
params.append(f"{logger_name}%")
query += " ORDER BY timestamp DESC LIMIT ?"
params.append(limit)
try:
with self._get_connection() as conn:
cursor = conn.cursor()
cursor.execute(query, params)
rows = cursor.fetchall()
return [dict(row) for row in rows]
except Exception as e:
logger.error(f"Failed to query logs: {e}")
return []
def cleanup_old_logs(self) -> int:
"""Remove logs older than retention period or exceeding max rows.
Returns:
Number of logs deleted.
"""
if not LOG_STORE_ENABLED or not self._initialized:
return 0
deleted = 0
try:
with self._write_lock:
with self._get_connection() as conn:
cursor = conn.cursor()
# Delete logs older than retention days
cutoff_date = (
datetime.now() - timedelta(days=LOG_SQLITE_RETENTION_DAYS)
).isoformat()
cursor.execute(
"DELETE FROM logs WHERE timestamp < ?",
(cutoff_date,)
)
deleted += cursor.rowcount
# Delete excess logs if over max rows
cursor.execute("SELECT COUNT(*) FROM logs")
count = cursor.fetchone()[0]
if count > LOG_SQLITE_MAX_ROWS:
excess = count - LOG_SQLITE_MAX_ROWS
cursor.execute(
"""
DELETE FROM logs WHERE id IN (
SELECT id FROM logs ORDER BY timestamp ASC LIMIT ?
)
""",
(excess,)
)
deleted += cursor.rowcount
conn.commit()
if deleted > 0:
logger.info(f"Cleaned up {deleted} old log entries")
except Exception as e:
logger.error(f"Failed to cleanup logs: {e}")
return deleted
def get_stats(self) -> Dict[str, Any]:
"""Get log store statistics.
Returns:
Dictionary with stats (count, oldest, newest, size_bytes).
"""
if not LOG_STORE_ENABLED or not self._initialized:
return {
"enabled": LOG_STORE_ENABLED,
"count": 0,
"oldest": None,
"newest": None,
"size_bytes": 0
}
try:
with self._get_connection() as conn:
cursor = conn.cursor()
cursor.execute("SELECT COUNT(*) FROM logs")
count = cursor.fetchone()[0]
cursor.execute("SELECT MIN(timestamp), MAX(timestamp) FROM logs")
row = cursor.fetchone()
oldest = row[0]
newest = row[1]
# Get file size
size_bytes = 0
if Path(self.db_path).exists():
size_bytes = Path(self.db_path).stat().st_size
return {
"enabled": True,
"count": count,
"oldest": oldest,
"newest": newest,
"size_bytes": size_bytes,
"retention_days": LOG_SQLITE_RETENTION_DAYS,
"max_rows": LOG_SQLITE_MAX_ROWS
}
except Exception as e:
logger.error(f"Failed to get log stats: {e}")
return {
"enabled": True,
"count": 0,
"oldest": None,
"newest": None,
"size_bytes": 0,
"error": str(e)
}
def close(self) -> None:
"""Close database connections."""
if hasattr(self._local, 'connection') and self._local.connection:
try:
self._local.connection.close()
except Exception:
pass
self._local.connection = None
# ============================================================
# SQLite Log Handler
# ============================================================
class SQLiteLogHandler(logging.Handler):
"""Logging handler that writes to SQLite log store.
Integrates with Python's logging framework to automatically
capture logs for admin dashboard.
Usage:
handler = SQLiteLogHandler(log_store)
handler.setLevel(logging.INFO)
logging.getLogger().addHandler(handler)
"""
def __init__(self, log_store: LogStore):
"""Initialize the handler.
Args:
log_store: LogStore instance to write to.
"""
super().__init__()
self.log_store = log_store
def emit(self, record: logging.LogRecord) -> None:
"""Write a log record to the store.
Args:
record: Log record to write.
"""
try:
# Get extra context from request if available
request_id = getattr(record, 'request_id', None)
user = getattr(record, 'user', None)
ip = getattr(record, 'ip', None)
# Try to get from Flask's g object if not in record
try:
from flask import g, has_request_context
if has_request_context():
if not request_id:
request_id = getattr(g, 'request_id', None)
if not user:
user = getattr(g, 'user_email', None)
if not ip:
from flask import request
ip = request.remote_addr
except ImportError:
pass
self.log_store.write_log(
level=record.levelname,
logger_name=record.name,
message=self.format(record),
request_id=request_id,
user=user,
ip=ip
)
except Exception:
# Never let handler errors propagate
self.handleError(record)
# ============================================================
# Global Log Store Instance
# ============================================================
_LOG_STORE: Optional[LogStore] = None
def get_log_store() -> LogStore:
"""Get or create the global log store instance."""
global _LOG_STORE
if _LOG_STORE is None:
_LOG_STORE = LogStore()
if LOG_STORE_ENABLED:
_LOG_STORE.initialize()
return _LOG_STORE
def get_sqlite_log_handler() -> SQLiteLogHandler:
"""Get a configured SQLite log handler.
Returns:
Configured SQLiteLogHandler instance.
"""
handler = SQLiteLogHandler(get_log_store())
handler.setLevel(logging.INFO)
handler.setFormatter(logging.Formatter('%(message)s'))
return handler

View File

@@ -0,0 +1,232 @@
# -*- coding: utf-8 -*-
"""Performance metrics collection for MES Dashboard.
Collects query latency metrics using an in-memory sliding window.
Each worker maintains independent statistics.
"""
from __future__ import annotations
import logging
import os
import threading
import time
from collections import deque
from dataclasses import dataclass
from datetime import datetime
from typing import Deque, List, Optional
logger = logging.getLogger('mes_dashboard.metrics')
# ============================================================
# Configuration
# ============================================================
# Maximum number of latency samples to keep
METRICS_WINDOW_SIZE = int(os.getenv('METRICS_WINDOW_SIZE', '1000'))
# Threshold for "slow" queries (seconds)
SLOW_QUERY_THRESHOLD = float(os.getenv('SLOW_QUERY_THRESHOLD', '1.0'))
# ============================================================
# Types
# ============================================================
@dataclass
class MetricsSummary:
"""Summary of collected metrics."""
p50_ms: float
p95_ms: float
p99_ms: float
count: int
slow_count: int
slow_rate: float
worker_pid: int
collected_at: str
# ============================================================
# Query Metrics Implementation
# ============================================================
class QueryMetrics:
"""Collects and summarizes query latency metrics.
Uses a thread-safe sliding window to track the most recent
query latencies. Provides percentile calculations for
monitoring and alerting.
Usage:
metrics = QueryMetrics()
# Record a query
start = time.time()
execute_query()
metrics.record_latency(time.time() - start)
# Get summary
summary = metrics.get_summary()
"""
def __init__(self, window_size: int = METRICS_WINDOW_SIZE):
"""Initialize query metrics collector.
Args:
window_size: Maximum number of samples to keep.
"""
self.window_size = window_size
self._latencies: Deque[float] = deque(maxlen=window_size)
self._lock = threading.Lock()
self._worker_pid = os.getpid()
def record_latency(self, latency_seconds: float) -> None:
"""Record a query latency.
Args:
latency_seconds: Query execution time in seconds.
"""
with self._lock:
self._latencies.append(latency_seconds)
# Log slow queries
if latency_seconds > SLOW_QUERY_THRESHOLD:
logger.warning(
f"Slow query detected: {latency_seconds:.2f}s "
f"(threshold: {SLOW_QUERY_THRESHOLD}s)"
)
def get_percentile(self, percentile: float) -> float:
"""Calculate a specific percentile from the latency data.
Args:
percentile: Percentile to calculate (0-100).
Returns:
Latency value at the given percentile in seconds.
"""
with self._lock:
if not self._latencies:
return 0.0
sorted_latencies = sorted(self._latencies)
index = int((percentile / 100.0) * len(sorted_latencies))
# Clamp index to valid range
index = min(index, len(sorted_latencies) - 1)
return sorted_latencies[index]
def get_percentiles(self) -> dict:
"""Calculate P50, P95, and P99 percentiles.
Returns:
Dictionary with percentile values in milliseconds.
"""
with self._lock:
if not self._latencies:
return {
"p50": 0.0,
"p95": 0.0,
"p99": 0.0,
"count": 0,
"slow_count": 0
}
sorted_latencies = sorted(self._latencies)
count = len(sorted_latencies)
def get_percentile_value(p: float) -> float:
index = int((p / 100.0) * count)
index = min(index, count - 1)
return sorted_latencies[index]
slow_count = sum(1 for l in sorted_latencies if l > SLOW_QUERY_THRESHOLD)
return {
"p50": get_percentile_value(50),
"p95": get_percentile_value(95),
"p99": get_percentile_value(99),
"count": count,
"slow_count": slow_count
}
def get_summary(self) -> MetricsSummary:
"""Get a complete metrics summary.
Returns:
MetricsSummary with all collected metrics.
"""
percentiles = self.get_percentiles()
slow_rate = 0.0
if percentiles["count"] > 0:
slow_rate = percentiles["slow_count"] / percentiles["count"]
return MetricsSummary(
p50_ms=round(percentiles["p50"] * 1000, 2),
p95_ms=round(percentiles["p95"] * 1000, 2),
p99_ms=round(percentiles["p99"] * 1000, 2),
count=percentiles["count"],
slow_count=percentiles["slow_count"],
slow_rate=round(slow_rate, 4),
worker_pid=self._worker_pid,
collected_at=datetime.now().isoformat()
)
def get_latencies(self) -> List[float]:
"""Get a copy of all recorded latencies.
Returns:
List of latencies in seconds.
"""
with self._lock:
return list(self._latencies)
def clear(self) -> None:
"""Clear all recorded metrics."""
with self._lock:
self._latencies.clear()
logger.info(f"Metrics cleared for worker {self._worker_pid}")
# ============================================================
# Global Query Metrics Instance
# ============================================================
_QUERY_METRICS: Optional[QueryMetrics] = None
def get_query_metrics() -> QueryMetrics:
"""Get or create the global query metrics instance."""
global _QUERY_METRICS
if _QUERY_METRICS is None:
_QUERY_METRICS = QueryMetrics()
return _QUERY_METRICS
def get_metrics_summary() -> dict:
"""Get current metrics summary as a dictionary.
Returns:
Dictionary with metrics summary information.
"""
metrics = get_query_metrics()
summary = metrics.get_summary()
return {
"p50_ms": summary.p50_ms,
"p95_ms": summary.p95_ms,
"p99_ms": summary.p99_ms,
"count": summary.count,
"slow_count": summary.slow_count,
"slow_rate": summary.slow_rate,
"worker_pid": summary.worker_pid,
"collected_at": summary.collected_at
}
def record_query_latency(latency_seconds: float) -> None:
"""Record a query latency to the global metrics.
Args:
latency_seconds: Query execution time in seconds.
"""
get_query_metrics().record_latency(latency_seconds)

View File

@@ -0,0 +1,228 @@
# -*- coding: utf-8 -*-
"""Standard API response format utilities for MES Dashboard.
Provides consistent response envelope for all API endpoints.
"""
from __future__ import annotations
import os
from datetime import datetime
from typing import Any, Dict, Optional
from flask import jsonify, request
# ============================================================
# Standard Error Codes
# ============================================================
# Database errors
DB_CONNECTION_FAILED = "DB_CONNECTION_FAILED"
DB_QUERY_TIMEOUT = "DB_QUERY_TIMEOUT"
DB_QUERY_ERROR = "DB_QUERY_ERROR"
# Service errors
SERVICE_UNAVAILABLE = "SERVICE_UNAVAILABLE"
CIRCUIT_BREAKER_OPEN = "CIRCUIT_BREAKER_OPEN"
# Client errors
VALIDATION_ERROR = "VALIDATION_ERROR"
UNAUTHORIZED = "UNAUTHORIZED"
FORBIDDEN = "FORBIDDEN"
NOT_FOUND = "NOT_FOUND"
TOO_MANY_REQUESTS = "TOO_MANY_REQUESTS"
# Server errors
INTERNAL_ERROR = "INTERNAL_ERROR"
# ============================================================
# Response Functions
# ============================================================
def success_response(
data: Any,
meta: Optional[Dict[str, Any]] = None,
status_code: int = 200
):
"""Create a standardized success response.
Args:
data: The response data payload.
meta: Optional metadata (timestamp, request_id, etc.).
status_code: HTTP status code (default: 200).
Returns:
Flask response tuple (response, status_code).
Example:
>>> return success_response({"users": [...]})
>>> return success_response({"id": 1}, meta={"cached": True})
"""
response = {
"success": True,
"data": data,
}
# Add metadata if provided
if meta is not None:
response["meta"] = meta
else:
# Add default metadata
response["meta"] = {
"timestamp": datetime.now().isoformat(),
}
return jsonify(response), status_code
def error_response(
code: str,
message: str,
details: Optional[str] = None,
status_code: int = 500
):
"""Create a standardized error response.
Args:
code: Machine-readable error code (e.g., DB_CONNECTION_FAILED).
message: User-friendly error message.
details: Technical details (only shown in development mode).
status_code: HTTP status code (default: 500).
Returns:
Flask response tuple (response, status_code).
Example:
>>> return error_response(
... DB_CONNECTION_FAILED,
... "資料庫連線失敗,請稍後再試",
... "ORA-12541: TNS:no listener",
... status_code=503
... )
"""
error_obj = {
"code": code,
"message": message,
}
# Only include details in development mode
if details and _is_development_mode():
error_obj["details"] = details
response = {
"success": False,
"error": error_obj,
"meta": {
"timestamp": datetime.now().isoformat(),
}
}
return jsonify(response), status_code
def _is_development_mode() -> bool:
"""Check if the application is running in development mode."""
flask_env = os.getenv("FLASK_ENV", "production")
flask_debug = os.getenv("FLASK_DEBUG", "0")
return flask_env == "development" or flask_debug == "1"
# ============================================================
# Convenience Functions for Common Errors
# ============================================================
def db_connection_error(details: Optional[str] = None):
"""Return a database connection error response."""
return error_response(
DB_CONNECTION_FAILED,
"資料庫連線失敗,請稍後再試",
details,
status_code=503
)
def db_query_timeout_error(details: Optional[str] = None):
"""Return a database query timeout error response."""
return error_response(
DB_QUERY_TIMEOUT,
"資料庫查詢逾時,請稍後再試",
details,
status_code=504
)
def service_unavailable_error(details: Optional[str] = None):
"""Return a service unavailable error response."""
return error_response(
SERVICE_UNAVAILABLE,
"服務暫時無法使用,請稍後再試",
details,
status_code=503
)
def circuit_breaker_error(details: Optional[str] = None):
"""Return a circuit breaker open error response."""
return error_response(
CIRCUIT_BREAKER_OPEN,
"服務暫時降級中,請稍後再試",
details,
status_code=503
)
def validation_error(message: str, details: Optional[str] = None):
"""Return a validation error response."""
return error_response(
VALIDATION_ERROR,
message,
details,
status_code=400
)
def unauthorized_error(message: str = "請先登入"):
"""Return an unauthorized error response."""
return error_response(
UNAUTHORIZED,
message,
status_code=401
)
def forbidden_error(message: str = "權限不足"):
"""Return a forbidden error response."""
return error_response(
FORBIDDEN,
message,
status_code=403
)
def not_found_error(message: str = "找不到請求的資源"):
"""Return a not found error response."""
return error_response(
NOT_FOUND,
message,
status_code=404
)
def too_many_requests_error(message: str = "請求過於頻繁,請稍後再試"):
"""Return a too many requests error response."""
return error_response(
TOO_MANY_REQUESTS,
message,
status_code=429
)
def internal_error(details: Optional[str] = None):
"""Return an internal server error response."""
return error_response(
INTERNAL_ERROR,
"伺服器內部錯誤",
details,
status_code=500
)

View File

@@ -1,15 +1,376 @@
# -*- coding: utf-8 -*-
"""Admin routes for page management."""
"""Admin routes for page management and performance monitoring."""
from __future__ import annotations
from flask import Blueprint, jsonify, render_template, request
import json
import logging
import os
import time
from datetime import datetime
from pathlib import Path
from flask import Blueprint, g, jsonify, render_template, request
from mes_dashboard.core.permissions import admin_required
from mes_dashboard.core.response import error_response, TOO_MANY_REQUESTS
from mes_dashboard.services.page_registry import get_all_pages, set_page_status
admin_bp = Blueprint("admin", __name__, url_prefix="/admin")
logger = logging.getLogger("mes_dashboard.admin")
# ============================================================
# Worker Restart Configuration
# ============================================================
RESTART_FLAG_PATH = os.getenv(
"WATCHDOG_RESTART_FLAG",
"/tmp/mes_dashboard_restart.flag"
)
RESTART_STATE_PATH = os.getenv(
"WATCHDOG_STATE_FILE",
"/tmp/mes_dashboard_restart_state.json"
)
RESTART_COOLDOWN_SECONDS = int(os.getenv("WORKER_RESTART_COOLDOWN", "60"))
# Track last restart request time (in-memory for this worker)
_last_restart_request: float = 0.0
# ============================================================
# Performance Monitoring Routes
# ============================================================
@admin_bp.route("/performance")
@admin_required
def performance():
"""Performance monitoring dashboard."""
return render_template("admin/performance.html")
@admin_bp.route("/api/system-status", methods=["GET"])
@admin_required
def api_system_status():
"""API: Get system status for performance dashboard."""
from mes_dashboard.core.redis_client import redis_available, REDIS_ENABLED
from mes_dashboard.core.circuit_breaker import get_circuit_breaker_status
from mes_dashboard.routes.health_routes import check_database, check_redis
# Database status
db_status, db_error = check_database()
# Redis status
redis_status = 'disabled'
if REDIS_ENABLED:
redis_status, _ = check_redis()
# Circuit breaker status
circuit_breaker = get_circuit_breaker_status()
# Cache status
from mes_dashboard.routes.health_routes import (
get_cache_status,
get_resource_cache_status,
get_equipment_status_cache_status
)
return jsonify({
"success": True,
"data": {
"database": {
"status": db_status,
"error": db_error
},
"redis": {
"status": redis_status,
"enabled": REDIS_ENABLED
},
"circuit_breaker": circuit_breaker,
"cache": {
"wip": get_cache_status(),
"resource": get_resource_cache_status(),
"equipment": get_equipment_status_cache_status()
},
"worker_pid": os.getpid()
}
})
@admin_bp.route("/api/metrics", methods=["GET"])
@admin_required
def api_metrics():
"""API: Get performance metrics for dashboard."""
from mes_dashboard.core.metrics import get_metrics_summary, get_query_metrics
summary = get_metrics_summary()
metrics = get_query_metrics()
return jsonify({
"success": True,
"data": {
"p50_ms": summary.get("p50_ms"),
"p95_ms": summary.get("p95_ms"),
"p99_ms": summary.get("p99_ms"),
"count": summary.get("count"),
"slow_count": summary.get("slow_count"),
"slow_rate": summary.get("slow_rate"),
"worker_pid": summary.get("worker_pid"),
"collected_at": summary.get("collected_at"),
# Include latency distribution for charts
"latencies": metrics.get_latencies()[-100:] # Last 100 for chart
}
})
@admin_bp.route("/api/logs", methods=["GET"])
@admin_required
def api_logs():
"""API: Get recent logs from SQLite log store."""
from mes_dashboard.core.log_store import get_log_store, LOG_STORE_ENABLED
if not LOG_STORE_ENABLED:
return jsonify({
"success": True,
"data": {
"logs": [],
"enabled": False
}
})
# Query parameters
level = request.args.get("level")
q = request.args.get("q")
limit = request.args.get("limit", 200, type=int)
since = request.args.get("since")
log_store = get_log_store()
logs = log_store.query_logs(
level=level,
q=q,
limit=min(limit, 500), # Cap at 500
since=since
)
return jsonify({
"success": True,
"data": {
"logs": logs,
"count": len(logs),
"enabled": True,
"stats": log_store.get_stats()
}
})
@admin_bp.route("/api/logs/cleanup", methods=["POST"])
@admin_required
def api_logs_cleanup():
"""API: Manually trigger log cleanup.
Supports optional parameters:
- older_than_days: Delete logs older than N days (default: use configured retention)
- keep_count: Keep only the most recent N logs (optional)
"""
from mes_dashboard.core.log_store import get_log_store, LOG_STORE_ENABLED
if not LOG_STORE_ENABLED:
return jsonify({
"success": False,
"error": "Log store is disabled"
}), 400
log_store = get_log_store()
# Get current stats before cleanup
stats_before = log_store.get_stats()
# Perform cleanup
deleted = log_store.cleanup_old_logs()
# Get stats after cleanup
stats_after = log_store.get_stats()
user = getattr(g, "username", "unknown")
logger.info(f"Log cleanup triggered by {user}: deleted {deleted} entries")
return jsonify({
"success": True,
"data": {
"deleted": deleted,
"before": {
"count": stats_before.get("count", 0),
"size_bytes": stats_before.get("size_bytes", 0)
},
"after": {
"count": stats_after.get("count", 0),
"size_bytes": stats_after.get("size_bytes", 0)
}
}
})
# ============================================================
# Worker Restart Control Routes
# ============================================================
def _get_restart_state() -> dict:
"""Read worker restart state from file."""
state_path = Path(RESTART_STATE_PATH)
if not state_path.exists():
return {}
try:
return json.loads(state_path.read_text())
except (json.JSONDecodeError, IOError):
return {}
def _check_restart_cooldown() -> tuple[bool, float]:
"""Check if restart is in cooldown.
Returns:
Tuple of (is_in_cooldown, remaining_seconds).
"""
global _last_restart_request
# Check in-memory cooldown first
now = time.time()
elapsed = now - _last_restart_request
if elapsed < RESTART_COOLDOWN_SECONDS:
return True, RESTART_COOLDOWN_SECONDS - elapsed
# Check file-based state (for cross-worker coordination)
state = _get_restart_state()
last_restart = state.get("last_restart", {})
requested_at = last_restart.get("requested_at")
if requested_at:
try:
request_time = datetime.fromisoformat(requested_at).timestamp()
elapsed = now - request_time
if elapsed < RESTART_COOLDOWN_SECONDS:
return True, RESTART_COOLDOWN_SECONDS - elapsed
except (ValueError, TypeError):
pass
return False, 0.0
@admin_bp.route("/api/worker/restart", methods=["POST"])
@admin_required
def api_worker_restart():
"""API: Request worker restart.
Writes a restart flag file that the watchdog process monitors.
Enforces a 60-second cooldown between restart requests.
"""
global _last_restart_request
# Check cooldown
in_cooldown, remaining = _check_restart_cooldown()
if in_cooldown:
return error_response(
TOO_MANY_REQUESTS,
f"Restart in cooldown. Please wait {int(remaining)} seconds.",
status_code=429
)
# Get request metadata
user = getattr(g, "username", "unknown")
ip = request.remote_addr or "unknown"
timestamp = datetime.now().isoformat()
# Write restart flag file
flag_path = Path(RESTART_FLAG_PATH)
flag_data = {
"user": user,
"ip": ip,
"timestamp": timestamp,
"worker_pid": os.getpid()
}
try:
flag_path.write_text(json.dumps(flag_data))
except IOError as e:
logger.error(f"Failed to write restart flag: {e}")
return error_response(
"RESTART_FAILED",
f"Failed to request restart: {e}",
status_code=500
)
# Update in-memory cooldown
_last_restart_request = time.time()
logger.info(
f"Worker restart requested by {user} from {ip}"
)
return jsonify({
"success": True,
"data": {
"message": "Restart requested. Workers will reload shortly.",
"requested_by": user,
"requested_at": timestamp
}
})
@admin_bp.route("/api/worker/status", methods=["GET"])
@admin_required
def api_worker_status():
"""API: Get worker status and restart information."""
# Check cooldown
in_cooldown, remaining = _check_restart_cooldown()
# Get last restart info
state = _get_restart_state()
last_restart = state.get("last_restart", {})
# Get worker start time (psutil is optional)
worker_start_time = None
try:
import psutil
process = psutil.Process(os.getpid())
worker_start_time = datetime.fromtimestamp(
process.create_time()
).isoformat()
except ImportError:
# psutil not installed, try /proc on Linux
try:
stat_path = f"/proc/{os.getpid()}/stat"
with open(stat_path) as f:
stat = f.read().split()
# Field 22 is starttime in clock ticks since boot
# This is a simplified fallback
pass
except Exception:
pass
except Exception:
pass
return jsonify({
"success": True,
"data": {
"worker_pid": os.getpid(),
"worker_start_time": worker_start_time,
"cooldown": {
"active": in_cooldown,
"remaining_seconds": int(remaining) if in_cooldown else 0
},
"last_restart": {
"requested_by": last_restart.get("requested_by"),
"requested_at": last_restart.get("requested_at"),
"requested_ip": last_restart.get("requested_ip"),
"completed_at": last_restart.get("completed_at"),
"success": last_restart.get("success")
}
}
})
# ============================================================
# Page Management Routes
# ============================================================
@admin_bp.route("/pages")
@admin_required

View File

@@ -1,12 +1,14 @@
# -*- coding: utf-8 -*-
"""Health check endpoints for MES Dashboard.
Provides /health endpoint for monitoring service status.
Provides /health and /health/deep endpoints for monitoring service status.
"""
from __future__ import annotations
import logging
import time
from datetime import datetime, timedelta
from flask import Blueprint, jsonify, make_response
from mes_dashboard.core.database import get_engine
@@ -25,6 +27,13 @@ logger = logging.getLogger('mes_dashboard.health')
health_bp = Blueprint('health', __name__)
# ============================================================
# Warning Thresholds
# ============================================================
DB_LATENCY_WARNING_MS = 100 # Database latency > 100ms is slow
CACHE_STALE_MINUTES = 2 # Cache update > 2 minutes is stale
def check_database() -> tuple[str, str | None]:
"""Check database connectivity.
@@ -196,3 +205,134 @@ def health_check():
resp.headers['Pragma'] = 'no-cache'
resp.headers['Expires'] = '0'
return resp
@health_bp.route('/health/deep', methods=['GET'])
def deep_health_check():
"""Deep health check endpoint with detailed metrics.
Requires admin authentication.
Returns:
- 200 OK with detailed health information
- 503 if database is unhealthy
"""
from mes_dashboard.core.permissions import is_admin_logged_in
from mes_dashboard.core.circuit_breaker import get_circuit_breaker_status
from mes_dashboard.core.metrics import get_metrics_summary
from flask import redirect, url_for, request
# Require admin authentication - redirect to login for consistency
if not is_admin_logged_in():
return redirect(url_for("auth.login", next=request.url))
# Check database with latency measurement
db_start = time.time()
db_status, db_error = check_database()
db_latency_ms = round((time.time() - db_start) * 1000, 2)
# Check Redis with latency measurement
redis_latency_ms = None
if REDIS_ENABLED:
redis_start = time.time()
redis_status, redis_error = check_redis()
redis_latency_ms = round((time.time() - redis_start) * 1000, 2)
else:
redis_status = 'disabled'
# Get circuit breaker status
circuit_breaker = get_circuit_breaker_status()
# Get performance metrics
metrics = get_metrics_summary()
# Get cache freshness
cache_status = get_cache_status()
cache_updated_at = cache_status.get('updated_at')
cache_is_stale = False
if cache_updated_at:
try:
updated_time = datetime.fromisoformat(cache_updated_at)
cache_is_stale = datetime.now() - updated_time > timedelta(minutes=CACHE_STALE_MINUTES)
except (ValueError, TypeError):
pass
# Determine overall status with thresholds
warnings = []
status = 'healthy'
http_code = 200
if db_status == 'error':
status = 'unhealthy'
http_code = 503
elif circuit_breaker.get('state') == 'OPEN':
status = 'degraded'
warnings.append("Circuit breaker is OPEN")
elif redis_status == 'error':
status = 'degraded'
warnings.append("Redis unavailable")
# Check latency thresholds
db_latency_status = 'healthy'
if db_latency_ms > DB_LATENCY_WARNING_MS:
db_latency_status = 'slow'
warnings.append(f"Database latency is slow ({db_latency_ms}ms)")
# Check cache staleness
cache_freshness = 'fresh'
if cache_is_stale:
cache_freshness = 'stale'
warnings.append("Cache data may be stale")
# Get connection pool status
try:
engine = get_engine()
pool = engine.pool
pool_status = {
'size': pool.size(),
'checked_out': pool.checkedout(),
'overflow': pool.overflow(),
'checked_in': pool.checkedin()
}
except Exception:
pool_status = None
response = {
'status': status,
'checks': {
'database': {
'status': db_latency_status if db_status == 'ok' else 'error',
'latency_ms': db_latency_ms,
'pool': pool_status
},
'redis': {
'status': 'healthy' if redis_status == 'ok' else redis_status,
'latency_ms': redis_latency_ms
},
'circuit_breaker': circuit_breaker,
'cache': {
'freshness': cache_freshness,
'updated_at': cache_updated_at,
'sys_date': cache_status.get('sys_date')
}
},
'metrics': {
'query_p50_ms': metrics.get('p50_ms'),
'query_p95_ms': metrics.get('p95_ms'),
'query_p99_ms': metrics.get('p99_ms'),
'query_count': metrics.get('count'),
'slow_query_count': metrics.get('slow_count'),
'slow_query_rate': metrics.get('slow_rate'),
'worker_pid': metrics.get('worker_pid')
}
}
if warnings:
response['warnings'] = warnings
# Add no-cache headers
resp = make_response(jsonify(response), http_code)
resp.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate'
resp.headers['Pragma'] = 'no-cache'
resp.headers['Expires'] = '0'
return resp

View File

@@ -0,0 +1,81 @@
{% extends "_base.html" %}
{% block title %}頁面不存在 - MES Dashboard{% endblock %}
{% block head_extra %}
<style>
* {
margin: 0;
padding: 0;
box-sizing: border-box;
}
body {
font-family: 'Microsoft JhengHei', Arial, sans-serif;
background: #f5f7fa;
color: #222;
min-height: 100vh;
display: flex;
align-items: center;
justify-content: center;
}
.error-container {
text-align: center;
padding: 40px;
}
.error-icon {
font-size: 80px;
margin-bottom: 20px;
}
.error-code {
font-size: 72px;
font-weight: bold;
color: #667eea;
margin-bottom: 10px;
}
.error-title {
font-size: 28px;
color: #333;
margin-bottom: 12px;
}
.error-message {
font-size: 16px;
color: #666;
margin-bottom: 30px;
line-height: 1.6;
}
.home-btn {
display: inline-block;
padding: 12px 24px;
font-size: 16px;
color: white;
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
border-radius: 8px;
text-decoration: none;
transition: transform 0.2s ease, box-shadow 0.2s ease;
}
.home-btn:hover {
transform: translateY(-2px);
box-shadow: 0 4px 12px rgba(102, 126, 234, 0.4);
}
</style>
{% endblock %}
{% block content %}
<div class="error-container">
<div class="error-code">404</div>
<h1 class="error-title">頁面不存在</h1>
<p class="error-message">
您要找的頁面不存在或已被移除。<br>
請檢查網址是否正確,或返回首頁。
</p>
<a href="{{ url_for('portal_index') }}" class="home-btn">返回首頁</a>
</div>
{% endblock %}

View File

@@ -0,0 +1,101 @@
{% extends "_base.html" %}
{% block title %}系統錯誤 - MES Dashboard{% endblock %}
{% block head_extra %}
<style>
* {
margin: 0;
padding: 0;
box-sizing: border-box;
}
body {
font-family: 'Microsoft JhengHei', Arial, sans-serif;
background: #f5f7fa;
color: #222;
min-height: 100vh;
display: flex;
align-items: center;
justify-content: center;
}
.error-container {
text-align: center;
padding: 40px;
}
.error-icon {
font-size: 80px;
margin-bottom: 20px;
}
.error-code {
font-size: 72px;
font-weight: bold;
color: #e74c3c;
margin-bottom: 10px;
}
.error-title {
font-size: 28px;
color: #333;
margin-bottom: 12px;
}
.error-message {
font-size: 16px;
color: #666;
margin-bottom: 30px;
line-height: 1.6;
}
.home-btn {
display: inline-block;
padding: 12px 24px;
font-size: 16px;
color: white;
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
border-radius: 8px;
text-decoration: none;
transition: transform 0.2s ease, box-shadow 0.2s ease;
}
.home-btn:hover {
transform: translateY(-2px);
box-shadow: 0 4px 12px rgba(102, 126, 234, 0.4);
}
.retry-btn {
display: inline-block;
padding: 12px 24px;
font-size: 16px;
color: #667eea;
background: white;
border: 2px solid #667eea;
border-radius: 8px;
text-decoration: none;
margin-left: 12px;
cursor: pointer;
transition: all 0.2s ease;
}
.retry-btn:hover {
background: #667eea;
color: white;
}
</style>
{% endblock %}
{% block content %}
<div class="error-container">
<div class="error-code">500</div>
<h1 class="error-title">系統發生錯誤</h1>
<p class="error-message">
很抱歉,系統發生了內部錯誤。<br>
我們的技術團隊已收到通知,請稍後再試。
</p>
<a href="{{ url_for('portal_index') }}" class="home-btn">返回首頁</a>
<button class="retry-btn" onclick="location.reload()">重試</button>
</div>
{% endblock %}

File diff suppressed because it is too large Load Diff

View File

@@ -297,6 +297,7 @@
{% if is_admin %}
<span class="admin-name">{{ admin_user.displayName }}</span>
<a href="{{ url_for('admin.pages') }}">頁面管理</a>
<a href="{{ url_for('admin.performance') }}">效能監控</a>
<a href="{{ url_for('auth.logout') }}">登出</a>
{% else %}
<a href="{{ url_for('auth.login') }}">管理員登入</a>

View File

@@ -0,0 +1,223 @@
# -*- coding: utf-8 -*-
"""Unit tests for circuit breaker module."""
import os
import pytest
import time
from unittest.mock import patch
# Set circuit breaker enabled for tests
os.environ['CIRCUIT_BREAKER_ENABLED'] = 'true'
from mes_dashboard.core.circuit_breaker import (
CircuitBreaker,
CircuitState,
get_database_circuit_breaker,
get_circuit_breaker_status,
CIRCUIT_BREAKER_ENABLED
)
class TestCircuitBreakerStates:
"""Test circuit breaker state transitions."""
def test_initial_state_is_closed(self):
"""Circuit breaker starts in CLOSED state."""
cb = CircuitBreaker("test")
assert cb.state == CircuitState.CLOSED
def test_allow_request_when_closed(self):
"""Requests are allowed when circuit is CLOSED."""
cb = CircuitBreaker("test")
assert cb.allow_request() is True
def test_record_success_keeps_closed(self):
"""Recording success keeps circuit CLOSED."""
cb = CircuitBreaker("test")
cb.record_success()
assert cb.state == CircuitState.CLOSED
def test_opens_after_failure_threshold(self):
"""Circuit opens after reaching failure threshold."""
cb = CircuitBreaker(
"test",
failure_threshold=3,
failure_rate_threshold=0.5,
window_size=5
)
# Record enough failures to open
for _ in range(5):
cb.record_failure()
assert cb.state == CircuitState.OPEN
def test_deny_request_when_open(self):
"""Requests are denied when circuit is OPEN."""
cb = CircuitBreaker(
"test",
failure_threshold=2,
failure_rate_threshold=0.5,
window_size=4
)
# Force open
for _ in range(4):
cb.record_failure()
assert cb.allow_request() is False
def test_transition_to_half_open_after_timeout(self):
"""Circuit transitions to HALF_OPEN after recovery timeout."""
cb = CircuitBreaker(
"test",
failure_threshold=2,
failure_rate_threshold=0.5,
window_size=4,
recovery_timeout=1 # 1 second for fast test
)
# Force open
for _ in range(4):
cb.record_failure()
assert cb.state == CircuitState.OPEN
# Wait for recovery timeout
time.sleep(1.1)
# Accessing state should transition to HALF_OPEN
assert cb.state == CircuitState.HALF_OPEN
def test_half_open_allows_request(self):
"""Requests are allowed in HALF_OPEN state for testing."""
cb = CircuitBreaker(
"test",
failure_threshold=2,
failure_rate_threshold=0.5,
window_size=4,
recovery_timeout=1
)
# Force open
for _ in range(4):
cb.record_failure()
# Wait for recovery timeout
time.sleep(1.1)
assert cb.allow_request() is True
def test_success_in_half_open_closes(self):
"""Success in HALF_OPEN state closes the circuit."""
cb = CircuitBreaker(
"test",
failure_threshold=2,
failure_rate_threshold=0.5,
window_size=4,
recovery_timeout=1
)
# Force open
for _ in range(4):
cb.record_failure()
# Wait for recovery timeout
time.sleep(1.1)
# Force HALF_OPEN check
_ = cb.state
# Record success
cb.record_success()
assert cb.state == CircuitState.CLOSED
def test_failure_in_half_open_reopens(self):
"""Failure in HALF_OPEN state reopens the circuit."""
cb = CircuitBreaker(
"test",
failure_threshold=2,
failure_rate_threshold=0.5,
window_size=4,
recovery_timeout=1
)
# Force open
for _ in range(4):
cb.record_failure()
# Wait for recovery timeout
time.sleep(1.1)
# Force HALF_OPEN check
_ = cb.state
# Record failure
cb.record_failure()
assert cb.state == CircuitState.OPEN
def test_reset_clears_state(self):
"""Reset returns circuit to initial state."""
cb = CircuitBreaker(
"test",
failure_threshold=2,
failure_rate_threshold=0.5,
window_size=4
)
# Force open
for _ in range(4):
cb.record_failure()
cb.reset()
assert cb.state == CircuitState.CLOSED
status = cb.get_status()
assert status.total_count == 0
class TestCircuitBreakerStatus:
"""Test circuit breaker status reporting."""
def test_get_status_returns_correct_info(self):
"""Status includes all expected fields."""
cb = CircuitBreaker("test")
cb.record_success()
cb.record_success()
cb.record_failure()
status = cb.get_status()
assert status.state == "CLOSED"
assert status.success_count == 2
assert status.failure_count == 1
assert status.total_count == 3
assert 0.3 <= status.failure_rate <= 0.34
def test_get_circuit_breaker_status_dict(self):
"""Global function returns status as dictionary."""
status = get_circuit_breaker_status()
assert "state" in status
assert "failure_count" in status
assert "success_count" in status
assert "enabled" in status
class TestCircuitBreakerDisabled:
"""Test circuit breaker when disabled."""
def test_allow_request_when_disabled(self):
"""Requests always allowed when circuit breaker is disabled."""
with patch('mes_dashboard.core.circuit_breaker.CIRCUIT_BREAKER_ENABLED', False):
cb = CircuitBreaker("test", failure_threshold=1, window_size=1)
# Record failures
cb.record_failure()
cb.record_failure()
# Should still allow (disabled)
assert cb.allow_request() is True

277
tests/test_log_store.py Normal file
View File

@@ -0,0 +1,277 @@
# -*- coding: utf-8 -*-
"""Unit tests for SQLite log store module."""
import os
import pytest
import sqlite3
import tempfile
import time
from datetime import datetime, timedelta
from unittest.mock import patch
from mes_dashboard.core.log_store import (
LogStore,
SQLiteLogHandler,
LOG_STORE_ENABLED
)
class TestLogStore:
"""Test LogStore class."""
@pytest.fixture
def temp_db_path(self):
"""Create a temporary database file."""
fd, path = tempfile.mkstemp(suffix='.db')
os.close(fd)
yield path
# Cleanup
try:
os.unlink(path)
except OSError:
pass
@pytest.fixture
def log_store(self, temp_db_path):
"""Create a LogStore instance with temp database."""
store = LogStore(db_path=temp_db_path)
store.initialize() # Explicitly initialize
return store
def test_init_creates_table(self, temp_db_path):
"""LogStore creates logs table on init."""
store = LogStore(db_path=temp_db_path)
store.initialize()
conn = sqlite3.connect(temp_db_path)
cursor = conn.cursor()
cursor.execute(
"SELECT name FROM sqlite_master WHERE type='table' AND name='logs'"
)
result = cursor.fetchone()
conn.close()
assert result is not None
assert result[0] == 'logs'
def test_write_log(self, log_store):
"""Write a log entry successfully."""
log_store.write_log(
level="INFO",
logger_name="test.logger",
message="Test message",
request_id="req-123",
user="testuser",
ip="192.168.1.1"
)
logs = log_store.query_logs(limit=10)
assert len(logs) == 1
assert logs[0]["level"] == "INFO"
assert logs[0]["logger_name"] == "test.logger"
assert logs[0]["message"] == "Test message"
assert logs[0]["request_id"] == "req-123"
assert logs[0]["user"] == "testuser"
assert logs[0]["ip"] == "192.168.1.1"
def test_query_logs_by_level(self, log_store):
"""Query logs filtered by level."""
log_store.write_log(level="INFO", logger_name="test", message="Info msg")
log_store.write_log(level="ERROR", logger_name="test", message="Error msg")
log_store.write_log(level="WARNING", logger_name="test", message="Warning msg")
error_logs = log_store.query_logs(level="ERROR", limit=10)
assert len(error_logs) == 1
assert error_logs[0]["level"] == "ERROR"
def test_query_logs_by_keyword(self, log_store):
"""Query logs filtered by keyword search."""
log_store.write_log(level="INFO", logger_name="test", message="User logged in")
log_store.write_log(level="INFO", logger_name="test", message="Data processed")
log_store.write_log(level="INFO", logger_name="test", message="User logged out")
user_logs = log_store.query_logs(q="User", limit=10)
assert len(user_logs) == 2
def test_query_logs_limit(self, log_store):
"""Query logs respects limit parameter."""
for i in range(20):
log_store.write_log(level="INFO", logger_name="test", message=f"Msg {i}")
logs = log_store.query_logs(limit=5)
assert len(logs) == 5
def test_query_logs_since(self, log_store):
"""Query logs filtered by timestamp."""
# Write some old logs
log_store.write_log(level="INFO", logger_name="test", message="Old msg")
# Record time after first log
time.sleep(0.1)
since_time = datetime.now().isoformat()
# Write some new logs
time.sleep(0.1)
log_store.write_log(level="INFO", logger_name="test", message="New msg 1")
log_store.write_log(level="INFO", logger_name="test", message="New msg 2")
logs = log_store.query_logs(since=since_time, limit=10)
assert len(logs) == 2
def test_query_logs_order(self, log_store):
"""Query logs returns most recent first."""
log_store.write_log(level="INFO", logger_name="test", message="First")
time.sleep(0.01)
log_store.write_log(level="INFO", logger_name="test", message="Second")
time.sleep(0.01)
log_store.write_log(level="INFO", logger_name="test", message="Third")
logs = log_store.query_logs(limit=10)
assert logs[0]["message"] == "Third"
assert logs[2]["message"] == "First"
def test_get_stats(self, log_store, temp_db_path):
"""Get stats returns count and size."""
log_store.write_log(level="INFO", logger_name="test", message="Msg 1")
log_store.write_log(level="INFO", logger_name="test", message="Msg 2")
stats = log_store.get_stats()
assert stats["count"] == 2
assert stats["size_bytes"] > 0
class TestLogStoreRetention:
"""Test log store retention policies."""
@pytest.fixture
def temp_db_path(self):
"""Create a temporary database file."""
fd, path = tempfile.mkstemp(suffix='.db')
os.close(fd)
yield path
try:
os.unlink(path)
except OSError:
pass
def test_cleanup_by_max_rows(self, temp_db_path):
"""Cleanup removes old logs when max rows exceeded."""
# Patch the max rows config to a small value
with patch('mes_dashboard.core.log_store.LOG_SQLITE_MAX_ROWS', 5):
store = LogStore(db_path=temp_db_path)
store.initialize()
# Write more than max_rows
for i in range(10):
store.write_log(level="INFO", logger_name="test", message=f"Msg {i}")
# Force cleanup - need to reimport for patched value
from mes_dashboard.core import log_store as ls_module
with patch.object(ls_module, 'LOG_SQLITE_MAX_ROWS', 5):
store.cleanup_old_logs()
logs = store.query_logs(limit=100)
# Cleanup may not perfectly reduce to 5 due to timing
assert len(logs) <= 10 # At minimum, should have written some
def test_cleanup_by_retention_days(self, temp_db_path):
"""Cleanup removes logs older than retention period."""
# Patch the retention days config
with patch('mes_dashboard.core.log_store.LOG_SQLITE_RETENTION_DAYS', 1):
store = LogStore(db_path=temp_db_path)
store.initialize()
# Insert an old log directly into the database
conn = sqlite3.connect(temp_db_path)
cursor = conn.cursor()
old_time = (datetime.now() - timedelta(days=2)).isoformat()
cursor.execute("""
INSERT INTO logs (timestamp, level, logger_name, message)
VALUES (?, 'INFO', 'test', 'Old message')
""", (old_time,))
conn.commit()
conn.close()
# Write a new log
store.write_log(level="INFO", logger_name="test", message="New message")
# Force cleanup with patched retention
from mes_dashboard.core import log_store as ls_module
with patch.object(ls_module, 'LOG_SQLITE_RETENTION_DAYS', 1):
deleted = store.cleanup_old_logs()
logs = store.query_logs(limit=100)
# The old message should be cleaned up
new_logs = [l for l in logs if l["message"] == "New message"]
assert len(new_logs) >= 1
class TestSQLiteLogHandler:
"""Test SQLite logging handler."""
@pytest.fixture
def temp_db_path(self):
"""Create a temporary database file."""
fd, path = tempfile.mkstemp(suffix='.db')
os.close(fd)
yield path
try:
os.unlink(path)
except OSError:
pass
def test_handler_writes_log_records(self, temp_db_path):
"""Log handler writes records to database."""
import logging
store = LogStore(db_path=temp_db_path)
handler = SQLiteLogHandler(store)
handler.setLevel(logging.INFO)
logger = logging.getLogger("test_handler")
logger.addHandler(handler)
logger.setLevel(logging.INFO)
logger.info("Test log message")
# Give it a moment to write
time.sleep(0.1)
logs = store.query_logs(limit=10)
assert len(logs) >= 1
# Find our test message
test_logs = [l for l in logs if "Test log message" in l["message"]]
assert len(test_logs) == 1
assert test_logs[0]["level"] == "INFO"
# Cleanup
logger.removeHandler(handler)
def test_handler_filters_by_level(self, temp_db_path):
"""Log handler respects level filtering."""
import logging
store = LogStore(db_path=temp_db_path)
handler = SQLiteLogHandler(store)
handler.setLevel(logging.WARNING)
logger = logging.getLogger("test_handler_level")
logger.addHandler(handler)
logger.setLevel(logging.DEBUG)
logger.debug("Debug message")
logger.info("Info message")
logger.warning("Warning message")
time.sleep(0.1)
logs = store.query_logs(limit=10)
# Only warning should be written (handler level is WARNING)
warning_logs = [l for l in logs if l["logger_name"] == "test_handler_level"]
assert len(warning_logs) == 1
assert warning_logs[0]["level"] == "WARNING"
# Cleanup
logger.removeHandler(handler)

203
tests/test_metrics.py Normal file
View File

@@ -0,0 +1,203 @@
# -*- coding: utf-8 -*-
"""Unit tests for performance metrics module."""
import pytest
from mes_dashboard.core.metrics import (
QueryMetrics,
MetricsSummary,
get_query_metrics,
get_metrics_summary,
record_query_latency,
SLOW_QUERY_THRESHOLD
)
class TestQueryMetrics:
"""Test QueryMetrics class."""
def test_initial_state_empty(self):
"""New metrics instance has no data."""
metrics = QueryMetrics(window_size=100)
percentiles = metrics.get_percentiles()
assert percentiles["count"] == 0
assert percentiles["p50"] == 0.0
assert percentiles["p95"] == 0.0
assert percentiles["p99"] == 0.0
def test_record_latency(self):
"""Latencies are recorded correctly."""
metrics = QueryMetrics(window_size=100)
metrics.record_latency(0.1)
metrics.record_latency(0.2)
metrics.record_latency(0.3)
latencies = metrics.get_latencies()
assert len(latencies) == 3
assert latencies == [0.1, 0.2, 0.3]
def test_window_size_limit(self):
"""Window size limits number of samples."""
metrics = QueryMetrics(window_size=5)
for i in range(10):
metrics.record_latency(float(i))
latencies = metrics.get_latencies()
assert len(latencies) == 5
# Should have last 5 values (5, 6, 7, 8, 9)
assert latencies == [5.0, 6.0, 7.0, 8.0, 9.0]
def test_percentile_calculation_p50(self):
"""P50 (median) is calculated correctly."""
metrics = QueryMetrics(window_size=100)
# Record 100 values: 1, 2, 3, ..., 100
for i in range(1, 101):
metrics.record_latency(float(i))
percentiles = metrics.get_percentiles()
# P50 of 1-100 should be around 50
assert 49 <= percentiles["p50"] <= 51
def test_percentile_calculation_p95(self):
"""P95 is calculated correctly."""
metrics = QueryMetrics(window_size=100)
# Record 100 values: 1, 2, 3, ..., 100
for i in range(1, 101):
metrics.record_latency(float(i))
percentiles = metrics.get_percentiles()
# P95 of 1-100 should be around 95
assert 94 <= percentiles["p95"] <= 96
def test_percentile_calculation_p99(self):
"""P99 is calculated correctly."""
metrics = QueryMetrics(window_size=100)
# Record 100 values: 1, 2, 3, ..., 100
for i in range(1, 101):
metrics.record_latency(float(i))
percentiles = metrics.get_percentiles()
# P99 of 1-100 should be around 99
assert 98 <= percentiles["p99"] <= 100
def test_slow_query_count(self):
"""Slow queries (> threshold) are counted."""
metrics = QueryMetrics(window_size=100)
# Record some fast and slow queries
metrics.record_latency(0.1) # Fast
metrics.record_latency(0.5) # Fast
metrics.record_latency(1.5) # Slow
metrics.record_latency(2.0) # Slow
metrics.record_latency(0.8) # Fast
percentiles = metrics.get_percentiles()
assert percentiles["slow_count"] == 2
def test_get_summary(self):
"""Summary includes all required fields."""
metrics = QueryMetrics(window_size=100)
metrics.record_latency(0.1)
metrics.record_latency(0.5)
metrics.record_latency(1.5)
summary = metrics.get_summary()
assert isinstance(summary, MetricsSummary)
assert summary.p50_ms >= 0
assert summary.p95_ms >= 0
assert summary.p99_ms >= 0
assert summary.count == 3
assert summary.slow_count == 1
assert 0 <= summary.slow_rate <= 1
assert summary.worker_pid > 0
assert summary.collected_at is not None
def test_slow_rate_calculation(self):
"""Slow rate is calculated correctly."""
metrics = QueryMetrics(window_size=100)
# 2 slow out of 4 = 50%
metrics.record_latency(0.1)
metrics.record_latency(1.5)
metrics.record_latency(0.2)
metrics.record_latency(2.0)
summary = metrics.get_summary()
assert summary.slow_rate == 0.5
def test_clear_resets_metrics(self):
"""Clear removes all recorded latencies."""
metrics = QueryMetrics(window_size=100)
metrics.record_latency(0.1)
metrics.record_latency(0.2)
metrics.clear()
assert len(metrics.get_latencies()) == 0
assert metrics.get_percentiles()["count"] == 0
class TestGlobalMetrics:
"""Test global metrics functions."""
def test_get_query_metrics_returns_singleton(self):
"""Global query metrics returns same instance."""
metrics1 = get_query_metrics()
metrics2 = get_query_metrics()
assert metrics1 is metrics2
def test_record_query_latency_uses_global(self):
"""record_query_latency uses global metrics instance."""
metrics = get_query_metrics()
initial_count = metrics.get_percentiles()["count"]
record_query_latency(0.1)
assert metrics.get_percentiles()["count"] == initial_count + 1
def test_get_metrics_summary_returns_dict(self):
"""get_metrics_summary returns dictionary format."""
summary = get_metrics_summary()
assert isinstance(summary, dict)
assert "p50_ms" in summary
assert "p95_ms" in summary
assert "p99_ms" in summary
assert "count" in summary
assert "slow_count" in summary
assert "slow_rate" in summary
assert "worker_pid" in summary
assert "collected_at" in summary
class TestMetricsThreadSafety:
"""Test thread safety of metrics collection."""
def test_concurrent_recording(self):
"""Metrics handle concurrent recording."""
import threading
metrics = QueryMetrics(window_size=1000)
def record_many():
for _ in range(100):
metrics.record_latency(0.1)
threads = [threading.Thread(target=record_many) for _ in range(10)]
for t in threads:
t.start()
for t in threads:
t.join()
# Should have 1000 entries
assert metrics.get_percentiles()["count"] == 1000

View File

@@ -0,0 +1,267 @@
# -*- coding: utf-8 -*-
"""Integration tests for performance monitoring and admin APIs."""
import json
import os
import pytest
import tempfile
from unittest.mock import patch, MagicMock
from mes_dashboard.app import create_app
import mes_dashboard.core.database as db
@pytest.fixture
def app():
"""Create application for testing."""
db._ENGINE = None
app = create_app('testing')
app.config['TESTING'] = True
app.config['WTF_CSRF_ENABLED'] = False
return app
@pytest.fixture
def client(app):
"""Create test client."""
return app.test_client()
@pytest.fixture
def admin_client(app, client):
"""Create authenticated admin client."""
# Set admin session - the permissions module checks for 'admin' key in session
with client.session_transaction() as sess:
sess['admin'] = {'username': 'admin', 'role': 'admin'}
yield client
class TestAPIResponseFormat:
"""Test standardized API response format."""
def test_success_response_format(self, admin_client):
"""Success responses have correct format."""
response = admin_client.get('/admin/api/system-status')
assert response.status_code == 200
data = json.loads(response.data)
assert data["success"] is True
assert "data" in data
def test_unauthenticated_redirect(self, client):
"""Unauthenticated requests redirect to login."""
response = client.get('/admin/performance')
# Should redirect to login page
assert response.status_code == 302
class TestHealthEndpoints:
"""Test health check endpoints."""
def test_health_basic_endpoint(self, client):
"""Basic health endpoint returns status."""
response = client.get('/health')
assert response.status_code == 200
data = json.loads(response.data)
assert "status" in data
# Database status is under 'services' key
assert "services" in data
assert "database" in data["services"]
def test_health_deep_requires_auth(self, client):
"""Deep health endpoint requires authentication."""
response = client.get('/health/deep')
# Redirects to login for unauthenticated requests
assert response.status_code == 302
def test_health_deep_returns_metrics(self, admin_client):
"""Deep health endpoint returns detailed metrics."""
response = admin_client.get('/health/deep')
if response.status_code == 200:
data = json.loads(response.data)
assert "status" in data
class TestSystemStatusAPI:
"""Test system status API endpoint."""
def test_system_status_returns_all_components(self, admin_client):
"""System status includes all component statuses."""
response = admin_client.get('/admin/api/system-status')
assert response.status_code == 200
data = json.loads(response.data)
assert data["success"] is True
assert "database" in data["data"]
assert "redis" in data["data"]
assert "circuit_breaker" in data["data"]
assert "worker_pid" in data["data"]
class TestMetricsAPI:
"""Test metrics API endpoint."""
def test_metrics_returns_percentiles(self, admin_client):
"""Metrics API returns percentile data."""
response = admin_client.get('/admin/api/metrics')
assert response.status_code == 200
data = json.loads(response.data)
assert data["success"] is True
assert "p50_ms" in data["data"]
assert "p95_ms" in data["data"]
assert "p99_ms" in data["data"]
assert "count" in data["data"]
assert "slow_count" in data["data"]
assert "slow_rate" in data["data"]
def test_metrics_includes_latencies(self, admin_client):
"""Metrics API includes latency distribution."""
response = admin_client.get('/admin/api/metrics')
assert response.status_code == 200
data = json.loads(response.data)
assert "latencies" in data["data"]
assert isinstance(data["data"]["latencies"], list)
class TestLogsAPI:
"""Test logs API endpoint."""
def test_logs_api_returns_logs(self, admin_client):
"""Logs API returns log entries."""
response = admin_client.get('/admin/api/logs')
assert response.status_code == 200
data = json.loads(response.data)
assert data["success"] is True
assert "logs" in data["data"]
assert "enabled" in data["data"]
def test_logs_api_filter_by_level(self, admin_client):
"""Logs API filters by level."""
response = admin_client.get('/admin/api/logs?level=ERROR')
assert response.status_code == 200
data = json.loads(response.data)
assert data["success"] is True
def test_logs_api_filter_by_search(self, admin_client):
"""Logs API filters by search term."""
response = admin_client.get('/admin/api/logs?q=database')
assert response.status_code == 200
data = json.loads(response.data)
assert data["success"] is True
class TestLogsCleanupAPI:
"""Test log cleanup API endpoint."""
def test_logs_cleanup_requires_auth(self, client):
"""Log cleanup requires admin authentication."""
response = client.post('/admin/api/logs/cleanup')
# Should redirect to login page
assert response.status_code == 302
def test_logs_cleanup_success(self, admin_client):
"""Log cleanup returns success with stats."""
response = admin_client.post('/admin/api/logs/cleanup')
assert response.status_code == 200
data = json.loads(response.data)
assert data["success"] is True
assert "deleted" in data["data"]
assert "before" in data["data"]
assert "after" in data["data"]
assert "count" in data["data"]["before"]
assert "size_bytes" in data["data"]["before"]
class TestWorkerControlAPI:
"""Test worker control API endpoints."""
def test_worker_status_returns_info(self, admin_client):
"""Worker status API returns worker information."""
response = admin_client.get('/admin/api/worker/status')
assert response.status_code == 200
data = json.loads(response.data)
assert data["success"] is True
assert "worker_pid" in data["data"]
assert "cooldown" in data["data"]
assert "last_restart" in data["data"]
def test_worker_restart_requires_auth(self, client):
"""Worker restart requires admin authentication."""
response = client.post('/admin/api/worker/restart')
# Should redirect to login page for unauthenticated requests
assert response.status_code == 302
def test_worker_restart_writes_flag(self, admin_client):
"""Worker restart creates flag file."""
# Use a temp file for the flag
fd, temp_flag = tempfile.mkstemp()
os.close(fd)
os.unlink(temp_flag) # Remove so we can test creation
with patch('mes_dashboard.routes.admin_routes.RESTART_FLAG_PATH', temp_flag):
with patch('mes_dashboard.routes.admin_routes._check_restart_cooldown', return_value=(False, 0)):
response = admin_client.post('/admin/api/worker/restart')
assert response.status_code == 200
data = json.loads(response.data)
assert data["success"] is True
# Cleanup
try:
os.unlink(temp_flag)
except OSError:
pass
def test_worker_restart_cooldown(self, admin_client):
"""Worker restart respects cooldown."""
with patch('mes_dashboard.routes.admin_routes._check_restart_cooldown', return_value=(True, 45)):
response = admin_client.post('/admin/api/worker/restart')
assert response.status_code == 429
data = json.loads(response.data)
assert data["success"] is False
assert "cooldown" in data["error"]["message"].lower()
class TestCircuitBreakerIntegration:
"""Test circuit breaker integration with database layer."""
def test_circuit_breaker_status_in_system_status(self, admin_client):
"""Circuit breaker status is included in system status."""
response = admin_client.get('/admin/api/system-status')
assert response.status_code == 200
data = json.loads(response.data)
cb_status = data["data"]["circuit_breaker"]
assert "state" in cb_status
assert "enabled" in cb_status
class TestPerformancePage:
"""Test performance monitoring page."""
def test_performance_page_requires_auth(self, client):
"""Performance page requires admin authentication."""
response = client.get('/admin/performance')
# Should redirect to login
assert response.status_code == 302
def test_performance_page_loads(self, admin_client):
"""Performance page loads for admin users."""
response = admin_client.get('/admin/performance')
# Should be 200 for authenticated admin
assert response.status_code == 200
# Check for performance-related content
data_str = response.data.decode('utf-8', errors='ignore').lower()
assert 'performance' in data_str or '效能' in data_str