feat: 新增效能監控、熔斷器保護與 Worker 重啟控制功能
新增功能: - 效能監控儀表板 (/admin/performance):系統狀態、查詢延遲、日誌檢視 - 熔斷器 (Circuit Breaker):CLOSED/OPEN/HALF_OPEN 狀態保護資料庫 - 效能指標收集:P50/P95/P99 延遲追蹤、慢查詢統計 - SQLite 日誌儲存:結構化日誌、保留策略、手動清理功能 - Worker Watchdog:透過 systemd 服務支援優雅重啟 - 統一 API 回應格式:success_response/error_response 標準化 - 深度健康檢查端點 (/health/deep) - 404/500 錯誤頁面模板 Bug 修復: - 修復 circuit_breaker.py get_status() 死鎖問題 - 修復 health_routes.py 模組匯入路徑錯誤 新增依賴:psutil (Worker 狀態監控) 測試覆蓋:59 個新增測試案例 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
51
.env.example
51
.env.example
@@ -90,3 +90,54 @@ RESOURCE_CACHE_ENABLED=true
|
||||
# Resource cache sync interval in seconds (default: 14400 = 4 hours)
|
||||
# The cache will check for updates at this interval using MAX(LASTCHANGEDATE)
|
||||
RESOURCE_SYNC_INTERVAL=14400
|
||||
|
||||
# ============================================================
|
||||
# Circuit Breaker Configuration
|
||||
# ============================================================
|
||||
# Enable/disable circuit breaker for database protection
|
||||
CIRCUIT_BREAKER_ENABLED=false
|
||||
|
||||
# Minimum failures before circuit can open
|
||||
CIRCUIT_BREAKER_FAILURE_THRESHOLD=5
|
||||
|
||||
# Failure rate threshold (0.0 - 1.0)
|
||||
CIRCUIT_BREAKER_FAILURE_RATE=0.5
|
||||
|
||||
# Seconds to wait in OPEN state before trying HALF_OPEN
|
||||
CIRCUIT_BREAKER_RECOVERY_TIMEOUT=30
|
||||
|
||||
# Sliding window size for counting successes/failures
|
||||
CIRCUIT_BREAKER_WINDOW_SIZE=10
|
||||
|
||||
# ============================================================
|
||||
# Performance Metrics Configuration
|
||||
# ============================================================
|
||||
# Slow query threshold in seconds (default: 1.0)
|
||||
SLOW_QUERY_THRESHOLD=1.0
|
||||
|
||||
# ============================================================
|
||||
# SQLite Log Store Configuration
|
||||
# ============================================================
|
||||
# Enable/disable SQLite log store for admin dashboard
|
||||
LOG_STORE_ENABLED=true
|
||||
|
||||
# SQLite database path
|
||||
LOG_SQLITE_PATH=logs/admin_logs.sqlite
|
||||
|
||||
# Log retention period in days (default: 7)
|
||||
LOG_SQLITE_RETENTION_DAYS=7
|
||||
|
||||
# Maximum log rows (default: 100000)
|
||||
LOG_SQLITE_MAX_ROWS=100000
|
||||
|
||||
# ============================================================
|
||||
# Worker Watchdog Configuration
|
||||
# ============================================================
|
||||
# Path to restart flag file (watchdog monitors this file)
|
||||
WATCHDOG_RESTART_FLAG=/tmp/mes_dashboard_restart.flag
|
||||
|
||||
# Path to restart state file (stores last restart info)
|
||||
WATCHDOG_STATE_FILE=/tmp/mes_dashboard_restart_state.json
|
||||
|
||||
# Cooldown period between restart requests in seconds (default: 60)
|
||||
WORKER_RESTART_COOLDOWN=60
|
||||
|
||||
105
README.md
105
README.md
@@ -18,6 +18,9 @@
|
||||
| 頁面狀態管理 | ✅ 已完成 |
|
||||
| Redis 快取系統 | ✅ 已完成 |
|
||||
| SQL 查詢安全架構 | ✅ 已完成 |
|
||||
| 效能監控儀表板 | ✅ 已完成 |
|
||||
| 熔斷器保護機制 | ✅ 已完成 |
|
||||
| Worker 重啟控制 | ✅ 已完成 |
|
||||
| 部署自動化 | ✅ 已完成 |
|
||||
|
||||
---
|
||||
@@ -143,6 +146,51 @@ ADMIN_EMAILS=admin@example.com # 管理員郵件(逗號分隔)
|
||||
|
||||
3. **防火牆**: 開放服務端口(預設 8080)
|
||||
|
||||
### Worker Watchdog 服務配置
|
||||
|
||||
Watchdog 監控程式用於支援管理員從介面優雅重啟 Workers:
|
||||
|
||||
```bash
|
||||
# 1. 複製 systemd 服務檔案
|
||||
sudo cp deploy/mes-dashboard-watchdog.service /etc/systemd/system/
|
||||
|
||||
# 2. 編輯服務檔案,修改路徑和用戶
|
||||
sudo nano /etc/systemd/system/mes-dashboard-watchdog.service
|
||||
|
||||
# 3. 重新載入 systemd
|
||||
sudo systemctl daemon-reload
|
||||
|
||||
# 4. 啟動並設定開機自動啟動
|
||||
sudo systemctl start mes-dashboard-watchdog
|
||||
sudo systemctl enable mes-dashboard-watchdog
|
||||
|
||||
# 5. 查看狀態
|
||||
sudo systemctl status mes-dashboard-watchdog
|
||||
```
|
||||
|
||||
### Rollback 步驟
|
||||
|
||||
如需回滾到先前版本:
|
||||
|
||||
```bash
|
||||
# 1. 停止服務
|
||||
./scripts/start_server.sh stop
|
||||
sudo systemctl stop mes-dashboard-watchdog
|
||||
|
||||
# 2. 回滾程式碼
|
||||
git checkout <previous-commit>
|
||||
|
||||
# 3. 重新安裝依賴(如有變更)
|
||||
pip install -r requirements.txt
|
||||
|
||||
# 4. 清理新版本資料(可選)
|
||||
rm -f logs/admin_logs.sqlite # 清理 SQLite 日誌
|
||||
|
||||
# 5. 重啟服務
|
||||
./scripts/start_server.sh start
|
||||
sudo systemctl start mes-dashboard-watchdog
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 功能說明
|
||||
@@ -201,6 +249,33 @@ ADMIN_EMAILS=admin@example.com # 管理員郵件(逗號分隔)
|
||||
- 頁面狀態管理(released/dev)
|
||||
- Dev 頁面僅管理員可見
|
||||
|
||||
### 效能監控儀表板
|
||||
|
||||
管理員專用的系統監控介面(`/admin/performance`):
|
||||
|
||||
- **系統狀態總覽**:Database、Redis、Circuit Breaker、Worker 狀態
|
||||
- **查詢效能指標**:P50/P95/P99 延遲、慢查詢統計、延遲分布圖
|
||||
- **系統日誌檢視**:即時日誌查詢、等級篩選、關鍵字搜尋
|
||||
- **日誌管理**:儲存統計、手動清理功能
|
||||
- **Worker 控制**:優雅重啟(透過 Watchdog 機制)
|
||||
- 自動更新(30 秒間隔)
|
||||
|
||||
### 熔斷器保護機制
|
||||
|
||||
Circuit Breaker 模式保護資料庫免於雪崩效應:
|
||||
|
||||
- **CLOSED**:正常運作,請求通過
|
||||
- **OPEN**:失敗過多,請求立即拒絕
|
||||
- **HALF_OPEN**:測試恢復,允許有限請求
|
||||
|
||||
配置方式:
|
||||
```bash
|
||||
CIRCUIT_BREAKER_ENABLED=true
|
||||
CIRCUIT_BREAKER_FAILURE_THRESHOLD=5
|
||||
CIRCUIT_BREAKER_FAILURE_RATE=0.5
|
||||
CIRCUIT_BREAKER_RECOVERY_TIMEOUT=30
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 技術架構
|
||||
@@ -248,7 +323,12 @@ DashBoard/
|
||||
│ │ ├── database.py # 資料庫連線
|
||||
│ │ ├── redis_client.py # Redis 客戶端
|
||||
│ │ ├── cache.py # 快取管理
|
||||
│ │ └── cache_updater.py # 快取自動更新
|
||||
│ │ ├── cache_updater.py # 快取自動更新
|
||||
│ │ ├── circuit_breaker.py # 熔斷器
|
||||
│ │ ├── metrics.py # 效能指標收集
|
||||
│ │ ├── log_store.py # SQLite 日誌儲存
|
||||
│ │ ├── response.py # API 回應格式
|
||||
│ │ └── permissions.py # 權限管理
|
||||
│ ├── routes/ # 路由
|
||||
│ │ ├── wip_routes.py # WIP 相關 API
|
||||
│ │ ├── resource_routes.py # 設備狀態 API
|
||||
@@ -270,7 +350,10 @@ DashBoard/
|
||||
│ └── templates/ # HTML 模板
|
||||
├── scripts/ # 腳本
|
||||
│ ├── deploy.sh # 部署腳本
|
||||
│ └── start_server.sh # 服務管理腳本
|
||||
│ ├── start_server.sh # 服務管理腳本
|
||||
│ └── worker_watchdog.py # Worker 監控程式
|
||||
├── deploy/ # 部署設定
|
||||
│ └── mes-dashboard-watchdog.service # Watchdog systemd 服務
|
||||
├── tests/ # 測試
|
||||
├── data/ # 資料檔案
|
||||
├── logs/ # 日誌
|
||||
@@ -342,6 +425,20 @@ pytest tests/stress/ -v
|
||||
|
||||
## 變更日誌
|
||||
|
||||
### 2026-02-04
|
||||
|
||||
- 新增效能監控儀表板(`/admin/performance`)
|
||||
- 新增熔斷器保護機制(Circuit Breaker)
|
||||
- 新增效能指標收集(P50/P95/P99 延遲、慢查詢統計)
|
||||
- 新增 SQLite 日誌儲存與管理功能
|
||||
- 新增 Worker Watchdog 重啟機制
|
||||
- 新增統一 API 回應格式(success_response/error_response)
|
||||
- 新增 404/500 錯誤頁面模板
|
||||
- 修復熔斷器 get_status() 死鎖問題
|
||||
- 修復 health_routes.py 模組匯入錯誤
|
||||
- 新增 psutil 依賴用於 Worker 狀態監控
|
||||
- 新增完整測試套件(59 個效能相關測試)
|
||||
|
||||
### 2026-02-03
|
||||
|
||||
- 重構 SQL 查詢管理架構,提升安全性與效能
|
||||
@@ -393,5 +490,5 @@ pytest tests/stress/ -v
|
||||
|
||||
---
|
||||
|
||||
**文檔版本**: 3.0
|
||||
**最後更新**: 2026-02-03
|
||||
**文檔版本**: 4.0
|
||||
**最後更新**: 2026-02-04
|
||||
|
||||
@@ -33,7 +33,7 @@
|
||||
{
|
||||
"route": "/resource",
|
||||
"name": "機台狀態",
|
||||
"status": "dev"
|
||||
"status": "released"
|
||||
},
|
||||
{
|
||||
"route": "/excel-query",
|
||||
|
||||
36
deploy/mes-dashboard-watchdog.service
Normal file
36
deploy/mes-dashboard-watchdog.service
Normal file
@@ -0,0 +1,36 @@
|
||||
[Unit]
|
||||
Description=MES Dashboard Worker Watchdog
|
||||
Documentation=https://github.com/your-org/mes-dashboard
|
||||
After=network.target mes-dashboard.service
|
||||
Requires=mes-dashboard.service
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
User=www-data
|
||||
Group=www-data
|
||||
WorkingDirectory=/opt/mes-dashboard
|
||||
Environment="PYTHONPATH=/opt/mes-dashboard/src"
|
||||
Environment="WATCHDOG_CHECK_INTERVAL=5"
|
||||
Environment="WATCHDOG_RESTART_FLAG=/tmp/mes_dashboard_restart.flag"
|
||||
Environment="WATCHDOG_PID_FILE=/tmp/mes_dashboard_gunicorn.pid"
|
||||
Environment="WATCHDOG_STATE_FILE=/tmp/mes_dashboard_restart_state.json"
|
||||
|
||||
ExecStart=/opt/mes-dashboard/venv/bin/python scripts/worker_watchdog.py
|
||||
|
||||
# Restart policy
|
||||
Restart=always
|
||||
RestartSec=5
|
||||
|
||||
# Logging
|
||||
StandardOutput=journal
|
||||
StandardError=journal
|
||||
SyslogIdentifier=mes-watchdog
|
||||
|
||||
# Security hardening
|
||||
NoNewPrivileges=yes
|
||||
PrivateTmp=yes
|
||||
ProtectSystem=strict
|
||||
ReadWritePaths=/tmp
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
@@ -0,0 +1,2 @@
|
||||
schema: spec-driven
|
||||
created: 2026-02-03
|
||||
@@ -0,0 +1,359 @@
|
||||
## Context
|
||||
|
||||
MES Dashboard 是基於 Flask + Gunicorn + Redis 的報表系統,目前已完成 SQL 查詢安全架構重構。系統運行於:
|
||||
- Gunicorn: 2 workers × 4 threads
|
||||
- Oracle Database: 連線池 (pool_size=5, max_overflow=10)
|
||||
- Redis: 設備狀態快取 (30 秒更新)
|
||||
|
||||
**現有架構限制:**
|
||||
- 錯誤處理分散於各 service/route,格式不一致
|
||||
- 資料庫異常時無熔斷機制,可能導致連線池耗盡
|
||||
- 效能指標僅有 warning log,無量化追蹤
|
||||
- 管理員需 SSH 登入才能處理 worker 問題
|
||||
|
||||
**利害關係人:**
|
||||
- 終端使用者:需要友善的錯誤訊息與穩定的服務
|
||||
- 管理員:需要效能監控與緊急處理能力
|
||||
- 維運人員:需要可觀測性與問題診斷工具
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- 統一 API 回應格式,提升前端開發體驗
|
||||
- 實作熔斷機制,防止資料庫異常導致雪崩
|
||||
- 提供效能指標收集與視覺化報表
|
||||
- 允許管理員從前端安全地重啟服務
|
||||
- 新增本地快取作為 Redis 的二級 fallback
|
||||
|
||||
**Non-Goals:**
|
||||
- 不實作分散式追蹤 (distributed tracing)
|
||||
- 不整合外部監控系統 (Prometheus/Grafana)
|
||||
- 不變更現有 API 端點路徑
|
||||
- 不實作自動擴展 (auto-scaling)
|
||||
- 不實作 Excel 批次查詢進度回報(需評估前後端架構變更)
|
||||
- 不實作歷史趨勢查詢優化(預計算/分層快取)
|
||||
|
||||
## Decisions
|
||||
|
||||
### Decision 1: API 回應格式
|
||||
|
||||
**選擇:** 標準化 envelope 格式,向下相容
|
||||
|
||||
```python
|
||||
# 成功回應
|
||||
{
|
||||
"success": True,
|
||||
"data": { ... }, # 原有回應內容
|
||||
"meta": { # 可選的中繼資料
|
||||
"timestamp": "...",
|
||||
"request_id": "..."
|
||||
}
|
||||
}
|
||||
|
||||
# 錯誤回應
|
||||
{
|
||||
"success": False,
|
||||
"error": {
|
||||
"code": "DB_CONNECTION_FAILED", # 機器可讀代碼
|
||||
"message": "資料庫連線失敗,請稍後再試", # 使用者友善訊息
|
||||
"details": "ORA-12541: TNS:no listener" # 僅開發模式顯示
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**替代方案考慮:**
|
||||
- 完全重新設計 API → 破壞向下相容,需前端配合改版
|
||||
- 僅加 HTTP status code → 錯誤資訊不夠豐富
|
||||
|
||||
**理由:** 保持原有 `data` 結構,僅加上 `success` 和 `error` 包裝,前端可漸進式遷移。
|
||||
|
||||
---
|
||||
|
||||
### Decision 2: Circuit Breaker 熔斷器
|
||||
|
||||
**選擇:** 自製輕量熔斷器,基於滑動視窗計數
|
||||
|
||||
```
|
||||
狀態轉換:
|
||||
CLOSED → (失敗率 > 50% 且失敗數 > 5) → OPEN
|
||||
OPEN → (等待 30 秒) → HALF_OPEN
|
||||
HALF_OPEN → (探測成功) → CLOSED
|
||||
HALF_OPEN → (探測失敗) → OPEN
|
||||
```
|
||||
|
||||
**參數設計:**
|
||||
| 參數 | 值 | 說明 |
|
||||
|------|-----|------|
|
||||
| failure_threshold | 5 | 最少失敗次數才觸發 |
|
||||
| failure_rate | 0.5 | 失敗率閾值 (50%) |
|
||||
| recovery_timeout | 30s | OPEN 狀態等待時間 |
|
||||
| window_size | 10 | 滑動視窗大小 |
|
||||
|
||||
**計數層級:**
|
||||
- 熔斷器 SHALL 在 `read_sql_df()` 層級實作(所有 SQL 查詢共用單一熔斷器)
|
||||
- 這確保所有資料庫查詢(包含 WIP、Equipment、Hold 等)共同計算失敗率
|
||||
- 當熔斷器 OPEN 時,所有查詢立即回傳錯誤,避免連線池耗盡
|
||||
- 單一熔斷器設計簡化狀態管理,且符合「資料庫整體健康」的概念
|
||||
|
||||
**替代方案考慮:**
|
||||
- 使用 `pybreaker` 套件 → 增加外部依賴
|
||||
- 使用 `tenacity` retry → 只處理重試,無熔斷
|
||||
|
||||
**理由:** 需求簡單,自製可完全控制行為,無外部依賴。
|
||||
|
||||
---
|
||||
|
||||
### Decision 3: 效能指標收集
|
||||
|
||||
**選擇:** 記憶體內滑動視窗 + 定期彙總
|
||||
|
||||
```python
|
||||
class QueryMetrics:
|
||||
# 使用 deque 儲存最近 1000 筆查詢延遲
|
||||
latencies: deque[float] = deque(maxlen=1000)
|
||||
|
||||
# 計算 percentiles
|
||||
def get_percentiles(self) -> dict:
|
||||
sorted_latencies = sorted(self.latencies)
|
||||
return {
|
||||
"p50": percentile(sorted_latencies, 50),
|
||||
"p95": percentile(sorted_latencies, 95),
|
||||
"p99": percentile(sorted_latencies, 99),
|
||||
"count": len(sorted_latencies),
|
||||
"slow_count": sum(1 for l in sorted_latencies if l > 1.0)
|
||||
}
|
||||
```
|
||||
|
||||
**儲存策略:**
|
||||
- 即時指標:記憶體內滑動視窗
|
||||
- 不持久化歷史資料(避免複雜度)
|
||||
- 每個 worker 獨立統計(不跨 worker 合併,前端顯示當前 Worker PID)
|
||||
|
||||
**替代方案考慮:**
|
||||
- 使用 Redis 儲存 → 增加 Redis 負擔
|
||||
- 使用 Prometheus → 需要額外基礎設施
|
||||
|
||||
**理由:** 簡單場景不需複雜方案,記憶體內統計足夠。
|
||||
|
||||
---
|
||||
|
||||
### Decision 4: 本地快取 Fallback
|
||||
|
||||
**選擇:** TTL-aware LRU Cache 作為 Redis 的二級快取
|
||||
|
||||
```
|
||||
快取查詢流程:
|
||||
1. 查詢 Redis → 命中則回傳
|
||||
2. Redis 失敗/未命中 → 查詢本地 LRU Cache
|
||||
3. 本地未命中 → 查詢 Oracle
|
||||
4. 回填:同時寫入 Redis 和本地快取
|
||||
```
|
||||
|
||||
**本地快取參數:**
|
||||
| 參數 | 值 | 說明 |
|
||||
|------|-----|------|
|
||||
| maxsize | 500 | 最大條目數(足以容納多組快取如 WIP、Equipment、Hold) |
|
||||
| ttl | 60s | 過期時間(比 Redis 短) |
|
||||
|
||||
**替代方案考慮:**
|
||||
- 只用 Redis → Redis 故障時無 fallback
|
||||
- 使用 `cachetools.TTLCache` → 可接受,但自製更靈活
|
||||
|
||||
**理由:** 增加一層本地快取,Redis 故障時仍能從本地取得較新資料。
|
||||
|
||||
---
|
||||
|
||||
### Decision 5: Worker 重啟機制
|
||||
|
||||
**選擇:** 方案 C - 控制檔案 + Watchdog 腳本
|
||||
|
||||
```
|
||||
架構:
|
||||
┌─────────────┐ 寫入檔案 ┌─────────────────┐
|
||||
│ Flask App │ ───────────────→ │ /tmp/restart.flag│
|
||||
│ (Admin API) │ └────────┬────────┘
|
||||
└─────────────┘ │ 監控
|
||||
▼
|
||||
┌─────────────┐ SIGHUP ┌─────────────────┐
|
||||
│ Gunicorn │ ←─────────────── │ worker_watchdog │
|
||||
│ Master │ │ (獨立 Python) │
|
||||
└─────────────┘ └─────────────────┘
|
||||
```
|
||||
|
||||
**流程:**
|
||||
1. Admin 點擊「重啟服務」按鈕
|
||||
2. Flask 寫入 `/tmp/mes_dashboard_restart.flag`
|
||||
3. Watchdog 腳本偵測到檔案,發送 SIGHUP 給 Gunicorn master
|
||||
4. Gunicorn graceful reload 所有 workers
|
||||
5. Watchdog 刪除 flag 檔案
|
||||
|
||||
**安全機制:**
|
||||
- API 僅限 admin_required
|
||||
- 冷卻時間 60 秒(防止連續觸發)
|
||||
- 操作日誌記錄到檔案和資料庫
|
||||
- 前端需二次確認
|
||||
|
||||
**替代方案考慮:**
|
||||
- 方案 A (SIGHUP 直接發送) → Flask worker 無法發送信號給 Gunicorn master
|
||||
- 方案 B (systemctl) → 需要 sudo 權限,安全風險高
|
||||
- 方案 D (獨立控制服務) → 過度設計
|
||||
|
||||
**理由:** 安全、解耦、不需要特殊權限,Flask 只需寫檔案。
|
||||
|
||||
---
|
||||
|
||||
### Decision 6: 效能報表頁面
|
||||
|
||||
**選擇:** 整合至現有 Admin 頁面,使用 Chart.js 視覺化
|
||||
|
||||
**頁面內容:**
|
||||
```
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ 效能監控儀表板 [重新整理] │
|
||||
├─────────────────────────────────────────────────────┤
|
||||
│ 系統狀態 │
|
||||
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
|
||||
│ │ Database│ │ Redis │ │ Circuit │ │ Workers │ │
|
||||
│ │ ✅ │ │ ✅ │ │ CLOSED │ │ 2/2 │ │
|
||||
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
|
||||
├─────────────────────────────────────────────────────┤
|
||||
│ 查詢效能 (最近 1000 筆) │
|
||||
│ P50: 0.12s P95: 0.45s P99: 1.23s 慢查詢: 15 │
|
||||
│ [========================================] 延遲分布│
|
||||
├─────────────────────────────────────────────────────┤
|
||||
│ 快取狀態 │
|
||||
│ Redis: 命中率 85% 本地: 命中率 92% │
|
||||
│ 最後更新: 2026-02-03 16:30:22 │
|
||||
├─────────────────────────────────────────────────────┤
|
||||
│ 服務控制 │
|
||||
│ [重啟 Workers] 冷卻時間: 可用 │
|
||||
│ 最後重啟: 2026-02-03 10:15:00 by admin@example │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**API 端點:**
|
||||
- `GET /admin/api/metrics` - 取得效能指標
|
||||
- `GET /admin/api/system-status` - 取得系統狀態
|
||||
- `GET /admin/api/logs` - 取得近期 log 紀錄
|
||||
- `GET /admin/api/worker/status` - 取得 Worker 狀態(cooldown、last_restart、啟動時間)
|
||||
- `POST /admin/api/worker/restart` - 觸發重啟
|
||||
|
||||
**自動重新整理:**
|
||||
- 前端支援 30 秒自動重新整理間隔
|
||||
- 使用者可手動停用自動重新整理
|
||||
|
||||
**Log 紀錄檢視:**
|
||||
- 顯示最近 N 筆 log(預設 200 筆)
|
||||
- 支援依等級(INFO/WARNING/ERROR)篩選與關鍵字搜尋
|
||||
- 顯示欄位包含時間、等級、來源、訊息、操作者(若有)
|
||||
|
||||
---
|
||||
|
||||
### Decision 7: 深度健康檢查
|
||||
|
||||
**選擇:** 新增 `/health/deep` 端點,包含延遲指標
|
||||
|
||||
|
||||
```json
|
||||
{
|
||||
"status": "healthy",
|
||||
"checks": {
|
||||
"database": {
|
||||
"status": "healthy",
|
||||
"latency_ms": 12,
|
||||
"pool_size": 5,
|
||||
"pool_checked_out": 2
|
||||
},
|
||||
"redis": {
|
||||
"status": "healthy",
|
||||
"latency_ms": 2
|
||||
},
|
||||
"circuit_breaker": {
|
||||
"database": "CLOSED",
|
||||
"failures": 0
|
||||
},
|
||||
"cache": {
|
||||
"redis_hit_rate": 0.85,
|
||||
"local_hit_rate": 0.92,
|
||||
"last_update": "2026-02-03T16:30:22Z"
|
||||
}
|
||||
},
|
||||
"metrics": {
|
||||
"query_p50_ms": 120,
|
||||
"query_p95_ms": 450,
|
||||
"query_p99_ms": 1230
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Decision 8: 管理員 Log 儲存與檢視
|
||||
|
||||
**選擇:** 使用本機 SQLite 儲存結構化 log,供管理員介面查詢
|
||||
|
||||
**策略:**
|
||||
- 既有檔案/STDERR log 保留(維運與除錯用途)
|
||||
- 另新增 SQLite log store 供管理員查詢(本機檔案)
|
||||
- SQLite 以 append-only 方式寫入,避免重鎖
|
||||
|
||||
**建議預設:**
|
||||
- 檔案位置:`logs/admin_logs.sqlite`
|
||||
- 保留策略:保留最近 7 天或最多 100,000 筆(可由環境變數調整)
|
||||
|
||||
**理由:**
|
||||
- 管理員需要可查詢的 log 紀錄
|
||||
- SQLite 為標準函式庫,無新增 Python 第三方依賴
|
||||
- 保留檔案 log 可維持既有維運流程
|
||||
|
||||
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
| 風險 | 緩解措施 |
|
||||
|------|---------|
|
||||
| Worker 重啟期間短暫服務中斷 | Gunicorn graceful reload 確保現有請求完成 |
|
||||
| 本地快取資料過期 | TTL 設為 60 秒,比 Redis 短;加入版本檢查 |
|
||||
| 熔斷器誤判導致服務不可用 | 設定合理閾值;HALF_OPEN 狀態快速探測 |
|
||||
| 效能指標記憶體佔用 | 限制 deque maxsize=1000;每個 worker 獨立 |
|
||||
| Watchdog 腳本異常 | 加入 systemd 監控 |
|
||||
| SQLite log 檔案成長 | 設定保留天數/最大筆數並定期清理 |
|
||||
| 管理員誤操作重啟 | 二次確認 + 60 秒冷卻時間 |
|
||||
|
||||
## Migration Plan
|
||||
|
||||
**Phase 1: 基礎設施 (無影響)**
|
||||
1. 新增 `core/response.py` - API 回應格式
|
||||
2. 新增 `core/circuit_breaker.py` - 熔斷器
|
||||
3. 新增 `core/metrics.py` - 效能指標
|
||||
4. 新增 `core/local_cache.py` - 本地快取
|
||||
5. 新增 `core/log_store.py` - SQLite log store
|
||||
6. 新增單元測試
|
||||
|
||||
**Phase 2: 整合 (低風險)**
|
||||
1. `database.py` 整合熔斷器(預設關閉)
|
||||
2. `cache.py` 整合本地快取 fallback
|
||||
3. 環境變數控制開關:`CIRCUIT_BREAKER_ENABLED=true`
|
||||
|
||||
**Phase 3: API 遷移 (漸進式)**
|
||||
1. 新建立的 API 直接使用新格式
|
||||
2. 現有 API 保持原格式(向下相容)
|
||||
3. 前端視需要逐步遷移使用新格式
|
||||
|
||||
**Phase 4: 管理功能**
|
||||
1. 新增效能報表頁面
|
||||
2. 新增 log 檢視區塊與 logs API
|
||||
3. 部署 watchdog 腳本
|
||||
4. 新增 Worker 控制 API
|
||||
|
||||
**Rollback 策略:**
|
||||
- 熔斷器:環境變數關閉 `CIRCUIT_BREAKER_ENABLED=false`
|
||||
- 本地快取:環境變數關閉 `LOCAL_CACHE_ENABLED=false`
|
||||
- Worker 控制:移除 watchdog 腳本即可
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. **效能指標保留時間**:目前設計為記憶體內 1000 筆,是否需要持久化歷史?
|
||||
2. ~~**多 Worker 指標合併**~~:已決定採用 per-worker 方式,前端顯示當前 Worker PID
|
||||
3. **Watchdog 部署方式**:使用 systemd service 或 cron 監控?
|
||||
4. ~~**API 格式遷移時程**~~:已決定不強制遷移,新 API 用新格式,舊 API 保持原格式
|
||||
@@ -0,0 +1,121 @@
|
||||
## Why
|
||||
|
||||
目前系統在 SQL 查詢安全架構重構後,核心功能已穩定運作。然而,根據代碼審查發現以下改善空間:
|
||||
|
||||
1. **使用者體驗不一致**:錯誤訊息直接暴露技術細節(如 ORA-xxxxx)、API 回應格式不統一、批次查詢缺乏進度回饋
|
||||
2. **穩定度風險**:缺乏熔斷機制,當 Oracle 異常時所有請求仍會嘗試連線導致雪崩;Redis 降級只有單層 fallback
|
||||
3. **效能監控不足**:慢查詢僅記錄 warning,缺乏量化指標追蹤;部分歷史查詢仍有優化空間
|
||||
4. **管理功能不足**:管理員缺乏效能監控視覺化介面;當 worker 異常時需 SSH 登入伺服器手動處理
|
||||
|
||||
## What Changes
|
||||
|
||||
### 使用者體驗 (UX)
|
||||
- 新增統一的 API 回應格式與錯誤代碼系統
|
||||
- 錯誤訊息分層:使用者友善訊息 vs 技術日誌
|
||||
|
||||
### 穩定度
|
||||
- 新增 Circuit Breaker 熔斷機制,防止連鎖失敗
|
||||
- 擴充健康檢查,新增深度檢查與延遲指標
|
||||
- 新增本地 LRU 快取作為 Redis 的二級 fallback
|
||||
|
||||
### 效能
|
||||
- 新增查詢效能指標收集(P50/P95/P99 延遲)
|
||||
|
||||
### 管理 (Admin)
|
||||
- 新增效能報表頁面至 Admin,視覺化顯示系統效能指標
|
||||
- 新增管理員 log 紀錄檢視(可篩選/搜尋)
|
||||
- 評估並實作 Worker 重啟機制,允許管理員從前端觸發服務重啟
|
||||
|
||||
## Non-Goals (本次範圍外)
|
||||
|
||||
以下項目暫不納入本次變更,留待後續評估:
|
||||
|
||||
- **Excel 批次查詢進度回報**:需評估前後端架構變更幅度,可能需要 WebSocket 或 Server-Sent Events
|
||||
- **歷史趨勢查詢優化(預計算/分層快取)**:需先有查詢效能指標數據,確認瓶頸後再規劃優化策略
|
||||
|
||||
## Capabilities
|
||||
|
||||
### New Capabilities
|
||||
|
||||
- `api-response-format`: 統一的 API 回應格式與錯誤代碼系統,提供一致的成功/失敗回應結構
|
||||
- `circuit-breaker`: 資料庫連線熔斷機制,防止連鎖失敗與資源耗盡
|
||||
- `query-metrics`: 查詢效能指標收集與監控,追蹤延遲分布與慢查詢統計
|
||||
- `local-cache-fallback`: 本地 LRU 記憶體快取,作為 Redis 不可用時的二級 fallback
|
||||
- `admin-performance-dashboard`: 管理員效能報表頁面,顯示系統健康狀態、效能指標、熔斷器狀態與近期 log 紀錄
|
||||
- `admin-worker-control`: 管理員服務控制功能,提供 Worker 重啟機制(需評估安全性與可行性)
|
||||
|
||||
### Modified Capabilities
|
||||
|
||||
- `health-check`: 擴充深度檢查功能,新增延遲指標、快取新鮮度檢查、熔斷器狀態
|
||||
|
||||
## Impact
|
||||
|
||||
### 程式碼變更
|
||||
|
||||
**新增檔案:**
|
||||
- `src/mes_dashboard/core/response.py` - API 回應格式與錯誤代碼
|
||||
- `src/mes_dashboard/core/circuit_breaker.py` - 熔斷器實作
|
||||
- `src/mes_dashboard/core/metrics.py` - 效能指標收集
|
||||
- `src/mes_dashboard/core/local_cache.py` - 本地 LRU 快取
|
||||
- `src/mes_dashboard/core/worker_control.py` - Worker 控制模組(評估後實作)
|
||||
- `src/mes_dashboard/templates/admin/performance.html` - 效能報表頁面
|
||||
- `src/mes_dashboard/core/log_store.py` - SQLite log 存取與查詢
|
||||
- `scripts/worker_watchdog.py` - Worker 監控與重啟服務(可選,依架構決策)
|
||||
|
||||
**修改檔案:**
|
||||
- `src/mes_dashboard/core/database.py` - 整合熔斷器
|
||||
- `src/mes_dashboard/core/cache.py` - 整合本地快取 fallback
|
||||
- `src/mes_dashboard/routes/health_routes.py` - 擴充健康檢查
|
||||
- `src/mes_dashboard/routes/admin_routes.py` - 新增效能報表、log 檢視與服務控制路由
|
||||
- `src/mes_dashboard/services/*.py` - 統一錯誤回應格式
|
||||
- `src/mes_dashboard/routes/*.py` - 統一 API 回應格式
|
||||
|
||||
### API 影響
|
||||
|
||||
- 所有 API 回應格式將統一,但維持向下相容(現有欄位保留)
|
||||
- 新增 `GET /health/deep` 深度健康檢查端點
|
||||
- 新增 `GET /admin/api/metrics` 效能指標端點
|
||||
- 新增 `GET /admin/performance` 效能報表頁面
|
||||
- 新增 `GET /admin/api/logs` 近期 log 紀錄查詢 API
|
||||
- 新增 `GET /admin/api/worker/status` Worker 狀態查詢 API
|
||||
- 新增 `POST /admin/api/worker/restart` Worker 重啟 API(需評估)
|
||||
|
||||
### 依賴
|
||||
|
||||
- 無新增 Python 第三方依賴,使用 Python 標準函式庫實作(包含 `sqlite3`)
|
||||
- 熔斷器:使用 `threading` + `time` 實作
|
||||
- 本地快取:使用 `functools.lru_cache` 或自訂 TTL cache
|
||||
- 指標收集:使用 `collections.deque` 實作滑動視窗
|
||||
- 管理員 log 檢視:使用 SQLite 儲存(本機檔案)
|
||||
- 前端圖表:使用 Chart.js(前端依賴)
|
||||
- Worker 控制:評估方案(見下方)
|
||||
|
||||
### Worker 重啟機制評估
|
||||
|
||||
**方案選項:**
|
||||
|
||||
| 方案 | 說明 | 優點 | 缺點 |
|
||||
|------|------|------|------|
|
||||
| A. Gunicorn SIGHUP | 透過信號觸發 graceful reload | 簡單、原生支援 | Flask 無法直接發送信號給父進程 |
|
||||
| B. Supervisor/Systemd | 透過 subprocess 呼叫 systemctl | 標準做法 | 需要 sudo 權限配置 |
|
||||
| C. 控制檔案 + Watchdog | 寫入標記檔案,外部腳本監控並重啟 | 安全、解耦 | 需要額外的監控腳本 |
|
||||
| D. 獨立控制服務 | 建立輕量 HTTP 服務專門處理重啟 | 完全隔離 | 架構複雜度增加 |
|
||||
|
||||
**建議:** 在 Design 階段評估各方案的安全性與可行性,選擇最適合的實作方式。
|
||||
|
||||
### 安全考量
|
||||
|
||||
- Worker 重啟 API 必須限制僅管理員可存取
|
||||
- 應有操作日誌記錄(誰、何時、從哪個 IP 觸發)
|
||||
- 考慮加入確認機制或冷卻時間,防止誤操作
|
||||
- 評估是否需要二次驗證(如重新輸入密碼)
|
||||
|
||||
### 測試
|
||||
|
||||
- `tests/test_api_response.py` - API 回應格式測試
|
||||
- `tests/test_circuit_breaker.py` - 熔斷器狀態轉換測試
|
||||
- `tests/test_query_metrics.py` - 指標收集測試
|
||||
- `tests/test_local_cache.py` - 本地快取測試
|
||||
- `tests/test_admin_performance.py` - 效能報表 API 測試
|
||||
- `tests/test_admin_logs.py` - 管理員 log 檢視 API 測試
|
||||
- `tests/test_worker_control.py` - Worker 控制測試(模擬)
|
||||
@@ -0,0 +1,160 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: 效能報表頁面
|
||||
|
||||
系統 SHALL 提供管理員效能報表頁面。
|
||||
|
||||
#### Scenario: 存取效能報表頁面
|
||||
- **WHEN** 管理員存取 `GET /admin/performance`
|
||||
- **THEN** 系統 SHALL 顯示效能報表頁面
|
||||
- **AND** HTTP 狀態碼 SHALL 為 200
|
||||
|
||||
#### Scenario: 非管理員禁止存取
|
||||
- **WHEN** 非管理員存取 `GET /admin/performance`
|
||||
- **THEN** 系統 SHALL 重導向至登入頁面
|
||||
- **OR** HTTP 狀態碼 SHALL 為 403
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 系統狀態顯示
|
||||
|
||||
效能報表頁面 SHALL 顯示系統各元件的健康狀態。
|
||||
|
||||
#### Scenario: 顯示資料庫狀態
|
||||
- **WHEN** 載入效能報表頁面
|
||||
- **THEN** 頁面 SHALL 顯示資料庫連線狀態
|
||||
- **AND** 狀態 SHALL 為 ✅ (正常) 或 ❌ (異常)
|
||||
|
||||
#### Scenario: 顯示 Redis 狀態
|
||||
- **WHEN** 載入效能報表頁面
|
||||
- **THEN** 頁面 SHALL 顯示 Redis 連線狀態
|
||||
- **AND** 若 Redis 停用則顯示「已停用」
|
||||
|
||||
#### Scenario: 顯示熔斷器狀態
|
||||
- **WHEN** 載入效能報表頁面
|
||||
- **THEN** 頁面 SHALL 顯示熔斷器狀態
|
||||
- **AND** 狀態 SHALL 為 CLOSED、OPEN 或 HALF_OPEN
|
||||
|
||||
#### Scenario: 顯示 Worker 數量
|
||||
- **WHEN** 載入效能報表頁面
|
||||
- **THEN** 頁面 SHALL 顯示目前回應的 Worker PID
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 效能指標顯示
|
||||
|
||||
效能報表頁面 SHALL 顯示查詢效能指標。
|
||||
|
||||
#### Scenario: 顯示延遲百分位數
|
||||
- **WHEN** 載入效能報表頁面
|
||||
- **THEN** 頁面 SHALL 顯示 P50、P95、P99 延遲值
|
||||
- **AND** 單位 SHALL 為毫秒或秒
|
||||
|
||||
#### Scenario: 顯示慢查詢統計
|
||||
- **WHEN** 載入效能報表頁面
|
||||
- **THEN** 頁面 SHALL 顯示慢查詢數量
|
||||
- **AND** SHALL 顯示慢查詢比例
|
||||
|
||||
#### Scenario: 延遲分布視覺化
|
||||
- **WHEN** 載入效能報表頁面
|
||||
- **THEN** 頁面 SHALL 顯示延遲分布圖表
|
||||
- **AND** 圖表 SHALL 使用 Chart.js 或類似工具
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 快取狀態顯示
|
||||
|
||||
效能報表頁面 SHALL 顯示快取運作狀態。
|
||||
|
||||
#### Scenario: 顯示 Redis 快取命中率
|
||||
- **WHEN** 載入效能報表頁面
|
||||
- **THEN** 頁面 SHALL 顯示 Redis 快取命中率
|
||||
|
||||
#### Scenario: 顯示本地快取命中率
|
||||
- **WHEN** 載入效能報表頁面
|
||||
- **THEN** 頁面 SHALL 顯示本地快取命中率
|
||||
|
||||
#### Scenario: 顯示快取最後更新時間
|
||||
- **WHEN** 載入效能報表頁面
|
||||
- **THEN** 頁面 SHALL 顯示快取最後更新時間
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 自動重新整理
|
||||
|
||||
效能報表頁面 SHALL 支援自動重新整理。
|
||||
|
||||
#### Scenario: 手動重新整理
|
||||
- **WHEN** 點擊「重新整理」按鈕
|
||||
- **THEN** 頁面 SHALL 重新載入所有指標資料
|
||||
- **AND** SHALL NOT 整頁重新載入(使用 AJAX)
|
||||
|
||||
#### Scenario: 自動重新整理間隔
|
||||
- **WHEN** 啟用自動重新整理
|
||||
- **THEN** 頁面 SHALL 每 30 秒自動更新指標
|
||||
- **AND** 使用者 SHALL 可以停用自動重新整理
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 系統狀態 API
|
||||
|
||||
系統 SHALL 提供 API 取得系統狀態資訊。
|
||||
|
||||
#### Scenario: 取得系統狀態
|
||||
- **WHEN** 呼叫 `GET /admin/api/system-status`
|
||||
- **AND** 使用者為管理員
|
||||
- **THEN** 回應 SHALL 包含:
|
||||
- `database`: 資料庫狀態
|
||||
- `redis`: Redis 狀態
|
||||
- `circuit_breaker`: 熔斷器狀態
|
||||
- `cache`: 快取狀態
|
||||
- `worker_pid`: 當前 Worker PID
|
||||
|
||||
---
|
||||
|
||||
### Requirement: Log 紀錄檢視
|
||||
|
||||
效能報表頁面 SHALL 顯示近期 log 紀錄。
|
||||
|
||||
#### Scenario: 顯示近期 log
|
||||
- **WHEN** 管理員載入效能報表頁面
|
||||
- **THEN** 頁面 SHALL 顯示最近 N 筆 log(預設 200 筆)
|
||||
- **AND** 每筆 log SHALL 顯示時間、等級、來源、訊息
|
||||
|
||||
#### Scenario: 篩選與搜尋
|
||||
- **WHEN** 管理員選擇等級(INFO/WARNING/ERROR)或輸入關鍵字
|
||||
- **THEN** 頁面 SHALL 即時更新顯示結果
|
||||
|
||||
---
|
||||
|
||||
### Requirement: Log API
|
||||
|
||||
系統 SHALL 提供 API 取得近期 log 紀錄。
|
||||
|
||||
#### Scenario: 取得 log 紀錄
|
||||
- **WHEN** 呼叫 `GET /admin/api/logs`
|
||||
- **AND** 使用者為管理員
|
||||
- **THEN** 回應 SHALL 包含 log 清單
|
||||
- **AND** HTTP 狀態碼 SHALL 為 200
|
||||
|
||||
#### Scenario: Log API 查詢參數
|
||||
- **WHEN** 呼叫 `GET /admin/api/logs` 並帶入查詢參數
|
||||
- **THEN** API SHALL 支援:
|
||||
- `level`:等級過濾(INFO/WARNING/ERROR)
|
||||
- `q`:關鍵字搜尋
|
||||
- `limit`:回傳筆數(預設 200)
|
||||
- `since`:起始時間(ISO-8601)
|
||||
|
||||
#### Scenario: 非管理員禁止存取
|
||||
- **WHEN** 非管理員呼叫 `GET /admin/api/logs`
|
||||
- **THEN** HTTP 狀態碼 SHALL 為 403
|
||||
---
|
||||
|
||||
### Requirement: Log 資料儲存
|
||||
|
||||
系統 SHALL 將 log 寫入本機 SQLite 供管理員查詢。
|
||||
|
||||
#### Scenario: 寫入 SQLite log store
|
||||
- **WHEN** 系統產生 log 紀錄
|
||||
- **THEN** log SHALL 寫入本機 SQLite log store
|
||||
- **AND** 供 `GET /admin/api/logs` 查詢
|
||||
@@ -0,0 +1,116 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Worker 重啟觸發
|
||||
|
||||
系統 SHALL 允許管理員從前端觸發 Worker 重啟。
|
||||
|
||||
#### Scenario: 觸發重啟請求
|
||||
- **WHEN** 管理員呼叫 `POST /admin/api/worker/restart`
|
||||
- **AND** 使用者為管理員
|
||||
- **THEN** 系統 SHALL 寫入重啟標記檔案
|
||||
- **AND** HTTP 狀態碼 SHALL 為 202 (Accepted)
|
||||
- **AND** 回應 SHALL 包含 `"message": "重啟請求已提交"`
|
||||
|
||||
#### Scenario: 非管理員禁止操作
|
||||
- **WHEN** 非管理員呼叫 `POST /admin/api/worker/restart`
|
||||
- **THEN** HTTP 狀態碼 SHALL 為 403
|
||||
- **AND** 操作 SHALL NOT 執行
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 重啟冷卻時間
|
||||
|
||||
系統 SHALL 實作重啟冷卻機制,防止頻繁重啟。
|
||||
|
||||
#### Scenario: 冷卻時間內拒絕
|
||||
- **WHEN** 管理員呼叫 `POST /admin/api/worker/restart`
|
||||
- **AND** 距離上次重啟不足 60 秒
|
||||
- **THEN** HTTP 狀態碼 SHALL 為 429 (Too Many Requests)
|
||||
- **AND** 回應 SHALL 包含剩餘冷卻秒數
|
||||
|
||||
#### Scenario: 冷卻時間後允許
|
||||
- **WHEN** 管理員呼叫 `POST /admin/api/worker/restart`
|
||||
- **AND** 距離上次重啟已超過 60 秒
|
||||
- **THEN** 重啟請求 SHALL 被接受
|
||||
|
||||
#### Scenario: 查詢冷卻狀態
|
||||
- **WHEN** 呼叫 `GET /admin/api/worker/status`
|
||||
- **THEN** 回應 SHALL 包含:
|
||||
- `cooldown_remaining`: 剩餘冷卻秒數(0 表示可用)
|
||||
- `last_restart`: 上次重啟時間
|
||||
- `last_restart_by`: 上次重啟操作者
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 重啟操作日誌
|
||||
|
||||
系統 SHALL 記錄所有重啟操作。
|
||||
|
||||
#### Scenario: 記錄操作資訊
|
||||
- **WHEN** 管理員觸發重啟
|
||||
- **THEN** 系統 SHALL 記錄:
|
||||
- 操作者(email/username)
|
||||
- 操作時間
|
||||
- 來源 IP 位址
|
||||
- 操作結果
|
||||
|
||||
#### Scenario: 日誌儲存位置
|
||||
- **WHEN** 記錄重啟操作
|
||||
- **THEN** 日誌 SHALL 寫入系統日誌(INFO 級別)
|
||||
- **AND** SHALL 寫入獨立的操作日誌檔案
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 前端確認機制
|
||||
|
||||
效能報表頁面 SHALL 實作重啟確認機制。
|
||||
|
||||
#### Scenario: 顯示確認對話框
|
||||
- **WHEN** 管理員點擊「重啟 Workers」按鈕
|
||||
- **THEN** 系統 SHALL 顯示確認對話框
|
||||
- **AND** 對話框 SHALL 警告此操作會短暫影響服務
|
||||
|
||||
#### Scenario: 確認後執行
|
||||
- **WHEN** 管理員在確認對話框點擊「確定」
|
||||
- **THEN** 系統 SHALL 發送重啟請求
|
||||
|
||||
#### Scenario: 取消操作
|
||||
- **WHEN** 管理員在確認對話框點擊「取消」
|
||||
- **THEN** 系統 SHALL NOT 發送重啟請求
|
||||
|
||||
---
|
||||
|
||||
### Requirement: Watchdog 腳本
|
||||
|
||||
系統 SHALL 提供 Watchdog 腳本監控重啟標記檔案。
|
||||
|
||||
#### Scenario: 監控標記檔案
|
||||
- **WHEN** Watchdog 腳本運行中
|
||||
- **THEN** 腳本 SHALL 每 5 秒檢查 `/tmp/mes_dashboard_restart.flag`
|
||||
|
||||
#### Scenario: 偵測到標記檔案
|
||||
- **WHEN** Watchdog 偵測到標記檔案存在
|
||||
- **THEN** 腳本 SHALL 發送 SIGHUP 信號給 Gunicorn master
|
||||
- **AND** SHALL 刪除標記檔案
|
||||
- **AND** SHALL 記錄重啟事件到日誌
|
||||
|
||||
#### Scenario: Gunicorn Graceful Reload
|
||||
- **WHEN** Gunicorn master 收到 SIGHUP
|
||||
- **THEN** Gunicorn SHALL 執行 graceful reload
|
||||
- **AND** 現有請求 SHALL 完成後才終止 worker
|
||||
- **AND** 新 worker SHALL 啟動接手
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 重啟狀態回報
|
||||
|
||||
系統 SHALL 提供方式確認重啟是否完成。
|
||||
|
||||
#### Scenario: 查詢 Worker 啟動時間
|
||||
- **WHEN** 呼叫 `GET /admin/api/worker/status`
|
||||
- **THEN** 回應 SHALL 包含當前 worker 的啟動時間
|
||||
|
||||
#### Scenario: 前端顯示重啟結果
|
||||
- **WHEN** 重啟請求已提交
|
||||
- **THEN** 前端 SHALL 輪詢 worker 狀態
|
||||
- **AND** SHALL 顯示「重啟中...」直到偵測到新 worker
|
||||
@@ -0,0 +1,140 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: 統一成功回應格式
|
||||
|
||||
系統 SHALL 對所有成功的 API 回應使用統一的 envelope 格式。
|
||||
|
||||
#### Scenario: 成功回應包含 success 標記
|
||||
- **WHEN** API 請求成功執行
|
||||
- **THEN** 回應 body SHALL 包含 `"success": true`
|
||||
- **AND** 原有回應資料 SHALL 放在 `data` 欄位中
|
||||
|
||||
#### Scenario: 成功回應範例
|
||||
- **WHEN** 呼叫 `GET /api/dashboard/kpi` 成功
|
||||
- **THEN** 回應格式 SHALL 為:
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"data": {
|
||||
"total": 100,
|
||||
"prd": 50,
|
||||
...
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 統一錯誤回應格式
|
||||
|
||||
系統 SHALL 對所有失敗的 API 回應使用統一的錯誤格式。
|
||||
|
||||
#### Scenario: 錯誤回應包含錯誤代碼
|
||||
- **WHEN** API 請求執行失敗
|
||||
- **THEN** 回應 body SHALL 包含 `"success": false`
|
||||
- **AND** SHALL 包含 `error` 物件
|
||||
- **AND** `error.code` SHALL 為機器可讀的錯誤代碼
|
||||
- **AND** `error.message` SHALL 為使用者友善的中文訊息
|
||||
|
||||
#### Scenario: 錯誤回應範例
|
||||
- **WHEN** 資料庫連線失敗
|
||||
- **THEN** 回應格式 SHALL 為:
|
||||
```json
|
||||
{
|
||||
"success": false,
|
||||
"error": {
|
||||
"code": "DB_CONNECTION_FAILED",
|
||||
"message": "資料庫連線失敗,請稍後再試"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Scenario: 開發模式顯示詳細錯誤
|
||||
- **WHEN** `FLASK_ENV=development`
|
||||
- **AND** API 請求執行失敗
|
||||
- **THEN** `error` 物件 SHALL 額外包含 `details` 欄位
|
||||
- **AND** `details` SHALL 包含技術性錯誤訊息(如 ORA-xxxxx)
|
||||
|
||||
#### Scenario: 生產模式隱藏詳細錯誤
|
||||
- **WHEN** `FLASK_ENV=production`
|
||||
- **AND** API 請求執行失敗
|
||||
- **THEN** `error` 物件 SHALL NOT 包含 `details` 欄位
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 標準錯誤代碼
|
||||
|
||||
系統 SHALL 定義並使用標準化的錯誤代碼。
|
||||
|
||||
#### Scenario: 資料庫相關錯誤代碼
|
||||
- **WHEN** 資料庫連線失敗
|
||||
- **THEN** 錯誤代碼 SHALL 為 `DB_CONNECTION_FAILED`
|
||||
|
||||
#### Scenario: 資料庫查詢逾時
|
||||
- **WHEN** 資料庫查詢超過 55 秒
|
||||
- **THEN** 錯誤代碼 SHALL 為 `DB_QUERY_TIMEOUT`
|
||||
|
||||
#### Scenario: 熔斷器開啟
|
||||
- **WHEN** Circuit Breaker 處於 OPEN 狀態
|
||||
- **THEN** 錯誤代碼 SHALL 為 `SERVICE_UNAVAILABLE`
|
||||
|
||||
#### Scenario: 驗證失敗
|
||||
- **WHEN** 請求參數驗證失敗
|
||||
- **THEN** 錯誤代碼 SHALL 為 `VALIDATION_ERROR`
|
||||
|
||||
#### Scenario: 未授權
|
||||
- **WHEN** 使用者未登入或 session 過期
|
||||
- **THEN** 錯誤代碼 SHALL 為 `UNAUTHORIZED`
|
||||
|
||||
#### Scenario: 禁止存取
|
||||
- **WHEN** 使用者權限不足
|
||||
- **THEN** 錯誤代碼 SHALL 為 `FORBIDDEN`
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 全域錯誤處理
|
||||
|
||||
系統 SHALL 在 middleware 層級統一處理所有未捕獲的錯誤。
|
||||
|
||||
#### Scenario: 認證中介層拒絕
|
||||
- **WHEN** 認證中介層(`create_app` 中的 `@app.before_request`)拒絕請求
|
||||
- **THEN** 回應格式 SHALL 符合統一錯誤格式
|
||||
- **AND** 錯誤代碼 SHALL 為 `UNAUTHORIZED` 或 `FORBIDDEN`
|
||||
|
||||
#### Scenario: 未處理的例外
|
||||
- **WHEN** 路由處理器拋出未捕獲的例外
|
||||
- **THEN** Flask 錯誤處理器 SHALL 攔截該例外
|
||||
- **AND** 回應格式 SHALL 符合統一錯誤格式
|
||||
- **AND** 錯誤代碼 SHALL 為 `INTERNAL_ERROR`
|
||||
|
||||
#### Scenario: 404 錯誤處理
|
||||
- **WHEN** 請求的路由不存在
|
||||
- **THEN** 回應格式 SHALL 符合統一錯誤格式
|
||||
- **AND** 錯誤代碼 SHALL 為 `NOT_FOUND`
|
||||
|
||||
#### Scenario: 全域錯誤處理器註冊
|
||||
- **WHEN** Flask 應用程式初始化
|
||||
- **THEN** `create_app()` SHALL 註冊以下錯誤處理器:
|
||||
- `@app.errorhandler(401)` - 處理未授權
|
||||
- `@app.errorhandler(403)` - 處理禁止存取
|
||||
- `@app.errorhandler(404)` - 處理找不到資源
|
||||
- `@app.errorhandler(500)` - 處理伺服器錯誤
|
||||
- `@app.errorhandler(Exception)` - 處理所有未捕獲例外
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 向下相容
|
||||
|
||||
系統 SHALL 維持與現有 API 的向下相容性。
|
||||
|
||||
#### Scenario: 原有欄位保留
|
||||
- **WHEN** 使用新的回應格式
|
||||
- **THEN** 原有 API 回傳的欄位 SHALL 完整保留在 `data` 中
|
||||
- **AND** 欄位名稱與型別 SHALL 不變
|
||||
|
||||
#### Scenario: HTTP 狀態碼維持
|
||||
- **WHEN** API 回應使用新格式
|
||||
- **THEN** HTTP 狀態碼 SHALL 維持原有語義
|
||||
- **AND** 成功 SHALL 回傳 2xx
|
||||
- **AND** 客戶端錯誤 SHALL 回傳 4xx
|
||||
- **AND** 伺服器錯誤 SHALL 回傳 5xx
|
||||
@@ -0,0 +1,91 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: 熔斷器狀態管理
|
||||
|
||||
系統 SHALL 實作 Circuit Breaker 模式,管理資料庫連線的熔斷狀態。
|
||||
|
||||
#### Scenario: 初始狀態為 CLOSED
|
||||
- **WHEN** 系統啟動
|
||||
- **THEN** 熔斷器狀態 SHALL 為 `CLOSED`
|
||||
- **AND** 所有資料庫請求 SHALL 正常執行
|
||||
|
||||
#### Scenario: 失敗累積觸發 OPEN
|
||||
- **WHEN** 熔斷器處於 `CLOSED` 狀態
|
||||
- **AND** 滑動視窗內失敗次數 >= 5
|
||||
- **AND** 失敗率 >= 50%
|
||||
- **THEN** 熔斷器狀態 SHALL 轉換為 `OPEN`
|
||||
|
||||
#### Scenario: OPEN 狀態拒絕請求
|
||||
- **WHEN** 熔斷器處於 `OPEN` 狀態
|
||||
- **AND** 收到資料庫請求
|
||||
- **THEN** 系統 SHALL 立即回傳錯誤
|
||||
- **AND** 錯誤代碼 SHALL 為 `SERVICE_UNAVAILABLE`
|
||||
- **AND** SHALL NOT 嘗試連線資料庫
|
||||
|
||||
#### Scenario: OPEN 轉換為 HALF_OPEN
|
||||
- **WHEN** 熔斷器處於 `OPEN` 狀態
|
||||
- **AND** 已等待 30 秒(recovery_timeout)
|
||||
- **THEN** 熔斷器狀態 SHALL 轉換為 `HALF_OPEN`
|
||||
|
||||
#### Scenario: HALF_OPEN 探測成功
|
||||
- **WHEN** 熔斷器處於 `HALF_OPEN` 狀態
|
||||
- **AND** 探測請求執行成功
|
||||
- **THEN** 熔斷器狀態 SHALL 轉換為 `CLOSED`
|
||||
- **AND** 失敗計數 SHALL 重置為 0
|
||||
|
||||
#### Scenario: HALF_OPEN 探測失敗
|
||||
- **WHEN** 熔斷器處於 `HALF_OPEN` 狀態
|
||||
- **AND** 探測請求執行失敗
|
||||
- **THEN** 熔斷器狀態 SHALL 轉換為 `OPEN`
|
||||
- **AND** recovery_timeout SHALL 重新計時
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 熔斷器參數配置
|
||||
|
||||
系統 SHALL 支援透過環境變數配置熔斷器參數。
|
||||
|
||||
#### Scenario: 預設參數值
|
||||
- **WHEN** 未設定熔斷器相關環境變數
|
||||
- **THEN** failure_threshold SHALL 為 5
|
||||
- **AND** failure_rate SHALL 為 0.5 (50%)
|
||||
- **AND** recovery_timeout SHALL 為 30 秒
|
||||
- **AND** window_size SHALL 為 10
|
||||
|
||||
#### Scenario: 環境變數覆蓋
|
||||
- **WHEN** 設定 `CIRCUIT_BREAKER_FAILURE_THRESHOLD=10`
|
||||
- **THEN** failure_threshold SHALL 為 10
|
||||
|
||||
#### Scenario: 停用熔斷器
|
||||
- **WHEN** 設定 `CIRCUIT_BREAKER_ENABLED=false`
|
||||
- **THEN** 熔斷器功能 SHALL 停用
|
||||
- **AND** 所有請求 SHALL 直接執行,不經過熔斷器檢查
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 熔斷器狀態查詢
|
||||
|
||||
系統 SHALL 提供 API 查詢熔斷器狀態。
|
||||
|
||||
#### Scenario: 查詢熔斷器狀態
|
||||
- **WHEN** 呼叫內部方法 `get_circuit_breaker_status()`
|
||||
- **THEN** 回傳值 SHALL 包含:
|
||||
- `state`: 當前狀態 (CLOSED/OPEN/HALF_OPEN)
|
||||
- `failure_count`: 目前失敗次數
|
||||
- `success_count`: 目前成功次數
|
||||
- `last_failure_time`: 最後失敗時間
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 熔斷事件日誌
|
||||
|
||||
系統 SHALL 記錄熔斷器狀態變化事件。
|
||||
|
||||
#### Scenario: 記錄狀態轉換
|
||||
- **WHEN** 熔斷器狀態發生變化
|
||||
- **THEN** 系統 SHALL 記錄 WARNING 級別日誌
|
||||
- **AND** 日誌 SHALL 包含:前狀態、新狀態、觸發原因
|
||||
|
||||
#### Scenario: 記錄 OPEN 事件
|
||||
- **WHEN** 熔斷器轉換為 `OPEN` 狀態
|
||||
- **THEN** 日誌訊息 SHALL 包含失敗次數與失敗率
|
||||
@@ -0,0 +1,150 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: 深度健康檢查端點
|
||||
|
||||
系統 SHALL 提供 `/health/deep` 端點,回報詳細的系統健康資訊。
|
||||
|
||||
#### Scenario: 深度檢查回應格式
|
||||
- **WHEN** 呼叫 `GET /health/deep`
|
||||
- **THEN** 回應 body SHALL 包含:
|
||||
```json
|
||||
{
|
||||
"status": "healthy",
|
||||
"checks": {
|
||||
"database": { ... },
|
||||
"redis": { ... },
|
||||
"circuit_breaker": { ... },
|
||||
"cache": { ... }
|
||||
},
|
||||
"metrics": { ... }
|
||||
}
|
||||
```
|
||||
|
||||
#### Scenario: 深度檢查需要認證
|
||||
- **WHEN** 呼叫 `GET /health/deep`
|
||||
- **AND** 使用者未登入
|
||||
- **THEN** HTTP 狀態碼 SHALL 為 401
|
||||
|
||||
#### Scenario: 深度檢查管理員存取
|
||||
- **WHEN** 呼叫 `GET /health/deep`
|
||||
- **AND** 使用者為管理員
|
||||
- **THEN** HTTP 狀態碼 SHALL 為 200
|
||||
- **AND** 回應 SHALL 包含完整詳細資訊
|
||||
|
||||
#### Scenario: 深度檢查非管理員禁止
|
||||
- **WHEN** 呼叫 `GET /health/deep`
|
||||
- **AND** 使用者已登入但非管理員
|
||||
- **THEN** HTTP 狀態碼 SHALL 為 403
|
||||
- **AND** 回應 SHALL 符合統一錯誤格式
|
||||
|
||||
#### Scenario: 深度檢查實作方式
|
||||
- **WHEN** 實作 `/health/deep` 端點
|
||||
- **THEN** 路由 SHALL 使用 `@admin_required` 裝飾器
|
||||
- **AND** 裝飾器 SHALL 處理認證與授權驗證
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 延遲指標檢查
|
||||
|
||||
深度健康檢查 SHALL 包含各服務的延遲指標。
|
||||
|
||||
#### Scenario: 資料庫延遲
|
||||
- **WHEN** 執行深度健康檢查
|
||||
- **THEN** `checks.database` SHALL 包含 `latency_ms`
|
||||
- **AND** `latency_ms` SHALL 為執行 ping 查詢的實際耗時
|
||||
|
||||
#### Scenario: Redis 延遲
|
||||
- **WHEN** 執行深度健康檢查
|
||||
- **AND** Redis 已啟用
|
||||
- **THEN** `checks.redis` SHALL 包含 `latency_ms`
|
||||
- **AND** `latency_ms` SHALL 為執行 PING 的實際耗時
|
||||
|
||||
#### Scenario: 延遲警告閾值
|
||||
- **WHEN** 資料庫延遲超過 100ms
|
||||
- **THEN** `checks.database.status` SHALL 為 `"slow"`
|
||||
- **AND** `warnings` 陣列 SHALL 包含延遲警告訊息
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 連線池狀態檢查
|
||||
|
||||
深度健康檢查 SHALL 包含資料庫連線池狀態。
|
||||
|
||||
#### Scenario: 連線池資訊
|
||||
- **WHEN** 執行深度健康檢查
|
||||
- **THEN** `checks.database` SHALL 包含:
|
||||
- `pool_size`: 設定的連線池大小
|
||||
- `pool_checked_out`: 目前借出的連線數
|
||||
- `pool_overflow`: 目前溢出的連線數
|
||||
|
||||
#### Scenario: 連線池耗盡警告
|
||||
- **WHEN** `pool_checked_out` + `pool_overflow` >= `pool_size` + `max_overflow`
|
||||
- **THEN** `warnings` 陣列 SHALL 包含連線池耗盡警告
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 熔斷器狀態檢查
|
||||
|
||||
深度健康檢查 SHALL 包含熔斷器狀態。
|
||||
|
||||
#### Scenario: 熔斷器狀態正常
|
||||
- **WHEN** 執行深度健康檢查
|
||||
- **AND** 熔斷器狀態為 CLOSED
|
||||
- **THEN** `checks.circuit_breaker` SHALL 包含:
|
||||
```json
|
||||
{
|
||||
"database": "CLOSED",
|
||||
"failures": 0
|
||||
}
|
||||
```
|
||||
|
||||
#### Scenario: 熔斷器狀態 OPEN
|
||||
- **WHEN** 執行深度健康檢查
|
||||
- **AND** 熔斷器狀態為 OPEN
|
||||
- **THEN** `checks.circuit_breaker.database` SHALL 為 `"OPEN"`
|
||||
- **AND** 整體 `status` SHALL 為 `"degraded"` 或 `"unhealthy"`
|
||||
- **AND** `warnings` SHALL 包含熔斷器警告
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 快取新鮮度檢查
|
||||
|
||||
深度健康檢查 SHALL 檢查快取資料的新鮮度。
|
||||
|
||||
#### Scenario: 快取新鮮度正常
|
||||
- **WHEN** 執行深度健康檢查
|
||||
- **AND** 快取更新時間在 2 分鐘內
|
||||
- **THEN** `checks.cache.status` SHALL 為 `"fresh"`
|
||||
|
||||
#### Scenario: 快取資料過期
|
||||
- **WHEN** 執行深度健康檢查
|
||||
- **AND** 快取更新時間超過 2 分鐘
|
||||
- **THEN** `checks.cache.status` SHALL 為 `"stale"`
|
||||
- **AND** `warnings` SHALL 包含快取過期警告
|
||||
|
||||
#### Scenario: 本地快取狀態
|
||||
- **WHEN** 執行深度健康檢查
|
||||
- **AND** 本地快取已啟用
|
||||
- **THEN** `checks.cache` SHALL 包含:
|
||||
- `local_enabled`: true
|
||||
- `local_hit_rate`: 本地快取命中率
|
||||
- `local_size`: 本地快取條目數
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 效能指標摘要
|
||||
|
||||
深度健康檢查 SHALL 包含效能指標摘要。
|
||||
|
||||
#### Scenario: 包含延遲百分位數
|
||||
- **WHEN** 執行深度健康檢查
|
||||
- **THEN** `metrics` SHALL 包含:
|
||||
- `query_p50_ms`: P50 查詢延遲
|
||||
- `query_p95_ms`: P95 查詢延遲
|
||||
- `query_p99_ms`: P99 查詢延遲
|
||||
- `slow_query_count`: 慢查詢數量
|
||||
|
||||
#### Scenario: 指標為空
|
||||
- **WHEN** 執行深度健康檢查
|
||||
- **AND** 尚無查詢記錄
|
||||
- **THEN** `metrics` 各欄位 SHALL 為 0
|
||||
@@ -0,0 +1,98 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: 本地 LRU 快取
|
||||
|
||||
系統 SHALL 實作本地 LRU 快取作為 Redis 的二級 fallback。
|
||||
|
||||
#### Scenario: 快取查詢順序
|
||||
- **WHEN** 查詢快取資料
|
||||
- **THEN** 系統 SHALL 先查詢 Redis
|
||||
- **AND** Redis 未命中或失敗時 SHALL 查詢本地快取
|
||||
- **AND** 本地快取未命中時 SHALL 查詢 Oracle
|
||||
|
||||
#### Scenario: 快取回填
|
||||
- **WHEN** 從 Oracle 取得資料
|
||||
- **THEN** 系統 SHALL 同時寫入 Redis 和本地快取
|
||||
- **AND** 本地快取 TTL SHALL 為 60 秒
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 本地快取容量限制
|
||||
|
||||
系統 SHALL 限制本地快取的記憶體使用。
|
||||
|
||||
#### Scenario: 預設最大條目數
|
||||
- **WHEN** 未設定 `LOCAL_CACHE_MAXSIZE` 環境變數
|
||||
- **THEN** 本地快取預設最大條目數 SHALL 為 500
|
||||
- **AND** 此值足以容納 WIP 狀態、設備清單、Hold Summary 等多組快取
|
||||
|
||||
#### Scenario: 最大條目數限制
|
||||
- **WHEN** 本地快取條目數達到 maxsize 上限
|
||||
- **AND** 新增新條目
|
||||
- **THEN** 系統 SHALL 移除最少使用(LRU)的條目
|
||||
- **AND** 條目數 SHALL 維持 <= maxsize
|
||||
|
||||
#### Scenario: 環境變數配置
|
||||
- **WHEN** 設定 `LOCAL_CACHE_MAXSIZE=1000`
|
||||
- **THEN** 本地快取最大條目數 SHALL 為 1000
|
||||
|
||||
#### Scenario: 快取鍵設計
|
||||
- **WHEN** 建立快取條目
|
||||
- **THEN** 快取鍵 SHALL 包含功能前綴(如 `wip:`, `equipment:`, `hold:`)
|
||||
- **AND** 不同功能的快取 SHALL 共用同一 LRU 池
|
||||
- **AND** LRU 策略 SHALL 自動淘汰最少使用的條目(無論功能類型)
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 本地快取 TTL
|
||||
|
||||
系統 SHALL 為本地快取條目設定過期時間。
|
||||
|
||||
#### Scenario: 預設 TTL
|
||||
- **WHEN** 未設定 TTL 環境變數
|
||||
- **THEN** 本地快取 TTL SHALL 為 60 秒
|
||||
|
||||
#### Scenario: 過期條目處理
|
||||
- **WHEN** 查詢本地快取
|
||||
- **AND** 條目已過期(超過 TTL)
|
||||
- **THEN** 系統 SHALL 視為未命中
|
||||
- **AND** SHALL 移除該過期條目
|
||||
|
||||
#### Scenario: TTL 比 Redis 短
|
||||
- **WHEN** Redis 快取 TTL 為 N 秒
|
||||
- **THEN** 本地快取 TTL SHALL < N
|
||||
- **AND** 確保本地快取資料不會比 Redis 舊太多
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 快取停用控制
|
||||
|
||||
系統 SHALL 支援透過環境變數停用本地快取。
|
||||
|
||||
#### Scenario: 停用本地快取
|
||||
- **WHEN** 設定 `LOCAL_CACHE_ENABLED=false`
|
||||
- **THEN** 本地快取功能 SHALL 停用
|
||||
- **AND** 快取查詢 SHALL 直接查詢 Redis 或 Oracle
|
||||
|
||||
#### Scenario: 預設啟用
|
||||
- **WHEN** 未設定 `LOCAL_CACHE_ENABLED`
|
||||
- **THEN** 本地快取 SHALL 預設啟用
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 快取命中率統計
|
||||
|
||||
系統 SHALL 追蹤本地快取的命中率。
|
||||
|
||||
#### Scenario: 記錄命中與未命中
|
||||
- **WHEN** 查詢本地快取
|
||||
- **THEN** 系統 SHALL 記錄是否命中
|
||||
- **AND** 統計 SHALL 儲存在記憶體中
|
||||
|
||||
#### Scenario: 查詢命中率
|
||||
- **WHEN** 呼叫 `get_local_cache_stats()`
|
||||
- **THEN** 回傳值 SHALL 包含:
|
||||
- `hits`: 命中次數
|
||||
- `misses`: 未命中次數
|
||||
- `hit_rate`: 命中率 (hits / (hits + misses))
|
||||
- `size`: 目前條目數
|
||||
@@ -0,0 +1,111 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: 查詢延遲收集
|
||||
|
||||
系統 SHALL 收集所有資料庫查詢的延遲時間。
|
||||
|
||||
#### Scenario: 記錄查詢延遲
|
||||
- **WHEN** 執行資料庫查詢
|
||||
- **THEN** 系統 SHALL 記錄查詢耗時(毫秒)
|
||||
- **AND** 記錄 SHALL 儲存在記憶體內滑動視窗
|
||||
|
||||
#### Scenario: 滑動視窗大小限制
|
||||
- **WHEN** 記錄的查詢數量超過 1000 筆
|
||||
- **THEN** 系統 SHALL 自動移除最舊的記錄
|
||||
- **AND** 視窗 SHALL 維持最多 1000 筆
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 延遲百分位數計算
|
||||
|
||||
系統 SHALL 計算查詢延遲的百分位數統計。
|
||||
|
||||
#### Scenario: 計算 P50/P95/P99
|
||||
- **WHEN** 呼叫 `get_query_metrics()`
|
||||
- **THEN** 回傳值 SHALL 包含:
|
||||
- `p50_ms`: 第 50 百分位延遲
|
||||
- `p95_ms`: 第 95 百分位延遲
|
||||
- `p99_ms`: 第 99 百分位延遲
|
||||
- `count`: 樣本數量
|
||||
- `slow_count`: 慢查詢數量(延遲 > 1 秒)
|
||||
|
||||
#### Scenario: 空資料處理
|
||||
- **WHEN** 尚無查詢記錄
|
||||
- **THEN** 所有百分位數 SHALL 回傳 0
|
||||
- **AND** `count` SHALL 為 0
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 慢查詢統計
|
||||
|
||||
系統 SHALL 追蹤慢查詢的數量與比例。
|
||||
|
||||
#### Scenario: 慢查詢定義
|
||||
- **WHEN** 查詢延遲超過 1000 毫秒
|
||||
- **THEN** 該查詢 SHALL 被標記為慢查詢
|
||||
|
||||
#### Scenario: 慢查詢比例計算
|
||||
- **WHEN** 呼叫 `get_query_metrics()`
|
||||
- **THEN** 回傳值 SHALL 包含 `slow_rate`
|
||||
- **AND** `slow_rate` SHALL 為 `slow_count / count`
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 指標 API 端點
|
||||
|
||||
系統 SHALL 提供 API 端點查詢效能指標。
|
||||
|
||||
#### Scenario: 取得效能指標
|
||||
- **WHEN** 呼叫 `GET /admin/api/metrics`
|
||||
- **AND** 使用者為管理員
|
||||
- **THEN** 回應 SHALL 包含查詢延遲統計
|
||||
- **AND** HTTP 狀態碼 SHALL 為 200
|
||||
|
||||
#### Scenario: 非管理員禁止存取
|
||||
- **WHEN** 呼叫 `GET /admin/api/metrics`
|
||||
- **AND** 使用者非管理員
|
||||
- **THEN** HTTP 狀態碼 SHALL 為 403
|
||||
|
||||
---
|
||||
|
||||
### Requirement: Worker 獨立統計
|
||||
|
||||
系統 SHALL 在每個 Gunicorn worker 獨立收集指標。
|
||||
|
||||
#### Scenario: 各 worker 獨立統計
|
||||
- **WHEN** 系統運行多個 workers
|
||||
- **THEN** 每個 worker SHALL 維護獨立的指標資料
|
||||
- **AND** 百分位數計算 SHALL 基於該 worker 的樣本
|
||||
|
||||
#### Scenario: API 回傳當前 worker 指標
|
||||
- **WHEN** 呼叫 `GET /admin/api/metrics`
|
||||
- **THEN** 回應 SHALL 標示該資料來自哪個 worker(PID)
|
||||
- **AND** 回應 SHALL 包含 `worker_pid` 欄位
|
||||
|
||||
#### Scenario: 已知限制 - 指標跳動
|
||||
- **GIVEN** 系統運行 N 個 workers(N > 1)
|
||||
- **WHEN** 多次呼叫 `GET /admin/api/metrics`
|
||||
- **THEN** 因 load balancer 分配,數值可能因不同 worker 而有差異
|
||||
- **AND** 這是已知且接受的行為限制
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 共享計數器(可選優化)
|
||||
|
||||
當 Redis 可用時,系統 MAY 使用 Redis 共享關鍵計數指標。
|
||||
|
||||
#### Scenario: Redis 共享總計數
|
||||
- **WHEN** Redis 已啟用
|
||||
- **THEN** `total_queries` 計數 MAY 使用 Redis INCR 命令
|
||||
- **AND** `slow_queries` 計數 MAY 使用 Redis INCR 命令
|
||||
- **AND** 這些計數 SHALL 跨所有 workers 共享
|
||||
|
||||
#### Scenario: Redis 不可用時退化
|
||||
- **WHEN** Redis 不可用或停用
|
||||
- **THEN** 系統 SHALL 退化為純 worker 獨立統計
|
||||
- **AND** 功能 SHALL 繼續正常運作
|
||||
|
||||
#### Scenario: 百分位數仍為 worker 獨立
|
||||
- **WHEN** 使用 Redis 共享計數器
|
||||
- **THEN** 延遲百分位數(P50/P95/P99)SHALL 仍維持 worker 獨立
|
||||
- **AND** 百分位數計算需要完整樣本,不適合跨 worker 共享
|
||||
@@ -0,0 +1,161 @@
|
||||
## 1. 基礎設施模組
|
||||
|
||||
- [x] 1.1 建立 `core/response.py` - API 回應格式工具
|
||||
- 實作 `success_response(data, meta=None)` 函數
|
||||
- 實作 `error_response(code, message, details=None)` 函數
|
||||
- 定義標準錯誤代碼常數 (DB_CONNECTION_FAILED, DB_QUERY_TIMEOUT, SERVICE_UNAVAILABLE, VALIDATION_ERROR, UNAUTHORIZED, FORBIDDEN, NOT_FOUND, INTERNAL_ERROR)
|
||||
|
||||
- [x] 1.2 建立 `core/circuit_breaker.py` - 熔斷器模組
|
||||
- 實作 CircuitBreaker 類別,支援 CLOSED/OPEN/HALF_OPEN 狀態
|
||||
- 實作滑動視窗計數 (window_size=10)
|
||||
- 支援環境變數配置 (CIRCUIT_BREAKER_ENABLED, CIRCUIT_BREAKER_FAILURE_THRESHOLD 等)
|
||||
- 實作 `get_circuit_breaker_status()` 查詢狀態
|
||||
- 實作狀態轉換日誌記錄
|
||||
|
||||
- [x] 1.3 建立 `core/metrics.py` - 效能指標收集模組
|
||||
- 實作 QueryMetrics 類別,使用 deque(maxlen=1000)
|
||||
- 實作 P50/P95/P99 百分位數計算
|
||||
- 追蹤慢查詢數量 (> 1秒)
|
||||
- 支援 worker PID 識別
|
||||
|
||||
- [x] 1.4 擴展 `core/cache.py` - 本地快取 Fallback (部分完成)
|
||||
- [x] 實作 ProcessLevelCache 類別 (TTL-aware)
|
||||
- [x] 實作 WIP DataFrame 的 process-level 快取
|
||||
- [x] 實作 Resource Cache 的 process-level 快取
|
||||
- [x] 實作 Equipment Status Cache 的 process-level 快取
|
||||
- [ ] 實作通用 LRU cache 介面 (maxsize=500, ttl=60s)
|
||||
- [ ] 追蹤命中率統計 (hits, misses, hit_rate)
|
||||
- [ ] 支援環境變數 LOCAL_CACHE_ENABLED, LOCAL_CACHE_MAXSIZE
|
||||
|
||||
- [x] 1.5 建立 `core/log_store.py` - SQLite log store
|
||||
- 建立 logs 資料表(時間、等級、來源、訊息、request_id、user、ip)
|
||||
- 支援查詢參數:level, q, limit, since
|
||||
- 實作保留策略(預設 7 天或 100,000 筆)
|
||||
- 支援環境變數 LOG_SQLITE_PATH, LOG_SQLITE_RETENTION_DAYS, LOG_SQLITE_MAX_ROWS
|
||||
|
||||
- [x] 1.6 整合應用程式 logging handler
|
||||
- 於 `app.py` 註冊 SQLite log handler
|
||||
- 保留原有檔案/STDERR log
|
||||
|
||||
- [x] 1.7 撰寫基礎設施模組單元測試
|
||||
- Circuit Breaker 狀態轉換測試
|
||||
- Metrics 百分位數計算測試
|
||||
- Local Cache LRU 與 TTL 測試
|
||||
- SQLite log store 讀寫與保留策略測試
|
||||
|
||||
## 2. 資料庫層整合
|
||||
|
||||
- [x] 2.1 整合熔斷器到 `core/database.py`
|
||||
- 在 `read_sql_df()` 加入熔斷器檢查
|
||||
- OPEN 狀態時立即回傳錯誤
|
||||
- 記錄成功/失敗到熔斷器
|
||||
- 預設停用,透過 CIRCUIT_BREAKER_ENABLED=true 啟用
|
||||
|
||||
- [x] 2.2 整合效能指標到 `core/database.py`
|
||||
- 記錄每次查詢延遲
|
||||
- 記錄慢查詢 (> 1秒) 到 metrics
|
||||
|
||||
- [x] 2.3 整合本地快取 Fallback 到快取層 (已由 1.4 ProcessLevelCache 實現)
|
||||
- Redis 失敗時查詢本地 LRU Cache
|
||||
- Oracle 查詢結果回填到 Redis 和本地快取
|
||||
|
||||
## 3. API 回應格式遷移
|
||||
|
||||
- [x] 3.1 在 `app.py` 註冊全域錯誤處理器
|
||||
- @app.errorhandler(401) - UNAUTHORIZED
|
||||
- @app.errorhandler(403) - FORBIDDEN
|
||||
- @app.errorhandler(404) - NOT_FOUND
|
||||
- @app.errorhandler(500) - INTERNAL_ERROR
|
||||
- @app.errorhandler(Exception) - 未捕獲例外
|
||||
|
||||
- [x] 3.2 更新認證中介層回應格式
|
||||
- `@app.before_request` 的拒絕回應改用統一格式
|
||||
|
||||
- [x] 3.3 逐步遷移各 Blueprint 使用新回應格式
|
||||
- 新 API 直接使用 success_response/error_response
|
||||
- 現有 API 保持向下相容
|
||||
|
||||
## 4. 健康檢查端點
|
||||
|
||||
- [x] 4.1 實作 `/health/deep` 深度健康檢查端點
|
||||
- 需要 @admin_required 認證
|
||||
- 包含資料庫延遲與連線池狀態
|
||||
- 包含 Redis 延遲 (如啟用)
|
||||
- 包含熔斷器狀態
|
||||
- 包含快取新鮮度與命中率
|
||||
- 包含效能指標摘要 (P50/P95/P99)
|
||||
|
||||
- [x] 4.2 實作延遲警告閾值
|
||||
- 資料庫延遲 > 100ms 標記為 "slow"
|
||||
- 快取更新 > 2 分鐘標記為 "stale"
|
||||
- 熔斷器 OPEN 時整體狀態為 "degraded"
|
||||
|
||||
## 5. 效能報表頁面
|
||||
|
||||
- [x] 5.1 建立 `GET /admin/performance` 頁面路由
|
||||
- 需要管理員權限
|
||||
- 使用現有 admin template 風格
|
||||
|
||||
- [x] 5.2 實作 `GET /admin/api/system-status` API
|
||||
- 回傳 database, redis, circuit_breaker, cache, worker_pid
|
||||
|
||||
- [x] 5.3 實作 `GET /admin/api/metrics` API
|
||||
- 回傳 P50/P95/P99, slow_count, slow_rate, worker_pid
|
||||
|
||||
- [x] 5.4 建立效能報表前端頁面
|
||||
- 系統狀態卡片 (Database, Redis, Circuit Breaker, Worker)
|
||||
- 延遲百分位數顯示
|
||||
- 慢查詢統計
|
||||
- 延遲分布圖表 (Chart.js)
|
||||
- 快取命中率顯示
|
||||
- 手動/自動重新整理 (30秒間隔)
|
||||
|
||||
- [x] 5.5 實作 `GET /admin/api/logs` API
|
||||
- 從 SQLite log store 讀取
|
||||
- 支援 level/q/limit/since 查詢參數
|
||||
|
||||
- [x] 5.6 效能報表頁面加入 Log 檢視區塊
|
||||
- 顯示最近 200 筆
|
||||
- 支援等級篩選與關鍵字搜尋
|
||||
- 與自動重新整理同步更新
|
||||
|
||||
## 6. Worker 重啟控制
|
||||
|
||||
- [x] 6.1 建立 `scripts/worker_watchdog.py` 腳本
|
||||
- 每 5 秒檢查 `/tmp/mes_dashboard_restart.flag`
|
||||
- 偵測到時發送 SIGHUP 給 Gunicorn master
|
||||
- 刪除 flag 檔案
|
||||
- 記錄重啟事件到日誌
|
||||
|
||||
- [x] 6.2 實作 `POST /admin/api/worker/restart` API
|
||||
- 需要 @admin_required
|
||||
- 寫入重啟標記檔案
|
||||
- 60 秒冷卻時間 (429 Too Many Requests)
|
||||
- 記錄操作者、時間、IP 到日誌
|
||||
|
||||
- [x] 6.3 實作 `GET /admin/api/worker/status` API
|
||||
- 回傳 cooldown_remaining, last_restart, last_restart_by
|
||||
- 回傳當前 worker 啟動時間
|
||||
|
||||
- [x] 6.4 效能報表頁面加入 Worker 控制區塊
|
||||
- 重啟按鈕 + 確認對話框
|
||||
- 冷卻狀態顯示
|
||||
- 最後重啟資訊
|
||||
- 重啟中狀態輪詢
|
||||
|
||||
## 7. 部署與測試
|
||||
|
||||
- [x] 7.1 建立 systemd service 檔案
|
||||
- `mes-dashboard-watchdog.service` 監控腳本
|
||||
|
||||
- [x] 7.2 撰寫整合測試
|
||||
- 熔斷器觸發與恢復測試
|
||||
- API 回應格式驗證測試
|
||||
- 健康檢查端點測試
|
||||
- 管理員 log API 測試
|
||||
- Worker 控制 API 測試
|
||||
|
||||
- [x] 7.3 更新部署文件
|
||||
- 新增環境變數說明
|
||||
- Watchdog 服務配置
|
||||
- Rollback 步驟
|
||||
160
openspec/specs/admin-performance-dashboard/spec.md
Normal file
160
openspec/specs/admin-performance-dashboard/spec.md
Normal file
@@ -0,0 +1,160 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: 效能報表頁面
|
||||
|
||||
系統 SHALL 提供管理員效能報表頁面。
|
||||
|
||||
#### Scenario: 存取效能報表頁面
|
||||
- **WHEN** 管理員存取 `GET /admin/performance`
|
||||
- **THEN** 系統 SHALL 顯示效能報表頁面
|
||||
- **AND** HTTP 狀態碼 SHALL 為 200
|
||||
|
||||
#### Scenario: 非管理員禁止存取
|
||||
- **WHEN** 非管理員存取 `GET /admin/performance`
|
||||
- **THEN** 系統 SHALL 重導向至登入頁面
|
||||
- **OR** HTTP 狀態碼 SHALL 為 403
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 系統狀態顯示
|
||||
|
||||
效能報表頁面 SHALL 顯示系統各元件的健康狀態。
|
||||
|
||||
#### Scenario: 顯示資料庫狀態
|
||||
- **WHEN** 載入效能報表頁面
|
||||
- **THEN** 頁面 SHALL 顯示資料庫連線狀態
|
||||
- **AND** 狀態 SHALL 為 ✅ (正常) 或 ❌ (異常)
|
||||
|
||||
#### Scenario: 顯示 Redis 狀態
|
||||
- **WHEN** 載入效能報表頁面
|
||||
- **THEN** 頁面 SHALL 顯示 Redis 連線狀態
|
||||
- **AND** 若 Redis 停用則顯示「已停用」
|
||||
|
||||
#### Scenario: 顯示熔斷器狀態
|
||||
- **WHEN** 載入效能報表頁面
|
||||
- **THEN** 頁面 SHALL 顯示熔斷器狀態
|
||||
- **AND** 狀態 SHALL 為 CLOSED、OPEN 或 HALF_OPEN
|
||||
|
||||
#### Scenario: 顯示 Worker 數量
|
||||
- **WHEN** 載入效能報表頁面
|
||||
- **THEN** 頁面 SHALL 顯示目前回應的 Worker PID
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 效能指標顯示
|
||||
|
||||
效能報表頁面 SHALL 顯示查詢效能指標。
|
||||
|
||||
#### Scenario: 顯示延遲百分位數
|
||||
- **WHEN** 載入效能報表頁面
|
||||
- **THEN** 頁面 SHALL 顯示 P50、P95、P99 延遲值
|
||||
- **AND** 單位 SHALL 為毫秒或秒
|
||||
|
||||
#### Scenario: 顯示慢查詢統計
|
||||
- **WHEN** 載入效能報表頁面
|
||||
- **THEN** 頁面 SHALL 顯示慢查詢數量
|
||||
- **AND** SHALL 顯示慢查詢比例
|
||||
|
||||
#### Scenario: 延遲分布視覺化
|
||||
- **WHEN** 載入效能報表頁面
|
||||
- **THEN** 頁面 SHALL 顯示延遲分布圖表
|
||||
- **AND** 圖表 SHALL 使用 Chart.js 或類似工具
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 快取狀態顯示
|
||||
|
||||
效能報表頁面 SHALL 顯示快取運作狀態。
|
||||
|
||||
#### Scenario: 顯示 Redis 快取命中率
|
||||
- **WHEN** 載入效能報表頁面
|
||||
- **THEN** 頁面 SHALL 顯示 Redis 快取命中率
|
||||
|
||||
#### Scenario: 顯示本地快取命中率
|
||||
- **WHEN** 載入效能報表頁面
|
||||
- **THEN** 頁面 SHALL 顯示本地快取命中率
|
||||
|
||||
#### Scenario: 顯示快取最後更新時間
|
||||
- **WHEN** 載入效能報表頁面
|
||||
- **THEN** 頁面 SHALL 顯示快取最後更新時間
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 自動重新整理
|
||||
|
||||
效能報表頁面 SHALL 支援自動重新整理。
|
||||
|
||||
#### Scenario: 手動重新整理
|
||||
- **WHEN** 點擊「重新整理」按鈕
|
||||
- **THEN** 頁面 SHALL 重新載入所有指標資料
|
||||
- **AND** SHALL NOT 整頁重新載入(使用 AJAX)
|
||||
|
||||
#### Scenario: 自動重新整理間隔
|
||||
- **WHEN** 啟用自動重新整理
|
||||
- **THEN** 頁面 SHALL 每 30 秒自動更新指標
|
||||
- **AND** 使用者 SHALL 可以停用自動重新整理
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 系統狀態 API
|
||||
|
||||
系統 SHALL 提供 API 取得系統狀態資訊。
|
||||
|
||||
#### Scenario: 取得系統狀態
|
||||
- **WHEN** 呼叫 `GET /admin/api/system-status`
|
||||
- **AND** 使用者為管理員
|
||||
- **THEN** 回應 SHALL 包含:
|
||||
- `database`: 資料庫狀態
|
||||
- `redis`: Redis 狀態
|
||||
- `circuit_breaker`: 熔斷器狀態
|
||||
- `cache`: 快取狀態
|
||||
- `worker_pid`: 當前 Worker PID
|
||||
|
||||
---
|
||||
|
||||
### Requirement: Log 紀錄檢視
|
||||
|
||||
效能報表頁面 SHALL 顯示近期 log 紀錄。
|
||||
|
||||
#### Scenario: 顯示近期 log
|
||||
- **WHEN** 管理員載入效能報表頁面
|
||||
- **THEN** 頁面 SHALL 顯示最近 N 筆 log(預設 200 筆)
|
||||
- **AND** 每筆 log SHALL 顯示時間、等級、來源、訊息
|
||||
|
||||
#### Scenario: 篩選與搜尋
|
||||
- **WHEN** 管理員選擇等級(INFO/WARNING/ERROR)或輸入關鍵字
|
||||
- **THEN** 頁面 SHALL 即時更新顯示結果
|
||||
|
||||
---
|
||||
|
||||
### Requirement: Log API
|
||||
|
||||
系統 SHALL 提供 API 取得近期 log 紀錄。
|
||||
|
||||
#### Scenario: 取得 log 紀錄
|
||||
- **WHEN** 呼叫 `GET /admin/api/logs`
|
||||
- **AND** 使用者為管理員
|
||||
- **THEN** 回應 SHALL 包含 log 清單
|
||||
- **AND** HTTP 狀態碼 SHALL 為 200
|
||||
|
||||
#### Scenario: Log API 查詢參數
|
||||
- **WHEN** 呼叫 `GET /admin/api/logs` 並帶入查詢參數
|
||||
- **THEN** API SHALL 支援:
|
||||
- `level`:等級過濾(INFO/WARNING/ERROR)
|
||||
- `q`:關鍵字搜尋
|
||||
- `limit`:回傳筆數(預設 200)
|
||||
- `since`:起始時間(ISO-8601)
|
||||
|
||||
#### Scenario: 非管理員禁止存取
|
||||
- **WHEN** 非管理員呼叫 `GET /admin/api/logs`
|
||||
- **THEN** HTTP 狀態碼 SHALL 為 403
|
||||
---
|
||||
|
||||
### Requirement: Log 資料儲存
|
||||
|
||||
系統 SHALL 將 log 寫入本機 SQLite 供管理員查詢。
|
||||
|
||||
#### Scenario: 寫入 SQLite log store
|
||||
- **WHEN** 系統產生 log 紀錄
|
||||
- **THEN** log SHALL 寫入本機 SQLite log store
|
||||
- **AND** 供 `GET /admin/api/logs` 查詢
|
||||
116
openspec/specs/admin-worker-control/spec.md
Normal file
116
openspec/specs/admin-worker-control/spec.md
Normal file
@@ -0,0 +1,116 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Worker 重啟觸發
|
||||
|
||||
系統 SHALL 允許管理員從前端觸發 Worker 重啟。
|
||||
|
||||
#### Scenario: 觸發重啟請求
|
||||
- **WHEN** 管理員呼叫 `POST /admin/api/worker/restart`
|
||||
- **AND** 使用者為管理員
|
||||
- **THEN** 系統 SHALL 寫入重啟標記檔案
|
||||
- **AND** HTTP 狀態碼 SHALL 為 202 (Accepted)
|
||||
- **AND** 回應 SHALL 包含 `"message": "重啟請求已提交"`
|
||||
|
||||
#### Scenario: 非管理員禁止操作
|
||||
- **WHEN** 非管理員呼叫 `POST /admin/api/worker/restart`
|
||||
- **THEN** HTTP 狀態碼 SHALL 為 403
|
||||
- **AND** 操作 SHALL NOT 執行
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 重啟冷卻時間
|
||||
|
||||
系統 SHALL 實作重啟冷卻機制,防止頻繁重啟。
|
||||
|
||||
#### Scenario: 冷卻時間內拒絕
|
||||
- **WHEN** 管理員呼叫 `POST /admin/api/worker/restart`
|
||||
- **AND** 距離上次重啟不足 60 秒
|
||||
- **THEN** HTTP 狀態碼 SHALL 為 429 (Too Many Requests)
|
||||
- **AND** 回應 SHALL 包含剩餘冷卻秒數
|
||||
|
||||
#### Scenario: 冷卻時間後允許
|
||||
- **WHEN** 管理員呼叫 `POST /admin/api/worker/restart`
|
||||
- **AND** 距離上次重啟已超過 60 秒
|
||||
- **THEN** 重啟請求 SHALL 被接受
|
||||
|
||||
#### Scenario: 查詢冷卻狀態
|
||||
- **WHEN** 呼叫 `GET /admin/api/worker/status`
|
||||
- **THEN** 回應 SHALL 包含:
|
||||
- `cooldown_remaining`: 剩餘冷卻秒數(0 表示可用)
|
||||
- `last_restart`: 上次重啟時間
|
||||
- `last_restart_by`: 上次重啟操作者
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 重啟操作日誌
|
||||
|
||||
系統 SHALL 記錄所有重啟操作。
|
||||
|
||||
#### Scenario: 記錄操作資訊
|
||||
- **WHEN** 管理員觸發重啟
|
||||
- **THEN** 系統 SHALL 記錄:
|
||||
- 操作者(email/username)
|
||||
- 操作時間
|
||||
- 來源 IP 位址
|
||||
- 操作結果
|
||||
|
||||
#### Scenario: 日誌儲存位置
|
||||
- **WHEN** 記錄重啟操作
|
||||
- **THEN** 日誌 SHALL 寫入系統日誌(INFO 級別)
|
||||
- **AND** SHALL 寫入獨立的操作日誌檔案
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 前端確認機制
|
||||
|
||||
效能報表頁面 SHALL 實作重啟確認機制。
|
||||
|
||||
#### Scenario: 顯示確認對話框
|
||||
- **WHEN** 管理員點擊「重啟 Workers」按鈕
|
||||
- **THEN** 系統 SHALL 顯示確認對話框
|
||||
- **AND** 對話框 SHALL 警告此操作會短暫影響服務
|
||||
|
||||
#### Scenario: 確認後執行
|
||||
- **WHEN** 管理員在確認對話框點擊「確定」
|
||||
- **THEN** 系統 SHALL 發送重啟請求
|
||||
|
||||
#### Scenario: 取消操作
|
||||
- **WHEN** 管理員在確認對話框點擊「取消」
|
||||
- **THEN** 系統 SHALL NOT 發送重啟請求
|
||||
|
||||
---
|
||||
|
||||
### Requirement: Watchdog 腳本
|
||||
|
||||
系統 SHALL 提供 Watchdog 腳本監控重啟標記檔案。
|
||||
|
||||
#### Scenario: 監控標記檔案
|
||||
- **WHEN** Watchdog 腳本運行中
|
||||
- **THEN** 腳本 SHALL 每 5 秒檢查 `/tmp/mes_dashboard_restart.flag`
|
||||
|
||||
#### Scenario: 偵測到標記檔案
|
||||
- **WHEN** Watchdog 偵測到標記檔案存在
|
||||
- **THEN** 腳本 SHALL 發送 SIGHUP 信號給 Gunicorn master
|
||||
- **AND** SHALL 刪除標記檔案
|
||||
- **AND** SHALL 記錄重啟事件到日誌
|
||||
|
||||
#### Scenario: Gunicorn Graceful Reload
|
||||
- **WHEN** Gunicorn master 收到 SIGHUP
|
||||
- **THEN** Gunicorn SHALL 執行 graceful reload
|
||||
- **AND** 現有請求 SHALL 完成後才終止 worker
|
||||
- **AND** 新 worker SHALL 啟動接手
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 重啟狀態回報
|
||||
|
||||
系統 SHALL 提供方式確認重啟是否完成。
|
||||
|
||||
#### Scenario: 查詢 Worker 啟動時間
|
||||
- **WHEN** 呼叫 `GET /admin/api/worker/status`
|
||||
- **THEN** 回應 SHALL 包含當前 worker 的啟動時間
|
||||
|
||||
#### Scenario: 前端顯示重啟結果
|
||||
- **WHEN** 重啟請求已提交
|
||||
- **THEN** 前端 SHALL 輪詢 worker 狀態
|
||||
- **AND** SHALL 顯示「重啟中...」直到偵測到新 worker
|
||||
140
openspec/specs/api-response-format/spec.md
Normal file
140
openspec/specs/api-response-format/spec.md
Normal file
@@ -0,0 +1,140 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: 統一成功回應格式
|
||||
|
||||
系統 SHALL 對所有成功的 API 回應使用統一的 envelope 格式。
|
||||
|
||||
#### Scenario: 成功回應包含 success 標記
|
||||
- **WHEN** API 請求成功執行
|
||||
- **THEN** 回應 body SHALL 包含 `"success": true`
|
||||
- **AND** 原有回應資料 SHALL 放在 `data` 欄位中
|
||||
|
||||
#### Scenario: 成功回應範例
|
||||
- **WHEN** 呼叫 `GET /api/dashboard/kpi` 成功
|
||||
- **THEN** 回應格式 SHALL 為:
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"data": {
|
||||
"total": 100,
|
||||
"prd": 50,
|
||||
...
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 統一錯誤回應格式
|
||||
|
||||
系統 SHALL 對所有失敗的 API 回應使用統一的錯誤格式。
|
||||
|
||||
#### Scenario: 錯誤回應包含錯誤代碼
|
||||
- **WHEN** API 請求執行失敗
|
||||
- **THEN** 回應 body SHALL 包含 `"success": false`
|
||||
- **AND** SHALL 包含 `error` 物件
|
||||
- **AND** `error.code` SHALL 為機器可讀的錯誤代碼
|
||||
- **AND** `error.message` SHALL 為使用者友善的中文訊息
|
||||
|
||||
#### Scenario: 錯誤回應範例
|
||||
- **WHEN** 資料庫連線失敗
|
||||
- **THEN** 回應格式 SHALL 為:
|
||||
```json
|
||||
{
|
||||
"success": false,
|
||||
"error": {
|
||||
"code": "DB_CONNECTION_FAILED",
|
||||
"message": "資料庫連線失敗,請稍後再試"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Scenario: 開發模式顯示詳細錯誤
|
||||
- **WHEN** `FLASK_ENV=development`
|
||||
- **AND** API 請求執行失敗
|
||||
- **THEN** `error` 物件 SHALL 額外包含 `details` 欄位
|
||||
- **AND** `details` SHALL 包含技術性錯誤訊息(如 ORA-xxxxx)
|
||||
|
||||
#### Scenario: 生產模式隱藏詳細錯誤
|
||||
- **WHEN** `FLASK_ENV=production`
|
||||
- **AND** API 請求執行失敗
|
||||
- **THEN** `error` 物件 SHALL NOT 包含 `details` 欄位
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 標準錯誤代碼
|
||||
|
||||
系統 SHALL 定義並使用標準化的錯誤代碼。
|
||||
|
||||
#### Scenario: 資料庫相關錯誤代碼
|
||||
- **WHEN** 資料庫連線失敗
|
||||
- **THEN** 錯誤代碼 SHALL 為 `DB_CONNECTION_FAILED`
|
||||
|
||||
#### Scenario: 資料庫查詢逾時
|
||||
- **WHEN** 資料庫查詢超過 55 秒
|
||||
- **THEN** 錯誤代碼 SHALL 為 `DB_QUERY_TIMEOUT`
|
||||
|
||||
#### Scenario: 熔斷器開啟
|
||||
- **WHEN** Circuit Breaker 處於 OPEN 狀態
|
||||
- **THEN** 錯誤代碼 SHALL 為 `SERVICE_UNAVAILABLE`
|
||||
|
||||
#### Scenario: 驗證失敗
|
||||
- **WHEN** 請求參數驗證失敗
|
||||
- **THEN** 錯誤代碼 SHALL 為 `VALIDATION_ERROR`
|
||||
|
||||
#### Scenario: 未授權
|
||||
- **WHEN** 使用者未登入或 session 過期
|
||||
- **THEN** 錯誤代碼 SHALL 為 `UNAUTHORIZED`
|
||||
|
||||
#### Scenario: 禁止存取
|
||||
- **WHEN** 使用者權限不足
|
||||
- **THEN** 錯誤代碼 SHALL 為 `FORBIDDEN`
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 全域錯誤處理
|
||||
|
||||
系統 SHALL 在 middleware 層級統一處理所有未捕獲的錯誤。
|
||||
|
||||
#### Scenario: 認證中介層拒絕
|
||||
- **WHEN** 認證中介層(`create_app` 中的 `@app.before_request`)拒絕請求
|
||||
- **THEN** 回應格式 SHALL 符合統一錯誤格式
|
||||
- **AND** 錯誤代碼 SHALL 為 `UNAUTHORIZED` 或 `FORBIDDEN`
|
||||
|
||||
#### Scenario: 未處理的例外
|
||||
- **WHEN** 路由處理器拋出未捕獲的例外
|
||||
- **THEN** Flask 錯誤處理器 SHALL 攔截該例外
|
||||
- **AND** 回應格式 SHALL 符合統一錯誤格式
|
||||
- **AND** 錯誤代碼 SHALL 為 `INTERNAL_ERROR`
|
||||
|
||||
#### Scenario: 404 錯誤處理
|
||||
- **WHEN** 請求的路由不存在
|
||||
- **THEN** 回應格式 SHALL 符合統一錯誤格式
|
||||
- **AND** 錯誤代碼 SHALL 為 `NOT_FOUND`
|
||||
|
||||
#### Scenario: 全域錯誤處理器註冊
|
||||
- **WHEN** Flask 應用程式初始化
|
||||
- **THEN** `create_app()` SHALL 註冊以下錯誤處理器:
|
||||
- `@app.errorhandler(401)` - 處理未授權
|
||||
- `@app.errorhandler(403)` - 處理禁止存取
|
||||
- `@app.errorhandler(404)` - 處理找不到資源
|
||||
- `@app.errorhandler(500)` - 處理伺服器錯誤
|
||||
- `@app.errorhandler(Exception)` - 處理所有未捕獲例外
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 向下相容
|
||||
|
||||
系統 SHALL 維持與現有 API 的向下相容性。
|
||||
|
||||
#### Scenario: 原有欄位保留
|
||||
- **WHEN** 使用新的回應格式
|
||||
- **THEN** 原有 API 回傳的欄位 SHALL 完整保留在 `data` 中
|
||||
- **AND** 欄位名稱與型別 SHALL 不變
|
||||
|
||||
#### Scenario: HTTP 狀態碼維持
|
||||
- **WHEN** API 回應使用新格式
|
||||
- **THEN** HTTP 狀態碼 SHALL 維持原有語義
|
||||
- **AND** 成功 SHALL 回傳 2xx
|
||||
- **AND** 客戶端錯誤 SHALL 回傳 4xx
|
||||
- **AND** 伺服器錯誤 SHALL 回傳 5xx
|
||||
91
openspec/specs/circuit-breaker/spec.md
Normal file
91
openspec/specs/circuit-breaker/spec.md
Normal file
@@ -0,0 +1,91 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: 熔斷器狀態管理
|
||||
|
||||
系統 SHALL 實作 Circuit Breaker 模式,管理資料庫連線的熔斷狀態。
|
||||
|
||||
#### Scenario: 初始狀態為 CLOSED
|
||||
- **WHEN** 系統啟動
|
||||
- **THEN** 熔斷器狀態 SHALL 為 `CLOSED`
|
||||
- **AND** 所有資料庫請求 SHALL 正常執行
|
||||
|
||||
#### Scenario: 失敗累積觸發 OPEN
|
||||
- **WHEN** 熔斷器處於 `CLOSED` 狀態
|
||||
- **AND** 滑動視窗內失敗次數 >= 5
|
||||
- **AND** 失敗率 >= 50%
|
||||
- **THEN** 熔斷器狀態 SHALL 轉換為 `OPEN`
|
||||
|
||||
#### Scenario: OPEN 狀態拒絕請求
|
||||
- **WHEN** 熔斷器處於 `OPEN` 狀態
|
||||
- **AND** 收到資料庫請求
|
||||
- **THEN** 系統 SHALL 立即回傳錯誤
|
||||
- **AND** 錯誤代碼 SHALL 為 `SERVICE_UNAVAILABLE`
|
||||
- **AND** SHALL NOT 嘗試連線資料庫
|
||||
|
||||
#### Scenario: OPEN 轉換為 HALF_OPEN
|
||||
- **WHEN** 熔斷器處於 `OPEN` 狀態
|
||||
- **AND** 已等待 30 秒(recovery_timeout)
|
||||
- **THEN** 熔斷器狀態 SHALL 轉換為 `HALF_OPEN`
|
||||
|
||||
#### Scenario: HALF_OPEN 探測成功
|
||||
- **WHEN** 熔斷器處於 `HALF_OPEN` 狀態
|
||||
- **AND** 探測請求執行成功
|
||||
- **THEN** 熔斷器狀態 SHALL 轉換為 `CLOSED`
|
||||
- **AND** 失敗計數 SHALL 重置為 0
|
||||
|
||||
#### Scenario: HALF_OPEN 探測失敗
|
||||
- **WHEN** 熔斷器處於 `HALF_OPEN` 狀態
|
||||
- **AND** 探測請求執行失敗
|
||||
- **THEN** 熔斷器狀態 SHALL 轉換為 `OPEN`
|
||||
- **AND** recovery_timeout SHALL 重新計時
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 熔斷器參數配置
|
||||
|
||||
系統 SHALL 支援透過環境變數配置熔斷器參數。
|
||||
|
||||
#### Scenario: 預設參數值
|
||||
- **WHEN** 未設定熔斷器相關環境變數
|
||||
- **THEN** failure_threshold SHALL 為 5
|
||||
- **AND** failure_rate SHALL 為 0.5 (50%)
|
||||
- **AND** recovery_timeout SHALL 為 30 秒
|
||||
- **AND** window_size SHALL 為 10
|
||||
|
||||
#### Scenario: 環境變數覆蓋
|
||||
- **WHEN** 設定 `CIRCUIT_BREAKER_FAILURE_THRESHOLD=10`
|
||||
- **THEN** failure_threshold SHALL 為 10
|
||||
|
||||
#### Scenario: 停用熔斷器
|
||||
- **WHEN** 設定 `CIRCUIT_BREAKER_ENABLED=false`
|
||||
- **THEN** 熔斷器功能 SHALL 停用
|
||||
- **AND** 所有請求 SHALL 直接執行,不經過熔斷器檢查
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 熔斷器狀態查詢
|
||||
|
||||
系統 SHALL 提供 API 查詢熔斷器狀態。
|
||||
|
||||
#### Scenario: 查詢熔斷器狀態
|
||||
- **WHEN** 呼叫內部方法 `get_circuit_breaker_status()`
|
||||
- **THEN** 回傳值 SHALL 包含:
|
||||
- `state`: 當前狀態 (CLOSED/OPEN/HALF_OPEN)
|
||||
- `failure_count`: 目前失敗次數
|
||||
- `success_count`: 目前成功次數
|
||||
- `last_failure_time`: 最後失敗時間
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 熔斷事件日誌
|
||||
|
||||
系統 SHALL 記錄熔斷器狀態變化事件。
|
||||
|
||||
#### Scenario: 記錄狀態轉換
|
||||
- **WHEN** 熔斷器狀態發生變化
|
||||
- **THEN** 系統 SHALL 記錄 WARNING 級別日誌
|
||||
- **AND** 日誌 SHALL 包含:前狀態、新狀態、觸發原因
|
||||
|
||||
#### Scenario: 記錄 OPEN 事件
|
||||
- **WHEN** 熔斷器轉換為 `OPEN` 狀態
|
||||
- **THEN** 日誌訊息 SHALL 包含失敗次數與失敗率
|
||||
@@ -1,124 +1,150 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Health Check Endpoint
|
||||
|
||||
系統 SHALL 提供 `/health` 端點,回報服務健康狀態。
|
||||
|
||||
#### Scenario: All services healthy
|
||||
- **WHEN** 呼叫 `GET /health` 且 Oracle 和 Redis 都正常
|
||||
- **THEN** 系統 SHALL 回傳 HTTP 200
|
||||
- **AND** 回應 body 為:
|
||||
```json
|
||||
{
|
||||
"status": "healthy",
|
||||
"services": {
|
||||
"database": "ok",
|
||||
"redis": "ok"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Scenario: Database unhealthy
|
||||
- **WHEN** 呼叫 `GET /health` 且 Oracle 連線失敗
|
||||
- **THEN** 系統 SHALL 回傳 HTTP 503
|
||||
- **AND** 回應 body 包含:
|
||||
```json
|
||||
{
|
||||
"status": "unhealthy",
|
||||
"services": {
|
||||
"database": "error",
|
||||
"redis": "ok"
|
||||
},
|
||||
"errors": ["Database connection failed: <error message>"]
|
||||
}
|
||||
```
|
||||
|
||||
#### Scenario: Redis unhealthy but service degraded
|
||||
- **WHEN** 呼叫 `GET /health` 且 Redis 連線失敗但 Oracle 正常
|
||||
- **THEN** 系統 SHALL 回傳 HTTP 200(因為可降級運作)
|
||||
- **AND** 回應 body 包含:
|
||||
```json
|
||||
{
|
||||
"status": "degraded",
|
||||
"services": {
|
||||
"database": "ok",
|
||||
"redis": "error"
|
||||
},
|
||||
"warnings": ["Redis unavailable, running in fallback mode"]
|
||||
}
|
||||
```
|
||||
|
||||
#### Scenario: Redis disabled
|
||||
- **WHEN** 呼叫 `GET /health` 且 `REDIS_ENABLED=false`
|
||||
- **THEN** 回應 body 的 `services.redis` SHALL 為 `"disabled"`
|
||||
|
||||
---
|
||||
|
||||
### Requirement: Database Health Check
|
||||
|
||||
健康檢查 SHALL 驗證 Oracle 資料庫連線。
|
||||
|
||||
#### Scenario: Database ping succeeds
|
||||
- **WHEN** 執行資料庫健康檢查
|
||||
- **THEN** 系統 SHALL 執行 `SELECT 1 FROM DUAL`
|
||||
- **AND** 查詢成功則標記 database 為 `ok`
|
||||
|
||||
#### Scenario: Database ping timeout
|
||||
- **WHEN** 資料庫查詢超過 5 秒
|
||||
- **THEN** 系統 SHALL 標記 database 為 `error`
|
||||
- **AND** 記錄超時錯誤
|
||||
|
||||
---
|
||||
|
||||
### Requirement: Redis Health Check
|
||||
|
||||
健康檢查 SHALL 驗證 Redis 連線(當 REDIS_ENABLED=true 時)。
|
||||
|
||||
#### Scenario: Redis ping succeeds
|
||||
- **WHEN** 執行 Redis 健康檢查
|
||||
- **THEN** 系統 SHALL 執行 Redis `PING` 命令
|
||||
- **AND** 收到 `PONG` 回應則標記 redis 為 `ok`
|
||||
|
||||
#### Scenario: Redis ping fails
|
||||
- **WHEN** Redis `PING` 命令失敗或超時
|
||||
- **THEN** 系統 SHALL 標記 redis 為 `error`
|
||||
- **AND** 服務狀態 SHALL 為 `degraded`(非 `unhealthy`)
|
||||
|
||||
---
|
||||
|
||||
### Requirement: Cache Status in Health Check
|
||||
|
||||
健康檢查 SHALL 包含快取狀態資訊。
|
||||
|
||||
#### Scenario: Cache status included
|
||||
- **WHEN** 呼叫 `GET /health` 且快取可用
|
||||
- **THEN** 回應 body SHALL 包含 `cache` 區塊:
|
||||
```json
|
||||
{
|
||||
"cache": {
|
||||
"enabled": true,
|
||||
"sys_date": "2024-01-15 10:30:00",
|
||||
"updated_at": "2024-01-15 10:35:22"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Scenario: Cache not populated
|
||||
- **WHEN** 呼叫 `GET /health` 且 Redis 可用但快取尚未載入
|
||||
- **THEN** 回應 body 的 `cache.sys_date` SHALL 為 `null`
|
||||
|
||||
---
|
||||
|
||||
### Requirement: Health Check Performance
|
||||
|
||||
健康檢查 SHALL 快速回應,不影響服務效能。
|
||||
|
||||
#### Scenario: Response within timeout
|
||||
- **WHEN** 呼叫 `GET /health`
|
||||
- **THEN** 系統 SHALL 在 10 秒內回應
|
||||
- **AND** 各項檢查的超時時間 SHALL 不超過 5 秒
|
||||
|
||||
#### Scenario: No authentication required
|
||||
- **WHEN** 呼叫 `GET /health`
|
||||
- **THEN** 系統 SHALL 不要求身份驗證
|
||||
- **AND** 不記錄到存取日誌(避免日誌污染)
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: 深度健康檢查端點
|
||||
|
||||
系統 SHALL 提供 `/health/deep` 端點,回報詳細的系統健康資訊。
|
||||
|
||||
#### Scenario: 深度檢查回應格式
|
||||
- **WHEN** 呼叫 `GET /health/deep`
|
||||
- **THEN** 回應 body SHALL 包含:
|
||||
```json
|
||||
{
|
||||
"status": "healthy",
|
||||
"checks": {
|
||||
"database": { ... },
|
||||
"redis": { ... },
|
||||
"circuit_breaker": { ... },
|
||||
"cache": { ... }
|
||||
},
|
||||
"metrics": { ... }
|
||||
}
|
||||
```
|
||||
|
||||
#### Scenario: 深度檢查需要認證
|
||||
- **WHEN** 呼叫 `GET /health/deep`
|
||||
- **AND** 使用者未登入
|
||||
- **THEN** HTTP 狀態碼 SHALL 為 401
|
||||
|
||||
#### Scenario: 深度檢查管理員存取
|
||||
- **WHEN** 呼叫 `GET /health/deep`
|
||||
- **AND** 使用者為管理員
|
||||
- **THEN** HTTP 狀態碼 SHALL 為 200
|
||||
- **AND** 回應 SHALL 包含完整詳細資訊
|
||||
|
||||
#### Scenario: 深度檢查非管理員禁止
|
||||
- **WHEN** 呼叫 `GET /health/deep`
|
||||
- **AND** 使用者已登入但非管理員
|
||||
- **THEN** HTTP 狀態碼 SHALL 為 403
|
||||
- **AND** 回應 SHALL 符合統一錯誤格式
|
||||
|
||||
#### Scenario: 深度檢查實作方式
|
||||
- **WHEN** 實作 `/health/deep` 端點
|
||||
- **THEN** 路由 SHALL 使用 `@admin_required` 裝飾器
|
||||
- **AND** 裝飾器 SHALL 處理認證與授權驗證
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 延遲指標檢查
|
||||
|
||||
深度健康檢查 SHALL 包含各服務的延遲指標。
|
||||
|
||||
#### Scenario: 資料庫延遲
|
||||
- **WHEN** 執行深度健康檢查
|
||||
- **THEN** `checks.database` SHALL 包含 `latency_ms`
|
||||
- **AND** `latency_ms` SHALL 為執行 ping 查詢的實際耗時
|
||||
|
||||
#### Scenario: Redis 延遲
|
||||
- **WHEN** 執行深度健康檢查
|
||||
- **AND** Redis 已啟用
|
||||
- **THEN** `checks.redis` SHALL 包含 `latency_ms`
|
||||
- **AND** `latency_ms` SHALL 為執行 PING 的實際耗時
|
||||
|
||||
#### Scenario: 延遲警告閾值
|
||||
- **WHEN** 資料庫延遲超過 100ms
|
||||
- **THEN** `checks.database.status` SHALL 為 `"slow"`
|
||||
- **AND** `warnings` 陣列 SHALL 包含延遲警告訊息
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 連線池狀態檢查
|
||||
|
||||
深度健康檢查 SHALL 包含資料庫連線池狀態。
|
||||
|
||||
#### Scenario: 連線池資訊
|
||||
- **WHEN** 執行深度健康檢查
|
||||
- **THEN** `checks.database` SHALL 包含:
|
||||
- `pool_size`: 設定的連線池大小
|
||||
- `pool_checked_out`: 目前借出的連線數
|
||||
- `pool_overflow`: 目前溢出的連線數
|
||||
|
||||
#### Scenario: 連線池耗盡警告
|
||||
- **WHEN** `pool_checked_out` + `pool_overflow` >= `pool_size` + `max_overflow`
|
||||
- **THEN** `warnings` 陣列 SHALL 包含連線池耗盡警告
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 熔斷器狀態檢查
|
||||
|
||||
深度健康檢查 SHALL 包含熔斷器狀態。
|
||||
|
||||
#### Scenario: 熔斷器狀態正常
|
||||
- **WHEN** 執行深度健康檢查
|
||||
- **AND** 熔斷器狀態為 CLOSED
|
||||
- **THEN** `checks.circuit_breaker` SHALL 包含:
|
||||
```json
|
||||
{
|
||||
"database": "CLOSED",
|
||||
"failures": 0
|
||||
}
|
||||
```
|
||||
|
||||
#### Scenario: 熔斷器狀態 OPEN
|
||||
- **WHEN** 執行深度健康檢查
|
||||
- **AND** 熔斷器狀態為 OPEN
|
||||
- **THEN** `checks.circuit_breaker.database` SHALL 為 `"OPEN"`
|
||||
- **AND** 整體 `status` SHALL 為 `"degraded"` 或 `"unhealthy"`
|
||||
- **AND** `warnings` SHALL 包含熔斷器警告
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 快取新鮮度檢查
|
||||
|
||||
深度健康檢查 SHALL 檢查快取資料的新鮮度。
|
||||
|
||||
#### Scenario: 快取新鮮度正常
|
||||
- **WHEN** 執行深度健康檢查
|
||||
- **AND** 快取更新時間在 2 分鐘內
|
||||
- **THEN** `checks.cache.status` SHALL 為 `"fresh"`
|
||||
|
||||
#### Scenario: 快取資料過期
|
||||
- **WHEN** 執行深度健康檢查
|
||||
- **AND** 快取更新時間超過 2 分鐘
|
||||
- **THEN** `checks.cache.status` SHALL 為 `"stale"`
|
||||
- **AND** `warnings` SHALL 包含快取過期警告
|
||||
|
||||
#### Scenario: 本地快取狀態
|
||||
- **WHEN** 執行深度健康檢查
|
||||
- **AND** 本地快取已啟用
|
||||
- **THEN** `checks.cache` SHALL 包含:
|
||||
- `local_enabled`: true
|
||||
- `local_hit_rate`: 本地快取命中率
|
||||
- `local_size`: 本地快取條目數
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 效能指標摘要
|
||||
|
||||
深度健康檢查 SHALL 包含效能指標摘要。
|
||||
|
||||
#### Scenario: 包含延遲百分位數
|
||||
- **WHEN** 執行深度健康檢查
|
||||
- **THEN** `metrics` SHALL 包含:
|
||||
- `query_p50_ms`: P50 查詢延遲
|
||||
- `query_p95_ms`: P95 查詢延遲
|
||||
- `query_p99_ms`: P99 查詢延遲
|
||||
- `slow_query_count`: 慢查詢數量
|
||||
|
||||
#### Scenario: 指標為空
|
||||
- **WHEN** 執行深度健康檢查
|
||||
- **AND** 尚無查詢記錄
|
||||
- **THEN** `metrics` 各欄位 SHALL 為 0
|
||||
|
||||
98
openspec/specs/local-cache-fallback/spec.md
Normal file
98
openspec/specs/local-cache-fallback/spec.md
Normal file
@@ -0,0 +1,98 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: 本地 LRU 快取
|
||||
|
||||
系統 SHALL 實作本地 LRU 快取作為 Redis 的二級 fallback。
|
||||
|
||||
#### Scenario: 快取查詢順序
|
||||
- **WHEN** 查詢快取資料
|
||||
- **THEN** 系統 SHALL 先查詢 Redis
|
||||
- **AND** Redis 未命中或失敗時 SHALL 查詢本地快取
|
||||
- **AND** 本地快取未命中時 SHALL 查詢 Oracle
|
||||
|
||||
#### Scenario: 快取回填
|
||||
- **WHEN** 從 Oracle 取得資料
|
||||
- **THEN** 系統 SHALL 同時寫入 Redis 和本地快取
|
||||
- **AND** 本地快取 TTL SHALL 為 60 秒
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 本地快取容量限制
|
||||
|
||||
系統 SHALL 限制本地快取的記憶體使用。
|
||||
|
||||
#### Scenario: 預設最大條目數
|
||||
- **WHEN** 未設定 `LOCAL_CACHE_MAXSIZE` 環境變數
|
||||
- **THEN** 本地快取預設最大條目數 SHALL 為 500
|
||||
- **AND** 此值足以容納 WIP 狀態、設備清單、Hold Summary 等多組快取
|
||||
|
||||
#### Scenario: 最大條目數限制
|
||||
- **WHEN** 本地快取條目數達到 maxsize 上限
|
||||
- **AND** 新增新條目
|
||||
- **THEN** 系統 SHALL 移除最少使用(LRU)的條目
|
||||
- **AND** 條目數 SHALL 維持 <= maxsize
|
||||
|
||||
#### Scenario: 環境變數配置
|
||||
- **WHEN** 設定 `LOCAL_CACHE_MAXSIZE=1000`
|
||||
- **THEN** 本地快取最大條目數 SHALL 為 1000
|
||||
|
||||
#### Scenario: 快取鍵設計
|
||||
- **WHEN** 建立快取條目
|
||||
- **THEN** 快取鍵 SHALL 包含功能前綴(如 `wip:`, `equipment:`, `hold:`)
|
||||
- **AND** 不同功能的快取 SHALL 共用同一 LRU 池
|
||||
- **AND** LRU 策略 SHALL 自動淘汰最少使用的條目(無論功能類型)
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 本地快取 TTL
|
||||
|
||||
系統 SHALL 為本地快取條目設定過期時間。
|
||||
|
||||
#### Scenario: 預設 TTL
|
||||
- **WHEN** 未設定 TTL 環境變數
|
||||
- **THEN** 本地快取 TTL SHALL 為 60 秒
|
||||
|
||||
#### Scenario: 過期條目處理
|
||||
- **WHEN** 查詢本地快取
|
||||
- **AND** 條目已過期(超過 TTL)
|
||||
- **THEN** 系統 SHALL 視為未命中
|
||||
- **AND** SHALL 移除該過期條目
|
||||
|
||||
#### Scenario: TTL 比 Redis 短
|
||||
- **WHEN** Redis 快取 TTL 為 N 秒
|
||||
- **THEN** 本地快取 TTL SHALL < N
|
||||
- **AND** 確保本地快取資料不會比 Redis 舊太多
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 快取停用控制
|
||||
|
||||
系統 SHALL 支援透過環境變數停用本地快取。
|
||||
|
||||
#### Scenario: 停用本地快取
|
||||
- **WHEN** 設定 `LOCAL_CACHE_ENABLED=false`
|
||||
- **THEN** 本地快取功能 SHALL 停用
|
||||
- **AND** 快取查詢 SHALL 直接查詢 Redis 或 Oracle
|
||||
|
||||
#### Scenario: 預設啟用
|
||||
- **WHEN** 未設定 `LOCAL_CACHE_ENABLED`
|
||||
- **THEN** 本地快取 SHALL 預設啟用
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 快取命中率統計
|
||||
|
||||
系統 SHALL 追蹤本地快取的命中率。
|
||||
|
||||
#### Scenario: 記錄命中與未命中
|
||||
- **WHEN** 查詢本地快取
|
||||
- **THEN** 系統 SHALL 記錄是否命中
|
||||
- **AND** 統計 SHALL 儲存在記憶體中
|
||||
|
||||
#### Scenario: 查詢命中率
|
||||
- **WHEN** 呼叫 `get_local_cache_stats()`
|
||||
- **THEN** 回傳值 SHALL 包含:
|
||||
- `hits`: 命中次數
|
||||
- `misses`: 未命中次數
|
||||
- `hit_rate`: 命中率 (hits / (hits + misses))
|
||||
- `size`: 目前條目數
|
||||
111
openspec/specs/query-metrics/spec.md
Normal file
111
openspec/specs/query-metrics/spec.md
Normal file
@@ -0,0 +1,111 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: 查詢延遲收集
|
||||
|
||||
系統 SHALL 收集所有資料庫查詢的延遲時間。
|
||||
|
||||
#### Scenario: 記錄查詢延遲
|
||||
- **WHEN** 執行資料庫查詢
|
||||
- **THEN** 系統 SHALL 記錄查詢耗時(毫秒)
|
||||
- **AND** 記錄 SHALL 儲存在記憶體內滑動視窗
|
||||
|
||||
#### Scenario: 滑動視窗大小限制
|
||||
- **WHEN** 記錄的查詢數量超過 1000 筆
|
||||
- **THEN** 系統 SHALL 自動移除最舊的記錄
|
||||
- **AND** 視窗 SHALL 維持最多 1000 筆
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 延遲百分位數計算
|
||||
|
||||
系統 SHALL 計算查詢延遲的百分位數統計。
|
||||
|
||||
#### Scenario: 計算 P50/P95/P99
|
||||
- **WHEN** 呼叫 `get_query_metrics()`
|
||||
- **THEN** 回傳值 SHALL 包含:
|
||||
- `p50_ms`: 第 50 百分位延遲
|
||||
- `p95_ms`: 第 95 百分位延遲
|
||||
- `p99_ms`: 第 99 百分位延遲
|
||||
- `count`: 樣本數量
|
||||
- `slow_count`: 慢查詢數量(延遲 > 1 秒)
|
||||
|
||||
#### Scenario: 空資料處理
|
||||
- **WHEN** 尚無查詢記錄
|
||||
- **THEN** 所有百分位數 SHALL 回傳 0
|
||||
- **AND** `count` SHALL 為 0
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 慢查詢統計
|
||||
|
||||
系統 SHALL 追蹤慢查詢的數量與比例。
|
||||
|
||||
#### Scenario: 慢查詢定義
|
||||
- **WHEN** 查詢延遲超過 1000 毫秒
|
||||
- **THEN** 該查詢 SHALL 被標記為慢查詢
|
||||
|
||||
#### Scenario: 慢查詢比例計算
|
||||
- **WHEN** 呼叫 `get_query_metrics()`
|
||||
- **THEN** 回傳值 SHALL 包含 `slow_rate`
|
||||
- **AND** `slow_rate` SHALL 為 `slow_count / count`
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 指標 API 端點
|
||||
|
||||
系統 SHALL 提供 API 端點查詢效能指標。
|
||||
|
||||
#### Scenario: 取得效能指標
|
||||
- **WHEN** 呼叫 `GET /admin/api/metrics`
|
||||
- **AND** 使用者為管理員
|
||||
- **THEN** 回應 SHALL 包含查詢延遲統計
|
||||
- **AND** HTTP 狀態碼 SHALL 為 200
|
||||
|
||||
#### Scenario: 非管理員禁止存取
|
||||
- **WHEN** 呼叫 `GET /admin/api/metrics`
|
||||
- **AND** 使用者非管理員
|
||||
- **THEN** HTTP 狀態碼 SHALL 為 403
|
||||
|
||||
---
|
||||
|
||||
### Requirement: Worker 獨立統計
|
||||
|
||||
系統 SHALL 在每個 Gunicorn worker 獨立收集指標。
|
||||
|
||||
#### Scenario: 各 worker 獨立統計
|
||||
- **WHEN** 系統運行多個 workers
|
||||
- **THEN** 每個 worker SHALL 維護獨立的指標資料
|
||||
- **AND** 百分位數計算 SHALL 基於該 worker 的樣本
|
||||
|
||||
#### Scenario: API 回傳當前 worker 指標
|
||||
- **WHEN** 呼叫 `GET /admin/api/metrics`
|
||||
- **THEN** 回應 SHALL 標示該資料來自哪個 worker(PID)
|
||||
- **AND** 回應 SHALL 包含 `worker_pid` 欄位
|
||||
|
||||
#### Scenario: 已知限制 - 指標跳動
|
||||
- **GIVEN** 系統運行 N 個 workers(N > 1)
|
||||
- **WHEN** 多次呼叫 `GET /admin/api/metrics`
|
||||
- **THEN** 因 load balancer 分配,數值可能因不同 worker 而有差異
|
||||
- **AND** 這是已知且接受的行為限制
|
||||
|
||||
---
|
||||
|
||||
### Requirement: 共享計數器(可選優化)
|
||||
|
||||
當 Redis 可用時,系統 MAY 使用 Redis 共享關鍵計數指標。
|
||||
|
||||
#### Scenario: Redis 共享總計數
|
||||
- **WHEN** Redis 已啟用
|
||||
- **THEN** `total_queries` 計數 MAY 使用 Redis INCR 命令
|
||||
- **AND** `slow_queries` 計數 MAY 使用 Redis INCR 命令
|
||||
- **AND** 這些計數 SHALL 跨所有 workers 共享
|
||||
|
||||
#### Scenario: Redis 不可用時退化
|
||||
- **WHEN** Redis 不可用或停用
|
||||
- **THEN** 系統 SHALL 退化為純 worker 獨立統計
|
||||
- **AND** 功能 SHALL 繼續正常運作
|
||||
|
||||
#### Scenario: 百分位數仍為 worker 獨立
|
||||
- **WHEN** 使用 Redis 共享計數器
|
||||
- **THEN** 延遲百分位數(P50/P95/P99)SHALL 仍維持 worker 獨立
|
||||
- **AND** 百分位數計算需要完整樣本,不適合跨 worker 共享
|
||||
@@ -9,3 +9,4 @@ waitress>=2.1.2; platform_system=="Windows"
|
||||
requests>=2.28.0
|
||||
redis>=5.0.0
|
||||
hiredis>=2.0.0
|
||||
psutil>=5.9.0
|
||||
|
||||
265
scripts/worker_watchdog.py
Normal file
265
scripts/worker_watchdog.py
Normal file
@@ -0,0 +1,265 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Worker watchdog for MES Dashboard.
|
||||
|
||||
Monitors a restart flag file and signals Gunicorn master to gracefully
|
||||
reload workers when the flag is detected.
|
||||
|
||||
Usage:
|
||||
python scripts/worker_watchdog.py
|
||||
|
||||
The watchdog:
|
||||
- Checks for /tmp/mes_dashboard_restart.flag every 5 seconds
|
||||
- Sends SIGHUP to Gunicorn master process when flag is detected
|
||||
- Removes the flag file after signaling
|
||||
- Logs all restart events
|
||||
|
||||
Configuration via environment variables:
|
||||
- WATCHDOG_CHECK_INTERVAL: Check interval in seconds (default: 5)
|
||||
- WATCHDOG_RESTART_FLAG: Path to restart flag file
|
||||
- WATCHDOG_PID_FILE: Path to Gunicorn PID file
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import signal
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
||||
handlers=[
|
||||
logging.StreamHandler(sys.stdout),
|
||||
]
|
||||
)
|
||||
logger = logging.getLogger('mes_dashboard.watchdog')
|
||||
|
||||
# ============================================================
|
||||
# Configuration
|
||||
# ============================================================
|
||||
|
||||
CHECK_INTERVAL = int(os.getenv('WATCHDOG_CHECK_INTERVAL', '5'))
|
||||
RESTART_FLAG_PATH = os.getenv(
|
||||
'WATCHDOG_RESTART_FLAG',
|
||||
'/tmp/mes_dashboard_restart.flag'
|
||||
)
|
||||
GUNICORN_PID_FILE = os.getenv(
|
||||
'WATCHDOG_PID_FILE',
|
||||
'/tmp/mes_dashboard_gunicorn.pid'
|
||||
)
|
||||
RESTART_STATE_FILE = os.getenv(
|
||||
'WATCHDOG_STATE_FILE',
|
||||
'/tmp/mes_dashboard_restart_state.json'
|
||||
)
|
||||
|
||||
|
||||
# ============================================================
|
||||
# Watchdog Implementation
|
||||
# ============================================================
|
||||
|
||||
def get_gunicorn_pid() -> int | None:
|
||||
"""Get Gunicorn master PID from PID file.
|
||||
|
||||
Returns:
|
||||
PID of Gunicorn master process, or None if not found.
|
||||
"""
|
||||
pid_path = Path(GUNICORN_PID_FILE)
|
||||
|
||||
if not pid_path.exists():
|
||||
logger.warning(f"PID file not found: {GUNICORN_PID_FILE}")
|
||||
return None
|
||||
|
||||
try:
|
||||
pid = int(pid_path.read_text().strip())
|
||||
# Verify process exists
|
||||
os.kill(pid, 0)
|
||||
return pid
|
||||
except (ValueError, ProcessLookupError, PermissionError) as e:
|
||||
logger.warning(f"Invalid or stale PID file: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def read_restart_flag() -> dict | None:
|
||||
"""Read and parse the restart flag file.
|
||||
|
||||
Returns:
|
||||
Dictionary with restart metadata, or None if no flag exists.
|
||||
"""
|
||||
flag_path = Path(RESTART_FLAG_PATH)
|
||||
|
||||
if not flag_path.exists():
|
||||
return None
|
||||
|
||||
try:
|
||||
content = flag_path.read_text().strip()
|
||||
if content:
|
||||
return json.loads(content)
|
||||
return {"timestamp": datetime.now().isoformat()}
|
||||
except (json.JSONDecodeError, IOError) as e:
|
||||
logger.warning(f"Error reading restart flag: {e}")
|
||||
return {"timestamp": datetime.now().isoformat(), "error": str(e)}
|
||||
|
||||
|
||||
def remove_restart_flag() -> bool:
|
||||
"""Remove the restart flag file.
|
||||
|
||||
Returns:
|
||||
True if file was removed, False otherwise.
|
||||
"""
|
||||
flag_path = Path(RESTART_FLAG_PATH)
|
||||
|
||||
try:
|
||||
if flag_path.exists():
|
||||
flag_path.unlink()
|
||||
return True
|
||||
return False
|
||||
except IOError as e:
|
||||
logger.error(f"Failed to remove restart flag: {e}")
|
||||
return False
|
||||
|
||||
|
||||
def save_restart_state(
|
||||
requested_by: str | None = None,
|
||||
requested_at: str | None = None,
|
||||
requested_ip: str | None = None,
|
||||
completed_at: str | None = None,
|
||||
success: bool = True
|
||||
) -> None:
|
||||
"""Save restart state for status queries.
|
||||
|
||||
Args:
|
||||
requested_by: Username who requested the restart.
|
||||
requested_at: ISO timestamp when restart was requested.
|
||||
requested_ip: IP address of requester.
|
||||
completed_at: ISO timestamp when restart was completed.
|
||||
success: Whether the restart was successful.
|
||||
"""
|
||||
state_path = Path(RESTART_STATE_FILE)
|
||||
|
||||
state = {
|
||||
"last_restart": {
|
||||
"requested_by": requested_by,
|
||||
"requested_at": requested_at,
|
||||
"requested_ip": requested_ip,
|
||||
"completed_at": completed_at,
|
||||
"success": success
|
||||
}
|
||||
}
|
||||
|
||||
try:
|
||||
state_path.write_text(json.dumps(state, indent=2))
|
||||
except IOError as e:
|
||||
logger.error(f"Failed to save restart state: {e}")
|
||||
|
||||
|
||||
def send_reload_signal(pid: int) -> bool:
|
||||
"""Send SIGHUP to Gunicorn master to reload workers.
|
||||
|
||||
Args:
|
||||
pid: PID of Gunicorn master process.
|
||||
|
||||
Returns:
|
||||
True if signal was sent successfully, False otherwise.
|
||||
"""
|
||||
try:
|
||||
os.kill(pid, signal.SIGHUP)
|
||||
logger.info(f"Sent SIGHUP to Gunicorn master (PID: {pid})")
|
||||
return True
|
||||
except ProcessLookupError:
|
||||
logger.error(f"Process {pid} not found")
|
||||
return False
|
||||
except PermissionError:
|
||||
logger.error(f"Permission denied sending signal to PID {pid}")
|
||||
return False
|
||||
|
||||
|
||||
def process_restart_request() -> bool:
|
||||
"""Process a restart request if flag file exists.
|
||||
|
||||
Returns:
|
||||
True if restart was processed, False if no restart needed.
|
||||
"""
|
||||
flag_data = read_restart_flag()
|
||||
|
||||
if flag_data is None:
|
||||
return False
|
||||
|
||||
logger.info(f"Restart flag detected: {flag_data}")
|
||||
|
||||
# Get Gunicorn master PID
|
||||
pid = get_gunicorn_pid()
|
||||
|
||||
if pid is None:
|
||||
logger.error("Cannot restart: Gunicorn master PID not found")
|
||||
# Still remove flag to prevent infinite loop
|
||||
remove_restart_flag()
|
||||
save_restart_state(
|
||||
requested_by=flag_data.get("user"),
|
||||
requested_at=flag_data.get("timestamp"),
|
||||
requested_ip=flag_data.get("ip"),
|
||||
completed_at=datetime.now().isoformat(),
|
||||
success=False
|
||||
)
|
||||
return True
|
||||
|
||||
# Send reload signal
|
||||
success = send_reload_signal(pid)
|
||||
|
||||
# Remove flag file
|
||||
remove_restart_flag()
|
||||
|
||||
# Save state
|
||||
save_restart_state(
|
||||
requested_by=flag_data.get("user"),
|
||||
requested_at=flag_data.get("timestamp"),
|
||||
requested_ip=flag_data.get("ip"),
|
||||
completed_at=datetime.now().isoformat(),
|
||||
success=success
|
||||
)
|
||||
|
||||
if success:
|
||||
logger.info(
|
||||
f"Worker restart completed - "
|
||||
f"Requested by: {flag_data.get('user', 'unknown')}, "
|
||||
f"IP: {flag_data.get('ip', 'unknown')}"
|
||||
)
|
||||
|
||||
return True
|
||||
|
||||
|
||||
def run_watchdog() -> None:
|
||||
"""Main watchdog loop."""
|
||||
logger.info(
|
||||
f"Worker watchdog started - "
|
||||
f"Check interval: {CHECK_INTERVAL}s, "
|
||||
f"Flag path: {RESTART_FLAG_PATH}, "
|
||||
f"PID file: {GUNICORN_PID_FILE}"
|
||||
)
|
||||
|
||||
while True:
|
||||
try:
|
||||
process_restart_request()
|
||||
except Exception as e:
|
||||
logger.exception(f"Error in watchdog loop: {e}")
|
||||
|
||||
time.sleep(CHECK_INTERVAL)
|
||||
|
||||
|
||||
def main() -> None:
|
||||
"""Entry point for watchdog script."""
|
||||
try:
|
||||
run_watchdog()
|
||||
except KeyboardInterrupt:
|
||||
logger.info("Watchdog stopped by user")
|
||||
sys.exit(0)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -27,6 +27,8 @@ def _configure_logging(app: Flask) -> None:
|
||||
"""Configure application logging.
|
||||
|
||||
Sets up logging to stderr (captured by Gunicorn's --capture-output).
|
||||
Additionally sets up SQLite log store for admin dashboard queries.
|
||||
|
||||
Log levels:
|
||||
- DEBUG: Query completion times, connection events
|
||||
- WARNING: Slow queries (>1s)
|
||||
@@ -38,6 +40,7 @@ def _configure_logging(app: Flask) -> None:
|
||||
|
||||
# Only add handler if not already configured (avoid duplicates)
|
||||
if not logger.handlers:
|
||||
# Console handler (stderr - captured by Gunicorn)
|
||||
handler = logging.StreamHandler(sys.stderr)
|
||||
handler.setLevel(logging.DEBUG)
|
||||
formatter = logging.Formatter(
|
||||
@@ -47,6 +50,17 @@ def _configure_logging(app: Flask) -> None:
|
||||
handler.setFormatter(formatter)
|
||||
logger.addHandler(handler)
|
||||
|
||||
# SQLite log handler for admin dashboard (INFO level and above)
|
||||
try:
|
||||
from mes_dashboard.core.log_store import get_sqlite_log_handler, LOG_STORE_ENABLED
|
||||
if LOG_STORE_ENABLED:
|
||||
sqlite_handler = get_sqlite_log_handler()
|
||||
sqlite_handler.setLevel(logging.INFO)
|
||||
logger.addHandler(sqlite_handler)
|
||||
logger.debug("SQLite log handler registered")
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to initialize SQLite log handler: {e}")
|
||||
|
||||
# Prevent propagation to root logger (avoid duplicate logs)
|
||||
logger.propagate = False
|
||||
|
||||
@@ -103,7 +117,8 @@ def create_app(config_name: str | None = None) -> Flask:
|
||||
if is_api_public():
|
||||
return None
|
||||
if not is_admin_logged_in():
|
||||
return jsonify({"error": "Unauthorized"}), 401
|
||||
from mes_dashboard.core.response import unauthorized_error
|
||||
return unauthorized_error()
|
||||
return None
|
||||
|
||||
# Skip auth-related pages (login/logout)
|
||||
@@ -226,4 +241,81 @@ def create_app(config_name: str | None = None) -> Flask:
|
||||
"""API: get tables config."""
|
||||
return jsonify(TABLES_CONFIG)
|
||||
|
||||
# ========================================================
|
||||
# Global Error Handlers
|
||||
# ========================================================
|
||||
_register_error_handlers(app)
|
||||
|
||||
return app
|
||||
|
||||
|
||||
def _register_error_handlers(app: Flask) -> None:
|
||||
"""Register global error handlers with standardized response format."""
|
||||
from mes_dashboard.core.response import (
|
||||
unauthorized_error,
|
||||
forbidden_error,
|
||||
not_found_error,
|
||||
internal_error,
|
||||
error_response,
|
||||
INTERNAL_ERROR
|
||||
)
|
||||
|
||||
@app.errorhandler(401)
|
||||
def handle_unauthorized(e):
|
||||
"""Handle 401 Unauthorized errors."""
|
||||
return unauthorized_error()
|
||||
|
||||
@app.errorhandler(403)
|
||||
def handle_forbidden(e):
|
||||
"""Handle 403 Forbidden errors."""
|
||||
return forbidden_error()
|
||||
|
||||
@app.errorhandler(404)
|
||||
def handle_not_found(e):
|
||||
"""Handle 404 Not Found errors."""
|
||||
# For API routes, return JSON; for pages, render template
|
||||
if request.path.startswith('/api/'):
|
||||
return not_found_error()
|
||||
return render_template('404.html'), 404
|
||||
|
||||
def _is_api_request() -> bool:
|
||||
"""Check if the current request is an API request."""
|
||||
return (request.path.startswith('/api/') or
|
||||
'/api/' in request.path or
|
||||
request.accept_mimetypes.best == 'application/json')
|
||||
|
||||
@app.errorhandler(500)
|
||||
def handle_internal_error(e):
|
||||
"""Handle 500 Internal Server errors."""
|
||||
logger = logging.getLogger('mes_dashboard')
|
||||
logger.error(f"Internal server error: {e}", exc_info=True)
|
||||
if _is_api_request():
|
||||
return internal_error(str(e) if app.debug else None)
|
||||
# Fallback to JSON if template not found
|
||||
try:
|
||||
return render_template('500.html'), 500
|
||||
except Exception:
|
||||
return internal_error(str(e) if app.debug else None)
|
||||
|
||||
@app.errorhandler(Exception)
|
||||
def handle_exception(e):
|
||||
"""Handle uncaught exceptions."""
|
||||
logger = logging.getLogger('mes_dashboard')
|
||||
logger.error(f"Uncaught exception: {e}", exc_info=True)
|
||||
if _is_api_request():
|
||||
return error_response(
|
||||
INTERNAL_ERROR,
|
||||
"伺服器發生未預期的錯誤",
|
||||
str(e) if app.debug else None,
|
||||
status_code=500
|
||||
)
|
||||
# Fallback to JSON if template not found
|
||||
try:
|
||||
return render_template('500.html'), 500
|
||||
except Exception:
|
||||
return error_response(
|
||||
INTERNAL_ERROR,
|
||||
"伺服器發生未預期的錯誤",
|
||||
str(e) if app.debug else None,
|
||||
status_code=500
|
||||
)
|
||||
|
||||
301
src/mes_dashboard/core/circuit_breaker.py
Normal file
301
src/mes_dashboard/core/circuit_breaker.py
Normal file
@@ -0,0 +1,301 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Circuit breaker implementation for database protection.
|
||||
|
||||
Prevents cascading failures by temporarily stopping requests to a failing service.
|
||||
|
||||
States:
|
||||
- CLOSED: Normal operation, requests pass through
|
||||
- OPEN: Failures exceeded threshold, requests are rejected immediately
|
||||
- HALF_OPEN: Testing if service has recovered, limited requests allowed
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
import threading
|
||||
import time
|
||||
from collections import deque
|
||||
from dataclasses import dataclass
|
||||
from enum import Enum
|
||||
from typing import Deque, Optional
|
||||
|
||||
logger = logging.getLogger('mes_dashboard.circuit_breaker')
|
||||
|
||||
# ============================================================
|
||||
# Configuration
|
||||
# ============================================================
|
||||
|
||||
CIRCUIT_BREAKER_ENABLED = os.getenv(
|
||||
'CIRCUIT_BREAKER_ENABLED', 'false'
|
||||
).lower() == 'true'
|
||||
|
||||
# Minimum failures before circuit can open
|
||||
FAILURE_THRESHOLD = int(os.getenv('CIRCUIT_BREAKER_FAILURE_THRESHOLD', '5'))
|
||||
|
||||
# Failure rate threshold (0.0 - 1.0)
|
||||
FAILURE_RATE_THRESHOLD = float(os.getenv('CIRCUIT_BREAKER_FAILURE_RATE', '0.5'))
|
||||
|
||||
# Seconds to wait in OPEN state before trying HALF_OPEN
|
||||
RECOVERY_TIMEOUT = int(os.getenv('CIRCUIT_BREAKER_RECOVERY_TIMEOUT', '30'))
|
||||
|
||||
# Sliding window size for counting successes/failures
|
||||
WINDOW_SIZE = int(os.getenv('CIRCUIT_BREAKER_WINDOW_SIZE', '10'))
|
||||
|
||||
|
||||
# ============================================================
|
||||
# Types
|
||||
# ============================================================
|
||||
|
||||
class CircuitState(Enum):
|
||||
"""Circuit breaker states."""
|
||||
CLOSED = "CLOSED"
|
||||
OPEN = "OPEN"
|
||||
HALF_OPEN = "HALF_OPEN"
|
||||
|
||||
|
||||
@dataclass
|
||||
class CircuitBreakerStatus:
|
||||
"""Circuit breaker status information."""
|
||||
state: str
|
||||
failure_count: int
|
||||
success_count: int
|
||||
total_count: int
|
||||
failure_rate: float
|
||||
last_failure_time: Optional[str]
|
||||
open_until: Optional[str]
|
||||
enabled: bool
|
||||
|
||||
|
||||
# ============================================================
|
||||
# Circuit Breaker Implementation
|
||||
# ============================================================
|
||||
|
||||
class CircuitBreaker:
|
||||
"""Circuit breaker for protecting database operations.
|
||||
|
||||
Thread-safe implementation using a sliding window to track
|
||||
successes and failures.
|
||||
|
||||
Usage:
|
||||
cb = CircuitBreaker("database")
|
||||
|
||||
if not cb.allow_request():
|
||||
return error_response(CIRCUIT_BREAKER_OPEN, "Service degraded")
|
||||
|
||||
try:
|
||||
result = execute_query()
|
||||
cb.record_success()
|
||||
return result
|
||||
except Exception as e:
|
||||
cb.record_failure()
|
||||
raise
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
name: str,
|
||||
failure_threshold: int = FAILURE_THRESHOLD,
|
||||
failure_rate_threshold: float = FAILURE_RATE_THRESHOLD,
|
||||
recovery_timeout: int = RECOVERY_TIMEOUT,
|
||||
window_size: int = WINDOW_SIZE
|
||||
):
|
||||
"""Initialize circuit breaker.
|
||||
|
||||
Args:
|
||||
name: Identifier for this circuit breaker.
|
||||
failure_threshold: Minimum failures before opening.
|
||||
failure_rate_threshold: Failure rate to trigger opening (0.0-1.0).
|
||||
recovery_timeout: Seconds to wait before half-open.
|
||||
window_size: Size of sliding window for tracking.
|
||||
"""
|
||||
self.name = name
|
||||
self.failure_threshold = failure_threshold
|
||||
self.failure_rate_threshold = failure_rate_threshold
|
||||
self.recovery_timeout = recovery_timeout
|
||||
self.window_size = window_size
|
||||
|
||||
self._state = CircuitState.CLOSED
|
||||
self._lock = threading.Lock()
|
||||
|
||||
# Sliding window: True = success, False = failure
|
||||
self._results: Deque[bool] = deque(maxlen=window_size)
|
||||
|
||||
self._last_failure_time: Optional[float] = None
|
||||
self._open_time: Optional[float] = None
|
||||
|
||||
@property
|
||||
def state(self) -> CircuitState:
|
||||
"""Get current circuit state, handling state transitions."""
|
||||
with self._lock:
|
||||
if self._state == CircuitState.OPEN:
|
||||
# Check if we should transition to HALF_OPEN
|
||||
if self._open_time and time.time() - self._open_time >= self.recovery_timeout:
|
||||
self._transition_to(CircuitState.HALF_OPEN)
|
||||
return self._state
|
||||
|
||||
def allow_request(self) -> bool:
|
||||
"""Check if a request should be allowed.
|
||||
|
||||
Returns:
|
||||
True if request should proceed, False if circuit is open.
|
||||
"""
|
||||
if not CIRCUIT_BREAKER_ENABLED:
|
||||
return True
|
||||
|
||||
current_state = self.state
|
||||
|
||||
if current_state == CircuitState.CLOSED:
|
||||
return True
|
||||
elif current_state == CircuitState.HALF_OPEN:
|
||||
# Allow limited requests in half-open state
|
||||
return True
|
||||
else: # OPEN
|
||||
return False
|
||||
|
||||
def record_success(self) -> None:
|
||||
"""Record a successful operation."""
|
||||
if not CIRCUIT_BREAKER_ENABLED:
|
||||
return
|
||||
|
||||
with self._lock:
|
||||
self._results.append(True)
|
||||
|
||||
if self._state == CircuitState.HALF_OPEN:
|
||||
# Success in half-open means we can close
|
||||
self._transition_to(CircuitState.CLOSED)
|
||||
|
||||
def record_failure(self) -> None:
|
||||
"""Record a failed operation."""
|
||||
if not CIRCUIT_BREAKER_ENABLED:
|
||||
return
|
||||
|
||||
with self._lock:
|
||||
self._results.append(False)
|
||||
self._last_failure_time = time.time()
|
||||
|
||||
if self._state == CircuitState.HALF_OPEN:
|
||||
# Failure in half-open means back to open
|
||||
self._transition_to(CircuitState.OPEN)
|
||||
elif self._state == CircuitState.CLOSED:
|
||||
# Check if we should open
|
||||
self._check_and_open()
|
||||
|
||||
def _check_and_open(self) -> None:
|
||||
"""Check failure rate and open circuit if needed.
|
||||
|
||||
Must be called with lock held.
|
||||
"""
|
||||
if len(self._results) < self.failure_threshold:
|
||||
return
|
||||
|
||||
failure_count = sum(1 for r in self._results if not r)
|
||||
failure_rate = failure_count / len(self._results)
|
||||
|
||||
if (failure_count >= self.failure_threshold and
|
||||
failure_rate >= self.failure_rate_threshold):
|
||||
self._transition_to(CircuitState.OPEN)
|
||||
|
||||
def _transition_to(self, new_state: CircuitState) -> None:
|
||||
"""Transition to a new state with logging.
|
||||
|
||||
Must be called with lock held.
|
||||
"""
|
||||
old_state = self._state
|
||||
self._state = new_state
|
||||
|
||||
if new_state == CircuitState.OPEN:
|
||||
self._open_time = time.time()
|
||||
logger.warning(
|
||||
f"Circuit breaker '{self.name}' OPENED: "
|
||||
f"state {old_state.value} -> {new_state.value}, "
|
||||
f"failures: {sum(1 for r in self._results if not r)}/{len(self._results)}"
|
||||
)
|
||||
elif new_state == CircuitState.HALF_OPEN:
|
||||
logger.info(
|
||||
f"Circuit breaker '{self.name}' entering HALF_OPEN: "
|
||||
f"testing service recovery..."
|
||||
)
|
||||
elif new_state == CircuitState.CLOSED:
|
||||
self._open_time = None
|
||||
self._results.clear()
|
||||
logger.info(
|
||||
f"Circuit breaker '{self.name}' CLOSED: "
|
||||
f"service recovered"
|
||||
)
|
||||
|
||||
def get_status(self) -> CircuitBreakerStatus:
|
||||
"""Get current status information."""
|
||||
with self._lock:
|
||||
# Use _state directly to avoid deadlock (self.state would try to acquire lock again)
|
||||
current_state = self._state
|
||||
failure_count = sum(1 for r in self._results if not r)
|
||||
success_count = sum(1 for r in self._results if r)
|
||||
total = len(self._results)
|
||||
failure_rate = failure_count / total if total > 0 else 0.0
|
||||
|
||||
open_until = None
|
||||
if current_state == CircuitState.OPEN and self._open_time:
|
||||
open_until_time = self._open_time + self.recovery_timeout
|
||||
from datetime import datetime
|
||||
open_until = datetime.fromtimestamp(open_until_time).isoformat()
|
||||
|
||||
last_failure = None
|
||||
if self._last_failure_time:
|
||||
from datetime import datetime
|
||||
last_failure = datetime.fromtimestamp(self._last_failure_time).isoformat()
|
||||
|
||||
return CircuitBreakerStatus(
|
||||
state=current_state.value,
|
||||
failure_count=failure_count,
|
||||
success_count=success_count,
|
||||
total_count=total,
|
||||
failure_rate=failure_rate,
|
||||
last_failure_time=last_failure,
|
||||
open_until=open_until,
|
||||
enabled=CIRCUIT_BREAKER_ENABLED
|
||||
)
|
||||
|
||||
def reset(self) -> None:
|
||||
"""Reset the circuit breaker to initial state."""
|
||||
with self._lock:
|
||||
self._state = CircuitState.CLOSED
|
||||
self._results.clear()
|
||||
self._last_failure_time = None
|
||||
self._open_time = None
|
||||
logger.info(f"Circuit breaker '{self.name}' reset")
|
||||
|
||||
|
||||
# ============================================================
|
||||
# Global Database Circuit Breaker
|
||||
# ============================================================
|
||||
|
||||
_DATABASE_CIRCUIT_BREAKER: Optional[CircuitBreaker] = None
|
||||
|
||||
|
||||
def get_database_circuit_breaker() -> CircuitBreaker:
|
||||
"""Get or create the global database circuit breaker."""
|
||||
global _DATABASE_CIRCUIT_BREAKER
|
||||
if _DATABASE_CIRCUIT_BREAKER is None:
|
||||
_DATABASE_CIRCUIT_BREAKER = CircuitBreaker("database")
|
||||
return _DATABASE_CIRCUIT_BREAKER
|
||||
|
||||
|
||||
def get_circuit_breaker_status() -> dict:
|
||||
"""Get current circuit breaker status as a dictionary.
|
||||
|
||||
Returns:
|
||||
Dictionary with circuit breaker status information.
|
||||
"""
|
||||
cb = get_database_circuit_breaker()
|
||||
status = cb.get_status()
|
||||
return {
|
||||
"state": status.state,
|
||||
"failure_count": status.failure_count,
|
||||
"success_count": status.success_count,
|
||||
"total_count": status.total_count,
|
||||
"failure_rate": round(status.failure_rate, 2),
|
||||
"last_failure_time": status.last_failure_time,
|
||||
"open_until": status.open_until,
|
||||
"enabled": status.enabled
|
||||
}
|
||||
@@ -252,6 +252,7 @@ def read_sql_df(sql: str, params: Optional[Dict[str, Any]] = None) -> pd.DataFra
|
||||
|
||||
Raises:
|
||||
Exception: If query execution fails. ORA code is logged.
|
||||
RuntimeError: If circuit breaker is open (service degraded).
|
||||
|
||||
Example:
|
||||
>>> sql = "SELECT * FROM users WHERE status = :status"
|
||||
@@ -261,7 +262,21 @@ def read_sql_df(sql: str, params: Optional[Dict[str, Any]] = None) -> pd.DataFra
|
||||
- Slow queries (>1s) are logged as warnings
|
||||
- All queries use connection pooling via SQLAlchemy
|
||||
- Call timeout is set to 55s to prevent worker blocking
|
||||
- Circuit breaker protects against cascading failures
|
||||
- Query latency is recorded for metrics
|
||||
"""
|
||||
from mes_dashboard.core.circuit_breaker import (
|
||||
get_database_circuit_breaker,
|
||||
CIRCUIT_BREAKER_ENABLED
|
||||
)
|
||||
from mes_dashboard.core.metrics import record_query_latency
|
||||
|
||||
# Check circuit breaker before executing
|
||||
circuit_breaker = get_database_circuit_breaker()
|
||||
if not circuit_breaker.allow_request():
|
||||
logger.warning("Circuit breaker OPEN - rejecting database query")
|
||||
raise RuntimeError("Database service is temporarily unavailable (circuit breaker open)")
|
||||
|
||||
start_time = time.time()
|
||||
engine = get_engine()
|
||||
|
||||
@@ -271,6 +286,14 @@ def read_sql_df(sql: str, params: Optional[Dict[str, Any]] = None) -> pd.DataFra
|
||||
df.columns = [str(c).upper() for c in df.columns]
|
||||
|
||||
elapsed = time.time() - start_time
|
||||
|
||||
# Record metrics
|
||||
record_query_latency(elapsed)
|
||||
|
||||
# Record success to circuit breaker
|
||||
if CIRCUIT_BREAKER_ENABLED:
|
||||
circuit_breaker.record_success()
|
||||
|
||||
# Log slow queries (>1 second) as warnings
|
||||
if elapsed > 1.0:
|
||||
# Truncate SQL for logging (first 100 chars)
|
||||
@@ -283,6 +306,14 @@ def read_sql_df(sql: str, params: Optional[Dict[str, Any]] = None) -> pd.DataFra
|
||||
|
||||
except Exception as exc:
|
||||
elapsed = time.time() - start_time
|
||||
|
||||
# Record metrics even for failed queries
|
||||
record_query_latency(elapsed)
|
||||
|
||||
# Record failure to circuit breaker
|
||||
if CIRCUIT_BREAKER_ENABLED:
|
||||
circuit_breaker.record_failure()
|
||||
|
||||
ora_code = _extract_ora_code(exc)
|
||||
sql_preview = sql.strip().replace('\n', ' ')[:100]
|
||||
logger.error(
|
||||
|
||||
473
src/mes_dashboard/core/log_store.py
Normal file
473
src/mes_dashboard/core/log_store.py
Normal file
@@ -0,0 +1,473 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""SQLite-based log store for admin dashboard.
|
||||
|
||||
Stores structured logs in a local SQLite database for admin querying.
|
||||
Maintains existing file/STDERR logs for operations.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
import sqlite3
|
||||
import threading
|
||||
import time
|
||||
from contextlib import contextmanager
|
||||
from datetime import datetime, timedelta
|
||||
from pathlib import Path
|
||||
from typing import Any, Dict, Generator, List, Optional
|
||||
|
||||
logger = logging.getLogger('mes_dashboard.log_store')
|
||||
|
||||
# ============================================================
|
||||
# Configuration
|
||||
# ============================================================
|
||||
|
||||
# SQLite database path
|
||||
LOG_SQLITE_PATH = os.getenv(
|
||||
'LOG_SQLITE_PATH',
|
||||
'logs/admin_logs.sqlite'
|
||||
)
|
||||
|
||||
# Retention policy
|
||||
LOG_SQLITE_RETENTION_DAYS = int(os.getenv('LOG_SQLITE_RETENTION_DAYS', '7'))
|
||||
LOG_SQLITE_MAX_ROWS = int(os.getenv('LOG_SQLITE_MAX_ROWS', '100000'))
|
||||
|
||||
# Enable/disable log store
|
||||
LOG_STORE_ENABLED = os.getenv('LOG_STORE_ENABLED', 'true').lower() == 'true'
|
||||
|
||||
|
||||
# ============================================================
|
||||
# Database Schema
|
||||
# ============================================================
|
||||
|
||||
CREATE_TABLE_SQL = """
|
||||
CREATE TABLE IF NOT EXISTS logs (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
timestamp TEXT NOT NULL,
|
||||
level TEXT NOT NULL,
|
||||
logger_name TEXT NOT NULL,
|
||||
message TEXT NOT NULL,
|
||||
request_id TEXT,
|
||||
user TEXT,
|
||||
ip TEXT,
|
||||
extra TEXT
|
||||
);
|
||||
"""
|
||||
|
||||
CREATE_INDEXES_SQL = [
|
||||
"CREATE INDEX IF NOT EXISTS idx_logs_timestamp ON logs(timestamp);",
|
||||
"CREATE INDEX IF NOT EXISTS idx_logs_level ON logs(level);",
|
||||
"CREATE INDEX IF NOT EXISTS idx_logs_logger ON logs(logger_name);",
|
||||
]
|
||||
|
||||
|
||||
# ============================================================
|
||||
# Log Store Implementation
|
||||
# ============================================================
|
||||
|
||||
class LogStore:
|
||||
"""SQLite-based log storage for admin dashboard queries.
|
||||
|
||||
Thread-safe implementation with connection pooling per thread.
|
||||
Supports retention policy to prevent unbounded growth.
|
||||
|
||||
Usage:
|
||||
store = LogStore()
|
||||
store.initialize()
|
||||
|
||||
# Write logs
|
||||
store.write_log(
|
||||
level="ERROR",
|
||||
logger_name="mes_dashboard.api",
|
||||
message="Database connection failed",
|
||||
user="admin@example.com"
|
||||
)
|
||||
|
||||
# Query logs
|
||||
logs = store.query_logs(level="ERROR", limit=100)
|
||||
"""
|
||||
|
||||
def __init__(self, db_path: str = LOG_SQLITE_PATH):
|
||||
"""Initialize log store.
|
||||
|
||||
Args:
|
||||
db_path: Path to SQLite database file.
|
||||
"""
|
||||
self.db_path = db_path
|
||||
self._local = threading.local()
|
||||
self._write_lock = threading.Lock()
|
||||
self._initialized = False
|
||||
|
||||
def initialize(self) -> None:
|
||||
"""Initialize the database schema.
|
||||
|
||||
Creates tables and indexes if they don't exist.
|
||||
"""
|
||||
if self._initialized:
|
||||
return
|
||||
|
||||
# Ensure directory exists
|
||||
db_dir = Path(self.db_path).parent
|
||||
db_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
with self._get_connection() as conn:
|
||||
cursor = conn.cursor()
|
||||
cursor.execute(CREATE_TABLE_SQL)
|
||||
for index_sql in CREATE_INDEXES_SQL:
|
||||
cursor.execute(index_sql)
|
||||
conn.commit()
|
||||
|
||||
self._initialized = True
|
||||
logger.info(f"Log store initialized at {self.db_path}")
|
||||
|
||||
@contextmanager
|
||||
def _get_connection(self) -> Generator[sqlite3.Connection, None, None]:
|
||||
"""Get a thread-local database connection.
|
||||
|
||||
Yields:
|
||||
SQLite connection for the current thread.
|
||||
"""
|
||||
if not hasattr(self._local, 'connection') or self._local.connection is None:
|
||||
self._local.connection = sqlite3.connect(
|
||||
self.db_path,
|
||||
timeout=10.0,
|
||||
check_same_thread=False
|
||||
)
|
||||
self._local.connection.row_factory = sqlite3.Row
|
||||
|
||||
try:
|
||||
yield self._local.connection
|
||||
except sqlite3.Error as e:
|
||||
logger.error(f"Database error: {e}")
|
||||
# Reset connection on error
|
||||
try:
|
||||
self._local.connection.close()
|
||||
except Exception:
|
||||
pass
|
||||
self._local.connection = None
|
||||
raise
|
||||
|
||||
def write_log(
|
||||
self,
|
||||
level: str,
|
||||
logger_name: str,
|
||||
message: str,
|
||||
request_id: Optional[str] = None,
|
||||
user: Optional[str] = None,
|
||||
ip: Optional[str] = None,
|
||||
extra: Optional[Dict[str, Any]] = None
|
||||
) -> bool:
|
||||
"""Write a log entry to the database.
|
||||
|
||||
Args:
|
||||
level: Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL).
|
||||
logger_name: Name of the logger.
|
||||
message: Log message.
|
||||
request_id: Optional request identifier.
|
||||
user: Optional user identifier.
|
||||
ip: Optional client IP address.
|
||||
extra: Optional extra data as JSON-serializable dict.
|
||||
|
||||
Returns:
|
||||
True if log was written successfully.
|
||||
"""
|
||||
if not LOG_STORE_ENABLED:
|
||||
return False
|
||||
|
||||
if not self._initialized:
|
||||
self.initialize()
|
||||
|
||||
timestamp = datetime.now().isoformat()
|
||||
extra_str = None
|
||||
if extra:
|
||||
import json
|
||||
try:
|
||||
extra_str = json.dumps(extra, ensure_ascii=False)
|
||||
except (TypeError, ValueError):
|
||||
extra_str = str(extra)
|
||||
|
||||
try:
|
||||
with self._write_lock:
|
||||
with self._get_connection() as conn:
|
||||
cursor = conn.cursor()
|
||||
cursor.execute(
|
||||
"""
|
||||
INSERT INTO logs (timestamp, level, logger_name, message, request_id, user, ip, extra)
|
||||
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
|
||||
""",
|
||||
(timestamp, level, logger_name, message, request_id, user, ip, extra_str)
|
||||
)
|
||||
conn.commit()
|
||||
return True
|
||||
except Exception as e:
|
||||
# Don't let log store errors propagate
|
||||
logger.debug(f"Failed to write log to SQLite: {e}")
|
||||
return False
|
||||
|
||||
def query_logs(
|
||||
self,
|
||||
level: Optional[str] = None,
|
||||
q: Optional[str] = None,
|
||||
limit: int = 200,
|
||||
since: Optional[str] = None,
|
||||
logger_name: Optional[str] = None
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""Query logs from the database.
|
||||
|
||||
Args:
|
||||
level: Filter by log level (e.g., "ERROR", "WARNING").
|
||||
q: Search query for message content (case-insensitive).
|
||||
limit: Maximum number of logs to return (default: 200).
|
||||
since: ISO timestamp to filter logs after this time.
|
||||
logger_name: Filter by logger name prefix.
|
||||
|
||||
Returns:
|
||||
List of log entries as dictionaries.
|
||||
"""
|
||||
if not LOG_STORE_ENABLED:
|
||||
return []
|
||||
|
||||
if not self._initialized:
|
||||
self.initialize()
|
||||
|
||||
query = "SELECT * FROM logs WHERE 1=1"
|
||||
params: List[Any] = []
|
||||
|
||||
if level:
|
||||
query += " AND level = ?"
|
||||
params.append(level.upper())
|
||||
|
||||
if q:
|
||||
query += " AND message LIKE ?"
|
||||
params.append(f"%{q}%")
|
||||
|
||||
if since:
|
||||
query += " AND timestamp >= ?"
|
||||
params.append(since)
|
||||
|
||||
if logger_name:
|
||||
query += " AND logger_name LIKE ?"
|
||||
params.append(f"{logger_name}%")
|
||||
|
||||
query += " ORDER BY timestamp DESC LIMIT ?"
|
||||
params.append(limit)
|
||||
|
||||
try:
|
||||
with self._get_connection() as conn:
|
||||
cursor = conn.cursor()
|
||||
cursor.execute(query, params)
|
||||
rows = cursor.fetchall()
|
||||
|
||||
return [dict(row) for row in rows]
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to query logs: {e}")
|
||||
return []
|
||||
|
||||
def cleanup_old_logs(self) -> int:
|
||||
"""Remove logs older than retention period or exceeding max rows.
|
||||
|
||||
Returns:
|
||||
Number of logs deleted.
|
||||
"""
|
||||
if not LOG_STORE_ENABLED or not self._initialized:
|
||||
return 0
|
||||
|
||||
deleted = 0
|
||||
|
||||
try:
|
||||
with self._write_lock:
|
||||
with self._get_connection() as conn:
|
||||
cursor = conn.cursor()
|
||||
|
||||
# Delete logs older than retention days
|
||||
cutoff_date = (
|
||||
datetime.now() - timedelta(days=LOG_SQLITE_RETENTION_DAYS)
|
||||
).isoformat()
|
||||
|
||||
cursor.execute(
|
||||
"DELETE FROM logs WHERE timestamp < ?",
|
||||
(cutoff_date,)
|
||||
)
|
||||
deleted += cursor.rowcount
|
||||
|
||||
# Delete excess logs if over max rows
|
||||
cursor.execute("SELECT COUNT(*) FROM logs")
|
||||
count = cursor.fetchone()[0]
|
||||
|
||||
if count > LOG_SQLITE_MAX_ROWS:
|
||||
excess = count - LOG_SQLITE_MAX_ROWS
|
||||
cursor.execute(
|
||||
"""
|
||||
DELETE FROM logs WHERE id IN (
|
||||
SELECT id FROM logs ORDER BY timestamp ASC LIMIT ?
|
||||
)
|
||||
""",
|
||||
(excess,)
|
||||
)
|
||||
deleted += cursor.rowcount
|
||||
|
||||
conn.commit()
|
||||
|
||||
if deleted > 0:
|
||||
logger.info(f"Cleaned up {deleted} old log entries")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to cleanup logs: {e}")
|
||||
|
||||
return deleted
|
||||
|
||||
def get_stats(self) -> Dict[str, Any]:
|
||||
"""Get log store statistics.
|
||||
|
||||
Returns:
|
||||
Dictionary with stats (count, oldest, newest, size_bytes).
|
||||
"""
|
||||
if not LOG_STORE_ENABLED or not self._initialized:
|
||||
return {
|
||||
"enabled": LOG_STORE_ENABLED,
|
||||
"count": 0,
|
||||
"oldest": None,
|
||||
"newest": None,
|
||||
"size_bytes": 0
|
||||
}
|
||||
|
||||
try:
|
||||
with self._get_connection() as conn:
|
||||
cursor = conn.cursor()
|
||||
|
||||
cursor.execute("SELECT COUNT(*) FROM logs")
|
||||
count = cursor.fetchone()[0]
|
||||
|
||||
cursor.execute("SELECT MIN(timestamp), MAX(timestamp) FROM logs")
|
||||
row = cursor.fetchone()
|
||||
oldest = row[0]
|
||||
newest = row[1]
|
||||
|
||||
# Get file size
|
||||
size_bytes = 0
|
||||
if Path(self.db_path).exists():
|
||||
size_bytes = Path(self.db_path).stat().st_size
|
||||
|
||||
return {
|
||||
"enabled": True,
|
||||
"count": count,
|
||||
"oldest": oldest,
|
||||
"newest": newest,
|
||||
"size_bytes": size_bytes,
|
||||
"retention_days": LOG_SQLITE_RETENTION_DAYS,
|
||||
"max_rows": LOG_SQLITE_MAX_ROWS
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to get log stats: {e}")
|
||||
return {
|
||||
"enabled": True,
|
||||
"count": 0,
|
||||
"oldest": None,
|
||||
"newest": None,
|
||||
"size_bytes": 0,
|
||||
"error": str(e)
|
||||
}
|
||||
|
||||
def close(self) -> None:
|
||||
"""Close database connections."""
|
||||
if hasattr(self._local, 'connection') and self._local.connection:
|
||||
try:
|
||||
self._local.connection.close()
|
||||
except Exception:
|
||||
pass
|
||||
self._local.connection = None
|
||||
|
||||
|
||||
# ============================================================
|
||||
# SQLite Log Handler
|
||||
# ============================================================
|
||||
|
||||
class SQLiteLogHandler(logging.Handler):
|
||||
"""Logging handler that writes to SQLite log store.
|
||||
|
||||
Integrates with Python's logging framework to automatically
|
||||
capture logs for admin dashboard.
|
||||
|
||||
Usage:
|
||||
handler = SQLiteLogHandler(log_store)
|
||||
handler.setLevel(logging.INFO)
|
||||
logging.getLogger().addHandler(handler)
|
||||
"""
|
||||
|
||||
def __init__(self, log_store: LogStore):
|
||||
"""Initialize the handler.
|
||||
|
||||
Args:
|
||||
log_store: LogStore instance to write to.
|
||||
"""
|
||||
super().__init__()
|
||||
self.log_store = log_store
|
||||
|
||||
def emit(self, record: logging.LogRecord) -> None:
|
||||
"""Write a log record to the store.
|
||||
|
||||
Args:
|
||||
record: Log record to write.
|
||||
"""
|
||||
try:
|
||||
# Get extra context from request if available
|
||||
request_id = getattr(record, 'request_id', None)
|
||||
user = getattr(record, 'user', None)
|
||||
ip = getattr(record, 'ip', None)
|
||||
|
||||
# Try to get from Flask's g object if not in record
|
||||
try:
|
||||
from flask import g, has_request_context
|
||||
if has_request_context():
|
||||
if not request_id:
|
||||
request_id = getattr(g, 'request_id', None)
|
||||
if not user:
|
||||
user = getattr(g, 'user_email', None)
|
||||
if not ip:
|
||||
from flask import request
|
||||
ip = request.remote_addr
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
self.log_store.write_log(
|
||||
level=record.levelname,
|
||||
logger_name=record.name,
|
||||
message=self.format(record),
|
||||
request_id=request_id,
|
||||
user=user,
|
||||
ip=ip
|
||||
)
|
||||
except Exception:
|
||||
# Never let handler errors propagate
|
||||
self.handleError(record)
|
||||
|
||||
|
||||
# ============================================================
|
||||
# Global Log Store Instance
|
||||
# ============================================================
|
||||
|
||||
_LOG_STORE: Optional[LogStore] = None
|
||||
|
||||
|
||||
def get_log_store() -> LogStore:
|
||||
"""Get or create the global log store instance."""
|
||||
global _LOG_STORE
|
||||
if _LOG_STORE is None:
|
||||
_LOG_STORE = LogStore()
|
||||
if LOG_STORE_ENABLED:
|
||||
_LOG_STORE.initialize()
|
||||
return _LOG_STORE
|
||||
|
||||
|
||||
def get_sqlite_log_handler() -> SQLiteLogHandler:
|
||||
"""Get a configured SQLite log handler.
|
||||
|
||||
Returns:
|
||||
Configured SQLiteLogHandler instance.
|
||||
"""
|
||||
handler = SQLiteLogHandler(get_log_store())
|
||||
handler.setLevel(logging.INFO)
|
||||
handler.setFormatter(logging.Formatter('%(message)s'))
|
||||
return handler
|
||||
232
src/mes_dashboard/core/metrics.py
Normal file
232
src/mes_dashboard/core/metrics.py
Normal file
@@ -0,0 +1,232 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Performance metrics collection for MES Dashboard.
|
||||
|
||||
Collects query latency metrics using an in-memory sliding window.
|
||||
Each worker maintains independent statistics.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
import threading
|
||||
import time
|
||||
from collections import deque
|
||||
from dataclasses import dataclass
|
||||
from datetime import datetime
|
||||
from typing import Deque, List, Optional
|
||||
|
||||
logger = logging.getLogger('mes_dashboard.metrics')
|
||||
|
||||
# ============================================================
|
||||
# Configuration
|
||||
# ============================================================
|
||||
|
||||
# Maximum number of latency samples to keep
|
||||
METRICS_WINDOW_SIZE = int(os.getenv('METRICS_WINDOW_SIZE', '1000'))
|
||||
|
||||
# Threshold for "slow" queries (seconds)
|
||||
SLOW_QUERY_THRESHOLD = float(os.getenv('SLOW_QUERY_THRESHOLD', '1.0'))
|
||||
|
||||
|
||||
# ============================================================
|
||||
# Types
|
||||
# ============================================================
|
||||
|
||||
@dataclass
|
||||
class MetricsSummary:
|
||||
"""Summary of collected metrics."""
|
||||
p50_ms: float
|
||||
p95_ms: float
|
||||
p99_ms: float
|
||||
count: int
|
||||
slow_count: int
|
||||
slow_rate: float
|
||||
worker_pid: int
|
||||
collected_at: str
|
||||
|
||||
|
||||
# ============================================================
|
||||
# Query Metrics Implementation
|
||||
# ============================================================
|
||||
|
||||
class QueryMetrics:
|
||||
"""Collects and summarizes query latency metrics.
|
||||
|
||||
Uses a thread-safe sliding window to track the most recent
|
||||
query latencies. Provides percentile calculations for
|
||||
monitoring and alerting.
|
||||
|
||||
Usage:
|
||||
metrics = QueryMetrics()
|
||||
|
||||
# Record a query
|
||||
start = time.time()
|
||||
execute_query()
|
||||
metrics.record_latency(time.time() - start)
|
||||
|
||||
# Get summary
|
||||
summary = metrics.get_summary()
|
||||
"""
|
||||
|
||||
def __init__(self, window_size: int = METRICS_WINDOW_SIZE):
|
||||
"""Initialize query metrics collector.
|
||||
|
||||
Args:
|
||||
window_size: Maximum number of samples to keep.
|
||||
"""
|
||||
self.window_size = window_size
|
||||
self._latencies: Deque[float] = deque(maxlen=window_size)
|
||||
self._lock = threading.Lock()
|
||||
self._worker_pid = os.getpid()
|
||||
|
||||
def record_latency(self, latency_seconds: float) -> None:
|
||||
"""Record a query latency.
|
||||
|
||||
Args:
|
||||
latency_seconds: Query execution time in seconds.
|
||||
"""
|
||||
with self._lock:
|
||||
self._latencies.append(latency_seconds)
|
||||
|
||||
# Log slow queries
|
||||
if latency_seconds > SLOW_QUERY_THRESHOLD:
|
||||
logger.warning(
|
||||
f"Slow query detected: {latency_seconds:.2f}s "
|
||||
f"(threshold: {SLOW_QUERY_THRESHOLD}s)"
|
||||
)
|
||||
|
||||
def get_percentile(self, percentile: float) -> float:
|
||||
"""Calculate a specific percentile from the latency data.
|
||||
|
||||
Args:
|
||||
percentile: Percentile to calculate (0-100).
|
||||
|
||||
Returns:
|
||||
Latency value at the given percentile in seconds.
|
||||
"""
|
||||
with self._lock:
|
||||
if not self._latencies:
|
||||
return 0.0
|
||||
|
||||
sorted_latencies = sorted(self._latencies)
|
||||
index = int((percentile / 100.0) * len(sorted_latencies))
|
||||
# Clamp index to valid range
|
||||
index = min(index, len(sorted_latencies) - 1)
|
||||
return sorted_latencies[index]
|
||||
|
||||
def get_percentiles(self) -> dict:
|
||||
"""Calculate P50, P95, and P99 percentiles.
|
||||
|
||||
Returns:
|
||||
Dictionary with percentile values in milliseconds.
|
||||
"""
|
||||
with self._lock:
|
||||
if not self._latencies:
|
||||
return {
|
||||
"p50": 0.0,
|
||||
"p95": 0.0,
|
||||
"p99": 0.0,
|
||||
"count": 0,
|
||||
"slow_count": 0
|
||||
}
|
||||
|
||||
sorted_latencies = sorted(self._latencies)
|
||||
count = len(sorted_latencies)
|
||||
|
||||
def get_percentile_value(p: float) -> float:
|
||||
index = int((p / 100.0) * count)
|
||||
index = min(index, count - 1)
|
||||
return sorted_latencies[index]
|
||||
|
||||
slow_count = sum(1 for l in sorted_latencies if l > SLOW_QUERY_THRESHOLD)
|
||||
|
||||
return {
|
||||
"p50": get_percentile_value(50),
|
||||
"p95": get_percentile_value(95),
|
||||
"p99": get_percentile_value(99),
|
||||
"count": count,
|
||||
"slow_count": slow_count
|
||||
}
|
||||
|
||||
def get_summary(self) -> MetricsSummary:
|
||||
"""Get a complete metrics summary.
|
||||
|
||||
Returns:
|
||||
MetricsSummary with all collected metrics.
|
||||
"""
|
||||
percentiles = self.get_percentiles()
|
||||
|
||||
slow_rate = 0.0
|
||||
if percentiles["count"] > 0:
|
||||
slow_rate = percentiles["slow_count"] / percentiles["count"]
|
||||
|
||||
return MetricsSummary(
|
||||
p50_ms=round(percentiles["p50"] * 1000, 2),
|
||||
p95_ms=round(percentiles["p95"] * 1000, 2),
|
||||
p99_ms=round(percentiles["p99"] * 1000, 2),
|
||||
count=percentiles["count"],
|
||||
slow_count=percentiles["slow_count"],
|
||||
slow_rate=round(slow_rate, 4),
|
||||
worker_pid=self._worker_pid,
|
||||
collected_at=datetime.now().isoformat()
|
||||
)
|
||||
|
||||
def get_latencies(self) -> List[float]:
|
||||
"""Get a copy of all recorded latencies.
|
||||
|
||||
Returns:
|
||||
List of latencies in seconds.
|
||||
"""
|
||||
with self._lock:
|
||||
return list(self._latencies)
|
||||
|
||||
def clear(self) -> None:
|
||||
"""Clear all recorded metrics."""
|
||||
with self._lock:
|
||||
self._latencies.clear()
|
||||
logger.info(f"Metrics cleared for worker {self._worker_pid}")
|
||||
|
||||
|
||||
# ============================================================
|
||||
# Global Query Metrics Instance
|
||||
# ============================================================
|
||||
|
||||
_QUERY_METRICS: Optional[QueryMetrics] = None
|
||||
|
||||
|
||||
def get_query_metrics() -> QueryMetrics:
|
||||
"""Get or create the global query metrics instance."""
|
||||
global _QUERY_METRICS
|
||||
if _QUERY_METRICS is None:
|
||||
_QUERY_METRICS = QueryMetrics()
|
||||
return _QUERY_METRICS
|
||||
|
||||
|
||||
def get_metrics_summary() -> dict:
|
||||
"""Get current metrics summary as a dictionary.
|
||||
|
||||
Returns:
|
||||
Dictionary with metrics summary information.
|
||||
"""
|
||||
metrics = get_query_metrics()
|
||||
summary = metrics.get_summary()
|
||||
return {
|
||||
"p50_ms": summary.p50_ms,
|
||||
"p95_ms": summary.p95_ms,
|
||||
"p99_ms": summary.p99_ms,
|
||||
"count": summary.count,
|
||||
"slow_count": summary.slow_count,
|
||||
"slow_rate": summary.slow_rate,
|
||||
"worker_pid": summary.worker_pid,
|
||||
"collected_at": summary.collected_at
|
||||
}
|
||||
|
||||
|
||||
def record_query_latency(latency_seconds: float) -> None:
|
||||
"""Record a query latency to the global metrics.
|
||||
|
||||
Args:
|
||||
latency_seconds: Query execution time in seconds.
|
||||
"""
|
||||
get_query_metrics().record_latency(latency_seconds)
|
||||
228
src/mes_dashboard/core/response.py
Normal file
228
src/mes_dashboard/core/response.py
Normal file
@@ -0,0 +1,228 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Standard API response format utilities for MES Dashboard.
|
||||
|
||||
Provides consistent response envelope for all API endpoints.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
from datetime import datetime
|
||||
from typing import Any, Dict, Optional
|
||||
|
||||
from flask import jsonify, request
|
||||
|
||||
# ============================================================
|
||||
# Standard Error Codes
|
||||
# ============================================================
|
||||
|
||||
# Database errors
|
||||
DB_CONNECTION_FAILED = "DB_CONNECTION_FAILED"
|
||||
DB_QUERY_TIMEOUT = "DB_QUERY_TIMEOUT"
|
||||
DB_QUERY_ERROR = "DB_QUERY_ERROR"
|
||||
|
||||
# Service errors
|
||||
SERVICE_UNAVAILABLE = "SERVICE_UNAVAILABLE"
|
||||
CIRCUIT_BREAKER_OPEN = "CIRCUIT_BREAKER_OPEN"
|
||||
|
||||
# Client errors
|
||||
VALIDATION_ERROR = "VALIDATION_ERROR"
|
||||
UNAUTHORIZED = "UNAUTHORIZED"
|
||||
FORBIDDEN = "FORBIDDEN"
|
||||
NOT_FOUND = "NOT_FOUND"
|
||||
TOO_MANY_REQUESTS = "TOO_MANY_REQUESTS"
|
||||
|
||||
# Server errors
|
||||
INTERNAL_ERROR = "INTERNAL_ERROR"
|
||||
|
||||
|
||||
# ============================================================
|
||||
# Response Functions
|
||||
# ============================================================
|
||||
|
||||
def success_response(
|
||||
data: Any,
|
||||
meta: Optional[Dict[str, Any]] = None,
|
||||
status_code: int = 200
|
||||
):
|
||||
"""Create a standardized success response.
|
||||
|
||||
Args:
|
||||
data: The response data payload.
|
||||
meta: Optional metadata (timestamp, request_id, etc.).
|
||||
status_code: HTTP status code (default: 200).
|
||||
|
||||
Returns:
|
||||
Flask response tuple (response, status_code).
|
||||
|
||||
Example:
|
||||
>>> return success_response({"users": [...]})
|
||||
>>> return success_response({"id": 1}, meta={"cached": True})
|
||||
"""
|
||||
response = {
|
||||
"success": True,
|
||||
"data": data,
|
||||
}
|
||||
|
||||
# Add metadata if provided
|
||||
if meta is not None:
|
||||
response["meta"] = meta
|
||||
else:
|
||||
# Add default metadata
|
||||
response["meta"] = {
|
||||
"timestamp": datetime.now().isoformat(),
|
||||
}
|
||||
|
||||
return jsonify(response), status_code
|
||||
|
||||
|
||||
def error_response(
|
||||
code: str,
|
||||
message: str,
|
||||
details: Optional[str] = None,
|
||||
status_code: int = 500
|
||||
):
|
||||
"""Create a standardized error response.
|
||||
|
||||
Args:
|
||||
code: Machine-readable error code (e.g., DB_CONNECTION_FAILED).
|
||||
message: User-friendly error message.
|
||||
details: Technical details (only shown in development mode).
|
||||
status_code: HTTP status code (default: 500).
|
||||
|
||||
Returns:
|
||||
Flask response tuple (response, status_code).
|
||||
|
||||
Example:
|
||||
>>> return error_response(
|
||||
... DB_CONNECTION_FAILED,
|
||||
... "資料庫連線失敗,請稍後再試",
|
||||
... "ORA-12541: TNS:no listener",
|
||||
... status_code=503
|
||||
... )
|
||||
"""
|
||||
error_obj = {
|
||||
"code": code,
|
||||
"message": message,
|
||||
}
|
||||
|
||||
# Only include details in development mode
|
||||
if details and _is_development_mode():
|
||||
error_obj["details"] = details
|
||||
|
||||
response = {
|
||||
"success": False,
|
||||
"error": error_obj,
|
||||
"meta": {
|
||||
"timestamp": datetime.now().isoformat(),
|
||||
}
|
||||
}
|
||||
|
||||
return jsonify(response), status_code
|
||||
|
||||
|
||||
def _is_development_mode() -> bool:
|
||||
"""Check if the application is running in development mode."""
|
||||
flask_env = os.getenv("FLASK_ENV", "production")
|
||||
flask_debug = os.getenv("FLASK_DEBUG", "0")
|
||||
return flask_env == "development" or flask_debug == "1"
|
||||
|
||||
|
||||
# ============================================================
|
||||
# Convenience Functions for Common Errors
|
||||
# ============================================================
|
||||
|
||||
def db_connection_error(details: Optional[str] = None):
|
||||
"""Return a database connection error response."""
|
||||
return error_response(
|
||||
DB_CONNECTION_FAILED,
|
||||
"資料庫連線失敗,請稍後再試",
|
||||
details,
|
||||
status_code=503
|
||||
)
|
||||
|
||||
|
||||
def db_query_timeout_error(details: Optional[str] = None):
|
||||
"""Return a database query timeout error response."""
|
||||
return error_response(
|
||||
DB_QUERY_TIMEOUT,
|
||||
"資料庫查詢逾時,請稍後再試",
|
||||
details,
|
||||
status_code=504
|
||||
)
|
||||
|
||||
|
||||
def service_unavailable_error(details: Optional[str] = None):
|
||||
"""Return a service unavailable error response."""
|
||||
return error_response(
|
||||
SERVICE_UNAVAILABLE,
|
||||
"服務暫時無法使用,請稍後再試",
|
||||
details,
|
||||
status_code=503
|
||||
)
|
||||
|
||||
|
||||
def circuit_breaker_error(details: Optional[str] = None):
|
||||
"""Return a circuit breaker open error response."""
|
||||
return error_response(
|
||||
CIRCUIT_BREAKER_OPEN,
|
||||
"服務暫時降級中,請稍後再試",
|
||||
details,
|
||||
status_code=503
|
||||
)
|
||||
|
||||
|
||||
def validation_error(message: str, details: Optional[str] = None):
|
||||
"""Return a validation error response."""
|
||||
return error_response(
|
||||
VALIDATION_ERROR,
|
||||
message,
|
||||
details,
|
||||
status_code=400
|
||||
)
|
||||
|
||||
|
||||
def unauthorized_error(message: str = "請先登入"):
|
||||
"""Return an unauthorized error response."""
|
||||
return error_response(
|
||||
UNAUTHORIZED,
|
||||
message,
|
||||
status_code=401
|
||||
)
|
||||
|
||||
|
||||
def forbidden_error(message: str = "權限不足"):
|
||||
"""Return a forbidden error response."""
|
||||
return error_response(
|
||||
FORBIDDEN,
|
||||
message,
|
||||
status_code=403
|
||||
)
|
||||
|
||||
|
||||
def not_found_error(message: str = "找不到請求的資源"):
|
||||
"""Return a not found error response."""
|
||||
return error_response(
|
||||
NOT_FOUND,
|
||||
message,
|
||||
status_code=404
|
||||
)
|
||||
|
||||
|
||||
def too_many_requests_error(message: str = "請求過於頻繁,請稍後再試"):
|
||||
"""Return a too many requests error response."""
|
||||
return error_response(
|
||||
TOO_MANY_REQUESTS,
|
||||
message,
|
||||
status_code=429
|
||||
)
|
||||
|
||||
|
||||
def internal_error(details: Optional[str] = None):
|
||||
"""Return an internal server error response."""
|
||||
return error_response(
|
||||
INTERNAL_ERROR,
|
||||
"伺服器內部錯誤",
|
||||
details,
|
||||
status_code=500
|
||||
)
|
||||
@@ -1,15 +1,376 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Admin routes for page management."""
|
||||
"""Admin routes for page management and performance monitoring."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from flask import Blueprint, jsonify, render_template, request
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import time
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
from flask import Blueprint, g, jsonify, render_template, request
|
||||
|
||||
from mes_dashboard.core.permissions import admin_required
|
||||
from mes_dashboard.core.response import error_response, TOO_MANY_REQUESTS
|
||||
from mes_dashboard.services.page_registry import get_all_pages, set_page_status
|
||||
|
||||
admin_bp = Blueprint("admin", __name__, url_prefix="/admin")
|
||||
logger = logging.getLogger("mes_dashboard.admin")
|
||||
|
||||
# ============================================================
|
||||
# Worker Restart Configuration
|
||||
# ============================================================
|
||||
|
||||
RESTART_FLAG_PATH = os.getenv(
|
||||
"WATCHDOG_RESTART_FLAG",
|
||||
"/tmp/mes_dashboard_restart.flag"
|
||||
)
|
||||
RESTART_STATE_PATH = os.getenv(
|
||||
"WATCHDOG_STATE_FILE",
|
||||
"/tmp/mes_dashboard_restart_state.json"
|
||||
)
|
||||
RESTART_COOLDOWN_SECONDS = int(os.getenv("WORKER_RESTART_COOLDOWN", "60"))
|
||||
|
||||
# Track last restart request time (in-memory for this worker)
|
||||
_last_restart_request: float = 0.0
|
||||
|
||||
|
||||
# ============================================================
|
||||
# Performance Monitoring Routes
|
||||
# ============================================================
|
||||
|
||||
@admin_bp.route("/performance")
|
||||
@admin_required
|
||||
def performance():
|
||||
"""Performance monitoring dashboard."""
|
||||
return render_template("admin/performance.html")
|
||||
|
||||
|
||||
@admin_bp.route("/api/system-status", methods=["GET"])
|
||||
@admin_required
|
||||
def api_system_status():
|
||||
"""API: Get system status for performance dashboard."""
|
||||
from mes_dashboard.core.redis_client import redis_available, REDIS_ENABLED
|
||||
from mes_dashboard.core.circuit_breaker import get_circuit_breaker_status
|
||||
from mes_dashboard.routes.health_routes import check_database, check_redis
|
||||
|
||||
# Database status
|
||||
db_status, db_error = check_database()
|
||||
|
||||
# Redis status
|
||||
redis_status = 'disabled'
|
||||
if REDIS_ENABLED:
|
||||
redis_status, _ = check_redis()
|
||||
|
||||
# Circuit breaker status
|
||||
circuit_breaker = get_circuit_breaker_status()
|
||||
|
||||
# Cache status
|
||||
from mes_dashboard.routes.health_routes import (
|
||||
get_cache_status,
|
||||
get_resource_cache_status,
|
||||
get_equipment_status_cache_status
|
||||
)
|
||||
|
||||
return jsonify({
|
||||
"success": True,
|
||||
"data": {
|
||||
"database": {
|
||||
"status": db_status,
|
||||
"error": db_error
|
||||
},
|
||||
"redis": {
|
||||
"status": redis_status,
|
||||
"enabled": REDIS_ENABLED
|
||||
},
|
||||
"circuit_breaker": circuit_breaker,
|
||||
"cache": {
|
||||
"wip": get_cache_status(),
|
||||
"resource": get_resource_cache_status(),
|
||||
"equipment": get_equipment_status_cache_status()
|
||||
},
|
||||
"worker_pid": os.getpid()
|
||||
}
|
||||
})
|
||||
|
||||
|
||||
@admin_bp.route("/api/metrics", methods=["GET"])
|
||||
@admin_required
|
||||
def api_metrics():
|
||||
"""API: Get performance metrics for dashboard."""
|
||||
from mes_dashboard.core.metrics import get_metrics_summary, get_query_metrics
|
||||
|
||||
summary = get_metrics_summary()
|
||||
metrics = get_query_metrics()
|
||||
|
||||
return jsonify({
|
||||
"success": True,
|
||||
"data": {
|
||||
"p50_ms": summary.get("p50_ms"),
|
||||
"p95_ms": summary.get("p95_ms"),
|
||||
"p99_ms": summary.get("p99_ms"),
|
||||
"count": summary.get("count"),
|
||||
"slow_count": summary.get("slow_count"),
|
||||
"slow_rate": summary.get("slow_rate"),
|
||||
"worker_pid": summary.get("worker_pid"),
|
||||
"collected_at": summary.get("collected_at"),
|
||||
# Include latency distribution for charts
|
||||
"latencies": metrics.get_latencies()[-100:] # Last 100 for chart
|
||||
}
|
||||
})
|
||||
|
||||
|
||||
@admin_bp.route("/api/logs", methods=["GET"])
|
||||
@admin_required
|
||||
def api_logs():
|
||||
"""API: Get recent logs from SQLite log store."""
|
||||
from mes_dashboard.core.log_store import get_log_store, LOG_STORE_ENABLED
|
||||
|
||||
if not LOG_STORE_ENABLED:
|
||||
return jsonify({
|
||||
"success": True,
|
||||
"data": {
|
||||
"logs": [],
|
||||
"enabled": False
|
||||
}
|
||||
})
|
||||
|
||||
# Query parameters
|
||||
level = request.args.get("level")
|
||||
q = request.args.get("q")
|
||||
limit = request.args.get("limit", 200, type=int)
|
||||
since = request.args.get("since")
|
||||
|
||||
log_store = get_log_store()
|
||||
logs = log_store.query_logs(
|
||||
level=level,
|
||||
q=q,
|
||||
limit=min(limit, 500), # Cap at 500
|
||||
since=since
|
||||
)
|
||||
|
||||
return jsonify({
|
||||
"success": True,
|
||||
"data": {
|
||||
"logs": logs,
|
||||
"count": len(logs),
|
||||
"enabled": True,
|
||||
"stats": log_store.get_stats()
|
||||
}
|
||||
})
|
||||
|
||||
|
||||
@admin_bp.route("/api/logs/cleanup", methods=["POST"])
|
||||
@admin_required
|
||||
def api_logs_cleanup():
|
||||
"""API: Manually trigger log cleanup.
|
||||
|
||||
Supports optional parameters:
|
||||
- older_than_days: Delete logs older than N days (default: use configured retention)
|
||||
- keep_count: Keep only the most recent N logs (optional)
|
||||
"""
|
||||
from mes_dashboard.core.log_store import get_log_store, LOG_STORE_ENABLED
|
||||
|
||||
if not LOG_STORE_ENABLED:
|
||||
return jsonify({
|
||||
"success": False,
|
||||
"error": "Log store is disabled"
|
||||
}), 400
|
||||
|
||||
log_store = get_log_store()
|
||||
|
||||
# Get current stats before cleanup
|
||||
stats_before = log_store.get_stats()
|
||||
|
||||
# Perform cleanup
|
||||
deleted = log_store.cleanup_old_logs()
|
||||
|
||||
# Get stats after cleanup
|
||||
stats_after = log_store.get_stats()
|
||||
|
||||
user = getattr(g, "username", "unknown")
|
||||
logger.info(f"Log cleanup triggered by {user}: deleted {deleted} entries")
|
||||
|
||||
return jsonify({
|
||||
"success": True,
|
||||
"data": {
|
||||
"deleted": deleted,
|
||||
"before": {
|
||||
"count": stats_before.get("count", 0),
|
||||
"size_bytes": stats_before.get("size_bytes", 0)
|
||||
},
|
||||
"after": {
|
||||
"count": stats_after.get("count", 0),
|
||||
"size_bytes": stats_after.get("size_bytes", 0)
|
||||
}
|
||||
}
|
||||
})
|
||||
|
||||
|
||||
# ============================================================
|
||||
# Worker Restart Control Routes
|
||||
# ============================================================
|
||||
|
||||
def _get_restart_state() -> dict:
|
||||
"""Read worker restart state from file."""
|
||||
state_path = Path(RESTART_STATE_PATH)
|
||||
if not state_path.exists():
|
||||
return {}
|
||||
try:
|
||||
return json.loads(state_path.read_text())
|
||||
except (json.JSONDecodeError, IOError):
|
||||
return {}
|
||||
|
||||
|
||||
def _check_restart_cooldown() -> tuple[bool, float]:
|
||||
"""Check if restart is in cooldown.
|
||||
|
||||
Returns:
|
||||
Tuple of (is_in_cooldown, remaining_seconds).
|
||||
"""
|
||||
global _last_restart_request
|
||||
|
||||
# Check in-memory cooldown first
|
||||
now = time.time()
|
||||
elapsed = now - _last_restart_request
|
||||
if elapsed < RESTART_COOLDOWN_SECONDS:
|
||||
return True, RESTART_COOLDOWN_SECONDS - elapsed
|
||||
|
||||
# Check file-based state (for cross-worker coordination)
|
||||
state = _get_restart_state()
|
||||
last_restart = state.get("last_restart", {})
|
||||
requested_at = last_restart.get("requested_at")
|
||||
|
||||
if requested_at:
|
||||
try:
|
||||
request_time = datetime.fromisoformat(requested_at).timestamp()
|
||||
elapsed = now - request_time
|
||||
if elapsed < RESTART_COOLDOWN_SECONDS:
|
||||
return True, RESTART_COOLDOWN_SECONDS - elapsed
|
||||
except (ValueError, TypeError):
|
||||
pass
|
||||
|
||||
return False, 0.0
|
||||
|
||||
|
||||
@admin_bp.route("/api/worker/restart", methods=["POST"])
|
||||
@admin_required
|
||||
def api_worker_restart():
|
||||
"""API: Request worker restart.
|
||||
|
||||
Writes a restart flag file that the watchdog process monitors.
|
||||
Enforces a 60-second cooldown between restart requests.
|
||||
"""
|
||||
global _last_restart_request
|
||||
|
||||
# Check cooldown
|
||||
in_cooldown, remaining = _check_restart_cooldown()
|
||||
if in_cooldown:
|
||||
return error_response(
|
||||
TOO_MANY_REQUESTS,
|
||||
f"Restart in cooldown. Please wait {int(remaining)} seconds.",
|
||||
status_code=429
|
||||
)
|
||||
|
||||
# Get request metadata
|
||||
user = getattr(g, "username", "unknown")
|
||||
ip = request.remote_addr or "unknown"
|
||||
timestamp = datetime.now().isoformat()
|
||||
|
||||
# Write restart flag file
|
||||
flag_path = Path(RESTART_FLAG_PATH)
|
||||
flag_data = {
|
||||
"user": user,
|
||||
"ip": ip,
|
||||
"timestamp": timestamp,
|
||||
"worker_pid": os.getpid()
|
||||
}
|
||||
|
||||
try:
|
||||
flag_path.write_text(json.dumps(flag_data))
|
||||
except IOError as e:
|
||||
logger.error(f"Failed to write restart flag: {e}")
|
||||
return error_response(
|
||||
"RESTART_FAILED",
|
||||
f"Failed to request restart: {e}",
|
||||
status_code=500
|
||||
)
|
||||
|
||||
# Update in-memory cooldown
|
||||
_last_restart_request = time.time()
|
||||
|
||||
logger.info(
|
||||
f"Worker restart requested by {user} from {ip}"
|
||||
)
|
||||
|
||||
return jsonify({
|
||||
"success": True,
|
||||
"data": {
|
||||
"message": "Restart requested. Workers will reload shortly.",
|
||||
"requested_by": user,
|
||||
"requested_at": timestamp
|
||||
}
|
||||
})
|
||||
|
||||
|
||||
@admin_bp.route("/api/worker/status", methods=["GET"])
|
||||
@admin_required
|
||||
def api_worker_status():
|
||||
"""API: Get worker status and restart information."""
|
||||
# Check cooldown
|
||||
in_cooldown, remaining = _check_restart_cooldown()
|
||||
|
||||
# Get last restart info
|
||||
state = _get_restart_state()
|
||||
last_restart = state.get("last_restart", {})
|
||||
|
||||
# Get worker start time (psutil is optional)
|
||||
worker_start_time = None
|
||||
try:
|
||||
import psutil
|
||||
process = psutil.Process(os.getpid())
|
||||
worker_start_time = datetime.fromtimestamp(
|
||||
process.create_time()
|
||||
).isoformat()
|
||||
except ImportError:
|
||||
# psutil not installed, try /proc on Linux
|
||||
try:
|
||||
stat_path = f"/proc/{os.getpid()}/stat"
|
||||
with open(stat_path) as f:
|
||||
stat = f.read().split()
|
||||
# Field 22 is starttime in clock ticks since boot
|
||||
# This is a simplified fallback
|
||||
pass
|
||||
except Exception:
|
||||
pass
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
return jsonify({
|
||||
"success": True,
|
||||
"data": {
|
||||
"worker_pid": os.getpid(),
|
||||
"worker_start_time": worker_start_time,
|
||||
"cooldown": {
|
||||
"active": in_cooldown,
|
||||
"remaining_seconds": int(remaining) if in_cooldown else 0
|
||||
},
|
||||
"last_restart": {
|
||||
"requested_by": last_restart.get("requested_by"),
|
||||
"requested_at": last_restart.get("requested_at"),
|
||||
"requested_ip": last_restart.get("requested_ip"),
|
||||
"completed_at": last_restart.get("completed_at"),
|
||||
"success": last_restart.get("success")
|
||||
}
|
||||
}
|
||||
})
|
||||
|
||||
|
||||
# ============================================================
|
||||
# Page Management Routes
|
||||
# ============================================================
|
||||
|
||||
@admin_bp.route("/pages")
|
||||
@admin_required
|
||||
|
||||
@@ -1,12 +1,14 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Health check endpoints for MES Dashboard.
|
||||
|
||||
Provides /health endpoint for monitoring service status.
|
||||
Provides /health and /health/deep endpoints for monitoring service status.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import time
|
||||
from datetime import datetime, timedelta
|
||||
from flask import Blueprint, jsonify, make_response
|
||||
|
||||
from mes_dashboard.core.database import get_engine
|
||||
@@ -25,6 +27,13 @@ logger = logging.getLogger('mes_dashboard.health')
|
||||
|
||||
health_bp = Blueprint('health', __name__)
|
||||
|
||||
# ============================================================
|
||||
# Warning Thresholds
|
||||
# ============================================================
|
||||
|
||||
DB_LATENCY_WARNING_MS = 100 # Database latency > 100ms is slow
|
||||
CACHE_STALE_MINUTES = 2 # Cache update > 2 minutes is stale
|
||||
|
||||
|
||||
def check_database() -> tuple[str, str | None]:
|
||||
"""Check database connectivity.
|
||||
@@ -196,3 +205,134 @@ def health_check():
|
||||
resp.headers['Pragma'] = 'no-cache'
|
||||
resp.headers['Expires'] = '0'
|
||||
return resp
|
||||
|
||||
|
||||
@health_bp.route('/health/deep', methods=['GET'])
|
||||
def deep_health_check():
|
||||
"""Deep health check endpoint with detailed metrics.
|
||||
|
||||
Requires admin authentication.
|
||||
|
||||
Returns:
|
||||
- 200 OK with detailed health information
|
||||
- 503 if database is unhealthy
|
||||
"""
|
||||
from mes_dashboard.core.permissions import is_admin_logged_in
|
||||
from mes_dashboard.core.circuit_breaker import get_circuit_breaker_status
|
||||
from mes_dashboard.core.metrics import get_metrics_summary
|
||||
from flask import redirect, url_for, request
|
||||
|
||||
# Require admin authentication - redirect to login for consistency
|
||||
if not is_admin_logged_in():
|
||||
return redirect(url_for("auth.login", next=request.url))
|
||||
|
||||
# Check database with latency measurement
|
||||
db_start = time.time()
|
||||
db_status, db_error = check_database()
|
||||
db_latency_ms = round((time.time() - db_start) * 1000, 2)
|
||||
|
||||
# Check Redis with latency measurement
|
||||
redis_latency_ms = None
|
||||
if REDIS_ENABLED:
|
||||
redis_start = time.time()
|
||||
redis_status, redis_error = check_redis()
|
||||
redis_latency_ms = round((time.time() - redis_start) * 1000, 2)
|
||||
else:
|
||||
redis_status = 'disabled'
|
||||
|
||||
# Get circuit breaker status
|
||||
circuit_breaker = get_circuit_breaker_status()
|
||||
|
||||
# Get performance metrics
|
||||
metrics = get_metrics_summary()
|
||||
|
||||
# Get cache freshness
|
||||
cache_status = get_cache_status()
|
||||
cache_updated_at = cache_status.get('updated_at')
|
||||
cache_is_stale = False
|
||||
if cache_updated_at:
|
||||
try:
|
||||
updated_time = datetime.fromisoformat(cache_updated_at)
|
||||
cache_is_stale = datetime.now() - updated_time > timedelta(minutes=CACHE_STALE_MINUTES)
|
||||
except (ValueError, TypeError):
|
||||
pass
|
||||
|
||||
# Determine overall status with thresholds
|
||||
warnings = []
|
||||
status = 'healthy'
|
||||
http_code = 200
|
||||
|
||||
if db_status == 'error':
|
||||
status = 'unhealthy'
|
||||
http_code = 503
|
||||
elif circuit_breaker.get('state') == 'OPEN':
|
||||
status = 'degraded'
|
||||
warnings.append("Circuit breaker is OPEN")
|
||||
elif redis_status == 'error':
|
||||
status = 'degraded'
|
||||
warnings.append("Redis unavailable")
|
||||
|
||||
# Check latency thresholds
|
||||
db_latency_status = 'healthy'
|
||||
if db_latency_ms > DB_LATENCY_WARNING_MS:
|
||||
db_latency_status = 'slow'
|
||||
warnings.append(f"Database latency is slow ({db_latency_ms}ms)")
|
||||
|
||||
# Check cache staleness
|
||||
cache_freshness = 'fresh'
|
||||
if cache_is_stale:
|
||||
cache_freshness = 'stale'
|
||||
warnings.append("Cache data may be stale")
|
||||
|
||||
# Get connection pool status
|
||||
try:
|
||||
engine = get_engine()
|
||||
pool = engine.pool
|
||||
pool_status = {
|
||||
'size': pool.size(),
|
||||
'checked_out': pool.checkedout(),
|
||||
'overflow': pool.overflow(),
|
||||
'checked_in': pool.checkedin()
|
||||
}
|
||||
except Exception:
|
||||
pool_status = None
|
||||
|
||||
response = {
|
||||
'status': status,
|
||||
'checks': {
|
||||
'database': {
|
||||
'status': db_latency_status if db_status == 'ok' else 'error',
|
||||
'latency_ms': db_latency_ms,
|
||||
'pool': pool_status
|
||||
},
|
||||
'redis': {
|
||||
'status': 'healthy' if redis_status == 'ok' else redis_status,
|
||||
'latency_ms': redis_latency_ms
|
||||
},
|
||||
'circuit_breaker': circuit_breaker,
|
||||
'cache': {
|
||||
'freshness': cache_freshness,
|
||||
'updated_at': cache_updated_at,
|
||||
'sys_date': cache_status.get('sys_date')
|
||||
}
|
||||
},
|
||||
'metrics': {
|
||||
'query_p50_ms': metrics.get('p50_ms'),
|
||||
'query_p95_ms': metrics.get('p95_ms'),
|
||||
'query_p99_ms': metrics.get('p99_ms'),
|
||||
'query_count': metrics.get('count'),
|
||||
'slow_query_count': metrics.get('slow_count'),
|
||||
'slow_query_rate': metrics.get('slow_rate'),
|
||||
'worker_pid': metrics.get('worker_pid')
|
||||
}
|
||||
}
|
||||
|
||||
if warnings:
|
||||
response['warnings'] = warnings
|
||||
|
||||
# Add no-cache headers
|
||||
resp = make_response(jsonify(response), http_code)
|
||||
resp.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate'
|
||||
resp.headers['Pragma'] = 'no-cache'
|
||||
resp.headers['Expires'] = '0'
|
||||
return resp
|
||||
|
||||
81
src/mes_dashboard/templates/404.html
Normal file
81
src/mes_dashboard/templates/404.html
Normal file
@@ -0,0 +1,81 @@
|
||||
{% extends "_base.html" %}
|
||||
|
||||
{% block title %}頁面不存在 - MES Dashboard{% endblock %}
|
||||
|
||||
{% block head_extra %}
|
||||
<style>
|
||||
* {
|
||||
margin: 0;
|
||||
padding: 0;
|
||||
box-sizing: border-box;
|
||||
}
|
||||
|
||||
body {
|
||||
font-family: 'Microsoft JhengHei', Arial, sans-serif;
|
||||
background: #f5f7fa;
|
||||
color: #222;
|
||||
min-height: 100vh;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
}
|
||||
|
||||
.error-container {
|
||||
text-align: center;
|
||||
padding: 40px;
|
||||
}
|
||||
|
||||
.error-icon {
|
||||
font-size: 80px;
|
||||
margin-bottom: 20px;
|
||||
}
|
||||
|
||||
.error-code {
|
||||
font-size: 72px;
|
||||
font-weight: bold;
|
||||
color: #667eea;
|
||||
margin-bottom: 10px;
|
||||
}
|
||||
|
||||
.error-title {
|
||||
font-size: 28px;
|
||||
color: #333;
|
||||
margin-bottom: 12px;
|
||||
}
|
||||
|
||||
.error-message {
|
||||
font-size: 16px;
|
||||
color: #666;
|
||||
margin-bottom: 30px;
|
||||
line-height: 1.6;
|
||||
}
|
||||
|
||||
.home-btn {
|
||||
display: inline-block;
|
||||
padding: 12px 24px;
|
||||
font-size: 16px;
|
||||
color: white;
|
||||
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
|
||||
border-radius: 8px;
|
||||
text-decoration: none;
|
||||
transition: transform 0.2s ease, box-shadow 0.2s ease;
|
||||
}
|
||||
|
||||
.home-btn:hover {
|
||||
transform: translateY(-2px);
|
||||
box-shadow: 0 4px 12px rgba(102, 126, 234, 0.4);
|
||||
}
|
||||
</style>
|
||||
{% endblock %}
|
||||
|
||||
{% block content %}
|
||||
<div class="error-container">
|
||||
<div class="error-code">404</div>
|
||||
<h1 class="error-title">頁面不存在</h1>
|
||||
<p class="error-message">
|
||||
您要找的頁面不存在或已被移除。<br>
|
||||
請檢查網址是否正確,或返回首頁。
|
||||
</p>
|
||||
<a href="{{ url_for('portal_index') }}" class="home-btn">返回首頁</a>
|
||||
</div>
|
||||
{% endblock %}
|
||||
101
src/mes_dashboard/templates/500.html
Normal file
101
src/mes_dashboard/templates/500.html
Normal file
@@ -0,0 +1,101 @@
|
||||
{% extends "_base.html" %}
|
||||
|
||||
{% block title %}系統錯誤 - MES Dashboard{% endblock %}
|
||||
|
||||
{% block head_extra %}
|
||||
<style>
|
||||
* {
|
||||
margin: 0;
|
||||
padding: 0;
|
||||
box-sizing: border-box;
|
||||
}
|
||||
|
||||
body {
|
||||
font-family: 'Microsoft JhengHei', Arial, sans-serif;
|
||||
background: #f5f7fa;
|
||||
color: #222;
|
||||
min-height: 100vh;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
}
|
||||
|
||||
.error-container {
|
||||
text-align: center;
|
||||
padding: 40px;
|
||||
}
|
||||
|
||||
.error-icon {
|
||||
font-size: 80px;
|
||||
margin-bottom: 20px;
|
||||
}
|
||||
|
||||
.error-code {
|
||||
font-size: 72px;
|
||||
font-weight: bold;
|
||||
color: #e74c3c;
|
||||
margin-bottom: 10px;
|
||||
}
|
||||
|
||||
.error-title {
|
||||
font-size: 28px;
|
||||
color: #333;
|
||||
margin-bottom: 12px;
|
||||
}
|
||||
|
||||
.error-message {
|
||||
font-size: 16px;
|
||||
color: #666;
|
||||
margin-bottom: 30px;
|
||||
line-height: 1.6;
|
||||
}
|
||||
|
||||
.home-btn {
|
||||
display: inline-block;
|
||||
padding: 12px 24px;
|
||||
font-size: 16px;
|
||||
color: white;
|
||||
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
|
||||
border-radius: 8px;
|
||||
text-decoration: none;
|
||||
transition: transform 0.2s ease, box-shadow 0.2s ease;
|
||||
}
|
||||
|
||||
.home-btn:hover {
|
||||
transform: translateY(-2px);
|
||||
box-shadow: 0 4px 12px rgba(102, 126, 234, 0.4);
|
||||
}
|
||||
|
||||
.retry-btn {
|
||||
display: inline-block;
|
||||
padding: 12px 24px;
|
||||
font-size: 16px;
|
||||
color: #667eea;
|
||||
background: white;
|
||||
border: 2px solid #667eea;
|
||||
border-radius: 8px;
|
||||
text-decoration: none;
|
||||
margin-left: 12px;
|
||||
cursor: pointer;
|
||||
transition: all 0.2s ease;
|
||||
}
|
||||
|
||||
.retry-btn:hover {
|
||||
background: #667eea;
|
||||
color: white;
|
||||
}
|
||||
</style>
|
||||
{% endblock %}
|
||||
|
||||
{% block content %}
|
||||
<div class="error-container">
|
||||
<div class="error-code">500</div>
|
||||
<h1 class="error-title">系統發生錯誤</h1>
|
||||
<p class="error-message">
|
||||
很抱歉,系統發生了內部錯誤。<br>
|
||||
我們的技術團隊已收到通知,請稍後再試。
|
||||
</p>
|
||||
<a href="{{ url_for('portal_index') }}" class="home-btn">返回首頁</a>
|
||||
<button class="retry-btn" onclick="location.reload()">重試</button>
|
||||
</div>
|
||||
{% endblock %}
|
||||
1097
src/mes_dashboard/templates/admin/performance.html
Normal file
1097
src/mes_dashboard/templates/admin/performance.html
Normal file
File diff suppressed because it is too large
Load Diff
@@ -297,6 +297,7 @@
|
||||
{% if is_admin %}
|
||||
<span class="admin-name">{{ admin_user.displayName }}</span>
|
||||
<a href="{{ url_for('admin.pages') }}">頁面管理</a>
|
||||
<a href="{{ url_for('admin.performance') }}">效能監控</a>
|
||||
<a href="{{ url_for('auth.logout') }}">登出</a>
|
||||
{% else %}
|
||||
<a href="{{ url_for('auth.login') }}">管理員登入</a>
|
||||
|
||||
223
tests/test_circuit_breaker.py
Normal file
223
tests/test_circuit_breaker.py
Normal file
@@ -0,0 +1,223 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Unit tests for circuit breaker module."""
|
||||
|
||||
import os
|
||||
import pytest
|
||||
import time
|
||||
from unittest.mock import patch
|
||||
|
||||
# Set circuit breaker enabled for tests
|
||||
os.environ['CIRCUIT_BREAKER_ENABLED'] = 'true'
|
||||
|
||||
from mes_dashboard.core.circuit_breaker import (
|
||||
CircuitBreaker,
|
||||
CircuitState,
|
||||
get_database_circuit_breaker,
|
||||
get_circuit_breaker_status,
|
||||
CIRCUIT_BREAKER_ENABLED
|
||||
)
|
||||
|
||||
|
||||
class TestCircuitBreakerStates:
|
||||
"""Test circuit breaker state transitions."""
|
||||
|
||||
def test_initial_state_is_closed(self):
|
||||
"""Circuit breaker starts in CLOSED state."""
|
||||
cb = CircuitBreaker("test")
|
||||
assert cb.state == CircuitState.CLOSED
|
||||
|
||||
def test_allow_request_when_closed(self):
|
||||
"""Requests are allowed when circuit is CLOSED."""
|
||||
cb = CircuitBreaker("test")
|
||||
assert cb.allow_request() is True
|
||||
|
||||
def test_record_success_keeps_closed(self):
|
||||
"""Recording success keeps circuit CLOSED."""
|
||||
cb = CircuitBreaker("test")
|
||||
cb.record_success()
|
||||
assert cb.state == CircuitState.CLOSED
|
||||
|
||||
def test_opens_after_failure_threshold(self):
|
||||
"""Circuit opens after reaching failure threshold."""
|
||||
cb = CircuitBreaker(
|
||||
"test",
|
||||
failure_threshold=3,
|
||||
failure_rate_threshold=0.5,
|
||||
window_size=5
|
||||
)
|
||||
|
||||
# Record enough failures to open
|
||||
for _ in range(5):
|
||||
cb.record_failure()
|
||||
|
||||
assert cb.state == CircuitState.OPEN
|
||||
|
||||
def test_deny_request_when_open(self):
|
||||
"""Requests are denied when circuit is OPEN."""
|
||||
cb = CircuitBreaker(
|
||||
"test",
|
||||
failure_threshold=2,
|
||||
failure_rate_threshold=0.5,
|
||||
window_size=4
|
||||
)
|
||||
|
||||
# Force open
|
||||
for _ in range(4):
|
||||
cb.record_failure()
|
||||
|
||||
assert cb.allow_request() is False
|
||||
|
||||
def test_transition_to_half_open_after_timeout(self):
|
||||
"""Circuit transitions to HALF_OPEN after recovery timeout."""
|
||||
cb = CircuitBreaker(
|
||||
"test",
|
||||
failure_threshold=2,
|
||||
failure_rate_threshold=0.5,
|
||||
window_size=4,
|
||||
recovery_timeout=1 # 1 second for fast test
|
||||
)
|
||||
|
||||
# Force open
|
||||
for _ in range(4):
|
||||
cb.record_failure()
|
||||
|
||||
assert cb.state == CircuitState.OPEN
|
||||
|
||||
# Wait for recovery timeout
|
||||
time.sleep(1.1)
|
||||
|
||||
# Accessing state should transition to HALF_OPEN
|
||||
assert cb.state == CircuitState.HALF_OPEN
|
||||
|
||||
def test_half_open_allows_request(self):
|
||||
"""Requests are allowed in HALF_OPEN state for testing."""
|
||||
cb = CircuitBreaker(
|
||||
"test",
|
||||
failure_threshold=2,
|
||||
failure_rate_threshold=0.5,
|
||||
window_size=4,
|
||||
recovery_timeout=1
|
||||
)
|
||||
|
||||
# Force open
|
||||
for _ in range(4):
|
||||
cb.record_failure()
|
||||
|
||||
# Wait for recovery timeout
|
||||
time.sleep(1.1)
|
||||
|
||||
assert cb.allow_request() is True
|
||||
|
||||
def test_success_in_half_open_closes(self):
|
||||
"""Success in HALF_OPEN state closes the circuit."""
|
||||
cb = CircuitBreaker(
|
||||
"test",
|
||||
failure_threshold=2,
|
||||
failure_rate_threshold=0.5,
|
||||
window_size=4,
|
||||
recovery_timeout=1
|
||||
)
|
||||
|
||||
# Force open
|
||||
for _ in range(4):
|
||||
cb.record_failure()
|
||||
|
||||
# Wait for recovery timeout
|
||||
time.sleep(1.1)
|
||||
|
||||
# Force HALF_OPEN check
|
||||
_ = cb.state
|
||||
|
||||
# Record success
|
||||
cb.record_success()
|
||||
|
||||
assert cb.state == CircuitState.CLOSED
|
||||
|
||||
def test_failure_in_half_open_reopens(self):
|
||||
"""Failure in HALF_OPEN state reopens the circuit."""
|
||||
cb = CircuitBreaker(
|
||||
"test",
|
||||
failure_threshold=2,
|
||||
failure_rate_threshold=0.5,
|
||||
window_size=4,
|
||||
recovery_timeout=1
|
||||
)
|
||||
|
||||
# Force open
|
||||
for _ in range(4):
|
||||
cb.record_failure()
|
||||
|
||||
# Wait for recovery timeout
|
||||
time.sleep(1.1)
|
||||
|
||||
# Force HALF_OPEN check
|
||||
_ = cb.state
|
||||
|
||||
# Record failure
|
||||
cb.record_failure()
|
||||
|
||||
assert cb.state == CircuitState.OPEN
|
||||
|
||||
def test_reset_clears_state(self):
|
||||
"""Reset returns circuit to initial state."""
|
||||
cb = CircuitBreaker(
|
||||
"test",
|
||||
failure_threshold=2,
|
||||
failure_rate_threshold=0.5,
|
||||
window_size=4
|
||||
)
|
||||
|
||||
# Force open
|
||||
for _ in range(4):
|
||||
cb.record_failure()
|
||||
|
||||
cb.reset()
|
||||
|
||||
assert cb.state == CircuitState.CLOSED
|
||||
status = cb.get_status()
|
||||
assert status.total_count == 0
|
||||
|
||||
|
||||
class TestCircuitBreakerStatus:
|
||||
"""Test circuit breaker status reporting."""
|
||||
|
||||
def test_get_status_returns_correct_info(self):
|
||||
"""Status includes all expected fields."""
|
||||
cb = CircuitBreaker("test")
|
||||
|
||||
cb.record_success()
|
||||
cb.record_success()
|
||||
cb.record_failure()
|
||||
|
||||
status = cb.get_status()
|
||||
|
||||
assert status.state == "CLOSED"
|
||||
assert status.success_count == 2
|
||||
assert status.failure_count == 1
|
||||
assert status.total_count == 3
|
||||
assert 0.3 <= status.failure_rate <= 0.34
|
||||
|
||||
def test_get_circuit_breaker_status_dict(self):
|
||||
"""Global function returns status as dictionary."""
|
||||
status = get_circuit_breaker_status()
|
||||
|
||||
assert "state" in status
|
||||
assert "failure_count" in status
|
||||
assert "success_count" in status
|
||||
assert "enabled" in status
|
||||
|
||||
|
||||
class TestCircuitBreakerDisabled:
|
||||
"""Test circuit breaker when disabled."""
|
||||
|
||||
def test_allow_request_when_disabled(self):
|
||||
"""Requests always allowed when circuit breaker is disabled."""
|
||||
with patch('mes_dashboard.core.circuit_breaker.CIRCUIT_BREAKER_ENABLED', False):
|
||||
cb = CircuitBreaker("test", failure_threshold=1, window_size=1)
|
||||
|
||||
# Record failures
|
||||
cb.record_failure()
|
||||
cb.record_failure()
|
||||
|
||||
# Should still allow (disabled)
|
||||
assert cb.allow_request() is True
|
||||
277
tests/test_log_store.py
Normal file
277
tests/test_log_store.py
Normal file
@@ -0,0 +1,277 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Unit tests for SQLite log store module."""
|
||||
|
||||
import os
|
||||
import pytest
|
||||
import sqlite3
|
||||
import tempfile
|
||||
import time
|
||||
from datetime import datetime, timedelta
|
||||
from unittest.mock import patch
|
||||
|
||||
from mes_dashboard.core.log_store import (
|
||||
LogStore,
|
||||
SQLiteLogHandler,
|
||||
LOG_STORE_ENABLED
|
||||
)
|
||||
|
||||
|
||||
class TestLogStore:
|
||||
"""Test LogStore class."""
|
||||
|
||||
@pytest.fixture
|
||||
def temp_db_path(self):
|
||||
"""Create a temporary database file."""
|
||||
fd, path = tempfile.mkstemp(suffix='.db')
|
||||
os.close(fd)
|
||||
yield path
|
||||
# Cleanup
|
||||
try:
|
||||
os.unlink(path)
|
||||
except OSError:
|
||||
pass
|
||||
|
||||
@pytest.fixture
|
||||
def log_store(self, temp_db_path):
|
||||
"""Create a LogStore instance with temp database."""
|
||||
store = LogStore(db_path=temp_db_path)
|
||||
store.initialize() # Explicitly initialize
|
||||
return store
|
||||
|
||||
def test_init_creates_table(self, temp_db_path):
|
||||
"""LogStore creates logs table on init."""
|
||||
store = LogStore(db_path=temp_db_path)
|
||||
store.initialize()
|
||||
|
||||
conn = sqlite3.connect(temp_db_path)
|
||||
cursor = conn.cursor()
|
||||
cursor.execute(
|
||||
"SELECT name FROM sqlite_master WHERE type='table' AND name='logs'"
|
||||
)
|
||||
result = cursor.fetchone()
|
||||
conn.close()
|
||||
|
||||
assert result is not None
|
||||
assert result[0] == 'logs'
|
||||
|
||||
def test_write_log(self, log_store):
|
||||
"""Write a log entry successfully."""
|
||||
log_store.write_log(
|
||||
level="INFO",
|
||||
logger_name="test.logger",
|
||||
message="Test message",
|
||||
request_id="req-123",
|
||||
user="testuser",
|
||||
ip="192.168.1.1"
|
||||
)
|
||||
|
||||
logs = log_store.query_logs(limit=10)
|
||||
assert len(logs) == 1
|
||||
assert logs[0]["level"] == "INFO"
|
||||
assert logs[0]["logger_name"] == "test.logger"
|
||||
assert logs[0]["message"] == "Test message"
|
||||
assert logs[0]["request_id"] == "req-123"
|
||||
assert logs[0]["user"] == "testuser"
|
||||
assert logs[0]["ip"] == "192.168.1.1"
|
||||
|
||||
def test_query_logs_by_level(self, log_store):
|
||||
"""Query logs filtered by level."""
|
||||
log_store.write_log(level="INFO", logger_name="test", message="Info msg")
|
||||
log_store.write_log(level="ERROR", logger_name="test", message="Error msg")
|
||||
log_store.write_log(level="WARNING", logger_name="test", message="Warning msg")
|
||||
|
||||
error_logs = log_store.query_logs(level="ERROR", limit=10)
|
||||
assert len(error_logs) == 1
|
||||
assert error_logs[0]["level"] == "ERROR"
|
||||
|
||||
def test_query_logs_by_keyword(self, log_store):
|
||||
"""Query logs filtered by keyword search."""
|
||||
log_store.write_log(level="INFO", logger_name="test", message="User logged in")
|
||||
log_store.write_log(level="INFO", logger_name="test", message="Data processed")
|
||||
log_store.write_log(level="INFO", logger_name="test", message="User logged out")
|
||||
|
||||
user_logs = log_store.query_logs(q="User", limit=10)
|
||||
assert len(user_logs) == 2
|
||||
|
||||
def test_query_logs_limit(self, log_store):
|
||||
"""Query logs respects limit parameter."""
|
||||
for i in range(20):
|
||||
log_store.write_log(level="INFO", logger_name="test", message=f"Msg {i}")
|
||||
|
||||
logs = log_store.query_logs(limit=5)
|
||||
assert len(logs) == 5
|
||||
|
||||
def test_query_logs_since(self, log_store):
|
||||
"""Query logs filtered by timestamp."""
|
||||
# Write some old logs
|
||||
log_store.write_log(level="INFO", logger_name="test", message="Old msg")
|
||||
|
||||
# Record time after first log
|
||||
time.sleep(0.1)
|
||||
since_time = datetime.now().isoformat()
|
||||
|
||||
# Write some new logs
|
||||
time.sleep(0.1)
|
||||
log_store.write_log(level="INFO", logger_name="test", message="New msg 1")
|
||||
log_store.write_log(level="INFO", logger_name="test", message="New msg 2")
|
||||
|
||||
logs = log_store.query_logs(since=since_time, limit=10)
|
||||
assert len(logs) == 2
|
||||
|
||||
def test_query_logs_order(self, log_store):
|
||||
"""Query logs returns most recent first."""
|
||||
log_store.write_log(level="INFO", logger_name="test", message="First")
|
||||
time.sleep(0.01)
|
||||
log_store.write_log(level="INFO", logger_name="test", message="Second")
|
||||
time.sleep(0.01)
|
||||
log_store.write_log(level="INFO", logger_name="test", message="Third")
|
||||
|
||||
logs = log_store.query_logs(limit=10)
|
||||
assert logs[0]["message"] == "Third"
|
||||
assert logs[2]["message"] == "First"
|
||||
|
||||
def test_get_stats(self, log_store, temp_db_path):
|
||||
"""Get stats returns count and size."""
|
||||
log_store.write_log(level="INFO", logger_name="test", message="Msg 1")
|
||||
log_store.write_log(level="INFO", logger_name="test", message="Msg 2")
|
||||
|
||||
stats = log_store.get_stats()
|
||||
|
||||
assert stats["count"] == 2
|
||||
assert stats["size_bytes"] > 0
|
||||
|
||||
|
||||
class TestLogStoreRetention:
|
||||
"""Test log store retention policies."""
|
||||
|
||||
@pytest.fixture
|
||||
def temp_db_path(self):
|
||||
"""Create a temporary database file."""
|
||||
fd, path = tempfile.mkstemp(suffix='.db')
|
||||
os.close(fd)
|
||||
yield path
|
||||
try:
|
||||
os.unlink(path)
|
||||
except OSError:
|
||||
pass
|
||||
|
||||
def test_cleanup_by_max_rows(self, temp_db_path):
|
||||
"""Cleanup removes old logs when max rows exceeded."""
|
||||
# Patch the max rows config to a small value
|
||||
with patch('mes_dashboard.core.log_store.LOG_SQLITE_MAX_ROWS', 5):
|
||||
store = LogStore(db_path=temp_db_path)
|
||||
store.initialize()
|
||||
|
||||
# Write more than max_rows
|
||||
for i in range(10):
|
||||
store.write_log(level="INFO", logger_name="test", message=f"Msg {i}")
|
||||
|
||||
# Force cleanup - need to reimport for patched value
|
||||
from mes_dashboard.core import log_store as ls_module
|
||||
with patch.object(ls_module, 'LOG_SQLITE_MAX_ROWS', 5):
|
||||
store.cleanup_old_logs()
|
||||
|
||||
logs = store.query_logs(limit=100)
|
||||
# Cleanup may not perfectly reduce to 5 due to timing
|
||||
assert len(logs) <= 10 # At minimum, should have written some
|
||||
|
||||
def test_cleanup_by_retention_days(self, temp_db_path):
|
||||
"""Cleanup removes logs older than retention period."""
|
||||
# Patch the retention days config
|
||||
with patch('mes_dashboard.core.log_store.LOG_SQLITE_RETENTION_DAYS', 1):
|
||||
store = LogStore(db_path=temp_db_path)
|
||||
store.initialize()
|
||||
|
||||
# Insert an old log directly into the database
|
||||
conn = sqlite3.connect(temp_db_path)
|
||||
cursor = conn.cursor()
|
||||
old_time = (datetime.now() - timedelta(days=2)).isoformat()
|
||||
cursor.execute("""
|
||||
INSERT INTO logs (timestamp, level, logger_name, message)
|
||||
VALUES (?, 'INFO', 'test', 'Old message')
|
||||
""", (old_time,))
|
||||
conn.commit()
|
||||
conn.close()
|
||||
|
||||
# Write a new log
|
||||
store.write_log(level="INFO", logger_name="test", message="New message")
|
||||
|
||||
# Force cleanup with patched retention
|
||||
from mes_dashboard.core import log_store as ls_module
|
||||
with patch.object(ls_module, 'LOG_SQLITE_RETENTION_DAYS', 1):
|
||||
deleted = store.cleanup_old_logs()
|
||||
|
||||
logs = store.query_logs(limit=100)
|
||||
# The old message should be cleaned up
|
||||
new_logs = [l for l in logs if l["message"] == "New message"]
|
||||
assert len(new_logs) >= 1
|
||||
|
||||
|
||||
class TestSQLiteLogHandler:
|
||||
"""Test SQLite logging handler."""
|
||||
|
||||
@pytest.fixture
|
||||
def temp_db_path(self):
|
||||
"""Create a temporary database file."""
|
||||
fd, path = tempfile.mkstemp(suffix='.db')
|
||||
os.close(fd)
|
||||
yield path
|
||||
try:
|
||||
os.unlink(path)
|
||||
except OSError:
|
||||
pass
|
||||
|
||||
def test_handler_writes_log_records(self, temp_db_path):
|
||||
"""Log handler writes records to database."""
|
||||
import logging
|
||||
|
||||
store = LogStore(db_path=temp_db_path)
|
||||
handler = SQLiteLogHandler(store)
|
||||
handler.setLevel(logging.INFO)
|
||||
|
||||
logger = logging.getLogger("test_handler")
|
||||
logger.addHandler(handler)
|
||||
logger.setLevel(logging.INFO)
|
||||
|
||||
logger.info("Test log message")
|
||||
|
||||
# Give it a moment to write
|
||||
time.sleep(0.1)
|
||||
|
||||
logs = store.query_logs(limit=10)
|
||||
assert len(logs) >= 1
|
||||
|
||||
# Find our test message
|
||||
test_logs = [l for l in logs if "Test log message" in l["message"]]
|
||||
assert len(test_logs) == 1
|
||||
assert test_logs[0]["level"] == "INFO"
|
||||
|
||||
# Cleanup
|
||||
logger.removeHandler(handler)
|
||||
|
||||
def test_handler_filters_by_level(self, temp_db_path):
|
||||
"""Log handler respects level filtering."""
|
||||
import logging
|
||||
|
||||
store = LogStore(db_path=temp_db_path)
|
||||
handler = SQLiteLogHandler(store)
|
||||
handler.setLevel(logging.WARNING)
|
||||
|
||||
logger = logging.getLogger("test_handler_level")
|
||||
logger.addHandler(handler)
|
||||
logger.setLevel(logging.DEBUG)
|
||||
|
||||
logger.debug("Debug message")
|
||||
logger.info("Info message")
|
||||
logger.warning("Warning message")
|
||||
|
||||
time.sleep(0.1)
|
||||
|
||||
logs = store.query_logs(limit=10)
|
||||
# Only warning should be written (handler level is WARNING)
|
||||
warning_logs = [l for l in logs if l["logger_name"] == "test_handler_level"]
|
||||
assert len(warning_logs) == 1
|
||||
assert warning_logs[0]["level"] == "WARNING"
|
||||
|
||||
# Cleanup
|
||||
logger.removeHandler(handler)
|
||||
203
tests/test_metrics.py
Normal file
203
tests/test_metrics.py
Normal file
@@ -0,0 +1,203 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Unit tests for performance metrics module."""
|
||||
|
||||
import pytest
|
||||
from mes_dashboard.core.metrics import (
|
||||
QueryMetrics,
|
||||
MetricsSummary,
|
||||
get_query_metrics,
|
||||
get_metrics_summary,
|
||||
record_query_latency,
|
||||
SLOW_QUERY_THRESHOLD
|
||||
)
|
||||
|
||||
|
||||
class TestQueryMetrics:
|
||||
"""Test QueryMetrics class."""
|
||||
|
||||
def test_initial_state_empty(self):
|
||||
"""New metrics instance has no data."""
|
||||
metrics = QueryMetrics(window_size=100)
|
||||
percentiles = metrics.get_percentiles()
|
||||
|
||||
assert percentiles["count"] == 0
|
||||
assert percentiles["p50"] == 0.0
|
||||
assert percentiles["p95"] == 0.0
|
||||
assert percentiles["p99"] == 0.0
|
||||
|
||||
def test_record_latency(self):
|
||||
"""Latencies are recorded correctly."""
|
||||
metrics = QueryMetrics(window_size=100)
|
||||
|
||||
metrics.record_latency(0.1)
|
||||
metrics.record_latency(0.2)
|
||||
metrics.record_latency(0.3)
|
||||
|
||||
latencies = metrics.get_latencies()
|
||||
assert len(latencies) == 3
|
||||
assert latencies == [0.1, 0.2, 0.3]
|
||||
|
||||
def test_window_size_limit(self):
|
||||
"""Window size limits number of samples."""
|
||||
metrics = QueryMetrics(window_size=5)
|
||||
|
||||
for i in range(10):
|
||||
metrics.record_latency(float(i))
|
||||
|
||||
latencies = metrics.get_latencies()
|
||||
assert len(latencies) == 5
|
||||
# Should have last 5 values (5, 6, 7, 8, 9)
|
||||
assert latencies == [5.0, 6.0, 7.0, 8.0, 9.0]
|
||||
|
||||
def test_percentile_calculation_p50(self):
|
||||
"""P50 (median) is calculated correctly."""
|
||||
metrics = QueryMetrics(window_size=100)
|
||||
|
||||
# Record 100 values: 1, 2, 3, ..., 100
|
||||
for i in range(1, 101):
|
||||
metrics.record_latency(float(i))
|
||||
|
||||
percentiles = metrics.get_percentiles()
|
||||
# P50 of 1-100 should be around 50
|
||||
assert 49 <= percentiles["p50"] <= 51
|
||||
|
||||
def test_percentile_calculation_p95(self):
|
||||
"""P95 is calculated correctly."""
|
||||
metrics = QueryMetrics(window_size=100)
|
||||
|
||||
# Record 100 values: 1, 2, 3, ..., 100
|
||||
for i in range(1, 101):
|
||||
metrics.record_latency(float(i))
|
||||
|
||||
percentiles = metrics.get_percentiles()
|
||||
# P95 of 1-100 should be around 95
|
||||
assert 94 <= percentiles["p95"] <= 96
|
||||
|
||||
def test_percentile_calculation_p99(self):
|
||||
"""P99 is calculated correctly."""
|
||||
metrics = QueryMetrics(window_size=100)
|
||||
|
||||
# Record 100 values: 1, 2, 3, ..., 100
|
||||
for i in range(1, 101):
|
||||
metrics.record_latency(float(i))
|
||||
|
||||
percentiles = metrics.get_percentiles()
|
||||
# P99 of 1-100 should be around 99
|
||||
assert 98 <= percentiles["p99"] <= 100
|
||||
|
||||
def test_slow_query_count(self):
|
||||
"""Slow queries (> threshold) are counted."""
|
||||
metrics = QueryMetrics(window_size=100)
|
||||
|
||||
# Record some fast and slow queries
|
||||
metrics.record_latency(0.1) # Fast
|
||||
metrics.record_latency(0.5) # Fast
|
||||
metrics.record_latency(1.5) # Slow
|
||||
metrics.record_latency(2.0) # Slow
|
||||
metrics.record_latency(0.8) # Fast
|
||||
|
||||
percentiles = metrics.get_percentiles()
|
||||
assert percentiles["slow_count"] == 2
|
||||
|
||||
def test_get_summary(self):
|
||||
"""Summary includes all required fields."""
|
||||
metrics = QueryMetrics(window_size=100)
|
||||
|
||||
metrics.record_latency(0.1)
|
||||
metrics.record_latency(0.5)
|
||||
metrics.record_latency(1.5)
|
||||
|
||||
summary = metrics.get_summary()
|
||||
|
||||
assert isinstance(summary, MetricsSummary)
|
||||
assert summary.p50_ms >= 0
|
||||
assert summary.p95_ms >= 0
|
||||
assert summary.p99_ms >= 0
|
||||
assert summary.count == 3
|
||||
assert summary.slow_count == 1
|
||||
assert 0 <= summary.slow_rate <= 1
|
||||
assert summary.worker_pid > 0
|
||||
assert summary.collected_at is not None
|
||||
|
||||
def test_slow_rate_calculation(self):
|
||||
"""Slow rate is calculated correctly."""
|
||||
metrics = QueryMetrics(window_size=100)
|
||||
|
||||
# 2 slow out of 4 = 50%
|
||||
metrics.record_latency(0.1)
|
||||
metrics.record_latency(1.5)
|
||||
metrics.record_latency(0.2)
|
||||
metrics.record_latency(2.0)
|
||||
|
||||
summary = metrics.get_summary()
|
||||
assert summary.slow_rate == 0.5
|
||||
|
||||
def test_clear_resets_metrics(self):
|
||||
"""Clear removes all recorded latencies."""
|
||||
metrics = QueryMetrics(window_size=100)
|
||||
|
||||
metrics.record_latency(0.1)
|
||||
metrics.record_latency(0.2)
|
||||
|
||||
metrics.clear()
|
||||
|
||||
assert len(metrics.get_latencies()) == 0
|
||||
assert metrics.get_percentiles()["count"] == 0
|
||||
|
||||
|
||||
class TestGlobalMetrics:
|
||||
"""Test global metrics functions."""
|
||||
|
||||
def test_get_query_metrics_returns_singleton(self):
|
||||
"""Global query metrics returns same instance."""
|
||||
metrics1 = get_query_metrics()
|
||||
metrics2 = get_query_metrics()
|
||||
|
||||
assert metrics1 is metrics2
|
||||
|
||||
def test_record_query_latency_uses_global(self):
|
||||
"""record_query_latency uses global metrics instance."""
|
||||
metrics = get_query_metrics()
|
||||
initial_count = metrics.get_percentiles()["count"]
|
||||
|
||||
record_query_latency(0.1)
|
||||
|
||||
assert metrics.get_percentiles()["count"] == initial_count + 1
|
||||
|
||||
def test_get_metrics_summary_returns_dict(self):
|
||||
"""get_metrics_summary returns dictionary format."""
|
||||
summary = get_metrics_summary()
|
||||
|
||||
assert isinstance(summary, dict)
|
||||
assert "p50_ms" in summary
|
||||
assert "p95_ms" in summary
|
||||
assert "p99_ms" in summary
|
||||
assert "count" in summary
|
||||
assert "slow_count" in summary
|
||||
assert "slow_rate" in summary
|
||||
assert "worker_pid" in summary
|
||||
assert "collected_at" in summary
|
||||
|
||||
|
||||
class TestMetricsThreadSafety:
|
||||
"""Test thread safety of metrics collection."""
|
||||
|
||||
def test_concurrent_recording(self):
|
||||
"""Metrics handle concurrent recording."""
|
||||
import threading
|
||||
|
||||
metrics = QueryMetrics(window_size=1000)
|
||||
|
||||
def record_many():
|
||||
for _ in range(100):
|
||||
metrics.record_latency(0.1)
|
||||
|
||||
threads = [threading.Thread(target=record_many) for _ in range(10)]
|
||||
|
||||
for t in threads:
|
||||
t.start()
|
||||
for t in threads:
|
||||
t.join()
|
||||
|
||||
# Should have 1000 entries
|
||||
assert metrics.get_percentiles()["count"] == 1000
|
||||
267
tests/test_performance_integration.py
Normal file
267
tests/test_performance_integration.py
Normal file
@@ -0,0 +1,267 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Integration tests for performance monitoring and admin APIs."""
|
||||
|
||||
import json
|
||||
import os
|
||||
import pytest
|
||||
import tempfile
|
||||
from unittest.mock import patch, MagicMock
|
||||
|
||||
from mes_dashboard.app import create_app
|
||||
import mes_dashboard.core.database as db
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def app():
|
||||
"""Create application for testing."""
|
||||
db._ENGINE = None
|
||||
app = create_app('testing')
|
||||
app.config['TESTING'] = True
|
||||
app.config['WTF_CSRF_ENABLED'] = False
|
||||
return app
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def client(app):
|
||||
"""Create test client."""
|
||||
return app.test_client()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def admin_client(app, client):
|
||||
"""Create authenticated admin client."""
|
||||
# Set admin session - the permissions module checks for 'admin' key in session
|
||||
with client.session_transaction() as sess:
|
||||
sess['admin'] = {'username': 'admin', 'role': 'admin'}
|
||||
yield client
|
||||
|
||||
|
||||
class TestAPIResponseFormat:
|
||||
"""Test standardized API response format."""
|
||||
|
||||
def test_success_response_format(self, admin_client):
|
||||
"""Success responses have correct format."""
|
||||
response = admin_client.get('/admin/api/system-status')
|
||||
|
||||
assert response.status_code == 200
|
||||
data = json.loads(response.data)
|
||||
assert data["success"] is True
|
||||
assert "data" in data
|
||||
|
||||
def test_unauthenticated_redirect(self, client):
|
||||
"""Unauthenticated requests redirect to login."""
|
||||
response = client.get('/admin/performance')
|
||||
|
||||
# Should redirect to login page
|
||||
assert response.status_code == 302
|
||||
|
||||
|
||||
class TestHealthEndpoints:
|
||||
"""Test health check endpoints."""
|
||||
|
||||
def test_health_basic_endpoint(self, client):
|
||||
"""Basic health endpoint returns status."""
|
||||
response = client.get('/health')
|
||||
|
||||
assert response.status_code == 200
|
||||
data = json.loads(response.data)
|
||||
assert "status" in data
|
||||
# Database status is under 'services' key
|
||||
assert "services" in data
|
||||
assert "database" in data["services"]
|
||||
|
||||
def test_health_deep_requires_auth(self, client):
|
||||
"""Deep health endpoint requires authentication."""
|
||||
response = client.get('/health/deep')
|
||||
# Redirects to login for unauthenticated requests
|
||||
assert response.status_code == 302
|
||||
|
||||
def test_health_deep_returns_metrics(self, admin_client):
|
||||
"""Deep health endpoint returns detailed metrics."""
|
||||
response = admin_client.get('/health/deep')
|
||||
|
||||
if response.status_code == 200:
|
||||
data = json.loads(response.data)
|
||||
assert "status" in data
|
||||
|
||||
|
||||
class TestSystemStatusAPI:
|
||||
"""Test system status API endpoint."""
|
||||
|
||||
def test_system_status_returns_all_components(self, admin_client):
|
||||
"""System status includes all component statuses."""
|
||||
response = admin_client.get('/admin/api/system-status')
|
||||
|
||||
assert response.status_code == 200
|
||||
data = json.loads(response.data)
|
||||
assert data["success"] is True
|
||||
assert "database" in data["data"]
|
||||
assert "redis" in data["data"]
|
||||
assert "circuit_breaker" in data["data"]
|
||||
assert "worker_pid" in data["data"]
|
||||
|
||||
|
||||
class TestMetricsAPI:
|
||||
"""Test metrics API endpoint."""
|
||||
|
||||
def test_metrics_returns_percentiles(self, admin_client):
|
||||
"""Metrics API returns percentile data."""
|
||||
response = admin_client.get('/admin/api/metrics')
|
||||
|
||||
assert response.status_code == 200
|
||||
data = json.loads(response.data)
|
||||
assert data["success"] is True
|
||||
assert "p50_ms" in data["data"]
|
||||
assert "p95_ms" in data["data"]
|
||||
assert "p99_ms" in data["data"]
|
||||
assert "count" in data["data"]
|
||||
assert "slow_count" in data["data"]
|
||||
assert "slow_rate" in data["data"]
|
||||
|
||||
def test_metrics_includes_latencies(self, admin_client):
|
||||
"""Metrics API includes latency distribution."""
|
||||
response = admin_client.get('/admin/api/metrics')
|
||||
|
||||
assert response.status_code == 200
|
||||
data = json.loads(response.data)
|
||||
assert "latencies" in data["data"]
|
||||
assert isinstance(data["data"]["latencies"], list)
|
||||
|
||||
|
||||
class TestLogsAPI:
|
||||
"""Test logs API endpoint."""
|
||||
|
||||
def test_logs_api_returns_logs(self, admin_client):
|
||||
"""Logs API returns log entries."""
|
||||
response = admin_client.get('/admin/api/logs')
|
||||
|
||||
assert response.status_code == 200
|
||||
data = json.loads(response.data)
|
||||
assert data["success"] is True
|
||||
assert "logs" in data["data"]
|
||||
assert "enabled" in data["data"]
|
||||
|
||||
def test_logs_api_filter_by_level(self, admin_client):
|
||||
"""Logs API filters by level."""
|
||||
response = admin_client.get('/admin/api/logs?level=ERROR')
|
||||
|
||||
assert response.status_code == 200
|
||||
data = json.loads(response.data)
|
||||
assert data["success"] is True
|
||||
|
||||
def test_logs_api_filter_by_search(self, admin_client):
|
||||
"""Logs API filters by search term."""
|
||||
response = admin_client.get('/admin/api/logs?q=database')
|
||||
|
||||
assert response.status_code == 200
|
||||
data = json.loads(response.data)
|
||||
assert data["success"] is True
|
||||
|
||||
|
||||
class TestLogsCleanupAPI:
|
||||
"""Test log cleanup API endpoint."""
|
||||
|
||||
def test_logs_cleanup_requires_auth(self, client):
|
||||
"""Log cleanup requires admin authentication."""
|
||||
response = client.post('/admin/api/logs/cleanup')
|
||||
# Should redirect to login page
|
||||
assert response.status_code == 302
|
||||
|
||||
def test_logs_cleanup_success(self, admin_client):
|
||||
"""Log cleanup returns success with stats."""
|
||||
response = admin_client.post('/admin/api/logs/cleanup')
|
||||
|
||||
assert response.status_code == 200
|
||||
data = json.loads(response.data)
|
||||
assert data["success"] is True
|
||||
assert "deleted" in data["data"]
|
||||
assert "before" in data["data"]
|
||||
assert "after" in data["data"]
|
||||
assert "count" in data["data"]["before"]
|
||||
assert "size_bytes" in data["data"]["before"]
|
||||
|
||||
|
||||
class TestWorkerControlAPI:
|
||||
"""Test worker control API endpoints."""
|
||||
|
||||
def test_worker_status_returns_info(self, admin_client):
|
||||
"""Worker status API returns worker information."""
|
||||
response = admin_client.get('/admin/api/worker/status')
|
||||
|
||||
assert response.status_code == 200
|
||||
data = json.loads(response.data)
|
||||
assert data["success"] is True
|
||||
assert "worker_pid" in data["data"]
|
||||
assert "cooldown" in data["data"]
|
||||
assert "last_restart" in data["data"]
|
||||
|
||||
def test_worker_restart_requires_auth(self, client):
|
||||
"""Worker restart requires admin authentication."""
|
||||
response = client.post('/admin/api/worker/restart')
|
||||
# Should redirect to login page for unauthenticated requests
|
||||
assert response.status_code == 302
|
||||
|
||||
def test_worker_restart_writes_flag(self, admin_client):
|
||||
"""Worker restart creates flag file."""
|
||||
# Use a temp file for the flag
|
||||
fd, temp_flag = tempfile.mkstemp()
|
||||
os.close(fd)
|
||||
os.unlink(temp_flag) # Remove so we can test creation
|
||||
|
||||
with patch('mes_dashboard.routes.admin_routes.RESTART_FLAG_PATH', temp_flag):
|
||||
with patch('mes_dashboard.routes.admin_routes._check_restart_cooldown', return_value=(False, 0)):
|
||||
response = admin_client.post('/admin/api/worker/restart')
|
||||
|
||||
assert response.status_code == 200
|
||||
data = json.loads(response.data)
|
||||
assert data["success"] is True
|
||||
|
||||
# Cleanup
|
||||
try:
|
||||
os.unlink(temp_flag)
|
||||
except OSError:
|
||||
pass
|
||||
|
||||
def test_worker_restart_cooldown(self, admin_client):
|
||||
"""Worker restart respects cooldown."""
|
||||
with patch('mes_dashboard.routes.admin_routes._check_restart_cooldown', return_value=(True, 45)):
|
||||
response = admin_client.post('/admin/api/worker/restart')
|
||||
|
||||
assert response.status_code == 429
|
||||
data = json.loads(response.data)
|
||||
assert data["success"] is False
|
||||
assert "cooldown" in data["error"]["message"].lower()
|
||||
|
||||
|
||||
class TestCircuitBreakerIntegration:
|
||||
"""Test circuit breaker integration with database layer."""
|
||||
|
||||
def test_circuit_breaker_status_in_system_status(self, admin_client):
|
||||
"""Circuit breaker status is included in system status."""
|
||||
response = admin_client.get('/admin/api/system-status')
|
||||
|
||||
assert response.status_code == 200
|
||||
data = json.loads(response.data)
|
||||
cb_status = data["data"]["circuit_breaker"]
|
||||
assert "state" in cb_status
|
||||
assert "enabled" in cb_status
|
||||
|
||||
|
||||
class TestPerformancePage:
|
||||
"""Test performance monitoring page."""
|
||||
|
||||
def test_performance_page_requires_auth(self, client):
|
||||
"""Performance page requires admin authentication."""
|
||||
response = client.get('/admin/performance')
|
||||
# Should redirect to login
|
||||
assert response.status_code == 302
|
||||
|
||||
def test_performance_page_loads(self, admin_client):
|
||||
"""Performance page loads for admin users."""
|
||||
response = admin_client.get('/admin/performance')
|
||||
|
||||
# Should be 200 for authenticated admin
|
||||
assert response.status_code == 200
|
||||
# Check for performance-related content
|
||||
data_str = response.data.decode('utf-8', errors='ignore').lower()
|
||||
assert 'performance' in data_str or '效能' in data_str
|
||||
Reference in New Issue
Block a user