feat(admin-perf): full Vue SPA migration + slow-query/memory monitoring gaps

Remove Jinja2 template fallback (1249 lines) — /admin/performance now serves
Vue SPA exclusively via send_from_directory.

Backend:
- Add _SLOW_QUERY_WAITING counter with get_slow_query_waiting_count()
- Record slow-path latency in read_sql_df_slow/iter via record_query_latency()
- Extend metrics_history schema with slow_query_active, slow_query_waiting,
  worker_rss_bytes columns + ALTER TABLE migration for existing DBs
- Add cleanup_archive_logs() with configurable ARCHIVE_LOG_DIR/KEEP_COUNT
- Integrate archive cleanup into MetricsHistoryCollector 50-min cycle

Frontend:
- Add slow_query_active and slow_query_waiting StatCards to connection pool
- Add slow_query_active trend line to pool trend chart
- Add Worker memory (RSS MB) trend chart with preprocessing
- Update modernization gate check path to frontend style.css

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
egg
2026-02-26 09:48:54 +08:00
parent c6f982ae50
commit 07ced80fb0
22 changed files with 740 additions and 1456 deletions

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-02-26

View File

@@ -0,0 +1,81 @@
## Context
2026-02-25 的 server crash 暴露出 pool 隔離架構變更後的監控盲區。event_fetcher 和 lineage_engine 已遷移到 `read_sql_df_slow`(獨立連線 + semaphore但 metrics_history 快照只記錄 pool 相關指標slow query 並行數、排隊數、Worker RSS 完全無歷史紀錄。
同時 `/admin/performance` 仍保留 1249 行 Jinja template 作為 Vue SPA fallback但 SPA 已是唯一使用的版本build artifact 存在於 `static/dist/admin-performance.html`),兩套 UI 增加維護成本且 Jinja 版功能遠不及 SPA。
`logs/archive/` 目錄累積 rotated log 檔案無自動清理,是唯一會無限增長的儲存。
## Goals / Non-Goals
**Goals:**
- 移除 Jinja fallback統一為 Vue SPA 單一架構
- 讓 slow query 並行數、排隊數、Worker RSS 成為可觀測的歷史趨勢指標
- 讓 P50/P95/P99 反映所有查詢路徑pool + slow path
- 解決 archive log 無限增長問題
**Non-Goals:**
- 不修改 `/admin/pages`(仍為 Jinja template
- 不新增 async job queue 面板P1後續 change 處理)
- 不新增 event cache hit/miss 計數器P2
- 不增加即時告警或 webhook 通知機制
## Decisions
### D1SQLite schema migration 策略
**選擇**:啟動時執行 `ALTER TABLE ADD COLUMN IF NOT EXISTS`(容錯 "duplicate column" error
**替代方案**version table + migration script → 過度工程SQLite 只有 3 天保留,加欄是向後相容的
**理由**:新欄位 nullable舊 row 自動為 NULL不影響既有查詢。MetricsHistoryStore.initialize() 已在啟動時執行 CREATE TABLE IF NOT EXISTS加入 ALTER TABLE 語句自然整合。
### D2RSS 記憶體取得方式
**選擇**`resource.getrusage(resource.RUSAGE_SELF).ru_maxrss * 1024`Python stdlibLinux 上單位為 KB
**替代方案 A**:讀取 `/proc/self/status` VmRSS → 平台相依,解析 overhead
**替代方案 B**`psutil.Process().memory_info().rss` → 需新增外部依賴
**理由**`resource` 模組為 Python 標準庫,無需額外依賴。`ru_maxrss` 在 Linux 上返回 KB乘以 1024 轉為 bytes。
### D3Semaphore 排隊計數器實作
**選擇**:在 `read_sql_df_slow()` 的 semaphore.acquire() 前後遞增/遞減 `_SLOW_QUERY_WAITING` atomic counter
**流程**
```
_SLOW_QUERY_WAITING += 1
acquired = semaphore.acquire(timeout=60)
_SLOW_QUERY_WAITING -= 1
if not acquired: raise RuntimeError
_SLOW_QUERY_ACTIVE += 1
... execute query ...
_SLOW_QUERY_ACTIVE -= 1
```
**理由**:與既有 `_SLOW_QUERY_ACTIVE` 模式一致,使用 threading.Lock 保護。
### D4Archive log cleanup 整合位置
**選擇**:整合到 `MetricsHistoryCollector._run()` 的 cleanup cycle每 ~100 intervals ≈ 50 分鐘)
**替代方案**:獨立 cron job → 需額外 crontab 配置,不自包含
**理由**:已有 daemon thread 定期 cleanup SQLite加入 archive cleanup 邏輯一致且自包含。
### D5移除 Jinja fallback 的安全性
**選擇**:直接移除 fallbackadmin_routes.py 改為只 `send_from_directory(dist_dir, "admin-performance.html")`
**理由**
- Vue SPA build artifact 已存在(`static/dist/admin-performance.html`2026-02-26 更新)
- `frontend/package.json` build script 已包含 admin-performance entry
- CI/deploy 流程必包含 `npx vite build`
- 若 build 失敗,`/health/frontend-shell` 已有 asset readiness 檢查可偵測
## Risks / Trade-offs
- **[Risk] Build 失敗時 /admin/performance 返回 404** → 既有 `/health/frontend-shell` 檢查 + deploy script 驗證。移除 fallback 反而讓問題更早暴露。
- **[Risk] ALTER TABLE 在 SQLite 大表上可能慢** → metrics_history 最多 50K rowsALTER TABLE 即時完成。
- **[Trade-off] `ru_maxrss` 是 peak RSS非 current RSS** → 在 Linux 上 `ru_maxrss` 是 process lifetime 的 max RSS。改用 `/proc/self/status` 的 VmRSS 可取得 current但需 file I/O。鑑於每 30 秒收集一次且 max RSS 更能反映記憶體壓力,接受此 trade-off。若日後需要 current RSS可改讀 `/proc/self/status`

View File

@@ -0,0 +1,34 @@
## Why
2026-02-25 server crash 暴露出管理員效能監控頁面在 pool 隔離架構變更後的關鍵盲區slow query 並行數、slow-path 延遲、Worker 記憶體等核心指標既未收集也未顯示,導致 crash 前完全無法觀測系統真實負載。同時,`/admin/performance` 仍保留 1249 行的 Jinja template 作為 fallback與已完成的 Vue SPA 遷移架構不一致,增加維護成本。
## What Changes
- **移除** Jinja template `templates/admin/performance.html``/admin/performance` 路由直接服務 Vue SPA`static/dist/admin-performance.html`),不再有 fallback 邏輯
- **新增** `slow_query_active``slow_query_waiting``worker_rss_bytes` 三個欄位到 `metrics_history.sqlite` 快照,含 SQLite schema migration
- **新增** semaphore 排隊計數器(`_SLOW_QUERY_WAITING`),追蹤等待 slow query semaphore 的 thread 數量
- **修正** `read_sql_df_slow()``read_sql_df_slow_iter()` 將查詢延遲記錄到 `QueryMetrics`,使 P50/P95/P99 反映所有查詢路徑
- **新增** Vue SPA 連線池區塊顯示「慢查詢執行中」「慢查詢排隊中」指標 + 連線池趨勢圖加入 slow_query_active 線 + Worker 記憶體趨勢圖
- **新增** archive log 自動清理機制,整合到既有 `MetricsHistoryCollector` 的 cleanup cycle
## Capabilities
### New Capabilities
- `slow-query-observability`: 追蹤 slow query 並行數、排隊數、延遲,寫入 metrics history 並在前端顯示趨勢
- `worker-memory-tracking`: 追蹤 Worker RSS 記憶體,寫入 metrics history 並在前端顯示趨勢
- `archive-log-rotation`: logs/archive/ 目錄的自動清理機制,防止檔案無限增長
### Modified Capabilities
- `admin-performance-spa`: 移除 Jinja template fallback完全遷移至 Vue SPA新增 slow query 與記憶體監控面板
- `metrics-history-trending`: 擴充 snapshot schema 加入 slow_query_active、slow_query_waiting、worker_rss_bytes
- `connection-pool-monitoring`: 新增 semaphore 排隊計數器slow-path 延遲納入 QueryMetrics
## Impact
- **後端**`core/database.py`(排隊計數器 + latency 記錄)、`core/metrics_history.py`schema 擴充 + archive cleanup`routes/admin_routes.py`(移除 fallback
- **前端**`frontend/src/admin-performance/App.vue`(新面板 + 趨勢圖)→ 需 rebuild
- **刪除**`templates/admin/performance.html`1249 行)
- **資料**:既有 `metrics_history.sqlite` 需 ALTER TABLE 加欄(向後相容,新欄位 nullable
- **測試**:既有 `test_performance_integration.py` 已測試 SPA 路徑,無需修改;需新增 schema migration 測試

View File

@@ -0,0 +1,37 @@
## MODIFIED Requirements
### Requirement: Vue 3 SPA page replaces Jinja2 template
The `/admin/performance` route SHALL serve the Vite-built `admin-performance.html` static file directly. The Jinja2 template fallback SHALL be removed. If the SPA build artifact does not exist, the server SHALL return a standard HTTP error (no fallback rendering).
#### Scenario: Page loads as Vue SPA
- **WHEN** user navigates to `/admin/performance`
- **THEN** the server SHALL return the Vite-built `admin-performance.html` static file via `send_from_directory`
#### Scenario: Portal-shell integration
- **WHEN** the portal-shell renders `/admin/performance`
- **THEN** it SHALL load the page as a native Vue SPA (not an external iframe)
#### Scenario: Build artifact missing
- **WHEN** the SPA build artifact `admin-performance.html` does not exist in `static/dist/`
- **THEN** the server SHALL return an HTTP error (no Jinja2 fallback)
### Requirement: Connection pool panel
The dashboard SHALL display connection pool saturation as a GaugeBar and stat cards showing checked_out, checked_in, overflow, max_capacity, pool_size, pool_recycle, pool_timeout, direct connection count, slow_query_active, and slow_query_waiting.
#### Scenario: Pool under normal load
- **WHEN** pool saturation is below 80%
- **THEN** the GaugeBar SHALL display in a normal color (green/blue)
#### Scenario: Pool near saturation
- **WHEN** pool saturation exceeds 80%
- **THEN** the GaugeBar SHALL display in a warning color (yellow/orange/red)
#### Scenario: Slow query metrics displayed
- **WHEN** `db_pool.status` includes `slow_query_active` and `slow_query_waiting`
- **THEN** the panel SHALL display StatCards for both values
## REMOVED Requirements
### Requirement: Jinja2 template fallback for performance page
**Reason**: The Vue SPA is the sole UI. Maintaining a 1249-line Jinja template as fallback adds maintenance burden and feature divergence.
**Migration**: Delete `templates/admin/performance.html`. The route handler serves the SPA directly.

View File

@@ -0,0 +1,30 @@
## ADDED Requirements
### Requirement: Automatic archive log cleanup
The system SHALL provide a `cleanup_archive_logs()` function in `core/metrics_history.py` that deletes old rotated log files from `logs/archive/`, keeping the most recent N files per log type (access, error, watchdog, rq_worker, startup).
#### Scenario: Cleanup keeps recent files
- **WHEN** `cleanup_archive_logs()` is called with `keep_per_type=20` and there are 30 access_*.log files
- **THEN** 10 oldest access_*.log files SHALL be deleted, keeping the 20 most recent by modification time
#### Scenario: No excess files
- **WHEN** `cleanup_archive_logs()` is called and each type has fewer than `keep_per_type` files
- **THEN** no files SHALL be deleted
#### Scenario: Archive directory missing
- **WHEN** `cleanup_archive_logs()` is called and the archive directory does not exist
- **THEN** the function SHALL return 0 without error
### Requirement: Archive cleanup integrated into collector cycle
The `MetricsHistoryCollector` SHALL call `cleanup_archive_logs()` alongside the existing SQLite cleanup, running approximately every 50 minutes (every 100 collection intervals).
#### Scenario: Periodic cleanup executes
- **WHEN** the cleanup counter reaches 100 intervals
- **THEN** both SQLite metrics cleanup and archive log cleanup SHALL execute
### Requirement: Archive cleanup configuration
The archive log cleanup SHALL be configurable via environment variables: `ARCHIVE_LOG_DIR` (default: `logs/archive`) and `ARCHIVE_LOG_KEEP_COUNT` (default: 20).
#### Scenario: Custom keep count
- **WHEN** `ARCHIVE_LOG_KEEP_COUNT=10` is set
- **THEN** cleanup SHALL keep only the 10 most recent files per type

View File

@@ -0,0 +1,29 @@
## MODIFIED Requirements
### Requirement: Connection pool status in performance detail
The performance-detail API SHALL include `db_pool` section with `status` (checked_out, checked_in, overflow, max_capacity, saturation, slow_query_active, slow_query_waiting) from `get_pool_status()` and `config` (pool_size, max_overflow, pool_timeout, pool_recycle) from `get_pool_runtime_config()`.
#### Scenario: Pool status retrieved
- **WHEN** the API is called
- **THEN** `db_pool.status` SHALL contain current pool utilization metrics including `slow_query_active` and `slow_query_waiting`, and `db_pool.config` SHALL contain the pool configuration values
#### Scenario: Saturation calculation
- **WHEN** the pool has 8 checked_out connections and max_capacity is 30
- **THEN** saturation SHALL be reported as approximately 26.7%
#### Scenario: Slow query waiting included
- **WHEN** 2 threads are waiting for the slow query semaphore
- **THEN** `db_pool.status.slow_query_waiting` SHALL be 2
## ADDED Requirements
### Requirement: Slow-path query latency included in QueryMetrics
The `read_sql_df_slow()` and `read_sql_df_slow_iter()` functions SHALL call `record_query_latency()` with the total elapsed time upon completion, ensuring P50/P95/P99 percentiles reflect queries from all paths (pooled and slow/direct).
#### Scenario: Slow query latency recorded
- **WHEN** `read_sql_df_slow()` completes a query in 8.5 seconds
- **THEN** `record_query_latency(8.5)` SHALL be called and the value SHALL appear in subsequent `get_percentiles()` results
#### Scenario: Slow iter latency recorded
- **WHEN** `read_sql_df_slow_iter()` completes streaming in 45 seconds
- **THEN** `record_query_latency(45.0)` SHALL be called in the finally block

View File

@@ -0,0 +1,54 @@
## MODIFIED Requirements
### Requirement: SQLite metrics history store
The system SHALL provide a `MetricsHistoryStore` class in `core/metrics_history.py` that persists metrics snapshots to a SQLite database (`logs/metrics_history.sqlite` by default). The store SHALL use thread-local connections and a write lock, following the `LogStore` pattern in `core/log_store.py`. The schema SHALL include columns for `slow_query_active` (INTEGER), `slow_query_waiting` (INTEGER), and `worker_rss_bytes` (INTEGER) in addition to the existing pool, Redis, route cache, and latency columns.
#### Scenario: Write and query snapshots
- **WHEN** `write_snapshot(data)` is called with pool/redis/route_cache/latency/slow_query/memory metrics
- **THEN** a row SHALL be inserted into `metrics_snapshots` with the current ISO 8601 timestamp, worker PID, and all metric columns
#### Scenario: Query by time range
- **WHEN** `query_snapshots(minutes=30)` is called
- **THEN** it SHALL return all rows from the last 30 minutes, ordered by timestamp ascending, including the new columns
#### Scenario: Retention cleanup
- **WHEN** `cleanup()` is called
- **THEN** rows older than `METRICS_HISTORY_RETENTION_DAYS` (default 3) SHALL be deleted, and total rows SHALL be capped at `METRICS_HISTORY_MAX_ROWS` (default 50000)
#### Scenario: Thread safety
- **WHEN** multiple threads write snapshots concurrently
- **THEN** the write lock SHALL serialize writes and prevent database corruption
#### Scenario: Schema migration for existing databases
- **WHEN** the store initializes on an existing database without the new columns
- **THEN** it SHALL execute ALTER TABLE ADD COLUMN for each missing column, tolerating "duplicate column" errors
### Requirement: Background metrics collector
The system SHALL provide a `MetricsHistoryCollector` class that runs a daemon thread collecting metrics snapshots at a configurable interval (default 30 seconds, via `METRICS_HISTORY_INTERVAL` env var). The collector SHALL include `slow_query_active`, `slow_query_waiting`, and `worker_rss_bytes` in each snapshot.
#### Scenario: Automatic collection
- **WHEN** the collector is started via `start_metrics_history(app)`
- **THEN** it SHALL collect pool status (including slow_query_active and slow_query_waiting), Redis info, route cache status, query latency metrics, and worker RSS memory every interval and write them to the store
#### Scenario: Graceful shutdown
- **WHEN** `stop_metrics_history()` is called
- **THEN** the collector thread SHALL stop within one interval period
#### Scenario: Subsystem unavailability
- **WHEN** a subsystem (e.g., Redis) is unavailable during collection
- **THEN** the collector SHALL write null/0 for those fields and continue collecting other metrics
### Requirement: Frontend trend charts
The system SHALL display 5 trend chart panels in the admin performance dashboard using vue-echarts VChart line/area charts: connection pool saturation, query latency (P50/P95/P99), Redis memory, cache hit rates, and worker memory.
#### Scenario: Trend charts with data
- **WHEN** historical snapshots contain more than 1 data point
- **THEN** the dashboard SHALL display trend charts for: connection pool saturation (including slow_query_active), query latency (P50/P95/P99), Redis memory, cache hit rates, and worker memory (RSS in MB)
#### Scenario: Trend charts without data
- **WHEN** historical snapshots are empty or contain only 1 data point
- **THEN** the trend charts SHALL NOT be displayed (hidden via `v-if`)
#### Scenario: Auto-refresh
- **WHEN** the dashboard auto-refreshes
- **THEN** historical data SHALL also be refreshed alongside real-time metrics

View File

@@ -0,0 +1,49 @@
## ADDED Requirements
### Requirement: Slow query active count in metrics history snapshots
The `MetricsHistoryCollector` SHALL include `slow_query_active` in each 30-second snapshot, recording the number of slow queries currently executing via dedicated connections.
#### Scenario: Snapshot includes slow_query_active
- **WHEN** the collector writes a snapshot while 3 slow queries are executing
- **THEN** the `slow_query_active` column SHALL contain the value 3
#### Scenario: No slow queries active
- **WHEN** the collector writes a snapshot while no slow queries are executing
- **THEN** the `slow_query_active` column SHALL contain the value 0
### Requirement: Slow query waiting count tracked and persisted
The system SHALL maintain a thread-safe counter `_SLOW_QUERY_WAITING` in `database.py` that tracks the number of threads currently waiting to acquire the slow query semaphore. This counter SHALL be included in `get_pool_status()` and persisted to metrics history snapshots.
#### Scenario: Counter increments on semaphore wait
- **WHEN** a thread enters `read_sql_df_slow()` and the semaphore is full
- **THEN** `_SLOW_QUERY_WAITING` SHALL be incremented before `semaphore.acquire()` and decremented after acquire completes (success or timeout)
#### Scenario: Counter in pool status API
- **WHEN** `get_pool_status()` is called
- **THEN** the returned dict SHALL include `slow_query_waiting` with the current waiting thread count
#### Scenario: Counter persisted to metrics history
- **WHEN** the collector writes a snapshot
- **THEN** the `slow_query_waiting` column SHALL reflect the count at snapshot time
### Requirement: Slow-path query latency recorded in QueryMetrics
The `read_sql_df_slow()` and `read_sql_df_slow_iter()` functions SHALL call `record_query_latency()` with the elapsed query time, so that P50/P95/P99 metrics reflect all query paths (pool + slow).
#### Scenario: Slow query latency appears in percentiles
- **WHEN** a `read_sql_df_slow()` call completes in 5.2 seconds
- **THEN** `record_query_latency(5.2)` SHALL be called and the latency SHALL appear in subsequent `get_percentiles()` results
#### Scenario: Slow iter latency recorded on completion
- **WHEN** a `read_sql_df_slow_iter()` generator completes after yielding all batches in 120 seconds total
- **THEN** `record_query_latency(120.0)` SHALL be called in the finally block
### Requirement: Slow query metrics displayed in Vue SPA
The admin performance Vue SPA SHALL display `slow_query_active` and `slow_query_waiting` as StatCards in the connection pool panel, and include `slow_query_active` as a trend line in the connection pool trend chart.
#### Scenario: StatCards display current values
- **WHEN** the performance-detail API returns `db_pool.status.slow_query_active = 4` and `db_pool.status.slow_query_waiting = 2`
- **THEN** the connection pool panel SHALL display StatCards showing "慢查詢執行中: 4" and "慢查詢排隊中: 2"
#### Scenario: Trend chart includes slow_query_active
- **WHEN** historical snapshots contain `slow_query_active` data points
- **THEN** the connection pool trend chart SHALL include a "慢查詢執行中" line series

View File

@@ -0,0 +1,23 @@
## ADDED Requirements
### Requirement: Worker RSS memory in metrics history snapshots
The `MetricsHistoryCollector` SHALL include `worker_rss_bytes` in each 30-second snapshot, recording the current worker process peak RSS memory using Python's `resource.getrusage()`.
#### Scenario: RSS recorded in snapshot
- **WHEN** the collector writes a snapshot and the worker process has 256 MB peak RSS
- **THEN** the `worker_rss_bytes` column SHALL contain approximately 268435456
#### Scenario: RSS collection failure
- **WHEN** `resource.getrusage()` raises an exception
- **THEN** the collector SHALL write NULL for `worker_rss_bytes` and continue collecting other metrics
### Requirement: Worker memory trend chart in Vue SPA
The admin performance Vue SPA SHALL display a "Worker 記憶體趨勢" TrendChart showing RSS memory over time in megabytes.
#### Scenario: Memory trend displayed
- **WHEN** historical snapshots contain `worker_rss_bytes` data with more than 1 data point
- **THEN** the dashboard SHALL display a TrendChart with RSS values converted to MB
#### Scenario: No memory data
- **WHEN** historical snapshots do not contain `worker_rss_bytes` data (all NULL)
- **THEN** the trend chart SHALL show "趨勢資料不足" message

View File

@@ -0,0 +1,37 @@
## 1. 後端Semaphore 排隊計數器 + Slow-path latency
- [x] 1.1 在 `src/mes_dashboard/core/database.py` 新增 `_SLOW_QUERY_WAITING` counter 和 `get_slow_query_waiting_count()` 函數
- [x] 1.2 修改 `read_sql_df_slow()` 在 semaphore.acquire() 前後遞增/遞減 `_SLOW_QUERY_WAITING`
- [x] 1.3 修改 `read_sql_df_slow_iter()` 同上加入 waiting counter 邏輯
- [x] 1.4 修改 `get_pool_status()` 回傳中加入 `slow_query_waiting` 欄位
- [x] 1.5 在 `read_sql_df_slow()` finally block 呼叫 `record_query_latency(elapsed)`
- [x] 1.6 在 `read_sql_df_slow_iter()` finally block 呼叫 `record_query_latency(elapsed)`
## 2. 後端metrics_history schema 擴充 + archive cleanup
- [x] 2.1 在 `src/mes_dashboard/core/metrics_history.py` 的 schema 新增 `slow_query_active INTEGER`, `slow_query_waiting INTEGER`, `worker_rss_bytes INTEGER` 欄位
- [x] 2.2 在 `MetricsHistoryStore.initialize()` 加入 ALTER TABLE ADD COLUMN migration容錯 duplicate column
- [x] 2.3 更新 `COLUMNS` list 加入新欄位
- [x] 2.4 更新 `write_snapshot()` 加入新欄位的讀取和 INSERT
- [x] 2.5 更新 `_collect_snapshot()` 收集 `slow_query_active``slow_query_waiting`(從 `get_pool_status()`)和 `worker_rss_bytes`(從 `resource.getrusage()`
- [x] 2.6 新增 `cleanup_archive_logs(archive_dir, keep_per_type)` 函數,含 `ARCHIVE_LOG_DIR``ARCHIVE_LOG_KEEP_COUNT` env var 配置
- [x] 2.7 在 `MetricsHistoryCollector._run()` 的 cleanup cycle 呼叫 `cleanup_archive_logs()`
## 3. 後端:移除 Jinja fallback
- [x] 3.1 修改 `src/mes_dashboard/routes/admin_routes.py``performance()` 路由,移除 Jinja fallback 邏輯(改為直接 `send_from_directory`
- [x] 3.2 刪除 `src/mes_dashboard/templates/admin/performance.html`
- [x] 3.3 更新 `scripts/check_full_modernization_gates.py``/admin/performance` 的 gate check 從 template 路徑改為 `frontend/src/admin-performance/style.css`
## 4. 前端Vue SPA 新增監控面板
- [x] 4.1 在 `frontend/src/admin-performance/App.vue` 連線池 section 新增 `slow_query_active``slow_query_waiting` StatCards
- [x] 4.2 在 `poolTrendSeries` 加入 `slow_query_active` 趨勢線
- [x] 4.3 新增 `memoryTrendSeries` 定義和 Worker 記憶體 TrendChart 組件
- [x] 4.4 新增 `historyData` 預處理邏輯:將 `worker_rss_bytes` 轉為 `worker_rss_mb`
## 5. Build + 測試驗證
- [x] 5.1 執行 `cd frontend && npx vite build` 確認 build 成功
- [x] 5.2 執行 `python -m pytest tests/ -v --tb=short` 確認既有測試通過
- [x] 5.3 確認 `test_performance_page_loads` 測試通過SPA 路徑驗證)

View File

@@ -1,100 +1,37 @@
## ADDED Requirements
### Requirement: Vue 3 SPA page replaces Jinja2 template
The `/admin/performance` route SHALL serve a Vue 3 SPA page built by Vite, replacing the existing Jinja2 server-rendered template. The SPA SHALL be registered as a Vite entry point and integrated into the portal-shell navigation as a `renderMode: 'native'` route.
#### Scenario: Page loads as Vue SPA
- **WHEN** user navigates to `/admin/performance`
- **THEN** the server SHALL return the Vite-built `admin-performance.html` static file (not a Jinja2 rendered template)
#### Scenario: Portal-shell integration
- **WHEN** the portal-shell renders `/admin/performance`
- **THEN** it SHALL load the page as a native Vue SPA (not an external iframe)
### Requirement: Status cards display system health
The dashboard SHALL display 4 status cards in a horizontal grid: Database, Redis, Circuit Breaker, and Worker PID. Each card SHALL show a StatusDot indicator (healthy/degraded/error/disabled) with the current status value.
#### Scenario: All systems healthy
- **WHEN** all backend systems report healthy status via `/admin/api/system-status`
- **THEN** all 4 status cards SHALL display green StatusDot indicators with their respective values
#### Scenario: Redis disabled
- **WHEN** Redis is disabled (`REDIS_ENABLED=false`)
- **THEN** the Redis status card SHALL display a disabled StatusDot indicator and the Redis cache panel SHALL show a graceful degradation message
### Requirement: Query performance panel with ECharts
The dashboard SHALL display query performance metrics (P50, P95, P99 latencies, total queries, slow queries) and an ECharts latency distribution chart, replacing the existing Chart.js implementation.
#### Scenario: Metrics loaded successfully
- **WHEN** `/admin/api/metrics` returns valid performance data
- **THEN** the panel SHALL display P50/P95/P99 latency values and render an ECharts bar chart showing latency distribution
#### Scenario: No metrics data
- **WHEN** `/admin/api/metrics` returns empty or null metrics
- **THEN** the panel SHALL display placeholder text indicating no data available
### Requirement: Redis cache detail panel
The dashboard SHALL display a Redis cache detail panel showing memory usage (as a GaugeBar), connected clients, hit rate percentage, peak memory, and a namespace key distribution table.
#### Scenario: Redis active with data
- **WHEN** `/admin/api/performance-detail` returns Redis data with namespace key counts
- **THEN** the panel SHALL display a memory GaugeBar, hit rate, client count, and a table listing each namespace with its key count
#### Scenario: Redis disabled
- **WHEN** Redis is disabled
- **THEN** the Redis detail panel SHALL display a disabled state message without errors
### Requirement: Memory cache panel
The dashboard SHALL display ProcessLevelCache statistics as grid cards (showing entries/max_size as a mini gauge and TTL) plus Route Cache telemetry (L1 hit rate, L2 hit rate, miss rate, total reads).
#### Scenario: Multiple caches registered
- **WHEN** `/admin/api/performance-detail` returns process_caches with multiple entries
- **THEN** the panel SHALL render one card per cache instance showing entries, max_size, TTL, and description
#### Scenario: Route cache telemetry
- **WHEN** `/admin/api/performance-detail` returns route_cache data
- **THEN** the panel SHALL display L1 hit rate, L2 hit rate, miss rate, and total reads
### Requirement: Connection pool panel
The dashboard SHALL display connection pool saturation as a GaugeBar and stat cards showing checked_out, checked_in, overflow, max_capacity, pool_size, pool_recycle, pool_timeout, and direct connection count.
#### Scenario: Pool under normal load
- **WHEN** pool saturation is below 80%
- **THEN** the GaugeBar SHALL display in a normal color (green/blue)
#### Scenario: Pool near saturation
- **WHEN** pool saturation exceeds 80%
- **THEN** the GaugeBar SHALL display in a warning color (yellow/orange/red)
### Requirement: Worker control panel
The dashboard SHALL display worker PID, uptime, cooldown status, and provide a restart button with a confirmation modal.
#### Scenario: Restart worker
- **WHEN** user clicks the restart button and confirms in the modal
- **THEN** the system SHALL POST to `/admin/api/worker/restart` and display the result
#### Scenario: Restart during cooldown
- **WHEN** worker is in cooldown period
- **THEN** the restart button SHALL be disabled with a cooldown indicator
### Requirement: System logs panel with filtering and pagination
The dashboard SHALL display system logs with level filtering, text search, and pagination controls.
#### Scenario: Filter by log level
- **WHEN** user selects a specific log level filter
- **THEN** only logs matching that level SHALL be displayed
#### Scenario: Paginate logs
- **WHEN** logs exceed the page size
- **THEN** pagination controls SHALL allow navigating between pages
### Requirement: Auto-refresh with toggle
The dashboard SHALL auto-refresh all panels every 30 seconds using `useAutoRefresh`. The user SHALL be able to toggle auto-refresh on/off and manually trigger a refresh.
#### Scenario: Auto-refresh enabled
- **WHEN** auto-refresh is enabled (default)
- **THEN** all panels SHALL refresh their data every 30 seconds via `Promise.all` parallel fetch
#### Scenario: Manual refresh
- **WHEN** user clicks the manual refresh button
- **THEN** all panels SHALL immediately refresh their data
## MODIFIED Requirements
### Requirement: Vue 3 SPA page replaces Jinja2 template
The `/admin/performance` route SHALL serve the Vite-built `admin-performance.html` static file directly. The Jinja2 template fallback SHALL be removed. If the SPA build artifact does not exist, the server SHALL return a standard HTTP error (no fallback rendering).
#### Scenario: Page loads as Vue SPA
- **WHEN** user navigates to `/admin/performance`
- **THEN** the server SHALL return the Vite-built `admin-performance.html` static file via `send_from_directory`
#### Scenario: Portal-shell integration
- **WHEN** the portal-shell renders `/admin/performance`
- **THEN** it SHALL load the page as a native Vue SPA (not an external iframe)
#### Scenario: Build artifact missing
- **WHEN** the SPA build artifact `admin-performance.html` does not exist in `static/dist/`
- **THEN** the server SHALL return an HTTP error (no Jinja2 fallback)
### Requirement: Connection pool panel
The dashboard SHALL display connection pool saturation as a GaugeBar and stat cards showing checked_out, checked_in, overflow, max_capacity, pool_size, pool_recycle, pool_timeout, direct connection count, slow_query_active, and slow_query_waiting.
#### Scenario: Pool under normal load
- **WHEN** pool saturation is below 80%
- **THEN** the GaugeBar SHALL display in a normal color (green/blue)
#### Scenario: Pool near saturation
- **WHEN** pool saturation exceeds 80%
- **THEN** the GaugeBar SHALL display in a warning color (yellow/orange/red)
#### Scenario: Slow query metrics displayed
- **WHEN** `db_pool.status` includes `slow_query_active` and `slow_query_waiting`
- **THEN** the panel SHALL display StatCards for both values
## REMOVED Requirements
### Requirement: Jinja2 template fallback for performance page
**Reason**: The Vue SPA is the sole UI. Maintaining a 1249-line Jinja template as fallback adds maintenance burden and feature divergence.
**Migration**: Delete `templates/admin/performance.html`. The route handler serves the SPA directly.

View File

@@ -0,0 +1,30 @@
## ADDED Requirements
### Requirement: Automatic archive log cleanup
The system SHALL provide a `cleanup_archive_logs()` function in `core/metrics_history.py` that deletes old rotated log files from `logs/archive/`, keeping the most recent N files per log type (access, error, watchdog, rq_worker, startup).
#### Scenario: Cleanup keeps recent files
- **WHEN** `cleanup_archive_logs()` is called with `keep_per_type=20` and there are 30 access_*.log files
- **THEN** 10 oldest access_*.log files SHALL be deleted, keeping the 20 most recent by modification time
#### Scenario: No excess files
- **WHEN** `cleanup_archive_logs()` is called and each type has fewer than `keep_per_type` files
- **THEN** no files SHALL be deleted
#### Scenario: Archive directory missing
- **WHEN** `cleanup_archive_logs()` is called and the archive directory does not exist
- **THEN** the function SHALL return 0 without error
### Requirement: Archive cleanup integrated into collector cycle
The `MetricsHistoryCollector` SHALL call `cleanup_archive_logs()` alongside the existing SQLite cleanup, running approximately every 50 minutes (every 100 collection intervals).
#### Scenario: Periodic cleanup executes
- **WHEN** the cleanup counter reaches 100 intervals
- **THEN** both SQLite metrics cleanup and archive log cleanup SHALL execute
### Requirement: Archive cleanup configuration
The archive log cleanup SHALL be configurable via environment variables: `ARCHIVE_LOG_DIR` (default: `logs/archive`) and `ARCHIVE_LOG_KEEP_COUNT` (default: 20).
#### Scenario: Custom keep count
- **WHEN** `ARCHIVE_LOG_KEEP_COUNT=10` is set
- **THEN** cleanup SHALL keep only the 10 most recent files per type

View File

@@ -1,27 +1,29 @@
## ADDED Requirements
### Requirement: Connection pool status in performance detail
The performance-detail API SHALL include `db_pool` section with `status` (checked_out, checked_in, overflow, max_capacity, saturation) from `get_pool_status()` and `config` (pool_size, max_overflow, pool_timeout, pool_recycle) from `get_pool_runtime_config()`.
#### Scenario: Pool status retrieved
- **WHEN** the API is called
- **THEN** `db_pool.status` SHALL contain current pool utilization metrics and `db_pool.config` SHALL contain the pool configuration values
#### Scenario: Saturation calculation
- **WHEN** the pool has 8 checked_out connections and max_capacity is 30
- **THEN** saturation SHALL be reported as approximately 26.7%
### Requirement: Direct Oracle connection counter
The system SHALL maintain a thread-safe monotonic counter in `database.py` that increments each time `get_db_connection()` or `read_sql_df_slow()` successfully creates a direct (non-pooled) Oracle connection.
#### Scenario: Counter increments on direct connection
- **WHEN** `get_db_connection()` successfully creates a connection
- **THEN** the direct connection counter SHALL increment by 1
#### Scenario: Counter in performance detail
- **WHEN** the performance-detail API is called
- **THEN** `direct_connections` SHALL contain `total_since_start` (counter value) and `worker_pid` (current process PID)
#### Scenario: Counter is per-worker
- **WHEN** multiple gunicorn workers are running
- **THEN** each worker SHALL maintain its own independent counter, and the API SHALL return the counter for the responding worker
## MODIFIED Requirements
### Requirement: Connection pool status in performance detail
The performance-detail API SHALL include `db_pool` section with `status` (checked_out, checked_in, overflow, max_capacity, saturation, slow_query_active, slow_query_waiting) from `get_pool_status()` and `config` (pool_size, max_overflow, pool_timeout, pool_recycle) from `get_pool_runtime_config()`.
#### Scenario: Pool status retrieved
- **WHEN** the API is called
- **THEN** `db_pool.status` SHALL contain current pool utilization metrics including `slow_query_active` and `slow_query_waiting`, and `db_pool.config` SHALL contain the pool configuration values
#### Scenario: Saturation calculation
- **WHEN** the pool has 8 checked_out connections and max_capacity is 30
- **THEN** saturation SHALL be reported as approximately 26.7%
#### Scenario: Slow query waiting included
- **WHEN** 2 threads are waiting for the slow query semaphore
- **THEN** `db_pool.status.slow_query_waiting` SHALL be 2
## ADDED Requirements
### Requirement: Slow-path query latency included in QueryMetrics
The `read_sql_df_slow()` and `read_sql_df_slow_iter()` functions SHALL call `record_query_latency()` with the total elapsed time upon completion, ensuring P50/P95/P99 percentiles reflect queries from all paths (pooled and slow/direct).
#### Scenario: Slow query latency recorded
- **WHEN** `read_sql_df_slow()` completes a query in 8.5 seconds
- **THEN** `record_query_latency(8.5)` SHALL be called and the value SHALL appear in subsequent `get_percentiles()` results
#### Scenario: Slow iter latency recorded
- **WHEN** `read_sql_df_slow_iter()` completes streaming in 45 seconds
- **THEN** `record_query_latency(45.0)` SHALL be called in the finally block

View File

@@ -1,65 +1,54 @@
## ADDED Requirements
### Requirement: SQLite metrics history store
The system SHALL provide a `MetricsHistoryStore` class in `core/metrics_history.py` that persists metrics snapshots to a SQLite database (`logs/metrics_history.sqlite` by default). The store SHALL use thread-local connections and a write lock, following the `LogStore` pattern in `core/log_store.py`.
#### Scenario: Write and query snapshots
- **WHEN** `write_snapshot(data)` is called with pool/redis/route_cache/latency metrics
- **THEN** a row SHALL be inserted into `metrics_snapshots` with the current ISO 8601 timestamp and worker PID
#### Scenario: Query by time range
- **WHEN** `query_snapshots(minutes=30)` is called
- **THEN** it SHALL return all rows from the last 30 minutes, ordered by timestamp ascending
#### Scenario: Retention cleanup
- **WHEN** `cleanup()` is called
- **THEN** rows older than `METRICS_HISTORY_RETENTION_DAYS` (default 3) SHALL be deleted, and total rows SHALL be capped at `METRICS_HISTORY_MAX_ROWS` (default 50000)
#### Scenario: Thread safety
- **WHEN** multiple threads write snapshots concurrently
- **THEN** the write lock SHALL serialize writes and prevent database corruption
### Requirement: Background metrics collector
The system SHALL provide a `MetricsHistoryCollector` class that runs a daemon thread collecting metrics snapshots at a configurable interval (default 30 seconds, via `METRICS_HISTORY_INTERVAL` env var).
#### Scenario: Automatic collection
- **WHEN** the collector is started via `start_metrics_history(app)`
- **THEN** it SHALL collect pool status, Redis info, route cache status, and query latency metrics every interval and write them to the store
#### Scenario: Graceful shutdown
- **WHEN** `stop_metrics_history()` is called
- **THEN** the collector thread SHALL stop within one interval period
#### Scenario: Subsystem unavailability
- **WHEN** a subsystem (e.g., Redis) is unavailable during collection
- **THEN** the collector SHALL write null/0 for those fields and continue collecting other metrics
### Requirement: Performance history API endpoint
The system SHALL expose `GET /admin/api/performance-history` that returns historical metrics snapshots.
#### Scenario: Query with time range
- **WHEN** the API is called with `?minutes=30`
- **THEN** it SHALL return `{"success": true, "data": {"snapshots": [...], "count": N}}`
#### Scenario: Time range bounds
- **WHEN** `minutes` is less than 1 or greater than 180
- **THEN** it SHALL be clamped to the range [1, 180]
#### Scenario: Admin authentication
- **WHEN** the API is called without admin authentication
- **THEN** it SHALL be rejected by the `@admin_required` decorator
### Requirement: Frontend trend charts
The system SHALL display 4 trend chart panels in the admin performance dashboard using vue-echarts VChart line/area charts.
#### Scenario: Trend charts with data
- **WHEN** historical snapshots contain more than 1 data point
- **THEN** the dashboard SHALL display trend charts for: connection pool saturation, query latency (P50/P95/P99), Redis memory, and cache hit rates
#### Scenario: Trend charts without data
- **WHEN** historical snapshots are empty or contain only 1 data point
- **THEN** the trend charts SHALL NOT be displayed (hidden via `v-if`)
#### Scenario: Auto-refresh
- **WHEN** the dashboard auto-refreshes
- **THEN** historical data SHALL also be refreshed alongside real-time metrics
## MODIFIED Requirements
### Requirement: SQLite metrics history store
The system SHALL provide a `MetricsHistoryStore` class in `core/metrics_history.py` that persists metrics snapshots to a SQLite database (`logs/metrics_history.sqlite` by default). The store SHALL use thread-local connections and a write lock, following the `LogStore` pattern in `core/log_store.py`. The schema SHALL include columns for `slow_query_active` (INTEGER), `slow_query_waiting` (INTEGER), and `worker_rss_bytes` (INTEGER) in addition to the existing pool, Redis, route cache, and latency columns.
#### Scenario: Write and query snapshots
- **WHEN** `write_snapshot(data)` is called with pool/redis/route_cache/latency/slow_query/memory metrics
- **THEN** a row SHALL be inserted into `metrics_snapshots` with the current ISO 8601 timestamp, worker PID, and all metric columns
#### Scenario: Query by time range
- **WHEN** `query_snapshots(minutes=30)` is called
- **THEN** it SHALL return all rows from the last 30 minutes, ordered by timestamp ascending, including the new columns
#### Scenario: Retention cleanup
- **WHEN** `cleanup()` is called
- **THEN** rows older than `METRICS_HISTORY_RETENTION_DAYS` (default 3) SHALL be deleted, and total rows SHALL be capped at `METRICS_HISTORY_MAX_ROWS` (default 50000)
#### Scenario: Thread safety
- **WHEN** multiple threads write snapshots concurrently
- **THEN** the write lock SHALL serialize writes and prevent database corruption
#### Scenario: Schema migration for existing databases
- **WHEN** the store initializes on an existing database without the new columns
- **THEN** it SHALL execute ALTER TABLE ADD COLUMN for each missing column, tolerating "duplicate column" errors
### Requirement: Background metrics collector
The system SHALL provide a `MetricsHistoryCollector` class that runs a daemon thread collecting metrics snapshots at a configurable interval (default 30 seconds, via `METRICS_HISTORY_INTERVAL` env var). The collector SHALL include `slow_query_active`, `slow_query_waiting`, and `worker_rss_bytes` in each snapshot.
#### Scenario: Automatic collection
- **WHEN** the collector is started via `start_metrics_history(app)`
- **THEN** it SHALL collect pool status (including slow_query_active and slow_query_waiting), Redis info, route cache status, query latency metrics, and worker RSS memory every interval and write them to the store
#### Scenario: Graceful shutdown
- **WHEN** `stop_metrics_history()` is called
- **THEN** the collector thread SHALL stop within one interval period
#### Scenario: Subsystem unavailability
- **WHEN** a subsystem (e.g., Redis) is unavailable during collection
- **THEN** the collector SHALL write null/0 for those fields and continue collecting other metrics
### Requirement: Frontend trend charts
The system SHALL display 5 trend chart panels in the admin performance dashboard using vue-echarts VChart line/area charts: connection pool saturation, query latency (P50/P95/P99), Redis memory, cache hit rates, and worker memory.
#### Scenario: Trend charts with data
- **WHEN** historical snapshots contain more than 1 data point
- **THEN** the dashboard SHALL display trend charts for: connection pool saturation (including slow_query_active), query latency (P50/P95/P99), Redis memory, cache hit rates, and worker memory (RSS in MB)
#### Scenario: Trend charts without data
- **WHEN** historical snapshots are empty or contain only 1 data point
- **THEN** the trend charts SHALL NOT be displayed (hidden via `v-if`)
#### Scenario: Auto-refresh
- **WHEN** the dashboard auto-refreshes
- **THEN** historical data SHALL also be refreshed alongside real-time metrics

View File

@@ -0,0 +1,49 @@
## ADDED Requirements
### Requirement: Slow query active count in metrics history snapshots
The `MetricsHistoryCollector` SHALL include `slow_query_active` in each 30-second snapshot, recording the number of slow queries currently executing via dedicated connections.
#### Scenario: Snapshot includes slow_query_active
- **WHEN** the collector writes a snapshot while 3 slow queries are executing
- **THEN** the `slow_query_active` column SHALL contain the value 3
#### Scenario: No slow queries active
- **WHEN** the collector writes a snapshot while no slow queries are executing
- **THEN** the `slow_query_active` column SHALL contain the value 0
### Requirement: Slow query waiting count tracked and persisted
The system SHALL maintain a thread-safe counter `_SLOW_QUERY_WAITING` in `database.py` that tracks the number of threads currently waiting to acquire the slow query semaphore. This counter SHALL be included in `get_pool_status()` and persisted to metrics history snapshots.
#### Scenario: Counter increments on semaphore wait
- **WHEN** a thread enters `read_sql_df_slow()` and the semaphore is full
- **THEN** `_SLOW_QUERY_WAITING` SHALL be incremented before `semaphore.acquire()` and decremented after acquire completes (success or timeout)
#### Scenario: Counter in pool status API
- **WHEN** `get_pool_status()` is called
- **THEN** the returned dict SHALL include `slow_query_waiting` with the current waiting thread count
#### Scenario: Counter persisted to metrics history
- **WHEN** the collector writes a snapshot
- **THEN** the `slow_query_waiting` column SHALL reflect the count at snapshot time
### Requirement: Slow-path query latency recorded in QueryMetrics
The `read_sql_df_slow()` and `read_sql_df_slow_iter()` functions SHALL call `record_query_latency()` with the elapsed query time, so that P50/P95/P99 metrics reflect all query paths (pool + slow).
#### Scenario: Slow query latency appears in percentiles
- **WHEN** a `read_sql_df_slow()` call completes in 5.2 seconds
- **THEN** `record_query_latency(5.2)` SHALL be called and the latency SHALL appear in subsequent `get_percentiles()` results
#### Scenario: Slow iter latency recorded on completion
- **WHEN** a `read_sql_df_slow_iter()` generator completes after yielding all batches in 120 seconds total
- **THEN** `record_query_latency(120.0)` SHALL be called in the finally block
### Requirement: Slow query metrics displayed in Vue SPA
The admin performance Vue SPA SHALL display `slow_query_active` and `slow_query_waiting` as StatCards in the connection pool panel, and include `slow_query_active` as a trend line in the connection pool trend chart.
#### Scenario: StatCards display current values
- **WHEN** the performance-detail API returns `db_pool.status.slow_query_active = 4` and `db_pool.status.slow_query_waiting = 2`
- **THEN** the connection pool panel SHALL display StatCards showing "慢查詢執行中: 4" and "慢查詢排隊中: 2"
#### Scenario: Trend chart includes slow_query_active
- **WHEN** historical snapshots contain `slow_query_active` data points
- **THEN** the connection pool trend chart SHALL include a "慢查詢執行中" line series

View File

@@ -0,0 +1,23 @@
## ADDED Requirements
### Requirement: Worker RSS memory in metrics history snapshots
The `MetricsHistoryCollector` SHALL include `worker_rss_bytes` in each 30-second snapshot, recording the current worker process peak RSS memory using Python's `resource.getrusage()`.
#### Scenario: RSS recorded in snapshot
- **WHEN** the collector writes a snapshot and the worker process has 256 MB peak RSS
- **THEN** the `worker_rss_bytes` column SHALL contain approximately 268435456
#### Scenario: RSS collection failure
- **WHEN** `resource.getrusage()` raises an exception
- **THEN** the collector SHALL write NULL for `worker_rss_bytes` and continue collecting other metrics
### Requirement: Worker memory trend chart in Vue SPA
The admin performance Vue SPA SHALL display a "Worker 記憶體趨勢" TrendChart showing RSS memory over time in megabytes.
#### Scenario: Memory trend displayed
- **WHEN** historical snapshots contain `worker_rss_bytes` data with more than 1 data point
- **THEN** the dashboard SHALL display a TrendChart with RSS values converted to MB
#### Scenario: No memory data
- **WHEN** historical snapshots do not contain `worker_rss_bytes` data (all NULL)
- **THEN** the trend chart SHALL show "趨勢資料不足" message