feat(admin-performance): Vue 3 SPA dashboard with metrics history trending

Rebuild /admin/performance from Jinja2 to Vue 3 SPA with ECharts, adding
cache telemetry infrastructure, connection pool monitoring, and SQLite-backed
historical metrics collection with trend chart visualization.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
egg
2026-02-23 09:18:10 +08:00
parent 1c46f5eb69
commit 5d570ca7a2
32 changed files with 2903 additions and 261 deletions

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-02-22

View File

@@ -0,0 +1,91 @@
## Context
現有 `/admin/performance` 是 Jinja2 server-rendered 頁面vanilla JS + Chart.js是唯一未遷移至 Vue 3 SPA 的前端頁面。後端已具備豐富的監控數據(連線池 `get_pool_status()`、Redis client、LayeredCache `.telemetry()`),但前端僅展示 4 張 status cards + query performance + worker control + logs缺少 Redis 詳情、ProcessLevelCache 統計、連線池飽和度等關鍵面板。
## Goals / Non-Goals
**Goals:**
- 將 admin/performance 頁面從 Jinja2 切換為 Vue 3 SPA與所有報表頁面架構一致
- 新增完整的系統監控面板Redis 快取詳情、ProcessLevelCache 統計、連線池飽和度、直連 Oracle 追蹤
- 提供可複用的 gauge/stat card 組件,便於未來擴展監控項目
- 保留所有既有功能status cards、query performance、worker control、system logs
**Non-Goals:**
- 不新增告警/通知機制(未來可擴展)
- 不引入 WebSocket 即時推送(維持 30 秒輪詢)
- 不修改既有 API response format`system-status``metrics``logs` 保持不變)
- 不新增使用者權限控制(沿用既有 admin 認證)
## Decisions
### 1. Vue 3 SPA + ECharts 取代 Jinja2 + Chart.js
**選擇**: 全面重建為 Vue 3 SPA使用 ECharts 繪製圖表
**理由**: 所有報表頁面已完成 Vue SPA 遷移admin/performance 是最後一個 Jinja2 頁面。統一架構可複用 `apiGet``useAutoRefresh` 等共用基礎設施減少維護成本。ECharts 已是專案標準圖表庫query-tool、reject-history 等均使用)。
**替代方案**: 保留 Jinja2 僅加 API — 但會持續累積技術債,且無法複用 Vue 生態。
### 2. 單一 performance-detail API 聚合所有新增監控數據
**選擇**: 新增 `GET /admin/api/performance-detail` 一個 endpoint回傳 `redis``process_caches``route_cache``db_pool``direct_connections` 五個 section。
**理由**: 減少前端並發請求數(已有 5 個 API加 1 個共 6 個),後端可在同一 request 中順序收集各子系統狀態,避免多次 round-trip。
**替代方案**: 每個監控維度獨立 endpoint — 更 RESTful 但增加前端複雜度和網路開銷。
### 3. ProcessLevelCache 全域 registry 模式
**選擇**: 在 `core/cache.py` 新增 `_PROCESS_CACHE_REGISTRY` dict + `register_process_cache()` 函式,各服務在模組載入時自行註冊。
**理由**: 避免 admin_routes 硬編碼各快取實例的 import 路徑,新增快取時只需在該服務中加一行 `register_process_cache()` 即可自動出現在監控面板。
**替代方案**: admin_routes 直接 import 各快取實例 — 耦合度高,新增快取需改兩處。
### 4. Redis namespace 監控使用 SCAN 而非 KEYS
**選擇**: 使用 `SCAN` 搭配 `MATCH` pattern 掃描各 namespace 的 key 數量。
**理由**: `KEYS *` 在生產環境會阻塞 Redis`SCAN` 為非阻塞迭代器,安全性更高。
### 5. 直連 Oracle 使用 thread-safe atomic counter
**選擇**: 在 `database.py` 使用 `threading.Lock` 保護的全域計數器,在 `get_db_connection()``read_sql_df_slow()` 建立連線後 increment。
**理由**: 追蹤連線池外的直接連線使用量,幫助判斷是否需要調整池大小。計數器為 monotonic只增不減記錄的是自 worker 啟動以來的總數。
### 6. 前端組件複用 GaugeBar / StatCard / StatusDot
**選擇**: 新增 3 個小型可複用組件放在 `admin-performance/components/` 下。
**理由**: Redis 記憶體、連線池飽和度、ProcessLevelCache 使用率等多處需要 gauge 視覺化status cards 跨面板重複。組件化可統一視覺風格並減少重複 template。
### 7. SQLite 持久化 metrics history store
**選擇**: 新增 `core/metrics_history.py`,使用 SQLite 儲存 metrics snapshots仿 `core/log_store.py``LogStore` 模式),搭配 daemon thread 每 30 秒採集一次。
**理由**: in-memory deque 在 worker 重啟或 gunicorn prefork 下無法跨 worker 共享且不保留歷史。SQLite 提供跨 worker 讀取、重啟持久化、可配置保留天數(預設 3 天 / 50000 rows且不需額外 infra。
**替代方案**:
- in-memory deque — 簡單但 worker 獨立、重啟即失
- Redis TSDB — 需額外模組且增加 Redis 負擔
- PostgreSQL — 太重,且此數據不需 ACID
**Schema**: `metrics_snapshots` table 含 timestamp、worker PID、pool/redis/route_cache/latency 各欄位,`idx_metrics_ts` 索引加速時間查詢。
**背景採集**: `MetricsHistoryCollector` daemon thread間隔可透過 `METRICS_HISTORY_INTERVAL` 環境變數配置。在 `app.py` lifecycle 中 start/stop。
## Risks / Trade-offs
- **Redis SCAN 效能**: 大量 key 時 SCAN 可能較慢 → 設定 `COUNT 100` 限制每次迭代量,且 30 秒才掃一次,可接受
- **ProcessLevelCache registry 依賴模組載入順序**: 服務未 import 時不會註冊 → 在 app factory 或 gunicorn post_fork 確保所有服務模組已載入
- **直連計數器跨 worker 不共享**: gunicorn prefork 模式下每個 worker 有獨立計數 → API 回傳當前 worker PID 供辨識,可透過 `/admin/api/system-status` 的 worker info 交叉比對
- **舊 Jinja2 模板保留但不維護**: 切換後舊模板不再更新 → 透過 `routeContracts.js``rollbackStrategy: 'fallback_to_legacy_route'` 保留回退能力
## Migration Plan
1. 後端先行:加 `stats()`、registry、直連計數器、新 API不影響既有功能
2. 前端建構:新建 `admin-performance/` Vue SPAVite 註冊 entry
3. 路由切換:`admin_routes.py` 改為 `send_from_directory``routeContracts.js``renderMode: 'native'`
4. 驗證後部署:確認所有面板正確顯示後上線
5. 回退方案:`routeContracts.js` 改回 `renderMode: 'external'``admin_routes.py` 改回 `render_template`

View File

@@ -0,0 +1,31 @@
## Why
現有 `/admin/performance` 是唯一仍使用 Jinja2 + vanilla JS + Chart.js 的頁面,與所有已遷移至 Vue 3 SPA 的報表頁面架構不一致。同時隨著報表系統功能擴充L1/L2 快取層、連線池、直連 Oracle 等),後端已具備豐富的遙測數據,但管理後台的監控面板覆蓋不足——缺少 Redis 詳情、ProcessLevelCache 統計、連線池飽和度、直連 Oracle 追蹤等關鍵資訊。
## What Changes
-`/admin/performance` 從 Jinja2 server-rendered 頁面重建為 Vue 3 SPAECharts 取代 Chart.js
- 新增 `GET /admin/api/performance-detail` API整合 Redis INFO/SCAN、ProcessLevelCache registry、連線池狀態、直連計數等完整監控數據
- 後端 `ProcessLevelCache` 加入 `stats()` 方法與全域 registry支援動態收集所有快取實例狀態
- 後端 `database.py` 加入直連 Oracle 計數器,追蹤非連線池的直接連線使用量
- 前端新增 GaugeBar / StatCard / StatusDot 可複用組件,提供 gauge 飽和度視覺化
- portal-shell 路由從 `renderMode: 'external'` 切換為 `'native'`
- Vite 構建新增 `admin-performance` entry point
## Capabilities
### New Capabilities
- `admin-performance-spa`: Vue 3 SPA 重建管理效能儀表板,包含 status cards、query performance、Redis 快取、記憶體快取、連線池、worker 控制、系統日誌等完整面板
- `cache-telemetry-api`: ProcessLevelCache stats() + 全域 registry + performance-detail API提供所有記憶體快取、Redis 快取、route cache 的遙測數據
- `connection-pool-monitoring`: 連線池飽和度追蹤 + 直連 Oracle 計數器,完整呈現資料庫連線使用狀況
- `metrics-history-trending`: SQLite 持久化背景採集 + 時間序列趨勢圖可回溯連線池飽和度、查詢延遲、Redis 記憶體、快取命中率等歷史數據
### Modified Capabilities
<!-- No existing spec-level requirements are changing -->
## Impact
- **Backend** (7 files): `core/cache.py``core/database.py``core/metrics_history.py`(NEW)、`routes/admin_routes.py``services/resource_cache.py``services/realtime_equipment_cache.py``services/reject_dataset_cache.py``app.py`
- **Frontend** (8 new + 3 modified): 新建 `admin-performance/` 目錄index.html、main.js、App.vue、style.css、4 個組件含 TrendChart修改 `vite.config.js``package.json``routeContracts.js`
- **API**: 新增 2 個 endpoint (`/admin/api/performance-detail``/admin/api/performance-history`),既有 5 個 endpoint 不變
- **Rollback**: 舊 Jinja2 模板保留,可透過 `routeContracts.js` 切回 `renderMode: 'external'`

View File

@@ -0,0 +1,100 @@
## ADDED Requirements
### Requirement: Vue 3 SPA page replaces Jinja2 template
The `/admin/performance` route SHALL serve a Vue 3 SPA page built by Vite, replacing the existing Jinja2 server-rendered template. The SPA SHALL be registered as a Vite entry point and integrated into the portal-shell navigation as a `renderMode: 'native'` route.
#### Scenario: Page loads as Vue SPA
- **WHEN** user navigates to `/admin/performance`
- **THEN** the server SHALL return the Vite-built `admin-performance.html` static file (not a Jinja2 rendered template)
#### Scenario: Portal-shell integration
- **WHEN** the portal-shell renders `/admin/performance`
- **THEN** it SHALL load the page as a native Vue SPA (not an external iframe)
### Requirement: Status cards display system health
The dashboard SHALL display 4 status cards in a horizontal grid: Database, Redis, Circuit Breaker, and Worker PID. Each card SHALL show a StatusDot indicator (healthy/degraded/error/disabled) with the current status value.
#### Scenario: All systems healthy
- **WHEN** all backend systems report healthy status via `/admin/api/system-status`
- **THEN** all 4 status cards SHALL display green StatusDot indicators with their respective values
#### Scenario: Redis disabled
- **WHEN** Redis is disabled (`REDIS_ENABLED=false`)
- **THEN** the Redis status card SHALL display a disabled StatusDot indicator and the Redis cache panel SHALL show a graceful degradation message
### Requirement: Query performance panel with ECharts
The dashboard SHALL display query performance metrics (P50, P95, P99 latencies, total queries, slow queries) and an ECharts latency distribution chart, replacing the existing Chart.js implementation.
#### Scenario: Metrics loaded successfully
- **WHEN** `/admin/api/metrics` returns valid performance data
- **THEN** the panel SHALL display P50/P95/P99 latency values and render an ECharts bar chart showing latency distribution
#### Scenario: No metrics data
- **WHEN** `/admin/api/metrics` returns empty or null metrics
- **THEN** the panel SHALL display placeholder text indicating no data available
### Requirement: Redis cache detail panel
The dashboard SHALL display a Redis cache detail panel showing memory usage (as a GaugeBar), connected clients, hit rate percentage, peak memory, and a namespace key distribution table.
#### Scenario: Redis active with data
- **WHEN** `/admin/api/performance-detail` returns Redis data with namespace key counts
- **THEN** the panel SHALL display a memory GaugeBar, hit rate, client count, and a table listing each namespace with its key count
#### Scenario: Redis disabled
- **WHEN** Redis is disabled
- **THEN** the Redis detail panel SHALL display a disabled state message without errors
### Requirement: Memory cache panel
The dashboard SHALL display ProcessLevelCache statistics as grid cards (showing entries/max_size as a mini gauge and TTL) plus Route Cache telemetry (L1 hit rate, L2 hit rate, miss rate, total reads).
#### Scenario: Multiple caches registered
- **WHEN** `/admin/api/performance-detail` returns process_caches with multiple entries
- **THEN** the panel SHALL render one card per cache instance showing entries, max_size, TTL, and description
#### Scenario: Route cache telemetry
- **WHEN** `/admin/api/performance-detail` returns route_cache data
- **THEN** the panel SHALL display L1 hit rate, L2 hit rate, miss rate, and total reads
### Requirement: Connection pool panel
The dashboard SHALL display connection pool saturation as a GaugeBar and stat cards showing checked_out, checked_in, overflow, max_capacity, pool_size, pool_recycle, pool_timeout, and direct connection count.
#### Scenario: Pool under normal load
- **WHEN** pool saturation is below 80%
- **THEN** the GaugeBar SHALL display in a normal color (green/blue)
#### Scenario: Pool near saturation
- **WHEN** pool saturation exceeds 80%
- **THEN** the GaugeBar SHALL display in a warning color (yellow/orange/red)
### Requirement: Worker control panel
The dashboard SHALL display worker PID, uptime, cooldown status, and provide a restart button with a confirmation modal.
#### Scenario: Restart worker
- **WHEN** user clicks the restart button and confirms in the modal
- **THEN** the system SHALL POST to `/admin/api/worker/restart` and display the result
#### Scenario: Restart during cooldown
- **WHEN** worker is in cooldown period
- **THEN** the restart button SHALL be disabled with a cooldown indicator
### Requirement: System logs panel with filtering and pagination
The dashboard SHALL display system logs with level filtering, text search, and pagination controls.
#### Scenario: Filter by log level
- **WHEN** user selects a specific log level filter
- **THEN** only logs matching that level SHALL be displayed
#### Scenario: Paginate logs
- **WHEN** logs exceed the page size
- **THEN** pagination controls SHALL allow navigating between pages
### Requirement: Auto-refresh with toggle
The dashboard SHALL auto-refresh all panels every 30 seconds using `useAutoRefresh`. The user SHALL be able to toggle auto-refresh on/off and manually trigger a refresh.
#### Scenario: Auto-refresh enabled
- **WHEN** auto-refresh is enabled (default)
- **THEN** all panels SHALL refresh their data every 30 seconds via `Promise.all` parallel fetch
#### Scenario: Manual refresh
- **WHEN** user clicks the manual refresh button
- **THEN** all panels SHALL immediately refresh their data

View File

@@ -0,0 +1,56 @@
## ADDED Requirements
### Requirement: ProcessLevelCache stats method
Every `ProcessLevelCache` instance SHALL expose a `stats()` method that returns a dict containing `entries` (live entries count), `max_size`, and `ttl_seconds`.
#### Scenario: Stats on active cache
- **WHEN** `stats()` is called on a ProcessLevelCache with 5 live entries (max_size=32, ttl=30s)
- **THEN** it SHALL return `{"entries": 5, "max_size": 32, "ttl_seconds": 30}`
#### Scenario: Stats with expired entries
- **WHEN** `stats()` is called and some entries have exceeded TTL
- **THEN** `entries` SHALL only count entries where `now - timestamp <= ttl`
#### Scenario: Thread safety
- **WHEN** `stats()` is called concurrently with cache writes
- **THEN** it SHALL acquire the cache lock and return consistent data without races
### Requirement: ProcessLevelCache global registry
The system SHALL maintain a module-level registry in `core/cache.py` that maps cache names to `(description, instance)` tuples. Services SHALL register their cache instances at module load time via `register_process_cache(name, instance, description)`.
#### Scenario: Register and retrieve all caches
- **WHEN** multiple services register their caches and `get_all_process_cache_stats()` is called
- **THEN** it SHALL return a dict of `{name: {entries, max_size, ttl_seconds, description}}` for all registered caches
#### Scenario: Cache not registered
- **WHEN** a service's ProcessLevelCache is not registered
- **THEN** it SHALL NOT appear in `get_all_process_cache_stats()` output
### Requirement: Performance detail API endpoint
The system SHALL expose `GET /admin/api/performance-detail` that returns a JSON object with sections: `redis`, `process_caches`, `route_cache`, `db_pool`, and `direct_connections`.
#### Scenario: All systems available
- **WHEN** the API is called and all subsystems are healthy
- **THEN** it SHALL return all 5 sections with current telemetry data
#### Scenario: Redis disabled
- **WHEN** Redis is disabled (`REDIS_ENABLED=false`)
- **THEN** the `redis` section SHALL be `null` or contain `{"enabled": false}`, and other sections SHALL still return normally
### Requirement: Redis namespace key distribution
The performance-detail API SHALL scan Redis keys by namespace prefix and return key counts per namespace. Namespaces SHALL include: `data`, `route_cache`, `equipment_status`, `reject_dataset`, `meta`, `lock`, `scrap_exclusion`.
#### Scenario: Keys exist across namespaces
- **WHEN** Redis contains keys across multiple namespaces
- **THEN** the `redis.namespaces` array SHALL list each namespace with its `name` and `key_count`
#### Scenario: SCAN safety
- **WHEN** scanning Redis keys
- **THEN** the system SHALL use `SCAN` (not `KEYS`) to avoid blocking Redis
### Requirement: Route cache telemetry in performance detail
The performance-detail API SHALL include route cache telemetry from `get_route_cache_status()`, providing `mode`, `l1_size`, `l1_hit_rate`, `l2_hit_rate`, `miss_rate`, and `reads_total`.
#### Scenario: LayeredCache active
- **WHEN** route cache is in layered mode
- **THEN** the `route_cache` section SHALL include L1 and L2 hit rates from telemetry

View File

@@ -0,0 +1,27 @@
## ADDED Requirements
### Requirement: Connection pool status in performance detail
The performance-detail API SHALL include `db_pool` section with `status` (checked_out, checked_in, overflow, max_capacity, saturation) from `get_pool_status()` and `config` (pool_size, max_overflow, pool_timeout, pool_recycle) from `get_pool_runtime_config()`.
#### Scenario: Pool status retrieved
- **WHEN** the API is called
- **THEN** `db_pool.status` SHALL contain current pool utilization metrics and `db_pool.config` SHALL contain the pool configuration values
#### Scenario: Saturation calculation
- **WHEN** the pool has 8 checked_out connections and max_capacity is 30
- **THEN** saturation SHALL be reported as approximately 26.7%
### Requirement: Direct Oracle connection counter
The system SHALL maintain a thread-safe monotonic counter in `database.py` that increments each time `get_db_connection()` or `read_sql_df_slow()` successfully creates a direct (non-pooled) Oracle connection.
#### Scenario: Counter increments on direct connection
- **WHEN** `get_db_connection()` successfully creates a connection
- **THEN** the direct connection counter SHALL increment by 1
#### Scenario: Counter in performance detail
- **WHEN** the performance-detail API is called
- **THEN** `direct_connections` SHALL contain `total_since_start` (counter value) and `worker_pid` (current process PID)
#### Scenario: Counter is per-worker
- **WHEN** multiple gunicorn workers are running
- **THEN** each worker SHALL maintain its own independent counter, and the API SHALL return the counter for the responding worker

View File

@@ -0,0 +1,65 @@
## ADDED Requirements
### Requirement: SQLite metrics history store
The system SHALL provide a `MetricsHistoryStore` class in `core/metrics_history.py` that persists metrics snapshots to a SQLite database (`logs/metrics_history.sqlite` by default). The store SHALL use thread-local connections and a write lock, following the `LogStore` pattern in `core/log_store.py`.
#### Scenario: Write and query snapshots
- **WHEN** `write_snapshot(data)` is called with pool/redis/route_cache/latency metrics
- **THEN** a row SHALL be inserted into `metrics_snapshots` with the current ISO 8601 timestamp and worker PID
#### Scenario: Query by time range
- **WHEN** `query_snapshots(minutes=30)` is called
- **THEN** it SHALL return all rows from the last 30 minutes, ordered by timestamp ascending
#### Scenario: Retention cleanup
- **WHEN** `cleanup()` is called
- **THEN** rows older than `METRICS_HISTORY_RETENTION_DAYS` (default 3) SHALL be deleted, and total rows SHALL be capped at `METRICS_HISTORY_MAX_ROWS` (default 50000)
#### Scenario: Thread safety
- **WHEN** multiple threads write snapshots concurrently
- **THEN** the write lock SHALL serialize writes and prevent database corruption
### Requirement: Background metrics collector
The system SHALL provide a `MetricsHistoryCollector` class that runs a daemon thread collecting metrics snapshots at a configurable interval (default 30 seconds, via `METRICS_HISTORY_INTERVAL` env var).
#### Scenario: Automatic collection
- **WHEN** the collector is started via `start_metrics_history(app)`
- **THEN** it SHALL collect pool status, Redis info, route cache status, and query latency metrics every interval and write them to the store
#### Scenario: Graceful shutdown
- **WHEN** `stop_metrics_history()` is called
- **THEN** the collector thread SHALL stop within one interval period
#### Scenario: Subsystem unavailability
- **WHEN** a subsystem (e.g., Redis) is unavailable during collection
- **THEN** the collector SHALL write null/0 for those fields and continue collecting other metrics
### Requirement: Performance history API endpoint
The system SHALL expose `GET /admin/api/performance-history` that returns historical metrics snapshots.
#### Scenario: Query with time range
- **WHEN** the API is called with `?minutes=30`
- **THEN** it SHALL return `{"success": true, "data": {"snapshots": [...], "count": N}}`
#### Scenario: Time range bounds
- **WHEN** `minutes` is less than 1 or greater than 180
- **THEN** it SHALL be clamped to the range [1, 180]
#### Scenario: Admin authentication
- **WHEN** the API is called without admin authentication
- **THEN** it SHALL be rejected by the `@admin_required` decorator
### Requirement: Frontend trend charts
The system SHALL display 4 trend chart panels in the admin performance dashboard using vue-echarts VChart line/area charts.
#### Scenario: Trend charts with data
- **WHEN** historical snapshots contain more than 1 data point
- **THEN** the dashboard SHALL display trend charts for: connection pool saturation, query latency (P50/P95/P99), Redis memory, and cache hit rates
#### Scenario: Trend charts without data
- **WHEN** historical snapshots are empty or contain only 1 data point
- **THEN** the trend charts SHALL NOT be displayed (hidden via `v-if`)
#### Scenario: Auto-refresh
- **WHEN** the dashboard auto-refreshes
- **THEN** historical data SHALL also be refreshed alongside real-time metrics

View File

@@ -0,0 +1,80 @@
## 1. Backend — Cache Telemetry Infrastructure
- [x] 1.1 Add `stats()` method to `ProcessLevelCache` in `core/cache.py` (returns entries/max_size/ttl_seconds with lock)
- [x] 1.2 Add `_PROCESS_CACHE_REGISTRY`, `register_process_cache()`, and `get_all_process_cache_stats()` to `core/cache.py`
- [x] 1.3 Register `_wip_df_cache` in `core/cache.py`
- [x] 1.4 Add `stats()` + `register_process_cache()` to `services/resource_cache.py`
- [x] 1.5 Add `stats()` + `register_process_cache()` to `services/realtime_equipment_cache.py`
- [x] 1.6 Add `register_process_cache()` to `services/reject_dataset_cache.py`
## 2. Backend — Direct Connection Counter
- [x] 2.1 Add `_DIRECT_CONN_COUNTER`, `_DIRECT_CONN_LOCK`, and `get_direct_connection_count()` to `core/database.py`
- [x] 2.2 Increment counter in `get_db_connection()` and `read_sql_df_slow()` after successful connection creation
## 3. Backend — Performance Detail API
- [x] 3.1 Add `GET /admin/api/performance-detail` endpoint in `routes/admin_routes.py` returning redis, process_caches, route_cache, db_pool, and direct_connections sections
- [x] 3.2 Implement Redis INFO + SCAN namespace key distribution (data, route_cache, equipment_status, reject_dataset, meta, lock, scrap_exclusion) with graceful degradation when Redis is disabled
## 4. Frontend — Page Scaffolding
- [x] 4.1 Create `frontend/src/admin-performance/index.html` and `main.js` (standard Vue SPA entry)
- [x] 4.2 Register `admin-performance` entry in `vite.config.js`
- [x] 4.3 Add `cp` command for `admin-performance.html` in `package.json` build script
## 5. Frontend — Reusable Components
- [x] 5.1 Create `GaugeBar.vue` — horizontal gauge bar with label, value, max, and color threshold props
- [x] 5.2 Create `StatCard.vue` — mini card with numeric value, label, and optional unit/icon
- [x] 5.3 Create `StatusDot.vue` — colored dot indicator (healthy/degraded/error/disabled) with label
## 6. Frontend — App.vue Main Dashboard
- [x] 6.1 Implement data fetching layer: `loadSystemStatus()`, `loadMetrics()`, `loadPerformanceDetail()`, `loadLogs()`, `loadWorkerStatus()` with `Promise.all` parallel fetch and `useAutoRefresh` (30s)
- [x] 6.2 Build header section with gradient background, title, auto-refresh toggle, and manual refresh button
- [x] 6.3 Build status cards section (Database / Redis / Circuit Breaker / Worker PID) using StatusDot
- [x] 6.4 Build query performance panel with P50/P95/P99 stat cards and ECharts latency distribution chart
- [x] 6.5 Build Redis cache detail panel with memory GaugeBar, hit rate, client count, peak memory, and namespace key distribution table
- [x] 6.6 Build memory cache panel with ProcessLevelCache grid cards (entries/max gauge + TTL) and route cache telemetry (L1/L2 hit rate, miss rate, total reads)
- [x] 6.7 Build connection pool panel with saturation GaugeBar and stat card grid (checked_out, checked_in, overflow, max_capacity, pool_size, pool_recycle, pool_timeout, direct connections)
- [x] 6.8 Build worker control panel with PID/uptime/cooldown display, restart button, and confirmation modal
- [x] 6.9 Build system logs panel with level filter, text search, pagination, and log clearing
- [x] 6.10 Create `style.css` with all panel, grid, gauge, card, and responsive layout styles
## 7. Route Integration
- [x] 7.1 Change `/admin/performance` route handler in `admin_routes.py` from `render_template` to `send_from_directory` serving the Vue SPA
- [x] 7.2 Update `routeContracts.js`: change renderMode to `'native'`, rollbackStrategy to `'fallback_to_legacy_route'`, compatibilityPolicy to `'redirect_to_shell_when_spa_enabled'`
## 8. Verification (Phase 1)
- [x] 8.1 Run `cd frontend && npx vite build` — confirm no compilation errors and `admin-performance.html` is produced
- [x] 8.2 Verify all dashboard panels render correctly with live data after service restart
## 9. Backend — Metrics History Store
- [x] 9.1 Create `core/metrics_history.py` with `MetricsHistoryStore` class (SQLite schema, thread-local connections, write_lock, write_snapshot, query_snapshots, cleanup)
- [x] 9.2 Add `MetricsHistoryCollector` class (daemon thread, configurable interval, collect pool/redis/route_cache/latency)
- [x] 9.3 Add module-level `get_metrics_history_store()`, `start_metrics_history(app)`, `stop_metrics_history()` functions
## 10. Backend — Lifecycle Integration
- [x] 10.1 Call `start_metrics_history(app)` in `app.py` after other background services
- [x] 10.2 Call `stop_metrics_history()` in `_shutdown_runtime_resources()` in `app.py`
## 11. Backend — Performance History API
- [x] 11.1 Add `GET /admin/api/performance-history` endpoint in `admin_routes.py` (minutes param, clamped 1-180, returns snapshots array)
## 12. Frontend — Trend Charts
- [x] 12.1 Create `TrendChart.vue` component using vue-echarts VChart (line/area chart, dual yAxis support, time labels, autoresize)
- [x] 12.2 Add `loadPerformanceHistory()` fetch to `App.vue` and integrate into `refreshAll()`
- [x] 12.3 Add 4 TrendChart panels to `App.vue` template (pool saturation, query latency, Redis memory, cache hit rates)
- [x] 12.4 Add trend chart styles to `style.css`
## 13. Verification (Phase 2)
- [x] 13.1 Run `cd frontend && npm run build` — confirm no compilation errors
- [x] 13.2 Verify trend charts render with historical data after service restart + 60s collection

View File

@@ -0,0 +1,100 @@
## ADDED Requirements
### Requirement: Vue 3 SPA page replaces Jinja2 template
The `/admin/performance` route SHALL serve a Vue 3 SPA page built by Vite, replacing the existing Jinja2 server-rendered template. The SPA SHALL be registered as a Vite entry point and integrated into the portal-shell navigation as a `renderMode: 'native'` route.
#### Scenario: Page loads as Vue SPA
- **WHEN** user navigates to `/admin/performance`
- **THEN** the server SHALL return the Vite-built `admin-performance.html` static file (not a Jinja2 rendered template)
#### Scenario: Portal-shell integration
- **WHEN** the portal-shell renders `/admin/performance`
- **THEN** it SHALL load the page as a native Vue SPA (not an external iframe)
### Requirement: Status cards display system health
The dashboard SHALL display 4 status cards in a horizontal grid: Database, Redis, Circuit Breaker, and Worker PID. Each card SHALL show a StatusDot indicator (healthy/degraded/error/disabled) with the current status value.
#### Scenario: All systems healthy
- **WHEN** all backend systems report healthy status via `/admin/api/system-status`
- **THEN** all 4 status cards SHALL display green StatusDot indicators with their respective values
#### Scenario: Redis disabled
- **WHEN** Redis is disabled (`REDIS_ENABLED=false`)
- **THEN** the Redis status card SHALL display a disabled StatusDot indicator and the Redis cache panel SHALL show a graceful degradation message
### Requirement: Query performance panel with ECharts
The dashboard SHALL display query performance metrics (P50, P95, P99 latencies, total queries, slow queries) and an ECharts latency distribution chart, replacing the existing Chart.js implementation.
#### Scenario: Metrics loaded successfully
- **WHEN** `/admin/api/metrics` returns valid performance data
- **THEN** the panel SHALL display P50/P95/P99 latency values and render an ECharts bar chart showing latency distribution
#### Scenario: No metrics data
- **WHEN** `/admin/api/metrics` returns empty or null metrics
- **THEN** the panel SHALL display placeholder text indicating no data available
### Requirement: Redis cache detail panel
The dashboard SHALL display a Redis cache detail panel showing memory usage (as a GaugeBar), connected clients, hit rate percentage, peak memory, and a namespace key distribution table.
#### Scenario: Redis active with data
- **WHEN** `/admin/api/performance-detail` returns Redis data with namespace key counts
- **THEN** the panel SHALL display a memory GaugeBar, hit rate, client count, and a table listing each namespace with its key count
#### Scenario: Redis disabled
- **WHEN** Redis is disabled
- **THEN** the Redis detail panel SHALL display a disabled state message without errors
### Requirement: Memory cache panel
The dashboard SHALL display ProcessLevelCache statistics as grid cards (showing entries/max_size as a mini gauge and TTL) plus Route Cache telemetry (L1 hit rate, L2 hit rate, miss rate, total reads).
#### Scenario: Multiple caches registered
- **WHEN** `/admin/api/performance-detail` returns process_caches with multiple entries
- **THEN** the panel SHALL render one card per cache instance showing entries, max_size, TTL, and description
#### Scenario: Route cache telemetry
- **WHEN** `/admin/api/performance-detail` returns route_cache data
- **THEN** the panel SHALL display L1 hit rate, L2 hit rate, miss rate, and total reads
### Requirement: Connection pool panel
The dashboard SHALL display connection pool saturation as a GaugeBar and stat cards showing checked_out, checked_in, overflow, max_capacity, pool_size, pool_recycle, pool_timeout, and direct connection count.
#### Scenario: Pool under normal load
- **WHEN** pool saturation is below 80%
- **THEN** the GaugeBar SHALL display in a normal color (green/blue)
#### Scenario: Pool near saturation
- **WHEN** pool saturation exceeds 80%
- **THEN** the GaugeBar SHALL display in a warning color (yellow/orange/red)
### Requirement: Worker control panel
The dashboard SHALL display worker PID, uptime, cooldown status, and provide a restart button with a confirmation modal.
#### Scenario: Restart worker
- **WHEN** user clicks the restart button and confirms in the modal
- **THEN** the system SHALL POST to `/admin/api/worker/restart` and display the result
#### Scenario: Restart during cooldown
- **WHEN** worker is in cooldown period
- **THEN** the restart button SHALL be disabled with a cooldown indicator
### Requirement: System logs panel with filtering and pagination
The dashboard SHALL display system logs with level filtering, text search, and pagination controls.
#### Scenario: Filter by log level
- **WHEN** user selects a specific log level filter
- **THEN** only logs matching that level SHALL be displayed
#### Scenario: Paginate logs
- **WHEN** logs exceed the page size
- **THEN** pagination controls SHALL allow navigating between pages
### Requirement: Auto-refresh with toggle
The dashboard SHALL auto-refresh all panels every 30 seconds using `useAutoRefresh`. The user SHALL be able to toggle auto-refresh on/off and manually trigger a refresh.
#### Scenario: Auto-refresh enabled
- **WHEN** auto-refresh is enabled (default)
- **THEN** all panels SHALL refresh their data every 30 seconds via `Promise.all` parallel fetch
#### Scenario: Manual refresh
- **WHEN** user clicks the manual refresh button
- **THEN** all panels SHALL immediately refresh their data

View File

@@ -0,0 +1,56 @@
## ADDED Requirements
### Requirement: ProcessLevelCache stats method
Every `ProcessLevelCache` instance SHALL expose a `stats()` method that returns a dict containing `entries` (live entries count), `max_size`, and `ttl_seconds`.
#### Scenario: Stats on active cache
- **WHEN** `stats()` is called on a ProcessLevelCache with 5 live entries (max_size=32, ttl=30s)
- **THEN** it SHALL return `{"entries": 5, "max_size": 32, "ttl_seconds": 30}`
#### Scenario: Stats with expired entries
- **WHEN** `stats()` is called and some entries have exceeded TTL
- **THEN** `entries` SHALL only count entries where `now - timestamp <= ttl`
#### Scenario: Thread safety
- **WHEN** `stats()` is called concurrently with cache writes
- **THEN** it SHALL acquire the cache lock and return consistent data without races
### Requirement: ProcessLevelCache global registry
The system SHALL maintain a module-level registry in `core/cache.py` that maps cache names to `(description, instance)` tuples. Services SHALL register their cache instances at module load time via `register_process_cache(name, instance, description)`.
#### Scenario: Register and retrieve all caches
- **WHEN** multiple services register their caches and `get_all_process_cache_stats()` is called
- **THEN** it SHALL return a dict of `{name: {entries, max_size, ttl_seconds, description}}` for all registered caches
#### Scenario: Cache not registered
- **WHEN** a service's ProcessLevelCache is not registered
- **THEN** it SHALL NOT appear in `get_all_process_cache_stats()` output
### Requirement: Performance detail API endpoint
The system SHALL expose `GET /admin/api/performance-detail` that returns a JSON object with sections: `redis`, `process_caches`, `route_cache`, `db_pool`, and `direct_connections`.
#### Scenario: All systems available
- **WHEN** the API is called and all subsystems are healthy
- **THEN** it SHALL return all 5 sections with current telemetry data
#### Scenario: Redis disabled
- **WHEN** Redis is disabled (`REDIS_ENABLED=false`)
- **THEN** the `redis` section SHALL be `null` or contain `{"enabled": false}`, and other sections SHALL still return normally
### Requirement: Redis namespace key distribution
The performance-detail API SHALL scan Redis keys by namespace prefix and return key counts per namespace. Namespaces SHALL include: `data`, `route_cache`, `equipment_status`, `reject_dataset`, `meta`, `lock`, `scrap_exclusion`.
#### Scenario: Keys exist across namespaces
- **WHEN** Redis contains keys across multiple namespaces
- **THEN** the `redis.namespaces` array SHALL list each namespace with its `name` and `key_count`
#### Scenario: SCAN safety
- **WHEN** scanning Redis keys
- **THEN** the system SHALL use `SCAN` (not `KEYS`) to avoid blocking Redis
### Requirement: Route cache telemetry in performance detail
The performance-detail API SHALL include route cache telemetry from `get_route_cache_status()`, providing `mode`, `l1_size`, `l1_hit_rate`, `l2_hit_rate`, `miss_rate`, and `reads_total`.
#### Scenario: LayeredCache active
- **WHEN** route cache is in layered mode
- **THEN** the `route_cache` section SHALL include L1 and L2 hit rates from telemetry

View File

@@ -0,0 +1,27 @@
## ADDED Requirements
### Requirement: Connection pool status in performance detail
The performance-detail API SHALL include `db_pool` section with `status` (checked_out, checked_in, overflow, max_capacity, saturation) from `get_pool_status()` and `config` (pool_size, max_overflow, pool_timeout, pool_recycle) from `get_pool_runtime_config()`.
#### Scenario: Pool status retrieved
- **WHEN** the API is called
- **THEN** `db_pool.status` SHALL contain current pool utilization metrics and `db_pool.config` SHALL contain the pool configuration values
#### Scenario: Saturation calculation
- **WHEN** the pool has 8 checked_out connections and max_capacity is 30
- **THEN** saturation SHALL be reported as approximately 26.7%
### Requirement: Direct Oracle connection counter
The system SHALL maintain a thread-safe monotonic counter in `database.py` that increments each time `get_db_connection()` or `read_sql_df_slow()` successfully creates a direct (non-pooled) Oracle connection.
#### Scenario: Counter increments on direct connection
- **WHEN** `get_db_connection()` successfully creates a connection
- **THEN** the direct connection counter SHALL increment by 1
#### Scenario: Counter in performance detail
- **WHEN** the performance-detail API is called
- **THEN** `direct_connections` SHALL contain `total_since_start` (counter value) and `worker_pid` (current process PID)
#### Scenario: Counter is per-worker
- **WHEN** multiple gunicorn workers are running
- **THEN** each worker SHALL maintain its own independent counter, and the API SHALL return the counter for the responding worker

View File

@@ -0,0 +1,65 @@
## ADDED Requirements
### Requirement: SQLite metrics history store
The system SHALL provide a `MetricsHistoryStore` class in `core/metrics_history.py` that persists metrics snapshots to a SQLite database (`logs/metrics_history.sqlite` by default). The store SHALL use thread-local connections and a write lock, following the `LogStore` pattern in `core/log_store.py`.
#### Scenario: Write and query snapshots
- **WHEN** `write_snapshot(data)` is called with pool/redis/route_cache/latency metrics
- **THEN** a row SHALL be inserted into `metrics_snapshots` with the current ISO 8601 timestamp and worker PID
#### Scenario: Query by time range
- **WHEN** `query_snapshots(minutes=30)` is called
- **THEN** it SHALL return all rows from the last 30 minutes, ordered by timestamp ascending
#### Scenario: Retention cleanup
- **WHEN** `cleanup()` is called
- **THEN** rows older than `METRICS_HISTORY_RETENTION_DAYS` (default 3) SHALL be deleted, and total rows SHALL be capped at `METRICS_HISTORY_MAX_ROWS` (default 50000)
#### Scenario: Thread safety
- **WHEN** multiple threads write snapshots concurrently
- **THEN** the write lock SHALL serialize writes and prevent database corruption
### Requirement: Background metrics collector
The system SHALL provide a `MetricsHistoryCollector` class that runs a daemon thread collecting metrics snapshots at a configurable interval (default 30 seconds, via `METRICS_HISTORY_INTERVAL` env var).
#### Scenario: Automatic collection
- **WHEN** the collector is started via `start_metrics_history(app)`
- **THEN** it SHALL collect pool status, Redis info, route cache status, and query latency metrics every interval and write them to the store
#### Scenario: Graceful shutdown
- **WHEN** `stop_metrics_history()` is called
- **THEN** the collector thread SHALL stop within one interval period
#### Scenario: Subsystem unavailability
- **WHEN** a subsystem (e.g., Redis) is unavailable during collection
- **THEN** the collector SHALL write null/0 for those fields and continue collecting other metrics
### Requirement: Performance history API endpoint
The system SHALL expose `GET /admin/api/performance-history` that returns historical metrics snapshots.
#### Scenario: Query with time range
- **WHEN** the API is called with `?minutes=30`
- **THEN** it SHALL return `{"success": true, "data": {"snapshots": [...], "count": N}}`
#### Scenario: Time range bounds
- **WHEN** `minutes` is less than 1 or greater than 180
- **THEN** it SHALL be clamped to the range [1, 180]
#### Scenario: Admin authentication
- **WHEN** the API is called without admin authentication
- **THEN** it SHALL be rejected by the `@admin_required` decorator
### Requirement: Frontend trend charts
The system SHALL display 4 trend chart panels in the admin performance dashboard using vue-echarts VChart line/area charts.
#### Scenario: Trend charts with data
- **WHEN** historical snapshots contain more than 1 data point
- **THEN** the dashboard SHALL display trend charts for: connection pool saturation, query latency (P50/P95/P99), Redis memory, and cache hit rates
#### Scenario: Trend charts without data
- **WHEN** historical snapshots are empty or contain only 1 data point
- **THEN** the trend charts SHALL NOT be displayed (hidden via `v-if`)
#### Scenario: Auto-refresh
- **WHEN** the dashboard auto-refreshes
- **THEN** historical data SHALL also be refreshed alongside real-time metrics