feat(admin-performance): Vue 3 SPA dashboard with metrics history trending

Rebuild /admin/performance from Jinja2 to Vue 3 SPA with ECharts, adding
cache telemetry infrastructure, connection pool monitoring, and SQLite-backed
historical metrics collection with trend chart visualization.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
egg
2026-02-23 09:18:10 +08:00
parent 1c46f5eb69
commit 5d570ca7a2
32 changed files with 2903 additions and 261 deletions

View File

@@ -0,0 +1,100 @@
## ADDED Requirements
### Requirement: Vue 3 SPA page replaces Jinja2 template
The `/admin/performance` route SHALL serve a Vue 3 SPA page built by Vite, replacing the existing Jinja2 server-rendered template. The SPA SHALL be registered as a Vite entry point and integrated into the portal-shell navigation as a `renderMode: 'native'` route.
#### Scenario: Page loads as Vue SPA
- **WHEN** user navigates to `/admin/performance`
- **THEN** the server SHALL return the Vite-built `admin-performance.html` static file (not a Jinja2 rendered template)
#### Scenario: Portal-shell integration
- **WHEN** the portal-shell renders `/admin/performance`
- **THEN** it SHALL load the page as a native Vue SPA (not an external iframe)
### Requirement: Status cards display system health
The dashboard SHALL display 4 status cards in a horizontal grid: Database, Redis, Circuit Breaker, and Worker PID. Each card SHALL show a StatusDot indicator (healthy/degraded/error/disabled) with the current status value.
#### Scenario: All systems healthy
- **WHEN** all backend systems report healthy status via `/admin/api/system-status`
- **THEN** all 4 status cards SHALL display green StatusDot indicators with their respective values
#### Scenario: Redis disabled
- **WHEN** Redis is disabled (`REDIS_ENABLED=false`)
- **THEN** the Redis status card SHALL display a disabled StatusDot indicator and the Redis cache panel SHALL show a graceful degradation message
### Requirement: Query performance panel with ECharts
The dashboard SHALL display query performance metrics (P50, P95, P99 latencies, total queries, slow queries) and an ECharts latency distribution chart, replacing the existing Chart.js implementation.
#### Scenario: Metrics loaded successfully
- **WHEN** `/admin/api/metrics` returns valid performance data
- **THEN** the panel SHALL display P50/P95/P99 latency values and render an ECharts bar chart showing latency distribution
#### Scenario: No metrics data
- **WHEN** `/admin/api/metrics` returns empty or null metrics
- **THEN** the panel SHALL display placeholder text indicating no data available
### Requirement: Redis cache detail panel
The dashboard SHALL display a Redis cache detail panel showing memory usage (as a GaugeBar), connected clients, hit rate percentage, peak memory, and a namespace key distribution table.
#### Scenario: Redis active with data
- **WHEN** `/admin/api/performance-detail` returns Redis data with namespace key counts
- **THEN** the panel SHALL display a memory GaugeBar, hit rate, client count, and a table listing each namespace with its key count
#### Scenario: Redis disabled
- **WHEN** Redis is disabled
- **THEN** the Redis detail panel SHALL display a disabled state message without errors
### Requirement: Memory cache panel
The dashboard SHALL display ProcessLevelCache statistics as grid cards (showing entries/max_size as a mini gauge and TTL) plus Route Cache telemetry (L1 hit rate, L2 hit rate, miss rate, total reads).
#### Scenario: Multiple caches registered
- **WHEN** `/admin/api/performance-detail` returns process_caches with multiple entries
- **THEN** the panel SHALL render one card per cache instance showing entries, max_size, TTL, and description
#### Scenario: Route cache telemetry
- **WHEN** `/admin/api/performance-detail` returns route_cache data
- **THEN** the panel SHALL display L1 hit rate, L2 hit rate, miss rate, and total reads
### Requirement: Connection pool panel
The dashboard SHALL display connection pool saturation as a GaugeBar and stat cards showing checked_out, checked_in, overflow, max_capacity, pool_size, pool_recycle, pool_timeout, and direct connection count.
#### Scenario: Pool under normal load
- **WHEN** pool saturation is below 80%
- **THEN** the GaugeBar SHALL display in a normal color (green/blue)
#### Scenario: Pool near saturation
- **WHEN** pool saturation exceeds 80%
- **THEN** the GaugeBar SHALL display in a warning color (yellow/orange/red)
### Requirement: Worker control panel
The dashboard SHALL display worker PID, uptime, cooldown status, and provide a restart button with a confirmation modal.
#### Scenario: Restart worker
- **WHEN** user clicks the restart button and confirms in the modal
- **THEN** the system SHALL POST to `/admin/api/worker/restart` and display the result
#### Scenario: Restart during cooldown
- **WHEN** worker is in cooldown period
- **THEN** the restart button SHALL be disabled with a cooldown indicator
### Requirement: System logs panel with filtering and pagination
The dashboard SHALL display system logs with level filtering, text search, and pagination controls.
#### Scenario: Filter by log level
- **WHEN** user selects a specific log level filter
- **THEN** only logs matching that level SHALL be displayed
#### Scenario: Paginate logs
- **WHEN** logs exceed the page size
- **THEN** pagination controls SHALL allow navigating between pages
### Requirement: Auto-refresh with toggle
The dashboard SHALL auto-refresh all panels every 30 seconds using `useAutoRefresh`. The user SHALL be able to toggle auto-refresh on/off and manually trigger a refresh.
#### Scenario: Auto-refresh enabled
- **WHEN** auto-refresh is enabled (default)
- **THEN** all panels SHALL refresh their data every 30 seconds via `Promise.all` parallel fetch
#### Scenario: Manual refresh
- **WHEN** user clicks the manual refresh button
- **THEN** all panels SHALL immediately refresh their data

View File

@@ -0,0 +1,56 @@
## ADDED Requirements
### Requirement: ProcessLevelCache stats method
Every `ProcessLevelCache` instance SHALL expose a `stats()` method that returns a dict containing `entries` (live entries count), `max_size`, and `ttl_seconds`.
#### Scenario: Stats on active cache
- **WHEN** `stats()` is called on a ProcessLevelCache with 5 live entries (max_size=32, ttl=30s)
- **THEN** it SHALL return `{"entries": 5, "max_size": 32, "ttl_seconds": 30}`
#### Scenario: Stats with expired entries
- **WHEN** `stats()` is called and some entries have exceeded TTL
- **THEN** `entries` SHALL only count entries where `now - timestamp <= ttl`
#### Scenario: Thread safety
- **WHEN** `stats()` is called concurrently with cache writes
- **THEN** it SHALL acquire the cache lock and return consistent data without races
### Requirement: ProcessLevelCache global registry
The system SHALL maintain a module-level registry in `core/cache.py` that maps cache names to `(description, instance)` tuples. Services SHALL register their cache instances at module load time via `register_process_cache(name, instance, description)`.
#### Scenario: Register and retrieve all caches
- **WHEN** multiple services register their caches and `get_all_process_cache_stats()` is called
- **THEN** it SHALL return a dict of `{name: {entries, max_size, ttl_seconds, description}}` for all registered caches
#### Scenario: Cache not registered
- **WHEN** a service's ProcessLevelCache is not registered
- **THEN** it SHALL NOT appear in `get_all_process_cache_stats()` output
### Requirement: Performance detail API endpoint
The system SHALL expose `GET /admin/api/performance-detail` that returns a JSON object with sections: `redis`, `process_caches`, `route_cache`, `db_pool`, and `direct_connections`.
#### Scenario: All systems available
- **WHEN** the API is called and all subsystems are healthy
- **THEN** it SHALL return all 5 sections with current telemetry data
#### Scenario: Redis disabled
- **WHEN** Redis is disabled (`REDIS_ENABLED=false`)
- **THEN** the `redis` section SHALL be `null` or contain `{"enabled": false}`, and other sections SHALL still return normally
### Requirement: Redis namespace key distribution
The performance-detail API SHALL scan Redis keys by namespace prefix and return key counts per namespace. Namespaces SHALL include: `data`, `route_cache`, `equipment_status`, `reject_dataset`, `meta`, `lock`, `scrap_exclusion`.
#### Scenario: Keys exist across namespaces
- **WHEN** Redis contains keys across multiple namespaces
- **THEN** the `redis.namespaces` array SHALL list each namespace with its `name` and `key_count`
#### Scenario: SCAN safety
- **WHEN** scanning Redis keys
- **THEN** the system SHALL use `SCAN` (not `KEYS`) to avoid blocking Redis
### Requirement: Route cache telemetry in performance detail
The performance-detail API SHALL include route cache telemetry from `get_route_cache_status()`, providing `mode`, `l1_size`, `l1_hit_rate`, `l2_hit_rate`, `miss_rate`, and `reads_total`.
#### Scenario: LayeredCache active
- **WHEN** route cache is in layered mode
- **THEN** the `route_cache` section SHALL include L1 and L2 hit rates from telemetry

View File

@@ -0,0 +1,27 @@
## ADDED Requirements
### Requirement: Connection pool status in performance detail
The performance-detail API SHALL include `db_pool` section with `status` (checked_out, checked_in, overflow, max_capacity, saturation) from `get_pool_status()` and `config` (pool_size, max_overflow, pool_timeout, pool_recycle) from `get_pool_runtime_config()`.
#### Scenario: Pool status retrieved
- **WHEN** the API is called
- **THEN** `db_pool.status` SHALL contain current pool utilization metrics and `db_pool.config` SHALL contain the pool configuration values
#### Scenario: Saturation calculation
- **WHEN** the pool has 8 checked_out connections and max_capacity is 30
- **THEN** saturation SHALL be reported as approximately 26.7%
### Requirement: Direct Oracle connection counter
The system SHALL maintain a thread-safe monotonic counter in `database.py` that increments each time `get_db_connection()` or `read_sql_df_slow()` successfully creates a direct (non-pooled) Oracle connection.
#### Scenario: Counter increments on direct connection
- **WHEN** `get_db_connection()` successfully creates a connection
- **THEN** the direct connection counter SHALL increment by 1
#### Scenario: Counter in performance detail
- **WHEN** the performance-detail API is called
- **THEN** `direct_connections` SHALL contain `total_since_start` (counter value) and `worker_pid` (current process PID)
#### Scenario: Counter is per-worker
- **WHEN** multiple gunicorn workers are running
- **THEN** each worker SHALL maintain its own independent counter, and the API SHALL return the counter for the responding worker

View File

@@ -0,0 +1,65 @@
## ADDED Requirements
### Requirement: SQLite metrics history store
The system SHALL provide a `MetricsHistoryStore` class in `core/metrics_history.py` that persists metrics snapshots to a SQLite database (`logs/metrics_history.sqlite` by default). The store SHALL use thread-local connections and a write lock, following the `LogStore` pattern in `core/log_store.py`.
#### Scenario: Write and query snapshots
- **WHEN** `write_snapshot(data)` is called with pool/redis/route_cache/latency metrics
- **THEN** a row SHALL be inserted into `metrics_snapshots` with the current ISO 8601 timestamp and worker PID
#### Scenario: Query by time range
- **WHEN** `query_snapshots(minutes=30)` is called
- **THEN** it SHALL return all rows from the last 30 minutes, ordered by timestamp ascending
#### Scenario: Retention cleanup
- **WHEN** `cleanup()` is called
- **THEN** rows older than `METRICS_HISTORY_RETENTION_DAYS` (default 3) SHALL be deleted, and total rows SHALL be capped at `METRICS_HISTORY_MAX_ROWS` (default 50000)
#### Scenario: Thread safety
- **WHEN** multiple threads write snapshots concurrently
- **THEN** the write lock SHALL serialize writes and prevent database corruption
### Requirement: Background metrics collector
The system SHALL provide a `MetricsHistoryCollector` class that runs a daemon thread collecting metrics snapshots at a configurable interval (default 30 seconds, via `METRICS_HISTORY_INTERVAL` env var).
#### Scenario: Automatic collection
- **WHEN** the collector is started via `start_metrics_history(app)`
- **THEN** it SHALL collect pool status, Redis info, route cache status, and query latency metrics every interval and write them to the store
#### Scenario: Graceful shutdown
- **WHEN** `stop_metrics_history()` is called
- **THEN** the collector thread SHALL stop within one interval period
#### Scenario: Subsystem unavailability
- **WHEN** a subsystem (e.g., Redis) is unavailable during collection
- **THEN** the collector SHALL write null/0 for those fields and continue collecting other metrics
### Requirement: Performance history API endpoint
The system SHALL expose `GET /admin/api/performance-history` that returns historical metrics snapshots.
#### Scenario: Query with time range
- **WHEN** the API is called with `?minutes=30`
- **THEN** it SHALL return `{"success": true, "data": {"snapshots": [...], "count": N}}`
#### Scenario: Time range bounds
- **WHEN** `minutes` is less than 1 or greater than 180
- **THEN** it SHALL be clamped to the range [1, 180]
#### Scenario: Admin authentication
- **WHEN** the API is called without admin authentication
- **THEN** it SHALL be rejected by the `@admin_required` decorator
### Requirement: Frontend trend charts
The system SHALL display 4 trend chart panels in the admin performance dashboard using vue-echarts VChart line/area charts.
#### Scenario: Trend charts with data
- **WHEN** historical snapshots contain more than 1 data point
- **THEN** the dashboard SHALL display trend charts for: connection pool saturation, query latency (P50/P95/P99), Redis memory, and cache hit rates
#### Scenario: Trend charts without data
- **WHEN** historical snapshots are empty or contain only 1 data point
- **THEN** the trend charts SHALL NOT be displayed (hidden via `v-if`)
#### Scenario: Auto-refresh
- **WHEN** the dashboard auto-refreshes
- **THEN** historical data SHALL also be refreshed alongside real-time metrics