feat(admin-performance): Vue 3 SPA dashboard with metrics history trending

Rebuild /admin/performance from Jinja2 to Vue 3 SPA with ECharts, adding cache telemetry infrastructure, connection pool monitoring, and SQLite-backed historical metrics collection with trend chart visualization. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 09:18:10 +08:00
parent 1c46f5eb69
commit 5d570ca7a2
32 changed files with 2903 additions and 261 deletions
--- a/openspec/specs/admin-performance-spa/spec.md
+++ b/openspec/specs/admin-performance-spa/spec.md
@@ -0,0 +1,100 @@
+## ADDED Requirements
+
+### Requirement: Vue 3 SPA page replaces Jinja2 template
+The `/admin/performance` route SHALL serve a Vue 3 SPA page built by Vite, replacing the existing Jinja2 server-rendered template. The SPA SHALL be registered as a Vite entry point and integrated into the portal-shell navigation as a `renderMode: 'native'` route.
+
+#### Scenario: Page loads as Vue SPA
+- **WHEN** user navigates to `/admin/performance`
+- **THEN** the server SHALL return the Vite-built `admin-performance.html` static file (not a Jinja2 rendered template)
+
+#### Scenario: Portal-shell integration
+- **WHEN** the portal-shell renders `/admin/performance`
+- **THEN** it SHALL load the page as a native Vue SPA (not an external iframe)
+
+### Requirement: Status cards display system health
+The dashboard SHALL display 4 status cards in a horizontal grid: Database, Redis, Circuit Breaker, and Worker PID. Each card SHALL show a StatusDot indicator (healthy/degraded/error/disabled) with the current status value.
+
+#### Scenario: All systems healthy
+- **WHEN** all backend systems report healthy status via `/admin/api/system-status`
+- **THEN** all 4 status cards SHALL display green StatusDot indicators with their respective values
+
+#### Scenario: Redis disabled
+- **WHEN** Redis is disabled (`REDIS_ENABLED=false`)
+- **THEN** the Redis status card SHALL display a disabled StatusDot indicator and the Redis cache panel SHALL show a graceful degradation message
+
+### Requirement: Query performance panel with ECharts
+The dashboard SHALL display query performance metrics (P50, P95, P99 latencies, total queries, slow queries) and an ECharts latency distribution chart, replacing the existing Chart.js implementation.
+
+#### Scenario: Metrics loaded successfully
+- **WHEN** `/admin/api/metrics` returns valid performance data
+- **THEN** the panel SHALL display P50/P95/P99 latency values and render an ECharts bar chart showing latency distribution
+
+#### Scenario: No metrics data
+- **WHEN** `/admin/api/metrics` returns empty or null metrics
+- **THEN** the panel SHALL display placeholder text indicating no data available
+
+### Requirement: Redis cache detail panel
+The dashboard SHALL display a Redis cache detail panel showing memory usage (as a GaugeBar), connected clients, hit rate percentage, peak memory, and a namespace key distribution table.
+
+#### Scenario: Redis active with data
+- **WHEN** `/admin/api/performance-detail` returns Redis data with namespace key counts
+- **THEN** the panel SHALL display a memory GaugeBar, hit rate, client count, and a table listing each namespace with its key count
+
+#### Scenario: Redis disabled
+- **WHEN** Redis is disabled
+- **THEN** the Redis detail panel SHALL display a disabled state message without errors
+
+### Requirement: Memory cache panel
+The dashboard SHALL display ProcessLevelCache statistics as grid cards (showing entries/max_size as a mini gauge and TTL) plus Route Cache telemetry (L1 hit rate, L2 hit rate, miss rate, total reads).
+
+#### Scenario: Multiple caches registered
+- **WHEN** `/admin/api/performance-detail` returns process_caches with multiple entries
+- **THEN** the panel SHALL render one card per cache instance showing entries, max_size, TTL, and description
+
+#### Scenario: Route cache telemetry
+- **WHEN** `/admin/api/performance-detail` returns route_cache data
+- **THEN** the panel SHALL display L1 hit rate, L2 hit rate, miss rate, and total reads
+
+### Requirement: Connection pool panel
+The dashboard SHALL display connection pool saturation as a GaugeBar and stat cards showing checked_out, checked_in, overflow, max_capacity, pool_size, pool_recycle, pool_timeout, and direct connection count.
+
+#### Scenario: Pool under normal load
+- **WHEN** pool saturation is below 80%
+- **THEN** the GaugeBar SHALL display in a normal color (green/blue)
+
+#### Scenario: Pool near saturation
+- **WHEN** pool saturation exceeds 80%
+- **THEN** the GaugeBar SHALL display in a warning color (yellow/orange/red)
+
+### Requirement: Worker control panel
+The dashboard SHALL display worker PID, uptime, cooldown status, and provide a restart button with a confirmation modal.
+
+#### Scenario: Restart worker
+- **WHEN** user clicks the restart button and confirms in the modal
+- **THEN** the system SHALL POST to `/admin/api/worker/restart` and display the result
+
+#### Scenario: Restart during cooldown
+- **WHEN** worker is in cooldown period
+- **THEN** the restart button SHALL be disabled with a cooldown indicator
+
+### Requirement: System logs panel with filtering and pagination
+The dashboard SHALL display system logs with level filtering, text search, and pagination controls.
+
+#### Scenario: Filter by log level
+- **WHEN** user selects a specific log level filter
+- **THEN** only logs matching that level SHALL be displayed
+
+#### Scenario: Paginate logs
+- **WHEN** logs exceed the page size
+- **THEN** pagination controls SHALL allow navigating between pages
+
+### Requirement: Auto-refresh with toggle
+The dashboard SHALL auto-refresh all panels every 30 seconds using `useAutoRefresh`. The user SHALL be able to toggle auto-refresh on/off and manually trigger a refresh.
+
+#### Scenario: Auto-refresh enabled
+- **WHEN** auto-refresh is enabled (default)
+- **THEN** all panels SHALL refresh their data every 30 seconds via `Promise.all` parallel fetch
+
+#### Scenario: Manual refresh
+- **WHEN** user clicks the manual refresh button
+- **THEN** all panels SHALL immediately refresh their data
--- a/openspec/specs/cache-telemetry-api/spec.md
+++ b/openspec/specs/cache-telemetry-api/spec.md
@@ -0,0 +1,56 @@
+## ADDED Requirements
+
+### Requirement: ProcessLevelCache stats method
+Every `ProcessLevelCache` instance SHALL expose a `stats()` method that returns a dict containing `entries` (live entries count), `max_size`, and `ttl_seconds`.
+
+#### Scenario: Stats on active cache
+- **WHEN** `stats()` is called on a ProcessLevelCache with 5 live entries (max_size=32, ttl=30s)
+- **THEN** it SHALL return `{"entries": 5, "max_size": 32, "ttl_seconds": 30}`
+
+#### Scenario: Stats with expired entries
+- **WHEN** `stats()` is called and some entries have exceeded TTL
+- **THEN** `entries` SHALL only count entries where `now - timestamp <= ttl`
+
+#### Scenario: Thread safety
+- **WHEN** `stats()` is called concurrently with cache writes
+- **THEN** it SHALL acquire the cache lock and return consistent data without races
+
+### Requirement: ProcessLevelCache global registry
+The system SHALL maintain a module-level registry in `core/cache.py` that maps cache names to `(description, instance)` tuples. Services SHALL register their cache instances at module load time via `register_process_cache(name, instance, description)`.
+
+#### Scenario: Register and retrieve all caches
+- **WHEN** multiple services register their caches and `get_all_process_cache_stats()` is called
+- **THEN** it SHALL return a dict of `{name: {entries, max_size, ttl_seconds, description}}` for all registered caches
+
+#### Scenario: Cache not registered
+- **WHEN** a service's ProcessLevelCache is not registered
+- **THEN** it SHALL NOT appear in `get_all_process_cache_stats()` output
+
+### Requirement: Performance detail API endpoint
+The system SHALL expose `GET /admin/api/performance-detail` that returns a JSON object with sections: `redis`, `process_caches`, `route_cache`, `db_pool`, and `direct_connections`.
+
+#### Scenario: All systems available
+- **WHEN** the API is called and all subsystems are healthy
+- **THEN** it SHALL return all 5 sections with current telemetry data
+
+#### Scenario: Redis disabled
+- **WHEN** Redis is disabled (`REDIS_ENABLED=false`)
+- **THEN** the `redis` section SHALL be `null` or contain `{"enabled": false}`, and other sections SHALL still return normally
+
+### Requirement: Redis namespace key distribution
+The performance-detail API SHALL scan Redis keys by namespace prefix and return key counts per namespace. Namespaces SHALL include: `data`, `route_cache`, `equipment_status`, `reject_dataset`, `meta`, `lock`, `scrap_exclusion`.
+
+#### Scenario: Keys exist across namespaces
+- **WHEN** Redis contains keys across multiple namespaces
+- **THEN** the `redis.namespaces` array SHALL list each namespace with its `name` and `key_count`
+
+#### Scenario: SCAN safety
+- **WHEN** scanning Redis keys
+- **THEN** the system SHALL use `SCAN` (not `KEYS`) to avoid blocking Redis
+
+### Requirement: Route cache telemetry in performance detail
+The performance-detail API SHALL include route cache telemetry from `get_route_cache_status()`, providing `mode`, `l1_size`, `l1_hit_rate`, `l2_hit_rate`, `miss_rate`, and `reads_total`.
+
+#### Scenario: LayeredCache active
+- **WHEN** route cache is in layered mode
+- **THEN** the `route_cache` section SHALL include L1 and L2 hit rates from telemetry
--- a/openspec/specs/connection-pool-monitoring/spec.md
+++ b/openspec/specs/connection-pool-monitoring/spec.md
@@ -0,0 +1,27 @@
+## ADDED Requirements
+
+### Requirement: Connection pool status in performance detail
+The performance-detail API SHALL include `db_pool` section with `status` (checked_out, checked_in, overflow, max_capacity, saturation) from `get_pool_status()` and `config` (pool_size, max_overflow, pool_timeout, pool_recycle) from `get_pool_runtime_config()`.
+
+#### Scenario: Pool status retrieved
+- **WHEN** the API is called
+- **THEN** `db_pool.status` SHALL contain current pool utilization metrics and `db_pool.config` SHALL contain the pool configuration values
+
+#### Scenario: Saturation calculation
+- **WHEN** the pool has 8 checked_out connections and max_capacity is 30
+- **THEN** saturation SHALL be reported as approximately 26.7%
+
+### Requirement: Direct Oracle connection counter
+The system SHALL maintain a thread-safe monotonic counter in `database.py` that increments each time `get_db_connection()` or `read_sql_df_slow()` successfully creates a direct (non-pooled) Oracle connection.
+
+#### Scenario: Counter increments on direct connection
+- **WHEN** `get_db_connection()` successfully creates a connection
+- **THEN** the direct connection counter SHALL increment by 1
+
+#### Scenario: Counter in performance detail
+- **WHEN** the performance-detail API is called
+- **THEN** `direct_connections` SHALL contain `total_since_start` (counter value) and `worker_pid` (current process PID)
+
+#### Scenario: Counter is per-worker
+- **WHEN** multiple gunicorn workers are running
+- **THEN** each worker SHALL maintain its own independent counter, and the API SHALL return the counter for the responding worker
--- a/openspec/specs/metrics-history-trending/spec.md
+++ b/openspec/specs/metrics-history-trending/spec.md
@@ -0,0 +1,65 @@
+## ADDED Requirements
+
+### Requirement: SQLite metrics history store
+The system SHALL provide a `MetricsHistoryStore` class in `core/metrics_history.py` that persists metrics snapshots to a SQLite database (`logs/metrics_history.sqlite` by default). The store SHALL use thread-local connections and a write lock, following the `LogStore` pattern in `core/log_store.py`.
+
+#### Scenario: Write and query snapshots
+- **WHEN** `write_snapshot(data)` is called with pool/redis/route_cache/latency metrics
+- **THEN** a row SHALL be inserted into `metrics_snapshots` with the current ISO 8601 timestamp and worker PID
+
+#### Scenario: Query by time range
+- **WHEN** `query_snapshots(minutes=30)` is called
+- **THEN** it SHALL return all rows from the last 30 minutes, ordered by timestamp ascending
+
+#### Scenario: Retention cleanup
+- **WHEN** `cleanup()` is called
+- **THEN** rows older than `METRICS_HISTORY_RETENTION_DAYS` (default 3) SHALL be deleted, and total rows SHALL be capped at `METRICS_HISTORY_MAX_ROWS` (default 50000)
+
+#### Scenario: Thread safety
+- **WHEN** multiple threads write snapshots concurrently
+- **THEN** the write lock SHALL serialize writes and prevent database corruption
+
+### Requirement: Background metrics collector
+The system SHALL provide a `MetricsHistoryCollector` class that runs a daemon thread collecting metrics snapshots at a configurable interval (default 30 seconds, via `METRICS_HISTORY_INTERVAL` env var).
+
+#### Scenario: Automatic collection
+- **WHEN** the collector is started via `start_metrics_history(app)`
+- **THEN** it SHALL collect pool status, Redis info, route cache status, and query latency metrics every interval and write them to the store
+
+#### Scenario: Graceful shutdown
+- **WHEN** `stop_metrics_history()` is called
+- **THEN** the collector thread SHALL stop within one interval period
+
+#### Scenario: Subsystem unavailability
+- **WHEN** a subsystem (e.g., Redis) is unavailable during collection
+- **THEN** the collector SHALL write null/0 for those fields and continue collecting other metrics
+
+### Requirement: Performance history API endpoint
+The system SHALL expose `GET /admin/api/performance-history` that returns historical metrics snapshots.
+
+#### Scenario: Query with time range
+- **WHEN** the API is called with `?minutes=30`
+- **THEN** it SHALL return `{"success": true, "data": {"snapshots": [...], "count": N}}`
+
+#### Scenario: Time range bounds
+- **WHEN** `minutes` is less than 1 or greater than 180
+- **THEN** it SHALL be clamped to the range [1, 180]
+
+#### Scenario: Admin authentication
+- **WHEN** the API is called without admin authentication
+- **THEN** it SHALL be rejected by the `@admin_required` decorator
+
+### Requirement: Frontend trend charts
+The system SHALL display 4 trend chart panels in the admin performance dashboard using vue-echarts VChart line/area charts.
+
+#### Scenario: Trend charts with data
+- **WHEN** historical snapshots contain more than 1 data point
+- **THEN** the dashboard SHALL display trend charts for: connection pool saturation, query latency (P50/P95/P99), Redis memory, and cache hit rates
+
+#### Scenario: Trend charts without data
+- **WHEN** historical snapshots are empty or contain only 1 data point
+- **THEN** the trend charts SHALL NOT be displayed (hidden via `v-if`)
+
+#### Scenario: Auto-refresh
+- **WHEN** the dashboard auto-refreshes
+- **THEN** historical data SHALL also be refreshed alongside real-time metrics