feat(admin-perf): full Vue SPA migration + slow-query/memory monitoring gaps
Remove Jinja2 template fallback (1249 lines) — /admin/performance now serves Vue SPA exclusively via send_from_directory. Backend: - Add _SLOW_QUERY_WAITING counter with get_slow_query_waiting_count() - Record slow-path latency in read_sql_df_slow/iter via record_query_latency() - Extend metrics_history schema with slow_query_active, slow_query_waiting, worker_rss_bytes columns + ALTER TABLE migration for existing DBs - Add cleanup_archive_logs() with configurable ARCHIVE_LOG_DIR/KEEP_COUNT - Integrate archive cleanup into MetricsHistoryCollector 50-min cycle Frontend: - Add slow_query_active and slow_query_waiting StatCards to connection pool - Add slow_query_active trend line to pool trend chart - Add Worker memory (RSS MB) trend chart with preprocessing - Update modernization gate check path to frontend style.css Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1,100 +1,37 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Vue 3 SPA page replaces Jinja2 template
|
||||
The `/admin/performance` route SHALL serve a Vue 3 SPA page built by Vite, replacing the existing Jinja2 server-rendered template. The SPA SHALL be registered as a Vite entry point and integrated into the portal-shell navigation as a `renderMode: 'native'` route.
|
||||
|
||||
#### Scenario: Page loads as Vue SPA
|
||||
- **WHEN** user navigates to `/admin/performance`
|
||||
- **THEN** the server SHALL return the Vite-built `admin-performance.html` static file (not a Jinja2 rendered template)
|
||||
|
||||
#### Scenario: Portal-shell integration
|
||||
- **WHEN** the portal-shell renders `/admin/performance`
|
||||
- **THEN** it SHALL load the page as a native Vue SPA (not an external iframe)
|
||||
|
||||
### Requirement: Status cards display system health
|
||||
The dashboard SHALL display 4 status cards in a horizontal grid: Database, Redis, Circuit Breaker, and Worker PID. Each card SHALL show a StatusDot indicator (healthy/degraded/error/disabled) with the current status value.
|
||||
|
||||
#### Scenario: All systems healthy
|
||||
- **WHEN** all backend systems report healthy status via `/admin/api/system-status`
|
||||
- **THEN** all 4 status cards SHALL display green StatusDot indicators with their respective values
|
||||
|
||||
#### Scenario: Redis disabled
|
||||
- **WHEN** Redis is disabled (`REDIS_ENABLED=false`)
|
||||
- **THEN** the Redis status card SHALL display a disabled StatusDot indicator and the Redis cache panel SHALL show a graceful degradation message
|
||||
|
||||
### Requirement: Query performance panel with ECharts
|
||||
The dashboard SHALL display query performance metrics (P50, P95, P99 latencies, total queries, slow queries) and an ECharts latency distribution chart, replacing the existing Chart.js implementation.
|
||||
|
||||
#### Scenario: Metrics loaded successfully
|
||||
- **WHEN** `/admin/api/metrics` returns valid performance data
|
||||
- **THEN** the panel SHALL display P50/P95/P99 latency values and render an ECharts bar chart showing latency distribution
|
||||
|
||||
#### Scenario: No metrics data
|
||||
- **WHEN** `/admin/api/metrics` returns empty or null metrics
|
||||
- **THEN** the panel SHALL display placeholder text indicating no data available
|
||||
|
||||
### Requirement: Redis cache detail panel
|
||||
The dashboard SHALL display a Redis cache detail panel showing memory usage (as a GaugeBar), connected clients, hit rate percentage, peak memory, and a namespace key distribution table.
|
||||
|
||||
#### Scenario: Redis active with data
|
||||
- **WHEN** `/admin/api/performance-detail` returns Redis data with namespace key counts
|
||||
- **THEN** the panel SHALL display a memory GaugeBar, hit rate, client count, and a table listing each namespace with its key count
|
||||
|
||||
#### Scenario: Redis disabled
|
||||
- **WHEN** Redis is disabled
|
||||
- **THEN** the Redis detail panel SHALL display a disabled state message without errors
|
||||
|
||||
### Requirement: Memory cache panel
|
||||
The dashboard SHALL display ProcessLevelCache statistics as grid cards (showing entries/max_size as a mini gauge and TTL) plus Route Cache telemetry (L1 hit rate, L2 hit rate, miss rate, total reads).
|
||||
|
||||
#### Scenario: Multiple caches registered
|
||||
- **WHEN** `/admin/api/performance-detail` returns process_caches with multiple entries
|
||||
- **THEN** the panel SHALL render one card per cache instance showing entries, max_size, TTL, and description
|
||||
|
||||
#### Scenario: Route cache telemetry
|
||||
- **WHEN** `/admin/api/performance-detail` returns route_cache data
|
||||
- **THEN** the panel SHALL display L1 hit rate, L2 hit rate, miss rate, and total reads
|
||||
|
||||
### Requirement: Connection pool panel
|
||||
The dashboard SHALL display connection pool saturation as a GaugeBar and stat cards showing checked_out, checked_in, overflow, max_capacity, pool_size, pool_recycle, pool_timeout, and direct connection count.
|
||||
|
||||
#### Scenario: Pool under normal load
|
||||
- **WHEN** pool saturation is below 80%
|
||||
- **THEN** the GaugeBar SHALL display in a normal color (green/blue)
|
||||
|
||||
#### Scenario: Pool near saturation
|
||||
- **WHEN** pool saturation exceeds 80%
|
||||
- **THEN** the GaugeBar SHALL display in a warning color (yellow/orange/red)
|
||||
|
||||
### Requirement: Worker control panel
|
||||
The dashboard SHALL display worker PID, uptime, cooldown status, and provide a restart button with a confirmation modal.
|
||||
|
||||
#### Scenario: Restart worker
|
||||
- **WHEN** user clicks the restart button and confirms in the modal
|
||||
- **THEN** the system SHALL POST to `/admin/api/worker/restart` and display the result
|
||||
|
||||
#### Scenario: Restart during cooldown
|
||||
- **WHEN** worker is in cooldown period
|
||||
- **THEN** the restart button SHALL be disabled with a cooldown indicator
|
||||
|
||||
### Requirement: System logs panel with filtering and pagination
|
||||
The dashboard SHALL display system logs with level filtering, text search, and pagination controls.
|
||||
|
||||
#### Scenario: Filter by log level
|
||||
- **WHEN** user selects a specific log level filter
|
||||
- **THEN** only logs matching that level SHALL be displayed
|
||||
|
||||
#### Scenario: Paginate logs
|
||||
- **WHEN** logs exceed the page size
|
||||
- **THEN** pagination controls SHALL allow navigating between pages
|
||||
|
||||
### Requirement: Auto-refresh with toggle
|
||||
The dashboard SHALL auto-refresh all panels every 30 seconds using `useAutoRefresh`. The user SHALL be able to toggle auto-refresh on/off and manually trigger a refresh.
|
||||
|
||||
#### Scenario: Auto-refresh enabled
|
||||
- **WHEN** auto-refresh is enabled (default)
|
||||
- **THEN** all panels SHALL refresh their data every 30 seconds via `Promise.all` parallel fetch
|
||||
|
||||
#### Scenario: Manual refresh
|
||||
- **WHEN** user clicks the manual refresh button
|
||||
- **THEN** all panels SHALL immediately refresh their data
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Vue 3 SPA page replaces Jinja2 template
|
||||
The `/admin/performance` route SHALL serve the Vite-built `admin-performance.html` static file directly. The Jinja2 template fallback SHALL be removed. If the SPA build artifact does not exist, the server SHALL return a standard HTTP error (no fallback rendering).
|
||||
|
||||
#### Scenario: Page loads as Vue SPA
|
||||
- **WHEN** user navigates to `/admin/performance`
|
||||
- **THEN** the server SHALL return the Vite-built `admin-performance.html` static file via `send_from_directory`
|
||||
|
||||
#### Scenario: Portal-shell integration
|
||||
- **WHEN** the portal-shell renders `/admin/performance`
|
||||
- **THEN** it SHALL load the page as a native Vue SPA (not an external iframe)
|
||||
|
||||
#### Scenario: Build artifact missing
|
||||
- **WHEN** the SPA build artifact `admin-performance.html` does not exist in `static/dist/`
|
||||
- **THEN** the server SHALL return an HTTP error (no Jinja2 fallback)
|
||||
|
||||
### Requirement: Connection pool panel
|
||||
The dashboard SHALL display connection pool saturation as a GaugeBar and stat cards showing checked_out, checked_in, overflow, max_capacity, pool_size, pool_recycle, pool_timeout, direct connection count, slow_query_active, and slow_query_waiting.
|
||||
|
||||
#### Scenario: Pool under normal load
|
||||
- **WHEN** pool saturation is below 80%
|
||||
- **THEN** the GaugeBar SHALL display in a normal color (green/blue)
|
||||
|
||||
#### Scenario: Pool near saturation
|
||||
- **WHEN** pool saturation exceeds 80%
|
||||
- **THEN** the GaugeBar SHALL display in a warning color (yellow/orange/red)
|
||||
|
||||
#### Scenario: Slow query metrics displayed
|
||||
- **WHEN** `db_pool.status` includes `slow_query_active` and `slow_query_waiting`
|
||||
- **THEN** the panel SHALL display StatCards for both values
|
||||
|
||||
## REMOVED Requirements
|
||||
|
||||
### Requirement: Jinja2 template fallback for performance page
|
||||
**Reason**: The Vue SPA is the sole UI. Maintaining a 1249-line Jinja template as fallback adds maintenance burden and feature divergence.
|
||||
**Migration**: Delete `templates/admin/performance.html`. The route handler serves the SPA directly.
|
||||
|
||||
30
openspec/specs/archive-log-rotation/spec.md
Normal file
30
openspec/specs/archive-log-rotation/spec.md
Normal file
@@ -0,0 +1,30 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Automatic archive log cleanup
|
||||
The system SHALL provide a `cleanup_archive_logs()` function in `core/metrics_history.py` that deletes old rotated log files from `logs/archive/`, keeping the most recent N files per log type (access, error, watchdog, rq_worker, startup).
|
||||
|
||||
#### Scenario: Cleanup keeps recent files
|
||||
- **WHEN** `cleanup_archive_logs()` is called with `keep_per_type=20` and there are 30 access_*.log files
|
||||
- **THEN** 10 oldest access_*.log files SHALL be deleted, keeping the 20 most recent by modification time
|
||||
|
||||
#### Scenario: No excess files
|
||||
- **WHEN** `cleanup_archive_logs()` is called and each type has fewer than `keep_per_type` files
|
||||
- **THEN** no files SHALL be deleted
|
||||
|
||||
#### Scenario: Archive directory missing
|
||||
- **WHEN** `cleanup_archive_logs()` is called and the archive directory does not exist
|
||||
- **THEN** the function SHALL return 0 without error
|
||||
|
||||
### Requirement: Archive cleanup integrated into collector cycle
|
||||
The `MetricsHistoryCollector` SHALL call `cleanup_archive_logs()` alongside the existing SQLite cleanup, running approximately every 50 minutes (every 100 collection intervals).
|
||||
|
||||
#### Scenario: Periodic cleanup executes
|
||||
- **WHEN** the cleanup counter reaches 100 intervals
|
||||
- **THEN** both SQLite metrics cleanup and archive log cleanup SHALL execute
|
||||
|
||||
### Requirement: Archive cleanup configuration
|
||||
The archive log cleanup SHALL be configurable via environment variables: `ARCHIVE_LOG_DIR` (default: `logs/archive`) and `ARCHIVE_LOG_KEEP_COUNT` (default: 20).
|
||||
|
||||
#### Scenario: Custom keep count
|
||||
- **WHEN** `ARCHIVE_LOG_KEEP_COUNT=10` is set
|
||||
- **THEN** cleanup SHALL keep only the 10 most recent files per type
|
||||
@@ -1,27 +1,29 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Connection pool status in performance detail
|
||||
The performance-detail API SHALL include `db_pool` section with `status` (checked_out, checked_in, overflow, max_capacity, saturation) from `get_pool_status()` and `config` (pool_size, max_overflow, pool_timeout, pool_recycle) from `get_pool_runtime_config()`.
|
||||
|
||||
#### Scenario: Pool status retrieved
|
||||
- **WHEN** the API is called
|
||||
- **THEN** `db_pool.status` SHALL contain current pool utilization metrics and `db_pool.config` SHALL contain the pool configuration values
|
||||
|
||||
#### Scenario: Saturation calculation
|
||||
- **WHEN** the pool has 8 checked_out connections and max_capacity is 30
|
||||
- **THEN** saturation SHALL be reported as approximately 26.7%
|
||||
|
||||
### Requirement: Direct Oracle connection counter
|
||||
The system SHALL maintain a thread-safe monotonic counter in `database.py` that increments each time `get_db_connection()` or `read_sql_df_slow()` successfully creates a direct (non-pooled) Oracle connection.
|
||||
|
||||
#### Scenario: Counter increments on direct connection
|
||||
- **WHEN** `get_db_connection()` successfully creates a connection
|
||||
- **THEN** the direct connection counter SHALL increment by 1
|
||||
|
||||
#### Scenario: Counter in performance detail
|
||||
- **WHEN** the performance-detail API is called
|
||||
- **THEN** `direct_connections` SHALL contain `total_since_start` (counter value) and `worker_pid` (current process PID)
|
||||
|
||||
#### Scenario: Counter is per-worker
|
||||
- **WHEN** multiple gunicorn workers are running
|
||||
- **THEN** each worker SHALL maintain its own independent counter, and the API SHALL return the counter for the responding worker
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Connection pool status in performance detail
|
||||
The performance-detail API SHALL include `db_pool` section with `status` (checked_out, checked_in, overflow, max_capacity, saturation, slow_query_active, slow_query_waiting) from `get_pool_status()` and `config` (pool_size, max_overflow, pool_timeout, pool_recycle) from `get_pool_runtime_config()`.
|
||||
|
||||
#### Scenario: Pool status retrieved
|
||||
- **WHEN** the API is called
|
||||
- **THEN** `db_pool.status` SHALL contain current pool utilization metrics including `slow_query_active` and `slow_query_waiting`, and `db_pool.config` SHALL contain the pool configuration values
|
||||
|
||||
#### Scenario: Saturation calculation
|
||||
- **WHEN** the pool has 8 checked_out connections and max_capacity is 30
|
||||
- **THEN** saturation SHALL be reported as approximately 26.7%
|
||||
|
||||
#### Scenario: Slow query waiting included
|
||||
- **WHEN** 2 threads are waiting for the slow query semaphore
|
||||
- **THEN** `db_pool.status.slow_query_waiting` SHALL be 2
|
||||
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Slow-path query latency included in QueryMetrics
|
||||
The `read_sql_df_slow()` and `read_sql_df_slow_iter()` functions SHALL call `record_query_latency()` with the total elapsed time upon completion, ensuring P50/P95/P99 percentiles reflect queries from all paths (pooled and slow/direct).
|
||||
|
||||
#### Scenario: Slow query latency recorded
|
||||
- **WHEN** `read_sql_df_slow()` completes a query in 8.5 seconds
|
||||
- **THEN** `record_query_latency(8.5)` SHALL be called and the value SHALL appear in subsequent `get_percentiles()` results
|
||||
|
||||
#### Scenario: Slow iter latency recorded
|
||||
- **WHEN** `read_sql_df_slow_iter()` completes streaming in 45 seconds
|
||||
- **THEN** `record_query_latency(45.0)` SHALL be called in the finally block
|
||||
|
||||
@@ -1,65 +1,54 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: SQLite metrics history store
|
||||
The system SHALL provide a `MetricsHistoryStore` class in `core/metrics_history.py` that persists metrics snapshots to a SQLite database (`logs/metrics_history.sqlite` by default). The store SHALL use thread-local connections and a write lock, following the `LogStore` pattern in `core/log_store.py`.
|
||||
|
||||
#### Scenario: Write and query snapshots
|
||||
- **WHEN** `write_snapshot(data)` is called with pool/redis/route_cache/latency metrics
|
||||
- **THEN** a row SHALL be inserted into `metrics_snapshots` with the current ISO 8601 timestamp and worker PID
|
||||
|
||||
#### Scenario: Query by time range
|
||||
- **WHEN** `query_snapshots(minutes=30)` is called
|
||||
- **THEN** it SHALL return all rows from the last 30 minutes, ordered by timestamp ascending
|
||||
|
||||
#### Scenario: Retention cleanup
|
||||
- **WHEN** `cleanup()` is called
|
||||
- **THEN** rows older than `METRICS_HISTORY_RETENTION_DAYS` (default 3) SHALL be deleted, and total rows SHALL be capped at `METRICS_HISTORY_MAX_ROWS` (default 50000)
|
||||
|
||||
#### Scenario: Thread safety
|
||||
- **WHEN** multiple threads write snapshots concurrently
|
||||
- **THEN** the write lock SHALL serialize writes and prevent database corruption
|
||||
|
||||
### Requirement: Background metrics collector
|
||||
The system SHALL provide a `MetricsHistoryCollector` class that runs a daemon thread collecting metrics snapshots at a configurable interval (default 30 seconds, via `METRICS_HISTORY_INTERVAL` env var).
|
||||
|
||||
#### Scenario: Automatic collection
|
||||
- **WHEN** the collector is started via `start_metrics_history(app)`
|
||||
- **THEN** it SHALL collect pool status, Redis info, route cache status, and query latency metrics every interval and write them to the store
|
||||
|
||||
#### Scenario: Graceful shutdown
|
||||
- **WHEN** `stop_metrics_history()` is called
|
||||
- **THEN** the collector thread SHALL stop within one interval period
|
||||
|
||||
#### Scenario: Subsystem unavailability
|
||||
- **WHEN** a subsystem (e.g., Redis) is unavailable during collection
|
||||
- **THEN** the collector SHALL write null/0 for those fields and continue collecting other metrics
|
||||
|
||||
### Requirement: Performance history API endpoint
|
||||
The system SHALL expose `GET /admin/api/performance-history` that returns historical metrics snapshots.
|
||||
|
||||
#### Scenario: Query with time range
|
||||
- **WHEN** the API is called with `?minutes=30`
|
||||
- **THEN** it SHALL return `{"success": true, "data": {"snapshots": [...], "count": N}}`
|
||||
|
||||
#### Scenario: Time range bounds
|
||||
- **WHEN** `minutes` is less than 1 or greater than 180
|
||||
- **THEN** it SHALL be clamped to the range [1, 180]
|
||||
|
||||
#### Scenario: Admin authentication
|
||||
- **WHEN** the API is called without admin authentication
|
||||
- **THEN** it SHALL be rejected by the `@admin_required` decorator
|
||||
|
||||
### Requirement: Frontend trend charts
|
||||
The system SHALL display 4 trend chart panels in the admin performance dashboard using vue-echarts VChart line/area charts.
|
||||
|
||||
#### Scenario: Trend charts with data
|
||||
- **WHEN** historical snapshots contain more than 1 data point
|
||||
- **THEN** the dashboard SHALL display trend charts for: connection pool saturation, query latency (P50/P95/P99), Redis memory, and cache hit rates
|
||||
|
||||
#### Scenario: Trend charts without data
|
||||
- **WHEN** historical snapshots are empty or contain only 1 data point
|
||||
- **THEN** the trend charts SHALL NOT be displayed (hidden via `v-if`)
|
||||
|
||||
#### Scenario: Auto-refresh
|
||||
- **WHEN** the dashboard auto-refreshes
|
||||
- **THEN** historical data SHALL also be refreshed alongside real-time metrics
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: SQLite metrics history store
|
||||
The system SHALL provide a `MetricsHistoryStore` class in `core/metrics_history.py` that persists metrics snapshots to a SQLite database (`logs/metrics_history.sqlite` by default). The store SHALL use thread-local connections and a write lock, following the `LogStore` pattern in `core/log_store.py`. The schema SHALL include columns for `slow_query_active` (INTEGER), `slow_query_waiting` (INTEGER), and `worker_rss_bytes` (INTEGER) in addition to the existing pool, Redis, route cache, and latency columns.
|
||||
|
||||
#### Scenario: Write and query snapshots
|
||||
- **WHEN** `write_snapshot(data)` is called with pool/redis/route_cache/latency/slow_query/memory metrics
|
||||
- **THEN** a row SHALL be inserted into `metrics_snapshots` with the current ISO 8601 timestamp, worker PID, and all metric columns
|
||||
|
||||
#### Scenario: Query by time range
|
||||
- **WHEN** `query_snapshots(minutes=30)` is called
|
||||
- **THEN** it SHALL return all rows from the last 30 minutes, ordered by timestamp ascending, including the new columns
|
||||
|
||||
#### Scenario: Retention cleanup
|
||||
- **WHEN** `cleanup()` is called
|
||||
- **THEN** rows older than `METRICS_HISTORY_RETENTION_DAYS` (default 3) SHALL be deleted, and total rows SHALL be capped at `METRICS_HISTORY_MAX_ROWS` (default 50000)
|
||||
|
||||
#### Scenario: Thread safety
|
||||
- **WHEN** multiple threads write snapshots concurrently
|
||||
- **THEN** the write lock SHALL serialize writes and prevent database corruption
|
||||
|
||||
#### Scenario: Schema migration for existing databases
|
||||
- **WHEN** the store initializes on an existing database without the new columns
|
||||
- **THEN** it SHALL execute ALTER TABLE ADD COLUMN for each missing column, tolerating "duplicate column" errors
|
||||
|
||||
### Requirement: Background metrics collector
|
||||
The system SHALL provide a `MetricsHistoryCollector` class that runs a daemon thread collecting metrics snapshots at a configurable interval (default 30 seconds, via `METRICS_HISTORY_INTERVAL` env var). The collector SHALL include `slow_query_active`, `slow_query_waiting`, and `worker_rss_bytes` in each snapshot.
|
||||
|
||||
#### Scenario: Automatic collection
|
||||
- **WHEN** the collector is started via `start_metrics_history(app)`
|
||||
- **THEN** it SHALL collect pool status (including slow_query_active and slow_query_waiting), Redis info, route cache status, query latency metrics, and worker RSS memory every interval and write them to the store
|
||||
|
||||
#### Scenario: Graceful shutdown
|
||||
- **WHEN** `stop_metrics_history()` is called
|
||||
- **THEN** the collector thread SHALL stop within one interval period
|
||||
|
||||
#### Scenario: Subsystem unavailability
|
||||
- **WHEN** a subsystem (e.g., Redis) is unavailable during collection
|
||||
- **THEN** the collector SHALL write null/0 for those fields and continue collecting other metrics
|
||||
|
||||
### Requirement: Frontend trend charts
|
||||
The system SHALL display 5 trend chart panels in the admin performance dashboard using vue-echarts VChart line/area charts: connection pool saturation, query latency (P50/P95/P99), Redis memory, cache hit rates, and worker memory.
|
||||
|
||||
#### Scenario: Trend charts with data
|
||||
- **WHEN** historical snapshots contain more than 1 data point
|
||||
- **THEN** the dashboard SHALL display trend charts for: connection pool saturation (including slow_query_active), query latency (P50/P95/P99), Redis memory, cache hit rates, and worker memory (RSS in MB)
|
||||
|
||||
#### Scenario: Trend charts without data
|
||||
- **WHEN** historical snapshots are empty or contain only 1 data point
|
||||
- **THEN** the trend charts SHALL NOT be displayed (hidden via `v-if`)
|
||||
|
||||
#### Scenario: Auto-refresh
|
||||
- **WHEN** the dashboard auto-refreshes
|
||||
- **THEN** historical data SHALL also be refreshed alongside real-time metrics
|
||||
|
||||
49
openspec/specs/slow-query-observability/spec.md
Normal file
49
openspec/specs/slow-query-observability/spec.md
Normal file
@@ -0,0 +1,49 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Slow query active count in metrics history snapshots
|
||||
The `MetricsHistoryCollector` SHALL include `slow_query_active` in each 30-second snapshot, recording the number of slow queries currently executing via dedicated connections.
|
||||
|
||||
#### Scenario: Snapshot includes slow_query_active
|
||||
- **WHEN** the collector writes a snapshot while 3 slow queries are executing
|
||||
- **THEN** the `slow_query_active` column SHALL contain the value 3
|
||||
|
||||
#### Scenario: No slow queries active
|
||||
- **WHEN** the collector writes a snapshot while no slow queries are executing
|
||||
- **THEN** the `slow_query_active` column SHALL contain the value 0
|
||||
|
||||
### Requirement: Slow query waiting count tracked and persisted
|
||||
The system SHALL maintain a thread-safe counter `_SLOW_QUERY_WAITING` in `database.py` that tracks the number of threads currently waiting to acquire the slow query semaphore. This counter SHALL be included in `get_pool_status()` and persisted to metrics history snapshots.
|
||||
|
||||
#### Scenario: Counter increments on semaphore wait
|
||||
- **WHEN** a thread enters `read_sql_df_slow()` and the semaphore is full
|
||||
- **THEN** `_SLOW_QUERY_WAITING` SHALL be incremented before `semaphore.acquire()` and decremented after acquire completes (success or timeout)
|
||||
|
||||
#### Scenario: Counter in pool status API
|
||||
- **WHEN** `get_pool_status()` is called
|
||||
- **THEN** the returned dict SHALL include `slow_query_waiting` with the current waiting thread count
|
||||
|
||||
#### Scenario: Counter persisted to metrics history
|
||||
- **WHEN** the collector writes a snapshot
|
||||
- **THEN** the `slow_query_waiting` column SHALL reflect the count at snapshot time
|
||||
|
||||
### Requirement: Slow-path query latency recorded in QueryMetrics
|
||||
The `read_sql_df_slow()` and `read_sql_df_slow_iter()` functions SHALL call `record_query_latency()` with the elapsed query time, so that P50/P95/P99 metrics reflect all query paths (pool + slow).
|
||||
|
||||
#### Scenario: Slow query latency appears in percentiles
|
||||
- **WHEN** a `read_sql_df_slow()` call completes in 5.2 seconds
|
||||
- **THEN** `record_query_latency(5.2)` SHALL be called and the latency SHALL appear in subsequent `get_percentiles()` results
|
||||
|
||||
#### Scenario: Slow iter latency recorded on completion
|
||||
- **WHEN** a `read_sql_df_slow_iter()` generator completes after yielding all batches in 120 seconds total
|
||||
- **THEN** `record_query_latency(120.0)` SHALL be called in the finally block
|
||||
|
||||
### Requirement: Slow query metrics displayed in Vue SPA
|
||||
The admin performance Vue SPA SHALL display `slow_query_active` and `slow_query_waiting` as StatCards in the connection pool panel, and include `slow_query_active` as a trend line in the connection pool trend chart.
|
||||
|
||||
#### Scenario: StatCards display current values
|
||||
- **WHEN** the performance-detail API returns `db_pool.status.slow_query_active = 4` and `db_pool.status.slow_query_waiting = 2`
|
||||
- **THEN** the connection pool panel SHALL display StatCards showing "慢查詢執行中: 4" and "慢查詢排隊中: 2"
|
||||
|
||||
#### Scenario: Trend chart includes slow_query_active
|
||||
- **WHEN** historical snapshots contain `slow_query_active` data points
|
||||
- **THEN** the connection pool trend chart SHALL include a "慢查詢執行中" line series
|
||||
23
openspec/specs/worker-memory-tracking/spec.md
Normal file
23
openspec/specs/worker-memory-tracking/spec.md
Normal file
@@ -0,0 +1,23 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Worker RSS memory in metrics history snapshots
|
||||
The `MetricsHistoryCollector` SHALL include `worker_rss_bytes` in each 30-second snapshot, recording the current worker process peak RSS memory using Python's `resource.getrusage()`.
|
||||
|
||||
#### Scenario: RSS recorded in snapshot
|
||||
- **WHEN** the collector writes a snapshot and the worker process has 256 MB peak RSS
|
||||
- **THEN** the `worker_rss_bytes` column SHALL contain approximately 268435456
|
||||
|
||||
#### Scenario: RSS collection failure
|
||||
- **WHEN** `resource.getrusage()` raises an exception
|
||||
- **THEN** the collector SHALL write NULL for `worker_rss_bytes` and continue collecting other metrics
|
||||
|
||||
### Requirement: Worker memory trend chart in Vue SPA
|
||||
The admin performance Vue SPA SHALL display a "Worker 記憶體趨勢" TrendChart showing RSS memory over time in megabytes.
|
||||
|
||||
#### Scenario: Memory trend displayed
|
||||
- **WHEN** historical snapshots contain `worker_rss_bytes` data with more than 1 data point
|
||||
- **THEN** the dashboard SHALL display a TrendChart with RSS values converted to MB
|
||||
|
||||
#### Scenario: No memory data
|
||||
- **WHEN** historical snapshots do not contain `worker_rss_bytes` data (all NULL)
|
||||
- **THEN** the trend chart SHALL show "趨勢資料不足" message
|
||||
Reference in New Issue
Block a user