chore: finalize vite migration hardening and archive openspec changes

This commit is contained in:
beabigegg
2026-02-08 20:03:36 +08:00
parent b56e80381b
commit c8e225101e
119 changed files with 6547 additions and 1301 deletions

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-02-08

View File

@@ -0,0 +1,46 @@
## Context
The current architecture already supports single-port Gunicorn runtime, circuit-breaker-aware degraded responses, and watchdog-assisted recovery. However, critical security and lifecycle controls are uneven: production startup can still fallback to a weak secret key, CSRF is not enforced globally, and background resources are not fully registered in a single shutdown lifecycle. These gaps are operationally risky when pool pressure or restart churn occurs.
## Goals / Non-Goals
**Goals:**
- Make production startup fail fast when required security secrets are missing.
- Enforce CSRF validation for all state-changing endpoints without breaking existing frontend flow.
- Make worker/app shutdown deterministic by stopping all background workers and shared clients.
- Keep degraded responses for pool exhaustion and circuit-open states stable and retry-aware.
- Isolate health probe connectivity from main request pool contention.
**Non-Goals:**
- Replacing LDAP provider or redesigning the full authentication architecture.
- Full CSP rollout across all templates in this change.
- Changing URL structure, page IA, or single-port deployment topology.
## Decisions
1. **Production secret-key guard at startup**
- Decision: enforce `SECRET_KEY` presence/strength in non-development modes and abort startup when invalid.
- Rationale: prevents silent insecure deployment.
2. **Unified CSRF contract across form + JSON flows**
- Decision: issue CSRF token from server session, validate hidden form field for HTML forms and `X-CSRF-Token` for JSON POST/PUT/PATCH/DELETE.
- Rationale: maintains current frontend behavior while covering non-form APIs.
3. **Centralized shutdown registry**
- Decision: register explicit shutdown hooks that call cache updater stop, realtime sync stop, Redis close, and DB dispose in bounded order.
- Rationale: avoids thread/client leaks during worker recycle and controlled reload.
4. **Health probe pool isolation**
- Decision: use a dedicated lightweight DB health engine/pool for `/health` checks.
- Rationale: prevents health endpoint from being blocked by request-pool exhaustion, improving observability fidelity.
5. **Template-safe JS serialization**
- Decision: replace HTML-escaped interpolation in JS string contexts with `tojson` serialization.
- Rationale: avoids context-mismatch injection edge cases.
## Risks / Trade-offs
- **[Risk] CSRF rollout may break undocumented clients** → **Mitigation:** provide opt-in transition flag and explicit error messaging during rollout.
- **[Risk] Strict startup secret validation can block misconfigured environments** → **Mitigation:** provide clear startup diagnostics and `.env.example` updates.
- **[Risk] Additional shutdown hooks can prolong worker exit** → **Mitigation:** bounded timeouts and idempotent stop handlers.
- **[Risk] Dedicated health pool introduces extra DB connections** → **Mitigation:** fixed minimal size and short timeout.

View File

@@ -0,0 +1,40 @@
## Why
The Vite migration is functionally complete, but production runtime still has high-risk gaps in security baseline and worker lifecycle cleanup. Addressing these now prevents avoidable outages, authentication bypass risk, and unstable degradation behavior under pool pressure.
## What Changes
- Enforce production-safe startup security defaults (no weak SECRET_KEY fallback in non-development environments).
- Add first-class CSRF protection for admin forms and state-changing JSON APIs.
- Harden degradation behavior for pool exhaustion with consistent retry/backoff contract and isolated health probing.
- Ensure background workers and shared clients (cache updater, realtime sync, Redis) are explicitly stopped on worker/app shutdown.
- Fix template-to-JavaScript variable serialization in hold-detail fallback script.
## Capabilities
### New Capabilities
- `security-baseline-hardening`: Define mandatory secret/session/CSRF/XSS-safe baseline for production runtime.
### Modified Capabilities
- `runtime-resilience-recovery`: Strengthen shutdown lifecycle and degraded-response behavior for pool pressure scenarios.
## Impact
- Affected code:
- `src/mes_dashboard/app.py`
- `src/mes_dashboard/core/database.py`
- `src/mes_dashboard/core/cache_updater.py`
- `src/mes_dashboard/core/redis_client.py`
- `src/mes_dashboard/routes/health_routes.py`
- `src/mes_dashboard/routes/auth_routes.py`
- `src/mes_dashboard/templates/hold_detail.html`
- `gunicorn.conf.py`
- `tests/`
- APIs:
- `/health`
- `/health/deep`
- `/admin/login`
- state-changing `/api/*` endpoints
- Operational behavior:
- Keep single-port deployment model unchanged.
- Improve degraded-state stability and startup safety gates.

View File

@@ -0,0 +1,24 @@
## MODIFIED Requirements
### Requirement: Pool Exhaustion MUST Return Retry-Aware Degraded Responses
The system MUST return explicit degraded responses for connection pool exhaustion, including stable machine-readable retry metadata and HTTP retry hints.
#### Scenario: Pool exhausted under load
- **WHEN** concurrent requests exceed available database connections and pool wait timeout is reached
- **THEN** the API MUST return `DB_POOL_EXHAUSTED` with `retry_after_seconds` metadata and a `Retry-After` header instead of a generic 500 failure
## ADDED Requirements
### Requirement: Runtime Shutdown SHALL Cleanly Stop Background Services
Worker/app shutdown MUST stop long-lived background services and shared clients in deterministic order.
#### Scenario: Worker exits during recycle or graceful reload
- **WHEN** Gunicorn worker shutdown hooks are triggered
- **THEN** cache updater, realtime equipment sync worker, Redis client, and DB engine resources MUST be stopped/disposed without orphan threads
### Requirement: Health Probing SHALL Remain Available During Request-Pool Saturation
Health checks MUST avoid depending solely on the same request pool used by business APIs.
#### Scenario: Request pool saturation
- **WHEN** the main database request pool is exhausted
- **THEN** `/health` and `/health/deep` MUST still provide timely degraded status using isolated probe connectivity

View File

@@ -0,0 +1,29 @@
## ADDED Requirements
### Requirement: Production Startup SHALL Reject Weak Session Secrets
The system MUST reject startup in non-development environments when `SECRET_KEY` is missing or configured with known insecure default values.
#### Scenario: Missing production secret key
- **WHEN** runtime starts with `FLASK_ENV` not equal to `development` and no secure secret key is configured
- **THEN** application startup MUST fail fast with an explicit configuration error
### Requirement: State-Changing Endpoints SHALL Enforce CSRF Validation
All state-changing endpoints that rely on cookie-based authentication MUST enforce CSRF token validation.
#### Scenario: Missing or invalid CSRF token
- **WHEN** a POST/PUT/PATCH/DELETE request is sent without a valid CSRF token
- **THEN** the server MUST reject the request with a client error and MUST NOT execute the mutation
### Requirement: Server-Rendered Values in JavaScript Context MUST Use Safe Serialization
Values inserted into inline JavaScript from templates MUST be serialized for JavaScript context safety.
#### Scenario: Hold reason rendered in fallback inline script
- **WHEN** server-side string values are embedded into script state payloads
- **THEN** template rendering MUST use JSON-safe serialization semantics to prevent script-context injection
### Requirement: Session Establishment SHALL Mitigate Fixation Risk
Successful admin login MUST rotate session identity material before granting authenticated privileges.
#### Scenario: Admin login success
- **WHEN** credentials are validated and admin session is created
- **THEN** session identity MUST be regenerated before storing authenticated user attributes

View File

@@ -0,0 +1,18 @@
## 1. Runtime Stability Hardening
- [x] 1.1 Add startup validation for `SECRET_KEY` and environment-aware secure defaults.
- [x] 1.2 Register centralized shutdown hooks to stop cache updater, realtime sync worker, Redis client, and DB engine.
- [x] 1.3 Isolate database health probing from request pool and keep degraded signal contract stable.
- [x] 1.4 Normalize pool-exhausted response metadata and retry headers across API error paths.
## 2. Security Baseline Enforcement
- [x] 2.1 Add CSRF token issuance/validation for form posts and JSON mutation endpoints.
- [x] 2.2 Update login flow to rotate session identity on successful authentication.
- [x] 2.3 Replace JS-context template interpolation in `hold_detail.html` with JSON-safe serialization.
## 3. Verification and Documentation
- [x] 3.1 Add tests for startup secret guard, CSRF rejection, and session-rotation behavior.
- [x] 3.2 Add lifecycle tests/validation for shutdown cleanup and health endpoint behavior under pool saturation.
- [x] 3.3 Update README/README.mdj runtime hardening sections and operator rollout notes.

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-02-08

View File

@@ -0,0 +1,46 @@
## Context
The migration delivered feature parity, but efficiency work remains: backend query paths still do broad copies and whole-frame recomputation even when only slices are needed. At the same time, business constraints explicitly require full-table caching for `resource` and `wip` because those datasets are intentionally small and frequently reused. This design optimizes around that constraint rather than removing it.
## Goals / Non-Goals
**Goals:**
- Keep `resource` and `wip` full-table caches intact.
- Reduce memory amplification from redundant cache representations.
- Replace repeated full merge/rebuild paths with incremental/indexed query plans where applicable.
- Increase reuse of browser-side compute modules for chart/table/filter/KPI derivations.
- Add measurable telemetry to verify latency and memory improvements.
**Non-Goals:**
- Rewriting all reporting endpoints to client-only mode.
- Removing Redis or existing layered cache strategy.
- Changing user-visible filter semantics or report outputs.
## Decisions
1. **Constrained cache strategy**
- Decision: retain full-table snapshots for `resource` and `wip`; optimize surrounding representations and derivation paths.
- Rationale: business-approved data-size profile and low complexity for frequent lookups.
2. **Incremental + indexed path for heavy derived datasets**
- Decision: add watermark/version-aware incremental refresh and per-column indexes for high-cardinality filters.
- Rationale: avoids repeated full recompute and lowers request tail latency.
3. **Canonical in-process structure**
- Decision: keep one canonical structure per cache domain and derive alternate views on demand.
- Rationale: reduces 2x/3x memory amplification from parallel representations.
4. **Frontend compute module expansion**
- Decision: extract reusable browser compute helpers for matrix/table/KPI transformations used across report pages.
- Rationale: shifts deterministic shaping work off backend and improves component reuse in Vite architecture.
5. **Benchmark-driven acceptance**
- Decision: add repeatable benchmark fixtures and telemetry thresholds as merge gates.
- Rationale: prevent subjective "performance improved" claims without measurable proof.
## Risks / Trade-offs
- **[Risk] Incremental sync correctness drift** → **Mitigation:** version checksum validation and periodic full reconciliation jobs.
- **[Risk] Browser compute can increase client CPU on low-end devices** → **Mitigation:** bounded dataset chunking and fallback server aggregation path.
- **[Risk] Refactor introduces subtle field-contract regressions** → **Mitigation:** keep export/header contract tests and fixture comparisons.
- **[Risk] Telemetry overhead** → **Mitigation:** low-cost counters/histograms with sampling where needed.

View File

@@ -0,0 +1,36 @@
## Why
Current reporting workloads still spend unnecessary CPU and memory on repeated full-data merges, broad DataFrame copies, and duplicated cache representations. We need a focused efficiency phase that preserves the intentional full-table cache strategy for `resource` and `wip`, while reducing cost for other query paths and increasing frontend compute reuse.
## What Changes
- Introduce indexed/incremental cache synchronization for heavy report datasets that do not require full-table snapshots.
- Keep `resource` and `wip` as full-table cache by design, but reduce redundant in-process representations and copy overhead.
- Move additional derived calculations (chart/table/KPI/filter shaping) to reusable browser modules in Vite frontend.
- Add cache/query efficiency telemetry and repeatable benchmark gates to validate gains.
## Capabilities
### New Capabilities
- `cache-indexed-query-acceleration`: Define incremental refresh and indexed query contracts for non-full-snapshot datasets.
### Modified Capabilities
- `cache-observability-hardening`: Add memory-efficiency and cache-structure telemetry expectations.
- `frontend-compute-shift`: Expand browser-side reusable compute coverage for report interactions.
## Impact
- Affected code:
- `src/mes_dashboard/core/cache.py`
- `src/mes_dashboard/services/resource_cache.py`
- `src/mes_dashboard/services/realtime_equipment_cache.py`
- `src/mes_dashboard/services/wip_service.py`
- `src/mes_dashboard/routes/health_routes.py`
- `frontend/src/core/`
- `frontend/src/**/main.js`
- `tests/`
- APIs:
- read-heavy `/api/wip/*` and `/api/resource/*` endpoints (response contract unchanged)
- Operational behavior:
- Preserve current `resource` and `wip` full-table caching strategy.
- Reduce server-side compute load through selective frontend compute offload.

View File

@@ -0,0 +1,22 @@
## ADDED Requirements
### Requirement: Incremental Synchronization SHALL Use Versioned Watermarks
For heavy non-full-snapshot datasets, cache refresh SHALL support incremental synchronization keyed by stable version or watermark boundaries.
#### Scenario: Incremental refresh cycle
- **WHEN** source data version indicates partial changes since last sync
- **THEN** cache update logic MUST fetch and merge only changed partitions while preserving correctness guarantees
### Requirement: Query Paths SHALL Use Indexed Access for High-Frequency Filters
Query execution over cached data SHALL use prebuilt indexes for known high-frequency filter columns.
#### Scenario: Filtered report query
- **WHEN** request filters target indexed fields
- **THEN** result selection MUST avoid full dataset scans and maintain existing response contract
### Requirement: Business-Mandated Full-Table Caches SHALL Be Preserved for Resource and WIP
The system SHALL continue to maintain full-table cache behavior for `resource` and `wip` domains.
#### Scenario: Resource or WIP cache refresh
- **WHEN** cache update runs for `resource` or `wip`
- **THEN** the updater MUST retain full-table snapshot semantics and MUST NOT switch these domains to partial-only cache mode

View File

@@ -0,0 +1,15 @@
## ADDED Requirements
### Requirement: Cache Telemetry SHALL Include Memory Amplification Signals
Operational telemetry MUST expose cache domain memory usage indicators and representation amplification factors.
#### Scenario: Deep health telemetry request
- **WHEN** operators inspect cache telemetry
- **THEN** telemetry MUST include per-domain memory footprint and amplification indicators sufficient to detect redundant structures
### Requirement: Efficiency Benchmarks SHALL Gate Cache Refactor Rollout
Cache/query efficiency changes MUST be validated against baseline latency and memory benchmarks before rollout.
#### Scenario: Pre-release validation
- **WHEN** cache refactor changes are prepared for deployment
- **THEN** benchmark results MUST demonstrate no regression beyond configured thresholds for P95 latency and memory usage

View File

@@ -0,0 +1,15 @@
## ADDED Requirements
### Requirement: Reusable Browser Compute Modules SHALL Power Report Derivations
Derived computations for report filters, KPI cards, chart series, and table projections SHALL be implemented through reusable frontend modules.
#### Scenario: Shared report derivation logic
- **WHEN** multiple report pages require equivalent data-shaping behavior
- **THEN** pages MUST consume shared compute modules instead of duplicating transformation logic per page
### Requirement: Browser Compute Shift SHALL Preserve Export and Field Contracts
Moving computations to frontend MUST preserve existing field naming and export column contracts.
#### Scenario: User exports report after frontend-side derivation
- **WHEN** transformed data is rendered and exported
- **THEN** exported field names and ordering MUST remain consistent with governed field contract definitions

View File

@@ -0,0 +1,23 @@
## 1. Cache Structure and Sync Refactor
- [x] 1.1 Define canonical per-domain cache representation and remove redundant parallel structures.
- [x] 1.2 Implement version/watermark-based incremental sync path for eligible non-full-snapshot datasets.
- [x] 1.3 Keep `resource` and `wip` full-table cache behavior while optimizing surrounding parse/index pipelines.
## 2. Indexed Query Acceleration
- [x] 2.1 Add index builders for high-frequency filter columns used by report endpoints.
- [x] 2.2 Refactor read paths to use indexed selection and reduce broad DataFrame copy operations.
- [x] 2.3 Add fallback and reconciliation logic to guarantee correctness under incremental/index drift.
## 3. Frontend Compute Reuse Expansion
- [x] 3.1 Extract shared Vite compute modules for KPI/filter/chart/table derivations.
- [x] 3.2 Refactor report pages to consume shared modules without changing user-visible behavior.
- [x] 3.3 Validate export/header field contract consistency after compute shift.
## 4. Performance Validation and Docs
- [x] 4.1 Add benchmark fixtures for baseline vs refactor latency/memory comparison.
- [x] 4.2 Surface cache memory amplification and index efficiency telemetry in health/admin outputs.
- [x] 4.3 Update README/README.mdj with cache strategy constraints and performance governance rules.

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-02-08

View File

@@ -0,0 +1,45 @@
## Context
The project already provides watchdog-assisted restart and resilience diagnostics, but policy boundaries for automated recovery are not yet formalized end-to-end. In practice, this can lead to either under-recovery (manual delays) or over-recovery (restart churn). We also need stronger conda/systemd path consistency checks to prevent runtime drift across deploy scripts and services.
## Goals / Non-Goals
**Goals:**
- Make conda/systemd runtime path contracts explicit, validated, and drift-detectable.
- Implement safe self-healing policy with cooldown and churn limits.
- Expose clear alert signals and recommended actions in health/admin payloads.
- Keep operator manual override available for incident control.
**Non-Goals:**
- Migrating from systemd to another orchestrator.
- Changing database vendor or introducing full autoscaling infrastructure.
- Removing existing admin restart endpoints.
## Decisions
1. **Single source runtime contract**
- Decision: centralize conda runtime path configuration consumed by systemd units, watchdog, and scripts.
- Rationale: prevents mismatched interpreter/path drift.
2. **Guarded self-healing state machine**
- Decision: implement bounded restart policy (cooldown + max retries per time window + circuit-open gating).
- Rationale: recovers quickly while preventing restart storms.
3. **Explicit recovery observability contract**
- Decision: enrich health/admin payloads with churn counters, cooldown state, and recommended operator action.
- Rationale: enables deterministic triage and alert automation.
4. **Auditability requirement**
- Decision: emit structured logs/events for auto-restart decision, manual override, and blocked restart attempts.
- Rationale: supports incident retrospectives and policy tuning.
5. **Runbook-first rollout**
- Decision: deploy policy changes behind documentation and validation gates, including rollback steps.
- Rationale: operational safety for production adoption.
## Risks / Trade-offs
- **[Risk] Overly strict policy delays recovery** → **Mitigation:** configurable thresholds and emergency manual override.
- **[Risk] Aggressive policy causes churn loops** → **Mitigation:** hard stop on churn threshold breach and explicit cool-off windows.
- **[Risk] Added operational complexity** → **Mitigation:** concise runbook with decision tables and tested scripts.
- **[Risk] Drift detection false positives** → **Mitigation:** normalize path resolution and clearly defined comparison sources.

View File

@@ -0,0 +1,40 @@
## Why
Operations stability still depends heavily on manual intervention when workers degrade or pools saturate. We need a formal operations phase to align conda/systemd runtime contracts and add controlled self-healing with guardrails, so recovery is faster without triggering restart storms.
## What Changes
- Standardize conda-based runtime paths across app service, watchdog, and operational scripts from a single source of truth.
- Introduce guarded worker self-healing policy (cooldown, churn windows, bounded retries, manual override).
- Add alert thresholds and machine-readable recovery signals for pool pressure, circuit-open persistence, and restart churn.
- Harden runbook documentation and scripts for deterministic restart, rollback, and incident triage.
## Capabilities
### New Capabilities
- `worker-self-healing-governance`: Define safe autonomous recovery behavior with anti-storm guardrails.
### Modified Capabilities
- `conda-systemd-runtime-alignment`: Extend runtime consistency requirements with startup validation and drift detection.
- `runtime-resilience-recovery`: Add auditable recovery-action requirements for automated and operator-triggered restart flows.
## Impact
- Affected code:
- `deploy/systemd/*.service`
- `scripts/worker_watchdog.py`
- `src/mes_dashboard/routes/admin_routes.py`
- `src/mes_dashboard/routes/health_routes.py`
- `src/mes_dashboard/core/database.py`
- `src/mes_dashboard/core/circuit_breaker.py`
- `tests/`
- `README.md`, `README.mdj`, runbook docs
- APIs:
- `/health`
- `/health/deep`
- `/admin/api/system-status`
- `/admin/api/worker/status`
- `/admin/api/worker/restart`
- Operational behavior:
- Preserve single-port bind model.
- Add controlled self-healing policy and clearer alert thresholds.

View File

@@ -0,0 +1,15 @@
## ADDED Requirements
### Requirement: Runtime Path Drift SHALL Be Detectable Before Service Start
Service startup checks MUST validate configured conda runtime paths across app, watchdog, and worker control scripts.
#### Scenario: Conda path mismatch detected
- **WHEN** startup validation finds runtime path inconsistency between configured units and scripts
- **THEN** service start MUST fail with actionable diagnostics instead of running with partial mismatch
### Requirement: Conda/Systemd Contract SHALL Be Versioned in Operations Docs
The documented runtime contract MUST include versioned path assumptions and verification commands.
#### Scenario: Operator verifies deployment contract
- **WHEN** operator follows runbook validation steps
- **THEN** commands MUST confirm active runtime paths match documented conda/systemd contract

View File

@@ -0,0 +1,15 @@
## ADDED Requirements
### Requirement: Recovery Recommendations SHALL Reflect Self-Healing Policy State
Health and admin resilience payloads MUST expose whether automated recovery is allowed, cooling down, or blocked by churn policy.
#### Scenario: Operator inspects degraded state
- **WHEN** `/health` or `/admin/api/worker/status` is requested during degradation
- **THEN** response MUST include policy state, cooldown remaining time, and next recommended action
### Requirement: Manual Recovery Override SHALL Be Explicit and Controlled
Manual restart actions MUST bypass automatic block only through authenticated operator pathways with explicit acknowledgement.
#### Scenario: Churn-blocked state with manual override request
- **WHEN** authorized admin requests manual restart while auto-recovery is blocked
- **THEN** system MUST execute controlled restart path and log the override context for auditability

View File

@@ -0,0 +1,22 @@
## ADDED Requirements
### Requirement: Automated Worker Recovery SHALL Use Bounded Policy Guards
Automated worker restart behavior MUST enforce cooldown periods and bounded restart attempts within a configurable time window.
#### Scenario: Repeated worker degradation within short window
- **WHEN** degradation events exceed configured restart-attempt budget
- **THEN** automated restarts MUST pause and surface a blocked-recovery signal for operator intervention
### Requirement: Restart-Churn Protection SHALL Prevent Recovery Storms
The runtime MUST classify restart churn and prevent uncontrolled restart loops.
#### Scenario: Churn threshold exceeded
- **WHEN** restart count crosses churn threshold in active window
- **THEN** watchdog MUST enter guarded mode and require explicit manual override before further restart attempts
### Requirement: Recovery Decisions SHALL Be Audit-Ready
Every auto-recovery decision and manual override action MUST be recorded with structured metadata.
#### Scenario: Worker restart decision emitted
- **WHEN** system executes or denies a restart action
- **THEN** structured logs/events MUST include reason, thresholds, actor/source, and resulting state

View File

@@ -0,0 +1,23 @@
## 1. Conda/Systemd Contract Alignment
- [x] 1.1 Centralize runtime path configuration consumed by service units, watchdog, and scripts.
- [x] 1.2 Add startup validation that fails fast on conda path drift.
- [x] 1.3 Update systemd/watchdog integration tests for consistent runtime contract.
## 2. Worker Self-Healing Policy
- [x] 2.1 Implement bounded auto-restart policy (cooldown, retry budget, churn window).
- [x] 2.2 Add guarded mode behavior when churn threshold is exceeded.
- [x] 2.3 Implement authenticated manual override flow with explicit logging context.
## 3. Alerting and Operational Signals
- [x] 3.1 Expose policy-state fields in health/admin payloads (`allowed`, `cooldown`, `blocked`).
- [x] 3.2 Add structured audit events for restart decisions and override actions.
- [x] 3.3 Define alert thresholds and wire monitoring-friendly fields for pool/circuit/churn conditions.
## 4. Validation and Runbook Delivery
- [x] 4.1 Add tests for policy transitions, guarded mode, and override behavior.
- [x] 4.2 Validate single-port continuity during controlled recovery and hot reload paths.
- [x] 4.3 Update README/README.mdj and deployment runbook with verified operational procedures.

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-02-08

View File

@@ -0,0 +1,50 @@
## Context
目前系統已完成 Vite 單一 port 架構與主要 P0/P1/P2 硬化,但殘餘風險集中在「快取慢路徑鎖競爭 + health 熱點查詢 + API 邊界治理」。這些問題多屬中高流量下才明顯,若不在此階段收斂,後續排障成本會高。
## Goals / Non-Goals
**Goals:**
- 在不改變頁面操作語意與單一 port 架構前提下,完成殘餘穩定性與安全性修補。
- 讓 cache/health 路徑在高併發下更可預期,並降低 log 資安風險。
- 透過測試覆蓋確保修補不造成功能回歸。
**Non-Goals:**
- 不重寫主要查詢流程或移除 `resource/wip` 全表快取策略。
- 不引入重量級 distributed rate-limit 基礎設施。
- 不改動前端 drill-down 與報表功能語意。
## Decisions
1. **Cache 發布一致性優先於局部最佳化**
- 使用 staging key + 原子 rename/pipeline 發布資料與 metadata確保 publish 失敗不影響舊資料可讀性。
2. **解析移至鎖外,鎖內僅做快取一致性檢查/寫入**
- WIP process cache 慢路徑改為鎖外 parse再鎖內 double-check+commit降低持鎖時間。
3. **Process cache 策略一致化**
- realtime equipment cache 補齊 max_size + LRU與既有 WIP/Resource 一致。
4. **Health 內部短快取僅在非測試環境啟用**
- TTL=5 秒,降低高頻 probe 對 DB/Redis 的重複壓力;測試模式維持即時計算避免互相污染。
5. **高成本 API 採輕量 in-memory 速率限制**
- 以 IP+route window 限流,參數化可調,不引入新外部依賴。
## Risks / Trade-offs
- [Risk] 快取發布改造引入 key 切換邏輯複雜度 → Mitigation: 補上 publish 失敗/成功測試。
- [Risk] health 快取造成短時間觀測延遲 → Mitigation: TTL 限制 5 秒,並於 testing 禁用。
- [Risk] in-memory rate limit 在多 worker 下非全域一致 → Mitigation: 先作保護閥,後續可升級 Redis-based limiter。
## Migration Plan
1. 先完成 cache 與 health 核心修補(不影響 API contract
2. 再導入 API 邊界/限流與共用工具抽離。
3. 補單元與整合測試,執行 benchmark smoke。
4. 更新 README 文件與環境變數說明。
## Open Questions
- 高成本 API 的預設限流門檻是否要按端點細分WIP vs Resource
- 後續是否要升級為 Redis 分散式限流以覆蓋多 worker 全域一致性?

View File

@@ -0,0 +1,44 @@
## Why
上一輪已完成高風險核心修復,但仍有一批殘餘問題會在高併發、長時間運行與惡意/異常輸入下放大風險(快取發布一致性、鎖競爭、健康檢查負載、輸入邊界與速率治理)。本輪目標是把這些尾端風險收斂到可接受範圍,避免後續運維與效能不穩。
## What Changes
- 強化 WIP 快取發布流程,確保更新失敗時不污染既有讀取路徑。
- 調整 process cache 慢路徑鎖範圍,避免持鎖解析大 JSON。
- 補齊 realtime equipment process cache 的 bounded LRU與 WIP/Resource 策略一致。
- 為資源路由 NaN 清理加入深度保護(避免深層遞迴風險)。
- 抽取共用布林參數解析,消除重複邏輯。
- 將 filter cache 的 view 名稱改為可配置,移除硬編碼耦合。
- 加入敏感連線字串 log redaction。
-`/health``/health/deep` 增加 5 秒內部短快取(測試模式禁用)。
- 對高成本查詢 API 增加輕量速率限制與可調參數。
- 更新 README/README.mdj 與驗證測試。
## Capabilities
### New Capabilities
- `api-safety-hygiene`: API 輸入邊界、共享參數解析、可配置查詢來源、與高成本端點速率治理。
### Modified Capabilities
- `cache-observability-hardening`: 補強快取發布一致性、process cache 鎖範圍與 bounded 策略一致化。
- `runtime-resilience-recovery`: 健康檢查短快取與敏感資訊日誌遮罩的運維安全要求。
## Impact
- Affected code:
- `src/mes_dashboard/core/cache_updater.py`
- `src/mes_dashboard/core/cache.py`
- `src/mes_dashboard/services/realtime_equipment_cache.py`
- `src/mes_dashboard/routes/resource_routes.py`
- `src/mes_dashboard/routes/wip_routes.py`
- `src/mes_dashboard/routes/hold_routes.py`
- `src/mes_dashboard/services/filter_cache.py`
- `src/mes_dashboard/core/database.py`
- `src/mes_dashboard/routes/health_routes.py`
- APIs:
- `/health`, `/health/deep`
- `/api/wip/detail/<workcenter>`, `/api/wip/overview/*`
- `/api/resource/*`(高成本路由)
- Docs/tests:
- `README.md`, `README.mdj`, `tests/*`

View File

@@ -0,0 +1,29 @@
## ADDED Requirements
### Requirement: Recursive Payload Cleaning MUST Enforce Depth Safety
Routes that normalize nested payloads MUST prevent unbounded recursion depth.
#### Scenario: Deeply nested response object
- **WHEN** NaN-cleaning helper receives deeply nested list/dict payload
- **THEN** cleaning logic MUST enforce max depth or iterative traversal and return safely without recursion failure
### Requirement: Filter Source Names MUST Be Configurable
Filter cache query sources MUST NOT rely on hardcoded view names only.
#### Scenario: Environment-specific view names
- **WHEN** deployment sets custom filter-source environment variables
- **THEN** filter cache loader MUST resolve and query configured view names
### Requirement: High-Cost APIs SHALL Apply Basic Rate Guardrails
High-cost read endpoints SHALL apply configurable request-rate guardrails to reduce abuse and accidental bursts.
#### Scenario: Burst traffic from same client
- **WHEN** a client exceeds configured request budget for guarded endpoints
- **THEN** endpoint SHALL return throttled response with clear retry guidance
### Requirement: Common Boolean Query Parsing SHALL Be Shared
Boolean query parsing in routes SHALL use shared helper behavior.
#### Scenario: Different routes parse include flags
- **WHEN** routes parse common boolean query parameters
- **THEN** parsing behavior MUST be consistent across routes via shared utility

View File

@@ -0,0 +1,26 @@
## ADDED Requirements
### Requirement: Cache Publish MUST Preserve Previous Readable Snapshot on Failure
When refreshing full-table cache payloads, the system MUST avoid exposing partially published states to readers.
#### Scenario: Publish fails after payload serialization
- **WHEN** a cache refresh has prepared new payload but publish operation fails
- **THEN** previously published cache keys MUST remain readable and metadata MUST remain consistent with old snapshot
#### Scenario: Publish succeeds
- **WHEN** publish operation completes successfully
- **THEN** data payload and metadata keys MUST be visible as one coherent new snapshot
### Requirement: Process-Level Cache Slow Path SHALL Minimize Lock Hold Time
Large payload parsing MUST NOT happen inside long-held process cache locks.
#### Scenario: Cache miss under concurrent requests
- **WHEN** multiple requests hit process cache miss
- **THEN** parsing work SHALL happen outside lock-protected mutation section, and lock scope SHALL be limited to consistency check + commit
### Requirement: Process-Level Cache Policies MUST Stay Consistent Across Services
All service-local process caches MUST support bounded capacity with deterministic eviction.
#### Scenario: Realtime equipment cache growth
- **WHEN** realtime equipment process cache reaches configured capacity
- **THEN** entries MUST be evicted according to deterministic LRU behavior

View File

@@ -0,0 +1,19 @@
## ADDED Requirements
### Requirement: Health Endpoints SHALL Use Short Internal Memoization
Health and deep-health computation SHALL use a short-lived internal cache to prevent probe storms from amplifying backend load.
#### Scenario: Frequent monitor scrapes
- **WHEN** health endpoints are called repeatedly within a small window
- **THEN** service SHALL return memoized payload for up to 5 seconds in non-testing environments
#### Scenario: Testing mode
- **WHEN** app is running in testing mode
- **THEN** health endpoint memoization MUST be bypassed to preserve deterministic tests
### Requirement: Logs MUST Redact Connection Secrets
Runtime logs MUST avoid exposing DB connection credentials.
#### Scenario: Connection string appears in log message
- **WHEN** a log message contains DB URL credentials
- **THEN** logger output MUST redact password and sensitive userinfo before emission

View File

@@ -0,0 +1,22 @@
## 1. Cache Consistency and Contention Hardening
- [x] 1.1 Harden WIP cache publish in `cache_updater.py` to preserve old snapshot on publish failure.
- [x] 1.2 Refactor WIP process-cache slow path in `core/cache.py` so heavy parse runs outside lock.
- [x] 1.3 Extend realtime equipment process cache with bounded `max_size` + deterministic LRU and add regression tests.
## 2. API Safety and Config Hygiene
- [x] 2.1 Add depth-safe NaN cleaning in `resource_routes.py` and tests for deep payloads.
- [x] 2.2 Add shared boolean query parser in `core/utils.py` and switch `wip_routes.py` / `hold_routes.py` to it.
- [x] 2.3 Make filter-cache source views configurable (env-based) in `filter_cache.py` and add config tests.
## 3. Runtime Guardrails
- [x] 3.1 Add DB connection-string redaction logging filter in `core/database.py` (or logging bootstrap) with tests.
- [x] 3.2 Add 5-second internal memoization for `/health` and `/health/deep` (disabled in testing) and tests.
- [x] 3.3 Add lightweight rate limiting for selected high-cost APIs with clear throttling responses and tests.
## 4. Validation and Documentation
- [x] 4.1 Run targeted backend/frontend tests and benchmark smoke gate.
- [x] 4.2 Update `README.md` and `README.mdj` with round-3 hardening notes and new env variables.

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-02-08

View File

@@ -0,0 +1,61 @@
## Context
round-3 後主流程已穩定,但仍有 3 類技術債:
- Resource 快取在同一 process 內同時保存 DataFrame 與完整 records 複本,導致記憶體放大。
- Resource 與 Realtime Equipment 的 Oracle 查詢存在跨服務重複字串,日後修改容易偏移。
- 部分服務邊界型別註記與魔術數字未系統化,維護成本偏高。
約束條件:
- `resource` / `wip` 維持全表快取策略,不改資料來源與刷新頻率。
- 對外 API 欄位與前端行為不變。
- 保持單一 port 架構與既有運維契約。
## Goals / Non-Goals
**Goals:**
- 降低 Resource 快取在 process 內的重複資料表示,保留查詢輸出相容性。
- 讓跨服務 Oracle 查詢片段由單一來源維護。
- 讓關鍵 service/cache 模組具備一致的型別註記與具名常數。
**Non-Goals:**
- 不改動資料庫 schema 或 SQL 查詢結果欄位。
- 不重寫整體 cache 架構Redis + process cache 維持)。
- 不引入新基礎設施或外部依賴。
## Decisions
1. Resource derived index 改為「row-position index」而非保存完整 records 複本
- 現況index 中保留 `records` 與多組 bucket records與 DataFrame 內容重複。
- 決策index 只保留 row positions整數索引與必要 metadata需要輸出 dict 時由 DataFrame 按需轉換。
- 取捨:單次輸出會增加少量轉換成本,但可顯著降低常駐記憶體重複。
2. 建立共用 Oracle 查詢常數模組
- 現況:`resource_cache.py``realtime_equipment_cache.py` 各自維護 base SQL。
- 決策:抽出 `services/sql_fragments.py`(或等效模組)管理共用 query 文本與 table/view 名稱。
- 取捨:增加一層間接引用,但查詢語意一致性與變更可控性更高。
3. 型別與常數治理採「先核心邊界,後擴散」
- 現況:部分函式已使用 `Optional` / PEP604 混搭,且魔術數字散落於 cache/service。
- 決策先統一這輪觸及檔案中的型別風格與高頻常數TTL、size、window、limits
- 取捨:不追求一次全專案清零,以避免大範圍 noise先建立可持續擴展基線。
## Risks / Trade-offs
- [Risk] row-position index 與 DataFrame 版本不同步 → Mitigation每次 cache invalidate 時同步重建 index並保留版本檢查。
- [Risk] 惰性轉換導致查詢端 latency 波動 → Mitigation保留 process cache並對高頻路徑做小批量輸出優化。
- [Risk] SQL 共用常數抽離造成引用錯誤 → Mitigation補齊單元測試驗證 query 文本與既有欄位契約一致。
- [Risk] 型別/常數清理引發行為改變 → Mitigation僅做等價重構保留原值並用回歸測試覆蓋。
## Migration Plan
1. 先重構 Resource index 表示,確保 API 輸出不變。
2. 抽離 SQL 共用片段並替換兩個快取服務引用。
3. 清理該範圍型別與常數,補測試。
4. 更新 README / README.mdj 與 OpenSpec tasks跑 backend/fronted 目標測試集。
Rollback
- 若出現相容性問題,可回退至原 index records 表示與舊 SQL 內嵌寫法(單檔回退即可)。
## Open Questions
- 是否要在下一輪把相同治理擴展到 `wip_service.py` 的其餘常數與型別(本輪先限定 residual 範圍)。

View File

@@ -0,0 +1,31 @@
## Why
目前剩餘風險集中在可維護性與記憶體效率Resource 快取在同一個 process 內維持多種資料表示,部分查詢 SQL 在不同快取服務重複維護,且型別註記與魔術數字仍不一致。這些問題不會立刻造成中斷,但會提高記憶體占用、增加後續修改成本與回歸風險,因此需要在既有功能不變前提下完成收斂。
## What Changes
- 將 Resource derived index 的資料表示改為「輕量索引 + 惰性輸出」,避免在 process 中重複保留完整 records 複本。
- 將 Resource 與 Realtime Equipment 的 Oracle 查詢字串收斂到共用 SQL 常數模組,降低重複定義與異步漂移風險。
- 補齊型別註記一致性(尤其 cache/index/service 邊界)並把高頻魔術數字提升為具名常數或可配置參數。
- 維持現有 API 契約、全表快取策略、單一 port 架構與前端行為不變。
## Capabilities
### New Capabilities
- `resource-cache-representation-normalization`: 以單一權威資料表示與輕量索引替代 process 內多份完整資料複本,並保留既有查詢回傳結構。
- `oracle-query-fragment-governance`: 將跨服務共用的 Oracle 查詢片段抽離為共享常數/模板,確保查詢語意一致。
- `maintainability-type-and-constant-hygiene`: 建立型別註記與具名常數的落地規範,降低魔術數字與註記風格漂移。
### Modified Capabilities
- `cache-observability-hardening`: 補充記憶體放大係數與索引表示調整後的可觀測一致性要求。
## Impact
- 主要影響檔案:
- `src/mes_dashboard/services/resource_cache.py`
- `src/mes_dashboard/services/realtime_equipment_cache.py`
- `src/mes_dashboard/services/resource_service.py`(若需配合索引輸出)
- `src/mes_dashboard/sql/*` 或新增共享 SQL 常數模組
- `src/mes_dashboard/config/constants.py``src/mes_dashboard/core/utils.py`
- 對應測試與 README/README.mdj 文檔
- 不新增外部依賴,不變更對外 API 路徑與欄位契約。

View File

@@ -0,0 +1,8 @@
## MODIFIED Requirements
### Requirement: Cache Telemetry SHALL Include Memory Amplification Signals
Operational telemetry MUST expose cache-domain memory usage indicators and representation amplification factors, and MUST differentiate between authoritative data payload and derived/index helper structures.
#### Scenario: Deep health telemetry request after representation normalization
- **WHEN** operators inspect cache telemetry for resource or WIP domains
- **THEN** telemetry MUST include per-domain memory footprint, amplification indicators, and enough structure detail to verify that full-record duplication is not reintroduced

View File

@@ -0,0 +1,15 @@
## ADDED Requirements
### Requirement: Core Cache and Service Boundaries MUST Use Consistent Type Annotation Style
Core cache/service modules touched by this change SHALL use a consistent and explicit type-annotation style for public and internal helper boundaries.
#### Scenario: Reviewing updated cache/service modules
- **WHEN** maintainers inspect function signatures in affected modules
- **THEN** optional and collection types MUST follow a single consistent style and remain compatible with the project Python baseline
### Requirement: High-Frequency Magic Numbers MUST Be Replaced by Named Constants
Cache, throttling, and index-related numeric literals that control behavior MUST be extracted to named constants or env-configurable settings.
#### Scenario: Tuning cache/index behavior
- **WHEN** operators need to tune cache/index thresholds
- **THEN** they MUST find values in named constants or environment variables rather than scattered inline literals

View File

@@ -0,0 +1,15 @@
## ADDED Requirements
### Requirement: Shared Oracle Query Fragments SHALL Have a Single Source of Truth
Cross-service Oracle query fragments for resource and equipment cache loading MUST be defined in a shared module and imported by service implementations.
#### Scenario: Update common table/view reference
- **WHEN** a common table or view name changes
- **THEN** operators and developers MUST be able to update one shared definition without editing duplicated SQL literals across services
### Requirement: Service Queries MUST Preserve Existing Columns and Semantics
Services consuming shared Oracle query fragments SHALL preserve existing selected columns, filters, and downstream payload behavior.
#### Scenario: Resource and equipment cache refresh after refactor
- **WHEN** cache services execute queries via shared fragments
- **THEN** resulting payload structure MUST remain compatible with existing aggregation and API contracts

View File

@@ -0,0 +1,22 @@
## ADDED Requirements
### Requirement: Resource Derived Index MUST Avoid Full Record Duplication
Resource derived index SHALL use lightweight row-position references instead of storing full duplicated record payloads alongside the process DataFrame cache.
#### Scenario: Build index from cached DataFrame
- **WHEN** resource cache data is parsed from Redis into process-level DataFrame
- **THEN** the derived index MUST store position-based references and metadata without a second full records copy
### Requirement: Resource Query APIs SHALL Preserve Existing Response Contract
Resource query APIs MUST keep existing output fields and semantics after index representation normalization.
#### Scenario: Read all resources after normalization
- **WHEN** callers request all resources or filtered resource lists
- **THEN** the returned payload MUST remain field-compatible with pre-normalization responses
### Requirement: Cache Invalidation MUST Keep Index/Data Coherent
The system SHALL invalidate and rebuild DataFrame/index representations atomically at cache refresh boundaries.
#### Scenario: Redis-backed cache refresh completes
- **WHEN** a new resource cache snapshot is published
- **THEN** stale index references MUST be invalidated before subsequent reads use refreshed DataFrame data

View File

@@ -0,0 +1,22 @@
## 1. Resource Cache Representation Normalization
- [x] 1.1 Refactor `resource_cache` derived index to use lightweight row-position references instead of full duplicated records payload.
- [x] 1.2 Keep `get_all_resources` / `get_resources_by_filter` API outputs backward compatible while sourcing data from normalized representation.
- [x] 1.3 Update cache telemetry fields to reflect normalized representation and verify amplification calculation remains interpretable.
## 2. Oracle Query Fragment Governance
- [x] 2.1 Extract shared Oracle SQL fragments/constants for resource/equipment cache loading into a common module.
- [x] 2.2 Replace duplicated SQL literals in `resource_cache.py` and `realtime_equipment_cache.py` with shared definitions.
- [x] 2.3 Add/adjust tests to lock expected query semantics and prevent drift.
## 3. Maintainability Hygiene
- [x] 3.1 Normalize type annotations in touched cache/service modules to one consistent style.
- [x] 3.2 Replace high-frequency magic numbers with named constants or env-driven config in touched modules.
- [x] 3.3 Confirm existing login/API rate-limit and bool parser utilities remain centralized without new duplication.
## 4. Verification and Documentation
- [x] 4.1 Run targeted backend tests for resource cache, equipment cache, health/admin, and route behavior.
- [x] 4.2 Update `README.md` and `README.mdj` with round-4 hardening notes.

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-02-08

View File

@@ -0,0 +1,65 @@
## Context
本專案上一輪已完成 P0/P1/P2 的主體重構,但 code review 後仍存在幾個殘餘高風險點:
- `LDAP_API_URL` 缺少 scheme/host 防線,屬於可配置 SSRF 風險。
- process-level DataFrame cache 僅用 TTL缺少容量上限。
- circuit breaker 狀態轉換在持鎖期間寫日誌,存在鎖競爭放大風險。
- 全域 security headers 尚未統一輸出。
- 分頁參數尚有下限驗證缺口。
這些問題橫跨 `app/core/services/routes/tests`,屬於跨模組安全與穩定性修補。
## Goals / Non-Goals
**Goals:**
- 對 LDAP endpoint、HTTP 回應標頭、輸入邊界建立可測試的最低防線。
- 讓 process-level cache 具備有界容量與可預期淘汰行為。
- 降低 circuit breaker 內部鎖競爭風險,避免慢 handler 放大阻塞。
- 維持單一 port、現有 API 契約與前端互動語意不變。
**Non-Goals:**
- 不引入完整 WAF/零信任架構。
- 不重寫既有 cache 架構為外部快取服務。
- 不改動報表功能或頁面流程。
## Decisions
1. **LDAP URL 啟動驗證fail-fast**
- Decision: 在 `auth_service` 啟動階段驗證 `LDAP_API_URL`,限制 `https` 與白名單 host由 env 設定),不符合即禁用 LDAP 驗證路徑並記錄錯誤。
- Rationale: 以最低改動封住配置型 SSRF 風險,不影響 local auth 模式。
2. **ProcessLevelCache 有界化**
- Decision: 在 `ProcessLevelCache` 新增 `max_size` 與 LRU 淘汰(`OrderedDict``set` 時淘汰最舊 key。
- Rationale: 保留 TTL 行為,同時避免高基數 key 長時間堆積。
3. **Circuit breaker 鎖外寫日誌**
- Decision: `_transition_to` 僅在鎖內更新狀態並組裝日誌訊息,實際 logger 呼叫移到鎖外。
- Rationale: 降低持鎖區塊執行時間,避免慢 I/O handler 阻塞其他請求路徑。
4. **全域安全標頭統一注入**
- Decision: 在 `app.after_request` 加入 `CSP``X-Frame-Options``X-Content-Type-Options``Referrer-Policy`,並在 production 加上 `HSTS`
- Rationale: 以集中式策略覆蓋所有頁面與 API降低遺漏機率。
5. **分頁參數上下限一致化**
- Decision: 對 `page``page_size` 統一加入 `max(1, min(...))` 邊界處理。
- Rationale: 防止負值或極端數值造成不必要負載與非預期行為。
## Risks / Trade-offs
- **[Risk] LDAP 白名單設定不完整導致登入中斷** → **Mitigation:** 提供明確錯誤訊息與 local auth fallback 指引。
- **[Risk] Cache 上限過小造成命中率下降** → **Mitigation:** `max_size` 設為可配置,先給保守預設值並觀察 telemetry。
- **[Risk] CSP 過嚴影響既有 inline 腳本** → **Mitigation:** 先採 `default-src 'self'` 與相容策略,必要時以 nonce/白名單微調。
- **[Risk] 行為調整引發測試回歸** → **Mitigation:** 補 unit/integration 測試覆蓋每個修補點。
## Migration Plan
1. 先落地 backend 修補auth/cache/circuit breaker/app headers/routes
2. 補測試LDAP 驗證、LRU、鎖外日誌、headers、分頁邊界
3. 執行既有健康檢查與重點整合測試。
4. 更新 README/README.mdj 的安全與穩定性章節。
5. 若部署後有相容性問題,可暫時透過 env 放寬 LDAP host 白名單與 CSP 細項。
## Open Questions
- LDAP host 白名單在各環境是否需要多個網域(例如內網 + DR site
- CSP 是否要立即切換到 nonce-based 嚴格模式,或先維持相容策略?

View File

@@ -0,0 +1,40 @@
## Why
上一輪已完成核心穩定性重構但仍有數個高優先風險LDAP URL 驗證、無界快取成長、circuit breaker 持鎖寫日誌、安全標頭缺口、分頁下限驗證)未收斂。這些問題會在長時運行與惡意輸入情境下累積可用性與安全風險,需在同一輪中補齊。
## What Changes
- 新增 LDAP API base URL 啟動驗證(限定 `https` 與白名單主機),避免可控 SSRF 目標。
- 對 process-level cache 加入 `max_size` 與 LRU 淘汰,避免高基數 key 造成無界記憶體成長。
- 調整 circuit breaker 狀態轉換流程,避免在持鎖期間寫日誌。
- 新增全域 security headersCSP、X-Frame-Options、X-Content-Type-Options、Referrer-Policy、HSTS
- 補齊分頁參數下限驗證,避免負值與不合理 page size 進入查詢流程。
- 為上述修補新增對應測試與文件更新,並維持單一 port 與既有前端操作語意不變。
## Capabilities
### New Capabilities
- `security-surface-hardening`: 規範剩餘安全面向SSRF 防護、security headers、輸入邊界驗證的最低防線。
### Modified Capabilities
- `cache-observability-hardening`: 擴充快取治理需求,納入 process-level cache 有界容量與淘汰策略。
- `runtime-resilience-recovery`: 補充 circuit breaker 鎖競爭風險修補與安全標頭對運維診斷回應的相容性要求。
## Impact
- Affected code:
- `src/mes_dashboard/services/auth_service.py`
- `src/mes_dashboard/core/cache.py`
- `src/mes_dashboard/services/resource_cache.py`
- `src/mes_dashboard/core/circuit_breaker.py`
- `src/mes_dashboard/app.py`
- `src/mes_dashboard/routes/wip_routes.py`
- `tests/`
- `README.md`, `README.mdj`
- APIs:
- `/health`, `/health/deep`
- `/api/wip/detail/<workcenter>`
- `/admin/login`間接受影響LDAP base 驗證)
- Operational behavior:
- 保持單一 port 與既有報表 UI 流程。
- 強化安全與穩定性防線,不改變既有功能語意。

View File

@@ -0,0 +1,12 @@
## ADDED Requirements
### Requirement: Process-Level Cache SHALL Use Bounded Capacity with Deterministic Eviction
Process-level parsed-data caches MUST enforce a configurable maximum key capacity and use deterministic eviction behavior when capacity is exceeded.
#### Scenario: Cache capacity reached
- **WHEN** a new cache entry is inserted and key capacity is at limit
- **THEN** cache MUST evict entries according to defined policy before storing the new key
#### Scenario: Repeated access updates recency
- **WHEN** an existing cache key is read or overwritten
- **THEN** eviction order MUST reflect recency semantics so hot keys are retained preferentially

View File

@@ -0,0 +1,12 @@
## ADDED Requirements
### Requirement: Circuit Breaker State Transitions SHALL Avoid Lock-Held Logging
Circuit breaker state transitions MUST avoid executing logger I/O while internal state locks are held.
#### Scenario: State transition occurs
- **WHEN** circuit breaker transitions between CLOSED, OPEN, or HALF_OPEN
- **THEN** lock-protected section MUST complete state mutation before emitting transition log output
#### Scenario: Slow log handler under load
- **WHEN** logger handlers are slow or blocked
- **THEN** circuit breaker lock contention MUST remain bounded and MUST NOT serialize unrelated request paths behind logging latency

View File

@@ -0,0 +1,34 @@
## ADDED Requirements
### Requirement: LDAP Authentication Endpoint Configuration SHALL Be Strictly Validated
The system MUST validate LDAP authentication endpoint configuration before use, including HTTPS scheme enforcement and host allowlist checks.
#### Scenario: Invalid LDAP URL configuration detected
- **WHEN** `LDAP_API_URL` is missing, non-HTTPS, or points to a host outside the configured allowlist
- **THEN** the service MUST reject LDAP authentication calls and emit actionable diagnostics without sending credentials to that endpoint
#### Scenario: Valid LDAP URL configuration accepted
- **WHEN** `LDAP_API_URL` uses HTTPS and host is allowlisted
- **THEN** LDAP authentication requests MAY proceed with normal timeout and error handling behavior
### Requirement: Security Response Headers SHALL Be Applied Globally
All HTTP responses MUST include baseline security headers suitable for dashboard and API traffic.
#### Scenario: Standard response emitted
- **WHEN** any route returns a response
- **THEN** response MUST include `Content-Security-Policy`, `X-Frame-Options`, `X-Content-Type-Options`, and `Referrer-Policy`
#### Scenario: Production transport hardening
- **WHEN** runtime environment is production
- **THEN** response MUST include `Strict-Transport-Security`
### Requirement: Pagination Input Boundaries SHALL Be Enforced
Endpoints accepting pagination parameters MUST enforce lower and upper bounds before query execution.
#### Scenario: Negative or zero pagination inputs
- **WHEN** client sends `page <= 0` or `page_size <= 0`
- **THEN** server MUST normalize values to minimum supported bounds
#### Scenario: Excessive page size requested
- **WHEN** client sends `page_size` above configured maximum
- **THEN** server MUST clamp to maximum supported page size

View File

@@ -0,0 +1,24 @@
## 1. LDAP Endpoint Hardening
- [x] 1.1 Add strict `LDAP_API_URL` validation (`https` + allowlisted hosts) in auth service initialization.
- [x] 1.2 Add tests for valid/invalid LDAP URL configurations and ensure unsafe URLs are rejected without outbound auth call.
## 2. Bounded Process Cache
- [x] 2.1 Extend `ProcessLevelCache` with configurable `max_size` and LRU eviction behavior.
- [x] 2.2 Wire bounded cache configuration for WIP/Resource process-level caches and add regression tests.
## 3. Circuit Breaker Lock Contention Reduction
- [x] 3.1 Refactor circuit breaker transition logging to execute outside lock-protected section.
- [x] 3.2 Add tests verifying transition logs are emitted while state mutation remains correct.
## 4. HTTP Security Headers and Input Boundary Validation
- [x] 4.1 Add global `after_request` security headers (CSP, frame, content-type, referrer, HSTS in production).
- [x] 4.2 Tighten pagination boundary handling (`page`/`page_size`) for WIP detail endpoint and add tests.
## 5. Validation and Documentation
- [x] 5.1 Run targeted backend/frontend tests plus benchmark smoke to confirm no behavior regression.
- [x] 5.2 Update `README.md` and `README.mdj` with round-2 security/stability hardening notes.

View File

@@ -0,0 +1,33 @@
# api-safety-hygiene Specification
## Purpose
TBD - created by archiving change residual-hardening-round3. Update Purpose after archive.
## Requirements
### Requirement: Recursive Payload Cleaning MUST Enforce Depth Safety
Routes that normalize nested payloads MUST prevent unbounded recursion depth.
#### Scenario: Deeply nested response object
- **WHEN** NaN-cleaning helper receives deeply nested list/dict payload
- **THEN** cleaning logic MUST enforce max depth or iterative traversal and return safely without recursion failure
### Requirement: Filter Source Names MUST Be Configurable
Filter cache query sources MUST NOT rely on hardcoded view names only.
#### Scenario: Environment-specific view names
- **WHEN** deployment sets custom filter-source environment variables
- **THEN** filter cache loader MUST resolve and query configured view names
### Requirement: High-Cost APIs SHALL Apply Basic Rate Guardrails
High-cost read endpoints SHALL apply configurable request-rate guardrails to reduce abuse and accidental bursts.
#### Scenario: Burst traffic from same client
- **WHEN** a client exceeds configured request budget for guarded endpoints
- **THEN** endpoint SHALL return throttled response with clear retry guidance
### Requirement: Common Boolean Query Parsing SHALL Be Shared
Boolean query parsing in routes SHALL use shared helper behavior.
#### Scenario: Different routes parse include flags
- **WHEN** routes parse common boolean query parameters
- **THEN** parsing behavior MUST be consistent across routes via shared utility

View File

@@ -0,0 +1,26 @@
# cache-indexed-query-acceleration Specification
## Purpose
TBD - created by archiving change p1-cache-query-efficiency. Update Purpose after archive.
## Requirements
### Requirement: Incremental Synchronization SHALL Use Versioned Watermarks
For heavy non-full-snapshot datasets, cache refresh SHALL support incremental synchronization keyed by stable version or watermark boundaries.
#### Scenario: Incremental refresh cycle
- **WHEN** source data version indicates partial changes since last sync
- **THEN** cache update logic MUST fetch and merge only changed partitions while preserving correctness guarantees
### Requirement: Query Paths SHALL Use Indexed Access for High-Frequency Filters
Query execution over cached data SHALL use prebuilt indexes for known high-frequency filter columns.
#### Scenario: Filtered report query
- **WHEN** request filters target indexed fields
- **THEN** result selection MUST avoid full dataset scans and maintain existing response contract
### Requirement: Business-Mandated Full-Table Caches SHALL Be Preserved for Resource and WIP
The system SHALL continue to maintain full-table cache behavior for `resource` and `wip` domains.
#### Scenario: Resource or WIP cache refresh
- **WHEN** cache update runs for `resource` or `wip`
- **THEN** the updater MUST retain full-table snapshot semantics and MUST NOT switch these domains to partial-only cache mode

View File

@@ -36,3 +36,53 @@ The system MUST define alert thresholds for sustained degraded state, repeated w
- **WHEN** degraded status persists beyond configured duration
- **THEN** the monitoring contract MUST classify the service as alert-worthy with actionable context
### Requirement: Cache Telemetry SHALL Include Memory Amplification Signals
Operational telemetry MUST expose cache-domain memory usage indicators and representation amplification factors, and MUST differentiate between authoritative data payload and derived/index helper structures.
#### Scenario: Deep health telemetry request after representation normalization
- **WHEN** operators inspect cache telemetry for resource or WIP domains
- **THEN** telemetry MUST include per-domain memory footprint, amplification indicators, and enough structure detail to verify that full-record duplication is not reintroduced
### Requirement: Efficiency Benchmarks SHALL Gate Cache Refactor Rollout
Cache/query efficiency changes MUST be validated against baseline latency and memory benchmarks before rollout.
#### Scenario: Pre-release validation
- **WHEN** cache refactor changes are prepared for deployment
- **THEN** benchmark results MUST demonstrate no regression beyond configured thresholds for P95 latency and memory usage
### Requirement: Process-Level Cache SHALL Use Bounded Capacity with Deterministic Eviction
Process-level parsed-data caches MUST enforce a configurable maximum key capacity and use deterministic eviction behavior when capacity is exceeded.
#### Scenario: Cache capacity reached
- **WHEN** a new cache entry is inserted and key capacity is at limit
- **THEN** cache MUST evict entries according to defined policy before storing the new key
#### Scenario: Repeated access updates recency
- **WHEN** an existing cache key is read or overwritten
- **THEN** eviction order MUST reflect recency semantics so hot keys are retained preferentially
### Requirement: Cache Publish MUST Preserve Previous Readable Snapshot on Failure
When refreshing full-table cache payloads, the system MUST avoid exposing partially published states to readers.
#### Scenario: Publish fails after payload serialization
- **WHEN** a cache refresh has prepared new payload but publish operation fails
- **THEN** previously published cache keys MUST remain readable and metadata MUST remain consistent with old snapshot
#### Scenario: Publish succeeds
- **WHEN** publish operation completes successfully
- **THEN** data payload and metadata keys MUST be visible as one coherent new snapshot
### Requirement: Process-Level Cache Slow Path SHALL Minimize Lock Hold Time
Large payload parsing MUST NOT happen inside long-held process cache locks.
#### Scenario: Cache miss under concurrent requests
- **WHEN** multiple requests hit process cache miss
- **THEN** parsing work SHALL happen outside lock-protected mutation section, and lock scope SHALL be limited to consistency check + commit
### Requirement: Process-Level Cache Policies MUST Stay Consistent Across Services
All service-local process caches MUST support bounded capacity with deterministic eviction.
#### Scenario: Realtime equipment cache growth
- **WHEN** realtime equipment process cache reaches configured capacity
- **THEN** entries MUST be evicted according to deterministic LRU behavior

View File

@@ -24,3 +24,17 @@ Runbooks and deployment documentation MUST describe the same conda/systemd/watch
- **WHEN** an operator performs deploy, health check, and rollback from documentation
- **THEN** documented commands and paths MUST work without requiring venv-specific assumptions
### Requirement: Runtime Path Drift SHALL Be Detectable Before Service Start
Service startup checks MUST validate configured conda runtime paths across app, watchdog, and worker control scripts.
#### Scenario: Conda path mismatch detected
- **WHEN** startup validation finds runtime path inconsistency between configured units and scripts
- **THEN** service start MUST fail with actionable diagnostics instead of running with partial mismatch
### Requirement: Conda/Systemd Contract SHALL Be Versioned in Operations Docs
The documented runtime contract MUST include versioned path assumptions and verification commands.
#### Scenario: Operator verifies deployment contract
- **WHEN** operator follows runbook validation steps
- **THEN** commands MUST confirm active runtime paths match documented conda/systemd contract

View File

@@ -50,3 +50,17 @@ Frontend matrix/filter computations SHALL produce deterministic selection and fi
- **WHEN** users toggle matrix cells across group, family, and resource rows
- **THEN** selected-state rendering and filtered equipment result sets MUST remain level-correct and reversible
### Requirement: Reusable Browser Compute Modules SHALL Power Report Derivations
Derived computations for report filters, KPI cards, chart series, and table projections SHALL be implemented through reusable frontend modules.
#### Scenario: Shared report derivation logic
- **WHEN** multiple report pages require equivalent data-shaping behavior
- **THEN** pages MUST consume shared compute modules instead of duplicating transformation logic per page
### Requirement: Browser Compute Shift SHALL Preserve Export and Field Contracts
Moving computations to frontend MUST preserve existing field naming and export column contracts.
#### Scenario: User exports report after frontend-side derivation
- **WHEN** transformed data is rendered and exported
- **THEN** exported field names and ordering MUST remain consistent with governed field contract definitions

View File

@@ -0,0 +1,19 @@
# maintainability-type-and-constant-hygiene Specification
## Purpose
TBD - created by archiving change residual-hardening-round4. Update Purpose after archive.
## Requirements
### Requirement: Core Cache and Service Boundaries MUST Use Consistent Type Annotation Style
Core cache/service modules touched by this change SHALL use a consistent and explicit type-annotation style for public and internal helper boundaries.
#### Scenario: Reviewing updated cache/service modules
- **WHEN** maintainers inspect function signatures in affected modules
- **THEN** optional and collection types MUST follow a single consistent style and remain compatible with the project Python baseline
### Requirement: High-Frequency Magic Numbers MUST Be Replaced by Named Constants
Cache, throttling, and index-related numeric literals that control behavior MUST be extracted to named constants or env-configurable settings.
#### Scenario: Tuning cache/index behavior
- **WHEN** operators need to tune cache/index thresholds
- **THEN** they MUST find values in named constants or environment variables rather than scattered inline literals

View File

@@ -0,0 +1,19 @@
# oracle-query-fragment-governance Specification
## Purpose
TBD - created by archiving change residual-hardening-round4. Update Purpose after archive.
## Requirements
### Requirement: Shared Oracle Query Fragments SHALL Have a Single Source of Truth
Cross-service Oracle query fragments for resource and equipment cache loading MUST be defined in a shared module and imported by service implementations.
#### Scenario: Update common table/view reference
- **WHEN** a common table or view name changes
- **THEN** operators and developers MUST be able to update one shared definition without editing duplicated SQL literals across services
### Requirement: Service Queries MUST Preserve Existing Columns and Semantics
Services consuming shared Oracle query fragments SHALL preserve existing selected columns, filters, and downstream payload behavior.
#### Scenario: Resource and equipment cache refresh after refactor
- **WHEN** cache services execute queries via shared fragments
- **THEN** resulting payload structure MUST remain compatible with existing aggregation and API contracts

View File

@@ -0,0 +1,26 @@
# resource-cache-representation-normalization Specification
## Purpose
TBD - created by archiving change residual-hardening-round4. Update Purpose after archive.
## Requirements
### Requirement: Resource Derived Index MUST Avoid Full Record Duplication
Resource derived index SHALL use lightweight row-position references instead of storing full duplicated record payloads alongside the process DataFrame cache.
#### Scenario: Build index from cached DataFrame
- **WHEN** resource cache data is parsed from Redis into process-level DataFrame
- **THEN** the derived index MUST store position-based references and metadata without a second full records copy
### Requirement: Resource Query APIs SHALL Preserve Existing Response Contract
Resource query APIs MUST keep existing output fields and semantics after index representation normalization.
#### Scenario: Read all resources after normalization
- **WHEN** callers request all resources or filtered resource lists
- **THEN** the returned payload MUST remain field-compatible with pre-normalization responses
### Requirement: Cache Invalidation MUST Keep Index/Data Coherent
The system SHALL invalidate and rebuild DataFrame/index representations atomically at cache refresh boundaries.
#### Scenario: Redis-backed cache refresh completes
- **WHEN** a new resource cache snapshot is published
- **THEN** stale index references MUST be invalidated before subsequent reads use refreshed DataFrame data

View File

@@ -48,3 +48,47 @@ The system MUST expose machine-readable resilience thresholds, restart-churn ind
#### Scenario: Admin status includes restart churn summary
- **WHEN** operators call `/admin/api/system-status` or `/admin/api/worker/status`
- **THEN** responses MUST include bounded restart history summary within a configured time window and indicate whether churn threshold is exceeded
### Requirement: Recovery Recommendations SHALL Reflect Self-Healing Policy State
Health and admin resilience payloads MUST expose whether automated recovery is allowed, cooling down, or blocked by churn policy.
#### Scenario: Operator inspects degraded state
- **WHEN** `/health` or `/admin/api/worker/status` is requested during degradation
- **THEN** response MUST include policy state, cooldown remaining time, and next recommended action
### Requirement: Manual Recovery Override SHALL Be Explicit and Controlled
Manual restart actions MUST bypass automatic block only through authenticated operator pathways with explicit acknowledgement.
#### Scenario: Churn-blocked state with manual override request
- **WHEN** authorized admin requests manual restart while auto-recovery is blocked
- **THEN** system MUST execute controlled restart path and log the override context for auditability
### Requirement: Circuit Breaker State Transitions SHALL Avoid Lock-Held Logging
Circuit breaker state transitions MUST avoid executing logger I/O while internal state locks are held.
#### Scenario: State transition occurs
- **WHEN** circuit breaker transitions between CLOSED, OPEN, or HALF_OPEN
- **THEN** lock-protected section MUST complete state mutation before emitting transition log output
#### Scenario: Slow log handler under load
- **WHEN** logger handlers are slow or blocked
- **THEN** circuit breaker lock contention MUST remain bounded and MUST NOT serialize unrelated request paths behind logging latency
### Requirement: Health Endpoints SHALL Use Short Internal Memoization
Health and deep-health computation SHALL use a short-lived internal cache to prevent probe storms from amplifying backend load.
#### Scenario: Frequent monitor scrapes
- **WHEN** health endpoints are called repeatedly within a small window
- **THEN** service SHALL return memoized payload for up to 5 seconds in non-testing environments
#### Scenario: Testing mode
- **WHEN** app is running in testing mode
- **THEN** health endpoint memoization MUST be bypassed to preserve deterministic tests
### Requirement: Logs MUST Redact Connection Secrets
Runtime logs MUST avoid exposing DB connection credentials.
#### Scenario: Connection string appears in log message
- **WHEN** a log message contains DB URL credentials
- **THEN** logger output MUST redact password and sensitive userinfo before emission

View File

@@ -0,0 +1,38 @@
# security-surface-hardening Specification
## Purpose
TBD - created by archiving change security-stability-hardening-round2. Update Purpose after archive.
## Requirements
### Requirement: LDAP Authentication Endpoint Configuration SHALL Be Strictly Validated
The system MUST validate LDAP authentication endpoint configuration before use, including HTTPS scheme enforcement and host allowlist checks.
#### Scenario: Invalid LDAP URL configuration detected
- **WHEN** `LDAP_API_URL` is missing, non-HTTPS, or points to a host outside the configured allowlist
- **THEN** the service MUST reject LDAP authentication calls and emit actionable diagnostics without sending credentials to that endpoint
#### Scenario: Valid LDAP URL configuration accepted
- **WHEN** `LDAP_API_URL` uses HTTPS and host is allowlisted
- **THEN** LDAP authentication requests MAY proceed with normal timeout and error handling behavior
### Requirement: Security Response Headers SHALL Be Applied Globally
All HTTP responses MUST include baseline security headers suitable for dashboard and API traffic.
#### Scenario: Standard response emitted
- **WHEN** any route returns a response
- **THEN** response MUST include `Content-Security-Policy`, `X-Frame-Options`, `X-Content-Type-Options`, and `Referrer-Policy`
#### Scenario: Production transport hardening
- **WHEN** runtime environment is production
- **THEN** response MUST include `Strict-Transport-Security`
### Requirement: Pagination Input Boundaries SHALL Be Enforced
Endpoints accepting pagination parameters MUST enforce lower and upper bounds before query execution.
#### Scenario: Negative or zero pagination inputs
- **WHEN** client sends `page <= 0` or `page_size <= 0`
- **THEN** server MUST normalize values to minimum supported bounds
#### Scenario: Excessive page size requested
- **WHEN** client sends `page_size` above configured maximum
- **THEN** server MUST clamp to maximum supported page size

View File

@@ -0,0 +1,26 @@
# worker-self-healing-governance Specification
## Purpose
TBD - created by archiving change p2-ops-self-healing-runbook. Update Purpose after archive.
## Requirements
### Requirement: Automated Worker Recovery SHALL Use Bounded Policy Guards
Automated worker restart behavior MUST enforce cooldown periods and bounded restart attempts within a configurable time window.
#### Scenario: Repeated worker degradation within short window
- **WHEN** degradation events exceed configured restart-attempt budget
- **THEN** automated restarts MUST pause and surface a blocked-recovery signal for operator intervention
### Requirement: Restart-Churn Protection SHALL Prevent Recovery Storms
The runtime MUST classify restart churn and prevent uncontrolled restart loops.
#### Scenario: Churn threshold exceeded
- **WHEN** restart count crosses churn threshold in active window
- **THEN** watchdog MUST enter guarded mode and require explicit manual override before further restart attempts
### Requirement: Recovery Decisions SHALL Be Audit-Ready
Every auto-recovery decision and manual override action MUST be recorded with structured metadata.
#### Scenario: Worker restart decision emitted
- **WHEN** system executes or denies a restart action
- **THEN** structured logs/events MUST include reason, thresholds, actor/source, and resulting state