chore: finalize vite migration hardening and archive openspec changes

2026-02-08 20:03:36 +08:00
parent b56e80381b
commit c8e225101e
119 changed files with 6547 additions and 1301 deletions
--- a/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/.openspec.yaml
+++ b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/.openspec.yaml
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-02-08
--- a/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/design.md
+++ b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/design.md
@@ -0,0 +1,46 @@
+## Context
+
+The current architecture already supports single-port Gunicorn runtime, circuit-breaker-aware degraded responses, and watchdog-assisted recovery. However, critical security and lifecycle controls are uneven: production startup can still fallback to a weak secret key, CSRF is not enforced globally, and background resources are not fully registered in a single shutdown lifecycle. These gaps are operationally risky when pool pressure or restart churn occurs.
+
+## Goals / Non-Goals
+
+**Goals:**
+- Make production startup fail fast when required security secrets are missing.
+- Enforce CSRF validation for all state-changing endpoints without breaking existing frontend flow.
+- Make worker/app shutdown deterministic by stopping all background workers and shared clients.
+- Keep degraded responses for pool exhaustion and circuit-open states stable and retry-aware.
+- Isolate health probe connectivity from main request pool contention.
+
+**Non-Goals:**
+- Replacing LDAP provider or redesigning the full authentication architecture.
+- Full CSP rollout across all templates in this change.
+- Changing URL structure, page IA, or single-port deployment topology.
+
+## Decisions
+
+1. **Production secret-key guard at startup**
+   - Decision: enforce `SECRET_KEY` presence/strength in non-development modes and abort startup when invalid.
+   - Rationale: prevents silent insecure deployment.
+
+2. **Unified CSRF contract across form + JSON flows**
+   - Decision: issue CSRF token from server session, validate hidden form field for HTML forms and `X-CSRF-Token` for JSON POST/PUT/PATCH/DELETE.
+   - Rationale: maintains current frontend behavior while covering non-form APIs.
+
+3. **Centralized shutdown registry**
+   - Decision: register explicit shutdown hooks that call cache updater stop, realtime sync stop, Redis close, and DB dispose in bounded order.
+   - Rationale: avoids thread/client leaks during worker recycle and controlled reload.
+
+4. **Health probe pool isolation**
+   - Decision: use a dedicated lightweight DB health engine/pool for `/health` checks.
+   - Rationale: prevents health endpoint from being blocked by request-pool exhaustion, improving observability fidelity.
+
+5. **Template-safe JS serialization**
+   - Decision: replace HTML-escaped interpolation in JS string contexts with `tojson` serialization.
+   - Rationale: avoids context-mismatch injection edge cases.
+
+## Risks / Trade-offs
+
+- **[Risk] CSRF rollout may break undocumented clients** → **Mitigation:** provide opt-in transition flag and explicit error messaging during rollout.
+- **[Risk] Strict startup secret validation can block misconfigured environments** → **Mitigation:** provide clear startup diagnostics and `.env.example` updates.
+- **[Risk] Additional shutdown hooks can prolong worker exit** → **Mitigation:** bounded timeouts and idempotent stop handlers.
+- **[Risk] Dedicated health pool introduces extra DB connections** → **Mitigation:** fixed minimal size and short timeout.
--- a/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/proposal.md
+++ b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/proposal.md
@@ -0,0 +1,40 @@
+## Why
+
+The Vite migration is functionally complete, but production runtime still has high-risk gaps in security baseline and worker lifecycle cleanup. Addressing these now prevents avoidable outages, authentication bypass risk, and unstable degradation behavior under pool pressure.
+
+## What Changes
+
+- Enforce production-safe startup security defaults (no weak SECRET_KEY fallback in non-development environments).
+- Add first-class CSRF protection for admin forms and state-changing JSON APIs.
+- Harden degradation behavior for pool exhaustion with consistent retry/backoff contract and isolated health probing.
+- Ensure background workers and shared clients (cache updater, realtime sync, Redis) are explicitly stopped on worker/app shutdown.
+- Fix template-to-JavaScript variable serialization in hold-detail fallback script.
+
+## Capabilities
+
+### New Capabilities
+- `security-baseline-hardening`: Define mandatory secret/session/CSRF/XSS-safe baseline for production runtime.
+
+### Modified Capabilities
+- `runtime-resilience-recovery`: Strengthen shutdown lifecycle and degraded-response behavior for pool pressure scenarios.
+
+## Impact
+
+- Affected code:
+  - `src/mes_dashboard/app.py`
+  - `src/mes_dashboard/core/database.py`
+  - `src/mes_dashboard/core/cache_updater.py`
+  - `src/mes_dashboard/core/redis_client.py`
+  - `src/mes_dashboard/routes/health_routes.py`
+  - `src/mes_dashboard/routes/auth_routes.py`
+  - `src/mes_dashboard/templates/hold_detail.html`
+  - `gunicorn.conf.py`
+  - `tests/`
+- APIs:
+  - `/health`
+  - `/health/deep`
+  - `/admin/login`
+  - state-changing `/api/*` endpoints
+- Operational behavior:
+  - Keep single-port deployment model unchanged.
+  - Improve degraded-state stability and startup safety gates.
--- a/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/specs/runtime-resilience-recovery/spec.md
+++ b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/specs/runtime-resilience-recovery/spec.md
@@ -0,0 +1,24 @@
+## MODIFIED Requirements
+
+### Requirement: Pool Exhaustion MUST Return Retry-Aware Degraded Responses
+The system MUST return explicit degraded responses for connection pool exhaustion, including stable machine-readable retry metadata and HTTP retry hints.
+
+#### Scenario: Pool exhausted under load
+- **WHEN** concurrent requests exceed available database connections and pool wait timeout is reached
+- **THEN** the API MUST return `DB_POOL_EXHAUSTED` with `retry_after_seconds` metadata and a `Retry-After` header instead of a generic 500 failure
+
+## ADDED Requirements
+
+### Requirement: Runtime Shutdown SHALL Cleanly Stop Background Services
+Worker/app shutdown MUST stop long-lived background services and shared clients in deterministic order.
+
+#### Scenario: Worker exits during recycle or graceful reload
+- **WHEN** Gunicorn worker shutdown hooks are triggered
+- **THEN** cache updater, realtime equipment sync worker, Redis client, and DB engine resources MUST be stopped/disposed without orphan threads
+
+### Requirement: Health Probing SHALL Remain Available During Request-Pool Saturation
+Health checks MUST avoid depending solely on the same request pool used by business APIs.
+
+#### Scenario: Request pool saturation
+- **WHEN** the main database request pool is exhausted
+- **THEN** `/health` and `/health/deep` MUST still provide timely degraded status using isolated probe connectivity
--- a/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/specs/security-baseline-hardening/spec.md
+++ b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/specs/security-baseline-hardening/spec.md
@@ -0,0 +1,29 @@
+## ADDED Requirements
+
+### Requirement: Production Startup SHALL Reject Weak Session Secrets
+The system MUST reject startup in non-development environments when `SECRET_KEY` is missing or configured with known insecure default values.
+
+#### Scenario: Missing production secret key
+- **WHEN** runtime starts with `FLASK_ENV` not equal to `development` and no secure secret key is configured
+- **THEN** application startup MUST fail fast with an explicit configuration error
+
+### Requirement: State-Changing Endpoints SHALL Enforce CSRF Validation
+All state-changing endpoints that rely on cookie-based authentication MUST enforce CSRF token validation.
+
+#### Scenario: Missing or invalid CSRF token
+- **WHEN** a POST/PUT/PATCH/DELETE request is sent without a valid CSRF token
+- **THEN** the server MUST reject the request with a client error and MUST NOT execute the mutation
+
+### Requirement: Server-Rendered Values in JavaScript Context MUST Use Safe Serialization
+Values inserted into inline JavaScript from templates MUST be serialized for JavaScript context safety.
+
+#### Scenario: Hold reason rendered in fallback inline script
+- **WHEN** server-side string values are embedded into script state payloads
+- **THEN** template rendering MUST use JSON-safe serialization semantics to prevent script-context injection
+
+### Requirement: Session Establishment SHALL Mitigate Fixation Risk
+Successful admin login MUST rotate session identity material before granting authenticated privileges.
+
+#### Scenario: Admin login success
+- **WHEN** credentials are validated and admin session is created
+- **THEN** session identity MUST be regenerated before storing authenticated user attributes
--- a/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/tasks.md
+++ b/openspec/changes/archive/2026-02-08-p0-runtime-stability-hardening/tasks.md
@@ -0,0 +1,18 @@
+## 1. Runtime Stability Hardening
+
+- [x] 1.1 Add startup validation for `SECRET_KEY` and environment-aware secure defaults.
+- [x] 1.2 Register centralized shutdown hooks to stop cache updater, realtime sync worker, Redis client, and DB engine.
+- [x] 1.3 Isolate database health probing from request pool and keep degraded signal contract stable.
+- [x] 1.4 Normalize pool-exhausted response metadata and retry headers across API error paths.
+
+## 2. Security Baseline Enforcement
+
+- [x] 2.1 Add CSRF token issuance/validation for form posts and JSON mutation endpoints.
+- [x] 2.2 Update login flow to rotate session identity on successful authentication.
+- [x] 2.3 Replace JS-context template interpolation in `hold_detail.html` with JSON-safe serialization.
+
+## 3. Verification and Documentation
+
+- [x] 3.1 Add tests for startup secret guard, CSRF rejection, and session-rotation behavior.
+- [x] 3.2 Add lifecycle tests/validation for shutdown cleanup and health endpoint behavior under pool saturation.
+- [x] 3.3 Update README/README.mdj runtime hardening sections and operator rollout notes.
--- a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/.openspec.yaml
+++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/.openspec.yaml
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-02-08
--- a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/design.md
+++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/design.md
@@ -0,0 +1,46 @@
+## Context
+
+The migration delivered feature parity, but efficiency work remains: backend query paths still do broad copies and whole-frame recomputation even when only slices are needed. At the same time, business constraints explicitly require full-table caching for `resource` and `wip` because those datasets are intentionally small and frequently reused. This design optimizes around that constraint rather than removing it.
+
+## Goals / Non-Goals
+
+**Goals:**
+- Keep `resource` and `wip` full-table caches intact.
+- Reduce memory amplification from redundant cache representations.
+- Replace repeated full merge/rebuild paths with incremental/indexed query plans where applicable.
+- Increase reuse of browser-side compute modules for chart/table/filter/KPI derivations.
+- Add measurable telemetry to verify latency and memory improvements.
+
+**Non-Goals:**
+- Rewriting all reporting endpoints to client-only mode.
+- Removing Redis or existing layered cache strategy.
+- Changing user-visible filter semantics or report outputs.
+
+## Decisions
+
+1. **Constrained cache strategy**
+   - Decision: retain full-table snapshots for `resource` and `wip`; optimize surrounding representations and derivation paths.
+   - Rationale: business-approved data-size profile and low complexity for frequent lookups.
+
+2. **Incremental + indexed path for heavy derived datasets**
+   - Decision: add watermark/version-aware incremental refresh and per-column indexes for high-cardinality filters.
+   - Rationale: avoids repeated full recompute and lowers request tail latency.
+
+3. **Canonical in-process structure**
+   - Decision: keep one canonical structure per cache domain and derive alternate views on demand.
+   - Rationale: reduces 2x/3x memory amplification from parallel representations.
+
+4. **Frontend compute module expansion**
+   - Decision: extract reusable browser compute helpers for matrix/table/KPI transformations used across report pages.
+   - Rationale: shifts deterministic shaping work off backend and improves component reuse in Vite architecture.
+
+5. **Benchmark-driven acceptance**
+   - Decision: add repeatable benchmark fixtures and telemetry thresholds as merge gates.
+   - Rationale: prevent subjective "performance improved" claims without measurable proof.
+
+## Risks / Trade-offs
+
+- **[Risk] Incremental sync correctness drift** → **Mitigation:** version checksum validation and periodic full reconciliation jobs.
+- **[Risk] Browser compute can increase client CPU on low-end devices** → **Mitigation:** bounded dataset chunking and fallback server aggregation path.
+- **[Risk] Refactor introduces subtle field-contract regressions** → **Mitigation:** keep export/header contract tests and fixture comparisons.
+- **[Risk] Telemetry overhead** → **Mitigation:** low-cost counters/histograms with sampling where needed.
--- a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/proposal.md
+++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/proposal.md
@@ -0,0 +1,36 @@
+## Why
+
+Current reporting workloads still spend unnecessary CPU and memory on repeated full-data merges, broad DataFrame copies, and duplicated cache representations. We need a focused efficiency phase that preserves the intentional full-table cache strategy for `resource` and `wip`, while reducing cost for other query paths and increasing frontend compute reuse.
+
+## What Changes
+
+- Introduce indexed/incremental cache synchronization for heavy report datasets that do not require full-table snapshots.
+- Keep `resource` and `wip` as full-table cache by design, but reduce redundant in-process representations and copy overhead.
+- Move additional derived calculations (chart/table/KPI/filter shaping) to reusable browser modules in Vite frontend.
+- Add cache/query efficiency telemetry and repeatable benchmark gates to validate gains.
+
+## Capabilities
+
+### New Capabilities
+- `cache-indexed-query-acceleration`: Define incremental refresh and indexed query contracts for non-full-snapshot datasets.
+
+### Modified Capabilities
+- `cache-observability-hardening`: Add memory-efficiency and cache-structure telemetry expectations.
+- `frontend-compute-shift`: Expand browser-side reusable compute coverage for report interactions.
+
+## Impact
+
+- Affected code:
+  - `src/mes_dashboard/core/cache.py`
+  - `src/mes_dashboard/services/resource_cache.py`
+  - `src/mes_dashboard/services/realtime_equipment_cache.py`
+  - `src/mes_dashboard/services/wip_service.py`
+  - `src/mes_dashboard/routes/health_routes.py`
+  - `frontend/src/core/`
+  - `frontend/src/**/main.js`
+  - `tests/`
+- APIs:
+  - read-heavy `/api/wip/*` and `/api/resource/*` endpoints (response contract unchanged)
+- Operational behavior:
+  - Preserve current `resource` and `wip` full-table caching strategy.
+  - Reduce server-side compute load through selective frontend compute offload.
--- a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/cache-indexed-query-acceleration/spec.md
+++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/cache-indexed-query-acceleration/spec.md
@@ -0,0 +1,22 @@
+## ADDED Requirements
+
+### Requirement: Incremental Synchronization SHALL Use Versioned Watermarks
+For heavy non-full-snapshot datasets, cache refresh SHALL support incremental synchronization keyed by stable version or watermark boundaries.
+
+#### Scenario: Incremental refresh cycle
+- **WHEN** source data version indicates partial changes since last sync
+- **THEN** cache update logic MUST fetch and merge only changed partitions while preserving correctness guarantees
+
+### Requirement: Query Paths SHALL Use Indexed Access for High-Frequency Filters
+Query execution over cached data SHALL use prebuilt indexes for known high-frequency filter columns.
+
+#### Scenario: Filtered report query
+- **WHEN** request filters target indexed fields
+- **THEN** result selection MUST avoid full dataset scans and maintain existing response contract
+
+### Requirement: Business-Mandated Full-Table Caches SHALL Be Preserved for Resource and WIP
+The system SHALL continue to maintain full-table cache behavior for `resource` and `wip` domains.
+
+#### Scenario: Resource or WIP cache refresh
+- **WHEN** cache update runs for `resource` or `wip`
+- **THEN** the updater MUST retain full-table snapshot semantics and MUST NOT switch these domains to partial-only cache mode
--- a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/cache-observability-hardening/spec.md
+++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/cache-observability-hardening/spec.md
@@ -0,0 +1,15 @@
+## ADDED Requirements
+
+### Requirement: Cache Telemetry SHALL Include Memory Amplification Signals
+Operational telemetry MUST expose cache domain memory usage indicators and representation amplification factors.
+
+#### Scenario: Deep health telemetry request
+- **WHEN** operators inspect cache telemetry
+- **THEN** telemetry MUST include per-domain memory footprint and amplification indicators sufficient to detect redundant structures
+
+### Requirement: Efficiency Benchmarks SHALL Gate Cache Refactor Rollout
+Cache/query efficiency changes MUST be validated against baseline latency and memory benchmarks before rollout.
+
+#### Scenario: Pre-release validation
+- **WHEN** cache refactor changes are prepared for deployment
+- **THEN** benchmark results MUST demonstrate no regression beyond configured thresholds for P95 latency and memory usage
--- a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/frontend-compute-shift/spec.md
+++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/specs/frontend-compute-shift/spec.md
@@ -0,0 +1,15 @@
+## ADDED Requirements
+
+### Requirement: Reusable Browser Compute Modules SHALL Power Report Derivations
+Derived computations for report filters, KPI cards, chart series, and table projections SHALL be implemented through reusable frontend modules.
+
+#### Scenario: Shared report derivation logic
+- **WHEN** multiple report pages require equivalent data-shaping behavior
+- **THEN** pages MUST consume shared compute modules instead of duplicating transformation logic per page
+
+### Requirement: Browser Compute Shift SHALL Preserve Export and Field Contracts
+Moving computations to frontend MUST preserve existing field naming and export column contracts.
+
+#### Scenario: User exports report after frontend-side derivation
+- **WHEN** transformed data is rendered and exported
+- **THEN** exported field names and ordering MUST remain consistent with governed field contract definitions
--- a/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/tasks.md
+++ b/openspec/changes/archive/2026-02-08-p1-cache-query-efficiency/tasks.md
@@ -0,0 +1,23 @@
+## 1. Cache Structure and Sync Refactor
+
+- [x] 1.1 Define canonical per-domain cache representation and remove redundant parallel structures.
+- [x] 1.2 Implement version/watermark-based incremental sync path for eligible non-full-snapshot datasets.
+- [x] 1.3 Keep `resource` and `wip` full-table cache behavior while optimizing surrounding parse/index pipelines.
+
+## 2. Indexed Query Acceleration
+
+- [x] 2.1 Add index builders for high-frequency filter columns used by report endpoints.
+- [x] 2.2 Refactor read paths to use indexed selection and reduce broad DataFrame copy operations.
+- [x] 2.3 Add fallback and reconciliation logic to guarantee correctness under incremental/index drift.
+
+## 3. Frontend Compute Reuse Expansion
+
+- [x] 3.1 Extract shared Vite compute modules for KPI/filter/chart/table derivations.
+- [x] 3.2 Refactor report pages to consume shared modules without changing user-visible behavior.
+- [x] 3.3 Validate export/header field contract consistency after compute shift.
+
+## 4. Performance Validation and Docs
+
+- [x] 4.1 Add benchmark fixtures for baseline vs refactor latency/memory comparison.
+- [x] 4.2 Surface cache memory amplification and index efficiency telemetry in health/admin outputs.
+- [x] 4.3 Update README/README.mdj with cache strategy constraints and performance governance rules.
--- a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/.openspec.yaml
+++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/.openspec.yaml
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-02-08
--- a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/design.md
+++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/design.md
@@ -0,0 +1,45 @@
+## Context
+
+The project already provides watchdog-assisted restart and resilience diagnostics, but policy boundaries for automated recovery are not yet formalized end-to-end. In practice, this can lead to either under-recovery (manual delays) or over-recovery (restart churn). We also need stronger conda/systemd path consistency checks to prevent runtime drift across deploy scripts and services.
+
+## Goals / Non-Goals
+
+**Goals:**
+- Make conda/systemd runtime path contracts explicit, validated, and drift-detectable.
+- Implement safe self-healing policy with cooldown and churn limits.
+- Expose clear alert signals and recommended actions in health/admin payloads.
+- Keep operator manual override available for incident control.
+
+**Non-Goals:**
+- Migrating from systemd to another orchestrator.
+- Changing database vendor or introducing full autoscaling infrastructure.
+- Removing existing admin restart endpoints.
+
+## Decisions
+
+1. **Single source runtime contract**
+   - Decision: centralize conda runtime path configuration consumed by systemd units, watchdog, and scripts.
+   - Rationale: prevents mismatched interpreter/path drift.
+
+2. **Guarded self-healing state machine**
+   - Decision: implement bounded restart policy (cooldown + max retries per time window + circuit-open gating).
+   - Rationale: recovers quickly while preventing restart storms.
+
+3. **Explicit recovery observability contract**
+   - Decision: enrich health/admin payloads with churn counters, cooldown state, and recommended operator action.
+   - Rationale: enables deterministic triage and alert automation.
+
+4. **Auditability requirement**
+   - Decision: emit structured logs/events for auto-restart decision, manual override, and blocked restart attempts.
+   - Rationale: supports incident retrospectives and policy tuning.
+
+5. **Runbook-first rollout**
+   - Decision: deploy policy changes behind documentation and validation gates, including rollback steps.
+   - Rationale: operational safety for production adoption.
+
+## Risks / Trade-offs
+
+- **[Risk] Overly strict policy delays recovery** → **Mitigation:** configurable thresholds and emergency manual override.
+- **[Risk] Aggressive policy causes churn loops** → **Mitigation:** hard stop on churn threshold breach and explicit cool-off windows.
+- **[Risk] Added operational complexity** → **Mitigation:** concise runbook with decision tables and tested scripts.
+- **[Risk] Drift detection false positives** → **Mitigation:** normalize path resolution and clearly defined comparison sources.
--- a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/proposal.md
+++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/proposal.md
@@ -0,0 +1,40 @@
+## Why
+
+Operations stability still depends heavily on manual intervention when workers degrade or pools saturate. We need a formal operations phase to align conda/systemd runtime contracts and add controlled self-healing with guardrails, so recovery is faster without triggering restart storms.
+
+## What Changes
+
+- Standardize conda-based runtime paths across app service, watchdog, and operational scripts from a single source of truth.
+- Introduce guarded worker self-healing policy (cooldown, churn windows, bounded retries, manual override).
+- Add alert thresholds and machine-readable recovery signals for pool pressure, circuit-open persistence, and restart churn.
+- Harden runbook documentation and scripts for deterministic restart, rollback, and incident triage.
+
+## Capabilities
+
+### New Capabilities
+- `worker-self-healing-governance`: Define safe autonomous recovery behavior with anti-storm guardrails.
+
+### Modified Capabilities
+- `conda-systemd-runtime-alignment`: Extend runtime consistency requirements with startup validation and drift detection.
+- `runtime-resilience-recovery`: Add auditable recovery-action requirements for automated and operator-triggered restart flows.
+
+## Impact
+
+- Affected code:
+  - `deploy/systemd/*.service`
+  - `scripts/worker_watchdog.py`
+  - `src/mes_dashboard/routes/admin_routes.py`
+  - `src/mes_dashboard/routes/health_routes.py`
+  - `src/mes_dashboard/core/database.py`
+  - `src/mes_dashboard/core/circuit_breaker.py`
+  - `tests/`
+  - `README.md`, `README.mdj`, runbook docs
+- APIs:
+  - `/health`
+  - `/health/deep`
+  - `/admin/api/system-status`
+  - `/admin/api/worker/status`
+  - `/admin/api/worker/restart`
+- Operational behavior:
+  - Preserve single-port bind model.
+  - Add controlled self-healing policy and clearer alert thresholds.
--- a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/conda-systemd-runtime-alignment/spec.md
+++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/conda-systemd-runtime-alignment/spec.md
@@ -0,0 +1,15 @@
+## ADDED Requirements
+
+### Requirement: Runtime Path Drift SHALL Be Detectable Before Service Start
+Service startup checks MUST validate configured conda runtime paths across app, watchdog, and worker control scripts.
+
+#### Scenario: Conda path mismatch detected
+- **WHEN** startup validation finds runtime path inconsistency between configured units and scripts
+- **THEN** service start MUST fail with actionable diagnostics instead of running with partial mismatch
+
+### Requirement: Conda/Systemd Contract SHALL Be Versioned in Operations Docs
+The documented runtime contract MUST include versioned path assumptions and verification commands.
+
+#### Scenario: Operator verifies deployment contract
+- **WHEN** operator follows runbook validation steps
+- **THEN** commands MUST confirm active runtime paths match documented conda/systemd contract
--- a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/runtime-resilience-recovery/spec.md
+++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/runtime-resilience-recovery/spec.md
@@ -0,0 +1,15 @@
+## ADDED Requirements
+
+### Requirement: Recovery Recommendations SHALL Reflect Self-Healing Policy State
+Health and admin resilience payloads MUST expose whether automated recovery is allowed, cooling down, or blocked by churn policy.
+
+#### Scenario: Operator inspects degraded state
+- **WHEN** `/health` or `/admin/api/worker/status` is requested during degradation
+- **THEN** response MUST include policy state, cooldown remaining time, and next recommended action
+
+### Requirement: Manual Recovery Override SHALL Be Explicit and Controlled
+Manual restart actions MUST bypass automatic block only through authenticated operator pathways with explicit acknowledgement.
+
+#### Scenario: Churn-blocked state with manual override request
+- **WHEN** authorized admin requests manual restart while auto-recovery is blocked
+- **THEN** system MUST execute controlled restart path and log the override context for auditability
--- a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/worker-self-healing-governance/spec.md
+++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/specs/worker-self-healing-governance/spec.md
@@ -0,0 +1,22 @@
+## ADDED Requirements
+
+### Requirement: Automated Worker Recovery SHALL Use Bounded Policy Guards
+Automated worker restart behavior MUST enforce cooldown periods and bounded restart attempts within a configurable time window.
+
+#### Scenario: Repeated worker degradation within short window
+- **WHEN** degradation events exceed configured restart-attempt budget
+- **THEN** automated restarts MUST pause and surface a blocked-recovery signal for operator intervention
+
+### Requirement: Restart-Churn Protection SHALL Prevent Recovery Storms
+The runtime MUST classify restart churn and prevent uncontrolled restart loops.
+
+#### Scenario: Churn threshold exceeded
+- **WHEN** restart count crosses churn threshold in active window
+- **THEN** watchdog MUST enter guarded mode and require explicit manual override before further restart attempts
+
+### Requirement: Recovery Decisions SHALL Be Audit-Ready
+Every auto-recovery decision and manual override action MUST be recorded with structured metadata.
+
+#### Scenario: Worker restart decision emitted
+- **WHEN** system executes or denies a restart action
+- **THEN** structured logs/events MUST include reason, thresholds, actor/source, and resulting state
--- a/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/tasks.md
+++ b/openspec/changes/archive/2026-02-08-p2-ops-self-healing-runbook/tasks.md
@@ -0,0 +1,23 @@
+## 1. Conda/Systemd Contract Alignment
+
+- [x] 1.1 Centralize runtime path configuration consumed by service units, watchdog, and scripts.
+- [x] 1.2 Add startup validation that fails fast on conda path drift.
+- [x] 1.3 Update systemd/watchdog integration tests for consistent runtime contract.
+
+## 2. Worker Self-Healing Policy
+
+- [x] 2.1 Implement bounded auto-restart policy (cooldown, retry budget, churn window).
+- [x] 2.2 Add guarded mode behavior when churn threshold is exceeded.
+- [x] 2.3 Implement authenticated manual override flow with explicit logging context.
+
+## 3. Alerting and Operational Signals
+
+- [x] 3.1 Expose policy-state fields in health/admin payloads (`allowed`, `cooldown`, `blocked`).
+- [x] 3.2 Add structured audit events for restart decisions and override actions.
+- [x] 3.3 Define alert thresholds and wire monitoring-friendly fields for pool/circuit/churn conditions.
+
+## 4. Validation and Runbook Delivery
+
+- [x] 4.1 Add tests for policy transitions, guarded mode, and override behavior.
+- [x] 4.2 Validate single-port continuity during controlled recovery and hot reload paths.
+- [x] 4.3 Update README/README.mdj and deployment runbook with verified operational procedures.
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round3/.openspec.yaml
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/.openspec.yaml
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-02-08
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round3/design.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/design.md
@@ -0,0 +1,50 @@
+## Context
+
+目前系統已完成 Vite 單一 port 架構與主要 P0/P1/P2 硬化，但殘餘風險集中在「快取慢路徑鎖競爭 + health 熱點查詢 + API 邊界治理」。這些問題多屬中高流量下才明顯，若不在此階段收斂，後續排障成本會高。
+
+## Goals / Non-Goals
+
+**Goals:**
+- 在不改變頁面操作語意與單一 port 架構前提下，完成殘餘穩定性與安全性修補。
+- 讓 cache/health 路徑在高併發下更可預期，並降低 log 資安風險。
+- 透過測試覆蓋確保修補不造成功能回歸。
+
+**Non-Goals:**
+- 不重寫主要查詢流程或移除 `resource/wip` 全表快取策略。
+- 不引入重量級 distributed rate-limit 基礎設施。
+- 不改動前端 drill-down 與報表功能語意。
+
+## Decisions
+
+1. **Cache 發布一致性優先於局部最佳化**
+- 使用 staging key + 原子 rename/pipeline 發布資料與 metadata，確保 publish 失敗不影響舊資料可讀性。
+
+2. **解析移至鎖外，鎖內僅做快取一致性檢查/寫入**
+- WIP process cache 慢路徑改為鎖外 parse，再鎖內 double-check+commit，降低持鎖時間。
+
+3. **Process cache 策略一致化**
+- realtime equipment cache 補齊 max_size + LRU，與既有 WIP/Resource 一致。
+
+4. **Health 內部短快取僅在非測試環境啟用**
+- TTL=5 秒，降低高頻 probe 對 DB/Redis 的重複壓力；測試模式維持即時計算避免互相污染。
+
+5. **高成本 API 採輕量 in-memory 速率限制**
+- 以 IP+route window 限流，參數化可調，不引入新外部依賴。
+
+## Risks / Trade-offs
+
+- [Risk] 快取發布改造引入 key 切換邏輯複雜度 → Mitigation: 補上 publish 失敗/成功測試。
+- [Risk] health 快取造成短時間觀測延遲 → Mitigation: TTL 限制 5 秒，並於 testing 禁用。
+- [Risk] in-memory rate limit 在多 worker 下非全域一致 → Mitigation: 先作保護閥，後續可升級 Redis-based limiter。
+
+## Migration Plan
+
+1. 先完成 cache 與 health 核心修補（不影響 API contract）。
+2. 再導入 API 邊界/限流與共用工具抽離。
+3. 補單元與整合測試，執行 benchmark smoke。
+4. 更新 README 文件與環境變數說明。
+
+## Open Questions
+
+- 高成本 API 的預設限流門檻是否要按端點細分（WIP vs Resource）？
+- 後續是否要升級為 Redis 分散式限流以覆蓋多 worker 全域一致性？
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round3/proposal.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/proposal.md
@@ -0,0 +1,44 @@
+## Why
+
+上一輪已完成高風險核心修復，但仍有一批殘餘問題會在高併發、長時間運行與惡意/異常輸入下放大風險（快取發布一致性、鎖競爭、健康檢查負載、輸入邊界與速率治理）。本輪目標是把這些尾端風險收斂到可接受範圍，避免後續運維與效能不穩。
+
+## What Changes
+
+- 強化 WIP 快取發布流程，確保更新失敗時不污染既有讀取路徑。
+- 調整 process cache 慢路徑鎖範圍，避免持鎖解析大 JSON。
+- 補齊 realtime equipment process cache 的 bounded LRU，與 WIP/Resource 策略一致。
+- 為資源路由 NaN 清理加入深度保護（避免深層遞迴風險）。
+- 抽取共用布林參數解析，消除重複邏輯。
+- 將 filter cache 的 view 名稱改為可配置，移除硬編碼耦合。
+- 加入敏感連線字串 log redaction。
+- 對 `/health`、`/health/deep` 增加 5 秒內部短快取（測試模式禁用）。
+- 對高成本查詢 API 增加輕量速率限制與可調參數。
+- 更新 README/README.mdj 與驗證測試。
+
+## Capabilities
+
+### New Capabilities
+- `api-safety-hygiene`: API 輸入邊界、共享參數解析、可配置查詢來源、與高成本端點速率治理。
+
+### Modified Capabilities
+- `cache-observability-hardening`: 補強快取發布一致性、process cache 鎖範圍與 bounded 策略一致化。
+- `runtime-resilience-recovery`: 健康檢查短快取與敏感資訊日誌遮罩的運維安全要求。
+
+## Impact
+
+- Affected code:
+  - `src/mes_dashboard/core/cache_updater.py`
+  - `src/mes_dashboard/core/cache.py`
+  - `src/mes_dashboard/services/realtime_equipment_cache.py`
+  - `src/mes_dashboard/routes/resource_routes.py`
+  - `src/mes_dashboard/routes/wip_routes.py`
+  - `src/mes_dashboard/routes/hold_routes.py`
+  - `src/mes_dashboard/services/filter_cache.py`
+  - `src/mes_dashboard/core/database.py`
+  - `src/mes_dashboard/routes/health_routes.py`
+- APIs:
+  - `/health`, `/health/deep`
+  - `/api/wip/detail/<workcenter>`, `/api/wip/overview/*`
+  - `/api/resource/*`（高成本路由）
+- Docs/tests:
+  - `README.md`, `README.mdj`, `tests/*`
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/api-safety-hygiene/spec.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/api-safety-hygiene/spec.md
@@ -0,0 +1,29 @@
+## ADDED Requirements
+
+### Requirement: Recursive Payload Cleaning MUST Enforce Depth Safety
+Routes that normalize nested payloads MUST prevent unbounded recursion depth.
+
+#### Scenario: Deeply nested response object
+- **WHEN** NaN-cleaning helper receives deeply nested list/dict payload
+- **THEN** cleaning logic MUST enforce max depth or iterative traversal and return safely without recursion failure
+
+### Requirement: Filter Source Names MUST Be Configurable
+Filter cache query sources MUST NOT rely on hardcoded view names only.
+
+#### Scenario: Environment-specific view names
+- **WHEN** deployment sets custom filter-source environment variables
+- **THEN** filter cache loader MUST resolve and query configured view names
+
+### Requirement: High-Cost APIs SHALL Apply Basic Rate Guardrails
+High-cost read endpoints SHALL apply configurable request-rate guardrails to reduce abuse and accidental bursts.
+
+#### Scenario: Burst traffic from same client
+- **WHEN** a client exceeds configured request budget for guarded endpoints
+- **THEN** endpoint SHALL return throttled response with clear retry guidance
+
+### Requirement: Common Boolean Query Parsing SHALL Be Shared
+Boolean query parsing in routes SHALL use shared helper behavior.
+
+#### Scenario: Different routes parse include flags
+- **WHEN** routes parse common boolean query parameters
+- **THEN** parsing behavior MUST be consistent across routes via shared utility
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/cache-observability-hardening/spec.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/cache-observability-hardening/spec.md
@@ -0,0 +1,26 @@
+## ADDED Requirements
+
+### Requirement: Cache Publish MUST Preserve Previous Readable Snapshot on Failure
+When refreshing full-table cache payloads, the system MUST avoid exposing partially published states to readers.
+
+#### Scenario: Publish fails after payload serialization
+- **WHEN** a cache refresh has prepared new payload but publish operation fails
+- **THEN** previously published cache keys MUST remain readable and metadata MUST remain consistent with old snapshot
+
+#### Scenario: Publish succeeds
+- **WHEN** publish operation completes successfully
+- **THEN** data payload and metadata keys MUST be visible as one coherent new snapshot
+
+### Requirement: Process-Level Cache Slow Path SHALL Minimize Lock Hold Time
+Large payload parsing MUST NOT happen inside long-held process cache locks.
+
+#### Scenario: Cache miss under concurrent requests
+- **WHEN** multiple requests hit process cache miss
+- **THEN** parsing work SHALL happen outside lock-protected mutation section, and lock scope SHALL be limited to consistency check + commit
+
+### Requirement: Process-Level Cache Policies MUST Stay Consistent Across Services
+All service-local process caches MUST support bounded capacity with deterministic eviction.
+
+#### Scenario: Realtime equipment cache growth
+- **WHEN** realtime equipment process cache reaches configured capacity
+- **THEN** entries MUST be evicted according to deterministic LRU behavior
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/runtime-resilience-recovery/spec.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/specs/runtime-resilience-recovery/spec.md
@@ -0,0 +1,19 @@
+## ADDED Requirements
+
+### Requirement: Health Endpoints SHALL Use Short Internal Memoization
+Health and deep-health computation SHALL use a short-lived internal cache to prevent probe storms from amplifying backend load.
+
+#### Scenario: Frequent monitor scrapes
+- **WHEN** health endpoints are called repeatedly within a small window
+- **THEN** service SHALL return memoized payload for up to 5 seconds in non-testing environments
+
+#### Scenario: Testing mode
+- **WHEN** app is running in testing mode
+- **THEN** health endpoint memoization MUST be bypassed to preserve deterministic tests
+
+### Requirement: Logs MUST Redact Connection Secrets
+Runtime logs MUST avoid exposing DB connection credentials.
+
+#### Scenario: Connection string appears in log message
+- **WHEN** a log message contains DB URL credentials
+- **THEN** logger output MUST redact password and sensitive userinfo before emission
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round3/tasks.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round3/tasks.md
@@ -0,0 +1,22 @@
+## 1. Cache Consistency and Contention Hardening
+
+- [x] 1.1 Harden WIP cache publish in `cache_updater.py` to preserve old snapshot on publish failure.
+- [x] 1.2 Refactor WIP process-cache slow path in `core/cache.py` so heavy parse runs outside lock.
+- [x] 1.3 Extend realtime equipment process cache with bounded `max_size` + deterministic LRU and add regression tests.
+
+## 2. API Safety and Config Hygiene
+
+- [x] 2.1 Add depth-safe NaN cleaning in `resource_routes.py` and tests for deep payloads.
+- [x] 2.2 Add shared boolean query parser in `core/utils.py` and switch `wip_routes.py` / `hold_routes.py` to it.
+- [x] 2.3 Make filter-cache source views configurable (env-based) in `filter_cache.py` and add config tests.
+
+## 3. Runtime Guardrails
+
+- [x] 3.1 Add DB connection-string redaction logging filter in `core/database.py` (or logging bootstrap) with tests.
+- [x] 3.2 Add 5-second internal memoization for `/health` and `/health/deep` (disabled in testing) and tests.
+- [x] 3.3 Add lightweight rate limiting for selected high-cost APIs with clear throttling responses and tests.
+
+## 4. Validation and Documentation
+
+- [x] 4.1 Run targeted backend/frontend tests and benchmark smoke gate.
+- [x] 4.2 Update `README.md` and `README.mdj` with round-3 hardening notes and new env variables.
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round4/.openspec.yaml
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/.openspec.yaml
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-02-08
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round4/design.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/design.md
@@ -0,0 +1,61 @@
+## Context
+
+round-3 後主流程已穩定，但仍有 3 類技術債：
+- Resource 快取在同一 process 內同時保存 DataFrame 與完整 records 複本，導致記憶體放大。
+- Resource 與 Realtime Equipment 的 Oracle 查詢存在跨服務重複字串，日後修改容易偏移。
+- 部分服務邊界型別註記與魔術數字未系統化，維護成本偏高。
+
+約束條件：
+- `resource` / `wip` 維持全表快取策略，不改資料來源與刷新頻率。
+- 對外 API 欄位與前端行為不變。
+- 保持單一 port 架構與既有運維契約。
+
+## Goals / Non-Goals
+
+**Goals:**
+- 降低 Resource 快取在 process 內的重複資料表示，保留查詢輸出相容性。
+- 讓跨服務 Oracle 查詢片段由單一來源維護。
+- 讓關鍵 service/cache 模組具備一致的型別註記與具名常數。
+
+**Non-Goals:**
+- 不改動資料庫 schema 或 SQL 查詢結果欄位。
+- 不重寫整體 cache 架構（Redis + process cache 維持）。
+- 不引入新基礎設施或外部依賴。
+
+## Decisions
+
+1. Resource derived index 改為「row-position index」而非保存完整 records 複本
+- 現況：index 中保留 `records` 與多組 bucket records，與 DataFrame 內容重複。
+- 決策：index 只保留 row positions（整數索引）與必要 metadata；需要輸出 dict 時由 DataFrame 按需轉換。
+- 取捨：單次輸出會增加少量轉換成本，但可顯著降低常駐記憶體重複。
+
+2. 建立共用 Oracle 查詢常數模組
+- 現況：`resource_cache.py`、`realtime_equipment_cache.py` 各自維護 base SQL。
+- 決策：抽出 `services/sql_fragments.py`（或等效模組）管理共用 query 文本與 table/view 名稱。
+- 取捨：增加一層間接引用，但查詢語意一致性與變更可控性更高。
+
+3. 型別與常數治理採「先核心邊界，後擴散」
+- 現況：部分函式已使用 `Optional` / PEP604 混搭，且魔術數字散落於 cache/service。
+- 決策：先統一這輪觸及檔案中的型別風格與高頻常數（TTL、size、window、limits）。
+- 取捨：不追求一次全專案清零，以避免大範圍 noise；先建立可持續擴展基線。
+
+## Risks / Trade-offs
+
+- [Risk] row-position index 與 DataFrame 版本不同步 → Mitigation：每次 cache invalidate 時同步重建 index，並保留版本檢查。
+- [Risk] 惰性轉換導致查詢端 latency 波動 → Mitigation：保留 process cache，並對高頻路徑做小批量輸出優化。
+- [Risk] SQL 共用常數抽離造成引用錯誤 → Mitigation：補齊單元測試，驗證 query 文本與既有欄位契約一致。
+- [Risk] 型別/常數清理引發行為改變 → Mitigation：僅做等價重構，保留原值並用回歸測試覆蓋。
+
+## Migration Plan
+
+1. 先重構 Resource index 表示，確保 API 輸出不變。
+2. 抽離 SQL 共用片段並替換兩個快取服務引用。
+3. 清理該範圍型別與常數，補測試。
+4. 更新 README / README.mdj 與 OpenSpec tasks，跑 backend/fronted 目標測試集。
+
+Rollback：
+- 若出現相容性問題，可回退至原 index records 表示與舊 SQL 內嵌寫法（單檔回退即可）。
+
+## Open Questions
+
+- 是否要在下一輪把相同治理擴展到 `wip_service.py` 的其餘常數與型別（本輪先限定 residual 範圍）。
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round4/proposal.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/proposal.md
@@ -0,0 +1,31 @@
+## Why
+
+目前剩餘風險集中在可維護性與記憶體效率：Resource 快取在同一個 process 內維持多種資料表示，部分查詢 SQL 在不同快取服務重複維護，且型別註記與魔術數字仍不一致。這些問題不會立刻造成中斷，但會提高記憶體占用、增加後續修改成本與回歸風險，因此需要在既有功能不變前提下完成收斂。
+
+## What Changes
+
+- 將 Resource derived index 的資料表示改為「輕量索引 + 惰性輸出」，避免在 process 中重複保留完整 records 複本。
+- 將 Resource 與 Realtime Equipment 的 Oracle 查詢字串收斂到共用 SQL 常數模組，降低重複定義與異步漂移風險。
+- 補齊型別註記一致性（尤其 cache/index/service 邊界）並把高頻魔術數字提升為具名常數或可配置參數。
+- 維持現有 API 契約、全表快取策略、單一 port 架構與前端行為不變。
+
+## Capabilities
+
+### New Capabilities
+- `resource-cache-representation-normalization`: 以單一權威資料表示與輕量索引替代 process 內多份完整資料複本，並保留既有查詢回傳結構。
+- `oracle-query-fragment-governance`: 將跨服務共用的 Oracle 查詢片段抽離為共享常數/模板，確保查詢語意一致。
+- `maintainability-type-and-constant-hygiene`: 建立型別註記與具名常數的落地規範，降低魔術數字與註記風格漂移。
+
+### Modified Capabilities
+- `cache-observability-hardening`: 補充記憶體放大係數與索引表示調整後的可觀測一致性要求。
+
+## Impact
+
+- 主要影響檔案：
+  - `src/mes_dashboard/services/resource_cache.py`
+  - `src/mes_dashboard/services/realtime_equipment_cache.py`
+  - `src/mes_dashboard/services/resource_service.py`（若需配合索引輸出）
+  - `src/mes_dashboard/sql/*` 或新增共享 SQL 常數模組
+  - `src/mes_dashboard/config/constants.py`、`src/mes_dashboard/core/utils.py`
+  - 對應測試與 README/README.mdj 文檔
+- 不新增外部依賴，不變更對外 API 路徑與欄位契約。
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/cache-observability-hardening/spec.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/cache-observability-hardening/spec.md
@@ -0,0 +1,8 @@
+## MODIFIED Requirements
+
+### Requirement: Cache Telemetry SHALL Include Memory Amplification Signals
+Operational telemetry MUST expose cache-domain memory usage indicators and representation amplification factors, and MUST differentiate between authoritative data payload and derived/index helper structures.
+
+#### Scenario: Deep health telemetry request after representation normalization
+- **WHEN** operators inspect cache telemetry for resource or WIP domains
+- **THEN** telemetry MUST include per-domain memory footprint, amplification indicators, and enough structure detail to verify that full-record duplication is not reintroduced
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/maintainability-type-and-constant-hygiene/spec.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/maintainability-type-and-constant-hygiene/spec.md
@@ -0,0 +1,15 @@
+## ADDED Requirements
+
+### Requirement: Core Cache and Service Boundaries MUST Use Consistent Type Annotation Style
+Core cache/service modules touched by this change SHALL use a consistent and explicit type-annotation style for public and internal helper boundaries.
+
+#### Scenario: Reviewing updated cache/service modules
+- **WHEN** maintainers inspect function signatures in affected modules
+- **THEN** optional and collection types MUST follow a single consistent style and remain compatible with the project Python baseline
+
+### Requirement: High-Frequency Magic Numbers MUST Be Replaced by Named Constants
+Cache, throttling, and index-related numeric literals that control behavior MUST be extracted to named constants or env-configurable settings.
+
+#### Scenario: Tuning cache/index behavior
+- **WHEN** operators need to tune cache/index thresholds
+- **THEN** they MUST find values in named constants or environment variables rather than scattered inline literals
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/oracle-query-fragment-governance/spec.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/oracle-query-fragment-governance/spec.md
@@ -0,0 +1,15 @@
+## ADDED Requirements
+
+### Requirement: Shared Oracle Query Fragments SHALL Have a Single Source of Truth
+Cross-service Oracle query fragments for resource and equipment cache loading MUST be defined in a shared module and imported by service implementations.
+
+#### Scenario: Update common table/view reference
+- **WHEN** a common table or view name changes
+- **THEN** operators and developers MUST be able to update one shared definition without editing duplicated SQL literals across services
+
+### Requirement: Service Queries MUST Preserve Existing Columns and Semantics
+Services consuming shared Oracle query fragments SHALL preserve existing selected columns, filters, and downstream payload behavior.
+
+#### Scenario: Resource and equipment cache refresh after refactor
+- **WHEN** cache services execute queries via shared fragments
+- **THEN** resulting payload structure MUST remain compatible with existing aggregation and API contracts
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/resource-cache-representation-normalization/spec.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/specs/resource-cache-representation-normalization/spec.md
@@ -0,0 +1,22 @@
+## ADDED Requirements
+
+### Requirement: Resource Derived Index MUST Avoid Full Record Duplication
+Resource derived index SHALL use lightweight row-position references instead of storing full duplicated record payloads alongside the process DataFrame cache.
+
+#### Scenario: Build index from cached DataFrame
+- **WHEN** resource cache data is parsed from Redis into process-level DataFrame
+- **THEN** the derived index MUST store position-based references and metadata without a second full records copy
+
+### Requirement: Resource Query APIs SHALL Preserve Existing Response Contract
+Resource query APIs MUST keep existing output fields and semantics after index representation normalization.
+
+#### Scenario: Read all resources after normalization
+- **WHEN** callers request all resources or filtered resource lists
+- **THEN** the returned payload MUST remain field-compatible with pre-normalization responses
+
+### Requirement: Cache Invalidation MUST Keep Index/Data Coherent
+The system SHALL invalidate and rebuild DataFrame/index representations atomically at cache refresh boundaries.
+
+#### Scenario: Redis-backed cache refresh completes
+- **WHEN** a new resource cache snapshot is published
+- **THEN** stale index references MUST be invalidated before subsequent reads use refreshed DataFrame data
--- a/openspec/changes/archive/2026-02-08-residual-hardening-round4/tasks.md
+++ b/openspec/changes/archive/2026-02-08-residual-hardening-round4/tasks.md
@@ -0,0 +1,22 @@
+## 1. Resource Cache Representation Normalization
+
+- [x] 1.1 Refactor `resource_cache` derived index to use lightweight row-position references instead of full duplicated records payload.
+- [x] 1.2 Keep `get_all_resources` / `get_resources_by_filter` API outputs backward compatible while sourcing data from normalized representation.
+- [x] 1.3 Update cache telemetry fields to reflect normalized representation and verify amplification calculation remains interpretable.
+
+## 2. Oracle Query Fragment Governance
+
+- [x] 2.1 Extract shared Oracle SQL fragments/constants for resource/equipment cache loading into a common module.
+- [x] 2.2 Replace duplicated SQL literals in `resource_cache.py` and `realtime_equipment_cache.py` with shared definitions.
+- [x] 2.3 Add/adjust tests to lock expected query semantics and prevent drift.
+
+## 3. Maintainability Hygiene
+
+- [x] 3.1 Normalize type annotations in touched cache/service modules to one consistent style.
+- [x] 3.2 Replace high-frequency magic numbers with named constants or env-driven config in touched modules.
+- [x] 3.3 Confirm existing login/API rate-limit and bool parser utilities remain centralized without new duplication.
+
+## 4. Verification and Documentation
+
+- [x] 4.1 Run targeted backend tests for resource cache, equipment cache, health/admin, and route behavior.
+- [x] 4.2 Update `README.md` and `README.mdj` with round-4 hardening notes.
--- a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/.openspec.yaml
+++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/.openspec.yaml
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-02-08
--- a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/design.md
+++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/design.md
@@ -0,0 +1,65 @@
+## Context
+
+本專案上一輪已完成 P0/P1/P2 的主體重構，但 code review 後仍存在幾個殘餘高風險點：
+- `LDAP_API_URL` 缺少 scheme/host 防線，屬於可配置 SSRF 風險。
+- process-level DataFrame cache 僅用 TTL，缺少容量上限。
+- circuit breaker 狀態轉換在持鎖期間寫日誌，存在鎖競爭放大風險。
+- 全域 security headers 尚未統一輸出。
+- 分頁參數尚有下限驗證缺口。
+
+這些問題橫跨 `app/core/services/routes/tests`，屬於跨模組安全與穩定性修補。
+
+## Goals / Non-Goals
+
+**Goals:**
+- 對 LDAP endpoint、HTTP 回應標頭、輸入邊界建立可測試的最低防線。
+- 讓 process-level cache 具備有界容量與可預期淘汰行為。
+- 降低 circuit breaker 內部鎖競爭風險，避免慢 handler 放大阻塞。
+- 維持單一 port、現有 API 契約與前端互動語意不變。
+
+**Non-Goals:**
+- 不引入完整 WAF/零信任架構。
+- 不重寫既有 cache 架構為外部快取服務。
+- 不改動報表功能或頁面流程。
+
+## Decisions
+
+1. **LDAP URL 啟動驗證（fail-fast）**
+   - Decision: 在 `auth_service` 啟動階段驗證 `LDAP_API_URL`，限制 `https` 與白名單 host（由 env 設定），不符合即禁用 LDAP 驗證路徑並記錄錯誤。
+   - Rationale: 以最低改動封住配置型 SSRF 風險，不影響 local auth 模式。
+
+2. **ProcessLevelCache 有界化**
+   - Decision: 在 `ProcessLevelCache` 新增 `max_size` 與 LRU 淘汰（`OrderedDict`），`set` 時淘汰最舊 key。
+   - Rationale: 保留 TTL 行為，同時避免高基數 key 長時間堆積。
+
+3. **Circuit breaker 鎖外寫日誌**
+   - Decision: `_transition_to` 僅在鎖內更新狀態並組裝日誌訊息，實際 logger 呼叫移到鎖外。
+   - Rationale: 降低持鎖區塊執行時間，避免慢 I/O handler 阻塞其他請求路徑。
+
+4. **全域安全標頭統一注入**
+   - Decision: 在 `app.after_request` 加入 `CSP`、`X-Frame-Options`、`X-Content-Type-Options`、`Referrer-Policy`，並在 production 加上 `HSTS`。
+   - Rationale: 以集中式策略覆蓋所有頁面與 API，降低遺漏機率。
+
+5. **分頁參數上下限一致化**
+   - Decision: 對 `page` 與 `page_size` 統一加入 `max(1, min(...))` 邊界處理。
+   - Rationale: 防止負值或極端數值造成不必要負載與非預期行為。
+
+## Risks / Trade-offs
+
+- **[Risk] LDAP 白名單設定不完整導致登入中斷** → **Mitigation:** 提供明確錯誤訊息與 local auth fallback 指引。
+- **[Risk] Cache 上限過小造成命中率下降** → **Mitigation:** `max_size` 設為可配置，先給保守預設值並觀察 telemetry。
+- **[Risk] CSP 過嚴影響既有 inline 腳本** → **Mitigation:** 先採 `default-src 'self'` 與相容策略，必要時以 nonce/白名單微調。
+- **[Risk] 行為調整引發測試回歸** → **Mitigation:** 補 unit/integration 測試覆蓋每個修補點。
+
+## Migration Plan
+
+1. 先落地 backend 修補（auth/cache/circuit breaker/app headers/routes）。
+2. 補測試（LDAP 驗證、LRU、鎖外日誌、headers、分頁邊界）。
+3. 執行既有健康檢查與重點整合測試。
+4. 更新 README/README.mdj 的安全與穩定性章節。
+5. 若部署後有相容性問題，可暫時透過 env 放寬 LDAP host 白名單與 CSP 細項。
+
+## Open Questions
+
+- LDAP host 白名單在各環境是否需要多個網域（例如內網 + DR site）？
+- CSP 是否要立即切換到 nonce-based 嚴格模式，或先維持相容策略？
--- a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/proposal.md
+++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/proposal.md
@@ -0,0 +1,40 @@
+## Why
+
+上一輪已完成核心穩定性重構，但仍有數個高優先風險（LDAP URL 驗證、無界快取成長、circuit breaker 持鎖寫日誌、安全標頭缺口、分頁下限驗證）未收斂。這些問題會在長時運行與惡意輸入情境下累積可用性與安全風險，需在同一輪中補齊。
+
+## What Changes
+
+- 新增 LDAP API base URL 啟動驗證（限定 `https` 與白名單主機），避免可控 SSRF 目標。
+- 對 process-level cache 加入 `max_size` 與 LRU 淘汰，避免高基數 key 造成無界記憶體成長。
+- 調整 circuit breaker 狀態轉換流程，避免在持鎖期間寫日誌。
+- 新增全域 security headers（CSP、X-Frame-Options、X-Content-Type-Options、Referrer-Policy、HSTS）。
+- 補齊分頁參數下限驗證，避免負值與不合理 page size 進入查詢流程。
+- 為上述修補新增對應測試與文件更新，並維持單一 port 與既有前端操作語意不變。
+
+## Capabilities
+
+### New Capabilities
+- `security-surface-hardening`: 規範剩餘安全面向（SSRF 防護、security headers、輸入邊界驗證）的最低防線。
+
+### Modified Capabilities
+- `cache-observability-hardening`: 擴充快取治理需求，納入 process-level cache 有界容量與淘汰策略。
+- `runtime-resilience-recovery`: 補充 circuit breaker 鎖競爭風險修補與安全標頭對運維診斷回應的相容性要求。
+
+## Impact
+
+- Affected code:
+  - `src/mes_dashboard/services/auth_service.py`
+  - `src/mes_dashboard/core/cache.py`
+  - `src/mes_dashboard/services/resource_cache.py`
+  - `src/mes_dashboard/core/circuit_breaker.py`
+  - `src/mes_dashboard/app.py`
+  - `src/mes_dashboard/routes/wip_routes.py`
+  - `tests/`
+  - `README.md`, `README.mdj`
+- APIs:
+  - `/health`, `/health/deep`
+  - `/api/wip/detail/<workcenter>`
+  - `/admin/login`（間接受影響：LDAP base 驗證）
+- Operational behavior:
+  - 保持單一 port 與既有報表 UI 流程。
+  - 強化安全與穩定性防線，不改變既有功能語意。
--- a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/cache-observability-hardening/spec.md
+++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/cache-observability-hardening/spec.md
@@ -0,0 +1,12 @@
+## ADDED Requirements
+
+### Requirement: Process-Level Cache SHALL Use Bounded Capacity with Deterministic Eviction
+Process-level parsed-data caches MUST enforce a configurable maximum key capacity and use deterministic eviction behavior when capacity is exceeded.
+
+#### Scenario: Cache capacity reached
+- **WHEN** a new cache entry is inserted and key capacity is at limit
+- **THEN** cache MUST evict entries according to defined policy before storing the new key
+
+#### Scenario: Repeated access updates recency
+- **WHEN** an existing cache key is read or overwritten
+- **THEN** eviction order MUST reflect recency semantics so hot keys are retained preferentially
--- a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/runtime-resilience-recovery/spec.md
+++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/runtime-resilience-recovery/spec.md
@@ -0,0 +1,12 @@
+## ADDED Requirements
+
+### Requirement: Circuit Breaker State Transitions SHALL Avoid Lock-Held Logging
+Circuit breaker state transitions MUST avoid executing logger I/O while internal state locks are held.
+
+#### Scenario: State transition occurs
+- **WHEN** circuit breaker transitions between CLOSED, OPEN, or HALF_OPEN
+- **THEN** lock-protected section MUST complete state mutation before emitting transition log output
+
+#### Scenario: Slow log handler under load
+- **WHEN** logger handlers are slow or blocked
+- **THEN** circuit breaker lock contention MUST remain bounded and MUST NOT serialize unrelated request paths behind logging latency
--- a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/security-surface-hardening/spec.md
+++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/specs/security-surface-hardening/spec.md
@@ -0,0 +1,34 @@
+## ADDED Requirements
+
+### Requirement: LDAP Authentication Endpoint Configuration SHALL Be Strictly Validated
+The system MUST validate LDAP authentication endpoint configuration before use, including HTTPS scheme enforcement and host allowlist checks.
+
+#### Scenario: Invalid LDAP URL configuration detected
+- **WHEN** `LDAP_API_URL` is missing, non-HTTPS, or points to a host outside the configured allowlist
+- **THEN** the service MUST reject LDAP authentication calls and emit actionable diagnostics without sending credentials to that endpoint
+
+#### Scenario: Valid LDAP URL configuration accepted
+- **WHEN** `LDAP_API_URL` uses HTTPS and host is allowlisted
+- **THEN** LDAP authentication requests MAY proceed with normal timeout and error handling behavior
+
+### Requirement: Security Response Headers SHALL Be Applied Globally
+All HTTP responses MUST include baseline security headers suitable for dashboard and API traffic.
+
+#### Scenario: Standard response emitted
+- **WHEN** any route returns a response
+- **THEN** response MUST include `Content-Security-Policy`, `X-Frame-Options`, `X-Content-Type-Options`, and `Referrer-Policy`
+
+#### Scenario: Production transport hardening
+- **WHEN** runtime environment is production
+- **THEN** response MUST include `Strict-Transport-Security`
+
+### Requirement: Pagination Input Boundaries SHALL Be Enforced
+Endpoints accepting pagination parameters MUST enforce lower and upper bounds before query execution.
+
+#### Scenario: Negative or zero pagination inputs
+- **WHEN** client sends `page <= 0` or `page_size <= 0`
+- **THEN** server MUST normalize values to minimum supported bounds
+
+#### Scenario: Excessive page size requested
+- **WHEN** client sends `page_size` above configured maximum
+- **THEN** server MUST clamp to maximum supported page size
--- a/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/tasks.md
+++ b/openspec/changes/archive/2026-02-08-security-stability-hardening-round2/tasks.md
@@ -0,0 +1,24 @@
+## 1. LDAP Endpoint Hardening
+
+- [x] 1.1 Add strict `LDAP_API_URL` validation (`https` + allowlisted hosts) in auth service initialization.
+- [x] 1.2 Add tests for valid/invalid LDAP URL configurations and ensure unsafe URLs are rejected without outbound auth call.
+
+## 2. Bounded Process Cache
+
+- [x] 2.1 Extend `ProcessLevelCache` with configurable `max_size` and LRU eviction behavior.
+- [x] 2.2 Wire bounded cache configuration for WIP/Resource process-level caches and add regression tests.
+
+## 3. Circuit Breaker Lock Contention Reduction
+
+- [x] 3.1 Refactor circuit breaker transition logging to execute outside lock-protected section.
+- [x] 3.2 Add tests verifying transition logs are emitted while state mutation remains correct.
+
+## 4. HTTP Security Headers and Input Boundary Validation
+
+- [x] 4.1 Add global `after_request` security headers (CSP, frame, content-type, referrer, HSTS in production).
+- [x] 4.2 Tighten pagination boundary handling (`page`/`page_size`) for WIP detail endpoint and add tests.
+
+## 5. Validation and Documentation
+
+- [x] 5.1 Run targeted backend/frontend tests plus benchmark smoke to confirm no behavior regression.
+- [x] 5.2 Update `README.md` and `README.mdj` with round-2 security/stability hardening notes.
--- a/openspec/specs/api-safety-hygiene/spec.md
+++ b/openspec/specs/api-safety-hygiene/spec.md
@@ -0,0 +1,33 @@
+# api-safety-hygiene Specification
+
+## Purpose
+TBD - created by archiving change residual-hardening-round3. Update Purpose after archive.
+## Requirements
+### Requirement: Recursive Payload Cleaning MUST Enforce Depth Safety
+Routes that normalize nested payloads MUST prevent unbounded recursion depth.
+
+#### Scenario: Deeply nested response object
+- **WHEN** NaN-cleaning helper receives deeply nested list/dict payload
+- **THEN** cleaning logic MUST enforce max depth or iterative traversal and return safely without recursion failure
+
+### Requirement: Filter Source Names MUST Be Configurable
+Filter cache query sources MUST NOT rely on hardcoded view names only.
+
+#### Scenario: Environment-specific view names
+- **WHEN** deployment sets custom filter-source environment variables
+- **THEN** filter cache loader MUST resolve and query configured view names
+
+### Requirement: High-Cost APIs SHALL Apply Basic Rate Guardrails
+High-cost read endpoints SHALL apply configurable request-rate guardrails to reduce abuse and accidental bursts.
+
+#### Scenario: Burst traffic from same client
+- **WHEN** a client exceeds configured request budget for guarded endpoints
+- **THEN** endpoint SHALL return throttled response with clear retry guidance
+
+### Requirement: Common Boolean Query Parsing SHALL Be Shared
+Boolean query parsing in routes SHALL use shared helper behavior.
+
+#### Scenario: Different routes parse include flags
+- **WHEN** routes parse common boolean query parameters
+- **THEN** parsing behavior MUST be consistent across routes via shared utility
+
--- a/openspec/specs/cache-indexed-query-acceleration/spec.md
+++ b/openspec/specs/cache-indexed-query-acceleration/spec.md
@@ -0,0 +1,26 @@
+# cache-indexed-query-acceleration Specification
+
+## Purpose
+TBD - created by archiving change p1-cache-query-efficiency. Update Purpose after archive.
+## Requirements
+### Requirement: Incremental Synchronization SHALL Use Versioned Watermarks
+For heavy non-full-snapshot datasets, cache refresh SHALL support incremental synchronization keyed by stable version or watermark boundaries.
+
+#### Scenario: Incremental refresh cycle
+- **WHEN** source data version indicates partial changes since last sync
+- **THEN** cache update logic MUST fetch and merge only changed partitions while preserving correctness guarantees
+
+### Requirement: Query Paths SHALL Use Indexed Access for High-Frequency Filters
+Query execution over cached data SHALL use prebuilt indexes for known high-frequency filter columns.
+
+#### Scenario: Filtered report query
+- **WHEN** request filters target indexed fields
+- **THEN** result selection MUST avoid full dataset scans and maintain existing response contract
+
+### Requirement: Business-Mandated Full-Table Caches SHALL Be Preserved for Resource and WIP
+The system SHALL continue to maintain full-table cache behavior for `resource` and `wip` domains.
+
+#### Scenario: Resource or WIP cache refresh
+- **WHEN** cache update runs for `resource` or `wip`
+- **THEN** the updater MUST retain full-table snapshot semantics and MUST NOT switch these domains to partial-only cache mode
+
--- a/openspec/specs/cache-observability-hardening/spec.md
+++ b/openspec/specs/cache-observability-hardening/spec.md
@@ -36,3 +36,53 @@ The system MUST define alert thresholds for sustained degraded state, repeated w
 - **WHEN** degraded status persists beyond configured duration
 - **THEN** the monitoring contract MUST classify the service as alert-worthy with actionable context

+### Requirement: Cache Telemetry SHALL Include Memory Amplification Signals
+Operational telemetry MUST expose cache-domain memory usage indicators and representation amplification factors, and MUST differentiate between authoritative data payload and derived/index helper structures.
+
+#### Scenario: Deep health telemetry request after representation normalization
+- **WHEN** operators inspect cache telemetry for resource or WIP domains
+- **THEN** telemetry MUST include per-domain memory footprint, amplification indicators, and enough structure detail to verify that full-record duplication is not reintroduced
+
+### Requirement: Efficiency Benchmarks SHALL Gate Cache Refactor Rollout
+Cache/query efficiency changes MUST be validated against baseline latency and memory benchmarks before rollout.
+
+#### Scenario: Pre-release validation
+- **WHEN** cache refactor changes are prepared for deployment
+- **THEN** benchmark results MUST demonstrate no regression beyond configured thresholds for P95 latency and memory usage
+
+### Requirement: Process-Level Cache SHALL Use Bounded Capacity with Deterministic Eviction
+Process-level parsed-data caches MUST enforce a configurable maximum key capacity and use deterministic eviction behavior when capacity is exceeded.
+
+#### Scenario: Cache capacity reached
+- **WHEN** a new cache entry is inserted and key capacity is at limit
+- **THEN** cache MUST evict entries according to defined policy before storing the new key
+
+#### Scenario: Repeated access updates recency
+- **WHEN** an existing cache key is read or overwritten
+- **THEN** eviction order MUST reflect recency semantics so hot keys are retained preferentially
+
+### Requirement: Cache Publish MUST Preserve Previous Readable Snapshot on Failure
+When refreshing full-table cache payloads, the system MUST avoid exposing partially published states to readers.
+
+#### Scenario: Publish fails after payload serialization
+- **WHEN** a cache refresh has prepared new payload but publish operation fails
+- **THEN** previously published cache keys MUST remain readable and metadata MUST remain consistent with old snapshot
+
+#### Scenario: Publish succeeds
+- **WHEN** publish operation completes successfully
+- **THEN** data payload and metadata keys MUST be visible as one coherent new snapshot
+
+### Requirement: Process-Level Cache Slow Path SHALL Minimize Lock Hold Time
+Large payload parsing MUST NOT happen inside long-held process cache locks.
+
+#### Scenario: Cache miss under concurrent requests
+- **WHEN** multiple requests hit process cache miss
+- **THEN** parsing work SHALL happen outside lock-protected mutation section, and lock scope SHALL be limited to consistency check + commit
+
+### Requirement: Process-Level Cache Policies MUST Stay Consistent Across Services
+All service-local process caches MUST support bounded capacity with deterministic eviction.
+
+#### Scenario: Realtime equipment cache growth
+- **WHEN** realtime equipment process cache reaches configured capacity
+- **THEN** entries MUST be evicted according to deterministic LRU behavior
+
--- a/openspec/specs/conda-systemd-runtime-alignment/spec.md
+++ b/openspec/specs/conda-systemd-runtime-alignment/spec.md
@@ -24,3 +24,17 @@ Runbooks and deployment documentation MUST describe the same conda/systemd/watch
 - **WHEN** an operator performs deploy, health check, and rollback from documentation
 - **THEN** documented commands and paths MUST work without requiring venv-specific assumptions

+### Requirement: Runtime Path Drift SHALL Be Detectable Before Service Start
+Service startup checks MUST validate configured conda runtime paths across app, watchdog, and worker control scripts.
+
+#### Scenario: Conda path mismatch detected
+- **WHEN** startup validation finds runtime path inconsistency between configured units and scripts
+- **THEN** service start MUST fail with actionable diagnostics instead of running with partial mismatch
+
+### Requirement: Conda/Systemd Contract SHALL Be Versioned in Operations Docs
+The documented runtime contract MUST include versioned path assumptions and verification commands.
+
+#### Scenario: Operator verifies deployment contract
+- **WHEN** operator follows runbook validation steps
+- **THEN** commands MUST confirm active runtime paths match documented conda/systemd contract
+
--- a/openspec/specs/frontend-compute-shift/spec.md
+++ b/openspec/specs/frontend-compute-shift/spec.md
@@ -50,3 +50,17 @@ Frontend matrix/filter computations SHALL produce deterministic selection and fi
 - **WHEN** users toggle matrix cells across group, family, and resource rows
 - **THEN** selected-state rendering and filtered equipment result sets MUST remain level-correct and reversible

+### Requirement: Reusable Browser Compute Modules SHALL Power Report Derivations
+Derived computations for report filters, KPI cards, chart series, and table projections SHALL be implemented through reusable frontend modules.
+
+#### Scenario: Shared report derivation logic
+- **WHEN** multiple report pages require equivalent data-shaping behavior
+- **THEN** pages MUST consume shared compute modules instead of duplicating transformation logic per page
+
+### Requirement: Browser Compute Shift SHALL Preserve Export and Field Contracts
+Moving computations to frontend MUST preserve existing field naming and export column contracts.
+
+#### Scenario: User exports report after frontend-side derivation
+- **WHEN** transformed data is rendered and exported
+- **THEN** exported field names and ordering MUST remain consistent with governed field contract definitions
+
--- a/openspec/specs/maintainability-type-and-constant-hygiene/spec.md
+++ b/openspec/specs/maintainability-type-and-constant-hygiene/spec.md
@@ -0,0 +1,19 @@
+# maintainability-type-and-constant-hygiene Specification
+
+## Purpose
+TBD - created by archiving change residual-hardening-round4. Update Purpose after archive.
+## Requirements
+### Requirement: Core Cache and Service Boundaries MUST Use Consistent Type Annotation Style
+Core cache/service modules touched by this change SHALL use a consistent and explicit type-annotation style for public and internal helper boundaries.
+
+#### Scenario: Reviewing updated cache/service modules
+- **WHEN** maintainers inspect function signatures in affected modules
+- **THEN** optional and collection types MUST follow a single consistent style and remain compatible with the project Python baseline
+
+### Requirement: High-Frequency Magic Numbers MUST Be Replaced by Named Constants
+Cache, throttling, and index-related numeric literals that control behavior MUST be extracted to named constants or env-configurable settings.
+
+#### Scenario: Tuning cache/index behavior
+- **WHEN** operators need to tune cache/index thresholds
+- **THEN** they MUST find values in named constants or environment variables rather than scattered inline literals
+
--- a/openspec/specs/oracle-query-fragment-governance/spec.md
+++ b/openspec/specs/oracle-query-fragment-governance/spec.md
@@ -0,0 +1,19 @@
+# oracle-query-fragment-governance Specification
+
+## Purpose
+TBD - created by archiving change residual-hardening-round4. Update Purpose after archive.
+## Requirements
+### Requirement: Shared Oracle Query Fragments SHALL Have a Single Source of Truth
+Cross-service Oracle query fragments for resource and equipment cache loading MUST be defined in a shared module and imported by service implementations.
+
+#### Scenario: Update common table/view reference
+- **WHEN** a common table or view name changes
+- **THEN** operators and developers MUST be able to update one shared definition without editing duplicated SQL literals across services
+
+### Requirement: Service Queries MUST Preserve Existing Columns and Semantics
+Services consuming shared Oracle query fragments SHALL preserve existing selected columns, filters, and downstream payload behavior.
+
+#### Scenario: Resource and equipment cache refresh after refactor
+- **WHEN** cache services execute queries via shared fragments
+- **THEN** resulting payload structure MUST remain compatible with existing aggregation and API contracts
+
--- a/openspec/specs/resource-cache-representation-normalization/spec.md
+++ b/openspec/specs/resource-cache-representation-normalization/spec.md
@@ -0,0 +1,26 @@
+# resource-cache-representation-normalization Specification
+
+## Purpose
+TBD - created by archiving change residual-hardening-round4. Update Purpose after archive.
+## Requirements
+### Requirement: Resource Derived Index MUST Avoid Full Record Duplication
+Resource derived index SHALL use lightweight row-position references instead of storing full duplicated record payloads alongside the process DataFrame cache.
+
+#### Scenario: Build index from cached DataFrame
+- **WHEN** resource cache data is parsed from Redis into process-level DataFrame
+- **THEN** the derived index MUST store position-based references and metadata without a second full records copy
+
+### Requirement: Resource Query APIs SHALL Preserve Existing Response Contract
+Resource query APIs MUST keep existing output fields and semantics after index representation normalization.
+
+#### Scenario: Read all resources after normalization
+- **WHEN** callers request all resources or filtered resource lists
+- **THEN** the returned payload MUST remain field-compatible with pre-normalization responses
+
+### Requirement: Cache Invalidation MUST Keep Index/Data Coherent
+The system SHALL invalidate and rebuild DataFrame/index representations atomically at cache refresh boundaries.
+
+#### Scenario: Redis-backed cache refresh completes
+- **WHEN** a new resource cache snapshot is published
+- **THEN** stale index references MUST be invalidated before subsequent reads use refreshed DataFrame data
+
--- a/openspec/specs/runtime-resilience-recovery/spec.md
+++ b/openspec/specs/runtime-resilience-recovery/spec.md
@@ -48,3 +48,47 @@ The system MUST expose machine-readable resilience thresholds, restart-churn ind
 #### Scenario: Admin status includes restart churn summary
 - **WHEN** operators call `/admin/api/system-status` or `/admin/api/worker/status`
 - **THEN** responses MUST include bounded restart history summary within a configured time window and indicate whether churn threshold is exceeded
+
+### Requirement: Recovery Recommendations SHALL Reflect Self-Healing Policy State
+Health and admin resilience payloads MUST expose whether automated recovery is allowed, cooling down, or blocked by churn policy.
+
+#### Scenario: Operator inspects degraded state
+- **WHEN** `/health` or `/admin/api/worker/status` is requested during degradation
+- **THEN** response MUST include policy state, cooldown remaining time, and next recommended action
+
+### Requirement: Manual Recovery Override SHALL Be Explicit and Controlled
+Manual restart actions MUST bypass automatic block only through authenticated operator pathways with explicit acknowledgement.
+
+#### Scenario: Churn-blocked state with manual override request
+- **WHEN** authorized admin requests manual restart while auto-recovery is blocked
+- **THEN** system MUST execute controlled restart path and log the override context for auditability
+
+### Requirement: Circuit Breaker State Transitions SHALL Avoid Lock-Held Logging
+Circuit breaker state transitions MUST avoid executing logger I/O while internal state locks are held.
+
+#### Scenario: State transition occurs
+- **WHEN** circuit breaker transitions between CLOSED, OPEN, or HALF_OPEN
+- **THEN** lock-protected section MUST complete state mutation before emitting transition log output
+
+#### Scenario: Slow log handler under load
+- **WHEN** logger handlers are slow or blocked
+- **THEN** circuit breaker lock contention MUST remain bounded and MUST NOT serialize unrelated request paths behind logging latency
+
+### Requirement: Health Endpoints SHALL Use Short Internal Memoization
+Health and deep-health computation SHALL use a short-lived internal cache to prevent probe storms from amplifying backend load.
+
+#### Scenario: Frequent monitor scrapes
+- **WHEN** health endpoints are called repeatedly within a small window
+- **THEN** service SHALL return memoized payload for up to 5 seconds in non-testing environments
+
+#### Scenario: Testing mode
+- **WHEN** app is running in testing mode
+- **THEN** health endpoint memoization MUST be bypassed to preserve deterministic tests
+
+### Requirement: Logs MUST Redact Connection Secrets
+Runtime logs MUST avoid exposing DB connection credentials.
+
+#### Scenario: Connection string appears in log message
+- **WHEN** a log message contains DB URL credentials
+- **THEN** logger output MUST redact password and sensitive userinfo before emission
+
--- a/openspec/specs/security-surface-hardening/spec.md
+++ b/openspec/specs/security-surface-hardening/spec.md
@@ -0,0 +1,38 @@
+# security-surface-hardening Specification
+
+## Purpose
+TBD - created by archiving change security-stability-hardening-round2. Update Purpose after archive.
+## Requirements
+### Requirement: LDAP Authentication Endpoint Configuration SHALL Be Strictly Validated
+The system MUST validate LDAP authentication endpoint configuration before use, including HTTPS scheme enforcement and host allowlist checks.
+
+#### Scenario: Invalid LDAP URL configuration detected
+- **WHEN** `LDAP_API_URL` is missing, non-HTTPS, or points to a host outside the configured allowlist
+- **THEN** the service MUST reject LDAP authentication calls and emit actionable diagnostics without sending credentials to that endpoint
+
+#### Scenario: Valid LDAP URL configuration accepted
+- **WHEN** `LDAP_API_URL` uses HTTPS and host is allowlisted
+- **THEN** LDAP authentication requests MAY proceed with normal timeout and error handling behavior
+
+### Requirement: Security Response Headers SHALL Be Applied Globally
+All HTTP responses MUST include baseline security headers suitable for dashboard and API traffic.
+
+#### Scenario: Standard response emitted
+- **WHEN** any route returns a response
+- **THEN** response MUST include `Content-Security-Policy`, `X-Frame-Options`, `X-Content-Type-Options`, and `Referrer-Policy`
+
+#### Scenario: Production transport hardening
+- **WHEN** runtime environment is production
+- **THEN** response MUST include `Strict-Transport-Security`
+
+### Requirement: Pagination Input Boundaries SHALL Be Enforced
+Endpoints accepting pagination parameters MUST enforce lower and upper bounds before query execution.
+
+#### Scenario: Negative or zero pagination inputs
+- **WHEN** client sends `page <= 0` or `page_size <= 0`
+- **THEN** server MUST normalize values to minimum supported bounds
+
+#### Scenario: Excessive page size requested
+- **WHEN** client sends `page_size` above configured maximum
+- **THEN** server MUST clamp to maximum supported page size
+
--- a/openspec/specs/worker-self-healing-governance/spec.md
+++ b/openspec/specs/worker-self-healing-governance/spec.md
@@ -0,0 +1,26 @@
+# worker-self-healing-governance Specification
+
+## Purpose
+TBD - created by archiving change p2-ops-self-healing-runbook. Update Purpose after archive.
+## Requirements
+### Requirement: Automated Worker Recovery SHALL Use Bounded Policy Guards
+Automated worker restart behavior MUST enforce cooldown periods and bounded restart attempts within a configurable time window.
+
+#### Scenario: Repeated worker degradation within short window
+- **WHEN** degradation events exceed configured restart-attempt budget
+- **THEN** automated restarts MUST pause and surface a blocked-recovery signal for operator intervention
+
+### Requirement: Restart-Churn Protection SHALL Prevent Recovery Storms
+The runtime MUST classify restart churn and prevent uncontrolled restart loops.
+
+#### Scenario: Churn threshold exceeded
+- **WHEN** restart count crosses churn threshold in active window
+- **THEN** watchdog MUST enter guarded mode and require explicit manual override before further restart attempts
+
+### Requirement: Recovery Decisions SHALL Be Audit-Ready
+Every auto-recovery decision and manual override action MUST be recorded with structured metadata.
+
+#### Scenario: Worker restart decision emitted
+- **WHEN** system executes or denies a restart action
+- **THEN** structured logs/events MUST include reason, thresholds, actor/source, and resulting state
+