feat(reject-history): add materialized Pareto aggregate layer with feature-flagged rollout

Pre-compute 6-dimension metric cubes from cached LOT-level DataFrames so interactive Pareto requests read compact snapshots instead of re-scanning detail rows on every filter change. Includes single-flight build guard, TTL/size guardrails, cross-filter exclude-self evaluation, safe legacy fallback, response metadata exposure, telemetry counters, and a 3-stage rollout plan (telemetry-only → build-enabled → read-through). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 08:05:02 +08:00
parent 98eea066ea
commit e79fb657a3
22 changed files with 2500 additions and 484 deletions
--- a/openspec/changes/archive/2026-03-04-reject-history-pareto-materialization/.openspec.yaml
+++ b/openspec/changes/archive/2026-03-04-reject-history-pareto-materialization/.openspec.yaml
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-03-03
--- a/openspec/changes/archive/2026-03-04-reject-history-pareto-materialization/design.md
+++ b/openspec/changes/archive/2026-03-04-reject-history-pareto-materialization/design.md
@@ -0,0 +1,89 @@
+## Context
+
+Reject History 目前採兩階段模式：`POST /api/reject-history/query` 先建立 lot-level dataset cache，後續 `reason-pareto` / `batch-pareto` / `view` 反覆在 worker 內以 pandas 對全量明細重算。當使用者在大日期區間下進行多維 Pareto 互相篩選時，會在多 worker 環境放大記憶體壓力，造成高 RSS、回應抖動，甚至 worker 不穩定。
+
+本變更目標是在不改前端 API contract 的前提下，新增「Pareto 預聚合/物化」層，讓互動式 Pareto 請求改讀聚合快照，而非每次掃描 lot-level 明細。
+
+## Goals / Non-Goals
+
+**Goals:**
+- 讓 `batch-pareto` 與 `reason-pareto`（cache 路徑）優先讀取 materialized Pareto snapshot。
+- 降低每次互動所需的 worker 計算記憶體與 CPU，避免反覆全量 groupby。
+- 對 snapshot 建立一致性規則（version/freshness/invalidation），確保不回傳過期或錯配資料。
+- 補齊可營運觀測：build latency、hit/miss、fallback 原因、snapshot 大小。
+- 保持回傳 schema 與前端互動語意（cross-filter、top80/top20）相容。
+
+**Non-Goals:**
+- 不改 Oracle 明細 SQL 與 primary query 的資料來源。
+- 不新增「大日期範圍硬拒絕」規則。
+- 不改前端 Pareto 元件與 URL 參數契約。
+- 不在此變更導入全新儲存系統（沿用現有 process cache + Redis/spool 生態）。
+
+## Decisions
+
+### 1) 新增獨立 Pareto materialization service，與 lot-level cache 解耦
+- Decision: 新增 `reject_pareto_materialized.py`（名稱可調整）負責 build/read/invalidate snapshot，不把邏輯直接塞進 route。
+- Why: 可將聚合生命週期、key 策略、遙測統一管理，降低 `reject_dataset_cache.py` 複雜度。
+- Alternative considered:
+  - 直接在 `compute_batch_pareto()` 內加 dict 快取：容易與 dataset cache 生命週期糾纏，且跨 worker 命中率不足。
+
+### 2) Snapshot key 綁定 query dataset 與 filter context，並附 schema version
+- Decision: key 至少包含 `query_id`、policy toggles、supplementary filters、trend_dates hash、materialization schema version。
+- Why: 避免不同篩選上下文誤用同一 snapshot；schema 變更可強制失效舊資料。
+- Alternative considered:
+  - 僅用 `query_id`：無法區分補充篩選，容易回傳錯誤 Pareto。
+
+### 3) Materialized payload 儲存「可交叉運算所需最小聚合」，而非完整明細
+- Decision: snapshot 儲存 6 維度聚合結果與交叉篩選必要中介結構（以可重建 cross-filter 為原則），不複製 lot-level rows。
+- Why: 目標是減少記憶體放大，若儲存完整明細會失去價值。
+- Alternative considered:
+  - 直接存每個 dimension 最終 items：空間小但無法支援任意 cross-filter 重算。
+
+### 4) Read path 採「materialized 優先 + 安全 fallback」
+- Decision: `batch-pareto` / `reason-pareto(query_id)` 先讀 snapshot；miss/stale/build-fail 時 fallback 到既有 cache DataFrame 計算，並打 telemetry。
+- Why: 保留功能可用性與漸進上線，避免一次切換造成功能中斷。
+- Alternative considered:
+  - snapshot miss 直接錯誤：風險高，對既有使用者不友善。
+
+### 5) 建立 single-flight build 與容量上限
+- Decision: 同一 snapshot key 同時間僅允許一個 build，其餘請求等待或讀舊值；並限制單 snapshot size 與總 key TTL。
+- Why: 避免 thundering herd 與 Redis/spool 不受控成長。
+- Alternative considered:
+  - 完全不鎖：高併發下會重複建構，放大 CPU/記憶體與 IO。
+
+### 6) 將 observability 併入既有 cache telemetry 合約
+- Decision: 在現有 cache observability 結構新增 pareto-materialized 欄位（hit/miss/fallback/build/size/freshness）。
+- Why: operations 可以在同一入口判斷 cache 問題來源，不需分散查多個 API。
+- Alternative considered:
+  - 只寫 log：難做趨勢與告警。
+
+## Risks / Trade-offs
+
+- [Risk] Snapshot key 組成不完整導致污染（cross-user/filter 汙染）。
+  - Mitigation: key builder 單元測試覆蓋參數排序、空值正規化與 version 隔離。
+
+- [Risk] 中介聚合結構設計錯誤，cross-filter 結果與舊路徑不一致。
+  - Mitigation: 建立 parity tests，對同一 query_id 比對 materialized 與 legacy 計算結果。
+
+- [Risk] Fallback 比率長期偏高，materialization 失去效益。
+  - Mitigation: telemetry 強制輸出 fallback reason，設定告警門檻並納入 rollout gate。
+
+- [Risk] 新增 build 步驟拉長首次 Pareto 回應時間。
+  - Mitigation: 首次 build 可回 legacy 結果並背景填充，或採同步 build 但有 timeout 上限；由配置控制。
+
+## Migration Plan
+
+1. 實作 materialization service（key、payload、build/read/invalidate、telemetry）。
+2. 在 `compute_batch_pareto` 與 `compute_dimension_pareto(query_id path)` 接入 read-through/fallback。
+3. 加入 parity 測試（legacy vs materialized）與壓力測試（重複 cross-filter 切換）。
+4. 灰度啟用：先開 telemetry-only / build-disabled，再開 read-through，再提高命中比例。
+5. 監控 hit ratio、fallback ratio、worker RSS 趨勢；達標後再考慮收斂 legacy 路徑。
+
+Rollback strategy:
+- 以 feature flag 關閉 materialized read path，立即回退至現行 DataFrame 計算，不影響 API schema。
+
+## Open Questions
+
+- materialized payload 要儲存在 Redis（快速）還是 spool（耐久）為主？是否需要雙層策略。
+- 首次 build 採同步阻塞還是背景建置（與 UX latency 取捨）。
+- 長期是否將 materialization 前移至 primary query 完成後立即建立，以換取互動穩定性。
--- a/openspec/changes/archive/2026-03-04-reject-history-pareto-materialization/proposal.md
+++ b/openspec/changes/archive/2026-03-04-reject-history-pareto-materialization/proposal.md
@@ -0,0 +1,37 @@
+## Why
+
+The reject-history Pareto workflow currently recomputes aggregates from full lot-level cached datasets on every interactive filter change. Under wide date ranges and cross-filtering, this drives repeated high-memory pandas operations and can destabilize workers.
+
+## What Changes
+
+- Introduce a materialized Pareto aggregate layer for reject-history that precomputes dimension metrics from cached query datasets.
+- Serve `/api/reject-history/batch-pareto` and related Pareto reads from pre-aggregated artifacts instead of recomputing from full detail rows each request.
+- Add freshness/version metadata so aggregate snapshots stay aligned with source query datasets and policy toggles.
+- Add bounded invalidation and lifecycle rules for aggregate artifacts to avoid stale growth and memory pressure.
+- Add observability for aggregate build latency, hit ratio, memory footprint, and fallback reasons.
+- Keep API response schema compatible with existing frontend contracts to avoid UI rewrites.
+
+## Capabilities
+
+### New Capabilities
+- `reject-history-pareto-materialized-aggregate`: Build, store, and read pre-aggregated Pareto data for interactive cross-filter workflows.
+
+### Modified Capabilities
+- `reject-history-api`: Route Pareto endpoints to materialized aggregates with cache-consistency and fallback behavior.
+- `cache-observability-hardening`: Extend telemetry to cover aggregate generation/hit/fallback and memory-guard events.
+
+## Impact
+
+- Affected backend code:
+  - `src/mes_dashboard/services/reject_dataset_cache.py`
+  - `src/mes_dashboard/services/reject_history_service.py`
+  - `src/mes_dashboard/routes/reject_history_routes.py`
+  - new aggregate builder/storage modules under `src/mes_dashboard/services/`
+- Affected APIs:
+  - `/api/reject-history/batch-pareto`
+  - `/api/reject-history/reason-pareto` (cache-backed path)
+- Affected runtime systems:
+  - Process L1 cache, Redis/L2 cache, spool lifecycle and metrics history
+- Dependencies/ops:
+  - Additional Redis key space and retention policy tuning for aggregate artifacts
+  - New monitoring/alerts for aggregate freshness and fallback rates
--- a/openspec/changes/archive/2026-03-04-reject-history-pareto-materialization/specs/cache-observability-hardening/spec.md
+++ b/openspec/changes/archive/2026-03-04-reject-history-pareto-materialization/specs/cache-observability-hardening/spec.md
@@ -0,0 +1,30 @@
+## MODIFIED Requirements
+
+### Requirement: Cache Telemetry MUST be Queryable for Operations
+The system MUST provide cache telemetry suitable for operations diagnostics, including materialized Pareto cache behavior for reject-history workloads.
+
+#### Scenario: Telemetry inspection
+- **WHEN** operators request deep health status
+- **THEN** cache-related metrics/state SHALL be present and interpretable for troubleshooting
+
+#### Scenario: Materialized Pareto telemetry visibility
+- **WHEN** materialized Pareto cache is enabled
+- **THEN** telemetry SHALL expose at least hit count/rate, miss count/rate, build count, build failure count, and fallback count
+- **THEN** telemetry SHALL expose latest snapshot freshness indicators and aggregate payload size indicators
+
+## ADDED Requirements
+
+### Requirement: Pareto materialization fallback reasons SHALL be operationally classifiable
+Telemetry MUST classify fallback outcomes with stable reason codes so repeated degradations can be monitored and alerted.
+
+#### Scenario: Snapshot miss fallback reason
+- **WHEN** request falls back because no snapshot exists
+- **THEN** telemetry SHALL record a stable reason code for snapshot miss
+
+#### Scenario: Snapshot stale fallback reason
+- **WHEN** request falls back because snapshot fails freshness/version checks
+- **THEN** telemetry SHALL record a stable reason code for stale/incompatible snapshot
+
+#### Scenario: Build failure fallback reason
+- **WHEN** request falls back because materialization build failed
+- **THEN** telemetry SHALL record a stable reason code for build failure
--- a/openspec/changes/archive/2026-03-04-reject-history-pareto-materialization/specs/reject-history-api/spec.md
+++ b/openspec/changes/archive/2026-03-04-reject-history-pareto-materialization/specs/reject-history-api/spec.md
@@ -0,0 +1,57 @@
+## MODIFIED Requirements
+
+### Requirement: Reject History API SHALL provide batch Pareto endpoint with cross-filter
+The API SHALL provide a batch Pareto endpoint that returns all 6 dimension Pareto results in a single response, supporting cross-dimension filtering with exclude-self logic, and SHALL prefer materialized Pareto snapshots over full detail regrouping.
+
+#### Scenario: Batch Pareto response structure
+- **WHEN** `GET /api/reject-history/batch-pareto` is called with valid `query_id`
+- **THEN** response SHALL be `{ success: true, data: { dimensions: { reason: {...}, package: {...}, type: {...}, workflow: {...}, workcenter: {...}, equipment: {...} } } }`
+- **THEN** each dimension object SHALL include `items` array with schema (`reason`, `metric_value`, `pct`, `cumPct`, `MOVEIN_QTY`, `REJECT_TOTAL_QTY`, `DEFECT_QTY`, `count`)
+
+#### Scenario: Cross-filter exclude-self logic
+- **WHEN** `sel_reason=A&sel_type=X` is provided
+- **THEN** reason Pareto SHALL be computed with type=X filter applied (but NOT reason=A filter)
+- **THEN** type Pareto SHALL be computed with reason=A filter applied (but NOT type=X filter)
+- **THEN** package/workflow/workcenter/equipment Paretos SHALL be computed with both reason=A AND type=X filters applied
+
+#### Scenario: Empty selections return unfiltered Paretos
+- **WHEN** batch-pareto is called with no `sel_*` parameters
+- **THEN** all 6 dimensions SHALL return their full Pareto distribution (subject to `pareto_scope`)
+
+#### Scenario: Cache-only computation
+- **WHEN** `query_id` does not exist in cache
+- **THEN** the endpoint SHALL return HTTP 400 with error message indicating cache miss
+- **THEN** the endpoint SHALL NOT fall back to Oracle query
+
+#### Scenario: Materialized snapshot preferred
+- **WHEN** a valid and fresh materialized Pareto snapshot exists for the request context
+- **THEN** the endpoint SHALL return results from that snapshot
+- **THEN** the endpoint SHALL avoid full lot-level DataFrame regrouping for the same request
+
+#### Scenario: Materialized miss fallback behavior
+- **WHEN** materialized snapshot is unavailable, stale, or build fails
+- **THEN** the endpoint SHALL fall back to legacy cache DataFrame computation
+- **THEN** the response schema and filter semantics SHALL remain unchanged
+
+#### Scenario: Supplementary and policy filters apply
+- **WHEN** batch-pareto is called with supplementary filters (packages, workcenter_groups, reason) and policy toggles
+- **THEN** all 6 dimension Paretos SHALL be computed after applying policy and supplementary filters first (before cross-filter)
+
+#### Scenario: Display scope (TOP20) support
+- **WHEN** `pareto_display_scope=top20` is provided
+- **THEN** applicable dimensions (type, workflow, equipment) SHALL truncate results to top 20 items after sorting
+- **WHEN** `pareto_display_scope` is omitted or `all`
+- **THEN** all items SHALL be returned (subject to `pareto_scope` filter)
+
+## ADDED Requirements
+
+### Requirement: Reject History API SHALL expose materialized Pareto freshness metadata
+The API SHALL surface stable metadata so operators and clients can identify whether Pareto responses came from materialized snapshots or fallback paths.
+
+#### Scenario: Materialized hit metadata
+- **WHEN** batch pareto response is served from materialized snapshot
+- **THEN** response metadata SHALL indicate materialized source and snapshot freshness/version identifiers
+
+#### Scenario: Fallback metadata
+- **WHEN** response uses legacy fallback due to snapshot miss/stale/build failure
+- **THEN** response metadata SHALL include a stable fallback reason code
--- a/openspec/changes/archive/2026-03-04-reject-history-pareto-materialization/specs/reject-history-pareto-materialized-aggregate/spec.md
+++ b/openspec/changes/archive/2026-03-04-reject-history-pareto-materialization/specs/reject-history-pareto-materialized-aggregate/spec.md
@@ -0,0 +1,49 @@
+## ADDED Requirements
+
+### Requirement: Reject History Pareto materialization SHALL build reusable aggregate snapshots
+The system SHALL build reusable Pareto aggregate snapshots from cached reject-history query datasets so interactive Pareto requests do not require full lot-level regrouping on every call.
+
+#### Scenario: Build snapshot from cached dataset
+- **WHEN** a valid `query_id` has cached reject-history dataset and Pareto data is requested
+- **THEN** the system SHALL build a materialized snapshot containing the six supported Pareto dimensions (`reason`, `package`, `type`, `workflow`, `workcenter`, `equipment`)
+- **THEN** the snapshot SHALL include quantities needed to compute `metric_value`, `pct`, `cumPct`, and affected count fields
+
+#### Scenario: Build skipped for missing dataset cache
+- **WHEN** the referenced `query_id` dataset is missing or expired
+- **THEN** snapshot build SHALL NOT proceed
+- **THEN** the caller SHALL receive a deterministic cache-miss outcome
+
+### Requirement: Materialized snapshot keys SHALL encode filter identity and schema version
+The system SHALL key materialized Pareto snapshots by canonical filter identity and schema version to prevent cross-context reuse.
+
+#### Scenario: Distinct supplementary filters generate distinct snapshots
+- **WHEN** two requests share the same `query_id` but differ in supplementary filters or policy toggles
+- **THEN** they SHALL resolve to different materialized snapshot keys
+
+#### Scenario: Schema version invalidates prior snapshots
+- **WHEN** materialization schema version is incremented
+- **THEN** snapshots produced by prior versions SHALL NOT be treated as valid hits
+
+### Requirement: Materialized snapshots SHALL preserve cross-filter semantics
+Materialized read paths SHALL produce the same cross-filter behavior as legacy DataFrame-based Pareto computation.
+
+#### Scenario: Exclude-self behavior parity
+- **WHEN** `sel_reason=A` and `sel_type=X` are active
+- **THEN** reason Pareto SHALL be computed with `type=X` applied but without `reason=A` self-filter
+- **THEN** type Pareto SHALL be computed with `reason=A` applied but without `type=X` self-filter
+
+#### Scenario: Multi-dimension intersection parity
+- **WHEN** multiple `sel_*` filters are active across dimensions
+- **THEN** each non-excluded dimension result SHALL reflect the AND intersection of all other selected dimensions
+
+### Requirement: Materialized snapshots SHALL enforce bounded lifecycle and capacity
+Materialized Pareto cache storage SHALL be bounded by TTL and size guardrails to avoid unbounded memory growth.
+
+#### Scenario: Snapshot expiry follows configured retention
+- **WHEN** a materialized snapshot exceeds configured TTL
+- **THEN** it SHALL be treated as expired and SHALL NOT be returned as a cache hit
+
+#### Scenario: Oversized snapshot handling
+- **WHEN** a snapshot build exceeds configured snapshot size guardrail
+- **THEN** the snapshot SHALL be rejected or degraded according to policy
+- **THEN** the system SHALL record the rejection/degradation reason for operations telemetry
--- a/openspec/changes/archive/2026-03-04-reject-history-pareto-materialization/tasks.md
+++ b/openspec/changes/archive/2026-03-04-reject-history-pareto-materialization/tasks.md
@@ -0,0 +1,35 @@
+## 1. Materialization Service Foundation
+
+- [x] 1.1 Create a dedicated reject Pareto materialization service module with key builder, payload schema versioning, and read/write interfaces
+- [x] 1.2 Implement canonical filter-context hashing (policy toggles, supplementary filters, trend dates) for materialized snapshot key isolation
+- [x] 1.3 Implement single-flight guard for concurrent snapshot builds targeting the same key
+- [x] 1.4 Add TTL and payload-size guardrails for materialized snapshots with explicit rejection paths
+
+## 2. Snapshot Build and Compute Path
+
+- [x] 2.1 Implement snapshot build pipeline from cached reject dataset to six-dimension aggregate structures
+- [x] 2.2 Implement cross-filter evaluation on materialized structures with exclude-self parity to current batch Pareto behavior
+- [x] 2.3 Implement `pareto_scope` (`top80`/`all`) and `pareto_display_scope` compatibility on materialized outputs
+- [x] 2.4 Add deterministic invalidation rules for stale or schema-mismatched snapshots
+
+## 3. API Integration and Compatibility
+
+- [x] 3.1 Integrate materialized read-through path into `compute_batch_pareto` and cached `compute_dimension_pareto` flow
+- [x] 3.2 Implement safe fallback to legacy DataFrame-based compute when snapshot is missing, stale, or build fails
+- [x] 3.3 Add response metadata fields for materialized source/freshness/version and fallback reason codes without breaking existing payload schema
+- [x] 3.4 Ensure cache-miss behavior for missing `query_id` remains unchanged (no Oracle fallback)
+
+## 4. Observability and Operations Signals
+
+- [x] 4.1 Extend cache telemetry payload to include materialized hit/miss/build/fallback counters and rates
+- [x] 4.2 Add snapshot freshness and payload-size telemetry fields to deep health diagnostics
+- [x] 4.3 Emit and document stable fallback reason codes (`miss`, `stale`, `build_failed`, etc.) for alert correlation
+- [x] 4.4 Add logging hooks for build latency and build failure diagnostics with request/key correlation context
+
+## 5. Validation, Rollout, and Regression Safety
+
+- [x] 5.1 Add unit tests for key isolation, schema version invalidation, single-flight behavior, and guardrail enforcement
+- [x] 5.2 Add parity tests comparing materialized and legacy results across multi-dimension cross-filter scenarios
+- [x] 5.3 Add route/service tests validating metadata exposure and fallback behavior under snapshot miss/stale/build-failure paths
+- [x] 5.4 Execute reject-history regression suite and stress checks for repeated Pareto filter toggling to confirm lower worker memory pressure
+- [x] 5.5 Add feature-flagged rollout plan (telemetry-only -> read-through enabled -> default-on) with rollback switch