feat(reject-history): add materialized Pareto aggregate layer with feature-flagged rollout

Pre-compute 6-dimension metric cubes from cached LOT-level DataFrames so
interactive Pareto requests read compact snapshots instead of re-scanning
detail rows on every filter change. Includes single-flight build guard,
TTL/size guardrails, cross-filter exclude-self evaluation, safe legacy
fallback, response metadata exposure, telemetry counters, and a 3-stage
rollout plan (telemetry-only → build-enabled → read-through).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
egg
2026-03-04 08:05:02 +08:00
parent 98eea066ea
commit e79fb657a3
22 changed files with 2500 additions and 484 deletions

View File

@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-03-03

View File

@@ -0,0 +1,89 @@
## Context
Reject History 目前採兩階段模式:`POST /api/reject-history/query` 先建立 lot-level dataset cache後續 `reason-pareto` / `batch-pareto` / `view` 反覆在 worker 內以 pandas 對全量明細重算。當使用者在大日期區間下進行多維 Pareto 互相篩選時,會在多 worker 環境放大記憶體壓力,造成高 RSS、回應抖動甚至 worker 不穩定。
本變更目標是在不改前端 API contract 的前提下新增「Pareto 預聚合/物化」層,讓互動式 Pareto 請求改讀聚合快照,而非每次掃描 lot-level 明細。
## Goals / Non-Goals
**Goals:**
-`batch-pareto``reason-pareto`cache 路徑)優先讀取 materialized Pareto snapshot。
- 降低每次互動所需的 worker 計算記憶體與 CPU避免反覆全量 groupby。
- 對 snapshot 建立一致性規則version/freshness/invalidation確保不回傳過期或錯配資料。
- 補齊可營運觀測build latency、hit/miss、fallback 原因、snapshot 大小。
- 保持回傳 schema 與前端互動語意cross-filter、top80/top20相容。
**Non-Goals:**
- 不改 Oracle 明細 SQL 與 primary query 的資料來源。
- 不新增「大日期範圍硬拒絕」規則。
- 不改前端 Pareto 元件與 URL 參數契約。
- 不在此變更導入全新儲存系統(沿用現有 process cache + Redis/spool 生態)。
## Decisions
### 1) 新增獨立 Pareto materialization service與 lot-level cache 解耦
- Decision: 新增 `reject_pareto_materialized.py`(名稱可調整)負責 build/read/invalidate snapshot不把邏輯直接塞進 route。
- Why: 可將聚合生命週期、key 策略、遙測統一管理,降低 `reject_dataset_cache.py` 複雜度。
- Alternative considered:
- 直接在 `compute_batch_pareto()` 內加 dict 快取:容易與 dataset cache 生命週期糾纏,且跨 worker 命中率不足。
### 2) Snapshot key 綁定 query dataset 與 filter context並附 schema version
- Decision: key 至少包含 `query_id`、policy toggles、supplementary filters、trend_dates hash、materialization schema version。
- Why: 避免不同篩選上下文誤用同一 snapshotschema 變更可強制失效舊資料。
- Alternative considered:
- 僅用 `query_id`:無法區分補充篩選,容易回傳錯誤 Pareto。
### 3) Materialized payload 儲存「可交叉運算所需最小聚合」,而非完整明細
- Decision: snapshot 儲存 6 維度聚合結果與交叉篩選必要中介結構(以可重建 cross-filter 為原則),不複製 lot-level rows。
- Why: 目標是減少記憶體放大,若儲存完整明細會失去價值。
- Alternative considered:
- 直接存每個 dimension 最終 items空間小但無法支援任意 cross-filter 重算。
### 4) Read path 採「materialized 優先 + 安全 fallback」
- Decision: `batch-pareto` / `reason-pareto(query_id)` 先讀 snapshotmiss/stale/build-fail 時 fallback 到既有 cache DataFrame 計算,並打 telemetry。
- Why: 保留功能可用性與漸進上線,避免一次切換造成功能中斷。
- Alternative considered:
- snapshot miss 直接錯誤:風險高,對既有使用者不友善。
### 5) 建立 single-flight build 與容量上限
- Decision: 同一 snapshot key 同時間僅允許一個 build其餘請求等待或讀舊值並限制單 snapshot size 與總 key TTL。
- Why: 避免 thundering herd 與 Redis/spool 不受控成長。
- Alternative considered:
- 完全不鎖:高併發下會重複建構,放大 CPU/記憶體與 IO。
### 6) 將 observability 併入既有 cache telemetry 合約
- Decision: 在現有 cache observability 結構新增 pareto-materialized 欄位hit/miss/fallback/build/size/freshness
- Why: operations 可以在同一入口判斷 cache 問題來源,不需分散查多個 API。
- Alternative considered:
- 只寫 log難做趨勢與告警。
## Risks / Trade-offs
- [Risk] Snapshot key 組成不完整導致污染cross-user/filter 汙染)。
- Mitigation: key builder 單元測試覆蓋參數排序、空值正規化與 version 隔離。
- [Risk] 中介聚合結構設計錯誤cross-filter 結果與舊路徑不一致。
- Mitigation: 建立 parity tests對同一 query_id 比對 materialized 與 legacy 計算結果。
- [Risk] Fallback 比率長期偏高materialization 失去效益。
- Mitigation: telemetry 強制輸出 fallback reason設定告警門檻並納入 rollout gate。
- [Risk] 新增 build 步驟拉長首次 Pareto 回應時間。
- Mitigation: 首次 build 可回 legacy 結果並背景填充,或採同步 build 但有 timeout 上限;由配置控制。
## Migration Plan
1. 實作 materialization servicekey、payload、build/read/invalidate、telemetry
2.`compute_batch_pareto``compute_dimension_pareto(query_id path)` 接入 read-through/fallback。
3. 加入 parity 測試legacy vs materialized與壓力測試重複 cross-filter 切換)。
4. 灰度啟用:先開 telemetry-only / build-disabled再開 read-through再提高命中比例。
5. 監控 hit ratio、fallback ratio、worker RSS 趨勢;達標後再考慮收斂 legacy 路徑。
Rollback strategy:
- 以 feature flag 關閉 materialized read path立即回退至現行 DataFrame 計算,不影響 API schema。
## Open Questions
- materialized payload 要儲存在 Redis快速還是 spool耐久為主是否需要雙層策略。
- 首次 build 採同步阻塞還是背景建置(與 UX latency 取捨)。
- 長期是否將 materialization 前移至 primary query 完成後立即建立,以換取互動穩定性。

View File

@@ -0,0 +1,37 @@
## Why
The reject-history Pareto workflow currently recomputes aggregates from full lot-level cached datasets on every interactive filter change. Under wide date ranges and cross-filtering, this drives repeated high-memory pandas operations and can destabilize workers.
## What Changes
- Introduce a materialized Pareto aggregate layer for reject-history that precomputes dimension metrics from cached query datasets.
- Serve `/api/reject-history/batch-pareto` and related Pareto reads from pre-aggregated artifacts instead of recomputing from full detail rows each request.
- Add freshness/version metadata so aggregate snapshots stay aligned with source query datasets and policy toggles.
- Add bounded invalidation and lifecycle rules for aggregate artifacts to avoid stale growth and memory pressure.
- Add observability for aggregate build latency, hit ratio, memory footprint, and fallback reasons.
- Keep API response schema compatible with existing frontend contracts to avoid UI rewrites.
## Capabilities
### New Capabilities
- `reject-history-pareto-materialized-aggregate`: Build, store, and read pre-aggregated Pareto data for interactive cross-filter workflows.
### Modified Capabilities
- `reject-history-api`: Route Pareto endpoints to materialized aggregates with cache-consistency and fallback behavior.
- `cache-observability-hardening`: Extend telemetry to cover aggregate generation/hit/fallback and memory-guard events.
## Impact
- Affected backend code:
- `src/mes_dashboard/services/reject_dataset_cache.py`
- `src/mes_dashboard/services/reject_history_service.py`
- `src/mes_dashboard/routes/reject_history_routes.py`
- new aggregate builder/storage modules under `src/mes_dashboard/services/`
- Affected APIs:
- `/api/reject-history/batch-pareto`
- `/api/reject-history/reason-pareto` (cache-backed path)
- Affected runtime systems:
- Process L1 cache, Redis/L2 cache, spool lifecycle and metrics history
- Dependencies/ops:
- Additional Redis key space and retention policy tuning for aggregate artifacts
- New monitoring/alerts for aggregate freshness and fallback rates

View File

@@ -0,0 +1,30 @@
## MODIFIED Requirements
### Requirement: Cache Telemetry MUST be Queryable for Operations
The system MUST provide cache telemetry suitable for operations diagnostics, including materialized Pareto cache behavior for reject-history workloads.
#### Scenario: Telemetry inspection
- **WHEN** operators request deep health status
- **THEN** cache-related metrics/state SHALL be present and interpretable for troubleshooting
#### Scenario: Materialized Pareto telemetry visibility
- **WHEN** materialized Pareto cache is enabled
- **THEN** telemetry SHALL expose at least hit count/rate, miss count/rate, build count, build failure count, and fallback count
- **THEN** telemetry SHALL expose latest snapshot freshness indicators and aggregate payload size indicators
## ADDED Requirements
### Requirement: Pareto materialization fallback reasons SHALL be operationally classifiable
Telemetry MUST classify fallback outcomes with stable reason codes so repeated degradations can be monitored and alerted.
#### Scenario: Snapshot miss fallback reason
- **WHEN** request falls back because no snapshot exists
- **THEN** telemetry SHALL record a stable reason code for snapshot miss
#### Scenario: Snapshot stale fallback reason
- **WHEN** request falls back because snapshot fails freshness/version checks
- **THEN** telemetry SHALL record a stable reason code for stale/incompatible snapshot
#### Scenario: Build failure fallback reason
- **WHEN** request falls back because materialization build failed
- **THEN** telemetry SHALL record a stable reason code for build failure

View File

@@ -0,0 +1,57 @@
## MODIFIED Requirements
### Requirement: Reject History API SHALL provide batch Pareto endpoint with cross-filter
The API SHALL provide a batch Pareto endpoint that returns all 6 dimension Pareto results in a single response, supporting cross-dimension filtering with exclude-self logic, and SHALL prefer materialized Pareto snapshots over full detail regrouping.
#### Scenario: Batch Pareto response structure
- **WHEN** `GET /api/reject-history/batch-pareto` is called with valid `query_id`
- **THEN** response SHALL be `{ success: true, data: { dimensions: { reason: {...}, package: {...}, type: {...}, workflow: {...}, workcenter: {...}, equipment: {...} } } }`
- **THEN** each dimension object SHALL include `items` array with schema (`reason`, `metric_value`, `pct`, `cumPct`, `MOVEIN_QTY`, `REJECT_TOTAL_QTY`, `DEFECT_QTY`, `count`)
#### Scenario: Cross-filter exclude-self logic
- **WHEN** `sel_reason=A&sel_type=X` is provided
- **THEN** reason Pareto SHALL be computed with type=X filter applied (but NOT reason=A filter)
- **THEN** type Pareto SHALL be computed with reason=A filter applied (but NOT type=X filter)
- **THEN** package/workflow/workcenter/equipment Paretos SHALL be computed with both reason=A AND type=X filters applied
#### Scenario: Empty selections return unfiltered Paretos
- **WHEN** batch-pareto is called with no `sel_*` parameters
- **THEN** all 6 dimensions SHALL return their full Pareto distribution (subject to `pareto_scope`)
#### Scenario: Cache-only computation
- **WHEN** `query_id` does not exist in cache
- **THEN** the endpoint SHALL return HTTP 400 with error message indicating cache miss
- **THEN** the endpoint SHALL NOT fall back to Oracle query
#### Scenario: Materialized snapshot preferred
- **WHEN** a valid and fresh materialized Pareto snapshot exists for the request context
- **THEN** the endpoint SHALL return results from that snapshot
- **THEN** the endpoint SHALL avoid full lot-level DataFrame regrouping for the same request
#### Scenario: Materialized miss fallback behavior
- **WHEN** materialized snapshot is unavailable, stale, or build fails
- **THEN** the endpoint SHALL fall back to legacy cache DataFrame computation
- **THEN** the response schema and filter semantics SHALL remain unchanged
#### Scenario: Supplementary and policy filters apply
- **WHEN** batch-pareto is called with supplementary filters (packages, workcenter_groups, reason) and policy toggles
- **THEN** all 6 dimension Paretos SHALL be computed after applying policy and supplementary filters first (before cross-filter)
#### Scenario: Display scope (TOP20) support
- **WHEN** `pareto_display_scope=top20` is provided
- **THEN** applicable dimensions (type, workflow, equipment) SHALL truncate results to top 20 items after sorting
- **WHEN** `pareto_display_scope` is omitted or `all`
- **THEN** all items SHALL be returned (subject to `pareto_scope` filter)
## ADDED Requirements
### Requirement: Reject History API SHALL expose materialized Pareto freshness metadata
The API SHALL surface stable metadata so operators and clients can identify whether Pareto responses came from materialized snapshots or fallback paths.
#### Scenario: Materialized hit metadata
- **WHEN** batch pareto response is served from materialized snapshot
- **THEN** response metadata SHALL indicate materialized source and snapshot freshness/version identifiers
#### Scenario: Fallback metadata
- **WHEN** response uses legacy fallback due to snapshot miss/stale/build failure
- **THEN** response metadata SHALL include a stable fallback reason code

View File

@@ -0,0 +1,49 @@
## ADDED Requirements
### Requirement: Reject History Pareto materialization SHALL build reusable aggregate snapshots
The system SHALL build reusable Pareto aggregate snapshots from cached reject-history query datasets so interactive Pareto requests do not require full lot-level regrouping on every call.
#### Scenario: Build snapshot from cached dataset
- **WHEN** a valid `query_id` has cached reject-history dataset and Pareto data is requested
- **THEN** the system SHALL build a materialized snapshot containing the six supported Pareto dimensions (`reason`, `package`, `type`, `workflow`, `workcenter`, `equipment`)
- **THEN** the snapshot SHALL include quantities needed to compute `metric_value`, `pct`, `cumPct`, and affected count fields
#### Scenario: Build skipped for missing dataset cache
- **WHEN** the referenced `query_id` dataset is missing or expired
- **THEN** snapshot build SHALL NOT proceed
- **THEN** the caller SHALL receive a deterministic cache-miss outcome
### Requirement: Materialized snapshot keys SHALL encode filter identity and schema version
The system SHALL key materialized Pareto snapshots by canonical filter identity and schema version to prevent cross-context reuse.
#### Scenario: Distinct supplementary filters generate distinct snapshots
- **WHEN** two requests share the same `query_id` but differ in supplementary filters or policy toggles
- **THEN** they SHALL resolve to different materialized snapshot keys
#### Scenario: Schema version invalidates prior snapshots
- **WHEN** materialization schema version is incremented
- **THEN** snapshots produced by prior versions SHALL NOT be treated as valid hits
### Requirement: Materialized snapshots SHALL preserve cross-filter semantics
Materialized read paths SHALL produce the same cross-filter behavior as legacy DataFrame-based Pareto computation.
#### Scenario: Exclude-self behavior parity
- **WHEN** `sel_reason=A` and `sel_type=X` are active
- **THEN** reason Pareto SHALL be computed with `type=X` applied but without `reason=A` self-filter
- **THEN** type Pareto SHALL be computed with `reason=A` applied but without `type=X` self-filter
#### Scenario: Multi-dimension intersection parity
- **WHEN** multiple `sel_*` filters are active across dimensions
- **THEN** each non-excluded dimension result SHALL reflect the AND intersection of all other selected dimensions
### Requirement: Materialized snapshots SHALL enforce bounded lifecycle and capacity
Materialized Pareto cache storage SHALL be bounded by TTL and size guardrails to avoid unbounded memory growth.
#### Scenario: Snapshot expiry follows configured retention
- **WHEN** a materialized snapshot exceeds configured TTL
- **THEN** it SHALL be treated as expired and SHALL NOT be returned as a cache hit
#### Scenario: Oversized snapshot handling
- **WHEN** a snapshot build exceeds configured snapshot size guardrail
- **THEN** the snapshot SHALL be rejected or degraded according to policy
- **THEN** the system SHALL record the rejection/degradation reason for operations telemetry

View File

@@ -0,0 +1,35 @@
## 1. Materialization Service Foundation
- [x] 1.1 Create a dedicated reject Pareto materialization service module with key builder, payload schema versioning, and read/write interfaces
- [x] 1.2 Implement canonical filter-context hashing (policy toggles, supplementary filters, trend dates) for materialized snapshot key isolation
- [x] 1.3 Implement single-flight guard for concurrent snapshot builds targeting the same key
- [x] 1.4 Add TTL and payload-size guardrails for materialized snapshots with explicit rejection paths
## 2. Snapshot Build and Compute Path
- [x] 2.1 Implement snapshot build pipeline from cached reject dataset to six-dimension aggregate structures
- [x] 2.2 Implement cross-filter evaluation on materialized structures with exclude-self parity to current batch Pareto behavior
- [x] 2.3 Implement `pareto_scope` (`top80`/`all`) and `pareto_display_scope` compatibility on materialized outputs
- [x] 2.4 Add deterministic invalidation rules for stale or schema-mismatched snapshots
## 3. API Integration and Compatibility
- [x] 3.1 Integrate materialized read-through path into `compute_batch_pareto` and cached `compute_dimension_pareto` flow
- [x] 3.2 Implement safe fallback to legacy DataFrame-based compute when snapshot is missing, stale, or build fails
- [x] 3.3 Add response metadata fields for materialized source/freshness/version and fallback reason codes without breaking existing payload schema
- [x] 3.4 Ensure cache-miss behavior for missing `query_id` remains unchanged (no Oracle fallback)
## 4. Observability and Operations Signals
- [x] 4.1 Extend cache telemetry payload to include materialized hit/miss/build/fallback counters and rates
- [x] 4.2 Add snapshot freshness and payload-size telemetry fields to deep health diagnostics
- [x] 4.3 Emit and document stable fallback reason codes (`miss`, `stale`, `build_failed`, etc.) for alert correlation
- [x] 4.4 Add logging hooks for build latency and build failure diagnostics with request/key correlation context
## 5. Validation, Rollout, and Regression Safety
- [x] 5.1 Add unit tests for key isolation, schema version invalidation, single-flight behavior, and guardrail enforcement
- [x] 5.2 Add parity tests comparing materialized and legacy results across multi-dimension cross-filter scenarios
- [x] 5.3 Add route/service tests validating metadata exposure and fallback behavior under snapshot miss/stale/build-failure paths
- [x] 5.4 Execute reject-history regression suite and stress checks for repeated Pareto filter toggling to confirm lower worker memory pressure
- [x] 5.5 Add feature-flagged rollout plan (telemetry-only -> read-through enabled -> default-on) with rollback switch