feat(reject-history): add materialized Pareto aggregate layer with feature-flagged rollout
Pre-compute 6-dimension metric cubes from cached LOT-level DataFrames so interactive Pareto requests read compact snapshots instead of re-scanning detail rows on every filter change. Includes single-flight build guard, TTL/size guardrails, cross-filter exclude-self evaluation, safe legacy fallback, response metadata exposure, telemetry counters, and a 3-stage rollout plan (telemetry-only → build-enabled → read-through). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,2 @@
|
||||
schema: spec-driven
|
||||
created: 2026-03-03
|
||||
@@ -0,0 +1,89 @@
|
||||
## Context
|
||||
|
||||
Reject History 目前採兩階段模式:`POST /api/reject-history/query` 先建立 lot-level dataset cache,後續 `reason-pareto` / `batch-pareto` / `view` 反覆在 worker 內以 pandas 對全量明細重算。當使用者在大日期區間下進行多維 Pareto 互相篩選時,會在多 worker 環境放大記憶體壓力,造成高 RSS、回應抖動,甚至 worker 不穩定。
|
||||
|
||||
本變更目標是在不改前端 API contract 的前提下,新增「Pareto 預聚合/物化」層,讓互動式 Pareto 請求改讀聚合快照,而非每次掃描 lot-level 明細。
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- 讓 `batch-pareto` 與 `reason-pareto`(cache 路徑)優先讀取 materialized Pareto snapshot。
|
||||
- 降低每次互動所需的 worker 計算記憶體與 CPU,避免反覆全量 groupby。
|
||||
- 對 snapshot 建立一致性規則(version/freshness/invalidation),確保不回傳過期或錯配資料。
|
||||
- 補齊可營運觀測:build latency、hit/miss、fallback 原因、snapshot 大小。
|
||||
- 保持回傳 schema 與前端互動語意(cross-filter、top80/top20)相容。
|
||||
|
||||
**Non-Goals:**
|
||||
- 不改 Oracle 明細 SQL 與 primary query 的資料來源。
|
||||
- 不新增「大日期範圍硬拒絕」規則。
|
||||
- 不改前端 Pareto 元件與 URL 參數契約。
|
||||
- 不在此變更導入全新儲存系統(沿用現有 process cache + Redis/spool 生態)。
|
||||
|
||||
## Decisions
|
||||
|
||||
### 1) 新增獨立 Pareto materialization service,與 lot-level cache 解耦
|
||||
- Decision: 新增 `reject_pareto_materialized.py`(名稱可調整)負責 build/read/invalidate snapshot,不把邏輯直接塞進 route。
|
||||
- Why: 可將聚合生命週期、key 策略、遙測統一管理,降低 `reject_dataset_cache.py` 複雜度。
|
||||
- Alternative considered:
|
||||
- 直接在 `compute_batch_pareto()` 內加 dict 快取:容易與 dataset cache 生命週期糾纏,且跨 worker 命中率不足。
|
||||
|
||||
### 2) Snapshot key 綁定 query dataset 與 filter context,並附 schema version
|
||||
- Decision: key 至少包含 `query_id`、policy toggles、supplementary filters、trend_dates hash、materialization schema version。
|
||||
- Why: 避免不同篩選上下文誤用同一 snapshot;schema 變更可強制失效舊資料。
|
||||
- Alternative considered:
|
||||
- 僅用 `query_id`:無法區分補充篩選,容易回傳錯誤 Pareto。
|
||||
|
||||
### 3) Materialized payload 儲存「可交叉運算所需最小聚合」,而非完整明細
|
||||
- Decision: snapshot 儲存 6 維度聚合結果與交叉篩選必要中介結構(以可重建 cross-filter 為原則),不複製 lot-level rows。
|
||||
- Why: 目標是減少記憶體放大,若儲存完整明細會失去價值。
|
||||
- Alternative considered:
|
||||
- 直接存每個 dimension 最終 items:空間小但無法支援任意 cross-filter 重算。
|
||||
|
||||
### 4) Read path 採「materialized 優先 + 安全 fallback」
|
||||
- Decision: `batch-pareto` / `reason-pareto(query_id)` 先讀 snapshot;miss/stale/build-fail 時 fallback 到既有 cache DataFrame 計算,並打 telemetry。
|
||||
- Why: 保留功能可用性與漸進上線,避免一次切換造成功能中斷。
|
||||
- Alternative considered:
|
||||
- snapshot miss 直接錯誤:風險高,對既有使用者不友善。
|
||||
|
||||
### 5) 建立 single-flight build 與容量上限
|
||||
- Decision: 同一 snapshot key 同時間僅允許一個 build,其餘請求等待或讀舊值;並限制單 snapshot size 與總 key TTL。
|
||||
- Why: 避免 thundering herd 與 Redis/spool 不受控成長。
|
||||
- Alternative considered:
|
||||
- 完全不鎖:高併發下會重複建構,放大 CPU/記憶體與 IO。
|
||||
|
||||
### 6) 將 observability 併入既有 cache telemetry 合約
|
||||
- Decision: 在現有 cache observability 結構新增 pareto-materialized 欄位(hit/miss/fallback/build/size/freshness)。
|
||||
- Why: operations 可以在同一入口判斷 cache 問題來源,不需分散查多個 API。
|
||||
- Alternative considered:
|
||||
- 只寫 log:難做趨勢與告警。
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
- [Risk] Snapshot key 組成不完整導致污染(cross-user/filter 汙染)。
|
||||
- Mitigation: key builder 單元測試覆蓋參數排序、空值正規化與 version 隔離。
|
||||
|
||||
- [Risk] 中介聚合結構設計錯誤,cross-filter 結果與舊路徑不一致。
|
||||
- Mitigation: 建立 parity tests,對同一 query_id 比對 materialized 與 legacy 計算結果。
|
||||
|
||||
- [Risk] Fallback 比率長期偏高,materialization 失去效益。
|
||||
- Mitigation: telemetry 強制輸出 fallback reason,設定告警門檻並納入 rollout gate。
|
||||
|
||||
- [Risk] 新增 build 步驟拉長首次 Pareto 回應時間。
|
||||
- Mitigation: 首次 build 可回 legacy 結果並背景填充,或採同步 build 但有 timeout 上限;由配置控制。
|
||||
|
||||
## Migration Plan
|
||||
|
||||
1. 實作 materialization service(key、payload、build/read/invalidate、telemetry)。
|
||||
2. 在 `compute_batch_pareto` 與 `compute_dimension_pareto(query_id path)` 接入 read-through/fallback。
|
||||
3. 加入 parity 測試(legacy vs materialized)與壓力測試(重複 cross-filter 切換)。
|
||||
4. 灰度啟用:先開 telemetry-only / build-disabled,再開 read-through,再提高命中比例。
|
||||
5. 監控 hit ratio、fallback ratio、worker RSS 趨勢;達標後再考慮收斂 legacy 路徑。
|
||||
|
||||
Rollback strategy:
|
||||
- 以 feature flag 關閉 materialized read path,立即回退至現行 DataFrame 計算,不影響 API schema。
|
||||
|
||||
## Open Questions
|
||||
|
||||
- materialized payload 要儲存在 Redis(快速)還是 spool(耐久)為主?是否需要雙層策略。
|
||||
- 首次 build 採同步阻塞還是背景建置(與 UX latency 取捨)。
|
||||
- 長期是否將 materialization 前移至 primary query 完成後立即建立,以換取互動穩定性。
|
||||
@@ -0,0 +1,37 @@
|
||||
## Why
|
||||
|
||||
The reject-history Pareto workflow currently recomputes aggregates from full lot-level cached datasets on every interactive filter change. Under wide date ranges and cross-filtering, this drives repeated high-memory pandas operations and can destabilize workers.
|
||||
|
||||
## What Changes
|
||||
|
||||
- Introduce a materialized Pareto aggregate layer for reject-history that precomputes dimension metrics from cached query datasets.
|
||||
- Serve `/api/reject-history/batch-pareto` and related Pareto reads from pre-aggregated artifacts instead of recomputing from full detail rows each request.
|
||||
- Add freshness/version metadata so aggregate snapshots stay aligned with source query datasets and policy toggles.
|
||||
- Add bounded invalidation and lifecycle rules for aggregate artifacts to avoid stale growth and memory pressure.
|
||||
- Add observability for aggregate build latency, hit ratio, memory footprint, and fallback reasons.
|
||||
- Keep API response schema compatible with existing frontend contracts to avoid UI rewrites.
|
||||
|
||||
## Capabilities
|
||||
|
||||
### New Capabilities
|
||||
- `reject-history-pareto-materialized-aggregate`: Build, store, and read pre-aggregated Pareto data for interactive cross-filter workflows.
|
||||
|
||||
### Modified Capabilities
|
||||
- `reject-history-api`: Route Pareto endpoints to materialized aggregates with cache-consistency and fallback behavior.
|
||||
- `cache-observability-hardening`: Extend telemetry to cover aggregate generation/hit/fallback and memory-guard events.
|
||||
|
||||
## Impact
|
||||
|
||||
- Affected backend code:
|
||||
- `src/mes_dashboard/services/reject_dataset_cache.py`
|
||||
- `src/mes_dashboard/services/reject_history_service.py`
|
||||
- `src/mes_dashboard/routes/reject_history_routes.py`
|
||||
- new aggregate builder/storage modules under `src/mes_dashboard/services/`
|
||||
- Affected APIs:
|
||||
- `/api/reject-history/batch-pareto`
|
||||
- `/api/reject-history/reason-pareto` (cache-backed path)
|
||||
- Affected runtime systems:
|
||||
- Process L1 cache, Redis/L2 cache, spool lifecycle and metrics history
|
||||
- Dependencies/ops:
|
||||
- Additional Redis key space and retention policy tuning for aggregate artifacts
|
||||
- New monitoring/alerts for aggregate freshness and fallback rates
|
||||
@@ -0,0 +1,30 @@
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Cache Telemetry MUST be Queryable for Operations
|
||||
The system MUST provide cache telemetry suitable for operations diagnostics, including materialized Pareto cache behavior for reject-history workloads.
|
||||
|
||||
#### Scenario: Telemetry inspection
|
||||
- **WHEN** operators request deep health status
|
||||
- **THEN** cache-related metrics/state SHALL be present and interpretable for troubleshooting
|
||||
|
||||
#### Scenario: Materialized Pareto telemetry visibility
|
||||
- **WHEN** materialized Pareto cache is enabled
|
||||
- **THEN** telemetry SHALL expose at least hit count/rate, miss count/rate, build count, build failure count, and fallback count
|
||||
- **THEN** telemetry SHALL expose latest snapshot freshness indicators and aggregate payload size indicators
|
||||
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Pareto materialization fallback reasons SHALL be operationally classifiable
|
||||
Telemetry MUST classify fallback outcomes with stable reason codes so repeated degradations can be monitored and alerted.
|
||||
|
||||
#### Scenario: Snapshot miss fallback reason
|
||||
- **WHEN** request falls back because no snapshot exists
|
||||
- **THEN** telemetry SHALL record a stable reason code for snapshot miss
|
||||
|
||||
#### Scenario: Snapshot stale fallback reason
|
||||
- **WHEN** request falls back because snapshot fails freshness/version checks
|
||||
- **THEN** telemetry SHALL record a stable reason code for stale/incompatible snapshot
|
||||
|
||||
#### Scenario: Build failure fallback reason
|
||||
- **WHEN** request falls back because materialization build failed
|
||||
- **THEN** telemetry SHALL record a stable reason code for build failure
|
||||
@@ -0,0 +1,57 @@
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Reject History API SHALL provide batch Pareto endpoint with cross-filter
|
||||
The API SHALL provide a batch Pareto endpoint that returns all 6 dimension Pareto results in a single response, supporting cross-dimension filtering with exclude-self logic, and SHALL prefer materialized Pareto snapshots over full detail regrouping.
|
||||
|
||||
#### Scenario: Batch Pareto response structure
|
||||
- **WHEN** `GET /api/reject-history/batch-pareto` is called with valid `query_id`
|
||||
- **THEN** response SHALL be `{ success: true, data: { dimensions: { reason: {...}, package: {...}, type: {...}, workflow: {...}, workcenter: {...}, equipment: {...} } } }`
|
||||
- **THEN** each dimension object SHALL include `items` array with schema (`reason`, `metric_value`, `pct`, `cumPct`, `MOVEIN_QTY`, `REJECT_TOTAL_QTY`, `DEFECT_QTY`, `count`)
|
||||
|
||||
#### Scenario: Cross-filter exclude-self logic
|
||||
- **WHEN** `sel_reason=A&sel_type=X` is provided
|
||||
- **THEN** reason Pareto SHALL be computed with type=X filter applied (but NOT reason=A filter)
|
||||
- **THEN** type Pareto SHALL be computed with reason=A filter applied (but NOT type=X filter)
|
||||
- **THEN** package/workflow/workcenter/equipment Paretos SHALL be computed with both reason=A AND type=X filters applied
|
||||
|
||||
#### Scenario: Empty selections return unfiltered Paretos
|
||||
- **WHEN** batch-pareto is called with no `sel_*` parameters
|
||||
- **THEN** all 6 dimensions SHALL return their full Pareto distribution (subject to `pareto_scope`)
|
||||
|
||||
#### Scenario: Cache-only computation
|
||||
- **WHEN** `query_id` does not exist in cache
|
||||
- **THEN** the endpoint SHALL return HTTP 400 with error message indicating cache miss
|
||||
- **THEN** the endpoint SHALL NOT fall back to Oracle query
|
||||
|
||||
#### Scenario: Materialized snapshot preferred
|
||||
- **WHEN** a valid and fresh materialized Pareto snapshot exists for the request context
|
||||
- **THEN** the endpoint SHALL return results from that snapshot
|
||||
- **THEN** the endpoint SHALL avoid full lot-level DataFrame regrouping for the same request
|
||||
|
||||
#### Scenario: Materialized miss fallback behavior
|
||||
- **WHEN** materialized snapshot is unavailable, stale, or build fails
|
||||
- **THEN** the endpoint SHALL fall back to legacy cache DataFrame computation
|
||||
- **THEN** the response schema and filter semantics SHALL remain unchanged
|
||||
|
||||
#### Scenario: Supplementary and policy filters apply
|
||||
- **WHEN** batch-pareto is called with supplementary filters (packages, workcenter_groups, reason) and policy toggles
|
||||
- **THEN** all 6 dimension Paretos SHALL be computed after applying policy and supplementary filters first (before cross-filter)
|
||||
|
||||
#### Scenario: Display scope (TOP20) support
|
||||
- **WHEN** `pareto_display_scope=top20` is provided
|
||||
- **THEN** applicable dimensions (type, workflow, equipment) SHALL truncate results to top 20 items after sorting
|
||||
- **WHEN** `pareto_display_scope` is omitted or `all`
|
||||
- **THEN** all items SHALL be returned (subject to `pareto_scope` filter)
|
||||
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Reject History API SHALL expose materialized Pareto freshness metadata
|
||||
The API SHALL surface stable metadata so operators and clients can identify whether Pareto responses came from materialized snapshots or fallback paths.
|
||||
|
||||
#### Scenario: Materialized hit metadata
|
||||
- **WHEN** batch pareto response is served from materialized snapshot
|
||||
- **THEN** response metadata SHALL indicate materialized source and snapshot freshness/version identifiers
|
||||
|
||||
#### Scenario: Fallback metadata
|
||||
- **WHEN** response uses legacy fallback due to snapshot miss/stale/build failure
|
||||
- **THEN** response metadata SHALL include a stable fallback reason code
|
||||
@@ -0,0 +1,49 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Reject History Pareto materialization SHALL build reusable aggregate snapshots
|
||||
The system SHALL build reusable Pareto aggregate snapshots from cached reject-history query datasets so interactive Pareto requests do not require full lot-level regrouping on every call.
|
||||
|
||||
#### Scenario: Build snapshot from cached dataset
|
||||
- **WHEN** a valid `query_id` has cached reject-history dataset and Pareto data is requested
|
||||
- **THEN** the system SHALL build a materialized snapshot containing the six supported Pareto dimensions (`reason`, `package`, `type`, `workflow`, `workcenter`, `equipment`)
|
||||
- **THEN** the snapshot SHALL include quantities needed to compute `metric_value`, `pct`, `cumPct`, and affected count fields
|
||||
|
||||
#### Scenario: Build skipped for missing dataset cache
|
||||
- **WHEN** the referenced `query_id` dataset is missing or expired
|
||||
- **THEN** snapshot build SHALL NOT proceed
|
||||
- **THEN** the caller SHALL receive a deterministic cache-miss outcome
|
||||
|
||||
### Requirement: Materialized snapshot keys SHALL encode filter identity and schema version
|
||||
The system SHALL key materialized Pareto snapshots by canonical filter identity and schema version to prevent cross-context reuse.
|
||||
|
||||
#### Scenario: Distinct supplementary filters generate distinct snapshots
|
||||
- **WHEN** two requests share the same `query_id` but differ in supplementary filters or policy toggles
|
||||
- **THEN** they SHALL resolve to different materialized snapshot keys
|
||||
|
||||
#### Scenario: Schema version invalidates prior snapshots
|
||||
- **WHEN** materialization schema version is incremented
|
||||
- **THEN** snapshots produced by prior versions SHALL NOT be treated as valid hits
|
||||
|
||||
### Requirement: Materialized snapshots SHALL preserve cross-filter semantics
|
||||
Materialized read paths SHALL produce the same cross-filter behavior as legacy DataFrame-based Pareto computation.
|
||||
|
||||
#### Scenario: Exclude-self behavior parity
|
||||
- **WHEN** `sel_reason=A` and `sel_type=X` are active
|
||||
- **THEN** reason Pareto SHALL be computed with `type=X` applied but without `reason=A` self-filter
|
||||
- **THEN** type Pareto SHALL be computed with `reason=A` applied but without `type=X` self-filter
|
||||
|
||||
#### Scenario: Multi-dimension intersection parity
|
||||
- **WHEN** multiple `sel_*` filters are active across dimensions
|
||||
- **THEN** each non-excluded dimension result SHALL reflect the AND intersection of all other selected dimensions
|
||||
|
||||
### Requirement: Materialized snapshots SHALL enforce bounded lifecycle and capacity
|
||||
Materialized Pareto cache storage SHALL be bounded by TTL and size guardrails to avoid unbounded memory growth.
|
||||
|
||||
#### Scenario: Snapshot expiry follows configured retention
|
||||
- **WHEN** a materialized snapshot exceeds configured TTL
|
||||
- **THEN** it SHALL be treated as expired and SHALL NOT be returned as a cache hit
|
||||
|
||||
#### Scenario: Oversized snapshot handling
|
||||
- **WHEN** a snapshot build exceeds configured snapshot size guardrail
|
||||
- **THEN** the snapshot SHALL be rejected or degraded according to policy
|
||||
- **THEN** the system SHALL record the rejection/degradation reason for operations telemetry
|
||||
@@ -0,0 +1,35 @@
|
||||
## 1. Materialization Service Foundation
|
||||
|
||||
- [x] 1.1 Create a dedicated reject Pareto materialization service module with key builder, payload schema versioning, and read/write interfaces
|
||||
- [x] 1.2 Implement canonical filter-context hashing (policy toggles, supplementary filters, trend dates) for materialized snapshot key isolation
|
||||
- [x] 1.3 Implement single-flight guard for concurrent snapshot builds targeting the same key
|
||||
- [x] 1.4 Add TTL and payload-size guardrails for materialized snapshots with explicit rejection paths
|
||||
|
||||
## 2. Snapshot Build and Compute Path
|
||||
|
||||
- [x] 2.1 Implement snapshot build pipeline from cached reject dataset to six-dimension aggregate structures
|
||||
- [x] 2.2 Implement cross-filter evaluation on materialized structures with exclude-self parity to current batch Pareto behavior
|
||||
- [x] 2.3 Implement `pareto_scope` (`top80`/`all`) and `pareto_display_scope` compatibility on materialized outputs
|
||||
- [x] 2.4 Add deterministic invalidation rules for stale or schema-mismatched snapshots
|
||||
|
||||
## 3. API Integration and Compatibility
|
||||
|
||||
- [x] 3.1 Integrate materialized read-through path into `compute_batch_pareto` and cached `compute_dimension_pareto` flow
|
||||
- [x] 3.2 Implement safe fallback to legacy DataFrame-based compute when snapshot is missing, stale, or build fails
|
||||
- [x] 3.3 Add response metadata fields for materialized source/freshness/version and fallback reason codes without breaking existing payload schema
|
||||
- [x] 3.4 Ensure cache-miss behavior for missing `query_id` remains unchanged (no Oracle fallback)
|
||||
|
||||
## 4. Observability and Operations Signals
|
||||
|
||||
- [x] 4.1 Extend cache telemetry payload to include materialized hit/miss/build/fallback counters and rates
|
||||
- [x] 4.2 Add snapshot freshness and payload-size telemetry fields to deep health diagnostics
|
||||
- [x] 4.3 Emit and document stable fallback reason codes (`miss`, `stale`, `build_failed`, etc.) for alert correlation
|
||||
- [x] 4.4 Add logging hooks for build latency and build failure diagnostics with request/key correlation context
|
||||
|
||||
## 5. Validation, Rollout, and Regression Safety
|
||||
|
||||
- [x] 5.1 Add unit tests for key isolation, schema version invalidation, single-flight behavior, and guardrail enforcement
|
||||
- [x] 5.2 Add parity tests comparing materialized and legacy results across multi-dimension cross-filter scenarios
|
||||
- [x] 5.3 Add route/service tests validating metadata exposure and fallback behavior under snapshot miss/stale/build-failure paths
|
||||
- [x] 5.4 Execute reject-history regression suite and stress checks for repeated Pareto filter toggling to confirm lower worker memory pressure
|
||||
- [x] 5.5 Add feature-flagged rollout plan (telemetry-only -> read-through enabled -> default-on) with rollback switch
|
||||
Reference in New Issue
Block a user