feat(reject-history): add materialized Pareto aggregate layer with feature-flagged rollout

Pre-compute 6-dimension metric cubes from cached LOT-level DataFrames so interactive Pareto requests read compact snapshots instead of re-scanning detail rows on every filter change. Includes single-flight build guard, TTL/size guardrails, cross-filter exclude-self evaluation, safe legacy fallback, response metadata exposure, telemetry counters, and a 3-stage rollout plan (telemetry-only → build-enabled → read-through). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 08:05:02 +08:00
parent 98eea066ea
commit e79fb657a3
22 changed files with 2500 additions and 484 deletions
--- a/openspec/specs/cache-observability-hardening/spec.md
+++ b/openspec/specs/cache-observability-hardening/spec.md
@@ -1,88 +1,30 @@
-## Purpose
-Define stable requirements for cache-observability-hardening.
-## Requirements
-### Requirement: Layered Cache SHALL Expose Operational State
-The route cache implementation SHALL expose layered cache operational state, including mode, freshness, and degradation status.
-
-#### Scenario: Redis unavailable degradation state
- **WHEN** Redis is unavailable
- **THEN** health endpoints MUST indicate degraded cache mode while keeping L1 memory cache active
+## MODIFIED Requirements

 ### Requirement: Cache Telemetry MUST be Queryable for Operations
-The system MUST provide cache telemetry suitable for operations diagnostics.
+The system MUST provide cache telemetry suitable for operations diagnostics, including materialized Pareto cache behavior for reject-history workloads.

 #### Scenario: Telemetry inspection
 - **WHEN** operators request deep health status
 - **THEN** cache-related metrics/state SHALL be present and interpretable for troubleshooting

-### Requirement: Health Endpoints SHALL Expose Pool Saturation and Degradation Reason Codes
-Operational health endpoints MUST report connection pool saturation indicators and explicit degradation reason codes.
+#### Scenario: Materialized Pareto telemetry visibility
+- **WHEN** materialized Pareto cache is enabled
+- **THEN** telemetry SHALL expose at least hit count/rate, miss count/rate, build count, build failure count, and fallback count
+- **THEN** telemetry SHALL expose latest snapshot freshness indicators and aggregate payload size indicators

-#### Scenario: Pool saturation observed
- **WHEN** checked-out connections and overflow approach configured limits
- **THEN** deep health output MUST expose saturation metrics and degraded reason classification
+## ADDED Requirements

-### Requirement: Degraded Responses MUST Be Correlatable Across API and Health Telemetry
-Error responses for degraded states SHALL include stable codes that can be mapped to health telemetry and operational dashboards.
+### Requirement: Pareto materialization fallback reasons SHALL be operationally classifiable
+Telemetry MUST classify fallback outcomes with stable reason codes so repeated degradations can be monitored and alerted.

-#### Scenario: Degraded API response correlation
- **WHEN** an API request fails due to circuit-open or pool-exhausted conditions
- **THEN** operators MUST be able to match the response code to current health telemetry state
+#### Scenario: Snapshot miss fallback reason
+- **WHEN** request falls back because no snapshot exists
+- **THEN** telemetry SHALL record a stable reason code for snapshot miss

-### Requirement: Operational Alert Thresholds SHALL Be Explicitly Defined
-The system MUST define alert thresholds for sustained degraded state, repeated worker recovery, and abnormal retry pressure.
-
-#### Scenario: Sustained degradation threshold exceeded
- **WHEN** degraded status persists beyond configured duration
- **THEN** the monitoring contract MUST classify the service as alert-worthy with actionable context
-
-### Requirement: Cache Telemetry SHALL Include Memory Amplification Signals
-Operational telemetry MUST expose cache-domain memory usage indicators and representation amplification factors, and MUST differentiate between authoritative data payload and derived/index helper structures.
-
-#### Scenario: Deep health telemetry request after representation normalization
- **WHEN** operators inspect cache telemetry for resource or WIP domains
- **THEN** telemetry MUST include per-domain memory footprint, amplification indicators, and enough structure detail to verify that full-record duplication is not reintroduced
-
-### Requirement: Efficiency Benchmarks SHALL Gate Cache Refactor Rollout
-Cache/query efficiency changes MUST be validated against baseline latency and memory benchmarks before rollout.
-
-#### Scenario: Pre-release validation
- **WHEN** cache refactor changes are prepared for deployment
- **THEN** benchmark results MUST demonstrate no regression beyond configured thresholds for P95 latency and memory usage
-
-### Requirement: Process-Level Cache SHALL Use Bounded Capacity with Deterministic Eviction
-Process-level parsed-data caches MUST enforce a configurable maximum key capacity and use deterministic eviction behavior when capacity is exceeded.
-
-#### Scenario: Cache capacity reached
- **WHEN** a new cache entry is inserted and key capacity is at limit
- **THEN** cache MUST evict entries according to defined policy before storing the new key
-
-#### Scenario: Repeated access updates recency
- **WHEN** an existing cache key is read or overwritten
- **THEN** eviction order MUST reflect recency semantics so hot keys are retained preferentially
-
-### Requirement: Cache Publish MUST Preserve Previous Readable Snapshot on Failure
-When refreshing full-table cache payloads, the system MUST avoid exposing partially published states to readers.
-
-#### Scenario: Publish fails after payload serialization
- **WHEN** a cache refresh has prepared new payload but publish operation fails
- **THEN** previously published cache keys MUST remain readable and metadata MUST remain consistent with old snapshot
-
-#### Scenario: Publish succeeds
- **WHEN** publish operation completes successfully
- **THEN** data payload and metadata keys MUST be visible as one coherent new snapshot
-
-### Requirement: Process-Level Cache Slow Path SHALL Minimize Lock Hold Time
-Large payload parsing MUST NOT happen inside long-held process cache locks.
-
-#### Scenario: Cache miss under concurrent requests
- **WHEN** multiple requests hit process cache miss
- **THEN** parsing work SHALL happen outside lock-protected mutation section, and lock scope SHALL be limited to consistency check + commit
-
-### Requirement: Process-Level Cache Policies MUST Stay Consistent Across Services
-All service-local process caches MUST support bounded capacity with deterministic eviction.
-
-#### Scenario: Realtime equipment cache growth
- **WHEN** realtime equipment process cache reaches configured capacity
- **THEN** entries MUST be evicted according to deterministic LRU behavior
+#### Scenario: Snapshot stale fallback reason
+- **WHEN** request falls back because snapshot fails freshness/version checks
+- **THEN** telemetry SHALL record a stable reason code for stale/incompatible snapshot

+#### Scenario: Build failure fallback reason
+- **WHEN** request falls back because materialization build failed
+- **THEN** telemetry SHALL record a stable reason code for build failure
--- a/openspec/specs/reject-history-api/spec.md
+++ b/openspec/specs/reject-history-api/spec.md
@@ -1,105 +1,12 @@
-# reject-history-api Specification
-
-## Purpose
-TBD - created by archiving change reject-history-query-page. Update Purpose after archive.
-## Requirements
-### Requirement: Reject History API SHALL validate required query parameters
-The API SHALL validate date parameters and basic paging bounds before executing database work.
-
-#### Scenario: Missing required dates
- **WHEN** a reject-history endpoint requiring date range is called without `start_date` or `end_date`
- **THEN** the API SHALL return HTTP 400 with a descriptive validation error
-
-#### Scenario: Invalid date order
- **WHEN** `end_date` is earlier than `start_date`
- **THEN** the API SHALL return HTTP 400 and SHALL NOT run SQL queries
-
-#### Scenario: Date range exceeds maximum
- **WHEN** the date range between `start_date` and `end_date` exceeds 730 days
- **THEN** the API SHALL return HTTP 400 with error message "日期範圍不可超過 730 天"
-
-### Requirement: Reject History API primary query response SHALL include partial failure metadata
-The primary query endpoint SHALL include batch execution completeness information in the response `meta` field when chunks fail during batch query execution.
-
-#### Scenario: Partial failure metadata in response
- **WHEN** `POST /api/reject-history/query` completes with some chunks failing
- **THEN** the response SHALL include `meta.has_partial_failure: true`
- **THEN** the response SHALL include `meta.failed_chunk_count` as a positive integer
- **THEN** the response SHALL include `meta.failed_ranges` as an array of `{start, end}` date strings (if available)
- **THEN** the HTTP status SHALL still be 200 (data is partially available)
-
-#### Scenario: No partial failure metadata on full success
- **WHEN** `POST /api/reject-history/query` completes with all chunks succeeding
- **THEN** the response `meta` SHALL NOT include `has_partial_failure`, `failed_chunk_count`, or `failed_ranges`
-
-#### Scenario: Partial failure metadata preserved on cache hit
- **WHEN** `POST /api/reject-history/query` returns cached data that originally had partial failures
- **THEN** the response SHALL include the same `meta.has_partial_failure`, `meta.failed_chunk_count`, and `meta.failed_ranges` as the original response
-
-### Requirement: Reject History API SHALL provide summary metrics endpoint
-The API SHALL provide aggregated summary metrics for the selected filter context.
-
-#### Scenario: Summary response payload
- **WHEN** `GET /api/reject-history/summary` is called with valid filters
- **THEN** response SHALL be `{ success: true, data: { ... } }`
- **THEN** data SHALL include `MOVEIN_QTY`, `REJECT_TOTAL_QTY`, `DEFECT_QTY`, `REJECT_RATE_PCT`, `DEFECT_RATE_PCT`, `REJECT_SHARE_PCT`, `AFFECTED_LOT_COUNT`, and `AFFECTED_WORKORDER_COUNT`
-
-### Requirement: Reject History API SHALL support yield-exclusion policy toggle
-The API SHALL support excluding or including policy-marked scrap reasons through a shared query parameter.
-
-#### Scenario: Default policy mode
- **WHEN** reject-history endpoints are called without `include_excluded_scrap`
- **THEN** `include_excluded_scrap` SHALL default to `false`
- **THEN** rows mapped to `ERP_PJ_WIP_SCRAP_REASONS_EXCLUDE.ENABLE_FLAG='Y'` SHALL be excluded from yield-related calculations
-
-#### Scenario: Explicitly include policy-marked scrap
- **WHEN** `include_excluded_scrap=true` is provided
- **THEN** policy-marked rows SHALL be included in summary/trend/pareto/list/export calculations
- **THEN** API response `meta` SHALL include the effective `include_excluded_scrap` value
-
-#### Scenario: Invalid toggle value
- **WHEN** `include_excluded_scrap` is not parseable as boolean
- **THEN** the API SHALL return HTTP 400 with a descriptive validation error
-
-### Requirement: Reject History API SHALL provide trend endpoint
-The API SHALL return time-series trend data for quantity and rate metrics.
-
-#### Scenario: Trend response structure
- **WHEN** `GET /api/reject-history/trend` is called
- **THEN** response SHALL be `{ success: true, data: { items: [...] } }`
- **THEN** each trend item SHALL contain bucket date, `REJECT_TOTAL_QTY`, `DEFECT_QTY`, `REJECT_RATE_PCT`, and `DEFECT_RATE_PCT`
-
-#### Scenario: Trend granularity
- **WHEN** `granularity` is provided as `day`, `week`, or `month`
- **THEN** the API SHALL aggregate by the requested granularity
- **THEN** invalid granularity SHALL return HTTP 400
-
-### Requirement: Reject History API SHALL provide reason Pareto endpoint
-The API SHALL return sorted reason distribution data with cumulative percentages. The endpoint supports dimension selection via `dimension` parameter for single-dimension queries.
-
-#### Scenario: Pareto response payload
- **WHEN** `GET /api/reject-history/reason-pareto` is called
- **THEN** each item SHALL include `reason`, `category`, selected metric value, `pct`, and `cumPct`
- **THEN** items SHALL be sorted descending by selected metric
-
-#### Scenario: Metric mode validation
- **WHEN** `metric_mode` is provided
- **THEN** accepted values SHALL be `reject_total` or `defect`
- **THEN** invalid `metric_mode` SHALL return HTTP 400
-
-#### Scenario: Dimension selection
- **WHEN** `dimension` parameter is provided with a valid value (reason, package, type, workflow, workcenter, equipment)
- **THEN** the endpoint SHALL return Pareto data for that dimension
- **WHEN** `dimension` is not provided
- **THEN** the endpoint SHALL default to `reason`
+## MODIFIED Requirements

 ### Requirement: Reject History API SHALL provide batch Pareto endpoint with cross-filter
-The API SHALL provide a batch Pareto endpoint that returns all 6 dimension Pareto results in a single response, supporting cross-dimension filtering with exclude-self logic.
+The API SHALL provide a batch Pareto endpoint that returns all 6 dimension Pareto results in a single response, supporting cross-dimension filtering with exclude-self logic, and SHALL prefer materialized Pareto snapshots over full detail regrouping.

 #### Scenario: Batch Pareto response structure
 - **WHEN** `GET /api/reject-history/batch-pareto` is called with valid `query_id`
 - **THEN** response SHALL be `{ success: true, data: { dimensions: { reason: {...}, package: {...}, type: {...}, workflow: {...}, workcenter: {...}, equipment: {...} } } }`
- **THEN** each dimension object SHALL include `items` array with same schema as reason-pareto items (`reason`, `metric_value`, `pct`, `cumPct`, `MOVEIN_QTY`, `REJECT_TOTAL_QTY`, `DEFECT_QTY`, `count`)
+- **THEN** each dimension object SHALL include `items` array with schema (`reason`, `metric_value`, `pct`, `cumPct`, `MOVEIN_QTY`, `REJECT_TOTAL_QTY`, `DEFECT_QTY`, `count`)

 #### Scenario: Cross-filter exclude-self logic
 - **WHEN** `sel_reason=A&sel_type=X` is provided
@@ -109,112 +16,42 @@ The API SHALL provide a batch Pareto endpoint that returns all 6 dimension Paret

 #### Scenario: Empty selections return unfiltered Paretos
 - **WHEN** batch-pareto is called with no `sel_*` parameters
- **THEN** all 6 dimensions SHALL return their full Pareto distribution (same as calling reason-pareto individually with no cross-filter)
+- **THEN** all 6 dimensions SHALL return their full Pareto distribution (subject to `pareto_scope`)

 #### Scenario: Cache-only computation
 - **WHEN** `query_id` does not exist in cache
 - **THEN** the endpoint SHALL return HTTP 400 with error message indicating cache miss
 - **THEN** the endpoint SHALL NOT fall back to Oracle query

+#### Scenario: Materialized snapshot preferred
+- **WHEN** a valid and fresh materialized Pareto snapshot exists for the request context
+- **THEN** the endpoint SHALL return results from that snapshot
+- **THEN** the endpoint SHALL avoid full lot-level DataFrame regrouping for the same request
+
+#### Scenario: Materialized miss fallback behavior
+- **WHEN** materialized snapshot is unavailable, stale, or build fails
+- **THEN** the endpoint SHALL fall back to legacy cache DataFrame computation
+- **THEN** the response schema and filter semantics SHALL remain unchanged
+
 #### Scenario: Supplementary and policy filters apply
 - **WHEN** batch-pareto is called with supplementary filters (packages, workcenter_groups, reason) and policy toggles
 - **THEN** all 6 dimension Paretos SHALL be computed after applying policy and supplementary filters first (before cross-filter)

-#### Scenario: Data source is cached DataFrame only
- **WHEN** batch-pareto computes dimension Paretos
- **THEN** computation SHALL operate on the in-memory cached Pandas DataFrame (populated by the primary query)
- **THEN** the endpoint SHALL NOT issue any additional Oracle database queries
- **THEN** response time SHALL be sub-100ms since all computation is in-memory
-
 #### Scenario: Display scope (TOP20) support
 - **WHEN** `pareto_display_scope=top20` is provided
 - **THEN** applicable dimensions (type, workflow, equipment) SHALL truncate results to top 20 items after sorting
 - **WHEN** `pareto_display_scope` is omitted or `all`
- **THEN** all items SHALL be returned (subject to pareto_scope 80% filter if active)
+- **THEN** all items SHALL be returned (subject to `pareto_scope` filter)

-### Requirement: Reject History API SHALL support multi-dimension Pareto selection in view and export
-The detail view and export endpoints SHALL accept multiple dimension selections simultaneously and apply them with AND logic.
+## ADDED Requirements

-#### Scenario: Multi-dimension filter on view endpoint
- **WHEN** `GET /api/reject-history/view` is called with `sel_reason=A&sel_type=X`
- **THEN** returned rows SHALL match reason=A AND type=X (both filters applied simultaneously)
+### Requirement: Reject History API SHALL expose materialized Pareto freshness metadata
+The API SHALL surface stable metadata so operators and clients can identify whether Pareto responses came from materialized snapshots or fallback paths.

-#### Scenario: Multi-dimension filter on export endpoint
- **WHEN** `GET /api/reject-history/export-cached` is called with `sel_reason=A&sel_type=X`
- **THEN** exported CSV SHALL contain only rows matching reason=A AND type=X
-
-#### Scenario: Backward compatibility with single-dimension params
- **WHEN** `pareto_dimension` and `pareto_values` are provided (legacy format)
- **THEN** the API SHALL still accept and apply them as before
- **WHEN** both `sel_*` params and legacy params are provided
- **THEN** `sel_*` params SHALL take precedence
-
-### Requirement: Reject History API SHALL provide paginated detail endpoint
-The API SHALL return paginated detailed rows for the selected filter context.
-
-#### Scenario: List response payload
- **WHEN** `GET /api/reject-history/list?page=1&per_page=50` is called
- **THEN** response SHALL include `{ items: [...], pagination: { page, perPage, total, totalPages } }`
- **THEN** each row SHALL include date, process dimensions, reason fields, `MOVEIN_QTY`, `REJECT_TOTAL_QTY`, `DEFECT_QTY`, and reject component columns
-
-#### Scenario: Paging bounds
- **WHEN** `page < 1`
- **THEN** page SHALL be treated as 1
- **WHEN** `per_page > 200`
- **THEN** `per_page` SHALL be capped at 200
-
-### Requirement: Reject History API SHALL provide CSV export endpoint
-The API SHALL provide CSV export using the same filter and metric semantics as list/query APIs.
-
-#### Scenario: Export payload consistency
- **WHEN** `GET /api/reject-history/export` is called with valid filters
- **THEN** CSV headers SHALL include both `REJECT_TOTAL_QTY` and `DEFECT_QTY`
- **THEN** export rows SHALL follow the same semantic definitions as summary/list endpoints
-
-#### Scenario: Cached export supports full detail-filter parity
- **WHEN** `GET /api/reject-history/export-cached` is called with an existing `query_id`
- **THEN** the endpoint SHALL apply primary policy toggles, supplementary filters, trend-date filters, metric filter, and Pareto-selected item filters
- **THEN** returned rows SHALL match the same filtered detail dataset semantics used by `GET /api/reject-history/view`
-
-#### Scenario: CSV encoding and escaping are stable
- **WHEN** either export endpoint returns CSV
- **THEN** response charset SHALL be `utf-8-sig`
- **THEN** values containing commas, quotes, or newlines SHALL be CSV-escaped correctly
-
-### Requirement: Reject History API SHALL centralize SQL in reject_history SQL directory
-The service SHALL load SQL from dedicated files under `src/mes_dashboard/sql/reject_history/`.
-
-#### Scenario: SQL file loading
- **WHEN** reject-history service executes queries
- **THEN** SQL SHALL be loaded from files in `sql/reject_history`
- **THEN** user-supplied filters SHALL be passed through bind parameters
- **THEN** user input SHALL NOT be interpolated into SQL strings directly
-
-### Requirement: Reject History API SHALL use cached exclusion-policy source
-The API SHALL read exclusion-policy reasons from cached `ERP_PJ_WIP_SCRAP_REASONS_EXCLUDE` data instead of querying Oracle on every request.
-
-#### Scenario: Enabled exclusions only
- **WHEN** exclusion-policy data is loaded
- **THEN** only rows with `ENABLE_FLAG='Y'` SHALL be treated as active exclusions
-
-#### Scenario: Daily full-table cache refresh
- **WHEN** exclusion cache is initialized
- **THEN** the full table SHALL be loaded and refreshed at least once per 24 hours
- **THEN** Redis SHOULD be used as shared cache when available, with in-memory fallback when unavailable
-
-### Requirement: Reject History API SHALL apply rate limiting on expensive endpoints
-The API SHALL rate-limit high-cost endpoints to protect Oracle and application resources.
-
-#### Scenario: List and export rate limiting
- **WHEN** `/api/reject-history/list` or `/api/reject-history/export` receives excessive requests
- **THEN** configured rate limiting SHALL reject requests beyond the threshold within the time window
-
-### Requirement: Database query execution path
-The reject-history service (`reject_history_service.py` and `reject_dataset_cache.py`) SHALL use `read_sql_df_slow` (dedicated connection) instead of `read_sql_df` (pooled connection) for all Oracle queries.
-
-#### Scenario: Primary query uses dedicated connection
- **WHEN** the reject-history primary query is executed
- **THEN** it uses `read_sql_df_slow` which creates a dedicated Oracle connection outside the pool
- **AND** the connection has a 300-second call_timeout (configurable)
- **AND** the connection is subject to the global slow query semaphore
+#### Scenario: Materialized hit metadata
+- **WHEN** batch pareto response is served from materialized snapshot
+- **THEN** response metadata SHALL indicate materialized source and snapshot freshness/version identifiers

+#### Scenario: Fallback metadata
+- **WHEN** response uses legacy fallback due to snapshot miss/stale/build failure
+- **THEN** response metadata SHALL include a stable fallback reason code
--- a/openspec/specs/reject-history-pareto-materialized-aggregate/spec.md
+++ b/openspec/specs/reject-history-pareto-materialized-aggregate/spec.md
@@ -0,0 +1,49 @@
+## ADDED Requirements
+
+### Requirement: Reject History Pareto materialization SHALL build reusable aggregate snapshots
+The system SHALL build reusable Pareto aggregate snapshots from cached reject-history query datasets so interactive Pareto requests do not require full lot-level regrouping on every call.
+
+#### Scenario: Build snapshot from cached dataset
+- **WHEN** a valid `query_id` has cached reject-history dataset and Pareto data is requested
+- **THEN** the system SHALL build a materialized snapshot containing the six supported Pareto dimensions (`reason`, `package`, `type`, `workflow`, `workcenter`, `equipment`)
+- **THEN** the snapshot SHALL include quantities needed to compute `metric_value`, `pct`, `cumPct`, and affected count fields
+
+#### Scenario: Build skipped for missing dataset cache
+- **WHEN** the referenced `query_id` dataset is missing or expired
+- **THEN** snapshot build SHALL NOT proceed
+- **THEN** the caller SHALL receive a deterministic cache-miss outcome
+
+### Requirement: Materialized snapshot keys SHALL encode filter identity and schema version
+The system SHALL key materialized Pareto snapshots by canonical filter identity and schema version to prevent cross-context reuse.
+
+#### Scenario: Distinct supplementary filters generate distinct snapshots
+- **WHEN** two requests share the same `query_id` but differ in supplementary filters or policy toggles
+- **THEN** they SHALL resolve to different materialized snapshot keys
+
+#### Scenario: Schema version invalidates prior snapshots
+- **WHEN** materialization schema version is incremented
+- **THEN** snapshots produced by prior versions SHALL NOT be treated as valid hits
+
+### Requirement: Materialized snapshots SHALL preserve cross-filter semantics
+Materialized read paths SHALL produce the same cross-filter behavior as legacy DataFrame-based Pareto computation.
+
+#### Scenario: Exclude-self behavior parity
+- **WHEN** `sel_reason=A` and `sel_type=X` are active
+- **THEN** reason Pareto SHALL be computed with `type=X` applied but without `reason=A` self-filter
+- **THEN** type Pareto SHALL be computed with `reason=A` applied but without `type=X` self-filter
+
+#### Scenario: Multi-dimension intersection parity
+- **WHEN** multiple `sel_*` filters are active across dimensions
+- **THEN** each non-excluded dimension result SHALL reflect the AND intersection of all other selected dimensions
+
+### Requirement: Materialized snapshots SHALL enforce bounded lifecycle and capacity
+Materialized Pareto cache storage SHALL be bounded by TTL and size guardrails to avoid unbounded memory growth.
+
+#### Scenario: Snapshot expiry follows configured retention
+- **WHEN** a materialized snapshot exceeds configured TTL
+- **THEN** it SHALL be treated as expired and SHALL NOT be returned as a cache hit
+
+#### Scenario: Oversized snapshot handling
+- **WHEN** a snapshot build exceeds configured snapshot size guardrail
+- **THEN** the snapshot SHALL be rejected or degraded according to policy
+- **THEN** the system SHALL record the rejection/degradation reason for operations telemetry