Files
DashBoard/openspec/changes/archive/2026-02-25-hold-resource-history-dataset-cache/design.md
egg 71c8102de6 feat: dataset cache for hold/resource history + slow connection migration
Two changes combined:

1. historical-query-slow-connection: Migrate all historical query pages
   to read_sql_df_slow with semaphore concurrency control (max 3),
   raise DB slow timeout to 300s, gunicorn timeout to 360s, and
   unify frontend timeouts to 360s for all historical pages.

2. hold-resource-history-dataset-cache: Convert hold-history and
   resource-history from multi-query to single-query + dataset cache
   pattern (L1 ProcessLevelCache + L2 Redis parquet/base64, TTL=900s).
   Replace old GET endpoints with POST /query + GET /view two-phase
   API. Frontend auto-retries on 410 cache_expired.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 13:15:02 +08:00

85 lines
5.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## Context
Hold-history fires 4 independent Oracle queries per user interaction (trend, reason-pareto, duration, list), all against `DW_MES_HOLDRELEASEHISTORY` with the same date range + hold_type filter. Resource-history fires 4 Oracle queries (3 parallel in summary + 1 detail), all against `DW_MES_RESOURCESTATUS_SHIFT` with the same filter set. Both pages re-query Oracle on every filter change or pagination.
Reject-history already solved this problem with a two-phase dataset cache pattern (`reject_dataset_cache.py`): one Oracle query caches the full fact set, subsequent views are derived from cache via pandas. This change applies the same pattern to hold-history and resource-history.
## Goals / Non-Goals
**Goals:**
- Reduce Oracle queries from 4 per interaction to 1 per user session (per filter combination)
- Same cache infrastructure as reject-history: L1 (ProcessLevelCache) + L2 (Redis parquet/base64), 15-minute TTL
- Same API pattern: POST /query (primary) + GET /view (supplementary, from cache)
- Maintain all existing UI functionality — same charts, tables, filters, pagination
- Frontend adopts queryId-based two-phase flow
**Non-Goals:**
- Changing the SQL queries themselves (same table, same WHERE logic)
- Adding new visualizations or metrics
- Modifying other pages (reject-history, query-tool, etc.)
- Changing the department endpoint on hold-history (it has unique person-level expansion logic that benefits from its own query — we keep it as a separate call)
## Decisions
### D1: Follow reject_dataset_cache.py architecture exactly
**Decision**: Create `hold_dataset_cache.py` and `resource_dataset_cache.py` following the same module structure:
- `_make_query_id()` — SHA256 hash of primary params
- `_redis_store_df()` / `_redis_load_df()` — parquet/base64 encoding
- `_get_cached_df()` / `_store_df()` — L1 → L2 read-through
- `execute_primary_query()` — Oracle query + cache + derive initial view
- `apply_view()` — read cache + filter + re-derive
**Rationale**: Proven pattern, consistent codebase, shared infrastructure. Alternatives (custom cache format, separate cache layers) add complexity for no benefit.
### D2: Hold-history primary query scope
**Decision**: The primary query fetches ALL hold/release records for the date range (all hold_types). Trend, reason-pareto, duration, and list are all derived from this single cached DataFrame. Department remains a separate API call.
**Rationale**: Trend data already contains all 3 hold_type variants in one query. By caching the raw facts (not pre-aggregated), we can switch hold_type views instantly from cache. Department has unique person-level JOINs and GROUP BY logic that doesn't fit the "filter from flat DataFrame" pattern cleanly.
**Alternatives considered**:
- Cache per hold_type: wastes 3x cache memory, still requires Oracle for type switching
- Include department in cache: complex person-level aggregation doesn't map well to flat DataFrame filtering
### D3: Resource-history primary query scope
**Decision**: The primary query fetches ALL shift-status records for the date range and resource filter combination. KPI, trend, heatmap, workcenter comparison, and detail are all derived from this single cached DataFrame.
**Rationale**: All 4 current queries (kpi, trend, heatmap, detail) use the same base WHERE clause against the same table. The aggregations (GROUP BY date for trend, GROUP BY workcenter×date for heatmap, etc.) are simple pandas operations on the cached raw data.
### D4: Cache TTL = 15 minutes, same as reject-history
**Decision**: Use `_CACHE_TTL = 900` (15 min) for both modules, with L1 `max_size = 8`.
**Rationale**: Matches reject-history. 15 minutes covers typical analysis sessions. Users who need fresh data can re-query (which replaces the cache). Hold-history's existing 12h Redis cache for trend data is more aggressive but stale — 15 minutes is a better balance.
### D5: API contract — POST /query + GET /view
**Decision**: Both pages switch to:
- `POST /api/hold-history/query` → primary query, returns `query_id` + initial view (trend, reason, duration, list page 1)
- `GET /api/hold-history/view` → supplementary filter/pagination from cache
- `POST /api/resource/history/query` → primary query, returns `query_id` + initial view (summary + detail page 1)
- `GET /api/resource/history/view` → supplementary filter/pagination from cache
Old GET endpoints (trend, reason-pareto, duration, list, summary, detail) are removed.
**Rationale**: Same pattern as reject-history. POST for primary (sends filter params in body), GET for view (sends query_id + supplementary params).
### D6: Frontend queryId-based flow
**Decision**: Both `App.vue` files adopt the two-phase pattern:
1. User clicks "查詢" → `POST /query` → store `queryId`
2. Filter change / pagination → `GET /view?query_id=...&filters...` (no Oracle)
3. Cache expired (HTTP 410) → auto re-execute primary query
**Rationale**: Proven pattern from reject-history. Keeps UI responsive after initial query.
## Risks / Trade-offs
- **[Memory]** Caching full DataFrames in L1 (per-worker) uses more RAM than current approach → Mitigated by `max_size=8` LRU eviction (same as reject-history, works well in production)
- **[Staleness]** 15-min TTL means data could be up to 15 minutes old during an analysis session → Acceptable for historical analysis; user can re-query for fresh data
- **[Department endpoint]** Hold-history department still makes a separate Oracle call → Acceptable trade-off; person-level aggregation doesn't fit flat DataFrame model. Could be addressed later.
- **[Breaking API]** Old GET endpoints removed → No external consumers; frontend is the only client
- **[Redis dependency]** If Redis is down, only L1 cache works (per-worker, not cross-worker) → Same behavior as reject-history; L1 still provides 15-min cache per worker