Two changes combined: 1. historical-query-slow-connection: Migrate all historical query pages to read_sql_df_slow with semaphore concurrency control (max 3), raise DB slow timeout to 300s, gunicorn timeout to 360s, and unify frontend timeouts to 360s for all historical pages. 2. hold-resource-history-dataset-cache: Convert hold-history and resource-history from multi-query to single-query + dataset cache pattern (L1 ProcessLevelCache + L2 Redis parquet/base64, TTL=900s). Replace old GET endpoints with POST /query + GET /view two-phase API. Frontend auto-retries on 410 cache_expired. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5.8 KiB
Context
Hold-history fires 4 independent Oracle queries per user interaction (trend, reason-pareto, duration, list), all against DW_MES_HOLDRELEASEHISTORY with the same date range + hold_type filter. Resource-history fires 4 Oracle queries (3 parallel in summary + 1 detail), all against DW_MES_RESOURCESTATUS_SHIFT with the same filter set. Both pages re-query Oracle on every filter change or pagination.
Reject-history already solved this problem with a two-phase dataset cache pattern (reject_dataset_cache.py): one Oracle query caches the full fact set, subsequent views are derived from cache via pandas. This change applies the same pattern to hold-history and resource-history.
Goals / Non-Goals
Goals:
- Reduce Oracle queries from 4 per interaction to 1 per user session (per filter combination)
- Same cache infrastructure as reject-history: L1 (ProcessLevelCache) + L2 (Redis parquet/base64), 15-minute TTL
- Same API pattern: POST /query (primary) + GET /view (supplementary, from cache)
- Maintain all existing UI functionality — same charts, tables, filters, pagination
- Frontend adopts queryId-based two-phase flow
Non-Goals:
- Changing the SQL queries themselves (same table, same WHERE logic)
- Adding new visualizations or metrics
- Modifying other pages (reject-history, query-tool, etc.)
- Changing the department endpoint on hold-history (it has unique person-level expansion logic that benefits from its own query — we keep it as a separate call)
Decisions
D1: Follow reject_dataset_cache.py architecture exactly
Decision: Create hold_dataset_cache.py and resource_dataset_cache.py following the same module structure:
_make_query_id()— SHA256 hash of primary params_redis_store_df()/_redis_load_df()— parquet/base64 encoding_get_cached_df()/_store_df()— L1 → L2 read-throughexecute_primary_query()— Oracle query + cache + derive initial viewapply_view()— read cache + filter + re-derive
Rationale: Proven pattern, consistent codebase, shared infrastructure. Alternatives (custom cache format, separate cache layers) add complexity for no benefit.
D2: Hold-history primary query scope
Decision: The primary query fetches ALL hold/release records for the date range (all hold_types). Trend, reason-pareto, duration, and list are all derived from this single cached DataFrame. Department remains a separate API call.
Rationale: Trend data already contains all 3 hold_type variants in one query. By caching the raw facts (not pre-aggregated), we can switch hold_type views instantly from cache. Department has unique person-level JOINs and GROUP BY logic that doesn't fit the "filter from flat DataFrame" pattern cleanly.
Alternatives considered:
- Cache per hold_type: wastes 3x cache memory, still requires Oracle for type switching
- Include department in cache: complex person-level aggregation doesn't map well to flat DataFrame filtering
D3: Resource-history primary query scope
Decision: The primary query fetches ALL shift-status records for the date range and resource filter combination. KPI, trend, heatmap, workcenter comparison, and detail are all derived from this single cached DataFrame.
Rationale: All 4 current queries (kpi, trend, heatmap, detail) use the same base WHERE clause against the same table. The aggregations (GROUP BY date for trend, GROUP BY workcenter×date for heatmap, etc.) are simple pandas operations on the cached raw data.
D4: Cache TTL = 15 minutes, same as reject-history
Decision: Use _CACHE_TTL = 900 (15 min) for both modules, with L1 max_size = 8.
Rationale: Matches reject-history. 15 minutes covers typical analysis sessions. Users who need fresh data can re-query (which replaces the cache). Hold-history's existing 12h Redis cache for trend data is more aggressive but stale — 15 minutes is a better balance.
D5: API contract — POST /query + GET /view
Decision: Both pages switch to:
POST /api/hold-history/query→ primary query, returnsquery_id+ initial view (trend, reason, duration, list page 1)GET /api/hold-history/view→ supplementary filter/pagination from cachePOST /api/resource/history/query→ primary query, returnsquery_id+ initial view (summary + detail page 1)GET /api/resource/history/view→ supplementary filter/pagination from cache
Old GET endpoints (trend, reason-pareto, duration, list, summary, detail) are removed.
Rationale: Same pattern as reject-history. POST for primary (sends filter params in body), GET for view (sends query_id + supplementary params).
D6: Frontend queryId-based flow
Decision: Both App.vue files adopt the two-phase pattern:
- User clicks "查詢" →
POST /query→ storequeryId - Filter change / pagination →
GET /view?query_id=...&filters...(no Oracle) - Cache expired (HTTP 410) → auto re-execute primary query
Rationale: Proven pattern from reject-history. Keeps UI responsive after initial query.
Risks / Trade-offs
- [Memory] Caching full DataFrames in L1 (per-worker) uses more RAM than current approach → Mitigated by
max_size=8LRU eviction (same as reject-history, works well in production) - [Staleness] 15-min TTL means data could be up to 15 minutes old during an analysis session → Acceptable for historical analysis; user can re-query for fresh data
- [Department endpoint] Hold-history department still makes a separate Oracle call → Acceptable trade-off; person-level aggregation doesn't fit flat DataFrame model. Could be addressed later.
- [Breaking API] Old GET endpoints removed → No external consumers; frontend is the only client
- [Redis dependency] If Redis is down, only L1 cache works (per-worker, not cross-worker) → Same behavior as reject-history; L1 still provides 15-min cache per worker