fix(hold): dedup equipment cache, fix portal iframe, improve Hold dashboards

- Equipment cache: add freshness gate so only 1 Oracle query per 5-min cycle across 4 gunicorn workers; sync worker waits before first refresh - Portal: add frame-busting to prevent recursive iframe nesting - Hold Overview: remove redundant TreeMap, add Product & Future Hold Comment columns to LotTable - Hold History: switch list.sql JOIN from DW_MES_LOT_V (WIP snapshot) to DW_MES_CONTAINER (historical master) for reliable Product data; add Future Hold Comment column; fix comment truncation with hover tooltip - Page status: reorganize drawer groupings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11 09:01:02 +08:00
parent be22571421
commit e2ce75b004
18 changed files with 420 additions and 237 deletions
--- a/openspec/changes/archive/2026-02-11-equipment-cache-dedup/.openspec.yaml
+++ b/openspec/changes/archive/2026-02-11-equipment-cache-dedup/.openspec.yaml
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-02-10
--- a/openspec/changes/archive/2026-02-11-equipment-cache-dedup/design.md
+++ b/openspec/changes/archive/2026-02-11-equipment-cache-dedup/design.md
@@ -0,0 +1,54 @@
+## Context
+
+`GUNICORN_WORKERS=4` 啟動 4 個 worker process，每個 worker 在 `init_realtime_equipment_cache()` 中各啟動一個 equipment sync daemon thread。現有分散式鎖（`try_acquire_lock`）只做序列化——worker A 釋放鎖後，worker B 取得鎖仍會查 Oracle，即使 worker A 剛寫入完全相同的資料。
+
+現行 `_sync_worker()` 迴圈為 `refresh → wait(interval)`，首次進入即立刻 `refresh()`，與 init 的 `refresh_equipment_status_cache()` 形成 double-call。
+
+**唯一修改檔案**：`src/mes_dashboard/services/realtime_equipment_cache.py`
+
+## Goals / Non-Goals
+
+**Goals:**
+- 每個 5 分鐘同步週期只產生 1 次 Oracle 查詢（目前 4 次）
+- 消除 init + sync thread 的 double-call
+- 保留 `force=True` 繞過去重的能力
+
+**Non-Goals:**
+- 改變快取對外 API 行為或資料格式
+- 改變 process-level cache（L1）的 TTL 或容量策略
+- 單一 worker 架構（仍維持多 worker 架構）
+- 更改分散式鎖本身的實作
+
+## Decisions
+
+### Decision 1: Freshness gate（取得鎖後檢查 Redis timestamp）
+
+**方案 A（選用）**：取得分散式鎖後，讀取 Redis `equipment_status:meta:updated`。若 age < `_SYNC_INTERVAL // 2`，判定為 fresh，釋放鎖並跳過。
+
+**方案 B（捨棄）**：取得鎖前先檢查 timestamp。問題：TOCTOU——檢查後鎖被另一個 worker 拿走並完成更新，本 worker 不知情。
+
+**方案 C（捨棄）**：用 Redis SETNX 做 "sync epoch" marker 取代 timestamp 比較。增加額外 key 管理複雜度，沒有實際優勢。
+
+**理由**：方案 A 最簡單，取得鎖後再檢查保證無 TOCTOU。threshold 設為 `interval / 2` 提供安全邊界——即使時鐘微漂移或 refresh 執行時間較長，也不會誤判。
+
+### Decision 2: Wait-first sync worker loop
+
+現行：`while not stop: refresh(); wait(interval)` → sync thread 啟動即 refresh（double-call）
+
+改為：`while not _STOP_EVENT.wait(timeout=interval): refresh()` → sync thread 先等 interval 再首次 refresh
+
+**理由**：`init_realtime_equipment_cache()` 已做首次同步，sync thread 不需要重複。`_STOP_EVENT.wait(timeout)` 返回 False 表示 timeout（繼續迴圈），返回 True 表示 stop signal（跳出）——語意清晰且是 Python threading 慣用模式。
+
+### Decision 3: 模組級 `_SYNC_INTERVAL` 變數
+
+`refresh_equipment_status_cache()` 需要知道 sync interval 來計算 freshness threshold。由 `init_realtime_equipment_cache()` 設定模組級變數 `_SYNC_INTERVAL`，default 300。
+
+**理由**：避免在 refresh 函數中重新讀取 Flask config（refresh 可能在 app context 外被呼叫）。模組級變數是此 codebase 已有的慣例（如 `_STOP_EVENT`、`_SYNC_THREAD`）。
+
+## Risks / Trade-offs
+
+| 風險 | 緩解 |
+|------|------|
+| Freshness gate 過於激進導致整個週期無 worker 更新 | Threshold 為 `interval / 2`（150s），遠小於完整 interval（300s）。只要 1 個 worker 成功更新，其餘 worker 看到 age < 150s 就會跳過。若連 1 個 worker 都沒成功，150s 後下一個取得鎖的 worker 會正常更新。 |
+| `_SYNC_INTERVAL` 在 init 前被 refresh 呼叫 | Default 值 300 確保安全。只有透過 init 才會啟動 sync thread，所以正常流程下 init 一定先於週期性 refresh。 |
+| Wait-first loop 延遲首次週期性 refresh 5 分鐘 | 這是期望行為——init 已完成首次同步，sync thread 等 5 分鐘後才需要下一次。 |
--- a/openspec/changes/archive/2026-02-11-equipment-cache-dedup/proposal.md
+++ b/openspec/changes/archive/2026-02-11-equipment-cache-dedup/proposal.md
@@ -0,0 +1,28 @@
+## Why
+
+`.env` 設定 `GUNICORN_WORKERS=4`，每個 worker 各自有獨立的 equipment sync thread（共 4 個）。每 5 分鐘週期，4 個 sync thread 輪流取得分散式鎖後都去查 Oracle，產生 4 次完全相同的 `SELECT ... FROM DW_MES_EQUIPMENTSTATUS_WIP_V`（~2700 rows），其中 3 次是多餘的。分散式鎖只做序列化（serialize），沒有去重（deduplicate）。另外 `init_realtime_equipment_cache()` 存在 double-call 問題：init 先呼叫一次 `refresh_equipment_status_cache()`，再啟動 sync thread 立即又呼叫一次。
+
+## What Changes
+
+- **Freshness gate**：`refresh_equipment_status_cache()` 取得分散式鎖後、查 Oracle 前，檢查 Redis `equipment_status:meta:updated` 時間戳。若距上次更新不到 `sync_interval / 2` 秒，跳過 Oracle 查詢並釋放鎖。`force=True` 繞過此檢查。
+- **Wait-first sync worker**：`_sync_worker()` 改為先等 interval 再開始查詢（`_STOP_EVENT.wait(timeout=interval)` loop），避免與 init 的首次 refresh 重複。
+- **模組級 `_SYNC_INTERVAL` 變數**：由 `init_realtime_equipment_cache()` 設定，供 freshness gate 使用。
+
+## Capabilities
+
+### New Capabilities
+
+（無新增 capability。此為既有 equipment cache sync 機制的去重優化。）
+
+### Modified Capabilities
+
+（無 spec-level requirement 變更。改動純屬實作層最佳化，不影響快取對外行為、資料即時性或 API 契約。）
+
+## Impact
+
+- **檔案**：`src/mes_dashboard/services/realtime_equipment_cache.py`（唯一修改檔案）
+- **Oracle 負載**：每 5 分鐘週期從 4 次查詢降至 1 次
+- **資料即時性**：無影響，每週期仍保證至少 1 次更新
+- **`force=True` 調用**：無影響，繞過 freshness gate
+- **Process-level cache**：無影響，`_save_to_redis()` 已呼叫 `invalidate()`，其他 worker 的 L1 cache 在 30s TTL 內自然過期
+- **Worker 重啟（gunicorn max_requests）**：改善——新 worker 的 init refresh 會被 freshness gate 擋住
--- a/openspec/changes/archive/2026-02-11-equipment-cache-dedup/specs/equipment-sync-dedup/spec.md
+++ b/openspec/changes/archive/2026-02-11-equipment-cache-dedup/specs/equipment-sync-dedup/spec.md
@@ -0,0 +1,27 @@
+## ADDED Requirements
+
+### Requirement: Equipment Sync Refresh SHALL Skip Redundant Oracle Queries Within Same Cycle
+When multiple workers attempt to refresh the equipment status cache within the same sync cycle, only the first successful refresh SHALL query Oracle. Subsequent workers that acquire the distributed lock MUST check the freshness of the existing cache and skip the Oracle query if the cache was recently updated.
+
+#### Scenario: Another worker already refreshed within current cycle
+- **WHEN** a worker acquires the distributed lock and the `equipment_status:meta:updated` timestamp is less than half the sync interval old
+- **THEN** the worker MUST release the lock without querying Oracle and return False
+
+#### Scenario: No recent refresh exists
+- **WHEN** a worker acquires the distributed lock and the `equipment_status:meta:updated` timestamp is older than half the sync interval (or missing)
+- **THEN** the worker MUST proceed with the full Oracle query and cache update
+
+#### Scenario: Force refresh bypasses freshness gate
+- **WHEN** `refresh_equipment_status_cache(force=True)` is called
+- **THEN** the freshness gate MUST be skipped and the Oracle query MUST proceed regardless of `meta:updated` age
+
+### Requirement: Sync Worker SHALL Not Duplicate Init Refresh
+The background sync worker thread MUST wait for one full sync interval before its first refresh attempt, since `init_realtime_equipment_cache()` already performs an initial refresh at startup.
+
+#### Scenario: Sync worker startup after init
+- **WHEN** the sync worker thread starts after `init_realtime_equipment_cache()` completes the initial refresh
+- **THEN** the worker MUST wait for the configured interval before attempting its first refresh
+
+#### Scenario: Stop signal during wait
+- **WHEN** a stop signal is received while the sync worker is waiting
+- **THEN** the worker MUST exit without performing a refresh
--- a/openspec/changes/archive/2026-02-11-equipment-cache-dedup/tasks.md
+++ b/openspec/changes/archive/2026-02-11-equipment-cache-dedup/tasks.md
@@ -0,0 +1,18 @@
+## 1. Freshness Gate
+
+- [x] 1.1 Add module-level `_SYNC_INTERVAL: int = 300` variable in `realtime_equipment_cache.py`
+- [x] 1.2 In `init_realtime_equipment_cache()`, set `_SYNC_INTERVAL` from `config.get('EQUIPMENT_STATUS_SYNC_INTERVAL', 300)` before starting sync worker
+- [x] 1.3 In `refresh_equipment_status_cache()`, after acquiring distributed lock and before Oracle query: if `force` is False, read Redis `equipment_status:meta:updated`, compute age, skip if age < `_SYNC_INTERVAL // 2`
+
+## 2. Wait-First Sync Worker
+
+- [x] 2.1 Rewrite `_sync_worker()` loop from `while not stop: refresh(); wait()` to `while not _STOP_EVENT.wait(timeout=interval): refresh()` so sync thread waits one full interval before first refresh
+
+## 3. Tests
+
+- [x] 3.1 Add test: `test_refresh_skips_when_recently_updated` — mock `meta:updated` as 10s ago, verify Oracle not called
+- [x] 3.2 Add test: `test_refresh_proceeds_when_stale` — mock `meta:updated` as 200s ago, verify Oracle called
+- [x] 3.3 Add test: `test_refresh_proceeds_when_force` — set `meta:updated` as 10s ago with `force=True`, verify Oracle called
+- [x] 3.4 Add test: `test_sync_worker_waits_before_first_refresh` — verify sync worker does not call refresh immediately on start
+- [x] 3.5 Run `python -m pytest tests/test_realtime_equipment_cache.py -x -q` — existing + new tests pass
+- [x] 3.6 Run `python -m pytest tests/ -x -q` — full test suite pass
--- a/openspec/specs/equipment-sync-dedup/spec.md
+++ b/openspec/specs/equipment-sync-dedup/spec.md
@@ -0,0 +1,32 @@
+# equipment-sync-dedup Specification
+
+## Purpose
+Ensure multi-worker equipment status cache sync performs at most one Oracle query per sync cycle, preventing redundant identical queries across gunicorn workers.
+
+## Requirements
+
+### Requirement: Equipment Sync Refresh SHALL Skip Redundant Oracle Queries Within Same Cycle
+When multiple workers attempt to refresh the equipment status cache within the same sync cycle, only the first successful refresh SHALL query Oracle. Subsequent workers that acquire the distributed lock MUST check the freshness of the existing cache and skip the Oracle query if the cache was recently updated.
+
+#### Scenario: Another worker already refreshed within current cycle
+- **WHEN** a worker acquires the distributed lock and the `equipment_status:meta:updated` timestamp is less than half the sync interval old
+- **THEN** the worker MUST release the lock without querying Oracle and return False
+
+#### Scenario: No recent refresh exists
+- **WHEN** a worker acquires the distributed lock and the `equipment_status:meta:updated` timestamp is older than half the sync interval (or missing)
+- **THEN** the worker MUST proceed with the full Oracle query and cache update
+
+#### Scenario: Force refresh bypasses freshness gate
+- **WHEN** `refresh_equipment_status_cache(force=True)` is called
+- **THEN** the freshness gate MUST be skipped and the Oracle query MUST proceed regardless of `meta:updated` age
+
+### Requirement: Sync Worker SHALL Not Duplicate Init Refresh
+The background sync worker thread MUST wait for one full sync interval before its first refresh attempt, since `init_realtime_equipment_cache()` already performs an initial refresh at startup.
+
+#### Scenario: Sync worker startup after init
+- **WHEN** the sync worker thread starts after `init_realtime_equipment_cache()` completes the initial refresh
+- **THEN** the worker MUST wait for the configured interval before attempting its first refresh
+
+#### Scenario: Stop signal during wait
+- **WHEN** a stop signal is received while the sync worker is waiting
+- **THEN** the worker MUST exit without performing a refresh