test

2025-12-04 18:00:37 +08:00
parent 9437387ef1
commit 8265be1741
22 changed files with 2672 additions and 196 deletions
--- a/openspec/changes/refactor-dual-track-architecture/design.md
+++ b/openspec/changes/refactor-dual-track-architecture/design.md
@@ -0,0 +1,240 @@
+# Design: Refactor Dual-Track Architecture
+
+## Context
+
+Tool_OCR 是一個雙軌制文件處理系統，支援：
+- **Direct Track**: 從可編輯 PDF 直接提取結構化內容
+- **OCR Track**: 使用 PaddleOCR + PP-StructureV3 進行光學字符識別
+
+目前系統存在以下技術債務：
+- OCRService (2,326 行) 承擔過多職責
+- PDFGeneratorService (4,644 行) 是單體服務
+- 記憶體管理分散在多個組件中
+- 已知 bug 影響輸出品質
+
+## Goals / Non-Goals
+
+### Goals
+- 修復 PLAN.md 中列出的所有已知 bug
+- 將 OCRService 拆分為 < 800 行的可維護單元
+- 將 PDFGeneratorService 拆分為 < 2,000 行
+- 簡化記憶體管理配置
+- 提升前端狀態管理一致性
+
+### Non-Goals
+- 不改變現有 API 契約
+- 不引入新的外部依賴
+- 不改變資料庫 schema
+- 不改變使用者介面
+
+## Decisions
+
+### Decision 1: 使用 PyMuPDF find_tables() 取代自定義表格檢測
+
+**選擇**: 使用 PyMuPDF 內建的 `page.find_tables()` API
+
+**理由**:
+- PyMuPDF 的表格檢測能正確識別合併單元格
+- 返回的 `table.cells` 結構包含 span 資訊
+- 減少自定義代碼維護負擔
+
+**替代方案**:
+- 改進 `_detect_tables_by_position()` 算法
+  - 優點：不依賴外部 API 變更
+  - 缺點：複雜度高，難以處理所有邊界情況
+- 使用 Camelot 或 Tabula
+  - 優點：成熟的表格提取庫
+  - 缺點：引入新依賴，增加系統複雜度
+
+### Decision 2: 使用 Strategy Pattern 重構服務層
+
+**選擇**: 引入 ProcessingOrchestrator 使用策略模式
+
+```python
+class ProcessingPipeline(Protocol):
+    def process(self, file_path: str, options: ProcessingOptions) -> UnifiedDocument:
+        ...
+
+class DirectPipeline(ProcessingPipeline):
+    def __init__(self, extraction_engine: DirectExtractionEngine):
+        self.engine = extraction_engine
+
+    def process(self, file_path, options):
+        return self.engine.extract(file_path)
+
+class OCRPipeline(ProcessingPipeline):
+    def __init__(self, ocr_service: OCRService, preprocessor: LayoutPreprocessingService):
+        self.ocr = ocr_service
+        self.preprocessor = preprocessor
+
+    def process(self, file_path, options):
+        # Preprocessing + OCR + Conversion
+        ...
+
+class ProcessingOrchestrator:
+    def __init__(self, detector: DocumentTypeDetector, pipelines: dict[str, ProcessingPipeline]):
+        self.detector = detector
+        self.pipelines = pipelines
+
+    def process(self, file_path, options):
+        track = options.force_track or self.detector.detect(file_path).track
+        return self.pipelines[track].process(file_path, options)
+```
+
+**理由**:
+- 職責分離：檢測、處理、轉換各自獨立
+- 易於測試：可以單獨測試每個 Pipeline
+- 易於擴展：新增處理方式只需添加新 Pipeline
+
+**替代方案**:
+- 使用 Chain of Responsibility
+  - 優點：更靈活的處理鏈
+  - 缺點：對於二選一的場景過於複雜
+- 保持現狀，只做代碼整理
+  - 優點：風險最低
+  - 缺點：無法解決根本問題
+
+### Decision 3: 分層提取 PDF 生成邏輯
+
+**選擇**: 將 PDFGeneratorService 拆分為三個模組
+
+```
+PDFGeneratorService (主要編排)
+├── PDFTableRenderer (表格渲染)
+│   ├── HTMLTableParser (HTML 表格解析)
+│   └── CellRenderer (單元格渲染)
+├── PDFFontManager (字體管理)
+│   ├── FontLoader (字體載入)
+│   └── FontFallback (字體 fallback)
+└── PDFLayoutEngine (版面配置)
+```
+
+**理由**:
+- 單一職責：每個模組專注一件事
+- 可重用：FontManager 可被其他服務使用
+- 易於測試：表格渲染可獨立測試
+
+### Decision 4: 統一記憶體策略引擎
+
+**選擇**: 合併記憶體管理組件為單一 MemoryPolicyEngine
+
+```python
+class MemoryPolicyEngine:
+    """統一的記憶體策略引擎"""
+
+    def __init__(self, config: MemoryConfig):
+        self.config = config
+        self._semaphore = asyncio.Semaphore(config.max_concurrent_predictions)
+
+    @property
+    def gpu_usage_percent(self) -> float:
+        # 統一的 GPU 使用率查詢
+        ...
+
+    def check_availability(self) -> MemoryStatus:
+        # 返回 AVAILABLE, WARNING, CRITICAL, EMERGENCY
+        ...
+
+    async def acquire_prediction_slot(self):
+        # 統一的並發控制
+        ...
+
+    def cleanup_if_needed(self):
+        # 根據狀態自動清理
+        ...
+
+@dataclass
+class MemoryConfig:
+    warning_threshold: float = 0.80      # 80%
+    critical_threshold: float = 0.95     # 95%
+    max_concurrent_predictions: int = 2
+    model_idle_timeout: int = 300        # 5 minutes
+```
+
+**理由**:
+- 減少配置項：從 8+ 降到 4 個核心配置
+- 簡化依賴：服務只需依賴一個記憶體引擎
+- 統一行為：所有記憶體決策在同一處做出
+
+### Decision 5: 使用 Zustand 管理任務狀態
+
+**選擇**: 新增 TaskStore 統一管理任務狀態
+
+```typescript
+interface TaskState {
+  currentTaskId: string | null;
+  tasks: Record<string, TaskDetail>;
+  processingStatus: Record<string, ProcessingStatus>;
+}
+
+interface TaskActions {
+  setCurrentTask: (taskId: string) => void;
+  updateTask: (taskId: string, updates: Partial<TaskDetail>) => void;
+  updateProcessingStatus: (taskId: string, status: ProcessingStatus) => void;
+  clearTasks: () => void;
+}
+
+const useTaskStore = create<TaskState & TaskActions>()(
+  persist(
+    (set) => ({
+      currentTaskId: null,
+      tasks: {},
+      processingStatus: {},
+      // ... actions
+    }),
+    { name: 'task-storage' }
+  )
+);
+```
+
+**理由**:
+- 一致性：與現有 uploadStore、authStore 模式一致
+- 可追蹤：任務狀態變更集中管理
+- 持久化：刷新頁面後狀態保留
+
+## Risks / Trade-offs
+
+| 風險 | 影響 | 緩解措施 |
+|------|------|----------|
+| PyMuPDF find_tables() API 變更 | 中 | 封裝為獨立函數，易於替換 |
+| 服務重構導致處理邏輯錯誤 | 高 | 保留原有測試，逐步重構 |
+| 記憶體引擎改變導致 OOM | 高 | 使用相同閾值，僅改變代碼結構 |
+| 前端狀態遷移導致 bug | 中 | 逐頁遷移，完整測試每個頁面 |
+
+## Migration Plan
+
+### Step 1: Bug Fixes (可獨立部署)
+1. 實現 PyMuPDF find_tables() 整合
+2. 修復 OCR Track 圖片路徑
+3. 添加 cell_boxes 座標驗證
+4. 測試並部署
+
+### Step 2: Service Refactoring (可獨立部署)
+1. 提取 ProcessingOrchestrator
+2. 提取 TableRenderer 和 FontManager
+3. 更新 OCRService 使用新組件
+4. 測試並部署
+
+### Step 3: Memory Management (可獨立部署)
+1. 實現 MemoryPolicyEngine
+2. 逐步遷移服務使用新引擎
+3. 移除舊組件
+4. 測試並部署
+
+### Step 4: Frontend Improvements (可獨立部署)
+1. 新增 TaskStore
+2. 遷移 ProcessingPage
+3. 遷移 TaskDetailPage
+4. 合併類型定義
+5. 測試並部署
+
+### Rollback Plan
+- 每個 Step 獨立部署，問題時可回滾到上一個穩定版本
+- Bug fixes 優先，確保基本功能正確
+- 重構不改變外部行為，回滾影響最小
+
+## Open Questions
+
+1. **PyMuPDF find_tables() 的版本相容性**: 需確認目前使用的 PyMuPDF 版本是否支援此 API
+2. **前端狀態持久化範圍**: 是否所有任務都需要持久化，還是只保留當前會話？
+3. **記憶體閾值調整**: 現有閾值是否經過生產驗證，可以直接沿用？