egg/OCR

Files

egg 940a406dce chore: backup before code cleanup

Backup commit before executing remove-unused-code proposal.
This includes all pending changes and new features.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-11 11:55:39 +08:00

7.1 KiB

Raw Blame History

Design: Refactor Dual-Track Architecture

Context

Tool_OCR 是一個雙軌制文件處理系統，支援：

Direct Track: 從可編輯 PDF 直接提取結構化內容
OCR Track: 使用 PaddleOCR + PP-StructureV3 進行光學字符識別

目前系統存在以下技術債務：

OCRService (2,326 行) 承擔過多職責
PDFGeneratorService (4,644 行) 是單體服務
記憶體管理分散在多個組件中
已知 bug 影響輸出品質

Goals / Non-Goals

Goals

修復 PLAN.md 中列出的所有已知 bug
將 OCRService 拆分為 < 800 行的可維護單元
將 PDFGeneratorService 拆分為 < 2,000 行
簡化記憶體管理配置
提升前端狀態管理一致性

Non-Goals

不改變現有 API 契約
不引入新的外部依賴
不改變資料庫 schema
不改變使用者介面

Decisions

Decision 1: 使用 PyMuPDF find_tables() 取代自定義表格檢測

選擇: 使用 PyMuPDF 內建的 page.find_tables() API

理由:

PyMuPDF 的表格檢測能正確識別合併單元格
返回的 table.cells 結構包含 span 資訊
減少自定義代碼維護負擔

替代方案:

改進 _detect_tables_by_position() 算法
- 優點：不依賴外部 API 變更
- 缺點：複雜度高，難以處理所有邊界情況
使用 Camelot 或 Tabula
- 優點：成熟的表格提取庫
- 缺點：引入新依賴，增加系統複雜度

Decision 2: 使用 Strategy Pattern 重構服務層

選擇: 引入 ProcessingOrchestrator 使用策略模式

class ProcessingPipeline(Protocol):
    def process(self, file_path: str, options: ProcessingOptions) -> UnifiedDocument:
        ...

class DirectPipeline(ProcessingPipeline):
    def __init__(self, extraction_engine: DirectExtractionEngine):
        self.engine = extraction_engine

    def process(self, file_path, options):
        return self.engine.extract(file_path)

class OCRPipeline(ProcessingPipeline):
    def __init__(self, ocr_service: OCRService, preprocessor: LayoutPreprocessingService):
        self.ocr = ocr_service
        self.preprocessor = preprocessor

    def process(self, file_path, options):
        # Preprocessing + OCR + Conversion
        ...

class ProcessingOrchestrator:
    def __init__(self, detector: DocumentTypeDetector, pipelines: dict[str, ProcessingPipeline]):
        self.detector = detector
        self.pipelines = pipelines

    def process(self, file_path, options):
        track = options.force_track or self.detector.detect(file_path).track
        return self.pipelines[track].process(file_path, options)

理由:

職責分離：檢測、處理、轉換各自獨立
易於測試：可以單獨測試每個 Pipeline
易於擴展：新增處理方式只需添加新 Pipeline

替代方案:

使用 Chain of Responsibility
- 優點：更靈活的處理鏈
- 缺點：對於二選一的場景過於複雜
保持現狀，只做代碼整理
- 優點：風險最低
- 缺點：無法解決根本問題

Decision 3: 分層提取 PDF 生成邏輯

選擇: 將 PDFGeneratorService 拆分為三個模組

PDFGeneratorService (主要編排)
├── PDFTableRenderer (表格渲染)
│   ├── HTMLTableParser (HTML 表格解析)
│   └── CellRenderer (單元格渲染)
├── PDFFontManager (字體管理)
│   ├── FontLoader (字體載入)
│   └── FontFallback (字體 fallback)
└── PDFLayoutEngine (版面配置)

理由:

單一職責：每個模組專注一件事
可重用：FontManager 可被其他服務使用
易於測試：表格渲染可獨立測試

Decision 4: 統一記憶體策略引擎

選擇: 合併記憶體管理組件為單一 MemoryPolicyEngine

class MemoryPolicyEngine:
    """統一的記憶體策略引擎"""

    def __init__(self, config: MemoryConfig):
        self.config = config
        self._semaphore = asyncio.Semaphore(config.max_concurrent_predictions)

    @property
    def gpu_usage_percent(self) -> float:
        # 統一的 GPU 使用率查詢
        ...

    def check_availability(self) -> MemoryStatus:
        # 返回 AVAILABLE, WARNING, CRITICAL, EMERGENCY
        ...

    async def acquire_prediction_slot(self):
        # 統一的並發控制
        ...

    def cleanup_if_needed(self):
        # 根據狀態自動清理
        ...

@dataclass
class MemoryConfig:
    warning_threshold: float = 0.80      # 80%
    critical_threshold: float = 0.95     # 95%
    max_concurrent_predictions: int = 2
    model_idle_timeout: int = 300        # 5 minutes

理由:

減少配置項：從 8+ 降到 4 個核心配置
簡化依賴：服務只需依賴一個記憶體引擎
統一行為：所有記憶體決策在同一處做出

Decision 5: 使用 Zustand 管理任務狀態

選擇: 新增 TaskStore 統一管理任務狀態

interface TaskState {
  currentTaskId: string | null;
  tasks: Record<string, TaskDetail>;
  processingStatus: Record<string, ProcessingStatus>;
}

interface TaskActions {
  setCurrentTask: (taskId: string) => void;
  updateTask: (taskId: string, updates: Partial<TaskDetail>) => void;
  updateProcessingStatus: (taskId: string, status: ProcessingStatus) => void;
  clearTasks: () => void;
}

const useTaskStore = create<TaskState & TaskActions>()(
  persist(
    (set) => ({
      currentTaskId: null,
      tasks: {},
      processingStatus: {},
      // ... actions
    }),
    { name: 'task-storage' }
  )
);

理由:

一致性：與現有 uploadStore、authStore 模式一致
可追蹤：任務狀態變更集中管理
持久化：刷新頁面後狀態保留

Risks / Trade-offs

風險	影響	緩解措施
PyMuPDF find_tables() API 變更	中	封裝為獨立函數，易於替換
服務重構導致處理邏輯錯誤	高	保留原有測試，逐步重構
記憶體引擎改變導致 OOM	高	使用相同閾值，僅改變代碼結構
前端狀態遷移導致 bug	中	逐頁遷移，完整測試每個頁面

Migration Plan

Step 1: Bug Fixes (可獨立部署)

實現 PyMuPDF find_tables() 整合
修復 OCR Track 圖片路徑
添加 cell_boxes 座標驗證
測試並部署

Step 2: Service Refactoring (可獨立部署)

提取 ProcessingOrchestrator
提取 TableRenderer 和 FontManager
更新 OCRService 使用新組件
測試並部署

Step 3: Memory Management (可獨立部署)

實現 MemoryPolicyEngine
逐步遷移服務使用新引擎
移除舊組件
測試並部署

Step 4: Frontend Improvements (可獨立部署)

新增 TaskStore
遷移 ProcessingPage
遷移 TaskDetailPage
合併類型定義
測試並部署

Rollback Plan

每個 Step 獨立部署，問題時可回滾到上一個穩定版本
Bug fixes 優先，確保基本功能正確
重構不改變外部行為，回滾影響最小

Open Questions

PyMuPDF find_tables() 的版本相容性: 需確認目前使用的 PyMuPDF 版本是否支援此 API
前端狀態持久化範圍: 是否所有任務都需要持久化，還是只保留當前會話？
記憶體閾值調整: 現有閾值是否經過生產驗證，可以直接沿用？

7.1 KiB Raw Blame History Unescape Escape