This commit is contained in:
egg
2025-12-04 18:00:37 +08:00
parent 9437387ef1
commit 8265be1741
22 changed files with 2672 additions and 196 deletions

View File

@@ -0,0 +1,240 @@
# Design: Refactor Dual-Track Architecture
## Context
Tool_OCR 是一個雙軌制文件處理系統,支援:
- **Direct Track**: 從可編輯 PDF 直接提取結構化內容
- **OCR Track**: 使用 PaddleOCR + PP-StructureV3 進行光學字符識別
目前系統存在以下技術債務:
- OCRService (2,326 行) 承擔過多職責
- PDFGeneratorService (4,644 行) 是單體服務
- 記憶體管理分散在多個組件中
- 已知 bug 影響輸出品質
## Goals / Non-Goals
### Goals
- 修復 PLAN.md 中列出的所有已知 bug
- 將 OCRService 拆分為 < 800 行的可維護單元
- PDFGeneratorService 拆分為 < 2,000
- 簡化記憶體管理配置
- 提升前端狀態管理一致性
### Non-Goals
- 不改變現有 API 契約
- 不引入新的外部依賴
- 不改變資料庫 schema
- 不改變使用者介面
## Decisions
### Decision 1: 使用 PyMuPDF find_tables() 取代自定義表格檢測
**選擇**: 使用 PyMuPDF 內建的 `page.find_tables()` API
**理由**:
- PyMuPDF 的表格檢測能正確識別合併單元格
- 返回的 `table.cells` 結構包含 span 資訊
- 減少自定義代碼維護負擔
**替代方案**:
- 改進 `_detect_tables_by_position()` 算法
- 優點不依賴外部 API 變更
- 缺點複雜度高難以處理所有邊界情況
- 使用 Camelot Tabula
- 優點成熟的表格提取庫
- 缺點引入新依賴增加系統複雜度
### Decision 2: 使用 Strategy Pattern 重構服務層
**選擇**: 引入 ProcessingOrchestrator 使用策略模式
```python
class ProcessingPipeline(Protocol):
def process(self, file_path: str, options: ProcessingOptions) -> UnifiedDocument:
...
class DirectPipeline(ProcessingPipeline):
def __init__(self, extraction_engine: DirectExtractionEngine):
self.engine = extraction_engine
def process(self, file_path, options):
return self.engine.extract(file_path)
class OCRPipeline(ProcessingPipeline):
def __init__(self, ocr_service: OCRService, preprocessor: LayoutPreprocessingService):
self.ocr = ocr_service
self.preprocessor = preprocessor
def process(self, file_path, options):
# Preprocessing + OCR + Conversion
...
class ProcessingOrchestrator:
def __init__(self, detector: DocumentTypeDetector, pipelines: dict[str, ProcessingPipeline]):
self.detector = detector
self.pipelines = pipelines
def process(self, file_path, options):
track = options.force_track or self.detector.detect(file_path).track
return self.pipelines[track].process(file_path, options)
```
**理由**:
- 職責分離檢測處理轉換各自獨立
- 易於測試可以單獨測試每個 Pipeline
- 易於擴展新增處理方式只需添加新 Pipeline
**替代方案**:
- 使用 Chain of Responsibility
- 優點更靈活的處理鏈
- 缺點對於二選一的場景過於複雜
- 保持現狀只做代碼整理
- 優點風險最低
- 缺點無法解決根本問題
### Decision 3: 分層提取 PDF 生成邏輯
**選擇**: PDFGeneratorService 拆分為三個模組
```
PDFGeneratorService (主要編排)
├── PDFTableRenderer (表格渲染)
│ ├── HTMLTableParser (HTML 表格解析)
│ └── CellRenderer (單元格渲染)
├── PDFFontManager (字體管理)
│ ├── FontLoader (字體載入)
│ └── FontFallback (字體 fallback)
└── PDFLayoutEngine (版面配置)
```
**理由**:
- 單一職責每個模組專注一件事
- 可重用FontManager 可被其他服務使用
- 易於測試表格渲染可獨立測試
### Decision 4: 統一記憶體策略引擎
**選擇**: 合併記憶體管理組件為單一 MemoryPolicyEngine
```python
class MemoryPolicyEngine:
"""統一的記憶體策略引擎"""
def __init__(self, config: MemoryConfig):
self.config = config
self._semaphore = asyncio.Semaphore(config.max_concurrent_predictions)
@property
def gpu_usage_percent(self) -> float:
# 統一的 GPU 使用率查詢
...
def check_availability(self) -> MemoryStatus:
# 返回 AVAILABLE, WARNING, CRITICAL, EMERGENCY
...
async def acquire_prediction_slot(self):
# 統一的並發控制
...
def cleanup_if_needed(self):
# 根據狀態自動清理
...
@dataclass
class MemoryConfig:
warning_threshold: float = 0.80 # 80%
critical_threshold: float = 0.95 # 95%
max_concurrent_predictions: int = 2
model_idle_timeout: int = 300 # 5 minutes
```
**理由**:
- 減少配置項 8+ 降到 4 個核心配置
- 簡化依賴服務只需依賴一個記憶體引擎
- 統一行為所有記憶體決策在同一處做出
### Decision 5: 使用 Zustand 管理任務狀態
**選擇**: 新增 TaskStore 統一管理任務狀態
```typescript
interface TaskState {
currentTaskId: string | null;
tasks: Record<string, TaskDetail>;
processingStatus: Record<string, ProcessingStatus>;
}
interface TaskActions {
setCurrentTask: (taskId: string) => void;
updateTask: (taskId: string, updates: Partial<TaskDetail>) => void;
updateProcessingStatus: (taskId: string, status: ProcessingStatus) => void;
clearTasks: () => void;
}
const useTaskStore = create<TaskState & TaskActions>()(
persist(
(set) => ({
currentTaskId: null,
tasks: {},
processingStatus: {},
// ... actions
}),
{ name: 'task-storage' }
)
);
```
**理由**:
- 一致性與現有 uploadStoreauthStore 模式一致
- 可追蹤任務狀態變更集中管理
- 持久化刷新頁面後狀態保留
## Risks / Trade-offs
| 風險 | 影響 | 緩解措施 |
|------|------|----------|
| PyMuPDF find_tables() API 變更 | | 封裝為獨立函數易於替換 |
| 服務重構導致處理邏輯錯誤 | | 保留原有測試逐步重構 |
| 記憶體引擎改變導致 OOM | | 使用相同閾值僅改變代碼結構 |
| 前端狀態遷移導致 bug | | 逐頁遷移完整測試每個頁面 |
## Migration Plan
### Step 1: Bug Fixes (可獨立部署)
1. 實現 PyMuPDF find_tables() 整合
2. 修復 OCR Track 圖片路徑
3. 添加 cell_boxes 座標驗證
4. 測試並部署
### Step 2: Service Refactoring (可獨立部署)
1. 提取 ProcessingOrchestrator
2. 提取 TableRenderer FontManager
3. 更新 OCRService 使用新組件
4. 測試並部署
### Step 3: Memory Management (可獨立部署)
1. 實現 MemoryPolicyEngine
2. 逐步遷移服務使用新引擎
3. 移除舊組件
4. 測試並部署
### Step 4: Frontend Improvements (可獨立部署)
1. 新增 TaskStore
2. 遷移 ProcessingPage
3. 遷移 TaskDetailPage
4. 合併類型定義
5. 測試並部署
### Rollback Plan
- 每個 Step 獨立部署問題時可回滾到上一個穩定版本
- Bug fixes 優先確保基本功能正確
- 重構不改變外部行為回滾影響最小
## Open Questions
1. **PyMuPDF find_tables() 的版本相容性**: 需確認目前使用的 PyMuPDF 版本是否支援此 API
2. **前端狀態持久化範圍**: 是否所有任務都需要持久化還是只保留當前會話
3. **記憶體閾值調整**: 現有閾值是否經過生產驗證可以直接沿用

View File

@@ -0,0 +1,68 @@
# Change: Refactor Dual-Track Architecture
## Why
目前雙軌制 OCR 系統存在多個已知問題和架構債務:
1. **Direct Track 表格問題**: `_detect_tables_by_position()` 無法識別合併單元格,導致 edit3.pdf 產生 204 個錯誤拆分的 cells應為 83 個)
2. **OCR Track 圖片路徑丟失**: CHART/DIAGRAM 等視覺元素的 `saved_path` 在轉換時丟失,導致圖片未放回 PDF
3. **OCR Track cell_boxes 座標錯亂**: PP-StructureV3 返回的 cell_boxes 超出頁面邊界
4. **服務層過度複雜**: OCRService (2,326 行) 承擔過多職責,難以維護和測試
5. **PDF 生成器過於龐大**: PDFGeneratorService (4,644 行) 是單體服務,難以擴展
## What Changes
### Phase 1: 修復已知 Bug優先級最高
- **Direct Track 表格修復**: 改用 PyMuPDF `find_tables()` API 取代 `_detect_tables_by_position()`
- **OCR Track 圖片路徑修復**: 擴展 `_convert_pp3_element` 處理所有視覺元素類型 (IMAGE, FIGURE, CHART, DIAGRAM, LOGO, STAMP)
- **Cell boxes 座標驗證**: 添加邊界檢查,超出範圍時使用 CV 線檢測 fallback
- **過濾極小裝飾圖片**: 過濾 < 200 px² 的圖片
- **移除覆蓋圖像**: 在渲染階段過濾與 covering_images 重疊的圖片
### Phase 2: 服務層重構(優先級:高)
- **拆分 OCRService**: 提取獨立的 `ProcessingOrchestrator` 負責流程編排
- **建立 Pipeline 模式**: 使用組合模式取代目前的聚合模式
- **提取 TableRenderer**: PDFGeneratorService 提取表格渲染邏輯
- **提取 FontManager**: PDFGeneratorService 提取字體管理邏輯
### Phase 3: 記憶體管理簡化(優先級:中)
- **統一記憶體策略**: 合併 MemoryManagerMemoryGuard各類 Semaphore 為單一策略引擎
- **簡化配置**: 減少 8+ 個記憶體相關配置項到核心 3-4
### Phase 4: 前端狀態管理改進(優先級:中)
- **新增 TaskStore**: 使用 Zustand 管理任務狀態取代分散的 useState
- **合併類型定義**: 統一 api.ts apiV2.ts 為單一類型定義檔案
## Impact
- Affected specs: `document-processing`
- Affected code:
- `backend/app/services/direct_extraction_engine.py` (表格檢測)
- `backend/app/services/ocr_to_unified_converter.py` (元素轉換)
- `backend/app/services/ocr_service.py` (服務編排)
- `backend/app/services/pdf_generator_service.py` (PDF 生成)
- `backend/app/services/memory_manager.py` (記憶體管理)
- `frontend/src/store/` (狀態管理)
- `frontend/src/types/` (類型定義)
## Risk Assessment
| 風險 | 嚴重性 | 緩解措施 |
|------|--------|----------|
| 表格渲染回歸 | | 使用 edit.pdf edit3.pdf 作為回歸測試 |
| 記憶體管理變更導致 OOM | | 保留現有閾值僅重構代碼結構 |
| 服務重構導致處理失敗 | | 逐步重構每階段完整測試 |
## Success Metrics
| 指標 | 目前 | 目標 |
|------|------|------|
| edit3.pdf Direct Track cells | 204 (錯誤) | 83 (正確) |
| OCR Track 圖片放回率 | 0% | 100% |
| cell_boxes 座標正確率 | ~40% | 100% |
| OCRService 行數 | 2,326 | < 800 |
| PDFGeneratorService 行數 | 4,644 | < 2,000 |

View File

@@ -0,0 +1,151 @@
# document-processing Specification Delta
## ADDED Requirements
### Requirement: Table Cell Merging Detection
The system SHALL correctly detect and preserve merged cells (rowspan/colspan) when extracting tables from PDF documents.
#### Scenario: Detect merged cells in Direct Track
- **WHEN** extracting tables from an editable PDF using Direct Track
- **THEN** the system SHALL use PyMuPDF find_tables() API
- **AND** correctly identify cells with rowspan > 1 or colspan > 1
- **AND** preserve merge information in UnifiedDocument table structure
- **AND** skip placeholder cells that are covered by merged cells
#### Scenario: Handle complex table structures
- **WHEN** processing a table with mixed merged and regular cells (e.g., edit3.pdf with 83 cells including 121 merges)
- **THEN** the system SHALL NOT split merged cells into individual cells
- **AND** the output cell count SHALL match the actual visual cell count
- **AND** the rendered PDF SHALL display correct merged cell boundaries
### Requirement: Visual Element Path Preservation
The system SHALL preserve image paths for all visual element types during OCR conversion.
#### Scenario: Preserve CHART element paths
- **WHEN** converting PP-StructureV3 output containing CHART elements
- **THEN** the system SHALL treat CHART as a visual element type
- **AND** extract saved_path from the element data
- **AND** include saved_path in the UnifiedDocument content field
#### Scenario: Support all visual element types
- **WHEN** processing visual elements of types IMAGE, FIGURE, CHART, DIAGRAM, LOGO, or STAMP
- **THEN** the system SHALL extract saved_path or img_path for each element
- **AND** preserve path, width, height, and format in content dictionary
- **AND** enable downstream PDF generation to embed these images
#### Scenario: Fallback path resolution
- **WHEN** a visual element has multiple path fields (saved_path, img_path)
- **THEN** the system SHALL prefer saved_path over img_path
- **AND** fallback to img_path if saved_path is missing
- **AND** log warning if both paths are missing
### Requirement: Cell Box Coordinate Validation
The system SHALL validate cell box coordinates from PP-StructureV3 and handle out-of-bounds cases.
#### Scenario: Detect out-of-bounds coordinates
- **WHEN** processing cell_boxes from PP-StructureV3
- **THEN** the system SHALL validate each coordinate against page boundaries (0, 0, page_width, page_height)
- **AND** log tables with coordinates exceeding page bounds
- **AND** mark affected cells for fallback processing
#### Scenario: Apply CV line detection fallback
- **WHEN** cell_boxes coordinates are invalid (out of bounds)
- **THEN** the system SHALL apply OpenCV line detection as fallback
- **AND** reconstruct table structure from detected lines
- **AND** include fallback_used flag in table metadata
#### Scenario: Coordinate normalization
- **WHEN** coordinates are within page bounds but slightly outside table bbox
- **THEN** the system SHALL clamp coordinates to table boundaries
- **AND** preserve relative cell positions
- **AND** ensure no cells overlap after normalization
### Requirement: Decoration Image Filtering
The system SHALL filter out minimal decoration images that do not contribute meaningful content.
#### Scenario: Filter tiny images by area
- **WHEN** extracting images from a document
- **THEN** the system SHALL calculate image area (width x height)
- **AND** filter out images with area < 200 square pixels
- **AND** log filtered image count for debugging
#### Scenario: Configurable filtering threshold
- **WHEN** processing documents with intentionally small images
- **THEN** the system SHALL support configuration of minimum image area threshold
- **AND** default to 200 square pixels if not specified
- **AND** allow threshold = 0 to disable filtering
### Requirement: Covering Image Removal
The system SHALL remove covering/redaction images from the final output.
#### Scenario: Detect covering rectangles
- **WHEN** preprocessing a PDF page
- **THEN** the system SHALL detect black/white rectangles covering text regions
- **AND** identify covering images by high IoU (> 0.8) with underlying content
- **AND** mark covering images for exclusion
#### Scenario: Exclude covering images from rendering
- **WHEN** generating output PDF
- **THEN** the system SHALL exclude images marked as covering
- **AND** preserve the text content that was covered
- **AND** include covering_images_removed count in metadata
#### Scenario: Handle both black and white covering
- **WHEN** detecting covering rectangles
- **THEN** the system SHALL detect both black fill (redaction style)
- **AND** white fill (whiteout style)
- **AND** low-contrast rectangles intended to hide content
## MODIFIED Requirements
### Requirement: Enhanced OCR with Full PP-StructureV3
The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list, with proper handling of visual elements and table coordinates.
#### Scenario: Extract comprehensive document structure
- **WHEN** processing through OCR track
- **THEN** the system SHALL use page_result.json['parsing_res_list']
- **AND** extract all element types including headers, lists, tables, figures
- **AND** preserve layout_bbox coordinates for each element
#### Scenario: Maintain reading order
- **WHEN** extracting elements from PP-StructureV3
- **THEN** the system SHALL preserve the reading order from parsing_res_list
- **AND** assign sequential indices to elements
- **AND** support reordering for complex layouts
#### Scenario: Extract table structure
- **WHEN** PP-StructureV3 identifies a table
- **THEN** the system SHALL extract cell content and boundaries
- **AND** validate cell_boxes coordinates against page boundaries
- **AND** apply fallback detection for invalid coordinates
- **AND** preserve table HTML for structure
- **AND** extract plain text for translation
#### Scenario: Extract visual elements with paths
- **WHEN** PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM)
- **THEN** the system SHALL preserve saved_path for each element
- **AND** include image dimensions and format
- **AND** enable image embedding in output PDF
### Requirement: Generate UnifiedDocument from direct extraction
The system SHALL convert PyMuPDF results to UnifiedDocument with correct table cell merging.
#### Scenario: Extract tables with cell merging
- **WHEN** direct extraction encounters a table
- **THEN** the system SHALL use PyMuPDF find_tables() API
- **AND** extract cell content with correct rowspan/colspan
- **AND** preserve merged cell boundaries
- **AND** skip placeholder cells covered by merges
#### Scenario: Filter decoration images
- **WHEN** extracting images from PDF
- **THEN** the system SHALL filter images smaller than minimum area threshold
- **AND** exclude covering/redaction images
- **AND** preserve meaningful content images
#### Scenario: Preserve text styling with image handling
- **WHEN** direct extraction completes
- **THEN** the system SHALL convert PyMuPDF results to UnifiedDocument
- **AND** preserve text styling, fonts, and exact positioning
- **AND** extract tables with cell boundaries, content, and merge info
- **AND** include only meaningful images in output

View File

@@ -0,0 +1,108 @@
# Tasks: Refactor Dual-Track Architecture
## Phase 1: 修復已知 Bug (已完成)
### 1.1 Direct Track 表格修復 (已完成 ✓)
- [x] 1.1.1 修改 `_process_native_table()` 方法使用 `table.cells` 處理合併單元格
- [x] 1.1.2 使用 PyMuPDF `page.find_tables()` API (已在使用中)
- [x] 1.1.3 解析 `table.cells` 並正確計算 `row_span`/`col_span`
- [x] 1.1.4 處理被合併的單元格(跳過 `None` 值,建立 covered grid
- [x] 1.1.5 驗證 edit3.pdf 返回 83 個正確的 cells ✓
### 1.2 OCR Track 圖片路徑修復 (已完成 ✓)
- [x] 1.2.1 修改 `ocr_to_unified_converter.py` 第 604-613 行
- [x] 1.2.2 擴展視覺元素類型判斷:`IMAGE, FIGURE, CHART, DIAGRAM, LOGO, STAMP`
- [x] 1.2.3 優先使用 `saved_path`fallback 到 `img_path`
- [x] 1.2.4 確保 content dict 包含 `saved_path`, `path`, `width`, `height`, `format`
- [x] 1.2.5 程式碼已修正 (需 OCR Track 完整測試驗證)
- [x] 1.2.6 程式碼已修正 (需 OCR Track 完整測試驗證)
### 1.3 Cell boxes 座標驗證 (已完成 ✓)
- [x] 1.3.1 在 `ocr_to_unified_converter.py` 添加 `validate_cell_boxes()` 函數
- [x] 1.3.2 檢查 cell_boxes 是否超出頁面邊界 (0, 0, page_width, page_height)
- [x] 1.3.3 超出範圍時使用 clamped coordinates標記 needs_fallback
- [x] 1.3.4 添加日誌記錄異常座標
- [x] 1.3.5 單元測試驗證座標驗證邏輯正確 ✓
### 1.4 過濾極小裝飾圖片 (已完成 ✓)
- [x] 1.4.1 在 `direct_extraction_engine.py` 圖片提取邏輯添加面積檢查
- [x] 1.4.2 過濾 `image_area < min_image_area` (默認 200 px²) 的圖片
- [x] 1.4.3 添加 `min_image_area` 配置項允許調整閾值
- [x] 1.4.4 驗證 edit3.pdf 偵測到 3 個極小裝飾圖片 ✓
### 1.5 移除覆蓋圖像 (已完成 ✓)
- [x] 1.5.1 傳遞 `covering_images``_extract_images()` 方法
- [x] 1.5.2 使用 IoU 閾值 (0.8) 和 xref 比對判斷覆蓋圖像
- [x] 1.5.3 從最終輸出中排除覆蓋圖像
- [x] 1.5.4 添加 `_calculate_iou()` 輔助方法
- [x] 1.5.5 驗證 edit3.pdf 偵測到 6 個黑框覆蓋圖像 ✓
## Phase 2: 服務層重構
### 2.1 提取 ProcessingOrchestrator
- [ ] 2.1.1 建立 `backend/app/services/processing_orchestrator.py`
- [ ] 2.1.2 從 OCRService 提取流程編排邏輯
- [ ] 2.1.3 定義 `ProcessingPipeline` 介面
- [ ] 2.1.4 實現 DirectPipeline 和 OCRPipeline
- [ ] 2.1.5 更新 OCRService 使用 ProcessingOrchestrator
- [ ] 2.1.6 確保現有功能不受影響
### 2.2 提取 TableRenderer
- [ ] 2.2.1 建立 `backend/app/services/pdf_table_renderer.py`
- [ ] 2.2.2 從 PDFGeneratorService 提取 HTMLTableParser
- [ ] 2.2.3 提取表格渲染邏輯到獨立類
- [ ] 2.2.4 支援合併單元格渲染
- [ ] 2.2.5 更新 PDFGeneratorService 使用 TableRenderer
### 2.3 提取 FontManager
- [ ] 2.3.1 建立 `backend/app/services/pdf_font_manager.py`
- [ ] 2.3.2 提取字體載入和快取邏輯
- [ ] 2.3.3 提取 CJK 字體支援邏輯
- [ ] 2.3.4 實現字體 fallback 機制
- [ ] 2.3.5 更新 PDFGeneratorService 使用 FontManager
## Phase 3: 記憶體管理簡化
### 3.1 統一記憶體策略引擎
- [ ] 3.1.1 建立 `backend/app/services/memory_policy_engine.py`
- [ ] 3.1.2 定義統一的記憶體策略介面
- [ ] 3.1.3 合併 MemoryManager 和 MemoryGuard 邏輯
- [ ] 3.1.4 整合 Semaphore 管理
- [ ] 3.1.5 簡化配置到 3-4 個核心項目
### 3.2 更新服務使用新記憶體引擎
- [ ] 3.2.1 更新 OCRService 使用 MemoryPolicyEngine
- [ ] 3.2.2 更新 ServicePool 使用 MemoryPolicyEngine
- [ ] 3.2.3 移除舊的 MemoryGuard 引用
- [ ] 3.2.4 驗證 GPU 記憶體監控正常運作
## Phase 4: 前端狀態管理改進
### 4.1 新增 TaskStore
- [ ] 4.1.1 建立 `frontend/src/store/taskStore.ts`
- [ ] 4.1.2 定義任務狀態結構currentTask, tasks, processingStatus
- [ ] 4.1.3 實現 CRUD 操作和狀態轉換
- [ ] 4.1.4 添加 localStorage 持久化
- [ ] 4.1.5 更新 ProcessingPage 使用 TaskStore
- [ ] 4.1.6 更新 TaskDetailPage 使用 TaskStore
### 4.2 合併類型定義
- [ ] 4.2.1 審查 `api.ts``apiV2.ts` 的差異
- [ ] 4.2.2 合併類型定義到 `apiV2.ts`
- [ ] 4.2.3 移除 `api.ts` 中的重複定義
- [ ] 4.2.4 更新所有 import 路徑
- [ ] 4.2.5 驗證 TypeScript 編譯無錯誤
## Phase 5: 測試與驗證
### 5.1 回歸測試
- [ ] 5.1.1 使用 edit.pdf 測試 Direct Track確保無回歸
- [ ] 5.1.2 使用 edit3.pdf 測試 Direct Track 表格合併
- [ ] 5.1.3 使用 edit.pdf 測試 OCR Track 圖片放回
- [ ] 5.1.4 使用 edit3.pdf 測試 OCR Track 圖片放回
- [ ] 5.1.5 驗證所有 cell_boxes 座標正確
### 5.2 效能測試
- [ ] 5.2.1 測量重構後的處理時間
- [ ] 5.2.2 驗證記憶體使用無明顯增加
- [ ] 5.2.3 驗證 GPU 使用率正常