feat: enable document orientation detection for scanned PDFs

- Enable PP-StructureV3's use_doc_orientation_classify feature - Detect rotation angle from doc_preprocessor_res.angle - Swap page dimensions (width <-> height) for 90°/270° rotations - Output PDF now correctly displays landscape-scanned content Also includes: - Archive completed openspec proposals - Add simplify-frontend-ocr-config proposal (pending) - Code cleanup and frontend simplification 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 17:13:46 +08:00
parent 57070af307
commit cfe65158a3
58 changed files with 1271 additions and 3048 deletions
--- a/openspec/changes/archive/2025-12-11-fix-ocr-track-table-rendering/design.md
+++ b/openspec/changes/archive/2025-12-11-fix-ocr-track-table-rendering/design.md
@@ -0,0 +1,88 @@
+## Context
+
+OCR Track 使用 PP-StructureV3 處理文件，將 PDF 轉換為 PNG 圖片（150 DPI）進行 OCR 識別，然後將結果轉換為 UnifiedDocument 格式並生成輸出 PDF。
+
+當前問題：
+1. 表格 HTML 內容在 bbox overlap 匹配路徑中未被提取
+2. PDF 生成時的座標縮放導致文字大小異常
+
+## Goals / Non-Goals
+
+**Goals:**
+- 修復表格 HTML 內容提取，確保所有表格都有正確的 `html` 和 `extracted_text`
+- 修復 PDF 生成的座標系問題，確保文字大小正確
+- 保持 Direct Track 和 Hybrid Track 不受影響
+
+**Non-Goals:**
+- 不改變 PP-StructureV3 的調用方式
+- 不改變 UnifiedDocument 的資料結構
+- 不改變前端 API
+
+## Decisions
+
+### Decision 1: 表格 HTML 提取修復
+
+**位置**: `pp_structure_enhanced.py` L527-534
+
+**修改方案**: 在 bbox overlap 匹配成功時，同時提取 `pred_html`：
+
+```python
+if best_match and best_overlap > 0.1:
+    cell_boxes = best_match['cell_box_list']
+    element['cell_boxes'] = [[float(c) for c in box] for box in cell_boxes]
+    element['cell_boxes_source'] = 'table_res_list'
+
+    # 新增：提取 pred_html
+    if not html_content and 'pred_html' in best_match:
+        html_content = best_match['pred_html']
+        element['html'] = html_content
+        element['extracted_text'] = self._extract_text_from_html(html_content)
+        logger.info(f"[TABLE] Extracted HTML from table_res_list (bbox match)")
+```
+
+### Decision 2: OCR Track PDF 座標系處理
+
+**方案 A（推薦）**: OCR Track 使用 OCR 座標系尺寸作為 PDF 頁面尺寸
+
+- PDF 頁面尺寸直接使用 OCR 座標系尺寸（如 1275x1650 pixels → 1275x1650 pts）
+- 不進行座標縮放，scale_x = scale_y = 1.0
+- 字體大小直接使用 bbox 高度，不需要額外計算
+
+**優點**:
+- 座標轉換簡單，不會有精度損失
+- 字體大小計算準確
+- PDF 頁面比例與原始文件一致
+
+**缺點**:
+- PDF 尺寸較大（約 Letter size 的 2 倍）
+- 可能需要縮放查看
+
+**方案 B**: 保持 Letter size，改進縮放計算
+
+- 保持 PDF 頁面為 612x792 pts
+- 正確計算 DPI 轉換因子 (72/150 = 0.48)
+- 確保字體大小在縮放時保持可讀性
+
+**選擇**: 採用方案 A，因為簡化實現且避免縮放精度問題。
+
+### Decision 3: 表格質量判定調整
+
+**當前問題**: `_check_cell_boxes_quality()` 過度過濾有效表格
+
+**修改方案**:
+1. 提高 cell_density 閾值（從 3.0 → 5.0 cells/10000px²）
+2. 降低 min_avg_cell_area 閾值（從 3000 → 2000 px²）
+3. 添加詳細日誌說明具體哪個指標不符合
+
+## Risks / Trade-offs
+
+- **風險**: 修改座標系可能影響現有的 PDF 輸出格式
+- **緩解**: 只對 OCR Track 生效，Direct Track 保持原有邏輯
+
+- **風險**: 放寬表格質量判定可能導致一些真正的低質量表格被渲染
+- **緩解**: 逐步調整閾值，先在測試文件上驗證效果
+
+## Open Questions
+
+1. OCR Track PDF 尺寸變大是否會影響用戶體驗？
+2. 是否需要提供配置選項讓用戶選擇 PDF 輸出尺寸？
--- a/openspec/changes/archive/2025-12-11-fix-ocr-track-table-rendering/proposal.md
+++ b/openspec/changes/archive/2025-12-11-fix-ocr-track-table-rendering/proposal.md
@@ -0,0 +1,17 @@
+# Change: Fix OCR Track Table Rendering and Text Sizing
+
+## Why
+OCR Track 處理產生的 PDF 有兩個主要問題：
+1. **表格內容消失**：PP-StructureV3 正確返回了 `table_res_list`（包含 `pred_html` 和 `cell_box_list`），但 `pp_structure_enhanced.py` 在通過 bbox overlap 匹配時只提取了 `cell_boxes` 而沒有提取 `pred_html`，導致表格的 HTML 內容為空。
+2. **文字大小不一致**：OCR 座標系 (1275x1650 pixels) 與 PDF 輸出尺寸 (612x792 pts) 之間的縮放因子 (0.48) 導致字體大小計算不準確，文字過小或大小不一致。
+
+## What Changes
+- 修復 `pp_structure_enhanced.py` 中 bbox overlap 匹配時的 HTML 提取邏輯
+- 改進 `pdf_generator_service.py` 中 OCR Track 的座標系處理，使用 OCR 座標系尺寸作為 PDF 輸出尺寸
+- 調整 `_check_cell_boxes_quality()` 函數的判定邏輯，避免過度過濾有效表格
+
+## Impact
+- Affected specs: `ocr-processing`
+- Affected code:
+  - `backend/app/services/pp_structure_enhanced.py` - 表格 HTML 提取邏輯
+  - `backend/app/services/pdf_generator_service.py` - PDF 生成座標系處理
--- a/openspec/changes/archive/2025-12-11-fix-ocr-track-table-rendering/specs/ocr-processing/spec.md
+++ b/openspec/changes/archive/2025-12-11-fix-ocr-track-table-rendering/specs/ocr-processing/spec.md
@@ -0,0 +1,91 @@
+## MODIFIED Requirements
+
+### Requirement: Enhanced OCR with Full PP-StructureV3
+
+The system SHALL utilize the full capabilities of PP-StructureV3, extracting all element types from parsing_res_list, with proper handling of visual elements and table coordinates.
+
+#### Scenario: Extract comprehensive document structure
+- **WHEN** processing through OCR track
+- **THEN** the system SHALL use page_result.json['parsing_res_list']
+- **AND** extract all element types including headers, lists, tables, figures
+- **AND** preserve layout_bbox coordinates for each element
+
+#### Scenario: Maintain reading order
+- **WHEN** extracting elements from PP-StructureV3
+- **THEN** the system SHALL preserve the reading order from parsing_res_list
+- **AND** assign sequential indices to elements
+- **AND** support reordering for complex layouts
+
+#### Scenario: Extract table structure with HTML content
+- **WHEN** PP-StructureV3 identifies a table
+- **THEN** the system SHALL extract cell content and boundaries from table_res_list
+- **AND** extract pred_html for table HTML content
+- **AND** validate cell_boxes coordinates against page boundaries
+- **AND** apply fallback detection for invalid coordinates
+- **AND** preserve table HTML for structure
+- **AND** extract plain text for translation
+
+#### Scenario: Table matching via bbox overlap
+- **GIVEN** a table element from parsing_res_list without direct HTML content
+- **WHEN** matching against table_res_list using bbox overlap
+- **AND** overlap ratio exceeds 10%
+- **THEN** the system SHALL extract both cell_box_list and pred_html from the matched table_res
+- **AND** set element['html'] to the extracted pred_html
+- **AND** set element['extracted_text'] from the HTML content
+- **AND** log the successful extraction
+
+#### Scenario: Extract visual elements with paths
+- **WHEN** PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM)
+- **THEN** the system SHALL preserve saved_path for each element
+- **AND** include image dimensions and format
+- **AND** enable image embedding in output PDF
+
+## ADDED Requirements
+
+### Requirement: OCR Track PDF Coordinate System
+
+The system SHALL generate PDF output for OCR Track using the OCR coordinate system dimensions to ensure accurate text sizing and positioning.
+
+#### Scenario: PDF page size matches OCR coordinate system
+- **GIVEN** an OCR track processing task
+- **WHEN** generating the output PDF
+- **THEN** the system SHALL use the OCR image dimensions as PDF page size
+- **AND** set scale factors to 1.0 (no scaling)
+- **AND** preserve original bbox coordinates without transformation
+
+#### Scenario: Text font size calculation without scaling
+- **GIVEN** a text element with bbox height H in OCR coordinates
+- **WHEN** rendering text in PDF
+- **THEN** the system SHALL calculate font size based directly on bbox height
+- **AND** NOT apply additional scaling factors
+- **AND** ensure readable text output
+
+#### Scenario: Direct Track PDF maintains original size
+- **GIVEN** a direct track processing task
+- **WHEN** generating the output PDF
+- **THEN** the system SHALL use the original PDF page dimensions
+- **AND** preserve existing coordinate transformation logic
+- **AND** NOT be affected by OCR Track coordinate changes
+
+### Requirement: Table Cell Quality Assessment
+
+The system SHALL assess table cell_boxes quality with appropriate thresholds to avoid filtering valid tables.
+
+#### Scenario: Cell density threshold
+- **GIVEN** a table with cell_boxes from PP-StructureV3
+- **WHEN** cell density exceeds 5.0 cells per 10,000 px²
+- **THEN** the system SHALL flag the table as potentially over-detected
+- **AND** log the specific density value for debugging
+
+#### Scenario: Average cell area threshold
+- **GIVEN** a table with cell_boxes
+- **WHEN** average cell area is less than 2,000 px²
+- **THEN** the system SHALL flag the table as potentially over-detected
+- **AND** log the specific area value for debugging
+
+#### Scenario: Valid tables with normal metrics
+- **GIVEN** a table with density < 5.0 cells/10000px² and avg area > 2000px²
+- **WHEN** quality assessment is applied
+- **THEN** the table SHALL be considered valid
+- **AND** cell_boxes SHALL be used for rendering
+- **AND** table content SHALL be displayed in PDF output
--- a/openspec/changes/archive/2025-12-11-fix-ocr-track-table-rendering/tasks.md
+++ b/openspec/changes/archive/2025-12-11-fix-ocr-track-table-rendering/tasks.md
@@ -0,0 +1,34 @@
+## 1. Fix Table HTML Extraction
+
+### 1.1 pp_structure_enhanced.py
+- [x] 1.1.1 在 bbox overlap 匹配時（L527-534）添加 `pred_html` 提取邏輯
+- [x] 1.1.2 確保 `element['html']` 在所有匹配路徑都被正確設置
+- [x] 1.1.3 添加 `extracted_text` 從 HTML 提取純文字內容
+- [x] 1.1.4 添加日誌記錄 HTML 提取狀態
+
+## 2. Fix PDF Coordinate System
+
+### 2.1 pdf_generator_service.py
+- [x] 2.1.1 對於 OCR Track，使用 OCR 座標系尺寸 (如 1275x1650) 作為 PDF 頁面尺寸
+- [x] 2.1.2 修改 `_get_page_size_for_track()` 方法區分 OCR/Direct track
+- [x] 2.1.3 調整字體大小計算，避免因縮放導致文字過小
+- [x] 2.1.4 確保座標轉換在 OCR Track 時不進行額外縮放
+
+## 3. Improve Table Cell Quality Check
+
+### 3.1 pdf_generator_service.py
+- [x] 3.1.1 審查 `_check_cell_boxes_quality()` 判定條件
+- [x] 3.1.2 放寬或調整判定閾值，避免過度過濾有效表格 (overlap threshold 10% → 25%)
+- [x] 3.1.3 添加更詳細的日誌說明為何表格被判定為 "bad quality"
+
+### 3.2 Fix Table Content Rendering
+- [x] 3.2.1 發現問題：`_draw_table_with_cell_boxes` 只渲染邊框，不渲染文字內容
+- [x] 3.2.2 添加 `cell_boxes_rendered` flag 追蹤邊框是否已渲染
+- [x] 3.2.3 修改邏輯：cell_boxes 渲染邊框後繼續使用 ReportLab Table 渲染文字
+- [x] 3.2.4 條件性跳過 GRID style 當 cell_boxes 已渲染邊框時
+
+## 4. Testing
+- [x] 4.1 使用 edit.pdf 測試修復後的 OCR Track 處理
+- [x] 4.2 驗證表格 HTML 正確提取並渲染
+- [x] 4.3 驗證文字大小一致且清晰可讀
+- [ ] 4.4 確認其他文件類型不受影響