feat: enable document orientation detection for scanned PDFs
- Enable PP-StructureV3's use_doc_orientation_classify feature - Detect rotation angle from doc_preprocessor_res.angle - Swap page dimensions (width <-> height) for 90°/270° rotations - Output PDF now correctly displays landscape-scanned content Also includes: - Archive completed openspec proposals - Add simplify-frontend-ocr-config proposal (pending) - Code cleanup and frontend simplification 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,88 @@
|
||||
## Context
|
||||
|
||||
OCR Track 使用 PP-StructureV3 處理文件,將 PDF 轉換為 PNG 圖片(150 DPI)進行 OCR 識別,然後將結果轉換為 UnifiedDocument 格式並生成輸出 PDF。
|
||||
|
||||
當前問題:
|
||||
1. 表格 HTML 內容在 bbox overlap 匹配路徑中未被提取
|
||||
2. PDF 生成時的座標縮放導致文字大小異常
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- 修復表格 HTML 內容提取,確保所有表格都有正確的 `html` 和 `extracted_text`
|
||||
- 修復 PDF 生成的座標系問題,確保文字大小正確
|
||||
- 保持 Direct Track 和 Hybrid Track 不受影響
|
||||
|
||||
**Non-Goals:**
|
||||
- 不改變 PP-StructureV3 的調用方式
|
||||
- 不改變 UnifiedDocument 的資料結構
|
||||
- 不改變前端 API
|
||||
|
||||
## Decisions
|
||||
|
||||
### Decision 1: 表格 HTML 提取修復
|
||||
|
||||
**位置**: `pp_structure_enhanced.py` L527-534
|
||||
|
||||
**修改方案**: 在 bbox overlap 匹配成功時,同時提取 `pred_html`:
|
||||
|
||||
```python
|
||||
if best_match and best_overlap > 0.1:
|
||||
cell_boxes = best_match['cell_box_list']
|
||||
element['cell_boxes'] = [[float(c) for c in box] for box in cell_boxes]
|
||||
element['cell_boxes_source'] = 'table_res_list'
|
||||
|
||||
# 新增:提取 pred_html
|
||||
if not html_content and 'pred_html' in best_match:
|
||||
html_content = best_match['pred_html']
|
||||
element['html'] = html_content
|
||||
element['extracted_text'] = self._extract_text_from_html(html_content)
|
||||
logger.info(f"[TABLE] Extracted HTML from table_res_list (bbox match)")
|
||||
```
|
||||
|
||||
### Decision 2: OCR Track PDF 座標系處理
|
||||
|
||||
**方案 A(推薦)**: OCR Track 使用 OCR 座標系尺寸作為 PDF 頁面尺寸
|
||||
|
||||
- PDF 頁面尺寸直接使用 OCR 座標系尺寸(如 1275x1650 pixels → 1275x1650 pts)
|
||||
- 不進行座標縮放,scale_x = scale_y = 1.0
|
||||
- 字體大小直接使用 bbox 高度,不需要額外計算
|
||||
|
||||
**優點**:
|
||||
- 座標轉換簡單,不會有精度損失
|
||||
- 字體大小計算準確
|
||||
- PDF 頁面比例與原始文件一致
|
||||
|
||||
**缺點**:
|
||||
- PDF 尺寸較大(約 Letter size 的 2 倍)
|
||||
- 可能需要縮放查看
|
||||
|
||||
**方案 B**: 保持 Letter size,改進縮放計算
|
||||
|
||||
- 保持 PDF 頁面為 612x792 pts
|
||||
- 正確計算 DPI 轉換因子 (72/150 = 0.48)
|
||||
- 確保字體大小在縮放時保持可讀性
|
||||
|
||||
**選擇**: 採用方案 A,因為簡化實現且避免縮放精度問題。
|
||||
|
||||
### Decision 3: 表格質量判定調整
|
||||
|
||||
**當前問題**: `_check_cell_boxes_quality()` 過度過濾有效表格
|
||||
|
||||
**修改方案**:
|
||||
1. 提高 cell_density 閾值(從 3.0 → 5.0 cells/10000px²)
|
||||
2. 降低 min_avg_cell_area 閾值(從 3000 → 2000 px²)
|
||||
3. 添加詳細日誌說明具體哪個指標不符合
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
- **風險**: 修改座標系可能影響現有的 PDF 輸出格式
|
||||
- **緩解**: 只對 OCR Track 生效,Direct Track 保持原有邏輯
|
||||
|
||||
- **風險**: 放寬表格質量判定可能導致一些真正的低質量表格被渲染
|
||||
- **緩解**: 逐步調整閾值,先在測試文件上驗證效果
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. OCR Track PDF 尺寸變大是否會影響用戶體驗?
|
||||
2. 是否需要提供配置選項讓用戶選擇 PDF 輸出尺寸?
|
||||
@@ -0,0 +1,17 @@
|
||||
# Change: Fix OCR Track Table Rendering and Text Sizing
|
||||
|
||||
## Why
|
||||
OCR Track 處理產生的 PDF 有兩個主要問題:
|
||||
1. **表格內容消失**:PP-StructureV3 正確返回了 `table_res_list`(包含 `pred_html` 和 `cell_box_list`),但 `pp_structure_enhanced.py` 在通過 bbox overlap 匹配時只提取了 `cell_boxes` 而沒有提取 `pred_html`,導致表格的 HTML 內容為空。
|
||||
2. **文字大小不一致**:OCR 座標系 (1275x1650 pixels) 與 PDF 輸出尺寸 (612x792 pts) 之間的縮放因子 (0.48) 導致字體大小計算不準確,文字過小或大小不一致。
|
||||
|
||||
## What Changes
|
||||
- 修復 `pp_structure_enhanced.py` 中 bbox overlap 匹配時的 HTML 提取邏輯
|
||||
- 改進 `pdf_generator_service.py` 中 OCR Track 的座標系處理,使用 OCR 座標系尺寸作為 PDF 輸出尺寸
|
||||
- 調整 `_check_cell_boxes_quality()` 函數的判定邏輯,避免過度過濾有效表格
|
||||
|
||||
## Impact
|
||||
- Affected specs: `ocr-processing`
|
||||
- Affected code:
|
||||
- `backend/app/services/pp_structure_enhanced.py` - 表格 HTML 提取邏輯
|
||||
- `backend/app/services/pdf_generator_service.py` - PDF 生成座標系處理
|
||||
@@ -0,0 +1,91 @@
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Enhanced OCR with Full PP-StructureV3
|
||||
|
||||
The system SHALL utilize the full capabilities of PP-StructureV3, extracting all element types from parsing_res_list, with proper handling of visual elements and table coordinates.
|
||||
|
||||
#### Scenario: Extract comprehensive document structure
|
||||
- **WHEN** processing through OCR track
|
||||
- **THEN** the system SHALL use page_result.json['parsing_res_list']
|
||||
- **AND** extract all element types including headers, lists, tables, figures
|
||||
- **AND** preserve layout_bbox coordinates for each element
|
||||
|
||||
#### Scenario: Maintain reading order
|
||||
- **WHEN** extracting elements from PP-StructureV3
|
||||
- **THEN** the system SHALL preserve the reading order from parsing_res_list
|
||||
- **AND** assign sequential indices to elements
|
||||
- **AND** support reordering for complex layouts
|
||||
|
||||
#### Scenario: Extract table structure with HTML content
|
||||
- **WHEN** PP-StructureV3 identifies a table
|
||||
- **THEN** the system SHALL extract cell content and boundaries from table_res_list
|
||||
- **AND** extract pred_html for table HTML content
|
||||
- **AND** validate cell_boxes coordinates against page boundaries
|
||||
- **AND** apply fallback detection for invalid coordinates
|
||||
- **AND** preserve table HTML for structure
|
||||
- **AND** extract plain text for translation
|
||||
|
||||
#### Scenario: Table matching via bbox overlap
|
||||
- **GIVEN** a table element from parsing_res_list without direct HTML content
|
||||
- **WHEN** matching against table_res_list using bbox overlap
|
||||
- **AND** overlap ratio exceeds 10%
|
||||
- **THEN** the system SHALL extract both cell_box_list and pred_html from the matched table_res
|
||||
- **AND** set element['html'] to the extracted pred_html
|
||||
- **AND** set element['extracted_text'] from the HTML content
|
||||
- **AND** log the successful extraction
|
||||
|
||||
#### Scenario: Extract visual elements with paths
|
||||
- **WHEN** PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM)
|
||||
- **THEN** the system SHALL preserve saved_path for each element
|
||||
- **AND** include image dimensions and format
|
||||
- **AND** enable image embedding in output PDF
|
||||
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: OCR Track PDF Coordinate System
|
||||
|
||||
The system SHALL generate PDF output for OCR Track using the OCR coordinate system dimensions to ensure accurate text sizing and positioning.
|
||||
|
||||
#### Scenario: PDF page size matches OCR coordinate system
|
||||
- **GIVEN** an OCR track processing task
|
||||
- **WHEN** generating the output PDF
|
||||
- **THEN** the system SHALL use the OCR image dimensions as PDF page size
|
||||
- **AND** set scale factors to 1.0 (no scaling)
|
||||
- **AND** preserve original bbox coordinates without transformation
|
||||
|
||||
#### Scenario: Text font size calculation without scaling
|
||||
- **GIVEN** a text element with bbox height H in OCR coordinates
|
||||
- **WHEN** rendering text in PDF
|
||||
- **THEN** the system SHALL calculate font size based directly on bbox height
|
||||
- **AND** NOT apply additional scaling factors
|
||||
- **AND** ensure readable text output
|
||||
|
||||
#### Scenario: Direct Track PDF maintains original size
|
||||
- **GIVEN** a direct track processing task
|
||||
- **WHEN** generating the output PDF
|
||||
- **THEN** the system SHALL use the original PDF page dimensions
|
||||
- **AND** preserve existing coordinate transformation logic
|
||||
- **AND** NOT be affected by OCR Track coordinate changes
|
||||
|
||||
### Requirement: Table Cell Quality Assessment
|
||||
|
||||
The system SHALL assess table cell_boxes quality with appropriate thresholds to avoid filtering valid tables.
|
||||
|
||||
#### Scenario: Cell density threshold
|
||||
- **GIVEN** a table with cell_boxes from PP-StructureV3
|
||||
- **WHEN** cell density exceeds 5.0 cells per 10,000 px²
|
||||
- **THEN** the system SHALL flag the table as potentially over-detected
|
||||
- **AND** log the specific density value for debugging
|
||||
|
||||
#### Scenario: Average cell area threshold
|
||||
- **GIVEN** a table with cell_boxes
|
||||
- **WHEN** average cell area is less than 2,000 px²
|
||||
- **THEN** the system SHALL flag the table as potentially over-detected
|
||||
- **AND** log the specific area value for debugging
|
||||
|
||||
#### Scenario: Valid tables with normal metrics
|
||||
- **GIVEN** a table with density < 5.0 cells/10000px² and avg area > 2000px²
|
||||
- **WHEN** quality assessment is applied
|
||||
- **THEN** the table SHALL be considered valid
|
||||
- **AND** cell_boxes SHALL be used for rendering
|
||||
- **AND** table content SHALL be displayed in PDF output
|
||||
@@ -0,0 +1,34 @@
|
||||
## 1. Fix Table HTML Extraction
|
||||
|
||||
### 1.1 pp_structure_enhanced.py
|
||||
- [x] 1.1.1 在 bbox overlap 匹配時(L527-534)添加 `pred_html` 提取邏輯
|
||||
- [x] 1.1.2 確保 `element['html']` 在所有匹配路徑都被正確設置
|
||||
- [x] 1.1.3 添加 `extracted_text` 從 HTML 提取純文字內容
|
||||
- [x] 1.1.4 添加日誌記錄 HTML 提取狀態
|
||||
|
||||
## 2. Fix PDF Coordinate System
|
||||
|
||||
### 2.1 pdf_generator_service.py
|
||||
- [x] 2.1.1 對於 OCR Track,使用 OCR 座標系尺寸 (如 1275x1650) 作為 PDF 頁面尺寸
|
||||
- [x] 2.1.2 修改 `_get_page_size_for_track()` 方法區分 OCR/Direct track
|
||||
- [x] 2.1.3 調整字體大小計算,避免因縮放導致文字過小
|
||||
- [x] 2.1.4 確保座標轉換在 OCR Track 時不進行額外縮放
|
||||
|
||||
## 3. Improve Table Cell Quality Check
|
||||
|
||||
### 3.1 pdf_generator_service.py
|
||||
- [x] 3.1.1 審查 `_check_cell_boxes_quality()` 判定條件
|
||||
- [x] 3.1.2 放寬或調整判定閾值,避免過度過濾有效表格 (overlap threshold 10% → 25%)
|
||||
- [x] 3.1.3 添加更詳細的日誌說明為何表格被判定為 "bad quality"
|
||||
|
||||
### 3.2 Fix Table Content Rendering
|
||||
- [x] 3.2.1 發現問題:`_draw_table_with_cell_boxes` 只渲染邊框,不渲染文字內容
|
||||
- [x] 3.2.2 添加 `cell_boxes_rendered` flag 追蹤邊框是否已渲染
|
||||
- [x] 3.2.3 修改邏輯:cell_boxes 渲染邊框後繼續使用 ReportLab Table 渲染文字
|
||||
- [x] 3.2.4 條件性跳過 GRID style 當 cell_boxes 已渲染邊框時
|
||||
|
||||
## 4. Testing
|
||||
- [x] 4.1 使用 edit.pdf 測試修復後的 OCR Track 處理
|
||||
- [x] 4.2 驗證表格 HTML 正確提取並渲染
|
||||
- [x] 4.3 驗證文字大小一致且清晰可讀
|
||||
- [ ] 4.4 確認其他文件類型不受影響
|
||||
Reference in New Issue
Block a user