chore: backup before code cleanup

Backup commit before executing remove-unused-code proposal.
This includes all pending changes and new features.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
egg
2025-12-11 11:55:39 +08:00
parent eff9b0bcd5
commit 940a406dce
58 changed files with 8226 additions and 175 deletions

View File

@@ -0,0 +1,88 @@
## Context
OCR Track 使用 PP-StructureV3 處理文件,將 PDF 轉換為 PNG 圖片150 DPI進行 OCR 識別,然後將結果轉換為 UnifiedDocument 格式並生成輸出 PDF。
當前問題:
1. 表格 HTML 內容在 bbox overlap 匹配路徑中未被提取
2. PDF 生成時的座標縮放導致文字大小異常
## Goals / Non-Goals
**Goals:**
- 修復表格 HTML 內容提取,確保所有表格都有正確的 `html``extracted_text`
- 修復 PDF 生成的座標系問題,確保文字大小正確
- 保持 Direct Track 和 Hybrid Track 不受影響
**Non-Goals:**
- 不改變 PP-StructureV3 的調用方式
- 不改變 UnifiedDocument 的資料結構
- 不改變前端 API
## Decisions
### Decision 1: 表格 HTML 提取修復
**位置**: `pp_structure_enhanced.py` L527-534
**修改方案**: 在 bbox overlap 匹配成功時,同時提取 `pred_html`
```python
if best_match and best_overlap > 0.1:
cell_boxes = best_match['cell_box_list']
element['cell_boxes'] = [[float(c) for c in box] for box in cell_boxes]
element['cell_boxes_source'] = 'table_res_list'
# 新增:提取 pred_html
if not html_content and 'pred_html' in best_match:
html_content = best_match['pred_html']
element['html'] = html_content
element['extracted_text'] = self._extract_text_from_html(html_content)
logger.info(f"[TABLE] Extracted HTML from table_res_list (bbox match)")
```
### Decision 2: OCR Track PDF 座標系處理
**方案 A推薦**: OCR Track 使用 OCR 座標系尺寸作為 PDF 頁面尺寸
- PDF 頁面尺寸直接使用 OCR 座標系尺寸(如 1275x1650 pixels → 1275x1650 pts
- 不進行座標縮放scale_x = scale_y = 1.0
- 字體大小直接使用 bbox 高度,不需要額外計算
**優點**:
- 座標轉換簡單,不會有精度損失
- 字體大小計算準確
- PDF 頁面比例與原始文件一致
**缺點**:
- PDF 尺寸較大(約 Letter size 的 2 倍)
- 可能需要縮放查看
**方案 B**: 保持 Letter size改進縮放計算
- 保持 PDF 頁面為 612x792 pts
- 正確計算 DPI 轉換因子 (72/150 = 0.48)
- 確保字體大小在縮放時保持可讀性
**選擇**: 採用方案 A因為簡化實現且避免縮放精度問題。
### Decision 3: 表格質量判定調整
**當前問題**: `_check_cell_boxes_quality()` 過度過濾有效表格
**修改方案**:
1. 提高 cell_density 閾值(從 3.0 → 5.0 cells/10000px²
2. 降低 min_avg_cell_area 閾值(從 3000 → 2000 px²
3. 添加詳細日誌說明具體哪個指標不符合
## Risks / Trade-offs
- **風險**: 修改座標系可能影響現有的 PDF 輸出格式
- **緩解**: 只對 OCR Track 生效Direct Track 保持原有邏輯
- **風險**: 放寬表格質量判定可能導致一些真正的低質量表格被渲染
- **緩解**: 逐步調整閾值,先在測試文件上驗證效果
## Open Questions
1. OCR Track PDF 尺寸變大是否會影響用戶體驗?
2. 是否需要提供配置選項讓用戶選擇 PDF 輸出尺寸?

View File

@@ -0,0 +1,17 @@
# Change: Fix OCR Track Table Rendering and Text Sizing
## Why
OCR Track 處理產生的 PDF 有兩個主要問題:
1. **表格內容消失**PP-StructureV3 正確返回了 `table_res_list`(包含 `pred_html``cell_box_list`),但 `pp_structure_enhanced.py` 在通過 bbox overlap 匹配時只提取了 `cell_boxes` 而沒有提取 `pred_html`,導致表格的 HTML 內容為空。
2. **文字大小不一致**OCR 座標系 (1275x1650 pixels) 與 PDF 輸出尺寸 (612x792 pts) 之間的縮放因子 (0.48) 導致字體大小計算不準確,文字過小或大小不一致。
## What Changes
- 修復 `pp_structure_enhanced.py` 中 bbox overlap 匹配時的 HTML 提取邏輯
- 改進 `pdf_generator_service.py` 中 OCR Track 的座標系處理,使用 OCR 座標系尺寸作為 PDF 輸出尺寸
- 調整 `_check_cell_boxes_quality()` 函數的判定邏輯,避免過度過濾有效表格
## Impact
- Affected specs: `ocr-processing`
- Affected code:
- `backend/app/services/pp_structure_enhanced.py` - 表格 HTML 提取邏輯
- `backend/app/services/pdf_generator_service.py` - PDF 生成座標系處理

View File

@@ -0,0 +1,91 @@
## MODIFIED Requirements
### Requirement: Enhanced OCR with Full PP-StructureV3
The system SHALL utilize the full capabilities of PP-StructureV3, extracting all element types from parsing_res_list, with proper handling of visual elements and table coordinates.
#### Scenario: Extract comprehensive document structure
- **WHEN** processing through OCR track
- **THEN** the system SHALL use page_result.json['parsing_res_list']
- **AND** extract all element types including headers, lists, tables, figures
- **AND** preserve layout_bbox coordinates for each element
#### Scenario: Maintain reading order
- **WHEN** extracting elements from PP-StructureV3
- **THEN** the system SHALL preserve the reading order from parsing_res_list
- **AND** assign sequential indices to elements
- **AND** support reordering for complex layouts
#### Scenario: Extract table structure with HTML content
- **WHEN** PP-StructureV3 identifies a table
- **THEN** the system SHALL extract cell content and boundaries from table_res_list
- **AND** extract pred_html for table HTML content
- **AND** validate cell_boxes coordinates against page boundaries
- **AND** apply fallback detection for invalid coordinates
- **AND** preserve table HTML for structure
- **AND** extract plain text for translation
#### Scenario: Table matching via bbox overlap
- **GIVEN** a table element from parsing_res_list without direct HTML content
- **WHEN** matching against table_res_list using bbox overlap
- **AND** overlap ratio exceeds 10%
- **THEN** the system SHALL extract both cell_box_list and pred_html from the matched table_res
- **AND** set element['html'] to the extracted pred_html
- **AND** set element['extracted_text'] from the HTML content
- **AND** log the successful extraction
#### Scenario: Extract visual elements with paths
- **WHEN** PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM)
- **THEN** the system SHALL preserve saved_path for each element
- **AND** include image dimensions and format
- **AND** enable image embedding in output PDF
## ADDED Requirements
### Requirement: OCR Track PDF Coordinate System
The system SHALL generate PDF output for OCR Track using the OCR coordinate system dimensions to ensure accurate text sizing and positioning.
#### Scenario: PDF page size matches OCR coordinate system
- **GIVEN** an OCR track processing task
- **WHEN** generating the output PDF
- **THEN** the system SHALL use the OCR image dimensions as PDF page size
- **AND** set scale factors to 1.0 (no scaling)
- **AND** preserve original bbox coordinates without transformation
#### Scenario: Text font size calculation without scaling
- **GIVEN** a text element with bbox height H in OCR coordinates
- **WHEN** rendering text in PDF
- **THEN** the system SHALL calculate font size based directly on bbox height
- **AND** NOT apply additional scaling factors
- **AND** ensure readable text output
#### Scenario: Direct Track PDF maintains original size
- **GIVEN** a direct track processing task
- **WHEN** generating the output PDF
- **THEN** the system SHALL use the original PDF page dimensions
- **AND** preserve existing coordinate transformation logic
- **AND** NOT be affected by OCR Track coordinate changes
### Requirement: Table Cell Quality Assessment
The system SHALL assess table cell_boxes quality with appropriate thresholds to avoid filtering valid tables.
#### Scenario: Cell density threshold
- **GIVEN** a table with cell_boxes from PP-StructureV3
- **WHEN** cell density exceeds 5.0 cells per 10,000 px²
- **THEN** the system SHALL flag the table as potentially over-detected
- **AND** log the specific density value for debugging
#### Scenario: Average cell area threshold
- **GIVEN** a table with cell_boxes
- **WHEN** average cell area is less than 2,000 px²
- **THEN** the system SHALL flag the table as potentially over-detected
- **AND** log the specific area value for debugging
#### Scenario: Valid tables with normal metrics
- **GIVEN** a table with density < 5.0 cells/10000px² and avg area > 2000px²
- **WHEN** quality assessment is applied
- **THEN** the table SHALL be considered valid
- **AND** cell_boxes SHALL be used for rendering
- **AND** table content SHALL be displayed in PDF output

View File

@@ -0,0 +1,34 @@
## 1. Fix Table HTML Extraction
### 1.1 pp_structure_enhanced.py
- [x] 1.1.1 在 bbox overlap 匹配時L527-534添加 `pred_html` 提取邏輯
- [x] 1.1.2 確保 `element['html']` 在所有匹配路徑都被正確設置
- [x] 1.1.3 添加 `extracted_text` 從 HTML 提取純文字內容
- [x] 1.1.4 添加日誌記錄 HTML 提取狀態
## 2. Fix PDF Coordinate System
### 2.1 pdf_generator_service.py
- [x] 2.1.1 對於 OCR Track使用 OCR 座標系尺寸 (如 1275x1650) 作為 PDF 頁面尺寸
- [x] 2.1.2 修改 `_get_page_size_for_track()` 方法區分 OCR/Direct track
- [x] 2.1.3 調整字體大小計算,避免因縮放導致文字過小
- [x] 2.1.4 確保座標轉換在 OCR Track 時不進行額外縮放
## 3. Improve Table Cell Quality Check
### 3.1 pdf_generator_service.py
- [x] 3.1.1 審查 `_check_cell_boxes_quality()` 判定條件
- [x] 3.1.2 放寬或調整判定閾值,避免過度過濾有效表格 (overlap threshold 10% → 25%)
- [x] 3.1.3 添加更詳細的日誌說明為何表格被判定為 "bad quality"
### 3.2 Fix Table Content Rendering
- [x] 3.2.1 發現問題:`_draw_table_with_cell_boxes` 只渲染邊框,不渲染文字內容
- [x] 3.2.2 添加 `cell_boxes_rendered` flag 追蹤邊框是否已渲染
- [x] 3.2.3 修改邏輯cell_boxes 渲染邊框後繼續使用 ReportLab Table 渲染文字
- [x] 3.2.4 條件性跳過 GRID style 當 cell_boxes 已渲染邊框時
## 4. Testing
- [x] 4.1 使用 edit.pdf 測試修復後的 OCR Track 處理
- [x] 4.2 驗證表格 HTML 正確提取並渲染
- [x] 4.3 驗證文字大小一致且清晰可讀
- [ ] 4.4 確認其他文件類型不受影響