fix: 修復PDF生成中的頁碼錯誤和文字重疊問題

## 問題修復

### 1. 頁碼分配錯誤
- **問題**: layout_data 和 images_metadata 頁碼被 1-based 覆蓋,導致全部為 0
- **修復**: 在 analyze_layout() 添加 current_page 參數,從源頭設置正確的 0-based 頁碼
- **影響**: 表格和圖片現在顯示在正確的頁面上

### 2. 文字與表格/圖片重疊
- **問題**: 使用不存在的 'tables' 和 'image_regions' 字段過濾,導致過濾失效
- **修復**: 改用 images_metadata(包含所有表格/圖片的 bbox)
- **新增**: _bbox_overlaps() 檢測任意重疊(非完全包含)
- **影響**: 文字不再覆蓋表格和圖片區域

### 3. 渲染順序優化
- **調整**: 圖片(底層) → 表格(中間層) → 文字(頂層)
- **影響**: 視覺層次更正確

## 技術細節

- ocr_service.py: 添加 current_page 參數傳遞,移除頁碼覆蓋邏輯
- pdf_generator_service.py:
  - 新增 _bbox_overlaps() 方法
  - 更新 _filter_text_in_regions() 使用重疊檢測
  - 修正數據源為 images_metadata
  - 調整繪製順序

## 已知限制

- 仍有 21.6% 文字因過濾而遺失(座標定位方法的固有問題)
- 未使用 PP-StructureV3 的完整版面資訊(parsing_res_list, layout_bbox)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-18 18:57:01 +08:00
parent 5cf4010c9b
commit 0edc56b03f
6 changed files with 485 additions and 45 deletions

View File

@@ -0,0 +1,57 @@
# Result Export - Delta Changes
## ADDED Requirements
### Requirement: Image Extraction and Persistence
The OCR system SHALL save extracted images to disk during layout analysis for later use in PDF generation.
#### Scenario: Images extracted by PP-StructureV3 are saved to disk
- **WHEN** OCR processes a document containing images (charts, tables, figures)
- **THEN** system SHALL extract image objects from `markdown_images` dictionary
- **AND** system SHALL create `imgs/` subdirectory in result folder
- **AND** system SHALL save each image object to disk using PIL Image.save()
- **AND** saved file paths SHALL match paths recorded in JSON `images_metadata`
- **AND** system SHALL log warnings for failed image saves but continue processing
#### Scenario: Multi-page documents with images on different pages
- **WHEN** OCR processes multi-page PDF with images on multiple pages
- **THEN** system SHALL save images from all pages to same `imgs/` folder
- **AND** image filenames SHALL include bbox coordinates for uniqueness
- **AND** images SHALL be available for PDF generation after OCR completes
### Requirement: Layout-Preserving PDF Generation
The system SHALL generate PDF files that preserve the original document layout using OCR JSON data.
#### Scenario: PDF generated from JSON with accurate layout
- **WHEN** user requests PDF download for a completed task
- **THEN** system SHALL parse OCR JSON result file
- **AND** system SHALL extract bounding box coordinates for each text region
- **AND** system SHALL determine page dimensions from source file or bbox maximum values
- **AND** system SHALL generate PDF with text positioned at precise coordinates
- **AND** system SHALL use Chinese-compatible font (e.g., Noto Sans CJK)
- **AND** system SHALL embed images from `imgs/` folder using paths in `images_metadata`
- **AND** generated PDF SHALL visually resemble original document layout with images
#### Scenario: PDF download works correctly
- **WHEN** user clicks PDF download button
- **THEN** system SHALL return cached PDF if already generated
- **OR** system SHALL generate new PDF from JSON on first request
- **AND** system SHALL NOT return 403 Forbidden error
- **AND** downloaded PDF SHALL contain task OCR results with layout preserved
#### Scenario: Multi-page PDF generation
- **WHEN** OCR JSON contains results for multiple pages
- **THEN** generated PDF SHALL contain same number of pages
- **AND** each page SHALL display text regions for that page only
- **AND** page dimensions SHALL match original document pages
## MODIFIED Requirements
### Requirement: Export Interface
The Export page SHALL support downloading OCR results in multiple formats using V2 task APIs.
#### Scenario: PDF caching improves performance
- **WHEN** user downloads same PDF multiple times
- **THEN** system SHALL serve cached PDF file on subsequent requests
- **AND** system SHALL NOT regenerate PDF unless JSON changes
- **AND** download response time SHALL be faster than initial generation