egg/OCR

Files

egg 0edc56b03f fix: 修復PDF生成中的頁碼錯誤和文字重疊問題

## 問題修復

### 1. 頁碼分配錯誤
- **問題**: layout_data 和 images_metadata 頁碼被 1-based 覆蓋，導致全部為 0
- **修復**: 在 analyze_layout() 添加 current_page 參數，從源頭設置正確的 0-based 頁碼
- **影響**: 表格和圖片現在顯示在正確的頁面上

### 2. 文字與表格/圖片重疊
- **問題**: 使用不存在的 'tables' 和 'image_regions' 字段過濾，導致過濾失效
- **修復**: 改用 images_metadata（包含所有表格/圖片的 bbox）
- **新增**: _bbox_overlaps() 檢測任意重疊（非完全包含）
- **影響**: 文字不再覆蓋表格和圖片區域

### 3. 渲染順序優化
- **調整**: 圖片(底層) → 表格(中間層) → 文字(頂層)
- **影響**: 視覺層次更正確

## 技術細節

- ocr_service.py: 添加 current_page 參數傳遞，移除頁碼覆蓋邏輯
- pdf_generator_service.py:
  - 新增 _bbox_overlaps() 方法
  - 更新 _filter_text_in_regions() 使用重疊檢測
  - 修正數據源為 images_metadata
  - 調整繪製順序

## 已知限制

- 仍有 21.6% 文字因過濾而遺失（座標定位方法的固有問題）
- 未使用 PP-StructureV3 的完整版面資訊（parsing_res_list, layout_bbox）

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-18 18:57:01 +08:00

2.9 KiB

Raw Blame History

Result Export - Delta Changes

ADDED Requirements

Requirement: Image Extraction and Persistence

The OCR system SHALL save extracted images to disk during layout analysis for later use in PDF generation.

Scenario: Images extracted by PP-StructureV3 are saved to disk

WHEN OCR processes a document containing images (charts, tables, figures)
THEN system SHALL extract image objects from markdown_images dictionary
AND system SHALL create imgs/ subdirectory in result folder
AND system SHALL save each image object to disk using PIL Image.save()
AND saved file paths SHALL match paths recorded in JSON images_metadata
AND system SHALL log warnings for failed image saves but continue processing

Scenario: Multi-page documents with images on different pages

WHEN OCR processes multi-page PDF with images on multiple pages
THEN system SHALL save images from all pages to same imgs/ folder
AND image filenames SHALL include bbox coordinates for uniqueness
AND images SHALL be available for PDF generation after OCR completes

Requirement: Layout-Preserving PDF Generation

The system SHALL generate PDF files that preserve the original document layout using OCR JSON data.

Scenario: PDF generated from JSON with accurate layout

WHEN user requests PDF download for a completed task
THEN system SHALL parse OCR JSON result file
AND system SHALL extract bounding box coordinates for each text region
AND system SHALL determine page dimensions from source file or bbox maximum values
AND system SHALL generate PDF with text positioned at precise coordinates
AND system SHALL use Chinese-compatible font (e.g., Noto Sans CJK)
AND system SHALL embed images from imgs/ folder using paths in images_metadata
AND generated PDF SHALL visually resemble original document layout with images

Scenario: PDF download works correctly

WHEN user clicks PDF download button
THEN system SHALL return cached PDF if already generated
OR system SHALL generate new PDF from JSON on first request
AND system SHALL NOT return 403 Forbidden error
AND downloaded PDF SHALL contain task OCR results with layout preserved

Scenario: Multi-page PDF generation

WHEN OCR JSON contains results for multiple pages
THEN generated PDF SHALL contain same number of pages
AND each page SHALL display text regions for that page only
AND page dimensions SHALL match original document pages

MODIFIED Requirements

Requirement: Export Interface

The Export page SHALL support downloading OCR results in multiple formats using V2 task APIs.

Scenario: PDF caching improves performance

WHEN user downloads same PDF multiple times
THEN system SHALL serve cached PDF file on subsequent requests
AND system SHALL NOT regenerate PDF unless JSON changes
AND download response time SHALL be faster than initial generation

2.9 KiB Raw Blame History