Files
OCR/openspec/changes/fix-result-preview-and-pdf-download/proposal.md
egg 0edc56b03f fix: 修復PDF生成中的頁碼錯誤和文字重疊問題
## 問題修復

### 1. 頁碼分配錯誤
- **問題**: layout_data 和 images_metadata 頁碼被 1-based 覆蓋,導致全部為 0
- **修復**: 在 analyze_layout() 添加 current_page 參數,從源頭設置正確的 0-based 頁碼
- **影響**: 表格和圖片現在顯示在正確的頁面上

### 2. 文字與表格/圖片重疊
- **問題**: 使用不存在的 'tables' 和 'image_regions' 字段過濾,導致過濾失效
- **修復**: 改用 images_metadata(包含所有表格/圖片的 bbox)
- **新增**: _bbox_overlaps() 檢測任意重疊(非完全包含)
- **影響**: 文字不再覆蓋表格和圖片區域

### 3. 渲染順序優化
- **調整**: 圖片(底層) → 表格(中間層) → 文字(頂層)
- **影響**: 視覺層次更正確

## 技術細節

- ocr_service.py: 添加 current_page 參數傳遞,移除頁碼覆蓋邏輯
- pdf_generator_service.py:
  - 新增 _bbox_overlaps() 方法
  - 更新 _filter_text_in_regions() 使用重疊檢測
  - 修正數據源為 images_metadata
  - 調整繪製順序

## 已知限制

- 仍有 21.6% 文字因過濾而遺失(座標定位方法的固有問題)
- 未使用 PP-StructureV3 的完整版面資訊(parsing_res_list, layout_bbox)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 18:57:01 +08:00

6.3 KiB

Implement Layout-Preserving PDF Generation and Preview

Problem

Testing revealed three critical issues affecting user experience:

1. PDF Download Returns 403 Forbidden

  • Endpoint: GET /api/v2/tasks/{task_id}/download/pdf
  • Error: Backend returns HTTP 403 Forbidden
  • Impact: Users cannot download PDF format results
  • Root Cause: PDF generation service not implemented

2. Result Preview Shows Placeholder Text Instead of Layout-Preserving Content

  • Affected Pages:
    • Results page (/results)
    • Task Detail page (/tasks/{taskId})
  • Current Behavior: Both pages display placeholder message "請使用上方下載按鈕下載 Markdown、JSON 或 PDF 格式查看完整結果"
  • Problem: Users cannot preview OCR results with original document layout preserved
  • Impact: Poor user experience - users cannot verify OCR accuracy visually

3. Images Extracted by PP-StructureV3 Are Not Saved to Disk

  • Affected File: backend/app/services/ocr_service.py:554-561
  • Current Behavior:
    • PP-StructureV3 extracts images from documents (tables, charts, figures)
    • analyze_layout() receives image objects in markdown_images dictionary
    • Code only saves image path strings to JSON, never saves actual image files
    • Result directory contains no imgs/ folder with extracted images
  • Impact:
    • JSON references non-existent files (e.g., imgs/img_in_table_box_*.jpg)
    • Layout-preserving PDF cannot embed images because source files don't exist
    • Loss of critical visual content from original documents
  • Root Cause: Missing image file saving logic in analyze_layout() function

Proposed Changes

Change 0: Fix Image Extraction and Saving (PREREQUISITE)

Modify OCR service to save extracted images to disk before PDF generation can embed them.

Implementation approach:

  1. Update analyze_layout() Function

    • Locate image saving code at ocr_service.py:554-561
    • Extract img_obj from markdown_images.items()
    • Create imgs/ subdirectory in result folder
    • Save each img_obj to disk using PIL Image.save()
    • Verify saved file path matches JSON images_metadata
  2. File Naming and Organization

    • PP-StructureV3 generates paths like imgs/img_in_table_box_145_1253_2329_2488.jpg
    • Create full path: {result_dir}/{img_path}
    • Ensure parent directories exist before saving
    • Handle image format conversion if needed (PNG, JPEG)
  3. Error Handling

    • Log warnings if image objects are missing or corrupt
    • Continue processing even if individual images fail
    • Include error info in images_metadata for debugging

Why This is Critical:

  • Without saved images, layout-preserving PDF cannot embed visual content
  • Images contain crucial information (charts, diagrams, table contents)
  • PP-StructureV3 already does the hard work of extraction - we just need to save them

Change 1: Implement Layout-Preserving PDF Generation Service

Create a PDF generation service that reconstructs the original document layout from OCR JSON data.

Implementation approach:

  1. Parse JSON OCR Results

    • Read text_regions array containing text, bounding boxes, confidence scores
    • Extract page dimensions from original file or infer from bbox coordinates
    • Group elements by page number
  2. Generate PDF with ReportLab

    • Create PDF canvas with original page dimensions
    • Iterate through each text region
    • Draw text at precise coordinates from bbox
    • Support Chinese fonts (e.g., Noto Sans CJK, Source Han Sans)
    • Optionally draw bounding boxes for visualization
  3. Handle Complex Elements

    • Text: Draw at bbox coordinates with appropriate font size
    • Tables: Reconstruct from layout analysis (if available)
    • Images: Embed from images_metadata
    • Preserve rotation/skew from bbox geometry
  4. Caching Strategy

    • Generate PDF once per task completion
    • Store in task result directory as {filename}_layout.pdf
    • Serve cached version on subsequent requests
    • Regenerate only if JSON changes

Technical stack:

  • ReportLab: PDF generation with precise coordinate control
  • Pillow: Extract dimensions from source images/PDFs, embed extracted images
  • Chinese fonts: Noto Sans CJK or Source Han Sans (需安裝)

Change 2: Implement In-Browser PDF Preview

Replace placeholder text with interactive PDF preview using react-pdf.

Implementation approach:

  1. Install react-pdf

    npm install react-pdf
    
  2. Create PDF Viewer Component

    • Fetch PDF from /api/v2/tasks/{task_id}/download/pdf
    • Render using <Document> and <Page> from react-pdf
    • Add zoom controls, page navigation
    • Show loading spinner while PDF loads
  3. Update ResultsPage and TaskDetailPage

    • Replace placeholder with PDF viewer
    • Add download button above viewer
    • Handle errors gracefully (show error if PDF unavailable)

Benefits:

  • Users see OCR results with original layout preserved
  • Visual verification of OCR accuracy
  • No download required for quick review
  • Professional presentation of results

Scope

In scope:

  • Fix image extraction to save extracted images to disk (PREREQUISITE)
  • Implement layout-preserving PDF generation service from JSON
  • Install and configure Chinese fonts (Noto Sans CJK)
  • Create PDF viewer component with react-pdf
  • Add PDF preview to Results page and Task Detail page
  • Cache generated PDFs for performance
  • Embed extracted images into layout-preserving PDF
  • Error handling for image saving, PDF generation and preview failures

Out of scope:

  • OCR result editing in preview
  • Advanced PDF features (annotations, search, highlights)
  • Excel/JSON inline preview
  • Real-time PDF regeneration (will use cached version)

Impact

  • User Experience: Major improvement - layout-preserving visual preview with images
  • Backend: Significant changes - image saving fix, new PDF generation service
  • Frontend: Medium changes - PDF viewer integration
  • Dependencies: New - ReportLab, react-pdf, Chinese fonts (Pillow already installed)
  • Performance: Medium - PDF generation cached after first request, minimal overhead for image saving
  • Risk: Medium - complex coordinate transformation, font rendering, image embedding
  • Data Integrity: High improvement - images now properly preserved alongside text