Files
OCR/openspec/changes/archive/2025-12-12-fix-ocr-track-reflow-pdf/proposal.md
egg 1f18010040 fix: OCR Track reflow PDF and translation with image text filtering
- Add OCR Track support for reflow PDF generation using raw_ocr_regions.json
- Add OCR Track translation extraction from raw_ocr_regions instead of elements
- Add raw_ocr_translations output format for OCR Track documents
- Add exclusion zone filtering to remove text overlapping with images
- Update API validation to accept both translations and raw_ocr_translations
- Add page_number field to TranslatedItem for proper tracking

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 11:02:35 +08:00

1.9 KiB

Change: Fix OCR Track Reflow PDF

Why

The OCR Track reflow PDF generation is missing most content because:

  1. PP-StructureV3 extracts tables as elements but stores content: "" (empty string) instead of structured content.cells data
  2. The generate_reflow_pdf method expects content.cells for tables, so tables are skipped
  3. Table text exists in raw_ocr_regions.json (59 text blocks) but is not used by reflow PDF generation
  4. This causes significant content loss - only 6 text elements vs 59 raw OCR regions

The Layout PDF works correctly because it uses raw_ocr_regions.json via Simple Text Positioning mode, bypassing the need for structured table data.

What Changes

Reflow PDF Generation for OCR Track

Modify generate_reflow_pdf to use raw_ocr_regions.json as the primary text source for OCR Track documents:

  1. Detect processing track from JSON metadata
  2. For OCR Track: Load raw_ocr_regions.json and render all text blocks in reading order
  3. For Direct Track: Continue using content.cells for tables (already works)
  4. Images/Charts: Continue using content.saved_path from elements (works for both tracks)

Data Flow

OCR Track Reflow PDF (NEW):

raw_ocr_regions.json (59 text blocks)
  + scan_result.json (images/charts only)
  → Sort by Y coordinate (reading order)
  → Render text paragraphs + images

Direct Track Reflow PDF (UNCHANGED):

*_result.json (elements with content.cells)
  → Render tables, text, images in order

Impact

  • Affected file: backend/app/services/pdf_generator_service.py
  • User experience: OCR Track reflow PDF will contain all text content (matching Layout PDF)
  • Translation: Reflow translated PDF will also work correctly for OCR Track

Migration

  • No data migration required
  • Existing raw_ocr_regions.json files contain all necessary data
  • No API changes