- Add OCR Track support for reflow PDF generation using raw_ocr_regions.json - Add OCR Track translation extraction from raw_ocr_regions instead of elements - Add raw_ocr_translations output format for OCR Track documents - Add exclusion zone filtering to remove text overlapping with images - Update API validation to accept both translations and raw_ocr_translations - Add page_number field to TranslatedItem for proper tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1.9 KiB
1.9 KiB
Change: Fix OCR Track Reflow PDF
Why
The OCR Track reflow PDF generation is missing most content because:
- PP-StructureV3 extracts tables as elements but stores
content: ""(empty string) instead of structuredcontent.cellsdata - The
generate_reflow_pdfmethod expectscontent.cellsfor tables, so tables are skipped - Table text exists in
raw_ocr_regions.json(59 text blocks) but is not used by reflow PDF generation - This causes significant content loss - only 6 text elements vs 59 raw OCR regions
The Layout PDF works correctly because it uses raw_ocr_regions.json via Simple Text Positioning mode, bypassing the need for structured table data.
What Changes
Reflow PDF Generation for OCR Track
Modify generate_reflow_pdf to use raw_ocr_regions.json as the primary text source for OCR Track documents:
- Detect processing track from JSON metadata
- For OCR Track: Load
raw_ocr_regions.jsonand render all text blocks in reading order - For Direct Track: Continue using
content.cellsfor tables (already works) - Images/Charts: Continue using
content.saved_pathfrom elements (works for both tracks)
Data Flow
OCR Track Reflow PDF (NEW):
raw_ocr_regions.json (59 text blocks)
+ scan_result.json (images/charts only)
→ Sort by Y coordinate (reading order)
→ Render text paragraphs + images
Direct Track Reflow PDF (UNCHANGED):
*_result.json (elements with content.cells)
→ Render tables, text, images in order
Impact
- Affected file:
backend/app/services/pdf_generator_service.py - User experience: OCR Track reflow PDF will contain all text content (matching Layout PDF)
- Translation: Reflow translated PDF will also work correctly for OCR Track
Migration
- No data migration required
- Existing
raw_ocr_regions.jsonfiles contain all necessary data - No API changes