Files
OCR/openspec/changes/archive/2025-12-12-fix-ocr-track-reflow-pdf/tasks.md
egg 1f18010040 fix: OCR Track reflow PDF and translation with image text filtering
- Add OCR Track support for reflow PDF generation using raw_ocr_regions.json
- Add OCR Track translation extraction from raw_ocr_regions instead of elements
- Add raw_ocr_translations output format for OCR Track documents
- Add exclusion zone filtering to remove text overlapping with images
- Update API validation to accept both translations and raw_ocr_translations
- Add page_number field to TranslatedItem for proper tracking

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 11:02:35 +08:00

2.0 KiB

Tasks: Fix OCR Track Reflow PDF

1. Modify generate_reflow_pdf Method

  • 1.1 Add processing track detection

    • File: backend/app/services/pdf_generator_service.py
    • Location: generate_reflow_pdf method (line ~4704)
    • Read metadata.processing_track from JSON data
    • Branch logic based on track type
  • 1.2 Add helper function to load raw OCR regions

    • File: backend/app/services/pdf_generator_service.py
    • Using existing: load_raw_ocr_regions from text_region_renderer.py
    • Pattern: {task_id}_*_page_{page_num}_raw_ocr_regions.json
    • Return: List of text regions with bbox and content
  • 1.3 Implement OCR Track reflow rendering

    • File: backend/app/services/pdf_generator_service.py
    • For OCR Track: Load raw OCR regions per page
    • Sort text blocks by Y coordinate (top to bottom reading order)
    • Render text blocks as paragraphs
    • Still render images/charts from elements
  • 1.4 Keep Direct Track logic unchanged

    • File: backend/app/services/pdf_generator_service.py
    • Direct Track continues using content.cells for tables
    • Extracted to _render_reflow_elements helper method
    • No changes to existing Direct Track flow

2. Handle Multi-page Documents

  • 2.1 Support per-page raw OCR files
    • Pattern: {task_id}_*_page_{page_num}_raw_ocr_regions.json
    • Iterate through pages and load corresponding raw OCR file
    • Handle missing files gracefully (fall back to elements)

3. Testing

  • 3.1 Test OCR Track reflow PDF

    • Test with: a9259180-fc49-4890-8184-2e6d5f4edad3 (scan document)
    • Verify: All 59 text blocks appear in reflow PDF
    • Verify: Images are embedded correctly
  • 3.2 Test Direct Track reflow PDF

    • Test with: 1b32428d-0609-4cfd-bc52-56be6956ac2e (editable PDF)
    • Verify: Tables render with cells
    • Verify: No regression from changes
  • 3.3 Test translated reflow PDF

    • Test: Complete translation then download reflow PDF
    • Verify: Translated text appears correctly