Files
OCR/openspec/changes/archive/2025-12-12-fix-ocr-track-translation/tasks.md
egg 1f18010040 fix: OCR Track reflow PDF and translation with image text filtering
- Add OCR Track support for reflow PDF generation using raw_ocr_regions.json
- Add OCR Track translation extraction from raw_ocr_regions instead of elements
- Add raw_ocr_translations output format for OCR Track documents
- Add exclusion zone filtering to remove text overlapping with images
- Update API validation to accept both translations and raw_ocr_translations
- Add page_number field to TranslatedItem for proper tracking

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 11:02:35 +08:00

3.2 KiB

Tasks: Fix OCR Track Translation

1. Modify Translation Service

  • 1.1 Add processing track detection

    • File: backend/app/services/translation_service.py
    • Location: translate_document method
    • Read metadata.processing_track from result JSON
    • Pass track type to extraction method
  • 1.2 Create helper to load raw OCR regions

    • File: backend/app/services/translation_service.py
    • Function: _load_raw_ocr_regions(result_dir, task_id, page_num)
    • Pattern: {task_id}_*_page_{page_num}_raw_ocr_regions.json
    • Return: List of text regions with index and content
  • 1.3 Modify extract_translatable_elements for OCR Track

    • File: backend/app/services/translation_service.py
    • Added: extract_translatable_elements_ocr_track method
    • Added parameters: result_dir: Path, task_id: str
    • For OCR Track: Extract from raw_ocr_regions.json
    • For Direct Track: Keep existing element-based extraction
  • 1.4 Update translation result format

    • File: backend/app/services/translation_service.py
    • Location: build_translation_result method
    • Added processing_track parameter
    • For OCR Track: Output raw_ocr_translations field
    • Structure: [{"page": 1, "index": 0, "original": "...", "translated": "..."}]

2. Modify PDF Generation

  • 2.1 Update generate_translated_pdf for OCR Track

    • File: backend/app/services/pdf_generator_service.py
    • Detect processing_track and raw_ocr_translations from translation JSON
    • For OCR Track: Call _generate_translated_pdf_ocr_track
    • For Direct Track: Continue using apply_translations (element-based)
  • 2.2 Create helper to apply raw OCR translations

    • File: backend/app/services/pdf_generator_service.py
    • Function: _generate_translated_pdf_ocr_track
    • Build translation lookup: {(page, index): translated_text}
    • Load raw OCR regions, sort by Y coordinate
    • Render translated text with original fallback

3. Additional Fixes

  • 3.1 Add page_number to TranslatedItem

    • File: backend/app/schemas/translation.py
    • Added page_number: int = 1 to TranslatedItem dataclass
    • Updated translate_batch and translate_item to pass page_number
  • 3.2 Update API endpoint validation

    • File: backend/app/routers/translate.py
    • Check for both translations (Direct Track) and raw_ocr_translations (OCR Track)
  • 3.3 Filter text overlapping with images

    • File: backend/app/services/pdf_generator_service.py
    • Added _collect_exclusion_zones, _is_region_overlapping_exclusion, _filter_regions_by_exclusion
    • Applied filtering in generate_reflow_pdf and _generate_translated_pdf_ocr_track

4. Testing

  • 4.1 Test OCR Track translation

    • Test with: f8265449-6cb7-425d-a213-5d2e1af73955
    • Verify: All 59 text blocks are sent for translation
    • Verify: Translation JSON contains raw_ocr_translations
  • 4.2 Test OCR Track translated PDF

    • Generate translated reflow PDF
    • Verify: All translated text blocks appear correctly
    • Verify: Text inside images (like EWsenel) is filtered out
  • 4.3 Test Direct Track unchanged

    • Verify: Translation still uses element-based approach
    • Verify: No regression in Direct Track flow