- Add OCR Track support for reflow PDF generation using raw_ocr_regions.json - Add OCR Track translation extraction from raw_ocr_regions instead of elements - Add raw_ocr_translations output format for OCR Track documents - Add exclusion zone filtering to remove text overlapping with images - Update API validation to accept both translations and raw_ocr_translations - Add page_number field to TranslatedItem for proper tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
3.2 KiB
3.2 KiB
Tasks: Fix OCR Track Translation
1. Modify Translation Service
-
1.1 Add processing track detection
- File:
backend/app/services/translation_service.py - Location:
translate_documentmethod - Read
metadata.processing_trackfrom result JSON - Pass track type to extraction method
- File:
-
1.2 Create helper to load raw OCR regions
- File:
backend/app/services/translation_service.py - Function:
_load_raw_ocr_regions(result_dir, task_id, page_num) - Pattern:
{task_id}_*_page_{page_num}_raw_ocr_regions.json - Return: List of text regions with index and content
- File:
-
1.3 Modify extract_translatable_elements for OCR Track
- File:
backend/app/services/translation_service.py - Added:
extract_translatable_elements_ocr_trackmethod - Added parameters:
result_dir: Path,task_id: str - For OCR Track: Extract from raw_ocr_regions.json
- For Direct Track: Keep existing element-based extraction
- File:
-
1.4 Update translation result format
- File:
backend/app/services/translation_service.py - Location:
build_translation_resultmethod - Added
processing_trackparameter - For OCR Track: Output
raw_ocr_translationsfield - Structure:
[{"page": 1, "index": 0, "original": "...", "translated": "..."}]
- File:
2. Modify PDF Generation
-
2.1 Update generate_translated_pdf for OCR Track
- File:
backend/app/services/pdf_generator_service.py - Detect
processing_trackandraw_ocr_translationsfrom translation JSON - For OCR Track: Call
_generate_translated_pdf_ocr_track - For Direct Track: Continue using
apply_translations(element-based)
- File:
-
2.2 Create helper to apply raw OCR translations
- File:
backend/app/services/pdf_generator_service.py - Function:
_generate_translated_pdf_ocr_track - Build translation lookup:
{(page, index): translated_text} - Load raw OCR regions, sort by Y coordinate
- Render translated text with original fallback
- File:
3. Additional Fixes
-
3.1 Add page_number to TranslatedItem
- File:
backend/app/schemas/translation.py - Added
page_number: int = 1to TranslatedItem dataclass - Updated
translate_batchandtranslate_itemto pass page_number
- File:
-
3.2 Update API endpoint validation
- File:
backend/app/routers/translate.py - Check for both
translations(Direct Track) andraw_ocr_translations(OCR Track)
- File:
-
3.3 Filter text overlapping with images
- File:
backend/app/services/pdf_generator_service.py - Added
_collect_exclusion_zones,_is_region_overlapping_exclusion,_filter_regions_by_exclusion - Applied filtering in
generate_reflow_pdfand_generate_translated_pdf_ocr_track
- File:
4. Testing
-
4.1 Test OCR Track translation
- Test with:
f8265449-6cb7-425d-a213-5d2e1af73955 - Verify: All 59 text blocks are sent for translation
- Verify: Translation JSON contains
raw_ocr_translations
- Test with:
-
4.2 Test OCR Track translated PDF
- Generate translated reflow PDF
- Verify: All translated text blocks appear correctly
- Verify: Text inside images (like EWsenel) is filtered out
-
4.3 Test Direct Track unchanged
- Verify: Translation still uses element-based approach
- Verify: No regression in Direct Track flow