- Add OCR Track support for reflow PDF generation using raw_ocr_regions.json - Add OCR Track translation extraction from raw_ocr_regions instead of elements - Add raw_ocr_translations output format for OCR Track documents - Add exclusion zone filtering to remove text overlapping with images - Update API validation to accept both translations and raw_ocr_translations - Add page_number field to TranslatedItem for proper tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
77 lines
3.2 KiB
Markdown
77 lines
3.2 KiB
Markdown
# Tasks: Fix OCR Track Translation
|
|
|
|
## 1. Modify Translation Service
|
|
|
|
- [x] 1.1 Add processing track detection
|
|
- File: `backend/app/services/translation_service.py`
|
|
- Location: `translate_document` method
|
|
- Read `metadata.processing_track` from result JSON
|
|
- Pass track type to extraction method
|
|
|
|
- [x] 1.2 Create helper to load raw OCR regions
|
|
- File: `backend/app/services/translation_service.py`
|
|
- Function: `_load_raw_ocr_regions(result_dir, task_id, page_num)`
|
|
- Pattern: `{task_id}_*_page_{page_num}_raw_ocr_regions.json`
|
|
- Return: List of text regions with index and content
|
|
|
|
- [x] 1.3 Modify extract_translatable_elements for OCR Track
|
|
- File: `backend/app/services/translation_service.py`
|
|
- Added: `extract_translatable_elements_ocr_track` method
|
|
- Added parameters: `result_dir: Path`, `task_id: str`
|
|
- For OCR Track: Extract from raw_ocr_regions.json
|
|
- For Direct Track: Keep existing element-based extraction
|
|
|
|
- [x] 1.4 Update translation result format
|
|
- File: `backend/app/services/translation_service.py`
|
|
- Location: `build_translation_result` method
|
|
- Added `processing_track` parameter
|
|
- For OCR Track: Output `raw_ocr_translations` field
|
|
- Structure: `[{"page": 1, "index": 0, "original": "...", "translated": "..."}]`
|
|
|
|
## 2. Modify PDF Generation
|
|
|
|
- [x] 2.1 Update generate_translated_pdf for OCR Track
|
|
- File: `backend/app/services/pdf_generator_service.py`
|
|
- Detect `processing_track` and `raw_ocr_translations` from translation JSON
|
|
- For OCR Track: Call `_generate_translated_pdf_ocr_track`
|
|
- For Direct Track: Continue using `apply_translations` (element-based)
|
|
|
|
- [x] 2.2 Create helper to apply raw OCR translations
|
|
- File: `backend/app/services/pdf_generator_service.py`
|
|
- Function: `_generate_translated_pdf_ocr_track`
|
|
- Build translation lookup: `{(page, index): translated_text}`
|
|
- Load raw OCR regions, sort by Y coordinate
|
|
- Render translated text with original fallback
|
|
|
|
## 3. Additional Fixes
|
|
|
|
- [x] 3.1 Add page_number to TranslatedItem
|
|
- File: `backend/app/schemas/translation.py`
|
|
- Added `page_number: int = 1` to TranslatedItem dataclass
|
|
- Updated `translate_batch` and `translate_item` to pass page_number
|
|
|
|
- [x] 3.2 Update API endpoint validation
|
|
- File: `backend/app/routers/translate.py`
|
|
- Check for both `translations` (Direct Track) and `raw_ocr_translations` (OCR Track)
|
|
|
|
- [x] 3.3 Filter text overlapping with images
|
|
- File: `backend/app/services/pdf_generator_service.py`
|
|
- Added `_collect_exclusion_zones`, `_is_region_overlapping_exclusion`, `_filter_regions_by_exclusion`
|
|
- Applied filtering in `generate_reflow_pdf` and `_generate_translated_pdf_ocr_track`
|
|
|
|
## 4. Testing
|
|
|
|
- [x] 4.1 Test OCR Track translation
|
|
- Test with: `f8265449-6cb7-425d-a213-5d2e1af73955`
|
|
- Verify: All 59 text blocks are sent for translation
|
|
- Verify: Translation JSON contains `raw_ocr_translations`
|
|
|
|
- [x] 4.2 Test OCR Track translated PDF
|
|
- Generate translated reflow PDF
|
|
- Verify: All translated text blocks appear correctly
|
|
- Verify: Text inside images (like EWsenel) is filtered out
|
|
|
|
- [x] 4.3 Test Direct Track unchanged
|
|
- Verify: Translation still uses element-based approach
|
|
- Verify: No regression in Direct Track flow
|