- Add OCR Track support for reflow PDF generation using raw_ocr_regions.json - Add OCR Track translation extraction from raw_ocr_regions instead of elements - Add raw_ocr_translations output format for OCR Track documents - Add exclusion zone filtering to remove text overlapping with images - Update API validation to accept both translations and raw_ocr_translations - Add page_number field to TranslatedItem for proper tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2.2 KiB
2.2 KiB
Change: Fix OCR Track Translation
Why
OCR Track translation is missing most content because:
- Translation service (
extract_translatable_elements) only processes elements fromscan_result.json - OCR Track tables have
content: ""(empty string) - nocontent.cellsdata - All table text exists in
raw_ocr_regions.json(59 text blocks) but translation service ignores it - Result: Only 6 text elements translated vs 59 raw OCR regions available
Current Data Flow (OCR Track):
scan_result.json (10 elements, 6 text, 2 empty tables)
→ Translation extracts 6 text items
→ 53 text blocks in tables are NOT translated
Expected Data Flow (OCR Track):
raw_ocr_regions.json (59 text blocks)
→ Translation extracts ALL 59 text items
→ Complete translation coverage
What Changes
1. Translation Service Enhancement
Modify translate_document in translation_service.py to:
- Detect processing track from result JSON metadata
- For OCR Track: Load and translate
raw_ocr_regions.jsoninstead of elements - For Direct Track: Continue using elements with
content.cells(already works)
2. Translation Result Format for OCR Track
Add new field raw_ocr_translations to translation JSON for OCR Track:
{
"translations": { ... }, // element-based (for Direct Track)
"raw_ocr_translations": [ // NEW: for OCR Track
{
"index": 0,
"original": "华天科技(宝鸡)有限公司",
"translated": "Huatian Technology (Baoji) Co., Ltd."
},
...
]
}
3. Translated PDF Generation
Modify generate_translated_pdf to use raw_ocr_translations when available for OCR Track documents.
Impact
- Affected files:
backend/app/services/translation_service.py- extraction and translation logicbackend/app/services/pdf_generator_service.py- translated PDF rendering
- User experience: OCR Track translations will include ALL text content
- API: Translation JSON format extended (backward compatible)
Migration
- No data migration required
- Existing translations continue to work (Direct Track unaffected)
- Re-translation needed for OCR Track documents to get full coverage