- Add OCR Track support for reflow PDF generation using raw_ocr_regions.json - Add OCR Track translation extraction from raw_ocr_regions instead of elements - Add raw_ocr_translations output format for OCR Track documents - Add exclusion zone filtering to remove text overlapping with images - Update API validation to accept both translations and raw_ocr_translations - Add page_number field to TranslatedItem for proper tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
71 lines
2.2 KiB
Markdown
71 lines
2.2 KiB
Markdown
# Change: Fix OCR Track Translation
|
|
|
|
## Why
|
|
|
|
OCR Track translation is missing most content because:
|
|
|
|
1. Translation service (`extract_translatable_elements`) only processes elements from `scan_result.json`
|
|
2. OCR Track tables have `content: ""` (empty string) - no `content.cells` data
|
|
3. All table text exists in `raw_ocr_regions.json` (59 text blocks) but translation service ignores it
|
|
4. Result: Only 6 text elements translated vs 59 raw OCR regions available
|
|
|
|
**Current Data Flow (OCR Track):**
|
|
```
|
|
scan_result.json (10 elements, 6 text, 2 empty tables)
|
|
→ Translation extracts 6 text items
|
|
→ 53 text blocks in tables are NOT translated
|
|
```
|
|
|
|
**Expected Data Flow (OCR Track):**
|
|
```
|
|
raw_ocr_regions.json (59 text blocks)
|
|
→ Translation extracts ALL 59 text items
|
|
→ Complete translation coverage
|
|
```
|
|
|
|
## What Changes
|
|
|
|
### 1. Translation Service Enhancement
|
|
|
|
Modify `translate_document` in `translation_service.py` to:
|
|
|
|
1. **Detect processing track** from result JSON metadata
|
|
2. **For OCR Track**: Load and translate `raw_ocr_regions.json` instead of elements
|
|
3. **For Direct Track**: Continue using elements with `content.cells` (already works)
|
|
|
|
### 2. Translation Result Format for OCR Track
|
|
|
|
Add new field `raw_ocr_translations` to translation JSON for OCR Track:
|
|
|
|
```json
|
|
{
|
|
"translations": { ... }, // element-based (for Direct Track)
|
|
"raw_ocr_translations": [ // NEW: for OCR Track
|
|
{
|
|
"index": 0,
|
|
"original": "华天科技(宝鸡)有限公司",
|
|
"translated": "Huatian Technology (Baoji) Co., Ltd."
|
|
},
|
|
...
|
|
]
|
|
}
|
|
```
|
|
|
|
### 3. Translated PDF Generation
|
|
|
|
Modify `generate_translated_pdf` to use `raw_ocr_translations` when available for OCR Track documents.
|
|
|
|
## Impact
|
|
|
|
- **Affected files**:
|
|
- `backend/app/services/translation_service.py` - extraction and translation logic
|
|
- `backend/app/services/pdf_generator_service.py` - translated PDF rendering
|
|
- **User experience**: OCR Track translations will include ALL text content
|
|
- **API**: Translation JSON format extended (backward compatible)
|
|
|
|
## Migration
|
|
|
|
- No data migration required
|
|
- Existing translations continue to work (Direct Track unaffected)
|
|
- Re-translation needed for OCR Track documents to get full coverage
|