fix: OCR Track reflow PDF and translation with image text filtering

- Add OCR Track support for reflow PDF generation using raw_ocr_regions.json
- Add OCR Track translation extraction from raw_ocr_regions instead of elements
- Add raw_ocr_translations output format for OCR Track documents
- Add exclusion zone filtering to remove text overlapping with images
- Update API validation to accept both translations and raw_ocr_translations
- Add page_number field to TranslatedItem for proper tracking

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
egg
2025-12-12 11:02:35 +08:00
parent 24253ac15e
commit 1f18010040
11 changed files with 1040 additions and 149 deletions

View File

@@ -0,0 +1,70 @@
# Change: Fix OCR Track Translation
## Why
OCR Track translation is missing most content because:
1. Translation service (`extract_translatable_elements`) only processes elements from `scan_result.json`
2. OCR Track tables have `content: ""` (empty string) - no `content.cells` data
3. All table text exists in `raw_ocr_regions.json` (59 text blocks) but translation service ignores it
4. Result: Only 6 text elements translated vs 59 raw OCR regions available
**Current Data Flow (OCR Track):**
```
scan_result.json (10 elements, 6 text, 2 empty tables)
→ Translation extracts 6 text items
→ 53 text blocks in tables are NOT translated
```
**Expected Data Flow (OCR Track):**
```
raw_ocr_regions.json (59 text blocks)
→ Translation extracts ALL 59 text items
→ Complete translation coverage
```
## What Changes
### 1. Translation Service Enhancement
Modify `translate_document` in `translation_service.py` to:
1. **Detect processing track** from result JSON metadata
2. **For OCR Track**: Load and translate `raw_ocr_regions.json` instead of elements
3. **For Direct Track**: Continue using elements with `content.cells` (already works)
### 2. Translation Result Format for OCR Track
Add new field `raw_ocr_translations` to translation JSON for OCR Track:
```json
{
"translations": { ... }, // element-based (for Direct Track)
"raw_ocr_translations": [ // NEW: for OCR Track
{
"index": 0,
"original": "华天科技(宝鸡)有限公司",
"translated": "Huatian Technology (Baoji) Co., Ltd."
},
...
]
}
```
### 3. Translated PDF Generation
Modify `generate_translated_pdf` to use `raw_ocr_translations` when available for OCR Track documents.
## Impact
- **Affected files**:
- `backend/app/services/translation_service.py` - extraction and translation logic
- `backend/app/services/pdf_generator_service.py` - translated PDF rendering
- **User experience**: OCR Track translations will include ALL text content
- **API**: Translation JSON format extended (backward compatible)
## Migration
- No data migration required
- Existing translations continue to work (Direct Track unaffected)
- Re-translation needed for OCR Track documents to get full coverage

View File

@@ -0,0 +1,56 @@
# translation Specification Delta
## MODIFIED Requirements
### Requirement: Translation Content Extraction
The translation service SHALL extract content based on processing track type.
#### Scenario: OCR Track translation extraction
- **GIVEN** a document processed with OCR Track
- **AND** the result JSON has `metadata.processing_track = "ocr"`
- **WHEN** translation service extracts translatable content
- **THEN** it SHALL load `raw_ocr_regions.json` for each page
- **AND** it SHALL extract all text blocks from raw OCR regions
- **AND** it SHALL NOT rely on `content.cells` from table elements
#### Scenario: Direct Track translation extraction (unchanged)
- **GIVEN** a document processed with Direct Track
- **AND** the result JSON has `metadata.processing_track = "direct"` or no track specified
- **WHEN** translation service extracts translatable content
- **THEN** it SHALL extract from `pages[].elements[]` in result JSON
- **AND** it SHALL extract table cell content from `content.cells`
### Requirement: Translation Result Format
The translation result JSON SHALL support both element-based and raw OCR translations.
#### Scenario: OCR Track translation result format
- **GIVEN** an OCR Track document has been translated
- **WHEN** translation result is saved
- **THEN** the JSON SHALL include `raw_ocr_translations` array
- **AND** each item SHALL have `index`, `original`, and `translated` fields
- **AND** the `translations` object MAY be empty or contain header text translations
#### Scenario: Direct Track translation result format (unchanged)
- **GIVEN** a Direct Track document has been translated
- **WHEN** translation result is saved
- **THEN** the JSON SHALL use `translations` object mapping element_id to translated text
- **AND** `raw_ocr_translations` field SHALL NOT be present
### Requirement: Translated PDF Generation
The translated PDF generation SHALL use appropriate translation source based on processing track.
#### Scenario: OCR Track translated PDF generation
- **GIVEN** an OCR Track document with translations
- **AND** the translation JSON contains `raw_ocr_translations`
- **WHEN** generating translated reflow PDF
- **THEN** it SHALL apply translations from `raw_ocr_translations` by index
- **AND** it SHALL render all translated text blocks in reading order
#### Scenario: Direct Track translated PDF generation (unchanged)
- **GIVEN** a Direct Track document with translations
- **WHEN** generating translated reflow PDF
- **THEN** it SHALL apply translations from `translations` object by element_id
- **AND** existing behavior SHALL be unchanged

View File

@@ -0,0 +1,76 @@
# Tasks: Fix OCR Track Translation
## 1. Modify Translation Service
- [x] 1.1 Add processing track detection
- File: `backend/app/services/translation_service.py`
- Location: `translate_document` method
- Read `metadata.processing_track` from result JSON
- Pass track type to extraction method
- [x] 1.2 Create helper to load raw OCR regions
- File: `backend/app/services/translation_service.py`
- Function: `_load_raw_ocr_regions(result_dir, task_id, page_num)`
- Pattern: `{task_id}_*_page_{page_num}_raw_ocr_regions.json`
- Return: List of text regions with index and content
- [x] 1.3 Modify extract_translatable_elements for OCR Track
- File: `backend/app/services/translation_service.py`
- Added: `extract_translatable_elements_ocr_track` method
- Added parameters: `result_dir: Path`, `task_id: str`
- For OCR Track: Extract from raw_ocr_regions.json
- For Direct Track: Keep existing element-based extraction
- [x] 1.4 Update translation result format
- File: `backend/app/services/translation_service.py`
- Location: `build_translation_result` method
- Added `processing_track` parameter
- For OCR Track: Output `raw_ocr_translations` field
- Structure: `[{"page": 1, "index": 0, "original": "...", "translated": "..."}]`
## 2. Modify PDF Generation
- [x] 2.1 Update generate_translated_pdf for OCR Track
- File: `backend/app/services/pdf_generator_service.py`
- Detect `processing_track` and `raw_ocr_translations` from translation JSON
- For OCR Track: Call `_generate_translated_pdf_ocr_track`
- For Direct Track: Continue using `apply_translations` (element-based)
- [x] 2.2 Create helper to apply raw OCR translations
- File: `backend/app/services/pdf_generator_service.py`
- Function: `_generate_translated_pdf_ocr_track`
- Build translation lookup: `{(page, index): translated_text}`
- Load raw OCR regions, sort by Y coordinate
- Render translated text with original fallback
## 3. Additional Fixes
- [x] 3.1 Add page_number to TranslatedItem
- File: `backend/app/schemas/translation.py`
- Added `page_number: int = 1` to TranslatedItem dataclass
- Updated `translate_batch` and `translate_item` to pass page_number
- [x] 3.2 Update API endpoint validation
- File: `backend/app/routers/translate.py`
- Check for both `translations` (Direct Track) and `raw_ocr_translations` (OCR Track)
- [x] 3.3 Filter text overlapping with images
- File: `backend/app/services/pdf_generator_service.py`
- Added `_collect_exclusion_zones`, `_is_region_overlapping_exclusion`, `_filter_regions_by_exclusion`
- Applied filtering in `generate_reflow_pdf` and `_generate_translated_pdf_ocr_track`
## 4. Testing
- [x] 4.1 Test OCR Track translation
- Test with: `f8265449-6cb7-425d-a213-5d2e1af73955`
- Verify: All 59 text blocks are sent for translation
- Verify: Translation JSON contains `raw_ocr_translations`
- [x] 4.2 Test OCR Track translated PDF
- Generate translated reflow PDF
- Verify: All translated text blocks appear correctly
- Verify: Text inside images (like EWsenel) is filtered out
- [x] 4.3 Test Direct Track unchanged
- Verify: Translation still uses element-based approach
- Verify: No regression in Direct Track flow