fix: OCR Track reflow PDF and translation with image text filtering

- Add OCR Track support for reflow PDF generation using raw_ocr_regions.json - Add OCR Track translation extraction from raw_ocr_regions instead of elements - Add raw_ocr_translations output format for OCR Track documents - Add exclusion zone filtering to remove text overlapping with images - Update API validation to accept both translations and raw_ocr_translations - Add page_number field to TranslatedItem for proper tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 11:02:35 +08:00
parent 24253ac15e
commit 1f18010040
11 changed files with 1040 additions and 149 deletions
--- a/openspec/changes/archive/2025-12-12-fix-ocr-track-translation/proposal.md
+++ b/openspec/changes/archive/2025-12-12-fix-ocr-track-translation/proposal.md
@@ -0,0 +1,70 @@
+# Change: Fix OCR Track Translation
+
+## Why
+
+OCR Track translation is missing most content because:
+
+1. Translation service (`extract_translatable_elements`) only processes elements from `scan_result.json`
+2. OCR Track tables have `content: ""` (empty string) - no `content.cells` data
+3. All table text exists in `raw_ocr_regions.json` (59 text blocks) but translation service ignores it
+4. Result: Only 6 text elements translated vs 59 raw OCR regions available
+
+**Current Data Flow (OCR Track):**
+```
+scan_result.json (10 elements, 6 text, 2 empty tables)
+  → Translation extracts 6 text items
+  → 53 text blocks in tables are NOT translated
+```
+
+**Expected Data Flow (OCR Track):**
+```
+raw_ocr_regions.json (59 text blocks)
+  → Translation extracts ALL 59 text items
+  → Complete translation coverage
+```
+
+## What Changes
+
+### 1. Translation Service Enhancement
+
+Modify `translate_document` in `translation_service.py` to:
+
+1. **Detect processing track** from result JSON metadata
+2. **For OCR Track**: Load and translate `raw_ocr_regions.json` instead of elements
+3. **For Direct Track**: Continue using elements with `content.cells` (already works)
+
+### 2. Translation Result Format for OCR Track
+
+Add new field `raw_ocr_translations` to translation JSON for OCR Track:
+
+```json
+{
+  "translations": { ... },  // element-based (for Direct Track)
+  "raw_ocr_translations": [  // NEW: for OCR Track
+    {
+      "index": 0,
+      "original": "华天科技（宝鸡）有限公司",
+      "translated": "Huatian Technology (Baoji) Co., Ltd."
+    },
+    ...
+  ]
+}
+```
+
+### 3. Translated PDF Generation
+
+Modify `generate_translated_pdf` to use `raw_ocr_translations` when available for OCR Track documents.
+
+## Impact
+
+- **Affected files**:
+  - `backend/app/services/translation_service.py` - extraction and translation logic
+  - `backend/app/services/pdf_generator_service.py` - translated PDF rendering
+- **User experience**: OCR Track translations will include ALL text content
+- **API**: Translation JSON format extended (backward compatible)
+
+## Migration
+
+- No data migration required
+- Existing translations continue to work (Direct Track unaffected)
+- Re-translation needed for OCR Track documents to get full coverage