OCR/openspec/changes/archive/2025-12-12-fix-ocr-track-translation/proposal.md

# Change: Fix OCR Track Translation

## Why

OCR Track translation is missing most content because:

1. Translation service (`extract_translatable_elements`) only processes elements from `scan_result.json`
2. OCR Track tables have `content: ""` (empty string) - no `content.cells` data
3. All table text exists in `raw_ocr_regions.json` (59 text blocks) but translation service ignores it
4. Result: Only 6 text elements translated vs 59 raw OCR regions available

**Current Data Flow (OCR Track):**
```
scan_result.json (10 elements, 6 text, 2 empty tables)
  → Translation extracts 6 text items
  → 53 text blocks in tables are NOT translated
```

**Expected Data Flow (OCR Track):**
```
raw_ocr_regions.json (59 text blocks)
  → Translation extracts ALL 59 text items
  → Complete translation coverage
```

## What Changes

### 1. Translation Service Enhancement

Modify `translate_document` in `translation_service.py` to:

1. **Detect processing track** from result JSON metadata
2. **For OCR Track**: Load and translate `raw_ocr_regions.json` instead of elements
3. **For Direct Track**: Continue using elements with `content.cells` (already works)

### 2. Translation Result Format for OCR Track

Add new field `raw_ocr_translations` to translation JSON for OCR Track:

```json
{
  "translations": { ... },  // element-based (for Direct Track)
  "raw_ocr_translations": [  // NEW: for OCR Track
    {
      "index": 0,
      "original": "华天科技（宝鸡）有限公司",
      "translated": "Huatian Technology (Baoji) Co., Ltd."
    },
    ...
  ]
}
```

### 3. Translated PDF Generation

Modify `generate_translated_pdf` to use `raw_ocr_translations` when available for OCR Track documents.

## Impact

- **Affected files**:
  - `backend/app/services/translation_service.py` - extraction and translation logic
  - `backend/app/services/pdf_generator_service.py` - translated PDF rendering
- **User experience**: OCR Track translations will include ALL text content
- **API**: Translation JSON format extended (backward compatible)

## Migration

- No data migration required
- Existing translations continue to work (Direct Track unaffected)
- Re-translation needed for OCR Track documents to get full coverage