Files
OCR/openspec/changes/archive/2025-12-12-fix-ocr-track-translation/proposal.md
egg 1f18010040 fix: OCR Track reflow PDF and translation with image text filtering
- Add OCR Track support for reflow PDF generation using raw_ocr_regions.json
- Add OCR Track translation extraction from raw_ocr_regions instead of elements
- Add raw_ocr_translations output format for OCR Track documents
- Add exclusion zone filtering to remove text overlapping with images
- Update API validation to accept both translations and raw_ocr_translations
- Add page_number field to TranslatedItem for proper tracking

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 11:02:35 +08:00

2.2 KiB

Change: Fix OCR Track Translation

Why

OCR Track translation is missing most content because:

  1. Translation service (extract_translatable_elements) only processes elements from scan_result.json
  2. OCR Track tables have content: "" (empty string) - no content.cells data
  3. All table text exists in raw_ocr_regions.json (59 text blocks) but translation service ignores it
  4. Result: Only 6 text elements translated vs 59 raw OCR regions available

Current Data Flow (OCR Track):

scan_result.json (10 elements, 6 text, 2 empty tables)
  → Translation extracts 6 text items
  → 53 text blocks in tables are NOT translated

Expected Data Flow (OCR Track):

raw_ocr_regions.json (59 text blocks)
  → Translation extracts ALL 59 text items
  → Complete translation coverage

What Changes

1. Translation Service Enhancement

Modify translate_document in translation_service.py to:

  1. Detect processing track from result JSON metadata
  2. For OCR Track: Load and translate raw_ocr_regions.json instead of elements
  3. For Direct Track: Continue using elements with content.cells (already works)

2. Translation Result Format for OCR Track

Add new field raw_ocr_translations to translation JSON for OCR Track:

{
  "translations": { ... },  // element-based (for Direct Track)
  "raw_ocr_translations": [  // NEW: for OCR Track
    {
      "index": 0,
      "original": "华天科技(宝鸡)有限公司",
      "translated": "Huatian Technology (Baoji) Co., Ltd."
    },
    ...
  ]
}

3. Translated PDF Generation

Modify generate_translated_pdf to use raw_ocr_translations when available for OCR Track documents.

Impact

  • Affected files:
    • backend/app/services/translation_service.py - extraction and translation logic
    • backend/app/services/pdf_generator_service.py - translated PDF rendering
  • User experience: OCR Track translations will include ALL text content
  • API: Translation JSON format extended (backward compatible)

Migration

  • No data migration required
  • Existing translations continue to work (Direct Track unaffected)
  • Re-translation needed for OCR Track documents to get full coverage