egg/OCR

Files

egg 1f18010040 fix: OCR Track reflow PDF and translation with image text filtering

- Add OCR Track support for reflow PDF generation using raw_ocr_regions.json
- Add OCR Track translation extraction from raw_ocr_regions instead of elements
- Add raw_ocr_translations output format for OCR Track documents
- Add exclusion zone filtering to remove text overlapping with images
- Update API validation to accept both translations and raw_ocr_translations
- Add page_number field to TranslatedItem for proper tracking

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-12 11:02:35 +08:00

2.2 KiB

Raw Blame History

Change: Fix OCR Track Translation

Why

OCR Track translation is missing most content because:

Translation service (extract_translatable_elements) only processes elements from scan_result.json
OCR Track tables have content: "" (empty string) - no content.cells data
All table text exists in raw_ocr_regions.json (59 text blocks) but translation service ignores it
Result: Only 6 text elements translated vs 59 raw OCR regions available

Current Data Flow (OCR Track):

scan_result.json (10 elements, 6 text, 2 empty tables)
  → Translation extracts 6 text items
  → 53 text blocks in tables are NOT translated

Expected Data Flow (OCR Track):

raw_ocr_regions.json (59 text blocks)
  → Translation extracts ALL 59 text items
  → Complete translation coverage

What Changes

1. Translation Service Enhancement

Modify translate_document in translation_service.py to:

Detect processing track from result JSON metadata
For OCR Track: Load and translate raw_ocr_regions.json instead of elements
For Direct Track: Continue using elements with content.cells (already works)

2. Translation Result Format for OCR Track

Add new field raw_ocr_translations to translation JSON for OCR Track:

{
  "translations": { ... },  // element-based (for Direct Track)
  "raw_ocr_translations": [  // NEW: for OCR Track
    {
      "index": 0,
      "original": "华天科技（宝鸡）有限公司",
      "translated": "Huatian Technology (Baoji) Co., Ltd."
    },
    ...
  ]
}

3. Translated PDF Generation

Modify generate_translated_pdf to use raw_ocr_translations when available for OCR Track documents.

Impact

Affected files:
- backend/app/services/translation_service.py - extraction and translation logic
- backend/app/services/pdf_generator_service.py - translated PDF rendering
User experience: OCR Track translations will include ALL text content
API: Translation JSON format extended (backward compatible)

Migration

No data migration required
Existing translations continue to work (Direct Track unaffected)
Re-translation needed for OCR Track documents to get full coverage

2.2 KiB Raw Blame History