fix: OCR Track reflow PDF and translation with image text filtering

- Add OCR Track support for reflow PDF generation using raw_ocr_regions.json - Add OCR Track translation extraction from raw_ocr_regions instead of elements - Add raw_ocr_translations output format for OCR Track documents - Add exclusion zone filtering to remove text overlapping with images - Update API validation to accept both translations and raw_ocr_translations - Add page_number field to TranslatedItem for proper tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 11:02:35 +08:00
parent 24253ac15e
commit 1f18010040
11 changed files with 1040 additions and 149 deletions
--- a/backend/app/routers/translate.py
+++ b/backend/app/routers/translate.py
@@ -578,7 +578,10 @@ async def download_translated_pdf(
        with open(translation_file, 'r', encoding='utf-8') as f:
            translation_data = json.load(f)

-        if not translation_data.get('translations'):
+        # Check for translations (Direct Track) or raw_ocr_translations (OCR Track)
+        has_translations = translation_data.get('translations')
+        has_raw_ocr_translations = translation_data.get('raw_ocr_translations')
+        if not has_translations and not has_raw_ocr_translations:
            raise HTTPException(
                status_code=status.HTTP_400_BAD_REQUEST,
                detail="Translation file is empty or incomplete"