fix: OCR Track reflow PDF and translation with image text filtering

- Add OCR Track support for reflow PDF generation using raw_ocr_regions.json - Add OCR Track translation extraction from raw_ocr_regions instead of elements - Add raw_ocr_translations output format for OCR Track documents - Add exclusion zone filtering to remove text overlapping with images - Update API validation to accept both translations and raw_ocr_translations - Add page_number field to TranslatedItem for proper tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 11:02:35 +08:00
parent 24253ac15e
commit 1f18010040
11 changed files with 1040 additions and 149 deletions
--- a/openspec/changes/archive/2025-12-12-fix-ocr-track-reflow-pdf/proposal.md
+++ b/openspec/changes/archive/2025-12-12-fix-ocr-track-reflow-pdf/proposal.md
@@ -0,0 +1,51 @@
+# Change: Fix OCR Track Reflow PDF
+
+## Why
+
+The OCR Track reflow PDF generation is missing most content because:
+
+1. PP-StructureV3 extracts tables as elements but stores `content: ""` (empty string) instead of structured `content.cells` data
+2. The `generate_reflow_pdf` method expects `content.cells` for tables, so tables are skipped
+3. Table text exists in `raw_ocr_regions.json` (59 text blocks) but is not used by reflow PDF generation
+4. This causes significant content loss - only 6 text elements vs 59 raw OCR regions
+
+The Layout PDF works correctly because it uses `raw_ocr_regions.json` via Simple Text Positioning mode, bypassing the need for structured table data.
+
+## What Changes
+
+### Reflow PDF Generation for OCR Track
+
+Modify `generate_reflow_pdf` to use `raw_ocr_regions.json` as the primary text source for OCR Track documents:
+
+1. **Detect processing track** from JSON metadata
+2. **For OCR Track**: Load `raw_ocr_regions.json` and render all text blocks in reading order
+3. **For Direct Track**: Continue using `content.cells` for tables (already works)
+4. **Images/Charts**: Continue using `content.saved_path` from elements (works for both tracks)
+
+### Data Flow
+
+**OCR Track Reflow PDF (NEW):**
+```
+raw_ocr_regions.json (59 text blocks)
+  + scan_result.json (images/charts only)
+  → Sort by Y coordinate (reading order)
+  → Render text paragraphs + images
+```
+
+**Direct Track Reflow PDF (UNCHANGED):**
+```
+*_result.json (elements with content.cells)
+  → Render tables, text, images in order
+```
+
+## Impact
+
+- **Affected file**: `backend/app/services/pdf_generator_service.py`
+- **User experience**: OCR Track reflow PDF will contain all text content (matching Layout PDF)
+- **Translation**: Reflow translated PDF will also work correctly for OCR Track
+
+## Migration
+
+- No data migration required
+- Existing `raw_ocr_regions.json` files contain all necessary data
+- No API changes
--- a/openspec/changes/archive/2025-12-12-fix-ocr-track-reflow-pdf/specs/result-export/spec.md
+++ b/openspec/changes/archive/2025-12-12-fix-ocr-track-reflow-pdf/specs/result-export/spec.md
@@ -0,0 +1,23 @@
+## MODIFIED Requirements
+
+### Requirement: Enhanced PDF Export with Layout Preservation
+
+The PDF export SHALL accurately preserve document layout from both OCR and direct extraction tracks with correct coordinate transformation and multi-page support. For Direct Track, a background image rendering approach SHALL be used for visual fidelity.
+
+#### Scenario: OCR Track reflow PDF uses raw OCR regions
+- **WHEN** generating reflow PDF for an OCR Track document
+- **THEN** the system SHALL load text content from `raw_ocr_regions.json` files
+- **AND** text blocks SHALL be sorted by Y coordinate for reading order
+- **AND** all text content SHALL match the Layout PDF output
+- **AND** images and charts SHALL be embedded from element `saved_path`
+
+#### Scenario: Direct Track reflow PDF uses structured content
+- **WHEN** generating reflow PDF for a Direct Track document
+- **THEN** the system SHALL use `content.cells` for table rendering
+- **AND** text elements SHALL use `content` string directly
+- **AND** images and charts SHALL be embedded from element `saved_path`
+
+#### Scenario: Reflow PDF content consistency
+- **WHEN** comparing Layout PDF and Reflow PDF for the same document
+- **THEN** both PDFs SHALL contain the same text content
+- **AND** only the presentation format SHALL differ (positioned vs flowing)
--- a/openspec/changes/archive/2025-12-12-fix-ocr-track-reflow-pdf/tasks.md
+++ b/openspec/changes/archive/2025-12-12-fix-ocr-track-reflow-pdf/tasks.md
@@ -0,0 +1,51 @@
+# Tasks: Fix OCR Track Reflow PDF
+
+## 1. Modify generate_reflow_pdf Method
+
+- [x] 1.1 Add processing track detection
+  - File: `backend/app/services/pdf_generator_service.py`
+  - Location: `generate_reflow_pdf` method (line ~4704)
+  - Read `metadata.processing_track` from JSON data
+  - Branch logic based on track type
+
+- [x] 1.2 Add helper function to load raw OCR regions
+  - File: `backend/app/services/pdf_generator_service.py`
+  - Using existing: `load_raw_ocr_regions` from `text_region_renderer.py`
+  - Pattern: `{task_id}_*_page_{page_num}_raw_ocr_regions.json`
+  - Return: List of text regions with bbox and content
+
+- [x] 1.3 Implement OCR Track reflow rendering
+  - File: `backend/app/services/pdf_generator_service.py`
+  - For OCR Track: Load raw OCR regions per page
+  - Sort text blocks by Y coordinate (top to bottom reading order)
+  - Render text blocks as paragraphs
+  - Still render images/charts from elements
+
+- [x] 1.4 Keep Direct Track logic unchanged
+  - File: `backend/app/services/pdf_generator_service.py`
+  - Direct Track continues using `content.cells` for tables
+  - Extracted to `_render_reflow_elements` helper method
+  - No changes to existing Direct Track flow
+
+## 2. Handle Multi-page Documents
+
+- [x] 2.1 Support per-page raw OCR files
+  - Pattern: `{task_id}_*_page_{page_num}_raw_ocr_regions.json`
+  - Iterate through pages and load corresponding raw OCR file
+  - Handle missing files gracefully (fall back to elements)
+
+## 3. Testing
+
+- [x] 3.1 Test OCR Track reflow PDF
+  - Test with: `a9259180-fc49-4890-8184-2e6d5f4edad3` (scan document)
+  - Verify: All 59 text blocks appear in reflow PDF
+  - Verify: Images are embedded correctly
+
+- [x] 3.2 Test Direct Track reflow PDF
+  - Test with: `1b32428d-0609-4cfd-bc52-56be6956ac2e` (editable PDF)
+  - Verify: Tables render with cells
+  - Verify: No regression from changes
+
+- [x] 3.3 Test translated reflow PDF
+  - Test: Complete translation then download reflow PDF
+  - Verify: Translated text appears correctly