# Change: Fix OCR Track Reflow PDF ## Why The OCR Track reflow PDF generation is missing most content because: 1. PP-StructureV3 extracts tables as elements but stores `content: ""` (empty string) instead of structured `content.cells` data 2. The `generate_reflow_pdf` method expects `content.cells` for tables, so tables are skipped 3. Table text exists in `raw_ocr_regions.json` (59 text blocks) but is not used by reflow PDF generation 4. This causes significant content loss - only 6 text elements vs 59 raw OCR regions The Layout PDF works correctly because it uses `raw_ocr_regions.json` via Simple Text Positioning mode, bypassing the need for structured table data. ## What Changes ### Reflow PDF Generation for OCR Track Modify `generate_reflow_pdf` to use `raw_ocr_regions.json` as the primary text source for OCR Track documents: 1. **Detect processing track** from JSON metadata 2. **For OCR Track**: Load `raw_ocr_regions.json` and render all text blocks in reading order 3. **For Direct Track**: Continue using `content.cells` for tables (already works) 4. **Images/Charts**: Continue using `content.saved_path` from elements (works for both tracks) ### Data Flow **OCR Track Reflow PDF (NEW):** ``` raw_ocr_regions.json (59 text blocks) + scan_result.json (images/charts only) → Sort by Y coordinate (reading order) → Render text paragraphs + images ``` **Direct Track Reflow PDF (UNCHANGED):** ``` *_result.json (elements with content.cells) → Render tables, text, images in order ``` ## Impact - **Affected file**: `backend/app/services/pdf_generator_service.py` - **User experience**: OCR Track reflow PDF will contain all text content (matching Layout PDF) - **Translation**: Reflow translated PDF will also work correctly for OCR Track ## Migration - No data migration required - Existing `raw_ocr_regions.json` files contain all necessary data - No API changes