# Fix OCR Track Table Rendering ## Summary OCR track PDF generation produces tables with incorrect format and layout. Tables appear without proper structure - cell content is misaligned and the visual format differs significantly from the original document. Image placement is correct, but table rendering is broken. ## Problem Statement When generating PDF from OCR track results (via `scan.pdf` processed by PP-StructureV3), the output tables have: 1. **Wrong cell alignment** - content not positioned in proper cells 2. **Missing table structure** - rows/columns don't match original document layout 3. **Incorrect content distribution** - all content seems to flow linearly instead of maintaining grid structure Reference: `backend/storage/results/af7c9ee8-60a0-4291-9f22-ef98d27eed52/` - Original: `af7c9ee8-60a0-4291-9f22-ef98d27eed52_scan_page_1.png` - Generated: `scan_layout.pdf` - Result JSON: `scan_result.json` - Tables have correct `{rows, cols, cells}` structure ## Root Cause Analysis ### Issue 1: Table Content Not Converted to TableData Object In `_json_to_document_element` (pdf_generator_service.py:1952): ```python element = DocumentElement( ... content=elem_dict.get('content', ''), # Raw dict, not TableData ... ) ``` Table elements have `content` as a dict `{rows: 5, cols: 4, cells: [...]}` but it's not converted to a `TableData` object. ### Issue 2: OCR Track HTML Conversion Fails In `convert_unified_document_to_ocr_data` (pdf_generator_service.py:464-467): ```python elif isinstance(element.content, dict): html_content = element.content.get('html', str(element.content)) ``` Since there's no 'html' key in the cells-based dict, it falls back to `str(element.content)` = `"{'rows': 5, 'cols': 4, ...}"` - invalid HTML. ### Issue 3: Different Table Rendering Paths - **Direct track** uses `_draw_table_element_direct` which properly handles dict with cells via `_build_rows_from_cells_dict` - **OCR track** uses `draw_table_region` which expects HTML strings and fails with dict content ## Proposed Solution ### Option A: Convert dict to TableData during JSON loading (Recommended) In `_json_to_document_element`, when element type is TABLE and content is a dict with cells, convert it to a `TableData` object: ```python # For TABLE elements, convert dict to TableData if elem_type == ElementType.TABLE and isinstance(content, dict) and 'cells' in content: content = self._dict_to_table_data(content) ``` This ensures `element.content.to_html()` works correctly in `convert_unified_document_to_ocr_data`. ### Option B: Fix conversion in convert_unified_document_to_ocr_data Handle dict with cells properly by converting to HTML: ```python elif isinstance(element.content, dict): if 'cells' in element.content: # Convert cells-based dict to HTML html_content = self._cells_dict_to_html(element.content) elif 'html' in element.content: html_content = element.content['html'] else: html_content = str(element.content) ``` ## Impact on Hybrid Mode Hybrid mode uses Direct track rendering (`_generate_direct_track_pdf`) which already handles dict content properly via `_build_rows_from_cells_dict`. The proposed fixes should not affect hybrid mode negatively. However, testing should verify: 1. Hybrid mode continues to work with combined Direct + OCR elements 2. Table rendering quality is consistent across all tracks ## Success Criteria 1. OCR track tables render with correct structure matching original document 2. Cell content positioned in proper grid locations 3. Table borders/grid lines visible 4. No regression in Direct track or Hybrid mode table rendering 5. All test files (scan.pdf, img1.png, img2.png, img3.png) produce correct output ## Files to Modify 1. `backend/app/services/pdf_generator_service.py` - `_json_to_document_element`: Convert table dict to TableData - `convert_unified_document_to_ocr_data`: Improve dict handling (if Option B) 2. `backend/app/models/unified_document.py` (optional) - Add `TableData.from_dict()` class method for cleaner conversion ## Testing Plan 1. Test scan.pdf with OCR track - verify table structure matches original 2. Test img1.png, img2.png, img3.png with OCR track 3. Test PDF files with Direct track - verify no regression 4. Test Hybrid mode with files that trigger OCR fallback