Add TableData.from_dict() and TableCell.from_dict() methods to convert JSON table dicts to proper TableData objects during UnifiedDocument parsing. Modified _json_to_document_element() to detect TABLE elements with dict content containing 'cells' key and convert to TableData. Note: This fix ensures table elements have proper to_html() method available but the rendered output still needs investigation - tables may still render incorrectly in OCR track PDFs. Files changed: - unified_document.py: Add from_dict() class methods - pdf_generator_service.py: Convert table dicts during JSON parsing - Add fix-ocr-track-table-rendering proposal 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
4.3 KiB
Fix OCR Track Table Rendering
Summary
OCR track PDF generation produces tables with incorrect format and layout. Tables appear without proper structure - cell content is misaligned and the visual format differs significantly from the original document. Image placement is correct, but table rendering is broken.
Problem Statement
When generating PDF from OCR track results (via scan.pdf processed by PP-StructureV3), the output tables have:
- Wrong cell alignment - content not positioned in proper cells
- Missing table structure - rows/columns don't match original document layout
- Incorrect content distribution - all content seems to flow linearly instead of maintaining grid structure
Reference: backend/storage/results/af7c9ee8-60a0-4291-9f22-ef98d27eed52/
- Original:
af7c9ee8-60a0-4291-9f22-ef98d27eed52_scan_page_1.png - Generated:
scan_layout.pdf - Result JSON:
scan_result.json- Tables have correct{rows, cols, cells}structure
Root Cause Analysis
Issue 1: Table Content Not Converted to TableData Object
In _json_to_document_element (pdf_generator_service.py:1952):
element = DocumentElement(
...
content=elem_dict.get('content', ''), # Raw dict, not TableData
...
)
Table elements have content as a dict {rows: 5, cols: 4, cells: [...]} but it's not converted to a TableData object.
Issue 2: OCR Track HTML Conversion Fails
In convert_unified_document_to_ocr_data (pdf_generator_service.py:464-467):
elif isinstance(element.content, dict):
html_content = element.content.get('html', str(element.content))
Since there's no 'html' key in the cells-based dict, it falls back to str(element.content) = "{'rows': 5, 'cols': 4, ...}" - invalid HTML.
Issue 3: Different Table Rendering Paths
- Direct track uses
_draw_table_element_directwhich properly handles dict with cells via_build_rows_from_cells_dict - OCR track uses
draw_table_regionwhich expects HTML strings and fails with dict content
Proposed Solution
Option A: Convert dict to TableData during JSON loading (Recommended)
In _json_to_document_element, when element type is TABLE and content is a dict with cells, convert it to a TableData object:
# For TABLE elements, convert dict to TableData
if elem_type == ElementType.TABLE and isinstance(content, dict) and 'cells' in content:
content = self._dict_to_table_data(content)
This ensures element.content.to_html() works correctly in convert_unified_document_to_ocr_data.
Option B: Fix conversion in convert_unified_document_to_ocr_data
Handle dict with cells properly by converting to HTML:
elif isinstance(element.content, dict):
if 'cells' in element.content:
# Convert cells-based dict to HTML
html_content = self._cells_dict_to_html(element.content)
elif 'html' in element.content:
html_content = element.content['html']
else:
html_content = str(element.content)
Impact on Hybrid Mode
Hybrid mode uses Direct track rendering (_generate_direct_track_pdf) which already handles dict content properly via _build_rows_from_cells_dict. The proposed fixes should not affect hybrid mode negatively.
However, testing should verify:
- Hybrid mode continues to work with combined Direct + OCR elements
- Table rendering quality is consistent across all tracks
Success Criteria
- OCR track tables render with correct structure matching original document
- Cell content positioned in proper grid locations
- Table borders/grid lines visible
- No regression in Direct track or Hybrid mode table rendering
- All test files (scan.pdf, img1.png, img2.png, img3.png) produce correct output
Files to Modify
-
backend/app/services/pdf_generator_service.py_json_to_document_element: Convert table dict to TableDataconvert_unified_document_to_ocr_data: Improve dict handling (if Option B)
-
backend/app/models/unified_document.py(optional)- Add
TableData.from_dict()class method for cleaner conversion
- Add
Testing Plan
- Test scan.pdf with OCR track - verify table structure matches original
- Test img1.png, img2.png, img3.png with OCR track
- Test PDF files with Direct track - verify no regression
- Test Hybrid mode with files that trigger OCR fallback