Add TableData.from_dict() and TableCell.from_dict() methods to convert JSON table dicts to proper TableData objects during UnifiedDocument parsing. Modified _json_to_document_element() to detect TABLE elements with dict content containing 'cells' key and convert to TableData. Note: This fix ensures table elements have proper to_html() method available but the rendered output still needs investigation - tables may still render incorrectly in OCR track PDFs. Files changed: - unified_document.py: Add from_dict() class methods - pdf_generator_service.py: Convert table dicts during JSON parsing - Add fix-ocr-track-table-rendering proposal 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1.9 KiB
PDF Generation - OCR Track Table Rendering Fix
MODIFIED Requirements
Requirement: OCR Track Table Content Conversion
The PDF generator MUST properly convert table content from JSON dict format to renderable structure when processing OCR track results.
Scenario: Table dict with cells array converts to proper HTML
Given an OCR track JSON with table element containing rows, cols, and cells array When the PDF generator processes this element Then the table content MUST be converted to a TableData object And TableData.to_html() MUST produce valid HTML with proper tr/td structure And the generated PDF table MUST have cells positioned in correct grid locations
Scenario: Table with rowspan/colspan renders correctly
Given a table element with cells having rowspan > 1 or colspan > 1 When the PDF generator renders the table Then merged cells MUST span the correct number of rows/columns And content MUST appear in the merged cell position
Requirement: Table Visual Fidelity
The PDF generator MUST render OCR track tables with visual structure matching the original document.
Scenario: Table renders with grid lines
Given an OCR track table element When rendered to PDF Then the table MUST have visible grid lines/borders And cell boundaries MUST be clearly defined
Scenario: Table text alignment preserved
Given an OCR track table with cell content When rendered to PDF Then text MUST be positioned within the correct cell boundaries And text MUST NOT overflow into adjacent cells
Requirement: Backward Compatibility with Hybrid Mode
The table rendering fix MUST NOT break hybrid mode processing.
Scenario: Hybrid mode tables render correctly
Given a document processed with hybrid mode combining Direct and OCR tracks When PDF is generated Then Direct track tables MUST render with existing quality And OCR track tables MUST render with improved quality And no regression in table positioning or content