Add TableData.from_dict() and TableCell.from_dict() methods to convert JSON table dicts to proper TableData objects during UnifiedDocument parsing. Modified _json_to_document_element() to detect TABLE elements with dict content containing 'cells' key and convert to TableData. Note: This fix ensures table elements have proper to_html() method available but the rendered output still needs investigation - tables may still render incorrectly in OCR track PDFs. Files changed: - unified_document.py: Add from_dict() class methods - pdf_generator_service.py: Convert table dicts during JSON parsing - Add fix-ocr-track-table-rendering proposal 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2.0 KiB
2.0 KiB
Implementation Tasks
Phase 1: Core Fix - Table Content Conversion
1.1 Add TableData.from_dict() class method
- In
unified_document.py, addfrom_dict()method toTableDataclass - Handle conversion of cells list (list of dicts) to
TableCellobjects - Preserve rows, cols, headers, caption fields
1.2 Fix _json_to_document_element for TABLE elements
- In
pdf_generator_service.py, modify_json_to_document_element - When
elem_type == ElementType.TABLEand content is dict with 'cells', convert toTableData - Use
TableData.from_dict()for clean conversion
1.3 Verify TableData.to_html() generates correct HTML
- Test that
to_html()produces parseable HTML with proper row/cell structure - Verify colspan/rowspan attributes are correctly generated
- Ensure empty cells are properly handled
Phase 2: OCR Track Rendering Consistency
2.1 Review convert_unified_document_to_ocr_data
- Verify TableData objects are properly converted to HTML
- Add fallback handling for dict content with 'cells' key
- Log warning if content cannot be converted to HTML
2.2 Review draw_table_region
- Verify HTMLTableParser correctly parses generated HTML
- Check that ReportLab Table is positioned at correct bbox
- Verify font and style application
Phase 3: Testing and Verification
3.1 Test OCR Track
- Test scan.pdf - verify tables have correct structure
- Test img1.png, img2.png, img3.png
- Compare generated PDF with original documents
3.2 Test Direct Track (Regression)
- Test PDF files with Direct track
- Verify table rendering unchanged
3.3 Test Hybrid Mode
- Test files that trigger hybrid processing
- Verify mixed Direct + OCR elements render correctly
Phase 4: Code Quality
4.1 Add logging
- Add debug logging for table content type detection
- Log conversion steps for troubleshooting
4.2 Error handling
- Handle malformed cell data gracefully
- Log warnings for unexpected content formats