Files
OCR/openspec/changes/fix-ocr-track-table-rendering/tasks.md
egg c65df754cf wip: add TableData.from_dict() for OCR track table parsing (incomplete)
Add TableData.from_dict() and TableCell.from_dict() methods to convert
JSON table dicts to proper TableData objects during UnifiedDocument parsing.

Modified _json_to_document_element() to detect TABLE elements with dict
content containing 'cells' key and convert to TableData.

Note: This fix ensures table elements have proper to_html() method available
but the rendered output still needs investigation - tables may still render
incorrectly in OCR track PDFs.

Files changed:
- unified_document.py: Add from_dict() class methods
- pdf_generator_service.py: Convert table dicts during JSON parsing
- Add fix-ocr-track-table-rendering proposal

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 19:16:51 +08:00

2.0 KiB

Implementation Tasks

Phase 1: Core Fix - Table Content Conversion

1.1 Add TableData.from_dict() class method

  • In unified_document.py, add from_dict() method to TableData class
  • Handle conversion of cells list (list of dicts) to TableCell objects
  • Preserve rows, cols, headers, caption fields

1.2 Fix _json_to_document_element for TABLE elements

  • In pdf_generator_service.py, modify _json_to_document_element
  • When elem_type == ElementType.TABLE and content is dict with 'cells', convert to TableData
  • Use TableData.from_dict() for clean conversion

1.3 Verify TableData.to_html() generates correct HTML

  • Test that to_html() produces parseable HTML with proper row/cell structure
  • Verify colspan/rowspan attributes are correctly generated
  • Ensure empty cells are properly handled

Phase 2: OCR Track Rendering Consistency

2.1 Review convert_unified_document_to_ocr_data

  • Verify TableData objects are properly converted to HTML
  • Add fallback handling for dict content with 'cells' key
  • Log warning if content cannot be converted to HTML

2.2 Review draw_table_region

  • Verify HTMLTableParser correctly parses generated HTML
  • Check that ReportLab Table is positioned at correct bbox
  • Verify font and style application

Phase 3: Testing and Verification

3.1 Test OCR Track

  • Test scan.pdf - verify tables have correct structure
  • Test img1.png, img2.png, img3.png
  • Compare generated PDF with original documents

3.2 Test Direct Track (Regression)

  • Test PDF files with Direct track
  • Verify table rendering unchanged

3.3 Test Hybrid Mode

  • Test files that trigger hybrid processing
  • Verify mixed Direct + OCR elements render correctly

Phase 4: Code Quality

4.1 Add logging

  • Add debug logging for table content type detection
  • Log conversion steps for troubleshooting

4.2 Error handling

  • Handle malformed cell data gracefully
  • Log warnings for unexpected content formats