OCR/tasks.md at c65df754cf755bb41bf93f1058d7b79243b9ce84

egg c65df754cf wip: add TableData.from_dict() for OCR track table parsing (incomplete)

Add TableData.from_dict() and TableCell.from_dict() methods to convert
JSON table dicts to proper TableData objects during UnifiedDocument parsing.

Modified _json_to_document_element() to detect TABLE elements with dict
content containing 'cells' key and convert to TableData.

Note: This fix ensures table elements have proper to_html() method available
but the rendered output still needs investigation - tables may still render
incorrectly in OCR track PDFs.

Files changed:
- unified_document.py: Add from_dict() class methods
- pdf_generator_service.py: Convert table dicts during JSON parsing
- Add fix-ocr-track-table-rendering proposal

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2.0 KiB

Raw Blame History

Implementation Tasks

Phase 1: Core Fix - Table Content Conversion

1.1 Add TableData.from_dict() class method

1.2 Fix _json_to_document_element for TABLE elements

1.3 Verify TableData.to_html() generates correct HTML

Phase 2: OCR Track Rendering Consistency

2.1 Review convert_unified_document_to_ocr_data

2.2 Review draw_table_region

Phase 3: Testing and Verification

3.1 Test OCR Track

3.2 Test Direct Track (Regression)

3.3 Test Hybrid Mode

Phase 4: Code Quality

4.1 Add logging

4.2 Error handling

2.0 KiB Raw Blame History

Implementation Tasks

Phase 1: Core Fix - Table Content Conversion

1.1 Add TableData.from_dict() class method

1.2 Fix _json_to_document_element for TABLE elements

1.3 Verify TableData.to_html() generates correct HTML

Phase 2: OCR Track Rendering Consistency

2.1 Review convert_unified_document_to_ocr_data

2.2 Review draw_table_region

Phase 3: Testing and Verification

3.1 Test OCR Track

3.2 Test Direct Track (Regression)

3.3 Test Hybrid Mode

Phase 4: Code Quality

4.1 Add logging

4.2 Error handling

2.0 KiB

Raw Blame History