egg/OCR

Files

egg c65df754cf wip: add TableData.from_dict() for OCR track table parsing (incomplete)

Add TableData.from_dict() and TableCell.from_dict() methods to convert
JSON table dicts to proper TableData objects during UnifiedDocument parsing.

Modified _json_to_document_element() to detect TABLE elements with dict
content containing 'cells' key and convert to TableData.

Note: This fix ensures table elements have proper to_html() method available
but the rendered output still needs investigation - tables may still render
incorrectly in OCR track PDFs.

Files changed:
- unified_document.py: Add from_dict() class methods
- pdf_generator_service.py: Convert table dicts during JSON parsing
- Add fix-ocr-track-table-rendering proposal

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-26 19:16:51 +08:00

1.9 KiB

Raw Blame History

PDF Generation - OCR Track Table Rendering Fix

MODIFIED Requirements

Requirement: OCR Track Table Content Conversion

The PDF generator MUST properly convert table content from JSON dict format to renderable structure when processing OCR track results.

Scenario: Table dict with cells array converts to proper HTML

Given an OCR track JSON with table element containing rows, cols, and cells array When the PDF generator processes this element Then the table content MUST be converted to a TableData object And TableData.to_html() MUST produce valid HTML with proper tr/td structure And the generated PDF table MUST have cells positioned in correct grid locations

Scenario: Table with rowspan/colspan renders correctly

Given a table element with cells having rowspan > 1 or colspan > 1 When the PDF generator renders the table Then merged cells MUST span the correct number of rows/columns And content MUST appear in the merged cell position

Requirement: Table Visual Fidelity

The PDF generator MUST render OCR track tables with visual structure matching the original document.

Scenario: Table renders with grid lines

Given an OCR track table element When rendered to PDF Then the table MUST have visible grid lines/borders And cell boundaries MUST be clearly defined

Scenario: Table text alignment preserved

Given an OCR track table with cell content When rendered to PDF Then text MUST be positioned within the correct cell boundaries And text MUST NOT overflow into adjacent cells

Requirement: Backward Compatibility with Hybrid Mode

The table rendering fix MUST NOT break hybrid mode processing.

Scenario: Hybrid mode tables render correctly

Given a document processed with hybrid mode combining Direct and OCR tracks When PDF is generated Then Direct track tables MUST render with existing quality And OCR track tables MUST render with improved quality And no regression in table positioning or content

1.9 KiB Raw Blame History