egg/OCR

Files

egg c65df754cf wip: add TableData.from_dict() for OCR track table parsing (incomplete)

Add TableData.from_dict() and TableCell.from_dict() methods to convert
JSON table dicts to proper TableData objects during UnifiedDocument parsing.

Modified _json_to_document_element() to detect TABLE elements with dict
content containing 'cells' key and convert to TableData.

Note: This fix ensures table elements have proper to_html() method available
but the rendered output still needs investigation - tables may still render
incorrectly in OCR track PDFs.

Files changed:
- unified_document.py: Add from_dict() class methods
- pdf_generator_service.py: Convert table dicts during JSON parsing
- Add fix-ocr-track-table-rendering proposal

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-26 19:16:51 +08:00

4.3 KiB

Raw Blame History

Fix OCR Track Table Rendering

Summary

OCR track PDF generation produces tables with incorrect format and layout. Tables appear without proper structure - cell content is misaligned and the visual format differs significantly from the original document. Image placement is correct, but table rendering is broken.

Problem Statement

When generating PDF from OCR track results (via scan.pdf processed by PP-StructureV3), the output tables have:

Wrong cell alignment - content not positioned in proper cells
Missing table structure - rows/columns don't match original document layout
Incorrect content distribution - all content seems to flow linearly instead of maintaining grid structure

Reference: backend/storage/results/af7c9ee8-60a0-4291-9f22-ef98d27eed52/

Original: af7c9ee8-60a0-4291-9f22-ef98d27eed52_scan_page_1.png
Generated: scan_layout.pdf
Result JSON: scan_result.json - Tables have correct {rows, cols, cells} structure

Root Cause Analysis

Issue 1: Table Content Not Converted to TableData Object

In _json_to_document_element (pdf_generator_service.py:1952):

element = DocumentElement(
    ...
    content=elem_dict.get('content', ''),  # Raw dict, not TableData
    ...
)

Table elements have content as a dict {rows: 5, cols: 4, cells: [...]} but it's not converted to a TableData object.

Issue 2: OCR Track HTML Conversion Fails

In convert_unified_document_to_ocr_data (pdf_generator_service.py:464-467):

elif isinstance(element.content, dict):
    html_content = element.content.get('html', str(element.content))

Since there's no 'html' key in the cells-based dict, it falls back to str(element.content) = "{'rows': 5, 'cols': 4, ...}" - invalid HTML.

Issue 3: Different Table Rendering Paths

Direct track uses _draw_table_element_direct which properly handles dict with cells via _build_rows_from_cells_dict
OCR track uses draw_table_region which expects HTML strings and fails with dict content

Proposed Solution

Option A: Convert dict to TableData during JSON loading (Recommended)

In _json_to_document_element, when element type is TABLE and content is a dict with cells, convert it to a TableData object:

# For TABLE elements, convert dict to TableData
if elem_type == ElementType.TABLE and isinstance(content, dict) and 'cells' in content:
    content = self._dict_to_table_data(content)

This ensures element.content.to_html() works correctly in convert_unified_document_to_ocr_data.

Option B: Fix conversion in convert_unified_document_to_ocr_data

Handle dict with cells properly by converting to HTML:

elif isinstance(element.content, dict):
    if 'cells' in element.content:
        # Convert cells-based dict to HTML
        html_content = self._cells_dict_to_html(element.content)
    elif 'html' in element.content:
        html_content = element.content['html']
    else:
        html_content = str(element.content)

Impact on Hybrid Mode

Hybrid mode uses Direct track rendering (_generate_direct_track_pdf) which already handles dict content properly via _build_rows_from_cells_dict. The proposed fixes should not affect hybrid mode negatively.

However, testing should verify:

Hybrid mode continues to work with combined Direct + OCR elements
Table rendering quality is consistent across all tracks

Success Criteria

OCR track tables render with correct structure matching original document
Cell content positioned in proper grid locations
Table borders/grid lines visible
No regression in Direct track or Hybrid mode table rendering
All test files (scan.pdf, img1.png, img2.png, img3.png) produce correct output

Files to Modify

backend/app/services/pdf_generator_service.py
- _json_to_document_element: Convert table dict to TableData
- convert_unified_document_to_ocr_data: Improve dict handling (if Option B)
backend/app/models/unified_document.py (optional)
- Add TableData.from_dict() class method for cleaner conversion

Testing Plan

Test scan.pdf with OCR track - verify table structure matches original
Test img1.png, img2.png, img3.png with OCR track
Test PDF files with Direct track - verify no regression
Test Hybrid mode with files that trigger OCR fallback

4.3 KiB Raw Blame History