Files
OCR/openspec/changes/archive/2025-11-27-fix-ocr-track-table-rendering/proposal.md
egg 59206a6ab8 feat: simplify layout model selection and archive proposals
Changes:
- Replace PP-Structure 7-slider parameter UI with simple 3-option layout model selector
- Add layout model mapping: chinese (PP-DocLayout-S), default (PubLayNet), cdla
- Add LayoutModelSelector component and zh-TW translations
- Fix "default" model behavior with sentinel value for PubLayNet
- Add gap filling service for OCR track coverage improvement
- Add PP-Structure debug utilities
- Archive completed/incomplete proposals:
  - add-ocr-track-gap-filling (complete)
  - fix-ocr-track-table-rendering (incomplete)
  - simplify-ppstructure-model-selection (22/25 tasks)
- Add new layout model tests, archive old PP-Structure param tests
- Update OpenSpec ocr-processing spec with layout model requirements

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-27 13:27:00 +08:00

4.3 KiB

Fix OCR Track Table Rendering

Summary

OCR track PDF generation produces tables with incorrect format and layout. Tables appear without proper structure - cell content is misaligned and the visual format differs significantly from the original document. Image placement is correct, but table rendering is broken.

Problem Statement

When generating PDF from OCR track results (via scan.pdf processed by PP-StructureV3), the output tables have:

  1. Wrong cell alignment - content not positioned in proper cells
  2. Missing table structure - rows/columns don't match original document layout
  3. Incorrect content distribution - all content seems to flow linearly instead of maintaining grid structure

Reference: backend/storage/results/af7c9ee8-60a0-4291-9f22-ef98d27eed52/

  • Original: af7c9ee8-60a0-4291-9f22-ef98d27eed52_scan_page_1.png
  • Generated: scan_layout.pdf
  • Result JSON: scan_result.json - Tables have correct {rows, cols, cells} structure

Root Cause Analysis

Issue 1: Table Content Not Converted to TableData Object

In _json_to_document_element (pdf_generator_service.py:1952):

element = DocumentElement(
    ...
    content=elem_dict.get('content', ''),  # Raw dict, not TableData
    ...
)

Table elements have content as a dict {rows: 5, cols: 4, cells: [...]} but it's not converted to a TableData object.

Issue 2: OCR Track HTML Conversion Fails

In convert_unified_document_to_ocr_data (pdf_generator_service.py:464-467):

elif isinstance(element.content, dict):
    html_content = element.content.get('html', str(element.content))

Since there's no 'html' key in the cells-based dict, it falls back to str(element.content) = "{'rows': 5, 'cols': 4, ...}" - invalid HTML.

Issue 3: Different Table Rendering Paths

  • Direct track uses _draw_table_element_direct which properly handles dict with cells via _build_rows_from_cells_dict
  • OCR track uses draw_table_region which expects HTML strings and fails with dict content

Proposed Solution

In _json_to_document_element, when element type is TABLE and content is a dict with cells, convert it to a TableData object:

# For TABLE elements, convert dict to TableData
if elem_type == ElementType.TABLE and isinstance(content, dict) and 'cells' in content:
    content = self._dict_to_table_data(content)

This ensures element.content.to_html() works correctly in convert_unified_document_to_ocr_data.

Option B: Fix conversion in convert_unified_document_to_ocr_data

Handle dict with cells properly by converting to HTML:

elif isinstance(element.content, dict):
    if 'cells' in element.content:
        # Convert cells-based dict to HTML
        html_content = self._cells_dict_to_html(element.content)
    elif 'html' in element.content:
        html_content = element.content['html']
    else:
        html_content = str(element.content)

Impact on Hybrid Mode

Hybrid mode uses Direct track rendering (_generate_direct_track_pdf) which already handles dict content properly via _build_rows_from_cells_dict. The proposed fixes should not affect hybrid mode negatively.

However, testing should verify:

  1. Hybrid mode continues to work with combined Direct + OCR elements
  2. Table rendering quality is consistent across all tracks

Success Criteria

  1. OCR track tables render with correct structure matching original document
  2. Cell content positioned in proper grid locations
  3. Table borders/grid lines visible
  4. No regression in Direct track or Hybrid mode table rendering
  5. All test files (scan.pdf, img1.png, img2.png, img3.png) produce correct output

Files to Modify

  1. backend/app/services/pdf_generator_service.py

    • _json_to_document_element: Convert table dict to TableData
    • convert_unified_document_to_ocr_data: Improve dict handling (if Option B)
  2. backend/app/models/unified_document.py (optional)

    • Add TableData.from_dict() class method for cleaner conversion

Testing Plan

  1. Test scan.pdf with OCR track - verify table structure matches original
  2. Test img1.png, img2.png, img3.png with OCR track
  3. Test PDF files with Direct track - verify no regression
  4. Test Hybrid mode with files that trigger OCR fallback