egg/OCR

Files

egg 59206a6ab8 feat: simplify layout model selection and archive proposals

Changes:
- Replace PP-Structure 7-slider parameter UI with simple 3-option layout model selector
- Add layout model mapping: chinese (PP-DocLayout-S), default (PubLayNet), cdla
- Add LayoutModelSelector component and zh-TW translations
- Fix "default" model behavior with sentinel value for PubLayNet
- Add gap filling service for OCR track coverage improvement
- Add PP-Structure debug utilities
- Archive completed/incomplete proposals:
  - add-ocr-track-gap-filling (complete)
  - fix-ocr-track-table-rendering (incomplete)
  - simplify-ppstructure-model-selection (22/25 tasks)
- Add new layout model tests, archive old PP-Structure param tests
- Update OpenSpec ocr-processing spec with layout model requirements

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-27 13:27:00 +08:00

4.3 KiB

Raw Blame History

Fix OCR Track Table Rendering

Summary

OCR track PDF generation produces tables with incorrect format and layout. Tables appear without proper structure - cell content is misaligned and the visual format differs significantly from the original document. Image placement is correct, but table rendering is broken.

Problem Statement

When generating PDF from OCR track results (via scan.pdf processed by PP-StructureV3), the output tables have:

Wrong cell alignment - content not positioned in proper cells
Missing table structure - rows/columns don't match original document layout
Incorrect content distribution - all content seems to flow linearly instead of maintaining grid structure

Reference: backend/storage/results/af7c9ee8-60a0-4291-9f22-ef98d27eed52/

Original: af7c9ee8-60a0-4291-9f22-ef98d27eed52_scan_page_1.png
Generated: scan_layout.pdf
Result JSON: scan_result.json - Tables have correct {rows, cols, cells} structure

Root Cause Analysis

Issue 1: Table Content Not Converted to TableData Object

In _json_to_document_element (pdf_generator_service.py:1952):

element = DocumentElement(
    ...
    content=elem_dict.get('content', ''),  # Raw dict, not TableData
    ...
)

Table elements have content as a dict {rows: 5, cols: 4, cells: [...]} but it's not converted to a TableData object.

Issue 2: OCR Track HTML Conversion Fails

In convert_unified_document_to_ocr_data (pdf_generator_service.py:464-467):

elif isinstance(element.content, dict):
    html_content = element.content.get('html', str(element.content))

Since there's no 'html' key in the cells-based dict, it falls back to str(element.content) = "{'rows': 5, 'cols': 4, ...}" - invalid HTML.

Issue 3: Different Table Rendering Paths

Direct track uses _draw_table_element_direct which properly handles dict with cells via _build_rows_from_cells_dict
OCR track uses draw_table_region which expects HTML strings and fails with dict content

Proposed Solution

Option A: Convert dict to TableData during JSON loading (Recommended)

In _json_to_document_element, when element type is TABLE and content is a dict with cells, convert it to a TableData object:

# For TABLE elements, convert dict to TableData
if elem_type == ElementType.TABLE and isinstance(content, dict) and 'cells' in content:
    content = self._dict_to_table_data(content)

This ensures element.content.to_html() works correctly in convert_unified_document_to_ocr_data.

Option B: Fix conversion in convert_unified_document_to_ocr_data

Handle dict with cells properly by converting to HTML:

elif isinstance(element.content, dict):
    if 'cells' in element.content:
        # Convert cells-based dict to HTML
        html_content = self._cells_dict_to_html(element.content)
    elif 'html' in element.content:
        html_content = element.content['html']
    else:
        html_content = str(element.content)

Impact on Hybrid Mode

Hybrid mode uses Direct track rendering (_generate_direct_track_pdf) which already handles dict content properly via _build_rows_from_cells_dict. The proposed fixes should not affect hybrid mode negatively.

However, testing should verify:

Hybrid mode continues to work with combined Direct + OCR elements
Table rendering quality is consistent across all tracks

Success Criteria

OCR track tables render with correct structure matching original document
Cell content positioned in proper grid locations
Table borders/grid lines visible
No regression in Direct track or Hybrid mode table rendering
All test files (scan.pdf, img1.png, img2.png, img3.png) produce correct output

Files to Modify

backend/app/services/pdf_generator_service.py
- _json_to_document_element: Convert table dict to TableData
- convert_unified_document_to_ocr_data: Improve dict handling (if Option B)
backend/app/models/unified_document.py (optional)
- Add TableData.from_dict() class method for cleaner conversion

Testing Plan

Test scan.pdf with OCR track - verify table structure matches original
Test img1.png, img2.png, img3.png with OCR track
Test PDF files with Direct track - verify no regression
Test Hybrid mode with files that trigger OCR fallback

4.3 KiB Raw Blame History