wip: add TableData.from_dict() for OCR track table parsing (incomplete)
Add TableData.from_dict() and TableCell.from_dict() methods to convert JSON table dicts to proper TableData objects during UnifiedDocument parsing. Modified _json_to_document_element() to detect TABLE elements with dict content containing 'cells' key and convert to TableData. Note: This fix ensures table elements have proper to_html() method available but the rendered output still needs investigation - tables may still render incorrectly in OCR track PDFs. Files changed: - unified_document.py: Add from_dict() class methods - pdf_generator_service.py: Convert table dicts during JSON parsing - Add fix-ocr-track-table-rendering proposal 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
108
openspec/changes/fix-ocr-track-table-rendering/proposal.md
Normal file
108
openspec/changes/fix-ocr-track-table-rendering/proposal.md
Normal file
@@ -0,0 +1,108 @@
|
||||
# Fix OCR Track Table Rendering
|
||||
|
||||
## Summary
|
||||
|
||||
OCR track PDF generation produces tables with incorrect format and layout. Tables appear without proper structure - cell content is misaligned and the visual format differs significantly from the original document. Image placement is correct, but table rendering is broken.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
When generating PDF from OCR track results (via `scan.pdf` processed by PP-StructureV3), the output tables have:
|
||||
1. **Wrong cell alignment** - content not positioned in proper cells
|
||||
2. **Missing table structure** - rows/columns don't match original document layout
|
||||
3. **Incorrect content distribution** - all content seems to flow linearly instead of maintaining grid structure
|
||||
|
||||
Reference: `backend/storage/results/af7c9ee8-60a0-4291-9f22-ef98d27eed52/`
|
||||
- Original: `af7c9ee8-60a0-4291-9f22-ef98d27eed52_scan_page_1.png`
|
||||
- Generated: `scan_layout.pdf`
|
||||
- Result JSON: `scan_result.json` - Tables have correct `{rows, cols, cells}` structure
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
### Issue 1: Table Content Not Converted to TableData Object
|
||||
|
||||
In `_json_to_document_element` (pdf_generator_service.py:1952):
|
||||
```python
|
||||
element = DocumentElement(
|
||||
...
|
||||
content=elem_dict.get('content', ''), # Raw dict, not TableData
|
||||
...
|
||||
)
|
||||
```
|
||||
|
||||
Table elements have `content` as a dict `{rows: 5, cols: 4, cells: [...]}` but it's not converted to a `TableData` object.
|
||||
|
||||
### Issue 2: OCR Track HTML Conversion Fails
|
||||
|
||||
In `convert_unified_document_to_ocr_data` (pdf_generator_service.py:464-467):
|
||||
```python
|
||||
elif isinstance(element.content, dict):
|
||||
html_content = element.content.get('html', str(element.content))
|
||||
```
|
||||
|
||||
Since there's no 'html' key in the cells-based dict, it falls back to `str(element.content)` = `"{'rows': 5, 'cols': 4, ...}"` - invalid HTML.
|
||||
|
||||
### Issue 3: Different Table Rendering Paths
|
||||
|
||||
- **Direct track** uses `_draw_table_element_direct` which properly handles dict with cells via `_build_rows_from_cells_dict`
|
||||
- **OCR track** uses `draw_table_region` which expects HTML strings and fails with dict content
|
||||
|
||||
## Proposed Solution
|
||||
|
||||
### Option A: Convert dict to TableData during JSON loading (Recommended)
|
||||
|
||||
In `_json_to_document_element`, when element type is TABLE and content is a dict with cells, convert it to a `TableData` object:
|
||||
|
||||
```python
|
||||
# For TABLE elements, convert dict to TableData
|
||||
if elem_type == ElementType.TABLE and isinstance(content, dict) and 'cells' in content:
|
||||
content = self._dict_to_table_data(content)
|
||||
```
|
||||
|
||||
This ensures `element.content.to_html()` works correctly in `convert_unified_document_to_ocr_data`.
|
||||
|
||||
### Option B: Fix conversion in convert_unified_document_to_ocr_data
|
||||
|
||||
Handle dict with cells properly by converting to HTML:
|
||||
|
||||
```python
|
||||
elif isinstance(element.content, dict):
|
||||
if 'cells' in element.content:
|
||||
# Convert cells-based dict to HTML
|
||||
html_content = self._cells_dict_to_html(element.content)
|
||||
elif 'html' in element.content:
|
||||
html_content = element.content['html']
|
||||
else:
|
||||
html_content = str(element.content)
|
||||
```
|
||||
|
||||
## Impact on Hybrid Mode
|
||||
|
||||
Hybrid mode uses Direct track rendering (`_generate_direct_track_pdf`) which already handles dict content properly via `_build_rows_from_cells_dict`. The proposed fixes should not affect hybrid mode negatively.
|
||||
|
||||
However, testing should verify:
|
||||
1. Hybrid mode continues to work with combined Direct + OCR elements
|
||||
2. Table rendering quality is consistent across all tracks
|
||||
|
||||
## Success Criteria
|
||||
|
||||
1. OCR track tables render with correct structure matching original document
|
||||
2. Cell content positioned in proper grid locations
|
||||
3. Table borders/grid lines visible
|
||||
4. No regression in Direct track or Hybrid mode table rendering
|
||||
5. All test files (scan.pdf, img1.png, img2.png, img3.png) produce correct output
|
||||
|
||||
## Files to Modify
|
||||
|
||||
1. `backend/app/services/pdf_generator_service.py`
|
||||
- `_json_to_document_element`: Convert table dict to TableData
|
||||
- `convert_unified_document_to_ocr_data`: Improve dict handling (if Option B)
|
||||
|
||||
2. `backend/app/models/unified_document.py` (optional)
|
||||
- Add `TableData.from_dict()` class method for cleaner conversion
|
||||
|
||||
## Testing Plan
|
||||
|
||||
1. Test scan.pdf with OCR track - verify table structure matches original
|
||||
2. Test img1.png, img2.png, img3.png with OCR track
|
||||
3. Test PDF files with Direct track - verify no regression
|
||||
4. Test Hybrid mode with files that trigger OCR fallback
|
||||
Reference in New Issue
Block a user