wip: add TableData.from_dict() for OCR track table parsing (incomplete)

Add TableData.from_dict() and TableCell.from_dict() methods to convert
JSON table dicts to proper TableData objects during UnifiedDocument parsing.

Modified _json_to_document_element() to detect TABLE elements with dict
content containing 'cells' key and convert to TableData.

Note: This fix ensures table elements have proper to_html() method available
but the rendered output still needs investigation - tables may still render
incorrectly in OCR track PDFs.

Files changed:
- unified_document.py: Add from_dict() class methods
- pdf_generator_service.py: Convert table dicts during JSON parsing
- Add fix-ocr-track-table-rendering proposal

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-26 19:16:51 +08:00
parent 6e050eb540
commit c65df754cf
5 changed files with 281 additions and 1 deletions

View File

@@ -1945,11 +1945,23 @@ class PDFGeneratorService:
if child:
children.append(child)
# Process content based on element type
content = elem_dict.get('content', '')
# For TABLE elements, convert dict content to TableData object
if elem_type == ElementType.TABLE and isinstance(content, dict) and 'cells' in content:
try:
content = TableData.from_dict(content)
logger.debug(f"Converted table dict to TableData: {content.rows}x{content.cols}, {len(content.cells)} cells")
except Exception as e:
logger.warning(f"Failed to convert table dict to TableData: {e}")
# Keep original dict as fallback
# Create element
element = DocumentElement(
element_id=elem_dict.get('element_id', ''),
type=elem_type,
content=elem_dict.get('content', ''),
content=content,
bbox=bbox,
confidence=elem_dict.get('confidence'),
style=style,