fix: improve OCR track multi-line text rendering and HTML table detection
Multi-line text rendering (pdf_generator_service.py): - Calculate font size by dividing bbox height by number of lines - Start Y coordinate from bbox TOP instead of bottom - Use non_empty_lines for proper line positioning HTML table detection: - pp_structure_enhanced.py: Detect HTML tables in 'text' type content and reclassify to TABLE when <table tag found - pdf_generator_service.py: Content-based reclassification from TEXT to TABLE during UnifiedDocument parsing - ocr_to_unified_converter.py: Fallback to check 'content' field for HTML tables when 'html' field is empty Known issue: OCR processing still has quality issues that need further investigation and fixes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -543,6 +543,13 @@ class OCRToUnifiedConverter:
|
||||
html = elem_data.get('html', '')
|
||||
extracted_text = elem_data.get('extracted_text', '')
|
||||
|
||||
# Fallback: check content field for HTML table if html field is empty
|
||||
if not html:
|
||||
content = elem_data.get('content', '')
|
||||
if isinstance(content, str) and '<table' in content.lower():
|
||||
html = content
|
||||
logger.debug("Using content field as HTML table source")
|
||||
|
||||
# Try to parse HTML to get rows and columns
|
||||
rows = 0
|
||||
cols = 0
|
||||
@@ -558,6 +565,10 @@ class OCRToUnifiedConverter:
|
||||
first_row = html[:first_row_end]
|
||||
cols = first_row.count('<td') + first_row.count('<th')
|
||||
|
||||
# Return None if no valid table data found
|
||||
if rows == 0 and cols == 0 and not extracted_text:
|
||||
return None
|
||||
|
||||
# Note: TableData uses 'cols' not 'columns'
|
||||
# HTML content can be stored as caption or in element metadata
|
||||
return TableData(
|
||||
|
||||
Reference in New Issue
Block a user