fix: improve OCR track multi-line text rendering and HTML table detection

Multi-line text rendering (pdf_generator_service.py):
- Calculate font size by dividing bbox height by number of lines
- Start Y coordinate from bbox TOP instead of bottom
- Use non_empty_lines for proper line positioning

HTML table detection:
- pp_structure_enhanced.py: Detect HTML tables in 'text' type content
  and reclassify to TABLE when <table tag found
- pdf_generator_service.py: Content-based reclassification from TEXT
  to TABLE during UnifiedDocument parsing
- ocr_to_unified_converter.py: Fallback to check 'content' field for
  HTML tables when 'html' field is empty

Known issue: OCR processing still has quality issues that need further
investigation and fixes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-26 16:09:31 +08:00
parent 19bd5fd609
commit fa9b542b06
3 changed files with 50 additions and 16 deletions

View File

@@ -543,6 +543,13 @@ class OCRToUnifiedConverter:
html = elem_data.get('html', '')
extracted_text = elem_data.get('extracted_text', '')
# Fallback: check content field for HTML table if html field is empty
if not html:
content = elem_data.get('content', '')
if isinstance(content, str) and '<table' in content.lower():
html = content
logger.debug("Using content field as HTML table source")
# Try to parse HTML to get rows and columns
rows = 0
cols = 0
@@ -558,6 +565,10 @@ class OCRToUnifiedConverter:
first_row = html[:first_row_end]
cols = first_row.count('<td') + first_row.count('<th')
# Return None if no valid table data found
if rows == 0 and cols == 0 and not extracted_text:
return None
# Note: TableData uses 'cols' not 'columns'
# HTML content can be stored as caption or in element metadata
return TableData(