fix: improve OCR track multi-line text rendering and HTML table detection

Multi-line text rendering (pdf_generator_service.py): - Calculate font size by dividing bbox height by number of lines - Start Y coordinate from bbox TOP instead of bottom - Use non_empty_lines for proper line positioning HTML table detection: - pp_structure_enhanced.py: Detect HTML tables in 'text' type content and reclassify to TABLE when <table tag found - pdf_generator_service.py: Content-based reclassification from TEXT to TABLE during UnifiedDocument parsing - ocr_to_unified_converter.py: Fallback to check 'content' field for HTML tables when 'html' field is empty Known issue: OCR processing still has quality issues that need further investigation and fixes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 16:09:31 +08:00
parent 19bd5fd609
commit fa9b542b06
3 changed files with 50 additions and 16 deletions
--- a/backend/app/services/ocr_to_unified_converter.py
+++ b/backend/app/services/ocr_to_unified_converter.py
@@ -543,6 +543,13 @@ class OCRToUnifiedConverter:
            html = elem_data.get('html', '')
            extracted_text = elem_data.get('extracted_text', '')

+            # Fallback: check content field for HTML table if html field is empty
+            if not html:
+                content = elem_data.get('content', '')
+                if isinstance(content, str) and '<table' in content.lower():
+                    html = content
+                    logger.debug("Using content field as HTML table source")
+
            # Try to parse HTML to get rows and columns
            rows = 0
            cols = 0
@@ -558,6 +565,10 @@ class OCRToUnifiedConverter:
                        first_row = html[:first_row_end]
                        cols = first_row.count('<td') + first_row.count('<th')

+            # Return None if no valid table data found
+            if rows == 0 and cols == 0 and not extracted_text:
+                return None
+
            # Note: TableData uses 'cols' not 'columns'
            # HTML content can be stored as caption or in element metadata
            return TableData(