Table data format fixes (ocr_to_unified_converter.py): - Fix ElementType string conversion using value-based lookup - Add content-based HTML table detection (reclassify TEXT to TABLE) - Use BeautifulSoup for robust HTML table parsing - Generate TableData with fully populated cells arrays Image cropping for OCR track (pp_structure_enhanced.py): - Add _crop_and_save_image method for extracting image regions - Pass source_image_path to _process_parsing_res_list - Return relative filename (not full path) for saved_path - Consistent with Direct Track image saving pattern Also includes: - Add beautifulsoup4 to requirements.txt - Add architecture overview documentation - Archive fix-ocr-track-table-data-format proposal (22/24 tasks) Known issues: OCR track images are restored but still have quality issues that will be addressed in a follow-up proposal. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2.0 KiB
2.0 KiB
Change: Fix OCR Track Table Data Format to Match Direct Track
Why
OCR Track produces HTML strings for table content instead of structured TableData objects, causing PDF generation to render raw HTML code as plain text. Direct Track correctly produces TableData objects with populated cells array, resulting in proper table rendering. This inconsistency creates poor user experience when using OCR Track for documents containing tables.
What Changes
- Enhance
_extract_table_datamethod inocr_to_unified_converter.pyto properly parse HTML tables into structuredTableDataobjects with populatedTableCellarrays - Add BeautifulSoup-based HTML table parsing to robustly extract cell content, row/column spans from OCR-generated HTML tables
- Ensure format consistency between OCR Track and Direct Track table output, allowing PDF Generator to handle a single standardized format
Impact
- Affected specs:
ocr-processing - Affected code:
backend/app/services/ocr_to_unified_converter.py(primary changes)backend/app/services/pdf_generator_service.py(no changes needed - already handlesTableData)backend/app/services/direct_extraction_engine.py(no changes - serves as reference implementation)
Evidence
Direct Track (Reference - Correct Behavior)
direct_extraction_engine.py:846-850:
table_data = TableData(
rows=len(data),
cols=max(len(row) for row in data) if data else 0,
cells=cells, # Properly populated with TableCell objects
headers=data[0] if data else None
)
OCR Track (Current - Problematic)
ocr_to_unified_converter.py:574-579:
return TableData(
rows=rows, # Only counts from html.count('<tr')
cols=cols, # Only counts from <td>/<th> in first row
cells=cells, # Always empty list []
caption=extracted_text
)
The cells array is always empty because the current HTML parsing only counts tags but doesn't extract actual cell content.