# Change: Fix OCR Track Table Data Format to Match Direct Track ## Why OCR Track produces HTML strings for table content instead of structured `TableData` objects, causing PDF generation to render raw HTML code as plain text. Direct Track correctly produces `TableData` objects with populated `cells` array, resulting in proper table rendering. This inconsistency creates poor user experience when using OCR Track for documents containing tables. ## What Changes - **Enhance `_extract_table_data` method** in `ocr_to_unified_converter.py` to properly parse HTML tables into structured `TableData` objects with populated `TableCell` arrays - **Add BeautifulSoup-based HTML table parsing** to robustly extract cell content, row/column spans from OCR-generated HTML tables - **Ensure format consistency** between OCR Track and Direct Track table output, allowing PDF Generator to handle a single standardized format ## Impact - Affected specs: `ocr-processing` - Affected code: - `backend/app/services/ocr_to_unified_converter.py` (primary changes) - `backend/app/services/pdf_generator_service.py` (no changes needed - already handles `TableData`) - `backend/app/services/direct_extraction_engine.py` (no changes - serves as reference implementation) ## Evidence ### Direct Track (Reference - Correct Behavior) `direct_extraction_engine.py:846-850`: ```python table_data = TableData( rows=len(data), cols=max(len(row) for row in data) if data else 0, cells=cells, # Properly populated with TableCell objects headers=data[0] if data else None ) ``` ### OCR Track (Current - Problematic) `ocr_to_unified_converter.py:574-579`: ```python return TableData( rows=rows, # Only counts from html.count('