OCR/openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/proposal.md

# Change: Fix OCR Track Table Data Format to Match Direct Track

## Why

OCR Track produces HTML strings for table content instead of structured `TableData` objects, causing PDF generation to render raw HTML code as plain text. Direct Track correctly produces `TableData` objects with populated `cells` array, resulting in proper table rendering. This inconsistency creates poor user experience when using OCR Track for documents containing tables.

## What Changes

- **Enhance `_extract_table_data` method** in `ocr_to_unified_converter.py` to properly parse HTML tables into structured `TableData` objects with populated `TableCell` arrays
- **Add BeautifulSoup-based HTML table parsing** to robustly extract cell content, row/column spans from OCR-generated HTML tables
- **Ensure format consistency** between OCR Track and Direct Track table output, allowing PDF Generator to handle a single standardized format

## Impact

- Affected specs: `ocr-processing`
- Affected code:
  - `backend/app/services/ocr_to_unified_converter.py` (primary changes)
  - `backend/app/services/pdf_generator_service.py` (no changes needed - already handles `TableData`)
  - `backend/app/services/direct_extraction_engine.py` (no changes - serves as reference implementation)

## Evidence

### Direct Track (Reference - Correct Behavior)
`direct_extraction_engine.py:846-850`:
```python
table_data = TableData(
    rows=len(data),
    cols=max(len(row) for row in data) if data else 0,
    cells=cells,  # Properly populated with TableCell objects
    headers=data[0] if data else None
)
```

### OCR Track (Current - Problematic)
`ocr_to_unified_converter.py:574-579`:
```python
return TableData(
    rows=rows,           # Only counts from html.count('<tr')
    cols=cols,           # Only counts from <td>/<th> in first row
    cells=cells,         # Always empty list []
    caption=extracted_text
)
```

The `cells` array is always empty because the current HTML parsing only counts tags but doesn't extract actual cell content.