OCR/openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/design.md

# Design: Fix OCR Track Table Data Format

## Context

The OCR processing pipeline has three modes:
1. **Direct Track**: Extracts structured data directly from native PDFs using `direct_extraction_engine.py`
2. **OCR Track**: Uses PP-StructureV3 for layout analysis and OCR, then converts results via `ocr_to_unified_converter.py`
3. **Hybrid Mode**: Uses Direct Track as primary, supplements with OCR Track for missing images only

Both tracks produce `UnifiedDocument` containing `DocumentElement` objects. For tables, the `content` field should contain a `TableData` object with populated `cells` array. However, OCR Track currently produces `TableData` with empty `cells`, causing PDF generation failures.

## Track Isolation Analysis (Safety Guarantee)

This section documents why the proposed changes will NOT affect Direct Track or Hybrid Mode.

### Code Flow Analysis

```
┌─────────────────────────────────────────────────────────────────────────┐
│                           ocr_service.py                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Direct Track ──► DirectExtractionEngine ──► UnifiedDocument            │
│                   (direct_extraction_engine.py)    (tables: TableData ✓) │
│                   [NOT MODIFIED]                                         │
│                                                                          │
│  OCR Track ────► PP-StructureV3 ──► OCRToUnifiedConverter ──► UnifiedDoc│
│                                     (ocr_to_unified_converter.py)        │
│                                     [MODIFIED: _extract_table_data]      │
│                                                                          │
│  Hybrid Mode ──► Direct Track (primary) + OCR Track (images only)       │
│                  │                         │                             │
│                  │                         └──► _merge_ocr_images_into_  │
│                  │                              direct() merges ONLY:    │
│                  │                              - ElementType.FIGURE     │
│                  │                              - ElementType.IMAGE      │
│                  │                              - ElementType.LOGO       │
│                  │                              [Tables NOT merged]      │
│                  └──► Tables come from Direct Track (unchanged)          │
└─────────────────────────────────────────────────────────────────────────┘
```

### Evidence from ocr_service.py

**Line 1610** (Hybrid mode merge logic):
```python
image_types = {ElementType.FIGURE, ElementType.IMAGE, ElementType.LOGO}
```

**Lines 1634-1635** (Only image types are merged):
```python
for element in ocr_page.elements:
    if element.type in image_types:  # Tables excluded
```

### Impact Matrix

| Mode | Table Source | Uses OCRToUnifiedConverter? | Affected by Change? |
|------|--------------|----------------------------|---------------------|
| Direct Track | `DirectExtractionEngine` | No | **No** |
| OCR Track | `OCRToUnifiedConverter` | Yes | **Yes (Fixed)** |
| Hybrid Mode | `DirectExtractionEngine` (tables) | Only for images | **No** |

### Conclusion

The fix is **isolated to OCR Track only**:
- Direct Track: Uses separate engine (`DirectExtractionEngine`), completely unaffected
- Hybrid Mode: Tables come from Direct Track; OCR Track is only used for image extraction
- OCR Track: Will benefit from the fix with proper `TableData` output

## Goals / Non-Goals

### Goals
- OCR Track table output format matches Direct Track format exactly
- PDF Generator receives consistent `TableData` objects from both tracks
- Robust HTML table parsing that handles real-world OCR output

### Non-Goals
- Modifying Direct Track behavior (it's the reference implementation)
- Changing the `TableData` or `TableCell` data models
- Modifying PDF Generator to handle HTML strings as a workaround

## Decisions

### Decision 1: Use BeautifulSoup for HTML Parsing

**Rationale**: The current regex/string-counting approach is fragile and cannot extract cell content. BeautifulSoup provides:
- Robust handling of malformed HTML (common in OCR output)
- Easy extraction of cell content, attributes (rowspan, colspan)
- Well-tested library already used in many Python projects

**Alternatives considered**:
- Manual regex parsing: Too fragile for complex tables
- lxml: More complex API, overkill for this use case
- html.parser (stdlib): Less tolerant of malformed HTML

### Decision 2: Maintain Backward Compatibility

**Rationale**: If BeautifulSoup parsing fails, fall back to current behavior (return `TableData` with basic row/col counts). This ensures existing functionality isn't broken.

### Decision 3: Single Point of Change

**Rationale**: Only modify `ocr_to_unified_converter.py`. This:
- Minimizes regression risk
- Keeps Direct Track untouched as reference
- Requires no changes to downstream PDF Generator

## Implementation Approach

```python
def _extract_table_data(self, elem_data: Dict) -> Optional[TableData]:
    """Extract table data from element using BeautifulSoup."""
    try:
        html = elem_data.get('html', '') or elem_data.get('content', '')
        if not html or '<table' not in html.lower():
            return None

        soup = BeautifulSoup(html, 'html.parser')
        table = soup.find('table')
        if not table:
            return None

        cells = []
        headers = []
        rows = table.find_all('tr')

        for row_idx, row in enumerate(rows):
            row_cells = row.find_all(['td', 'th'])
            for col_idx, cell in enumerate(row_cells):
                cell_content = cell.get_text(strip=True)
                rowspan = int(cell.get('rowspan', 1))
                colspan = int(cell.get('colspan', 1))

                cells.append(TableCell(
                    row=row_idx,
                    col=col_idx,
                    row_span=rowspan,
                    col_span=colspan,
                    content=cell_content
                ))

                # Collect headers from first row or <th> elements
                if row_idx == 0 or cell.name == 'th':
                    headers.append(cell_content)

        return TableData(
            rows=len(rows),
            cols=max(len(row.find_all(['td', 'th'])) for row in rows) if rows else 0,
            cells=cells,
            headers=headers if headers else None
        )
    except Exception as e:
        logger.warning(f"Failed to parse HTML table: {e}")
        return None  # Fallback handled by caller
```

## Risks / Trade-offs

| Risk | Mitigation |
|------|------------|
| BeautifulSoup not installed | Add to requirements.txt; it's already a common dependency |
| Malformed HTML causes parsing errors | Use try/except with fallback to current behavior |
| Performance impact from HTML parsing | Minimal; tables are small; BeautifulSoup is fast |
| Complex rowspan/colspan calculations | Start with simple col tracking; enhance if needed |

## Dependencies

- `beautifulsoup4`: Already commonly available, add to requirements.txt if not present

## Open Questions

- Q: Should we preserve the original HTML in metadata for debugging?
  - A: Optional enhancement; not required for initial fix