# Design: Fix OCR Track Table Data Format ## Context The OCR processing pipeline has three modes: 1. **Direct Track**: Extracts structured data directly from native PDFs using `direct_extraction_engine.py` 2. **OCR Track**: Uses PP-StructureV3 for layout analysis and OCR, then converts results via `ocr_to_unified_converter.py` 3. **Hybrid Mode**: Uses Direct Track as primary, supplements with OCR Track for missing images only Both tracks produce `UnifiedDocument` containing `DocumentElement` objects. For tables, the `content` field should contain a `TableData` object with populated `cells` array. However, OCR Track currently produces `TableData` with empty `cells`, causing PDF generation failures. ## Track Isolation Analysis (Safety Guarantee) This section documents why the proposed changes will NOT affect Direct Track or Hybrid Mode. ### Code Flow Analysis ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ ocr_service.py │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ Direct Track ──► DirectExtractionEngine ──► UnifiedDocument │ │ (direct_extraction_engine.py) (tables: TableData ✓) │ │ [NOT MODIFIED] │ │ │ │ OCR Track ────► PP-StructureV3 ──► OCRToUnifiedConverter ──► UnifiedDoc│ │ (ocr_to_unified_converter.py) │ │ [MODIFIED: _extract_table_data] │ │ │ │ Hybrid Mode ──► Direct Track (primary) + OCR Track (images only) │ │ │ │ │ │ │ └──► _merge_ocr_images_into_ │ │ │ direct() merges ONLY: │ │ │ - ElementType.FIGURE │ │ │ - ElementType.IMAGE │ │ │ - ElementType.LOGO │ │ │ [Tables NOT merged] │ │ └──► Tables come from Direct Track (unchanged) │ └─────────────────────────────────────────────────────────────────────────┘ ``` ### Evidence from ocr_service.py **Line 1610** (Hybrid mode merge logic): ```python image_types = {ElementType.FIGURE, ElementType.IMAGE, ElementType.LOGO} ``` **Lines 1634-1635** (Only image types are merged): ```python for element in ocr_page.elements: if element.type in image_types: # Tables excluded ``` ### Impact Matrix | Mode | Table Source | Uses OCRToUnifiedConverter? | Affected by Change? | |------|--------------|----------------------------|---------------------| | Direct Track | `DirectExtractionEngine` | No | **No** | | OCR Track | `OCRToUnifiedConverter` | Yes | **Yes (Fixed)** | | Hybrid Mode | `DirectExtractionEngine` (tables) | Only for images | **No** | ### Conclusion The fix is **isolated to OCR Track only**: - Direct Track: Uses separate engine (`DirectExtractionEngine`), completely unaffected - Hybrid Mode: Tables come from Direct Track; OCR Track is only used for image extraction - OCR Track: Will benefit from the fix with proper `TableData` output ## Goals / Non-Goals ### Goals - OCR Track table output format matches Direct Track format exactly - PDF Generator receives consistent `TableData` objects from both tracks - Robust HTML table parsing that handles real-world OCR output ### Non-Goals - Modifying Direct Track behavior (it's the reference implementation) - Changing the `TableData` or `TableCell` data models - Modifying PDF Generator to handle HTML strings as a workaround ## Decisions ### Decision 1: Use BeautifulSoup for HTML Parsing **Rationale**: The current regex/string-counting approach is fragile and cannot extract cell content. BeautifulSoup provides: - Robust handling of malformed HTML (common in OCR output) - Easy extraction of cell content, attributes (rowspan, colspan) - Well-tested library already used in many Python projects **Alternatives considered**: - Manual regex parsing: Too fragile for complex tables - lxml: More complex API, overkill for this use case - html.parser (stdlib): Less tolerant of malformed HTML ### Decision 2: Maintain Backward Compatibility **Rationale**: If BeautifulSoup parsing fails, fall back to current behavior (return `TableData` with basic row/col counts). This ensures existing functionality isn't broken. ### Decision 3: Single Point of Change **Rationale**: Only modify `ocr_to_unified_converter.py`. This: - Minimizes regression risk - Keeps Direct Track untouched as reference - Requires no changes to downstream PDF Generator ## Implementation Approach ```python def _extract_table_data(self, elem_data: Dict) -> Optional[TableData]: """Extract table data from element using BeautifulSoup.""" try: html = elem_data.get('html', '') or elem_data.get('content', '') if not html or ' elements if row_idx == 0 or cell.name == 'th': headers.append(cell_content) return TableData( rows=len(rows), cols=max(len(row.find_all(['td', 'th'])) for row in rows) if rows else 0, cells=cells, headers=headers if headers else None ) except Exception as e: logger.warning(f"Failed to parse HTML table: {e}") return None # Fallback handled by caller ``` ## Risks / Trade-offs | Risk | Mitigation | |------|------------| | BeautifulSoup not installed | Add to requirements.txt; it's already a common dependency | | Malformed HTML causes parsing errors | Use try/except with fallback to current behavior | | Performance impact from HTML parsing | Minimal; tables are small; BeautifulSoup is fast | | Complex rowspan/colspan calculations | Start with simple col tracking; enhance if needed | ## Dependencies - `beautifulsoup4`: Already commonly available, add to requirements.txt if not present ## Open Questions - Q: Should we preserve the original HTML in metadata for debugging? - A: Optional enhancement; not required for initial fix