Files
OCR/openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/design.md
egg 6e050eb540 fix: OCR track table data format and image cropping
Table data format fixes (ocr_to_unified_converter.py):
- Fix ElementType string conversion using value-based lookup
- Add content-based HTML table detection (reclassify TEXT to TABLE)
- Use BeautifulSoup for robust HTML table parsing
- Generate TableData with fully populated cells arrays

Image cropping for OCR track (pp_structure_enhanced.py):
- Add _crop_and_save_image method for extracting image regions
- Pass source_image_path to _process_parsing_res_list
- Return relative filename (not full path) for saved_path
- Consistent with Direct Track image saving pattern

Also includes:
- Add beautifulsoup4 to requirements.txt
- Add architecture overview documentation
- Archive fix-ocr-track-table-data-format proposal (22/24 tasks)

Known issues: OCR track images are restored but still have quality issues
that will be addressed in a follow-up proposal.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 18:48:15 +08:00

7.8 KiB

Design: Fix OCR Track Table Data Format

Context

The OCR processing pipeline has three modes:

  1. Direct Track: Extracts structured data directly from native PDFs using direct_extraction_engine.py
  2. OCR Track: Uses PP-StructureV3 for layout analysis and OCR, then converts results via ocr_to_unified_converter.py
  3. Hybrid Mode: Uses Direct Track as primary, supplements with OCR Track for missing images only

Both tracks produce UnifiedDocument containing DocumentElement objects. For tables, the content field should contain a TableData object with populated cells array. However, OCR Track currently produces TableData with empty cells, causing PDF generation failures.

Track Isolation Analysis (Safety Guarantee)

This section documents why the proposed changes will NOT affect Direct Track or Hybrid Mode.

Code Flow Analysis

┌─────────────────────────────────────────────────────────────────────────┐
│                           ocr_service.py                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Direct Track ──► DirectExtractionEngine ──► UnifiedDocument            │
│                   (direct_extraction_engine.py)    (tables: TableData ✓) │
│                   [NOT MODIFIED]                                         │
│                                                                          │
│  OCR Track ────► PP-StructureV3 ──► OCRToUnifiedConverter ──► UnifiedDoc│
│                                     (ocr_to_unified_converter.py)        │
│                                     [MODIFIED: _extract_table_data]      │
│                                                                          │
│  Hybrid Mode ──► Direct Track (primary) + OCR Track (images only)       │
│                  │                         │                             │
│                  │                         └──► _merge_ocr_images_into_  │
│                  │                              direct() merges ONLY:    │
│                  │                              - ElementType.FIGURE     │
│                  │                              - ElementType.IMAGE      │
│                  │                              - ElementType.LOGO       │
│                  │                              [Tables NOT merged]      │
│                  └──► Tables come from Direct Track (unchanged)          │
└─────────────────────────────────────────────────────────────────────────┘

Evidence from ocr_service.py

Line 1610 (Hybrid mode merge logic):

image_types = {ElementType.FIGURE, ElementType.IMAGE, ElementType.LOGO}

Lines 1634-1635 (Only image types are merged):

for element in ocr_page.elements:
    if element.type in image_types:  # Tables excluded

Impact Matrix

Mode Table Source Uses OCRToUnifiedConverter? Affected by Change?
Direct Track DirectExtractionEngine No No
OCR Track OCRToUnifiedConverter Yes Yes (Fixed)
Hybrid Mode DirectExtractionEngine (tables) Only for images No

Conclusion

The fix is isolated to OCR Track only:

  • Direct Track: Uses separate engine (DirectExtractionEngine), completely unaffected
  • Hybrid Mode: Tables come from Direct Track; OCR Track is only used for image extraction
  • OCR Track: Will benefit from the fix with proper TableData output

Goals / Non-Goals

Goals

  • OCR Track table output format matches Direct Track format exactly
  • PDF Generator receives consistent TableData objects from both tracks
  • Robust HTML table parsing that handles real-world OCR output

Non-Goals

  • Modifying Direct Track behavior (it's the reference implementation)
  • Changing the TableData or TableCell data models
  • Modifying PDF Generator to handle HTML strings as a workaround

Decisions

Decision 1: Use BeautifulSoup for HTML Parsing

Rationale: The current regex/string-counting approach is fragile and cannot extract cell content. BeautifulSoup provides:

  • Robust handling of malformed HTML (common in OCR output)
  • Easy extraction of cell content, attributes (rowspan, colspan)
  • Well-tested library already used in many Python projects

Alternatives considered:

  • Manual regex parsing: Too fragile for complex tables
  • lxml: More complex API, overkill for this use case
  • html.parser (stdlib): Less tolerant of malformed HTML

Decision 2: Maintain Backward Compatibility

Rationale: If BeautifulSoup parsing fails, fall back to current behavior (return TableData with basic row/col counts). This ensures existing functionality isn't broken.

Decision 3: Single Point of Change

Rationale: Only modify ocr_to_unified_converter.py. This:

  • Minimizes regression risk
  • Keeps Direct Track untouched as reference
  • Requires no changes to downstream PDF Generator

Implementation Approach

def _extract_table_data(self, elem_data: Dict) -> Optional[TableData]:
    """Extract table data from element using BeautifulSoup."""
    try:
        html = elem_data.get('html', '') or elem_data.get('content', '')
        if not html or '<table' not in html.lower():
            return None

        soup = BeautifulSoup(html, 'html.parser')
        table = soup.find('table')
        if not table:
            return None

        cells = []
        headers = []
        rows = table.find_all('tr')

        for row_idx, row in enumerate(rows):
            row_cells = row.find_all(['td', 'th'])
            for col_idx, cell in enumerate(row_cells):
                cell_content = cell.get_text(strip=True)
                rowspan = int(cell.get('rowspan', 1))
                colspan = int(cell.get('colspan', 1))

                cells.append(TableCell(
                    row=row_idx,
                    col=col_idx,
                    row_span=rowspan,
                    col_span=colspan,
                    content=cell_content
                ))

                # Collect headers from first row or <th> elements
                if row_idx == 0 or cell.name == 'th':
                    headers.append(cell_content)

        return TableData(
            rows=len(rows),
            cols=max(len(row.find_all(['td', 'th'])) for row in rows) if rows else 0,
            cells=cells,
            headers=headers if headers else None
        )
    except Exception as e:
        logger.warning(f"Failed to parse HTML table: {e}")
        return None  # Fallback handled by caller

Risks / Trade-offs

Risk Mitigation
BeautifulSoup not installed Add to requirements.txt; it's already a common dependency
Malformed HTML causes parsing errors Use try/except with fallback to current behavior
Performance impact from HTML parsing Minimal; tables are small; BeautifulSoup is fast
Complex rowspan/colspan calculations Start with simple col tracking; enhance if needed

Dependencies

  • beautifulsoup4: Already commonly available, add to requirements.txt if not present

Open Questions

  • Q: Should we preserve the original HTML in metadata for debugging?
    • A: Optional enhancement; not required for initial fix