egg/OCR

Files

egg 6e050eb540 fix: OCR track table data format and image cropping

Table data format fixes (ocr_to_unified_converter.py):
- Fix ElementType string conversion using value-based lookup
- Add content-based HTML table detection (reclassify TEXT to TABLE)
- Use BeautifulSoup for robust HTML table parsing
- Generate TableData with fully populated cells arrays

Image cropping for OCR track (pp_structure_enhanced.py):
- Add _crop_and_save_image method for extracting image regions
- Pass source_image_path to _process_parsing_res_list
- Return relative filename (not full path) for saved_path
- Consistent with Direct Track image saving pattern

Also includes:
- Add beautifulsoup4 to requirements.txt
- Add architecture overview documentation
- Archive fix-ocr-track-table-data-format proposal (22/24 tasks)

Known issues: OCR track images are restored but still have quality issues
that will be addressed in a follow-up proposal.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-26 18:48:15 +08:00

7.8 KiB

Raw Blame History

Design: Fix OCR Track Table Data Format

Context

The OCR processing pipeline has three modes:

Direct Track: Extracts structured data directly from native PDFs using direct_extraction_engine.py
OCR Track: Uses PP-StructureV3 for layout analysis and OCR, then converts results via ocr_to_unified_converter.py
Hybrid Mode: Uses Direct Track as primary, supplements with OCR Track for missing images only

Both tracks produce UnifiedDocument containing DocumentElement objects. For tables, the content field should contain a TableData object with populated cells array. However, OCR Track currently produces TableData with empty cells, causing PDF generation failures.

Track Isolation Analysis (Safety Guarantee)

This section documents why the proposed changes will NOT affect Direct Track or Hybrid Mode.

Code Flow Analysis

┌─────────────────────────────────────────────────────────────────────────┐
│                           ocr_service.py                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Direct Track ──► DirectExtractionEngine ──► UnifiedDocument            │
│                   (direct_extraction_engine.py)    (tables: TableData ✓) │
│                   [NOT MODIFIED]                                         │
│                                                                          │
│  OCR Track ────► PP-StructureV3 ──► OCRToUnifiedConverter ──► UnifiedDoc│
│                                     (ocr_to_unified_converter.py)        │
│                                     [MODIFIED: _extract_table_data]      │
│                                                                          │
│  Hybrid Mode ──► Direct Track (primary) + OCR Track (images only)       │
│                  │                         │                             │
│                  │                         └──► _merge_ocr_images_into_  │
│                  │                              direct() merges ONLY:    │
│                  │                              - ElementType.FIGURE     │
│                  │                              - ElementType.IMAGE      │
│                  │                              - ElementType.LOGO       │
│                  │                              [Tables NOT merged]      │
│                  └──► Tables come from Direct Track (unchanged)          │
└─────────────────────────────────────────────────────────────────────────┘

Evidence from ocr_service.py

Line 1610 (Hybrid mode merge logic):

image_types = {ElementType.FIGURE, ElementType.IMAGE, ElementType.LOGO}

Lines 1634-1635 (Only image types are merged):

for element in ocr_page.elements:
    if element.type in image_types:  # Tables excluded

Impact Matrix

Mode	Table Source	Uses OCRToUnifiedConverter?	Affected by Change?
Direct Track	`DirectExtractionEngine`	No	No
OCR Track	`OCRToUnifiedConverter`	Yes	Yes (Fixed)
Hybrid Mode	`DirectExtractionEngine` (tables)	Only for images	No

Conclusion

The fix is isolated to OCR Track only:

Direct Track: Uses separate engine (DirectExtractionEngine), completely unaffected
Hybrid Mode: Tables come from Direct Track; OCR Track is only used for image extraction
OCR Track: Will benefit from the fix with proper TableData output

Goals / Non-Goals

Goals

OCR Track table output format matches Direct Track format exactly
PDF Generator receives consistent TableData objects from both tracks
Robust HTML table parsing that handles real-world OCR output

Non-Goals

Modifying Direct Track behavior (it's the reference implementation)
Changing the TableData or TableCell data models
Modifying PDF Generator to handle HTML strings as a workaround

Decisions

Decision 1: Use BeautifulSoup for HTML Parsing

Rationale: The current regex/string-counting approach is fragile and cannot extract cell content. BeautifulSoup provides:

Robust handling of malformed HTML (common in OCR output)
Easy extraction of cell content, attributes (rowspan, colspan)
Well-tested library already used in many Python projects

Alternatives considered:

Manual regex parsing: Too fragile for complex tables
lxml: More complex API, overkill for this use case
html.parser (stdlib): Less tolerant of malformed HTML

Decision 2: Maintain Backward Compatibility

Rationale: If BeautifulSoup parsing fails, fall back to current behavior (return TableData with basic row/col counts). This ensures existing functionality isn't broken.

Decision 3: Single Point of Change

Rationale: Only modify ocr_to_unified_converter.py. This:

Minimizes regression risk
Keeps Direct Track untouched as reference
Requires no changes to downstream PDF Generator

Implementation Approach

def _extract_table_data(self, elem_data: Dict) -> Optional[TableData]:
    """Extract table data from element using BeautifulSoup."""
    try:
        html = elem_data.get('html', '') or elem_data.get('content', '')
        if not html or '<table' not in html.lower():
            return None

        soup = BeautifulSoup(html, 'html.parser')
        table = soup.find('table')
        if not table:
            return None

        cells = []
        headers = []
        rows = table.find_all('tr')

        for row_idx, row in enumerate(rows):
            row_cells = row.find_all(['td', 'th'])
            for col_idx, cell in enumerate(row_cells):
                cell_content = cell.get_text(strip=True)
                rowspan = int(cell.get('rowspan', 1))
                colspan = int(cell.get('colspan', 1))

                cells.append(TableCell(
                    row=row_idx,
                    col=col_idx,
                    row_span=rowspan,
                    col_span=colspan,
                    content=cell_content
                ))

                # Collect headers from first row or <th> elements
                if row_idx == 0 or cell.name == 'th':
                    headers.append(cell_content)

        return TableData(
            rows=len(rows),
            cols=max(len(row.find_all(['td', 'th'])) for row in rows) if rows else 0,
            cells=cells,
            headers=headers if headers else None
        )
    except Exception as e:
        logger.warning(f"Failed to parse HTML table: {e}")
        return None  # Fallback handled by caller

Risks / Trade-offs

Risk	Mitigation
BeautifulSoup not installed	Add to requirements.txt; it's already a common dependency
Malformed HTML causes parsing errors	Use try/except with fallback to current behavior
Performance impact from HTML parsing	Minimal; tables are small; BeautifulSoup is fast
Complex rowspan/colspan calculations	Start with simple col tracking; enhance if needed

Dependencies

beautifulsoup4: Already commonly available, add to requirements.txt if not present

Open Questions

Q: Should we preserve the original HTML in metadata for debugging?
- A: Optional enhancement; not required for initial fix

7.8 KiB Raw Blame History