fix: OCR track table data format and image cropping

Table data format fixes (ocr_to_unified_converter.py): - Fix ElementType string conversion using value-based lookup - Add content-based HTML table detection (reclassify TEXT to TABLE) - Use BeautifulSoup for robust HTML table parsing - Generate TableData with fully populated cells arrays Image cropping for OCR track (pp_structure_enhanced.py): - Add _crop_and_save_image method for extracting image regions - Pass source_image_path to _process_parsing_res_list - Return relative filename (not full path) for saved_path - Consistent with Direct Track image saving pattern Also includes: - Add beautifulsoup4 to requirements.txt - Add architecture overview documentation - Archive fix-ocr-track-table-data-format proposal (22/24 tasks) Known issues: OCR track images are restored but still have quality issues that will be addressed in a follow-up proposal. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 18:48:15 +08:00
parent a227311b2d
commit 6e050eb540
8 changed files with 585 additions and 30 deletions
--- a/openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/design.md
+++ b/openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/design.md
@@ -0,0 +1,173 @@
+# Design: Fix OCR Track Table Data Format
+
+## Context
+
+The OCR processing pipeline has three modes:
+1. **Direct Track**: Extracts structured data directly from native PDFs using `direct_extraction_engine.py`
+2. **OCR Track**: Uses PP-StructureV3 for layout analysis and OCR, then converts results via `ocr_to_unified_converter.py`
+3. **Hybrid Mode**: Uses Direct Track as primary, supplements with OCR Track for missing images only
+
+Both tracks produce `UnifiedDocument` containing `DocumentElement` objects. For tables, the `content` field should contain a `TableData` object with populated `cells` array. However, OCR Track currently produces `TableData` with empty `cells`, causing PDF generation failures.
+
+## Track Isolation Analysis (Safety Guarantee)
+
+This section documents why the proposed changes will NOT affect Direct Track or Hybrid Mode.
+
+### Code Flow Analysis
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│                           ocr_service.py                                 │
+├─────────────────────────────────────────────────────────────────────────┤
+│                                                                          │
+│  Direct Track ──► DirectExtractionEngine ──► UnifiedDocument            │
+│                   (direct_extraction_engine.py)    (tables: TableData ✓) │
+│                   [NOT MODIFIED]                                         │
+│                                                                          │
+│  OCR Track ────► PP-StructureV3 ──► OCRToUnifiedConverter ──► UnifiedDoc│
+│                                     (ocr_to_unified_converter.py)        │
+│                                     [MODIFIED: _extract_table_data]      │
+│                                                                          │
+│  Hybrid Mode ──► Direct Track (primary) + OCR Track (images only)       │
+│                  │                         │                             │
+│                  │                         └──► _merge_ocr_images_into_  │
+│                  │                              direct() merges ONLY:    │
+│                  │                              - ElementType.FIGURE     │
+│                  │                              - ElementType.IMAGE      │
+│                  │                              - ElementType.LOGO       │
+│                  │                              [Tables NOT merged]      │
+│                  └──► Tables come from Direct Track (unchanged)          │
+└─────────────────────────────────────────────────────────────────────────┘
+```
+
+### Evidence from ocr_service.py
+
+**Line 1610** (Hybrid mode merge logic):
+```python
+image_types = {ElementType.FIGURE, ElementType.IMAGE, ElementType.LOGO}
+```
+
+**Lines 1634-1635** (Only image types are merged):
+```python
+for element in ocr_page.elements:
+    if element.type in image_types:  # Tables excluded
+```
+
+### Impact Matrix
+
+| Mode | Table Source | Uses OCRToUnifiedConverter? | Affected by Change? |
+|------|--------------|----------------------------|---------------------|
+| Direct Track | `DirectExtractionEngine` | No | **No** |
+| OCR Track | `OCRToUnifiedConverter` | Yes | **Yes (Fixed)** |
+| Hybrid Mode | `DirectExtractionEngine` (tables) | Only for images | **No** |
+
+### Conclusion
+
+The fix is **isolated to OCR Track only**:
+- Direct Track: Uses separate engine (`DirectExtractionEngine`), completely unaffected
+- Hybrid Mode: Tables come from Direct Track; OCR Track is only used for image extraction
+- OCR Track: Will benefit from the fix with proper `TableData` output
+
+## Goals / Non-Goals
+
+### Goals
+- OCR Track table output format matches Direct Track format exactly
+- PDF Generator receives consistent `TableData` objects from both tracks
+- Robust HTML table parsing that handles real-world OCR output
+
+### Non-Goals
+- Modifying Direct Track behavior (it's the reference implementation)
+- Changing the `TableData` or `TableCell` data models
+- Modifying PDF Generator to handle HTML strings as a workaround
+
+## Decisions
+
+### Decision 1: Use BeautifulSoup for HTML Parsing
+
+**Rationale**: The current regex/string-counting approach is fragile and cannot extract cell content. BeautifulSoup provides:
+- Robust handling of malformed HTML (common in OCR output)
+- Easy extraction of cell content, attributes (rowspan, colspan)
+- Well-tested library already used in many Python projects
+
+**Alternatives considered**:
+- Manual regex parsing: Too fragile for complex tables
+- lxml: More complex API, overkill for this use case
+- html.parser (stdlib): Less tolerant of malformed HTML
+
+### Decision 2: Maintain Backward Compatibility
+
+**Rationale**: If BeautifulSoup parsing fails, fall back to current behavior (return `TableData` with basic row/col counts). This ensures existing functionality isn't broken.
+
+### Decision 3: Single Point of Change
+
+**Rationale**: Only modify `ocr_to_unified_converter.py`. This:
+- Minimizes regression risk
+- Keeps Direct Track untouched as reference
+- Requires no changes to downstream PDF Generator
+
+## Implementation Approach
+
+```python
+def _extract_table_data(self, elem_data: Dict) -> Optional[TableData]:
+    """Extract table data from element using BeautifulSoup."""
+    try:
+        html = elem_data.get('html', '') or elem_data.get('content', '')
+        if not html or '<table' not in html.lower():
+            return None
+
+        soup = BeautifulSoup(html, 'html.parser')
+        table = soup.find('table')
+        if not table:
+            return None
+
+        cells = []
+        headers = []
+        rows = table.find_all('tr')
+
+        for row_idx, row in enumerate(rows):
+            row_cells = row.find_all(['td', 'th'])
+            for col_idx, cell in enumerate(row_cells):
+                cell_content = cell.get_text(strip=True)
+                rowspan = int(cell.get('rowspan', 1))
+                colspan = int(cell.get('colspan', 1))
+
+                cells.append(TableCell(
+                    row=row_idx,
+                    col=col_idx,
+                    row_span=rowspan,
+                    col_span=colspan,
+                    content=cell_content
+                ))
+
+                # Collect headers from first row or <th> elements
+                if row_idx == 0 or cell.name == 'th':
+                    headers.append(cell_content)
+
+        return TableData(
+            rows=len(rows),
+            cols=max(len(row.find_all(['td', 'th'])) for row in rows) if rows else 0,
+            cells=cells,
+            headers=headers if headers else None
+        )
+    except Exception as e:
+        logger.warning(f"Failed to parse HTML table: {e}")
+        return None  # Fallback handled by caller
+```
+
+## Risks / Trade-offs
+
+| Risk | Mitigation |
+|------|------------|
+| BeautifulSoup not installed | Add to requirements.txt; it's already a common dependency |
+| Malformed HTML causes parsing errors | Use try/except with fallback to current behavior |
+| Performance impact from HTML parsing | Minimal; tables are small; BeautifulSoup is fast |
+| Complex rowspan/colspan calculations | Start with simple col tracking; enhance if needed |
+
+## Dependencies
+
+- `beautifulsoup4`: Already commonly available, add to requirements.txt if not present
+
+## Open Questions
+
+- Q: Should we preserve the original HTML in metadata for debugging?
+  - A: Optional enhancement; not required for initial fix