Table data format fixes (ocr_to_unified_converter.py): - Fix ElementType string conversion using value-based lookup - Add content-based HTML table detection (reclassify TEXT to TABLE) - Use BeautifulSoup for robust HTML table parsing - Generate TableData with fully populated cells arrays Image cropping for OCR track (pp_structure_enhanced.py): - Add _crop_and_save_image method for extracting image regions - Pass source_image_path to _process_parsing_res_list - Return relative filename (not full path) for saved_path - Consistent with Direct Track image saving pattern Also includes: - Add beautifulsoup4 to requirements.txt - Add architecture overview documentation - Archive fix-ocr-track-table-data-format proposal (22/24 tasks) Known issues: OCR track images are restored but still have quality issues that will be addressed in a follow-up proposal. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
174 lines
7.8 KiB
Markdown
174 lines
7.8 KiB
Markdown
# Design: Fix OCR Track Table Data Format
|
|
|
|
## Context
|
|
|
|
The OCR processing pipeline has three modes:
|
|
1. **Direct Track**: Extracts structured data directly from native PDFs using `direct_extraction_engine.py`
|
|
2. **OCR Track**: Uses PP-StructureV3 for layout analysis and OCR, then converts results via `ocr_to_unified_converter.py`
|
|
3. **Hybrid Mode**: Uses Direct Track as primary, supplements with OCR Track for missing images only
|
|
|
|
Both tracks produce `UnifiedDocument` containing `DocumentElement` objects. For tables, the `content` field should contain a `TableData` object with populated `cells` array. However, OCR Track currently produces `TableData` with empty `cells`, causing PDF generation failures.
|
|
|
|
## Track Isolation Analysis (Safety Guarantee)
|
|
|
|
This section documents why the proposed changes will NOT affect Direct Track or Hybrid Mode.
|
|
|
|
### Code Flow Analysis
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────┐
|
|
│ ocr_service.py │
|
|
├─────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ Direct Track ──► DirectExtractionEngine ──► UnifiedDocument │
|
|
│ (direct_extraction_engine.py) (tables: TableData ✓) │
|
|
│ [NOT MODIFIED] │
|
|
│ │
|
|
│ OCR Track ────► PP-StructureV3 ──► OCRToUnifiedConverter ──► UnifiedDoc│
|
|
│ (ocr_to_unified_converter.py) │
|
|
│ [MODIFIED: _extract_table_data] │
|
|
│ │
|
|
│ Hybrid Mode ──► Direct Track (primary) + OCR Track (images only) │
|
|
│ │ │ │
|
|
│ │ └──► _merge_ocr_images_into_ │
|
|
│ │ direct() merges ONLY: │
|
|
│ │ - ElementType.FIGURE │
|
|
│ │ - ElementType.IMAGE │
|
|
│ │ - ElementType.LOGO │
|
|
│ │ [Tables NOT merged] │
|
|
│ └──► Tables come from Direct Track (unchanged) │
|
|
└─────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Evidence from ocr_service.py
|
|
|
|
**Line 1610** (Hybrid mode merge logic):
|
|
```python
|
|
image_types = {ElementType.FIGURE, ElementType.IMAGE, ElementType.LOGO}
|
|
```
|
|
|
|
**Lines 1634-1635** (Only image types are merged):
|
|
```python
|
|
for element in ocr_page.elements:
|
|
if element.type in image_types: # Tables excluded
|
|
```
|
|
|
|
### Impact Matrix
|
|
|
|
| Mode | Table Source | Uses OCRToUnifiedConverter? | Affected by Change? |
|
|
|------|--------------|----------------------------|---------------------|
|
|
| Direct Track | `DirectExtractionEngine` | No | **No** |
|
|
| OCR Track | `OCRToUnifiedConverter` | Yes | **Yes (Fixed)** |
|
|
| Hybrid Mode | `DirectExtractionEngine` (tables) | Only for images | **No** |
|
|
|
|
### Conclusion
|
|
|
|
The fix is **isolated to OCR Track only**:
|
|
- Direct Track: Uses separate engine (`DirectExtractionEngine`), completely unaffected
|
|
- Hybrid Mode: Tables come from Direct Track; OCR Track is only used for image extraction
|
|
- OCR Track: Will benefit from the fix with proper `TableData` output
|
|
|
|
## Goals / Non-Goals
|
|
|
|
### Goals
|
|
- OCR Track table output format matches Direct Track format exactly
|
|
- PDF Generator receives consistent `TableData` objects from both tracks
|
|
- Robust HTML table parsing that handles real-world OCR output
|
|
|
|
### Non-Goals
|
|
- Modifying Direct Track behavior (it's the reference implementation)
|
|
- Changing the `TableData` or `TableCell` data models
|
|
- Modifying PDF Generator to handle HTML strings as a workaround
|
|
|
|
## Decisions
|
|
|
|
### Decision 1: Use BeautifulSoup for HTML Parsing
|
|
|
|
**Rationale**: The current regex/string-counting approach is fragile and cannot extract cell content. BeautifulSoup provides:
|
|
- Robust handling of malformed HTML (common in OCR output)
|
|
- Easy extraction of cell content, attributes (rowspan, colspan)
|
|
- Well-tested library already used in many Python projects
|
|
|
|
**Alternatives considered**:
|
|
- Manual regex parsing: Too fragile for complex tables
|
|
- lxml: More complex API, overkill for this use case
|
|
- html.parser (stdlib): Less tolerant of malformed HTML
|
|
|
|
### Decision 2: Maintain Backward Compatibility
|
|
|
|
**Rationale**: If BeautifulSoup parsing fails, fall back to current behavior (return `TableData` with basic row/col counts). This ensures existing functionality isn't broken.
|
|
|
|
### Decision 3: Single Point of Change
|
|
|
|
**Rationale**: Only modify `ocr_to_unified_converter.py`. This:
|
|
- Minimizes regression risk
|
|
- Keeps Direct Track untouched as reference
|
|
- Requires no changes to downstream PDF Generator
|
|
|
|
## Implementation Approach
|
|
|
|
```python
|
|
def _extract_table_data(self, elem_data: Dict) -> Optional[TableData]:
|
|
"""Extract table data from element using BeautifulSoup."""
|
|
try:
|
|
html = elem_data.get('html', '') or elem_data.get('content', '')
|
|
if not html or '<table' not in html.lower():
|
|
return None
|
|
|
|
soup = BeautifulSoup(html, 'html.parser')
|
|
table = soup.find('table')
|
|
if not table:
|
|
return None
|
|
|
|
cells = []
|
|
headers = []
|
|
rows = table.find_all('tr')
|
|
|
|
for row_idx, row in enumerate(rows):
|
|
row_cells = row.find_all(['td', 'th'])
|
|
for col_idx, cell in enumerate(row_cells):
|
|
cell_content = cell.get_text(strip=True)
|
|
rowspan = int(cell.get('rowspan', 1))
|
|
colspan = int(cell.get('colspan', 1))
|
|
|
|
cells.append(TableCell(
|
|
row=row_idx,
|
|
col=col_idx,
|
|
row_span=rowspan,
|
|
col_span=colspan,
|
|
content=cell_content
|
|
))
|
|
|
|
# Collect headers from first row or <th> elements
|
|
if row_idx == 0 or cell.name == 'th':
|
|
headers.append(cell_content)
|
|
|
|
return TableData(
|
|
rows=len(rows),
|
|
cols=max(len(row.find_all(['td', 'th'])) for row in rows) if rows else 0,
|
|
cells=cells,
|
|
headers=headers if headers else None
|
|
)
|
|
except Exception as e:
|
|
logger.warning(f"Failed to parse HTML table: {e}")
|
|
return None # Fallback handled by caller
|
|
```
|
|
|
|
## Risks / Trade-offs
|
|
|
|
| Risk | Mitigation |
|
|
|------|------------|
|
|
| BeautifulSoup not installed | Add to requirements.txt; it's already a common dependency |
|
|
| Malformed HTML causes parsing errors | Use try/except with fallback to current behavior |
|
|
| Performance impact from HTML parsing | Minimal; tables are small; BeautifulSoup is fast |
|
|
| Complex rowspan/colspan calculations | Start with simple col tracking; enhance if needed |
|
|
|
|
## Dependencies
|
|
|
|
- `beautifulsoup4`: Already commonly available, add to requirements.txt if not present
|
|
|
|
## Open Questions
|
|
|
|
- Q: Should we preserve the original HTML in metadata for debugging?
|
|
- A: Optional enhancement; not required for initial fix
|