Table data format fixes (ocr_to_unified_converter.py): - Fix ElementType string conversion using value-based lookup - Add content-based HTML table detection (reclassify TEXT to TABLE) - Use BeautifulSoup for robust HTML table parsing - Generate TableData with fully populated cells arrays Image cropping for OCR track (pp_structure_enhanced.py): - Add _crop_and_save_image method for extracting image regions - Pass source_image_path to _process_parsing_res_list - Return relative filename (not full path) for saved_path - Consistent with Direct Track image saving pattern Also includes: - Add beautifulsoup4 to requirements.txt - Add architecture overview documentation - Archive fix-ocr-track-table-data-format proposal (22/24 tasks) Known issues: OCR track images are restored but still have quality issues that will be addressed in a follow-up proposal. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
7.8 KiB
Design: Fix OCR Track Table Data Format
Context
The OCR processing pipeline has three modes:
- Direct Track: Extracts structured data directly from native PDFs using
direct_extraction_engine.py - OCR Track: Uses PP-StructureV3 for layout analysis and OCR, then converts results via
ocr_to_unified_converter.py - Hybrid Mode: Uses Direct Track as primary, supplements with OCR Track for missing images only
Both tracks produce UnifiedDocument containing DocumentElement objects. For tables, the content field should contain a TableData object with populated cells array. However, OCR Track currently produces TableData with empty cells, causing PDF generation failures.
Track Isolation Analysis (Safety Guarantee)
This section documents why the proposed changes will NOT affect Direct Track or Hybrid Mode.
Code Flow Analysis
┌─────────────────────────────────────────────────────────────────────────┐
│ ocr_service.py │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Direct Track ──► DirectExtractionEngine ──► UnifiedDocument │
│ (direct_extraction_engine.py) (tables: TableData ✓) │
│ [NOT MODIFIED] │
│ │
│ OCR Track ────► PP-StructureV3 ──► OCRToUnifiedConverter ──► UnifiedDoc│
│ (ocr_to_unified_converter.py) │
│ [MODIFIED: _extract_table_data] │
│ │
│ Hybrid Mode ──► Direct Track (primary) + OCR Track (images only) │
│ │ │ │
│ │ └──► _merge_ocr_images_into_ │
│ │ direct() merges ONLY: │
│ │ - ElementType.FIGURE │
│ │ - ElementType.IMAGE │
│ │ - ElementType.LOGO │
│ │ [Tables NOT merged] │
│ └──► Tables come from Direct Track (unchanged) │
└─────────────────────────────────────────────────────────────────────────┘
Evidence from ocr_service.py
Line 1610 (Hybrid mode merge logic):
image_types = {ElementType.FIGURE, ElementType.IMAGE, ElementType.LOGO}
Lines 1634-1635 (Only image types are merged):
for element in ocr_page.elements:
if element.type in image_types: # Tables excluded
Impact Matrix
| Mode | Table Source | Uses OCRToUnifiedConverter? | Affected by Change? |
|---|---|---|---|
| Direct Track | DirectExtractionEngine |
No | No |
| OCR Track | OCRToUnifiedConverter |
Yes | Yes (Fixed) |
| Hybrid Mode | DirectExtractionEngine (tables) |
Only for images | No |
Conclusion
The fix is isolated to OCR Track only:
- Direct Track: Uses separate engine (
DirectExtractionEngine), completely unaffected - Hybrid Mode: Tables come from Direct Track; OCR Track is only used for image extraction
- OCR Track: Will benefit from the fix with proper
TableDataoutput
Goals / Non-Goals
Goals
- OCR Track table output format matches Direct Track format exactly
- PDF Generator receives consistent
TableDataobjects from both tracks - Robust HTML table parsing that handles real-world OCR output
Non-Goals
- Modifying Direct Track behavior (it's the reference implementation)
- Changing the
TableDataorTableCelldata models - Modifying PDF Generator to handle HTML strings as a workaround
Decisions
Decision 1: Use BeautifulSoup for HTML Parsing
Rationale: The current regex/string-counting approach is fragile and cannot extract cell content. BeautifulSoup provides:
- Robust handling of malformed HTML (common in OCR output)
- Easy extraction of cell content, attributes (rowspan, colspan)
- Well-tested library already used in many Python projects
Alternatives considered:
- Manual regex parsing: Too fragile for complex tables
- lxml: More complex API, overkill for this use case
- html.parser (stdlib): Less tolerant of malformed HTML
Decision 2: Maintain Backward Compatibility
Rationale: If BeautifulSoup parsing fails, fall back to current behavior (return TableData with basic row/col counts). This ensures existing functionality isn't broken.
Decision 3: Single Point of Change
Rationale: Only modify ocr_to_unified_converter.py. This:
- Minimizes regression risk
- Keeps Direct Track untouched as reference
- Requires no changes to downstream PDF Generator
Implementation Approach
def _extract_table_data(self, elem_data: Dict) -> Optional[TableData]:
"""Extract table data from element using BeautifulSoup."""
try:
html = elem_data.get('html', '') or elem_data.get('content', '')
if not html or '<table' not in html.lower():
return None
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table')
if not table:
return None
cells = []
headers = []
rows = table.find_all('tr')
for row_idx, row in enumerate(rows):
row_cells = row.find_all(['td', 'th'])
for col_idx, cell in enumerate(row_cells):
cell_content = cell.get_text(strip=True)
rowspan = int(cell.get('rowspan', 1))
colspan = int(cell.get('colspan', 1))
cells.append(TableCell(
row=row_idx,
col=col_idx,
row_span=rowspan,
col_span=colspan,
content=cell_content
))
# Collect headers from first row or <th> elements
if row_idx == 0 or cell.name == 'th':
headers.append(cell_content)
return TableData(
rows=len(rows),
cols=max(len(row.find_all(['td', 'th'])) for row in rows) if rows else 0,
cells=cells,
headers=headers if headers else None
)
except Exception as e:
logger.warning(f"Failed to parse HTML table: {e}")
return None # Fallback handled by caller
Risks / Trade-offs
| Risk | Mitigation |
|---|---|
| BeautifulSoup not installed | Add to requirements.txt; it's already a common dependency |
| Malformed HTML causes parsing errors | Use try/except with fallback to current behavior |
| Performance impact from HTML parsing | Minimal; tables are small; BeautifulSoup is fast |
| Complex rowspan/colspan calculations | Start with simple col tracking; enhance if needed |
Dependencies
beautifulsoup4: Already commonly available, add to requirements.txt if not present
Open Questions
- Q: Should we preserve the original HTML in metadata for debugging?
- A: Optional enhancement; not required for initial fix