fix: OCR track table data format and image cropping

Table data format fixes (ocr_to_unified_converter.py):
- Fix ElementType string conversion using value-based lookup
- Add content-based HTML table detection (reclassify TEXT to TABLE)
- Use BeautifulSoup for robust HTML table parsing
- Generate TableData with fully populated cells arrays

Image cropping for OCR track (pp_structure_enhanced.py):
- Add _crop_and_save_image method for extracting image regions
- Pass source_image_path to _process_parsing_res_list
- Return relative filename (not full path) for saved_path
- Consistent with Direct Track image saving pattern

Also includes:
- Add beautifulsoup4 to requirements.txt
- Add architecture overview documentation
- Archive fix-ocr-track-table-data-format proposal (22/24 tasks)

Known issues: OCR track images are restored but still have quality issues
that will be addressed in a follow-up proposal.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-26 18:48:15 +08:00
parent a227311b2d
commit 6e050eb540
8 changed files with 585 additions and 30 deletions

View File

@@ -0,0 +1,173 @@
# Design: Fix OCR Track Table Data Format
## Context
The OCR processing pipeline has three modes:
1. **Direct Track**: Extracts structured data directly from native PDFs using `direct_extraction_engine.py`
2. **OCR Track**: Uses PP-StructureV3 for layout analysis and OCR, then converts results via `ocr_to_unified_converter.py`
3. **Hybrid Mode**: Uses Direct Track as primary, supplements with OCR Track for missing images only
Both tracks produce `UnifiedDocument` containing `DocumentElement` objects. For tables, the `content` field should contain a `TableData` object with populated `cells` array. However, OCR Track currently produces `TableData` with empty `cells`, causing PDF generation failures.
## Track Isolation Analysis (Safety Guarantee)
This section documents why the proposed changes will NOT affect Direct Track or Hybrid Mode.
### Code Flow Analysis
```
┌─────────────────────────────────────────────────────────────────────────┐
│ ocr_service.py │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Direct Track ──► DirectExtractionEngine ──► UnifiedDocument │
│ (direct_extraction_engine.py) (tables: TableData ✓) │
│ [NOT MODIFIED] │
│ │
│ OCR Track ────► PP-StructureV3 ──► OCRToUnifiedConverter ──► UnifiedDoc│
│ (ocr_to_unified_converter.py) │
│ [MODIFIED: _extract_table_data] │
│ │
│ Hybrid Mode ──► Direct Track (primary) + OCR Track (images only) │
│ │ │ │
│ │ └──► _merge_ocr_images_into_ │
│ │ direct() merges ONLY: │
│ │ - ElementType.FIGURE │
│ │ - ElementType.IMAGE │
│ │ - ElementType.LOGO │
│ │ [Tables NOT merged] │
│ └──► Tables come from Direct Track (unchanged) │
└─────────────────────────────────────────────────────────────────────────┘
```
### Evidence from ocr_service.py
**Line 1610** (Hybrid mode merge logic):
```python
image_types = {ElementType.FIGURE, ElementType.IMAGE, ElementType.LOGO}
```
**Lines 1634-1635** (Only image types are merged):
```python
for element in ocr_page.elements:
if element.type in image_types: # Tables excluded
```
### Impact Matrix
| Mode | Table Source | Uses OCRToUnifiedConverter? | Affected by Change? |
|------|--------------|----------------------------|---------------------|
| Direct Track | `DirectExtractionEngine` | No | **No** |
| OCR Track | `OCRToUnifiedConverter` | Yes | **Yes (Fixed)** |
| Hybrid Mode | `DirectExtractionEngine` (tables) | Only for images | **No** |
### Conclusion
The fix is **isolated to OCR Track only**:
- Direct Track: Uses separate engine (`DirectExtractionEngine`), completely unaffected
- Hybrid Mode: Tables come from Direct Track; OCR Track is only used for image extraction
- OCR Track: Will benefit from the fix with proper `TableData` output
## Goals / Non-Goals
### Goals
- OCR Track table output format matches Direct Track format exactly
- PDF Generator receives consistent `TableData` objects from both tracks
- Robust HTML table parsing that handles real-world OCR output
### Non-Goals
- Modifying Direct Track behavior (it's the reference implementation)
- Changing the `TableData` or `TableCell` data models
- Modifying PDF Generator to handle HTML strings as a workaround
## Decisions
### Decision 1: Use BeautifulSoup for HTML Parsing
**Rationale**: The current regex/string-counting approach is fragile and cannot extract cell content. BeautifulSoup provides:
- Robust handling of malformed HTML (common in OCR output)
- Easy extraction of cell content, attributes (rowspan, colspan)
- Well-tested library already used in many Python projects
**Alternatives considered**:
- Manual regex parsing: Too fragile for complex tables
- lxml: More complex API, overkill for this use case
- html.parser (stdlib): Less tolerant of malformed HTML
### Decision 2: Maintain Backward Compatibility
**Rationale**: If BeautifulSoup parsing fails, fall back to current behavior (return `TableData` with basic row/col counts). This ensures existing functionality isn't broken.
### Decision 3: Single Point of Change
**Rationale**: Only modify `ocr_to_unified_converter.py`. This:
- Minimizes regression risk
- Keeps Direct Track untouched as reference
- Requires no changes to downstream PDF Generator
## Implementation Approach
```python
def _extract_table_data(self, elem_data: Dict) -> Optional[TableData]:
"""Extract table data from element using BeautifulSoup."""
try:
html = elem_data.get('html', '') or elem_data.get('content', '')
if not html or '<table' not in html.lower():
return None
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table')
if not table:
return None
cells = []
headers = []
rows = table.find_all('tr')
for row_idx, row in enumerate(rows):
row_cells = row.find_all(['td', 'th'])
for col_idx, cell in enumerate(row_cells):
cell_content = cell.get_text(strip=True)
rowspan = int(cell.get('rowspan', 1))
colspan = int(cell.get('colspan', 1))
cells.append(TableCell(
row=row_idx,
col=col_idx,
row_span=rowspan,
col_span=colspan,
content=cell_content
))
# Collect headers from first row or <th> elements
if row_idx == 0 or cell.name == 'th':
headers.append(cell_content)
return TableData(
rows=len(rows),
cols=max(len(row.find_all(['td', 'th'])) for row in rows) if rows else 0,
cells=cells,
headers=headers if headers else None
)
except Exception as e:
logger.warning(f"Failed to parse HTML table: {e}")
return None # Fallback handled by caller
```
## Risks / Trade-offs
| Risk | Mitigation |
|------|------------|
| BeautifulSoup not installed | Add to requirements.txt; it's already a common dependency |
| Malformed HTML causes parsing errors | Use try/except with fallback to current behavior |
| Performance impact from HTML parsing | Minimal; tables are small; BeautifulSoup is fast |
| Complex rowspan/colspan calculations | Start with simple col tracking; enhance if needed |
## Dependencies
- `beautifulsoup4`: Already commonly available, add to requirements.txt if not present
## Open Questions
- Q: Should we preserve the original HTML in metadata for debugging?
- A: Optional enhancement; not required for initial fix

View File

@@ -0,0 +1,45 @@
# Change: Fix OCR Track Table Data Format to Match Direct Track
## Why
OCR Track produces HTML strings for table content instead of structured `TableData` objects, causing PDF generation to render raw HTML code as plain text. Direct Track correctly produces `TableData` objects with populated `cells` array, resulting in proper table rendering. This inconsistency creates poor user experience when using OCR Track for documents containing tables.
## What Changes
- **Enhance `_extract_table_data` method** in `ocr_to_unified_converter.py` to properly parse HTML tables into structured `TableData` objects with populated `TableCell` arrays
- **Add BeautifulSoup-based HTML table parsing** to robustly extract cell content, row/column spans from OCR-generated HTML tables
- **Ensure format consistency** between OCR Track and Direct Track table output, allowing PDF Generator to handle a single standardized format
## Impact
- Affected specs: `ocr-processing`
- Affected code:
- `backend/app/services/ocr_to_unified_converter.py` (primary changes)
- `backend/app/services/pdf_generator_service.py` (no changes needed - already handles `TableData`)
- `backend/app/services/direct_extraction_engine.py` (no changes - serves as reference implementation)
## Evidence
### Direct Track (Reference - Correct Behavior)
`direct_extraction_engine.py:846-850`:
```python
table_data = TableData(
rows=len(data),
cols=max(len(row) for row in data) if data else 0,
cells=cells, # Properly populated with TableCell objects
headers=data[0] if data else None
)
```
### OCR Track (Current - Problematic)
`ocr_to_unified_converter.py:574-579`:
```python
return TableData(
rows=rows, # Only counts from html.count('<tr')
cols=cols, # Only counts from <td>/<th> in first row
cells=cells, # Always empty list []
caption=extracted_text
)
```
The `cells` array is always empty because the current HTML parsing only counts tags but doesn't extract actual cell content.

View File

@@ -0,0 +1,51 @@
## ADDED Requirements
### Requirement: OCR Track Table Data Structure Consistency
The OCR Track SHALL produce `TableData` objects with fully populated `cells` arrays that match the format produced by Direct Track, ensuring consistent table rendering across both processing tracks.
#### Scenario: OCR Track produces structured TableData for HTML tables
- **GIVEN** a document with tables is processed via OCR Track
- **WHEN** PP-StructureV3 returns HTML table content in the `html` or `content` field
- **THEN** the `ocr_to_unified_converter` SHALL parse the HTML and produce a `TableData` object
- **AND** the `TableData.cells` array SHALL contain `TableCell` objects for each cell
- **AND** each `TableCell` SHALL have correct `row`, `col`, and `content` values
- **AND** the output format SHALL match Direct Track's `TableData` structure
#### Scenario: OCR Track handles tables with merged cells
- **GIVEN** an HTML table with `rowspan` or `colspan` attributes
- **WHEN** the table is converted to `TableData`
- **THEN** each `TableCell` SHALL have correct `row_span` and `col_span` values
- **AND** the cell content SHALL be correctly extracted
#### Scenario: OCR Track handles header rows
- **GIVEN** an HTML table with `<th>` elements or a header row
- **WHEN** the table is converted to `TableData`
- **THEN** the `TableData.headers` field SHALL contain the header cell contents
- **AND** header cells SHALL also be included in the `cells` array
#### Scenario: OCR Track gracefully handles malformed HTML tables
- **GIVEN** an HTML table with malformed markup (missing closing tags, invalid nesting)
- **WHEN** parsing is attempted
- **THEN** the system SHALL attempt best-effort parsing using a tolerant HTML parser
- **AND** if parsing fails completely, SHALL fall back to returning basic `TableData` with row/col counts
- **AND** SHALL log a warning for debugging purposes
#### Scenario: PDF Generator renders OCR Track tables correctly
- **GIVEN** a `UnifiedDocument` from OCR Track containing table elements
- **WHEN** the PDF Generator processes the document
- **THEN** tables SHALL be rendered as formatted tables (not as raw HTML text)
- **AND** the rendering SHALL be identical to Direct Track table rendering
#### Scenario: Direct Track table processing remains unchanged
- **GIVEN** a native PDF with embedded tables
- **WHEN** the document is processed via Direct Track
- **THEN** the `DirectExtractionEngine` SHALL continue to produce `TableData` objects as before
- **AND** the `ocr_to_unified_converter.py` changes SHALL NOT affect Direct Track processing
- **AND** table rendering in PDF output SHALL be identical to pre-fix behavior
#### Scenario: Hybrid Mode table source isolation
- **GIVEN** a document processed via Hybrid Mode (Direct Track primary + OCR Track for images)
- **WHEN** the system merges OCR Track results into Direct Track results
- **THEN** only image elements (FIGURE, IMAGE, LOGO) SHALL be merged from OCR Track
- **AND** table elements SHALL exclusively come from Direct Track
- **AND** no OCR Track table data SHALL contaminate the final output

View File

@@ -0,0 +1,43 @@
# Tasks: Fix OCR Track Table Data Format
## 1. Implementation
- [x] 1.1 Add BeautifulSoup import and dependency check in `ocr_to_unified_converter.py`
- [x] 1.2 Rewrite `_extract_table_data` method to parse HTML using BeautifulSoup
- [x] 1.3 Extract cell content, row index, column index for each `<td>` and `<th>` element
- [x] 1.4 Handle `rowspan` and `colspan` attributes for merged cells
- [x] 1.5 Create `TableCell` objects with proper content and positioning
- [x] 1.6 Populate `TableData.cells` array with extracted `TableCell` objects
- [x] 1.7 Preserve header detection (`<th>` elements) and store in `TableData.headers`
## 2. Edge Case Handling
- [x] 2.1 Handle malformed HTML tables gracefully (missing closing tags, nested tables)
- [x] 2.2 Handle empty cells (create TableCell with empty string content)
- [x] 2.3 Handle tables without `<tr>` structure (fallback to current behavior)
- [x] 2.4 Log warnings for unparseable tables instead of failing silently
## 3. Testing
- [x] 3.1 Create unit tests for `_extract_table_data` with various HTML table formats
- [x] 3.2 Test simple tables (basic rows/columns)
- [x] 3.3 Test tables with merged cells (rowspan/colspan)
- [x] 3.4 Test tables with header rows (`<th>` elements)
- [x] 3.5 Test malformed HTML tables (handled via BeautifulSoup's tolerance)
- [ ] 3.6 Integration test: OCR Track PDF generation with tables
## 4. Verification (Track Isolation)
- [x] 4.1 Compare OCR Track table output format with Direct Track output format
- [ ] 4.2 Verify PDF Generator renders OCR Track tables correctly
- [x] 4.3 **Direct Track regression test**: `direct_extraction_engine.py` NOT modified (confirmed via git status)
- [x] 4.4 **Hybrid Mode regression test**: `ocr_service.py` NOT modified, image merge logic unchanged
- [x] 4.5 **OCR Track fix verification**: Unit tests confirm:
- `TableData.cells` array is populated (6 cells in 3x2 table)
- `TableCell` objects have correct row/col/content values
- Headers extracted correctly
- [x] 4.6 Verify `DirectExtractionEngine` code is NOT modified (isolation check - confirmed)
## 5. Dependencies
- [x] 5.1 Add `beautifulsoup4>=4.12.0` to `requirements.txt`