Files
OCR/openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/proposal.md
egg 6e050eb540 fix: OCR track table data format and image cropping
Table data format fixes (ocr_to_unified_converter.py):
- Fix ElementType string conversion using value-based lookup
- Add content-based HTML table detection (reclassify TEXT to TABLE)
- Use BeautifulSoup for robust HTML table parsing
- Generate TableData with fully populated cells arrays

Image cropping for OCR track (pp_structure_enhanced.py):
- Add _crop_and_save_image method for extracting image regions
- Pass source_image_path to _process_parsing_res_list
- Return relative filename (not full path) for saved_path
- Consistent with Direct Track image saving pattern

Also includes:
- Add beautifulsoup4 to requirements.txt
- Add architecture overview documentation
- Archive fix-ocr-track-table-data-format proposal (22/24 tasks)

Known issues: OCR track images are restored but still have quality issues
that will be addressed in a follow-up proposal.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 18:48:15 +08:00

46 lines
2.0 KiB
Markdown

# Change: Fix OCR Track Table Data Format to Match Direct Track
## Why
OCR Track produces HTML strings for table content instead of structured `TableData` objects, causing PDF generation to render raw HTML code as plain text. Direct Track correctly produces `TableData` objects with populated `cells` array, resulting in proper table rendering. This inconsistency creates poor user experience when using OCR Track for documents containing tables.
## What Changes
- **Enhance `_extract_table_data` method** in `ocr_to_unified_converter.py` to properly parse HTML tables into structured `TableData` objects with populated `TableCell` arrays
- **Add BeautifulSoup-based HTML table parsing** to robustly extract cell content, row/column spans from OCR-generated HTML tables
- **Ensure format consistency** between OCR Track and Direct Track table output, allowing PDF Generator to handle a single standardized format
## Impact
- Affected specs: `ocr-processing`
- Affected code:
- `backend/app/services/ocr_to_unified_converter.py` (primary changes)
- `backend/app/services/pdf_generator_service.py` (no changes needed - already handles `TableData`)
- `backend/app/services/direct_extraction_engine.py` (no changes - serves as reference implementation)
## Evidence
### Direct Track (Reference - Correct Behavior)
`direct_extraction_engine.py:846-850`:
```python
table_data = TableData(
rows=len(data),
cols=max(len(row) for row in data) if data else 0,
cells=cells, # Properly populated with TableCell objects
headers=data[0] if data else None
)
```
### OCR Track (Current - Problematic)
`ocr_to_unified_converter.py:574-579`:
```python
return TableData(
rows=rows, # Only counts from html.count('<tr')
cols=cols, # Only counts from <td>/<th> in first row
cells=cells, # Always empty list []
caption=extracted_text
)
```
The `cells` array is always empty because the current HTML parsing only counts tags but doesn't extract actual cell content.