fix: OCR track table data format and image cropping
Table data format fixes (ocr_to_unified_converter.py): - Fix ElementType string conversion using value-based lookup - Add content-based HTML table detection (reclassify TEXT to TABLE) - Use BeautifulSoup for robust HTML table parsing - Generate TableData with fully populated cells arrays Image cropping for OCR track (pp_structure_enhanced.py): - Add _crop_and_save_image method for extracting image regions - Pass source_image_path to _process_parsing_res_list - Return relative filename (not full path) for saved_path - Consistent with Direct Track image saving pattern Also includes: - Add beautifulsoup4 to requirements.txt - Add architecture overview documentation - Archive fix-ocr-track-table-data-format proposal (22/24 tasks) Known issues: OCR track images are restored but still have quality issues that will be addressed in a follow-up proposal. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,45 @@
|
||||
# Change: Fix OCR Track Table Data Format to Match Direct Track
|
||||
|
||||
## Why
|
||||
|
||||
OCR Track produces HTML strings for table content instead of structured `TableData` objects, causing PDF generation to render raw HTML code as plain text. Direct Track correctly produces `TableData` objects with populated `cells` array, resulting in proper table rendering. This inconsistency creates poor user experience when using OCR Track for documents containing tables.
|
||||
|
||||
## What Changes
|
||||
|
||||
- **Enhance `_extract_table_data` method** in `ocr_to_unified_converter.py` to properly parse HTML tables into structured `TableData` objects with populated `TableCell` arrays
|
||||
- **Add BeautifulSoup-based HTML table parsing** to robustly extract cell content, row/column spans from OCR-generated HTML tables
|
||||
- **Ensure format consistency** between OCR Track and Direct Track table output, allowing PDF Generator to handle a single standardized format
|
||||
|
||||
## Impact
|
||||
|
||||
- Affected specs: `ocr-processing`
|
||||
- Affected code:
|
||||
- `backend/app/services/ocr_to_unified_converter.py` (primary changes)
|
||||
- `backend/app/services/pdf_generator_service.py` (no changes needed - already handles `TableData`)
|
||||
- `backend/app/services/direct_extraction_engine.py` (no changes - serves as reference implementation)
|
||||
|
||||
## Evidence
|
||||
|
||||
### Direct Track (Reference - Correct Behavior)
|
||||
`direct_extraction_engine.py:846-850`:
|
||||
```python
|
||||
table_data = TableData(
|
||||
rows=len(data),
|
||||
cols=max(len(row) for row in data) if data else 0,
|
||||
cells=cells, # Properly populated with TableCell objects
|
||||
headers=data[0] if data else None
|
||||
)
|
||||
```
|
||||
|
||||
### OCR Track (Current - Problematic)
|
||||
`ocr_to_unified_converter.py:574-579`:
|
||||
```python
|
||||
return TableData(
|
||||
rows=rows, # Only counts from html.count('<tr')
|
||||
cols=cols, # Only counts from <td>/<th> in first row
|
||||
cells=cells, # Always empty list []
|
||||
caption=extracted_text
|
||||
)
|
||||
```
|
||||
|
||||
The `cells` array is always empty because the current HTML parsing only counts tags but doesn't extract actual cell content.
|
||||
Reference in New Issue
Block a user