Table data format fixes (ocr_to_unified_converter.py): - Fix ElementType string conversion using value-based lookup - Add content-based HTML table detection (reclassify TEXT to TABLE) - Use BeautifulSoup for robust HTML table parsing - Generate TableData with fully populated cells arrays Image cropping for OCR track (pp_structure_enhanced.py): - Add _crop_and_save_image method for extracting image regions - Pass source_image_path to _process_parsing_res_list - Return relative filename (not full path) for saved_path - Consistent with Direct Track image saving pattern Also includes: - Add beautifulsoup4 to requirements.txt - Add architecture overview documentation - Archive fix-ocr-track-table-data-format proposal (22/24 tasks) Known issues: OCR track images are restored but still have quality issues that will be addressed in a follow-up proposal. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
46 lines
2.0 KiB
Markdown
46 lines
2.0 KiB
Markdown
# Change: Fix OCR Track Table Data Format to Match Direct Track
|
|
|
|
## Why
|
|
|
|
OCR Track produces HTML strings for table content instead of structured `TableData` objects, causing PDF generation to render raw HTML code as plain text. Direct Track correctly produces `TableData` objects with populated `cells` array, resulting in proper table rendering. This inconsistency creates poor user experience when using OCR Track for documents containing tables.
|
|
|
|
## What Changes
|
|
|
|
- **Enhance `_extract_table_data` method** in `ocr_to_unified_converter.py` to properly parse HTML tables into structured `TableData` objects with populated `TableCell` arrays
|
|
- **Add BeautifulSoup-based HTML table parsing** to robustly extract cell content, row/column spans from OCR-generated HTML tables
|
|
- **Ensure format consistency** between OCR Track and Direct Track table output, allowing PDF Generator to handle a single standardized format
|
|
|
|
## Impact
|
|
|
|
- Affected specs: `ocr-processing`
|
|
- Affected code:
|
|
- `backend/app/services/ocr_to_unified_converter.py` (primary changes)
|
|
- `backend/app/services/pdf_generator_service.py` (no changes needed - already handles `TableData`)
|
|
- `backend/app/services/direct_extraction_engine.py` (no changes - serves as reference implementation)
|
|
|
|
## Evidence
|
|
|
|
### Direct Track (Reference - Correct Behavior)
|
|
`direct_extraction_engine.py:846-850`:
|
|
```python
|
|
table_data = TableData(
|
|
rows=len(data),
|
|
cols=max(len(row) for row in data) if data else 0,
|
|
cells=cells, # Properly populated with TableCell objects
|
|
headers=data[0] if data else None
|
|
)
|
|
```
|
|
|
|
### OCR Track (Current - Problematic)
|
|
`ocr_to_unified_converter.py:574-579`:
|
|
```python
|
|
return TableData(
|
|
rows=rows, # Only counts from html.count('<tr')
|
|
cols=cols, # Only counts from <td>/<th> in first row
|
|
cells=cells, # Always empty list []
|
|
caption=extracted_text
|
|
)
|
|
```
|
|
|
|
The `cells` array is always empty because the current HTML parsing only counts tags but doesn't extract actual cell content.
|