egg/OCR

Files

egg 6e050eb540 fix: OCR track table data format and image cropping

Table data format fixes (ocr_to_unified_converter.py):
- Fix ElementType string conversion using value-based lookup
- Add content-based HTML table detection (reclassify TEXT to TABLE)
- Use BeautifulSoup for robust HTML table parsing
- Generate TableData with fully populated cells arrays

Image cropping for OCR track (pp_structure_enhanced.py):
- Add _crop_and_save_image method for extracting image regions
- Pass source_image_path to _process_parsing_res_list
- Return relative filename (not full path) for saved_path
- Consistent with Direct Track image saving pattern

Also includes:
- Add beautifulsoup4 to requirements.txt
- Add architecture overview documentation
- Archive fix-ocr-track-table-data-format proposal (22/24 tasks)

Known issues: OCR track images are restored but still have quality issues
that will be addressed in a follow-up proposal.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-26 18:48:15 +08:00

2.0 KiB

Raw Blame History

Change: Fix OCR Track Table Data Format to Match Direct Track

Why

OCR Track produces HTML strings for table content instead of structured TableData objects, causing PDF generation to render raw HTML code as plain text. Direct Track correctly produces TableData objects with populated cells array, resulting in proper table rendering. This inconsistency creates poor user experience when using OCR Track for documents containing tables.

What Changes

Enhance _extract_table_data method in ocr_to_unified_converter.py to properly parse HTML tables into structured TableData objects with populated TableCell arrays
Add BeautifulSoup-based HTML table parsing to robustly extract cell content, row/column spans from OCR-generated HTML tables
Ensure format consistency between OCR Track and Direct Track table output, allowing PDF Generator to handle a single standardized format

Impact

Affected specs: ocr-processing
Affected code:
- backend/app/services/ocr_to_unified_converter.py (primary changes)
- backend/app/services/pdf_generator_service.py (no changes needed - already handles TableData)
- backend/app/services/direct_extraction_engine.py (no changes - serves as reference implementation)

Evidence

Direct Track (Reference - Correct Behavior)

direct_extraction_engine.py:846-850:

table_data = TableData(
    rows=len(data),
    cols=max(len(row) for row in data) if data else 0,
    cells=cells,  # Properly populated with TableCell objects
    headers=data[0] if data else None
)

OCR Track (Current - Problematic)

ocr_to_unified_converter.py:574-579:

return TableData(
    rows=rows,           # Only counts from html.count('<tr')
    cols=cols,           # Only counts from <td>/<th> in first row
    cells=cells,         # Always empty list []
    caption=extracted_text
)

The cells array is always empty because the current HTML parsing only counts tags but doesn't extract actual cell content.

2.0 KiB Raw Blame History