fix: OCR track table data format and image cropping

Table data format fixes (ocr_to_unified_converter.py): - Fix ElementType string conversion using value-based lookup - Add content-based HTML table detection (reclassify TEXT to TABLE) - Use BeautifulSoup for robust HTML table parsing - Generate TableData with fully populated cells arrays Image cropping for OCR track (pp_structure_enhanced.py): - Add _crop_and_save_image method for extracting image regions - Pass source_image_path to _process_parsing_res_list - Return relative filename (not full path) for saved_path - Consistent with Direct Track image saving pattern Also includes: - Add beautifulsoup4 to requirements.txt - Add architecture overview documentation - Archive fix-ocr-track-table-data-format proposal (22/24 tasks) Known issues: OCR track images are restored but still have quality issues that will be addressed in a follow-up proposal. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 18:48:15 +08:00
parent a227311b2d
commit 6e050eb540
8 changed files with 585 additions and 30 deletions
--- a/openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/proposal.md
+++ b/openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/proposal.md
@@ -0,0 +1,45 @@
+# Change: Fix OCR Track Table Data Format to Match Direct Track
+
+## Why
+
+OCR Track produces HTML strings for table content instead of structured `TableData` objects, causing PDF generation to render raw HTML code as plain text. Direct Track correctly produces `TableData` objects with populated `cells` array, resulting in proper table rendering. This inconsistency creates poor user experience when using OCR Track for documents containing tables.
+
+## What Changes
+
+- **Enhance `_extract_table_data` method** in `ocr_to_unified_converter.py` to properly parse HTML tables into structured `TableData` objects with populated `TableCell` arrays
+- **Add BeautifulSoup-based HTML table parsing** to robustly extract cell content, row/column spans from OCR-generated HTML tables
+- **Ensure format consistency** between OCR Track and Direct Track table output, allowing PDF Generator to handle a single standardized format
+
+## Impact
+
+- Affected specs: `ocr-processing`
+- Affected code:
+  - `backend/app/services/ocr_to_unified_converter.py` (primary changes)
+  - `backend/app/services/pdf_generator_service.py` (no changes needed - already handles `TableData`)
+  - `backend/app/services/direct_extraction_engine.py` (no changes - serves as reference implementation)
+
+## Evidence
+
+### Direct Track (Reference - Correct Behavior)
+`direct_extraction_engine.py:846-850`:
+```python
+table_data = TableData(
+    rows=len(data),
+    cols=max(len(row) for row in data) if data else 0,
+    cells=cells,  # Properly populated with TableCell objects
+    headers=data[0] if data else None
+)
+```
+
+### OCR Track (Current - Problematic)
+`ocr_to_unified_converter.py:574-579`:
+```python
+return TableData(
+    rows=rows,           # Only counts from html.count('<tr')
+    cols=cols,           # Only counts from <td>/<th> in first row
+    cells=cells,         # Always empty list []
+    caption=extracted_text
+)
+```
+
+The `cells` array is always empty because the current HTML parsing only counts tags but doesn't extract actual cell content.