Table data format fixes (ocr_to_unified_converter.py): - Fix ElementType string conversion using value-based lookup - Add content-based HTML table detection (reclassify TEXT to TABLE) - Use BeautifulSoup for robust HTML table parsing - Generate TableData with fully populated cells arrays Image cropping for OCR track (pp_structure_enhanced.py): - Add _crop_and_save_image method for extracting image regions - Pass source_image_path to _process_parsing_res_list - Return relative filename (not full path) for saved_path - Consistent with Direct Track image saving pattern Also includes: - Add beautifulsoup4 to requirements.txt - Add architecture overview documentation - Archive fix-ocr-track-table-data-format proposal (22/24 tasks) Known issues: OCR track images are restored but still have quality issues that will be addressed in a follow-up proposal. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2.1 KiB
2.1 KiB
Tasks: Fix OCR Track Table Data Format
1. Implementation
- 1.1 Add BeautifulSoup import and dependency check in
ocr_to_unified_converter.py - 1.2 Rewrite
_extract_table_datamethod to parse HTML using BeautifulSoup - 1.3 Extract cell content, row index, column index for each
<td>and<th>element - 1.4 Handle
rowspanandcolspanattributes for merged cells - 1.5 Create
TableCellobjects with proper content and positioning - 1.6 Populate
TableData.cellsarray with extractedTableCellobjects - 1.7 Preserve header detection (
<th>elements) and store inTableData.headers
2. Edge Case Handling
- 2.1 Handle malformed HTML tables gracefully (missing closing tags, nested tables)
- 2.2 Handle empty cells (create TableCell with empty string content)
- 2.3 Handle tables without
<tr>structure (fallback to current behavior) - 2.4 Log warnings for unparseable tables instead of failing silently
3. Testing
- 3.1 Create unit tests for
_extract_table_datawith various HTML table formats - 3.2 Test simple tables (basic rows/columns)
- 3.3 Test tables with merged cells (rowspan/colspan)
- 3.4 Test tables with header rows (
<th>elements) - 3.5 Test malformed HTML tables (handled via BeautifulSoup's tolerance)
- 3.6 Integration test: OCR Track PDF generation with tables
4. Verification (Track Isolation)
- 4.1 Compare OCR Track table output format with Direct Track output format
- 4.2 Verify PDF Generator renders OCR Track tables correctly
- 4.3 Direct Track regression test:
direct_extraction_engine.pyNOT modified (confirmed via git status) - 4.4 Hybrid Mode regression test:
ocr_service.pyNOT modified, image merge logic unchanged - 4.5 OCR Track fix verification: Unit tests confirm:
TableData.cellsarray is populated (6 cells in 3x2 table)TableCellobjects have correct row/col/content values- Headers extracted correctly
- 4.6 Verify
DirectExtractionEnginecode is NOT modified (isolation check - confirmed)
5. Dependencies
- 5.1 Add
beautifulsoup4>=4.12.0torequirements.txt