fix: OCR track table data format and image cropping
Table data format fixes (ocr_to_unified_converter.py): - Fix ElementType string conversion using value-based lookup - Add content-based HTML table detection (reclassify TEXT to TABLE) - Use BeautifulSoup for robust HTML table parsing - Generate TableData with fully populated cells arrays Image cropping for OCR track (pp_structure_enhanced.py): - Add _crop_and_save_image method for extracting image regions - Pass source_image_path to _process_parsing_res_list - Return relative filename (not full path) for saved_path - Consistent with Direct Track image saving pattern Also includes: - Add beautifulsoup4 to requirements.txt - Add architecture overview documentation - Archive fix-ocr-track-table-data-format proposal (22/24 tasks) Known issues: OCR track images are restored but still have quality issues that will be addressed in a follow-up proposal. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,43 @@
|
||||
# Tasks: Fix OCR Track Table Data Format
|
||||
|
||||
## 1. Implementation
|
||||
|
||||
- [x] 1.1 Add BeautifulSoup import and dependency check in `ocr_to_unified_converter.py`
|
||||
- [x] 1.2 Rewrite `_extract_table_data` method to parse HTML using BeautifulSoup
|
||||
- [x] 1.3 Extract cell content, row index, column index for each `<td>` and `<th>` element
|
||||
- [x] 1.4 Handle `rowspan` and `colspan` attributes for merged cells
|
||||
- [x] 1.5 Create `TableCell` objects with proper content and positioning
|
||||
- [x] 1.6 Populate `TableData.cells` array with extracted `TableCell` objects
|
||||
- [x] 1.7 Preserve header detection (`<th>` elements) and store in `TableData.headers`
|
||||
|
||||
## 2. Edge Case Handling
|
||||
|
||||
- [x] 2.1 Handle malformed HTML tables gracefully (missing closing tags, nested tables)
|
||||
- [x] 2.2 Handle empty cells (create TableCell with empty string content)
|
||||
- [x] 2.3 Handle tables without `<tr>` structure (fallback to current behavior)
|
||||
- [x] 2.4 Log warnings for unparseable tables instead of failing silently
|
||||
|
||||
## 3. Testing
|
||||
|
||||
- [x] 3.1 Create unit tests for `_extract_table_data` with various HTML table formats
|
||||
- [x] 3.2 Test simple tables (basic rows/columns)
|
||||
- [x] 3.3 Test tables with merged cells (rowspan/colspan)
|
||||
- [x] 3.4 Test tables with header rows (`<th>` elements)
|
||||
- [x] 3.5 Test malformed HTML tables (handled via BeautifulSoup's tolerance)
|
||||
- [ ] 3.6 Integration test: OCR Track PDF generation with tables
|
||||
|
||||
## 4. Verification (Track Isolation)
|
||||
|
||||
- [x] 4.1 Compare OCR Track table output format with Direct Track output format
|
||||
- [ ] 4.2 Verify PDF Generator renders OCR Track tables correctly
|
||||
- [x] 4.3 **Direct Track regression test**: `direct_extraction_engine.py` NOT modified (confirmed via git status)
|
||||
- [x] 4.4 **Hybrid Mode regression test**: `ocr_service.py` NOT modified, image merge logic unchanged
|
||||
- [x] 4.5 **OCR Track fix verification**: Unit tests confirm:
|
||||
- `TableData.cells` array is populated (6 cells in 3x2 table)
|
||||
- `TableCell` objects have correct row/col/content values
|
||||
- Headers extracted correctly
|
||||
- [x] 4.6 Verify `DirectExtractionEngine` code is NOT modified (isolation check - confirmed)
|
||||
|
||||
## 5. Dependencies
|
||||
|
||||
- [x] 5.1 Add `beautifulsoup4>=4.12.0` to `requirements.txt`
|
||||
Reference in New Issue
Block a user