# Tasks: Fix OCR Track Table Data Format ## 1. Implementation - [x] 1.1 Add BeautifulSoup import and dependency check in `ocr_to_unified_converter.py` - [x] 1.2 Rewrite `_extract_table_data` method to parse HTML using BeautifulSoup - [x] 1.3 Extract cell content, row index, column index for each `` and `` element - [x] 1.4 Handle `rowspan` and `colspan` attributes for merged cells - [x] 1.5 Create `TableCell` objects with proper content and positioning - [x] 1.6 Populate `TableData.cells` array with extracted `TableCell` objects - [x] 1.7 Preserve header detection (`` elements) and store in `TableData.headers` ## 2. Edge Case Handling - [x] 2.1 Handle malformed HTML tables gracefully (missing closing tags, nested tables) - [x] 2.2 Handle empty cells (create TableCell with empty string content) - [x] 2.3 Handle tables without `` structure (fallback to current behavior) - [x] 2.4 Log warnings for unparseable tables instead of failing silently ## 3. Testing - [x] 3.1 Create unit tests for `_extract_table_data` with various HTML table formats - [x] 3.2 Test simple tables (basic rows/columns) - [x] 3.3 Test tables with merged cells (rowspan/colspan) - [x] 3.4 Test tables with header rows (`` elements) - [x] 3.5 Test malformed HTML tables (handled via BeautifulSoup's tolerance) - [ ] 3.6 Integration test: OCR Track PDF generation with tables ## 4. Verification (Track Isolation) - [x] 4.1 Compare OCR Track table output format with Direct Track output format - [ ] 4.2 Verify PDF Generator renders OCR Track tables correctly - [x] 4.3 **Direct Track regression test**: `direct_extraction_engine.py` NOT modified (confirmed via git status) - [x] 4.4 **Hybrid Mode regression test**: `ocr_service.py` NOT modified, image merge logic unchanged - [x] 4.5 **OCR Track fix verification**: Unit tests confirm: - `TableData.cells` array is populated (6 cells in 3x2 table) - `TableCell` objects have correct row/col/content values - Headers extracted correctly - [x] 4.6 Verify `DirectExtractionEngine` code is NOT modified (isolation check - confirmed) ## 5. Dependencies - [x] 5.1 Add `beautifulsoup4>=4.12.0` to `requirements.txt`