## 1. Core Algorithm Implementation ### 1.1 Table Column Corrector Module - [x] 1.1.1 Create `table_column_corrector.py` service file - [x] 1.1.2 Implement `ColumnAnchor` dataclass for header column ranges - [x] 1.1.3 Implement `build_column_anchors()` to extract header column X-ranges - [x] 1.1.4 Implement `calculate_x_overlap()` utility function - [x] 1.1.5 Implement `correct_cell_column()` for single cell correction - [x] 1.1.6 Implement `correct_table_columns()` main entry point ### 1.2 HTML Cell Extraction - [x] 1.2.1 Implement `parse_table_html_with_positions()` to extract cells with row/col - [x] 1.2.2 Implement cell-to-cellbox matching using IoU - [x] 1.2.3 Handle colspan/rowspan in header detection ### 1.3 Vertical Fragment Merging - [x] 1.3.1 Implement `detect_vertical_fragments()` to find narrow text blocks - [x] 1.3.2 Implement `should_merge_blocks()` adjacency check - [x] 1.3.3 Implement `merge_vertical_fragments()` main function - [x] 1.3.4 Integrate merged blocks back into table structure ## 2. Configuration ### 2.1 Settings - [x] 2.1.1 Add `table_column_correction_enabled: bool = True` - [x] 2.1.2 Add `table_column_correction_threshold: float = 0.5` - [x] 2.1.3 Add `vertical_fragment_merge_enabled: bool = True` - [x] 2.1.4 Add `vertical_fragment_aspect_ratio: float = 0.3` ## 3. Integration ### 3.1 Pipeline Integration - [x] 3.1.1 Add correction step in `pdf_generator_service.py` before table rendering - [x] 3.1.2 Pass corrected HTML to existing table rendering logic - [x] 3.1.3 Add diagnostic logging for corrections made ### 3.2 Error Handling - [x] 3.2.1 Handle tables without headers gracefully - [x] 3.2.2 Handle empty/malformed cell_boxes - [x] 3.2.3 Fallback to original structure on correction failure ## 4. Testing ### 4.1 Unit Tests - [ ] 4.1.1 Test `build_column_anchors()` with various header configurations - [ ] 4.1.2 Test `correct_cell_column()` with known column shift cases - [ ] 4.1.3 Test `merge_vertical_fragments()` with vertical text samples - [ ] 4.1.4 Test edge cases: empty tables, single column, no headers ### 4.2 Integration Tests - [ ] 4.2.1 Test with `scan.pdf` Table 7 (the problematic case) - [ ] 4.2.2 Test with tables that have correct alignment (no regression) - [ ] 4.2.3 Visual comparison of corrected vs original output ## 5. Documentation - [x] 5.1 Add inline code comments explaining correction algorithm - [x] 5.2 Update spec with new table column correction requirement - [x] 5.3 Add logging messages for debugging