Backup commit before executing remove-unused-code proposal. This includes all pending changes and new features. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
60 lines
2.5 KiB
Markdown
60 lines
2.5 KiB
Markdown
## 1. Core Algorithm Implementation
|
|
|
|
### 1.1 Table Column Corrector Module
|
|
- [x] 1.1.1 Create `table_column_corrector.py` service file
|
|
- [x] 1.1.2 Implement `ColumnAnchor` dataclass for header column ranges
|
|
- [x] 1.1.3 Implement `build_column_anchors()` to extract header column X-ranges
|
|
- [x] 1.1.4 Implement `calculate_x_overlap()` utility function
|
|
- [x] 1.1.5 Implement `correct_cell_column()` for single cell correction
|
|
- [x] 1.1.6 Implement `correct_table_columns()` main entry point
|
|
|
|
### 1.2 HTML Cell Extraction
|
|
- [x] 1.2.1 Implement `parse_table_html_with_positions()` to extract cells with row/col
|
|
- [x] 1.2.2 Implement cell-to-cellbox matching using IoU
|
|
- [x] 1.2.3 Handle colspan/rowspan in header detection
|
|
|
|
### 1.3 Vertical Fragment Merging
|
|
- [x] 1.3.1 Implement `detect_vertical_fragments()` to find narrow text blocks
|
|
- [x] 1.3.2 Implement `should_merge_blocks()` adjacency check
|
|
- [x] 1.3.3 Implement `merge_vertical_fragments()` main function
|
|
- [x] 1.3.4 Integrate merged blocks back into table structure
|
|
|
|
## 2. Configuration
|
|
|
|
### 2.1 Settings
|
|
- [x] 2.1.1 Add `table_column_correction_enabled: bool = True`
|
|
- [x] 2.1.2 Add `table_column_correction_threshold: float = 0.5`
|
|
- [x] 2.1.3 Add `vertical_fragment_merge_enabled: bool = True`
|
|
- [x] 2.1.4 Add `vertical_fragment_aspect_ratio: float = 0.3`
|
|
|
|
## 3. Integration
|
|
|
|
### 3.1 Pipeline Integration
|
|
- [x] 3.1.1 Add correction step in `pdf_generator_service.py` before table rendering
|
|
- [x] 3.1.2 Pass corrected HTML to existing table rendering logic
|
|
- [x] 3.1.3 Add diagnostic logging for corrections made
|
|
|
|
### 3.2 Error Handling
|
|
- [x] 3.2.1 Handle tables without headers gracefully
|
|
- [x] 3.2.2 Handle empty/malformed cell_boxes
|
|
- [x] 3.2.3 Fallback to original structure on correction failure
|
|
|
|
## 4. Testing
|
|
|
|
### 4.1 Unit Tests
|
|
- [ ] 4.1.1 Test `build_column_anchors()` with various header configurations
|
|
- [ ] 4.1.2 Test `correct_cell_column()` with known column shift cases
|
|
- [ ] 4.1.3 Test `merge_vertical_fragments()` with vertical text samples
|
|
- [ ] 4.1.4 Test edge cases: empty tables, single column, no headers
|
|
|
|
### 4.2 Integration Tests
|
|
- [ ] 4.2.1 Test with `scan.pdf` Table 7 (the problematic case)
|
|
- [ ] 4.2.2 Test with tables that have correct alignment (no regression)
|
|
- [ ] 4.2.3 Visual comparison of corrected vs original output
|
|
|
|
## 5. Documentation
|
|
|
|
- [x] 5.1 Add inline code comments explaining correction algorithm
|
|
- [x] 5.2 Update spec with new table column correction requirement
|
|
- [x] 5.3 Add logging messages for debugging
|