Backup commit before executing remove-unused-code proposal. This includes all pending changes and new features. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2.5 KiB
2.5 KiB
1. Core Algorithm Implementation
1.1 Table Column Corrector Module
- 1.1.1 Create
table_column_corrector.pyservice file - 1.1.2 Implement
ColumnAnchordataclass for header column ranges - 1.1.3 Implement
build_column_anchors()to extract header column X-ranges - 1.1.4 Implement
calculate_x_overlap()utility function - 1.1.5 Implement
correct_cell_column()for single cell correction - 1.1.6 Implement
correct_table_columns()main entry point
1.2 HTML Cell Extraction
- 1.2.1 Implement
parse_table_html_with_positions()to extract cells with row/col - 1.2.2 Implement cell-to-cellbox matching using IoU
- 1.2.3 Handle colspan/rowspan in header detection
1.3 Vertical Fragment Merging
- 1.3.1 Implement
detect_vertical_fragments()to find narrow text blocks - 1.3.2 Implement
should_merge_blocks()adjacency check - 1.3.3 Implement
merge_vertical_fragments()main function - 1.3.4 Integrate merged blocks back into table structure
2. Configuration
2.1 Settings
- 2.1.1 Add
table_column_correction_enabled: bool = True - 2.1.2 Add
table_column_correction_threshold: float = 0.5 - 2.1.3 Add
vertical_fragment_merge_enabled: bool = True - 2.1.4 Add
vertical_fragment_aspect_ratio: float = 0.3
3. Integration
3.1 Pipeline Integration
- 3.1.1 Add correction step in
pdf_generator_service.pybefore table rendering - 3.1.2 Pass corrected HTML to existing table rendering logic
- 3.1.3 Add diagnostic logging for corrections made
3.2 Error Handling
- 3.2.1 Handle tables without headers gracefully
- 3.2.2 Handle empty/malformed cell_boxes
- 3.2.3 Fallback to original structure on correction failure
4. Testing
4.1 Unit Tests
- 4.1.1 Test
build_column_anchors()with various header configurations - 4.1.2 Test
correct_cell_column()with known column shift cases - 4.1.3 Test
merge_vertical_fragments()with vertical text samples - 4.1.4 Test edge cases: empty tables, single column, no headers
4.2 Integration Tests
- 4.2.1 Test with
scan.pdfTable 7 (the problematic case) - 4.2.2 Test with tables that have correct alignment (no regression)
- 4.2.3 Visual comparison of corrected vs original output
5. Documentation
- 5.1 Add inline code comments explaining correction algorithm
- 5.2 Update spec with new table column correction requirement
- 5.3 Add logging messages for debugging