feat: enable document orientation detection for scanned PDFs

- Enable PP-StructureV3's use_doc_orientation_classify feature - Detect rotation angle from doc_preprocessor_res.angle - Swap page dimensions (width <-> height) for 90°/270° rotations - Output PDF now correctly displays landscape-scanned content Also includes: - Archive completed openspec proposals - Add simplify-frontend-ocr-config proposal (pending) - Code cleanup and frontend simplification 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 17:13:46 +08:00
parent 57070af307
commit cfe65158a3
58 changed files with 1271 additions and 3048 deletions
--- a/openspec/changes/archive/2025-12-11-fix-table-column-alignment/tasks.md
+++ b/openspec/changes/archive/2025-12-11-fix-table-column-alignment/tasks.md
@@ -0,0 +1,59 @@
+## 1. Core Algorithm Implementation
+
+### 1.1 Table Column Corrector Module
+- [x] 1.1.1 Create `table_column_corrector.py` service file
+- [x] 1.1.2 Implement `ColumnAnchor` dataclass for header column ranges
+- [x] 1.1.3 Implement `build_column_anchors()` to extract header column X-ranges
+- [x] 1.1.4 Implement `calculate_x_overlap()` utility function
+- [x] 1.1.5 Implement `correct_cell_column()` for single cell correction
+- [x] 1.1.6 Implement `correct_table_columns()` main entry point
+
+### 1.2 HTML Cell Extraction
+- [x] 1.2.1 Implement `parse_table_html_with_positions()` to extract cells with row/col
+- [x] 1.2.2 Implement cell-to-cellbox matching using IoU
+- [x] 1.2.3 Handle colspan/rowspan in header detection
+
+### 1.3 Vertical Fragment Merging
+- [x] 1.3.1 Implement `detect_vertical_fragments()` to find narrow text blocks
+- [x] 1.3.2 Implement `should_merge_blocks()` adjacency check
+- [x] 1.3.3 Implement `merge_vertical_fragments()` main function
+- [x] 1.3.4 Integrate merged blocks back into table structure
+
+## 2. Configuration
+
+### 2.1 Settings
+- [x] 2.1.1 Add `table_column_correction_enabled: bool = True`
+- [x] 2.1.2 Add `table_column_correction_threshold: float = 0.5`
+- [x] 2.1.3 Add `vertical_fragment_merge_enabled: bool = True`
+- [x] 2.1.4 Add `vertical_fragment_aspect_ratio: float = 0.3`
+
+## 3. Integration
+
+### 3.1 Pipeline Integration
+- [x] 3.1.1 Add correction step in `pdf_generator_service.py` before table rendering
+- [x] 3.1.2 Pass corrected HTML to existing table rendering logic
+- [x] 3.1.3 Add diagnostic logging for corrections made
+
+### 3.2 Error Handling
+- [x] 3.2.1 Handle tables without headers gracefully
+- [x] 3.2.2 Handle empty/malformed cell_boxes
+- [x] 3.2.3 Fallback to original structure on correction failure
+
+## 4. Testing
+
+### 4.1 Unit Tests
+- [ ] 4.1.1 Test `build_column_anchors()` with various header configurations
+- [ ] 4.1.2 Test `correct_cell_column()` with known column shift cases
+- [ ] 4.1.3 Test `merge_vertical_fragments()` with vertical text samples
+- [ ] 4.1.4 Test edge cases: empty tables, single column, no headers
+
+### 4.2 Integration Tests
+- [ ] 4.2.1 Test with `scan.pdf` Table 7 (the problematic case)
+- [ ] 4.2.2 Test with tables that have correct alignment (no regression)
+- [ ] 4.2.3 Visual comparison of corrected vs original output
+
+## 5. Documentation
+
+- [x] 5.1 Add inline code comments explaining correction algorithm
+- [x] 5.2 Update spec with new table column correction requirement
+- [x] 5.3 Add logging messages for debugging