feat: enable document orientation detection for scanned PDFs

- Enable PP-StructureV3's use_doc_orientation_classify feature
- Detect rotation angle from doc_preprocessor_res.angle
- Swap page dimensions (width <-> height) for 90°/270° rotations
- Output PDF now correctly displays landscape-scanned content

Also includes:
- Archive completed openspec proposals
- Add simplify-frontend-ocr-config proposal (pending)
- Code cleanup and frontend simplification

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
egg
2025-12-11 17:13:46 +08:00
parent 57070af307
commit cfe65158a3
58 changed files with 1271 additions and 3048 deletions

View File

@@ -0,0 +1,59 @@
## 1. Core Algorithm Implementation
### 1.1 Table Column Corrector Module
- [x] 1.1.1 Create `table_column_corrector.py` service file
- [x] 1.1.2 Implement `ColumnAnchor` dataclass for header column ranges
- [x] 1.1.3 Implement `build_column_anchors()` to extract header column X-ranges
- [x] 1.1.4 Implement `calculate_x_overlap()` utility function
- [x] 1.1.5 Implement `correct_cell_column()` for single cell correction
- [x] 1.1.6 Implement `correct_table_columns()` main entry point
### 1.2 HTML Cell Extraction
- [x] 1.2.1 Implement `parse_table_html_with_positions()` to extract cells with row/col
- [x] 1.2.2 Implement cell-to-cellbox matching using IoU
- [x] 1.2.3 Handle colspan/rowspan in header detection
### 1.3 Vertical Fragment Merging
- [x] 1.3.1 Implement `detect_vertical_fragments()` to find narrow text blocks
- [x] 1.3.2 Implement `should_merge_blocks()` adjacency check
- [x] 1.3.3 Implement `merge_vertical_fragments()` main function
- [x] 1.3.4 Integrate merged blocks back into table structure
## 2. Configuration
### 2.1 Settings
- [x] 2.1.1 Add `table_column_correction_enabled: bool = True`
- [x] 2.1.2 Add `table_column_correction_threshold: float = 0.5`
- [x] 2.1.3 Add `vertical_fragment_merge_enabled: bool = True`
- [x] 2.1.4 Add `vertical_fragment_aspect_ratio: float = 0.3`
## 3. Integration
### 3.1 Pipeline Integration
- [x] 3.1.1 Add correction step in `pdf_generator_service.py` before table rendering
- [x] 3.1.2 Pass corrected HTML to existing table rendering logic
- [x] 3.1.3 Add diagnostic logging for corrections made
### 3.2 Error Handling
- [x] 3.2.1 Handle tables without headers gracefully
- [x] 3.2.2 Handle empty/malformed cell_boxes
- [x] 3.2.3 Fallback to original structure on correction failure
## 4. Testing
### 4.1 Unit Tests
- [ ] 4.1.1 Test `build_column_anchors()` with various header configurations
- [ ] 4.1.2 Test `correct_cell_column()` with known column shift cases
- [ ] 4.1.3 Test `merge_vertical_fragments()` with vertical text samples
- [ ] 4.1.4 Test edge cases: empty tables, single column, no headers
### 4.2 Integration Tests
- [ ] 4.2.1 Test with `scan.pdf` Table 7 (the problematic case)
- [ ] 4.2.2 Test with tables that have correct alignment (no regression)
- [ ] 4.2.3 Visual comparison of corrected vs original output
## 5. Documentation
- [x] 5.1 Add inline code comments explaining correction algorithm
- [x] 5.2 Update spec with new table column correction requirement
- [x] 5.3 Add logging messages for debugging