chore: backup before code cleanup

Backup commit before executing remove-unused-code proposal.
This includes all pending changes and new features.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
egg
2025-12-11 11:55:39 +08:00
parent eff9b0bcd5
commit 940a406dce
58 changed files with 8226 additions and 175 deletions

View File

@@ -0,0 +1,42 @@
# Simple Text Positioning from Raw OCR
## Summary
Simplify OCR track PDF generation by rendering raw OCR text at correct positions without complex table structure reconstruction.
## Problem
Current OCR track processing has multiple failure points:
1. PP-Structure table structure recognition fails for borderless tables
2. Multi-column layouts get merged incorrectly into single tables
3. Table HTML reconstruction produces wrong cell positions
4. Complex column correction algorithms still can't fix fundamental structure errors
Meanwhile, raw OCR (`raw_ocr_regions.json`) correctly identifies all text with accurate bounding boxes.
## Solution
Replace complex table reconstruction with simple text positioning:
1. Read raw OCR regions directly
2. Position text at bbox coordinates
3. Calculate text rotation from bbox quadrilateral shape
4. Estimate font size from bbox height
5. Skip table HTML parsing entirely for OCR track
## Benefits
- **Reliability**: Raw OCR text positions are accurate
- **Simplicity**: Eliminates complex table parsing logic
- **Performance**: Faster processing without structure analysis
- **Consistency**: Predictable output regardless of table type
## Trade-offs
- No table borders in output
- No cell structure (colspan, rowspan)
- Visual layout approximation rather than semantic structure
## Scope
- OCR track PDF generation only
- Direct track remains unchanged (uses native PDF text extraction)