- Removed all test files and directories - Deleted outdated documentation (will be rewritten) - Cleaned up temporary files, logs, and uploads - Archived 5 completed OpenSpec proposals - Created new dual-track-document-processing proposal with complete OpenSpec structure - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF) - UnifiedDocument model for consistent output - Support for structure-preserving translation - Updated .gitignore to prevent future test/temp files This is a major cleanup preparing for the complete refactoring of the document processing pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2.0 KiB
2.0 KiB
Change: Dual-track Document Processing with Structure-Preserving Translation
Why
The current system processes all documents through PaddleOCR, causing unnecessary overhead for editable PDFs that already contain extractable text. Additionally, we're only using ~20% of PP-StructureV3's capabilities, missing out on comprehensive document structure extraction. The system needs to support structure-preserving document translation as a future goal.
What Changes
- ADDED Dual-track processing architecture with intelligent routing
- OCR track for scanned documents, images, and Office files using PaddleOCR
- Direct extraction track for editable PDFs using PyMuPDF
- ADDED UnifiedDocument model as common output format for both tracks
- ADDED DocumentTypeDetector service for automatic track selection
- MODIFIED OCR service to use PP-StructureV3's parsing_res_list instead of markdown
- Now extracts all 23 element types with bbox coordinates
- Preserves reading order and hierarchical structure
- MODIFIED PDF generator to handle UnifiedDocument format
- Enhanced overlap detection to prevent text/image/table collisions
- Improved coordinate transformation for accurate layout
- ADDED Foundation for structure-preserving translation system
- BREAKING JSON output structure will include new fields (backward compatible with defaults)
Impact
- Affected specs:
document-processing(new capability)result-export(enhanced with track metadata and structure data)task-management(tracks processing route and history)
- Affected code:
backend/app/services/ocr_service.py- Major refactoring for dual-trackbackend/app/services/pdf_generator_service.py- UnifiedDocument supportbackend/app/api/v2/tasks.py- New endpoints for track detectionfrontend/src/pages/TaskDetailPage.tsx- Display processing track info
- Performance: 5-10x faster for editable PDFs, same speed for scanned documents
- Dependencies: Adds PyMuPDF, pdfplumber, python-magic-bin