# Change: Dual-track Document Processing with Structure-Preserving Translation ## Why The current system processes all documents through PaddleOCR, causing unnecessary overhead for editable PDFs that already contain extractable text. Additionally, we're only using ~20% of PP-StructureV3's capabilities, missing out on comprehensive document structure extraction. The system needs to support structure-preserving document translation as a future goal. ## What Changes - **ADDED** Dual-track processing architecture with intelligent routing - OCR track for scanned documents, images, and Office files using PaddleOCR - Direct extraction track for editable PDFs using PyMuPDF - **ADDED** UnifiedDocument model as common output format for both tracks - **ADDED** DocumentTypeDetector service for automatic track selection - **MODIFIED** OCR service to use PP-StructureV3's parsing_res_list instead of markdown - Now extracts all 23 element types with bbox coordinates - Preserves reading order and hierarchical structure - **MODIFIED** PDF generator to handle UnifiedDocument format - Enhanced overlap detection to prevent text/image/table collisions - Improved coordinate transformation for accurate layout - **ADDED** Foundation for structure-preserving translation system - **BREAKING** JSON output structure will include new fields (backward compatible with defaults) ## Impact - **Affected specs**: - `document-processing` (new capability) - `result-export` (enhanced with track metadata and structure data) - `task-management` (tracks processing route and history) - **Affected code**: - `backend/app/services/ocr_service.py` - Major refactoring for dual-track - `backend/app/services/pdf_generator_service.py` - UnifiedDocument support - `backend/app/api/v2/tasks.py` - New endpoints for track detection - `frontend/src/pages/TaskDetailPage.tsx` - Display processing track info - **Performance**: 5-10x faster for editable PDFs, same speed for scanned documents - **Dependencies**: Adds PyMuPDF, pdfplumber, python-magic-bin