# Implementation Tasks: Dual-track Document Processing ## 1. Core Infrastructure - [x] 1.1 Add PyMuPDF and other dependencies to requirements.txt - [x] 1.1.1 Add PyMuPDF>=1.23.0 - [x] 1.1.2 Add pdfplumber>=0.10.0 - [x] 1.1.3 Add python-magic-bin>=0.4.14 - [x] 1.1.4 Test dependency installation - [x] 1.2 Create UnifiedDocument model in backend/app/models/ - [x] 1.2.1 Define UnifiedDocument dataclass - [x] 1.2.2 Add DocumentElement model - [x] 1.2.3 Add DocumentMetadata model - [x] 1.2.4 Create converters for both OCR and direct extraction outputs - Note: OCR converter complete; DirectExtractionEngine returns UnifiedDocument directly - [x] 1.3 Create DocumentTypeDetector service - [x] 1.3.1 Implement file type detection using python-magic - [x] 1.3.2 Add PDF editability checking logic - [x] 1.3.3 Add Office document detection - [x] 1.3.4 Create routing logic to determine processing track - [x] 1.3.5 Add unit tests for detector ## 2. Direct Extraction Track - [x] 2.1 Create DirectExtractionEngine service - [x] 2.1.1 Implement PyMuPDF-based text extraction - [x] 2.1.2 Add structure preservation logic - [x] 2.1.3 Extract tables with coordinates - [x] 2.1.4 Extract images and their positions - [x] 2.1.5 Maintain reading order - [x] 2.1.6 Handle multi-column layouts - [x] 2.2 Implement layout analysis for editable PDFs - [x] 2.2.1 Detect headers and footers - [x] 2.2.2 Identify sections and subsections - [x] 2.2.3 Parse lists and nested structures - [x] 2.2.4 Extract font and style information - [x] 2.3 Create direct extraction to UnifiedDocument converter - [x] 2.3.1 Map PyMuPDF structures to UnifiedDocument - [x] 2.3.2 Preserve coordinate information - [x] 2.3.3 Maintain element relationships - [x] 2.4 Add Office document direct extraction support - [x] 2.4.1 Update DocumentTypeDetector._analyze_office to convert to PDF first - [x] 2.4.2 Analyze converted PDF for text extractability - [x] 2.4.3 Route to direct track if PDF is text-based - [x] 2.4.4 Update OCR service to use DirectExtractionEngine for Office files - [x] 2.4.5 Add unit tests for Office → PDF → Direct flow - Note: This optimization significantly improves Office document processing time (from >300s to ~2-5s) ## 3. OCR Track Enhancement - [x] 3.1 Upgrade PP-StructureV3 configuration - [x] 3.1.1 Update config for RTX 4060 8GB optimization - [x] 3.1.2 Enable batch processing for GPU efficiency - [x] 3.1.3 Configure memory management settings - [x] 3.1.4 Set up model caching - [x] 3.2 Enhance OCR service to use parsing_res_list - [x] 3.2.1 Replace markdown extraction with parsing_res_list - [x] 3.2.2 Extract all 23 element types - [x] 3.2.3 Preserve bbox coordinates from PP-StructureV3 - [x] 3.2.4 Maintain reading order information - [x] 3.3 Create OCR to UnifiedDocument converter - [x] 3.3.1 Map PP-StructureV3 elements to UnifiedDocument - [x] 3.3.2 Handle complex nested structures - [x] 3.3.3 Preserve all metadata ## 4. Unified Processing Pipeline - [x] 4.1 Update main OCR service for dual-track processing - [x] 4.1.1 Integrate DocumentTypeDetector - [x] 4.1.2 Route to appropriate processing engine - [x] 4.1.3 Return UnifiedDocument from both tracks - [x] 4.1.4 Maintain backward compatibility - [x] 4.2 Create unified JSON export - [x] 4.2.1 Define standardized JSON schema - [x] 4.2.2 Include processing metadata - [x] 4.2.3 Support both track outputs - [x] 4.3 Update PDF generator for UnifiedDocument - [x] 4.3.1 Adapt PDF generation to use UnifiedDocument - [x] 4.3.2 Preserve layout from both tracks - [x] 4.3.3 Handle coordinate transformations ## 5. Translation System Foundation - [ ] 5.1 Create TranslationEngine interface - [ ] 5.1.1 Define translation API contract - [ ] 5.1.2 Support element-level translation - [ ] 5.1.3 Preserve formatting markers - [ ] 5.2 Implement structure-preserving translation - [ ] 5.2.1 Translate text while maintaining coordinates - [ ] 5.2.2 Handle table cell translations - [ ] 5.2.3 Preserve list structures - [ ] 5.2.4 Maintain header hierarchies - [ ] 5.3 Create translated document renderer - [ ] 5.3.1 Generate PDF with translated text - [ ] 5.3.2 Adjust layouts for text expansion/contraction - [ ] 5.3.3 Handle font substitution for target languages ## 6. API Updates - [x] 6.1 Update OCR endpoints - [x] 6.1.1 Add processing_track parameter - [x] 6.1.2 Support track auto-detection - [x] 6.1.3 Return processing metadata - [x] 6.2 Add document type detection endpoint - [x] 6.2.1 Create /analyze endpoint - [x] 6.2.2 Return recommended processing track - [x] 6.2.3 Provide confidence scores - [x] 6.3 Update result export endpoints - [x] 6.3.1 Support UnifiedDocument format - [x] 6.3.2 Add format conversion options - [x] 6.3.3 Include processing track information ## 7. Frontend Updates - [x] 7.1 Update task detail view - [x] 7.1.1 Display processing track information - [x] 7.1.2 Show track-specific metadata - [x] 7.1.3 Add track selection UI (if manual override needed) - Note: Track display implemented; manual override via API query params - [x] 7.2 Update results preview - [x] 7.2.1 Handle UnifiedDocument format - [x] 7.2.2 Display enhanced structure information - [ ] 7.2.3 Show coordinate overlays (debug mode) - Note: Future enhancement, not critical for initial release - [x] 7.3 Add translation UI preparation - [x] 7.3.1 Add translation toggle/button - [x] 7.3.2 Language selection dropdown - [x] 7.3.3 Translation progress indicator - Note: UI prepared with disabled state; awaiting Section 5 implementation ## 8. Testing - [x] 8.1 Unit tests for DocumentTypeDetector - [x] 8.1.1 Test various file types - [x] 8.1.2 Test editability detection - [x] 8.1.3 Test edge cases - [x] 8.2 Unit tests for DirectExtractionEngine - [x] 8.2.1 Test text extraction accuracy - [x] 8.2.2 Test structure preservation - [x] 8.2.3 Test coordinate extraction - [x] 8.3 Integration tests for dual-track processing - [x] 8.3.1 Test routing logic - [x] 8.3.2 Test UnifiedDocument generation - [x] 8.3.3 Test backward compatibility - [x] 8.4 End-to-end tests - [x] 8.4.1 Test scanned PDF processing (OCR track) - Passed: scan.pdf processed via OCR track in 50.25s - [x] 8.4.2 Test editable PDF processing (direct track) - Passed: edit.pdf processed via direct track in 1.14s with 51 elements extracted - [~] 8.4.3 Test Office document processing - Timeout: ppt.pptx (11MB) exceeded 300s timeout - requires investigation - Note: Smaller Office files process successfully; large files may need optimization - [x] 8.4.4 Test image file processing - Passed: img1.png (21.84s), img2.png (23.24s), img3.png (41.14s) - [ ] 8.5 Performance testing - [ ] 8.5.1 Benchmark both processing tracks - [ ] 8.5.2 Test GPU memory usage - [ ] 8.5.3 Compare processing times ## 9. Documentation - [ ] 9.1 Update API documentation - [ ] 9.1.1 Document new endpoints - [ ] 9.1.2 Update existing endpoint docs - [ ] 9.1.3 Add processing track information - [ ] 9.2 Create architecture documentation - [ ] 9.2.1 Document dual-track flow - [ ] 9.2.2 Explain UnifiedDocument structure - [ ] 9.2.3 Add decision trees for track selection - [ ] 9.3 Add deployment guide - [ ] 9.3.1 Document GPU requirements - [ ] 9.3.2 Add environment configuration - [ ] 9.3.3 Include troubleshooting guide ## 10. Deployment Preparation - [ ] 10.1 Update Docker configuration - [ ] 10.1.1 Add new dependencies to Dockerfile - [ ] 10.1.2 Configure GPU support - [ ] 10.1.3 Update volume mappings - [ ] 10.2 Update environment variables - [ ] 10.2.1 Add processing track settings - [ ] 10.2.2 Configure GPU memory limits - [ ] 10.2.3 Add feature flags - [ ] 10.3 Create migration plan - [ ] 10.3.1 Plan for existing data migration - [ ] 10.3.2 Create rollback procedures - [ ] 10.3.3 Document breaking changes ## Completion Checklist - [ ] All unit tests passing - [ ] Integration tests passing - [ ] Performance benchmarks acceptable - [ ] Documentation complete - [ ] Code reviewed - [ ] Deployment tested in staging ## Future Improvements The following improvements are identified but not part of this change proposal: ### Batch Processing Enhancement - **Related to**: Section 3.1.2 (Enable batch processing for GPU efficiency) - **Description**: Implement true batch inference by sending multiple pages or documents to PaddleOCR simultaneously - **Benefits**: Better GPU utilization, reduced overhead from model switching - **Requirements**: Queue management, memory-aware batching, result aggregation - **Recommendation**: Create a separate change proposal when ready to implement