# Technical Design: Dual-track Document Processing ## Context ### Background The current OCR tool processes all documents through PaddleOCR, even when dealing with editable PDFs that contain extractable text. This causes: - Unnecessary processing overhead - Potential quality degradation from re-OCRing already digital text - Loss of precise formatting information - Inefficient GPU usage on documents that don't need OCR ### Constraints - RTX 4060 8GB GPU memory limitation - Need to maintain backward compatibility with existing API - Must support future translation features - Should handle mixed documents (partially scanned, partially digital) ### Stakeholders - API consumers expecting consistent JSON/PDF output - Translation system requiring structure preservation - Performance-sensitive deployments ## Goals / Non-Goals ### Goals - Intelligently route documents to appropriate processing track - Preserve document structure for translation - Optimize GPU usage by avoiding unnecessary OCR - Maintain unified output format across tracks - Reduce processing time for editable PDFs by 70%+ ### Non-Goals - Implementing the actual translation engine (future phase) - Supporting video or audio transcription - Real-time collaborative editing - OCR model training or fine-tuning ## Decisions ### Decision 1: Dual-track Architecture **What**: Implement two separate processing pipelines - OCR track and Direct extraction track **Why**: - Editable PDFs don't need OCR, can be processed 10-100x faster - Direct extraction preserves exact formatting and fonts - OCR track remains optimal for scanned documents **Alternatives considered**: 1. **Single enhanced OCR pipeline**: Would still waste resources on editable PDFs 2. **Hybrid approach per page**: Too complex, most documents are uniformly editable or scanned 3. **Multiple specialized pipelines**: Over-engineering for current requirements ### Decision 2: UnifiedDocument Model **What**: Create a standardized intermediate representation for both tracks **Why**: - Provides consistent API interface regardless of processing track - Simplifies downstream processing (PDF generation, translation) - Enables track switching without breaking changes **Structure**: ```python @dataclass class UnifiedDocument: document_id: str metadata: DocumentMetadata pages: List[Page] processing_track: Literal["ocr", "direct"] @dataclass class Page: page_number: int elements: List[DocumentElement] dimensions: Dimensions @dataclass class DocumentElement: element_id: str type: ElementType # text, table, image, header, etc. content: Union[str, Dict, bytes] bbox: BoundingBox style: Optional[StyleInfo] confidence: Optional[float] # Only for OCR track ``` ### Decision 3: PyMuPDF for Direct Extraction **What**: Use PyMuPDF (fitz) library for editable PDF processing **Why**: - Mature, well-maintained library - Excellent coordinate preservation - Fast C++ backend - Supports text, tables, and image extraction with positions **Alternatives considered**: 1. **pdfplumber**: Good but slower, less precise coordinates 2. **PyPDF2**: Limited layout information 3. **PDFMiner**: Complex API, slower performance ### Decision 4: Processing Track Auto-detection **What**: Automatically determine optimal track based on document analysis **Detection logic**: ```python def detect_track(file_path: Path) -> str: file_type = magic.from_file(file_path, mime=True) if file_type.startswith('image/'): return "ocr" if file_type == 'application/pdf': # Check if PDF has extractable text doc = fitz.open(file_path) for page in doc[:3]: # Sample first 3 pages text = page.get_text() if len(text.strip()) < 100: # Minimal text return "ocr" return "direct" if file_type in OFFICE_MIMES: # Convert Office to PDF first, then analyze pdf_path = convert_office_to_pdf(file_path) return detect_track(pdf_path) # Recursive call on PDF return "ocr" # Default fallback ``` **Office Document Processing Strategy**: 1. Convert Office files (Word, PPT, Excel) to PDF using LibreOffice 2. Analyze the resulting PDF for text extractability 3. Route based on PDF analysis: - Text-based PDF → Direct track (faster, more accurate) - Image-based PDF → OCR track (for scanned content in Office docs) This approach ensures: - Consistent processing pipeline (all documents become PDF first) - Optimal routing based on actual content - Significant performance improvement for editable Office documents - Better layout preservation (no OCR errors on text content) ### Decision 5: GPU Memory Management **What**: Implement dynamic batch sizing and model caching for RTX 4060 8GB **Why**: - Prevents OOM errors - Maximizes throughput - Enables concurrent request handling **Strategy**: ```python # Adaptive batch sizing based on available memory batch_size = calculate_batch_size( available_memory=get_gpu_memory(), image_size=image.shape, model_size=MODEL_MEMORY_REQUIREMENTS ) # Model caching to avoid reload overhead @lru_cache(maxsize=2) def get_model(model_type: str): return load_model(model_type) ``` ### Decision 6: Backward Compatibility **What**: Maintain existing API while adding new capabilities **How**: - Existing endpoints continue working unchanged - New `processing_track` parameter is optional - Output format compatible with current consumers - Gradual migration path for clients ## Risks / Trade-offs ### Risk 1: Mixed Content Documents **Risk**: Documents with both scanned and digital pages **Mitigation**: - Page-level track detection as fallback - Confidence scoring to identify uncertain pages - Manual override option via API ### Risk 2: Direct Extraction Quality **Risk**: Some PDFs have poor internal structure **Mitigation**: - Fallback to OCR track if extraction quality is low - Quality metrics: text density, structure coherence - User-reportable quality issues ### Risk 3: Memory Pressure **Risk**: RTX 4060 8GB limitation with concurrent requests **Mitigation**: - Request queuing system - Dynamic batch adjustment - CPU fallback for overflow ### Trade-off 1: Processing Time vs Accuracy - Direct extraction: Fast but depends on PDF quality - OCR: Slower but consistent quality - **Decision**: Prioritize speed for editable PDFs, accuracy for scanned ### Trade-off 2: Complexity vs Flexibility - Two tracks increase system complexity - But enable optimal processing per document type - **Decision**: Accept complexity for 10x+ performance gains ## Migration Plan ### Phase 1: Infrastructure (Week 1-2) 1. Deploy UnifiedDocument model 2. Implement DocumentTypeDetector 3. Add DirectExtractionEngine 4. Update logging and monitoring ### Phase 2: Integration (Week 3) 1. Update OCR service with routing logic 2. Modify PDF generator for unified model 3. Add new API endpoints 4. Deploy to staging ### Phase 3: Validation (Week 4) 1. A/B testing with subset of traffic 2. Performance benchmarking 3. Quality validation 4. Client integration testing ### Rollback Plan 1. Feature flag to disable dual-track 2. Fallback all requests to OCR track 3. Maintain old code paths during transition 4. Database migration reversible ## Open Questions ### Resolved - Q: Should we support page-level track mixing? - A: No, adds complexity with minimal benefit. Document-level is sufficient. - Q: How to handle Office documents? - A: Convert to PDF using LibreOffice, then analyze the PDF for text extractability. - Text-based PDF → Direct track (editable Office docs produce text PDFs) - Image-based PDF → OCR track (rare case of scanned content in Office) - This approach provides: - 10x+ faster processing for typical Office documents - Better layout preservation (no OCR errors) - Consistent pipeline (all documents normalized to PDF first) ### Pending - Q: What translation services to integrate with? - Needs stakeholder input on cost/quality trade-offs - Q: Should we cache extracted text for repeated processing? - Depends on storage costs vs reprocessing frequency - Q: How to handle password-protected PDFs? - May need API parameter for passwords ## Performance Targets ### Direct Extraction Track - Latency: <500ms per page - Throughput: 100+ pages/minute - Memory: <500MB per document ### OCR Track (Optimized) - Latency: 2-5s per page (GPU) - Throughput: 20-30 pages/minute - Memory: <2GB per batch ### API Response Times - Document type detection: <100ms - Processing initiation: <200ms - Result retrieval: <100ms ## Technical Dependencies ### Python Packages ```python # Direct extraction PyMuPDF==1.23.x pdfplumber==0.10.x # Fallback/validation python-magic-bin==0.4.x # OCR enhancement paddlepaddle-gpu==2.5.2 paddleocr==2.7.3 # Infrastructure pydantic==2.x fastapi==0.100+ redis==5.x # For caching ``` ### System Requirements - CUDA 11.8+ for PaddlePaddle - libmagic for file detection - 16GB RAM minimum - 50GB disk for models and cache ## GPU Memory Management ### Background With RTX 4060 8GB GPU constraint and large PP-StructureV3 models, GPU OOM (Out of Memory) errors can occur during intensive OCR processing. Proper memory management is critical for reliable operation. ### Implementation Strategy #### 1. Memory Cleanup System **Location**: `backend/app/services/ocr_service.py` **Methods**: - `cleanup_gpu_memory()`: Cleans GPU memory after processing - `check_gpu_memory()`: Checks available memory before operations **Cleanup Strategy**: ```python def cleanup_gpu_memory(self): """Clean up GPU memory using PaddlePaddle and optionally torch""" # Clear PaddlePaddle GPU cache (primary) if paddle.device.is_compiled_with_cuda(): paddle.device.cuda.empty_cache() # Clear torch GPU cache if available (optional) if TORCH_AVAILABLE and torch.cuda.is_available(): torch.cuda.empty_cache() torch.cuda.synchronize() # Force Python garbage collection gc.collect() ``` #### 2. Cleanup Points GPU memory cleanup is triggered at strategic points: 1. **After OCR processing** ([ocr_service.py:687](backend/app/services/ocr_service.py#L687)) - After completing image OCR processing 2. **After layout analysis** ([ocr_service.py:807-808, 913-914](backend/app/services/ocr_service.py#L807-L914)) - After enhanced PP-StructureV3 processing - After standard structure analysis 3. **After traditional processing** ([ocr_service.py:1105-1106](backend/app/services/ocr_service.py#L1105)) - After processing all pages in traditional mode 4. **On error** ([pp_structure_enhanced.py:168-177](backend/app/services/pp_structure_enhanced.py#L168)) - Clean up memory when PP-StructureV3 processing fails #### 3. Memory Monitoring **Pre-processing checks** prevent OOM errors: ```python def check_gpu_memory(self, required_mb: int = 2000) -> bool: """Check if sufficient GPU memory is available""" # Get free memory via torch if available if TORCH_AVAILABLE and torch.cuda.is_available(): free_memory = torch.cuda.mem_get_info()[0] / 1024**2 if free_memory < required_mb: # Try cleanup and re-check self.cleanup_gpu_memory() # Log warning if still insufficient return True # Continue even if check fails (graceful degradation) ``` **Memory checks before**: - OCR processing: 1500MB required - PP-StructureV3 processing: 2000MB required #### 4. Optional torch Dependency torch is **not required** for GPU memory management. The system uses PaddlePaddle's built-in `paddle.device.cuda.empty_cache()` as the primary method. **Why optional**: - Project uses PaddlePaddle which has its own CUDA implementation - torch provides additional memory monitoring via `mem_get_info()` - Gracefully degrades if torch is not installed **Import pattern**: ```python try: import torch TORCH_AVAILABLE = True except ImportError: TORCH_AVAILABLE = False ``` #### 5. Benefits - **Prevents OOM errors**: Regular cleanup prevents memory accumulation - **Better GPU utilization**: Freed memory available for next operations - **Graceful degradation**: Works without torch, continues on cleanup failures - **Debug visibility**: Logs memory status for troubleshooting #### 6. Performance Impact - Cleanup overhead: <50ms per operation - Memory recovery: Typically 200-500MB per cleanup - No impact on accuracy or output quality