# Technical Design: Dual-track Document Processing ## Context ### Background The current OCR tool processes all documents through PaddleOCR, even when dealing with editable PDFs that contain extractable text. This causes: - Unnecessary processing overhead - Potential quality degradation from re-OCRing already digital text - Loss of precise formatting information - Inefficient GPU usage on documents that don't need OCR ### Constraints - RTX 4060 8GB GPU memory limitation - Need to maintain backward compatibility with existing API - Must support future translation features - Should handle mixed documents (partially scanned, partially digital) ### Stakeholders - API consumers expecting consistent JSON/PDF output - Translation system requiring structure preservation - Performance-sensitive deployments ## Goals / Non-Goals ### Goals - Intelligently route documents to appropriate processing track - Preserve document structure for translation - Optimize GPU usage by avoiding unnecessary OCR - Maintain unified output format across tracks - Reduce processing time for editable PDFs by 70%+ ### Non-Goals - Implementing the actual translation engine (future phase) - Supporting video or audio transcription - Real-time collaborative editing - OCR model training or fine-tuning ## Decisions ### Decision 1: Dual-track Architecture **What**: Implement two separate processing pipelines - OCR track and Direct extraction track **Why**: - Editable PDFs don't need OCR, can be processed 10-100x faster - Direct extraction preserves exact formatting and fonts - OCR track remains optimal for scanned documents **Alternatives considered**: 1. **Single enhanced OCR pipeline**: Would still waste resources on editable PDFs 2. **Hybrid approach per page**: Too complex, most documents are uniformly editable or scanned 3. **Multiple specialized pipelines**: Over-engineering for current requirements ### Decision 2: UnifiedDocument Model **What**: Create a standardized intermediate representation for both tracks **Why**: - Provides consistent API interface regardless of processing track - Simplifies downstream processing (PDF generation, translation) - Enables track switching without breaking changes **Structure**: ```python @dataclass class UnifiedDocument: document_id: str metadata: DocumentMetadata pages: List[Page] processing_track: Literal["ocr", "direct"] @dataclass class Page: page_number: int elements: List[DocumentElement] dimensions: Dimensions @dataclass class DocumentElement: element_id: str type: ElementType # text, table, image, header, etc. content: Union[str, Dict, bytes] bbox: BoundingBox style: Optional[StyleInfo] confidence: Optional[float] # Only for OCR track ``` ### Decision 3: PyMuPDF for Direct Extraction **What**: Use PyMuPDF (fitz) library for editable PDF processing **Why**: - Mature, well-maintained library - Excellent coordinate preservation - Fast C++ backend - Supports text, tables, and image extraction with positions **Alternatives considered**: 1. **pdfplumber**: Good but slower, less precise coordinates 2. **PyPDF2**: Limited layout information 3. **PDFMiner**: Complex API, slower performance ### Decision 4: Processing Track Auto-detection **What**: Automatically determine optimal track based on document analysis **Detection logic**: ```python def detect_track(file_path: Path) -> str: file_type = magic.from_file(file_path, mime=True) if file_type.startswith('image/'): return "ocr" if file_type == 'application/pdf': # Check if PDF has extractable text doc = fitz.open(file_path) for page in doc[:3]: # Sample first 3 pages text = page.get_text() if len(text.strip()) < 100: # Minimal text return "ocr" return "direct" if file_type in OFFICE_MIMES: return "ocr" # For now, may add direct Office support later return "ocr" # Default fallback ``` ### Decision 5: GPU Memory Management **What**: Implement dynamic batch sizing and model caching for RTX 4060 8GB **Why**: - Prevents OOM errors - Maximizes throughput - Enables concurrent request handling **Strategy**: ```python # Adaptive batch sizing based on available memory batch_size = calculate_batch_size( available_memory=get_gpu_memory(), image_size=image.shape, model_size=MODEL_MEMORY_REQUIREMENTS ) # Model caching to avoid reload overhead @lru_cache(maxsize=2) def get_model(model_type: str): return load_model(model_type) ``` ### Decision 6: Backward Compatibility **What**: Maintain existing API while adding new capabilities **How**: - Existing endpoints continue working unchanged - New `processing_track` parameter is optional - Output format compatible with current consumers - Gradual migration path for clients ## Risks / Trade-offs ### Risk 1: Mixed Content Documents **Risk**: Documents with both scanned and digital pages **Mitigation**: - Page-level track detection as fallback - Confidence scoring to identify uncertain pages - Manual override option via API ### Risk 2: Direct Extraction Quality **Risk**: Some PDFs have poor internal structure **Mitigation**: - Fallback to OCR track if extraction quality is low - Quality metrics: text density, structure coherence - User-reportable quality issues ### Risk 3: Memory Pressure **Risk**: RTX 4060 8GB limitation with concurrent requests **Mitigation**: - Request queuing system - Dynamic batch adjustment - CPU fallback for overflow ### Trade-off 1: Processing Time vs Accuracy - Direct extraction: Fast but depends on PDF quality - OCR: Slower but consistent quality - **Decision**: Prioritize speed for editable PDFs, accuracy for scanned ### Trade-off 2: Complexity vs Flexibility - Two tracks increase system complexity - But enable optimal processing per document type - **Decision**: Accept complexity for 10x+ performance gains ## Migration Plan ### Phase 1: Infrastructure (Week 1-2) 1. Deploy UnifiedDocument model 2. Implement DocumentTypeDetector 3. Add DirectExtractionEngine 4. Update logging and monitoring ### Phase 2: Integration (Week 3) 1. Update OCR service with routing logic 2. Modify PDF generator for unified model 3. Add new API endpoints 4. Deploy to staging ### Phase 3: Validation (Week 4) 1. A/B testing with subset of traffic 2. Performance benchmarking 3. Quality validation 4. Client integration testing ### Rollback Plan 1. Feature flag to disable dual-track 2. Fallback all requests to OCR track 3. Maintain old code paths during transition 4. Database migration reversible ## Open Questions ### Resolved - Q: Should we support page-level track mixing? - A: No, adds complexity with minimal benefit. Document-level is sufficient. - Q: How to handle Office documents? - A: OCR track initially, consider python-docx/openpyxl later if needed. ### Pending - Q: What translation services to integrate with? - Needs stakeholder input on cost/quality trade-offs - Q: Should we cache extracted text for repeated processing? - Depends on storage costs vs reprocessing frequency - Q: How to handle password-protected PDFs? - May need API parameter for passwords ## Performance Targets ### Direct Extraction Track - Latency: <500ms per page - Throughput: 100+ pages/minute - Memory: <500MB per document ### OCR Track (Optimized) - Latency: 2-5s per page (GPU) - Throughput: 20-30 pages/minute - Memory: <2GB per batch ### API Response Times - Document type detection: <100ms - Processing initiation: <200ms - Result retrieval: <100ms ## Technical Dependencies ### Python Packages ```python # Direct extraction PyMuPDF==1.23.x pdfplumber==0.10.x # Fallback/validation python-magic-bin==0.4.x # OCR enhancement paddlepaddle-gpu==2.5.2 paddleocr==2.7.3 # Infrastructure pydantic==2.x fastapi==0.100+ redis==5.x # For caching ``` ### System Requirements - CUDA 11.8+ for PaddlePaddle - libmagic for file detection - 16GB RAM minimum - 50GB disk for models and cache