OCR/openspec/changes/dual-track-document-processing/design.md

# Technical Design: Dual-track Document Processing

## Context

### Background
The current OCR tool processes all documents through PaddleOCR, even when dealing with editable PDFs that contain extractable text. This causes:
- Unnecessary processing overhead
- Potential quality degradation from re-OCRing already digital text
- Loss of precise formatting information
- Inefficient GPU usage on documents that don't need OCR

### Constraints
- RTX 4060 8GB GPU memory limitation
- Need to maintain backward compatibility with existing API
- Must support future translation features
- Should handle mixed documents (partially scanned, partially digital)

### Stakeholders
- API consumers expecting consistent JSON/PDF output
- Translation system requiring structure preservation
- Performance-sensitive deployments

## Goals / Non-Goals

### Goals
- Intelligently route documents to appropriate processing track
- Preserve document structure for translation
- Optimize GPU usage by avoiding unnecessary OCR
- Maintain unified output format across tracks
- Reduce processing time for editable PDFs by 70%+

### Non-Goals
- Implementing the actual translation engine (future phase)
- Supporting video or audio transcription
- Real-time collaborative editing
- OCR model training or fine-tuning

## Decisions

### Decision 1: Dual-track Architecture
**What**: Implement two separate processing pipelines - OCR track and Direct extraction track

**Why**:
- Editable PDFs don't need OCR, can be processed 10-100x faster
- Direct extraction preserves exact formatting and fonts
- OCR track remains optimal for scanned documents

**Alternatives considered**:
1. **Single enhanced OCR pipeline**: Would still waste resources on editable PDFs
2. **Hybrid approach per page**: Too complex, most documents are uniformly editable or scanned
3. **Multiple specialized pipelines**: Over-engineering for current requirements

### Decision 2: UnifiedDocument Model
**What**: Create a standardized intermediate representation for both tracks

**Why**:
- Provides consistent API interface regardless of processing track
- Simplifies downstream processing (PDF generation, translation)
- Enables track switching without breaking changes

**Structure**:
```python
@dataclass
class UnifiedDocument:
    document_id: str
    metadata: DocumentMetadata
    pages: List[Page]
    processing_track: Literal["ocr", "direct"]

@dataclass
class Page:
    page_number: int
    elements: List[DocumentElement]
    dimensions: Dimensions

@dataclass
class DocumentElement:
    element_id: str
    type: ElementType  # text, table, image, header, etc.
    content: Union[str, Dict, bytes]
    bbox: BoundingBox
    style: Optional[StyleInfo]
    confidence: Optional[float]  # Only for OCR track
```

### Decision 3: PyMuPDF for Direct Extraction
**What**: Use PyMuPDF (fitz) library for editable PDF processing

**Why**:
- Mature, well-maintained library
- Excellent coordinate preservation
- Fast C++ backend
- Supports text, tables, and image extraction with positions

**Alternatives considered**:
1. **pdfplumber**: Good but slower, less precise coordinates
2. **PyPDF2**: Limited layout information
3. **PDFMiner**: Complex API, slower performance

### Decision 4: Processing Track Auto-detection
**What**: Automatically determine optimal track based on document analysis

**Detection logic**:
```python
def detect_track(file_path: Path) -> str:
    file_type = magic.from_file(file_path, mime=True)

    if file_type.startswith('image/'):
        return "ocr"

    if file_type == 'application/pdf':
        # Check if PDF has extractable text
        doc = fitz.open(file_path)
        for page in doc[:3]:  # Sample first 3 pages
            text = page.get_text()
            if len(text.strip()) < 100:  # Minimal text
                return "ocr"
        return "direct"

    if file_type in OFFICE_MIMES:
        # Convert Office to PDF first, then analyze
        pdf_path = convert_office_to_pdf(file_path)
        return detect_track(pdf_path)  # Recursive call on PDF

    return "ocr"  # Default fallback
```

**Office Document Processing Strategy**:
1. Convert Office files (Word, PPT, Excel) to PDF using LibreOffice
2. Analyze the resulting PDF for text extractability
3. Route based on PDF analysis:
   - Text-based PDF → Direct track (faster, more accurate)
   - Image-based PDF → OCR track (for scanned content in Office docs)

This approach ensures:
- Consistent processing pipeline (all documents become PDF first)
- Optimal routing based on actual content
- Significant performance improvement for editable Office documents
- Better layout preservation (no OCR errors on text content)

### Decision 5: GPU Memory Management
**What**: Implement dynamic batch sizing and model caching for RTX 4060 8GB

**Why**:
- Prevents OOM errors
- Maximizes throughput
- Enables concurrent request handling

**Strategy**:
```python
# Adaptive batch sizing based on available memory
batch_size = calculate_batch_size(
    available_memory=get_gpu_memory(),
    image_size=image.shape,
    model_size=MODEL_MEMORY_REQUIREMENTS
)

# Model caching to avoid reload overhead
@lru_cache(maxsize=2)
def get_model(model_type: str):
    return load_model(model_type)
```

### Decision 6: Backward Compatibility
**What**: Maintain existing API while adding new capabilities

**How**:
- Existing endpoints continue working unchanged
- New `processing_track` parameter is optional
- Output format compatible with current consumers
- Gradual migration path for clients

## Risks / Trade-offs

### Risk 1: Mixed Content Documents
**Risk**: Documents with both scanned and digital pages
**Mitigation**:
- Page-level track detection as fallback
- Confidence scoring to identify uncertain pages
- Manual override option via API

### Risk 2: Direct Extraction Quality
**Risk**: Some PDFs have poor internal structure
**Mitigation**:
- Fallback to OCR track if extraction quality is low
- Quality metrics: text density, structure coherence
- User-reportable quality issues

### Risk 3: Memory Pressure
**Risk**: RTX 4060 8GB limitation with concurrent requests
**Mitigation**:
- Request queuing system
- Dynamic batch adjustment
- CPU fallback for overflow

### Trade-off 1: Processing Time vs Accuracy
- Direct extraction: Fast but depends on PDF quality
- OCR: Slower but consistent quality
- **Decision**: Prioritize speed for editable PDFs, accuracy for scanned

### Trade-off 2: Complexity vs Flexibility
- Two tracks increase system complexity
- But enable optimal processing per document type
- **Decision**: Accept complexity for 10x+ performance gains

## Migration Plan

### Phase 1: Infrastructure (Week 1-2)
1. Deploy UnifiedDocument model
2. Implement DocumentTypeDetector
3. Add DirectExtractionEngine
4. Update logging and monitoring

### Phase 2: Integration (Week 3)
1. Update OCR service with routing logic
2. Modify PDF generator for unified model
3. Add new API endpoints
4. Deploy to staging

### Phase 3: Validation (Week 4)
1. A/B testing with subset of traffic
2. Performance benchmarking
3. Quality validation
4. Client integration testing

### Rollback Plan
1. Feature flag to disable dual-track
2. Fallback all requests to OCR track
3. Maintain old code paths during transition
4. Database migration reversible

## Open Questions

### Resolved
- Q: Should we support page-level track mixing?
  - A: No, adds complexity with minimal benefit. Document-level is sufficient.

- Q: How to handle Office documents?
  - A: Convert to PDF using LibreOffice, then analyze the PDF for text extractability.
    - Text-based PDF → Direct track (editable Office docs produce text PDFs)
    - Image-based PDF → OCR track (rare case of scanned content in Office)
  - This approach provides:
    - 10x+ faster processing for typical Office documents
    - Better layout preservation (no OCR errors)
    - Consistent pipeline (all documents normalized to PDF first)

### Pending
- Q: What translation services to integrate with?
  - Needs stakeholder input on cost/quality trade-offs

- Q: Should we cache extracted text for repeated processing?
  - Depends on storage costs vs reprocessing frequency

- Q: How to handle password-protected PDFs?
  - May need API parameter for passwords

## Performance Targets

### Direct Extraction Track
- Latency: <500ms per page
- Throughput: 100+ pages/minute
- Memory: <500MB per document

### OCR Track (Optimized)
- Latency: 2-5s per page (GPU)
- Throughput: 20-30 pages/minute
- Memory: <2GB per batch

### API Response Times
- Document type detection: <100ms
- Processing initiation: <200ms
- Result retrieval: <100ms

## Technical Dependencies

### Python Packages
```python
# Direct extraction
PyMuPDF==1.23.x
pdfplumber==0.10.x  # Fallback/validation
python-magic-bin==0.4.x

# OCR enhancement
paddlepaddle-gpu==2.5.2
paddleocr==2.7.3

# Infrastructure
pydantic==2.x
fastapi==0.100+
redis==5.x  # For caching
```

### System Requirements
- CUDA 11.8+ for PaddlePaddle
- libmagic for file detection
- 16GB RAM minimum
- 50GB disk for models and cache