Files
OCR/openspec/changes/dual-track-document-processing/design.md
egg cd3cbea49d chore: project cleanup and prepare for dual-track processing refactor
- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 20:02:31 +08:00

7.8 KiB

Technical Design: Dual-track Document Processing

Context

Background

The current OCR tool processes all documents through PaddleOCR, even when dealing with editable PDFs that contain extractable text. This causes:

  • Unnecessary processing overhead
  • Potential quality degradation from re-OCRing already digital text
  • Loss of precise formatting information
  • Inefficient GPU usage on documents that don't need OCR

Constraints

  • RTX 4060 8GB GPU memory limitation
  • Need to maintain backward compatibility with existing API
  • Must support future translation features
  • Should handle mixed documents (partially scanned, partially digital)

Stakeholders

  • API consumers expecting consistent JSON/PDF output
  • Translation system requiring structure preservation
  • Performance-sensitive deployments

Goals / Non-Goals

Goals

  • Intelligently route documents to appropriate processing track
  • Preserve document structure for translation
  • Optimize GPU usage by avoiding unnecessary OCR
  • Maintain unified output format across tracks
  • Reduce processing time for editable PDFs by 70%+

Non-Goals

  • Implementing the actual translation engine (future phase)
  • Supporting video or audio transcription
  • Real-time collaborative editing
  • OCR model training or fine-tuning

Decisions

Decision 1: Dual-track Architecture

What: Implement two separate processing pipelines - OCR track and Direct extraction track

Why:

  • Editable PDFs don't need OCR, can be processed 10-100x faster
  • Direct extraction preserves exact formatting and fonts
  • OCR track remains optimal for scanned documents

Alternatives considered:

  1. Single enhanced OCR pipeline: Would still waste resources on editable PDFs
  2. Hybrid approach per page: Too complex, most documents are uniformly editable or scanned
  3. Multiple specialized pipelines: Over-engineering for current requirements

Decision 2: UnifiedDocument Model

What: Create a standardized intermediate representation for both tracks

Why:

  • Provides consistent API interface regardless of processing track
  • Simplifies downstream processing (PDF generation, translation)
  • Enables track switching without breaking changes

Structure:

@dataclass
class UnifiedDocument:
    document_id: str
    metadata: DocumentMetadata
    pages: List[Page]
    processing_track: Literal["ocr", "direct"]

@dataclass
class Page:
    page_number: int
    elements: List[DocumentElement]
    dimensions: Dimensions

@dataclass
class DocumentElement:
    element_id: str
    type: ElementType  # text, table, image, header, etc.
    content: Union[str, Dict, bytes]
    bbox: BoundingBox
    style: Optional[StyleInfo]
    confidence: Optional[float]  # Only for OCR track

Decision 3: PyMuPDF for Direct Extraction

What: Use PyMuPDF (fitz) library for editable PDF processing

Why:

  • Mature, well-maintained library
  • Excellent coordinate preservation
  • Fast C++ backend
  • Supports text, tables, and image extraction with positions

Alternatives considered:

  1. pdfplumber: Good but slower, less precise coordinates
  2. PyPDF2: Limited layout information
  3. PDFMiner: Complex API, slower performance

Decision 4: Processing Track Auto-detection

What: Automatically determine optimal track based on document analysis

Detection logic:

def detect_track(file_path: Path) -> str:
    file_type = magic.from_file(file_path, mime=True)

    if file_type.startswith('image/'):
        return "ocr"

    if file_type == 'application/pdf':
        # Check if PDF has extractable text
        doc = fitz.open(file_path)
        for page in doc[:3]:  # Sample first 3 pages
            text = page.get_text()
            if len(text.strip()) < 100:  # Minimal text
                return "ocr"
        return "direct"

    if file_type in OFFICE_MIMES:
        return "ocr"  # For now, may add direct Office support later

    return "ocr"  # Default fallback

Decision 5: GPU Memory Management

What: Implement dynamic batch sizing and model caching for RTX 4060 8GB

Why:

  • Prevents OOM errors
  • Maximizes throughput
  • Enables concurrent request handling

Strategy:

# Adaptive batch sizing based on available memory
batch_size = calculate_batch_size(
    available_memory=get_gpu_memory(),
    image_size=image.shape,
    model_size=MODEL_MEMORY_REQUIREMENTS
)

# Model caching to avoid reload overhead
@lru_cache(maxsize=2)
def get_model(model_type: str):
    return load_model(model_type)

Decision 6: Backward Compatibility

What: Maintain existing API while adding new capabilities

How:

  • Existing endpoints continue working unchanged
  • New processing_track parameter is optional
  • Output format compatible with current consumers
  • Gradual migration path for clients

Risks / Trade-offs

Risk 1: Mixed Content Documents

Risk: Documents with both scanned and digital pages Mitigation:

  • Page-level track detection as fallback
  • Confidence scoring to identify uncertain pages
  • Manual override option via API

Risk 2: Direct Extraction Quality

Risk: Some PDFs have poor internal structure Mitigation:

  • Fallback to OCR track if extraction quality is low
  • Quality metrics: text density, structure coherence
  • User-reportable quality issues

Risk 3: Memory Pressure

Risk: RTX 4060 8GB limitation with concurrent requests Mitigation:

  • Request queuing system
  • Dynamic batch adjustment
  • CPU fallback for overflow

Trade-off 1: Processing Time vs Accuracy

  • Direct extraction: Fast but depends on PDF quality
  • OCR: Slower but consistent quality
  • Decision: Prioritize speed for editable PDFs, accuracy for scanned

Trade-off 2: Complexity vs Flexibility

  • Two tracks increase system complexity
  • But enable optimal processing per document type
  • Decision: Accept complexity for 10x+ performance gains

Migration Plan

Phase 1: Infrastructure (Week 1-2)

  1. Deploy UnifiedDocument model
  2. Implement DocumentTypeDetector
  3. Add DirectExtractionEngine
  4. Update logging and monitoring

Phase 2: Integration (Week 3)

  1. Update OCR service with routing logic
  2. Modify PDF generator for unified model
  3. Add new API endpoints
  4. Deploy to staging

Phase 3: Validation (Week 4)

  1. A/B testing with subset of traffic
  2. Performance benchmarking
  3. Quality validation
  4. Client integration testing

Rollback Plan

  1. Feature flag to disable dual-track
  2. Fallback all requests to OCR track
  3. Maintain old code paths during transition
  4. Database migration reversible

Open Questions

Resolved

  • Q: Should we support page-level track mixing?

    • A: No, adds complexity with minimal benefit. Document-level is sufficient.
  • Q: How to handle Office documents?

    • A: OCR track initially, consider python-docx/openpyxl later if needed.

Pending

  • Q: What translation services to integrate with?

    • Needs stakeholder input on cost/quality trade-offs
  • Q: Should we cache extracted text for repeated processing?

    • Depends on storage costs vs reprocessing frequency
  • Q: How to handle password-protected PDFs?

    • May need API parameter for passwords

Performance Targets

Direct Extraction Track

  • Latency: <500ms per page
  • Throughput: 100+ pages/minute
  • Memory: <500MB per document

OCR Track (Optimized)

  • Latency: 2-5s per page (GPU)
  • Throughput: 20-30 pages/minute
  • Memory: <2GB per batch

API Response Times

  • Document type detection: <100ms
  • Processing initiation: <200ms
  • Result retrieval: <100ms

Technical Dependencies

Python Packages

# Direct extraction
PyMuPDF==1.23.x
pdfplumber==0.10.x  # Fallback/validation
python-magic-bin==0.4.x

# OCR enhancement
paddlepaddle-gpu==2.5.2
paddleocr==2.7.3

# Infrastructure
pydantic==2.x
fastapi==0.100+
redis==5.x  # For caching

System Requirements

  • CUDA 11.8+ for PaddlePaddle
  • libmagic for file detection
  • 16GB RAM minimum
  • 50GB disk for models and cache