egg/OCR

Files

egg 0974fc3a54 fix: resolve E2E test failures and add Office direct extraction design

- Fix MySQL connection timeout by creating fresh DB session after OCR
- Fix /analyze endpoint attribute errors (detect vs analyze, metadata)
- Add processing_track field extraction to TaskDetailResponse
- Update E2E tests to use POST for /analyze endpoint
- Increase Office document timeout to 300s
- Add Section 2.4 tasks for Office document direct extraction
- Document Office → PDF → Direct track strategy in design.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-20 12:13:18 +08:00

8.9 KiB

Raw Blame History

Technical Design: Dual-track Document Processing

Context

Background

The current OCR tool processes all documents through PaddleOCR, even when dealing with editable PDFs that contain extractable text. This causes:

Unnecessary processing overhead
Potential quality degradation from re-OCRing already digital text
Loss of precise formatting information
Inefficient GPU usage on documents that don't need OCR

Constraints

RTX 4060 8GB GPU memory limitation
Need to maintain backward compatibility with existing API
Must support future translation features
Should handle mixed documents (partially scanned, partially digital)

Stakeholders

API consumers expecting consistent JSON/PDF output
Translation system requiring structure preservation
Performance-sensitive deployments

Goals / Non-Goals

Goals

Intelligently route documents to appropriate processing track
Preserve document structure for translation
Optimize GPU usage by avoiding unnecessary OCR
Maintain unified output format across tracks
Reduce processing time for editable PDFs by 70%+

Non-Goals

Implementing the actual translation engine (future phase)
Supporting video or audio transcription
Real-time collaborative editing
OCR model training or fine-tuning

Decisions

Decision 1: Dual-track Architecture

What: Implement two separate processing pipelines - OCR track and Direct extraction track

Why:

Editable PDFs don't need OCR, can be processed 10-100x faster
Direct extraction preserves exact formatting and fonts
OCR track remains optimal for scanned documents

Alternatives considered:

Single enhanced OCR pipeline: Would still waste resources on editable PDFs
Hybrid approach per page: Too complex, most documents are uniformly editable or scanned
Multiple specialized pipelines: Over-engineering for current requirements

Decision 2: UnifiedDocument Model

What: Create a standardized intermediate representation for both tracks

Why:

Provides consistent API interface regardless of processing track
Simplifies downstream processing (PDF generation, translation)
Enables track switching without breaking changes

Structure:

@dataclass
class UnifiedDocument:
    document_id: str
    metadata: DocumentMetadata
    pages: List[Page]
    processing_track: Literal["ocr", "direct"]

@dataclass
class Page:
    page_number: int
    elements: List[DocumentElement]
    dimensions: Dimensions

@dataclass
class DocumentElement:
    element_id: str
    type: ElementType  # text, table, image, header, etc.
    content: Union[str, Dict, bytes]
    bbox: BoundingBox
    style: Optional[StyleInfo]
    confidence: Optional[float]  # Only for OCR track

Decision 3: PyMuPDF for Direct Extraction

What: Use PyMuPDF (fitz) library for editable PDF processing

Why:

Mature, well-maintained library
Excellent coordinate preservation
Fast C++ backend
Supports text, tables, and image extraction with positions

Alternatives considered:

pdfplumber: Good but slower, less precise coordinates
PyPDF2: Limited layout information
PDFMiner: Complex API, slower performance

Decision 4: Processing Track Auto-detection

What: Automatically determine optimal track based on document analysis

Detection logic:

def detect_track(file_path: Path) -> str:
    file_type = magic.from_file(file_path, mime=True)

    if file_type.startswith('image/'):
        return "ocr"

    if file_type == 'application/pdf':
        # Check if PDF has extractable text
        doc = fitz.open(file_path)
        for page in doc[:3]:  # Sample first 3 pages
            text = page.get_text()
            if len(text.strip()) < 100:  # Minimal text
                return "ocr"
        return "direct"

    if file_type in OFFICE_MIMES:
        # Convert Office to PDF first, then analyze
        pdf_path = convert_office_to_pdf(file_path)
        return detect_track(pdf_path)  # Recursive call on PDF

    return "ocr"  # Default fallback

Office Document Processing Strategy:

Convert Office files (Word, PPT, Excel) to PDF using LibreOffice
Analyze the resulting PDF for text extractability
Route based on PDF analysis:
- Text-based PDF → Direct track (faster, more accurate)
- Image-based PDF → OCR track (for scanned content in Office docs)

This approach ensures:

Consistent processing pipeline (all documents become PDF first)
Optimal routing based on actual content
Significant performance improvement for editable Office documents
Better layout preservation (no OCR errors on text content)

Decision 5: GPU Memory Management

What: Implement dynamic batch sizing and model caching for RTX 4060 8GB

Why:

Prevents OOM errors
Maximizes throughput
Enables concurrent request handling

Strategy:

# Adaptive batch sizing based on available memory
batch_size = calculate_batch_size(
    available_memory=get_gpu_memory(),
    image_size=image.shape,
    model_size=MODEL_MEMORY_REQUIREMENTS
)

# Model caching to avoid reload overhead
@lru_cache(maxsize=2)
def get_model(model_type: str):
    return load_model(model_type)

Decision 6: Backward Compatibility

What: Maintain existing API while adding new capabilities

How:

Existing endpoints continue working unchanged
New processing_track parameter is optional
Output format compatible with current consumers
Gradual migration path for clients

Risks / Trade-offs

Risk 1: Mixed Content Documents

Risk: Documents with both scanned and digital pages Mitigation:

Page-level track detection as fallback
Confidence scoring to identify uncertain pages
Manual override option via API

Risk 2: Direct Extraction Quality

Risk: Some PDFs have poor internal structure Mitigation:

Fallback to OCR track if extraction quality is low
Quality metrics: text density, structure coherence
User-reportable quality issues

Risk 3: Memory Pressure

Risk: RTX 4060 8GB limitation with concurrent requests Mitigation:

Request queuing system
Dynamic batch adjustment
CPU fallback for overflow

Trade-off 1: Processing Time vs Accuracy

Direct extraction: Fast but depends on PDF quality
OCR: Slower but consistent quality
Decision: Prioritize speed for editable PDFs, accuracy for scanned

Trade-off 2: Complexity vs Flexibility

Two tracks increase system complexity
But enable optimal processing per document type
Decision: Accept complexity for 10x+ performance gains

Migration Plan

Phase 1: Infrastructure (Week 1-2)

Deploy UnifiedDocument model
Implement DocumentTypeDetector
Add DirectExtractionEngine
Update logging and monitoring

Phase 2: Integration (Week 3)

Update OCR service with routing logic
Modify PDF generator for unified model
Add new API endpoints
Deploy to staging

Phase 3: Validation (Week 4)

A/B testing with subset of traffic
Performance benchmarking
Quality validation
Client integration testing

Rollback Plan

Feature flag to disable dual-track
Fallback all requests to OCR track
Maintain old code paths during transition
Database migration reversible

Open Questions

Resolved

Q: Should we support page-level track mixing?
- A: No, adds complexity with minimal benefit. Document-level is sufficient.
Q: How to handle Office documents?
- A: Convert to PDF using LibreOffice, then analyze the PDF for text extractability.
  - Text-based PDF → Direct track (editable Office docs produce text PDFs)
  - Image-based PDF → OCR track (rare case of scanned content in Office)
- This approach provides:
  - 10x+ faster processing for typical Office documents
  - Better layout preservation (no OCR errors)
  - Consistent pipeline (all documents normalized to PDF first)

Pending

Q: What translation services to integrate with?
- Needs stakeholder input on cost/quality trade-offs
Q: Should we cache extracted text for repeated processing?
- Depends on storage costs vs reprocessing frequency
Q: How to handle password-protected PDFs?
- May need API parameter for passwords

Performance Targets

Direct Extraction Track

Latency: <500ms per page
Throughput: 100+ pages/minute
Memory: <500MB per document

OCR Track (Optimized)

Latency: 2-5s per page (GPU)
Throughput: 20-30 pages/minute
Memory: <2GB per batch

API Response Times

Document type detection: <100ms
Processing initiation: <200ms
Result retrieval: <100ms

Technical Dependencies

Python Packages

# Direct extraction
PyMuPDF==1.23.x
pdfplumber==0.10.x  # Fallback/validation
python-magic-bin==0.4.x

# OCR enhancement
paddlepaddle-gpu==2.5.2
paddleocr==2.7.3

# Infrastructure
pydantic==2.x
fastapi==0.100+
redis==5.x  # For caching

System Requirements

CUDA 11.8+ for PaddlePaddle
libmagic for file detection
16GB RAM minimum
50GB disk for models and cache

8.9 KiB Raw Blame History

Technical Design: Dual-track Document Processing

Context

Background

Constraints

Stakeholders

Goals / Non-Goals

Goals

Non-Goals

Decisions

Decision 1: Dual-track Architecture

Decision 2: UnifiedDocument Model

Decision 3: PyMuPDF for Direct Extraction

Decision 4: Processing Track Auto-detection

Decision 5: GPU Memory Management

Decision 6: Backward Compatibility

Risks / Trade-offs

Risk 1: Mixed Content Documents

Risk 2: Direct Extraction Quality

Risk 3: Memory Pressure

Trade-off 1: Processing Time vs Accuracy

Trade-off 2: Complexity vs Flexibility

Migration Plan

Phase 1: Infrastructure (Week 1-2)

Phase 2: Integration (Week 3)

Phase 3: Validation (Week 4)

Rollback Plan

Open Questions

Resolved

Pending

Performance Targets

Direct Extraction Track

OCR Track (Optimized)

API Response Times

Technical Dependencies

Python Packages

System Requirements

8.9 KiB

Raw Blame History