- Removed all test files and directories - Deleted outdated documentation (will be rewritten) - Cleaned up temporary files, logs, and uploads - Archived 5 completed OpenSpec proposals - Created new dual-track-document-processing proposal with complete OpenSpec structure - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF) - UnifiedDocument model for consistent output - Support for structure-preserving translation - Updated .gitignore to prevent future test/temp files This is a major cleanup preparing for the complete refactoring of the document processing pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
276 lines
7.8 KiB
Markdown
276 lines
7.8 KiB
Markdown
# Technical Design: Dual-track Document Processing
|
|
|
|
## Context
|
|
|
|
### Background
|
|
The current OCR tool processes all documents through PaddleOCR, even when dealing with editable PDFs that contain extractable text. This causes:
|
|
- Unnecessary processing overhead
|
|
- Potential quality degradation from re-OCRing already digital text
|
|
- Loss of precise formatting information
|
|
- Inefficient GPU usage on documents that don't need OCR
|
|
|
|
### Constraints
|
|
- RTX 4060 8GB GPU memory limitation
|
|
- Need to maintain backward compatibility with existing API
|
|
- Must support future translation features
|
|
- Should handle mixed documents (partially scanned, partially digital)
|
|
|
|
### Stakeholders
|
|
- API consumers expecting consistent JSON/PDF output
|
|
- Translation system requiring structure preservation
|
|
- Performance-sensitive deployments
|
|
|
|
## Goals / Non-Goals
|
|
|
|
### Goals
|
|
- Intelligently route documents to appropriate processing track
|
|
- Preserve document structure for translation
|
|
- Optimize GPU usage by avoiding unnecessary OCR
|
|
- Maintain unified output format across tracks
|
|
- Reduce processing time for editable PDFs by 70%+
|
|
|
|
### Non-Goals
|
|
- Implementing the actual translation engine (future phase)
|
|
- Supporting video or audio transcription
|
|
- Real-time collaborative editing
|
|
- OCR model training or fine-tuning
|
|
|
|
## Decisions
|
|
|
|
### Decision 1: Dual-track Architecture
|
|
**What**: Implement two separate processing pipelines - OCR track and Direct extraction track
|
|
|
|
**Why**:
|
|
- Editable PDFs don't need OCR, can be processed 10-100x faster
|
|
- Direct extraction preserves exact formatting and fonts
|
|
- OCR track remains optimal for scanned documents
|
|
|
|
**Alternatives considered**:
|
|
1. **Single enhanced OCR pipeline**: Would still waste resources on editable PDFs
|
|
2. **Hybrid approach per page**: Too complex, most documents are uniformly editable or scanned
|
|
3. **Multiple specialized pipelines**: Over-engineering for current requirements
|
|
|
|
### Decision 2: UnifiedDocument Model
|
|
**What**: Create a standardized intermediate representation for both tracks
|
|
|
|
**Why**:
|
|
- Provides consistent API interface regardless of processing track
|
|
- Simplifies downstream processing (PDF generation, translation)
|
|
- Enables track switching without breaking changes
|
|
|
|
**Structure**:
|
|
```python
|
|
@dataclass
|
|
class UnifiedDocument:
|
|
document_id: str
|
|
metadata: DocumentMetadata
|
|
pages: List[Page]
|
|
processing_track: Literal["ocr", "direct"]
|
|
|
|
@dataclass
|
|
class Page:
|
|
page_number: int
|
|
elements: List[DocumentElement]
|
|
dimensions: Dimensions
|
|
|
|
@dataclass
|
|
class DocumentElement:
|
|
element_id: str
|
|
type: ElementType # text, table, image, header, etc.
|
|
content: Union[str, Dict, bytes]
|
|
bbox: BoundingBox
|
|
style: Optional[StyleInfo]
|
|
confidence: Optional[float] # Only for OCR track
|
|
```
|
|
|
|
### Decision 3: PyMuPDF for Direct Extraction
|
|
**What**: Use PyMuPDF (fitz) library for editable PDF processing
|
|
|
|
**Why**:
|
|
- Mature, well-maintained library
|
|
- Excellent coordinate preservation
|
|
- Fast C++ backend
|
|
- Supports text, tables, and image extraction with positions
|
|
|
|
**Alternatives considered**:
|
|
1. **pdfplumber**: Good but slower, less precise coordinates
|
|
2. **PyPDF2**: Limited layout information
|
|
3. **PDFMiner**: Complex API, slower performance
|
|
|
|
### Decision 4: Processing Track Auto-detection
|
|
**What**: Automatically determine optimal track based on document analysis
|
|
|
|
**Detection logic**:
|
|
```python
|
|
def detect_track(file_path: Path) -> str:
|
|
file_type = magic.from_file(file_path, mime=True)
|
|
|
|
if file_type.startswith('image/'):
|
|
return "ocr"
|
|
|
|
if file_type == 'application/pdf':
|
|
# Check if PDF has extractable text
|
|
doc = fitz.open(file_path)
|
|
for page in doc[:3]: # Sample first 3 pages
|
|
text = page.get_text()
|
|
if len(text.strip()) < 100: # Minimal text
|
|
return "ocr"
|
|
return "direct"
|
|
|
|
if file_type in OFFICE_MIMES:
|
|
return "ocr" # For now, may add direct Office support later
|
|
|
|
return "ocr" # Default fallback
|
|
```
|
|
|
|
### Decision 5: GPU Memory Management
|
|
**What**: Implement dynamic batch sizing and model caching for RTX 4060 8GB
|
|
|
|
**Why**:
|
|
- Prevents OOM errors
|
|
- Maximizes throughput
|
|
- Enables concurrent request handling
|
|
|
|
**Strategy**:
|
|
```python
|
|
# Adaptive batch sizing based on available memory
|
|
batch_size = calculate_batch_size(
|
|
available_memory=get_gpu_memory(),
|
|
image_size=image.shape,
|
|
model_size=MODEL_MEMORY_REQUIREMENTS
|
|
)
|
|
|
|
# Model caching to avoid reload overhead
|
|
@lru_cache(maxsize=2)
|
|
def get_model(model_type: str):
|
|
return load_model(model_type)
|
|
```
|
|
|
|
### Decision 6: Backward Compatibility
|
|
**What**: Maintain existing API while adding new capabilities
|
|
|
|
**How**:
|
|
- Existing endpoints continue working unchanged
|
|
- New `processing_track` parameter is optional
|
|
- Output format compatible with current consumers
|
|
- Gradual migration path for clients
|
|
|
|
## Risks / Trade-offs
|
|
|
|
### Risk 1: Mixed Content Documents
|
|
**Risk**: Documents with both scanned and digital pages
|
|
**Mitigation**:
|
|
- Page-level track detection as fallback
|
|
- Confidence scoring to identify uncertain pages
|
|
- Manual override option via API
|
|
|
|
### Risk 2: Direct Extraction Quality
|
|
**Risk**: Some PDFs have poor internal structure
|
|
**Mitigation**:
|
|
- Fallback to OCR track if extraction quality is low
|
|
- Quality metrics: text density, structure coherence
|
|
- User-reportable quality issues
|
|
|
|
### Risk 3: Memory Pressure
|
|
**Risk**: RTX 4060 8GB limitation with concurrent requests
|
|
**Mitigation**:
|
|
- Request queuing system
|
|
- Dynamic batch adjustment
|
|
- CPU fallback for overflow
|
|
|
|
### Trade-off 1: Processing Time vs Accuracy
|
|
- Direct extraction: Fast but depends on PDF quality
|
|
- OCR: Slower but consistent quality
|
|
- **Decision**: Prioritize speed for editable PDFs, accuracy for scanned
|
|
|
|
### Trade-off 2: Complexity vs Flexibility
|
|
- Two tracks increase system complexity
|
|
- But enable optimal processing per document type
|
|
- **Decision**: Accept complexity for 10x+ performance gains
|
|
|
|
## Migration Plan
|
|
|
|
### Phase 1: Infrastructure (Week 1-2)
|
|
1. Deploy UnifiedDocument model
|
|
2. Implement DocumentTypeDetector
|
|
3. Add DirectExtractionEngine
|
|
4. Update logging and monitoring
|
|
|
|
### Phase 2: Integration (Week 3)
|
|
1. Update OCR service with routing logic
|
|
2. Modify PDF generator for unified model
|
|
3. Add new API endpoints
|
|
4. Deploy to staging
|
|
|
|
### Phase 3: Validation (Week 4)
|
|
1. A/B testing with subset of traffic
|
|
2. Performance benchmarking
|
|
3. Quality validation
|
|
4. Client integration testing
|
|
|
|
### Rollback Plan
|
|
1. Feature flag to disable dual-track
|
|
2. Fallback all requests to OCR track
|
|
3. Maintain old code paths during transition
|
|
4. Database migration reversible
|
|
|
|
## Open Questions
|
|
|
|
### Resolved
|
|
- Q: Should we support page-level track mixing?
|
|
- A: No, adds complexity with minimal benefit. Document-level is sufficient.
|
|
|
|
- Q: How to handle Office documents?
|
|
- A: OCR track initially, consider python-docx/openpyxl later if needed.
|
|
|
|
### Pending
|
|
- Q: What translation services to integrate with?
|
|
- Needs stakeholder input on cost/quality trade-offs
|
|
|
|
- Q: Should we cache extracted text for repeated processing?
|
|
- Depends on storage costs vs reprocessing frequency
|
|
|
|
- Q: How to handle password-protected PDFs?
|
|
- May need API parameter for passwords
|
|
|
|
## Performance Targets
|
|
|
|
### Direct Extraction Track
|
|
- Latency: <500ms per page
|
|
- Throughput: 100+ pages/minute
|
|
- Memory: <500MB per document
|
|
|
|
### OCR Track (Optimized)
|
|
- Latency: 2-5s per page (GPU)
|
|
- Throughput: 20-30 pages/minute
|
|
- Memory: <2GB per batch
|
|
|
|
### API Response Times
|
|
- Document type detection: <100ms
|
|
- Processing initiation: <200ms
|
|
- Result retrieval: <100ms
|
|
|
|
## Technical Dependencies
|
|
|
|
### Python Packages
|
|
```python
|
|
# Direct extraction
|
|
PyMuPDF==1.23.x
|
|
pdfplumber==0.10.x # Fallback/validation
|
|
python-magic-bin==0.4.x
|
|
|
|
# OCR enhancement
|
|
paddlepaddle-gpu==2.5.2
|
|
paddleocr==2.7.3
|
|
|
|
# Infrastructure
|
|
pydantic==2.x
|
|
fastapi==0.100+
|
|
redis==5.x # For caching
|
|
```
|
|
|
|
### System Requirements
|
|
- CUDA 11.8+ for PaddlePaddle
|
|
- libmagic for file detection
|
|
- 16GB RAM minimum
|
|
- 50GB disk for models and cache |