Backend: - Add hybrid image extraction for Direct track (inline image blocks) - Add render_inline_image_regions() fallback when OCR doesn't find images - Add check_document_for_missing_images() for detecting missing images - Add memory management system (MemoryGuard, ModelManager, ServicePool) - Update pdf_generator_service to handle HYBRID processing track - Add ElementType.LOGO for logo extraction Frontend: - Fix PDF viewer re-rendering issues with memoization - Add TaskNotFound component and useTaskValidation hook - Disable StrictMode due to react-pdf incompatibility - Fix task detail and results page loading states 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
587 lines
21 KiB
Markdown
587 lines
21 KiB
Markdown
# Design Document: Enhanced Memory Management
|
|
|
|
## Architecture Overview
|
|
|
|
The enhanced memory management system introduces three core components that work together to prevent OOM crashes and optimize resource utilization:
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Task Router │
|
|
│ ┌──────────────────────────────────────────────────────┐ │
|
|
│ │ Request → Queue → Acquire Service → Process → Release │ │
|
|
│ └──────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ OCRServicePool │
|
|
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
|
|
│ │Service 1│ │Service 2│ │Service 3│ │Service 4│ │
|
|
│ │ GPU:0 │ │ GPU:0 │ │ GPU:1 │ │ CPU │ │
|
|
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ ModelManager │
|
|
│ ┌──────────────────────────────────────────────────────┐ │
|
|
│ │ Models: {id → (instance, ref_count, last_used)} │ │
|
|
│ │ Timeout Monitor → Unload Idle Models │ │
|
|
│ └──────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ MemoryGuard │
|
|
│ ┌──────────────────────────────────────────────────────┐ │
|
|
│ │ Monitor: GPU/CPU Memory Usage │ │
|
|
│ │ Actions: Warn → Throttle → Fallback → Emergency │ │
|
|
│ └──────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Component Design
|
|
|
|
### 1. ModelManager
|
|
|
|
**Purpose**: Centralized model lifecycle management with reference counting and idle timeout.
|
|
|
|
**Key Design Decisions**:
|
|
- **Singleton Pattern**: One ModelManager instance per application
|
|
- **Reference Counting**: Track active users of each model
|
|
- **LRU Cache**: Evict least recently used models when memory pressure
|
|
- **Lazy Loading**: Load models only when first requested
|
|
|
|
**Implementation**:
|
|
```python
|
|
class ModelManager:
|
|
def __init__(self, config: ModelConfig):
|
|
self.models: Dict[str, ModelEntry] = {}
|
|
self.lock = asyncio.Lock()
|
|
self.config = config
|
|
self._start_timeout_monitor()
|
|
|
|
async def load_model(self, model_id: str, params: Dict) -> Model:
|
|
async with self.lock:
|
|
if model_id in self.models:
|
|
entry = self.models[model_id]
|
|
entry.ref_count += 1
|
|
entry.last_used = time.time()
|
|
return entry.model
|
|
|
|
# Check memory before loading
|
|
if not await self.memory_guard.check_memory(params['estimated_memory']):
|
|
await self._evict_idle_models()
|
|
|
|
model = await self._create_model(model_id, params)
|
|
self.models[model_id] = ModelEntry(
|
|
model=model,
|
|
ref_count=1,
|
|
last_used=time.time()
|
|
)
|
|
return model
|
|
```
|
|
|
|
### 2. OCRServicePool
|
|
|
|
**Purpose**: Manage a pool of OCRService instances to prevent duplicate model loading.
|
|
|
|
**Key Design Decisions**:
|
|
- **Per-Device Pools**: Separate pool for each GPU/CPU device
|
|
- **Semaphore Control**: Limit concurrent usage per service
|
|
- **Queue Management**: FIFO queue with timeout for waiting requests
|
|
- **Health Monitoring**: Periodic health checks on pooled services
|
|
|
|
**Implementation**:
|
|
```python
|
|
class OCRServicePool:
|
|
def __init__(self, config: PoolConfig):
|
|
self.pools: Dict[str, List[OCRService]] = {}
|
|
self.semaphores: Dict[str, asyncio.Semaphore] = {}
|
|
self.queues: Dict[str, asyncio.Queue] = {}
|
|
self._initialize_pools()
|
|
|
|
async def acquire(self, device: str = "GPU:0") -> OCRService:
|
|
# Try to get from pool
|
|
if device in self.pools and self.pools[device]:
|
|
for service in self.pools[device]:
|
|
if await service.try_acquire():
|
|
return service
|
|
|
|
# Queue if pool exhausted
|
|
return await self._wait_for_service(device)
|
|
```
|
|
|
|
### 3. MemoryGuard
|
|
|
|
**Purpose**: Monitor memory usage and trigger preventive actions.
|
|
|
|
**Key Design Decisions**:
|
|
- **Multi-Backend Support**: paddle.device.cuda, pynvml, torch as fallbacks
|
|
- **Threshold System**: Warning (80%), Critical (95%), Emergency (98%)
|
|
- **Predictive Allocation**: Estimate memory before operations
|
|
- **Progressive Actions**: Warn → Throttle → CPU Fallback → Reject
|
|
|
|
**Implementation**:
|
|
```python
|
|
class MemoryGuard:
|
|
def __init__(self, config: MemoryConfig):
|
|
self.config = config
|
|
self.backend = self._detect_backend()
|
|
self._start_monitor()
|
|
|
|
async def check_memory(self, required_mb: int = 0) -> bool:
|
|
stats = await self.get_memory_stats()
|
|
available = stats['gpu_free_mb']
|
|
|
|
if available < required_mb:
|
|
return False
|
|
|
|
usage_ratio = stats['gpu_used_ratio']
|
|
if usage_ratio > self.config.critical_threshold:
|
|
await self._trigger_emergency_cleanup()
|
|
return False
|
|
|
|
if usage_ratio > self.config.warning_threshold:
|
|
await self._trigger_warning()
|
|
|
|
return True
|
|
```
|
|
|
|
## Memory Optimization Strategies
|
|
|
|
### 1. PP-StructureV3 Specific Optimizations
|
|
|
|
**Problem**: PP-StructureV3 is permanently exempted from unloading (lines 255-267).
|
|
|
|
**Solution**:
|
|
```python
|
|
# Remove exemption
|
|
def should_unload_model(model_id: str) -> bool:
|
|
# Old: if model_id == "ppstructure_v3": return False
|
|
# New: Apply same rules to all models
|
|
return True
|
|
|
|
# Add proper cleanup
|
|
def unload_ppstructure_v3(engine: PPStructureV3):
|
|
engine.table_engine = None
|
|
engine.text_detector = None
|
|
engine.text_recognizer = None
|
|
paddle.device.cuda.empty_cache()
|
|
```
|
|
|
|
### 2. Batch Processing for Large Documents
|
|
|
|
**Strategy**: Process documents in configurable batches to limit memory usage.
|
|
|
|
```python
|
|
async def process_large_document(doc_path: Path, batch_size: int = 10):
|
|
total_pages = get_page_count(doc_path)
|
|
|
|
for start_idx in range(0, total_pages, batch_size):
|
|
end_idx = min(start_idx + batch_size, total_pages)
|
|
|
|
# Process batch
|
|
batch_results = await process_pages(doc_path, start_idx, end_idx)
|
|
|
|
# Force cleanup between batches
|
|
paddle.device.cuda.empty_cache()
|
|
gc.collect()
|
|
|
|
yield batch_results
|
|
```
|
|
|
|
### 3. Selective Feature Disabling
|
|
|
|
**Strategy**: Allow disabling memory-intensive features when under pressure.
|
|
|
|
```python
|
|
class AdaptiveProcessing:
|
|
def __init__(self):
|
|
self.features = {
|
|
'charts': True,
|
|
'formulas': True,
|
|
'tables': True,
|
|
'layout': True
|
|
}
|
|
|
|
async def adapt_to_memory(self, available_mb: int):
|
|
if available_mb < 1000:
|
|
self.features['charts'] = False
|
|
self.features['formulas'] = False
|
|
if available_mb < 500:
|
|
self.features['tables'] = False
|
|
```
|
|
|
|
## Concurrency Management
|
|
|
|
### 1. Semaphore-Based Limiting
|
|
|
|
```python
|
|
# Global semaphores
|
|
prediction_semaphore = asyncio.Semaphore(2) # Max 2 concurrent predictions
|
|
processing_semaphore = asyncio.Semaphore(4) # Max 4 concurrent OCR tasks
|
|
|
|
async def predict_with_structure(image, params=None):
|
|
async with prediction_semaphore:
|
|
# Memory check before prediction
|
|
required_mb = estimate_prediction_memory(image.shape)
|
|
if not await memory_guard.check_memory(required_mb):
|
|
raise MemoryError("Insufficient memory for prediction")
|
|
|
|
return await pp_structure.predict(image, params)
|
|
```
|
|
|
|
### 2. Queue-Based Task Distribution
|
|
|
|
```python
|
|
class TaskDistributor:
|
|
def __init__(self):
|
|
self.queues = {
|
|
'high': asyncio.Queue(maxsize=10),
|
|
'normal': asyncio.Queue(maxsize=50),
|
|
'low': asyncio.Queue(maxsize=100)
|
|
}
|
|
|
|
async def distribute_task(self, task: Task):
|
|
priority = self._calculate_priority(task)
|
|
queue = self.queues[priority]
|
|
|
|
try:
|
|
await asyncio.wait_for(
|
|
queue.put(task),
|
|
timeout=self.config.queue_timeout
|
|
)
|
|
except asyncio.TimeoutError:
|
|
raise QueueFullError(f"Queue {priority} is full")
|
|
```
|
|
|
|
## Monitoring and Metrics
|
|
|
|
### 1. Memory Metrics Collection
|
|
|
|
```python
|
|
class MemoryMetrics:
|
|
def __init__(self):
|
|
self.history = deque(maxlen=1000)
|
|
self.alerts = []
|
|
|
|
async def collect(self):
|
|
stats = {
|
|
'timestamp': time.time(),
|
|
'gpu_used_mb': get_gpu_memory_used(),
|
|
'gpu_free_mb': get_gpu_memory_free(),
|
|
'cpu_used_mb': get_cpu_memory_used(),
|
|
'models_loaded': len(model_manager.models),
|
|
'active_tasks': len(active_tasks),
|
|
'pool_utilization': get_pool_utilization()
|
|
}
|
|
self.history.append(stats)
|
|
await self._check_alerts(stats)
|
|
```
|
|
|
|
### 2. Monitoring Dashboard Endpoints
|
|
|
|
```python
|
|
@router.get("/admin/memory/stats")
|
|
async def get_memory_stats():
|
|
return {
|
|
'current': memory_metrics.get_current(),
|
|
'history': memory_metrics.get_history(minutes=5),
|
|
'alerts': memory_metrics.get_active_alerts(),
|
|
'recommendations': memory_optimizer.get_recommendations()
|
|
}
|
|
|
|
@router.post("/admin/memory/gc")
|
|
async def trigger_garbage_collection():
|
|
"""Manual garbage collection trigger"""
|
|
results = await memory_manager.force_cleanup()
|
|
return {'freed_mb': results['freed'], 'models_unloaded': results['models']}
|
|
```
|
|
|
|
## Error Recovery
|
|
|
|
### 1. OOM Recovery Strategy
|
|
|
|
```python
|
|
class OOMRecovery:
|
|
async def recover(self, error: Exception, task: Task):
|
|
logger.error(f"OOM detected for task {task.id}: {error}")
|
|
|
|
# Step 1: Emergency cleanup
|
|
await self.emergency_cleanup()
|
|
|
|
# Step 2: Try CPU fallback
|
|
if self.config.enable_cpu_fallback:
|
|
task.device = "CPU"
|
|
return await self.retry_on_cpu(task)
|
|
|
|
# Step 3: Reduce batch size and retry
|
|
if task.batch_size > 1:
|
|
task.batch_size = max(1, task.batch_size // 2)
|
|
return await self.retry_with_reduced_batch(task)
|
|
|
|
# Step 4: Fail gracefully
|
|
await self.mark_task_failed(task, "Insufficient memory")
|
|
```
|
|
|
|
### 2. Service Recovery
|
|
|
|
```python
|
|
class ServiceRecovery:
|
|
async def restart_service(self, service_id: str):
|
|
"""Restart a failed service"""
|
|
# Kill existing process
|
|
await self.kill_service_process(service_id)
|
|
|
|
# Clear service memory
|
|
await self.clear_service_cache(service_id)
|
|
|
|
# Restart with fresh state
|
|
new_service = await self.create_service(service_id)
|
|
await self.pool.replace_service(service_id, new_service)
|
|
```
|
|
|
|
## Testing Strategy
|
|
|
|
### 1. Memory Leak Detection
|
|
|
|
```python
|
|
@pytest.mark.memory
|
|
async def test_no_memory_leak():
|
|
initial_memory = get_memory_usage()
|
|
|
|
# Process 100 tasks
|
|
for _ in range(100):
|
|
task = create_test_task()
|
|
await process_task(task)
|
|
|
|
# Force cleanup
|
|
await cleanup_all()
|
|
gc.collect()
|
|
|
|
final_memory = get_memory_usage()
|
|
leak = final_memory - initial_memory
|
|
|
|
assert leak < 100 # Max 100MB leak tolerance
|
|
```
|
|
|
|
### 2. Stress Testing
|
|
|
|
```python
|
|
@pytest.mark.stress
|
|
async def test_concurrent_load():
|
|
tasks = [create_large_task() for _ in range(50)]
|
|
|
|
# Should handle gracefully without OOM
|
|
results = await asyncio.gather(
|
|
*[process_task(t) for t in tasks],
|
|
return_exceptions=True
|
|
)
|
|
|
|
# Some may fail but system should remain stable
|
|
successful = sum(1 for r in results if not isinstance(r, Exception))
|
|
assert successful > 0
|
|
assert await health_check() == "healthy"
|
|
```
|
|
|
|
## Performance Targets
|
|
|
|
| Metric | Current | Target | Improvement |
|
|
|--------|---------|---------|------------|
|
|
| Memory per task | 2-4 GB | 0.5-1 GB | 75% reduction |
|
|
| Concurrent tasks | 1-2 | 4-8 | 4x increase |
|
|
| Model load time | 30-60s | 5-10s (cached) | 6x faster |
|
|
| OOM crashes/day | 5-10 | 0-1 | 90% reduction |
|
|
| Service uptime | 4-8 hours | 24+ hours | 3x improvement |
|
|
|
|
## Rollout Plan
|
|
|
|
### Phase 1: Foundation (Week 1)
|
|
- Implement ModelManager
|
|
- Integrate with existing OCRService
|
|
- Add basic memory monitoring
|
|
|
|
### Phase 2: Pooling (Week 2)
|
|
- Implement OCRServicePool
|
|
- Update task router
|
|
- Add concurrency limits
|
|
|
|
### Phase 3: Optimization (Week 3)
|
|
- Add MemoryGuard
|
|
- Implement adaptive processing
|
|
- Add batch processing
|
|
|
|
### Phase 4: Hardening (Week 4)
|
|
- Stress testing
|
|
- Performance tuning
|
|
- Documentation and monitoring
|
|
|
|
## Configuration Settings Reference
|
|
|
|
All memory management settings are defined in `backend/app/core/config.py` under the `Settings` class.
|
|
|
|
### Memory Thresholds
|
|
|
|
| Setting | Type | Default | Description |
|
|
|---------|------|---------|-------------|
|
|
| `memory_warning_threshold` | float | 0.80 | GPU memory usage ratio (0-1) to trigger warning alerts |
|
|
| `memory_critical_threshold` | float | 0.95 | GPU memory ratio to start throttling operations |
|
|
| `memory_emergency_threshold` | float | 0.98 | GPU memory ratio to trigger emergency cleanup |
|
|
|
|
### Memory Monitoring
|
|
|
|
| Setting | Type | Default | Description |
|
|
|---------|------|---------|-------------|
|
|
| `memory_check_interval_seconds` | int | 30 | Background check interval for memory monitoring |
|
|
| `enable_memory_alerts` | bool | True | Enable/disable memory threshold alerts |
|
|
| `gpu_memory_limit_mb` | int | 6144 | Maximum GPU memory to use (MB) |
|
|
| `gpu_memory_reserve_mb` | int | 512 | Memory reserved for CUDA overhead |
|
|
|
|
### Model Lifecycle Management
|
|
|
|
| Setting | Type | Default | Description |
|
|
|---------|------|---------|-------------|
|
|
| `enable_model_lifecycle_management` | bool | True | Use ModelManager for model lifecycle |
|
|
| `model_idle_timeout_seconds` | int | 300 | Unload models after idle time |
|
|
| `pp_structure_idle_timeout_seconds` | int | 300 | Unload PP-StructureV3 after idle |
|
|
| `structure_model_memory_mb` | int | 2000 | Estimated memory for PP-StructureV3 |
|
|
| `ocr_model_memory_mb` | int | 500 | Estimated memory per OCR language model |
|
|
| `enable_lazy_model_loading` | bool | True | Load models on demand |
|
|
| `auto_unload_unused_models` | bool | True | Auto-unload unused language models |
|
|
|
|
### Service Pool Configuration
|
|
|
|
| Setting | Type | Default | Description |
|
|
|---------|------|---------|-------------|
|
|
| `enable_service_pool` | bool | True | Use OCRServicePool |
|
|
| `max_services_per_device` | int | 1 | Max OCRService instances per GPU |
|
|
| `max_total_services` | int | 2 | Max total OCRService instances |
|
|
| `service_acquire_timeout_seconds` | float | 300.0 | Timeout for acquiring service from pool |
|
|
| `max_queue_size` | int | 50 | Max pending tasks per device queue |
|
|
|
|
### Concurrency Control
|
|
|
|
| Setting | Type | Default | Description |
|
|
|---------|------|---------|-------------|
|
|
| `max_concurrent_predictions` | int | 2 | Max concurrent PP-StructureV3 predictions |
|
|
| `max_concurrent_pages` | int | 2 | Max pages processed concurrently |
|
|
| `inference_batch_size` | int | 1 | Batch size for inference |
|
|
| `enable_batch_processing` | bool | True | Enable batch processing for large docs |
|
|
|
|
### Recovery Settings
|
|
|
|
| Setting | Type | Default | Description |
|
|
|---------|------|---------|-------------|
|
|
| `enable_cpu_fallback` | bool | True | Fall back to CPU when GPU memory low |
|
|
| `enable_emergency_cleanup` | bool | True | Auto-cleanup on memory pressure |
|
|
| `enable_worker_restart` | bool | False | Restart workers on OOM (requires supervisor) |
|
|
|
|
### Feature Flags
|
|
|
|
| Setting | Type | Default | Description |
|
|
|---------|------|---------|-------------|
|
|
| `enable_chart_recognition` | bool | True | Enable chart/diagram recognition |
|
|
| `enable_formula_recognition` | bool | True | Enable math formula recognition |
|
|
| `enable_table_recognition` | bool | True | Enable table structure recognition |
|
|
| `enable_seal_recognition` | bool | True | Enable seal/stamp recognition |
|
|
| `enable_text_recognition` | bool | True | Enable general text recognition |
|
|
| `enable_memory_optimization` | bool | True | Enable memory optimizations |
|
|
|
|
### Environment Variable Override
|
|
|
|
All settings can be overridden via environment variables. The format is uppercase with underscores:
|
|
|
|
```bash
|
|
# Example .env file
|
|
MEMORY_WARNING_THRESHOLD=0.75
|
|
MEMORY_CRITICAL_THRESHOLD=0.90
|
|
MAX_CONCURRENT_PREDICTIONS=1
|
|
GPU_MEMORY_LIMIT_MB=4096
|
|
ENABLE_CPU_FALLBACK=true
|
|
```
|
|
|
|
### Recommended Configurations
|
|
|
|
#### RTX 4060 8GB (Default)
|
|
```bash
|
|
GPU_MEMORY_LIMIT_MB=6144
|
|
MAX_CONCURRENT_PREDICTIONS=2
|
|
MAX_CONCURRENT_PAGES=2
|
|
INFERENCE_BATCH_SIZE=1
|
|
```
|
|
|
|
#### RTX 3090 24GB
|
|
```bash
|
|
GPU_MEMORY_LIMIT_MB=20480
|
|
MAX_CONCURRENT_PREDICTIONS=4
|
|
MAX_CONCURRENT_PAGES=4
|
|
INFERENCE_BATCH_SIZE=2
|
|
```
|
|
|
|
#### CPU-Only Mode
|
|
```bash
|
|
FORCE_CPU_MODE=true
|
|
MAX_CONCURRENT_PREDICTIONS=1
|
|
ENABLE_CPU_FALLBACK=false
|
|
```
|
|
|
|
## Prometheus Metrics
|
|
|
|
The system exports Prometheus-format metrics via the `PrometheusMetrics` class. Available metrics:
|
|
|
|
### GPU Metrics
|
|
- `tool_ocr_memory_gpu_total_bytes` - Total GPU memory
|
|
- `tool_ocr_memory_gpu_used_bytes` - Used GPU memory
|
|
- `tool_ocr_memory_gpu_free_bytes` - Free GPU memory
|
|
- `tool_ocr_memory_gpu_utilization_ratio` - GPU utilization (0-1)
|
|
|
|
### Model Metrics
|
|
- `tool_ocr_memory_models_loaded_total` - Number of loaded models
|
|
- `tool_ocr_memory_models_memory_bytes` - Total memory used by models
|
|
- `tool_ocr_memory_model_ref_count{model_id}` - Reference count per model
|
|
|
|
### Prediction Metrics
|
|
- `tool_ocr_memory_predictions_active` - Currently active predictions
|
|
- `tool_ocr_memory_predictions_queue_depth` - Predictions waiting in queue
|
|
- `tool_ocr_memory_predictions_total` - Total predictions processed (counter)
|
|
- `tool_ocr_memory_predictions_timeouts_total` - Total prediction timeouts (counter)
|
|
|
|
### Pool Metrics
|
|
- `tool_ocr_memory_pool_services_total` - Total services in pool
|
|
- `tool_ocr_memory_pool_services_available` - Available services
|
|
- `tool_ocr_memory_pool_services_in_use` - Services in use
|
|
- `tool_ocr_memory_pool_acquisitions_total` - Total acquisitions (counter)
|
|
|
|
### Recovery Metrics
|
|
- `tool_ocr_memory_recovery_count_total` - Total recovery attempts
|
|
- `tool_ocr_memory_recovery_in_cooldown` - In cooldown (0/1)
|
|
- `tool_ocr_memory_recovery_cooldown_remaining_seconds` - Remaining cooldown
|
|
|
|
## Memory Dump API
|
|
|
|
The `MemoryDumper` class provides debugging capabilities:
|
|
|
|
```python
|
|
from app.services.memory_manager import get_memory_dumper
|
|
|
|
dumper = get_memory_dumper()
|
|
|
|
# Create a memory dump
|
|
dump = dumper.create_dump(include_python_objects=True)
|
|
|
|
# Get dump as dictionary for JSON serialization
|
|
dump_dict = dumper.to_dict(dump)
|
|
|
|
# Compare two dumps to detect memory growth
|
|
comparison = dumper.compare_dumps(dump1, dump2)
|
|
```
|
|
|
|
Memory dumps include:
|
|
- GPU/CPU memory usage
|
|
- Loaded models and reference counts
|
|
- Active predictions and queue state
|
|
- Service pool statistics
|
|
- Recovery manager state
|
|
- Python GC statistics
|
|
- Large Python objects (optional) |