# Change: Enhanced Memory Management for OCR Services ## Why The current OCR service architecture suffers from critical memory management issues that lead to GPU memory exhaustion, service instability, and degraded performance under load: 1. **Memory Leaks**: PP-StructureV3 models are permanently exempted from unloading (lines 255-267), causing VRAM to remain occupied indefinitely. 2. **Instance Proliferation**: Each task creates a new OCRService instance (tasks.py lines 44-65), leading to duplicate model loading and memory fragmentation. 3. **Inadequate Memory Monitoring**: `check_gpu_memory()` always returns True in Paddle-only environments, providing no actual memory protection. 4. **Uncontrolled Concurrency**: No limits on simultaneous PP-StructureV3 predictions, causing memory spikes. 5. **No Resource Cleanup**: Tasks complete without releasing GPU memory, leading to accumulated memory usage. These issues cause service crashes, require frequent restarts, and prevent scaling to handle multiple concurrent requests. ## What Changes ### 1. Model Lifecycle Management - **NEW**: `ModelManager` class to handle model loading/unloading with reference counting - **NEW**: Idle timeout mechanism for PP-StructureV3 (same as language models) - **NEW**: Explicit `teardown()` method for end-of-flow cleanup - **MODIFIED**: OCRService to use managed model instances ### 2. Service Singleton Pattern - **NEW**: `OCRServicePool` to manage OCRService instances (one per GPU/device) - **NEW**: Queue-based task distribution with concurrency limits - **MODIFIED**: Task router to use pooled services instead of creating new instances ### 3. Enhanced Memory Monitoring - **NEW**: `MemoryGuard` class using paddle.device.cuda memory APIs - **NEW**: Support for pynvml/torch as fallback memory query methods - **NEW**: Memory threshold configuration (warning/critical levels) - **MODIFIED**: Processing logic to degrade gracefully when memory is low ### 4. Concurrency Control - **NEW**: Semaphore-based limits for PP-StructureV3 predictions - **NEW**: Configuration to disable/delay chart/formula/table analysis - **NEW**: Batch processing mode for large documents ### 5. Active Memory Management - **NEW**: Background memory monitor thread with metrics collection - **NEW**: Automatic cache clearing when thresholds exceeded - **NEW**: Model unloading based on LRU policy - **NEW**: Worker process restart capability when memory cannot be recovered ### 6. Cleanup Hooks - **NEW**: Global shutdown handlers for graceful cleanup - **NEW**: Task completion callbacks to release resources - **MODIFIED**: Background task wrapper to ensure cleanup on success/failure ## Impact **Affected specs**: - `ocr-processing` - Model management and processing flow - `task-management` - Task execution and resource management **Affected code**: - `backend/app/services/ocr_service.py` - Major refactoring for memory management - `backend/app/routers/tasks.py` - Use service pool instead of new instances - `backend/app/core/config.py` - New memory management settings - `backend/app/services/memory_manager.py` - NEW file - `backend/app/services/service_pool.py` - NEW file **Breaking changes**: None - All changes are internal optimizations **Migration**: Existing deployments will benefit immediately with no configuration changes required. Optional tuning parameters available for optimization. ## Testing Requirements 1. **Memory leak tests** - Verify models are properly unloaded 2. **Concurrency tests** - Validate semaphore limits work correctly 3. **Stress tests** - Ensure system degrades gracefully under memory pressure 4. **Integration tests** - Verify pooled services work correctly 5. **Performance benchmarks** - Measure memory usage improvements