Archive incomplete proposal for later continuation. OCR processing has known quality issues to be addressed in future work. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
3.7 KiB
Change: Enhanced Memory Management for OCR Services
Why
The current OCR service architecture suffers from critical memory management issues that lead to GPU memory exhaustion, service instability, and degraded performance under load:
-
Memory Leaks: PP-StructureV3 models are permanently exempted from unloading (lines 255-267), causing VRAM to remain occupied indefinitely.
-
Instance Proliferation: Each task creates a new OCRService instance (tasks.py lines 44-65), leading to duplicate model loading and memory fragmentation.
-
Inadequate Memory Monitoring:
check_gpu_memory()always returns True in Paddle-only environments, providing no actual memory protection. -
Uncontrolled Concurrency: No limits on simultaneous PP-StructureV3 predictions, causing memory spikes.
-
No Resource Cleanup: Tasks complete without releasing GPU memory, leading to accumulated memory usage.
These issues cause service crashes, require frequent restarts, and prevent scaling to handle multiple concurrent requests.
What Changes
1. Model Lifecycle Management
- NEW:
ModelManagerclass to handle model loading/unloading with reference counting - NEW: Idle timeout mechanism for PP-StructureV3 (same as language models)
- NEW: Explicit
teardown()method for end-of-flow cleanup - MODIFIED: OCRService to use managed model instances
2. Service Singleton Pattern
- NEW:
OCRServicePoolto manage OCRService instances (one per GPU/device) - NEW: Queue-based task distribution with concurrency limits
- MODIFIED: Task router to use pooled services instead of creating new instances
3. Enhanced Memory Monitoring
- NEW:
MemoryGuardclass using paddle.device.cuda memory APIs - NEW: Support for pynvml/torch as fallback memory query methods
- NEW: Memory threshold configuration (warning/critical levels)
- MODIFIED: Processing logic to degrade gracefully when memory is low
4. Concurrency Control
- NEW: Semaphore-based limits for PP-StructureV3 predictions
- NEW: Configuration to disable/delay chart/formula/table analysis
- NEW: Batch processing mode for large documents
5. Active Memory Management
- NEW: Background memory monitor thread with metrics collection
- NEW: Automatic cache clearing when thresholds exceeded
- NEW: Model unloading based on LRU policy
- NEW: Worker process restart capability when memory cannot be recovered
6. Cleanup Hooks
- NEW: Global shutdown handlers for graceful cleanup
- NEW: Task completion callbacks to release resources
- MODIFIED: Background task wrapper to ensure cleanup on success/failure
Impact
Affected specs:
ocr-processing- Model management and processing flowtask-management- Task execution and resource management
Affected code:
backend/app/services/ocr_service.py- Major refactoring for memory managementbackend/app/routers/tasks.py- Use service pool instead of new instancesbackend/app/core/config.py- New memory management settingsbackend/app/services/memory_manager.py- NEW filebackend/app/services/service_pool.py- NEW file
Breaking changes: None - All changes are internal optimizations
Migration: Existing deployments will benefit immediately with no configuration changes required. Optional tuning parameters available for optimization.
Testing Requirements
- Memory leak tests - Verify models are properly unloaded
- Concurrency tests - Validate semaphore limits work correctly
- Stress tests - Ensure system degrades gracefully under memory pressure
- Integration tests - Verify pooled services work correctly
- Performance benchmarks - Measure memory usage improvements