feat: create OpenSpec proposal for enhanced memory management

- Create comprehensive proposal addressing OOM crashes and memory leaks - Define 6 core areas: model lifecycle, service pooling, monitoring - Add 58 implementation tasks across 8 sections - Design ModelManager with reference counting and idle timeout - Plan OCRServicePool for singleton service pattern - Specify MemoryGuard for proactive memory monitoring - Include concurrency controls and cleanup hooks - Add spec deltas for ocr-processing and task-management - Create detailed design document with architecture diagrams - Define performance targets: 75% memory reduction, 4x concurrency Critical improvements: - Remove PP-StructureV3 permanent exemption from unloading - Replace per-task OCRService instantiation with pooling - Add real GPU memory monitoring (currently always returns True) - Implement semaphore-based concurrency limits - Add proper resource cleanup on task completion 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-25 15:21:32 +08:00
parent 2d0932face
commit ba8ddf2b68
6 changed files with 1105 additions and 0 deletions
--- a/openspec/changes/enhance-memory-management/proposal.md
+++ b/openspec/changes/enhance-memory-management/proposal.md
@@ -0,0 +1,77 @@
+# Change: Enhanced Memory Management for OCR Services
+
+## Why
+
+The current OCR service architecture suffers from critical memory management issues that lead to GPU memory exhaustion, service instability, and degraded performance under load:
+
+1. **Memory Leaks**: PP-StructureV3 models are permanently exempted from unloading (lines 255-267), causing VRAM to remain occupied indefinitely.
+
+2. **Instance Proliferation**: Each task creates a new OCRService instance (tasks.py lines 44-65), leading to duplicate model loading and memory fragmentation.
+
+3. **Inadequate Memory Monitoring**: `check_gpu_memory()` always returns True in Paddle-only environments, providing no actual memory protection.
+
+4. **Uncontrolled Concurrency**: No limits on simultaneous PP-StructureV3 predictions, causing memory spikes.
+
+5. **No Resource Cleanup**: Tasks complete without releasing GPU memory, leading to accumulated memory usage.
+
+These issues cause service crashes, require frequent restarts, and prevent scaling to handle multiple concurrent requests.
+
+## What Changes
+
+### 1. Model Lifecycle Management
+- **NEW**: `ModelManager` class to handle model loading/unloading with reference counting
+- **NEW**: Idle timeout mechanism for PP-StructureV3 (same as language models)
+- **NEW**: Explicit `teardown()` method for end-of-flow cleanup
+- **MODIFIED**: OCRService to use managed model instances
+
+### 2. Service Singleton Pattern
+- **NEW**: `OCRServicePool` to manage OCRService instances (one per GPU/device)
+- **NEW**: Queue-based task distribution with concurrency limits
+- **MODIFIED**: Task router to use pooled services instead of creating new instances
+
+### 3. Enhanced Memory Monitoring
+- **NEW**: `MemoryGuard` class using paddle.device.cuda memory APIs
+- **NEW**: Support for pynvml/torch as fallback memory query methods
+- **NEW**: Memory threshold configuration (warning/critical levels)
+- **MODIFIED**: Processing logic to degrade gracefully when memory is low
+
+### 4. Concurrency Control
+- **NEW**: Semaphore-based limits for PP-StructureV3 predictions
+- **NEW**: Configuration to disable/delay chart/formula/table analysis
+- **NEW**: Batch processing mode for large documents
+
+### 5. Active Memory Management
+- **NEW**: Background memory monitor thread with metrics collection
+- **NEW**: Automatic cache clearing when thresholds exceeded
+- **NEW**: Model unloading based on LRU policy
+- **NEW**: Worker process restart capability when memory cannot be recovered
+
+### 6. Cleanup Hooks
+- **NEW**: Global shutdown handlers for graceful cleanup
+- **NEW**: Task completion callbacks to release resources
+- **MODIFIED**: Background task wrapper to ensure cleanup on success/failure
+
+## Impact
+
+**Affected specs**:
+- `ocr-processing` - Model management and processing flow
+- `task-management` - Task execution and resource management
+
+**Affected code**:
+- `backend/app/services/ocr_service.py` - Major refactoring for memory management
+- `backend/app/routers/tasks.py` - Use service pool instead of new instances
+- `backend/app/core/config.py` - New memory management settings
+- `backend/app/services/memory_manager.py` - NEW file
+- `backend/app/services/service_pool.py` - NEW file
+
+**Breaking changes**: None - All changes are internal optimizations
+
+**Migration**: Existing deployments will benefit immediately with no configuration changes required. Optional tuning parameters available for optimization.
+
+## Testing Requirements
+
+1. **Memory leak tests** - Verify models are properly unloaded
+2. **Concurrency tests** - Validate semaphore limits work correctly
+3. **Stress tests** - Ensure system degrades gracefully under memory pressure
+4. **Integration tests** - Verify pooled services work correctly
+5. **Performance benchmarks** - Measure memory usage improvements