feat: implement hybrid image extraction and memory management

Backend: - Add hybrid image extraction for Direct track (inline image blocks) - Add render_inline_image_regions() fallback when OCR doesn't find images - Add check_document_for_missing_images() for detecting missing images - Add memory management system (MemoryGuard, ModelManager, ServicePool) - Update pdf_generator_service to handle HYBRID processing track - Add ElementType.LOGO for logo extraction Frontend: - Fix PDF viewer re-rendering issues with memoization - Add TaskNotFound component and useTaskValidation hook - Disable StrictMode due to react-pdf incompatibility - Fix task detail and results page loading states 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 10:56:22 +08:00
parent ba8ddf2b68
commit 1afdb822c3
26 changed files with 8273 additions and 366 deletions
--- a/openspec/changes/enhance-memory-management/design.md
+++ b/openspec/changes/enhance-memory-management/design.md
@@ -415,4 +415,173 @@ async def test_concurrent_load():
 ### Phase 4: Hardening (Week 4)
 - Stress testing
 - Performance tuning
- Documentation and monitoring
+- Documentation and monitoring
+
+## Configuration Settings Reference
+
+All memory management settings are defined in `backend/app/core/config.py` under the `Settings` class.
+
+### Memory Thresholds
+
+| Setting | Type | Default | Description |
+|---------|------|---------|-------------|
+| `memory_warning_threshold` | float | 0.80 | GPU memory usage ratio (0-1) to trigger warning alerts |
+| `memory_critical_threshold` | float | 0.95 | GPU memory ratio to start throttling operations |
+| `memory_emergency_threshold` | float | 0.98 | GPU memory ratio to trigger emergency cleanup |
+
+### Memory Monitoring
+
+| Setting | Type | Default | Description |
+|---------|------|---------|-------------|
+| `memory_check_interval_seconds` | int | 30 | Background check interval for memory monitoring |
+| `enable_memory_alerts` | bool | True | Enable/disable memory threshold alerts |
+| `gpu_memory_limit_mb` | int | 6144 | Maximum GPU memory to use (MB) |
+| `gpu_memory_reserve_mb` | int | 512 | Memory reserved for CUDA overhead |
+
+### Model Lifecycle Management
+
+| Setting | Type | Default | Description |
+|---------|------|---------|-------------|
+| `enable_model_lifecycle_management` | bool | True | Use ModelManager for model lifecycle |
+| `model_idle_timeout_seconds` | int | 300 | Unload models after idle time |
+| `pp_structure_idle_timeout_seconds` | int | 300 | Unload PP-StructureV3 after idle |
+| `structure_model_memory_mb` | int | 2000 | Estimated memory for PP-StructureV3 |
+| `ocr_model_memory_mb` | int | 500 | Estimated memory per OCR language model |
+| `enable_lazy_model_loading` | bool | True | Load models on demand |
+| `auto_unload_unused_models` | bool | True | Auto-unload unused language models |
+
+### Service Pool Configuration
+
+| Setting | Type | Default | Description |
+|---------|------|---------|-------------|
+| `enable_service_pool` | bool | True | Use OCRServicePool |
+| `max_services_per_device` | int | 1 | Max OCRService instances per GPU |
+| `max_total_services` | int | 2 | Max total OCRService instances |
+| `service_acquire_timeout_seconds` | float | 300.0 | Timeout for acquiring service from pool |
+| `max_queue_size` | int | 50 | Max pending tasks per device queue |
+
+### Concurrency Control
+
+| Setting | Type | Default | Description |
+|---------|------|---------|-------------|
+| `max_concurrent_predictions` | int | 2 | Max concurrent PP-StructureV3 predictions |
+| `max_concurrent_pages` | int | 2 | Max pages processed concurrently |
+| `inference_batch_size` | int | 1 | Batch size for inference |
+| `enable_batch_processing` | bool | True | Enable batch processing for large docs |
+
+### Recovery Settings
+
+| Setting | Type | Default | Description |
+|---------|------|---------|-------------|
+| `enable_cpu_fallback` | bool | True | Fall back to CPU when GPU memory low |
+| `enable_emergency_cleanup` | bool | True | Auto-cleanup on memory pressure |
+| `enable_worker_restart` | bool | False | Restart workers on OOM (requires supervisor) |
+
+### Feature Flags
+
+| Setting | Type | Default | Description |
+|---------|------|---------|-------------|
+| `enable_chart_recognition` | bool | True | Enable chart/diagram recognition |
+| `enable_formula_recognition` | bool | True | Enable math formula recognition |
+| `enable_table_recognition` | bool | True | Enable table structure recognition |
+| `enable_seal_recognition` | bool | True | Enable seal/stamp recognition |
+| `enable_text_recognition` | bool | True | Enable general text recognition |
+| `enable_memory_optimization` | bool | True | Enable memory optimizations |
+
+### Environment Variable Override
+
+All settings can be overridden via environment variables. The format is uppercase with underscores:
+
+```bash
+# Example .env file
+MEMORY_WARNING_THRESHOLD=0.75
+MEMORY_CRITICAL_THRESHOLD=0.90
+MAX_CONCURRENT_PREDICTIONS=1
+GPU_MEMORY_LIMIT_MB=4096
+ENABLE_CPU_FALLBACK=true
+```
+
+### Recommended Configurations
+
+#### RTX 4060 8GB (Default)
+```bash
+GPU_MEMORY_LIMIT_MB=6144
+MAX_CONCURRENT_PREDICTIONS=2
+MAX_CONCURRENT_PAGES=2
+INFERENCE_BATCH_SIZE=1
+```
+
+#### RTX 3090 24GB
+```bash
+GPU_MEMORY_LIMIT_MB=20480
+MAX_CONCURRENT_PREDICTIONS=4
+MAX_CONCURRENT_PAGES=4
+INFERENCE_BATCH_SIZE=2
+```
+
+#### CPU-Only Mode
+```bash
+FORCE_CPU_MODE=true
+MAX_CONCURRENT_PREDICTIONS=1
+ENABLE_CPU_FALLBACK=false
+```
+
+## Prometheus Metrics
+
+The system exports Prometheus-format metrics via the `PrometheusMetrics` class. Available metrics:
+
+### GPU Metrics
+- `tool_ocr_memory_gpu_total_bytes` - Total GPU memory
+- `tool_ocr_memory_gpu_used_bytes` - Used GPU memory
+- `tool_ocr_memory_gpu_free_bytes` - Free GPU memory
+- `tool_ocr_memory_gpu_utilization_ratio` - GPU utilization (0-1)
+
+### Model Metrics
+- `tool_ocr_memory_models_loaded_total` - Number of loaded models
+- `tool_ocr_memory_models_memory_bytes` - Total memory used by models
+- `tool_ocr_memory_model_ref_count{model_id}` - Reference count per model
+
+### Prediction Metrics
+- `tool_ocr_memory_predictions_active` - Currently active predictions
+- `tool_ocr_memory_predictions_queue_depth` - Predictions waiting in queue
+- `tool_ocr_memory_predictions_total` - Total predictions processed (counter)
+- `tool_ocr_memory_predictions_timeouts_total` - Total prediction timeouts (counter)
+
+### Pool Metrics
+- `tool_ocr_memory_pool_services_total` - Total services in pool
+- `tool_ocr_memory_pool_services_available` - Available services
+- `tool_ocr_memory_pool_services_in_use` - Services in use
+- `tool_ocr_memory_pool_acquisitions_total` - Total acquisitions (counter)
+
+### Recovery Metrics
+- `tool_ocr_memory_recovery_count_total` - Total recovery attempts
+- `tool_ocr_memory_recovery_in_cooldown` - In cooldown (0/1)
+- `tool_ocr_memory_recovery_cooldown_remaining_seconds` - Remaining cooldown
+
+## Memory Dump API
+
+The `MemoryDumper` class provides debugging capabilities:
+
+```python
+from app.services.memory_manager import get_memory_dumper
+
+dumper = get_memory_dumper()
+
+# Create a memory dump
+dump = dumper.create_dump(include_python_objects=True)
+
+# Get dump as dictionary for JSON serialization
+dump_dict = dumper.to_dict(dump)
+
+# Compare two dumps to detect memory growth
+comparison = dumper.compare_dumps(dump1, dump2)
+```
+
+Memory dumps include:
+- GPU/CPU memory usage
+- Loaded models and reference counts
+- Active predictions and queue state
+- Service pool statistics
+- Recovery manager state
+- Python GC statistics
+- Large Python objects (optional)