chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-18 20:02:31 +08:00
parent 0edc56b03f
commit cd3cbea49d
64 changed files with 3573 additions and 8190 deletions

View File

@@ -0,0 +1,89 @@
# OCR Processing Specification
## ADDED Requirements
### Requirement: GPU Acceleration
The system SHALL automatically detect and utilize GPU hardware for OCR processing when available, with graceful fallback to CPU mode when GPU is unavailable or disabled.
#### Scenario: GPU available and enabled
- **WHEN** PaddleOCR service initializes on system with compatible GPU
- **THEN** the system detects GPU availability using CUDA runtime
- **AND** initializes PaddleOCR with `use_gpu=True` parameter
- **AND** sets appropriate GPU memory fraction to prevent OOM errors
- **AND** logs GPU device information (name, memory, CUDA version)
- **AND** processes OCR tasks using GPU acceleration
#### Scenario: CPU fallback when GPU unavailable
- **WHEN** PaddleOCR service initializes on system without GPU
- **THEN** the system detects absence of GPU
- **AND** initializes PaddleOCR with `use_gpu=False` parameter
- **AND** logs CPU mode status
- **AND** processes OCR tasks using CPU without errors
#### Scenario: Force CPU mode override
- **WHEN** FORCE_CPU_MODE environment variable is set to true
- **THEN** the system ignores GPU availability
- **AND** initializes PaddleOCR in CPU mode
- **AND** logs that CPU mode is forced by configuration
- **AND** processes OCR tasks using CPU
#### Scenario: GPU out-of-memory error handling
- **WHEN** GPU runs out of memory during OCR processing
- **THEN** the system catches CUDA OOM exception
- **AND** logs error with GPU memory information
- **AND** attempts to process the task using CPU mode
- **AND** continues batch processing without failure
- **AND** records GPU failure in task metadata
#### Scenario: Multiple GPU devices available
- **WHEN** system has multiple CUDA devices
- **THEN** the system detects all available GPUs
- **AND** uses primary GPU (device 0) by default
- **AND** allows GPU device selection via configuration
- **AND** logs selected GPU device information
### Requirement: GPU Performance Optimization
The system SHALL optimize GPU memory usage and batch processing for efficient OCR performance.
#### Scenario: Automatic batch size adjustment
- **WHEN** GPU mode is enabled
- **THEN** the system queries available GPU memory
- **AND** calculates optimal batch size based on memory capacity
- **AND** adjusts concurrent processing threads accordingly
- **AND** monitors memory usage during processing
- **AND** prevents memory allocation beyond safe threshold
#### Scenario: GPU memory management
- **WHEN** GPU memory fraction is configured
- **THEN** the system allocates specified fraction of total GPU memory
- **AND** reserves memory for PaddleOCR model
- **AND** prevents other processes from causing OOM
- **AND** releases memory after batch completion
### Requirement: GPU Status Reporting
The system SHALL provide GPU status information through health check API and logging.
#### Scenario: Health check with GPU available
- **WHEN** client requests `/health` endpoint on GPU-enabled system
- **THEN** the system returns health status including:
- `gpu_available`: true
- `gpu_device_name`: detected GPU name
- `cuda_version`: CUDA runtime version
- `gpu_memory_total`: total GPU memory in MB
- `gpu_memory_used`: currently used GPU memory in MB
- `gpu_utilization`: current GPU utilization percentage
#### Scenario: Health check without GPU
- **WHEN** client requests `/health` endpoint on CPU-only system
- **THEN** the system returns health status including:
- `gpu_available`: false
- `processing_mode`: "CPU"
- `reason`: explanation for CPU mode (e.g., "No GPU detected", "CPU mode forced")
#### Scenario: Startup GPU status logging
- **WHEN** OCR service starts
- **THEN** the system logs GPU detection results
- **AND** logs selected processing mode (GPU/CPU)
- **AND** logs GPU device details if available
- **AND** logs any GPU-related warnings or errors
- **AND** continues startup successfully regardless of GPU status