first
This commit is contained in:
313
openspec/changes/add-ocr-batch-processing/design.md
Normal file
313
openspec/changes/add-ocr-batch-processing/design.md
Normal file
@@ -0,0 +1,313 @@
|
||||
# Technical Design Document
|
||||
|
||||
## Context
|
||||
Tool_OCR is a web-based batch OCR processing system with frontend-backend separation architecture. The system needs to handle large file uploads, long-running OCR tasks, and multiple export formats while maintaining responsive UI and efficient resource usage.
|
||||
|
||||
**Key stakeholders:**
|
||||
- End users: Need simple, fast, reliable OCR processing
|
||||
- Developers: Need maintainable, testable code architecture
|
||||
- Operations: Need easy deployment via 1Panel, monitoring, and error tracking
|
||||
|
||||
**Constraints:**
|
||||
- Development on Windows with Conda (Python 3.10)
|
||||
- Deployment on Linux server via 1Panel (no Docker)
|
||||
- Port range: 12010-12019
|
||||
- External MySQL database (mysql.theaken.com:33306)
|
||||
- PaddleOCR models (~100-200MB per language)
|
||||
- Max file upload: 20MB per file, 100MB per batch
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
### Goals
|
||||
- Process images and PDFs with multi-language OCR (Chinese, English, Japanese, Korean)
|
||||
- Handle batch uploads with real-time progress tracking
|
||||
- Provide flexible export formats (TXT, JSON, Excel) with custom rules
|
||||
- Maintain responsive UI during long-running OCR tasks
|
||||
- Enable easy deployment and maintenance via 1Panel
|
||||
|
||||
### Non-Goals
|
||||
- Real-time OCR streaming (batch processing only)
|
||||
- Cloud-based OCR services (local processing only)
|
||||
- Mobile app support (web UI only, desktop/tablet optimized)
|
||||
- Advanced image editing or annotation features
|
||||
- Multi-tenant SaaS architecture (single deployment per organization)
|
||||
|
||||
## Decisions
|
||||
|
||||
### Decision 1: FastAPI for Backend Framework
|
||||
**Choice:** Use FastAPI instead of Flask or Django
|
||||
|
||||
**Rationale:**
|
||||
- Native async/await support for I/O-bound operations (file upload, database queries)
|
||||
- Automatic OpenAPI documentation (Swagger UI)
|
||||
- Built-in Pydantic validation for type safety
|
||||
- Better performance for concurrent requests
|
||||
- Modern Python 3.10+ features (type hints, async)
|
||||
|
||||
**Alternatives considered:**
|
||||
- Flask: Simpler but lacks native async, requires extensions
|
||||
- Django: Too heavyweight for API-only backend, includes unnecessary ORM features
|
||||
|
||||
### Decision 2: PaddleOCR as OCR Engine
|
||||
**Choice:** Use PaddleOCR instead of Tesseract or cloud APIs
|
||||
|
||||
**Rationale:**
|
||||
- Excellent Chinese/multilingual support (key requirement)
|
||||
- Higher accuracy with deep learning models
|
||||
- Offline operation (no API costs or internet dependency)
|
||||
- Active development and good documentation
|
||||
- GPU acceleration support (optional)
|
||||
|
||||
**Alternatives considered:**
|
||||
- Tesseract: Lower accuracy for Chinese, older technology
|
||||
- Google Cloud Vision / AWS Textract: Requires internet, ongoing costs, data privacy concerns
|
||||
|
||||
### Decision 3: React Query for API State Management
|
||||
**Choice:** Use React Query (TanStack Query) instead of Redux
|
||||
|
||||
**Rationale:**
|
||||
- Designed specifically for server state (API calls, caching, refetching)
|
||||
- Built-in loading/error states
|
||||
- Automatic background refetching and cache invalidation
|
||||
- Reduces boilerplate compared to Redux
|
||||
- Better for our API-heavy use case
|
||||
|
||||
**Alternatives considered:**
|
||||
- Redux: Overkill for server state, more boilerplate
|
||||
- Plain Axios: Requires manual loading/error state management
|
||||
|
||||
### Decision 4: Zustand for Client State
|
||||
**Choice:** Use Zustand for global UI state (separate from React Query)
|
||||
|
||||
**Rationale:**
|
||||
- Lightweight (1KB) and simple API
|
||||
- No providers or context required
|
||||
- TypeScript-friendly
|
||||
- Works well alongside React Query
|
||||
- Only for UI state (selected files, filters, etc.)
|
||||
|
||||
### Decision 5: Background Task Processing
|
||||
**Choice:** FastAPI BackgroundTasks for OCR processing (no external queue initially)
|
||||
|
||||
**Rationale:**
|
||||
- Built-in FastAPI feature, no additional dependencies
|
||||
- Sufficient for single-server deployment
|
||||
- Simpler deployment and maintenance
|
||||
- Can migrate to Redis/Celery later if needed
|
||||
|
||||
**Migration path:** If scale requires, add Redis + Celery for distributed task queue
|
||||
|
||||
**Alternatives considered:**
|
||||
- Celery + Redis: More complex, overkill for initial deployment
|
||||
- Threading: FastAPI BackgroundTasks already uses thread pool
|
||||
|
||||
### Decision 6: File Storage Strategy
|
||||
**Choice:** Local filesystem with automatic cleanup (24-hour retention)
|
||||
|
||||
**Rationale:**
|
||||
- Simple implementation, no S3/cloud storage costs
|
||||
- OCR results stored in database (permanent)
|
||||
- Original files temporary, only needed during processing
|
||||
- Automatic cleanup prevents disk space issues
|
||||
|
||||
**Storage structure:**
|
||||
```
|
||||
uploads/
|
||||
{batch_id}/
|
||||
{file_id}_original.png
|
||||
{file_id}_preprocessed.png (if preprocessing enabled)
|
||||
```
|
||||
|
||||
**Cleanup:** Daily cron job or background task deletes files older than 24 hours
|
||||
|
||||
### Decision 7: Real-time Progress Updates
|
||||
**Choice:** HTTP polling instead of WebSocket
|
||||
|
||||
**Rationale:**
|
||||
- Simpler implementation and deployment
|
||||
- Works better with Nginx reverse proxy and 1Panel
|
||||
- Sufficient UX for batch processing (poll every 2 seconds)
|
||||
- No need for persistent connections
|
||||
|
||||
**API:** `GET /api/v1/batch/{batch_id}/status` returns progress percentage
|
||||
|
||||
**Alternatives considered:**
|
||||
- WebSocket: More complex, requires special Nginx config, overkill for this use case
|
||||
|
||||
### Decision 8: Database Schema Design
|
||||
**Choice:** Separate tables for tasks, files, and results (normalized)
|
||||
|
||||
**Schema:**
|
||||
```sql
|
||||
users (id, username, password_hash, created_at)
|
||||
ocr_batches (id, user_id, status, created_at, completed_at)
|
||||
ocr_files (id, batch_id, filename, file_path, file_size, status)
|
||||
ocr_results (id, file_id, text, bbox_json, confidence, language)
|
||||
export_rules (id, user_id, rule_name, config_json)
|
||||
```
|
||||
|
||||
**Rationale:**
|
||||
- Normalized for data integrity
|
||||
- Supports batch tracking and partial failures
|
||||
- Easy to query individual file results or batch statistics
|
||||
- Export rules reusable across users
|
||||
|
||||
### Decision 9: Export Rule Configuration Format
|
||||
**Choice:** JSON-based rule configuration stored in database
|
||||
|
||||
**Example rule:**
|
||||
```json
|
||||
{
|
||||
"filters": {
|
||||
"min_confidence": 0.8,
|
||||
"filename_pattern": "^invoice_.*"
|
||||
},
|
||||
"formatting": {
|
||||
"add_line_numbers": true,
|
||||
"sort_by_position": true,
|
||||
"group_by_page": true
|
||||
},
|
||||
"output": {
|
||||
"format": "txt",
|
||||
"encoding": "utf-8",
|
||||
"line_separator": "\n"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Rationale:**
|
||||
- Flexible and extensible
|
||||
- Easy to validate with JSON schema
|
||||
- Can be edited via UI or API
|
||||
- Supports complex rules without database schema changes
|
||||
|
||||
### Decision 10: Deployment Architecture (1Panel)
|
||||
**Choice:** Nginx (static files + reverse proxy) + Supervisor (backend process manager)
|
||||
|
||||
**Architecture:**
|
||||
```
|
||||
[Client Browser]
|
||||
↓
|
||||
[Nginx :80/443] (managed by 1Panel)
|
||||
↓
|
||||
├─ / → Frontend static files (React build)
|
||||
├─ /assets → Static assets
|
||||
└─ /api → Reverse proxy to backend :12010
|
||||
↓
|
||||
[FastAPI Backend :12010] (managed by Supervisor)
|
||||
↓
|
||||
[MySQL :33306] (external)
|
||||
```
|
||||
|
||||
**Rationale:**
|
||||
- 1Panel provides GUI for Nginx management
|
||||
- Supervisor ensures backend auto-restart on failure
|
||||
- No Docker simplifies deployment on existing infrastructure
|
||||
- Standard Nginx config works without special 1Panel requirements
|
||||
|
||||
**Supervisor config:**
|
||||
```ini
|
||||
[program:tool_ocr_backend]
|
||||
command=/home/user/.conda/envs/tool_ocr/bin/uvicorn app.main:app --host 127.0.0.1 --port 12010
|
||||
directory=/path/to/Tool_OCR/backend
|
||||
user=www-data
|
||||
autostart=true
|
||||
autorestart=true
|
||||
```
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
### Risk 1: OCR Processing Time for Large Batches
|
||||
**Risk:** Processing 50+ images may take 5-10 minutes, potential timeout
|
||||
|
||||
**Mitigation:**
|
||||
- Use FastAPI BackgroundTasks to avoid HTTP timeout
|
||||
- Return batch_id immediately, client polls for status
|
||||
- Display progress bar with estimated time remaining
|
||||
- Limit max batch size to 50 files (configurable)
|
||||
- Add worker concurrency limit to prevent resource exhaustion
|
||||
|
||||
### Risk 2: PaddleOCR Model Download on First Run
|
||||
**Risk:** Models are 100-200MB, first-time download may fail or be slow
|
||||
|
||||
**Mitigation:**
|
||||
- Pre-download models during deployment setup
|
||||
- Provide manual download script for offline installation
|
||||
- Cache models in shared directory for all users
|
||||
- Include model version in deployment docs
|
||||
|
||||
### Risk 3: File Upload Size Limits
|
||||
**Risk:** Users may try to upload very large PDFs (>20MB)
|
||||
|
||||
**Mitigation:**
|
||||
- Enforce 20MB per file, 100MB per batch limits in frontend and backend
|
||||
- Display clear error messages with limit information
|
||||
- Provide guidance on compressing PDFs or splitting large files
|
||||
- Consider adding image downsampling for huge images
|
||||
|
||||
### Risk 4: Concurrent User Scaling
|
||||
**Risk:** Multiple users uploading simultaneously may overwhelm CPU/memory
|
||||
|
||||
**Mitigation:**
|
||||
- Limit concurrent OCR workers (e.g., 4 workers max)
|
||||
- Implement task queue with FastAPI BackgroundTasks
|
||||
- Monitor resource usage and add throttling if needed
|
||||
- Document recommended server specs (8GB RAM, 4 CPU cores)
|
||||
|
||||
### Risk 5: Database Connection Pool Exhaustion
|
||||
**Risk:** External MySQL may have connection limits
|
||||
|
||||
**Mitigation:**
|
||||
- Configure SQLAlchemy connection pool (max 20 connections)
|
||||
- Use connection pooling with proper timeout settings
|
||||
- Close connections properly in all API endpoints
|
||||
- Add health check endpoint to monitor database connectivity
|
||||
|
||||
## Migration Plan
|
||||
|
||||
### Phase 1: Initial Deployment
|
||||
1. Setup Conda environment on production server
|
||||
2. Install Python dependencies and download OCR models
|
||||
3. Configure MySQL database and create tables
|
||||
4. Build frontend static files (`npm run build`)
|
||||
5. Configure Nginx via 1Panel (upload nginx.conf)
|
||||
6. Setup Supervisor for backend process
|
||||
7. Test with sample images
|
||||
|
||||
### Phase 2: Production Rollout
|
||||
1. Create admin user account
|
||||
2. Import sample export rules
|
||||
3. Perform smoke tests (upload, OCR, export)
|
||||
4. Monitor logs for errors
|
||||
5. Setup daily cleanup cron job for old files
|
||||
6. Enable HTTPS via 1Panel (Let's Encrypt)
|
||||
|
||||
### Phase 3: Monitoring and Optimization
|
||||
1. Add application logging (file + console)
|
||||
2. Monitor resource usage (CPU, memory, disk)
|
||||
3. Optimize slow queries if needed
|
||||
4. Tune worker concurrency based on actual load
|
||||
5. Collect user feedback and iterate
|
||||
|
||||
### Rollback Plan
|
||||
- Keep previous version in separate directory
|
||||
- Use Supervisor to stop current version and start previous
|
||||
- Database migrations should be backward compatible
|
||||
- If major issues, restore database from backup
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. **Should we add user registration, or use admin-created accounts only?**
|
||||
- Recommendation: Start with admin-created accounts for security, add registration later if needed
|
||||
|
||||
2. **Do we need audit logging for compliance?**
|
||||
- Recommendation: Add basic audit trail (who uploaded what, when) in database
|
||||
|
||||
3. **Should we support GPU acceleration for PaddleOCR?**
|
||||
- Recommendation: Optional, detect GPU on startup, fallback to CPU if unavailable
|
||||
|
||||
4. **What's the desired behavior for duplicate filenames in a batch?**
|
||||
- Recommendation: Auto-rename with suffix (e.g., `file.png`, `file_1.png`)
|
||||
|
||||
5. **Should export rules be shareable across users or private?**
|
||||
- Recommendation: Private by default, add "public templates" feature later
|
||||
Reference in New Issue
Block a user