314 lines
11 KiB
Markdown
314 lines
11 KiB
Markdown
# Technical Design Document
|
|
|
|
## Context
|
|
Tool_OCR is a web-based batch OCR processing system with frontend-backend separation architecture. The system needs to handle large file uploads, long-running OCR tasks, and multiple export formats while maintaining responsive UI and efficient resource usage.
|
|
|
|
**Key stakeholders:**
|
|
- End users: Need simple, fast, reliable OCR processing
|
|
- Developers: Need maintainable, testable code architecture
|
|
- Operations: Need easy deployment via 1Panel, monitoring, and error tracking
|
|
|
|
**Constraints:**
|
|
- Development on Windows with Conda (Python 3.10)
|
|
- Deployment on Linux server via 1Panel (no Docker)
|
|
- Port range: 12010-12019
|
|
- External MySQL database (mysql.theaken.com:33306)
|
|
- PaddleOCR models (~100-200MB per language)
|
|
- Max file upload: 20MB per file, 100MB per batch
|
|
|
|
## Goals / Non-Goals
|
|
|
|
### Goals
|
|
- Process images and PDFs with multi-language OCR (Chinese, English, Japanese, Korean)
|
|
- Handle batch uploads with real-time progress tracking
|
|
- Provide flexible export formats (TXT, JSON, Excel) with custom rules
|
|
- Maintain responsive UI during long-running OCR tasks
|
|
- Enable easy deployment and maintenance via 1Panel
|
|
|
|
### Non-Goals
|
|
- Real-time OCR streaming (batch processing only)
|
|
- Cloud-based OCR services (local processing only)
|
|
- Mobile app support (web UI only, desktop/tablet optimized)
|
|
- Advanced image editing or annotation features
|
|
- Multi-tenant SaaS architecture (single deployment per organization)
|
|
|
|
## Decisions
|
|
|
|
### Decision 1: FastAPI for Backend Framework
|
|
**Choice:** Use FastAPI instead of Flask or Django
|
|
|
|
**Rationale:**
|
|
- Native async/await support for I/O-bound operations (file upload, database queries)
|
|
- Automatic OpenAPI documentation (Swagger UI)
|
|
- Built-in Pydantic validation for type safety
|
|
- Better performance for concurrent requests
|
|
- Modern Python 3.10+ features (type hints, async)
|
|
|
|
**Alternatives considered:**
|
|
- Flask: Simpler but lacks native async, requires extensions
|
|
- Django: Too heavyweight for API-only backend, includes unnecessary ORM features
|
|
|
|
### Decision 2: PaddleOCR as OCR Engine
|
|
**Choice:** Use PaddleOCR instead of Tesseract or cloud APIs
|
|
|
|
**Rationale:**
|
|
- Excellent Chinese/multilingual support (key requirement)
|
|
- Higher accuracy with deep learning models
|
|
- Offline operation (no API costs or internet dependency)
|
|
- Active development and good documentation
|
|
- GPU acceleration support (optional)
|
|
|
|
**Alternatives considered:**
|
|
- Tesseract: Lower accuracy for Chinese, older technology
|
|
- Google Cloud Vision / AWS Textract: Requires internet, ongoing costs, data privacy concerns
|
|
|
|
### Decision 3: React Query for API State Management
|
|
**Choice:** Use React Query (TanStack Query) instead of Redux
|
|
|
|
**Rationale:**
|
|
- Designed specifically for server state (API calls, caching, refetching)
|
|
- Built-in loading/error states
|
|
- Automatic background refetching and cache invalidation
|
|
- Reduces boilerplate compared to Redux
|
|
- Better for our API-heavy use case
|
|
|
|
**Alternatives considered:**
|
|
- Redux: Overkill for server state, more boilerplate
|
|
- Plain Axios: Requires manual loading/error state management
|
|
|
|
### Decision 4: Zustand for Client State
|
|
**Choice:** Use Zustand for global UI state (separate from React Query)
|
|
|
|
**Rationale:**
|
|
- Lightweight (1KB) and simple API
|
|
- No providers or context required
|
|
- TypeScript-friendly
|
|
- Works well alongside React Query
|
|
- Only for UI state (selected files, filters, etc.)
|
|
|
|
### Decision 5: Background Task Processing
|
|
**Choice:** FastAPI BackgroundTasks for OCR processing (no external queue initially)
|
|
|
|
**Rationale:**
|
|
- Built-in FastAPI feature, no additional dependencies
|
|
- Sufficient for single-server deployment
|
|
- Simpler deployment and maintenance
|
|
- Can migrate to Redis/Celery later if needed
|
|
|
|
**Migration path:** If scale requires, add Redis + Celery for distributed task queue
|
|
|
|
**Alternatives considered:**
|
|
- Celery + Redis: More complex, overkill for initial deployment
|
|
- Threading: FastAPI BackgroundTasks already uses thread pool
|
|
|
|
### Decision 6: File Storage Strategy
|
|
**Choice:** Local filesystem with automatic cleanup (24-hour retention)
|
|
|
|
**Rationale:**
|
|
- Simple implementation, no S3/cloud storage costs
|
|
- OCR results stored in database (permanent)
|
|
- Original files temporary, only needed during processing
|
|
- Automatic cleanup prevents disk space issues
|
|
|
|
**Storage structure:**
|
|
```
|
|
uploads/
|
|
{batch_id}/
|
|
{file_id}_original.png
|
|
{file_id}_preprocessed.png (if preprocessing enabled)
|
|
```
|
|
|
|
**Cleanup:** Daily cron job or background task deletes files older than 24 hours
|
|
|
|
### Decision 7: Real-time Progress Updates
|
|
**Choice:** HTTP polling instead of WebSocket
|
|
|
|
**Rationale:**
|
|
- Simpler implementation and deployment
|
|
- Works better with Nginx reverse proxy and 1Panel
|
|
- Sufficient UX for batch processing (poll every 2 seconds)
|
|
- No need for persistent connections
|
|
|
|
**API:** `GET /api/v1/batch/{batch_id}/status` returns progress percentage
|
|
|
|
**Alternatives considered:**
|
|
- WebSocket: More complex, requires special Nginx config, overkill for this use case
|
|
|
|
### Decision 8: Database Schema Design
|
|
**Choice:** Separate tables for tasks, files, and results (normalized)
|
|
|
|
**Schema:**
|
|
```sql
|
|
users (id, username, password_hash, created_at)
|
|
ocr_batches (id, user_id, status, created_at, completed_at)
|
|
ocr_files (id, batch_id, filename, file_path, file_size, status)
|
|
ocr_results (id, file_id, text, bbox_json, confidence, language)
|
|
export_rules (id, user_id, rule_name, config_json)
|
|
```
|
|
|
|
**Rationale:**
|
|
- Normalized for data integrity
|
|
- Supports batch tracking and partial failures
|
|
- Easy to query individual file results or batch statistics
|
|
- Export rules reusable across users
|
|
|
|
### Decision 9: Export Rule Configuration Format
|
|
**Choice:** JSON-based rule configuration stored in database
|
|
|
|
**Example rule:**
|
|
```json
|
|
{
|
|
"filters": {
|
|
"min_confidence": 0.8,
|
|
"filename_pattern": "^invoice_.*"
|
|
},
|
|
"formatting": {
|
|
"add_line_numbers": true,
|
|
"sort_by_position": true,
|
|
"group_by_page": true
|
|
},
|
|
"output": {
|
|
"format": "txt",
|
|
"encoding": "utf-8",
|
|
"line_separator": "\n"
|
|
}
|
|
}
|
|
```
|
|
|
|
**Rationale:**
|
|
- Flexible and extensible
|
|
- Easy to validate with JSON schema
|
|
- Can be edited via UI or API
|
|
- Supports complex rules without database schema changes
|
|
|
|
### Decision 10: Deployment Architecture (1Panel)
|
|
**Choice:** Nginx (static files + reverse proxy) + Supervisor (backend process manager)
|
|
|
|
**Architecture:**
|
|
```
|
|
[Client Browser]
|
|
↓
|
|
[Nginx :80/443] (managed by 1Panel)
|
|
↓
|
|
├─ / → Frontend static files (React build)
|
|
├─ /assets → Static assets
|
|
└─ /api → Reverse proxy to backend :12010
|
|
↓
|
|
[FastAPI Backend :12010] (managed by Supervisor)
|
|
↓
|
|
[MySQL :33306] (external)
|
|
```
|
|
|
|
**Rationale:**
|
|
- 1Panel provides GUI for Nginx management
|
|
- Supervisor ensures backend auto-restart on failure
|
|
- No Docker simplifies deployment on existing infrastructure
|
|
- Standard Nginx config works without special 1Panel requirements
|
|
|
|
**Supervisor config:**
|
|
```ini
|
|
[program:tool_ocr_backend]
|
|
command=/home/user/.conda/envs/tool_ocr/bin/uvicorn app.main:app --host 127.0.0.1 --port 12010
|
|
directory=/path/to/Tool_OCR/backend
|
|
user=www-data
|
|
autostart=true
|
|
autorestart=true
|
|
```
|
|
|
|
## Risks / Trade-offs
|
|
|
|
### Risk 1: OCR Processing Time for Large Batches
|
|
**Risk:** Processing 50+ images may take 5-10 minutes, potential timeout
|
|
|
|
**Mitigation:**
|
|
- Use FastAPI BackgroundTasks to avoid HTTP timeout
|
|
- Return batch_id immediately, client polls for status
|
|
- Display progress bar with estimated time remaining
|
|
- Limit max batch size to 50 files (configurable)
|
|
- Add worker concurrency limit to prevent resource exhaustion
|
|
|
|
### Risk 2: PaddleOCR Model Download on First Run
|
|
**Risk:** Models are 100-200MB, first-time download may fail or be slow
|
|
|
|
**Mitigation:**
|
|
- Pre-download models during deployment setup
|
|
- Provide manual download script for offline installation
|
|
- Cache models in shared directory for all users
|
|
- Include model version in deployment docs
|
|
|
|
### Risk 3: File Upload Size Limits
|
|
**Risk:** Users may try to upload very large PDFs (>20MB)
|
|
|
|
**Mitigation:**
|
|
- Enforce 20MB per file, 100MB per batch limits in frontend and backend
|
|
- Display clear error messages with limit information
|
|
- Provide guidance on compressing PDFs or splitting large files
|
|
- Consider adding image downsampling for huge images
|
|
|
|
### Risk 4: Concurrent User Scaling
|
|
**Risk:** Multiple users uploading simultaneously may overwhelm CPU/memory
|
|
|
|
**Mitigation:**
|
|
- Limit concurrent OCR workers (e.g., 4 workers max)
|
|
- Implement task queue with FastAPI BackgroundTasks
|
|
- Monitor resource usage and add throttling if needed
|
|
- Document recommended server specs (8GB RAM, 4 CPU cores)
|
|
|
|
### Risk 5: Database Connection Pool Exhaustion
|
|
**Risk:** External MySQL may have connection limits
|
|
|
|
**Mitigation:**
|
|
- Configure SQLAlchemy connection pool (max 20 connections)
|
|
- Use connection pooling with proper timeout settings
|
|
- Close connections properly in all API endpoints
|
|
- Add health check endpoint to monitor database connectivity
|
|
|
|
## Migration Plan
|
|
|
|
### Phase 1: Initial Deployment
|
|
1. Setup Conda environment on production server
|
|
2. Install Python dependencies and download OCR models
|
|
3. Configure MySQL database and create tables
|
|
4. Build frontend static files (`npm run build`)
|
|
5. Configure Nginx via 1Panel (upload nginx.conf)
|
|
6. Setup Supervisor for backend process
|
|
7. Test with sample images
|
|
|
|
### Phase 2: Production Rollout
|
|
1. Create admin user account
|
|
2. Import sample export rules
|
|
3. Perform smoke tests (upload, OCR, export)
|
|
4. Monitor logs for errors
|
|
5. Setup daily cleanup cron job for old files
|
|
6. Enable HTTPS via 1Panel (Let's Encrypt)
|
|
|
|
### Phase 3: Monitoring and Optimization
|
|
1. Add application logging (file + console)
|
|
2. Monitor resource usage (CPU, memory, disk)
|
|
3. Optimize slow queries if needed
|
|
4. Tune worker concurrency based on actual load
|
|
5. Collect user feedback and iterate
|
|
|
|
### Rollback Plan
|
|
- Keep previous version in separate directory
|
|
- Use Supervisor to stop current version and start previous
|
|
- Database migrations should be backward compatible
|
|
- If major issues, restore database from backup
|
|
|
|
## Open Questions
|
|
|
|
1. **Should we add user registration, or use admin-created accounts only?**
|
|
- Recommendation: Start with admin-created accounts for security, add registration later if needed
|
|
|
|
2. **Do we need audit logging for compliance?**
|
|
- Recommendation: Add basic audit trail (who uploaded what, when) in database
|
|
|
|
3. **Should we support GPU acceleration for PaddleOCR?**
|
|
- Recommendation: Optional, detect GPU on startup, fallback to CPU if unavailable
|
|
|
|
4. **What's the desired behavior for duplicate filenames in a batch?**
|
|
- Recommendation: Auto-rename with suffix (e.g., `file.png`, `file_1.png`)
|
|
|
|
5. **Should export rules be shareable across users or private?**
|
|
- Recommendation: Private by default, add "public templates" feature later
|