OCR/openspec/changes/add-ocr-batch-processing/design.md

# Technical Design Document

## Context
Tool_OCR is a web-based batch OCR processing system with frontend-backend separation architecture. The system needs to handle large file uploads, long-running OCR tasks, and multiple export formats while maintaining responsive UI and efficient resource usage.

**Key stakeholders:**
- End users: Need simple, fast, reliable OCR processing
- Developers: Need maintainable, testable code architecture
- Operations: Need easy deployment via 1Panel, monitoring, and error tracking

**Constraints:**
- Development on Windows with Conda (Python 3.10)
- Deployment on Linux server via 1Panel (no Docker)
- Port range: 12010-12019
- External MySQL database (mysql.theaken.com:33306)
- PaddleOCR models (~100-200MB per language)
- Max file upload: 20MB per file, 100MB per batch

## Goals / Non-Goals

### Goals
- Process images and PDFs with multi-language OCR (Chinese, English, Japanese, Korean)
- Handle batch uploads with real-time progress tracking
- Provide flexible export formats (TXT, JSON, Excel) with custom rules
- Maintain responsive UI during long-running OCR tasks
- Enable easy deployment and maintenance via 1Panel

### Non-Goals
- Real-time OCR streaming (batch processing only)
- Cloud-based OCR services (local processing only)
- Mobile app support (web UI only, desktop/tablet optimized)
- Advanced image editing or annotation features
- Multi-tenant SaaS architecture (single deployment per organization)

## Decisions

### Decision 1: FastAPI for Backend Framework
**Choice:** Use FastAPI instead of Flask or Django

**Rationale:**
- Native async/await support for I/O-bound operations (file upload, database queries)
- Automatic OpenAPI documentation (Swagger UI)
- Built-in Pydantic validation for type safety
- Better performance for concurrent requests
- Modern Python 3.10+ features (type hints, async)

**Alternatives considered:**
- Flask: Simpler but lacks native async, requires extensions
- Django: Too heavyweight for API-only backend, includes unnecessary ORM features

### Decision 2: PaddleOCR as OCR Engine
**Choice:** Use PaddleOCR instead of Tesseract or cloud APIs

**Rationale:**
- Excellent Chinese/multilingual support (key requirement)
- Higher accuracy with deep learning models
- Offline operation (no API costs or internet dependency)
- Active development and good documentation
- GPU acceleration support (optional)

**Alternatives considered:**
- Tesseract: Lower accuracy for Chinese, older technology
- Google Cloud Vision / AWS Textract: Requires internet, ongoing costs, data privacy concerns

### Decision 3: React Query for API State Management
**Choice:** Use React Query (TanStack Query) instead of Redux

**Rationale:**
- Designed specifically for server state (API calls, caching, refetching)
- Built-in loading/error states
- Automatic background refetching and cache invalidation
- Reduces boilerplate compared to Redux
- Better for our API-heavy use case

**Alternatives considered:**
- Redux: Overkill for server state, more boilerplate
- Plain Axios: Requires manual loading/error state management

### Decision 4: Zustand for Client State
**Choice:** Use Zustand for global UI state (separate from React Query)

**Rationale:**
- Lightweight (1KB) and simple API
- No providers or context required
- TypeScript-friendly
- Works well alongside React Query
- Only for UI state (selected files, filters, etc.)

### Decision 5: Background Task Processing
**Choice:** FastAPI BackgroundTasks for OCR processing (no external queue initially)

**Rationale:**
- Built-in FastAPI feature, no additional dependencies
- Sufficient for single-server deployment
- Simpler deployment and maintenance
- Can migrate to Redis/Celery later if needed

**Migration path:** If scale requires, add Redis + Celery for distributed task queue

**Alternatives considered:**
- Celery + Redis: More complex, overkill for initial deployment
- Threading: FastAPI BackgroundTasks already uses thread pool

### Decision 6: File Storage Strategy
**Choice:** Local filesystem with automatic cleanup (24-hour retention)

**Rationale:**
- Simple implementation, no S3/cloud storage costs
- OCR results stored in database (permanent)
- Original files temporary, only needed during processing
- Automatic cleanup prevents disk space issues

**Storage structure:**
```
uploads/
  {batch_id}/
    {file_id}_original.png
    {file_id}_preprocessed.png (if preprocessing enabled)
```

**Cleanup:** Daily cron job or background task deletes files older than 24 hours

### Decision 7: Real-time Progress Updates
**Choice:** HTTP polling instead of WebSocket

**Rationale:**
- Simpler implementation and deployment
- Works better with Nginx reverse proxy and 1Panel
- Sufficient UX for batch processing (poll every 2 seconds)
- No need for persistent connections

**API:** `GET /api/v1/batch/{batch_id}/status` returns progress percentage

**Alternatives considered:**
- WebSocket: More complex, requires special Nginx config, overkill for this use case

### Decision 8: Database Schema Design
**Choice:** Separate tables for tasks, files, and results (normalized)

**Schema:**
```sql
users (id, username, password_hash, created_at)
ocr_batches (id, user_id, status, created_at, completed_at)
ocr_files (id, batch_id, filename, file_path, file_size, status)
ocr_results (id, file_id, text, bbox_json, confidence, language)
export_rules (id, user_id, rule_name, config_json)
```

**Rationale:**
- Normalized for data integrity
- Supports batch tracking and partial failures
- Easy to query individual file results or batch statistics
- Export rules reusable across users

### Decision 9: Export Rule Configuration Format
**Choice:** JSON-based rule configuration stored in database

**Example rule:**
```json
{
  "filters": {
    "min_confidence": 0.8,
    "filename_pattern": "^invoice_.*"
  },
  "formatting": {
    "add_line_numbers": true,
    "sort_by_position": true,
    "group_by_page": true
  },
  "output": {
    "format": "txt",
    "encoding": "utf-8",
    "line_separator": "\n"
  }
}
```

**Rationale:**
- Flexible and extensible
- Easy to validate with JSON schema
- Can be edited via UI or API
- Supports complex rules without database schema changes

### Decision 10: Deployment Architecture (1Panel)
**Choice:** Nginx (static files + reverse proxy) + Supervisor (backend process manager)

**Architecture:**
```
[Client Browser]
      ↓
[Nginx :80/443] (managed by 1Panel)
      ↓
      ├─ /          → Frontend static files (React build)
      ├─ /assets    → Static assets
      └─ /api       → Reverse proxy to backend :12010
            ↓
      [FastAPI Backend :12010] (managed by Supervisor)
            ↓
      [MySQL :33306] (external)
```

**Rationale:**
- 1Panel provides GUI for Nginx management
- Supervisor ensures backend auto-restart on failure
- No Docker simplifies deployment on existing infrastructure
- Standard Nginx config works without special 1Panel requirements

**Supervisor config:**
```ini
[program:tool_ocr_backend]
command=/home/user/.conda/envs/tool_ocr/bin/uvicorn app.main:app --host 127.0.0.1 --port 12010
directory=/path/to/Tool_OCR/backend
user=www-data
autostart=true
autorestart=true
```

## Risks / Trade-offs

### Risk 1: OCR Processing Time for Large Batches
**Risk:** Processing 50+ images may take 5-10 minutes, potential timeout

**Mitigation:**
- Use FastAPI BackgroundTasks to avoid HTTP timeout
- Return batch_id immediately, client polls for status
- Display progress bar with estimated time remaining
- Limit max batch size to 50 files (configurable)
- Add worker concurrency limit to prevent resource exhaustion

### Risk 2: PaddleOCR Model Download on First Run
**Risk:** Models are 100-200MB, first-time download may fail or be slow

**Mitigation:**
- Pre-download models during deployment setup
- Provide manual download script for offline installation
- Cache models in shared directory for all users
- Include model version in deployment docs

### Risk 3: File Upload Size Limits
**Risk:** Users may try to upload very large PDFs (>20MB)

**Mitigation:**
- Enforce 20MB per file, 100MB per batch limits in frontend and backend
- Display clear error messages with limit information
- Provide guidance on compressing PDFs or splitting large files
- Consider adding image downsampling for huge images

### Risk 4: Concurrent User Scaling
**Risk:** Multiple users uploading simultaneously may overwhelm CPU/memory

**Mitigation:**
- Limit concurrent OCR workers (e.g., 4 workers max)
- Implement task queue with FastAPI BackgroundTasks
- Monitor resource usage and add throttling if needed
- Document recommended server specs (8GB RAM, 4 CPU cores)

### Risk 5: Database Connection Pool Exhaustion
**Risk:** External MySQL may have connection limits

**Mitigation:**
- Configure SQLAlchemy connection pool (max 20 connections)
- Use connection pooling with proper timeout settings
- Close connections properly in all API endpoints
- Add health check endpoint to monitor database connectivity

## Migration Plan

### Phase 1: Initial Deployment
1. Setup Conda environment on production server
2. Install Python dependencies and download OCR models
3. Configure MySQL database and create tables
4. Build frontend static files (`npm run build`)
5. Configure Nginx via 1Panel (upload nginx.conf)
6. Setup Supervisor for backend process
7. Test with sample images

### Phase 2: Production Rollout
1. Create admin user account
2. Import sample export rules
3. Perform smoke tests (upload, OCR, export)
4. Monitor logs for errors
5. Setup daily cleanup cron job for old files
6. Enable HTTPS via 1Panel (Let's Encrypt)

### Phase 3: Monitoring and Optimization
1. Add application logging (file + console)
2. Monitor resource usage (CPU, memory, disk)
3. Optimize slow queries if needed
4. Tune worker concurrency based on actual load
5. Collect user feedback and iterate

### Rollback Plan
- Keep previous version in separate directory
- Use Supervisor to stop current version and start previous
- Database migrations should be backward compatible
- If major issues, restore database from backup

## Open Questions

1. **Should we add user registration, or use admin-created accounts only?**
   - Recommendation: Start with admin-created accounts for security, add registration later if needed

2. **Do we need audit logging for compliance?**
   - Recommendation: Add basic audit trail (who uploaded what, when) in database

3. **Should we support GPU acceleration for PaddleOCR?**
   - Recommendation: Optional, detect GPU on startup, fallback to CPU if unavailable

4. **What's the desired behavior for duplicate filenames in a batch?**
   - Recommendation: Auto-rename with suffix (e.g., `file.png`, `file_1.png`)

5. **Should export rules be shareable across users or private?**
   - Recommendation: Private by default, add "public templates" feature later