Files
OCR/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/design.md
egg cd3cbea49d chore: project cleanup and prepare for dual-track processing refactor
- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 20:02:31 +08:00

11 KiB

Technical Design Document

Context

Tool_OCR is a web-based batch OCR processing system with frontend-backend separation architecture. The system needs to handle large file uploads, long-running OCR tasks, and multiple export formats while maintaining responsive UI and efficient resource usage.

Key stakeholders:

  • End users: Need simple, fast, reliable OCR processing
  • Developers: Need maintainable, testable code architecture
  • Operations: Need easy deployment via 1Panel, monitoring, and error tracking

Constraints:

  • Development on Windows with Conda (Python 3.10)
  • Deployment on Linux server via 1Panel (no Docker)
  • Port range: 12010-12019
  • External MySQL database (mysql.theaken.com:33306)
  • PaddleOCR models (~100-200MB per language)
  • Max file upload: 20MB per file, 100MB per batch

Goals / Non-Goals

Goals

  • Process images and PDFs with multi-language OCR (Chinese, English, Japanese, Korean)
  • Handle batch uploads with real-time progress tracking
  • Provide flexible export formats (TXT, JSON, Excel) with custom rules
  • Maintain responsive UI during long-running OCR tasks
  • Enable easy deployment and maintenance via 1Panel

Non-Goals

  • Real-time OCR streaming (batch processing only)
  • Cloud-based OCR services (local processing only)
  • Mobile app support (web UI only, desktop/tablet optimized)
  • Advanced image editing or annotation features
  • Multi-tenant SaaS architecture (single deployment per organization)

Decisions

Decision 1: FastAPI for Backend Framework

Choice: Use FastAPI instead of Flask or Django

Rationale:

  • Native async/await support for I/O-bound operations (file upload, database queries)
  • Automatic OpenAPI documentation (Swagger UI)
  • Built-in Pydantic validation for type safety
  • Better performance for concurrent requests
  • Modern Python 3.10+ features (type hints, async)

Alternatives considered:

  • Flask: Simpler but lacks native async, requires extensions
  • Django: Too heavyweight for API-only backend, includes unnecessary ORM features

Decision 2: PaddleOCR as OCR Engine

Choice: Use PaddleOCR instead of Tesseract or cloud APIs

Rationale:

  • Excellent Chinese/multilingual support (key requirement)
  • Higher accuracy with deep learning models
  • Offline operation (no API costs or internet dependency)
  • Active development and good documentation
  • GPU acceleration support (optional)

Alternatives considered:

  • Tesseract: Lower accuracy for Chinese, older technology
  • Google Cloud Vision / AWS Textract: Requires internet, ongoing costs, data privacy concerns

Decision 3: React Query for API State Management

Choice: Use React Query (TanStack Query) instead of Redux

Rationale:

  • Designed specifically for server state (API calls, caching, refetching)
  • Built-in loading/error states
  • Automatic background refetching and cache invalidation
  • Reduces boilerplate compared to Redux
  • Better for our API-heavy use case

Alternatives considered:

  • Redux: Overkill for server state, more boilerplate
  • Plain Axios: Requires manual loading/error state management

Decision 4: Zustand for Client State

Choice: Use Zustand for global UI state (separate from React Query)

Rationale:

  • Lightweight (1KB) and simple API
  • No providers or context required
  • TypeScript-friendly
  • Works well alongside React Query
  • Only for UI state (selected files, filters, etc.)

Decision 5: Background Task Processing

Choice: FastAPI BackgroundTasks for OCR processing (no external queue initially)

Rationale:

  • Built-in FastAPI feature, no additional dependencies
  • Sufficient for single-server deployment
  • Simpler deployment and maintenance
  • Can migrate to Redis/Celery later if needed

Migration path: If scale requires, add Redis + Celery for distributed task queue

Alternatives considered:

  • Celery + Redis: More complex, overkill for initial deployment
  • Threading: FastAPI BackgroundTasks already uses thread pool

Decision 6: File Storage Strategy

Choice: Local filesystem with automatic cleanup (24-hour retention)

Rationale:

  • Simple implementation, no S3/cloud storage costs
  • OCR results stored in database (permanent)
  • Original files temporary, only needed during processing
  • Automatic cleanup prevents disk space issues

Storage structure:

uploads/
  {batch_id}/
    {file_id}_original.png
    {file_id}_preprocessed.png (if preprocessing enabled)

Cleanup: Daily cron job or background task deletes files older than 24 hours

Decision 7: Real-time Progress Updates

Choice: HTTP polling instead of WebSocket

Rationale:

  • Simpler implementation and deployment
  • Works better with Nginx reverse proxy and 1Panel
  • Sufficient UX for batch processing (poll every 2 seconds)
  • No need for persistent connections

API: GET /api/v1/batch/{batch_id}/status returns progress percentage

Alternatives considered:

  • WebSocket: More complex, requires special Nginx config, overkill for this use case

Decision 8: Database Schema Design

Choice: Separate tables for tasks, files, and results (normalized)

Schema:

users (id, username, password_hash, created_at)
ocr_batches (id, user_id, status, created_at, completed_at)
ocr_files (id, batch_id, filename, file_path, file_size, status)
ocr_results (id, file_id, text, bbox_json, confidence, language)
export_rules (id, user_id, rule_name, config_json)

Rationale:

  • Normalized for data integrity
  • Supports batch tracking and partial failures
  • Easy to query individual file results or batch statistics
  • Export rules reusable across users

Decision 9: Export Rule Configuration Format

Choice: JSON-based rule configuration stored in database

Example rule:

{
  "filters": {
    "min_confidence": 0.8,
    "filename_pattern": "^invoice_.*"
  },
  "formatting": {
    "add_line_numbers": true,
    "sort_by_position": true,
    "group_by_page": true
  },
  "output": {
    "format": "txt",
    "encoding": "utf-8",
    "line_separator": "\n"
  }
}

Rationale:

  • Flexible and extensible
  • Easy to validate with JSON schema
  • Can be edited via UI or API
  • Supports complex rules without database schema changes

Decision 10: Deployment Architecture (1Panel)

Choice: Nginx (static files + reverse proxy) + Supervisor (backend process manager)

Architecture:

[Client Browser]
      ↓
[Nginx :80/443] (managed by 1Panel)
      ↓
      ├─ /          → Frontend static files (React build)
      ├─ /assets    → Static assets
      └─ /api       → Reverse proxy to backend :12010
            ↓
      [FastAPI Backend :12010] (managed by Supervisor)
            ↓
      [MySQL :33306] (external)

Rationale:

  • 1Panel provides GUI for Nginx management
  • Supervisor ensures backend auto-restart on failure
  • No Docker simplifies deployment on existing infrastructure
  • Standard Nginx config works without special 1Panel requirements

Supervisor config:

[program:tool_ocr_backend]
command=/home/user/.conda/envs/tool_ocr/bin/uvicorn app.main:app --host 127.0.0.1 --port 12010
directory=/path/to/Tool_OCR/backend
user=www-data
autostart=true
autorestart=true

Risks / Trade-offs

Risk 1: OCR Processing Time for Large Batches

Risk: Processing 50+ images may take 5-10 minutes, potential timeout

Mitigation:

  • Use FastAPI BackgroundTasks to avoid HTTP timeout
  • Return batch_id immediately, client polls for status
  • Display progress bar with estimated time remaining
  • Limit max batch size to 50 files (configurable)
  • Add worker concurrency limit to prevent resource exhaustion

Risk 2: PaddleOCR Model Download on First Run

Risk: Models are 100-200MB, first-time download may fail or be slow

Mitigation:

  • Pre-download models during deployment setup
  • Provide manual download script for offline installation
  • Cache models in shared directory for all users
  • Include model version in deployment docs

Risk 3: File Upload Size Limits

Risk: Users may try to upload very large PDFs (>20MB)

Mitigation:

  • Enforce 20MB per file, 100MB per batch limits in frontend and backend
  • Display clear error messages with limit information
  • Provide guidance on compressing PDFs or splitting large files
  • Consider adding image downsampling for huge images

Risk 4: Concurrent User Scaling

Risk: Multiple users uploading simultaneously may overwhelm CPU/memory

Mitigation:

  • Limit concurrent OCR workers (e.g., 4 workers max)
  • Implement task queue with FastAPI BackgroundTasks
  • Monitor resource usage and add throttling if needed
  • Document recommended server specs (8GB RAM, 4 CPU cores)

Risk 5: Database Connection Pool Exhaustion

Risk: External MySQL may have connection limits

Mitigation:

  • Configure SQLAlchemy connection pool (max 20 connections)
  • Use connection pooling with proper timeout settings
  • Close connections properly in all API endpoints
  • Add health check endpoint to monitor database connectivity

Migration Plan

Phase 1: Initial Deployment

  1. Setup Conda environment on production server
  2. Install Python dependencies and download OCR models
  3. Configure MySQL database and create tables
  4. Build frontend static files (npm run build)
  5. Configure Nginx via 1Panel (upload nginx.conf)
  6. Setup Supervisor for backend process
  7. Test with sample images

Phase 2: Production Rollout

  1. Create admin user account
  2. Import sample export rules
  3. Perform smoke tests (upload, OCR, export)
  4. Monitor logs for errors
  5. Setup daily cleanup cron job for old files
  6. Enable HTTPS via 1Panel (Let's Encrypt)

Phase 3: Monitoring and Optimization

  1. Add application logging (file + console)
  2. Monitor resource usage (CPU, memory, disk)
  3. Optimize slow queries if needed
  4. Tune worker concurrency based on actual load
  5. Collect user feedback and iterate

Rollback Plan

  • Keep previous version in separate directory
  • Use Supervisor to stop current version and start previous
  • Database migrations should be backward compatible
  • If major issues, restore database from backup

Open Questions

  1. Should we add user registration, or use admin-created accounts only?

    • Recommendation: Start with admin-created accounts for security, add registration later if needed
  2. Do we need audit logging for compliance?

    • Recommendation: Add basic audit trail (who uploaded what, when) in database
  3. Should we support GPU acceleration for PaddleOCR?

    • Recommendation: Optional, detect GPU on startup, fallback to CPU if unavailable
  4. What's the desired behavior for duplicate filenames in a batch?

    • Recommendation: Auto-rename with suffix (e.g., file.png, file_1.png)
  5. Should export rules be shareable across users or private?

    • Recommendation: Private by default, add "public templates" feature later