egg/OCR

Files

egg cd3cbea49d chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-18 20:02:31 +08:00

11 KiB

Raw Blame History

Technical Design Document

Context

Tool_OCR is a web-based batch OCR processing system with frontend-backend separation architecture. The system needs to handle large file uploads, long-running OCR tasks, and multiple export formats while maintaining responsive UI and efficient resource usage.

Key stakeholders:

End users: Need simple, fast, reliable OCR processing
Developers: Need maintainable, testable code architecture
Operations: Need easy deployment via 1Panel, monitoring, and error tracking

Constraints:

Development on Windows with Conda (Python 3.10)
Deployment on Linux server via 1Panel (no Docker)
Port range: 12010-12019
External MySQL database (mysql.theaken.com:33306)
PaddleOCR models (~100-200MB per language)
Max file upload: 20MB per file, 100MB per batch

Goals / Non-Goals

Goals

Process images and PDFs with multi-language OCR (Chinese, English, Japanese, Korean)
Handle batch uploads with real-time progress tracking
Provide flexible export formats (TXT, JSON, Excel) with custom rules
Maintain responsive UI during long-running OCR tasks
Enable easy deployment and maintenance via 1Panel

Non-Goals

Real-time OCR streaming (batch processing only)
Cloud-based OCR services (local processing only)
Mobile app support (web UI only, desktop/tablet optimized)
Advanced image editing or annotation features
Multi-tenant SaaS architecture (single deployment per organization)

Decisions

Decision 1: FastAPI for Backend Framework

Choice: Use FastAPI instead of Flask or Django

Rationale:

Native async/await support for I/O-bound operations (file upload, database queries)
Automatic OpenAPI documentation (Swagger UI)
Built-in Pydantic validation for type safety
Better performance for concurrent requests
Modern Python 3.10+ features (type hints, async)

Alternatives considered:

Flask: Simpler but lacks native async, requires extensions
Django: Too heavyweight for API-only backend, includes unnecessary ORM features

Decision 2: PaddleOCR as OCR Engine

Choice: Use PaddleOCR instead of Tesseract or cloud APIs

Rationale:

Excellent Chinese/multilingual support (key requirement)
Higher accuracy with deep learning models
Offline operation (no API costs or internet dependency)
Active development and good documentation
GPU acceleration support (optional)

Alternatives considered:

Tesseract: Lower accuracy for Chinese, older technology
Google Cloud Vision / AWS Textract: Requires internet, ongoing costs, data privacy concerns

Decision 3: React Query for API State Management

Choice: Use React Query (TanStack Query) instead of Redux

Rationale:

Designed specifically for server state (API calls, caching, refetching)
Built-in loading/error states
Automatic background refetching and cache invalidation
Reduces boilerplate compared to Redux
Better for our API-heavy use case

Alternatives considered:

Redux: Overkill for server state, more boilerplate
Plain Axios: Requires manual loading/error state management

Decision 4: Zustand for Client State

Choice: Use Zustand for global UI state (separate from React Query)

Rationale:

Lightweight (1KB) and simple API
No providers or context required
TypeScript-friendly
Works well alongside React Query
Only for UI state (selected files, filters, etc.)

Decision 5: Background Task Processing

Choice: FastAPI BackgroundTasks for OCR processing (no external queue initially)

Rationale:

Built-in FastAPI feature, no additional dependencies
Sufficient for single-server deployment
Simpler deployment and maintenance
Can migrate to Redis/Celery later if needed

Migration path: If scale requires, add Redis + Celery for distributed task queue

Alternatives considered:

Celery + Redis: More complex, overkill for initial deployment
Threading: FastAPI BackgroundTasks already uses thread pool

Decision 6: File Storage Strategy

Choice: Local filesystem with automatic cleanup (24-hour retention)

Rationale:

Simple implementation, no S3/cloud storage costs
OCR results stored in database (permanent)
Original files temporary, only needed during processing
Automatic cleanup prevents disk space issues

Storage structure:

uploads/
  {batch_id}/
    {file_id}_original.png
    {file_id}_preprocessed.png (if preprocessing enabled)

Cleanup: Daily cron job or background task deletes files older than 24 hours

Decision 7: Real-time Progress Updates

Choice: HTTP polling instead of WebSocket

Rationale:

Simpler implementation and deployment
Works better with Nginx reverse proxy and 1Panel
Sufficient UX for batch processing (poll every 2 seconds)
No need for persistent connections

API: GET /api/v1/batch/{batch_id}/status returns progress percentage

Alternatives considered:

WebSocket: More complex, requires special Nginx config, overkill for this use case

Decision 8: Database Schema Design

Choice: Separate tables for tasks, files, and results (normalized)

Schema:

users (id, username, password_hash, created_at)
ocr_batches (id, user_id, status, created_at, completed_at)
ocr_files (id, batch_id, filename, file_path, file_size, status)
ocr_results (id, file_id, text, bbox_json, confidence, language)
export_rules (id, user_id, rule_name, config_json)

Rationale:

Normalized for data integrity
Supports batch tracking and partial failures
Easy to query individual file results or batch statistics
Export rules reusable across users

Decision 9: Export Rule Configuration Format

Choice: JSON-based rule configuration stored in database

Example rule:

{
  "filters": {
    "min_confidence": 0.8,
    "filename_pattern": "^invoice_.*"
  },
  "formatting": {
    "add_line_numbers": true,
    "sort_by_position": true,
    "group_by_page": true
  },
  "output": {
    "format": "txt",
    "encoding": "utf-8",
    "line_separator": "\n"
  }
}

Rationale:

Flexible and extensible
Easy to validate with JSON schema
Can be edited via UI or API
Supports complex rules without database schema changes

Decision 10: Deployment Architecture (1Panel)

Choice: Nginx (static files + reverse proxy) + Supervisor (backend process manager)

Architecture:

[Client Browser]
      ↓
[Nginx :80/443] (managed by 1Panel)
      ↓
      ├─ /          → Frontend static files (React build)
      ├─ /assets    → Static assets
      └─ /api       → Reverse proxy to backend :12010
            ↓
      [FastAPI Backend :12010] (managed by Supervisor)
            ↓
      [MySQL :33306] (external)

Rationale:

1Panel provides GUI for Nginx management
Supervisor ensures backend auto-restart on failure
No Docker simplifies deployment on existing infrastructure
Standard Nginx config works without special 1Panel requirements

Supervisor config:

[program:tool_ocr_backend]
command=/home/user/.conda/envs/tool_ocr/bin/uvicorn app.main:app --host 127.0.0.1 --port 12010
directory=/path/to/Tool_OCR/backend
user=www-data
autostart=true
autorestart=true

Risks / Trade-offs

Risk 1: OCR Processing Time for Large Batches

Risk: Processing 50+ images may take 5-10 minutes, potential timeout

Mitigation:

Use FastAPI BackgroundTasks to avoid HTTP timeout
Return batch_id immediately, client polls for status
Display progress bar with estimated time remaining
Limit max batch size to 50 files (configurable)
Add worker concurrency limit to prevent resource exhaustion

Risk 2: PaddleOCR Model Download on First Run

Risk: Models are 100-200MB, first-time download may fail or be slow

Mitigation:

Pre-download models during deployment setup
Provide manual download script for offline installation
Cache models in shared directory for all users
Include model version in deployment docs

Risk 3: File Upload Size Limits

Risk: Users may try to upload very large PDFs (>20MB)

Mitigation:

Enforce 20MB per file, 100MB per batch limits in frontend and backend
Display clear error messages with limit information
Provide guidance on compressing PDFs or splitting large files
Consider adding image downsampling for huge images

Risk 4: Concurrent User Scaling

Risk: Multiple users uploading simultaneously may overwhelm CPU/memory

Mitigation:

Limit concurrent OCR workers (e.g., 4 workers max)
Implement task queue with FastAPI BackgroundTasks
Monitor resource usage and add throttling if needed
Document recommended server specs (8GB RAM, 4 CPU cores)

Risk 5: Database Connection Pool Exhaustion

Risk: External MySQL may have connection limits

Mitigation:

Configure SQLAlchemy connection pool (max 20 connections)
Use connection pooling with proper timeout settings
Close connections properly in all API endpoints
Add health check endpoint to monitor database connectivity

Migration Plan

Phase 1: Initial Deployment

Setup Conda environment on production server
Install Python dependencies and download OCR models
Configure MySQL database and create tables
Build frontend static files (npm run build)
Configure Nginx via 1Panel (upload nginx.conf)
Setup Supervisor for backend process
Test with sample images

Phase 2: Production Rollout

Create admin user account
Import sample export rules
Perform smoke tests (upload, OCR, export)
Monitor logs for errors
Setup daily cleanup cron job for old files
Enable HTTPS via 1Panel (Let's Encrypt)

Phase 3: Monitoring and Optimization

Add application logging (file + console)
Monitor resource usage (CPU, memory, disk)
Optimize slow queries if needed
Tune worker concurrency based on actual load
Collect user feedback and iterate

Rollback Plan

Keep previous version in separate directory
Use Supervisor to stop current version and start previous
Database migrations should be backward compatible
If major issues, restore database from backup

Open Questions

Should we add user registration, or use admin-created accounts only?
- Recommendation: Start with admin-created accounts for security, add registration later if needed
Do we need audit logging for compliance?
- Recommendation: Add basic audit trail (who uploaded what, when) in database
Should we support GPU acceleration for PaddleOCR?
- Recommendation: Optional, detect GPU on startup, fallback to CPU if unavailable
What's the desired behavior for duplicate filenames in a batch?
- Recommendation: Auto-rename with suffix (e.g., file.png, file_1.png)
Should export rules be shareable across users or private?
- Recommendation: Private by default, add "public templates" feature later

11 KiB Raw Blame History

Technical Design Document

Context

Goals / Non-Goals

Goals

Non-Goals

Decisions

Decision 1: FastAPI for Backend Framework

Decision 2: PaddleOCR as OCR Engine

Decision 3: React Query for API State Management

Decision 4: Zustand for Client State

Decision 5: Background Task Processing

Decision 6: File Storage Strategy

Decision 7: Real-time Progress Updates

Decision 8: Database Schema Design

Decision 9: Export Rule Configuration Format

Decision 10: Deployment Architecture (1Panel)

Risks / Trade-offs

Risk 1: OCR Processing Time for Large Batches

Risk 2: PaddleOCR Model Download on First Run

Risk 3: File Upload Size Limits

Risk 4: Concurrent User Scaling

Risk 5: Database Connection Pool Exhaustion

Migration Plan

Phase 1: Initial Deployment

Phase 2: Production Rollout

Phase 3: Monitoring and Optimization

Rollback Plan

Open Questions

11 KiB

Raw Blame History