- Removed all test files and directories - Deleted outdated documentation (will be rewritten) - Cleaned up temporary files, logs, and uploads - Archived 5 completed OpenSpec proposals - Created new dual-track-document-processing proposal with complete OpenSpec structure - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF) - UnifiedDocument model for consistent output - Support for structure-preserving translation - Updated .gitignore to prevent future test/temp files This is a major cleanup preparing for the complete refactoring of the document processing pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
11 KiB
Technical Design Document
Context
Tool_OCR is a web-based batch OCR processing system with frontend-backend separation architecture. The system needs to handle large file uploads, long-running OCR tasks, and multiple export formats while maintaining responsive UI and efficient resource usage.
Key stakeholders:
- End users: Need simple, fast, reliable OCR processing
- Developers: Need maintainable, testable code architecture
- Operations: Need easy deployment via 1Panel, monitoring, and error tracking
Constraints:
- Development on Windows with Conda (Python 3.10)
- Deployment on Linux server via 1Panel (no Docker)
- Port range: 12010-12019
- External MySQL database (mysql.theaken.com:33306)
- PaddleOCR models (~100-200MB per language)
- Max file upload: 20MB per file, 100MB per batch
Goals / Non-Goals
Goals
- Process images and PDFs with multi-language OCR (Chinese, English, Japanese, Korean)
- Handle batch uploads with real-time progress tracking
- Provide flexible export formats (TXT, JSON, Excel) with custom rules
- Maintain responsive UI during long-running OCR tasks
- Enable easy deployment and maintenance via 1Panel
Non-Goals
- Real-time OCR streaming (batch processing only)
- Cloud-based OCR services (local processing only)
- Mobile app support (web UI only, desktop/tablet optimized)
- Advanced image editing or annotation features
- Multi-tenant SaaS architecture (single deployment per organization)
Decisions
Decision 1: FastAPI for Backend Framework
Choice: Use FastAPI instead of Flask or Django
Rationale:
- Native async/await support for I/O-bound operations (file upload, database queries)
- Automatic OpenAPI documentation (Swagger UI)
- Built-in Pydantic validation for type safety
- Better performance for concurrent requests
- Modern Python 3.10+ features (type hints, async)
Alternatives considered:
- Flask: Simpler but lacks native async, requires extensions
- Django: Too heavyweight for API-only backend, includes unnecessary ORM features
Decision 2: PaddleOCR as OCR Engine
Choice: Use PaddleOCR instead of Tesseract or cloud APIs
Rationale:
- Excellent Chinese/multilingual support (key requirement)
- Higher accuracy with deep learning models
- Offline operation (no API costs or internet dependency)
- Active development and good documentation
- GPU acceleration support (optional)
Alternatives considered:
- Tesseract: Lower accuracy for Chinese, older technology
- Google Cloud Vision / AWS Textract: Requires internet, ongoing costs, data privacy concerns
Decision 3: React Query for API State Management
Choice: Use React Query (TanStack Query) instead of Redux
Rationale:
- Designed specifically for server state (API calls, caching, refetching)
- Built-in loading/error states
- Automatic background refetching and cache invalidation
- Reduces boilerplate compared to Redux
- Better for our API-heavy use case
Alternatives considered:
- Redux: Overkill for server state, more boilerplate
- Plain Axios: Requires manual loading/error state management
Decision 4: Zustand for Client State
Choice: Use Zustand for global UI state (separate from React Query)
Rationale:
- Lightweight (1KB) and simple API
- No providers or context required
- TypeScript-friendly
- Works well alongside React Query
- Only for UI state (selected files, filters, etc.)
Decision 5: Background Task Processing
Choice: FastAPI BackgroundTasks for OCR processing (no external queue initially)
Rationale:
- Built-in FastAPI feature, no additional dependencies
- Sufficient for single-server deployment
- Simpler deployment and maintenance
- Can migrate to Redis/Celery later if needed
Migration path: If scale requires, add Redis + Celery for distributed task queue
Alternatives considered:
- Celery + Redis: More complex, overkill for initial deployment
- Threading: FastAPI BackgroundTasks already uses thread pool
Decision 6: File Storage Strategy
Choice: Local filesystem with automatic cleanup (24-hour retention)
Rationale:
- Simple implementation, no S3/cloud storage costs
- OCR results stored in database (permanent)
- Original files temporary, only needed during processing
- Automatic cleanup prevents disk space issues
Storage structure:
uploads/
{batch_id}/
{file_id}_original.png
{file_id}_preprocessed.png (if preprocessing enabled)
Cleanup: Daily cron job or background task deletes files older than 24 hours
Decision 7: Real-time Progress Updates
Choice: HTTP polling instead of WebSocket
Rationale:
- Simpler implementation and deployment
- Works better with Nginx reverse proxy and 1Panel
- Sufficient UX for batch processing (poll every 2 seconds)
- No need for persistent connections
API: GET /api/v1/batch/{batch_id}/status returns progress percentage
Alternatives considered:
- WebSocket: More complex, requires special Nginx config, overkill for this use case
Decision 8: Database Schema Design
Choice: Separate tables for tasks, files, and results (normalized)
Schema:
users (id, username, password_hash, created_at)
ocr_batches (id, user_id, status, created_at, completed_at)
ocr_files (id, batch_id, filename, file_path, file_size, status)
ocr_results (id, file_id, text, bbox_json, confidence, language)
export_rules (id, user_id, rule_name, config_json)
Rationale:
- Normalized for data integrity
- Supports batch tracking and partial failures
- Easy to query individual file results or batch statistics
- Export rules reusable across users
Decision 9: Export Rule Configuration Format
Choice: JSON-based rule configuration stored in database
Example rule:
{
"filters": {
"min_confidence": 0.8,
"filename_pattern": "^invoice_.*"
},
"formatting": {
"add_line_numbers": true,
"sort_by_position": true,
"group_by_page": true
},
"output": {
"format": "txt",
"encoding": "utf-8",
"line_separator": "\n"
}
}
Rationale:
- Flexible and extensible
- Easy to validate with JSON schema
- Can be edited via UI or API
- Supports complex rules without database schema changes
Decision 10: Deployment Architecture (1Panel)
Choice: Nginx (static files + reverse proxy) + Supervisor (backend process manager)
Architecture:
[Client Browser]
↓
[Nginx :80/443] (managed by 1Panel)
↓
├─ / → Frontend static files (React build)
├─ /assets → Static assets
└─ /api → Reverse proxy to backend :12010
↓
[FastAPI Backend :12010] (managed by Supervisor)
↓
[MySQL :33306] (external)
Rationale:
- 1Panel provides GUI for Nginx management
- Supervisor ensures backend auto-restart on failure
- No Docker simplifies deployment on existing infrastructure
- Standard Nginx config works without special 1Panel requirements
Supervisor config:
[program:tool_ocr_backend]
command=/home/user/.conda/envs/tool_ocr/bin/uvicorn app.main:app --host 127.0.0.1 --port 12010
directory=/path/to/Tool_OCR/backend
user=www-data
autostart=true
autorestart=true
Risks / Trade-offs
Risk 1: OCR Processing Time for Large Batches
Risk: Processing 50+ images may take 5-10 minutes, potential timeout
Mitigation:
- Use FastAPI BackgroundTasks to avoid HTTP timeout
- Return batch_id immediately, client polls for status
- Display progress bar with estimated time remaining
- Limit max batch size to 50 files (configurable)
- Add worker concurrency limit to prevent resource exhaustion
Risk 2: PaddleOCR Model Download on First Run
Risk: Models are 100-200MB, first-time download may fail or be slow
Mitigation:
- Pre-download models during deployment setup
- Provide manual download script for offline installation
- Cache models in shared directory for all users
- Include model version in deployment docs
Risk 3: File Upload Size Limits
Risk: Users may try to upload very large PDFs (>20MB)
Mitigation:
- Enforce 20MB per file, 100MB per batch limits in frontend and backend
- Display clear error messages with limit information
- Provide guidance on compressing PDFs or splitting large files
- Consider adding image downsampling for huge images
Risk 4: Concurrent User Scaling
Risk: Multiple users uploading simultaneously may overwhelm CPU/memory
Mitigation:
- Limit concurrent OCR workers (e.g., 4 workers max)
- Implement task queue with FastAPI BackgroundTasks
- Monitor resource usage and add throttling if needed
- Document recommended server specs (8GB RAM, 4 CPU cores)
Risk 5: Database Connection Pool Exhaustion
Risk: External MySQL may have connection limits
Mitigation:
- Configure SQLAlchemy connection pool (max 20 connections)
- Use connection pooling with proper timeout settings
- Close connections properly in all API endpoints
- Add health check endpoint to monitor database connectivity
Migration Plan
Phase 1: Initial Deployment
- Setup Conda environment on production server
- Install Python dependencies and download OCR models
- Configure MySQL database and create tables
- Build frontend static files (
npm run build) - Configure Nginx via 1Panel (upload nginx.conf)
- Setup Supervisor for backend process
- Test with sample images
Phase 2: Production Rollout
- Create admin user account
- Import sample export rules
- Perform smoke tests (upload, OCR, export)
- Monitor logs for errors
- Setup daily cleanup cron job for old files
- Enable HTTPS via 1Panel (Let's Encrypt)
Phase 3: Monitoring and Optimization
- Add application logging (file + console)
- Monitor resource usage (CPU, memory, disk)
- Optimize slow queries if needed
- Tune worker concurrency based on actual load
- Collect user feedback and iterate
Rollback Plan
- Keep previous version in separate directory
- Use Supervisor to stop current version and start previous
- Database migrations should be backward compatible
- If major issues, restore database from backup
Open Questions
-
Should we add user registration, or use admin-created accounts only?
- Recommendation: Start with admin-created accounts for security, add registration later if needed
-
Do we need audit logging for compliance?
- Recommendation: Add basic audit trail (who uploaded what, when) in database
-
Should we support GPU acceleration for PaddleOCR?
- Recommendation: Optional, detect GPU on startup, fallback to CPU if unavailable
-
What's the desired behavior for duplicate filenames in a batch?
- Recommendation: Auto-rename with suffix (e.g.,
file.png,file_1.png)
- Recommendation: Auto-rename with suffix (e.g.,
-
Should export rules be shareable across users or private?
- Recommendation: Private by default, add "public templates" feature later