first

2025-11-12 22:53:17 +08:00
commit da700721fa
130 changed files with 23393 additions and 0 deletions
--- a/openspec/changes/add-ocr-batch-processing/design.md
+++ b/openspec/changes/add-ocr-batch-processing/design.md
@@ -0,0 +1,313 @@
+# Technical Design Document
+
+## Context
+Tool_OCR is a web-based batch OCR processing system with frontend-backend separation architecture. The system needs to handle large file uploads, long-running OCR tasks, and multiple export formats while maintaining responsive UI and efficient resource usage.
+
+**Key stakeholders:**
+- End users: Need simple, fast, reliable OCR processing
+- Developers: Need maintainable, testable code architecture
+- Operations: Need easy deployment via 1Panel, monitoring, and error tracking
+
+**Constraints:**
+- Development on Windows with Conda (Python 3.10)
+- Deployment on Linux server via 1Panel (no Docker)
+- Port range: 12010-12019
+- External MySQL database (mysql.theaken.com:33306)
+- PaddleOCR models (~100-200MB per language)
+- Max file upload: 20MB per file, 100MB per batch
+
+## Goals / Non-Goals
+
+### Goals
+- Process images and PDFs with multi-language OCR (Chinese, English, Japanese, Korean)
+- Handle batch uploads with real-time progress tracking
+- Provide flexible export formats (TXT, JSON, Excel) with custom rules
+- Maintain responsive UI during long-running OCR tasks
+- Enable easy deployment and maintenance via 1Panel
+
+### Non-Goals
+- Real-time OCR streaming (batch processing only)
+- Cloud-based OCR services (local processing only)
+- Mobile app support (web UI only, desktop/tablet optimized)
+- Advanced image editing or annotation features
+- Multi-tenant SaaS architecture (single deployment per organization)
+
+## Decisions
+
+### Decision 1: FastAPI for Backend Framework
+**Choice:** Use FastAPI instead of Flask or Django
+
+**Rationale:**
+- Native async/await support for I/O-bound operations (file upload, database queries)
+- Automatic OpenAPI documentation (Swagger UI)
+- Built-in Pydantic validation for type safety
+- Better performance for concurrent requests
+- Modern Python 3.10+ features (type hints, async)
+
+**Alternatives considered:**
+- Flask: Simpler but lacks native async, requires extensions
+- Django: Too heavyweight for API-only backend, includes unnecessary ORM features
+
+### Decision 2: PaddleOCR as OCR Engine
+**Choice:** Use PaddleOCR instead of Tesseract or cloud APIs
+
+**Rationale:**
+- Excellent Chinese/multilingual support (key requirement)
+- Higher accuracy with deep learning models
+- Offline operation (no API costs or internet dependency)
+- Active development and good documentation
+- GPU acceleration support (optional)
+
+**Alternatives considered:**
+- Tesseract: Lower accuracy for Chinese, older technology
+- Google Cloud Vision / AWS Textract: Requires internet, ongoing costs, data privacy concerns
+
+### Decision 3: React Query for API State Management
+**Choice:** Use React Query (TanStack Query) instead of Redux
+
+**Rationale:**
+- Designed specifically for server state (API calls, caching, refetching)
+- Built-in loading/error states
+- Automatic background refetching and cache invalidation
+- Reduces boilerplate compared to Redux
+- Better for our API-heavy use case
+
+**Alternatives considered:**
+- Redux: Overkill for server state, more boilerplate
+- Plain Axios: Requires manual loading/error state management
+
+### Decision 4: Zustand for Client State
+**Choice:** Use Zustand for global UI state (separate from React Query)
+
+**Rationale:**
+- Lightweight (1KB) and simple API
+- No providers or context required
+- TypeScript-friendly
+- Works well alongside React Query
+- Only for UI state (selected files, filters, etc.)
+
+### Decision 5: Background Task Processing
+**Choice:** FastAPI BackgroundTasks for OCR processing (no external queue initially)
+
+**Rationale:**
+- Built-in FastAPI feature, no additional dependencies
+- Sufficient for single-server deployment
+- Simpler deployment and maintenance
+- Can migrate to Redis/Celery later if needed
+
+**Migration path:** If scale requires, add Redis + Celery for distributed task queue
+
+**Alternatives considered:**
+- Celery + Redis: More complex, overkill for initial deployment
+- Threading: FastAPI BackgroundTasks already uses thread pool
+
+### Decision 6: File Storage Strategy
+**Choice:** Local filesystem with automatic cleanup (24-hour retention)
+
+**Rationale:**
+- Simple implementation, no S3/cloud storage costs
+- OCR results stored in database (permanent)
+- Original files temporary, only needed during processing
+- Automatic cleanup prevents disk space issues
+
+**Storage structure:**
+```
+uploads/
+  {batch_id}/
+    {file_id}_original.png
+    {file_id}_preprocessed.png (if preprocessing enabled)
+```
+
+**Cleanup:** Daily cron job or background task deletes files older than 24 hours
+
+### Decision 7: Real-time Progress Updates
+**Choice:** HTTP polling instead of WebSocket
+
+**Rationale:**
+- Simpler implementation and deployment
+- Works better with Nginx reverse proxy and 1Panel
+- Sufficient UX for batch processing (poll every 2 seconds)
+- No need for persistent connections
+
+**API:** `GET /api/v1/batch/{batch_id}/status` returns progress percentage
+
+**Alternatives considered:**
+- WebSocket: More complex, requires special Nginx config, overkill for this use case
+
+### Decision 8: Database Schema Design
+**Choice:** Separate tables for tasks, files, and results (normalized)
+
+**Schema:**
+```sql
+users (id, username, password_hash, created_at)
+ocr_batches (id, user_id, status, created_at, completed_at)
+ocr_files (id, batch_id, filename, file_path, file_size, status)
+ocr_results (id, file_id, text, bbox_json, confidence, language)
+export_rules (id, user_id, rule_name, config_json)
+```
+
+**Rationale:**
+- Normalized for data integrity
+- Supports batch tracking and partial failures
+- Easy to query individual file results or batch statistics
+- Export rules reusable across users
+
+### Decision 9: Export Rule Configuration Format
+**Choice:** JSON-based rule configuration stored in database
+
+**Example rule:**
+```json
+{
+  "filters": {
+    "min_confidence": 0.8,
+    "filename_pattern": "^invoice_.*"
+  },
+  "formatting": {
+    "add_line_numbers": true,
+    "sort_by_position": true,
+    "group_by_page": true
+  },
+  "output": {
+    "format": "txt",
+    "encoding": "utf-8",
+    "line_separator": "\n"
+  }
+}
+```
+
+**Rationale:**
+- Flexible and extensible
+- Easy to validate with JSON schema
+- Can be edited via UI or API
+- Supports complex rules without database schema changes
+
+### Decision 10: Deployment Architecture (1Panel)
+**Choice:** Nginx (static files + reverse proxy) + Supervisor (backend process manager)
+
+**Architecture:**
+```
+[Client Browser]
+      ↓
+[Nginx :80/443] (managed by 1Panel)
+      ↓
+      ├─ /          → Frontend static files (React build)
+      ├─ /assets    → Static assets
+      └─ /api       → Reverse proxy to backend :12010
+            ↓
+      [FastAPI Backend :12010] (managed by Supervisor)
+            ↓
+      [MySQL :33306] (external)
+```
+
+**Rationale:**
+- 1Panel provides GUI for Nginx management
+- Supervisor ensures backend auto-restart on failure
+- No Docker simplifies deployment on existing infrastructure
+- Standard Nginx config works without special 1Panel requirements
+
+**Supervisor config:**
+```ini
+[program:tool_ocr_backend]
+command=/home/user/.conda/envs/tool_ocr/bin/uvicorn app.main:app --host 127.0.0.1 --port 12010
+directory=/path/to/Tool_OCR/backend
+user=www-data
+autostart=true
+autorestart=true
+```
+
+## Risks / Trade-offs
+
+### Risk 1: OCR Processing Time for Large Batches
+**Risk:** Processing 50+ images may take 5-10 minutes, potential timeout
+
+**Mitigation:**
+- Use FastAPI BackgroundTasks to avoid HTTP timeout
+- Return batch_id immediately, client polls for status
+- Display progress bar with estimated time remaining
+- Limit max batch size to 50 files (configurable)
+- Add worker concurrency limit to prevent resource exhaustion
+
+### Risk 2: PaddleOCR Model Download on First Run
+**Risk:** Models are 100-200MB, first-time download may fail or be slow
+
+**Mitigation:**
+- Pre-download models during deployment setup
+- Provide manual download script for offline installation
+- Cache models in shared directory for all users
+- Include model version in deployment docs
+
+### Risk 3: File Upload Size Limits
+**Risk:** Users may try to upload very large PDFs (>20MB)
+
+**Mitigation:**
+- Enforce 20MB per file, 100MB per batch limits in frontend and backend
+- Display clear error messages with limit information
+- Provide guidance on compressing PDFs or splitting large files
+- Consider adding image downsampling for huge images
+
+### Risk 4: Concurrent User Scaling
+**Risk:** Multiple users uploading simultaneously may overwhelm CPU/memory
+
+**Mitigation:**
+- Limit concurrent OCR workers (e.g., 4 workers max)
+- Implement task queue with FastAPI BackgroundTasks
+- Monitor resource usage and add throttling if needed
+- Document recommended server specs (8GB RAM, 4 CPU cores)
+
+### Risk 5: Database Connection Pool Exhaustion
+**Risk:** External MySQL may have connection limits
+
+**Mitigation:**
+- Configure SQLAlchemy connection pool (max 20 connections)
+- Use connection pooling with proper timeout settings
+- Close connections properly in all API endpoints
+- Add health check endpoint to monitor database connectivity
+
+## Migration Plan
+
+### Phase 1: Initial Deployment
+1. Setup Conda environment on production server
+2. Install Python dependencies and download OCR models
+3. Configure MySQL database and create tables
+4. Build frontend static files (`npm run build`)
+5. Configure Nginx via 1Panel (upload nginx.conf)
+6. Setup Supervisor for backend process
+7. Test with sample images
+
+### Phase 2: Production Rollout
+1. Create admin user account
+2. Import sample export rules
+3. Perform smoke tests (upload, OCR, export)
+4. Monitor logs for errors
+5. Setup daily cleanup cron job for old files
+6. Enable HTTPS via 1Panel (Let's Encrypt)
+
+### Phase 3: Monitoring and Optimization
+1. Add application logging (file + console)
+2. Monitor resource usage (CPU, memory, disk)
+3. Optimize slow queries if needed
+4. Tune worker concurrency based on actual load
+5. Collect user feedback and iterate
+
+### Rollback Plan
+- Keep previous version in separate directory
+- Use Supervisor to stop current version and start previous
+- Database migrations should be backward compatible
+- If major issues, restore database from backup
+
+## Open Questions
+
+1. **Should we add user registration, or use admin-created accounts only?**
+   - Recommendation: Start with admin-created accounts for security, add registration later if needed
+
+2. **Do we need audit logging for compliance?**
+   - Recommendation: Add basic audit trail (who uploaded what, when) in database
+
+3. **Should we support GPU acceleration for PaddleOCR?**
+   - Recommendation: Optional, detect GPU on startup, fallback to CPU if unavailable
+
+4. **What's the desired behavior for duplicate filenames in a batch?**
+   - Recommendation: Auto-rename with suffix (e.g., `file.png`, `file_1.png`)
+
+5. **Should export rules be shareable across users or private?**
+   - Recommendation: Private by default, add "public templates" feature later