first

2025-11-12 22:53:17 +08:00
commit da700721fa
130 changed files with 23393 additions and 0 deletions
--- a/SETUP.md
+++ b/SETUP.md
@@ -0,0 +1,395 @@
+# Tool_OCR Setup Guide
+
+Complete setup instructions for macOS environment.
+
+## Prerequisites Check
+
+Before starting, verify you have:
+- ✅ macOS (Apple Silicon or Intel)
+- ✅ Terminal access (zsh or bash)
+- ✅ Internet connection for downloads
+
+## Step-by-Step Setup
+
+### Step 1: Install Conda Environment
+
+Run the automated setup script:
+
+```bash
+chmod +x setup_conda.sh
+./setup_conda.sh
+```
+
+**Expected output:**
+- If Conda not installed: Downloads and installs Miniconda for Apple Silicon
+- If Conda already installed: Creates `tool_ocr` environment with Python 3.10
+
+**If Conda was just installed:**
+```bash
+# Reload your shell to activate Conda
+source ~/.zshrc       # if using zsh (default on macOS)
+source ~/.bashrc      # if using bash
+
+# Run setup script again to create environment
+./setup_conda.sh
+```
+
+### Step 2: Activate Environment
+
+```bash
+conda activate tool_ocr
+```
+
+You should see `(tool_ocr)` prefix in your terminal prompt.
+
+### Step 3: Install Python Dependencies
+
+```bash
+pip install -r requirements.txt
+```
+
+**This will install:**
+- FastAPI and Uvicorn (web framework)
+- PaddleOCR and PaddlePaddle (OCR engine)
+- Image processing libraries (Pillow, OpenCV, pdf2image)
+- PDF generation tools (WeasyPrint, Markdown)
+- Database tools (SQLAlchemy, PyMySQL, Alembic)
+- Authentication libraries (python-jose, passlib)
+- Testing tools (pytest, pytest-asyncio)
+
+**Installation time:** ~5-10 minutes depending on your internet speed
+
+### Step 4: Install System Dependencies
+
+```bash
+# Install libmagic (required for python-magic file type detection)
+brew install libmagic
+
+# Install WeasyPrint dependencies (required for PDF generation)
+brew install pango gdk-pixbuf libffi
+
+# Install Pandoc (optional - for enhanced PDF generation)
+brew install pandoc
+
+# Install Chinese fonts for PDF output (optional - macOS has built-in Chinese fonts)
+brew install --cask font-noto-sans-cjk
+# Note: If above fails, skip it - macOS built-in fonts (PingFang SC, Heiti TC) work fine
+```
+
+**If Homebrew not installed:**
+```bash
+/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
+```
+
+### Step 5: Configure Environment Variables
+
+```bash
+# Copy template
+cp .env.example .env
+
+# Edit with your preferred editor
+nano .env
+# or
+code .env
+```
+
+**Important settings to verify in `.env`:**
+
+```bash
+# Database (pre-configured, should work as-is)
+MYSQL_HOST=mysql.theaken.com
+MYSQL_PORT=33306
+MYSQL_USER=A060
+MYSQL_PASSWORD=WLeSCi0yhtc7
+MYSQL_DATABASE=db_A060
+
+# Application ports
+BACKEND_PORT=12010
+FRONTEND_PORT=12011
+
+# Security (CHANGE THIS!)
+SECRET_KEY=your-secret-key-here-please-change-this-to-random-string
+```
+
+**Generate a secure SECRET_KEY:**
+```bash
+python -c "import secrets; print(secrets.token_urlsafe(32))"
+```
+
+Copy the output and paste it as your `SECRET_KEY` value.
+
+### Step 6: Set Environment Variable for WeasyPrint
+
+Add to your shell config (`~/.zshrc` or `~/.bash_profile`):
+
+```bash
+export DYLD_LIBRARY_PATH="/opt/homebrew/lib:$DYLD_LIBRARY_PATH"
+```
+
+Then reload:
+```bash
+source ~/.zshrc  # or source ~/.bash_profile
+```
+
+### Step 7: Run Service Layer Tests
+
+Verify all services are working:
+
+```bash
+cd backend
+python test_services.py
+```
+
+Expected output:
+```
+✓ PASS   - database
+✓ PASS   - preprocessor
+✓ PASS   - pdf_generator
+✓ PASS   - file_manager
+Total: 4-5/5 tests passed
+```
+
+**Note:** OCR engine test may fail on first run as PaddleOCR downloads models (~900MB). This is normal.
+
+### Step 8: Create Directory Structure
+
+The directories should already exist, but verify:
+
+```bash
+ls -la
+```
+
+You should see:
+- `backend/` - FastAPI application
+- `frontend/` - React application (will be populated later)
+- `uploads/` - File upload storage
+- `storage/` - Processed results
+- `models/` - PaddleOCR models (empty until first run)
+- `logs/` - Application logs
+
+### Step 8: Start Backend Server
+
+```bash
+cd backend
+python -m app.main
+```
+
+**Expected output:**
+```
+INFO:     Started server process
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://0.0.0.0:12010
+```
+
+**Test the server:**
+Open browser and visit:
+- http://localhost:12010 - API root
+- http://localhost:12010/docs - Interactive API documentation
+- http://localhost:12010/health - Health check endpoint
+
+### Step 9: Download PaddleOCR Models
+
+On first OCR request, PaddleOCR will automatically download models (~900MB).
+
+**To pre-download models manually:**
+
+```bash
+python -c "
+from paddleocr import PaddleOCR
+ocr = PaddleOCR(use_angle_cls=True, lang='ch', use_gpu=False)
+print('Models downloaded successfully')
+"
+```
+
+This will download:
+- Detection model: ch_PP-OCRv4_det
+- Recognition model: ch_PP-OCRv4_rec
+- Angle classifier: ch_ppocr_mobile_v2.0_cls
+
+Models are stored in: `./models/paddleocr/`
+
+## Troubleshooting
+
+### Issue: "conda: command not found"
+
+**Solution:**
+```bash
+# Reload shell configuration
+source ~/.zshrc  # or source ~/.bashrc
+
+# If still not working, manually add Conda to PATH
+export PATH="$HOME/miniconda3/bin:$PATH"
+```
+
+### Issue: PaddlePaddle installation fails
+
+**Solution:**
+```bash
+# For Apple Silicon Macs, ensure you're using ARM version
+pip uninstall paddlepaddle
+pip install paddlepaddle --no-cache-dir
+```
+
+### Issue: WeasyPrint fails to install
+
+**Solution:**
+```bash
+# Install required system libraries
+brew install cairo pango gdk-pixbuf libffi
+pip install --upgrade weasyprint
+```
+
+### Issue: Database connection fails
+
+**Solution:**
+```bash
+# Test database connection
+python -c "
+import pymysql
+conn = pymysql.connect(
+    host='mysql.theaken.com',
+    port=33306,
+    user='A060',
+    password='WLeSCi0yhtc7',
+    database='db_A060'
+)
+print('Database connection OK')
+conn.close()
+"
+```
+
+If this fails, verify:
+- Internet connection is active
+- Firewall is not blocking port 33306
+- Database credentials in `.env` are correct
+
+### Issue: Port 12010 already in use
+
+**Solution:**
+```bash
+# Find what's using the port
+lsof -i :12010
+
+# Kill the process or change port in .env
+# Edit BACKEND_PORT=12011 (or any available port)
+```
+
+## Next Steps
+
+After successful setup:
+
+1. ✅ Environment is ready
+2. ✅ Backend server can start
+3. ✅ Database connection configured
+
+**Ready to develop:**
+- Implement database models (`backend/app/models/`)
+- Create API endpoints (`backend/app/api/v1/`)
+- Build OCR service (`backend/app/services/ocr_service.py`)
+- Develop frontend UI (`frontend/src/`)
+
+**Start with Phase 1 tasks:**
+Refer to [openspec/changes/add-ocr-batch-processing/tasks.md](openspec/changes/add-ocr-batch-processing/tasks.md) for detailed implementation tasks.
+
+## Development Workflow
+
+```bash
+# Activate environment
+conda activate tool_ocr
+
+# Start backend in development mode (auto-reload)
+cd backend
+python -m app.main
+
+bash -c "source ~/.zshrc && conda activate tool_ocr && export DYLD_LIBRARY_PATH=/opt/homebrew/lib:$DYLD_LIBRARY_PATH && python -m app.main"
+
+# In another terminal, start frontend
+cd frontend
+npm run dev
+
+# Run tests
+cd backend
+pytest tests/ -v
+
+# Check code style
+black app/
+pylint app/
+```
+
+## Background Services
+
+### Automatic Cleanup Scheduler
+
+The application automatically runs a cleanup scheduler that:
+- **Runs every**: 1 hour (configurable via `BackgroundTaskManager.cleanup_interval`)
+- **Deletes files older than**: 24 hours (configurable via `BackgroundTaskManager.file_retention_hours`)
+- **Cleans up**:
+  - Physical files and directories
+  - Database records (results, files, batches)
+  - Expired batches in COMPLETED, FAILED, or PARTIAL status
+
+The cleanup scheduler starts automatically when the backend application starts and stops gracefully on shutdown.
+
+**Monitor cleanup activity:**
+```bash
+# Watch cleanup logs in real-time
+tail -f /tmp/tool_ocr_startup.log | grep cleanup
+
+# Or check application logs
+tail -f backend/logs/app.log | grep cleanup
+```
+
+### Retry Logic
+
+OCR processing includes automatic retry logic:
+- **Maximum retries**: 3 attempts (configurable)
+- **Retry delay**: 5 seconds between attempts (configurable)
+- **Tracks**: `retry_count` field in database
+- **Error handling**: Detailed error messages with retry attempt information
+
+**Configuration** (in [backend/app/services/background_tasks.py](backend/app/services/background_tasks.py)):
+```python
+task_manager = BackgroundTaskManager(
+    max_retries=3,           # Number of retry attempts
+    retry_delay=5,            # Delay between retries (seconds)
+    cleanup_interval=3600,    # Cleanup runs every hour
+    file_retention_hours=24   # Keep files for 24 hours
+)
+```
+
+### Background Task Status
+
+Check if background services are running:
+```bash
+# Check health endpoint
+curl http://localhost:12010/health
+
+# Check application startup logs for cleanup scheduler
+grep "cleanup scheduler" /tmp/tool_ocr_startup.log
+# Expected output: "Started cleanup scheduler for expired files"
+# Expected output: "Starting cleanup scheduler (interval: 3600s, retention: 24h)"
+```
+
+## Deactivate Environment
+
+When done working:
+```bash
+conda deactivate
+```
+
+## Environment Management
+
+```bash
+# List Conda environments
+conda env list
+
+# Remove environment (if needed)
+conda env remove -n tool_ocr
+
+# Export environment
+conda env export > environment.yml
+
+# Create from exported environment
+conda env create -f environment.yml
+```