egg/OCR

Fork 0

Files

beabigegg da700721fa first

2025-11-12 22:53:17 +08:00

9.1 KiB

Raw Blame History

Tool_OCR Setup Guide

Complete setup instructions for macOS environment.

Prerequisites Check

Before starting, verify you have:

✅ macOS (Apple Silicon or Intel)
✅ Terminal access (zsh or bash)
✅ Internet connection for downloads

Step-by-Step Setup

Step 1: Install Conda Environment

Run the automated setup script:

chmod +x setup_conda.sh
./setup_conda.sh

Expected output:

If Conda not installed: Downloads and installs Miniconda for Apple Silicon
If Conda already installed: Creates tool_ocr environment with Python 3.10

If Conda was just installed:

# Reload your shell to activate Conda
source ~/.zshrc       # if using zsh (default on macOS)
source ~/.bashrc      # if using bash

# Run setup script again to create environment
./setup_conda.sh

Step 2: Activate Environment

conda activate tool_ocr

You should see (tool_ocr) prefix in your terminal prompt.

Step 3: Install Python Dependencies

pip install -r requirements.txt

This will install:

FastAPI and Uvicorn (web framework)
PaddleOCR and PaddlePaddle (OCR engine)
Image processing libraries (Pillow, OpenCV, pdf2image)
PDF generation tools (WeasyPrint, Markdown)
Database tools (SQLAlchemy, PyMySQL, Alembic)
Authentication libraries (python-jose, passlib)
Testing tools (pytest, pytest-asyncio)

Installation time: ~5-10 minutes depending on your internet speed

Step 4: Install System Dependencies

# Install libmagic (required for python-magic file type detection)
brew install libmagic

# Install WeasyPrint dependencies (required for PDF generation)
brew install pango gdk-pixbuf libffi

# Install Pandoc (optional - for enhanced PDF generation)
brew install pandoc

# Install Chinese fonts for PDF output (optional - macOS has built-in Chinese fonts)
brew install --cask font-noto-sans-cjk
# Note: If above fails, skip it - macOS built-in fonts (PingFang SC, Heiti TC) work fine

If Homebrew not installed:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Step 5: Configure Environment Variables

# Copy template
cp .env.example .env

# Edit with your preferred editor
nano .env
# or
code .env

Important settings to verify in .env:

# Database (pre-configured, should work as-is)
MYSQL_HOST=mysql.theaken.com
MYSQL_PORT=33306
MYSQL_USER=A060
MYSQL_PASSWORD=WLeSCi0yhtc7
MYSQL_DATABASE=db_A060

# Application ports
BACKEND_PORT=12010
FRONTEND_PORT=12011

# Security (CHANGE THIS!)
SECRET_KEY=your-secret-key-here-please-change-this-to-random-string

Generate a secure SECRET_KEY:

python -c "import secrets; print(secrets.token_urlsafe(32))"

Copy the output and paste it as your SECRET_KEY value.

Step 6: Set Environment Variable for WeasyPrint

Add to your shell config (~/.zshrc or ~/.bash_profile):

export DYLD_LIBRARY_PATH="/opt/homebrew/lib:$DYLD_LIBRARY_PATH"

Then reload:

source ~/.zshrc  # or source ~/.bash_profile

Step 7: Run Service Layer Tests

Verify all services are working:

cd backend
python test_services.py

Expected output:

✓ PASS   - database
✓ PASS   - preprocessor
✓ PASS   - pdf_generator
✓ PASS   - file_manager
Total: 4-5/5 tests passed

Note: OCR engine test may fail on first run as PaddleOCR downloads models (~900MB). This is normal.

Step 8: Create Directory Structure

The directories should already exist, but verify:

ls -la

You should see:

backend/ - FastAPI application
frontend/ - React application (will be populated later)
uploads/ - File upload storage
storage/ - Processed results
models/ - PaddleOCR models (empty until first run)
logs/ - Application logs

Step 8: Start Backend Server

cd backend
python -m app.main

Expected output:

INFO:     Started server process
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:12010

Test the server: Open browser and visit:

http://localhost:12010 - API root
http://localhost:12010/docs - Interactive API documentation
http://localhost:12010/health - Health check endpoint

Step 9: Download PaddleOCR Models

On first OCR request, PaddleOCR will automatically download models (~900MB).

To pre-download models manually:

python -c "
from paddleocr import PaddleOCR
ocr = PaddleOCR(use_angle_cls=True, lang='ch', use_gpu=False)
print('Models downloaded successfully')
"

This will download:

Detection model: ch_PP-OCRv4_det
Recognition model: ch_PP-OCRv4_rec
Angle classifier: ch_ppocr_mobile_v2.0_cls

Models are stored in: ./models/paddleocr/

Troubleshooting

Issue: "conda: command not found"

Solution:

# Reload shell configuration
source ~/.zshrc  # or source ~/.bashrc

# If still not working, manually add Conda to PATH
export PATH="$HOME/miniconda3/bin:$PATH"

Issue: PaddlePaddle installation fails

Solution:

# For Apple Silicon Macs, ensure you're using ARM version
pip uninstall paddlepaddle
pip install paddlepaddle --no-cache-dir

Issue: WeasyPrint fails to install

Solution:

# Install required system libraries
brew install cairo pango gdk-pixbuf libffi
pip install --upgrade weasyprint

Issue: Database connection fails

Solution:

# Test database connection
python -c "
import pymysql
conn = pymysql.connect(
    host='mysql.theaken.com',
    port=33306,
    user='A060',
    password='WLeSCi0yhtc7',
    database='db_A060'
)
print('Database connection OK')
conn.close()
"

If this fails, verify:

Internet connection is active
Firewall is not blocking port 33306
Database credentials in .env are correct

Issue: Port 12010 already in use

Solution:

# Find what's using the port
lsof -i :12010

# Kill the process or change port in .env
# Edit BACKEND_PORT=12011 (or any available port)

Next Steps

After successful setup:

✅ Environment is ready
✅ Backend server can start
✅ Database connection configured

Ready to develop:

Implement database models (backend/app/models/)
Create API endpoints (backend/app/api/v1/)
Build OCR service (backend/app/services/ocr_service.py)
Develop frontend UI (frontend/src/)

Start with Phase 1 tasks: Refer to openspec/changes/add-ocr-batch-processing/tasks.md for detailed implementation tasks.

Development Workflow

# Activate environment
conda activate tool_ocr

# Start backend in development mode (auto-reload)
cd backend
python -m app.main

bash -c "source ~/.zshrc && conda activate tool_ocr && export DYLD_LIBRARY_PATH=/opt/homebrew/lib:$DYLD_LIBRARY_PATH && python -m app.main"

# In another terminal, start frontend
cd frontend
npm run dev

# Run tests
cd backend
pytest tests/ -v

# Check code style
black app/
pylint app/

Background Services

Automatic Cleanup Scheduler

The application automatically runs a cleanup scheduler that:

Runs every: 1 hour (configurable via BackgroundTaskManager.cleanup_interval)
Deletes files older than: 24 hours (configurable via BackgroundTaskManager.file_retention_hours)
Cleans up:
- Physical files and directories
- Database records (results, files, batches)
- Expired batches in COMPLETED, FAILED, or PARTIAL status

The cleanup scheduler starts automatically when the backend application starts and stops gracefully on shutdown.

Monitor cleanup activity:

# Watch cleanup logs in real-time
tail -f /tmp/tool_ocr_startup.log | grep cleanup

# Or check application logs
tail -f backend/logs/app.log | grep cleanup

Retry Logic

OCR processing includes automatic retry logic:

Maximum retries: 3 attempts (configurable)
Retry delay: 5 seconds between attempts (configurable)
Tracks: retry_count field in database
Error handling: Detailed error messages with retry attempt information

Configuration (in backend/app/services/background_tasks.py):

task_manager = BackgroundTaskManager(
    max_retries=3,           # Number of retry attempts
    retry_delay=5,            # Delay between retries (seconds)
    cleanup_interval=3600,    # Cleanup runs every hour
    file_retention_hours=24   # Keep files for 24 hours
)

Background Task Status

Check if background services are running:

# Check health endpoint
curl http://localhost:12010/health

# Check application startup logs for cleanup scheduler
grep "cleanup scheduler" /tmp/tool_ocr_startup.log
# Expected output: "Started cleanup scheduler for expired files"
# Expected output: "Starting cleanup scheduler (interval: 3600s, retention: 24h)"

Deactivate Environment

When done working:

conda deactivate

Environment Management

# List Conda environments
conda env list

# Remove environment (if needed)
conda env remove -n tool_ocr

# Export environment
conda env export > environment.yml

# Create from exported environment
conda env create -f environment.yml

9.1 KiB Raw Blame History

Tool_OCR Setup Guide

Prerequisites Check

Step-by-Step Setup

Step 1: Install Conda Environment

Step 2: Activate Environment

Step 3: Install Python Dependencies

Step 4: Install System Dependencies

Step 5: Configure Environment Variables

Step 6: Set Environment Variable for WeasyPrint

Step 7: Run Service Layer Tests

Step 8: Create Directory Structure

Step 8: Start Backend Server

Step 9: Download PaddleOCR Models

Troubleshooting

Issue: "conda: command not found"

Issue: PaddlePaddle installation fails

Issue: WeasyPrint fails to install

Issue: Database connection fails

Issue: Port 12010 already in use

Next Steps

Development Workflow

Background Services

Automatic Cleanup Scheduler

Retry Logic

Background Task Status

Deactivate Environment

Environment Management

9.1 KiB

Raw Blame History