# Tool_OCR **OCR Batch Processing System with Structure Extraction** A web-based solution to extract text, images, and document structure from multiple files efficiently using PaddleOCR-VL. ## Features - 🔍 **Multi-Language OCR**: Support for 109 languages (Chinese, English, Japanese, Korean, etc.) - 📄 **Document Structure Analysis**: Intelligent layout analysis with PP-StructureV3 - 🖼️ **Image Extraction**: Preserve document images alongside text content - 📑 **Batch Processing**: Process multiple files concurrently with progress tracking - 📤 **Multiple Export Formats**: TXT, JSON, Excel, Markdown with images, searchable PDF - 📋 **Office Documents**: DOC, DOCX, PPT, PPTX support via LibreOffice conversion - 🚀 **GPU Acceleration**: Automatic CUDA GPU detection with graceful CPU fallback - 🔧 **Flexible Configuration**: Rule-based output formatting - 🌐 **Translation Ready**: Reserved architecture for future translation features ## Tech Stack ### Backend - **Framework**: FastAPI 0.115.0 - **OCR Engine**: PaddleOCR 3.0+ with PaddleOCR-VL - **Database**: MySQL via SQLAlchemy - **PDF Generation**: Pandoc + WeasyPrint - **Image Processing**: OpenCV, Pillow, pdf2image - **Office Conversion**: LibreOffice (headless mode) ### Frontend - **Framework**: React 19 with TypeScript - **Build Tool**: Vite 7 - **Styling**: Tailwind CSS v4 + shadcn/ui - **State Management**: React Query + Zustand - **HTTP Client**: Axios ## Prerequisites - **OS**: WSL2 Ubuntu 24.04 - **Python**: 3.12+ - **Node.js**: 24.x LTS - **MySQL**: External database server (provided) - **GPU** (Optional): NVIDIA GPU with CUDA 11.2+ for hardware acceleration ## Quick Start ### 1. Automated Setup (Recommended) ```bash # Run automated setup script ./setup_dev_env.sh ``` This script automatically: - Detects NVIDIA GPU and CUDA version (if available) - Installs Python development tools (pip, venv, build-essential) - Installs system dependencies (pandoc, LibreOffice, fonts, etc.) - Installs Node.js (via nvm) - Installs PaddlePaddle GPU version (if GPU detected) or CPU version - Installs other Python packages - Installs frontend dependencies - Verifies GPU functionality (if GPU detected) ### 2. Initialize Database ```bash source venv/bin/activate cd backend alembic upgrade head python create_test_user.py cd .. ``` Default test user: - Username: `admin` - Password: `admin123` ### 3. Start Development Servers **Backend (Terminal 1):** ```bash ./start_backend.sh ``` **Frontend (Terminal 2):** ```bash ./start_frontend.sh ``` ### 4. Access Application - **Frontend**: http://localhost:5173 - **API Docs**: http://localhost:8000/docs - **Health Check**: http://localhost:8000/health ## Project Structure ``` Tool_OCR/ ├── backend/ # FastAPI backend │ ├── app/ │ │ ├── api/v1/ # API endpoints │ │ ├── core/ # Configuration, database │ │ ├── models/ # Database models │ │ ├── services/ # Business logic │ │ └── main.py # Application entry point │ ├── alembic/ # Database migrations │ └── tests/ # Test suite ├── frontend/ # React frontend │ ├── src/ │ │ ├── components/ # UI components │ │ ├── pages/ # Page components │ │ ├── services/ # API services │ │ └── stores/ # State management │ └── public/ # Static assets ├── .env.local # Local development config ├── setup_dev_env.sh # Environment setup script ├── start_backend.sh # Backend startup script └── start_frontend.sh # Frontend startup script ``` ## Configuration Main config file: `.env.local` ```bash # Database MYSQL_HOST=mysql.theaken.com MYSQL_PORT=33306 # Application ports BACKEND_PORT=8000 FRONTEND_PORT=5173 # Token expiration (minutes) ACCESS_TOKEN_EXPIRE_MINUTES=1440 # 24 hours # Supported file formats ALLOWED_EXTENSIONS=png,jpg,jpeg,pdf,bmp,tiff,doc,docx,ppt,pptx # OCR settings OCR_LANGUAGES=ch,en,japan,korean MAX_OCR_WORKERS=4 # GPU acceleration (optional) FORCE_CPU_MODE=false # Set to true to disable GPU even if available GPU_MEMORY_FRACTION=0.8 # Fraction of GPU memory to use (0.0-1.0) GPU_DEVICE_ID=0 # GPU device ID to use (0 for primary GPU) ``` ### GPU Acceleration The system automatically detects and utilizes NVIDIA GPU hardware when available: - **Auto-detection**: Setup script detects GPU and installs appropriate PaddlePaddle version - **Graceful fallback**: If GPU is unavailable or fails, system automatically uses CPU mode - **Performance**: GPU acceleration provides 3-10x speedup for OCR processing - **Configuration**: Control GPU usage via `.env.local` environment variables Check GPU status at: http://localhost:8000/health ## API Endpoints ### Authentication - `POST /api/v1/auth/login` - User login ### File Management - `POST /api/v1/upload` - Upload files - `POST /api/v1/ocr/process` - Start OCR processing - `GET /api/v1/batch/{id}/status` - Get batch status ### Results & Export - `GET /api/v1/ocr/result/{id}` - Get OCR result - `GET /api/v1/export/pdf/{id}` - Export as PDF Full API documentation: http://localhost:8000/docs ## Supported File Formats - **Images**: PNG, JPG, JPEG, BMP, TIFF - **Documents**: PDF - **Office**: DOC, DOCX, PPT, PPTX Office files are automatically converted to PDF before OCR processing. ## Development ### Backend ```bash source venv/bin/activate cd backend # Run tests pytest # Database migration alembic revision --autogenerate -m "description" alembic upgrade head # Code formatting black app/ ``` ### Frontend ```bash cd frontend # Development server npm run dev # Build for production npm run build # Lint code npm run lint ``` ## OpenSpec Workflow This project follows OpenSpec for specification-driven development: ```bash # View current changes openspec list # Validate specifications openspec validate add-ocr-batch-processing # View implementation tasks cat openspec/changes/add-ocr-batch-processing/tasks.md ``` ## Roadmap - [x] **Phase 0**: Environment setup - [x] **Phase 1**: Core OCR backend (~98% complete) - [x] **Phase 2**: Frontend development (~92% complete) - [ ] **Phase 3**: Testing & optimization - [ ] **Phase 4**: Deployment automation - [ ] **Phase 5**: Translation feature (future) ## Documentation - Development specs: [openspec/project.md](openspec/project.md) - Implementation status: [openspec/changes/add-ocr-batch-processing/STATUS.md](openspec/changes/add-ocr-batch-processing/STATUS.md) - Agent instructions: [openspec/AGENTS.md](openspec/AGENTS.md) ## License Internal project use ## Notes - First OCR run will download PaddleOCR models (~900MB) - Token expiration is set to 24 hours by default - Office conversion requires LibreOffice (installed via setup script) - Development environment: WSL2 Ubuntu 24.04 with Python venv - **GPU acceleration**: Automatically detected and enabled if NVIDIA GPU with CUDA 11.2+ is available - **WSL GPU support**: Ensure NVIDIA CUDA drivers are installed in WSL for GPU acceleration - GPU status can be checked via `/health` API endpoint