Add orientation detection to handle cases where scanned documents have content in a different orientation than the image dimensions suggest. When PP-StructureV3 processes rotated documents, it may return bounding boxes in the "corrected" orientation while the image remains in its scanned orientation. This causes content to extend beyond page boundaries. The fix: - Add _detect_content_orientation() method to detect when content bbox exceeds page dimensions significantly - Automatically swap page dimensions when landscape content is detected in portrait-oriented images (and vice versa) - Apply orientation detection for both single-page and multi-page documents Fixes issue where horizontal delivery slips scanned vertically were generating PDFs with content cut off or incorrectly positioned. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Tool_OCR
OCR Batch Processing System with Structure Extraction
A web-based solution to extract text, images, and document structure from multiple files efficiently using PaddleOCR-VL.
Features
- 🔍 Multi-Language OCR: Support for 109 languages (Chinese, English, Japanese, Korean, etc.)
- 📄 Document Structure Analysis: Intelligent layout analysis with PP-StructureV3
- 🖼️ Image Extraction: Preserve document images alongside text content
- 📑 Batch Processing: Process multiple files concurrently with progress tracking
- 📤 Multiple Export Formats: TXT, JSON, Excel, Markdown with images, searchable PDF
- 📋 Office Documents: DOC, DOCX, PPT, PPTX support via LibreOffice conversion
- 🚀 GPU Acceleration: Automatic CUDA GPU detection with graceful CPU fallback
- 🔧 Flexible Configuration: Rule-based output formatting
- 🌐 Translation Ready: Reserved architecture for future translation features
Tech Stack
Backend
- Framework: FastAPI 0.115.0
- OCR Engine: PaddleOCR 3.0+ with PaddleOCR-VL and PP-StructureV3
- Deep Learning: PaddlePaddle 3.2.1+ (GPU/CPU support)
- Database: MySQL via SQLAlchemy
- PDF Generation: Pandoc + WeasyPrint
- Image Processing: OpenCV, Pillow, pdf2image
- Office Conversion: LibreOffice (headless mode)
Frontend
- Framework: React 19 with TypeScript
- Build Tool: Vite 7
- Styling: Tailwind CSS v4 + shadcn/ui
- State Management: React Query + Zustand
- HTTP Client: Axios
Prerequisites
- OS: WSL2 Ubuntu 24.04
- Python: 3.12+
- Node.js: 24.x LTS
- MySQL: External database server (provided)
- GPU (Optional): NVIDIA GPU with CUDA 11.8+ for hardware acceleration
- PaddlePaddle 3.2.1+ requires CUDA 11.8, 12.3, or 12.6+
- WSL2 users: Ensure NVIDIA CUDA drivers are installed
Quick Start
1. Automated Setup (Recommended)
# Run automated setup script
./setup_dev_env.sh
This script automatically:
- Detects NVIDIA GPU and CUDA version (if available)
- Installs Python development tools (pip, venv, build-essential)
- Installs system dependencies (pandoc, LibreOffice, fonts, etc.)
- Installs Node.js (via nvm)
- Installs PaddlePaddle 3.2.1+ GPU version (if GPU detected) or CPU version
- Configures WSL CUDA library paths (for WSL2 GPU users)
- Installs other Python packages (PaddleOCR, PaddleX, etc.)
- Installs frontend dependencies
- Verifies GPU functionality and chart recognition API availability
2. Initialize Database
source venv/bin/activate
cd backend
alembic upgrade head
python create_test_user.py
cd ..
Default test user:
- Username:
admin - Password:
admin123
3. Start Development Servers
Backend (Terminal 1):
./start_backend.sh
Frontend (Terminal 2):
./start_frontend.sh
4. Access Application
- Frontend: http://localhost:5173
- API Docs: http://localhost:8000/docs
- Health Check: http://localhost:8000/health
Project Structure
Tool_OCR/
├── backend/ # FastAPI backend
│ ├── app/
│ │ ├── api/v1/ # API endpoints
│ │ ├── core/ # Configuration, database
│ │ ├── models/ # Database models
│ │ ├── services/ # Business logic
│ │ └── main.py # Application entry point
│ ├── alembic/ # Database migrations
│ └── tests/ # Test suite
├── frontend/ # React frontend
│ ├── src/
│ │ ├── components/ # UI components
│ │ ├── pages/ # Page components
│ │ ├── services/ # API services
│ │ └── stores/ # State management
│ └── public/ # Static assets
├── .env.local # Local development config
├── setup_dev_env.sh # Environment setup script
├── start_backend.sh # Backend startup script
└── start_frontend.sh # Frontend startup script
Configuration
Main config file: .env.local
# Database
MYSQL_HOST=mysql.theaken.com
MYSQL_PORT=33306
# Application ports
BACKEND_PORT=8000
FRONTEND_PORT=5173
# Token expiration (minutes)
ACCESS_TOKEN_EXPIRE_MINUTES=1440 # 24 hours
# Supported file formats
ALLOWED_EXTENSIONS=png,jpg,jpeg,pdf,bmp,tiff,doc,docx,ppt,pptx
# OCR settings
OCR_LANGUAGES=ch,en,japan,korean
MAX_OCR_WORKERS=4
# GPU acceleration (optional)
FORCE_CPU_MODE=false # Set to true to disable GPU even if available
GPU_MEMORY_FRACTION=0.8 # Fraction of GPU memory to use (0.0-1.0)
GPU_DEVICE_ID=0 # GPU device ID to use (0 for primary GPU)
GPU Acceleration
The system automatically detects and utilizes NVIDIA GPU hardware when available:
- Auto-detection: Setup script detects GPU and installs appropriate PaddlePaddle version
- Graceful fallback: If GPU is unavailable or fails, system automatically uses CPU mode
- Performance: GPU acceleration provides 3-10x speedup for OCR processing
- Configuration: Control GPU usage via
.env.localenvironment variables - WSL2 CUDA Setup: For WSL2 users, CUDA library paths are automatically configured in
~/.bashrc
Chart Recognition: Requires PaddlePaddle 3.2.0+ for full PP-StructureV3 chart recognition capabilities (chart type detection, data extraction, axis/legend parsing). The setup script installs PaddlePaddle 3.2.1+ which includes all required APIs.
Check GPU status and chart recognition availability at: http://localhost:8000/health
API Endpoints
Authentication
POST /api/v1/auth/login- User login
File Management
POST /api/v1/upload- Upload filesPOST /api/v1/ocr/process- Start OCR processingGET /api/v1/batch/{id}/status- Get batch status
Results & Export
GET /api/v1/ocr/result/{id}- Get OCR resultGET /api/v1/export/pdf/{id}- Export as PDF
Full API documentation: http://localhost:8000/docs
Supported File Formats
- Images: PNG, JPG, JPEG, BMP, TIFF
- Documents: PDF
- Office: DOC, DOCX, PPT, PPTX
Office files are automatically converted to PDF before OCR processing.
Development
Backend
source venv/bin/activate
cd backend
# Run tests
pytest
# Database migration
alembic revision --autogenerate -m "description"
alembic upgrade head
# Code formatting
black app/
Frontend
cd frontend
# Development server
npm run dev
# Build for production
npm run build
# Lint code
npm run lint
OpenSpec Workflow
This project follows OpenSpec for specification-driven development:
# View current changes
openspec list
# Validate specifications
openspec validate add-ocr-batch-processing
# View implementation tasks
cat openspec/changes/add-ocr-batch-processing/tasks.md
Roadmap
- Phase 0: Environment setup
- Phase 1: Core OCR backend (~98% complete)
- Phase 2: Frontend development (~92% complete)
- Phase 3: Testing & optimization
- Phase 4: Deployment automation
- Phase 5: Translation feature (future)
Documentation
- Development specs: openspec/project.md
- Implementation status: openspec/changes/add-ocr-batch-processing/STATUS.md
- Agent instructions: openspec/AGENTS.md
License
Internal project use
Notes
- First OCR run will download PaddleOCR models (~900MB)
- Token expiration is set to 24 hours by default
- Office conversion requires LibreOffice (installed via setup script)
- Development environment: WSL2 Ubuntu 24.04 with Python venv
- GPU acceleration: Automatically detected and enabled if NVIDIA GPU with CUDA 11.8+ is available
- PaddlePaddle version: System uses PaddlePaddle 3.2.1+ which includes full chart recognition support
- WSL GPU support: WSL2 CUDA library paths (
/usr/lib/wsl/lib) are automatically configured in~/.bashrc - Chart recognition: Fully enabled with PP-StructureV3 for chart type detection, data extraction, and structure analysis
- GPU status and chart recognition availability can be checked via
/healthAPI endpoint