Files
OCR/README.md
egg b048f2d640 fix: disable chart recognition due to PaddlePaddle 3.0.0 API limitation
PaddleOCR-VL chart recognition model requires `fused_rms_norm_ext` API
which is not available in PaddlePaddle 3.0.0 stable release.

Changes:
- Set use_chart_recognition=False in PP-StructureV3 initialization
- Remove unsupported show_log parameter from PaddleOCR 3.x API calls
- Document known limitation in openspec proposal
- Add limitation documentation to README
- Update tasks.md with documentation task for known issues

Impact:
- Layout analysis still detects/extracts charts as images ✓
- Tables, formulas, and text recognition work normally ✓
- Deep chart understanding (type detection, data extraction) disabled ✗
- Chart to structured data conversion disabled ✗

Workaround: Charts saved as image files for manual review

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-14 13:16:17 +08:00

280 lines
8.1 KiB
Markdown

# Tool_OCR
**OCR Batch Processing System with Structure Extraction**
A web-based solution to extract text, images, and document structure from multiple files efficiently using PaddleOCR-VL.
## Features
- 🔍 **Multi-Language OCR**: Support for 109 languages (Chinese, English, Japanese, Korean, etc.)
- 📄 **Document Structure Analysis**: Intelligent layout analysis with PP-StructureV3
- 🖼️ **Image Extraction**: Preserve document images alongside text content
- 📑 **Batch Processing**: Process multiple files concurrently with progress tracking
- 📤 **Multiple Export Formats**: TXT, JSON, Excel, Markdown with images, searchable PDF
- 📋 **Office Documents**: DOC, DOCX, PPT, PPTX support via LibreOffice conversion
- 🚀 **GPU Acceleration**: Automatic CUDA GPU detection with graceful CPU fallback
- 🔧 **Flexible Configuration**: Rule-based output formatting
- 🌐 **Translation Ready**: Reserved architecture for future translation features
## Tech Stack
### Backend
- **Framework**: FastAPI 0.115.0
- **OCR Engine**: PaddleOCR 3.0+ with PaddleOCR-VL
- **Database**: MySQL via SQLAlchemy
- **PDF Generation**: Pandoc + WeasyPrint
- **Image Processing**: OpenCV, Pillow, pdf2image
- **Office Conversion**: LibreOffice (headless mode)
### Frontend
- **Framework**: React 19 with TypeScript
- **Build Tool**: Vite 7
- **Styling**: Tailwind CSS v4 + shadcn/ui
- **State Management**: React Query + Zustand
- **HTTP Client**: Axios
## Prerequisites
- **OS**: WSL2 Ubuntu 24.04
- **Python**: 3.12+
- **Node.js**: 24.x LTS
- **MySQL**: External database server (provided)
- **GPU** (Optional): NVIDIA GPU with CUDA 11.2+ for hardware acceleration
## Quick Start
### 1. Automated Setup (Recommended)
```bash
# Run automated setup script
./setup_dev_env.sh
```
This script automatically:
- Detects NVIDIA GPU and CUDA version (if available)
- Installs Python development tools (pip, venv, build-essential)
- Installs system dependencies (pandoc, LibreOffice, fonts, etc.)
- Installs Node.js (via nvm)
- Installs PaddlePaddle GPU version (if GPU detected) or CPU version
- Installs other Python packages
- Installs frontend dependencies
- Verifies GPU functionality (if GPU detected)
### 2. Initialize Database
```bash
source venv/bin/activate
cd backend
alembic upgrade head
python create_test_user.py
cd ..
```
Default test user:
- Username: `admin`
- Password: `admin123`
### 3. Start Development Servers
**Backend (Terminal 1):**
```bash
./start_backend.sh
```
**Frontend (Terminal 2):**
```bash
./start_frontend.sh
```
### 4. Access Application
- **Frontend**: http://localhost:5173
- **API Docs**: http://localhost:8000/docs
- **Health Check**: http://localhost:8000/health
## Project Structure
```
Tool_OCR/
├── backend/ # FastAPI backend
│ ├── app/
│ │ ├── api/v1/ # API endpoints
│ │ ├── core/ # Configuration, database
│ │ ├── models/ # Database models
│ │ ├── services/ # Business logic
│ │ └── main.py # Application entry point
│ ├── alembic/ # Database migrations
│ └── tests/ # Test suite
├── frontend/ # React frontend
│ ├── src/
│ │ ├── components/ # UI components
│ │ ├── pages/ # Page components
│ │ ├── services/ # API services
│ │ └── stores/ # State management
│ └── public/ # Static assets
├── .env.local # Local development config
├── setup_dev_env.sh # Environment setup script
├── start_backend.sh # Backend startup script
└── start_frontend.sh # Frontend startup script
```
## Configuration
Main config file: `.env.local`
```bash
# Database
MYSQL_HOST=mysql.theaken.com
MYSQL_PORT=33306
# Application ports
BACKEND_PORT=8000
FRONTEND_PORT=5173
# Token expiration (minutes)
ACCESS_TOKEN_EXPIRE_MINUTES=1440 # 24 hours
# Supported file formats
ALLOWED_EXTENSIONS=png,jpg,jpeg,pdf,bmp,tiff,doc,docx,ppt,pptx
# OCR settings
OCR_LANGUAGES=ch,en,japan,korean
MAX_OCR_WORKERS=4
# GPU acceleration (optional)
FORCE_CPU_MODE=false # Set to true to disable GPU even if available
GPU_MEMORY_FRACTION=0.8 # Fraction of GPU memory to use (0.0-1.0)
GPU_DEVICE_ID=0 # GPU device ID to use (0 for primary GPU)
```
### GPU Acceleration
The system automatically detects and utilizes NVIDIA GPU hardware when available:
- **Auto-detection**: Setup script detects GPU and installs appropriate PaddlePaddle version
- **Graceful fallback**: If GPU is unavailable or fails, system automatically uses CPU mode
- **Performance**: GPU acceleration provides 3-10x speedup for OCR processing
- **Configuration**: Control GPU usage via `.env.local` environment variables
Check GPU status at: http://localhost:8000/health
### Known Limitations
**Chart Recognition (PP-StructureV3)**
Due to API incompatibility between PaddleOCR 3.x and PaddlePaddle 3.0.0 stable, the chart recognition feature is currently disabled:
-**Works**: Layout analysis detects and extracts charts/figures as image files
-**Works**: Tables, formulas, and text recognition function normally
-**Disabled**: Deep chart content understanding (chart type, data extraction, axis/legend parsing)
-**Disabled**: Converting chart content to structured data
**Technical Details**:
- The PaddleOCR-VL chart recognition model requires `paddle.incubate.nn.functional.fused_rms_norm_ext` API
- PaddlePaddle 3.0.0 stable only provides the base `fused_rms_norm` function
- This limitation will be resolved when PaddlePaddle releases an update with the extended API
**Workaround**: Charts are saved as images and can be viewed manually. For chart data extraction, consider using specialized chart recognition tools separately.
## API Endpoints
### Authentication
- `POST /api/v1/auth/login` - User login
### File Management
- `POST /api/v1/upload` - Upload files
- `POST /api/v1/ocr/process` - Start OCR processing
- `GET /api/v1/batch/{id}/status` - Get batch status
### Results & Export
- `GET /api/v1/ocr/result/{id}` - Get OCR result
- `GET /api/v1/export/pdf/{id}` - Export as PDF
Full API documentation: http://localhost:8000/docs
## Supported File Formats
- **Images**: PNG, JPG, JPEG, BMP, TIFF
- **Documents**: PDF
- **Office**: DOC, DOCX, PPT, PPTX
Office files are automatically converted to PDF before OCR processing.
## Development
### Backend
```bash
source venv/bin/activate
cd backend
# Run tests
pytest
# Database migration
alembic revision --autogenerate -m "description"
alembic upgrade head
# Code formatting
black app/
```
### Frontend
```bash
cd frontend
# Development server
npm run dev
# Build for production
npm run build
# Lint code
npm run lint
```
## OpenSpec Workflow
This project follows OpenSpec for specification-driven development:
```bash
# View current changes
openspec list
# Validate specifications
openspec validate add-ocr-batch-processing
# View implementation tasks
cat openspec/changes/add-ocr-batch-processing/tasks.md
```
## Roadmap
- [x] **Phase 0**: Environment setup
- [x] **Phase 1**: Core OCR backend (~98% complete)
- [x] **Phase 2**: Frontend development (~92% complete)
- [ ] **Phase 3**: Testing & optimization
- [ ] **Phase 4**: Deployment automation
- [ ] **Phase 5**: Translation feature (future)
## Documentation
- Development specs: [openspec/project.md](openspec/project.md)
- Implementation status: [openspec/changes/add-ocr-batch-processing/STATUS.md](openspec/changes/add-ocr-batch-processing/STATUS.md)
- Agent instructions: [openspec/AGENTS.md](openspec/AGENTS.md)
## License
Internal project use
## Notes
- First OCR run will download PaddleOCR models (~900MB)
- Token expiration is set to 24 hours by default
- Office conversion requires LibreOffice (installed via setup script)
- Development environment: WSL2 Ubuntu 24.04 with Python venv
- **GPU acceleration**: Automatically detected and enabled if NVIDIA GPU with CUDA 11.2+ is available
- **WSL GPU support**: Ensure NVIDIA CUDA drivers are installed in WSL for GPU acceleration
- GPU status can be checked via `/health` API endpoint