OCR/README.md

# Tool_OCR

**OCR Batch Processing System with Structure Extraction**

A web-based solution to extract text, images, and document structure from multiple files efficiently using PaddleOCR-VL.

## Features

- 🔍 **Multi-Language OCR**: Support for 109 languages (Chinese, English, Japanese, Korean, etc.)
- 📄 **Document Structure Analysis**: Intelligent layout analysis with PP-StructureV3
- 🖼️ **Image Extraction**: Preserve document images alongside text content
- 📑 **Batch Processing**: Process multiple files concurrently with progress tracking
- 📤 **Multiple Export Formats**: TXT, JSON, Excel, Markdown with images, searchable PDF
- 📋 **Office Documents**: DOC, DOCX, PPT, PPTX support via LibreOffice conversion
- 🔧 **Flexible Configuration**: Rule-based output formatting
- 🌐 **Translation Ready**: Reserved architecture for future translation features

## Tech Stack

### Backend
- **Framework**: FastAPI 0.115.0
- **OCR Engine**: PaddleOCR 3.0+ with PaddleOCR-VL
- **Database**: MySQL via SQLAlchemy
- **PDF Generation**: Pandoc + WeasyPrint
- **Image Processing**: OpenCV, Pillow, pdf2image
- **Office Conversion**: LibreOffice (headless mode)

### Frontend
- **Framework**: React 19 with TypeScript
- **Build Tool**: Vite 7
- **Styling**: Tailwind CSS v4 + shadcn/ui
- **State Management**: React Query + Zustand
- **HTTP Client**: Axios

## Prerequisites

- **OS**: WSL2 Ubuntu 24.04
- **Python**: 3.12+
- **Node.js**: 24.x LTS
- **MySQL**: External database server (provided)

## Quick Start

### 1. Automated Setup (Recommended)

```bash
# Run automated setup script
./setup_dev_env.sh
```

This script automatically installs:
- Python development tools (pip, venv, build-essential)
- System dependencies (pandoc, LibreOffice, fonts, etc.)
- Node.js (via nvm)
- Python packages
- Frontend dependencies

### 2. Initialize Database

```bash
source venv/bin/activate
cd backend
alembic upgrade head
python create_test_user.py
cd ..
```

Default test user:
- Username: `admin`
- Password: `admin123`

### 3. Start Development Servers

**Backend (Terminal 1):**
```bash
./start_backend.sh
```

**Frontend (Terminal 2):**
```bash
./start_frontend.sh
```

### 4. Access Application

- **Frontend**: http://localhost:5173
- **API Docs**: http://localhost:8000/docs
- **Health Check**: http://localhost:8000/health

## Project Structure

```
Tool_OCR/
├── backend/                 # FastAPI backend
│   ├── app/
│   │   ├── api/v1/         # API endpoints
│   │   ├── core/           # Configuration, database
│   │   ├── models/         # Database models
│   │   ├── services/       # Business logic
│   │   └── main.py         # Application entry point
│   ├── alembic/            # Database migrations
│   └── tests/              # Test suite
├── frontend/               # React frontend
│   ├── src/
│   │   ├── components/     # UI components
│   │   ├── pages/          # Page components
│   │   ├── services/       # API services
│   │   └── stores/         # State management
│   └── public/             # Static assets
├── .env.local              # Local development config
├── setup_dev_env.sh        # Environment setup script
├── start_backend.sh        # Backend startup script
└── start_frontend.sh       # Frontend startup script
```

## Configuration

Main config file: `.env.local`

```bash
# Database
MYSQL_HOST=mysql.theaken.com
MYSQL_PORT=33306

# Application ports
BACKEND_PORT=8000
FRONTEND_PORT=5173

# Token expiration (minutes)
ACCESS_TOKEN_EXPIRE_MINUTES=1440  # 24 hours

# Supported file formats
ALLOWED_EXTENSIONS=png,jpg,jpeg,pdf,bmp,tiff,doc,docx,ppt,pptx

# OCR settings
OCR_LANGUAGES=ch,en,japan,korean
MAX_OCR_WORKERS=4
```

## API Endpoints

### Authentication
- `POST /api/v1/auth/login` - User login

### File Management
- `POST /api/v1/upload` - Upload files
- `POST /api/v1/ocr/process` - Start OCR processing
- `GET /api/v1/batch/{id}/status` - Get batch status

### Results & Export
- `GET /api/v1/ocr/result/{id}` - Get OCR result
- `GET /api/v1/export/pdf/{id}` - Export as PDF

Full API documentation: http://localhost:8000/docs

## Supported File Formats

- **Images**: PNG, JPG, JPEG, BMP, TIFF
- **Documents**: PDF
- **Office**: DOC, DOCX, PPT, PPTX

Office files are automatically converted to PDF before OCR processing.

## Development

### Backend

```bash
source venv/bin/activate
cd backend

# Run tests
pytest

# Database migration
alembic revision --autogenerate -m "description"
alembic upgrade head

# Code formatting
black app/
```

### Frontend

```bash
cd frontend

# Development server
npm run dev

# Build for production
npm run build

# Lint code
npm run lint
```

## OpenSpec Workflow

This project follows OpenSpec for specification-driven development:

```bash
# View current changes
openspec list

# Validate specifications
openspec validate add-ocr-batch-processing

# View implementation tasks
cat openspec/changes/add-ocr-batch-processing/tasks.md
```

## Roadmap

- [x] **Phase 0**: Environment setup
- [x] **Phase 1**: Core OCR backend (~98% complete)
- [x] **Phase 2**: Frontend development (~92% complete)
- [ ] **Phase 3**: Testing & optimization
- [ ] **Phase 4**: Deployment automation
- [ ] **Phase 5**: Translation feature (future)

## Documentation

- Development specs: [openspec/project.md](openspec/project.md)
- Implementation status: [openspec/changes/add-ocr-batch-processing/STATUS.md](openspec/changes/add-ocr-batch-processing/STATUS.md)
- Agent instructions: [openspec/AGENTS.md](openspec/AGENTS.md)

## License

Internal project use

## Notes

- First OCR run will download PaddleOCR models (~900MB)
- Token expiration is set to 24 hours by default
- Office conversion requires LibreOffice (installed via setup script)
- Development environment: WSL2 Ubuntu 24.04 with Python venv