234 lines
5.9 KiB
Markdown
234 lines
5.9 KiB
Markdown
# Tool_OCR
|
|
|
|
**OCR Batch Processing System with Structure Extraction**
|
|
|
|
A web-based solution to extract text, images, and document structure from multiple files efficiently using PaddleOCR-VL.
|
|
|
|
## Features
|
|
|
|
- 🔍 **Multi-Language OCR**: Support for 109 languages (Chinese, English, Japanese, Korean, etc.)
|
|
- 📄 **Document Structure Analysis**: Intelligent layout analysis with PP-StructureV3
|
|
- 🖼️ **Image Extraction**: Preserve document images alongside text content
|
|
- 📑 **Batch Processing**: Process multiple files concurrently with progress tracking
|
|
- 📤 **Multiple Export Formats**: TXT, JSON, Excel, Markdown with images, searchable PDF
|
|
- 🔧 **Flexible Configuration**: Rule-based output formatting
|
|
- 🌐 **Translation Ready**: Reserved architecture for future translation features
|
|
|
|
## Tech Stack
|
|
|
|
### Backend
|
|
- **Framework**: FastAPI 0.115.0
|
|
- **OCR Engine**: PaddleOCR 3.0+ with PaddleOCR-VL
|
|
- **Database**: MySQL via SQLAlchemy
|
|
- **PDF Generation**: Pandoc + WeasyPrint
|
|
- **Image Processing**: OpenCV, Pillow, pdf2image
|
|
|
|
### Frontend
|
|
- **Framework**: React 18 with Vite
|
|
- **Styling**: TailwindCSS + shadcn/ui
|
|
- **HTTP Client**: Axios with React Query
|
|
|
|
## Prerequisites
|
|
|
|
- **macOS**: Apple Silicon (M1/M2/M3) or Intel
|
|
- **Python**: 3.10+
|
|
- **Conda**: Miniconda or Anaconda (will be installed automatically)
|
|
- **Homebrew**: For system dependencies
|
|
- **MySQL**: External database server (provided)
|
|
|
|
## Installation
|
|
|
|
### 1. Automated Setup (Recommended)
|
|
|
|
```bash
|
|
# Clone the repository
|
|
cd /Users/egg/Projects/Tool_OCR
|
|
|
|
# Run automated setup script
|
|
chmod +x setup_conda.sh
|
|
./setup_conda.sh
|
|
|
|
# If Conda was just installed, reload your shell
|
|
source ~/.zshrc # or source ~/.bash_profile
|
|
|
|
# Run the script again to create environment
|
|
./setup_conda.sh
|
|
```
|
|
|
|
### 2. Install Dependencies
|
|
|
|
```bash
|
|
# Activate Conda environment
|
|
conda activate tool_ocr
|
|
|
|
# Install Python dependencies
|
|
pip install -r requirements.txt
|
|
|
|
# Install system dependencies (Pandoc for PDF generation)
|
|
brew install pandoc
|
|
|
|
# Install Chinese fonts for PDF generation (optional)
|
|
brew install --cask font-noto-sans-cjk
|
|
# Note: macOS built-in fonts work fine, this is optional
|
|
```
|
|
|
|
### 3. Download PaddleOCR Models
|
|
|
|
```bash
|
|
# Create models directory
|
|
mkdir -p models/paddleocr
|
|
|
|
# Models will be automatically downloaded on first run
|
|
# (~900MB total, includes PaddleOCR-VL 0.9B model)
|
|
```
|
|
|
|
### 4. Configure Environment
|
|
|
|
```bash
|
|
# Copy environment template
|
|
cp .env.example .env
|
|
|
|
# Edit .env with your settings
|
|
# Database credentials are pre-configured
|
|
nano .env
|
|
```
|
|
|
|
### 5. Initialize Database
|
|
|
|
```bash
|
|
# Database schema will be created automatically on first run
|
|
# Using: mysql.theaken.com:33306/db_A060
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Start Backend Server
|
|
|
|
```bash
|
|
# Activate environment
|
|
conda activate tool_ocr
|
|
|
|
# Start FastAPI server
|
|
cd backend
|
|
python -m app.main
|
|
|
|
# Server runs at: http://localhost:12010
|
|
# API docs: http://localhost:12010/docs
|
|
```
|
|
|
|
### Start Frontend (Coming Soon)
|
|
|
|
```bash
|
|
# Install frontend dependencies
|
|
cd frontend
|
|
npm install
|
|
|
|
# Start development server
|
|
npm run dev
|
|
|
|
# Frontend runs at: http://localhost:12011
|
|
```
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
Tool_OCR/
|
|
├── backend/
|
|
│ ├── app/
|
|
│ │ ├── api/v1/ # API endpoints
|
|
│ │ ├── core/ # Configuration, database
|
|
│ │ ├── models/ # Database models
|
|
│ │ ├── services/ # Business logic
|
|
│ │ ├── utils/ # Utilities
|
|
│ │ └── main.py # Application entry point
|
|
│ └── tests/ # Test suite
|
|
├── frontend/
|
|
│ └── src/ # React application
|
|
├── uploads/
|
|
│ ├── temp/ # Temporary uploads
|
|
│ ├── processed/ # Processed files
|
|
│ └── images/ # Extracted images
|
|
├── storage/
|
|
│ ├── markdown/ # Markdown outputs
|
|
│ ├── json/ # JSON results
|
|
│ └── exports/ # Export files
|
|
├── models/
|
|
│ └── paddleocr/ # PaddleOCR models
|
|
├── config/ # Configuration files
|
|
├── templates/ # PDF templates
|
|
├── logs/ # Application logs
|
|
├── requirements.txt # Python dependencies
|
|
├── setup_conda.sh # Environment setup script
|
|
├── .env.example # Environment template
|
|
└── README.md
|
|
```
|
|
|
|
## API Endpoints (Planned)
|
|
|
|
- `POST /api/v1/ocr/upload` - Upload files for OCR processing
|
|
- `GET /api/v1/ocr/tasks` - List all OCR tasks
|
|
- `GET /api/v1/ocr/tasks/{task_id}` - Get task details
|
|
- `POST /api/v1/ocr/batch` - Create batch processing task
|
|
- `GET /api/v1/export/{task_id}` - Export results (TXT/JSON/Excel/MD/PDF)
|
|
- `POST /api/v1/translate/document` - Translate document (reserved, returns 501)
|
|
|
|
## Development
|
|
|
|
### Run Tests
|
|
|
|
```bash
|
|
cd backend
|
|
pytest tests/ -v --cov=app
|
|
```
|
|
|
|
### Code Quality
|
|
|
|
```bash
|
|
# Format code
|
|
black app/
|
|
|
|
# Lint code
|
|
pylint app/
|
|
```
|
|
|
|
## OpenSpec Workflow
|
|
|
|
This project follows OpenSpec for specification-driven development:
|
|
|
|
```bash
|
|
# View current changes
|
|
openspec list
|
|
|
|
# Validate specifications
|
|
openspec validate add-ocr-batch-processing
|
|
|
|
# View implementation tasks
|
|
cat openspec/changes/add-ocr-batch-processing/tasks.md
|
|
```
|
|
|
|
## Roadmap
|
|
|
|
- [x] **Phase 0**: Environment setup and configuration
|
|
- [ ] **Phase 1**: Core OCR with structure extraction
|
|
- [ ] **Phase 2**: Frontend development
|
|
- [ ] **Phase 3**: Testing & optimization
|
|
- [ ] **Phase 4**: Deployment
|
|
- [ ] **Phase 5**: Translation feature (future)
|
|
|
|
## License
|
|
|
|
[To be determined]
|
|
|
|
## Contributors
|
|
|
|
- Development environment: macOS Apple Silicon
|
|
- Database: MySQL external server
|
|
- OCR Engine: PaddleOCR-VL 0.9B with PP-StructureV3
|
|
|
|
## Support
|
|
|
|
For issues and questions, refer to:
|
|
- OpenSpec documentation: `openspec/AGENTS.md`
|
|
- Task breakdown: `openspec/changes/add-ocr-batch-processing/tasks.md`
|
|
- Specifications: `openspec/changes/add-ocr-batch-processing/specs/`
|