egg/OCR

Files

egg 3f41a33877 docs: update documentation for chart recognition enablement

Updates all project documentation to reflect that chart recognition
is now fully enabled with PaddlePaddle 3.2.1+.

Changes:
- README.md: Remove Known Limitations section about chart recognition,
  update tech stack and prerequisites to include PaddlePaddle 3.2.1+,
  add WSL CUDA configuration notes
- openspec/project.md: Add comprehensive chart recognition feature
  descriptions, update system requirements for GPU/CUDA support
- openspec/changes/add-gpu-acceleration-support/tasks.md: Mark task
  5.4 as completed with resolution details
- openspec/changes/add-gpu-acceleration-support/proposal.md: Update
  Known Issues section to show chart recognition is now resolved
- setup_dev_env.sh: Upgrade PaddlePaddle from 3.0.0 to 3.2.1+, add
  WSL CUDA library path configuration, add chart recognition API
  verification

All documentation now accurately reflects:
✅ Chart recognition fully enabled
✅ PaddlePaddle 3.2.1+ with fused_rms_norm_ext API
✅ WSL CUDA path auto-configuration
✅ Comprehensive PP-StructureV3 capabilities

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-16 19:04:30 +08:00

14 KiB

Raw Blame History

Project Context

Purpose

Tool_OCR is a web-based application for batch image-to-text conversion with multi-language support and rule-based output formatting. The tool uses a modern frontend-backend separation architecture, designed to process multiple images/PDFs simultaneously, extract text using OCR, and export results in various formats according to user-defined rules.

Key Goals:

Batch processing of images and PDF files for text extraction via web interface
Multi-language OCR support (Chinese, English, and other languages)
Rule-based output formatting and organization
User-friendly web interface accessible via browser
Export flexibility (TXT, JSON, Excel, etc.)
RESTful API for OCR processing

Tech Stack

Development Environment

OS Platform: WSL2 Ubuntu 24.04
Python Version: 3.12
Environment Manager: Python venv
Virtual Environment Path: ./venv
Node.js: 24.x LTS (via nvm)
IDE Recommended: VS Code with Python + React extensions

Backend Technologies

Language: Python 3.10+
Web Framework: FastAPI (modern, async, auto API docs)
OCR Engine: PaddleOCR 3.0+ with PaddleOCR-VL (deep learning-based, excellent multi-language support)
Deep Learning Framework: PaddlePaddle 3.2.1+ (GPU/CPU support, CUDA 11.8/12.3/12.6+)
Structure Analysis: PP-StructureV3 (layout analysis, table recognition, formula extraction, chart recognition)
PDF Processing: PyPDF2 / pdf2image
Image Processing: Pillow (PIL), OpenCV
Data Export: pandas (Excel), json (JSON)
Database: MySQL (configuration storage, task history)
Cache: Redis (optional, for task queue)
Authentication: JWT

Frontend Technologies

Framework: React 18+
Build Tool: Vite
UI Library: Tailwind CSS + shadcn/ui
State Management: React Query (for API calls) + Zustand (for global state)
HTTP Client: Axios
File Upload: react-dropzone

Development Tools

Package Manager: Conda + pip (backend), npm/pnpm (frontend)
Deployment: 1Panel (web-based server management)
Process Manager: systemd / PM2 / Supervisor
Web Server: Nginx (reverse proxy)
Testing: pytest (backend), Vitest (frontend)
Code Style: Black + pylint (Python), ESLint + Prettier (JavaScript/TypeScript)
Version Control: Git

Key Libraries (Backend)

fastapi: Web framework
uvicorn: ASGI server
paddleocr: OCR processing
paddlepaddle: Deep learning framework (GPU/CPU)
paddlex[ocr]: PP-StructureV3 for layout analysis and chart recognition
pdf2image: PDF to image conversion
pillow: Image manipulation
opencv-python: Advanced image processing
pandas: Data export to Excel
pyyaml: Configuration management
python-jose: JWT authentication
sqlalchemy: Database ORM
pydantic: Data validation

Key Libraries (Frontend)

react: UI framework
vite: Build tool
tailwindcss: CSS framework
shadcn/ui: UI components
axios: HTTP client
react-query: Server state management
zustand: Client state management
react-dropzone: File upload

Project Conventions

Environment Setup (Backend)

# Run automated setup script (recommended)
./setup_dev_env.sh

# Or manually:
# Create Python virtual environment
python3 -m venv venv

# Activate environment
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Environment Setup (Frontend)

# Navigate to frontend directory
cd frontend

# Install dependencies
npm install

# Run dev server
npm run dev

Code Style

Backend (Python)

Formatter: Black with line length 100
Naming Conventions:
- Classes: PascalCase (e.g., OcrProcessor, ImageService)
- Functions/Methods: snake_case (e.g., process_image, export_results)
- Constants: UPPER_SNAKE_CASE (e.g., MAX_BATCH_SIZE, DEFAULT_LANG)
- Private members: prefix with underscore (e.g., _internal_method)
Docstrings: Google style for all public functions and classes
Type Hints: Use type hints for function signatures (FastAPI requirement)
Imports: Organized by standard library, third-party, local (separated by blank lines)
Encoding: UTF-8 for all Python files

Frontend (JavaScript/TypeScript)

Formatter: Prettier
Naming Conventions:
- Components: PascalCase (e.g., ImageUpload, ResultsTable)
- Functions/Variables: camelCase (e.g., processImage, ocrResults)
- Constants: UPPER_SNAKE_CASE (e.g., MAX_FILE_SIZE, API_BASE_URL)
- CSS Classes: kebab-case (Tailwind convention)
File Structure: One component per file
Imports: Group by external, internal, types

Architecture Patterns

Backend Architecture

Layered Architecture:
- Router Layer (FastAPI routes)
- Service Layer (business logic)
- Data Access Layer (database/file operations)
- Model Layer (Pydantic models)
Async/Await: Use async operations for I/O bound tasks
Dependency Injection: FastAPI's dependency injection for services
Error Handling: Custom exception handlers with proper HTTP status codes
Logging: Structured logging with log levels
Background Tasks: FastAPI BackgroundTasks for long-running OCR jobs

Frontend Architecture

Component-Based: Reusable React components
Atomic Design: atoms → molecules → organisms → templates → pages
API Layer: Centralized API client with React Query
State Management: Server state (React Query) + Client state (Zustand)
Routing: React Router for SPA navigation
Error Boundaries: Graceful error handling in UI

API Design

RESTful: Follow REST conventions
Versioning: API versioned as /api/v1/...
Documentation: Auto-generated via FastAPI (Swagger/OpenAPI)

Response Format: Consistent JSON structure

{
  "success": true,
  "data": {},
  "message": "Success",
  "timestamp": "2025-01-01T00:00:00Z"
}

Testing Strategy

Backend Testing

Unit Tests: Test services, utilities, data models
Integration Tests: Test API endpoints end-to-end
Test Framework: pytest with pytest-asyncio
Coverage Target: Minimum 70% code coverage
Test Command: pytest tests/ -v --cov=app

Frontend Testing

Component Tests: Test React components with Vitest + React Testing Library
Integration Tests: Test user workflows
E2E Tests: Optional with Playwright
Test Command: npm run test

Git Workflow

Branching: Feature branches from main (e.g., feature/add-pdf-support)
Commits: Conventional Commits format (e.g., feat:, fix:, docs:)
PRs: Require passing tests before merge
Versioning: Semantic versioning (MAJOR.MINOR.PATCH)

Domain Context

OCR Concepts

Recognition Accuracy: Depends on image quality, language, and font type
Preprocessing: Image enhancement (contrast, denoising) can improve OCR accuracy
Multi-Language: PaddleOCR supports Chinese, English, Japanese, Korean, and many others
Bounding Boxes: OCR engines detect text regions before recognition
Confidence Scores: Each recognized text has a confidence score (0-1)

Document Structure Analysis (PP-StructureV3)

Layout Analysis: Automatic detection of document regions (text, images, tables, charts, formulas)
Table Recognition: Extract table structure and content with support for nested formulas and images
Formula Recognition: Convert mathematical formulas to LaTeX format
Chart Recognition (✅ Enabled with PaddlePaddle 3.2.1+):
- Chart Type Detection: Identify bar charts, line charts, pie charts, scatter plots, etc.
- Data Extraction: Extract numerical data points from chart visualizations
- Axis & Legend Parsing: Recognize axis labels, tick values, and legend information
- Structured Output: Convert chart content to JSON or tabular format
- Performance: GPU acceleration recommended for best results (2-10 seconds per chart)
- Accuracy: >85% for simple charts, >70% for complex multi-axis charts
Image Extraction: Preserve and save embedded images from documents

Use Cases

Digitizing scanned documents and images via web upload
Extracting text from screenshots for archival
Processing receipts and invoices for data entry
Converting image-based PDFs to searchable text
Batch processing multiple files via drag-and-drop interface

Output Rules

Users can define custom rules for organizing extracted text
Examples: group by file name pattern, filter by confidence threshold, format as structured data
Export formats: plain text files, JSON with metadata, Excel spreadsheets

Important Constraints

Technical Constraints

Platform: Windows 10/11 (development), Docker-based deployment
Web Application: Browser-based interface (Chrome, Firefox, Edge)
Local Processing: All OCR processing happens on backend server (no cloud dependencies)
Resource Intensive: OCR is CPU/GPU intensive; consider task queue for batch processing
File Size Limits: Set max upload size (e.g., 20MB per file, 100MB per batch)
Language Models: PaddleOCR models must be downloaded (~100MB+ per language)
Conda Environment: Backend development must be done within Conda virtual environment
Port Range: Web services must use ports 12010-12019

User Experience Constraints

Target Users: Non-technical users who need simple batch OCR via web
Browser Compatibility: Modern browsers (Chrome 90+, Firefox 88+, Edge 90+)
Performance: UI must show progress feedback during OCR processing
Error Messages: Clear, actionable error messages in Traditional Chinese
Responsive Design: UI should work on desktop and tablet (mobile optional)

Business Constraints

Open Source: Use only open-source libraries (no paid API dependencies)
Deployment: 1Panel-based deployment (no Docker required)
Offline Capable: Must work without internet after initial setup (except model downloads)
Authentication: JWT-based auth (optional LDAP integration for enterprise)

Security Constraints

File Upload: Validate file types, scan for malware (optional)
Authentication: JWT tokens with expiration
CORS: Configure CORS for frontend-backend communication
Input Validation: Strict validation on all API inputs

External Dependencies

Database Configuration

MySQL Host: mysql.theaken.com
MySQL Port: 33306
MySQL User: A060
MySQL Password: WLeSCi0yhtc7
MySQL Database: db_A060
MySQL Charset: utf8mb4

SMTP Configuration (Optional)

SMTP Server: mail.panjit.com.tw
SMTP Port: 25
SMTP TLS: false
SMTP Auth: false
Sender Email: tool-ocr-system@panjit.com.tw

LDAP Configuration (Optional)

LDAP Server: panjit.com.tw
LDAP Port: 389

Conda Environment

Environment Name: tool_ocr
Python Version: 3.10
Base Path: C:\Users\lin46\.conda\envs\tool_ocr
Activation: Always activate environment before backend development

OCR Models

PaddleOCR Models: Downloaded automatically on first run or manually installed
Model Storage: Local cache directory or Docker volume
Supported Languages: Chinese (simplified/traditional), English, Japanese, Korean, etc.
Model Size: ~100-200MB per language pack

System Requirements

Python: 3.10+ (managed by Conda or venv)
Node.js: 18+ (for frontend development and build)
RAM: Minimum 4GB (8GB recommended for batch processing, 16GB+ for GPU usage)
Disk Space: ~2GB for application + models + dependencies
OS: Windows 10/11 (development), WSL2 Ubuntu 24.04 (development), Linux (1Panel deployment server)
GPU (Optional but recommended):
- NVIDIA GPU with CUDA 11.8, 12.3, or 12.6+ support
- GPU Memory: Minimum 4GB (8GB+ recommended for chart recognition)
- WSL2 GPU: NVIDIA CUDA drivers installed for WSL
- Performance: 3-10x speedup for OCR and chart recognition
Web Server: Nginx (for static files and reverse proxy)
Process Manager: Supervisor / PM2 / systemd (for backend service)

Port Configuration

Backend API: 12010 (FastAPI via uvicorn)
Frontend Dev Server: 12011 (Vite, development only)
Nginx: 80/443 (production, managed by 1Panel)
MySQL: 33306 (external)
Redis: 6379 (optional, local)

Deployment Architecture (1Panel)

Development: Windows with Conda + local Node.js
Production: Linux server managed by 1Panel
Backend Deployment:
- Conda environment on production server
- uvicorn runs FastAPI on port 12010
- Managed by Supervisor/PM2/systemd for auto-restart
Frontend Deployment:
- Build static files with npm run build
- Served by Nginx (configured via 1Panel)
- Nginx reverse proxies /api to backend (12010)
1Panel Features:
- Website management (Nginx configuration)
- Process management (backend service)
- SSL certificate management (Let's Encrypt)
- File management and deployment

Configuration Files

Backend:
- environment.yml: Conda environment specification
- requirements.txt: Pip dependencies
- .env: Environment variables (database, JWT secret, etc.)
- config.yaml: Application configuration
- start.sh: Backend startup script
Frontend:
- package.json: npm dependencies
- .env.production: Production environment variables (API URL)
- vite.config.js: Vite configuration
- build.sh: Frontend build script
Deployment:
- nginx.conf: Nginx reverse proxy configuration
- supervisor.conf or pm2.config.js: Process manager configuration
- deploy.sh: Deployment automation script

14 KiB Raw Blame History