Files
OCR/openspec/project.md
egg 3f41a33877 docs: update documentation for chart recognition enablement
Updates all project documentation to reflect that chart recognition
is now fully enabled with PaddlePaddle 3.2.1+.

Changes:
- README.md: Remove Known Limitations section about chart recognition,
  update tech stack and prerequisites to include PaddlePaddle 3.2.1+,
  add WSL CUDA configuration notes
- openspec/project.md: Add comprehensive chart recognition feature
  descriptions, update system requirements for GPU/CUDA support
- openspec/changes/add-gpu-acceleration-support/tasks.md: Mark task
  5.4 as completed with resolution details
- openspec/changes/add-gpu-acceleration-support/proposal.md: Update
  Known Issues section to show chart recognition is now resolved
- setup_dev_env.sh: Upgrade PaddlePaddle from 3.0.0 to 3.2.1+, add
  WSL CUDA library path configuration, add chart recognition API
  verification

All documentation now accurately reflects:
 Chart recognition fully enabled
 PaddlePaddle 3.2.1+ with fused_rms_norm_ext API
 WSL CUDA path auto-configuration
 Comprehensive PP-StructureV3 capabilities

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-16 19:04:30 +08:00

14 KiB

Project Context

Purpose

Tool_OCR is a web-based application for batch image-to-text conversion with multi-language support and rule-based output formatting. The tool uses a modern frontend-backend separation architecture, designed to process multiple images/PDFs simultaneously, extract text using OCR, and export results in various formats according to user-defined rules.

Key Goals:

  • Batch processing of images and PDF files for text extraction via web interface
  • Multi-language OCR support (Chinese, English, and other languages)
  • Rule-based output formatting and organization
  • User-friendly web interface accessible via browser
  • Export flexibility (TXT, JSON, Excel, etc.)
  • RESTful API for OCR processing

Tech Stack

Development Environment

  • OS Platform: WSL2 Ubuntu 24.04
  • Python Version: 3.12
  • Environment Manager: Python venv
  • Virtual Environment Path: ./venv
  • Node.js: 24.x LTS (via nvm)
  • IDE Recommended: VS Code with Python + React extensions

Backend Technologies

  • Language: Python 3.10+
  • Web Framework: FastAPI (modern, async, auto API docs)
  • OCR Engine: PaddleOCR 3.0+ with PaddleOCR-VL (deep learning-based, excellent multi-language support)
  • Deep Learning Framework: PaddlePaddle 3.2.1+ (GPU/CPU support, CUDA 11.8/12.3/12.6+)
  • Structure Analysis: PP-StructureV3 (layout analysis, table recognition, formula extraction, chart recognition)
  • PDF Processing: PyPDF2 / pdf2image
  • Image Processing: Pillow (PIL), OpenCV
  • Data Export: pandas (Excel), json (JSON)
  • Database: MySQL (configuration storage, task history)
  • Cache: Redis (optional, for task queue)
  • Authentication: JWT

Frontend Technologies

  • Framework: React 18+
  • Build Tool: Vite
  • UI Library: Tailwind CSS + shadcn/ui
  • State Management: React Query (for API calls) + Zustand (for global state)
  • HTTP Client: Axios
  • File Upload: react-dropzone

Development Tools

  • Package Manager: Conda + pip (backend), npm/pnpm (frontend)
  • Deployment: 1Panel (web-based server management)
  • Process Manager: systemd / PM2 / Supervisor
  • Web Server: Nginx (reverse proxy)
  • Testing: pytest (backend), Vitest (frontend)
  • Code Style: Black + pylint (Python), ESLint + Prettier (JavaScript/TypeScript)
  • Version Control: Git

Key Libraries (Backend)

  • fastapi: Web framework
  • uvicorn: ASGI server
  • paddleocr: OCR processing
  • paddlepaddle: Deep learning framework (GPU/CPU)
  • paddlex[ocr]: PP-StructureV3 for layout analysis and chart recognition
  • pdf2image: PDF to image conversion
  • pillow: Image manipulation
  • opencv-python: Advanced image processing
  • pandas: Data export to Excel
  • pyyaml: Configuration management
  • python-jose: JWT authentication
  • sqlalchemy: Database ORM
  • pydantic: Data validation

Key Libraries (Frontend)

  • react: UI framework
  • vite: Build tool
  • tailwindcss: CSS framework
  • shadcn/ui: UI components
  • axios: HTTP client
  • react-query: Server state management
  • zustand: Client state management
  • react-dropzone: File upload

Project Conventions

Environment Setup (Backend)

# Run automated setup script (recommended)
./setup_dev_env.sh

# Or manually:
# Create Python virtual environment
python3 -m venv venv

# Activate environment
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Environment Setup (Frontend)

# Navigate to frontend directory
cd frontend

# Install dependencies
npm install

# Run dev server
npm run dev

Code Style

Backend (Python)

  • Formatter: Black with line length 100
  • Naming Conventions:
    • Classes: PascalCase (e.g., OcrProcessor, ImageService)
    • Functions/Methods: snake_case (e.g., process_image, export_results)
    • Constants: UPPER_SNAKE_CASE (e.g., MAX_BATCH_SIZE, DEFAULT_LANG)
    • Private members: prefix with underscore (e.g., _internal_method)
  • Docstrings: Google style for all public functions and classes
  • Type Hints: Use type hints for function signatures (FastAPI requirement)
  • Imports: Organized by standard library, third-party, local (separated by blank lines)
  • Encoding: UTF-8 for all Python files

Frontend (JavaScript/TypeScript)

  • Formatter: Prettier
  • Naming Conventions:
    • Components: PascalCase (e.g., ImageUpload, ResultsTable)
    • Functions/Variables: camelCase (e.g., processImage, ocrResults)
    • Constants: UPPER_SNAKE_CASE (e.g., MAX_FILE_SIZE, API_BASE_URL)
    • CSS Classes: kebab-case (Tailwind convention)
  • File Structure: One component per file
  • Imports: Group by external, internal, types

Architecture Patterns

Backend Architecture

  • Layered Architecture:
    • Router Layer (FastAPI routes)
    • Service Layer (business logic)
    • Data Access Layer (database/file operations)
    • Model Layer (Pydantic models)
  • Async/Await: Use async operations for I/O bound tasks
  • Dependency Injection: FastAPI's dependency injection for services
  • Error Handling: Custom exception handlers with proper HTTP status codes
  • Logging: Structured logging with log levels
  • Background Tasks: FastAPI BackgroundTasks for long-running OCR jobs

Frontend Architecture

  • Component-Based: Reusable React components
  • Atomic Design: atoms → molecules → organisms → templates → pages
  • API Layer: Centralized API client with React Query
  • State Management: Server state (React Query) + Client state (Zustand)
  • Routing: React Router for SPA navigation
  • Error Boundaries: Graceful error handling in UI

API Design

  • RESTful: Follow REST conventions
  • Versioning: API versioned as /api/v1/...
  • Documentation: Auto-generated via FastAPI (Swagger/OpenAPI)
  • Response Format: Consistent JSON structure
    {
      "success": true,
      "data": {},
      "message": "Success",
      "timestamp": "2025-01-01T00:00:00Z"
    }
    

Testing Strategy

Backend Testing

  • Unit Tests: Test services, utilities, data models
  • Integration Tests: Test API endpoints end-to-end
  • Test Framework: pytest with pytest-asyncio
  • Coverage Target: Minimum 70% code coverage
  • Test Command: pytest tests/ -v --cov=app

Frontend Testing

  • Component Tests: Test React components with Vitest + React Testing Library
  • Integration Tests: Test user workflows
  • E2E Tests: Optional with Playwright
  • Test Command: npm run test

Git Workflow

  • Branching: Feature branches from main (e.g., feature/add-pdf-support)
  • Commits: Conventional Commits format (e.g., feat:, fix:, docs:)
  • PRs: Require passing tests before merge
  • Versioning: Semantic versioning (MAJOR.MINOR.PATCH)

Domain Context

OCR Concepts

  • Recognition Accuracy: Depends on image quality, language, and font type
  • Preprocessing: Image enhancement (contrast, denoising) can improve OCR accuracy
  • Multi-Language: PaddleOCR supports Chinese, English, Japanese, Korean, and many others
  • Bounding Boxes: OCR engines detect text regions before recognition
  • Confidence Scores: Each recognized text has a confidence score (0-1)

Document Structure Analysis (PP-StructureV3)

  • Layout Analysis: Automatic detection of document regions (text, images, tables, charts, formulas)
  • Table Recognition: Extract table structure and content with support for nested formulas and images
  • Formula Recognition: Convert mathematical formulas to LaTeX format
  • Chart Recognition ( Enabled with PaddlePaddle 3.2.1+):
    • Chart Type Detection: Identify bar charts, line charts, pie charts, scatter plots, etc.
    • Data Extraction: Extract numerical data points from chart visualizations
    • Axis & Legend Parsing: Recognize axis labels, tick values, and legend information
    • Structured Output: Convert chart content to JSON or tabular format
    • Performance: GPU acceleration recommended for best results (2-10 seconds per chart)
    • Accuracy: >85% for simple charts, >70% for complex multi-axis charts
  • Image Extraction: Preserve and save embedded images from documents

Use Cases

  • Digitizing scanned documents and images via web upload
  • Extracting text from screenshots for archival
  • Processing receipts and invoices for data entry
  • Converting image-based PDFs to searchable text
  • Batch processing multiple files via drag-and-drop interface

Output Rules

  • Users can define custom rules for organizing extracted text
  • Examples: group by file name pattern, filter by confidence threshold, format as structured data
  • Export formats: plain text files, JSON with metadata, Excel spreadsheets

Important Constraints

Technical Constraints

  • Platform: Windows 10/11 (development), Docker-based deployment
  • Web Application: Browser-based interface (Chrome, Firefox, Edge)
  • Local Processing: All OCR processing happens on backend server (no cloud dependencies)
  • Resource Intensive: OCR is CPU/GPU intensive; consider task queue for batch processing
  • File Size Limits: Set max upload size (e.g., 20MB per file, 100MB per batch)
  • Language Models: PaddleOCR models must be downloaded (~100MB+ per language)
  • Conda Environment: Backend development must be done within Conda virtual environment
  • Port Range: Web services must use ports 12010-12019

User Experience Constraints

  • Target Users: Non-technical users who need simple batch OCR via web
  • Browser Compatibility: Modern browsers (Chrome 90+, Firefox 88+, Edge 90+)
  • Performance: UI must show progress feedback during OCR processing
  • Error Messages: Clear, actionable error messages in Traditional Chinese
  • Responsive Design: UI should work on desktop and tablet (mobile optional)

Business Constraints

  • Open Source: Use only open-source libraries (no paid API dependencies)
  • Deployment: 1Panel-based deployment (no Docker required)
  • Offline Capable: Must work without internet after initial setup (except model downloads)
  • Authentication: JWT-based auth (optional LDAP integration for enterprise)

Security Constraints

  • File Upload: Validate file types, scan for malware (optional)
  • Authentication: JWT tokens with expiration
  • CORS: Configure CORS for frontend-backend communication
  • Input Validation: Strict validation on all API inputs

External Dependencies

Database Configuration

  • MySQL Host: mysql.theaken.com
  • MySQL Port: 33306
  • MySQL User: A060
  • MySQL Password: WLeSCi0yhtc7
  • MySQL Database: db_A060
  • MySQL Charset: utf8mb4

SMTP Configuration (Optional)

LDAP Configuration (Optional)

  • LDAP Server: panjit.com.tw
  • LDAP Port: 389

Conda Environment

  • Environment Name: tool_ocr
  • Python Version: 3.10
  • Base Path: C:\Users\lin46\.conda\envs\tool_ocr
  • Activation: Always activate environment before backend development

OCR Models

  • PaddleOCR Models: Downloaded automatically on first run or manually installed
  • Model Storage: Local cache directory or Docker volume
  • Supported Languages: Chinese (simplified/traditional), English, Japanese, Korean, etc.
  • Model Size: ~100-200MB per language pack

System Requirements

  • Python: 3.10+ (managed by Conda or venv)
  • Node.js: 18+ (for frontend development and build)
  • RAM: Minimum 4GB (8GB recommended for batch processing, 16GB+ for GPU usage)
  • Disk Space: ~2GB for application + models + dependencies
  • OS: Windows 10/11 (development), WSL2 Ubuntu 24.04 (development), Linux (1Panel deployment server)
  • GPU (Optional but recommended):
    • NVIDIA GPU with CUDA 11.8, 12.3, or 12.6+ support
    • GPU Memory: Minimum 4GB (8GB+ recommended for chart recognition)
    • WSL2 GPU: NVIDIA CUDA drivers installed for WSL
    • Performance: 3-10x speedup for OCR and chart recognition
  • Web Server: Nginx (for static files and reverse proxy)
  • Process Manager: Supervisor / PM2 / systemd (for backend service)

Port Configuration

  • Backend API: 12010 (FastAPI via uvicorn)
  • Frontend Dev Server: 12011 (Vite, development only)
  • Nginx: 80/443 (production, managed by 1Panel)
  • MySQL: 33306 (external)
  • Redis: 6379 (optional, local)

Deployment Architecture (1Panel)

  • Development: Windows with Conda + local Node.js
  • Production: Linux server managed by 1Panel
  • Backend Deployment:
    • Conda environment on production server
    • uvicorn runs FastAPI on port 12010
    • Managed by Supervisor/PM2/systemd for auto-restart
  • Frontend Deployment:
    • Build static files with npm run build
    • Served by Nginx (configured via 1Panel)
    • Nginx reverse proxies /api to backend (12010)
  • 1Panel Features:
    • Website management (Nginx configuration)
    • Process management (backend service)
    • SSL certificate management (Let's Encrypt)
    • File management and deployment

Configuration Files

  • Backend:
    • environment.yml: Conda environment specification
    • requirements.txt: Pip dependencies
    • .env: Environment variables (database, JWT secret, etc.)
    • config.yaml: Application configuration
    • start.sh: Backend startup script
  • Frontend:
    • package.json: npm dependencies
    • .env.production: Production environment variables (API URL)
    • vite.config.js: Vite configuration
    • build.sh: Frontend build script
  • Deployment:
    • nginx.conf: Nginx reverse proxy configuration
    • supervisor.conf or pm2.config.js: Process manager configuration
    • deploy.sh: Deployment automation script