Updates all project documentation to reflect that chart recognition is now fully enabled with PaddlePaddle 3.2.1+. Changes: - README.md: Remove Known Limitations section about chart recognition, update tech stack and prerequisites to include PaddlePaddle 3.2.1+, add WSL CUDA configuration notes - openspec/project.md: Add comprehensive chart recognition feature descriptions, update system requirements for GPU/CUDA support - openspec/changes/add-gpu-acceleration-support/tasks.md: Mark task 5.4 as completed with resolution details - openspec/changes/add-gpu-acceleration-support/proposal.md: Update Known Issues section to show chart recognition is now resolved - setup_dev_env.sh: Upgrade PaddlePaddle from 3.0.0 to 3.2.1+, add WSL CUDA library path configuration, add chart recognition API verification All documentation now accurately reflects: ✅ Chart recognition fully enabled ✅ PaddlePaddle 3.2.1+ with fused_rms_norm_ext API ✅ WSL CUDA path auto-configuration ✅ Comprehensive PP-StructureV3 capabilities 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
14 KiB
14 KiB
Project Context
Purpose
Tool_OCR is a web-based application for batch image-to-text conversion with multi-language support and rule-based output formatting. The tool uses a modern frontend-backend separation architecture, designed to process multiple images/PDFs simultaneously, extract text using OCR, and export results in various formats according to user-defined rules.
Key Goals:
- Batch processing of images and PDF files for text extraction via web interface
- Multi-language OCR support (Chinese, English, and other languages)
- Rule-based output formatting and organization
- User-friendly web interface accessible via browser
- Export flexibility (TXT, JSON, Excel, etc.)
- RESTful API for OCR processing
Tech Stack
Development Environment
- OS Platform: WSL2 Ubuntu 24.04
- Python Version: 3.12
- Environment Manager: Python venv
- Virtual Environment Path:
./venv - Node.js: 24.x LTS (via nvm)
- IDE Recommended: VS Code with Python + React extensions
Backend Technologies
- Language: Python 3.10+
- Web Framework: FastAPI (modern, async, auto API docs)
- OCR Engine: PaddleOCR 3.0+ with PaddleOCR-VL (deep learning-based, excellent multi-language support)
- Deep Learning Framework: PaddlePaddle 3.2.1+ (GPU/CPU support, CUDA 11.8/12.3/12.6+)
- Structure Analysis: PP-StructureV3 (layout analysis, table recognition, formula extraction, chart recognition)
- PDF Processing: PyPDF2 / pdf2image
- Image Processing: Pillow (PIL), OpenCV
- Data Export: pandas (Excel), json (JSON)
- Database: MySQL (configuration storage, task history)
- Cache: Redis (optional, for task queue)
- Authentication: JWT
Frontend Technologies
- Framework: React 18+
- Build Tool: Vite
- UI Library: Tailwind CSS + shadcn/ui
- State Management: React Query (for API calls) + Zustand (for global state)
- HTTP Client: Axios
- File Upload: react-dropzone
Development Tools
- Package Manager: Conda + pip (backend), npm/pnpm (frontend)
- Deployment: 1Panel (web-based server management)
- Process Manager: systemd / PM2 / Supervisor
- Web Server: Nginx (reverse proxy)
- Testing: pytest (backend), Vitest (frontend)
- Code Style: Black + pylint (Python), ESLint + Prettier (JavaScript/TypeScript)
- Version Control: Git
Key Libraries (Backend)
- fastapi: Web framework
- uvicorn: ASGI server
- paddleocr: OCR processing
- paddlepaddle: Deep learning framework (GPU/CPU)
- paddlex[ocr]: PP-StructureV3 for layout analysis and chart recognition
- pdf2image: PDF to image conversion
- pillow: Image manipulation
- opencv-python: Advanced image processing
- pandas: Data export to Excel
- pyyaml: Configuration management
- python-jose: JWT authentication
- sqlalchemy: Database ORM
- pydantic: Data validation
Key Libraries (Frontend)
- react: UI framework
- vite: Build tool
- tailwindcss: CSS framework
- shadcn/ui: UI components
- axios: HTTP client
- react-query: Server state management
- zustand: Client state management
- react-dropzone: File upload
Project Conventions
Environment Setup (Backend)
# Run automated setup script (recommended)
./setup_dev_env.sh
# Or manually:
# Create Python virtual environment
python3 -m venv venv
# Activate environment
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
Environment Setup (Frontend)
# Navigate to frontend directory
cd frontend
# Install dependencies
npm install
# Run dev server
npm run dev
Code Style
Backend (Python)
- Formatter: Black with line length 100
- Naming Conventions:
- Classes: PascalCase (e.g.,
OcrProcessor,ImageService) - Functions/Methods: snake_case (e.g.,
process_image,export_results) - Constants: UPPER_SNAKE_CASE (e.g.,
MAX_BATCH_SIZE,DEFAULT_LANG) - Private members: prefix with underscore (e.g.,
_internal_method)
- Classes: PascalCase (e.g.,
- Docstrings: Google style for all public functions and classes
- Type Hints: Use type hints for function signatures (FastAPI requirement)
- Imports: Organized by standard library, third-party, local (separated by blank lines)
- Encoding: UTF-8 for all Python files
Frontend (JavaScript/TypeScript)
- Formatter: Prettier
- Naming Conventions:
- Components: PascalCase (e.g.,
ImageUpload,ResultsTable) - Functions/Variables: camelCase (e.g.,
processImage,ocrResults) - Constants: UPPER_SNAKE_CASE (e.g.,
MAX_FILE_SIZE,API_BASE_URL) - CSS Classes: kebab-case (Tailwind convention)
- Components: PascalCase (e.g.,
- File Structure: One component per file
- Imports: Group by external, internal, types
Architecture Patterns
Backend Architecture
- Layered Architecture:
- Router Layer (FastAPI routes)
- Service Layer (business logic)
- Data Access Layer (database/file operations)
- Model Layer (Pydantic models)
- Async/Await: Use async operations for I/O bound tasks
- Dependency Injection: FastAPI's dependency injection for services
- Error Handling: Custom exception handlers with proper HTTP status codes
- Logging: Structured logging with log levels
- Background Tasks: FastAPI BackgroundTasks for long-running OCR jobs
Frontend Architecture
- Component-Based: Reusable React components
- Atomic Design: atoms → molecules → organisms → templates → pages
- API Layer: Centralized API client with React Query
- State Management: Server state (React Query) + Client state (Zustand)
- Routing: React Router for SPA navigation
- Error Boundaries: Graceful error handling in UI
API Design
- RESTful: Follow REST conventions
- Versioning: API versioned as
/api/v1/... - Documentation: Auto-generated via FastAPI (Swagger/OpenAPI)
- Response Format: Consistent JSON structure
{ "success": true, "data": {}, "message": "Success", "timestamp": "2025-01-01T00:00:00Z" }
Testing Strategy
Backend Testing
- Unit Tests: Test services, utilities, data models
- Integration Tests: Test API endpoints end-to-end
- Test Framework: pytest with pytest-asyncio
- Coverage Target: Minimum 70% code coverage
- Test Command:
pytest tests/ -v --cov=app
Frontend Testing
- Component Tests: Test React components with Vitest + React Testing Library
- Integration Tests: Test user workflows
- E2E Tests: Optional with Playwright
- Test Command:
npm run test
Git Workflow
- Branching: Feature branches from main (e.g.,
feature/add-pdf-support) - Commits: Conventional Commits format (e.g.,
feat:,fix:,docs:) - PRs: Require passing tests before merge
- Versioning: Semantic versioning (MAJOR.MINOR.PATCH)
Domain Context
OCR Concepts
- Recognition Accuracy: Depends on image quality, language, and font type
- Preprocessing: Image enhancement (contrast, denoising) can improve OCR accuracy
- Multi-Language: PaddleOCR supports Chinese, English, Japanese, Korean, and many others
- Bounding Boxes: OCR engines detect text regions before recognition
- Confidence Scores: Each recognized text has a confidence score (0-1)
Document Structure Analysis (PP-StructureV3)
- Layout Analysis: Automatic detection of document regions (text, images, tables, charts, formulas)
- Table Recognition: Extract table structure and content with support for nested formulas and images
- Formula Recognition: Convert mathematical formulas to LaTeX format
- Chart Recognition (✅ Enabled with PaddlePaddle 3.2.1+):
- Chart Type Detection: Identify bar charts, line charts, pie charts, scatter plots, etc.
- Data Extraction: Extract numerical data points from chart visualizations
- Axis & Legend Parsing: Recognize axis labels, tick values, and legend information
- Structured Output: Convert chart content to JSON or tabular format
- Performance: GPU acceleration recommended for best results (2-10 seconds per chart)
- Accuracy: >85% for simple charts, >70% for complex multi-axis charts
- Image Extraction: Preserve and save embedded images from documents
Use Cases
- Digitizing scanned documents and images via web upload
- Extracting text from screenshots for archival
- Processing receipts and invoices for data entry
- Converting image-based PDFs to searchable text
- Batch processing multiple files via drag-and-drop interface
Output Rules
- Users can define custom rules for organizing extracted text
- Examples: group by file name pattern, filter by confidence threshold, format as structured data
- Export formats: plain text files, JSON with metadata, Excel spreadsheets
Important Constraints
Technical Constraints
- Platform: Windows 10/11 (development), Docker-based deployment
- Web Application: Browser-based interface (Chrome, Firefox, Edge)
- Local Processing: All OCR processing happens on backend server (no cloud dependencies)
- Resource Intensive: OCR is CPU/GPU intensive; consider task queue for batch processing
- File Size Limits: Set max upload size (e.g., 20MB per file, 100MB per batch)
- Language Models: PaddleOCR models must be downloaded (~100MB+ per language)
- Conda Environment: Backend development must be done within Conda virtual environment
- Port Range: Web services must use ports 12010-12019
User Experience Constraints
- Target Users: Non-technical users who need simple batch OCR via web
- Browser Compatibility: Modern browsers (Chrome 90+, Firefox 88+, Edge 90+)
- Performance: UI must show progress feedback during OCR processing
- Error Messages: Clear, actionable error messages in Traditional Chinese
- Responsive Design: UI should work on desktop and tablet (mobile optional)
Business Constraints
- Open Source: Use only open-source libraries (no paid API dependencies)
- Deployment: 1Panel-based deployment (no Docker required)
- Offline Capable: Must work without internet after initial setup (except model downloads)
- Authentication: JWT-based auth (optional LDAP integration for enterprise)
Security Constraints
- File Upload: Validate file types, scan for malware (optional)
- Authentication: JWT tokens with expiration
- CORS: Configure CORS for frontend-backend communication
- Input Validation: Strict validation on all API inputs
External Dependencies
Database Configuration
- MySQL Host: mysql.theaken.com
- MySQL Port: 33306
- MySQL User: A060
- MySQL Password: WLeSCi0yhtc7
- MySQL Database: db_A060
- MySQL Charset: utf8mb4
SMTP Configuration (Optional)
- SMTP Server: mail.panjit.com.tw
- SMTP Port: 25
- SMTP TLS: false
- SMTP Auth: false
- Sender Email: tool-ocr-system@panjit.com.tw
LDAP Configuration (Optional)
- LDAP Server: panjit.com.tw
- LDAP Port: 389
Conda Environment
- Environment Name:
tool_ocr - Python Version: 3.10
- Base Path:
C:\Users\lin46\.conda\envs\tool_ocr - Activation: Always activate environment before backend development
OCR Models
- PaddleOCR Models: Downloaded automatically on first run or manually installed
- Model Storage: Local cache directory or Docker volume
- Supported Languages: Chinese (simplified/traditional), English, Japanese, Korean, etc.
- Model Size: ~100-200MB per language pack
System Requirements
- Python: 3.10+ (managed by Conda or venv)
- Node.js: 18+ (for frontend development and build)
- RAM: Minimum 4GB (8GB recommended for batch processing, 16GB+ for GPU usage)
- Disk Space: ~2GB for application + models + dependencies
- OS: Windows 10/11 (development), WSL2 Ubuntu 24.04 (development), Linux (1Panel deployment server)
- GPU (Optional but recommended):
- NVIDIA GPU with CUDA 11.8, 12.3, or 12.6+ support
- GPU Memory: Minimum 4GB (8GB+ recommended for chart recognition)
- WSL2 GPU: NVIDIA CUDA drivers installed for WSL
- Performance: 3-10x speedup for OCR and chart recognition
- Web Server: Nginx (for static files and reverse proxy)
- Process Manager: Supervisor / PM2 / systemd (for backend service)
Port Configuration
- Backend API: 12010 (FastAPI via uvicorn)
- Frontend Dev Server: 12011 (Vite, development only)
- Nginx: 80/443 (production, managed by 1Panel)
- MySQL: 33306 (external)
- Redis: 6379 (optional, local)
Deployment Architecture (1Panel)
- Development: Windows with Conda + local Node.js
- Production: Linux server managed by 1Panel
- Backend Deployment:
- Conda environment on production server
- uvicorn runs FastAPI on port 12010
- Managed by Supervisor/PM2/systemd for auto-restart
- Frontend Deployment:
- Build static files with
npm run build - Served by Nginx (configured via 1Panel)
- Nginx reverse proxies
/apito backend (12010)
- Build static files with
- 1Panel Features:
- Website management (Nginx configuration)
- Process management (backend service)
- SSL certificate management (Let's Encrypt)
- File management and deployment
Configuration Files
- Backend:
environment.yml: Conda environment specificationrequirements.txt: Pip dependencies.env: Environment variables (database, JWT secret, etc.)config.yaml: Application configurationstart.sh: Backend startup script
- Frontend:
package.json: npm dependencies.env.production: Production environment variables (API URL)vite.config.js: Vite configurationbuild.sh: Frontend build script
- Deployment:
nginx.conf: Nginx reverse proxy configurationsupervisor.conforpm2.config.js: Process manager configurationdeploy.sh: Deployment automation script