egg/OCR

Go to file

egg ad2b832fb6 feat: complete external auth V2 migration with advanced features

This commit implements comprehensive external Azure AD authentication
with complete task management, file download, and admin monitoring systems.

## Core Features Implemented (80% Complete)

### 1. Token Auto-Refresh Mechanism ✅
- Backend: POST /api/v2/auth/refresh endpoint
- Frontend: Auto-refresh 5 minutes before expiration
- Auto-retry on 401 errors with seamless token refresh

### 2. File Download System ✅
- Three format support: JSON / Markdown / PDF
- Endpoints: GET /api/v2/tasks/{id}/download/{format}
- File access control with ownership validation
- Frontend download buttons in TaskHistoryPage

### 3. Complete Task Management ✅
Backend Endpoints:
- POST /api/v2/tasks/{id}/start - Start task
- POST /api/v2/tasks/{id}/cancel - Cancel task
- POST /api/v2/tasks/{id}/retry - Retry failed task
- GET /api/v2/tasks - List with filters (status, filename, date range)
- GET /api/v2/tasks/stats - User statistics

Frontend Features:
- Status-based action buttons (Start/Cancel/Retry)
- Advanced search and filtering (status, filename, date range)
- Pagination and sorting
- Task statistics dashboard (5 stat cards)

### 4. Admin Monitoring System ✅ (Backend)
Admin APIs:
- GET /api/v2/admin/stats - System statistics
- GET /api/v2/admin/users - User list with stats
- GET /api/v2/admin/users/top - User leaderboard
- GET /api/v2/admin/audit-logs - Audit log query system
- GET /api/v2/admin/audit-logs/user/{id}/summary

Admin Features:
- Email-based admin check (ymirliu@panjit.com.tw)
- Comprehensive system metrics (users, tasks, sessions, activity)
- Audit logging service for security tracking

### 5. User Isolation & Security ✅
- Row-level security on all task queries
- File access control with ownership validation
- Strict user_id filtering on all operations
- Session validation and expiry checking
- Admin privilege verification

## New Files Created

Backend:
- backend/app/models/user_v2.py - User model for external auth
- backend/app/models/task.py - Task model with user isolation
- backend/app/models/session.py - Session management
- backend/app/models/audit_log.py - Audit log model
- backend/app/services/external_auth_service.py - External API client
- backend/app/services/task_service.py - Task CRUD with isolation
- backend/app/services/file_access_service.py - File access control
- backend/app/services/admin_service.py - Admin operations
- backend/app/services/audit_service.py - Audit logging
- backend/app/routers/auth_v2.py - V2 auth endpoints
- backend/app/routers/tasks.py - Task management endpoints
- backend/app/routers/admin.py - Admin endpoints
- backend/alembic/versions/5e75a59fb763_*.py - DB migration

Frontend:
- frontend/src/services/apiV2.ts - Complete V2 API client
- frontend/src/types/apiV2.ts - V2 type definitions
- frontend/src/pages/TaskHistoryPage.tsx - Task history UI

Modified Files:
- backend/app/core/deps.py - Added get_current_admin_user_v2
- backend/app/main.py - Registered admin router
- frontend/src/pages/LoginPage.tsx - V2 login integration
- frontend/src/components/Layout.tsx - User display and logout
- frontend/src/App.tsx - Added /tasks route

## Documentation
- openspec/changes/.../PROGRESS_UPDATE.md - Detailed progress report

## Pending Items (20%)
1. Database migration execution for audit_logs table
2. Frontend admin dashboard page
3. Frontend audit log viewer

## Testing Status
- Manual testing: ✅ Authentication flow verified
- Unit tests: ⏳ Pending
- Integration tests: ⏳ Pending

## Security Enhancements
- ✅ User isolation (row-level security)
- ✅ File access control
- ✅ Token expiry validation
- ✅ Admin privilege verification
- ✅ Audit logging infrastructure
- ⏳ Token encryption (noted, low priority)
- ⏳ Rate limiting (noted, low priority)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-14 17:19:43 +08:00

.claude

feat: complete external auth V2 migration with advanced features

2025-11-14 17:19:43 +08:00

backend

feat: complete external auth V2 migration with advanced features

2025-11-14 17:19:43 +08:00

demo_docs

first

2025-11-12 22:53:17 +08:00

frontend

feat: complete external auth V2 migration with advanced features

2025-11-14 17:19:43 +08:00

models

first

2025-11-12 22:53:17 +08:00

openspec

feat: complete external auth V2 migration with advanced features

2025-11-14 17:19:43 +08:00

.env

feat: Docker化部署 - 單容器架構轉換

2025-11-13 13:12:59 +08:00

.env.example

first

2025-11-12 22:53:17 +08:00

.gitignore

2nd

2025-11-12 22:54:56 +08:00

AGENTS.md

first

2025-11-12 22:53:17 +08:00

API_REFERENCE.md

fix: resolve 7 frontend-backend API inconsistencies and add comprehensive documentation

2025-11-13 08:54:37 +08:00

CLAUDE.md

first

2025-11-12 22:53:17 +08:00

FRONTEND_API.md

fix: resolve 7 frontend-backend API inconsistencies and add comprehensive documentation

2025-11-13 08:54:37 +08:00

README.md

fix: disable chart recognition due to PaddlePaddle 3.0.0 API limitation

2025-11-14 13:16:17 +08:00

requirements.txt

fix: update setup script to install PaddlePaddle GPU version from official source

2025-11-14 09:35:12 +08:00

setup_dev_env.sh

fix: disable chart recognition due to PaddlePaddle 3.0.0 API limitation

2025-11-14 13:16:17 +08:00

start_backend.sh

feat: migrate to WSL Ubuntu native development environment

2025-11-13 21:00:42 +08:00

start_frontend.sh

feat: migrate to WSL Ubuntu native development environment

2025-11-13 21:00:42 +08:00

README.md

Tool_OCR

OCR Batch Processing System with Structure Extraction

A web-based solution to extract text, images, and document structure from multiple files efficiently using PaddleOCR-VL.

Features

🔍 Multi-Language OCR: Support for 109 languages (Chinese, English, Japanese, Korean, etc.)
📄 Document Structure Analysis: Intelligent layout analysis with PP-StructureV3
🖼️ Image Extraction: Preserve document images alongside text content
📑 Batch Processing: Process multiple files concurrently with progress tracking
📤 Multiple Export Formats: TXT, JSON, Excel, Markdown with images, searchable PDF
📋 Office Documents: DOC, DOCX, PPT, PPTX support via LibreOffice conversion
🚀 GPU Acceleration: Automatic CUDA GPU detection with graceful CPU fallback
🔧 Flexible Configuration: Rule-based output formatting
🌐 Translation Ready: Reserved architecture for future translation features

Tech Stack

Backend

Framework: FastAPI 0.115.0
OCR Engine: PaddleOCR 3.0+ with PaddleOCR-VL
Database: MySQL via SQLAlchemy
PDF Generation: Pandoc + WeasyPrint
Image Processing: OpenCV, Pillow, pdf2image
Office Conversion: LibreOffice (headless mode)

Frontend

Framework: React 19 with TypeScript
Build Tool: Vite 7
Styling: Tailwind CSS v4 + shadcn/ui
State Management: React Query + Zustand
HTTP Client: Axios

Prerequisites

OS: WSL2 Ubuntu 24.04
Python: 3.12+
Node.js: 24.x LTS
MySQL: External database server (provided)
GPU (Optional): NVIDIA GPU with CUDA 11.2+ for hardware acceleration

Quick Start

1. Automated Setup (Recommended)

# Run automated setup script
./setup_dev_env.sh

This script automatically:

Detects NVIDIA GPU and CUDA version (if available)
Installs Python development tools (pip, venv, build-essential)
Installs system dependencies (pandoc, LibreOffice, fonts, etc.)
Installs Node.js (via nvm)
Installs PaddlePaddle GPU version (if GPU detected) or CPU version
Installs other Python packages
Installs frontend dependencies
Verifies GPU functionality (if GPU detected)

2. Initialize Database

source venv/bin/activate
cd backend
alembic upgrade head
python create_test_user.py
cd ..

Default test user:

Username: admin
Password: admin123

3. Start Development Servers

Backend (Terminal 1):

./start_backend.sh

Frontend (Terminal 2):

./start_frontend.sh

4. Access Application

Frontend: http://localhost:5173
API Docs: http://localhost:8000/docs
Health Check: http://localhost:8000/health

Project Structure

Tool_OCR/
├── backend/                 # FastAPI backend
│   ├── app/
│   │   ├── api/v1/         # API endpoints
│   │   ├── core/           # Configuration, database
│   │   ├── models/         # Database models
│   │   ├── services/       # Business logic
│   │   └── main.py         # Application entry point
│   ├── alembic/            # Database migrations
│   └── tests/              # Test suite
├── frontend/               # React frontend
│   ├── src/
│   │   ├── components/     # UI components
│   │   ├── pages/          # Page components
│   │   ├── services/       # API services
│   │   └── stores/         # State management
│   └── public/             # Static assets
├── .env.local              # Local development config
├── setup_dev_env.sh        # Environment setup script
├── start_backend.sh        # Backend startup script
└── start_frontend.sh       # Frontend startup script

Configuration

Main config file: .env.local

# Database
MYSQL_HOST=mysql.theaken.com
MYSQL_PORT=33306

# Application ports
BACKEND_PORT=8000
FRONTEND_PORT=5173

# Token expiration (minutes)
ACCESS_TOKEN_EXPIRE_MINUTES=1440  # 24 hours

# Supported file formats
ALLOWED_EXTENSIONS=png,jpg,jpeg,pdf,bmp,tiff,doc,docx,ppt,pptx

# OCR settings
OCR_LANGUAGES=ch,en,japan,korean
MAX_OCR_WORKERS=4

# GPU acceleration (optional)
FORCE_CPU_MODE=false         # Set to true to disable GPU even if available
GPU_MEMORY_FRACTION=0.8      # Fraction of GPU memory to use (0.0-1.0)
GPU_DEVICE_ID=0              # GPU device ID to use (0 for primary GPU)

GPU Acceleration

The system automatically detects and utilizes NVIDIA GPU hardware when available:

Auto-detection: Setup script detects GPU and installs appropriate PaddlePaddle version
Graceful fallback: If GPU is unavailable or fails, system automatically uses CPU mode
Performance: GPU acceleration provides 3-10x speedup for OCR processing
Configuration: Control GPU usage via .env.local environment variables

Check GPU status at: http://localhost:8000/health

Known Limitations

Chart Recognition (PP-StructureV3)

Due to API incompatibility between PaddleOCR 3.x and PaddlePaddle 3.0.0 stable, the chart recognition feature is currently disabled:

✅ Works: Layout analysis detects and extracts charts/figures as image files
✅ Works: Tables, formulas, and text recognition function normally
❌ Disabled: Deep chart content understanding (chart type, data extraction, axis/legend parsing)
❌ Disabled: Converting chart content to structured data

Technical Details:

The PaddleOCR-VL chart recognition model requires paddle.incubate.nn.functional.fused_rms_norm_ext API
PaddlePaddle 3.0.0 stable only provides the base fused_rms_norm function
This limitation will be resolved when PaddlePaddle releases an update with the extended API

Workaround: Charts are saved as images and can be viewed manually. For chart data extraction, consider using specialized chart recognition tools separately.

API Endpoints

Authentication

POST /api/v1/auth/login - User login

File Management

POST /api/v1/upload - Upload files
POST /api/v1/ocr/process - Start OCR processing
GET /api/v1/batch/{id}/status - Get batch status

Results & Export

GET /api/v1/ocr/result/{id} - Get OCR result
GET /api/v1/export/pdf/{id} - Export as PDF

Full API documentation: http://localhost:8000/docs

Supported File Formats

Images: PNG, JPG, JPEG, BMP, TIFF
Documents: PDF
Office: DOC, DOCX, PPT, PPTX

Office files are automatically converted to PDF before OCR processing.

Development

Backend

source venv/bin/activate
cd backend

# Run tests
pytest

# Database migration
alembic revision --autogenerate -m "description"
alembic upgrade head

# Code formatting
black app/

Frontend

cd frontend

# Development server
npm run dev

# Build for production
npm run build

# Lint code
npm run lint

OpenSpec Workflow

This project follows OpenSpec for specification-driven development:

# View current changes
openspec list

# Validate specifications
openspec validate add-ocr-batch-processing

# View implementation tasks
cat openspec/changes/add-ocr-batch-processing/tasks.md

Roadmap

Phase 0: Environment setup
Phase 1: Core OCR backend (~98% complete)
Phase 2: Frontend development (~92% complete)
Phase 3: Testing & optimization
Phase 4: Deployment automation
Phase 5: Translation feature (future)

Documentation

Development specs: openspec/project.md
Implementation status: openspec/changes/add-ocr-batch-processing/STATUS.md
Agent instructions: openspec/AGENTS.md

License

Internal project use

Notes

First OCR run will download PaddleOCR models (~900MB)
Token expiration is set to 24 hours by default
Office conversion requires LibreOffice (installed via setup script)
Development environment: WSL2 Ubuntu 24.04 with Python venv
GPU acceleration: Automatically detected and enabled if NVIDIA GPU with CUDA 11.2+ is available
WSL GPU support: Ensure NVIDIA CUDA drivers are installed in WSL for GPU acceleration
GPU status can be checked via /health API endpoint