egg/OCR

Go to file

egg 5cf4010c9b fix: 修復多頁PDF頁碼分配錯誤和logging配置問題

Critical Bug #1: 多頁PDF頁碼分配錯誤
問題：
- 在處理多頁PDF時，雖然text_regions有正確的頁碼標記
- 但layout_data.elements（表格）和images_metadata（圖片）都保持page=0
- 導致所有頁面的表格和圖片都被錯誤地繪製在第1頁
- 造成嚴重的版面錯誤、元素重疊和位置錯誤

根本原因：
- ocr_service.py (第359-372行) 在累積多頁結果時
- text_regions有添加頁碼：region['page'] = page_num
- 但images_metadata和layout_data.elements沒有更新頁碼
- 它們保持單頁處理時的默認值page=0

修復方案：
- backend/app/services/ocr_service.py (第359-372行)
  - 為layout_data.elements中的每個元素添加正確的頁碼
  - 為images_metadata中的每個圖片添加正確的頁碼
  - 確保多頁PDF的每個元素都有正確的page標記

Critical Bug #2: Logging配置被uvicorn覆蓋
問題：
- uvicorn啟動時會設置自己的logging配置
- 這會覆蓋應用程式的logging.basicConfig()
- 導致應用層的INFO/WARNING/ERROR log完全消失
- 只能看到uvicorn的HTTP請求log和第三方庫的DEBUG log
- 無法診斷PDF生成過程中的問題

修復方案：
- backend/app/main.py (第17-36行)
  - 添加force=True參數強制重新配置logging (Python 3.8+)
  - 顯式設置root logger的level
  - 配置app-specific loggers (app.services.pdf_generator_service等)
  - 啟用log propagation確保訊息能傳遞到root logger

其他修復：
- backend/app/services/pdf_generator_service.py
  - 將重要的debug logging改為info level (第371, 379, 490, 613行)
    原因：預設log level是INFO，debug log不會顯示
  - 修復max_cols UnboundLocalError (第507-509行)
    將logger.info()移到max_cols定義之後
  - 移除危險的.get('page', 0)默認值 (第762行)
    改為.get('page')，沒有page的元素會被正確跳過

影響：
✅ 多頁PDF的表格和圖片現在會正確分配到對應頁面
✅ 詳細的PDF生成log現在可以正確顯示（座標轉換、縮放比例等）
✅ 能夠診斷文字擠壓、間距和位置錯誤的問題

測試建議：
1. 重新啟動後端清除Python cache
2. 上傳多頁PDF進行OCR處理
3. 檢查生成的JSON中每個元素是否有正確的page標記
4. 檢查終端log是否顯示詳細的PDF生成過程
5. 驗證生成的PDF中每頁的元素位置是否正確

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-18 12:13:25 +08:00

.claude

fix: 修復多頁PDF頁碼分配錯誤和logging配置問題

2025-11-18 12:13:25 +08:00

backend

fix: 修復多頁PDF頁碼分配錯誤和logging配置問題

2025-11-18 12:13:25 +08:00

demo_docs

first

2025-11-12 22:53:17 +08:00

frontend

feat: implement layout-preserving PDF generation with table reconstruction

2025-11-17 20:21:56 +08:00

models

first

2025-11-12 22:53:17 +08:00

openspec

fix: migrate UI to V2 API and fix admin dashboard

2025-11-17 08:55:50 +08:00

.env

feat: Docker化部署 - 單容器架構轉換

2025-11-13 13:12:59 +08:00

.env.example

first

2025-11-12 22:53:17 +08:00

.gitignore

2nd

2025-11-12 22:54:56 +08:00

AGENTS.md

first

2025-11-12 22:53:17 +08:00

API_REFERENCE.md

fix: resolve 7 frontend-backend API inconsistencies and add comprehensive documentation

2025-11-13 08:54:37 +08:00

CHART_RECOGNITION.md

feat: enable chart recognition with PaddlePaddle 3.2.1

2025-11-16 18:57:38 +08:00

CLAUDE.md

first

2025-11-12 22:53:17 +08:00

FRONTEND_API.md

fix: resolve 7 frontend-backend API inconsistencies and add comprehensive documentation

2025-11-13 08:54:37 +08:00

README.md

docs: update documentation for chart recognition enablement

2025-11-16 19:04:30 +08:00

requirements.txt

fix: correct OCR coordinate scaling by inferring dimensions from bbox

2025-11-17 21:01:38 +08:00

setup_dev_env.sh

docs: update documentation for chart recognition enablement

2025-11-16 19:04:30 +08:00

start_backend.sh

feat: migrate to WSL Ubuntu native development environment

2025-11-13 21:00:42 +08:00

start_frontend.sh

feat: migrate to WSL Ubuntu native development environment

2025-11-13 21:00:42 +08:00

TESTING.md

feat: add admin dashboard, audit logs, token expiry check and test suite

2025-11-16 18:01:50 +08:00

README.md

Tool_OCR

OCR Batch Processing System with Structure Extraction

A web-based solution to extract text, images, and document structure from multiple files efficiently using PaddleOCR-VL.

Features

🔍 Multi-Language OCR: Support for 109 languages (Chinese, English, Japanese, Korean, etc.)
📄 Document Structure Analysis: Intelligent layout analysis with PP-StructureV3
🖼️ Image Extraction: Preserve document images alongside text content
📑 Batch Processing: Process multiple files concurrently with progress tracking
📤 Multiple Export Formats: TXT, JSON, Excel, Markdown with images, searchable PDF
📋 Office Documents: DOC, DOCX, PPT, PPTX support via LibreOffice conversion
🚀 GPU Acceleration: Automatic CUDA GPU detection with graceful CPU fallback
🔧 Flexible Configuration: Rule-based output formatting
🌐 Translation Ready: Reserved architecture for future translation features

Tech Stack

Backend

Framework: FastAPI 0.115.0
OCR Engine: PaddleOCR 3.0+ with PaddleOCR-VL and PP-StructureV3
Deep Learning: PaddlePaddle 3.2.1+ (GPU/CPU support)
Database: MySQL via SQLAlchemy
PDF Generation: Pandoc + WeasyPrint
Image Processing: OpenCV, Pillow, pdf2image
Office Conversion: LibreOffice (headless mode)

Frontend

Framework: React 19 with TypeScript
Build Tool: Vite 7
Styling: Tailwind CSS v4 + shadcn/ui
State Management: React Query + Zustand
HTTP Client: Axios

Prerequisites

OS: WSL2 Ubuntu 24.04
Python: 3.12+
Node.js: 24.x LTS
MySQL: External database server (provided)
GPU (Optional): NVIDIA GPU with CUDA 11.8+ for hardware acceleration
- PaddlePaddle 3.2.1+ requires CUDA 11.8, 12.3, or 12.6+
- WSL2 users: Ensure NVIDIA CUDA drivers are installed

Quick Start

1. Automated Setup (Recommended)

# Run automated setup script
./setup_dev_env.sh

This script automatically:

Detects NVIDIA GPU and CUDA version (if available)
Installs Python development tools (pip, venv, build-essential)
Installs system dependencies (pandoc, LibreOffice, fonts, etc.)
Installs Node.js (via nvm)
Installs PaddlePaddle 3.2.1+ GPU version (if GPU detected) or CPU version
Configures WSL CUDA library paths (for WSL2 GPU users)
Installs other Python packages (PaddleOCR, PaddleX, etc.)
Installs frontend dependencies
Verifies GPU functionality and chart recognition API availability

2. Initialize Database

source venv/bin/activate
cd backend
alembic upgrade head
python create_test_user.py
cd ..

Default test user:

Username: admin
Password: admin123

3. Start Development Servers

Backend (Terminal 1):

./start_backend.sh

Frontend (Terminal 2):

./start_frontend.sh

4. Access Application

Frontend: http://localhost:5173
API Docs: http://localhost:8000/docs
Health Check: http://localhost:8000/health

Project Structure

Tool_OCR/
├── backend/                 # FastAPI backend
│   ├── app/
│   │   ├── api/v1/         # API endpoints
│   │   ├── core/           # Configuration, database
│   │   ├── models/         # Database models
│   │   ├── services/       # Business logic
│   │   └── main.py         # Application entry point
│   ├── alembic/            # Database migrations
│   └── tests/              # Test suite
├── frontend/               # React frontend
│   ├── src/
│   │   ├── components/     # UI components
│   │   ├── pages/          # Page components
│   │   ├── services/       # API services
│   │   └── stores/         # State management
│   └── public/             # Static assets
├── .env.local              # Local development config
├── setup_dev_env.sh        # Environment setup script
├── start_backend.sh        # Backend startup script
└── start_frontend.sh       # Frontend startup script

Configuration

Main config file: .env.local

# Database
MYSQL_HOST=mysql.theaken.com
MYSQL_PORT=33306

# Application ports
BACKEND_PORT=8000
FRONTEND_PORT=5173

# Token expiration (minutes)
ACCESS_TOKEN_EXPIRE_MINUTES=1440  # 24 hours

# Supported file formats
ALLOWED_EXTENSIONS=png,jpg,jpeg,pdf,bmp,tiff,doc,docx,ppt,pptx

# OCR settings
OCR_LANGUAGES=ch,en,japan,korean
MAX_OCR_WORKERS=4

# GPU acceleration (optional)
FORCE_CPU_MODE=false         # Set to true to disable GPU even if available
GPU_MEMORY_FRACTION=0.8      # Fraction of GPU memory to use (0.0-1.0)
GPU_DEVICE_ID=0              # GPU device ID to use (0 for primary GPU)

GPU Acceleration

The system automatically detects and utilizes NVIDIA GPU hardware when available:

Auto-detection: Setup script detects GPU and installs appropriate PaddlePaddle version
Graceful fallback: If GPU is unavailable or fails, system automatically uses CPU mode
Performance: GPU acceleration provides 3-10x speedup for OCR processing
Configuration: Control GPU usage via .env.local environment variables
WSL2 CUDA Setup: For WSL2 users, CUDA library paths are automatically configured in ~/.bashrc

Chart Recognition: Requires PaddlePaddle 3.2.0+ for full PP-StructureV3 chart recognition capabilities (chart type detection, data extraction, axis/legend parsing). The setup script installs PaddlePaddle 3.2.1+ which includes all required APIs.

Check GPU status and chart recognition availability at: http://localhost:8000/health

API Endpoints

Authentication

POST /api/v1/auth/login - User login

File Management

POST /api/v1/upload - Upload files
POST /api/v1/ocr/process - Start OCR processing
GET /api/v1/batch/{id}/status - Get batch status

Results & Export

GET /api/v1/ocr/result/{id} - Get OCR result
GET /api/v1/export/pdf/{id} - Export as PDF

Full API documentation: http://localhost:8000/docs

Supported File Formats

Images: PNG, JPG, JPEG, BMP, TIFF
Documents: PDF
Office: DOC, DOCX, PPT, PPTX

Office files are automatically converted to PDF before OCR processing.

Development

Backend

source venv/bin/activate
cd backend

# Run tests
pytest

# Database migration
alembic revision --autogenerate -m "description"
alembic upgrade head

# Code formatting
black app/

Frontend

cd frontend

# Development server
npm run dev

# Build for production
npm run build

# Lint code
npm run lint

OpenSpec Workflow

This project follows OpenSpec for specification-driven development:

# View current changes
openspec list

# Validate specifications
openspec validate add-ocr-batch-processing

# View implementation tasks
cat openspec/changes/add-ocr-batch-processing/tasks.md

Roadmap

Phase 0: Environment setup
Phase 1: Core OCR backend (~98% complete)
Phase 2: Frontend development (~92% complete)
Phase 3: Testing & optimization
Phase 4: Deployment automation
Phase 5: Translation feature (future)

Documentation

Development specs: openspec/project.md
Implementation status: openspec/changes/add-ocr-batch-processing/STATUS.md
Agent instructions: openspec/AGENTS.md

License

Internal project use

Notes

First OCR run will download PaddleOCR models (~900MB)
Token expiration is set to 24 hours by default
Office conversion requires LibreOffice (installed via setup script)
Development environment: WSL2 Ubuntu 24.04 with Python venv
GPU acceleration: Automatically detected and enabled if NVIDIA GPU with CUDA 11.8+ is available
PaddlePaddle version: System uses PaddlePaddle 3.2.1+ which includes full chart recognition support
WSL GPU support: WSL2 CUDA library paths (/usr/lib/wsl/lib) are automatically configured in ~/.bashrc
Chart recognition: Fully enabled with PP-StructureV3 for chart type detection, data extraction, and structure analysis
GPU status and chart recognition availability can be checked via /health API endpoint