feat: add translated PDF format selection (layout/reflow)

- Add generate_translated_layout_pdf() method for layout-preserving translated PDFs - Add generate_translated_pdf() method for reflow translated PDFs - Update translate router to accept format parameter (layout/reflow) - Update frontend with dropdown to select translated PDF format - Fix reflow PDF table cell extraction from content dict - Add embedded images handling in reflow PDF tables - Archive improve-translated-text-fitting openspec proposal 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 10:10:28 +08:00
parent 0dcea4a7e7
commit 08adf3d01d
15 changed files with 1384 additions and 1222 deletions
--- a/README.md
+++ b/README.md
@@ -1,270 +1,82 @@
 # Tool_OCR

-**OCR Batch Processing System with Structure Extraction**
-
-A web-based solution to extract text, images, and document structure from multiple files efficiently using PaddleOCR-VL.
-
-## Features
-
- 🔍 **Multi-Language OCR**: Support for 109 languages (Chinese, English, Japanese, Korean, etc.)
- 📄 **Document Structure Analysis**: Intelligent layout analysis with PP-StructureV3
- 🖼️ **Image Extraction**: Preserve document images alongside text content
- 📑 **Batch Processing**: Process multiple files concurrently with progress tracking
- 📤 **Multiple Export Formats**: TXT, JSON, Excel, Markdown with images, searchable PDF
- 📋 **Office Documents**: DOC, DOCX, PPT, PPTX support via LibreOffice conversion
- 🚀 **GPU Acceleration**: Automatic CUDA GPU detection with graceful CPU fallback
- 🔧 **Flexible Configuration**: Rule-based output formatting
- 🌐 **Translation Ready**: Reserved architecture for future translation features
-
-## Tech Stack
-
-### Backend
- **Framework**: FastAPI 0.115.0
- **OCR Engine**: PaddleOCR 3.0+ with PaddleOCR-VL and PP-StructureV3
- **Deep Learning**: PaddlePaddle 3.2.1+ (GPU/CPU support)
- **Database**: MySQL via SQLAlchemy
- **PDF Generation**: Pandoc + WeasyPrint
- **Image Processing**: OpenCV, Pillow, pdf2image
- **Office Conversion**: LibreOffice (headless mode)
-
-### Frontend
- **Framework**: React 19 with TypeScript
- **Build Tool**: Vite 7
- **Styling**: Tailwind CSS v4 + shadcn/ui
- **State Management**: React Query + Zustand
- **HTTP Client**: Axios
-
-## Prerequisites
-
- **OS**: WSL2 Ubuntu 24.04
- **Python**: 3.12+
- **Node.js**: 24.x LTS
- **MySQL**: External database server (provided)
- **GPU** (Optional): NVIDIA GPU with CUDA 11.8+ for hardware acceleration
-  - PaddlePaddle 3.2.1+ requires CUDA 11.8, 12.3, or 12.6+
-  - WSL2 users: Ensure NVIDIA CUDA drivers are installed
-
-## Quick Start
-
-### 1. Automated Setup (Recommended)
-
-```bash
-# Run automated setup script
-./setup_dev_env.sh
-```
-
-This script automatically:
- Detects NVIDIA GPU and CUDA version (if available)
- Installs Python development tools (pip, venv, build-essential)
- Installs system dependencies (pandoc, LibreOffice, fonts, etc.)
- Installs Node.js (via nvm)
- Installs PaddlePaddle 3.2.1+ GPU version (if GPU detected) or CPU version
- Configures WSL CUDA library paths (for WSL2 GPU users)
- Installs other Python packages (PaddleOCR, PaddleX, etc.)
- Installs frontend dependencies
- Verifies GPU functionality and chart recognition API availability
-
-### 2. Initialize Database
-
-```bash
-source venv/bin/activate
-cd backend
-alembic upgrade head
-python create_test_user.py
-cd ..
-```
-
-Default test user:
- Username: `admin`
- Password: `admin123`
-
-### 3. Start Development Servers
-
-**Backend (Terminal 1):**
-```bash
-./start_backend.sh
-```
-
-**Frontend (Terminal 2):**
-```bash
-./start_frontend.sh
-```
-
-### 4. Access Application
-
- **Frontend**: http://localhost:5173
- **API Docs**: http://localhost:8000/docs
- **Health Check**: http://localhost:8000/health
-
-## Project Structure
-
-```
-Tool_OCR/
-├── backend/                 # FastAPI backend
-│   ├── app/
-│   │   ├── api/v1/         # API endpoints
-│   │   ├── core/           # Configuration, database
-│   │   ├── models/         # Database models
-│   │   ├── services/       # Business logic
-│   │   └── main.py         # Application entry point
-│   ├── alembic/            # Database migrations
-│   └── tests/              # Test suite
-├── frontend/               # React frontend
-│   ├── src/
-│   │   ├── components/     # UI components
-│   │   ├── pages/          # Page components
-│   │   ├── services/       # API services
-│   │   └── stores/         # State management
-│   └── public/             # Static assets
-├── .env.local              # Local development config
-├── setup_dev_env.sh        # Environment setup script
-├── start_backend.sh        # Backend startup script
-└── start_frontend.sh       # Frontend startup script
-```
-
-## Configuration
-
-Main config file: `.env.local`
-
-```bash
-# Database
-MYSQL_HOST=mysql.theaken.com
-MYSQL_PORT=33306
-
-# Application ports
-BACKEND_PORT=8000
-FRONTEND_PORT=5173
-
-# Token expiration (minutes)
-ACCESS_TOKEN_EXPIRE_MINUTES=1440  # 24 hours
-
-# Supported file formats
-ALLOWED_EXTENSIONS=png,jpg,jpeg,pdf,bmp,tiff,doc,docx,ppt,pptx
-
-# OCR settings
-OCR_LANGUAGES=ch,en,japan,korean
-MAX_OCR_WORKERS=4
-
-# GPU acceleration (optional)
-FORCE_CPU_MODE=false         # Set to true to disable GPU even if available
-GPU_MEMORY_FRACTION=0.8      # Fraction of GPU memory to use (0.0-1.0)
-GPU_DEVICE_ID=0              # GPU device ID to use (0 for primary GPU)
-```
-
-### GPU Acceleration
-
-The system automatically detects and utilizes NVIDIA GPU hardware when available:
-
- **Auto-detection**: Setup script detects GPU and installs appropriate PaddlePaddle version
- **Graceful fallback**: If GPU is unavailable or fails, system automatically uses CPU mode
- **Performance**: GPU acceleration provides 3-10x speedup for OCR processing
- **Configuration**: Control GPU usage via `.env.local` environment variables
- **WSL2 CUDA Setup**: For WSL2 users, CUDA library paths are automatically configured in `~/.bashrc`
-
-**Chart Recognition**: Requires PaddlePaddle 3.2.0+ for full PP-StructureV3 chart recognition capabilities (chart type detection, data extraction, axis/legend parsing). The setup script installs PaddlePaddle 3.2.1+ which includes all required APIs.
-
-Check GPU status and chart recognition availability at: http://localhost:8000/health
-
-## API Endpoints
-
-### Authentication
- `POST /api/v1/auth/login` - User login
-
-### File Management
- `POST /api/v1/upload` - Upload files
- `POST /api/v1/ocr/process` - Start OCR processing
- `GET /api/v1/batch/{id}/status` - Get batch status
-
-### Results & Export
- `GET /api/v1/ocr/result/{id}` - Get OCR result
- `GET /api/v1/export/pdf/{id}` - Export as PDF
-
-Full API documentation: http://localhost:8000/docs
-
-## Supported File Formats
-
- **Images**: PNG, JPG, JPEG, BMP, TIFF
- **Documents**: PDF
- **Office**: DOC, DOCX, PPT, PPTX
-
-Office files are automatically converted to PDF before OCR processing.
-
-## Development
-
-### Backend
-
-```bash
-source venv/bin/activate
-cd backend
-
-# Run tests
-pytest
-
-# Database migration
-alembic revision --autogenerate -m "description"
-alembic upgrade head
-
-# Code formatting
-black app/
-```
-
-### Frontend
-
-```bash
-cd frontend
-
-# Development server
-npm run dev
-
-# Build for production
-npm run build
-
-# Lint code
-npm run lint
-```
-
-## OpenSpec Workflow
-
-This project follows OpenSpec for specification-driven development:
-
-```bash
-# View current changes
-openspec list
-
-# Validate specifications
-openspec validate add-ocr-batch-processing
-
-# View implementation tasks
-cat openspec/changes/add-ocr-batch-processing/tasks.md
-```
-
-## Roadmap
-
- [x] **Phase 0**: Environment setup
- [x] **Phase 1**: Core OCR backend (~98% complete)
- [x] **Phase 2**: Frontend development (~92% complete)
- [ ] **Phase 3**: Testing & optimization
- [ ] **Phase 4**: Deployment automation
- [ ] **Phase 5**: Translation feature (future)
-
-## Documentation
-
- Development specs: [openspec/project.md](openspec/project.md)
- Implementation status: [openspec/changes/add-ocr-batch-processing/STATUS.md](openspec/changes/add-ocr-batch-processing/STATUS.md)
- Agent instructions: [openspec/AGENTS.md](openspec/AGENTS.md)
-
-## License
-
-Internal project use
-
-## Notes
-
- First OCR run will download PaddleOCR models (~900MB)
- Token expiration is set to 24 hours by default
- Office conversion requires LibreOffice (installed via setup script)
- Development environment: WSL2 Ubuntu 24.04 with Python venv
- **GPU acceleration**: Automatically detected and enabled if NVIDIA GPU with CUDA 11.8+ is available
- **PaddlePaddle version**: System uses PaddlePaddle 3.2.1+ which includes full chart recognition support
- **WSL GPU support**: WSL2 CUDA library paths (`/usr/lib/wsl/lib`) are automatically configured in `~/.bashrc`
- **Chart recognition**: Fully enabled with PP-StructureV3 for chart type detection, data extraction, and structure analysis
- GPU status and chart recognition availability can be checked via `/health` API endpoint
+多語系批次 OCR 與版面還原工具，提供直接抽取與深度 OCR 雙軌流程、PP-StructureV3 結構分析、JSON/Markdown/版面保持 PDF 匯出，前端以 React 提供任務追蹤與下載。
+
+## 功能亮點
+- 雙軌處理：DocumentTypeDetector 選擇 Direct (PyMuPDF 抽取) 或 OCR (PaddleOCR + PP-StructureV3)，必要時混合補圖。
+- 統一輸出：OCR/Direct 皆轉成 UnifiedDocument，後續匯出 JSON/Markdown/版面保持 PDF，並回寫 metadata。
+- 資源控管：OCRServicePool、MemoryGuard 與 prediction semaphore 控制 GPU/CPU 載荷，支援自動卸載與 CPU fallback。
+- 任務與權限：JWT 驗證、外部登入 API、任務歷史/統計、管理員審計路由。
+- 前端體驗：React + Vite + shadcn/ui，任務輪詢、結果預覽、下載、設定頁與管理面板。
+- 國際化：保留翻譯流水線（translation_service），可接入 Dify/離線模型。
+
+## 架構概覽
+- **Backend (FastAPI)**  
+  - `app/main.py`：lifespan 初始化 service pool、memory manager、CORS、/health；上傳端點 `/api/v2/upload`。  
+  - `routers/`：`auth.py` 登入、`tasks.py` 任務啟動/下載/metadata、`admin.py` 審計、`translate.py` 翻譯輸出。  
+  - `services/`：`ocr_service.py` 雙軌處理、`document_type_detector.py` 軌道選擇、`direct_extraction_engine.py` 直抽、`pp_structure_enhanced.py` 版面分析、`ocr_to_unified_converter.py` 與 `unified_document_exporter.py` 匯出、`pdf_generator_service.py` 版面保持 PDF、`service_pool.py`/`memory_manager.py` 資源管理。  
+  - `models/`、`schemas/`：SQLAlchemy 模型與 Pydantic 結構，`core/config.py` 整合環境設定。
+- **Frontend (React 18 + Vite)**  
+  - `src/pages`：Login、Upload、Processing、Results、Export、TaskHistory/TaskDetail、Settings、AdminDashboard、AuditLogs。  
+  - `src/services` API client + React Query，`src/store` 任務/使用者狀態，`src/components` 共用 UI。  
+  - PDF 預覽使用 react-pdf，i18n 由 `src/i18n` 管理。
+- **處理流程摘要**  
+  1. `/api/v2/upload` 儲存檔案至 `backend/uploads` 並建立 Task。  
+  2. `/api/v2/tasks/{id}/start` 觸發雙軌處理（可附 `pp_structure_params`）。  
+  3. Direct/OCR 產生 UnifiedDocument，匯出 `_result.json`、`_output.md`、版面保持 PDF 至 `backend/storage/results/<task_id>/`，並在 DB 記錄 metadata。  
+  4. `/api/v2/tasks/{id}/download/{json|markdown|pdf|unified}` 與 `/metadata` 提供下載與統計。
+
+## 倉庫結構
+- `backend/app/`：FastAPI 程式碼（core、routers、services、schemas、models、main.py）。
+- `backend/tests/`：測試集合  
+  - `api/` API mock/integration、`services/` 核心邏輯、`e2e/` 需啟動後端與測試帳號、`performance/` 量測、`archived/` 舊案例。  
+  - 測試資源使用 `demo_docs/` 中的範例檔（gitignore，不會上傳）。
+- `backend/uploads`, `backend/storage`, `backend/logs`, `backend/models/`：執行時輸入/輸出/模型/日誌目錄，啟動時自動建立並鎖定在 backend 目錄下。
+- `frontend/`：React 應用程式碼與設定（vite.config.ts、eslint.config.js 等）。
+- `docs/`：API/架構/風險說明。
+- `openspec/`：規格檔與變更紀錄。
+
+## 環境準備
+- 需求：Python 3.10+、Node 18+/20+、MySQL（或相容端點）、可選 NVIDIA GPU（CUDA 11.8+/12.x）。  
+- 一鍵腳本：`./setup_dev_env.sh`（可加 `--cpu-only`、`--skip-db`）。  
+- 手動：
+  1. `python3 -m venv venv && source venv/bin/activate`
+  2. `pip install -r requirements.txt`
+  3. `cp .env.example .env.local` 並填入 DB/認證/路徑設定（預設使用 8000/5173）
+  4. `cd frontend && npm install`
+
+## 開發啟動
+- Backend（預設 `.env` 的 `BACKEND_PORT=8000`，config 預設 12010，依環境變數覆蓋）：  
+  ```bash
+  source venv/bin/activate
+  cd backend
+  uvicorn app.main:app --reload --host 0.0.0.0 --port ${BACKEND_PORT:-8000}
+  # API docs: http://localhost:${BACKEND_PORT:-8000}/docs
+  ```
+  `Settings` 會將 `uploads`/`storage`/`logs`/`models` 等路徑正規化到 `backend/`，避免在不同工作目錄產生多餘資料夾。
+- Frontend：  
+  ```bash
+  cd frontend
+  npm run dev -- --host --port ${FRONTEND_PORT:-5173}
+  # http://localhost:${FRONTEND_PORT:-5173}
+  ```
+- 也可用 `./start.sh backend|frontend|--stop|--status` 管理背景進程（PID 置於 `.pid/`）。
+
+## 測試
+- 單元/整合：`pytest backend/tests -m "not e2e"`（如需）。  
+- API mock 測試：`pytest backend/tests/api`（僅依賴虛擬依賴/SQLite）。  
+- E2E：需先啟動後端並準備測試帳號，預設呼叫 `http://localhost:8000/api/v2`，測試檔使用 `demo_docs/` 範例檔。  
+- 性能/封存案例：`backend/tests/performance`、`backend/tests/archived` 可選擇性執行。
+
+## 產生物與清理
+- 執行後的輸入/輸出皆位於 `backend/uploads`、`backend/storage/results|json|markdown|exports`、`backend/logs`，模型快取在 `backend/models/`。  
+- 已移除多餘的 `node_modules/`、`venv/`、舊的 `pp_demo/` 與上傳/輸出/日誌樣本。再次清理可執行：
+  ```bash
+  rm -rf backend/uploads/* backend/storage/results/* backend/logs/*.log .pytest_cache backend/.pytest_cache
+  ```
+  目錄會在啟動時自動重建。
+
+## 參考文件
+- `docs/architecture-overview.md`：雙軌流程與組件說明  
+- `docs/API.md`：主要 API 介面  
+- `openspec/`：系統規格與歷史變更