OCR/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/tasks.md

# Implementation Tasks

## Phase 1: Core OCR with Layout Preservation

### 1. Environment Setup
- [x] 1.1 Create Conda environment with Python 3.10
- [x] 1.2 Install backend dependencies (FastAPI, PaddleOCR 3.0+, paddlepaddle, pandas, etc.)
- [x] 1.3 Install PDF generation tools (weasyprint, markdown, pandoc system package)
- [x] 1.4 Download PaddleOCR-VL model (~900MB) and language packs
- [ ] 1.5 Setup frontend project with Vite + React + TypeScript
- [ ] 1.6 Install frontend dependencies (Tailwind, shadcn/ui, axios, react-query)
- [x] 1.7 Configure MySQL database connection
- [x] 1.8 Install Chinese fonts (Noto Sans CJK) for PDF generation

### 2. Database Schema
- [x] 2.1 Create `paddle_ocr_users` table for JWT authentication (id, username, password_hash, etc.)
- [x] 2.2 Create `paddle_ocr_batches` table (id, user_id, status, created_at, completed_at)
- [x] 2.3 Create `paddle_ocr_files` table (id, batch_id, filename, file_path, file_size, status, format)
- [x] 2.4 Create `paddle_ocr_results` table (id, file_id, markdown_path, json_path, layout_data, confidence)
- [x] 2.5 Create `paddle_ocr_export_rules` table (id, user_id, rule_name, config_json, css_template)
- [x] 2.6 Create `paddle_ocr_translation_configs` table (RESERVED: id, user_id, source_lang, target_lang, engine_type, engine_config)
- [x] 2.7 Write database migration scripts (Alembic)
- [x] 2.8 Add indexes for performance optimization (batch_id, user_id, status)
- Note: All tables use `paddle_ocr_` prefix for namespace isolation

### 3. Backend - Document Preprocessing
- [x] 3.1 Implement document preprocessor class for format standardization
- [x] 3.2 Add image format validator (PNG, JPG, JPEG)
- [x] 3.3 Add PDF validator and direct passthrough (PaddleOCR-VL native support)
- [x] 3.4 Implement Office document to PDF conversion (DOC, DOCX, PPT, PPTX via LibreOffice) ⬅️ **Completed via sub-proposal**
- [x] 3.5 Add file corruption detection
- [x] 3.6 Write unit tests for preprocessor

### 4. Backend - Core OCR Service with PaddleOCR-VL
- [x] 4.1 Implement OCR service class with PaddleOCR-VL initialization
- [x] 4.2 Configure layout detection (use_layout_detection=True)
- [x] 4.3 Implement single image/PDF OCR processing
- [x] 4.4 Parse OCR output to extract Markdown and JSON
- [x] 4.5 Store Markdown files with preserved layout structure
- [x] 4.6 Store JSON with detailed bounding boxes and layout metadata
- [x] 4.7 Add confidence threshold filtering
- [x] 4.8 Implement batch processing with worker queue (completed via Task 10: BackgroundTasks)
- [x] 4.9 Add progress tracking for batch jobs (completed via Task 8.4, 8.6: API endpoints)
- [x] 4.10 Write unit tests for OCR service

### 5. Backend - Layout-Preserved PDF Generation
- [x] 5.1 Create PDF generator service using Pandoc + WeasyPrint
- [x] 5.2 Implement Markdown to HTML conversion with extensions (tables, code, etc.)
- [x] 5.3 Create default CSS template for layout preservation
- [x] 5.4 Create additional CSS templates (academic, business, report)
- [x] 5.5 Add Chinese font configuration (Noto Sans CJK)
- [x] 5.6 Implement PDF generation via Pandoc command
- [x] 5.7 Add fallback: Python WeasyPrint direct generation
- [x] 5.8 Handle multi-page PDF merging
- [x] 5.9 Write unit tests for PDF generator

### 6. Backend - File Management
- [x] 6.1 Implement file upload validation (type, size, corruption check)
- [x] 6.2 Create file storage service with temporary directory management
- [x] 6.3 Add batch upload handler with unique batch_id generation
- [x] 6.4 Implement file access control and ownership verification
- [x] 6.5 Add automatic cleanup job for expired files (24-hour retention)
- [x] 6.6 Store Markdown and JSON outputs in organized directory structure
- [x] 6.7 Write unit tests for file management

### 7. Backend - Export Service
- [x] 7.1 Implement plain text export from Markdown
- [x] 7.2 Implement JSON export with full metadata
- [x] 7.3 Implement Excel export using pandas
- [x] 7.4 Implement Markdown export (direct from OCR output)
- [x] 7.5 Implement layout-preserved PDF export (using PDF generator service)
- [x] 7.6 Add ZIP file creation for batch exports
- [x] 7.7 Implement rule-based filtering (confidence threshold, filename pattern)
- [x] 7.8 Implement rule-based formatting (line numbers, sort by position)
- [x] 7.9 Create export rule CRUD operations (save, load, update, delete)
- [x] 7.10 Write unit tests for export service

### 8. Backend - API Endpoints
- [x] 8.1 POST `/api/v1/auth/login` - JWT authentication
- [x] 8.2 POST `/api/v1/upload` - File upload with validation
- [x] 8.3 POST `/api/v1/ocr/process` - Trigger OCR processing (PaddleOCR-VL)
- [x] 8.4 GET `/api/v1/ocr/status/{task_id}` - Get task status with progress
- [x] 8.5 GET `/api/v1/ocr/result/{task_id}` - Get OCR results (JSON + Markdown)
- [x] 8.6 GET `/api/v1/batch/{batch_id}/status` - Get batch status
- [x] 8.7 POST `/api/v1/export` - Export results with format and rules
- [x] 8.8 GET `/api/v1/export/pdf/{file_id}` - Generate and download layout-preserved PDF
- [x] 8.9 GET `/api/v1/export/rules` - List saved export rules
- [x] 8.10 POST `/api/v1/export/rules` - Create new export rule
- [x] 8.11 PUT `/api/v1/export/rules/{rule_id}` - Update export rule
- [x] 8.12 DELETE `/api/v1/export/rules/{rule_id}` - Delete export rule
- [x] 8.13 GET `/api/v1/export/css-templates` - List available CSS templates
- [x] 8.14 Write API integration tests

### 9. Backend - Translation Architecture (RESERVED)
- [x] 9.1 Create translation service interface (abstract class)
- [x] 9.2 Implement stub endpoint POST `/api/v1/translate/document` (returns 501 Not Implemented)
- [x] 9.3 Document expected request/response format in OpenAPI spec
- [x] 9.4 Add translation_configs table migrations (completed in Task 2.6)
- [x] 9.5 Create placeholder for translation engine factory (Argos/ERNIE/Google)
- [ ] 9.6 Write unit tests for translation service interface (optional for stub)

### 10. Backend - Background Tasks
- [x] 10.1 Implement FastAPI BackgroundTasks for async OCR processing
- [ ] 10.2 Add task queue system (optional: Redis-based queue)
- [x] 10.3 Implement progress updates (polling endpoint)
- [x] 10.4 Add error handling and retry logic
- [x] 10.5 Implement cleanup scheduler for expired files
- [x] 10.6 Add PDF generation to background tasks (slower process)

## Phase 2: Frontend Development

### 11. Frontend - Project Structure
- [x] 11.1 Setup Vite project with TypeScript support
- [x] 11.2 Configure Tailwind CSS and shadcn/ui
- [x] 11.3 Setup React Router for navigation
- [x] 11.4 Configure Axios with base URL and interceptors
- [x] 11.5 Setup React Query for API state management
- [x] 11.6 Create Zustand store for global state
- [x] 11.7 Setup i18n for Traditional Chinese interface

### 12. Frontend - UI Components (shadcn/ui)
- [x] 12.1 Install and configure shadcn/ui components
- [x] 12.2 Create FileUpload component with drag-and-drop (react-dropzone)
- [x] 12.3 Create ProgressBar component for batch processing
- [x] 12.4 Create ResultsTable component for displaying OCR results
- [x] 12.5 Create MarkdownPreview component for viewing extracted content ⬅️ **Fixed: API schema alignment for filename display**
- [ ] 12.6 Create ExportDialog component for format and rule selection
- [ ] 12.7 Create CSSTemplateSelector component for PDF styling
- [ ] 12.8 Create RuleEditor component for creating custom rules
- [x] 12.9 Create Toast notifications for feedback
- [ ] 12.10 Create TranslationPanel component (DISABLED with "Coming Soon" label)

### 13. Frontend - Pages
- [x] 13.1 Create Login page with JWT authentication
- [x] 13.2 Create Upload page with file selection and batch management ⬅️ **Fixed: Upload response schema alignment**
- [x] 13.3 Create Processing page with real-time progress ⬅️ **Fixed: Error field mapping**
- [x] 13.4 Create Results page with Markdown/JSON preview ⬅️ **Fixed: OCR result detail flattening, null safety**
- [x] 13.5 Create Export page with format options (TXT, JSON, Excel, Markdown, PDF)
- [ ] 13.6 Create PDF Preview page (optional: embedded PDF viewer)
- [x] 13.7 Create Settings page for export rule management
- [x] 13.8 Add translation option placeholder in Results page (disabled state)

### 14. Frontend - API Integration
- [x] 14.1 Create API client service with typed interfaces ⬅️ **Updated: All endpoints verified working**
- [x] 14.2 Implement file upload with progress tracking ⬅️ **Fixed: UploadBatchResponse schema**
- [x] 14.3 Implement OCR task status polling ⬅️ **Fixed: BatchStatusResponse with files array**
- [x] 14.4 Implement results fetching (Markdown + JSON display) ⬅️ **Fixed: OCRResultDetailResponse with flattened structure**
- [x] 14.5 Implement export with file download ⬅️ **Fixed: ExportOptions schema added**
- [x] 14.6 Implement PDF generation request with loading indicator
- [x] 14.7 Implement rule CRUD operations
- [x] 14.8 Implement CSS template selection ⬅️ **Fixed: CSSTemplateResponse with filename field**
- [x] 14.9 Add error handling and user feedback ⬅️ **Fixed: Error field mapping with validation_alias**
- [x] 14.10 Create translation API client (stub, for future use)

## Phase 3: Testing & Optimization

### 15. Testing
- [ ] 15.1 Write backend unit tests (pytest) for all services
- [ ] 15.2 Write backend API integration tests
- [ ] 15.3 Test PaddleOCR-VL with various document types (scanned images, PDFs, mixed content)
- [ ] 15.4 Test layout preservation quality (Markdown structure correctness)
- [ ] 15.5 Test PDF generation with different CSS templates
- [ ] 15.6 Test Chinese font rendering in generated PDFs
- [ ] 15.7 Write frontend component tests (Vitest)
- [ ] 15.8 Perform manual end-to-end testing
- [ ] 15.9 Test with various image formats and languages
- [ ] 15.10 Test batch processing with large file sets (50+ files)
- [ ] 15.11 Test export with different formats and rules
- [x] 15.12 Verify translation UI placeholders are properly disabled

### 16. Documentation
- [ ] 16.1 Write API documentation (FastAPI auto-docs + additional notes)
- [ ] 16.2 Document PaddleOCR-VL model requirements and installation
- [ ] 16.3 Document Pandoc and WeasyPrint setup
- [ ] 16.4 Create CSS template customization guide
- [ ] 16.5 Write user guide for web interface
- [ ] 16.6 Write deployment guide for 1Panel
- [ ] 16.7 Create README.md with setup instructions
- [ ] 16.8 Document export rule syntax and examples
- [ ] 16.9 Document translation feature roadmap and architecture

## Phase 4: Deployment

### 17. Deployment Preparation
- [ ] 17.1 Create backend startup script (start.sh)
- [ ] 17.2 Create frontend build script (build.sh)
- [ ] 17.3 Create Nginx configuration file (static files + reverse proxy)
- [ ] 17.4 Create Supervisor configuration for backend process
- [ ] 17.5 Create environment variable templates (.env.example)
- [ ] 17.6 Create deployment automation script (deploy.sh)
- [ ] 17.7 Prepare CSS templates for production
- [ ] 17.8 Test deployment on staging environment

### 18. Production Deployment (1Panel)
- [ ] 18.1 Setup Conda environment on production server
- [ ] 18.2 Install system dependencies (pandoc, fonts-noto-cjk)
- [ ] 18.3 Install Python dependencies and download PaddleOCR-VL models
- [ ] 18.4 Configure MySQL database connection
- [ ] 18.5 Build frontend static files
- [ ] 18.6 Configure Nginx via 1Panel (static files + reverse proxy)
- [ ] 18.7 Setup Supervisor to manage backend process
- [ ] 18.8 Configure SSL certificate (Let's Encrypt via 1Panel)
- [ ] 18.9 Perform production smoke tests (upload, OCR, export PDF)
- [ ] 18.10 Setup monitoring and logging
- [ ] 18.11 Verify PDF generation works in production environment

## Phase 5: Translation Feature (FUTURE)

### 19. Translation Implementation (Post-Launch)
- [ ] 19.1 Decide on translation engine (Argos offline vs ERNIE API vs Google API)
- [ ] 19.2 Implement chosen translation engine integration
- [ ] 19.3 Implement Markdown translation with structure preservation
- [ ] 19.4 Update POST `/api/v1/translate/document` endpoint (remove 501 status)
- [ ] 19.5 Add translation configuration UI (enable TranslationPanel component)
- [ ] 19.6 Add source/target language selection
- [ ] 19.7 Implement translation progress tracking
- [ ] 19.8 Test translation with various document types
- [ ] 19.9 Optimize translation quality for technical documents
- [ ] 19.10 Update documentation with translation feature guide

## Summary

**Phase 1 (Core OCR + Layout Preservation)**: Tasks 1-10 (基礎 OCR + 版面保留 PDF)
**Phase 2 (Frontend)**: Tasks 11-14 (用戶界面)
**Phase 3 (Testing)**: Tasks 15-16 (測試與文檔)
**Phase 4 (Deployment)**: Tasks 17-18 (部署)
**Phase 5 (Translation)**: Task 19 (翻譯功能 - 未來實現)

**Total Tasks**: 150+ tasks
**Priority**: Complete Phase 1-4 first, Phase 5 after production deployment and user feedback