Files
OCR/openspec/changes/add-ocr-batch-processing/tasks.md
beabigegg da700721fa first
2025-11-12 22:53:17 +08:00

13 KiB

Implementation Tasks

Phase 1: Core OCR with Layout Preservation

1. Environment Setup

  • 1.1 Create Conda environment with Python 3.10
  • 1.2 Install backend dependencies (FastAPI, PaddleOCR 3.0+, paddlepaddle, pandas, etc.)
  • 1.3 Install PDF generation tools (weasyprint, markdown, pandoc system package)
  • 1.4 Download PaddleOCR-VL model (~900MB) and language packs
  • 1.5 Setup frontend project with Vite + React + TypeScript
  • 1.6 Install frontend dependencies (Tailwind, shadcn/ui, axios, react-query)
  • 1.7 Configure MySQL database connection
  • 1.8 Install Chinese fonts (Noto Sans CJK) for PDF generation

2. Database Schema

  • 2.1 Create paddle_ocr_users table for JWT authentication (id, username, password_hash, etc.)
  • 2.2 Create paddle_ocr_batches table (id, user_id, status, created_at, completed_at)
  • 2.3 Create paddle_ocr_files table (id, batch_id, filename, file_path, file_size, status, format)
  • 2.4 Create paddle_ocr_results table (id, file_id, markdown_path, json_path, layout_data, confidence)
  • 2.5 Create paddle_ocr_export_rules table (id, user_id, rule_name, config_json, css_template)
  • 2.6 Create paddle_ocr_translation_configs table (RESERVED: id, user_id, source_lang, target_lang, engine_type, engine_config)
  • 2.7 Write database migration scripts (Alembic)
  • 2.8 Add indexes for performance optimization (batch_id, user_id, status)
  • Note: All tables use paddle_ocr_ prefix for namespace isolation

3. Backend - Document Preprocessing

  • 3.1 Implement document preprocessor class for format standardization
  • 3.2 Add image format validator (PNG, JPG, JPEG)
  • 3.3 Add PDF validator and direct passthrough (PaddleOCR-VL native support)
  • 3.4 Implement Office document to PDF conversion (DOC, DOCX, PPT, PPTX via LibreOffice) ⬅️ Completed via sub-proposal
  • 3.5 Add file corruption detection
  • 3.6 Write unit tests for preprocessor

4. Backend - Core OCR Service with PaddleOCR-VL

  • 4.1 Implement OCR service class with PaddleOCR-VL initialization
  • 4.2 Configure layout detection (use_layout_detection=True)
  • 4.3 Implement single image/PDF OCR processing
  • 4.4 Parse OCR output to extract Markdown and JSON
  • 4.5 Store Markdown files with preserved layout structure
  • 4.6 Store JSON with detailed bounding boxes and layout metadata
  • 4.7 Add confidence threshold filtering
  • 4.8 Implement batch processing with worker queue (completed via Task 10: BackgroundTasks)
  • 4.9 Add progress tracking for batch jobs (completed via Task 8.4, 8.6: API endpoints)
  • 4.10 Write unit tests for OCR service

5. Backend - Layout-Preserved PDF Generation

  • 5.1 Create PDF generator service using Pandoc + WeasyPrint
  • 5.2 Implement Markdown to HTML conversion with extensions (tables, code, etc.)
  • 5.3 Create default CSS template for layout preservation
  • 5.4 Create additional CSS templates (academic, business, report)
  • 5.5 Add Chinese font configuration (Noto Sans CJK)
  • 5.6 Implement PDF generation via Pandoc command
  • 5.7 Add fallback: Python WeasyPrint direct generation
  • 5.8 Handle multi-page PDF merging
  • 5.9 Write unit tests for PDF generator

6. Backend - File Management

  • 6.1 Implement file upload validation (type, size, corruption check)
  • 6.2 Create file storage service with temporary directory management
  • 6.3 Add batch upload handler with unique batch_id generation
  • 6.4 Implement file access control and ownership verification
  • 6.5 Add automatic cleanup job for expired files (24-hour retention)
  • 6.6 Store Markdown and JSON outputs in organized directory structure
  • 6.7 Write unit tests for file management

7. Backend - Export Service

  • 7.1 Implement plain text export from Markdown
  • 7.2 Implement JSON export with full metadata
  • 7.3 Implement Excel export using pandas
  • 7.4 Implement Markdown export (direct from OCR output)
  • 7.5 Implement layout-preserved PDF export (using PDF generator service)
  • 7.6 Add ZIP file creation for batch exports
  • 7.7 Implement rule-based filtering (confidence threshold, filename pattern)
  • 7.8 Implement rule-based formatting (line numbers, sort by position)
  • 7.9 Create export rule CRUD operations (save, load, update, delete)
  • 7.10 Write unit tests for export service

8. Backend - API Endpoints

  • 8.1 POST /api/v1/auth/login - JWT authentication
  • 8.2 POST /api/v1/upload - File upload with validation
  • 8.3 POST /api/v1/ocr/process - Trigger OCR processing (PaddleOCR-VL)
  • 8.4 GET /api/v1/ocr/status/{task_id} - Get task status with progress
  • 8.5 GET /api/v1/ocr/result/{task_id} - Get OCR results (JSON + Markdown)
  • 8.6 GET /api/v1/batch/{batch_id}/status - Get batch status
  • 8.7 POST /api/v1/export - Export results with format and rules
  • 8.8 GET /api/v1/export/pdf/{file_id} - Generate and download layout-preserved PDF
  • 8.9 GET /api/v1/export/rules - List saved export rules
  • 8.10 POST /api/v1/export/rules - Create new export rule
  • 8.11 PUT /api/v1/export/rules/{rule_id} - Update export rule
  • 8.12 DELETE /api/v1/export/rules/{rule_id} - Delete export rule
  • 8.13 GET /api/v1/export/css-templates - List available CSS templates
  • 8.14 Write API integration tests

9. Backend - Translation Architecture (RESERVED)

  • 9.1 Create translation service interface (abstract class)
  • 9.2 Implement stub endpoint POST /api/v1/translate/document (returns 501 Not Implemented)
  • 9.3 Document expected request/response format in OpenAPI spec
  • 9.4 Add translation_configs table migrations (completed in Task 2.6)
  • 9.5 Create placeholder for translation engine factory (Argos/ERNIE/Google)
  • 9.6 Write unit tests for translation service interface (optional for stub)

10. Backend - Background Tasks

  • 10.1 Implement FastAPI BackgroundTasks for async OCR processing
  • 10.2 Add task queue system (optional: Redis-based queue)
  • 10.3 Implement progress updates (polling endpoint)
  • 10.4 Add error handling and retry logic
  • 10.5 Implement cleanup scheduler for expired files
  • 10.6 Add PDF generation to background tasks (slower process)

Phase 2: Frontend Development

11. Frontend - Project Structure

  • 11.1 Setup Vite project with TypeScript support
  • 11.2 Configure Tailwind CSS and shadcn/ui
  • 11.3 Setup React Router for navigation
  • 11.4 Configure Axios with base URL and interceptors
  • 11.5 Setup React Query for API state management
  • 11.6 Create Zustand store for global state
  • 11.7 Setup i18n for Traditional Chinese interface

12. Frontend - UI Components (shadcn/ui)

  • 12.1 Install and configure shadcn/ui components
  • 12.2 Create FileUpload component with drag-and-drop (react-dropzone)
  • 12.3 Create ProgressBar component for batch processing
  • 12.4 Create ResultsTable component for displaying OCR results
  • 12.5 Create MarkdownPreview component for viewing extracted content ⬅️ Fixed: API schema alignment for filename display
  • 12.6 Create ExportDialog component for format and rule selection
  • 12.7 Create CSSTemplateSelector component for PDF styling
  • 12.8 Create RuleEditor component for creating custom rules
  • 12.9 Create Toast notifications for feedback
  • 12.10 Create TranslationPanel component (DISABLED with "Coming Soon" label)

13. Frontend - Pages

  • 13.1 Create Login page with JWT authentication
  • 13.2 Create Upload page with file selection and batch management ⬅️ Fixed: Upload response schema alignment
  • 13.3 Create Processing page with real-time progress ⬅️ Fixed: Error field mapping
  • 13.4 Create Results page with Markdown/JSON preview ⬅️ Fixed: OCR result detail flattening, null safety
  • 13.5 Create Export page with format options (TXT, JSON, Excel, Markdown, PDF)
  • 13.6 Create PDF Preview page (optional: embedded PDF viewer)
  • 13.7 Create Settings page for export rule management
  • 13.8 Add translation option placeholder in Results page (disabled state)

14. Frontend - API Integration

  • 14.1 Create API client service with typed interfaces ⬅️ Updated: All endpoints verified working
  • 14.2 Implement file upload with progress tracking ⬅️ Fixed: UploadBatchResponse schema
  • 14.3 Implement OCR task status polling ⬅️ Fixed: BatchStatusResponse with files array
  • 14.4 Implement results fetching (Markdown + JSON display) ⬅️ Fixed: OCRResultDetailResponse with flattened structure
  • 14.5 Implement export with file download ⬅️ Fixed: ExportOptions schema added
  • 14.6 Implement PDF generation request with loading indicator
  • 14.7 Implement rule CRUD operations
  • 14.8 Implement CSS template selection ⬅️ Fixed: CSSTemplateResponse with filename field
  • 14.9 Add error handling and user feedback ⬅️ Fixed: Error field mapping with validation_alias
  • 14.10 Create translation API client (stub, for future use)

Phase 3: Testing & Optimization

15. Testing

  • 15.1 Write backend unit tests (pytest) for all services
  • 15.2 Write backend API integration tests
  • 15.3 Test PaddleOCR-VL with various document types (scanned images, PDFs, mixed content)
  • 15.4 Test layout preservation quality (Markdown structure correctness)
  • 15.5 Test PDF generation with different CSS templates
  • 15.6 Test Chinese font rendering in generated PDFs
  • 15.7 Write frontend component tests (Vitest)
  • 15.8 Perform manual end-to-end testing
  • 15.9 Test with various image formats and languages
  • 15.10 Test batch processing with large file sets (50+ files)
  • 15.11 Test export with different formats and rules
  • 15.12 Verify translation UI placeholders are properly disabled

16. Documentation

  • 16.1 Write API documentation (FastAPI auto-docs + additional notes)
  • 16.2 Document PaddleOCR-VL model requirements and installation
  • 16.3 Document Pandoc and WeasyPrint setup
  • 16.4 Create CSS template customization guide
  • 16.5 Write user guide for web interface
  • 16.6 Write deployment guide for 1Panel
  • 16.7 Create README.md with setup instructions
  • 16.8 Document export rule syntax and examples
  • 16.9 Document translation feature roadmap and architecture

Phase 4: Deployment

17. Deployment Preparation

  • 17.1 Create backend startup script (start.sh)
  • 17.2 Create frontend build script (build.sh)
  • 17.3 Create Nginx configuration file (static files + reverse proxy)
  • 17.4 Create Supervisor configuration for backend process
  • 17.5 Create environment variable templates (.env.example)
  • 17.6 Create deployment automation script (deploy.sh)
  • 17.7 Prepare CSS templates for production
  • 17.8 Test deployment on staging environment

18. Production Deployment (1Panel)

  • 18.1 Setup Conda environment on production server
  • 18.2 Install system dependencies (pandoc, fonts-noto-cjk)
  • 18.3 Install Python dependencies and download PaddleOCR-VL models
  • 18.4 Configure MySQL database connection
  • 18.5 Build frontend static files
  • 18.6 Configure Nginx via 1Panel (static files + reverse proxy)
  • 18.7 Setup Supervisor to manage backend process
  • 18.8 Configure SSL certificate (Let's Encrypt via 1Panel)
  • 18.9 Perform production smoke tests (upload, OCR, export PDF)
  • 18.10 Setup monitoring and logging
  • 18.11 Verify PDF generation works in production environment

Phase 5: Translation Feature (FUTURE)

19. Translation Implementation (Post-Launch)

  • 19.1 Decide on translation engine (Argos offline vs ERNIE API vs Google API)
  • 19.2 Implement chosen translation engine integration
  • 19.3 Implement Markdown translation with structure preservation
  • 19.4 Update POST /api/v1/translate/document endpoint (remove 501 status)
  • 19.5 Add translation configuration UI (enable TranslationPanel component)
  • 19.6 Add source/target language selection
  • 19.7 Implement translation progress tracking
  • 19.8 Test translation with various document types
  • 19.9 Optimize translation quality for technical documents
  • 19.10 Update documentation with translation feature guide

Summary

Phase 1 (Core OCR + Layout Preservation): Tasks 1-10 (基礎 OCR + 版面保留 PDF) Phase 2 (Frontend): Tasks 11-14 (用戶界面) Phase 3 (Testing): Tasks 15-16 (測試與文檔) Phase 4 (Deployment): Tasks 17-18 (部署) Phase 5 (Translation): Task 19 (翻譯功能 - 未來實現)

Total Tasks: 150+ tasks Priority: Complete Phase 1-4 first, Phase 5 after production deployment and user feedback