- Removed all test files and directories - Deleted outdated documentation (will be rewritten) - Cleaned up temporary files, logs, and uploads - Archived 5 completed OpenSpec proposals - Created new dual-track-document-processing proposal with complete OpenSpec structure - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF) - UnifiedDocument model for consistent output - Support for structure-preserving translation - Updated .gitignore to prevent future test/temp files This is a major cleanup preparing for the complete refactoring of the document processing pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
13 KiB
13 KiB
Implementation Tasks
Phase 1: Core OCR with Layout Preservation
1. Environment Setup
- 1.1 Create Conda environment with Python 3.10
- 1.2 Install backend dependencies (FastAPI, PaddleOCR 3.0+, paddlepaddle, pandas, etc.)
- 1.3 Install PDF generation tools (weasyprint, markdown, pandoc system package)
- 1.4 Download PaddleOCR-VL model (~900MB) and language packs
- 1.5 Setup frontend project with Vite + React + TypeScript
- 1.6 Install frontend dependencies (Tailwind, shadcn/ui, axios, react-query)
- 1.7 Configure MySQL database connection
- 1.8 Install Chinese fonts (Noto Sans CJK) for PDF generation
2. Database Schema
- 2.1 Create
paddle_ocr_userstable for JWT authentication (id, username, password_hash, etc.) - 2.2 Create
paddle_ocr_batchestable (id, user_id, status, created_at, completed_at) - 2.3 Create
paddle_ocr_filestable (id, batch_id, filename, file_path, file_size, status, format) - 2.4 Create
paddle_ocr_resultstable (id, file_id, markdown_path, json_path, layout_data, confidence) - 2.5 Create
paddle_ocr_export_rulestable (id, user_id, rule_name, config_json, css_template) - 2.6 Create
paddle_ocr_translation_configstable (RESERVED: id, user_id, source_lang, target_lang, engine_type, engine_config) - 2.7 Write database migration scripts (Alembic)
- 2.8 Add indexes for performance optimization (batch_id, user_id, status)
- Note: All tables use
paddle_ocr_prefix for namespace isolation
3. Backend - Document Preprocessing
- 3.1 Implement document preprocessor class for format standardization
- 3.2 Add image format validator (PNG, JPG, JPEG)
- 3.3 Add PDF validator and direct passthrough (PaddleOCR-VL native support)
- 3.4 Implement Office document to PDF conversion (DOC, DOCX, PPT, PPTX via LibreOffice) ⬅️ Completed via sub-proposal
- 3.5 Add file corruption detection
- 3.6 Write unit tests for preprocessor
4. Backend - Core OCR Service with PaddleOCR-VL
- 4.1 Implement OCR service class with PaddleOCR-VL initialization
- 4.2 Configure layout detection (use_layout_detection=True)
- 4.3 Implement single image/PDF OCR processing
- 4.4 Parse OCR output to extract Markdown and JSON
- 4.5 Store Markdown files with preserved layout structure
- 4.6 Store JSON with detailed bounding boxes and layout metadata
- 4.7 Add confidence threshold filtering
- 4.8 Implement batch processing with worker queue (completed via Task 10: BackgroundTasks)
- 4.9 Add progress tracking for batch jobs (completed via Task 8.4, 8.6: API endpoints)
- 4.10 Write unit tests for OCR service
5. Backend - Layout-Preserved PDF Generation
- 5.1 Create PDF generator service using Pandoc + WeasyPrint
- 5.2 Implement Markdown to HTML conversion with extensions (tables, code, etc.)
- 5.3 Create default CSS template for layout preservation
- 5.4 Create additional CSS templates (academic, business, report)
- 5.5 Add Chinese font configuration (Noto Sans CJK)
- 5.6 Implement PDF generation via Pandoc command
- 5.7 Add fallback: Python WeasyPrint direct generation
- 5.8 Handle multi-page PDF merging
- 5.9 Write unit tests for PDF generator
6. Backend - File Management
- 6.1 Implement file upload validation (type, size, corruption check)
- 6.2 Create file storage service with temporary directory management
- 6.3 Add batch upload handler with unique batch_id generation
- 6.4 Implement file access control and ownership verification
- 6.5 Add automatic cleanup job for expired files (24-hour retention)
- 6.6 Store Markdown and JSON outputs in organized directory structure
- 6.7 Write unit tests for file management
7. Backend - Export Service
- 7.1 Implement plain text export from Markdown
- 7.2 Implement JSON export with full metadata
- 7.3 Implement Excel export using pandas
- 7.4 Implement Markdown export (direct from OCR output)
- 7.5 Implement layout-preserved PDF export (using PDF generator service)
- 7.6 Add ZIP file creation for batch exports
- 7.7 Implement rule-based filtering (confidence threshold, filename pattern)
- 7.8 Implement rule-based formatting (line numbers, sort by position)
- 7.9 Create export rule CRUD operations (save, load, update, delete)
- 7.10 Write unit tests for export service
8. Backend - API Endpoints
- 8.1 POST
/api/v1/auth/login- JWT authentication - 8.2 POST
/api/v1/upload- File upload with validation - 8.3 POST
/api/v1/ocr/process- Trigger OCR processing (PaddleOCR-VL) - 8.4 GET
/api/v1/ocr/status/{task_id}- Get task status with progress - 8.5 GET
/api/v1/ocr/result/{task_id}- Get OCR results (JSON + Markdown) - 8.6 GET
/api/v1/batch/{batch_id}/status- Get batch status - 8.7 POST
/api/v1/export- Export results with format and rules - 8.8 GET
/api/v1/export/pdf/{file_id}- Generate and download layout-preserved PDF - 8.9 GET
/api/v1/export/rules- List saved export rules - 8.10 POST
/api/v1/export/rules- Create new export rule - 8.11 PUT
/api/v1/export/rules/{rule_id}- Update export rule - 8.12 DELETE
/api/v1/export/rules/{rule_id}- Delete export rule - 8.13 GET
/api/v1/export/css-templates- List available CSS templates - 8.14 Write API integration tests
9. Backend - Translation Architecture (RESERVED)
- 9.1 Create translation service interface (abstract class)
- 9.2 Implement stub endpoint POST
/api/v1/translate/document(returns 501 Not Implemented) - 9.3 Document expected request/response format in OpenAPI spec
- 9.4 Add translation_configs table migrations (completed in Task 2.6)
- 9.5 Create placeholder for translation engine factory (Argos/ERNIE/Google)
- 9.6 Write unit tests for translation service interface (optional for stub)
10. Backend - Background Tasks
- 10.1 Implement FastAPI BackgroundTasks for async OCR processing
- 10.2 Add task queue system (optional: Redis-based queue)
- 10.3 Implement progress updates (polling endpoint)
- 10.4 Add error handling and retry logic
- 10.5 Implement cleanup scheduler for expired files
- 10.6 Add PDF generation to background tasks (slower process)
Phase 2: Frontend Development
11. Frontend - Project Structure
- 11.1 Setup Vite project with TypeScript support
- 11.2 Configure Tailwind CSS and shadcn/ui
- 11.3 Setup React Router for navigation
- 11.4 Configure Axios with base URL and interceptors
- 11.5 Setup React Query for API state management
- 11.6 Create Zustand store for global state
- 11.7 Setup i18n for Traditional Chinese interface
12. Frontend - UI Components (shadcn/ui)
- 12.1 Install and configure shadcn/ui components
- 12.2 Create FileUpload component with drag-and-drop (react-dropzone)
- 12.3 Create ProgressBar component for batch processing
- 12.4 Create ResultsTable component for displaying OCR results
- 12.5 Create MarkdownPreview component for viewing extracted content ⬅️ Fixed: API schema alignment for filename display
- 12.6 Create ExportDialog component for format and rule selection
- 12.7 Create CSSTemplateSelector component for PDF styling
- 12.8 Create RuleEditor component for creating custom rules
- 12.9 Create Toast notifications for feedback
- 12.10 Create TranslationPanel component (DISABLED with "Coming Soon" label)
13. Frontend - Pages
- 13.1 Create Login page with JWT authentication
- 13.2 Create Upload page with file selection and batch management ⬅️ Fixed: Upload response schema alignment
- 13.3 Create Processing page with real-time progress ⬅️ Fixed: Error field mapping
- 13.4 Create Results page with Markdown/JSON preview ⬅️ Fixed: OCR result detail flattening, null safety
- 13.5 Create Export page with format options (TXT, JSON, Excel, Markdown, PDF)
- 13.6 Create PDF Preview page (optional: embedded PDF viewer)
- 13.7 Create Settings page for export rule management
- 13.8 Add translation option placeholder in Results page (disabled state)
14. Frontend - API Integration
- 14.1 Create API client service with typed interfaces ⬅️ Updated: All endpoints verified working
- 14.2 Implement file upload with progress tracking ⬅️ Fixed: UploadBatchResponse schema
- 14.3 Implement OCR task status polling ⬅️ Fixed: BatchStatusResponse with files array
- 14.4 Implement results fetching (Markdown + JSON display) ⬅️ Fixed: OCRResultDetailResponse with flattened structure
- 14.5 Implement export with file download ⬅️ Fixed: ExportOptions schema added
- 14.6 Implement PDF generation request with loading indicator
- 14.7 Implement rule CRUD operations
- 14.8 Implement CSS template selection ⬅️ Fixed: CSSTemplateResponse with filename field
- 14.9 Add error handling and user feedback ⬅️ Fixed: Error field mapping with validation_alias
- 14.10 Create translation API client (stub, for future use)
Phase 3: Testing & Optimization
15. Testing
- 15.1 Write backend unit tests (pytest) for all services
- 15.2 Write backend API integration tests
- 15.3 Test PaddleOCR-VL with various document types (scanned images, PDFs, mixed content)
- 15.4 Test layout preservation quality (Markdown structure correctness)
- 15.5 Test PDF generation with different CSS templates
- 15.6 Test Chinese font rendering in generated PDFs
- 15.7 Write frontend component tests (Vitest)
- 15.8 Perform manual end-to-end testing
- 15.9 Test with various image formats and languages
- 15.10 Test batch processing with large file sets (50+ files)
- 15.11 Test export with different formats and rules
- 15.12 Verify translation UI placeholders are properly disabled
16. Documentation
- 16.1 Write API documentation (FastAPI auto-docs + additional notes)
- 16.2 Document PaddleOCR-VL model requirements and installation
- 16.3 Document Pandoc and WeasyPrint setup
- 16.4 Create CSS template customization guide
- 16.5 Write user guide for web interface
- 16.6 Write deployment guide for 1Panel
- 16.7 Create README.md with setup instructions
- 16.8 Document export rule syntax and examples
- 16.9 Document translation feature roadmap and architecture
Phase 4: Deployment
17. Deployment Preparation
- 17.1 Create backend startup script (start.sh)
- 17.2 Create frontend build script (build.sh)
- 17.3 Create Nginx configuration file (static files + reverse proxy)
- 17.4 Create Supervisor configuration for backend process
- 17.5 Create environment variable templates (.env.example)
- 17.6 Create deployment automation script (deploy.sh)
- 17.7 Prepare CSS templates for production
- 17.8 Test deployment on staging environment
18. Production Deployment (1Panel)
- 18.1 Setup Conda environment on production server
- 18.2 Install system dependencies (pandoc, fonts-noto-cjk)
- 18.3 Install Python dependencies and download PaddleOCR-VL models
- 18.4 Configure MySQL database connection
- 18.5 Build frontend static files
- 18.6 Configure Nginx via 1Panel (static files + reverse proxy)
- 18.7 Setup Supervisor to manage backend process
- 18.8 Configure SSL certificate (Let's Encrypt via 1Panel)
- 18.9 Perform production smoke tests (upload, OCR, export PDF)
- 18.10 Setup monitoring and logging
- 18.11 Verify PDF generation works in production environment
Phase 5: Translation Feature (FUTURE)
19. Translation Implementation (Post-Launch)
- 19.1 Decide on translation engine (Argos offline vs ERNIE API vs Google API)
- 19.2 Implement chosen translation engine integration
- 19.3 Implement Markdown translation with structure preservation
- 19.4 Update POST
/api/v1/translate/documentendpoint (remove 501 status) - 19.5 Add translation configuration UI (enable TranslationPanel component)
- 19.6 Add source/target language selection
- 19.7 Implement translation progress tracking
- 19.8 Test translation with various document types
- 19.9 Optimize translation quality for technical documents
- 19.10 Update documentation with translation feature guide
Summary
Phase 1 (Core OCR + Layout Preservation): Tasks 1-10 (基礎 OCR + 版面保留 PDF) Phase 2 (Frontend): Tasks 11-14 (用戶界面) Phase 3 (Testing): Tasks 15-16 (測試與文檔) Phase 4 (Deployment): Tasks 17-18 (部署) Phase 5 (Translation): Task 19 (翻譯功能 - 未來實現)
Total Tasks: 150+ tasks Priority: Complete Phase 1-4 first, Phase 5 after production deployment and user feedback