egg/OCR

Files

egg cd3cbea49d chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-18 20:02:31 +08:00

13 KiB

Raw Blame History

Implementation Tasks

Phase 1: Core OCR with Layout Preservation

1. Environment Setup

1.1 Create Conda environment with Python 3.10
1.2 Install backend dependencies (FastAPI, PaddleOCR 3.0+, paddlepaddle, pandas, etc.)
1.3 Install PDF generation tools (weasyprint, markdown, pandoc system package)
1.4 Download PaddleOCR-VL model (~900MB) and language packs
1.5 Setup frontend project with Vite + React + TypeScript
1.6 Install frontend dependencies (Tailwind, shadcn/ui, axios, react-query)
1.7 Configure MySQL database connection
1.8 Install Chinese fonts (Noto Sans CJK) for PDF generation

2. Database Schema

2.1 Create paddle_ocr_users table for JWT authentication (id, username, password_hash, etc.)
2.2 Create paddle_ocr_batches table (id, user_id, status, created_at, completed_at)
2.3 Create paddle_ocr_files table (id, batch_id, filename, file_path, file_size, status, format)
2.4 Create paddle_ocr_results table (id, file_id, markdown_path, json_path, layout_data, confidence)
2.5 Create paddle_ocr_export_rules table (id, user_id, rule_name, config_json, css_template)
2.6 Create paddle_ocr_translation_configs table (RESERVED: id, user_id, source_lang, target_lang, engine_type, engine_config)
2.7 Write database migration scripts (Alembic)
2.8 Add indexes for performance optimization (batch_id, user_id, status)
Note: All tables use paddle_ocr_ prefix for namespace isolation

3. Backend - Document Preprocessing

3.1 Implement document preprocessor class for format standardization
3.2 Add image format validator (PNG, JPG, JPEG)
3.3 Add PDF validator and direct passthrough (PaddleOCR-VL native support)
3.4 Implement Office document to PDF conversion (DOC, DOCX, PPT, PPTX via LibreOffice) ⬅️ Completed via sub-proposal
3.5 Add file corruption detection
3.6 Write unit tests for preprocessor

4. Backend - Core OCR Service with PaddleOCR-VL

4.1 Implement OCR service class with PaddleOCR-VL initialization
4.2 Configure layout detection (use_layout_detection=True)
4.3 Implement single image/PDF OCR processing
4.4 Parse OCR output to extract Markdown and JSON
4.5 Store Markdown files with preserved layout structure
4.6 Store JSON with detailed bounding boxes and layout metadata
4.7 Add confidence threshold filtering
4.8 Implement batch processing with worker queue (completed via Task 10: BackgroundTasks)
4.9 Add progress tracking for batch jobs (completed via Task 8.4, 8.6: API endpoints)
4.10 Write unit tests for OCR service

5. Backend - Layout-Preserved PDF Generation

5.1 Create PDF generator service using Pandoc + WeasyPrint
5.2 Implement Markdown to HTML conversion with extensions (tables, code, etc.)
5.3 Create default CSS template for layout preservation
5.4 Create additional CSS templates (academic, business, report)
5.5 Add Chinese font configuration (Noto Sans CJK)
5.6 Implement PDF generation via Pandoc command
5.7 Add fallback: Python WeasyPrint direct generation
5.8 Handle multi-page PDF merging
5.9 Write unit tests for PDF generator

6. Backend - File Management

6.1 Implement file upload validation (type, size, corruption check)
6.2 Create file storage service with temporary directory management
6.3 Add batch upload handler with unique batch_id generation
6.4 Implement file access control and ownership verification
6.5 Add automatic cleanup job for expired files (24-hour retention)
6.6 Store Markdown and JSON outputs in organized directory structure
6.7 Write unit tests for file management

7. Backend - Export Service

7.1 Implement plain text export from Markdown
7.2 Implement JSON export with full metadata
7.3 Implement Excel export using pandas
7.4 Implement Markdown export (direct from OCR output)
7.5 Implement layout-preserved PDF export (using PDF generator service)
7.6 Add ZIP file creation for batch exports
7.7 Implement rule-based filtering (confidence threshold, filename pattern)
7.8 Implement rule-based formatting (line numbers, sort by position)
7.9 Create export rule CRUD operations (save, load, update, delete)
7.10 Write unit tests for export service

8. Backend - API Endpoints

8.1 POST /api/v1/auth/login - JWT authentication
8.2 POST /api/v1/upload - File upload with validation
8.3 POST /api/v1/ocr/process - Trigger OCR processing (PaddleOCR-VL)
8.4 GET /api/v1/ocr/status/{task_id} - Get task status with progress
8.5 GET /api/v1/ocr/result/{task_id} - Get OCR results (JSON + Markdown)
8.6 GET /api/v1/batch/{batch_id}/status - Get batch status
8.7 POST /api/v1/export - Export results with format and rules
8.8 GET /api/v1/export/pdf/{file_id} - Generate and download layout-preserved PDF
8.9 GET /api/v1/export/rules - List saved export rules
8.10 POST /api/v1/export/rules - Create new export rule
8.11 PUT /api/v1/export/rules/{rule_id} - Update export rule
8.12 DELETE /api/v1/export/rules/{rule_id} - Delete export rule
8.13 GET /api/v1/export/css-templates - List available CSS templates
8.14 Write API integration tests

9. Backend - Translation Architecture (RESERVED)

9.1 Create translation service interface (abstract class)
9.2 Implement stub endpoint POST /api/v1/translate/document (returns 501 Not Implemented)
9.3 Document expected request/response format in OpenAPI spec
9.4 Add translation_configs table migrations (completed in Task 2.6)
9.5 Create placeholder for translation engine factory (Argos/ERNIE/Google)
9.6 Write unit tests for translation service interface (optional for stub)

10. Backend - Background Tasks

10.1 Implement FastAPI BackgroundTasks for async OCR processing
10.2 Add task queue system (optional: Redis-based queue)
10.3 Implement progress updates (polling endpoint)
10.4 Add error handling and retry logic
10.5 Implement cleanup scheduler for expired files
10.6 Add PDF generation to background tasks (slower process)

Phase 2: Frontend Development

11. Frontend - Project Structure

11.1 Setup Vite project with TypeScript support
11.2 Configure Tailwind CSS and shadcn/ui
11.3 Setup React Router for navigation
11.4 Configure Axios with base URL and interceptors
11.5 Setup React Query for API state management
11.6 Create Zustand store for global state
11.7 Setup i18n for Traditional Chinese interface

12. Frontend - UI Components (shadcn/ui)

12.1 Install and configure shadcn/ui components
12.2 Create FileUpload component with drag-and-drop (react-dropzone)
12.3 Create ProgressBar component for batch processing
12.4 Create ResultsTable component for displaying OCR results
12.5 Create MarkdownPreview component for viewing extracted content ⬅️ Fixed: API schema alignment for filename display
12.6 Create ExportDialog component for format and rule selection
12.7 Create CSSTemplateSelector component for PDF styling
12.8 Create RuleEditor component for creating custom rules
12.9 Create Toast notifications for feedback
12.10 Create TranslationPanel component (DISABLED with "Coming Soon" label)

13. Frontend - Pages

13.1 Create Login page with JWT authentication
13.2 Create Upload page with file selection and batch management ⬅️ Fixed: Upload response schema alignment
13.3 Create Processing page with real-time progress ⬅️ Fixed: Error field mapping
13.4 Create Results page with Markdown/JSON preview ⬅️ Fixed: OCR result detail flattening, null safety
13.5 Create Export page with format options (TXT, JSON, Excel, Markdown, PDF)
13.6 Create PDF Preview page (optional: embedded PDF viewer)
13.7 Create Settings page for export rule management
13.8 Add translation option placeholder in Results page (disabled state)

14. Frontend - API Integration

14.1 Create API client service with typed interfaces ⬅️ Updated: All endpoints verified working
14.2 Implement file upload with progress tracking ⬅️ Fixed: UploadBatchResponse schema
14.3 Implement OCR task status polling ⬅️ Fixed: BatchStatusResponse with files array
14.4 Implement results fetching (Markdown + JSON display) ⬅️ Fixed: OCRResultDetailResponse with flattened structure
14.5 Implement export with file download ⬅️ Fixed: ExportOptions schema added
14.6 Implement PDF generation request with loading indicator
14.7 Implement rule CRUD operations
14.8 Implement CSS template selection ⬅️ Fixed: CSSTemplateResponse with filename field
14.9 Add error handling and user feedback ⬅️ Fixed: Error field mapping with validation_alias
14.10 Create translation API client (stub, for future use)

Phase 3: Testing & Optimization

15. Testing

15.1 Write backend unit tests (pytest) for all services
15.2 Write backend API integration tests
15.3 Test PaddleOCR-VL with various document types (scanned images, PDFs, mixed content)
15.4 Test layout preservation quality (Markdown structure correctness)
15.5 Test PDF generation with different CSS templates
15.6 Test Chinese font rendering in generated PDFs
15.7 Write frontend component tests (Vitest)
15.8 Perform manual end-to-end testing
15.9 Test with various image formats and languages
15.10 Test batch processing with large file sets (50+ files)
15.11 Test export with different formats and rules
15.12 Verify translation UI placeholders are properly disabled

16. Documentation

16.1 Write API documentation (FastAPI auto-docs + additional notes)
16.2 Document PaddleOCR-VL model requirements and installation
16.3 Document Pandoc and WeasyPrint setup
16.4 Create CSS template customization guide
16.5 Write user guide for web interface
16.6 Write deployment guide for 1Panel
16.7 Create README.md with setup instructions
16.8 Document export rule syntax and examples
16.9 Document translation feature roadmap and architecture

Phase 4: Deployment

17. Deployment Preparation

17.1 Create backend startup script (start.sh)
17.2 Create frontend build script (build.sh)
17.3 Create Nginx configuration file (static files + reverse proxy)
17.4 Create Supervisor configuration for backend process
17.5 Create environment variable templates (.env.example)
17.6 Create deployment automation script (deploy.sh)
17.7 Prepare CSS templates for production
17.8 Test deployment on staging environment

18. Production Deployment (1Panel)

18.1 Setup Conda environment on production server
18.2 Install system dependencies (pandoc, fonts-noto-cjk)
18.3 Install Python dependencies and download PaddleOCR-VL models
18.4 Configure MySQL database connection
18.5 Build frontend static files
18.6 Configure Nginx via 1Panel (static files + reverse proxy)
18.7 Setup Supervisor to manage backend process
18.8 Configure SSL certificate (Let's Encrypt via 1Panel)
18.9 Perform production smoke tests (upload, OCR, export PDF)
18.10 Setup monitoring and logging
18.11 Verify PDF generation works in production environment

Phase 5: Translation Feature (FUTURE)

19. Translation Implementation (Post-Launch)

19.1 Decide on translation engine (Argos offline vs ERNIE API vs Google API)
19.2 Implement chosen translation engine integration
19.3 Implement Markdown translation with structure preservation
19.4 Update POST /api/v1/translate/document endpoint (remove 501 status)
19.5 Add translation configuration UI (enable TranslationPanel component)
19.6 Add source/target language selection
19.7 Implement translation progress tracking
19.8 Test translation with various document types
19.9 Optimize translation quality for technical documents
19.10 Update documentation with translation feature guide

Summary

Phase 1 (Core OCR + Layout Preservation): Tasks 1-10 (基礎 OCR + 版面保留 PDF) Phase 2 (Frontend): Tasks 11-14 (用戶界面) Phase 3 (Testing): Tasks 15-16 (測試與文檔) Phase 4 (Deployment): Tasks 17-18 (部署) Phase 5 (Translation): Task 19 (翻譯功能 - 未來實現)

Total Tasks: 150+ tasks Priority: Complete Phase 1-4 first, Phase 5 after production deployment and user feedback

13 KiB Raw Blame History