egg/OCR

Files

egg cd3cbea49d chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-18 20:02:31 +08:00

4.1 KiB

Raw Blame History

Technical Design

Architecture Overview

User Upload (DOC/DOCX/PPT/PPTX)
    ↓
File Validation & Storage
    ↓
Format Detection
    ↓
Office Document Converter
    ↓
PDF Generation
    ↓
PDF to Images (existing)
    ↓
PaddleOCR Processing (existing)
    ↓
Results & Export

Component Design

1. Office Document Converter Service

# app/services/office_converter.py

class OfficeConverter:
    """Convert Office documents to PDF for OCR processing"""

    def convert_to_pdf(self, file_path: Path) -> Path:
        """Main conversion dispatcher"""

    def convert_docx_to_pdf(self, docx_path: Path) -> Path:
        """Convert DOCX to PDF using python-docx and pypandoc"""

    def convert_doc_to_pdf(self, doc_path: Path) -> Path:
        """Convert legacy DOC to PDF"""

    def convert_pptx_to_pdf(self, pptx_path: Path) -> Path:
        """Convert PPTX to PDF using python-pptx"""

    def convert_ppt_to_pdf(self, ppt_path: Path) -> Path:
        """Convert legacy PPT to PDF"""

2. OCR Service Integration

# Extend app/services/ocr_service.py

def process_image(self, image_path: Path, ...):
    # Check file type
    if is_office_document(image_path):
        # Convert to PDF first
        pdf_path = self.office_converter.convert_to_pdf(image_path)
        # Use existing PDF processing
        return self.process_pdf(pdf_path, ...)
    elif is_pdf:
        # Existing PDF processing
        ...
    else:
        # Existing image processing
        ...

3. File Format Detection

OFFICE_FORMATS = {
    '.doc': 'application/msword',
    '.docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
    '.ppt': 'application/vnd.ms-powerpoint',
    '.pptx': 'application/vnd.openxmlformats-officedocument.presentationml.presentation'
}

def is_office_document(file_path: Path) -> bool:
    return file_path.suffix.lower() in OFFICE_FORMATS

Library Selection

For Word Documents

python-docx: Read/write DOCX files
doc2pdf: Simple conversion (requires LibreOffice)
Alternative: pypandoc with pandoc backend

For PowerPoint Documents

python-pptx: Read/write PPTX files
unoconv: Universal Office Converter (requires LibreOffice)

Recommended Approach

Use LibreOffice headless mode for universal conversion:

libreoffice --headless --convert-to pdf input.docx

This provides:

Support for all Office formats
High fidelity conversion
Maintained by active community

Configuration Changes

Token Expiration

# app/core/config.py
class Settings(BaseSettings):
    # Change from 30 to 1440 (24 hours)
    access_token_expire_minutes: int = 1440

File Upload Limits

# Consider Office files can be larger
max_file_size: int = 100 * 1024 * 1024  # 100MB
allowed_extensions: Set[str] = {
    '.png', '.jpg', '.jpeg', '.pdf',
    '.doc', '.docx', '.ppt', '.pptx'
}

Error Handling

Conversion Failures
- Corrupted Office files
- Unsupported Office features
- LibreOffice not installed
Performance Considerations
- Office conversion is CPU intensive
- Consider queuing for large files
- Add conversion timeout (60 seconds)
Security
- Validate Office files before processing
- Scan for macros/embedded objects
- Sandbox conversion process

Dependencies

System Requirements

# macOS
brew install libreoffice

# Linux
apt-get install libreoffice

# Python packages
pip install python-docx python-pptx pypandoc

Alternative: Docker Container

Use a Docker container with LibreOffice pre-installed for consistent conversion across environments.

Testing Strategy

Unit Tests
- Test each conversion method
- Mock LibreOffice calls
- Test error handling
Integration Tests
- End-to-end Office → OCR pipeline
- Test with various Office versions
- Performance benchmarks
Sample Documents
- Simple text documents
- Documents with tables
- Documents with images
- Presentations with multiple slides
- Legacy formats (DOC, PPT)

4.1 KiB Raw Blame History