Files
OCR/openspec/changes/archive/2025-11-18-add-office-document-support/design.md
egg cd3cbea49d chore: project cleanup and prepare for dual-track processing refactor
- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 20:02:31 +08:00

4.1 KiB

Technical Design

Architecture Overview

User Upload (DOC/DOCX/PPT/PPTX)
    ↓
File Validation & Storage
    ↓
Format Detection
    ↓
Office Document Converter
    ↓
PDF Generation
    ↓
PDF to Images (existing)
    ↓
PaddleOCR Processing (existing)
    ↓
Results & Export

Component Design

1. Office Document Converter Service

# app/services/office_converter.py

class OfficeConverter:
    """Convert Office documents to PDF for OCR processing"""

    def convert_to_pdf(self, file_path: Path) -> Path:
        """Main conversion dispatcher"""

    def convert_docx_to_pdf(self, docx_path: Path) -> Path:
        """Convert DOCX to PDF using python-docx and pypandoc"""

    def convert_doc_to_pdf(self, doc_path: Path) -> Path:
        """Convert legacy DOC to PDF"""

    def convert_pptx_to_pdf(self, pptx_path: Path) -> Path:
        """Convert PPTX to PDF using python-pptx"""

    def convert_ppt_to_pdf(self, ppt_path: Path) -> Path:
        """Convert legacy PPT to PDF"""

2. OCR Service Integration

# Extend app/services/ocr_service.py

def process_image(self, image_path: Path, ...):
    # Check file type
    if is_office_document(image_path):
        # Convert to PDF first
        pdf_path = self.office_converter.convert_to_pdf(image_path)
        # Use existing PDF processing
        return self.process_pdf(pdf_path, ...)
    elif is_pdf:
        # Existing PDF processing
        ...
    else:
        # Existing image processing
        ...

3. File Format Detection

OFFICE_FORMATS = {
    '.doc': 'application/msword',
    '.docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
    '.ppt': 'application/vnd.ms-powerpoint',
    '.pptx': 'application/vnd.openxmlformats-officedocument.presentationml.presentation'
}

def is_office_document(file_path: Path) -> bool:
    return file_path.suffix.lower() in OFFICE_FORMATS

Library Selection

For Word Documents

  • python-docx: Read/write DOCX files
  • doc2pdf: Simple conversion (requires LibreOffice)
  • Alternative: pypandoc with pandoc backend

For PowerPoint Documents

  • python-pptx: Read/write PPTX files
  • unoconv: Universal Office Converter (requires LibreOffice)

Use LibreOffice headless mode for universal conversion:

libreoffice --headless --convert-to pdf input.docx

This provides:

  • Support for all Office formats
  • High fidelity conversion
  • Maintained by active community

Configuration Changes

Token Expiration

# app/core/config.py
class Settings(BaseSettings):
    # Change from 30 to 1440 (24 hours)
    access_token_expire_minutes: int = 1440

File Upload Limits

# Consider Office files can be larger
max_file_size: int = 100 * 1024 * 1024  # 100MB
allowed_extensions: Set[str] = {
    '.png', '.jpg', '.jpeg', '.pdf',
    '.doc', '.docx', '.ppt', '.pptx'
}

Error Handling

  1. Conversion Failures

    • Corrupted Office files
    • Unsupported Office features
    • LibreOffice not installed
  2. Performance Considerations

    • Office conversion is CPU intensive
    • Consider queuing for large files
    • Add conversion timeout (60 seconds)
  3. Security

    • Validate Office files before processing
    • Scan for macros/embedded objects
    • Sandbox conversion process

Dependencies

System Requirements

# macOS
brew install libreoffice

# Linux
apt-get install libreoffice

# Python packages
pip install python-docx python-pptx pypandoc

Alternative: Docker Container

Use a Docker container with LibreOffice pre-installed for consistent conversion across environments.

Testing Strategy

  1. Unit Tests

    • Test each conversion method
    • Mock LibreOffice calls
    • Test error handling
  2. Integration Tests

    • End-to-end Office → OCR pipeline
    • Test with various Office versions
    • Performance benchmarks
  3. Sample Documents

    • Simple text documents
    • Documents with tables
    • Documents with images
    • Presentations with multiple slides
    • Legacy formats (DOC, PPT)