OCR/openspec/changes/archive/2025-11-18-add-office-document-support/design.md

# Technical Design

## Architecture Overview

```
User Upload (DOC/DOCX/PPT/PPTX)
    ↓
File Validation & Storage
    ↓
Format Detection
    ↓
Office Document Converter
    ↓
PDF Generation
    ↓
PDF to Images (existing)
    ↓
PaddleOCR Processing (existing)
    ↓
Results & Export
```

## Component Design

### 1. Office Document Converter Service

```python
# app/services/office_converter.py

class OfficeConverter:
    """Convert Office documents to PDF for OCR processing"""

    def convert_to_pdf(self, file_path: Path) -> Path:
        """Main conversion dispatcher"""

    def convert_docx_to_pdf(self, docx_path: Path) -> Path:
        """Convert DOCX to PDF using python-docx and pypandoc"""

    def convert_doc_to_pdf(self, doc_path: Path) -> Path:
        """Convert legacy DOC to PDF"""

    def convert_pptx_to_pdf(self, pptx_path: Path) -> Path:
        """Convert PPTX to PDF using python-pptx"""

    def convert_ppt_to_pdf(self, ppt_path: Path) -> Path:
        """Convert legacy PPT to PDF"""
```

### 2. OCR Service Integration

```python
# Extend app/services/ocr_service.py

def process_image(self, image_path: Path, ...):
    # Check file type
    if is_office_document(image_path):
        # Convert to PDF first
        pdf_path = self.office_converter.convert_to_pdf(image_path)
        # Use existing PDF processing
        return self.process_pdf(pdf_path, ...)
    elif is_pdf:
        # Existing PDF processing
        ...
    else:
        # Existing image processing
        ...
```

### 3. File Format Detection

```python
OFFICE_FORMATS = {
    '.doc': 'application/msword',
    '.docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
    '.ppt': 'application/vnd.ms-powerpoint',
    '.pptx': 'application/vnd.openxmlformats-officedocument.presentationml.presentation'
}

def is_office_document(file_path: Path) -> bool:
    return file_path.suffix.lower() in OFFICE_FORMATS
```

## Library Selection

### For Word Documents
- **python-docx**: Read/write DOCX files
- **doc2pdf**: Simple conversion (requires LibreOffice)
- Alternative: **pypandoc** with pandoc backend

### For PowerPoint Documents
- **python-pptx**: Read/write PPTX files
- **unoconv**: Universal Office Converter (requires LibreOffice)

### Recommended Approach
Use **LibreOffice** headless mode for universal conversion:
```bash
libreoffice --headless --convert-to pdf input.docx
```

This provides:
- Support for all Office formats
- High fidelity conversion
- Maintained by active community

## Configuration Changes

### Token Expiration
```python
# app/core/config.py
class Settings(BaseSettings):
    # Change from 30 to 1440 (24 hours)
    access_token_expire_minutes: int = 1440
```

### File Upload Limits
```python
# Consider Office files can be larger
max_file_size: int = 100 * 1024 * 1024  # 100MB
allowed_extensions: Set[str] = {
    '.png', '.jpg', '.jpeg', '.pdf',
    '.doc', '.docx', '.ppt', '.pptx'
}
```

## Error Handling

1. **Conversion Failures**
   - Corrupted Office files
   - Unsupported Office features
   - LibreOffice not installed

2. **Performance Considerations**
   - Office conversion is CPU intensive
   - Consider queuing for large files
   - Add conversion timeout (60 seconds)

3. **Security**
   - Validate Office files before processing
   - Scan for macros/embedded objects
   - Sandbox conversion process

## Dependencies

### System Requirements
```bash
# macOS
brew install libreoffice

# Linux
apt-get install libreoffice

# Python packages
pip install python-docx python-pptx pypandoc
```

### Alternative: Docker Container
Use a Docker container with LibreOffice pre-installed for consistent conversion across environments.

## Testing Strategy

1. **Unit Tests**
   - Test each conversion method
   - Mock LibreOffice calls
   - Test error handling

2. **Integration Tests**
   - End-to-end Office → OCR pipeline
   - Test with various Office versions
   - Performance benchmarks

3. **Sample Documents**
   - Simple text documents
   - Documents with tables
   - Documents with images
   - Presentations with multiple slides
   - Legacy formats (DOC, PPT)