Files
OCR/openspec/changes/add-office-document-support/design.md
beabigegg da700721fa first
2025-11-12 22:53:17 +08:00

176 lines
4.1 KiB
Markdown

# Technical Design
## Architecture Overview
```
User Upload (DOC/DOCX/PPT/PPTX)
File Validation & Storage
Format Detection
Office Document Converter
PDF Generation
PDF to Images (existing)
PaddleOCR Processing (existing)
Results & Export
```
## Component Design
### 1. Office Document Converter Service
```python
# app/services/office_converter.py
class OfficeConverter:
"""Convert Office documents to PDF for OCR processing"""
def convert_to_pdf(self, file_path: Path) -> Path:
"""Main conversion dispatcher"""
def convert_docx_to_pdf(self, docx_path: Path) -> Path:
"""Convert DOCX to PDF using python-docx and pypandoc"""
def convert_doc_to_pdf(self, doc_path: Path) -> Path:
"""Convert legacy DOC to PDF"""
def convert_pptx_to_pdf(self, pptx_path: Path) -> Path:
"""Convert PPTX to PDF using python-pptx"""
def convert_ppt_to_pdf(self, ppt_path: Path) -> Path:
"""Convert legacy PPT to PDF"""
```
### 2. OCR Service Integration
```python
# Extend app/services/ocr_service.py
def process_image(self, image_path: Path, ...):
# Check file type
if is_office_document(image_path):
# Convert to PDF first
pdf_path = self.office_converter.convert_to_pdf(image_path)
# Use existing PDF processing
return self.process_pdf(pdf_path, ...)
elif is_pdf:
# Existing PDF processing
...
else:
# Existing image processing
...
```
### 3. File Format Detection
```python
OFFICE_FORMATS = {
'.doc': 'application/msword',
'.docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
'.ppt': 'application/vnd.ms-powerpoint',
'.pptx': 'application/vnd.openxmlformats-officedocument.presentationml.presentation'
}
def is_office_document(file_path: Path) -> bool:
return file_path.suffix.lower() in OFFICE_FORMATS
```
## Library Selection
### For Word Documents
- **python-docx**: Read/write DOCX files
- **doc2pdf**: Simple conversion (requires LibreOffice)
- Alternative: **pypandoc** with pandoc backend
### For PowerPoint Documents
- **python-pptx**: Read/write PPTX files
- **unoconv**: Universal Office Converter (requires LibreOffice)
### Recommended Approach
Use **LibreOffice** headless mode for universal conversion:
```bash
libreoffice --headless --convert-to pdf input.docx
```
This provides:
- Support for all Office formats
- High fidelity conversion
- Maintained by active community
## Configuration Changes
### Token Expiration
```python
# app/core/config.py
class Settings(BaseSettings):
# Change from 30 to 1440 (24 hours)
access_token_expire_minutes: int = 1440
```
### File Upload Limits
```python
# Consider Office files can be larger
max_file_size: int = 100 * 1024 * 1024 # 100MB
allowed_extensions: Set[str] = {
'.png', '.jpg', '.jpeg', '.pdf',
'.doc', '.docx', '.ppt', '.pptx'
}
```
## Error Handling
1. **Conversion Failures**
- Corrupted Office files
- Unsupported Office features
- LibreOffice not installed
2. **Performance Considerations**
- Office conversion is CPU intensive
- Consider queuing for large files
- Add conversion timeout (60 seconds)
3. **Security**
- Validate Office files before processing
- Scan for macros/embedded objects
- Sandbox conversion process
## Dependencies
### System Requirements
```bash
# macOS
brew install libreoffice
# Linux
apt-get install libreoffice
# Python packages
pip install python-docx python-pptx pypandoc
```
### Alternative: Docker Container
Use a Docker container with LibreOffice pre-installed for consistent conversion across environments.
## Testing Strategy
1. **Unit Tests**
- Test each conversion method
- Mock LibreOffice calls
- Test error handling
2. **Integration Tests**
- End-to-end Office → OCR pipeline
- Test with various Office versions
- Performance benchmarks
3. **Sample Documents**
- Simple text documents
- Documents with tables
- Documents with images
- Presentations with multiple slides
- Legacy formats (DOC, PPT)