- Removed all test files and directories - Deleted outdated documentation (will be rewritten) - Cleaned up temporary files, logs, and uploads - Archived 5 completed OpenSpec proposals - Created new dual-track-document-processing proposal with complete OpenSpec structure - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF) - UnifiedDocument model for consistent output - Support for structure-preserving translation - Updated .gitignore to prevent future test/temp files This is a major cleanup preparing for the complete refactoring of the document processing pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
176 lines
4.1 KiB
Markdown
176 lines
4.1 KiB
Markdown
# Technical Design
|
|
|
|
## Architecture Overview
|
|
|
|
```
|
|
User Upload (DOC/DOCX/PPT/PPTX)
|
|
↓
|
|
File Validation & Storage
|
|
↓
|
|
Format Detection
|
|
↓
|
|
Office Document Converter
|
|
↓
|
|
PDF Generation
|
|
↓
|
|
PDF to Images (existing)
|
|
↓
|
|
PaddleOCR Processing (existing)
|
|
↓
|
|
Results & Export
|
|
```
|
|
|
|
## Component Design
|
|
|
|
### 1. Office Document Converter Service
|
|
|
|
```python
|
|
# app/services/office_converter.py
|
|
|
|
class OfficeConverter:
|
|
"""Convert Office documents to PDF for OCR processing"""
|
|
|
|
def convert_to_pdf(self, file_path: Path) -> Path:
|
|
"""Main conversion dispatcher"""
|
|
|
|
def convert_docx_to_pdf(self, docx_path: Path) -> Path:
|
|
"""Convert DOCX to PDF using python-docx and pypandoc"""
|
|
|
|
def convert_doc_to_pdf(self, doc_path: Path) -> Path:
|
|
"""Convert legacy DOC to PDF"""
|
|
|
|
def convert_pptx_to_pdf(self, pptx_path: Path) -> Path:
|
|
"""Convert PPTX to PDF using python-pptx"""
|
|
|
|
def convert_ppt_to_pdf(self, ppt_path: Path) -> Path:
|
|
"""Convert legacy PPT to PDF"""
|
|
```
|
|
|
|
### 2. OCR Service Integration
|
|
|
|
```python
|
|
# Extend app/services/ocr_service.py
|
|
|
|
def process_image(self, image_path: Path, ...):
|
|
# Check file type
|
|
if is_office_document(image_path):
|
|
# Convert to PDF first
|
|
pdf_path = self.office_converter.convert_to_pdf(image_path)
|
|
# Use existing PDF processing
|
|
return self.process_pdf(pdf_path, ...)
|
|
elif is_pdf:
|
|
# Existing PDF processing
|
|
...
|
|
else:
|
|
# Existing image processing
|
|
...
|
|
```
|
|
|
|
### 3. File Format Detection
|
|
|
|
```python
|
|
OFFICE_FORMATS = {
|
|
'.doc': 'application/msword',
|
|
'.docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
|
|
'.ppt': 'application/vnd.ms-powerpoint',
|
|
'.pptx': 'application/vnd.openxmlformats-officedocument.presentationml.presentation'
|
|
}
|
|
|
|
def is_office_document(file_path: Path) -> bool:
|
|
return file_path.suffix.lower() in OFFICE_FORMATS
|
|
```
|
|
|
|
## Library Selection
|
|
|
|
### For Word Documents
|
|
- **python-docx**: Read/write DOCX files
|
|
- **doc2pdf**: Simple conversion (requires LibreOffice)
|
|
- Alternative: **pypandoc** with pandoc backend
|
|
|
|
### For PowerPoint Documents
|
|
- **python-pptx**: Read/write PPTX files
|
|
- **unoconv**: Universal Office Converter (requires LibreOffice)
|
|
|
|
### Recommended Approach
|
|
Use **LibreOffice** headless mode for universal conversion:
|
|
```bash
|
|
libreoffice --headless --convert-to pdf input.docx
|
|
```
|
|
|
|
This provides:
|
|
- Support for all Office formats
|
|
- High fidelity conversion
|
|
- Maintained by active community
|
|
|
|
## Configuration Changes
|
|
|
|
### Token Expiration
|
|
```python
|
|
# app/core/config.py
|
|
class Settings(BaseSettings):
|
|
# Change from 30 to 1440 (24 hours)
|
|
access_token_expire_minutes: int = 1440
|
|
```
|
|
|
|
### File Upload Limits
|
|
```python
|
|
# Consider Office files can be larger
|
|
max_file_size: int = 100 * 1024 * 1024 # 100MB
|
|
allowed_extensions: Set[str] = {
|
|
'.png', '.jpg', '.jpeg', '.pdf',
|
|
'.doc', '.docx', '.ppt', '.pptx'
|
|
}
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
1. **Conversion Failures**
|
|
- Corrupted Office files
|
|
- Unsupported Office features
|
|
- LibreOffice not installed
|
|
|
|
2. **Performance Considerations**
|
|
- Office conversion is CPU intensive
|
|
- Consider queuing for large files
|
|
- Add conversion timeout (60 seconds)
|
|
|
|
3. **Security**
|
|
- Validate Office files before processing
|
|
- Scan for macros/embedded objects
|
|
- Sandbox conversion process
|
|
|
|
## Dependencies
|
|
|
|
### System Requirements
|
|
```bash
|
|
# macOS
|
|
brew install libreoffice
|
|
|
|
# Linux
|
|
apt-get install libreoffice
|
|
|
|
# Python packages
|
|
pip install python-docx python-pptx pypandoc
|
|
```
|
|
|
|
### Alternative: Docker Container
|
|
Use a Docker container with LibreOffice pre-installed for consistent conversion across environments.
|
|
|
|
## Testing Strategy
|
|
|
|
1. **Unit Tests**
|
|
- Test each conversion method
|
|
- Mock LibreOffice calls
|
|
- Test error handling
|
|
|
|
2. **Integration Tests**
|
|
- End-to-end Office → OCR pipeline
|
|
- Test with various Office versions
|
|
- Performance benchmarks
|
|
|
|
3. **Sample Documents**
|
|
- Simple text documents
|
|
- Documents with tables
|
|
- Documents with images
|
|
- Presentations with multiple slides
|
|
- Legacy formats (DOC, PPT) |