# Technical Design ## Architecture Overview ``` User Upload (DOC/DOCX/PPT/PPTX) ↓ File Validation & Storage ↓ Format Detection ↓ Office Document Converter ↓ PDF Generation ↓ PDF to Images (existing) ↓ PaddleOCR Processing (existing) ↓ Results & Export ``` ## Component Design ### 1. Office Document Converter Service ```python # app/services/office_converter.py class OfficeConverter: """Convert Office documents to PDF for OCR processing""" def convert_to_pdf(self, file_path: Path) -> Path: """Main conversion dispatcher""" def convert_docx_to_pdf(self, docx_path: Path) -> Path: """Convert DOCX to PDF using python-docx and pypandoc""" def convert_doc_to_pdf(self, doc_path: Path) -> Path: """Convert legacy DOC to PDF""" def convert_pptx_to_pdf(self, pptx_path: Path) -> Path: """Convert PPTX to PDF using python-pptx""" def convert_ppt_to_pdf(self, ppt_path: Path) -> Path: """Convert legacy PPT to PDF""" ``` ### 2. OCR Service Integration ```python # Extend app/services/ocr_service.py def process_image(self, image_path: Path, ...): # Check file type if is_office_document(image_path): # Convert to PDF first pdf_path = self.office_converter.convert_to_pdf(image_path) # Use existing PDF processing return self.process_pdf(pdf_path, ...) elif is_pdf: # Existing PDF processing ... else: # Existing image processing ... ``` ### 3. File Format Detection ```python OFFICE_FORMATS = { '.doc': 'application/msword', '.docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', '.ppt': 'application/vnd.ms-powerpoint', '.pptx': 'application/vnd.openxmlformats-officedocument.presentationml.presentation' } def is_office_document(file_path: Path) -> bool: return file_path.suffix.lower() in OFFICE_FORMATS ``` ## Library Selection ### For Word Documents - **python-docx**: Read/write DOCX files - **doc2pdf**: Simple conversion (requires LibreOffice) - Alternative: **pypandoc** with pandoc backend ### For PowerPoint Documents - **python-pptx**: Read/write PPTX files - **unoconv**: Universal Office Converter (requires LibreOffice) ### Recommended Approach Use **LibreOffice** headless mode for universal conversion: ```bash libreoffice --headless --convert-to pdf input.docx ``` This provides: - Support for all Office formats - High fidelity conversion - Maintained by active community ## Configuration Changes ### Token Expiration ```python # app/core/config.py class Settings(BaseSettings): # Change from 30 to 1440 (24 hours) access_token_expire_minutes: int = 1440 ``` ### File Upload Limits ```python # Consider Office files can be larger max_file_size: int = 100 * 1024 * 1024 # 100MB allowed_extensions: Set[str] = { '.png', '.jpg', '.jpeg', '.pdf', '.doc', '.docx', '.ppt', '.pptx' } ``` ## Error Handling 1. **Conversion Failures** - Corrupted Office files - Unsupported Office features - LibreOffice not installed 2. **Performance Considerations** - Office conversion is CPU intensive - Consider queuing for large files - Add conversion timeout (60 seconds) 3. **Security** - Validate Office files before processing - Scan for macros/embedded objects - Sandbox conversion process ## Dependencies ### System Requirements ```bash # macOS brew install libreoffice # Linux apt-get install libreoffice # Python packages pip install python-docx python-pptx pypandoc ``` ### Alternative: Docker Container Use a Docker container with LibreOffice pre-installed for consistent conversion across environments. ## Testing Strategy 1. **Unit Tests** - Test each conversion method - Mock LibreOffice calls - Test error handling 2. **Integration Tests** - End-to-end Office → OCR pipeline - Test with various Office versions - Performance benchmarks 3. **Sample Documents** - Simple text documents - Documents with tables - Documents with images - Presentations with multiple slides - Legacy formats (DOC, PPT)