- Removed all test files and directories - Deleted outdated documentation (will be rewritten) - Cleaned up temporary files, logs, and uploads - Archived 5 completed OpenSpec proposals - Created new dual-track-document-processing proposal with complete OpenSpec structure - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF) - UnifiedDocument model for consistent output - Support for structure-preserving translation - Updated .gitignore to prevent future test/temp files This is a major cleanup preparing for the complete refactoring of the document processing pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
4.1 KiB
4.1 KiB
Technical Design
Architecture Overview
User Upload (DOC/DOCX/PPT/PPTX)
↓
File Validation & Storage
↓
Format Detection
↓
Office Document Converter
↓
PDF Generation
↓
PDF to Images (existing)
↓
PaddleOCR Processing (existing)
↓
Results & Export
Component Design
1. Office Document Converter Service
# app/services/office_converter.py
class OfficeConverter:
"""Convert Office documents to PDF for OCR processing"""
def convert_to_pdf(self, file_path: Path) -> Path:
"""Main conversion dispatcher"""
def convert_docx_to_pdf(self, docx_path: Path) -> Path:
"""Convert DOCX to PDF using python-docx and pypandoc"""
def convert_doc_to_pdf(self, doc_path: Path) -> Path:
"""Convert legacy DOC to PDF"""
def convert_pptx_to_pdf(self, pptx_path: Path) -> Path:
"""Convert PPTX to PDF using python-pptx"""
def convert_ppt_to_pdf(self, ppt_path: Path) -> Path:
"""Convert legacy PPT to PDF"""
2. OCR Service Integration
# Extend app/services/ocr_service.py
def process_image(self, image_path: Path, ...):
# Check file type
if is_office_document(image_path):
# Convert to PDF first
pdf_path = self.office_converter.convert_to_pdf(image_path)
# Use existing PDF processing
return self.process_pdf(pdf_path, ...)
elif is_pdf:
# Existing PDF processing
...
else:
# Existing image processing
...
3. File Format Detection
OFFICE_FORMATS = {
'.doc': 'application/msword',
'.docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
'.ppt': 'application/vnd.ms-powerpoint',
'.pptx': 'application/vnd.openxmlformats-officedocument.presentationml.presentation'
}
def is_office_document(file_path: Path) -> bool:
return file_path.suffix.lower() in OFFICE_FORMATS
Library Selection
For Word Documents
- python-docx: Read/write DOCX files
- doc2pdf: Simple conversion (requires LibreOffice)
- Alternative: pypandoc with pandoc backend
For PowerPoint Documents
- python-pptx: Read/write PPTX files
- unoconv: Universal Office Converter (requires LibreOffice)
Recommended Approach
Use LibreOffice headless mode for universal conversion:
libreoffice --headless --convert-to pdf input.docx
This provides:
- Support for all Office formats
- High fidelity conversion
- Maintained by active community
Configuration Changes
Token Expiration
# app/core/config.py
class Settings(BaseSettings):
# Change from 30 to 1440 (24 hours)
access_token_expire_minutes: int = 1440
File Upload Limits
# Consider Office files can be larger
max_file_size: int = 100 * 1024 * 1024 # 100MB
allowed_extensions: Set[str] = {
'.png', '.jpg', '.jpeg', '.pdf',
'.doc', '.docx', '.ppt', '.pptx'
}
Error Handling
-
Conversion Failures
- Corrupted Office files
- Unsupported Office features
- LibreOffice not installed
-
Performance Considerations
- Office conversion is CPU intensive
- Consider queuing for large files
- Add conversion timeout (60 seconds)
-
Security
- Validate Office files before processing
- Scan for macros/embedded objects
- Sandbox conversion process
Dependencies
System Requirements
# macOS
brew install libreoffice
# Linux
apt-get install libreoffice
# Python packages
pip install python-docx python-pptx pypandoc
Alternative: Docker Container
Use a Docker container with LibreOffice pre-installed for consistent conversion across environments.
Testing Strategy
-
Unit Tests
- Test each conversion method
- Mock LibreOffice calls
- Test error handling
-
Integration Tests
- End-to-end Office → OCR pipeline
- Test with various Office versions
- Performance benchmarks
-
Sample Documents
- Simple text documents
- Documents with tables
- Documents with images
- Presentations with multiple slides
- Legacy formats (DOC, PPT)