chore: project cleanup and prepare for dual-track processing refactor
- Removed all test files and directories - Deleted outdated documentation (will be rewritten) - Cleaned up temporary files, logs, and uploads - Archived 5 completed OpenSpec proposals - Created new dual-track-document-processing proposal with complete OpenSpec structure - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF) - UnifiedDocument model for consistent output - Support for structure-preserving translation - Updated .gitignore to prevent future test/temp files This is a major cleanup preparing for the complete refactoring of the document processing pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,176 @@
|
||||
# Technical Design
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
```
|
||||
User Upload (DOC/DOCX/PPT/PPTX)
|
||||
↓
|
||||
File Validation & Storage
|
||||
↓
|
||||
Format Detection
|
||||
↓
|
||||
Office Document Converter
|
||||
↓
|
||||
PDF Generation
|
||||
↓
|
||||
PDF to Images (existing)
|
||||
↓
|
||||
PaddleOCR Processing (existing)
|
||||
↓
|
||||
Results & Export
|
||||
```
|
||||
|
||||
## Component Design
|
||||
|
||||
### 1. Office Document Converter Service
|
||||
|
||||
```python
|
||||
# app/services/office_converter.py
|
||||
|
||||
class OfficeConverter:
|
||||
"""Convert Office documents to PDF for OCR processing"""
|
||||
|
||||
def convert_to_pdf(self, file_path: Path) -> Path:
|
||||
"""Main conversion dispatcher"""
|
||||
|
||||
def convert_docx_to_pdf(self, docx_path: Path) -> Path:
|
||||
"""Convert DOCX to PDF using python-docx and pypandoc"""
|
||||
|
||||
def convert_doc_to_pdf(self, doc_path: Path) -> Path:
|
||||
"""Convert legacy DOC to PDF"""
|
||||
|
||||
def convert_pptx_to_pdf(self, pptx_path: Path) -> Path:
|
||||
"""Convert PPTX to PDF using python-pptx"""
|
||||
|
||||
def convert_ppt_to_pdf(self, ppt_path: Path) -> Path:
|
||||
"""Convert legacy PPT to PDF"""
|
||||
```
|
||||
|
||||
### 2. OCR Service Integration
|
||||
|
||||
```python
|
||||
# Extend app/services/ocr_service.py
|
||||
|
||||
def process_image(self, image_path: Path, ...):
|
||||
# Check file type
|
||||
if is_office_document(image_path):
|
||||
# Convert to PDF first
|
||||
pdf_path = self.office_converter.convert_to_pdf(image_path)
|
||||
# Use existing PDF processing
|
||||
return self.process_pdf(pdf_path, ...)
|
||||
elif is_pdf:
|
||||
# Existing PDF processing
|
||||
...
|
||||
else:
|
||||
# Existing image processing
|
||||
...
|
||||
```
|
||||
|
||||
### 3. File Format Detection
|
||||
|
||||
```python
|
||||
OFFICE_FORMATS = {
|
||||
'.doc': 'application/msword',
|
||||
'.docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
|
||||
'.ppt': 'application/vnd.ms-powerpoint',
|
||||
'.pptx': 'application/vnd.openxmlformats-officedocument.presentationml.presentation'
|
||||
}
|
||||
|
||||
def is_office_document(file_path: Path) -> bool:
|
||||
return file_path.suffix.lower() in OFFICE_FORMATS
|
||||
```
|
||||
|
||||
## Library Selection
|
||||
|
||||
### For Word Documents
|
||||
- **python-docx**: Read/write DOCX files
|
||||
- **doc2pdf**: Simple conversion (requires LibreOffice)
|
||||
- Alternative: **pypandoc** with pandoc backend
|
||||
|
||||
### For PowerPoint Documents
|
||||
- **python-pptx**: Read/write PPTX files
|
||||
- **unoconv**: Universal Office Converter (requires LibreOffice)
|
||||
|
||||
### Recommended Approach
|
||||
Use **LibreOffice** headless mode for universal conversion:
|
||||
```bash
|
||||
libreoffice --headless --convert-to pdf input.docx
|
||||
```
|
||||
|
||||
This provides:
|
||||
- Support for all Office formats
|
||||
- High fidelity conversion
|
||||
- Maintained by active community
|
||||
|
||||
## Configuration Changes
|
||||
|
||||
### Token Expiration
|
||||
```python
|
||||
# app/core/config.py
|
||||
class Settings(BaseSettings):
|
||||
# Change from 30 to 1440 (24 hours)
|
||||
access_token_expire_minutes: int = 1440
|
||||
```
|
||||
|
||||
### File Upload Limits
|
||||
```python
|
||||
# Consider Office files can be larger
|
||||
max_file_size: int = 100 * 1024 * 1024 # 100MB
|
||||
allowed_extensions: Set[str] = {
|
||||
'.png', '.jpg', '.jpeg', '.pdf',
|
||||
'.doc', '.docx', '.ppt', '.pptx'
|
||||
}
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
1. **Conversion Failures**
|
||||
- Corrupted Office files
|
||||
- Unsupported Office features
|
||||
- LibreOffice not installed
|
||||
|
||||
2. **Performance Considerations**
|
||||
- Office conversion is CPU intensive
|
||||
- Consider queuing for large files
|
||||
- Add conversion timeout (60 seconds)
|
||||
|
||||
3. **Security**
|
||||
- Validate Office files before processing
|
||||
- Scan for macros/embedded objects
|
||||
- Sandbox conversion process
|
||||
|
||||
## Dependencies
|
||||
|
||||
### System Requirements
|
||||
```bash
|
||||
# macOS
|
||||
brew install libreoffice
|
||||
|
||||
# Linux
|
||||
apt-get install libreoffice
|
||||
|
||||
# Python packages
|
||||
pip install python-docx python-pptx pypandoc
|
||||
```
|
||||
|
||||
### Alternative: Docker Container
|
||||
Use a Docker container with LibreOffice pre-installed for consistent conversion across environments.
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
1. **Unit Tests**
|
||||
- Test each conversion method
|
||||
- Mock LibreOffice calls
|
||||
- Test error handling
|
||||
|
||||
2. **Integration Tests**
|
||||
- End-to-end Office → OCR pipeline
|
||||
- Test with various Office versions
|
||||
- Performance benchmarks
|
||||
|
||||
3. **Sample Documents**
|
||||
- Simple text documents
|
||||
- Documents with tables
|
||||
- Documents with images
|
||||
- Presentations with multiple slides
|
||||
- Legacy formats (DOC, PPT)
|
||||
Reference in New Issue
Block a user