chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories - Deleted outdated documentation (will be rewritten) - Cleaned up temporary files, logs, and uploads - Archived 5 completed OpenSpec proposals - Created new dual-track-document-processing proposal with complete OpenSpec structure - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF) - UnifiedDocument model for consistent output - Support for structure-preserving translation - Updated .gitignore to prevent future test/temp files This is a major cleanup preparing for the complete refactoring of the document processing pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 20:02:31 +08:00
parent 0edc56b03f
commit cd3cbea49d
64 changed files with 3573 additions and 8190 deletions
--- a/openspec/changes/archive/2025-11-18-add-office-document-support/design.md
+++ b/openspec/changes/archive/2025-11-18-add-office-document-support/design.md
@@ -0,0 +1,176 @@
+# Technical Design
+
+## Architecture Overview
+
+```
+User Upload (DOC/DOCX/PPT/PPTX)
+    ↓
+File Validation & Storage
+    ↓
+Format Detection
+    ↓
+Office Document Converter
+    ↓
+PDF Generation
+    ↓
+PDF to Images (existing)
+    ↓
+PaddleOCR Processing (existing)
+    ↓
+Results & Export
+```
+
+## Component Design
+
+### 1. Office Document Converter Service
+
+```python
+# app/services/office_converter.py
+
+class OfficeConverter:
+    """Convert Office documents to PDF for OCR processing"""
+
+    def convert_to_pdf(self, file_path: Path) -> Path:
+        """Main conversion dispatcher"""
+
+    def convert_docx_to_pdf(self, docx_path: Path) -> Path:
+        """Convert DOCX to PDF using python-docx and pypandoc"""
+
+    def convert_doc_to_pdf(self, doc_path: Path) -> Path:
+        """Convert legacy DOC to PDF"""
+
+    def convert_pptx_to_pdf(self, pptx_path: Path) -> Path:
+        """Convert PPTX to PDF using python-pptx"""
+
+    def convert_ppt_to_pdf(self, ppt_path: Path) -> Path:
+        """Convert legacy PPT to PDF"""
+```
+
+### 2. OCR Service Integration
+
+```python
+# Extend app/services/ocr_service.py
+
+def process_image(self, image_path: Path, ...):
+    # Check file type
+    if is_office_document(image_path):
+        # Convert to PDF first
+        pdf_path = self.office_converter.convert_to_pdf(image_path)
+        # Use existing PDF processing
+        return self.process_pdf(pdf_path, ...)
+    elif is_pdf:
+        # Existing PDF processing
+        ...
+    else:
+        # Existing image processing
+        ...
+```
+
+### 3. File Format Detection
+
+```python
+OFFICE_FORMATS = {
+    '.doc': 'application/msword',
+    '.docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
+    '.ppt': 'application/vnd.ms-powerpoint',
+    '.pptx': 'application/vnd.openxmlformats-officedocument.presentationml.presentation'
+}
+
+def is_office_document(file_path: Path) -> bool:
+    return file_path.suffix.lower() in OFFICE_FORMATS
+```
+
+## Library Selection
+
+### For Word Documents
+- **python-docx**: Read/write DOCX files
+- **doc2pdf**: Simple conversion (requires LibreOffice)
+- Alternative: **pypandoc** with pandoc backend
+
+### For PowerPoint Documents
+- **python-pptx**: Read/write PPTX files
+- **unoconv**: Universal Office Converter (requires LibreOffice)
+
+### Recommended Approach
+Use **LibreOffice** headless mode for universal conversion:
+```bash
+libreoffice --headless --convert-to pdf input.docx
+```
+
+This provides:
+- Support for all Office formats
+- High fidelity conversion
+- Maintained by active community
+
+## Configuration Changes
+
+### Token Expiration
+```python
+# app/core/config.py
+class Settings(BaseSettings):
+    # Change from 30 to 1440 (24 hours)
+    access_token_expire_minutes: int = 1440
+```
+
+### File Upload Limits
+```python
+# Consider Office files can be larger
+max_file_size: int = 100 * 1024 * 1024  # 100MB
+allowed_extensions: Set[str] = {
+    '.png', '.jpg', '.jpeg', '.pdf',
+    '.doc', '.docx', '.ppt', '.pptx'
+}
+```
+
+## Error Handling
+
+1. **Conversion Failures**
+   - Corrupted Office files
+   - Unsupported Office features
+   - LibreOffice not installed
+
+2. **Performance Considerations**
+   - Office conversion is CPU intensive
+   - Consider queuing for large files
+   - Add conversion timeout (60 seconds)
+
+3. **Security**
+   - Validate Office files before processing
+   - Scan for macros/embedded objects
+   - Sandbox conversion process
+
+## Dependencies
+
+### System Requirements
+```bash
+# macOS
+brew install libreoffice
+
+# Linux
+apt-get install libreoffice
+
+# Python packages
+pip install python-docx python-pptx pypandoc
+```
+
+### Alternative: Docker Container
+Use a Docker container with LibreOffice pre-installed for consistent conversion across environments.
+
+## Testing Strategy
+
+1. **Unit Tests**
+   - Test each conversion method
+   - Mock LibreOffice calls
+   - Test error handling
+
+2. **Integration Tests**
+   - End-to-end Office → OCR pipeline
+   - Test with various Office versions
+   - Performance benchmarks
+
+3. **Sample Documents**
+   - Simple text documents
+   - Documents with tables
+   - Documents with images
+   - Presentations with multiple slides
+   - Legacy formats (DOC, PPT)