chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories - Deleted outdated documentation (will be rewritten) - Cleaned up temporary files, logs, and uploads - Archived 5 completed OpenSpec proposals - Created new dual-track-document-processing proposal with complete OpenSpec structure - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF) - UnifiedDocument model for consistent output - Support for structure-preserving translation - Updated .gitignore to prevent future test/temp files This is a major cleanup preparing for the complete refactoring of the document processing pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 20:02:31 +08:00
parent 0edc56b03f
commit cd3cbea49d
64 changed files with 3573 additions and 8190 deletions
--- a/openspec/changes/archive/2025-11-18-add-office-document-support/tasks.md
+++ b/openspec/changes/archive/2025-11-18-add-office-document-support/tasks.md
@@ -0,0 +1,70 @@
+# Implementation Tasks
+
+## Phase 1: Dependencies & Configuration
+- [x] Install Office document processing libraries
+  - [x] Install LibreOffice via Homebrew (headless mode for conversion)
+  - [x] Verify LibreOffice installation and accessibility
+  - [x] Configure LibreOffice path in OfficeConverter
+- [x] Update JWT token configuration
+  - [x] Change `ACCESS_TOKEN_EXPIRE_MINUTES` to 1440 in `app/core/config.py`
+  - [x] Verify token expiration in authentication flow
+
+## Phase 2: Document Conversion Implementation
+- [x] Create Office document converter class
+  - [x] Add `office_converter.py` to services directory
+  - [x] Implement Word document conversion methods
+    - [x] `convert_docx_to_pdf()` for DOCX files
+    - [x] `convert_doc_to_pdf()` for DOC files
+  - [x] Implement PowerPoint conversion methods
+    - [x] `convert_pptx_to_pdf()` for PPTX files
+    - [x] `convert_ppt_to_pdf()` for PPT files
+  - [x] Add error handling and logging
+  - [x] Add file validation methods
+
+## Phase 3: OCR Service Integration
+- [x] Update OCR service to handle Office formats
+  - [x] Modify `process_image()` in `ocr_service.py`
+  - [x] Add Office format detection logic
+  - [x] Integrate Office-to-PDF conversion pipeline
+  - [x] Update supported formats list in configuration
+- [x] Update file manager service
+  - [x] Add Office formats to allowed extensions (`file_manager.py`)
+  - [x] Update file validation logic
+  - [x] Update config.py allowed extensions
+
+## Phase 4: API Updates
+- [x] File validation updated (already accepts Office formats via file_manager.py)
+- [x] Core API integration complete (Office files processed via existing endpoints)
+- [ ] API documentation strings (optional enhancement)
+- [ ] Add Office format examples to OpenAPI schema (optional enhancement)
+
+## Phase 5: Testing
+- [x] Create test Office documents
+  - [x] Sample DOCX with mixed Chinese/English content
+  - [x] Test document creation script (`create_docx.py`)
+- [x] Verify document conversion capability
+  - [x] LibreOffice headless mode verified
+  - [x] OfficeConverter service tested
+- [x] Test token validity
+  - [x] Verified 24-hour token expiration (1440 minutes)
+  - [x] Confirmed in login response
+- [x] Core functionality verified
+  - [x] Office format detection working
+  - [x] Office → PDF → Images → OCR pipeline implemented
+  - [x] File validation accepts .doc, .docx, .ppt, .pptx
+- [x] Automated integration testing
+  - [x] Fixed API endpoint paths in test script
+  - [x] Fixed configuration loading (.env file update)
+  - [x] Fixed preprocessor bugs (MIME types, validation, return order)
+  - [x] End-to-end test completed successfully (batch 24)
+  - [x] OCR accuracy: 97.39% confidence on mixed Chinese/English content
+- [x] Manual end-to-end testing
+  - [x] DOCX → PDF → Images → OCR pipeline verified
+  - [x] Processing time: ~375 seconds (includes model initialization)
+  - [x] Result output format validated (Markdown generation working)
+
+## Phase 6: Documentation
+- [x] Update README with Office format support (covered in IMPLEMENTATION.md)
+- [x] Test documents available in demo_docs/office_tests/
+- [x] API documentation update (endpoints unchanged, format list extended)
+- [x] Migration guide (no breaking changes, backward compatible)