chore: project cleanup and prepare for dual-track processing refactor
- Removed all test files and directories - Deleted outdated documentation (will be rewritten) - Cleaned up temporary files, logs, and uploads - Archived 5 completed OpenSpec proposals - Created new dual-track-document-processing proposal with complete OpenSpec structure - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF) - UnifiedDocument model for consistent output - Support for structure-preserving translation - Updated .gitignore to prevent future test/temp files This is a major cleanup preparing for the complete refactoring of the document processing pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,70 @@
|
||||
# Implementation Tasks
|
||||
|
||||
## Phase 1: Dependencies & Configuration
|
||||
- [x] Install Office document processing libraries
|
||||
- [x] Install LibreOffice via Homebrew (headless mode for conversion)
|
||||
- [x] Verify LibreOffice installation and accessibility
|
||||
- [x] Configure LibreOffice path in OfficeConverter
|
||||
- [x] Update JWT token configuration
|
||||
- [x] Change `ACCESS_TOKEN_EXPIRE_MINUTES` to 1440 in `app/core/config.py`
|
||||
- [x] Verify token expiration in authentication flow
|
||||
|
||||
## Phase 2: Document Conversion Implementation
|
||||
- [x] Create Office document converter class
|
||||
- [x] Add `office_converter.py` to services directory
|
||||
- [x] Implement Word document conversion methods
|
||||
- [x] `convert_docx_to_pdf()` for DOCX files
|
||||
- [x] `convert_doc_to_pdf()` for DOC files
|
||||
- [x] Implement PowerPoint conversion methods
|
||||
- [x] `convert_pptx_to_pdf()` for PPTX files
|
||||
- [x] `convert_ppt_to_pdf()` for PPT files
|
||||
- [x] Add error handling and logging
|
||||
- [x] Add file validation methods
|
||||
|
||||
## Phase 3: OCR Service Integration
|
||||
- [x] Update OCR service to handle Office formats
|
||||
- [x] Modify `process_image()` in `ocr_service.py`
|
||||
- [x] Add Office format detection logic
|
||||
- [x] Integrate Office-to-PDF conversion pipeline
|
||||
- [x] Update supported formats list in configuration
|
||||
- [x] Update file manager service
|
||||
- [x] Add Office formats to allowed extensions (`file_manager.py`)
|
||||
- [x] Update file validation logic
|
||||
- [x] Update config.py allowed extensions
|
||||
|
||||
## Phase 4: API Updates
|
||||
- [x] File validation updated (already accepts Office formats via file_manager.py)
|
||||
- [x] Core API integration complete (Office files processed via existing endpoints)
|
||||
- [ ] API documentation strings (optional enhancement)
|
||||
- [ ] Add Office format examples to OpenAPI schema (optional enhancement)
|
||||
|
||||
## Phase 5: Testing
|
||||
- [x] Create test Office documents
|
||||
- [x] Sample DOCX with mixed Chinese/English content
|
||||
- [x] Test document creation script (`create_docx.py`)
|
||||
- [x] Verify document conversion capability
|
||||
- [x] LibreOffice headless mode verified
|
||||
- [x] OfficeConverter service tested
|
||||
- [x] Test token validity
|
||||
- [x] Verified 24-hour token expiration (1440 minutes)
|
||||
- [x] Confirmed in login response
|
||||
- [x] Core functionality verified
|
||||
- [x] Office format detection working
|
||||
- [x] Office → PDF → Images → OCR pipeline implemented
|
||||
- [x] File validation accepts .doc, .docx, .ppt, .pptx
|
||||
- [x] Automated integration testing
|
||||
- [x] Fixed API endpoint paths in test script
|
||||
- [x] Fixed configuration loading (.env file update)
|
||||
- [x] Fixed preprocessor bugs (MIME types, validation, return order)
|
||||
- [x] End-to-end test completed successfully (batch 24)
|
||||
- [x] OCR accuracy: 97.39% confidence on mixed Chinese/English content
|
||||
- [x] Manual end-to-end testing
|
||||
- [x] DOCX → PDF → Images → OCR pipeline verified
|
||||
- [x] Processing time: ~375 seconds (includes model initialization)
|
||||
- [x] Result output format validated (Markdown generation working)
|
||||
|
||||
## Phase 6: Documentation
|
||||
- [x] Update README with Office format support (covered in IMPLEMENTATION.md)
|
||||
- [x] Test documents available in demo_docs/office_tests/
|
||||
- [x] API documentation update (endpoints unchanged, format list extended)
|
||||
- [x] Migration guide (no breaking changes, backward compatible)
|
||||
Reference in New Issue
Block a user