chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-18 20:02:31 +08:00
parent 0edc56b03f
commit cd3cbea49d
64 changed files with 3573 additions and 8190 deletions

View File

@@ -0,0 +1,70 @@
# Implementation Tasks
## Phase 1: Dependencies & Configuration
- [x] Install Office document processing libraries
- [x] Install LibreOffice via Homebrew (headless mode for conversion)
- [x] Verify LibreOffice installation and accessibility
- [x] Configure LibreOffice path in OfficeConverter
- [x] Update JWT token configuration
- [x] Change `ACCESS_TOKEN_EXPIRE_MINUTES` to 1440 in `app/core/config.py`
- [x] Verify token expiration in authentication flow
## Phase 2: Document Conversion Implementation
- [x] Create Office document converter class
- [x] Add `office_converter.py` to services directory
- [x] Implement Word document conversion methods
- [x] `convert_docx_to_pdf()` for DOCX files
- [x] `convert_doc_to_pdf()` for DOC files
- [x] Implement PowerPoint conversion methods
- [x] `convert_pptx_to_pdf()` for PPTX files
- [x] `convert_ppt_to_pdf()` for PPT files
- [x] Add error handling and logging
- [x] Add file validation methods
## Phase 3: OCR Service Integration
- [x] Update OCR service to handle Office formats
- [x] Modify `process_image()` in `ocr_service.py`
- [x] Add Office format detection logic
- [x] Integrate Office-to-PDF conversion pipeline
- [x] Update supported formats list in configuration
- [x] Update file manager service
- [x] Add Office formats to allowed extensions (`file_manager.py`)
- [x] Update file validation logic
- [x] Update config.py allowed extensions
## Phase 4: API Updates
- [x] File validation updated (already accepts Office formats via file_manager.py)
- [x] Core API integration complete (Office files processed via existing endpoints)
- [ ] API documentation strings (optional enhancement)
- [ ] Add Office format examples to OpenAPI schema (optional enhancement)
## Phase 5: Testing
- [x] Create test Office documents
- [x] Sample DOCX with mixed Chinese/English content
- [x] Test document creation script (`create_docx.py`)
- [x] Verify document conversion capability
- [x] LibreOffice headless mode verified
- [x] OfficeConverter service tested
- [x] Test token validity
- [x] Verified 24-hour token expiration (1440 minutes)
- [x] Confirmed in login response
- [x] Core functionality verified
- [x] Office format detection working
- [x] Office → PDF → Images → OCR pipeline implemented
- [x] File validation accepts .doc, .docx, .ppt, .pptx
- [x] Automated integration testing
- [x] Fixed API endpoint paths in test script
- [x] Fixed configuration loading (.env file update)
- [x] Fixed preprocessor bugs (MIME types, validation, return order)
- [x] End-to-end test completed successfully (batch 24)
- [x] OCR accuracy: 97.39% confidence on mixed Chinese/English content
- [x] Manual end-to-end testing
- [x] DOCX → PDF → Images → OCR pipeline verified
- [x] Processing time: ~375 seconds (includes model initialization)
- [x] Result output format validated (Markdown generation working)
## Phase 6: Documentation
- [x] Update README with Office format support (covered in IMPLEMENTATION.md)
- [x] Test documents available in demo_docs/office_tests/
- [x] API documentation update (endpoints unchanged, format list extended)
- [x] Migration guide (no breaking changes, backward compatible)