3.2 KiB
3.2 KiB
Implementation Tasks
Phase 1: Dependencies & Configuration
- Install Office document processing libraries
- Install LibreOffice via Homebrew (headless mode for conversion)
- Verify LibreOffice installation and accessibility
- Configure LibreOffice path in OfficeConverter
- Update JWT token configuration
- Change
ACCESS_TOKEN_EXPIRE_MINUTESto 1440 inapp/core/config.py - Verify token expiration in authentication flow
- Change
Phase 2: Document Conversion Implementation
- Create Office document converter class
- Add
office_converter.pyto services directory - Implement Word document conversion methods
convert_docx_to_pdf()for DOCX filesconvert_doc_to_pdf()for DOC files
- Implement PowerPoint conversion methods
convert_pptx_to_pdf()for PPTX filesconvert_ppt_to_pdf()for PPT files
- Add error handling and logging
- Add file validation methods
- Add
Phase 3: OCR Service Integration
- Update OCR service to handle Office formats
- Modify
process_image()inocr_service.py - Add Office format detection logic
- Integrate Office-to-PDF conversion pipeline
- Update supported formats list in configuration
- Modify
- Update file manager service
- Add Office formats to allowed extensions (
file_manager.py) - Update file validation logic
- Update config.py allowed extensions
- Add Office formats to allowed extensions (
Phase 4: API Updates
- File validation updated (already accepts Office formats via file_manager.py)
- Core API integration complete (Office files processed via existing endpoints)
- API documentation strings (optional enhancement)
- Add Office format examples to OpenAPI schema (optional enhancement)
Phase 5: Testing
- Create test Office documents
- Sample DOCX with mixed Chinese/English content
- Test document creation script (
create_docx.py)
- Verify document conversion capability
- LibreOffice headless mode verified
- OfficeConverter service tested
- Test token validity
- Verified 24-hour token expiration (1440 minutes)
- Confirmed in login response
- Core functionality verified
- Office format detection working
- Office → PDF → Images → OCR pipeline implemented
- File validation accepts .doc, .docx, .ppt, .pptx
- Automated integration testing
- Fixed API endpoint paths in test script
- Fixed configuration loading (.env file update)
- Fixed preprocessor bugs (MIME types, validation, return order)
- End-to-end test completed successfully (batch 24)
- OCR accuracy: 97.39% confidence on mixed Chinese/English content
- Manual end-to-end testing
- DOCX → PDF → Images → OCR pipeline verified
- Processing time: ~375 seconds (includes model initialization)
- Result output format validated (Markdown generation working)
Phase 6: Documentation
- Update README with Office format support (covered in IMPLEMENTATION.md)
- Test documents available in demo_docs/office_tests/
- API documentation update (endpoints unchanged, format list extended)
- Migration guide (no breaking changes, backward compatible)