Files
OCR/openspec/changes/archive/2025-11-18-add-office-document-support/tasks.md
egg cd3cbea49d chore: project cleanup and prepare for dual-track processing refactor
- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 20:02:31 +08:00

3.2 KiB

Implementation Tasks

Phase 1: Dependencies & Configuration

  • Install Office document processing libraries
    • Install LibreOffice via Homebrew (headless mode for conversion)
    • Verify LibreOffice installation and accessibility
    • Configure LibreOffice path in OfficeConverter
  • Update JWT token configuration
    • Change ACCESS_TOKEN_EXPIRE_MINUTES to 1440 in app/core/config.py
    • Verify token expiration in authentication flow

Phase 2: Document Conversion Implementation

  • Create Office document converter class
    • Add office_converter.py to services directory
    • Implement Word document conversion methods
      • convert_docx_to_pdf() for DOCX files
      • convert_doc_to_pdf() for DOC files
    • Implement PowerPoint conversion methods
      • convert_pptx_to_pdf() for PPTX files
      • convert_ppt_to_pdf() for PPT files
    • Add error handling and logging
    • Add file validation methods

Phase 3: OCR Service Integration

  • Update OCR service to handle Office formats
    • Modify process_image() in ocr_service.py
    • Add Office format detection logic
    • Integrate Office-to-PDF conversion pipeline
    • Update supported formats list in configuration
  • Update file manager service
    • Add Office formats to allowed extensions (file_manager.py)
    • Update file validation logic
    • Update config.py allowed extensions

Phase 4: API Updates

  • File validation updated (already accepts Office formats via file_manager.py)
  • Core API integration complete (Office files processed via existing endpoints)
  • API documentation strings (optional enhancement)
  • Add Office format examples to OpenAPI schema (optional enhancement)

Phase 5: Testing

  • Create test Office documents
    • Sample DOCX with mixed Chinese/English content
    • Test document creation script (create_docx.py)
  • Verify document conversion capability
    • LibreOffice headless mode verified
    • OfficeConverter service tested
  • Test token validity
    • Verified 24-hour token expiration (1440 minutes)
    • Confirmed in login response
  • Core functionality verified
    • Office format detection working
    • Office → PDF → Images → OCR pipeline implemented
    • File validation accepts .doc, .docx, .ppt, .pptx
  • Automated integration testing
    • Fixed API endpoint paths in test script
    • Fixed configuration loading (.env file update)
    • Fixed preprocessor bugs (MIME types, validation, return order)
    • End-to-end test completed successfully (batch 24)
    • OCR accuracy: 97.39% confidence on mixed Chinese/English content
  • Manual end-to-end testing
    • DOCX → PDF → Images → OCR pipeline verified
    • Processing time: ~375 seconds (includes model initialization)
    • Result output format validated (Markdown generation working)

Phase 6: Documentation

  • Update README with Office format support (covered in IMPLEMENTATION.md)
  • Test documents available in demo_docs/office_tests/
  • API documentation update (endpoints unchanged, format list extended)
  • Migration guide (no breaking changes, backward compatible)