# Implementation Tasks

## Phase 1: Dependencies & Configuration
- [x] Install Office document processing libraries
  - [x] Install LibreOffice via Homebrew (headless mode for conversion)
  - [x] Verify LibreOffice installation and accessibility
  - [x] Configure LibreOffice path in OfficeConverter
- [x] Update JWT token configuration
  - [x] Change `ACCESS_TOKEN_EXPIRE_MINUTES` to 1440 in `app/core/config.py`
  - [x] Verify token expiration in authentication flow

## Phase 2: Document Conversion Implementation
- [x] Create Office document converter class
  - [x] Add `office_converter.py` to services directory
  - [x] Implement Word document conversion methods
    - [x] `convert_docx_to_pdf()` for DOCX files
    - [x] `convert_doc_to_pdf()` for DOC files
  - [x] Implement PowerPoint conversion methods
    - [x] `convert_pptx_to_pdf()` for PPTX files
    - [x] `convert_ppt_to_pdf()` for PPT files
  - [x] Add error handling and logging
  - [x] Add file validation methods

## Phase 3: OCR Service Integration
- [x] Update OCR service to handle Office formats
  - [x] Modify `process_image()` in `ocr_service.py`
  - [x] Add Office format detection logic
  - [x] Integrate Office-to-PDF conversion pipeline
  - [x] Update supported formats list in configuration
- [x] Update file manager service
  - [x] Add Office formats to allowed extensions (`file_manager.py`)
  - [x] Update file validation logic
  - [x] Update config.py allowed extensions

## Phase 4: API Updates
- [x] File validation updated (already accepts Office formats via file_manager.py)
- [x] Core API integration complete (Office files processed via existing endpoints)
- [ ] API documentation strings (optional enhancement)
- [ ] Add Office format examples to OpenAPI schema (optional enhancement)

## Phase 5: Testing
- [x] Create test Office documents
  - [x] Sample DOCX with mixed Chinese/English content
  - [x] Test document creation script (`create_docx.py`)
- [x] Verify document conversion capability
  - [x] LibreOffice headless mode verified
  - [x] OfficeConverter service tested
- [x] Test token validity
  - [x] Verified 24-hour token expiration (1440 minutes)
  - [x] Confirmed in login response
- [x] Core functionality verified
  - [x] Office format detection working
  - [x] Office → PDF → Images → OCR pipeline implemented
  - [x] File validation accepts .doc, .docx, .ppt, .pptx
- [x] Automated integration testing
  - [x] Fixed API endpoint paths in test script
  - [x] Fixed configuration loading (.env file update)
  - [x] Fixed preprocessor bugs (MIME types, validation, return order)
  - [x] End-to-end test completed successfully (batch 24)
  - [x] OCR accuracy: 97.39% confidence on mixed Chinese/English content
- [x] Manual end-to-end testing
  - [x] DOCX → PDF → Images → OCR pipeline verified
  - [x] Processing time: ~375 seconds (includes model initialization)
  - [x] Result output format validated (Markdown generation working)

## Phase 6: Documentation
- [x] Update README with Office format support (covered in IMPLEMENTATION.md)
- [x] Test documents available in demo_docs/office_tests/
- [x] API documentation update (endpoints unchanged, format list extended)
- [x] Migration guide (no breaking changes, backward compatible)