Files
OCR/openspec/changes/archive/2025-11-18-add-office-document-support/proposal.md
egg cd3cbea49d chore: project cleanup and prepare for dual-track processing refactor
- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 20:02:31 +08:00

2.2 KiB

Add Office Document Support

Status: IMPLEMENTED & TESTED

Summary

Add support for Microsoft Office document formats (DOC, DOCX, PPT, PPTX) in the OCR processing pipeline and extend JWT token validity period to 1 day.

Motivation

Currently, the system only supports image formats (PNG, JPG, JPEG) and PDF files. Many users have documents in Microsoft Office formats that require OCR processing. This change will:

  1. Enable processing of Word and PowerPoint documents
  2. Improve user experience by extending token validity
  3. Leverage existing PDF-to-image conversion infrastructure

Proposed Solution

1. Office Document Support

  • Add Python libraries for Office document conversion:
    • python-docx2pdf or python-docx + pypandoc for Word documents
    • python-pptx for PowerPoint documents
  • Implement conversion pipeline:
    • Option A: Office → PDF → Images → OCR
    • Option B: Office → Images → OCR (direct conversion)
  • Extend file validation to accept .doc, .docx, .ppt, .pptx formats
  • Add conversion methods to OCRService class

2. Token Validity Extension

  • Update ACCESS_TOKEN_EXPIRE_MINUTES from 30 minutes to 1440 minutes (24 hours)
  • Ensure security measures are in place for longer-lived tokens

Impact Analysis

  • Backend Services: Minimal changes to existing OCR processing flow
  • Dependencies: New Python packages for Office document handling
  • Performance: Slight increase in processing time for document conversion
  • Security: Longer token validity requires careful consideration
  • Storage: Temporary files during conversion process

Success Criteria

  1. Successfully process Word documents (.doc, .docx) with OCR
  2. Successfully process PowerPoint documents (.ppt, .pptx) with OCR
  3. JWT tokens remain valid for 24 hours
  4. All existing functionality continues to work
  5. Conversion quality maintains text readability for OCR

Timeline

  • Implementation: 2-3 hours
  • Testing: 1 hour
  • Documentation: 30 mins
  • Total: ~4 hours COMPLETED

Actual Time

  • Total development time: ~6 hours (including debugging and testing)
  • Primary issues resolved: Configuration loading, MIME type mapping, validation logic, API endpoint fixes