Files
OCR/openspec/changes/archive/2025-11-18-add-office-document-support/proposal.md
egg cd3cbea49d chore: project cleanup and prepare for dual-track processing refactor
- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 20:02:31 +08:00

52 lines
2.2 KiB
Markdown

# Add Office Document Support
**Status**: ✅ IMPLEMENTED & TESTED
## Summary
Add support for Microsoft Office document formats (DOC, DOCX, PPT, PPTX) in the OCR processing pipeline and extend JWT token validity period to 1 day.
## Motivation
Currently, the system only supports image formats (PNG, JPG, JPEG) and PDF files. Many users have documents in Microsoft Office formats that require OCR processing. This change will:
1. Enable processing of Word and PowerPoint documents
2. Improve user experience by extending token validity
3. Leverage existing PDF-to-image conversion infrastructure
## Proposed Solution
### 1. Office Document Support
- Add Python libraries for Office document conversion:
- `python-docx2pdf` or `python-docx` + `pypandoc` for Word documents
- `python-pptx` for PowerPoint documents
- Implement conversion pipeline:
- Option A: Office → PDF → Images → OCR
- Option B: Office → Images → OCR (direct conversion)
- Extend file validation to accept `.doc`, `.docx`, `.ppt`, `.pptx` formats
- Add conversion methods to `OCRService` class
### 2. Token Validity Extension
- Update `ACCESS_TOKEN_EXPIRE_MINUTES` from 30 minutes to 1440 minutes (24 hours)
- Ensure security measures are in place for longer-lived tokens
## Impact Analysis
- **Backend Services**: Minimal changes to existing OCR processing flow
- **Dependencies**: New Python packages for Office document handling
- **Performance**: Slight increase in processing time for document conversion
- **Security**: Longer token validity requires careful consideration
- **Storage**: Temporary files during conversion process
## Success Criteria
1. Successfully process Word documents (.doc, .docx) with OCR
2. Successfully process PowerPoint documents (.ppt, .pptx) with OCR
3. JWT tokens remain valid for 24 hours
4. All existing functionality continues to work
5. Conversion quality maintains text readability for OCR
## Timeline
- Implementation: 2-3 hours ✅
- Testing: 1 hour ✅
- Documentation: 30 mins ✅
- Total: ~4 hours ✅ COMPLETED
## Actual Time
- Total development time: ~6 hours (including debugging and testing)
- Primary issues resolved: Configuration loading, MIME type mapping, validation logic, API endpoint fixes