- Removed all test files and directories - Deleted outdated documentation (will be rewritten) - Cleaned up temporary files, logs, and uploads - Archived 5 completed OpenSpec proposals - Created new dual-track-document-processing proposal with complete OpenSpec structure - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF) - UnifiedDocument model for consistent output - Support for structure-preserving translation - Updated .gitignore to prevent future test/temp files This is a major cleanup preparing for the complete refactoring of the document processing pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
52 lines
2.2 KiB
Markdown
52 lines
2.2 KiB
Markdown
# Add Office Document Support
|
|
|
|
**Status**: ✅ IMPLEMENTED & TESTED
|
|
|
|
## Summary
|
|
Add support for Microsoft Office document formats (DOC, DOCX, PPT, PPTX) in the OCR processing pipeline and extend JWT token validity period to 1 day.
|
|
|
|
## Motivation
|
|
Currently, the system only supports image formats (PNG, JPG, JPEG) and PDF files. Many users have documents in Microsoft Office formats that require OCR processing. This change will:
|
|
1. Enable processing of Word and PowerPoint documents
|
|
2. Improve user experience by extending token validity
|
|
3. Leverage existing PDF-to-image conversion infrastructure
|
|
|
|
## Proposed Solution
|
|
|
|
### 1. Office Document Support
|
|
- Add Python libraries for Office document conversion:
|
|
- `python-docx2pdf` or `python-docx` + `pypandoc` for Word documents
|
|
- `python-pptx` for PowerPoint documents
|
|
- Implement conversion pipeline:
|
|
- Option A: Office → PDF → Images → OCR
|
|
- Option B: Office → Images → OCR (direct conversion)
|
|
- Extend file validation to accept `.doc`, `.docx`, `.ppt`, `.pptx` formats
|
|
- Add conversion methods to `OCRService` class
|
|
|
|
### 2. Token Validity Extension
|
|
- Update `ACCESS_TOKEN_EXPIRE_MINUTES` from 30 minutes to 1440 minutes (24 hours)
|
|
- Ensure security measures are in place for longer-lived tokens
|
|
|
|
## Impact Analysis
|
|
- **Backend Services**: Minimal changes to existing OCR processing flow
|
|
- **Dependencies**: New Python packages for Office document handling
|
|
- **Performance**: Slight increase in processing time for document conversion
|
|
- **Security**: Longer token validity requires careful consideration
|
|
- **Storage**: Temporary files during conversion process
|
|
|
|
## Success Criteria
|
|
1. Successfully process Word documents (.doc, .docx) with OCR
|
|
2. Successfully process PowerPoint documents (.ppt, .pptx) with OCR
|
|
3. JWT tokens remain valid for 24 hours
|
|
4. All existing functionality continues to work
|
|
5. Conversion quality maintains text readability for OCR
|
|
|
|
## Timeline
|
|
- Implementation: 2-3 hours ✅
|
|
- Testing: 1 hour ✅
|
|
- Documentation: 30 mins ✅
|
|
- Total: ~4 hours ✅ COMPLETED
|
|
|
|
## Actual Time
|
|
- Total development time: ~6 hours (including debugging and testing)
|
|
- Primary issues resolved: Configuration loading, MIME type mapping, validation logic, API endpoint fixes |