egg/OCR

Files

egg cd3cbea49d chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-18 20:02:31 +08:00

2.2 KiB

Raw Blame History

Add Office Document Support

Status: ✅ IMPLEMENTED & TESTED

Summary

Add support for Microsoft Office document formats (DOC, DOCX, PPT, PPTX) in the OCR processing pipeline and extend JWT token validity period to 1 day.

Motivation

Currently, the system only supports image formats (PNG, JPG, JPEG) and PDF files. Many users have documents in Microsoft Office formats that require OCR processing. This change will:

Enable processing of Word and PowerPoint documents
Improve user experience by extending token validity
Leverage existing PDF-to-image conversion infrastructure

Proposed Solution

1. Office Document Support

Add Python libraries for Office document conversion:
- python-docx2pdf or python-docx + pypandoc for Word documents
- python-pptx for PowerPoint documents
Implement conversion pipeline:
- Option A: Office → PDF → Images → OCR
- Option B: Office → Images → OCR (direct conversion)
Extend file validation to accept .doc, .docx, .ppt, .pptx formats
Add conversion methods to OCRService class

2. Token Validity Extension

Update ACCESS_TOKEN_EXPIRE_MINUTES from 30 minutes to 1440 minutes (24 hours)
Ensure security measures are in place for longer-lived tokens

Impact Analysis

Backend Services: Minimal changes to existing OCR processing flow
Dependencies: New Python packages for Office document handling
Performance: Slight increase in processing time for document conversion
Security: Longer token validity requires careful consideration
Storage: Temporary files during conversion process

Success Criteria

Successfully process Word documents (.doc, .docx) with OCR
Successfully process PowerPoint documents (.ppt, .pptx) with OCR
JWT tokens remain valid for 24 hours
All existing functionality continues to work
Conversion quality maintains text readability for OCR

Timeline

Implementation: 2-3 hours ✅
Testing: 1 hour ✅
Documentation: 30 mins ✅
Total: ~4 hours ✅ COMPLETED

Actual Time

Total development time: ~6 hours (including debugging and testing)
Primary issues resolved: Configuration loading, MIME type mapping, validation logic, API endpoint fixes

2.2 KiB Raw Blame History