feat: implement Office document direct extraction (Section 2.4)
- Update DocumentTypeDetector._analyze_office to convert Office to PDF first - Analyze converted PDF for text extractability before routing - Route text-based Office documents to direct track (10x faster) - Update OCR service to convert Office files for DirectExtractionEngine - Add unit tests for Office → PDF → Direct extraction flow - Handle conversion failures with fallback to OCR track This optimization reduces Office document processing from >300s to ~2-5s for text-based documents by avoiding unnecessary OCR processing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -36,12 +36,12 @@
|
||||
- [x] 2.3.1 Map PyMuPDF structures to UnifiedDocument
|
||||
- [x] 2.3.2 Preserve coordinate information
|
||||
- [x] 2.3.3 Maintain element relationships
|
||||
- [ ] 2.4 Add Office document direct extraction support
|
||||
- [ ] 2.4.1 Update DocumentTypeDetector._analyze_office to convert to PDF first
|
||||
- [ ] 2.4.2 Analyze converted PDF for text extractability
|
||||
- [ ] 2.4.3 Route to direct track if PDF is text-based
|
||||
- [ ] 2.4.4 Update OCR service to use DirectExtractionEngine for Office files
|
||||
- [ ] 2.4.5 Add unit tests for Office → PDF → Direct flow
|
||||
- [x] 2.4 Add Office document direct extraction support
|
||||
- [x] 2.4.1 Update DocumentTypeDetector._analyze_office to convert to PDF first
|
||||
- [x] 2.4.2 Analyze converted PDF for text extractability
|
||||
- [x] 2.4.3 Route to direct track if PDF is text-based
|
||||
- [x] 2.4.4 Update OCR service to use DirectExtractionEngine for Office files
|
||||
- [x] 2.4.5 Add unit tests for Office → PDF → Direct flow
|
||||
- Note: This optimization significantly improves Office document processing time (from >300s to ~2-5s)
|
||||
|
||||
## 3. OCR Track Enhancement
|
||||
|
||||
Reference in New Issue
Block a user