feat: implement Office document direct extraction (Section 2.4)

- Update DocumentTypeDetector._analyze_office to convert Office to PDF first
- Analyze converted PDF for text extractability before routing
- Route text-based Office documents to direct track (10x faster)
- Update OCR service to convert Office files for DirectExtractionEngine
- Add unit tests for Office → PDF → Direct extraction flow
- Handle conversion failures with fallback to OCR track

This optimization reduces Office document processing from >300s to ~2-5s
for text-based documents by avoiding unnecessary OCR processing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-20 12:20:50 +08:00
parent 0974fc3a54
commit ef335cf3af
4 changed files with 284 additions and 28 deletions

View File

@@ -36,12 +36,12 @@
- [x] 2.3.1 Map PyMuPDF structures to UnifiedDocument
- [x] 2.3.2 Preserve coordinate information
- [x] 2.3.3 Maintain element relationships
- [ ] 2.4 Add Office document direct extraction support
- [ ] 2.4.1 Update DocumentTypeDetector._analyze_office to convert to PDF first
- [ ] 2.4.2 Analyze converted PDF for text extractability
- [ ] 2.4.3 Route to direct track if PDF is text-based
- [ ] 2.4.4 Update OCR service to use DirectExtractionEngine for Office files
- [ ] 2.4.5 Add unit tests for Office → PDF → Direct flow
- [x] 2.4 Add Office document direct extraction support
- [x] 2.4.1 Update DocumentTypeDetector._analyze_office to convert to PDF first
- [x] 2.4.2 Analyze converted PDF for text extractability
- [x] 2.4.3 Route to direct track if PDF is text-based
- [x] 2.4.4 Update OCR service to use DirectExtractionEngine for Office files
- [x] 2.4.5 Add unit tests for Office → PDF → Direct flow
- Note: This optimization significantly improves Office document processing time (from >300s to ~2-5s)
## 3. OCR Track Enhancement