OCR/services at 3358d97624b3e4da162373db9dc68afebfbe0427 - OCR

egg/OCR

Files

egg ef335cf3af feat: implement Office document direct extraction (Section 2.4)

- Update DocumentTypeDetector._analyze_office to convert Office to PDF first
- Analyze converted PDF for text extractability before routing
- Route text-based Office documents to direct track (10x faster)
- Update OCR service to convert Office files for DirectExtractionEngine
- Add unit tests for Office → PDF → Direct extraction flow
- Handle conversion failures with fallback to OCR track

This optimization reduces Office document processing from >300s to ~2-5s
for text-based documents by avoiding unnecessary OCR processing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-20 12:20:50 +08:00

__init__.py

test: add unit tests for DocumentTypeDetector

2025-11-19 12:16:49 +08:00

test_direct_extraction_engine.py

test: add unit and integration tests for dual-track processing

2025-11-19 12:50:44 +08:00

test_document_type_detector.py

feat: implement Office document direct extraction (Section 2.4)

2025-11-20 12:20:50 +08:00

test_dual_track_integration.py

test: add unit and integration tests for dual-track processing

2025-11-19 12:50:44 +08:00