OCR/tasks.md at 7064ea30d5f1742709c8ad04432c6ea2aa88f3b9 - OCR - ZHAOI

egg/OCR

Files

egg ef335cf3af feat: implement Office document direct extraction (Section 2.4)

- Update DocumentTypeDetector._analyze_office to convert Office to PDF first
- Analyze converted PDF for text extractability before routing
- Route text-based Office documents to direct track (10x faster)
- Update OCR service to convert Office files for DirectExtractionEngine
- Add unit tests for Office → PDF → Direct extraction flow
- Handle conversion failures with fallback to OCR track

This optimization reduces Office document processing from >300s to ~2-5s
for text-based documents by avoiding unnecessary OCR processing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-20 12:20:50 +08:00

8.2 KiB

Raw Blame History