feat: implement Office document direct extraction (Section 2.4)

- Update DocumentTypeDetector._analyze_office to convert Office to PDF first - Analyze converted PDF for text extractability before routing - Route text-based Office documents to direct track (10x faster) - Update OCR service to convert Office files for DirectExtractionEngine - Add unit tests for Office → PDF → Direct extraction flow - Handle conversion failures with fallback to OCR track This optimization reduces Office document processing from >300s to ~2-5s for text-based documents by avoiding unnecessary OCR processing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-20 12:20:50 +08:00
parent 0974fc3a54
commit ef335cf3af
4 changed files with 284 additions and 28 deletions
--- a/openspec/changes/dual-track-document-processing/tasks.md
+++ b/openspec/changes/dual-track-document-processing/tasks.md
@@ -36,12 +36,12 @@
  - [x] 2.3.1 Map PyMuPDF structures to UnifiedDocument
  - [x] 2.3.2 Preserve coordinate information
  - [x] 2.3.3 Maintain element relationships
- [ ] 2.4 Add Office document direct extraction support
-  - [ ] 2.4.1 Update DocumentTypeDetector._analyze_office to convert to PDF first
-  - [ ] 2.4.2 Analyze converted PDF for text extractability
-  - [ ] 2.4.3 Route to direct track if PDF is text-based
-  - [ ] 2.4.4 Update OCR service to use DirectExtractionEngine for Office files
-  - [ ] 2.4.5 Add unit tests for Office → PDF → Direct flow
+- [x] 2.4 Add Office document direct extraction support
+  - [x] 2.4.1 Update DocumentTypeDetector._analyze_office to convert to PDF first
+  - [x] 2.4.2 Analyze converted PDF for text extractability
+  - [x] 2.4.3 Route to direct track if PDF is text-based
+  - [x] 2.4.4 Update OCR service to use DirectExtractionEngine for Office files
+  - [x] 2.4.5 Add unit tests for Office → PDF → Direct flow
  - Note: This optimization significantly improves Office document processing time (from >300s to ~2-5s)

 ## 3. OCR Track Enhancement