fix: resolve E2E test failures and add Office direct extraction design

- Fix MySQL connection timeout by creating fresh DB session after OCR - Fix /analyze endpoint attribute errors (detect vs analyze, metadata) - Add processing_track field extraction to TaskDetailResponse - Update E2E tests to use POST for /analyze endpoint - Increase Office document timeout to 300s - Add Section 2.4 tasks for Office document direct extraction - Document Office → PDF → Direct track strategy in design.md 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-20 12:13:18 +08:00
parent c50a5e9d2b
commit 0974fc3a54
7 changed files with 746 additions and 9 deletions
--- a/openspec/changes/dual-track-document-processing/tasks.md
+++ b/openspec/changes/dual-track-document-processing/tasks.md
@@ -36,6 +36,13 @@
  - [x] 2.3.1 Map PyMuPDF structures to UnifiedDocument
  - [x] 2.3.2 Preserve coordinate information
  - [x] 2.3.3 Maintain element relationships
+- [ ] 2.4 Add Office document direct extraction support
+  - [ ] 2.4.1 Update DocumentTypeDetector._analyze_office to convert to PDF first
+  - [ ] 2.4.2 Analyze converted PDF for text extractability
+  - [ ] 2.4.3 Route to direct track if PDF is text-based
+  - [ ] 2.4.4 Update OCR service to use DirectExtractionEngine for Office files
+  - [ ] 2.4.5 Add unit tests for Office → PDF → Direct flow
+  - Note: This optimization significantly improves Office document processing time (from >300s to ~2-5s)

 ## 3. OCR Track Enhancement
 - [x] 3.1 Upgrade PP-StructureV3 configuration