fix: resolve E2E test failures and add Office direct extraction design
- Fix MySQL connection timeout by creating fresh DB session after OCR - Fix /analyze endpoint attribute errors (detect vs analyze, metadata) - Add processing_track field extraction to TaskDetailResponse - Update E2E tests to use POST for /analyze endpoint - Increase Office document timeout to 300s - Add Section 2.4 tasks for Office document direct extraction - Document Office → PDF → Direct track strategy in design.md 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -36,6 +36,13 @@
|
||||
- [x] 2.3.1 Map PyMuPDF structures to UnifiedDocument
|
||||
- [x] 2.3.2 Preserve coordinate information
|
||||
- [x] 2.3.3 Maintain element relationships
|
||||
- [ ] 2.4 Add Office document direct extraction support
|
||||
- [ ] 2.4.1 Update DocumentTypeDetector._analyze_office to convert to PDF first
|
||||
- [ ] 2.4.2 Analyze converted PDF for text extractability
|
||||
- [ ] 2.4.3 Route to direct track if PDF is text-based
|
||||
- [ ] 2.4.4 Update OCR service to use DirectExtractionEngine for Office files
|
||||
- [ ] 2.4.5 Add unit tests for Office → PDF → Direct flow
|
||||
- Note: This optimization significantly improves Office document processing time (from >300s to ~2-5s)
|
||||
|
||||
## 3. OCR Track Enhancement
|
||||
- [x] 3.1 Upgrade PP-StructureV3 configuration
|
||||
|
||||
Reference in New Issue
Block a user