fix: resolve E2E test failures and add Office direct extraction design

- Fix MySQL connection timeout by creating fresh DB session after OCR
- Fix /analyze endpoint attribute errors (detect vs analyze, metadata)
- Add processing_track field extraction to TaskDetailResponse
- Update E2E tests to use POST for /analyze endpoint
- Increase Office document timeout to 300s
- Add Section 2.4 tasks for Office document direct extraction
- Document Office → PDF → Direct track strategy in design.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-20 12:13:18 +08:00
parent c50a5e9d2b
commit 0974fc3a54
7 changed files with 746 additions and 9 deletions

View File

@@ -118,11 +118,26 @@ def detect_track(file_path: Path) -> str:
return "direct"
if file_type in OFFICE_MIMES:
return "ocr" # For now, may add direct Office support later
# Convert Office to PDF first, then analyze
pdf_path = convert_office_to_pdf(file_path)
return detect_track(pdf_path) # Recursive call on PDF
return "ocr" # Default fallback
```
**Office Document Processing Strategy**:
1. Convert Office files (Word, PPT, Excel) to PDF using LibreOffice
2. Analyze the resulting PDF for text extractability
3. Route based on PDF analysis:
- Text-based PDF → Direct track (faster, more accurate)
- Image-based PDF → OCR track (for scanned content in Office docs)
This approach ensures:
- Consistent processing pipeline (all documents become PDF first)
- Optimal routing based on actual content
- Significant performance improvement for editable Office documents
- Better layout preservation (no OCR errors on text content)
### Decision 5: GPU Memory Management
**What**: Implement dynamic batch sizing and model caching for RTX 4060 8GB
@@ -221,7 +236,13 @@ def get_model(model_type: str):
- A: No, adds complexity with minimal benefit. Document-level is sufficient.
- Q: How to handle Office documents?
- A: OCR track initially, consider python-docx/openpyxl later if needed.
- A: Convert to PDF using LibreOffice, then analyze the PDF for text extractability.
- Text-based PDF → Direct track (editable Office docs produce text PDFs)
- Image-based PDF → OCR track (rare case of scanned content in Office)
- This approach provides:
- 10x+ faster processing for typical Office documents
- Better layout preservation (no OCR errors)
- Consistent pipeline (all documents normalized to PDF first)
### Pending
- Q: What translation services to integrate with?