feat: implement core dual-track processing infrastructure

Added foundation for dual-track document processing: 1. UnifiedDocument Model (backend/app/models/unified_document.py) - Common output format for both OCR and direct extraction - Comprehensive element types (23+ types from PP-StructureV3) - BoundingBox, StyleInfo, TableData structures - Backward compatibility with legacy format 2. DocumentTypeDetector Service (backend/app/services/document_type_detector.py) - Intelligent document type detection using python-magic - PDF editability analysis using PyMuPDF - Processing track recommendation with confidence scores - Support for PDF, images, Office docs, and text files 3. DirectExtractionEngine Service (backend/app/services/direct_extraction_engine.py) - Fast extraction from editable PDFs using PyMuPDF - Preserves fonts, colors, and exact positioning - Native and positional table detection - Image extraction with coordinates - Hyperlink and metadata extraction 4. Dependencies - Added PyMuPDF>=1.23.0 for PDF extraction - Added pdfplumber>=0.10.0 as fallback - Added python-magic-bin>=0.4.14 for file detection Next: Integrate with OCR service for complete dual-track processing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 20:17:50 +08:00
parent cd3cbea49d
commit 2d50c128f7
4 changed files with 1729 additions and 0 deletions
--- a/requirements.txt
+++ b/requirements.txt
@@ -25,6 +25,11 @@ reportlab>=4.0.0  # Layout-preserving PDF generation with precise coordinate con
 PyPDF2>=3.0.0  # Extract dimensions from source PDF files
 # Note: pandoc needs to be installed via brew (brew install pandoc)

+# ===== Direct PDF Extraction (Dual-track Processing) =====
+PyMuPDF>=1.23.0  # Primary library for editable PDF text/structure extraction
+pdfplumber>=0.10.0  # Fallback for table extraction and validation
+python-magic-bin>=0.4.14  # Windows-compatible file type detection
+
 # ===== Data Export =====
 pandas>=2.1.0
 openpyxl>=3.1.0  # Excel support