chore: update tasks.md with completed infrastructure work
Progress update: - Core Infrastructure: 13/14 tasks completed - Direct Extraction Track: 18/18 tasks completed - Total progress: 30/147 tasks (20.4%) Completed major components: ✅ UnifiedDocument model with all structures ✅ DocumentTypeDetector service ✅ DirectExtractionEngine with PyMuPDF ✅ Dependencies added to requirements.txt Next priorities: - Update OCR service for dual-track integration - Enhance PP-StructureV3 usage - Update PDF generator for UnifiedDocument
This commit is contained in:
@@ -1,40 +1,40 @@
|
||||
# Implementation Tasks: Dual-track Document Processing
|
||||
|
||||
## 1. Core Infrastructure
|
||||
- [ ] 1.1 Add PyMuPDF and other dependencies to requirements.txt
|
||||
- [ ] 1.1.1 Add PyMuPDF==1.23.x
|
||||
- [ ] 1.1.2 Add pdfplumber==0.10.x
|
||||
- [ ] 1.1.3 Add python-magic-bin==0.4.x
|
||||
- [x] 1.1 Add PyMuPDF and other dependencies to requirements.txt
|
||||
- [x] 1.1.1 Add PyMuPDF>=1.23.0
|
||||
- [x] 1.1.2 Add pdfplumber>=0.10.0
|
||||
- [x] 1.1.3 Add python-magic-bin>=0.4.14
|
||||
- [ ] 1.1.4 Test dependency installation
|
||||
- [ ] 1.2 Create UnifiedDocument model in backend/app/models/
|
||||
- [ ] 1.2.1 Define UnifiedDocument dataclass
|
||||
- [ ] 1.2.2 Add DocumentElement model
|
||||
- [ ] 1.2.3 Add DocumentMetadata model
|
||||
- [ ] 1.2.4 Create converters for both OCR and direct extraction outputs
|
||||
- [ ] 1.3 Create DocumentTypeDetector service
|
||||
- [ ] 1.3.1 Implement file type detection using python-magic
|
||||
- [ ] 1.3.2 Add PDF editability checking logic
|
||||
- [ ] 1.3.3 Add Office document detection
|
||||
- [ ] 1.3.4 Create routing logic to determine processing track
|
||||
- [x] 1.2 Create UnifiedDocument model in backend/app/models/
|
||||
- [x] 1.2.1 Define UnifiedDocument dataclass
|
||||
- [x] 1.2.2 Add DocumentElement model
|
||||
- [x] 1.2.3 Add DocumentMetadata model
|
||||
- [x] 1.2.4 Create converters for both OCR and direct extraction outputs
|
||||
- [x] 1.3 Create DocumentTypeDetector service
|
||||
- [x] 1.3.1 Implement file type detection using python-magic
|
||||
- [x] 1.3.2 Add PDF editability checking logic
|
||||
- [x] 1.3.3 Add Office document detection
|
||||
- [x] 1.3.4 Create routing logic to determine processing track
|
||||
- [ ] 1.3.5 Add unit tests for detector
|
||||
|
||||
## 2. Direct Extraction Track
|
||||
- [ ] 2.1 Create DirectExtractionEngine service
|
||||
- [ ] 2.1.1 Implement PyMuPDF-based text extraction
|
||||
- [ ] 2.1.2 Add structure preservation logic
|
||||
- [ ] 2.1.3 Extract tables with coordinates
|
||||
- [ ] 2.1.4 Extract images and their positions
|
||||
- [ ] 2.1.5 Maintain reading order
|
||||
- [ ] 2.1.6 Handle multi-column layouts
|
||||
- [ ] 2.2 Implement layout analysis for editable PDFs
|
||||
- [ ] 2.2.1 Detect headers and footers
|
||||
- [ ] 2.2.2 Identify sections and subsections
|
||||
- [ ] 2.2.3 Parse lists and nested structures
|
||||
- [ ] 2.2.4 Extract font and style information
|
||||
- [ ] 2.3 Create direct extraction to UnifiedDocument converter
|
||||
- [ ] 2.3.1 Map PyMuPDF structures to UnifiedDocument
|
||||
- [ ] 2.3.2 Preserve coordinate information
|
||||
- [ ] 2.3.3 Maintain element relationships
|
||||
- [x] 2.1 Create DirectExtractionEngine service
|
||||
- [x] 2.1.1 Implement PyMuPDF-based text extraction
|
||||
- [x] 2.1.2 Add structure preservation logic
|
||||
- [x] 2.1.3 Extract tables with coordinates
|
||||
- [x] 2.1.4 Extract images and their positions
|
||||
- [x] 2.1.5 Maintain reading order
|
||||
- [x] 2.1.6 Handle multi-column layouts
|
||||
- [x] 2.2 Implement layout analysis for editable PDFs
|
||||
- [x] 2.2.1 Detect headers and footers
|
||||
- [x] 2.2.2 Identify sections and subsections
|
||||
- [x] 2.2.3 Parse lists and nested structures
|
||||
- [x] 2.2.4 Extract font and style information
|
||||
- [x] 2.3 Create direct extraction to UnifiedDocument converter
|
||||
- [x] 2.3.1 Map PyMuPDF structures to UnifiedDocument
|
||||
- [x] 2.3.2 Preserve coordinate information
|
||||
- [x] 2.3.3 Maintain element relationships
|
||||
|
||||
## 3. OCR Track Enhancement
|
||||
- [ ] 3.1 Upgrade PP-StructureV3 configuration
|
||||
|
||||
Reference in New Issue
Block a user