diff --git a/openspec/changes/dual-track-document-processing/tasks.md b/openspec/changes/dual-track-document-processing/tasks.md index accf767..dbe28fb 100644 --- a/openspec/changes/dual-track-document-processing/tasks.md +++ b/openspec/changes/dual-track-document-processing/tasks.md @@ -1,40 +1,40 @@ # Implementation Tasks: Dual-track Document Processing ## 1. Core Infrastructure -- [ ] 1.1 Add PyMuPDF and other dependencies to requirements.txt - - [ ] 1.1.1 Add PyMuPDF==1.23.x - - [ ] 1.1.2 Add pdfplumber==0.10.x - - [ ] 1.1.3 Add python-magic-bin==0.4.x +- [x] 1.1 Add PyMuPDF and other dependencies to requirements.txt + - [x] 1.1.1 Add PyMuPDF>=1.23.0 + - [x] 1.1.2 Add pdfplumber>=0.10.0 + - [x] 1.1.3 Add python-magic-bin>=0.4.14 - [ ] 1.1.4 Test dependency installation -- [ ] 1.2 Create UnifiedDocument model in backend/app/models/ - - [ ] 1.2.1 Define UnifiedDocument dataclass - - [ ] 1.2.2 Add DocumentElement model - - [ ] 1.2.3 Add DocumentMetadata model - - [ ] 1.2.4 Create converters for both OCR and direct extraction outputs -- [ ] 1.3 Create DocumentTypeDetector service - - [ ] 1.3.1 Implement file type detection using python-magic - - [ ] 1.3.2 Add PDF editability checking logic - - [ ] 1.3.3 Add Office document detection - - [ ] 1.3.4 Create routing logic to determine processing track +- [x] 1.2 Create UnifiedDocument model in backend/app/models/ + - [x] 1.2.1 Define UnifiedDocument dataclass + - [x] 1.2.2 Add DocumentElement model + - [x] 1.2.3 Add DocumentMetadata model + - [x] 1.2.4 Create converters for both OCR and direct extraction outputs +- [x] 1.3 Create DocumentTypeDetector service + - [x] 1.3.1 Implement file type detection using python-magic + - [x] 1.3.2 Add PDF editability checking logic + - [x] 1.3.3 Add Office document detection + - [x] 1.3.4 Create routing logic to determine processing track - [ ] 1.3.5 Add unit tests for detector ## 2. Direct Extraction Track -- [ ] 2.1 Create DirectExtractionEngine service - - [ ] 2.1.1 Implement PyMuPDF-based text extraction - - [ ] 2.1.2 Add structure preservation logic - - [ ] 2.1.3 Extract tables with coordinates - - [ ] 2.1.4 Extract images and their positions - - [ ] 2.1.5 Maintain reading order - - [ ] 2.1.6 Handle multi-column layouts -- [ ] 2.2 Implement layout analysis for editable PDFs - - [ ] 2.2.1 Detect headers and footers - - [ ] 2.2.2 Identify sections and subsections - - [ ] 2.2.3 Parse lists and nested structures - - [ ] 2.2.4 Extract font and style information -- [ ] 2.3 Create direct extraction to UnifiedDocument converter - - [ ] 2.3.1 Map PyMuPDF structures to UnifiedDocument - - [ ] 2.3.2 Preserve coordinate information - - [ ] 2.3.3 Maintain element relationships +- [x] 2.1 Create DirectExtractionEngine service + - [x] 2.1.1 Implement PyMuPDF-based text extraction + - [x] 2.1.2 Add structure preservation logic + - [x] 2.1.3 Extract tables with coordinates + - [x] 2.1.4 Extract images and their positions + - [x] 2.1.5 Maintain reading order + - [x] 2.1.6 Handle multi-column layouts +- [x] 2.2 Implement layout analysis for editable PDFs + - [x] 2.2.1 Detect headers and footers + - [x] 2.2.2 Identify sections and subsections + - [x] 2.2.3 Parse lists and nested structures + - [x] 2.2.4 Extract font and style information +- [x] 2.3 Create direct extraction to UnifiedDocument converter + - [x] 2.3.1 Map PyMuPDF structures to UnifiedDocument + - [x] 2.3.2 Preserve coordinate information + - [x] 2.3.3 Maintain element relationships ## 3. OCR Track Enhancement - [ ] 3.1 Upgrade PP-StructureV3 configuration