GPU Optimization (Section 3.1): - Add comprehensive memory management for RTX 4060 8GB - Enable all recognition features (chart, formula, table, seal, text) - Implement model cache with auto-unload for idle models - Add memory monitoring and warning system Bug Fix (Section 3.3): - Fix TableData field inconsistency: 'columns' -> 'cols' - Remove invalid 'html' and 'extracted_text' parameters - Add proper TableCell conversion in _convert_table_data Documentation: - Add Future Improvements section for batch processing enhancement 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
7.5 KiB
7.5 KiB
Implementation Tasks: Dual-track Document Processing
1. Core Infrastructure
- 1.1 Add PyMuPDF and other dependencies to requirements.txt
- 1.1.1 Add PyMuPDF>=1.23.0
- 1.1.2 Add pdfplumber>=0.10.0
- 1.1.3 Add python-magic-bin>=0.4.14
- 1.1.4 Test dependency installation
- 1.2 Create UnifiedDocument model in backend/app/models/
- 1.2.1 Define UnifiedDocument dataclass
- 1.2.2 Add DocumentElement model
- 1.2.3 Add DocumentMetadata model
- 1.2.4 Create converters for both OCR and direct extraction outputs
- Note: OCR converter complete; DirectExtractionEngine returns UnifiedDocument directly
- 1.3 Create DocumentTypeDetector service
- 1.3.1 Implement file type detection using python-magic
- 1.3.2 Add PDF editability checking logic
- 1.3.3 Add Office document detection
- 1.3.4 Create routing logic to determine processing track
- 1.3.5 Add unit tests for detector
2. Direct Extraction Track
- 2.1 Create DirectExtractionEngine service
- 2.1.1 Implement PyMuPDF-based text extraction
- 2.1.2 Add structure preservation logic
- 2.1.3 Extract tables with coordinates
- 2.1.4 Extract images and their positions
- 2.1.5 Maintain reading order
- 2.1.6 Handle multi-column layouts
- 2.2 Implement layout analysis for editable PDFs
- 2.2.1 Detect headers and footers
- 2.2.2 Identify sections and subsections
- 2.2.3 Parse lists and nested structures
- 2.2.4 Extract font and style information
- 2.3 Create direct extraction to UnifiedDocument converter
- 2.3.1 Map PyMuPDF structures to UnifiedDocument
- 2.3.2 Preserve coordinate information
- 2.3.3 Maintain element relationships
3. OCR Track Enhancement
- 3.1 Upgrade PP-StructureV3 configuration
- 3.1.1 Update config for RTX 4060 8GB optimization
- 3.1.2 Enable batch processing for GPU efficiency
- 3.1.3 Configure memory management settings
- 3.1.4 Set up model caching
- 3.2 Enhance OCR service to use parsing_res_list
- 3.2.1 Replace markdown extraction with parsing_res_list
- 3.2.2 Extract all 23 element types
- 3.2.3 Preserve bbox coordinates from PP-StructureV3
- 3.2.4 Maintain reading order information
- 3.3 Create OCR to UnifiedDocument converter
- 3.3.1 Map PP-StructureV3 elements to UnifiedDocument
- 3.3.2 Handle complex nested structures
- 3.3.3 Preserve all metadata
4. Unified Processing Pipeline
- 4.1 Update main OCR service for dual-track processing
- 4.1.1 Integrate DocumentTypeDetector
- 4.1.2 Route to appropriate processing engine
- 4.1.3 Return UnifiedDocument from both tracks
- 4.1.4 Maintain backward compatibility
- 4.2 Create unified JSON export
- 4.2.1 Define standardized JSON schema
- 4.2.2 Include processing metadata
- 4.2.3 Support both track outputs
- 4.3 Update PDF generator for UnifiedDocument
- 4.3.1 Adapt PDF generation to use UnifiedDocument
- 4.3.2 Preserve layout from both tracks
- 4.3.3 Handle coordinate transformations
5. Translation System Foundation
- 5.1 Create TranslationEngine interface
- 5.1.1 Define translation API contract
- 5.1.2 Support element-level translation
- 5.1.3 Preserve formatting markers
- 5.2 Implement structure-preserving translation
- 5.2.1 Translate text while maintaining coordinates
- 5.2.2 Handle table cell translations
- 5.2.3 Preserve list structures
- 5.2.4 Maintain header hierarchies
- 5.3 Create translated document renderer
- 5.3.1 Generate PDF with translated text
- 5.3.2 Adjust layouts for text expansion/contraction
- 5.3.3 Handle font substitution for target languages
6. API Updates
- 6.1 Update OCR endpoints
- 6.1.1 Add processing_track parameter
- 6.1.2 Support track auto-detection
- 6.1.3 Return processing metadata
- 6.2 Add document type detection endpoint
- 6.2.1 Create /analyze endpoint
- 6.2.2 Return recommended processing track
- 6.2.3 Provide confidence scores
- 6.3 Update result export endpoints
- 6.3.1 Support UnifiedDocument format
- 6.3.2 Add format conversion options
- 6.3.3 Include processing track information
7. Frontend Updates
- 7.1 Update task detail view
- 7.1.1 Display processing track information
- 7.1.2 Show track-specific metadata
- 7.1.3 Add track selection UI (if manual override needed)
- 7.2 Update results preview
- 7.2.1 Handle UnifiedDocument format
- 7.2.2 Display enhanced structure information
- 7.2.3 Show coordinate overlays (debug mode)
- 7.3 Add translation UI preparation
- 7.3.1 Add translation toggle/button
- 7.3.2 Language selection dropdown
- 7.3.3 Translation progress indicator
8. Testing
- 8.1 Unit tests for DocumentTypeDetector
- 8.1.1 Test various file types
- 8.1.2 Test editability detection
- 8.1.3 Test edge cases
- 8.2 Unit tests for DirectExtractionEngine
- 8.2.1 Test text extraction accuracy
- 8.2.2 Test structure preservation
- 8.2.3 Test coordinate extraction
- 8.3 Integration tests for dual-track processing
- 8.3.1 Test routing logic
- 8.3.2 Test UnifiedDocument generation
- 8.3.3 Test backward compatibility
- 8.4 End-to-end tests
- 8.4.1 Test scanned PDF processing (OCR track)
- 8.4.2 Test editable PDF processing (direct track)
- 8.4.3 Test Office document processing
- 8.4.4 Test image file processing
- 8.5 Performance testing
- 8.5.1 Benchmark both processing tracks
- 8.5.2 Test GPU memory usage
- 8.5.3 Compare processing times
9. Documentation
- 9.1 Update API documentation
- 9.1.1 Document new endpoints
- 9.1.2 Update existing endpoint docs
- 9.1.3 Add processing track information
- 9.2 Create architecture documentation
- 9.2.1 Document dual-track flow
- 9.2.2 Explain UnifiedDocument structure
- 9.2.3 Add decision trees for track selection
- 9.3 Add deployment guide
- 9.3.1 Document GPU requirements
- 9.3.2 Add environment configuration
- 9.3.3 Include troubleshooting guide
10. Deployment Preparation
- 10.1 Update Docker configuration
- 10.1.1 Add new dependencies to Dockerfile
- 10.1.2 Configure GPU support
- 10.1.3 Update volume mappings
- 10.2 Update environment variables
- 10.2.1 Add processing track settings
- 10.2.2 Configure GPU memory limits
- 10.2.3 Add feature flags
- 10.3 Create migration plan
- 10.3.1 Plan for existing data migration
- 10.3.2 Create rollback procedures
- 10.3.3 Document breaking changes
Completion Checklist
- All unit tests passing
- Integration tests passing
- Performance benchmarks acceptable
- Documentation complete
- Code reviewed
- Deployment tested in staging
Future Improvements
The following improvements are identified but not part of this change proposal:
Batch Processing Enhancement
- Related to: Section 3.1.2 (Enable batch processing for GPU efficiency)
- Description: Implement true batch inference by sending multiple pages or documents to PaddleOCR simultaneously
- Benefits: Better GPU utilization, reduced overhead from model switching
- Requirements: Queue management, memory-aware batching, result aggregation
- Recommendation: Create a separate change proposal when ready to implement