Add TableData.from_dict() and TableCell.from_dict() methods to convert
JSON table dicts to proper TableData objects during UnifiedDocument parsing.
Modified _json_to_document_element() to detect TABLE elements with dict
content containing 'cells' key and convert to TableData.
Note: This fix ensures table elements have proper to_html() method available
but the rendered output still needs investigation - tables may still render
incorrectly in OCR track PDFs.
Files changed:
- unified_document.py: Add from_dict() class methods
- pdf_generator_service.py: Convert table dicts during JSON parsing
- Add fix-ocr-track-table-rendering proposal
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add optional original_filename field to DocumentMetadata dataclass
to properly store the original filename when files are converted
(e.g., Office → PDF). This ensures the field is included in to_dict()
output for JSON serialization.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Added foundation for dual-track document processing:
1. UnifiedDocument Model (backend/app/models/unified_document.py)
- Common output format for both OCR and direct extraction
- Comprehensive element types (23+ types from PP-StructureV3)
- BoundingBox, StyleInfo, TableData structures
- Backward compatibility with legacy format
2. DocumentTypeDetector Service (backend/app/services/document_type_detector.py)
- Intelligent document type detection using python-magic
- PDF editability analysis using PyMuPDF
- Processing track recommendation with confidence scores
- Support for PDF, images, Office docs, and text files
3. DirectExtractionEngine Service (backend/app/services/direct_extraction_engine.py)
- Fast extraction from editable PDFs using PyMuPDF
- Preserves fonts, colors, and exact positioning
- Native and positional table detection
- Image extraction with coordinates
- Hyperlink and metadata extraction
4. Dependencies
- Added PyMuPDF>=1.23.0 for PDF extraction
- Added pdfplumber>=0.10.0 as fallback
- Added python-magic-bin>=0.4.14 for file detection
Next: Integrate with OCR service for complete dual-track processing
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>