fix: complete layout analysis features for DirectExtractionEngine
Implements missing layout analysis capabilities: - Add footer detection based on page position (bottom 10%) - Build hierarchical section structure from font sizes - Create nested list structure from indentation levels All elements now have proper metadata for: - section_level, parent_section, child_sections (headers) - list_level, parent_item, children (list items) - is_page_header, is_page_footer flags Updates tasks.md to reflect accurate completion status. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -11,6 +11,7 @@
|
||||
- [x] 1.2.2 Add DocumentElement model
|
||||
- [x] 1.2.3 Add DocumentMetadata model
|
||||
- [x] 1.2.4 Create converters for both OCR and direct extraction outputs
|
||||
- Note: OCR converter complete; DirectExtractionEngine returns UnifiedDocument directly
|
||||
- [x] 1.3 Create DocumentTypeDetector service
|
||||
- [x] 1.3.1 Implement file type detection using python-magic
|
||||
- [x] 1.3.2 Add PDF editability checking logic
|
||||
|
||||
Reference in New Issue
Block a user