fix: complete layout analysis features for DirectExtractionEngine

Implements missing layout analysis capabilities:
- Add footer detection based on page position (bottom 10%)
- Build hierarchical section structure from font sizes
- Create nested list structure from indentation levels

All elements now have proper metadata for:
- section_level, parent_section, child_sections (headers)
- list_level, parent_item, children (list items)
- is_page_header, is_page_footer flags

Updates tasks.md to reflect accurate completion status.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-19 08:15:11 +08:00
parent a3a6fbe58b
commit 5bcf3dfd42
2 changed files with 134 additions and 0 deletions

View File

@@ -11,6 +11,7 @@
- [x] 1.2.2 Add DocumentElement model
- [x] 1.2.3 Add DocumentMetadata model
- [x] 1.2.4 Create converters for both OCR and direct extraction outputs
- Note: OCR converter complete; DirectExtractionEngine returns UnifiedDocument directly
- [x] 1.3 Create DocumentTypeDetector service
- [x] 1.3.1 Implement file type detection using python-magic
- [x] 1.3.2 Add PDF editability checking logic