Commit Graph

3 Commits

Author SHA1 Message Date
egg
75c194fe2a feat: implement Task 7 span-level rendering for inline styling
Added support for preserving and rendering inline style variations
within text elements (e.g., bold/italic/color changes mid-line).

Span Extraction (direct_extraction_engine.py):
1. Parse PyMuPDF span data with font, size, flags, color per span
2. Create DocumentElement children for each span with StyleInfo
3. Store spans in element.children for downstream rendering
4. Extract span-specific bbox from PyMuPDF (lines 434-453)

Span Rendering (pdf_generator_service.py):
1. Implement _draw_text_with_spans() method (lines 1685-1734)
   - Iterate through span children
   - Apply per-span styling via _apply_text_style
   - Track X position and calculate widths
   - Return total rendered width
2. Integrate in _draw_text_element_direct() (lines 1822-1823, 1905-1914)
   - Check for element.children (has_spans flag)
   - Use span rendering for first line
   - Fall back to normal rendering for list items
3. Add span count to debug logging

Features:
- Inline font changes (Arial → Times → Courier)
- Inline size changes (12pt → 14pt → 10pt)
- Inline style changes (normal → bold → italic)
- Inline color changes (black → red → blue)

Limitations (future work):
- Currently renders all spans on first line only
- Multi-line span support requires line breaking logic
- List items use single-style rendering (compatibility)

Direct track only (OCR track has no span information).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-24 11:44:05 +08:00
egg
5bcf3dfd42 fix: complete layout analysis features for DirectExtractionEngine
Implements missing layout analysis capabilities:
- Add footer detection based on page position (bottom 10%)
- Build hierarchical section structure from font sizes
- Create nested list structure from indentation levels

All elements now have proper metadata for:
- section_level, parent_section, child_sections (headers)
- list_level, parent_item, children (list items)
- is_page_header, is_page_footer flags

Updates tasks.md to reflect accurate completion status.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-19 08:15:11 +08:00
egg
2d50c128f7 feat: implement core dual-track processing infrastructure
Added foundation for dual-track document processing:

1. UnifiedDocument Model (backend/app/models/unified_document.py)
   - Common output format for both OCR and direct extraction
   - Comprehensive element types (23+ types from PP-StructureV3)
   - BoundingBox, StyleInfo, TableData structures
   - Backward compatibility with legacy format

2. DocumentTypeDetector Service (backend/app/services/document_type_detector.py)
   - Intelligent document type detection using python-magic
   - PDF editability analysis using PyMuPDF
   - Processing track recommendation with confidence scores
   - Support for PDF, images, Office docs, and text files

3. DirectExtractionEngine Service (backend/app/services/direct_extraction_engine.py)
   - Fast extraction from editable PDFs using PyMuPDF
   - Preserves fonts, colors, and exact positioning
   - Native and positional table detection
   - Image extraction with coordinates
   - Hyperlink and metadata extraction

4. Dependencies
   - Added PyMuPDF>=1.23.0 for PDF extraction
   - Added pdfplumber>=0.10.0 as fallback
   - Added python-magic-bin>=0.4.14 for file detection

Next: Integrate with OCR service for complete dual-track processing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 20:17:50 +08:00