Files
OCR/openspec/changes/pdf-layout-restoration/tasks.md
egg 75c194fe2a feat: implement Task 7 span-level rendering for inline styling
Added support for preserving and rendering inline style variations
within text elements (e.g., bold/italic/color changes mid-line).

Span Extraction (direct_extraction_engine.py):
1. Parse PyMuPDF span data with font, size, flags, color per span
2. Create DocumentElement children for each span with StyleInfo
3. Store spans in element.children for downstream rendering
4. Extract span-specific bbox from PyMuPDF (lines 434-453)

Span Rendering (pdf_generator_service.py):
1. Implement _draw_text_with_spans() method (lines 1685-1734)
   - Iterate through span children
   - Apply per-span styling via _apply_text_style
   - Track X position and calculate widths
   - Return total rendered width
2. Integrate in _draw_text_element_direct() (lines 1822-1823, 1905-1914)
   - Check for element.children (has_spans flag)
   - Use span rendering for first line
   - Fall back to normal rendering for list items
3. Add span count to debug logging

Features:
- Inline font changes (Arial → Times → Courier)
- Inline size changes (12pt → 14pt → 10pt)
- Inline style changes (normal → bold → italic)
- Inline color changes (black → red → blue)

Limitations (future work):
- Currently renders all spans on first line only
- Multi-line span support requires line breaking logic
- List items use single-style rendering (compatibility)

Direct track only (OCR track has no span information).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-24 11:44:05 +08:00

211 lines
10 KiB
Markdown

# Implementation Tasks: PDF Layout Restoration
## Phase 1: Critical Fixes (P0 - Immediate)
### 1. Fix Image Handling
- [x] 1.1 Implement `_save_image()` in pp_structure_enhanced.py
- [x] 1.1.1 Create imgs subdirectory in result_dir
- [x] 1.1.2 Handle both file path and numpy array inputs
- [x] 1.1.3 Save with element_id as filename
- [x] 1.1.4 Return relative path for reference
- [x] 1.1.5 Add error handling and logging
- [x] 1.2 Fix path resolution in pdf_generator_service.py
- [x] 1.2.1 Create `_get_image_path()` helper with fallback logic
- [x] 1.2.2 Check saved_path, path, image_path keys
- [x] 1.2.3 Check metadata for path
- [x] 1.2.4 Update convert_unified_document_to_ocr_data to use helper
- [ ] 1.3 Test image rendering
- [ ] 1.3.1 Test with OCR track document
- [ ] 1.3.2 Test with Direct track document
- [ ] 1.3.3 Verify images appear in PDF output
### 2. Fix Table Rendering
- [x] 2.1 Remove dependency on fake image references
- [x] 2.1.1 Stop creating fake table_*.png references (changed to None)
- [x] 2.1.2 Remove image lookup fallback in draw_table_region
- [x] 2.2 Use direct bbox from table element
- [x] 2.2.1 Get bbox from table_element.get("bbox")
- [x] 2.2.2 Fallback to bbox_polygon if needed
- [x] 2.2.3 Implement _polygon_to_bbox converter (inline conversion implemented)
- [x] 2.3 Fix table HTML rendering
- [x] 2.3.1 Parse HTML content from table element
- [x] 2.3.2 Position table using normalized bbox
- [x] 2.3.3 Render with proper dimensions
- [ ] 2.4 Test table rendering
- [ ] 2.4.1 Test simple tables
- [ ] 2.4.2 Test complex multi-column tables
- [ ] 2.4.3 Test with both tracks
## Phase 2: Basic Style Preservation (P1 - Week 1)
### 3. Implement Style Application System
- [x] 3.1 Create font mapping system
- [x] 3.1.1 Define FONT_MAPPING dictionary (20 common fonts mapped)
- [x] 3.1.2 Map common fonts to PDF standard fonts (Helvetica/Times/Courier)
- [x] 3.1.3 Add fallback to Helvetica for unknown fonts (with partial matching)
- [x] 3.2 Implement _apply_text_style() method
- [x] 3.2.1 Extract font family from StyleInfo (object and dict support)
- [x] 3.2.2 Handle bold/italic flags (compound variants like BoldOblique)
- [x] 3.2.3 Apply font size (with default fallback)
- [x] 3.2.4 Apply text color (using _parse_color)
- [x] 3.2.5 Handle errors gracefully (try-except with fallback to defaults)
- [x] 3.3 Create color parsing utilities
- [x] 3.3.1 Parse hex colors (#RRGGBB and #RGB)
- [x] 3.3.2 Parse RGB tuples (0-255 and 0-1 normalization)
- [x] 3.3.3 Convert to PDF color space (0-1 range for ReportLab)
### 4. Track-Specific Rendering
- [x] 4.1 Add track detection in generate_from_unified_document
- [x] 4.1.1 Check unified_doc.metadata.processing_track (object and dict support)
- [x] 4.1.2 Route to _generate_direct_track_pdf or _generate_ocr_track_pdf
- [x] 4.2 Implement _generate_direct_track_pdf
- [x] 4.2.1 Process each page directly from UnifiedDocument (no legacy conversion)
- [x] 4.2.2 Apply StyleInfo to text elements (_draw_text_element_direct)
- [x] 4.2.3 Use precise positioning from element.bbox
- [x] 4.2.4 Preserve line breaks (split on \n, render multi-line)
- [x] 4.2.5 Implement _draw_text_element_direct with line break handling
- [x] 4.2.6 Implement _draw_table_element_direct for tables
- [x] 4.2.7 Implement _draw_image_element_direct for images
- [x] 4.3 Implement _generate_ocr_track_pdf
- [x] 4.3.1 Use legacy OCR data conversion (convert_unified_document_to_ocr_data)
- [x] 4.3.2 Route to existing _generate_pdf_from_data pipeline
- [x] 4.3.3 Maintain backward compatibility with OCR track behavior
- [ ] 4.4 Test track-specific rendering
- [ ] 4.4.1 Compare Direct track with original
- [ ] 4.4.2 Verify OCR track maintains quality
## Phase 3: Advanced Layout (P2 - Week 2)
### 5. Enhanced Text Rendering
- [x] 5.1 Implement line-by-line rendering (both tracks)
- [x] 5.1.1 Split text content by newlines (text.split('\n'))
- [x] 5.1.2 Calculate line height from font size (font_size * 1.2)
- [x] 5.1.3 Render each line with proper spacing (line_y = pdf_y - i * line_height)
- [x] 5.1.4 Direct track: _draw_text_element_direct (lines 1549-1693)
- [x] 5.1.5 OCR track: draw_text_region (lines 1113-1270, simplified)
- [x] 5.2 Add paragraph handling (Direct track only)
- [x] 5.2.1 Detect paragraph boundaries (via element.type PARAGRAPH)
- [x] 5.2.2 Apply spacing_before from metadata (line 1576, adjusts Y position)
- [x] 5.2.3 Handle indentation (indent/first_line_indent from metadata, lines 1564-1565)
- [x] 5.2.4 Record spacing_after for analysis (lines 1680-1689)
- [x] 5.2.5 Note: spacing_after is implicit in bbox-based layout (bbox_bottom_margin)
- [x] 5.2.6 OCR track: no paragraph handling (simple left-aligned rendering)
- [x] 5.3 Implement text alignment (Direct track only)
- [x] 5.3.1 Support left/right/center/justify (from StyleInfo.alignment)
- [x] 5.3.2 Calculate positioning based on alignment (line_x calculation)
- [x] 5.3.3 Apply to each text block (per-line alignment in _draw_text_element_direct)
- [x] 5.3.4 Justify alignment with word spacing distribution
- [x] 5.3.5 OCR track: left-aligned only (no StyleInfo available)
### 6. List Formatting (Direct track only)
- [x] 6.1 Detect list elements from Direct track
- [x] 6.1.1 Identify LIST_ITEM elements (separate from text_elements, lines 636-637)
- [x] 6.1.2 Fallback detection via metadata and text patterns (_is_list_item_fallback, lines 1528-1567)
- [x] Check metadata for list_level, parent_item, children fields
- [x] Pattern matching for ordered lists (^\d+[\.\)]) and unordered (^[•·▪▫◦‣⁃\-\*])
- [x] Auto-mark as LIST_ITEM if detected (lines 638-642)
- [x] 6.1.3 Group list items by proximity and level (_draw_list_elements_direct, lines 1589-1610)
- [x] 6.1.4 Determine list type via regex on first item (ordered/unordered, lines 1628-1636)
- [x] 6.1.5 Extract indent level from metadata (list_level)
- [x] 6.2 Render lists with proper formatting
- [x] 6.2.1 Sequential numbering across list items (list_counter, lines 1639-1665)
- [x] 6.2.2 Add bullets/numbers as list markers (stored in _list_marker metadata, lines 1649-1653)
- [x] 6.2.3 Apply indentation (20pt per level, lines 1738-1742)
- [x] 6.2.4 Multi-line list item alignment (marker_width calculation, lines 1755-1772)
- [x] Calculate marker width before rendering (line 1758)
- [x] Add marker_width to subsequent line indentation (lines 1770-1772)
- [x] 6.2.5 Remove original markers from text content (lines 1716-1723)
- [x] 6.2.6 Dedicated list item spacing (lines 1658-1683)
- [x] Default 3pt spacing_after for list items (except last item)
- [x] Calculate actual gap between adjacent items (line 1676)
- [x] Apply cumulative Y offset to push items down if gap < desired (lines 1678-1683)
- [x] Pass y_offset to _draw_text_element_direct (line 1668, 1690, 1716)
- [x] 6.2.7 Maintain list grouping via proximity (max_gap=30pt, lines 1597-1607)
### 7. Span-Level Rendering (Advanced, Direct track only)
- [x] 7.1 Extract span information from Direct track
- [x] 7.1.1 Parse PyMuPDF span data in _process_text_block (direct_extraction_engine.py:418-453)
- [x] 7.1.2 Create span DocumentElements with per-span StyleInfo (lines 434-453)
- [x] 7.1.3 Store spans in element.children for inline styling (line 476)
- [x] 7.1.4 Extract span bbox, font, size, flags, color from PyMuPDF (lines 435-450)
- [x] 7.2 Render mixed-style lines
- [x] 7.2.1 Implement _draw_text_with_spans method (pdf_generator_service.py:1685-1734)
- [x] 7.2.2 Switch styles mid-line by iterating spans (lines 1709-1732)
- [x] 7.2.3 Apply span-specific style via _apply_text_style (lines 1715-1716)
- [x] 7.2.4 Track X position and calculate span widths (lines 1706, 1730-1732)
- [x] 7.2.5 Integrate span rendering in _draw_text_element_direct (lines 1822-1823, 1905-1914)
- [x] 7.2.6 Handle inline formatting with per-span fonts, sizes, colors, bold/italic
- [ ] 7.3 Future enhancements
- [ ] 7.3.1 Multi-line span support with line breaking logic
- [ ] 7.3.2 Preserve exact span positioning from PyMuPDF bbox
## Phase 4: Testing and Optimization (P2 - Week 3)
### 8. Comprehensive Testing
- [ ] 8.1 Create test suite for layout preservation
- [ ] 8.1.1 Unit tests for each component
- [ ] 8.1.2 Integration tests for full pipeline
- [ ] 8.1.3 Visual regression tests
- [ ] 8.2 Test with various document types
- [ ] 8.2.1 Scientific papers (complex layout)
- [ ] 8.2.2 Business documents (tables/charts)
- [ ] 8.2.3 Books (chapters/paragraphs)
- [ ] 8.2.4 Forms (precise positioning)
- [ ] 8.3 Performance testing
- [ ] 8.3.1 Measure generation time
- [ ] 8.3.2 Profile memory usage
- [ ] 8.3.3 Identify bottlenecks
### 9. Performance Optimization
- [ ] 9.1 Implement caching
- [ ] 9.1.1 Cache font metrics
- [ ] 9.1.2 Cache parsed styles
- [ ] 9.1.3 Reuse computed layouts
- [ ] 9.2 Optimize image handling
- [ ] 9.2.1 Lazy load images
- [ ] 9.2.2 Compress when appropriate
- [ ] 9.2.3 Stream large images
- [ ] 9.3 Batch operations
- [ ] 9.3.1 Group similar rendering ops
- [ ] 9.3.2 Minimize context switches
- [ ] 9.3.3 Use efficient data structures
### 10. Documentation and Deployment
- [ ] 10.1 Update API documentation
- [ ] 10.1.1 Document new rendering capabilities
- [ ] 10.1.2 Add examples of improved output
- [ ] 10.1.3 Note performance characteristics
- [ ] 10.2 Create migration guide
- [ ] 10.2.1 Explain improvements
- [ ] 10.2.2 Note any breaking changes
- [ ] 10.2.3 Provide rollback instructions
- [ ] 10.3 Deployment preparation
- [ ] 10.3.1 Feature flag setup
- [ ] 10.3.2 Monitoring metrics
- [ ] 10.3.3 Rollback plan
## Success Criteria
### Must Have (Phase 1)
- [x] Images appear in generated PDFs
- [x] Tables render with correct layout
- [x] No regression in existing functionality
### Should Have (Phase 2)
- [ ] Text styling preserved in Direct track
- [ ] Font sizes and colors applied
- [ ] Line breaks maintained
### Nice to Have (Phase 3-4)
- [ ] Paragraph formatting
- [ ] List rendering
- [ ] Span-level styling
- [ ] <10% performance overhead
## Timeline
- **Week 0**: Phase 1 - Critical fixes (images, tables)
- **Week 1**: Phase 2 - Basic style preservation
- **Week 2**: Phase 3 - Advanced layout features
- **Week 3**: Phase 4 - Testing and optimization
- **Week 4**: Review, documentation, and deployment