Files
OCR/openspec/changes/pdf-layout-restoration/tasks.md
egg ad879d48e5 feat: implement Phase 3 list formatting for Direct track
Add comprehensive list rendering with automatic detection and formatting:

**Task 6.1: List Element Detection**
- Detect LIST_ITEM elements by type (element.type == ElementType.LIST_ITEM)
- Extract list_level from element metadata (lines 1566-1567)
- Determine list type via regex pattern matching:
  - Ordered lists: ^\d+[\.\)]\s (e.g., "1. ", "2) ")
  - Unordered lists: ^[•·▪▫◦‣⁃]\s (various bullet symbols)
- Parse and extract list markers from text content (lines 1571-1588)

**Task 6.2: List Rendering**
- Add list markers to first line of each item:
  - Ordered: Preserve original numbering (e.g., "1. ")
  - Unordered: Standardize to bullet "• "
- Remove original markers from text content
- Apply list indentation: 20pt per nesting level (lines 1594-1598)
- Combine list indent with existing paragraph indent
- List spacing: Inherited from bbox-based layout (spacing_before/after)

**Implementation Details**
- Lines 1565-1598: List detection and indentation logic
- Lines 1629-1632: Prepend list marker to first line (rendered_line)
- Lines 1635-1676: Update all text width calculations to use rendered_line
- Lines 1688-1692: Enhanced logging with list type and level

**Technical Notes**
- Direct track only (OCR track has no list metadata)
- Integrates with existing alignment and indentation system
- Preserves line breaks and multi-line list items
- Works with all text alignment modes (left/center/right/justify)

**Modified Files**
- backend/app/services/pdf_generator_service.py
  - Added import re for regex pattern matching
  - Lines 1565-1598: List detection and indentation
  - Lines 1629-1676: List marker rendering
  - Lines 1688-1692: Enhanced debug logging
- openspec/changes/pdf-layout-restoration/tasks.md
  - Marked Task 6.1 (all subtasks) as completed
  - Marked Task 6.2 (all subtasks) as completed
  - Added implementation line references

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-24 09:54:15 +08:00

8.4 KiB

Implementation Tasks: PDF Layout Restoration

Phase 1: Critical Fixes (P0 - Immediate)

1. Fix Image Handling

  • 1.1 Implement _save_image() in pp_structure_enhanced.py
    • 1.1.1 Create imgs subdirectory in result_dir
    • 1.1.2 Handle both file path and numpy array inputs
    • 1.1.3 Save with element_id as filename
    • 1.1.4 Return relative path for reference
    • 1.1.5 Add error handling and logging
  • 1.2 Fix path resolution in pdf_generator_service.py
    • 1.2.1 Create _get_image_path() helper with fallback logic
    • 1.2.2 Check saved_path, path, image_path keys
    • 1.2.3 Check metadata for path
    • 1.2.4 Update convert_unified_document_to_ocr_data to use helper
  • 1.3 Test image rendering
    • 1.3.1 Test with OCR track document
    • 1.3.2 Test with Direct track document
    • 1.3.3 Verify images appear in PDF output

2. Fix Table Rendering

  • 2.1 Remove dependency on fake image references
    • 2.1.1 Stop creating fake table_*.png references (changed to None)
    • 2.1.2 Remove image lookup fallback in draw_table_region
  • 2.2 Use direct bbox from table element
    • 2.2.1 Get bbox from table_element.get("bbox")
    • 2.2.2 Fallback to bbox_polygon if needed
    • 2.2.3 Implement _polygon_to_bbox converter (inline conversion implemented)
  • 2.3 Fix table HTML rendering
    • 2.3.1 Parse HTML content from table element
    • 2.3.2 Position table using normalized bbox
    • 2.3.3 Render with proper dimensions
  • 2.4 Test table rendering
    • 2.4.1 Test simple tables
    • 2.4.2 Test complex multi-column tables
    • 2.4.3 Test with both tracks

Phase 2: Basic Style Preservation (P1 - Week 1)

3. Implement Style Application System

  • 3.1 Create font mapping system
    • 3.1.1 Define FONT_MAPPING dictionary (20 common fonts mapped)
    • 3.1.2 Map common fonts to PDF standard fonts (Helvetica/Times/Courier)
    • 3.1.3 Add fallback to Helvetica for unknown fonts (with partial matching)
  • 3.2 Implement _apply_text_style() method
    • 3.2.1 Extract font family from StyleInfo (object and dict support)
    • 3.2.2 Handle bold/italic flags (compound variants like BoldOblique)
    • 3.2.3 Apply font size (with default fallback)
    • 3.2.4 Apply text color (using _parse_color)
    • 3.2.5 Handle errors gracefully (try-except with fallback to defaults)
  • 3.3 Create color parsing utilities
    • 3.3.1 Parse hex colors (#RRGGBB and #RGB)
    • 3.3.2 Parse RGB tuples (0-255 and 0-1 normalization)
    • 3.3.3 Convert to PDF color space (0-1 range for ReportLab)

4. Track-Specific Rendering

  • 4.1 Add track detection in generate_from_unified_document
    • 4.1.1 Check unified_doc.metadata.processing_track (object and dict support)
    • 4.1.2 Route to _generate_direct_track_pdf or _generate_ocr_track_pdf
  • 4.2 Implement _generate_direct_track_pdf
    • 4.2.1 Process each page directly from UnifiedDocument (no legacy conversion)
    • 4.2.2 Apply StyleInfo to text elements (_draw_text_element_direct)
    • 4.2.3 Use precise positioning from element.bbox
    • 4.2.4 Preserve line breaks (split on \n, render multi-line)
    • 4.2.5 Implement _draw_text_element_direct with line break handling
    • 4.2.6 Implement _draw_table_element_direct for tables
    • 4.2.7 Implement _draw_image_element_direct for images
  • 4.3 Implement _generate_ocr_track_pdf
    • 4.3.1 Use legacy OCR data conversion (convert_unified_document_to_ocr_data)
    • 4.3.2 Route to existing _generate_pdf_from_data pipeline
    • 4.3.3 Maintain backward compatibility with OCR track behavior
  • 4.4 Test track-specific rendering
    • 4.4.1 Compare Direct track with original
    • 4.4.2 Verify OCR track maintains quality

Phase 3: Advanced Layout (P2 - Week 2)

5. Enhanced Text Rendering

  • 5.1 Implement line-by-line rendering (both tracks)
    • 5.1.1 Split text content by newlines (text.split('\n'))
    • 5.1.2 Calculate line height from font size (font_size * 1.2)
    • 5.1.3 Render each line with proper spacing (line_y = pdf_y - i * line_height)
    • 5.1.4 Direct track: _draw_text_element_direct (lines 1549-1693)
    • 5.1.5 OCR track: draw_text_region (lines 1113-1270, simplified)
  • 5.2 Add paragraph handling (Direct track only)
    • 5.2.1 Detect paragraph boundaries (via element.type PARAGRAPH)
    • 5.2.2 Apply spacing_before from metadata (line 1576, adjusts Y position)
    • 5.2.3 Handle indentation (indent/first_line_indent from metadata, lines 1564-1565)
    • 5.2.4 Record spacing_after for analysis (lines 1680-1689)
    • 5.2.5 Note: spacing_after is implicit in bbox-based layout (bbox_bottom_margin)
    • 5.2.6 OCR track: no paragraph handling (simple left-aligned rendering)
  • 5.3 Implement text alignment (Direct track only)
    • 5.3.1 Support left/right/center/justify (from StyleInfo.alignment)
    • 5.3.2 Calculate positioning based on alignment (line_x calculation)
    • 5.3.3 Apply to each text block (per-line alignment in _draw_text_element_direct)
    • 5.3.4 Justify alignment with word spacing distribution
    • 5.3.5 OCR track: left-aligned only (no StyleInfo available)

6. List Formatting (Direct track only)

  • 6.1 Detect list elements from Direct track
    • 6.1.1 Identify LIST_ITEM elements (element.type == ElementType.LIST_ITEM)
    • 6.1.2 Determine list type via regex (ordered: ^\d+[.)], unordered: ^[•·▪▫◦‣⁃])
    • 6.1.3 Extract indent level from metadata (list_level, lines 1567-1598)
  • 6.2 Render lists with proper formatting
    • 6.2.1 Add bullets/numbers as list markers (lines 1571-1588, prepended to first line)
    • 6.2.2 Apply indentation (20pt per level, lines 1594-1598)
    • 6.2.3 Maintain list spacing (inherent in bbox-based layout, spacing_before/after)

7. Span-Level Rendering (Advanced)

  • 7.1 Extract span information from Direct track
    • 7.1.1 Parse children elements for spans
    • 7.1.2 Get per-span styling
    • 7.1.3 Track position within line
  • 7.2 Render mixed-style lines
    • 7.2.1 Switch styles mid-line
    • 7.2.2 Handle inline formatting
    • 7.2.3 Preserve exact positioning

Phase 4: Testing and Optimization (P2 - Week 3)

8. Comprehensive Testing

  • 8.1 Create test suite for layout preservation
    • 8.1.1 Unit tests for each component
    • 8.1.2 Integration tests for full pipeline
    • 8.1.3 Visual regression tests
  • 8.2 Test with various document types
    • 8.2.1 Scientific papers (complex layout)
    • 8.2.2 Business documents (tables/charts)
    • 8.2.3 Books (chapters/paragraphs)
    • 8.2.4 Forms (precise positioning)
  • 8.3 Performance testing
    • 8.3.1 Measure generation time
    • 8.3.2 Profile memory usage
    • 8.3.3 Identify bottlenecks

9. Performance Optimization

  • 9.1 Implement caching
    • 9.1.1 Cache font metrics
    • 9.1.2 Cache parsed styles
    • 9.1.3 Reuse computed layouts
  • 9.2 Optimize image handling
    • 9.2.1 Lazy load images
    • 9.2.2 Compress when appropriate
    • 9.2.3 Stream large images
  • 9.3 Batch operations
    • 9.3.1 Group similar rendering ops
    • 9.3.2 Minimize context switches
    • 9.3.3 Use efficient data structures

10. Documentation and Deployment

  • 10.1 Update API documentation
    • 10.1.1 Document new rendering capabilities
    • 10.1.2 Add examples of improved output
    • 10.1.3 Note performance characteristics
  • 10.2 Create migration guide
    • 10.2.1 Explain improvements
    • 10.2.2 Note any breaking changes
    • 10.2.3 Provide rollback instructions
  • 10.3 Deployment preparation
    • 10.3.1 Feature flag setup
    • 10.3.2 Monitoring metrics
    • 10.3.3 Rollback plan

Success Criteria

Must Have (Phase 1)

  • Images appear in generated PDFs
  • Tables render with correct layout
  • No regression in existing functionality

Should Have (Phase 2)

  • Text styling preserved in Direct track
  • Font sizes and colors applied
  • Line breaks maintained

Nice to Have (Phase 3-4)

  • Paragraph formatting
  • List rendering
  • Span-level styling
  • <10% performance overhead

Timeline

  • Week 0: Phase 1 - Critical fixes (images, tables)
  • Week 1: Phase 2 - Basic style preservation
  • Week 2: Phase 3 - Advanced layout features
  • Week 3: Phase 4 - Testing and optimization
  • Week 4: Review, documentation, and deployment