Address Phase 3 accuracy issues identified in review: **Issue 1: Invalid OCR Track Alignment Code** - Removed alignment extraction from region style (lines 1179-1185) - Removed alignment-based positioning logic (lines 1215-1240) - Problem: OCR track has no StyleInfo (extracted from images without style data) - Result: Alignment code was non-functional, always defaulted to left - Solution: Simplified to explicit left-aligned rendering for OCR track **Issue 2: Misleading Task Completion Markers** - Updated 5.1: Clarified both tracks support line-by-line rendering - Direct: _draw_text_element_direct (lines 1549-1693) - OCR: draw_text_region (lines 1113-1270, simplified) - Updated 5.2: Marked as "Direct track only" - spacing_before: Applied (adjusts Y position) - spacing_after: Implicit in bbox-based layout (recorded for analysis) - indent/first_line_indent: Direct track only - OCR: No paragraph handling - Updated 5.3: Marked as "Direct track only" - Direct: Supports left/right/center/justify alignment - OCR: Left-aligned only (no StyleInfo available) **Technical Clarifications** - spacing_after cannot be "applied" in bbox-based layout - It is already reflected in element positions (bbox spacing) - bbox_bottom_margin shows the implicit spacing_after value - OCR track uses simplified rendering (design decision per design.md) **Modified Files** - backend/app/services/pdf_generator_service.py - Removed lines 1179-1185: Invalid alignment extraction - Removed lines 1215-1240: Invalid alignment logic - Added comments clarifying OCR track limitations - openspec/changes/pdf-layout-restoration/tasks.md - Added "(Direct track only)" markers to 5.2 and 5.3 - Changed 5.3.5 from "Add OCR track alignment support" to "OCR track: left-aligned only" - Added 5.2.6 to note OCR has no paragraph handling 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
8.1 KiB
8.1 KiB
Implementation Tasks: PDF Layout Restoration
Phase 1: Critical Fixes (P0 - Immediate)
1. Fix Image Handling
- 1.1 Implement
_save_image()in pp_structure_enhanced.py- 1.1.1 Create imgs subdirectory in result_dir
- 1.1.2 Handle both file path and numpy array inputs
- 1.1.3 Save with element_id as filename
- 1.1.4 Return relative path for reference
- 1.1.5 Add error handling and logging
- 1.2 Fix path resolution in pdf_generator_service.py
- 1.2.1 Create
_get_image_path()helper with fallback logic - 1.2.2 Check saved_path, path, image_path keys
- 1.2.3 Check metadata for path
- 1.2.4 Update convert_unified_document_to_ocr_data to use helper
- 1.2.1 Create
- 1.3 Test image rendering
- 1.3.1 Test with OCR track document
- 1.3.2 Test with Direct track document
- 1.3.3 Verify images appear in PDF output
2. Fix Table Rendering
- 2.1 Remove dependency on fake image references
- 2.1.1 Stop creating fake table_*.png references (changed to None)
- 2.1.2 Remove image lookup fallback in draw_table_region
- 2.2 Use direct bbox from table element
- 2.2.1 Get bbox from table_element.get("bbox")
- 2.2.2 Fallback to bbox_polygon if needed
- 2.2.3 Implement _polygon_to_bbox converter (inline conversion implemented)
- 2.3 Fix table HTML rendering
- 2.3.1 Parse HTML content from table element
- 2.3.2 Position table using normalized bbox
- 2.3.3 Render with proper dimensions
- 2.4 Test table rendering
- 2.4.1 Test simple tables
- 2.4.2 Test complex multi-column tables
- 2.4.3 Test with both tracks
Phase 2: Basic Style Preservation (P1 - Week 1)
3. Implement Style Application System
- 3.1 Create font mapping system
- 3.1.1 Define FONT_MAPPING dictionary (20 common fonts mapped)
- 3.1.2 Map common fonts to PDF standard fonts (Helvetica/Times/Courier)
- 3.1.3 Add fallback to Helvetica for unknown fonts (with partial matching)
- 3.2 Implement _apply_text_style() method
- 3.2.1 Extract font family from StyleInfo (object and dict support)
- 3.2.2 Handle bold/italic flags (compound variants like BoldOblique)
- 3.2.3 Apply font size (with default fallback)
- 3.2.4 Apply text color (using _parse_color)
- 3.2.5 Handle errors gracefully (try-except with fallback to defaults)
- 3.3 Create color parsing utilities
- 3.3.1 Parse hex colors (#RRGGBB and #RGB)
- 3.3.2 Parse RGB tuples (0-255 and 0-1 normalization)
- 3.3.3 Convert to PDF color space (0-1 range for ReportLab)
4. Track-Specific Rendering
- 4.1 Add track detection in generate_from_unified_document
- 4.1.1 Check unified_doc.metadata.processing_track (object and dict support)
- 4.1.2 Route to _generate_direct_track_pdf or _generate_ocr_track_pdf
- 4.2 Implement _generate_direct_track_pdf
- 4.2.1 Process each page directly from UnifiedDocument (no legacy conversion)
- 4.2.2 Apply StyleInfo to text elements (_draw_text_element_direct)
- 4.2.3 Use precise positioning from element.bbox
- 4.2.4 Preserve line breaks (split on \n, render multi-line)
- 4.2.5 Implement _draw_text_element_direct with line break handling
- 4.2.6 Implement _draw_table_element_direct for tables
- 4.2.7 Implement _draw_image_element_direct for images
- 4.3 Implement _generate_ocr_track_pdf
- 4.3.1 Use legacy OCR data conversion (convert_unified_document_to_ocr_data)
- 4.3.2 Route to existing _generate_pdf_from_data pipeline
- 4.3.3 Maintain backward compatibility with OCR track behavior
- 4.4 Test track-specific rendering
- 4.4.1 Compare Direct track with original
- 4.4.2 Verify OCR track maintains quality
Phase 3: Advanced Layout (P2 - Week 2)
5. Enhanced Text Rendering
- 5.1 Implement line-by-line rendering (both tracks)
- 5.1.1 Split text content by newlines (text.split('\n'))
- 5.1.2 Calculate line height from font size (font_size * 1.2)
- 5.1.3 Render each line with proper spacing (line_y = pdf_y - i * line_height)
- 5.1.4 Direct track: _draw_text_element_direct (lines 1549-1693)
- 5.1.5 OCR track: draw_text_region (lines 1113-1270, simplified)
- 5.2 Add paragraph handling (Direct track only)
- 5.2.1 Detect paragraph boundaries (via element.type PARAGRAPH)
- 5.2.2 Apply spacing_before from metadata (line 1576, adjusts Y position)
- 5.2.3 Handle indentation (indent/first_line_indent from metadata, lines 1564-1565)
- 5.2.4 Record spacing_after for analysis (lines 1680-1689)
- 5.2.5 Note: spacing_after is implicit in bbox-based layout (bbox_bottom_margin)
- 5.2.6 OCR track: no paragraph handling (simple left-aligned rendering)
- 5.3 Implement text alignment (Direct track only)
- 5.3.1 Support left/right/center/justify (from StyleInfo.alignment)
- 5.3.2 Calculate positioning based on alignment (line_x calculation)
- 5.3.3 Apply to each text block (per-line alignment in _draw_text_element_direct)
- 5.3.4 Justify alignment with word spacing distribution
- 5.3.5 OCR track: left-aligned only (no StyleInfo available)
6. List Formatting
- 6.1 Detect list elements
- 6.1.1 Identify list items from metadata
- 6.1.2 Determine list type (ordered/unordered)
- 6.1.3 Extract indent level
- 6.2 Render lists with proper formatting
- 6.2.1 Add bullets/numbers
- 6.2.2 Apply indentation
- 6.2.3 Maintain list spacing
7. Span-Level Rendering (Advanced)
- 7.1 Extract span information from Direct track
- 7.1.1 Parse children elements for spans
- 7.1.2 Get per-span styling
- 7.1.3 Track position within line
- 7.2 Render mixed-style lines
- 7.2.1 Switch styles mid-line
- 7.2.2 Handle inline formatting
- 7.2.3 Preserve exact positioning
Phase 4: Testing and Optimization (P2 - Week 3)
8. Comprehensive Testing
- 8.1 Create test suite for layout preservation
- 8.1.1 Unit tests for each component
- 8.1.2 Integration tests for full pipeline
- 8.1.3 Visual regression tests
- 8.2 Test with various document types
- 8.2.1 Scientific papers (complex layout)
- 8.2.2 Business documents (tables/charts)
- 8.2.3 Books (chapters/paragraphs)
- 8.2.4 Forms (precise positioning)
- 8.3 Performance testing
- 8.3.1 Measure generation time
- 8.3.2 Profile memory usage
- 8.3.3 Identify bottlenecks
9. Performance Optimization
- 9.1 Implement caching
- 9.1.1 Cache font metrics
- 9.1.2 Cache parsed styles
- 9.1.3 Reuse computed layouts
- 9.2 Optimize image handling
- 9.2.1 Lazy load images
- 9.2.2 Compress when appropriate
- 9.2.3 Stream large images
- 9.3 Batch operations
- 9.3.1 Group similar rendering ops
- 9.3.2 Minimize context switches
- 9.3.3 Use efficient data structures
10. Documentation and Deployment
- 10.1 Update API documentation
- 10.1.1 Document new rendering capabilities
- 10.1.2 Add examples of improved output
- 10.1.3 Note performance characteristics
- 10.2 Create migration guide
- 10.2.1 Explain improvements
- 10.2.2 Note any breaking changes
- 10.2.3 Provide rollback instructions
- 10.3 Deployment preparation
- 10.3.1 Feature flag setup
- 10.3.2 Monitoring metrics
- 10.3.3 Rollback plan
Success Criteria
Must Have (Phase 1)
- Images appear in generated PDFs
- Tables render with correct layout
- No regression in existing functionality
Should Have (Phase 2)
- Text styling preserved in Direct track
- Font sizes and colors applied
- Line breaks maintained
Nice to Have (Phase 3-4)
- Paragraph formatting
- List rendering
- Span-level styling
- <10% performance overhead
Timeline
- Week 0: Phase 1 - Critical fixes (images, tables)
- Week 1: Phase 2 - Basic style preservation
- Week 2: Phase 3 - Advanced layout features
- Week 3: Phase 4 - Testing and optimization
- Week 4: Review, documentation, and deployment