Files
OCR/openspec/changes/pdf-layout-restoration/proposal.md
egg cf894b076e feat: create PDF layout restoration proposal
Create new OpenSpec change proposal to fix critical PDF generation issues:

**Problems Identified**:
1. Images never saved (empty _save_image implementation)
2. Image path mismatch (saved_path vs path lookup)
3. Tables never render (fake image dependency)
4. Text style completely lost (no font/color application)

**Solution Design**:
- Phase 1: Critical fixes (images, tables)
- Phase 2: Basic style preservation
- Phase 3: Advanced layout features
- Phase 4: Testing and optimization

**Key Improvements**:
- Implement actual image saving in pp_structure_enhanced
- Fix path resolution with fallback logic
- Use table's own bbox instead of fake images
- Track-specific rendering (rich for Direct, simple for OCR)
- Preserve StyleInfo (fonts, sizes, colors)

**Implementation Tasks**:
- 10 major task groups
- 4-week timeline
- No breaking changes
- Performance target: <10% overhead

Proposal validated: openspec validate pdf-layout-restoration ✓

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-20 19:00:49 +08:00

2.7 KiB

PDF Layout Restoration and Preservation

Problem

Currently, the PDF generation from both OCR and Direct extraction tracks produces documents that are severely degraded compared to the original, with multiple critical issues:

1. Images Never Appear

  • OCR track: pp_structure_enhanced._save_image() is an empty implementation (lines 262, 414), so detected images are never saved
  • Direct track: Image paths are saved as content["saved_path"] but converter looks for content.get("path"), causing a mismatch
  • Result: All PDFs are text-only, with no images whatsoever

2. Tables Never Render

  • Table elements use fake table_*.png references that don't exist as actual files
  • draw_table_region() tries to find these non-existent images to get bbox coordinates
  • When images aren't found, table rendering is skipped entirely
  • Result: No tables appear in generated PDFs

3. Text Layout is Broken

  • All text uses single drawString() call with entire block as one line
  • No line breaks, paragraph alignment, or text styling preserved
  • Direct track extracts StyleInfo but it's completely ignored during PDF generation
  • Result: Text appears as unformatted blocks at wrong positions

4. Information Loss in Conversion

  • Direct track data gets converted to legacy OCR format, losing rich metadata
  • Span-level information (fonts, colors, styles) is discarded
  • Precise positioning information is reduced to simple bboxes

Solution

Implement proper layout preservation for PDF generation:

  1. Fix image handling: Actually save images and use correct path references
  2. Fix table rendering: Use element's own bbox instead of looking for fake images
  3. Preserve text formatting: Use StyleInfo and span-level data for accurate rendering
  4. Track-specific rendering: Different approaches for OCR vs Direct tracks

Impact

  • User Experience: Output PDFs will actually be usable and readable
  • Functionality: Tables and images will finally appear in outputs
  • Quality: Direct track PDFs will closely match original formatting
  • Performance: No negative impact, possibly faster by avoiding unnecessary conversions

Tasks

  • Fix image saving and path references (Critical)
  • Fix table rendering using actual bbox data (Critical)
  • Implement track-specific PDF generation (Important)
  • Preserve text styling and formatting (Important)
  • Add span-level text rendering (Nice-to-have)

Deltas

result-export

+ image_handling: Proper image saving and path resolution
+ table_rendering: Direct bbox usage for table positioning
+ text_formatting: StyleInfo preservation and application
+ track_specific_rendering: OCR vs Direct track differentiation