egg/OCR

Files

egg cf894b076e feat: create PDF layout restoration proposal

Create new OpenSpec change proposal to fix critical PDF generation issues:

**Problems Identified**:
1. Images never saved (empty _save_image implementation)
2. Image path mismatch (saved_path vs path lookup)
3. Tables never render (fake image dependency)
4. Text style completely lost (no font/color application)

**Solution Design**:
- Phase 1: Critical fixes (images, tables)
- Phase 2: Basic style preservation
- Phase 3: Advanced layout features
- Phase 4: Testing and optimization

**Key Improvements**:
- Implement actual image saving in pp_structure_enhanced
- Fix path resolution with fallback logic
- Use table's own bbox instead of fake images
- Track-specific rendering (rich for Direct, simple for OCR)
- Preserve StyleInfo (fonts, sizes, colors)

**Implementation Tasks**:
- 10 major task groups
- 4-week timeline
- No breaking changes
- Performance target: <10% overhead

Proposal validated: openspec validate pdf-layout-restoration ✓

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-20 19:00:49 +08:00

2.7 KiB

Raw Blame History

PDF Layout Restoration and Preservation

Problem

Currently, the PDF generation from both OCR and Direct extraction tracks produces documents that are severely degraded compared to the original, with multiple critical issues:

1. Images Never Appear

OCR track: pp_structure_enhanced._save_image() is an empty implementation (lines 262, 414), so detected images are never saved
Direct track: Image paths are saved as content["saved_path"] but converter looks for content.get("path"), causing a mismatch
Result: All PDFs are text-only, with no images whatsoever

2. Tables Never Render

Table elements use fake table_*.png references that don't exist as actual files
draw_table_region() tries to find these non-existent images to get bbox coordinates
When images aren't found, table rendering is skipped entirely
Result: No tables appear in generated PDFs

3. Text Layout is Broken

All text uses single drawString() call with entire block as one line
No line breaks, paragraph alignment, or text styling preserved
Direct track extracts StyleInfo but it's completely ignored during PDF generation
Result: Text appears as unformatted blocks at wrong positions

4. Information Loss in Conversion

Direct track data gets converted to legacy OCR format, losing rich metadata
Span-level information (fonts, colors, styles) is discarded
Precise positioning information is reduced to simple bboxes

Solution

Implement proper layout preservation for PDF generation:

Fix image handling: Actually save images and use correct path references
Fix table rendering: Use element's own bbox instead of looking for fake images
Preserve text formatting: Use StyleInfo and span-level data for accurate rendering
Track-specific rendering: Different approaches for OCR vs Direct tracks

Impact

User Experience: Output PDFs will actually be usable and readable
Functionality: Tables and images will finally appear in outputs
Quality: Direct track PDFs will closely match original formatting
Performance: No negative impact, possibly faster by avoiding unnecessary conversions

Tasks

Fix image saving and path references (Critical)
Fix table rendering using actual bbox data (Critical)
Implement track-specific PDF generation (Important)
Preserve text styling and formatting (Important)
Add span-level text rendering (Nice-to-have)

Deltas

result-export

+ image_handling: Proper image saving and path resolution
+ table_rendering: Direct bbox usage for table positioning
+ text_formatting: StyleInfo preservation and application
+ track_specific_rendering: OCR vs Direct track differentiation

2.7 KiB Raw Blame History