feat: create PDF layout restoration proposal

Create new OpenSpec change proposal to fix critical PDF generation issues: **Problems Identified**: 1. Images never saved (empty _save_image implementation) 2. Image path mismatch (saved_path vs path lookup) 3. Tables never render (fake image dependency) 4. Text style completely lost (no font/color application) **Solution Design**: - Phase 1: Critical fixes (images, tables) - Phase 2: Basic style preservation - Phase 3: Advanced layout features - Phase 4: Testing and optimization **Key Improvements**: - Implement actual image saving in pp_structure_enhanced - Fix path resolution with fallback logic - Use table's own bbox instead of fake images - Track-specific rendering (rich for Direct, simple for OCR) - Preserve StyleInfo (fonts, sizes, colors) **Implementation Tasks**: - 10 major task groups - 4-week timeline - No breaking changes - Performance target: <10% overhead Proposal validated: openspec validate pdf-layout-restoration ✓ 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-20 19:00:49 +08:00
parent a957f06588
commit cf894b076e
4 changed files with 685 additions and 0 deletions
--- a/openspec/changes/pdf-layout-restoration/proposal.md
+++ b/openspec/changes/pdf-layout-restoration/proposal.md
@@ -0,0 +1,57 @@
+# PDF Layout Restoration and Preservation
+
+## Problem
+Currently, the PDF generation from both OCR and Direct extraction tracks produces documents that are **severely degraded compared to the original**, with multiple critical issues:
+
+### 1. Images Never Appear
+- **OCR track**: `pp_structure_enhanced._save_image()` is an empty implementation (lines 262, 414), so detected images are never saved
+- **Direct track**: Image paths are saved as `content["saved_path"]` but converter looks for `content.get("path")`, causing a mismatch
+- **Result**: All PDFs are text-only, with no images whatsoever
+
+### 2. Tables Never Render
+- Table elements use fake `table_*.png` references that don't exist as actual files
+- `draw_table_region()` tries to find these non-existent images to get bbox coordinates
+- When images aren't found, table rendering is skipped entirely
+- **Result**: No tables appear in generated PDFs
+
+### 3. Text Layout is Broken
+- All text uses single `drawString()` call with entire block as one line
+- No line breaks, paragraph alignment, or text styling preserved
+- Direct track extracts `StyleInfo` but it's completely ignored during PDF generation
+- **Result**: Text appears as unformatted blocks at wrong positions
+
+### 4. Information Loss in Conversion
+- Direct track data gets converted to legacy OCR format, losing rich metadata
+- Span-level information (fonts, colors, styles) is discarded
+- Precise positioning information is reduced to simple bboxes
+
+## Solution
+Implement proper layout preservation for PDF generation:
+
+1. **Fix image handling**: Actually save images and use correct path references
+2. **Fix table rendering**: Use element's own bbox instead of looking for fake images
+3. **Preserve text formatting**: Use StyleInfo and span-level data for accurate rendering
+4. **Track-specific rendering**: Different approaches for OCR vs Direct tracks
+
+## Impact
+- **User Experience**: Output PDFs will actually be usable and readable
+- **Functionality**: Tables and images will finally appear in outputs
+- **Quality**: Direct track PDFs will closely match original formatting
+- **Performance**: No negative impact, possibly faster by avoiding unnecessary conversions
+
+## Tasks
+- Fix image saving and path references (Critical)
+- Fix table rendering using actual bbox data (Critical)
+- Implement track-specific PDF generation (Important)
+- Preserve text styling and formatting (Important)
+- Add span-level text rendering (Nice-to-have)
+
+## Deltas
+
+### result-export
+```delta
+ image_handling: Proper image saving and path resolution
+ table_rendering: Direct bbox usage for table positioning
+ text_formatting: StyleInfo preservation and application
+ track_specific_rendering: OCR vs Direct track differentiation
+```