feat: create PDF layout restoration proposal

Create new OpenSpec change proposal to fix critical PDF generation issues: **Problems Identified**: 1. Images never saved (empty _save_image implementation) 2. Image path mismatch (saved_path vs path lookup) 3. Tables never render (fake image dependency) 4. Text style completely lost (no font/color application) **Solution Design**: - Phase 1: Critical fixes (images, tables) - Phase 2: Basic style preservation - Phase 3: Advanced layout features - Phase 4: Testing and optimization **Key Improvements**: - Implement actual image saving in pp_structure_enhanced - Fix path resolution with fallback logic - Use table's own bbox instead of fake images - Track-specific rendering (rich for Direct, simple for OCR) - Preserve StyleInfo (fonts, sizes, colors) **Implementation Tasks**: - 10 major task groups - 4-week timeline - No breaking changes - Performance target: <10% overhead Proposal validated: openspec validate pdf-layout-restoration ✓ 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-20 19:00:49 +08:00
parent a957f06588
commit cf894b076e
4 changed files with 685 additions and 0 deletions
--- a/openspec/changes/pdf-layout-restoration/specs/result-export/spec.md
+++ b/openspec/changes/pdf-layout-restoration/specs/result-export/spec.md
@@ -0,0 +1,88 @@
+# Result Export Specification
+
+## ADDED Requirements
+
+### Requirement: Layout-Preserving PDF Generation
+The system MUST generate PDF files that preserve the original document layout including images, tables, and text formatting.
+
+#### Scenario: Generate PDF with images
+GIVEN a document processed through OCR or Direct track
+WHEN images are detected and extracted
+THEN the generated PDF MUST include all images at their original positions
+AND images MUST maintain their aspect ratios
+AND images MUST be saved to an imgs/ subdirectory
+
+#### Scenario: Generate PDF with tables
+GIVEN a document containing tables
+WHEN tables are detected and extracted
+THEN the generated PDF MUST render tables with proper structure
+AND tables MUST use their own bbox coordinates for positioning
+AND tables MUST NOT depend on fake image references
+
+#### Scenario: Generate PDF with styled text
+GIVEN a document processed through Direct track with StyleInfo
+WHEN text elements have style information
+THEN the generated PDF MUST apply font families (with mapping)
+AND the PDF MUST apply font sizes
+AND the PDF MUST apply text colors
+AND the PDF MUST apply bold/italic formatting
+
+### Requirement: Track-Specific Rendering
+The system MUST provide different rendering approaches based on the processing track.
+
+#### Scenario: Direct track rendering
+GIVEN a document processed through Direct extraction
+WHEN generating a PDF
+THEN the system MUST use rich formatting preservation
+AND maintain precise positioning from the original
+AND apply all available StyleInfo
+
+#### Scenario: OCR track rendering
+GIVEN a document processed through OCR
+WHEN generating a PDF
+THEN the system MUST use simplified rendering
+AND apply best-effort positioning based on bbox
+AND use estimated font sizes
+
+### Requirement: Image Path Resolution
+The system MUST correctly resolve image paths with fallback logic.
+
+#### Scenario: Resolve saved image paths
+GIVEN an element with image content
+WHEN looking for the image path
+THEN the system MUST check content["saved_path"] first
+AND fallback to content["path"] if not found
+AND fallback to content["image_path"] if not found
+AND finally check metadata["path"]
+
+## MODIFIED Requirements
+
+### Requirement: PDF Generation Pipeline
+The PDF generation pipeline MUST be enhanced to support layout preservation.
+
+#### Scenario: Enhanced PDF generation
+GIVEN a UnifiedDocument from either track
+WHEN generating a PDF
+THEN the system MUST detect the processing track
+AND route to the appropriate rendering method
+AND preserve as much layout information as available
+
+### Requirement: Image Handling in PP-Structure
+The PP-Structure enhanced module MUST actually save extracted images.
+
+#### Scenario: Save PP-Structure images
+GIVEN PP-Structure extracts an image with img_path
+WHEN processing the image element
+THEN the _save_image method MUST save the image to disk
+AND return a relative path for reference
+AND handle both file paths and numpy arrays
+
+### Requirement: Table Rendering Logic
+The table rendering MUST use direct bbox instead of image lookup.
+
+#### Scenario: Render table with direct bbox
+GIVEN a table element with bbox coordinates
+WHEN rendering the table in PDF
+THEN the system MUST use the element's own bbox
+AND NOT look for non-existent table image files
+AND position the table accurately based on coordinates