feat: create PDF layout restoration proposal

Create new OpenSpec change proposal to fix critical PDF generation issues:

**Problems Identified**:
1. Images never saved (empty _save_image implementation)
2. Image path mismatch (saved_path vs path lookup)
3. Tables never render (fake image dependency)
4. Text style completely lost (no font/color application)

**Solution Design**:
- Phase 1: Critical fixes (images, tables)
- Phase 2: Basic style preservation
- Phase 3: Advanced layout features
- Phase 4: Testing and optimization

**Key Improvements**:
- Implement actual image saving in pp_structure_enhanced
- Fix path resolution with fallback logic
- Use table's own bbox instead of fake images
- Track-specific rendering (rich for Direct, simple for OCR)
- Preserve StyleInfo (fonts, sizes, colors)

**Implementation Tasks**:
- 10 major task groups
- 4-week timeline
- No breaking changes
- Performance target: <10% overhead

Proposal validated: openspec validate pdf-layout-restoration ✓

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-20 19:00:49 +08:00
parent a957f06588
commit cf894b076e
4 changed files with 685 additions and 0 deletions

View File

@@ -0,0 +1,88 @@
# Result Export Specification
## ADDED Requirements
### Requirement: Layout-Preserving PDF Generation
The system MUST generate PDF files that preserve the original document layout including images, tables, and text formatting.
#### Scenario: Generate PDF with images
GIVEN a document processed through OCR or Direct track
WHEN images are detected and extracted
THEN the generated PDF MUST include all images at their original positions
AND images MUST maintain their aspect ratios
AND images MUST be saved to an imgs/ subdirectory
#### Scenario: Generate PDF with tables
GIVEN a document containing tables
WHEN tables are detected and extracted
THEN the generated PDF MUST render tables with proper structure
AND tables MUST use their own bbox coordinates for positioning
AND tables MUST NOT depend on fake image references
#### Scenario: Generate PDF with styled text
GIVEN a document processed through Direct track with StyleInfo
WHEN text elements have style information
THEN the generated PDF MUST apply font families (with mapping)
AND the PDF MUST apply font sizes
AND the PDF MUST apply text colors
AND the PDF MUST apply bold/italic formatting
### Requirement: Track-Specific Rendering
The system MUST provide different rendering approaches based on the processing track.
#### Scenario: Direct track rendering
GIVEN a document processed through Direct extraction
WHEN generating a PDF
THEN the system MUST use rich formatting preservation
AND maintain precise positioning from the original
AND apply all available StyleInfo
#### Scenario: OCR track rendering
GIVEN a document processed through OCR
WHEN generating a PDF
THEN the system MUST use simplified rendering
AND apply best-effort positioning based on bbox
AND use estimated font sizes
### Requirement: Image Path Resolution
The system MUST correctly resolve image paths with fallback logic.
#### Scenario: Resolve saved image paths
GIVEN an element with image content
WHEN looking for the image path
THEN the system MUST check content["saved_path"] first
AND fallback to content["path"] if not found
AND fallback to content["image_path"] if not found
AND finally check metadata["path"]
## MODIFIED Requirements
### Requirement: PDF Generation Pipeline
The PDF generation pipeline MUST be enhanced to support layout preservation.
#### Scenario: Enhanced PDF generation
GIVEN a UnifiedDocument from either track
WHEN generating a PDF
THEN the system MUST detect the processing track
AND route to the appropriate rendering method
AND preserve as much layout information as available
### Requirement: Image Handling in PP-Structure
The PP-Structure enhanced module MUST actually save extracted images.
#### Scenario: Save PP-Structure images
GIVEN PP-Structure extracts an image with img_path
WHEN processing the image element
THEN the _save_image method MUST save the image to disk
AND return a relative path for reference
AND handle both file paths and numpy arrays
### Requirement: Table Rendering Logic
The table rendering MUST use direct bbox instead of image lookup.
#### Scenario: Render table with direct bbox
GIVEN a table element with bbox coordinates
WHEN rendering the table in PDF
THEN the system MUST use the element's own bbox
AND NOT look for non-existent table image files
AND position the table accurately based on coordinates