egg/OCR

Files

egg cf894b076e feat: create PDF layout restoration proposal

Create new OpenSpec change proposal to fix critical PDF generation issues:

**Problems Identified**:
1. Images never saved (empty _save_image implementation)
2. Image path mismatch (saved_path vs path lookup)
3. Tables never render (fake image dependency)
4. Text style completely lost (no font/color application)

**Solution Design**:
- Phase 1: Critical fixes (images, tables)
- Phase 2: Basic style preservation
- Phase 3: Advanced layout features
- Phase 4: Testing and optimization

**Key Improvements**:
- Implement actual image saving in pp_structure_enhanced
- Fix path resolution with fallback logic
- Use table's own bbox instead of fake images
- Track-specific rendering (rich for Direct, simple for OCR)
- Preserve StyleInfo (fonts, sizes, colors)

**Implementation Tasks**:
- 10 major task groups
- 4-week timeline
- No breaking changes
- Performance target: <10% overhead

Proposal validated: openspec validate pdf-layout-restoration ✓

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-20 19:00:49 +08:00

3.2 KiB

Raw Blame History

Result Export Specification

ADDED Requirements

Requirement: Layout-Preserving PDF Generation

The system MUST generate PDF files that preserve the original document layout including images, tables, and text formatting.

Scenario: Generate PDF with images

GIVEN a document processed through OCR or Direct track WHEN images are detected and extracted THEN the generated PDF MUST include all images at their original positions AND images MUST maintain their aspect ratios AND images MUST be saved to an imgs/ subdirectory

Scenario: Generate PDF with tables

GIVEN a document containing tables WHEN tables are detected and extracted THEN the generated PDF MUST render tables with proper structure AND tables MUST use their own bbox coordinates for positioning AND tables MUST NOT depend on fake image references

Scenario: Generate PDF with styled text

GIVEN a document processed through Direct track with StyleInfo WHEN text elements have style information THEN the generated PDF MUST apply font families (with mapping) AND the PDF MUST apply font sizes AND the PDF MUST apply text colors AND the PDF MUST apply bold/italic formatting

Requirement: Track-Specific Rendering

The system MUST provide different rendering approaches based on the processing track.

Scenario: Direct track rendering

GIVEN a document processed through Direct extraction WHEN generating a PDF THEN the system MUST use rich formatting preservation AND maintain precise positioning from the original AND apply all available StyleInfo

Scenario: OCR track rendering

GIVEN a document processed through OCR WHEN generating a PDF THEN the system MUST use simplified rendering AND apply best-effort positioning based on bbox AND use estimated font sizes

Requirement: Image Path Resolution

The system MUST correctly resolve image paths with fallback logic.

Scenario: Resolve saved image paths

GIVEN an element with image content WHEN looking for the image path THEN the system MUST check content["saved_path"] first AND fallback to content["path"] if not found AND fallback to content["image_path"] if not found AND finally check metadata["path"]

MODIFIED Requirements

Requirement: PDF Generation Pipeline

The PDF generation pipeline MUST be enhanced to support layout preservation.

Scenario: Enhanced PDF generation

GIVEN a UnifiedDocument from either track WHEN generating a PDF THEN the system MUST detect the processing track AND route to the appropriate rendering method AND preserve as much layout information as available

Requirement: Image Handling in PP-Structure

The PP-Structure enhanced module MUST actually save extracted images.

Scenario: Save PP-Structure images

GIVEN PP-Structure extracts an image with img_path WHEN processing the image element THEN the _save_image method MUST save the image to disk AND return a relative path for reference AND handle both file paths and numpy arrays

Requirement: Table Rendering Logic

The table rendering MUST use direct bbox instead of image lookup.

Scenario: Render table with direct bbox

GIVEN a table element with bbox coordinates WHEN rendering the table in PDF THEN the system MUST use the element's own bbox AND NOT look for non-existent table image files AND position the table accurately based on coordinates

3.2 KiB Raw Blame History