egg/OCR

Files

egg 4325d024a7 chore: cleanup test files and archive pdf-layout-restoration proposal

Remove obsolete test and utility scripts:
- backend/create_test_user.py
- backend/mark_migration_done.py
- backend/fix_alembic_version.py
- backend/RUN_TESTS.md (outdated test documentation)

Archive completed pdf-layout-restoration proposal:
- Moved from openspec/changes/pdf-layout-restoration/
- To openspec/changes/archive/2025-11-24-pdf-layout-restoration/

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-24 19:43:05 +08:00

10 KiB

Raw Blame History

Technical Design: PDF Layout Restoration and Preservation

Context

Background

The current PDF generation system loses critical layout information during the conversion process. Despite successfully extracting images, tables, and styled text in both OCR and Direct tracks, none of this information makes it to the final PDF output due to:

Empty implementations: Image saving functions are stubs
Path mismatches: Saved paths don't match expected lookup keys
Fake data dependencies: Table rendering relies on non-existent image files
Format degradation: Rich formatting is reduced to plain text blocks

Current Issues

Issue 1: OCR Track Image Loss

# backend/app/services/pp_structure_enhanced.py
def _save_image(self, img_data, element_id: str, result_dir: Path):
    """Save image data to file"""
    # TODO: Implement image saving
    pass  # Lines 262, 414 - NEVER SAVES ANYTHING!

Result: img_path from PP-Structure is ignored, no image files created.

Issue 2: Direct Track Path Mismatch

# Saves as:
element.content["saved_path"] = f"imgs/{element_id}.png"  # line 745

# But converter looks for:
image_path = content.get("path")  # line 180 - WRONG KEY!

Result: Direct track images are saved but never found.

Issue 3: Table Rendering Failure

# Creates fake reference:
images_metadata.append({
    "path": f"table_{element.element_id}.png",  # DOESN'T EXIST
    "bbox": element.bbox
})

# Then tries to find it:
table_image = next((img for img in images_metadata
                   if "table" in img.get("path", "")), None)
if not table_image:
    return  # ALWAYS HAPPENS - NO RENDERING!

Issue 4: Text Style Loss

# Has rich data:
StyleInfo(font='Arial', size=12, flags=BOLD|ITALIC, color='#000080')

# But only uses:
c.drawString(x, y, text)  # No font, size, or style applied!

Constraints

Must maintain backward compatibility with existing API
Cannot break current OCR/Direct track separation
Should work within existing UnifiedDocument model
Must handle both track types appropriately

Goals / Non-Goals

Goals

Restore image rendering: Save and correctly reference all images
Fix table layout: Render tables using actual bbox data
Preserve text formatting: Apply fonts, sizes, colors, and styles
Track-specific optimization: Different rendering for OCR vs Direct
Maintain positioning: Accurate spatial layout preservation

Non-Goals

Rewriting entire PDF generation system
Changing UnifiedDocument structure
Modifying extraction engines
Supporting complex vector graphics
Interactive PDF features (forms, annotations)

Decisions

Decision 1: Fix Image Handling Pipeline

What: Implement actual image saving and correct path resolution

Implementation:

# pp_structure_enhanced.py
def _save_image(self, img_data, element_id: str, result_dir: Path):
    """Save image data to file"""
    img_dir = result_dir / "imgs"
    img_dir.mkdir(parents=True, exist_ok=True)

    if isinstance(img_data, (str, Path)):
        # Copy existing file
        src_path = Path(img_data)
        dst_path = img_dir / f"{element_id}.png"
        shutil.copy2(src_path, dst_path)
    else:
        # Save image data
        dst_path = img_dir / f"{element_id}.png"
        Image.fromarray(img_data).save(dst_path)

    return f"imgs/{element_id}.png"  # Relative path

Path Resolution:

# pdf_generator_service.py - convert_unified_document_to_ocr_data
def _get_image_path(element):
    """Get image path with fallback logic"""
    content = element.content

    # Try multiple path locations
    for key in ["saved_path", "path", "image_path"]:
        if isinstance(content, dict) and key in content:
            return content[key]

    # Check metadata
    if hasattr(element, 'metadata') and element.metadata:
        return element.metadata.get('path')

    return None

Decision 2: Direct Table Bbox Usage

What: Use table element's own bbox instead of fake image references

Current Problem:

# Creates fake image that doesn't exist
images_metadata.append({"path": f"table_{id}.png", "bbox": bbox})
# Later fails to find it and skips rendering

Solution:

def draw_table_region(self, c, table_element, page_width, page_height):
    """Draw table using its own bbox"""
    # Get bbox directly from element
    bbox = table_element.get("bbox")
    if not bbox:
        # Fallback to polygon
        bbox_polygon = table_element.get("bbox_polygon")
        if bbox_polygon:
            bbox = self._polygon_to_bbox(bbox_polygon)

    if not bbox:
        logger.warning(f"No bbox for table {table_element.get('element_id')}")
        return

    # Use bbox to position and render table
    x, y, width, height = self._normalize_bbox(bbox, page_width, page_height)
    self._render_table_html(c, table_element.get("content"), x, y, width, height)

Decision 3: Track-Specific Rendering

What: Different rendering approaches based on processing track

Implementation Strategy:

def generate_from_unified_document(self, unified_doc, output_path):
    """Generate PDF with track-specific rendering"""

    track = unified_doc.metadata.processing_track

    if track == "direct":
        return self._generate_direct_track_pdf(unified_doc, output_path)
    else:  # OCR track
        return self._generate_ocr_track_pdf(unified_doc, output_path)

def _generate_direct_track_pdf(self, unified_doc, output_path):
    """Rich rendering for direct track"""
    # Preserve:
    # - Font families, sizes, weights
    # - Text colors and backgrounds
    # - Precise positioning
    # - Line breaks and paragraph spacing

def _generate_ocr_track_pdf(self, unified_doc, output_path):
    """Simplified rendering for OCR track"""
    # Best effort with:
    # - Detected layout regions
    # - Estimated font sizes
    # - Basic positioning

Decision 4: Style Preservation System

What: Apply StyleInfo to text rendering

Text Rendering Enhancement:

def _apply_text_style(self, c, style_info):
    """Apply text styling from StyleInfo"""
    if not style_info:
        return

    # Font selection
    font_name = self._map_font(style_info.font)
    if style_info.flags:
        if style_info.flags & BOLD and style_info.flags & ITALIC:
            font_name = f"{font_name}-BoldOblique"
        elif style_info.flags & BOLD:
            font_name = f"{font_name}-Bold"
        elif style_info.flags & ITALIC:
            font_name = f"{font_name}-Oblique"

    # Apply font and size
    try:
        c.setFont(font_name, style_info.size or 12)
    except:
        c.setFont("Helvetica", style_info.size or 12)

    # Apply color
    if style_info.color:
        r, g, b = self._parse_color(style_info.color)
        c.setFillColorRGB(r, g, b)

def draw_text_region_enhanced(self, c, element, page_width, page_height):
    """Enhanced text rendering with style preservation"""
    # Apply style
    if hasattr(element, 'style') and element.style:
        self._apply_text_style(c, element.style)

    # Render with line breaks
    text_lines = element.content.split('\n')
    for line in text_lines:
        c.drawString(x, y, line)
        y -= line_height

Implementation Phases

Phase 1: Critical Fixes (Immediate)

Implement _save_image() in pp_structure_enhanced.py
Fix path resolution in converter
Fix table bbox usage
Test with sample documents

Phase 2: Basic Style Preservation (Week 1)

Implement style application for Direct track
Add font mapping system
Handle text colors
Preserve line breaks

Phase 3: Advanced Layout (Week 2)

Implement span-level rendering
Add paragraph alignment
Handle text indentation
Preserve list formatting

Phase 4: Optimization (Week 3)

Cache font metrics
Optimize image handling
Batch rendering operations
Performance testing

Risks / Trade-offs

Risk 1: Font Availability

Risk: System fonts may not match document fonts Mitigation: Font mapping table with fallbacks

FONT_MAPPING = {
    'Arial': 'Helvetica',
    'Times New Roman': 'Times-Roman',
    'Courier New': 'Courier',
    # ... more mappings
}

Risk 2: Complex Layouts

Risk: Some layouts too complex to preserve perfectly Mitigation: Graceful degradation with logging

Attempt full preservation
Fall back to simpler rendering if needed
Log what couldn't be preserved

Risk 3: Performance Impact

Risk: Style processing may slow down PDF generation Mitigation:

Cache computed styles
Batch similar operations
Lazy loading for images

Trade-off: Accuracy vs Speed

Direct track: Prioritize accuracy (users chose quality)
OCR track: Balance accuracy with processing time

Testing Strategy

Unit Tests

def test_image_saving():
    """Test that images are actually saved"""

def test_path_resolution():
    """Test path lookup with fallbacks"""

def test_table_bbox_rendering():
    """Test tables render without fake images"""

def test_style_application():
    """Test font/color/size application"""

Integration Tests

Process document with images → verify images in PDF
Process document with tables → verify table layout
Process styled document → verify formatting preserved

Visual Regression Tests

Generate PDFs for test documents
Compare with expected outputs
Flag visual differences

Success Metrics

Image Rendering: 100% of detected images appear in PDF
Table Rendering: 100% of tables rendered with correct layout
Style Preservation:
- Direct track: 90%+ style attributes preserved
- OCR track: Basic formatting maintained
Performance: <10% increase in generation time
Quality: User satisfaction with output appearance

Migration Plan

Rollout Strategy

Deploy fixes behind feature flag
Test with subset of documents
Gradual rollout monitoring quality
Full deployment after validation

Rollback Plan

Feature flag to disable enhanced rendering
Fallback to current implementation
Keep old code paths during transition

Open Questions

Resolved

Q: Should we modify UnifiedDocument structure? A: No, work within existing model for compatibility

Q: How to handle missing fonts? A: Font mapping table with safe fallbacks

Pending

Q: Should we support embedded fonts in PDFs?

Requires investigation of PDF font embedding

Q: How to handle RTL text and vertical writing?

May need specialized text layout engine

Q: Should we preserve hyperlinks and bookmarks?

Depends on user requirements

10 KiB Raw Blame History