Files
OCR/openspec/changes/archive/2025-11-24-pdf-layout-restoration/design.md
egg 4325d024a7 chore: cleanup test files and archive pdf-layout-restoration proposal
Remove obsolete test and utility scripts:
- backend/create_test_user.py
- backend/mark_migration_done.py
- backend/fix_alembic_version.py
- backend/RUN_TESTS.md (outdated test documentation)

Archive completed pdf-layout-restoration proposal:
- Moved from openspec/changes/pdf-layout-restoration/
- To openspec/changes/archive/2025-11-24-pdf-layout-restoration/

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-24 19:43:05 +08:00

361 lines
10 KiB
Markdown

# Technical Design: PDF Layout Restoration and Preservation
## Context
### Background
The current PDF generation system loses critical layout information during the conversion process. Despite successfully extracting images, tables, and styled text in both OCR and Direct tracks, none of this information makes it to the final PDF output due to:
1. **Empty implementations**: Image saving functions are stubs
2. **Path mismatches**: Saved paths don't match expected lookup keys
3. **Fake data dependencies**: Table rendering relies on non-existent image files
4. **Format degradation**: Rich formatting is reduced to plain text blocks
### Current Issues
#### Issue 1: OCR Track Image Loss
```python
# backend/app/services/pp_structure_enhanced.py
def _save_image(self, img_data, element_id: str, result_dir: Path):
"""Save image data to file"""
# TODO: Implement image saving
pass # Lines 262, 414 - NEVER SAVES ANYTHING!
```
Result: `img_path` from PP-Structure is ignored, no image files created.
#### Issue 2: Direct Track Path Mismatch
```python
# Saves as:
element.content["saved_path"] = f"imgs/{element_id}.png" # line 745
# But converter looks for:
image_path = content.get("path") # line 180 - WRONG KEY!
```
Result: Direct track images are saved but never found.
#### Issue 3: Table Rendering Failure
```python
# Creates fake reference:
images_metadata.append({
"path": f"table_{element.element_id}.png", # DOESN'T EXIST
"bbox": element.bbox
})
# Then tries to find it:
table_image = next((img for img in images_metadata
if "table" in img.get("path", "")), None)
if not table_image:
return # ALWAYS HAPPENS - NO RENDERING!
```
#### Issue 4: Text Style Loss
```python
# Has rich data:
StyleInfo(font='Arial', size=12, flags=BOLD|ITALIC, color='#000080')
# But only uses:
c.drawString(x, y, text) # No font, size, or style applied!
```
### Constraints
- Must maintain backward compatibility with existing API
- Cannot break current OCR/Direct track separation
- Should work within existing UnifiedDocument model
- Must handle both track types appropriately
## Goals / Non-Goals
### Goals
1. **Restore image rendering**: Save and correctly reference all images
2. **Fix table layout**: Render tables using actual bbox data
3. **Preserve text formatting**: Apply fonts, sizes, colors, and styles
4. **Track-specific optimization**: Different rendering for OCR vs Direct
5. **Maintain positioning**: Accurate spatial layout preservation
### Non-Goals
- Rewriting entire PDF generation system
- Changing UnifiedDocument structure
- Modifying extraction engines
- Supporting complex vector graphics
- Interactive PDF features (forms, annotations)
## Decisions
### Decision 1: Fix Image Handling Pipeline
**What**: Implement actual image saving and correct path resolution
**Implementation**:
```python
# pp_structure_enhanced.py
def _save_image(self, img_data, element_id: str, result_dir: Path):
"""Save image data to file"""
img_dir = result_dir / "imgs"
img_dir.mkdir(parents=True, exist_ok=True)
if isinstance(img_data, (str, Path)):
# Copy existing file
src_path = Path(img_data)
dst_path = img_dir / f"{element_id}.png"
shutil.copy2(src_path, dst_path)
else:
# Save image data
dst_path = img_dir / f"{element_id}.png"
Image.fromarray(img_data).save(dst_path)
return f"imgs/{element_id}.png" # Relative path
```
**Path Resolution**:
```python
# pdf_generator_service.py - convert_unified_document_to_ocr_data
def _get_image_path(element):
"""Get image path with fallback logic"""
content = element.content
# Try multiple path locations
for key in ["saved_path", "path", "image_path"]:
if isinstance(content, dict) and key in content:
return content[key]
# Check metadata
if hasattr(element, 'metadata') and element.metadata:
return element.metadata.get('path')
return None
```
### Decision 2: Direct Table Bbox Usage
**What**: Use table element's own bbox instead of fake image references
**Current Problem**:
```python
# Creates fake image that doesn't exist
images_metadata.append({"path": f"table_{id}.png", "bbox": bbox})
# Later fails to find it and skips rendering
```
**Solution**:
```python
def draw_table_region(self, c, table_element, page_width, page_height):
"""Draw table using its own bbox"""
# Get bbox directly from element
bbox = table_element.get("bbox")
if not bbox:
# Fallback to polygon
bbox_polygon = table_element.get("bbox_polygon")
if bbox_polygon:
bbox = self._polygon_to_bbox(bbox_polygon)
if not bbox:
logger.warning(f"No bbox for table {table_element.get('element_id')}")
return
# Use bbox to position and render table
x, y, width, height = self._normalize_bbox(bbox, page_width, page_height)
self._render_table_html(c, table_element.get("content"), x, y, width, height)
```
### Decision 3: Track-Specific Rendering
**What**: Different rendering approaches based on processing track
**Implementation Strategy**:
```python
def generate_from_unified_document(self, unified_doc, output_path):
"""Generate PDF with track-specific rendering"""
track = unified_doc.metadata.processing_track
if track == "direct":
return self._generate_direct_track_pdf(unified_doc, output_path)
else: # OCR track
return self._generate_ocr_track_pdf(unified_doc, output_path)
def _generate_direct_track_pdf(self, unified_doc, output_path):
"""Rich rendering for direct track"""
# Preserve:
# - Font families, sizes, weights
# - Text colors and backgrounds
# - Precise positioning
# - Line breaks and paragraph spacing
def _generate_ocr_track_pdf(self, unified_doc, output_path):
"""Simplified rendering for OCR track"""
# Best effort with:
# - Detected layout regions
# - Estimated font sizes
# - Basic positioning
```
### Decision 4: Style Preservation System
**What**: Apply StyleInfo to text rendering
**Text Rendering Enhancement**:
```python
def _apply_text_style(self, c, style_info):
"""Apply text styling from StyleInfo"""
if not style_info:
return
# Font selection
font_name = self._map_font(style_info.font)
if style_info.flags:
if style_info.flags & BOLD and style_info.flags & ITALIC:
font_name = f"{font_name}-BoldOblique"
elif style_info.flags & BOLD:
font_name = f"{font_name}-Bold"
elif style_info.flags & ITALIC:
font_name = f"{font_name}-Oblique"
# Apply font and size
try:
c.setFont(font_name, style_info.size or 12)
except:
c.setFont("Helvetica", style_info.size or 12)
# Apply color
if style_info.color:
r, g, b = self._parse_color(style_info.color)
c.setFillColorRGB(r, g, b)
def draw_text_region_enhanced(self, c, element, page_width, page_height):
"""Enhanced text rendering with style preservation"""
# Apply style
if hasattr(element, 'style') and element.style:
self._apply_text_style(c, element.style)
# Render with line breaks
text_lines = element.content.split('\n')
for line in text_lines:
c.drawString(x, y, line)
y -= line_height
```
## Implementation Phases
### Phase 1: Critical Fixes (Immediate)
1. Implement `_save_image()` in pp_structure_enhanced.py
2. Fix path resolution in converter
3. Fix table bbox usage
4. Test with sample documents
### Phase 2: Basic Style Preservation (Week 1)
1. Implement style application for Direct track
2. Add font mapping system
3. Handle text colors
4. Preserve line breaks
### Phase 3: Advanced Layout (Week 2)
1. Implement span-level rendering
2. Add paragraph alignment
3. Handle text indentation
4. Preserve list formatting
### Phase 4: Optimization (Week 3)
1. Cache font metrics
2. Optimize image handling
3. Batch rendering operations
4. Performance testing
## Risks / Trade-offs
### Risk 1: Font Availability
**Risk**: System fonts may not match document fonts
**Mitigation**: Font mapping table with fallbacks
```python
FONT_MAPPING = {
'Arial': 'Helvetica',
'Times New Roman': 'Times-Roman',
'Courier New': 'Courier',
# ... more mappings
}
```
### Risk 2: Complex Layouts
**Risk**: Some layouts too complex to preserve perfectly
**Mitigation**: Graceful degradation with logging
- Attempt full preservation
- Fall back to simpler rendering if needed
- Log what couldn't be preserved
### Risk 3: Performance Impact
**Risk**: Style processing may slow down PDF generation
**Mitigation**:
- Cache computed styles
- Batch similar operations
- Lazy loading for images
### Trade-off: Accuracy vs Speed
- Direct track: Prioritize accuracy (users chose quality)
- OCR track: Balance accuracy with processing time
## Testing Strategy
### Unit Tests
```python
def test_image_saving():
"""Test that images are actually saved"""
def test_path_resolution():
"""Test path lookup with fallbacks"""
def test_table_bbox_rendering():
"""Test tables render without fake images"""
def test_style_application():
"""Test font/color/size application"""
```
### Integration Tests
- Process document with images → verify images in PDF
- Process document with tables → verify table layout
- Process styled document → verify formatting preserved
### Visual Regression Tests
- Generate PDFs for test documents
- Compare with expected outputs
- Flag visual differences
## Success Metrics
1. **Image Rendering**: 100% of detected images appear in PDF
2. **Table Rendering**: 100% of tables rendered with correct layout
3. **Style Preservation**:
- Direct track: 90%+ style attributes preserved
- OCR track: Basic formatting maintained
4. **Performance**: <10% increase in generation time
5. **Quality**: User satisfaction with output appearance
## Migration Plan
### Rollout Strategy
1. Deploy fixes behind feature flag
2. Test with subset of documents
3. Gradual rollout monitoring quality
4. Full deployment after validation
### Rollback Plan
- Feature flag to disable enhanced rendering
- Fallback to current implementation
- Keep old code paths during transition
## Open Questions
### Resolved
Q: Should we modify UnifiedDocument structure?
A: No, work within existing model for compatibility
Q: How to handle missing fonts?
A: Font mapping table with safe fallbacks
### Pending
Q: Should we support embedded fonts in PDFs?
- Requires investigation of PDF font embedding
Q: How to handle RTL text and vertical writing?
- May need specialized text layout engine
Q: Should we preserve hyperlinks and bookmarks?
- Depends on user requirements