OCR/openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/proposal.md

# Implement Layout-Preserving PDF Generation and Preview

## Problem

Testing revealed three critical issues affecting user experience:

### 1. PDF Download Returns 403 Forbidden
- **Endpoint**: `GET /api/v2/tasks/{task_id}/download/pdf`
- **Error**: Backend returns HTTP 403 Forbidden
- **Impact**: Users cannot download PDF format results
- **Root Cause**: PDF generation service not implemented

### 2. Result Preview Shows Placeholder Text Instead of Layout-Preserving Content
- **Affected Pages**:
  - Results page (`/results`)
  - Task Detail page (`/tasks/{taskId}`)
- **Current Behavior**: Both pages display placeholder message "請使用上方下載按鈕下載 Markdown、JSON 或 PDF 格式查看完整結果"
- **Problem**: Users cannot preview OCR results with original document layout preserved
- **Impact**: Poor user experience - users cannot verify OCR accuracy visually

### 3. Images Extracted by PP-StructureV3 Are Not Saved to Disk
- **Affected File**: `backend/app/services/ocr_service.py:554-561`
- **Current Behavior**:
  - PP-StructureV3 extracts images from documents (tables, charts, figures)
  - `analyze_layout()` receives image objects in `markdown_images` dictionary
  - Code only saves image path strings to JSON, never saves actual image files
  - Result directory contains no `imgs/` folder with extracted images
- **Impact**:
  - JSON references non-existent files (e.g., `imgs/img_in_table_box_*.jpg`)
  - Layout-preserving PDF cannot embed images because source files don't exist
  - Loss of critical visual content from original documents
- **Root Cause**: Missing image file saving logic in `analyze_layout()` function

## Proposed Changes

### Change 0: Fix Image Extraction and Saving (PREREQUISITE)
Modify OCR service to save extracted images to disk before PDF generation can embed them.

**Implementation approach:**
1. **Update `analyze_layout()` Function**
   - Locate image saving code at `ocr_service.py:554-561`
   - Extract `img_obj` from `markdown_images.items()`
   - Create `imgs/` subdirectory in result folder
   - Save each `img_obj` to disk using PIL `Image.save()`
   - Verify saved file path matches JSON `images_metadata`

2. **File Naming and Organization**
   - PP-StructureV3 generates paths like `imgs/img_in_table_box_145_1253_2329_2488.jpg`
   - Create full path: `{result_dir}/{img_path}`
   - Ensure parent directories exist before saving
   - Handle image format conversion if needed (PNG, JPEG)

3. **Error Handling**
   - Log warnings if image objects are missing or corrupt
   - Continue processing even if individual images fail
   - Include error info in images_metadata for debugging

**Why This is Critical:**
- Without saved images, layout-preserving PDF cannot embed visual content
- Images contain crucial information (charts, diagrams, table contents)
- PP-StructureV3 already does the hard work of extraction - we just need to save them

### Change 1: Implement Layout-Preserving PDF Generation Service
Create a PDF generation service that reconstructs the original document layout from OCR JSON data.

**Implementation approach:**
1. **Parse JSON OCR Results**
   - Read `text_regions` array containing text, bounding boxes, confidence scores
   - Extract page dimensions from original file or infer from bbox coordinates
   - Group elements by page number

2. **Generate PDF with ReportLab**
   - Create PDF canvas with original page dimensions
   - Iterate through each text region
   - Draw text at precise coordinates from bbox
   - Support Chinese fonts (e.g., Noto Sans CJK, Source Han Sans)
   - Optionally draw bounding boxes for visualization

3. **Handle Complex Elements**
   - Text: Draw at bbox coordinates with appropriate font size
   - Tables: Reconstruct from layout analysis (if available)
   - Images: Embed from `images_metadata`
   - Preserve rotation/skew from bbox geometry

4. **Caching Strategy**
   - Generate PDF once per task completion
   - Store in task result directory as `{filename}_layout.pdf`
   - Serve cached version on subsequent requests
   - Regenerate only if JSON changes

**Technical stack:**
- **ReportLab**: PDF generation with precise coordinate control
- **Pillow**: Extract dimensions from source images/PDFs, embed extracted images
- **Chinese fonts**: Noto Sans CJK or Source Han Sans (需安裝)

### Change 2: Implement In-Browser PDF Preview
Replace placeholder text with interactive PDF preview using react-pdf.

**Implementation approach:**
1. **Install react-pdf**
   ```bash
   npm install react-pdf
   ```

2. **Create PDF Viewer Component**
   - Fetch PDF from `/api/v2/tasks/{task_id}/download/pdf`
   - Render using `<Document>` and `<Page>` from react-pdf
   - Add zoom controls, page navigation
   - Show loading spinner while PDF loads

3. **Update ResultsPage and TaskDetailPage**
   - Replace placeholder with PDF viewer
   - Add download button above viewer
   - Handle errors gracefully (show error if PDF unavailable)

**Benefits:**
- Users see OCR results with original layout preserved
- Visual verification of OCR accuracy
- No download required for quick review
- Professional presentation of results

## Scope

**In scope:**
- Fix image extraction to save extracted images to disk (PREREQUISITE)
- Implement layout-preserving PDF generation service from JSON
- Install and configure Chinese fonts (Noto Sans CJK)
- Create PDF viewer component with react-pdf
- Add PDF preview to Results page and Task Detail page
- Cache generated PDFs for performance
- Embed extracted images into layout-preserving PDF
- Error handling for image saving, PDF generation and preview failures

**Out of scope:**
- OCR result editing in preview
- Advanced PDF features (annotations, search, highlights)
- Excel/JSON inline preview
- Real-time PDF regeneration (will use cached version)

## Impact

- **User Experience**: Major improvement - layout-preserving visual preview with images
- **Backend**: Significant changes - image saving fix, new PDF generation service
- **Frontend**: Medium changes - PDF viewer integration
- **Dependencies**: New - ReportLab, react-pdf, Chinese fonts (Pillow already installed)
- **Performance**: Medium - PDF generation cached after first request, minimal overhead for image saving
- **Risk**: Medium - complex coordinate transformation, font rendering, image embedding
- **Data Integrity**: High improvement - images now properly preserved alongside text