# Implement Layout-Preserving PDF Generation and Preview ## Problem Testing revealed three critical issues affecting user experience: ### 1. PDF Download Returns 403 Forbidden - **Endpoint**: `GET /api/v2/tasks/{task_id}/download/pdf` - **Error**: Backend returns HTTP 403 Forbidden - **Impact**: Users cannot download PDF format results - **Root Cause**: PDF generation service not implemented ### 2. Result Preview Shows Placeholder Text Instead of Layout-Preserving Content - **Affected Pages**: - Results page (`/results`) - Task Detail page (`/tasks/{taskId}`) - **Current Behavior**: Both pages display placeholder message "請使用上方下載按鈕下載 Markdown、JSON 或 PDF 格式查看完整結果" - **Problem**: Users cannot preview OCR results with original document layout preserved - **Impact**: Poor user experience - users cannot verify OCR accuracy visually ### 3. Images Extracted by PP-StructureV3 Are Not Saved to Disk - **Affected File**: `backend/app/services/ocr_service.py:554-561` - **Current Behavior**: - PP-StructureV3 extracts images from documents (tables, charts, figures) - `analyze_layout()` receives image objects in `markdown_images` dictionary - Code only saves image path strings to JSON, never saves actual image files - Result directory contains no `imgs/` folder with extracted images - **Impact**: - JSON references non-existent files (e.g., `imgs/img_in_table_box_*.jpg`) - Layout-preserving PDF cannot embed images because source files don't exist - Loss of critical visual content from original documents - **Root Cause**: Missing image file saving logic in `analyze_layout()` function ## Proposed Changes ### Change 0: Fix Image Extraction and Saving (PREREQUISITE) Modify OCR service to save extracted images to disk before PDF generation can embed them. **Implementation approach:** 1. **Update `analyze_layout()` Function** - Locate image saving code at `ocr_service.py:554-561` - Extract `img_obj` from `markdown_images.items()` - Create `imgs/` subdirectory in result folder - Save each `img_obj` to disk using PIL `Image.save()` - Verify saved file path matches JSON `images_metadata` 2. **File Naming and Organization** - PP-StructureV3 generates paths like `imgs/img_in_table_box_145_1253_2329_2488.jpg` - Create full path: `{result_dir}/{img_path}` - Ensure parent directories exist before saving - Handle image format conversion if needed (PNG, JPEG) 3. **Error Handling** - Log warnings if image objects are missing or corrupt - Continue processing even if individual images fail - Include error info in images_metadata for debugging **Why This is Critical:** - Without saved images, layout-preserving PDF cannot embed visual content - Images contain crucial information (charts, diagrams, table contents) - PP-StructureV3 already does the hard work of extraction - we just need to save them ### Change 1: Implement Layout-Preserving PDF Generation Service Create a PDF generation service that reconstructs the original document layout from OCR JSON data. **Implementation approach:** 1. **Parse JSON OCR Results** - Read `text_regions` array containing text, bounding boxes, confidence scores - Extract page dimensions from original file or infer from bbox coordinates - Group elements by page number 2. **Generate PDF with ReportLab** - Create PDF canvas with original page dimensions - Iterate through each text region - Draw text at precise coordinates from bbox - Support Chinese fonts (e.g., Noto Sans CJK, Source Han Sans) - Optionally draw bounding boxes for visualization 3. **Handle Complex Elements** - Text: Draw at bbox coordinates with appropriate font size - Tables: Reconstruct from layout analysis (if available) - Images: Embed from `images_metadata` - Preserve rotation/skew from bbox geometry 4. **Caching Strategy** - Generate PDF once per task completion - Store in task result directory as `{filename}_layout.pdf` - Serve cached version on subsequent requests - Regenerate only if JSON changes **Technical stack:** - **ReportLab**: PDF generation with precise coordinate control - **Pillow**: Extract dimensions from source images/PDFs, embed extracted images - **Chinese fonts**: Noto Sans CJK or Source Han Sans (需安裝) ### Change 2: Implement In-Browser PDF Preview Replace placeholder text with interactive PDF preview using react-pdf. **Implementation approach:** 1. **Install react-pdf** ```bash npm install react-pdf ``` 2. **Create PDF Viewer Component** - Fetch PDF from `/api/v2/tasks/{task_id}/download/pdf` - Render using `` and `` from react-pdf - Add zoom controls, page navigation - Show loading spinner while PDF loads 3. **Update ResultsPage and TaskDetailPage** - Replace placeholder with PDF viewer - Add download button above viewer - Handle errors gracefully (show error if PDF unavailable) **Benefits:** - Users see OCR results with original layout preserved - Visual verification of OCR accuracy - No download required for quick review - Professional presentation of results ## Scope **In scope:** - Fix image extraction to save extracted images to disk (PREREQUISITE) - Implement layout-preserving PDF generation service from JSON - Install and configure Chinese fonts (Noto Sans CJK) - Create PDF viewer component with react-pdf - Add PDF preview to Results page and Task Detail page - Cache generated PDFs for performance - Embed extracted images into layout-preserving PDF - Error handling for image saving, PDF generation and preview failures **Out of scope:** - OCR result editing in preview - Advanced PDF features (annotations, search, highlights) - Excel/JSON inline preview - Real-time PDF regeneration (will use cached version) ## Impact - **User Experience**: Major improvement - layout-preserving visual preview with images - **Backend**: Significant changes - image saving fix, new PDF generation service - **Frontend**: Medium changes - PDF viewer integration - **Dependencies**: New - ReportLab, react-pdf, Chinese fonts (Pillow already installed) - **Performance**: Medium - PDF generation cached after first request, minimal overhead for image saving - **Risk**: Medium - complex coordinate transformation, font rendering, image embedding - **Data Integrity**: High improvement - images now properly preserved alongside text