fix: 修復PDF生成中的頁碼錯誤和文字重疊問題

## 問題修復 ### 1. 頁碼分配錯誤 - **問題**: layout_data 和 images_metadata 頁碼被 1-based 覆蓋，導致全部為 0 - **修復**: 在 analyze_layout() 添加 current_page 參數，從源頭設置正確的 0-based 頁碼 - **影響**: 表格和圖片現在顯示在正確的頁面上 ### 2. 文字與表格/圖片重疊 - **問題**: 使用不存在的 'tables' 和 'image_regions' 字段過濾，導致過濾失效 - **修復**: 改用 images_metadata（包含所有表格/圖片的 bbox） - **新增**: _bbox_overlaps() 檢測任意重疊（非完全包含） - **影響**: 文字不再覆蓋表格和圖片區域 ### 3. 渲染順序優化 - **調整**: 圖片(底層) → 表格(中間層) → 文字(頂層) - **影響**: 視覺層次更正確 ## 技術細節 - ocr_service.py: 添加 current_page 參數傳遞，移除頁碼覆蓋邏輯 - pdf_generator_service.py: - 新增 _bbox_overlaps() 方法 - 更新 _filter_text_in_regions() 使用重疊檢測 - 修正數據源為 images_metadata - 調整繪製順序 ## 已知限制 - 仍有 21.6% 文字因過濾而遺失（座標定位方法的固有問題） - 未使用 PP-StructureV3 的完整版面資訊（parsing_res_list, layout_bbox） 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 18:57:01 +08:00
parent 5cf4010c9b
commit 0edc56b03f
6 changed files with 485 additions and 45 deletions
--- a/openspec/changes/fix-result-preview-and-pdf-download/proposal.md
+++ b/openspec/changes/fix-result-preview-and-pdf-download/proposal.md
@@ -0,0 +1,148 @@
+# Implement Layout-Preserving PDF Generation and Preview
+
+## Problem
+
+Testing revealed three critical issues affecting user experience:
+
+### 1. PDF Download Returns 403 Forbidden
+- **Endpoint**: `GET /api/v2/tasks/{task_id}/download/pdf`
+- **Error**: Backend returns HTTP 403 Forbidden
+- **Impact**: Users cannot download PDF format results
+- **Root Cause**: PDF generation service not implemented
+
+### 2. Result Preview Shows Placeholder Text Instead of Layout-Preserving Content
+- **Affected Pages**:
+  - Results page (`/results`)
+  - Task Detail page (`/tasks/{taskId}`)
+- **Current Behavior**: Both pages display placeholder message "請使用上方下載按鈕下載 Markdown、JSON 或 PDF 格式查看完整結果"
+- **Problem**: Users cannot preview OCR results with original document layout preserved
+- **Impact**: Poor user experience - users cannot verify OCR accuracy visually
+
+### 3. Images Extracted by PP-StructureV3 Are Not Saved to Disk
+- **Affected File**: `backend/app/services/ocr_service.py:554-561`
+- **Current Behavior**:
+  - PP-StructureV3 extracts images from documents (tables, charts, figures)
+  - `analyze_layout()` receives image objects in `markdown_images` dictionary
+  - Code only saves image path strings to JSON, never saves actual image files
+  - Result directory contains no `imgs/` folder with extracted images
+- **Impact**:
+  - JSON references non-existent files (e.g., `imgs/img_in_table_box_*.jpg`)
+  - Layout-preserving PDF cannot embed images because source files don't exist
+  - Loss of critical visual content from original documents
+- **Root Cause**: Missing image file saving logic in `analyze_layout()` function
+
+## Proposed Changes
+
+### Change 0: Fix Image Extraction and Saving (PREREQUISITE)
+Modify OCR service to save extracted images to disk before PDF generation can embed them.
+
+**Implementation approach:**
+1. **Update `analyze_layout()` Function**
+   - Locate image saving code at `ocr_service.py:554-561`
+   - Extract `img_obj` from `markdown_images.items()`
+   - Create `imgs/` subdirectory in result folder
+   - Save each `img_obj` to disk using PIL `Image.save()`
+   - Verify saved file path matches JSON `images_metadata`
+
+2. **File Naming and Organization**
+   - PP-StructureV3 generates paths like `imgs/img_in_table_box_145_1253_2329_2488.jpg`
+   - Create full path: `{result_dir}/{img_path}`
+   - Ensure parent directories exist before saving
+   - Handle image format conversion if needed (PNG, JPEG)
+
+3. **Error Handling**
+   - Log warnings if image objects are missing or corrupt
+   - Continue processing even if individual images fail
+   - Include error info in images_metadata for debugging
+
+**Why This is Critical:**
+- Without saved images, layout-preserving PDF cannot embed visual content
+- Images contain crucial information (charts, diagrams, table contents)
+- PP-StructureV3 already does the hard work of extraction - we just need to save them
+
+### Change 1: Implement Layout-Preserving PDF Generation Service
+Create a PDF generation service that reconstructs the original document layout from OCR JSON data.
+
+**Implementation approach:**
+1. **Parse JSON OCR Results**
+   - Read `text_regions` array containing text, bounding boxes, confidence scores
+   - Extract page dimensions from original file or infer from bbox coordinates
+   - Group elements by page number
+
+2. **Generate PDF with ReportLab**
+   - Create PDF canvas with original page dimensions
+   - Iterate through each text region
+   - Draw text at precise coordinates from bbox
+   - Support Chinese fonts (e.g., Noto Sans CJK, Source Han Sans)
+   - Optionally draw bounding boxes for visualization
+
+3. **Handle Complex Elements**
+   - Text: Draw at bbox coordinates with appropriate font size
+   - Tables: Reconstruct from layout analysis (if available)
+   - Images: Embed from `images_metadata`
+   - Preserve rotation/skew from bbox geometry
+
+4. **Caching Strategy**
+   - Generate PDF once per task completion
+   - Store in task result directory as `{filename}_layout.pdf`
+   - Serve cached version on subsequent requests
+   - Regenerate only if JSON changes
+
+**Technical stack:**
+- **ReportLab**: PDF generation with precise coordinate control
+- **Pillow**: Extract dimensions from source images/PDFs, embed extracted images
+- **Chinese fonts**: Noto Sans CJK or Source Han Sans (需安裝)
+
+### Change 2: Implement In-Browser PDF Preview
+Replace placeholder text with interactive PDF preview using react-pdf.
+
+**Implementation approach:**
+1. **Install react-pdf**
+   ```bash
+   npm install react-pdf
+   ```
+
+2. **Create PDF Viewer Component**
+   - Fetch PDF from `/api/v2/tasks/{task_id}/download/pdf`
+   - Render using `<Document>` and `<Page>` from react-pdf
+   - Add zoom controls, page navigation
+   - Show loading spinner while PDF loads
+
+3. **Update ResultsPage and TaskDetailPage**
+   - Replace placeholder with PDF viewer
+   - Add download button above viewer
+   - Handle errors gracefully (show error if PDF unavailable)
+
+**Benefits:**
+- Users see OCR results with original layout preserved
+- Visual verification of OCR accuracy
+- No download required for quick review
+- Professional presentation of results
+
+## Scope
+
+**In scope:**
+- Fix image extraction to save extracted images to disk (PREREQUISITE)
+- Implement layout-preserving PDF generation service from JSON
+- Install and configure Chinese fonts (Noto Sans CJK)
+- Create PDF viewer component with react-pdf
+- Add PDF preview to Results page and Task Detail page
+- Cache generated PDFs for performance
+- Embed extracted images into layout-preserving PDF
+- Error handling for image saving, PDF generation and preview failures
+
+**Out of scope:**
+- OCR result editing in preview
+- Advanced PDF features (annotations, search, highlights)
+- Excel/JSON inline preview
+- Real-time PDF regeneration (will use cached version)
+
+## Impact
+
+- **User Experience**: Major improvement - layout-preserving visual preview with images
+- **Backend**: Significant changes - image saving fix, new PDF generation service
+- **Frontend**: Medium changes - PDF viewer integration
+- **Dependencies**: New - ReportLab, react-pdf, Chinese fonts (Pillow already installed)
+- **Performance**: Medium - PDF generation cached after first request, minimal overhead for image saving
+- **Risk**: Medium - complex coordinate transformation, font rendering, image embedding
+- **Data Integrity**: High improvement - images now properly preserved alongside text