## 問題修復 ### 1. 頁碼分配錯誤 - **問題**: layout_data 和 images_metadata 頁碼被 1-based 覆蓋,導致全部為 0 - **修復**: 在 analyze_layout() 添加 current_page 參數,從源頭設置正確的 0-based 頁碼 - **影響**: 表格和圖片現在顯示在正確的頁面上 ### 2. 文字與表格/圖片重疊 - **問題**: 使用不存在的 'tables' 和 'image_regions' 字段過濾,導致過濾失效 - **修復**: 改用 images_metadata(包含所有表格/圖片的 bbox) - **新增**: _bbox_overlaps() 檢測任意重疊(非完全包含) - **影響**: 文字不再覆蓋表格和圖片區域 ### 3. 渲染順序優化 - **調整**: 圖片(底層) → 表格(中間層) → 文字(頂層) - **影響**: 視覺層次更正確 ## 技術細節 - ocr_service.py: 添加 current_page 參數傳遞,移除頁碼覆蓋邏輯 - pdf_generator_service.py: - 新增 _bbox_overlaps() 方法 - 更新 _filter_text_in_regions() 使用重疊檢測 - 修正數據源為 images_metadata - 調整繪製順序 ## 已知限制 - 仍有 21.6% 文字因過濾而遺失(座標定位方法的固有問題) - 未使用 PP-StructureV3 的完整版面資訊(parsing_res_list, layout_bbox) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
6.3 KiB
6.3 KiB
Implement Layout-Preserving PDF Generation and Preview
Problem
Testing revealed three critical issues affecting user experience:
1. PDF Download Returns 403 Forbidden
- Endpoint:
GET /api/v2/tasks/{task_id}/download/pdf - Error: Backend returns HTTP 403 Forbidden
- Impact: Users cannot download PDF format results
- Root Cause: PDF generation service not implemented
2. Result Preview Shows Placeholder Text Instead of Layout-Preserving Content
- Affected Pages:
- Results page (
/results) - Task Detail page (
/tasks/{taskId})
- Results page (
- Current Behavior: Both pages display placeholder message "請使用上方下載按鈕下載 Markdown、JSON 或 PDF 格式查看完整結果"
- Problem: Users cannot preview OCR results with original document layout preserved
- Impact: Poor user experience - users cannot verify OCR accuracy visually
3. Images Extracted by PP-StructureV3 Are Not Saved to Disk
- Affected File:
backend/app/services/ocr_service.py:554-561 - Current Behavior:
- PP-StructureV3 extracts images from documents (tables, charts, figures)
analyze_layout()receives image objects inmarkdown_imagesdictionary- Code only saves image path strings to JSON, never saves actual image files
- Result directory contains no
imgs/folder with extracted images
- Impact:
- JSON references non-existent files (e.g.,
imgs/img_in_table_box_*.jpg) - Layout-preserving PDF cannot embed images because source files don't exist
- Loss of critical visual content from original documents
- JSON references non-existent files (e.g.,
- Root Cause: Missing image file saving logic in
analyze_layout()function
Proposed Changes
Change 0: Fix Image Extraction and Saving (PREREQUISITE)
Modify OCR service to save extracted images to disk before PDF generation can embed them.
Implementation approach:
-
Update
analyze_layout()Function- Locate image saving code at
ocr_service.py:554-561 - Extract
img_objfrommarkdown_images.items() - Create
imgs/subdirectory in result folder - Save each
img_objto disk using PILImage.save() - Verify saved file path matches JSON
images_metadata
- Locate image saving code at
-
File Naming and Organization
- PP-StructureV3 generates paths like
imgs/img_in_table_box_145_1253_2329_2488.jpg - Create full path:
{result_dir}/{img_path} - Ensure parent directories exist before saving
- Handle image format conversion if needed (PNG, JPEG)
- PP-StructureV3 generates paths like
-
Error Handling
- Log warnings if image objects are missing or corrupt
- Continue processing even if individual images fail
- Include error info in images_metadata for debugging
Why This is Critical:
- Without saved images, layout-preserving PDF cannot embed visual content
- Images contain crucial information (charts, diagrams, table contents)
- PP-StructureV3 already does the hard work of extraction - we just need to save them
Change 1: Implement Layout-Preserving PDF Generation Service
Create a PDF generation service that reconstructs the original document layout from OCR JSON data.
Implementation approach:
-
Parse JSON OCR Results
- Read
text_regionsarray containing text, bounding boxes, confidence scores - Extract page dimensions from original file or infer from bbox coordinates
- Group elements by page number
- Read
-
Generate PDF with ReportLab
- Create PDF canvas with original page dimensions
- Iterate through each text region
- Draw text at precise coordinates from bbox
- Support Chinese fonts (e.g., Noto Sans CJK, Source Han Sans)
- Optionally draw bounding boxes for visualization
-
Handle Complex Elements
- Text: Draw at bbox coordinates with appropriate font size
- Tables: Reconstruct from layout analysis (if available)
- Images: Embed from
images_metadata - Preserve rotation/skew from bbox geometry
-
Caching Strategy
- Generate PDF once per task completion
- Store in task result directory as
{filename}_layout.pdf - Serve cached version on subsequent requests
- Regenerate only if JSON changes
Technical stack:
- ReportLab: PDF generation with precise coordinate control
- Pillow: Extract dimensions from source images/PDFs, embed extracted images
- Chinese fonts: Noto Sans CJK or Source Han Sans (需安裝)
Change 2: Implement In-Browser PDF Preview
Replace placeholder text with interactive PDF preview using react-pdf.
Implementation approach:
-
Install react-pdf
npm install react-pdf -
Create PDF Viewer Component
- Fetch PDF from
/api/v2/tasks/{task_id}/download/pdf - Render using
<Document>and<Page>from react-pdf - Add zoom controls, page navigation
- Show loading spinner while PDF loads
- Fetch PDF from
-
Update ResultsPage and TaskDetailPage
- Replace placeholder with PDF viewer
- Add download button above viewer
- Handle errors gracefully (show error if PDF unavailable)
Benefits:
- Users see OCR results with original layout preserved
- Visual verification of OCR accuracy
- No download required for quick review
- Professional presentation of results
Scope
In scope:
- Fix image extraction to save extracted images to disk (PREREQUISITE)
- Implement layout-preserving PDF generation service from JSON
- Install and configure Chinese fonts (Noto Sans CJK)
- Create PDF viewer component with react-pdf
- Add PDF preview to Results page and Task Detail page
- Cache generated PDFs for performance
- Embed extracted images into layout-preserving PDF
- Error handling for image saving, PDF generation and preview failures
Out of scope:
- OCR result editing in preview
- Advanced PDF features (annotations, search, highlights)
- Excel/JSON inline preview
- Real-time PDF regeneration (will use cached version)
Impact
- User Experience: Major improvement - layout-preserving visual preview with images
- Backend: Significant changes - image saving fix, new PDF generation service
- Frontend: Medium changes - PDF viewer integration
- Dependencies: New - ReportLab, react-pdf, Chinese fonts (Pillow already installed)
- Performance: Medium - PDF generation cached after first request, minimal overhead for image saving
- Risk: Medium - complex coordinate transformation, font rendering, image embedding
- Data Integrity: High improvement - images now properly preserved alongside text