egg/OCR

Files

egg cd3cbea49d chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-18 20:02:31 +08:00

6.3 KiB

Raw Blame History

Implement Layout-Preserving PDF Generation and Preview

Problem

Testing revealed three critical issues affecting user experience:

1. PDF Download Returns 403 Forbidden

Endpoint: GET /api/v2/tasks/{task_id}/download/pdf
Error: Backend returns HTTP 403 Forbidden
Impact: Users cannot download PDF format results
Root Cause: PDF generation service not implemented

2. Result Preview Shows Placeholder Text Instead of Layout-Preserving Content

Affected Pages:
- Results page (/results)
- Task Detail page (/tasks/{taskId})
Current Behavior: Both pages display placeholder message "請使用上方下載按鈕下載 Markdown、JSON 或 PDF 格式查看完整結果"
Problem: Users cannot preview OCR results with original document layout preserved
Impact: Poor user experience - users cannot verify OCR accuracy visually

3. Images Extracted by PP-StructureV3 Are Not Saved to Disk

Affected File: backend/app/services/ocr_service.py:554-561
Current Behavior:
- PP-StructureV3 extracts images from documents (tables, charts, figures)
- analyze_layout() receives image objects in markdown_images dictionary
- Code only saves image path strings to JSON, never saves actual image files
- Result directory contains no imgs/ folder with extracted images
Impact:
- JSON references non-existent files (e.g., imgs/img_in_table_box_*.jpg)
- Layout-preserving PDF cannot embed images because source files don't exist
- Loss of critical visual content from original documents
Root Cause: Missing image file saving logic in analyze_layout() function

Proposed Changes

Change 0: Fix Image Extraction and Saving (PREREQUISITE)

Modify OCR service to save extracted images to disk before PDF generation can embed them.

Implementation approach:

Update analyze_layout() Function
- Locate image saving code at ocr_service.py:554-561
- Extract img_obj from markdown_images.items()
- Create imgs/ subdirectory in result folder
- Save each img_obj to disk using PIL Image.save()
- Verify saved file path matches JSON images_metadata
File Naming and Organization
- PP-StructureV3 generates paths like imgs/img_in_table_box_145_1253_2329_2488.jpg
- Create full path: {result_dir}/{img_path}
- Ensure parent directories exist before saving
- Handle image format conversion if needed (PNG, JPEG)
Error Handling
- Log warnings if image objects are missing or corrupt
- Continue processing even if individual images fail
- Include error info in images_metadata for debugging

Why This is Critical:

Without saved images, layout-preserving PDF cannot embed visual content
Images contain crucial information (charts, diagrams, table contents)
PP-StructureV3 already does the hard work of extraction - we just need to save them

Change 1: Implement Layout-Preserving PDF Generation Service

Create a PDF generation service that reconstructs the original document layout from OCR JSON data.

Implementation approach:

Parse JSON OCR Results
- Read text_regions array containing text, bounding boxes, confidence scores
- Extract page dimensions from original file or infer from bbox coordinates
- Group elements by page number
Generate PDF with ReportLab
- Create PDF canvas with original page dimensions
- Iterate through each text region
- Draw text at precise coordinates from bbox
- Support Chinese fonts (e.g., Noto Sans CJK, Source Han Sans)
- Optionally draw bounding boxes for visualization
Handle Complex Elements
- Text: Draw at bbox coordinates with appropriate font size
- Tables: Reconstruct from layout analysis (if available)
- Images: Embed from images_metadata
- Preserve rotation/skew from bbox geometry
Caching Strategy
- Generate PDF once per task completion
- Store in task result directory as {filename}_layout.pdf
- Serve cached version on subsequent requests
- Regenerate only if JSON changes

Technical stack:

ReportLab: PDF generation with precise coordinate control
Pillow: Extract dimensions from source images/PDFs, embed extracted images
Chinese fonts: Noto Sans CJK or Source Han Sans (需安裝)

Change 2: Implement In-Browser PDF Preview

Replace placeholder text with interactive PDF preview using react-pdf.

Implementation approach:

Install react-pdf
```
npm install react-pdf
```
Create PDF Viewer Component
- Fetch PDF from /api/v2/tasks/{task_id}/download/pdf
- Render using <Document> and <Page> from react-pdf
- Add zoom controls, page navigation
- Show loading spinner while PDF loads
Update ResultsPage and TaskDetailPage
- Replace placeholder with PDF viewer
- Add download button above viewer
- Handle errors gracefully (show error if PDF unavailable)

Benefits:

Users see OCR results with original layout preserved
Visual verification of OCR accuracy
No download required for quick review
Professional presentation of results

Scope

In scope:

Fix image extraction to save extracted images to disk (PREREQUISITE)
Implement layout-preserving PDF generation service from JSON
Install and configure Chinese fonts (Noto Sans CJK)
Create PDF viewer component with react-pdf
Add PDF preview to Results page and Task Detail page
Cache generated PDFs for performance
Embed extracted images into layout-preserving PDF
Error handling for image saving, PDF generation and preview failures

Out of scope:

OCR result editing in preview
Advanced PDF features (annotations, search, highlights)
Excel/JSON inline preview
Real-time PDF regeneration (will use cached version)

Impact

User Experience: Major improvement - layout-preserving visual preview with images
Backend: Significant changes - image saving fix, new PDF generation service
Frontend: Medium changes - PDF viewer integration
Dependencies: New - ReportLab, react-pdf, Chinese fonts (Pillow already installed)
Performance: Medium - PDF generation cached after first request, minimal overhead for image saving
Risk: Medium - complex coordinate transformation, font rendering, image embedding
Data Integrity: High improvement - images now properly preserved alongside text

6.3 KiB Raw Blame History

Implement Layout-Preserving PDF Generation and Preview

Problem

1. PDF Download Returns 403 Forbidden

2. Result Preview Shows Placeholder Text Instead of Layout-Preserving Content

3. Images Extracted by PP-StructureV3 Are Not Saved to Disk

Proposed Changes

Change 0: Fix Image Extraction and Saving (PREREQUISITE)

Change 1: Implement Layout-Preserving PDF Generation Service

Change 2: Implement In-Browser PDF Preview

Scope

Impact

6.3 KiB

Raw Blame History