egg/OCR

Files

egg 24253ac15e feat: unify Direct Track PDF rendering and simplify export options

Backend changes:
- Apply background image + invisible text layer to all Direct Track PDFs
- Add CHART to regions_to_avoid for text extraction
- Improve visual fidelity for native PDFs and Office documents

Frontend changes:
- Remove JSON, UnifiedDocument, Markdown download buttons
- Simplify to 2-column layout with only Layout PDF and Reflow PDF
- Remove translation JSON download and Layout PDF option
- Keep only Reflow PDF for translated document downloads
- Clean up unused imports (FileJson, Database, FileOutput)

Archives two OpenSpec proposals:
- unify-direct-track-pdf-rendering
- simplify-frontend-export-options

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-12 07:50:43 +08:00

5.4 KiB

Raw Blame History

Design: Unify Direct Track PDF Rendering

Context

The Tool_OCR system generates "Layout PDF" files that preserve the original document appearance while maintaining extractable text. Currently, Direct Track (editable PDFs and Office documents) uses element-by-element rendering, which causes:

Z-order conflicts (text behind images)
Missing vector graphics (chart bars, gradients)
White text becoming invisible on dark backgrounds

Goals / Non-Goals

Goals

Visual fidelity: Layout PDF matches source document exactly
Text extractability: All text remains searchable/selectable for translation
Unified logic: Same rendering approach for all Direct Track documents
Chart handling: Chart-internal text excluded from translation layer

Non-Goals

Editable text in Layout PDF (translation creates separate reflow PDF)
Reducing file size (trade-off for visual fidelity)
OCR Track changes (only affects Direct Track)

Decisions

Decision 1: Use Background Image + Invisible Text Layer

What: Render each source PDF page as a full-page background image, then overlay invisible text.

Why:

Preserves ALL visual content (vector graphics, gradients, complex layouts)
Invisible text (PDF Rendering Mode 3) allows text selection without visual overlap
Simplifies z-order handling (just one image layer + one text layer)

Implementation:

# Render source page as background
mat = fitz.Matrix(2.0, 2.0)  # 2x resolution
pix = source_page.get_pixmap(matrix=mat, alpha=False)
pdf_canvas.drawImage(bg_img, 0, 0, width=page_width, height=page_height)

# Set invisible text mode
pdf_canvas._code.append('3 Tr')  # Text render mode: invisible

# Draw text elements (invisible but selectable)
for elem in text_elements:
    if not is_inside_chart_region(elem):
        draw_text_element(elem)

pdf_canvas._code.append('0 Tr')  # Reset to normal

Decision 2: Add CHART to regions_to_avoid

What: Chart-internal text elements are excluded from the invisible text layer.

Why:

Chart axis labels, legends already visible in background image
These texts typically don't need translation
Prevents duplicate text extraction for translation

Implementation:

# In element classification loop
if element.type == ElementType.CHART:
    image_elements.append(element)
    regions_to_avoid.append(element)  # Exclude chart region from text layer

Decision 3: Apply to ALL Direct Track Documents

What: Use background image rendering for both Office documents and native PDFs.

Why:

Consistent handling eliminates edge cases
Chart text overlap affects both document types
Office detection (LibreOffice producer) is unreliable for some PDFs

Detection logic removed:

# OLD: Only for Office documents
is_office_document = 'LibreOffice' in producer or filename.endswith('.pptx')

# NEW: All Direct Track uses background rendering
if self.current_processing_track == ProcessingTrack.DIRECT:
    render_background_image()

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    PDF Generation Flow                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Source PDF ──► PyMuPDF ──► Page Pixmap (2x) ──► Background │
│                    │                                         │
│                    ▼                                         │
│              Extract Text ──► Filter Chart Regions           │
│                    │                                         │
│                    ▼                                         │
│         Invisible Text Layer (Mode 3) ──► Overlay            │
│                                                              │
│  Result: Background Image + Invisible Searchable Text        │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Risks / Trade-offs

Risk	Impact	Mitigation
Larger file size (~2MB/page)	Storage, download time	Accept trade-off for visual fidelity
Slightly slower generation	User wait time	Acceptable for quality improvement
Chart text not translatable	Feature limitation	Document as expected behavior
Source PDF required	Can't regenerate without source	Store source PDF reference in task

File Size Estimation

Document	Pages	Current Size	New Size (est.)
PPT (25 pages)	25	~1.5 MB	~43 MB
PDF (3 pages)	3	~68 KB	~6 MB

Open Questions

Should we provide a "lightweight" option that skips background rendering for simple PDFs?
- Decision: No, keep unified approach for consistency
Should chart text be optionally included in translation?
- Decision: No, chart labels rarely need translation and would require complex masking

5.4 KiB Raw Blame History

Design: Unify Direct Track PDF Rendering

Context

Goals / Non-Goals

Goals

Non-Goals

Decisions

Decision 1: Use Background Image + Invisible Text Layer

Decision 2: Add CHART to regions_to_avoid

Decision 3: Apply to ALL Direct Track Documents

Architecture

Risks / Trade-offs

File Size Estimation

Open Questions

5.4 KiB

Raw Blame History