OCR/design.md at 63ffa8f0e3a07f07059fd0c85b610016487245e8

egg/OCR

Files

egg 8265be1741 test

2025-12-04 18:00:37 +08:00

4.9 KiB

Raw Blame History

Context

The PDF generator currently uses layout preservation mode for all PDF output, placing text at original coordinates. This works for document reconstruction but:

Fails for translated content where text length differs significantly
May not provide the best reading experience for flowing documents

Two PDF generation modes are needed:

Layout Preservation (existing): Maintains original coordinates
Reflow Layout (new): Prioritizes readability with flowing content

Goals / Non-Goals

Goals:

Translated and non-translated documents can use reflow layout
Both OCR and Direct tracks supported
Proper reading order preserved using available data
Consistent font sizes for readability
Images and tables embedded inline

Non-Goals:

Perfect visual matching with original document layout
Complex multi-column reflow (simple single-column flow)
Font style matching from original document

Decisions

Decision 1: Reading Order Strategy

Track	Reading Order Source	Implementation
OCR	Explicit `reading_order` array in JSON	Use array indices to order elements
Direct	Implicit in element list order	Use list iteration order (PyMuPDF sort=True)

OCR Track - reading_order array:

{
  "pages": [{
    "reading_order": [0, 1, 2, 3, 6, 7, 8, ...],
    "elements": [...]
  }]
}

Direct Track - implicit order:

PyMuPDF's get_text("dict", sort=True) provides spatial reading order
Elements already sorted by extraction engine
Optional: Enable _sort_elements_for_reading_order() for multi-column detection

Decision 2: Separate API Endpoints

# Layout preservation (existing)
GET /api/v2/tasks/{task_id}/download/pdf

# Reflow layout (new)
GET /api/v2/tasks/{task_id}/download/pdf?format=reflow

# Translated PDF (reflow only)
POST /api/v2/translate/{task_id}/pdf?lang={lang}

Decision 3: Unified Reflow Generation Method

def generate_reflow_pdf(
    self,
    result_json_path: Path,
    output_path: Path,
    translation_json_path: Optional[Path] = None,  # None = no translation
    source_file_path: Optional[Path] = None,       # For embedded images
) -> bool:
    """
    Generate reflow layout PDF for either OCR or Direct track.
    Works with or without translation.
    """

Decision 4: Reading Order Application

def _get_elements_in_reading_order(self, page_data: dict) -> List[dict]:
    """Get elements sorted by reading order."""
    elements = page_data.get('elements', [])
    reading_order = page_data.get('reading_order')

    if reading_order:
        # OCR track: use explicit reading order
        ordered = []
        for idx in reading_order:
            if 0 <= idx < len(elements):
                ordered.append(elements[idx])
        return ordered
    else:
        # Direct track: elements already in reading order
        return elements

Decision 5: Consistent Typography

Element Type	Font Size	Style
Title/H1	18pt	Bold
H2	16pt	Bold
H3	14pt	Bold
Body text	12pt	Normal
Table cell	10pt	Normal
Caption	10pt	Italic

Decision 6: Table Handling in Reflow

Tables use Platypus Table with auto-width columns:

def _create_reflow_table(self, table_data, translations=None):
    data = []
    for row in table_data['rows']:
        row_data = []
        for cell in row['cells']:
            text = cell.get('text', '')
            if translations:
                text = translations.get(cell.get('id'), text)
            row_data.append(Paragraph(text, self.styles['TableCell']))
        data.append(row_data)

    table = Table(data)
    table.setStyle(TableStyle([
        ('GRID', (0, 0), (-1, -1), 0.5, colors.black),
        ('VALIGN', (0, 0), (-1, -1), 'TOP'),
        ('PADDING', (0, 0), (-1, -1), 6),
    ]))
    return table

Decision 7: Image Embedding

def _embed_image_reflow(self, element, max_width=450):
    img_path = self._resolve_image_path(element)
    if img_path and img_path.exists():
        img = Image(str(img_path))
        # Scale to fit page width
        if img.drawWidth > max_width:
            ratio = max_width / img.drawWidth
            img.drawWidth = max_width
            img.drawHeight *= ratio
        return img
    return Spacer(1, 0)

Risks / Trade-offs

Risk: OCR reading_order may not be accurate for complex layouts
- Mitigation: Falls back to spatial sort if reading_order missing
Risk: Direct track multi-column detection unused
- Mitigation: PyMuPDF sort=True is generally reliable
Risk: Loss of visual fidelity compared to original
- Mitigation: This is acceptable; layout PDF still available

Migration Plan

No migration needed - new functionality, existing behavior unchanged.

Open Questions

None - design confirmed with user.

4.9 KiB Raw Blame History