Files
OCR/openspec/changes/pdf-preprocessing-pipeline/design.md
egg 6a65c7617d feat: add PDF preprocessing pipeline for Direct track
Implement multi-stage preprocessing pipeline to improve extraction quality:

Phase 1 - Object-level Cleaning:
- Content stream sanitization via clean_contents(sanitize=True)
- Hidden OCG layer detection
- White-out detection with IoU 80% threshold

Phase 2 - Layout Analysis:
- Column-aware sorting (sort=True)
- Page number pattern detection and filtering
- Position-based element classification

Phase 3 - Enhanced Extraction:
- Garble rate detection (cid:xxxx, U+FFFD, PUA characters)
- OCR fallback recommendation when garble >10%
- Quality report generation interface

Phase 4 - GS Distillation (Exception Handler):
- Ghostscript PDF repair for severely damaged files
- Auto-triggered on high garble or mupdf errors
- Graceful fallback when GS unavailable

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 16:11:00 +08:00

17 KiB

Design: PDF Preprocessing Pipeline

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                     DIRECT Track PDF Processing Pipeline                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Input PDF                                                                 │
│      │                                                                      │
│      ▼                                                                      │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Step 0: GS Distillation (Exception Handler)                         │   │
│  │ ───────────────────────────────────────────────────────────────────  │   │
│  │ Trigger: (cid:xxxx) garble detected OR mupdf structural errors      │   │
│  │ Action: gs -sDEVICE=pdfwrite -dDetectDuplicateImages=true           │   │
│  │ Status: DISABLED by default, auto-triggered on errors               │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│      │                                                                      │
│      ▼                                                                      │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Step 1: Object-level Cleaning (P0 - Core)                           │   │
│  │ ───────────────────────────────────────────────────────────────────  │   │
│  │ 1.1 clean_contents(sanitize=True) - Fix malformed content stream    │   │
│  │ 1.2 Remove hidden OCG layers                                        │   │
│  │ 1.3 White-out detection & removal (IoU >= 80%)                      │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│      │                                                                      │
│      ▼                                                                      │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Step 2: Layout Analysis (P1 - Rule-based)                           │   │
│  │ ───────────────────────────────────────────────────────────────────  │   │
│  │ 2.1 get_text("blocks", sort=True) - Column-aware sorting            │   │
│  │ 2.2 Classify elements (title/body/header/footer/page_number)        │   │
│  │ 2.3 Filter unwanted elements (page numbers, decorations)            │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│      │                                                                      │
│      ▼                                                                      │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Step 3: Text Extraction (Enhanced)                                  │   │
│  │ ───────────────────────────────────────────────────────────────────  │   │
│  │ 3.1 Extract text with bbox coordinates preserved                    │   │
│  │ 3.2 Garble rate detection (cid:xxxx count / total chars)            │   │
│  │ 3.3 Auto-fallback: garble_rate > 10% → trigger Paddle OCR           │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│      │                                                                      │
│      ▼                                                                      │
│   UnifiedDocument (with bbox for debugging)                                 │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Step 0: GS Distillation (Exception Handler)

Purpose

Repair structurally damaged PDFs that PyMuPDF cannot parse correctly.

Trigger Conditions

def should_trigger_gs_repair(page_text: str, mupdf_warnings: List[str]) -> bool:
    # Condition 1: High garble rate (cid:xxxx patterns)
    cid_pattern = r'\(cid:\d+\)'
    cid_count = len(re.findall(cid_pattern, page_text))
    total_chars = len(page_text)
    garble_rate = cid_count / max(total_chars, 1)

    if garble_rate > 0.1:  # >10% garbled
        return True

    # Condition 2: Severe structural errors
    severe_errors = ['error', 'invalid', 'corrupt', 'damaged']
    for warning in mupdf_warnings:
        if any(err in warning.lower() for err in severe_errors):
            return True

    return False

GS Command

gs -dNOPAUSE -dBATCH -dSAFER \
   -sDEVICE=pdfwrite \
   -dPDFSETTINGS=/prepress \
   -dDetectDuplicateImages=true \
   -sOutputFile=repaired.pdf \
   input.pdf

Implementation Notes

  • Default: DISABLED
  • Execution: Only when triggered by error detection
  • Fallback: If GS also fails, route to Paddle OCR track

Step 1: Object-level Cleaning (P0)

1.1 Content Stream Sanitization

def sanitize_page(page: fitz.Page) -> None:
    """Fix malformed PDF content stream."""
    page.clean_contents(sanitize=True)

1.2 Hidden Layer (OCG) Removal

def remove_hidden_layers(doc: fitz.Document) -> List[str]:
    """Remove content from hidden Optional Content Groups."""
    removed_layers = []

    ocgs = doc.get_ocgs()  # Get all OCG definitions
    for ocg_xref, ocg_info in ocgs.items():
        # Check if layer is hidden by default
        if ocg_info.get('on') == False:
            removed_layers.append(ocg_info.get('name', f'OCG_{ocg_xref}'))
            # Mark for removal during extraction

    return removed_layers

1.3 White-out Detection (Core Algorithm)

def detect_whiteout_covered_text(page: fitz.Page, iou_threshold: float = 0.8) -> List[dict]:
    """
    Detect text covered by white rectangles ("white-out" / "correction tape" effect).

    Returns list of text words that should be excluded from extraction.
    """
    covered_words = []

    # Get all white-filled rectangles
    drawings = page.get_drawings()
    white_rects = []
    for d in drawings:
        # Check for white fill (RGB all 1.0)
        fill_color = d.get('fill')
        if fill_color and fill_color == (1, 1, 1):
            rect = d.get('rect')
            if rect:
                white_rects.append(fitz.Rect(rect))

    if not white_rects:
        return covered_words

    # Get all text words with bounding boxes
    words = page.get_text("words")  # Returns list of (x0, y0, x1, y1, word, block_no, line_no, word_no)

    for word_info in words:
        word_rect = fitz.Rect(word_info[:4])
        word_text = word_info[4]

        for white_rect in white_rects:
            # Calculate IoU (Intersection over Union)
            intersection = word_rect & white_rect  # Intersection
            if intersection.is_empty:
                continue

            intersection_area = intersection.width * intersection.height
            word_area = word_rect.width * word_rect.height

            if word_area > 0:
                coverage_ratio = intersection_area / word_area
                if coverage_ratio >= iou_threshold:
                    covered_words.append({
                        'text': word_text,
                        'bbox': tuple(word_rect),
                        'coverage': coverage_ratio
                    })
                    break  # Word is covered, no need to check other rects

    return covered_words

Step 2: Layout Analysis (P1)

2.1 Column-aware Text Extraction

def extract_with_reading_order(page: fitz.Page) -> List[dict]:
    """
    Extract text blocks with correct reading order.
    PyMuPDF's sort=True handles two-column layouts automatically.
    """
    # CRITICAL: sort=True enables column-aware sorting
    blocks = page.get_text("dict", sort=True)['blocks']
    return blocks

2.2 Element Classification

def classify_element(block: dict, page_rect: fitz.Rect) -> str:
    """
    Classify text block by position and font size.

    Returns: 'title', 'body', 'header', 'footer', 'page_number'
    """
    if 'lines' not in block:
        return 'image'

    bbox = fitz.Rect(block['bbox'])
    page_height = page_rect.height
    page_width = page_rect.width

    # Relative position (0.0 = top, 1.0 = bottom)
    y_rel = bbox.y0 / page_height

    # Get average font size
    font_sizes = []
    for line in block.get('lines', []):
        for span in line.get('spans', []):
            font_sizes.append(span.get('size', 12))
    avg_font_size = sum(font_sizes) / len(font_sizes) if font_sizes else 12

    # Get text content for pattern matching
    text = ''.join(
        span.get('text', '')
        for line in block.get('lines', [])
        for span in line.get('spans', [])
    ).strip()

    # Classification rules

    # Header: top 5% of page
    if y_rel < 0.05:
        return 'header'

    # Footer: bottom 5% of page
    if y_rel > 0.95:
        return 'footer'

    # Page number: bottom 10% + numeric pattern
    if y_rel > 0.90 and _is_page_number(text):
        return 'page_number'

    # Title: large font (>14pt) or centered
    if avg_font_size > 14:
        return 'title'

    # Check if centered (for subtitles)
    x_center = (bbox.x0 + bbox.x1) / 2
    page_center = page_width / 2
    if abs(x_center - page_center) < page_width * 0.1 and len(text) < 100:
        if avg_font_size > 12:
            return 'title'

    return 'body'


def _is_page_number(text: str) -> bool:
    """Check if text is likely a page number."""
    text = text.strip()

    # Pure number
    if text.isdigit():
        return True

    # Common patterns: "Page 1", "- 1 -", "1/10"
    patterns = [
        r'^page\s*\d+$',
        r'^-?\s*\d+\s*-?$',
        r'^\d+\s*/\s*\d+$',
        r'^第\s*\d+\s*頁$',
        r'^第\s*\d+\s*页$',
    ]

    for pattern in patterns:
        if re.match(pattern, text, re.IGNORECASE):
            return True

    return False

2.3 Element Filtering

def filter_elements(blocks: List[dict], page_rect: fitz.Rect) -> List[dict]:
    """Filter out unwanted elements (page numbers, headers, footers)."""
    filtered = []

    for block in blocks:
        element_type = classify_element(block, page_rect)

        # Skip page numbers and optionally headers/footers
        if element_type == 'page_number':
            continue

        # Keep with classification metadata
        block['_element_type'] = element_type
        filtered.append(block)

    return filtered

Step 3: Text Extraction (Enhanced)

3.1 Garble Detection

def calculate_garble_rate(text: str) -> float:
    """
    Calculate the rate of garbled characters (cid:xxxx patterns).

    Returns: float between 0.0 and 1.0
    """
    if not text:
        return 0.0

    # Count (cid:xxxx) patterns
    cid_pattern = r'\(cid:\d+\)'
    cid_matches = re.findall(cid_pattern, text)
    cid_char_count = sum(len(m) for m in cid_matches)

    # Count other garble indicators
    # - Replacement character U+FFFD
    # - Private Use Area characters
    replacement_count = text.count('\ufffd')
    pua_count = sum(1 for c in text if 0xE000 <= ord(c) <= 0xF8FF)

    total_garble = cid_char_count + replacement_count + pua_count
    total_chars = len(text)

    return total_garble / total_chars if total_chars > 0 else 0.0

3.2 Auto-fallback to OCR

def should_fallback_to_ocr(page_text: str, garble_threshold: float = 0.1) -> bool:
    """
    Determine if page should be processed with OCR instead of direct extraction.

    Args:
        page_text: Extracted text from page
        garble_threshold: Maximum acceptable garble rate (default 10%)

    Returns:
        True if OCR fallback is recommended
    """
    garble_rate = calculate_garble_rate(page_text)

    if garble_rate > garble_threshold:
        logger.warning(
            f"High garble rate detected: {garble_rate:.1%}. "
            f"Recommending OCR fallback."
        )
        return True

    return False

Integration Point

Modified DirectExtractionEngine._extract_page()

def _extract_page(self, page: fitz.Page, page_num: int, ...) -> Page:
    """Extract content from a single page with preprocessing pipeline."""

    # === Step 1: Object-level Cleaning ===

    # 1.1 Sanitize content stream
    page.clean_contents(sanitize=True)

    # 1.2 Detect white-out covered text
    covered_words = detect_whiteout_covered_text(page, iou_threshold=0.8)
    covered_bboxes = [fitz.Rect(w['bbox']) for w in covered_words]

    # === Step 2: Layout Analysis ===

    # 2.1 Extract with column-aware sorting
    blocks = page.get_text("dict", sort=True)['blocks']

    # 2.2 & 2.3 Classify and filter
    filtered_blocks = filter_elements(blocks, page.rect)

    # === Step 3: Text Extraction ===

    elements = []
    full_text = ""

    for block in filtered_blocks:
        # Skip if block overlaps with covered areas
        block_rect = fitz.Rect(block['bbox'])
        if any(block_rect.intersects(cr) for cr in covered_bboxes):
            continue

        # Extract text with bbox preserved
        element = self._block_to_element(block, page_num)
        if element:
            elements.append(element)
            full_text += element.get_text() + " "

    # 3.2 Check garble rate
    if should_fallback_to_ocr(full_text):
        # Mark page for OCR processing
        page_metadata['needs_ocr'] = True

    return Page(
        page_number=page_num,
        elements=elements,
        metadata=page_metadata
    )

Configuration

@dataclass
class PreprocessingConfig:
    """Configuration for PDF preprocessing pipeline."""

    # Step 0: GS Distillation
    gs_enabled: bool = False  # Disabled by default
    gs_garble_threshold: float = 0.1  # Trigger on >10% garble
    gs_detect_duplicate_images: bool = True

    # Step 1: Object Cleaning
    sanitize_content: bool = True
    remove_hidden_layers: bool = True
    whiteout_detection: bool = True
    whiteout_iou_threshold: float = 0.8

    # Step 2: Layout Analysis
    column_aware_sort: bool = True  # Use sort=True
    filter_page_numbers: bool = True
    filter_headers: bool = False  # Keep headers by default
    filter_footers: bool = False  # Keep footers by default

    # Step 3: Text Extraction
    preserve_bbox: bool = True  # For debugging
    garble_detection: bool = True
    ocr_fallback_threshold: float = 0.1  # Fallback on >10% garble

Testing Strategy

  1. Unit Tests

    • White-out detection with synthetic PDFs
    • Garble rate calculation
    • Element classification accuracy
  2. Integration Tests

    • Two-column document reading order
    • Hidden layer removal
    • GS fallback trigger conditions
  3. Regression Tests

    • Existing task outputs should not change for clean PDFs
    • Performance benchmarks (should add <100ms per page)