OCR/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/design.md

# Design: PDF Preprocessing Pipeline

## Architecture Overview

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                     DIRECT Track PDF Processing Pipeline                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Input PDF                                                                 │
│      │                                                                      │
│      ▼                                                                      │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Step 0: GS Distillation (Exception Handler)                         │   │
│  │ ───────────────────────────────────────────────────────────────────  │   │
│  │ Trigger: (cid:xxxx) garble detected OR mupdf structural errors      │   │
│  │ Action: gs -sDEVICE=pdfwrite -dDetectDuplicateImages=true           │   │
│  │ Status: DISABLED by default, auto-triggered on errors               │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│      │                                                                      │
│      ▼                                                                      │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Step 1: Object-level Cleaning (P0 - Core)                           │   │
│  │ ───────────────────────────────────────────────────────────────────  │   │
│  │ 1.1 clean_contents(sanitize=True) - Fix malformed content stream    │   │
│  │ 1.2 Remove hidden OCG layers                                        │   │
│  │ 1.3 White-out detection & removal (IoU >= 80%)                      │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│      │                                                                      │
│      ▼                                                                      │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Step 2: Layout Analysis (P1 - Rule-based)                           │   │
│  │ ───────────────────────────────────────────────────────────────────  │   │
│  │ 2.1 get_text("blocks", sort=True) - Column-aware sorting            │   │
│  │ 2.2 Classify elements (title/body/header/footer/page_number)        │   │
│  │ 2.3 Filter unwanted elements (page numbers, decorations)            │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│      │                                                                      │
│      ▼                                                                      │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Step 3: Text Extraction (Enhanced)                                  │   │
│  │ ───────────────────────────────────────────────────────────────────  │   │
│  │ 3.1 Extract text with bbox coordinates preserved                    │   │
│  │ 3.2 Garble rate detection (cid:xxxx count / total chars)            │   │
│  │ 3.3 Auto-fallback: garble_rate > 10% → trigger Paddle OCR           │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│      │                                                                      │
│      ▼                                                                      │
│   UnifiedDocument (with bbox for debugging)                                 │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## Step 0: GS Distillation (Exception Handler)

### Purpose
Repair structurally damaged PDFs that PyMuPDF cannot parse correctly.

### Trigger Conditions
```python
def should_trigger_gs_repair(page_text: str, mupdf_warnings: List[str]) -> bool:
    # Condition 1: High garble rate (cid:xxxx patterns)
    cid_pattern = r'\(cid:\d+\)'
    cid_count = len(re.findall(cid_pattern, page_text))
    total_chars = len(page_text)
    garble_rate = cid_count / max(total_chars, 1)

    if garble_rate > 0.1:  # >10% garbled
        return True

    # Condition 2: Severe structural errors
    severe_errors = ['error', 'invalid', 'corrupt', 'damaged']
    for warning in mupdf_warnings:
        if any(err in warning.lower() for err in severe_errors):
            return True

    return False
```

### GS Command
```bash
gs -dNOPAUSE -dBATCH -dSAFER \
   -sDEVICE=pdfwrite \
   -dPDFSETTINGS=/prepress \
   -dDetectDuplicateImages=true \
   -sOutputFile=repaired.pdf \
   input.pdf
```

### Implementation Notes
- **Default**: DISABLED
- **Execution**: Only when triggered by error detection
- **Fallback**: If GS also fails, route to Paddle OCR track

---

## Step 1: Object-level Cleaning (P0)

### 1.1 Content Stream Sanitization
```python
def sanitize_page(page: fitz.Page) -> None:
    """Fix malformed PDF content stream."""
    page.clean_contents(sanitize=True)
```

### 1.2 Hidden Layer (OCG) Removal
```python
def remove_hidden_layers(doc: fitz.Document) -> List[str]:
    """Remove content from hidden Optional Content Groups."""
    removed_layers = []

    ocgs = doc.get_ocgs()  # Get all OCG definitions
    for ocg_xref, ocg_info in ocgs.items():
        # Check if layer is hidden by default
        if ocg_info.get('on') == False:
            removed_layers.append(ocg_info.get('name', f'OCG_{ocg_xref}'))
            # Mark for removal during extraction

    return removed_layers
```

### 1.3 White-out Detection (Core Algorithm)
```python
def detect_whiteout_covered_text(page: fitz.Page, iou_threshold: float = 0.8) -> List[dict]:
    """
    Detect text covered by white rectangles ("white-out" / "correction tape" effect).

    Returns list of text words that should be excluded from extraction.
    """
    covered_words = []

    # Get all white-filled rectangles
    drawings = page.get_drawings()
    white_rects = []
    for d in drawings:
        # Check for white fill (RGB all 1.0)
        fill_color = d.get('fill')
        if fill_color and fill_color == (1, 1, 1):
            rect = d.get('rect')
            if rect:
                white_rects.append(fitz.Rect(rect))

    if not white_rects:
        return covered_words

    # Get all text words with bounding boxes
    words = page.get_text("words")  # Returns list of (x0, y0, x1, y1, word, block_no, line_no, word_no)

    for word_info in words:
        word_rect = fitz.Rect(word_info[:4])
        word_text = word_info[4]

        for white_rect in white_rects:
            # Calculate IoU (Intersection over Union)
            intersection = word_rect & white_rect  # Intersection
            if intersection.is_empty:
                continue

            intersection_area = intersection.width * intersection.height
            word_area = word_rect.width * word_rect.height

            if word_area > 0:
                coverage_ratio = intersection_area / word_area
                if coverage_ratio >= iou_threshold:
                    covered_words.append({
                        'text': word_text,
                        'bbox': tuple(word_rect),
                        'coverage': coverage_ratio
                    })
                    break  # Word is covered, no need to check other rects

    return covered_words
```

---

## Step 2: Layout Analysis (P1)

### 2.1 Column-aware Text Extraction
```python
def extract_with_reading_order(page: fitz.Page) -> List[dict]:
    """
    Extract text blocks with correct reading order.
    PyMuPDF's sort=True handles two-column layouts automatically.
    """
    # CRITICAL: sort=True enables column-aware sorting
    blocks = page.get_text("dict", sort=True)['blocks']
    return blocks
```

### 2.2 Element Classification
```python
def classify_element(block: dict, page_rect: fitz.Rect) -> str:
    """
    Classify text block by position and font size.

    Returns: 'title', 'body', 'header', 'footer', 'page_number'
    """
    if 'lines' not in block:
        return 'image'

    bbox = fitz.Rect(block['bbox'])
    page_height = page_rect.height
    page_width = page_rect.width

    # Relative position (0.0 = top, 1.0 = bottom)
    y_rel = bbox.y0 / page_height

    # Get average font size
    font_sizes = []
    for line in block.get('lines', []):
        for span in line.get('spans', []):
            font_sizes.append(span.get('size', 12))
    avg_font_size = sum(font_sizes) / len(font_sizes) if font_sizes else 12

    # Get text content for pattern matching
    text = ''.join(
        span.get('text', '')
        for line in block.get('lines', [])
        for span in line.get('spans', [])
    ).strip()

    # Classification rules

    # Header: top 5% of page
    if y_rel < 0.05:
        return 'header'

    # Footer: bottom 5% of page
    if y_rel > 0.95:
        return 'footer'

    # Page number: bottom 10% + numeric pattern
    if y_rel > 0.90 and _is_page_number(text):
        return 'page_number'

    # Title: large font (>14pt) or centered
    if avg_font_size > 14:
        return 'title'

    # Check if centered (for subtitles)
    x_center = (bbox.x0 + bbox.x1) / 2
    page_center = page_width / 2
    if abs(x_center - page_center) < page_width * 0.1 and len(text) < 100:
        if avg_font_size > 12:
            return 'title'

    return 'body'


def _is_page_number(text: str) -> bool:
    """Check if text is likely a page number."""
    text = text.strip()

    # Pure number
    if text.isdigit():
        return True

    # Common patterns: "Page 1", "- 1 -", "1/10"
    patterns = [
        r'^page\s*\d+$',
        r'^-?\s*\d+\s*-?$',
        r'^\d+\s*/\s*\d+$',
        r'^第\s*\d+\s*頁$',
        r'^第\s*\d+\s*页$',
    ]

    for pattern in patterns:
        if re.match(pattern, text, re.IGNORECASE):
            return True

    return False
```

### 2.3 Element Filtering
```python
def filter_elements(blocks: List[dict], page_rect: fitz.Rect) -> List[dict]:
    """Filter out unwanted elements (page numbers, headers, footers)."""
    filtered = []

    for block in blocks:
        element_type = classify_element(block, page_rect)

        # Skip page numbers and optionally headers/footers
        if element_type == 'page_number':
            continue

        # Keep with classification metadata
        block['_element_type'] = element_type
        filtered.append(block)

    return filtered
```

---

## Step 3: Text Extraction (Enhanced)

### 3.1 Garble Detection
```python
def calculate_garble_rate(text: str) -> float:
    """
    Calculate the rate of garbled characters (cid:xxxx patterns).

    Returns: float between 0.0 and 1.0
    """
    if not text:
        return 0.0

    # Count (cid:xxxx) patterns
    cid_pattern = r'\(cid:\d+\)'
    cid_matches = re.findall(cid_pattern, text)
    cid_char_count = sum(len(m) for m in cid_matches)

    # Count other garble indicators
    # - Replacement character U+FFFD
    # - Private Use Area characters
    replacement_count = text.count('\ufffd')
    pua_count = sum(1 for c in text if 0xE000 <= ord(c) <= 0xF8FF)

    total_garble = cid_char_count + replacement_count + pua_count
    total_chars = len(text)

    return total_garble / total_chars if total_chars > 0 else 0.0
```

### 3.2 Auto-fallback to OCR
```python
def should_fallback_to_ocr(page_text: str, garble_threshold: float = 0.1) -> bool:
    """
    Determine if page should be processed with OCR instead of direct extraction.

    Args:
        page_text: Extracted text from page
        garble_threshold: Maximum acceptable garble rate (default 10%)

    Returns:
        True if OCR fallback is recommended
    """
    garble_rate = calculate_garble_rate(page_text)

    if garble_rate > garble_threshold:
        logger.warning(
            f"High garble rate detected: {garble_rate:.1%}. "
            f"Recommending OCR fallback."
        )
        return True

    return False
```

---

## Integration Point

### Modified DirectExtractionEngine._extract_page()

```python
def _extract_page(self, page: fitz.Page, page_num: int, ...) -> Page:
    """Extract content from a single page with preprocessing pipeline."""

    # === Step 1: Object-level Cleaning ===

    # 1.1 Sanitize content stream
    page.clean_contents(sanitize=True)

    # 1.2 Detect white-out covered text
    covered_words = detect_whiteout_covered_text(page, iou_threshold=0.8)
    covered_bboxes = [fitz.Rect(w['bbox']) for w in covered_words]

    # === Step 2: Layout Analysis ===

    # 2.1 Extract with column-aware sorting
    blocks = page.get_text("dict", sort=True)['blocks']

    # 2.2 & 2.3 Classify and filter
    filtered_blocks = filter_elements(blocks, page.rect)

    # === Step 3: Text Extraction ===

    elements = []
    full_text = ""

    for block in filtered_blocks:
        # Skip if block overlaps with covered areas
        block_rect = fitz.Rect(block['bbox'])
        if any(block_rect.intersects(cr) for cr in covered_bboxes):
            continue

        # Extract text with bbox preserved
        element = self._block_to_element(block, page_num)
        if element:
            elements.append(element)
            full_text += element.get_text() + " "

    # 3.2 Check garble rate
    if should_fallback_to_ocr(full_text):
        # Mark page for OCR processing
        page_metadata['needs_ocr'] = True

    return Page(
        page_number=page_num,
        elements=elements,
        metadata=page_metadata
    )
```

---

## Configuration

```python
@dataclass
class PreprocessingConfig:
    """Configuration for PDF preprocessing pipeline."""

    # Step 0: GS Distillation
    gs_enabled: bool = False  # Disabled by default
    gs_garble_threshold: float = 0.1  # Trigger on >10% garble
    gs_detect_duplicate_images: bool = True

    # Step 1: Object Cleaning
    sanitize_content: bool = True
    remove_hidden_layers: bool = True
    whiteout_detection: bool = True
    whiteout_iou_threshold: float = 0.8

    # Step 2: Layout Analysis
    column_aware_sort: bool = True  # Use sort=True
    filter_page_numbers: bool = True
    filter_headers: bool = False  # Keep headers by default
    filter_footers: bool = False  # Keep footers by default

    # Step 3: Text Extraction
    preserve_bbox: bool = True  # For debugging
    garble_detection: bool = True
    ocr_fallback_threshold: float = 0.1  # Fallback on >10% garble
```

---

## Testing Strategy

1. **Unit Tests**
   - White-out detection with synthetic PDFs
   - Garble rate calculation
   - Element classification accuracy

2. **Integration Tests**
   - Two-column document reading order
   - Hidden layer removal
   - GS fallback trigger conditions

3. **Regression Tests**
   - Existing task outputs should not change for clean PDFs
   - Performance benchmarks (should add <100ms per page)