Files
OCR/openspec/changes/pdf-preprocessing-pipeline/design.md
egg 6a65c7617d feat: add PDF preprocessing pipeline for Direct track
Implement multi-stage preprocessing pipeline to improve extraction quality:

Phase 1 - Object-level Cleaning:
- Content stream sanitization via clean_contents(sanitize=True)
- Hidden OCG layer detection
- White-out detection with IoU 80% threshold

Phase 2 - Layout Analysis:
- Column-aware sorting (sort=True)
- Page number pattern detection and filtering
- Position-based element classification

Phase 3 - Enhanced Extraction:
- Garble rate detection (cid:xxxx, U+FFFD, PUA characters)
- OCR fallback recommendation when garble >10%
- Quality report generation interface

Phase 4 - GS Distillation (Exception Handler):
- Ghostscript PDF repair for severely damaged files
- Auto-triggered on high garble or mupdf errors
- Graceful fallback when GS unavailable

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 16:11:00 +08:00

459 lines
17 KiB
Markdown

# Design: PDF Preprocessing Pipeline
## Architecture Overview
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ DIRECT Track PDF Processing Pipeline │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Input PDF │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Step 0: GS Distillation (Exception Handler) │ │
│ │ ─────────────────────────────────────────────────────────────────── │ │
│ │ Trigger: (cid:xxxx) garble detected OR mupdf structural errors │ │
│ │ Action: gs -sDEVICE=pdfwrite -dDetectDuplicateImages=true │ │
│ │ Status: DISABLED by default, auto-triggered on errors │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Step 1: Object-level Cleaning (P0 - Core) │ │
│ │ ─────────────────────────────────────────────────────────────────── │ │
│ │ 1.1 clean_contents(sanitize=True) - Fix malformed content stream │ │
│ │ 1.2 Remove hidden OCG layers │ │
│ │ 1.3 White-out detection & removal (IoU >= 80%) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Step 2: Layout Analysis (P1 - Rule-based) │ │
│ │ ─────────────────────────────────────────────────────────────────── │ │
│ │ 2.1 get_text("blocks", sort=True) - Column-aware sorting │ │
│ │ 2.2 Classify elements (title/body/header/footer/page_number) │ │
│ │ 2.3 Filter unwanted elements (page numbers, decorations) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Step 3: Text Extraction (Enhanced) │ │
│ │ ─────────────────────────────────────────────────────────────────── │ │
│ │ 3.1 Extract text with bbox coordinates preserved │ │
│ │ 3.2 Garble rate detection (cid:xxxx count / total chars) │ │
│ │ 3.3 Auto-fallback: garble_rate > 10% → trigger Paddle OCR │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ UnifiedDocument (with bbox for debugging) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
---
## Step 0: GS Distillation (Exception Handler)
### Purpose
Repair structurally damaged PDFs that PyMuPDF cannot parse correctly.
### Trigger Conditions
```python
def should_trigger_gs_repair(page_text: str, mupdf_warnings: List[str]) -> bool:
# Condition 1: High garble rate (cid:xxxx patterns)
cid_pattern = r'\(cid:\d+\)'
cid_count = len(re.findall(cid_pattern, page_text))
total_chars = len(page_text)
garble_rate = cid_count / max(total_chars, 1)
if garble_rate > 0.1: # >10% garbled
return True
# Condition 2: Severe structural errors
severe_errors = ['error', 'invalid', 'corrupt', 'damaged']
for warning in mupdf_warnings:
if any(err in warning.lower() for err in severe_errors):
return True
return False
```
### GS Command
```bash
gs -dNOPAUSE -dBATCH -dSAFER \
-sDEVICE=pdfwrite \
-dPDFSETTINGS=/prepress \
-dDetectDuplicateImages=true \
-sOutputFile=repaired.pdf \
input.pdf
```
### Implementation Notes
- **Default**: DISABLED
- **Execution**: Only when triggered by error detection
- **Fallback**: If GS also fails, route to Paddle OCR track
---
## Step 1: Object-level Cleaning (P0)
### 1.1 Content Stream Sanitization
```python
def sanitize_page(page: fitz.Page) -> None:
"""Fix malformed PDF content stream."""
page.clean_contents(sanitize=True)
```
### 1.2 Hidden Layer (OCG) Removal
```python
def remove_hidden_layers(doc: fitz.Document) -> List[str]:
"""Remove content from hidden Optional Content Groups."""
removed_layers = []
ocgs = doc.get_ocgs() # Get all OCG definitions
for ocg_xref, ocg_info in ocgs.items():
# Check if layer is hidden by default
if ocg_info.get('on') == False:
removed_layers.append(ocg_info.get('name', f'OCG_{ocg_xref}'))
# Mark for removal during extraction
return removed_layers
```
### 1.3 White-out Detection (Core Algorithm)
```python
def detect_whiteout_covered_text(page: fitz.Page, iou_threshold: float = 0.8) -> List[dict]:
"""
Detect text covered by white rectangles ("white-out" / "correction tape" effect).
Returns list of text words that should be excluded from extraction.
"""
covered_words = []
# Get all white-filled rectangles
drawings = page.get_drawings()
white_rects = []
for d in drawings:
# Check for white fill (RGB all 1.0)
fill_color = d.get('fill')
if fill_color and fill_color == (1, 1, 1):
rect = d.get('rect')
if rect:
white_rects.append(fitz.Rect(rect))
if not white_rects:
return covered_words
# Get all text words with bounding boxes
words = page.get_text("words") # Returns list of (x0, y0, x1, y1, word, block_no, line_no, word_no)
for word_info in words:
word_rect = fitz.Rect(word_info[:4])
word_text = word_info[4]
for white_rect in white_rects:
# Calculate IoU (Intersection over Union)
intersection = word_rect & white_rect # Intersection
if intersection.is_empty:
continue
intersection_area = intersection.width * intersection.height
word_area = word_rect.width * word_rect.height
if word_area > 0:
coverage_ratio = intersection_area / word_area
if coverage_ratio >= iou_threshold:
covered_words.append({
'text': word_text,
'bbox': tuple(word_rect),
'coverage': coverage_ratio
})
break # Word is covered, no need to check other rects
return covered_words
```
---
## Step 2: Layout Analysis (P1)
### 2.1 Column-aware Text Extraction
```python
def extract_with_reading_order(page: fitz.Page) -> List[dict]:
"""
Extract text blocks with correct reading order.
PyMuPDF's sort=True handles two-column layouts automatically.
"""
# CRITICAL: sort=True enables column-aware sorting
blocks = page.get_text("dict", sort=True)['blocks']
return blocks
```
### 2.2 Element Classification
```python
def classify_element(block: dict, page_rect: fitz.Rect) -> str:
"""
Classify text block by position and font size.
Returns: 'title', 'body', 'header', 'footer', 'page_number'
"""
if 'lines' not in block:
return 'image'
bbox = fitz.Rect(block['bbox'])
page_height = page_rect.height
page_width = page_rect.width
# Relative position (0.0 = top, 1.0 = bottom)
y_rel = bbox.y0 / page_height
# Get average font size
font_sizes = []
for line in block.get('lines', []):
for span in line.get('spans', []):
font_sizes.append(span.get('size', 12))
avg_font_size = sum(font_sizes) / len(font_sizes) if font_sizes else 12
# Get text content for pattern matching
text = ''.join(
span.get('text', '')
for line in block.get('lines', [])
for span in line.get('spans', [])
).strip()
# Classification rules
# Header: top 5% of page
if y_rel < 0.05:
return 'header'
# Footer: bottom 5% of page
if y_rel > 0.95:
return 'footer'
# Page number: bottom 10% + numeric pattern
if y_rel > 0.90 and _is_page_number(text):
return 'page_number'
# Title: large font (>14pt) or centered
if avg_font_size > 14:
return 'title'
# Check if centered (for subtitles)
x_center = (bbox.x0 + bbox.x1) / 2
page_center = page_width / 2
if abs(x_center - page_center) < page_width * 0.1 and len(text) < 100:
if avg_font_size > 12:
return 'title'
return 'body'
def _is_page_number(text: str) -> bool:
"""Check if text is likely a page number."""
text = text.strip()
# Pure number
if text.isdigit():
return True
# Common patterns: "Page 1", "- 1 -", "1/10"
patterns = [
r'^page\s*\d+$',
r'^-?\s*\d+\s*-?$',
r'^\d+\s*/\s*\d+$',
r'^第\s*\d+\s*頁$',
r'^第\s*\d+\s*页$',
]
for pattern in patterns:
if re.match(pattern, text, re.IGNORECASE):
return True
return False
```
### 2.3 Element Filtering
```python
def filter_elements(blocks: List[dict], page_rect: fitz.Rect) -> List[dict]:
"""Filter out unwanted elements (page numbers, headers, footers)."""
filtered = []
for block in blocks:
element_type = classify_element(block, page_rect)
# Skip page numbers and optionally headers/footers
if element_type == 'page_number':
continue
# Keep with classification metadata
block['_element_type'] = element_type
filtered.append(block)
return filtered
```
---
## Step 3: Text Extraction (Enhanced)
### 3.1 Garble Detection
```python
def calculate_garble_rate(text: str) -> float:
"""
Calculate the rate of garbled characters (cid:xxxx patterns).
Returns: float between 0.0 and 1.0
"""
if not text:
return 0.0
# Count (cid:xxxx) patterns
cid_pattern = r'\(cid:\d+\)'
cid_matches = re.findall(cid_pattern, text)
cid_char_count = sum(len(m) for m in cid_matches)
# Count other garble indicators
# - Replacement character U+FFFD
# - Private Use Area characters
replacement_count = text.count('\ufffd')
pua_count = sum(1 for c in text if 0xE000 <= ord(c) <= 0xF8FF)
total_garble = cid_char_count + replacement_count + pua_count
total_chars = len(text)
return total_garble / total_chars if total_chars > 0 else 0.0
```
### 3.2 Auto-fallback to OCR
```python
def should_fallback_to_ocr(page_text: str, garble_threshold: float = 0.1) -> bool:
"""
Determine if page should be processed with OCR instead of direct extraction.
Args:
page_text: Extracted text from page
garble_threshold: Maximum acceptable garble rate (default 10%)
Returns:
True if OCR fallback is recommended
"""
garble_rate = calculate_garble_rate(page_text)
if garble_rate > garble_threshold:
logger.warning(
f"High garble rate detected: {garble_rate:.1%}. "
f"Recommending OCR fallback."
)
return True
return False
```
---
## Integration Point
### Modified DirectExtractionEngine._extract_page()
```python
def _extract_page(self, page: fitz.Page, page_num: int, ...) -> Page:
"""Extract content from a single page with preprocessing pipeline."""
# === Step 1: Object-level Cleaning ===
# 1.1 Sanitize content stream
page.clean_contents(sanitize=True)
# 1.2 Detect white-out covered text
covered_words = detect_whiteout_covered_text(page, iou_threshold=0.8)
covered_bboxes = [fitz.Rect(w['bbox']) for w in covered_words]
# === Step 2: Layout Analysis ===
# 2.1 Extract with column-aware sorting
blocks = page.get_text("dict", sort=True)['blocks']
# 2.2 & 2.3 Classify and filter
filtered_blocks = filter_elements(blocks, page.rect)
# === Step 3: Text Extraction ===
elements = []
full_text = ""
for block in filtered_blocks:
# Skip if block overlaps with covered areas
block_rect = fitz.Rect(block['bbox'])
if any(block_rect.intersects(cr) for cr in covered_bboxes):
continue
# Extract text with bbox preserved
element = self._block_to_element(block, page_num)
if element:
elements.append(element)
full_text += element.get_text() + " "
# 3.2 Check garble rate
if should_fallback_to_ocr(full_text):
# Mark page for OCR processing
page_metadata['needs_ocr'] = True
return Page(
page_number=page_num,
elements=elements,
metadata=page_metadata
)
```
---
## Configuration
```python
@dataclass
class PreprocessingConfig:
"""Configuration for PDF preprocessing pipeline."""
# Step 0: GS Distillation
gs_enabled: bool = False # Disabled by default
gs_garble_threshold: float = 0.1 # Trigger on >10% garble
gs_detect_duplicate_images: bool = True
# Step 1: Object Cleaning
sanitize_content: bool = True
remove_hidden_layers: bool = True
whiteout_detection: bool = True
whiteout_iou_threshold: float = 0.8
# Step 2: Layout Analysis
column_aware_sort: bool = True # Use sort=True
filter_page_numbers: bool = True
filter_headers: bool = False # Keep headers by default
filter_footers: bool = False # Keep footers by default
# Step 3: Text Extraction
preserve_bbox: bool = True # For debugging
garble_detection: bool = True
ocr_fallback_threshold: float = 0.1 # Fallback on >10% garble
```
---
## Testing Strategy
1. **Unit Tests**
- White-out detection with synthetic PDFs
- Garble rate calculation
- Element classification accuracy
2. **Integration Tests**
- Two-column document reading order
- Hidden layer removal
- GS fallback trigger conditions
3. **Regression Tests**
- Existing task outputs should not change for clean PDFs
- Performance benchmarks (should add <100ms per page)