OCR/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/tasks.md

# Tasks: PDF Preprocessing Pipeline

## Phase 1: Object-level Cleaning (P0)

### Step 1.1: Content Sanitization
- [x] Add `page.clean_contents(sanitize=True)` to `_extract_page()`
- [x] Add error handling for malformed content streams
- [x] Add logging for sanitization actions

### Step 1.2: Hidden Layer (OCG) Removal
- [x] Implement `get_hidden_ocg_layers()` function
- [ ] Add OCG content filtering during extraction (deferred - needs test case)
- [x] Add configuration option `remove_hidden_layers`
- [x] Add logging for removed layers

### Step 1.3: White-out Detection
- [x] Implement `detect_whiteout_covered_text()` with IoU calculation
- [x] Add white rectangle detection from `page.get_drawings()`
- [x] Integrate covered text filtering into extraction
- [x] Add configuration option `whiteout_iou_threshold` (default 0.8)
- [x] Add logging for detected white-out regions

## Phase 2: Layout Analysis (P1)

### Step 2.1: Column-aware Sorting
- [x] Change `get_text()` calls to use `sort=True` parameter (already implemented)
- [x] Verify reading order improvement on test documents
- [ ] Add configuration option `column_aware_sort` (deferred - low priority)

### Step 2.2: Element Classification
- [ ] Implement `classify_element()` function (deferred - existing detection sufficient)
- [x] Add position-based classification (header/footer/body) - via existing `_detect_headers_footers()`
- [x] Add font-size-based classification (title detection) - via existing logic
- [x] Add page number pattern detection `_is_page_number()`
- [ ] Preserve classification in element metadata `_element_type` (deferred)

### Step 2.3: Element Filtering
- [x] Implement `filter_elements()` function - `_filter_page_numbers()`
- [x] Add configuration options for filtering (page_numbers, headers, footers)
- [x] Add logging for filtered elements

## Phase 3: Enhanced Extraction (P1)

### Step 3.1: Bbox Preservation
- [x] Ensure all extracted elements retain bbox coordinates (already implemented)
- [x] Add bbox to UnifiedDocument element metadata
- [x] Verify bbox accuracy in generated output

### Step 3.2: Garble Detection
- [x] Implement `calculate_garble_rate()` function
- [x] Detect `(cid:xxxx)` patterns
- [x] Detect replacement characters (U+FFFD)
- [x] Detect Private Use Area characters
- [x] Add garble rate to page metadata

### Step 3.3: OCR Fallback
- [x] Implement `should_fallback_to_ocr()` decision function
- [x] Add configuration option `ocr_fallback_threshold` (default 0.1)
- [x] Add `get_pages_needing_ocr()` interface for callers
- [x] Add `get_extraction_quality_report()` for quality metrics
- [x] Add logging for fallback decisions

## Phase 4: GS Distillation - Exception Handler (P2)

### Step 0: GS Repair (Optional)
- [x] Implement `should_trigger_gs_repair()` trigger detection
- [x] Implement `repair_pdf_with_gs()` function
- [x] Add `-dDetectDuplicateImages=true` option
- [x] Add temporary file handling for repaired PDF
- [x] Implement `is_ghostscript_available()` check
- [x] Add `extract_with_repair()` method
- [x] Add fallback to normal extraction if GS not available
- [x] Add logging for GS repair actions

## Testing

### Unit Tests
- [ ] Test white-out detection with synthetic PDF
- [x] Test garble rate calculation
- [ ] Test element classification accuracy
- [x] Test page number pattern detection

### Integration Tests
- [x] Test with demo_docs/edit.pdf (3 pages)
- [x] Test with demo_docs/edit2.pdf (1 page)
- [x] Test with demo_docs/edit3.pdf (2 pages)
- [x] Test quality report generation
- [x] Test GS availability check
- [x] Test end-to-end pipeline with real documents

### Regression Tests
- [x] Verify existing clean PDFs produce same output
- [ ] Performance benchmark (<100ms overhead per page)