# Tasks: PDF Preprocessing Pipeline ## Phase 1: Object-level Cleaning (P0) ### Step 1.1: Content Sanitization - [x] Add `page.clean_contents(sanitize=True)` to `_extract_page()` - [x] Add error handling for malformed content streams - [x] Add logging for sanitization actions ### Step 1.2: Hidden Layer (OCG) Removal - [x] Implement `get_hidden_ocg_layers()` function - [ ] Add OCG content filtering during extraction (deferred - needs test case) - [x] Add configuration option `remove_hidden_layers` - [x] Add logging for removed layers ### Step 1.3: White-out Detection - [x] Implement `detect_whiteout_covered_text()` with IoU calculation - [x] Add white rectangle detection from `page.get_drawings()` - [x] Integrate covered text filtering into extraction - [x] Add configuration option `whiteout_iou_threshold` (default 0.8) - [x] Add logging for detected white-out regions ## Phase 2: Layout Analysis (P1) ### Step 2.1: Column-aware Sorting - [x] Change `get_text()` calls to use `sort=True` parameter (already implemented) - [x] Verify reading order improvement on test documents - [ ] Add configuration option `column_aware_sort` (deferred - low priority) ### Step 2.2: Element Classification - [ ] Implement `classify_element()` function (deferred - existing detection sufficient) - [x] Add position-based classification (header/footer/body) - via existing `_detect_headers_footers()` - [x] Add font-size-based classification (title detection) - via existing logic - [x] Add page number pattern detection `_is_page_number()` - [ ] Preserve classification in element metadata `_element_type` (deferred) ### Step 2.3: Element Filtering - [x] Implement `filter_elements()` function - `_filter_page_numbers()` - [x] Add configuration options for filtering (page_numbers, headers, footers) - [x] Add logging for filtered elements ## Phase 3: Enhanced Extraction (P1) ### Step 3.1: Bbox Preservation - [x] Ensure all extracted elements retain bbox coordinates (already implemented) - [x] Add bbox to UnifiedDocument element metadata - [x] Verify bbox accuracy in generated output ### Step 3.2: Garble Detection - [x] Implement `calculate_garble_rate()` function - [x] Detect `(cid:xxxx)` patterns - [x] Detect replacement characters (U+FFFD) - [x] Detect Private Use Area characters - [x] Add garble rate to page metadata ### Step 3.3: OCR Fallback - [x] Implement `should_fallback_to_ocr()` decision function - [x] Add configuration option `ocr_fallback_threshold` (default 0.1) - [x] Add `get_pages_needing_ocr()` interface for callers - [x] Add `get_extraction_quality_report()` for quality metrics - [x] Add logging for fallback decisions ## Phase 4: GS Distillation - Exception Handler (P2) ### Step 0: GS Repair (Optional) - [x] Implement `should_trigger_gs_repair()` trigger detection - [x] Implement `repair_pdf_with_gs()` function - [x] Add `-dDetectDuplicateImages=true` option - [x] Add temporary file handling for repaired PDF - [x] Implement `is_ghostscript_available()` check - [x] Add `extract_with_repair()` method - [x] Add fallback to normal extraction if GS not available - [x] Add logging for GS repair actions ## Testing ### Unit Tests - [ ] Test white-out detection with synthetic PDF - [x] Test garble rate calculation - [ ] Test element classification accuracy - [x] Test page number pattern detection ### Integration Tests - [x] Test with demo_docs/edit.pdf (3 pages) - [x] Test with demo_docs/edit2.pdf (1 page) - [x] Test with demo_docs/edit3.pdf (2 pages) - [x] Test quality report generation - [x] Test GS availability check - [x] Test end-to-end pipeline with real documents ### Regression Tests - [x] Verify existing clean PDFs produce same output - [ ] Performance benchmark (<100ms overhead per page)