feat: add PDF preprocessing pipeline for Direct track

Implement multi-stage preprocessing pipeline to improve extraction quality: Phase 1 - Object-level Cleaning: - Content stream sanitization via clean_contents(sanitize=True) - Hidden OCG layer detection - White-out detection with IoU 80% threshold Phase 2 - Layout Analysis: - Column-aware sorting (sort=True) - Page number pattern detection and filtering - Position-based element classification Phase 3 - Enhanced Extraction: - Garble rate detection (cid:xxxx, U+FFFD, PUA characters) - OCR fallback recommendation when garble >10% - Quality report generation interface Phase 4 - GS Distillation (Exception Handler): - Ghostscript PDF repair for severely damaged files - Auto-triggered on high garble or mupdf errors - Graceful fallback when GS unavailable 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 16:11:00 +08:00
parent 1b5c7f39a8
commit 6a65c7617d
4 changed files with 1236 additions and 9 deletions
--- a/openspec/changes/pdf-preprocessing-pipeline/tasks.md
+++ b/openspec/changes/pdf-preprocessing-pipeline/tasks.md
@@ -0,0 +1,93 @@
+# Tasks: PDF Preprocessing Pipeline
+
+## Phase 1: Object-level Cleaning (P0)
+
+### Step 1.1: Content Sanitization
+- [x] Add `page.clean_contents(sanitize=True)` to `_extract_page()`
+- [x] Add error handling for malformed content streams
+- [x] Add logging for sanitization actions
+
+### Step 1.2: Hidden Layer (OCG) Removal
+- [x] Implement `get_hidden_ocg_layers()` function
+- [ ] Add OCG content filtering during extraction (deferred - needs test case)
+- [x] Add configuration option `remove_hidden_layers`
+- [x] Add logging for removed layers
+
+### Step 1.3: White-out Detection
+- [x] Implement `detect_whiteout_covered_text()` with IoU calculation
+- [x] Add white rectangle detection from `page.get_drawings()`
+- [x] Integrate covered text filtering into extraction
+- [x] Add configuration option `whiteout_iou_threshold` (default 0.8)
+- [x] Add logging for detected white-out regions
+
+## Phase 2: Layout Analysis (P1)
+
+### Step 2.1: Column-aware Sorting
+- [x] Change `get_text()` calls to use `sort=True` parameter (already implemented)
+- [x] Verify reading order improvement on test documents
+- [ ] Add configuration option `column_aware_sort` (deferred - low priority)
+
+### Step 2.2: Element Classification
+- [ ] Implement `classify_element()` function (deferred - existing detection sufficient)
+- [x] Add position-based classification (header/footer/body) - via existing `_detect_headers_footers()`
+- [x] Add font-size-based classification (title detection) - via existing logic
+- [x] Add page number pattern detection `_is_page_number()`
+- [ ] Preserve classification in element metadata `_element_type` (deferred)
+
+### Step 2.3: Element Filtering
+- [x] Implement `filter_elements()` function - `_filter_page_numbers()`
+- [x] Add configuration options for filtering (page_numbers, headers, footers)
+- [x] Add logging for filtered elements
+
+## Phase 3: Enhanced Extraction (P1)
+
+### Step 3.1: Bbox Preservation
+- [x] Ensure all extracted elements retain bbox coordinates (already implemented)
+- [x] Add bbox to UnifiedDocument element metadata
+- [x] Verify bbox accuracy in generated output
+
+### Step 3.2: Garble Detection
+- [x] Implement `calculate_garble_rate()` function
+- [x] Detect `(cid:xxxx)` patterns
+- [x] Detect replacement characters (U+FFFD)
+- [x] Detect Private Use Area characters
+- [x] Add garble rate to page metadata
+
+### Step 3.3: OCR Fallback
+- [x] Implement `should_fallback_to_ocr()` decision function
+- [x] Add configuration option `ocr_fallback_threshold` (default 0.1)
+- [x] Add `get_pages_needing_ocr()` interface for callers
+- [x] Add `get_extraction_quality_report()` for quality metrics
+- [x] Add logging for fallback decisions
+
+## Phase 4: GS Distillation - Exception Handler (P2)
+
+### Step 0: GS Repair (Optional)
+- [x] Implement `should_trigger_gs_repair()` trigger detection
+- [x] Implement `repair_pdf_with_gs()` function
+- [x] Add `-dDetectDuplicateImages=true` option
+- [x] Add temporary file handling for repaired PDF
+- [x] Implement `is_ghostscript_available()` check
+- [x] Add `extract_with_repair()` method
+- [x] Add fallback to normal extraction if GS not available
+- [x] Add logging for GS repair actions
+
+## Testing
+
+### Unit Tests
+- [ ] Test white-out detection with synthetic PDF
+- [x] Test garble rate calculation
+- [ ] Test element classification accuracy
+- [x] Test page number pattern detection
+
+### Integration Tests
+- [x] Test with demo_docs/edit.pdf (3 pages)
+- [x] Test with demo_docs/edit2.pdf (1 page)
+- [x] Test with demo_docs/edit3.pdf (2 pages)
+- [x] Test quality report generation
+- [x] Test GS availability check
+- [x] Test end-to-end pipeline with real documents
+
+### Regression Tests
+- [x] Verify existing clean PDFs produce same output
+- [ ] Performance benchmark (<100ms overhead per page)