feat: add PDF preprocessing pipeline for Direct track

Implement multi-stage preprocessing pipeline to improve extraction quality:

Phase 1 - Object-level Cleaning:
- Content stream sanitization via clean_contents(sanitize=True)
- Hidden OCG layer detection
- White-out detection with IoU 80% threshold

Phase 2 - Layout Analysis:
- Column-aware sorting (sort=True)
- Page number pattern detection and filtering
- Position-based element classification

Phase 3 - Enhanced Extraction:
- Garble rate detection (cid:xxxx, U+FFFD, PUA characters)
- OCR fallback recommendation when garble >10%
- Quality report generation interface

Phase 4 - GS Distillation (Exception Handler):
- Ghostscript PDF repair for severely damaged files
- Auto-triggered on high garble or mupdf errors
- Graceful fallback when GS unavailable

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-12-03 16:11:00 +08:00
parent 1b5c7f39a8
commit 6a65c7617d
4 changed files with 1236 additions and 9 deletions

View File

@@ -0,0 +1,93 @@
# Tasks: PDF Preprocessing Pipeline
## Phase 1: Object-level Cleaning (P0)
### Step 1.1: Content Sanitization
- [x] Add `page.clean_contents(sanitize=True)` to `_extract_page()`
- [x] Add error handling for malformed content streams
- [x] Add logging for sanitization actions
### Step 1.2: Hidden Layer (OCG) Removal
- [x] Implement `get_hidden_ocg_layers()` function
- [ ] Add OCG content filtering during extraction (deferred - needs test case)
- [x] Add configuration option `remove_hidden_layers`
- [x] Add logging for removed layers
### Step 1.3: White-out Detection
- [x] Implement `detect_whiteout_covered_text()` with IoU calculation
- [x] Add white rectangle detection from `page.get_drawings()`
- [x] Integrate covered text filtering into extraction
- [x] Add configuration option `whiteout_iou_threshold` (default 0.8)
- [x] Add logging for detected white-out regions
## Phase 2: Layout Analysis (P1)
### Step 2.1: Column-aware Sorting
- [x] Change `get_text()` calls to use `sort=True` parameter (already implemented)
- [x] Verify reading order improvement on test documents
- [ ] Add configuration option `column_aware_sort` (deferred - low priority)
### Step 2.2: Element Classification
- [ ] Implement `classify_element()` function (deferred - existing detection sufficient)
- [x] Add position-based classification (header/footer/body) - via existing `_detect_headers_footers()`
- [x] Add font-size-based classification (title detection) - via existing logic
- [x] Add page number pattern detection `_is_page_number()`
- [ ] Preserve classification in element metadata `_element_type` (deferred)
### Step 2.3: Element Filtering
- [x] Implement `filter_elements()` function - `_filter_page_numbers()`
- [x] Add configuration options for filtering (page_numbers, headers, footers)
- [x] Add logging for filtered elements
## Phase 3: Enhanced Extraction (P1)
### Step 3.1: Bbox Preservation
- [x] Ensure all extracted elements retain bbox coordinates (already implemented)
- [x] Add bbox to UnifiedDocument element metadata
- [x] Verify bbox accuracy in generated output
### Step 3.2: Garble Detection
- [x] Implement `calculate_garble_rate()` function
- [x] Detect `(cid:xxxx)` patterns
- [x] Detect replacement characters (U+FFFD)
- [x] Detect Private Use Area characters
- [x] Add garble rate to page metadata
### Step 3.3: OCR Fallback
- [x] Implement `should_fallback_to_ocr()` decision function
- [x] Add configuration option `ocr_fallback_threshold` (default 0.1)
- [x] Add `get_pages_needing_ocr()` interface for callers
- [x] Add `get_extraction_quality_report()` for quality metrics
- [x] Add logging for fallback decisions
## Phase 4: GS Distillation - Exception Handler (P2)
### Step 0: GS Repair (Optional)
- [x] Implement `should_trigger_gs_repair()` trigger detection
- [x] Implement `repair_pdf_with_gs()` function
- [x] Add `-dDetectDuplicateImages=true` option
- [x] Add temporary file handling for repaired PDF
- [x] Implement `is_ghostscript_available()` check
- [x] Add `extract_with_repair()` method
- [x] Add fallback to normal extraction if GS not available
- [x] Add logging for GS repair actions
## Testing
### Unit Tests
- [ ] Test white-out detection with synthetic PDF
- [x] Test garble rate calculation
- [ ] Test element classification accuracy
- [x] Test page number pattern detection
### Integration Tests
- [x] Test with demo_docs/edit.pdf (3 pages)
- [x] Test with demo_docs/edit2.pdf (1 page)
- [x] Test with demo_docs/edit3.pdf (2 pages)
- [x] Test quality report generation
- [x] Test GS availability check
- [x] Test end-to-end pipeline with real documents
### Regression Tests
- [x] Verify existing clean PDFs produce same output
- [ ] Performance benchmark (<100ms overhead per page)