94 lines
3.7 KiB
Markdown
94 lines
3.7 KiB
Markdown
# Tasks: PDF Preprocessing Pipeline
|
|
|
|
## Phase 1: Object-level Cleaning (P0)
|
|
|
|
### Step 1.1: Content Sanitization
|
|
- [x] Add `page.clean_contents(sanitize=True)` to `_extract_page()`
|
|
- [x] Add error handling for malformed content streams
|
|
- [x] Add logging for sanitization actions
|
|
|
|
### Step 1.2: Hidden Layer (OCG) Removal
|
|
- [x] Implement `get_hidden_ocg_layers()` function
|
|
- [ ] Add OCG content filtering during extraction (deferred - needs test case)
|
|
- [x] Add configuration option `remove_hidden_layers`
|
|
- [x] Add logging for removed layers
|
|
|
|
### Step 1.3: White-out Detection
|
|
- [x] Implement `detect_whiteout_covered_text()` with IoU calculation
|
|
- [x] Add white rectangle detection from `page.get_drawings()`
|
|
- [x] Integrate covered text filtering into extraction
|
|
- [x] Add configuration option `whiteout_iou_threshold` (default 0.8)
|
|
- [x] Add logging for detected white-out regions
|
|
|
|
## Phase 2: Layout Analysis (P1)
|
|
|
|
### Step 2.1: Column-aware Sorting
|
|
- [x] Change `get_text()` calls to use `sort=True` parameter (already implemented)
|
|
- [x] Verify reading order improvement on test documents
|
|
- [ ] Add configuration option `column_aware_sort` (deferred - low priority)
|
|
|
|
### Step 2.2: Element Classification
|
|
- [ ] Implement `classify_element()` function (deferred - existing detection sufficient)
|
|
- [x] Add position-based classification (header/footer/body) - via existing `_detect_headers_footers()`
|
|
- [x] Add font-size-based classification (title detection) - via existing logic
|
|
- [x] Add page number pattern detection `_is_page_number()`
|
|
- [ ] Preserve classification in element metadata `_element_type` (deferred)
|
|
|
|
### Step 2.3: Element Filtering
|
|
- [x] Implement `filter_elements()` function - `_filter_page_numbers()`
|
|
- [x] Add configuration options for filtering (page_numbers, headers, footers)
|
|
- [x] Add logging for filtered elements
|
|
|
|
## Phase 3: Enhanced Extraction (P1)
|
|
|
|
### Step 3.1: Bbox Preservation
|
|
- [x] Ensure all extracted elements retain bbox coordinates (already implemented)
|
|
- [x] Add bbox to UnifiedDocument element metadata
|
|
- [x] Verify bbox accuracy in generated output
|
|
|
|
### Step 3.2: Garble Detection
|
|
- [x] Implement `calculate_garble_rate()` function
|
|
- [x] Detect `(cid:xxxx)` patterns
|
|
- [x] Detect replacement characters (U+FFFD)
|
|
- [x] Detect Private Use Area characters
|
|
- [x] Add garble rate to page metadata
|
|
|
|
### Step 3.3: OCR Fallback
|
|
- [x] Implement `should_fallback_to_ocr()` decision function
|
|
- [x] Add configuration option `ocr_fallback_threshold` (default 0.1)
|
|
- [x] Add `get_pages_needing_ocr()` interface for callers
|
|
- [x] Add `get_extraction_quality_report()` for quality metrics
|
|
- [x] Add logging for fallback decisions
|
|
|
|
## Phase 4: GS Distillation - Exception Handler (P2)
|
|
|
|
### Step 0: GS Repair (Optional)
|
|
- [x] Implement `should_trigger_gs_repair()` trigger detection
|
|
- [x] Implement `repair_pdf_with_gs()` function
|
|
- [x] Add `-dDetectDuplicateImages=true` option
|
|
- [x] Add temporary file handling for repaired PDF
|
|
- [x] Implement `is_ghostscript_available()` check
|
|
- [x] Add `extract_with_repair()` method
|
|
- [x] Add fallback to normal extraction if GS not available
|
|
- [x] Add logging for GS repair actions
|
|
|
|
## Testing
|
|
|
|
### Unit Tests
|
|
- [ ] Test white-out detection with synthetic PDF
|
|
- [x] Test garble rate calculation
|
|
- [ ] Test element classification accuracy
|
|
- [x] Test page number pattern detection
|
|
|
|
### Integration Tests
|
|
- [x] Test with demo_docs/edit.pdf (3 pages)
|
|
- [x] Test with demo_docs/edit2.pdf (1 page)
|
|
- [x] Test with demo_docs/edit3.pdf (2 pages)
|
|
- [x] Test quality report generation
|
|
- [x] Test GS availability check
|
|
- [x] Test end-to-end pipeline with real documents
|
|
|
|
### Regression Tests
|
|
- [x] Verify existing clean PDFs produce same output
|
|
- [ ] Performance benchmark (<100ms overhead per page)
|