3.7 KiB
3.7 KiB
Tasks: PDF Preprocessing Pipeline
Phase 1: Object-level Cleaning (P0)
Step 1.1: Content Sanitization
- Add
page.clean_contents(sanitize=True)to_extract_page() - Add error handling for malformed content streams
- Add logging for sanitization actions
Step 1.2: Hidden Layer (OCG) Removal
- Implement
get_hidden_ocg_layers()function - Add OCG content filtering during extraction (deferred - needs test case)
- Add configuration option
remove_hidden_layers - Add logging for removed layers
Step 1.3: White-out Detection
- Implement
detect_whiteout_covered_text()with IoU calculation - Add white rectangle detection from
page.get_drawings() - Integrate covered text filtering into extraction
- Add configuration option
whiteout_iou_threshold(default 0.8) - Add logging for detected white-out regions
Phase 2: Layout Analysis (P1)
Step 2.1: Column-aware Sorting
- Change
get_text()calls to usesort=Trueparameter (already implemented) - Verify reading order improvement on test documents
- Add configuration option
column_aware_sort(deferred - low priority)
Step 2.2: Element Classification
- Implement
classify_element()function (deferred - existing detection sufficient) - Add position-based classification (header/footer/body) - via existing
_detect_headers_footers() - Add font-size-based classification (title detection) - via existing logic
- Add page number pattern detection
_is_page_number() - Preserve classification in element metadata
_element_type(deferred)
Step 2.3: Element Filtering
- Implement
filter_elements()function -_filter_page_numbers() - Add configuration options for filtering (page_numbers, headers, footers)
- Add logging for filtered elements
Phase 3: Enhanced Extraction (P1)
Step 3.1: Bbox Preservation
- Ensure all extracted elements retain bbox coordinates (already implemented)
- Add bbox to UnifiedDocument element metadata
- Verify bbox accuracy in generated output
Step 3.2: Garble Detection
- Implement
calculate_garble_rate()function - Detect
(cid:xxxx)patterns - Detect replacement characters (U+FFFD)
- Detect Private Use Area characters
- Add garble rate to page metadata
Step 3.3: OCR Fallback
- Implement
should_fallback_to_ocr()decision function - Add configuration option
ocr_fallback_threshold(default 0.1) - Add
get_pages_needing_ocr()interface for callers - Add
get_extraction_quality_report()for quality metrics - Add logging for fallback decisions
Phase 4: GS Distillation - Exception Handler (P2)
Step 0: GS Repair (Optional)
- Implement
should_trigger_gs_repair()trigger detection - Implement
repair_pdf_with_gs()function - Add
-dDetectDuplicateImages=trueoption - Add temporary file handling for repaired PDF
- Implement
is_ghostscript_available()check - Add
extract_with_repair()method - Add fallback to normal extraction if GS not available
- Add logging for GS repair actions
Testing
Unit Tests
- Test white-out detection with synthetic PDF
- Test garble rate calculation
- Test element classification accuracy
- Test page number pattern detection
Integration Tests
- Test with demo_docs/edit.pdf (3 pages)
- Test with demo_docs/edit2.pdf (1 page)
- Test with demo_docs/edit3.pdf (2 pages)
- Test quality report generation
- Test GS availability check
- Test end-to-end pipeline with real documents
Regression Tests
- Verify existing clean PDFs produce same output
- Performance benchmark (<100ms overhead per page)