Files
OCR/openspec/changes/pdf-preprocessing-pipeline/tasks.md
egg 6a65c7617d feat: add PDF preprocessing pipeline for Direct track
Implement multi-stage preprocessing pipeline to improve extraction quality:

Phase 1 - Object-level Cleaning:
- Content stream sanitization via clean_contents(sanitize=True)
- Hidden OCG layer detection
- White-out detection with IoU 80% threshold

Phase 2 - Layout Analysis:
- Column-aware sorting (sort=True)
- Page number pattern detection and filtering
- Position-based element classification

Phase 3 - Enhanced Extraction:
- Garble rate detection (cid:xxxx, U+FFFD, PUA characters)
- OCR fallback recommendation when garble >10%
- Quality report generation interface

Phase 4 - GS Distillation (Exception Handler):
- Ghostscript PDF repair for severely damaged files
- Auto-triggered on high garble or mupdf errors
- Graceful fallback when GS unavailable

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 16:11:00 +08:00

3.7 KiB

Tasks: PDF Preprocessing Pipeline

Phase 1: Object-level Cleaning (P0)

Step 1.1: Content Sanitization

  • Add page.clean_contents(sanitize=True) to _extract_page()
  • Add error handling for malformed content streams
  • Add logging for sanitization actions

Step 1.2: Hidden Layer (OCG) Removal

  • Implement get_hidden_ocg_layers() function
  • Add OCG content filtering during extraction (deferred - needs test case)
  • Add configuration option remove_hidden_layers
  • Add logging for removed layers

Step 1.3: White-out Detection

  • Implement detect_whiteout_covered_text() with IoU calculation
  • Add white rectangle detection from page.get_drawings()
  • Integrate covered text filtering into extraction
  • Add configuration option whiteout_iou_threshold (default 0.8)
  • Add logging for detected white-out regions

Phase 2: Layout Analysis (P1)

Step 2.1: Column-aware Sorting

  • Change get_text() calls to use sort=True parameter (already implemented)
  • Verify reading order improvement on test documents
  • Add configuration option column_aware_sort (deferred - low priority)

Step 2.2: Element Classification

  • Implement classify_element() function (deferred - existing detection sufficient)
  • Add position-based classification (header/footer/body) - via existing _detect_headers_footers()
  • Add font-size-based classification (title detection) - via existing logic
  • Add page number pattern detection _is_page_number()
  • Preserve classification in element metadata _element_type (deferred)

Step 2.3: Element Filtering

  • Implement filter_elements() function - _filter_page_numbers()
  • Add configuration options for filtering (page_numbers, headers, footers)
  • Add logging for filtered elements

Phase 3: Enhanced Extraction (P1)

Step 3.1: Bbox Preservation

  • Ensure all extracted elements retain bbox coordinates (already implemented)
  • Add bbox to UnifiedDocument element metadata
  • Verify bbox accuracy in generated output

Step 3.2: Garble Detection

  • Implement calculate_garble_rate() function
  • Detect (cid:xxxx) patterns
  • Detect replacement characters (U+FFFD)
  • Detect Private Use Area characters
  • Add garble rate to page metadata

Step 3.3: OCR Fallback

  • Implement should_fallback_to_ocr() decision function
  • Add configuration option ocr_fallback_threshold (default 0.1)
  • Add get_pages_needing_ocr() interface for callers
  • Add get_extraction_quality_report() for quality metrics
  • Add logging for fallback decisions

Phase 4: GS Distillation - Exception Handler (P2)

Step 0: GS Repair (Optional)

  • Implement should_trigger_gs_repair() trigger detection
  • Implement repair_pdf_with_gs() function
  • Add -dDetectDuplicateImages=true option
  • Add temporary file handling for repaired PDF
  • Implement is_ghostscript_available() check
  • Add extract_with_repair() method
  • Add fallback to normal extraction if GS not available
  • Add logging for GS repair actions

Testing

Unit Tests

  • Test white-out detection with synthetic PDF
  • Test garble rate calculation
  • Test element classification accuracy
  • Test page number pattern detection

Integration Tests

  • Test with demo_docs/edit.pdf (3 pages)
  • Test with demo_docs/edit2.pdf (1 page)
  • Test with demo_docs/edit3.pdf (2 pages)
  • Test quality report generation
  • Test GS availability check
  • Test end-to-end pipeline with real documents

Regression Tests

  • Verify existing clean PDFs produce same output
  • Performance benchmark (<100ms overhead per page)