1.8 KiB
1.8 KiB
Change Proposal: PDF Preprocessing Pipeline
Summary
Implement a multi-stage PDF preprocessing pipeline for Direct track extraction to improve layout accuracy, remove hidden/covered content, and ensure correct reading order.
Problem Statement
Current Direct track extraction has several issues:
- Hidden content pollution: OCG (Optional Content Groups) layers and "white-out" covered text leak into extraction
- Reading order chaos: Two-column layouts get interleaved incorrectly
- Vector graphics interference: Large decorative vector elements cover text content
- Corrupted PDF handling: No fallback for structurally damaged PDFs with
(cid:xxxx)garbled text
Proposed Solution
Implement a 4-stage preprocessing pipeline:
Step 0: GS Distillation (Exception Handler - triggered on errors)
Step 1: Object-level Cleaning (P0 - Core)
Step 2: Layout Analysis (P1 - Rule-based with sort=True)
Step 3: Text Extraction (Existing, enhanced with garble detection)
Key Features
- Smart Fallback: GS distillation only triggers on
(cid:xxxx)garble or mupdf structural errors - White-out Detection: IoU-based overlap detection (80% threshold) to remove covered text
- Column-aware Sorting: Leverage PyMuPDF's
sort=Truefor automatic two-column handling - Garble Rate Detection: Auto-switch to Paddle OCR when garble rate exceeds threshold
Impact
- Files Modified:
backend/app/services/direct_extraction_engine.py - New Dependencies: None (Ghostscript optional, already available on most systems)
- Risk Level: Medium (core extraction logic changes)
Success Criteria
- Hidden OCG content no longer appears in extraction
- White-out covered text is correctly filtered
- Two-column documents maintain correct reading order
- Corrupted PDFs gracefully fallback to GS repair or OCR