feat: add PDF preprocessing pipeline for Direct track

Implement multi-stage preprocessing pipeline to improve extraction quality: Phase 1 - Object-level Cleaning: - Content stream sanitization via clean_contents(sanitize=True) - Hidden OCG layer detection - White-out detection with IoU 80% threshold Phase 2 - Layout Analysis: - Column-aware sorting (sort=True) - Page number pattern detection and filtering - Position-based element classification Phase 3 - Enhanced Extraction: - Garble rate detection (cid:xxxx, U+FFFD, PUA characters) - OCR fallback recommendation when garble >10% - Quality report generation interface Phase 4 - GS Distillation (Exception Handler): - Ghostscript PDF repair for severely damaged files - Auto-triggered on high garble or mupdf errors - Graceful fallback when GS unavailable 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 16:11:00 +08:00
parent 1b5c7f39a8
commit 6a65c7617d
4 changed files with 1236 additions and 9 deletions
--- a/openspec/changes/pdf-preprocessing-pipeline/proposal.md
+++ b/openspec/changes/pdf-preprocessing-pipeline/proposal.md
@@ -0,0 +1,44 @@
+# Change Proposal: PDF Preprocessing Pipeline
+
+## Summary
+
+Implement a multi-stage PDF preprocessing pipeline for Direct track extraction to improve layout accuracy, remove hidden/covered content, and ensure correct reading order.
+
+## Problem Statement
+
+Current Direct track extraction has several issues:
+1. **Hidden content pollution**: OCG (Optional Content Groups) layers and "white-out" covered text leak into extraction
+2. **Reading order chaos**: Two-column layouts get interleaved incorrectly
+3. **Vector graphics interference**: Large decorative vector elements cover text content
+4. **Corrupted PDF handling**: No fallback for structurally damaged PDFs with `(cid:xxxx)` garbled text
+
+## Proposed Solution
+
+Implement a 4-stage preprocessing pipeline:
+
+```
+Step 0: GS Distillation (Exception Handler - triggered on errors)
+Step 1: Object-level Cleaning (P0 - Core)
+Step 2: Layout Analysis (P1 - Rule-based with sort=True)
+Step 3: Text Extraction (Existing, enhanced with garble detection)
+```
+
+## Key Features
+
+1. **Smart Fallback**: GS distillation only triggers on `(cid:xxxx)` garble or mupdf structural errors
+2. **White-out Detection**: IoU-based overlap detection (80% threshold) to remove covered text
+3. **Column-aware Sorting**: Leverage PyMuPDF's `sort=True` for automatic two-column handling
+4. **Garble Rate Detection**: Auto-switch to Paddle OCR when garble rate exceeds threshold
+
+## Impact
+
+- **Files Modified**: `backend/app/services/direct_extraction_engine.py`
+- **New Dependencies**: None (Ghostscript optional, already available on most systems)
+- **Risk Level**: Medium (core extraction logic changes)
+
+## Success Criteria
+
+- [ ] Hidden OCG content no longer appears in extraction
+- [ ] White-out covered text is correctly filtered
+- [ ] Two-column documents maintain correct reading order
+- [ ] Corrupted PDFs gracefully fallback to GS repair or OCR