test
This commit is contained in:
@@ -0,0 +1,44 @@
|
||||
# Change Proposal: PDF Preprocessing Pipeline
|
||||
|
||||
## Summary
|
||||
|
||||
Implement a multi-stage PDF preprocessing pipeline for Direct track extraction to improve layout accuracy, remove hidden/covered content, and ensure correct reading order.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Current Direct track extraction has several issues:
|
||||
1. **Hidden content pollution**: OCG (Optional Content Groups) layers and "white-out" covered text leak into extraction
|
||||
2. **Reading order chaos**: Two-column layouts get interleaved incorrectly
|
||||
3. **Vector graphics interference**: Large decorative vector elements cover text content
|
||||
4. **Corrupted PDF handling**: No fallback for structurally damaged PDFs with `(cid:xxxx)` garbled text
|
||||
|
||||
## Proposed Solution
|
||||
|
||||
Implement a 4-stage preprocessing pipeline:
|
||||
|
||||
```
|
||||
Step 0: GS Distillation (Exception Handler - triggered on errors)
|
||||
Step 1: Object-level Cleaning (P0 - Core)
|
||||
Step 2: Layout Analysis (P1 - Rule-based with sort=True)
|
||||
Step 3: Text Extraction (Existing, enhanced with garble detection)
|
||||
```
|
||||
|
||||
## Key Features
|
||||
|
||||
1. **Smart Fallback**: GS distillation only triggers on `(cid:xxxx)` garble or mupdf structural errors
|
||||
2. **White-out Detection**: IoU-based overlap detection (80% threshold) to remove covered text
|
||||
3. **Column-aware Sorting**: Leverage PyMuPDF's `sort=True` for automatic two-column handling
|
||||
4. **Garble Rate Detection**: Auto-switch to Paddle OCR when garble rate exceeds threshold
|
||||
|
||||
## Impact
|
||||
|
||||
- **Files Modified**: `backend/app/services/direct_extraction_engine.py`
|
||||
- **New Dependencies**: None (Ghostscript optional, already available on most systems)
|
||||
- **Risk Level**: Medium (core extraction logic changes)
|
||||
|
||||
## Success Criteria
|
||||
|
||||
- [ ] Hidden OCG content no longer appears in extraction
|
||||
- [ ] White-out covered text is correctly filtered
|
||||
- [ ] Two-column documents maintain correct reading order
|
||||
- [ ] Corrupted PDFs gracefully fallback to GS repair or OCR
|
||||
Reference in New Issue
Block a user