Commit Graph

1 Commits

Author SHA1 Message Date
egg
6a65c7617d feat: add PDF preprocessing pipeline for Direct track
Implement multi-stage preprocessing pipeline to improve extraction quality:

Phase 1 - Object-level Cleaning:
- Content stream sanitization via clean_contents(sanitize=True)
- Hidden OCG layer detection
- White-out detection with IoU 80% threshold

Phase 2 - Layout Analysis:
- Column-aware sorting (sort=True)
- Page number pattern detection and filtering
- Position-based element classification

Phase 3 - Enhanced Extraction:
- Garble rate detection (cid:xxxx, U+FFFD, PUA characters)
- OCR fallback recommendation when garble >10%
- Quality report generation interface

Phase 4 - GS Distillation (Exception Handler):
- Ghostscript PDF repair for severely damaged files
- Auto-triggered on high garble or mupdf errors
- Graceful fallback when GS unavailable

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 16:11:00 +08:00