egg
6a65c7617d
feat: add PDF preprocessing pipeline for Direct track
Implement multi-stage preprocessing pipeline to improve extraction quality:
Phase 1 - Object-level Cleaning:
- Content stream sanitization via clean_contents(sanitize=True)
- Hidden OCG layer detection
- White-out detection with IoU 80% threshold
Phase 2 - Layout Analysis:
- Column-aware sorting (sort=True)
- Page number pattern detection and filtering
- Position-based element classification
Phase 3 - Enhanced Extraction:
- Garble rate detection (cid:xxxx, U+FFFD, PUA characters)
- OCR fallback recommendation when garble >10%
- Quality report generation interface
Phase 4 - GS Distillation (Exception Handler):
- Ghostscript PDF repair for severely damaged files
- Auto-triggered on high garble or mupdf errors
- Graceful fallback when GS unavailable
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 16:11:00 +08:00
..
2025-11-19 08:36:24 +08:00
2025-11-17 08:55:50 +08:00
2025-11-14 17:19:43 +08:00
2025-11-30 13:21:50 +08:00
2025-12-02 17:50:47 +08:00
2025-12-03 16:11:00 +08:00
2025-11-30 16:22:04 +08:00
2025-11-14 17:19:43 +08:00
2025-11-14 17:19:43 +08:00
2025-11-30 13:21:50 +08:00
2025-11-30 13:21:50 +08:00
2025-11-26 10:56:22 +08:00
2025-11-19 07:29:06 +08:00
2025-11-30 13:21:50 +08:00
2025-11-30 13:21:50 +08:00
2025-11-13 21:00:42 +08:00
2025-12-03 14:55:00 +08:00
2025-11-12 22:53:17 +08:00
2025-11-27 13:27:00 +08:00
2025-11-30 13:21:50 +08:00
2025-11-12 22:53:17 +08:00
2025-11-26 10:56:22 +08:00
2025-11-14 17:19:43 +08:00
2025-12-02 17:50:47 +08:00
2025-11-19 08:36:24 +08:00