OCR/tasks.md at 63b474f93a0649ba9c1c739597d15d1297aaa31c

egg 6a65c7617d feat: add PDF preprocessing pipeline for Direct track

Implement multi-stage preprocessing pipeline to improve extraction quality:

Phase 1 - Object-level Cleaning:
- Content stream sanitization via clean_contents(sanitize=True)
- Hidden OCG layer detection
- White-out detection with IoU 80% threshold

Phase 2 - Layout Analysis:
- Column-aware sorting (sort=True)
- Page number pattern detection and filtering
- Position-based element classification

Phase 3 - Enhanced Extraction:
- Garble rate detection (cid:xxxx, U+FFFD, PUA characters)
- OCR fallback recommendation when garble >10%
- Quality report generation interface

Phase 4 - GS Distillation (Exception Handler):
- Ghostscript PDF repair for severely damaged files
- Auto-triggered on high garble or mupdf errors
- Graceful fallback when GS unavailable

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

3.7 KiB

Raw Blame History

Tasks: PDF Preprocessing Pipeline

Phase 1: Object-level Cleaning (P0)

Step 1.1: Content Sanitization

Step 1.2: Hidden Layer (OCG) Removal

Step 1.3: White-out Detection

Phase 2: Layout Analysis (P1)

Step 2.1: Column-aware Sorting

Step 2.2: Element Classification

Step 2.3: Element Filtering

Phase 3: Enhanced Extraction (P1)

Step 3.1: Bbox Preservation

Step 3.2: Garble Detection

Step 3.3: OCR Fallback

Phase 4: GS Distillation - Exception Handler (P2)

Step 0: GS Repair (Optional)

Testing

Unit Tests

Integration Tests

Regression Tests

3.7 KiB Raw Blame History

Tasks: PDF Preprocessing Pipeline

Phase 1: Object-level Cleaning (P0)

Step 1.1: Content Sanitization

Step 1.2: Hidden Layer (OCG) Removal

Step 1.3: White-out Detection

Phase 2: Layout Analysis (P1)

Step 2.1: Column-aware Sorting

Step 2.2: Element Classification

Step 2.3: Element Filtering

Phase 3: Enhanced Extraction (P1)

Step 3.1: Bbox Preservation

Step 3.2: Garble Detection

Step 3.3: OCR Fallback

Phase 4: GS Distillation - Exception Handler (P2)

Step 0: GS Repair (Optional)

Testing

Unit Tests

Integration Tests

Regression Tests

3.7 KiB

Raw Blame History