Change Proposal: PDF Preprocessing Pipeline

Summary

Implement a multi-stage PDF preprocessing pipeline for Direct track extraction to improve layout accuracy, remove hidden/covered content, and ensure correct reading order.

Problem Statement

Current Direct track extraction has several issues:

Hidden content pollution: OCG (Optional Content Groups) layers and "white-out" covered text leak into extraction
Reading order chaos: Two-column layouts get interleaved incorrectly
Vector graphics interference: Large decorative vector elements cover text content
Corrupted PDF handling: No fallback for structurally damaged PDFs with (cid:xxxx) garbled text

Proposed Solution

Implement a 4-stage preprocessing pipeline:

Step 0: GS Distillation (Exception Handler - triggered on errors)
Step 1: Object-level Cleaning (P0 - Core)
Step 2: Layout Analysis (P1 - Rule-based with sort=True)
Step 3: Text Extraction (Existing, enhanced with garble detection)

Key Features

Smart Fallback: GS distillation only triggers on (cid:xxxx) garble or mupdf structural errors
White-out Detection: IoU-based overlap detection (80% threshold) to remove covered text
Column-aware Sorting: Leverage PyMuPDF's sort=True for automatic two-column handling
Garble Rate Detection: Auto-switch to Paddle OCR when garble rate exceeds threshold

Impact

Files Modified: backend/app/services/direct_extraction_engine.py
New Dependencies: None (Ghostscript optional, already available on most systems)
Risk Level: Medium (core extraction logic changes)

Success Criteria

Hidden OCG content no longer appears in extraction
White-out covered text is correctly filtered
Two-column documents maintain correct reading order
Corrupted PDFs gracefully fallback to GS repair or OCR

1.8 KiB Raw Blame History