Files
OCR/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/proposal.md
2025-12-04 18:00:37 +08:00

1.8 KiB

Change Proposal: PDF Preprocessing Pipeline

Summary

Implement a multi-stage PDF preprocessing pipeline for Direct track extraction to improve layout accuracy, remove hidden/covered content, and ensure correct reading order.

Problem Statement

Current Direct track extraction has several issues:

  1. Hidden content pollution: OCG (Optional Content Groups) layers and "white-out" covered text leak into extraction
  2. Reading order chaos: Two-column layouts get interleaved incorrectly
  3. Vector graphics interference: Large decorative vector elements cover text content
  4. Corrupted PDF handling: No fallback for structurally damaged PDFs with (cid:xxxx) garbled text

Proposed Solution

Implement a 4-stage preprocessing pipeline:

Step 0: GS Distillation (Exception Handler - triggered on errors)
Step 1: Object-level Cleaning (P0 - Core)
Step 2: Layout Analysis (P1 - Rule-based with sort=True)
Step 3: Text Extraction (Existing, enhanced with garble detection)

Key Features

  1. Smart Fallback: GS distillation only triggers on (cid:xxxx) garble or mupdf structural errors
  2. White-out Detection: IoU-based overlap detection (80% threshold) to remove covered text
  3. Column-aware Sorting: Leverage PyMuPDF's sort=True for automatic two-column handling
  4. Garble Rate Detection: Auto-switch to Paddle OCR when garble rate exceeds threshold

Impact

  • Files Modified: backend/app/services/direct_extraction_engine.py
  • New Dependencies: None (Ghostscript optional, already available on most systems)
  • Risk Level: Medium (core extraction logic changes)

Success Criteria

  • Hidden OCG content no longer appears in extraction
  • White-out covered text is correctly filtered
  • Two-column documents maintain correct reading order
  • Corrupted PDFs gracefully fallback to GS repair or OCR