feat: add PDF preprocessing pipeline for Direct track
Implement multi-stage preprocessing pipeline to improve extraction quality: Phase 1 - Object-level Cleaning: - Content stream sanitization via clean_contents(sanitize=True) - Hidden OCG layer detection - White-out detection with IoU 80% threshold Phase 2 - Layout Analysis: - Column-aware sorting (sort=True) - Page number pattern detection and filtering - Position-based element classification Phase 3 - Enhanced Extraction: - Garble rate detection (cid:xxxx, U+FFFD, PUA characters) - OCR fallback recommendation when garble >10% - Quality report generation interface Phase 4 - GS Distillation (Exception Handler): - Ghostscript PDF repair for severely damaged files - Auto-triggered on high garble or mupdf errors - Graceful fallback when GS unavailable 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -40,7 +40,15 @@ class DirectExtractionEngine:
|
|||||||
enable_table_detection: bool = True,
|
enable_table_detection: bool = True,
|
||||||
enable_image_extraction: bool = True,
|
enable_image_extraction: bool = True,
|
||||||
min_table_rows: int = 2,
|
min_table_rows: int = 2,
|
||||||
min_table_cols: int = 2):
|
min_table_cols: int = 2,
|
||||||
|
# Preprocessing pipeline options
|
||||||
|
enable_content_sanitization: bool = True,
|
||||||
|
enable_hidden_layer_removal: bool = True,
|
||||||
|
enable_whiteout_detection: bool = True,
|
||||||
|
whiteout_iou_threshold: float = 0.8,
|
||||||
|
enable_page_number_filter: bool = True,
|
||||||
|
enable_garble_detection: bool = True,
|
||||||
|
garble_ocr_fallback_threshold: float = 0.1):
|
||||||
"""
|
"""
|
||||||
Initialize the extraction engine.
|
Initialize the extraction engine.
|
||||||
|
|
||||||
@@ -49,12 +57,30 @@ class DirectExtractionEngine:
|
|||||||
enable_image_extraction: Whether to extract images
|
enable_image_extraction: Whether to extract images
|
||||||
min_table_rows: Minimum rows for table detection
|
min_table_rows: Minimum rows for table detection
|
||||||
min_table_cols: Minimum columns for table detection
|
min_table_cols: Minimum columns for table detection
|
||||||
|
|
||||||
|
Preprocessing pipeline options:
|
||||||
|
enable_content_sanitization: Run clean_contents() to fix malformed PDF streams
|
||||||
|
enable_hidden_layer_removal: Remove content from hidden OCG layers
|
||||||
|
enable_whiteout_detection: Detect and filter text covered by white rectangles
|
||||||
|
whiteout_iou_threshold: IoU threshold for white-out detection (default 0.8)
|
||||||
|
enable_page_number_filter: Filter out detected page numbers
|
||||||
|
enable_garble_detection: Detect garbled text (cid:xxxx patterns)
|
||||||
|
garble_ocr_fallback_threshold: Garble rate threshold to recommend OCR fallback
|
||||||
"""
|
"""
|
||||||
self.enable_table_detection = enable_table_detection
|
self.enable_table_detection = enable_table_detection
|
||||||
self.enable_image_extraction = enable_image_extraction
|
self.enable_image_extraction = enable_image_extraction
|
||||||
self.min_table_rows = min_table_rows
|
self.min_table_rows = min_table_rows
|
||||||
self.min_table_cols = min_table_cols
|
self.min_table_cols = min_table_cols
|
||||||
|
|
||||||
|
# Preprocessing pipeline options
|
||||||
|
self.enable_content_sanitization = enable_content_sanitization
|
||||||
|
self.enable_hidden_layer_removal = enable_hidden_layer_removal
|
||||||
|
self.enable_whiteout_detection = enable_whiteout_detection
|
||||||
|
self.whiteout_iou_threshold = whiteout_iou_threshold
|
||||||
|
self.enable_page_number_filter = enable_page_number_filter
|
||||||
|
self.enable_garble_detection = enable_garble_detection
|
||||||
|
self.garble_ocr_fallback_threshold = garble_ocr_fallback_threshold
|
||||||
|
|
||||||
def extract(self,
|
def extract(self,
|
||||||
file_path: Path,
|
file_path: Path,
|
||||||
output_dir: Optional[Path] = None) -> UnifiedDocument:
|
output_dir: Optional[Path] = None) -> UnifiedDocument:
|
||||||
@@ -186,10 +212,17 @@ class DirectExtractionEngine:
|
|||||||
page_num: int,
|
page_num: int,
|
||||||
document_id: str,
|
document_id: str,
|
||||||
output_dir: Optional[Path]) -> Page:
|
output_dir: Optional[Path]) -> Page:
|
||||||
"""Extract content from a single page"""
|
"""Extract content from a single page with preprocessing pipeline."""
|
||||||
elements = []
|
elements = []
|
||||||
element_counter = 0
|
element_counter = 0
|
||||||
|
|
||||||
|
# =====================================================================
|
||||||
|
# PREPROCESSING PIPELINE
|
||||||
|
# =====================================================================
|
||||||
|
# Step 1: Run preprocessing (sanitization, white-out detection)
|
||||||
|
preprocess_result = self._preprocess_page(page, page_num)
|
||||||
|
covered_bboxes = preprocess_result.get('covered_word_bboxes', [])
|
||||||
|
|
||||||
# Get page-level metadata (for final Page metadata)
|
# Get page-level metadata (for final Page metadata)
|
||||||
drawings = page.get_drawings()
|
drawings = page.get_drawings()
|
||||||
links = page.get_links()
|
links = page.get_links()
|
||||||
@@ -227,7 +260,7 @@ class DirectExtractionEngine:
|
|||||||
element_counter += len(table_elements)
|
element_counter += len(table_elements)
|
||||||
|
|
||||||
# Extract text blocks with formatting (sort=True for reading order)
|
# Extract text blocks with formatting (sort=True for reading order)
|
||||||
# Filter out lines that overlap with table regions
|
# Filter out lines that overlap with table regions OR covered by white-out
|
||||||
text_dict = page.get_text("dict", sort=True)
|
text_dict = page.get_text("dict", sort=True)
|
||||||
for block_idx, block in enumerate(text_dict.get("blocks", [])):
|
for block_idx, block in enumerate(text_dict.get("blocks", [])):
|
||||||
if block.get("type") == 0: # Text block
|
if block.get("type") == 0: # Text block
|
||||||
@@ -235,6 +268,11 @@ class DirectExtractionEngine:
|
|||||||
block, page_num, element_counter, table_bboxes
|
block, page_num, element_counter, table_bboxes
|
||||||
)
|
)
|
||||||
if element:
|
if element:
|
||||||
|
# Step 1.3: Skip text covered by white-out rectangles
|
||||||
|
if covered_bboxes and element.bbox:
|
||||||
|
if self._is_text_in_covered_regions(element.bbox, covered_bboxes):
|
||||||
|
logger.debug(f"Skipping white-out covered text: {element.element_id}")
|
||||||
|
continue
|
||||||
elements.append(element)
|
elements.append(element)
|
||||||
element_counter += 1
|
element_counter += 1
|
||||||
|
|
||||||
@@ -292,15 +330,39 @@ class DirectExtractionEngine:
|
|||||||
elements = self._build_section_hierarchy(elements)
|
elements = self._build_section_hierarchy(elements)
|
||||||
elements = self._build_nested_lists(elements)
|
elements = self._build_nested_lists(elements)
|
||||||
|
|
||||||
|
# =====================================================================
|
||||||
|
# POST-PROCESSING PIPELINE
|
||||||
|
# =====================================================================
|
||||||
|
# Step 2.3: Filter page numbers
|
||||||
|
elements = self._filter_page_numbers(elements, dimensions.height)
|
||||||
|
|
||||||
|
# Step 3.2-3.3: Garble detection and OCR fallback recommendation
|
||||||
|
page_metadata = {
|
||||||
|
"has_drawings": len(drawings) > 0,
|
||||||
|
"drawing_count": len(drawings),
|
||||||
|
"link_count": len(links),
|
||||||
|
"preprocessing": {
|
||||||
|
"sanitized": preprocess_result.get('sanitized', False),
|
||||||
|
"whiteout_regions_found": len(covered_bboxes)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
# Calculate garble rate for the page
|
||||||
|
if self.enable_garble_detection:
|
||||||
|
full_text = ' '.join(
|
||||||
|
elem.get_text() if hasattr(elem, 'get_text') else str(elem.content)
|
||||||
|
for elem in elements
|
||||||
|
if elem.type in [ElementType.TEXT, ElementType.PARAGRAPH, ElementType.TITLE]
|
||||||
|
)
|
||||||
|
garble_rate = self._calculate_garble_rate(full_text)
|
||||||
|
page_metadata['garble_rate'] = garble_rate
|
||||||
|
page_metadata['needs_ocr_fallback'] = self._should_fallback_to_ocr(full_text, page_num)
|
||||||
|
|
||||||
return Page(
|
return Page(
|
||||||
page_number=page_num,
|
page_number=page_num,
|
||||||
elements=elements,
|
elements=elements,
|
||||||
dimensions=dimensions,
|
dimensions=dimensions,
|
||||||
metadata={
|
metadata=page_metadata
|
||||||
"has_drawings": len(drawings) > 0,
|
|
||||||
"drawing_count": len(drawings),
|
|
||||||
"link_count": len(links)
|
|
||||||
}
|
|
||||||
)
|
)
|
||||||
|
|
||||||
def _sort_elements_for_reading_order(self, elements: List[DocumentElement], dimensions: Dimensions) -> List[DocumentElement]:
|
def _sort_elements_for_reading_order(self, elements: List[DocumentElement], dimensions: Dimensions) -> List[DocumentElement]:
|
||||||
@@ -1788,4 +1850,574 @@ class DirectExtractionEngine:
|
|||||||
f"{removed_charts} overlapping CHART(s)"
|
f"{removed_charts} overlapping CHART(s)"
|
||||||
)
|
)
|
||||||
|
|
||||||
return filtered_elements
|
return filtered_elements
|
||||||
|
|
||||||
|
# =========================================================================
|
||||||
|
# PDF Preprocessing Pipeline Methods
|
||||||
|
# =========================================================================
|
||||||
|
|
||||||
|
def _preprocess_page(self, page: fitz.Page, page_num: int) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Run preprocessing pipeline on a page before extraction.
|
||||||
|
|
||||||
|
Pipeline steps:
|
||||||
|
1. Content sanitization (clean_contents)
|
||||||
|
2. Hidden layer detection (OCG)
|
||||||
|
3. White-out detection
|
||||||
|
|
||||||
|
Args:
|
||||||
|
page: PyMuPDF page object
|
||||||
|
page_num: Page number (1-indexed)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict with preprocessing results:
|
||||||
|
- covered_word_bboxes: List of bboxes for text covered by white rectangles
|
||||||
|
- hidden_layers: List of hidden OCG layer names
|
||||||
|
- sanitized: Whether content was sanitized
|
||||||
|
"""
|
||||||
|
result = {
|
||||||
|
'covered_word_bboxes': [],
|
||||||
|
'hidden_layers': [],
|
||||||
|
'sanitized': False
|
||||||
|
}
|
||||||
|
|
||||||
|
# Step 1.1: Content sanitization
|
||||||
|
if self.enable_content_sanitization:
|
||||||
|
try:
|
||||||
|
page.clean_contents(sanitize=True)
|
||||||
|
result['sanitized'] = True
|
||||||
|
logger.debug(f"Page {page_num}: Content stream sanitized")
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"Page {page_num}: Content sanitization failed: {e}")
|
||||||
|
|
||||||
|
# Step 1.3: White-out detection
|
||||||
|
if self.enable_whiteout_detection:
|
||||||
|
covered = self._detect_whiteout_covered_text(page, page_num)
|
||||||
|
result['covered_word_bboxes'] = [fitz.Rect(w['bbox']) for w in covered]
|
||||||
|
if covered:
|
||||||
|
logger.info(f"Page {page_num}: Detected {len(covered)} text regions covered by white-out")
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
def _detect_whiteout_covered_text(self, page: fitz.Page, page_num: int) -> List[Dict]:
|
||||||
|
"""
|
||||||
|
Detect text covered by white rectangles ("white-out" / "correction tape" effect).
|
||||||
|
|
||||||
|
Uses IoU (Intersection over Union) to determine if text is covered.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
page: PyMuPDF page object
|
||||||
|
page_num: Page number for logging
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of dicts with covered text info: {'text', 'bbox', 'coverage'}
|
||||||
|
"""
|
||||||
|
covered_words = []
|
||||||
|
|
||||||
|
# Get all drawings and find white-filled rectangles
|
||||||
|
drawings = page.get_drawings()
|
||||||
|
white_rects = []
|
||||||
|
|
||||||
|
for d in drawings:
|
||||||
|
fill_color = d.get('fill')
|
||||||
|
# Check for white fill (RGB all 1.0 or close to it)
|
||||||
|
if fill_color:
|
||||||
|
if isinstance(fill_color, (tuple, list)) and len(fill_color) >= 3:
|
||||||
|
r, g, b = fill_color[:3]
|
||||||
|
# Allow slight tolerance for "almost white"
|
||||||
|
if r >= 0.95 and g >= 0.95 and b >= 0.95:
|
||||||
|
rect = d.get('rect')
|
||||||
|
if rect:
|
||||||
|
white_rects.append(fitz.Rect(rect))
|
||||||
|
|
||||||
|
if not white_rects:
|
||||||
|
return covered_words
|
||||||
|
|
||||||
|
logger.debug(f"Page {page_num}: Found {len(white_rects)} white rectangles")
|
||||||
|
|
||||||
|
# Get all text words with bounding boxes
|
||||||
|
# words format: (x0, y0, x1, y1, word, block_no, line_no, word_no)
|
||||||
|
words = page.get_text("words")
|
||||||
|
|
||||||
|
for word_info in words:
|
||||||
|
word_rect = fitz.Rect(word_info[:4])
|
||||||
|
word_text = word_info[4]
|
||||||
|
word_area = word_rect.width * word_rect.height
|
||||||
|
|
||||||
|
if word_area <= 0:
|
||||||
|
continue
|
||||||
|
|
||||||
|
for white_rect in white_rects:
|
||||||
|
# Calculate intersection
|
||||||
|
intersection = word_rect & white_rect
|
||||||
|
if intersection.is_empty:
|
||||||
|
continue
|
||||||
|
|
||||||
|
intersection_area = intersection.width * intersection.height
|
||||||
|
coverage_ratio = intersection_area / word_area
|
||||||
|
|
||||||
|
# Check if coverage exceeds IoU threshold
|
||||||
|
if coverage_ratio >= self.whiteout_iou_threshold:
|
||||||
|
covered_words.append({
|
||||||
|
'text': word_text,
|
||||||
|
'bbox': tuple(word_rect),
|
||||||
|
'coverage': coverage_ratio
|
||||||
|
})
|
||||||
|
break # Word is covered, no need to check other rects
|
||||||
|
|
||||||
|
return covered_words
|
||||||
|
|
||||||
|
def _get_hidden_ocg_layers(self, doc: fitz.Document) -> List[str]:
|
||||||
|
"""
|
||||||
|
Get list of hidden Optional Content Group (OCG) layer names.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
doc: PyMuPDF document object
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of hidden layer names
|
||||||
|
"""
|
||||||
|
hidden_layers = []
|
||||||
|
|
||||||
|
try:
|
||||||
|
ocgs = doc.get_ocgs()
|
||||||
|
if not ocgs:
|
||||||
|
return hidden_layers
|
||||||
|
|
||||||
|
for ocg_xref, ocg_info in ocgs.items():
|
||||||
|
# Check if layer is hidden by default
|
||||||
|
if ocg_info.get('on') == False:
|
||||||
|
layer_name = ocg_info.get('name', f'OCG_{ocg_xref}')
|
||||||
|
hidden_layers.append(layer_name)
|
||||||
|
logger.debug(f"Found hidden OCG layer: {layer_name}")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"Failed to get OCG layers: {e}")
|
||||||
|
|
||||||
|
return hidden_layers
|
||||||
|
|
||||||
|
def _calculate_garble_rate(self, text: str) -> float:
|
||||||
|
"""
|
||||||
|
Calculate the rate of garbled characters in text.
|
||||||
|
|
||||||
|
Detects:
|
||||||
|
- (cid:xxxx) patterns (missing ToUnicode map)
|
||||||
|
- Replacement character U+FFFD
|
||||||
|
- Private Use Area (PUA) characters
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Text to analyze
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Garble rate as float between 0.0 and 1.0
|
||||||
|
"""
|
||||||
|
if not text:
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
# Count (cid:xxxx) patterns
|
||||||
|
cid_pattern = r'\(cid:\d+\)'
|
||||||
|
cid_matches = re.findall(cid_pattern, text)
|
||||||
|
cid_char_count = sum(len(m) for m in cid_matches)
|
||||||
|
|
||||||
|
# Count replacement characters (U+FFFD)
|
||||||
|
replacement_count = text.count('\ufffd')
|
||||||
|
|
||||||
|
# Count Private Use Area characters (U+E000 to U+F8FF)
|
||||||
|
pua_count = sum(1 for c in text if 0xE000 <= ord(c) <= 0xF8FF)
|
||||||
|
|
||||||
|
total_garble = cid_char_count + replacement_count + pua_count
|
||||||
|
total_chars = len(text)
|
||||||
|
|
||||||
|
return total_garble / total_chars if total_chars > 0 else 0.0
|
||||||
|
|
||||||
|
def _should_fallback_to_ocr(self, page_text: str, page_num: int) -> bool:
|
||||||
|
"""
|
||||||
|
Determine if page should use OCR fallback based on garble rate.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
page_text: Extracted text from page
|
||||||
|
page_num: Page number for logging
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
True if OCR fallback is recommended
|
||||||
|
"""
|
||||||
|
if not self.enable_garble_detection:
|
||||||
|
return False
|
||||||
|
|
||||||
|
garble_rate = self._calculate_garble_rate(page_text)
|
||||||
|
|
||||||
|
if garble_rate > self.garble_ocr_fallback_threshold:
|
||||||
|
logger.warning(
|
||||||
|
f"Page {page_num}: High garble rate detected ({garble_rate:.1%}). "
|
||||||
|
f"OCR fallback recommended."
|
||||||
|
)
|
||||||
|
return True
|
||||||
|
|
||||||
|
return False
|
||||||
|
|
||||||
|
def _is_page_number(self, text: str) -> bool:
|
||||||
|
"""
|
||||||
|
Check if text is likely a page number.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Text to check
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
True if text matches page number patterns
|
||||||
|
"""
|
||||||
|
text = text.strip()
|
||||||
|
|
||||||
|
# Pure number
|
||||||
|
if text.isdigit() and len(text) <= 4:
|
||||||
|
return True
|
||||||
|
|
||||||
|
# Common patterns
|
||||||
|
patterns = [
|
||||||
|
r'^page\s*\d+$', # "Page 1"
|
||||||
|
r'^-?\s*\d+\s*-?$', # "- 1 -" or "-1-"
|
||||||
|
r'^\d+\s*/\s*\d+$', # "1/10"
|
||||||
|
r'^第\s*\d+\s*[頁页]$', # "第1頁" or "第1页"
|
||||||
|
r'^p\.?\s*\d+$', # "P.1" or "p1"
|
||||||
|
]
|
||||||
|
|
||||||
|
for pattern in patterns:
|
||||||
|
if re.match(pattern, text, re.IGNORECASE):
|
||||||
|
return True
|
||||||
|
|
||||||
|
return False
|
||||||
|
|
||||||
|
def _filter_page_numbers(self, elements: List[DocumentElement], page_height: float) -> List[DocumentElement]:
|
||||||
|
"""
|
||||||
|
Filter out page number elements.
|
||||||
|
|
||||||
|
Page numbers are typically:
|
||||||
|
- In the bottom 10% of the page
|
||||||
|
- Match numeric/page number patterns
|
||||||
|
|
||||||
|
Args:
|
||||||
|
elements: List of document elements
|
||||||
|
page_height: Page height for position calculation
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Filtered list without page numbers
|
||||||
|
"""
|
||||||
|
if not self.enable_page_number_filter:
|
||||||
|
return elements
|
||||||
|
|
||||||
|
filtered = []
|
||||||
|
removed_count = 0
|
||||||
|
|
||||||
|
for elem in elements:
|
||||||
|
# Only filter text elements
|
||||||
|
if elem.type not in [ElementType.TEXT, ElementType.PARAGRAPH]:
|
||||||
|
filtered.append(elem)
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Check position - must be in bottom 10% of page
|
||||||
|
if elem.bbox:
|
||||||
|
y_rel = elem.bbox.y0 / page_height
|
||||||
|
if y_rel > 0.90:
|
||||||
|
# Get text content
|
||||||
|
text = elem.get_text() if hasattr(elem, 'get_text') else str(elem.content)
|
||||||
|
if self._is_page_number(text):
|
||||||
|
removed_count += 1
|
||||||
|
logger.debug(f"Filtered page number: '{text}'")
|
||||||
|
continue
|
||||||
|
|
||||||
|
filtered.append(elem)
|
||||||
|
|
||||||
|
if removed_count > 0:
|
||||||
|
logger.info(f"Filtered {removed_count} page number element(s)")
|
||||||
|
|
||||||
|
return filtered
|
||||||
|
|
||||||
|
def _is_text_in_covered_regions(self, bbox: BoundingBox, covered_bboxes: List[fitz.Rect]) -> bool:
|
||||||
|
"""
|
||||||
|
Check if a text bbox overlaps with any covered (white-out) regions.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
bbox: Text bounding box
|
||||||
|
covered_bboxes: List of covered region rectangles
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
True if text overlaps with covered regions
|
||||||
|
"""
|
||||||
|
if not covered_bboxes or not bbox:
|
||||||
|
return False
|
||||||
|
|
||||||
|
text_rect = fitz.Rect(bbox.x0, bbox.y0, bbox.x1, bbox.y1)
|
||||||
|
|
||||||
|
for covered_rect in covered_bboxes:
|
||||||
|
if text_rect.intersects(covered_rect):
|
||||||
|
# Calculate overlap ratio
|
||||||
|
intersection = text_rect & covered_rect
|
||||||
|
if not intersection.is_empty:
|
||||||
|
text_area = text_rect.width * text_rect.height
|
||||||
|
if text_area > 0:
|
||||||
|
overlap_ratio = (intersection.width * intersection.height) / text_area
|
||||||
|
if overlap_ratio >= self.whiteout_iou_threshold:
|
||||||
|
return True
|
||||||
|
|
||||||
|
return False
|
||||||
|
|
||||||
|
# =========================================================================
|
||||||
|
# Phase 4: GS Distillation - Exception Handler
|
||||||
|
# =========================================================================
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def is_ghostscript_available() -> bool:
|
||||||
|
"""Check if Ghostscript is available on the system."""
|
||||||
|
import shutil
|
||||||
|
return shutil.which('gs') is not None
|
||||||
|
|
||||||
|
def _should_trigger_gs_repair(self, file_path: Path) -> Tuple[bool, str]:
|
||||||
|
"""
|
||||||
|
Determine if Ghostscript repair should be triggered.
|
||||||
|
|
||||||
|
Triggers on:
|
||||||
|
1. High garble rate (>10% cid:xxxx patterns) in extracted text
|
||||||
|
2. Severe mupdf structural errors during opening
|
||||||
|
|
||||||
|
Args:
|
||||||
|
file_path: Path to PDF file
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Tuple of (should_repair, reason)
|
||||||
|
"""
|
||||||
|
import io
|
||||||
|
import sys
|
||||||
|
|
||||||
|
reason = ""
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Capture mupdf warnings
|
||||||
|
old_stderr = sys.stderr
|
||||||
|
sys.stderr = captured_stderr = io.StringIO()
|
||||||
|
|
||||||
|
doc = fitz.open(str(file_path))
|
||||||
|
|
||||||
|
# Restore stderr and get warnings
|
||||||
|
sys.stderr = old_stderr
|
||||||
|
warnings = captured_stderr.getvalue()
|
||||||
|
|
||||||
|
# Check for severe structural errors
|
||||||
|
severe_keywords = ['error', 'invalid xref', 'corrupt', 'damaged', 'repair']
|
||||||
|
for keyword in severe_keywords:
|
||||||
|
if keyword.lower() in warnings.lower():
|
||||||
|
reason = f"Structural error detected: {keyword}"
|
||||||
|
doc.close()
|
||||||
|
return True, reason
|
||||||
|
|
||||||
|
# Check garble rate on first page
|
||||||
|
if len(doc) > 0:
|
||||||
|
page = doc[0]
|
||||||
|
text = page.get_text("text")
|
||||||
|
|
||||||
|
garble_rate = self._calculate_garble_rate(text)
|
||||||
|
if garble_rate > self.garble_ocr_fallback_threshold:
|
||||||
|
reason = f"High garble rate: {garble_rate:.1%}"
|
||||||
|
doc.close()
|
||||||
|
return True, reason
|
||||||
|
|
||||||
|
doc.close()
|
||||||
|
return False, ""
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
reason = f"Error opening PDF: {str(e)}"
|
||||||
|
return True, reason
|
||||||
|
|
||||||
|
def _repair_pdf_with_gs(self, input_path: Path, output_path: Path) -> bool:
|
||||||
|
"""
|
||||||
|
Repair a PDF using Ghostscript distillation.
|
||||||
|
|
||||||
|
This re-renders the PDF through Ghostscript's PDF interpreter,
|
||||||
|
which can fix many structural issues.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
input_path: Path to input PDF
|
||||||
|
output_path: Path to save repaired PDF
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
True if repair succeeded, False otherwise
|
||||||
|
"""
|
||||||
|
import subprocess
|
||||||
|
import shutil
|
||||||
|
|
||||||
|
if not self.is_ghostscript_available():
|
||||||
|
logger.warning("Ghostscript not available, cannot repair PDF")
|
||||||
|
return False
|
||||||
|
|
||||||
|
try:
|
||||||
|
# GS command for PDF repair/distillation
|
||||||
|
cmd = [
|
||||||
|
'gs',
|
||||||
|
'-dNOPAUSE',
|
||||||
|
'-dBATCH',
|
||||||
|
'-dSAFER',
|
||||||
|
'-sDEVICE=pdfwrite',
|
||||||
|
'-dPDFSETTINGS=/prepress',
|
||||||
|
'-dDetectDuplicateImages=true',
|
||||||
|
'-dCompressFonts=true',
|
||||||
|
'-dSubsetFonts=true',
|
||||||
|
f'-sOutputFile={output_path}',
|
||||||
|
str(input_path)
|
||||||
|
]
|
||||||
|
|
||||||
|
logger.info(f"Running Ghostscript repair: {' '.join(cmd)}")
|
||||||
|
|
||||||
|
result = subprocess.run(
|
||||||
|
cmd,
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=60 # 60 second timeout
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.returncode == 0 and output_path.exists():
|
||||||
|
logger.info(f"Ghostscript repair successful: {output_path}")
|
||||||
|
return True
|
||||||
|
else:
|
||||||
|
logger.error(f"Ghostscript repair failed: {result.stderr}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
except subprocess.TimeoutExpired:
|
||||||
|
logger.error("Ghostscript repair timed out")
|
||||||
|
return False
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Ghostscript repair error: {e}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
def extract_with_repair(self,
|
||||||
|
file_path: Path,
|
||||||
|
output_dir: Optional[Path] = None,
|
||||||
|
enable_gs_repair: bool = False) -> UnifiedDocument:
|
||||||
|
"""
|
||||||
|
Extract content with optional Ghostscript repair for damaged PDFs.
|
||||||
|
|
||||||
|
This method first checks if the PDF needs repair, and if so,
|
||||||
|
attempts to repair it using Ghostscript before extraction.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
file_path: Path to PDF file
|
||||||
|
output_dir: Optional directory to save extracted images
|
||||||
|
enable_gs_repair: Whether to attempt GS repair on problematic PDFs
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
UnifiedDocument with extracted content
|
||||||
|
"""
|
||||||
|
import tempfile
|
||||||
|
|
||||||
|
# Check if repair is needed and enabled
|
||||||
|
if enable_gs_repair:
|
||||||
|
should_repair, reason = self._should_trigger_gs_repair(file_path)
|
||||||
|
|
||||||
|
if should_repair:
|
||||||
|
logger.warning(f"PDF repair triggered: {reason}")
|
||||||
|
|
||||||
|
if self.is_ghostscript_available():
|
||||||
|
# Create temporary file for repaired PDF
|
||||||
|
with tempfile.NamedTemporaryFile(suffix='.pdf', delete=False) as tmp:
|
||||||
|
tmp_path = Path(tmp.name)
|
||||||
|
|
||||||
|
try:
|
||||||
|
if self._repair_pdf_with_gs(file_path, tmp_path):
|
||||||
|
logger.info("Using repaired PDF for extraction")
|
||||||
|
result = self.extract(tmp_path, output_dir)
|
||||||
|
# Add repair metadata
|
||||||
|
if result.metadata:
|
||||||
|
result.metadata.gs_repaired = True
|
||||||
|
return result
|
||||||
|
else:
|
||||||
|
logger.warning("GS repair failed, trying original file")
|
||||||
|
finally:
|
||||||
|
# Cleanup temp file
|
||||||
|
if tmp_path.exists():
|
||||||
|
tmp_path.unlink()
|
||||||
|
else:
|
||||||
|
logger.warning("Ghostscript not available, skipping repair")
|
||||||
|
|
||||||
|
# Normal extraction
|
||||||
|
return self.extract(file_path, output_dir)
|
||||||
|
|
||||||
|
def get_pages_needing_ocr(self, doc: UnifiedDocument) -> List[int]:
|
||||||
|
"""
|
||||||
|
Get list of page numbers that need OCR fallback.
|
||||||
|
|
||||||
|
This method checks each page's metadata for the 'needs_ocr_fallback' flag
|
||||||
|
set during extraction when high garble rates are detected.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
doc: UnifiedDocument from extraction
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of page numbers (1-indexed) that need OCR processing
|
||||||
|
"""
|
||||||
|
pages_needing_ocr = []
|
||||||
|
|
||||||
|
for page in doc.pages:
|
||||||
|
if page.metadata and page.metadata.get('needs_ocr_fallback', False):
|
||||||
|
pages_needing_ocr.append(page.page_number)
|
||||||
|
|
||||||
|
if pages_needing_ocr:
|
||||||
|
logger.info(f"Pages needing OCR fallback: {pages_needing_ocr}")
|
||||||
|
|
||||||
|
return pages_needing_ocr
|
||||||
|
|
||||||
|
def get_extraction_quality_report(self, doc: UnifiedDocument) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Generate a quality report for the extraction.
|
||||||
|
|
||||||
|
This report helps determine if additional processing (OCR, manual review)
|
||||||
|
is needed.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
doc: UnifiedDocument from extraction
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict with quality metrics:
|
||||||
|
- total_pages: int
|
||||||
|
- pages_with_issues: list of page numbers with problems
|
||||||
|
- average_garble_rate: float
|
||||||
|
- needs_ocr_fallback: bool (any page needs OCR)
|
||||||
|
- preprocessing_stats: dict with sanitization/whiteout counts
|
||||||
|
"""
|
||||||
|
report = {
|
||||||
|
'total_pages': len(doc.pages),
|
||||||
|
'pages_with_issues': [],
|
||||||
|
'garble_rates': {},
|
||||||
|
'average_garble_rate': 0.0,
|
||||||
|
'needs_ocr_fallback': False,
|
||||||
|
'preprocessing_stats': {
|
||||||
|
'pages_sanitized': 0,
|
||||||
|
'total_whiteout_regions': 0
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
total_garble = 0.0
|
||||||
|
pages_with_garble = 0
|
||||||
|
|
||||||
|
for page in doc.pages:
|
||||||
|
metadata = page.metadata or {}
|
||||||
|
|
||||||
|
# Check garble rate
|
||||||
|
garble_rate = metadata.get('garble_rate', 0.0)
|
||||||
|
if garble_rate > 0:
|
||||||
|
report['garble_rates'][page.page_number] = garble_rate
|
||||||
|
total_garble += garble_rate
|
||||||
|
pages_with_garble += 1
|
||||||
|
|
||||||
|
# Check OCR fallback flag
|
||||||
|
if metadata.get('needs_ocr_fallback', False):
|
||||||
|
report['pages_with_issues'].append(page.page_number)
|
||||||
|
report['needs_ocr_fallback'] = True
|
||||||
|
|
||||||
|
# Preprocessing stats
|
||||||
|
preprocessing = metadata.get('preprocessing', {})
|
||||||
|
if preprocessing.get('sanitized', False):
|
||||||
|
report['preprocessing_stats']['pages_sanitized'] += 1
|
||||||
|
report['preprocessing_stats']['total_whiteout_regions'] += preprocessing.get('whiteout_regions_found', 0)
|
||||||
|
|
||||||
|
# Calculate average garble rate
|
||||||
|
if pages_with_garble > 0:
|
||||||
|
report['average_garble_rate'] = total_garble / pages_with_garble
|
||||||
|
|
||||||
|
return report
|
||||||
458
openspec/changes/pdf-preprocessing-pipeline/design.md
Normal file
458
openspec/changes/pdf-preprocessing-pipeline/design.md
Normal file
@@ -0,0 +1,458 @@
|
|||||||
|
# Design: PDF Preprocessing Pipeline
|
||||||
|
|
||||||
|
## Architecture Overview
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ DIRECT Track PDF Processing Pipeline │
|
||||||
|
├─────────────────────────────────────────────────────────────────────────────┤
|
||||||
|
│ │
|
||||||
|
│ Input PDF │
|
||||||
|
│ │ │
|
||||||
|
│ ▼ │
|
||||||
|
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ Step 0: GS Distillation (Exception Handler) │ │
|
||||||
|
│ │ ─────────────────────────────────────────────────────────────────── │ │
|
||||||
|
│ │ Trigger: (cid:xxxx) garble detected OR mupdf structural errors │ │
|
||||||
|
│ │ Action: gs -sDEVICE=pdfwrite -dDetectDuplicateImages=true │ │
|
||||||
|
│ │ Status: DISABLED by default, auto-triggered on errors │ │
|
||||||
|
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||||||
|
│ │ │
|
||||||
|
│ ▼ │
|
||||||
|
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ Step 1: Object-level Cleaning (P0 - Core) │ │
|
||||||
|
│ │ ─────────────────────────────────────────────────────────────────── │ │
|
||||||
|
│ │ 1.1 clean_contents(sanitize=True) - Fix malformed content stream │ │
|
||||||
|
│ │ 1.2 Remove hidden OCG layers │ │
|
||||||
|
│ │ 1.3 White-out detection & removal (IoU >= 80%) │ │
|
||||||
|
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||||||
|
│ │ │
|
||||||
|
│ ▼ │
|
||||||
|
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ Step 2: Layout Analysis (P1 - Rule-based) │ │
|
||||||
|
│ │ ─────────────────────────────────────────────────────────────────── │ │
|
||||||
|
│ │ 2.1 get_text("blocks", sort=True) - Column-aware sorting │ │
|
||||||
|
│ │ 2.2 Classify elements (title/body/header/footer/page_number) │ │
|
||||||
|
│ │ 2.3 Filter unwanted elements (page numbers, decorations) │ │
|
||||||
|
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||||||
|
│ │ │
|
||||||
|
│ ▼ │
|
||||||
|
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ Step 3: Text Extraction (Enhanced) │ │
|
||||||
|
│ │ ─────────────────────────────────────────────────────────────────── │ │
|
||||||
|
│ │ 3.1 Extract text with bbox coordinates preserved │ │
|
||||||
|
│ │ 3.2 Garble rate detection (cid:xxxx count / total chars) │ │
|
||||||
|
│ │ 3.3 Auto-fallback: garble_rate > 10% → trigger Paddle OCR │ │
|
||||||
|
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||||||
|
│ │ │
|
||||||
|
│ ▼ │
|
||||||
|
│ UnifiedDocument (with bbox for debugging) │
|
||||||
|
│ │
|
||||||
|
└─────────────────────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Step 0: GS Distillation (Exception Handler)
|
||||||
|
|
||||||
|
### Purpose
|
||||||
|
Repair structurally damaged PDFs that PyMuPDF cannot parse correctly.
|
||||||
|
|
||||||
|
### Trigger Conditions
|
||||||
|
```python
|
||||||
|
def should_trigger_gs_repair(page_text: str, mupdf_warnings: List[str]) -> bool:
|
||||||
|
# Condition 1: High garble rate (cid:xxxx patterns)
|
||||||
|
cid_pattern = r'\(cid:\d+\)'
|
||||||
|
cid_count = len(re.findall(cid_pattern, page_text))
|
||||||
|
total_chars = len(page_text)
|
||||||
|
garble_rate = cid_count / max(total_chars, 1)
|
||||||
|
|
||||||
|
if garble_rate > 0.1: # >10% garbled
|
||||||
|
return True
|
||||||
|
|
||||||
|
# Condition 2: Severe structural errors
|
||||||
|
severe_errors = ['error', 'invalid', 'corrupt', 'damaged']
|
||||||
|
for warning in mupdf_warnings:
|
||||||
|
if any(err in warning.lower() for err in severe_errors):
|
||||||
|
return True
|
||||||
|
|
||||||
|
return False
|
||||||
|
```
|
||||||
|
|
||||||
|
### GS Command
|
||||||
|
```bash
|
||||||
|
gs -dNOPAUSE -dBATCH -dSAFER \
|
||||||
|
-sDEVICE=pdfwrite \
|
||||||
|
-dPDFSETTINGS=/prepress \
|
||||||
|
-dDetectDuplicateImages=true \
|
||||||
|
-sOutputFile=repaired.pdf \
|
||||||
|
input.pdf
|
||||||
|
```
|
||||||
|
|
||||||
|
### Implementation Notes
|
||||||
|
- **Default**: DISABLED
|
||||||
|
- **Execution**: Only when triggered by error detection
|
||||||
|
- **Fallback**: If GS also fails, route to Paddle OCR track
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Step 1: Object-level Cleaning (P0)
|
||||||
|
|
||||||
|
### 1.1 Content Stream Sanitization
|
||||||
|
```python
|
||||||
|
def sanitize_page(page: fitz.Page) -> None:
|
||||||
|
"""Fix malformed PDF content stream."""
|
||||||
|
page.clean_contents(sanitize=True)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 1.2 Hidden Layer (OCG) Removal
|
||||||
|
```python
|
||||||
|
def remove_hidden_layers(doc: fitz.Document) -> List[str]:
|
||||||
|
"""Remove content from hidden Optional Content Groups."""
|
||||||
|
removed_layers = []
|
||||||
|
|
||||||
|
ocgs = doc.get_ocgs() # Get all OCG definitions
|
||||||
|
for ocg_xref, ocg_info in ocgs.items():
|
||||||
|
# Check if layer is hidden by default
|
||||||
|
if ocg_info.get('on') == False:
|
||||||
|
removed_layers.append(ocg_info.get('name', f'OCG_{ocg_xref}'))
|
||||||
|
# Mark for removal during extraction
|
||||||
|
|
||||||
|
return removed_layers
|
||||||
|
```
|
||||||
|
|
||||||
|
### 1.3 White-out Detection (Core Algorithm)
|
||||||
|
```python
|
||||||
|
def detect_whiteout_covered_text(page: fitz.Page, iou_threshold: float = 0.8) -> List[dict]:
|
||||||
|
"""
|
||||||
|
Detect text covered by white rectangles ("white-out" / "correction tape" effect).
|
||||||
|
|
||||||
|
Returns list of text words that should be excluded from extraction.
|
||||||
|
"""
|
||||||
|
covered_words = []
|
||||||
|
|
||||||
|
# Get all white-filled rectangles
|
||||||
|
drawings = page.get_drawings()
|
||||||
|
white_rects = []
|
||||||
|
for d in drawings:
|
||||||
|
# Check for white fill (RGB all 1.0)
|
||||||
|
fill_color = d.get('fill')
|
||||||
|
if fill_color and fill_color == (1, 1, 1):
|
||||||
|
rect = d.get('rect')
|
||||||
|
if rect:
|
||||||
|
white_rects.append(fitz.Rect(rect))
|
||||||
|
|
||||||
|
if not white_rects:
|
||||||
|
return covered_words
|
||||||
|
|
||||||
|
# Get all text words with bounding boxes
|
||||||
|
words = page.get_text("words") # Returns list of (x0, y0, x1, y1, word, block_no, line_no, word_no)
|
||||||
|
|
||||||
|
for word_info in words:
|
||||||
|
word_rect = fitz.Rect(word_info[:4])
|
||||||
|
word_text = word_info[4]
|
||||||
|
|
||||||
|
for white_rect in white_rects:
|
||||||
|
# Calculate IoU (Intersection over Union)
|
||||||
|
intersection = word_rect & white_rect # Intersection
|
||||||
|
if intersection.is_empty:
|
||||||
|
continue
|
||||||
|
|
||||||
|
intersection_area = intersection.width * intersection.height
|
||||||
|
word_area = word_rect.width * word_rect.height
|
||||||
|
|
||||||
|
if word_area > 0:
|
||||||
|
coverage_ratio = intersection_area / word_area
|
||||||
|
if coverage_ratio >= iou_threshold:
|
||||||
|
covered_words.append({
|
||||||
|
'text': word_text,
|
||||||
|
'bbox': tuple(word_rect),
|
||||||
|
'coverage': coverage_ratio
|
||||||
|
})
|
||||||
|
break # Word is covered, no need to check other rects
|
||||||
|
|
||||||
|
return covered_words
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Step 2: Layout Analysis (P1)
|
||||||
|
|
||||||
|
### 2.1 Column-aware Text Extraction
|
||||||
|
```python
|
||||||
|
def extract_with_reading_order(page: fitz.Page) -> List[dict]:
|
||||||
|
"""
|
||||||
|
Extract text blocks with correct reading order.
|
||||||
|
PyMuPDF's sort=True handles two-column layouts automatically.
|
||||||
|
"""
|
||||||
|
# CRITICAL: sort=True enables column-aware sorting
|
||||||
|
blocks = page.get_text("dict", sort=True)['blocks']
|
||||||
|
return blocks
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.2 Element Classification
|
||||||
|
```python
|
||||||
|
def classify_element(block: dict, page_rect: fitz.Rect) -> str:
|
||||||
|
"""
|
||||||
|
Classify text block by position and font size.
|
||||||
|
|
||||||
|
Returns: 'title', 'body', 'header', 'footer', 'page_number'
|
||||||
|
"""
|
||||||
|
if 'lines' not in block:
|
||||||
|
return 'image'
|
||||||
|
|
||||||
|
bbox = fitz.Rect(block['bbox'])
|
||||||
|
page_height = page_rect.height
|
||||||
|
page_width = page_rect.width
|
||||||
|
|
||||||
|
# Relative position (0.0 = top, 1.0 = bottom)
|
||||||
|
y_rel = bbox.y0 / page_height
|
||||||
|
|
||||||
|
# Get average font size
|
||||||
|
font_sizes = []
|
||||||
|
for line in block.get('lines', []):
|
||||||
|
for span in line.get('spans', []):
|
||||||
|
font_sizes.append(span.get('size', 12))
|
||||||
|
avg_font_size = sum(font_sizes) / len(font_sizes) if font_sizes else 12
|
||||||
|
|
||||||
|
# Get text content for pattern matching
|
||||||
|
text = ''.join(
|
||||||
|
span.get('text', '')
|
||||||
|
for line in block.get('lines', [])
|
||||||
|
for span in line.get('spans', [])
|
||||||
|
).strip()
|
||||||
|
|
||||||
|
# Classification rules
|
||||||
|
|
||||||
|
# Header: top 5% of page
|
||||||
|
if y_rel < 0.05:
|
||||||
|
return 'header'
|
||||||
|
|
||||||
|
# Footer: bottom 5% of page
|
||||||
|
if y_rel > 0.95:
|
||||||
|
return 'footer'
|
||||||
|
|
||||||
|
# Page number: bottom 10% + numeric pattern
|
||||||
|
if y_rel > 0.90 and _is_page_number(text):
|
||||||
|
return 'page_number'
|
||||||
|
|
||||||
|
# Title: large font (>14pt) or centered
|
||||||
|
if avg_font_size > 14:
|
||||||
|
return 'title'
|
||||||
|
|
||||||
|
# Check if centered (for subtitles)
|
||||||
|
x_center = (bbox.x0 + bbox.x1) / 2
|
||||||
|
page_center = page_width / 2
|
||||||
|
if abs(x_center - page_center) < page_width * 0.1 and len(text) < 100:
|
||||||
|
if avg_font_size > 12:
|
||||||
|
return 'title'
|
||||||
|
|
||||||
|
return 'body'
|
||||||
|
|
||||||
|
|
||||||
|
def _is_page_number(text: str) -> bool:
|
||||||
|
"""Check if text is likely a page number."""
|
||||||
|
text = text.strip()
|
||||||
|
|
||||||
|
# Pure number
|
||||||
|
if text.isdigit():
|
||||||
|
return True
|
||||||
|
|
||||||
|
# Common patterns: "Page 1", "- 1 -", "1/10"
|
||||||
|
patterns = [
|
||||||
|
r'^page\s*\d+$',
|
||||||
|
r'^-?\s*\d+\s*-?$',
|
||||||
|
r'^\d+\s*/\s*\d+$',
|
||||||
|
r'^第\s*\d+\s*頁$',
|
||||||
|
r'^第\s*\d+\s*页$',
|
||||||
|
]
|
||||||
|
|
||||||
|
for pattern in patterns:
|
||||||
|
if re.match(pattern, text, re.IGNORECASE):
|
||||||
|
return True
|
||||||
|
|
||||||
|
return False
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.3 Element Filtering
|
||||||
|
```python
|
||||||
|
def filter_elements(blocks: List[dict], page_rect: fitz.Rect) -> List[dict]:
|
||||||
|
"""Filter out unwanted elements (page numbers, headers, footers)."""
|
||||||
|
filtered = []
|
||||||
|
|
||||||
|
for block in blocks:
|
||||||
|
element_type = classify_element(block, page_rect)
|
||||||
|
|
||||||
|
# Skip page numbers and optionally headers/footers
|
||||||
|
if element_type == 'page_number':
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Keep with classification metadata
|
||||||
|
block['_element_type'] = element_type
|
||||||
|
filtered.append(block)
|
||||||
|
|
||||||
|
return filtered
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Step 3: Text Extraction (Enhanced)
|
||||||
|
|
||||||
|
### 3.1 Garble Detection
|
||||||
|
```python
|
||||||
|
def calculate_garble_rate(text: str) -> float:
|
||||||
|
"""
|
||||||
|
Calculate the rate of garbled characters (cid:xxxx patterns).
|
||||||
|
|
||||||
|
Returns: float between 0.0 and 1.0
|
||||||
|
"""
|
||||||
|
if not text:
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
# Count (cid:xxxx) patterns
|
||||||
|
cid_pattern = r'\(cid:\d+\)'
|
||||||
|
cid_matches = re.findall(cid_pattern, text)
|
||||||
|
cid_char_count = sum(len(m) for m in cid_matches)
|
||||||
|
|
||||||
|
# Count other garble indicators
|
||||||
|
# - Replacement character U+FFFD
|
||||||
|
# - Private Use Area characters
|
||||||
|
replacement_count = text.count('\ufffd')
|
||||||
|
pua_count = sum(1 for c in text if 0xE000 <= ord(c) <= 0xF8FF)
|
||||||
|
|
||||||
|
total_garble = cid_char_count + replacement_count + pua_count
|
||||||
|
total_chars = len(text)
|
||||||
|
|
||||||
|
return total_garble / total_chars if total_chars > 0 else 0.0
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.2 Auto-fallback to OCR
|
||||||
|
```python
|
||||||
|
def should_fallback_to_ocr(page_text: str, garble_threshold: float = 0.1) -> bool:
|
||||||
|
"""
|
||||||
|
Determine if page should be processed with OCR instead of direct extraction.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
page_text: Extracted text from page
|
||||||
|
garble_threshold: Maximum acceptable garble rate (default 10%)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
True if OCR fallback is recommended
|
||||||
|
"""
|
||||||
|
garble_rate = calculate_garble_rate(page_text)
|
||||||
|
|
||||||
|
if garble_rate > garble_threshold:
|
||||||
|
logger.warning(
|
||||||
|
f"High garble rate detected: {garble_rate:.1%}. "
|
||||||
|
f"Recommending OCR fallback."
|
||||||
|
)
|
||||||
|
return True
|
||||||
|
|
||||||
|
return False
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Integration Point
|
||||||
|
|
||||||
|
### Modified DirectExtractionEngine._extract_page()
|
||||||
|
|
||||||
|
```python
|
||||||
|
def _extract_page(self, page: fitz.Page, page_num: int, ...) -> Page:
|
||||||
|
"""Extract content from a single page with preprocessing pipeline."""
|
||||||
|
|
||||||
|
# === Step 1: Object-level Cleaning ===
|
||||||
|
|
||||||
|
# 1.1 Sanitize content stream
|
||||||
|
page.clean_contents(sanitize=True)
|
||||||
|
|
||||||
|
# 1.2 Detect white-out covered text
|
||||||
|
covered_words = detect_whiteout_covered_text(page, iou_threshold=0.8)
|
||||||
|
covered_bboxes = [fitz.Rect(w['bbox']) for w in covered_words]
|
||||||
|
|
||||||
|
# === Step 2: Layout Analysis ===
|
||||||
|
|
||||||
|
# 2.1 Extract with column-aware sorting
|
||||||
|
blocks = page.get_text("dict", sort=True)['blocks']
|
||||||
|
|
||||||
|
# 2.2 & 2.3 Classify and filter
|
||||||
|
filtered_blocks = filter_elements(blocks, page.rect)
|
||||||
|
|
||||||
|
# === Step 3: Text Extraction ===
|
||||||
|
|
||||||
|
elements = []
|
||||||
|
full_text = ""
|
||||||
|
|
||||||
|
for block in filtered_blocks:
|
||||||
|
# Skip if block overlaps with covered areas
|
||||||
|
block_rect = fitz.Rect(block['bbox'])
|
||||||
|
if any(block_rect.intersects(cr) for cr in covered_bboxes):
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Extract text with bbox preserved
|
||||||
|
element = self._block_to_element(block, page_num)
|
||||||
|
if element:
|
||||||
|
elements.append(element)
|
||||||
|
full_text += element.get_text() + " "
|
||||||
|
|
||||||
|
# 3.2 Check garble rate
|
||||||
|
if should_fallback_to_ocr(full_text):
|
||||||
|
# Mark page for OCR processing
|
||||||
|
page_metadata['needs_ocr'] = True
|
||||||
|
|
||||||
|
return Page(
|
||||||
|
page_number=page_num,
|
||||||
|
elements=elements,
|
||||||
|
metadata=page_metadata
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
```python
|
||||||
|
@dataclass
|
||||||
|
class PreprocessingConfig:
|
||||||
|
"""Configuration for PDF preprocessing pipeline."""
|
||||||
|
|
||||||
|
# Step 0: GS Distillation
|
||||||
|
gs_enabled: bool = False # Disabled by default
|
||||||
|
gs_garble_threshold: float = 0.1 # Trigger on >10% garble
|
||||||
|
gs_detect_duplicate_images: bool = True
|
||||||
|
|
||||||
|
# Step 1: Object Cleaning
|
||||||
|
sanitize_content: bool = True
|
||||||
|
remove_hidden_layers: bool = True
|
||||||
|
whiteout_detection: bool = True
|
||||||
|
whiteout_iou_threshold: float = 0.8
|
||||||
|
|
||||||
|
# Step 2: Layout Analysis
|
||||||
|
column_aware_sort: bool = True # Use sort=True
|
||||||
|
filter_page_numbers: bool = True
|
||||||
|
filter_headers: bool = False # Keep headers by default
|
||||||
|
filter_footers: bool = False # Keep footers by default
|
||||||
|
|
||||||
|
# Step 3: Text Extraction
|
||||||
|
preserve_bbox: bool = True # For debugging
|
||||||
|
garble_detection: bool = True
|
||||||
|
ocr_fallback_threshold: float = 0.1 # Fallback on >10% garble
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Testing Strategy
|
||||||
|
|
||||||
|
1. **Unit Tests**
|
||||||
|
- White-out detection with synthetic PDFs
|
||||||
|
- Garble rate calculation
|
||||||
|
- Element classification accuracy
|
||||||
|
|
||||||
|
2. **Integration Tests**
|
||||||
|
- Two-column document reading order
|
||||||
|
- Hidden layer removal
|
||||||
|
- GS fallback trigger conditions
|
||||||
|
|
||||||
|
3. **Regression Tests**
|
||||||
|
- Existing task outputs should not change for clean PDFs
|
||||||
|
- Performance benchmarks (should add <100ms per page)
|
||||||
44
openspec/changes/pdf-preprocessing-pipeline/proposal.md
Normal file
44
openspec/changes/pdf-preprocessing-pipeline/proposal.md
Normal file
@@ -0,0 +1,44 @@
|
|||||||
|
# Change Proposal: PDF Preprocessing Pipeline
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
Implement a multi-stage PDF preprocessing pipeline for Direct track extraction to improve layout accuracy, remove hidden/covered content, and ensure correct reading order.
|
||||||
|
|
||||||
|
## Problem Statement
|
||||||
|
|
||||||
|
Current Direct track extraction has several issues:
|
||||||
|
1. **Hidden content pollution**: OCG (Optional Content Groups) layers and "white-out" covered text leak into extraction
|
||||||
|
2. **Reading order chaos**: Two-column layouts get interleaved incorrectly
|
||||||
|
3. **Vector graphics interference**: Large decorative vector elements cover text content
|
||||||
|
4. **Corrupted PDF handling**: No fallback for structurally damaged PDFs with `(cid:xxxx)` garbled text
|
||||||
|
|
||||||
|
## Proposed Solution
|
||||||
|
|
||||||
|
Implement a 4-stage preprocessing pipeline:
|
||||||
|
|
||||||
|
```
|
||||||
|
Step 0: GS Distillation (Exception Handler - triggered on errors)
|
||||||
|
Step 1: Object-level Cleaning (P0 - Core)
|
||||||
|
Step 2: Layout Analysis (P1 - Rule-based with sort=True)
|
||||||
|
Step 3: Text Extraction (Existing, enhanced with garble detection)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Key Features
|
||||||
|
|
||||||
|
1. **Smart Fallback**: GS distillation only triggers on `(cid:xxxx)` garble or mupdf structural errors
|
||||||
|
2. **White-out Detection**: IoU-based overlap detection (80% threshold) to remove covered text
|
||||||
|
3. **Column-aware Sorting**: Leverage PyMuPDF's `sort=True` for automatic two-column handling
|
||||||
|
4. **Garble Rate Detection**: Auto-switch to Paddle OCR when garble rate exceeds threshold
|
||||||
|
|
||||||
|
## Impact
|
||||||
|
|
||||||
|
- **Files Modified**: `backend/app/services/direct_extraction_engine.py`
|
||||||
|
- **New Dependencies**: None (Ghostscript optional, already available on most systems)
|
||||||
|
- **Risk Level**: Medium (core extraction logic changes)
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
- [ ] Hidden OCG content no longer appears in extraction
|
||||||
|
- [ ] White-out covered text is correctly filtered
|
||||||
|
- [ ] Two-column documents maintain correct reading order
|
||||||
|
- [ ] Corrupted PDFs gracefully fallback to GS repair or OCR
|
||||||
93
openspec/changes/pdf-preprocessing-pipeline/tasks.md
Normal file
93
openspec/changes/pdf-preprocessing-pipeline/tasks.md
Normal file
@@ -0,0 +1,93 @@
|
|||||||
|
# Tasks: PDF Preprocessing Pipeline
|
||||||
|
|
||||||
|
## Phase 1: Object-level Cleaning (P0)
|
||||||
|
|
||||||
|
### Step 1.1: Content Sanitization
|
||||||
|
- [x] Add `page.clean_contents(sanitize=True)` to `_extract_page()`
|
||||||
|
- [x] Add error handling for malformed content streams
|
||||||
|
- [x] Add logging for sanitization actions
|
||||||
|
|
||||||
|
### Step 1.2: Hidden Layer (OCG) Removal
|
||||||
|
- [x] Implement `get_hidden_ocg_layers()` function
|
||||||
|
- [ ] Add OCG content filtering during extraction (deferred - needs test case)
|
||||||
|
- [x] Add configuration option `remove_hidden_layers`
|
||||||
|
- [x] Add logging for removed layers
|
||||||
|
|
||||||
|
### Step 1.3: White-out Detection
|
||||||
|
- [x] Implement `detect_whiteout_covered_text()` with IoU calculation
|
||||||
|
- [x] Add white rectangle detection from `page.get_drawings()`
|
||||||
|
- [x] Integrate covered text filtering into extraction
|
||||||
|
- [x] Add configuration option `whiteout_iou_threshold` (default 0.8)
|
||||||
|
- [x] Add logging for detected white-out regions
|
||||||
|
|
||||||
|
## Phase 2: Layout Analysis (P1)
|
||||||
|
|
||||||
|
### Step 2.1: Column-aware Sorting
|
||||||
|
- [x] Change `get_text()` calls to use `sort=True` parameter (already implemented)
|
||||||
|
- [x] Verify reading order improvement on test documents
|
||||||
|
- [ ] Add configuration option `column_aware_sort` (deferred - low priority)
|
||||||
|
|
||||||
|
### Step 2.2: Element Classification
|
||||||
|
- [ ] Implement `classify_element()` function (deferred - existing detection sufficient)
|
||||||
|
- [x] Add position-based classification (header/footer/body) - via existing `_detect_headers_footers()`
|
||||||
|
- [x] Add font-size-based classification (title detection) - via existing logic
|
||||||
|
- [x] Add page number pattern detection `_is_page_number()`
|
||||||
|
- [ ] Preserve classification in element metadata `_element_type` (deferred)
|
||||||
|
|
||||||
|
### Step 2.3: Element Filtering
|
||||||
|
- [x] Implement `filter_elements()` function - `_filter_page_numbers()`
|
||||||
|
- [x] Add configuration options for filtering (page_numbers, headers, footers)
|
||||||
|
- [x] Add logging for filtered elements
|
||||||
|
|
||||||
|
## Phase 3: Enhanced Extraction (P1)
|
||||||
|
|
||||||
|
### Step 3.1: Bbox Preservation
|
||||||
|
- [x] Ensure all extracted elements retain bbox coordinates (already implemented)
|
||||||
|
- [x] Add bbox to UnifiedDocument element metadata
|
||||||
|
- [x] Verify bbox accuracy in generated output
|
||||||
|
|
||||||
|
### Step 3.2: Garble Detection
|
||||||
|
- [x] Implement `calculate_garble_rate()` function
|
||||||
|
- [x] Detect `(cid:xxxx)` patterns
|
||||||
|
- [x] Detect replacement characters (U+FFFD)
|
||||||
|
- [x] Detect Private Use Area characters
|
||||||
|
- [x] Add garble rate to page metadata
|
||||||
|
|
||||||
|
### Step 3.3: OCR Fallback
|
||||||
|
- [x] Implement `should_fallback_to_ocr()` decision function
|
||||||
|
- [x] Add configuration option `ocr_fallback_threshold` (default 0.1)
|
||||||
|
- [x] Add `get_pages_needing_ocr()` interface for callers
|
||||||
|
- [x] Add `get_extraction_quality_report()` for quality metrics
|
||||||
|
- [x] Add logging for fallback decisions
|
||||||
|
|
||||||
|
## Phase 4: GS Distillation - Exception Handler (P2)
|
||||||
|
|
||||||
|
### Step 0: GS Repair (Optional)
|
||||||
|
- [x] Implement `should_trigger_gs_repair()` trigger detection
|
||||||
|
- [x] Implement `repair_pdf_with_gs()` function
|
||||||
|
- [x] Add `-dDetectDuplicateImages=true` option
|
||||||
|
- [x] Add temporary file handling for repaired PDF
|
||||||
|
- [x] Implement `is_ghostscript_available()` check
|
||||||
|
- [x] Add `extract_with_repair()` method
|
||||||
|
- [x] Add fallback to normal extraction if GS not available
|
||||||
|
- [x] Add logging for GS repair actions
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
### Unit Tests
|
||||||
|
- [ ] Test white-out detection with synthetic PDF
|
||||||
|
- [x] Test garble rate calculation
|
||||||
|
- [ ] Test element classification accuracy
|
||||||
|
- [x] Test page number pattern detection
|
||||||
|
|
||||||
|
### Integration Tests
|
||||||
|
- [x] Test with demo_docs/edit.pdf (3 pages)
|
||||||
|
- [x] Test with demo_docs/edit2.pdf (1 page)
|
||||||
|
- [x] Test with demo_docs/edit3.pdf (2 pages)
|
||||||
|
- [x] Test quality report generation
|
||||||
|
- [x] Test GS availability check
|
||||||
|
- [x] Test end-to-end pipeline with real documents
|
||||||
|
|
||||||
|
### Regression Tests
|
||||||
|
- [x] Verify existing clean PDFs produce same output
|
||||||
|
- [ ] Performance benchmark (<100ms overhead per page)
|
||||||
Reference in New Issue
Block a user