test
This commit is contained in:
@@ -0,0 +1,458 @@
|
||||
# Design: PDF Preprocessing Pipeline
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ DIRECT Track PDF Processing Pipeline │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ Input PDF │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Step 0: GS Distillation (Exception Handler) │ │
|
||||
│ │ ─────────────────────────────────────────────────────────────────── │ │
|
||||
│ │ Trigger: (cid:xxxx) garble detected OR mupdf structural errors │ │
|
||||
│ │ Action: gs -sDEVICE=pdfwrite -dDetectDuplicateImages=true │ │
|
||||
│ │ Status: DISABLED by default, auto-triggered on errors │ │
|
||||
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Step 1: Object-level Cleaning (P0 - Core) │ │
|
||||
│ │ ─────────────────────────────────────────────────────────────────── │ │
|
||||
│ │ 1.1 clean_contents(sanitize=True) - Fix malformed content stream │ │
|
||||
│ │ 1.2 Remove hidden OCG layers │ │
|
||||
│ │ 1.3 White-out detection & removal (IoU >= 80%) │ │
|
||||
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Step 2: Layout Analysis (P1 - Rule-based) │ │
|
||||
│ │ ─────────────────────────────────────────────────────────────────── │ │
|
||||
│ │ 2.1 get_text("blocks", sort=True) - Column-aware sorting │ │
|
||||
│ │ 2.2 Classify elements (title/body/header/footer/page_number) │ │
|
||||
│ │ 2.3 Filter unwanted elements (page numbers, decorations) │ │
|
||||
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Step 3: Text Extraction (Enhanced) │ │
|
||||
│ │ ─────────────────────────────────────────────────────────────────── │ │
|
||||
│ │ 3.1 Extract text with bbox coordinates preserved │ │
|
||||
│ │ 3.2 Garble rate detection (cid:xxxx count / total chars) │ │
|
||||
│ │ 3.3 Auto-fallback: garble_rate > 10% → trigger Paddle OCR │ │
|
||||
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ UnifiedDocument (with bbox for debugging) │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 0: GS Distillation (Exception Handler)
|
||||
|
||||
### Purpose
|
||||
Repair structurally damaged PDFs that PyMuPDF cannot parse correctly.
|
||||
|
||||
### Trigger Conditions
|
||||
```python
|
||||
def should_trigger_gs_repair(page_text: str, mupdf_warnings: List[str]) -> bool:
|
||||
# Condition 1: High garble rate (cid:xxxx patterns)
|
||||
cid_pattern = r'\(cid:\d+\)'
|
||||
cid_count = len(re.findall(cid_pattern, page_text))
|
||||
total_chars = len(page_text)
|
||||
garble_rate = cid_count / max(total_chars, 1)
|
||||
|
||||
if garble_rate > 0.1: # >10% garbled
|
||||
return True
|
||||
|
||||
# Condition 2: Severe structural errors
|
||||
severe_errors = ['error', 'invalid', 'corrupt', 'damaged']
|
||||
for warning in mupdf_warnings:
|
||||
if any(err in warning.lower() for err in severe_errors):
|
||||
return True
|
||||
|
||||
return False
|
||||
```
|
||||
|
||||
### GS Command
|
||||
```bash
|
||||
gs -dNOPAUSE -dBATCH -dSAFER \
|
||||
-sDEVICE=pdfwrite \
|
||||
-dPDFSETTINGS=/prepress \
|
||||
-dDetectDuplicateImages=true \
|
||||
-sOutputFile=repaired.pdf \
|
||||
input.pdf
|
||||
```
|
||||
|
||||
### Implementation Notes
|
||||
- **Default**: DISABLED
|
||||
- **Execution**: Only when triggered by error detection
|
||||
- **Fallback**: If GS also fails, route to Paddle OCR track
|
||||
|
||||
---
|
||||
|
||||
## Step 1: Object-level Cleaning (P0)
|
||||
|
||||
### 1.1 Content Stream Sanitization
|
||||
```python
|
||||
def sanitize_page(page: fitz.Page) -> None:
|
||||
"""Fix malformed PDF content stream."""
|
||||
page.clean_contents(sanitize=True)
|
||||
```
|
||||
|
||||
### 1.2 Hidden Layer (OCG) Removal
|
||||
```python
|
||||
def remove_hidden_layers(doc: fitz.Document) -> List[str]:
|
||||
"""Remove content from hidden Optional Content Groups."""
|
||||
removed_layers = []
|
||||
|
||||
ocgs = doc.get_ocgs() # Get all OCG definitions
|
||||
for ocg_xref, ocg_info in ocgs.items():
|
||||
# Check if layer is hidden by default
|
||||
if ocg_info.get('on') == False:
|
||||
removed_layers.append(ocg_info.get('name', f'OCG_{ocg_xref}'))
|
||||
# Mark for removal during extraction
|
||||
|
||||
return removed_layers
|
||||
```
|
||||
|
||||
### 1.3 White-out Detection (Core Algorithm)
|
||||
```python
|
||||
def detect_whiteout_covered_text(page: fitz.Page, iou_threshold: float = 0.8) -> List[dict]:
|
||||
"""
|
||||
Detect text covered by white rectangles ("white-out" / "correction tape" effect).
|
||||
|
||||
Returns list of text words that should be excluded from extraction.
|
||||
"""
|
||||
covered_words = []
|
||||
|
||||
# Get all white-filled rectangles
|
||||
drawings = page.get_drawings()
|
||||
white_rects = []
|
||||
for d in drawings:
|
||||
# Check for white fill (RGB all 1.0)
|
||||
fill_color = d.get('fill')
|
||||
if fill_color and fill_color == (1, 1, 1):
|
||||
rect = d.get('rect')
|
||||
if rect:
|
||||
white_rects.append(fitz.Rect(rect))
|
||||
|
||||
if not white_rects:
|
||||
return covered_words
|
||||
|
||||
# Get all text words with bounding boxes
|
||||
words = page.get_text("words") # Returns list of (x0, y0, x1, y1, word, block_no, line_no, word_no)
|
||||
|
||||
for word_info in words:
|
||||
word_rect = fitz.Rect(word_info[:4])
|
||||
word_text = word_info[4]
|
||||
|
||||
for white_rect in white_rects:
|
||||
# Calculate IoU (Intersection over Union)
|
||||
intersection = word_rect & white_rect # Intersection
|
||||
if intersection.is_empty:
|
||||
continue
|
||||
|
||||
intersection_area = intersection.width * intersection.height
|
||||
word_area = word_rect.width * word_rect.height
|
||||
|
||||
if word_area > 0:
|
||||
coverage_ratio = intersection_area / word_area
|
||||
if coverage_ratio >= iou_threshold:
|
||||
covered_words.append({
|
||||
'text': word_text,
|
||||
'bbox': tuple(word_rect),
|
||||
'coverage': coverage_ratio
|
||||
})
|
||||
break # Word is covered, no need to check other rects
|
||||
|
||||
return covered_words
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 2: Layout Analysis (P1)
|
||||
|
||||
### 2.1 Column-aware Text Extraction
|
||||
```python
|
||||
def extract_with_reading_order(page: fitz.Page) -> List[dict]:
|
||||
"""
|
||||
Extract text blocks with correct reading order.
|
||||
PyMuPDF's sort=True handles two-column layouts automatically.
|
||||
"""
|
||||
# CRITICAL: sort=True enables column-aware sorting
|
||||
blocks = page.get_text("dict", sort=True)['blocks']
|
||||
return blocks
|
||||
```
|
||||
|
||||
### 2.2 Element Classification
|
||||
```python
|
||||
def classify_element(block: dict, page_rect: fitz.Rect) -> str:
|
||||
"""
|
||||
Classify text block by position and font size.
|
||||
|
||||
Returns: 'title', 'body', 'header', 'footer', 'page_number'
|
||||
"""
|
||||
if 'lines' not in block:
|
||||
return 'image'
|
||||
|
||||
bbox = fitz.Rect(block['bbox'])
|
||||
page_height = page_rect.height
|
||||
page_width = page_rect.width
|
||||
|
||||
# Relative position (0.0 = top, 1.0 = bottom)
|
||||
y_rel = bbox.y0 / page_height
|
||||
|
||||
# Get average font size
|
||||
font_sizes = []
|
||||
for line in block.get('lines', []):
|
||||
for span in line.get('spans', []):
|
||||
font_sizes.append(span.get('size', 12))
|
||||
avg_font_size = sum(font_sizes) / len(font_sizes) if font_sizes else 12
|
||||
|
||||
# Get text content for pattern matching
|
||||
text = ''.join(
|
||||
span.get('text', '')
|
||||
for line in block.get('lines', [])
|
||||
for span in line.get('spans', [])
|
||||
).strip()
|
||||
|
||||
# Classification rules
|
||||
|
||||
# Header: top 5% of page
|
||||
if y_rel < 0.05:
|
||||
return 'header'
|
||||
|
||||
# Footer: bottom 5% of page
|
||||
if y_rel > 0.95:
|
||||
return 'footer'
|
||||
|
||||
# Page number: bottom 10% + numeric pattern
|
||||
if y_rel > 0.90 and _is_page_number(text):
|
||||
return 'page_number'
|
||||
|
||||
# Title: large font (>14pt) or centered
|
||||
if avg_font_size > 14:
|
||||
return 'title'
|
||||
|
||||
# Check if centered (for subtitles)
|
||||
x_center = (bbox.x0 + bbox.x1) / 2
|
||||
page_center = page_width / 2
|
||||
if abs(x_center - page_center) < page_width * 0.1 and len(text) < 100:
|
||||
if avg_font_size > 12:
|
||||
return 'title'
|
||||
|
||||
return 'body'
|
||||
|
||||
|
||||
def _is_page_number(text: str) -> bool:
|
||||
"""Check if text is likely a page number."""
|
||||
text = text.strip()
|
||||
|
||||
# Pure number
|
||||
if text.isdigit():
|
||||
return True
|
||||
|
||||
# Common patterns: "Page 1", "- 1 -", "1/10"
|
||||
patterns = [
|
||||
r'^page\s*\d+$',
|
||||
r'^-?\s*\d+\s*-?$',
|
||||
r'^\d+\s*/\s*\d+$',
|
||||
r'^第\s*\d+\s*頁$',
|
||||
r'^第\s*\d+\s*页$',
|
||||
]
|
||||
|
||||
for pattern in patterns:
|
||||
if re.match(pattern, text, re.IGNORECASE):
|
||||
return True
|
||||
|
||||
return False
|
||||
```
|
||||
|
||||
### 2.3 Element Filtering
|
||||
```python
|
||||
def filter_elements(blocks: List[dict], page_rect: fitz.Rect) -> List[dict]:
|
||||
"""Filter out unwanted elements (page numbers, headers, footers)."""
|
||||
filtered = []
|
||||
|
||||
for block in blocks:
|
||||
element_type = classify_element(block, page_rect)
|
||||
|
||||
# Skip page numbers and optionally headers/footers
|
||||
if element_type == 'page_number':
|
||||
continue
|
||||
|
||||
# Keep with classification metadata
|
||||
block['_element_type'] = element_type
|
||||
filtered.append(block)
|
||||
|
||||
return filtered
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 3: Text Extraction (Enhanced)
|
||||
|
||||
### 3.1 Garble Detection
|
||||
```python
|
||||
def calculate_garble_rate(text: str) -> float:
|
||||
"""
|
||||
Calculate the rate of garbled characters (cid:xxxx patterns).
|
||||
|
||||
Returns: float between 0.0 and 1.0
|
||||
"""
|
||||
if not text:
|
||||
return 0.0
|
||||
|
||||
# Count (cid:xxxx) patterns
|
||||
cid_pattern = r'\(cid:\d+\)'
|
||||
cid_matches = re.findall(cid_pattern, text)
|
||||
cid_char_count = sum(len(m) for m in cid_matches)
|
||||
|
||||
# Count other garble indicators
|
||||
# - Replacement character U+FFFD
|
||||
# - Private Use Area characters
|
||||
replacement_count = text.count('\ufffd')
|
||||
pua_count = sum(1 for c in text if 0xE000 <= ord(c) <= 0xF8FF)
|
||||
|
||||
total_garble = cid_char_count + replacement_count + pua_count
|
||||
total_chars = len(text)
|
||||
|
||||
return total_garble / total_chars if total_chars > 0 else 0.0
|
||||
```
|
||||
|
||||
### 3.2 Auto-fallback to OCR
|
||||
```python
|
||||
def should_fallback_to_ocr(page_text: str, garble_threshold: float = 0.1) -> bool:
|
||||
"""
|
||||
Determine if page should be processed with OCR instead of direct extraction.
|
||||
|
||||
Args:
|
||||
page_text: Extracted text from page
|
||||
garble_threshold: Maximum acceptable garble rate (default 10%)
|
||||
|
||||
Returns:
|
||||
True if OCR fallback is recommended
|
||||
"""
|
||||
garble_rate = calculate_garble_rate(page_text)
|
||||
|
||||
if garble_rate > garble_threshold:
|
||||
logger.warning(
|
||||
f"High garble rate detected: {garble_rate:.1%}. "
|
||||
f"Recommending OCR fallback."
|
||||
)
|
||||
return True
|
||||
|
||||
return False
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Integration Point
|
||||
|
||||
### Modified DirectExtractionEngine._extract_page()
|
||||
|
||||
```python
|
||||
def _extract_page(self, page: fitz.Page, page_num: int, ...) -> Page:
|
||||
"""Extract content from a single page with preprocessing pipeline."""
|
||||
|
||||
# === Step 1: Object-level Cleaning ===
|
||||
|
||||
# 1.1 Sanitize content stream
|
||||
page.clean_contents(sanitize=True)
|
||||
|
||||
# 1.2 Detect white-out covered text
|
||||
covered_words = detect_whiteout_covered_text(page, iou_threshold=0.8)
|
||||
covered_bboxes = [fitz.Rect(w['bbox']) for w in covered_words]
|
||||
|
||||
# === Step 2: Layout Analysis ===
|
||||
|
||||
# 2.1 Extract with column-aware sorting
|
||||
blocks = page.get_text("dict", sort=True)['blocks']
|
||||
|
||||
# 2.2 & 2.3 Classify and filter
|
||||
filtered_blocks = filter_elements(blocks, page.rect)
|
||||
|
||||
# === Step 3: Text Extraction ===
|
||||
|
||||
elements = []
|
||||
full_text = ""
|
||||
|
||||
for block in filtered_blocks:
|
||||
# Skip if block overlaps with covered areas
|
||||
block_rect = fitz.Rect(block['bbox'])
|
||||
if any(block_rect.intersects(cr) for cr in covered_bboxes):
|
||||
continue
|
||||
|
||||
# Extract text with bbox preserved
|
||||
element = self._block_to_element(block, page_num)
|
||||
if element:
|
||||
elements.append(element)
|
||||
full_text += element.get_text() + " "
|
||||
|
||||
# 3.2 Check garble rate
|
||||
if should_fallback_to_ocr(full_text):
|
||||
# Mark page for OCR processing
|
||||
page_metadata['needs_ocr'] = True
|
||||
|
||||
return Page(
|
||||
page_number=page_num,
|
||||
elements=elements,
|
||||
metadata=page_metadata
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class PreprocessingConfig:
|
||||
"""Configuration for PDF preprocessing pipeline."""
|
||||
|
||||
# Step 0: GS Distillation
|
||||
gs_enabled: bool = False # Disabled by default
|
||||
gs_garble_threshold: float = 0.1 # Trigger on >10% garble
|
||||
gs_detect_duplicate_images: bool = True
|
||||
|
||||
# Step 1: Object Cleaning
|
||||
sanitize_content: bool = True
|
||||
remove_hidden_layers: bool = True
|
||||
whiteout_detection: bool = True
|
||||
whiteout_iou_threshold: float = 0.8
|
||||
|
||||
# Step 2: Layout Analysis
|
||||
column_aware_sort: bool = True # Use sort=True
|
||||
filter_page_numbers: bool = True
|
||||
filter_headers: bool = False # Keep headers by default
|
||||
filter_footers: bool = False # Keep footers by default
|
||||
|
||||
# Step 3: Text Extraction
|
||||
preserve_bbox: bool = True # For debugging
|
||||
garble_detection: bool = True
|
||||
ocr_fallback_threshold: float = 0.1 # Fallback on >10% garble
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
1. **Unit Tests**
|
||||
- White-out detection with synthetic PDFs
|
||||
- Garble rate calculation
|
||||
- Element classification accuracy
|
||||
|
||||
2. **Integration Tests**
|
||||
- Two-column document reading order
|
||||
- Hidden layer removal
|
||||
- GS fallback trigger conditions
|
||||
|
||||
3. **Regression Tests**
|
||||
- Existing task outputs should not change for clean PDFs
|
||||
- Performance benchmarks (should add <100ms per page)
|
||||
@@ -0,0 +1,44 @@
|
||||
# Change Proposal: PDF Preprocessing Pipeline
|
||||
|
||||
## Summary
|
||||
|
||||
Implement a multi-stage PDF preprocessing pipeline for Direct track extraction to improve layout accuracy, remove hidden/covered content, and ensure correct reading order.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Current Direct track extraction has several issues:
|
||||
1. **Hidden content pollution**: OCG (Optional Content Groups) layers and "white-out" covered text leak into extraction
|
||||
2. **Reading order chaos**: Two-column layouts get interleaved incorrectly
|
||||
3. **Vector graphics interference**: Large decorative vector elements cover text content
|
||||
4. **Corrupted PDF handling**: No fallback for structurally damaged PDFs with `(cid:xxxx)` garbled text
|
||||
|
||||
## Proposed Solution
|
||||
|
||||
Implement a 4-stage preprocessing pipeline:
|
||||
|
||||
```
|
||||
Step 0: GS Distillation (Exception Handler - triggered on errors)
|
||||
Step 1: Object-level Cleaning (P0 - Core)
|
||||
Step 2: Layout Analysis (P1 - Rule-based with sort=True)
|
||||
Step 3: Text Extraction (Existing, enhanced with garble detection)
|
||||
```
|
||||
|
||||
## Key Features
|
||||
|
||||
1. **Smart Fallback**: GS distillation only triggers on `(cid:xxxx)` garble or mupdf structural errors
|
||||
2. **White-out Detection**: IoU-based overlap detection (80% threshold) to remove covered text
|
||||
3. **Column-aware Sorting**: Leverage PyMuPDF's `sort=True` for automatic two-column handling
|
||||
4. **Garble Rate Detection**: Auto-switch to Paddle OCR when garble rate exceeds threshold
|
||||
|
||||
## Impact
|
||||
|
||||
- **Files Modified**: `backend/app/services/direct_extraction_engine.py`
|
||||
- **New Dependencies**: None (Ghostscript optional, already available on most systems)
|
||||
- **Risk Level**: Medium (core extraction logic changes)
|
||||
|
||||
## Success Criteria
|
||||
|
||||
- [ ] Hidden OCG content no longer appears in extraction
|
||||
- [ ] White-out covered text is correctly filtered
|
||||
- [ ] Two-column documents maintain correct reading order
|
||||
- [ ] Corrupted PDFs gracefully fallback to GS repair or OCR
|
||||
@@ -0,0 +1,93 @@
|
||||
# Tasks: PDF Preprocessing Pipeline
|
||||
|
||||
## Phase 1: Object-level Cleaning (P0)
|
||||
|
||||
### Step 1.1: Content Sanitization
|
||||
- [x] Add `page.clean_contents(sanitize=True)` to `_extract_page()`
|
||||
- [x] Add error handling for malformed content streams
|
||||
- [x] Add logging for sanitization actions
|
||||
|
||||
### Step 1.2: Hidden Layer (OCG) Removal
|
||||
- [x] Implement `get_hidden_ocg_layers()` function
|
||||
- [ ] Add OCG content filtering during extraction (deferred - needs test case)
|
||||
- [x] Add configuration option `remove_hidden_layers`
|
||||
- [x] Add logging for removed layers
|
||||
|
||||
### Step 1.3: White-out Detection
|
||||
- [x] Implement `detect_whiteout_covered_text()` with IoU calculation
|
||||
- [x] Add white rectangle detection from `page.get_drawings()`
|
||||
- [x] Integrate covered text filtering into extraction
|
||||
- [x] Add configuration option `whiteout_iou_threshold` (default 0.8)
|
||||
- [x] Add logging for detected white-out regions
|
||||
|
||||
## Phase 2: Layout Analysis (P1)
|
||||
|
||||
### Step 2.1: Column-aware Sorting
|
||||
- [x] Change `get_text()` calls to use `sort=True` parameter (already implemented)
|
||||
- [x] Verify reading order improvement on test documents
|
||||
- [ ] Add configuration option `column_aware_sort` (deferred - low priority)
|
||||
|
||||
### Step 2.2: Element Classification
|
||||
- [ ] Implement `classify_element()` function (deferred - existing detection sufficient)
|
||||
- [x] Add position-based classification (header/footer/body) - via existing `_detect_headers_footers()`
|
||||
- [x] Add font-size-based classification (title detection) - via existing logic
|
||||
- [x] Add page number pattern detection `_is_page_number()`
|
||||
- [ ] Preserve classification in element metadata `_element_type` (deferred)
|
||||
|
||||
### Step 2.3: Element Filtering
|
||||
- [x] Implement `filter_elements()` function - `_filter_page_numbers()`
|
||||
- [x] Add configuration options for filtering (page_numbers, headers, footers)
|
||||
- [x] Add logging for filtered elements
|
||||
|
||||
## Phase 3: Enhanced Extraction (P1)
|
||||
|
||||
### Step 3.1: Bbox Preservation
|
||||
- [x] Ensure all extracted elements retain bbox coordinates (already implemented)
|
||||
- [x] Add bbox to UnifiedDocument element metadata
|
||||
- [x] Verify bbox accuracy in generated output
|
||||
|
||||
### Step 3.2: Garble Detection
|
||||
- [x] Implement `calculate_garble_rate()` function
|
||||
- [x] Detect `(cid:xxxx)` patterns
|
||||
- [x] Detect replacement characters (U+FFFD)
|
||||
- [x] Detect Private Use Area characters
|
||||
- [x] Add garble rate to page metadata
|
||||
|
||||
### Step 3.3: OCR Fallback
|
||||
- [x] Implement `should_fallback_to_ocr()` decision function
|
||||
- [x] Add configuration option `ocr_fallback_threshold` (default 0.1)
|
||||
- [x] Add `get_pages_needing_ocr()` interface for callers
|
||||
- [x] Add `get_extraction_quality_report()` for quality metrics
|
||||
- [x] Add logging for fallback decisions
|
||||
|
||||
## Phase 4: GS Distillation - Exception Handler (P2)
|
||||
|
||||
### Step 0: GS Repair (Optional)
|
||||
- [x] Implement `should_trigger_gs_repair()` trigger detection
|
||||
- [x] Implement `repair_pdf_with_gs()` function
|
||||
- [x] Add `-dDetectDuplicateImages=true` option
|
||||
- [x] Add temporary file handling for repaired PDF
|
||||
- [x] Implement `is_ghostscript_available()` check
|
||||
- [x] Add `extract_with_repair()` method
|
||||
- [x] Add fallback to normal extraction if GS not available
|
||||
- [x] Add logging for GS repair actions
|
||||
|
||||
## Testing
|
||||
|
||||
### Unit Tests
|
||||
- [ ] Test white-out detection with synthetic PDF
|
||||
- [x] Test garble rate calculation
|
||||
- [ ] Test element classification accuracy
|
||||
- [x] Test page number pattern detection
|
||||
|
||||
### Integration Tests
|
||||
- [x] Test with demo_docs/edit.pdf (3 pages)
|
||||
- [x] Test with demo_docs/edit2.pdf (1 page)
|
||||
- [x] Test with demo_docs/edit3.pdf (2 pages)
|
||||
- [x] Test quality report generation
|
||||
- [x] Test GS availability check
|
||||
- [x] Test end-to-end pipeline with real documents
|
||||
|
||||
### Regression Tests
|
||||
- [x] Verify existing clean PDFs produce same output
|
||||
- [ ] Performance benchmark (<100ms overhead per page)
|
||||
Reference in New Issue
Block a user