This commit is contained in:
egg
2025-12-04 18:00:37 +08:00
parent 9437387ef1
commit 8265be1741
22 changed files with 2672 additions and 196 deletions

View File

@@ -0,0 +1,167 @@
## Context
The PDF generator currently uses layout preservation mode for all PDF output, placing text at original coordinates. This works for document reconstruction but:
1. Fails for translated content where text length differs significantly
2. May not provide the best reading experience for flowing documents
Two PDF generation modes are needed:
1. **Layout Preservation** (existing): Maintains original coordinates
2. **Reflow Layout** (new): Prioritizes readability with flowing content
## Goals / Non-Goals
**Goals:**
- Translated and non-translated documents can use reflow layout
- Both OCR and Direct tracks supported
- Proper reading order preserved using available data
- Consistent font sizes for readability
- Images and tables embedded inline
**Non-Goals:**
- Perfect visual matching with original document layout
- Complex multi-column reflow (simple single-column flow)
- Font style matching from original document
## Decisions
### Decision 1: Reading Order Strategy
| Track | Reading Order Source | Implementation |
|-------|---------------------|----------------|
| **OCR** | Explicit `reading_order` array in JSON | Use array indices to order elements |
| **Direct** | Implicit in element list order | Use list iteration order (PyMuPDF sort=True) |
**OCR Track - reading_order array:**
```json
{
"pages": [{
"reading_order": [0, 1, 2, 3, 6, 7, 8, ...],
"elements": [...]
}]
}
```
**Direct Track - implicit order:**
- PyMuPDF's `get_text("dict", sort=True)` provides spatial reading order
- Elements already sorted by extraction engine
- Optional: Enable `_sort_elements_for_reading_order()` for multi-column detection
### Decision 2: Separate API Endpoints
```
# Layout preservation (existing)
GET /api/v2/tasks/{task_id}/download/pdf
# Reflow layout (new)
GET /api/v2/tasks/{task_id}/download/pdf?format=reflow
# Translated PDF (reflow only)
POST /api/v2/translate/{task_id}/pdf?lang={lang}
```
### Decision 3: Unified Reflow Generation Method
```python
def generate_reflow_pdf(
self,
result_json_path: Path,
output_path: Path,
translation_json_path: Optional[Path] = None, # None = no translation
source_file_path: Optional[Path] = None, # For embedded images
) -> bool:
"""
Generate reflow layout PDF for either OCR or Direct track.
Works with or without translation.
"""
```
### Decision 4: Reading Order Application
```python
def _get_elements_in_reading_order(self, page_data: dict) -> List[dict]:
"""Get elements sorted by reading order."""
elements = page_data.get('elements', [])
reading_order = page_data.get('reading_order')
if reading_order:
# OCR track: use explicit reading order
ordered = []
for idx in reading_order:
if 0 <= idx < len(elements):
ordered.append(elements[idx])
return ordered
else:
# Direct track: elements already in reading order
return elements
```
### Decision 5: Consistent Typography
| Element Type | Font Size | Style |
|-------------|-----------|-------|
| Title/H1 | 18pt | Bold |
| H2 | 16pt | Bold |
| H3 | 14pt | Bold |
| Body text | 12pt | Normal|
| Table cell | 10pt | Normal|
| Caption | 10pt | Italic|
### Decision 6: Table Handling in Reflow
Tables use Platypus Table with auto-width columns:
```python
def _create_reflow_table(self, table_data, translations=None):
data = []
for row in table_data['rows']:
row_data = []
for cell in row['cells']:
text = cell.get('text', '')
if translations:
text = translations.get(cell.get('id'), text)
row_data.append(Paragraph(text, self.styles['TableCell']))
data.append(row_data)
table = Table(data)
table.setStyle(TableStyle([
('GRID', (0, 0), (-1, -1), 0.5, colors.black),
('VALIGN', (0, 0), (-1, -1), 'TOP'),
('PADDING', (0, 0), (-1, -1), 6),
]))
return table
```
### Decision 7: Image Embedding
```python
def _embed_image_reflow(self, element, max_width=450):
img_path = self._resolve_image_path(element)
if img_path and img_path.exists():
img = Image(str(img_path))
# Scale to fit page width
if img.drawWidth > max_width:
ratio = max_width / img.drawWidth
img.drawWidth = max_width
img.drawHeight *= ratio
return img
return Spacer(1, 0)
```
## Risks / Trade-offs
- **Risk**: OCR reading_order may not be accurate for complex layouts
- **Mitigation**: Falls back to spatial sort if reading_order missing
- **Risk**: Direct track multi-column detection unused
- **Mitigation**: PyMuPDF sort=True is generally reliable
- **Risk**: Loss of visual fidelity compared to original
- **Mitigation**: This is acceptable; layout PDF still available
## Migration Plan
No migration needed - new functionality, existing behavior unchanged.
## Open Questions
None - design confirmed with user.

View File

@@ -0,0 +1,41 @@
# Change: Reflow Layout PDF Export for All Tracks
## Why
When generating translated PDFs, text often doesn't fit within original bounding boxes due to language expansion/contraction differences. Additionally, users may want a readable flowing document format even without translation.
**Example from task c79df0ad-f9a6-4c04-8139-13eaef25fa83:**
- Original Chinese: "华天科技(宝鸡)有限公司设备版块报价单" (19 characters)
- Translated English: "Huatian Technology (Baoji) Co., Ltd. Equipment Division Quotation" (65+ characters)
- Same bounding box: 703×111 pixels
- Current result: Font reduced to minimum (3pt), text unreadable
## What Changes
- **NEW**: Add reflow layout PDF generation for both OCR and Direct tracks
- Preserve semantic structure (headings, tables, lists) in reflow mode
- Use consistent, readable font sizes (12pt body, 16pt headings)
- Embed images inline within flowing content
- **IMPORTANT**: Original layout preservation PDF generation remains unchanged
- Support both tracks with proper reading order:
- **OCR track**: Use existing `reading_order` array from PP-StructureV3
- **Direct track**: Use PyMuPDF's implicit order (with option for column detection)
- **FIX**: Remove outdated MADLAD-400 references from frontend (now uses Dify cloud translation)
## Download Options
| Scenario | Layout PDF | Reflow PDF |
|----------|------------|------------|
| **Without Translation** | Available | Available (NEW) |
| **With Translation** | - | Available (single option, unchanged) |
## Impact
- Affected specs: `specs/result-export/spec.md`
- Affected code:
- `backend/app/services/pdf_generator_service.py` - add reflow generation method
- `backend/app/routers/tasks.py` - add reflow PDF download endpoint
- `backend/app/routers/translate.py` - use reflow mode for translated PDFs
- `frontend/src/pages/TaskDetailPage.tsx`:
- Add "Download Reflow PDF" button for original documents
- Remove MADLAD-400 badge and outdated description text

View File

@@ -0,0 +1,137 @@
## ADDED Requirements
### Requirement: Dual PDF Generation Modes
The system SHALL support two distinct PDF generation modes to serve different use cases for both OCR and Direct tracks.
#### Scenario: Download layout preservation PDF
- **WHEN** user requests PDF via `/api/v2/tasks/{task_id}/download/pdf`
- **THEN** PDF SHALL use layout preservation mode
- **AND** text positions SHALL match original document coordinates
- **AND** this option SHALL be available for both OCR and Direct tracks
- **AND** existing behavior SHALL remain unchanged
#### Scenario: Download reflow layout PDF without translation
- **WHEN** user requests PDF via `/api/v2/tasks/{task_id}/download/pdf?format=reflow`
- **THEN** PDF SHALL use reflow layout mode
- **AND** text SHALL flow naturally with consistent font sizes
- **AND** body text SHALL use approximately 12pt font size
- **AND** headings SHALL use larger font sizes (14-18pt)
- **AND** this option SHALL be available for both OCR and Direct tracks
#### Scenario: OCR track reading order in reflow mode
- **GIVEN** document processed via OCR track
- **WHEN** generating reflow PDF
- **THEN** system SHALL use explicit `reading_order` array from JSON
- **AND** elements SHALL appear in order specified by reading_order indices
- **AND** if reading_order is missing, fall back to spatial sort (y, x)
#### Scenario: Direct track reading order in reflow mode
- **GIVEN** document processed via Direct track
- **WHEN** generating reflow PDF
- **THEN** system SHALL use implicit element order from extraction
- **AND** elements SHALL appear in list iteration order
- **AND** PyMuPDF's sort=True ordering SHALL be trusted
---
### Requirement: Reflow PDF Semantic Structure
The reflow PDF generation SHALL preserve document semantic structure.
#### Scenario: Headings in reflow mode
- **WHEN** original document contains headings (title, h1, h2, etc.)
- **THEN** headings SHALL be rendered with larger font sizes
- **AND** headings SHALL be visually distinguished from body text
- **AND** heading hierarchy SHALL be preserved
#### Scenario: Tables in reflow mode
- **WHEN** original document contains tables
- **THEN** tables SHALL render with visible cell borders
- **AND** column widths SHALL auto-adjust to content
- **AND** table content SHALL be fully visible
- **AND** tables SHALL use appropriate cell padding
#### Scenario: Images in reflow mode
- **WHEN** original document contains images
- **THEN** images SHALL be embedded inline in flowing content
- **AND** images SHALL be scaled to fit page width if necessary
- **AND** images SHALL maintain aspect ratio
#### Scenario: Lists in reflow mode
- **WHEN** original document contains numbered or bulleted lists
- **THEN** lists SHALL preserve their formatting
- **AND** list items SHALL flow naturally
---
## MODIFIED Requirements
### Requirement: Translated PDF Export API
The system SHALL expose an API endpoint for downloading translated documents as PDF files using reflow layout mode only.
#### Scenario: Download translated PDF via API
- **GIVEN** a task with completed translation
- **WHEN** POST request to `/api/v2/translate/{task_id}/pdf?lang={lang}`
- **THEN** system returns PDF file with translated content
- **AND** PDF SHALL use reflow layout mode (not layout preservation)
- **AND** Content-Type is `application/pdf`
- **AND** Content-Disposition suggests filename like `{task_id}_translated_{lang}.pdf`
#### Scenario: Translated PDF uses reflow layout
- **WHEN** user downloads translated PDF
- **THEN** the PDF SHALL use reflow layout mode
- **AND** text SHALL flow naturally with consistent font sizes
- **AND** body text SHALL use approximately 12pt font size
- **AND** headings SHALL use larger font sizes (14-18pt)
- **AND** content SHALL be readable without magnification
#### Scenario: Translated PDF for OCR track
- **GIVEN** document processed via OCR track with translation
- **WHEN** generating translated PDF
- **THEN** reading order SHALL follow `reading_order` array
- **AND** translated text SHALL replace original in correct positions
#### Scenario: Translated PDF for Direct track
- **GIVEN** document processed via Direct track with translation
- **WHEN** generating translated PDF
- **THEN** reading order SHALL follow implicit element order
- **AND** translated text SHALL replace original in correct positions
#### Scenario: Invalid language parameter
- **GIVEN** a task with translation only to English
- **WHEN** user requests PDF with `lang=ja` (Japanese)
- **THEN** system returns 404 Not Found
- **AND** response includes available languages in error message
#### Scenario: Task not found
- **GIVEN** non-existent task_id
- **WHEN** user requests translated PDF
- **THEN** system returns 404 Not Found
---
### Requirement: Frontend Download Options
The frontend SHALL provide appropriate download options based on translation status.
#### Scenario: Download options without translation
- **GIVEN** a task without any completed translations
- **WHEN** user views TaskDetailPage
- **THEN** page SHALL display "Download Layout PDF" button (original coordinates)
- **AND** page SHALL display "Download Reflow PDF" button (flowing layout)
- **AND** both options SHALL be available in the download section
#### Scenario: Download options with translation
- **GIVEN** a task with completed translation
- **WHEN** user views TaskDetailPage
- **THEN** page SHALL display "Download Translated PDF" button for each language
- **AND** translated PDF button SHALL remain as single option (no Layout/Reflow choice)
- **AND** translated PDF SHALL automatically use reflow layout
#### Scenario: Remove outdated MADLAD-400 references
- **WHEN** displaying translation section
- **THEN** page SHALL NOT display "MADLAD-400" badge
- **AND** description text SHALL reflect cloud translation service (Dify)
- **AND** description SHALL NOT mention local model loading time

View File

@@ -0,0 +1,30 @@
## 1. Backend Implementation
- [x] 1.1 Create `generate_reflow_pdf()` method in pdf_generator_service.py
- [x] 1.2 Implement `_get_elements_in_reading_order()` for both tracks
- [x] 1.3 Implement reflow text rendering with consistent font sizes
- [x] 1.4 Implement table rendering in reflow mode (Platypus Table)
- [x] 1.5 Implement inline image embedding
- [x] 1.6 Add `format=reflow` query parameter to tasks download endpoint
- [x] 1.7 Update `generate_translated_pdf()` to use reflow mode
## 2. Frontend Implementation
- [x] 2.1 Add "Download Reflow PDF" button for original documents
- [x] 2.2 Update download logic to support format parameter
- [x] 2.3 Remove MADLAD-400 badge (line 545)
- [x] 2.4 Update translation description text to reflect Dify cloud service (line 652)
## 3. Testing
- [x] 3.1 Test OCR track reflow PDF (with reading_order) - Basic smoke test passed
- [ ] 3.2 Test Direct track reflow PDF (implicit order) - No test data available
- [x] 3.3 Test translated PDF (reflow mode) - Basic smoke test passed
- [x] 3.4 Test documents with tables - SUCCESS (62294 bytes, 2 tables)
- [x] 3.5 Test documents with images - SUCCESS (embedded img_in_table)
- [x] 3.6 Test multi-page documents - SUCCESS (11451 bytes, 3 pages)
- [x] 3.7 Verify layout PDF still works correctly - SUCCESS (104543 bytes)
## 4. Documentation
- [x] 4.1 Update spec with reflow layout requirements

View File

@@ -0,0 +1,458 @@
# Design: PDF Preprocessing Pipeline
## Architecture Overview
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ DIRECT Track PDF Processing Pipeline │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Input PDF │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Step 0: GS Distillation (Exception Handler) │ │
│ │ ─────────────────────────────────────────────────────────────────── │ │
│ │ Trigger: (cid:xxxx) garble detected OR mupdf structural errors │ │
│ │ Action: gs -sDEVICE=pdfwrite -dDetectDuplicateImages=true │ │
│ │ Status: DISABLED by default, auto-triggered on errors │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Step 1: Object-level Cleaning (P0 - Core) │ │
│ │ ─────────────────────────────────────────────────────────────────── │ │
│ │ 1.1 clean_contents(sanitize=True) - Fix malformed content stream │ │
│ │ 1.2 Remove hidden OCG layers │ │
│ │ 1.3 White-out detection & removal (IoU >= 80%) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Step 2: Layout Analysis (P1 - Rule-based) │ │
│ │ ─────────────────────────────────────────────────────────────────── │ │
│ │ 2.1 get_text("blocks", sort=True) - Column-aware sorting │ │
│ │ 2.2 Classify elements (title/body/header/footer/page_number) │ │
│ │ 2.3 Filter unwanted elements (page numbers, decorations) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Step 3: Text Extraction (Enhanced) │ │
│ │ ─────────────────────────────────────────────────────────────────── │ │
│ │ 3.1 Extract text with bbox coordinates preserved │ │
│ │ 3.2 Garble rate detection (cid:xxxx count / total chars) │ │
│ │ 3.3 Auto-fallback: garble_rate > 10% → trigger Paddle OCR │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ UnifiedDocument (with bbox for debugging) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
---
## Step 0: GS Distillation (Exception Handler)
### Purpose
Repair structurally damaged PDFs that PyMuPDF cannot parse correctly.
### Trigger Conditions
```python
def should_trigger_gs_repair(page_text: str, mupdf_warnings: List[str]) -> bool:
# Condition 1: High garble rate (cid:xxxx patterns)
cid_pattern = r'\(cid:\d+\)'
cid_count = len(re.findall(cid_pattern, page_text))
total_chars = len(page_text)
garble_rate = cid_count / max(total_chars, 1)
if garble_rate > 0.1: # >10% garbled
return True
# Condition 2: Severe structural errors
severe_errors = ['error', 'invalid', 'corrupt', 'damaged']
for warning in mupdf_warnings:
if any(err in warning.lower() for err in severe_errors):
return True
return False
```
### GS Command
```bash
gs -dNOPAUSE -dBATCH -dSAFER \
-sDEVICE=pdfwrite \
-dPDFSETTINGS=/prepress \
-dDetectDuplicateImages=true \
-sOutputFile=repaired.pdf \
input.pdf
```
### Implementation Notes
- **Default**: DISABLED
- **Execution**: Only when triggered by error detection
- **Fallback**: If GS also fails, route to Paddle OCR track
---
## Step 1: Object-level Cleaning (P0)
### 1.1 Content Stream Sanitization
```python
def sanitize_page(page: fitz.Page) -> None:
"""Fix malformed PDF content stream."""
page.clean_contents(sanitize=True)
```
### 1.2 Hidden Layer (OCG) Removal
```python
def remove_hidden_layers(doc: fitz.Document) -> List[str]:
"""Remove content from hidden Optional Content Groups."""
removed_layers = []
ocgs = doc.get_ocgs() # Get all OCG definitions
for ocg_xref, ocg_info in ocgs.items():
# Check if layer is hidden by default
if ocg_info.get('on') == False:
removed_layers.append(ocg_info.get('name', f'OCG_{ocg_xref}'))
# Mark for removal during extraction
return removed_layers
```
### 1.3 White-out Detection (Core Algorithm)
```python
def detect_whiteout_covered_text(page: fitz.Page, iou_threshold: float = 0.8) -> List[dict]:
"""
Detect text covered by white rectangles ("white-out" / "correction tape" effect).
Returns list of text words that should be excluded from extraction.
"""
covered_words = []
# Get all white-filled rectangles
drawings = page.get_drawings()
white_rects = []
for d in drawings:
# Check for white fill (RGB all 1.0)
fill_color = d.get('fill')
if fill_color and fill_color == (1, 1, 1):
rect = d.get('rect')
if rect:
white_rects.append(fitz.Rect(rect))
if not white_rects:
return covered_words
# Get all text words with bounding boxes
words = page.get_text("words") # Returns list of (x0, y0, x1, y1, word, block_no, line_no, word_no)
for word_info in words:
word_rect = fitz.Rect(word_info[:4])
word_text = word_info[4]
for white_rect in white_rects:
# Calculate IoU (Intersection over Union)
intersection = word_rect & white_rect # Intersection
if intersection.is_empty:
continue
intersection_area = intersection.width * intersection.height
word_area = word_rect.width * word_rect.height
if word_area > 0:
coverage_ratio = intersection_area / word_area
if coverage_ratio >= iou_threshold:
covered_words.append({
'text': word_text,
'bbox': tuple(word_rect),
'coverage': coverage_ratio
})
break # Word is covered, no need to check other rects
return covered_words
```
---
## Step 2: Layout Analysis (P1)
### 2.1 Column-aware Text Extraction
```python
def extract_with_reading_order(page: fitz.Page) -> List[dict]:
"""
Extract text blocks with correct reading order.
PyMuPDF's sort=True handles two-column layouts automatically.
"""
# CRITICAL: sort=True enables column-aware sorting
blocks = page.get_text("dict", sort=True)['blocks']
return blocks
```
### 2.2 Element Classification
```python
def classify_element(block: dict, page_rect: fitz.Rect) -> str:
"""
Classify text block by position and font size.
Returns: 'title', 'body', 'header', 'footer', 'page_number'
"""
if 'lines' not in block:
return 'image'
bbox = fitz.Rect(block['bbox'])
page_height = page_rect.height
page_width = page_rect.width
# Relative position (0.0 = top, 1.0 = bottom)
y_rel = bbox.y0 / page_height
# Get average font size
font_sizes = []
for line in block.get('lines', []):
for span in line.get('spans', []):
font_sizes.append(span.get('size', 12))
avg_font_size = sum(font_sizes) / len(font_sizes) if font_sizes else 12
# Get text content for pattern matching
text = ''.join(
span.get('text', '')
for line in block.get('lines', [])
for span in line.get('spans', [])
).strip()
# Classification rules
# Header: top 5% of page
if y_rel < 0.05:
return 'header'
# Footer: bottom 5% of page
if y_rel > 0.95:
return 'footer'
# Page number: bottom 10% + numeric pattern
if y_rel > 0.90 and _is_page_number(text):
return 'page_number'
# Title: large font (>14pt) or centered
if avg_font_size > 14:
return 'title'
# Check if centered (for subtitles)
x_center = (bbox.x0 + bbox.x1) / 2
page_center = page_width / 2
if abs(x_center - page_center) < page_width * 0.1 and len(text) < 100:
if avg_font_size > 12:
return 'title'
return 'body'
def _is_page_number(text: str) -> bool:
"""Check if text is likely a page number."""
text = text.strip()
# Pure number
if text.isdigit():
return True
# Common patterns: "Page 1", "- 1 -", "1/10"
patterns = [
r'^page\s*\d+$',
r'^-?\s*\d+\s*-?$',
r'^\d+\s*/\s*\d+$',
r'^第\s*\d+\s*頁$',
r'^第\s*\d+\s*页$',
]
for pattern in patterns:
if re.match(pattern, text, re.IGNORECASE):
return True
return False
```
### 2.3 Element Filtering
```python
def filter_elements(blocks: List[dict], page_rect: fitz.Rect) -> List[dict]:
"""Filter out unwanted elements (page numbers, headers, footers)."""
filtered = []
for block in blocks:
element_type = classify_element(block, page_rect)
# Skip page numbers and optionally headers/footers
if element_type == 'page_number':
continue
# Keep with classification metadata
block['_element_type'] = element_type
filtered.append(block)
return filtered
```
---
## Step 3: Text Extraction (Enhanced)
### 3.1 Garble Detection
```python
def calculate_garble_rate(text: str) -> float:
"""
Calculate the rate of garbled characters (cid:xxxx patterns).
Returns: float between 0.0 and 1.0
"""
if not text:
return 0.0
# Count (cid:xxxx) patterns
cid_pattern = r'\(cid:\d+\)'
cid_matches = re.findall(cid_pattern, text)
cid_char_count = sum(len(m) for m in cid_matches)
# Count other garble indicators
# - Replacement character U+FFFD
# - Private Use Area characters
replacement_count = text.count('\ufffd')
pua_count = sum(1 for c in text if 0xE000 <= ord(c) <= 0xF8FF)
total_garble = cid_char_count + replacement_count + pua_count
total_chars = len(text)
return total_garble / total_chars if total_chars > 0 else 0.0
```
### 3.2 Auto-fallback to OCR
```python
def should_fallback_to_ocr(page_text: str, garble_threshold: float = 0.1) -> bool:
"""
Determine if page should be processed with OCR instead of direct extraction.
Args:
page_text: Extracted text from page
garble_threshold: Maximum acceptable garble rate (default 10%)
Returns:
True if OCR fallback is recommended
"""
garble_rate = calculate_garble_rate(page_text)
if garble_rate > garble_threshold:
logger.warning(
f"High garble rate detected: {garble_rate:.1%}. "
f"Recommending OCR fallback."
)
return True
return False
```
---
## Integration Point
### Modified DirectExtractionEngine._extract_page()
```python
def _extract_page(self, page: fitz.Page, page_num: int, ...) -> Page:
"""Extract content from a single page with preprocessing pipeline."""
# === Step 1: Object-level Cleaning ===
# 1.1 Sanitize content stream
page.clean_contents(sanitize=True)
# 1.2 Detect white-out covered text
covered_words = detect_whiteout_covered_text(page, iou_threshold=0.8)
covered_bboxes = [fitz.Rect(w['bbox']) for w in covered_words]
# === Step 2: Layout Analysis ===
# 2.1 Extract with column-aware sorting
blocks = page.get_text("dict", sort=True)['blocks']
# 2.2 & 2.3 Classify and filter
filtered_blocks = filter_elements(blocks, page.rect)
# === Step 3: Text Extraction ===
elements = []
full_text = ""
for block in filtered_blocks:
# Skip if block overlaps with covered areas
block_rect = fitz.Rect(block['bbox'])
if any(block_rect.intersects(cr) for cr in covered_bboxes):
continue
# Extract text with bbox preserved
element = self._block_to_element(block, page_num)
if element:
elements.append(element)
full_text += element.get_text() + " "
# 3.2 Check garble rate
if should_fallback_to_ocr(full_text):
# Mark page for OCR processing
page_metadata['needs_ocr'] = True
return Page(
page_number=page_num,
elements=elements,
metadata=page_metadata
)
```
---
## Configuration
```python
@dataclass
class PreprocessingConfig:
"""Configuration for PDF preprocessing pipeline."""
# Step 0: GS Distillation
gs_enabled: bool = False # Disabled by default
gs_garble_threshold: float = 0.1 # Trigger on >10% garble
gs_detect_duplicate_images: bool = True
# Step 1: Object Cleaning
sanitize_content: bool = True
remove_hidden_layers: bool = True
whiteout_detection: bool = True
whiteout_iou_threshold: float = 0.8
# Step 2: Layout Analysis
column_aware_sort: bool = True # Use sort=True
filter_page_numbers: bool = True
filter_headers: bool = False # Keep headers by default
filter_footers: bool = False # Keep footers by default
# Step 3: Text Extraction
preserve_bbox: bool = True # For debugging
garble_detection: bool = True
ocr_fallback_threshold: float = 0.1 # Fallback on >10% garble
```
---
## Testing Strategy
1. **Unit Tests**
- White-out detection with synthetic PDFs
- Garble rate calculation
- Element classification accuracy
2. **Integration Tests**
- Two-column document reading order
- Hidden layer removal
- GS fallback trigger conditions
3. **Regression Tests**
- Existing task outputs should not change for clean PDFs
- Performance benchmarks (should add <100ms per page)

View File

@@ -0,0 +1,44 @@
# Change Proposal: PDF Preprocessing Pipeline
## Summary
Implement a multi-stage PDF preprocessing pipeline for Direct track extraction to improve layout accuracy, remove hidden/covered content, and ensure correct reading order.
## Problem Statement
Current Direct track extraction has several issues:
1. **Hidden content pollution**: OCG (Optional Content Groups) layers and "white-out" covered text leak into extraction
2. **Reading order chaos**: Two-column layouts get interleaved incorrectly
3. **Vector graphics interference**: Large decorative vector elements cover text content
4. **Corrupted PDF handling**: No fallback for structurally damaged PDFs with `(cid:xxxx)` garbled text
## Proposed Solution
Implement a 4-stage preprocessing pipeline:
```
Step 0: GS Distillation (Exception Handler - triggered on errors)
Step 1: Object-level Cleaning (P0 - Core)
Step 2: Layout Analysis (P1 - Rule-based with sort=True)
Step 3: Text Extraction (Existing, enhanced with garble detection)
```
## Key Features
1. **Smart Fallback**: GS distillation only triggers on `(cid:xxxx)` garble or mupdf structural errors
2. **White-out Detection**: IoU-based overlap detection (80% threshold) to remove covered text
3. **Column-aware Sorting**: Leverage PyMuPDF's `sort=True` for automatic two-column handling
4. **Garble Rate Detection**: Auto-switch to Paddle OCR when garble rate exceeds threshold
## Impact
- **Files Modified**: `backend/app/services/direct_extraction_engine.py`
- **New Dependencies**: None (Ghostscript optional, already available on most systems)
- **Risk Level**: Medium (core extraction logic changes)
## Success Criteria
- [ ] Hidden OCG content no longer appears in extraction
- [ ] White-out covered text is correctly filtered
- [ ] Two-column documents maintain correct reading order
- [ ] Corrupted PDFs gracefully fallback to GS repair or OCR

View File

@@ -0,0 +1,93 @@
# Tasks: PDF Preprocessing Pipeline
## Phase 1: Object-level Cleaning (P0)
### Step 1.1: Content Sanitization
- [x] Add `page.clean_contents(sanitize=True)` to `_extract_page()`
- [x] Add error handling for malformed content streams
- [x] Add logging for sanitization actions
### Step 1.2: Hidden Layer (OCG) Removal
- [x] Implement `get_hidden_ocg_layers()` function
- [ ] Add OCG content filtering during extraction (deferred - needs test case)
- [x] Add configuration option `remove_hidden_layers`
- [x] Add logging for removed layers
### Step 1.3: White-out Detection
- [x] Implement `detect_whiteout_covered_text()` with IoU calculation
- [x] Add white rectangle detection from `page.get_drawings()`
- [x] Integrate covered text filtering into extraction
- [x] Add configuration option `whiteout_iou_threshold` (default 0.8)
- [x] Add logging for detected white-out regions
## Phase 2: Layout Analysis (P1)
### Step 2.1: Column-aware Sorting
- [x] Change `get_text()` calls to use `sort=True` parameter (already implemented)
- [x] Verify reading order improvement on test documents
- [ ] Add configuration option `column_aware_sort` (deferred - low priority)
### Step 2.2: Element Classification
- [ ] Implement `classify_element()` function (deferred - existing detection sufficient)
- [x] Add position-based classification (header/footer/body) - via existing `_detect_headers_footers()`
- [x] Add font-size-based classification (title detection) - via existing logic
- [x] Add page number pattern detection `_is_page_number()`
- [ ] Preserve classification in element metadata `_element_type` (deferred)
### Step 2.3: Element Filtering
- [x] Implement `filter_elements()` function - `_filter_page_numbers()`
- [x] Add configuration options for filtering (page_numbers, headers, footers)
- [x] Add logging for filtered elements
## Phase 3: Enhanced Extraction (P1)
### Step 3.1: Bbox Preservation
- [x] Ensure all extracted elements retain bbox coordinates (already implemented)
- [x] Add bbox to UnifiedDocument element metadata
- [x] Verify bbox accuracy in generated output
### Step 3.2: Garble Detection
- [x] Implement `calculate_garble_rate()` function
- [x] Detect `(cid:xxxx)` patterns
- [x] Detect replacement characters (U+FFFD)
- [x] Detect Private Use Area characters
- [x] Add garble rate to page metadata
### Step 3.3: OCR Fallback
- [x] Implement `should_fallback_to_ocr()` decision function
- [x] Add configuration option `ocr_fallback_threshold` (default 0.1)
- [x] Add `get_pages_needing_ocr()` interface for callers
- [x] Add `get_extraction_quality_report()` for quality metrics
- [x] Add logging for fallback decisions
## Phase 4: GS Distillation - Exception Handler (P2)
### Step 0: GS Repair (Optional)
- [x] Implement `should_trigger_gs_repair()` trigger detection
- [x] Implement `repair_pdf_with_gs()` function
- [x] Add `-dDetectDuplicateImages=true` option
- [x] Add temporary file handling for repaired PDF
- [x] Implement `is_ghostscript_available()` check
- [x] Add `extract_with_repair()` method
- [x] Add fallback to normal extraction if GS not available
- [x] Add logging for GS repair actions
## Testing
### Unit Tests
- [ ] Test white-out detection with synthetic PDF
- [x] Test garble rate calculation
- [ ] Test element classification accuracy
- [x] Test page number pattern detection
### Integration Tests
- [x] Test with demo_docs/edit.pdf (3 pages)
- [x] Test with demo_docs/edit2.pdf (1 page)
- [x] Test with demo_docs/edit3.pdf (2 pages)
- [x] Test quality report generation
- [x] Test GS availability check
- [x] Test end-to-end pipeline with real documents
### Regression Tests
- [x] Verify existing clean PDFs produce same output
- [ ] Performance benchmark (<100ms overhead per page)