test

2025-12-04 18:00:37 +08:00
parent 9437387ef1
commit 8265be1741
22 changed files with 2672 additions and 196 deletions
--- a/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/design.md
+++ b/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/design.md
@@ -0,0 +1,167 @@
+## Context
+
+The PDF generator currently uses layout preservation mode for all PDF output, placing text at original coordinates. This works for document reconstruction but:
+1. Fails for translated content where text length differs significantly
+2. May not provide the best reading experience for flowing documents
+
+Two PDF generation modes are needed:
+1. **Layout Preservation** (existing): Maintains original coordinates
+2. **Reflow Layout** (new): Prioritizes readability with flowing content
+
+## Goals / Non-Goals
+
+**Goals:**
+- Translated and non-translated documents can use reflow layout
+- Both OCR and Direct tracks supported
+- Proper reading order preserved using available data
+- Consistent font sizes for readability
+- Images and tables embedded inline
+
+**Non-Goals:**
+- Perfect visual matching with original document layout
+- Complex multi-column reflow (simple single-column flow)
+- Font style matching from original document
+
+## Decisions
+
+### Decision 1: Reading Order Strategy
+
+| Track | Reading Order Source | Implementation |
+|-------|---------------------|----------------|
+| **OCR** | Explicit `reading_order` array in JSON | Use array indices to order elements |
+| **Direct** | Implicit in element list order | Use list iteration order (PyMuPDF sort=True) |
+
+**OCR Track - reading_order array:**
+```json
+{
+  "pages": [{
+    "reading_order": [0, 1, 2, 3, 6, 7, 8, ...],
+    "elements": [...]
+  }]
+}
+```
+
+**Direct Track - implicit order:**
+- PyMuPDF's `get_text("dict", sort=True)` provides spatial reading order
+- Elements already sorted by extraction engine
+- Optional: Enable `_sort_elements_for_reading_order()` for multi-column detection
+
+### Decision 2: Separate API Endpoints
+
+```
+# Layout preservation (existing)
+GET /api/v2/tasks/{task_id}/download/pdf
+
+# Reflow layout (new)
+GET /api/v2/tasks/{task_id}/download/pdf?format=reflow
+
+# Translated PDF (reflow only)
+POST /api/v2/translate/{task_id}/pdf?lang={lang}
+```
+
+### Decision 3: Unified Reflow Generation Method
+
+```python
+def generate_reflow_pdf(
+    self,
+    result_json_path: Path,
+    output_path: Path,
+    translation_json_path: Optional[Path] = None,  # None = no translation
+    source_file_path: Optional[Path] = None,       # For embedded images
+) -> bool:
+    """
+    Generate reflow layout PDF for either OCR or Direct track.
+    Works with or without translation.
+    """
+```
+
+### Decision 4: Reading Order Application
+
+```python
+def _get_elements_in_reading_order(self, page_data: dict) -> List[dict]:
+    """Get elements sorted by reading order."""
+    elements = page_data.get('elements', [])
+    reading_order = page_data.get('reading_order')
+
+    if reading_order:
+        # OCR track: use explicit reading order
+        ordered = []
+        for idx in reading_order:
+            if 0 <= idx < len(elements):
+                ordered.append(elements[idx])
+        return ordered
+    else:
+        # Direct track: elements already in reading order
+        return elements
+```
+
+### Decision 5: Consistent Typography
+
+| Element Type | Font Size | Style |
+|-------------|-----------|-------|
+| Title/H1    | 18pt      | Bold  |
+| H2          | 16pt      | Bold  |
+| H3          | 14pt      | Bold  |
+| Body text   | 12pt      | Normal|
+| Table cell  | 10pt      | Normal|
+| Caption     | 10pt      | Italic|
+
+### Decision 6: Table Handling in Reflow
+
+Tables use Platypus Table with auto-width columns:
+
+```python
+def _create_reflow_table(self, table_data, translations=None):
+    data = []
+    for row in table_data['rows']:
+        row_data = []
+        for cell in row['cells']:
+            text = cell.get('text', '')
+            if translations:
+                text = translations.get(cell.get('id'), text)
+            row_data.append(Paragraph(text, self.styles['TableCell']))
+        data.append(row_data)
+
+    table = Table(data)
+    table.setStyle(TableStyle([
+        ('GRID', (0, 0), (-1, -1), 0.5, colors.black),
+        ('VALIGN', (0, 0), (-1, -1), 'TOP'),
+        ('PADDING', (0, 0), (-1, -1), 6),
+    ]))
+    return table
+```
+
+### Decision 7: Image Embedding
+
+```python
+def _embed_image_reflow(self, element, max_width=450):
+    img_path = self._resolve_image_path(element)
+    if img_path and img_path.exists():
+        img = Image(str(img_path))
+        # Scale to fit page width
+        if img.drawWidth > max_width:
+            ratio = max_width / img.drawWidth
+            img.drawWidth = max_width
+            img.drawHeight *= ratio
+        return img
+    return Spacer(1, 0)
+```
+
+## Risks / Trade-offs
+
+- **Risk**: OCR reading_order may not be accurate for complex layouts
+  - **Mitigation**: Falls back to spatial sort if reading_order missing
+
+- **Risk**: Direct track multi-column detection unused
+  - **Mitigation**: PyMuPDF sort=True is generally reliable
+
+- **Risk**: Loss of visual fidelity compared to original
+  - **Mitigation**: This is acceptable; layout PDF still available
+
+## Migration Plan
+
+No migration needed - new functionality, existing behavior unchanged.
+
+## Open Questions
+
+None - design confirmed with user.
--- a/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/proposal.md
+++ b/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/proposal.md
@@ -0,0 +1,41 @@
+# Change: Reflow Layout PDF Export for All Tracks
+
+## Why
+
+When generating translated PDFs, text often doesn't fit within original bounding boxes due to language expansion/contraction differences. Additionally, users may want a readable flowing document format even without translation.
+
+**Example from task c79df0ad-f9a6-4c04-8139-13eaef25fa83:**
+- Original Chinese: "华天科技（宝鸡）有限公司设备版块报价单" (19 characters)
+- Translated English: "Huatian Technology (Baoji) Co., Ltd. Equipment Division Quotation" (65+ characters)
+- Same bounding box: 703×111 pixels
+- Current result: Font reduced to minimum (3pt), text unreadable
+
+## What Changes
+
+- **NEW**: Add reflow layout PDF generation for both OCR and Direct tracks
+- Preserve semantic structure (headings, tables, lists) in reflow mode
+- Use consistent, readable font sizes (12pt body, 16pt headings)
+- Embed images inline within flowing content
+- **IMPORTANT**: Original layout preservation PDF generation remains unchanged
+- Support both tracks with proper reading order:
+  - **OCR track**: Use existing `reading_order` array from PP-StructureV3
+  - **Direct track**: Use PyMuPDF's implicit order (with option for column detection)
+- **FIX**: Remove outdated MADLAD-400 references from frontend (now uses Dify cloud translation)
+
+## Download Options
+
+| Scenario | Layout PDF | Reflow PDF |
+|----------|------------|------------|
+| **Without Translation** | Available | Available (NEW) |
+| **With Translation** | - | Available (single option, unchanged) |
+
+## Impact
+
+- Affected specs: `specs/result-export/spec.md`
+- Affected code:
+  - `backend/app/services/pdf_generator_service.py` - add reflow generation method
+  - `backend/app/routers/tasks.py` - add reflow PDF download endpoint
+  - `backend/app/routers/translate.py` - use reflow mode for translated PDFs
+  - `frontend/src/pages/TaskDetailPage.tsx`:
+    - Add "Download Reflow PDF" button for original documents
+    - Remove MADLAD-400 badge and outdated description text
--- a/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/specs/result-export/spec.md
+++ b/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/specs/result-export/spec.md
@@ -0,0 +1,137 @@
+## ADDED Requirements
+
+### Requirement: Dual PDF Generation Modes
+
+The system SHALL support two distinct PDF generation modes to serve different use cases for both OCR and Direct tracks.
+
+#### Scenario: Download layout preservation PDF
+- **WHEN** user requests PDF via `/api/v2/tasks/{task_id}/download/pdf`
+- **THEN** PDF SHALL use layout preservation mode
+- **AND** text positions SHALL match original document coordinates
+- **AND** this option SHALL be available for both OCR and Direct tracks
+- **AND** existing behavior SHALL remain unchanged
+
+#### Scenario: Download reflow layout PDF without translation
+- **WHEN** user requests PDF via `/api/v2/tasks/{task_id}/download/pdf?format=reflow`
+- **THEN** PDF SHALL use reflow layout mode
+- **AND** text SHALL flow naturally with consistent font sizes
+- **AND** body text SHALL use approximately 12pt font size
+- **AND** headings SHALL use larger font sizes (14-18pt)
+- **AND** this option SHALL be available for both OCR and Direct tracks
+
+#### Scenario: OCR track reading order in reflow mode
+- **GIVEN** document processed via OCR track
+- **WHEN** generating reflow PDF
+- **THEN** system SHALL use explicit `reading_order` array from JSON
+- **AND** elements SHALL appear in order specified by reading_order indices
+- **AND** if reading_order is missing, fall back to spatial sort (y, x)
+
+#### Scenario: Direct track reading order in reflow mode
+- **GIVEN** document processed via Direct track
+- **WHEN** generating reflow PDF
+- **THEN** system SHALL use implicit element order from extraction
+- **AND** elements SHALL appear in list iteration order
+- **AND** PyMuPDF's sort=True ordering SHALL be trusted
+
+---
+
+### Requirement: Reflow PDF Semantic Structure
+
+The reflow PDF generation SHALL preserve document semantic structure.
+
+#### Scenario: Headings in reflow mode
+- **WHEN** original document contains headings (title, h1, h2, etc.)
+- **THEN** headings SHALL be rendered with larger font sizes
+- **AND** headings SHALL be visually distinguished from body text
+- **AND** heading hierarchy SHALL be preserved
+
+#### Scenario: Tables in reflow mode
+- **WHEN** original document contains tables
+- **THEN** tables SHALL render with visible cell borders
+- **AND** column widths SHALL auto-adjust to content
+- **AND** table content SHALL be fully visible
+- **AND** tables SHALL use appropriate cell padding
+
+#### Scenario: Images in reflow mode
+- **WHEN** original document contains images
+- **THEN** images SHALL be embedded inline in flowing content
+- **AND** images SHALL be scaled to fit page width if necessary
+- **AND** images SHALL maintain aspect ratio
+
+#### Scenario: Lists in reflow mode
+- **WHEN** original document contains numbered or bulleted lists
+- **THEN** lists SHALL preserve their formatting
+- **AND** list items SHALL flow naturally
+
+---
+
+## MODIFIED Requirements
+
+### Requirement: Translated PDF Export API
+
+The system SHALL expose an API endpoint for downloading translated documents as PDF files using reflow layout mode only.
+
+#### Scenario: Download translated PDF via API
+- **GIVEN** a task with completed translation
+- **WHEN** POST request to `/api/v2/translate/{task_id}/pdf?lang={lang}`
+- **THEN** system returns PDF file with translated content
+- **AND** PDF SHALL use reflow layout mode (not layout preservation)
+- **AND** Content-Type is `application/pdf`
+- **AND** Content-Disposition suggests filename like `{task_id}_translated_{lang}.pdf`
+
+#### Scenario: Translated PDF uses reflow layout
+- **WHEN** user downloads translated PDF
+- **THEN** the PDF SHALL use reflow layout mode
+- **AND** text SHALL flow naturally with consistent font sizes
+- **AND** body text SHALL use approximately 12pt font size
+- **AND** headings SHALL use larger font sizes (14-18pt)
+- **AND** content SHALL be readable without magnification
+
+#### Scenario: Translated PDF for OCR track
+- **GIVEN** document processed via OCR track with translation
+- **WHEN** generating translated PDF
+- **THEN** reading order SHALL follow `reading_order` array
+- **AND** translated text SHALL replace original in correct positions
+
+#### Scenario: Translated PDF for Direct track
+- **GIVEN** document processed via Direct track with translation
+- **WHEN** generating translated PDF
+- **THEN** reading order SHALL follow implicit element order
+- **AND** translated text SHALL replace original in correct positions
+
+#### Scenario: Invalid language parameter
+- **GIVEN** a task with translation only to English
+- **WHEN** user requests PDF with `lang=ja` (Japanese)
+- **THEN** system returns 404 Not Found
+- **AND** response includes available languages in error message
+
+#### Scenario: Task not found
+- **GIVEN** non-existent task_id
+- **WHEN** user requests translated PDF
+- **THEN** system returns 404 Not Found
+
+---
+
+### Requirement: Frontend Download Options
+
+The frontend SHALL provide appropriate download options based on translation status.
+
+#### Scenario: Download options without translation
+- **GIVEN** a task without any completed translations
+- **WHEN** user views TaskDetailPage
+- **THEN** page SHALL display "Download Layout PDF" button (original coordinates)
+- **AND** page SHALL display "Download Reflow PDF" button (flowing layout)
+- **AND** both options SHALL be available in the download section
+
+#### Scenario: Download options with translation
+- **GIVEN** a task with completed translation
+- **WHEN** user views TaskDetailPage
+- **THEN** page SHALL display "Download Translated PDF" button for each language
+- **AND** translated PDF button SHALL remain as single option (no Layout/Reflow choice)
+- **AND** translated PDF SHALL automatically use reflow layout
+
+#### Scenario: Remove outdated MADLAD-400 references
+- **WHEN** displaying translation section
+- **THEN** page SHALL NOT display "MADLAD-400" badge
+- **AND** description text SHALL reflect cloud translation service (Dify)
+- **AND** description SHALL NOT mention local model loading time
--- a/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/tasks.md
+++ b/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/tasks.md
@@ -0,0 +1,30 @@
+## 1. Backend Implementation
+
+- [x] 1.1 Create `generate_reflow_pdf()` method in pdf_generator_service.py
+- [x] 1.2 Implement `_get_elements_in_reading_order()` for both tracks
+- [x] 1.3 Implement reflow text rendering with consistent font sizes
+- [x] 1.4 Implement table rendering in reflow mode (Platypus Table)
+- [x] 1.5 Implement inline image embedding
+- [x] 1.6 Add `format=reflow` query parameter to tasks download endpoint
+- [x] 1.7 Update `generate_translated_pdf()` to use reflow mode
+
+## 2. Frontend Implementation
+
+- [x] 2.1 Add "Download Reflow PDF" button for original documents
+- [x] 2.2 Update download logic to support format parameter
+- [x] 2.3 Remove MADLAD-400 badge (line 545)
+- [x] 2.4 Update translation description text to reflect Dify cloud service (line 652)
+
+## 3. Testing
+
+- [x] 3.1 Test OCR track reflow PDF (with reading_order) - Basic smoke test passed
+- [ ] 3.2 Test Direct track reflow PDF (implicit order) - No test data available
+- [x] 3.3 Test translated PDF (reflow mode) - Basic smoke test passed
+- [x] 3.4 Test documents with tables - SUCCESS (62294 bytes, 2 tables)
+- [x] 3.5 Test documents with images - SUCCESS (embedded img_in_table)
+- [x] 3.6 Test multi-page documents - SUCCESS (11451 bytes, 3 pages)
+- [x] 3.7 Verify layout PDF still works correctly - SUCCESS (104543 bytes)
+
+## 4. Documentation
+
+- [x] 4.1 Update spec with reflow layout requirements
--- a/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/design.md
+++ b/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/design.md
@@ -0,0 +1,458 @@
+# Design: PDF Preprocessing Pipeline
+
+## Architecture Overview
+
+```
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                     DIRECT Track PDF Processing Pipeline                     │
+├─────────────────────────────────────────────────────────────────────────────┤
+│                                                                             │
+│   Input PDF                                                                 │
+│      │                                                                      │
+│      ▼                                                                      │
+│  ┌─────────────────────────────────────────────────────────────────────┐   │
+│  │ Step 0: GS Distillation (Exception Handler)                         │   │
+│  │ ───────────────────────────────────────────────────────────────────  │   │
+│  │ Trigger: (cid:xxxx) garble detected OR mupdf structural errors      │   │
+│  │ Action: gs -sDEVICE=pdfwrite -dDetectDuplicateImages=true           │   │
+│  │ Status: DISABLED by default, auto-triggered on errors               │   │
+│  └─────────────────────────────────────────────────────────────────────┘   │
+│      │                                                                      │
+│      ▼                                                                      │
+│  ┌─────────────────────────────────────────────────────────────────────┐   │
+│  │ Step 1: Object-level Cleaning (P0 - Core)                           │   │
+│  │ ───────────────────────────────────────────────────────────────────  │   │
+│  │ 1.1 clean_contents(sanitize=True) - Fix malformed content stream    │   │
+│  │ 1.2 Remove hidden OCG layers                                        │   │
+│  │ 1.3 White-out detection & removal (IoU >= 80%)                      │   │
+│  └─────────────────────────────────────────────────────────────────────┘   │
+│      │                                                                      │
+│      ▼                                                                      │
+│  ┌─────────────────────────────────────────────────────────────────────┐   │
+│  │ Step 2: Layout Analysis (P1 - Rule-based)                           │   │
+│  │ ───────────────────────────────────────────────────────────────────  │   │
+│  │ 2.1 get_text("blocks", sort=True) - Column-aware sorting            │   │
+│  │ 2.2 Classify elements (title/body/header/footer/page_number)        │   │
+│  │ 2.3 Filter unwanted elements (page numbers, decorations)            │   │
+│  └─────────────────────────────────────────────────────────────────────┘   │
+│      │                                                                      │
+│      ▼                                                                      │
+│  ┌─────────────────────────────────────────────────────────────────────┐   │
+│  │ Step 3: Text Extraction (Enhanced)                                  │   │
+│  │ ───────────────────────────────────────────────────────────────────  │   │
+│  │ 3.1 Extract text with bbox coordinates preserved                    │   │
+│  │ 3.2 Garble rate detection (cid:xxxx count / total chars)            │   │
+│  │ 3.3 Auto-fallback: garble_rate > 10% → trigger Paddle OCR           │   │
+│  └─────────────────────────────────────────────────────────────────────┘   │
+│      │                                                                      │
+│      ▼                                                                      │
+│   UnifiedDocument (with bbox for debugging)                                 │
+│                                                                             │
+└─────────────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## Step 0: GS Distillation (Exception Handler)
+
+### Purpose
+Repair structurally damaged PDFs that PyMuPDF cannot parse correctly.
+
+### Trigger Conditions
+```python
+def should_trigger_gs_repair(page_text: str, mupdf_warnings: List[str]) -> bool:
+    # Condition 1: High garble rate (cid:xxxx patterns)
+    cid_pattern = r'\(cid:\d+\)'
+    cid_count = len(re.findall(cid_pattern, page_text))
+    total_chars = len(page_text)
+    garble_rate = cid_count / max(total_chars, 1)
+
+    if garble_rate > 0.1:  # >10% garbled
+        return True
+
+    # Condition 2: Severe structural errors
+    severe_errors = ['error', 'invalid', 'corrupt', 'damaged']
+    for warning in mupdf_warnings:
+        if any(err in warning.lower() for err in severe_errors):
+            return True
+
+    return False
+```
+
+### GS Command
+```bash
+gs -dNOPAUSE -dBATCH -dSAFER \
+   -sDEVICE=pdfwrite \
+   -dPDFSETTINGS=/prepress \
+   -dDetectDuplicateImages=true \
+   -sOutputFile=repaired.pdf \
+   input.pdf
+```
+
+### Implementation Notes
+- **Default**: DISABLED
+- **Execution**: Only when triggered by error detection
+- **Fallback**: If GS also fails, route to Paddle OCR track
+
+---
+
+## Step 1: Object-level Cleaning (P0)
+
+### 1.1 Content Stream Sanitization
+```python
+def sanitize_page(page: fitz.Page) -> None:
+    """Fix malformed PDF content stream."""
+    page.clean_contents(sanitize=True)
+```
+
+### 1.2 Hidden Layer (OCG) Removal
+```python
+def remove_hidden_layers(doc: fitz.Document) -> List[str]:
+    """Remove content from hidden Optional Content Groups."""
+    removed_layers = []
+
+    ocgs = doc.get_ocgs()  # Get all OCG definitions
+    for ocg_xref, ocg_info in ocgs.items():
+        # Check if layer is hidden by default
+        if ocg_info.get('on') == False:
+            removed_layers.append(ocg_info.get('name', f'OCG_{ocg_xref}'))
+            # Mark for removal during extraction
+
+    return removed_layers
+```
+
+### 1.3 White-out Detection (Core Algorithm)
+```python
+def detect_whiteout_covered_text(page: fitz.Page, iou_threshold: float = 0.8) -> List[dict]:
+    """
+    Detect text covered by white rectangles ("white-out" / "correction tape" effect).
+
+    Returns list of text words that should be excluded from extraction.
+    """
+    covered_words = []
+
+    # Get all white-filled rectangles
+    drawings = page.get_drawings()
+    white_rects = []
+    for d in drawings:
+        # Check for white fill (RGB all 1.0)
+        fill_color = d.get('fill')
+        if fill_color and fill_color == (1, 1, 1):
+            rect = d.get('rect')
+            if rect:
+                white_rects.append(fitz.Rect(rect))
+
+    if not white_rects:
+        return covered_words
+
+    # Get all text words with bounding boxes
+    words = page.get_text("words")  # Returns list of (x0, y0, x1, y1, word, block_no, line_no, word_no)
+
+    for word_info in words:
+        word_rect = fitz.Rect(word_info[:4])
+        word_text = word_info[4]
+
+        for white_rect in white_rects:
+            # Calculate IoU (Intersection over Union)
+            intersection = word_rect & white_rect  # Intersection
+            if intersection.is_empty:
+                continue
+
+            intersection_area = intersection.width * intersection.height
+            word_area = word_rect.width * word_rect.height
+
+            if word_area > 0:
+                coverage_ratio = intersection_area / word_area
+                if coverage_ratio >= iou_threshold:
+                    covered_words.append({
+                        'text': word_text,
+                        'bbox': tuple(word_rect),
+                        'coverage': coverage_ratio
+                    })
+                    break  # Word is covered, no need to check other rects
+
+    return covered_words
+```
+
+---
+
+## Step 2: Layout Analysis (P1)
+
+### 2.1 Column-aware Text Extraction
+```python
+def extract_with_reading_order(page: fitz.Page) -> List[dict]:
+    """
+    Extract text blocks with correct reading order.
+    PyMuPDF's sort=True handles two-column layouts automatically.
+    """
+    # CRITICAL: sort=True enables column-aware sorting
+    blocks = page.get_text("dict", sort=True)['blocks']
+    return blocks
+```
+
+### 2.2 Element Classification
+```python
+def classify_element(block: dict, page_rect: fitz.Rect) -> str:
+    """
+    Classify text block by position and font size.
+
+    Returns: 'title', 'body', 'header', 'footer', 'page_number'
+    """
+    if 'lines' not in block:
+        return 'image'
+
+    bbox = fitz.Rect(block['bbox'])
+    page_height = page_rect.height
+    page_width = page_rect.width
+
+    # Relative position (0.0 = top, 1.0 = bottom)
+    y_rel = bbox.y0 / page_height
+
+    # Get average font size
+    font_sizes = []
+    for line in block.get('lines', []):
+        for span in line.get('spans', []):
+            font_sizes.append(span.get('size', 12))
+    avg_font_size = sum(font_sizes) / len(font_sizes) if font_sizes else 12
+
+    # Get text content for pattern matching
+    text = ''.join(
+        span.get('text', '')
+        for line in block.get('lines', [])
+        for span in line.get('spans', [])
+    ).strip()
+
+    # Classification rules
+
+    # Header: top 5% of page
+    if y_rel < 0.05:
+        return 'header'
+
+    # Footer: bottom 5% of page
+    if y_rel > 0.95:
+        return 'footer'
+
+    # Page number: bottom 10% + numeric pattern
+    if y_rel > 0.90 and _is_page_number(text):
+        return 'page_number'
+
+    # Title: large font (>14pt) or centered
+    if avg_font_size > 14:
+        return 'title'
+
+    # Check if centered (for subtitles)
+    x_center = (bbox.x0 + bbox.x1) / 2
+    page_center = page_width / 2
+    if abs(x_center - page_center) < page_width * 0.1 and len(text) < 100:
+        if avg_font_size > 12:
+            return 'title'
+
+    return 'body'
+
+
+def _is_page_number(text: str) -> bool:
+    """Check if text is likely a page number."""
+    text = text.strip()
+
+    # Pure number
+    if text.isdigit():
+        return True
+
+    # Common patterns: "Page 1", "- 1 -", "1/10"
+    patterns = [
+        r'^page\s*\d+$',
+        r'^-?\s*\d+\s*-?$',
+        r'^\d+\s*/\s*\d+$',
+        r'^第\s*\d+\s*頁$',
+        r'^第\s*\d+\s*页$',
+    ]
+
+    for pattern in patterns:
+        if re.match(pattern, text, re.IGNORECASE):
+            return True
+
+    return False
+```
+
+### 2.3 Element Filtering
+```python
+def filter_elements(blocks: List[dict], page_rect: fitz.Rect) -> List[dict]:
+    """Filter out unwanted elements (page numbers, headers, footers)."""
+    filtered = []
+
+    for block in blocks:
+        element_type = classify_element(block, page_rect)
+
+        # Skip page numbers and optionally headers/footers
+        if element_type == 'page_number':
+            continue
+
+        # Keep with classification metadata
+        block['_element_type'] = element_type
+        filtered.append(block)
+
+    return filtered
+```
+
+---
+
+## Step 3: Text Extraction (Enhanced)
+
+### 3.1 Garble Detection
+```python
+def calculate_garble_rate(text: str) -> float:
+    """
+    Calculate the rate of garbled characters (cid:xxxx patterns).
+
+    Returns: float between 0.0 and 1.0
+    """
+    if not text:
+        return 0.0
+
+    # Count (cid:xxxx) patterns
+    cid_pattern = r'\(cid:\d+\)'
+    cid_matches = re.findall(cid_pattern, text)
+    cid_char_count = sum(len(m) for m in cid_matches)
+
+    # Count other garble indicators
+    # - Replacement character U+FFFD
+    # - Private Use Area characters
+    replacement_count = text.count('\ufffd')
+    pua_count = sum(1 for c in text if 0xE000 <= ord(c) <= 0xF8FF)
+
+    total_garble = cid_char_count + replacement_count + pua_count
+    total_chars = len(text)
+
+    return total_garble / total_chars if total_chars > 0 else 0.0
+```
+
+### 3.2 Auto-fallback to OCR
+```python
+def should_fallback_to_ocr(page_text: str, garble_threshold: float = 0.1) -> bool:
+    """
+    Determine if page should be processed with OCR instead of direct extraction.
+
+    Args:
+        page_text: Extracted text from page
+        garble_threshold: Maximum acceptable garble rate (default 10%)
+
+    Returns:
+        True if OCR fallback is recommended
+    """
+    garble_rate = calculate_garble_rate(page_text)
+
+    if garble_rate > garble_threshold:
+        logger.warning(
+            f"High garble rate detected: {garble_rate:.1%}. "
+            f"Recommending OCR fallback."
+        )
+        return True
+
+    return False
+```
+
+---
+
+## Integration Point
+
+### Modified DirectExtractionEngine._extract_page()
+
+```python
+def _extract_page(self, page: fitz.Page, page_num: int, ...) -> Page:
+    """Extract content from a single page with preprocessing pipeline."""
+
+    # === Step 1: Object-level Cleaning ===
+
+    # 1.1 Sanitize content stream
+    page.clean_contents(sanitize=True)
+
+    # 1.2 Detect white-out covered text
+    covered_words = detect_whiteout_covered_text(page, iou_threshold=0.8)
+    covered_bboxes = [fitz.Rect(w['bbox']) for w in covered_words]
+
+    # === Step 2: Layout Analysis ===
+
+    # 2.1 Extract with column-aware sorting
+    blocks = page.get_text("dict", sort=True)['blocks']
+
+    # 2.2 & 2.3 Classify and filter
+    filtered_blocks = filter_elements(blocks, page.rect)
+
+    # === Step 3: Text Extraction ===
+
+    elements = []
+    full_text = ""
+
+    for block in filtered_blocks:
+        # Skip if block overlaps with covered areas
+        block_rect = fitz.Rect(block['bbox'])
+        if any(block_rect.intersects(cr) for cr in covered_bboxes):
+            continue
+
+        # Extract text with bbox preserved
+        element = self._block_to_element(block, page_num)
+        if element:
+            elements.append(element)
+            full_text += element.get_text() + " "
+
+    # 3.2 Check garble rate
+    if should_fallback_to_ocr(full_text):
+        # Mark page for OCR processing
+        page_metadata['needs_ocr'] = True
+
+    return Page(
+        page_number=page_num,
+        elements=elements,
+        metadata=page_metadata
+    )
+```
+
+---
+
+## Configuration
+
+```python
+@dataclass
+class PreprocessingConfig:
+    """Configuration for PDF preprocessing pipeline."""
+
+    # Step 0: GS Distillation
+    gs_enabled: bool = False  # Disabled by default
+    gs_garble_threshold: float = 0.1  # Trigger on >10% garble
+    gs_detect_duplicate_images: bool = True
+
+    # Step 1: Object Cleaning
+    sanitize_content: bool = True
+    remove_hidden_layers: bool = True
+    whiteout_detection: bool = True
+    whiteout_iou_threshold: float = 0.8
+
+    # Step 2: Layout Analysis
+    column_aware_sort: bool = True  # Use sort=True
+    filter_page_numbers: bool = True
+    filter_headers: bool = False  # Keep headers by default
+    filter_footers: bool = False  # Keep footers by default
+
+    # Step 3: Text Extraction
+    preserve_bbox: bool = True  # For debugging
+    garble_detection: bool = True
+    ocr_fallback_threshold: float = 0.1  # Fallback on >10% garble
+```
+
+---
+
+## Testing Strategy
+
+1. **Unit Tests**
+   - White-out detection with synthetic PDFs
+   - Garble rate calculation
+   - Element classification accuracy
+
+2. **Integration Tests**
+   - Two-column document reading order
+   - Hidden layer removal
+   - GS fallback trigger conditions
+
+3. **Regression Tests**
+   - Existing task outputs should not change for clean PDFs
+   - Performance benchmarks (should add <100ms per page)
--- a/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/proposal.md
+++ b/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/proposal.md
@@ -0,0 +1,44 @@
+# Change Proposal: PDF Preprocessing Pipeline
+
+## Summary
+
+Implement a multi-stage PDF preprocessing pipeline for Direct track extraction to improve layout accuracy, remove hidden/covered content, and ensure correct reading order.
+
+## Problem Statement
+
+Current Direct track extraction has several issues:
+1. **Hidden content pollution**: OCG (Optional Content Groups) layers and "white-out" covered text leak into extraction
+2. **Reading order chaos**: Two-column layouts get interleaved incorrectly
+3. **Vector graphics interference**: Large decorative vector elements cover text content
+4. **Corrupted PDF handling**: No fallback for structurally damaged PDFs with `(cid:xxxx)` garbled text
+
+## Proposed Solution
+
+Implement a 4-stage preprocessing pipeline:
+
+```
+Step 0: GS Distillation (Exception Handler - triggered on errors)
+Step 1: Object-level Cleaning (P0 - Core)
+Step 2: Layout Analysis (P1 - Rule-based with sort=True)
+Step 3: Text Extraction (Existing, enhanced with garble detection)
+```
+
+## Key Features
+
+1. **Smart Fallback**: GS distillation only triggers on `(cid:xxxx)` garble or mupdf structural errors
+2. **White-out Detection**: IoU-based overlap detection (80% threshold) to remove covered text
+3. **Column-aware Sorting**: Leverage PyMuPDF's `sort=True` for automatic two-column handling
+4. **Garble Rate Detection**: Auto-switch to Paddle OCR when garble rate exceeds threshold
+
+## Impact
+
+- **Files Modified**: `backend/app/services/direct_extraction_engine.py`
+- **New Dependencies**: None (Ghostscript optional, already available on most systems)
+- **Risk Level**: Medium (core extraction logic changes)
+
+## Success Criteria
+
+- [ ] Hidden OCG content no longer appears in extraction
+- [ ] White-out covered text is correctly filtered
+- [ ] Two-column documents maintain correct reading order
+- [ ] Corrupted PDFs gracefully fallback to GS repair or OCR
--- a/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/tasks.md
+++ b/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/tasks.md
@@ -0,0 +1,93 @@
+# Tasks: PDF Preprocessing Pipeline
+
+## Phase 1: Object-level Cleaning (P0)
+
+### Step 1.1: Content Sanitization
+- [x] Add `page.clean_contents(sanitize=True)` to `_extract_page()`
+- [x] Add error handling for malformed content streams
+- [x] Add logging for sanitization actions
+
+### Step 1.2: Hidden Layer (OCG) Removal
+- [x] Implement `get_hidden_ocg_layers()` function
+- [ ] Add OCG content filtering during extraction (deferred - needs test case)
+- [x] Add configuration option `remove_hidden_layers`
+- [x] Add logging for removed layers
+
+### Step 1.3: White-out Detection
+- [x] Implement `detect_whiteout_covered_text()` with IoU calculation
+- [x] Add white rectangle detection from `page.get_drawings()`
+- [x] Integrate covered text filtering into extraction
+- [x] Add configuration option `whiteout_iou_threshold` (default 0.8)
+- [x] Add logging for detected white-out regions
+
+## Phase 2: Layout Analysis (P1)
+
+### Step 2.1: Column-aware Sorting
+- [x] Change `get_text()` calls to use `sort=True` parameter (already implemented)
+- [x] Verify reading order improvement on test documents
+- [ ] Add configuration option `column_aware_sort` (deferred - low priority)
+
+### Step 2.2: Element Classification
+- [ ] Implement `classify_element()` function (deferred - existing detection sufficient)
+- [x] Add position-based classification (header/footer/body) - via existing `_detect_headers_footers()`
+- [x] Add font-size-based classification (title detection) - via existing logic
+- [x] Add page number pattern detection `_is_page_number()`
+- [ ] Preserve classification in element metadata `_element_type` (deferred)
+
+### Step 2.3: Element Filtering
+- [x] Implement `filter_elements()` function - `_filter_page_numbers()`
+- [x] Add configuration options for filtering (page_numbers, headers, footers)
+- [x] Add logging for filtered elements
+
+## Phase 3: Enhanced Extraction (P1)
+
+### Step 3.1: Bbox Preservation
+- [x] Ensure all extracted elements retain bbox coordinates (already implemented)
+- [x] Add bbox to UnifiedDocument element metadata
+- [x] Verify bbox accuracy in generated output
+
+### Step 3.2: Garble Detection
+- [x] Implement `calculate_garble_rate()` function
+- [x] Detect `(cid:xxxx)` patterns
+- [x] Detect replacement characters (U+FFFD)
+- [x] Detect Private Use Area characters
+- [x] Add garble rate to page metadata
+
+### Step 3.3: OCR Fallback
+- [x] Implement `should_fallback_to_ocr()` decision function
+- [x] Add configuration option `ocr_fallback_threshold` (default 0.1)
+- [x] Add `get_pages_needing_ocr()` interface for callers
+- [x] Add `get_extraction_quality_report()` for quality metrics
+- [x] Add logging for fallback decisions
+
+## Phase 4: GS Distillation - Exception Handler (P2)
+
+### Step 0: GS Repair (Optional)
+- [x] Implement `should_trigger_gs_repair()` trigger detection
+- [x] Implement `repair_pdf_with_gs()` function
+- [x] Add `-dDetectDuplicateImages=true` option
+- [x] Add temporary file handling for repaired PDF
+- [x] Implement `is_ghostscript_available()` check
+- [x] Add `extract_with_repair()` method
+- [x] Add fallback to normal extraction if GS not available
+- [x] Add logging for GS repair actions
+
+## Testing
+
+### Unit Tests
+- [ ] Test white-out detection with synthetic PDF
+- [x] Test garble rate calculation
+- [ ] Test element classification accuracy
+- [x] Test page number pattern detection
+
+### Integration Tests
+- [x] Test with demo_docs/edit.pdf (3 pages)
+- [x] Test with demo_docs/edit2.pdf (1 page)
+- [x] Test with demo_docs/edit3.pdf (2 pages)
+- [x] Test quality report generation
+- [x] Test GS availability check
+- [x] Test end-to-end pipeline with real documents
+
+### Regression Tests
+- [x] Verify existing clean PDFs produce same output
+- [ ] Performance benchmark (<100ms overhead per page)