# PaddleOCR 版面提取與 PDF 還原完整方案

## 一、技術架構概覽

### 1.1 核心套件選擇

#### PP-StructureV3 (推薦主力)
- **最新版本**: PaddleOCR 3.x
- **核心能力**: 
  - 版面分析 (Layout Detection)
  - 表格識別 (Table Recognition)
  - 公式識別 (Formula Recognition)
  - 圖表識別 (Chart Recognition)
  - 閱讀順序恢復 (Reading Order Recovery)
  - Markdown/JSON 輸出

#### PaddleOCR-VL (Vision-Language Model)
- **模型大小**: 0.9B 參數
- **特色**: 
  - 支援 109 種語言
  - 端到端文檔解析
  - 資源消耗更少
  - 原生支援複雜元素識別

#### PP-OCRv5 (基礎 OCR)
- 單一模型支援繁中、簡中、英文、日文、拼音
- 手寫體識別改善
- 伺服器版與移動版兩種配置

### 1.2 版面分析能力

PP-StructureV3 支援 **23 個常見類別**:
- 文檔標題 (document title)
- 段落標題 (paragraph title)
- 正文文本 (text)
- 頁碼 (page number)
- 摘要 (abstract)
- 表格 (table)
- 參考文獻 (references)
- 腳註 (footnotes)
- 頁眉 (header)
- 頁腳 (footer)
- 演算法 (algorithm)
- 公式 (formula)
- 公式編號 (formula number)
- 圖片 (image)
- 印章 (seal)
- 圖表標題 (figure_table title)
- 圖表 (chart)
- 側邊欄文本 (sidebar text)
- 參考文獻列表 (lists of references)
- 等等...

---

## 二、完整工作流程

### 2.1 PP-StructureV3 Pipeline 架構

```
輸入 (PDF/Image)
    ↓
前處理 (Preprocessing)
├── 文檔方向分類 (Document Orientation Classification)
├── 文檔影像矯正 (Document Unwarping)
└── 文字行方向分類 (Textline Orientation)
    ↓
版面分析 (Layout Detection)
├── 區域檢測
├── 類別分類
└── 座標定位
    ↓
並行處理各區域
├── OCR區域 → PP-OCRv5 (文字檢測+識別)
├── 表格區域 → Table Recognition (結構化+HTML)
├── 公式區域 → Formula Recognition (LaTeX)
├── 圖表區域 → Chart Recognition (轉表格)
└── 印章區域 → Seal Recognition
    ↓
後處理 (Postprocessing)
├── 閱讀順序排序
├── 結構化組織
└── 格式輸出
    ↓
輸出 (Markdown/JSON/DOCX)
```

### 2.2 版面還原策略

```python
# 版面資訊提取與還原的核心概念
版面信息 = {
    "區域位置": (x, y, width, height),
    "區域類別": "text/table/image/formula",
    "內容信息": {
        "文字": "OCR結果 + 座標",
        "表格": "HTML結構 + 文字內容",
        "圖片": "影像二進制數據",
        "公式": "LaTeX表達式"
    },
    "樣式信息": {
        "字體": "大小、粗細、顏色",
        "對齊": "左對齊/置中/右對齊",
        "間距": "行距、段落間距"
    },
    "閱讀順序": "全局排序索引"
}
```

---

## 三、安裝與環境配置

### 3.1 基礎安裝

```bash
# 安裝 PaddlePaddle (GPU版本)
python -m pip install paddlepaddle-gpu

# 或 CPU 版本
python -m pip install paddlepaddle

# 安裝 PaddleOCR (完整功能)
python -m pip install "paddleocr[all]"

# 或選擇性安裝文檔解析功能
python -m pip install "paddleocr[doc-parser]"
```

### 3.2 依賴套件

```bash
# 版面恢復所需
pip install python-docx  # Word 文檔生成
pip install PyMuPDF      # PDF 處理 (需 Python >= 3.7)
pip install pdf2docx     # PDF 轉 DOCX

# PDF 生成所需
pip install reportlab    # 自定義 PDF 生成
pip install markdown     # Markdown 處理
```

### 3.3 可選依賴

```bash
# 選擇性功能
pip install opencv-python           # 影像處理
pip install Pillow                  # 圖片處理
pip install lxml                    # HTML/XML 解析
pip install beautifulsoup4          # HTML 美化
```

---

## 四、核心實作方案

### 4.1 使用 PP-StructureV3 (推薦)

```python
from paddleocr import PPStructureV3
from pathlib import Path
import json

class DocumentLayoutExtractor:
    """文檔版面提取與還原"""
    
    def __init__(self, use_gpu=True):
        """初始化 PP-StructureV3"""
        self.engine = PPStructureV3(
            # 文檔前處理
            use_doc_orientation_classify=True,   # 文檔方向分類
            use_doc_unwarping=True,              # 文檔影像矯正
            use_textline_orientation=True,       # 文字行方向
            
            # 功能模組開關
            use_seal_recognition=True,           # 印章識別
            use_table_recognition=True,          # 表格識別
            use_formula_recognition=True,        # 公式識別
            use_chart_recognition=True,          # 圖表識別
            
            # OCR 模型配置
            text_recognition_model_name="ch_PP-OCRv4_server_rec",  # 中文識別
            # text_recognition_model_name="en_PP-OCRv4_mobile_rec",  # 英文識別
            
            # 版面檢測參數調整
            layout_threshold=0.5,                # 版面檢測閾值
            layout_nms=0.5,                      # NMS 閾值
            layout_unclip_ratio=1.5,            # 邊界框擴展比例
            
            show_log=True
        )
    
    def extract_layout(self, input_path, output_dir="output"):
        """
        提取完整版面資訊
        
        Args:
            input_path: PDF或圖片路徑
            output_dir: 輸出目錄
        
        Returns:
            list: 每頁的結構化結果
        """
        output_path = Path(output_dir)
        output_path.mkdir(parents=True, exist_ok=True)
        
        # 執行文檔解析
        results = self.engine.predict(input_path)
        
        all_pages_data = []
        
        for page_idx, result in enumerate(results):
            page_data = {
                "page": page_idx + 1,
                "regions": []
            }
            
            # 遍歷該頁的所有區域
            for region in result:
                region_info = {
                    "type": region.get("type"),           # 區域類別
                    "bbox": region.get("bbox"),           # 邊界框 [x1,y1,x2,y2]
                    "score": region.get("score", 0),      # 置信度
                    "content": {},
                    "reading_order": region.get("layout_bbox_idx", 0)  # 閱讀順序
                }
                
                # 根據類別提取不同內容
                region_type = region.get("type")
                
                if region_type in ["text", "title", "header", "footer"]:
                    # OCR 文字區域
                    region_info["content"] = {
                        "text": region.get("res", []),
                        "ocr_boxes": region.get("text_region", [])
                    }
                
                elif region_type == "table":
                    # 表格區域
                    region_info["content"] = {
                        "html": region.get("res", {}).get("html", ""),
                        "text": region.get("res", {}).get("text", ""),
                        "structure": region.get("res", {}).get("cell_bbox", [])
                    }
                
                elif region_type == "formula":
                    # 公式區域
                    region_info["content"] = {
                        "latex": region.get("res", "")
                    }
                
                elif region_type == "figure":
                    # 圖片區域
                    region_info["content"] = {
                        "image_path": f"page_{page_idx+1}_figure_{len(page_data['regions'])}.png"
                    }
                    # 儲存圖片
                    if "img" in region:
                        img_path = output_path / region_info["content"]["image_path"]
                        region["img"].save(img_path)
                
                elif region_type == "seal":
                    # 印章區域
                    region_info["content"] = {
                        "text": region.get("res", ""),
                        "seal_bbox": region.get("seal_bbox", [])
                    }
                
                page_data["regions"].append(region_info)
            
            # 按閱讀順序排序
            page_data["regions"].sort(key=lambda x: x["reading_order"])
            
            all_pages_data.append(page_data)
            
            # 儲存該頁的 JSON
            json_path = output_path / f"page_{page_idx+1}.json"
            with open(json_path, "w", encoding="utf-8") as f:
                json.dump(page_data, f, ensure_ascii=False, indent=2)
            
            print(f"✓ 已處理第 {page_idx+1} 頁")
        
        # 儲存完整文檔結構
        full_json = output_path / "document_structure.json"
        with open(full_json, "w", encoding="utf-8") as f:
            json.dump(all_pages_data, f, ensure_ascii=False, indent=2)
        
        return all_pages_data
    
    def export_to_markdown(self, input_path, output_dir="output"):
        """
        直接導出為 Markdown (PP-StructureV3 內建)
        """
        output_path = Path(output_dir)
        output_path.mkdir(parents=True, exist_ok=True)
        
        results = self.engine.predict(input_path)
        
        for page_idx, result in enumerate(results):
            # PP-StructureV3 自動生成 Markdown
            md_content = result.get("markdown", "")
            
            if md_content:
                md_path = output_path / f"page_{page_idx+1}.md"
                with open(md_path, "w", encoding="utf-8") as f:
                    f.write(md_content)
                print(f"✓ 已生成 Markdown: {md_path}")


# 使用範例
if __name__ == "__main__":
    extractor = DocumentLayoutExtractor(use_gpu=True)
    
    # 提取版面資訊
    layout_data = extractor.extract_layout(
        input_path="document.pdf",
        output_dir="output/layout"
    )
    
    # 導出 Markdown
    extractor.export_to_markdown(
        input_path="document.pdf",
        output_dir="output/markdown"
    )
```

### 4.2 使用 PaddleOCR-VL (更簡潔)

```python
from paddleocr import PaddleOCRVL
from pathlib import Path

class SimpleDocumentParser:
    """使用 PaddleOCR-VL 的簡化方案"""
    
    def __init__(self):
        self.pipeline = PaddleOCRVL(
            use_doc_orientation_classify=True,
            use_doc_unwarping=True,
            use_layout_detection=True,
            use_chart_recognition=True
        )
    
    def parse_document(self, input_path, output_dir="output"):
        """一鍵解析文檔"""
        output_path = Path(output_dir)
        output_path.mkdir(parents=True, exist_ok=True)
        
        # 執行解析
        results = self.pipeline.predict(input_path)
        
        for res in results:
            # 列印結構化輸出
            res.print()
            
            # 儲存 JSON
            res.save_to_json(save_path=str(output_path))
            
            # 儲存 Markdown
            res.save_to_markdown(save_path=str(output_path))
        
        print(f"✓ 解析完成,結果已儲存至: {output_path}")
    
    def parse_pdf_to_single_markdown(self, pdf_path, output_path="output"):
        """將整個 PDF 轉為單一 Markdown 檔案"""
        output_dir = Path(output_path)
        output_dir.mkdir(parents=True, exist_ok=True)
        
        # 解析 PDF
        results = self.pipeline.predict(pdf_path)
        
        markdown_list = []
        markdown_images = []
        
        # 收集所有頁面的 Markdown
        for res in results:
            md_info = res.markdown
            markdown_list.append(md_info)
            markdown_images.append(md_info.get("markdown_images", {}))
        
        # 合併所有頁面
        markdown_text = self.pipeline.concatenate_markdown_pages(markdown_list)
        
        # 儲存 Markdown 檔案
        md_file = output_dir / f"{Path(pdf_path).stem}.md"
        with open(md_file, "w", encoding="utf-8") as f:
            f.write(markdown_text)
        
        # 儲存相關圖片
        for item in markdown_images:
            if item:
                for path, image in item.items():
                    img_path = output_dir / path
                    img_path.parent.mkdir(parents=True, exist_ok=True)
                    image.save(img_path)
        
        print(f"✓ 已生成單一 Markdown: {md_file}")
        return md_file


# 使用範例
if __name__ == "__main__":
    parser = SimpleDocumentParser()
    
    # 簡單解析
    parser.parse_document("document.pdf", "output")
    
    # 生成單一 Markdown
    parser.parse_pdf_to_single_markdown("document.pdf", "output")
```

---

## 五、PDF 還原與生成

### 5.1 版面恢復策略

PaddleOCR 提供兩種版面恢復方法:

#### 方法 1: 標準 PDF 解析 (適用於可複製文字的 PDF)
```bash
# 使用 pdf2docx 直接轉換
paddleocr --image_dir=document.pdf \
          --type=structure \
          --recovery=true \
          --use_pdf2docx_api=true
```

**優點**: 快速、保留原始格式
**缺點**: 僅適用於標準 PDF,掃描文檔無效

#### 方法 2: 影像格式 PDF 解析 (通用方案)
```bash
# 使用完整 OCR Pipeline
paddleocr --image_dir=document.pdf \
          --type=structure \
          --recovery=true \
          --use_pdf2docx_api=false
```

**優點**: 適用於掃描文檔、複雜版面
**缺點**: 速度較慢、需要更多計算資源

### 5.2 自定義 PDF 生成方案

```python
from reportlab.lib.pagesizes import A4
from reportlab.lib.units import mm
from reportlab.pdfgen import canvas
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.platypus import (
    SimpleDocTemplate, Paragraph, Spacer, 
    Table, TableStyle, Image, PageBreak
)
from reportlab.lib import colors
from reportlab.lib.enums import TA_LEFT, TA_CENTER, TA_RIGHT
import json

class PDFLayoutRecovery:
    """基於版面資訊的 PDF 還原"""
    
    def __init__(self, layout_json_path):
        """
        Args:
            layout_json_path: 版面 JSON 檔案路徑
        """
        with open(layout_json_path, 'r', encoding='utf-8') as f:
            self.layout_data = json.load(f)
        
        self.styles = getSampleStyleSheet()
        self._create_custom_styles()
    
    def _create_custom_styles(self):
        """建立自定義樣式"""
        # 標題樣式
        self.styles.add(ParagraphStyle(
            name='CustomTitle',
            parent=self.styles['Heading1'],
            fontSize=18,
            textColor=colors.HexColor('#1a1a1a'),
            spaceAfter=12,
            alignment=TA_CENTER
        ))
        
        # 段落標題樣式
        self.styles.add(ParagraphStyle(
            name='CustomHeading',
            parent=self.styles['Heading2'],
            fontSize=14,
            textColor=colors.HexColor('#333333'),
            spaceAfter=8,
            spaceBefore=8
        ))
        
        # 正文樣式
        self.styles.add(ParagraphStyle(
            name='CustomBody',
            parent=self.styles['Normal'],
            fontSize=11,
            leading=16,
            textColor=colors.HexColor('#000000'),
            alignment=TA_LEFT
        ))
    
    def generate_pdf(self, output_path="output.pdf"):
        """生成 PDF"""
        doc = SimpleDocTemplate(
            output_path,
            pagesize=A4,
            rightMargin=20*mm,
            leftMargin=20*mm,
            topMargin=20*mm,
            bottomMargin=20*mm
        )
        
        story = []
        
        # 遍歷所有頁面
        for page_data in self.layout_data:
            page_num = page_data.get("page", 1)
            
            # 頁面標記 (可選)
            # story.append(Paragraph(f"--- 第 {page_num} 頁 ---", self.styles['CustomHeading']))
            # story.append(Spacer(1, 5*mm))
            
            # 遍歷該頁的所有區域
            for region in page_data.get("regions", []):
                region_type = region.get("type")
                content = region.get("content", {})
                
                if region_type in ["title", "document_title"]:
                    # 標題
                    text = self._extract_text_from_ocr(content)
                    if text:
                        story.append(Paragraph(text, self.styles['CustomTitle']))
                        story.append(Spacer(1, 3*mm))
                
                elif region_type in ["text", "paragraph"]:
                    # 正文
                    text = self._extract_text_from_ocr(content)
                    if text:
                        story.append(Paragraph(text, self.styles['CustomBody']))
                        story.append(Spacer(1, 2*mm))
                
                elif region_type in ["paragraph_title", "heading"]:
                    # 段落標題
                    text = self._extract_text_from_ocr(content)
                    if text:
                        story.append(Paragraph(text, self.styles['CustomHeading']))
                        story.append(Spacer(1, 2*mm))
                
                elif region_type == "table":
                    # 表格
                    table_element = self._create_table_from_html(content)
                    if table_element:
                        story.append(table_element)
                        story.append(Spacer(1, 3*mm))
                
                elif region_type == "figure":
                    # 圖片
                    img_path = content.get("image_path")
                    if img_path and Path(img_path).exists():
                        try:
                            img = Image(img_path, width=150*mm, height=100*mm, kind='proportional')
                            story.append(img)
                            story.append(Spacer(1, 3*mm))
                        except:
                            pass
                
                elif region_type == "formula":
                    # 公式 (作為程式碼區塊顯示)
                    latex = content.get("latex", "")
                    if latex:
                        story.append(Paragraph(f"<font name='Courier'>{latex}</font>", 
                                             self.styles['Code']))
                        story.append(Spacer(1, 2*mm))
            
            # 分頁 (除了最後一頁)
            if page_num < len(self.layout_data):
                story.append(PageBreak())
        
        # 生成 PDF
        doc.build(story)
        print(f"✓ PDF 已生成: {output_path}")
    
    def _extract_text_from_ocr(self, content):
        """從 OCR 結果提取文字"""
        if isinstance(content.get("text"), str):
            return content["text"]
        elif isinstance(content.get("text"), list):
            # OCR 結果是列表形式
            texts = []
            for item in content["text"]:
                if isinstance(item, dict) and "text" in item:
                    texts.append(item["text"])
                elif isinstance(item, (list, tuple)) and len(item) >= 2:
                    texts.append(item[1])  # (bbox, text, confidence) 格式
            return " ".join(texts)
        return ""
    
    def _create_table_from_html(self, content):
        """從 HTML 建立表格"""
        # 簡化版:從 text 提取
        text = content.get("text", "")
        if not text:
            return None
        
        # 這裡可以解析 HTML 或直接使用文字
        # 為簡化起見,這裡僅展示基本結構
        try:
            # 假設文字格式為行分隔
            rows = [row.split("\t") for row in text.split("\n") if row.strip()]
            
            if not rows:
                return None
            
            table = Table(rows)
            table.setStyle(TableStyle([
                ('BACKGROUND', (0, 0), (-1, 0), colors.grey),
                ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),
                ('ALIGN', (0, 0), (-1, -1), 'CENTER'),
                ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
                ('FONTSIZE', (0, 0), (-1, 0), 10),
                ('BOTTOMPADDING', (0, 0), (-1, 0), 8),
                ('BACKGROUND', (0, 1), (-1, -1), colors.beige),
                ('GRID', (0, 0), (-1, -1), 0.5, colors.black)
            ]))
            
            return table
        except:
            return None


# 使用範例
if __name__ == "__main__":
    # 先用 PP-StructureV3 提取版面
    extractor = DocumentLayoutExtractor()
    layout_data = extractor.extract_layout("document.pdf", "output/layout")
    
    # 生成 PDF
    pdf_recovery = PDFLayoutRecovery("output/layout/document_structure.json")
    pdf_recovery.generate_pdf("output/recovered_document.pdf")
```

### 5.3 使用 python-docx 生成 Word 文檔

```python
from docx import Document
from docx.shared import Inches, Pt, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH
import json

class DOCXLayoutRecovery:
    """基於版面資訊的 DOCX 還原"""
    
    def __init__(self, layout_json_path):
        with open(layout_json_path, 'r', encoding='utf-8') as f:
            self.layout_data = json.load(f)
        
        self.doc = Document()
    
    def generate_docx(self, output_path="output.docx"):
        """生成 DOCX"""
        for page_data in self.layout_data:
            for region in page_data.get("regions", []):
                region_type = region.get("type")
                content = region.get("content", {})
                
                if region_type in ["title", "document_title"]:
                    # 標題
                    text = self._extract_text(content)
                    if text:
                        heading = self.doc.add_heading(text, level=1)
                        heading.alignment = WD_ALIGN_PARAGRAPH.CENTER
                
                elif region_type == "text":
                    # 正文
                    text = self._extract_text(content)
                    if text:
                        para = self.doc.add_paragraph(text)
                        para.paragraph_format.first_line_indent = Inches(0.5)
                
                elif region_type == "paragraph_title":
                    # 小標題
                    text = self._extract_text(content)
                    if text:
                        self.doc.add_heading(text, level=2)
                
                elif region_type == "table":
                    # 表格 (簡化版)
                    text = content.get("text", "")
                    if text:
                        rows = [row.split("\t") for row in text.split("\n")]
                        if rows:
                            table = self.doc.add_table(rows=len(rows), cols=len(rows[0]))
                            table.style = 'Light Grid Accent 1'
                            
                            for i, row_data in enumerate(rows):
                                for j, cell_data in enumerate(row_data):
                                    table.rows[i].cells[j].text = cell_data
                
                elif region_type == "figure":
                    # 圖片
                    img_path = content.get("image_path")
                    if img_path and Path(img_path).exists():
                        try:
                            self.doc.add_picture(img_path, width=Inches(5))
                        except:
                            pass
            
            # 分頁
            self.doc.add_page_break()
        
        self.doc.save(output_path)
        print(f"✓ DOCX 已生成: {output_path}")
    
    def _extract_text(self, content):
        """提取文字"""
        if isinstance(content.get("text"), str):
            return content["text"]
        elif isinstance(content.get("text"), list):
            texts = []
            for item in content["text"]:
                if isinstance(item, dict):
                    texts.append(item.get("text", ""))
                elif isinstance(item, (list, tuple)) and len(item) >= 2:
                    texts.append(item[1])
            return " ".join(texts)
        return ""
```

---

## 六、進階配置與優化

### 6.1 性能優化

```python
# 輕量級配置 (CPU 環境)
extractor = PPStructureV3(
    # 使用 mobile 模型
    text_detection_model_name="ch_PP-OCRv4_mobile_det",
    text_recognition_model_name="ch_PP-OCRv4_mobile_rec",
    
    # 關閉部分功能
    use_chart_recognition=False,  # 圖表識別較耗時
    use_formula_recognition=False,  # 公式識別需要較大模型
    
    # 降低精度要求
    layout_threshold=0.6,
    text_det_limit_side_len=960,  # 降低檢測尺寸
)

# 高精度配置 (GPU 環境)
extractor = PPStructureV3(
    # 使用 server 模型
    text_detection_model_name="ch_PP-OCRv4_server_det",
    text_recognition_model_name="ch_PP-OCRv4_server_rec",
    
    # 啟用所有功能
    use_chart_recognition=True,
    use_formula_recognition=True,
    use_seal_recognition=True,
    
    # 提高精度
    layout_threshold=0.3,
    text_det_limit_side_len=1920,
)
```

### 6.2 批次處理

```python
import os
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor, as_completed

class BatchDocumentProcessor:
    """批次文檔處理器"""
    
    def __init__(self, max_workers=4):
        self.extractor = PPStructureV3()
        self.max_workers = max_workers
    
    def process_directory(self, input_dir, output_dir="output", file_types=None):
        """
        批次處理目錄下的所有文檔
        
        Args:
            input_dir: 輸入目錄
            output_dir: 輸出目錄
            file_types: 支援的檔案類型,預設 ['.pdf', '.png', '.jpg']
        """
        if file_types is None:
            file_types = ['.pdf', '.png', '.jpg', '.jpeg']
        
        input_path = Path(input_dir)
        output_path = Path(output_dir)
        output_path.mkdir(parents=True, exist_ok=True)
        
        # 收集所有待處理檔案
        files = []
        for file_type in file_types:
            files.extend(input_path.glob(f"**/*{file_type}"))
        
        print(f"找到 {len(files)} 個檔案待處理")
        
        # 多執行緒處理
        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
            futures = {
                executor.submit(self._process_single_file, file, output_path): file
                for file in files
            }
            
            for future in as_completed(futures):
                file = futures[future]
                try:
                    result = future.result()
                    print(f"✓ 完成: {file.name}")
                except Exception as e:
                    print(f"✗ 失敗: {file.name} - {e}")
    
    def _process_single_file(self, file_path, output_dir):
        """處理單一檔案"""
        file_stem = file_path.stem
        file_output_dir = output_dir / file_stem
        file_output_dir.mkdir(parents=True, exist_ok=True)
        
        # 提取版面
        results = self.extractor.predict(str(file_path))
        
        # 儲存結果
        for idx, result in enumerate(results):
            # 儲存 JSON
            json_path = file_output_dir / f"page_{idx+1}.json"
            with open(json_path, "w", encoding="utf-8") as f:
                json.dump(result, f, ensure_ascii=False, indent=2)
            
            # 儲存 Markdown
            if "markdown" in result:
                md_path = file_output_dir / f"page_{idx+1}.md"
                with open(md_path, "w", encoding="utf-8") as f:
                    f.write(result["markdown"])
        
        return file_output_dir


# 使用範例
if __name__ == "__main__":
    processor = BatchDocumentProcessor(max_workers=4)
    processor.process_directory(
        input_dir="documents",
        output_dir="output",
        file_types=['.pdf']
    )
```

---

## 七、實際應用場景

### 7.1 學術論文處理

```python
class AcademicPaperProcessor:
    """學術論文專用處理器"""
    
    def __init__(self):
        self.extractor = PPStructureV3(
            use_formula_recognition=True,  # 公式識別
            use_table_recognition=True,    # 表格識別
            use_chart_recognition=True,    # 圖表識別
        )
    
    def process_paper(self, pdf_path, output_dir="output"):
        """
        處理學術論文
        - 提取標題、摘要、章節
        - 識別公式並轉為 LaTeX
        - 提取表格和圖表
        """
        results = self.extractor.predict(pdf_path)
        
        paper_structure = {
            "title": "",
            "abstract": "",
            "sections": [],
            "formulas": [],
            "tables": [],
            "figures": []
        }
        
        for result in results:
            for region in result:
                region_type = region.get("type")
                
                if region_type == "document_title":
                    paper_structure["title"] = self._extract_text(region)
                
                elif region_type == "abstract":
                    paper_structure["abstract"] = self._extract_text(region)
                
                elif region_type == "formula":
                    latex = region.get("content", {}).get("latex", "")
                    if latex:
                        paper_structure["formulas"].append(latex)
                
                elif region_type == "table":
                    paper_structure["tables"].append(region.get("content"))
                
                elif region_type == "figure":
                    paper_structure["figures"].append(region.get("content"))
        
        # 儲存結構化結果
        output_path = Path(output_dir)
        output_path.mkdir(parents=True, exist_ok=True)
        
        with open(output_path / "paper_structure.json", "w", encoding="utf-8") as f:
            json.dump(paper_structure, f, ensure_ascii=False, indent=2)
        
        return paper_structure
```

### 7.2 商業文檔處理

```python
class BusinessDocumentProcessor:
    """商業文檔處理器"""
    
    def __init__(self):
        self.extractor = PPStructureV3(
            use_seal_recognition=True,   # 印章識別
            use_table_recognition=True,  # 表格識別
        )
    
    def process_invoice(self, pdf_path):
        """處理發票/合約等商業文檔"""
        results = self.extractor.predict(pdf_path)
        
        doc_info = {
            "text_content": [],
            "tables": [],
            "seals": []
        }
        
        for result in results:
            for region in result:
                region_type = region.get("type")
                
                if region_type == "seal":
                    doc_info["seals"].append(region.get("content"))
                
                elif region_type == "table":
                    doc_info["tables"].append(region.get("content"))
                
                elif region_type == "text":
                    doc_info["text_content"].append(self._extract_text(region))
        
        return doc_info
```

---

## 八、常見問題與解決方案

### 8.1 記憶體不足

```python
# 方案1: 分批處理 PDF 頁面
def process_large_pdf(pdf_path, batch_size=10):
    import fitz  # PyMuPDF
    
    doc = fitz.open(pdf_path)
    total_pages = len(doc)
    
    for start_idx in range(0, total_pages, batch_size):
        end_idx = min(start_idx + batch_size, total_pages)
        
        # 提取該批次頁面為臨時 PDF
        temp_pdf = fitz.open()
        temp_pdf.insert_pdf(doc, from_page=start_idx, to_page=end_idx-1)
        temp_path = f"temp_batch_{start_idx}_{end_idx}.pdf"
        temp_pdf.save(temp_path)
        temp_pdf.close()
        
        # 處理臨時 PDF
        extractor = PPStructureV3()
        results = extractor.predict(temp_path)
        
        # 處理結果...
        
        # 清理
        os.remove(temp_path)
    
    doc.close()

# 方案2: 使用輕量級模型
extractor = PPStructureV3(
    text_detection_model_name="ch_PP-OCRv4_mobile_det",
    text_recognition_model_name="ch_PP-OCRv4_mobile_rec",
)
```

### 8.2 處理速度優化

```python
# 方案1: 僅處理必要內容
extractor = PPStructureV3(
    use_chart_recognition=False,     # 關閉圖表識別
    use_formula_recognition=False,   # 關閉公式識別
    use_seal_recognition=False,      # 關閉印章識別
)

# 方案2: 降低影像解析度
extractor = PPStructureV3(
    text_det_limit_side_len=960,  # 預設 1920
)

# 方案3: 啟用 MKL-DNN (CPU 加速)
extractor = PPStructureV3(
    enable_mkldnn=True  # CPU 環境下加速
)
```

### 8.3 識別準確度問題

```python
# 方案1: 使用高精度模型
extractor = PPStructureV3(
    text_detection_model_name="ch_PP-OCRv4_server_det",
    text_recognition_model_name="ch_PP-OCRv4_server_rec",
)

# 方案2: 調整檢測參數
extractor = PPStructureV3(
    text_det_thresh=0.3,           # 降低檢測閾值(預設0.5)
    text_det_box_thresh=0.5,       # 降低邊界框閾值
    text_det_unclip_ratio=1.8,     # 增加邊界框擴展(預設1.5)
)

# 方案3: 啟用前處理
extractor = PPStructureV3(
    use_doc_orientation_classify=True,  # 文檔方向校正
    use_doc_unwarping=True,             # 文檔影像矯正
    use_textline_orientation=True,      # 文字行方向校正
)
```

---

## 九、最佳實踐建議

### 9.1 選擇合適的方案

| 場景 | 推薦方案 | 理由 |
|------|---------|------|
| 標準 PDF (可複製文字) | pdf2docx | 最快速,格式保留好 |
| 掃描文檔 | PP-StructureV3 | 完整 OCR + 版面分析 |
| 複雜排版 | PaddleOCR-VL | 端到端,準確度高 |
| 學術論文 | PP-StructureV3+ 公式識別 | 支援 LaTeX 公式 |
| 商業合約 | PP-StructureV3 + 印章識別 | 需要印章檢測 |

### 9.2 版面還原質量保證

```python
class QualityAssurance:
    """版面還原質量檢查"""
    
    @staticmethod
    def check_text_coverage(ocr_results, min_confidence=0.7):
        """檢查 OCR 置信度"""
        low_confidence_items = []
        
        for item in ocr_results:
            if isinstance(item, (list, tuple)) and len(item) >= 3:
                _, text, confidence = item[0], item[1], item[2]
                if confidence < min_confidence:
                    low_confidence_items.append({
                        "text": text,
                        "confidence": confidence
                    })
        
        if low_confidence_items:
            print(f"⚠ 發現 {len(low_confidence_items)} 個低置信度識別")
            return False
        return True
    
    @staticmethod
    def validate_layout_structure(layout_data):
        """驗證版面結構完整性"""
        issues = []
        
        for page_idx, page in enumerate(layout_data):
            regions = page.get("regions", [])
            
            # 檢查是否有標題
            if not any(r.get("type") in ["title", "document_title"] for r in regions):
                issues.append(f"第 {page_idx+1} 頁缺少標題")
            
            # 檢查閱讀順序
            orders = [r.get("reading_order", 0) for r in regions]
            if len(orders) != len(set(orders)):
                issues.append(f"第 {page_idx+1} 頁閱讀順序有重複")
        
        if issues:
            print("⚠ 版面結構問題:")
            for issue in issues:
                print(f"  - {issue}")
            return False
        return True
```

### 9.3 錯誤處理

```python
class RobustDocumentProcessor:
    """具備容錯機制的文檔處理器"""
    
    def __init__(self):
        self.extractor = None
        self._init_extractor()
    
    def _init_extractor(self, retry=3):
        """初始化提取器,支援重試"""
        for attempt in range(retry):
            try:
                self.extractor = PPStructureV3()
                print("✓ 初始化成功")
                break
            except Exception as e:
                print(f"✗ 初始化失敗 (嘗試 {attempt+1}/{retry}): {e}")
                if attempt == retry - 1:
                    raise
    
    def safe_process(self, input_path, output_dir="output"):
        """安全處理文檔"""
        try:
            # 檢查檔案存在
            if not Path(input_path).exists():
                raise FileNotFoundError(f"檔案不存在: {input_path}")
            
            # 執行處理
            results = self.extractor.predict(input_path)
            
            # 驗證結果
            if not results:
                raise ValueError("處理結果為空")
            
            # 儲存結果
            output_path = Path(output_dir)
            output_path.mkdir(parents=True, exist_ok=True)
            
            for idx, result in enumerate(results):
                json_path = output_path / f"page_{idx+1}.json"
                with open(json_path, "w", encoding="utf-8") as f:
                    json.dump(result, f, ensure_ascii=False, indent=2)
            
            print(f"✓ 處理成功: {len(results)} 頁")
            return True
            
        except Exception as e:
            print(f"✗ 處理失敗: {e}")
            import traceback
            traceback.print_exc()
            return False
```

---

## 十、總結與建議

### 10.1 核心要點

1. **PP-StructureV3** 是最完整的解決方案,支援:
   - 23 種版面元素類別
   - 表格/公式/圖表識別
   - 閱讀順序恢復
   - Markdown/JSON 輸出

2. **PaddleOCR-VL** 適合追求簡潔的場景:
   - 端到端處理
   - 資源消耗較少
   - 109 種語言支援

3. **版面還原** 兩種路徑:
   - 標準 PDF → pdf2docx (快速)
   - 掃描文檔 → PP-StructureV3 + ReportLab/python-docx (完整)

### 10.2 推薦工作流程

```
1. 文檔預處理
   ├── 檢查 PDF 類型 (標準/掃描)
   ├── 影像品質評估
   └── 頁面分割

2. 版面提取
   ├── PP-StructureV3.predict()
   ├── 提取結構化資訊 (JSON)
   └── 驗證完整性

3. 內容處理
   ├── OCR 文字校正
   ├── 表格結構化
   ├── 公式轉 LaTeX
   └── 圖片提取

4. 版面還原
   ├── 解析 JSON 結構
   ├── 重建版面元素
   └── 生成目標格式 (PDF/DOCX/MD)

5. 質量檢查
   ├── 文字覆蓋率
   ├── 版面完整性
   └── 格式一致性
```

### 10.3 性能參考

| 配置 | 硬體 | 速度 (頁/秒) | 準確度 |
|------|------|-------------|--------|
| Mobile 模型 + CPU | Intel 8350C | ~0.27 | 85% |
| Server 模型 + V100 | V100 GPU | ~1.5 | 95% |
| Server 模型 + A100 | A100 GPU | ~3.0 | 95% |

### 10.4 下一步建議

1. **建立測試集**: 準備不同類型的文檔樣本
2. **參數調優**: 根據實際文檔調整檢測閾值
3. **後處理優化**: 針對特定格式開發專用處理邏輯
4. **整合 LLM**: 結合大語言模型進行智慧校正
5. **建立監控**: 追蹤處理質量和性能指標

---

## 附錄

### A. 完整依賴清單

```txt
paddlepaddle-gpu>=3.0.0
paddleocr>=3.0.0
python-docx>=0.8.11
PyMuPDF>=1.23.0
pdf2docx>=0.5.6
reportlab>=4.0.0
markdown>=3.5.0
opencv-python>=4.8.0
Pillow>=10.0.0
lxml>=4.9.0
beautifulsoup4>=4.12.0
```

### B. 環境變數配置

```bash
# 設定模型下載源
export PADDLE_PDX_MODEL_SOURCE=HuggingFace  # 或 BOS

# 啟用 GPU
export CUDA_VISIBLE_DEVICES=0

# 設定快取目錄
export PADDLEX_CACHE_DIR=/path/to/cache
```

### C. 相關資源

- **官方文檔**: https://paddlepaddle.github.io/PaddleOCR/
- **GitHub**: https://github.com/PaddlePaddle/PaddleOCR
- **模型庫**: https://github.com/PaddlePaddle/PaddleOCR/blob/main/doc/doc_ch/models_list.md
- **技術論文**: https://arxiv.org/abs/2507.05595

---

**文檔版本**: v1.0
**最後更新**: 2025-11-18
**作者**: Claude + PaddleOCR 技術團隊