egg/OCR

Files

egg cd3cbea49d chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-18 20:02:31 +08:00

39 KiB

Raw Blame History

PaddleOCR 版面提取與 PDF 還原完整方案

一、技術架構概覽

1.1 核心套件選擇

PP-StructureV3 (推薦主力)

最新版本: PaddleOCR 3.x
核心能力:
- 版面分析 (Layout Detection)
- 表格識別 (Table Recognition)
- 公式識別 (Formula Recognition)
- 圖表識別 (Chart Recognition)
- 閱讀順序恢復 (Reading Order Recovery)
- Markdown/JSON 輸出

PaddleOCR-VL (Vision-Language Model)

模型大小: 0.9B 參數
特色:
- 支援 109 種語言
- 端到端文檔解析
- 資源消耗更少
- 原生支援複雜元素識別

PP-OCRv5 (基礎 OCR)

單一模型支援繁中、簡中、英文、日文、拼音
手寫體識別改善
伺服器版與移動版兩種配置

1.2 版面分析能力

PP-StructureV3 支援 23 個常見類別:

文檔標題 (document title)
段落標題 (paragraph title)
正文文本 (text)
頁碼 (page number)
摘要 (abstract)
表格 (table)
參考文獻 (references)
腳註 (footnotes)
頁眉 (header)
頁腳 (footer)
演算法 (algorithm)
公式 (formula)
公式編號 (formula number)
圖片 (image)
印章 (seal)
圖表標題 (figure_table title)
圖表 (chart)
側邊欄文本 (sidebar text)
參考文獻列表 (lists of references)
等等...

二、完整工作流程

2.1 PP-StructureV3 Pipeline 架構

輸入 (PDF/Image)
    ↓
前處理 (Preprocessing)
├── 文檔方向分類 (Document Orientation Classification)
├── 文檔影像矯正 (Document Unwarping)
└── 文字行方向分類 (Textline Orientation)
    ↓
版面分析 (Layout Detection)
├── 區域檢測
├── 類別分類
└── 座標定位
    ↓
並行處理各區域
├── OCR區域 → PP-OCRv5 (文字檢測+識別)
├── 表格區域 → Table Recognition (結構化+HTML)
├── 公式區域 → Formula Recognition (LaTeX)
├── 圖表區域 → Chart Recognition (轉表格)
└── 印章區域 → Seal Recognition
    ↓
後處理 (Postprocessing)
├── 閱讀順序排序
├── 結構化組織
└── 格式輸出
    ↓
輸出 (Markdown/JSON/DOCX)

2.2 版面還原策略

# 版面資訊提取與還原的核心概念
版面信息 = {
    "區域位置": (x, y, width, height),
    "區域類別": "text/table/image/formula",
    "內容信息": {
        "文字": "OCR結果 + 座標",
        "表格": "HTML結構 + 文字內容",
        "圖片": "影像二進制數據",
        "公式": "LaTeX表達式"
    },
    "樣式信息": {
        "字體": "大小、粗細、顏色",
        "對齊": "左對齊/置中/右對齊",
        "間距": "行距、段落間距"
    },
    "閱讀順序": "全局排序索引"
}

三、安裝與環境配置

3.1 基礎安裝

# 安裝 PaddlePaddle (GPU版本)
python -m pip install paddlepaddle-gpu

# 或 CPU 版本
python -m pip install paddlepaddle

# 安裝 PaddleOCR (完整功能)
python -m pip install "paddleocr[all]"

# 或選擇性安裝文檔解析功能
python -m pip install "paddleocr[doc-parser]"

3.2 依賴套件

# 版面恢復所需
pip install python-docx  # Word 文檔生成
pip install PyMuPDF      # PDF 處理 (需 Python >= 3.7)
pip install pdf2docx     # PDF 轉 DOCX

# PDF 生成所需
pip install reportlab    # 自定義 PDF 生成
pip install markdown     # Markdown 處理

3.3 可選依賴

# 選擇性功能
pip install opencv-python           # 影像處理
pip install Pillow                  # 圖片處理
pip install lxml                    # HTML/XML 解析
pip install beautifulsoup4          # HTML 美化

四、核心實作方案

4.1 使用 PP-StructureV3 (推薦)

from paddleocr import PPStructureV3
from pathlib import Path
import json

class DocumentLayoutExtractor:
    """文檔版面提取與還原"""
    
    def __init__(self, use_gpu=True):
        """初始化 PP-StructureV3"""
        self.engine = PPStructureV3(
            # 文檔前處理
            use_doc_orientation_classify=True,   # 文檔方向分類
            use_doc_unwarping=True,              # 文檔影像矯正
            use_textline_orientation=True,       # 文字行方向
            
            # 功能模組開關
            use_seal_recognition=True,           # 印章識別
            use_table_recognition=True,          # 表格識別
            use_formula_recognition=True,        # 公式識別
            use_chart_recognition=True,          # 圖表識別
            
            # OCR 模型配置
            text_recognition_model_name="ch_PP-OCRv4_server_rec",  # 中文識別
            # text_recognition_model_name="en_PP-OCRv4_mobile_rec",  # 英文識別
            
            # 版面檢測參數調整
            layout_threshold=0.5,                # 版面檢測閾值
            layout_nms=0.5,                      # NMS 閾值
            layout_unclip_ratio=1.5,            # 邊界框擴展比例
            
            show_log=True
        )
    
    def extract_layout(self, input_path, output_dir="output"):
        """
        提取完整版面資訊
        
        Args:
            input_path: PDF或圖片路徑
            output_dir: 輸出目錄
        
        Returns:
            list: 每頁的結構化結果
        """
        output_path = Path(output_dir)
        output_path.mkdir(parents=True, exist_ok=True)
        
        # 執行文檔解析
        results = self.engine.predict(input_path)
        
        all_pages_data = []
        
        for page_idx, result in enumerate(results):
            page_data = {
                "page": page_idx + 1,
                "regions": []
            }
            
            # 遍歷該頁的所有區域
            for region in result:
                region_info = {
                    "type": region.get("type"),           # 區域類別
                    "bbox": region.get("bbox"),           # 邊界框 [x1,y1,x2,y2]
                    "score": region.get("score", 0),      # 置信度
                    "content": {},
                    "reading_order": region.get("layout_bbox_idx", 0)  # 閱讀順序
                }
                
                # 根據類別提取不同內容
                region_type = region.get("type")
                
                if region_type in ["text", "title", "header", "footer"]:
                    # OCR 文字區域
                    region_info["content"] = {
                        "text": region.get("res", []),
                        "ocr_boxes": region.get("text_region", [])
                    }
                
                elif region_type == "table":
                    # 表格區域
                    region_info["content"] = {
                        "html": region.get("res", {}).get("html", ""),
                        "text": region.get("res", {}).get("text", ""),
                        "structure": region.get("res", {}).get("cell_bbox", [])
                    }
                
                elif region_type == "formula":
                    # 公式區域
                    region_info["content"] = {
                        "latex": region.get("res", "")
                    }
                
                elif region_type == "figure":
                    # 圖片區域
                    region_info["content"] = {
                        "image_path": f"page_{page_idx+1}_figure_{len(page_data['regions'])}.png"
                    }
                    # 儲存圖片
                    if "img" in region:
                        img_path = output_path / region_info["content"]["image_path"]
                        region["img"].save(img_path)
                
                elif region_type == "seal":
                    # 印章區域
                    region_info["content"] = {
                        "text": region.get("res", ""),
                        "seal_bbox": region.get("seal_bbox", [])
                    }
                
                page_data["regions"].append(region_info)
            
            # 按閱讀順序排序
            page_data["regions"].sort(key=lambda x: x["reading_order"])
            
            all_pages_data.append(page_data)
            
            # 儲存該頁的 JSON
            json_path = output_path / f"page_{page_idx+1}.json"
            with open(json_path, "w", encoding="utf-8") as f:
                json.dump(page_data, f, ensure_ascii=False, indent=2)
            
            print(f"✓ 已處理第 {page_idx+1} 頁")
        
        # 儲存完整文檔結構
        full_json = output_path / "document_structure.json"
        with open(full_json, "w", encoding="utf-8") as f:
            json.dump(all_pages_data, f, ensure_ascii=False, indent=2)
        
        return all_pages_data
    
    def export_to_markdown(self, input_path, output_dir="output"):
        """
        直接導出為 Markdown (PP-StructureV3 內建)
        """
        output_path = Path(output_dir)
        output_path.mkdir(parents=True, exist_ok=True)
        
        results = self.engine.predict(input_path)
        
        for page_idx, result in enumerate(results):
            # PP-StructureV3 自動生成 Markdown
            md_content = result.get("markdown", "")
            
            if md_content:
                md_path = output_path / f"page_{page_idx+1}.md"
                with open(md_path, "w", encoding="utf-8") as f:
                    f.write(md_content)
                print(f"✓ 已生成 Markdown: {md_path}")


# 使用範例
if __name__ == "__main__":
    extractor = DocumentLayoutExtractor(use_gpu=True)
    
    # 提取版面資訊
    layout_data = extractor.extract_layout(
        input_path="document.pdf",
        output_dir="output/layout"
    )
    
    # 導出 Markdown
    extractor.export_to_markdown(
        input_path="document.pdf",
        output_dir="output/markdown"
    )

4.2 使用 PaddleOCR-VL (更簡潔)

from paddleocr import PaddleOCRVL
from pathlib import Path

class SimpleDocumentParser:
    """使用 PaddleOCR-VL 的簡化方案"""
    
    def __init__(self):
        self.pipeline = PaddleOCRVL(
            use_doc_orientation_classify=True,
            use_doc_unwarping=True,
            use_layout_detection=True,
            use_chart_recognition=True
        )
    
    def parse_document(self, input_path, output_dir="output"):
        """一鍵解析文檔"""
        output_path = Path(output_dir)
        output_path.mkdir(parents=True, exist_ok=True)
        
        # 執行解析
        results = self.pipeline.predict(input_path)
        
        for res in results:
            # 列印結構化輸出
            res.print()
            
            # 儲存 JSON
            res.save_to_json(save_path=str(output_path))
            
            # 儲存 Markdown
            res.save_to_markdown(save_path=str(output_path))
        
        print(f"✓ 解析完成,結果已儲存至: {output_path}")
    
    def parse_pdf_to_single_markdown(self, pdf_path, output_path="output"):
        """將整個 PDF 轉為單一 Markdown 檔案"""
        output_dir = Path(output_path)
        output_dir.mkdir(parents=True, exist_ok=True)
        
        # 解析 PDF
        results = self.pipeline.predict(pdf_path)
        
        markdown_list = []
        markdown_images = []
        
        # 收集所有頁面的 Markdown
        for res in results:
            md_info = res.markdown
            markdown_list.append(md_info)
            markdown_images.append(md_info.get("markdown_images", {}))
        
        # 合併所有頁面
        markdown_text = self.pipeline.concatenate_markdown_pages(markdown_list)
        
        # 儲存 Markdown 檔案
        md_file = output_dir / f"{Path(pdf_path).stem}.md"
        with open(md_file, "w", encoding="utf-8") as f:
            f.write(markdown_text)
        
        # 儲存相關圖片
        for item in markdown_images:
            if item:
                for path, image in item.items():
                    img_path = output_dir / path
                    img_path.parent.mkdir(parents=True, exist_ok=True)
                    image.save(img_path)
        
        print(f"✓ 已生成單一 Markdown: {md_file}")
        return md_file


# 使用範例
if __name__ == "__main__":
    parser = SimpleDocumentParser()
    
    # 簡單解析
    parser.parse_document("document.pdf", "output")
    
    # 生成單一 Markdown
    parser.parse_pdf_to_single_markdown("document.pdf", "output")

五、PDF 還原與生成

5.1 版面恢復策略

PaddleOCR 提供兩種版面恢復方法:

方法 1: 標準 PDF 解析 (適用於可複製文字的 PDF)

# 使用 pdf2docx 直接轉換
paddleocr --image_dir=document.pdf \
          --type=structure \
          --recovery=true \
          --use_pdf2docx_api=true

優點: 快速、保留原始格式缺點: 僅適用於標準 PDF,掃描文檔無效

方法 2: 影像格式 PDF 解析 (通用方案)

# 使用完整 OCR Pipeline
paddleocr --image_dir=document.pdf \
          --type=structure \
          --recovery=true \
          --use_pdf2docx_api=false

優點: 適用於掃描文檔、複雜版面缺點: 速度較慢、需要更多計算資源

5.2 自定義 PDF 生成方案

from reportlab.lib.pagesizes import A4
from reportlab.lib.units import mm
from reportlab.pdfgen import canvas
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.platypus import (
    SimpleDocTemplate, Paragraph, Spacer, 
    Table, TableStyle, Image, PageBreak
)
from reportlab.lib import colors
from reportlab.lib.enums import TA_LEFT, TA_CENTER, TA_RIGHT
import json

class PDFLayoutRecovery:
    """基於版面資訊的 PDF 還原"""
    
    def __init__(self, layout_json_path):
        """
        Args:
            layout_json_path: 版面 JSON 檔案路徑
        """
        with open(layout_json_path, 'r', encoding='utf-8') as f:
            self.layout_data = json.load(f)
        
        self.styles = getSampleStyleSheet()
        self._create_custom_styles()
    
    def _create_custom_styles(self):
        """建立自定義樣式"""
        # 標題樣式
        self.styles.add(ParagraphStyle(
            name='CustomTitle',
            parent=self.styles['Heading1'],
            fontSize=18,
            textColor=colors.HexColor('#1a1a1a'),
            spaceAfter=12,
            alignment=TA_CENTER
        ))
        
        # 段落標題樣式
        self.styles.add(ParagraphStyle(
            name='CustomHeading',
            parent=self.styles['Heading2'],
            fontSize=14,
            textColor=colors.HexColor('#333333'),
            spaceAfter=8,
            spaceBefore=8
        ))
        
        # 正文樣式
        self.styles.add(ParagraphStyle(
            name='CustomBody',
            parent=self.styles['Normal'],
            fontSize=11,
            leading=16,
            textColor=colors.HexColor('#000000'),
            alignment=TA_LEFT
        ))
    
    def generate_pdf(self, output_path="output.pdf"):
        """生成 PDF"""
        doc = SimpleDocTemplate(
            output_path,
            pagesize=A4,
            rightMargin=20*mm,
            leftMargin=20*mm,
            topMargin=20*mm,
            bottomMargin=20*mm
        )
        
        story = []
        
        # 遍歷所有頁面
        for page_data in self.layout_data:
            page_num = page_data.get("page", 1)
            
            # 頁面標記 (可選)
            # story.append(Paragraph(f"--- 第 {page_num} 頁 ---", self.styles['CustomHeading']))
            # story.append(Spacer(1, 5*mm))
            
            # 遍歷該頁的所有區域
            for region in page_data.get("regions", []):
                region_type = region.get("type")
                content = region.get("content", {})
                
                if region_type in ["title", "document_title"]:
                    # 標題
                    text = self._extract_text_from_ocr(content)
                    if text:
                        story.append(Paragraph(text, self.styles['CustomTitle']))
                        story.append(Spacer(1, 3*mm))
                
                elif region_type in ["text", "paragraph"]:
                    # 正文
                    text = self._extract_text_from_ocr(content)
                    if text:
                        story.append(Paragraph(text, self.styles['CustomBody']))
                        story.append(Spacer(1, 2*mm))
                
                elif region_type in ["paragraph_title", "heading"]:
                    # 段落標題
                    text = self._extract_text_from_ocr(content)
                    if text:
                        story.append(Paragraph(text, self.styles['CustomHeading']))
                        story.append(Spacer(1, 2*mm))
                
                elif region_type == "table":
                    # 表格
                    table_element = self._create_table_from_html(content)
                    if table_element:
                        story.append(table_element)
                        story.append(Spacer(1, 3*mm))
                
                elif region_type == "figure":
                    # 圖片
                    img_path = content.get("image_path")
                    if img_path and Path(img_path).exists():
                        try:
                            img = Image(img_path, width=150*mm, height=100*mm, kind='proportional')
                            story.append(img)
                            story.append(Spacer(1, 3*mm))
                        except:
                            pass
                
                elif region_type == "formula":
                    # 公式 (作為程式碼區塊顯示)
                    latex = content.get("latex", "")
                    if latex:
                        story.append(Paragraph(f"<font name='Courier'>{latex}</font>", 
                                             self.styles['Code']))
                        story.append(Spacer(1, 2*mm))
            
            # 分頁 (除了最後一頁)
            if page_num < len(self.layout_data):
                story.append(PageBreak())
        
        # 生成 PDF
        doc.build(story)
        print(f"✓ PDF 已生成: {output_path}")
    
    def _extract_text_from_ocr(self, content):
        """從 OCR 結果提取文字"""
        if isinstance(content.get("text"), str):
            return content["text"]
        elif isinstance(content.get("text"), list):
            # OCR 結果是列表形式
            texts = []
            for item in content["text"]:
                if isinstance(item, dict) and "text" in item:
                    texts.append(item["text"])
                elif isinstance(item, (list, tuple)) and len(item) >= 2:
                    texts.append(item[1])  # (bbox, text, confidence) 格式
            return " ".join(texts)
        return ""
    
    def _create_table_from_html(self, content):
        """從 HTML 建立表格"""
        # 簡化版:從 text 提取
        text = content.get("text", "")
        if not text:
            return None
        
        # 這裡可以解析 HTML 或直接使用文字
        # 為簡化起見,這裡僅展示基本結構
        try:
            # 假設文字格式為行分隔
            rows = [row.split("\t") for row in text.split("\n") if row.strip()]
            
            if not rows:
                return None
            
            table = Table(rows)
            table.setStyle(TableStyle([
                ('BACKGROUND', (0, 0), (-1, 0), colors.grey),
                ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),
                ('ALIGN', (0, 0), (-1, -1), 'CENTER'),
                ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
                ('FONTSIZE', (0, 0), (-1, 0), 10),
                ('BOTTOMPADDING', (0, 0), (-1, 0), 8),
                ('BACKGROUND', (0, 1), (-1, -1), colors.beige),
                ('GRID', (0, 0), (-1, -1), 0.5, colors.black)
            ]))
            
            return table
        except:
            return None


# 使用範例
if __name__ == "__main__":
    # 先用 PP-StructureV3 提取版面
    extractor = DocumentLayoutExtractor()
    layout_data = extractor.extract_layout("document.pdf", "output/layout")
    
    # 生成 PDF
    pdf_recovery = PDFLayoutRecovery("output/layout/document_structure.json")
    pdf_recovery.generate_pdf("output/recovered_document.pdf")

5.3 使用 python-docx 生成 Word 文檔

from docx import Document
from docx.shared import Inches, Pt, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH
import json

class DOCXLayoutRecovery:
    """基於版面資訊的 DOCX 還原"""
    
    def __init__(self, layout_json_path):
        with open(layout_json_path, 'r', encoding='utf-8') as f:
            self.layout_data = json.load(f)
        
        self.doc = Document()
    
    def generate_docx(self, output_path="output.docx"):
        """生成 DOCX"""
        for page_data in self.layout_data:
            for region in page_data.get("regions", []):
                region_type = region.get("type")
                content = region.get("content", {})
                
                if region_type in ["title", "document_title"]:
                    # 標題
                    text = self._extract_text(content)
                    if text:
                        heading = self.doc.add_heading(text, level=1)
                        heading.alignment = WD_ALIGN_PARAGRAPH.CENTER
                
                elif region_type == "text":
                    # 正文
                    text = self._extract_text(content)
                    if text:
                        para = self.doc.add_paragraph(text)
                        para.paragraph_format.first_line_indent = Inches(0.5)
                
                elif region_type == "paragraph_title":
                    # 小標題
                    text = self._extract_text(content)
                    if text:
                        self.doc.add_heading(text, level=2)
                
                elif region_type == "table":
                    # 表格 (簡化版)
                    text = content.get("text", "")
                    if text:
                        rows = [row.split("\t") for row in text.split("\n")]
                        if rows:
                            table = self.doc.add_table(rows=len(rows), cols=len(rows[0]))
                            table.style = 'Light Grid Accent 1'
                            
                            for i, row_data in enumerate(rows):
                                for j, cell_data in enumerate(row_data):
                                    table.rows[i].cells[j].text = cell_data
                
                elif region_type == "figure":
                    # 圖片
                    img_path = content.get("image_path")
                    if img_path and Path(img_path).exists():
                        try:
                            self.doc.add_picture(img_path, width=Inches(5))
                        except:
                            pass
            
            # 分頁
            self.doc.add_page_break()
        
        self.doc.save(output_path)
        print(f"✓ DOCX 已生成: {output_path}")
    
    def _extract_text(self, content):
        """提取文字"""
        if isinstance(content.get("text"), str):
            return content["text"]
        elif isinstance(content.get("text"), list):
            texts = []
            for item in content["text"]:
                if isinstance(item, dict):
                    texts.append(item.get("text", ""))
                elif isinstance(item, (list, tuple)) and len(item) >= 2:
                    texts.append(item[1])
            return " ".join(texts)
        return ""

六、進階配置與優化

6.1 性能優化

# 輕量級配置 (CPU 環境)
extractor = PPStructureV3(
    # 使用 mobile 模型
    text_detection_model_name="ch_PP-OCRv4_mobile_det",
    text_recognition_model_name="ch_PP-OCRv4_mobile_rec",
    
    # 關閉部分功能
    use_chart_recognition=False,  # 圖表識別較耗時
    use_formula_recognition=False,  # 公式識別需要較大模型
    
    # 降低精度要求
    layout_threshold=0.6,
    text_det_limit_side_len=960,  # 降低檢測尺寸
)

# 高精度配置 (GPU 環境)
extractor = PPStructureV3(
    # 使用 server 模型
    text_detection_model_name="ch_PP-OCRv4_server_det",
    text_recognition_model_name="ch_PP-OCRv4_server_rec",
    
    # 啟用所有功能
    use_chart_recognition=True,
    use_formula_recognition=True,
    use_seal_recognition=True,
    
    # 提高精度
    layout_threshold=0.3,
    text_det_limit_side_len=1920,
)

6.2 批次處理

import os
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor, as_completed

class BatchDocumentProcessor:
    """批次文檔處理器"""
    
    def __init__(self, max_workers=4):
        self.extractor = PPStructureV3()
        self.max_workers = max_workers
    
    def process_directory(self, input_dir, output_dir="output", file_types=None):
        """
        批次處理目錄下的所有文檔
        
        Args:
            input_dir: 輸入目錄
            output_dir: 輸出目錄
            file_types: 支援的檔案類型,預設 ['.pdf', '.png', '.jpg']
        """
        if file_types is None:
            file_types = ['.pdf', '.png', '.jpg', '.jpeg']
        
        input_path = Path(input_dir)
        output_path = Path(output_dir)
        output_path.mkdir(parents=True, exist_ok=True)
        
        # 收集所有待處理檔案
        files = []
        for file_type in file_types:
            files.extend(input_path.glob(f"**/*{file_type}"))
        
        print(f"找到 {len(files)} 個檔案待處理")
        
        # 多執行緒處理
        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
            futures = {
                executor.submit(self._process_single_file, file, output_path): file
                for file in files
            }
            
            for future in as_completed(futures):
                file = futures[future]
                try:
                    result = future.result()
                    print(f"✓ 完成: {file.name}")
                except Exception as e:
                    print(f"✗ 失敗: {file.name} - {e}")
    
    def _process_single_file(self, file_path, output_dir):
        """處理單一檔案"""
        file_stem = file_path.stem
        file_output_dir = output_dir / file_stem
        file_output_dir.mkdir(parents=True, exist_ok=True)
        
        # 提取版面
        results = self.extractor.predict(str(file_path))
        
        # 儲存結果
        for idx, result in enumerate(results):
            # 儲存 JSON
            json_path = file_output_dir / f"page_{idx+1}.json"
            with open(json_path, "w", encoding="utf-8") as f:
                json.dump(result, f, ensure_ascii=False, indent=2)
            
            # 儲存 Markdown
            if "markdown" in result:
                md_path = file_output_dir / f"page_{idx+1}.md"
                with open(md_path, "w", encoding="utf-8") as f:
                    f.write(result["markdown"])
        
        return file_output_dir


# 使用範例
if __name__ == "__main__":
    processor = BatchDocumentProcessor(max_workers=4)
    processor.process_directory(
        input_dir="documents",
        output_dir="output",
        file_types=['.pdf']
    )

七、實際應用場景

7.1 學術論文處理

class AcademicPaperProcessor:
    """學術論文專用處理器"""
    
    def __init__(self):
        self.extractor = PPStructureV3(
            use_formula_recognition=True,  # 公式識別
            use_table_recognition=True,    # 表格識別
            use_chart_recognition=True,    # 圖表識別
        )
    
    def process_paper(self, pdf_path, output_dir="output"):
        """
        處理學術論文
        - 提取標題、摘要、章節
        - 識別公式並轉為 LaTeX
        - 提取表格和圖表
        """
        results = self.extractor.predict(pdf_path)
        
        paper_structure = {
            "title": "",
            "abstract": "",
            "sections": [],
            "formulas": [],
            "tables": [],
            "figures": []
        }
        
        for result in results:
            for region in result:
                region_type = region.get("type")
                
                if region_type == "document_title":
                    paper_structure["title"] = self._extract_text(region)
                
                elif region_type == "abstract":
                    paper_structure["abstract"] = self._extract_text(region)
                
                elif region_type == "formula":
                    latex = region.get("content", {}).get("latex", "")
                    if latex:
                        paper_structure["formulas"].append(latex)
                
                elif region_type == "table":
                    paper_structure["tables"].append(region.get("content"))
                
                elif region_type == "figure":
                    paper_structure["figures"].append(region.get("content"))
        
        # 儲存結構化結果
        output_path = Path(output_dir)
        output_path.mkdir(parents=True, exist_ok=True)
        
        with open(output_path / "paper_structure.json", "w", encoding="utf-8") as f:
            json.dump(paper_structure, f, ensure_ascii=False, indent=2)
        
        return paper_structure

7.2 商業文檔處理

class BusinessDocumentProcessor:
    """商業文檔處理器"""
    
    def __init__(self):
        self.extractor = PPStructureV3(
            use_seal_recognition=True,   # 印章識別
            use_table_recognition=True,  # 表格識別
        )
    
    def process_invoice(self, pdf_path):
        """處理發票/合約等商業文檔"""
        results = self.extractor.predict(pdf_path)
        
        doc_info = {
            "text_content": [],
            "tables": [],
            "seals": []
        }
        
        for result in results:
            for region in result:
                region_type = region.get("type")
                
                if region_type == "seal":
                    doc_info["seals"].append(region.get("content"))
                
                elif region_type == "table":
                    doc_info["tables"].append(region.get("content"))
                
                elif region_type == "text":
                    doc_info["text_content"].append(self._extract_text(region))
        
        return doc_info

八、常見問題與解決方案

8.1 記憶體不足

# 方案1: 分批處理 PDF 頁面
def process_large_pdf(pdf_path, batch_size=10):
    import fitz  # PyMuPDF
    
    doc = fitz.open(pdf_path)
    total_pages = len(doc)
    
    for start_idx in range(0, total_pages, batch_size):
        end_idx = min(start_idx + batch_size, total_pages)
        
        # 提取該批次頁面為臨時 PDF
        temp_pdf = fitz.open()
        temp_pdf.insert_pdf(doc, from_page=start_idx, to_page=end_idx-1)
        temp_path = f"temp_batch_{start_idx}_{end_idx}.pdf"
        temp_pdf.save(temp_path)
        temp_pdf.close()
        
        # 處理臨時 PDF
        extractor = PPStructureV3()
        results = extractor.predict(temp_path)
        
        # 處理結果...
        
        # 清理
        os.remove(temp_path)
    
    doc.close()

# 方案2: 使用輕量級模型
extractor = PPStructureV3(
    text_detection_model_name="ch_PP-OCRv4_mobile_det",
    text_recognition_model_name="ch_PP-OCRv4_mobile_rec",
)

8.2 處理速度優化

# 方案1: 僅處理必要內容
extractor = PPStructureV3(
    use_chart_recognition=False,     # 關閉圖表識別
    use_formula_recognition=False,   # 關閉公式識別
    use_seal_recognition=False,      # 關閉印章識別
)

# 方案2: 降低影像解析度
extractor = PPStructureV3(
    text_det_limit_side_len=960,  # 預設 1920
)

# 方案3: 啟用 MKL-DNN (CPU 加速)
extractor = PPStructureV3(
    enable_mkldnn=True  # CPU 環境下加速
)

8.3 識別準確度問題

# 方案1: 使用高精度模型
extractor = PPStructureV3(
    text_detection_model_name="ch_PP-OCRv4_server_det",
    text_recognition_model_name="ch_PP-OCRv4_server_rec",
)

# 方案2: 調整檢測參數
extractor = PPStructureV3(
    text_det_thresh=0.3,           # 降低檢測閾值(預設0.5)
    text_det_box_thresh=0.5,       # 降低邊界框閾值
    text_det_unclip_ratio=1.8,     # 增加邊界框擴展(預設1.5)
)

# 方案3: 啟用前處理
extractor = PPStructureV3(
    use_doc_orientation_classify=True,  # 文檔方向校正
    use_doc_unwarping=True,             # 文檔影像矯正
    use_textline_orientation=True,      # 文字行方向校正
)

九、最佳實踐建議

9.1 選擇合適的方案

場景	推薦方案	理由
標準 PDF (可複製文字)	pdf2docx	最快速,格式保留好
掃描文檔	PP-StructureV3	完整 OCR + 版面分析
複雜排版	PaddleOCR-VL	端到端,準確度高
學術論文	PP-StructureV3+ 公式識別	支援 LaTeX 公式
商業合約	PP-StructureV3 + 印章識別	需要印章檢測

9.2 版面還原質量保證

class QualityAssurance:
    """版面還原質量檢查"""
    
    @staticmethod
    def check_text_coverage(ocr_results, min_confidence=0.7):
        """檢查 OCR 置信度"""
        low_confidence_items = []
        
        for item in ocr_results:
            if isinstance(item, (list, tuple)) and len(item) >= 3:
                _, text, confidence = item[0], item[1], item[2]
                if confidence < min_confidence:
                    low_confidence_items.append({
                        "text": text,
                        "confidence": confidence
                    })
        
        if low_confidence_items:
            print(f"⚠ 發現 {len(low_confidence_items)} 個低置信度識別")
            return False
        return True
    
    @staticmethod
    def validate_layout_structure(layout_data):
        """驗證版面結構完整性"""
        issues = []
        
        for page_idx, page in enumerate(layout_data):
            regions = page.get("regions", [])
            
            # 檢查是否有標題
            if not any(r.get("type") in ["title", "document_title"] for r in regions):
                issues.append(f"第 {page_idx+1} 頁缺少標題")
            
            # 檢查閱讀順序
            orders = [r.get("reading_order", 0) for r in regions]
            if len(orders) != len(set(orders)):
                issues.append(f"第 {page_idx+1} 頁閱讀順序有重複")
        
        if issues:
            print("⚠ 版面結構問題:")
            for issue in issues:
                print(f"  - {issue}")
            return False
        return True

9.3 錯誤處理

class RobustDocumentProcessor:
    """具備容錯機制的文檔處理器"""
    
    def __init__(self):
        self.extractor = None
        self._init_extractor()
    
    def _init_extractor(self, retry=3):
        """初始化提取器,支援重試"""
        for attempt in range(retry):
            try:
                self.extractor = PPStructureV3()
                print("✓ 初始化成功")
                break
            except Exception as e:
                print(f"✗ 初始化失敗 (嘗試 {attempt+1}/{retry}): {e}")
                if attempt == retry - 1:
                    raise
    
    def safe_process(self, input_path, output_dir="output"):
        """安全處理文檔"""
        try:
            # 檢查檔案存在
            if not Path(input_path).exists():
                raise FileNotFoundError(f"檔案不存在: {input_path}")
            
            # 執行處理
            results = self.extractor.predict(input_path)
            
            # 驗證結果
            if not results:
                raise ValueError("處理結果為空")
            
            # 儲存結果
            output_path = Path(output_dir)
            output_path.mkdir(parents=True, exist_ok=True)
            
            for idx, result in enumerate(results):
                json_path = output_path / f"page_{idx+1}.json"
                with open(json_path, "w", encoding="utf-8") as f:
                    json.dump(result, f, ensure_ascii=False, indent=2)
            
            print(f"✓ 處理成功: {len(results)} 頁")
            return True
            
        except Exception as e:
            print(f"✗ 處理失敗: {e}")
            import traceback
            traceback.print_exc()
            return False

十、總結與建議

10.1 核心要點

PP-StructureV3 是最完整的解決方案,支援:
- 23 種版面元素類別
- 表格/公式/圖表識別
- 閱讀順序恢復
- Markdown/JSON 輸出
PaddleOCR-VL 適合追求簡潔的場景:
- 端到端處理
- 資源消耗較少
- 109 種語言支援
版面還原 兩種路徑:
- 標準 PDF → pdf2docx (快速)
- 掃描文檔 → PP-StructureV3 + ReportLab/python-docx (完整)

10.2 推薦工作流程

1. 文檔預處理
   ├── 檢查 PDF 類型 (標準/掃描)
   ├── 影像品質評估
   └── 頁面分割

2. 版面提取
   ├── PP-StructureV3.predict()
   ├── 提取結構化資訊 (JSON)
   └── 驗證完整性

3. 內容處理
   ├── OCR 文字校正
   ├── 表格結構化
   ├── 公式轉 LaTeX
   └── 圖片提取

4. 版面還原
   ├── 解析 JSON 結構
   ├── 重建版面元素
   └── 生成目標格式 (PDF/DOCX/MD)

5. 質量檢查
   ├── 文字覆蓋率
   ├── 版面完整性
   └── 格式一致性

10.3 性能參考

配置	硬體	速度 (頁/秒)	準確度
Mobile 模型 + CPU	Intel 8350C	~0.27	85%
Server 模型 + V100	V100 GPU	~1.5	95%
Server 模型 + A100	A100 GPU	~3.0	95%

10.4 下一步建議

建立測試集: 準備不同類型的文檔樣本
參數調優: 根據實際文檔調整檢測閾值
後處理優化: 針對特定格式開發專用處理邏輯
整合 LLM: 結合大語言模型進行智慧校正
建立監控: 追蹤處理質量和性能指標

附錄

A. 完整依賴清單

paddlepaddle-gpu>=3.0.0
paddleocr>=3.0.0
python-docx>=0.8.11
PyMuPDF>=1.23.0
pdf2docx>=0.5.6
reportlab>=4.0.0
markdown>=3.5.0
opencv-python>=4.8.0
Pillow>=10.0.0
lxml>=4.9.0
beautifulsoup4>=4.12.0

B. 環境變數配置

# 設定模型下載源
export PADDLE_PDX_MODEL_SOURCE=HuggingFace  # 或 BOS

# 啟用 GPU
export CUDA_VISIBLE_DEVICES=0

# 設定快取目錄
export PADDLEX_CACHE_DIR=/path/to/cache

C. 相關資源

官方文檔: https://paddlepaddle.github.io/PaddleOCR/
GitHub: https://github.com/PaddlePaddle/PaddleOCR
模型庫: https://github.com/PaddlePaddle/PaddleOCR/blob/main/doc/doc_ch/models_list.md
技術論文: https://arxiv.org/abs/2507.05595

文檔版本: v1.0 最後更新: 2025-11-18 作者: Claude + PaddleOCR 技術團隊

39 KiB Raw Blame History