egg/OCR

Files

egg cd3cbea49d chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-18 20:02:31 +08:00

24 KiB

Raw Blame History

PP-StructureV3 完整版面資訊利用計劃

📋 執行摘要

問題診斷

目前實作嚴重低估了 PP-StructureV3 的能力，只使用了 page_result.markdown 屬性，完全忽略了核心的版面資訊 page_result.json。

核心發現

PP-StructureV3 提供完整的版面解析資訊，包括：
- parsing_res_list: 按閱讀順序排列的版面元素列表
- layout_bbox: 每個元素的精確座標
- layout_det_res: 版面檢測結果（區域類型、置信度）
- overall_ocr_res: 完整的 OCR 結果（包含所有文字的 bbox）
- layout: 版面類型（單欄/雙欄/多欄）

目前實作的缺陷：

# ❌ 目前做法 (ocr_service.py:615-646)
markdown_dict = page_result.markdown  # 只獲取 markdown 和圖片
markdown_texts = markdown_dict.get('markdown_texts', '')
# bbox 被設為空列表
'bbox': [],  # PP-StructureV3 doesn't provide individual bbox in this format

應該這樣做：

# ✅ 正確做法
json_data = page_result.json  # 獲取完整的結構化資訊
parsing_list = json_data.get('parsing_res_list', [])  # 閱讀順序 + bbox
layout_det = json_data.get('layout_det_res', {})  # 版面檢測
overall_ocr = json_data.get('overall_ocr_res', {})  # 所有文字的座標

🎯 規劃目標

階段 1: 提取完整版面資訊（高優先級）

目標: 修改 analyze_layout() 以使用 PP-StructureV3 的完整能力

預期效果:

✅ 每個版面元素都有精確的 layout_bbox
✅ 保留原始閱讀順序（parsing_res_list 的順序）
✅ 獲取版面類型資訊（單欄/雙欄）
✅ 提取區域分類（text/table/figure/title/formula）
✅ 零資訊損失（不需要過濾重疊文字）

階段 2: 實作雙模式 PDF 生成（中優先級）

目標: 提供兩種 PDF 生成模式

模式 A: 精確座標定位模式

使用 layout_bbox 精確定位每個元素
保留原始文件的視覺外觀
適用於需要精確還原版面的場景

模式 B: 流式排版模式

按 parsing_res_list 順序流式排版
使用 ReportLab Platypus 高階 API
零資訊損失，所有內容都可搜尋
適用於需要翻譯或內容處理的場景

階段 3: 多欄版面處理（低優先級）

目標: 利用 PP-StructureV3 的多欄識別能力

📊 PP-StructureV3 完整資料結構

1. `page_result.json` 完整結構

{
    # 基本資訊
    "input_path": str,  # 源文件路徑
    "page_index": int,  # 頁碼（PDF 專用）

    # 版面檢測結果
    "layout_det_res": {
        "boxes": [
            {
                "cls_id": int,        # 類別 ID
                "label": str,         # 區域類型: text/table/figure/title/formula/seal
                "score": float,       # 置信度 0-1
                "coordinate": [x1, y1, x2, y2]  # 矩形座標
            },
            ...
        ]
    },

    # 完整 OCR 結果
    "overall_ocr_res": {
        "dt_polys": np.ndarray,      # 文字檢測多邊形
        "rec_polys": np.ndarray,     # 文字識別多邊形
        "rec_boxes": np.ndarray,     # 文字識別矩形框 (n, 4, 2) int16
        "rec_texts": List[str],      # 識別的文字
        "rec_scores": np.ndarray     # 識別置信度
    },

    # **核心版面解析結果（按閱讀順序）**
    "parsing_res_list": [
        {
            "layout_bbox": np.ndarray,  # 區域邊界框 [x1, y1, x2, y2]
            "layout": str,              # 版面類型: single/double/multi-column
            "text": str,                # 文字內容（如果是文字區域）
            "table": str,               # 表格 HTML（如果是表格區域）
            "image": str,               # 圖片路徑（如果是圖片區域）
            "formula": str,             # 公式 LaTeX（如果是公式區域）
            # ... 其他區域類型
        },
        ...  # 順序 = 閱讀順序
    ],

    # 文字段落 OCR（按閱讀順序）
    "text_paragraphs_ocr_res": {
        "rec_polys": np.ndarray,
        "rec_texts": List[str],
        "rec_scores": np.ndarray
    },

    # 可選模組結果
    "formula_res_region1": {...},  # 公式識別結果
    "table_cell_img": {...},       # 表格儲存格圖片
    "seal_res_region1": {...}      # 印章識別結果
}

2. 關鍵欄位說明

欄位	用途	資料格式	重要性
`parsing_res_list`	核心資料，包含按閱讀順序排列的所有版面元素	List[Dict]	⭐⭐⭐⭐⭐
`layout_bbox`	每個元素的精確座標	np.ndarray [x1,y1,x2,y2]	⭐⭐⭐⭐⭐
`layout`	版面類型（單欄/雙欄/多欄）	str: single/double/multi	⭐⭐⭐⭐
`layout_det_res`	版面檢測詳細結果（包含區域分類）	Dict with boxes list	⭐⭐⭐⭐
`overall_ocr_res`	所有文字的 OCR 結果和座標	Dict with np.ndarray	⭐⭐⭐⭐
`markdown`	簡化的 Markdown 輸出	Dict with texts/images	⭐⭐

🔧 實作計劃

任務 1: 重構 `analyze_layout()` 函數

檔案: /backend/app/services/ocr_service.py

修改範圍: Lines 590-710

核心改動:

def analyze_layout(self, image_path: Path, output_dir: Optional[Path] = None, current_page: int = 0) -> Tuple[Optional[Dict], List[Dict]]:
    """
    Analyze document layout using PP-StructureV3 (使用完整的 JSON 資訊)
    """
    try:
        structure_engine = self.get_structure_engine()
        results = structure_engine.predict(str(image_path))

        layout_elements = []
        images_metadata = []

        for page_idx, page_result in enumerate(results):
            # ✅ 修改 1: 使用完整的 JSON 資料而非只用 markdown
            json_data = page_result.json

            # ✅ 修改 2: 提取版面檢測結果
            layout_det_res = json_data.get('layout_det_res', {})
            layout_boxes = layout_det_res.get('boxes', [])

            # ✅ 修改 3: 提取核心的 parsing_res_list（包含閱讀順序 + bbox）
            parsing_res_list = json_data.get('parsing_res_list', [])

            if parsing_res_list:
                # *** 核心邏輯：使用 parsing_res_list ***
                for idx, item in enumerate(parsing_res_list):
                    # 提取 bbox（不再是空列表！）
                    layout_bbox = item.get('layout_bbox')
                    if layout_bbox is not None:
                        # 轉換 numpy array 為標準格式
                        if hasattr(layout_bbox, 'tolist'):
                            bbox = layout_bbox.tolist()
                        else:
                            bbox = list(layout_bbox)

                        # 轉換為 4-point 格式: [[x1,y1], [x2,y1], [x2,y2], [x1,y2]]
                        if len(bbox) == 4:  # [x1, y1, x2, y2]
                            x1, y1, x2, y2 = bbox
                            bbox = [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
                    else:
                        bbox = []

                    # 提取版面類型
                    layout_type = item.get('layout', 'single')

                    # 創建元素（包含所有資訊）
                    element = {
                        'element_id': idx,
                        'page': current_page,
                        'bbox': bbox,  # ✅ 不再是空列表！
                        'layout_type': layout_type,  # ✅ 新增版面類型
                        'reading_order': idx,  # ✅ 新增閱讀順序
                    }

                    # 根據內容類型提取資料
                    if 'table' in item:
                        element['type'] = 'table'
                        element['content'] = item['table']
                        # 提取表格純文字（用於翻譯）
                        element['extracted_text'] = self._extract_table_text(item['table'])

                    elif 'text' in item:
                        element['type'] = 'text'
                        element['content'] = item['text']

                    elif 'figure' in item or 'image' in item:
                        element['type'] = 'image'
                        element['content'] = item.get('figure') or item.get('image')

                    elif 'formula' in item:
                        element['type'] = 'formula'
                        element['content'] = item['formula']

                    elif 'title' in item:
                        element['type'] = 'title'
                        element['content'] = item['title']

                    else:
                        # 未知類型，記錄所有非系統欄位
                        for key, value in item.items():
                            if key not in ['layout_bbox', 'layout']:
                                element['type'] = key
                                element['content'] = value
                                break

                    layout_elements.append(element)

            else:
                # 回退到 markdown 方式（向後相容）
                logger.warning("No parsing_res_list found, falling back to markdown parsing")
                markdown_dict = page_result.markdown
                # ... 原有的 markdown 解析邏輯 ...

            # ✅ 修改 4: 同時處理提取的圖片（仍需保存到磁碟）
            markdown_dict = page_result.markdown
            markdown_images = markdown_dict.get('markdown_images', {})

            for img_idx, (img_path, img_obj) in enumerate(markdown_images.items()):
                # 保存圖片到磁碟
                try:
                    base_dir = output_dir if output_dir else image_path.parent
                    full_img_path = base_dir / img_path
                    full_img_path.parent.mkdir(parents=True, exist_ok=True)

                    if hasattr(img_obj, 'save'):
                        img_obj.save(str(full_img_path))
                        logger.info(f"Saved extracted image to {full_img_path}")
                except Exception as e:
                    logger.warning(f"Failed to save image {img_path}: {e}")

                # 提取 bbox（從檔名或從 parsing_res_list 匹配）
                bbox = self._find_image_bbox(img_path, parsing_res_list, layout_boxes)

                images_metadata.append({
                    'element_id': len(layout_elements) + img_idx,
                    'image_path': img_path,
                    'type': 'image',
                    'page': current_page,
                    'bbox': bbox,
                })

        if layout_elements:
            layout_data = {
                'elements': layout_elements,
                'total_elements': len(layout_elements),
                'reading_order': [e['reading_order'] for e in layout_elements],  # ✅ 保留閱讀順序
                'layout_types': list(set(e.get('layout_type') for e in layout_elements)),  # ✅ 版面類型統計
            }
            logger.info(f"Detected {len(layout_elements)} layout elements (with bbox and reading order)")
            return layout_data, images_metadata
        else:
            logger.warning("No layout elements detected")
            return None, []

    except Exception as e:
        import traceback
        logger.error(f"Layout analysis error: {str(e)}\n{traceback.format_exc()}")
        return None, []


def _find_image_bbox(self, img_path: str, parsing_res_list: List[Dict], layout_boxes: List[Dict]) -> List:
    """
    從 parsing_res_list 或 layout_det_res 中查找圖片的 bbox
    """
    # 方法 1: 從檔名提取（現有方法）
    import re
    match = re.search(r'box_(\d+)_(\d+)_(\d+)_(\d+)', img_path)
    if match:
        x1, y1, x2, y2 = map(int, match.groups())
        return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]

    # 方法 2: 從 parsing_res_list 匹配（如果包含圖片路徑資訊）
    for item in parsing_res_list:
        if 'image' in item or 'figure' in item:
            content = item.get('image') or item.get('figure')
            if img_path in str(content):
                bbox = item.get('layout_bbox')
                if bbox is not None:
                    if hasattr(bbox, 'tolist'):
                        bbox_list = bbox.tolist()
                    else:
                        bbox_list = list(bbox)
                    if len(bbox_list) == 4:
                        x1, y1, x2, y2 = bbox_list
                        return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]

    # 方法 3: 從 layout_det_res 匹配（根據類型）
    for box in layout_boxes:
        if box.get('label') in ['figure', 'image']:
            coord = box.get('coordinate', [])
            if len(coord) == 4:
                x1, y1, x2, y2 = coord
                return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]

    logger.warning(f"Could not find bbox for image {img_path}")
    return []

任務 2: 更新 PDF 生成器使用新資訊

檔案: /backend/app/services/pdf_generator_service.py

核心改動:

移除文字過濾邏輯（不再需要！）
- 因為 parsing_res_list 已經按閱讀順序排列
- 表格/圖片有自己的區域，文字有自己的區域
- 不會有重疊問題

按 reading_order 渲染元素

def generate_layout_pdf(self, json_path: Path, output_path: Path, mode: str = 'coordinate') -> bool:
    """
    mode: 'coordinate' 或 'flow'
    """
    # 載入資料
    ocr_data = self.load_ocr_json(json_path)
    layout_data = ocr_data.get('layout_data', {})
    elements = layout_data.get('elements', [])

    if mode == 'coordinate':
        # 模式 A: 座標定位模式
        return self._generate_coordinate_pdf(elements, output_path, ocr_data)
    else:
        # 模式 B: 流式排版模式
        return self._generate_flow_pdf(elements, output_path, ocr_data)

def _generate_coordinate_pdf(self, elements: List[Dict], output_path: Path, ocr_data: Dict) -> bool:
    """座標定位模式 - 精確還原版面"""
    # 按 reading_order 排序元素
    sorted_elements = sorted(elements, key=lambda x: x.get('reading_order', 0))

    # 按頁碼分組
    pages = {}
    for elem in sorted_elements:
        page = elem.get('page', 0)
        if page not in pages:
            pages[page] = []
        pages[page].append(elem)

    # 渲染每頁
    for page_num, page_elements in sorted(pages.items()):
        for elem in page_elements:
            bbox = elem.get('bbox', [])
            elem_type = elem.get('type')
            content = elem.get('content', '')

            if not bbox:
                logger.warning(f"Element {elem['element_id']} has no bbox, skipping")
                continue

            # 使用精確座標渲染
            if elem_type == 'table':
                self.draw_table_at_bbox(pdf_canvas, content, bbox, page_height, scale_w, scale_h)
            elif elem_type == 'text':
                self.draw_text_at_bbox(pdf_canvas, content, bbox, page_height, scale_w, scale_h)
            elif elem_type == 'image':
                self.draw_image_at_bbox(pdf_canvas, content, bbox, page_height, scale_w, scale_h)
            # ... 其他類型

def _generate_flow_pdf(self, elements: List[Dict], output_path: Path, ocr_data: Dict) -> bool:
    """流式排版模式 - 零資訊損失"""
    from reportlab.platypus import SimpleDocTemplate, Paragraph, Table, Image, Spacer
    from reportlab.lib.styles import getSampleStyleSheet

    # 按 reading_order 排序元素
    sorted_elements = sorted(elements, key=lambda x: x.get('reading_order', 0))

    # 創建 Story（流式內容）
    story = []
    styles = getSampleStyleSheet()

    for elem in sorted_elements:
        elem_type = elem.get('type')
        content = elem.get('content', '')

        if elem_type == 'title':
            story.append(Paragraph(content, styles['Title']))
        elif elem_type == 'text':
            story.append(Paragraph(content, styles['Normal']))
        elif elem_type == 'table':
            # 解析 HTML 表格為 ReportLab Table
            table_obj = self._html_to_reportlab_table(content)
            story.append(table_obj)
        elif elem_type == 'image':
            # 嵌入圖片
            img_path = json_path.parent / content
            if img_path.exists():
                story.append(Image(str(img_path), width=400, height=300))

        story.append(Spacer(1, 12))  # 間距

    # 生成 PDF
    doc = SimpleDocTemplate(str(output_path))
    doc.build(story)
    return True

📈 預期效果對比

目前實作 vs 新實作

指標	目前實作 ❌	新實作 ✅	改善
bbox 資訊	空列表 `[]`	精確座標 `[x1,y1,x2,y2]`	✅ 100%
閱讀順序	無（混合 HTML）	`reading_order` 欄位	✅ 100%
版面類型	無	`layout_type`（單欄/雙欄）	✅ 100%
元素分類	簡單判斷 `<table`	精確分類（9+ 類型）	✅ 100%
資訊損失	21.6% 文字被過濾	0% 損失（流式模式）	✅ 100%
座標精度	只有部分圖片 bbox	所有元素都有 bbox	✅ 100%
PDF 模式	只有座標定位	雙模式（座標+流式）	✅ 新功能
翻譯支援	困難（資訊損失）	完美（零損失）	✅ 100%

具體改善

1. 零資訊損失

# ❌ 目前: 342 個文字區域 → 過濾後 268 個 = 損失 74 個 (21.6%)
filtered_text_regions = self._filter_text_in_regions(text_regions, regions_to_avoid)

# ✅ 新實作: 不需要過濾，直接使用 parsing_res_list
# 所有元素（文字、表格、圖片）都在各自的區域中，不重疊
for elem in sorted(elements, key=lambda x: x['reading_order']):
    render_element(elem)  # 渲染所有元素，零損失

2. 精確 bbox

# ❌ 目前: bbox 是空列表
{
    'element_id': 0,
    'type': 'table',
    'bbox': [],  # ← 無法定位！
}

# ✅ 新實作: 從 layout_bbox 獲取精確座標
{
    'element_id': 0,
    'type': 'table',
    'bbox': [[770, 776], [1122, 776], [1122, 1058], [770, 1058]],  # ← 精確定位！
    'reading_order': 3,
    'layout_type': 'single'
}

3. 閱讀順序

# ❌ 目前: 無法保證正確的閱讀順序
# 表格、圖片、文字混在一起，順序混亂

# ✅ 新實作: parsing_res_list 的順序 = 閱讀順序
elements = sorted(elements, key=lambda x: x['reading_order'])
# 元素按 reading_order: 0, 1, 2, 3, ... 渲染
# 完美保留文件的邏輯順序

🚀 實作步驟

第一階段：核心重構（2-3 小時）

修改 analyze_layout() 函數
- 從 page_result.json 提取 parsing_res_list
- 提取 layout_bbox 為每個元素的 bbox
- 保留 reading_order
- 提取 layout_type
- 測試輸出 JSON 結構
添加輔助函數
- _find_image_bbox(): 從多個來源查找圖片 bbox
- _convert_bbox_format(): 統一 bbox 格式
- _extract_element_content(): 根據類型提取內容
測試驗證
- 使用現有測試文件重新執行 OCR
- 檢查生成的 JSON 是否包含 bbox
- 驗證 reading_order 是否正確

第二階段：PDF 生成優化（2-3 小時）

實作座標定位模式
- 移除文字過濾邏輯
- 按 bbox 精確渲染每個元素
- 按 reading_order 確定渲染順序（同頁元素）
實作流式排版模式
- 使用 ReportLab Platypus
- 按 reading_order 構建 Story
- 實作各類型元素的流式渲染
添加 API 參數
- /tasks/{id}/download/pdf?mode=coordinate (預設)
- /tasks/{id}/download/pdf?mode=flow

第三階段：測試與優化（1-2 小時）

完整測試
- 單頁文件測試
- 多頁 PDF 測試
- 多欄版面測試
- 複雜表格測試
效能優化
- 減少重複計算
- 優化 bbox 轉換
- 快取處理
文檔更新
- 更新 API 文檔
- 添加使用範例
- 更新架構圖

💡 關鍵技術細節

1. Numpy Array 處理

# layout_bbox 是 numpy.ndarray，需要轉換為標準格式
layout_bbox = item.get('layout_bbox')
if hasattr(layout_bbox, 'tolist'):
    bbox = layout_bbox.tolist()  # [x1, y1, x2, y2]
else:
    bbox = list(layout_bbox)

# 轉換為 4-point 格式
x1, y1, x2, y2 = bbox
bbox_4point = [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]

2. 版面類型處理

# 根據 layout_type 調整渲染策略
layout_type = elem.get('layout_type', 'single')

if layout_type == 'double':
    # 雙欄版面：可能需要特殊處理
    pass
elif layout_type == 'multi':
    # 多欄版面：更複雜的處理
    pass

3. 閱讀順序保證

# 確保按正確順序渲染
elements = layout_data.get('elements', [])
sorted_elements = sorted(elements, key=lambda x: (
    x.get('page', 0),          # 先按頁碼
    x.get('reading_order', 0)  # 再按閱讀順序
))

⚠️ 風險與緩解措施

風險 1: 向後相容性

問題: 舊的 JSON 檔案沒有新欄位

緩解措施:

# 在 analyze_layout() 中添加回退邏輯
parsing_res_list = json_data.get('parsing_res_list', [])
if not parsing_res_list:
    logger.warning("No parsing_res_list, using markdown fallback")
    # 使用舊的 markdown 解析邏輯

風險 2: PaddleOCR 版本差異

問題: 不同版本的 PaddleOCR 可能輸出格式不同

緩解措施:

記錄 PaddleOCR 版本到 JSON
添加版本檢測邏輯
提供多版本支援

風險 3: 效能影響

問題: 提取更多資訊可能增加處理時間

緩解措施:

只在需要時提取詳細資訊
使用快取
並行處理多頁

📝 TODO Checklist

階段 1: 核心重構

修改 analyze_layout() 使用 page_result.json
提取 parsing_res_list
提取 layout_bbox 並轉換格式
保留 reading_order
提取 layout_type
實作 _find_image_bbox()
添加回退邏輯（向後相容）
測試新 JSON 輸出結構

階段 2: PDF 生成優化

實作 _generate_coordinate_pdf()
實作 _generate_flow_pdf()
移除舊的文字過濾邏輯
添加 mode 參數到 API
實作 HTML 表格解析器（用於流式模式）
測試兩種模式的 PDF 輸出

階段 3: 測試與文檔

單頁文件測試
多頁 PDF 測試
複雜版面測試（多欄、表格密集）
效能測試
更新 API 文檔
更新使用說明
創建遷移指南

🎓 學習資源

PaddleOCR 官方文檔
- PP-StructureV3 Usage Tutorial
- PaddleX PP-StructureV3
ReportLab 文檔
- Platypus User Guide
- Table Styling
參考實作
- PaddleOCR GitHub: /paddlex/inference/pipelines/layout_parsing/pipeline_v2.py

🏁 成功標準

必須達成

✅ 所有版面元素都有精確的 bbox ✅ 閱讀順序正確保留 ✅ 零資訊損失（流式模式） ✅ 向後相容（舊 JSON 仍可用）

期望達成

✅ 雙模式 PDF 生成（座標 + 流式） ✅ 多欄版面正確處理 ✅ 翻譯功能支援（表格文字可提取） ✅ 效能無明顯下降

附加目標

✅ 支援更多元素類型（公式、印章） ✅ 版面類型統計和分析 ✅ 視覺化版面結構

規劃完成時間: 2025-01-18 預計開發時間: 5-8 小時 優先級: P0 (最高優先級)

24 KiB Raw Blame History Unescape Escape

PP-StructureV3 完整版面資訊利用計劃

📋 執行摘要

問題診斷

核心發現

🎯 規劃目標

階段 1: 提取完整版面資訊（高優先級）

階段 2: 實作雙模式 PDF 生成（中優先級）

階段 3: 多欄版面處理（低優先級）

📊 PP-StructureV3 完整資料結構

1. page_result.json 完整結構

2. 關鍵欄位說明

🔧 實作計劃

任務 1: 重構 analyze_layout() 函數

任務 2: 更新 PDF 生成器使用新資訊

📈 預期效果對比

目前實作 vs 新實作

具體改善

1. 零資訊損失

2. 精確 bbox

3. 閱讀順序

🚀 實作步驟

第一階段：核心重構（2-3 小時）

第二階段：PDF 生成優化（2-3 小時）

第三階段：測試與優化（1-2 小時）

💡 關鍵技術細節

1. Numpy Array 處理

2. 版面類型處理

3. 閱讀順序保證

⚠️ 風險與緩解措施

風險 1: 向後相容性

風險 2: PaddleOCR 版本差異

風險 3: 效能影響

📝 TODO Checklist

階段 1: 核心重構

階段 2: PDF 生成優化

階段 3: 測試與文檔

🎓 學習資源

🏁 成功標準

必須達成

期望達成

附加目標

24 KiB

Raw Blame History

1. `page_result.json` 完整結構

任務 1: 重構 `analyze_layout()` 函數