OCR/openspec/changes/archive/2025-11-18-fix-result-preview-and-pdf-download/PP-STRUCTURE-ENHANCEMENT-PLAN.md

# PP-StructureV3 完整版面資訊利用計劃

## 📋 執行摘要

### 問題診斷
目前實作**嚴重低估了 PP-StructureV3 的能力**，只使用了 `page_result.markdown` 屬性，完全忽略了核心的版面資訊 `page_result.json`。

### 核心發現
1. **PP-StructureV3 提供完整的版面解析資訊**，包括：
   - `parsing_res_list`: 按閱讀順序排列的版面元素列表
   - `layout_bbox`: 每個元素的精確座標
   - `layout_det_res`: 版面檢測結果（區域類型、置信度）
   - `overall_ocr_res`: 完整的 OCR 結果（包含所有文字的 bbox）
   - `layout`: 版面類型（單欄/雙欄/多欄）

2. **目前實作的缺陷**：
   ```python
   # ❌ 目前做法 (ocr_service.py:615-646)
   markdown_dict = page_result.markdown  # 只獲取 markdown 和圖片
   markdown_texts = markdown_dict.get('markdown_texts', '')
   # bbox 被設為空列表
   'bbox': [],  # PP-StructureV3 doesn't provide individual bbox in this format
   ```

3. **應該這樣做**：
   ```python
   # ✅ 正確做法
   json_data = page_result.json  # 獲取完整的結構化資訊
   parsing_list = json_data.get('parsing_res_list', [])  # 閱讀順序 + bbox
   layout_det = json_data.get('layout_det_res', {})  # 版面檢測
   overall_ocr = json_data.get('overall_ocr_res', {})  # 所有文字的座標
   ```

---

## 🎯 規劃目標

### 階段 1: 提取完整版面資訊（高優先級）
**目標**: 修改 `analyze_layout()` 以使用 PP-StructureV3 的完整能力

**預期效果**:
- ✅ 每個版面元素都有精確的 `layout_bbox`
- ✅ 保留原始閱讀順序（`parsing_res_list` 的順序）
- ✅ 獲取版面類型資訊（單欄/雙欄）
- ✅ 提取區域分類（text/table/figure/title/formula）
- ✅ 零資訊損失（不需要過濾重疊文字）

### 階段 2: 實作雙模式 PDF 生成（中優先級）
**目標**: 提供兩種 PDF 生成模式

**模式 A: 精確座標定位模式**
- 使用 `layout_bbox` 精確定位每個元素
- 保留原始文件的視覺外觀
- 適用於需要精確還原版面的場景

**模式 B: 流式排版模式**
- 按 `parsing_res_list` 順序流式排版
- 使用 ReportLab Platypus 高階 API
- 零資訊損失，所有內容都可搜尋
- 適用於需要翻譯或內容處理的場景

### 階段 3: 多欄版面處理（低優先級）
**目標**: 利用 PP-StructureV3 的多欄識別能力

---

## 📊 PP-StructureV3 完整資料結構

### 1. `page_result.json` 完整結構

```python
{
    # 基本資訊
    "input_path": str,  # 源文件路徑
    "page_index": int,  # 頁碼（PDF 專用）

    # 版面檢測結果
    "layout_det_res": {
        "boxes": [
            {
                "cls_id": int,        # 類別 ID
                "label": str,         # 區域類型: text/table/figure/title/formula/seal
                "score": float,       # 置信度 0-1
                "coordinate": [x1, y1, x2, y2]  # 矩形座標
            },
            ...
        ]
    },

    # 完整 OCR 結果
    "overall_ocr_res": {
        "dt_polys": np.ndarray,      # 文字檢測多邊形
        "rec_polys": np.ndarray,     # 文字識別多邊形
        "rec_boxes": np.ndarray,     # 文字識別矩形框 (n, 4, 2) int16
        "rec_texts": List[str],      # 識別的文字
        "rec_scores": np.ndarray     # 識別置信度
    },

    # **核心版面解析結果（按閱讀順序）**
    "parsing_res_list": [
        {
            "layout_bbox": np.ndarray,  # 區域邊界框 [x1, y1, x2, y2]
            "layout": str,              # 版面類型: single/double/multi-column
            "text": str,                # 文字內容（如果是文字區域）
            "table": str,               # 表格 HTML（如果是表格區域）
            "image": str,               # 圖片路徑（如果是圖片區域）
            "formula": str,             # 公式 LaTeX（如果是公式區域）
            # ... 其他區域類型
        },
        ...  # 順序 = 閱讀順序
    ],

    # 文字段落 OCR（按閱讀順序）
    "text_paragraphs_ocr_res": {
        "rec_polys": np.ndarray,
        "rec_texts": List[str],
        "rec_scores": np.ndarray
    },

    # 可選模組結果
    "formula_res_region1": {...},  # 公式識別結果
    "table_cell_img": {...},       # 表格儲存格圖片
    "seal_res_region1": {...}      # 印章識別結果
}
```

### 2. 關鍵欄位說明

| 欄位 | 用途 | 資料格式 | 重要性 |
|------|------|---------|--------|
| `parsing_res_list` | **核心資料**，包含按閱讀順序排列的所有版面元素 | List[Dict] | ⭐⭐⭐⭐⭐ |
| `layout_bbox` | 每個元素的精確座標 | np.ndarray [x1,y1,x2,y2] | ⭐⭐⭐⭐⭐ |
| `layout` | 版面類型（單欄/雙欄/多欄） | str: single/double/multi | ⭐⭐⭐⭐ |
| `layout_det_res` | 版面檢測詳細結果（包含區域分類） | Dict with boxes list | ⭐⭐⭐⭐ |
| `overall_ocr_res` | 所有文字的 OCR 結果和座標 | Dict with np.ndarray | ⭐⭐⭐⭐ |
| `markdown` | 簡化的 Markdown 輸出 | Dict with texts/images | ⭐⭐ |

---

## 🔧 實作計劃

### 任務 1: 重構 `analyze_layout()` 函數

**檔案**: `/backend/app/services/ocr_service.py`

**修改範圍**: Lines 590-710

**核心改動**:

```python
def analyze_layout(self, image_path: Path, output_dir: Optional[Path] = None, current_page: int = 0) -> Tuple[Optional[Dict], List[Dict]]:
    """
    Analyze document layout using PP-StructureV3 (使用完整的 JSON 資訊)
    """
    try:
        structure_engine = self.get_structure_engine()
        results = structure_engine.predict(str(image_path))

        layout_elements = []
        images_metadata = []

        for page_idx, page_result in enumerate(results):
            # ✅ 修改 1: 使用完整的 JSON 資料而非只用 markdown
            json_data = page_result.json

            # ✅ 修改 2: 提取版面檢測結果
            layout_det_res = json_data.get('layout_det_res', {})
            layout_boxes = layout_det_res.get('boxes', [])

            # ✅ 修改 3: 提取核心的 parsing_res_list（包含閱讀順序 + bbox）
            parsing_res_list = json_data.get('parsing_res_list', [])

            if parsing_res_list:
                # *** 核心邏輯：使用 parsing_res_list ***
                for idx, item in enumerate(parsing_res_list):
                    # 提取 bbox（不再是空列表！）
                    layout_bbox = item.get('layout_bbox')
                    if layout_bbox is not None:
                        # 轉換 numpy array 為標準格式
                        if hasattr(layout_bbox, 'tolist'):
                            bbox = layout_bbox.tolist()
                        else:
                            bbox = list(layout_bbox)

                        # 轉換為 4-point 格式: [[x1,y1], [x2,y1], [x2,y2], [x1,y2]]
                        if len(bbox) == 4:  # [x1, y1, x2, y2]
                            x1, y1, x2, y2 = bbox
                            bbox = [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
                    else:
                        bbox = []

                    # 提取版面類型
                    layout_type = item.get('layout', 'single')

                    # 創建元素（包含所有資訊）
                    element = {
                        'element_id': idx,
                        'page': current_page,
                        'bbox': bbox,  # ✅ 不再是空列表！
                        'layout_type': layout_type,  # ✅ 新增版面類型
                        'reading_order': idx,  # ✅ 新增閱讀順序
                    }

                    # 根據內容類型提取資料
                    if 'table' in item:
                        element['type'] = 'table'
                        element['content'] = item['table']
                        # 提取表格純文字（用於翻譯）
                        element['extracted_text'] = self._extract_table_text(item['table'])

                    elif 'text' in item:
                        element['type'] = 'text'
                        element['content'] = item['text']

                    elif 'figure' in item or 'image' in item:
                        element['type'] = 'image'
                        element['content'] = item.get('figure') or item.get('image')

                    elif 'formula' in item:
                        element['type'] = 'formula'
                        element['content'] = item['formula']

                    elif 'title' in item:
                        element['type'] = 'title'
                        element['content'] = item['title']

                    else:
                        # 未知類型，記錄所有非系統欄位
                        for key, value in item.items():
                            if key not in ['layout_bbox', 'layout']:
                                element['type'] = key
                                element['content'] = value
                                break

                    layout_elements.append(element)

            else:
                # 回退到 markdown 方式（向後相容）
                logger.warning("No parsing_res_list found, falling back to markdown parsing")
                markdown_dict = page_result.markdown
                # ... 原有的 markdown 解析邏輯 ...

            # ✅ 修改 4: 同時處理提取的圖片（仍需保存到磁碟）
            markdown_dict = page_result.markdown
            markdown_images = markdown_dict.get('markdown_images', {})

            for img_idx, (img_path, img_obj) in enumerate(markdown_images.items()):
                # 保存圖片到磁碟
                try:
                    base_dir = output_dir if output_dir else image_path.parent
                    full_img_path = base_dir / img_path
                    full_img_path.parent.mkdir(parents=True, exist_ok=True)

                    if hasattr(img_obj, 'save'):
                        img_obj.save(str(full_img_path))
                        logger.info(f"Saved extracted image to {full_img_path}")
                except Exception as e:
                    logger.warning(f"Failed to save image {img_path}: {e}")

                # 提取 bbox（從檔名或從 parsing_res_list 匹配）
                bbox = self._find_image_bbox(img_path, parsing_res_list, layout_boxes)

                images_metadata.append({
                    'element_id': len(layout_elements) + img_idx,
                    'image_path': img_path,
                    'type': 'image',
                    'page': current_page,
                    'bbox': bbox,
                })

        if layout_elements:
            layout_data = {
                'elements': layout_elements,
                'total_elements': len(layout_elements),
                'reading_order': [e['reading_order'] for e in layout_elements],  # ✅ 保留閱讀順序
                'layout_types': list(set(e.get('layout_type') for e in layout_elements)),  # ✅ 版面類型統計
            }
            logger.info(f"Detected {len(layout_elements)} layout elements (with bbox and reading order)")
            return layout_data, images_metadata
        else:
            logger.warning("No layout elements detected")
            return None, []

    except Exception as e:
        import traceback
        logger.error(f"Layout analysis error: {str(e)}\n{traceback.format_exc()}")
        return None, []


def _find_image_bbox(self, img_path: str, parsing_res_list: List[Dict], layout_boxes: List[Dict]) -> List:
    """
    從 parsing_res_list 或 layout_det_res 中查找圖片的 bbox
    """
    # 方法 1: 從檔名提取（現有方法）
    import re
    match = re.search(r'box_(\d+)_(\d+)_(\d+)_(\d+)', img_path)
    if match:
        x1, y1, x2, y2 = map(int, match.groups())
        return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]

    # 方法 2: 從 parsing_res_list 匹配（如果包含圖片路徑資訊）
    for item in parsing_res_list:
        if 'image' in item or 'figure' in item:
            content = item.get('image') or item.get('figure')
            if img_path in str(content):
                bbox = item.get('layout_bbox')
                if bbox is not None:
                    if hasattr(bbox, 'tolist'):
                        bbox_list = bbox.tolist()
                    else:
                        bbox_list = list(bbox)
                    if len(bbox_list) == 4:
                        x1, y1, x2, y2 = bbox_list
                        return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]

    # 方法 3: 從 layout_det_res 匹配（根據類型）
    for box in layout_boxes:
        if box.get('label') in ['figure', 'image']:
            coord = box.get('coordinate', [])
            if len(coord) == 4:
                x1, y1, x2, y2 = coord
                return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]

    logger.warning(f"Could not find bbox for image {img_path}")
    return []
```

---

### 任務 2: 更新 PDF 生成器使用新資訊

**檔案**: `/backend/app/services/pdf_generator_service.py`

**核心改動**:

1. **移除文字過濾邏輯**（不再需要！）
   - 因為 `parsing_res_list` 已經按閱讀順序排列
   - 表格/圖片有自己的區域，文字有自己的區域
   - 不會有重疊問題

2. **按 `reading_order` 渲染元素**
   ```python
   def generate_layout_pdf(self, json_path: Path, output_path: Path, mode: str = 'coordinate') -> bool:
       """
       mode: 'coordinate' 或 'flow'
       """
       # 載入資料
       ocr_data = self.load_ocr_json(json_path)
       layout_data = ocr_data.get('layout_data', {})
       elements = layout_data.get('elements', [])

       if mode == 'coordinate':
           # 模式 A: 座標定位模式
           return self._generate_coordinate_pdf(elements, output_path, ocr_data)
       else:
           # 模式 B: 流式排版模式
           return self._generate_flow_pdf(elements, output_path, ocr_data)

   def _generate_coordinate_pdf(self, elements: List[Dict], output_path: Path, ocr_data: Dict) -> bool:
       """座標定位模式 - 精確還原版面"""
       # 按 reading_order 排序元素
       sorted_elements = sorted(elements, key=lambda x: x.get('reading_order', 0))

       # 按頁碼分組
       pages = {}
       for elem in sorted_elements:
           page = elem.get('page', 0)
           if page not in pages:
               pages[page] = []
           pages[page].append(elem)

       # 渲染每頁
       for page_num, page_elements in sorted(pages.items()):
           for elem in page_elements:
               bbox = elem.get('bbox', [])
               elem_type = elem.get('type')
               content = elem.get('content', '')

               if not bbox:
                   logger.warning(f"Element {elem['element_id']} has no bbox, skipping")
                   continue

               # 使用精確座標渲染
               if elem_type == 'table':
                   self.draw_table_at_bbox(pdf_canvas, content, bbox, page_height, scale_w, scale_h)
               elif elem_type == 'text':
                   self.draw_text_at_bbox(pdf_canvas, content, bbox, page_height, scale_w, scale_h)
               elif elem_type == 'image':
                   self.draw_image_at_bbox(pdf_canvas, content, bbox, page_height, scale_w, scale_h)
               # ... 其他類型

   def _generate_flow_pdf(self, elements: List[Dict], output_path: Path, ocr_data: Dict) -> bool:
       """流式排版模式 - 零資訊損失"""
       from reportlab.platypus import SimpleDocTemplate, Paragraph, Table, Image, Spacer
       from reportlab.lib.styles import getSampleStyleSheet

       # 按 reading_order 排序元素
       sorted_elements = sorted(elements, key=lambda x: x.get('reading_order', 0))

       # 創建 Story（流式內容）
       story = []
       styles = getSampleStyleSheet()

       for elem in sorted_elements:
           elem_type = elem.get('type')
           content = elem.get('content', '')

           if elem_type == 'title':
               story.append(Paragraph(content, styles['Title']))
           elif elem_type == 'text':
               story.append(Paragraph(content, styles['Normal']))
           elif elem_type == 'table':
               # 解析 HTML 表格為 ReportLab Table
               table_obj = self._html_to_reportlab_table(content)
               story.append(table_obj)
           elif elem_type == 'image':
               # 嵌入圖片
               img_path = json_path.parent / content
               if img_path.exists():
                   story.append(Image(str(img_path), width=400, height=300))

           story.append(Spacer(1, 12))  # 間距

       # 生成 PDF
       doc = SimpleDocTemplate(str(output_path))
       doc.build(story)
       return True
   ```

---

## 📈 預期效果對比

### 目前實作 vs 新實作

| 指標 | 目前實作 ❌ | 新實作 ✅ | 改善 |
|------|-----------|----------|------|
| **bbox 資訊** | 空列表 `[]` | 精確座標 `[x1,y1,x2,y2]` | ✅ 100% |
| **閱讀順序** | 無（混合 HTML） | `reading_order` 欄位 | ✅ 100% |
| **版面類型** | 無 | `layout_type`（單欄/雙欄） | ✅ 100% |
| **元素分類** | 簡單判斷 `<table` | 精確分類（9+ 類型） | ✅ 100% |
| **資訊損失** | 21.6% 文字被過濾 | 0% 損失（流式模式） | ✅ 100% |
| **座標精度** | 只有部分圖片 bbox | 所有元素都有 bbox | ✅ 100% |
| **PDF 模式** | 只有座標定位 | 雙模式（座標+流式） | ✅ 新功能 |
| **翻譯支援** | 困難（資訊損失） | 完美（零損失） | ✅ 100% |

### 具體改善

#### 1. 零資訊損失
```python
# ❌ 目前: 342 個文字區域 → 過濾後 268 個 = 損失 74 個 (21.6%)
filtered_text_regions = self._filter_text_in_regions(text_regions, regions_to_avoid)

# ✅ 新實作: 不需要過濾，直接使用 parsing_res_list
# 所有元素（文字、表格、圖片）都在各自的區域中，不重疊
for elem in sorted(elements, key=lambda x: x['reading_order']):
    render_element(elem)  # 渲染所有元素，零損失
```

#### 2. 精確 bbox
```python
# ❌ 目前: bbox 是空列表
{
    'element_id': 0,
    'type': 'table',
    'bbox': [],  # ← 無法定位！
}

# ✅ 新實作: 從 layout_bbox 獲取精確座標
{
    'element_id': 0,
    'type': 'table',
    'bbox': [[770, 776], [1122, 776], [1122, 1058], [770, 1058]],  # ← 精確定位！
    'reading_order': 3,
    'layout_type': 'single'
}
```

#### 3. 閱讀順序
```python
# ❌ 目前: 無法保證正確的閱讀順序
# 表格、圖片、文字混在一起，順序混亂

# ✅ 新實作: parsing_res_list 的順序 = 閱讀順序
elements = sorted(elements, key=lambda x: x['reading_order'])
# 元素按 reading_order: 0, 1, 2, 3, ... 渲染
# 完美保留文件的邏輯順序
```

---

## 🚀 實作步驟

### 第一階段：核心重構（2-3 小時）

1. **修改 `analyze_layout()` 函數**
   - 從 `page_result.json` 提取 `parsing_res_list`
   - 提取 `layout_bbox` 為每個元素的 bbox
   - 保留 `reading_order`
   - 提取 `layout_type`
   - 測試輸出 JSON 結構

2. **添加輔助函數**
   - `_find_image_bbox()`: 從多個來源查找圖片 bbox
   - `_convert_bbox_format()`: 統一 bbox 格式
   - `_extract_element_content()`: 根據類型提取內容

3. **測試驗證**
   - 使用現有測試文件重新執行 OCR
   - 檢查生成的 JSON 是否包含 bbox
   - 驗證 reading_order 是否正確

### 第二階段：PDF 生成優化（2-3 小時）

1. **實作座標定位模式**
   - 移除文字過濾邏輯
   - 按 bbox 精確渲染每個元素
   - 按 reading_order 確定渲染順序（同頁元素）

2. **實作流式排版模式**
   - 使用 ReportLab Platypus
   - 按 reading_order 構建 Story
   - 實作各類型元素的流式渲染

3. **添加 API 參數**
   - `/tasks/{id}/download/pdf?mode=coordinate` (預設)
   - `/tasks/{id}/download/pdf?mode=flow`

### 第三階段：測試與優化（1-2 小時）

1. **完整測試**
   - 單頁文件測試
   - 多頁 PDF 測試
   - 多欄版面測試
   - 複雜表格測試

2. **效能優化**
   - 減少重複計算
   - 優化 bbox 轉換
   - 快取處理

3. **文檔更新**
   - 更新 API 文檔
   - 添加使用範例
   - 更新架構圖

---

## 💡 關鍵技術細節

### 1. Numpy Array 處理
```python
# layout_bbox 是 numpy.ndarray，需要轉換為標準格式
layout_bbox = item.get('layout_bbox')
if hasattr(layout_bbox, 'tolist'):
    bbox = layout_bbox.tolist()  # [x1, y1, x2, y2]
else:
    bbox = list(layout_bbox)

# 轉換為 4-point 格式
x1, y1, x2, y2 = bbox
bbox_4point = [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
```

### 2. 版面類型處理
```python
# 根據 layout_type 調整渲染策略
layout_type = elem.get('layout_type', 'single')

if layout_type == 'double':
    # 雙欄版面：可能需要特殊處理
    pass
elif layout_type == 'multi':
    # 多欄版面：更複雜的處理
    pass
```

### 3. 閱讀順序保證
```python
# 確保按正確順序渲染
elements = layout_data.get('elements', [])
sorted_elements = sorted(elements, key=lambda x: (
    x.get('page', 0),          # 先按頁碼
    x.get('reading_order', 0)  # 再按閱讀順序
))
```

---

## ⚠️ 風險與緩解措施

### 風險 1: 向後相容性
**問題**: 舊的 JSON 檔案沒有新欄位

**緩解措施**:
```python
# 在 analyze_layout() 中添加回退邏輯
parsing_res_list = json_data.get('parsing_res_list', [])
if not parsing_res_list:
    logger.warning("No parsing_res_list, using markdown fallback")
    # 使用舊的 markdown 解析邏輯
```

### 風險 2: PaddleOCR 版本差異
**問題**: 不同版本的 PaddleOCR 可能輸出格式不同

**緩解措施**:
- 記錄 PaddleOCR 版本到 JSON
- 添加版本檢測邏輯
- 提供多版本支援

### 風險 3: 效能影響
**問題**: 提取更多資訊可能增加處理時間

**緩解措施**:
- 只在需要時提取詳細資訊
- 使用快取
- 並行處理多頁

---

## 📝 TODO Checklist

### 階段 1: 核心重構
- [ ] 修改 `analyze_layout()` 使用 `page_result.json`
- [ ] 提取 `parsing_res_list`
- [ ] 提取 `layout_bbox` 並轉換格式
- [ ] 保留 `reading_order`
- [ ] 提取 `layout_type`
- [ ] 實作 `_find_image_bbox()`
- [ ] 添加回退邏輯（向後相容）
- [ ] 測試新 JSON 輸出結構

### 階段 2: PDF 生成優化
- [ ] 實作 `_generate_coordinate_pdf()`
- [ ] 實作 `_generate_flow_pdf()`
- [ ] 移除舊的文字過濾邏輯
- [ ] 添加 mode 參數到 API
- [ ] 實作 HTML 表格解析器（用於流式模式）
- [ ] 測試兩種模式的 PDF 輸出

### 階段 3: 測試與文檔
- [ ] 單頁文件測試
- [ ] 多頁 PDF 測試
- [ ] 複雜版面測試（多欄、表格密集）
- [ ] 效能測試
- [ ] 更新 API 文檔
- [ ] 更新使用說明
- [ ] 創建遷移指南

---

## 🎓 學習資源

1. **PaddleOCR 官方文檔**
   - [PP-StructureV3 Usage Tutorial](http://www.paddleocr.ai/main/en/version3.x/pipeline_usage/PP-StructureV3.html)
   - [PaddleX PP-StructureV3](https://paddlepaddle.github.io/PaddleX/3.0/en/pipeline_usage/tutorials/ocr_pipelines/PP-StructureV3.html)

2. **ReportLab 文檔**
   - [Platypus User Guide](https://www.reportlab.com/docs/reportlab-userguide.pdf)
   - [Table Styling](https://www.reportlab.com/docs/reportlab-userguide.pdf#page=80)

3. **參考實作**
   - PaddleOCR GitHub: `/paddlex/inference/pipelines/layout_parsing/pipeline_v2.py`

---

## 🏁 成功標準

### 必須達成
✅ 所有版面元素都有精確的 bbox
✅ 閱讀順序正確保留
✅ 零資訊損失（流式模式）
✅ 向後相容（舊 JSON 仍可用）

### 期望達成
✅ 雙模式 PDF 生成（座標 + 流式）
✅ 多欄版面正確處理
✅ 翻譯功能支援（表格文字可提取）
✅ 效能無明顯下降

### 附加目標
✅ 支援更多元素類型（公式、印章）
✅ 版面類型統計和分析
✅ 視覺化版面結構

---

**規劃完成時間**: 2025-01-18
**預計開發時間**: 5-8 小時
**優先級**: P0 (最高優先級)