feat: create extract-table-cell-boxes proposal and archive old proposal

- Archive unify-image-scaling proposal to archive/2025-11-28 - Create new extract-table-cell-boxes proposal for supplementing PPStructureV3 with direct SLANeXt model calls to extract table cell bounding boxes - Add debug logging to pp_structure_enhanced.py for table cell boxes investigation - Discovered that PPStructureV3 high-level API filters out cell bbox data, but paddlex.create_model() can directly invoke underlying models 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-28 12:15:06 +08:00
parent dda9621e17
commit 801ee9c4b6
7 changed files with 393 additions and 4 deletions
--- a/backend/app/services/pp_structure_enhanced.py
+++ b/backend/app/services/pp_structure_enhanced.py
@@ -355,14 +355,54 @@ class PPStructureEnhanced:

            # Special handling for tables
            if mapped_type == ElementType.TABLE:
-                # Use HTML content from content-based detection or extract from 'res'
-                html_content = html_table_content  # From content-based detection
-                if not html_content and 'res' in item and isinstance(item['res'], dict):
-                    html_content = item['res'].get('html', '')
+                # 1. 提取 HTML (原有邏輯)
+                html_content = html_table_content
+                res_data = {}
+
+                # 獲取 res 字典 (包含 html 和 boxes)
+                if 'res' in item and isinstance(item['res'], dict):
+                    res_data = item['res']
+                    logger.info(f"[TABLE] Found 'res' dict with keys: {list(res_data.keys())}")
+                    if not html_content:
+                        html_content = res_data.get('html', '')
+                else:
+                    logger.info(f"[TABLE] No 'res' key in item. Available keys: {list(item.keys())}")
+
                if html_content:
                    element['html'] = html_content
                    element['extracted_text'] = self._extract_text_from_html(html_content)

+                # 2. 【新增】提取 Cell 座標 (boxes)
+                # SLANet 回傳的格式通常是 [[x1, y1, x2, y2], ...]
+                if 'boxes' in res_data:
+                    cell_boxes = res_data['boxes']
+                    logger.info(f"[TABLE] Found {len(cell_boxes)} cell boxes in res_data")
+
+                    # 獲取表格自身的偏移量 (用於將 Cell 的相對座標轉為絕對座標)
+                    table_x, table_y = 0, 0
+                    if len(bbox) >= 2:  # bbox is [x1, y1, x2, y2]
+                        table_x, table_y = bbox[0], bbox[1]
+
+                    processed_cells = []
+                    for cell_box in cell_boxes:
+                        # 確保格式正確
+                        if isinstance(cell_box, (list, tuple)) and len(cell_box) >= 4:
+                            # 轉換為絕對座標: Cell x + 表格 x
+                            abs_cell_box = [
+                                cell_box[0] + table_x,
+                                cell_box[1] + table_y,
+                                cell_box[2] + table_x,
+                                cell_box[3] + table_y
+                            ]
+                            processed_cells.append(abs_cell_box)
+
+                    # 將處理後的 Cell 座標存入 element
+                    element['cell_boxes'] = processed_cells
+                    element['raw_cell_boxes'] = cell_boxes
+                    logger.info(f"[TABLE] Processed {len(processed_cells)} cell boxes with table offset ({table_x}, {table_y})")
+                else:
+                    logger.info(f"[TABLE] No 'boxes' key in res_data. Available: {list(res_data.keys()) if res_data else 'empty'}")
+
            # Special handling for images/figures
            elif mapped_type in [ElementType.IMAGE, ElementType.FIGURE]:
                # Save image if path provided
--- a/openspec/changes/archive/2025-11-28-unify-image-scaling/proposal.md
+++ b/openspec/changes/archive/2025-11-28-unify-image-scaling/proposal.md
--- a/openspec/changes/archive/2025-11-28-unify-image-scaling/specs/ocr-processing/spec.md
+++ b/openspec/changes/archive/2025-11-28-unify-image-scaling/specs/ocr-processing/spec.md
--- a/openspec/changes/archive/2025-11-28-unify-image-scaling/tasks.md
+++ b/openspec/changes/archive/2025-11-28-unify-image-scaling/tasks.md
--- a/openspec/changes/extract-table-cell-boxes/proposal.md
+++ b/openspec/changes/extract-table-cell-boxes/proposal.md
@@ -0,0 +1,134 @@
+# Change: Extract Table Cell Boxes via Direct Model Invocation
+
+## Why
+
+PPStructureV3 (PaddleX 3.x) 的高層 API 在處理表格時，只輸出 HTML 格式的表格內容，**不返回每個 cell 的座標 (bbox)**。
+
+### 問題分析
+
+經過測試確認：
+
+```python
+# PPStructureV3 輸出 (parsing_res_list)
+{
+    'block_label': 'table',
+    'block_content': '<html>...</html>',  # 只有 HTML
+    'block_bbox': [84, 269, 1174, 1508],  # 只有整個表格的 bbox
+    # ❌ 沒有 cell boxes
+}
+```
+
+但底層模型 (SLANeXt) 實際上**有輸出 cell boxes**：
+
+```python
+# 直接調用 SLANeXt 模型
+from paddlex import create_model
+table_model = create_model('SLANeXt_wired')
+result = table_model.predict(table_img)
+# result.json['res']['bbox'] → 29 個 cell 座標 (8點多邊形)
+```
+
+### 影響
+
+缺少 cell boxes 導致：
+- OCR Track 的 PDF 版面還原表格渲染不準確
+- 無法精確定位每個 cell 的位置
+- 表格內容可能重疊或錯位
+
+## What Changes
+
+### 方案：補充調用底層 SLANeXt 模型
+
+在 `pp_structure_enhanced.py` 處理表格時，補充調用 PaddleX 底層模型獲取 cell boxes：
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    修改後的流程                              │
+├─────────────────────────────────────────────────────────────┤
+│                                                              │
+│  PPStructureV3.predict()                                     │
+│       │                                                      │
+│       ▼                                                      │
+│  parsing_res_list (HTML only)                               │
+│       │                                                      │
+│       ▼ (對於 TABLE 類型)                                    │
+│  ┌─────────────────────────────────────┐                    │
+│  │ 補充調用底層模型                      │                    │
+│  │ 1. 裁切表格區域                       │                    │
+│  │ 2. 調用 SLANeXt 獲取 cell boxes      │                    │
+│  │ 3. 轉換座標到全域座標                 │                    │
+│  │ 4. 存入 element['cell_boxes']        │                    │
+│  └─────────────────────────────────────┘                    │
+│       │                                                      │
+│       ▼                                                      │
+│  完整的表格元素 (HTML + cell_boxes)                          │
+│                                                              │
+└─────────────────────────────────────────────────────────────┘
+```
+
+### 模型選擇邏輯
+
+根據表格類型選擇對應的 SLANeXt 模型：
+
+| 表格類型 | 判斷方式 | 使用模型 |
+|---------|---------|---------|
+| 有線表格 (wired) | PP-LCNet 分類 | SLANeXt_wired |
+| 無線表格 (wireless) | PP-LCNet 分類 | SLANeXt_wireless |
+
+### Cell Boxes 格式
+
+SLANeXt 輸出的 bbox 是 8 點多邊形格式：
+```python
+[x1, y1, x2, y2, x3, y3, x4, y4]  # 四個角點座標
+# 例如: [11, 4, 692, 5, 675, 57, 10, 56]
+```
+
+需要轉換為全域座標（加上表格偏移量）。
+
+## Impact
+
+### Affected Specs
+- `ocr-processing` - 表格處理增強
+
+### Affected Code
+- `backend/app/services/pp_structure_enhanced.py`
+  - 添加底層模型緩存機制
+  - 修改 `_process_parsing_res_list` 中的 TABLE 處理邏輯
+  - 添加 cell boxes 提取和座標轉換
+
+- `backend/app/services/pdf_generator_service.py`
+  - 利用 cell_boxes 改進表格渲染
+
+### Quality Impact
+
+| 項目 | 改進前 | 改進後 |
+|------|--------|--------|
+| Cell 座標 | ❌ 無 | ✅ 有 (8點多邊形) |
+| 表格渲染 | 平均分配行列 | 精確定位 |
+| 版面還原 | 內容可能重疊 | 準確對應 |
+
+### Performance Impact
+
+- 額外模型調用：每個表格需要額外調用一次 SLANeXt
+- 緩存優化：模型實例可緩存，避免重複載入
+- 預估開銷：每表格增加 ~0.5-1 秒
+
+## Risks
+
+1. **性能開銷**
+   - 風險：額外模型調用增加處理時間
+   - 緩解：緩存模型實例，僅在需要時調用
+
+2. **模型不一致**
+   - 風險：PPStructureV3 內部可能已使用不同參數的模型
+   - 緩解：使用相同的模型配置
+
+3. **座標轉換錯誤**
+   - 風險：bbox 座標系可能有差異
+   - 緩解：充分測試，確保座標正確轉換
+
+## Not Included
+
+- 完全繞過 PPStructureV3（保留用於 Layout 分析）
+- RT-DETR cell detection（可作為後續增強）
+- 其他元素的增強處理
--- a/openspec/changes/extract-table-cell-boxes/specs/ocr-processing/spec.md
+++ b/openspec/changes/extract-table-cell-boxes/specs/ocr-processing/spec.md
@@ -0,0 +1,132 @@
+# Spec: OCR Processing - Table Cell Boxes Extraction
+
+## Overview
+
+在 OCR Track 處理表格時，補充調用 PaddleX 底層 SLANeXt 模型，獲取每個 cell 的座標信息。
+
+## Requirements
+
+### 1. 模型管理
+
+#### 1.1 模型緩存
+```python
+class PPStructureEnhanced:
+    def __init__(self, structure_engine):
+        self.structure_engine = structure_engine
+        # 底層模型緩存
+        self._table_cls_model = None
+        self._wired_table_model = None
+        self._wireless_table_model = None
+```
+
+#### 1.2 延遲載入
+- 模型只在首次需要時載入
+- 使用 `paddlex.create_model()` API
+- 模型配置從 settings 讀取
+
+### 2. Cell Boxes 提取流程
+
+#### 2.1 處理條件
+當 `mapped_type == ElementType.TABLE` 且有有效的 `block_bbox` 時觸發。
+
+#### 2.2 處理步驟
+
+```
+1. 裁切表格圖片
+   - 從原始圖片中根據 block_bbox 裁切
+   - 確保邊界不超出圖片範圍
+
+2. 判斷表格類型 (可選)
+   - 調用 PP-LCNet_x1_0_table_cls
+   - 獲取 wired/wireless 分類結果
+   - 或直接使用 PPStructureV3 內部的分類結果
+
+3. 調用對應 SLANeXt 模型
+   - wired → SLANeXt_wired
+   - wireless → SLANeXt_wireless
+
+4. 提取 cell boxes
+   - 從 result.json['res']['bbox'] 獲取
+   - 格式: [[x1,y1,x2,y2,x3,y3,x4,y4], ...]
+
+5. 座標轉換
+   - 將相對座標轉為全域座標
+   - global_box = [box[i] + offset for each point]
+   - offset = (table_x, table_y) from block_bbox
+
+6. 存入 element
+   - element['cell_boxes'] = processed_boxes
+   - element['cell_boxes_format'] = 'polygon_8'
+```
+
+### 3. 數據格式
+
+#### 3.1 Cell Boxes 結構
+```python
+element = {
+    'element_id': 'pp3_0_3',
+    'type': ElementType.TABLE,
+    'bbox': [84, 269, 1174, 1508],  # 表格整體 bbox
+    'content': '<html>...</html>',   # HTML 內容
+    'cell_boxes': [                  # 新增：cell 座標
+        [95, 273, 776, 274, 759, 326, 94, 325],  # cell 0 (全域座標)
+        [119, 296, 575, 295, 560, 399, 117, 401], # cell 1
+        # ...
+    ],
+    'cell_boxes_format': 'polygon_8',  # 座標格式說明
+    'table_type': 'wired',  # 可選：表格類型
+}
+```
+
+#### 3.2 座標格式
+- `polygon_8`: 8 點多邊形 `[x1,y1,x2,y2,x3,y3,x4,y4]`
+- 順序：左上 → 右上 → 右下 → 左下
+
+### 4. 錯誤處理
+
+#### 4.1 失敗情況
+- 模型載入失敗
+- 圖片裁切失敗
+- 預測返回空結果
+
+#### 4.2 處理方式
+- 記錄警告日誌
+- 繼續處理，element 不包含 cell_boxes
+- 不影響原有 HTML 提取流程
+
+### 5. 配置項
+
+```python
+# config.py
+class Settings:
+    # 是否啟用 cell boxes 提取
+    enable_table_cell_boxes_extraction: bool = True
+
+    # 表格結構識別模型 (已存在)
+    wired_table_model_name: str = "SLANeXt_wired"
+    wireless_table_model_name: str = "SLANeXt_wireless"
+```
+
+## Implementation Notes
+
+### 模型共享
+PPStructureV3 內部已載入了這些模型，但高層 API 不暴露。
+直接使用 `paddlex.create_model()` 會重新載入模型。
+考慮是否可以訪問 PPStructureV3 內部的模型實例（經測試：不可行）。
+
+### 性能優化
+- 模型實例緩存在 PPStructureEnhanced 中
+- 避免每次處理表格都重新載入模型
+- 考慮在內存緊張時釋放緩存
+
+### 座標縮放
+如果圖片在 Layout 分析前經過縮放（ScalingInfo），
+cell boxes 座標也需要相應縮放回原始座標系。
+
+## Test Cases
+
+1. **有線表格**：確認 cell boxes 提取正確
+2. **無線表格**：確認模型選擇和提取正確
+3. **複雜表格**：跨行跨列的表格
+4. **小表格**：cell 數量少的簡單表格
+5. **錯誤處理**：無效 bbox、模型失敗等情況
--- a/openspec/changes/extract-table-cell-boxes/tasks.md
+++ b/openspec/changes/extract-table-cell-boxes/tasks.md
@@ -0,0 +1,83 @@
+# Tasks: Extract Table Cell Boxes
+
+## Phase 1: 基礎設施
+
+### Task 1.1: 添加配置項
+- [ ] 在 `config.py` 添加 `enable_table_cell_boxes_extraction` 配置
+- [ ] 確認現有的表格模型配置可用
+
+### Task 1.2: 模型緩存機制
+- [ ] 在 `PPStructureEnhanced` 中添加模型緩存屬性
+- [ ] 實現延遲載入邏輯
+- [ ] 添加模型釋放方法（可選）
+
+## Phase 2: Cell Boxes 提取
+
+### Task 2.1: 修改表格處理邏輯
+- [ ] 在 `_process_parsing_res_list` 中添加 cell boxes 提取
+- [ ] 實現圖片裁切邏輯
+- [ ] 調用 SLANeXt 模型獲取結果
+
+### Task 2.2: 座標轉換
+- [ ] 實現相對座標到全域座標的轉換
+- [ ] 處理 ScalingInfo 的座標縮放
+- [ ] 驗證座標轉換正確性
+
+### Task 2.3: 錯誤處理
+- [ ] 添加 try-catch 包裝
+- [ ] 實現失敗時的降級處理
+- [ ] 添加適當的日誌記錄
+
+## Phase 3: PDF 生成優化
+
+### Task 3.1: 利用 Cell Boxes 渲染表格
+- [ ] 修改 `draw_table_region` 使用 cell_boxes
+- [ ] 根據實際 cell 位置計算行高列寬
+- [ ] 測試渲染效果
+
+### Task 3.2: 備選方案
+- [ ] 當 cell_boxes 不可用時，使用現有邏輯
+- [ ] 確保向後兼容
+
+## Phase 4: 測試與驗證
+
+### Task 4.1: 單元測試
+- [ ] 測試 cell boxes 提取功能
+- [ ] 測試座標轉換
+- [ ] 測試錯誤處理
+
+### Task 4.2: 整合測試
+- [ ] 使用實際 PDF 測試 OCR Track
+- [ ] 驗證 PDF 版面還原效果
+- [ ] 性能測試
+
+## Phase 5: 清理
+
+### Task 5.1: 移除舊代碼
+- [ ] 評估並移除不再需要的 Paragraph 包裝代碼
+- [ ] 清理調試日誌
+- [ ] 更新文檔
+
+---
+
+## 技術細節
+
+### 關鍵代碼位置
+
+| 文件 | 修改內容 |
+|------|---------|
+| `backend/app/core/config.py` | 添加配置項 |
+| `backend/app/services/pp_structure_enhanced.py` | 主要實現 |
+| `backend/app/services/pdf_generator_service.py` | 利用 cell_boxes |
+
+### 依賴
+
+```python
+from paddlex import create_model
+```
+
+### 測試數據
+
+- Task ID: `79a3d256-88f6-41d4-a7e9-3e358c85db40`
+- 表格 bbox: `[84, 269, 1174, 1508]`
+- 預期 cell 數量: 29 (SLANeXt_wired)