feat: add table detection options and scan artifact removal

- Add TableDetectionSelector component for wired/wireless/region detection - Add CV-based table line detector module (disabled due to poor performance) - Add scan artifact removal preprocessing step (removes faint horizontal lines) - Add PreprocessingConfig schema with remove_scan_artifacts option - Update frontend PreprocessingSettings with scan artifact toggle - Integrate table detection config into ProcessingPage - Archive extract-table-cell-boxes proposal 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-30 13:21:50 +08:00
parent f5a2c8a750
commit 95ae1f1bdb
17 changed files with 1906 additions and 344 deletions
--- a/openspec/changes/extract-table-cell-boxes/tasks.md
+++ b/openspec/changes/extract-table-cell-boxes/tasks.md
@@ -1,62 +1,88 @@
 # Tasks: Extract Table Cell Boxes

-## Phase 1: 基礎設施
+## 重要發現 (2025-11-28)

-### Task 1.1: 添加配置項
- [x] 在 `config.py` 添加 `enable_table_cell_boxes_extraction` 配置
- [x] 確認現有的表格模型配置可用
+**PPStructureV3 (PaddleX 3.3.9) 確實提供 `table_res_list`！**
+
+之前的實現假設需要額外調用 SLANeXt 模型，但經過深入測試發現：
+- `result.json['res']['table_res_list']` 包含所有表格的 `cell_box_list`
+- 不需要額外的模型調用
+- 已移除多餘的 SLANeXt 代碼
+
+## Phase 1: 基礎設施 (已完成)
+
+### Task 1.1: 配置項
+- [x] ~~添加 `enable_table_cell_boxes_extraction` 配置~~ (已移除，不再需要)
+- [x] 確認 PPStructureV3 提供 `table_res_list`

 ### Task 1.2: 模型緩存機制
- [x] 在 `PPStructureEnhanced` 中添加模型緩存屬性
- [x] 實現延遲載入邏輯
- [x] 添加模型釋放方法（可選）
+- [x] ~~實現 SLANeXt 模型緩存~~ (已移除，不再需要)
+- [x] 直接使用 PPStructureV3 內建的 `table_res_list`

-## Phase 2: Cell Boxes 提取
+## Phase 2: Cell Boxes 提取 (已完成)

-### Task 2.1: 修改表格處理邏輯
- [x] 在 `_process_parsing_res_list` 中添加 cell boxes 提取
- [x] 實現圖片裁切邏輯
- [x] 調用 SLANeXt 模型獲取結果
+### Task 2.1: 從 table_res_list 提取
+- [x] 從 `result.json['res']['table_res_list']` 獲取 `cell_box_list`
+- [x] 通過 HTML 內容匹配表格
+- [x] 驗證座標格式 (已是絕對座標)

-### Task 2.2: 座標轉換
- [x] 實現相對座標到全域座標的轉換
- [x] 處理 ScalingInfo 的座標縮放
- [x] 驗證座標轉換正確性
+### Task 2.2: Image-in-Table 處理
+- [x] 從 `layout_det_res` 獲取 image boxes
+- [x] 檢測表格內的圖片
+- [x] 裁切保存圖片
+- [x] 嵌入到表格 HTML

-### Task 2.3: 錯誤處理
- [x] 添加 try-catch 包裝
- [x] 實現失敗時的降級處理
- [x] 添加適當的日誌記錄
+## Phase 3: PDF 生成優化 (已完成)

-## Phase 3: PDF 生成優化
+### Task 3.1: ~~利用 Cell Boxes 推斷網格~~ (已棄用)
+- [x] ~~修改 `draw_table_region` 使用 cell_boxes~~
+- [x] ~~根據實際 cell 位置計算行高列寬~~
+- [x] 測試渲染效果 → **發現問題：HTML 結構與 cell_boxes 不匹配**

-### Task 3.1: 利用 Cell Boxes 渲染表格
- [x] 修改 `draw_table_region` 使用 cell_boxes
- [x] 根據實際 cell 位置計算行高列寬
- [ ] 測試渲染效果
+### Task 3.2: 方案 B - 分層渲染 (Layered Rendering) ✓ 已完成

-### Task 3.2: 備選方案
- [x] 當 cell_boxes 不可用時，使用現有邏輯
+**問題分析 (2025-11-30)**：
+- HTML 表格結構與 cell_boxes 不匹配，無法正確推斷網格
+- 嘗試在 cell 內繪製文字失敗（超出邊框、匹配錯誤）
+
+**解決方案**：分層渲染 - 分離表格邊框與文字繪製
+- Layer 1: 使用 cell_boxes 繪製表格邊框
+- Layer 2: 使用 raw OCR positions 繪製文字（獨立於表格結構）
+- Layer 3: 繪製 embedded_images
+
+**實作步驟 (2025-11-30)**：
+- [x] 修改 `GapFillingService._is_region_covered()` - 跳過 TABLE 元素覆蓋檢測
+- [x] 簡化 `_draw_table_with_cell_boxes()` - 只繪製邊框 + 圖片
+- [x] 修改 `regions_to_avoid` - 排除表格，讓文字穿透表格區域
+- [x] 整合測試：test_layered_rendering.py
+
+### Task 3.3: 備選方案
+- [x] 當 cell_boxes 不可用時，使用 ReportLab Table
 - [x] 確保向後兼容

-## Phase 4: 測試與驗證
+## Phase 4: 測試與驗證 (已完成)

 ### Task 4.1: 單元測試
- [ ] 測試 cell boxes 提取功能
- [ ] 測試座標轉換
- [ ] 測試錯誤處理
+- [x] 測試 cell_box_list 提取 (29 cells 成功)
+- [x] 測試 Image-in-Table 處理 (1 image embedded)
+- [x] 測試錯誤處理

 ### Task 4.2: 整合測試
- [ ] 使用實際 PDF 測試 OCR Track
- [ ] 驗證 PDF 版面還原效果
- [ ] 性能測試
+- [x] 使用實際 PDF 測試 OCR Track (test_layered_rendering.py)
+- [x] 驗證 PDF 版面還原效果
+- [x] 分層渲染測試結果：
+  - 50 text elements (從 raw OCR 補充，原本只有 5 個)
+  - 31 cell_boxes (8 + 23)
+  - 1 embedded_image
+  - PDF 生成成功 (57,290 bytes)

-## Phase 5: 清理
+## Phase 5: 清理 (已完成)

 ### Task 5.1: 移除舊代碼
- [ ] 評估並移除不再需要的 Paragraph 包裝代碼
- [ ] 清理調試日誌
- [ ] 更新文檔
+- [x] 移除 SLANeXt 模型緩存代碼
+- [x] 移除 `_get_slanet_model()`, `_get_table_classifier()`, `_extract_cell_boxes_with_slanet()`, `release_slanet_models()`
+- [x] 移除 `enable_table_cell_boxes_extraction` 配置
+- [x] 清理調試日誌

 ---

@@ -66,32 +92,182 @@

 | 文件 | 修改內容 |
 |------|---------|
-| `backend/app/core/config.py` | 添加配置項 |
-| `backend/app/services/pp_structure_enhanced.py` | 主要實現 |
-| `backend/app/services/pdf_generator_service.py` | 利用 cell_boxes |
+| `backend/app/core/config.py` | 移除 `enable_table_cell_boxes_extraction` |
+| `backend/app/services/pp_structure_enhanced.py` | 使用 `table_res_list`, 添加 `_embed_images_in_table()` |
+| `backend/app/services/pdf_generator_service.py` | 分層渲染：只繪製邊框，排除表格區域的文字過濾 |
+| `backend/app/services/gap_filling_service.py` | `_is_region_covered()` 跳過 TABLE 元素 |
+| `backend/tests/test_layered_rendering.py` | 分層渲染整合測試 |

-### 依賴
+### PPStructureV3 數據結構

 ```python
-from paddlex import create_model
+result.json = {
+    'res': {
+        'parsing_res_list': [...],      # 解析結果
+        'layout_det_res': {...},        # Layout 檢測結果
+        'table_res_list': [             # 表格識別結果
+            {
+                'cell_box_list': [[x1,y1,x2,y2], ...],  # ← 關鍵！
+                'pred_html': '<html>...',
+                'table_ocr_pred': {...}
+            }
+        ],
+        'overall_ocr_res': {...}
+    }
+}
 ```

-### 測試數據
+### 測試結果

- Task ID: `79a3d256-88f6-41d4-a7e9-3e358c85db40`
- 表格 bbox: `[84, 269, 1174, 1508]`
- 預期 cell 數量: 29 (SLANeXt_wired)
+- Task ID: `442f9345-09ba-4a7d-949f-3bc88c2fa895`
+- cell_boxes: 29 cells (source: table_res_list)
+- embedded_images: 1 (img_in_table_935_838_1118_1031)

-### 實現摘要
+### 本地 vs 雲端差異

-**已完成 (715805b):**
-1. `config.py`: 添加 `enable_table_cell_boxes_extraction` 配置項
-2. `pp_structure_enhanced.py`:
-   - 添加 `_slanet_wired_model`, `_slanet_wireless_model`, `_table_cls_model` 緩存屬性
-   - 實現 `_get_slanet_model()` 和 `_get_table_classifier()` 延遲載入
-   - 實現 `_extract_cell_boxes_with_slanet()` 從裁切圖片提取 cell boxes
-   - 實現 `release_slanet_models()` 釋放 GPU 記憶體
-   - 修改表格處理邏輯，當 PPStructureV3 沒有返回 boxes 時調用 SLANeXt
-3. `pdf_generator_service.py`:
-   - 添加 `_compute_table_grid_from_cell_boxes()` 計算列寬和行高
-   - 修改 `draw_table_region()` 優先使用 cell_boxes 計算列寬
+| 特性 | 本地 PaddleX 3.3.9 | 雲端 pp_demo |
+|------|-------------------|--------------|
+| `table_res_list` | ✓ 提供 | ✓ 提供 |
+| `cell_box_list` | ✓ 29 cells | ✓ 27+8 cells |
+| Layout 識別 | 1 個合併表格 | 2 個獨立表格 |
+| Image-in-Table | 需自行處理 | 自動嵌入 HTML |
+
+### 遺留問題
+
+1. **Layout 識別合併表格**：本地 Layout 模型把多個表格合併成一個大表格
+   - 這導致 `table_res_list` 只有 1 個表格
+   - 雲端識別為 2 個獨立表格
+   - 可能需要調整 Layout 模型參數或後處理邏輯
+
+---
+
+## 分層渲染技術設計 (2025-11-30)
+
+### 問題根因
+
+ReportLab Table 需要規則矩形網格，但 PPStructureV3 的 cell_boxes 反映實際視覺位置，與 HTML 邏輯結構不匹配。嘗試在 cell 內繪製文字會導致：
+- 文字超出邊框
+- 匹配錯誤
+- 部分文字遺失
+
+### 解決方案：分層渲染
+
+將表格渲染解耦為三個獨立層次：
+
+```
+┌─────────────────────────────────────────┐
+│  Layer 3: Embedded Images               │
+│  (從 metadata['embedded_images'] 獲取)   │
+├─────────────────────────────────────────┤
+│  Layer 2: Text at Raw OCR Positions     │
+│  (從 GapFillingService 補充的原始 OCR)   │
+├─────────────────────────────────────────┤
+│  Layer 1: Table Cell Borders            │
+│  (從 metadata['cell_boxes'] 繪製)        │
+└─────────────────────────────────────────┘
+```
+
+### 實作細節
+
+**1. GapFillingService 修改** (`_is_region_covered`):
+```python
+# 跳過 TABLE 元素覆蓋檢測，讓表格內文字通過
+if skip_table_coverage and element.type == ElementType.TABLE:
+    continue
+```
+
+**2. PDF Generator 修改** (`regions_to_avoid`):
+```python
+# 排除表格，只避免與圖片重疊
+regions_to_avoid = [img for img in images_metadata if img.get('type') != 'table']
+```
+
+**3. 簡化的 `_draw_table_with_cell_boxes`**:
+```python
+def _draw_table_with_cell_boxes(...):
+    """只繪製邊框和圖片，不處理文字"""
+    # 1. 繪製每個 cell 的邊框
+    for box in cell_boxes:
+        pdf_canvas.rect(x, y, width, height, stroke=1, fill=0)
+
+    # 2. 繪製 embedded_images
+    for img in embedded_images:
+        self._draw_embedded_image(...)
+```
+
+### 優勢
+
+1. **解耦**：邊框渲染與文字渲染完全獨立
+2. **精確**：文字位置直接使用 OCR 結果，不需推斷
+3. **穩定**：不受 cell_boxes 與 HTML 不匹配影響
+4. **相容**：visualization 中 overall_ocr_res.png 的效果可直接還原
+
+### 測試結果
+
+- Task ID: `84899366-f361-44f1-b989-5aba72419ca5`
+- cell_boxes: 31 (8 + 23)
+- 原始 text elements: 5
+- 補充後 text elements: 50 (從 raw OCR 補充)
+- PDF 大小: 57,290 bytes
+
+---
+
+## 混合渲染優化 (2025-11-30)
+
+### 問題發現
+
+分層渲染後仍有問題：
+1. 表格歪斜：cell_boxes 有 2-11 像素的座標偏差
+2. Title 等元素樣式未應用：OCR track 不套用樣式
+
+### 解決方案：混合渲染 + 網格對齊
+
+**1. Cell Boxes 網格對齊** (`_normalize_cell_boxes_to_grid`):
+```python
+def _normalize_cell_boxes_to_grid(self, cell_boxes, threshold=10.0):
+    """
+    將相鄰座標聚合為統一值，消除 2-11 像素的偏差。
+    - 收集所有 X/Y 座標
+    - 聚類相近座標（threshold 內）
+    - 使用平均值作為對齊後的座標
+    """
+```
+
+**2. 元素類型樣式** (OCR track):
+```python
+# 在 draw_text_region 中加入元素類型檢查
+element_type = region.get('element_type', 'text')
+
+if element_type == 'title':
+    font_size = min(font_size * 1.3, 36)  # 30% 放大
+elif element_type == 'header':
+    font_size = min(font_size * 1.15, 24)  # 15% 放大
+elif element_type == 'caption':
+    font_size = max(font_size * 0.9, 6)  # 10% 縮小
+```
+
+**3. 元素類型傳遞**:
+```python
+# convert_unified_document_to_ocr_data 中加入
+text_region = {
+    'text': text_content,
+    'bbox': bbox_polygon,
+    'element_type': element.type.value  # 新增
+}
+```
+
+### 改進後效果
+
+| 項目 | 改進前 | 改進後 |
+|------|--------|--------|
+| 表格邊框 | 歪斜 (2-11px 偏差) | 網格對齊 |
+| Title 樣式 | 無 (與普通文字相同) | 36pt 放大字體 |
+| 混合渲染 | 只用 raw OCR | PP-Structure + raw OCR |
+
+### 測試結果 (2025-11-30)
+
+- Task ID: `3a3f350f-2d81-4af4-8a18-021ea09ac433`
+- Table 1: 8 cell_boxes → 網格對齊
+- Table 2: 23 cell_boxes → 網格對齊 + 1 embedded image
+- Title: Applied title style: size=36.0
+- PDF 大小: 104,082 bytes