egg/OCR

Files

egg 6806fff1d5 chore: archive extract-table-cell-boxes proposal

Archived the extract-table-cell-boxes proposal which implemented:
- Table cell boxes extraction from PP-StructureV3 table_res_list
- Layered rendering for tables with cell borders
- CV-based table line detection (disabled)
- Scan artifact removal preprocessing
- PDF orientation detection for rotated documents

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-30 14:22:29 +08:00

9.0 KiB

Raw Blame History

Tasks: Extract Table Cell Boxes

重要發現 (2025-11-28)

PPStructureV3 (PaddleX 3.3.9) 確實提供 table_res_list！

之前的實現假設需要額外調用 SLANeXt 模型，但經過深入測試發現：

result.json['res']['table_res_list'] 包含所有表格的 cell_box_list
不需要額外的模型調用
已移除多餘的 SLANeXt 代碼

Phase 1: 基礎設施 (已完成)

Task 1.1: 配置項

~~添加 enable_table_cell_boxes_extraction 配置~~ (已移除，不再需要)
確認 PPStructureV3 提供 table_res_list

Task 1.2: 模型緩存機制

~~實現 SLANeXt 模型緩存~~ (已移除，不再需要)
直接使用 PPStructureV3 內建的 table_res_list

Phase 2: Cell Boxes 提取 (已完成)

Task 2.1: 從 table_res_list 提取

從 result.json['res']['table_res_list'] 獲取 cell_box_list
通過 HTML 內容匹配表格
驗證座標格式 (已是絕對座標)

Task 2.2: Image-in-Table 處理

從 layout_det_res 獲取 image boxes
檢測表格內的圖片
裁切保存圖片
嵌入到表格 HTML

Phase 3: PDF 生成優化 (已完成)

Task 3.1: 利用 Cell Boxes 推斷網格 (已棄用)

~~修改 draw_table_region 使用 cell_boxes~~
~~根據實際 cell 位置計算行高列寬~~
測試渲染效果 → 發現問題：HTML 結構與 cell_boxes 不匹配

Task 3.2: 方案 B - 分層渲染 (Layered Rendering) ✓ 已完成

問題分析 (2025-11-30)：

HTML 表格結構與 cell_boxes 不匹配，無法正確推斷網格
嘗試在 cell 內繪製文字失敗（超出邊框、匹配錯誤）

解決方案：分層渲染 - 分離表格邊框與文字繪製

Layer 1: 使用 cell_boxes 繪製表格邊框
Layer 2: 使用 raw OCR positions 繪製文字（獨立於表格結構）
Layer 3: 繪製 embedded_images

實作步驟 (2025-11-30)：

修改 GapFillingService._is_region_covered() - 跳過 TABLE 元素覆蓋檢測
簡化 _draw_table_with_cell_boxes() - 只繪製邊框 + 圖片
修改 regions_to_avoid - 排除表格，讓文字穿透表格區域
整合測試：test_layered_rendering.py

Task 3.3: 備選方案

當 cell_boxes 不可用時，使用 ReportLab Table
確保向後兼容

Phase 4: 測試與驗證 (已完成)

Task 4.1: 單元測試

測試 cell_box_list 提取 (29 cells 成功)
測試 Image-in-Table 處理 (1 image embedded)
測試錯誤處理

Task 4.2: 整合測試

使用實際 PDF 測試 OCR Track (test_layered_rendering.py)
驗證 PDF 版面還原效果
分層渲染測試結果：
- 50 text elements (從 raw OCR 補充，原本只有 5 個)
- 31 cell_boxes (8 + 23)
- 1 embedded_image
- PDF 生成成功 (57,290 bytes)

Phase 5: 清理 (已完成)

Task 5.1: 移除舊代碼

移除 SLANeXt 模型緩存代碼
移除 _get_slanet_model(), _get_table_classifier(), _extract_cell_boxes_with_slanet(), release_slanet_models()
移除 enable_table_cell_boxes_extraction 配置
清理調試日誌

技術細節

關鍵代碼位置

文件	修改內容
`backend/app/core/config.py`	移除 `enable_table_cell_boxes_extraction`
`backend/app/services/pp_structure_enhanced.py`	使用 `table_res_list`, 添加 `_embed_images_in_table()`
`backend/app/services/pdf_generator_service.py`	分層渲染：只繪製邊框，排除表格區域的文字過濾
`backend/app/services/gap_filling_service.py`	`_is_region_covered()` 跳過 TABLE 元素
`backend/tests/test_layered_rendering.py`	分層渲染整合測試

PPStructureV3 數據結構

result.json = {
    'res': {
        'parsing_res_list': [...],      # 解析結果
        'layout_det_res': {...},        # Layout 檢測結果
        'table_res_list': [             # 表格識別結果
            {
                'cell_box_list': [[x1,y1,x2,y2], ...],  # ← 關鍵！
                'pred_html': '<html>...',
                'table_ocr_pred': {...}
            }
        ],
        'overall_ocr_res': {...}
    }
}

測試結果

Task ID: 442f9345-09ba-4a7d-949f-3bc88c2fa895
cell_boxes: 29 cells (source: table_res_list)
embedded_images: 1 (img_in_table_935_838_1118_1031)

本地 vs 雲端差異

特性	本地 PaddleX 3.3.9	雲端 pp_demo
`table_res_list`	✓ 提供	✓ 提供
`cell_box_list`	✓ 29 cells	✓ 27+8 cells
Layout 識別	1 個合併表格	2 個獨立表格
Image-in-Table	需自行處理	自動嵌入 HTML

遺留問題

Layout 識別合併表格：本地 Layout 模型把多個表格合併成一個大表格
- 這導致 table_res_list 只有 1 個表格
- 雲端識別為 2 個獨立表格
- 可能需要調整 Layout 模型參數或後處理邏輯

分層渲染技術設計 (2025-11-30)

問題根因

ReportLab Table 需要規則矩形網格，但 PPStructureV3 的 cell_boxes 反映實際視覺位置，與 HTML 邏輯結構不匹配。嘗試在 cell 內繪製文字會導致：

文字超出邊框
匹配錯誤
部分文字遺失

解決方案：分層渲染

將表格渲染解耦為三個獨立層次：

┌─────────────────────────────────────────┐
│  Layer 3: Embedded Images               │
│  (從 metadata['embedded_images'] 獲取)   │
├─────────────────────────────────────────┤
│  Layer 2: Text at Raw OCR Positions     │
│  (從 GapFillingService 補充的原始 OCR)   │
├─────────────────────────────────────────┤
│  Layer 1: Table Cell Borders            │
│  (從 metadata['cell_boxes'] 繪製)        │
└─────────────────────────────────────────┘

實作細節

1. GapFillingService 修改 (_is_region_covered):

# 跳過 TABLE 元素覆蓋檢測，讓表格內文字通過
if skip_table_coverage and element.type == ElementType.TABLE:
    continue

2. PDF Generator 修改 (regions_to_avoid):

# 排除表格，只避免與圖片重疊
regions_to_avoid = [img for img in images_metadata if img.get('type') != 'table']

3. 簡化的 _draw_table_with_cell_boxes:

def _draw_table_with_cell_boxes(...):
    """只繪製邊框和圖片，不處理文字"""
    # 1. 繪製每個 cell 的邊框
    for box in cell_boxes:
        pdf_canvas.rect(x, y, width, height, stroke=1, fill=0)

    # 2. 繪製 embedded_images
    for img in embedded_images:
        self._draw_embedded_image(...)

優勢

解耦：邊框渲染與文字渲染完全獨立
精確：文字位置直接使用 OCR 結果，不需推斷
穩定：不受 cell_boxes 與 HTML 不匹配影響
相容：visualization 中 overall_ocr_res.png 的效果可直接還原

測試結果

Task ID: 84899366-f361-44f1-b989-5aba72419ca5
cell_boxes: 31 (8 + 23)
原始 text elements: 5
補充後 text elements: 50 (從 raw OCR 補充)
PDF 大小: 57,290 bytes

混合渲染優化 (2025-11-30)

問題發現

分層渲染後仍有問題：

表格歪斜：cell_boxes 有 2-11 像素的座標偏差
Title 等元素樣式未應用：OCR track 不套用樣式

解決方案：混合渲染 + 網格對齊

1. Cell Boxes 網格對齊 (_normalize_cell_boxes_to_grid):

def _normalize_cell_boxes_to_grid(self, cell_boxes, threshold=10.0):
    """
    將相鄰座標聚合為統一值，消除 2-11 像素的偏差。
    - 收集所有 X/Y 座標
    - 聚類相近座標（threshold 內）
    - 使用平均值作為對齊後的座標
    """

2. 元素類型樣式 (OCR track):

# 在 draw_text_region 中加入元素類型檢查
element_type = region.get('element_type', 'text')

if element_type == 'title':
    font_size = min(font_size * 1.3, 36)  # 30% 放大
elif element_type == 'header':
    font_size = min(font_size * 1.15, 24)  # 15% 放大
elif element_type == 'caption':
    font_size = max(font_size * 0.9, 6)  # 10% 縮小

3. 元素類型傳遞:

# convert_unified_document_to_ocr_data 中加入
text_region = {
    'text': text_content,
    'bbox': bbox_polygon,
    'element_type': element.type.value  # 新增
}

改進後效果

項目	改進前	改進後
表格邊框	歪斜 (2-11px 偏差)	網格對齊
Title 樣式	無 (與普通文字相同)	36pt 放大字體
混合渲染	只用 raw OCR	PP-Structure + raw OCR

測試結果 (2025-11-30)

Task ID: 3a3f350f-2d81-4af4-8a18-021ea09ac433
Table 1: 8 cell_boxes → 網格對齊
Table 2: 23 cell_boxes → 網格對齊 + 1 embedded image
Title: Applied title style: size=36.0
PDF 大小: 104,082 bytes

9.0 KiB Raw Blame History Unescape Escape

Tasks: Extract Table Cell Boxes

重要發現 (2025-11-28)

Phase 1: 基礎設施 (已完成)

Task 1.1: 配置項

Task 1.2: 模型緩存機制

Phase 2: Cell Boxes 提取 (已完成)

Task 2.1: 從 table_res_list 提取

Task 2.2: Image-in-Table 處理

Phase 3: PDF 生成優化 (已完成)

Task 3.1: 利用 Cell Boxes 推斷網格 (已棄用)

Task 3.2: 方案 B - 分層渲染 (Layered Rendering) ✓ 已完成

Task 3.3: 備選方案

Phase 4: 測試與驗證 (已完成)

Task 4.1: 單元測試

Task 4.2: 整合測試

Phase 5: 清理 (已完成)

Task 5.1: 移除舊代碼

技術細節

關鍵代碼位置

PPStructureV3 數據結構

測試結果

本地 vs 雲端差異

遺留問題

分層渲染技術設計 (2025-11-30)

問題根因

解決方案：分層渲染

實作細節

優勢

測試結果

混合渲染優化 (2025-11-30)

問題發現

解決方案：混合渲染 + 網格對齊

改進後效果

測試結果 (2025-11-30)

9.0 KiB

Raw Blame History