egg/OCR

Files

egg 59206a6ab8 feat: simplify layout model selection and archive proposals

Changes:
- Replace PP-Structure 7-slider parameter UI with simple 3-option layout model selector
- Add layout model mapping: chinese (PP-DocLayout-S), default (PubLayNet), cdla
- Add LayoutModelSelector component and zh-TW translations
- Fix "default" model behavior with sentinel value for PubLayNet
- Add gap filling service for OCR track coverage improvement
- Add PP-Structure debug utilities
- Archive completed/incomplete proposals:
  - add-ocr-track-gap-filling (complete)
  - fix-ocr-track-table-rendering (incomplete)
  - simplify-ppstructure-model-selection (22/25 tasks)
- Add new layout model tests, archive old PP-Structure param tests
- Update OpenSpec ocr-processing spec with layout model requirements

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-27 13:27:00 +08:00

6.5 KiB

Raw Blame History

Design: OCR Track Gap Filling

Context

PP-StructureV3 版面分析模型在處理某些掃描文件時會嚴重漏檢。實測顯示 Raw PaddleOCR 能偵測 56 個文字區域，但 PP-StructureV3 僅輸出 9 個元素（遺失 84%）。

問題發生在 PP-StructureV3 內部的 Layout Detection Model，這是 PaddleOCR 函式庫的限制，無法從外部修復。但 Raw OCR 的 text_regions 資料仍然完整可用。

Stakeholders

End users: 需要完整的 OCR 輸出，不能有大量文字遺失
OCR track: 需要整合 Raw OCR 與 PP-StructureV3 結果
Direct/Hybrid track: 不應受此變更影響

Goals / Non-Goals

Goals

偵測 PP-StructureV3 漏檢區域並以 Raw OCR 結果補回
確保補回的文字不會與現有元素重複
維持正確的閱讀順序
僅影響 OCR track，不改變其他 track 的行為

Non-Goals

不修改 PP-StructureV3 或 PaddleOCR 內部邏輯
不處理圖片/表格/圖表等非文字元素的補漏
不實作複雜的版面分析（僅做 gap filling）

Decisions

Decision 1: 覆蓋判定策略

選擇: 優先使用「中心點落入」判定，輔以 IoU 閾值

理由:

中心點判定計算簡單，效能好
IoU 閾值作為補充，處理邊界情況
建議 IoU 閾值 0.1~0.2，避免低 IoU 被誤判為未覆蓋

替代方案:

純 IoU 判定：計算量較大，且對部分重疊的處理較複雜
面積比例判定：對不同大小的區域不夠公平

Decision 2: 補漏觸發條件

選擇: 當 PP-Structure 覆蓋率 < 70% 或元素數顯著低於 Raw OCR

理由:

避免正常文件出現重複文字
70% 閾值經驗值，可透過設定調整
元素數比較作為快速判斷條件

Decision 3: 補漏元素類型

選擇: 僅補 TEXT 類型，跳過 TABLE/IMAGE/FIGURE/FLOWCHART/HEADER/FOOTER

理由:

PP-StructureV3 對結構化元素（表格、圖片）的識別通常較準確
補回原始 OCR 文字可能破壞表格結構
這些元素需要保持結構完整性

Decision 4: 重複判定與去重

選擇: IoU > 0.5 的 Raw OCR 區域視為與 PP-Structure TEXT 重複，跳過

理由:

0.5 是常見的重疊閾值
避免同一文字出現兩次
對細碎的 Raw OCR 框可考慮輕量合併

Decision 5: 座標對齊

選擇: 使用 ocr_dimensions 進行 bbox 換算

理由:

OCR 可能有 resize 處理
確保 Raw OCR 與 PP-Structure 的座標在同一空間
避免因尺寸不一致導致覆蓋誤判

Data Flow

┌─────────────────┐     ┌──────────────────────┐
│  Raw OCR Result │     │ PP-StructureV3 Result│
│  (56 regions)   │     │    (9 elements)      │
└────────┬────────┘     └──────────┬───────────┘
         │                         │
         └────────────┬────────────┘
                      │
              ┌───────▼───────┐
              │ GapFillingService │
              │ 1. Calculate coverage
              │ 2. Find uncovered regions
              │ 3. Filter by confidence
              │ 4. Deduplicate
              │ 5. Merge if needed
              └───────┬───────┘
                      │
              ┌───────▼───────┐
              │ OCRToUnifiedConverter │
              │ - Combine elements
              │ - Recalculate reading order
              └───────┬───────┘
                      │
              ┌───────▼───────┐
              │ UnifiedDocument │
              │ (complete content)
              └───────────────┘

Algorithm: Gap Detection

def find_uncovered_regions(
    raw_ocr_regions: List[TextRegion],
    pp_structure_elements: List[Element],
    iou_threshold: float = 0.15
) -> List[TextRegion]:
    """
    Find Raw OCR regions not covered by PP-Structure elements.

    Coverage criteria (either one):
    1. Center point of raw region falls inside any PP-Structure bbox
    2. IoU with any PP-Structure bbox > iou_threshold
    """
    uncovered = []

    # Filter PP-Structure elements: only consider TEXT, skip TABLE/IMAGE/etc.
    text_elements = [e for e in pp_structure_elements
                     if e.type not in SKIP_TYPES]

    for region in raw_ocr_regions:
        center = get_center(region.bbox)
        is_covered = False

        for element in text_elements:
            # Check center point
            if point_in_bbox(center, element.bbox):
                is_covered = True
                break

            # Check IoU
            if calculate_iou(region.bbox, element.bbox) > iou_threshold:
                is_covered = True
                break

        if not is_covered:
            uncovered.append(region)

    return uncovered

Configuration Parameters

Parameter	Type	Default	Description
`gap_filling_enabled`	bool	True	是否啟用 gap filling
`gap_filling_coverage_threshold`	float	0.7	覆蓋率低於此值時啟用
`gap_filling_iou_threshold`	float	0.15	覆蓋判定 IoU 閾值
`gap_filling_confidence_threshold`	float	0.3	Raw OCR 信心度門檻
`gap_filling_dedup_iou_threshold`	float	0.5	去重 IoU 閾值

Risks / Trade-offs

Risk 1: 補漏造成文字重複

Mitigation: 設定 dedup_iou_threshold，對高重疊區域進行去重

Risk 2: 閱讀順序錯亂

Mitigation: 補回元素後重新計算整頁的 reading_order（依 y0, x0 排序）

Risk 3: 效能影響

Mitigation:

先做快速的覆蓋率檢查，若 > 70% 則跳過 gap filling
使用 R-tree 或 interval tree 加速 bbox 查詢（若效能成為瓶頸）

Risk 4: 座標不對齊

Mitigation: 使用 ocr_dimensions 確保座標空間一致

Migration Plan

新增功能為可選（預設啟用）
可透過設定關閉 gap filling
不影響現有 API 介面
向後相容：不傳參數時使用預設行為

Open Questions

是否需要 UI 開關讓使用者選擇啟用/停用 gap filling？
對於細碎的 Raw OCR 框，是否需要實作合併邏輯？（同行、相鄰且間距很小）
是否需要在輸出中標記哪些元素是補漏來的？（debug 用途）

6.5 KiB Raw Blame History Unescape Escape