feat: simplify layout model selection and archive proposals

Changes: - Replace PP-Structure 7-slider parameter UI with simple 3-option layout model selector - Add layout model mapping: chinese (PP-DocLayout-S), default (PubLayNet), cdla - Add LayoutModelSelector component and zh-TW translations - Fix "default" model behavior with sentinel value for PubLayNet - Add gap filling service for OCR track coverage improvement - Add PP-Structure debug utilities - Archive completed/incomplete proposals: - add-ocr-track-gap-filling (complete) - fix-ocr-track-table-rendering (incomplete) - simplify-ppstructure-model-selection (22/25 tasks) - Add new layout model tests, archive old PP-Structure param tests - Update OpenSpec ocr-processing spec with layout model requirements 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-27 13:27:00 +08:00
parent c65df754cf
commit 59206a6ab8
35 changed files with 3621 additions and 658 deletions
--- a/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/design.md
+++ b/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/design.md
@@ -0,0 +1,183 @@
+# Design: OCR Track Gap Filling
+
+## Context
+
+PP-StructureV3 版面分析模型在處理某些掃描文件時會嚴重漏檢。實測顯示 Raw PaddleOCR 能偵測 56 個文字區域，但 PP-StructureV3 僅輸出 9 個元素（遺失 84%）。
+
+問題發生在 PP-StructureV3 內部的 Layout Detection Model，這是 PaddleOCR 函式庫的限制，無法從外部修復。但 Raw OCR 的 `text_regions` 資料仍然完整可用。
+
+### Stakeholders
+- **End users**: 需要完整的 OCR 輸出，不能有大量文字遺失
+- **OCR track**: 需要整合 Raw OCR 與 PP-StructureV3 結果
+- **Direct/Hybrid track**: 不應受此變更影響
+
+## Goals / Non-Goals
+
+### Goals
+- 偵測 PP-StructureV3 漏檢區域並以 Raw OCR 結果補回
+- 確保補回的文字不會與現有元素重複
+- 維持正確的閱讀順序
+- 僅影響 OCR track，不改變其他 track 的行為
+
+### Non-Goals
+- 不修改 PP-StructureV3 或 PaddleOCR 內部邏輯
+- 不處理圖片/表格/圖表等非文字元素的補漏
+- 不實作複雜的版面分析（僅做 gap filling）
+
+## Decisions
+
+### Decision 1: 覆蓋判定策略
+**選擇**: 優先使用「中心點落入」判定，輔以 IoU 閾值
+
+**理由**:
+- 中心點判定計算簡單，效能好
+- IoU 閾值作為補充，處理邊界情況
+- 建議 IoU 閾值 0.1~0.2，避免低 IoU 被誤判為未覆蓋
+
+**替代方案**:
+- 純 IoU 判定：計算量較大，且對部分重疊的處理較複雜
+- 面積比例判定：對不同大小的區域不夠公平
+
+### Decision 2: 補漏觸發條件
+**選擇**: 當 PP-Structure 覆蓋率 < 70% 或元素數顯著低於 Raw OCR
+
+**理由**:
+- 避免正常文件出現重複文字
+- 70% 閾值經驗值，可透過設定調整
+- 元素數比較作為快速判斷條件
+
+### Decision 3: 補漏元素類型
+**選擇**: 僅補 TEXT 類型，跳過 TABLE/IMAGE/FIGURE/FLOWCHART/HEADER/FOOTER
+
+**理由**:
+- PP-StructureV3 對結構化元素（表格、圖片）的識別通常較準確
+- 補回原始 OCR 文字可能破壞表格結構
+- 這些元素需要保持結構完整性
+
+### Decision 4: 重複判定與去重
+**選擇**: IoU > 0.5 的 Raw OCR 區域視為與 PP-Structure TEXT 重複，跳過
+
+**理由**:
+- 0.5 是常見的重疊閾值
+- 避免同一文字出現兩次
+- 對細碎的 Raw OCR 框可考慮輕量合併
+
+### Decision 5: 座標對齊
+**選擇**: 使用 `ocr_dimensions` 進行 bbox 換算
+
+**理由**:
+- OCR 可能有 resize 處理
+- 確保 Raw OCR 與 PP-Structure 的座標在同一空間
+- 避免因尺寸不一致導致覆蓋誤判
+
+## Data Flow
+
+```
+┌─────────────────┐     ┌──────────────────────┐
+│  Raw OCR Result │     │ PP-StructureV3 Result│
+│  (56 regions)   │     │    (9 elements)      │
+└────────┬────────┘     └──────────┬───────────┘
+         │                         │
+         └────────────┬────────────┘
+                      │
+              ┌───────▼───────┐
+              │ GapFillingService │
+              │ 1. Calculate coverage
+              │ 2. Find uncovered regions
+              │ 3. Filter by confidence
+              │ 4. Deduplicate
+              │ 5. Merge if needed
+              └───────┬───────┘
+                      │
+              ┌───────▼───────┐
+              │ OCRToUnifiedConverter │
+              │ - Combine elements
+              │ - Recalculate reading order
+              └───────┬───────┘
+                      │
+              ┌───────▼───────┐
+              │ UnifiedDocument │
+              │ (complete content)
+              └───────────────┘
+```
+
+## Algorithm: Gap Detection
+
+```python
+def find_uncovered_regions(
+    raw_ocr_regions: List[TextRegion],
+    pp_structure_elements: List[Element],
+    iou_threshold: float = 0.15
+) -> List[TextRegion]:
+    """
+    Find Raw OCR regions not covered by PP-Structure elements.
+
+    Coverage criteria (either one):
+    1. Center point of raw region falls inside any PP-Structure bbox
+    2. IoU with any PP-Structure bbox > iou_threshold
+    """
+    uncovered = []
+
+    # Filter PP-Structure elements: only consider TEXT, skip TABLE/IMAGE/etc.
+    text_elements = [e for e in pp_structure_elements
+                     if e.type not in SKIP_TYPES]
+
+    for region in raw_ocr_regions:
+        center = get_center(region.bbox)
+        is_covered = False
+
+        for element in text_elements:
+            # Check center point
+            if point_in_bbox(center, element.bbox):
+                is_covered = True
+                break
+
+            # Check IoU
+            if calculate_iou(region.bbox, element.bbox) > iou_threshold:
+                is_covered = True
+                break
+
+        if not is_covered:
+            uncovered.append(region)
+
+    return uncovered
+```
+
+## Configuration Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `gap_filling_enabled` | bool | True | 是否啟用 gap filling |
+| `gap_filling_coverage_threshold` | float | 0.7 | 覆蓋率低於此值時啟用 |
+| `gap_filling_iou_threshold` | float | 0.15 | 覆蓋判定 IoU 閾值 |
+| `gap_filling_confidence_threshold` | float | 0.3 | Raw OCR 信心度門檻 |
+| `gap_filling_dedup_iou_threshold` | float | 0.5 | 去重 IoU 閾值 |
+
+## Risks / Trade-offs
+
+### Risk 1: 補漏造成文字重複
+**Mitigation**: 設定 dedup_iou_threshold，對高重疊區域進行去重
+
+### Risk 2: 閱讀順序錯亂
+**Mitigation**: 補回元素後重新計算整頁的 reading_order（依 y0, x0 排序）
+
+### Risk 3: 效能影響
+**Mitigation**:
+- 先做快速的覆蓋率檢查，若 > 70% 則跳過 gap filling
+- 使用 R-tree 或 interval tree 加速 bbox 查詢（若效能成為瓶頸）
+
+### Risk 4: 座標不對齊
+**Mitigation**: 使用 `ocr_dimensions` 確保座標空間一致
+
+## Migration Plan
+
+1. 新增功能為可選（預設啟用）
+2. 可透過設定關閉 gap filling
+3. 不影響現有 API 介面
+4. 向後相容：不傳參數時使用預設行為
+
+## Open Questions
+
+1. 是否需要 UI 開關讓使用者選擇啟用/停用 gap filling？
+2. 對於細碎的 Raw OCR 框，是否需要實作合併邏輯？（同行、相鄰且間距很小）
+3. 是否需要在輸出中標記哪些元素是補漏來的？（debug 用途）
--- a/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/proposal.md
+++ b/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/proposal.md
@@ -0,0 +1,30 @@
+# Change: Add OCR Track Gap Filling with Raw OCR Text Regions
+
+## Why
+
+PP-StructureV3 的版面分析模型在處理某些掃描文件時會嚴重漏檢，導致大量文字內容遺失。實測 scan.pdf 顯示：
+- Raw PaddleOCR 文字識別：偵測到 **56 個文字區域**
+- PP-StructureV3 版面分析：僅輸出 **9 個元素**
+- 遺失比例：約 **84%** 的內容未被 PP-StructureV3 識別
+
+問題根源在於 PP-StructureV3 內部的 Layout Detection Model 對掃描文件類型支援不足，而非我們的程式碼問題。Raw OCR 能正確偵測所有文字區域，但這些資訊在 PP-StructureV3 的結構化處理過程中被遺失。
+
+## What Changes
+
+實作「混合式處理」(Hybrid Approach)：使用 Raw OCR 的文字區域來補充 PP-StructureV3 遺失的內容。
+
+- **新增** `GapFillingService` 類別，負責偵測並補回 PP-StructureV3 遺漏的文字區域
+- **新增** 覆蓋率計算邏輯（中心點落入或 IoU 閾值判斷）
+- **新增** 自動啟用條件：當 PP-Structure 覆蓋率 < 70% 或元素數顯著低於 Raw OCR 框數
+- **修改** `OCRToUnifiedConverter` 整合 gap filling 邏輯
+- **新增** 重新計算 reading_order 邏輯（依 y0, x0 排序）
+- **新增** 測試案例：PP-Structure 嚴重漏檢案例、無漏檢正常文件驗證
+
+## Impact
+
+- **Affected specs**: `ocr-processing`
+- **Affected code**:
+  - `backend/app/services/ocr_to_unified_converter.py` - 整合 gap filling
+  - `backend/app/services/gap_filling_service.py` - 新增 (核心邏輯)
+  - `backend/tests/test_gap_filling.py` - 新增 (測試)
+- **Track isolation**: 僅作用於 OCR track；Direct/Hybrid track 不受影響
--- a/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/specs/ocr-processing/spec.md
+++ b/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/specs/ocr-processing/spec.md
@@ -0,0 +1,111 @@
+## ADDED Requirements
+
+### Requirement: OCR Track Gap Filling with Raw OCR Regions
+
+The system SHALL detect and fill gaps in PP-StructureV3 output by supplementing with Raw OCR text regions when significant content loss is detected.
+
+#### Scenario: Gap filling activates when coverage is low
+- **GIVEN** an OCR track processing task
+- **WHEN** PP-StructureV3 outputs elements that cover less than 70% of Raw OCR text regions
+- **THEN** the system SHALL activate gap filling
+- **AND** identify Raw OCR regions not covered by any PP-StructureV3 element
+- **AND** supplement these regions as TEXT elements in the output
+
+#### Scenario: Coverage is determined by center-point and IoU
+- **GIVEN** a Raw OCR text region with bounding box
+- **WHEN** checking if the region is covered by PP-StructureV3
+- **THEN** the region SHALL be considered covered if its center point falls inside any PP-StructureV3 element bbox
+- **OR** if IoU with any PP-StructureV3 element exceeds 0.15 threshold
+- **AND** regions not meeting either criterion SHALL be marked as uncovered
+
+#### Scenario: Only TEXT elements are supplemented
+- **GIVEN** uncovered Raw OCR regions identified for supplementation
+- **WHEN** PP-StructureV3 has detected TABLE, IMAGE, FIGURE, FLOWCHART, HEADER, or FOOTER elements
+- **THEN** the system SHALL NOT supplement regions that overlap with these structural elements
+- **AND** only supplement regions as TEXT type to preserve structural integrity
+
+#### Scenario: Supplemented regions meet confidence threshold
+- **GIVEN** Raw OCR regions to be supplemented
+- **WHEN** a region has confidence score below 0.3
+- **THEN** the system SHALL skip that region
+- **AND** only supplement regions with confidence >= 0.3
+
+#### Scenario: Deduplication prevents repeated text
+- **GIVEN** a Raw OCR region being considered for supplementation
+- **WHEN** the region has IoU > 0.5 with any existing PP-StructureV3 TEXT element
+- **THEN** the system SHALL skip that region to prevent duplicate text
+- **AND** the original PP-StructureV3 element SHALL be preserved
+
+#### Scenario: Reading order is recalculated after gap filling
+- **GIVEN** supplemented elements have been added to the page
+- **WHEN** assembling the final element list
+- **THEN** the system SHALL recalculate reading order for the entire page
+- **AND** sort elements by y0 coordinate (top to bottom) then x0 (left to right)
+- **AND** ensure logical document flow is maintained
+
+#### Scenario: Coordinate alignment with ocr_dimensions
+- **GIVEN** Raw OCR processing may involve image resizing
+- **WHEN** comparing Raw OCR bbox with PP-StructureV3 bbox
+- **THEN** the system SHALL use ocr_dimensions to normalize coordinates
+- **AND** ensure both sources reference the same coordinate space
+- **AND** prevent coverage misdetection due to scale differences
+
+#### Scenario: Supplemented elements have complete metadata
+- **GIVEN** a Raw OCR region being added as supplemented element
+- **WHEN** creating the DocumentElement
+- **THEN** the element SHALL include page_number
+- **AND** include confidence score from Raw OCR
+- **AND** include original bbox coordinates
+- **AND** optionally include source indicator for debugging
+
+### Requirement: Gap Filling Track Isolation
+
+The gap filling feature SHALL only apply to OCR track processing and SHALL NOT affect Direct or Hybrid track outputs.
+
+#### Scenario: Gap filling only activates for OCR track
+- **GIVEN** a document processing task
+- **WHEN** the processing track is OCR
+- **THEN** the system SHALL evaluate and apply gap filling as needed
+- **AND** produce enhanced output with supplemented content
+
+#### Scenario: Direct track is unaffected
+- **GIVEN** a document processing task with Direct track
+- **WHEN** the task is processed
+- **THEN** the system SHALL NOT invoke any gap filling logic
+- **AND** produce output identical to current Direct track behavior
+
+#### Scenario: Hybrid track is unaffected
+- **GIVEN** a document processing task with Hybrid track
+- **WHEN** the task is processed
+- **THEN** the system SHALL NOT invoke gap filling logic
+- **AND** use existing Hybrid track processing pipeline
+
+### Requirement: Gap Filling Configuration
+
+The system SHALL provide configurable parameters for gap filling behavior.
+
+#### Scenario: Gap filling can be disabled via configuration
+- **GIVEN** gap_filling_enabled is set to false in configuration
+- **WHEN** OCR track processing runs
+- **THEN** the system SHALL skip all gap filling logic
+- **AND** output only PP-StructureV3 results as before
+
+#### Scenario: Coverage threshold is configurable
+- **GIVEN** gap_filling_coverage_threshold is set to 0.8
+- **WHEN** PP-StructureV3 coverage is 75%
+- **THEN** the system SHALL activate gap filling
+- **AND** supplement uncovered regions
+
+#### Scenario: IoU thresholds are configurable
+- **GIVEN** custom IoU thresholds configured:
+  - gap_filling_iou_threshold: 0.2
+  - gap_filling_dedup_iou_threshold: 0.6
+- **WHEN** evaluating coverage and deduplication
+- **THEN** the system SHALL use the configured values
+- **AND** apply them consistently throughout gap filling process
+
+#### Scenario: Confidence threshold is configurable
+- **GIVEN** gap_filling_confidence_threshold is set to 0.5
+- **WHEN** supplementing Raw OCR regions
+- **THEN** the system SHALL only include regions with confidence >= 0.5
+- **AND** filter out lower confidence regions
--- a/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/tasks.md
+++ b/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/tasks.md
@@ -0,0 +1,44 @@
+# Tasks: Add OCR Track Gap Filling
+
+## 1. Core Implementation
+
+- [x] 1.1 Create `gap_filling_service.py` with `GapFillingService` class
+- [x] 1.2 Implement bbox coverage calculation (center-point and IoU methods)
+- [x] 1.3 Implement gap detection logic (find uncovered raw OCR regions)
+- [x] 1.4 Implement confidence threshold filtering for supplemented regions
+- [x] 1.5 Implement element type filtering (only supplement TEXT, skip TABLE/IMAGE/FIGURE/etc.)
+- [x] 1.6 Implement reading order recalculation (sort by y0, x0)
+- [x] 1.7 Implement deduplication logic (skip high IoU overlaps with PP-Structure TEXT)
+- [x] 1.8 Implement optional text merging for fragmented adjacent regions
+
+## 2. Integration
+
+- [x] 2.1 Modify `OCRToUnifiedConverter` to accept raw OCR text_regions
+- [x] 2.2 Add gap filling activation condition check (coverage < 70% or element count disparity)
+- [x] 2.3 Ensure coordinate alignment between raw OCR and PP-Structure (ocr_dimensions handling)
+- [x] 2.4 Add page metadata (page_number, confidence, bbox) to supplemented elements
+- [x] 2.5 Ensure track isolation (only OCR track, not Direct/Hybrid)
+
+## 3. Configuration
+
+- [x] 3.1 Add configurable parameters to settings:
+  - `gap_filling_enabled`: bool (default: True)
+  - `gap_filling_coverage_threshold`: float (default: 0.7)
+  - `gap_filling_iou_threshold`: float (default: 0.15)
+  - `gap_filling_confidence_threshold`: float (default: 0.3)
+  - `gap_filling_dedup_iou_threshold`: float (default: 0.5)
+
+## 4. Testing(with env)
+
+- [x] 4.1 Create test fixtures with PP-Structure severe miss-detection case(with scan.pdf / scan2.pdf)
+- [x] 4.2 Test gap detection correctly identifies uncovered regions
+- [x] 4.3 Test supplemented elements have correct metadata
+- [x] 4.4 Test reading order is correctly recalculated
+- [x] 4.5 Test deduplication prevents duplicate text
+- [x] 4.6 Test normal document without miss-detection has no duplicate/inflation
+- [x] 4.7 Test track isolation (Direct track unaffected)
+
+## 5. Documentation
+
+- [x] 5.1 Add inline documentation to GapFillingService
+- [x] 5.2 Update configuration documentation with new settings