feat: simplify layout model selection and archive proposals

Changes: - Replace PP-Structure 7-slider parameter UI with simple 3-option layout model selector - Add layout model mapping: chinese (PP-DocLayout-S), default (PubLayNet), cdla - Add LayoutModelSelector component and zh-TW translations - Fix "default" model behavior with sentinel value for PubLayNet - Add gap filling service for OCR track coverage improvement - Add PP-Structure debug utilities - Archive completed/incomplete proposals: - add-ocr-track-gap-filling (complete) - fix-ocr-track-table-rendering (incomplete) - simplify-ppstructure-model-selection (22/25 tasks) - Add new layout model tests, archive old PP-Structure param tests - Update OpenSpec ocr-processing spec with layout model requirements 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-27 13:27:00 +08:00
parent c65df754cf
commit 59206a6ab8
35 changed files with 3621 additions and 658 deletions
--- a/openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/proposal.md
+++ b/openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/proposal.md
@@ -0,0 +1,28 @@
+# Change: Fix OCR Track Table Empty Columns and Alignment
+
+## Why
+
+PP-Structure 生成的表格經常包含空白欄位（所有 row 該欄皆為空/空白），導致轉換後的 UnifiedDocument 表格出現空欄與欄位錯位。目前 OCR Track 直接使用原始資料，未進行清理，影響 PDF/JSON/Markdown 輸出品質。
+
+## What Changes
+
+- 新增 `trim_empty_columns()` 函數，清理 OCR Track 表格的空欄
+- 在 `_convert_table_data` 入口調用清洗邏輯，確保 TableData 乾淨
+- 處理 col_span 重算：若 span 跨過被移除欄位，縮小 span
+- 更新 columns/cols 數值、調整各 cell 的 col 索引
+- 可選：依 bbox x0 進行欄對齊排序
+
+## Impact
+
+- Affected specs: `ocr-processing`
+- Affected code:
+  - `backend/app/services/ocr_to_unified_converter.py` (主要修改)
+- 不影響 Direct/HYBRID 路徑
+- PDF/JSON/Markdown 輸出將更乾淨
+
+## Constraints
+
+- 保持表格 bbox、頁面座標不變
+- 不修改 Direct/HYBRID 路徑
+- 只移除「所有行皆空」的欄；若表頭空但數據有值，不應移除
+- 保留原 bbox，避免 PDF 版面漂移
--- a/openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/specs/ocr-processing/spec.md
+++ b/openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/specs/ocr-processing/spec.md
@@ -0,0 +1,61 @@
+## ADDED Requirements
+
+### Requirement: OCR Table Empty Column Cleanup
+
+The OCR Track converter SHALL clean up PP-Structure generated tables by removing columns where all rows have empty or whitespace-only content.
+
+The system SHALL:
+1. Identify columns where every cell's content is empty or contains only whitespace (using `.strip()` to determine emptiness)
+2. Remove identified empty columns from the table structure
+3. Update the `columns`/`cols` value to reflect the new column count
+4. Recalculate each cell's `col` index to maintain continuity
+5. Adjust `col_span` values when spans cross removed columns (shrink span size)
+6. Remove cells entirely when their complete span falls within removed columns
+7. Preserve original bbox and page coordinates (no layout drift)
+8. If `columns` is 0 or missing after cleanup, fill with the calculated column count
+
+The cleanup SHALL NOT:
+- Remove columns where the header is empty but data rows contain values
+- Modify tables in Direct or HYBRID track
+- Alter the original bbox coordinates
+
+#### Scenario: All rows in column are empty
+- **WHEN** a table has a column where all cells contain only empty or whitespace content
+- **THEN** that column is removed
+- **AND** remaining cells have their `col` indices decremented appropriately
+- **AND** `cols` count is reduced by 1
+
+#### Scenario: Column has empty header but data has values
+- **WHEN** a table has a column where the header cell is empty
+- **AND** at least one data row cell in that column contains non-whitespace content
+- **THEN** that column is NOT removed
+
+#### Scenario: Cell span crosses removed column
+- **WHEN** a cell has `col_span > 1`
+- **AND** one or more columns within the span are removed
+- **THEN** the `col_span` is reduced by the number of removed columns within the span
+
+#### Scenario: Cell span entirely within removed columns
+- **WHEN** a cell's entire span falls within columns that are all removed
+- **THEN** that cell is removed from the table
+
+#### Scenario: Missing columns metadata
+- **WHEN** the table dict has `columns` set to 0 or missing
+- **AFTER** cleanup is performed
+- **THEN** `columns` is set to the calculated number of remaining columns
+
+### Requirement: OCR Table Column Alignment by Bbox
+
+(Optional Enhancement) When bbox coordinates are available for table cells, the OCR Track converter SHALL use cell bbox x0 coordinates to improve column alignment accuracy.
+
+The system SHALL:
+1. Sort cells by bbox `x0` coordinate before assigning column indices
+2. Reassign `col` indices based on spatial position rather than HTML order
+
+This requirement is optional and implementation MAY be deferred if bbox data is not reliably available.
+
+#### Scenario: Cells reordered by bbox position
+- **WHEN** bbox coordinates are available for table cells
+- **AND** the original HTML order does not match spatial order
+- **THEN** cells are reordered by `x0` coordinate
+- **AND** `col` indices are reassigned to reflect spatial positioning
--- a/openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/tasks.md
+++ b/openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/tasks.md
@@ -0,0 +1,43 @@
+# Tasks: Fix OCR Track Table Empty Columns
+
+## 1. Core Implementation
+
+- [x] 1.1 在 `ocr_to_unified_converter.py` 實作 `trim_empty_columns(table_dict: Dict[str, Any]) -> Dict[str, Any]`
+  - 依據 cells 陣列計算每一欄是否「所有 row 的內容皆為空/空白」
+  - 使用 `.strip()` 判斷空白字元
+- [x] 1.2 實作欄位移除邏輯
+  - 更新 columns/cols 數值
+  - 調整各 cell 的 col 索引
+- [x] 1.3 實作 col_span 重算邏輯
+  - 若 span 跨過被移除欄位，縮小 span
+  - 若整個 span 落在被刪欄位上，移除該 cell
+- [x] 1.4 在 `_convert_table_data` 入口呼叫 `trim_empty_columns`
+  - 在建 TableData 之前執行清洗
+  - 同時也在 `_extract_table_data` (HTML 表格解析) 中加入清洗
+- [ ] 1.5 (可選) 依 bbox x0/x1 進行欄對齊排序
+  - 若可取得 bbox 網格，先依 x0 排序再重排 col index
+  - 此功能延後實作，待 bbox 資料確認可用性後進行
+
+## 2. Testing & Validation
+
+- [x] 2.1 單元測試通過
+  - 測試基本空欄移除
+  - 測試表頭空但數據有值（不移除）
+  - 測試 col_span 跨越被移除欄位（縮小 span）
+  - 測試 cell 完全落在被移除欄位（移除 cell）
+  - 測試無空欄情況（不變更）
+- [x] 2.2 檢查現有 OCR 結果
+  - 現有結果中無「整欄為空」的表格
+  - 實作已就緒，遇到空欄時會正確清理
+- [x] 2.3 確認 Direct/HYBRID 表格不變
+  - `OCRToUnifiedConverter` 僅在 `ocr_service.py` 中使用
+  - Direct 軌使用 `DirectExtractionEngine`，不受影響
+
+## 3. Edge Cases & Validation
+
+- [x] 3.1 處理 columns 欄位為 0/缺失的情況
+  - 以計算後的欄數回填，避免 downstream 依賴出錯
+- [x] 3.2 處理表頭為空但數據有值的情況
+  - 只移除「所有行皆空」的欄
+- [x] 3.3 確保不直接修改 `backend/storage/results/...`
+  - 修改 converter，需重新跑任務驗證
--- a/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/design.md
+++ b/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/design.md
@@ -0,0 +1,183 @@
+# Design: OCR Track Gap Filling
+
+## Context
+
+PP-StructureV3 版面分析模型在處理某些掃描文件時會嚴重漏檢。實測顯示 Raw PaddleOCR 能偵測 56 個文字區域，但 PP-StructureV3 僅輸出 9 個元素（遺失 84%）。
+
+問題發生在 PP-StructureV3 內部的 Layout Detection Model，這是 PaddleOCR 函式庫的限制，無法從外部修復。但 Raw OCR 的 `text_regions` 資料仍然完整可用。
+
+### Stakeholders
+- **End users**: 需要完整的 OCR 輸出，不能有大量文字遺失
+- **OCR track**: 需要整合 Raw OCR 與 PP-StructureV3 結果
+- **Direct/Hybrid track**: 不應受此變更影響
+
+## Goals / Non-Goals
+
+### Goals
+- 偵測 PP-StructureV3 漏檢區域並以 Raw OCR 結果補回
+- 確保補回的文字不會與現有元素重複
+- 維持正確的閱讀順序
+- 僅影響 OCR track，不改變其他 track 的行為
+
+### Non-Goals
+- 不修改 PP-StructureV3 或 PaddleOCR 內部邏輯
+- 不處理圖片/表格/圖表等非文字元素的補漏
+- 不實作複雜的版面分析（僅做 gap filling）
+
+## Decisions
+
+### Decision 1: 覆蓋判定策略
+**選擇**: 優先使用「中心點落入」判定，輔以 IoU 閾值
+
+**理由**:
+- 中心點判定計算簡單，效能好
+- IoU 閾值作為補充，處理邊界情況
+- 建議 IoU 閾值 0.1~0.2，避免低 IoU 被誤判為未覆蓋
+
+**替代方案**:
+- 純 IoU 判定：計算量較大，且對部分重疊的處理較複雜
+- 面積比例判定：對不同大小的區域不夠公平
+
+### Decision 2: 補漏觸發條件
+**選擇**: 當 PP-Structure 覆蓋率 < 70% 或元素數顯著低於 Raw OCR
+
+**理由**:
+- 避免正常文件出現重複文字
+- 70% 閾值經驗值，可透過設定調整
+- 元素數比較作為快速判斷條件
+
+### Decision 3: 補漏元素類型
+**選擇**: 僅補 TEXT 類型，跳過 TABLE/IMAGE/FIGURE/FLOWCHART/HEADER/FOOTER
+
+**理由**:
+- PP-StructureV3 對結構化元素（表格、圖片）的識別通常較準確
+- 補回原始 OCR 文字可能破壞表格結構
+- 這些元素需要保持結構完整性
+
+### Decision 4: 重複判定與去重
+**選擇**: IoU > 0.5 的 Raw OCR 區域視為與 PP-Structure TEXT 重複，跳過
+
+**理由**:
+- 0.5 是常見的重疊閾值
+- 避免同一文字出現兩次
+- 對細碎的 Raw OCR 框可考慮輕量合併
+
+### Decision 5: 座標對齊
+**選擇**: 使用 `ocr_dimensions` 進行 bbox 換算
+
+**理由**:
+- OCR 可能有 resize 處理
+- 確保 Raw OCR 與 PP-Structure 的座標在同一空間
+- 避免因尺寸不一致導致覆蓋誤判
+
+## Data Flow
+
+```
+┌─────────────────┐     ┌──────────────────────┐
+│  Raw OCR Result │     │ PP-StructureV3 Result│
+│  (56 regions)   │     │    (9 elements)      │
+└────────┬────────┘     └──────────┬───────────┘
+         │                         │
+         └────────────┬────────────┘
+                      │
+              ┌───────▼───────┐
+              │ GapFillingService │
+              │ 1. Calculate coverage
+              │ 2. Find uncovered regions
+              │ 3. Filter by confidence
+              │ 4. Deduplicate
+              │ 5. Merge if needed
+              └───────┬───────┘
+                      │
+              ┌───────▼───────┐
+              │ OCRToUnifiedConverter │
+              │ - Combine elements
+              │ - Recalculate reading order
+              └───────┬───────┘
+                      │
+              ┌───────▼───────┐
+              │ UnifiedDocument │
+              │ (complete content)
+              └───────────────┘
+```
+
+## Algorithm: Gap Detection
+
+```python
+def find_uncovered_regions(
+    raw_ocr_regions: List[TextRegion],
+    pp_structure_elements: List[Element],
+    iou_threshold: float = 0.15
+) -> List[TextRegion]:
+    """
+    Find Raw OCR regions not covered by PP-Structure elements.
+
+    Coverage criteria (either one):
+    1. Center point of raw region falls inside any PP-Structure bbox
+    2. IoU with any PP-Structure bbox > iou_threshold
+    """
+    uncovered = []
+
+    # Filter PP-Structure elements: only consider TEXT, skip TABLE/IMAGE/etc.
+    text_elements = [e for e in pp_structure_elements
+                     if e.type not in SKIP_TYPES]
+
+    for region in raw_ocr_regions:
+        center = get_center(region.bbox)
+        is_covered = False
+
+        for element in text_elements:
+            # Check center point
+            if point_in_bbox(center, element.bbox):
+                is_covered = True
+                break
+
+            # Check IoU
+            if calculate_iou(region.bbox, element.bbox) > iou_threshold:
+                is_covered = True
+                break
+
+        if not is_covered:
+            uncovered.append(region)
+
+    return uncovered
+```
+
+## Configuration Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `gap_filling_enabled` | bool | True | 是否啟用 gap filling |
+| `gap_filling_coverage_threshold` | float | 0.7 | 覆蓋率低於此值時啟用 |
+| `gap_filling_iou_threshold` | float | 0.15 | 覆蓋判定 IoU 閾值 |
+| `gap_filling_confidence_threshold` | float | 0.3 | Raw OCR 信心度門檻 |
+| `gap_filling_dedup_iou_threshold` | float | 0.5 | 去重 IoU 閾值 |
+
+## Risks / Trade-offs
+
+### Risk 1: 補漏造成文字重複
+**Mitigation**: 設定 dedup_iou_threshold，對高重疊區域進行去重
+
+### Risk 2: 閱讀順序錯亂
+**Mitigation**: 補回元素後重新計算整頁的 reading_order（依 y0, x0 排序）
+
+### Risk 3: 效能影響
+**Mitigation**:
+- 先做快速的覆蓋率檢查，若 > 70% 則跳過 gap filling
+- 使用 R-tree 或 interval tree 加速 bbox 查詢（若效能成為瓶頸）
+
+### Risk 4: 座標不對齊
+**Mitigation**: 使用 `ocr_dimensions` 確保座標空間一致
+
+## Migration Plan
+
+1. 新增功能為可選（預設啟用）
+2. 可透過設定關閉 gap filling
+3. 不影響現有 API 介面
+4. 向後相容：不傳參數時使用預設行為
+
+## Open Questions
+
+1. 是否需要 UI 開關讓使用者選擇啟用/停用 gap filling？
+2. 對於細碎的 Raw OCR 框，是否需要實作合併邏輯？（同行、相鄰且間距很小）
+3. 是否需要在輸出中標記哪些元素是補漏來的？（debug 用途）
--- a/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/proposal.md
+++ b/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/proposal.md
@@ -0,0 +1,30 @@
+# Change: Add OCR Track Gap Filling with Raw OCR Text Regions
+
+## Why
+
+PP-StructureV3 的版面分析模型在處理某些掃描文件時會嚴重漏檢，導致大量文字內容遺失。實測 scan.pdf 顯示：
+- Raw PaddleOCR 文字識別：偵測到 **56 個文字區域**
+- PP-StructureV3 版面分析：僅輸出 **9 個元素**
+- 遺失比例：約 **84%** 的內容未被 PP-StructureV3 識別
+
+問題根源在於 PP-StructureV3 內部的 Layout Detection Model 對掃描文件類型支援不足，而非我們的程式碼問題。Raw OCR 能正確偵測所有文字區域，但這些資訊在 PP-StructureV3 的結構化處理過程中被遺失。
+
+## What Changes
+
+實作「混合式處理」(Hybrid Approach)：使用 Raw OCR 的文字區域來補充 PP-StructureV3 遺失的內容。
+
+- **新增** `GapFillingService` 類別，負責偵測並補回 PP-StructureV3 遺漏的文字區域
+- **新增** 覆蓋率計算邏輯（中心點落入或 IoU 閾值判斷）
+- **新增** 自動啟用條件：當 PP-Structure 覆蓋率 < 70% 或元素數顯著低於 Raw OCR 框數
+- **修改** `OCRToUnifiedConverter` 整合 gap filling 邏輯
+- **新增** 重新計算 reading_order 邏輯（依 y0, x0 排序）
+- **新增** 測試案例：PP-Structure 嚴重漏檢案例、無漏檢正常文件驗證
+
+## Impact
+
+- **Affected specs**: `ocr-processing`
+- **Affected code**:
+  - `backend/app/services/ocr_to_unified_converter.py` - 整合 gap filling
+  - `backend/app/services/gap_filling_service.py` - 新增 (核心邏輯)
+  - `backend/tests/test_gap_filling.py` - 新增 (測試)
+- **Track isolation**: 僅作用於 OCR track；Direct/Hybrid track 不受影響
--- a/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/specs/ocr-processing/spec.md
+++ b/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/specs/ocr-processing/spec.md
@@ -0,0 +1,111 @@
+## ADDED Requirements
+
+### Requirement: OCR Track Gap Filling with Raw OCR Regions
+
+The system SHALL detect and fill gaps in PP-StructureV3 output by supplementing with Raw OCR text regions when significant content loss is detected.
+
+#### Scenario: Gap filling activates when coverage is low
+- **GIVEN** an OCR track processing task
+- **WHEN** PP-StructureV3 outputs elements that cover less than 70% of Raw OCR text regions
+- **THEN** the system SHALL activate gap filling
+- **AND** identify Raw OCR regions not covered by any PP-StructureV3 element
+- **AND** supplement these regions as TEXT elements in the output
+
+#### Scenario: Coverage is determined by center-point and IoU
+- **GIVEN** a Raw OCR text region with bounding box
+- **WHEN** checking if the region is covered by PP-StructureV3
+- **THEN** the region SHALL be considered covered if its center point falls inside any PP-StructureV3 element bbox
+- **OR** if IoU with any PP-StructureV3 element exceeds 0.15 threshold
+- **AND** regions not meeting either criterion SHALL be marked as uncovered
+
+#### Scenario: Only TEXT elements are supplemented
+- **GIVEN** uncovered Raw OCR regions identified for supplementation
+- **WHEN** PP-StructureV3 has detected TABLE, IMAGE, FIGURE, FLOWCHART, HEADER, or FOOTER elements
+- **THEN** the system SHALL NOT supplement regions that overlap with these structural elements
+- **AND** only supplement regions as TEXT type to preserve structural integrity
+
+#### Scenario: Supplemented regions meet confidence threshold
+- **GIVEN** Raw OCR regions to be supplemented
+- **WHEN** a region has confidence score below 0.3
+- **THEN** the system SHALL skip that region
+- **AND** only supplement regions with confidence >= 0.3
+
+#### Scenario: Deduplication prevents repeated text
+- **GIVEN** a Raw OCR region being considered for supplementation
+- **WHEN** the region has IoU > 0.5 with any existing PP-StructureV3 TEXT element
+- **THEN** the system SHALL skip that region to prevent duplicate text
+- **AND** the original PP-StructureV3 element SHALL be preserved
+
+#### Scenario: Reading order is recalculated after gap filling
+- **GIVEN** supplemented elements have been added to the page
+- **WHEN** assembling the final element list
+- **THEN** the system SHALL recalculate reading order for the entire page
+- **AND** sort elements by y0 coordinate (top to bottom) then x0 (left to right)
+- **AND** ensure logical document flow is maintained
+
+#### Scenario: Coordinate alignment with ocr_dimensions
+- **GIVEN** Raw OCR processing may involve image resizing
+- **WHEN** comparing Raw OCR bbox with PP-StructureV3 bbox
+- **THEN** the system SHALL use ocr_dimensions to normalize coordinates
+- **AND** ensure both sources reference the same coordinate space
+- **AND** prevent coverage misdetection due to scale differences
+
+#### Scenario: Supplemented elements have complete metadata
+- **GIVEN** a Raw OCR region being added as supplemented element
+- **WHEN** creating the DocumentElement
+- **THEN** the element SHALL include page_number
+- **AND** include confidence score from Raw OCR
+- **AND** include original bbox coordinates
+- **AND** optionally include source indicator for debugging
+
+### Requirement: Gap Filling Track Isolation
+
+The gap filling feature SHALL only apply to OCR track processing and SHALL NOT affect Direct or Hybrid track outputs.
+
+#### Scenario: Gap filling only activates for OCR track
+- **GIVEN** a document processing task
+- **WHEN** the processing track is OCR
+- **THEN** the system SHALL evaluate and apply gap filling as needed
+- **AND** produce enhanced output with supplemented content
+
+#### Scenario: Direct track is unaffected
+- **GIVEN** a document processing task with Direct track
+- **WHEN** the task is processed
+- **THEN** the system SHALL NOT invoke any gap filling logic
+- **AND** produce output identical to current Direct track behavior
+
+#### Scenario: Hybrid track is unaffected
+- **GIVEN** a document processing task with Hybrid track
+- **WHEN** the task is processed
+- **THEN** the system SHALL NOT invoke gap filling logic
+- **AND** use existing Hybrid track processing pipeline
+
+### Requirement: Gap Filling Configuration
+
+The system SHALL provide configurable parameters for gap filling behavior.
+
+#### Scenario: Gap filling can be disabled via configuration
+- **GIVEN** gap_filling_enabled is set to false in configuration
+- **WHEN** OCR track processing runs
+- **THEN** the system SHALL skip all gap filling logic
+- **AND** output only PP-StructureV3 results as before
+
+#### Scenario: Coverage threshold is configurable
+- **GIVEN** gap_filling_coverage_threshold is set to 0.8
+- **WHEN** PP-StructureV3 coverage is 75%
+- **THEN** the system SHALL activate gap filling
+- **AND** supplement uncovered regions
+
+#### Scenario: IoU thresholds are configurable
+- **GIVEN** custom IoU thresholds configured:
+  - gap_filling_iou_threshold: 0.2
+  - gap_filling_dedup_iou_threshold: 0.6
+- **WHEN** evaluating coverage and deduplication
+- **THEN** the system SHALL use the configured values
+- **AND** apply them consistently throughout gap filling process
+
+#### Scenario: Confidence threshold is configurable
+- **GIVEN** gap_filling_confidence_threshold is set to 0.5
+- **WHEN** supplementing Raw OCR regions
+- **THEN** the system SHALL only include regions with confidence >= 0.5
+- **AND** filter out lower confidence regions
--- a/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/tasks.md
+++ b/openspec/changes/archive/2025-11-27-add-ocr-track-gap-filling/tasks.md
@@ -0,0 +1,44 @@
+# Tasks: Add OCR Track Gap Filling
+
+## 1. Core Implementation
+
+- [x] 1.1 Create `gap_filling_service.py` with `GapFillingService` class
+- [x] 1.2 Implement bbox coverage calculation (center-point and IoU methods)
+- [x] 1.3 Implement gap detection logic (find uncovered raw OCR regions)
+- [x] 1.4 Implement confidence threshold filtering for supplemented regions
+- [x] 1.5 Implement element type filtering (only supplement TEXT, skip TABLE/IMAGE/FIGURE/etc.)
+- [x] 1.6 Implement reading order recalculation (sort by y0, x0)
+- [x] 1.7 Implement deduplication logic (skip high IoU overlaps with PP-Structure TEXT)
+- [x] 1.8 Implement optional text merging for fragmented adjacent regions
+
+## 2. Integration
+
+- [x] 2.1 Modify `OCRToUnifiedConverter` to accept raw OCR text_regions
+- [x] 2.2 Add gap filling activation condition check (coverage < 70% or element count disparity)
+- [x] 2.3 Ensure coordinate alignment between raw OCR and PP-Structure (ocr_dimensions handling)
+- [x] 2.4 Add page metadata (page_number, confidence, bbox) to supplemented elements
+- [x] 2.5 Ensure track isolation (only OCR track, not Direct/Hybrid)
+
+## 3. Configuration
+
+- [x] 3.1 Add configurable parameters to settings:
+  - `gap_filling_enabled`: bool (default: True)
+  - `gap_filling_coverage_threshold`: float (default: 0.7)
+  - `gap_filling_iou_threshold`: float (default: 0.15)
+  - `gap_filling_confidence_threshold`: float (default: 0.3)
+  - `gap_filling_dedup_iou_threshold`: float (default: 0.5)
+
+## 4. Testing(with env)
+
+- [x] 4.1 Create test fixtures with PP-Structure severe miss-detection case(with scan.pdf / scan2.pdf)
+- [x] 4.2 Test gap detection correctly identifies uncovered regions
+- [x] 4.3 Test supplemented elements have correct metadata
+- [x] 4.4 Test reading order is correctly recalculated
+- [x] 4.5 Test deduplication prevents duplicate text
+- [x] 4.6 Test normal document without miss-detection has no duplicate/inflation
+- [x] 4.7 Test track isolation (Direct track unaffected)
+
+## 5. Documentation
+
+- [x] 5.1 Add inline documentation to GapFillingService
+- [x] 5.2 Update configuration documentation with new settings
--- a/openspec/changes/archive/2025-11-27-fix-ocr-track-table-rendering/proposal.md
+++ b/openspec/changes/archive/2025-11-27-fix-ocr-track-table-rendering/proposal.md
@@ -0,0 +1,108 @@
+# Fix OCR Track Table Rendering
+
+## Summary
+
+OCR track PDF generation produces tables with incorrect format and layout. Tables appear without proper structure - cell content is misaligned and the visual format differs significantly from the original document. Image placement is correct, but table rendering is broken.
+
+## Problem Statement
+
+When generating PDF from OCR track results (via `scan.pdf` processed by PP-StructureV3), the output tables have:
+1. **Wrong cell alignment** - content not positioned in proper cells
+2. **Missing table structure** - rows/columns don't match original document layout
+3. **Incorrect content distribution** - all content seems to flow linearly instead of maintaining grid structure
+
+Reference: `backend/storage/results/af7c9ee8-60a0-4291-9f22-ef98d27eed52/`
+- Original: `af7c9ee8-60a0-4291-9f22-ef98d27eed52_scan_page_1.png`
+- Generated: `scan_layout.pdf`
+- Result JSON: `scan_result.json` - Tables have correct `{rows, cols, cells}` structure
+
+## Root Cause Analysis
+
+### Issue 1: Table Content Not Converted to TableData Object
+
+In `_json_to_document_element` (pdf_generator_service.py:1952):
+```python
+element = DocumentElement(
+    ...
+    content=elem_dict.get('content', ''),  # Raw dict, not TableData
+    ...
+)
+```
+
+Table elements have `content` as a dict `{rows: 5, cols: 4, cells: [...]}` but it's not converted to a `TableData` object.
+
+### Issue 2: OCR Track HTML Conversion Fails
+
+In `convert_unified_document_to_ocr_data` (pdf_generator_service.py:464-467):
+```python
+elif isinstance(element.content, dict):
+    html_content = element.content.get('html', str(element.content))
+```
+
+Since there's no 'html' key in the cells-based dict, it falls back to `str(element.content)` = `"{'rows': 5, 'cols': 4, ...}"` - invalid HTML.
+
+### Issue 3: Different Table Rendering Paths
+
+- **Direct track** uses `_draw_table_element_direct` which properly handles dict with cells via `_build_rows_from_cells_dict`
+- **OCR track** uses `draw_table_region` which expects HTML strings and fails with dict content
+
+## Proposed Solution
+
+### Option A: Convert dict to TableData during JSON loading (Recommended)
+
+In `_json_to_document_element`, when element type is TABLE and content is a dict with cells, convert it to a `TableData` object:
+
+```python
+# For TABLE elements, convert dict to TableData
+if elem_type == ElementType.TABLE and isinstance(content, dict) and 'cells' in content:
+    content = self._dict_to_table_data(content)
+```
+
+This ensures `element.content.to_html()` works correctly in `convert_unified_document_to_ocr_data`.
+
+### Option B: Fix conversion in convert_unified_document_to_ocr_data
+
+Handle dict with cells properly by converting to HTML:
+
+```python
+elif isinstance(element.content, dict):
+    if 'cells' in element.content:
+        # Convert cells-based dict to HTML
+        html_content = self._cells_dict_to_html(element.content)
+    elif 'html' in element.content:
+        html_content = element.content['html']
+    else:
+        html_content = str(element.content)
+```
+
+## Impact on Hybrid Mode
+
+Hybrid mode uses Direct track rendering (`_generate_direct_track_pdf`) which already handles dict content properly via `_build_rows_from_cells_dict`. The proposed fixes should not affect hybrid mode negatively.
+
+However, testing should verify:
+1. Hybrid mode continues to work with combined Direct + OCR elements
+2. Table rendering quality is consistent across all tracks
+
+## Success Criteria
+
+1. OCR track tables render with correct structure matching original document
+2. Cell content positioned in proper grid locations
+3. Table borders/grid lines visible
+4. No regression in Direct track or Hybrid mode table rendering
+5. All test files (scan.pdf, img1.png, img2.png, img3.png) produce correct output
+
+## Files to Modify
+
+1. `backend/app/services/pdf_generator_service.py`
+   - `_json_to_document_element`: Convert table dict to TableData
+   - `convert_unified_document_to_ocr_data`: Improve dict handling (if Option B)
+
+2. `backend/app/models/unified_document.py` (optional)
+   - Add `TableData.from_dict()` class method for cleaner conversion
+
+## Testing Plan
+
+1. Test scan.pdf with OCR track - verify table structure matches original
+2. Test img1.png, img2.png, img3.png with OCR track
+3. Test PDF files with Direct track - verify no regression
+4. Test Hybrid mode with files that trigger OCR fallback
--- a/openspec/changes/archive/2025-11-27-fix-ocr-track-table-rendering/specs/pdf-generation/spec.md
+++ b/openspec/changes/archive/2025-11-27-fix-ocr-track-table-rendering/specs/pdf-generation/spec.md
@@ -0,0 +1,52 @@
+# PDF Generation - OCR Track Table Rendering Fix
+
+## MODIFIED Requirements
+
+### Requirement: OCR Track Table Content Conversion
+
+The PDF generator MUST properly convert table content from JSON dict format to renderable structure when processing OCR track results.
+
+#### Scenario: Table dict with cells array converts to proper HTML
+
+Given an OCR track JSON with table element containing rows, cols, and cells array
+When the PDF generator processes this element
+Then the table content MUST be converted to a TableData object
+And TableData.to_html() MUST produce valid HTML with proper tr/td structure
+And the generated PDF table MUST have cells positioned in correct grid locations
+
+#### Scenario: Table with rowspan/colspan renders correctly
+
+Given a table element with cells having rowspan > 1 or colspan > 1
+When the PDF generator renders the table
+Then merged cells MUST span the correct number of rows/columns
+And content MUST appear in the merged cell position
+
+### Requirement: Table Visual Fidelity
+
+The PDF generator MUST render OCR track tables with visual structure matching the original document.
+
+#### Scenario: Table renders with grid lines
+
+Given an OCR track table element
+When rendered to PDF
+Then the table MUST have visible grid lines/borders
+And cell boundaries MUST be clearly defined
+
+#### Scenario: Table text alignment preserved
+
+Given an OCR track table with cell content
+When rendered to PDF
+Then text MUST be positioned within the correct cell boundaries
+And text MUST NOT overflow into adjacent cells
+
+### Requirement: Backward Compatibility with Hybrid Mode
+
+The table rendering fix MUST NOT break hybrid mode processing.
+
+#### Scenario: Hybrid mode tables render correctly
+
+Given a document processed with hybrid mode combining Direct and OCR tracks
+When PDF is generated
+Then Direct track tables MUST render with existing quality
+And OCR track tables MUST render with improved quality
+And no regression in table positioning or content
--- a/openspec/changes/archive/2025-11-27-fix-ocr-track-table-rendering/tasks.md
+++ b/openspec/changes/archive/2025-11-27-fix-ocr-track-table-rendering/tasks.md
@@ -0,0 +1,55 @@
+# Implementation Tasks
+
+## Phase 1: Core Fix - Table Content Conversion
+
+### 1.1 Add TableData.from_dict() class method
+- [ ] In `unified_document.py`, add `from_dict()` method to `TableData` class
+- [ ] Handle conversion of cells list (list of dicts) to `TableCell` objects
+- [ ] Preserve rows, cols, headers, caption fields
+
+### 1.2 Fix _json_to_document_element for TABLE elements
+- [ ] In `pdf_generator_service.py`, modify `_json_to_document_element`
+- [ ] When `elem_type == ElementType.TABLE` and content is dict with 'cells', convert to `TableData`
+- [ ] Use `TableData.from_dict()` for clean conversion
+
+### 1.3 Verify TableData.to_html() generates correct HTML
+- [ ] Test that `to_html()` produces parseable HTML with proper row/cell structure
+- [ ] Verify colspan/rowspan attributes are correctly generated
+- [ ] Ensure empty cells are properly handled
+
+## Phase 2: OCR Track Rendering Consistency
+
+### 2.1 Review convert_unified_document_to_ocr_data
+- [ ] Verify TableData objects are properly converted to HTML
+- [ ] Add fallback handling for dict content with 'cells' key
+- [ ] Log warning if content cannot be converted to HTML
+
+### 2.2 Review draw_table_region
+- [ ] Verify HTMLTableParser correctly parses generated HTML
+- [ ] Check that ReportLab Table is positioned at correct bbox
+- [ ] Verify font and style application
+
+## Phase 3: Testing and Verification
+
+### 3.1 Test OCR Track
+- [ ] Test scan.pdf - verify tables have correct structure
+- [ ] Test img1.png, img2.png, img3.png
+- [ ] Compare generated PDF with original documents
+
+### 3.2 Test Direct Track (Regression)
+- [ ] Test PDF files with Direct track
+- [ ] Verify table rendering unchanged
+
+### 3.3 Test Hybrid Mode
+- [ ] Test files that trigger hybrid processing
+- [ ] Verify mixed Direct + OCR elements render correctly
+
+## Phase 4: Code Quality
+
+### 4.1 Add logging
+- [ ] Add debug logging for table content type detection
+- [ ] Log conversion steps for troubleshooting
+
+### 4.2 Error handling
+- [ ] Handle malformed cell data gracefully
+- [ ] Log warnings for unexpected content formats
--- a/openspec/changes/archive/2025-11-27-simplify-ppstructure-model-selection/proposal.md
+++ b/openspec/changes/archive/2025-11-27-simplify-ppstructure-model-selection/proposal.md
@@ -0,0 +1,40 @@
+# Change: Simplify PP-StructureV3 Configuration with Layout Model Selection
+
+## Why
+
+Current PP-StructureV3 parameter adjustment UI exposes 7 technical ML parameters (thresholds, ratios, merge modes) that are difficult for end users to understand. Meanwhile, switching to a different layout detection model (e.g., CDLA-trained models for Chinese documents) would have a much greater impact on OCR quality than fine-tuning these parameters.
+
+**Problems with current approach:**
+- Users don't understand what `layout_detection_threshold` or `text_det_unclip_ratio` mean
+- Wrong parameter values can make OCR results worse
+- The default model (PubLayNet-based) is optimized for English academic papers, not Chinese business documents
+- Model selection is far more impactful than parameter tuning
+
+## What Changes
+
+### Backend Changes
+- **REMOVED**: API parameter `pp_structure_params` from task start endpoint
+- **ADDED**: New API parameter `layout_model` with predefined options:
+  - `"default"` - Standard model (PubLayNet-based, for English documents)
+  - `"chinese"` - PP-DocLayout-S model (for Chinese documents, forms, contracts)
+  - `"cdla"` - CDLA model (alternative Chinese document layout model)
+- **MODIFIED**: PP-StructureV3 initialization uses `layout_detection_model_name` based on selection
+- Keep fine-tuning parameters in backend `config.py` with optimized defaults
+
+### Frontend Changes
+- **REMOVED**: `PPStructureParams.tsx` component (slider/dropdown UI for 7 parameters)
+- **ADDED**: Simple radio button/dropdown for layout model selection with clear descriptions
+- **MODIFIED**: Task start request body to send `layout_model` instead of `pp_structure_params`
+
+### API Changes
+- **BREAKING**: Remove `pp_structure_params` from `POST /api/v2/tasks/{task_id}/start`
+- **ADDED**: New optional parameter `layout_model: "default" | "chinese" | "cdla"`
+
+## Impact
+
+- Affected specs: `ocr-processing`
+- Affected code:
+  - Backend: `app/routers/tasks.py`, `app/services/ocr_service.py`, `app/core/config.py`
+  - Frontend: `src/components/PPStructureParams.tsx` (remove), `src/types/apiV2.ts`, task start form
+- Breaking change: Clients using `pp_structure_params` will need to migrate to `layout_model`
+- User impact: Simpler UI, better default OCR quality for Chinese documents
--- a/openspec/changes/archive/2025-11-27-simplify-ppstructure-model-selection/specs/ocr-processing/spec.md
+++ b/openspec/changes/archive/2025-11-27-simplify-ppstructure-model-selection/specs/ocr-processing/spec.md
@@ -0,0 +1,86 @@
+# ocr-processing Specification Delta
+
+## REMOVED Requirements
+
+### Requirement: Frontend-Adjustable PP-StructureV3 Parameters
+**Reason**: Complex ML parameters are difficult for end users to understand and tune. Model selection provides better UX and more significant quality improvements.
+**Migration**: Replace `pp_structure_params` API parameter with `layout_model` parameter.
+
+### Requirement: PP-StructureV3 Parameter UI Controls
+**Reason**: Slider/dropdown UI for 7 technical parameters adds complexity without proportional benefit. Simple model selection is more user-friendly.
+**Migration**: Remove `PPStructureParams.tsx` component, add `LayoutModelSelector.tsx` component.
+
+## ADDED Requirements
+
+### Requirement: Layout Model Selection
+The system SHALL allow users to select a layout detection model optimized for their document type, providing a simple choice between pre-configured models instead of manual parameter tuning.
+
+#### Scenario: User selects Chinese document model
+- **GIVEN** a user is processing Chinese business documents (forms, contracts, invoices)
+- **WHEN** the user selects "Chinese Document Model" (PP-DocLayout-S)
+- **THEN** the OCR engine SHALL use the PP-DocLayout-S layout detection model
+- **AND** the model SHALL be optimized for 23 Chinese document element types
+- **AND** table and form detection accuracy SHALL be improved over the default model
+
+#### Scenario: User selects standard model for English documents
+- **GIVEN** a user is processing English academic papers or reports
+- **WHEN** the user selects "Standard Model" (PubLayNet-based)
+- **THEN** the OCR engine SHALL use the default PubLayNet-based layout detection model
+- **AND** the model SHALL be optimized for English document layouts
+
+#### Scenario: User selects CDLA model for specialized Chinese layout
+- **GIVEN** a user is processing Chinese documents with complex layouts
+- **WHEN** the user selects "CDLA Model"
+- **THEN** the OCR engine SHALL use the picodet_lcnet_x1_0_fgd_layout_cdla model
+- **AND** the model SHALL provide specialized Chinese document layout analysis
+
+#### Scenario: Layout model is sent via API request
+- **GIVEN** a frontend application with model selection UI
+- **WHEN** the user starts task processing with a selected model
+- **THEN** the frontend SHALL send the model choice in the request body:
+  ```json
+  POST /api/v2/tasks/{task_id}/start
+  {
+    "use_dual_track": true,
+    "force_track": "ocr",
+    "language": "ch",
+    "layout_model": "chinese"
+  }
+  ```
+- **AND** the backend SHALL configure PP-StructureV3 with the corresponding model
+
+#### Scenario: Default model when not specified
+- **GIVEN** an API request without `layout_model` parameter
+- **WHEN** the task is started
+- **THEN** the system SHALL use "chinese" (PP-DocLayout-S) as the default model
+- **AND** processing SHALL work correctly without requiring model selection
+
+#### Scenario: Invalid model name is rejected
+- **GIVEN** a request with an invalid `layout_model` value
+- **WHEN** the user sends `layout_model: "invalid_model"`
+- **THEN** the API SHALL return 422 Validation Error
+- **AND** provide a clear error message listing valid model options
+
+### Requirement: Layout Model Selection UI
+The frontend SHALL provide a simple, user-friendly interface for selecting layout detection models with clear descriptions of each option.
+
+#### Scenario: Model options are displayed with descriptions
+- **GIVEN** the model selection UI is displayed
+- **WHEN** the user views the available options
+- **THEN** the UI SHALL show the following options:
+  - "Chinese Document Model (Recommended)" - for Chinese forms, contracts, invoices
+  - "Standard Model" - for English academic papers, reports
+  - "CDLA Model" - for specialized Chinese layout analysis
+- **AND** each option SHALL have a brief description of its use case
+
+#### Scenario: Chinese model is selected by default
+- **GIVEN** the user opens the task processing interface
+- **WHEN** the model selection is displayed
+- **THEN** "Chinese Document Model" SHALL be pre-selected as the default
+- **AND** the user MAY change the selection before starting processing
+
+#### Scenario: Model selection is visible only for OCR track
+- **GIVEN** a document processing interface
+- **WHEN** the user selects processing track
+- **THEN** layout model selection SHALL be shown ONLY when OCR track is selected or auto-detected
+- **AND** SHALL be hidden for Direct track (which does not use PP-StructureV3)
--- a/openspec/changes/archive/2025-11-27-simplify-ppstructure-model-selection/tasks.md
+++ b/openspec/changes/archive/2025-11-27-simplify-ppstructure-model-selection/tasks.md
@@ -0,0 +1,56 @@
+# Implementation Tasks
+
+## 1. Backend API Changes
+
+- [x] 1.1 Update `app/schemas/task.py` to add `layout_model` enum type
+- [x] 1.2 Update `app/routers/tasks.py` to replace `pp_structure_params` with `layout_model` parameter
+- [x] 1.3 Update `app/services/ocr_service.py` to map `layout_model` to `layout_detection_model_name`
+- [x] 1.4 Remove custom PP-Structure engine creation logic (use model selection instead)
+- [x] 1.5 Add backward compatibility: default to "chinese" if no model specified
+
+## 2. Backend Configuration
+
+- [x] 2.1 Keep `layout_detection_model_name` in `config.py` as fallback default
+- [x] 2.2 Keep fine-tuning parameters in `config.py` (not exposed to API)
+- [x] 2.3 Document available layout models in config comments
+
+## 3. Frontend Changes
+
+- [x] 3.1 Remove `PPStructureParams.tsx` component
+- [x] 3.2 Update `src/types/apiV2.ts`:
+  - Remove `PPStructureV3Params` interface
+  - Add `LayoutModel` type: `"default" | "chinese" | "cdla"`
+  - Update `ProcessingOptions` to use `layout_model` instead of `pp_structure_params`
+- [x] 3.3 Create `LayoutModelSelector.tsx` component with:
+  - Radio buttons or dropdown for model selection
+  - Clear descriptions for each model option
+  - Default selection: "chinese"
+- [x] 3.4 Update task start form to use new `LayoutModelSelector`
+- [x] 3.5 Update API calls to send `layout_model` instead of `pp_structure_params`
+
+## 4. Internationalization
+
+- [x] 4.1 Add i18n strings for layout model options:
+  - `layoutModel.default`: "Standard Model (English documents)"
+  - `layoutModel.chinese`: "Chinese Document Model (Recommended)"
+  - `layoutModel.cdla`: "CDLA Model (Chinese layout analysis)"
+- [x] 4.2 Add i18n strings for model descriptions
+
+## 5. Testing
+
+- [x] 5.1 Create new tests for `layout_model` parameter (`test_layout_model_api.py`, `test_layout_model.py`)
+- [x] 5.2 Archive tests for `pp_structure_params` validation (moved to `tests/archived/`)
+- [x] 5.3 Add tests for layout model selection (19 tests passing)
+- [x] 5.4 Test backward compatibility (no model specified → use chinese default)
+
+## 6. Documentation
+
+- [ ] 6.1 Update API documentation for task start endpoint
+- [ ] 6.2 Remove PP-Structure parameter documentation
+- [ ] 6.3 Add layout model selection documentation
+
+## 7. Cleanup
+
+- [x] 7.1 Remove localStorage keys for PP-Structure params (`pp_structure_params_presets`, `pp_structure_params_last_used`)
+- [x] 7.2 Remove any unused imports/types related to PP-Structure params
+- [x] 7.3 Archive old PP-Structure params test files