feat: simplify layout model selection and archive proposals

Changes:
- Replace PP-Structure 7-slider parameter UI with simple 3-option layout model selector
- Add layout model mapping: chinese (PP-DocLayout-S), default (PubLayNet), cdla
- Add LayoutModelSelector component and zh-TW translations
- Fix "default" model behavior with sentinel value for PubLayNet
- Add gap filling service for OCR track coverage improvement
- Add PP-Structure debug utilities
- Archive completed/incomplete proposals:
  - add-ocr-track-gap-filling (complete)
  - fix-ocr-track-table-rendering (incomplete)
  - simplify-ppstructure-model-selection (22/25 tasks)
- Add new layout model tests, archive old PP-Structure param tests
- Update OpenSpec ocr-processing spec with layout model requirements

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-27 13:27:00 +08:00
parent c65df754cf
commit 59206a6ab8
35 changed files with 3621 additions and 658 deletions

View File

@@ -0,0 +1,28 @@
# Change: Fix OCR Track Table Empty Columns and Alignment
## Why
PP-Structure 生成的表格經常包含空白欄位(所有 row 該欄皆為空/空白),導致轉換後的 UnifiedDocument 表格出現空欄與欄位錯位。目前 OCR Track 直接使用原始資料,未進行清理,影響 PDF/JSON/Markdown 輸出品質。
## What Changes
- 新增 `trim_empty_columns()` 函數,清理 OCR Track 表格的空欄
-`_convert_table_data` 入口調用清洗邏輯,確保 TableData 乾淨
- 處理 col_span 重算:若 span 跨過被移除欄位,縮小 span
- 更新 columns/cols 數值、調整各 cell 的 col 索引
- 可選:依 bbox x0 進行欄對齊排序
## Impact
- Affected specs: `ocr-processing`
- Affected code:
- `backend/app/services/ocr_to_unified_converter.py` (主要修改)
- 不影響 Direct/HYBRID 路徑
- PDF/JSON/Markdown 輸出將更乾淨
## Constraints
- 保持表格 bbox、頁面座標不變
- 不修改 Direct/HYBRID 路徑
- 只移除「所有行皆空」的欄;若表頭空但數據有值,不應移除
- 保留原 bbox避免 PDF 版面漂移

View File

@@ -0,0 +1,61 @@
## ADDED Requirements
### Requirement: OCR Table Empty Column Cleanup
The OCR Track converter SHALL clean up PP-Structure generated tables by removing columns where all rows have empty or whitespace-only content.
The system SHALL:
1. Identify columns where every cell's content is empty or contains only whitespace (using `.strip()` to determine emptiness)
2. Remove identified empty columns from the table structure
3. Update the `columns`/`cols` value to reflect the new column count
4. Recalculate each cell's `col` index to maintain continuity
5. Adjust `col_span` values when spans cross removed columns (shrink span size)
6. Remove cells entirely when their complete span falls within removed columns
7. Preserve original bbox and page coordinates (no layout drift)
8. If `columns` is 0 or missing after cleanup, fill with the calculated column count
The cleanup SHALL NOT:
- Remove columns where the header is empty but data rows contain values
- Modify tables in Direct or HYBRID track
- Alter the original bbox coordinates
#### Scenario: All rows in column are empty
- **WHEN** a table has a column where all cells contain only empty or whitespace content
- **THEN** that column is removed
- **AND** remaining cells have their `col` indices decremented appropriately
- **AND** `cols` count is reduced by 1
#### Scenario: Column has empty header but data has values
- **WHEN** a table has a column where the header cell is empty
- **AND** at least one data row cell in that column contains non-whitespace content
- **THEN** that column is NOT removed
#### Scenario: Cell span crosses removed column
- **WHEN** a cell has `col_span > 1`
- **AND** one or more columns within the span are removed
- **THEN** the `col_span` is reduced by the number of removed columns within the span
#### Scenario: Cell span entirely within removed columns
- **WHEN** a cell's entire span falls within columns that are all removed
- **THEN** that cell is removed from the table
#### Scenario: Missing columns metadata
- **WHEN** the table dict has `columns` set to 0 or missing
- **AFTER** cleanup is performed
- **THEN** `columns` is set to the calculated number of remaining columns
### Requirement: OCR Table Column Alignment by Bbox
(Optional Enhancement) When bbox coordinates are available for table cells, the OCR Track converter SHALL use cell bbox x0 coordinates to improve column alignment accuracy.
The system SHALL:
1. Sort cells by bbox `x0` coordinate before assigning column indices
2. Reassign `col` indices based on spatial position rather than HTML order
This requirement is optional and implementation MAY be deferred if bbox data is not reliably available.
#### Scenario: Cells reordered by bbox position
- **WHEN** bbox coordinates are available for table cells
- **AND** the original HTML order does not match spatial order
- **THEN** cells are reordered by `x0` coordinate
- **AND** `col` indices are reassigned to reflect spatial positioning

View File

@@ -0,0 +1,43 @@
# Tasks: Fix OCR Track Table Empty Columns
## 1. Core Implementation
- [x] 1.1 在 `ocr_to_unified_converter.py` 實作 `trim_empty_columns(table_dict: Dict[str, Any]) -> Dict[str, Any]`
- 依據 cells 陣列計算每一欄是否「所有 row 的內容皆為空/空白」
- 使用 `.strip()` 判斷空白字元
- [x] 1.2 實作欄位移除邏輯
- 更新 columns/cols 數值
- 調整各 cell 的 col 索引
- [x] 1.3 實作 col_span 重算邏輯
- 若 span 跨過被移除欄位,縮小 span
- 若整個 span 落在被刪欄位上,移除該 cell
- [x] 1.4 在 `_convert_table_data` 入口呼叫 `trim_empty_columns`
- 在建 TableData 之前執行清洗
- 同時也在 `_extract_table_data` (HTML 表格解析) 中加入清洗
- [ ] 1.5 (可選) 依 bbox x0/x1 進行欄對齊排序
- 若可取得 bbox 網格,先依 x0 排序再重排 col index
- 此功能延後實作,待 bbox 資料確認可用性後進行
## 2. Testing & Validation
- [x] 2.1 單元測試通過
- 測試基本空欄移除
- 測試表頭空但數據有值(不移除)
- 測試 col_span 跨越被移除欄位(縮小 span
- 測試 cell 完全落在被移除欄位(移除 cell
- 測試無空欄情況(不變更)
- [x] 2.2 檢查現有 OCR 結果
- 現有結果中無「整欄為空」的表格
- 實作已就緒,遇到空欄時會正確清理
- [x] 2.3 確認 Direct/HYBRID 表格不變
- `OCRToUnifiedConverter` 僅在 `ocr_service.py` 中使用
- Direct 軌使用 `DirectExtractionEngine`,不受影響
## 3. Edge Cases & Validation
- [x] 3.1 處理 columns 欄位為 0/缺失的情況
- 以計算後的欄數回填,避免 downstream 依賴出錯
- [x] 3.2 處理表頭為空但數據有值的情況
- 只移除「所有行皆空」的欄
- [x] 3.3 確保不直接修改 `backend/storage/results/...`
- 修改 converter需重新跑任務驗證

View File

@@ -0,0 +1,183 @@
# Design: OCR Track Gap Filling
## Context
PP-StructureV3 版面分析模型在處理某些掃描文件時會嚴重漏檢。實測顯示 Raw PaddleOCR 能偵測 56 個文字區域,但 PP-StructureV3 僅輸出 9 個元素(遺失 84%)。
問題發生在 PP-StructureV3 內部的 Layout Detection Model這是 PaddleOCR 函式庫的限制,無法從外部修復。但 Raw OCR 的 `text_regions` 資料仍然完整可用。
### Stakeholders
- **End users**: 需要完整的 OCR 輸出,不能有大量文字遺失
- **OCR track**: 需要整合 Raw OCR 與 PP-StructureV3 結果
- **Direct/Hybrid track**: 不應受此變更影響
## Goals / Non-Goals
### Goals
- 偵測 PP-StructureV3 漏檢區域並以 Raw OCR 結果補回
- 確保補回的文字不會與現有元素重複
- 維持正確的閱讀順序
- 僅影響 OCR track不改變其他 track 的行為
### Non-Goals
- 不修改 PP-StructureV3 或 PaddleOCR 內部邏輯
- 不處理圖片/表格/圖表等非文字元素的補漏
- 不實作複雜的版面分析(僅做 gap filling
## Decisions
### Decision 1: 覆蓋判定策略
**選擇**: 優先使用「中心點落入」判定,輔以 IoU 閾值
**理由**:
- 中心點判定計算簡單,效能好
- IoU 閾值作為補充,處理邊界情況
- 建議 IoU 閾值 0.1~0.2,避免低 IoU 被誤判為未覆蓋
**替代方案**:
- 純 IoU 判定:計算量較大,且對部分重疊的處理較複雜
- 面積比例判定:對不同大小的區域不夠公平
### Decision 2: 補漏觸發條件
**選擇**: 當 PP-Structure 覆蓋率 < 70% 或元素數顯著低於 Raw OCR
**理由**:
- 避免正常文件出現重複文字
- 70% 閾值經驗值可透過設定調整
- 元素數比較作為快速判斷條件
### Decision 3: 補漏元素類型
**選擇**: 僅補 TEXT 類型跳過 TABLE/IMAGE/FIGURE/FLOWCHART/HEADER/FOOTER
**理由**:
- PP-StructureV3 對結構化元素表格圖片的識別通常較準確
- 補回原始 OCR 文字可能破壞表格結構
- 這些元素需要保持結構完整性
### Decision 4: 重複判定與去重
**選擇**: IoU > 0.5 的 Raw OCR 區域視為與 PP-Structure TEXT 重複,跳過
**理由**:
- 0.5 是常見的重疊閾值
- 避免同一文字出現兩次
- 對細碎的 Raw OCR 框可考慮輕量合併
### Decision 5: 座標對齊
**選擇**: 使用 `ocr_dimensions` 進行 bbox 換算
**理由**:
- OCR 可能有 resize 處理
- 確保 Raw OCR 與 PP-Structure 的座標在同一空間
- 避免因尺寸不一致導致覆蓋誤判
## Data Flow
```
┌─────────────────┐ ┌──────────────────────┐
│ Raw OCR Result │ │ PP-StructureV3 Result│
│ (56 regions) │ │ (9 elements) │
└────────┬────────┘ └──────────┬───────────┘
│ │
└────────────┬────────────┘
┌───────▼───────┐
│ GapFillingService │
│ 1. Calculate coverage
│ 2. Find uncovered regions
│ 3. Filter by confidence
│ 4. Deduplicate
│ 5. Merge if needed
└───────┬───────┘
┌───────▼───────┐
│ OCRToUnifiedConverter │
│ - Combine elements
│ - Recalculate reading order
└───────┬───────┘
┌───────▼───────┐
│ UnifiedDocument │
│ (complete content)
└───────────────┘
```
## Algorithm: Gap Detection
```python
def find_uncovered_regions(
raw_ocr_regions: List[TextRegion],
pp_structure_elements: List[Element],
iou_threshold: float = 0.15
) -> List[TextRegion]:
"""
Find Raw OCR regions not covered by PP-Structure elements.
Coverage criteria (either one):
1. Center point of raw region falls inside any PP-Structure bbox
2. IoU with any PP-Structure bbox > iou_threshold
"""
uncovered = []
# Filter PP-Structure elements: only consider TEXT, skip TABLE/IMAGE/etc.
text_elements = [e for e in pp_structure_elements
if e.type not in SKIP_TYPES]
for region in raw_ocr_regions:
center = get_center(region.bbox)
is_covered = False
for element in text_elements:
# Check center point
if point_in_bbox(center, element.bbox):
is_covered = True
break
# Check IoU
if calculate_iou(region.bbox, element.bbox) > iou_threshold:
is_covered = True
break
if not is_covered:
uncovered.append(region)
return uncovered
```
## Configuration Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `gap_filling_enabled` | bool | True | 是否啟用 gap filling |
| `gap_filling_coverage_threshold` | float | 0.7 | 覆蓋率低於此值時啟用 |
| `gap_filling_iou_threshold` | float | 0.15 | 覆蓋判定 IoU 閾值 |
| `gap_filling_confidence_threshold` | float | 0.3 | Raw OCR 信心度門檻 |
| `gap_filling_dedup_iou_threshold` | float | 0.5 | 去重 IoU 閾值 |
## Risks / Trade-offs
### Risk 1: 補漏造成文字重複
**Mitigation**: 設定 dedup_iou_threshold對高重疊區域進行去重
### Risk 2: 閱讀順序錯亂
**Mitigation**: 補回元素後重新計算整頁的 reading_order依 y0, x0 排序)
### Risk 3: 效能影響
**Mitigation**:
- 先做快速的覆蓋率檢查,若 > 70% 則跳過 gap filling
- 使用 R-tree 或 interval tree 加速 bbox 查詢(若效能成為瓶頸)
### Risk 4: 座標不對齊
**Mitigation**: 使用 `ocr_dimensions` 確保座標空間一致
## Migration Plan
1. 新增功能為可選(預設啟用)
2. 可透過設定關閉 gap filling
3. 不影響現有 API 介面
4. 向後相容:不傳參數時使用預設行為
## Open Questions
1. 是否需要 UI 開關讓使用者選擇啟用/停用 gap filling
2. 對於細碎的 Raw OCR 框,是否需要實作合併邏輯?(同行、相鄰且間距很小)
3. 是否需要在輸出中標記哪些元素是補漏來的debug 用途)

View File

@@ -0,0 +1,30 @@
# Change: Add OCR Track Gap Filling with Raw OCR Text Regions
## Why
PP-StructureV3 的版面分析模型在處理某些掃描文件時會嚴重漏檢,導致大量文字內容遺失。實測 scan.pdf 顯示:
- Raw PaddleOCR 文字識別:偵測到 **56 個文字區域**
- PP-StructureV3 版面分析:僅輸出 **9 個元素**
- 遺失比例:約 **84%** 的內容未被 PP-StructureV3 識別
問題根源在於 PP-StructureV3 內部的 Layout Detection Model 對掃描文件類型支援不足而非我們的程式碼問題。Raw OCR 能正確偵測所有文字區域,但這些資訊在 PP-StructureV3 的結構化處理過程中被遺失。
## What Changes
實作「混合式處理」(Hybrid Approach):使用 Raw OCR 的文字區域來補充 PP-StructureV3 遺失的內容。
- **新增** `GapFillingService` 類別,負責偵測並補回 PP-StructureV3 遺漏的文字區域
- **新增** 覆蓋率計算邏輯(中心點落入或 IoU 閾值判斷)
- **新增** 自動啟用條件:當 PP-Structure 覆蓋率 < 70% 或元素數顯著低於 Raw OCR 框數
- **修改** `OCRToUnifiedConverter` 整合 gap filling 邏輯
- **新增** 重新計算 reading_order 邏輯 y0, x0 排序
- **新增** 測試案例PP-Structure 嚴重漏檢案例無漏檢正常文件驗證
## Impact
- **Affected specs**: `ocr-processing`
- **Affected code**:
- `backend/app/services/ocr_to_unified_converter.py` - 整合 gap filling
- `backend/app/services/gap_filling_service.py` - 新增 (核心邏輯)
- `backend/tests/test_gap_filling.py` - 新增 (測試)
- **Track isolation**: 僅作用於 OCR trackDirect/Hybrid track 不受影響

View File

@@ -0,0 +1,111 @@
## ADDED Requirements
### Requirement: OCR Track Gap Filling with Raw OCR Regions
The system SHALL detect and fill gaps in PP-StructureV3 output by supplementing with Raw OCR text regions when significant content loss is detected.
#### Scenario: Gap filling activates when coverage is low
- **GIVEN** an OCR track processing task
- **WHEN** PP-StructureV3 outputs elements that cover less than 70% of Raw OCR text regions
- **THEN** the system SHALL activate gap filling
- **AND** identify Raw OCR regions not covered by any PP-StructureV3 element
- **AND** supplement these regions as TEXT elements in the output
#### Scenario: Coverage is determined by center-point and IoU
- **GIVEN** a Raw OCR text region with bounding box
- **WHEN** checking if the region is covered by PP-StructureV3
- **THEN** the region SHALL be considered covered if its center point falls inside any PP-StructureV3 element bbox
- **OR** if IoU with any PP-StructureV3 element exceeds 0.15 threshold
- **AND** regions not meeting either criterion SHALL be marked as uncovered
#### Scenario: Only TEXT elements are supplemented
- **GIVEN** uncovered Raw OCR regions identified for supplementation
- **WHEN** PP-StructureV3 has detected TABLE, IMAGE, FIGURE, FLOWCHART, HEADER, or FOOTER elements
- **THEN** the system SHALL NOT supplement regions that overlap with these structural elements
- **AND** only supplement regions as TEXT type to preserve structural integrity
#### Scenario: Supplemented regions meet confidence threshold
- **GIVEN** Raw OCR regions to be supplemented
- **WHEN** a region has confidence score below 0.3
- **THEN** the system SHALL skip that region
- **AND** only supplement regions with confidence >= 0.3
#### Scenario: Deduplication prevents repeated text
- **GIVEN** a Raw OCR region being considered for supplementation
- **WHEN** the region has IoU > 0.5 with any existing PP-StructureV3 TEXT element
- **THEN** the system SHALL skip that region to prevent duplicate text
- **AND** the original PP-StructureV3 element SHALL be preserved
#### Scenario: Reading order is recalculated after gap filling
- **GIVEN** supplemented elements have been added to the page
- **WHEN** assembling the final element list
- **THEN** the system SHALL recalculate reading order for the entire page
- **AND** sort elements by y0 coordinate (top to bottom) then x0 (left to right)
- **AND** ensure logical document flow is maintained
#### Scenario: Coordinate alignment with ocr_dimensions
- **GIVEN** Raw OCR processing may involve image resizing
- **WHEN** comparing Raw OCR bbox with PP-StructureV3 bbox
- **THEN** the system SHALL use ocr_dimensions to normalize coordinates
- **AND** ensure both sources reference the same coordinate space
- **AND** prevent coverage misdetection due to scale differences
#### Scenario: Supplemented elements have complete metadata
- **GIVEN** a Raw OCR region being added as supplemented element
- **WHEN** creating the DocumentElement
- **THEN** the element SHALL include page_number
- **AND** include confidence score from Raw OCR
- **AND** include original bbox coordinates
- **AND** optionally include source indicator for debugging
### Requirement: Gap Filling Track Isolation
The gap filling feature SHALL only apply to OCR track processing and SHALL NOT affect Direct or Hybrid track outputs.
#### Scenario: Gap filling only activates for OCR track
- **GIVEN** a document processing task
- **WHEN** the processing track is OCR
- **THEN** the system SHALL evaluate and apply gap filling as needed
- **AND** produce enhanced output with supplemented content
#### Scenario: Direct track is unaffected
- **GIVEN** a document processing task with Direct track
- **WHEN** the task is processed
- **THEN** the system SHALL NOT invoke any gap filling logic
- **AND** produce output identical to current Direct track behavior
#### Scenario: Hybrid track is unaffected
- **GIVEN** a document processing task with Hybrid track
- **WHEN** the task is processed
- **THEN** the system SHALL NOT invoke gap filling logic
- **AND** use existing Hybrid track processing pipeline
### Requirement: Gap Filling Configuration
The system SHALL provide configurable parameters for gap filling behavior.
#### Scenario: Gap filling can be disabled via configuration
- **GIVEN** gap_filling_enabled is set to false in configuration
- **WHEN** OCR track processing runs
- **THEN** the system SHALL skip all gap filling logic
- **AND** output only PP-StructureV3 results as before
#### Scenario: Coverage threshold is configurable
- **GIVEN** gap_filling_coverage_threshold is set to 0.8
- **WHEN** PP-StructureV3 coverage is 75%
- **THEN** the system SHALL activate gap filling
- **AND** supplement uncovered regions
#### Scenario: IoU thresholds are configurable
- **GIVEN** custom IoU thresholds configured:
- gap_filling_iou_threshold: 0.2
- gap_filling_dedup_iou_threshold: 0.6
- **WHEN** evaluating coverage and deduplication
- **THEN** the system SHALL use the configured values
- **AND** apply them consistently throughout gap filling process
#### Scenario: Confidence threshold is configurable
- **GIVEN** gap_filling_confidence_threshold is set to 0.5
- **WHEN** supplementing Raw OCR regions
- **THEN** the system SHALL only include regions with confidence >= 0.5
- **AND** filter out lower confidence regions

View File

@@ -0,0 +1,44 @@
# Tasks: Add OCR Track Gap Filling
## 1. Core Implementation
- [x] 1.1 Create `gap_filling_service.py` with `GapFillingService` class
- [x] 1.2 Implement bbox coverage calculation (center-point and IoU methods)
- [x] 1.3 Implement gap detection logic (find uncovered raw OCR regions)
- [x] 1.4 Implement confidence threshold filtering for supplemented regions
- [x] 1.5 Implement element type filtering (only supplement TEXT, skip TABLE/IMAGE/FIGURE/etc.)
- [x] 1.6 Implement reading order recalculation (sort by y0, x0)
- [x] 1.7 Implement deduplication logic (skip high IoU overlaps with PP-Structure TEXT)
- [x] 1.8 Implement optional text merging for fragmented adjacent regions
## 2. Integration
- [x] 2.1 Modify `OCRToUnifiedConverter` to accept raw OCR text_regions
- [x] 2.2 Add gap filling activation condition check (coverage < 70% or element count disparity)
- [x] 2.3 Ensure coordinate alignment between raw OCR and PP-Structure (ocr_dimensions handling)
- [x] 2.4 Add page metadata (page_number, confidence, bbox) to supplemented elements
- [x] 2.5 Ensure track isolation (only OCR track, not Direct/Hybrid)
## 3. Configuration
- [x] 3.1 Add configurable parameters to settings:
- `gap_filling_enabled`: bool (default: True)
- `gap_filling_coverage_threshold`: float (default: 0.7)
- `gap_filling_iou_threshold`: float (default: 0.15)
- `gap_filling_confidence_threshold`: float (default: 0.3)
- `gap_filling_dedup_iou_threshold`: float (default: 0.5)
## 4. Testing(with env)
- [x] 4.1 Create test fixtures with PP-Structure severe miss-detection case(with scan.pdf / scan2.pdf)
- [x] 4.2 Test gap detection correctly identifies uncovered regions
- [x] 4.3 Test supplemented elements have correct metadata
- [x] 4.4 Test reading order is correctly recalculated
- [x] 4.5 Test deduplication prevents duplicate text
- [x] 4.6 Test normal document without miss-detection has no duplicate/inflation
- [x] 4.7 Test track isolation (Direct track unaffected)
## 5. Documentation
- [x] 5.1 Add inline documentation to GapFillingService
- [x] 5.2 Update configuration documentation with new settings

View File

@@ -0,0 +1,40 @@
# Change: Simplify PP-StructureV3 Configuration with Layout Model Selection
## Why
Current PP-StructureV3 parameter adjustment UI exposes 7 technical ML parameters (thresholds, ratios, merge modes) that are difficult for end users to understand. Meanwhile, switching to a different layout detection model (e.g., CDLA-trained models for Chinese documents) would have a much greater impact on OCR quality than fine-tuning these parameters.
**Problems with current approach:**
- Users don't understand what `layout_detection_threshold` or `text_det_unclip_ratio` mean
- Wrong parameter values can make OCR results worse
- The default model (PubLayNet-based) is optimized for English academic papers, not Chinese business documents
- Model selection is far more impactful than parameter tuning
## What Changes
### Backend Changes
- **REMOVED**: API parameter `pp_structure_params` from task start endpoint
- **ADDED**: New API parameter `layout_model` with predefined options:
- `"default"` - Standard model (PubLayNet-based, for English documents)
- `"chinese"` - PP-DocLayout-S model (for Chinese documents, forms, contracts)
- `"cdla"` - CDLA model (alternative Chinese document layout model)
- **MODIFIED**: PP-StructureV3 initialization uses `layout_detection_model_name` based on selection
- Keep fine-tuning parameters in backend `config.py` with optimized defaults
### Frontend Changes
- **REMOVED**: `PPStructureParams.tsx` component (slider/dropdown UI for 7 parameters)
- **ADDED**: Simple radio button/dropdown for layout model selection with clear descriptions
- **MODIFIED**: Task start request body to send `layout_model` instead of `pp_structure_params`
### API Changes
- **BREAKING**: Remove `pp_structure_params` from `POST /api/v2/tasks/{task_id}/start`
- **ADDED**: New optional parameter `layout_model: "default" | "chinese" | "cdla"`
## Impact
- Affected specs: `ocr-processing`
- Affected code:
- Backend: `app/routers/tasks.py`, `app/services/ocr_service.py`, `app/core/config.py`
- Frontend: `src/components/PPStructureParams.tsx` (remove), `src/types/apiV2.ts`, task start form
- Breaking change: Clients using `pp_structure_params` will need to migrate to `layout_model`
- User impact: Simpler UI, better default OCR quality for Chinese documents

View File

@@ -0,0 +1,86 @@
# ocr-processing Specification Delta
## REMOVED Requirements
### Requirement: Frontend-Adjustable PP-StructureV3 Parameters
**Reason**: Complex ML parameters are difficult for end users to understand and tune. Model selection provides better UX and more significant quality improvements.
**Migration**: Replace `pp_structure_params` API parameter with `layout_model` parameter.
### Requirement: PP-StructureV3 Parameter UI Controls
**Reason**: Slider/dropdown UI for 7 technical parameters adds complexity without proportional benefit. Simple model selection is more user-friendly.
**Migration**: Remove `PPStructureParams.tsx` component, add `LayoutModelSelector.tsx` component.
## ADDED Requirements
### Requirement: Layout Model Selection
The system SHALL allow users to select a layout detection model optimized for their document type, providing a simple choice between pre-configured models instead of manual parameter tuning.
#### Scenario: User selects Chinese document model
- **GIVEN** a user is processing Chinese business documents (forms, contracts, invoices)
- **WHEN** the user selects "Chinese Document Model" (PP-DocLayout-S)
- **THEN** the OCR engine SHALL use the PP-DocLayout-S layout detection model
- **AND** the model SHALL be optimized for 23 Chinese document element types
- **AND** table and form detection accuracy SHALL be improved over the default model
#### Scenario: User selects standard model for English documents
- **GIVEN** a user is processing English academic papers or reports
- **WHEN** the user selects "Standard Model" (PubLayNet-based)
- **THEN** the OCR engine SHALL use the default PubLayNet-based layout detection model
- **AND** the model SHALL be optimized for English document layouts
#### Scenario: User selects CDLA model for specialized Chinese layout
- **GIVEN** a user is processing Chinese documents with complex layouts
- **WHEN** the user selects "CDLA Model"
- **THEN** the OCR engine SHALL use the picodet_lcnet_x1_0_fgd_layout_cdla model
- **AND** the model SHALL provide specialized Chinese document layout analysis
#### Scenario: Layout model is sent via API request
- **GIVEN** a frontend application with model selection UI
- **WHEN** the user starts task processing with a selected model
- **THEN** the frontend SHALL send the model choice in the request body:
```json
POST /api/v2/tasks/{task_id}/start
{
"use_dual_track": true,
"force_track": "ocr",
"language": "ch",
"layout_model": "chinese"
}
```
- **AND** the backend SHALL configure PP-StructureV3 with the corresponding model
#### Scenario: Default model when not specified
- **GIVEN** an API request without `layout_model` parameter
- **WHEN** the task is started
- **THEN** the system SHALL use "chinese" (PP-DocLayout-S) as the default model
- **AND** processing SHALL work correctly without requiring model selection
#### Scenario: Invalid model name is rejected
- **GIVEN** a request with an invalid `layout_model` value
- **WHEN** the user sends `layout_model: "invalid_model"`
- **THEN** the API SHALL return 422 Validation Error
- **AND** provide a clear error message listing valid model options
### Requirement: Layout Model Selection UI
The frontend SHALL provide a simple, user-friendly interface for selecting layout detection models with clear descriptions of each option.
#### Scenario: Model options are displayed with descriptions
- **GIVEN** the model selection UI is displayed
- **WHEN** the user views the available options
- **THEN** the UI SHALL show the following options:
- "Chinese Document Model (Recommended)" - for Chinese forms, contracts, invoices
- "Standard Model" - for English academic papers, reports
- "CDLA Model" - for specialized Chinese layout analysis
- **AND** each option SHALL have a brief description of its use case
#### Scenario: Chinese model is selected by default
- **GIVEN** the user opens the task processing interface
- **WHEN** the model selection is displayed
- **THEN** "Chinese Document Model" SHALL be pre-selected as the default
- **AND** the user MAY change the selection before starting processing
#### Scenario: Model selection is visible only for OCR track
- **GIVEN** a document processing interface
- **WHEN** the user selects processing track
- **THEN** layout model selection SHALL be shown ONLY when OCR track is selected or auto-detected
- **AND** SHALL be hidden for Direct track (which does not use PP-StructureV3)

View File

@@ -0,0 +1,56 @@
# Implementation Tasks
## 1. Backend API Changes
- [x] 1.1 Update `app/schemas/task.py` to add `layout_model` enum type
- [x] 1.2 Update `app/routers/tasks.py` to replace `pp_structure_params` with `layout_model` parameter
- [x] 1.3 Update `app/services/ocr_service.py` to map `layout_model` to `layout_detection_model_name`
- [x] 1.4 Remove custom PP-Structure engine creation logic (use model selection instead)
- [x] 1.5 Add backward compatibility: default to "chinese" if no model specified
## 2. Backend Configuration
- [x] 2.1 Keep `layout_detection_model_name` in `config.py` as fallback default
- [x] 2.2 Keep fine-tuning parameters in `config.py` (not exposed to API)
- [x] 2.3 Document available layout models in config comments
## 3. Frontend Changes
- [x] 3.1 Remove `PPStructureParams.tsx` component
- [x] 3.2 Update `src/types/apiV2.ts`:
- Remove `PPStructureV3Params` interface
- Add `LayoutModel` type: `"default" | "chinese" | "cdla"`
- Update `ProcessingOptions` to use `layout_model` instead of `pp_structure_params`
- [x] 3.3 Create `LayoutModelSelector.tsx` component with:
- Radio buttons or dropdown for model selection
- Clear descriptions for each model option
- Default selection: "chinese"
- [x] 3.4 Update task start form to use new `LayoutModelSelector`
- [x] 3.5 Update API calls to send `layout_model` instead of `pp_structure_params`
## 4. Internationalization
- [x] 4.1 Add i18n strings for layout model options:
- `layoutModel.default`: "Standard Model (English documents)"
- `layoutModel.chinese`: "Chinese Document Model (Recommended)"
- `layoutModel.cdla`: "CDLA Model (Chinese layout analysis)"
- [x] 4.2 Add i18n strings for model descriptions
## 5. Testing
- [x] 5.1 Create new tests for `layout_model` parameter (`test_layout_model_api.py`, `test_layout_model.py`)
- [x] 5.2 Archive tests for `pp_structure_params` validation (moved to `tests/archived/`)
- [x] 5.3 Add tests for layout model selection (19 tests passing)
- [x] 5.4 Test backward compatibility (no model specified → use chinese default)
## 6. Documentation
- [ ] 6.1 Update API documentation for task start endpoint
- [ ] 6.2 Remove PP-Structure parameter documentation
- [ ] 6.3 Add layout model selection documentation
## 7. Cleanup
- [x] 7.1 Remove localStorage keys for PP-Structure params (`pp_structure_params_presets`, `pp_structure_params_last_used`)
- [x] 7.2 Remove any unused imports/types related to PP-Structure params
- [x] 7.3 Archive old PP-Structure params test files

View File

@@ -3,100 +3,186 @@
## Purpose
TBD - created by archiving change frontend-adjustable-ppstructure-params. Update Purpose after archive.
## Requirements
### Requirement: Frontend-Adjustable PP-StructureV3 Parameters
The system SHALL allow frontend users to dynamically adjust PP-StructureV3 OCR parameters for fine-tuning document processing without backend configuration changes.
### Requirement: OCR Track Gap Filling with Raw OCR Regions
#### Scenario: User adjusts layout detection threshold
- **GIVEN** a user is processing a document with OCR track
- **WHEN** the user sets `layout_detection_threshold` to 0.1 (lower than default 0.2)
- **THEN** the OCR engine SHALL detect more layout blocks including weak signals
- **AND** the processing SHALL use the custom parameter instead of backend defaults
- **AND** the custom parameter SHALL NOT be cached for reuse
The system SHALL detect and fill gaps in PP-StructureV3 output by supplementing with Raw OCR text regions when significant content loss is detected.
#### Scenario: User selects high-quality preset configuration
- **GIVEN** a user wants to process a complex document with many small text elements
- **WHEN** the user selects "High Quality" preset mode
- **THEN** the system SHALL automatically set:
- `layout_detection_threshold` to 0.1
- `layout_nms_threshold` to 0.15
- `text_det_thresh` to 0.1
- `text_det_box_thresh` to 0.2
- **AND** process the document with these optimized parameters
#### Scenario: Gap filling activates when coverage is low
- **GIVEN** an OCR track processing task
- **WHEN** PP-StructureV3 outputs elements that cover less than 70% of Raw OCR text regions
- **THEN** the system SHALL activate gap filling
- **AND** identify Raw OCR regions not covered by any PP-StructureV3 element
- **AND** supplement these regions as TEXT elements in the output
#### Scenario: User adjusts text detection parameters
- **GIVEN** a document with low-contrast text
- **WHEN** the user sets:
- `text_det_thresh` to 0.05 (very low)
- `text_det_unclip_ratio` to 1.5 (larger boxes)
- **THEN** the OCR SHALL detect more small and low-contrast text
- **AND** text bounding boxes SHALL be expanded by the specified ratio
#### Scenario: Coverage is determined by center-point and IoU
- **GIVEN** a Raw OCR text region with bounding box
- **WHEN** checking if the region is covered by PP-StructureV3
- **THEN** the region SHALL be considered covered if its center point falls inside any PP-StructureV3 element bbox
- **OR** if IoU with any PP-StructureV3 element exceeds 0.15 threshold
- **AND** regions not meeting either criterion SHALL be marked as uncovered
#### Scenario: Parameters are sent via API request body
- **GIVEN** a frontend application with parameter adjustment UI
- **WHEN** the user starts task processing with custom parameters
- **THEN** the frontend SHALL send parameters in the request body (not query params):
#### Scenario: Only TEXT elements are supplemented
- **GIVEN** uncovered Raw OCR regions identified for supplementation
- **WHEN** PP-StructureV3 has detected TABLE, IMAGE, FIGURE, FLOWCHART, HEADER, or FOOTER elements
- **THEN** the system SHALL NOT supplement regions that overlap with these structural elements
- **AND** only supplement regions as TEXT type to preserve structural integrity
#### Scenario: Supplemented regions meet confidence threshold
- **GIVEN** Raw OCR regions to be supplemented
- **WHEN** a region has confidence score below 0.3
- **THEN** the system SHALL skip that region
- **AND** only supplement regions with confidence >= 0.3
#### Scenario: Deduplication prevents repeated text
- **GIVEN** a Raw OCR region being considered for supplementation
- **WHEN** the region has IoU > 0.5 with any existing PP-StructureV3 TEXT element
- **THEN** the system SHALL skip that region to prevent duplicate text
- **AND** the original PP-StructureV3 element SHALL be preserved
#### Scenario: Reading order is recalculated after gap filling
- **GIVEN** supplemented elements have been added to the page
- **WHEN** assembling the final element list
- **THEN** the system SHALL recalculate reading order for the entire page
- **AND** sort elements by y0 coordinate (top to bottom) then x0 (left to right)
- **AND** ensure logical document flow is maintained
#### Scenario: Coordinate alignment with ocr_dimensions
- **GIVEN** Raw OCR processing may involve image resizing
- **WHEN** comparing Raw OCR bbox with PP-StructureV3 bbox
- **THEN** the system SHALL use ocr_dimensions to normalize coordinates
- **AND** ensure both sources reference the same coordinate space
- **AND** prevent coverage misdetection due to scale differences
#### Scenario: Supplemented elements have complete metadata
- **GIVEN** a Raw OCR region being added as supplemented element
- **WHEN** creating the DocumentElement
- **THEN** the element SHALL include page_number
- **AND** include confidence score from Raw OCR
- **AND** include original bbox coordinates
- **AND** optionally include source indicator for debugging
### Requirement: Gap Filling Track Isolation
The gap filling feature SHALL only apply to OCR track processing and SHALL NOT affect Direct or Hybrid track outputs.
#### Scenario: Gap filling only activates for OCR track
- **GIVEN** a document processing task
- **WHEN** the processing track is OCR
- **THEN** the system SHALL evaluate and apply gap filling as needed
- **AND** produce enhanced output with supplemented content
#### Scenario: Direct track is unaffected
- **GIVEN** a document processing task with Direct track
- **WHEN** the task is processed
- **THEN** the system SHALL NOT invoke any gap filling logic
- **AND** produce output identical to current Direct track behavior
#### Scenario: Hybrid track is unaffected
- **GIVEN** a document processing task with Hybrid track
- **WHEN** the task is processed
- **THEN** the system SHALL NOT invoke gap filling logic
- **AND** use existing Hybrid track processing pipeline
### Requirement: Gap Filling Configuration
The system SHALL provide configurable parameters for gap filling behavior.
#### Scenario: Gap filling can be disabled via configuration
- **GIVEN** gap_filling_enabled is set to false in configuration
- **WHEN** OCR track processing runs
- **THEN** the system SHALL skip all gap filling logic
- **AND** output only PP-StructureV3 results as before
#### Scenario: Coverage threshold is configurable
- **GIVEN** gap_filling_coverage_threshold is set to 0.8
- **WHEN** PP-StructureV3 coverage is 75%
- **THEN** the system SHALL activate gap filling
- **AND** supplement uncovered regions
#### Scenario: IoU thresholds are configurable
- **GIVEN** custom IoU thresholds configured:
- gap_filling_iou_threshold: 0.2
- gap_filling_dedup_iou_threshold: 0.6
- **WHEN** evaluating coverage and deduplication
- **THEN** the system SHALL use the configured values
- **AND** apply them consistently throughout gap filling process
#### Scenario: Confidence threshold is configurable
- **GIVEN** gap_filling_confidence_threshold is set to 0.5
- **WHEN** supplementing Raw OCR regions
- **THEN** the system SHALL only include regions with confidence >= 0.5
- **AND** filter out lower confidence regions
### Requirement: Layout Model Selection
The system SHALL allow users to select a layout detection model optimized for their document type, providing a simple choice between pre-configured models instead of manual parameter tuning.
#### Scenario: User selects Chinese document model
- **GIVEN** a user is processing Chinese business documents (forms, contracts, invoices)
- **WHEN** the user selects "Chinese Document Model" (PP-DocLayout-S)
- **THEN** the OCR engine SHALL use the PP-DocLayout-S layout detection model
- **AND** the model SHALL be optimized for 23 Chinese document element types
- **AND** table and form detection accuracy SHALL be improved over the default model
#### Scenario: User selects standard model for English documents
- **GIVEN** a user is processing English academic papers or reports
- **WHEN** the user selects "Standard Model" (PubLayNet-based)
- **THEN** the OCR engine SHALL use the default PubLayNet-based layout detection model
- **AND** the model SHALL be optimized for English document layouts
#### Scenario: User selects CDLA model for specialized Chinese layout
- **GIVEN** a user is processing Chinese documents with complex layouts
- **WHEN** the user selects "CDLA Model"
- **THEN** the OCR engine SHALL use the picodet_lcnet_x1_0_fgd_layout_cdla model
- **AND** the model SHALL provide specialized Chinese document layout analysis
#### Scenario: Layout model is sent via API request
- **GIVEN** a frontend application with model selection UI
- **WHEN** the user starts task processing with a selected model
- **THEN** the frontend SHALL send the model choice in the request body:
```json
POST /api/v2/tasks/{task_id}/start
{
"use_dual_track": true,
"force_track": "ocr",
"language": "ch",
"pp_structure_params": {
"layout_detection_threshold": 0.15,
"layout_merge_bboxes_mode": "small",
"text_det_thresh": 0.1
}
"layout_model": "chinese"
}
```
- **AND** the backend SHALL parse and apply these parameters
- **AND** the backend SHALL configure PP-StructureV3 with the corresponding model
#### Scenario: Backward compatibility is maintained
- **GIVEN** existing API clients without PP-StructureV3 parameter support
- **WHEN** a task is started without `pp_structure_params`
- **THEN** the system SHALL use backend default settings
- **AND** processing SHALL work exactly as before
- **AND** no errors SHALL occur
#### Scenario: Default model when not specified
- **GIVEN** an API request without `layout_model` parameter
- **WHEN** the task is started
- **THEN** the system SHALL use "chinese" (PP-DocLayout-S) as the default model
- **AND** processing SHALL work correctly without requiring model selection
#### Scenario: Invalid parameters are rejected
- **GIVEN** a request with invalid parameter values
- **WHEN** the user sends:
- `layout_detection_threshold` = 1.5 (exceeds max 1.0)
- `layout_merge_bboxes_mode` = "invalid" (not in allowed values)
#### Scenario: Invalid model name is rejected
- **GIVEN** a request with an invalid `layout_model` value
- **WHEN** the user sends `layout_model: "invalid_model"`
- **THEN** the API SHALL return 422 Validation Error
- **AND** provide clear error messages about invalid parameters
- **AND** provide a clear error message listing valid model options
#### Scenario: Custom parameters affect only current processing
- **GIVEN** multiple concurrent OCR processing tasks
- **WHEN** Task A uses custom parameters and Task B uses defaults
- **THEN** Task A SHALL process with its custom parameters
- **AND** Task B SHALL process with default parameters
- **AND** no parameter interference SHALL occur between tasks
### Requirement: Layout Model Selection UI
The frontend SHALL provide a simple, user-friendly interface for selecting layout detection models with clear descriptions of each option.
### Requirement: PP-StructureV3 Parameter UI Controls
The frontend SHALL provide intuitive UI controls for adjusting PP-StructureV3 parameters with appropriate constraints and help text.
#### Scenario: Model options are displayed with descriptions
- **GIVEN** the model selection UI is displayed
- **WHEN** the user views the available options
- **THEN** the UI SHALL show the following options:
- "Chinese Document Model (Recommended)" - for Chinese forms, contracts, invoices
- "Standard Model" - for English academic papers, reports
- "CDLA Model" - for specialized Chinese layout analysis
- **AND** each option SHALL have a brief description of its use case
#### Scenario: Slider controls for numeric parameters
- **GIVEN** the parameter adjustment UI is displayed
- **WHEN** the user adjusts a numeric parameter slider
- **THEN** the slider SHALL enforce min/max constraints:
- Threshold parameters: 0.0 to 1.0
- Ratio parameters: > 0 (typically 0.5 to 3.0)
- **AND** display current value in real-time
- **AND** show help text explaining the parameter effect
#### Scenario: Chinese model is selected by default
- **GIVEN** the user opens the task processing interface
- **WHEN** the model selection is displayed
- **THEN** "Chinese Document Model" SHALL be pre-selected as the default
- **AND** the user MAY change the selection before starting processing
#### Scenario: Dropdown for merge mode selection
- **GIVEN** the layout merge mode parameter
- **WHEN** the user clicks the dropdown
- **THEN** the UI SHALL show exactly three options:
- "small" (conservative merging)
- "large" (aggressive merging)
- "union" (middle ground)
- **AND** display description for each option
#### Scenario: Parameters shown only for OCR track
#### Scenario: Model selection is visible only for OCR track
- **GIVEN** a document processing interface
- **WHEN** the user selects processing track
- **THEN** PP-StructureV3 parameters SHALL be shown ONLY when OCR track is selected
- **AND** SHALL be hidden for Direct track
- **AND** SHALL be disabled for Auto track until track is determined
- **THEN** layout model selection SHALL be shown ONLY when OCR track is selected or auto-detected
- **AND** SHALL be hidden for Direct track (which does not use PP-StructureV3)