feat: simplify layout model selection and archive proposals

Changes:
- Replace PP-Structure 7-slider parameter UI with simple 3-option layout model selector
- Add layout model mapping: chinese (PP-DocLayout-S), default (PubLayNet), cdla
- Add LayoutModelSelector component and zh-TW translations
- Fix "default" model behavior with sentinel value for PubLayNet
- Add gap filling service for OCR track coverage improvement
- Add PP-Structure debug utilities
- Archive completed/incomplete proposals:
  - add-ocr-track-gap-filling (complete)
  - fix-ocr-track-table-rendering (incomplete)
  - simplify-ppstructure-model-selection (22/25 tasks)
- Add new layout model tests, archive old PP-Structure param tests
- Update OpenSpec ocr-processing spec with layout model requirements

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-27 13:27:00 +08:00
parent c65df754cf
commit 59206a6ab8
35 changed files with 3621 additions and 658 deletions

View File

@@ -0,0 +1,28 @@
# Change: Fix OCR Track Table Empty Columns and Alignment
## Why
PP-Structure 生成的表格經常包含空白欄位(所有 row 該欄皆為空/空白),導致轉換後的 UnifiedDocument 表格出現空欄與欄位錯位。目前 OCR Track 直接使用原始資料,未進行清理,影響 PDF/JSON/Markdown 輸出品質。
## What Changes
- 新增 `trim_empty_columns()` 函數,清理 OCR Track 表格的空欄
-`_convert_table_data` 入口調用清洗邏輯,確保 TableData 乾淨
- 處理 col_span 重算:若 span 跨過被移除欄位,縮小 span
- 更新 columns/cols 數值、調整各 cell 的 col 索引
- 可選:依 bbox x0 進行欄對齊排序
## Impact
- Affected specs: `ocr-processing`
- Affected code:
- `backend/app/services/ocr_to_unified_converter.py` (主要修改)
- 不影響 Direct/HYBRID 路徑
- PDF/JSON/Markdown 輸出將更乾淨
## Constraints
- 保持表格 bbox、頁面座標不變
- 不修改 Direct/HYBRID 路徑
- 只移除「所有行皆空」的欄;若表頭空但數據有值,不應移除
- 保留原 bbox避免 PDF 版面漂移

View File

@@ -0,0 +1,61 @@
## ADDED Requirements
### Requirement: OCR Table Empty Column Cleanup
The OCR Track converter SHALL clean up PP-Structure generated tables by removing columns where all rows have empty or whitespace-only content.
The system SHALL:
1. Identify columns where every cell's content is empty or contains only whitespace (using `.strip()` to determine emptiness)
2. Remove identified empty columns from the table structure
3. Update the `columns`/`cols` value to reflect the new column count
4. Recalculate each cell's `col` index to maintain continuity
5. Adjust `col_span` values when spans cross removed columns (shrink span size)
6. Remove cells entirely when their complete span falls within removed columns
7. Preserve original bbox and page coordinates (no layout drift)
8. If `columns` is 0 or missing after cleanup, fill with the calculated column count
The cleanup SHALL NOT:
- Remove columns where the header is empty but data rows contain values
- Modify tables in Direct or HYBRID track
- Alter the original bbox coordinates
#### Scenario: All rows in column are empty
- **WHEN** a table has a column where all cells contain only empty or whitespace content
- **THEN** that column is removed
- **AND** remaining cells have their `col` indices decremented appropriately
- **AND** `cols` count is reduced by 1
#### Scenario: Column has empty header but data has values
- **WHEN** a table has a column where the header cell is empty
- **AND** at least one data row cell in that column contains non-whitespace content
- **THEN** that column is NOT removed
#### Scenario: Cell span crosses removed column
- **WHEN** a cell has `col_span > 1`
- **AND** one or more columns within the span are removed
- **THEN** the `col_span` is reduced by the number of removed columns within the span
#### Scenario: Cell span entirely within removed columns
- **WHEN** a cell's entire span falls within columns that are all removed
- **THEN** that cell is removed from the table
#### Scenario: Missing columns metadata
- **WHEN** the table dict has `columns` set to 0 or missing
- **AFTER** cleanup is performed
- **THEN** `columns` is set to the calculated number of remaining columns
### Requirement: OCR Table Column Alignment by Bbox
(Optional Enhancement) When bbox coordinates are available for table cells, the OCR Track converter SHALL use cell bbox x0 coordinates to improve column alignment accuracy.
The system SHALL:
1. Sort cells by bbox `x0` coordinate before assigning column indices
2. Reassign `col` indices based on spatial position rather than HTML order
This requirement is optional and implementation MAY be deferred if bbox data is not reliably available.
#### Scenario: Cells reordered by bbox position
- **WHEN** bbox coordinates are available for table cells
- **AND** the original HTML order does not match spatial order
- **THEN** cells are reordered by `x0` coordinate
- **AND** `col` indices are reassigned to reflect spatial positioning

View File

@@ -0,0 +1,43 @@
# Tasks: Fix OCR Track Table Empty Columns
## 1. Core Implementation
- [x] 1.1 在 `ocr_to_unified_converter.py` 實作 `trim_empty_columns(table_dict: Dict[str, Any]) -> Dict[str, Any]`
- 依據 cells 陣列計算每一欄是否「所有 row 的內容皆為空/空白」
- 使用 `.strip()` 判斷空白字元
- [x] 1.2 實作欄位移除邏輯
- 更新 columns/cols 數值
- 調整各 cell 的 col 索引
- [x] 1.3 實作 col_span 重算邏輯
- 若 span 跨過被移除欄位,縮小 span
- 若整個 span 落在被刪欄位上,移除該 cell
- [x] 1.4 在 `_convert_table_data` 入口呼叫 `trim_empty_columns`
- 在建 TableData 之前執行清洗
- 同時也在 `_extract_table_data` (HTML 表格解析) 中加入清洗
- [ ] 1.5 (可選) 依 bbox x0/x1 進行欄對齊排序
- 若可取得 bbox 網格,先依 x0 排序再重排 col index
- 此功能延後實作,待 bbox 資料確認可用性後進行
## 2. Testing & Validation
- [x] 2.1 單元測試通過
- 測試基本空欄移除
- 測試表頭空但數據有值(不移除)
- 測試 col_span 跨越被移除欄位(縮小 span
- 測試 cell 完全落在被移除欄位(移除 cell
- 測試無空欄情況(不變更)
- [x] 2.2 檢查現有 OCR 結果
- 現有結果中無「整欄為空」的表格
- 實作已就緒,遇到空欄時會正確清理
- [x] 2.3 確認 Direct/HYBRID 表格不變
- `OCRToUnifiedConverter` 僅在 `ocr_service.py` 中使用
- Direct 軌使用 `DirectExtractionEngine`,不受影響
## 3. Edge Cases & Validation
- [x] 3.1 處理 columns 欄位為 0/缺失的情況
- 以計算後的欄數回填,避免 downstream 依賴出錯
- [x] 3.2 處理表頭為空但數據有值的情況
- 只移除「所有行皆空」的欄
- [x] 3.3 確保不直接修改 `backend/storage/results/...`
- 修改 converter需重新跑任務驗證