Files
OCR/openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/tasks.md
egg 59206a6ab8 feat: simplify layout model selection and archive proposals
Changes:
- Replace PP-Structure 7-slider parameter UI with simple 3-option layout model selector
- Add layout model mapping: chinese (PP-DocLayout-S), default (PubLayNet), cdla
- Add LayoutModelSelector component and zh-TW translations
- Fix "default" model behavior with sentinel value for PubLayNet
- Add gap filling service for OCR track coverage improvement
- Add PP-Structure debug utilities
- Archive completed/incomplete proposals:
  - add-ocr-track-gap-filling (complete)
  - fix-ocr-track-table-rendering (incomplete)
  - simplify-ppstructure-model-selection (22/25 tasks)
- Add new layout model tests, archive old PP-Structure param tests
- Update OpenSpec ocr-processing spec with layout model requirements

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-27 13:27:00 +08:00

44 lines
1.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Tasks: Fix OCR Track Table Empty Columns
## 1. Core Implementation
- [x] 1.1 在 `ocr_to_unified_converter.py` 實作 `trim_empty_columns(table_dict: Dict[str, Any]) -> Dict[str, Any]`
- 依據 cells 陣列計算每一欄是否「所有 row 的內容皆為空/空白」
- 使用 `.strip()` 判斷空白字元
- [x] 1.2 實作欄位移除邏輯
- 更新 columns/cols 數值
- 調整各 cell 的 col 索引
- [x] 1.3 實作 col_span 重算邏輯
- 若 span 跨過被移除欄位,縮小 span
- 若整個 span 落在被刪欄位上,移除該 cell
- [x] 1.4 在 `_convert_table_data` 入口呼叫 `trim_empty_columns`
- 在建 TableData 之前執行清洗
- 同時也在 `_extract_table_data` (HTML 表格解析) 中加入清洗
- [ ] 1.5 (可選) 依 bbox x0/x1 進行欄對齊排序
- 若可取得 bbox 網格,先依 x0 排序再重排 col index
- 此功能延後實作,待 bbox 資料確認可用性後進行
## 2. Testing & Validation
- [x] 2.1 單元測試通過
- 測試基本空欄移除
- 測試表頭空但數據有值(不移除)
- 測試 col_span 跨越被移除欄位(縮小 span
- 測試 cell 完全落在被移除欄位(移除 cell
- 測試無空欄情況(不變更)
- [x] 2.2 檢查現有 OCR 結果
- 現有結果中無「整欄為空」的表格
- 實作已就緒,遇到空欄時會正確清理
- [x] 2.3 確認 Direct/HYBRID 表格不變
- `OCRToUnifiedConverter` 僅在 `ocr_service.py` 中使用
- Direct 軌使用 `DirectExtractionEngine`,不受影響
## 3. Edge Cases & Validation
- [x] 3.1 處理 columns 欄位為 0/缺失的情況
- 以計算後的欄數回填,避免 downstream 依賴出錯
- [x] 3.2 處理表頭為空但數據有值的情況
- 只移除「所有行皆空」的欄
- [x] 3.3 確保不直接修改 `backend/storage/results/...`
- 修改 converter需重新跑任務驗證