Files
OCR/openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/proposal.md
egg 59206a6ab8 feat: simplify layout model selection and archive proposals
Changes:
- Replace PP-Structure 7-slider parameter UI with simple 3-option layout model selector
- Add layout model mapping: chinese (PP-DocLayout-S), default (PubLayNet), cdla
- Add LayoutModelSelector component and zh-TW translations
- Fix "default" model behavior with sentinel value for PubLayNet
- Add gap filling service for OCR track coverage improvement
- Add PP-Structure debug utilities
- Archive completed/incomplete proposals:
  - add-ocr-track-gap-filling (complete)
  - fix-ocr-track-table-rendering (incomplete)
  - simplify-ppstructure-model-selection (22/25 tasks)
- Add new layout model tests, archive old PP-Structure param tests
- Update OpenSpec ocr-processing spec with layout model requirements

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-27 13:27:00 +08:00

29 lines
1.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Change: Fix OCR Track Table Empty Columns and Alignment
## Why
PP-Structure 生成的表格經常包含空白欄位(所有 row 該欄皆為空/空白),導致轉換後的 UnifiedDocument 表格出現空欄與欄位錯位。目前 OCR Track 直接使用原始資料,未進行清理,影響 PDF/JSON/Markdown 輸出品質。
## What Changes
- 新增 `trim_empty_columns()` 函數,清理 OCR Track 表格的空欄
-`_convert_table_data` 入口調用清洗邏輯,確保 TableData 乾淨
- 處理 col_span 重算:若 span 跨過被移除欄位,縮小 span
- 更新 columns/cols 數值、調整各 cell 的 col 索引
- 可選:依 bbox x0 進行欄對齊排序
## Impact
- Affected specs: `ocr-processing`
- Affected code:
- `backend/app/services/ocr_to_unified_converter.py` (主要修改)
- 不影響 Direct/HYBRID 路徑
- PDF/JSON/Markdown 輸出將更乾淨
## Constraints
- 保持表格 bbox、頁面座標不變
- 不修改 Direct/HYBRID 路徑
- 只移除「所有行皆空」的欄;若表頭空但數據有值,不應移除
- 保留原 bbox避免 PDF 版面漂移