feat: simplify layout model selection and archive proposals

Changes: - Replace PP-Structure 7-slider parameter UI with simple 3-option layout model selector - Add layout model mapping: chinese (PP-DocLayout-S), default (PubLayNet), cdla - Add LayoutModelSelector component and zh-TW translations - Fix "default" model behavior with sentinel value for PubLayNet - Add gap filling service for OCR track coverage improvement - Add PP-Structure debug utilities - Archive completed/incomplete proposals: - add-ocr-track-gap-filling (complete) - fix-ocr-track-table-rendering (incomplete) - simplify-ppstructure-model-selection (22/25 tasks) - Add new layout model tests, archive old PP-Structure param tests - Update OpenSpec ocr-processing spec with layout model requirements 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-27 13:27:00 +08:00
parent c65df754cf
commit 59206a6ab8
35 changed files with 3621 additions and 658 deletions
--- a/openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/proposal.md
+++ b/openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/proposal.md
@@ -0,0 +1,28 @@
+# Change: Fix OCR Track Table Empty Columns and Alignment
+
+## Why
+
+PP-Structure 生成的表格經常包含空白欄位（所有 row 該欄皆為空/空白），導致轉換後的 UnifiedDocument 表格出現空欄與欄位錯位。目前 OCR Track 直接使用原始資料，未進行清理，影響 PDF/JSON/Markdown 輸出品質。
+
+## What Changes
+
+- 新增 `trim_empty_columns()` 函數，清理 OCR Track 表格的空欄
+- 在 `_convert_table_data` 入口調用清洗邏輯，確保 TableData 乾淨
+- 處理 col_span 重算：若 span 跨過被移除欄位，縮小 span
+- 更新 columns/cols 數值、調整各 cell 的 col 索引
+- 可選：依 bbox x0 進行欄對齊排序
+
+## Impact
+
+- Affected specs: `ocr-processing`
+- Affected code:
+  - `backend/app/services/ocr_to_unified_converter.py` (主要修改)
+- 不影響 Direct/HYBRID 路徑
+- PDF/JSON/Markdown 輸出將更乾淨
+
+## Constraints
+
+- 保持表格 bbox、頁面座標不變
+- 不修改 Direct/HYBRID 路徑
+- 只移除「所有行皆空」的欄；若表頭空但數據有值，不應移除
+- 保留原 bbox，避免 PDF 版面漂移
--- a/openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/specs/ocr-processing/spec.md
+++ b/openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/specs/ocr-processing/spec.md
@@ -0,0 +1,61 @@
+## ADDED Requirements
+
+### Requirement: OCR Table Empty Column Cleanup
+
+The OCR Track converter SHALL clean up PP-Structure generated tables by removing columns where all rows have empty or whitespace-only content.
+
+The system SHALL:
+1. Identify columns where every cell's content is empty or contains only whitespace (using `.strip()` to determine emptiness)
+2. Remove identified empty columns from the table structure
+3. Update the `columns`/`cols` value to reflect the new column count
+4. Recalculate each cell's `col` index to maintain continuity
+5. Adjust `col_span` values when spans cross removed columns (shrink span size)
+6. Remove cells entirely when their complete span falls within removed columns
+7. Preserve original bbox and page coordinates (no layout drift)
+8. If `columns` is 0 or missing after cleanup, fill with the calculated column count
+
+The cleanup SHALL NOT:
+- Remove columns where the header is empty but data rows contain values
+- Modify tables in Direct or HYBRID track
+- Alter the original bbox coordinates
+
+#### Scenario: All rows in column are empty
+- **WHEN** a table has a column where all cells contain only empty or whitespace content
+- **THEN** that column is removed
+- **AND** remaining cells have their `col` indices decremented appropriately
+- **AND** `cols` count is reduced by 1
+
+#### Scenario: Column has empty header but data has values
+- **WHEN** a table has a column where the header cell is empty
+- **AND** at least one data row cell in that column contains non-whitespace content
+- **THEN** that column is NOT removed
+
+#### Scenario: Cell span crosses removed column
+- **WHEN** a cell has `col_span > 1`
+- **AND** one or more columns within the span are removed
+- **THEN** the `col_span` is reduced by the number of removed columns within the span
+
+#### Scenario: Cell span entirely within removed columns
+- **WHEN** a cell's entire span falls within columns that are all removed
+- **THEN** that cell is removed from the table
+
+#### Scenario: Missing columns metadata
+- **WHEN** the table dict has `columns` set to 0 or missing
+- **AFTER** cleanup is performed
+- **THEN** `columns` is set to the calculated number of remaining columns
+
+### Requirement: OCR Table Column Alignment by Bbox
+
+(Optional Enhancement) When bbox coordinates are available for table cells, the OCR Track converter SHALL use cell bbox x0 coordinates to improve column alignment accuracy.
+
+The system SHALL:
+1. Sort cells by bbox `x0` coordinate before assigning column indices
+2. Reassign `col` indices based on spatial position rather than HTML order
+
+This requirement is optional and implementation MAY be deferred if bbox data is not reliably available.
+
+#### Scenario: Cells reordered by bbox position
+- **WHEN** bbox coordinates are available for table cells
+- **AND** the original HTML order does not match spatial order
+- **THEN** cells are reordered by `x0` coordinate
+- **AND** `col` indices are reassigned to reflect spatial positioning
--- a/openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/tasks.md
+++ b/openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/tasks.md
@@ -0,0 +1,43 @@
+# Tasks: Fix OCR Track Table Empty Columns
+
+## 1. Core Implementation
+
+- [x] 1.1 在 `ocr_to_unified_converter.py` 實作 `trim_empty_columns(table_dict: Dict[str, Any]) -> Dict[str, Any]`
+  - 依據 cells 陣列計算每一欄是否「所有 row 的內容皆為空/空白」
+  - 使用 `.strip()` 判斷空白字元
+- [x] 1.2 實作欄位移除邏輯
+  - 更新 columns/cols 數值
+  - 調整各 cell 的 col 索引
+- [x] 1.3 實作 col_span 重算邏輯
+  - 若 span 跨過被移除欄位，縮小 span
+  - 若整個 span 落在被刪欄位上，移除該 cell
+- [x] 1.4 在 `_convert_table_data` 入口呼叫 `trim_empty_columns`
+  - 在建 TableData 之前執行清洗
+  - 同時也在 `_extract_table_data` (HTML 表格解析) 中加入清洗
+- [ ] 1.5 (可選) 依 bbox x0/x1 進行欄對齊排序
+  - 若可取得 bbox 網格，先依 x0 排序再重排 col index
+  - 此功能延後實作，待 bbox 資料確認可用性後進行
+
+## 2. Testing & Validation
+
+- [x] 2.1 單元測試通過
+  - 測試基本空欄移除
+  - 測試表頭空但數據有值（不移除）
+  - 測試 col_span 跨越被移除欄位（縮小 span）
+  - 測試 cell 完全落在被移除欄位（移除 cell）
+  - 測試無空欄情況（不變更）
+- [x] 2.2 檢查現有 OCR 結果
+  - 現有結果中無「整欄為空」的表格
+  - 實作已就緒，遇到空欄時會正確清理
+- [x] 2.3 確認 Direct/HYBRID 表格不變
+  - `OCRToUnifiedConverter` 僅在 `ocr_service.py` 中使用
+  - Direct 軌使用 `DirectExtractionEngine`，不受影響
+
+## 3. Edge Cases & Validation
+
+- [x] 3.1 處理 columns 欄位為 0/缺失的情況
+  - 以計算後的欄數回填，避免 downstream 依賴出錯
+- [x] 3.2 處理表頭為空但數據有值的情況
+  - 只移除「所有行皆空」的欄
+- [x] 3.3 確保不直接修改 `backend/storage/results/...`
+  - 修改 converter，需重新跑任務驗證