chore: backup before code cleanup

Backup commit before executing remove-unused-code proposal. This includes all pending changes and new features. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 11:55:39 +08:00
parent eff9b0bcd5
commit 940a406dce
58 changed files with 8226 additions and 175 deletions
--- a/openspec/changes/fix-ocr-track-table-rendering/design.md
+++ b/openspec/changes/fix-ocr-track-table-rendering/design.md
@@ -0,0 +1,88 @@
+## Context
+
+OCR Track 使用 PP-StructureV3 處理文件，將 PDF 轉換為 PNG 圖片（150 DPI）進行 OCR 識別，然後將結果轉換為 UnifiedDocument 格式並生成輸出 PDF。
+
+當前問題：
+1. 表格 HTML 內容在 bbox overlap 匹配路徑中未被提取
+2. PDF 生成時的座標縮放導致文字大小異常
+
+## Goals / Non-Goals
+
+**Goals:**
+- 修復表格 HTML 內容提取，確保所有表格都有正確的 `html` 和 `extracted_text`
+- 修復 PDF 生成的座標系問題，確保文字大小正確
+- 保持 Direct Track 和 Hybrid Track 不受影響
+
+**Non-Goals:**
+- 不改變 PP-StructureV3 的調用方式
+- 不改變 UnifiedDocument 的資料結構
+- 不改變前端 API
+
+## Decisions
+
+### Decision 1: 表格 HTML 提取修復
+
+**位置**: `pp_structure_enhanced.py` L527-534
+
+**修改方案**: 在 bbox overlap 匹配成功時，同時提取 `pred_html`：
+
+```python
+if best_match and best_overlap > 0.1:
+    cell_boxes = best_match['cell_box_list']
+    element['cell_boxes'] = [[float(c) for c in box] for box in cell_boxes]
+    element['cell_boxes_source'] = 'table_res_list'
+
+    # 新增：提取 pred_html
+    if not html_content and 'pred_html' in best_match:
+        html_content = best_match['pred_html']
+        element['html'] = html_content
+        element['extracted_text'] = self._extract_text_from_html(html_content)
+        logger.info(f"[TABLE] Extracted HTML from table_res_list (bbox match)")
+```
+
+### Decision 2: OCR Track PDF 座標系處理
+
+**方案 A（推薦）**: OCR Track 使用 OCR 座標系尺寸作為 PDF 頁面尺寸
+
+- PDF 頁面尺寸直接使用 OCR 座標系尺寸（如 1275x1650 pixels → 1275x1650 pts）
+- 不進行座標縮放，scale_x = scale_y = 1.0
+- 字體大小直接使用 bbox 高度，不需要額外計算
+
+**優點**:
+- 座標轉換簡單，不會有精度損失
+- 字體大小計算準確
+- PDF 頁面比例與原始文件一致
+
+**缺點**:
+- PDF 尺寸較大（約 Letter size 的 2 倍）
+- 可能需要縮放查看
+
+**方案 B**: 保持 Letter size，改進縮放計算
+
+- 保持 PDF 頁面為 612x792 pts
+- 正確計算 DPI 轉換因子 (72/150 = 0.48)
+- 確保字體大小在縮放時保持可讀性
+
+**選擇**: 採用方案 A，因為簡化實現且避免縮放精度問題。
+
+### Decision 3: 表格質量判定調整
+
+**當前問題**: `_check_cell_boxes_quality()` 過度過濾有效表格
+
+**修改方案**:
+1. 提高 cell_density 閾值（從 3.0 → 5.0 cells/10000px²）
+2. 降低 min_avg_cell_area 閾值（從 3000 → 2000 px²）
+3. 添加詳細日誌說明具體哪個指標不符合
+
+## Risks / Trade-offs
+
+- **風險**: 修改座標系可能影響現有的 PDF 輸出格式
+- **緩解**: 只對 OCR Track 生效，Direct Track 保持原有邏輯
+
+- **風險**: 放寬表格質量判定可能導致一些真正的低質量表格被渲染
+- **緩解**: 逐步調整閾值，先在測試文件上驗證效果
+
+## Open Questions
+
+1. OCR Track PDF 尺寸變大是否會影響用戶體驗？
+2. 是否需要提供配置選項讓用戶選擇 PDF 輸出尺寸？