chore: archive upgrade-ppstructure-models proposal

Archived as 2025-11-27-upgrade-ppstructure-models Spec updated: ocr-processing (added PP-StructureV3 Configuration) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-27 14:22:33 +08:00
parent 6235280c45
commit 5448a047ff
5 changed files with 9 additions and 0 deletions
--- a/openspec/changes/archive/2025-11-27-upgrade-ppstructure-models/MODEL_CLEANUP.md
+++ b/openspec/changes/archive/2025-11-27-upgrade-ppstructure-models/MODEL_CLEANUP.md
@@ -0,0 +1,141 @@
+# PP-StructureV3 Model Cache Cleanup Guide
+
+## Overview
+
+After upgrading PP-StructureV3 models, older unused models may remain in the cache directory. This guide explains how to safely remove them to free disk space.
+
+## Model Cache Location
+
+PaddleX/PaddleOCR 3.x stores downloaded models in:
+
+```
+~/.paddlex/official_models/
+```
+
+## Models After Upgrade
+
+### Current Active Models (DO NOT DELETE)
+
+| Model | Purpose | Approx. Size |
+|-------|---------|--------------|
+| `PP-DocLayout_plus-L` | Layout detection for Chinese documents | ~350MB |
+| `SLANeXt_wired` | Table structure recognition (bordered tables) | ~351MB |
+| `SLANeXt_wireless` | Table structure recognition (borderless tables) | ~351MB |
+| `PP-FormulaNet_plus-L` | Formula recognition (Chinese + English) | ~800MB |
+| `PP-OCRv5_*` | Text detection and recognition | ~150MB |
+| `picodet_lcnet_x1_0_fgd_layout_cdla` | CDLA layout model option | ~10MB |
+
+### Deprecated Models (Safe to Delete)
+
+| Model | Reason | Approx. Size |
+|-------|--------|--------------|
+| `PP-DocLayout-S` | Replaced by PP-DocLayout_plus-L | ~50MB |
+| `SLANet` | Replaced by SLANeXt_wired/wireless | ~7MB |
+| `SLANet_plus` | Replaced by SLANeXt_wired/wireless | ~7MB |
+| `PP-FormulaNet-S` | Replaced by PP-FormulaNet_plus-L | ~200MB |
+| `PP-FormulaNet-L` | Replaced by PP-FormulaNet_plus-L | ~400MB |
+
+## Cleanup Commands
+
+### List Current Cache
+
+```bash
+# List all cached models
+ls -la ~/.paddlex/official_models/
+
+# Show disk usage per model
+du -sh ~/.paddlex/official_models/*
+```
+
+### Delete Deprecated Models
+
+```bash
+# Remove deprecated layout model
+rm -rf ~/.paddlex/official_models/PP-DocLayout-S
+
+# Remove deprecated table models
+rm -rf ~/.paddlex/official_models/SLANet
+rm -rf ~/.paddlex/official_models/SLANet_plus
+
+# Remove deprecated formula models (if present)
+rm -rf ~/.paddlex/official_models/PP-FormulaNet-S
+rm -rf ~/.paddlex/official_models/PP-FormulaNet-L
+```
+
+### Cleanup Script
+
+```bash
+#!/bin/bash
+# cleanup_old_models.sh - Remove deprecated PP-StructureV3 models
+
+CACHE_DIR="$HOME/.paddlex/official_models"
+
+echo "PP-StructureV3 Model Cleanup"
+echo "============================"
+echo ""
+
+# Check if cache directory exists
+if [ ! -d "$CACHE_DIR" ]; then
+    echo "Cache directory not found: $CACHE_DIR"
+    exit 0
+fi
+
+# List deprecated models
+DEPRECATED_MODELS=(
+    "PP-DocLayout-S"
+    "SLANet"
+    "SLANet_plus"
+    "PP-FormulaNet-S"
+    "PP-FormulaNet-L"
+)
+
+echo "Checking for deprecated models..."
+echo ""
+
+TOTAL_SIZE=0
+for model in "${DEPRECATED_MODELS[@]}"; do
+    MODEL_PATH="$CACHE_DIR/$model"
+    if [ -d "$MODEL_PATH" ]; then
+        SIZE=$(du -sh "$MODEL_PATH" 2>/dev/null | cut -f1)
+        echo "Found: $model ($SIZE)"
+        TOTAL_SIZE=$((TOTAL_SIZE + 1))
+    fi
+done
+
+if [ $TOTAL_SIZE -eq 0 ]; then
+    echo "No deprecated models found. Cache is clean."
+    exit 0
+fi
+
+echo ""
+read -p "Delete these models? [y/N]: " confirm
+
+if [ "$confirm" = "y" ] || [ "$confirm" = "Y" ]; then
+    for model in "${DEPRECATED_MODELS[@]}"; do
+        MODEL_PATH="$CACHE_DIR/$model"
+        if [ -d "$MODEL_PATH" ]; then
+            rm -rf "$MODEL_PATH"
+            echo "Deleted: $model"
+        fi
+    done
+    echo ""
+    echo "Cleanup complete."
+else
+    echo "Cleanup cancelled."
+fi
+```
+
+## Space Savings Estimate
+
+After cleanup, you can expect to free approximately:
+- **~65MB** from deprecated layout model
+- **~14MB** from deprecated table models
+- **~600MB** from deprecated formula models (if present)
+
+Total potential savings: **~680MB**
+
+## Notes
+
+1. Models are downloaded on first use. Deleting active models will trigger re-download.
+2. The cache directory may vary if `PADDLEX_HOME` environment variable is set.
+3. Always verify which models your configuration uses before deleting.
--- a/openspec/changes/archive/2025-11-27-upgrade-ppstructure-models/proposal.md
+++ b/openspec/changes/archive/2025-11-27-upgrade-ppstructure-models/proposal.md
@@ -0,0 +1,134 @@
+# Upgrade PP-StructureV3 Models
+
+## Why
+
+目前專案使用的 PP-StructureV3 模型配置存在以下問題：
+
+1. **版面偵測模型精度不足**：PP-DocLayout-S (70.9% mAP) 無法正確處理複雜表格和版面
+2. **表格識別準確率低**：SLANet (59.52%) 產出錯誤的 HTML 結構
+3. **預處理模組未啟用**：文檔方向校正和彎曲校正功能關閉
+4. **模型佔用空間過大**：下載了不使用的模型，浪費儲存空間
+
+## What Changes
+
+### Stage 1: 預處理模組 - 全部開啟
+
+| 功能 | 當前 | 變更後 |
+|-----|-----|-------|
+| `use_doc_orientation_classify` | False | **True** |
+| `use_doc_unwarping` | False | **True** |
+| `use_textline_orientation` | False | **True** |
+
+### Stage 2: OCR 模組 - 維持現狀
+
+- 繼續使用 PP-OCRv5 (預設配置)
+- 不需要更改
+
+### Stage 3: 版面分析模組 - 升級模型選項
+
+| 選項名稱 | 當前模型 | 變更後模型 | mAP |
+|---------|---------|-----------|-----|
+| `chinese` | PP-DocLayout-S (移除) | **PP-DocLayout_plus-L** | 83.2% |
+| `default` | PubLayNet | PubLayNet (維持) | ~94% |
+| `cdla` | CDLA | CDLA (維持) | ~86% |
+
+**重點變更**：
+- 移除 PP-DocLayout-S (70.9% mAP)
+- 新增 PP-DocLayout_plus-L (83.2% mAP, 20類別)
+- 前端「中文文檔」選項改用 PP-DocLayout_plus-L
+
+### Stage 4: 元素識別模組 - 升級表格識別
+
+| 模組 | 當前模型 | 變更後模型 | 準確率變化 |
+|-----|---------|-----------|-----------|
+| 表格識別 | SLANet (預設) | **SLANeXt_wired + SLANeXt_wireless** | 59.52% → 69.65% |
+| 公式識別 | PP-FormulaNet (預設) | **PP-FormulaNet_plus-L** | 45.78% → 90.64% (中文) |
+| 圖表解析 | PP-Chart2Table | PP-Chart2Table (維持) | - |
+| 印章識別 | PP-OCRv4_seal | PP-OCRv4_seal (維持) | - |
+
+**表格識別策略**：
+- SLANeXt_wired 和 SLANeXt_wireless 搭配使用
+- 先用分類器判斷有線/無線表格類型
+- 根據類型選擇對應的 SLANeXt 模型
+- 聯合測試準確率達 69.65%
+
+### 儲存空間優化 - 刪除未使用模型
+
+PaddleOCR 3.x 模型緩存位置：`~/.paddlex/official_models/`
+
+可刪除的模型目錄：
+- PP-DocLayout-S (被 PP-DocLayout_plus-L 取代)
+- SLANet (被 SLANeXt 取代)
+- 其他未使用的舊版模型
+
+**注意**：刪除後首次使用新模型會觸發下載
+
+## Requirements
+
+### REQ-1: 預處理模組開啟
+系統 **SHALL** 在 PP-StructureV3 初始化時啟用所有預處理功能：
+- 文檔方向分類 (use_doc_orientation_classify=True)
+- 文檔彎曲校正 (use_doc_unwarping=True)
+- 文字行方向偵測 (use_textline_orientation=True)
+
+**Scenario: 處理旋轉的掃描文檔**
+- Given 一個旋轉 90 度的 PDF 文檔
+- When 使用 OCR track 處理
+- Then 系統應自動校正方向後再進行 OCR
+
+### REQ-2: 版面模型升級
+系統 **SHALL** 將「chinese」選項對應的模型從 PP-DocLayout-S 更改為 PP-DocLayout_plus-L
+
+**Scenario: 處理中文複雜文檔**
+- Given 包含表格、圖片、公式的中文文檔
+- When 選擇「chinese」版面模型處理
+- Then 應使用 PP-DocLayout_plus-L (83.2% mAP) 進行版面分析
+
+### REQ-3: 表格識別升級
+系統 **SHALL** 使用 SLANeXt_wired 和 SLANeXt_wireless 搭配進行表格識別
+
+**Scenario: 處理有線表格**
+- Given 包含有線表格的文檔
+- When 進行表格結構識別
+- Then 應使用 SLANeXt_wired 模型
+- And 輸出正確的 HTML 表格結構
+
+**Scenario: 處理無線表格**
+- Given 包含無線表格的文檔
+- When 進行表格結構識別
+- Then 應使用 SLANeXt_wireless 模型
+
+### REQ-4: 公式識別升級
+系統 **SHALL** 使用 PP-FormulaNet_plus-L 進行公式識別以支援中文公式
+
+### REQ-5: 模型緩存清理
+系統 **SHOULD** 提供工具或文檔說明如何清理未使用的模型緩存以節省儲存空間
+
+## Model Comparison Data
+
+### 表格識別模型對比
+
+| 模型 | 準確率 | 推理時間 | 模型大小 | 適用場景 |
+|-----|-------|---------|---------|---------|
+| SLANet | 59.52% | 24ms | 6.9 MB | ❌ 準確率不足 |
+| SLANet_plus | 63.69% | 23ms | 6.9 MB | ❌ 仍不足 |
+| **SLANeXt_wired** | 69.65% | 86ms | 351 MB | ✅ 有線表格 |
+| **SLANeXt_wireless** | 69.65% | - | 351 MB | ✅ 無線表格 |
+
+**結論**：SLANeXt 系列比 SLANet/SLANet_plus 準確率高約 10%，但模型大小增加約 50 倍。考慮到表格識別是核心功能，建議升級。
+
+### 版面偵測模型對比
+
+| 模型 | 類別數 | mAP | 推理時間 | 適用場景 |
+|-----|-------|-----|---------|---------|
+| PP-DocLayout-S | 23 | 70.9% | 12ms | ❌ 精度不足 |
+| PP-DocLayout-L | 23 | 90.4% | 34ms | ✅ 通用高精度 |
+| **PP-DocLayout_plus-L** | 20 | 83.2% | 53ms | ✅ 複雜文檔推薦 |
+
+## References
+
+- [PaddleOCR Table Structure Recognition](http://www.paddleocr.ai/main/en/version3.x/module_usage/table_structure_recognition.html)
+- [SLANeXt_wired on HuggingFace](https://huggingface.co/PaddlePaddle/SLANeXt_wired)
+- [SLANeXt_wireless on HuggingFace](https://huggingface.co/PaddlePaddle/SLANeXt_wireless)
+- [PP-StructureV3 Technical Report](https://arxiv.org/html/2507.05595v1)
+- [PaddleOCR Model Cache Issue](https://github.com/PaddlePaddle/PaddleOCR/issues/10234)
--- a/openspec/changes/archive/2025-11-27-upgrade-ppstructure-models/specs/ocr-processing/spec.md
+++ b/openspec/changes/archive/2025-11-27-upgrade-ppstructure-models/specs/ocr-processing/spec.md
@@ -0,0 +1,56 @@
+## ADDED Requirements
+
+### Requirement: PP-StructureV3 Configuration
+
+The system SHALL configure PP-StructureV3 with the following settings:
+
+**Preprocessing (Stage 1):**
+- Document orientation classification MUST be enabled (`use_doc_orientation_classify=True`)
+- Document unwarping MUST be enabled (`use_doc_unwarping=True`)  
+- Textline orientation detection MUST be enabled (`use_textline_orientation=True`)
+
+**Layout Detection (Stage 3):**
+- The `chinese` layout model option SHALL use PP-DocLayout_plus-L (83.2% mAP)
+- The `default` layout model option SHALL use PubLayNet for English documents
+- The `cdla` layout model option SHALL use picodet_lcnet_x1_0_fgd_layout_cdla
+
+**Element Recognition (Stage 4):**
+- Table structure recognition SHALL use SLANeXt_wired and SLANeXt_wireless models (69.65% combined accuracy)
+- Formula recognition SHALL use PP-FormulaNet_plus-L (92.22% English, 90.64% Chinese BLEU)
+- Chart parsing SHALL use PP-Chart2Table
+- Seal recognition SHALL use PP-OCRv4_seal
+
+#### Scenario: Processing rotated scanned document
+- **WHEN** a PDF document with rotated pages is processed using OCR track
+- **THEN** the system SHALL automatically detect and correct the orientation before OCR processing
+
+#### Scenario: Processing complex Chinese document with tables
+- **WHEN** a Chinese document containing tables, images, and formulas is processed
+- **AND** the user selects "chinese" layout model
+- **THEN** the system SHALL use PP-DocLayout_plus-L for layout detection (83.2% mAP)
+- **AND** the system SHALL correctly identify table regions
+
+#### Scenario: Table structure recognition with wired tables
+- **WHEN** a document contains wired (bordered) tables
+- **THEN** the system SHALL use SLANeXt_wired model for structure recognition
+- **AND** output correct HTML table structure with proper row/column spanning
+
+#### Scenario: Table structure recognition with wireless tables
+- **WHEN** a document contains wireless (borderless) tables
+- **THEN** the system SHALL use SLANeXt_wireless model for structure recognition
+
+#### Scenario: Chinese formula recognition
+- **WHEN** a document contains mathematical formulas with Chinese characters
+- **THEN** the system SHALL use PP-FormulaNet_plus-L for recognition
+- **AND** output LaTeX code with correct Chinese character representation
+
+## ADDED Requirements
+
+### Requirement: Model Cache Cleanup
+
+The system SHALL provide documentation for cleaning up unused model caches to optimize storage space.
+
+#### Scenario: User wants to free disk space after model upgrade
+- **WHEN** the user has upgraded from older models (PP-DocLayout-S, SLANet) to newer models
+- **THEN** the documentation SHALL explain how to delete unused cached models from `~/.paddlex/official_models/`
+- **AND** list which model directories can be safely removed
--- a/openspec/changes/archive/2025-11-27-upgrade-ppstructure-models/tasks.md
+++ b/openspec/changes/archive/2025-11-27-upgrade-ppstructure-models/tasks.md
@@ -0,0 +1,77 @@
+# Tasks: Upgrade PP-StructureV3 Models
+
+## 1. Backend Configuration Changes
+
+- [x] 1.1 Update `backend/app/core/config.py` - Enable preprocessing flags
+  - Set `use_doc_orientation_classify` default to True
+  - Set `use_doc_unwarping` default to True
+  - Set `use_textline_orientation` default to True
+  - Add `table_structure_model_name` configuration
+  - Add `formula_recognition_model_name` configuration
+
+- [x] 1.2 Update `backend/app/services/ocr_service.py` - Model mapping changes
+  - Update `LAYOUT_MODEL_MAPPING`:
+    - Change `"chinese"` from `"PP-DocLayout-S"` to `"PP-DocLayout_plus-L"`
+    - Keep `"default"` as PubLayNet
+    - Keep `"cdla"` as is
+  - Update `_ensure_structure_engine()`:
+    - Pass preprocessing flags to PPStructureV3
+    - Configure SLANeXt models for table recognition
+    - Configure PP-FormulaNet_plus-L for formula recognition
+
+- [x] 1.3 Update PPStructureV3 initialization kwargs
+  - Add `table_structure_model_name="SLANeXt_wired"` (or configure dual model)
+  - Add `formula_recognition_model_name="PP-FormulaNet_plus-L"`
+  - Verify preprocessing flags are passed correctly
+
+## 2. Schema Updates
+
+- [x] 2.1 Update `backend/app/schemas/task.py` - LayoutModelEnum
+  - Rename or update `CHINESE` description to reflect PP-DocLayout_plus-L
+  - Update docstrings to reflect new model capabilities
+
+## 3. Frontend Updates
+
+- [x] 3.1 Update `frontend/src/components/LayoutModelSelector.tsx`
+  - Update Chinese option description to mention PP-DocLayout_plus-L
+  - Update accuracy information displayed to users
+
+- [x] 3.2 Update `frontend/src/i18n/locales/zh-TW.json`
+  - Update `layoutModel.chinese.description` to reflect new model
+  - Update any accuracy percentages in descriptions
+
+## 4. Testing
+
+- [x] 4.1 Create unit tests for new model configuration
+  - Test preprocessing flags are correctly passed
+  - Test model mapping resolves correctly
+  - Test engine initialization with new models
+
+- [ ] 4.2 Integration testing with real documents
+  - Test rotated document handling (preprocessing)
+  - Test complex Chinese document layout detection
+  - Test table structure recognition accuracy
+  - Test formula recognition with Chinese formulas
+
+- [x] 4.3 Update existing tests
+  - Update `backend/tests/services/test_layout_model.py` for new mapping
+  - Update `backend/tests/api/test_layout_model_api.py` if needed
+
+## 5. Documentation
+
+- [x] 5.1 Create model cleanup documentation
+  - Document `~/.paddlex/official_models/` cache location
+  - List models that can be safely deleted after upgrade
+  - Provide cleanup script/commands
+  - See: [MODEL_CLEANUP.md](./MODEL_CLEANUP.md)
+
+- [x] 5.2 Update API documentation
+  - Document preprocessing feature behavior
+  - Update layout model descriptions
+
+## 6. Verification & Deployment
+
+- [ ] 6.1 Verify new models download correctly on first use
+- [ ] 6.2 Measure memory/GPU usage with new models
+- [ ] 6.3 Compare processing speed before/after upgrade
+- [ ] 6.4 Verify existing functionality not broken