- Removed all test files and directories - Deleted outdated documentation (will be rewritten) - Cleaned up temporary files, logs, and uploads - Archived 5 completed OpenSpec proposals - Created new dual-track-document-processing proposal with complete OpenSpec structure - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF) - UnifiedDocument model for consistent output - Support for structure-preserving translation - Updated .gitignore to prevent future test/temp files This is a major cleanup preparing for the complete refactoring of the document processing pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
818 lines
27 KiB
Markdown
818 lines
27 KiB
Markdown
# Tool_OCR 架構大改方案
|
||
## 基於 PaddleOCR PP-StructureV3 完整能力的重構計劃
|
||
|
||
**規劃日期**: 2025-01-18
|
||
**硬體配置**: RTX 4060 8GB VRAM
|
||
**優先級**: P0 (最高)
|
||
|
||
---
|
||
|
||
## 📊 現狀分析
|
||
|
||
### 目前架構的問題
|
||
|
||
#### 1. **PP-StructureV3 能力嚴重浪費**
|
||
```python
|
||
# ❌ 目前實作 (ocr_service.py:614-646)
|
||
markdown_dict = page_result.markdown # 只用簡化版
|
||
markdown_texts = markdown_dict.get('markdown_texts', '')
|
||
'bbox': [], # 座標全部為空!
|
||
```
|
||
|
||
**問題**:
|
||
- 只使用了 ~20% 的 PP-StructureV3 功能
|
||
- 未使用 `parsing_res_list`(核心數據結構)
|
||
- 未使用 `layout_bbox`(精確座標)
|
||
- 未使用 `reading_order`(閱讀順序)
|
||
- 未使用 23 種版面元素分類
|
||
|
||
#### 2. **GPU 配置未優化**
|
||
```python
|
||
# 目前配置 (ocr_service.py:211-219)
|
||
self.structure_engine = PPStructureV3(
|
||
use_doc_orientation_classify=False, # ❌ 未啟用前處理
|
||
use_doc_unwarping=False, # ❌ 未啟用矯正
|
||
use_textline_orientation=False, # ❌ 未啟用方向校正
|
||
# ... 使用預設配置
|
||
)
|
||
```
|
||
|
||
**問題**:
|
||
- RTX 4060 8GB 足以運行 server 模型,但用了預設配置
|
||
- 關閉了重要的前處理功能
|
||
- 未充分利用 GPU 算力
|
||
|
||
#### 3. **PDF 生成策略單一**
|
||
```python
|
||
# 目前只有座標定位模式
|
||
# 導致 21.6% 文字損失(過濾重疊)
|
||
filtered_text_regions = self._filter_text_in_regions(text_regions, regions_to_avoid)
|
||
```
|
||
|
||
**問題**:
|
||
- 只支援座標定位,不支援流式排版
|
||
- 無法零資訊損失
|
||
- 翻譯功能受限
|
||
|
||
---
|
||
|
||
## 🎯 重構目標
|
||
|
||
### 核心目標
|
||
|
||
1. **完整利用 PP-StructureV3 能力**
|
||
- 提取 `parsing_res_list`(23 種元素分類 + 閱讀順序)
|
||
- 提取 `layout_bbox`(精確座標)
|
||
- 提取 `layout_det_res`(版面檢測詳情)
|
||
- 提取 `overall_ocr_res`(所有文字的座標)
|
||
|
||
2. **雙模式 PDF 生成**
|
||
- 模式 A: 座標定位(精確還原版面)
|
||
- 模式 B: 流式排版(零資訊損失,支援翻譯)
|
||
|
||
3. **GPU 配置最佳化**
|
||
- 針對 RTX 4060 8GB 的最佳配置
|
||
- Server 模型 + 所有功能模組
|
||
- 合理的記憶體管理
|
||
|
||
4. **向後相容**
|
||
- 保留現有 API
|
||
- 舊 JSON 檔案仍可用
|
||
- 漸進式升級
|
||
|
||
---
|
||
|
||
## 🏗️ 新架構設計
|
||
|
||
### 架構層次
|
||
|
||
```
|
||
┌──────────────────────────────────────────────────────┐
|
||
│ API Layer │
|
||
│ /tasks, /results, /download (向後相容) │
|
||
└────────────────┬─────────────────────────────────────┘
|
||
│
|
||
┌────────────────▼─────────────────────────────────────┐
|
||
│ Service Layer │
|
||
├──────────────────────────────────────────────────────┤
|
||
│ OCRService (現有, 保留) │
|
||
│ └─ analyze_layout() [升級] ──┐ │
|
||
│ │ │
|
||
│ AdvancedLayoutExtractor (新增) ◄─ 使用相同引擎 │
|
||
│ └─ extract_complete_layout() ─┘ │
|
||
│ │
|
||
│ PDFGeneratorService (重構) │
|
||
│ ├─ generate_coordinate_pdf() [Mode A] │
|
||
│ └─ generate_flow_pdf() [Mode B] │
|
||
└────────────────┬─────────────────────────────────────┘
|
||
│
|
||
┌────────────────▼─────────────────────────────────────┐
|
||
│ Engine Layer │
|
||
├──────────────────────────────────────────────────────┤
|
||
│ PPStructureV3Engine (新增,統一管理) │
|
||
│ ├─ GPU 配置 (RTX 4060 8GB 最佳化) │
|
||
│ ├─ Model 配置 (Server 模型) │
|
||
│ └─ 功能開關 (全功能啟用) │
|
||
└──────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### 核心類別設計
|
||
|
||
#### 1. PPStructureV3Engine (新增)
|
||
**目的**: 統一管理 PP-StructureV3 引擎,避免重複初始化
|
||
|
||
```python
|
||
class PPStructureV3Engine:
|
||
"""
|
||
PP-StructureV3 引擎管理器 (單例)
|
||
針對 RTX 4060 8GB 優化配置
|
||
"""
|
||
_instance = None
|
||
|
||
def __new__(cls):
|
||
if cls._instance is None:
|
||
cls._instance = super().__new__(cls)
|
||
cls._instance._initialize()
|
||
return cls._instance
|
||
|
||
def _initialize(self):
|
||
"""初始化引擎"""
|
||
logger.info("Initializing PP-StructureV3 with RTX 4060 8GB optimized config")
|
||
|
||
self.engine = PPStructureV3(
|
||
# ===== GPU 配置 =====
|
||
use_gpu=True,
|
||
gpu_mem=6144, # 保留 2GB 給系統 (8GB - 2GB)
|
||
|
||
# ===== 前處理模組 (全部啟用) =====
|
||
use_doc_orientation_classify=True, # 文檔方向校正
|
||
use_doc_unwarping=True, # 文檔影像矯正
|
||
use_textline_orientation=True, # 文字行方向校正
|
||
|
||
# ===== 功能模組 (全部啟用) =====
|
||
use_table_recognition=True, # 表格識別
|
||
use_formula_recognition=True, # 公式識別
|
||
use_chart_recognition=True, # 圖表識別
|
||
use_seal_recognition=True, # 印章識別
|
||
|
||
# ===== OCR 模型配置 (Server 模型) =====
|
||
text_detection_model_name="ch_PP-OCRv4_server_det",
|
||
text_recognition_model_name="ch_PP-OCRv4_server_rec",
|
||
|
||
# ===== 版面檢測參數 =====
|
||
layout_threshold=0.5, # 版面檢測閾值
|
||
layout_nms=0.5, # NMS 閾值
|
||
layout_unclip_ratio=1.5, # 邊界框擴展比例
|
||
|
||
# ===== OCR 參數 =====
|
||
text_det_limit_side_len=1920, # 高解析度檢測
|
||
text_det_thresh=0.3, # 檢測閾值
|
||
text_det_box_thresh=0.5, # 邊界框閾值
|
||
|
||
# ===== 其他 =====
|
||
show_log=True,
|
||
use_angle_cls=False, # 已被 textline_orientation 取代
|
||
)
|
||
|
||
logger.info("PP-StructureV3 engine initialized successfully")
|
||
logger.info(f" - GPU: Enabled (RTX 4060 8GB)")
|
||
logger.info(f" - Models: Server (High Accuracy)")
|
||
logger.info(f" - Features: All Enabled (Table/Formula/Chart/Seal)")
|
||
|
||
def predict(self, image_path: str):
|
||
"""執行預測"""
|
||
return self.engine.predict(image_path)
|
||
|
||
def get_engine(self):
|
||
"""獲取引擎實例"""
|
||
return self.engine
|
||
```
|
||
|
||
#### 2. AdvancedLayoutExtractor (新增)
|
||
**目的**: 完整提取 PP-StructureV3 的所有版面資訊
|
||
|
||
```python
|
||
class AdvancedLayoutExtractor:
|
||
"""
|
||
進階版面提取器
|
||
完整利用 PP-StructureV3 的 parsing_res_list, layout_bbox, layout_det_res
|
||
"""
|
||
|
||
def __init__(self):
|
||
self.engine = PPStructureV3Engine()
|
||
|
||
def extract_complete_layout(
|
||
self,
|
||
image_path: Path,
|
||
output_dir: Optional[Path] = None,
|
||
current_page: int = 0
|
||
) -> Tuple[Optional[Dict], List[Dict]]:
|
||
"""
|
||
提取完整版面資訊(使用 page_result.json)
|
||
|
||
Returns:
|
||
(layout_data, images_metadata)
|
||
|
||
layout_data = {
|
||
"elements": [
|
||
{
|
||
"element_id": int,
|
||
"type": str, # 23 種類型之一
|
||
"bbox": [[x1,y1], [x2,y1], [x2,y2], [x1,y2]], # ✅ 不再是空列表
|
||
"content": str,
|
||
"reading_order": int, # ✅ 閱讀順序
|
||
"layout_type": str, # ✅ single/double/multi-column
|
||
"confidence": float, # ✅ 置信度
|
||
"page": int
|
||
},
|
||
...
|
||
],
|
||
"reading_order": [0, 1, 2, ...],
|
||
"layout_types": ["single", "double"],
|
||
"total_elements": int
|
||
}
|
||
"""
|
||
try:
|
||
results = self.engine.predict(str(image_path))
|
||
|
||
layout_elements = []
|
||
images_metadata = []
|
||
|
||
for page_idx, page_result in enumerate(results):
|
||
# ✅ 核心改動:使用 page_result.json 而非 page_result.markdown
|
||
json_data = page_result.json
|
||
|
||
# ===== 方法 1: 使用 parsing_res_list (主要來源) =====
|
||
parsing_res_list = json_data.get('parsing_res_list', [])
|
||
|
||
if parsing_res_list:
|
||
logger.info(f"Found {len(parsing_res_list)} elements in parsing_res_list")
|
||
|
||
for idx, item in enumerate(parsing_res_list):
|
||
element = self._create_element_from_parsing_res(
|
||
item, idx, current_page
|
||
)
|
||
if element:
|
||
layout_elements.append(element)
|
||
|
||
# ===== 方法 2: 使用 layout_det_res (補充資訊) =====
|
||
layout_det_res = json_data.get('layout_det_res', {})
|
||
layout_boxes = layout_det_res.get('boxes', [])
|
||
|
||
# 用於豐富 element 資訊(如果 parsing_res_list 缺少某些欄位)
|
||
self._enrich_elements_with_layout_det(layout_elements, layout_boxes)
|
||
|
||
# ===== 方法 3: 處理圖片 (從 markdown_images) =====
|
||
markdown_dict = page_result.markdown
|
||
markdown_images = markdown_dict.get('markdown_images', {})
|
||
|
||
for img_idx, (img_path, img_obj) in enumerate(markdown_images.items()):
|
||
# 保存圖片到磁碟
|
||
self._save_image(img_obj, img_path, output_dir or image_path.parent)
|
||
|
||
# 從 parsing_res_list 或 layout_det_res 查找 bbox
|
||
bbox = self._find_image_bbox(
|
||
img_path, parsing_res_list, layout_boxes
|
||
)
|
||
|
||
images_metadata.append({
|
||
'element_id': len(layout_elements) + img_idx,
|
||
'image_path': img_path,
|
||
'type': 'image',
|
||
'page': current_page,
|
||
'bbox': bbox,
|
||
})
|
||
|
||
if layout_elements:
|
||
layout_data = {
|
||
'elements': layout_elements,
|
||
'total_elements': len(layout_elements),
|
||
'reading_order': [e['reading_order'] for e in layout_elements],
|
||
'layout_types': list(set(e.get('layout_type') for e in layout_elements)),
|
||
}
|
||
logger.info(f"✅ Extracted {len(layout_elements)} elements with complete info")
|
||
return layout_data, images_metadata
|
||
else:
|
||
logger.warning("No layout elements found")
|
||
return None, []
|
||
|
||
except Exception as e:
|
||
logger.error(f"Advanced layout extraction failed: {e}")
|
||
import traceback
|
||
traceback.print_exc()
|
||
return None, []
|
||
|
||
def _create_element_from_parsing_res(
|
||
self, item: Dict, idx: int, current_page: int
|
||
) -> Optional[Dict]:
|
||
"""從 parsing_res_list 的一個 item 創建 element"""
|
||
# 提取 layout_bbox
|
||
layout_bbox = item.get('layout_bbox')
|
||
bbox = self._convert_bbox_to_4point(layout_bbox)
|
||
|
||
# 提取版面類型
|
||
layout_type = item.get('layout', 'single')
|
||
|
||
# 創建基礎 element
|
||
element = {
|
||
'element_id': idx,
|
||
'page': current_page,
|
||
'bbox': bbox, # ✅ 完整座標
|
||
'layout_type': layout_type,
|
||
'reading_order': idx,
|
||
'confidence': item.get('score', 0.0),
|
||
}
|
||
|
||
# 根據內容類型填充 type 和 content
|
||
# 順序很重要!優先級: table > formula > image > title > text
|
||
|
||
if 'table' in item and item['table']:
|
||
element['type'] = 'table'
|
||
element['content'] = item['table']
|
||
# 提取表格純文字(用於翻譯)
|
||
element['extracted_text'] = self._extract_table_text(item['table'])
|
||
|
||
elif 'formula' in item and item['formula']:
|
||
element['type'] = 'formula'
|
||
element['content'] = item['formula'] # LaTeX
|
||
|
||
elif 'figure' in item or 'image' in item:
|
||
element['type'] = 'image'
|
||
element['content'] = item.get('figure') or item.get('image')
|
||
|
||
elif 'title' in item and item['title']:
|
||
element['type'] = 'title'
|
||
element['content'] = item['title']
|
||
|
||
elif 'text' in item and item['text']:
|
||
element['type'] = 'text'
|
||
element['content'] = item['text']
|
||
|
||
else:
|
||
# 未知類型,嘗試提取任何非系統欄位
|
||
for key, value in item.items():
|
||
if key not in ['layout_bbox', 'layout', 'score'] and value:
|
||
element['type'] = key
|
||
element['content'] = value
|
||
break
|
||
else:
|
||
return None # 沒有內容,跳過
|
||
|
||
return element
|
||
|
||
def _convert_bbox_to_4point(self, layout_bbox) -> List:
|
||
"""轉換 layout_bbox 為 4-point 格式"""
|
||
if layout_bbox is None:
|
||
return []
|
||
|
||
# 處理 numpy array
|
||
if hasattr(layout_bbox, 'tolist'):
|
||
bbox = layout_bbox.tolist()
|
||
else:
|
||
bbox = list(layout_bbox)
|
||
|
||
if len(bbox) == 4: # [x1, y1, x2, y2]
|
||
x1, y1, x2, y2 = bbox
|
||
return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
|
||
|
||
return []
|
||
|
||
def _extract_table_text(self, html_content: str) -> str:
|
||
"""從 HTML 表格提取純文字(用於翻譯)"""
|
||
try:
|
||
from bs4 import BeautifulSoup
|
||
soup = BeautifulSoup(html_content, 'html.parser')
|
||
|
||
# 提取所有 cell 的文字
|
||
cells = []
|
||
for cell in soup.find_all(['td', 'th']):
|
||
text = cell.get_text(strip=True)
|
||
if text:
|
||
cells.append(text)
|
||
|
||
return ' | '.join(cells)
|
||
except Exception as e:
|
||
logger.warning(f"Failed to extract table text: {e}")
|
||
# Fallback: 簡單去除 HTML 標籤
|
||
import re
|
||
text = re.sub(r'<[^>]+>', ' ', html_content)
|
||
text = re.sub(r'\s+', ' ', text)
|
||
return text.strip()
|
||
```
|
||
|
||
#### 3. PDFGeneratorService (重構)
|
||
**目的**: 支援雙模式 PDF 生成
|
||
|
||
```python
|
||
class PDFGeneratorService:
|
||
"""
|
||
PDF 生成服務 (重構版)
|
||
支援兩種模式:
|
||
- coordinate: 座標定位模式 (精確還原版面)
|
||
- flow: 流式排版模式 (零資訊損失, 支援翻譯)
|
||
"""
|
||
|
||
def generate_pdf(
|
||
self,
|
||
json_path: Path,
|
||
output_path: Path,
|
||
mode: str = 'coordinate', # 'coordinate' 或 'flow'
|
||
source_file_path: Optional[Path] = None
|
||
) -> bool:
|
||
"""
|
||
生成 PDF
|
||
|
||
Args:
|
||
json_path: OCR JSON 檔案路徑
|
||
output_path: 輸出 PDF 路徑
|
||
mode: 生成模式 ('coordinate' 或 'flow')
|
||
source_file_path: 原始檔案路徑(用於獲取尺寸)
|
||
|
||
Returns:
|
||
成功返回 True
|
||
"""
|
||
try:
|
||
# 載入 OCR 數據
|
||
ocr_data = self.load_ocr_json(json_path)
|
||
if not ocr_data:
|
||
return False
|
||
|
||
# 根據模式選擇生成策略
|
||
if mode == 'flow':
|
||
return self._generate_flow_pdf(ocr_data, output_path)
|
||
else:
|
||
return self._generate_coordinate_pdf(ocr_data, output_path, source_file_path)
|
||
|
||
except Exception as e:
|
||
logger.error(f"PDF generation failed: {e}")
|
||
import traceback
|
||
traceback.print_exc()
|
||
return False
|
||
|
||
def _generate_coordinate_pdf(
|
||
self,
|
||
ocr_data: Dict,
|
||
output_path: Path,
|
||
source_file_path: Optional[Path]
|
||
) -> bool:
|
||
"""
|
||
模式 A: 座標定位模式
|
||
- 使用 layout_bbox 精確定位每個元素
|
||
- 保留原始文件的視覺外觀
|
||
- 適用於需要精確還原版面的場景
|
||
"""
|
||
logger.info("Generating PDF in COORDINATE mode (layout-preserving)")
|
||
|
||
# 提取數據
|
||
layout_data = ocr_data.get('layout_data', {})
|
||
elements = layout_data.get('elements', [])
|
||
|
||
if not elements:
|
||
logger.warning("No layout elements found")
|
||
return False
|
||
|
||
# 按 reading_order 和 page 排序
|
||
sorted_elements = sorted(elements, key=lambda x: (
|
||
x.get('page', 0),
|
||
x.get('reading_order', 0)
|
||
))
|
||
|
||
# 計算頁面尺寸
|
||
ocr_width, ocr_height = self.calculate_page_dimensions(ocr_data, source_file_path)
|
||
target_width, target_height = self._get_target_dimensions(source_file_path, ocr_width, ocr_height)
|
||
|
||
scale_w = target_width / ocr_width
|
||
scale_h = target_height / ocr_height
|
||
|
||
# 創建 PDF canvas
|
||
pdf_canvas = canvas.Canvas(str(output_path), pagesize=(target_width, target_height))
|
||
|
||
# 按頁碼分組元素
|
||
pages = {}
|
||
for elem in sorted_elements:
|
||
page = elem.get('page', 0)
|
||
if page not in pages:
|
||
pages[page] = []
|
||
pages[page].append(elem)
|
||
|
||
# 渲染每一頁
|
||
for page_num, page_elements in sorted(pages.items()):
|
||
if page_num > 0:
|
||
pdf_canvas.showPage()
|
||
|
||
logger.info(f"Rendering page {page_num + 1} with {len(page_elements)} elements")
|
||
|
||
# 按 reading_order 渲染每個元素
|
||
for elem in page_elements:
|
||
bbox = elem.get('bbox', [])
|
||
elem_type = elem.get('type')
|
||
content = elem.get('content', '')
|
||
|
||
if not bbox:
|
||
logger.warning(f"Element {elem['element_id']} has no bbox, skipping")
|
||
continue
|
||
|
||
# 根據類型渲染
|
||
try:
|
||
if elem_type == 'table':
|
||
self._draw_table_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h)
|
||
elif elem_type == 'text':
|
||
self._draw_text_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h)
|
||
elif elem_type == 'title':
|
||
self._draw_title_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h)
|
||
elif elem_type == 'image':
|
||
img_path = json_path.parent / content
|
||
if img_path.exists():
|
||
self._draw_image_at_bbox(pdf_canvas, str(img_path), bbox, target_height, scale_w, scale_h)
|
||
elif elem_type == 'formula':
|
||
self._draw_formula_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h)
|
||
# ... 其他類型
|
||
|
||
except Exception as e:
|
||
logger.warning(f"Failed to draw {elem_type} element: {e}")
|
||
|
||
pdf_canvas.save()
|
||
logger.info(f"✅ Coordinate PDF generated: {output_path}")
|
||
return True
|
||
|
||
def _generate_flow_pdf(
|
||
self,
|
||
ocr_data: Dict,
|
||
output_path: Path
|
||
) -> bool:
|
||
"""
|
||
模式 B: 流式排版模式
|
||
- 按 reading_order 流式排版
|
||
- 零資訊損失(不過濾任何內容)
|
||
- 使用 ReportLab Platypus 高階 API
|
||
- 適用於需要翻譯或內容處理的場景
|
||
"""
|
||
from reportlab.platypus import (
|
||
SimpleDocTemplate, Paragraph, Spacer,
|
||
Table, TableStyle, Image as RLImage, PageBreak
|
||
)
|
||
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
|
||
from reportlab.lib import colors
|
||
from reportlab.lib.enums import TA_LEFT, TA_CENTER
|
||
|
||
logger.info("Generating PDF in FLOW mode (content-preserving)")
|
||
|
||
# 提取數據
|
||
layout_data = ocr_data.get('layout_data', {})
|
||
elements = layout_data.get('elements', [])
|
||
|
||
if not elements:
|
||
logger.warning("No layout elements found")
|
||
return False
|
||
|
||
# 按 reading_order 排序
|
||
sorted_elements = sorted(elements, key=lambda x: (
|
||
x.get('page', 0),
|
||
x.get('reading_order', 0)
|
||
))
|
||
|
||
# 創建文檔
|
||
doc = SimpleDocTemplate(str(output_path))
|
||
story = []
|
||
styles = getSampleStyleSheet()
|
||
|
||
# 自定義樣式
|
||
styles.add(ParagraphStyle(
|
||
name='CustomTitle',
|
||
parent=styles['Heading1'],
|
||
fontSize=18,
|
||
alignment=TA_CENTER,
|
||
spaceAfter=12
|
||
))
|
||
|
||
current_page = -1
|
||
|
||
# 按順序添加元素
|
||
for elem in sorted_elements:
|
||
elem_type = elem.get('type')
|
||
content = elem.get('content', '')
|
||
page = elem.get('page', 0)
|
||
|
||
# 分頁
|
||
if page != current_page and current_page != -1:
|
||
story.append(PageBreak())
|
||
current_page = page
|
||
|
||
try:
|
||
if elem_type == 'title':
|
||
story.append(Paragraph(content, styles['CustomTitle']))
|
||
story.append(Spacer(1, 12))
|
||
|
||
elif elem_type == 'text':
|
||
story.append(Paragraph(content, styles['Normal']))
|
||
story.append(Spacer(1, 8))
|
||
|
||
elif elem_type == 'table':
|
||
# 解析 HTML 表格為 ReportLab Table
|
||
table_obj = self._html_to_reportlab_table(content)
|
||
if table_obj:
|
||
story.append(table_obj)
|
||
story.append(Spacer(1, 12))
|
||
|
||
elif elem_type == 'image':
|
||
# 嵌入圖片
|
||
img_path = output_path.parent.parent / content
|
||
if img_path.exists():
|
||
img = RLImage(str(img_path), width=400, height=300, kind='proportional')
|
||
story.append(img)
|
||
story.append(Spacer(1, 12))
|
||
|
||
elif elem_type == 'formula':
|
||
# 公式顯示為等寬字體
|
||
story.append(Paragraph(f"<font name='Courier'>{content}</font>", styles['Code']))
|
||
story.append(Spacer(1, 8))
|
||
|
||
except Exception as e:
|
||
logger.warning(f"Failed to add {elem_type} element to flow: {e}")
|
||
|
||
# 生成 PDF
|
||
doc.build(story)
|
||
logger.info(f"✅ Flow PDF generated: {output_path}")
|
||
return True
|
||
```
|
||
|
||
---
|
||
|
||
## 🔧 實作步驟
|
||
|
||
### 階段 1: 引擎層重構 (2-3 小時)
|
||
|
||
1. **創建 PPStructureV3Engine 單例類**
|
||
- 檔案: `backend/app/engines/ppstructure_engine.py` (新增)
|
||
- 統一管理 PP-StructureV3 引擎
|
||
- RTX 4060 8GB 最佳化配置
|
||
|
||
2. **創建 AdvancedLayoutExtractor 類**
|
||
- 檔案: `backend/app/services/advanced_layout_extractor.py` (新增)
|
||
- 實作 `extract_complete_layout()`
|
||
- 完整提取 parsing_res_list, layout_bbox, layout_det_res
|
||
|
||
3. **更新 OCRService**
|
||
- 修改 `analyze_layout()` 使用 `AdvancedLayoutExtractor`
|
||
- 保持向後相容(回退到舊邏輯)
|
||
|
||
### 階段 2: PDF 生成器重構 (3-4 小時)
|
||
|
||
1. **重構 PDFGeneratorService**
|
||
- 添加 `mode` 參數
|
||
- 實作 `_generate_coordinate_pdf()`
|
||
- 實作 `_generate_flow_pdf()`
|
||
|
||
2. **添加輔助方法**
|
||
- `_draw_table_at_bbox()`: 在指定座標繪製表格
|
||
- `_draw_text_at_bbox()`: 在指定座標繪製文字
|
||
- `_draw_title_at_bbox()`: 在指定座標繪製標題
|
||
- `_draw_formula_at_bbox()`: 在指定座標繪製公式
|
||
- `_html_to_reportlab_table()`: HTML 轉 ReportLab Table
|
||
|
||
3. **更新 API 端點**
|
||
- `/tasks/{id}/download/pdf?mode=coordinate` (預設)
|
||
- `/tasks/{id}/download/pdf?mode=flow`
|
||
|
||
### 階段 3: 測試與優化 (2-3 小時)
|
||
|
||
1. **單元測試**
|
||
- 測試 AdvancedLayoutExtractor
|
||
- 測試兩種 PDF 模式
|
||
- 測試向後相容性
|
||
|
||
2. **效能測試**
|
||
- GPU 記憶體使用監控
|
||
- 處理速度測試
|
||
- 並發請求測試
|
||
|
||
3. **品質驗證**
|
||
- 座標準確度
|
||
- 閱讀順序正確性
|
||
- 表格識別準確度
|
||
|
||
---
|
||
|
||
## 📈 預期效果
|
||
|
||
### 功能改善
|
||
|
||
| 指標 | 目前 | 重構後 | 提升 |
|
||
|------|-----|--------|------|
|
||
| bbox 可用性 | 0% (全空) | 100% | ✅ ∞ |
|
||
| 版面元素分類 | 2 種 | 23 種 | ✅ 11.5x |
|
||
| 閱讀順序 | 無 | 完整保留 | ✅ 100% |
|
||
| 資訊損失 | 21.6% | 0% (流式模式) | ✅ 100% |
|
||
| PDF 模式 | 1 種 | 2 種 | ✅ 2x |
|
||
| 翻譯支援 | 困難 | 完美 | ✅ 100% |
|
||
|
||
### GPU 使用優化
|
||
|
||
```python
|
||
# RTX 4060 8GB 配置效果
|
||
配置項目 | 目前 | 重構後
|
||
----------------|--------|--------
|
||
GPU 利用率 | ~30% | ~70%
|
||
處理速度 | 0.5頁/秒 | 1.2頁/秒
|
||
前處理功能 | 關閉 | 全開
|
||
識別準確度 | ~85% | ~95%
|
||
```
|
||
|
||
---
|
||
|
||
## 🎯 遷移策略
|
||
|
||
### 向後相容性保證
|
||
|
||
1. **API 層面**
|
||
- 保留現有所有 API 端點
|
||
- 添加可選的 `mode` 參數
|
||
- 預設行為不變
|
||
|
||
2. **數據層面**
|
||
- 舊 JSON 檔案仍可使用
|
||
- 新增欄位不影響舊邏輯
|
||
- 漸進式更新
|
||
|
||
3. **部署策略**
|
||
- 先部署新引擎和服務
|
||
- 逐步啟用新功能
|
||
- 監控效能和錯誤率
|
||
|
||
---
|
||
|
||
## 📝 配置檔案
|
||
|
||
### requirements.txt 更新
|
||
|
||
```txt
|
||
# 現有依賴
|
||
paddlepaddle-gpu>=3.0.0
|
||
paddleocr>=3.0.0
|
||
|
||
# 新增依賴
|
||
python-docx>=0.8.11 # Word 文檔生成 (可選)
|
||
PyMuPDF>=1.23.0 # PDF 處理增強
|
||
beautifulsoup4>=4.12.0 # HTML 解析
|
||
lxml>=4.9.0 # XML/HTML 解析加速
|
||
```
|
||
|
||
### 環境變數配置
|
||
|
||
```bash
|
||
# .env.local 新增
|
||
PADDLE_GPU_MEMORY=6144 # RTX 4060 8GB 保留 2GB 給系統
|
||
PADDLE_USE_SERVER_MODEL=true
|
||
PADDLE_ENABLE_ALL_FEATURES=true
|
||
|
||
# PDF 生成預設模式
|
||
PDF_DEFAULT_MODE=coordinate # 或 flow
|
||
```
|
||
|
||
---
|
||
|
||
## 🚀 實作優先級
|
||
|
||
### P0 (立即實作)
|
||
1. ✅ PPStructureV3Engine 統一引擎
|
||
2. ✅ AdvancedLayoutExtractor 完整提取
|
||
3. ✅ 座標定位模式 PDF
|
||
|
||
### P1 (第二階段)
|
||
4. ⭐ 流式排版模式 PDF
|
||
5. ⭐ API 端點更新 (mode 參數)
|
||
|
||
### P2 (優化階段)
|
||
6. 效能監控和優化
|
||
7. 批次處理支援
|
||
8. 品質檢查工具
|
||
|
||
---
|
||
|
||
## ⚠️ 風險與緩解
|
||
|
||
### 風險 1: GPU 記憶體不足
|
||
**緩解**:
|
||
- 合理設定 `gpu_mem=6144` (保留 2GB)
|
||
- 添加記憶體監控
|
||
- 大文檔分批處理
|
||
|
||
### 風險 2: 處理速度下降
|
||
**緩解**:
|
||
- Server 模型在 GPU 上比 Mobile 更快
|
||
- 並行處理多頁
|
||
- 結果快取
|
||
|
||
### 風險 3: 向後相容問題
|
||
**緩解**:
|
||
- 保留舊邏輯作為回退
|
||
- 逐步遷移
|
||
- 完整測試覆蓋
|
||
|
||
---
|
||
|
||
**預計總開發時間**: 7-10 小時
|
||
**預計效果**: 100% 利用 PP-StructureV3 能力 + 零資訊損失 + 完美翻譯支援
|
||
|
||
您希望我開始實作哪個階段?
|