chore: project cleanup and prepare for dual-track processing refactor
- Removed all test files and directories - Deleted outdated documentation (will be rewritten) - Cleaned up temporary files, logs, and uploads - Archived 5 completed OpenSpec proposals - Created new dual-track-document-processing proposal with complete OpenSpec structure - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF) - UnifiedDocument model for consistent output - Support for structure-preserving translation - Updated .gitignore to prevent future test/temp files This is a major cleanup preparing for the complete refactoring of the document processing pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,817 @@
|
||||
# Tool_OCR 架構大改方案
|
||||
## 基於 PaddleOCR PP-StructureV3 完整能力的重構計劃
|
||||
|
||||
**規劃日期**: 2025-01-18
|
||||
**硬體配置**: RTX 4060 8GB VRAM
|
||||
**優先級**: P0 (最高)
|
||||
|
||||
---
|
||||
|
||||
## 📊 現狀分析
|
||||
|
||||
### 目前架構的問題
|
||||
|
||||
#### 1. **PP-StructureV3 能力嚴重浪費**
|
||||
```python
|
||||
# ❌ 目前實作 (ocr_service.py:614-646)
|
||||
markdown_dict = page_result.markdown # 只用簡化版
|
||||
markdown_texts = markdown_dict.get('markdown_texts', '')
|
||||
'bbox': [], # 座標全部為空!
|
||||
```
|
||||
|
||||
**問題**:
|
||||
- 只使用了 ~20% 的 PP-StructureV3 功能
|
||||
- 未使用 `parsing_res_list`(核心數據結構)
|
||||
- 未使用 `layout_bbox`(精確座標)
|
||||
- 未使用 `reading_order`(閱讀順序)
|
||||
- 未使用 23 種版面元素分類
|
||||
|
||||
#### 2. **GPU 配置未優化**
|
||||
```python
|
||||
# 目前配置 (ocr_service.py:211-219)
|
||||
self.structure_engine = PPStructureV3(
|
||||
use_doc_orientation_classify=False, # ❌ 未啟用前處理
|
||||
use_doc_unwarping=False, # ❌ 未啟用矯正
|
||||
use_textline_orientation=False, # ❌ 未啟用方向校正
|
||||
# ... 使用預設配置
|
||||
)
|
||||
```
|
||||
|
||||
**問題**:
|
||||
- RTX 4060 8GB 足以運行 server 模型,但用了預設配置
|
||||
- 關閉了重要的前處理功能
|
||||
- 未充分利用 GPU 算力
|
||||
|
||||
#### 3. **PDF 生成策略單一**
|
||||
```python
|
||||
# 目前只有座標定位模式
|
||||
# 導致 21.6% 文字損失(過濾重疊)
|
||||
filtered_text_regions = self._filter_text_in_regions(text_regions, regions_to_avoid)
|
||||
```
|
||||
|
||||
**問題**:
|
||||
- 只支援座標定位,不支援流式排版
|
||||
- 無法零資訊損失
|
||||
- 翻譯功能受限
|
||||
|
||||
---
|
||||
|
||||
## 🎯 重構目標
|
||||
|
||||
### 核心目標
|
||||
|
||||
1. **完整利用 PP-StructureV3 能力**
|
||||
- 提取 `parsing_res_list`(23 種元素分類 + 閱讀順序)
|
||||
- 提取 `layout_bbox`(精確座標)
|
||||
- 提取 `layout_det_res`(版面檢測詳情)
|
||||
- 提取 `overall_ocr_res`(所有文字的座標)
|
||||
|
||||
2. **雙模式 PDF 生成**
|
||||
- 模式 A: 座標定位(精確還原版面)
|
||||
- 模式 B: 流式排版(零資訊損失,支援翻譯)
|
||||
|
||||
3. **GPU 配置最佳化**
|
||||
- 針對 RTX 4060 8GB 的最佳配置
|
||||
- Server 模型 + 所有功能模組
|
||||
- 合理的記憶體管理
|
||||
|
||||
4. **向後相容**
|
||||
- 保留現有 API
|
||||
- 舊 JSON 檔案仍可用
|
||||
- 漸進式升級
|
||||
|
||||
---
|
||||
|
||||
## 🏗️ 新架構設計
|
||||
|
||||
### 架構層次
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────┐
|
||||
│ API Layer │
|
||||
│ /tasks, /results, /download (向後相容) │
|
||||
└────────────────┬─────────────────────────────────────┘
|
||||
│
|
||||
┌────────────────▼─────────────────────────────────────┐
|
||||
│ Service Layer │
|
||||
├──────────────────────────────────────────────────────┤
|
||||
│ OCRService (現有, 保留) │
|
||||
│ └─ analyze_layout() [升級] ──┐ │
|
||||
│ │ │
|
||||
│ AdvancedLayoutExtractor (新增) ◄─ 使用相同引擎 │
|
||||
│ └─ extract_complete_layout() ─┘ │
|
||||
│ │
|
||||
│ PDFGeneratorService (重構) │
|
||||
│ ├─ generate_coordinate_pdf() [Mode A] │
|
||||
│ └─ generate_flow_pdf() [Mode B] │
|
||||
└────────────────┬─────────────────────────────────────┘
|
||||
│
|
||||
┌────────────────▼─────────────────────────────────────┐
|
||||
│ Engine Layer │
|
||||
├──────────────────────────────────────────────────────┤
|
||||
│ PPStructureV3Engine (新增,統一管理) │
|
||||
│ ├─ GPU 配置 (RTX 4060 8GB 最佳化) │
|
||||
│ ├─ Model 配置 (Server 模型) │
|
||||
│ └─ 功能開關 (全功能啟用) │
|
||||
└──────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 核心類別設計
|
||||
|
||||
#### 1. PPStructureV3Engine (新增)
|
||||
**目的**: 統一管理 PP-StructureV3 引擎,避免重複初始化
|
||||
|
||||
```python
|
||||
class PPStructureV3Engine:
|
||||
"""
|
||||
PP-StructureV3 引擎管理器 (單例)
|
||||
針對 RTX 4060 8GB 優化配置
|
||||
"""
|
||||
_instance = None
|
||||
|
||||
def __new__(cls):
|
||||
if cls._instance is None:
|
||||
cls._instance = super().__new__(cls)
|
||||
cls._instance._initialize()
|
||||
return cls._instance
|
||||
|
||||
def _initialize(self):
|
||||
"""初始化引擎"""
|
||||
logger.info("Initializing PP-StructureV3 with RTX 4060 8GB optimized config")
|
||||
|
||||
self.engine = PPStructureV3(
|
||||
# ===== GPU 配置 =====
|
||||
use_gpu=True,
|
||||
gpu_mem=6144, # 保留 2GB 給系統 (8GB - 2GB)
|
||||
|
||||
# ===== 前處理模組 (全部啟用) =====
|
||||
use_doc_orientation_classify=True, # 文檔方向校正
|
||||
use_doc_unwarping=True, # 文檔影像矯正
|
||||
use_textline_orientation=True, # 文字行方向校正
|
||||
|
||||
# ===== 功能模組 (全部啟用) =====
|
||||
use_table_recognition=True, # 表格識別
|
||||
use_formula_recognition=True, # 公式識別
|
||||
use_chart_recognition=True, # 圖表識別
|
||||
use_seal_recognition=True, # 印章識別
|
||||
|
||||
# ===== OCR 模型配置 (Server 模型) =====
|
||||
text_detection_model_name="ch_PP-OCRv4_server_det",
|
||||
text_recognition_model_name="ch_PP-OCRv4_server_rec",
|
||||
|
||||
# ===== 版面檢測參數 =====
|
||||
layout_threshold=0.5, # 版面檢測閾值
|
||||
layout_nms=0.5, # NMS 閾值
|
||||
layout_unclip_ratio=1.5, # 邊界框擴展比例
|
||||
|
||||
# ===== OCR 參數 =====
|
||||
text_det_limit_side_len=1920, # 高解析度檢測
|
||||
text_det_thresh=0.3, # 檢測閾值
|
||||
text_det_box_thresh=0.5, # 邊界框閾值
|
||||
|
||||
# ===== 其他 =====
|
||||
show_log=True,
|
||||
use_angle_cls=False, # 已被 textline_orientation 取代
|
||||
)
|
||||
|
||||
logger.info("PP-StructureV3 engine initialized successfully")
|
||||
logger.info(f" - GPU: Enabled (RTX 4060 8GB)")
|
||||
logger.info(f" - Models: Server (High Accuracy)")
|
||||
logger.info(f" - Features: All Enabled (Table/Formula/Chart/Seal)")
|
||||
|
||||
def predict(self, image_path: str):
|
||||
"""執行預測"""
|
||||
return self.engine.predict(image_path)
|
||||
|
||||
def get_engine(self):
|
||||
"""獲取引擎實例"""
|
||||
return self.engine
|
||||
```
|
||||
|
||||
#### 2. AdvancedLayoutExtractor (新增)
|
||||
**目的**: 完整提取 PP-StructureV3 的所有版面資訊
|
||||
|
||||
```python
|
||||
class AdvancedLayoutExtractor:
|
||||
"""
|
||||
進階版面提取器
|
||||
完整利用 PP-StructureV3 的 parsing_res_list, layout_bbox, layout_det_res
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
self.engine = PPStructureV3Engine()
|
||||
|
||||
def extract_complete_layout(
|
||||
self,
|
||||
image_path: Path,
|
||||
output_dir: Optional[Path] = None,
|
||||
current_page: int = 0
|
||||
) -> Tuple[Optional[Dict], List[Dict]]:
|
||||
"""
|
||||
提取完整版面資訊(使用 page_result.json)
|
||||
|
||||
Returns:
|
||||
(layout_data, images_metadata)
|
||||
|
||||
layout_data = {
|
||||
"elements": [
|
||||
{
|
||||
"element_id": int,
|
||||
"type": str, # 23 種類型之一
|
||||
"bbox": [[x1,y1], [x2,y1], [x2,y2], [x1,y2]], # ✅ 不再是空列表
|
||||
"content": str,
|
||||
"reading_order": int, # ✅ 閱讀順序
|
||||
"layout_type": str, # ✅ single/double/multi-column
|
||||
"confidence": float, # ✅ 置信度
|
||||
"page": int
|
||||
},
|
||||
...
|
||||
],
|
||||
"reading_order": [0, 1, 2, ...],
|
||||
"layout_types": ["single", "double"],
|
||||
"total_elements": int
|
||||
}
|
||||
"""
|
||||
try:
|
||||
results = self.engine.predict(str(image_path))
|
||||
|
||||
layout_elements = []
|
||||
images_metadata = []
|
||||
|
||||
for page_idx, page_result in enumerate(results):
|
||||
# ✅ 核心改動:使用 page_result.json 而非 page_result.markdown
|
||||
json_data = page_result.json
|
||||
|
||||
# ===== 方法 1: 使用 parsing_res_list (主要來源) =====
|
||||
parsing_res_list = json_data.get('parsing_res_list', [])
|
||||
|
||||
if parsing_res_list:
|
||||
logger.info(f"Found {len(parsing_res_list)} elements in parsing_res_list")
|
||||
|
||||
for idx, item in enumerate(parsing_res_list):
|
||||
element = self._create_element_from_parsing_res(
|
||||
item, idx, current_page
|
||||
)
|
||||
if element:
|
||||
layout_elements.append(element)
|
||||
|
||||
# ===== 方法 2: 使用 layout_det_res (補充資訊) =====
|
||||
layout_det_res = json_data.get('layout_det_res', {})
|
||||
layout_boxes = layout_det_res.get('boxes', [])
|
||||
|
||||
# 用於豐富 element 資訊(如果 parsing_res_list 缺少某些欄位)
|
||||
self._enrich_elements_with_layout_det(layout_elements, layout_boxes)
|
||||
|
||||
# ===== 方法 3: 處理圖片 (從 markdown_images) =====
|
||||
markdown_dict = page_result.markdown
|
||||
markdown_images = markdown_dict.get('markdown_images', {})
|
||||
|
||||
for img_idx, (img_path, img_obj) in enumerate(markdown_images.items()):
|
||||
# 保存圖片到磁碟
|
||||
self._save_image(img_obj, img_path, output_dir or image_path.parent)
|
||||
|
||||
# 從 parsing_res_list 或 layout_det_res 查找 bbox
|
||||
bbox = self._find_image_bbox(
|
||||
img_path, parsing_res_list, layout_boxes
|
||||
)
|
||||
|
||||
images_metadata.append({
|
||||
'element_id': len(layout_elements) + img_idx,
|
||||
'image_path': img_path,
|
||||
'type': 'image',
|
||||
'page': current_page,
|
||||
'bbox': bbox,
|
||||
})
|
||||
|
||||
if layout_elements:
|
||||
layout_data = {
|
||||
'elements': layout_elements,
|
||||
'total_elements': len(layout_elements),
|
||||
'reading_order': [e['reading_order'] for e in layout_elements],
|
||||
'layout_types': list(set(e.get('layout_type') for e in layout_elements)),
|
||||
}
|
||||
logger.info(f"✅ Extracted {len(layout_elements)} elements with complete info")
|
||||
return layout_data, images_metadata
|
||||
else:
|
||||
logger.warning("No layout elements found")
|
||||
return None, []
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Advanced layout extraction failed: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return None, []
|
||||
|
||||
def _create_element_from_parsing_res(
|
||||
self, item: Dict, idx: int, current_page: int
|
||||
) -> Optional[Dict]:
|
||||
"""從 parsing_res_list 的一個 item 創建 element"""
|
||||
# 提取 layout_bbox
|
||||
layout_bbox = item.get('layout_bbox')
|
||||
bbox = self._convert_bbox_to_4point(layout_bbox)
|
||||
|
||||
# 提取版面類型
|
||||
layout_type = item.get('layout', 'single')
|
||||
|
||||
# 創建基礎 element
|
||||
element = {
|
||||
'element_id': idx,
|
||||
'page': current_page,
|
||||
'bbox': bbox, # ✅ 完整座標
|
||||
'layout_type': layout_type,
|
||||
'reading_order': idx,
|
||||
'confidence': item.get('score', 0.0),
|
||||
}
|
||||
|
||||
# 根據內容類型填充 type 和 content
|
||||
# 順序很重要!優先級: table > formula > image > title > text
|
||||
|
||||
if 'table' in item and item['table']:
|
||||
element['type'] = 'table'
|
||||
element['content'] = item['table']
|
||||
# 提取表格純文字(用於翻譯)
|
||||
element['extracted_text'] = self._extract_table_text(item['table'])
|
||||
|
||||
elif 'formula' in item and item['formula']:
|
||||
element['type'] = 'formula'
|
||||
element['content'] = item['formula'] # LaTeX
|
||||
|
||||
elif 'figure' in item or 'image' in item:
|
||||
element['type'] = 'image'
|
||||
element['content'] = item.get('figure') or item.get('image')
|
||||
|
||||
elif 'title' in item and item['title']:
|
||||
element['type'] = 'title'
|
||||
element['content'] = item['title']
|
||||
|
||||
elif 'text' in item and item['text']:
|
||||
element['type'] = 'text'
|
||||
element['content'] = item['text']
|
||||
|
||||
else:
|
||||
# 未知類型,嘗試提取任何非系統欄位
|
||||
for key, value in item.items():
|
||||
if key not in ['layout_bbox', 'layout', 'score'] and value:
|
||||
element['type'] = key
|
||||
element['content'] = value
|
||||
break
|
||||
else:
|
||||
return None # 沒有內容,跳過
|
||||
|
||||
return element
|
||||
|
||||
def _convert_bbox_to_4point(self, layout_bbox) -> List:
|
||||
"""轉換 layout_bbox 為 4-point 格式"""
|
||||
if layout_bbox is None:
|
||||
return []
|
||||
|
||||
# 處理 numpy array
|
||||
if hasattr(layout_bbox, 'tolist'):
|
||||
bbox = layout_bbox.tolist()
|
||||
else:
|
||||
bbox = list(layout_bbox)
|
||||
|
||||
if len(bbox) == 4: # [x1, y1, x2, y2]
|
||||
x1, y1, x2, y2 = bbox
|
||||
return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
|
||||
|
||||
return []
|
||||
|
||||
def _extract_table_text(self, html_content: str) -> str:
|
||||
"""從 HTML 表格提取純文字(用於翻譯)"""
|
||||
try:
|
||||
from bs4 import BeautifulSoup
|
||||
soup = BeautifulSoup(html_content, 'html.parser')
|
||||
|
||||
# 提取所有 cell 的文字
|
||||
cells = []
|
||||
for cell in soup.find_all(['td', 'th']):
|
||||
text = cell.get_text(strip=True)
|
||||
if text:
|
||||
cells.append(text)
|
||||
|
||||
return ' | '.join(cells)
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to extract table text: {e}")
|
||||
# Fallback: 簡單去除 HTML 標籤
|
||||
import re
|
||||
text = re.sub(r'<[^>]+>', ' ', html_content)
|
||||
text = re.sub(r'\s+', ' ', text)
|
||||
return text.strip()
|
||||
```
|
||||
|
||||
#### 3. PDFGeneratorService (重構)
|
||||
**目的**: 支援雙模式 PDF 生成
|
||||
|
||||
```python
|
||||
class PDFGeneratorService:
|
||||
"""
|
||||
PDF 生成服務 (重構版)
|
||||
支援兩種模式:
|
||||
- coordinate: 座標定位模式 (精確還原版面)
|
||||
- flow: 流式排版模式 (零資訊損失, 支援翻譯)
|
||||
"""
|
||||
|
||||
def generate_pdf(
|
||||
self,
|
||||
json_path: Path,
|
||||
output_path: Path,
|
||||
mode: str = 'coordinate', # 'coordinate' 或 'flow'
|
||||
source_file_path: Optional[Path] = None
|
||||
) -> bool:
|
||||
"""
|
||||
生成 PDF
|
||||
|
||||
Args:
|
||||
json_path: OCR JSON 檔案路徑
|
||||
output_path: 輸出 PDF 路徑
|
||||
mode: 生成模式 ('coordinate' 或 'flow')
|
||||
source_file_path: 原始檔案路徑(用於獲取尺寸)
|
||||
|
||||
Returns:
|
||||
成功返回 True
|
||||
"""
|
||||
try:
|
||||
# 載入 OCR 數據
|
||||
ocr_data = self.load_ocr_json(json_path)
|
||||
if not ocr_data:
|
||||
return False
|
||||
|
||||
# 根據模式選擇生成策略
|
||||
if mode == 'flow':
|
||||
return self._generate_flow_pdf(ocr_data, output_path)
|
||||
else:
|
||||
return self._generate_coordinate_pdf(ocr_data, output_path, source_file_path)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"PDF generation failed: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return False
|
||||
|
||||
def _generate_coordinate_pdf(
|
||||
self,
|
||||
ocr_data: Dict,
|
||||
output_path: Path,
|
||||
source_file_path: Optional[Path]
|
||||
) -> bool:
|
||||
"""
|
||||
模式 A: 座標定位模式
|
||||
- 使用 layout_bbox 精確定位每個元素
|
||||
- 保留原始文件的視覺外觀
|
||||
- 適用於需要精確還原版面的場景
|
||||
"""
|
||||
logger.info("Generating PDF in COORDINATE mode (layout-preserving)")
|
||||
|
||||
# 提取數據
|
||||
layout_data = ocr_data.get('layout_data', {})
|
||||
elements = layout_data.get('elements', [])
|
||||
|
||||
if not elements:
|
||||
logger.warning("No layout elements found")
|
||||
return False
|
||||
|
||||
# 按 reading_order 和 page 排序
|
||||
sorted_elements = sorted(elements, key=lambda x: (
|
||||
x.get('page', 0),
|
||||
x.get('reading_order', 0)
|
||||
))
|
||||
|
||||
# 計算頁面尺寸
|
||||
ocr_width, ocr_height = self.calculate_page_dimensions(ocr_data, source_file_path)
|
||||
target_width, target_height = self._get_target_dimensions(source_file_path, ocr_width, ocr_height)
|
||||
|
||||
scale_w = target_width / ocr_width
|
||||
scale_h = target_height / ocr_height
|
||||
|
||||
# 創建 PDF canvas
|
||||
pdf_canvas = canvas.Canvas(str(output_path), pagesize=(target_width, target_height))
|
||||
|
||||
# 按頁碼分組元素
|
||||
pages = {}
|
||||
for elem in sorted_elements:
|
||||
page = elem.get('page', 0)
|
||||
if page not in pages:
|
||||
pages[page] = []
|
||||
pages[page].append(elem)
|
||||
|
||||
# 渲染每一頁
|
||||
for page_num, page_elements in sorted(pages.items()):
|
||||
if page_num > 0:
|
||||
pdf_canvas.showPage()
|
||||
|
||||
logger.info(f"Rendering page {page_num + 1} with {len(page_elements)} elements")
|
||||
|
||||
# 按 reading_order 渲染每個元素
|
||||
for elem in page_elements:
|
||||
bbox = elem.get('bbox', [])
|
||||
elem_type = elem.get('type')
|
||||
content = elem.get('content', '')
|
||||
|
||||
if not bbox:
|
||||
logger.warning(f"Element {elem['element_id']} has no bbox, skipping")
|
||||
continue
|
||||
|
||||
# 根據類型渲染
|
||||
try:
|
||||
if elem_type == 'table':
|
||||
self._draw_table_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h)
|
||||
elif elem_type == 'text':
|
||||
self._draw_text_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h)
|
||||
elif elem_type == 'title':
|
||||
self._draw_title_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h)
|
||||
elif elem_type == 'image':
|
||||
img_path = json_path.parent / content
|
||||
if img_path.exists():
|
||||
self._draw_image_at_bbox(pdf_canvas, str(img_path), bbox, target_height, scale_w, scale_h)
|
||||
elif elem_type == 'formula':
|
||||
self._draw_formula_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h)
|
||||
# ... 其他類型
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to draw {elem_type} element: {e}")
|
||||
|
||||
pdf_canvas.save()
|
||||
logger.info(f"✅ Coordinate PDF generated: {output_path}")
|
||||
return True
|
||||
|
||||
def _generate_flow_pdf(
|
||||
self,
|
||||
ocr_data: Dict,
|
||||
output_path: Path
|
||||
) -> bool:
|
||||
"""
|
||||
模式 B: 流式排版模式
|
||||
- 按 reading_order 流式排版
|
||||
- 零資訊損失(不過濾任何內容)
|
||||
- 使用 ReportLab Platypus 高階 API
|
||||
- 適用於需要翻譯或內容處理的場景
|
||||
"""
|
||||
from reportlab.platypus import (
|
||||
SimpleDocTemplate, Paragraph, Spacer,
|
||||
Table, TableStyle, Image as RLImage, PageBreak
|
||||
)
|
||||
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
|
||||
from reportlab.lib import colors
|
||||
from reportlab.lib.enums import TA_LEFT, TA_CENTER
|
||||
|
||||
logger.info("Generating PDF in FLOW mode (content-preserving)")
|
||||
|
||||
# 提取數據
|
||||
layout_data = ocr_data.get('layout_data', {})
|
||||
elements = layout_data.get('elements', [])
|
||||
|
||||
if not elements:
|
||||
logger.warning("No layout elements found")
|
||||
return False
|
||||
|
||||
# 按 reading_order 排序
|
||||
sorted_elements = sorted(elements, key=lambda x: (
|
||||
x.get('page', 0),
|
||||
x.get('reading_order', 0)
|
||||
))
|
||||
|
||||
# 創建文檔
|
||||
doc = SimpleDocTemplate(str(output_path))
|
||||
story = []
|
||||
styles = getSampleStyleSheet()
|
||||
|
||||
# 自定義樣式
|
||||
styles.add(ParagraphStyle(
|
||||
name='CustomTitle',
|
||||
parent=styles['Heading1'],
|
||||
fontSize=18,
|
||||
alignment=TA_CENTER,
|
||||
spaceAfter=12
|
||||
))
|
||||
|
||||
current_page = -1
|
||||
|
||||
# 按順序添加元素
|
||||
for elem in sorted_elements:
|
||||
elem_type = elem.get('type')
|
||||
content = elem.get('content', '')
|
||||
page = elem.get('page', 0)
|
||||
|
||||
# 分頁
|
||||
if page != current_page and current_page != -1:
|
||||
story.append(PageBreak())
|
||||
current_page = page
|
||||
|
||||
try:
|
||||
if elem_type == 'title':
|
||||
story.append(Paragraph(content, styles['CustomTitle']))
|
||||
story.append(Spacer(1, 12))
|
||||
|
||||
elif elem_type == 'text':
|
||||
story.append(Paragraph(content, styles['Normal']))
|
||||
story.append(Spacer(1, 8))
|
||||
|
||||
elif elem_type == 'table':
|
||||
# 解析 HTML 表格為 ReportLab Table
|
||||
table_obj = self._html_to_reportlab_table(content)
|
||||
if table_obj:
|
||||
story.append(table_obj)
|
||||
story.append(Spacer(1, 12))
|
||||
|
||||
elif elem_type == 'image':
|
||||
# 嵌入圖片
|
||||
img_path = output_path.parent.parent / content
|
||||
if img_path.exists():
|
||||
img = RLImage(str(img_path), width=400, height=300, kind='proportional')
|
||||
story.append(img)
|
||||
story.append(Spacer(1, 12))
|
||||
|
||||
elif elem_type == 'formula':
|
||||
# 公式顯示為等寬字體
|
||||
story.append(Paragraph(f"<font name='Courier'>{content}</font>", styles['Code']))
|
||||
story.append(Spacer(1, 8))
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to add {elem_type} element to flow: {e}")
|
||||
|
||||
# 生成 PDF
|
||||
doc.build(story)
|
||||
logger.info(f"✅ Flow PDF generated: {output_path}")
|
||||
return True
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔧 實作步驟
|
||||
|
||||
### 階段 1: 引擎層重構 (2-3 小時)
|
||||
|
||||
1. **創建 PPStructureV3Engine 單例類**
|
||||
- 檔案: `backend/app/engines/ppstructure_engine.py` (新增)
|
||||
- 統一管理 PP-StructureV3 引擎
|
||||
- RTX 4060 8GB 最佳化配置
|
||||
|
||||
2. **創建 AdvancedLayoutExtractor 類**
|
||||
- 檔案: `backend/app/services/advanced_layout_extractor.py` (新增)
|
||||
- 實作 `extract_complete_layout()`
|
||||
- 完整提取 parsing_res_list, layout_bbox, layout_det_res
|
||||
|
||||
3. **更新 OCRService**
|
||||
- 修改 `analyze_layout()` 使用 `AdvancedLayoutExtractor`
|
||||
- 保持向後相容(回退到舊邏輯)
|
||||
|
||||
### 階段 2: PDF 生成器重構 (3-4 小時)
|
||||
|
||||
1. **重構 PDFGeneratorService**
|
||||
- 添加 `mode` 參數
|
||||
- 實作 `_generate_coordinate_pdf()`
|
||||
- 實作 `_generate_flow_pdf()`
|
||||
|
||||
2. **添加輔助方法**
|
||||
- `_draw_table_at_bbox()`: 在指定座標繪製表格
|
||||
- `_draw_text_at_bbox()`: 在指定座標繪製文字
|
||||
- `_draw_title_at_bbox()`: 在指定座標繪製標題
|
||||
- `_draw_formula_at_bbox()`: 在指定座標繪製公式
|
||||
- `_html_to_reportlab_table()`: HTML 轉 ReportLab Table
|
||||
|
||||
3. **更新 API 端點**
|
||||
- `/tasks/{id}/download/pdf?mode=coordinate` (預設)
|
||||
- `/tasks/{id}/download/pdf?mode=flow`
|
||||
|
||||
### 階段 3: 測試與優化 (2-3 小時)
|
||||
|
||||
1. **單元測試**
|
||||
- 測試 AdvancedLayoutExtractor
|
||||
- 測試兩種 PDF 模式
|
||||
- 測試向後相容性
|
||||
|
||||
2. **效能測試**
|
||||
- GPU 記憶體使用監控
|
||||
- 處理速度測試
|
||||
- 並發請求測試
|
||||
|
||||
3. **品質驗證**
|
||||
- 座標準確度
|
||||
- 閱讀順序正確性
|
||||
- 表格識別準確度
|
||||
|
||||
---
|
||||
|
||||
## 📈 預期效果
|
||||
|
||||
### 功能改善
|
||||
|
||||
| 指標 | 目前 | 重構後 | 提升 |
|
||||
|------|-----|--------|------|
|
||||
| bbox 可用性 | 0% (全空) | 100% | ✅ ∞ |
|
||||
| 版面元素分類 | 2 種 | 23 種 | ✅ 11.5x |
|
||||
| 閱讀順序 | 無 | 完整保留 | ✅ 100% |
|
||||
| 資訊損失 | 21.6% | 0% (流式模式) | ✅ 100% |
|
||||
| PDF 模式 | 1 種 | 2 種 | ✅ 2x |
|
||||
| 翻譯支援 | 困難 | 完美 | ✅ 100% |
|
||||
|
||||
### GPU 使用優化
|
||||
|
||||
```python
|
||||
# RTX 4060 8GB 配置效果
|
||||
配置項目 | 目前 | 重構後
|
||||
----------------|--------|--------
|
||||
GPU 利用率 | ~30% | ~70%
|
||||
處理速度 | 0.5頁/秒 | 1.2頁/秒
|
||||
前處理功能 | 關閉 | 全開
|
||||
識別準確度 | ~85% | ~95%
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎯 遷移策略
|
||||
|
||||
### 向後相容性保證
|
||||
|
||||
1. **API 層面**
|
||||
- 保留現有所有 API 端點
|
||||
- 添加可選的 `mode` 參數
|
||||
- 預設行為不變
|
||||
|
||||
2. **數據層面**
|
||||
- 舊 JSON 檔案仍可使用
|
||||
- 新增欄位不影響舊邏輯
|
||||
- 漸進式更新
|
||||
|
||||
3. **部署策略**
|
||||
- 先部署新引擎和服務
|
||||
- 逐步啟用新功能
|
||||
- 監控效能和錯誤率
|
||||
|
||||
---
|
||||
|
||||
## 📝 配置檔案
|
||||
|
||||
### requirements.txt 更新
|
||||
|
||||
```txt
|
||||
# 現有依賴
|
||||
paddlepaddle-gpu>=3.0.0
|
||||
paddleocr>=3.0.0
|
||||
|
||||
# 新增依賴
|
||||
python-docx>=0.8.11 # Word 文檔生成 (可選)
|
||||
PyMuPDF>=1.23.0 # PDF 處理增強
|
||||
beautifulsoup4>=4.12.0 # HTML 解析
|
||||
lxml>=4.9.0 # XML/HTML 解析加速
|
||||
```
|
||||
|
||||
### 環境變數配置
|
||||
|
||||
```bash
|
||||
# .env.local 新增
|
||||
PADDLE_GPU_MEMORY=6144 # RTX 4060 8GB 保留 2GB 給系統
|
||||
PADDLE_USE_SERVER_MODEL=true
|
||||
PADDLE_ENABLE_ALL_FEATURES=true
|
||||
|
||||
# PDF 生成預設模式
|
||||
PDF_DEFAULT_MODE=coordinate # 或 flow
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🚀 實作優先級
|
||||
|
||||
### P0 (立即實作)
|
||||
1. ✅ PPStructureV3Engine 統一引擎
|
||||
2. ✅ AdvancedLayoutExtractor 完整提取
|
||||
3. ✅ 座標定位模式 PDF
|
||||
|
||||
### P1 (第二階段)
|
||||
4. ⭐ 流式排版模式 PDF
|
||||
5. ⭐ API 端點更新 (mode 參數)
|
||||
|
||||
### P2 (優化階段)
|
||||
6. 效能監控和優化
|
||||
7. 批次處理支援
|
||||
8. 品質檢查工具
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ 風險與緩解
|
||||
|
||||
### 風險 1: GPU 記憶體不足
|
||||
**緩解**:
|
||||
- 合理設定 `gpu_mem=6144` (保留 2GB)
|
||||
- 添加記憶體監控
|
||||
- 大文檔分批處理
|
||||
|
||||
### 風險 2: 處理速度下降
|
||||
**緩解**:
|
||||
- Server 模型在 GPU 上比 Mobile 更快
|
||||
- 並行處理多頁
|
||||
- 結果快取
|
||||
|
||||
### 風險 3: 向後相容問題
|
||||
**緩解**:
|
||||
- 保留舊邏輯作為回退
|
||||
- 逐步遷移
|
||||
- 完整測試覆蓋
|
||||
|
||||
---
|
||||
|
||||
**預計總開發時間**: 7-10 小時
|
||||
**預計效果**: 100% 利用 PP-StructureV3 能力 + 零資訊損失 + 完美翻譯支援
|
||||
|
||||
您希望我開始實作哪個階段?
|
||||
@@ -0,0 +1,691 @@
|
||||
# PP-StructureV3 完整版面資訊利用計劃
|
||||
|
||||
## 📋 執行摘要
|
||||
|
||||
### 問題診斷
|
||||
目前實作**嚴重低估了 PP-StructureV3 的能力**,只使用了 `page_result.markdown` 屬性,完全忽略了核心的版面資訊 `page_result.json`。
|
||||
|
||||
### 核心發現
|
||||
1. **PP-StructureV3 提供完整的版面解析資訊**,包括:
|
||||
- `parsing_res_list`: 按閱讀順序排列的版面元素列表
|
||||
- `layout_bbox`: 每個元素的精確座標
|
||||
- `layout_det_res`: 版面檢測結果(區域類型、置信度)
|
||||
- `overall_ocr_res`: 完整的 OCR 結果(包含所有文字的 bbox)
|
||||
- `layout`: 版面類型(單欄/雙欄/多欄)
|
||||
|
||||
2. **目前實作的缺陷**:
|
||||
```python
|
||||
# ❌ 目前做法 (ocr_service.py:615-646)
|
||||
markdown_dict = page_result.markdown # 只獲取 markdown 和圖片
|
||||
markdown_texts = markdown_dict.get('markdown_texts', '')
|
||||
# bbox 被設為空列表
|
||||
'bbox': [], # PP-StructureV3 doesn't provide individual bbox in this format
|
||||
```
|
||||
|
||||
3. **應該這樣做**:
|
||||
```python
|
||||
# ✅ 正確做法
|
||||
json_data = page_result.json # 獲取完整的結構化資訊
|
||||
parsing_list = json_data.get('parsing_res_list', []) # 閱讀順序 + bbox
|
||||
layout_det = json_data.get('layout_det_res', {}) # 版面檢測
|
||||
overall_ocr = json_data.get('overall_ocr_res', {}) # 所有文字的座標
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎯 規劃目標
|
||||
|
||||
### 階段 1: 提取完整版面資訊(高優先級)
|
||||
**目標**: 修改 `analyze_layout()` 以使用 PP-StructureV3 的完整能力
|
||||
|
||||
**預期效果**:
|
||||
- ✅ 每個版面元素都有精確的 `layout_bbox`
|
||||
- ✅ 保留原始閱讀順序(`parsing_res_list` 的順序)
|
||||
- ✅ 獲取版面類型資訊(單欄/雙欄)
|
||||
- ✅ 提取區域分類(text/table/figure/title/formula)
|
||||
- ✅ 零資訊損失(不需要過濾重疊文字)
|
||||
|
||||
### 階段 2: 實作雙模式 PDF 生成(中優先級)
|
||||
**目標**: 提供兩種 PDF 生成模式
|
||||
|
||||
**模式 A: 精確座標定位模式**
|
||||
- 使用 `layout_bbox` 精確定位每個元素
|
||||
- 保留原始文件的視覺外觀
|
||||
- 適用於需要精確還原版面的場景
|
||||
|
||||
**模式 B: 流式排版模式**
|
||||
- 按 `parsing_res_list` 順序流式排版
|
||||
- 使用 ReportLab Platypus 高階 API
|
||||
- 零資訊損失,所有內容都可搜尋
|
||||
- 適用於需要翻譯或內容處理的場景
|
||||
|
||||
### 階段 3: 多欄版面處理(低優先級)
|
||||
**目標**: 利用 PP-StructureV3 的多欄識別能力
|
||||
|
||||
---
|
||||
|
||||
## 📊 PP-StructureV3 完整資料結構
|
||||
|
||||
### 1. `page_result.json` 完整結構
|
||||
|
||||
```python
|
||||
{
|
||||
# 基本資訊
|
||||
"input_path": str, # 源文件路徑
|
||||
"page_index": int, # 頁碼(PDF 專用)
|
||||
|
||||
# 版面檢測結果
|
||||
"layout_det_res": {
|
||||
"boxes": [
|
||||
{
|
||||
"cls_id": int, # 類別 ID
|
||||
"label": str, # 區域類型: text/table/figure/title/formula/seal
|
||||
"score": float, # 置信度 0-1
|
||||
"coordinate": [x1, y1, x2, y2] # 矩形座標
|
||||
},
|
||||
...
|
||||
]
|
||||
},
|
||||
|
||||
# 完整 OCR 結果
|
||||
"overall_ocr_res": {
|
||||
"dt_polys": np.ndarray, # 文字檢測多邊形
|
||||
"rec_polys": np.ndarray, # 文字識別多邊形
|
||||
"rec_boxes": np.ndarray, # 文字識別矩形框 (n, 4, 2) int16
|
||||
"rec_texts": List[str], # 識別的文字
|
||||
"rec_scores": np.ndarray # 識別置信度
|
||||
},
|
||||
|
||||
# **核心版面解析結果(按閱讀順序)**
|
||||
"parsing_res_list": [
|
||||
{
|
||||
"layout_bbox": np.ndarray, # 區域邊界框 [x1, y1, x2, y2]
|
||||
"layout": str, # 版面類型: single/double/multi-column
|
||||
"text": str, # 文字內容(如果是文字區域)
|
||||
"table": str, # 表格 HTML(如果是表格區域)
|
||||
"image": str, # 圖片路徑(如果是圖片區域)
|
||||
"formula": str, # 公式 LaTeX(如果是公式區域)
|
||||
# ... 其他區域類型
|
||||
},
|
||||
... # 順序 = 閱讀順序
|
||||
],
|
||||
|
||||
# 文字段落 OCR(按閱讀順序)
|
||||
"text_paragraphs_ocr_res": {
|
||||
"rec_polys": np.ndarray,
|
||||
"rec_texts": List[str],
|
||||
"rec_scores": np.ndarray
|
||||
},
|
||||
|
||||
# 可選模組結果
|
||||
"formula_res_region1": {...}, # 公式識別結果
|
||||
"table_cell_img": {...}, # 表格儲存格圖片
|
||||
"seal_res_region1": {...} # 印章識別結果
|
||||
}
|
||||
```
|
||||
|
||||
### 2. 關鍵欄位說明
|
||||
|
||||
| 欄位 | 用途 | 資料格式 | 重要性 |
|
||||
|------|------|---------|--------|
|
||||
| `parsing_res_list` | **核心資料**,包含按閱讀順序排列的所有版面元素 | List[Dict] | ⭐⭐⭐⭐⭐ |
|
||||
| `layout_bbox` | 每個元素的精確座標 | np.ndarray [x1,y1,x2,y2] | ⭐⭐⭐⭐⭐ |
|
||||
| `layout` | 版面類型(單欄/雙欄/多欄) | str: single/double/multi | ⭐⭐⭐⭐ |
|
||||
| `layout_det_res` | 版面檢測詳細結果(包含區域分類) | Dict with boxes list | ⭐⭐⭐⭐ |
|
||||
| `overall_ocr_res` | 所有文字的 OCR 結果和座標 | Dict with np.ndarray | ⭐⭐⭐⭐ |
|
||||
| `markdown` | 簡化的 Markdown 輸出 | Dict with texts/images | ⭐⭐ |
|
||||
|
||||
---
|
||||
|
||||
## 🔧 實作計劃
|
||||
|
||||
### 任務 1: 重構 `analyze_layout()` 函數
|
||||
|
||||
**檔案**: `/backend/app/services/ocr_service.py`
|
||||
|
||||
**修改範圍**: Lines 590-710
|
||||
|
||||
**核心改動**:
|
||||
|
||||
```python
|
||||
def analyze_layout(self, image_path: Path, output_dir: Optional[Path] = None, current_page: int = 0) -> Tuple[Optional[Dict], List[Dict]]:
|
||||
"""
|
||||
Analyze document layout using PP-StructureV3 (使用完整的 JSON 資訊)
|
||||
"""
|
||||
try:
|
||||
structure_engine = self.get_structure_engine()
|
||||
results = structure_engine.predict(str(image_path))
|
||||
|
||||
layout_elements = []
|
||||
images_metadata = []
|
||||
|
||||
for page_idx, page_result in enumerate(results):
|
||||
# ✅ 修改 1: 使用完整的 JSON 資料而非只用 markdown
|
||||
json_data = page_result.json
|
||||
|
||||
# ✅ 修改 2: 提取版面檢測結果
|
||||
layout_det_res = json_data.get('layout_det_res', {})
|
||||
layout_boxes = layout_det_res.get('boxes', [])
|
||||
|
||||
# ✅ 修改 3: 提取核心的 parsing_res_list(包含閱讀順序 + bbox)
|
||||
parsing_res_list = json_data.get('parsing_res_list', [])
|
||||
|
||||
if parsing_res_list:
|
||||
# *** 核心邏輯:使用 parsing_res_list ***
|
||||
for idx, item in enumerate(parsing_res_list):
|
||||
# 提取 bbox(不再是空列表!)
|
||||
layout_bbox = item.get('layout_bbox')
|
||||
if layout_bbox is not None:
|
||||
# 轉換 numpy array 為標準格式
|
||||
if hasattr(layout_bbox, 'tolist'):
|
||||
bbox = layout_bbox.tolist()
|
||||
else:
|
||||
bbox = list(layout_bbox)
|
||||
|
||||
# 轉換為 4-point 格式: [[x1,y1], [x2,y1], [x2,y2], [x1,y2]]
|
||||
if len(bbox) == 4: # [x1, y1, x2, y2]
|
||||
x1, y1, x2, y2 = bbox
|
||||
bbox = [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
|
||||
else:
|
||||
bbox = []
|
||||
|
||||
# 提取版面類型
|
||||
layout_type = item.get('layout', 'single')
|
||||
|
||||
# 創建元素(包含所有資訊)
|
||||
element = {
|
||||
'element_id': idx,
|
||||
'page': current_page,
|
||||
'bbox': bbox, # ✅ 不再是空列表!
|
||||
'layout_type': layout_type, # ✅ 新增版面類型
|
||||
'reading_order': idx, # ✅ 新增閱讀順序
|
||||
}
|
||||
|
||||
# 根據內容類型提取資料
|
||||
if 'table' in item:
|
||||
element['type'] = 'table'
|
||||
element['content'] = item['table']
|
||||
# 提取表格純文字(用於翻譯)
|
||||
element['extracted_text'] = self._extract_table_text(item['table'])
|
||||
|
||||
elif 'text' in item:
|
||||
element['type'] = 'text'
|
||||
element['content'] = item['text']
|
||||
|
||||
elif 'figure' in item or 'image' in item:
|
||||
element['type'] = 'image'
|
||||
element['content'] = item.get('figure') or item.get('image')
|
||||
|
||||
elif 'formula' in item:
|
||||
element['type'] = 'formula'
|
||||
element['content'] = item['formula']
|
||||
|
||||
elif 'title' in item:
|
||||
element['type'] = 'title'
|
||||
element['content'] = item['title']
|
||||
|
||||
else:
|
||||
# 未知類型,記錄所有非系統欄位
|
||||
for key, value in item.items():
|
||||
if key not in ['layout_bbox', 'layout']:
|
||||
element['type'] = key
|
||||
element['content'] = value
|
||||
break
|
||||
|
||||
layout_elements.append(element)
|
||||
|
||||
else:
|
||||
# 回退到 markdown 方式(向後相容)
|
||||
logger.warning("No parsing_res_list found, falling back to markdown parsing")
|
||||
markdown_dict = page_result.markdown
|
||||
# ... 原有的 markdown 解析邏輯 ...
|
||||
|
||||
# ✅ 修改 4: 同時處理提取的圖片(仍需保存到磁碟)
|
||||
markdown_dict = page_result.markdown
|
||||
markdown_images = markdown_dict.get('markdown_images', {})
|
||||
|
||||
for img_idx, (img_path, img_obj) in enumerate(markdown_images.items()):
|
||||
# 保存圖片到磁碟
|
||||
try:
|
||||
base_dir = output_dir if output_dir else image_path.parent
|
||||
full_img_path = base_dir / img_path
|
||||
full_img_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
if hasattr(img_obj, 'save'):
|
||||
img_obj.save(str(full_img_path))
|
||||
logger.info(f"Saved extracted image to {full_img_path}")
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to save image {img_path}: {e}")
|
||||
|
||||
# 提取 bbox(從檔名或從 parsing_res_list 匹配)
|
||||
bbox = self._find_image_bbox(img_path, parsing_res_list, layout_boxes)
|
||||
|
||||
images_metadata.append({
|
||||
'element_id': len(layout_elements) + img_idx,
|
||||
'image_path': img_path,
|
||||
'type': 'image',
|
||||
'page': current_page,
|
||||
'bbox': bbox,
|
||||
})
|
||||
|
||||
if layout_elements:
|
||||
layout_data = {
|
||||
'elements': layout_elements,
|
||||
'total_elements': len(layout_elements),
|
||||
'reading_order': [e['reading_order'] for e in layout_elements], # ✅ 保留閱讀順序
|
||||
'layout_types': list(set(e.get('layout_type') for e in layout_elements)), # ✅ 版面類型統計
|
||||
}
|
||||
logger.info(f"Detected {len(layout_elements)} layout elements (with bbox and reading order)")
|
||||
return layout_data, images_metadata
|
||||
else:
|
||||
logger.warning("No layout elements detected")
|
||||
return None, []
|
||||
|
||||
except Exception as e:
|
||||
import traceback
|
||||
logger.error(f"Layout analysis error: {str(e)}\n{traceback.format_exc()}")
|
||||
return None, []
|
||||
|
||||
|
||||
def _find_image_bbox(self, img_path: str, parsing_res_list: List[Dict], layout_boxes: List[Dict]) -> List:
|
||||
"""
|
||||
從 parsing_res_list 或 layout_det_res 中查找圖片的 bbox
|
||||
"""
|
||||
# 方法 1: 從檔名提取(現有方法)
|
||||
import re
|
||||
match = re.search(r'box_(\d+)_(\d+)_(\d+)_(\d+)', img_path)
|
||||
if match:
|
||||
x1, y1, x2, y2 = map(int, match.groups())
|
||||
return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
|
||||
|
||||
# 方法 2: 從 parsing_res_list 匹配(如果包含圖片路徑資訊)
|
||||
for item in parsing_res_list:
|
||||
if 'image' in item or 'figure' in item:
|
||||
content = item.get('image') or item.get('figure')
|
||||
if img_path in str(content):
|
||||
bbox = item.get('layout_bbox')
|
||||
if bbox is not None:
|
||||
if hasattr(bbox, 'tolist'):
|
||||
bbox_list = bbox.tolist()
|
||||
else:
|
||||
bbox_list = list(bbox)
|
||||
if len(bbox_list) == 4:
|
||||
x1, y1, x2, y2 = bbox_list
|
||||
return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
|
||||
|
||||
# 方法 3: 從 layout_det_res 匹配(根據類型)
|
||||
for box in layout_boxes:
|
||||
if box.get('label') in ['figure', 'image']:
|
||||
coord = box.get('coordinate', [])
|
||||
if len(coord) == 4:
|
||||
x1, y1, x2, y2 = coord
|
||||
return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
|
||||
|
||||
logger.warning(f"Could not find bbox for image {img_path}")
|
||||
return []
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 任務 2: 更新 PDF 生成器使用新資訊
|
||||
|
||||
**檔案**: `/backend/app/services/pdf_generator_service.py`
|
||||
|
||||
**核心改動**:
|
||||
|
||||
1. **移除文字過濾邏輯**(不再需要!)
|
||||
- 因為 `parsing_res_list` 已經按閱讀順序排列
|
||||
- 表格/圖片有自己的區域,文字有自己的區域
|
||||
- 不會有重疊問題
|
||||
|
||||
2. **按 `reading_order` 渲染元素**
|
||||
```python
|
||||
def generate_layout_pdf(self, json_path: Path, output_path: Path, mode: str = 'coordinate') -> bool:
|
||||
"""
|
||||
mode: 'coordinate' 或 'flow'
|
||||
"""
|
||||
# 載入資料
|
||||
ocr_data = self.load_ocr_json(json_path)
|
||||
layout_data = ocr_data.get('layout_data', {})
|
||||
elements = layout_data.get('elements', [])
|
||||
|
||||
if mode == 'coordinate':
|
||||
# 模式 A: 座標定位模式
|
||||
return self._generate_coordinate_pdf(elements, output_path, ocr_data)
|
||||
else:
|
||||
# 模式 B: 流式排版模式
|
||||
return self._generate_flow_pdf(elements, output_path, ocr_data)
|
||||
|
||||
def _generate_coordinate_pdf(self, elements: List[Dict], output_path: Path, ocr_data: Dict) -> bool:
|
||||
"""座標定位模式 - 精確還原版面"""
|
||||
# 按 reading_order 排序元素
|
||||
sorted_elements = sorted(elements, key=lambda x: x.get('reading_order', 0))
|
||||
|
||||
# 按頁碼分組
|
||||
pages = {}
|
||||
for elem in sorted_elements:
|
||||
page = elem.get('page', 0)
|
||||
if page not in pages:
|
||||
pages[page] = []
|
||||
pages[page].append(elem)
|
||||
|
||||
# 渲染每頁
|
||||
for page_num, page_elements in sorted(pages.items()):
|
||||
for elem in page_elements:
|
||||
bbox = elem.get('bbox', [])
|
||||
elem_type = elem.get('type')
|
||||
content = elem.get('content', '')
|
||||
|
||||
if not bbox:
|
||||
logger.warning(f"Element {elem['element_id']} has no bbox, skipping")
|
||||
continue
|
||||
|
||||
# 使用精確座標渲染
|
||||
if elem_type == 'table':
|
||||
self.draw_table_at_bbox(pdf_canvas, content, bbox, page_height, scale_w, scale_h)
|
||||
elif elem_type == 'text':
|
||||
self.draw_text_at_bbox(pdf_canvas, content, bbox, page_height, scale_w, scale_h)
|
||||
elif elem_type == 'image':
|
||||
self.draw_image_at_bbox(pdf_canvas, content, bbox, page_height, scale_w, scale_h)
|
||||
# ... 其他類型
|
||||
|
||||
def _generate_flow_pdf(self, elements: List[Dict], output_path: Path, ocr_data: Dict) -> bool:
|
||||
"""流式排版模式 - 零資訊損失"""
|
||||
from reportlab.platypus import SimpleDocTemplate, Paragraph, Table, Image, Spacer
|
||||
from reportlab.lib.styles import getSampleStyleSheet
|
||||
|
||||
# 按 reading_order 排序元素
|
||||
sorted_elements = sorted(elements, key=lambda x: x.get('reading_order', 0))
|
||||
|
||||
# 創建 Story(流式內容)
|
||||
story = []
|
||||
styles = getSampleStyleSheet()
|
||||
|
||||
for elem in sorted_elements:
|
||||
elem_type = elem.get('type')
|
||||
content = elem.get('content', '')
|
||||
|
||||
if elem_type == 'title':
|
||||
story.append(Paragraph(content, styles['Title']))
|
||||
elif elem_type == 'text':
|
||||
story.append(Paragraph(content, styles['Normal']))
|
||||
elif elem_type == 'table':
|
||||
# 解析 HTML 表格為 ReportLab Table
|
||||
table_obj = self._html_to_reportlab_table(content)
|
||||
story.append(table_obj)
|
||||
elif elem_type == 'image':
|
||||
# 嵌入圖片
|
||||
img_path = json_path.parent / content
|
||||
if img_path.exists():
|
||||
story.append(Image(str(img_path), width=400, height=300))
|
||||
|
||||
story.append(Spacer(1, 12)) # 間距
|
||||
|
||||
# 生成 PDF
|
||||
doc = SimpleDocTemplate(str(output_path))
|
||||
doc.build(story)
|
||||
return True
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📈 預期效果對比
|
||||
|
||||
### 目前實作 vs 新實作
|
||||
|
||||
| 指標 | 目前實作 ❌ | 新實作 ✅ | 改善 |
|
||||
|------|-----------|----------|------|
|
||||
| **bbox 資訊** | 空列表 `[]` | 精確座標 `[x1,y1,x2,y2]` | ✅ 100% |
|
||||
| **閱讀順序** | 無(混合 HTML) | `reading_order` 欄位 | ✅ 100% |
|
||||
| **版面類型** | 無 | `layout_type`(單欄/雙欄) | ✅ 100% |
|
||||
| **元素分類** | 簡單判斷 `<table` | 精確分類(9+ 類型) | ✅ 100% |
|
||||
| **資訊損失** | 21.6% 文字被過濾 | 0% 損失(流式模式) | ✅ 100% |
|
||||
| **座標精度** | 只有部分圖片 bbox | 所有元素都有 bbox | ✅ 100% |
|
||||
| **PDF 模式** | 只有座標定位 | 雙模式(座標+流式) | ✅ 新功能 |
|
||||
| **翻譯支援** | 困難(資訊損失) | 完美(零損失) | ✅ 100% |
|
||||
|
||||
### 具體改善
|
||||
|
||||
#### 1. 零資訊損失
|
||||
```python
|
||||
# ❌ 目前: 342 個文字區域 → 過濾後 268 個 = 損失 74 個 (21.6%)
|
||||
filtered_text_regions = self._filter_text_in_regions(text_regions, regions_to_avoid)
|
||||
|
||||
# ✅ 新實作: 不需要過濾,直接使用 parsing_res_list
|
||||
# 所有元素(文字、表格、圖片)都在各自的區域中,不重疊
|
||||
for elem in sorted(elements, key=lambda x: x['reading_order']):
|
||||
render_element(elem) # 渲染所有元素,零損失
|
||||
```
|
||||
|
||||
#### 2. 精確 bbox
|
||||
```python
|
||||
# ❌ 目前: bbox 是空列表
|
||||
{
|
||||
'element_id': 0,
|
||||
'type': 'table',
|
||||
'bbox': [], # ← 無法定位!
|
||||
}
|
||||
|
||||
# ✅ 新實作: 從 layout_bbox 獲取精確座標
|
||||
{
|
||||
'element_id': 0,
|
||||
'type': 'table',
|
||||
'bbox': [[770, 776], [1122, 776], [1122, 1058], [770, 1058]], # ← 精確定位!
|
||||
'reading_order': 3,
|
||||
'layout_type': 'single'
|
||||
}
|
||||
```
|
||||
|
||||
#### 3. 閱讀順序
|
||||
```python
|
||||
# ❌ 目前: 無法保證正確的閱讀順序
|
||||
# 表格、圖片、文字混在一起,順序混亂
|
||||
|
||||
# ✅ 新實作: parsing_res_list 的順序 = 閱讀順序
|
||||
elements = sorted(elements, key=lambda x: x['reading_order'])
|
||||
# 元素按 reading_order: 0, 1, 2, 3, ... 渲染
|
||||
# 完美保留文件的邏輯順序
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🚀 實作步驟
|
||||
|
||||
### 第一階段:核心重構(2-3 小時)
|
||||
|
||||
1. **修改 `analyze_layout()` 函數**
|
||||
- 從 `page_result.json` 提取 `parsing_res_list`
|
||||
- 提取 `layout_bbox` 為每個元素的 bbox
|
||||
- 保留 `reading_order`
|
||||
- 提取 `layout_type`
|
||||
- 測試輸出 JSON 結構
|
||||
|
||||
2. **添加輔助函數**
|
||||
- `_find_image_bbox()`: 從多個來源查找圖片 bbox
|
||||
- `_convert_bbox_format()`: 統一 bbox 格式
|
||||
- `_extract_element_content()`: 根據類型提取內容
|
||||
|
||||
3. **測試驗證**
|
||||
- 使用現有測試文件重新執行 OCR
|
||||
- 檢查生成的 JSON 是否包含 bbox
|
||||
- 驗證 reading_order 是否正確
|
||||
|
||||
### 第二階段:PDF 生成優化(2-3 小時)
|
||||
|
||||
1. **實作座標定位模式**
|
||||
- 移除文字過濾邏輯
|
||||
- 按 bbox 精確渲染每個元素
|
||||
- 按 reading_order 確定渲染順序(同頁元素)
|
||||
|
||||
2. **實作流式排版模式**
|
||||
- 使用 ReportLab Platypus
|
||||
- 按 reading_order 構建 Story
|
||||
- 實作各類型元素的流式渲染
|
||||
|
||||
3. **添加 API 參數**
|
||||
- `/tasks/{id}/download/pdf?mode=coordinate` (預設)
|
||||
- `/tasks/{id}/download/pdf?mode=flow`
|
||||
|
||||
### 第三階段:測試與優化(1-2 小時)
|
||||
|
||||
1. **完整測試**
|
||||
- 單頁文件測試
|
||||
- 多頁 PDF 測試
|
||||
- 多欄版面測試
|
||||
- 複雜表格測試
|
||||
|
||||
2. **效能優化**
|
||||
- 減少重複計算
|
||||
- 優化 bbox 轉換
|
||||
- 快取處理
|
||||
|
||||
3. **文檔更新**
|
||||
- 更新 API 文檔
|
||||
- 添加使用範例
|
||||
- 更新架構圖
|
||||
|
||||
---
|
||||
|
||||
## 💡 關鍵技術細節
|
||||
|
||||
### 1. Numpy Array 處理
|
||||
```python
|
||||
# layout_bbox 是 numpy.ndarray,需要轉換為標準格式
|
||||
layout_bbox = item.get('layout_bbox')
|
||||
if hasattr(layout_bbox, 'tolist'):
|
||||
bbox = layout_bbox.tolist() # [x1, y1, x2, y2]
|
||||
else:
|
||||
bbox = list(layout_bbox)
|
||||
|
||||
# 轉換為 4-point 格式
|
||||
x1, y1, x2, y2 = bbox
|
||||
bbox_4point = [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
|
||||
```
|
||||
|
||||
### 2. 版面類型處理
|
||||
```python
|
||||
# 根據 layout_type 調整渲染策略
|
||||
layout_type = elem.get('layout_type', 'single')
|
||||
|
||||
if layout_type == 'double':
|
||||
# 雙欄版面:可能需要特殊處理
|
||||
pass
|
||||
elif layout_type == 'multi':
|
||||
# 多欄版面:更複雜的處理
|
||||
pass
|
||||
```
|
||||
|
||||
### 3. 閱讀順序保證
|
||||
```python
|
||||
# 確保按正確順序渲染
|
||||
elements = layout_data.get('elements', [])
|
||||
sorted_elements = sorted(elements, key=lambda x: (
|
||||
x.get('page', 0), # 先按頁碼
|
||||
x.get('reading_order', 0) # 再按閱讀順序
|
||||
))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ 風險與緩解措施
|
||||
|
||||
### 風險 1: 向後相容性
|
||||
**問題**: 舊的 JSON 檔案沒有新欄位
|
||||
|
||||
**緩解措施**:
|
||||
```python
|
||||
# 在 analyze_layout() 中添加回退邏輯
|
||||
parsing_res_list = json_data.get('parsing_res_list', [])
|
||||
if not parsing_res_list:
|
||||
logger.warning("No parsing_res_list, using markdown fallback")
|
||||
# 使用舊的 markdown 解析邏輯
|
||||
```
|
||||
|
||||
### 風險 2: PaddleOCR 版本差異
|
||||
**問題**: 不同版本的 PaddleOCR 可能輸出格式不同
|
||||
|
||||
**緩解措施**:
|
||||
- 記錄 PaddleOCR 版本到 JSON
|
||||
- 添加版本檢測邏輯
|
||||
- 提供多版本支援
|
||||
|
||||
### 風險 3: 效能影響
|
||||
**問題**: 提取更多資訊可能增加處理時間
|
||||
|
||||
**緩解措施**:
|
||||
- 只在需要時提取詳細資訊
|
||||
- 使用快取
|
||||
- 並行處理多頁
|
||||
|
||||
---
|
||||
|
||||
## 📝 TODO Checklist
|
||||
|
||||
### 階段 1: 核心重構
|
||||
- [ ] 修改 `analyze_layout()` 使用 `page_result.json`
|
||||
- [ ] 提取 `parsing_res_list`
|
||||
- [ ] 提取 `layout_bbox` 並轉換格式
|
||||
- [ ] 保留 `reading_order`
|
||||
- [ ] 提取 `layout_type`
|
||||
- [ ] 實作 `_find_image_bbox()`
|
||||
- [ ] 添加回退邏輯(向後相容)
|
||||
- [ ] 測試新 JSON 輸出結構
|
||||
|
||||
### 階段 2: PDF 生成優化
|
||||
- [ ] 實作 `_generate_coordinate_pdf()`
|
||||
- [ ] 實作 `_generate_flow_pdf()`
|
||||
- [ ] 移除舊的文字過濾邏輯
|
||||
- [ ] 添加 mode 參數到 API
|
||||
- [ ] 實作 HTML 表格解析器(用於流式模式)
|
||||
- [ ] 測試兩種模式的 PDF 輸出
|
||||
|
||||
### 階段 3: 測試與文檔
|
||||
- [ ] 單頁文件測試
|
||||
- [ ] 多頁 PDF 測試
|
||||
- [ ] 複雜版面測試(多欄、表格密集)
|
||||
- [ ] 效能測試
|
||||
- [ ] 更新 API 文檔
|
||||
- [ ] 更新使用說明
|
||||
- [ ] 創建遷移指南
|
||||
|
||||
---
|
||||
|
||||
## 🎓 學習資源
|
||||
|
||||
1. **PaddleOCR 官方文檔**
|
||||
- [PP-StructureV3 Usage Tutorial](http://www.paddleocr.ai/main/en/version3.x/pipeline_usage/PP-StructureV3.html)
|
||||
- [PaddleX PP-StructureV3](https://paddlepaddle.github.io/PaddleX/3.0/en/pipeline_usage/tutorials/ocr_pipelines/PP-StructureV3.html)
|
||||
|
||||
2. **ReportLab 文檔**
|
||||
- [Platypus User Guide](https://www.reportlab.com/docs/reportlab-userguide.pdf)
|
||||
- [Table Styling](https://www.reportlab.com/docs/reportlab-userguide.pdf#page=80)
|
||||
|
||||
3. **參考實作**
|
||||
- PaddleOCR GitHub: `/paddlex/inference/pipelines/layout_parsing/pipeline_v2.py`
|
||||
|
||||
---
|
||||
|
||||
## 🏁 成功標準
|
||||
|
||||
### 必須達成
|
||||
✅ 所有版面元素都有精確的 bbox
|
||||
✅ 閱讀順序正確保留
|
||||
✅ 零資訊損失(流式模式)
|
||||
✅ 向後相容(舊 JSON 仍可用)
|
||||
|
||||
### 期望達成
|
||||
✅ 雙模式 PDF 生成(座標 + 流式)
|
||||
✅ 多欄版面正確處理
|
||||
✅ 翻譯功能支援(表格文字可提取)
|
||||
✅ 效能無明顯下降
|
||||
|
||||
### 附加目標
|
||||
✅ 支援更多元素類型(公式、印章)
|
||||
✅ 版面類型統計和分析
|
||||
✅ 視覺化版面結構
|
||||
|
||||
---
|
||||
|
||||
**規劃完成時間**: 2025-01-18
|
||||
**預計開發時間**: 5-8 小時
|
||||
**優先級**: P0 (最高優先級)
|
||||
File diff suppressed because it is too large
Load Diff
276
openspec/changes/dual-track-document-processing/design.md
Normal file
276
openspec/changes/dual-track-document-processing/design.md
Normal file
@@ -0,0 +1,276 @@
|
||||
# Technical Design: Dual-track Document Processing
|
||||
|
||||
## Context
|
||||
|
||||
### Background
|
||||
The current OCR tool processes all documents through PaddleOCR, even when dealing with editable PDFs that contain extractable text. This causes:
|
||||
- Unnecessary processing overhead
|
||||
- Potential quality degradation from re-OCRing already digital text
|
||||
- Loss of precise formatting information
|
||||
- Inefficient GPU usage on documents that don't need OCR
|
||||
|
||||
### Constraints
|
||||
- RTX 4060 8GB GPU memory limitation
|
||||
- Need to maintain backward compatibility with existing API
|
||||
- Must support future translation features
|
||||
- Should handle mixed documents (partially scanned, partially digital)
|
||||
|
||||
### Stakeholders
|
||||
- API consumers expecting consistent JSON/PDF output
|
||||
- Translation system requiring structure preservation
|
||||
- Performance-sensitive deployments
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
### Goals
|
||||
- Intelligently route documents to appropriate processing track
|
||||
- Preserve document structure for translation
|
||||
- Optimize GPU usage by avoiding unnecessary OCR
|
||||
- Maintain unified output format across tracks
|
||||
- Reduce processing time for editable PDFs by 70%+
|
||||
|
||||
### Non-Goals
|
||||
- Implementing the actual translation engine (future phase)
|
||||
- Supporting video or audio transcription
|
||||
- Real-time collaborative editing
|
||||
- OCR model training or fine-tuning
|
||||
|
||||
## Decisions
|
||||
|
||||
### Decision 1: Dual-track Architecture
|
||||
**What**: Implement two separate processing pipelines - OCR track and Direct extraction track
|
||||
|
||||
**Why**:
|
||||
- Editable PDFs don't need OCR, can be processed 10-100x faster
|
||||
- Direct extraction preserves exact formatting and fonts
|
||||
- OCR track remains optimal for scanned documents
|
||||
|
||||
**Alternatives considered**:
|
||||
1. **Single enhanced OCR pipeline**: Would still waste resources on editable PDFs
|
||||
2. **Hybrid approach per page**: Too complex, most documents are uniformly editable or scanned
|
||||
3. **Multiple specialized pipelines**: Over-engineering for current requirements
|
||||
|
||||
### Decision 2: UnifiedDocument Model
|
||||
**What**: Create a standardized intermediate representation for both tracks
|
||||
|
||||
**Why**:
|
||||
- Provides consistent API interface regardless of processing track
|
||||
- Simplifies downstream processing (PDF generation, translation)
|
||||
- Enables track switching without breaking changes
|
||||
|
||||
**Structure**:
|
||||
```python
|
||||
@dataclass
|
||||
class UnifiedDocument:
|
||||
document_id: str
|
||||
metadata: DocumentMetadata
|
||||
pages: List[Page]
|
||||
processing_track: Literal["ocr", "direct"]
|
||||
|
||||
@dataclass
|
||||
class Page:
|
||||
page_number: int
|
||||
elements: List[DocumentElement]
|
||||
dimensions: Dimensions
|
||||
|
||||
@dataclass
|
||||
class DocumentElement:
|
||||
element_id: str
|
||||
type: ElementType # text, table, image, header, etc.
|
||||
content: Union[str, Dict, bytes]
|
||||
bbox: BoundingBox
|
||||
style: Optional[StyleInfo]
|
||||
confidence: Optional[float] # Only for OCR track
|
||||
```
|
||||
|
||||
### Decision 3: PyMuPDF for Direct Extraction
|
||||
**What**: Use PyMuPDF (fitz) library for editable PDF processing
|
||||
|
||||
**Why**:
|
||||
- Mature, well-maintained library
|
||||
- Excellent coordinate preservation
|
||||
- Fast C++ backend
|
||||
- Supports text, tables, and image extraction with positions
|
||||
|
||||
**Alternatives considered**:
|
||||
1. **pdfplumber**: Good but slower, less precise coordinates
|
||||
2. **PyPDF2**: Limited layout information
|
||||
3. **PDFMiner**: Complex API, slower performance
|
||||
|
||||
### Decision 4: Processing Track Auto-detection
|
||||
**What**: Automatically determine optimal track based on document analysis
|
||||
|
||||
**Detection logic**:
|
||||
```python
|
||||
def detect_track(file_path: Path) -> str:
|
||||
file_type = magic.from_file(file_path, mime=True)
|
||||
|
||||
if file_type.startswith('image/'):
|
||||
return "ocr"
|
||||
|
||||
if file_type == 'application/pdf':
|
||||
# Check if PDF has extractable text
|
||||
doc = fitz.open(file_path)
|
||||
for page in doc[:3]: # Sample first 3 pages
|
||||
text = page.get_text()
|
||||
if len(text.strip()) < 100: # Minimal text
|
||||
return "ocr"
|
||||
return "direct"
|
||||
|
||||
if file_type in OFFICE_MIMES:
|
||||
return "ocr" # For now, may add direct Office support later
|
||||
|
||||
return "ocr" # Default fallback
|
||||
```
|
||||
|
||||
### Decision 5: GPU Memory Management
|
||||
**What**: Implement dynamic batch sizing and model caching for RTX 4060 8GB
|
||||
|
||||
**Why**:
|
||||
- Prevents OOM errors
|
||||
- Maximizes throughput
|
||||
- Enables concurrent request handling
|
||||
|
||||
**Strategy**:
|
||||
```python
|
||||
# Adaptive batch sizing based on available memory
|
||||
batch_size = calculate_batch_size(
|
||||
available_memory=get_gpu_memory(),
|
||||
image_size=image.shape,
|
||||
model_size=MODEL_MEMORY_REQUIREMENTS
|
||||
)
|
||||
|
||||
# Model caching to avoid reload overhead
|
||||
@lru_cache(maxsize=2)
|
||||
def get_model(model_type: str):
|
||||
return load_model(model_type)
|
||||
```
|
||||
|
||||
### Decision 6: Backward Compatibility
|
||||
**What**: Maintain existing API while adding new capabilities
|
||||
|
||||
**How**:
|
||||
- Existing endpoints continue working unchanged
|
||||
- New `processing_track` parameter is optional
|
||||
- Output format compatible with current consumers
|
||||
- Gradual migration path for clients
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
### Risk 1: Mixed Content Documents
|
||||
**Risk**: Documents with both scanned and digital pages
|
||||
**Mitigation**:
|
||||
- Page-level track detection as fallback
|
||||
- Confidence scoring to identify uncertain pages
|
||||
- Manual override option via API
|
||||
|
||||
### Risk 2: Direct Extraction Quality
|
||||
**Risk**: Some PDFs have poor internal structure
|
||||
**Mitigation**:
|
||||
- Fallback to OCR track if extraction quality is low
|
||||
- Quality metrics: text density, structure coherence
|
||||
- User-reportable quality issues
|
||||
|
||||
### Risk 3: Memory Pressure
|
||||
**Risk**: RTX 4060 8GB limitation with concurrent requests
|
||||
**Mitigation**:
|
||||
- Request queuing system
|
||||
- Dynamic batch adjustment
|
||||
- CPU fallback for overflow
|
||||
|
||||
### Trade-off 1: Processing Time vs Accuracy
|
||||
- Direct extraction: Fast but depends on PDF quality
|
||||
- OCR: Slower but consistent quality
|
||||
- **Decision**: Prioritize speed for editable PDFs, accuracy for scanned
|
||||
|
||||
### Trade-off 2: Complexity vs Flexibility
|
||||
- Two tracks increase system complexity
|
||||
- But enable optimal processing per document type
|
||||
- **Decision**: Accept complexity for 10x+ performance gains
|
||||
|
||||
## Migration Plan
|
||||
|
||||
### Phase 1: Infrastructure (Week 1-2)
|
||||
1. Deploy UnifiedDocument model
|
||||
2. Implement DocumentTypeDetector
|
||||
3. Add DirectExtractionEngine
|
||||
4. Update logging and monitoring
|
||||
|
||||
### Phase 2: Integration (Week 3)
|
||||
1. Update OCR service with routing logic
|
||||
2. Modify PDF generator for unified model
|
||||
3. Add new API endpoints
|
||||
4. Deploy to staging
|
||||
|
||||
### Phase 3: Validation (Week 4)
|
||||
1. A/B testing with subset of traffic
|
||||
2. Performance benchmarking
|
||||
3. Quality validation
|
||||
4. Client integration testing
|
||||
|
||||
### Rollback Plan
|
||||
1. Feature flag to disable dual-track
|
||||
2. Fallback all requests to OCR track
|
||||
3. Maintain old code paths during transition
|
||||
4. Database migration reversible
|
||||
|
||||
## Open Questions
|
||||
|
||||
### Resolved
|
||||
- Q: Should we support page-level track mixing?
|
||||
- A: No, adds complexity with minimal benefit. Document-level is sufficient.
|
||||
|
||||
- Q: How to handle Office documents?
|
||||
- A: OCR track initially, consider python-docx/openpyxl later if needed.
|
||||
|
||||
### Pending
|
||||
- Q: What translation services to integrate with?
|
||||
- Needs stakeholder input on cost/quality trade-offs
|
||||
|
||||
- Q: Should we cache extracted text for repeated processing?
|
||||
- Depends on storage costs vs reprocessing frequency
|
||||
|
||||
- Q: How to handle password-protected PDFs?
|
||||
- May need API parameter for passwords
|
||||
|
||||
## Performance Targets
|
||||
|
||||
### Direct Extraction Track
|
||||
- Latency: <500ms per page
|
||||
- Throughput: 100+ pages/minute
|
||||
- Memory: <500MB per document
|
||||
|
||||
### OCR Track (Optimized)
|
||||
- Latency: 2-5s per page (GPU)
|
||||
- Throughput: 20-30 pages/minute
|
||||
- Memory: <2GB per batch
|
||||
|
||||
### API Response Times
|
||||
- Document type detection: <100ms
|
||||
- Processing initiation: <200ms
|
||||
- Result retrieval: <100ms
|
||||
|
||||
## Technical Dependencies
|
||||
|
||||
### Python Packages
|
||||
```python
|
||||
# Direct extraction
|
||||
PyMuPDF==1.23.x
|
||||
pdfplumber==0.10.x # Fallback/validation
|
||||
python-magic-bin==0.4.x
|
||||
|
||||
# OCR enhancement
|
||||
paddlepaddle-gpu==2.5.2
|
||||
paddleocr==2.7.3
|
||||
|
||||
# Infrastructure
|
||||
pydantic==2.x
|
||||
fastapi==0.100+
|
||||
redis==5.x # For caching
|
||||
```
|
||||
|
||||
### System Requirements
|
||||
- CUDA 11.8+ for PaddlePaddle
|
||||
- libmagic for file detection
|
||||
- 16GB RAM minimum
|
||||
- 50GB disk for models and cache
|
||||
35
openspec/changes/dual-track-document-processing/proposal.md
Normal file
35
openspec/changes/dual-track-document-processing/proposal.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# Change: Dual-track Document Processing with Structure-Preserving Translation
|
||||
|
||||
## Why
|
||||
|
||||
The current system processes all documents through PaddleOCR, causing unnecessary overhead for editable PDFs that already contain extractable text. Additionally, we're only using ~20% of PP-StructureV3's capabilities, missing out on comprehensive document structure extraction. The system needs to support structure-preserving document translation as a future goal.
|
||||
|
||||
## What Changes
|
||||
|
||||
- **ADDED** Dual-track processing architecture with intelligent routing
|
||||
- OCR track for scanned documents, images, and Office files using PaddleOCR
|
||||
- Direct extraction track for editable PDFs using PyMuPDF
|
||||
- **ADDED** UnifiedDocument model as common output format for both tracks
|
||||
- **ADDED** DocumentTypeDetector service for automatic track selection
|
||||
- **MODIFIED** OCR service to use PP-StructureV3's parsing_res_list instead of markdown
|
||||
- Now extracts all 23 element types with bbox coordinates
|
||||
- Preserves reading order and hierarchical structure
|
||||
- **MODIFIED** PDF generator to handle UnifiedDocument format
|
||||
- Enhanced overlap detection to prevent text/image/table collisions
|
||||
- Improved coordinate transformation for accurate layout
|
||||
- **ADDED** Foundation for structure-preserving translation system
|
||||
- **BREAKING** JSON output structure will include new fields (backward compatible with defaults)
|
||||
|
||||
## Impact
|
||||
|
||||
- **Affected specs**:
|
||||
- `document-processing` (new capability)
|
||||
- `result-export` (enhanced with track metadata and structure data)
|
||||
- `task-management` (tracks processing route and history)
|
||||
- **Affected code**:
|
||||
- `backend/app/services/ocr_service.py` - Major refactoring for dual-track
|
||||
- `backend/app/services/pdf_generator_service.py` - UnifiedDocument support
|
||||
- `backend/app/api/v2/tasks.py` - New endpoints for track detection
|
||||
- `frontend/src/pages/TaskDetailPage.tsx` - Display processing track info
|
||||
- **Performance**: 5-10x faster for editable PDFs, same speed for scanned documents
|
||||
- **Dependencies**: Adds PyMuPDF, pdfplumber, python-magic-bin
|
||||
@@ -0,0 +1,108 @@
|
||||
# Document Processing Spec Delta
|
||||
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Dual-track Processing
|
||||
The system SHALL support two distinct processing tracks for documents: OCR track for scanned/image documents and Direct extraction track for editable PDFs.
|
||||
|
||||
#### Scenario: Process scanned PDF through OCR track
|
||||
- **WHEN** a scanned PDF is uploaded
|
||||
- **THEN** the system SHALL detect it requires OCR
|
||||
- **AND** route it through PaddleOCR PP-StructureV3 pipeline
|
||||
- **AND** return results in UnifiedDocument format
|
||||
|
||||
#### Scenario: Process editable PDF through direct extraction
|
||||
- **WHEN** an editable PDF with extractable text is uploaded
|
||||
- **THEN** the system SHALL detect it can be directly extracted
|
||||
- **AND** route it through PyMuPDF extraction pipeline
|
||||
- **AND** return results in UnifiedDocument format without OCR
|
||||
|
||||
#### Scenario: Auto-detect processing track
|
||||
- **WHEN** a document is uploaded without explicit track specification
|
||||
- **THEN** the system SHALL analyze the document type and content
|
||||
- **AND** automatically select the optimal processing track
|
||||
- **AND** include the selected track in processing metadata
|
||||
|
||||
### Requirement: Document Type Detection
|
||||
The system SHALL provide intelligent document type detection to determine the optimal processing track.
|
||||
|
||||
#### Scenario: Detect editable PDF
|
||||
- **WHEN** analyzing a PDF document
|
||||
- **THEN** the system SHALL check for extractable text content
|
||||
- **AND** return confidence score for editability
|
||||
- **AND** recommend "direct" track if text coverage > 90%
|
||||
|
||||
#### Scenario: Detect scanned document
|
||||
- **WHEN** analyzing an image or scanned PDF
|
||||
- **THEN** the system SHALL identify lack of extractable text
|
||||
- **AND** recommend "ocr" track for processing
|
||||
- **AND** configure appropriate OCR models
|
||||
|
||||
#### Scenario: Detect Office documents
|
||||
- **WHEN** analyzing .docx, .xlsx, .pptx files
|
||||
- **THEN** the system SHALL identify Office format
|
||||
- **AND** route to OCR track for initial implementation
|
||||
- **AND** preserve option for future direct Office extraction
|
||||
|
||||
### Requirement: Unified Document Model
|
||||
The system SHALL use a standardized UnifiedDocument model as the common output format for both processing tracks.
|
||||
|
||||
#### Scenario: Generate UnifiedDocument from OCR
|
||||
- **WHEN** OCR processing completes
|
||||
- **THEN** the system SHALL convert PP-StructureV3 results to UnifiedDocument
|
||||
- **AND** preserve all element types, coordinates, and confidence scores
|
||||
- **AND** maintain reading order and hierarchical structure
|
||||
|
||||
#### Scenario: Generate UnifiedDocument from direct extraction
|
||||
- **WHEN** direct extraction completes
|
||||
- **THEN** the system SHALL convert PyMuPDF results to UnifiedDocument
|
||||
- **AND** preserve text styling, fonts, and exact positioning
|
||||
- **AND** extract tables with cell boundaries and content
|
||||
|
||||
#### Scenario: Consistent output regardless of track
|
||||
- **WHEN** processing completes through either track
|
||||
- **THEN** the output SHALL conform to UnifiedDocument schema
|
||||
- **AND** include processing_track metadata field
|
||||
- **AND** support identical downstream operations (PDF generation, translation)
|
||||
|
||||
### Requirement: Enhanced OCR with Full PP-StructureV3
|
||||
The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list.
|
||||
|
||||
#### Scenario: Extract comprehensive document structure
|
||||
- **WHEN** processing through OCR track
|
||||
- **THEN** the system SHALL use page_result.json['parsing_res_list']
|
||||
- **AND** extract all element types including headers, lists, tables, figures
|
||||
- **AND** preserve layout_bbox coordinates for each element
|
||||
|
||||
#### Scenario: Maintain reading order
|
||||
- **WHEN** extracting elements from PP-StructureV3
|
||||
- **THEN** the system SHALL preserve the reading order from parsing_res_list
|
||||
- **AND** assign sequential indices to elements
|
||||
- **AND** support reordering for complex layouts
|
||||
|
||||
#### Scenario: Extract table structure
|
||||
- **WHEN** PP-StructureV3 identifies a table
|
||||
- **THEN** the system SHALL extract cell content and boundaries
|
||||
- **AND** preserve table HTML for structure
|
||||
- **AND** extract plain text for translation
|
||||
|
||||
### Requirement: Structure-Preserving Translation Foundation
|
||||
The system SHALL maintain document structure and layout information to support future translation features.
|
||||
|
||||
#### Scenario: Preserve coordinates for translation
|
||||
- **WHEN** processing any document
|
||||
- **THEN** the system SHALL retain bbox coordinates for all text elements
|
||||
- **AND** calculate space requirements for text expansion/contraction
|
||||
- **AND** maintain element relationships and groupings
|
||||
|
||||
#### Scenario: Extract translatable content
|
||||
- **WHEN** processing tables and lists
|
||||
- **THEN** the system SHALL extract plain text content
|
||||
- **AND** maintain mapping to original structure
|
||||
- **AND** preserve formatting markers for reconstruction
|
||||
|
||||
#### Scenario: Support layout adjustment
|
||||
- **WHEN** preparing for translation
|
||||
- **THEN** the system SHALL identify flexible vs fixed layout regions
|
||||
- **AND** calculate maximum text expansion ratios
|
||||
- **AND** preserve non-translatable elements (logos, signatures)
|
||||
@@ -0,0 +1,74 @@
|
||||
# Result Export Spec Delta
|
||||
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Export Interface
|
||||
The Export page SHALL support downloading OCR results in multiple formats using V2 task APIs, with processing track information and enhanced structure data.
|
||||
|
||||
#### Scenario: Export page uses V2 download endpoints
|
||||
- **WHEN** user selects a format and clicks export button
|
||||
- **THEN** frontend SHALL call V2 endpoint `/api/v2/tasks/{task_id}/download/{format}`
|
||||
- **AND** frontend SHALL NOT call V1 `/api/v2/export` endpoint (which returns 404)
|
||||
- **AND** file SHALL download successfully
|
||||
|
||||
#### Scenario: Export supports multiple formats
|
||||
- **WHEN** user exports a completed task
|
||||
- **THEN** system SHALL support downloading as TXT, JSON, Excel, Markdown, and PDF
|
||||
- **AND** each format SHALL use correct V2 download endpoint
|
||||
- **AND** downloaded files SHALL contain task OCR results
|
||||
|
||||
#### Scenario: Export includes processing track metadata
|
||||
- **WHEN** user exports a task processed through dual-track system
|
||||
- **THEN** exported JSON SHALL include "processing_track" field indicating "ocr" or "direct"
|
||||
- **AND** SHALL include "processing_metadata" with track-specific information
|
||||
- **AND** SHALL maintain backward compatibility for clients not expecting these fields
|
||||
|
||||
#### Scenario: Export UnifiedDocument format
|
||||
- **WHEN** user requests JSON export with unified=true parameter
|
||||
- **THEN** system SHALL return UnifiedDocument structure
|
||||
- **AND** include complete element hierarchy with coordinates
|
||||
- **AND** preserve all PP-StructureV3 element types for OCR track
|
||||
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Enhanced PDF Export with Layout Preservation
|
||||
The PDF export SHALL accurately preserve document layout from both OCR and direct extraction tracks.
|
||||
|
||||
#### Scenario: Export PDF from direct extraction track
|
||||
- **WHEN** exporting PDF from a direct-extraction processed document
|
||||
- **THEN** the PDF SHALL maintain exact text positioning from source
|
||||
- **AND** preserve original fonts and styles where possible
|
||||
- **AND** include extracted images at correct positions
|
||||
|
||||
#### Scenario: Export PDF from OCR track with full structure
|
||||
- **WHEN** exporting PDF from OCR-processed document
|
||||
- **THEN** the PDF SHALL use all 23 PP-StructureV3 element types
|
||||
- **AND** render tables with proper cell boundaries
|
||||
- **AND** maintain reading order from parsing_res_list
|
||||
|
||||
#### Scenario: Handle coordinate transformations
|
||||
- **WHEN** generating PDF from UnifiedDocument
|
||||
- **THEN** system SHALL correctly transform bbox coordinates to PDF space
|
||||
- **AND** handle page size variations
|
||||
- **AND** prevent text overlap using enhanced overlap detection
|
||||
|
||||
### Requirement: Structure Data Export
|
||||
The system SHALL provide export formats that preserve document structure for downstream processing.
|
||||
|
||||
#### Scenario: Export structured JSON with hierarchy
|
||||
- **WHEN** user selects structured JSON format
|
||||
- **THEN** export SHALL include element hierarchy and relationships
|
||||
- **AND** preserve parent-child relationships (sections, lists)
|
||||
- **AND** include style and formatting information
|
||||
|
||||
#### Scenario: Export for translation preparation
|
||||
- **WHEN** user exports with translation_ready=true parameter
|
||||
- **THEN** export SHALL include translatable text segments
|
||||
- **AND** maintain coordinate mappings for each segment
|
||||
- **AND** mark non-translatable regions
|
||||
|
||||
#### Scenario: Export with layout analysis
|
||||
- **WHEN** user requests layout analysis export
|
||||
- **THEN** system SHALL include reading order indices
|
||||
- **AND** identify layout regions (header, body, footer, sidebar)
|
||||
- **AND** provide confidence scores for layout detection
|
||||
@@ -0,0 +1,105 @@
|
||||
# Task Management Spec Delta
|
||||
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Task Result Generation
|
||||
The OCR service SHALL generate both JSON and Markdown result files for completed tasks with actual content, including processing track information and enhanced structure data.
|
||||
|
||||
#### Scenario: Markdown file contains OCR results
|
||||
- **WHEN** a task completes OCR processing successfully
|
||||
- **THEN** the generated `.md` file SHALL contain the extracted text in markdown format
|
||||
- **AND** the file size SHALL be greater than 0 bytes
|
||||
- **AND** the markdown SHALL include headings, paragraphs, and formatting based on OCR layout detection
|
||||
|
||||
#### Scenario: Result files stored in task directory
|
||||
- **WHEN** OCR processing completes for task ID `88c6c2d2-37e1-48fd-a50f-406142987bdf`
|
||||
- **THEN** result files SHALL be stored in `storage/results/88c6c2d2-37e1-48fd-a50f-406142987bdf/`
|
||||
- **AND** both `<filename>_result.json` and `<filename>_result.md` SHALL exist
|
||||
- **AND** both files SHALL contain valid OCR output data
|
||||
|
||||
#### Scenario: Include processing track in results
|
||||
- **WHEN** a task completes through dual-track processing
|
||||
- **THEN** the JSON result SHALL include "processing_track" field
|
||||
- **AND** SHALL indicate whether "ocr" or "direct" track was used
|
||||
- **AND** SHALL include track-specific metadata (confidence for OCR, extraction quality for direct)
|
||||
|
||||
#### Scenario: Store UnifiedDocument format
|
||||
- **WHEN** processing completes through either track
|
||||
- **THEN** system SHALL save results in UnifiedDocument format
|
||||
- **AND** maintain backward-compatible JSON structure
|
||||
- **AND** include enhanced structure from PP-StructureV3 or PyMuPDF
|
||||
|
||||
### Requirement: Task Detail View
|
||||
The frontend SHALL provide a dedicated page for viewing individual task details with processing track information and enhanced preview capabilities.
|
||||
|
||||
#### Scenario: Navigate to task detail page
|
||||
- **WHEN** user clicks "View Details" button on task in Task History page
|
||||
- **THEN** browser SHALL navigate to `/tasks/{task_id}`
|
||||
- **AND** TaskDetailPage component SHALL render
|
||||
|
||||
#### Scenario: Display task information
|
||||
- **WHEN** TaskDetailPage loads for a valid task ID
|
||||
- **THEN** page SHALL display task metadata (filename, status, processing time, confidence)
|
||||
- **AND** page SHALL show markdown preview of OCR results
|
||||
- **AND** page SHALL provide download buttons for JSON, Markdown, and PDF formats
|
||||
|
||||
#### Scenario: Download from task detail page
|
||||
- **WHEN** user clicks download button for a specific format
|
||||
- **THEN** browser SHALL download the file using `/api/v2/tasks/{task_id}/download/{format}` endpoint
|
||||
- **AND** downloaded file SHALL contain the task's OCR results in requested format
|
||||
|
||||
#### Scenario: Display processing track information
|
||||
- **WHEN** viewing task processed through dual-track system
|
||||
- **THEN** page SHALL display processing track used (OCR or Direct)
|
||||
- **AND** show track-specific metrics (OCR confidence or extraction quality)
|
||||
- **AND** provide option to reprocess with alternate track if applicable
|
||||
|
||||
#### Scenario: Preview document structure
|
||||
- **WHEN** user enables structure view
|
||||
- **THEN** page SHALL display document element hierarchy
|
||||
- **AND** show bounding boxes overlay on preview
|
||||
- **AND** highlight different element types (headers, tables, lists) with distinct colors
|
||||
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Processing Track Management
|
||||
The task management system SHALL track and display processing track information for all tasks.
|
||||
|
||||
#### Scenario: Track processing route selection
|
||||
- **WHEN** a task begins processing
|
||||
- **THEN** system SHALL record the selected processing track
|
||||
- **AND** log the reason for track selection
|
||||
- **AND** store auto-detection confidence score
|
||||
|
||||
#### Scenario: Allow track override
|
||||
- **WHEN** user views a completed task
|
||||
- **THEN** system SHALL offer option to reprocess with different track
|
||||
- **AND** maintain both results for comparison
|
||||
- **AND** track which result user prefers
|
||||
|
||||
#### Scenario: Display processing metrics
|
||||
- **WHEN** task completes processing
|
||||
- **THEN** system SHALL record track-specific metrics
|
||||
- **AND** OCR track SHALL show confidence scores and character count
|
||||
- **AND** Direct track SHALL show extraction coverage and structure quality
|
||||
|
||||
### Requirement: Task Processing History
|
||||
The system SHALL maintain detailed processing history for tasks including track changes and reprocessing.
|
||||
|
||||
#### Scenario: Record reprocessing attempts
|
||||
- **WHEN** a task is reprocessed with different track
|
||||
- **THEN** system SHALL maintain processing history
|
||||
- **AND** store results from each attempt
|
||||
- **AND** allow comparison between different processing attempts
|
||||
|
||||
#### Scenario: Track quality improvements
|
||||
- **WHEN** viewing task history
|
||||
- **THEN** system SHALL show quality metrics over time
|
||||
- **AND** indicate if reprocessing improved results
|
||||
- **AND** suggest optimal track based on document characteristics
|
||||
|
||||
#### Scenario: Export processing analytics
|
||||
- **WHEN** exporting task data
|
||||
- **THEN** system SHALL include processing history
|
||||
- **AND** provide track selection statistics
|
||||
- **AND** include performance metrics for each processing attempt
|
||||
170
openspec/changes/dual-track-document-processing/tasks.md
Normal file
170
openspec/changes/dual-track-document-processing/tasks.md
Normal file
@@ -0,0 +1,170 @@
|
||||
# Implementation Tasks: Dual-track Document Processing
|
||||
|
||||
## 1. Core Infrastructure
|
||||
- [ ] 1.1 Add PyMuPDF and other dependencies to requirements.txt
|
||||
- [ ] 1.1.1 Add PyMuPDF==1.23.x
|
||||
- [ ] 1.1.2 Add pdfplumber==0.10.x
|
||||
- [ ] 1.1.3 Add python-magic-bin==0.4.x
|
||||
- [ ] 1.1.4 Test dependency installation
|
||||
- [ ] 1.2 Create UnifiedDocument model in backend/app/models/
|
||||
- [ ] 1.2.1 Define UnifiedDocument dataclass
|
||||
- [ ] 1.2.2 Add DocumentElement model
|
||||
- [ ] 1.2.3 Add DocumentMetadata model
|
||||
- [ ] 1.2.4 Create converters for both OCR and direct extraction outputs
|
||||
- [ ] 1.3 Create DocumentTypeDetector service
|
||||
- [ ] 1.3.1 Implement file type detection using python-magic
|
||||
- [ ] 1.3.2 Add PDF editability checking logic
|
||||
- [ ] 1.3.3 Add Office document detection
|
||||
- [ ] 1.3.4 Create routing logic to determine processing track
|
||||
- [ ] 1.3.5 Add unit tests for detector
|
||||
|
||||
## 2. Direct Extraction Track
|
||||
- [ ] 2.1 Create DirectExtractionEngine service
|
||||
- [ ] 2.1.1 Implement PyMuPDF-based text extraction
|
||||
- [ ] 2.1.2 Add structure preservation logic
|
||||
- [ ] 2.1.3 Extract tables with coordinates
|
||||
- [ ] 2.1.4 Extract images and their positions
|
||||
- [ ] 2.1.5 Maintain reading order
|
||||
- [ ] 2.1.6 Handle multi-column layouts
|
||||
- [ ] 2.2 Implement layout analysis for editable PDFs
|
||||
- [ ] 2.2.1 Detect headers and footers
|
||||
- [ ] 2.2.2 Identify sections and subsections
|
||||
- [ ] 2.2.3 Parse lists and nested structures
|
||||
- [ ] 2.2.4 Extract font and style information
|
||||
- [ ] 2.3 Create direct extraction to UnifiedDocument converter
|
||||
- [ ] 2.3.1 Map PyMuPDF structures to UnifiedDocument
|
||||
- [ ] 2.3.2 Preserve coordinate information
|
||||
- [ ] 2.3.3 Maintain element relationships
|
||||
|
||||
## 3. OCR Track Enhancement
|
||||
- [ ] 3.1 Upgrade PP-StructureV3 configuration
|
||||
- [ ] 3.1.1 Update config for RTX 4060 8GB optimization
|
||||
- [ ] 3.1.2 Enable batch processing for GPU efficiency
|
||||
- [ ] 3.1.3 Configure memory management settings
|
||||
- [ ] 3.1.4 Set up model caching
|
||||
- [ ] 3.2 Enhance OCR service to use parsing_res_list
|
||||
- [ ] 3.2.1 Replace markdown extraction with parsing_res_list
|
||||
- [ ] 3.2.2 Extract all 23 element types
|
||||
- [ ] 3.2.3 Preserve bbox coordinates from PP-StructureV3
|
||||
- [ ] 3.2.4 Maintain reading order information
|
||||
- [ ] 3.3 Create OCR to UnifiedDocument converter
|
||||
- [ ] 3.3.1 Map PP-StructureV3 elements to UnifiedDocument
|
||||
- [ ] 3.3.2 Handle complex nested structures
|
||||
- [ ] 3.3.3 Preserve all metadata
|
||||
|
||||
## 4. Unified Processing Pipeline
|
||||
- [ ] 4.1 Update main OCR service for dual-track processing
|
||||
- [ ] 4.1.1 Integrate DocumentTypeDetector
|
||||
- [ ] 4.1.2 Route to appropriate processing engine
|
||||
- [ ] 4.1.3 Return UnifiedDocument from both tracks
|
||||
- [ ] 4.1.4 Maintain backward compatibility
|
||||
- [ ] 4.2 Create unified JSON export
|
||||
- [ ] 4.2.1 Define standardized JSON schema
|
||||
- [ ] 4.2.2 Include processing metadata
|
||||
- [ ] 4.2.3 Support both track outputs
|
||||
- [ ] 4.3 Update PDF generator for UnifiedDocument
|
||||
- [ ] 4.3.1 Adapt PDF generation to use UnifiedDocument
|
||||
- [ ] 4.3.2 Preserve layout from both tracks
|
||||
- [ ] 4.3.3 Handle coordinate transformations
|
||||
|
||||
## 5. Translation System Foundation
|
||||
- [ ] 5.1 Create TranslationEngine interface
|
||||
- [ ] 5.1.1 Define translation API contract
|
||||
- [ ] 5.1.2 Support element-level translation
|
||||
- [ ] 5.1.3 Preserve formatting markers
|
||||
- [ ] 5.2 Implement structure-preserving translation
|
||||
- [ ] 5.2.1 Translate text while maintaining coordinates
|
||||
- [ ] 5.2.2 Handle table cell translations
|
||||
- [ ] 5.2.3 Preserve list structures
|
||||
- [ ] 5.2.4 Maintain header hierarchies
|
||||
- [ ] 5.3 Create translated document renderer
|
||||
- [ ] 5.3.1 Generate PDF with translated text
|
||||
- [ ] 5.3.2 Adjust layouts for text expansion/contraction
|
||||
- [ ] 5.3.3 Handle font substitution for target languages
|
||||
|
||||
## 6. API Updates
|
||||
- [ ] 6.1 Update OCR endpoints
|
||||
- [ ] 6.1.1 Add processing_track parameter
|
||||
- [ ] 6.1.2 Support track auto-detection
|
||||
- [ ] 6.1.3 Return processing metadata
|
||||
- [ ] 6.2 Add document type detection endpoint
|
||||
- [ ] 6.2.1 Create /analyze endpoint
|
||||
- [ ] 6.2.2 Return recommended processing track
|
||||
- [ ] 6.2.3 Provide confidence scores
|
||||
- [ ] 6.3 Update result export endpoints
|
||||
- [ ] 6.3.1 Support UnifiedDocument format
|
||||
- [ ] 6.3.2 Add format conversion options
|
||||
- [ ] 6.3.3 Include processing track information
|
||||
|
||||
## 7. Frontend Updates
|
||||
- [ ] 7.1 Update task detail view
|
||||
- [ ] 7.1.1 Display processing track information
|
||||
- [ ] 7.1.2 Show track-specific metadata
|
||||
- [ ] 7.1.3 Add track selection UI (if manual override needed)
|
||||
- [ ] 7.2 Update results preview
|
||||
- [ ] 7.2.1 Handle UnifiedDocument format
|
||||
- [ ] 7.2.2 Display enhanced structure information
|
||||
- [ ] 7.2.3 Show coordinate overlays (debug mode)
|
||||
- [ ] 7.3 Add translation UI preparation
|
||||
- [ ] 7.3.1 Add translation toggle/button
|
||||
- [ ] 7.3.2 Language selection dropdown
|
||||
- [ ] 7.3.3 Translation progress indicator
|
||||
|
||||
## 8. Testing
|
||||
- [ ] 8.1 Unit tests for DocumentTypeDetector
|
||||
- [ ] 8.1.1 Test various file types
|
||||
- [ ] 8.1.2 Test editability detection
|
||||
- [ ] 8.1.3 Test edge cases
|
||||
- [ ] 8.2 Unit tests for DirectExtractionEngine
|
||||
- [ ] 8.2.1 Test text extraction accuracy
|
||||
- [ ] 8.2.2 Test structure preservation
|
||||
- [ ] 8.2.3 Test coordinate extraction
|
||||
- [ ] 8.3 Integration tests for dual-track processing
|
||||
- [ ] 8.3.1 Test routing logic
|
||||
- [ ] 8.3.2 Test UnifiedDocument generation
|
||||
- [ ] 8.3.3 Test backward compatibility
|
||||
- [ ] 8.4 End-to-end tests
|
||||
- [ ] 8.4.1 Test scanned PDF processing (OCR track)
|
||||
- [ ] 8.4.2 Test editable PDF processing (direct track)
|
||||
- [ ] 8.4.3 Test Office document processing
|
||||
- [ ] 8.4.4 Test image file processing
|
||||
- [ ] 8.5 Performance testing
|
||||
- [ ] 8.5.1 Benchmark both processing tracks
|
||||
- [ ] 8.5.2 Test GPU memory usage
|
||||
- [ ] 8.5.3 Compare processing times
|
||||
|
||||
## 9. Documentation
|
||||
- [ ] 9.1 Update API documentation
|
||||
- [ ] 9.1.1 Document new endpoints
|
||||
- [ ] 9.1.2 Update existing endpoint docs
|
||||
- [ ] 9.1.3 Add processing track information
|
||||
- [ ] 9.2 Create architecture documentation
|
||||
- [ ] 9.2.1 Document dual-track flow
|
||||
- [ ] 9.2.2 Explain UnifiedDocument structure
|
||||
- [ ] 9.2.3 Add decision trees for track selection
|
||||
- [ ] 9.3 Add deployment guide
|
||||
- [ ] 9.3.1 Document GPU requirements
|
||||
- [ ] 9.3.2 Add environment configuration
|
||||
- [ ] 9.3.3 Include troubleshooting guide
|
||||
|
||||
## 10. Deployment Preparation
|
||||
- [ ] 10.1 Update Docker configuration
|
||||
- [ ] 10.1.1 Add new dependencies to Dockerfile
|
||||
- [ ] 10.1.2 Configure GPU support
|
||||
- [ ] 10.1.3 Update volume mappings
|
||||
- [ ] 10.2 Update environment variables
|
||||
- [ ] 10.2.1 Add processing track settings
|
||||
- [ ] 10.2.2 Configure GPU memory limits
|
||||
- [ ] 10.2.3 Add feature flags
|
||||
- [ ] 10.3 Create migration plan
|
||||
- [ ] 10.3.1 Plan for existing data migration
|
||||
- [ ] 10.3.2 Create rollback procedures
|
||||
- [ ] 10.3.3 Document breaking changes
|
||||
|
||||
## Completion Checklist
|
||||
- [ ] All unit tests passing
|
||||
- [ ] Integration tests passing
|
||||
- [ ] Performance benchmarks acceptable
|
||||
- [ ] Documentation complete
|
||||
- [ ] Code reviewed
|
||||
- [ ] Deployment tested in staging
|
||||
@@ -1,226 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Proof of Concept: External API Authentication Test
|
||||
Tests the external authentication API at https://pj-auth-api.vercel.app
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
from datetime import datetime
|
||||
from typing import Dict, Any, Optional
|
||||
import httpx
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
|
||||
class UserInfo(BaseModel):
|
||||
"""User information from external API"""
|
||||
id: str
|
||||
name: str
|
||||
email: str
|
||||
job_title: Optional[str] = Field(None, alias="jobTitle")
|
||||
office_location: Optional[str] = Field(None, alias="officeLocation")
|
||||
business_phones: list[str] = Field(default_factory=list, alias="businessPhones")
|
||||
|
||||
|
||||
class AuthSuccessData(BaseModel):
|
||||
"""Successful authentication response data"""
|
||||
access_token: str
|
||||
id_token: str
|
||||
expires_in: int
|
||||
token_type: str
|
||||
user_info: UserInfo = Field(alias="userInfo")
|
||||
issued_at: str = Field(alias="issuedAt")
|
||||
expires_at: str = Field(alias="expiresAt")
|
||||
|
||||
|
||||
class AuthSuccessResponse(BaseModel):
|
||||
"""Successful authentication response"""
|
||||
success: bool
|
||||
message: str
|
||||
data: AuthSuccessData
|
||||
timestamp: str
|
||||
|
||||
|
||||
class AuthErrorResponse(BaseModel):
|
||||
"""Failed authentication response"""
|
||||
success: bool
|
||||
error: str
|
||||
code: str
|
||||
timestamp: str
|
||||
|
||||
|
||||
class ExternalAuthClient:
|
||||
"""Client for external authentication API"""
|
||||
|
||||
def __init__(self, base_url: str = "https://pj-auth-api.vercel.app", timeout: int = 30):
|
||||
self.base_url = base_url
|
||||
self.timeout = timeout
|
||||
self.endpoint = "/api/auth/login"
|
||||
|
||||
async def authenticate(self, username: str, password: str) -> Dict[str, Any]:
|
||||
"""
|
||||
Authenticate user with external API
|
||||
|
||||
Args:
|
||||
username: User email/username
|
||||
password: User password
|
||||
|
||||
Returns:
|
||||
Authentication result dictionary
|
||||
"""
|
||||
url = f"{self.base_url}{self.endpoint}"
|
||||
|
||||
print(f"ℹ Endpoint: POST {url}")
|
||||
print(f"ℹ Username: {username}")
|
||||
print(f"ℹ Timestamp: {datetime.now().isoformat()}")
|
||||
print()
|
||||
|
||||
async with httpx.AsyncClient() as client:
|
||||
try:
|
||||
# Make authentication request
|
||||
start_time = datetime.now()
|
||||
response = await client.post(
|
||||
url,
|
||||
json={"username": username, "password": password},
|
||||
timeout=self.timeout
|
||||
)
|
||||
elapsed = (datetime.now() - start_time).total_seconds()
|
||||
|
||||
# Print response details
|
||||
print("Response Details:")
|
||||
print(f" Status Code: {response.status_code}")
|
||||
print(f" Response Time: {elapsed:.3f}s")
|
||||
print(f" Content-Type: {response.headers.get('content-type', 'N/A')}")
|
||||
print()
|
||||
|
||||
# Parse response
|
||||
response_data = response.json()
|
||||
print("Response Body:")
|
||||
print(json.dumps(response_data, indent=2, ensure_ascii=False))
|
||||
print()
|
||||
|
||||
# Handle success/failure
|
||||
if response.status_code == 200:
|
||||
auth_response = AuthSuccessResponse(**response_data)
|
||||
return {
|
||||
"success": True,
|
||||
"status_code": response.status_code,
|
||||
"data": auth_response.dict(),
|
||||
"user_display_name": auth_response.data.user_info.name,
|
||||
"user_email": auth_response.data.user_info.email,
|
||||
"token": auth_response.data.access_token,
|
||||
"expires_in": auth_response.data.expires_in,
|
||||
"expires_at": auth_response.data.expires_at
|
||||
}
|
||||
elif response.status_code == 401:
|
||||
error_response = AuthErrorResponse(**response_data)
|
||||
return {
|
||||
"success": False,
|
||||
"status_code": response.status_code,
|
||||
"error": error_response.error,
|
||||
"code": error_response.code
|
||||
}
|
||||
else:
|
||||
return {
|
||||
"success": False,
|
||||
"status_code": response.status_code,
|
||||
"error": f"Unexpected status code: {response.status_code}",
|
||||
"response": response_data
|
||||
}
|
||||
|
||||
except httpx.TimeoutException:
|
||||
print(f"❌ Request timeout after {self.timeout} seconds")
|
||||
return {
|
||||
"success": False,
|
||||
"error": "Request timeout",
|
||||
"code": "TIMEOUT"
|
||||
}
|
||||
except httpx.RequestError as e:
|
||||
print(f"❌ Request error: {e}")
|
||||
return {
|
||||
"success": False,
|
||||
"error": str(e),
|
||||
"code": "REQUEST_ERROR"
|
||||
}
|
||||
except Exception as e:
|
||||
print(f"❌ Unexpected error: {e}")
|
||||
return {
|
||||
"success": False,
|
||||
"error": str(e),
|
||||
"code": "UNKNOWN_ERROR"
|
||||
}
|
||||
|
||||
|
||||
async def test_authentication():
|
||||
"""Test authentication with different scenarios"""
|
||||
client = ExternalAuthClient()
|
||||
|
||||
# Test scenarios
|
||||
test_cases = [
|
||||
{
|
||||
"name": "Valid Credentials (Example)",
|
||||
"username": "ymirliu@panjit.com.tw",
|
||||
"password": "correct_password", # Replace with actual password for testing
|
||||
"expected": "success"
|
||||
},
|
||||
{
|
||||
"name": "Invalid Credentials",
|
||||
"username": "test@example.com",
|
||||
"password": "wrong_password",
|
||||
"expected": "failure"
|
||||
}
|
||||
]
|
||||
|
||||
for i, test_case in enumerate(test_cases, 1):
|
||||
print(f"{'='*60}")
|
||||
print(f"Test Case {i}: {test_case['name']}")
|
||||
print(f"{'='*60}")
|
||||
|
||||
result = await client.authenticate(
|
||||
username=test_case["username"],
|
||||
password=test_case["password"]
|
||||
)
|
||||
|
||||
# Analyze result
|
||||
print("\nAnalysis:")
|
||||
if result["success"]:
|
||||
print("✅ Authentication successful")
|
||||
print(f" User: {result.get('user_display_name', 'N/A')}")
|
||||
print(f" Email: {result.get('user_email', 'N/A')}")
|
||||
print(f" Token expires in: {result.get('expires_in', 0)} seconds")
|
||||
print(f" Expires at: {result.get('expires_at', 'N/A')}")
|
||||
else:
|
||||
print("❌ Authentication failed")
|
||||
print(f" Error: {result.get('error', 'Unknown error')}")
|
||||
print(f" Code: {result.get('code', 'N/A')}")
|
||||
|
||||
print("\n")
|
||||
|
||||
|
||||
async def test_token_validation():
|
||||
"""Test token validation and refresh logic"""
|
||||
# This would be implemented when we have a valid token
|
||||
print("Token validation test - To be implemented with actual tokens")
|
||||
pass
|
||||
|
||||
|
||||
def main():
|
||||
"""Main entry point"""
|
||||
print("External Authentication API Test")
|
||||
print("================================\n")
|
||||
|
||||
# Run tests
|
||||
asyncio.run(test_authentication())
|
||||
|
||||
print("\nTest completed!")
|
||||
print("\nNotes for implementation:")
|
||||
print("1. Use httpx for async HTTP requests (already in requirements)")
|
||||
print("2. Store tokens securely (consider encryption)")
|
||||
print("3. Implement automatic token refresh before expiration")
|
||||
print("4. Handle network failures with retry logic")
|
||||
print("5. Map external user ID to local user records")
|
||||
print("6. Display user 'name' field in UI instead of username")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user