chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-18 20:02:31 +08:00
parent 0edc56b03f
commit cd3cbea49d
64 changed files with 3573 additions and 8190 deletions

View File

@@ -0,0 +1,817 @@
# Tool_OCR 架構大改方案
## 基於 PaddleOCR PP-StructureV3 完整能力的重構計劃
**規劃日期**: 2025-01-18
**硬體配置**: RTX 4060 8GB VRAM
**優先級**: P0 (最高)
---
## 📊 現狀分析
### 目前架構的問題
#### 1. **PP-StructureV3 能力嚴重浪費**
```python
# ❌ 目前實作 (ocr_service.py:614-646)
markdown_dict = page_result.markdown # 只用簡化版
markdown_texts = markdown_dict.get('markdown_texts', '')
'bbox': [], # 座標全部為空!
```
**問題**:
- 只使用了 ~20% 的 PP-StructureV3 功能
- 未使用 `parsing_res_list`(核心數據結構)
- 未使用 `layout_bbox`(精確座標)
- 未使用 `reading_order`(閱讀順序)
- 未使用 23 種版面元素分類
#### 2. **GPU 配置未優化**
```python
# 目前配置 (ocr_service.py:211-219)
self.structure_engine = PPStructureV3(
use_doc_orientation_classify=False, # ❌ 未啟用前處理
use_doc_unwarping=False, # ❌ 未啟用矯正
use_textline_orientation=False, # ❌ 未啟用方向校正
# ... 使用預設配置
)
```
**問題**:
- RTX 4060 8GB 足以運行 server 模型,但用了預設配置
- 關閉了重要的前處理功能
- 未充分利用 GPU 算力
#### 3. **PDF 生成策略單一**
```python
# 目前只有座標定位模式
# 導致 21.6% 文字損失(過濾重疊)
filtered_text_regions = self._filter_text_in_regions(text_regions, regions_to_avoid)
```
**問題**:
- 只支援座標定位,不支援流式排版
- 無法零資訊損失
- 翻譯功能受限
---
## 🎯 重構目標
### 核心目標
1. **完整利用 PP-StructureV3 能力**
- 提取 `parsing_res_list`23 種元素分類 + 閱讀順序)
- 提取 `layout_bbox`(精確座標)
- 提取 `layout_det_res`(版面檢測詳情)
- 提取 `overall_ocr_res`(所有文字的座標)
2. **雙模式 PDF 生成**
- 模式 A: 座標定位(精確還原版面)
- 模式 B: 流式排版(零資訊損失,支援翻譯)
3. **GPU 配置最佳化**
- 針對 RTX 4060 8GB 的最佳配置
- Server 模型 + 所有功能模組
- 合理的記憶體管理
4. **向後相容**
- 保留現有 API
- 舊 JSON 檔案仍可用
- 漸進式升級
---
## 🏗️ 新架構設計
### 架構層次
```
┌──────────────────────────────────────────────────────┐
│ API Layer │
│ /tasks, /results, /download (向後相容) │
└────────────────┬─────────────────────────────────────┘
┌────────────────▼─────────────────────────────────────┐
│ Service Layer │
├──────────────────────────────────────────────────────┤
│ OCRService (現有, 保留) │
│ └─ analyze_layout() [升級] ──┐ │
│ │ │
│ AdvancedLayoutExtractor (新增) ◄─ 使用相同引擎 │
│ └─ extract_complete_layout() ─┘ │
│ │
│ PDFGeneratorService (重構) │
│ ├─ generate_coordinate_pdf() [Mode A] │
│ └─ generate_flow_pdf() [Mode B] │
└────────────────┬─────────────────────────────────────┘
┌────────────────▼─────────────────────────────────────┐
│ Engine Layer │
├──────────────────────────────────────────────────────┤
│ PPStructureV3Engine (新增,統一管理) │
│ ├─ GPU 配置 (RTX 4060 8GB 最佳化) │
│ ├─ Model 配置 (Server 模型) │
│ └─ 功能開關 (全功能啟用) │
└──────────────────────────────────────────────────────┘
```
### 核心類別設計
#### 1. PPStructureV3Engine (新增)
**目的**: 統一管理 PP-StructureV3 引擎,避免重複初始化
```python
class PPStructureV3Engine:
"""
PP-StructureV3 引擎管理器 (單例)
針對 RTX 4060 8GB 優化配置
"""
_instance = None
def __new__(cls):
if cls._instance is None:
cls._instance = super().__new__(cls)
cls._instance._initialize()
return cls._instance
def _initialize(self):
"""初始化引擎"""
logger.info("Initializing PP-StructureV3 with RTX 4060 8GB optimized config")
self.engine = PPStructureV3(
# ===== GPU 配置 =====
use_gpu=True,
gpu_mem=6144, # 保留 2GB 給系統 (8GB - 2GB)
# ===== 前處理模組 (全部啟用) =====
use_doc_orientation_classify=True, # 文檔方向校正
use_doc_unwarping=True, # 文檔影像矯正
use_textline_orientation=True, # 文字行方向校正
# ===== 功能模組 (全部啟用) =====
use_table_recognition=True, # 表格識別
use_formula_recognition=True, # 公式識別
use_chart_recognition=True, # 圖表識別
use_seal_recognition=True, # 印章識別
# ===== OCR 模型配置 (Server 模型) =====
text_detection_model_name="ch_PP-OCRv4_server_det",
text_recognition_model_name="ch_PP-OCRv4_server_rec",
# ===== 版面檢測參數 =====
layout_threshold=0.5, # 版面檢測閾值
layout_nms=0.5, # NMS 閾值
layout_unclip_ratio=1.5, # 邊界框擴展比例
# ===== OCR 參數 =====
text_det_limit_side_len=1920, # 高解析度檢測
text_det_thresh=0.3, # 檢測閾值
text_det_box_thresh=0.5, # 邊界框閾值
# ===== 其他 =====
show_log=True,
use_angle_cls=False, # 已被 textline_orientation 取代
)
logger.info("PP-StructureV3 engine initialized successfully")
logger.info(f" - GPU: Enabled (RTX 4060 8GB)")
logger.info(f" - Models: Server (High Accuracy)")
logger.info(f" - Features: All Enabled (Table/Formula/Chart/Seal)")
def predict(self, image_path: str):
"""執行預測"""
return self.engine.predict(image_path)
def get_engine(self):
"""獲取引擎實例"""
return self.engine
```
#### 2. AdvancedLayoutExtractor (新增)
**目的**: 完整提取 PP-StructureV3 的所有版面資訊
```python
class AdvancedLayoutExtractor:
"""
進階版面提取器
完整利用 PP-StructureV3 的 parsing_res_list, layout_bbox, layout_det_res
"""
def __init__(self):
self.engine = PPStructureV3Engine()
def extract_complete_layout(
self,
image_path: Path,
output_dir: Optional[Path] = None,
current_page: int = 0
) -> Tuple[Optional[Dict], List[Dict]]:
"""
提取完整版面資訊(使用 page_result.json
Returns:
(layout_data, images_metadata)
layout_data = {
"elements": [
{
"element_id": int,
"type": str, # 23 種類型之一
"bbox": [[x1,y1], [x2,y1], [x2,y2], [x1,y2]], # ✅ 不再是空列表
"content": str,
"reading_order": int, # ✅ 閱讀順序
"layout_type": str, # ✅ single/double/multi-column
"confidence": float, # ✅ 置信度
"page": int
},
...
],
"reading_order": [0, 1, 2, ...],
"layout_types": ["single", "double"],
"total_elements": int
}
"""
try:
results = self.engine.predict(str(image_path))
layout_elements = []
images_metadata = []
for page_idx, page_result in enumerate(results):
# ✅ 核心改動:使用 page_result.json 而非 page_result.markdown
json_data = page_result.json
# ===== 方法 1: 使用 parsing_res_list (主要來源) =====
parsing_res_list = json_data.get('parsing_res_list', [])
if parsing_res_list:
logger.info(f"Found {len(parsing_res_list)} elements in parsing_res_list")
for idx, item in enumerate(parsing_res_list):
element = self._create_element_from_parsing_res(
item, idx, current_page
)
if element:
layout_elements.append(element)
# ===== 方法 2: 使用 layout_det_res (補充資訊) =====
layout_det_res = json_data.get('layout_det_res', {})
layout_boxes = layout_det_res.get('boxes', [])
# 用於豐富 element 資訊(如果 parsing_res_list 缺少某些欄位)
self._enrich_elements_with_layout_det(layout_elements, layout_boxes)
# ===== 方法 3: 處理圖片 (從 markdown_images) =====
markdown_dict = page_result.markdown
markdown_images = markdown_dict.get('markdown_images', {})
for img_idx, (img_path, img_obj) in enumerate(markdown_images.items()):
# 保存圖片到磁碟
self._save_image(img_obj, img_path, output_dir or image_path.parent)
# 從 parsing_res_list 或 layout_det_res 查找 bbox
bbox = self._find_image_bbox(
img_path, parsing_res_list, layout_boxes
)
images_metadata.append({
'element_id': len(layout_elements) + img_idx,
'image_path': img_path,
'type': 'image',
'page': current_page,
'bbox': bbox,
})
if layout_elements:
layout_data = {
'elements': layout_elements,
'total_elements': len(layout_elements),
'reading_order': [e['reading_order'] for e in layout_elements],
'layout_types': list(set(e.get('layout_type') for e in layout_elements)),
}
logger.info(f"✅ Extracted {len(layout_elements)} elements with complete info")
return layout_data, images_metadata
else:
logger.warning("No layout elements found")
return None, []
except Exception as e:
logger.error(f"Advanced layout extraction failed: {e}")
import traceback
traceback.print_exc()
return None, []
def _create_element_from_parsing_res(
self, item: Dict, idx: int, current_page: int
) -> Optional[Dict]:
"""從 parsing_res_list 的一個 item 創建 element"""
# 提取 layout_bbox
layout_bbox = item.get('layout_bbox')
bbox = self._convert_bbox_to_4point(layout_bbox)
# 提取版面類型
layout_type = item.get('layout', 'single')
# 創建基礎 element
element = {
'element_id': idx,
'page': current_page,
'bbox': bbox, # ✅ 完整座標
'layout_type': layout_type,
'reading_order': idx,
'confidence': item.get('score', 0.0),
}
# 根據內容類型填充 type 和 content
# 順序很重要!優先級: table > formula > image > title > text
if 'table' in item and item['table']:
element['type'] = 'table'
element['content'] = item['table']
# 提取表格純文字(用於翻譯)
element['extracted_text'] = self._extract_table_text(item['table'])
elif 'formula' in item and item['formula']:
element['type'] = 'formula'
element['content'] = item['formula'] # LaTeX
elif 'figure' in item or 'image' in item:
element['type'] = 'image'
element['content'] = item.get('figure') or item.get('image')
elif 'title' in item and item['title']:
element['type'] = 'title'
element['content'] = item['title']
elif 'text' in item and item['text']:
element['type'] = 'text'
element['content'] = item['text']
else:
# 未知類型,嘗試提取任何非系統欄位
for key, value in item.items():
if key not in ['layout_bbox', 'layout', 'score'] and value:
element['type'] = key
element['content'] = value
break
else:
return None # 沒有內容,跳過
return element
def _convert_bbox_to_4point(self, layout_bbox) -> List:
"""轉換 layout_bbox 為 4-point 格式"""
if layout_bbox is None:
return []
# 處理 numpy array
if hasattr(layout_bbox, 'tolist'):
bbox = layout_bbox.tolist()
else:
bbox = list(layout_bbox)
if len(bbox) == 4: # [x1, y1, x2, y2]
x1, y1, x2, y2 = bbox
return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
return []
def _extract_table_text(self, html_content: str) -> str:
"""從 HTML 表格提取純文字(用於翻譯)"""
try:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
# 提取所有 cell 的文字
cells = []
for cell in soup.find_all(['td', 'th']):
text = cell.get_text(strip=True)
if text:
cells.append(text)
return ' | '.join(cells)
except Exception as e:
logger.warning(f"Failed to extract table text: {e}")
# Fallback: 簡單去除 HTML 標籤
import re
text = re.sub(r'<[^>]+>', ' ', html_content)
text = re.sub(r'\s+', ' ', text)
return text.strip()
```
#### 3. PDFGeneratorService (重構)
**目的**: 支援雙模式 PDF 生成
```python
class PDFGeneratorService:
"""
PDF 生成服務 (重構版)
支援兩種模式:
- coordinate: 座標定位模式 (精確還原版面)
- flow: 流式排版模式 (零資訊損失, 支援翻譯)
"""
def generate_pdf(
self,
json_path: Path,
output_path: Path,
mode: str = 'coordinate', # 'coordinate' 或 'flow'
source_file_path: Optional[Path] = None
) -> bool:
"""
生成 PDF
Args:
json_path: OCR JSON 檔案路徑
output_path: 輸出 PDF 路徑
mode: 生成模式 ('coordinate' 或 'flow')
source_file_path: 原始檔案路徑(用於獲取尺寸)
Returns:
成功返回 True
"""
try:
# 載入 OCR 數據
ocr_data = self.load_ocr_json(json_path)
if not ocr_data:
return False
# 根據模式選擇生成策略
if mode == 'flow':
return self._generate_flow_pdf(ocr_data, output_path)
else:
return self._generate_coordinate_pdf(ocr_data, output_path, source_file_path)
except Exception as e:
logger.error(f"PDF generation failed: {e}")
import traceback
traceback.print_exc()
return False
def _generate_coordinate_pdf(
self,
ocr_data: Dict,
output_path: Path,
source_file_path: Optional[Path]
) -> bool:
"""
模式 A: 座標定位模式
- 使用 layout_bbox 精確定位每個元素
- 保留原始文件的視覺外觀
- 適用於需要精確還原版面的場景
"""
logger.info("Generating PDF in COORDINATE mode (layout-preserving)")
# 提取數據
layout_data = ocr_data.get('layout_data', {})
elements = layout_data.get('elements', [])
if not elements:
logger.warning("No layout elements found")
return False
# 按 reading_order 和 page 排序
sorted_elements = sorted(elements, key=lambda x: (
x.get('page', 0),
x.get('reading_order', 0)
))
# 計算頁面尺寸
ocr_width, ocr_height = self.calculate_page_dimensions(ocr_data, source_file_path)
target_width, target_height = self._get_target_dimensions(source_file_path, ocr_width, ocr_height)
scale_w = target_width / ocr_width
scale_h = target_height / ocr_height
# 創建 PDF canvas
pdf_canvas = canvas.Canvas(str(output_path), pagesize=(target_width, target_height))
# 按頁碼分組元素
pages = {}
for elem in sorted_elements:
page = elem.get('page', 0)
if page not in pages:
pages[page] = []
pages[page].append(elem)
# 渲染每一頁
for page_num, page_elements in sorted(pages.items()):
if page_num > 0:
pdf_canvas.showPage()
logger.info(f"Rendering page {page_num + 1} with {len(page_elements)} elements")
# 按 reading_order 渲染每個元素
for elem in page_elements:
bbox = elem.get('bbox', [])
elem_type = elem.get('type')
content = elem.get('content', '')
if not bbox:
logger.warning(f"Element {elem['element_id']} has no bbox, skipping")
continue
# 根據類型渲染
try:
if elem_type == 'table':
self._draw_table_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h)
elif elem_type == 'text':
self._draw_text_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h)
elif elem_type == 'title':
self._draw_title_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h)
elif elem_type == 'image':
img_path = json_path.parent / content
if img_path.exists():
self._draw_image_at_bbox(pdf_canvas, str(img_path), bbox, target_height, scale_w, scale_h)
elif elem_type == 'formula':
self._draw_formula_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h)
# ... 其他類型
except Exception as e:
logger.warning(f"Failed to draw {elem_type} element: {e}")
pdf_canvas.save()
logger.info(f"✅ Coordinate PDF generated: {output_path}")
return True
def _generate_flow_pdf(
self,
ocr_data: Dict,
output_path: Path
) -> bool:
"""
模式 B: 流式排版模式
- 按 reading_order 流式排版
- 零資訊損失(不過濾任何內容)
- 使用 ReportLab Platypus 高階 API
- 適用於需要翻譯或內容處理的場景
"""
from reportlab.platypus import (
SimpleDocTemplate, Paragraph, Spacer,
Table, TableStyle, Image as RLImage, PageBreak
)
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib import colors
from reportlab.lib.enums import TA_LEFT, TA_CENTER
logger.info("Generating PDF in FLOW mode (content-preserving)")
# 提取數據
layout_data = ocr_data.get('layout_data', {})
elements = layout_data.get('elements', [])
if not elements:
logger.warning("No layout elements found")
return False
# 按 reading_order 排序
sorted_elements = sorted(elements, key=lambda x: (
x.get('page', 0),
x.get('reading_order', 0)
))
# 創建文檔
doc = SimpleDocTemplate(str(output_path))
story = []
styles = getSampleStyleSheet()
# 自定義樣式
styles.add(ParagraphStyle(
name='CustomTitle',
parent=styles['Heading1'],
fontSize=18,
alignment=TA_CENTER,
spaceAfter=12
))
current_page = -1
# 按順序添加元素
for elem in sorted_elements:
elem_type = elem.get('type')
content = elem.get('content', '')
page = elem.get('page', 0)
# 分頁
if page != current_page and current_page != -1:
story.append(PageBreak())
current_page = page
try:
if elem_type == 'title':
story.append(Paragraph(content, styles['CustomTitle']))
story.append(Spacer(1, 12))
elif elem_type == 'text':
story.append(Paragraph(content, styles['Normal']))
story.append(Spacer(1, 8))
elif elem_type == 'table':
# 解析 HTML 表格為 ReportLab Table
table_obj = self._html_to_reportlab_table(content)
if table_obj:
story.append(table_obj)
story.append(Spacer(1, 12))
elif elem_type == 'image':
# 嵌入圖片
img_path = output_path.parent.parent / content
if img_path.exists():
img = RLImage(str(img_path), width=400, height=300, kind='proportional')
story.append(img)
story.append(Spacer(1, 12))
elif elem_type == 'formula':
# 公式顯示為等寬字體
story.append(Paragraph(f"<font name='Courier'>{content}</font>", styles['Code']))
story.append(Spacer(1, 8))
except Exception as e:
logger.warning(f"Failed to add {elem_type} element to flow: {e}")
# 生成 PDF
doc.build(story)
logger.info(f"✅ Flow PDF generated: {output_path}")
return True
```
---
## 🔧 實作步驟
### 階段 1: 引擎層重構 (2-3 小時)
1. **創建 PPStructureV3Engine 單例類**
- 檔案: `backend/app/engines/ppstructure_engine.py` (新增)
- 統一管理 PP-StructureV3 引擎
- RTX 4060 8GB 最佳化配置
2. **創建 AdvancedLayoutExtractor 類**
- 檔案: `backend/app/services/advanced_layout_extractor.py` (新增)
- 實作 `extract_complete_layout()`
- 完整提取 parsing_res_list, layout_bbox, layout_det_res
3. **更新 OCRService**
- 修改 `analyze_layout()` 使用 `AdvancedLayoutExtractor`
- 保持向後相容(回退到舊邏輯)
### 階段 2: PDF 生成器重構 (3-4 小時)
1. **重構 PDFGeneratorService**
- 添加 `mode` 參數
- 實作 `_generate_coordinate_pdf()`
- 實作 `_generate_flow_pdf()`
2. **添加輔助方法**
- `_draw_table_at_bbox()`: 在指定座標繪製表格
- `_draw_text_at_bbox()`: 在指定座標繪製文字
- `_draw_title_at_bbox()`: 在指定座標繪製標題
- `_draw_formula_at_bbox()`: 在指定座標繪製公式
- `_html_to_reportlab_table()`: HTML 轉 ReportLab Table
3. **更新 API 端點**
- `/tasks/{id}/download/pdf?mode=coordinate` (預設)
- `/tasks/{id}/download/pdf?mode=flow`
### 階段 3: 測試與優化 (2-3 小時)
1. **單元測試**
- 測試 AdvancedLayoutExtractor
- 測試兩種 PDF 模式
- 測試向後相容性
2. **效能測試**
- GPU 記憶體使用監控
- 處理速度測試
- 並發請求測試
3. **品質驗證**
- 座標準確度
- 閱讀順序正確性
- 表格識別準確度
---
## 📈 預期效果
### 功能改善
| 指標 | 目前 | 重構後 | 提升 |
|------|-----|--------|------|
| bbox 可用性 | 0% (全空) | 100% | ✅ ∞ |
| 版面元素分類 | 2 種 | 23 種 | ✅ 11.5x |
| 閱讀順序 | 無 | 完整保留 | ✅ 100% |
| 資訊損失 | 21.6% | 0% (流式模式) | ✅ 100% |
| PDF 模式 | 1 種 | 2 種 | ✅ 2x |
| 翻譯支援 | 困難 | 完美 | ✅ 100% |
### GPU 使用優化
```python
# RTX 4060 8GB 配置效果
配置項目 | 目前 | 重構後
----------------|--------|--------
GPU 利用率 | ~30% | ~70%
處理速度 | 0.5/ | 1.2/
前處理功能 | 關閉 | 全開
識別準確度 | ~85% | ~95%
```
---
## 🎯 遷移策略
### 向後相容性保證
1. **API 層面**
- 保留現有所有 API 端點
- 添加可選的 `mode` 參數
- 預設行為不變
2. **數據層面**
- 舊 JSON 檔案仍可使用
- 新增欄位不影響舊邏輯
- 漸進式更新
3. **部署策略**
- 先部署新引擎和服務
- 逐步啟用新功能
- 監控效能和錯誤率
---
## 📝 配置檔案
### requirements.txt 更新
```txt
# 現有依賴
paddlepaddle-gpu>=3.0.0
paddleocr>=3.0.0
# 新增依賴
python-docx>=0.8.11 # Word 文檔生成 (可選)
PyMuPDF>=1.23.0 # PDF 處理增強
beautifulsoup4>=4.12.0 # HTML 解析
lxml>=4.9.0 # XML/HTML 解析加速
```
### 環境變數配置
```bash
# .env.local 新增
PADDLE_GPU_MEMORY=6144 # RTX 4060 8GB 保留 2GB 給系統
PADDLE_USE_SERVER_MODEL=true
PADDLE_ENABLE_ALL_FEATURES=true
# PDF 生成預設模式
PDF_DEFAULT_MODE=coordinate # 或 flow
```
---
## 🚀 實作優先級
### P0 (立即實作)
1. ✅ PPStructureV3Engine 統一引擎
2. ✅ AdvancedLayoutExtractor 完整提取
3. ✅ 座標定位模式 PDF
### P1 (第二階段)
4. ⭐ 流式排版模式 PDF
5. ⭐ API 端點更新 (mode 參數)
### P2 (優化階段)
6. 效能監控和優化
7. 批次處理支援
8. 品質檢查工具
---
## ⚠️ 風險與緩解
### 風險 1: GPU 記憶體不足
**緩解**:
- 合理設定 `gpu_mem=6144` (保留 2GB)
- 添加記憶體監控
- 大文檔分批處理
### 風險 2: 處理速度下降
**緩解**:
- Server 模型在 GPU 上比 Mobile 更快
- 並行處理多頁
- 結果快取
### 風險 3: 向後相容問題
**緩解**:
- 保留舊邏輯作為回退
- 逐步遷移
- 完整測試覆蓋
---
**預計總開發時間**: 7-10 小時
**預計效果**: 100% 利用 PP-StructureV3 能力 + 零資訊損失 + 完美翻譯支援
您希望我開始實作哪個階段?

View File

@@ -0,0 +1,691 @@
# PP-StructureV3 完整版面資訊利用計劃
## 📋 執行摘要
### 問題診斷
目前實作**嚴重低估了 PP-StructureV3 的能力**,只使用了 `page_result.markdown` 屬性,完全忽略了核心的版面資訊 `page_result.json`
### 核心發現
1. **PP-StructureV3 提供完整的版面解析資訊**,包括:
- `parsing_res_list`: 按閱讀順序排列的版面元素列表
- `layout_bbox`: 每個元素的精確座標
- `layout_det_res`: 版面檢測結果(區域類型、置信度)
- `overall_ocr_res`: 完整的 OCR 結果(包含所有文字的 bbox
- `layout`: 版面類型(單欄/雙欄/多欄)
2. **目前實作的缺陷**
```python
# ❌ 目前做法 (ocr_service.py:615-646)
markdown_dict = page_result.markdown # 只獲取 markdown 和圖片
markdown_texts = markdown_dict.get('markdown_texts', '')
# bbox 被設為空列表
'bbox': [], # PP-StructureV3 doesn't provide individual bbox in this format
```
3. **應該這樣做**
```python
# ✅ 正確做法
json_data = page_result.json # 獲取完整的結構化資訊
parsing_list = json_data.get('parsing_res_list', []) # 閱讀順序 + bbox
layout_det = json_data.get('layout_det_res', {}) # 版面檢測
overall_ocr = json_data.get('overall_ocr_res', {}) # 所有文字的座標
```
---
## 🎯 規劃目標
### 階段 1: 提取完整版面資訊(高優先級)
**目標**: 修改 `analyze_layout()` 以使用 PP-StructureV3 的完整能力
**預期效果**:
- ✅ 每個版面元素都有精確的 `layout_bbox`
- ✅ 保留原始閱讀順序(`parsing_res_list` 的順序)
- ✅ 獲取版面類型資訊(單欄/雙欄)
- ✅ 提取區域分類text/table/figure/title/formula
- ✅ 零資訊損失(不需要過濾重疊文字)
### 階段 2: 實作雙模式 PDF 生成(中優先級)
**目標**: 提供兩種 PDF 生成模式
**模式 A: 精確座標定位模式**
- 使用 `layout_bbox` 精確定位每個元素
- 保留原始文件的視覺外觀
- 適用於需要精確還原版面的場景
**模式 B: 流式排版模式**
- 按 `parsing_res_list` 順序流式排版
- 使用 ReportLab Platypus 高階 API
- 零資訊損失,所有內容都可搜尋
- 適用於需要翻譯或內容處理的場景
### 階段 3: 多欄版面處理(低優先級)
**目標**: 利用 PP-StructureV3 的多欄識別能力
---
## 📊 PP-StructureV3 完整資料結構
### 1. `page_result.json` 完整結構
```python
{
# 基本資訊
"input_path": str, # 源文件路徑
"page_index": int, # 頁碼PDF 專用)
# 版面檢測結果
"layout_det_res": {
"boxes": [
{
"cls_id": int, # 類別 ID
"label": str, # 區域類型: text/table/figure/title/formula/seal
"score": float, # 置信度 0-1
"coordinate": [x1, y1, x2, y2] # 矩形座標
},
...
]
},
# 完整 OCR 結果
"overall_ocr_res": {
"dt_polys": np.ndarray, # 文字檢測多邊形
"rec_polys": np.ndarray, # 文字識別多邊形
"rec_boxes": np.ndarray, # 文字識別矩形框 (n, 4, 2) int16
"rec_texts": List[str], # 識別的文字
"rec_scores": np.ndarray # 識別置信度
},
# **核心版面解析結果(按閱讀順序)**
"parsing_res_list": [
{
"layout_bbox": np.ndarray, # 區域邊界框 [x1, y1, x2, y2]
"layout": str, # 版面類型: single/double/multi-column
"text": str, # 文字內容(如果是文字區域)
"table": str, # 表格 HTML如果是表格區域
"image": str, # 圖片路徑(如果是圖片區域)
"formula": str, # 公式 LaTeX如果是公式區域
# ... 其他區域類型
},
... # 順序 = 閱讀順序
],
# 文字段落 OCR按閱讀順序
"text_paragraphs_ocr_res": {
"rec_polys": np.ndarray,
"rec_texts": List[str],
"rec_scores": np.ndarray
},
# 可選模組結果
"formula_res_region1": {...}, # 公式識別結果
"table_cell_img": {...}, # 表格儲存格圖片
"seal_res_region1": {...} # 印章識別結果
}
```
### 2. 關鍵欄位說明
| 欄位 | 用途 | 資料格式 | 重要性 |
|------|------|---------|--------|
| `parsing_res_list` | **核心資料**,包含按閱讀順序排列的所有版面元素 | List[Dict] | ⭐⭐⭐⭐⭐ |
| `layout_bbox` | 每個元素的精確座標 | np.ndarray [x1,y1,x2,y2] | ⭐⭐⭐⭐⭐ |
| `layout` | 版面類型(單欄/雙欄/多欄) | str: single/double/multi | ⭐⭐⭐⭐ |
| `layout_det_res` | 版面檢測詳細結果(包含區域分類) | Dict with boxes list | ⭐⭐⭐⭐ |
| `overall_ocr_res` | 所有文字的 OCR 結果和座標 | Dict with np.ndarray | ⭐⭐⭐⭐ |
| `markdown` | 簡化的 Markdown 輸出 | Dict with texts/images | ⭐⭐ |
---
## 🔧 實作計劃
### 任務 1: 重構 `analyze_layout()` 函數
**檔案**: `/backend/app/services/ocr_service.py`
**修改範圍**: Lines 590-710
**核心改動**:
```python
def analyze_layout(self, image_path: Path, output_dir: Optional[Path] = None, current_page: int = 0) -> Tuple[Optional[Dict], List[Dict]]:
"""
Analyze document layout using PP-StructureV3 (使用完整的 JSON 資訊)
"""
try:
structure_engine = self.get_structure_engine()
results = structure_engine.predict(str(image_path))
layout_elements = []
images_metadata = []
for page_idx, page_result in enumerate(results):
# ✅ 修改 1: 使用完整的 JSON 資料而非只用 markdown
json_data = page_result.json
# ✅ 修改 2: 提取版面檢測結果
layout_det_res = json_data.get('layout_det_res', {})
layout_boxes = layout_det_res.get('boxes', [])
# ✅ 修改 3: 提取核心的 parsing_res_list包含閱讀順序 + bbox
parsing_res_list = json_data.get('parsing_res_list', [])
if parsing_res_list:
# *** 核心邏輯:使用 parsing_res_list ***
for idx, item in enumerate(parsing_res_list):
# 提取 bbox不再是空列表
layout_bbox = item.get('layout_bbox')
if layout_bbox is not None:
# 轉換 numpy array 為標準格式
if hasattr(layout_bbox, 'tolist'):
bbox = layout_bbox.tolist()
else:
bbox = list(layout_bbox)
# 轉換為 4-point 格式: [[x1,y1], [x2,y1], [x2,y2], [x1,y2]]
if len(bbox) == 4: # [x1, y1, x2, y2]
x1, y1, x2, y2 = bbox
bbox = [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
else:
bbox = []
# 提取版面類型
layout_type = item.get('layout', 'single')
# 創建元素(包含所有資訊)
element = {
'element_id': idx,
'page': current_page,
'bbox': bbox, # ✅ 不再是空列表!
'layout_type': layout_type, # ✅ 新增版面類型
'reading_order': idx, # ✅ 新增閱讀順序
}
# 根據內容類型提取資料
if 'table' in item:
element['type'] = 'table'
element['content'] = item['table']
# 提取表格純文字(用於翻譯)
element['extracted_text'] = self._extract_table_text(item['table'])
elif 'text' in item:
element['type'] = 'text'
element['content'] = item['text']
elif 'figure' in item or 'image' in item:
element['type'] = 'image'
element['content'] = item.get('figure') or item.get('image')
elif 'formula' in item:
element['type'] = 'formula'
element['content'] = item['formula']
elif 'title' in item:
element['type'] = 'title'
element['content'] = item['title']
else:
# 未知類型,記錄所有非系統欄位
for key, value in item.items():
if key not in ['layout_bbox', 'layout']:
element['type'] = key
element['content'] = value
break
layout_elements.append(element)
else:
# 回退到 markdown 方式(向後相容)
logger.warning("No parsing_res_list found, falling back to markdown parsing")
markdown_dict = page_result.markdown
# ... 原有的 markdown 解析邏輯 ...
# ✅ 修改 4: 同時處理提取的圖片(仍需保存到磁碟)
markdown_dict = page_result.markdown
markdown_images = markdown_dict.get('markdown_images', {})
for img_idx, (img_path, img_obj) in enumerate(markdown_images.items()):
# 保存圖片到磁碟
try:
base_dir = output_dir if output_dir else image_path.parent
full_img_path = base_dir / img_path
full_img_path.parent.mkdir(parents=True, exist_ok=True)
if hasattr(img_obj, 'save'):
img_obj.save(str(full_img_path))
logger.info(f"Saved extracted image to {full_img_path}")
except Exception as e:
logger.warning(f"Failed to save image {img_path}: {e}")
# 提取 bbox從檔名或從 parsing_res_list 匹配)
bbox = self._find_image_bbox(img_path, parsing_res_list, layout_boxes)
images_metadata.append({
'element_id': len(layout_elements) + img_idx,
'image_path': img_path,
'type': 'image',
'page': current_page,
'bbox': bbox,
})
if layout_elements:
layout_data = {
'elements': layout_elements,
'total_elements': len(layout_elements),
'reading_order': [e['reading_order'] for e in layout_elements], # ✅ 保留閱讀順序
'layout_types': list(set(e.get('layout_type') for e in layout_elements)), # ✅ 版面類型統計
}
logger.info(f"Detected {len(layout_elements)} layout elements (with bbox and reading order)")
return layout_data, images_metadata
else:
logger.warning("No layout elements detected")
return None, []
except Exception as e:
import traceback
logger.error(f"Layout analysis error: {str(e)}\n{traceback.format_exc()}")
return None, []
def _find_image_bbox(self, img_path: str, parsing_res_list: List[Dict], layout_boxes: List[Dict]) -> List:
"""
從 parsing_res_list 或 layout_det_res 中查找圖片的 bbox
"""
# 方法 1: 從檔名提取(現有方法)
import re
match = re.search(r'box_(\d+)_(\d+)_(\d+)_(\d+)', img_path)
if match:
x1, y1, x2, y2 = map(int, match.groups())
return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
# 方法 2: 從 parsing_res_list 匹配(如果包含圖片路徑資訊)
for item in parsing_res_list:
if 'image' in item or 'figure' in item:
content = item.get('image') or item.get('figure')
if img_path in str(content):
bbox = item.get('layout_bbox')
if bbox is not None:
if hasattr(bbox, 'tolist'):
bbox_list = bbox.tolist()
else:
bbox_list = list(bbox)
if len(bbox_list) == 4:
x1, y1, x2, y2 = bbox_list
return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
# 方法 3: 從 layout_det_res 匹配(根據類型)
for box in layout_boxes:
if box.get('label') in ['figure', 'image']:
coord = box.get('coordinate', [])
if len(coord) == 4:
x1, y1, x2, y2 = coord
return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
logger.warning(f"Could not find bbox for image {img_path}")
return []
```
---
### 任務 2: 更新 PDF 生成器使用新資訊
**檔案**: `/backend/app/services/pdf_generator_service.py`
**核心改動**:
1. **移除文字過濾邏輯**(不再需要!)
- 因為 `parsing_res_list` 已經按閱讀順序排列
- 表格/圖片有自己的區域,文字有自己的區域
- 不會有重疊問題
2. **按 `reading_order` 渲染元素**
```python
def generate_layout_pdf(self, json_path: Path, output_path: Path, mode: str = 'coordinate') -> bool:
"""
mode: 'coordinate' 或 'flow'
"""
# 載入資料
ocr_data = self.load_ocr_json(json_path)
layout_data = ocr_data.get('layout_data', {})
elements = layout_data.get('elements', [])
if mode == 'coordinate':
# 模式 A: 座標定位模式
return self._generate_coordinate_pdf(elements, output_path, ocr_data)
else:
# 模式 B: 流式排版模式
return self._generate_flow_pdf(elements, output_path, ocr_data)
def _generate_coordinate_pdf(self, elements: List[Dict], output_path: Path, ocr_data: Dict) -> bool:
"""座標定位模式 - 精確還原版面"""
# 按 reading_order 排序元素
sorted_elements = sorted(elements, key=lambda x: x.get('reading_order', 0))
# 按頁碼分組
pages = {}
for elem in sorted_elements:
page = elem.get('page', 0)
if page not in pages:
pages[page] = []
pages[page].append(elem)
# 渲染每頁
for page_num, page_elements in sorted(pages.items()):
for elem in page_elements:
bbox = elem.get('bbox', [])
elem_type = elem.get('type')
content = elem.get('content', '')
if not bbox:
logger.warning(f"Element {elem['element_id']} has no bbox, skipping")
continue
# 使用精確座標渲染
if elem_type == 'table':
self.draw_table_at_bbox(pdf_canvas, content, bbox, page_height, scale_w, scale_h)
elif elem_type == 'text':
self.draw_text_at_bbox(pdf_canvas, content, bbox, page_height, scale_w, scale_h)
elif elem_type == 'image':
self.draw_image_at_bbox(pdf_canvas, content, bbox, page_height, scale_w, scale_h)
# ... 其他類型
def _generate_flow_pdf(self, elements: List[Dict], output_path: Path, ocr_data: Dict) -> bool:
"""流式排版模式 - 零資訊損失"""
from reportlab.platypus import SimpleDocTemplate, Paragraph, Table, Image, Spacer
from reportlab.lib.styles import getSampleStyleSheet
# 按 reading_order 排序元素
sorted_elements = sorted(elements, key=lambda x: x.get('reading_order', 0))
# 創建 Story流式內容
story = []
styles = getSampleStyleSheet()
for elem in sorted_elements:
elem_type = elem.get('type')
content = elem.get('content', '')
if elem_type == 'title':
story.append(Paragraph(content, styles['Title']))
elif elem_type == 'text':
story.append(Paragraph(content, styles['Normal']))
elif elem_type == 'table':
# 解析 HTML 表格為 ReportLab Table
table_obj = self._html_to_reportlab_table(content)
story.append(table_obj)
elif elem_type == 'image':
# 嵌入圖片
img_path = json_path.parent / content
if img_path.exists():
story.append(Image(str(img_path), width=400, height=300))
story.append(Spacer(1, 12)) # 間距
# 生成 PDF
doc = SimpleDocTemplate(str(output_path))
doc.build(story)
return True
```
---
## 📈 預期效果對比
### 目前實作 vs 新實作
| 指標 | 目前實作 ❌ | 新實作 ✅ | 改善 |
|------|-----------|----------|------|
| **bbox 資訊** | 空列表 `[]` | 精確座標 `[x1,y1,x2,y2]` | ✅ 100% |
| **閱讀順序** | 無(混合 HTML | `reading_order` 欄位 | ✅ 100% |
| **版面類型** | 無 | `layout_type`(單欄/雙欄) | ✅ 100% |
| **元素分類** | 簡單判斷 `<table` | 精確分類9+ 類型) | ✅ 100% |
| **資訊損失** | 21.6% 文字被過濾 | 0% 損失(流式模式) | ✅ 100% |
| **座標精度** | 只有部分圖片 bbox | 所有元素都有 bbox | ✅ 100% |
| **PDF 模式** | 只有座標定位 | 雙模式(座標+流式) | ✅ 新功能 |
| **翻譯支援** | 困難(資訊損失) | 完美(零損失) | ✅ 100% |
### 具體改善
#### 1. 零資訊損失
```python
# ❌ 目前: 342 個文字區域 → 過濾後 268 個 = 損失 74 個 (21.6%)
filtered_text_regions = self._filter_text_in_regions(text_regions, regions_to_avoid)
# ✅ 新實作: 不需要過濾,直接使用 parsing_res_list
# 所有元素(文字、表格、圖片)都在各自的區域中,不重疊
for elem in sorted(elements, key=lambda x: x['reading_order']):
render_element(elem) # 渲染所有元素,零損失
```
#### 2. 精確 bbox
```python
# ❌ 目前: bbox 是空列表
{
'element_id': 0,
'type': 'table',
'bbox': [], # ← 無法定位!
}
# ✅ 新實作: 從 layout_bbox 獲取精確座標
{
'element_id': 0,
'type': 'table',
'bbox': [[770, 776], [1122, 776], [1122, 1058], [770, 1058]], # ← 精確定位!
'reading_order': 3,
'layout_type': 'single'
}
```
#### 3. 閱讀順序
```python
# ❌ 目前: 無法保證正確的閱讀順序
# 表格、圖片、文字混在一起,順序混亂
# ✅ 新實作: parsing_res_list 的順序 = 閱讀順序
elements = sorted(elements, key=lambda x: x['reading_order'])
# 元素按 reading_order: 0, 1, 2, 3, ... 渲染
# 完美保留文件的邏輯順序
```
---
## 🚀 實作步驟
### 第一階段核心重構2-3 小時)
1. **修改 `analyze_layout()` 函數**
- 從 `page_result.json` 提取 `parsing_res_list`
- 提取 `layout_bbox` 為每個元素的 bbox
- 保留 `reading_order`
- 提取 `layout_type`
- 測試輸出 JSON 結構
2. **添加輔助函數**
- `_find_image_bbox()`: 從多個來源查找圖片 bbox
- `_convert_bbox_format()`: 統一 bbox 格式
- `_extract_element_content()`: 根據類型提取內容
3. **測試驗證**
- 使用現有測試文件重新執行 OCR
- 檢查生成的 JSON 是否包含 bbox
- 驗證 reading_order 是否正確
### 第二階段PDF 生成優化2-3 小時)
1. **實作座標定位模式**
- 移除文字過濾邏輯
- 按 bbox 精確渲染每個元素
- 按 reading_order 確定渲染順序(同頁元素)
2. **實作流式排版模式**
- 使用 ReportLab Platypus
- 按 reading_order 構建 Story
- 實作各類型元素的流式渲染
3. **添加 API 參數**
- `/tasks/{id}/download/pdf?mode=coordinate` (預設)
- `/tasks/{id}/download/pdf?mode=flow`
### 第三階段測試與優化1-2 小時)
1. **完整測試**
- 單頁文件測試
- 多頁 PDF 測試
- 多欄版面測試
- 複雜表格測試
2. **效能優化**
- 減少重複計算
- 優化 bbox 轉換
- 快取處理
3. **文檔更新**
- 更新 API 文檔
- 添加使用範例
- 更新架構圖
---
## 💡 關鍵技術細節
### 1. Numpy Array 處理
```python
# layout_bbox 是 numpy.ndarray需要轉換為標準格式
layout_bbox = item.get('layout_bbox')
if hasattr(layout_bbox, 'tolist'):
bbox = layout_bbox.tolist() # [x1, y1, x2, y2]
else:
bbox = list(layout_bbox)
# 轉換為 4-point 格式
x1, y1, x2, y2 = bbox
bbox_4point = [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
```
### 2. 版面類型處理
```python
# 根據 layout_type 調整渲染策略
layout_type = elem.get('layout_type', 'single')
if layout_type == 'double':
# 雙欄版面:可能需要特殊處理
pass
elif layout_type == 'multi':
# 多欄版面:更複雜的處理
pass
```
### 3. 閱讀順序保證
```python
# 確保按正確順序渲染
elements = layout_data.get('elements', [])
sorted_elements = sorted(elements, key=lambda x: (
x.get('page', 0), # 先按頁碼
x.get('reading_order', 0) # 再按閱讀順序
))
```
---
## ⚠️ 風險與緩解措施
### 風險 1: 向後相容性
**問題**: 舊的 JSON 檔案沒有新欄位
**緩解措施**:
```python
# 在 analyze_layout() 中添加回退邏輯
parsing_res_list = json_data.get('parsing_res_list', [])
if not parsing_res_list:
logger.warning("No parsing_res_list, using markdown fallback")
# 使用舊的 markdown 解析邏輯
```
### 風險 2: PaddleOCR 版本差異
**問題**: 不同版本的 PaddleOCR 可能輸出格式不同
**緩解措施**:
- 記錄 PaddleOCR 版本到 JSON
- 添加版本檢測邏輯
- 提供多版本支援
### 風險 3: 效能影響
**問題**: 提取更多資訊可能增加處理時間
**緩解措施**:
- 只在需要時提取詳細資訊
- 使用快取
- 並行處理多頁
---
## 📝 TODO Checklist
### 階段 1: 核心重構
- [ ] 修改 `analyze_layout()` 使用 `page_result.json`
- [ ] 提取 `parsing_res_list`
- [ ] 提取 `layout_bbox` 並轉換格式
- [ ] 保留 `reading_order`
- [ ] 提取 `layout_type`
- [ ] 實作 `_find_image_bbox()`
- [ ] 添加回退邏輯(向後相容)
- [ ] 測試新 JSON 輸出結構
### 階段 2: PDF 生成優化
- [ ] 實作 `_generate_coordinate_pdf()`
- [ ] 實作 `_generate_flow_pdf()`
- [ ] 移除舊的文字過濾邏輯
- [ ] 添加 mode 參數到 API
- [ ] 實作 HTML 表格解析器(用於流式模式)
- [ ] 測試兩種模式的 PDF 輸出
### 階段 3: 測試與文檔
- [ ] 單頁文件測試
- [ ] 多頁 PDF 測試
- [ ] 複雜版面測試(多欄、表格密集)
- [ ] 效能測試
- [ ] 更新 API 文檔
- [ ] 更新使用說明
- [ ] 創建遷移指南
---
## 🎓 學習資源
1. **PaddleOCR 官方文檔**
- [PP-StructureV3 Usage Tutorial](http://www.paddleocr.ai/main/en/version3.x/pipeline_usage/PP-StructureV3.html)
- [PaddleX PP-StructureV3](https://paddlepaddle.github.io/PaddleX/3.0/en/pipeline_usage/tutorials/ocr_pipelines/PP-StructureV3.html)
2. **ReportLab 文檔**
- [Platypus User Guide](https://www.reportlab.com/docs/reportlab-userguide.pdf)
- [Table Styling](https://www.reportlab.com/docs/reportlab-userguide.pdf#page=80)
3. **參考實作**
- PaddleOCR GitHub: `/paddlex/inference/pipelines/layout_parsing/pipeline_v2.py`
---
## 🏁 成功標準
### 必須達成
所有版面元素都有精確的 bbox
閱讀順序正確保留
零資訊損失流式模式
向後相容 JSON 仍可用
### 期望達成
雙模式 PDF 生成座標 + 流式
多欄版面正確處理
翻譯功能支援表格文字可提取
效能無明顯下降
### 附加目標
支援更多元素類型公式印章
版面類型統計和分析
視覺化版面結構
---
**規劃完成時間**: 2025-01-18
**預計開發時間**: 5-8 小時
**優先級**: P0 (最高優先級)

View File

@@ -0,0 +1,276 @@
# Technical Design: Dual-track Document Processing
## Context
### Background
The current OCR tool processes all documents through PaddleOCR, even when dealing with editable PDFs that contain extractable text. This causes:
- Unnecessary processing overhead
- Potential quality degradation from re-OCRing already digital text
- Loss of precise formatting information
- Inefficient GPU usage on documents that don't need OCR
### Constraints
- RTX 4060 8GB GPU memory limitation
- Need to maintain backward compatibility with existing API
- Must support future translation features
- Should handle mixed documents (partially scanned, partially digital)
### Stakeholders
- API consumers expecting consistent JSON/PDF output
- Translation system requiring structure preservation
- Performance-sensitive deployments
## Goals / Non-Goals
### Goals
- Intelligently route documents to appropriate processing track
- Preserve document structure for translation
- Optimize GPU usage by avoiding unnecessary OCR
- Maintain unified output format across tracks
- Reduce processing time for editable PDFs by 70%+
### Non-Goals
- Implementing the actual translation engine (future phase)
- Supporting video or audio transcription
- Real-time collaborative editing
- OCR model training or fine-tuning
## Decisions
### Decision 1: Dual-track Architecture
**What**: Implement two separate processing pipelines - OCR track and Direct extraction track
**Why**:
- Editable PDFs don't need OCR, can be processed 10-100x faster
- Direct extraction preserves exact formatting and fonts
- OCR track remains optimal for scanned documents
**Alternatives considered**:
1. **Single enhanced OCR pipeline**: Would still waste resources on editable PDFs
2. **Hybrid approach per page**: Too complex, most documents are uniformly editable or scanned
3. **Multiple specialized pipelines**: Over-engineering for current requirements
### Decision 2: UnifiedDocument Model
**What**: Create a standardized intermediate representation for both tracks
**Why**:
- Provides consistent API interface regardless of processing track
- Simplifies downstream processing (PDF generation, translation)
- Enables track switching without breaking changes
**Structure**:
```python
@dataclass
class UnifiedDocument:
document_id: str
metadata: DocumentMetadata
pages: List[Page]
processing_track: Literal["ocr", "direct"]
@dataclass
class Page:
page_number: int
elements: List[DocumentElement]
dimensions: Dimensions
@dataclass
class DocumentElement:
element_id: str
type: ElementType # text, table, image, header, etc.
content: Union[str, Dict, bytes]
bbox: BoundingBox
style: Optional[StyleInfo]
confidence: Optional[float] # Only for OCR track
```
### Decision 3: PyMuPDF for Direct Extraction
**What**: Use PyMuPDF (fitz) library for editable PDF processing
**Why**:
- Mature, well-maintained library
- Excellent coordinate preservation
- Fast C++ backend
- Supports text, tables, and image extraction with positions
**Alternatives considered**:
1. **pdfplumber**: Good but slower, less precise coordinates
2. **PyPDF2**: Limited layout information
3. **PDFMiner**: Complex API, slower performance
### Decision 4: Processing Track Auto-detection
**What**: Automatically determine optimal track based on document analysis
**Detection logic**:
```python
def detect_track(file_path: Path) -> str:
file_type = magic.from_file(file_path, mime=True)
if file_type.startswith('image/'):
return "ocr"
if file_type == 'application/pdf':
# Check if PDF has extractable text
doc = fitz.open(file_path)
for page in doc[:3]: # Sample first 3 pages
text = page.get_text()
if len(text.strip()) < 100: # Minimal text
return "ocr"
return "direct"
if file_type in OFFICE_MIMES:
return "ocr" # For now, may add direct Office support later
return "ocr" # Default fallback
```
### Decision 5: GPU Memory Management
**What**: Implement dynamic batch sizing and model caching for RTX 4060 8GB
**Why**:
- Prevents OOM errors
- Maximizes throughput
- Enables concurrent request handling
**Strategy**:
```python
# Adaptive batch sizing based on available memory
batch_size = calculate_batch_size(
available_memory=get_gpu_memory(),
image_size=image.shape,
model_size=MODEL_MEMORY_REQUIREMENTS
)
# Model caching to avoid reload overhead
@lru_cache(maxsize=2)
def get_model(model_type: str):
return load_model(model_type)
```
### Decision 6: Backward Compatibility
**What**: Maintain existing API while adding new capabilities
**How**:
- Existing endpoints continue working unchanged
- New `processing_track` parameter is optional
- Output format compatible with current consumers
- Gradual migration path for clients
## Risks / Trade-offs
### Risk 1: Mixed Content Documents
**Risk**: Documents with both scanned and digital pages
**Mitigation**:
- Page-level track detection as fallback
- Confidence scoring to identify uncertain pages
- Manual override option via API
### Risk 2: Direct Extraction Quality
**Risk**: Some PDFs have poor internal structure
**Mitigation**:
- Fallback to OCR track if extraction quality is low
- Quality metrics: text density, structure coherence
- User-reportable quality issues
### Risk 3: Memory Pressure
**Risk**: RTX 4060 8GB limitation with concurrent requests
**Mitigation**:
- Request queuing system
- Dynamic batch adjustment
- CPU fallback for overflow
### Trade-off 1: Processing Time vs Accuracy
- Direct extraction: Fast but depends on PDF quality
- OCR: Slower but consistent quality
- **Decision**: Prioritize speed for editable PDFs, accuracy for scanned
### Trade-off 2: Complexity vs Flexibility
- Two tracks increase system complexity
- But enable optimal processing per document type
- **Decision**: Accept complexity for 10x+ performance gains
## Migration Plan
### Phase 1: Infrastructure (Week 1-2)
1. Deploy UnifiedDocument model
2. Implement DocumentTypeDetector
3. Add DirectExtractionEngine
4. Update logging and monitoring
### Phase 2: Integration (Week 3)
1. Update OCR service with routing logic
2. Modify PDF generator for unified model
3. Add new API endpoints
4. Deploy to staging
### Phase 3: Validation (Week 4)
1. A/B testing with subset of traffic
2. Performance benchmarking
3. Quality validation
4. Client integration testing
### Rollback Plan
1. Feature flag to disable dual-track
2. Fallback all requests to OCR track
3. Maintain old code paths during transition
4. Database migration reversible
## Open Questions
### Resolved
- Q: Should we support page-level track mixing?
- A: No, adds complexity with minimal benefit. Document-level is sufficient.
- Q: How to handle Office documents?
- A: OCR track initially, consider python-docx/openpyxl later if needed.
### Pending
- Q: What translation services to integrate with?
- Needs stakeholder input on cost/quality trade-offs
- Q: Should we cache extracted text for repeated processing?
- Depends on storage costs vs reprocessing frequency
- Q: How to handle password-protected PDFs?
- May need API parameter for passwords
## Performance Targets
### Direct Extraction Track
- Latency: <500ms per page
- Throughput: 100+ pages/minute
- Memory: <500MB per document
### OCR Track (Optimized)
- Latency: 2-5s per page (GPU)
- Throughput: 20-30 pages/minute
- Memory: <2GB per batch
### API Response Times
- Document type detection: <100ms
- Processing initiation: <200ms
- Result retrieval: <100ms
## Technical Dependencies
### Python Packages
```python
# Direct extraction
PyMuPDF==1.23.x
pdfplumber==0.10.x # Fallback/validation
python-magic-bin==0.4.x
# OCR enhancement
paddlepaddle-gpu==2.5.2
paddleocr==2.7.3
# Infrastructure
pydantic==2.x
fastapi==0.100+
redis==5.x # For caching
```
### System Requirements
- CUDA 11.8+ for PaddlePaddle
- libmagic for file detection
- 16GB RAM minimum
- 50GB disk for models and cache

View File

@@ -0,0 +1,35 @@
# Change: Dual-track Document Processing with Structure-Preserving Translation
## Why
The current system processes all documents through PaddleOCR, causing unnecessary overhead for editable PDFs that already contain extractable text. Additionally, we're only using ~20% of PP-StructureV3's capabilities, missing out on comprehensive document structure extraction. The system needs to support structure-preserving document translation as a future goal.
## What Changes
- **ADDED** Dual-track processing architecture with intelligent routing
- OCR track for scanned documents, images, and Office files using PaddleOCR
- Direct extraction track for editable PDFs using PyMuPDF
- **ADDED** UnifiedDocument model as common output format for both tracks
- **ADDED** DocumentTypeDetector service for automatic track selection
- **MODIFIED** OCR service to use PP-StructureV3's parsing_res_list instead of markdown
- Now extracts all 23 element types with bbox coordinates
- Preserves reading order and hierarchical structure
- **MODIFIED** PDF generator to handle UnifiedDocument format
- Enhanced overlap detection to prevent text/image/table collisions
- Improved coordinate transformation for accurate layout
- **ADDED** Foundation for structure-preserving translation system
- **BREAKING** JSON output structure will include new fields (backward compatible with defaults)
## Impact
- **Affected specs**:
- `document-processing` (new capability)
- `result-export` (enhanced with track metadata and structure data)
- `task-management` (tracks processing route and history)
- **Affected code**:
- `backend/app/services/ocr_service.py` - Major refactoring for dual-track
- `backend/app/services/pdf_generator_service.py` - UnifiedDocument support
- `backend/app/api/v2/tasks.py` - New endpoints for track detection
- `frontend/src/pages/TaskDetailPage.tsx` - Display processing track info
- **Performance**: 5-10x faster for editable PDFs, same speed for scanned documents
- **Dependencies**: Adds PyMuPDF, pdfplumber, python-magic-bin

View File

@@ -0,0 +1,108 @@
# Document Processing Spec Delta
## ADDED Requirements
### Requirement: Dual-track Processing
The system SHALL support two distinct processing tracks for documents: OCR track for scanned/image documents and Direct extraction track for editable PDFs.
#### Scenario: Process scanned PDF through OCR track
- **WHEN** a scanned PDF is uploaded
- **THEN** the system SHALL detect it requires OCR
- **AND** route it through PaddleOCR PP-StructureV3 pipeline
- **AND** return results in UnifiedDocument format
#### Scenario: Process editable PDF through direct extraction
- **WHEN** an editable PDF with extractable text is uploaded
- **THEN** the system SHALL detect it can be directly extracted
- **AND** route it through PyMuPDF extraction pipeline
- **AND** return results in UnifiedDocument format without OCR
#### Scenario: Auto-detect processing track
- **WHEN** a document is uploaded without explicit track specification
- **THEN** the system SHALL analyze the document type and content
- **AND** automatically select the optimal processing track
- **AND** include the selected track in processing metadata
### Requirement: Document Type Detection
The system SHALL provide intelligent document type detection to determine the optimal processing track.
#### Scenario: Detect editable PDF
- **WHEN** analyzing a PDF document
- **THEN** the system SHALL check for extractable text content
- **AND** return confidence score for editability
- **AND** recommend "direct" track if text coverage > 90%
#### Scenario: Detect scanned document
- **WHEN** analyzing an image or scanned PDF
- **THEN** the system SHALL identify lack of extractable text
- **AND** recommend "ocr" track for processing
- **AND** configure appropriate OCR models
#### Scenario: Detect Office documents
- **WHEN** analyzing .docx, .xlsx, .pptx files
- **THEN** the system SHALL identify Office format
- **AND** route to OCR track for initial implementation
- **AND** preserve option for future direct Office extraction
### Requirement: Unified Document Model
The system SHALL use a standardized UnifiedDocument model as the common output format for both processing tracks.
#### Scenario: Generate UnifiedDocument from OCR
- **WHEN** OCR processing completes
- **THEN** the system SHALL convert PP-StructureV3 results to UnifiedDocument
- **AND** preserve all element types, coordinates, and confidence scores
- **AND** maintain reading order and hierarchical structure
#### Scenario: Generate UnifiedDocument from direct extraction
- **WHEN** direct extraction completes
- **THEN** the system SHALL convert PyMuPDF results to UnifiedDocument
- **AND** preserve text styling, fonts, and exact positioning
- **AND** extract tables with cell boundaries and content
#### Scenario: Consistent output regardless of track
- **WHEN** processing completes through either track
- **THEN** the output SHALL conform to UnifiedDocument schema
- **AND** include processing_track metadata field
- **AND** support identical downstream operations (PDF generation, translation)
### Requirement: Enhanced OCR with Full PP-StructureV3
The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list.
#### Scenario: Extract comprehensive document structure
- **WHEN** processing through OCR track
- **THEN** the system SHALL use page_result.json['parsing_res_list']
- **AND** extract all element types including headers, lists, tables, figures
- **AND** preserve layout_bbox coordinates for each element
#### Scenario: Maintain reading order
- **WHEN** extracting elements from PP-StructureV3
- **THEN** the system SHALL preserve the reading order from parsing_res_list
- **AND** assign sequential indices to elements
- **AND** support reordering for complex layouts
#### Scenario: Extract table structure
- **WHEN** PP-StructureV3 identifies a table
- **THEN** the system SHALL extract cell content and boundaries
- **AND** preserve table HTML for structure
- **AND** extract plain text for translation
### Requirement: Structure-Preserving Translation Foundation
The system SHALL maintain document structure and layout information to support future translation features.
#### Scenario: Preserve coordinates for translation
- **WHEN** processing any document
- **THEN** the system SHALL retain bbox coordinates for all text elements
- **AND** calculate space requirements for text expansion/contraction
- **AND** maintain element relationships and groupings
#### Scenario: Extract translatable content
- **WHEN** processing tables and lists
- **THEN** the system SHALL extract plain text content
- **AND** maintain mapping to original structure
- **AND** preserve formatting markers for reconstruction
#### Scenario: Support layout adjustment
- **WHEN** preparing for translation
- **THEN** the system SHALL identify flexible vs fixed layout regions
- **AND** calculate maximum text expansion ratios
- **AND** preserve non-translatable elements (logos, signatures)

View File

@@ -0,0 +1,74 @@
# Result Export Spec Delta
## MODIFIED Requirements
### Requirement: Export Interface
The Export page SHALL support downloading OCR results in multiple formats using V2 task APIs, with processing track information and enhanced structure data.
#### Scenario: Export page uses V2 download endpoints
- **WHEN** user selects a format and clicks export button
- **THEN** frontend SHALL call V2 endpoint `/api/v2/tasks/{task_id}/download/{format}`
- **AND** frontend SHALL NOT call V1 `/api/v2/export` endpoint (which returns 404)
- **AND** file SHALL download successfully
#### Scenario: Export supports multiple formats
- **WHEN** user exports a completed task
- **THEN** system SHALL support downloading as TXT, JSON, Excel, Markdown, and PDF
- **AND** each format SHALL use correct V2 download endpoint
- **AND** downloaded files SHALL contain task OCR results
#### Scenario: Export includes processing track metadata
- **WHEN** user exports a task processed through dual-track system
- **THEN** exported JSON SHALL include "processing_track" field indicating "ocr" or "direct"
- **AND** SHALL include "processing_metadata" with track-specific information
- **AND** SHALL maintain backward compatibility for clients not expecting these fields
#### Scenario: Export UnifiedDocument format
- **WHEN** user requests JSON export with unified=true parameter
- **THEN** system SHALL return UnifiedDocument structure
- **AND** include complete element hierarchy with coordinates
- **AND** preserve all PP-StructureV3 element types for OCR track
## ADDED Requirements
### Requirement: Enhanced PDF Export with Layout Preservation
The PDF export SHALL accurately preserve document layout from both OCR and direct extraction tracks.
#### Scenario: Export PDF from direct extraction track
- **WHEN** exporting PDF from a direct-extraction processed document
- **THEN** the PDF SHALL maintain exact text positioning from source
- **AND** preserve original fonts and styles where possible
- **AND** include extracted images at correct positions
#### Scenario: Export PDF from OCR track with full structure
- **WHEN** exporting PDF from OCR-processed document
- **THEN** the PDF SHALL use all 23 PP-StructureV3 element types
- **AND** render tables with proper cell boundaries
- **AND** maintain reading order from parsing_res_list
#### Scenario: Handle coordinate transformations
- **WHEN** generating PDF from UnifiedDocument
- **THEN** system SHALL correctly transform bbox coordinates to PDF space
- **AND** handle page size variations
- **AND** prevent text overlap using enhanced overlap detection
### Requirement: Structure Data Export
The system SHALL provide export formats that preserve document structure for downstream processing.
#### Scenario: Export structured JSON with hierarchy
- **WHEN** user selects structured JSON format
- **THEN** export SHALL include element hierarchy and relationships
- **AND** preserve parent-child relationships (sections, lists)
- **AND** include style and formatting information
#### Scenario: Export for translation preparation
- **WHEN** user exports with translation_ready=true parameter
- **THEN** export SHALL include translatable text segments
- **AND** maintain coordinate mappings for each segment
- **AND** mark non-translatable regions
#### Scenario: Export with layout analysis
- **WHEN** user requests layout analysis export
- **THEN** system SHALL include reading order indices
- **AND** identify layout regions (header, body, footer, sidebar)
- **AND** provide confidence scores for layout detection

View File

@@ -0,0 +1,105 @@
# Task Management Spec Delta
## MODIFIED Requirements
### Requirement: Task Result Generation
The OCR service SHALL generate both JSON and Markdown result files for completed tasks with actual content, including processing track information and enhanced structure data.
#### Scenario: Markdown file contains OCR results
- **WHEN** a task completes OCR processing successfully
- **THEN** the generated `.md` file SHALL contain the extracted text in markdown format
- **AND** the file size SHALL be greater than 0 bytes
- **AND** the markdown SHALL include headings, paragraphs, and formatting based on OCR layout detection
#### Scenario: Result files stored in task directory
- **WHEN** OCR processing completes for task ID `88c6c2d2-37e1-48fd-a50f-406142987bdf`
- **THEN** result files SHALL be stored in `storage/results/88c6c2d2-37e1-48fd-a50f-406142987bdf/`
- **AND** both `<filename>_result.json` and `<filename>_result.md` SHALL exist
- **AND** both files SHALL contain valid OCR output data
#### Scenario: Include processing track in results
- **WHEN** a task completes through dual-track processing
- **THEN** the JSON result SHALL include "processing_track" field
- **AND** SHALL indicate whether "ocr" or "direct" track was used
- **AND** SHALL include track-specific metadata (confidence for OCR, extraction quality for direct)
#### Scenario: Store UnifiedDocument format
- **WHEN** processing completes through either track
- **THEN** system SHALL save results in UnifiedDocument format
- **AND** maintain backward-compatible JSON structure
- **AND** include enhanced structure from PP-StructureV3 or PyMuPDF
### Requirement: Task Detail View
The frontend SHALL provide a dedicated page for viewing individual task details with processing track information and enhanced preview capabilities.
#### Scenario: Navigate to task detail page
- **WHEN** user clicks "View Details" button on task in Task History page
- **THEN** browser SHALL navigate to `/tasks/{task_id}`
- **AND** TaskDetailPage component SHALL render
#### Scenario: Display task information
- **WHEN** TaskDetailPage loads for a valid task ID
- **THEN** page SHALL display task metadata (filename, status, processing time, confidence)
- **AND** page SHALL show markdown preview of OCR results
- **AND** page SHALL provide download buttons for JSON, Markdown, and PDF formats
#### Scenario: Download from task detail page
- **WHEN** user clicks download button for a specific format
- **THEN** browser SHALL download the file using `/api/v2/tasks/{task_id}/download/{format}` endpoint
- **AND** downloaded file SHALL contain the task's OCR results in requested format
#### Scenario: Display processing track information
- **WHEN** viewing task processed through dual-track system
- **THEN** page SHALL display processing track used (OCR or Direct)
- **AND** show track-specific metrics (OCR confidence or extraction quality)
- **AND** provide option to reprocess with alternate track if applicable
#### Scenario: Preview document structure
- **WHEN** user enables structure view
- **THEN** page SHALL display document element hierarchy
- **AND** show bounding boxes overlay on preview
- **AND** highlight different element types (headers, tables, lists) with distinct colors
## ADDED Requirements
### Requirement: Processing Track Management
The task management system SHALL track and display processing track information for all tasks.
#### Scenario: Track processing route selection
- **WHEN** a task begins processing
- **THEN** system SHALL record the selected processing track
- **AND** log the reason for track selection
- **AND** store auto-detection confidence score
#### Scenario: Allow track override
- **WHEN** user views a completed task
- **THEN** system SHALL offer option to reprocess with different track
- **AND** maintain both results for comparison
- **AND** track which result user prefers
#### Scenario: Display processing metrics
- **WHEN** task completes processing
- **THEN** system SHALL record track-specific metrics
- **AND** OCR track SHALL show confidence scores and character count
- **AND** Direct track SHALL show extraction coverage and structure quality
### Requirement: Task Processing History
The system SHALL maintain detailed processing history for tasks including track changes and reprocessing.
#### Scenario: Record reprocessing attempts
- **WHEN** a task is reprocessed with different track
- **THEN** system SHALL maintain processing history
- **AND** store results from each attempt
- **AND** allow comparison between different processing attempts
#### Scenario: Track quality improvements
- **WHEN** viewing task history
- **THEN** system SHALL show quality metrics over time
- **AND** indicate if reprocessing improved results
- **AND** suggest optimal track based on document characteristics
#### Scenario: Export processing analytics
- **WHEN** exporting task data
- **THEN** system SHALL include processing history
- **AND** provide track selection statistics
- **AND** include performance metrics for each processing attempt

View File

@@ -0,0 +1,170 @@
# Implementation Tasks: Dual-track Document Processing
## 1. Core Infrastructure
- [ ] 1.1 Add PyMuPDF and other dependencies to requirements.txt
- [ ] 1.1.1 Add PyMuPDF==1.23.x
- [ ] 1.1.2 Add pdfplumber==0.10.x
- [ ] 1.1.3 Add python-magic-bin==0.4.x
- [ ] 1.1.4 Test dependency installation
- [ ] 1.2 Create UnifiedDocument model in backend/app/models/
- [ ] 1.2.1 Define UnifiedDocument dataclass
- [ ] 1.2.2 Add DocumentElement model
- [ ] 1.2.3 Add DocumentMetadata model
- [ ] 1.2.4 Create converters for both OCR and direct extraction outputs
- [ ] 1.3 Create DocumentTypeDetector service
- [ ] 1.3.1 Implement file type detection using python-magic
- [ ] 1.3.2 Add PDF editability checking logic
- [ ] 1.3.3 Add Office document detection
- [ ] 1.3.4 Create routing logic to determine processing track
- [ ] 1.3.5 Add unit tests for detector
## 2. Direct Extraction Track
- [ ] 2.1 Create DirectExtractionEngine service
- [ ] 2.1.1 Implement PyMuPDF-based text extraction
- [ ] 2.1.2 Add structure preservation logic
- [ ] 2.1.3 Extract tables with coordinates
- [ ] 2.1.4 Extract images and their positions
- [ ] 2.1.5 Maintain reading order
- [ ] 2.1.6 Handle multi-column layouts
- [ ] 2.2 Implement layout analysis for editable PDFs
- [ ] 2.2.1 Detect headers and footers
- [ ] 2.2.2 Identify sections and subsections
- [ ] 2.2.3 Parse lists and nested structures
- [ ] 2.2.4 Extract font and style information
- [ ] 2.3 Create direct extraction to UnifiedDocument converter
- [ ] 2.3.1 Map PyMuPDF structures to UnifiedDocument
- [ ] 2.3.2 Preserve coordinate information
- [ ] 2.3.3 Maintain element relationships
## 3. OCR Track Enhancement
- [ ] 3.1 Upgrade PP-StructureV3 configuration
- [ ] 3.1.1 Update config for RTX 4060 8GB optimization
- [ ] 3.1.2 Enable batch processing for GPU efficiency
- [ ] 3.1.3 Configure memory management settings
- [ ] 3.1.4 Set up model caching
- [ ] 3.2 Enhance OCR service to use parsing_res_list
- [ ] 3.2.1 Replace markdown extraction with parsing_res_list
- [ ] 3.2.2 Extract all 23 element types
- [ ] 3.2.3 Preserve bbox coordinates from PP-StructureV3
- [ ] 3.2.4 Maintain reading order information
- [ ] 3.3 Create OCR to UnifiedDocument converter
- [ ] 3.3.1 Map PP-StructureV3 elements to UnifiedDocument
- [ ] 3.3.2 Handle complex nested structures
- [ ] 3.3.3 Preserve all metadata
## 4. Unified Processing Pipeline
- [ ] 4.1 Update main OCR service for dual-track processing
- [ ] 4.1.1 Integrate DocumentTypeDetector
- [ ] 4.1.2 Route to appropriate processing engine
- [ ] 4.1.3 Return UnifiedDocument from both tracks
- [ ] 4.1.4 Maintain backward compatibility
- [ ] 4.2 Create unified JSON export
- [ ] 4.2.1 Define standardized JSON schema
- [ ] 4.2.2 Include processing metadata
- [ ] 4.2.3 Support both track outputs
- [ ] 4.3 Update PDF generator for UnifiedDocument
- [ ] 4.3.1 Adapt PDF generation to use UnifiedDocument
- [ ] 4.3.2 Preserve layout from both tracks
- [ ] 4.3.3 Handle coordinate transformations
## 5. Translation System Foundation
- [ ] 5.1 Create TranslationEngine interface
- [ ] 5.1.1 Define translation API contract
- [ ] 5.1.2 Support element-level translation
- [ ] 5.1.3 Preserve formatting markers
- [ ] 5.2 Implement structure-preserving translation
- [ ] 5.2.1 Translate text while maintaining coordinates
- [ ] 5.2.2 Handle table cell translations
- [ ] 5.2.3 Preserve list structures
- [ ] 5.2.4 Maintain header hierarchies
- [ ] 5.3 Create translated document renderer
- [ ] 5.3.1 Generate PDF with translated text
- [ ] 5.3.2 Adjust layouts for text expansion/contraction
- [ ] 5.3.3 Handle font substitution for target languages
## 6. API Updates
- [ ] 6.1 Update OCR endpoints
- [ ] 6.1.1 Add processing_track parameter
- [ ] 6.1.2 Support track auto-detection
- [ ] 6.1.3 Return processing metadata
- [ ] 6.2 Add document type detection endpoint
- [ ] 6.2.1 Create /analyze endpoint
- [ ] 6.2.2 Return recommended processing track
- [ ] 6.2.3 Provide confidence scores
- [ ] 6.3 Update result export endpoints
- [ ] 6.3.1 Support UnifiedDocument format
- [ ] 6.3.2 Add format conversion options
- [ ] 6.3.3 Include processing track information
## 7. Frontend Updates
- [ ] 7.1 Update task detail view
- [ ] 7.1.1 Display processing track information
- [ ] 7.1.2 Show track-specific metadata
- [ ] 7.1.3 Add track selection UI (if manual override needed)
- [ ] 7.2 Update results preview
- [ ] 7.2.1 Handle UnifiedDocument format
- [ ] 7.2.2 Display enhanced structure information
- [ ] 7.2.3 Show coordinate overlays (debug mode)
- [ ] 7.3 Add translation UI preparation
- [ ] 7.3.1 Add translation toggle/button
- [ ] 7.3.2 Language selection dropdown
- [ ] 7.3.3 Translation progress indicator
## 8. Testing
- [ ] 8.1 Unit tests for DocumentTypeDetector
- [ ] 8.1.1 Test various file types
- [ ] 8.1.2 Test editability detection
- [ ] 8.1.3 Test edge cases
- [ ] 8.2 Unit tests for DirectExtractionEngine
- [ ] 8.2.1 Test text extraction accuracy
- [ ] 8.2.2 Test structure preservation
- [ ] 8.2.3 Test coordinate extraction
- [ ] 8.3 Integration tests for dual-track processing
- [ ] 8.3.1 Test routing logic
- [ ] 8.3.2 Test UnifiedDocument generation
- [ ] 8.3.3 Test backward compatibility
- [ ] 8.4 End-to-end tests
- [ ] 8.4.1 Test scanned PDF processing (OCR track)
- [ ] 8.4.2 Test editable PDF processing (direct track)
- [ ] 8.4.3 Test Office document processing
- [ ] 8.4.4 Test image file processing
- [ ] 8.5 Performance testing
- [ ] 8.5.1 Benchmark both processing tracks
- [ ] 8.5.2 Test GPU memory usage
- [ ] 8.5.3 Compare processing times
## 9. Documentation
- [ ] 9.1 Update API documentation
- [ ] 9.1.1 Document new endpoints
- [ ] 9.1.2 Update existing endpoint docs
- [ ] 9.1.3 Add processing track information
- [ ] 9.2 Create architecture documentation
- [ ] 9.2.1 Document dual-track flow
- [ ] 9.2.2 Explain UnifiedDocument structure
- [ ] 9.2.3 Add decision trees for track selection
- [ ] 9.3 Add deployment guide
- [ ] 9.3.1 Document GPU requirements
- [ ] 9.3.2 Add environment configuration
- [ ] 9.3.3 Include troubleshooting guide
## 10. Deployment Preparation
- [ ] 10.1 Update Docker configuration
- [ ] 10.1.1 Add new dependencies to Dockerfile
- [ ] 10.1.2 Configure GPU support
- [ ] 10.1.3 Update volume mappings
- [ ] 10.2 Update environment variables
- [ ] 10.2.1 Add processing track settings
- [ ] 10.2.2 Configure GPU memory limits
- [ ] 10.2.3 Add feature flags
- [ ] 10.3 Create migration plan
- [ ] 10.3.1 Plan for existing data migration
- [ ] 10.3.2 Create rollback procedures
- [ ] 10.3.3 Document breaking changes
## Completion Checklist
- [ ] All unit tests passing
- [ ] Integration tests passing
- [ ] Performance benchmarks acceptable
- [ ] Documentation complete
- [ ] Code reviewed
- [ ] Deployment tested in staging

View File

@@ -1,226 +0,0 @@
#!/usr/bin/env python3
"""
Proof of Concept: External API Authentication Test
Tests the external authentication API at https://pj-auth-api.vercel.app
"""
import asyncio
import json
from datetime import datetime
from typing import Dict, Any, Optional
import httpx
from pydantic import BaseModel, Field
class UserInfo(BaseModel):
"""User information from external API"""
id: str
name: str
email: str
job_title: Optional[str] = Field(None, alias="jobTitle")
office_location: Optional[str] = Field(None, alias="officeLocation")
business_phones: list[str] = Field(default_factory=list, alias="businessPhones")
class AuthSuccessData(BaseModel):
"""Successful authentication response data"""
access_token: str
id_token: str
expires_in: int
token_type: str
user_info: UserInfo = Field(alias="userInfo")
issued_at: str = Field(alias="issuedAt")
expires_at: str = Field(alias="expiresAt")
class AuthSuccessResponse(BaseModel):
"""Successful authentication response"""
success: bool
message: str
data: AuthSuccessData
timestamp: str
class AuthErrorResponse(BaseModel):
"""Failed authentication response"""
success: bool
error: str
code: str
timestamp: str
class ExternalAuthClient:
"""Client for external authentication API"""
def __init__(self, base_url: str = "https://pj-auth-api.vercel.app", timeout: int = 30):
self.base_url = base_url
self.timeout = timeout
self.endpoint = "/api/auth/login"
async def authenticate(self, username: str, password: str) -> Dict[str, Any]:
"""
Authenticate user with external API
Args:
username: User email/username
password: User password
Returns:
Authentication result dictionary
"""
url = f"{self.base_url}{self.endpoint}"
print(f" Endpoint: POST {url}")
print(f" Username: {username}")
print(f" Timestamp: {datetime.now().isoformat()}")
print()
async with httpx.AsyncClient() as client:
try:
# Make authentication request
start_time = datetime.now()
response = await client.post(
url,
json={"username": username, "password": password},
timeout=self.timeout
)
elapsed = (datetime.now() - start_time).total_seconds()
# Print response details
print("Response Details:")
print(f" Status Code: {response.status_code}")
print(f" Response Time: {elapsed:.3f}s")
print(f" Content-Type: {response.headers.get('content-type', 'N/A')}")
print()
# Parse response
response_data = response.json()
print("Response Body:")
print(json.dumps(response_data, indent=2, ensure_ascii=False))
print()
# Handle success/failure
if response.status_code == 200:
auth_response = AuthSuccessResponse(**response_data)
return {
"success": True,
"status_code": response.status_code,
"data": auth_response.dict(),
"user_display_name": auth_response.data.user_info.name,
"user_email": auth_response.data.user_info.email,
"token": auth_response.data.access_token,
"expires_in": auth_response.data.expires_in,
"expires_at": auth_response.data.expires_at
}
elif response.status_code == 401:
error_response = AuthErrorResponse(**response_data)
return {
"success": False,
"status_code": response.status_code,
"error": error_response.error,
"code": error_response.code
}
else:
return {
"success": False,
"status_code": response.status_code,
"error": f"Unexpected status code: {response.status_code}",
"response": response_data
}
except httpx.TimeoutException:
print(f"❌ Request timeout after {self.timeout} seconds")
return {
"success": False,
"error": "Request timeout",
"code": "TIMEOUT"
}
except httpx.RequestError as e:
print(f"❌ Request error: {e}")
return {
"success": False,
"error": str(e),
"code": "REQUEST_ERROR"
}
except Exception as e:
print(f"❌ Unexpected error: {e}")
return {
"success": False,
"error": str(e),
"code": "UNKNOWN_ERROR"
}
async def test_authentication():
"""Test authentication with different scenarios"""
client = ExternalAuthClient()
# Test scenarios
test_cases = [
{
"name": "Valid Credentials (Example)",
"username": "ymirliu@panjit.com.tw",
"password": "correct_password", # Replace with actual password for testing
"expected": "success"
},
{
"name": "Invalid Credentials",
"username": "test@example.com",
"password": "wrong_password",
"expected": "failure"
}
]
for i, test_case in enumerate(test_cases, 1):
print(f"{'='*60}")
print(f"Test Case {i}: {test_case['name']}")
print(f"{'='*60}")
result = await client.authenticate(
username=test_case["username"],
password=test_case["password"]
)
# Analyze result
print("\nAnalysis:")
if result["success"]:
print("✅ Authentication successful")
print(f" User: {result.get('user_display_name', 'N/A')}")
print(f" Email: {result.get('user_email', 'N/A')}")
print(f" Token expires in: {result.get('expires_in', 0)} seconds")
print(f" Expires at: {result.get('expires_at', 'N/A')}")
else:
print("❌ Authentication failed")
print(f" Error: {result.get('error', 'Unknown error')}")
print(f" Code: {result.get('code', 'N/A')}")
print("\n")
async def test_token_validation():
"""Test token validation and refresh logic"""
# This would be implemented when we have a valid token
print("Token validation test - To be implemented with actual tokens")
pass
def main():
"""Main entry point"""
print("External Authentication API Test")
print("================================\n")
# Run tests
asyncio.run(test_authentication())
print("\nTest completed!")
print("\nNotes for implementation:")
print("1. Use httpx for async HTTP requests (already in requirements)")
print("2. Store tokens securely (consider encryption)")
print("3. Implement automatic token refresh before expiration")
print("4. Handle network failures with retry logic")
print("5. Map external user ID to local user records")
print("6. Display user 'name' field in UI instead of username")
if __name__ == "__main__":
main()