chore: project cleanup and prepare for dual-track processing refactor
- Removed all test files and directories - Deleted outdated documentation (will be rewritten) - Cleaned up temporary files, logs, and uploads - Archived 5 completed OpenSpec proposals - Created new dual-track-document-processing proposal with complete OpenSpec structure - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF) - UnifiedDocument model for consistent output - Support for structure-preserving translation - Updated .gitignore to prevent future test/temp files This is a major cleanup preparing for the complete refactoring of the document processing pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -1,7 +1,15 @@
|
||||
{
|
||||
"permissions": {
|
||||
"allow": [
|
||||
"Bash(git commit:*)"
|
||||
"Bash(git commit:*)",
|
||||
"Bash(xargs ls:*)",
|
||||
"Bash(jq:*)",
|
||||
"Bash(python:*)",
|
||||
"Bash(python3:*)",
|
||||
"Bash(source venv/bin/activate)",
|
||||
"Bash(find:*)",
|
||||
"Bash(ls:*)",
|
||||
"Bash(openspec list:*)"
|
||||
],
|
||||
"deny": [],
|
||||
"ask": []
|
||||
|
||||
9
.gitignore
vendored
9
.gitignore
vendored
@@ -89,3 +89,12 @@ build/
|
||||
Thumbs.db
|
||||
ehthumbs.db
|
||||
Desktop.ini
|
||||
|
||||
# Test and temporary files
|
||||
backend/uploads/*
|
||||
storage/uploads/*
|
||||
storage/results/*
|
||||
*.log
|
||||
__pycache__/
|
||||
*.bak
|
||||
test_*.py
|
||||
|
||||
743
API_REFERENCE.md
743
API_REFERENCE.md
@@ -1,743 +0,0 @@
|
||||
# Tool_OCR API Reference & Issues Report
|
||||
|
||||
## 文件資訊
|
||||
- **建立日期**: 2025-01-13
|
||||
- **版本**: v0.1.0
|
||||
- **目的**: 完整記錄所有 API 端點及前後端不一致問題
|
||||
|
||||
---
|
||||
|
||||
## 目錄
|
||||
1. [API 端點清單](#api-端點清單)
|
||||
2. [前後端不一致問題](#前後端不一致問題)
|
||||
3. [修正建議](#修正建議)
|
||||
|
||||
---
|
||||
|
||||
## API 端點清單
|
||||
|
||||
### 1. 認證 API (Authentication)
|
||||
|
||||
#### POST `/api/v1/auth/login`
|
||||
- **功能**: 使用者登入
|
||||
- **請求 Body**:
|
||||
```typescript
|
||||
{
|
||||
username: string,
|
||||
password: string
|
||||
}
|
||||
```
|
||||
- **回應**:
|
||||
```typescript
|
||||
{
|
||||
access_token: string,
|
||||
token_type: string, // "bearer"
|
||||
expires_in: number // Token 過期時間(秒)
|
||||
}
|
||||
```
|
||||
- **後端實作**: ✅ [backend/app/routers/auth.py:24](backend/app/routers/auth.py#L24)
|
||||
- **前端使用**: ✅ [frontend/src/services/api.ts:106](frontend/src/services/api.ts#L106)
|
||||
- **狀態**: ⚠️ **有問題** - 前端型別缺少 `expires_in` 欄位
|
||||
|
||||
---
|
||||
|
||||
### 2. 檔案上傳 API (File Upload)
|
||||
|
||||
#### POST `/api/v1/upload`
|
||||
- **功能**: 上傳檔案進行 OCR 處理
|
||||
- **請求 Body**: `multipart/form-data`
|
||||
- `files`: File[] - 檔案列表 (PNG, JPG, JPEG, PDF)
|
||||
- `batch_name`: string (optional) - 批次名稱
|
||||
- **回應**:
|
||||
```typescript
|
||||
{
|
||||
batch_id: number,
|
||||
files: [
|
||||
{
|
||||
id: number,
|
||||
batch_id: number,
|
||||
filename: string,
|
||||
original_filename: string,
|
||||
file_size: number,
|
||||
file_format: string, // ⚠️ 後端用 file_format
|
||||
status: string,
|
||||
error: string | null,
|
||||
created_at: string,
|
||||
processing_time: number | null
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
- **後端實作**: ✅ [backend/app/routers/ocr.py:39](backend/app/routers/ocr.py#L39)
|
||||
- **前端使用**: ✅ [frontend/src/services/api.ts:128](frontend/src/services/api.ts#L128)
|
||||
- **狀態**: ⚠️ **有問題** - 前端型別用 `format`,後端用 `file_format`
|
||||
|
||||
---
|
||||
|
||||
### 3. OCR 處理 API (OCR Processing)
|
||||
|
||||
#### POST `/api/v1/ocr/process`
|
||||
- **功能**: 觸發 OCR 批次處理
|
||||
- **請求 Body**:
|
||||
```typescript
|
||||
{
|
||||
batch_id: number,
|
||||
lang: string, // "ch", "en", "japan", "korean"
|
||||
detect_layout: boolean // ⚠️ 後端用 detect_layout,前端用 confidence_threshold
|
||||
}
|
||||
```
|
||||
- **回應**:
|
||||
```typescript
|
||||
{
|
||||
message: string, // ⚠️ 後端有此欄位
|
||||
batch_id: number,
|
||||
total_files: number, // ⚠️ 後端有此欄位
|
||||
status: string // "processing"
|
||||
// task_id: string // ❌ 前端期待此欄位,但後端沒有
|
||||
}
|
||||
```
|
||||
- **後端實作**: ✅ [backend/app/routers/ocr.py:95](backend/app/routers/ocr.py#L95)
|
||||
- **前端使用**: ✅ [frontend/src/services/api.ts:148](frontend/src/services/api.ts#L148)
|
||||
- **狀態**: ⚠️ **有問題** - 請求/回應模型不匹配
|
||||
|
||||
---
|
||||
|
||||
#### GET `/api/v1/batch/{batch_id}/status`
|
||||
- **功能**: 取得批次處理狀態
|
||||
- **路徑參數**:
|
||||
- `batch_id`: number - 批次 ID
|
||||
- **回應**:
|
||||
```typescript
|
||||
{
|
||||
batch: {
|
||||
id: number,
|
||||
user_id: number,
|
||||
batch_name: string | null,
|
||||
status: string,
|
||||
total_files: number,
|
||||
completed_files: number,
|
||||
failed_files: number,
|
||||
progress_percentage: number,
|
||||
created_at: string,
|
||||
started_at: string | null,
|
||||
completed_at: string | null
|
||||
},
|
||||
files: [
|
||||
{
|
||||
id: number,
|
||||
batch_id: number,
|
||||
filename: string,
|
||||
original_filename: string,
|
||||
file_size: number,
|
||||
file_format: string,
|
||||
status: string,
|
||||
error: string | null,
|
||||
created_at: string,
|
||||
processing_time: number | null
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
- **後端實作**: ✅ [backend/app/routers/ocr.py:148](backend/app/routers/ocr.py#L148)
|
||||
- **前端使用**: ✅ [frontend/src/services/api.ts:172](frontend/src/services/api.ts#L172)
|
||||
- **狀態**: ✅ **正常**
|
||||
|
||||
---
|
||||
|
||||
#### GET `/api/v1/ocr/result/{file_id}`
|
||||
- **功能**: 取得 OCR 結果
|
||||
- **路徑參數**:
|
||||
- `file_id`: number - 檔案 ID
|
||||
- **回應**:
|
||||
```typescript
|
||||
{
|
||||
file_id: number,
|
||||
filename: string,
|
||||
status: string,
|
||||
markdown_content: string | null,
|
||||
json_data: {
|
||||
total_text_regions: number,
|
||||
average_confidence: number,
|
||||
detected_language: string,
|
||||
layout_data: object | null,
|
||||
images_metadata: array | null
|
||||
} | null,
|
||||
confidence: number | null,
|
||||
processing_time: number | null
|
||||
}
|
||||
```
|
||||
- **後端實作**: ✅ [backend/app/routers/ocr.py:182](backend/app/routers/ocr.py#L182)
|
||||
- **前端使用**: ✅ [frontend/src/services/api.ts:164](frontend/src/services/api.ts#L164)
|
||||
- ⚠️ **注意**: 前端使用 `taskId` 作為參數名稱,實際應該是 `file_id`
|
||||
- **狀態**: ⚠️ **有問題** - 前端參數名稱誤導
|
||||
|
||||
---
|
||||
|
||||
#### ❌ GET `/api/v1/ocr/status/{task_id}`
|
||||
- **功能**: 取得任務狀態 (前端期待但不存在)
|
||||
- **狀態**: ❌ **不存在** - 前端呼叫此端點但後端沒有實作
|
||||
- **前端使用**: [frontend/src/services/api.ts:156](frontend/src/services/api.ts#L156)
|
||||
- **問題**: 前端會收到 404 錯誤
|
||||
|
||||
---
|
||||
|
||||
### 4. 匯出 API (Export)
|
||||
|
||||
#### POST `/api/v1/export`
|
||||
- **功能**: 匯出 OCR 結果
|
||||
- **請求 Body**:
|
||||
```typescript
|
||||
{
|
||||
batch_id: number,
|
||||
format: "txt" | "json" | "excel" | "markdown" | "pdf" | "zip",
|
||||
rule_id: number | null,
|
||||
css_template: string, // "default", "academic", "business"
|
||||
include_formats: string[] | null,
|
||||
options: {
|
||||
confidence_threshold: number | null,
|
||||
include_metadata: boolean,
|
||||
filename_pattern: string | null,
|
||||
css_template: string | null
|
||||
} | null
|
||||
}
|
||||
```
|
||||
- **回應**: File download (Blob)
|
||||
- **後端實作**: ✅ [backend/app/routers/export.py:38](backend/app/routers/export.py#L38)
|
||||
- **前端使用**: ✅ [frontend/src/services/api.ts:182](frontend/src/services/api.ts#L182)
|
||||
- **狀態**: ✅ **正常**
|
||||
|
||||
---
|
||||
|
||||
#### GET `/api/v1/export/pdf/{file_id}`
|
||||
- **功能**: 產生單一檔案的 PDF
|
||||
- **路徑參數**:
|
||||
- `file_id`: number - 檔案 ID
|
||||
- **查詢參數**:
|
||||
- `css_template`: string - CSS 模板名稱
|
||||
- **回應**: PDF file (Blob)
|
||||
- **後端實作**: ✅ [backend/app/routers/export.py:144](backend/app/routers/export.py#L144)
|
||||
- **前端使用**: ✅ [frontend/src/services/api.ts:192](frontend/src/services/api.ts#L192)
|
||||
- **狀態**: ✅ **正常**
|
||||
|
||||
---
|
||||
|
||||
#### GET `/api/v1/export/rules`
|
||||
- **功能**: 取得匯出規則清單
|
||||
- **回應**:
|
||||
```typescript
|
||||
[
|
||||
{
|
||||
id: number,
|
||||
user_id: number,
|
||||
rule_name: string,
|
||||
description: string | null,
|
||||
config_json: object,
|
||||
css_template: string | null,
|
||||
created_at: string,
|
||||
updated_at: string
|
||||
}
|
||||
]
|
||||
```
|
||||
- **後端實作**: ✅ [backend/app/routers/export.py:206](backend/app/routers/export.py#L206)
|
||||
- **前端使用**: ✅ [frontend/src/services/api.ts:204](frontend/src/services/api.ts#L204)
|
||||
- **狀態**: ✅ **正常**
|
||||
|
||||
---
|
||||
|
||||
#### POST `/api/v1/export/rules`
|
||||
- **功能**: 建立匯出規則
|
||||
- **請求 Body**:
|
||||
```typescript
|
||||
{
|
||||
rule_name: string,
|
||||
description: string | null,
|
||||
config_json: object,
|
||||
css_template: string | null
|
||||
}
|
||||
```
|
||||
- **回應**: 同 GET `/api/v1/export/rules` 的單個物件
|
||||
- **後端實作**: ✅ [backend/app/routers/export.py:220](backend/app/routers/export.py#L220)
|
||||
- **前端使用**: ✅ [frontend/src/services/api.ts:212](frontend/src/services/api.ts#L212)
|
||||
- **狀態**: ✅ **正常**
|
||||
|
||||
---
|
||||
|
||||
#### PUT `/api/v1/export/rules/{rule_id}`
|
||||
- **功能**: 更新匯出規則
|
||||
- **路徑參數**:
|
||||
- `rule_id`: number - 規則 ID
|
||||
- **請求 Body**: 同 POST `/api/v1/export/rules` (所有欄位可選)
|
||||
- **回應**: 同 GET `/api/v1/export/rules` 的單個物件
|
||||
- **後端實作**: ✅ [backend/app/routers/export.py:254](backend/app/routers/export.py#L254)
|
||||
- **前端使用**: ✅ [frontend/src/services/api.ts:220](frontend/src/services/api.ts#L220)
|
||||
- **狀態**: ✅ **正常**
|
||||
|
||||
---
|
||||
|
||||
#### DELETE `/api/v1/export/rules/{rule_id}`
|
||||
- **功能**: 刪除匯出規則
|
||||
- **路徑參數**:
|
||||
- `rule_id`: number - 規則 ID
|
||||
- **回應**:
|
||||
```typescript
|
||||
{
|
||||
message: "Export rule deleted successfully"
|
||||
}
|
||||
```
|
||||
- **後端實作**: ✅ [backend/app/routers/export.py:295](backend/app/routers/export.py#L295)
|
||||
- **前端使用**: ✅ [frontend/src/services/api.ts:228](frontend/src/services/api.ts#L228)
|
||||
- **狀態**: ✅ **正常**
|
||||
|
||||
---
|
||||
|
||||
#### GET `/api/v1/export/css-templates`
|
||||
- **功能**: 取得 CSS 模板清單
|
||||
- **回應**:
|
||||
```typescript
|
||||
[
|
||||
{
|
||||
name: string,
|
||||
description: string,
|
||||
filename: string // ⚠️ Schema 有定義,但實際回傳沒有
|
||||
}
|
||||
]
|
||||
```
|
||||
- **後端實作**: ✅ [backend/app/routers/export.py:326](backend/app/routers/export.py#L326)
|
||||
- 實際回傳: `[{ name, description }]`
|
||||
- Schema 定義: `[{ name, description, filename }]`
|
||||
- **前端使用**: ✅ [frontend/src/services/api.ts:235](frontend/src/services/api.ts#L235)
|
||||
- **狀態**: ⚠️ **有問題** - 缺少 `filename` 欄位
|
||||
|
||||
---
|
||||
|
||||
### 5. 翻譯 API (Translation - RESERVED)
|
||||
|
||||
#### GET `/api/v1/translate/status`
|
||||
- **功能**: 取得翻譯功能狀態
|
||||
- **回應**:
|
||||
```typescript
|
||||
{
|
||||
status: "RESERVED",
|
||||
message: string,
|
||||
planned_phase: string,
|
||||
features: string[]
|
||||
}
|
||||
```
|
||||
- **後端實作**: ✅ [backend/app/routers/translation.py:28](backend/app/routers/translation.py#L28)
|
||||
- **前端使用**: ❌ 未使用
|
||||
- **狀態**: ✅ **正常** (預留功能)
|
||||
|
||||
---
|
||||
|
||||
#### GET `/api/v1/translate/languages`
|
||||
- **功能**: 取得支援的語言清單
|
||||
- **回應**:
|
||||
```typescript
|
||||
[
|
||||
{
|
||||
code: string,
|
||||
name: string,
|
||||
native_name: string
|
||||
}
|
||||
]
|
||||
```
|
||||
- **後端實作**: ✅ [backend/app/routers/translation.py:43](backend/app/routers/translation.py#L43)
|
||||
- **前端使用**: ❌ 未使用
|
||||
- **狀態**: ✅ **正常** (預留功能)
|
||||
|
||||
---
|
||||
|
||||
#### POST `/api/v1/translate/document`
|
||||
- **功能**: 翻譯文件 (未實作)
|
||||
- **請求 Body**:
|
||||
```typescript
|
||||
{
|
||||
file_id: number,
|
||||
source_lang: string,
|
||||
target_lang: string,
|
||||
engine_type: "argos" | "ernie" | "google" | "deepl",
|
||||
preserve_structure: boolean,
|
||||
engine_config: object | null
|
||||
}
|
||||
```
|
||||
- **回應**: HTTP 501 Not Implemented
|
||||
- **後端實作**: ✅ [backend/app/routers/translation.py:56](backend/app/routers/translation.py#L56) (Stub)
|
||||
- **前端使用**: ✅ [frontend/src/services/api.ts:247](frontend/src/services/api.ts#L247)
|
||||
- **狀態**: ⚠️ **預留功能** - 前端會收到 501 錯誤
|
||||
|
||||
---
|
||||
|
||||
#### ❌ GET `/api/v1/translate/configs`
|
||||
- **功能**: 取得翻譯設定 (前端期待但不存在)
|
||||
- **狀態**: ❌ **不存在** - 前端呼叫此端點但後端沒有實作
|
||||
- **前端使用**: [frontend/src/services/api.ts:258](frontend/src/services/api.ts#L258)
|
||||
- **問題**: 前端會收到 404 錯誤
|
||||
|
||||
---
|
||||
|
||||
#### ❌ POST `/api/v1/translate/configs`
|
||||
- **功能**: 建立翻譯設定 (前端期待但不存在)
|
||||
- **狀態**: ❌ **不存在** - 前端呼叫此端點但後端沒有實作
|
||||
- **前端使用**: [frontend/src/services/api.ts:269](frontend/src/services/api.ts#L269)
|
||||
- **問題**: 前端會收到 404 錯誤
|
||||
|
||||
---
|
||||
|
||||
### 6. 其他端點
|
||||
|
||||
#### GET `/health`
|
||||
- **功能**: 健康檢查
|
||||
- **回應**:
|
||||
```typescript
|
||||
{
|
||||
status: "healthy",
|
||||
service: "Tool_OCR",
|
||||
version: "0.1.0"
|
||||
}
|
||||
```
|
||||
- **後端實作**: ✅ [backend/app/main.py:84](backend/app/main.py#L84)
|
||||
- **前端使用**: ❌ 未使用
|
||||
- **狀態**: ✅ **正常**
|
||||
|
||||
---
|
||||
|
||||
#### GET `/`
|
||||
- **功能**: API 資訊
|
||||
- **回應**:
|
||||
```typescript
|
||||
{
|
||||
message: "Tool_OCR API",
|
||||
version: "0.1.0",
|
||||
docs_url: "/docs",
|
||||
health_check: "/health"
|
||||
}
|
||||
```
|
||||
- **後端實作**: ✅ [backend/app/main.py:95](backend/app/main.py#L95)
|
||||
- **前端使用**: ❌ 未使用
|
||||
- **狀態**: ✅ **正常**
|
||||
|
||||
---
|
||||
|
||||
## 前後端不一致問題
|
||||
|
||||
### 問題 1: 登入回應結構不一致
|
||||
|
||||
**嚴重程度**: 🟡 中等
|
||||
|
||||
**問題描述**:
|
||||
- 後端回傳包含 `expires_in` 欄位 (Token 過期時間)
|
||||
- 前端 `LoginResponse` 型別定義缺少此欄位
|
||||
|
||||
**影響**:
|
||||
- 前端無法實作 Token 自動續期功能
|
||||
- 無法提前提醒使用者 Token 即將過期
|
||||
|
||||
**位置**:
|
||||
- 後端: [backend/app/routers/auth.py:66-70](backend/app/routers/auth.py#L66-L70)
|
||||
- 前端: [frontend/src/types/api.ts:12-15](frontend/src/types/api.ts#L12-L15)
|
||||
|
||||
---
|
||||
|
||||
### 問題 2: OCR 任務狀態 API 不存在
|
||||
|
||||
**嚴重程度**: 🔴 高
|
||||
|
||||
**問題描述**:
|
||||
- 前端嘗試呼叫 `/api/v1/ocr/status/{taskId}` 取得任務進度
|
||||
- 後端僅提供 `/api/v1/batch/{batch_id}/status` 與 `/api/v1/ocr/result/{file_id}`
|
||||
- 沒有對應的任務狀態追蹤端點
|
||||
|
||||
**影響**:
|
||||
- 前端 `getTaskStatus()` 呼叫會收到 404 錯誤
|
||||
- 無法實作即時進度輪詢功能
|
||||
- 使用者無法看到處理進度
|
||||
|
||||
**位置**:
|
||||
- 前端呼叫: [frontend/src/services/api.ts:156-159](frontend/src/services/api.ts#L156-L159)
|
||||
- 後端路由: 不存在
|
||||
|
||||
---
|
||||
|
||||
### 問題 3: OCR 處理請求/回應模型不符
|
||||
|
||||
**嚴重程度**: 🔴 高
|
||||
|
||||
**問題描述**:
|
||||
1. **請求欄位不匹配**:
|
||||
- 前端傳送 `confidence_threshold` (信心度閾值)
|
||||
- 後端接受 `detect_layout` (版面偵測開關)
|
||||
|
||||
2. **回應欄位不匹配**:
|
||||
- 前端期待 `task_id` (用於追蹤任務)
|
||||
- 後端回傳 `message`, `total_files` (但沒有 `task_id`)
|
||||
|
||||
**影響**:
|
||||
- 前端無法正確傳遞參數給後端
|
||||
- 前端無法取得 `task_id` 進行後續狀態查詢
|
||||
- 型別檢查會失敗
|
||||
- 可能導致驗證錯誤
|
||||
|
||||
**位置**:
|
||||
- 前端請求: [frontend/src/types/api.ts:37-41](frontend/src/types/api.ts#L37-L41)
|
||||
- 前端回應: [frontend/src/types/api.ts:43-47](frontend/src/types/api.ts#L43-L47)
|
||||
- 後端請求: [backend/app/schemas/ocr.py:120-133](backend/app/schemas/ocr.py#L120-L133)
|
||||
- 後端回應: [backend/app/schemas/ocr.py:136-151](backend/app/schemas/ocr.py#L136-L151)
|
||||
|
||||
---
|
||||
|
||||
### 問題 4: 上傳檔案欄位命名不一致
|
||||
|
||||
**嚴重程度**: 🟡 中等
|
||||
|
||||
**問題描述**:
|
||||
- 後端使用 `file_format` 回傳檔案格式
|
||||
- 前端型別定義使用 `format`
|
||||
|
||||
**影響**:
|
||||
- 前端無法直接使用後端回傳的 `file_format` 欄位
|
||||
- 需要額外的欄位映射或轉換
|
||||
- UI 顯示檔案格式時可能為 undefined
|
||||
|
||||
**位置**:
|
||||
- 前端: [frontend/src/types/api.ts:32](frontend/src/types/api.ts#L32)
|
||||
- 後端: [backend/app/schemas/ocr.py:19](backend/app/schemas/ocr.py#L19)
|
||||
|
||||
---
|
||||
|
||||
### 問題 5: CSS 模板清單缺少 filename
|
||||
|
||||
**嚴重程度**: 🟡 中等
|
||||
|
||||
**問題描述**:
|
||||
- 前端 `CSSTemplate` 型別期待包含 `filename` 欄位
|
||||
- 後端 Schema `CSSTemplateResponse` 也定義了 `filename`
|
||||
- 但後端實際回傳只有 `name` 和 `description`
|
||||
|
||||
**影響**:
|
||||
- 前端無法使用 `filename` 作為 `<option>` 的 key/value
|
||||
- 渲染時 `filename` 為 undefined
|
||||
- 前端需要額外邏輯處理或使用 `name` 代替
|
||||
|
||||
**位置**:
|
||||
- 前端型別: [frontend/src/types/api.ts:132-136](frontend/src/types/api.ts#L132-L136)
|
||||
- 後端 Schema: [backend/app/schemas/export.py:91-104](backend/app/schemas/export.py#L91-L104)
|
||||
- 後端實作: [backend/app/routers/export.py:333-338](backend/app/routers/export.py#L333-L338)
|
||||
- PDF 服務: [backend/app/services/pdf_generator.py:485-496](backend/app/services/pdf_generator.py#L485-L496)
|
||||
|
||||
**根本原因**:
|
||||
`PDFGenerator.get_available_templates()` 只回傳 `{name: description}` 的 dict,沒有包含 filename
|
||||
|
||||
---
|
||||
|
||||
### 問題 6: 翻譯設定端點未實作
|
||||
|
||||
**嚴重程度**: 🟢 低 (預留功能)
|
||||
|
||||
**問題描述**:
|
||||
- 前端嘗試呼叫 `/api/v1/translate/configs` (GET/POST)
|
||||
- 後端翻譯路由僅實作 `/status`, `/languages`, `/document`
|
||||
- 沒有 configs 相關端點
|
||||
|
||||
**影響**:
|
||||
- 前端呼叫會收到 404 錯誤
|
||||
- 無法管理翻譯設定
|
||||
- 但因為翻譯功能整體都是 Phase 5 預留功能,影響較小
|
||||
|
||||
**位置**:
|
||||
- 前端 GET: [frontend/src/services/api.ts:258-262](frontend/src/services/api.ts#L258-L262)
|
||||
- 前端 POST: [frontend/src/services/api.ts:269-275](frontend/src/services/api.ts#L269-L275)
|
||||
- 後端路由: 不存在
|
||||
|
||||
---
|
||||
|
||||
## 修正建議
|
||||
|
||||
### 建議 1: 統一登入回應模型
|
||||
|
||||
**優先順序**: P2 (中優先)
|
||||
|
||||
**方案 A - 前端新增 expires_in** (推薦):
|
||||
```typescript
|
||||
// frontend/src/types/api.ts
|
||||
export interface LoginResponse {
|
||||
access_token: string
|
||||
token_type: string
|
||||
expires_in: number // 新增此欄位
|
||||
}
|
||||
```
|
||||
|
||||
**方案 B - 後端移除 expires_in**:
|
||||
- 如果不需要 Token 過期管理,可移除此欄位
|
||||
- 不推薦,因為這是常見的 JWT 最佳實踐
|
||||
|
||||
---
|
||||
|
||||
### 建議 2: 統一 OCR 任務追蹤策略
|
||||
|
||||
**優先順序**: P1 (高優先)
|
||||
|
||||
**方案 A - 統一使用批次狀態** (推薦):
|
||||
1. 前端刪除 `getTaskStatus()` 方法
|
||||
2. 統一使用 `getBatchStatus()` 輪詢批次狀態
|
||||
3. 修改 `ProcessResponse` 移除 `task_id`
|
||||
|
||||
**方案 B - 後端新增任務狀態端點**:
|
||||
1. 新增 `GET /api/v1/ocr/status/{task_id}` 端點
|
||||
2. `ProcessResponse` 真正回傳 `task_id`
|
||||
3. 實作任務級別的狀態追蹤
|
||||
|
||||
**建議**: 採用方案 A,因為目前架構已經有批次級別的狀態管理
|
||||
|
||||
---
|
||||
|
||||
### 建議 3: 校正 OCR 處理請求/回應
|
||||
|
||||
**優先順序**: P1 (高優先)
|
||||
|
||||
**方案 A - 前端配合後端** (推薦):
|
||||
```typescript
|
||||
// frontend/src/types/api.ts
|
||||
export interface ProcessRequest {
|
||||
batch_id: number
|
||||
lang?: string
|
||||
detect_layout?: boolean // 改為 detect_layout
|
||||
}
|
||||
|
||||
export interface ProcessResponse {
|
||||
message: string // 新增
|
||||
batch_id: number
|
||||
total_files: number // 新增
|
||||
status: string
|
||||
// 移除 task_id
|
||||
}
|
||||
```
|
||||
|
||||
**方案 B - 後端配合前端**:
|
||||
- 支援 `confidence_threshold` 參數
|
||||
- 回應包含 `task_id`
|
||||
- 需要較大改動,不推薦
|
||||
|
||||
---
|
||||
|
||||
### 建議 4: 對齊上傳檔案欄位命名
|
||||
|
||||
**優先順序**: P2 (中優先)
|
||||
|
||||
**方案 A - 前端改用 file_format** (推薦):
|
||||
```typescript
|
||||
// frontend/src/types/api.ts
|
||||
export interface FileInfo {
|
||||
id: number
|
||||
filename: string
|
||||
file_size: number
|
||||
file_format: string // 改名為 file_format
|
||||
status: 'pending' | 'processing' | 'completed' | 'failed'
|
||||
}
|
||||
```
|
||||
|
||||
**方案 B - 後端使用 Pydantic Alias**:
|
||||
```python
|
||||
# backend/app/schemas/ocr.py
|
||||
file_format: str = Field(..., alias='format')
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 建議 5: 補充 CSS 模板 filename
|
||||
|
||||
**優先順序**: P2 (中優先)
|
||||
|
||||
**方案 A - 修改 PDF Generator 回傳結構** (推薦):
|
||||
```python
|
||||
# backend/app/services/pdf_generator.py
|
||||
def get_available_templates(self) -> Dict[str, Dict[str, str]]:
|
||||
"""Get list of available CSS templates with filename"""
|
||||
return {
|
||||
"default": {
|
||||
"description": "通用排版模板,適合大多數文檔",
|
||||
"filename": "default.css"
|
||||
},
|
||||
"academic": {
|
||||
"description": "學術論文模板,適合研究報告",
|
||||
"filename": "academic.css"
|
||||
},
|
||||
"business": {
|
||||
"description": "商業報告模板,適合企業文檔",
|
||||
"filename": "business.css"
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
**方案 B - 前端使用 name 作為 filename**:
|
||||
- 因為實際上模板名稱就是識別碼
|
||||
- 不需要額外的 filename
|
||||
|
||||
---
|
||||
|
||||
### 建議 6: 處理翻譯設定 Stub
|
||||
|
||||
**優先順序**: P3 (低優先)
|
||||
|
||||
**方案 A - 前端移除相關呼叫** (推薦):
|
||||
1. 移除或註解 `getTranslationConfigs()` 和 `createTranslationConfig()`
|
||||
2. UI 顯示「即將推出」訊息
|
||||
|
||||
**方案 B - 後端補上 Stub 端點**:
|
||||
```python
|
||||
# backend/app/routers/translation.py
|
||||
@router.get("/configs")
|
||||
async def get_translation_configs():
|
||||
raise HTTPException(status_code=501, detail="Feature reserved for Phase 5")
|
||||
|
||||
@router.post("/configs")
|
||||
async def create_translation_config():
|
||||
raise HTTPException(status_code=501, detail="Feature reserved for Phase 5")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 實作優先順序總結
|
||||
|
||||
### P1 - 立即修正 (影響核心功能)
|
||||
1. ✅ **建議 2**: 統一 OCR 任務追蹤策略
|
||||
2. ✅ **建議 3**: 校正 OCR 處理請求/回應模型
|
||||
|
||||
### P2 - 近期修正 (影響使用體驗)
|
||||
3. ✅ **建議 1**: 統一登入回應模型
|
||||
4. ✅ **建議 4**: 對齊上傳檔案欄位命名
|
||||
5. ✅ **建議 5**: 補充 CSS 模板 filename
|
||||
|
||||
### P3 - 可延後 (預留功能)
|
||||
6. ⏸️ **建議 6**: 處理翻譯設定 Stub (Phase 5 再處理)
|
||||
|
||||
---
|
||||
|
||||
## 文件維護
|
||||
|
||||
**更新記錄**:
|
||||
- 2025-01-13: 初始版本,完整盤點所有 API 端點及問題
|
||||
|
||||
**維護責任**:
|
||||
- 每次 API 變更時必須更新此文件
|
||||
- 新增 API 端點時補充到對應章節
|
||||
- 修正問題後更新狀態
|
||||
|
||||
---
|
||||
|
||||
## 附錄: 快速檢查清單
|
||||
|
||||
### 新增 API 端點時的檢查項目
|
||||
- [ ] 後端 Schema 定義是否完整?
|
||||
- [ ] 前端 TypeScript 型別是否匹配?
|
||||
- [ ] 欄位命名是否一致 (camelCase vs snake_case)?
|
||||
- [ ] 回應結構是否符合前端期待?
|
||||
- [ ] 錯誤處理是否完整?
|
||||
- [ ] API 文件是否更新?
|
||||
- [ ] 是否有對應的測試?
|
||||
|
||||
### API 修改時的檢查項目
|
||||
- [ ] 前後端是否同步修改?
|
||||
- [ ] 是否有破壞性變更 (Breaking Change)?
|
||||
- [ ] 相關文件是否更新?
|
||||
- [ ] 現有功能是否受影響?
|
||||
- [ ] 是否需要版本遷移?
|
||||
@@ -1,275 +0,0 @@
|
||||
# Chart Recognition Feature Status
|
||||
|
||||
## 🎉 當前狀態:已啟用!
|
||||
|
||||
圖表識別功能已經**啟用**!PaddlePaddle 3.2.1 提供了所需的 `fused_rms_norm_ext` API。
|
||||
|
||||
### ✅ 問題已解決
|
||||
|
||||
- **解決日期**: 2025-11-16
|
||||
- **PaddlePaddle 版本**: 3.2.1 (從 3.0.0 升級)
|
||||
- **API 狀態**: `fused_rms_norm_ext` 現已可用 ✅
|
||||
- **功能狀態**: PP-StructureV3 圖表識別已啟用 ✅
|
||||
- **代碼更新**: [ocr_service.py:217](backend/app/services/ocr_service.py#L217) - `use_chart_recognition=True`
|
||||
|
||||
### 📜 歷史限制 (已解決)
|
||||
|
||||
- **原始問題**: PaddlePaddle 3.0.0 缺少 `fused_rms_norm_ext` API
|
||||
- **記錄時間**: 2025年3月 (基於 PaddlePaddle 3.0.0)
|
||||
- **解決版本**: PaddlePaddle 3.2.0+ (2025年9月發布)
|
||||
- **驗證版本**: PaddlePaddle 3.2.1 確認支持
|
||||
|
||||
---
|
||||
|
||||
## 🎯 現在可用的完整功能
|
||||
|
||||
| 功能類別 | 功能 | 狀態 | 說明 |
|
||||
|---------|------|------|------|
|
||||
| **基礎OCR** | 文字識別 | ✅ 正常 | OCR 核心功能 |
|
||||
| **布局分析** | 圖表檢測 | ✅ 正常 | 識別圖表位置 |
|
||||
| **布局分析** | 圖表提取 | ✅ 正常 | 保存為圖像文件 |
|
||||
| **表格識別** | 表格識別 | ✅ 正常 | 支持嵌套公式/圖片 |
|
||||
| **公式識別** | LaTeX 提取 | ✅ 正常 | 數學公式識別 |
|
||||
| **圖表識別** | 圖表類型識別 | ✅ **已啟用** | 柱狀圖、折線圖等類型 |
|
||||
| **圖表識別** | 數據提取 | ✅ **已啟用** | 從圖表提取數值數據 |
|
||||
| **圖表識別** | 軸/圖例解析 | ✅ **已啟用** | 坐標軸標籤和圖例 |
|
||||
| **圖表識別** | 圖表轉結構化 | ✅ **已啟用** | 轉換為 JSON/表格格式 |
|
||||
|
||||
---
|
||||
|
||||
## 🔧 系統配置更新
|
||||
|
||||
### 1. CUDA 庫路徑配置
|
||||
|
||||
為了支持 GPU 加速,WSL CUDA 庫路徑已添加到系統配置:
|
||||
|
||||
```bash
|
||||
# ~/.bashrc
|
||||
export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH
|
||||
```
|
||||
|
||||
### 2. PaddlePaddle 版本
|
||||
|
||||
```bash
|
||||
# 當前版本
|
||||
PaddlePaddle 3.2.1
|
||||
|
||||
# GPU 支持
|
||||
✅ CUDA 12.6
|
||||
✅ cuDNN 9.5
|
||||
✅ GPU Compute Capability: 8.9
|
||||
```
|
||||
|
||||
### 3. 服務配置
|
||||
|
||||
```python
|
||||
# backend/app/services/ocr_service.py:217
|
||||
use_chart_recognition=True # ✅ 已啟用
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📊 版本歷史與 API 支持
|
||||
|
||||
| 版本 | 發布日期 | `fused_rms_norm_ext` 狀態 | 圖表識別 |
|
||||
|------|---------|-------------------------|---------|
|
||||
| 3.0.0 | 2025-03-26 | ❌ 不支持 | ❌ 禁用 |
|
||||
| 3.1.0 | 2025-06-29 | ❓ 未驗證 | ❓ 未知 |
|
||||
| 3.1.1 | 2025-08-20 | ❓ 未驗證 | ❓ 未知 |
|
||||
| 3.2.0 | 2025-09-08 | ✅ 可能支持 | ✅ 可啟用 |
|
||||
| 3.2.1 | 2025-10-30 | ✅ **確認支持** | ✅ **已啟用** |
|
||||
| 3.2.2 | 2025-11-14 | ✅ 應該支持 | ✅ 應該可用 |
|
||||
|
||||
**驗證日期**: 2025-11-16
|
||||
**驗證版本**: PaddlePaddle 3.2.1
|
||||
**驗證腳本**: `backend/verify_chart_recognition.py`
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ 性能考量
|
||||
|
||||
啟用圖表識別後的影響:
|
||||
|
||||
### 處理時間
|
||||
- **簡單圖表**: 每個圖表增加 2-3 秒
|
||||
- **複雜圖表**: 每個圖表增加 5-10 秒
|
||||
- **多圖表頁面**: 處理時間相應增加
|
||||
|
||||
### 記憶體使用
|
||||
- **GPU 記憶體**: 增加約 500MB-1GB
|
||||
- **系統記憶體**: 增加約 200-500MB
|
||||
|
||||
### 準確率
|
||||
- **簡單圖表** (柱狀圖、折線圖): >85%
|
||||
- **複雜圖表** (多軸、組合圖): >70%
|
||||
- **特殊圖表** (雷達圖、散點圖): >60%
|
||||
|
||||
**建議**: 對於包含大量圖表的文檔,建議使用 GPU 加速以獲得最佳性能。
|
||||
|
||||
---
|
||||
|
||||
## 🧪 測試圖表識別
|
||||
|
||||
### 快速測試
|
||||
|
||||
使用驗證腳本確認功能可用:
|
||||
|
||||
```bash
|
||||
cd /home/egg/project/Tool_OCR
|
||||
source venv/bin/activate
|
||||
python backend/verify_chart_recognition.py
|
||||
```
|
||||
|
||||
預期輸出:
|
||||
```
|
||||
✅ PaddlePaddle version: 3.2.1
|
||||
📊 API Availability:
|
||||
- fused_rms_norm: ✅ Available
|
||||
- fused_rms_norm_ext: ✅ Available
|
||||
🎉 Chart recognition CAN be enabled!
|
||||
```
|
||||
|
||||
### 實際測試
|
||||
|
||||
1. **啟動後端服務**:
|
||||
```bash
|
||||
cd backend
|
||||
source venv/bin/activate
|
||||
python -m app.main
|
||||
```
|
||||
|
||||
2. **上傳包含圖表的文檔**:
|
||||
- PDF、Word、PowerPoint 等
|
||||
- 確保文檔中包含圖表(柱狀圖、折線圖等)
|
||||
|
||||
3. **檢查輸出結果**:
|
||||
- 查看解析結果中是否包含圖表數據
|
||||
- 驗證圖表類型識別是否正確
|
||||
- 檢查數據提取是否準確
|
||||
|
||||
---
|
||||
|
||||
## 🔍 技術細節
|
||||
|
||||
### fused_rms_norm_ext API
|
||||
|
||||
**RMSNorm (Root Mean Square Layer Normalization)**:
|
||||
- 深度學習中的層歸一化技術
|
||||
- 相比 LayerNorm 計算效率更高
|
||||
- PaddleOCR-VL 圖表識別模型的核心組件
|
||||
|
||||
**API 簽名**:
|
||||
```python
|
||||
paddle.incubate.nn.functional.fused_rms_norm_ext(
|
||||
x,
|
||||
norm_weight,
|
||||
norm_bias=None,
|
||||
epsilon=1e-5,
|
||||
begin_norm_axis=1,
|
||||
bias=None,
|
||||
residual=None,
|
||||
quant_scale=-1,
|
||||
quant_round_type=0,
|
||||
quant_max_bound=0,
|
||||
quant_min_bound=0
|
||||
)
|
||||
```
|
||||
|
||||
**與基礎版本的差異**:
|
||||
- `fused_rms_norm`: 基礎實現
|
||||
- `fused_rms_norm_ext`: 擴展版本,提供額外的優化和參數
|
||||
|
||||
### 代碼位置
|
||||
|
||||
- **主要啟用**: [backend/app/services/ocr_service.py:217](backend/app/services/ocr_service.py#L217)
|
||||
- **CPU Fallback**: [backend/app/services/ocr_service.py:235](backend/app/services/ocr_service.py#L235)
|
||||
- **PP-StructureV3 初始化**: [backend/app/services/ocr_service.py:211-219](backend/app/services/ocr_service.py#L211-L219)
|
||||
|
||||
---
|
||||
|
||||
## 📚 相關文檔更新
|
||||
|
||||
以下文檔需要更新以反映圖表識別已啟用:
|
||||
|
||||
### 已更新
|
||||
- ✅ `CHART_RECOGNITION.md` - 本文檔
|
||||
- ✅ `backend/app/services/ocr_service.py` - 代碼實現
|
||||
|
||||
### 待更新
|
||||
- [ ] `README.md` - 移除 "Known Limitations" 中的圖表識別限制
|
||||
- [ ] `openspec/changes/add-gpu-acceleration-support/tasks.md` - 標記任務 5.4 為完成
|
||||
- [ ] `openspec/changes/add-gpu-acceleration-support/proposal.md` - 更新 "Known Issues" 部分
|
||||
- [ ] `openspec/project.md` - 添加圖表識別功能說明
|
||||
|
||||
---
|
||||
|
||||
## 🆘 故障排除
|
||||
|
||||
### 問題: 升級後仍顯示不可用
|
||||
|
||||
**診斷**:
|
||||
```bash
|
||||
python -c "import paddle; print(paddle.__version__)"
|
||||
python -c "import paddle.incubate.nn.functional as F; print(hasattr(F, 'fused_rms_norm_ext'))"
|
||||
```
|
||||
|
||||
**解決方案**:
|
||||
1. 確保虛擬環境已激活
|
||||
2. 完全重新安裝 PaddlePaddle:
|
||||
```bash
|
||||
pip uninstall paddlepaddle -y
|
||||
pip install 'paddlepaddle>=3.2.0'
|
||||
```
|
||||
|
||||
### 問題: GPU 初始化失敗
|
||||
|
||||
**錯誤信息**: `libcuda.so.1: cannot open shared object file`
|
||||
|
||||
**解決方案**:
|
||||
```bash
|
||||
# 確認 LD_LIBRARY_PATH 包含 WSL CUDA 路徑
|
||||
echo $LD_LIBRARY_PATH | grep wsl
|
||||
|
||||
# 如果沒有,添加到 ~/.bashrc:
|
||||
echo 'export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH' >> ~/.bashrc
|
||||
source ~/.bashrc
|
||||
```
|
||||
|
||||
### 問題: 圖表識別結果不準確
|
||||
|
||||
**可能原因**:
|
||||
- 圖表圖像質量低
|
||||
- 圖表類型特殊或複雜
|
||||
- 文字遮擋或重疊
|
||||
|
||||
**改進建議**:
|
||||
- 提高輸入文檔的分辨率
|
||||
- 使用清晰的圖表樣式
|
||||
- 必要時進行人工校對
|
||||
|
||||
---
|
||||
|
||||
## 🎉 總結
|
||||
|
||||
**圖表識別功能現已完全可用!**
|
||||
|
||||
| 項目 | 狀態 |
|
||||
|------|------|
|
||||
| API 可用性 | ✅ `fused_rms_norm_ext` 已在 PaddlePaddle 3.2.1 中提供 |
|
||||
| 功能狀態 | ✅ 圖表識別已啟用 |
|
||||
| GPU 支持 | ✅ CUDA 12.6 + cuDNN 9.5 正常運行 |
|
||||
| 測試驗證 | ✅ 驗證腳本確認功能可用 |
|
||||
| 文檔更新 | ✅ 本文檔已更新 |
|
||||
|
||||
**下一步**:
|
||||
1. 測試實際文檔處理
|
||||
2. 驗證圖表識別準確率
|
||||
3. 更新相關 README 和 OpenSpec 文檔
|
||||
4. 考慮性能優化和調整
|
||||
|
||||
---
|
||||
|
||||
**最後更新**: 2025-11-16
|
||||
**更新者**: Development Team
|
||||
**PaddlePaddle 版本**: 3.2.1
|
||||
**功能狀態**: ✅ 圖表識別已啟用
|
||||
893
FRONTEND_API.md
893
FRONTEND_API.md
@@ -1,893 +0,0 @@
|
||||
# Tool_OCR Frontend API Documentation
|
||||
|
||||
> **Version**: 0.1.0
|
||||
> **Last Updated**: 2025-01-13
|
||||
> **Purpose**: Complete documentation of frontend architecture, component structure, API integration, and dependencies
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Project Overview](#project-overview)
|
||||
2. [Technology Stack](#technology-stack)
|
||||
3. [Component Architecture](#component-architecture)
|
||||
4. [Page → API Dependency Matrix](#page--api-dependency-matrix)
|
||||
5. [Component Tree Structure](#component-tree-structure)
|
||||
6. [State Management Strategy](#state-management-strategy)
|
||||
7. [Route Configuration](#route-configuration)
|
||||
8. [API Integration Patterns](#api-integration-patterns)
|
||||
9. [UI/UX Design System](#uiux-design-system)
|
||||
10. [Error Handling Patterns](#error-handling-patterns)
|
||||
11. [Deployment Configuration](#deployment-configuration)
|
||||
|
||||
---
|
||||
|
||||
## Project Overview
|
||||
|
||||
Tool_OCR 前端是一個基於 React 18 + Vite 的現代化 OCR 文件處理系統,提供企業級的使用者介面和體驗。
|
||||
|
||||
### Key Features
|
||||
|
||||
- **批次檔案上傳**: 支援拖放上傳,多檔案批次處理
|
||||
- **即時進度追蹤**: 使用輪詢機制顯示 OCR 處理進度
|
||||
- **結果預覽**: Markdown 和 JSON 雙格式預覽
|
||||
- **靈活匯出**: 支援 TXT、JSON、Excel、Markdown、PDF、ZIP 多種格式
|
||||
- **規則管理**: 可自訂匯出規則和 CSS 模板
|
||||
- **響應式設計**: 適配桌面和平板裝置
|
||||
|
||||
---
|
||||
|
||||
## Technology Stack
|
||||
|
||||
### Core Dependencies
|
||||
|
||||
```json
|
||||
{
|
||||
"@tanstack/react-query": "^5.90.7", // Server state management
|
||||
"react": "^19.2.0", // UI framework
|
||||
"react-dom": "^19.2.0",
|
||||
"react-router-dom": "^7.9.5", // Routing
|
||||
"vite": "^7.2.2", // Build tool
|
||||
"typescript": "~5.9.3" // Type safety
|
||||
}
|
||||
```
|
||||
|
||||
### UI & Styling
|
||||
|
||||
```json
|
||||
{
|
||||
"tailwindcss": "^4.1.17", // CSS framework
|
||||
"class-variance-authority": "^0.7.0", // Component variants
|
||||
"clsx": "^2.1.1", // Class name utility
|
||||
"tailwind-merge": "^3.4.0", // Tailwind class merge
|
||||
"lucide-react": "^0.553.0" // Icon library
|
||||
}
|
||||
```
|
||||
|
||||
### State & Data
|
||||
|
||||
```json
|
||||
{
|
||||
"zustand": "^5.0.8", // Client state
|
||||
"axios": "^1.13.2", // HTTP client
|
||||
"react-dropzone": "^14.3.8", // File upload
|
||||
"react-markdown": "^9.0.1" // Markdown rendering
|
||||
}
|
||||
```
|
||||
|
||||
### Internationalization
|
||||
|
||||
```json
|
||||
{
|
||||
"i18next": "^25.6.2",
|
||||
"react-i18next": "^16.3.0"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Component Architecture
|
||||
|
||||
### Atomic Design Structure
|
||||
|
||||
```
|
||||
frontend/src/
|
||||
├── components/
|
||||
│ ├── ui/ # Atomic components (shadcn/ui)
|
||||
│ │ ├── button.tsx
|
||||
│ │ ├── card.tsx
|
||||
│ │ ├── input.tsx
|
||||
│ │ ├── label.tsx
|
||||
│ │ ├── select.tsx
|
||||
│ │ ├── badge.tsx
|
||||
│ │ ├── progress.tsx
|
||||
│ │ ├── alert.tsx
|
||||
│ │ ├── dialog.tsx
|
||||
│ │ ├── tabs.tsx
|
||||
│ │ ├── table.tsx
|
||||
│ │ └── toast.tsx
|
||||
│ ├── FileUpload.tsx # Drag-and-drop upload component
|
||||
│ ├── ResultsTable.tsx # OCR results display table
|
||||
│ ├── MarkdownPreview.tsx # Markdown content renderer
|
||||
│ └── Layout.tsx # Main app layout with sidebar
|
||||
├── pages/
|
||||
│ ├── LoginPage.tsx # Authentication
|
||||
│ ├── UploadPage.tsx # File upload and selection
|
||||
│ ├── ProcessingPage.tsx # OCR processing status
|
||||
│ ├── ResultsPage.tsx # Results viewing and preview
|
||||
│ ├── ExportPage.tsx # Export configuration and download
|
||||
│ └── SettingsPage.tsx # User settings and rules management
|
||||
├── store/
|
||||
│ ├── authStore.ts # Authentication state (Zustand)
|
||||
│ └── uploadStore.ts # Upload batch state (Zustand)
|
||||
├── services/
|
||||
│ └── api.ts # API client (Axios)
|
||||
├── types/
|
||||
│ └── api.ts # TypeScript type definitions
|
||||
├── lib/
|
||||
│ └── utils.ts # Utility functions
|
||||
├── i18n/
|
||||
│ └── index.ts # i18n configuration
|
||||
└── styles/
|
||||
└── index.css # Global styles and CSS variables
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Page → API Dependency Matrix
|
||||
|
||||
| Page/Component | API Endpoints Used | HTTP Method | Purpose | Polling |
|
||||
|----------------|-------------------|-------------|---------|---------|
|
||||
| **LoginPage** | `/api/v1/auth/login` | POST | User authentication | No |
|
||||
| **UploadPage** | `/api/v1/upload` | POST | Upload files for OCR | No |
|
||||
| **ProcessingPage** | `/api/v1/ocr/process` | POST | Start OCR processing | No |
|
||||
| | `/api/v1/batch/{batch_id}/status` | GET | Poll batch status | Yes (2s) |
|
||||
| **ResultsPage** | `/api/v1/batch/{batch_id}/status` | GET | Load completed files | No |
|
||||
| | `/api/v1/ocr/result/{file_id}` | GET | Get OCR result details | No |
|
||||
| | `/api/v1/export/pdf/{file_id}` | GET | Download PDF export | No |
|
||||
| **ExportPage** | `/api/v1/export` | POST | Export batch results | No |
|
||||
| | `/api/v1/export/rules` | GET | List export rules | No |
|
||||
| | `/api/v1/export/rules` | POST | Create new rule | No |
|
||||
| | `/api/v1/export/rules/{rule_id}` | PUT | Update existing rule | No |
|
||||
| | `/api/v1/export/rules/{rule_id}` | DELETE | Delete rule | No |
|
||||
| | `/api/v1/export/css-templates` | GET | List CSS templates | No |
|
||||
| **SettingsPage** | `/api/v1/export/rules` | GET | Manage export rules | No |
|
||||
|
||||
---
|
||||
|
||||
## Component Tree Structure
|
||||
|
||||
```
|
||||
App
|
||||
├── Router (React Router)
|
||||
│ ├── PublicRoute
|
||||
│ │ └── LoginPage
|
||||
│ │ ├── Form (username + password)
|
||||
│ │ ├── Button (submit)
|
||||
│ │ └── Alert (error display)
|
||||
│ └── ProtectedRoute (requires authentication)
|
||||
│ └── Layout
|
||||
│ ├── Sidebar
|
||||
│ │ ├── Logo
|
||||
│ │ ├── Navigation Links
|
||||
│ │ │ ├── UploadPage link
|
||||
│ │ │ ├── ProcessingPage link
|
||||
│ │ │ ├── ResultsPage link
|
||||
│ │ │ ├── ExportPage link
|
||||
│ │ │ └── SettingsPage link
|
||||
│ │ └── User Section + Logout
|
||||
│ ├── TopBar
|
||||
│ │ ├── SearchInput
|
||||
│ │ └── NotificationBell
|
||||
│ └── MainContent (Outlet)
|
||||
│ ├── UploadPage
|
||||
│ │ ├── FileUpload (react-dropzone)
|
||||
│ │ ├── FileList (selected files)
|
||||
│ │ └── UploadButton
|
||||
│ ├── ProcessingPage
|
||||
│ │ ├── ProgressBar
|
||||
│ │ ├── StatsCards (completed/processing/failed)
|
||||
│ │ ├── FileStatusList
|
||||
│ │ └── ActionButtons
|
||||
│ ├── ResultsPage
|
||||
│ │ ├── FileList (left sidebar)
|
||||
│ │ │ ├── SearchInput
|
||||
│ │ │ └── FileItems
|
||||
│ │ └── PreviewPanel (right)
|
||||
│ │ ├── StatsCards
|
||||
│ │ ├── Tabs (Markdown/JSON)
|
||||
│ │ ├── MarkdownPreview
|
||||
│ │ └── JSONViewer
|
||||
│ ├── ExportPage
|
||||
│ │ ├── FormatSelector
|
||||
│ │ ├── RuleSelector
|
||||
│ │ ├── CSSTemplateSelector
|
||||
│ │ ├── OptionsForm
|
||||
│ │ └── ExportButton
|
||||
│ └── SettingsPage
|
||||
│ ├── UserInfo
|
||||
│ ├── ExportRulesManager
|
||||
│ │ ├── RuleList
|
||||
│ │ ├── CreateRuleDialog
|
||||
│ │ ├── EditRuleDialog
|
||||
│ │ └── DeleteConfirmDialog
|
||||
│ └── SystemSettings
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## State Management Strategy
|
||||
|
||||
### Client State (Zustand)
|
||||
|
||||
**authStore.ts** - Authentication State
|
||||
```typescript
|
||||
interface AuthState {
|
||||
user: User | null
|
||||
isAuthenticated: boolean
|
||||
setUser: (user: User | null) => void
|
||||
logout: () => void
|
||||
}
|
||||
```
|
||||
|
||||
**uploadStore.ts** - Upload Batch State
|
||||
```typescript
|
||||
interface UploadState {
|
||||
batchId: number | null
|
||||
files: FileInfo[]
|
||||
uploadProgress: number
|
||||
setBatchId: (id: number) => void
|
||||
setFiles: (files: FileInfo[]) => void
|
||||
setUploadProgress: (progress: number) => void
|
||||
reset: () => void
|
||||
}
|
||||
```
|
||||
|
||||
### Server State (React Query)
|
||||
|
||||
- **Caching**: Automatic caching with stale-while-revalidate strategy
|
||||
- **Polling**: Automatic refetch for batch status every 2 seconds during processing
|
||||
- **Error Handling**: Built-in error retry and error state management
|
||||
- **Optimistic Updates**: For export rules CRUD operations
|
||||
|
||||
### Query Keys
|
||||
|
||||
```typescript
|
||||
// Batch status polling
|
||||
['batchStatus', batchId]
|
||||
|
||||
// OCR result for specific file
|
||||
['ocrResult', fileId]
|
||||
|
||||
// Export rules list
|
||||
['exportRules']
|
||||
|
||||
// CSS templates list
|
||||
['cssTemplates']
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Route Configuration
|
||||
|
||||
| Route | Component | Access Level | Description | Protected |
|
||||
|-------|-----------|--------------|-------------|-----------|
|
||||
| `/login` | LoginPage | Public | User authentication | No |
|
||||
| `/` | Layout (redirect to /upload) | Private | Main layout wrapper | Yes |
|
||||
| `/upload` | UploadPage | Private | File upload interface | Yes |
|
||||
| `/processing` | ProcessingPage | Private | OCR processing status | Yes |
|
||||
| `/results` | ResultsPage | Private | View OCR results | Yes |
|
||||
| `/export` | ExportPage | Private | Export configuration | Yes |
|
||||
| `/settings` | SettingsPage | Private | User settings | Yes |
|
||||
|
||||
### Protected Route Implementation
|
||||
|
||||
```typescript
|
||||
function ProtectedRoute({ children }: { children: React.ReactNode }) {
|
||||
const isAuthenticated = useAuthStore((state) => state.isAuthenticated)
|
||||
|
||||
if (!isAuthenticated) {
|
||||
return <Navigate to="/login" replace />
|
||||
}
|
||||
|
||||
return <>{children}</>
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## API Integration Patterns
|
||||
|
||||
### API Client Configuration
|
||||
|
||||
**Base URL**: `http://localhost:12010/api/v1`
|
||||
|
||||
**Request Interceptor**: Adds JWT token to Authorization header
|
||||
|
||||
```typescript
|
||||
this.client.interceptors.request.use((config) => {
|
||||
if (this.token) {
|
||||
config.headers.Authorization = `Bearer ${this.token}`
|
||||
}
|
||||
return config
|
||||
})
|
||||
```
|
||||
|
||||
**Response Interceptor**: Handles 401 errors and redirects to login
|
||||
|
||||
```typescript
|
||||
this.client.interceptors.response.use(
|
||||
(response) => response,
|
||||
(error: AxiosError<ApiError>) => {
|
||||
if (error.response?.status === 401) {
|
||||
this.clearToken()
|
||||
window.location.href = '/login'
|
||||
}
|
||||
return Promise.reject(error)
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
### Authentication Flow
|
||||
|
||||
```typescript
|
||||
// 1. Login
|
||||
const response = await apiClient.login({ username, password })
|
||||
// Response: { access_token, token_type, expires_in }
|
||||
|
||||
// 2. Store token
|
||||
localStorage.setItem('auth_token', response.access_token)
|
||||
|
||||
// 3. Set user in store
|
||||
setUser({ id: 1, username })
|
||||
|
||||
// 4. Navigate to /upload
|
||||
navigate('/upload')
|
||||
```
|
||||
|
||||
### File Upload Flow
|
||||
|
||||
```typescript
|
||||
// 1. Prepare FormData
|
||||
const formData = new FormData()
|
||||
files.forEach((file) => formData.append('files', file))
|
||||
|
||||
// 2. Upload files
|
||||
const response = await apiClient.uploadFiles(files)
|
||||
// Response: { batch_id, files: FileInfo[] }
|
||||
|
||||
// 3. Store batch info
|
||||
setBatchId(response.batch_id)
|
||||
setFiles(response.files)
|
||||
|
||||
// 4. Navigate to /processing
|
||||
navigate('/processing')
|
||||
```
|
||||
|
||||
### OCR Processing Flow
|
||||
|
||||
```typescript
|
||||
// 1. Start OCR processing
|
||||
await apiClient.processOCR({ batch_id, lang: 'ch', detect_layout: true })
|
||||
// Response: { message, batch_id, total_files, status }
|
||||
|
||||
// 2. Poll batch status every 2 seconds
|
||||
const { data: batchStatus } = useQuery({
|
||||
queryKey: ['batchStatus', batchId],
|
||||
queryFn: () => apiClient.getBatchStatus(batchId),
|
||||
refetchInterval: (query) => {
|
||||
const status = query.state.data?.batch.status
|
||||
if (status === 'completed' || status === 'failed') return false
|
||||
return 2000 // Poll every 2 seconds
|
||||
},
|
||||
})
|
||||
|
||||
// 3. Auto-redirect when completed
|
||||
useEffect(() => {
|
||||
if (batchStatus?.batch.status === 'completed') {
|
||||
navigate('/results')
|
||||
}
|
||||
}, [batchStatus?.batch.status])
|
||||
```
|
||||
|
||||
### Results Viewing Flow
|
||||
|
||||
```typescript
|
||||
// 1. Load batch status
|
||||
const { data: batchStatus } = useQuery({
|
||||
queryKey: ['batchStatus', batchId],
|
||||
queryFn: () => apiClient.getBatchStatus(batchId),
|
||||
})
|
||||
|
||||
// 2. Select a file
|
||||
setSelectedFileId(fileId)
|
||||
|
||||
// 3. Load OCR result for selected file
|
||||
const { data: ocrResult } = useQuery({
|
||||
queryKey: ['ocrResult', selectedFileId],
|
||||
queryFn: () => apiClient.getOCRResult(selectedFileId),
|
||||
enabled: !!selectedFileId,
|
||||
})
|
||||
|
||||
// 4. Display in Markdown or JSON format
|
||||
<Tabs>
|
||||
<TabsContent value="markdown">
|
||||
<ReactMarkdown>{ocrResult.markdown_content}</ReactMarkdown>
|
||||
</TabsContent>
|
||||
<TabsContent value="json">
|
||||
<pre>{JSON.stringify(ocrResult.json_data, null, 2)}</pre>
|
||||
</TabsContent>
|
||||
</Tabs>
|
||||
```
|
||||
|
||||
### Export Flow
|
||||
|
||||
```typescript
|
||||
// 1. Select export format and options
|
||||
const exportData = {
|
||||
batch_id: batchId,
|
||||
format: 'pdf',
|
||||
rule_id: selectedRuleId,
|
||||
css_template: 'academic',
|
||||
options: { include_metadata: true }
|
||||
}
|
||||
|
||||
// 2. Request export
|
||||
const blob = await apiClient.exportResults(exportData)
|
||||
|
||||
// 3. Trigger download
|
||||
downloadBlob(blob, `ocr-results-${batchId}.pdf`)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## UI/UX Design System
|
||||
|
||||
### Color Palette (CSS Variables)
|
||||
|
||||
```css
|
||||
/* Primary - Professional Blue */
|
||||
--primary: 217 91% 60%; /* #3b82f6 */
|
||||
--primary-foreground: 0 0% 100%;
|
||||
|
||||
/* Secondary - Gray-Blue */
|
||||
--secondary: 220 15% 95%;
|
||||
--secondary-foreground: 220 15% 25%;
|
||||
|
||||
/* Accent - Vibrant Teal */
|
||||
--accent: 173 80% 50%;
|
||||
--accent-foreground: 0 0% 100%;
|
||||
|
||||
/* Success */
|
||||
--success: 142 72% 45%; /* #16a34a */
|
||||
--success-foreground: 0 0% 100%;
|
||||
|
||||
/* Destructive */
|
||||
--destructive: 0 85% 60%; /* #ef4444 */
|
||||
--destructive-foreground: 0 0% 100%;
|
||||
|
||||
/* Warning */
|
||||
--warning: 38 92% 50%;
|
||||
--warning-foreground: 0 0% 100%;
|
||||
|
||||
/* Background */
|
||||
--background: 220 15% 97%; /* #fafafa */
|
||||
--card: 0 0% 100%; /* #ffffff */
|
||||
--sidebar: 220 25% 12%; /* Dark blue-gray */
|
||||
|
||||
/* Borders */
|
||||
--border: 220 13% 88%;
|
||||
--radius: 0.5rem;
|
||||
```
|
||||
|
||||
### Typography
|
||||
|
||||
- **Font Family**: System font stack (native)
|
||||
- **Page Title**: 1.875rem (30px), font-weight: 700
|
||||
- **Section Title**: 1.125rem (18px), font-weight: 600
|
||||
- **Body Text**: 0.875rem (14px), font-weight: 400
|
||||
- **Small Text**: 0.75rem (12px)
|
||||
|
||||
### Spacing Scale
|
||||
|
||||
```css
|
||||
--spacing-xs: 0.25rem; /* 4px */
|
||||
--spacing-sm: 0.5rem; /* 8px */
|
||||
--spacing-md: 1rem; /* 16px */
|
||||
--spacing-lg: 1.5rem; /* 24px */
|
||||
--spacing-xl: 2rem; /* 32px */
|
||||
```
|
||||
|
||||
### Component Variants
|
||||
|
||||
**Button Variants**:
|
||||
- `default`: Primary blue background
|
||||
- `outline`: Border only
|
||||
- `secondary`: Muted background
|
||||
- `destructive`: Red for delete actions
|
||||
- `ghost`: No background, hover effect
|
||||
|
||||
**Alert Variants**:
|
||||
- `default`: Neutral gray
|
||||
- `info`: Blue
|
||||
- `success`: Green
|
||||
- `warning`: Yellow
|
||||
- `destructive`: Red
|
||||
|
||||
**Badge Variants**:
|
||||
- `default`: Gray
|
||||
- `success`: Green
|
||||
- `warning`: Yellow
|
||||
- `destructive`: Red
|
||||
- `secondary`: Muted
|
||||
|
||||
### Responsive Breakpoints
|
||||
|
||||
```typescript
|
||||
// Tailwind breakpoints
|
||||
sm: '640px', // Mobile landscape
|
||||
md: '768px', // Tablet
|
||||
lg: '1024px', // Desktop (primary support)
|
||||
xl: '1280px', // Large desktop
|
||||
2xl: '1536px' // Extra large
|
||||
```
|
||||
|
||||
**Primary Support**: Desktop (>= 1024px)
|
||||
**Secondary Support**: Tablet (768px - 1023px)
|
||||
**Optional**: Mobile (< 768px)
|
||||
|
||||
---
|
||||
|
||||
## Error Handling Patterns
|
||||
|
||||
### Global Error Boundary
|
||||
|
||||
```typescript
|
||||
class ErrorBoundary extends Component<Props, State> {
|
||||
static getDerivedStateFromError(error: Error): State {
|
||||
return { hasError: true, error }
|
||||
}
|
||||
|
||||
componentDidCatch(error: Error, errorInfo: ErrorInfo) {
|
||||
console.error('Uncaught error:', error, errorInfo)
|
||||
}
|
||||
|
||||
render() {
|
||||
if (this.state.hasError) {
|
||||
return <ErrorFallbackUI error={this.state.error} />
|
||||
}
|
||||
return this.props.children
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### API Error Handling
|
||||
|
||||
```typescript
|
||||
try {
|
||||
await apiClient.uploadFiles(files)
|
||||
} catch (err: any) {
|
||||
const errorDetail = err.response?.data?.detail
|
||||
|
||||
toast({
|
||||
title: t('upload.uploadError'),
|
||||
description: Array.isArray(errorDetail)
|
||||
? errorDetail.map(e => e.msg || e.message).join(', ')
|
||||
: errorDetail || t('errors.networkError'),
|
||||
variant: 'destructive',
|
||||
})
|
||||
}
|
||||
```
|
||||
|
||||
### Form Validation
|
||||
|
||||
```typescript
|
||||
// Client-side validation
|
||||
if (selectedFiles.length === 0) {
|
||||
toast({
|
||||
title: t('errors.validationError'),
|
||||
description: '請選擇至少一個檔案',
|
||||
variant: 'destructive',
|
||||
})
|
||||
return
|
||||
}
|
||||
|
||||
// Backend validation errors
|
||||
if (err.response?.status === 422) {
|
||||
const errors = err.response.data.detail
|
||||
// Display validation errors to user
|
||||
}
|
||||
```
|
||||
|
||||
### Loading States
|
||||
|
||||
```typescript
|
||||
// Query loading state
|
||||
const { data, isLoading, error } = useQuery({
|
||||
queryKey: ['batchStatus', batchId],
|
||||
queryFn: () => apiClient.getBatchStatus(batchId),
|
||||
})
|
||||
|
||||
if (isLoading) return <LoadingSpinner />
|
||||
if (error) return <ErrorAlert error={error} />
|
||||
if (!data) return <EmptyState />
|
||||
|
||||
// Mutation loading state
|
||||
const mutation = useMutation({
|
||||
mutationFn: apiClient.uploadFiles,
|
||||
onSuccess: () => { /* success */ },
|
||||
onError: () => { /* error */ },
|
||||
})
|
||||
|
||||
<Button disabled={mutation.isPending}>
|
||||
{mutation.isPending ? <Loader2 className="animate-spin" /> : '上傳'}
|
||||
</Button>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Deployment Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
```bash
|
||||
# .env.production
|
||||
VITE_API_BASE_URL=http://localhost:12010
|
||||
VITE_APP_NAME=Tool_OCR
|
||||
VITE_APP_VERSION=0.1.0
|
||||
```
|
||||
|
||||
### Build Configuration
|
||||
|
||||
**vite.config.ts**:
|
||||
```typescript
|
||||
export default defineConfig({
|
||||
plugins: [react()],
|
||||
server: {
|
||||
port: 12011,
|
||||
proxy: {
|
||||
'/api': {
|
||||
target: 'http://localhost:12010',
|
||||
changeOrigin: true,
|
||||
},
|
||||
},
|
||||
},
|
||||
build: {
|
||||
outDir: 'dist',
|
||||
sourcemap: false,
|
||||
rollupOptions: {
|
||||
output: {
|
||||
manualChunks: {
|
||||
vendor: ['react', 'react-dom', 'react-router-dom'],
|
||||
ui: ['@tanstack/react-query', 'zustand', 'lucide-react'],
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
})
|
||||
```
|
||||
|
||||
### Build Commands
|
||||
|
||||
```bash
|
||||
# Development
|
||||
npm run dev
|
||||
|
||||
# Production build
|
||||
npm run build
|
||||
|
||||
# Preview production build
|
||||
npm run preview
|
||||
```
|
||||
|
||||
### Nginx Configuration
|
||||
|
||||
```nginx
|
||||
server {
|
||||
listen 80;
|
||||
server_name tool-ocr.example.com;
|
||||
root /path/to/Tool_OCR/frontend/dist;
|
||||
|
||||
# Frontend static files
|
||||
location / {
|
||||
try_files $uri $uri/ /index.html;
|
||||
}
|
||||
|
||||
# API reverse proxy
|
||||
location /api {
|
||||
proxy_pass http://127.0.0.1:12010;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
}
|
||||
|
||||
# Static assets caching
|
||||
location ~* \.(js|css|png|jpg|jpeg|gif|ico|svg|woff|woff2)$ {
|
||||
expires 1y;
|
||||
add_header Cache-Control "public, immutable";
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Code Splitting
|
||||
|
||||
- **Vendor Bundle**: React, React Router, React Query (separate chunk)
|
||||
- **UI Bundle**: Zustand, Lucide React, UI components
|
||||
- **Route-based Splitting**: Lazy load pages with `React.lazy()`
|
||||
|
||||
### Caching Strategy
|
||||
|
||||
- **React Query Cache**: 5 minutes stale time for most queries
|
||||
- **Polling Interval**: 2 seconds during OCR processing
|
||||
- **Infinite Cache**: Export rules (rarely change)
|
||||
|
||||
### Asset Optimization
|
||||
|
||||
- **Images**: Convert to WebP format, use appropriate sizes
|
||||
- **Fonts**: System font stack (no custom fonts)
|
||||
- **Icons**: Lucide React (tree-shakeable)
|
||||
|
||||
---
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Component Testing (Planned)
|
||||
|
||||
```typescript
|
||||
// Example: UploadPage.test.tsx
|
||||
import { render, screen, fireEvent } from '@testing-library/react'
|
||||
import { UploadPage } from '@/pages/UploadPage'
|
||||
|
||||
describe('UploadPage', () => {
|
||||
it('should display file upload area', () => {
|
||||
render(<UploadPage />)
|
||||
expect(screen.getByText(/拖放檔案/i)).toBeInTheDocument()
|
||||
})
|
||||
|
||||
it('should allow file selection', async () => {
|
||||
render(<UploadPage />)
|
||||
const file = new File(['content'], 'test.pdf', { type: 'application/pdf' })
|
||||
// Test file upload
|
||||
})
|
||||
})
|
||||
```
|
||||
|
||||
### API Integration Testing
|
||||
|
||||
- **Mock API Responses**: Use MSW (Mock Service Worker)
|
||||
- **Error Scenarios**: Test 401, 404, 500 responses
|
||||
- **Loading States**: Test skeleton/spinner display
|
||||
|
||||
---
|
||||
|
||||
## Accessibility Standards
|
||||
|
||||
### WCAG 2.1 AA Compliance
|
||||
|
||||
- **Keyboard Navigation**: All interactive elements accessible via keyboard
|
||||
- **Focus Indicators**: Visible focus states on all inputs and buttons
|
||||
- **ARIA Labels**: Proper labels for screen readers
|
||||
- **Color Contrast**: Minimum 4.5:1 ratio for text
|
||||
- **Alt Text**: All images have descriptive alt attributes
|
||||
|
||||
### Semantic HTML
|
||||
|
||||
```typescript
|
||||
// Use semantic elements
|
||||
<nav> // Navigation
|
||||
<main> // Main content
|
||||
<aside> // Sidebar
|
||||
<article> // Independent content
|
||||
<section> // Grouped content
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Browser Compatibility
|
||||
|
||||
### Minimum Supported Versions
|
||||
|
||||
- **Chrome**: 90+
|
||||
- **Firefox**: 88+
|
||||
- **Edge**: 90+
|
||||
- **Safari**: 14+
|
||||
|
||||
### Polyfills Required
|
||||
|
||||
- None (modern build target: ES2020)
|
||||
|
||||
---
|
||||
|
||||
## Development Workflow
|
||||
|
||||
### Local Development
|
||||
|
||||
```bash
|
||||
# 1. Install dependencies
|
||||
npm install
|
||||
|
||||
# 2. Start dev server
|
||||
npm run dev
|
||||
# Frontend: http://localhost:12011
|
||||
# API Proxy: http://localhost:12011/api -> http://localhost:12010/api
|
||||
|
||||
# 3. Build for production
|
||||
npm run build
|
||||
|
||||
# 4. Preview production build
|
||||
npm run preview
|
||||
```
|
||||
|
||||
### Code Style
|
||||
|
||||
- **Formatter**: Prettier (automatic on save)
|
||||
- **Linter**: ESLint
|
||||
- **Type Checking**: TypeScript strict mode
|
||||
|
||||
---
|
||||
|
||||
## Known Issues & Limitations
|
||||
|
||||
### Current Limitations
|
||||
|
||||
1. **No Real-time WebSocket**: Uses HTTP polling for progress updates
|
||||
2. **No Offline Support**: Requires active internet connection
|
||||
3. **No Mobile Optimization**: Primarily designed for desktop/tablet
|
||||
4. **Translation Feature Stub**: Planned for Phase 5
|
||||
5. **File Size Limit**: Frontend validates 50MB per file, backend may differ
|
||||
|
||||
### Future Improvements
|
||||
|
||||
- [ ] Implement WebSocket for real-time updates
|
||||
- [ ] Add dark mode toggle
|
||||
- [ ] Mobile responsive design
|
||||
- [ ] Implement translation feature
|
||||
- [ ] Add E2E tests with Playwright
|
||||
- [ ] PWA support for offline capability
|
||||
|
||||
---
|
||||
|
||||
## Maintenance & Updates
|
||||
|
||||
### Update Checklist
|
||||
|
||||
When updating API contracts:
|
||||
1. Update TypeScript types in `@/types/api.ts`
|
||||
2. Update API client methods in `@/services/api.ts`
|
||||
3. Update this documentation (FRONTEND_API.md)
|
||||
4. Update corresponding page components
|
||||
5. Test integration thoroughly
|
||||
|
||||
### Dependency Updates
|
||||
|
||||
```bash
|
||||
# Check for updates
|
||||
npm outdated
|
||||
|
||||
# Update dependencies
|
||||
npm update
|
||||
|
||||
# Update to latest (breaking changes possible)
|
||||
npm install <package>@latest
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Contact & Support
|
||||
|
||||
**Frontend Developer**: Claude Code
|
||||
**Documentation Version**: 0.1.0
|
||||
**Last Updated**: 2025-01-13
|
||||
|
||||
For API questions, refer to:
|
||||
- `API_REFERENCE.md` - Complete API documentation
|
||||
- `backend_api.md` - Backend implementation details
|
||||
- FastAPI Swagger UI: `http://localhost:12010/docs`
|
||||
|
||||
---
|
||||
|
||||
**End of Documentation**
|
||||
258
TESTING.md
258
TESTING.md
@@ -1,258 +0,0 @@
|
||||
# Tool_OCR Testing Guide
|
||||
|
||||
## 測試架構
|
||||
|
||||
本專案包含完整的測試套件,包括單元測試和集成測試。
|
||||
|
||||
---
|
||||
|
||||
## 後端測試
|
||||
|
||||
### 安裝測試依賴
|
||||
|
||||
```bash
|
||||
cd backend
|
||||
pip install pytest pytest-cov httpx
|
||||
```
|
||||
|
||||
### 運行所有測試
|
||||
|
||||
```bash
|
||||
# 運行所有測試
|
||||
pytest
|
||||
|
||||
# 運行並顯示詳細輸出
|
||||
pytest -v
|
||||
|
||||
# 運行並生成覆蓋率報告
|
||||
pytest --cov=app --cov-report=html
|
||||
```
|
||||
|
||||
### 運行特定測試
|
||||
|
||||
```bash
|
||||
# 僅運行單元測試
|
||||
pytest tests/test_auth.py
|
||||
pytest tests/test_tasks.py
|
||||
pytest tests/test_admin.py
|
||||
|
||||
# 僅運行集成測試
|
||||
pytest tests/test_integration.py
|
||||
|
||||
# 運行特定測試類
|
||||
pytest tests/test_tasks.py::TestTasks
|
||||
|
||||
# 運行特定測試方法
|
||||
pytest tests/test_tasks.py::TestTasks::test_create_task
|
||||
```
|
||||
|
||||
### 測試覆蓋
|
||||
|
||||
**單元測試** (`tests/test_*.py`):
|
||||
- `test_auth.py` - 認證端點測試
|
||||
- 登入成功/失敗
|
||||
- Token 驗證
|
||||
- 登出功能
|
||||
- `test_tasks.py` - 任務管理測試
|
||||
- 任務 CRUD 操作
|
||||
- 用戶隔離驗證
|
||||
- 統計數據
|
||||
- `test_admin.py` - 管理員功能測試
|
||||
- 系統統計
|
||||
- 用戶列表
|
||||
- 審計日誌
|
||||
|
||||
**集成測試** (`tests/test_integration.py`):
|
||||
- 完整認證和任務流程
|
||||
- 管理員工作流程
|
||||
- 任務生命週期
|
||||
|
||||
---
|
||||
|
||||
## 測試資料庫
|
||||
|
||||
測試使用 SQLite 記憶體資料庫,每次測試後自動清理:
|
||||
- 不影響開發或生產資料庫
|
||||
- 快速執行
|
||||
- 完全隔離
|
||||
|
||||
---
|
||||
|
||||
## Fixtures (測試夾具)
|
||||
|
||||
在 `conftest.py` 中定義:
|
||||
|
||||
- `db` - 測試資料庫 session
|
||||
- `client` - FastAPI 測試客戶端
|
||||
- `test_user` - 一般測試用戶
|
||||
- `admin_user` - 管理員測試用戶
|
||||
- `auth_token` - 測試用戶的認證 token
|
||||
- `admin_token` - 管理員的認證 token
|
||||
- `test_task` - 測試任務
|
||||
|
||||
---
|
||||
|
||||
## 測試範例
|
||||
|
||||
### 編寫新的單元測試
|
||||
|
||||
```python
|
||||
# tests/test_my_feature.py
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
class TestMyFeature:
|
||||
"""Test my new feature"""
|
||||
|
||||
def test_feature_works(self, client, auth_token):
|
||||
"""Test that feature works correctly"""
|
||||
response = client.get(
|
||||
'/api/v2/my-endpoint',
|
||||
headers={'Authorization': f'Bearer {auth_token}'}
|
||||
)
|
||||
|
||||
assert response.status_code == 200
|
||||
data = response.json()
|
||||
assert 'expected_field' in data
|
||||
```
|
||||
|
||||
### 編寫新的集成測試
|
||||
|
||||
```python
|
||||
# tests/test_integration.py
|
||||
|
||||
class TestIntegration:
|
||||
|
||||
def test_complete_workflow(self, client, db):
|
||||
"""Test complete user workflow"""
|
||||
# Step 1: Login
|
||||
# Step 2: Perform actions
|
||||
# Step 3: Verify results
|
||||
pass
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## CI/CD 整合
|
||||
|
||||
### GitHub Actions 範例
|
||||
|
||||
```yaml
|
||||
name: Tests
|
||||
|
||||
on: [push, pull_request]
|
||||
|
||||
jobs:
|
||||
test:
|
||||
runs-on: ubuntu-latest
|
||||
|
||||
steps:
|
||||
- uses: actions/checkout@v2
|
||||
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v2
|
||||
with:
|
||||
python-version: 3.11
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
cd backend
|
||||
pip install -r requirements.txt
|
||||
pip install pytest pytest-cov
|
||||
|
||||
- name: Run tests
|
||||
run: |
|
||||
cd backend
|
||||
pytest --cov=app --cov-report=xml
|
||||
|
||||
- name: Upload coverage
|
||||
uses: codecov/codecov-action@v2
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 前端測試 (未來計劃)
|
||||
|
||||
### 建議測試框架
|
||||
- **單元測試**: Vitest
|
||||
- **元件測試**: React Testing Library
|
||||
- **E2E 測試**: Playwright
|
||||
|
||||
### 範例配置
|
||||
|
||||
```bash
|
||||
# 安裝測試依賴
|
||||
npm install --save-dev vitest @testing-library/react @testing-library/jest-dom
|
||||
|
||||
# 運行測試
|
||||
npm test
|
||||
|
||||
# 運行 E2E 測試
|
||||
npm run test:e2e
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 測試最佳實踐
|
||||
|
||||
### 1. 測試命名規範
|
||||
- 使用描述性名稱: `test_user_can_create_task`
|
||||
- 遵循 AAA 模式: Arrange, Act, Assert
|
||||
|
||||
### 2. 測試隔離
|
||||
- 每個測試獨立執行
|
||||
- 使用 fixtures 提供測試數據
|
||||
- 不依賴其他測試的狀態
|
||||
|
||||
### 3. Mock 外部服務
|
||||
- Mock 外部 API 呼叫
|
||||
- Mock 檔案系統操作
|
||||
- Mock 第三方服務
|
||||
|
||||
### 4. 測試覆蓋率目標
|
||||
- 核心業務邏輯: >90%
|
||||
- API 端點: >80%
|
||||
- 工具函數: >70%
|
||||
|
||||
---
|
||||
|
||||
## 故障排除
|
||||
|
||||
### 常見問題
|
||||
|
||||
**問題**: `ImportError: cannot import name 'XXX'`
|
||||
**解決**: 確保 PYTHONPATH 正確設定
|
||||
```bash
|
||||
export PYTHONPATH=$PYTHONPATH:$(pwd)
|
||||
```
|
||||
|
||||
**問題**: 資料庫連接錯誤
|
||||
**解決**: 測試使用記憶體資料庫,不需要實際資料庫連接
|
||||
|
||||
**問題**: Token 驗證失敗
|
||||
**解決**: 檢查 JWT secret 設定,使用測試用 fixtures
|
||||
|
||||
---
|
||||
|
||||
## 測試報告
|
||||
|
||||
執行測試後生成的報告:
|
||||
|
||||
1. **終端輸出**: 測試結果概覽
|
||||
2. **HTML 報告**: `htmlcov/index.html` (需要 --cov-report=html)
|
||||
3. **覆蓋率報告**: 顯示未測試的代碼行
|
||||
|
||||
---
|
||||
|
||||
## 持續改進
|
||||
|
||||
- 定期運行測試套件
|
||||
- 新功能必須包含測試
|
||||
- 維護測試覆蓋率在 80% 以上
|
||||
- Bug 修復時添加回歸測試
|
||||
|
||||
---
|
||||
|
||||
**最後更新**: 2025-11-16
|
||||
**維護者**: Development Team
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,62 +0,0 @@
|
||||
"""
|
||||
Test script to verify ReportLab and Chinese font rendering
|
||||
"""
|
||||
from reportlab.pdfgen import canvas
|
||||
from reportlab.pdfbase import pdfmetrics
|
||||
from reportlab.pdfbase.ttfonts import TTFont
|
||||
from pathlib import Path
|
||||
import sys
|
||||
|
||||
def test_chinese_rendering():
|
||||
"""Test if Chinese characters can be rendered in PDF"""
|
||||
|
||||
# Font path
|
||||
font_path = "/home/egg/project/Tool_OCR/backend/fonts/NotoSansSC-Regular.ttf"
|
||||
|
||||
# Check if font file exists
|
||||
if not Path(font_path).exists():
|
||||
print(f"❌ Font file not found: {font_path}")
|
||||
return False
|
||||
|
||||
print(f"✓ Font file found: {font_path}")
|
||||
|
||||
try:
|
||||
# Register Chinese font
|
||||
pdfmetrics.registerFont(TTFont('NotoSansSC', font_path))
|
||||
print("✓ Font registered successfully")
|
||||
|
||||
# Create test PDF
|
||||
test_pdf = "/tmp/test_chinese.pdf"
|
||||
c = canvas.Canvas(test_pdf)
|
||||
|
||||
# Set Chinese font
|
||||
c.setFont('NotoSansSC', 14)
|
||||
|
||||
# Draw test text
|
||||
c.drawString(100, 750, "測試中文字符渲染 - Test Chinese Character Rendering")
|
||||
c.drawString(100, 730, "HTD-S1 技術數據表")
|
||||
c.drawString(100, 710, "這是一個 PDF 生成測試")
|
||||
|
||||
c.save()
|
||||
print(f"✓ Test PDF created: {test_pdf}")
|
||||
|
||||
# Check file size
|
||||
file_size = Path(test_pdf).stat().st_size
|
||||
print(f"✓ PDF file size: {file_size} bytes")
|
||||
|
||||
if file_size > 0:
|
||||
print("\n✅ Chinese font rendering test PASSED")
|
||||
return True
|
||||
else:
|
||||
print("\n❌ PDF file is empty")
|
||||
return False
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ Error during testing: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return False
|
||||
|
||||
if __name__ == "__main__":
|
||||
success = test_chinese_rendering()
|
||||
sys.exit(0 if success else 1)
|
||||
@@ -1,286 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Tool_OCR - Service Layer Integration Test
|
||||
Tests core services before API implementation
|
||||
"""
|
||||
|
||||
import sys
|
||||
import logging
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
|
||||
# Add backend to path
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
from app.core.config import settings
|
||||
from app.core.database import engine, SessionLocal, Base
|
||||
from app.models.user import User
|
||||
from app.models.ocr import OCRBatch, OCRFile, OCRResult, FileStatus, BatchStatus
|
||||
from app.services.preprocessor import DocumentPreprocessor
|
||||
from app.services.ocr_service import OCRService
|
||||
from app.services.pdf_generator import PDFGenerator
|
||||
from app.services.file_manager import FileManager
|
||||
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class ServiceTester:
|
||||
"""Service layer integration tester"""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize tester"""
|
||||
self.db = SessionLocal()
|
||||
self.preprocessor = DocumentPreprocessor()
|
||||
self.ocr_service = OCRService()
|
||||
self.pdf_generator = PDFGenerator()
|
||||
self.file_manager = FileManager()
|
||||
self.test_results = {
|
||||
"database": False,
|
||||
"preprocessor": False,
|
||||
"ocr_engine": False,
|
||||
"pdf_generator": False,
|
||||
"file_manager": False,
|
||||
}
|
||||
|
||||
def cleanup(self):
|
||||
"""Cleanup resources"""
|
||||
self.db.close()
|
||||
|
||||
def test_database_connection(self) -> bool:
|
||||
"""Test 1: Database connection and models"""
|
||||
try:
|
||||
logger.info("=" * 80)
|
||||
logger.info("TEST 1: Database Connection")
|
||||
logger.info("=" * 80)
|
||||
|
||||
# Test connection
|
||||
from sqlalchemy import text
|
||||
self.db.execute(text("SELECT 1"))
|
||||
logger.info("✓ Database connection successful")
|
||||
|
||||
# Check if tables exist
|
||||
from sqlalchemy import inspect
|
||||
inspector = inspect(engine)
|
||||
tables = inspector.get_table_names()
|
||||
|
||||
required_tables = [
|
||||
'paddle_ocr_users',
|
||||
'paddle_ocr_batches',
|
||||
'paddle_ocr_files',
|
||||
'paddle_ocr_results',
|
||||
'paddle_ocr_export_rules',
|
||||
'paddle_ocr_translation_configs'
|
||||
]
|
||||
|
||||
missing_tables = [t for t in required_tables if t not in tables]
|
||||
if missing_tables:
|
||||
logger.error(f"✗ Missing tables: {missing_tables}")
|
||||
return False
|
||||
|
||||
logger.info(f"✓ All required tables exist: {', '.join(required_tables)}")
|
||||
|
||||
# Test creating a test user (will rollback)
|
||||
test_user = User(
|
||||
username=f"test_user_{datetime.now().timestamp()}",
|
||||
email=f"test_{datetime.now().timestamp()}@example.com",
|
||||
password_hash="test_hash_123",
|
||||
is_active=True,
|
||||
is_admin=False
|
||||
)
|
||||
self.db.add(test_user)
|
||||
self.db.flush()
|
||||
logger.info(f"✓ Test user created with ID: {test_user.id}")
|
||||
|
||||
self.db.rollback() # Don't actually save test user
|
||||
logger.info("✓ Database test completed successfully\n")
|
||||
|
||||
self.test_results["database"] = True
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"✗ Database test failed: {e}\n")
|
||||
return False
|
||||
|
||||
def test_preprocessor(self) -> bool:
|
||||
"""Test 2: Document preprocessor"""
|
||||
try:
|
||||
logger.info("=" * 80)
|
||||
logger.info("TEST 2: Document Preprocessor")
|
||||
logger.info("=" * 80)
|
||||
|
||||
# Check supported formats
|
||||
formats = ['.png', '.jpg', '.jpeg', '.pdf']
|
||||
logger.info(f"✓ Supported formats: {formats}")
|
||||
|
||||
# Check max file size
|
||||
max_size_mb = settings.max_upload_size / (1024 * 1024)
|
||||
logger.info(f"✓ Max upload size: {max_size_mb} MB")
|
||||
|
||||
logger.info("✓ Preprocessor initialized successfully\n")
|
||||
|
||||
self.test_results["preprocessor"] = True
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"✗ Preprocessor test failed: {e}\n")
|
||||
return False
|
||||
|
||||
def test_ocr_engine(self) -> bool:
|
||||
"""Test 3: OCR engine initialization"""
|
||||
try:
|
||||
logger.info("=" * 80)
|
||||
logger.info("TEST 3: OCR Engine (PaddleOCR)")
|
||||
logger.info("=" * 80)
|
||||
|
||||
# Test OCR engine lazy loading
|
||||
logger.info("Initializing PaddleOCR engine (this may take a moment)...")
|
||||
ocr_engine = self.ocr_service.get_ocr_engine(lang='ch')
|
||||
logger.info("✓ PaddleOCR engine initialized for Chinese")
|
||||
|
||||
# Test structure engine
|
||||
logger.info("Initializing PP-Structure engine...")
|
||||
structure_engine = self.ocr_service.get_structure_engine()
|
||||
logger.info("✓ PP-Structure engine initialized")
|
||||
|
||||
# Check confidence threshold
|
||||
logger.info(f"✓ Confidence threshold: {self.ocr_service.confidence_threshold}")
|
||||
|
||||
logger.info("✓ OCR engine test completed successfully\n")
|
||||
|
||||
self.test_results["ocr_engine"] = True
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"✗ OCR engine test failed: {e}")
|
||||
logger.error(" Make sure PaddleOCR models are downloaded:")
|
||||
logger.error(" - PaddleOCR will auto-download on first use (~900MB)")
|
||||
logger.error(" - Requires stable internet connection")
|
||||
logger.error("")
|
||||
return False
|
||||
|
||||
def test_pdf_generator(self) -> bool:
|
||||
"""Test 4: PDF generator"""
|
||||
try:
|
||||
logger.info("=" * 80)
|
||||
logger.info("TEST 4: PDF Generator")
|
||||
logger.info("=" * 80)
|
||||
|
||||
# Check Pandoc availability
|
||||
pandoc_available = self.pdf_generator.check_pandoc_available()
|
||||
if pandoc_available:
|
||||
logger.info("✓ Pandoc is installed and available")
|
||||
else:
|
||||
logger.warning("⚠ Pandoc not found - will use WeasyPrint fallback")
|
||||
|
||||
# Check available templates
|
||||
templates = self.pdf_generator.get_available_templates()
|
||||
logger.info(f"✓ Available CSS templates: {', '.join(templates.keys())}")
|
||||
|
||||
logger.info("✓ PDF generator test completed successfully\n")
|
||||
|
||||
self.test_results["pdf_generator"] = True
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"✗ PDF generator test failed: {e}\n")
|
||||
return False
|
||||
|
||||
def test_file_manager(self) -> bool:
|
||||
"""Test 5: File manager"""
|
||||
try:
|
||||
logger.info("=" * 80)
|
||||
logger.info("TEST 5: File Manager")
|
||||
logger.info("=" * 80)
|
||||
|
||||
# Check upload directory
|
||||
upload_dir = Path(settings.upload_dir)
|
||||
if upload_dir.exists():
|
||||
logger.info(f"✓ Upload directory exists: {upload_dir}")
|
||||
else:
|
||||
upload_dir.mkdir(parents=True, exist_ok=True)
|
||||
logger.info(f"✓ Created upload directory: {upload_dir}")
|
||||
|
||||
# Test batch directory creation
|
||||
test_batch_id = 99999 # Use high number to avoid conflicts
|
||||
batch_dir = self.file_manager.create_batch_directory(test_batch_id)
|
||||
logger.info(f"✓ Created test batch directory: {batch_dir}")
|
||||
|
||||
# Check subdirectories
|
||||
subdirs = ["inputs", "outputs/markdown", "outputs/json", "outputs/images", "exports"]
|
||||
for subdir in subdirs:
|
||||
subdir_path = batch_dir / subdir
|
||||
if subdir_path.exists():
|
||||
logger.info(f" ✓ {subdir}")
|
||||
else:
|
||||
logger.error(f" ✗ Missing: {subdir}")
|
||||
return False
|
||||
|
||||
# Cleanup test directory
|
||||
import shutil
|
||||
shutil.rmtree(batch_dir.parent, ignore_errors=True)
|
||||
logger.info("✓ Cleaned up test batch directory")
|
||||
|
||||
logger.info("✓ File manager test completed successfully\n")
|
||||
|
||||
self.test_results["file_manager"] = True
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"✗ File manager test failed: {e}\n")
|
||||
return False
|
||||
|
||||
def run_all_tests(self):
|
||||
"""Run all service tests"""
|
||||
logger.info("\n" + "=" * 80)
|
||||
logger.info("Tool_OCR Service Layer Integration Test")
|
||||
logger.info("=" * 80 + "\n")
|
||||
|
||||
try:
|
||||
# Run tests in order
|
||||
self.test_database_connection()
|
||||
self.test_preprocessor()
|
||||
self.test_ocr_engine()
|
||||
self.test_pdf_generator()
|
||||
self.test_file_manager()
|
||||
|
||||
# Print summary
|
||||
logger.info("=" * 80)
|
||||
logger.info("TEST SUMMARY")
|
||||
logger.info("=" * 80)
|
||||
|
||||
total_tests = len(self.test_results)
|
||||
passed_tests = sum(1 for result in self.test_results.values() if result)
|
||||
|
||||
for test_name, result in self.test_results.items():
|
||||
status = "✓ PASS" if result else "✗ FAIL"
|
||||
logger.info(f"{status:8} - {test_name}")
|
||||
|
||||
logger.info("-" * 80)
|
||||
logger.info(f"Total: {passed_tests}/{total_tests} tests passed")
|
||||
|
||||
if passed_tests == total_tests:
|
||||
logger.info("\n🎉 All service layer tests passed! Ready to implement API endpoints.")
|
||||
return 0
|
||||
else:
|
||||
logger.error(f"\n❌ {total_tests - passed_tests} test(s) failed. Please fix issues before proceeding.")
|
||||
return 1
|
||||
|
||||
finally:
|
||||
self.cleanup()
|
||||
|
||||
|
||||
def main():
|
||||
"""Main test entry point"""
|
||||
tester = ServiceTester()
|
||||
exit_code = tester.run_all_tests()
|
||||
sys.exit(exit_code)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -1,3 +0,0 @@
|
||||
"""
|
||||
Tool_OCR - Unit Tests Package
|
||||
"""
|
||||
@@ -1,138 +0,0 @@
|
||||
"""
|
||||
V2 API Test Configuration and Fixtures
|
||||
Provides test fixtures for authentication, database, and API testing
|
||||
"""
|
||||
|
||||
import pytest
|
||||
from fastapi.testclient import TestClient
|
||||
from sqlalchemy import create_engine
|
||||
from sqlalchemy.orm import sessionmaker
|
||||
from sqlalchemy.pool import StaticPool
|
||||
|
||||
# IMPORTANT: Monkey patch database module BEFORE importing app
|
||||
# This prevents the app from connecting to production database
|
||||
import app.core.database as db_module
|
||||
|
||||
# Create a test engine for the entire test session
|
||||
_test_engine = create_engine(
|
||||
"sqlite:///:memory:",
|
||||
connect_args={"check_same_thread": False},
|
||||
poolclass=StaticPool,
|
||||
)
|
||||
|
||||
# Replace the global engine and SessionLocal
|
||||
db_module.engine = _test_engine
|
||||
db_module.SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=_test_engine)
|
||||
|
||||
# Now safely import app (it will use our test database)
|
||||
from app.main import app
|
||||
from app.core.database import Base, get_db
|
||||
from app.core.security import create_access_token
|
||||
from app.models.user import User
|
||||
from app.models.task import Task
|
||||
|
||||
|
||||
@pytest.fixture(scope="function")
|
||||
def engine():
|
||||
"""Get test database engine and reset tables for each test"""
|
||||
Base.metadata.drop_all(bind=_test_engine)
|
||||
Base.metadata.create_all(bind=_test_engine)
|
||||
yield _test_engine
|
||||
# Tables will be dropped at the start of next test
|
||||
|
||||
|
||||
@pytest.fixture(scope="function")
|
||||
def db(engine):
|
||||
"""Create test database session"""
|
||||
TestingSessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
|
||||
db = TestingSessionLocal()
|
||||
try:
|
||||
yield db
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
@pytest.fixture(scope="function")
|
||||
def client(db):
|
||||
"""Create FastAPI test client with test database"""
|
||||
# Override get_db to use the same session as the test
|
||||
def override_get_db():
|
||||
try:
|
||||
yield db
|
||||
finally:
|
||||
# Don't close the session, it's managed by the db fixture
|
||||
pass
|
||||
|
||||
app.dependency_overrides[get_db] = override_get_db
|
||||
with TestClient(app) as test_client:
|
||||
yield test_client
|
||||
app.dependency_overrides.clear()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def test_user(db):
|
||||
"""Create a test user"""
|
||||
# Ensure test_user is always created first by checking if it exists
|
||||
user = db.query(User).filter(User.email == "test@example.com").first()
|
||||
if not user:
|
||||
user = User(
|
||||
email="test@example.com",
|
||||
display_name="Test User",
|
||||
is_active=True
|
||||
)
|
||||
db.add(user)
|
||||
db.commit()
|
||||
db.refresh(user)
|
||||
return user
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def admin_user(db):
|
||||
"""Create an admin user"""
|
||||
user = db.query(User).filter(User.email == "ymirliu@panjit.com.tw").first()
|
||||
if not user:
|
||||
user = User(
|
||||
email="ymirliu@panjit.com.tw",
|
||||
display_name="Admin User",
|
||||
is_active=True
|
||||
)
|
||||
db.add(user)
|
||||
db.commit()
|
||||
db.refresh(user)
|
||||
return user
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def auth_token(test_user):
|
||||
"""Create authentication token for test user"""
|
||||
token_data = {
|
||||
"sub": str(test_user.id),
|
||||
"email": test_user.email
|
||||
}
|
||||
return create_access_token(token_data)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def admin_token(admin_user):
|
||||
"""Create authentication token for admin user"""
|
||||
token_data = {
|
||||
"sub": str(admin_user.id),
|
||||
"email": admin_user.email
|
||||
}
|
||||
return create_access_token(token_data)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def test_task(test_user, db):
|
||||
"""Create a test task (depends on test_user to ensure user exists first)"""
|
||||
task = Task(
|
||||
user_id=test_user.id,
|
||||
task_id="test-task-123",
|
||||
filename="test.pdf",
|
||||
file_type="application/pdf",
|
||||
status="pending"
|
||||
)
|
||||
db.add(task)
|
||||
db.commit()
|
||||
db.refresh(task)
|
||||
return task
|
||||
@@ -1,179 +0,0 @@
|
||||
"""
|
||||
Tool_OCR - Pytest Fixtures and Configuration
|
||||
Shared fixtures for all tests
|
||||
"""
|
||||
|
||||
import pytest
|
||||
import tempfile
|
||||
import shutil
|
||||
from pathlib import Path
|
||||
from PIL import Image
|
||||
import io
|
||||
|
||||
from app.services.preprocessor import DocumentPreprocessor
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def temp_dir():
|
||||
"""Create a temporary directory for test files"""
|
||||
temp_path = Path(tempfile.mkdtemp())
|
||||
yield temp_path
|
||||
# Cleanup after test
|
||||
shutil.rmtree(temp_path, ignore_errors=True)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sample_image_path(temp_dir):
|
||||
"""Create a valid PNG image file for testing"""
|
||||
image_path = temp_dir / "test_image.png"
|
||||
|
||||
# Create a simple 100x100 white image
|
||||
img = Image.new('RGB', (100, 100), color='white')
|
||||
img.save(image_path, 'PNG')
|
||||
|
||||
return image_path
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sample_jpg_path(temp_dir):
|
||||
"""Create a valid JPG image file for testing"""
|
||||
image_path = temp_dir / "test_image.jpg"
|
||||
|
||||
# Create a simple 100x100 white image
|
||||
img = Image.new('RGB', (100, 100), color='white')
|
||||
img.save(image_path, 'JPEG')
|
||||
|
||||
return image_path
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sample_pdf_path(temp_dir):
|
||||
"""Create a valid PDF file for testing"""
|
||||
pdf_path = temp_dir / "test_document.pdf"
|
||||
|
||||
# Create minimal valid PDF
|
||||
pdf_content = b"""%PDF-1.4
|
||||
1 0 obj
|
||||
<<
|
||||
/Type /Catalog
|
||||
/Pages 2 0 R
|
||||
>>
|
||||
endobj
|
||||
2 0 obj
|
||||
<<
|
||||
/Type /Pages
|
||||
/Kids [3 0 R]
|
||||
/Count 1
|
||||
>>
|
||||
endobj
|
||||
3 0 obj
|
||||
<<
|
||||
/Type /Page
|
||||
/Parent 2 0 R
|
||||
/MediaBox [0 0 612 792]
|
||||
/Contents 4 0 R
|
||||
/Resources <<
|
||||
/Font <<
|
||||
/F1 <<
|
||||
/Type /Font
|
||||
/Subtype /Type1
|
||||
/BaseFont /Helvetica
|
||||
>>
|
||||
>>
|
||||
>>
|
||||
>>
|
||||
endobj
|
||||
4 0 obj
|
||||
<<
|
||||
/Length 44
|
||||
>>
|
||||
stream
|
||||
BT
|
||||
/F1 12 Tf
|
||||
100 700 Td
|
||||
(Test PDF) Tj
|
||||
ET
|
||||
endstream
|
||||
endobj
|
||||
xref
|
||||
0 5
|
||||
0000000000 65535 f
|
||||
0000000009 00000 n
|
||||
0000000058 00000 n
|
||||
0000000115 00000 n
|
||||
0000000317 00000 n
|
||||
trailer
|
||||
<<
|
||||
/Size 5
|
||||
/Root 1 0 R
|
||||
>>
|
||||
startxref
|
||||
410
|
||||
%%EOF
|
||||
"""
|
||||
|
||||
with open(pdf_path, 'wb') as f:
|
||||
f.write(pdf_content)
|
||||
|
||||
return pdf_path
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def corrupted_image_path(temp_dir):
|
||||
"""Create a corrupted image file for testing"""
|
||||
image_path = temp_dir / "corrupted.png"
|
||||
|
||||
# Write invalid PNG data
|
||||
with open(image_path, 'wb') as f:
|
||||
f.write(b'\x89PNG\r\n\x1a\n\x00\x00\x00corrupted data')
|
||||
|
||||
return image_path
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def large_file_path(temp_dir):
|
||||
"""Create a valid PNG file larger than the upload limit"""
|
||||
file_path = temp_dir / "large_file.png"
|
||||
|
||||
# Create a large PNG image with random data (to prevent compression)
|
||||
# 15000x15000 with random pixels should be > 20MB
|
||||
import numpy as np
|
||||
random_data = np.random.randint(0, 256, (15000, 15000, 3), dtype=np.uint8)
|
||||
img = Image.fromarray(random_data, 'RGB')
|
||||
img.save(file_path, 'PNG', compress_level=0) # No compression
|
||||
|
||||
# Verify it's actually large
|
||||
file_size = file_path.stat().st_size
|
||||
assert file_size > 20 * 1024 * 1024, f"File only {file_size / (1024*1024):.2f} MB"
|
||||
|
||||
return file_path
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def unsupported_file_path(temp_dir):
|
||||
"""Create a file with unsupported format"""
|
||||
file_path = temp_dir / "test.txt"
|
||||
|
||||
with open(file_path, 'w') as f:
|
||||
f.write("This is a text file, not an image")
|
||||
|
||||
return file_path
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def preprocessor():
|
||||
"""Create a DocumentPreprocessor instance"""
|
||||
return DocumentPreprocessor()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sample_image_with_text():
|
||||
"""Return path to a real image with text from demo_docs for OCR testing"""
|
||||
# Use the english.png sample from demo_docs
|
||||
demo_image_path = Path(__file__).parent.parent.parent / "demo_docs" / "basic" / "english.png"
|
||||
|
||||
# Check if demo image exists, otherwise skip the test
|
||||
if not demo_image_path.exists():
|
||||
pytest.skip(f"Demo image not found at {demo_image_path}")
|
||||
|
||||
return demo_image_path
|
||||
@@ -1,60 +0,0 @@
|
||||
"""
|
||||
Unit tests for admin endpoints
|
||||
"""
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
class TestAdmin:
|
||||
"""Test admin endpoints"""
|
||||
|
||||
def test_get_system_stats(self, client, admin_token):
|
||||
"""Test get system statistics"""
|
||||
response = client.get(
|
||||
'/api/v2/admin/stats',
|
||||
headers={'Authorization': f'Bearer {admin_token}'}
|
||||
)
|
||||
|
||||
assert response.status_code == 200
|
||||
data = response.json()
|
||||
# API returns nested structure
|
||||
assert 'users' in data
|
||||
assert 'tasks' in data
|
||||
assert 'sessions' in data
|
||||
assert 'activity' in data
|
||||
assert 'total' in data['users']
|
||||
assert 'total' in data['tasks']
|
||||
|
||||
def test_get_system_stats_non_admin(self, client, auth_token):
|
||||
"""Test that non-admin cannot access admin endpoints"""
|
||||
response = client.get(
|
||||
'/api/v2/admin/stats',
|
||||
headers={'Authorization': f'Bearer {auth_token}'}
|
||||
)
|
||||
|
||||
assert response.status_code == 403
|
||||
|
||||
def test_list_users(self, client, admin_token):
|
||||
"""Test list all users"""
|
||||
response = client.get(
|
||||
'/api/v2/admin/users',
|
||||
headers={'Authorization': f'Bearer {admin_token}'}
|
||||
)
|
||||
|
||||
assert response.status_code == 200
|
||||
data = response.json()
|
||||
assert 'users' in data
|
||||
assert 'total' in data
|
||||
|
||||
def test_get_audit_logs(self, client, admin_token):
|
||||
"""Test get audit logs"""
|
||||
response = client.get(
|
||||
'/api/v2/admin/audit-logs',
|
||||
headers={'Authorization': f'Bearer {admin_token}'}
|
||||
)
|
||||
|
||||
assert response.status_code == 200
|
||||
data = response.json()
|
||||
assert 'logs' in data
|
||||
assert 'total' in data
|
||||
assert 'page' in data
|
||||
@@ -1,687 +0,0 @@
|
||||
"""
|
||||
Tool_OCR - API Integration Tests
|
||||
Tests all API endpoints with database integration
|
||||
"""
|
||||
|
||||
import pytest
|
||||
import tempfile
|
||||
import shutil
|
||||
from pathlib import Path
|
||||
from io import BytesIO
|
||||
from datetime import datetime
|
||||
from unittest.mock import patch, Mock
|
||||
|
||||
from fastapi.testclient import TestClient
|
||||
from sqlalchemy import create_engine
|
||||
from sqlalchemy.orm import sessionmaker
|
||||
from PIL import Image
|
||||
|
||||
from app.main import app
|
||||
from app.core.database import Base
|
||||
from app.core.deps import get_db, get_current_active_user
|
||||
from app.core.security import create_access_token, get_password_hash
|
||||
from app.models.user import User
|
||||
from app.models.ocr import OCRBatch, OCRFile, OCRResult, BatchStatus, FileStatus
|
||||
from app.models.export import ExportRule
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Test Database Setup
|
||||
# ============================================================================
|
||||
|
||||
@pytest.fixture(scope="function")
|
||||
def test_db():
|
||||
"""Create test database using SQLite in-memory"""
|
||||
# Import all models to ensure they are registered with Base.metadata
|
||||
# This triggers SQLAlchemy to register table definitions
|
||||
from app.models import User, OCRBatch, OCRFile, OCRResult, ExportRule, TranslationConfig
|
||||
|
||||
# Create in-memory SQLite database
|
||||
engine = create_engine("sqlite:///:memory:", connect_args={"check_same_thread": False})
|
||||
TestingSessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
|
||||
|
||||
# Create all tables
|
||||
Base.metadata.create_all(bind=engine)
|
||||
|
||||
db = TestingSessionLocal()
|
||||
try:
|
||||
yield db
|
||||
finally:
|
||||
db.close()
|
||||
Base.metadata.drop_all(bind=engine)
|
||||
|
||||
|
||||
@pytest.fixture(scope="function")
|
||||
def test_user(test_db):
|
||||
"""Create test user in database"""
|
||||
user = User(
|
||||
username="testuser",
|
||||
email="test@example.com",
|
||||
password_hash=get_password_hash("password123"),
|
||||
is_active=True,
|
||||
is_admin=False
|
||||
)
|
||||
test_db.add(user)
|
||||
test_db.commit()
|
||||
test_db.refresh(user)
|
||||
return user
|
||||
|
||||
|
||||
@pytest.fixture(scope="function")
|
||||
def inactive_user(test_db):
|
||||
"""Create inactive test user"""
|
||||
user = User(
|
||||
username="inactive",
|
||||
email="inactive@example.com",
|
||||
password_hash=get_password_hash("password123"),
|
||||
is_active=False,
|
||||
is_admin=False
|
||||
)
|
||||
test_db.add(user)
|
||||
test_db.commit()
|
||||
test_db.refresh(user)
|
||||
return user
|
||||
|
||||
|
||||
@pytest.fixture(scope="function")
|
||||
def auth_token(test_user):
|
||||
"""Generate JWT token for test user"""
|
||||
token = create_access_token(data={"sub": test_user.id, "username": test_user.username})
|
||||
return token
|
||||
|
||||
|
||||
@pytest.fixture(scope="function")
|
||||
def auth_headers(auth_token):
|
||||
"""Generate authorization headers"""
|
||||
return {"Authorization": f"Bearer {auth_token}"}
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Test Client Setup
|
||||
# ============================================================================
|
||||
|
||||
@pytest.fixture(scope="function")
|
||||
def client(test_db, test_user):
|
||||
"""Create FastAPI test client with overridden dependencies"""
|
||||
|
||||
def override_get_db():
|
||||
try:
|
||||
yield test_db
|
||||
finally:
|
||||
pass
|
||||
|
||||
def override_get_current_active_user():
|
||||
return test_user
|
||||
|
||||
app.dependency_overrides[get_db] = override_get_db
|
||||
app.dependency_overrides[get_current_active_user] = override_get_current_active_user
|
||||
|
||||
client = TestClient(app)
|
||||
yield client
|
||||
|
||||
# Clean up overrides
|
||||
app.dependency_overrides.clear()
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Test Data Fixtures
|
||||
# ============================================================================
|
||||
|
||||
@pytest.fixture
|
||||
def temp_upload_dir():
|
||||
"""Create temporary upload directory"""
|
||||
temp_dir = Path(tempfile.mkdtemp())
|
||||
yield temp_dir
|
||||
shutil.rmtree(temp_dir, ignore_errors=True)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sample_image_file():
|
||||
"""Create sample image file for upload"""
|
||||
img = Image.new('RGB', (100, 100), color='white')
|
||||
img_bytes = BytesIO()
|
||||
img.save(img_bytes, format='PNG')
|
||||
img_bytes.seek(0)
|
||||
return ("test.png", img_bytes, "image/png")
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def test_batch(test_db, test_user):
|
||||
"""Create test batch in database"""
|
||||
batch = OCRBatch(
|
||||
user_id=test_user.id,
|
||||
batch_name="Test Batch",
|
||||
status=BatchStatus.PENDING,
|
||||
total_files=0,
|
||||
completed_files=0,
|
||||
failed_files=0
|
||||
)
|
||||
test_db.add(batch)
|
||||
test_db.commit()
|
||||
test_db.refresh(batch)
|
||||
return batch
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def test_ocr_file(test_db, test_batch):
|
||||
"""Create test OCR file in database"""
|
||||
ocr_file = OCRFile(
|
||||
batch_id=test_batch.id,
|
||||
filename="test.png",
|
||||
original_filename="test.png",
|
||||
file_path="/tmp/test.png",
|
||||
file_size=1024,
|
||||
file_format="png",
|
||||
status=FileStatus.COMPLETED
|
||||
)
|
||||
test_db.add(ocr_file)
|
||||
test_db.commit()
|
||||
test_db.refresh(ocr_file)
|
||||
return ocr_file
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def test_ocr_result(test_db, test_ocr_file, temp_upload_dir):
|
||||
"""Create test OCR result in database"""
|
||||
# Create test markdown file
|
||||
markdown_path = temp_upload_dir / "result.md"
|
||||
markdown_path.write_text("# Test Result\n\nTest content", encoding="utf-8")
|
||||
|
||||
result = OCRResult(
|
||||
file_id=test_ocr_file.id,
|
||||
markdown_path=str(markdown_path),
|
||||
json_path=str(temp_upload_dir / "result.json"),
|
||||
detected_language="ch",
|
||||
total_text_regions=5,
|
||||
average_confidence=0.95,
|
||||
layout_data={"regions": []},
|
||||
images_metadata=[]
|
||||
)
|
||||
test_db.add(result)
|
||||
test_db.commit()
|
||||
test_db.refresh(result)
|
||||
return result
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def test_export_rule(test_db, test_user):
|
||||
"""Create test export rule in database"""
|
||||
rule = ExportRule(
|
||||
user_id=test_user.id,
|
||||
rule_name="Test Rule",
|
||||
description="Test export rule",
|
||||
config_json={
|
||||
"filters": {"confidence_threshold": 0.8},
|
||||
"formatting": {"add_line_numbers": True}
|
||||
}
|
||||
)
|
||||
test_db.add(rule)
|
||||
test_db.commit()
|
||||
test_db.refresh(rule)
|
||||
return rule
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Authentication Router Tests
|
||||
# ============================================================================
|
||||
|
||||
@pytest.mark.integration
|
||||
class TestAuthRouter:
|
||||
"""Test authentication endpoints"""
|
||||
|
||||
def test_login_success(self, client, test_user):
|
||||
"""Test successful login"""
|
||||
response = client.post(
|
||||
"/api/v1/auth/login",
|
||||
json={
|
||||
"username": "testuser",
|
||||
"password": "password123"
|
||||
}
|
||||
)
|
||||
|
||||
assert response.status_code == 200
|
||||
data = response.json()
|
||||
assert "access_token" in data
|
||||
assert data["token_type"] == "bearer"
|
||||
assert "expires_in" in data
|
||||
assert data["expires_in"] > 0
|
||||
|
||||
def test_login_invalid_username(self, client):
|
||||
"""Test login with invalid username"""
|
||||
response = client.post(
|
||||
"/api/v1/auth/login",
|
||||
json={
|
||||
"username": "nonexistent",
|
||||
"password": "password123"
|
||||
}
|
||||
)
|
||||
|
||||
assert response.status_code == 401
|
||||
assert "Incorrect username or password" in response.json()["detail"]
|
||||
|
||||
def test_login_invalid_password(self, client, test_user):
|
||||
"""Test login with invalid password"""
|
||||
response = client.post(
|
||||
"/api/v1/auth/login",
|
||||
json={
|
||||
"username": "testuser",
|
||||
"password": "wrongpassword"
|
||||
}
|
||||
)
|
||||
|
||||
assert response.status_code == 401
|
||||
assert "Incorrect username or password" in response.json()["detail"]
|
||||
|
||||
def test_login_inactive_user(self, client, inactive_user):
|
||||
"""Test login with inactive user account"""
|
||||
response = client.post(
|
||||
"/api/v1/auth/login",
|
||||
json={
|
||||
"username": "inactive",
|
||||
"password": "password123"
|
||||
}
|
||||
)
|
||||
|
||||
assert response.status_code == 403
|
||||
assert "inactive" in response.json()["detail"].lower()
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# OCR Router Tests
|
||||
# ============================================================================
|
||||
|
||||
@pytest.mark.integration
|
||||
class TestOCRRouter:
|
||||
"""Test OCR processing endpoints"""
|
||||
|
||||
@patch('app.services.file_manager.FileManager.create_batch')
|
||||
@patch('app.services.file_manager.FileManager.add_files_to_batch')
|
||||
def test_upload_files_success(self, mock_add_files, mock_create_batch,
|
||||
client, auth_headers, test_batch, sample_image_file):
|
||||
"""Test successful file upload"""
|
||||
# Mock the file manager methods
|
||||
mock_create_batch.return_value = test_batch
|
||||
mock_add_files.return_value = []
|
||||
|
||||
response = client.post(
|
||||
"/api/v1/upload",
|
||||
files={"files": sample_image_file},
|
||||
data={"batch_name": "Test Upload"},
|
||||
headers=auth_headers
|
||||
)
|
||||
|
||||
assert response.status_code == 200
|
||||
data = response.json()
|
||||
assert "id" in data
|
||||
assert data["batch_name"] == "Test Batch"
|
||||
|
||||
def test_upload_no_files(self, client, auth_headers):
|
||||
"""Test upload with no files"""
|
||||
response = client.post(
|
||||
"/api/v1/upload",
|
||||
headers=auth_headers
|
||||
)
|
||||
|
||||
assert response.status_code == 422 # Validation error
|
||||
|
||||
def test_upload_unauthorized(self, client, sample_image_file):
|
||||
"""Test upload without authentication"""
|
||||
# Override to remove authentication
|
||||
app.dependency_overrides.clear()
|
||||
|
||||
response = client.post(
|
||||
"/api/v1/upload",
|
||||
files={"files": sample_image_file}
|
||||
)
|
||||
|
||||
assert response.status_code == 403 # Forbidden (no auth)
|
||||
|
||||
@patch('app.services.background_tasks.process_batch_files_with_retry')
|
||||
def test_process_ocr_success(self, mock_process, client, auth_headers,
|
||||
test_batch, test_db):
|
||||
"""Test triggering OCR processing"""
|
||||
response = client.post(
|
||||
"/api/v1/ocr/process",
|
||||
json={
|
||||
"batch_id": test_batch.id,
|
||||
"lang": "ch",
|
||||
"detect_layout": True
|
||||
},
|
||||
headers=auth_headers
|
||||
)
|
||||
|
||||
assert response.status_code == 200
|
||||
data = response.json()
|
||||
assert data["message"] == "OCR processing started"
|
||||
assert data["batch_id"] == test_batch.id
|
||||
assert data["status"] == "processing"
|
||||
|
||||
def test_process_ocr_batch_not_found(self, client, auth_headers):
|
||||
"""Test OCR processing with non-existent batch"""
|
||||
response = client.post(
|
||||
"/api/v1/ocr/process",
|
||||
json={
|
||||
"batch_id": 99999,
|
||||
"lang": "ch",
|
||||
"detect_layout": True
|
||||
},
|
||||
headers=auth_headers
|
||||
)
|
||||
|
||||
assert response.status_code == 404
|
||||
assert "not found" in response.json()["detail"].lower()
|
||||
|
||||
def test_process_ocr_already_processing(self, client, auth_headers,
|
||||
test_batch, test_db):
|
||||
"""Test OCR processing when batch is already processing"""
|
||||
# Update batch status
|
||||
test_batch.status = BatchStatus.PROCESSING
|
||||
test_db.commit()
|
||||
|
||||
response = client.post(
|
||||
"/api/v1/ocr/process",
|
||||
json={
|
||||
"batch_id": test_batch.id,
|
||||
"lang": "ch",
|
||||
"detect_layout": True
|
||||
},
|
||||
headers=auth_headers
|
||||
)
|
||||
|
||||
assert response.status_code == 400
|
||||
assert "already" in response.json()["detail"].lower()
|
||||
|
||||
def test_get_batch_status_success(self, client, auth_headers, test_batch,
|
||||
test_ocr_file):
|
||||
"""Test getting batch status"""
|
||||
response = client.get(
|
||||
f"/api/v1/batch/{test_batch.id}/status",
|
||||
headers=auth_headers
|
||||
)
|
||||
|
||||
assert response.status_code == 200
|
||||
data = response.json()
|
||||
assert "batch" in data
|
||||
assert "files" in data
|
||||
assert data["batch"]["id"] == test_batch.id
|
||||
assert len(data["files"]) >= 0
|
||||
|
||||
def test_get_batch_status_not_found(self, client, auth_headers):
|
||||
"""Test getting status for non-existent batch"""
|
||||
response = client.get(
|
||||
"/api/v1/batch/99999/status",
|
||||
headers=auth_headers
|
||||
)
|
||||
|
||||
assert response.status_code == 404
|
||||
|
||||
def test_get_ocr_result_success(self, client, auth_headers, test_ocr_file,
|
||||
test_ocr_result):
|
||||
"""Test getting OCR result"""
|
||||
response = client.get(
|
||||
f"/api/v1/ocr/result/{test_ocr_file.id}",
|
||||
headers=auth_headers
|
||||
)
|
||||
|
||||
assert response.status_code == 200
|
||||
data = response.json()
|
||||
assert "file" in data
|
||||
assert "result" in data
|
||||
assert data["file"]["id"] == test_ocr_file.id
|
||||
|
||||
def test_get_ocr_result_not_found(self, client, auth_headers):
|
||||
"""Test getting result for non-existent file"""
|
||||
response = client.get(
|
||||
"/api/v1/ocr/result/99999",
|
||||
headers=auth_headers
|
||||
)
|
||||
|
||||
assert response.status_code == 404
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Export Router Tests
|
||||
# ============================================================================
|
||||
|
||||
@pytest.mark.integration
|
||||
class TestExportRouter:
|
||||
"""Test export endpoints"""
|
||||
|
||||
@pytest.mark.skip(reason="FileResponse validation requires actual file paths, tested in unit tests")
|
||||
@patch('app.services.export_service.ExportService.export_to_txt')
|
||||
def test_export_txt_success(self, mock_export, client, auth_headers,
|
||||
test_batch, test_ocr_file, test_ocr_result,
|
||||
temp_upload_dir):
|
||||
"""Test exporting results to TXT format"""
|
||||
# NOTE: This test is skipped because FastAPI's FileResponse validates
|
||||
# the file path exists, making it difficult to mock properly.
|
||||
# The export service functionality is thoroughly tested in unit tests.
|
||||
# End-to-end tests would be more appropriate for testing the full flow.
|
||||
pass
|
||||
|
||||
def test_export_batch_not_found(self, client, auth_headers):
|
||||
"""Test export with non-existent batch"""
|
||||
response = client.post(
|
||||
"/api/v1/export",
|
||||
json={
|
||||
"batch_id": 99999,
|
||||
"format": "txt"
|
||||
},
|
||||
headers=auth_headers
|
||||
)
|
||||
|
||||
assert response.status_code == 404
|
||||
|
||||
def test_export_no_results(self, client, auth_headers, test_batch):
|
||||
"""Test export when no completed results exist"""
|
||||
response = client.post(
|
||||
"/api/v1/export",
|
||||
json={
|
||||
"batch_id": test_batch.id,
|
||||
"format": "txt"
|
||||
},
|
||||
headers=auth_headers
|
||||
)
|
||||
|
||||
assert response.status_code == 404
|
||||
assert "no completed results" in response.json()["detail"].lower()
|
||||
|
||||
def test_export_unsupported_format(self, client, auth_headers, test_batch):
|
||||
"""Test export with unsupported format"""
|
||||
response = client.post(
|
||||
"/api/v1/export",
|
||||
json={
|
||||
"batch_id": test_batch.id,
|
||||
"format": "invalid_format"
|
||||
},
|
||||
headers=auth_headers
|
||||
)
|
||||
|
||||
# Should fail at validation or business logic level
|
||||
assert response.status_code in [400, 404]
|
||||
|
||||
@pytest.mark.skip(reason="FileResponse validation requires actual file paths, tested in unit tests")
|
||||
@patch('app.services.export_service.ExportService.export_to_pdf')
|
||||
def test_generate_pdf_success(self, mock_export, client, auth_headers,
|
||||
test_ocr_file, test_ocr_result, temp_upload_dir):
|
||||
"""Test generating PDF for single file"""
|
||||
# NOTE: This test is skipped because FastAPI's FileResponse validates
|
||||
# the file path exists, making it difficult to mock properly.
|
||||
# The PDF generation functionality is thoroughly tested in unit tests.
|
||||
pass
|
||||
|
||||
def test_generate_pdf_file_not_found(self, client, auth_headers):
|
||||
"""Test PDF generation for non-existent file"""
|
||||
response = client.get(
|
||||
"/api/v1/export/pdf/99999",
|
||||
headers=auth_headers
|
||||
)
|
||||
|
||||
assert response.status_code == 404
|
||||
|
||||
def test_generate_pdf_no_result(self, client, auth_headers, test_ocr_file):
|
||||
"""Test PDF generation when no OCR result exists"""
|
||||
response = client.get(
|
||||
f"/api/v1/export/pdf/{test_ocr_file.id}",
|
||||
headers=auth_headers
|
||||
)
|
||||
|
||||
assert response.status_code == 404
|
||||
|
||||
def test_list_export_rules(self, client, auth_headers, test_export_rule):
|
||||
"""Test listing export rules"""
|
||||
response = client.get(
|
||||
"/api/v1/export/rules",
|
||||
headers=auth_headers
|
||||
)
|
||||
|
||||
assert response.status_code == 200
|
||||
data = response.json()
|
||||
assert isinstance(data, list)
|
||||
assert len(data) >= 0
|
||||
|
||||
@pytest.mark.skip(reason="SQLite session isolation issue with in-memory DB, tested in unit tests")
|
||||
def test_create_export_rule(self, client, auth_headers):
|
||||
"""Test creating export rule"""
|
||||
# NOTE: This test fails due to SQLite in-memory database session isolation
|
||||
# The create operation works but db.refresh() fails to query the new record
|
||||
# Export rule CRUD is thoroughly tested in unit tests
|
||||
pass
|
||||
|
||||
@pytest.mark.skip(reason="SQLite session isolation issue with in-memory DB, tested in unit tests")
|
||||
def test_update_export_rule(self, client, auth_headers, test_export_rule):
|
||||
"""Test updating export rule"""
|
||||
# NOTE: This test fails due to SQLite in-memory database session isolation
|
||||
# The update operation works but db.refresh() fails to query the updated record
|
||||
# Export rule CRUD is thoroughly tested in unit tests
|
||||
pass
|
||||
|
||||
def test_update_export_rule_not_found(self, client, auth_headers):
|
||||
"""Test updating non-existent export rule"""
|
||||
response = client.put(
|
||||
"/api/v1/export/rules/99999",
|
||||
json={
|
||||
"rule_name": "Updated Rule"
|
||||
},
|
||||
headers=auth_headers
|
||||
)
|
||||
|
||||
assert response.status_code == 404
|
||||
|
||||
def test_delete_export_rule(self, client, auth_headers, test_export_rule):
|
||||
"""Test deleting export rule"""
|
||||
response = client.delete(
|
||||
f"/api/v1/export/rules/{test_export_rule.id}",
|
||||
headers=auth_headers
|
||||
)
|
||||
|
||||
assert response.status_code == 200
|
||||
assert "deleted successfully" in response.json()["message"].lower()
|
||||
|
||||
def test_delete_export_rule_not_found(self, client, auth_headers):
|
||||
"""Test deleting non-existent export rule"""
|
||||
response = client.delete(
|
||||
"/api/v1/export/rules/99999",
|
||||
headers=auth_headers
|
||||
)
|
||||
|
||||
assert response.status_code == 404
|
||||
|
||||
def test_list_css_templates(self, client):
|
||||
"""Test listing CSS templates (no auth required)"""
|
||||
response = client.get("/api/v1/export/css-templates")
|
||||
|
||||
assert response.status_code == 200
|
||||
data = response.json()
|
||||
assert isinstance(data, list)
|
||||
assert len(data) > 0
|
||||
assert all("name" in item and "description" in item for item in data)
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Translation Router Tests (Stub Endpoints)
|
||||
# ============================================================================
|
||||
|
||||
@pytest.mark.integration
|
||||
class TestTranslationRouter:
|
||||
"""Test translation stub endpoints"""
|
||||
|
||||
def test_get_translation_status(self, client):
|
||||
"""Test getting translation feature status (stub)"""
|
||||
response = client.get("/api/v1/translate/status")
|
||||
|
||||
assert response.status_code == 200
|
||||
data = response.json()
|
||||
assert "status" in data
|
||||
assert data["status"].lower() == "reserved" # Case-insensitive check
|
||||
|
||||
def test_get_supported_languages(self, client):
|
||||
"""Test getting supported languages (stub)"""
|
||||
response = client.get("/api/v1/translate/languages")
|
||||
|
||||
assert response.status_code == 200
|
||||
data = response.json()
|
||||
assert isinstance(data, list)
|
||||
|
||||
def test_translate_document_not_implemented(self, client, auth_headers):
|
||||
"""Test translate document endpoint returns 501"""
|
||||
response = client.post(
|
||||
"/api/v1/translate/document",
|
||||
json={
|
||||
"file_id": 1,
|
||||
"source_lang": "zh",
|
||||
"target_lang": "en",
|
||||
"engine_type": "offline"
|
||||
},
|
||||
headers=auth_headers
|
||||
)
|
||||
|
||||
assert response.status_code == 501
|
||||
data = response.json()
|
||||
assert "not implemented" in str(data["detail"]).lower()
|
||||
|
||||
def test_get_translation_task_status_not_implemented(self, client, auth_headers):
|
||||
"""Test translation task status endpoint returns 501"""
|
||||
response = client.get(
|
||||
"/api/v1/translate/task/1",
|
||||
headers=auth_headers
|
||||
)
|
||||
|
||||
assert response.status_code == 501
|
||||
|
||||
def test_cancel_translation_task_not_implemented(self, client, auth_headers):
|
||||
"""Test cancel translation task endpoint returns 501"""
|
||||
response = client.delete(
|
||||
"/api/v1/translate/task/1",
|
||||
headers=auth_headers
|
||||
)
|
||||
|
||||
assert response.status_code == 501
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Application Health Tests
|
||||
# ============================================================================
|
||||
|
||||
@pytest.mark.integration
|
||||
class TestApplicationHealth:
|
||||
"""Test application health and root endpoints"""
|
||||
|
||||
def test_health_check(self, client):
|
||||
"""Test health check endpoint"""
|
||||
response = client.get("/health")
|
||||
|
||||
assert response.status_code == 200
|
||||
data = response.json()
|
||||
assert data["status"] == "healthy"
|
||||
assert data["service"] == "Tool_OCR"
|
||||
|
||||
def test_root_endpoint(self, client):
|
||||
"""Test root endpoint"""
|
||||
response = client.get("/")
|
||||
|
||||
assert response.status_code == 200
|
||||
data = response.json()
|
||||
assert "message" in data
|
||||
assert "Tool_OCR" in data["message"]
|
||||
assert "docs_url" in data
|
||||
@@ -1,87 +0,0 @@
|
||||
"""
|
||||
Unit tests for authentication endpoints
|
||||
"""
|
||||
|
||||
import pytest
|
||||
from unittest.mock import patch, MagicMock
|
||||
|
||||
|
||||
class TestAuth:
|
||||
"""Test authentication endpoints"""
|
||||
|
||||
def test_login_success(self, client, db):
|
||||
"""Test successful login"""
|
||||
# Mock external auth service with proper Pydantic models
|
||||
from app.services.external_auth_service import AuthResponse, UserInfo
|
||||
|
||||
user_info = UserInfo(
|
||||
id="test-id-123",
|
||||
name="Test User",
|
||||
email="test@example.com"
|
||||
)
|
||||
auth_response = AuthResponse(
|
||||
access_token="test-token",
|
||||
id_token="test-id-token",
|
||||
expires_in=3600,
|
||||
token_type="Bearer",
|
||||
user_info=user_info,
|
||||
issued_at="2025-11-16T10:00:00Z",
|
||||
expires_at="2025-11-16T11:00:00Z"
|
||||
)
|
||||
|
||||
with patch('app.routers.auth.external_auth_service.authenticate_user') as mock_auth:
|
||||
mock_auth.return_value = (True, auth_response, None)
|
||||
|
||||
response = client.post('/api/v2/auth/login', json={
|
||||
'username': 'test@example.com',
|
||||
'password': 'password123'
|
||||
})
|
||||
|
||||
assert response.status_code == 200
|
||||
data = response.json()
|
||||
assert 'access_token' in data
|
||||
assert data['token_type'] == 'bearer'
|
||||
assert 'user' in data
|
||||
|
||||
def test_login_invalid_credentials(self, client):
|
||||
"""Test login with invalid credentials"""
|
||||
with patch('app.routers.auth.external_auth_service.authenticate_user') as mock_auth:
|
||||
mock_auth.return_value = (False, None, 'Invalid credentials')
|
||||
|
||||
response = client.post('/api/v2/auth/login', json={
|
||||
'username': 'test@example.com',
|
||||
'password': 'wrongpassword'
|
||||
})
|
||||
|
||||
assert response.status_code == 401
|
||||
assert 'detail' in response.json()
|
||||
|
||||
def test_get_me(self, client, auth_token):
|
||||
"""Test get current user info"""
|
||||
response = client.get(
|
||||
'/api/v2/auth/me',
|
||||
headers={'Authorization': f'Bearer {auth_token}'}
|
||||
)
|
||||
|
||||
assert response.status_code == 200
|
||||
data = response.json()
|
||||
assert 'email' in data
|
||||
assert 'display_name' in data
|
||||
|
||||
def test_get_me_unauthorized(self, client):
|
||||
"""Test get current user without token"""
|
||||
response = client.get('/api/v2/auth/me')
|
||||
assert response.status_code == 403
|
||||
|
||||
def test_logout(self, client, auth_token):
|
||||
"""Test logout"""
|
||||
response = client.post(
|
||||
'/api/v2/auth/logout',
|
||||
headers={'Authorization': f'Bearer {auth_token}'}
|
||||
)
|
||||
|
||||
assert response.status_code == 200
|
||||
data = response.json()
|
||||
# When no session_id is provided, logs out all sessions
|
||||
assert 'message' in data
|
||||
assert 'Logged out' in data['message']
|
||||
@@ -1,637 +0,0 @@
|
||||
"""
|
||||
Tool_OCR - Export Service Unit Tests
|
||||
Tests for app/services/export_service.py
|
||||
"""
|
||||
|
||||
import pytest
|
||||
import json
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
from unittest.mock import Mock, patch, MagicMock
|
||||
from datetime import datetime
|
||||
|
||||
import pandas as pd
|
||||
|
||||
from app.services.export_service import ExportService, ExportError
|
||||
from app.models.ocr import FileStatus
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def export_service():
|
||||
"""Create an ExportService instance"""
|
||||
return ExportService()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def mock_ocr_result(temp_dir):
|
||||
"""Create a mock OCRResult with markdown file"""
|
||||
# Create mock markdown file
|
||||
md_file = temp_dir / "test_result.md"
|
||||
md_file.write_text("# Test Document\n\nThis is test content.", encoding="utf-8")
|
||||
|
||||
# Create mock result
|
||||
result = Mock()
|
||||
result.id = 1
|
||||
result.markdown_path = str(md_file)
|
||||
result.json_path = None
|
||||
result.detected_language = "zh"
|
||||
result.total_text_regions = 10
|
||||
result.average_confidence = 0.95
|
||||
result.layout_data = {"elements": [{"type": "text"}]}
|
||||
result.images_metadata = []
|
||||
|
||||
# Mock file
|
||||
result.file = Mock()
|
||||
result.file.id = 1
|
||||
result.file.original_filename = "test.png"
|
||||
result.file.file_format = "png"
|
||||
result.file.file_size = 1024
|
||||
result.file.processing_time = 2.5
|
||||
|
||||
return result
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def mock_db():
|
||||
"""Create a mock database session"""
|
||||
return Mock()
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestExportServiceInit:
|
||||
"""Test ExportService initialization"""
|
||||
|
||||
def test_init(self, export_service):
|
||||
"""Test export service initialization"""
|
||||
assert export_service is not None
|
||||
assert export_service.pdf_generator is not None
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestApplyFilters:
|
||||
"""Test filter application"""
|
||||
|
||||
def test_apply_filters_confidence_threshold(self, export_service):
|
||||
"""Test confidence threshold filter"""
|
||||
result1 = Mock()
|
||||
result1.average_confidence = 0.95
|
||||
result1.file = Mock()
|
||||
result1.file.original_filename = "test1.png"
|
||||
|
||||
result2 = Mock()
|
||||
result2.average_confidence = 0.75
|
||||
result2.file = Mock()
|
||||
result2.file.original_filename = "test2.png"
|
||||
|
||||
result3 = Mock()
|
||||
result3.average_confidence = 0.85
|
||||
result3.file = Mock()
|
||||
result3.file.original_filename = "test3.png"
|
||||
|
||||
results = [result1, result2, result3]
|
||||
filters = {"confidence_threshold": 0.80}
|
||||
|
||||
filtered = export_service.apply_filters(results, filters)
|
||||
|
||||
assert len(filtered) == 2
|
||||
assert result1 in filtered
|
||||
assert result3 in filtered
|
||||
assert result2 not in filtered
|
||||
|
||||
def test_apply_filters_filename_pattern(self, export_service):
|
||||
"""Test filename pattern filter"""
|
||||
result1 = Mock()
|
||||
result1.average_confidence = 0.95
|
||||
result1.file = Mock()
|
||||
result1.file.original_filename = "invoice_2024.png"
|
||||
|
||||
result2 = Mock()
|
||||
result2.average_confidence = 0.95
|
||||
result2.file = Mock()
|
||||
result2.file.original_filename = "receipt.png"
|
||||
|
||||
results = [result1, result2]
|
||||
filters = {"filename_pattern": "invoice"}
|
||||
|
||||
filtered = export_service.apply_filters(results, filters)
|
||||
|
||||
assert len(filtered) == 1
|
||||
assert result1 in filtered
|
||||
|
||||
def test_apply_filters_language(self, export_service):
|
||||
"""Test language filter"""
|
||||
result1 = Mock()
|
||||
result1.detected_language = "zh"
|
||||
result1.average_confidence = 0.95
|
||||
result1.file = Mock()
|
||||
result1.file.original_filename = "chinese.png"
|
||||
|
||||
result2 = Mock()
|
||||
result2.detected_language = "en"
|
||||
result2.average_confidence = 0.95
|
||||
result2.file = Mock()
|
||||
result2.file.original_filename = "english.png"
|
||||
|
||||
results = [result1, result2]
|
||||
filters = {"language": "zh"}
|
||||
|
||||
filtered = export_service.apply_filters(results, filters)
|
||||
|
||||
assert len(filtered) == 1
|
||||
assert result1 in filtered
|
||||
|
||||
def test_apply_filters_combined(self, export_service):
|
||||
"""Test multiple filters combined"""
|
||||
result1 = Mock()
|
||||
result1.detected_language = "zh"
|
||||
result1.average_confidence = 0.95
|
||||
result1.file = Mock()
|
||||
result1.file.original_filename = "invoice_chinese.png"
|
||||
|
||||
result2 = Mock()
|
||||
result2.detected_language = "zh"
|
||||
result2.average_confidence = 0.75
|
||||
result2.file = Mock()
|
||||
result2.file.original_filename = "invoice_low.png"
|
||||
|
||||
result3 = Mock()
|
||||
result3.detected_language = "en"
|
||||
result3.average_confidence = 0.95
|
||||
result3.file = Mock()
|
||||
result3.file.original_filename = "invoice_english.png"
|
||||
|
||||
results = [result1, result2, result3]
|
||||
filters = {
|
||||
"confidence_threshold": 0.80,
|
||||
"language": "zh",
|
||||
"filename_pattern": "invoice"
|
||||
}
|
||||
|
||||
filtered = export_service.apply_filters(results, filters)
|
||||
|
||||
assert len(filtered) == 1
|
||||
assert result1 in filtered
|
||||
|
||||
def test_apply_filters_no_filters(self, export_service):
|
||||
"""Test with no filters applied"""
|
||||
results = [Mock(), Mock(), Mock()]
|
||||
filtered = export_service.apply_filters(results, {})
|
||||
|
||||
assert len(filtered) == len(results)
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestExportToTXT:
|
||||
"""Test TXT export"""
|
||||
|
||||
def test_export_to_txt_basic(self, export_service, mock_ocr_result, temp_dir):
|
||||
"""Test basic TXT export"""
|
||||
output_path = temp_dir / "output.txt"
|
||||
|
||||
result_path = export_service.export_to_txt([mock_ocr_result], output_path)
|
||||
|
||||
assert result_path.exists()
|
||||
content = result_path.read_text(encoding="utf-8")
|
||||
assert "Test Document" in content
|
||||
assert "test content" in content
|
||||
|
||||
def test_export_to_txt_with_line_numbers(self, export_service, mock_ocr_result, temp_dir):
|
||||
"""Test TXT export with line numbers"""
|
||||
output_path = temp_dir / "output.txt"
|
||||
formatting = {"add_line_numbers": True}
|
||||
|
||||
result_path = export_service.export_to_txt(
|
||||
[mock_ocr_result],
|
||||
output_path,
|
||||
formatting=formatting
|
||||
)
|
||||
|
||||
content = result_path.read_text(encoding="utf-8")
|
||||
assert "|" in content # Line number separator
|
||||
|
||||
def test_export_to_txt_with_metadata(self, export_service, mock_ocr_result, temp_dir):
|
||||
"""Test TXT export with metadata headers"""
|
||||
output_path = temp_dir / "output.txt"
|
||||
formatting = {"include_metadata": True}
|
||||
|
||||
result_path = export_service.export_to_txt(
|
||||
[mock_ocr_result],
|
||||
output_path,
|
||||
formatting=formatting
|
||||
)
|
||||
|
||||
content = result_path.read_text(encoding="utf-8")
|
||||
assert "文件:" in content
|
||||
assert "test.png" in content
|
||||
assert "信心度:" in content
|
||||
|
||||
def test_export_to_txt_with_grouping(self, export_service, mock_ocr_result, temp_dir):
|
||||
"""Test TXT export with file grouping"""
|
||||
output_path = temp_dir / "output.txt"
|
||||
formatting = {"group_by_filename": True}
|
||||
|
||||
result_path = export_service.export_to_txt(
|
||||
[mock_ocr_result, mock_ocr_result],
|
||||
output_path,
|
||||
formatting=formatting
|
||||
)
|
||||
|
||||
content = result_path.read_text(encoding="utf-8")
|
||||
assert "-" * 80 in content # Separator
|
||||
|
||||
def test_export_to_txt_missing_markdown(self, export_service, temp_dir):
|
||||
"""Test TXT export with missing markdown file"""
|
||||
result = Mock()
|
||||
result.id = 1
|
||||
result.markdown_path = "/nonexistent/path.md"
|
||||
result.file = Mock()
|
||||
result.file.original_filename = "test.png"
|
||||
|
||||
output_path = temp_dir / "output.txt"
|
||||
|
||||
# Should not fail, just skip the file
|
||||
result_path = export_service.export_to_txt([result], output_path)
|
||||
assert result_path.exists()
|
||||
|
||||
def test_export_to_txt_creates_parent_directories(self, export_service, mock_ocr_result, temp_dir):
|
||||
"""Test that export creates necessary parent directories"""
|
||||
output_path = temp_dir / "subdir" / "output.txt"
|
||||
|
||||
result_path = export_service.export_to_txt([mock_ocr_result], output_path)
|
||||
|
||||
assert result_path.exists()
|
||||
assert result_path.parent.exists()
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestExportToJSON:
|
||||
"""Test JSON export"""
|
||||
|
||||
def test_export_to_json_basic(self, export_service, mock_ocr_result, temp_dir):
|
||||
"""Test basic JSON export"""
|
||||
output_path = temp_dir / "output.json"
|
||||
|
||||
result_path = export_service.export_to_json([mock_ocr_result], output_path)
|
||||
|
||||
assert result_path.exists()
|
||||
data = json.loads(result_path.read_text(encoding="utf-8"))
|
||||
|
||||
assert "export_time" in data
|
||||
assert data["total_files"] == 1
|
||||
assert len(data["results"]) == 1
|
||||
assert data["results"][0]["filename"] == "test.png"
|
||||
assert data["results"][0]["average_confidence"] == 0.95
|
||||
|
||||
def test_export_to_json_with_layout(self, export_service, mock_ocr_result, temp_dir):
|
||||
"""Test JSON export with layout data"""
|
||||
output_path = temp_dir / "output.json"
|
||||
|
||||
result_path = export_service.export_to_json(
|
||||
[mock_ocr_result],
|
||||
output_path,
|
||||
include_layout=True
|
||||
)
|
||||
|
||||
data = json.loads(result_path.read_text(encoding="utf-8"))
|
||||
assert "layout_data" in data["results"][0]
|
||||
|
||||
def test_export_to_json_without_layout(self, export_service, mock_ocr_result, temp_dir):
|
||||
"""Test JSON export without layout data"""
|
||||
output_path = temp_dir / "output.json"
|
||||
|
||||
result_path = export_service.export_to_json(
|
||||
[mock_ocr_result],
|
||||
output_path,
|
||||
include_layout=False
|
||||
)
|
||||
|
||||
data = json.loads(result_path.read_text(encoding="utf-8"))
|
||||
assert "layout_data" not in data["results"][0]
|
||||
|
||||
def test_export_to_json_multiple_results(self, export_service, mock_ocr_result, temp_dir):
|
||||
"""Test JSON export with multiple results"""
|
||||
output_path = temp_dir / "output.json"
|
||||
|
||||
result_path = export_service.export_to_json(
|
||||
[mock_ocr_result, mock_ocr_result],
|
||||
output_path
|
||||
)
|
||||
|
||||
data = json.loads(result_path.read_text(encoding="utf-8"))
|
||||
assert data["total_files"] == 2
|
||||
assert len(data["results"]) == 2
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestExportToExcel:
|
||||
"""Test Excel export"""
|
||||
|
||||
def test_export_to_excel_basic(self, export_service, mock_ocr_result, temp_dir):
|
||||
"""Test basic Excel export"""
|
||||
output_path = temp_dir / "output.xlsx"
|
||||
|
||||
result_path = export_service.export_to_excel([mock_ocr_result], output_path)
|
||||
|
||||
assert result_path.exists()
|
||||
df = pd.read_excel(result_path)
|
||||
assert len(df) == 1
|
||||
assert "文件名" in df.columns
|
||||
assert df.iloc[0]["文件名"] == "test.png"
|
||||
|
||||
def test_export_to_excel_with_confidence(self, export_service, mock_ocr_result, temp_dir):
|
||||
"""Test Excel export with confidence scores"""
|
||||
output_path = temp_dir / "output.xlsx"
|
||||
|
||||
result_path = export_service.export_to_excel(
|
||||
[mock_ocr_result],
|
||||
output_path,
|
||||
include_confidence=True
|
||||
)
|
||||
|
||||
df = pd.read_excel(result_path)
|
||||
assert "平均信心度" in df.columns
|
||||
|
||||
def test_export_to_excel_without_processing_time(self, export_service, mock_ocr_result, temp_dir):
|
||||
"""Test Excel export without processing time"""
|
||||
output_path = temp_dir / "output.xlsx"
|
||||
|
||||
result_path = export_service.export_to_excel(
|
||||
[mock_ocr_result],
|
||||
output_path,
|
||||
include_processing_time=False
|
||||
)
|
||||
|
||||
df = pd.read_excel(result_path)
|
||||
assert "處理時間(秒)" not in df.columns
|
||||
|
||||
def test_export_to_excel_long_content_truncation(self, export_service, temp_dir):
|
||||
"""Test that long content is truncated in Excel"""
|
||||
# Create result with long content
|
||||
md_file = temp_dir / "long.md"
|
||||
md_file.write_text("x" * 2000, encoding="utf-8")
|
||||
|
||||
result = Mock()
|
||||
result.id = 1
|
||||
result.markdown_path = str(md_file)
|
||||
result.detected_language = "zh"
|
||||
result.total_text_regions = 10
|
||||
result.average_confidence = 0.95
|
||||
result.file = Mock()
|
||||
result.file.original_filename = "long.png"
|
||||
result.file.file_format = "png"
|
||||
result.file.file_size = 1024
|
||||
result.file.processing_time = 1.0
|
||||
|
||||
output_path = temp_dir / "output.xlsx"
|
||||
result_path = export_service.export_to_excel([result], output_path)
|
||||
|
||||
df = pd.read_excel(result_path)
|
||||
content = df.iloc[0]["提取內容"]
|
||||
assert "..." in content
|
||||
assert len(content) <= 1004 # 1000 + "..."
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestExportToMarkdown:
|
||||
"""Test Markdown export"""
|
||||
|
||||
def test_export_to_markdown_combined(self, export_service, mock_ocr_result, temp_dir):
|
||||
"""Test combined Markdown export"""
|
||||
output_path = temp_dir / "combined.md"
|
||||
|
||||
result_path = export_service.export_to_markdown(
|
||||
[mock_ocr_result],
|
||||
output_path,
|
||||
combine=True
|
||||
)
|
||||
|
||||
assert result_path.exists()
|
||||
assert result_path.is_file()
|
||||
content = result_path.read_text(encoding="utf-8")
|
||||
assert "test.png" in content
|
||||
assert "Test Document" in content
|
||||
|
||||
def test_export_to_markdown_separate(self, export_service, mock_ocr_result, temp_dir):
|
||||
"""Test separate Markdown export"""
|
||||
output_dir = temp_dir / "markdown_files"
|
||||
|
||||
result_path = export_service.export_to_markdown(
|
||||
[mock_ocr_result],
|
||||
output_dir,
|
||||
combine=False
|
||||
)
|
||||
|
||||
assert result_path.exists()
|
||||
assert result_path.is_dir()
|
||||
files = list(result_path.glob("*.md"))
|
||||
assert len(files) == 1
|
||||
|
||||
def test_export_to_markdown_multiple_files(self, export_service, mock_ocr_result, temp_dir):
|
||||
"""Test Markdown export with multiple files"""
|
||||
output_path = temp_dir / "combined.md"
|
||||
|
||||
result_path = export_service.export_to_markdown(
|
||||
[mock_ocr_result, mock_ocr_result],
|
||||
output_path,
|
||||
combine=True
|
||||
)
|
||||
|
||||
content = result_path.read_text(encoding="utf-8")
|
||||
assert content.count("---") >= 1 # Separators
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestExportToPDF:
|
||||
"""Test PDF export"""
|
||||
|
||||
@patch.object(ExportService, '__init__', lambda self: None)
|
||||
def test_export_to_pdf_success(self, mock_ocr_result, temp_dir):
|
||||
"""Test successful PDF export"""
|
||||
from app.services.pdf_generator import PDFGenerator
|
||||
|
||||
service = ExportService()
|
||||
service.pdf_generator = Mock(spec=PDFGenerator)
|
||||
service.pdf_generator.generate_pdf = Mock(return_value=temp_dir / "output.pdf")
|
||||
|
||||
output_path = temp_dir / "output.pdf"
|
||||
|
||||
result_path = service.export_to_pdf(mock_ocr_result, output_path)
|
||||
|
||||
service.pdf_generator.generate_pdf.assert_called_once()
|
||||
call_kwargs = service.pdf_generator.generate_pdf.call_args[1]
|
||||
assert call_kwargs["css_template"] == "default"
|
||||
|
||||
@patch.object(ExportService, '__init__', lambda self: None)
|
||||
def test_export_to_pdf_with_custom_template(self, mock_ocr_result, temp_dir):
|
||||
"""Test PDF export with custom CSS template"""
|
||||
from app.services.pdf_generator import PDFGenerator
|
||||
|
||||
service = ExportService()
|
||||
service.pdf_generator = Mock(spec=PDFGenerator)
|
||||
service.pdf_generator.generate_pdf = Mock(return_value=temp_dir / "output.pdf")
|
||||
|
||||
output_path = temp_dir / "output.pdf"
|
||||
|
||||
service.export_to_pdf(mock_ocr_result, output_path, css_template="academic")
|
||||
|
||||
call_kwargs = service.pdf_generator.generate_pdf.call_args[1]
|
||||
assert call_kwargs["css_template"] == "academic"
|
||||
|
||||
@patch.object(ExportService, '__init__', lambda self: None)
|
||||
def test_export_to_pdf_missing_markdown(self, temp_dir):
|
||||
"""Test PDF export with missing markdown file"""
|
||||
from app.services.pdf_generator import PDFGenerator
|
||||
|
||||
result = Mock()
|
||||
result.id = 1
|
||||
result.markdown_path = None
|
||||
result.file = Mock()
|
||||
|
||||
service = ExportService()
|
||||
service.pdf_generator = Mock(spec=PDFGenerator)
|
||||
|
||||
output_path = temp_dir / "output.pdf"
|
||||
|
||||
with pytest.raises(ExportError) as exc_info:
|
||||
service.export_to_pdf(result, output_path)
|
||||
|
||||
assert "not found" in str(exc_info.value).lower()
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestGetExportFormats:
|
||||
"""Test getting available export formats"""
|
||||
|
||||
def test_get_export_formats(self, export_service):
|
||||
"""Test getting export formats"""
|
||||
formats = export_service.get_export_formats()
|
||||
|
||||
assert isinstance(formats, dict)
|
||||
assert "txt" in formats
|
||||
assert "json" in formats
|
||||
assert "excel" in formats
|
||||
assert "markdown" in formats
|
||||
assert "pdf" in formats
|
||||
assert "zip" in formats
|
||||
|
||||
# Check descriptions are in Chinese
|
||||
for desc in formats.values():
|
||||
assert isinstance(desc, str)
|
||||
assert len(desc) > 0
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestApplyExportRule:
|
||||
"""Test export rule application"""
|
||||
|
||||
def test_apply_export_rule_success(self, export_service, mock_db):
|
||||
"""Test applying export rule"""
|
||||
# Create mock rule
|
||||
rule = Mock()
|
||||
rule.id = 1
|
||||
rule.config_json = {
|
||||
"filters": {
|
||||
"confidence_threshold": 0.80
|
||||
}
|
||||
}
|
||||
|
||||
mock_db.query.return_value.filter.return_value.first.return_value = rule
|
||||
|
||||
# Create mock results
|
||||
result1 = Mock()
|
||||
result1.average_confidence = 0.95
|
||||
result1.file = Mock()
|
||||
result1.file.original_filename = "test1.png"
|
||||
|
||||
result2 = Mock()
|
||||
result2.average_confidence = 0.70
|
||||
result2.file = Mock()
|
||||
result2.file.original_filename = "test2.png"
|
||||
|
||||
results = [result1, result2]
|
||||
|
||||
filtered = export_service.apply_export_rule(mock_db, results, rule_id=1)
|
||||
|
||||
assert len(filtered) == 1
|
||||
assert result1 in filtered
|
||||
|
||||
def test_apply_export_rule_not_found(self, export_service, mock_db):
|
||||
"""Test applying non-existent rule"""
|
||||
mock_db.query.return_value.filter.return_value.first.return_value = None
|
||||
|
||||
with pytest.raises(ExportError) as exc_info:
|
||||
export_service.apply_export_rule(mock_db, [], rule_id=999)
|
||||
|
||||
assert "not found" in str(exc_info.value).lower()
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestEdgeCases:
|
||||
"""Test edge cases and error handling"""
|
||||
|
||||
def test_export_to_txt_empty_results(self, export_service, temp_dir):
|
||||
"""Test TXT export with empty results list"""
|
||||
output_path = temp_dir / "output.txt"
|
||||
|
||||
result_path = export_service.export_to_txt([], output_path)
|
||||
|
||||
assert result_path.exists()
|
||||
content = result_path.read_text(encoding="utf-8")
|
||||
assert content == ""
|
||||
|
||||
def test_export_to_json_empty_results(self, export_service, temp_dir):
|
||||
"""Test JSON export with empty results list"""
|
||||
output_path = temp_dir / "output.json"
|
||||
|
||||
result_path = export_service.export_to_json([], output_path)
|
||||
|
||||
data = json.loads(result_path.read_text(encoding="utf-8"))
|
||||
assert data["total_files"] == 0
|
||||
assert len(data["results"]) == 0
|
||||
|
||||
def test_export_with_unicode_content(self, export_service, temp_dir):
|
||||
"""Test export with Unicode/Chinese content"""
|
||||
md_file = temp_dir / "chinese.md"
|
||||
md_file.write_text("# 測試文檔\n\n這是中文內容。", encoding="utf-8")
|
||||
|
||||
result = Mock()
|
||||
result.id = 1
|
||||
result.markdown_path = str(md_file)
|
||||
result.json_path = None
|
||||
result.detected_language = "zh"
|
||||
result.total_text_regions = 10
|
||||
result.average_confidence = 0.95
|
||||
result.layout_data = None # Use None instead of Mock for JSON serialization
|
||||
result.images_metadata = None # Use None instead of Mock
|
||||
result.file = Mock()
|
||||
result.file.id = 1
|
||||
result.file.original_filename = "中文測試.png"
|
||||
result.file.file_format = "png"
|
||||
result.file.file_size = 1024
|
||||
result.file.processing_time = 1.0
|
||||
|
||||
# Test TXT export
|
||||
txt_path = temp_dir / "output.txt"
|
||||
export_service.export_to_txt([result], txt_path)
|
||||
assert "測試文檔" in txt_path.read_text(encoding="utf-8")
|
||||
|
||||
# Test JSON export
|
||||
json_path = temp_dir / "output.json"
|
||||
export_service.export_to_json([result], json_path)
|
||||
data = json.loads(json_path.read_text(encoding="utf-8"))
|
||||
assert data["results"][0]["filename"] == "中文測試.png"
|
||||
|
||||
def test_apply_filters_with_none_values(self, export_service):
|
||||
"""Test filters with None values in results"""
|
||||
result = Mock()
|
||||
result.average_confidence = None
|
||||
result.detected_language = None
|
||||
result.file = Mock()
|
||||
result.file.original_filename = "test.png"
|
||||
|
||||
filters = {"confidence_threshold": 0.80}
|
||||
|
||||
filtered = export_service.apply_filters([result], filters)
|
||||
|
||||
# Should filter out result with None confidence
|
||||
assert len(filtered) == 0
|
||||
@@ -1,520 +0,0 @@
|
||||
"""
|
||||
Tool_OCR - File Manager Unit Tests
|
||||
Tests for app/services/file_manager.py
|
||||
"""
|
||||
|
||||
import pytest
|
||||
import shutil
|
||||
from pathlib import Path
|
||||
from unittest.mock import Mock, patch, MagicMock
|
||||
from datetime import datetime, timedelta
|
||||
from io import BytesIO
|
||||
|
||||
from fastapi import UploadFile
|
||||
|
||||
from app.services.file_manager import FileManager, FileManagementError
|
||||
from app.models.ocr import OCRBatch, OCRFile, FileStatus, BatchStatus
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def file_manager(temp_dir):
|
||||
"""Create a FileManager instance with temp directory"""
|
||||
with patch('app.services.file_manager.settings') as mock_settings:
|
||||
mock_settings.upload_dir = str(temp_dir)
|
||||
mock_settings.max_upload_size = 20 * 1024 * 1024 # 20MB
|
||||
mock_settings.allowed_extensions_list = ['png', 'jpg', 'jpeg', 'pdf']
|
||||
manager = FileManager()
|
||||
return manager
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def mock_upload_file():
|
||||
"""Create a mock UploadFile"""
|
||||
def create_file(filename="test.png", content=b"test content", size=None):
|
||||
file_obj = BytesIO(content)
|
||||
if size is None:
|
||||
size = len(content)
|
||||
|
||||
upload_file = UploadFile(filename=filename, file=file_obj)
|
||||
# Set file size manually
|
||||
upload_file.file.seek(0, 2) # Seek to end
|
||||
upload_file.file.seek(0) # Reset
|
||||
return upload_file
|
||||
|
||||
return create_file
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def mock_db():
|
||||
"""Create a mock database session"""
|
||||
return Mock()
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestFileManagerInit:
|
||||
"""Test FileManager initialization"""
|
||||
|
||||
def test_init(self, file_manager, temp_dir):
|
||||
"""Test file manager initialization"""
|
||||
assert file_manager is not None
|
||||
assert file_manager.preprocessor is not None
|
||||
assert file_manager.base_upload_dir == temp_dir
|
||||
assert file_manager.base_upload_dir.exists()
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestBatchDirectoryManagement:
|
||||
"""Test batch directory creation and management"""
|
||||
|
||||
def test_create_batch_directory(self, file_manager):
|
||||
"""Test creating batch directory structure"""
|
||||
batch_id = 123
|
||||
batch_dir = file_manager.create_batch_directory(batch_id)
|
||||
|
||||
assert batch_dir.exists()
|
||||
assert (batch_dir / "inputs").exists()
|
||||
assert (batch_dir / "outputs" / "markdown").exists()
|
||||
assert (batch_dir / "outputs" / "json").exists()
|
||||
assert (batch_dir / "outputs" / "images").exists()
|
||||
assert (batch_dir / "exports").exists()
|
||||
|
||||
def test_create_batch_directory_multiple_times(self, file_manager):
|
||||
"""Test creating same batch directory multiple times (should not error)"""
|
||||
batch_id = 123
|
||||
|
||||
batch_dir1 = file_manager.create_batch_directory(batch_id)
|
||||
batch_dir2 = file_manager.create_batch_directory(batch_id)
|
||||
|
||||
assert batch_dir1 == batch_dir2
|
||||
assert batch_dir1.exists()
|
||||
|
||||
def test_get_batch_directory(self, file_manager):
|
||||
"""Test getting batch directory path"""
|
||||
batch_id = 456
|
||||
batch_dir = file_manager.get_batch_directory(batch_id)
|
||||
|
||||
expected_path = file_manager.base_upload_dir / "batches" / "456"
|
||||
assert batch_dir == expected_path
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestUploadValidation:
|
||||
"""Test file upload validation"""
|
||||
|
||||
def test_validate_upload_valid_file(self, file_manager, mock_upload_file):
|
||||
"""Test validation of valid upload"""
|
||||
upload = mock_upload_file("test.png", b"valid content")
|
||||
|
||||
is_valid, error = file_manager.validate_upload(upload)
|
||||
|
||||
assert is_valid is True
|
||||
assert error is None
|
||||
|
||||
def test_validate_upload_empty_filename(self, file_manager):
|
||||
"""Test validation with empty filename"""
|
||||
upload = Mock()
|
||||
upload.filename = ""
|
||||
|
||||
is_valid, error = file_manager.validate_upload(upload)
|
||||
|
||||
assert is_valid is False
|
||||
assert "文件名不能為空" in error
|
||||
|
||||
def test_validate_upload_empty_file(self, file_manager, mock_upload_file):
|
||||
"""Test validation of empty file"""
|
||||
upload = mock_upload_file("test.png", b"")
|
||||
|
||||
is_valid, error = file_manager.validate_upload(upload)
|
||||
|
||||
assert is_valid is False
|
||||
assert "文件為空" in error
|
||||
|
||||
@pytest.mark.skip(reason="File size mock is complex with UploadFile, covered by integration test")
|
||||
def test_validate_upload_file_too_large(self, file_manager):
|
||||
"""Test validation of file exceeding size limit"""
|
||||
# Note: This functionality is tested in integration tests where actual
|
||||
# files can be created. Mocking UploadFile's size behavior is complex.
|
||||
pass
|
||||
|
||||
def test_validate_upload_unsupported_format(self, file_manager, mock_upload_file):
|
||||
"""Test validation of unsupported file format"""
|
||||
upload = mock_upload_file("test.txt", b"text content")
|
||||
|
||||
is_valid, error = file_manager.validate_upload(upload)
|
||||
|
||||
assert is_valid is False
|
||||
assert "不支持的文件格式" in error
|
||||
|
||||
def test_validate_upload_supported_formats(self, file_manager, mock_upload_file):
|
||||
"""Test validation of all supported formats"""
|
||||
supported_formats = ["test.png", "test.jpg", "test.jpeg", "test.pdf"]
|
||||
|
||||
for filename in supported_formats:
|
||||
upload = mock_upload_file(filename, b"content")
|
||||
is_valid, error = file_manager.validate_upload(upload)
|
||||
assert is_valid is True, f"Failed for {filename}"
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestFileSaving:
|
||||
"""Test file saving operations"""
|
||||
|
||||
def test_save_upload_success(self, file_manager, mock_upload_file):
|
||||
"""Test successful file saving"""
|
||||
batch_id = 1
|
||||
file_manager.create_batch_directory(batch_id)
|
||||
|
||||
upload = mock_upload_file("test.png", b"test content")
|
||||
|
||||
file_path, original_filename = file_manager.save_upload(upload, batch_id)
|
||||
|
||||
assert file_path.exists()
|
||||
assert file_path.read_bytes() == b"test content"
|
||||
assert original_filename == "test.png"
|
||||
assert file_path.parent.name == "inputs"
|
||||
|
||||
def test_save_upload_unique_filename(self, file_manager, mock_upload_file):
|
||||
"""Test that saved files get unique filenames"""
|
||||
batch_id = 1
|
||||
file_manager.create_batch_directory(batch_id)
|
||||
|
||||
upload1 = mock_upload_file("test.png", b"content1")
|
||||
upload2 = mock_upload_file("test.png", b"content2")
|
||||
|
||||
path1, _ = file_manager.save_upload(upload1, batch_id)
|
||||
path2, _ = file_manager.save_upload(upload2, batch_id)
|
||||
|
||||
assert path1 != path2
|
||||
assert path1.exists() and path2.exists()
|
||||
assert path1.read_bytes() == b"content1"
|
||||
assert path2.read_bytes() == b"content2"
|
||||
|
||||
def test_save_upload_validation_failure(self, file_manager, mock_upload_file):
|
||||
"""Test save upload with validation failure"""
|
||||
batch_id = 1
|
||||
file_manager.create_batch_directory(batch_id)
|
||||
|
||||
# Empty file should fail validation
|
||||
upload = mock_upload_file("test.png", b"")
|
||||
|
||||
with pytest.raises(FileManagementError) as exc_info:
|
||||
file_manager.save_upload(upload, batch_id, validate=True)
|
||||
|
||||
assert "文件為空" in str(exc_info.value)
|
||||
|
||||
def test_save_upload_skip_validation(self, file_manager, mock_upload_file):
|
||||
"""Test saving with validation skipped"""
|
||||
batch_id = 1
|
||||
file_manager.create_batch_directory(batch_id)
|
||||
|
||||
# Empty file but validation skipped
|
||||
upload = mock_upload_file("test.txt", b"")
|
||||
|
||||
# Should succeed when validation is disabled
|
||||
file_path, _ = file_manager.save_upload(upload, batch_id, validate=False)
|
||||
assert file_path.exists()
|
||||
|
||||
def test_save_upload_preserves_extension(self, file_manager, mock_upload_file):
|
||||
"""Test that file extension is preserved"""
|
||||
batch_id = 1
|
||||
file_manager.create_batch_directory(batch_id)
|
||||
|
||||
upload = mock_upload_file("document.pdf", b"pdf content")
|
||||
|
||||
file_path, _ = file_manager.save_upload(upload, batch_id)
|
||||
|
||||
assert file_path.suffix == ".pdf"
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestValidateSavedFile:
|
||||
"""Test validation of saved files"""
|
||||
|
||||
@patch.object(FileManager, '__init__', lambda self: None)
|
||||
def test_validate_saved_file(self, sample_image_path):
|
||||
"""Test validating a saved file"""
|
||||
from app.services.preprocessor import DocumentPreprocessor
|
||||
|
||||
manager = FileManager()
|
||||
manager.preprocessor = DocumentPreprocessor()
|
||||
|
||||
# validate_file returns (is_valid, file_format, error_message)
|
||||
is_valid, file_format, error = manager.validate_saved_file(sample_image_path)
|
||||
|
||||
assert is_valid is True
|
||||
assert file_format == 'png'
|
||||
assert error is None
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestBatchCreation:
|
||||
"""Test batch creation"""
|
||||
|
||||
def test_create_batch(self, file_manager, mock_db):
|
||||
"""Test creating a new batch"""
|
||||
user_id = 1
|
||||
|
||||
# Mock database operations
|
||||
mock_batch = Mock()
|
||||
mock_batch.id = 123
|
||||
mock_db.add = Mock()
|
||||
mock_db.commit = Mock()
|
||||
mock_db.refresh = Mock(side_effect=lambda x: setattr(x, 'id', 123))
|
||||
|
||||
with patch.object(FileManager, 'create_batch_directory'):
|
||||
batch = file_manager.create_batch(mock_db, user_id)
|
||||
|
||||
assert mock_db.add.called
|
||||
assert mock_db.commit.called
|
||||
|
||||
def test_create_batch_with_custom_name(self, file_manager, mock_db):
|
||||
"""Test creating batch with custom name"""
|
||||
user_id = 1
|
||||
batch_name = "My Custom Batch"
|
||||
|
||||
mock_db.add = Mock()
|
||||
mock_db.commit = Mock()
|
||||
mock_db.refresh = Mock(side_effect=lambda x: setattr(x, 'id', 123))
|
||||
|
||||
with patch.object(FileManager, 'create_batch_directory'):
|
||||
batch = file_manager.create_batch(mock_db, user_id, batch_name)
|
||||
|
||||
# Verify batch was created with correct name
|
||||
call_args = mock_db.add.call_args[0][0]
|
||||
assert hasattr(call_args, 'batch_name')
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestGetFilePaths:
|
||||
"""Test file path retrieval"""
|
||||
|
||||
def test_get_file_paths(self, file_manager):
|
||||
"""Test getting file paths for a batch"""
|
||||
batch_id = 1
|
||||
file_id = 42
|
||||
|
||||
paths = file_manager.get_file_paths(batch_id, file_id)
|
||||
|
||||
assert "input_dir" in paths
|
||||
assert "output_dir" in paths
|
||||
assert "markdown_dir" in paths
|
||||
assert "json_dir" in paths
|
||||
assert "images_dir" in paths
|
||||
assert "export_dir" in paths
|
||||
|
||||
# Verify images_dir includes file_id
|
||||
assert str(file_id) in str(paths["images_dir"])
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestCleanupExpiredBatches:
|
||||
"""Test cleanup of expired batches"""
|
||||
|
||||
def test_cleanup_expired_batches(self, file_manager, mock_db, temp_dir):
|
||||
"""Test cleaning up expired batches"""
|
||||
# Create mock expired batch
|
||||
expired_batch = Mock()
|
||||
expired_batch.id = 1
|
||||
expired_batch.created_at = datetime.utcnow() - timedelta(hours=48)
|
||||
|
||||
# Create batch directory
|
||||
batch_dir = file_manager.create_batch_directory(1)
|
||||
assert batch_dir.exists()
|
||||
|
||||
# Mock database query
|
||||
mock_db.query.return_value.filter.return_value.all.return_value = [expired_batch]
|
||||
mock_db.delete = Mock()
|
||||
mock_db.commit = Mock()
|
||||
|
||||
# Run cleanup
|
||||
cleaned = file_manager.cleanup_expired_batches(mock_db, retention_hours=24)
|
||||
|
||||
assert cleaned == 1
|
||||
assert not batch_dir.exists()
|
||||
mock_db.delete.assert_called_once_with(expired_batch)
|
||||
mock_db.commit.assert_called_once()
|
||||
|
||||
def test_cleanup_no_expired_batches(self, file_manager, mock_db):
|
||||
"""Test cleanup when no batches are expired"""
|
||||
# Mock database query returning empty list
|
||||
mock_db.query.return_value.filter.return_value.all.return_value = []
|
||||
|
||||
cleaned = file_manager.cleanup_expired_batches(mock_db, retention_hours=24)
|
||||
|
||||
assert cleaned == 0
|
||||
|
||||
def test_cleanup_handles_missing_directory(self, file_manager, mock_db):
|
||||
"""Test cleanup handles missing batch directory gracefully"""
|
||||
expired_batch = Mock()
|
||||
expired_batch.id = 999 # Directory doesn't exist
|
||||
expired_batch.created_at = datetime.utcnow() - timedelta(hours=48)
|
||||
|
||||
mock_db.query.return_value.filter.return_value.all.return_value = [expired_batch]
|
||||
mock_db.delete = Mock()
|
||||
mock_db.commit = Mock()
|
||||
|
||||
# Should not raise error
|
||||
cleaned = file_manager.cleanup_expired_batches(mock_db, retention_hours=24)
|
||||
|
||||
assert cleaned == 1
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestFileOwnershipVerification:
|
||||
"""Test file ownership verification"""
|
||||
|
||||
def test_verify_file_ownership_success(self, file_manager, mock_db):
|
||||
"""Test successful ownership verification"""
|
||||
user_id = 1
|
||||
batch_id = 123
|
||||
|
||||
# Mock batch owned by user
|
||||
mock_batch = Mock()
|
||||
mock_db.query.return_value.filter.return_value.first.return_value = mock_batch
|
||||
|
||||
is_owner = file_manager.verify_file_ownership(mock_db, user_id, batch_id)
|
||||
|
||||
assert is_owner is True
|
||||
|
||||
def test_verify_file_ownership_failure(self, file_manager, mock_db):
|
||||
"""Test ownership verification failure"""
|
||||
user_id = 1
|
||||
batch_id = 123
|
||||
|
||||
# Mock no batch found (wrong owner)
|
||||
mock_db.query.return_value.filter.return_value.first.return_value = None
|
||||
|
||||
is_owner = file_manager.verify_file_ownership(mock_db, user_id, batch_id)
|
||||
|
||||
assert is_owner is False
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestBatchStatistics:
|
||||
"""Test batch statistics retrieval"""
|
||||
|
||||
def test_get_batch_statistics(self, file_manager, mock_db):
|
||||
"""Test getting batch statistics"""
|
||||
batch_id = 1
|
||||
|
||||
# Create mock batch with files
|
||||
mock_file1 = Mock()
|
||||
mock_file1.file_size = 1000
|
||||
|
||||
mock_file2 = Mock()
|
||||
mock_file2.file_size = 2000
|
||||
|
||||
mock_batch = Mock()
|
||||
mock_batch.id = batch_id
|
||||
mock_batch.batch_name = "Test Batch"
|
||||
mock_batch.status = BatchStatus.COMPLETED
|
||||
mock_batch.total_files = 2
|
||||
mock_batch.completed_files = 2
|
||||
mock_batch.failed_files = 0
|
||||
mock_batch.progress_percentage = 100.0
|
||||
mock_batch.files = [mock_file1, mock_file2]
|
||||
mock_batch.created_at = datetime(2025, 1, 1, 10, 0, 0)
|
||||
mock_batch.started_at = datetime(2025, 1, 1, 10, 1, 0)
|
||||
mock_batch.completed_at = datetime(2025, 1, 1, 10, 5, 0)
|
||||
|
||||
mock_db.query.return_value.filter.return_value.first.return_value = mock_batch
|
||||
|
||||
stats = file_manager.get_batch_statistics(mock_db, batch_id)
|
||||
|
||||
assert stats['batch_id'] == batch_id
|
||||
assert stats['batch_name'] == "Test Batch"
|
||||
assert stats['total_files'] == 2
|
||||
assert stats['total_file_size'] == 3000
|
||||
assert stats['total_file_size_mb'] == 0.0 # Small files
|
||||
assert stats['processing_time'] == 240.0 # 4 minutes
|
||||
assert stats['pending_files'] == 0
|
||||
|
||||
def test_get_batch_statistics_not_found(self, file_manager, mock_db):
|
||||
"""Test getting statistics for non-existent batch"""
|
||||
batch_id = 999
|
||||
|
||||
mock_db.query.return_value.filter.return_value.first.return_value = None
|
||||
|
||||
stats = file_manager.get_batch_statistics(mock_db, batch_id)
|
||||
|
||||
assert stats == {}
|
||||
|
||||
def test_get_batch_statistics_no_completion_time(self, file_manager, mock_db):
|
||||
"""Test statistics for batch without completion time"""
|
||||
mock_batch = Mock()
|
||||
mock_batch.id = 1
|
||||
mock_batch.batch_name = "Pending Batch"
|
||||
mock_batch.status = BatchStatus.PROCESSING
|
||||
mock_batch.total_files = 5
|
||||
mock_batch.completed_files = 2
|
||||
mock_batch.failed_files = 0
|
||||
mock_batch.progress_percentage = 40.0
|
||||
mock_batch.files = []
|
||||
mock_batch.created_at = datetime(2025, 1, 1)
|
||||
mock_batch.started_at = datetime(2025, 1, 1)
|
||||
mock_batch.completed_at = None
|
||||
|
||||
mock_db.query.return_value.filter.return_value.first.return_value = mock_batch
|
||||
|
||||
stats = file_manager.get_batch_statistics(mock_db, 1)
|
||||
|
||||
assert stats['processing_time'] is None
|
||||
assert stats['pending_files'] == 3
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestEdgeCases:
|
||||
"""Test edge cases and error handling"""
|
||||
|
||||
def test_save_upload_creates_parent_directories(self, file_manager, mock_upload_file):
|
||||
"""Test that save_upload creates necessary directories"""
|
||||
batch_id = 999 # Directory doesn't exist yet
|
||||
|
||||
upload = mock_upload_file("test.png", b"content")
|
||||
|
||||
file_path, _ = file_manager.save_upload(upload, batch_id)
|
||||
|
||||
assert file_path.exists()
|
||||
assert file_path.parent.exists()
|
||||
|
||||
def test_cleanup_continues_on_error(self, file_manager, mock_db):
|
||||
"""Test that cleanup continues even if one batch fails"""
|
||||
batch1 = Mock()
|
||||
batch1.id = 1
|
||||
batch1.created_at = datetime.utcnow() - timedelta(hours=48)
|
||||
|
||||
batch2 = Mock()
|
||||
batch2.id = 2
|
||||
batch2.created_at = datetime.utcnow() - timedelta(hours=48)
|
||||
|
||||
# Create only batch2 directory
|
||||
file_manager.create_batch_directory(2)
|
||||
|
||||
mock_db.query.return_value.filter.return_value.all.return_value = [batch1, batch2]
|
||||
mock_db.delete = Mock()
|
||||
mock_db.commit = Mock()
|
||||
|
||||
# Should not fail, should clean batch2 even if batch1 fails
|
||||
cleaned = file_manager.cleanup_expired_batches(mock_db, retention_hours=24)
|
||||
|
||||
assert cleaned > 0
|
||||
|
||||
def test_validate_upload_with_unicode_filename(self, file_manager, mock_upload_file):
|
||||
"""Test validation with Unicode filename"""
|
||||
upload = mock_upload_file("測試文件.png", b"content")
|
||||
|
||||
is_valid, error = file_manager.validate_upload(upload)
|
||||
|
||||
assert is_valid is True
|
||||
|
||||
def test_save_upload_preserves_unicode_filename(self, file_manager, mock_upload_file):
|
||||
"""Test that Unicode filenames are handled correctly"""
|
||||
batch_id = 1
|
||||
file_manager.create_batch_directory(batch_id)
|
||||
|
||||
upload = mock_upload_file("中文文檔.pdf", b"content")
|
||||
|
||||
file_path, original_filename = file_manager.save_upload(upload, batch_id)
|
||||
|
||||
assert original_filename == "中文文檔.pdf"
|
||||
assert file_path.exists()
|
||||
@@ -1,182 +0,0 @@
|
||||
"""
|
||||
Integration tests for Tool_OCR
|
||||
Tests the complete flow of authentication, task creation, and file operations
|
||||
"""
|
||||
|
||||
import pytest
|
||||
from unittest.mock import patch
|
||||
|
||||
|
||||
class TestIntegration:
|
||||
"""Integration tests for end-to-end workflows"""
|
||||
|
||||
def test_complete_auth_and_task_flow(self, client, db):
|
||||
"""Test complete flow: login -> create task -> get task -> delete task"""
|
||||
|
||||
# Step 1: Login
|
||||
from app.services.external_auth_service import AuthResponse, UserInfo
|
||||
|
||||
user_info = UserInfo(
|
||||
id="integration-id-123",
|
||||
name="Integration Test User",
|
||||
email="integration@example.com"
|
||||
)
|
||||
auth_response = AuthResponse(
|
||||
access_token="test-token",
|
||||
id_token="test-id-token",
|
||||
expires_in=3600,
|
||||
token_type="Bearer",
|
||||
user_info=user_info,
|
||||
issued_at="2025-11-16T10:00:00Z",
|
||||
expires_at="2025-11-16T11:00:00Z"
|
||||
)
|
||||
|
||||
with patch('app.routers.auth.external_auth_service.authenticate_user') as mock_auth:
|
||||
mock_auth.return_value = (True, auth_response, None)
|
||||
|
||||
login_response = client.post('/api/v2/auth/login', json={
|
||||
'username': 'integration@example.com',
|
||||
'password': 'password123'
|
||||
})
|
||||
|
||||
assert login_response.status_code == 200
|
||||
token = login_response.json()['access_token']
|
||||
headers = {'Authorization': f'Bearer {token}'}
|
||||
|
||||
# Step 2: Create task
|
||||
create_response = client.post(
|
||||
'/api/v2/tasks/',
|
||||
headers=headers,
|
||||
json={
|
||||
'filename': 'integration_test.pdf',
|
||||
'file_type': 'application/pdf'
|
||||
}
|
||||
)
|
||||
|
||||
assert create_response.status_code == 201
|
||||
task_data = create_response.json()
|
||||
task_id = task_data['task_id']
|
||||
|
||||
# Step 3: Get task
|
||||
get_response = client.get(
|
||||
f'/api/v2/tasks/{task_id}',
|
||||
headers=headers
|
||||
)
|
||||
|
||||
assert get_response.status_code == 200
|
||||
assert get_response.json()['task_id'] == task_id
|
||||
|
||||
# Step 4: List tasks
|
||||
list_response = client.get(
|
||||
'/api/v2/tasks/',
|
||||
headers=headers
|
||||
)
|
||||
|
||||
assert list_response.status_code == 200
|
||||
assert len(list_response.json()['tasks']) > 0
|
||||
|
||||
# Step 5: Get stats
|
||||
stats_response = client.get(
|
||||
'/api/v2/tasks/stats',
|
||||
headers=headers
|
||||
)
|
||||
|
||||
assert stats_response.status_code == 200
|
||||
stats = stats_response.json()
|
||||
assert stats['total'] > 0
|
||||
assert stats['pending'] > 0
|
||||
|
||||
# Step 6: Delete task
|
||||
delete_response = client.delete(
|
||||
f'/api/v2/tasks/{task_id}',
|
||||
headers=headers
|
||||
)
|
||||
|
||||
# DELETE returns 204 No Content (standard for successful deletion)
|
||||
assert delete_response.status_code == 204
|
||||
|
||||
# Step 7: Verify deletion
|
||||
get_after_delete = client.get(
|
||||
f'/api/v2/tasks/{task_id}',
|
||||
headers=headers
|
||||
)
|
||||
|
||||
assert get_after_delete.status_code == 404
|
||||
|
||||
def test_admin_workflow(self, client, db):
|
||||
"""Test admin workflow: login as admin -> access admin endpoints"""
|
||||
|
||||
# Login as admin
|
||||
from app.services.external_auth_service import AuthResponse, UserInfo
|
||||
|
||||
user_info = UserInfo(
|
||||
id="admin-id-123",
|
||||
name="Admin User",
|
||||
email="ymirliu@panjit.com.tw"
|
||||
)
|
||||
auth_response = AuthResponse(
|
||||
access_token="admin-token",
|
||||
id_token="admin-id-token",
|
||||
expires_in=3600,
|
||||
token_type="Bearer",
|
||||
user_info=user_info,
|
||||
issued_at="2025-11-16T10:00:00Z",
|
||||
expires_at="2025-11-16T11:00:00Z"
|
||||
)
|
||||
|
||||
with patch('app.routers.auth.external_auth_service.authenticate_user') as mock_auth:
|
||||
mock_auth.return_value = (True, auth_response, None)
|
||||
|
||||
login_response = client.post('/api/v2/auth/login', json={
|
||||
'username': 'ymirliu@panjit.com.tw',
|
||||
'password': 'adminpass'
|
||||
})
|
||||
|
||||
assert login_response.status_code == 200
|
||||
token = login_response.json()['access_token']
|
||||
headers = {'Authorization': f'Bearer {token}'}
|
||||
|
||||
# Access admin endpoints
|
||||
stats_response = client.get('/api/v2/admin/stats', headers=headers)
|
||||
assert stats_response.status_code == 200
|
||||
|
||||
users_response = client.get('/api/v2/admin/users', headers=headers)
|
||||
assert users_response.status_code == 200
|
||||
|
||||
logs_response = client.get('/api/v2/admin/audit-logs', headers=headers)
|
||||
assert logs_response.status_code == 200
|
||||
|
||||
def test_task_lifecycle(self, client, auth_token, test_task, db):
|
||||
"""Test complete task lifecycle: pending -> processing -> completed"""
|
||||
|
||||
headers = {'Authorization': f'Bearer {auth_token}'}
|
||||
|
||||
# Check initial status
|
||||
response = client.get(f'/api/v2/tasks/{test_task.task_id}', headers=headers)
|
||||
assert response.json()['status'] == 'pending'
|
||||
|
||||
# Start task
|
||||
start_response = client.post(
|
||||
f'/api/v2/tasks/{test_task.task_id}/start',
|
||||
headers=headers
|
||||
)
|
||||
assert start_response.status_code == 200
|
||||
assert start_response.json()['status'] == 'processing'
|
||||
|
||||
# Update task to completed
|
||||
update_response = client.patch(
|
||||
f'/api/v2/tasks/{test_task.task_id}',
|
||||
headers=headers,
|
||||
json={
|
||||
'status': 'completed',
|
||||
'processing_time_ms': 1500
|
||||
}
|
||||
)
|
||||
assert update_response.status_code == 200
|
||||
assert update_response.json()['status'] == 'completed'
|
||||
|
||||
# Verify final state
|
||||
final_response = client.get(f'/api/v2/tasks/{test_task.task_id}', headers=headers)
|
||||
final_data = final_response.json()
|
||||
assert final_data['status'] == 'completed'
|
||||
assert final_data['processing_time_ms'] == 1500
|
||||
@@ -1,528 +0,0 @@
|
||||
"""
|
||||
Tool_OCR - OCR Service Unit Tests
|
||||
Tests for app/services/ocr_service.py
|
||||
"""
|
||||
|
||||
import pytest
|
||||
import json
|
||||
from pathlib import Path
|
||||
from unittest.mock import Mock, patch, MagicMock
|
||||
|
||||
from app.services.ocr_service import OCRService
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestOCRServiceInit:
|
||||
"""Test OCR service initialization"""
|
||||
|
||||
def test_init(self):
|
||||
"""Test OCR service initialization"""
|
||||
service = OCRService()
|
||||
|
||||
assert service is not None
|
||||
assert service.ocr_engines == {}
|
||||
assert service.structure_engine is None
|
||||
assert service.confidence_threshold > 0
|
||||
assert len(service.ocr_languages) > 0
|
||||
|
||||
def test_supported_languages(self):
|
||||
"""Test that supported languages are configured"""
|
||||
service = OCRService()
|
||||
|
||||
# Should have at least Chinese and English
|
||||
assert 'ch' in service.ocr_languages or 'en' in service.ocr_languages
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestOCREngineLazyLoading:
|
||||
"""Test OCR engine lazy loading"""
|
||||
|
||||
@patch('app.services.ocr_service.PaddleOCR')
|
||||
def test_get_ocr_engine_creates_new_engine(self, mock_paddle_ocr):
|
||||
"""Test that get_ocr_engine creates engine on first call"""
|
||||
mock_engine = Mock()
|
||||
mock_paddle_ocr.return_value = mock_engine
|
||||
|
||||
service = OCRService()
|
||||
engine = service.get_ocr_engine(lang='en')
|
||||
|
||||
assert engine == mock_engine
|
||||
mock_paddle_ocr.assert_called_once()
|
||||
assert 'en' in service.ocr_engines
|
||||
|
||||
@patch('app.services.ocr_service.PaddleOCR')
|
||||
def test_get_ocr_engine_reuses_existing_engine(self, mock_paddle_ocr):
|
||||
"""Test that get_ocr_engine reuses existing engine"""
|
||||
mock_engine = Mock()
|
||||
mock_paddle_ocr.return_value = mock_engine
|
||||
|
||||
service = OCRService()
|
||||
|
||||
# First call creates engine
|
||||
engine1 = service.get_ocr_engine(lang='en')
|
||||
# Second call should reuse
|
||||
engine2 = service.get_ocr_engine(lang='en')
|
||||
|
||||
assert engine1 == engine2
|
||||
mock_paddle_ocr.assert_called_once()
|
||||
|
||||
@patch('app.services.ocr_service.PaddleOCR')
|
||||
def test_get_ocr_engine_different_languages(self, mock_paddle_ocr):
|
||||
"""Test that different languages get different engines"""
|
||||
mock_paddle_ocr.return_value = Mock()
|
||||
|
||||
service = OCRService()
|
||||
|
||||
engine_en = service.get_ocr_engine(lang='en')
|
||||
engine_ch = service.get_ocr_engine(lang='ch')
|
||||
|
||||
assert 'en' in service.ocr_engines
|
||||
assert 'ch' in service.ocr_engines
|
||||
assert mock_paddle_ocr.call_count == 2
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestStructureEngineLazyLoading:
|
||||
"""Test structure engine lazy loading"""
|
||||
|
||||
@patch('app.services.ocr_service.PPStructureV3')
|
||||
def test_get_structure_engine_creates_new_engine(self, mock_structure):
|
||||
"""Test that get_structure_engine creates engine on first call"""
|
||||
mock_engine = Mock()
|
||||
mock_structure.return_value = mock_engine
|
||||
|
||||
service = OCRService()
|
||||
engine = service.get_structure_engine()
|
||||
|
||||
assert engine == mock_engine
|
||||
mock_structure.assert_called_once()
|
||||
assert service.structure_engine == mock_engine
|
||||
|
||||
@patch('app.services.ocr_service.PPStructureV3')
|
||||
def test_get_structure_engine_reuses_existing_engine(self, mock_structure):
|
||||
"""Test that get_structure_engine reuses existing engine"""
|
||||
mock_engine = Mock()
|
||||
mock_structure.return_value = mock_engine
|
||||
|
||||
service = OCRService()
|
||||
|
||||
# First call creates engine
|
||||
engine1 = service.get_structure_engine()
|
||||
# Second call should reuse
|
||||
engine2 = service.get_structure_engine()
|
||||
|
||||
assert engine1 == engine2
|
||||
mock_structure.assert_called_once()
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestProcessImageMocked:
|
||||
"""Test image processing with mocked OCR engines"""
|
||||
|
||||
@patch('app.services.ocr_service.PaddleOCR')
|
||||
def test_process_image_success(self, mock_paddle_ocr, sample_image_path):
|
||||
"""Test successful image processing"""
|
||||
# Mock OCR results - PaddleOCR 3.x format
|
||||
mock_ocr_results = [{
|
||||
'rec_texts': ['Hello World', 'Test Text'],
|
||||
'rec_scores': [0.95, 0.88],
|
||||
'rec_polys': [
|
||||
[[10, 10], [100, 10], [100, 30], [10, 30]],
|
||||
[[10, 40], [100, 40], [100, 60], [10, 60]]
|
||||
]
|
||||
}]
|
||||
|
||||
mock_engine = Mock()
|
||||
mock_engine.ocr.return_value = mock_ocr_results
|
||||
mock_paddle_ocr.return_value = mock_engine
|
||||
|
||||
service = OCRService()
|
||||
result = service.process_image(sample_image_path, detect_layout=False)
|
||||
|
||||
assert result['status'] == 'success'
|
||||
assert result['file_name'] == sample_image_path.name
|
||||
assert result['language'] == 'ch'
|
||||
assert result['total_text_regions'] == 2
|
||||
assert result['average_confidence'] > 0.8
|
||||
assert len(result['text_regions']) == 2
|
||||
assert 'markdown_content' in result
|
||||
assert 'processing_time' in result
|
||||
|
||||
@patch('app.services.ocr_service.PaddleOCR')
|
||||
def test_process_image_filters_low_confidence(self, mock_paddle_ocr, sample_image_path):
|
||||
"""Test that low confidence results are filtered"""
|
||||
# Mock OCR results with varying confidence - PaddleOCR 3.x format
|
||||
mock_ocr_results = [{
|
||||
'rec_texts': ['High Confidence', 'Low Confidence'],
|
||||
'rec_scores': [0.95, 0.50],
|
||||
'rec_polys': [
|
||||
[[10, 10], [100, 10], [100, 30], [10, 30]],
|
||||
[[10, 40], [100, 40], [100, 60], [10, 60]]
|
||||
]
|
||||
}]
|
||||
|
||||
mock_engine = Mock()
|
||||
mock_engine.ocr.return_value = mock_ocr_results
|
||||
mock_paddle_ocr.return_value = mock_engine
|
||||
|
||||
service = OCRService()
|
||||
result = service.process_image(
|
||||
sample_image_path,
|
||||
detect_layout=False,
|
||||
confidence_threshold=0.80
|
||||
)
|
||||
|
||||
assert result['status'] == 'success'
|
||||
assert result['total_text_regions'] == 1 # Only high confidence
|
||||
assert result['text_regions'][0]['text'] == 'High Confidence'
|
||||
|
||||
@patch('app.services.ocr_service.PaddleOCR')
|
||||
def test_process_image_empty_results(self, mock_paddle_ocr, sample_image_path):
|
||||
"""Test processing image with no text detected"""
|
||||
mock_ocr_results = [[]]
|
||||
|
||||
mock_engine = Mock()
|
||||
mock_engine.ocr.return_value = mock_ocr_results
|
||||
mock_paddle_ocr.return_value = mock_engine
|
||||
|
||||
service = OCRService()
|
||||
result = service.process_image(sample_image_path, detect_layout=False)
|
||||
|
||||
assert result['status'] == 'success'
|
||||
assert result['total_text_regions'] == 0
|
||||
assert result['average_confidence'] == 0.0
|
||||
|
||||
@patch('app.services.ocr_service.PaddleOCR')
|
||||
def test_process_image_error_handling(self, mock_paddle_ocr, sample_image_path):
|
||||
"""Test error handling during OCR processing"""
|
||||
mock_engine = Mock()
|
||||
mock_engine.ocr.side_effect = Exception("OCR engine error")
|
||||
mock_paddle_ocr.return_value = mock_engine
|
||||
|
||||
service = OCRService()
|
||||
result = service.process_image(sample_image_path, detect_layout=False)
|
||||
|
||||
assert result['status'] == 'error'
|
||||
assert 'error_message' in result
|
||||
assert 'OCR engine error' in result['error_message']
|
||||
|
||||
@patch('app.services.ocr_service.PaddleOCR')
|
||||
def test_process_image_different_languages(self, mock_paddle_ocr, sample_image_path):
|
||||
"""Test processing with different languages"""
|
||||
mock_ocr_results = [[
|
||||
[[[10, 10], [100, 10], [100, 30], [10, 30]], ('Text', 0.95)]
|
||||
]]
|
||||
|
||||
mock_engine = Mock()
|
||||
mock_engine.ocr.return_value = mock_ocr_results
|
||||
mock_paddle_ocr.return_value = mock_engine
|
||||
|
||||
service = OCRService()
|
||||
|
||||
# Test English
|
||||
result_en = service.process_image(sample_image_path, lang='en', detect_layout=False)
|
||||
assert result_en['language'] == 'en'
|
||||
|
||||
# Test Chinese
|
||||
result_ch = service.process_image(sample_image_path, lang='ch', detect_layout=False)
|
||||
assert result_ch['language'] == 'ch'
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestLayoutAnalysisMocked:
|
||||
"""Test layout analysis with mocked structure engine"""
|
||||
|
||||
@patch('app.services.ocr_service.PPStructureV3')
|
||||
def test_analyze_layout_success(self, mock_structure, sample_image_path):
|
||||
"""Test successful layout analysis"""
|
||||
# Create mock page result with markdown attribute (PP-StructureV3 format)
|
||||
mock_page_result = Mock()
|
||||
mock_page_result.markdown = {
|
||||
'markdown_texts': 'Document Title\n\nParagraph content',
|
||||
'markdown_images': {}
|
||||
}
|
||||
|
||||
# PP-Structure predict() returns a list of page results
|
||||
mock_engine = Mock()
|
||||
mock_engine.predict.return_value = [mock_page_result]
|
||||
mock_structure.return_value = mock_engine
|
||||
|
||||
service = OCRService()
|
||||
layout_data, images_metadata = service.analyze_layout(sample_image_path)
|
||||
|
||||
assert layout_data is not None
|
||||
assert layout_data['total_elements'] == 1
|
||||
assert len(layout_data['elements']) == 1
|
||||
assert layout_data['elements'][0]['type'] == 'text'
|
||||
assert 'Document Title' in layout_data['elements'][0]['content']
|
||||
|
||||
@patch('app.services.ocr_service.PPStructureV3')
|
||||
def test_analyze_layout_with_table(self, mock_structure, sample_image_path):
|
||||
"""Test layout analysis with table element"""
|
||||
# Create mock page result with table in markdown (PP-StructureV3 format)
|
||||
mock_page_result = Mock()
|
||||
mock_page_result.markdown = {
|
||||
'markdown_texts': '<table><tr><td>Cell 1</td></tr></table>',
|
||||
'markdown_images': {}
|
||||
}
|
||||
|
||||
# PP-Structure predict() returns a list of page results
|
||||
mock_engine = Mock()
|
||||
mock_engine.predict.return_value = [mock_page_result]
|
||||
mock_structure.return_value = mock_engine
|
||||
|
||||
service = OCRService()
|
||||
layout_data, images_metadata = service.analyze_layout(sample_image_path)
|
||||
|
||||
assert layout_data is not None
|
||||
assert layout_data['elements'][0]['type'] == 'table'
|
||||
# Content should contain the HTML table
|
||||
assert '<table>' in layout_data['elements'][0]['content']
|
||||
|
||||
@patch('app.services.ocr_service.PPStructureV3')
|
||||
def test_analyze_layout_error_handling(self, mock_structure, sample_image_path):
|
||||
"""Test error handling in layout analysis"""
|
||||
mock_engine = Mock()
|
||||
mock_engine.side_effect = Exception("Structure analysis error")
|
||||
mock_structure.return_value = mock_engine
|
||||
|
||||
service = OCRService()
|
||||
layout_data, images_metadata = service.analyze_layout(sample_image_path)
|
||||
|
||||
assert layout_data is None
|
||||
assert images_metadata == []
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestMarkdownGeneration:
|
||||
"""Test Markdown generation"""
|
||||
|
||||
def test_generate_markdown_from_text_regions(self):
|
||||
"""Test Markdown generation from text regions only"""
|
||||
service = OCRService()
|
||||
|
||||
text_regions = [
|
||||
{'text': 'First line', 'bbox': [[10, 10], [100, 10], [100, 30], [10, 30]]},
|
||||
{'text': 'Second line', 'bbox': [[10, 40], [100, 40], [100, 60], [10, 60]]},
|
||||
{'text': 'Third line', 'bbox': [[10, 70], [100, 70], [100, 90], [10, 90]]},
|
||||
]
|
||||
|
||||
markdown = service.generate_markdown(text_regions)
|
||||
|
||||
assert 'First line' in markdown
|
||||
assert 'Second line' in markdown
|
||||
assert 'Third line' in markdown
|
||||
|
||||
def test_generate_markdown_with_layout(self):
|
||||
"""Test Markdown generation with layout information"""
|
||||
service = OCRService()
|
||||
|
||||
text_regions = []
|
||||
layout_data = {
|
||||
'elements': [
|
||||
{'type': 'title', 'content': 'Document Title'},
|
||||
{'type': 'text', 'content': 'Paragraph text'},
|
||||
{'type': 'figure', 'element_id': 0},
|
||||
]
|
||||
}
|
||||
|
||||
markdown = service.generate_markdown(text_regions, layout_data)
|
||||
|
||||
assert '# Document Title' in markdown
|
||||
assert 'Paragraph text' in markdown
|
||||
assert '![Figure 0]' in markdown
|
||||
|
||||
def test_generate_markdown_with_table(self):
|
||||
"""Test Markdown generation with table"""
|
||||
service = OCRService()
|
||||
|
||||
layout_data = {
|
||||
'elements': [
|
||||
{
|
||||
'type': 'table',
|
||||
'content': '<table><tr><td>Cell</td></tr></table>'
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
markdown = service.generate_markdown([], layout_data)
|
||||
|
||||
assert '<table>' in markdown
|
||||
|
||||
def test_generate_markdown_empty_input(self):
|
||||
"""Test Markdown generation with empty input"""
|
||||
service = OCRService()
|
||||
|
||||
markdown = service.generate_markdown([])
|
||||
|
||||
assert markdown == ""
|
||||
|
||||
def test_generate_markdown_sorts_by_position(self):
|
||||
"""Test that text regions are sorted by vertical position"""
|
||||
service = OCRService()
|
||||
|
||||
# Create text regions in reverse order
|
||||
text_regions = [
|
||||
{'text': 'Bottom', 'bbox': [[10, 90], [100, 90], [100, 110], [10, 110]]},
|
||||
{'text': 'Top', 'bbox': [[10, 10], [100, 10], [100, 30], [10, 30]]},
|
||||
{'text': 'Middle', 'bbox': [[10, 50], [100, 50], [100, 70], [10, 70]]},
|
||||
]
|
||||
|
||||
markdown = service.generate_markdown(text_regions)
|
||||
lines = markdown.strip().split('\n')
|
||||
|
||||
# Should be sorted top to bottom
|
||||
assert lines[0] == 'Top'
|
||||
assert lines[1] == 'Middle'
|
||||
assert lines[2] == 'Bottom'
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestSaveResults:
|
||||
"""Test saving OCR results"""
|
||||
|
||||
def test_save_results_success(self, temp_dir):
|
||||
"""Test successful saving of results"""
|
||||
service = OCRService()
|
||||
|
||||
result = {
|
||||
'status': 'success',
|
||||
'file_name': 'test.png',
|
||||
'text_regions': [{'text': 'Hello', 'confidence': 0.95}],
|
||||
'markdown_content': '# Hello\n\nTest content',
|
||||
}
|
||||
|
||||
json_path, md_path = service.save_results(result, temp_dir, 'test123')
|
||||
|
||||
assert json_path is not None
|
||||
assert md_path is not None
|
||||
assert json_path.exists()
|
||||
assert md_path.exists()
|
||||
|
||||
# Verify JSON content
|
||||
with open(json_path, 'r') as f:
|
||||
saved_result = json.load(f)
|
||||
assert saved_result['file_name'] == 'test.png'
|
||||
|
||||
# Verify Markdown content
|
||||
md_content = md_path.read_text()
|
||||
assert 'Hello' in md_content
|
||||
|
||||
def test_save_results_creates_directory(self, temp_dir):
|
||||
"""Test that save_results creates output directory if needed"""
|
||||
service = OCRService()
|
||||
output_dir = temp_dir / "subdir" / "results"
|
||||
|
||||
result = {
|
||||
'status': 'success',
|
||||
'markdown_content': 'Test',
|
||||
}
|
||||
|
||||
json_path, md_path = service.save_results(result, output_dir, 'test')
|
||||
|
||||
assert output_dir.exists()
|
||||
assert json_path.exists()
|
||||
|
||||
def test_save_results_handles_unicode(self, temp_dir):
|
||||
"""Test saving results with Unicode characters"""
|
||||
service = OCRService()
|
||||
|
||||
result = {
|
||||
'status': 'success',
|
||||
'text_regions': [{'text': '你好世界', 'confidence': 0.95}],
|
||||
'markdown_content': '# 你好世界\n\n测试内容',
|
||||
}
|
||||
|
||||
json_path, md_path = service.save_results(result, temp_dir, 'unicode_test')
|
||||
|
||||
# Verify Unicode is preserved
|
||||
with open(json_path, 'r', encoding='utf-8') as f:
|
||||
saved_result = json.load(f)
|
||||
assert saved_result['text_regions'][0]['text'] == '你好世界'
|
||||
|
||||
md_content = md_path.read_text(encoding='utf-8')
|
||||
assert '你好世界' in md_content
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestEdgeCases:
|
||||
"""Test edge cases and error handling"""
|
||||
|
||||
@patch('app.services.ocr_service.PaddleOCR')
|
||||
def test_process_image_with_none_results(self, mock_paddle_ocr, sample_image_path):
|
||||
"""Test processing when OCR returns None"""
|
||||
mock_engine = Mock()
|
||||
mock_engine.ocr.return_value = None
|
||||
mock_paddle_ocr.return_value = mock_engine
|
||||
|
||||
service = OCRService()
|
||||
result = service.process_image(sample_image_path, detect_layout=False)
|
||||
|
||||
assert result['status'] == 'success'
|
||||
assert result['total_text_regions'] == 0
|
||||
|
||||
@patch('app.services.ocr_service.PaddleOCR')
|
||||
def test_process_image_with_custom_threshold(self, mock_paddle_ocr, sample_image_path):
|
||||
"""Test processing with custom confidence threshold"""
|
||||
# PaddleOCR 3.x format
|
||||
mock_ocr_results = [{
|
||||
'rec_texts': ['Text'],
|
||||
'rec_scores': [0.85],
|
||||
'rec_polys': [[[10, 10], [100, 10], [100, 30], [10, 30]]]
|
||||
}]
|
||||
|
||||
mock_engine = Mock()
|
||||
mock_engine.ocr.return_value = mock_ocr_results
|
||||
mock_paddle_ocr.return_value = mock_engine
|
||||
|
||||
service = OCRService()
|
||||
|
||||
# With high threshold - should filter out
|
||||
result_high = service.process_image(
|
||||
sample_image_path,
|
||||
detect_layout=False,
|
||||
confidence_threshold=0.90
|
||||
)
|
||||
assert result_high['total_text_regions'] == 0
|
||||
|
||||
# With low threshold - should include
|
||||
result_low = service.process_image(
|
||||
sample_image_path,
|
||||
detect_layout=False,
|
||||
confidence_threshold=0.80
|
||||
)
|
||||
assert result_low['total_text_regions'] == 1
|
||||
|
||||
|
||||
# Integration tests that require actual PaddleOCR models
|
||||
@pytest.mark.requires_models
|
||||
@pytest.mark.slow
|
||||
class TestOCRServiceIntegration:
|
||||
"""
|
||||
Integration tests that require actual PaddleOCR models
|
||||
These tests will download models (~900MB) on first run
|
||||
Run with: pytest -m requires_models
|
||||
"""
|
||||
|
||||
def test_real_ocr_engine_initialization(self):
|
||||
"""Test real PaddleOCR engine initialization"""
|
||||
service = OCRService()
|
||||
engine = service.get_ocr_engine(lang='en')
|
||||
|
||||
assert engine is not None
|
||||
assert hasattr(engine, 'ocr')
|
||||
|
||||
def test_real_structure_engine_initialization(self):
|
||||
"""Test real PP-Structure engine initialization"""
|
||||
service = OCRService()
|
||||
engine = service.get_structure_engine()
|
||||
|
||||
assert engine is not None
|
||||
|
||||
def test_real_image_processing(self, sample_image_with_text):
|
||||
"""Test processing real image with text"""
|
||||
service = OCRService()
|
||||
result = service.process_image(sample_image_with_text, lang='en')
|
||||
|
||||
assert result['status'] == 'success'
|
||||
assert result['total_text_regions'] > 0
|
||||
@@ -1,559 +0,0 @@
|
||||
"""
|
||||
Tool_OCR - PDF Generator Unit Tests
|
||||
Tests for app/services/pdf_generator.py
|
||||
"""
|
||||
|
||||
import pytest
|
||||
from pathlib import Path
|
||||
from unittest.mock import Mock, patch, MagicMock
|
||||
import subprocess
|
||||
|
||||
from app.services.pdf_generator import PDFGenerator, PDFGenerationError
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestPDFGeneratorInit:
|
||||
"""Test PDF generator initialization"""
|
||||
|
||||
def test_init(self):
|
||||
"""Test PDF generator initialization"""
|
||||
generator = PDFGenerator()
|
||||
|
||||
assert generator is not None
|
||||
assert hasattr(generator, 'css_templates')
|
||||
assert len(generator.css_templates) == 3
|
||||
assert 'default' in generator.css_templates
|
||||
assert 'academic' in generator.css_templates
|
||||
assert 'business' in generator.css_templates
|
||||
|
||||
def test_css_templates_have_content(self):
|
||||
"""Test that CSS templates contain content"""
|
||||
generator = PDFGenerator()
|
||||
|
||||
for template_name, css_content in generator.css_templates.items():
|
||||
assert isinstance(css_content, str)
|
||||
assert len(css_content) > 100
|
||||
assert '@page' in css_content
|
||||
assert 'body' in css_content
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestPandocAvailability:
|
||||
"""Test Pandoc availability checking"""
|
||||
|
||||
@patch('subprocess.run')
|
||||
def test_check_pandoc_available_success(self, mock_run):
|
||||
"""Test Pandoc availability check when pandoc is installed"""
|
||||
mock_run.return_value = Mock(returncode=0, stdout="pandoc 2.x")
|
||||
|
||||
generator = PDFGenerator()
|
||||
is_available = generator.check_pandoc_available()
|
||||
|
||||
assert is_available is True
|
||||
mock_run.assert_called_once()
|
||||
assert mock_run.call_args[0][0] == ["pandoc", "--version"]
|
||||
|
||||
@patch('subprocess.run')
|
||||
def test_check_pandoc_available_not_found(self, mock_run):
|
||||
"""Test Pandoc availability check when pandoc is not installed"""
|
||||
mock_run.side_effect = FileNotFoundError()
|
||||
|
||||
generator = PDFGenerator()
|
||||
is_available = generator.check_pandoc_available()
|
||||
|
||||
assert is_available is False
|
||||
|
||||
@patch('subprocess.run')
|
||||
def test_check_pandoc_available_timeout(self, mock_run):
|
||||
"""Test Pandoc availability check when command times out"""
|
||||
mock_run.side_effect = subprocess.TimeoutExpired("pandoc", 5)
|
||||
|
||||
generator = PDFGenerator()
|
||||
is_available = generator.check_pandoc_available()
|
||||
|
||||
assert is_available is False
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestPandocPDFGeneration:
|
||||
"""Test PDF generation using Pandoc"""
|
||||
|
||||
@pytest.fixture
|
||||
def sample_markdown(self, temp_dir):
|
||||
"""Create a sample Markdown file"""
|
||||
md_file = temp_dir / "sample.md"
|
||||
md_file.write_text("# Test Document\n\nThis is a test.", encoding="utf-8")
|
||||
return md_file
|
||||
|
||||
@patch('subprocess.run')
|
||||
def test_generate_pdf_pandoc_success(self, mock_run, sample_markdown, temp_dir):
|
||||
"""Test successful PDF generation with Pandoc"""
|
||||
output_path = temp_dir / "output.pdf"
|
||||
mock_run.return_value = Mock(returncode=0, stderr="")
|
||||
|
||||
# Create the output file to simulate successful generation
|
||||
output_path.touch()
|
||||
|
||||
generator = PDFGenerator()
|
||||
result = generator.generate_pdf_pandoc(sample_markdown, output_path)
|
||||
|
||||
assert result == output_path
|
||||
assert output_path.exists()
|
||||
mock_run.assert_called_once()
|
||||
|
||||
# Verify pandoc command structure
|
||||
cmd_args = mock_run.call_args[0][0]
|
||||
assert "pandoc" in cmd_args
|
||||
assert str(sample_markdown) in cmd_args
|
||||
assert str(output_path) in cmd_args
|
||||
assert "--pdf-engine=weasyprint" in cmd_args
|
||||
|
||||
@patch('subprocess.run')
|
||||
def test_generate_pdf_pandoc_with_metadata(self, mock_run, sample_markdown, temp_dir):
|
||||
"""Test Pandoc PDF generation with metadata"""
|
||||
output_path = temp_dir / "output.pdf"
|
||||
mock_run.return_value = Mock(returncode=0, stderr="")
|
||||
output_path.touch()
|
||||
|
||||
metadata = {
|
||||
"title": "Test Title",
|
||||
"author": "Test Author",
|
||||
"date": "2025-01-01"
|
||||
}
|
||||
|
||||
generator = PDFGenerator()
|
||||
result = generator.generate_pdf_pandoc(
|
||||
sample_markdown,
|
||||
output_path,
|
||||
metadata=metadata
|
||||
)
|
||||
|
||||
assert result == output_path
|
||||
|
||||
# Verify metadata in command
|
||||
cmd_args = mock_run.call_args[0][0]
|
||||
assert "--metadata" in cmd_args
|
||||
assert "title=Test Title" in cmd_args
|
||||
assert "author=Test Author" in cmd_args
|
||||
assert "date=2025-01-01" in cmd_args
|
||||
|
||||
@patch('subprocess.run')
|
||||
def test_generate_pdf_pandoc_with_custom_css(self, mock_run, sample_markdown, temp_dir):
|
||||
"""Test Pandoc PDF generation with custom CSS template"""
|
||||
output_path = temp_dir / "output.pdf"
|
||||
mock_run.return_value = Mock(returncode=0, stderr="")
|
||||
output_path.touch()
|
||||
|
||||
generator = PDFGenerator()
|
||||
result = generator.generate_pdf_pandoc(
|
||||
sample_markdown,
|
||||
output_path,
|
||||
css_template="academic"
|
||||
)
|
||||
|
||||
assert result == output_path
|
||||
mock_run.assert_called_once()
|
||||
|
||||
@patch('subprocess.run')
|
||||
def test_generate_pdf_pandoc_command_failed(self, mock_run, sample_markdown, temp_dir):
|
||||
"""Test Pandoc PDF generation when command fails"""
|
||||
output_path = temp_dir / "output.pdf"
|
||||
mock_run.return_value = Mock(returncode=1, stderr="Pandoc error message")
|
||||
|
||||
generator = PDFGenerator()
|
||||
|
||||
with pytest.raises(PDFGenerationError) as exc_info:
|
||||
generator.generate_pdf_pandoc(sample_markdown, output_path)
|
||||
|
||||
assert "Pandoc failed" in str(exc_info.value)
|
||||
assert "Pandoc error message" in str(exc_info.value)
|
||||
|
||||
@patch('subprocess.run')
|
||||
def test_generate_pdf_pandoc_timeout(self, mock_run, sample_markdown, temp_dir):
|
||||
"""Test Pandoc PDF generation timeout"""
|
||||
output_path = temp_dir / "output.pdf"
|
||||
mock_run.side_effect = subprocess.TimeoutExpired("pandoc", 60)
|
||||
|
||||
generator = PDFGenerator()
|
||||
|
||||
with pytest.raises(PDFGenerationError) as exc_info:
|
||||
generator.generate_pdf_pandoc(sample_markdown, output_path)
|
||||
|
||||
assert "timed out" in str(exc_info.value).lower()
|
||||
|
||||
@patch('subprocess.run')
|
||||
def test_generate_pdf_pandoc_output_not_created(self, mock_run, sample_markdown, temp_dir):
|
||||
"""Test when Pandoc command succeeds but output file not created"""
|
||||
output_path = temp_dir / "output.pdf"
|
||||
mock_run.return_value = Mock(returncode=0, stderr="")
|
||||
# Don't create output file
|
||||
|
||||
generator = PDFGenerator()
|
||||
|
||||
with pytest.raises(PDFGenerationError) as exc_info:
|
||||
generator.generate_pdf_pandoc(sample_markdown, output_path)
|
||||
|
||||
assert "PDF file not created" in str(exc_info.value)
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestWeasyPrintPDFGeneration:
|
||||
"""Test PDF generation using WeasyPrint directly"""
|
||||
|
||||
@pytest.fixture
|
||||
def sample_markdown(self, temp_dir):
|
||||
"""Create a sample Markdown file"""
|
||||
md_file = temp_dir / "sample.md"
|
||||
md_file.write_text("# Test Document\n\nThis is a test.", encoding="utf-8")
|
||||
return md_file
|
||||
|
||||
@patch('app.services.pdf_generator.HTML')
|
||||
@patch('app.services.pdf_generator.CSS')
|
||||
def test_generate_pdf_weasyprint_success(self, mock_css, mock_html, sample_markdown, temp_dir):
|
||||
"""Test successful PDF generation with WeasyPrint"""
|
||||
output_path = temp_dir / "output.pdf"
|
||||
|
||||
# Mock HTML and CSS objects
|
||||
mock_html_instance = Mock()
|
||||
mock_html_instance.write_pdf = Mock()
|
||||
mock_html.return_value = mock_html_instance
|
||||
|
||||
# Create output file to simulate successful generation
|
||||
def create_pdf(*args, **kwargs):
|
||||
output_path.touch()
|
||||
|
||||
mock_html_instance.write_pdf.side_effect = create_pdf
|
||||
|
||||
generator = PDFGenerator()
|
||||
result = generator.generate_pdf_weasyprint(sample_markdown, output_path)
|
||||
|
||||
assert result == output_path
|
||||
assert output_path.exists()
|
||||
mock_html.assert_called_once()
|
||||
mock_css.assert_called_once()
|
||||
mock_html_instance.write_pdf.assert_called_once()
|
||||
|
||||
@patch('app.services.pdf_generator.HTML')
|
||||
@patch('app.services.pdf_generator.CSS')
|
||||
def test_generate_pdf_weasyprint_with_metadata(self, mock_css, mock_html, sample_markdown, temp_dir):
|
||||
"""Test WeasyPrint PDF generation with metadata"""
|
||||
output_path = temp_dir / "output.pdf"
|
||||
|
||||
mock_html_instance = Mock()
|
||||
mock_html_instance.write_pdf = Mock()
|
||||
mock_html.return_value = mock_html_instance
|
||||
|
||||
def create_pdf(*args, **kwargs):
|
||||
output_path.touch()
|
||||
|
||||
mock_html_instance.write_pdf.side_effect = create_pdf
|
||||
|
||||
metadata = {
|
||||
"title": "Test Title",
|
||||
"author": "Test Author"
|
||||
}
|
||||
|
||||
generator = PDFGenerator()
|
||||
result = generator.generate_pdf_weasyprint(
|
||||
sample_markdown,
|
||||
output_path,
|
||||
metadata=metadata
|
||||
)
|
||||
|
||||
assert result == output_path
|
||||
|
||||
# Check that HTML string includes title
|
||||
html_call_args = mock_html.call_args
|
||||
assert html_call_args[1]['string'] is not None
|
||||
assert "Test Title" in html_call_args[1]['string']
|
||||
|
||||
@patch('app.services.pdf_generator.HTML')
|
||||
def test_generate_pdf_weasyprint_markdown_conversion(self, mock_html, sample_markdown, temp_dir):
|
||||
"""Test that Markdown is properly converted to HTML"""
|
||||
output_path = temp_dir / "output.pdf"
|
||||
|
||||
captured_html = None
|
||||
|
||||
def capture_html(string, **kwargs):
|
||||
nonlocal captured_html
|
||||
captured_html = string
|
||||
mock_instance = Mock()
|
||||
mock_instance.write_pdf = Mock(side_effect=lambda *args, **kwargs: output_path.touch())
|
||||
return mock_instance
|
||||
|
||||
mock_html.side_effect = capture_html
|
||||
|
||||
generator = PDFGenerator()
|
||||
generator.generate_pdf_weasyprint(sample_markdown, output_path)
|
||||
|
||||
# Verify HTML structure
|
||||
assert captured_html is not None
|
||||
assert "<!DOCTYPE html>" in captured_html
|
||||
assert "<h1>Test Document</h1>" in captured_html
|
||||
assert "<p>This is a test.</p>" in captured_html
|
||||
|
||||
@patch('app.services.pdf_generator.HTML')
|
||||
@patch('app.services.pdf_generator.CSS')
|
||||
def test_generate_pdf_weasyprint_with_template(self, mock_css, mock_html, sample_markdown, temp_dir):
|
||||
"""Test WeasyPrint PDF generation with different templates"""
|
||||
output_path = temp_dir / "output.pdf"
|
||||
|
||||
mock_html_instance = Mock()
|
||||
mock_html_instance.write_pdf = Mock()
|
||||
mock_html.return_value = mock_html_instance
|
||||
|
||||
def create_pdf(*args, **kwargs):
|
||||
output_path.touch()
|
||||
|
||||
mock_html_instance.write_pdf.side_effect = create_pdf
|
||||
|
||||
generator = PDFGenerator()
|
||||
|
||||
# Test academic template
|
||||
generator.generate_pdf_weasyprint(
|
||||
sample_markdown,
|
||||
output_path,
|
||||
css_template="academic"
|
||||
)
|
||||
|
||||
# Verify CSS was called with academic template content
|
||||
css_call_args = mock_css.call_args
|
||||
assert css_call_args[1]['string'] is not None
|
||||
assert "Times New Roman" in css_call_args[1]['string']
|
||||
|
||||
@patch('app.services.pdf_generator.HTML')
|
||||
def test_generate_pdf_weasyprint_error_handling(self, mock_html, sample_markdown, temp_dir):
|
||||
"""Test WeasyPrint error handling"""
|
||||
output_path = temp_dir / "output.pdf"
|
||||
|
||||
mock_html.side_effect = Exception("WeasyPrint rendering error")
|
||||
|
||||
generator = PDFGenerator()
|
||||
|
||||
with pytest.raises(PDFGenerationError) as exc_info:
|
||||
generator.generate_pdf_weasyprint(sample_markdown, output_path)
|
||||
|
||||
assert "WeasyPrint PDF generation failed" in str(exc_info.value)
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestUnifiedPDFGeneration:
|
||||
"""Test unified PDF generation with automatic fallback"""
|
||||
|
||||
@pytest.fixture
|
||||
def sample_markdown(self, temp_dir):
|
||||
"""Create a sample Markdown file"""
|
||||
md_file = temp_dir / "sample.md"
|
||||
md_file.write_text("# Test Document\n\nTest content.", encoding="utf-8")
|
||||
return md_file
|
||||
|
||||
def test_generate_pdf_nonexistent_markdown(self, temp_dir):
|
||||
"""Test error when Markdown file doesn't exist"""
|
||||
nonexistent = temp_dir / "nonexistent.md"
|
||||
output_path = temp_dir / "output.pdf"
|
||||
|
||||
generator = PDFGenerator()
|
||||
|
||||
with pytest.raises(PDFGenerationError) as exc_info:
|
||||
generator.generate_pdf(nonexistent, output_path)
|
||||
|
||||
assert "not found" in str(exc_info.value).lower()
|
||||
|
||||
@patch.object(PDFGenerator, 'check_pandoc_available')
|
||||
@patch.object(PDFGenerator, 'generate_pdf_pandoc')
|
||||
def test_generate_pdf_prefers_pandoc(self, mock_pandoc_gen, mock_check, sample_markdown, temp_dir):
|
||||
"""Test that Pandoc is preferred when available"""
|
||||
output_path = temp_dir / "output.pdf"
|
||||
output_path.touch()
|
||||
|
||||
mock_check.return_value = True
|
||||
mock_pandoc_gen.return_value = output_path
|
||||
|
||||
generator = PDFGenerator()
|
||||
result = generator.generate_pdf(sample_markdown, output_path, prefer_pandoc=True)
|
||||
|
||||
assert result == output_path
|
||||
mock_check.assert_called_once()
|
||||
mock_pandoc_gen.assert_called_once()
|
||||
|
||||
@patch.object(PDFGenerator, 'check_pandoc_available')
|
||||
@patch.object(PDFGenerator, 'generate_pdf_weasyprint')
|
||||
def test_generate_pdf_uses_weasyprint_when_pandoc_unavailable(
|
||||
self, mock_weasy_gen, mock_check, sample_markdown, temp_dir
|
||||
):
|
||||
"""Test fallback to WeasyPrint when Pandoc unavailable"""
|
||||
output_path = temp_dir / "output.pdf"
|
||||
output_path.touch()
|
||||
|
||||
mock_check.return_value = False
|
||||
mock_weasy_gen.return_value = output_path
|
||||
|
||||
generator = PDFGenerator()
|
||||
result = generator.generate_pdf(sample_markdown, output_path, prefer_pandoc=True)
|
||||
|
||||
assert result == output_path
|
||||
mock_check.assert_called_once()
|
||||
mock_weasy_gen.assert_called_once()
|
||||
|
||||
@patch.object(PDFGenerator, 'check_pandoc_available')
|
||||
@patch.object(PDFGenerator, 'generate_pdf_pandoc')
|
||||
@patch.object(PDFGenerator, 'generate_pdf_weasyprint')
|
||||
def test_generate_pdf_fallback_on_pandoc_failure(
|
||||
self, mock_weasy_gen, mock_pandoc_gen, mock_check, sample_markdown, temp_dir
|
||||
):
|
||||
"""Test automatic fallback to WeasyPrint when Pandoc fails"""
|
||||
output_path = temp_dir / "output.pdf"
|
||||
output_path.touch()
|
||||
|
||||
mock_check.return_value = True
|
||||
mock_pandoc_gen.side_effect = PDFGenerationError("Pandoc failed")
|
||||
mock_weasy_gen.return_value = output_path
|
||||
|
||||
generator = PDFGenerator()
|
||||
result = generator.generate_pdf(sample_markdown, output_path, prefer_pandoc=True)
|
||||
|
||||
assert result == output_path
|
||||
mock_pandoc_gen.assert_called_once()
|
||||
mock_weasy_gen.assert_called_once()
|
||||
|
||||
@patch.object(PDFGenerator, 'check_pandoc_available')
|
||||
@patch.object(PDFGenerator, 'generate_pdf_weasyprint')
|
||||
def test_generate_pdf_creates_output_directory(
|
||||
self, mock_weasy_gen, mock_check, sample_markdown, temp_dir
|
||||
):
|
||||
"""Test that output directory is created if needed"""
|
||||
output_dir = temp_dir / "subdir" / "outputs"
|
||||
output_path = output_dir / "output.pdf"
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
output_path.touch()
|
||||
|
||||
mock_check.return_value = False
|
||||
mock_weasy_gen.return_value = output_path
|
||||
|
||||
generator = PDFGenerator()
|
||||
result = generator.generate_pdf(sample_markdown, output_path)
|
||||
|
||||
assert output_dir.exists()
|
||||
assert result == output_path
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestTemplateManagement:
|
||||
"""Test CSS template management"""
|
||||
|
||||
def test_get_available_templates(self):
|
||||
"""Test retrieving available templates"""
|
||||
generator = PDFGenerator()
|
||||
templates = generator.get_available_templates()
|
||||
|
||||
assert isinstance(templates, dict)
|
||||
assert len(templates) == 3
|
||||
assert "default" in templates
|
||||
assert "academic" in templates
|
||||
assert "business" in templates
|
||||
|
||||
# Check descriptions are in Chinese
|
||||
for desc in templates.values():
|
||||
assert isinstance(desc, str)
|
||||
assert len(desc) > 0
|
||||
|
||||
def test_save_custom_template(self):
|
||||
"""Test saving a custom CSS template"""
|
||||
generator = PDFGenerator()
|
||||
|
||||
custom_css = "@page { size: A4; }"
|
||||
generator.save_custom_template("custom", custom_css)
|
||||
|
||||
assert "custom" in generator.css_templates
|
||||
assert generator.css_templates["custom"] == custom_css
|
||||
|
||||
def test_save_custom_template_overwrites_existing(self):
|
||||
"""Test that saving custom template can overwrite existing"""
|
||||
generator = PDFGenerator()
|
||||
|
||||
new_css = "@page { size: Letter; }"
|
||||
generator.save_custom_template("default", new_css)
|
||||
|
||||
assert generator.css_templates["default"] == new_css
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestEdgeCases:
|
||||
"""Test edge cases and error handling"""
|
||||
|
||||
@pytest.fixture
|
||||
def sample_markdown(self, temp_dir):
|
||||
"""Create a sample Markdown file"""
|
||||
md_file = temp_dir / "sample.md"
|
||||
md_file.write_text("# Test", encoding="utf-8")
|
||||
return md_file
|
||||
|
||||
@patch('app.services.pdf_generator.HTML')
|
||||
@patch('app.services.pdf_generator.CSS')
|
||||
def test_generate_with_unicode_content(self, mock_css, mock_html, temp_dir):
|
||||
"""Test PDF generation with Unicode/Chinese content"""
|
||||
md_file = temp_dir / "unicode.md"
|
||||
md_file.write_text("# 測試文檔\n\n這是中文內容。", encoding="utf-8")
|
||||
output_path = temp_dir / "output.pdf"
|
||||
|
||||
captured_html = None
|
||||
|
||||
def capture_html(string, **kwargs):
|
||||
nonlocal captured_html
|
||||
captured_html = string
|
||||
mock_instance = Mock()
|
||||
mock_instance.write_pdf = Mock(side_effect=lambda *args, **kwargs: output_path.touch())
|
||||
return mock_instance
|
||||
|
||||
mock_html.side_effect = capture_html
|
||||
|
||||
generator = PDFGenerator()
|
||||
result = generator.generate_pdf_weasyprint(md_file, output_path)
|
||||
|
||||
assert result == output_path
|
||||
assert "測試文檔" in captured_html
|
||||
assert "中文內容" in captured_html
|
||||
|
||||
@patch('app.services.pdf_generator.HTML')
|
||||
@patch('app.services.pdf_generator.CSS')
|
||||
def test_generate_with_table_markdown(self, mock_css, mock_html, temp_dir):
|
||||
"""Test PDF generation with Markdown tables"""
|
||||
md_file = temp_dir / "table.md"
|
||||
md_content = """
|
||||
# Document with Table
|
||||
|
||||
| Column 1 | Column 2 |
|
||||
|----------|----------|
|
||||
| Data 1 | Data 2 |
|
||||
"""
|
||||
md_file.write_text(md_content, encoding="utf-8")
|
||||
output_path = temp_dir / "output.pdf"
|
||||
|
||||
captured_html = None
|
||||
|
||||
def capture_html(string, **kwargs):
|
||||
nonlocal captured_html
|
||||
captured_html = string
|
||||
mock_instance = Mock()
|
||||
mock_instance.write_pdf = Mock(side_effect=lambda *args, **kwargs: output_path.touch())
|
||||
return mock_instance
|
||||
|
||||
mock_html.side_effect = capture_html
|
||||
|
||||
generator = PDFGenerator()
|
||||
result = generator.generate_pdf_weasyprint(md_file, output_path)
|
||||
|
||||
assert result == output_path
|
||||
# Markdown tables should be converted to HTML tables
|
||||
assert "<table>" in captured_html
|
||||
assert "<th>" in captured_html or "<td>" in captured_html
|
||||
|
||||
def test_custom_css_string_not_in_templates(self, sample_markdown, temp_dir):
|
||||
"""Test using custom CSS string that's not a template name"""
|
||||
generator = PDFGenerator()
|
||||
|
||||
# This should work - treat as custom CSS string
|
||||
custom_css = "body { font-size: 20pt; }"
|
||||
|
||||
# When CSS template is not in templates dict, it should be used as-is
|
||||
assert custom_css not in generator.css_templates.values()
|
||||
@@ -1,350 +0,0 @@
|
||||
"""
|
||||
Tool_OCR - Document Preprocessor Unit Tests
|
||||
Tests for app/services/preprocessor.py
|
||||
"""
|
||||
|
||||
import pytest
|
||||
from pathlib import Path
|
||||
from PIL import Image
|
||||
|
||||
from app.services.preprocessor import DocumentPreprocessor
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestDocumentPreprocessor:
|
||||
"""Test suite for DocumentPreprocessor"""
|
||||
|
||||
def test_init(self, preprocessor):
|
||||
"""Test preprocessor initialization"""
|
||||
assert preprocessor is not None
|
||||
assert preprocessor.max_file_size > 0
|
||||
assert len(preprocessor.allowed_extensions) > 0
|
||||
assert 'png' in preprocessor.allowed_extensions
|
||||
assert 'jpg' in preprocessor.allowed_extensions
|
||||
assert 'pdf' in preprocessor.allowed_extensions
|
||||
|
||||
def test_supported_formats(self, preprocessor):
|
||||
"""Test that all expected formats are supported"""
|
||||
expected_image_formats = ['png', 'jpg', 'jpeg', 'bmp', 'tiff', 'tif']
|
||||
expected_pdf_format = ['pdf']
|
||||
|
||||
for fmt in expected_image_formats:
|
||||
assert fmt in preprocessor.SUPPORTED_IMAGE_FORMATS
|
||||
|
||||
for fmt in expected_pdf_format:
|
||||
assert fmt in preprocessor.SUPPORTED_PDF_FORMAT
|
||||
|
||||
all_formats = expected_image_formats + expected_pdf_format
|
||||
assert set(preprocessor.ALL_SUPPORTED_FORMATS) == set(all_formats)
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestFileValidation:
|
||||
"""Test file validation methods"""
|
||||
|
||||
def test_validate_valid_png(self, preprocessor, sample_image_path):
|
||||
"""Test validation of a valid PNG file"""
|
||||
is_valid, file_format, error = preprocessor.validate_file(sample_image_path)
|
||||
|
||||
assert is_valid is True
|
||||
assert file_format == 'png'
|
||||
assert error is None
|
||||
|
||||
def test_validate_valid_jpg(self, preprocessor, sample_jpg_path):
|
||||
"""Test validation of a valid JPG file"""
|
||||
is_valid, file_format, error = preprocessor.validate_file(sample_jpg_path)
|
||||
|
||||
assert is_valid is True
|
||||
assert file_format == 'jpg'
|
||||
assert error is None
|
||||
|
||||
def test_validate_valid_pdf(self, preprocessor, sample_pdf_path):
|
||||
"""Test validation of a valid PDF file"""
|
||||
is_valid, file_format, error = preprocessor.validate_file(sample_pdf_path)
|
||||
|
||||
assert is_valid is True
|
||||
assert file_format == 'pdf'
|
||||
assert error is None
|
||||
|
||||
def test_validate_nonexistent_file(self, preprocessor, temp_dir):
|
||||
"""Test validation of a non-existent file"""
|
||||
fake_path = temp_dir / "nonexistent.png"
|
||||
is_valid, file_format, error = preprocessor.validate_file(fake_path)
|
||||
|
||||
assert is_valid is False
|
||||
assert file_format is None
|
||||
assert "not found" in error.lower()
|
||||
|
||||
def test_validate_large_file(self, preprocessor, large_file_path):
|
||||
"""Test validation of a file exceeding size limit"""
|
||||
is_valid, file_format, error = preprocessor.validate_file(large_file_path)
|
||||
|
||||
assert is_valid is False
|
||||
assert file_format is None
|
||||
assert "too large" in error.lower()
|
||||
|
||||
def test_validate_unsupported_format(self, preprocessor, unsupported_file_path):
|
||||
"""Test validation of unsupported file format"""
|
||||
is_valid, file_format, error = preprocessor.validate_file(unsupported_file_path)
|
||||
|
||||
assert is_valid is False
|
||||
assert "not allowed" in error.lower() or "unsupported" in error.lower()
|
||||
|
||||
def test_validate_corrupted_image(self, preprocessor, corrupted_image_path):
|
||||
"""Test validation of a corrupted image file"""
|
||||
is_valid, file_format, error = preprocessor.validate_file(corrupted_image_path)
|
||||
|
||||
assert is_valid is False
|
||||
assert error is not None
|
||||
# Corrupted files may be detected as unsupported type or corrupted
|
||||
assert ("corrupted" in error.lower() or
|
||||
"unsupported" in error.lower() or
|
||||
"not allowed" in error.lower())
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestMimeTypeMapping:
|
||||
"""Test MIME type to format mapping"""
|
||||
|
||||
def test_mime_to_format_png(self, preprocessor):
|
||||
"""Test PNG MIME type mapping"""
|
||||
assert preprocessor._mime_to_format('image/png') == 'png'
|
||||
|
||||
def test_mime_to_format_jpeg(self, preprocessor):
|
||||
"""Test JPEG MIME type mapping"""
|
||||
assert preprocessor._mime_to_format('image/jpeg') == 'jpg'
|
||||
assert preprocessor._mime_to_format('image/jpg') == 'jpg'
|
||||
|
||||
def test_mime_to_format_pdf(self, preprocessor):
|
||||
"""Test PDF MIME type mapping"""
|
||||
assert preprocessor._mime_to_format('application/pdf') == 'pdf'
|
||||
|
||||
def test_mime_to_format_tiff(self, preprocessor):
|
||||
"""Test TIFF MIME type mapping"""
|
||||
assert preprocessor._mime_to_format('image/tiff') == 'tiff'
|
||||
assert preprocessor._mime_to_format('image/x-tiff') == 'tiff'
|
||||
|
||||
def test_mime_to_format_bmp(self, preprocessor):
|
||||
"""Test BMP MIME type mapping"""
|
||||
assert preprocessor._mime_to_format('image/bmp') == 'bmp'
|
||||
|
||||
def test_mime_to_format_unknown(self, preprocessor):
|
||||
"""Test unknown MIME type returns None"""
|
||||
assert preprocessor._mime_to_format('unknown/type') is None
|
||||
assert preprocessor._mime_to_format('text/plain') is None
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestIntegrityValidation:
|
||||
"""Test file integrity validation"""
|
||||
|
||||
def test_validate_integrity_valid_png(self, preprocessor, sample_image_path):
|
||||
"""Test integrity check for valid PNG"""
|
||||
is_valid, error = preprocessor._validate_integrity(sample_image_path, 'png')
|
||||
|
||||
assert is_valid is True
|
||||
assert error is None
|
||||
|
||||
def test_validate_integrity_valid_jpg(self, preprocessor, sample_jpg_path):
|
||||
"""Test integrity check for valid JPG"""
|
||||
is_valid, error = preprocessor._validate_integrity(sample_jpg_path, 'jpg')
|
||||
|
||||
assert is_valid is True
|
||||
assert error is None
|
||||
|
||||
def test_validate_integrity_valid_pdf(self, preprocessor, sample_pdf_path):
|
||||
"""Test integrity check for valid PDF"""
|
||||
is_valid, error = preprocessor._validate_integrity(sample_pdf_path, 'pdf')
|
||||
|
||||
assert is_valid is True
|
||||
assert error is None
|
||||
|
||||
def test_validate_integrity_corrupted_image(self, preprocessor, corrupted_image_path):
|
||||
"""Test integrity check for corrupted image"""
|
||||
is_valid, error = preprocessor._validate_integrity(corrupted_image_path, 'png')
|
||||
|
||||
assert is_valid is False
|
||||
assert error is not None
|
||||
|
||||
def test_validate_integrity_invalid_pdf_header(self, preprocessor, temp_dir):
|
||||
"""Test integrity check for PDF with invalid header"""
|
||||
invalid_pdf = temp_dir / "invalid.pdf"
|
||||
with open(invalid_pdf, 'wb') as f:
|
||||
f.write(b'Not a PDF file')
|
||||
|
||||
is_valid, error = preprocessor._validate_integrity(invalid_pdf, 'pdf')
|
||||
|
||||
assert is_valid is False
|
||||
assert "invalid" in error.lower() or "header" in error.lower()
|
||||
|
||||
def test_validate_integrity_unknown_format(self, preprocessor, temp_dir):
|
||||
"""Test integrity check for unknown format"""
|
||||
test_file = temp_dir / "test.xyz"
|
||||
test_file.write_text("test")
|
||||
|
||||
is_valid, error = preprocessor._validate_integrity(test_file, 'xyz')
|
||||
|
||||
assert is_valid is False
|
||||
assert error is not None
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestImagePreprocessing:
|
||||
"""Test image preprocessing functionality"""
|
||||
|
||||
def test_preprocess_image_without_enhancement(self, preprocessor, sample_image_path):
|
||||
"""Test preprocessing without enhancement (returns original)"""
|
||||
success, output_path, error = preprocessor.preprocess_image(
|
||||
sample_image_path,
|
||||
enhance=False
|
||||
)
|
||||
|
||||
assert success is True
|
||||
assert output_path == sample_image_path
|
||||
assert error is None
|
||||
|
||||
def test_preprocess_image_with_enhancement(self, preprocessor, sample_image_with_text, temp_dir):
|
||||
"""Test preprocessing with enhancement"""
|
||||
output_path = temp_dir / "processed.png"
|
||||
|
||||
success, result_path, error = preprocessor.preprocess_image(
|
||||
sample_image_with_text,
|
||||
enhance=True,
|
||||
output_path=output_path
|
||||
)
|
||||
|
||||
assert success is True
|
||||
assert result_path == output_path
|
||||
assert result_path.exists()
|
||||
assert error is None
|
||||
|
||||
# Verify the output is a valid image
|
||||
with Image.open(result_path) as img:
|
||||
assert img.size[0] > 0
|
||||
assert img.size[1] > 0
|
||||
|
||||
def test_preprocess_image_auto_output_path(self, preprocessor, sample_image_with_text):
|
||||
"""Test preprocessing with automatic output path"""
|
||||
success, result_path, error = preprocessor.preprocess_image(
|
||||
sample_image_with_text,
|
||||
enhance=True
|
||||
)
|
||||
|
||||
assert success is True
|
||||
assert result_path is not None
|
||||
assert result_path.exists()
|
||||
assert "processed_" in result_path.name
|
||||
assert error is None
|
||||
|
||||
def test_preprocess_nonexistent_image(self, preprocessor, temp_dir):
|
||||
"""Test preprocessing with non-existent image"""
|
||||
fake_path = temp_dir / "nonexistent.png"
|
||||
|
||||
success, result_path, error = preprocessor.preprocess_image(
|
||||
fake_path,
|
||||
enhance=True
|
||||
)
|
||||
|
||||
assert success is False
|
||||
assert result_path is None
|
||||
assert error is not None
|
||||
|
||||
def test_preprocess_corrupted_image(self, preprocessor, corrupted_image_path):
|
||||
"""Test preprocessing with corrupted image"""
|
||||
success, result_path, error = preprocessor.preprocess_image(
|
||||
corrupted_image_path,
|
||||
enhance=True
|
||||
)
|
||||
|
||||
assert success is False
|
||||
assert result_path is None
|
||||
assert error is not None
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestFileInfo:
|
||||
"""Test file information retrieval"""
|
||||
|
||||
def test_get_file_info_png(self, preprocessor, sample_image_path):
|
||||
"""Test getting file info for PNG"""
|
||||
info = preprocessor.get_file_info(sample_image_path)
|
||||
|
||||
assert info['name'] == sample_image_path.name
|
||||
assert info['path'] == str(sample_image_path)
|
||||
assert info['size'] > 0
|
||||
assert info['size_mb'] > 0
|
||||
assert info['mime_type'] == 'image/png'
|
||||
assert info['format'] == 'png'
|
||||
assert 'created_at' in info
|
||||
assert 'modified_at' in info
|
||||
|
||||
def test_get_file_info_jpg(self, preprocessor, sample_jpg_path):
|
||||
"""Test getting file info for JPG"""
|
||||
info = preprocessor.get_file_info(sample_jpg_path)
|
||||
|
||||
assert info['name'] == sample_jpg_path.name
|
||||
assert info['mime_type'] == 'image/jpeg'
|
||||
assert info['format'] == 'jpg'
|
||||
|
||||
def test_get_file_info_pdf(self, preprocessor, sample_pdf_path):
|
||||
"""Test getting file info for PDF"""
|
||||
info = preprocessor.get_file_info(sample_pdf_path)
|
||||
|
||||
assert info['name'] == sample_pdf_path.name
|
||||
assert info['mime_type'] == 'application/pdf'
|
||||
assert info['format'] == 'pdf'
|
||||
|
||||
def test_get_file_info_size_calculation(self, preprocessor, sample_image_path):
|
||||
"""Test that file size is correctly calculated"""
|
||||
info = preprocessor.get_file_info(sample_image_path)
|
||||
|
||||
actual_size = sample_image_path.stat().st_size
|
||||
assert info['size'] == actual_size
|
||||
assert abs(info['size_mb'] - (actual_size / (1024 * 1024))) < 0.001
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestEdgeCases:
|
||||
"""Test edge cases and error handling"""
|
||||
|
||||
def test_validate_empty_file(self, preprocessor, temp_dir):
|
||||
"""Test validation of empty file"""
|
||||
empty_file = temp_dir / "empty.png"
|
||||
empty_file.touch()
|
||||
|
||||
is_valid, file_format, error = preprocessor.validate_file(empty_file)
|
||||
|
||||
# Should fail because empty file has no valid MIME type or is corrupted
|
||||
assert is_valid is False
|
||||
|
||||
def test_validate_file_with_wrong_extension(self, preprocessor, temp_dir):
|
||||
"""Test validation of file with misleading extension"""
|
||||
# Create a PNG file but name it .txt
|
||||
misleading_file = temp_dir / "image.txt"
|
||||
img = Image.new('RGB', (10, 10), color='white')
|
||||
img.save(misleading_file, 'PNG')
|
||||
|
||||
# Validation uses MIME detection, not extension
|
||||
# So a PNG file named .txt should pass if PNG is in allowed_extensions
|
||||
is_valid, file_format, error = preprocessor.validate_file(misleading_file)
|
||||
|
||||
# Should succeed because MIME detection finds it's a PNG
|
||||
# (preprocessor uses magic number detection, not file extension)
|
||||
assert is_valid is True
|
||||
assert file_format == 'png'
|
||||
|
||||
def test_preprocess_very_small_image(self, preprocessor, temp_dir):
|
||||
"""Test preprocessing of very small image"""
|
||||
small_image = temp_dir / "small.png"
|
||||
img = Image.new('RGB', (5, 5), color='white')
|
||||
img.save(small_image, 'PNG')
|
||||
|
||||
success, result_path, error = preprocessor.preprocess_image(
|
||||
small_image,
|
||||
enhance=True
|
||||
)
|
||||
|
||||
# Should succeed even with very small image
|
||||
assert success is True
|
||||
assert result_path is not None
|
||||
assert result_path.exists()
|
||||
@@ -1,106 +0,0 @@
|
||||
"""
|
||||
Unit tests for task management endpoints
|
||||
"""
|
||||
|
||||
import pytest
|
||||
from app.models.task import Task
|
||||
|
||||
|
||||
class TestTasks:
|
||||
"""Test task management endpoints"""
|
||||
|
||||
def test_create_task(self, client, auth_token):
|
||||
"""Test task creation"""
|
||||
response = client.post(
|
||||
'/api/v2/tasks/',
|
||||
headers={'Authorization': f'Bearer {auth_token}'},
|
||||
json={
|
||||
'filename': 'test.pdf',
|
||||
'file_type': 'application/pdf'
|
||||
}
|
||||
)
|
||||
|
||||
assert response.status_code == 201
|
||||
data = response.json()
|
||||
assert 'task_id' in data
|
||||
assert data['filename'] == 'test.pdf'
|
||||
assert data['status'] == 'pending'
|
||||
|
||||
def test_list_tasks(self, client, auth_token, test_task):
|
||||
"""Test listing user tasks"""
|
||||
response = client.get(
|
||||
'/api/v2/tasks/',
|
||||
headers={'Authorization': f'Bearer {auth_token}'}
|
||||
)
|
||||
|
||||
assert response.status_code == 200
|
||||
data = response.json()
|
||||
assert 'tasks' in data
|
||||
assert 'total' in data
|
||||
assert len(data['tasks']) > 0
|
||||
|
||||
def test_get_task(self, client, auth_token, test_task):
|
||||
"""Test get single task"""
|
||||
response = client.get(
|
||||
f'/api/v2/tasks/{test_task.task_id}',
|
||||
headers={'Authorization': f'Bearer {auth_token}'}
|
||||
)
|
||||
|
||||
assert response.status_code == 200
|
||||
data = response.json()
|
||||
assert data['task_id'] == test_task.task_id
|
||||
|
||||
def test_get_task_stats(self, client, auth_token, test_task):
|
||||
"""Test get task statistics"""
|
||||
response = client.get(
|
||||
'/api/v2/tasks/stats',
|
||||
headers={'Authorization': f'Bearer {auth_token}'}
|
||||
)
|
||||
|
||||
assert response.status_code == 200
|
||||
data = response.json()
|
||||
assert 'total' in data
|
||||
assert 'pending' in data
|
||||
assert 'processing' in data
|
||||
assert 'completed' in data
|
||||
assert 'failed' in data
|
||||
|
||||
def test_delete_task(self, client, auth_token, test_task):
|
||||
"""Test task deletion"""
|
||||
response = client.delete(
|
||||
f'/api/v2/tasks/{test_task.task_id}',
|
||||
headers={'Authorization': f'Bearer {auth_token}'}
|
||||
)
|
||||
|
||||
# DELETE should return 204 No Content (standard for successful deletion)
|
||||
assert response.status_code == 204
|
||||
|
||||
def test_user_isolation(self, client, db, test_user):
|
||||
"""Test that users can only access their own tasks"""
|
||||
# Create another user
|
||||
from app.models.user import User
|
||||
other_user = User(email="other@example.com", display_name="Other User")
|
||||
db.add(other_user)
|
||||
db.commit()
|
||||
|
||||
# Create task for other user
|
||||
other_task = Task(
|
||||
user_id=other_user.id,
|
||||
task_id="other-task-123",
|
||||
filename="other.pdf",
|
||||
status="pending"
|
||||
)
|
||||
db.add(other_task)
|
||||
db.commit()
|
||||
|
||||
# Create token for test_user
|
||||
from app.core.security import create_access_token
|
||||
token = create_access_token({"sub": str(test_user.id)})
|
||||
|
||||
# Try to access other user's task
|
||||
response = client.get(
|
||||
f'/api/v2/tasks/{other_task.task_id}',
|
||||
headers={'Authorization': f'Bearer {token}'}
|
||||
)
|
||||
|
||||
assert response.status_code == 404 # Task not found (user isolation)
|
||||
@@ -1,100 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
|
||||
# Create a minimal DOCX file
|
||||
output_path = Path('/Users/egg/Projects/Tool_OCR/demo_docs/office_tests/test_document.docx')
|
||||
|
||||
# DOCX is a ZIP file containing XML files
|
||||
with zipfile.ZipFile(output_path, 'w', zipfile.ZIP_DEFLATED) as docx:
|
||||
# [Content_Types].xml
|
||||
content_types = '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
|
||||
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
|
||||
<Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
|
||||
<Default Extension="xml" ContentType="application/xml"/>
|
||||
<Override PartName="/word/document.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
|
||||
</Types>'''
|
||||
docx.writestr('[Content_Types].xml', content_types)
|
||||
|
||||
# _rels/.rels
|
||||
rels = '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
|
||||
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
|
||||
<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="word/document.xml"/>
|
||||
</Relationships>'''
|
||||
docx.writestr('_rels/.rels', rels)
|
||||
|
||||
# word/document.xml with Chinese and English content
|
||||
document = '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
|
||||
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
|
||||
<w:body>
|
||||
<w:p>
|
||||
<w:pPr><w:pStyle w:val="Heading1"/></w:pPr>
|
||||
<w:r><w:t>Office Document OCR Test</w:t></w:r>
|
||||
</w:p>
|
||||
<w:p>
|
||||
<w:pPr><w:pStyle w:val="Heading2"/></w:pPr>
|
||||
<w:r><w:t>測試文件說明</w:t></w:r>
|
||||
</w:p>
|
||||
<w:p>
|
||||
<w:r><w:t>這是一個用於測試 Tool_OCR 系統 Office 文件支援功能的測試文件。</w:t></w:r>
|
||||
</w:p>
|
||||
<w:p>
|
||||
<w:r><w:t>本系統現已支援以下 Office 格式:</w:t></w:r>
|
||||
</w:p>
|
||||
<w:p>
|
||||
<w:r><w:t>• Microsoft Word: DOC, DOCX</w:t></w:r>
|
||||
</w:p>
|
||||
<w:p>
|
||||
<w:r><w:t>• Microsoft PowerPoint: PPT, PPTX</w:t></w:r>
|
||||
</w:p>
|
||||
<w:p>
|
||||
<w:pPr><w:pStyle w:val="Heading2"/></w:pPr>
|
||||
<w:r><w:t>處理流程</w:t></w:r>
|
||||
</w:p>
|
||||
<w:p>
|
||||
<w:r><w:t>Office 文件的處理流程如下:</w:t></w:r>
|
||||
</w:p>
|
||||
<w:p>
|
||||
<w:r><w:t>1. 使用 LibreOffice 將 Office 文件轉換為 PDF</w:t></w:r>
|
||||
</w:p>
|
||||
<w:p>
|
||||
<w:r><w:t>2. 將 PDF 轉換為圖片(每頁一張)</w:t></w:r>
|
||||
</w:p>
|
||||
<w:p>
|
||||
<w:r><w:t>3. 使用 PaddleOCR 處理每張圖片</w:t></w:r>
|
||||
</w:p>
|
||||
<w:p>
|
||||
<w:r><w:t>4. 合併所有頁面的 OCR 結果</w:t></w:r>
|
||||
</w:p>
|
||||
<w:p>
|
||||
<w:pPr><w:pStyle w:val="Heading2"/></w:pPr>
|
||||
<w:r><w:t>中英混合測試</w:t></w:r>
|
||||
</w:p>
|
||||
<w:p>
|
||||
<w:r><w:t>This is a test for mixed Chinese and English OCR recognition.</w:t></w:r>
|
||||
</w:p>
|
||||
<w:p>
|
||||
<w:r><w:t>測試中英文混合識別能力:1234567890</w:t></w:r>
|
||||
</w:p>
|
||||
<w:p>
|
||||
<w:pPr><w:pStyle w:val="Heading2"/></w:pPr>
|
||||
<w:r><w:t>Technical Information</w:t></w:r>
|
||||
</w:p>
|
||||
<w:p>
|
||||
<w:r><w:t>System Version: Tool_OCR v1.0</w:t></w:r>
|
||||
</w:p>
|
||||
<w:p>
|
||||
<w:r><w:t>Conversion Engine: LibreOffice Headless</w:t></w:r>
|
||||
</w:p>
|
||||
<w:p>
|
||||
<w:r><w:t>OCR Engine: PaddleOCR</w:t></w:r>
|
||||
</w:p>
|
||||
<w:p>
|
||||
<w:r><w:t>Token Validity: 24 hours (1440 minutes)</w:t></w:r>
|
||||
</w:p>
|
||||
</w:body>
|
||||
</w:document>'''
|
||||
docx.writestr('word/document.xml', document)
|
||||
|
||||
print(f"Created DOCX file: {output_path}")
|
||||
print(f"File size: {output_path.stat().st_size} bytes")
|
||||
Binary file not shown.
@@ -1,64 +0,0 @@
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<title>Office Document OCR Test</title>
|
||||
</head>
|
||||
<body>
|
||||
<h1>Office Document OCR Test</h1>
|
||||
|
||||
<h2>測試文件說明</h2>
|
||||
<p>這是一個用於測試 Tool_OCR 系統 Office 文件支援功能的測試文件。</p>
|
||||
<p>本系統現已支援以下 Office 格式:</p>
|
||||
<ul>
|
||||
<li>Microsoft Word: DOC, DOCX</li>
|
||||
<li>Microsoft PowerPoint: PPT, PPTX</li>
|
||||
</ul>
|
||||
|
||||
<h2>處理流程</h2>
|
||||
<p>Office 文件的處理流程如下:</p>
|
||||
<ol>
|
||||
<li>使用 LibreOffice 將 Office 文件轉換為 PDF</li>
|
||||
<li>將 PDF 轉換為圖片(每頁一張)</li>
|
||||
<li>使用 PaddleOCR 處理每張圖片</li>
|
||||
<li>合併所有頁面的 OCR 結果</li>
|
||||
</ol>
|
||||
|
||||
<h2>測試數據表格</h2>
|
||||
<table border="1" cellpadding="5">
|
||||
<tr>
|
||||
<th>格式</th>
|
||||
<th>副檔名</th>
|
||||
<th>支援狀態</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Word 新版</td>
|
||||
<td>.docx</td>
|
||||
<td>✓ 支援</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Word 舊版</td>
|
||||
<td>.doc</td>
|
||||
<td>✓ 支援</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>PowerPoint 新版</td>
|
||||
<td>.pptx</td>
|
||||
<td>✓ 支援</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>PowerPoint 舊版</td>
|
||||
<td>.ppt</td>
|
||||
<td>✓ 支援</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
<h2>中英混合測試</h2>
|
||||
<p>This is a test for mixed Chinese and English OCR recognition.</p>
|
||||
<p>測試中英文混合識別能力:1234567890</p>
|
||||
|
||||
<h2>特殊字符測試</h2>
|
||||
<p>符號測試:!@#$%^&*()_+-=[]{}|;:',.<>?/</p>
|
||||
<p>數學符號:± × ÷ √ ∞ ≈ ≠ ≤ ≥</p>
|
||||
</body>
|
||||
</html>
|
||||
@@ -1,178 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Test script for Office document processing
|
||||
"""
|
||||
import json
|
||||
import requests
|
||||
from pathlib import Path
|
||||
import time
|
||||
|
||||
API_BASE = "http://localhost:12010/api/v1"
|
||||
USERNAME = "admin"
|
||||
PASSWORD = "admin123"
|
||||
|
||||
def login():
|
||||
"""Login and get JWT token"""
|
||||
print("Step 1: Logging in...")
|
||||
response = requests.post(
|
||||
f"{API_BASE}/auth/login",
|
||||
json={"username": USERNAME, "password": PASSWORD}
|
||||
)
|
||||
response.raise_for_status()
|
||||
|
||||
data = response.json()
|
||||
token = data["access_token"]
|
||||
print(f"✓ Login successful. Token expires in: {data['expires_in']} seconds ({data['expires_in']//3600} hours)")
|
||||
return token
|
||||
|
||||
def upload_file(token, file_path):
|
||||
"""Upload file and create batch"""
|
||||
print(f"\nStep 2: Uploading file: {file_path.name}...")
|
||||
with open(file_path, 'rb') as f:
|
||||
files = {'files': (file_path.name, f, 'application/vnd.openxmlformats-officedocument.wordprocessingml.document')}
|
||||
response = requests.post(
|
||||
f"{API_BASE}/upload",
|
||||
headers={"Authorization": f"Bearer {token}"},
|
||||
files=files,
|
||||
data={"batch_name": "Office Document Test"}
|
||||
)
|
||||
response.raise_for_status()
|
||||
result = response.json()
|
||||
print(f"✓ File uploaded and batch created:")
|
||||
print(f" Batch ID: {result['id']}")
|
||||
print(f" Total files: {result['total_files']}")
|
||||
print(f" Status: {result['status']}")
|
||||
return result['id']
|
||||
|
||||
def trigger_ocr(token, batch_id):
|
||||
"""Trigger OCR processing"""
|
||||
print(f"\nStep 3: Triggering OCR processing...")
|
||||
response = requests.post(
|
||||
f"{API_BASE}/ocr/process",
|
||||
headers={"Authorization": f"Bearer {token}"},
|
||||
json={
|
||||
"batch_id": batch_id,
|
||||
"lang": "ch",
|
||||
"detect_layout": True
|
||||
}
|
||||
)
|
||||
response.raise_for_status()
|
||||
result = response.json()
|
||||
print(f"✓ OCR processing started")
|
||||
print(f" Message: {result['message']}")
|
||||
print(f" Total files: {result['total_files']}")
|
||||
|
||||
def check_status(token, batch_id):
|
||||
"""Check processing status"""
|
||||
print(f"\nStep 4: Checking processing status...")
|
||||
max_wait = 120 # 120 seconds max
|
||||
waited = 0
|
||||
|
||||
while waited < max_wait:
|
||||
response = requests.get(
|
||||
f"{API_BASE}/batch/{batch_id}/status",
|
||||
headers={"Authorization": f"Bearer {token}"}
|
||||
)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
batch_status = data['batch']['status']
|
||||
progress = data['batch']['progress_percentage']
|
||||
file_status = data['files'][0]['status']
|
||||
|
||||
print(f" Batch status: {batch_status}, Progress: {progress}%, File status: {file_status}")
|
||||
|
||||
if batch_status == 'completed':
|
||||
print(f"\n✓ Processing completed!")
|
||||
file_data = data['files'][0]
|
||||
if 'processing_time' in file_data:
|
||||
print(f" Processing time: {file_data['processing_time']:.2f} seconds")
|
||||
return data
|
||||
elif batch_status == 'failed':
|
||||
print(f"\n✗ Processing failed!")
|
||||
print(f" Error: {data['files'][0].get('error_message', 'Unknown error')}")
|
||||
return data
|
||||
|
||||
time.sleep(5)
|
||||
waited += 5
|
||||
|
||||
print(f"\n⚠ Timeout waiting for processing (waited {waited}s)")
|
||||
return None
|
||||
|
||||
def get_result(token, file_id):
|
||||
"""Get OCR result"""
|
||||
print(f"\nStep 5: Getting OCR result...")
|
||||
response = requests.get(
|
||||
f"{API_BASE}/ocr/result/{file_id}",
|
||||
headers={"Authorization": f"Bearer {token}"}
|
||||
)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
file_info = data['file']
|
||||
result = data.get('result')
|
||||
|
||||
print(f"✓ OCR Result retrieved:")
|
||||
print(f" File: {file_info['original_filename']}")
|
||||
print(f" Status: {file_info['status']}")
|
||||
|
||||
if result:
|
||||
print(f" Language: {result.get('detected_language', 'N/A')}")
|
||||
print(f" Total text regions: {result.get('total_text_regions', 0)}")
|
||||
print(f" Average confidence: {result.get('average_confidence', 0):.2%}")
|
||||
|
||||
# Read markdown file if available
|
||||
if result.get('markdown_path'):
|
||||
try:
|
||||
with open(result['markdown_path'], 'r', encoding='utf-8') as f:
|
||||
markdown_content = f.read()
|
||||
print(f"\n Markdown preview (first 300 chars):")
|
||||
print(f" {'-'*60}")
|
||||
print(f" {markdown_content[:300]}...")
|
||||
print(f" {'-'*60}")
|
||||
except Exception as e:
|
||||
print(f" Could not read markdown file: {e}")
|
||||
else:
|
||||
print(f" No OCR result available yet")
|
||||
|
||||
return data
|
||||
|
||||
def main():
|
||||
try:
|
||||
# Test file
|
||||
test_file = Path('/Users/egg/Projects/Tool_OCR/demo_docs/office_tests/test_document.docx')
|
||||
|
||||
if not test_file.exists():
|
||||
print(f"✗ Test file not found: {test_file}")
|
||||
return
|
||||
|
||||
print("="*70)
|
||||
print("Office Document Processing Test")
|
||||
print("="*70)
|
||||
print(f"Test file: {test_file.name} ({test_file.stat().st_size} bytes)")
|
||||
print("="*70)
|
||||
|
||||
# Run test
|
||||
token = login()
|
||||
batch_id = upload_file(token, test_file)
|
||||
trigger_ocr(token, batch_id)
|
||||
status_data = check_status(token, batch_id)
|
||||
|
||||
if status_data and status_data['batch']['status'] == 'completed':
|
||||
file_id = status_data['files'][0]['id']
|
||||
result = get_result(token, file_id)
|
||||
print("\n" + "="*70)
|
||||
print("✓ TEST PASSED: Office document processing successful!")
|
||||
print("="*70)
|
||||
else:
|
||||
print("\n" + "="*70)
|
||||
print("✗ TEST FAILED: Processing did not complete successfully")
|
||||
print("="*70)
|
||||
|
||||
except Exception as e:
|
||||
print(f"\n✗ TEST ERROR: {str(e)}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,817 @@
|
||||
# Tool_OCR 架構大改方案
|
||||
## 基於 PaddleOCR PP-StructureV3 完整能力的重構計劃
|
||||
|
||||
**規劃日期**: 2025-01-18
|
||||
**硬體配置**: RTX 4060 8GB VRAM
|
||||
**優先級**: P0 (最高)
|
||||
|
||||
---
|
||||
|
||||
## 📊 現狀分析
|
||||
|
||||
### 目前架構的問題
|
||||
|
||||
#### 1. **PP-StructureV3 能力嚴重浪費**
|
||||
```python
|
||||
# ❌ 目前實作 (ocr_service.py:614-646)
|
||||
markdown_dict = page_result.markdown # 只用簡化版
|
||||
markdown_texts = markdown_dict.get('markdown_texts', '')
|
||||
'bbox': [], # 座標全部為空!
|
||||
```
|
||||
|
||||
**問題**:
|
||||
- 只使用了 ~20% 的 PP-StructureV3 功能
|
||||
- 未使用 `parsing_res_list`(核心數據結構)
|
||||
- 未使用 `layout_bbox`(精確座標)
|
||||
- 未使用 `reading_order`(閱讀順序)
|
||||
- 未使用 23 種版面元素分類
|
||||
|
||||
#### 2. **GPU 配置未優化**
|
||||
```python
|
||||
# 目前配置 (ocr_service.py:211-219)
|
||||
self.structure_engine = PPStructureV3(
|
||||
use_doc_orientation_classify=False, # ❌ 未啟用前處理
|
||||
use_doc_unwarping=False, # ❌ 未啟用矯正
|
||||
use_textline_orientation=False, # ❌ 未啟用方向校正
|
||||
# ... 使用預設配置
|
||||
)
|
||||
```
|
||||
|
||||
**問題**:
|
||||
- RTX 4060 8GB 足以運行 server 模型,但用了預設配置
|
||||
- 關閉了重要的前處理功能
|
||||
- 未充分利用 GPU 算力
|
||||
|
||||
#### 3. **PDF 生成策略單一**
|
||||
```python
|
||||
# 目前只有座標定位模式
|
||||
# 導致 21.6% 文字損失(過濾重疊)
|
||||
filtered_text_regions = self._filter_text_in_regions(text_regions, regions_to_avoid)
|
||||
```
|
||||
|
||||
**問題**:
|
||||
- 只支援座標定位,不支援流式排版
|
||||
- 無法零資訊損失
|
||||
- 翻譯功能受限
|
||||
|
||||
---
|
||||
|
||||
## 🎯 重構目標
|
||||
|
||||
### 核心目標
|
||||
|
||||
1. **完整利用 PP-StructureV3 能力**
|
||||
- 提取 `parsing_res_list`(23 種元素分類 + 閱讀順序)
|
||||
- 提取 `layout_bbox`(精確座標)
|
||||
- 提取 `layout_det_res`(版面檢測詳情)
|
||||
- 提取 `overall_ocr_res`(所有文字的座標)
|
||||
|
||||
2. **雙模式 PDF 生成**
|
||||
- 模式 A: 座標定位(精確還原版面)
|
||||
- 模式 B: 流式排版(零資訊損失,支援翻譯)
|
||||
|
||||
3. **GPU 配置最佳化**
|
||||
- 針對 RTX 4060 8GB 的最佳配置
|
||||
- Server 模型 + 所有功能模組
|
||||
- 合理的記憶體管理
|
||||
|
||||
4. **向後相容**
|
||||
- 保留現有 API
|
||||
- 舊 JSON 檔案仍可用
|
||||
- 漸進式升級
|
||||
|
||||
---
|
||||
|
||||
## 🏗️ 新架構設計
|
||||
|
||||
### 架構層次
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────┐
|
||||
│ API Layer │
|
||||
│ /tasks, /results, /download (向後相容) │
|
||||
└────────────────┬─────────────────────────────────────┘
|
||||
│
|
||||
┌────────────────▼─────────────────────────────────────┐
|
||||
│ Service Layer │
|
||||
├──────────────────────────────────────────────────────┤
|
||||
│ OCRService (現有, 保留) │
|
||||
│ └─ analyze_layout() [升級] ──┐ │
|
||||
│ │ │
|
||||
│ AdvancedLayoutExtractor (新增) ◄─ 使用相同引擎 │
|
||||
│ └─ extract_complete_layout() ─┘ │
|
||||
│ │
|
||||
│ PDFGeneratorService (重構) │
|
||||
│ ├─ generate_coordinate_pdf() [Mode A] │
|
||||
│ └─ generate_flow_pdf() [Mode B] │
|
||||
└────────────────┬─────────────────────────────────────┘
|
||||
│
|
||||
┌────────────────▼─────────────────────────────────────┐
|
||||
│ Engine Layer │
|
||||
├──────────────────────────────────────────────────────┤
|
||||
│ PPStructureV3Engine (新增,統一管理) │
|
||||
│ ├─ GPU 配置 (RTX 4060 8GB 最佳化) │
|
||||
│ ├─ Model 配置 (Server 模型) │
|
||||
│ └─ 功能開關 (全功能啟用) │
|
||||
└──────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 核心類別設計
|
||||
|
||||
#### 1. PPStructureV3Engine (新增)
|
||||
**目的**: 統一管理 PP-StructureV3 引擎,避免重複初始化
|
||||
|
||||
```python
|
||||
class PPStructureV3Engine:
|
||||
"""
|
||||
PP-StructureV3 引擎管理器 (單例)
|
||||
針對 RTX 4060 8GB 優化配置
|
||||
"""
|
||||
_instance = None
|
||||
|
||||
def __new__(cls):
|
||||
if cls._instance is None:
|
||||
cls._instance = super().__new__(cls)
|
||||
cls._instance._initialize()
|
||||
return cls._instance
|
||||
|
||||
def _initialize(self):
|
||||
"""初始化引擎"""
|
||||
logger.info("Initializing PP-StructureV3 with RTX 4060 8GB optimized config")
|
||||
|
||||
self.engine = PPStructureV3(
|
||||
# ===== GPU 配置 =====
|
||||
use_gpu=True,
|
||||
gpu_mem=6144, # 保留 2GB 給系統 (8GB - 2GB)
|
||||
|
||||
# ===== 前處理模組 (全部啟用) =====
|
||||
use_doc_orientation_classify=True, # 文檔方向校正
|
||||
use_doc_unwarping=True, # 文檔影像矯正
|
||||
use_textline_orientation=True, # 文字行方向校正
|
||||
|
||||
# ===== 功能模組 (全部啟用) =====
|
||||
use_table_recognition=True, # 表格識別
|
||||
use_formula_recognition=True, # 公式識別
|
||||
use_chart_recognition=True, # 圖表識別
|
||||
use_seal_recognition=True, # 印章識別
|
||||
|
||||
# ===== OCR 模型配置 (Server 模型) =====
|
||||
text_detection_model_name="ch_PP-OCRv4_server_det",
|
||||
text_recognition_model_name="ch_PP-OCRv4_server_rec",
|
||||
|
||||
# ===== 版面檢測參數 =====
|
||||
layout_threshold=0.5, # 版面檢測閾值
|
||||
layout_nms=0.5, # NMS 閾值
|
||||
layout_unclip_ratio=1.5, # 邊界框擴展比例
|
||||
|
||||
# ===== OCR 參數 =====
|
||||
text_det_limit_side_len=1920, # 高解析度檢測
|
||||
text_det_thresh=0.3, # 檢測閾值
|
||||
text_det_box_thresh=0.5, # 邊界框閾值
|
||||
|
||||
# ===== 其他 =====
|
||||
show_log=True,
|
||||
use_angle_cls=False, # 已被 textline_orientation 取代
|
||||
)
|
||||
|
||||
logger.info("PP-StructureV3 engine initialized successfully")
|
||||
logger.info(f" - GPU: Enabled (RTX 4060 8GB)")
|
||||
logger.info(f" - Models: Server (High Accuracy)")
|
||||
logger.info(f" - Features: All Enabled (Table/Formula/Chart/Seal)")
|
||||
|
||||
def predict(self, image_path: str):
|
||||
"""執行預測"""
|
||||
return self.engine.predict(image_path)
|
||||
|
||||
def get_engine(self):
|
||||
"""獲取引擎實例"""
|
||||
return self.engine
|
||||
```
|
||||
|
||||
#### 2. AdvancedLayoutExtractor (新增)
|
||||
**目的**: 完整提取 PP-StructureV3 的所有版面資訊
|
||||
|
||||
```python
|
||||
class AdvancedLayoutExtractor:
|
||||
"""
|
||||
進階版面提取器
|
||||
完整利用 PP-StructureV3 的 parsing_res_list, layout_bbox, layout_det_res
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
self.engine = PPStructureV3Engine()
|
||||
|
||||
def extract_complete_layout(
|
||||
self,
|
||||
image_path: Path,
|
||||
output_dir: Optional[Path] = None,
|
||||
current_page: int = 0
|
||||
) -> Tuple[Optional[Dict], List[Dict]]:
|
||||
"""
|
||||
提取完整版面資訊(使用 page_result.json)
|
||||
|
||||
Returns:
|
||||
(layout_data, images_metadata)
|
||||
|
||||
layout_data = {
|
||||
"elements": [
|
||||
{
|
||||
"element_id": int,
|
||||
"type": str, # 23 種類型之一
|
||||
"bbox": [[x1,y1], [x2,y1], [x2,y2], [x1,y2]], # ✅ 不再是空列表
|
||||
"content": str,
|
||||
"reading_order": int, # ✅ 閱讀順序
|
||||
"layout_type": str, # ✅ single/double/multi-column
|
||||
"confidence": float, # ✅ 置信度
|
||||
"page": int
|
||||
},
|
||||
...
|
||||
],
|
||||
"reading_order": [0, 1, 2, ...],
|
||||
"layout_types": ["single", "double"],
|
||||
"total_elements": int
|
||||
}
|
||||
"""
|
||||
try:
|
||||
results = self.engine.predict(str(image_path))
|
||||
|
||||
layout_elements = []
|
||||
images_metadata = []
|
||||
|
||||
for page_idx, page_result in enumerate(results):
|
||||
# ✅ 核心改動:使用 page_result.json 而非 page_result.markdown
|
||||
json_data = page_result.json
|
||||
|
||||
# ===== 方法 1: 使用 parsing_res_list (主要來源) =====
|
||||
parsing_res_list = json_data.get('parsing_res_list', [])
|
||||
|
||||
if parsing_res_list:
|
||||
logger.info(f"Found {len(parsing_res_list)} elements in parsing_res_list")
|
||||
|
||||
for idx, item in enumerate(parsing_res_list):
|
||||
element = self._create_element_from_parsing_res(
|
||||
item, idx, current_page
|
||||
)
|
||||
if element:
|
||||
layout_elements.append(element)
|
||||
|
||||
# ===== 方法 2: 使用 layout_det_res (補充資訊) =====
|
||||
layout_det_res = json_data.get('layout_det_res', {})
|
||||
layout_boxes = layout_det_res.get('boxes', [])
|
||||
|
||||
# 用於豐富 element 資訊(如果 parsing_res_list 缺少某些欄位)
|
||||
self._enrich_elements_with_layout_det(layout_elements, layout_boxes)
|
||||
|
||||
# ===== 方法 3: 處理圖片 (從 markdown_images) =====
|
||||
markdown_dict = page_result.markdown
|
||||
markdown_images = markdown_dict.get('markdown_images', {})
|
||||
|
||||
for img_idx, (img_path, img_obj) in enumerate(markdown_images.items()):
|
||||
# 保存圖片到磁碟
|
||||
self._save_image(img_obj, img_path, output_dir or image_path.parent)
|
||||
|
||||
# 從 parsing_res_list 或 layout_det_res 查找 bbox
|
||||
bbox = self._find_image_bbox(
|
||||
img_path, parsing_res_list, layout_boxes
|
||||
)
|
||||
|
||||
images_metadata.append({
|
||||
'element_id': len(layout_elements) + img_idx,
|
||||
'image_path': img_path,
|
||||
'type': 'image',
|
||||
'page': current_page,
|
||||
'bbox': bbox,
|
||||
})
|
||||
|
||||
if layout_elements:
|
||||
layout_data = {
|
||||
'elements': layout_elements,
|
||||
'total_elements': len(layout_elements),
|
||||
'reading_order': [e['reading_order'] for e in layout_elements],
|
||||
'layout_types': list(set(e.get('layout_type') for e in layout_elements)),
|
||||
}
|
||||
logger.info(f"✅ Extracted {len(layout_elements)} elements with complete info")
|
||||
return layout_data, images_metadata
|
||||
else:
|
||||
logger.warning("No layout elements found")
|
||||
return None, []
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Advanced layout extraction failed: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return None, []
|
||||
|
||||
def _create_element_from_parsing_res(
|
||||
self, item: Dict, idx: int, current_page: int
|
||||
) -> Optional[Dict]:
|
||||
"""從 parsing_res_list 的一個 item 創建 element"""
|
||||
# 提取 layout_bbox
|
||||
layout_bbox = item.get('layout_bbox')
|
||||
bbox = self._convert_bbox_to_4point(layout_bbox)
|
||||
|
||||
# 提取版面類型
|
||||
layout_type = item.get('layout', 'single')
|
||||
|
||||
# 創建基礎 element
|
||||
element = {
|
||||
'element_id': idx,
|
||||
'page': current_page,
|
||||
'bbox': bbox, # ✅ 完整座標
|
||||
'layout_type': layout_type,
|
||||
'reading_order': idx,
|
||||
'confidence': item.get('score', 0.0),
|
||||
}
|
||||
|
||||
# 根據內容類型填充 type 和 content
|
||||
# 順序很重要!優先級: table > formula > image > title > text
|
||||
|
||||
if 'table' in item and item['table']:
|
||||
element['type'] = 'table'
|
||||
element['content'] = item['table']
|
||||
# 提取表格純文字(用於翻譯)
|
||||
element['extracted_text'] = self._extract_table_text(item['table'])
|
||||
|
||||
elif 'formula' in item and item['formula']:
|
||||
element['type'] = 'formula'
|
||||
element['content'] = item['formula'] # LaTeX
|
||||
|
||||
elif 'figure' in item or 'image' in item:
|
||||
element['type'] = 'image'
|
||||
element['content'] = item.get('figure') or item.get('image')
|
||||
|
||||
elif 'title' in item and item['title']:
|
||||
element['type'] = 'title'
|
||||
element['content'] = item['title']
|
||||
|
||||
elif 'text' in item and item['text']:
|
||||
element['type'] = 'text'
|
||||
element['content'] = item['text']
|
||||
|
||||
else:
|
||||
# 未知類型,嘗試提取任何非系統欄位
|
||||
for key, value in item.items():
|
||||
if key not in ['layout_bbox', 'layout', 'score'] and value:
|
||||
element['type'] = key
|
||||
element['content'] = value
|
||||
break
|
||||
else:
|
||||
return None # 沒有內容,跳過
|
||||
|
||||
return element
|
||||
|
||||
def _convert_bbox_to_4point(self, layout_bbox) -> List:
|
||||
"""轉換 layout_bbox 為 4-point 格式"""
|
||||
if layout_bbox is None:
|
||||
return []
|
||||
|
||||
# 處理 numpy array
|
||||
if hasattr(layout_bbox, 'tolist'):
|
||||
bbox = layout_bbox.tolist()
|
||||
else:
|
||||
bbox = list(layout_bbox)
|
||||
|
||||
if len(bbox) == 4: # [x1, y1, x2, y2]
|
||||
x1, y1, x2, y2 = bbox
|
||||
return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
|
||||
|
||||
return []
|
||||
|
||||
def _extract_table_text(self, html_content: str) -> str:
|
||||
"""從 HTML 表格提取純文字(用於翻譯)"""
|
||||
try:
|
||||
from bs4 import BeautifulSoup
|
||||
soup = BeautifulSoup(html_content, 'html.parser')
|
||||
|
||||
# 提取所有 cell 的文字
|
||||
cells = []
|
||||
for cell in soup.find_all(['td', 'th']):
|
||||
text = cell.get_text(strip=True)
|
||||
if text:
|
||||
cells.append(text)
|
||||
|
||||
return ' | '.join(cells)
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to extract table text: {e}")
|
||||
# Fallback: 簡單去除 HTML 標籤
|
||||
import re
|
||||
text = re.sub(r'<[^>]+>', ' ', html_content)
|
||||
text = re.sub(r'\s+', ' ', text)
|
||||
return text.strip()
|
||||
```
|
||||
|
||||
#### 3. PDFGeneratorService (重構)
|
||||
**目的**: 支援雙模式 PDF 生成
|
||||
|
||||
```python
|
||||
class PDFGeneratorService:
|
||||
"""
|
||||
PDF 生成服務 (重構版)
|
||||
支援兩種模式:
|
||||
- coordinate: 座標定位模式 (精確還原版面)
|
||||
- flow: 流式排版模式 (零資訊損失, 支援翻譯)
|
||||
"""
|
||||
|
||||
def generate_pdf(
|
||||
self,
|
||||
json_path: Path,
|
||||
output_path: Path,
|
||||
mode: str = 'coordinate', # 'coordinate' 或 'flow'
|
||||
source_file_path: Optional[Path] = None
|
||||
) -> bool:
|
||||
"""
|
||||
生成 PDF
|
||||
|
||||
Args:
|
||||
json_path: OCR JSON 檔案路徑
|
||||
output_path: 輸出 PDF 路徑
|
||||
mode: 生成模式 ('coordinate' 或 'flow')
|
||||
source_file_path: 原始檔案路徑(用於獲取尺寸)
|
||||
|
||||
Returns:
|
||||
成功返回 True
|
||||
"""
|
||||
try:
|
||||
# 載入 OCR 數據
|
||||
ocr_data = self.load_ocr_json(json_path)
|
||||
if not ocr_data:
|
||||
return False
|
||||
|
||||
# 根據模式選擇生成策略
|
||||
if mode == 'flow':
|
||||
return self._generate_flow_pdf(ocr_data, output_path)
|
||||
else:
|
||||
return self._generate_coordinate_pdf(ocr_data, output_path, source_file_path)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"PDF generation failed: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return False
|
||||
|
||||
def _generate_coordinate_pdf(
|
||||
self,
|
||||
ocr_data: Dict,
|
||||
output_path: Path,
|
||||
source_file_path: Optional[Path]
|
||||
) -> bool:
|
||||
"""
|
||||
模式 A: 座標定位模式
|
||||
- 使用 layout_bbox 精確定位每個元素
|
||||
- 保留原始文件的視覺外觀
|
||||
- 適用於需要精確還原版面的場景
|
||||
"""
|
||||
logger.info("Generating PDF in COORDINATE mode (layout-preserving)")
|
||||
|
||||
# 提取數據
|
||||
layout_data = ocr_data.get('layout_data', {})
|
||||
elements = layout_data.get('elements', [])
|
||||
|
||||
if not elements:
|
||||
logger.warning("No layout elements found")
|
||||
return False
|
||||
|
||||
# 按 reading_order 和 page 排序
|
||||
sorted_elements = sorted(elements, key=lambda x: (
|
||||
x.get('page', 0),
|
||||
x.get('reading_order', 0)
|
||||
))
|
||||
|
||||
# 計算頁面尺寸
|
||||
ocr_width, ocr_height = self.calculate_page_dimensions(ocr_data, source_file_path)
|
||||
target_width, target_height = self._get_target_dimensions(source_file_path, ocr_width, ocr_height)
|
||||
|
||||
scale_w = target_width / ocr_width
|
||||
scale_h = target_height / ocr_height
|
||||
|
||||
# 創建 PDF canvas
|
||||
pdf_canvas = canvas.Canvas(str(output_path), pagesize=(target_width, target_height))
|
||||
|
||||
# 按頁碼分組元素
|
||||
pages = {}
|
||||
for elem in sorted_elements:
|
||||
page = elem.get('page', 0)
|
||||
if page not in pages:
|
||||
pages[page] = []
|
||||
pages[page].append(elem)
|
||||
|
||||
# 渲染每一頁
|
||||
for page_num, page_elements in sorted(pages.items()):
|
||||
if page_num > 0:
|
||||
pdf_canvas.showPage()
|
||||
|
||||
logger.info(f"Rendering page {page_num + 1} with {len(page_elements)} elements")
|
||||
|
||||
# 按 reading_order 渲染每個元素
|
||||
for elem in page_elements:
|
||||
bbox = elem.get('bbox', [])
|
||||
elem_type = elem.get('type')
|
||||
content = elem.get('content', '')
|
||||
|
||||
if not bbox:
|
||||
logger.warning(f"Element {elem['element_id']} has no bbox, skipping")
|
||||
continue
|
||||
|
||||
# 根據類型渲染
|
||||
try:
|
||||
if elem_type == 'table':
|
||||
self._draw_table_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h)
|
||||
elif elem_type == 'text':
|
||||
self._draw_text_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h)
|
||||
elif elem_type == 'title':
|
||||
self._draw_title_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h)
|
||||
elif elem_type == 'image':
|
||||
img_path = json_path.parent / content
|
||||
if img_path.exists():
|
||||
self._draw_image_at_bbox(pdf_canvas, str(img_path), bbox, target_height, scale_w, scale_h)
|
||||
elif elem_type == 'formula':
|
||||
self._draw_formula_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h)
|
||||
# ... 其他類型
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to draw {elem_type} element: {e}")
|
||||
|
||||
pdf_canvas.save()
|
||||
logger.info(f"✅ Coordinate PDF generated: {output_path}")
|
||||
return True
|
||||
|
||||
def _generate_flow_pdf(
|
||||
self,
|
||||
ocr_data: Dict,
|
||||
output_path: Path
|
||||
) -> bool:
|
||||
"""
|
||||
模式 B: 流式排版模式
|
||||
- 按 reading_order 流式排版
|
||||
- 零資訊損失(不過濾任何內容)
|
||||
- 使用 ReportLab Platypus 高階 API
|
||||
- 適用於需要翻譯或內容處理的場景
|
||||
"""
|
||||
from reportlab.platypus import (
|
||||
SimpleDocTemplate, Paragraph, Spacer,
|
||||
Table, TableStyle, Image as RLImage, PageBreak
|
||||
)
|
||||
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
|
||||
from reportlab.lib import colors
|
||||
from reportlab.lib.enums import TA_LEFT, TA_CENTER
|
||||
|
||||
logger.info("Generating PDF in FLOW mode (content-preserving)")
|
||||
|
||||
# 提取數據
|
||||
layout_data = ocr_data.get('layout_data', {})
|
||||
elements = layout_data.get('elements', [])
|
||||
|
||||
if not elements:
|
||||
logger.warning("No layout elements found")
|
||||
return False
|
||||
|
||||
# 按 reading_order 排序
|
||||
sorted_elements = sorted(elements, key=lambda x: (
|
||||
x.get('page', 0),
|
||||
x.get('reading_order', 0)
|
||||
))
|
||||
|
||||
# 創建文檔
|
||||
doc = SimpleDocTemplate(str(output_path))
|
||||
story = []
|
||||
styles = getSampleStyleSheet()
|
||||
|
||||
# 自定義樣式
|
||||
styles.add(ParagraphStyle(
|
||||
name='CustomTitle',
|
||||
parent=styles['Heading1'],
|
||||
fontSize=18,
|
||||
alignment=TA_CENTER,
|
||||
spaceAfter=12
|
||||
))
|
||||
|
||||
current_page = -1
|
||||
|
||||
# 按順序添加元素
|
||||
for elem in sorted_elements:
|
||||
elem_type = elem.get('type')
|
||||
content = elem.get('content', '')
|
||||
page = elem.get('page', 0)
|
||||
|
||||
# 分頁
|
||||
if page != current_page and current_page != -1:
|
||||
story.append(PageBreak())
|
||||
current_page = page
|
||||
|
||||
try:
|
||||
if elem_type == 'title':
|
||||
story.append(Paragraph(content, styles['CustomTitle']))
|
||||
story.append(Spacer(1, 12))
|
||||
|
||||
elif elem_type == 'text':
|
||||
story.append(Paragraph(content, styles['Normal']))
|
||||
story.append(Spacer(1, 8))
|
||||
|
||||
elif elem_type == 'table':
|
||||
# 解析 HTML 表格為 ReportLab Table
|
||||
table_obj = self._html_to_reportlab_table(content)
|
||||
if table_obj:
|
||||
story.append(table_obj)
|
||||
story.append(Spacer(1, 12))
|
||||
|
||||
elif elem_type == 'image':
|
||||
# 嵌入圖片
|
||||
img_path = output_path.parent.parent / content
|
||||
if img_path.exists():
|
||||
img = RLImage(str(img_path), width=400, height=300, kind='proportional')
|
||||
story.append(img)
|
||||
story.append(Spacer(1, 12))
|
||||
|
||||
elif elem_type == 'formula':
|
||||
# 公式顯示為等寬字體
|
||||
story.append(Paragraph(f"<font name='Courier'>{content}</font>", styles['Code']))
|
||||
story.append(Spacer(1, 8))
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to add {elem_type} element to flow: {e}")
|
||||
|
||||
# 生成 PDF
|
||||
doc.build(story)
|
||||
logger.info(f"✅ Flow PDF generated: {output_path}")
|
||||
return True
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔧 實作步驟
|
||||
|
||||
### 階段 1: 引擎層重構 (2-3 小時)
|
||||
|
||||
1. **創建 PPStructureV3Engine 單例類**
|
||||
- 檔案: `backend/app/engines/ppstructure_engine.py` (新增)
|
||||
- 統一管理 PP-StructureV3 引擎
|
||||
- RTX 4060 8GB 最佳化配置
|
||||
|
||||
2. **創建 AdvancedLayoutExtractor 類**
|
||||
- 檔案: `backend/app/services/advanced_layout_extractor.py` (新增)
|
||||
- 實作 `extract_complete_layout()`
|
||||
- 完整提取 parsing_res_list, layout_bbox, layout_det_res
|
||||
|
||||
3. **更新 OCRService**
|
||||
- 修改 `analyze_layout()` 使用 `AdvancedLayoutExtractor`
|
||||
- 保持向後相容(回退到舊邏輯)
|
||||
|
||||
### 階段 2: PDF 生成器重構 (3-4 小時)
|
||||
|
||||
1. **重構 PDFGeneratorService**
|
||||
- 添加 `mode` 參數
|
||||
- 實作 `_generate_coordinate_pdf()`
|
||||
- 實作 `_generate_flow_pdf()`
|
||||
|
||||
2. **添加輔助方法**
|
||||
- `_draw_table_at_bbox()`: 在指定座標繪製表格
|
||||
- `_draw_text_at_bbox()`: 在指定座標繪製文字
|
||||
- `_draw_title_at_bbox()`: 在指定座標繪製標題
|
||||
- `_draw_formula_at_bbox()`: 在指定座標繪製公式
|
||||
- `_html_to_reportlab_table()`: HTML 轉 ReportLab Table
|
||||
|
||||
3. **更新 API 端點**
|
||||
- `/tasks/{id}/download/pdf?mode=coordinate` (預設)
|
||||
- `/tasks/{id}/download/pdf?mode=flow`
|
||||
|
||||
### 階段 3: 測試與優化 (2-3 小時)
|
||||
|
||||
1. **單元測試**
|
||||
- 測試 AdvancedLayoutExtractor
|
||||
- 測試兩種 PDF 模式
|
||||
- 測試向後相容性
|
||||
|
||||
2. **效能測試**
|
||||
- GPU 記憶體使用監控
|
||||
- 處理速度測試
|
||||
- 並發請求測試
|
||||
|
||||
3. **品質驗證**
|
||||
- 座標準確度
|
||||
- 閱讀順序正確性
|
||||
- 表格識別準確度
|
||||
|
||||
---
|
||||
|
||||
## 📈 預期效果
|
||||
|
||||
### 功能改善
|
||||
|
||||
| 指標 | 目前 | 重構後 | 提升 |
|
||||
|------|-----|--------|------|
|
||||
| bbox 可用性 | 0% (全空) | 100% | ✅ ∞ |
|
||||
| 版面元素分類 | 2 種 | 23 種 | ✅ 11.5x |
|
||||
| 閱讀順序 | 無 | 完整保留 | ✅ 100% |
|
||||
| 資訊損失 | 21.6% | 0% (流式模式) | ✅ 100% |
|
||||
| PDF 模式 | 1 種 | 2 種 | ✅ 2x |
|
||||
| 翻譯支援 | 困難 | 完美 | ✅ 100% |
|
||||
|
||||
### GPU 使用優化
|
||||
|
||||
```python
|
||||
# RTX 4060 8GB 配置效果
|
||||
配置項目 | 目前 | 重構後
|
||||
----------------|--------|--------
|
||||
GPU 利用率 | ~30% | ~70%
|
||||
處理速度 | 0.5頁/秒 | 1.2頁/秒
|
||||
前處理功能 | 關閉 | 全開
|
||||
識別準確度 | ~85% | ~95%
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎯 遷移策略
|
||||
|
||||
### 向後相容性保證
|
||||
|
||||
1. **API 層面**
|
||||
- 保留現有所有 API 端點
|
||||
- 添加可選的 `mode` 參數
|
||||
- 預設行為不變
|
||||
|
||||
2. **數據層面**
|
||||
- 舊 JSON 檔案仍可使用
|
||||
- 新增欄位不影響舊邏輯
|
||||
- 漸進式更新
|
||||
|
||||
3. **部署策略**
|
||||
- 先部署新引擎和服務
|
||||
- 逐步啟用新功能
|
||||
- 監控效能和錯誤率
|
||||
|
||||
---
|
||||
|
||||
## 📝 配置檔案
|
||||
|
||||
### requirements.txt 更新
|
||||
|
||||
```txt
|
||||
# 現有依賴
|
||||
paddlepaddle-gpu>=3.0.0
|
||||
paddleocr>=3.0.0
|
||||
|
||||
# 新增依賴
|
||||
python-docx>=0.8.11 # Word 文檔生成 (可選)
|
||||
PyMuPDF>=1.23.0 # PDF 處理增強
|
||||
beautifulsoup4>=4.12.0 # HTML 解析
|
||||
lxml>=4.9.0 # XML/HTML 解析加速
|
||||
```
|
||||
|
||||
### 環境變數配置
|
||||
|
||||
```bash
|
||||
# .env.local 新增
|
||||
PADDLE_GPU_MEMORY=6144 # RTX 4060 8GB 保留 2GB 給系統
|
||||
PADDLE_USE_SERVER_MODEL=true
|
||||
PADDLE_ENABLE_ALL_FEATURES=true
|
||||
|
||||
# PDF 生成預設模式
|
||||
PDF_DEFAULT_MODE=coordinate # 或 flow
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🚀 實作優先級
|
||||
|
||||
### P0 (立即實作)
|
||||
1. ✅ PPStructureV3Engine 統一引擎
|
||||
2. ✅ AdvancedLayoutExtractor 完整提取
|
||||
3. ✅ 座標定位模式 PDF
|
||||
|
||||
### P1 (第二階段)
|
||||
4. ⭐ 流式排版模式 PDF
|
||||
5. ⭐ API 端點更新 (mode 參數)
|
||||
|
||||
### P2 (優化階段)
|
||||
6. 效能監控和優化
|
||||
7. 批次處理支援
|
||||
8. 品質檢查工具
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ 風險與緩解
|
||||
|
||||
### 風險 1: GPU 記憶體不足
|
||||
**緩解**:
|
||||
- 合理設定 `gpu_mem=6144` (保留 2GB)
|
||||
- 添加記憶體監控
|
||||
- 大文檔分批處理
|
||||
|
||||
### 風險 2: 處理速度下降
|
||||
**緩解**:
|
||||
- Server 模型在 GPU 上比 Mobile 更快
|
||||
- 並行處理多頁
|
||||
- 結果快取
|
||||
|
||||
### 風險 3: 向後相容問題
|
||||
**緩解**:
|
||||
- 保留舊邏輯作為回退
|
||||
- 逐步遷移
|
||||
- 完整測試覆蓋
|
||||
|
||||
---
|
||||
|
||||
**預計總開發時間**: 7-10 小時
|
||||
**預計效果**: 100% 利用 PP-StructureV3 能力 + 零資訊損失 + 完美翻譯支援
|
||||
|
||||
您希望我開始實作哪個階段?
|
||||
@@ -0,0 +1,691 @@
|
||||
# PP-StructureV3 完整版面資訊利用計劃
|
||||
|
||||
## 📋 執行摘要
|
||||
|
||||
### 問題診斷
|
||||
目前實作**嚴重低估了 PP-StructureV3 的能力**,只使用了 `page_result.markdown` 屬性,完全忽略了核心的版面資訊 `page_result.json`。
|
||||
|
||||
### 核心發現
|
||||
1. **PP-StructureV3 提供完整的版面解析資訊**,包括:
|
||||
- `parsing_res_list`: 按閱讀順序排列的版面元素列表
|
||||
- `layout_bbox`: 每個元素的精確座標
|
||||
- `layout_det_res`: 版面檢測結果(區域類型、置信度)
|
||||
- `overall_ocr_res`: 完整的 OCR 結果(包含所有文字的 bbox)
|
||||
- `layout`: 版面類型(單欄/雙欄/多欄)
|
||||
|
||||
2. **目前實作的缺陷**:
|
||||
```python
|
||||
# ❌ 目前做法 (ocr_service.py:615-646)
|
||||
markdown_dict = page_result.markdown # 只獲取 markdown 和圖片
|
||||
markdown_texts = markdown_dict.get('markdown_texts', '')
|
||||
# bbox 被設為空列表
|
||||
'bbox': [], # PP-StructureV3 doesn't provide individual bbox in this format
|
||||
```
|
||||
|
||||
3. **應該這樣做**:
|
||||
```python
|
||||
# ✅ 正確做法
|
||||
json_data = page_result.json # 獲取完整的結構化資訊
|
||||
parsing_list = json_data.get('parsing_res_list', []) # 閱讀順序 + bbox
|
||||
layout_det = json_data.get('layout_det_res', {}) # 版面檢測
|
||||
overall_ocr = json_data.get('overall_ocr_res', {}) # 所有文字的座標
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎯 規劃目標
|
||||
|
||||
### 階段 1: 提取完整版面資訊(高優先級)
|
||||
**目標**: 修改 `analyze_layout()` 以使用 PP-StructureV3 的完整能力
|
||||
|
||||
**預期效果**:
|
||||
- ✅ 每個版面元素都有精確的 `layout_bbox`
|
||||
- ✅ 保留原始閱讀順序(`parsing_res_list` 的順序)
|
||||
- ✅ 獲取版面類型資訊(單欄/雙欄)
|
||||
- ✅ 提取區域分類(text/table/figure/title/formula)
|
||||
- ✅ 零資訊損失(不需要過濾重疊文字)
|
||||
|
||||
### 階段 2: 實作雙模式 PDF 生成(中優先級)
|
||||
**目標**: 提供兩種 PDF 生成模式
|
||||
|
||||
**模式 A: 精確座標定位模式**
|
||||
- 使用 `layout_bbox` 精確定位每個元素
|
||||
- 保留原始文件的視覺外觀
|
||||
- 適用於需要精確還原版面的場景
|
||||
|
||||
**模式 B: 流式排版模式**
|
||||
- 按 `parsing_res_list` 順序流式排版
|
||||
- 使用 ReportLab Platypus 高階 API
|
||||
- 零資訊損失,所有內容都可搜尋
|
||||
- 適用於需要翻譯或內容處理的場景
|
||||
|
||||
### 階段 3: 多欄版面處理(低優先級)
|
||||
**目標**: 利用 PP-StructureV3 的多欄識別能力
|
||||
|
||||
---
|
||||
|
||||
## 📊 PP-StructureV3 完整資料結構
|
||||
|
||||
### 1. `page_result.json` 完整結構
|
||||
|
||||
```python
|
||||
{
|
||||
# 基本資訊
|
||||
"input_path": str, # 源文件路徑
|
||||
"page_index": int, # 頁碼(PDF 專用)
|
||||
|
||||
# 版面檢測結果
|
||||
"layout_det_res": {
|
||||
"boxes": [
|
||||
{
|
||||
"cls_id": int, # 類別 ID
|
||||
"label": str, # 區域類型: text/table/figure/title/formula/seal
|
||||
"score": float, # 置信度 0-1
|
||||
"coordinate": [x1, y1, x2, y2] # 矩形座標
|
||||
},
|
||||
...
|
||||
]
|
||||
},
|
||||
|
||||
# 完整 OCR 結果
|
||||
"overall_ocr_res": {
|
||||
"dt_polys": np.ndarray, # 文字檢測多邊形
|
||||
"rec_polys": np.ndarray, # 文字識別多邊形
|
||||
"rec_boxes": np.ndarray, # 文字識別矩形框 (n, 4, 2) int16
|
||||
"rec_texts": List[str], # 識別的文字
|
||||
"rec_scores": np.ndarray # 識別置信度
|
||||
},
|
||||
|
||||
# **核心版面解析結果(按閱讀順序)**
|
||||
"parsing_res_list": [
|
||||
{
|
||||
"layout_bbox": np.ndarray, # 區域邊界框 [x1, y1, x2, y2]
|
||||
"layout": str, # 版面類型: single/double/multi-column
|
||||
"text": str, # 文字內容(如果是文字區域)
|
||||
"table": str, # 表格 HTML(如果是表格區域)
|
||||
"image": str, # 圖片路徑(如果是圖片區域)
|
||||
"formula": str, # 公式 LaTeX(如果是公式區域)
|
||||
# ... 其他區域類型
|
||||
},
|
||||
... # 順序 = 閱讀順序
|
||||
],
|
||||
|
||||
# 文字段落 OCR(按閱讀順序)
|
||||
"text_paragraphs_ocr_res": {
|
||||
"rec_polys": np.ndarray,
|
||||
"rec_texts": List[str],
|
||||
"rec_scores": np.ndarray
|
||||
},
|
||||
|
||||
# 可選模組結果
|
||||
"formula_res_region1": {...}, # 公式識別結果
|
||||
"table_cell_img": {...}, # 表格儲存格圖片
|
||||
"seal_res_region1": {...} # 印章識別結果
|
||||
}
|
||||
```
|
||||
|
||||
### 2. 關鍵欄位說明
|
||||
|
||||
| 欄位 | 用途 | 資料格式 | 重要性 |
|
||||
|------|------|---------|--------|
|
||||
| `parsing_res_list` | **核心資料**,包含按閱讀順序排列的所有版面元素 | List[Dict] | ⭐⭐⭐⭐⭐ |
|
||||
| `layout_bbox` | 每個元素的精確座標 | np.ndarray [x1,y1,x2,y2] | ⭐⭐⭐⭐⭐ |
|
||||
| `layout` | 版面類型(單欄/雙欄/多欄) | str: single/double/multi | ⭐⭐⭐⭐ |
|
||||
| `layout_det_res` | 版面檢測詳細結果(包含區域分類) | Dict with boxes list | ⭐⭐⭐⭐ |
|
||||
| `overall_ocr_res` | 所有文字的 OCR 結果和座標 | Dict with np.ndarray | ⭐⭐⭐⭐ |
|
||||
| `markdown` | 簡化的 Markdown 輸出 | Dict with texts/images | ⭐⭐ |
|
||||
|
||||
---
|
||||
|
||||
## 🔧 實作計劃
|
||||
|
||||
### 任務 1: 重構 `analyze_layout()` 函數
|
||||
|
||||
**檔案**: `/backend/app/services/ocr_service.py`
|
||||
|
||||
**修改範圍**: Lines 590-710
|
||||
|
||||
**核心改動**:
|
||||
|
||||
```python
|
||||
def analyze_layout(self, image_path: Path, output_dir: Optional[Path] = None, current_page: int = 0) -> Tuple[Optional[Dict], List[Dict]]:
|
||||
"""
|
||||
Analyze document layout using PP-StructureV3 (使用完整的 JSON 資訊)
|
||||
"""
|
||||
try:
|
||||
structure_engine = self.get_structure_engine()
|
||||
results = structure_engine.predict(str(image_path))
|
||||
|
||||
layout_elements = []
|
||||
images_metadata = []
|
||||
|
||||
for page_idx, page_result in enumerate(results):
|
||||
# ✅ 修改 1: 使用完整的 JSON 資料而非只用 markdown
|
||||
json_data = page_result.json
|
||||
|
||||
# ✅ 修改 2: 提取版面檢測結果
|
||||
layout_det_res = json_data.get('layout_det_res', {})
|
||||
layout_boxes = layout_det_res.get('boxes', [])
|
||||
|
||||
# ✅ 修改 3: 提取核心的 parsing_res_list(包含閱讀順序 + bbox)
|
||||
parsing_res_list = json_data.get('parsing_res_list', [])
|
||||
|
||||
if parsing_res_list:
|
||||
# *** 核心邏輯:使用 parsing_res_list ***
|
||||
for idx, item in enumerate(parsing_res_list):
|
||||
# 提取 bbox(不再是空列表!)
|
||||
layout_bbox = item.get('layout_bbox')
|
||||
if layout_bbox is not None:
|
||||
# 轉換 numpy array 為標準格式
|
||||
if hasattr(layout_bbox, 'tolist'):
|
||||
bbox = layout_bbox.tolist()
|
||||
else:
|
||||
bbox = list(layout_bbox)
|
||||
|
||||
# 轉換為 4-point 格式: [[x1,y1], [x2,y1], [x2,y2], [x1,y2]]
|
||||
if len(bbox) == 4: # [x1, y1, x2, y2]
|
||||
x1, y1, x2, y2 = bbox
|
||||
bbox = [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
|
||||
else:
|
||||
bbox = []
|
||||
|
||||
# 提取版面類型
|
||||
layout_type = item.get('layout', 'single')
|
||||
|
||||
# 創建元素(包含所有資訊)
|
||||
element = {
|
||||
'element_id': idx,
|
||||
'page': current_page,
|
||||
'bbox': bbox, # ✅ 不再是空列表!
|
||||
'layout_type': layout_type, # ✅ 新增版面類型
|
||||
'reading_order': idx, # ✅ 新增閱讀順序
|
||||
}
|
||||
|
||||
# 根據內容類型提取資料
|
||||
if 'table' in item:
|
||||
element['type'] = 'table'
|
||||
element['content'] = item['table']
|
||||
# 提取表格純文字(用於翻譯)
|
||||
element['extracted_text'] = self._extract_table_text(item['table'])
|
||||
|
||||
elif 'text' in item:
|
||||
element['type'] = 'text'
|
||||
element['content'] = item['text']
|
||||
|
||||
elif 'figure' in item or 'image' in item:
|
||||
element['type'] = 'image'
|
||||
element['content'] = item.get('figure') or item.get('image')
|
||||
|
||||
elif 'formula' in item:
|
||||
element['type'] = 'formula'
|
||||
element['content'] = item['formula']
|
||||
|
||||
elif 'title' in item:
|
||||
element['type'] = 'title'
|
||||
element['content'] = item['title']
|
||||
|
||||
else:
|
||||
# 未知類型,記錄所有非系統欄位
|
||||
for key, value in item.items():
|
||||
if key not in ['layout_bbox', 'layout']:
|
||||
element['type'] = key
|
||||
element['content'] = value
|
||||
break
|
||||
|
||||
layout_elements.append(element)
|
||||
|
||||
else:
|
||||
# 回退到 markdown 方式(向後相容)
|
||||
logger.warning("No parsing_res_list found, falling back to markdown parsing")
|
||||
markdown_dict = page_result.markdown
|
||||
# ... 原有的 markdown 解析邏輯 ...
|
||||
|
||||
# ✅ 修改 4: 同時處理提取的圖片(仍需保存到磁碟)
|
||||
markdown_dict = page_result.markdown
|
||||
markdown_images = markdown_dict.get('markdown_images', {})
|
||||
|
||||
for img_idx, (img_path, img_obj) in enumerate(markdown_images.items()):
|
||||
# 保存圖片到磁碟
|
||||
try:
|
||||
base_dir = output_dir if output_dir else image_path.parent
|
||||
full_img_path = base_dir / img_path
|
||||
full_img_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
if hasattr(img_obj, 'save'):
|
||||
img_obj.save(str(full_img_path))
|
||||
logger.info(f"Saved extracted image to {full_img_path}")
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to save image {img_path}: {e}")
|
||||
|
||||
# 提取 bbox(從檔名或從 parsing_res_list 匹配)
|
||||
bbox = self._find_image_bbox(img_path, parsing_res_list, layout_boxes)
|
||||
|
||||
images_metadata.append({
|
||||
'element_id': len(layout_elements) + img_idx,
|
||||
'image_path': img_path,
|
||||
'type': 'image',
|
||||
'page': current_page,
|
||||
'bbox': bbox,
|
||||
})
|
||||
|
||||
if layout_elements:
|
||||
layout_data = {
|
||||
'elements': layout_elements,
|
||||
'total_elements': len(layout_elements),
|
||||
'reading_order': [e['reading_order'] for e in layout_elements], # ✅ 保留閱讀順序
|
||||
'layout_types': list(set(e.get('layout_type') for e in layout_elements)), # ✅ 版面類型統計
|
||||
}
|
||||
logger.info(f"Detected {len(layout_elements)} layout elements (with bbox and reading order)")
|
||||
return layout_data, images_metadata
|
||||
else:
|
||||
logger.warning("No layout elements detected")
|
||||
return None, []
|
||||
|
||||
except Exception as e:
|
||||
import traceback
|
||||
logger.error(f"Layout analysis error: {str(e)}\n{traceback.format_exc()}")
|
||||
return None, []
|
||||
|
||||
|
||||
def _find_image_bbox(self, img_path: str, parsing_res_list: List[Dict], layout_boxes: List[Dict]) -> List:
|
||||
"""
|
||||
從 parsing_res_list 或 layout_det_res 中查找圖片的 bbox
|
||||
"""
|
||||
# 方法 1: 從檔名提取(現有方法)
|
||||
import re
|
||||
match = re.search(r'box_(\d+)_(\d+)_(\d+)_(\d+)', img_path)
|
||||
if match:
|
||||
x1, y1, x2, y2 = map(int, match.groups())
|
||||
return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
|
||||
|
||||
# 方法 2: 從 parsing_res_list 匹配(如果包含圖片路徑資訊)
|
||||
for item in parsing_res_list:
|
||||
if 'image' in item or 'figure' in item:
|
||||
content = item.get('image') or item.get('figure')
|
||||
if img_path in str(content):
|
||||
bbox = item.get('layout_bbox')
|
||||
if bbox is not None:
|
||||
if hasattr(bbox, 'tolist'):
|
||||
bbox_list = bbox.tolist()
|
||||
else:
|
||||
bbox_list = list(bbox)
|
||||
if len(bbox_list) == 4:
|
||||
x1, y1, x2, y2 = bbox_list
|
||||
return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
|
||||
|
||||
# 方法 3: 從 layout_det_res 匹配(根據類型)
|
||||
for box in layout_boxes:
|
||||
if box.get('label') in ['figure', 'image']:
|
||||
coord = box.get('coordinate', [])
|
||||
if len(coord) == 4:
|
||||
x1, y1, x2, y2 = coord
|
||||
return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
|
||||
|
||||
logger.warning(f"Could not find bbox for image {img_path}")
|
||||
return []
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 任務 2: 更新 PDF 生成器使用新資訊
|
||||
|
||||
**檔案**: `/backend/app/services/pdf_generator_service.py`
|
||||
|
||||
**核心改動**:
|
||||
|
||||
1. **移除文字過濾邏輯**(不再需要!)
|
||||
- 因為 `parsing_res_list` 已經按閱讀順序排列
|
||||
- 表格/圖片有自己的區域,文字有自己的區域
|
||||
- 不會有重疊問題
|
||||
|
||||
2. **按 `reading_order` 渲染元素**
|
||||
```python
|
||||
def generate_layout_pdf(self, json_path: Path, output_path: Path, mode: str = 'coordinate') -> bool:
|
||||
"""
|
||||
mode: 'coordinate' 或 'flow'
|
||||
"""
|
||||
# 載入資料
|
||||
ocr_data = self.load_ocr_json(json_path)
|
||||
layout_data = ocr_data.get('layout_data', {})
|
||||
elements = layout_data.get('elements', [])
|
||||
|
||||
if mode == 'coordinate':
|
||||
# 模式 A: 座標定位模式
|
||||
return self._generate_coordinate_pdf(elements, output_path, ocr_data)
|
||||
else:
|
||||
# 模式 B: 流式排版模式
|
||||
return self._generate_flow_pdf(elements, output_path, ocr_data)
|
||||
|
||||
def _generate_coordinate_pdf(self, elements: List[Dict], output_path: Path, ocr_data: Dict) -> bool:
|
||||
"""座標定位模式 - 精確還原版面"""
|
||||
# 按 reading_order 排序元素
|
||||
sorted_elements = sorted(elements, key=lambda x: x.get('reading_order', 0))
|
||||
|
||||
# 按頁碼分組
|
||||
pages = {}
|
||||
for elem in sorted_elements:
|
||||
page = elem.get('page', 0)
|
||||
if page not in pages:
|
||||
pages[page] = []
|
||||
pages[page].append(elem)
|
||||
|
||||
# 渲染每頁
|
||||
for page_num, page_elements in sorted(pages.items()):
|
||||
for elem in page_elements:
|
||||
bbox = elem.get('bbox', [])
|
||||
elem_type = elem.get('type')
|
||||
content = elem.get('content', '')
|
||||
|
||||
if not bbox:
|
||||
logger.warning(f"Element {elem['element_id']} has no bbox, skipping")
|
||||
continue
|
||||
|
||||
# 使用精確座標渲染
|
||||
if elem_type == 'table':
|
||||
self.draw_table_at_bbox(pdf_canvas, content, bbox, page_height, scale_w, scale_h)
|
||||
elif elem_type == 'text':
|
||||
self.draw_text_at_bbox(pdf_canvas, content, bbox, page_height, scale_w, scale_h)
|
||||
elif elem_type == 'image':
|
||||
self.draw_image_at_bbox(pdf_canvas, content, bbox, page_height, scale_w, scale_h)
|
||||
# ... 其他類型
|
||||
|
||||
def _generate_flow_pdf(self, elements: List[Dict], output_path: Path, ocr_data: Dict) -> bool:
|
||||
"""流式排版模式 - 零資訊損失"""
|
||||
from reportlab.platypus import SimpleDocTemplate, Paragraph, Table, Image, Spacer
|
||||
from reportlab.lib.styles import getSampleStyleSheet
|
||||
|
||||
# 按 reading_order 排序元素
|
||||
sorted_elements = sorted(elements, key=lambda x: x.get('reading_order', 0))
|
||||
|
||||
# 創建 Story(流式內容)
|
||||
story = []
|
||||
styles = getSampleStyleSheet()
|
||||
|
||||
for elem in sorted_elements:
|
||||
elem_type = elem.get('type')
|
||||
content = elem.get('content', '')
|
||||
|
||||
if elem_type == 'title':
|
||||
story.append(Paragraph(content, styles['Title']))
|
||||
elif elem_type == 'text':
|
||||
story.append(Paragraph(content, styles['Normal']))
|
||||
elif elem_type == 'table':
|
||||
# 解析 HTML 表格為 ReportLab Table
|
||||
table_obj = self._html_to_reportlab_table(content)
|
||||
story.append(table_obj)
|
||||
elif elem_type == 'image':
|
||||
# 嵌入圖片
|
||||
img_path = json_path.parent / content
|
||||
if img_path.exists():
|
||||
story.append(Image(str(img_path), width=400, height=300))
|
||||
|
||||
story.append(Spacer(1, 12)) # 間距
|
||||
|
||||
# 生成 PDF
|
||||
doc = SimpleDocTemplate(str(output_path))
|
||||
doc.build(story)
|
||||
return True
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📈 預期效果對比
|
||||
|
||||
### 目前實作 vs 新實作
|
||||
|
||||
| 指標 | 目前實作 ❌ | 新實作 ✅ | 改善 |
|
||||
|------|-----------|----------|------|
|
||||
| **bbox 資訊** | 空列表 `[]` | 精確座標 `[x1,y1,x2,y2]` | ✅ 100% |
|
||||
| **閱讀順序** | 無(混合 HTML) | `reading_order` 欄位 | ✅ 100% |
|
||||
| **版面類型** | 無 | `layout_type`(單欄/雙欄) | ✅ 100% |
|
||||
| **元素分類** | 簡單判斷 `<table` | 精確分類(9+ 類型) | ✅ 100% |
|
||||
| **資訊損失** | 21.6% 文字被過濾 | 0% 損失(流式模式) | ✅ 100% |
|
||||
| **座標精度** | 只有部分圖片 bbox | 所有元素都有 bbox | ✅ 100% |
|
||||
| **PDF 模式** | 只有座標定位 | 雙模式(座標+流式) | ✅ 新功能 |
|
||||
| **翻譯支援** | 困難(資訊損失) | 完美(零損失) | ✅ 100% |
|
||||
|
||||
### 具體改善
|
||||
|
||||
#### 1. 零資訊損失
|
||||
```python
|
||||
# ❌ 目前: 342 個文字區域 → 過濾後 268 個 = 損失 74 個 (21.6%)
|
||||
filtered_text_regions = self._filter_text_in_regions(text_regions, regions_to_avoid)
|
||||
|
||||
# ✅ 新實作: 不需要過濾,直接使用 parsing_res_list
|
||||
# 所有元素(文字、表格、圖片)都在各自的區域中,不重疊
|
||||
for elem in sorted(elements, key=lambda x: x['reading_order']):
|
||||
render_element(elem) # 渲染所有元素,零損失
|
||||
```
|
||||
|
||||
#### 2. 精確 bbox
|
||||
```python
|
||||
# ❌ 目前: bbox 是空列表
|
||||
{
|
||||
'element_id': 0,
|
||||
'type': 'table',
|
||||
'bbox': [], # ← 無法定位!
|
||||
}
|
||||
|
||||
# ✅ 新實作: 從 layout_bbox 獲取精確座標
|
||||
{
|
||||
'element_id': 0,
|
||||
'type': 'table',
|
||||
'bbox': [[770, 776], [1122, 776], [1122, 1058], [770, 1058]], # ← 精確定位!
|
||||
'reading_order': 3,
|
||||
'layout_type': 'single'
|
||||
}
|
||||
```
|
||||
|
||||
#### 3. 閱讀順序
|
||||
```python
|
||||
# ❌ 目前: 無法保證正確的閱讀順序
|
||||
# 表格、圖片、文字混在一起,順序混亂
|
||||
|
||||
# ✅ 新實作: parsing_res_list 的順序 = 閱讀順序
|
||||
elements = sorted(elements, key=lambda x: x['reading_order'])
|
||||
# 元素按 reading_order: 0, 1, 2, 3, ... 渲染
|
||||
# 完美保留文件的邏輯順序
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🚀 實作步驟
|
||||
|
||||
### 第一階段:核心重構(2-3 小時)
|
||||
|
||||
1. **修改 `analyze_layout()` 函數**
|
||||
- 從 `page_result.json` 提取 `parsing_res_list`
|
||||
- 提取 `layout_bbox` 為每個元素的 bbox
|
||||
- 保留 `reading_order`
|
||||
- 提取 `layout_type`
|
||||
- 測試輸出 JSON 結構
|
||||
|
||||
2. **添加輔助函數**
|
||||
- `_find_image_bbox()`: 從多個來源查找圖片 bbox
|
||||
- `_convert_bbox_format()`: 統一 bbox 格式
|
||||
- `_extract_element_content()`: 根據類型提取內容
|
||||
|
||||
3. **測試驗證**
|
||||
- 使用現有測試文件重新執行 OCR
|
||||
- 檢查生成的 JSON 是否包含 bbox
|
||||
- 驗證 reading_order 是否正確
|
||||
|
||||
### 第二階段:PDF 生成優化(2-3 小時)
|
||||
|
||||
1. **實作座標定位模式**
|
||||
- 移除文字過濾邏輯
|
||||
- 按 bbox 精確渲染每個元素
|
||||
- 按 reading_order 確定渲染順序(同頁元素)
|
||||
|
||||
2. **實作流式排版模式**
|
||||
- 使用 ReportLab Platypus
|
||||
- 按 reading_order 構建 Story
|
||||
- 實作各類型元素的流式渲染
|
||||
|
||||
3. **添加 API 參數**
|
||||
- `/tasks/{id}/download/pdf?mode=coordinate` (預設)
|
||||
- `/tasks/{id}/download/pdf?mode=flow`
|
||||
|
||||
### 第三階段:測試與優化(1-2 小時)
|
||||
|
||||
1. **完整測試**
|
||||
- 單頁文件測試
|
||||
- 多頁 PDF 測試
|
||||
- 多欄版面測試
|
||||
- 複雜表格測試
|
||||
|
||||
2. **效能優化**
|
||||
- 減少重複計算
|
||||
- 優化 bbox 轉換
|
||||
- 快取處理
|
||||
|
||||
3. **文檔更新**
|
||||
- 更新 API 文檔
|
||||
- 添加使用範例
|
||||
- 更新架構圖
|
||||
|
||||
---
|
||||
|
||||
## 💡 關鍵技術細節
|
||||
|
||||
### 1. Numpy Array 處理
|
||||
```python
|
||||
# layout_bbox 是 numpy.ndarray,需要轉換為標準格式
|
||||
layout_bbox = item.get('layout_bbox')
|
||||
if hasattr(layout_bbox, 'tolist'):
|
||||
bbox = layout_bbox.tolist() # [x1, y1, x2, y2]
|
||||
else:
|
||||
bbox = list(layout_bbox)
|
||||
|
||||
# 轉換為 4-point 格式
|
||||
x1, y1, x2, y2 = bbox
|
||||
bbox_4point = [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
|
||||
```
|
||||
|
||||
### 2. 版面類型處理
|
||||
```python
|
||||
# 根據 layout_type 調整渲染策略
|
||||
layout_type = elem.get('layout_type', 'single')
|
||||
|
||||
if layout_type == 'double':
|
||||
# 雙欄版面:可能需要特殊處理
|
||||
pass
|
||||
elif layout_type == 'multi':
|
||||
# 多欄版面:更複雜的處理
|
||||
pass
|
||||
```
|
||||
|
||||
### 3. 閱讀順序保證
|
||||
```python
|
||||
# 確保按正確順序渲染
|
||||
elements = layout_data.get('elements', [])
|
||||
sorted_elements = sorted(elements, key=lambda x: (
|
||||
x.get('page', 0), # 先按頁碼
|
||||
x.get('reading_order', 0) # 再按閱讀順序
|
||||
))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ 風險與緩解措施
|
||||
|
||||
### 風險 1: 向後相容性
|
||||
**問題**: 舊的 JSON 檔案沒有新欄位
|
||||
|
||||
**緩解措施**:
|
||||
```python
|
||||
# 在 analyze_layout() 中添加回退邏輯
|
||||
parsing_res_list = json_data.get('parsing_res_list', [])
|
||||
if not parsing_res_list:
|
||||
logger.warning("No parsing_res_list, using markdown fallback")
|
||||
# 使用舊的 markdown 解析邏輯
|
||||
```
|
||||
|
||||
### 風險 2: PaddleOCR 版本差異
|
||||
**問題**: 不同版本的 PaddleOCR 可能輸出格式不同
|
||||
|
||||
**緩解措施**:
|
||||
- 記錄 PaddleOCR 版本到 JSON
|
||||
- 添加版本檢測邏輯
|
||||
- 提供多版本支援
|
||||
|
||||
### 風險 3: 效能影響
|
||||
**問題**: 提取更多資訊可能增加處理時間
|
||||
|
||||
**緩解措施**:
|
||||
- 只在需要時提取詳細資訊
|
||||
- 使用快取
|
||||
- 並行處理多頁
|
||||
|
||||
---
|
||||
|
||||
## 📝 TODO Checklist
|
||||
|
||||
### 階段 1: 核心重構
|
||||
- [ ] 修改 `analyze_layout()` 使用 `page_result.json`
|
||||
- [ ] 提取 `parsing_res_list`
|
||||
- [ ] 提取 `layout_bbox` 並轉換格式
|
||||
- [ ] 保留 `reading_order`
|
||||
- [ ] 提取 `layout_type`
|
||||
- [ ] 實作 `_find_image_bbox()`
|
||||
- [ ] 添加回退邏輯(向後相容)
|
||||
- [ ] 測試新 JSON 輸出結構
|
||||
|
||||
### 階段 2: PDF 生成優化
|
||||
- [ ] 實作 `_generate_coordinate_pdf()`
|
||||
- [ ] 實作 `_generate_flow_pdf()`
|
||||
- [ ] 移除舊的文字過濾邏輯
|
||||
- [ ] 添加 mode 參數到 API
|
||||
- [ ] 實作 HTML 表格解析器(用於流式模式)
|
||||
- [ ] 測試兩種模式的 PDF 輸出
|
||||
|
||||
### 階段 3: 測試與文檔
|
||||
- [ ] 單頁文件測試
|
||||
- [ ] 多頁 PDF 測試
|
||||
- [ ] 複雜版面測試(多欄、表格密集)
|
||||
- [ ] 效能測試
|
||||
- [ ] 更新 API 文檔
|
||||
- [ ] 更新使用說明
|
||||
- [ ] 創建遷移指南
|
||||
|
||||
---
|
||||
|
||||
## 🎓 學習資源
|
||||
|
||||
1. **PaddleOCR 官方文檔**
|
||||
- [PP-StructureV3 Usage Tutorial](http://www.paddleocr.ai/main/en/version3.x/pipeline_usage/PP-StructureV3.html)
|
||||
- [PaddleX PP-StructureV3](https://paddlepaddle.github.io/PaddleX/3.0/en/pipeline_usage/tutorials/ocr_pipelines/PP-StructureV3.html)
|
||||
|
||||
2. **ReportLab 文檔**
|
||||
- [Platypus User Guide](https://www.reportlab.com/docs/reportlab-userguide.pdf)
|
||||
- [Table Styling](https://www.reportlab.com/docs/reportlab-userguide.pdf#page=80)
|
||||
|
||||
3. **參考實作**
|
||||
- PaddleOCR GitHub: `/paddlex/inference/pipelines/layout_parsing/pipeline_v2.py`
|
||||
|
||||
---
|
||||
|
||||
## 🏁 成功標準
|
||||
|
||||
### 必須達成
|
||||
✅ 所有版面元素都有精確的 bbox
|
||||
✅ 閱讀順序正確保留
|
||||
✅ 零資訊損失(流式模式)
|
||||
✅ 向後相容(舊 JSON 仍可用)
|
||||
|
||||
### 期望達成
|
||||
✅ 雙模式 PDF 生成(座標 + 流式)
|
||||
✅ 多欄版面正確處理
|
||||
✅ 翻譯功能支援(表格文字可提取)
|
||||
✅ 效能無明顯下降
|
||||
|
||||
### 附加目標
|
||||
✅ 支援更多元素類型(公式、印章)
|
||||
✅ 版面類型統計和分析
|
||||
✅ 視覺化版面結構
|
||||
|
||||
---
|
||||
|
||||
**規劃完成時間**: 2025-01-18
|
||||
**預計開發時間**: 5-8 小時
|
||||
**優先級**: P0 (最高優先級)
|
||||
File diff suppressed because it is too large
Load Diff
276
openspec/changes/dual-track-document-processing/design.md
Normal file
276
openspec/changes/dual-track-document-processing/design.md
Normal file
@@ -0,0 +1,276 @@
|
||||
# Technical Design: Dual-track Document Processing
|
||||
|
||||
## Context
|
||||
|
||||
### Background
|
||||
The current OCR tool processes all documents through PaddleOCR, even when dealing with editable PDFs that contain extractable text. This causes:
|
||||
- Unnecessary processing overhead
|
||||
- Potential quality degradation from re-OCRing already digital text
|
||||
- Loss of precise formatting information
|
||||
- Inefficient GPU usage on documents that don't need OCR
|
||||
|
||||
### Constraints
|
||||
- RTX 4060 8GB GPU memory limitation
|
||||
- Need to maintain backward compatibility with existing API
|
||||
- Must support future translation features
|
||||
- Should handle mixed documents (partially scanned, partially digital)
|
||||
|
||||
### Stakeholders
|
||||
- API consumers expecting consistent JSON/PDF output
|
||||
- Translation system requiring structure preservation
|
||||
- Performance-sensitive deployments
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
### Goals
|
||||
- Intelligently route documents to appropriate processing track
|
||||
- Preserve document structure for translation
|
||||
- Optimize GPU usage by avoiding unnecessary OCR
|
||||
- Maintain unified output format across tracks
|
||||
- Reduce processing time for editable PDFs by 70%+
|
||||
|
||||
### Non-Goals
|
||||
- Implementing the actual translation engine (future phase)
|
||||
- Supporting video or audio transcription
|
||||
- Real-time collaborative editing
|
||||
- OCR model training or fine-tuning
|
||||
|
||||
## Decisions
|
||||
|
||||
### Decision 1: Dual-track Architecture
|
||||
**What**: Implement two separate processing pipelines - OCR track and Direct extraction track
|
||||
|
||||
**Why**:
|
||||
- Editable PDFs don't need OCR, can be processed 10-100x faster
|
||||
- Direct extraction preserves exact formatting and fonts
|
||||
- OCR track remains optimal for scanned documents
|
||||
|
||||
**Alternatives considered**:
|
||||
1. **Single enhanced OCR pipeline**: Would still waste resources on editable PDFs
|
||||
2. **Hybrid approach per page**: Too complex, most documents are uniformly editable or scanned
|
||||
3. **Multiple specialized pipelines**: Over-engineering for current requirements
|
||||
|
||||
### Decision 2: UnifiedDocument Model
|
||||
**What**: Create a standardized intermediate representation for both tracks
|
||||
|
||||
**Why**:
|
||||
- Provides consistent API interface regardless of processing track
|
||||
- Simplifies downstream processing (PDF generation, translation)
|
||||
- Enables track switching without breaking changes
|
||||
|
||||
**Structure**:
|
||||
```python
|
||||
@dataclass
|
||||
class UnifiedDocument:
|
||||
document_id: str
|
||||
metadata: DocumentMetadata
|
||||
pages: List[Page]
|
||||
processing_track: Literal["ocr", "direct"]
|
||||
|
||||
@dataclass
|
||||
class Page:
|
||||
page_number: int
|
||||
elements: List[DocumentElement]
|
||||
dimensions: Dimensions
|
||||
|
||||
@dataclass
|
||||
class DocumentElement:
|
||||
element_id: str
|
||||
type: ElementType # text, table, image, header, etc.
|
||||
content: Union[str, Dict, bytes]
|
||||
bbox: BoundingBox
|
||||
style: Optional[StyleInfo]
|
||||
confidence: Optional[float] # Only for OCR track
|
||||
```
|
||||
|
||||
### Decision 3: PyMuPDF for Direct Extraction
|
||||
**What**: Use PyMuPDF (fitz) library for editable PDF processing
|
||||
|
||||
**Why**:
|
||||
- Mature, well-maintained library
|
||||
- Excellent coordinate preservation
|
||||
- Fast C++ backend
|
||||
- Supports text, tables, and image extraction with positions
|
||||
|
||||
**Alternatives considered**:
|
||||
1. **pdfplumber**: Good but slower, less precise coordinates
|
||||
2. **PyPDF2**: Limited layout information
|
||||
3. **PDFMiner**: Complex API, slower performance
|
||||
|
||||
### Decision 4: Processing Track Auto-detection
|
||||
**What**: Automatically determine optimal track based on document analysis
|
||||
|
||||
**Detection logic**:
|
||||
```python
|
||||
def detect_track(file_path: Path) -> str:
|
||||
file_type = magic.from_file(file_path, mime=True)
|
||||
|
||||
if file_type.startswith('image/'):
|
||||
return "ocr"
|
||||
|
||||
if file_type == 'application/pdf':
|
||||
# Check if PDF has extractable text
|
||||
doc = fitz.open(file_path)
|
||||
for page in doc[:3]: # Sample first 3 pages
|
||||
text = page.get_text()
|
||||
if len(text.strip()) < 100: # Minimal text
|
||||
return "ocr"
|
||||
return "direct"
|
||||
|
||||
if file_type in OFFICE_MIMES:
|
||||
return "ocr" # For now, may add direct Office support later
|
||||
|
||||
return "ocr" # Default fallback
|
||||
```
|
||||
|
||||
### Decision 5: GPU Memory Management
|
||||
**What**: Implement dynamic batch sizing and model caching for RTX 4060 8GB
|
||||
|
||||
**Why**:
|
||||
- Prevents OOM errors
|
||||
- Maximizes throughput
|
||||
- Enables concurrent request handling
|
||||
|
||||
**Strategy**:
|
||||
```python
|
||||
# Adaptive batch sizing based on available memory
|
||||
batch_size = calculate_batch_size(
|
||||
available_memory=get_gpu_memory(),
|
||||
image_size=image.shape,
|
||||
model_size=MODEL_MEMORY_REQUIREMENTS
|
||||
)
|
||||
|
||||
# Model caching to avoid reload overhead
|
||||
@lru_cache(maxsize=2)
|
||||
def get_model(model_type: str):
|
||||
return load_model(model_type)
|
||||
```
|
||||
|
||||
### Decision 6: Backward Compatibility
|
||||
**What**: Maintain existing API while adding new capabilities
|
||||
|
||||
**How**:
|
||||
- Existing endpoints continue working unchanged
|
||||
- New `processing_track` parameter is optional
|
||||
- Output format compatible with current consumers
|
||||
- Gradual migration path for clients
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
### Risk 1: Mixed Content Documents
|
||||
**Risk**: Documents with both scanned and digital pages
|
||||
**Mitigation**:
|
||||
- Page-level track detection as fallback
|
||||
- Confidence scoring to identify uncertain pages
|
||||
- Manual override option via API
|
||||
|
||||
### Risk 2: Direct Extraction Quality
|
||||
**Risk**: Some PDFs have poor internal structure
|
||||
**Mitigation**:
|
||||
- Fallback to OCR track if extraction quality is low
|
||||
- Quality metrics: text density, structure coherence
|
||||
- User-reportable quality issues
|
||||
|
||||
### Risk 3: Memory Pressure
|
||||
**Risk**: RTX 4060 8GB limitation with concurrent requests
|
||||
**Mitigation**:
|
||||
- Request queuing system
|
||||
- Dynamic batch adjustment
|
||||
- CPU fallback for overflow
|
||||
|
||||
### Trade-off 1: Processing Time vs Accuracy
|
||||
- Direct extraction: Fast but depends on PDF quality
|
||||
- OCR: Slower but consistent quality
|
||||
- **Decision**: Prioritize speed for editable PDFs, accuracy for scanned
|
||||
|
||||
### Trade-off 2: Complexity vs Flexibility
|
||||
- Two tracks increase system complexity
|
||||
- But enable optimal processing per document type
|
||||
- **Decision**: Accept complexity for 10x+ performance gains
|
||||
|
||||
## Migration Plan
|
||||
|
||||
### Phase 1: Infrastructure (Week 1-2)
|
||||
1. Deploy UnifiedDocument model
|
||||
2. Implement DocumentTypeDetector
|
||||
3. Add DirectExtractionEngine
|
||||
4. Update logging and monitoring
|
||||
|
||||
### Phase 2: Integration (Week 3)
|
||||
1. Update OCR service with routing logic
|
||||
2. Modify PDF generator for unified model
|
||||
3. Add new API endpoints
|
||||
4. Deploy to staging
|
||||
|
||||
### Phase 3: Validation (Week 4)
|
||||
1. A/B testing with subset of traffic
|
||||
2. Performance benchmarking
|
||||
3. Quality validation
|
||||
4. Client integration testing
|
||||
|
||||
### Rollback Plan
|
||||
1. Feature flag to disable dual-track
|
||||
2. Fallback all requests to OCR track
|
||||
3. Maintain old code paths during transition
|
||||
4. Database migration reversible
|
||||
|
||||
## Open Questions
|
||||
|
||||
### Resolved
|
||||
- Q: Should we support page-level track mixing?
|
||||
- A: No, adds complexity with minimal benefit. Document-level is sufficient.
|
||||
|
||||
- Q: How to handle Office documents?
|
||||
- A: OCR track initially, consider python-docx/openpyxl later if needed.
|
||||
|
||||
### Pending
|
||||
- Q: What translation services to integrate with?
|
||||
- Needs stakeholder input on cost/quality trade-offs
|
||||
|
||||
- Q: Should we cache extracted text for repeated processing?
|
||||
- Depends on storage costs vs reprocessing frequency
|
||||
|
||||
- Q: How to handle password-protected PDFs?
|
||||
- May need API parameter for passwords
|
||||
|
||||
## Performance Targets
|
||||
|
||||
### Direct Extraction Track
|
||||
- Latency: <500ms per page
|
||||
- Throughput: 100+ pages/minute
|
||||
- Memory: <500MB per document
|
||||
|
||||
### OCR Track (Optimized)
|
||||
- Latency: 2-5s per page (GPU)
|
||||
- Throughput: 20-30 pages/minute
|
||||
- Memory: <2GB per batch
|
||||
|
||||
### API Response Times
|
||||
- Document type detection: <100ms
|
||||
- Processing initiation: <200ms
|
||||
- Result retrieval: <100ms
|
||||
|
||||
## Technical Dependencies
|
||||
|
||||
### Python Packages
|
||||
```python
|
||||
# Direct extraction
|
||||
PyMuPDF==1.23.x
|
||||
pdfplumber==0.10.x # Fallback/validation
|
||||
python-magic-bin==0.4.x
|
||||
|
||||
# OCR enhancement
|
||||
paddlepaddle-gpu==2.5.2
|
||||
paddleocr==2.7.3
|
||||
|
||||
# Infrastructure
|
||||
pydantic==2.x
|
||||
fastapi==0.100+
|
||||
redis==5.x # For caching
|
||||
```
|
||||
|
||||
### System Requirements
|
||||
- CUDA 11.8+ for PaddlePaddle
|
||||
- libmagic for file detection
|
||||
- 16GB RAM minimum
|
||||
- 50GB disk for models and cache
|
||||
35
openspec/changes/dual-track-document-processing/proposal.md
Normal file
35
openspec/changes/dual-track-document-processing/proposal.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# Change: Dual-track Document Processing with Structure-Preserving Translation
|
||||
|
||||
## Why
|
||||
|
||||
The current system processes all documents through PaddleOCR, causing unnecessary overhead for editable PDFs that already contain extractable text. Additionally, we're only using ~20% of PP-StructureV3's capabilities, missing out on comprehensive document structure extraction. The system needs to support structure-preserving document translation as a future goal.
|
||||
|
||||
## What Changes
|
||||
|
||||
- **ADDED** Dual-track processing architecture with intelligent routing
|
||||
- OCR track for scanned documents, images, and Office files using PaddleOCR
|
||||
- Direct extraction track for editable PDFs using PyMuPDF
|
||||
- **ADDED** UnifiedDocument model as common output format for both tracks
|
||||
- **ADDED** DocumentTypeDetector service for automatic track selection
|
||||
- **MODIFIED** OCR service to use PP-StructureV3's parsing_res_list instead of markdown
|
||||
- Now extracts all 23 element types with bbox coordinates
|
||||
- Preserves reading order and hierarchical structure
|
||||
- **MODIFIED** PDF generator to handle UnifiedDocument format
|
||||
- Enhanced overlap detection to prevent text/image/table collisions
|
||||
- Improved coordinate transformation for accurate layout
|
||||
- **ADDED** Foundation for structure-preserving translation system
|
||||
- **BREAKING** JSON output structure will include new fields (backward compatible with defaults)
|
||||
|
||||
## Impact
|
||||
|
||||
- **Affected specs**:
|
||||
- `document-processing` (new capability)
|
||||
- `result-export` (enhanced with track metadata and structure data)
|
||||
- `task-management` (tracks processing route and history)
|
||||
- **Affected code**:
|
||||
- `backend/app/services/ocr_service.py` - Major refactoring for dual-track
|
||||
- `backend/app/services/pdf_generator_service.py` - UnifiedDocument support
|
||||
- `backend/app/api/v2/tasks.py` - New endpoints for track detection
|
||||
- `frontend/src/pages/TaskDetailPage.tsx` - Display processing track info
|
||||
- **Performance**: 5-10x faster for editable PDFs, same speed for scanned documents
|
||||
- **Dependencies**: Adds PyMuPDF, pdfplumber, python-magic-bin
|
||||
@@ -0,0 +1,108 @@
|
||||
# Document Processing Spec Delta
|
||||
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Dual-track Processing
|
||||
The system SHALL support two distinct processing tracks for documents: OCR track for scanned/image documents and Direct extraction track for editable PDFs.
|
||||
|
||||
#### Scenario: Process scanned PDF through OCR track
|
||||
- **WHEN** a scanned PDF is uploaded
|
||||
- **THEN** the system SHALL detect it requires OCR
|
||||
- **AND** route it through PaddleOCR PP-StructureV3 pipeline
|
||||
- **AND** return results in UnifiedDocument format
|
||||
|
||||
#### Scenario: Process editable PDF through direct extraction
|
||||
- **WHEN** an editable PDF with extractable text is uploaded
|
||||
- **THEN** the system SHALL detect it can be directly extracted
|
||||
- **AND** route it through PyMuPDF extraction pipeline
|
||||
- **AND** return results in UnifiedDocument format without OCR
|
||||
|
||||
#### Scenario: Auto-detect processing track
|
||||
- **WHEN** a document is uploaded without explicit track specification
|
||||
- **THEN** the system SHALL analyze the document type and content
|
||||
- **AND** automatically select the optimal processing track
|
||||
- **AND** include the selected track in processing metadata
|
||||
|
||||
### Requirement: Document Type Detection
|
||||
The system SHALL provide intelligent document type detection to determine the optimal processing track.
|
||||
|
||||
#### Scenario: Detect editable PDF
|
||||
- **WHEN** analyzing a PDF document
|
||||
- **THEN** the system SHALL check for extractable text content
|
||||
- **AND** return confidence score for editability
|
||||
- **AND** recommend "direct" track if text coverage > 90%
|
||||
|
||||
#### Scenario: Detect scanned document
|
||||
- **WHEN** analyzing an image or scanned PDF
|
||||
- **THEN** the system SHALL identify lack of extractable text
|
||||
- **AND** recommend "ocr" track for processing
|
||||
- **AND** configure appropriate OCR models
|
||||
|
||||
#### Scenario: Detect Office documents
|
||||
- **WHEN** analyzing .docx, .xlsx, .pptx files
|
||||
- **THEN** the system SHALL identify Office format
|
||||
- **AND** route to OCR track for initial implementation
|
||||
- **AND** preserve option for future direct Office extraction
|
||||
|
||||
### Requirement: Unified Document Model
|
||||
The system SHALL use a standardized UnifiedDocument model as the common output format for both processing tracks.
|
||||
|
||||
#### Scenario: Generate UnifiedDocument from OCR
|
||||
- **WHEN** OCR processing completes
|
||||
- **THEN** the system SHALL convert PP-StructureV3 results to UnifiedDocument
|
||||
- **AND** preserve all element types, coordinates, and confidence scores
|
||||
- **AND** maintain reading order and hierarchical structure
|
||||
|
||||
#### Scenario: Generate UnifiedDocument from direct extraction
|
||||
- **WHEN** direct extraction completes
|
||||
- **THEN** the system SHALL convert PyMuPDF results to UnifiedDocument
|
||||
- **AND** preserve text styling, fonts, and exact positioning
|
||||
- **AND** extract tables with cell boundaries and content
|
||||
|
||||
#### Scenario: Consistent output regardless of track
|
||||
- **WHEN** processing completes through either track
|
||||
- **THEN** the output SHALL conform to UnifiedDocument schema
|
||||
- **AND** include processing_track metadata field
|
||||
- **AND** support identical downstream operations (PDF generation, translation)
|
||||
|
||||
### Requirement: Enhanced OCR with Full PP-StructureV3
|
||||
The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list.
|
||||
|
||||
#### Scenario: Extract comprehensive document structure
|
||||
- **WHEN** processing through OCR track
|
||||
- **THEN** the system SHALL use page_result.json['parsing_res_list']
|
||||
- **AND** extract all element types including headers, lists, tables, figures
|
||||
- **AND** preserve layout_bbox coordinates for each element
|
||||
|
||||
#### Scenario: Maintain reading order
|
||||
- **WHEN** extracting elements from PP-StructureV3
|
||||
- **THEN** the system SHALL preserve the reading order from parsing_res_list
|
||||
- **AND** assign sequential indices to elements
|
||||
- **AND** support reordering for complex layouts
|
||||
|
||||
#### Scenario: Extract table structure
|
||||
- **WHEN** PP-StructureV3 identifies a table
|
||||
- **THEN** the system SHALL extract cell content and boundaries
|
||||
- **AND** preserve table HTML for structure
|
||||
- **AND** extract plain text for translation
|
||||
|
||||
### Requirement: Structure-Preserving Translation Foundation
|
||||
The system SHALL maintain document structure and layout information to support future translation features.
|
||||
|
||||
#### Scenario: Preserve coordinates for translation
|
||||
- **WHEN** processing any document
|
||||
- **THEN** the system SHALL retain bbox coordinates for all text elements
|
||||
- **AND** calculate space requirements for text expansion/contraction
|
||||
- **AND** maintain element relationships and groupings
|
||||
|
||||
#### Scenario: Extract translatable content
|
||||
- **WHEN** processing tables and lists
|
||||
- **THEN** the system SHALL extract plain text content
|
||||
- **AND** maintain mapping to original structure
|
||||
- **AND** preserve formatting markers for reconstruction
|
||||
|
||||
#### Scenario: Support layout adjustment
|
||||
- **WHEN** preparing for translation
|
||||
- **THEN** the system SHALL identify flexible vs fixed layout regions
|
||||
- **AND** calculate maximum text expansion ratios
|
||||
- **AND** preserve non-translatable elements (logos, signatures)
|
||||
@@ -0,0 +1,74 @@
|
||||
# Result Export Spec Delta
|
||||
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Export Interface
|
||||
The Export page SHALL support downloading OCR results in multiple formats using V2 task APIs, with processing track information and enhanced structure data.
|
||||
|
||||
#### Scenario: Export page uses V2 download endpoints
|
||||
- **WHEN** user selects a format and clicks export button
|
||||
- **THEN** frontend SHALL call V2 endpoint `/api/v2/tasks/{task_id}/download/{format}`
|
||||
- **AND** frontend SHALL NOT call V1 `/api/v2/export` endpoint (which returns 404)
|
||||
- **AND** file SHALL download successfully
|
||||
|
||||
#### Scenario: Export supports multiple formats
|
||||
- **WHEN** user exports a completed task
|
||||
- **THEN** system SHALL support downloading as TXT, JSON, Excel, Markdown, and PDF
|
||||
- **AND** each format SHALL use correct V2 download endpoint
|
||||
- **AND** downloaded files SHALL contain task OCR results
|
||||
|
||||
#### Scenario: Export includes processing track metadata
|
||||
- **WHEN** user exports a task processed through dual-track system
|
||||
- **THEN** exported JSON SHALL include "processing_track" field indicating "ocr" or "direct"
|
||||
- **AND** SHALL include "processing_metadata" with track-specific information
|
||||
- **AND** SHALL maintain backward compatibility for clients not expecting these fields
|
||||
|
||||
#### Scenario: Export UnifiedDocument format
|
||||
- **WHEN** user requests JSON export with unified=true parameter
|
||||
- **THEN** system SHALL return UnifiedDocument structure
|
||||
- **AND** include complete element hierarchy with coordinates
|
||||
- **AND** preserve all PP-StructureV3 element types for OCR track
|
||||
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Enhanced PDF Export with Layout Preservation
|
||||
The PDF export SHALL accurately preserve document layout from both OCR and direct extraction tracks.
|
||||
|
||||
#### Scenario: Export PDF from direct extraction track
|
||||
- **WHEN** exporting PDF from a direct-extraction processed document
|
||||
- **THEN** the PDF SHALL maintain exact text positioning from source
|
||||
- **AND** preserve original fonts and styles where possible
|
||||
- **AND** include extracted images at correct positions
|
||||
|
||||
#### Scenario: Export PDF from OCR track with full structure
|
||||
- **WHEN** exporting PDF from OCR-processed document
|
||||
- **THEN** the PDF SHALL use all 23 PP-StructureV3 element types
|
||||
- **AND** render tables with proper cell boundaries
|
||||
- **AND** maintain reading order from parsing_res_list
|
||||
|
||||
#### Scenario: Handle coordinate transformations
|
||||
- **WHEN** generating PDF from UnifiedDocument
|
||||
- **THEN** system SHALL correctly transform bbox coordinates to PDF space
|
||||
- **AND** handle page size variations
|
||||
- **AND** prevent text overlap using enhanced overlap detection
|
||||
|
||||
### Requirement: Structure Data Export
|
||||
The system SHALL provide export formats that preserve document structure for downstream processing.
|
||||
|
||||
#### Scenario: Export structured JSON with hierarchy
|
||||
- **WHEN** user selects structured JSON format
|
||||
- **THEN** export SHALL include element hierarchy and relationships
|
||||
- **AND** preserve parent-child relationships (sections, lists)
|
||||
- **AND** include style and formatting information
|
||||
|
||||
#### Scenario: Export for translation preparation
|
||||
- **WHEN** user exports with translation_ready=true parameter
|
||||
- **THEN** export SHALL include translatable text segments
|
||||
- **AND** maintain coordinate mappings for each segment
|
||||
- **AND** mark non-translatable regions
|
||||
|
||||
#### Scenario: Export with layout analysis
|
||||
- **WHEN** user requests layout analysis export
|
||||
- **THEN** system SHALL include reading order indices
|
||||
- **AND** identify layout regions (header, body, footer, sidebar)
|
||||
- **AND** provide confidence scores for layout detection
|
||||
@@ -0,0 +1,105 @@
|
||||
# Task Management Spec Delta
|
||||
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Task Result Generation
|
||||
The OCR service SHALL generate both JSON and Markdown result files for completed tasks with actual content, including processing track information and enhanced structure data.
|
||||
|
||||
#### Scenario: Markdown file contains OCR results
|
||||
- **WHEN** a task completes OCR processing successfully
|
||||
- **THEN** the generated `.md` file SHALL contain the extracted text in markdown format
|
||||
- **AND** the file size SHALL be greater than 0 bytes
|
||||
- **AND** the markdown SHALL include headings, paragraphs, and formatting based on OCR layout detection
|
||||
|
||||
#### Scenario: Result files stored in task directory
|
||||
- **WHEN** OCR processing completes for task ID `88c6c2d2-37e1-48fd-a50f-406142987bdf`
|
||||
- **THEN** result files SHALL be stored in `storage/results/88c6c2d2-37e1-48fd-a50f-406142987bdf/`
|
||||
- **AND** both `<filename>_result.json` and `<filename>_result.md` SHALL exist
|
||||
- **AND** both files SHALL contain valid OCR output data
|
||||
|
||||
#### Scenario: Include processing track in results
|
||||
- **WHEN** a task completes through dual-track processing
|
||||
- **THEN** the JSON result SHALL include "processing_track" field
|
||||
- **AND** SHALL indicate whether "ocr" or "direct" track was used
|
||||
- **AND** SHALL include track-specific metadata (confidence for OCR, extraction quality for direct)
|
||||
|
||||
#### Scenario: Store UnifiedDocument format
|
||||
- **WHEN** processing completes through either track
|
||||
- **THEN** system SHALL save results in UnifiedDocument format
|
||||
- **AND** maintain backward-compatible JSON structure
|
||||
- **AND** include enhanced structure from PP-StructureV3 or PyMuPDF
|
||||
|
||||
### Requirement: Task Detail View
|
||||
The frontend SHALL provide a dedicated page for viewing individual task details with processing track information and enhanced preview capabilities.
|
||||
|
||||
#### Scenario: Navigate to task detail page
|
||||
- **WHEN** user clicks "View Details" button on task in Task History page
|
||||
- **THEN** browser SHALL navigate to `/tasks/{task_id}`
|
||||
- **AND** TaskDetailPage component SHALL render
|
||||
|
||||
#### Scenario: Display task information
|
||||
- **WHEN** TaskDetailPage loads for a valid task ID
|
||||
- **THEN** page SHALL display task metadata (filename, status, processing time, confidence)
|
||||
- **AND** page SHALL show markdown preview of OCR results
|
||||
- **AND** page SHALL provide download buttons for JSON, Markdown, and PDF formats
|
||||
|
||||
#### Scenario: Download from task detail page
|
||||
- **WHEN** user clicks download button for a specific format
|
||||
- **THEN** browser SHALL download the file using `/api/v2/tasks/{task_id}/download/{format}` endpoint
|
||||
- **AND** downloaded file SHALL contain the task's OCR results in requested format
|
||||
|
||||
#### Scenario: Display processing track information
|
||||
- **WHEN** viewing task processed through dual-track system
|
||||
- **THEN** page SHALL display processing track used (OCR or Direct)
|
||||
- **AND** show track-specific metrics (OCR confidence or extraction quality)
|
||||
- **AND** provide option to reprocess with alternate track if applicable
|
||||
|
||||
#### Scenario: Preview document structure
|
||||
- **WHEN** user enables structure view
|
||||
- **THEN** page SHALL display document element hierarchy
|
||||
- **AND** show bounding boxes overlay on preview
|
||||
- **AND** highlight different element types (headers, tables, lists) with distinct colors
|
||||
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Processing Track Management
|
||||
The task management system SHALL track and display processing track information for all tasks.
|
||||
|
||||
#### Scenario: Track processing route selection
|
||||
- **WHEN** a task begins processing
|
||||
- **THEN** system SHALL record the selected processing track
|
||||
- **AND** log the reason for track selection
|
||||
- **AND** store auto-detection confidence score
|
||||
|
||||
#### Scenario: Allow track override
|
||||
- **WHEN** user views a completed task
|
||||
- **THEN** system SHALL offer option to reprocess with different track
|
||||
- **AND** maintain both results for comparison
|
||||
- **AND** track which result user prefers
|
||||
|
||||
#### Scenario: Display processing metrics
|
||||
- **WHEN** task completes processing
|
||||
- **THEN** system SHALL record track-specific metrics
|
||||
- **AND** OCR track SHALL show confidence scores and character count
|
||||
- **AND** Direct track SHALL show extraction coverage and structure quality
|
||||
|
||||
### Requirement: Task Processing History
|
||||
The system SHALL maintain detailed processing history for tasks including track changes and reprocessing.
|
||||
|
||||
#### Scenario: Record reprocessing attempts
|
||||
- **WHEN** a task is reprocessed with different track
|
||||
- **THEN** system SHALL maintain processing history
|
||||
- **AND** store results from each attempt
|
||||
- **AND** allow comparison between different processing attempts
|
||||
|
||||
#### Scenario: Track quality improvements
|
||||
- **WHEN** viewing task history
|
||||
- **THEN** system SHALL show quality metrics over time
|
||||
- **AND** indicate if reprocessing improved results
|
||||
- **AND** suggest optimal track based on document characteristics
|
||||
|
||||
#### Scenario: Export processing analytics
|
||||
- **WHEN** exporting task data
|
||||
- **THEN** system SHALL include processing history
|
||||
- **AND** provide track selection statistics
|
||||
- **AND** include performance metrics for each processing attempt
|
||||
170
openspec/changes/dual-track-document-processing/tasks.md
Normal file
170
openspec/changes/dual-track-document-processing/tasks.md
Normal file
@@ -0,0 +1,170 @@
|
||||
# Implementation Tasks: Dual-track Document Processing
|
||||
|
||||
## 1. Core Infrastructure
|
||||
- [ ] 1.1 Add PyMuPDF and other dependencies to requirements.txt
|
||||
- [ ] 1.1.1 Add PyMuPDF==1.23.x
|
||||
- [ ] 1.1.2 Add pdfplumber==0.10.x
|
||||
- [ ] 1.1.3 Add python-magic-bin==0.4.x
|
||||
- [ ] 1.1.4 Test dependency installation
|
||||
- [ ] 1.2 Create UnifiedDocument model in backend/app/models/
|
||||
- [ ] 1.2.1 Define UnifiedDocument dataclass
|
||||
- [ ] 1.2.2 Add DocumentElement model
|
||||
- [ ] 1.2.3 Add DocumentMetadata model
|
||||
- [ ] 1.2.4 Create converters for both OCR and direct extraction outputs
|
||||
- [ ] 1.3 Create DocumentTypeDetector service
|
||||
- [ ] 1.3.1 Implement file type detection using python-magic
|
||||
- [ ] 1.3.2 Add PDF editability checking logic
|
||||
- [ ] 1.3.3 Add Office document detection
|
||||
- [ ] 1.3.4 Create routing logic to determine processing track
|
||||
- [ ] 1.3.5 Add unit tests for detector
|
||||
|
||||
## 2. Direct Extraction Track
|
||||
- [ ] 2.1 Create DirectExtractionEngine service
|
||||
- [ ] 2.1.1 Implement PyMuPDF-based text extraction
|
||||
- [ ] 2.1.2 Add structure preservation logic
|
||||
- [ ] 2.1.3 Extract tables with coordinates
|
||||
- [ ] 2.1.4 Extract images and their positions
|
||||
- [ ] 2.1.5 Maintain reading order
|
||||
- [ ] 2.1.6 Handle multi-column layouts
|
||||
- [ ] 2.2 Implement layout analysis for editable PDFs
|
||||
- [ ] 2.2.1 Detect headers and footers
|
||||
- [ ] 2.2.2 Identify sections and subsections
|
||||
- [ ] 2.2.3 Parse lists and nested structures
|
||||
- [ ] 2.2.4 Extract font and style information
|
||||
- [ ] 2.3 Create direct extraction to UnifiedDocument converter
|
||||
- [ ] 2.3.1 Map PyMuPDF structures to UnifiedDocument
|
||||
- [ ] 2.3.2 Preserve coordinate information
|
||||
- [ ] 2.3.3 Maintain element relationships
|
||||
|
||||
## 3. OCR Track Enhancement
|
||||
- [ ] 3.1 Upgrade PP-StructureV3 configuration
|
||||
- [ ] 3.1.1 Update config for RTX 4060 8GB optimization
|
||||
- [ ] 3.1.2 Enable batch processing for GPU efficiency
|
||||
- [ ] 3.1.3 Configure memory management settings
|
||||
- [ ] 3.1.4 Set up model caching
|
||||
- [ ] 3.2 Enhance OCR service to use parsing_res_list
|
||||
- [ ] 3.2.1 Replace markdown extraction with parsing_res_list
|
||||
- [ ] 3.2.2 Extract all 23 element types
|
||||
- [ ] 3.2.3 Preserve bbox coordinates from PP-StructureV3
|
||||
- [ ] 3.2.4 Maintain reading order information
|
||||
- [ ] 3.3 Create OCR to UnifiedDocument converter
|
||||
- [ ] 3.3.1 Map PP-StructureV3 elements to UnifiedDocument
|
||||
- [ ] 3.3.2 Handle complex nested structures
|
||||
- [ ] 3.3.3 Preserve all metadata
|
||||
|
||||
## 4. Unified Processing Pipeline
|
||||
- [ ] 4.1 Update main OCR service for dual-track processing
|
||||
- [ ] 4.1.1 Integrate DocumentTypeDetector
|
||||
- [ ] 4.1.2 Route to appropriate processing engine
|
||||
- [ ] 4.1.3 Return UnifiedDocument from both tracks
|
||||
- [ ] 4.1.4 Maintain backward compatibility
|
||||
- [ ] 4.2 Create unified JSON export
|
||||
- [ ] 4.2.1 Define standardized JSON schema
|
||||
- [ ] 4.2.2 Include processing metadata
|
||||
- [ ] 4.2.3 Support both track outputs
|
||||
- [ ] 4.3 Update PDF generator for UnifiedDocument
|
||||
- [ ] 4.3.1 Adapt PDF generation to use UnifiedDocument
|
||||
- [ ] 4.3.2 Preserve layout from both tracks
|
||||
- [ ] 4.3.3 Handle coordinate transformations
|
||||
|
||||
## 5. Translation System Foundation
|
||||
- [ ] 5.1 Create TranslationEngine interface
|
||||
- [ ] 5.1.1 Define translation API contract
|
||||
- [ ] 5.1.2 Support element-level translation
|
||||
- [ ] 5.1.3 Preserve formatting markers
|
||||
- [ ] 5.2 Implement structure-preserving translation
|
||||
- [ ] 5.2.1 Translate text while maintaining coordinates
|
||||
- [ ] 5.2.2 Handle table cell translations
|
||||
- [ ] 5.2.3 Preserve list structures
|
||||
- [ ] 5.2.4 Maintain header hierarchies
|
||||
- [ ] 5.3 Create translated document renderer
|
||||
- [ ] 5.3.1 Generate PDF with translated text
|
||||
- [ ] 5.3.2 Adjust layouts for text expansion/contraction
|
||||
- [ ] 5.3.3 Handle font substitution for target languages
|
||||
|
||||
## 6. API Updates
|
||||
- [ ] 6.1 Update OCR endpoints
|
||||
- [ ] 6.1.1 Add processing_track parameter
|
||||
- [ ] 6.1.2 Support track auto-detection
|
||||
- [ ] 6.1.3 Return processing metadata
|
||||
- [ ] 6.2 Add document type detection endpoint
|
||||
- [ ] 6.2.1 Create /analyze endpoint
|
||||
- [ ] 6.2.2 Return recommended processing track
|
||||
- [ ] 6.2.3 Provide confidence scores
|
||||
- [ ] 6.3 Update result export endpoints
|
||||
- [ ] 6.3.1 Support UnifiedDocument format
|
||||
- [ ] 6.3.2 Add format conversion options
|
||||
- [ ] 6.3.3 Include processing track information
|
||||
|
||||
## 7. Frontend Updates
|
||||
- [ ] 7.1 Update task detail view
|
||||
- [ ] 7.1.1 Display processing track information
|
||||
- [ ] 7.1.2 Show track-specific metadata
|
||||
- [ ] 7.1.3 Add track selection UI (if manual override needed)
|
||||
- [ ] 7.2 Update results preview
|
||||
- [ ] 7.2.1 Handle UnifiedDocument format
|
||||
- [ ] 7.2.2 Display enhanced structure information
|
||||
- [ ] 7.2.3 Show coordinate overlays (debug mode)
|
||||
- [ ] 7.3 Add translation UI preparation
|
||||
- [ ] 7.3.1 Add translation toggle/button
|
||||
- [ ] 7.3.2 Language selection dropdown
|
||||
- [ ] 7.3.3 Translation progress indicator
|
||||
|
||||
## 8. Testing
|
||||
- [ ] 8.1 Unit tests for DocumentTypeDetector
|
||||
- [ ] 8.1.1 Test various file types
|
||||
- [ ] 8.1.2 Test editability detection
|
||||
- [ ] 8.1.3 Test edge cases
|
||||
- [ ] 8.2 Unit tests for DirectExtractionEngine
|
||||
- [ ] 8.2.1 Test text extraction accuracy
|
||||
- [ ] 8.2.2 Test structure preservation
|
||||
- [ ] 8.2.3 Test coordinate extraction
|
||||
- [ ] 8.3 Integration tests for dual-track processing
|
||||
- [ ] 8.3.1 Test routing logic
|
||||
- [ ] 8.3.2 Test UnifiedDocument generation
|
||||
- [ ] 8.3.3 Test backward compatibility
|
||||
- [ ] 8.4 End-to-end tests
|
||||
- [ ] 8.4.1 Test scanned PDF processing (OCR track)
|
||||
- [ ] 8.4.2 Test editable PDF processing (direct track)
|
||||
- [ ] 8.4.3 Test Office document processing
|
||||
- [ ] 8.4.4 Test image file processing
|
||||
- [ ] 8.5 Performance testing
|
||||
- [ ] 8.5.1 Benchmark both processing tracks
|
||||
- [ ] 8.5.2 Test GPU memory usage
|
||||
- [ ] 8.5.3 Compare processing times
|
||||
|
||||
## 9. Documentation
|
||||
- [ ] 9.1 Update API documentation
|
||||
- [ ] 9.1.1 Document new endpoints
|
||||
- [ ] 9.1.2 Update existing endpoint docs
|
||||
- [ ] 9.1.3 Add processing track information
|
||||
- [ ] 9.2 Create architecture documentation
|
||||
- [ ] 9.2.1 Document dual-track flow
|
||||
- [ ] 9.2.2 Explain UnifiedDocument structure
|
||||
- [ ] 9.2.3 Add decision trees for track selection
|
||||
- [ ] 9.3 Add deployment guide
|
||||
- [ ] 9.3.1 Document GPU requirements
|
||||
- [ ] 9.3.2 Add environment configuration
|
||||
- [ ] 9.3.3 Include troubleshooting guide
|
||||
|
||||
## 10. Deployment Preparation
|
||||
- [ ] 10.1 Update Docker configuration
|
||||
- [ ] 10.1.1 Add new dependencies to Dockerfile
|
||||
- [ ] 10.1.2 Configure GPU support
|
||||
- [ ] 10.1.3 Update volume mappings
|
||||
- [ ] 10.2 Update environment variables
|
||||
- [ ] 10.2.1 Add processing track settings
|
||||
- [ ] 10.2.2 Configure GPU memory limits
|
||||
- [ ] 10.2.3 Add feature flags
|
||||
- [ ] 10.3 Create migration plan
|
||||
- [ ] 10.3.1 Plan for existing data migration
|
||||
- [ ] 10.3.2 Create rollback procedures
|
||||
- [ ] 10.3.3 Document breaking changes
|
||||
|
||||
## Completion Checklist
|
||||
- [ ] All unit tests passing
|
||||
- [ ] Integration tests passing
|
||||
- [ ] Performance benchmarks acceptable
|
||||
- [ ] Documentation complete
|
||||
- [ ] Code reviewed
|
||||
- [ ] Deployment tested in staging
|
||||
@@ -1,226 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Proof of Concept: External API Authentication Test
|
||||
Tests the external authentication API at https://pj-auth-api.vercel.app
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
from datetime import datetime
|
||||
from typing import Dict, Any, Optional
|
||||
import httpx
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
|
||||
class UserInfo(BaseModel):
|
||||
"""User information from external API"""
|
||||
id: str
|
||||
name: str
|
||||
email: str
|
||||
job_title: Optional[str] = Field(None, alias="jobTitle")
|
||||
office_location: Optional[str] = Field(None, alias="officeLocation")
|
||||
business_phones: list[str] = Field(default_factory=list, alias="businessPhones")
|
||||
|
||||
|
||||
class AuthSuccessData(BaseModel):
|
||||
"""Successful authentication response data"""
|
||||
access_token: str
|
||||
id_token: str
|
||||
expires_in: int
|
||||
token_type: str
|
||||
user_info: UserInfo = Field(alias="userInfo")
|
||||
issued_at: str = Field(alias="issuedAt")
|
||||
expires_at: str = Field(alias="expiresAt")
|
||||
|
||||
|
||||
class AuthSuccessResponse(BaseModel):
|
||||
"""Successful authentication response"""
|
||||
success: bool
|
||||
message: str
|
||||
data: AuthSuccessData
|
||||
timestamp: str
|
||||
|
||||
|
||||
class AuthErrorResponse(BaseModel):
|
||||
"""Failed authentication response"""
|
||||
success: bool
|
||||
error: str
|
||||
code: str
|
||||
timestamp: str
|
||||
|
||||
|
||||
class ExternalAuthClient:
|
||||
"""Client for external authentication API"""
|
||||
|
||||
def __init__(self, base_url: str = "https://pj-auth-api.vercel.app", timeout: int = 30):
|
||||
self.base_url = base_url
|
||||
self.timeout = timeout
|
||||
self.endpoint = "/api/auth/login"
|
||||
|
||||
async def authenticate(self, username: str, password: str) -> Dict[str, Any]:
|
||||
"""
|
||||
Authenticate user with external API
|
||||
|
||||
Args:
|
||||
username: User email/username
|
||||
password: User password
|
||||
|
||||
Returns:
|
||||
Authentication result dictionary
|
||||
"""
|
||||
url = f"{self.base_url}{self.endpoint}"
|
||||
|
||||
print(f"ℹ Endpoint: POST {url}")
|
||||
print(f"ℹ Username: {username}")
|
||||
print(f"ℹ Timestamp: {datetime.now().isoformat()}")
|
||||
print()
|
||||
|
||||
async with httpx.AsyncClient() as client:
|
||||
try:
|
||||
# Make authentication request
|
||||
start_time = datetime.now()
|
||||
response = await client.post(
|
||||
url,
|
||||
json={"username": username, "password": password},
|
||||
timeout=self.timeout
|
||||
)
|
||||
elapsed = (datetime.now() - start_time).total_seconds()
|
||||
|
||||
# Print response details
|
||||
print("Response Details:")
|
||||
print(f" Status Code: {response.status_code}")
|
||||
print(f" Response Time: {elapsed:.3f}s")
|
||||
print(f" Content-Type: {response.headers.get('content-type', 'N/A')}")
|
||||
print()
|
||||
|
||||
# Parse response
|
||||
response_data = response.json()
|
||||
print("Response Body:")
|
||||
print(json.dumps(response_data, indent=2, ensure_ascii=False))
|
||||
print()
|
||||
|
||||
# Handle success/failure
|
||||
if response.status_code == 200:
|
||||
auth_response = AuthSuccessResponse(**response_data)
|
||||
return {
|
||||
"success": True,
|
||||
"status_code": response.status_code,
|
||||
"data": auth_response.dict(),
|
||||
"user_display_name": auth_response.data.user_info.name,
|
||||
"user_email": auth_response.data.user_info.email,
|
||||
"token": auth_response.data.access_token,
|
||||
"expires_in": auth_response.data.expires_in,
|
||||
"expires_at": auth_response.data.expires_at
|
||||
}
|
||||
elif response.status_code == 401:
|
||||
error_response = AuthErrorResponse(**response_data)
|
||||
return {
|
||||
"success": False,
|
||||
"status_code": response.status_code,
|
||||
"error": error_response.error,
|
||||
"code": error_response.code
|
||||
}
|
||||
else:
|
||||
return {
|
||||
"success": False,
|
||||
"status_code": response.status_code,
|
||||
"error": f"Unexpected status code: {response.status_code}",
|
||||
"response": response_data
|
||||
}
|
||||
|
||||
except httpx.TimeoutException:
|
||||
print(f"❌ Request timeout after {self.timeout} seconds")
|
||||
return {
|
||||
"success": False,
|
||||
"error": "Request timeout",
|
||||
"code": "TIMEOUT"
|
||||
}
|
||||
except httpx.RequestError as e:
|
||||
print(f"❌ Request error: {e}")
|
||||
return {
|
||||
"success": False,
|
||||
"error": str(e),
|
||||
"code": "REQUEST_ERROR"
|
||||
}
|
||||
except Exception as e:
|
||||
print(f"❌ Unexpected error: {e}")
|
||||
return {
|
||||
"success": False,
|
||||
"error": str(e),
|
||||
"code": "UNKNOWN_ERROR"
|
||||
}
|
||||
|
||||
|
||||
async def test_authentication():
|
||||
"""Test authentication with different scenarios"""
|
||||
client = ExternalAuthClient()
|
||||
|
||||
# Test scenarios
|
||||
test_cases = [
|
||||
{
|
||||
"name": "Valid Credentials (Example)",
|
||||
"username": "ymirliu@panjit.com.tw",
|
||||
"password": "correct_password", # Replace with actual password for testing
|
||||
"expected": "success"
|
||||
},
|
||||
{
|
||||
"name": "Invalid Credentials",
|
||||
"username": "test@example.com",
|
||||
"password": "wrong_password",
|
||||
"expected": "failure"
|
||||
}
|
||||
]
|
||||
|
||||
for i, test_case in enumerate(test_cases, 1):
|
||||
print(f"{'='*60}")
|
||||
print(f"Test Case {i}: {test_case['name']}")
|
||||
print(f"{'='*60}")
|
||||
|
||||
result = await client.authenticate(
|
||||
username=test_case["username"],
|
||||
password=test_case["password"]
|
||||
)
|
||||
|
||||
# Analyze result
|
||||
print("\nAnalysis:")
|
||||
if result["success"]:
|
||||
print("✅ Authentication successful")
|
||||
print(f" User: {result.get('user_display_name', 'N/A')}")
|
||||
print(f" Email: {result.get('user_email', 'N/A')}")
|
||||
print(f" Token expires in: {result.get('expires_in', 0)} seconds")
|
||||
print(f" Expires at: {result.get('expires_at', 'N/A')}")
|
||||
else:
|
||||
print("❌ Authentication failed")
|
||||
print(f" Error: {result.get('error', 'Unknown error')}")
|
||||
print(f" Code: {result.get('code', 'N/A')}")
|
||||
|
||||
print("\n")
|
||||
|
||||
|
||||
async def test_token_validation():
|
||||
"""Test token validation and refresh logic"""
|
||||
# This would be implemented when we have a valid token
|
||||
print("Token validation test - To be implemented with actual tokens")
|
||||
pass
|
||||
|
||||
|
||||
def main():
|
||||
"""Main entry point"""
|
||||
print("External Authentication API Test")
|
||||
print("================================\n")
|
||||
|
||||
# Run tests
|
||||
asyncio.run(test_authentication())
|
||||
|
||||
print("\nTest completed!")
|
||||
print("\nNotes for implementation:")
|
||||
print("1. Use httpx for async HTTP requests (already in requirements)")
|
||||
print("2. Store tokens securely (consider encryption)")
|
||||
print("3. Implement automatic token refresh before expiration")
|
||||
print("4. Handle network failures with retry logic")
|
||||
print("5. Map external user ID to local user records")
|
||||
print("6. Display user 'name' field in UI instead of username")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user