chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-18 20:02:31 +08:00
parent 0edc56b03f
commit cd3cbea49d
64 changed files with 3573 additions and 8190 deletions

View File

@@ -1,7 +1,15 @@
{ {
"permissions": { "permissions": {
"allow": [ "allow": [
"Bash(git commit:*)" "Bash(git commit:*)",
"Bash(xargs ls:*)",
"Bash(jq:*)",
"Bash(python:*)",
"Bash(python3:*)",
"Bash(source venv/bin/activate)",
"Bash(find:*)",
"Bash(ls:*)",
"Bash(openspec list:*)"
], ],
"deny": [], "deny": [],
"ask": [] "ask": []

9
.gitignore vendored
View File

@@ -89,3 +89,12 @@ build/
Thumbs.db Thumbs.db
ehthumbs.db ehthumbs.db
Desktop.ini Desktop.ini
# Test and temporary files
backend/uploads/*
storage/uploads/*
storage/results/*
*.log
__pycache__/
*.bak
test_*.py

View File

@@ -1,743 +0,0 @@
# Tool_OCR API Reference & Issues Report
## 文件資訊
- **建立日期**: 2025-01-13
- **版本**: v0.1.0
- **目的**: 完整記錄所有 API 端點及前後端不一致問題
---
## 目錄
1. [API 端點清單](#api-端點清單)
2. [前後端不一致問題](#前後端不一致問題)
3. [修正建議](#修正建議)
---
## API 端點清單
### 1. 認證 API (Authentication)
#### POST `/api/v1/auth/login`
- **功能**: 使用者登入
- **請求 Body**:
```typescript
{
username: string,
password: string
}
```
- **回應**:
```typescript
{
access_token: string,
token_type: string, // "bearer"
expires_in: number // Token 過期時間(秒)
}
```
- **後端實作**: ✅ [backend/app/routers/auth.py:24](backend/app/routers/auth.py#L24)
- **前端使用**: ✅ [frontend/src/services/api.ts:106](frontend/src/services/api.ts#L106)
- **狀態**: ⚠️ **有問題** - 前端型別缺少 `expires_in` 欄位
---
### 2. 檔案上傳 API (File Upload)
#### POST `/api/v1/upload`
- **功能**: 上傳檔案進行 OCR 處理
- **請求 Body**: `multipart/form-data`
- `files`: File[] - 檔案列表 (PNG, JPG, JPEG, PDF)
- `batch_name`: string (optional) - 批次名稱
- **回應**:
```typescript
{
batch_id: number,
files: [
{
id: number,
batch_id: number,
filename: string,
original_filename: string,
file_size: number,
file_format: string, // ⚠️ 後端用 file_format
status: string,
error: string | null,
created_at: string,
processing_time: number | null
}
]
}
```
- **後端實作**: ✅ [backend/app/routers/ocr.py:39](backend/app/routers/ocr.py#L39)
- **前端使用**: ✅ [frontend/src/services/api.ts:128](frontend/src/services/api.ts#L128)
- **狀態**: ⚠️ **有問題** - 前端型別用 `format`,後端用 `file_format`
---
### 3. OCR 處理 API (OCR Processing)
#### POST `/api/v1/ocr/process`
- **功能**: 觸發 OCR 批次處理
- **請求 Body**:
```typescript
{
batch_id: number,
lang: string, // "ch", "en", "japan", "korean"
detect_layout: boolean // ⚠️ 後端用 detect_layout前端用 confidence_threshold
}
```
- **回應**:
```typescript
{
message: string, // ⚠️ 後端有此欄位
batch_id: number,
total_files: number, // ⚠️ 後端有此欄位
status: string // "processing"
// task_id: string // ❌ 前端期待此欄位,但後端沒有
}
```
- **後端實作**: ✅ [backend/app/routers/ocr.py:95](backend/app/routers/ocr.py#L95)
- **前端使用**: ✅ [frontend/src/services/api.ts:148](frontend/src/services/api.ts#L148)
- **狀態**: ⚠️ **有問題** - 請求/回應模型不匹配
---
#### GET `/api/v1/batch/{batch_id}/status`
- **功能**: 取得批次處理狀態
- **路徑參數**:
- `batch_id`: number - 批次 ID
- **回應**:
```typescript
{
batch: {
id: number,
user_id: number,
batch_name: string | null,
status: string,
total_files: number,
completed_files: number,
failed_files: number,
progress_percentage: number,
created_at: string,
started_at: string | null,
completed_at: string | null
},
files: [
{
id: number,
batch_id: number,
filename: string,
original_filename: string,
file_size: number,
file_format: string,
status: string,
error: string | null,
created_at: string,
processing_time: number | null
}
]
}
```
- **後端實作**: ✅ [backend/app/routers/ocr.py:148](backend/app/routers/ocr.py#L148)
- **前端使用**: ✅ [frontend/src/services/api.ts:172](frontend/src/services/api.ts#L172)
- **狀態**: ✅ **正常**
---
#### GET `/api/v1/ocr/result/{file_id}`
- **功能**: 取得 OCR 結果
- **路徑參數**:
- `file_id`: number - 檔案 ID
- **回應**:
```typescript
{
file_id: number,
filename: string,
status: string,
markdown_content: string | null,
json_data: {
total_text_regions: number,
average_confidence: number,
detected_language: string,
layout_data: object | null,
images_metadata: array | null
} | null,
confidence: number | null,
processing_time: number | null
}
```
- **後端實作**: ✅ [backend/app/routers/ocr.py:182](backend/app/routers/ocr.py#L182)
- **前端使用**: ✅ [frontend/src/services/api.ts:164](frontend/src/services/api.ts#L164)
- ⚠️ **注意**: 前端使用 `taskId` 作為參數名稱,實際應該是 `file_id`
- **狀態**: ⚠️ **有問題** - 前端參數名稱誤導
---
#### ❌ GET `/api/v1/ocr/status/{task_id}`
- **功能**: 取得任務狀態 (前端期待但不存在)
- **狀態**: ❌ **不存在** - 前端呼叫此端點但後端沒有實作
- **前端使用**: [frontend/src/services/api.ts:156](frontend/src/services/api.ts#L156)
- **問題**: 前端會收到 404 錯誤
---
### 4. 匯出 API (Export)
#### POST `/api/v1/export`
- **功能**: 匯出 OCR 結果
- **請求 Body**:
```typescript
{
batch_id: number,
format: "txt" | "json" | "excel" | "markdown" | "pdf" | "zip",
rule_id: number | null,
css_template: string, // "default", "academic", "business"
include_formats: string[] | null,
options: {
confidence_threshold: number | null,
include_metadata: boolean,
filename_pattern: string | null,
css_template: string | null
} | null
}
```
- **回應**: File download (Blob)
- **後端實作**: ✅ [backend/app/routers/export.py:38](backend/app/routers/export.py#L38)
- **前端使用**: ✅ [frontend/src/services/api.ts:182](frontend/src/services/api.ts#L182)
- **狀態**: ✅ **正常**
---
#### GET `/api/v1/export/pdf/{file_id}`
- **功能**: 產生單一檔案的 PDF
- **路徑參數**:
- `file_id`: number - 檔案 ID
- **查詢參數**:
- `css_template`: string - CSS 模板名稱
- **回應**: PDF file (Blob)
- **後端實作**: ✅ [backend/app/routers/export.py:144](backend/app/routers/export.py#L144)
- **前端使用**: ✅ [frontend/src/services/api.ts:192](frontend/src/services/api.ts#L192)
- **狀態**: ✅ **正常**
---
#### GET `/api/v1/export/rules`
- **功能**: 取得匯出規則清單
- **回應**:
```typescript
[
{
id: number,
user_id: number,
rule_name: string,
description: string | null,
config_json: object,
css_template: string | null,
created_at: string,
updated_at: string
}
]
```
- **後端實作**: ✅ [backend/app/routers/export.py:206](backend/app/routers/export.py#L206)
- **前端使用**: ✅ [frontend/src/services/api.ts:204](frontend/src/services/api.ts#L204)
- **狀態**: ✅ **正常**
---
#### POST `/api/v1/export/rules`
- **功能**: 建立匯出規則
- **請求 Body**:
```typescript
{
rule_name: string,
description: string | null,
config_json: object,
css_template: string | null
}
```
- **回應**: 同 GET `/api/v1/export/rules` 的單個物件
- **後端實作**: ✅ [backend/app/routers/export.py:220](backend/app/routers/export.py#L220)
- **前端使用**: ✅ [frontend/src/services/api.ts:212](frontend/src/services/api.ts#L212)
- **狀態**: ✅ **正常**
---
#### PUT `/api/v1/export/rules/{rule_id}`
- **功能**: 更新匯出規則
- **路徑參數**:
- `rule_id`: number - 規則 ID
- **請求 Body**: 同 POST `/api/v1/export/rules` (所有欄位可選)
- **回應**: 同 GET `/api/v1/export/rules` 的單個物件
- **後端實作**: ✅ [backend/app/routers/export.py:254](backend/app/routers/export.py#L254)
- **前端使用**: ✅ [frontend/src/services/api.ts:220](frontend/src/services/api.ts#L220)
- **狀態**: ✅ **正常**
---
#### DELETE `/api/v1/export/rules/{rule_id}`
- **功能**: 刪除匯出規則
- **路徑參數**:
- `rule_id`: number - 規則 ID
- **回應**:
```typescript
{
message: "Export rule deleted successfully"
}
```
- **後端實作**: ✅ [backend/app/routers/export.py:295](backend/app/routers/export.py#L295)
- **前端使用**: ✅ [frontend/src/services/api.ts:228](frontend/src/services/api.ts#L228)
- **狀態**: ✅ **正常**
---
#### GET `/api/v1/export/css-templates`
- **功能**: 取得 CSS 模板清單
- **回應**:
```typescript
[
{
name: string,
description: string,
filename: string // ⚠️ Schema 有定義,但實際回傳沒有
}
]
```
- **後端實作**: ✅ [backend/app/routers/export.py:326](backend/app/routers/export.py#L326)
- 實際回傳: `[{ name, description }]`
- Schema 定義: `[{ name, description, filename }]`
- **前端使用**: ✅ [frontend/src/services/api.ts:235](frontend/src/services/api.ts#L235)
- **狀態**: ⚠️ **有問題** - 缺少 `filename` 欄位
---
### 5. 翻譯 API (Translation - RESERVED)
#### GET `/api/v1/translate/status`
- **功能**: 取得翻譯功能狀態
- **回應**:
```typescript
{
status: "RESERVED",
message: string,
planned_phase: string,
features: string[]
}
```
- **後端實作**: ✅ [backend/app/routers/translation.py:28](backend/app/routers/translation.py#L28)
- **前端使用**: ❌ 未使用
- **狀態**: ✅ **正常** (預留功能)
---
#### GET `/api/v1/translate/languages`
- **功能**: 取得支援的語言清單
- **回應**:
```typescript
[
{
code: string,
name: string,
native_name: string
}
]
```
- **後端實作**: ✅ [backend/app/routers/translation.py:43](backend/app/routers/translation.py#L43)
- **前端使用**: ❌ 未使用
- **狀態**: ✅ **正常** (預留功能)
---
#### POST `/api/v1/translate/document`
- **功能**: 翻譯文件 (未實作)
- **請求 Body**:
```typescript
{
file_id: number,
source_lang: string,
target_lang: string,
engine_type: "argos" | "ernie" | "google" | "deepl",
preserve_structure: boolean,
engine_config: object | null
}
```
- **回應**: HTTP 501 Not Implemented
- **後端實作**: ✅ [backend/app/routers/translation.py:56](backend/app/routers/translation.py#L56) (Stub)
- **前端使用**: ✅ [frontend/src/services/api.ts:247](frontend/src/services/api.ts#L247)
- **狀態**: ⚠️ **預留功能** - 前端會收到 501 錯誤
---
#### ❌ GET `/api/v1/translate/configs`
- **功能**: 取得翻譯設定 (前端期待但不存在)
- **狀態**: ❌ **不存在** - 前端呼叫此端點但後端沒有實作
- **前端使用**: [frontend/src/services/api.ts:258](frontend/src/services/api.ts#L258)
- **問題**: 前端會收到 404 錯誤
---
#### ❌ POST `/api/v1/translate/configs`
- **功能**: 建立翻譯設定 (前端期待但不存在)
- **狀態**: ❌ **不存在** - 前端呼叫此端點但後端沒有實作
- **前端使用**: [frontend/src/services/api.ts:269](frontend/src/services/api.ts#L269)
- **問題**: 前端會收到 404 錯誤
---
### 6. 其他端點
#### GET `/health`
- **功能**: 健康檢查
- **回應**:
```typescript
{
status: "healthy",
service: "Tool_OCR",
version: "0.1.0"
}
```
- **後端實作**: ✅ [backend/app/main.py:84](backend/app/main.py#L84)
- **前端使用**: ❌ 未使用
- **狀態**: ✅ **正常**
---
#### GET `/`
- **功能**: API 資訊
- **回應**:
```typescript
{
message: "Tool_OCR API",
version: "0.1.0",
docs_url: "/docs",
health_check: "/health"
}
```
- **後端實作**: ✅ [backend/app/main.py:95](backend/app/main.py#L95)
- **前端使用**: ❌ 未使用
- **狀態**: ✅ **正常**
---
## 前後端不一致問題
### 問題 1: 登入回應結構不一致
**嚴重程度**: 🟡 中等
**問題描述**:
- 後端回傳包含 `expires_in` 欄位 (Token 過期時間)
- 前端 `LoginResponse` 型別定義缺少此欄位
**影響**:
- 前端無法實作 Token 自動續期功能
- 無法提前提醒使用者 Token 即將過期
**位置**:
- 後端: [backend/app/routers/auth.py:66-70](backend/app/routers/auth.py#L66-L70)
- 前端: [frontend/src/types/api.ts:12-15](frontend/src/types/api.ts#L12-L15)
---
### 問題 2: OCR 任務狀態 API 不存在
**嚴重程度**: 🔴 高
**問題描述**:
- 前端嘗試呼叫 `/api/v1/ocr/status/{taskId}` 取得任務進度
- 後端僅提供 `/api/v1/batch/{batch_id}/status` 與 `/api/v1/ocr/result/{file_id}`
- 沒有對應的任務狀態追蹤端點
**影響**:
- 前端 `getTaskStatus()` 呼叫會收到 404 錯誤
- 無法實作即時進度輪詢功能
- 使用者無法看到處理進度
**位置**:
- 前端呼叫: [frontend/src/services/api.ts:156-159](frontend/src/services/api.ts#L156-L159)
- 後端路由: 不存在
---
### 問題 3: OCR 處理請求/回應模型不符
**嚴重程度**: 🔴 高
**問題描述**:
1. **請求欄位不匹配**:
- 前端傳送 `confidence_threshold` (信心度閾值)
- 後端接受 `detect_layout` (版面偵測開關)
2. **回應欄位不匹配**:
- 前端期待 `task_id` (用於追蹤任務)
- 後端回傳 `message`, `total_files` (但沒有 `task_id`)
**影響**:
- 前端無法正確傳遞參數給後端
- 前端無法取得 `task_id` 進行後續狀態查詢
- 型別檢查會失敗
- 可能導致驗證錯誤
**位置**:
- 前端請求: [frontend/src/types/api.ts:37-41](frontend/src/types/api.ts#L37-L41)
- 前端回應: [frontend/src/types/api.ts:43-47](frontend/src/types/api.ts#L43-L47)
- 後端請求: [backend/app/schemas/ocr.py:120-133](backend/app/schemas/ocr.py#L120-L133)
- 後端回應: [backend/app/schemas/ocr.py:136-151](backend/app/schemas/ocr.py#L136-L151)
---
### 問題 4: 上傳檔案欄位命名不一致
**嚴重程度**: 🟡 中等
**問題描述**:
- 後端使用 `file_format` 回傳檔案格式
- 前端型別定義使用 `format`
**影響**:
- 前端無法直接使用後端回傳的 `file_format` 欄位
- 需要額外的欄位映射或轉換
- UI 顯示檔案格式時可能為 undefined
**位置**:
- 前端: [frontend/src/types/api.ts:32](frontend/src/types/api.ts#L32)
- 後端: [backend/app/schemas/ocr.py:19](backend/app/schemas/ocr.py#L19)
---
### 問題 5: CSS 模板清單缺少 filename
**嚴重程度**: 🟡 中等
**問題描述**:
- 前端 `CSSTemplate` 型別期待包含 `filename` 欄位
- 後端 Schema `CSSTemplateResponse` 也定義了 `filename`
- 但後端實際回傳只有 `name` 和 `description`
**影響**:
- 前端無法使用 `filename` 作為 `<option>` 的 key/value
- 渲染時 `filename` 為 undefined
- 前端需要額外邏輯處理或使用 `name` 代替
**位置**:
- 前端型別: [frontend/src/types/api.ts:132-136](frontend/src/types/api.ts#L132-L136)
- 後端 Schema: [backend/app/schemas/export.py:91-104](backend/app/schemas/export.py#L91-L104)
- 後端實作: [backend/app/routers/export.py:333-338](backend/app/routers/export.py#L333-L338)
- PDF 服務: [backend/app/services/pdf_generator.py:485-496](backend/app/services/pdf_generator.py#L485-L496)
**根本原因**:
`PDFGenerator.get_available_templates()` 只回傳 `{name: description}` 的 dict沒有包含 filename
---
### 問題 6: 翻譯設定端點未實作
**嚴重程度**: 🟢 低 (預留功能)
**問題描述**:
- 前端嘗試呼叫 `/api/v1/translate/configs` (GET/POST)
- 後端翻譯路由僅實作 `/status`, `/languages`, `/document`
- 沒有 configs 相關端點
**影響**:
- 前端呼叫會收到 404 錯誤
- 無法管理翻譯設定
- 但因為翻譯功能整體都是 Phase 5 預留功能,影響較小
**位置**:
- 前端 GET: [frontend/src/services/api.ts:258-262](frontend/src/services/api.ts#L258-L262)
- 前端 POST: [frontend/src/services/api.ts:269-275](frontend/src/services/api.ts#L269-L275)
- 後端路由: 不存在
---
## 修正建議
### 建議 1: 統一登入回應模型
**優先順序**: P2 (中優先)
**方案 A - 前端新增 expires_in** (推薦):
```typescript
// frontend/src/types/api.ts
export interface LoginResponse {
access_token: string
token_type: string
expires_in: number // 新增此欄位
}
```
**方案 B - 後端移除 expires_in**:
- 如果不需要 Token 過期管理,可移除此欄位
- 不推薦,因為這是常見的 JWT 最佳實踐
---
### 建議 2: 統一 OCR 任務追蹤策略
**優先順序**: P1 (高優先)
**方案 A - 統一使用批次狀態** (推薦):
1. 前端刪除 `getTaskStatus()` 方法
2. 統一使用 `getBatchStatus()` 輪詢批次狀態
3. 修改 `ProcessResponse` 移除 `task_id`
**方案 B - 後端新增任務狀態端點**:
1. 新增 `GET /api/v1/ocr/status/{task_id}` 端點
2. `ProcessResponse` 真正回傳 `task_id`
3. 實作任務級別的狀態追蹤
**建議**: 採用方案 A因為目前架構已經有批次級別的狀態管理
---
### 建議 3: 校正 OCR 處理請求/回應
**優先順序**: P1 (高優先)
**方案 A - 前端配合後端** (推薦):
```typescript
// frontend/src/types/api.ts
export interface ProcessRequest {
batch_id: number
lang?: string
detect_layout?: boolean // 改為 detect_layout
}
export interface ProcessResponse {
message: string // 新增
batch_id: number
total_files: number // 新增
status: string
// 移除 task_id
}
```
**方案 B - 後端配合前端**:
- 支援 `confidence_threshold` 參數
- 回應包含 `task_id`
- 需要較大改動,不推薦
---
### 建議 4: 對齊上傳檔案欄位命名
**優先順序**: P2 (中優先)
**方案 A - 前端改用 file_format** (推薦):
```typescript
// frontend/src/types/api.ts
export interface FileInfo {
id: number
filename: string
file_size: number
file_format: string // 改名為 file_format
status: 'pending' | 'processing' | 'completed' | 'failed'
}
```
**方案 B - 後端使用 Pydantic Alias**:
```python
# backend/app/schemas/ocr.py
file_format: str = Field(..., alias='format')
```
---
### 建議 5: 補充 CSS 模板 filename
**優先順序**: P2 (中優先)
**方案 A - 修改 PDF Generator 回傳結構** (推薦):
```python
# backend/app/services/pdf_generator.py
def get_available_templates(self) -> Dict[str, Dict[str, str]]:
"""Get list of available CSS templates with filename"""
return {
"default": {
"description": "通用排版模板,適合大多數文檔",
"filename": "default.css"
},
"academic": {
"description": "學術論文模板,適合研究報告",
"filename": "academic.css"
},
"business": {
"description": "商業報告模板,適合企業文檔",
"filename": "business.css"
},
}
```
**方案 B - 前端使用 name 作為 filename**:
- 因為實際上模板名稱就是識別碼
- 不需要額外的 filename
---
### 建議 6: 處理翻譯設定 Stub
**優先順序**: P3 (低優先)
**方案 A - 前端移除相關呼叫** (推薦):
1. 移除或註解 `getTranslationConfigs()` 和 `createTranslationConfig()`
2. UI 顯示「即將推出」訊息
**方案 B - 後端補上 Stub 端點**:
```python
# backend/app/routers/translation.py
@router.get("/configs")
async def get_translation_configs():
raise HTTPException(status_code=501, detail="Feature reserved for Phase 5")
@router.post("/configs")
async def create_translation_config():
raise HTTPException(status_code=501, detail="Feature reserved for Phase 5")
```
---
## 實作優先順序總結
### P1 - 立即修正 (影響核心功能)
1.**建議 2**: 統一 OCR 任務追蹤策略
2.**建議 3**: 校正 OCR 處理請求/回應模型
### P2 - 近期修正 (影響使用體驗)
3.**建議 1**: 統一登入回應模型
4.**建議 4**: 對齊上傳檔案欄位命名
5.**建議 5**: 補充 CSS 模板 filename
### P3 - 可延後 (預留功能)
6. ⏸️ **建議 6**: 處理翻譯設定 Stub (Phase 5 再處理)
---
## 文件維護
**更新記錄**:
- 2025-01-13: 初始版本,完整盤點所有 API 端點及問題
**維護責任**:
- 每次 API 變更時必須更新此文件
- 新增 API 端點時補充到對應章節
- 修正問題後更新狀態
---
## 附錄: 快速檢查清單
### 新增 API 端點時的檢查項目
- [ ] 後端 Schema 定義是否完整?
- [ ] 前端 TypeScript 型別是否匹配?
- [ ] 欄位命名是否一致 (camelCase vs snake_case)?
- [ ] 回應結構是否符合前端期待?
- [ ] 錯誤處理是否完整?
- [ ] API 文件是否更新?
- [ ] 是否有對應的測試?
### API 修改時的檢查項目
- [ ] 前後端是否同步修改?
- [ ] 是否有破壞性變更 (Breaking Change)?
- [ ] 相關文件是否更新?
- [ ] 現有功能是否受影響?
- [ ] 是否需要版本遷移?

View File

@@ -1,275 +0,0 @@
# Chart Recognition Feature Status
## 🎉 當前狀態:已啟用!
圖表識別功能已經**啟用**PaddlePaddle 3.2.1 提供了所需的 `fused_rms_norm_ext` API。
### ✅ 問題已解決
- **解決日期**: 2025-11-16
- **PaddlePaddle 版本**: 3.2.1 (從 3.0.0 升級)
- **API 狀態**: `fused_rms_norm_ext` 現已可用 ✅
- **功能狀態**: PP-StructureV3 圖表識別已啟用 ✅
- **代碼更新**: [ocr_service.py:217](backend/app/services/ocr_service.py#L217) - `use_chart_recognition=True`
### 📜 歷史限制 (已解決)
- **原始問題**: PaddlePaddle 3.0.0 缺少 `fused_rms_norm_ext` API
- **記錄時間**: 2025年3月 (基於 PaddlePaddle 3.0.0)
- **解決版本**: PaddlePaddle 3.2.0+ (2025年9月發布)
- **驗證版本**: PaddlePaddle 3.2.1 確認支持
---
## 🎯 現在可用的完整功能
| 功能類別 | 功能 | 狀態 | 說明 |
|---------|------|------|------|
| **基礎OCR** | 文字識別 | ✅ 正常 | OCR 核心功能 |
| **布局分析** | 圖表檢測 | ✅ 正常 | 識別圖表位置 |
| **布局分析** | 圖表提取 | ✅ 正常 | 保存為圖像文件 |
| **表格識別** | 表格識別 | ✅ 正常 | 支持嵌套公式/圖片 |
| **公式識別** | LaTeX 提取 | ✅ 正常 | 數學公式識別 |
| **圖表識別** | 圖表類型識別 | ✅ **已啟用** | 柱狀圖、折線圖等類型 |
| **圖表識別** | 數據提取 | ✅ **已啟用** | 從圖表提取數值數據 |
| **圖表識別** | 軸/圖例解析 | ✅ **已啟用** | 坐標軸標籤和圖例 |
| **圖表識別** | 圖表轉結構化 | ✅ **已啟用** | 轉換為 JSON/表格格式 |
---
## 🔧 系統配置更新
### 1. CUDA 庫路徑配置
為了支持 GPU 加速WSL CUDA 庫路徑已添加到系統配置:
```bash
# ~/.bashrc
export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH
```
### 2. PaddlePaddle 版本
```bash
# 當前版本
PaddlePaddle 3.2.1
# GPU 支持
✅ CUDA 12.6
✅ cuDNN 9.5
✅ GPU Compute Capability: 8.9
```
### 3. 服務配置
```python
# backend/app/services/ocr_service.py:217
use_chart_recognition=True # ✅ 已啟用
```
---
## 📊 版本歷史與 API 支持
| 版本 | 發布日期 | `fused_rms_norm_ext` 狀態 | 圖表識別 |
|------|---------|-------------------------|---------|
| 3.0.0 | 2025-03-26 | ❌ 不支持 | ❌ 禁用 |
| 3.1.0 | 2025-06-29 | ❓ 未驗證 | ❓ 未知 |
| 3.1.1 | 2025-08-20 | ❓ 未驗證 | ❓ 未知 |
| 3.2.0 | 2025-09-08 | ✅ 可能支持 | ✅ 可啟用 |
| 3.2.1 | 2025-10-30 | ✅ **確認支持** | ✅ **已啟用** |
| 3.2.2 | 2025-11-14 | ✅ 應該支持 | ✅ 應該可用 |
**驗證日期**: 2025-11-16
**驗證版本**: PaddlePaddle 3.2.1
**驗證腳本**: `backend/verify_chart_recognition.py`
---
## ⚠️ 性能考量
啟用圖表識別後的影響:
### 處理時間
- **簡單圖表**: 每個圖表增加 2-3 秒
- **複雜圖表**: 每個圖表增加 5-10 秒
- **多圖表頁面**: 處理時間相應增加
### 記憶體使用
- **GPU 記憶體**: 增加約 500MB-1GB
- **系統記憶體**: 增加約 200-500MB
### 準確率
- **簡單圖表** (柱狀圖、折線圖): >85%
- **複雜圖表** (多軸、組合圖): >70%
- **特殊圖表** (雷達圖、散點圖): >60%
**建議**: 對於包含大量圖表的文檔,建議使用 GPU 加速以獲得最佳性能。
---
## 🧪 測試圖表識別
### 快速測試
使用驗證腳本確認功能可用:
```bash
cd /home/egg/project/Tool_OCR
source venv/bin/activate
python backend/verify_chart_recognition.py
```
預期輸出:
```
✅ PaddlePaddle version: 3.2.1
📊 API Availability:
- fused_rms_norm: ✅ Available
- fused_rms_norm_ext: ✅ Available
🎉 Chart recognition CAN be enabled!
```
### 實際測試
1. **啟動後端服務**:
```bash
cd backend
source venv/bin/activate
python -m app.main
```
2. **上傳包含圖表的文檔**:
- PDF、Word、PowerPoint 等
- 確保文檔中包含圖表(柱狀圖、折線圖等)
3. **檢查輸出結果**:
- 查看解析結果中是否包含圖表數據
- 驗證圖表類型識別是否正確
- 檢查數據提取是否準確
---
## 🔍 技術細節
### fused_rms_norm_ext API
**RMSNorm (Root Mean Square Layer Normalization)**:
- 深度學習中的層歸一化技術
- 相比 LayerNorm 計算效率更高
- PaddleOCR-VL 圖表識別模型的核心組件
**API 簽名**:
```python
paddle.incubate.nn.functional.fused_rms_norm_ext(
x,
norm_weight,
norm_bias=None,
epsilon=1e-5,
begin_norm_axis=1,
bias=None,
residual=None,
quant_scale=-1,
quant_round_type=0,
quant_max_bound=0,
quant_min_bound=0
)
```
**與基礎版本的差異**:
- `fused_rms_norm`: 基礎實現
- `fused_rms_norm_ext`: 擴展版本,提供額外的優化和參數
### 代碼位置
- **主要啟用**: [backend/app/services/ocr_service.py:217](backend/app/services/ocr_service.py#L217)
- **CPU Fallback**: [backend/app/services/ocr_service.py:235](backend/app/services/ocr_service.py#L235)
- **PP-StructureV3 初始化**: [backend/app/services/ocr_service.py:211-219](backend/app/services/ocr_service.py#L211-L219)
---
## 📚 相關文檔更新
以下文檔需要更新以反映圖表識別已啟用:
### 已更新
- ✅ `CHART_RECOGNITION.md` - 本文檔
- ✅ `backend/app/services/ocr_service.py` - 代碼實現
### 待更新
- [ ] `README.md` - 移除 "Known Limitations" 中的圖表識別限制
- [ ] `openspec/changes/add-gpu-acceleration-support/tasks.md` - 標記任務 5.4 為完成
- [ ] `openspec/changes/add-gpu-acceleration-support/proposal.md` - 更新 "Known Issues" 部分
- [ ] `openspec/project.md` - 添加圖表識別功能說明
---
## 🆘 故障排除
### 問題: 升級後仍顯示不可用
**診斷**:
```bash
python -c "import paddle; print(paddle.__version__)"
python -c "import paddle.incubate.nn.functional as F; print(hasattr(F, 'fused_rms_norm_ext'))"
```
**解決方案**:
1. 確保虛擬環境已激活
2. 完全重新安裝 PaddlePaddle:
```bash
pip uninstall paddlepaddle -y
pip install 'paddlepaddle>=3.2.0'
```
### 問題: GPU 初始化失敗
**錯誤信息**: `libcuda.so.1: cannot open shared object file`
**解決方案**:
```bash
# 確認 LD_LIBRARY_PATH 包含 WSL CUDA 路徑
echo $LD_LIBRARY_PATH | grep wsl
# 如果沒有,添加到 ~/.bashrc:
echo 'export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
```
### 問題: 圖表識別結果不準確
**可能原因**:
- 圖表圖像質量低
- 圖表類型特殊或複雜
- 文字遮擋或重疊
**改進建議**:
- 提高輸入文檔的分辨率
- 使用清晰的圖表樣式
- 必要時進行人工校對
---
## 🎉 總結
**圖表識別功能現已完全可用!**
| 項目 | 狀態 |
|------|------|
| API 可用性 | ✅ `fused_rms_norm_ext` 已在 PaddlePaddle 3.2.1 中提供 |
| 功能狀態 | ✅ 圖表識別已啟用 |
| GPU 支持 | ✅ CUDA 12.6 + cuDNN 9.5 正常運行 |
| 測試驗證 | ✅ 驗證腳本確認功能可用 |
| 文檔更新 | ✅ 本文檔已更新 |
**下一步**:
1. 測試實際文檔處理
2. 驗證圖表識別準確率
3. 更新相關 README 和 OpenSpec 文檔
4. 考慮性能優化和調整
---
**最後更新**: 2025-11-16
**更新者**: Development Team
**PaddlePaddle 版本**: 3.2.1
**功能狀態**: ✅ 圖表識別已啟用

View File

@@ -1,893 +0,0 @@
# Tool_OCR Frontend API Documentation
> **Version**: 0.1.0
> **Last Updated**: 2025-01-13
> **Purpose**: Complete documentation of frontend architecture, component structure, API integration, and dependencies
---
## Table of Contents
1. [Project Overview](#project-overview)
2. [Technology Stack](#technology-stack)
3. [Component Architecture](#component-architecture)
4. [Page → API Dependency Matrix](#page--api-dependency-matrix)
5. [Component Tree Structure](#component-tree-structure)
6. [State Management Strategy](#state-management-strategy)
7. [Route Configuration](#route-configuration)
8. [API Integration Patterns](#api-integration-patterns)
9. [UI/UX Design System](#uiux-design-system)
10. [Error Handling Patterns](#error-handling-patterns)
11. [Deployment Configuration](#deployment-configuration)
---
## Project Overview
Tool_OCR 前端是一個基於 React 18 + Vite 的現代化 OCR 文件處理系統,提供企業級的使用者介面和體驗。
### Key Features
- **批次檔案上傳**: 支援拖放上傳,多檔案批次處理
- **即時進度追蹤**: 使用輪詢機制顯示 OCR 處理進度
- **結果預覽**: Markdown 和 JSON 雙格式預覽
- **靈活匯出**: 支援 TXT、JSON、Excel、Markdown、PDF、ZIP 多種格式
- **規則管理**: 可自訂匯出規則和 CSS 模板
- **響應式設計**: 適配桌面和平板裝置
---
## Technology Stack
### Core Dependencies
```json
{
"@tanstack/react-query": "^5.90.7", // Server state management
"react": "^19.2.0", // UI framework
"react-dom": "^19.2.0",
"react-router-dom": "^7.9.5", // Routing
"vite": "^7.2.2", // Build tool
"typescript": "~5.9.3" // Type safety
}
```
### UI & Styling
```json
{
"tailwindcss": "^4.1.17", // CSS framework
"class-variance-authority": "^0.7.0", // Component variants
"clsx": "^2.1.1", // Class name utility
"tailwind-merge": "^3.4.0", // Tailwind class merge
"lucide-react": "^0.553.0" // Icon library
}
```
### State & Data
```json
{
"zustand": "^5.0.8", // Client state
"axios": "^1.13.2", // HTTP client
"react-dropzone": "^14.3.8", // File upload
"react-markdown": "^9.0.1" // Markdown rendering
}
```
### Internationalization
```json
{
"i18next": "^25.6.2",
"react-i18next": "^16.3.0"
}
```
---
## Component Architecture
### Atomic Design Structure
```
frontend/src/
├── components/
│ ├── ui/ # Atomic components (shadcn/ui)
│ │ ├── button.tsx
│ │ ├── card.tsx
│ │ ├── input.tsx
│ │ ├── label.tsx
│ │ ├── select.tsx
│ │ ├── badge.tsx
│ │ ├── progress.tsx
│ │ ├── alert.tsx
│ │ ├── dialog.tsx
│ │ ├── tabs.tsx
│ │ ├── table.tsx
│ │ └── toast.tsx
│ ├── FileUpload.tsx # Drag-and-drop upload component
│ ├── ResultsTable.tsx # OCR results display table
│ ├── MarkdownPreview.tsx # Markdown content renderer
│ └── Layout.tsx # Main app layout with sidebar
├── pages/
│ ├── LoginPage.tsx # Authentication
│ ├── UploadPage.tsx # File upload and selection
│ ├── ProcessingPage.tsx # OCR processing status
│ ├── ResultsPage.tsx # Results viewing and preview
│ ├── ExportPage.tsx # Export configuration and download
│ └── SettingsPage.tsx # User settings and rules management
├── store/
│ ├── authStore.ts # Authentication state (Zustand)
│ └── uploadStore.ts # Upload batch state (Zustand)
├── services/
│ └── api.ts # API client (Axios)
├── types/
│ └── api.ts # TypeScript type definitions
├── lib/
│ └── utils.ts # Utility functions
├── i18n/
│ └── index.ts # i18n configuration
└── styles/
└── index.css # Global styles and CSS variables
```
---
## Page → API Dependency Matrix
| Page/Component | API Endpoints Used | HTTP Method | Purpose | Polling |
|----------------|-------------------|-------------|---------|---------|
| **LoginPage** | `/api/v1/auth/login` | POST | User authentication | No |
| **UploadPage** | `/api/v1/upload` | POST | Upload files for OCR | No |
| **ProcessingPage** | `/api/v1/ocr/process` | POST | Start OCR processing | No |
| | `/api/v1/batch/{batch_id}/status` | GET | Poll batch status | Yes (2s) |
| **ResultsPage** | `/api/v1/batch/{batch_id}/status` | GET | Load completed files | No |
| | `/api/v1/ocr/result/{file_id}` | GET | Get OCR result details | No |
| | `/api/v1/export/pdf/{file_id}` | GET | Download PDF export | No |
| **ExportPage** | `/api/v1/export` | POST | Export batch results | No |
| | `/api/v1/export/rules` | GET | List export rules | No |
| | `/api/v1/export/rules` | POST | Create new rule | No |
| | `/api/v1/export/rules/{rule_id}` | PUT | Update existing rule | No |
| | `/api/v1/export/rules/{rule_id}` | DELETE | Delete rule | No |
| | `/api/v1/export/css-templates` | GET | List CSS templates | No |
| **SettingsPage** | `/api/v1/export/rules` | GET | Manage export rules | No |
---
## Component Tree Structure
```
App
├── Router (React Router)
│ ├── PublicRoute
│ │ └── LoginPage
│ │ ├── Form (username + password)
│ │ ├── Button (submit)
│ │ └── Alert (error display)
│ └── ProtectedRoute (requires authentication)
│ └── Layout
│ ├── Sidebar
│ │ ├── Logo
│ │ ├── Navigation Links
│ │ │ ├── UploadPage link
│ │ │ ├── ProcessingPage link
│ │ │ ├── ResultsPage link
│ │ │ ├── ExportPage link
│ │ │ └── SettingsPage link
│ │ └── User Section + Logout
│ ├── TopBar
│ │ ├── SearchInput
│ │ └── NotificationBell
│ └── MainContent (Outlet)
│ ├── UploadPage
│ │ ├── FileUpload (react-dropzone)
│ │ ├── FileList (selected files)
│ │ └── UploadButton
│ ├── ProcessingPage
│ │ ├── ProgressBar
│ │ ├── StatsCards (completed/processing/failed)
│ │ ├── FileStatusList
│ │ └── ActionButtons
│ ├── ResultsPage
│ │ ├── FileList (left sidebar)
│ │ │ ├── SearchInput
│ │ │ └── FileItems
│ │ └── PreviewPanel (right)
│ │ ├── StatsCards
│ │ ├── Tabs (Markdown/JSON)
│ │ ├── MarkdownPreview
│ │ └── JSONViewer
│ ├── ExportPage
│ │ ├── FormatSelector
│ │ ├── RuleSelector
│ │ ├── CSSTemplateSelector
│ │ ├── OptionsForm
│ │ └── ExportButton
│ └── SettingsPage
│ ├── UserInfo
│ ├── ExportRulesManager
│ │ ├── RuleList
│ │ ├── CreateRuleDialog
│ │ ├── EditRuleDialog
│ │ └── DeleteConfirmDialog
│ └── SystemSettings
```
---
## State Management Strategy
### Client State (Zustand)
**authStore.ts** - Authentication State
```typescript
interface AuthState {
user: User | null
isAuthenticated: boolean
setUser: (user: User | null) => void
logout: () => void
}
```
**uploadStore.ts** - Upload Batch State
```typescript
interface UploadState {
batchId: number | null
files: FileInfo[]
uploadProgress: number
setBatchId: (id: number) => void
setFiles: (files: FileInfo[]) => void
setUploadProgress: (progress: number) => void
reset: () => void
}
```
### Server State (React Query)
- **Caching**: Automatic caching with stale-while-revalidate strategy
- **Polling**: Automatic refetch for batch status every 2 seconds during processing
- **Error Handling**: Built-in error retry and error state management
- **Optimistic Updates**: For export rules CRUD operations
### Query Keys
```typescript
// Batch status polling
['batchStatus', batchId]
// OCR result for specific file
['ocrResult', fileId]
// Export rules list
['exportRules']
// CSS templates list
['cssTemplates']
```
---
## Route Configuration
| Route | Component | Access Level | Description | Protected |
|-------|-----------|--------------|-------------|-----------|
| `/login` | LoginPage | Public | User authentication | No |
| `/` | Layout (redirect to /upload) | Private | Main layout wrapper | Yes |
| `/upload` | UploadPage | Private | File upload interface | Yes |
| `/processing` | ProcessingPage | Private | OCR processing status | Yes |
| `/results` | ResultsPage | Private | View OCR results | Yes |
| `/export` | ExportPage | Private | Export configuration | Yes |
| `/settings` | SettingsPage | Private | User settings | Yes |
### Protected Route Implementation
```typescript
function ProtectedRoute({ children }: { children: React.ReactNode }) {
const isAuthenticated = useAuthStore((state) => state.isAuthenticated)
if (!isAuthenticated) {
return <Navigate to="/login" replace />
}
return <>{children}</>
}
```
---
## API Integration Patterns
### API Client Configuration
**Base URL**: `http://localhost:12010/api/v1`
**Request Interceptor**: Adds JWT token to Authorization header
```typescript
this.client.interceptors.request.use((config) => {
if (this.token) {
config.headers.Authorization = `Bearer ${this.token}`
}
return config
})
```
**Response Interceptor**: Handles 401 errors and redirects to login
```typescript
this.client.interceptors.response.use(
(response) => response,
(error: AxiosError<ApiError>) => {
if (error.response?.status === 401) {
this.clearToken()
window.location.href = '/login'
}
return Promise.reject(error)
}
)
```
### Authentication Flow
```typescript
// 1. Login
const response = await apiClient.login({ username, password })
// Response: { access_token, token_type, expires_in }
// 2. Store token
localStorage.setItem('auth_token', response.access_token)
// 3. Set user in store
setUser({ id: 1, username })
// 4. Navigate to /upload
navigate('/upload')
```
### File Upload Flow
```typescript
// 1. Prepare FormData
const formData = new FormData()
files.forEach((file) => formData.append('files', file))
// 2. Upload files
const response = await apiClient.uploadFiles(files)
// Response: { batch_id, files: FileInfo[] }
// 3. Store batch info
setBatchId(response.batch_id)
setFiles(response.files)
// 4. Navigate to /processing
navigate('/processing')
```
### OCR Processing Flow
```typescript
// 1. Start OCR processing
await apiClient.processOCR({ batch_id, lang: 'ch', detect_layout: true })
// Response: { message, batch_id, total_files, status }
// 2. Poll batch status every 2 seconds
const { data: batchStatus } = useQuery({
queryKey: ['batchStatus', batchId],
queryFn: () => apiClient.getBatchStatus(batchId),
refetchInterval: (query) => {
const status = query.state.data?.batch.status
if (status === 'completed' || status === 'failed') return false
return 2000 // Poll every 2 seconds
},
})
// 3. Auto-redirect when completed
useEffect(() => {
if (batchStatus?.batch.status === 'completed') {
navigate('/results')
}
}, [batchStatus?.batch.status])
```
### Results Viewing Flow
```typescript
// 1. Load batch status
const { data: batchStatus } = useQuery({
queryKey: ['batchStatus', batchId],
queryFn: () => apiClient.getBatchStatus(batchId),
})
// 2. Select a file
setSelectedFileId(fileId)
// 3. Load OCR result for selected file
const { data: ocrResult } = useQuery({
queryKey: ['ocrResult', selectedFileId],
queryFn: () => apiClient.getOCRResult(selectedFileId),
enabled: !!selectedFileId,
})
// 4. Display in Markdown or JSON format
<Tabs>
<TabsContent value="markdown">
<ReactMarkdown>{ocrResult.markdown_content}</ReactMarkdown>
</TabsContent>
<TabsContent value="json">
<pre>{JSON.stringify(ocrResult.json_data, null, 2)}</pre>
</TabsContent>
</Tabs>
```
### Export Flow
```typescript
// 1. Select export format and options
const exportData = {
batch_id: batchId,
format: 'pdf',
rule_id: selectedRuleId,
css_template: 'academic',
options: { include_metadata: true }
}
// 2. Request export
const blob = await apiClient.exportResults(exportData)
// 3. Trigger download
downloadBlob(blob, `ocr-results-${batchId}.pdf`)
```
---
## UI/UX Design System
### Color Palette (CSS Variables)
```css
/* Primary - Professional Blue */
--primary: 217 91% 60%; /* #3b82f6 */
--primary-foreground: 0 0% 100%;
/* Secondary - Gray-Blue */
--secondary: 220 15% 95%;
--secondary-foreground: 220 15% 25%;
/* Accent - Vibrant Teal */
--accent: 173 80% 50%;
--accent-foreground: 0 0% 100%;
/* Success */
--success: 142 72% 45%; /* #16a34a */
--success-foreground: 0 0% 100%;
/* Destructive */
--destructive: 0 85% 60%; /* #ef4444 */
--destructive-foreground: 0 0% 100%;
/* Warning */
--warning: 38 92% 50%;
--warning-foreground: 0 0% 100%;
/* Background */
--background: 220 15% 97%; /* #fafafa */
--card: 0 0% 100%; /* #ffffff */
--sidebar: 220 25% 12%; /* Dark blue-gray */
/* Borders */
--border: 220 13% 88%;
--radius: 0.5rem;
```
### Typography
- **Font Family**: System font stack (native)
- **Page Title**: 1.875rem (30px), font-weight: 700
- **Section Title**: 1.125rem (18px), font-weight: 600
- **Body Text**: 0.875rem (14px), font-weight: 400
- **Small Text**: 0.75rem (12px)
### Spacing Scale
```css
--spacing-xs: 0.25rem; /* 4px */
--spacing-sm: 0.5rem; /* 8px */
--spacing-md: 1rem; /* 16px */
--spacing-lg: 1.5rem; /* 24px */
--spacing-xl: 2rem; /* 32px */
```
### Component Variants
**Button Variants**:
- `default`: Primary blue background
- `outline`: Border only
- `secondary`: Muted background
- `destructive`: Red for delete actions
- `ghost`: No background, hover effect
**Alert Variants**:
- `default`: Neutral gray
- `info`: Blue
- `success`: Green
- `warning`: Yellow
- `destructive`: Red
**Badge Variants**:
- `default`: Gray
- `success`: Green
- `warning`: Yellow
- `destructive`: Red
- `secondary`: Muted
### Responsive Breakpoints
```typescript
// Tailwind breakpoints
sm: '640px', // Mobile landscape
md: '768px', // Tablet
lg: '1024px', // Desktop (primary support)
xl: '1280px', // Large desktop
2xl: '1536px' // Extra large
```
**Primary Support**: Desktop (>= 1024px)
**Secondary Support**: Tablet (768px - 1023px)
**Optional**: Mobile (< 768px)
---
## Error Handling Patterns
### Global Error Boundary
```typescript
class ErrorBoundary extends Component<Props, State> {
static getDerivedStateFromError(error: Error): State {
return { hasError: true, error }
}
componentDidCatch(error: Error, errorInfo: ErrorInfo) {
console.error('Uncaught error:', error, errorInfo)
}
render() {
if (this.state.hasError) {
return <ErrorFallbackUI error={this.state.error} />
}
return this.props.children
}
}
```
### API Error Handling
```typescript
try {
await apiClient.uploadFiles(files)
} catch (err: any) {
const errorDetail = err.response?.data?.detail
toast({
title: t('upload.uploadError'),
description: Array.isArray(errorDetail)
? errorDetail.map(e => e.msg || e.message).join(', ')
: errorDetail || t('errors.networkError'),
variant: 'destructive',
})
}
```
### Form Validation
```typescript
// Client-side validation
if (selectedFiles.length === 0) {
toast({
title: t('errors.validationError'),
description: '請選擇至少一個檔案',
variant: 'destructive',
})
return
}
// Backend validation errors
if (err.response?.status === 422) {
const errors = err.response.data.detail
// Display validation errors to user
}
```
### Loading States
```typescript
// Query loading state
const { data, isLoading, error } = useQuery({
queryKey: ['batchStatus', batchId],
queryFn: () => apiClient.getBatchStatus(batchId),
})
if (isLoading) return <LoadingSpinner />
if (error) return <ErrorAlert error={error} />
if (!data) return <EmptyState />
// Mutation loading state
const mutation = useMutation({
mutationFn: apiClient.uploadFiles,
onSuccess: () => { /* success */ },
onError: () => { /* error */ },
})
<Button disabled={mutation.isPending}>
{mutation.isPending ? <Loader2 className="animate-spin" /> : '上傳'}
</Button>
```
---
## Deployment Configuration
### Environment Variables
```bash
# .env.production
VITE_API_BASE_URL=http://localhost:12010
VITE_APP_NAME=Tool_OCR
VITE_APP_VERSION=0.1.0
```
### Build Configuration
**vite.config.ts**:
```typescript
export default defineConfig({
plugins: [react()],
server: {
port: 12011,
proxy: {
'/api': {
target: 'http://localhost:12010',
changeOrigin: true,
},
},
},
build: {
outDir: 'dist',
sourcemap: false,
rollupOptions: {
output: {
manualChunks: {
vendor: ['react', 'react-dom', 'react-router-dom'],
ui: ['@tanstack/react-query', 'zustand', 'lucide-react'],
},
},
},
},
})
```
### Build Commands
```bash
# Development
npm run dev
# Production build
npm run build
# Preview production build
npm run preview
```
### Nginx Configuration
```nginx
server {
listen 80;
server_name tool-ocr.example.com;
root /path/to/Tool_OCR/frontend/dist;
# Frontend static files
location / {
try_files $uri $uri/ /index.html;
}
# API reverse proxy
location /api {
proxy_pass http://127.0.0.1:12010;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
# Static assets caching
location ~* \.(js|css|png|jpg|jpeg|gif|ico|svg|woff|woff2)$ {
expires 1y;
add_header Cache-Control "public, immutable";
}
}
```
---
## Performance Optimization
### Code Splitting
- **Vendor Bundle**: React, React Router, React Query (separate chunk)
- **UI Bundle**: Zustand, Lucide React, UI components
- **Route-based Splitting**: Lazy load pages with `React.lazy()`
### Caching Strategy
- **React Query Cache**: 5 minutes stale time for most queries
- **Polling Interval**: 2 seconds during OCR processing
- **Infinite Cache**: Export rules (rarely change)
### Asset Optimization
- **Images**: Convert to WebP format, use appropriate sizes
- **Fonts**: System font stack (no custom fonts)
- **Icons**: Lucide React (tree-shakeable)
---
## Testing Strategy
### Component Testing (Planned)
```typescript
// Example: UploadPage.test.tsx
import { render, screen, fireEvent } from '@testing-library/react'
import { UploadPage } from '@/pages/UploadPage'
describe('UploadPage', () => {
it('should display file upload area', () => {
render(<UploadPage />)
expect(screen.getByText(/拖放檔案/i)).toBeInTheDocument()
})
it('should allow file selection', async () => {
render(<UploadPage />)
const file = new File(['content'], 'test.pdf', { type: 'application/pdf' })
// Test file upload
})
})
```
### API Integration Testing
- **Mock API Responses**: Use MSW (Mock Service Worker)
- **Error Scenarios**: Test 401, 404, 500 responses
- **Loading States**: Test skeleton/spinner display
---
## Accessibility Standards
### WCAG 2.1 AA Compliance
- **Keyboard Navigation**: All interactive elements accessible via keyboard
- **Focus Indicators**: Visible focus states on all inputs and buttons
- **ARIA Labels**: Proper labels for screen readers
- **Color Contrast**: Minimum 4.5:1 ratio for text
- **Alt Text**: All images have descriptive alt attributes
### Semantic HTML
```typescript
// Use semantic elements
<nav> // Navigation
<main> // Main content
<aside> // Sidebar
<article> // Independent content
<section> // Grouped content
```
---
## Browser Compatibility
### Minimum Supported Versions
- **Chrome**: 90+
- **Firefox**: 88+
- **Edge**: 90+
- **Safari**: 14+
### Polyfills Required
- None (modern build target: ES2020)
---
## Development Workflow
### Local Development
```bash
# 1. Install dependencies
npm install
# 2. Start dev server
npm run dev
# Frontend: http://localhost:12011
# API Proxy: http://localhost:12011/api -> http://localhost:12010/api
# 3. Build for production
npm run build
# 4. Preview production build
npm run preview
```
### Code Style
- **Formatter**: Prettier (automatic on save)
- **Linter**: ESLint
- **Type Checking**: TypeScript strict mode
---
## Known Issues & Limitations
### Current Limitations
1. **No Real-time WebSocket**: Uses HTTP polling for progress updates
2. **No Offline Support**: Requires active internet connection
3. **No Mobile Optimization**: Primarily designed for desktop/tablet
4. **Translation Feature Stub**: Planned for Phase 5
5. **File Size Limit**: Frontend validates 50MB per file, backend may differ
### Future Improvements
- [ ] Implement WebSocket for real-time updates
- [ ] Add dark mode toggle
- [ ] Mobile responsive design
- [ ] Implement translation feature
- [ ] Add E2E tests with Playwright
- [ ] PWA support for offline capability
---
## Maintenance & Updates
### Update Checklist
When updating API contracts:
1. Update TypeScript types in `@/types/api.ts`
2. Update API client methods in `@/services/api.ts`
3. Update this documentation (FRONTEND_API.md)
4. Update corresponding page components
5. Test integration thoroughly
### Dependency Updates
```bash
# Check for updates
npm outdated
# Update dependencies
npm update
# Update to latest (breaking changes possible)
npm install <package>@latest
```
---
## Contact & Support
**Frontend Developer**: Claude Code
**Documentation Version**: 0.1.0
**Last Updated**: 2025-01-13
For API questions, refer to:
- `API_REFERENCE.md` - Complete API documentation
- `backend_api.md` - Backend implementation details
- FastAPI Swagger UI: `http://localhost:12010/docs`
---
**End of Documentation**

View File

@@ -1,258 +0,0 @@
# Tool_OCR Testing Guide
## 測試架構
本專案包含完整的測試套件,包括單元測試和集成測試。
---
## 後端測試
### 安裝測試依賴
```bash
cd backend
pip install pytest pytest-cov httpx
```
### 運行所有測試
```bash
# 運行所有測試
pytest
# 運行並顯示詳細輸出
pytest -v
# 運行並生成覆蓋率報告
pytest --cov=app --cov-report=html
```
### 運行特定測試
```bash
# 僅運行單元測試
pytest tests/test_auth.py
pytest tests/test_tasks.py
pytest tests/test_admin.py
# 僅運行集成測試
pytest tests/test_integration.py
# 運行特定測試類
pytest tests/test_tasks.py::TestTasks
# 運行特定測試方法
pytest tests/test_tasks.py::TestTasks::test_create_task
```
### 測試覆蓋
**單元測試** (`tests/test_*.py`):
- `test_auth.py` - 認證端點測試
- 登入成功/失敗
- Token 驗證
- 登出功能
- `test_tasks.py` - 任務管理測試
- 任務 CRUD 操作
- 用戶隔離驗證
- 統計數據
- `test_admin.py` - 管理員功能測試
- 系統統計
- 用戶列表
- 審計日誌
**集成測試** (`tests/test_integration.py`):
- 完整認證和任務流程
- 管理員工作流程
- 任務生命週期
---
## 測試資料庫
測試使用 SQLite 記憶體資料庫,每次測試後自動清理:
- 不影響開發或生產資料庫
- 快速執行
- 完全隔離
---
## Fixtures (測試夾具)
`conftest.py` 中定義:
- `db` - 測試資料庫 session
- `client` - FastAPI 測試客戶端
- `test_user` - 一般測試用戶
- `admin_user` - 管理員測試用戶
- `auth_token` - 測試用戶的認證 token
- `admin_token` - 管理員的認證 token
- `test_task` - 測試任務
---
## 測試範例
### 編寫新的單元測試
```python
# tests/test_my_feature.py
import pytest
class TestMyFeature:
"""Test my new feature"""
def test_feature_works(self, client, auth_token):
"""Test that feature works correctly"""
response = client.get(
'/api/v2/my-endpoint',
headers={'Authorization': f'Bearer {auth_token}'}
)
assert response.status_code == 200
data = response.json()
assert 'expected_field' in data
```
### 編寫新的集成測試
```python
# tests/test_integration.py
class TestIntegration:
def test_complete_workflow(self, client, db):
"""Test complete user workflow"""
# Step 1: Login
# Step 2: Perform actions
# Step 3: Verify results
pass
```
---
## CI/CD 整合
### GitHub Actions 範例
```yaml
name: Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.11
- name: Install dependencies
run: |
cd backend
pip install -r requirements.txt
pip install pytest pytest-cov
- name: Run tests
run: |
cd backend
pytest --cov=app --cov-report=xml
- name: Upload coverage
uses: codecov/codecov-action@v2
```
---
## 前端測試 (未來計劃)
### 建議測試框架
- **單元測試**: Vitest
- **元件測試**: React Testing Library
- **E2E 測試**: Playwright
### 範例配置
```bash
# 安裝測試依賴
npm install --save-dev vitest @testing-library/react @testing-library/jest-dom
# 運行測試
npm test
# 運行 E2E 測試
npm run test:e2e
```
---
## 測試最佳實踐
### 1. 測試命名規範
- 使用描述性名稱: `test_user_can_create_task`
- 遵循 AAA 模式: Arrange, Act, Assert
### 2. 測試隔離
- 每個測試獨立執行
- 使用 fixtures 提供測試數據
- 不依賴其他測試的狀態
### 3. Mock 外部服務
- Mock 外部 API 呼叫
- Mock 檔案系統操作
- Mock 第三方服務
### 4. 測試覆蓋率目標
- 核心業務邏輯: >90%
- API 端點: >80%
- 工具函數: >70%
---
## 故障排除
### 常見問題
**問題**: `ImportError: cannot import name 'XXX'`
**解決**: 確保 PYTHONPATH 正確設定
```bash
export PYTHONPATH=$PYTHONPATH:$(pwd)
```
**問題**: 資料庫連接錯誤
**解決**: 測試使用記憶體資料庫,不需要實際資料庫連接
**問題**: Token 驗證失敗
**解決**: 檢查 JWT secret 設定,使用測試用 fixtures
---
## 測試報告
執行測試後生成的報告:
1. **終端輸出**: 測試結果概覽
2. **HTML 報告**: `htmlcov/index.html` (需要 --cov-report=html)
3. **覆蓋率報告**: 顯示未測試的代碼行
---
## 持續改進
- 定期運行測試套件
- 新功能必須包含測試
- 維護測試覆蓋率在 80% 以上
- Bug 修復時添加回歸測試
---
**最後更新**: 2025-11-16
**維護者**: Development Team

View File

@@ -1,62 +0,0 @@
"""
Test script to verify ReportLab and Chinese font rendering
"""
from reportlab.pdfgen import canvas
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from pathlib import Path
import sys
def test_chinese_rendering():
"""Test if Chinese characters can be rendered in PDF"""
# Font path
font_path = "/home/egg/project/Tool_OCR/backend/fonts/NotoSansSC-Regular.ttf"
# Check if font file exists
if not Path(font_path).exists():
print(f"❌ Font file not found: {font_path}")
return False
print(f"✓ Font file found: {font_path}")
try:
# Register Chinese font
pdfmetrics.registerFont(TTFont('NotoSansSC', font_path))
print("✓ Font registered successfully")
# Create test PDF
test_pdf = "/tmp/test_chinese.pdf"
c = canvas.Canvas(test_pdf)
# Set Chinese font
c.setFont('NotoSansSC', 14)
# Draw test text
c.drawString(100, 750, "測試中文字符渲染 - Test Chinese Character Rendering")
c.drawString(100, 730, "HTD-S1 技術數據表")
c.drawString(100, 710, "這是一個 PDF 生成測試")
c.save()
print(f"✓ Test PDF created: {test_pdf}")
# Check file size
file_size = Path(test_pdf).stat().st_size
print(f"✓ PDF file size: {file_size} bytes")
if file_size > 0:
print("\n✅ Chinese font rendering test PASSED")
return True
else:
print("\n❌ PDF file is empty")
return False
except Exception as e:
print(f"❌ Error during testing: {e}")
import traceback
traceback.print_exc()
return False
if __name__ == "__main__":
success = test_chinese_rendering()
sys.exit(0 if success else 1)

View File

@@ -1,286 +0,0 @@
#!/usr/bin/env python3
"""
Tool_OCR - Service Layer Integration Test
Tests core services before API implementation
"""
import sys
import logging
from pathlib import Path
from datetime import datetime
# Add backend to path
sys.path.insert(0, str(Path(__file__).parent))
from app.core.config import settings
from app.core.database import engine, SessionLocal, Base
from app.models.user import User
from app.models.ocr import OCRBatch, OCRFile, OCRResult, FileStatus, BatchStatus
from app.services.preprocessor import DocumentPreprocessor
from app.services.ocr_service import OCRService
from app.services.pdf_generator import PDFGenerator
from app.services.file_manager import FileManager
# Configure logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)
class ServiceTester:
"""Service layer integration tester"""
def __init__(self):
"""Initialize tester"""
self.db = SessionLocal()
self.preprocessor = DocumentPreprocessor()
self.ocr_service = OCRService()
self.pdf_generator = PDFGenerator()
self.file_manager = FileManager()
self.test_results = {
"database": False,
"preprocessor": False,
"ocr_engine": False,
"pdf_generator": False,
"file_manager": False,
}
def cleanup(self):
"""Cleanup resources"""
self.db.close()
def test_database_connection(self) -> bool:
"""Test 1: Database connection and models"""
try:
logger.info("=" * 80)
logger.info("TEST 1: Database Connection")
logger.info("=" * 80)
# Test connection
from sqlalchemy import text
self.db.execute(text("SELECT 1"))
logger.info("✓ Database connection successful")
# Check if tables exist
from sqlalchemy import inspect
inspector = inspect(engine)
tables = inspector.get_table_names()
required_tables = [
'paddle_ocr_users',
'paddle_ocr_batches',
'paddle_ocr_files',
'paddle_ocr_results',
'paddle_ocr_export_rules',
'paddle_ocr_translation_configs'
]
missing_tables = [t for t in required_tables if t not in tables]
if missing_tables:
logger.error(f"✗ Missing tables: {missing_tables}")
return False
logger.info(f"✓ All required tables exist: {', '.join(required_tables)}")
# Test creating a test user (will rollback)
test_user = User(
username=f"test_user_{datetime.now().timestamp()}",
email=f"test_{datetime.now().timestamp()}@example.com",
password_hash="test_hash_123",
is_active=True,
is_admin=False
)
self.db.add(test_user)
self.db.flush()
logger.info(f"✓ Test user created with ID: {test_user.id}")
self.db.rollback() # Don't actually save test user
logger.info("✓ Database test completed successfully\n")
self.test_results["database"] = True
return True
except Exception as e:
logger.error(f"✗ Database test failed: {e}\n")
return False
def test_preprocessor(self) -> bool:
"""Test 2: Document preprocessor"""
try:
logger.info("=" * 80)
logger.info("TEST 2: Document Preprocessor")
logger.info("=" * 80)
# Check supported formats
formats = ['.png', '.jpg', '.jpeg', '.pdf']
logger.info(f"✓ Supported formats: {formats}")
# Check max file size
max_size_mb = settings.max_upload_size / (1024 * 1024)
logger.info(f"✓ Max upload size: {max_size_mb} MB")
logger.info("✓ Preprocessor initialized successfully\n")
self.test_results["preprocessor"] = True
return True
except Exception as e:
logger.error(f"✗ Preprocessor test failed: {e}\n")
return False
def test_ocr_engine(self) -> bool:
"""Test 3: OCR engine initialization"""
try:
logger.info("=" * 80)
logger.info("TEST 3: OCR Engine (PaddleOCR)")
logger.info("=" * 80)
# Test OCR engine lazy loading
logger.info("Initializing PaddleOCR engine (this may take a moment)...")
ocr_engine = self.ocr_service.get_ocr_engine(lang='ch')
logger.info("✓ PaddleOCR engine initialized for Chinese")
# Test structure engine
logger.info("Initializing PP-Structure engine...")
structure_engine = self.ocr_service.get_structure_engine()
logger.info("✓ PP-Structure engine initialized")
# Check confidence threshold
logger.info(f"✓ Confidence threshold: {self.ocr_service.confidence_threshold}")
logger.info("✓ OCR engine test completed successfully\n")
self.test_results["ocr_engine"] = True
return True
except Exception as e:
logger.error(f"✗ OCR engine test failed: {e}")
logger.error(" Make sure PaddleOCR models are downloaded:")
logger.error(" - PaddleOCR will auto-download on first use (~900MB)")
logger.error(" - Requires stable internet connection")
logger.error("")
return False
def test_pdf_generator(self) -> bool:
"""Test 4: PDF generator"""
try:
logger.info("=" * 80)
logger.info("TEST 4: PDF Generator")
logger.info("=" * 80)
# Check Pandoc availability
pandoc_available = self.pdf_generator.check_pandoc_available()
if pandoc_available:
logger.info("✓ Pandoc is installed and available")
else:
logger.warning("⚠ Pandoc not found - will use WeasyPrint fallback")
# Check available templates
templates = self.pdf_generator.get_available_templates()
logger.info(f"✓ Available CSS templates: {', '.join(templates.keys())}")
logger.info("✓ PDF generator test completed successfully\n")
self.test_results["pdf_generator"] = True
return True
except Exception as e:
logger.error(f"✗ PDF generator test failed: {e}\n")
return False
def test_file_manager(self) -> bool:
"""Test 5: File manager"""
try:
logger.info("=" * 80)
logger.info("TEST 5: File Manager")
logger.info("=" * 80)
# Check upload directory
upload_dir = Path(settings.upload_dir)
if upload_dir.exists():
logger.info(f"✓ Upload directory exists: {upload_dir}")
else:
upload_dir.mkdir(parents=True, exist_ok=True)
logger.info(f"✓ Created upload directory: {upload_dir}")
# Test batch directory creation
test_batch_id = 99999 # Use high number to avoid conflicts
batch_dir = self.file_manager.create_batch_directory(test_batch_id)
logger.info(f"✓ Created test batch directory: {batch_dir}")
# Check subdirectories
subdirs = ["inputs", "outputs/markdown", "outputs/json", "outputs/images", "exports"]
for subdir in subdirs:
subdir_path = batch_dir / subdir
if subdir_path.exists():
logger.info(f"{subdir}")
else:
logger.error(f" ✗ Missing: {subdir}")
return False
# Cleanup test directory
import shutil
shutil.rmtree(batch_dir.parent, ignore_errors=True)
logger.info("✓ Cleaned up test batch directory")
logger.info("✓ File manager test completed successfully\n")
self.test_results["file_manager"] = True
return True
except Exception as e:
logger.error(f"✗ File manager test failed: {e}\n")
return False
def run_all_tests(self):
"""Run all service tests"""
logger.info("\n" + "=" * 80)
logger.info("Tool_OCR Service Layer Integration Test")
logger.info("=" * 80 + "\n")
try:
# Run tests in order
self.test_database_connection()
self.test_preprocessor()
self.test_ocr_engine()
self.test_pdf_generator()
self.test_file_manager()
# Print summary
logger.info("=" * 80)
logger.info("TEST SUMMARY")
logger.info("=" * 80)
total_tests = len(self.test_results)
passed_tests = sum(1 for result in self.test_results.values() if result)
for test_name, result in self.test_results.items():
status = "✓ PASS" if result else "✗ FAIL"
logger.info(f"{status:8} - {test_name}")
logger.info("-" * 80)
logger.info(f"Total: {passed_tests}/{total_tests} tests passed")
if passed_tests == total_tests:
logger.info("\n🎉 All service layer tests passed! Ready to implement API endpoints.")
return 0
else:
logger.error(f"\n{total_tests - passed_tests} test(s) failed. Please fix issues before proceeding.")
return 1
finally:
self.cleanup()
def main():
"""Main test entry point"""
tester = ServiceTester()
exit_code = tester.run_all_tests()
sys.exit(exit_code)
if __name__ == "__main__":
main()

View File

@@ -1,3 +0,0 @@
"""
Tool_OCR - Unit Tests Package
"""

View File

@@ -1,138 +0,0 @@
"""
V2 API Test Configuration and Fixtures
Provides test fixtures for authentication, database, and API testing
"""
import pytest
from fastapi.testclient import TestClient
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.pool import StaticPool
# IMPORTANT: Monkey patch database module BEFORE importing app
# This prevents the app from connecting to production database
import app.core.database as db_module
# Create a test engine for the entire test session
_test_engine = create_engine(
"sqlite:///:memory:",
connect_args={"check_same_thread": False},
poolclass=StaticPool,
)
# Replace the global engine and SessionLocal
db_module.engine = _test_engine
db_module.SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=_test_engine)
# Now safely import app (it will use our test database)
from app.main import app
from app.core.database import Base, get_db
from app.core.security import create_access_token
from app.models.user import User
from app.models.task import Task
@pytest.fixture(scope="function")
def engine():
"""Get test database engine and reset tables for each test"""
Base.metadata.drop_all(bind=_test_engine)
Base.metadata.create_all(bind=_test_engine)
yield _test_engine
# Tables will be dropped at the start of next test
@pytest.fixture(scope="function")
def db(engine):
"""Create test database session"""
TestingSessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
db = TestingSessionLocal()
try:
yield db
finally:
db.close()
@pytest.fixture(scope="function")
def client(db):
"""Create FastAPI test client with test database"""
# Override get_db to use the same session as the test
def override_get_db():
try:
yield db
finally:
# Don't close the session, it's managed by the db fixture
pass
app.dependency_overrides[get_db] = override_get_db
with TestClient(app) as test_client:
yield test_client
app.dependency_overrides.clear()
@pytest.fixture
def test_user(db):
"""Create a test user"""
# Ensure test_user is always created first by checking if it exists
user = db.query(User).filter(User.email == "test@example.com").first()
if not user:
user = User(
email="test@example.com",
display_name="Test User",
is_active=True
)
db.add(user)
db.commit()
db.refresh(user)
return user
@pytest.fixture
def admin_user(db):
"""Create an admin user"""
user = db.query(User).filter(User.email == "ymirliu@panjit.com.tw").first()
if not user:
user = User(
email="ymirliu@panjit.com.tw",
display_name="Admin User",
is_active=True
)
db.add(user)
db.commit()
db.refresh(user)
return user
@pytest.fixture
def auth_token(test_user):
"""Create authentication token for test user"""
token_data = {
"sub": str(test_user.id),
"email": test_user.email
}
return create_access_token(token_data)
@pytest.fixture
def admin_token(admin_user):
"""Create authentication token for admin user"""
token_data = {
"sub": str(admin_user.id),
"email": admin_user.email
}
return create_access_token(token_data)
@pytest.fixture
def test_task(test_user, db):
"""Create a test task (depends on test_user to ensure user exists first)"""
task = Task(
user_id=test_user.id,
task_id="test-task-123",
filename="test.pdf",
file_type="application/pdf",
status="pending"
)
db.add(task)
db.commit()
db.refresh(task)
return task

View File

@@ -1,179 +0,0 @@
"""
Tool_OCR - Pytest Fixtures and Configuration
Shared fixtures for all tests
"""
import pytest
import tempfile
import shutil
from pathlib import Path
from PIL import Image
import io
from app.services.preprocessor import DocumentPreprocessor
@pytest.fixture
def temp_dir():
"""Create a temporary directory for test files"""
temp_path = Path(tempfile.mkdtemp())
yield temp_path
# Cleanup after test
shutil.rmtree(temp_path, ignore_errors=True)
@pytest.fixture
def sample_image_path(temp_dir):
"""Create a valid PNG image file for testing"""
image_path = temp_dir / "test_image.png"
# Create a simple 100x100 white image
img = Image.new('RGB', (100, 100), color='white')
img.save(image_path, 'PNG')
return image_path
@pytest.fixture
def sample_jpg_path(temp_dir):
"""Create a valid JPG image file for testing"""
image_path = temp_dir / "test_image.jpg"
# Create a simple 100x100 white image
img = Image.new('RGB', (100, 100), color='white')
img.save(image_path, 'JPEG')
return image_path
@pytest.fixture
def sample_pdf_path(temp_dir):
"""Create a valid PDF file for testing"""
pdf_path = temp_dir / "test_document.pdf"
# Create minimal valid PDF
pdf_content = b"""%PDF-1.4
1 0 obj
<<
/Type /Catalog
/Pages 2 0 R
>>
endobj
2 0 obj
<<
/Type /Pages
/Kids [3 0 R]
/Count 1
>>
endobj
3 0 obj
<<
/Type /Page
/Parent 2 0 R
/MediaBox [0 0 612 792]
/Contents 4 0 R
/Resources <<
/Font <<
/F1 <<
/Type /Font
/Subtype /Type1
/BaseFont /Helvetica
>>
>>
>>
>>
endobj
4 0 obj
<<
/Length 44
>>
stream
BT
/F1 12 Tf
100 700 Td
(Test PDF) Tj
ET
endstream
endobj
xref
0 5
0000000000 65535 f
0000000009 00000 n
0000000058 00000 n
0000000115 00000 n
0000000317 00000 n
trailer
<<
/Size 5
/Root 1 0 R
>>
startxref
410
%%EOF
"""
with open(pdf_path, 'wb') as f:
f.write(pdf_content)
return pdf_path
@pytest.fixture
def corrupted_image_path(temp_dir):
"""Create a corrupted image file for testing"""
image_path = temp_dir / "corrupted.png"
# Write invalid PNG data
with open(image_path, 'wb') as f:
f.write(b'\x89PNG\r\n\x1a\n\x00\x00\x00corrupted data')
return image_path
@pytest.fixture
def large_file_path(temp_dir):
"""Create a valid PNG file larger than the upload limit"""
file_path = temp_dir / "large_file.png"
# Create a large PNG image with random data (to prevent compression)
# 15000x15000 with random pixels should be > 20MB
import numpy as np
random_data = np.random.randint(0, 256, (15000, 15000, 3), dtype=np.uint8)
img = Image.fromarray(random_data, 'RGB')
img.save(file_path, 'PNG', compress_level=0) # No compression
# Verify it's actually large
file_size = file_path.stat().st_size
assert file_size > 20 * 1024 * 1024, f"File only {file_size / (1024*1024):.2f} MB"
return file_path
@pytest.fixture
def unsupported_file_path(temp_dir):
"""Create a file with unsupported format"""
file_path = temp_dir / "test.txt"
with open(file_path, 'w') as f:
f.write("This is a text file, not an image")
return file_path
@pytest.fixture
def preprocessor():
"""Create a DocumentPreprocessor instance"""
return DocumentPreprocessor()
@pytest.fixture
def sample_image_with_text():
"""Return path to a real image with text from demo_docs for OCR testing"""
# Use the english.png sample from demo_docs
demo_image_path = Path(__file__).parent.parent.parent / "demo_docs" / "basic" / "english.png"
# Check if demo image exists, otherwise skip the test
if not demo_image_path.exists():
pytest.skip(f"Demo image not found at {demo_image_path}")
return demo_image_path

View File

@@ -1,60 +0,0 @@
"""
Unit tests for admin endpoints
"""
import pytest
class TestAdmin:
"""Test admin endpoints"""
def test_get_system_stats(self, client, admin_token):
"""Test get system statistics"""
response = client.get(
'/api/v2/admin/stats',
headers={'Authorization': f'Bearer {admin_token}'}
)
assert response.status_code == 200
data = response.json()
# API returns nested structure
assert 'users' in data
assert 'tasks' in data
assert 'sessions' in data
assert 'activity' in data
assert 'total' in data['users']
assert 'total' in data['tasks']
def test_get_system_stats_non_admin(self, client, auth_token):
"""Test that non-admin cannot access admin endpoints"""
response = client.get(
'/api/v2/admin/stats',
headers={'Authorization': f'Bearer {auth_token}'}
)
assert response.status_code == 403
def test_list_users(self, client, admin_token):
"""Test list all users"""
response = client.get(
'/api/v2/admin/users',
headers={'Authorization': f'Bearer {admin_token}'}
)
assert response.status_code == 200
data = response.json()
assert 'users' in data
assert 'total' in data
def test_get_audit_logs(self, client, admin_token):
"""Test get audit logs"""
response = client.get(
'/api/v2/admin/audit-logs',
headers={'Authorization': f'Bearer {admin_token}'}
)
assert response.status_code == 200
data = response.json()
assert 'logs' in data
assert 'total' in data
assert 'page' in data

View File

@@ -1,687 +0,0 @@
"""
Tool_OCR - API Integration Tests
Tests all API endpoints with database integration
"""
import pytest
import tempfile
import shutil
from pathlib import Path
from io import BytesIO
from datetime import datetime
from unittest.mock import patch, Mock
from fastapi.testclient import TestClient
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from PIL import Image
from app.main import app
from app.core.database import Base
from app.core.deps import get_db, get_current_active_user
from app.core.security import create_access_token, get_password_hash
from app.models.user import User
from app.models.ocr import OCRBatch, OCRFile, OCRResult, BatchStatus, FileStatus
from app.models.export import ExportRule
# ============================================================================
# Test Database Setup
# ============================================================================
@pytest.fixture(scope="function")
def test_db():
"""Create test database using SQLite in-memory"""
# Import all models to ensure they are registered with Base.metadata
# This triggers SQLAlchemy to register table definitions
from app.models import User, OCRBatch, OCRFile, OCRResult, ExportRule, TranslationConfig
# Create in-memory SQLite database
engine = create_engine("sqlite:///:memory:", connect_args={"check_same_thread": False})
TestingSessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
# Create all tables
Base.metadata.create_all(bind=engine)
db = TestingSessionLocal()
try:
yield db
finally:
db.close()
Base.metadata.drop_all(bind=engine)
@pytest.fixture(scope="function")
def test_user(test_db):
"""Create test user in database"""
user = User(
username="testuser",
email="test@example.com",
password_hash=get_password_hash("password123"),
is_active=True,
is_admin=False
)
test_db.add(user)
test_db.commit()
test_db.refresh(user)
return user
@pytest.fixture(scope="function")
def inactive_user(test_db):
"""Create inactive test user"""
user = User(
username="inactive",
email="inactive@example.com",
password_hash=get_password_hash("password123"),
is_active=False,
is_admin=False
)
test_db.add(user)
test_db.commit()
test_db.refresh(user)
return user
@pytest.fixture(scope="function")
def auth_token(test_user):
"""Generate JWT token for test user"""
token = create_access_token(data={"sub": test_user.id, "username": test_user.username})
return token
@pytest.fixture(scope="function")
def auth_headers(auth_token):
"""Generate authorization headers"""
return {"Authorization": f"Bearer {auth_token}"}
# ============================================================================
# Test Client Setup
# ============================================================================
@pytest.fixture(scope="function")
def client(test_db, test_user):
"""Create FastAPI test client with overridden dependencies"""
def override_get_db():
try:
yield test_db
finally:
pass
def override_get_current_active_user():
return test_user
app.dependency_overrides[get_db] = override_get_db
app.dependency_overrides[get_current_active_user] = override_get_current_active_user
client = TestClient(app)
yield client
# Clean up overrides
app.dependency_overrides.clear()
# ============================================================================
# Test Data Fixtures
# ============================================================================
@pytest.fixture
def temp_upload_dir():
"""Create temporary upload directory"""
temp_dir = Path(tempfile.mkdtemp())
yield temp_dir
shutil.rmtree(temp_dir, ignore_errors=True)
@pytest.fixture
def sample_image_file():
"""Create sample image file for upload"""
img = Image.new('RGB', (100, 100), color='white')
img_bytes = BytesIO()
img.save(img_bytes, format='PNG')
img_bytes.seek(0)
return ("test.png", img_bytes, "image/png")
@pytest.fixture
def test_batch(test_db, test_user):
"""Create test batch in database"""
batch = OCRBatch(
user_id=test_user.id,
batch_name="Test Batch",
status=BatchStatus.PENDING,
total_files=0,
completed_files=0,
failed_files=0
)
test_db.add(batch)
test_db.commit()
test_db.refresh(batch)
return batch
@pytest.fixture
def test_ocr_file(test_db, test_batch):
"""Create test OCR file in database"""
ocr_file = OCRFile(
batch_id=test_batch.id,
filename="test.png",
original_filename="test.png",
file_path="/tmp/test.png",
file_size=1024,
file_format="png",
status=FileStatus.COMPLETED
)
test_db.add(ocr_file)
test_db.commit()
test_db.refresh(ocr_file)
return ocr_file
@pytest.fixture
def test_ocr_result(test_db, test_ocr_file, temp_upload_dir):
"""Create test OCR result in database"""
# Create test markdown file
markdown_path = temp_upload_dir / "result.md"
markdown_path.write_text("# Test Result\n\nTest content", encoding="utf-8")
result = OCRResult(
file_id=test_ocr_file.id,
markdown_path=str(markdown_path),
json_path=str(temp_upload_dir / "result.json"),
detected_language="ch",
total_text_regions=5,
average_confidence=0.95,
layout_data={"regions": []},
images_metadata=[]
)
test_db.add(result)
test_db.commit()
test_db.refresh(result)
return result
@pytest.fixture
def test_export_rule(test_db, test_user):
"""Create test export rule in database"""
rule = ExportRule(
user_id=test_user.id,
rule_name="Test Rule",
description="Test export rule",
config_json={
"filters": {"confidence_threshold": 0.8},
"formatting": {"add_line_numbers": True}
}
)
test_db.add(rule)
test_db.commit()
test_db.refresh(rule)
return rule
# ============================================================================
# Authentication Router Tests
# ============================================================================
@pytest.mark.integration
class TestAuthRouter:
"""Test authentication endpoints"""
def test_login_success(self, client, test_user):
"""Test successful login"""
response = client.post(
"/api/v1/auth/login",
json={
"username": "testuser",
"password": "password123"
}
)
assert response.status_code == 200
data = response.json()
assert "access_token" in data
assert data["token_type"] == "bearer"
assert "expires_in" in data
assert data["expires_in"] > 0
def test_login_invalid_username(self, client):
"""Test login with invalid username"""
response = client.post(
"/api/v1/auth/login",
json={
"username": "nonexistent",
"password": "password123"
}
)
assert response.status_code == 401
assert "Incorrect username or password" in response.json()["detail"]
def test_login_invalid_password(self, client, test_user):
"""Test login with invalid password"""
response = client.post(
"/api/v1/auth/login",
json={
"username": "testuser",
"password": "wrongpassword"
}
)
assert response.status_code == 401
assert "Incorrect username or password" in response.json()["detail"]
def test_login_inactive_user(self, client, inactive_user):
"""Test login with inactive user account"""
response = client.post(
"/api/v1/auth/login",
json={
"username": "inactive",
"password": "password123"
}
)
assert response.status_code == 403
assert "inactive" in response.json()["detail"].lower()
# ============================================================================
# OCR Router Tests
# ============================================================================
@pytest.mark.integration
class TestOCRRouter:
"""Test OCR processing endpoints"""
@patch('app.services.file_manager.FileManager.create_batch')
@patch('app.services.file_manager.FileManager.add_files_to_batch')
def test_upload_files_success(self, mock_add_files, mock_create_batch,
client, auth_headers, test_batch, sample_image_file):
"""Test successful file upload"""
# Mock the file manager methods
mock_create_batch.return_value = test_batch
mock_add_files.return_value = []
response = client.post(
"/api/v1/upload",
files={"files": sample_image_file},
data={"batch_name": "Test Upload"},
headers=auth_headers
)
assert response.status_code == 200
data = response.json()
assert "id" in data
assert data["batch_name"] == "Test Batch"
def test_upload_no_files(self, client, auth_headers):
"""Test upload with no files"""
response = client.post(
"/api/v1/upload",
headers=auth_headers
)
assert response.status_code == 422 # Validation error
def test_upload_unauthorized(self, client, sample_image_file):
"""Test upload without authentication"""
# Override to remove authentication
app.dependency_overrides.clear()
response = client.post(
"/api/v1/upload",
files={"files": sample_image_file}
)
assert response.status_code == 403 # Forbidden (no auth)
@patch('app.services.background_tasks.process_batch_files_with_retry')
def test_process_ocr_success(self, mock_process, client, auth_headers,
test_batch, test_db):
"""Test triggering OCR processing"""
response = client.post(
"/api/v1/ocr/process",
json={
"batch_id": test_batch.id,
"lang": "ch",
"detect_layout": True
},
headers=auth_headers
)
assert response.status_code == 200
data = response.json()
assert data["message"] == "OCR processing started"
assert data["batch_id"] == test_batch.id
assert data["status"] == "processing"
def test_process_ocr_batch_not_found(self, client, auth_headers):
"""Test OCR processing with non-existent batch"""
response = client.post(
"/api/v1/ocr/process",
json={
"batch_id": 99999,
"lang": "ch",
"detect_layout": True
},
headers=auth_headers
)
assert response.status_code == 404
assert "not found" in response.json()["detail"].lower()
def test_process_ocr_already_processing(self, client, auth_headers,
test_batch, test_db):
"""Test OCR processing when batch is already processing"""
# Update batch status
test_batch.status = BatchStatus.PROCESSING
test_db.commit()
response = client.post(
"/api/v1/ocr/process",
json={
"batch_id": test_batch.id,
"lang": "ch",
"detect_layout": True
},
headers=auth_headers
)
assert response.status_code == 400
assert "already" in response.json()["detail"].lower()
def test_get_batch_status_success(self, client, auth_headers, test_batch,
test_ocr_file):
"""Test getting batch status"""
response = client.get(
f"/api/v1/batch/{test_batch.id}/status",
headers=auth_headers
)
assert response.status_code == 200
data = response.json()
assert "batch" in data
assert "files" in data
assert data["batch"]["id"] == test_batch.id
assert len(data["files"]) >= 0
def test_get_batch_status_not_found(self, client, auth_headers):
"""Test getting status for non-existent batch"""
response = client.get(
"/api/v1/batch/99999/status",
headers=auth_headers
)
assert response.status_code == 404
def test_get_ocr_result_success(self, client, auth_headers, test_ocr_file,
test_ocr_result):
"""Test getting OCR result"""
response = client.get(
f"/api/v1/ocr/result/{test_ocr_file.id}",
headers=auth_headers
)
assert response.status_code == 200
data = response.json()
assert "file" in data
assert "result" in data
assert data["file"]["id"] == test_ocr_file.id
def test_get_ocr_result_not_found(self, client, auth_headers):
"""Test getting result for non-existent file"""
response = client.get(
"/api/v1/ocr/result/99999",
headers=auth_headers
)
assert response.status_code == 404
# ============================================================================
# Export Router Tests
# ============================================================================
@pytest.mark.integration
class TestExportRouter:
"""Test export endpoints"""
@pytest.mark.skip(reason="FileResponse validation requires actual file paths, tested in unit tests")
@patch('app.services.export_service.ExportService.export_to_txt')
def test_export_txt_success(self, mock_export, client, auth_headers,
test_batch, test_ocr_file, test_ocr_result,
temp_upload_dir):
"""Test exporting results to TXT format"""
# NOTE: This test is skipped because FastAPI's FileResponse validates
# the file path exists, making it difficult to mock properly.
# The export service functionality is thoroughly tested in unit tests.
# End-to-end tests would be more appropriate for testing the full flow.
pass
def test_export_batch_not_found(self, client, auth_headers):
"""Test export with non-existent batch"""
response = client.post(
"/api/v1/export",
json={
"batch_id": 99999,
"format": "txt"
},
headers=auth_headers
)
assert response.status_code == 404
def test_export_no_results(self, client, auth_headers, test_batch):
"""Test export when no completed results exist"""
response = client.post(
"/api/v1/export",
json={
"batch_id": test_batch.id,
"format": "txt"
},
headers=auth_headers
)
assert response.status_code == 404
assert "no completed results" in response.json()["detail"].lower()
def test_export_unsupported_format(self, client, auth_headers, test_batch):
"""Test export with unsupported format"""
response = client.post(
"/api/v1/export",
json={
"batch_id": test_batch.id,
"format": "invalid_format"
},
headers=auth_headers
)
# Should fail at validation or business logic level
assert response.status_code in [400, 404]
@pytest.mark.skip(reason="FileResponse validation requires actual file paths, tested in unit tests")
@patch('app.services.export_service.ExportService.export_to_pdf')
def test_generate_pdf_success(self, mock_export, client, auth_headers,
test_ocr_file, test_ocr_result, temp_upload_dir):
"""Test generating PDF for single file"""
# NOTE: This test is skipped because FastAPI's FileResponse validates
# the file path exists, making it difficult to mock properly.
# The PDF generation functionality is thoroughly tested in unit tests.
pass
def test_generate_pdf_file_not_found(self, client, auth_headers):
"""Test PDF generation for non-existent file"""
response = client.get(
"/api/v1/export/pdf/99999",
headers=auth_headers
)
assert response.status_code == 404
def test_generate_pdf_no_result(self, client, auth_headers, test_ocr_file):
"""Test PDF generation when no OCR result exists"""
response = client.get(
f"/api/v1/export/pdf/{test_ocr_file.id}",
headers=auth_headers
)
assert response.status_code == 404
def test_list_export_rules(self, client, auth_headers, test_export_rule):
"""Test listing export rules"""
response = client.get(
"/api/v1/export/rules",
headers=auth_headers
)
assert response.status_code == 200
data = response.json()
assert isinstance(data, list)
assert len(data) >= 0
@pytest.mark.skip(reason="SQLite session isolation issue with in-memory DB, tested in unit tests")
def test_create_export_rule(self, client, auth_headers):
"""Test creating export rule"""
# NOTE: This test fails due to SQLite in-memory database session isolation
# The create operation works but db.refresh() fails to query the new record
# Export rule CRUD is thoroughly tested in unit tests
pass
@pytest.mark.skip(reason="SQLite session isolation issue with in-memory DB, tested in unit tests")
def test_update_export_rule(self, client, auth_headers, test_export_rule):
"""Test updating export rule"""
# NOTE: This test fails due to SQLite in-memory database session isolation
# The update operation works but db.refresh() fails to query the updated record
# Export rule CRUD is thoroughly tested in unit tests
pass
def test_update_export_rule_not_found(self, client, auth_headers):
"""Test updating non-existent export rule"""
response = client.put(
"/api/v1/export/rules/99999",
json={
"rule_name": "Updated Rule"
},
headers=auth_headers
)
assert response.status_code == 404
def test_delete_export_rule(self, client, auth_headers, test_export_rule):
"""Test deleting export rule"""
response = client.delete(
f"/api/v1/export/rules/{test_export_rule.id}",
headers=auth_headers
)
assert response.status_code == 200
assert "deleted successfully" in response.json()["message"].lower()
def test_delete_export_rule_not_found(self, client, auth_headers):
"""Test deleting non-existent export rule"""
response = client.delete(
"/api/v1/export/rules/99999",
headers=auth_headers
)
assert response.status_code == 404
def test_list_css_templates(self, client):
"""Test listing CSS templates (no auth required)"""
response = client.get("/api/v1/export/css-templates")
assert response.status_code == 200
data = response.json()
assert isinstance(data, list)
assert len(data) > 0
assert all("name" in item and "description" in item for item in data)
# ============================================================================
# Translation Router Tests (Stub Endpoints)
# ============================================================================
@pytest.mark.integration
class TestTranslationRouter:
"""Test translation stub endpoints"""
def test_get_translation_status(self, client):
"""Test getting translation feature status (stub)"""
response = client.get("/api/v1/translate/status")
assert response.status_code == 200
data = response.json()
assert "status" in data
assert data["status"].lower() == "reserved" # Case-insensitive check
def test_get_supported_languages(self, client):
"""Test getting supported languages (stub)"""
response = client.get("/api/v1/translate/languages")
assert response.status_code == 200
data = response.json()
assert isinstance(data, list)
def test_translate_document_not_implemented(self, client, auth_headers):
"""Test translate document endpoint returns 501"""
response = client.post(
"/api/v1/translate/document",
json={
"file_id": 1,
"source_lang": "zh",
"target_lang": "en",
"engine_type": "offline"
},
headers=auth_headers
)
assert response.status_code == 501
data = response.json()
assert "not implemented" in str(data["detail"]).lower()
def test_get_translation_task_status_not_implemented(self, client, auth_headers):
"""Test translation task status endpoint returns 501"""
response = client.get(
"/api/v1/translate/task/1",
headers=auth_headers
)
assert response.status_code == 501
def test_cancel_translation_task_not_implemented(self, client, auth_headers):
"""Test cancel translation task endpoint returns 501"""
response = client.delete(
"/api/v1/translate/task/1",
headers=auth_headers
)
assert response.status_code == 501
# ============================================================================
# Application Health Tests
# ============================================================================
@pytest.mark.integration
class TestApplicationHealth:
"""Test application health and root endpoints"""
def test_health_check(self, client):
"""Test health check endpoint"""
response = client.get("/health")
assert response.status_code == 200
data = response.json()
assert data["status"] == "healthy"
assert data["service"] == "Tool_OCR"
def test_root_endpoint(self, client):
"""Test root endpoint"""
response = client.get("/")
assert response.status_code == 200
data = response.json()
assert "message" in data
assert "Tool_OCR" in data["message"]
assert "docs_url" in data

View File

@@ -1,87 +0,0 @@
"""
Unit tests for authentication endpoints
"""
import pytest
from unittest.mock import patch, MagicMock
class TestAuth:
"""Test authentication endpoints"""
def test_login_success(self, client, db):
"""Test successful login"""
# Mock external auth service with proper Pydantic models
from app.services.external_auth_service import AuthResponse, UserInfo
user_info = UserInfo(
id="test-id-123",
name="Test User",
email="test@example.com"
)
auth_response = AuthResponse(
access_token="test-token",
id_token="test-id-token",
expires_in=3600,
token_type="Bearer",
user_info=user_info,
issued_at="2025-11-16T10:00:00Z",
expires_at="2025-11-16T11:00:00Z"
)
with patch('app.routers.auth.external_auth_service.authenticate_user') as mock_auth:
mock_auth.return_value = (True, auth_response, None)
response = client.post('/api/v2/auth/login', json={
'username': 'test@example.com',
'password': 'password123'
})
assert response.status_code == 200
data = response.json()
assert 'access_token' in data
assert data['token_type'] == 'bearer'
assert 'user' in data
def test_login_invalid_credentials(self, client):
"""Test login with invalid credentials"""
with patch('app.routers.auth.external_auth_service.authenticate_user') as mock_auth:
mock_auth.return_value = (False, None, 'Invalid credentials')
response = client.post('/api/v2/auth/login', json={
'username': 'test@example.com',
'password': 'wrongpassword'
})
assert response.status_code == 401
assert 'detail' in response.json()
def test_get_me(self, client, auth_token):
"""Test get current user info"""
response = client.get(
'/api/v2/auth/me',
headers={'Authorization': f'Bearer {auth_token}'}
)
assert response.status_code == 200
data = response.json()
assert 'email' in data
assert 'display_name' in data
def test_get_me_unauthorized(self, client):
"""Test get current user without token"""
response = client.get('/api/v2/auth/me')
assert response.status_code == 403
def test_logout(self, client, auth_token):
"""Test logout"""
response = client.post(
'/api/v2/auth/logout',
headers={'Authorization': f'Bearer {auth_token}'}
)
assert response.status_code == 200
data = response.json()
# When no session_id is provided, logs out all sessions
assert 'message' in data
assert 'Logged out' in data['message']

View File

@@ -1,637 +0,0 @@
"""
Tool_OCR - Export Service Unit Tests
Tests for app/services/export_service.py
"""
import pytest
import json
import zipfile
from pathlib import Path
from unittest.mock import Mock, patch, MagicMock
from datetime import datetime
import pandas as pd
from app.services.export_service import ExportService, ExportError
from app.models.ocr import FileStatus
@pytest.fixture
def export_service():
"""Create an ExportService instance"""
return ExportService()
@pytest.fixture
def mock_ocr_result(temp_dir):
"""Create a mock OCRResult with markdown file"""
# Create mock markdown file
md_file = temp_dir / "test_result.md"
md_file.write_text("# Test Document\n\nThis is test content.", encoding="utf-8")
# Create mock result
result = Mock()
result.id = 1
result.markdown_path = str(md_file)
result.json_path = None
result.detected_language = "zh"
result.total_text_regions = 10
result.average_confidence = 0.95
result.layout_data = {"elements": [{"type": "text"}]}
result.images_metadata = []
# Mock file
result.file = Mock()
result.file.id = 1
result.file.original_filename = "test.png"
result.file.file_format = "png"
result.file.file_size = 1024
result.file.processing_time = 2.5
return result
@pytest.fixture
def mock_db():
"""Create a mock database session"""
return Mock()
@pytest.mark.unit
class TestExportServiceInit:
"""Test ExportService initialization"""
def test_init(self, export_service):
"""Test export service initialization"""
assert export_service is not None
assert export_service.pdf_generator is not None
@pytest.mark.unit
class TestApplyFilters:
"""Test filter application"""
def test_apply_filters_confidence_threshold(self, export_service):
"""Test confidence threshold filter"""
result1 = Mock()
result1.average_confidence = 0.95
result1.file = Mock()
result1.file.original_filename = "test1.png"
result2 = Mock()
result2.average_confidence = 0.75
result2.file = Mock()
result2.file.original_filename = "test2.png"
result3 = Mock()
result3.average_confidence = 0.85
result3.file = Mock()
result3.file.original_filename = "test3.png"
results = [result1, result2, result3]
filters = {"confidence_threshold": 0.80}
filtered = export_service.apply_filters(results, filters)
assert len(filtered) == 2
assert result1 in filtered
assert result3 in filtered
assert result2 not in filtered
def test_apply_filters_filename_pattern(self, export_service):
"""Test filename pattern filter"""
result1 = Mock()
result1.average_confidence = 0.95
result1.file = Mock()
result1.file.original_filename = "invoice_2024.png"
result2 = Mock()
result2.average_confidence = 0.95
result2.file = Mock()
result2.file.original_filename = "receipt.png"
results = [result1, result2]
filters = {"filename_pattern": "invoice"}
filtered = export_service.apply_filters(results, filters)
assert len(filtered) == 1
assert result1 in filtered
def test_apply_filters_language(self, export_service):
"""Test language filter"""
result1 = Mock()
result1.detected_language = "zh"
result1.average_confidence = 0.95
result1.file = Mock()
result1.file.original_filename = "chinese.png"
result2 = Mock()
result2.detected_language = "en"
result2.average_confidence = 0.95
result2.file = Mock()
result2.file.original_filename = "english.png"
results = [result1, result2]
filters = {"language": "zh"}
filtered = export_service.apply_filters(results, filters)
assert len(filtered) == 1
assert result1 in filtered
def test_apply_filters_combined(self, export_service):
"""Test multiple filters combined"""
result1 = Mock()
result1.detected_language = "zh"
result1.average_confidence = 0.95
result1.file = Mock()
result1.file.original_filename = "invoice_chinese.png"
result2 = Mock()
result2.detected_language = "zh"
result2.average_confidence = 0.75
result2.file = Mock()
result2.file.original_filename = "invoice_low.png"
result3 = Mock()
result3.detected_language = "en"
result3.average_confidence = 0.95
result3.file = Mock()
result3.file.original_filename = "invoice_english.png"
results = [result1, result2, result3]
filters = {
"confidence_threshold": 0.80,
"language": "zh",
"filename_pattern": "invoice"
}
filtered = export_service.apply_filters(results, filters)
assert len(filtered) == 1
assert result1 in filtered
def test_apply_filters_no_filters(self, export_service):
"""Test with no filters applied"""
results = [Mock(), Mock(), Mock()]
filtered = export_service.apply_filters(results, {})
assert len(filtered) == len(results)
@pytest.mark.unit
class TestExportToTXT:
"""Test TXT export"""
def test_export_to_txt_basic(self, export_service, mock_ocr_result, temp_dir):
"""Test basic TXT export"""
output_path = temp_dir / "output.txt"
result_path = export_service.export_to_txt([mock_ocr_result], output_path)
assert result_path.exists()
content = result_path.read_text(encoding="utf-8")
assert "Test Document" in content
assert "test content" in content
def test_export_to_txt_with_line_numbers(self, export_service, mock_ocr_result, temp_dir):
"""Test TXT export with line numbers"""
output_path = temp_dir / "output.txt"
formatting = {"add_line_numbers": True}
result_path = export_service.export_to_txt(
[mock_ocr_result],
output_path,
formatting=formatting
)
content = result_path.read_text(encoding="utf-8")
assert "|" in content # Line number separator
def test_export_to_txt_with_metadata(self, export_service, mock_ocr_result, temp_dir):
"""Test TXT export with metadata headers"""
output_path = temp_dir / "output.txt"
formatting = {"include_metadata": True}
result_path = export_service.export_to_txt(
[mock_ocr_result],
output_path,
formatting=formatting
)
content = result_path.read_text(encoding="utf-8")
assert "文件:" in content
assert "test.png" in content
assert "信心度:" in content
def test_export_to_txt_with_grouping(self, export_service, mock_ocr_result, temp_dir):
"""Test TXT export with file grouping"""
output_path = temp_dir / "output.txt"
formatting = {"group_by_filename": True}
result_path = export_service.export_to_txt(
[mock_ocr_result, mock_ocr_result],
output_path,
formatting=formatting
)
content = result_path.read_text(encoding="utf-8")
assert "-" * 80 in content # Separator
def test_export_to_txt_missing_markdown(self, export_service, temp_dir):
"""Test TXT export with missing markdown file"""
result = Mock()
result.id = 1
result.markdown_path = "/nonexistent/path.md"
result.file = Mock()
result.file.original_filename = "test.png"
output_path = temp_dir / "output.txt"
# Should not fail, just skip the file
result_path = export_service.export_to_txt([result], output_path)
assert result_path.exists()
def test_export_to_txt_creates_parent_directories(self, export_service, mock_ocr_result, temp_dir):
"""Test that export creates necessary parent directories"""
output_path = temp_dir / "subdir" / "output.txt"
result_path = export_service.export_to_txt([mock_ocr_result], output_path)
assert result_path.exists()
assert result_path.parent.exists()
@pytest.mark.unit
class TestExportToJSON:
"""Test JSON export"""
def test_export_to_json_basic(self, export_service, mock_ocr_result, temp_dir):
"""Test basic JSON export"""
output_path = temp_dir / "output.json"
result_path = export_service.export_to_json([mock_ocr_result], output_path)
assert result_path.exists()
data = json.loads(result_path.read_text(encoding="utf-8"))
assert "export_time" in data
assert data["total_files"] == 1
assert len(data["results"]) == 1
assert data["results"][0]["filename"] == "test.png"
assert data["results"][0]["average_confidence"] == 0.95
def test_export_to_json_with_layout(self, export_service, mock_ocr_result, temp_dir):
"""Test JSON export with layout data"""
output_path = temp_dir / "output.json"
result_path = export_service.export_to_json(
[mock_ocr_result],
output_path,
include_layout=True
)
data = json.loads(result_path.read_text(encoding="utf-8"))
assert "layout_data" in data["results"][0]
def test_export_to_json_without_layout(self, export_service, mock_ocr_result, temp_dir):
"""Test JSON export without layout data"""
output_path = temp_dir / "output.json"
result_path = export_service.export_to_json(
[mock_ocr_result],
output_path,
include_layout=False
)
data = json.loads(result_path.read_text(encoding="utf-8"))
assert "layout_data" not in data["results"][0]
def test_export_to_json_multiple_results(self, export_service, mock_ocr_result, temp_dir):
"""Test JSON export with multiple results"""
output_path = temp_dir / "output.json"
result_path = export_service.export_to_json(
[mock_ocr_result, mock_ocr_result],
output_path
)
data = json.loads(result_path.read_text(encoding="utf-8"))
assert data["total_files"] == 2
assert len(data["results"]) == 2
@pytest.mark.unit
class TestExportToExcel:
"""Test Excel export"""
def test_export_to_excel_basic(self, export_service, mock_ocr_result, temp_dir):
"""Test basic Excel export"""
output_path = temp_dir / "output.xlsx"
result_path = export_service.export_to_excel([mock_ocr_result], output_path)
assert result_path.exists()
df = pd.read_excel(result_path)
assert len(df) == 1
assert "文件名" in df.columns
assert df.iloc[0]["文件名"] == "test.png"
def test_export_to_excel_with_confidence(self, export_service, mock_ocr_result, temp_dir):
"""Test Excel export with confidence scores"""
output_path = temp_dir / "output.xlsx"
result_path = export_service.export_to_excel(
[mock_ocr_result],
output_path,
include_confidence=True
)
df = pd.read_excel(result_path)
assert "平均信心度" in df.columns
def test_export_to_excel_without_processing_time(self, export_service, mock_ocr_result, temp_dir):
"""Test Excel export without processing time"""
output_path = temp_dir / "output.xlsx"
result_path = export_service.export_to_excel(
[mock_ocr_result],
output_path,
include_processing_time=False
)
df = pd.read_excel(result_path)
assert "處理時間(秒)" not in df.columns
def test_export_to_excel_long_content_truncation(self, export_service, temp_dir):
"""Test that long content is truncated in Excel"""
# Create result with long content
md_file = temp_dir / "long.md"
md_file.write_text("x" * 2000, encoding="utf-8")
result = Mock()
result.id = 1
result.markdown_path = str(md_file)
result.detected_language = "zh"
result.total_text_regions = 10
result.average_confidence = 0.95
result.file = Mock()
result.file.original_filename = "long.png"
result.file.file_format = "png"
result.file.file_size = 1024
result.file.processing_time = 1.0
output_path = temp_dir / "output.xlsx"
result_path = export_service.export_to_excel([result], output_path)
df = pd.read_excel(result_path)
content = df.iloc[0]["提取內容"]
assert "..." in content
assert len(content) <= 1004 # 1000 + "..."
@pytest.mark.unit
class TestExportToMarkdown:
"""Test Markdown export"""
def test_export_to_markdown_combined(self, export_service, mock_ocr_result, temp_dir):
"""Test combined Markdown export"""
output_path = temp_dir / "combined.md"
result_path = export_service.export_to_markdown(
[mock_ocr_result],
output_path,
combine=True
)
assert result_path.exists()
assert result_path.is_file()
content = result_path.read_text(encoding="utf-8")
assert "test.png" in content
assert "Test Document" in content
def test_export_to_markdown_separate(self, export_service, mock_ocr_result, temp_dir):
"""Test separate Markdown export"""
output_dir = temp_dir / "markdown_files"
result_path = export_service.export_to_markdown(
[mock_ocr_result],
output_dir,
combine=False
)
assert result_path.exists()
assert result_path.is_dir()
files = list(result_path.glob("*.md"))
assert len(files) == 1
def test_export_to_markdown_multiple_files(self, export_service, mock_ocr_result, temp_dir):
"""Test Markdown export with multiple files"""
output_path = temp_dir / "combined.md"
result_path = export_service.export_to_markdown(
[mock_ocr_result, mock_ocr_result],
output_path,
combine=True
)
content = result_path.read_text(encoding="utf-8")
assert content.count("---") >= 1 # Separators
@pytest.mark.unit
class TestExportToPDF:
"""Test PDF export"""
@patch.object(ExportService, '__init__', lambda self: None)
def test_export_to_pdf_success(self, mock_ocr_result, temp_dir):
"""Test successful PDF export"""
from app.services.pdf_generator import PDFGenerator
service = ExportService()
service.pdf_generator = Mock(spec=PDFGenerator)
service.pdf_generator.generate_pdf = Mock(return_value=temp_dir / "output.pdf")
output_path = temp_dir / "output.pdf"
result_path = service.export_to_pdf(mock_ocr_result, output_path)
service.pdf_generator.generate_pdf.assert_called_once()
call_kwargs = service.pdf_generator.generate_pdf.call_args[1]
assert call_kwargs["css_template"] == "default"
@patch.object(ExportService, '__init__', lambda self: None)
def test_export_to_pdf_with_custom_template(self, mock_ocr_result, temp_dir):
"""Test PDF export with custom CSS template"""
from app.services.pdf_generator import PDFGenerator
service = ExportService()
service.pdf_generator = Mock(spec=PDFGenerator)
service.pdf_generator.generate_pdf = Mock(return_value=temp_dir / "output.pdf")
output_path = temp_dir / "output.pdf"
service.export_to_pdf(mock_ocr_result, output_path, css_template="academic")
call_kwargs = service.pdf_generator.generate_pdf.call_args[1]
assert call_kwargs["css_template"] == "academic"
@patch.object(ExportService, '__init__', lambda self: None)
def test_export_to_pdf_missing_markdown(self, temp_dir):
"""Test PDF export with missing markdown file"""
from app.services.pdf_generator import PDFGenerator
result = Mock()
result.id = 1
result.markdown_path = None
result.file = Mock()
service = ExportService()
service.pdf_generator = Mock(spec=PDFGenerator)
output_path = temp_dir / "output.pdf"
with pytest.raises(ExportError) as exc_info:
service.export_to_pdf(result, output_path)
assert "not found" in str(exc_info.value).lower()
@pytest.mark.unit
class TestGetExportFormats:
"""Test getting available export formats"""
def test_get_export_formats(self, export_service):
"""Test getting export formats"""
formats = export_service.get_export_formats()
assert isinstance(formats, dict)
assert "txt" in formats
assert "json" in formats
assert "excel" in formats
assert "markdown" in formats
assert "pdf" in formats
assert "zip" in formats
# Check descriptions are in Chinese
for desc in formats.values():
assert isinstance(desc, str)
assert len(desc) > 0
@pytest.mark.unit
class TestApplyExportRule:
"""Test export rule application"""
def test_apply_export_rule_success(self, export_service, mock_db):
"""Test applying export rule"""
# Create mock rule
rule = Mock()
rule.id = 1
rule.config_json = {
"filters": {
"confidence_threshold": 0.80
}
}
mock_db.query.return_value.filter.return_value.first.return_value = rule
# Create mock results
result1 = Mock()
result1.average_confidence = 0.95
result1.file = Mock()
result1.file.original_filename = "test1.png"
result2 = Mock()
result2.average_confidence = 0.70
result2.file = Mock()
result2.file.original_filename = "test2.png"
results = [result1, result2]
filtered = export_service.apply_export_rule(mock_db, results, rule_id=1)
assert len(filtered) == 1
assert result1 in filtered
def test_apply_export_rule_not_found(self, export_service, mock_db):
"""Test applying non-existent rule"""
mock_db.query.return_value.filter.return_value.first.return_value = None
with pytest.raises(ExportError) as exc_info:
export_service.apply_export_rule(mock_db, [], rule_id=999)
assert "not found" in str(exc_info.value).lower()
@pytest.mark.unit
class TestEdgeCases:
"""Test edge cases and error handling"""
def test_export_to_txt_empty_results(self, export_service, temp_dir):
"""Test TXT export with empty results list"""
output_path = temp_dir / "output.txt"
result_path = export_service.export_to_txt([], output_path)
assert result_path.exists()
content = result_path.read_text(encoding="utf-8")
assert content == ""
def test_export_to_json_empty_results(self, export_service, temp_dir):
"""Test JSON export with empty results list"""
output_path = temp_dir / "output.json"
result_path = export_service.export_to_json([], output_path)
data = json.loads(result_path.read_text(encoding="utf-8"))
assert data["total_files"] == 0
assert len(data["results"]) == 0
def test_export_with_unicode_content(self, export_service, temp_dir):
"""Test export with Unicode/Chinese content"""
md_file = temp_dir / "chinese.md"
md_file.write_text("# 測試文檔\n\n這是中文內容。", encoding="utf-8")
result = Mock()
result.id = 1
result.markdown_path = str(md_file)
result.json_path = None
result.detected_language = "zh"
result.total_text_regions = 10
result.average_confidence = 0.95
result.layout_data = None # Use None instead of Mock for JSON serialization
result.images_metadata = None # Use None instead of Mock
result.file = Mock()
result.file.id = 1
result.file.original_filename = "中文測試.png"
result.file.file_format = "png"
result.file.file_size = 1024
result.file.processing_time = 1.0
# Test TXT export
txt_path = temp_dir / "output.txt"
export_service.export_to_txt([result], txt_path)
assert "測試文檔" in txt_path.read_text(encoding="utf-8")
# Test JSON export
json_path = temp_dir / "output.json"
export_service.export_to_json([result], json_path)
data = json.loads(json_path.read_text(encoding="utf-8"))
assert data["results"][0]["filename"] == "中文測試.png"
def test_apply_filters_with_none_values(self, export_service):
"""Test filters with None values in results"""
result = Mock()
result.average_confidence = None
result.detected_language = None
result.file = Mock()
result.file.original_filename = "test.png"
filters = {"confidence_threshold": 0.80}
filtered = export_service.apply_filters([result], filters)
# Should filter out result with None confidence
assert len(filtered) == 0

View File

@@ -1,520 +0,0 @@
"""
Tool_OCR - File Manager Unit Tests
Tests for app/services/file_manager.py
"""
import pytest
import shutil
from pathlib import Path
from unittest.mock import Mock, patch, MagicMock
from datetime import datetime, timedelta
from io import BytesIO
from fastapi import UploadFile
from app.services.file_manager import FileManager, FileManagementError
from app.models.ocr import OCRBatch, OCRFile, FileStatus, BatchStatus
@pytest.fixture
def file_manager(temp_dir):
"""Create a FileManager instance with temp directory"""
with patch('app.services.file_manager.settings') as mock_settings:
mock_settings.upload_dir = str(temp_dir)
mock_settings.max_upload_size = 20 * 1024 * 1024 # 20MB
mock_settings.allowed_extensions_list = ['png', 'jpg', 'jpeg', 'pdf']
manager = FileManager()
return manager
@pytest.fixture
def mock_upload_file():
"""Create a mock UploadFile"""
def create_file(filename="test.png", content=b"test content", size=None):
file_obj = BytesIO(content)
if size is None:
size = len(content)
upload_file = UploadFile(filename=filename, file=file_obj)
# Set file size manually
upload_file.file.seek(0, 2) # Seek to end
upload_file.file.seek(0) # Reset
return upload_file
return create_file
@pytest.fixture
def mock_db():
"""Create a mock database session"""
return Mock()
@pytest.mark.unit
class TestFileManagerInit:
"""Test FileManager initialization"""
def test_init(self, file_manager, temp_dir):
"""Test file manager initialization"""
assert file_manager is not None
assert file_manager.preprocessor is not None
assert file_manager.base_upload_dir == temp_dir
assert file_manager.base_upload_dir.exists()
@pytest.mark.unit
class TestBatchDirectoryManagement:
"""Test batch directory creation and management"""
def test_create_batch_directory(self, file_manager):
"""Test creating batch directory structure"""
batch_id = 123
batch_dir = file_manager.create_batch_directory(batch_id)
assert batch_dir.exists()
assert (batch_dir / "inputs").exists()
assert (batch_dir / "outputs" / "markdown").exists()
assert (batch_dir / "outputs" / "json").exists()
assert (batch_dir / "outputs" / "images").exists()
assert (batch_dir / "exports").exists()
def test_create_batch_directory_multiple_times(self, file_manager):
"""Test creating same batch directory multiple times (should not error)"""
batch_id = 123
batch_dir1 = file_manager.create_batch_directory(batch_id)
batch_dir2 = file_manager.create_batch_directory(batch_id)
assert batch_dir1 == batch_dir2
assert batch_dir1.exists()
def test_get_batch_directory(self, file_manager):
"""Test getting batch directory path"""
batch_id = 456
batch_dir = file_manager.get_batch_directory(batch_id)
expected_path = file_manager.base_upload_dir / "batches" / "456"
assert batch_dir == expected_path
@pytest.mark.unit
class TestUploadValidation:
"""Test file upload validation"""
def test_validate_upload_valid_file(self, file_manager, mock_upload_file):
"""Test validation of valid upload"""
upload = mock_upload_file("test.png", b"valid content")
is_valid, error = file_manager.validate_upload(upload)
assert is_valid is True
assert error is None
def test_validate_upload_empty_filename(self, file_manager):
"""Test validation with empty filename"""
upload = Mock()
upload.filename = ""
is_valid, error = file_manager.validate_upload(upload)
assert is_valid is False
assert "文件名不能為空" in error
def test_validate_upload_empty_file(self, file_manager, mock_upload_file):
"""Test validation of empty file"""
upload = mock_upload_file("test.png", b"")
is_valid, error = file_manager.validate_upload(upload)
assert is_valid is False
assert "文件為空" in error
@pytest.mark.skip(reason="File size mock is complex with UploadFile, covered by integration test")
def test_validate_upload_file_too_large(self, file_manager):
"""Test validation of file exceeding size limit"""
# Note: This functionality is tested in integration tests where actual
# files can be created. Mocking UploadFile's size behavior is complex.
pass
def test_validate_upload_unsupported_format(self, file_manager, mock_upload_file):
"""Test validation of unsupported file format"""
upload = mock_upload_file("test.txt", b"text content")
is_valid, error = file_manager.validate_upload(upload)
assert is_valid is False
assert "不支持的文件格式" in error
def test_validate_upload_supported_formats(self, file_manager, mock_upload_file):
"""Test validation of all supported formats"""
supported_formats = ["test.png", "test.jpg", "test.jpeg", "test.pdf"]
for filename in supported_formats:
upload = mock_upload_file(filename, b"content")
is_valid, error = file_manager.validate_upload(upload)
assert is_valid is True, f"Failed for {filename}"
@pytest.mark.unit
class TestFileSaving:
"""Test file saving operations"""
def test_save_upload_success(self, file_manager, mock_upload_file):
"""Test successful file saving"""
batch_id = 1
file_manager.create_batch_directory(batch_id)
upload = mock_upload_file("test.png", b"test content")
file_path, original_filename = file_manager.save_upload(upload, batch_id)
assert file_path.exists()
assert file_path.read_bytes() == b"test content"
assert original_filename == "test.png"
assert file_path.parent.name == "inputs"
def test_save_upload_unique_filename(self, file_manager, mock_upload_file):
"""Test that saved files get unique filenames"""
batch_id = 1
file_manager.create_batch_directory(batch_id)
upload1 = mock_upload_file("test.png", b"content1")
upload2 = mock_upload_file("test.png", b"content2")
path1, _ = file_manager.save_upload(upload1, batch_id)
path2, _ = file_manager.save_upload(upload2, batch_id)
assert path1 != path2
assert path1.exists() and path2.exists()
assert path1.read_bytes() == b"content1"
assert path2.read_bytes() == b"content2"
def test_save_upload_validation_failure(self, file_manager, mock_upload_file):
"""Test save upload with validation failure"""
batch_id = 1
file_manager.create_batch_directory(batch_id)
# Empty file should fail validation
upload = mock_upload_file("test.png", b"")
with pytest.raises(FileManagementError) as exc_info:
file_manager.save_upload(upload, batch_id, validate=True)
assert "文件為空" in str(exc_info.value)
def test_save_upload_skip_validation(self, file_manager, mock_upload_file):
"""Test saving with validation skipped"""
batch_id = 1
file_manager.create_batch_directory(batch_id)
# Empty file but validation skipped
upload = mock_upload_file("test.txt", b"")
# Should succeed when validation is disabled
file_path, _ = file_manager.save_upload(upload, batch_id, validate=False)
assert file_path.exists()
def test_save_upload_preserves_extension(self, file_manager, mock_upload_file):
"""Test that file extension is preserved"""
batch_id = 1
file_manager.create_batch_directory(batch_id)
upload = mock_upload_file("document.pdf", b"pdf content")
file_path, _ = file_manager.save_upload(upload, batch_id)
assert file_path.suffix == ".pdf"
@pytest.mark.unit
class TestValidateSavedFile:
"""Test validation of saved files"""
@patch.object(FileManager, '__init__', lambda self: None)
def test_validate_saved_file(self, sample_image_path):
"""Test validating a saved file"""
from app.services.preprocessor import DocumentPreprocessor
manager = FileManager()
manager.preprocessor = DocumentPreprocessor()
# validate_file returns (is_valid, file_format, error_message)
is_valid, file_format, error = manager.validate_saved_file(sample_image_path)
assert is_valid is True
assert file_format == 'png'
assert error is None
@pytest.mark.unit
class TestBatchCreation:
"""Test batch creation"""
def test_create_batch(self, file_manager, mock_db):
"""Test creating a new batch"""
user_id = 1
# Mock database operations
mock_batch = Mock()
mock_batch.id = 123
mock_db.add = Mock()
mock_db.commit = Mock()
mock_db.refresh = Mock(side_effect=lambda x: setattr(x, 'id', 123))
with patch.object(FileManager, 'create_batch_directory'):
batch = file_manager.create_batch(mock_db, user_id)
assert mock_db.add.called
assert mock_db.commit.called
def test_create_batch_with_custom_name(self, file_manager, mock_db):
"""Test creating batch with custom name"""
user_id = 1
batch_name = "My Custom Batch"
mock_db.add = Mock()
mock_db.commit = Mock()
mock_db.refresh = Mock(side_effect=lambda x: setattr(x, 'id', 123))
with patch.object(FileManager, 'create_batch_directory'):
batch = file_manager.create_batch(mock_db, user_id, batch_name)
# Verify batch was created with correct name
call_args = mock_db.add.call_args[0][0]
assert hasattr(call_args, 'batch_name')
@pytest.mark.unit
class TestGetFilePaths:
"""Test file path retrieval"""
def test_get_file_paths(self, file_manager):
"""Test getting file paths for a batch"""
batch_id = 1
file_id = 42
paths = file_manager.get_file_paths(batch_id, file_id)
assert "input_dir" in paths
assert "output_dir" in paths
assert "markdown_dir" in paths
assert "json_dir" in paths
assert "images_dir" in paths
assert "export_dir" in paths
# Verify images_dir includes file_id
assert str(file_id) in str(paths["images_dir"])
@pytest.mark.unit
class TestCleanupExpiredBatches:
"""Test cleanup of expired batches"""
def test_cleanup_expired_batches(self, file_manager, mock_db, temp_dir):
"""Test cleaning up expired batches"""
# Create mock expired batch
expired_batch = Mock()
expired_batch.id = 1
expired_batch.created_at = datetime.utcnow() - timedelta(hours=48)
# Create batch directory
batch_dir = file_manager.create_batch_directory(1)
assert batch_dir.exists()
# Mock database query
mock_db.query.return_value.filter.return_value.all.return_value = [expired_batch]
mock_db.delete = Mock()
mock_db.commit = Mock()
# Run cleanup
cleaned = file_manager.cleanup_expired_batches(mock_db, retention_hours=24)
assert cleaned == 1
assert not batch_dir.exists()
mock_db.delete.assert_called_once_with(expired_batch)
mock_db.commit.assert_called_once()
def test_cleanup_no_expired_batches(self, file_manager, mock_db):
"""Test cleanup when no batches are expired"""
# Mock database query returning empty list
mock_db.query.return_value.filter.return_value.all.return_value = []
cleaned = file_manager.cleanup_expired_batches(mock_db, retention_hours=24)
assert cleaned == 0
def test_cleanup_handles_missing_directory(self, file_manager, mock_db):
"""Test cleanup handles missing batch directory gracefully"""
expired_batch = Mock()
expired_batch.id = 999 # Directory doesn't exist
expired_batch.created_at = datetime.utcnow() - timedelta(hours=48)
mock_db.query.return_value.filter.return_value.all.return_value = [expired_batch]
mock_db.delete = Mock()
mock_db.commit = Mock()
# Should not raise error
cleaned = file_manager.cleanup_expired_batches(mock_db, retention_hours=24)
assert cleaned == 1
@pytest.mark.unit
class TestFileOwnershipVerification:
"""Test file ownership verification"""
def test_verify_file_ownership_success(self, file_manager, mock_db):
"""Test successful ownership verification"""
user_id = 1
batch_id = 123
# Mock batch owned by user
mock_batch = Mock()
mock_db.query.return_value.filter.return_value.first.return_value = mock_batch
is_owner = file_manager.verify_file_ownership(mock_db, user_id, batch_id)
assert is_owner is True
def test_verify_file_ownership_failure(self, file_manager, mock_db):
"""Test ownership verification failure"""
user_id = 1
batch_id = 123
# Mock no batch found (wrong owner)
mock_db.query.return_value.filter.return_value.first.return_value = None
is_owner = file_manager.verify_file_ownership(mock_db, user_id, batch_id)
assert is_owner is False
@pytest.mark.unit
class TestBatchStatistics:
"""Test batch statistics retrieval"""
def test_get_batch_statistics(self, file_manager, mock_db):
"""Test getting batch statistics"""
batch_id = 1
# Create mock batch with files
mock_file1 = Mock()
mock_file1.file_size = 1000
mock_file2 = Mock()
mock_file2.file_size = 2000
mock_batch = Mock()
mock_batch.id = batch_id
mock_batch.batch_name = "Test Batch"
mock_batch.status = BatchStatus.COMPLETED
mock_batch.total_files = 2
mock_batch.completed_files = 2
mock_batch.failed_files = 0
mock_batch.progress_percentage = 100.0
mock_batch.files = [mock_file1, mock_file2]
mock_batch.created_at = datetime(2025, 1, 1, 10, 0, 0)
mock_batch.started_at = datetime(2025, 1, 1, 10, 1, 0)
mock_batch.completed_at = datetime(2025, 1, 1, 10, 5, 0)
mock_db.query.return_value.filter.return_value.first.return_value = mock_batch
stats = file_manager.get_batch_statistics(mock_db, batch_id)
assert stats['batch_id'] == batch_id
assert stats['batch_name'] == "Test Batch"
assert stats['total_files'] == 2
assert stats['total_file_size'] == 3000
assert stats['total_file_size_mb'] == 0.0 # Small files
assert stats['processing_time'] == 240.0 # 4 minutes
assert stats['pending_files'] == 0
def test_get_batch_statistics_not_found(self, file_manager, mock_db):
"""Test getting statistics for non-existent batch"""
batch_id = 999
mock_db.query.return_value.filter.return_value.first.return_value = None
stats = file_manager.get_batch_statistics(mock_db, batch_id)
assert stats == {}
def test_get_batch_statistics_no_completion_time(self, file_manager, mock_db):
"""Test statistics for batch without completion time"""
mock_batch = Mock()
mock_batch.id = 1
mock_batch.batch_name = "Pending Batch"
mock_batch.status = BatchStatus.PROCESSING
mock_batch.total_files = 5
mock_batch.completed_files = 2
mock_batch.failed_files = 0
mock_batch.progress_percentage = 40.0
mock_batch.files = []
mock_batch.created_at = datetime(2025, 1, 1)
mock_batch.started_at = datetime(2025, 1, 1)
mock_batch.completed_at = None
mock_db.query.return_value.filter.return_value.first.return_value = mock_batch
stats = file_manager.get_batch_statistics(mock_db, 1)
assert stats['processing_time'] is None
assert stats['pending_files'] == 3
@pytest.mark.unit
class TestEdgeCases:
"""Test edge cases and error handling"""
def test_save_upload_creates_parent_directories(self, file_manager, mock_upload_file):
"""Test that save_upload creates necessary directories"""
batch_id = 999 # Directory doesn't exist yet
upload = mock_upload_file("test.png", b"content")
file_path, _ = file_manager.save_upload(upload, batch_id)
assert file_path.exists()
assert file_path.parent.exists()
def test_cleanup_continues_on_error(self, file_manager, mock_db):
"""Test that cleanup continues even if one batch fails"""
batch1 = Mock()
batch1.id = 1
batch1.created_at = datetime.utcnow() - timedelta(hours=48)
batch2 = Mock()
batch2.id = 2
batch2.created_at = datetime.utcnow() - timedelta(hours=48)
# Create only batch2 directory
file_manager.create_batch_directory(2)
mock_db.query.return_value.filter.return_value.all.return_value = [batch1, batch2]
mock_db.delete = Mock()
mock_db.commit = Mock()
# Should not fail, should clean batch2 even if batch1 fails
cleaned = file_manager.cleanup_expired_batches(mock_db, retention_hours=24)
assert cleaned > 0
def test_validate_upload_with_unicode_filename(self, file_manager, mock_upload_file):
"""Test validation with Unicode filename"""
upload = mock_upload_file("測試文件.png", b"content")
is_valid, error = file_manager.validate_upload(upload)
assert is_valid is True
def test_save_upload_preserves_unicode_filename(self, file_manager, mock_upload_file):
"""Test that Unicode filenames are handled correctly"""
batch_id = 1
file_manager.create_batch_directory(batch_id)
upload = mock_upload_file("中文文檔.pdf", b"content")
file_path, original_filename = file_manager.save_upload(upload, batch_id)
assert original_filename == "中文文檔.pdf"
assert file_path.exists()

View File

@@ -1,182 +0,0 @@
"""
Integration tests for Tool_OCR
Tests the complete flow of authentication, task creation, and file operations
"""
import pytest
from unittest.mock import patch
class TestIntegration:
"""Integration tests for end-to-end workflows"""
def test_complete_auth_and_task_flow(self, client, db):
"""Test complete flow: login -> create task -> get task -> delete task"""
# Step 1: Login
from app.services.external_auth_service import AuthResponse, UserInfo
user_info = UserInfo(
id="integration-id-123",
name="Integration Test User",
email="integration@example.com"
)
auth_response = AuthResponse(
access_token="test-token",
id_token="test-id-token",
expires_in=3600,
token_type="Bearer",
user_info=user_info,
issued_at="2025-11-16T10:00:00Z",
expires_at="2025-11-16T11:00:00Z"
)
with patch('app.routers.auth.external_auth_service.authenticate_user') as mock_auth:
mock_auth.return_value = (True, auth_response, None)
login_response = client.post('/api/v2/auth/login', json={
'username': 'integration@example.com',
'password': 'password123'
})
assert login_response.status_code == 200
token = login_response.json()['access_token']
headers = {'Authorization': f'Bearer {token}'}
# Step 2: Create task
create_response = client.post(
'/api/v2/tasks/',
headers=headers,
json={
'filename': 'integration_test.pdf',
'file_type': 'application/pdf'
}
)
assert create_response.status_code == 201
task_data = create_response.json()
task_id = task_data['task_id']
# Step 3: Get task
get_response = client.get(
f'/api/v2/tasks/{task_id}',
headers=headers
)
assert get_response.status_code == 200
assert get_response.json()['task_id'] == task_id
# Step 4: List tasks
list_response = client.get(
'/api/v2/tasks/',
headers=headers
)
assert list_response.status_code == 200
assert len(list_response.json()['tasks']) > 0
# Step 5: Get stats
stats_response = client.get(
'/api/v2/tasks/stats',
headers=headers
)
assert stats_response.status_code == 200
stats = stats_response.json()
assert stats['total'] > 0
assert stats['pending'] > 0
# Step 6: Delete task
delete_response = client.delete(
f'/api/v2/tasks/{task_id}',
headers=headers
)
# DELETE returns 204 No Content (standard for successful deletion)
assert delete_response.status_code == 204
# Step 7: Verify deletion
get_after_delete = client.get(
f'/api/v2/tasks/{task_id}',
headers=headers
)
assert get_after_delete.status_code == 404
def test_admin_workflow(self, client, db):
"""Test admin workflow: login as admin -> access admin endpoints"""
# Login as admin
from app.services.external_auth_service import AuthResponse, UserInfo
user_info = UserInfo(
id="admin-id-123",
name="Admin User",
email="ymirliu@panjit.com.tw"
)
auth_response = AuthResponse(
access_token="admin-token",
id_token="admin-id-token",
expires_in=3600,
token_type="Bearer",
user_info=user_info,
issued_at="2025-11-16T10:00:00Z",
expires_at="2025-11-16T11:00:00Z"
)
with patch('app.routers.auth.external_auth_service.authenticate_user') as mock_auth:
mock_auth.return_value = (True, auth_response, None)
login_response = client.post('/api/v2/auth/login', json={
'username': 'ymirliu@panjit.com.tw',
'password': 'adminpass'
})
assert login_response.status_code == 200
token = login_response.json()['access_token']
headers = {'Authorization': f'Bearer {token}'}
# Access admin endpoints
stats_response = client.get('/api/v2/admin/stats', headers=headers)
assert stats_response.status_code == 200
users_response = client.get('/api/v2/admin/users', headers=headers)
assert users_response.status_code == 200
logs_response = client.get('/api/v2/admin/audit-logs', headers=headers)
assert logs_response.status_code == 200
def test_task_lifecycle(self, client, auth_token, test_task, db):
"""Test complete task lifecycle: pending -> processing -> completed"""
headers = {'Authorization': f'Bearer {auth_token}'}
# Check initial status
response = client.get(f'/api/v2/tasks/{test_task.task_id}', headers=headers)
assert response.json()['status'] == 'pending'
# Start task
start_response = client.post(
f'/api/v2/tasks/{test_task.task_id}/start',
headers=headers
)
assert start_response.status_code == 200
assert start_response.json()['status'] == 'processing'
# Update task to completed
update_response = client.patch(
f'/api/v2/tasks/{test_task.task_id}',
headers=headers,
json={
'status': 'completed',
'processing_time_ms': 1500
}
)
assert update_response.status_code == 200
assert update_response.json()['status'] == 'completed'
# Verify final state
final_response = client.get(f'/api/v2/tasks/{test_task.task_id}', headers=headers)
final_data = final_response.json()
assert final_data['status'] == 'completed'
assert final_data['processing_time_ms'] == 1500

View File

@@ -1,528 +0,0 @@
"""
Tool_OCR - OCR Service Unit Tests
Tests for app/services/ocr_service.py
"""
import pytest
import json
from pathlib import Path
from unittest.mock import Mock, patch, MagicMock
from app.services.ocr_service import OCRService
@pytest.mark.unit
class TestOCRServiceInit:
"""Test OCR service initialization"""
def test_init(self):
"""Test OCR service initialization"""
service = OCRService()
assert service is not None
assert service.ocr_engines == {}
assert service.structure_engine is None
assert service.confidence_threshold > 0
assert len(service.ocr_languages) > 0
def test_supported_languages(self):
"""Test that supported languages are configured"""
service = OCRService()
# Should have at least Chinese and English
assert 'ch' in service.ocr_languages or 'en' in service.ocr_languages
@pytest.mark.unit
class TestOCREngineLazyLoading:
"""Test OCR engine lazy loading"""
@patch('app.services.ocr_service.PaddleOCR')
def test_get_ocr_engine_creates_new_engine(self, mock_paddle_ocr):
"""Test that get_ocr_engine creates engine on first call"""
mock_engine = Mock()
mock_paddle_ocr.return_value = mock_engine
service = OCRService()
engine = service.get_ocr_engine(lang='en')
assert engine == mock_engine
mock_paddle_ocr.assert_called_once()
assert 'en' in service.ocr_engines
@patch('app.services.ocr_service.PaddleOCR')
def test_get_ocr_engine_reuses_existing_engine(self, mock_paddle_ocr):
"""Test that get_ocr_engine reuses existing engine"""
mock_engine = Mock()
mock_paddle_ocr.return_value = mock_engine
service = OCRService()
# First call creates engine
engine1 = service.get_ocr_engine(lang='en')
# Second call should reuse
engine2 = service.get_ocr_engine(lang='en')
assert engine1 == engine2
mock_paddle_ocr.assert_called_once()
@patch('app.services.ocr_service.PaddleOCR')
def test_get_ocr_engine_different_languages(self, mock_paddle_ocr):
"""Test that different languages get different engines"""
mock_paddle_ocr.return_value = Mock()
service = OCRService()
engine_en = service.get_ocr_engine(lang='en')
engine_ch = service.get_ocr_engine(lang='ch')
assert 'en' in service.ocr_engines
assert 'ch' in service.ocr_engines
assert mock_paddle_ocr.call_count == 2
@pytest.mark.unit
class TestStructureEngineLazyLoading:
"""Test structure engine lazy loading"""
@patch('app.services.ocr_service.PPStructureV3')
def test_get_structure_engine_creates_new_engine(self, mock_structure):
"""Test that get_structure_engine creates engine on first call"""
mock_engine = Mock()
mock_structure.return_value = mock_engine
service = OCRService()
engine = service.get_structure_engine()
assert engine == mock_engine
mock_structure.assert_called_once()
assert service.structure_engine == mock_engine
@patch('app.services.ocr_service.PPStructureV3')
def test_get_structure_engine_reuses_existing_engine(self, mock_structure):
"""Test that get_structure_engine reuses existing engine"""
mock_engine = Mock()
mock_structure.return_value = mock_engine
service = OCRService()
# First call creates engine
engine1 = service.get_structure_engine()
# Second call should reuse
engine2 = service.get_structure_engine()
assert engine1 == engine2
mock_structure.assert_called_once()
@pytest.mark.unit
class TestProcessImageMocked:
"""Test image processing with mocked OCR engines"""
@patch('app.services.ocr_service.PaddleOCR')
def test_process_image_success(self, mock_paddle_ocr, sample_image_path):
"""Test successful image processing"""
# Mock OCR results - PaddleOCR 3.x format
mock_ocr_results = [{
'rec_texts': ['Hello World', 'Test Text'],
'rec_scores': [0.95, 0.88],
'rec_polys': [
[[10, 10], [100, 10], [100, 30], [10, 30]],
[[10, 40], [100, 40], [100, 60], [10, 60]]
]
}]
mock_engine = Mock()
mock_engine.ocr.return_value = mock_ocr_results
mock_paddle_ocr.return_value = mock_engine
service = OCRService()
result = service.process_image(sample_image_path, detect_layout=False)
assert result['status'] == 'success'
assert result['file_name'] == sample_image_path.name
assert result['language'] == 'ch'
assert result['total_text_regions'] == 2
assert result['average_confidence'] > 0.8
assert len(result['text_regions']) == 2
assert 'markdown_content' in result
assert 'processing_time' in result
@patch('app.services.ocr_service.PaddleOCR')
def test_process_image_filters_low_confidence(self, mock_paddle_ocr, sample_image_path):
"""Test that low confidence results are filtered"""
# Mock OCR results with varying confidence - PaddleOCR 3.x format
mock_ocr_results = [{
'rec_texts': ['High Confidence', 'Low Confidence'],
'rec_scores': [0.95, 0.50],
'rec_polys': [
[[10, 10], [100, 10], [100, 30], [10, 30]],
[[10, 40], [100, 40], [100, 60], [10, 60]]
]
}]
mock_engine = Mock()
mock_engine.ocr.return_value = mock_ocr_results
mock_paddle_ocr.return_value = mock_engine
service = OCRService()
result = service.process_image(
sample_image_path,
detect_layout=False,
confidence_threshold=0.80
)
assert result['status'] == 'success'
assert result['total_text_regions'] == 1 # Only high confidence
assert result['text_regions'][0]['text'] == 'High Confidence'
@patch('app.services.ocr_service.PaddleOCR')
def test_process_image_empty_results(self, mock_paddle_ocr, sample_image_path):
"""Test processing image with no text detected"""
mock_ocr_results = [[]]
mock_engine = Mock()
mock_engine.ocr.return_value = mock_ocr_results
mock_paddle_ocr.return_value = mock_engine
service = OCRService()
result = service.process_image(sample_image_path, detect_layout=False)
assert result['status'] == 'success'
assert result['total_text_regions'] == 0
assert result['average_confidence'] == 0.0
@patch('app.services.ocr_service.PaddleOCR')
def test_process_image_error_handling(self, mock_paddle_ocr, sample_image_path):
"""Test error handling during OCR processing"""
mock_engine = Mock()
mock_engine.ocr.side_effect = Exception("OCR engine error")
mock_paddle_ocr.return_value = mock_engine
service = OCRService()
result = service.process_image(sample_image_path, detect_layout=False)
assert result['status'] == 'error'
assert 'error_message' in result
assert 'OCR engine error' in result['error_message']
@patch('app.services.ocr_service.PaddleOCR')
def test_process_image_different_languages(self, mock_paddle_ocr, sample_image_path):
"""Test processing with different languages"""
mock_ocr_results = [[
[[[10, 10], [100, 10], [100, 30], [10, 30]], ('Text', 0.95)]
]]
mock_engine = Mock()
mock_engine.ocr.return_value = mock_ocr_results
mock_paddle_ocr.return_value = mock_engine
service = OCRService()
# Test English
result_en = service.process_image(sample_image_path, lang='en', detect_layout=False)
assert result_en['language'] == 'en'
# Test Chinese
result_ch = service.process_image(sample_image_path, lang='ch', detect_layout=False)
assert result_ch['language'] == 'ch'
@pytest.mark.unit
class TestLayoutAnalysisMocked:
"""Test layout analysis with mocked structure engine"""
@patch('app.services.ocr_service.PPStructureV3')
def test_analyze_layout_success(self, mock_structure, sample_image_path):
"""Test successful layout analysis"""
# Create mock page result with markdown attribute (PP-StructureV3 format)
mock_page_result = Mock()
mock_page_result.markdown = {
'markdown_texts': 'Document Title\n\nParagraph content',
'markdown_images': {}
}
# PP-Structure predict() returns a list of page results
mock_engine = Mock()
mock_engine.predict.return_value = [mock_page_result]
mock_structure.return_value = mock_engine
service = OCRService()
layout_data, images_metadata = service.analyze_layout(sample_image_path)
assert layout_data is not None
assert layout_data['total_elements'] == 1
assert len(layout_data['elements']) == 1
assert layout_data['elements'][0]['type'] == 'text'
assert 'Document Title' in layout_data['elements'][0]['content']
@patch('app.services.ocr_service.PPStructureV3')
def test_analyze_layout_with_table(self, mock_structure, sample_image_path):
"""Test layout analysis with table element"""
# Create mock page result with table in markdown (PP-StructureV3 format)
mock_page_result = Mock()
mock_page_result.markdown = {
'markdown_texts': '<table><tr><td>Cell 1</td></tr></table>',
'markdown_images': {}
}
# PP-Structure predict() returns a list of page results
mock_engine = Mock()
mock_engine.predict.return_value = [mock_page_result]
mock_structure.return_value = mock_engine
service = OCRService()
layout_data, images_metadata = service.analyze_layout(sample_image_path)
assert layout_data is not None
assert layout_data['elements'][0]['type'] == 'table'
# Content should contain the HTML table
assert '<table>' in layout_data['elements'][0]['content']
@patch('app.services.ocr_service.PPStructureV3')
def test_analyze_layout_error_handling(self, mock_structure, sample_image_path):
"""Test error handling in layout analysis"""
mock_engine = Mock()
mock_engine.side_effect = Exception("Structure analysis error")
mock_structure.return_value = mock_engine
service = OCRService()
layout_data, images_metadata = service.analyze_layout(sample_image_path)
assert layout_data is None
assert images_metadata == []
@pytest.mark.unit
class TestMarkdownGeneration:
"""Test Markdown generation"""
def test_generate_markdown_from_text_regions(self):
"""Test Markdown generation from text regions only"""
service = OCRService()
text_regions = [
{'text': 'First line', 'bbox': [[10, 10], [100, 10], [100, 30], [10, 30]]},
{'text': 'Second line', 'bbox': [[10, 40], [100, 40], [100, 60], [10, 60]]},
{'text': 'Third line', 'bbox': [[10, 70], [100, 70], [100, 90], [10, 90]]},
]
markdown = service.generate_markdown(text_regions)
assert 'First line' in markdown
assert 'Second line' in markdown
assert 'Third line' in markdown
def test_generate_markdown_with_layout(self):
"""Test Markdown generation with layout information"""
service = OCRService()
text_regions = []
layout_data = {
'elements': [
{'type': 'title', 'content': 'Document Title'},
{'type': 'text', 'content': 'Paragraph text'},
{'type': 'figure', 'element_id': 0},
]
}
markdown = service.generate_markdown(text_regions, layout_data)
assert '# Document Title' in markdown
assert 'Paragraph text' in markdown
assert '![Figure 0]' in markdown
def test_generate_markdown_with_table(self):
"""Test Markdown generation with table"""
service = OCRService()
layout_data = {
'elements': [
{
'type': 'table',
'content': '<table><tr><td>Cell</td></tr></table>'
}
]
}
markdown = service.generate_markdown([], layout_data)
assert '<table>' in markdown
def test_generate_markdown_empty_input(self):
"""Test Markdown generation with empty input"""
service = OCRService()
markdown = service.generate_markdown([])
assert markdown == ""
def test_generate_markdown_sorts_by_position(self):
"""Test that text regions are sorted by vertical position"""
service = OCRService()
# Create text regions in reverse order
text_regions = [
{'text': 'Bottom', 'bbox': [[10, 90], [100, 90], [100, 110], [10, 110]]},
{'text': 'Top', 'bbox': [[10, 10], [100, 10], [100, 30], [10, 30]]},
{'text': 'Middle', 'bbox': [[10, 50], [100, 50], [100, 70], [10, 70]]},
]
markdown = service.generate_markdown(text_regions)
lines = markdown.strip().split('\n')
# Should be sorted top to bottom
assert lines[0] == 'Top'
assert lines[1] == 'Middle'
assert lines[2] == 'Bottom'
@pytest.mark.unit
class TestSaveResults:
"""Test saving OCR results"""
def test_save_results_success(self, temp_dir):
"""Test successful saving of results"""
service = OCRService()
result = {
'status': 'success',
'file_name': 'test.png',
'text_regions': [{'text': 'Hello', 'confidence': 0.95}],
'markdown_content': '# Hello\n\nTest content',
}
json_path, md_path = service.save_results(result, temp_dir, 'test123')
assert json_path is not None
assert md_path is not None
assert json_path.exists()
assert md_path.exists()
# Verify JSON content
with open(json_path, 'r') as f:
saved_result = json.load(f)
assert saved_result['file_name'] == 'test.png'
# Verify Markdown content
md_content = md_path.read_text()
assert 'Hello' in md_content
def test_save_results_creates_directory(self, temp_dir):
"""Test that save_results creates output directory if needed"""
service = OCRService()
output_dir = temp_dir / "subdir" / "results"
result = {
'status': 'success',
'markdown_content': 'Test',
}
json_path, md_path = service.save_results(result, output_dir, 'test')
assert output_dir.exists()
assert json_path.exists()
def test_save_results_handles_unicode(self, temp_dir):
"""Test saving results with Unicode characters"""
service = OCRService()
result = {
'status': 'success',
'text_regions': [{'text': '你好世界', 'confidence': 0.95}],
'markdown_content': '# 你好世界\n\n测试内容',
}
json_path, md_path = service.save_results(result, temp_dir, 'unicode_test')
# Verify Unicode is preserved
with open(json_path, 'r', encoding='utf-8') as f:
saved_result = json.load(f)
assert saved_result['text_regions'][0]['text'] == '你好世界'
md_content = md_path.read_text(encoding='utf-8')
assert '你好世界' in md_content
@pytest.mark.unit
class TestEdgeCases:
"""Test edge cases and error handling"""
@patch('app.services.ocr_service.PaddleOCR')
def test_process_image_with_none_results(self, mock_paddle_ocr, sample_image_path):
"""Test processing when OCR returns None"""
mock_engine = Mock()
mock_engine.ocr.return_value = None
mock_paddle_ocr.return_value = mock_engine
service = OCRService()
result = service.process_image(sample_image_path, detect_layout=False)
assert result['status'] == 'success'
assert result['total_text_regions'] == 0
@patch('app.services.ocr_service.PaddleOCR')
def test_process_image_with_custom_threshold(self, mock_paddle_ocr, sample_image_path):
"""Test processing with custom confidence threshold"""
# PaddleOCR 3.x format
mock_ocr_results = [{
'rec_texts': ['Text'],
'rec_scores': [0.85],
'rec_polys': [[[10, 10], [100, 10], [100, 30], [10, 30]]]
}]
mock_engine = Mock()
mock_engine.ocr.return_value = mock_ocr_results
mock_paddle_ocr.return_value = mock_engine
service = OCRService()
# With high threshold - should filter out
result_high = service.process_image(
sample_image_path,
detect_layout=False,
confidence_threshold=0.90
)
assert result_high['total_text_regions'] == 0
# With low threshold - should include
result_low = service.process_image(
sample_image_path,
detect_layout=False,
confidence_threshold=0.80
)
assert result_low['total_text_regions'] == 1
# Integration tests that require actual PaddleOCR models
@pytest.mark.requires_models
@pytest.mark.slow
class TestOCRServiceIntegration:
"""
Integration tests that require actual PaddleOCR models
These tests will download models (~900MB) on first run
Run with: pytest -m requires_models
"""
def test_real_ocr_engine_initialization(self):
"""Test real PaddleOCR engine initialization"""
service = OCRService()
engine = service.get_ocr_engine(lang='en')
assert engine is not None
assert hasattr(engine, 'ocr')
def test_real_structure_engine_initialization(self):
"""Test real PP-Structure engine initialization"""
service = OCRService()
engine = service.get_structure_engine()
assert engine is not None
def test_real_image_processing(self, sample_image_with_text):
"""Test processing real image with text"""
service = OCRService()
result = service.process_image(sample_image_with_text, lang='en')
assert result['status'] == 'success'
assert result['total_text_regions'] > 0

View File

@@ -1,559 +0,0 @@
"""
Tool_OCR - PDF Generator Unit Tests
Tests for app/services/pdf_generator.py
"""
import pytest
from pathlib import Path
from unittest.mock import Mock, patch, MagicMock
import subprocess
from app.services.pdf_generator import PDFGenerator, PDFGenerationError
@pytest.mark.unit
class TestPDFGeneratorInit:
"""Test PDF generator initialization"""
def test_init(self):
"""Test PDF generator initialization"""
generator = PDFGenerator()
assert generator is not None
assert hasattr(generator, 'css_templates')
assert len(generator.css_templates) == 3
assert 'default' in generator.css_templates
assert 'academic' in generator.css_templates
assert 'business' in generator.css_templates
def test_css_templates_have_content(self):
"""Test that CSS templates contain content"""
generator = PDFGenerator()
for template_name, css_content in generator.css_templates.items():
assert isinstance(css_content, str)
assert len(css_content) > 100
assert '@page' in css_content
assert 'body' in css_content
@pytest.mark.unit
class TestPandocAvailability:
"""Test Pandoc availability checking"""
@patch('subprocess.run')
def test_check_pandoc_available_success(self, mock_run):
"""Test Pandoc availability check when pandoc is installed"""
mock_run.return_value = Mock(returncode=0, stdout="pandoc 2.x")
generator = PDFGenerator()
is_available = generator.check_pandoc_available()
assert is_available is True
mock_run.assert_called_once()
assert mock_run.call_args[0][0] == ["pandoc", "--version"]
@patch('subprocess.run')
def test_check_pandoc_available_not_found(self, mock_run):
"""Test Pandoc availability check when pandoc is not installed"""
mock_run.side_effect = FileNotFoundError()
generator = PDFGenerator()
is_available = generator.check_pandoc_available()
assert is_available is False
@patch('subprocess.run')
def test_check_pandoc_available_timeout(self, mock_run):
"""Test Pandoc availability check when command times out"""
mock_run.side_effect = subprocess.TimeoutExpired("pandoc", 5)
generator = PDFGenerator()
is_available = generator.check_pandoc_available()
assert is_available is False
@pytest.mark.unit
class TestPandocPDFGeneration:
"""Test PDF generation using Pandoc"""
@pytest.fixture
def sample_markdown(self, temp_dir):
"""Create a sample Markdown file"""
md_file = temp_dir / "sample.md"
md_file.write_text("# Test Document\n\nThis is a test.", encoding="utf-8")
return md_file
@patch('subprocess.run')
def test_generate_pdf_pandoc_success(self, mock_run, sample_markdown, temp_dir):
"""Test successful PDF generation with Pandoc"""
output_path = temp_dir / "output.pdf"
mock_run.return_value = Mock(returncode=0, stderr="")
# Create the output file to simulate successful generation
output_path.touch()
generator = PDFGenerator()
result = generator.generate_pdf_pandoc(sample_markdown, output_path)
assert result == output_path
assert output_path.exists()
mock_run.assert_called_once()
# Verify pandoc command structure
cmd_args = mock_run.call_args[0][0]
assert "pandoc" in cmd_args
assert str(sample_markdown) in cmd_args
assert str(output_path) in cmd_args
assert "--pdf-engine=weasyprint" in cmd_args
@patch('subprocess.run')
def test_generate_pdf_pandoc_with_metadata(self, mock_run, sample_markdown, temp_dir):
"""Test Pandoc PDF generation with metadata"""
output_path = temp_dir / "output.pdf"
mock_run.return_value = Mock(returncode=0, stderr="")
output_path.touch()
metadata = {
"title": "Test Title",
"author": "Test Author",
"date": "2025-01-01"
}
generator = PDFGenerator()
result = generator.generate_pdf_pandoc(
sample_markdown,
output_path,
metadata=metadata
)
assert result == output_path
# Verify metadata in command
cmd_args = mock_run.call_args[0][0]
assert "--metadata" in cmd_args
assert "title=Test Title" in cmd_args
assert "author=Test Author" in cmd_args
assert "date=2025-01-01" in cmd_args
@patch('subprocess.run')
def test_generate_pdf_pandoc_with_custom_css(self, mock_run, sample_markdown, temp_dir):
"""Test Pandoc PDF generation with custom CSS template"""
output_path = temp_dir / "output.pdf"
mock_run.return_value = Mock(returncode=0, stderr="")
output_path.touch()
generator = PDFGenerator()
result = generator.generate_pdf_pandoc(
sample_markdown,
output_path,
css_template="academic"
)
assert result == output_path
mock_run.assert_called_once()
@patch('subprocess.run')
def test_generate_pdf_pandoc_command_failed(self, mock_run, sample_markdown, temp_dir):
"""Test Pandoc PDF generation when command fails"""
output_path = temp_dir / "output.pdf"
mock_run.return_value = Mock(returncode=1, stderr="Pandoc error message")
generator = PDFGenerator()
with pytest.raises(PDFGenerationError) as exc_info:
generator.generate_pdf_pandoc(sample_markdown, output_path)
assert "Pandoc failed" in str(exc_info.value)
assert "Pandoc error message" in str(exc_info.value)
@patch('subprocess.run')
def test_generate_pdf_pandoc_timeout(self, mock_run, sample_markdown, temp_dir):
"""Test Pandoc PDF generation timeout"""
output_path = temp_dir / "output.pdf"
mock_run.side_effect = subprocess.TimeoutExpired("pandoc", 60)
generator = PDFGenerator()
with pytest.raises(PDFGenerationError) as exc_info:
generator.generate_pdf_pandoc(sample_markdown, output_path)
assert "timed out" in str(exc_info.value).lower()
@patch('subprocess.run')
def test_generate_pdf_pandoc_output_not_created(self, mock_run, sample_markdown, temp_dir):
"""Test when Pandoc command succeeds but output file not created"""
output_path = temp_dir / "output.pdf"
mock_run.return_value = Mock(returncode=0, stderr="")
# Don't create output file
generator = PDFGenerator()
with pytest.raises(PDFGenerationError) as exc_info:
generator.generate_pdf_pandoc(sample_markdown, output_path)
assert "PDF file not created" in str(exc_info.value)
@pytest.mark.unit
class TestWeasyPrintPDFGeneration:
"""Test PDF generation using WeasyPrint directly"""
@pytest.fixture
def sample_markdown(self, temp_dir):
"""Create a sample Markdown file"""
md_file = temp_dir / "sample.md"
md_file.write_text("# Test Document\n\nThis is a test.", encoding="utf-8")
return md_file
@patch('app.services.pdf_generator.HTML')
@patch('app.services.pdf_generator.CSS')
def test_generate_pdf_weasyprint_success(self, mock_css, mock_html, sample_markdown, temp_dir):
"""Test successful PDF generation with WeasyPrint"""
output_path = temp_dir / "output.pdf"
# Mock HTML and CSS objects
mock_html_instance = Mock()
mock_html_instance.write_pdf = Mock()
mock_html.return_value = mock_html_instance
# Create output file to simulate successful generation
def create_pdf(*args, **kwargs):
output_path.touch()
mock_html_instance.write_pdf.side_effect = create_pdf
generator = PDFGenerator()
result = generator.generate_pdf_weasyprint(sample_markdown, output_path)
assert result == output_path
assert output_path.exists()
mock_html.assert_called_once()
mock_css.assert_called_once()
mock_html_instance.write_pdf.assert_called_once()
@patch('app.services.pdf_generator.HTML')
@patch('app.services.pdf_generator.CSS')
def test_generate_pdf_weasyprint_with_metadata(self, mock_css, mock_html, sample_markdown, temp_dir):
"""Test WeasyPrint PDF generation with metadata"""
output_path = temp_dir / "output.pdf"
mock_html_instance = Mock()
mock_html_instance.write_pdf = Mock()
mock_html.return_value = mock_html_instance
def create_pdf(*args, **kwargs):
output_path.touch()
mock_html_instance.write_pdf.side_effect = create_pdf
metadata = {
"title": "Test Title",
"author": "Test Author"
}
generator = PDFGenerator()
result = generator.generate_pdf_weasyprint(
sample_markdown,
output_path,
metadata=metadata
)
assert result == output_path
# Check that HTML string includes title
html_call_args = mock_html.call_args
assert html_call_args[1]['string'] is not None
assert "Test Title" in html_call_args[1]['string']
@patch('app.services.pdf_generator.HTML')
def test_generate_pdf_weasyprint_markdown_conversion(self, mock_html, sample_markdown, temp_dir):
"""Test that Markdown is properly converted to HTML"""
output_path = temp_dir / "output.pdf"
captured_html = None
def capture_html(string, **kwargs):
nonlocal captured_html
captured_html = string
mock_instance = Mock()
mock_instance.write_pdf = Mock(side_effect=lambda *args, **kwargs: output_path.touch())
return mock_instance
mock_html.side_effect = capture_html
generator = PDFGenerator()
generator.generate_pdf_weasyprint(sample_markdown, output_path)
# Verify HTML structure
assert captured_html is not None
assert "<!DOCTYPE html>" in captured_html
assert "<h1>Test Document</h1>" in captured_html
assert "<p>This is a test.</p>" in captured_html
@patch('app.services.pdf_generator.HTML')
@patch('app.services.pdf_generator.CSS')
def test_generate_pdf_weasyprint_with_template(self, mock_css, mock_html, sample_markdown, temp_dir):
"""Test WeasyPrint PDF generation with different templates"""
output_path = temp_dir / "output.pdf"
mock_html_instance = Mock()
mock_html_instance.write_pdf = Mock()
mock_html.return_value = mock_html_instance
def create_pdf(*args, **kwargs):
output_path.touch()
mock_html_instance.write_pdf.side_effect = create_pdf
generator = PDFGenerator()
# Test academic template
generator.generate_pdf_weasyprint(
sample_markdown,
output_path,
css_template="academic"
)
# Verify CSS was called with academic template content
css_call_args = mock_css.call_args
assert css_call_args[1]['string'] is not None
assert "Times New Roman" in css_call_args[1]['string']
@patch('app.services.pdf_generator.HTML')
def test_generate_pdf_weasyprint_error_handling(self, mock_html, sample_markdown, temp_dir):
"""Test WeasyPrint error handling"""
output_path = temp_dir / "output.pdf"
mock_html.side_effect = Exception("WeasyPrint rendering error")
generator = PDFGenerator()
with pytest.raises(PDFGenerationError) as exc_info:
generator.generate_pdf_weasyprint(sample_markdown, output_path)
assert "WeasyPrint PDF generation failed" in str(exc_info.value)
@pytest.mark.unit
class TestUnifiedPDFGeneration:
"""Test unified PDF generation with automatic fallback"""
@pytest.fixture
def sample_markdown(self, temp_dir):
"""Create a sample Markdown file"""
md_file = temp_dir / "sample.md"
md_file.write_text("# Test Document\n\nTest content.", encoding="utf-8")
return md_file
def test_generate_pdf_nonexistent_markdown(self, temp_dir):
"""Test error when Markdown file doesn't exist"""
nonexistent = temp_dir / "nonexistent.md"
output_path = temp_dir / "output.pdf"
generator = PDFGenerator()
with pytest.raises(PDFGenerationError) as exc_info:
generator.generate_pdf(nonexistent, output_path)
assert "not found" in str(exc_info.value).lower()
@patch.object(PDFGenerator, 'check_pandoc_available')
@patch.object(PDFGenerator, 'generate_pdf_pandoc')
def test_generate_pdf_prefers_pandoc(self, mock_pandoc_gen, mock_check, sample_markdown, temp_dir):
"""Test that Pandoc is preferred when available"""
output_path = temp_dir / "output.pdf"
output_path.touch()
mock_check.return_value = True
mock_pandoc_gen.return_value = output_path
generator = PDFGenerator()
result = generator.generate_pdf(sample_markdown, output_path, prefer_pandoc=True)
assert result == output_path
mock_check.assert_called_once()
mock_pandoc_gen.assert_called_once()
@patch.object(PDFGenerator, 'check_pandoc_available')
@patch.object(PDFGenerator, 'generate_pdf_weasyprint')
def test_generate_pdf_uses_weasyprint_when_pandoc_unavailable(
self, mock_weasy_gen, mock_check, sample_markdown, temp_dir
):
"""Test fallback to WeasyPrint when Pandoc unavailable"""
output_path = temp_dir / "output.pdf"
output_path.touch()
mock_check.return_value = False
mock_weasy_gen.return_value = output_path
generator = PDFGenerator()
result = generator.generate_pdf(sample_markdown, output_path, prefer_pandoc=True)
assert result == output_path
mock_check.assert_called_once()
mock_weasy_gen.assert_called_once()
@patch.object(PDFGenerator, 'check_pandoc_available')
@patch.object(PDFGenerator, 'generate_pdf_pandoc')
@patch.object(PDFGenerator, 'generate_pdf_weasyprint')
def test_generate_pdf_fallback_on_pandoc_failure(
self, mock_weasy_gen, mock_pandoc_gen, mock_check, sample_markdown, temp_dir
):
"""Test automatic fallback to WeasyPrint when Pandoc fails"""
output_path = temp_dir / "output.pdf"
output_path.touch()
mock_check.return_value = True
mock_pandoc_gen.side_effect = PDFGenerationError("Pandoc failed")
mock_weasy_gen.return_value = output_path
generator = PDFGenerator()
result = generator.generate_pdf(sample_markdown, output_path, prefer_pandoc=True)
assert result == output_path
mock_pandoc_gen.assert_called_once()
mock_weasy_gen.assert_called_once()
@patch.object(PDFGenerator, 'check_pandoc_available')
@patch.object(PDFGenerator, 'generate_pdf_weasyprint')
def test_generate_pdf_creates_output_directory(
self, mock_weasy_gen, mock_check, sample_markdown, temp_dir
):
"""Test that output directory is created if needed"""
output_dir = temp_dir / "subdir" / "outputs"
output_path = output_dir / "output.pdf"
output_path.parent.mkdir(parents=True, exist_ok=True)
output_path.touch()
mock_check.return_value = False
mock_weasy_gen.return_value = output_path
generator = PDFGenerator()
result = generator.generate_pdf(sample_markdown, output_path)
assert output_dir.exists()
assert result == output_path
@pytest.mark.unit
class TestTemplateManagement:
"""Test CSS template management"""
def test_get_available_templates(self):
"""Test retrieving available templates"""
generator = PDFGenerator()
templates = generator.get_available_templates()
assert isinstance(templates, dict)
assert len(templates) == 3
assert "default" in templates
assert "academic" in templates
assert "business" in templates
# Check descriptions are in Chinese
for desc in templates.values():
assert isinstance(desc, str)
assert len(desc) > 0
def test_save_custom_template(self):
"""Test saving a custom CSS template"""
generator = PDFGenerator()
custom_css = "@page { size: A4; }"
generator.save_custom_template("custom", custom_css)
assert "custom" in generator.css_templates
assert generator.css_templates["custom"] == custom_css
def test_save_custom_template_overwrites_existing(self):
"""Test that saving custom template can overwrite existing"""
generator = PDFGenerator()
new_css = "@page { size: Letter; }"
generator.save_custom_template("default", new_css)
assert generator.css_templates["default"] == new_css
@pytest.mark.unit
class TestEdgeCases:
"""Test edge cases and error handling"""
@pytest.fixture
def sample_markdown(self, temp_dir):
"""Create a sample Markdown file"""
md_file = temp_dir / "sample.md"
md_file.write_text("# Test", encoding="utf-8")
return md_file
@patch('app.services.pdf_generator.HTML')
@patch('app.services.pdf_generator.CSS')
def test_generate_with_unicode_content(self, mock_css, mock_html, temp_dir):
"""Test PDF generation with Unicode/Chinese content"""
md_file = temp_dir / "unicode.md"
md_file.write_text("# 測試文檔\n\n這是中文內容。", encoding="utf-8")
output_path = temp_dir / "output.pdf"
captured_html = None
def capture_html(string, **kwargs):
nonlocal captured_html
captured_html = string
mock_instance = Mock()
mock_instance.write_pdf = Mock(side_effect=lambda *args, **kwargs: output_path.touch())
return mock_instance
mock_html.side_effect = capture_html
generator = PDFGenerator()
result = generator.generate_pdf_weasyprint(md_file, output_path)
assert result == output_path
assert "測試文檔" in captured_html
assert "中文內容" in captured_html
@patch('app.services.pdf_generator.HTML')
@patch('app.services.pdf_generator.CSS')
def test_generate_with_table_markdown(self, mock_css, mock_html, temp_dir):
"""Test PDF generation with Markdown tables"""
md_file = temp_dir / "table.md"
md_content = """
# Document with Table
| Column 1 | Column 2 |
|----------|----------|
| Data 1 | Data 2 |
"""
md_file.write_text(md_content, encoding="utf-8")
output_path = temp_dir / "output.pdf"
captured_html = None
def capture_html(string, **kwargs):
nonlocal captured_html
captured_html = string
mock_instance = Mock()
mock_instance.write_pdf = Mock(side_effect=lambda *args, **kwargs: output_path.touch())
return mock_instance
mock_html.side_effect = capture_html
generator = PDFGenerator()
result = generator.generate_pdf_weasyprint(md_file, output_path)
assert result == output_path
# Markdown tables should be converted to HTML tables
assert "<table>" in captured_html
assert "<th>" in captured_html or "<td>" in captured_html
def test_custom_css_string_not_in_templates(self, sample_markdown, temp_dir):
"""Test using custom CSS string that's not a template name"""
generator = PDFGenerator()
# This should work - treat as custom CSS string
custom_css = "body { font-size: 20pt; }"
# When CSS template is not in templates dict, it should be used as-is
assert custom_css not in generator.css_templates.values()

View File

@@ -1,350 +0,0 @@
"""
Tool_OCR - Document Preprocessor Unit Tests
Tests for app/services/preprocessor.py
"""
import pytest
from pathlib import Path
from PIL import Image
from app.services.preprocessor import DocumentPreprocessor
@pytest.mark.unit
class TestDocumentPreprocessor:
"""Test suite for DocumentPreprocessor"""
def test_init(self, preprocessor):
"""Test preprocessor initialization"""
assert preprocessor is not None
assert preprocessor.max_file_size > 0
assert len(preprocessor.allowed_extensions) > 0
assert 'png' in preprocessor.allowed_extensions
assert 'jpg' in preprocessor.allowed_extensions
assert 'pdf' in preprocessor.allowed_extensions
def test_supported_formats(self, preprocessor):
"""Test that all expected formats are supported"""
expected_image_formats = ['png', 'jpg', 'jpeg', 'bmp', 'tiff', 'tif']
expected_pdf_format = ['pdf']
for fmt in expected_image_formats:
assert fmt in preprocessor.SUPPORTED_IMAGE_FORMATS
for fmt in expected_pdf_format:
assert fmt in preprocessor.SUPPORTED_PDF_FORMAT
all_formats = expected_image_formats + expected_pdf_format
assert set(preprocessor.ALL_SUPPORTED_FORMATS) == set(all_formats)
@pytest.mark.unit
class TestFileValidation:
"""Test file validation methods"""
def test_validate_valid_png(self, preprocessor, sample_image_path):
"""Test validation of a valid PNG file"""
is_valid, file_format, error = preprocessor.validate_file(sample_image_path)
assert is_valid is True
assert file_format == 'png'
assert error is None
def test_validate_valid_jpg(self, preprocessor, sample_jpg_path):
"""Test validation of a valid JPG file"""
is_valid, file_format, error = preprocessor.validate_file(sample_jpg_path)
assert is_valid is True
assert file_format == 'jpg'
assert error is None
def test_validate_valid_pdf(self, preprocessor, sample_pdf_path):
"""Test validation of a valid PDF file"""
is_valid, file_format, error = preprocessor.validate_file(sample_pdf_path)
assert is_valid is True
assert file_format == 'pdf'
assert error is None
def test_validate_nonexistent_file(self, preprocessor, temp_dir):
"""Test validation of a non-existent file"""
fake_path = temp_dir / "nonexistent.png"
is_valid, file_format, error = preprocessor.validate_file(fake_path)
assert is_valid is False
assert file_format is None
assert "not found" in error.lower()
def test_validate_large_file(self, preprocessor, large_file_path):
"""Test validation of a file exceeding size limit"""
is_valid, file_format, error = preprocessor.validate_file(large_file_path)
assert is_valid is False
assert file_format is None
assert "too large" in error.lower()
def test_validate_unsupported_format(self, preprocessor, unsupported_file_path):
"""Test validation of unsupported file format"""
is_valid, file_format, error = preprocessor.validate_file(unsupported_file_path)
assert is_valid is False
assert "not allowed" in error.lower() or "unsupported" in error.lower()
def test_validate_corrupted_image(self, preprocessor, corrupted_image_path):
"""Test validation of a corrupted image file"""
is_valid, file_format, error = preprocessor.validate_file(corrupted_image_path)
assert is_valid is False
assert error is not None
# Corrupted files may be detected as unsupported type or corrupted
assert ("corrupted" in error.lower() or
"unsupported" in error.lower() or
"not allowed" in error.lower())
@pytest.mark.unit
class TestMimeTypeMapping:
"""Test MIME type to format mapping"""
def test_mime_to_format_png(self, preprocessor):
"""Test PNG MIME type mapping"""
assert preprocessor._mime_to_format('image/png') == 'png'
def test_mime_to_format_jpeg(self, preprocessor):
"""Test JPEG MIME type mapping"""
assert preprocessor._mime_to_format('image/jpeg') == 'jpg'
assert preprocessor._mime_to_format('image/jpg') == 'jpg'
def test_mime_to_format_pdf(self, preprocessor):
"""Test PDF MIME type mapping"""
assert preprocessor._mime_to_format('application/pdf') == 'pdf'
def test_mime_to_format_tiff(self, preprocessor):
"""Test TIFF MIME type mapping"""
assert preprocessor._mime_to_format('image/tiff') == 'tiff'
assert preprocessor._mime_to_format('image/x-tiff') == 'tiff'
def test_mime_to_format_bmp(self, preprocessor):
"""Test BMP MIME type mapping"""
assert preprocessor._mime_to_format('image/bmp') == 'bmp'
def test_mime_to_format_unknown(self, preprocessor):
"""Test unknown MIME type returns None"""
assert preprocessor._mime_to_format('unknown/type') is None
assert preprocessor._mime_to_format('text/plain') is None
@pytest.mark.unit
class TestIntegrityValidation:
"""Test file integrity validation"""
def test_validate_integrity_valid_png(self, preprocessor, sample_image_path):
"""Test integrity check for valid PNG"""
is_valid, error = preprocessor._validate_integrity(sample_image_path, 'png')
assert is_valid is True
assert error is None
def test_validate_integrity_valid_jpg(self, preprocessor, sample_jpg_path):
"""Test integrity check for valid JPG"""
is_valid, error = preprocessor._validate_integrity(sample_jpg_path, 'jpg')
assert is_valid is True
assert error is None
def test_validate_integrity_valid_pdf(self, preprocessor, sample_pdf_path):
"""Test integrity check for valid PDF"""
is_valid, error = preprocessor._validate_integrity(sample_pdf_path, 'pdf')
assert is_valid is True
assert error is None
def test_validate_integrity_corrupted_image(self, preprocessor, corrupted_image_path):
"""Test integrity check for corrupted image"""
is_valid, error = preprocessor._validate_integrity(corrupted_image_path, 'png')
assert is_valid is False
assert error is not None
def test_validate_integrity_invalid_pdf_header(self, preprocessor, temp_dir):
"""Test integrity check for PDF with invalid header"""
invalid_pdf = temp_dir / "invalid.pdf"
with open(invalid_pdf, 'wb') as f:
f.write(b'Not a PDF file')
is_valid, error = preprocessor._validate_integrity(invalid_pdf, 'pdf')
assert is_valid is False
assert "invalid" in error.lower() or "header" in error.lower()
def test_validate_integrity_unknown_format(self, preprocessor, temp_dir):
"""Test integrity check for unknown format"""
test_file = temp_dir / "test.xyz"
test_file.write_text("test")
is_valid, error = preprocessor._validate_integrity(test_file, 'xyz')
assert is_valid is False
assert error is not None
@pytest.mark.unit
class TestImagePreprocessing:
"""Test image preprocessing functionality"""
def test_preprocess_image_without_enhancement(self, preprocessor, sample_image_path):
"""Test preprocessing without enhancement (returns original)"""
success, output_path, error = preprocessor.preprocess_image(
sample_image_path,
enhance=False
)
assert success is True
assert output_path == sample_image_path
assert error is None
def test_preprocess_image_with_enhancement(self, preprocessor, sample_image_with_text, temp_dir):
"""Test preprocessing with enhancement"""
output_path = temp_dir / "processed.png"
success, result_path, error = preprocessor.preprocess_image(
sample_image_with_text,
enhance=True,
output_path=output_path
)
assert success is True
assert result_path == output_path
assert result_path.exists()
assert error is None
# Verify the output is a valid image
with Image.open(result_path) as img:
assert img.size[0] > 0
assert img.size[1] > 0
def test_preprocess_image_auto_output_path(self, preprocessor, sample_image_with_text):
"""Test preprocessing with automatic output path"""
success, result_path, error = preprocessor.preprocess_image(
sample_image_with_text,
enhance=True
)
assert success is True
assert result_path is not None
assert result_path.exists()
assert "processed_" in result_path.name
assert error is None
def test_preprocess_nonexistent_image(self, preprocessor, temp_dir):
"""Test preprocessing with non-existent image"""
fake_path = temp_dir / "nonexistent.png"
success, result_path, error = preprocessor.preprocess_image(
fake_path,
enhance=True
)
assert success is False
assert result_path is None
assert error is not None
def test_preprocess_corrupted_image(self, preprocessor, corrupted_image_path):
"""Test preprocessing with corrupted image"""
success, result_path, error = preprocessor.preprocess_image(
corrupted_image_path,
enhance=True
)
assert success is False
assert result_path is None
assert error is not None
@pytest.mark.unit
class TestFileInfo:
"""Test file information retrieval"""
def test_get_file_info_png(self, preprocessor, sample_image_path):
"""Test getting file info for PNG"""
info = preprocessor.get_file_info(sample_image_path)
assert info['name'] == sample_image_path.name
assert info['path'] == str(sample_image_path)
assert info['size'] > 0
assert info['size_mb'] > 0
assert info['mime_type'] == 'image/png'
assert info['format'] == 'png'
assert 'created_at' in info
assert 'modified_at' in info
def test_get_file_info_jpg(self, preprocessor, sample_jpg_path):
"""Test getting file info for JPG"""
info = preprocessor.get_file_info(sample_jpg_path)
assert info['name'] == sample_jpg_path.name
assert info['mime_type'] == 'image/jpeg'
assert info['format'] == 'jpg'
def test_get_file_info_pdf(self, preprocessor, sample_pdf_path):
"""Test getting file info for PDF"""
info = preprocessor.get_file_info(sample_pdf_path)
assert info['name'] == sample_pdf_path.name
assert info['mime_type'] == 'application/pdf'
assert info['format'] == 'pdf'
def test_get_file_info_size_calculation(self, preprocessor, sample_image_path):
"""Test that file size is correctly calculated"""
info = preprocessor.get_file_info(sample_image_path)
actual_size = sample_image_path.stat().st_size
assert info['size'] == actual_size
assert abs(info['size_mb'] - (actual_size / (1024 * 1024))) < 0.001
@pytest.mark.unit
class TestEdgeCases:
"""Test edge cases and error handling"""
def test_validate_empty_file(self, preprocessor, temp_dir):
"""Test validation of empty file"""
empty_file = temp_dir / "empty.png"
empty_file.touch()
is_valid, file_format, error = preprocessor.validate_file(empty_file)
# Should fail because empty file has no valid MIME type or is corrupted
assert is_valid is False
def test_validate_file_with_wrong_extension(self, preprocessor, temp_dir):
"""Test validation of file with misleading extension"""
# Create a PNG file but name it .txt
misleading_file = temp_dir / "image.txt"
img = Image.new('RGB', (10, 10), color='white')
img.save(misleading_file, 'PNG')
# Validation uses MIME detection, not extension
# So a PNG file named .txt should pass if PNG is in allowed_extensions
is_valid, file_format, error = preprocessor.validate_file(misleading_file)
# Should succeed because MIME detection finds it's a PNG
# (preprocessor uses magic number detection, not file extension)
assert is_valid is True
assert file_format == 'png'
def test_preprocess_very_small_image(self, preprocessor, temp_dir):
"""Test preprocessing of very small image"""
small_image = temp_dir / "small.png"
img = Image.new('RGB', (5, 5), color='white')
img.save(small_image, 'PNG')
success, result_path, error = preprocessor.preprocess_image(
small_image,
enhance=True
)
# Should succeed even with very small image
assert success is True
assert result_path is not None
assert result_path.exists()

View File

@@ -1,106 +0,0 @@
"""
Unit tests for task management endpoints
"""
import pytest
from app.models.task import Task
class TestTasks:
"""Test task management endpoints"""
def test_create_task(self, client, auth_token):
"""Test task creation"""
response = client.post(
'/api/v2/tasks/',
headers={'Authorization': f'Bearer {auth_token}'},
json={
'filename': 'test.pdf',
'file_type': 'application/pdf'
}
)
assert response.status_code == 201
data = response.json()
assert 'task_id' in data
assert data['filename'] == 'test.pdf'
assert data['status'] == 'pending'
def test_list_tasks(self, client, auth_token, test_task):
"""Test listing user tasks"""
response = client.get(
'/api/v2/tasks/',
headers={'Authorization': f'Bearer {auth_token}'}
)
assert response.status_code == 200
data = response.json()
assert 'tasks' in data
assert 'total' in data
assert len(data['tasks']) > 0
def test_get_task(self, client, auth_token, test_task):
"""Test get single task"""
response = client.get(
f'/api/v2/tasks/{test_task.task_id}',
headers={'Authorization': f'Bearer {auth_token}'}
)
assert response.status_code == 200
data = response.json()
assert data['task_id'] == test_task.task_id
def test_get_task_stats(self, client, auth_token, test_task):
"""Test get task statistics"""
response = client.get(
'/api/v2/tasks/stats',
headers={'Authorization': f'Bearer {auth_token}'}
)
assert response.status_code == 200
data = response.json()
assert 'total' in data
assert 'pending' in data
assert 'processing' in data
assert 'completed' in data
assert 'failed' in data
def test_delete_task(self, client, auth_token, test_task):
"""Test task deletion"""
response = client.delete(
f'/api/v2/tasks/{test_task.task_id}',
headers={'Authorization': f'Bearer {auth_token}'}
)
# DELETE should return 204 No Content (standard for successful deletion)
assert response.status_code == 204
def test_user_isolation(self, client, db, test_user):
"""Test that users can only access their own tasks"""
# Create another user
from app.models.user import User
other_user = User(email="other@example.com", display_name="Other User")
db.add(other_user)
db.commit()
# Create task for other user
other_task = Task(
user_id=other_user.id,
task_id="other-task-123",
filename="other.pdf",
status="pending"
)
db.add(other_task)
db.commit()
# Create token for test_user
from app.core.security import create_access_token
token = create_access_token({"sub": str(test_user.id)})
# Try to access other user's task
response = client.get(
f'/api/v2/tasks/{other_task.task_id}',
headers={'Authorization': f'Bearer {token}'}
)
assert response.status_code == 404 # Task not found (user isolation)

View File

@@ -1,100 +0,0 @@
#!/usr/bin/env python3
import zipfile
from pathlib import Path
# Create a minimal DOCX file
output_path = Path('/Users/egg/Projects/Tool_OCR/demo_docs/office_tests/test_document.docx')
# DOCX is a ZIP file containing XML files
with zipfile.ZipFile(output_path, 'w', zipfile.ZIP_DEFLATED) as docx:
# [Content_Types].xml
content_types = '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
<Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
<Default Extension="xml" ContentType="application/xml"/>
<Override PartName="/word/document.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
</Types>'''
docx.writestr('[Content_Types].xml', content_types)
# _rels/.rels
rels = '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="word/document.xml"/>
</Relationships>'''
docx.writestr('_rels/.rels', rels)
# word/document.xml with Chinese and English content
document = '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<w:body>
<w:p>
<w:pPr><w:pStyle w:val="Heading1"/></w:pPr>
<w:r><w:t>Office Document OCR Test</w:t></w:r>
</w:p>
<w:p>
<w:pPr><w:pStyle w:val="Heading2"/></w:pPr>
<w:r><w:t>測試文件說明</w:t></w:r>
</w:p>
<w:p>
<w:r><w:t>這是一個用於測試 Tool_OCR 系統 Office 文件支援功能的測試文件。</w:t></w:r>
</w:p>
<w:p>
<w:r><w:t>本系統現已支援以下 Office 格式:</w:t></w:r>
</w:p>
<w:p>
<w:r><w:t>• Microsoft Word: DOC, DOCX</w:t></w:r>
</w:p>
<w:p>
<w:r><w:t>• Microsoft PowerPoint: PPT, PPTX</w:t></w:r>
</w:p>
<w:p>
<w:pPr><w:pStyle w:val="Heading2"/></w:pPr>
<w:r><w:t>處理流程</w:t></w:r>
</w:p>
<w:p>
<w:r><w:t>Office 文件的處理流程如下:</w:t></w:r>
</w:p>
<w:p>
<w:r><w:t>1. 使用 LibreOffice 將 Office 文件轉換為 PDF</w:t></w:r>
</w:p>
<w:p>
<w:r><w:t>2. 將 PDF 轉換為圖片(每頁一張)</w:t></w:r>
</w:p>
<w:p>
<w:r><w:t>3. 使用 PaddleOCR 處理每張圖片</w:t></w:r>
</w:p>
<w:p>
<w:r><w:t>4. 合併所有頁面的 OCR 結果</w:t></w:r>
</w:p>
<w:p>
<w:pPr><w:pStyle w:val="Heading2"/></w:pPr>
<w:r><w:t>中英混合測試</w:t></w:r>
</w:p>
<w:p>
<w:r><w:t>This is a test for mixed Chinese and English OCR recognition.</w:t></w:r>
</w:p>
<w:p>
<w:r><w:t>測試中英文混合識別能力1234567890</w:t></w:r>
</w:p>
<w:p>
<w:pPr><w:pStyle w:val="Heading2"/></w:pPr>
<w:r><w:t>Technical Information</w:t></w:r>
</w:p>
<w:p>
<w:r><w:t>System Version: Tool_OCR v1.0</w:t></w:r>
</w:p>
<w:p>
<w:r><w:t>Conversion Engine: LibreOffice Headless</w:t></w:r>
</w:p>
<w:p>
<w:r><w:t>OCR Engine: PaddleOCR</w:t></w:r>
</w:p>
<w:p>
<w:r><w:t>Token Validity: 24 hours (1440 minutes)</w:t></w:r>
</w:p>
</w:body>
</w:document>'''
docx.writestr('word/document.xml', document)
print(f"Created DOCX file: {output_path}")
print(f"File size: {output_path.stat().st_size} bytes")

View File

@@ -1,64 +0,0 @@
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Office Document OCR Test</title>
</head>
<body>
<h1>Office Document OCR Test</h1>
<h2>測試文件說明</h2>
<p>這是一個用於測試 Tool_OCR 系統 Office 文件支援功能的測試文件。</p>
<p>本系統現已支援以下 Office 格式:</p>
<ul>
<li>Microsoft Word: DOC, DOCX</li>
<li>Microsoft PowerPoint: PPT, PPTX</li>
</ul>
<h2>處理流程</h2>
<p>Office 文件的處理流程如下:</p>
<ol>
<li>使用 LibreOffice 將 Office 文件轉換為 PDF</li>
<li>將 PDF 轉換為圖片(每頁一張)</li>
<li>使用 PaddleOCR 處理每張圖片</li>
<li>合併所有頁面的 OCR 結果</li>
</ol>
<h2>測試數據表格</h2>
<table border="1" cellpadding="5">
<tr>
<th>格式</th>
<th>副檔名</th>
<th>支援狀態</th>
</tr>
<tr>
<td>Word 新版</td>
<td>.docx</td>
<td>✓ 支援</td>
</tr>
<tr>
<td>Word 舊版</td>
<td>.doc</td>
<td>✓ 支援</td>
</tr>
<tr>
<td>PowerPoint 新版</td>
<td>.pptx</td>
<td>✓ 支援</td>
</tr>
<tr>
<td>PowerPoint 舊版</td>
<td>.ppt</td>
<td>✓ 支援</td>
</tr>
</table>
<h2>中英混合測試</h2>
<p>This is a test for mixed Chinese and English OCR recognition.</p>
<p>測試中英文混合識別能力1234567890</p>
<h2>特殊字符測試</h2>
<p>符號測試:!@#$%^&*()_+-=[]{}|;:',.<>?/</p>
<p>數學符號:± × ÷ √ ∞ ≈ ≠ ≤ ≥</p>
</body>
</html>

View File

@@ -1,178 +0,0 @@
#!/usr/bin/env python3
"""
Test script for Office document processing
"""
import json
import requests
from pathlib import Path
import time
API_BASE = "http://localhost:12010/api/v1"
USERNAME = "admin"
PASSWORD = "admin123"
def login():
"""Login and get JWT token"""
print("Step 1: Logging in...")
response = requests.post(
f"{API_BASE}/auth/login",
json={"username": USERNAME, "password": PASSWORD}
)
response.raise_for_status()
data = response.json()
token = data["access_token"]
print(f"✓ Login successful. Token expires in: {data['expires_in']} seconds ({data['expires_in']//3600} hours)")
return token
def upload_file(token, file_path):
"""Upload file and create batch"""
print(f"\nStep 2: Uploading file: {file_path.name}...")
with open(file_path, 'rb') as f:
files = {'files': (file_path.name, f, 'application/vnd.openxmlformats-officedocument.wordprocessingml.document')}
response = requests.post(
f"{API_BASE}/upload",
headers={"Authorization": f"Bearer {token}"},
files=files,
data={"batch_name": "Office Document Test"}
)
response.raise_for_status()
result = response.json()
print(f"✓ File uploaded and batch created:")
print(f" Batch ID: {result['id']}")
print(f" Total files: {result['total_files']}")
print(f" Status: {result['status']}")
return result['id']
def trigger_ocr(token, batch_id):
"""Trigger OCR processing"""
print(f"\nStep 3: Triggering OCR processing...")
response = requests.post(
f"{API_BASE}/ocr/process",
headers={"Authorization": f"Bearer {token}"},
json={
"batch_id": batch_id,
"lang": "ch",
"detect_layout": True
}
)
response.raise_for_status()
result = response.json()
print(f"✓ OCR processing started")
print(f" Message: {result['message']}")
print(f" Total files: {result['total_files']}")
def check_status(token, batch_id):
"""Check processing status"""
print(f"\nStep 4: Checking processing status...")
max_wait = 120 # 120 seconds max
waited = 0
while waited < max_wait:
response = requests.get(
f"{API_BASE}/batch/{batch_id}/status",
headers={"Authorization": f"Bearer {token}"}
)
response.raise_for_status()
data = response.json()
batch_status = data['batch']['status']
progress = data['batch']['progress_percentage']
file_status = data['files'][0]['status']
print(f" Batch status: {batch_status}, Progress: {progress}%, File status: {file_status}")
if batch_status == 'completed':
print(f"\n✓ Processing completed!")
file_data = data['files'][0]
if 'processing_time' in file_data:
print(f" Processing time: {file_data['processing_time']:.2f} seconds")
return data
elif batch_status == 'failed':
print(f"\n✗ Processing failed!")
print(f" Error: {data['files'][0].get('error_message', 'Unknown error')}")
return data
time.sleep(5)
waited += 5
print(f"\n⚠ Timeout waiting for processing (waited {waited}s)")
return None
def get_result(token, file_id):
"""Get OCR result"""
print(f"\nStep 5: Getting OCR result...")
response = requests.get(
f"{API_BASE}/ocr/result/{file_id}",
headers={"Authorization": f"Bearer {token}"}
)
response.raise_for_status()
data = response.json()
file_info = data['file']
result = data.get('result')
print(f"✓ OCR Result retrieved:")
print(f" File: {file_info['original_filename']}")
print(f" Status: {file_info['status']}")
if result:
print(f" Language: {result.get('detected_language', 'N/A')}")
print(f" Total text regions: {result.get('total_text_regions', 0)}")
print(f" Average confidence: {result.get('average_confidence', 0):.2%}")
# Read markdown file if available
if result.get('markdown_path'):
try:
with open(result['markdown_path'], 'r', encoding='utf-8') as f:
markdown_content = f.read()
print(f"\n Markdown preview (first 300 chars):")
print(f" {'-'*60}")
print(f" {markdown_content[:300]}...")
print(f" {'-'*60}")
except Exception as e:
print(f" Could not read markdown file: {e}")
else:
print(f" No OCR result available yet")
return data
def main():
try:
# Test file
test_file = Path('/Users/egg/Projects/Tool_OCR/demo_docs/office_tests/test_document.docx')
if not test_file.exists():
print(f"✗ Test file not found: {test_file}")
return
print("="*70)
print("Office Document Processing Test")
print("="*70)
print(f"Test file: {test_file.name} ({test_file.stat().st_size} bytes)")
print("="*70)
# Run test
token = login()
batch_id = upload_file(token, test_file)
trigger_ocr(token, batch_id)
status_data = check_status(token, batch_id)
if status_data and status_data['batch']['status'] == 'completed':
file_id = status_data['files'][0]['id']
result = get_result(token, file_id)
print("\n" + "="*70)
print("✓ TEST PASSED: Office document processing successful!")
print("="*70)
else:
print("\n" + "="*70)
print("✗ TEST FAILED: Processing did not complete successfully")
print("="*70)
except Exception as e:
print(f"\n✗ TEST ERROR: {str(e)}")
import traceback
traceback.print_exc()
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,817 @@
# Tool_OCR 架構大改方案
## 基於 PaddleOCR PP-StructureV3 完整能力的重構計劃
**規劃日期**: 2025-01-18
**硬體配置**: RTX 4060 8GB VRAM
**優先級**: P0 (最高)
---
## 📊 現狀分析
### 目前架構的問題
#### 1. **PP-StructureV3 能力嚴重浪費**
```python
# ❌ 目前實作 (ocr_service.py:614-646)
markdown_dict = page_result.markdown # 只用簡化版
markdown_texts = markdown_dict.get('markdown_texts', '')
'bbox': [], # 座標全部為空!
```
**問題**:
- 只使用了 ~20% 的 PP-StructureV3 功能
- 未使用 `parsing_res_list`(核心數據結構)
- 未使用 `layout_bbox`(精確座標)
- 未使用 `reading_order`(閱讀順序)
- 未使用 23 種版面元素分類
#### 2. **GPU 配置未優化**
```python
# 目前配置 (ocr_service.py:211-219)
self.structure_engine = PPStructureV3(
use_doc_orientation_classify=False, # ❌ 未啟用前處理
use_doc_unwarping=False, # ❌ 未啟用矯正
use_textline_orientation=False, # ❌ 未啟用方向校正
# ... 使用預設配置
)
```
**問題**:
- RTX 4060 8GB 足以運行 server 模型,但用了預設配置
- 關閉了重要的前處理功能
- 未充分利用 GPU 算力
#### 3. **PDF 生成策略單一**
```python
# 目前只有座標定位模式
# 導致 21.6% 文字損失(過濾重疊)
filtered_text_regions = self._filter_text_in_regions(text_regions, regions_to_avoid)
```
**問題**:
- 只支援座標定位,不支援流式排版
- 無法零資訊損失
- 翻譯功能受限
---
## 🎯 重構目標
### 核心目標
1. **完整利用 PP-StructureV3 能力**
- 提取 `parsing_res_list`23 種元素分類 + 閱讀順序)
- 提取 `layout_bbox`(精確座標)
- 提取 `layout_det_res`(版面檢測詳情)
- 提取 `overall_ocr_res`(所有文字的座標)
2. **雙模式 PDF 生成**
- 模式 A: 座標定位(精確還原版面)
- 模式 B: 流式排版(零資訊損失,支援翻譯)
3. **GPU 配置最佳化**
- 針對 RTX 4060 8GB 的最佳配置
- Server 模型 + 所有功能模組
- 合理的記憶體管理
4. **向後相容**
- 保留現有 API
- 舊 JSON 檔案仍可用
- 漸進式升級
---
## 🏗️ 新架構設計
### 架構層次
```
┌──────────────────────────────────────────────────────┐
│ API Layer │
│ /tasks, /results, /download (向後相容) │
└────────────────┬─────────────────────────────────────┘
┌────────────────▼─────────────────────────────────────┐
│ Service Layer │
├──────────────────────────────────────────────────────┤
│ OCRService (現有, 保留) │
│ └─ analyze_layout() [升級] ──┐ │
│ │ │
│ AdvancedLayoutExtractor (新增) ◄─ 使用相同引擎 │
│ └─ extract_complete_layout() ─┘ │
│ │
│ PDFGeneratorService (重構) │
│ ├─ generate_coordinate_pdf() [Mode A] │
│ └─ generate_flow_pdf() [Mode B] │
└────────────────┬─────────────────────────────────────┘
┌────────────────▼─────────────────────────────────────┐
│ Engine Layer │
├──────────────────────────────────────────────────────┤
│ PPStructureV3Engine (新增,統一管理) │
│ ├─ GPU 配置 (RTX 4060 8GB 最佳化) │
│ ├─ Model 配置 (Server 模型) │
│ └─ 功能開關 (全功能啟用) │
└──────────────────────────────────────────────────────┘
```
### 核心類別設計
#### 1. PPStructureV3Engine (新增)
**目的**: 統一管理 PP-StructureV3 引擎,避免重複初始化
```python
class PPStructureV3Engine:
"""
PP-StructureV3 引擎管理器 (單例)
針對 RTX 4060 8GB 優化配置
"""
_instance = None
def __new__(cls):
if cls._instance is None:
cls._instance = super().__new__(cls)
cls._instance._initialize()
return cls._instance
def _initialize(self):
"""初始化引擎"""
logger.info("Initializing PP-StructureV3 with RTX 4060 8GB optimized config")
self.engine = PPStructureV3(
# ===== GPU 配置 =====
use_gpu=True,
gpu_mem=6144, # 保留 2GB 給系統 (8GB - 2GB)
# ===== 前處理模組 (全部啟用) =====
use_doc_orientation_classify=True, # 文檔方向校正
use_doc_unwarping=True, # 文檔影像矯正
use_textline_orientation=True, # 文字行方向校正
# ===== 功能模組 (全部啟用) =====
use_table_recognition=True, # 表格識別
use_formula_recognition=True, # 公式識別
use_chart_recognition=True, # 圖表識別
use_seal_recognition=True, # 印章識別
# ===== OCR 模型配置 (Server 模型) =====
text_detection_model_name="ch_PP-OCRv4_server_det",
text_recognition_model_name="ch_PP-OCRv4_server_rec",
# ===== 版面檢測參數 =====
layout_threshold=0.5, # 版面檢測閾值
layout_nms=0.5, # NMS 閾值
layout_unclip_ratio=1.5, # 邊界框擴展比例
# ===== OCR 參數 =====
text_det_limit_side_len=1920, # 高解析度檢測
text_det_thresh=0.3, # 檢測閾值
text_det_box_thresh=0.5, # 邊界框閾值
# ===== 其他 =====
show_log=True,
use_angle_cls=False, # 已被 textline_orientation 取代
)
logger.info("PP-StructureV3 engine initialized successfully")
logger.info(f" - GPU: Enabled (RTX 4060 8GB)")
logger.info(f" - Models: Server (High Accuracy)")
logger.info(f" - Features: All Enabled (Table/Formula/Chart/Seal)")
def predict(self, image_path: str):
"""執行預測"""
return self.engine.predict(image_path)
def get_engine(self):
"""獲取引擎實例"""
return self.engine
```
#### 2. AdvancedLayoutExtractor (新增)
**目的**: 完整提取 PP-StructureV3 的所有版面資訊
```python
class AdvancedLayoutExtractor:
"""
進階版面提取器
完整利用 PP-StructureV3 的 parsing_res_list, layout_bbox, layout_det_res
"""
def __init__(self):
self.engine = PPStructureV3Engine()
def extract_complete_layout(
self,
image_path: Path,
output_dir: Optional[Path] = None,
current_page: int = 0
) -> Tuple[Optional[Dict], List[Dict]]:
"""
提取完整版面資訊(使用 page_result.json
Returns:
(layout_data, images_metadata)
layout_data = {
"elements": [
{
"element_id": int,
"type": str, # 23 種類型之一
"bbox": [[x1,y1], [x2,y1], [x2,y2], [x1,y2]], # ✅ 不再是空列表
"content": str,
"reading_order": int, # ✅ 閱讀順序
"layout_type": str, # ✅ single/double/multi-column
"confidence": float, # ✅ 置信度
"page": int
},
...
],
"reading_order": [0, 1, 2, ...],
"layout_types": ["single", "double"],
"total_elements": int
}
"""
try:
results = self.engine.predict(str(image_path))
layout_elements = []
images_metadata = []
for page_idx, page_result in enumerate(results):
# ✅ 核心改動:使用 page_result.json 而非 page_result.markdown
json_data = page_result.json
# ===== 方法 1: 使用 parsing_res_list (主要來源) =====
parsing_res_list = json_data.get('parsing_res_list', [])
if parsing_res_list:
logger.info(f"Found {len(parsing_res_list)} elements in parsing_res_list")
for idx, item in enumerate(parsing_res_list):
element = self._create_element_from_parsing_res(
item, idx, current_page
)
if element:
layout_elements.append(element)
# ===== 方法 2: 使用 layout_det_res (補充資訊) =====
layout_det_res = json_data.get('layout_det_res', {})
layout_boxes = layout_det_res.get('boxes', [])
# 用於豐富 element 資訊(如果 parsing_res_list 缺少某些欄位)
self._enrich_elements_with_layout_det(layout_elements, layout_boxes)
# ===== 方法 3: 處理圖片 (從 markdown_images) =====
markdown_dict = page_result.markdown
markdown_images = markdown_dict.get('markdown_images', {})
for img_idx, (img_path, img_obj) in enumerate(markdown_images.items()):
# 保存圖片到磁碟
self._save_image(img_obj, img_path, output_dir or image_path.parent)
# 從 parsing_res_list 或 layout_det_res 查找 bbox
bbox = self._find_image_bbox(
img_path, parsing_res_list, layout_boxes
)
images_metadata.append({
'element_id': len(layout_elements) + img_idx,
'image_path': img_path,
'type': 'image',
'page': current_page,
'bbox': bbox,
})
if layout_elements:
layout_data = {
'elements': layout_elements,
'total_elements': len(layout_elements),
'reading_order': [e['reading_order'] for e in layout_elements],
'layout_types': list(set(e.get('layout_type') for e in layout_elements)),
}
logger.info(f"✅ Extracted {len(layout_elements)} elements with complete info")
return layout_data, images_metadata
else:
logger.warning("No layout elements found")
return None, []
except Exception as e:
logger.error(f"Advanced layout extraction failed: {e}")
import traceback
traceback.print_exc()
return None, []
def _create_element_from_parsing_res(
self, item: Dict, idx: int, current_page: int
) -> Optional[Dict]:
"""從 parsing_res_list 的一個 item 創建 element"""
# 提取 layout_bbox
layout_bbox = item.get('layout_bbox')
bbox = self._convert_bbox_to_4point(layout_bbox)
# 提取版面類型
layout_type = item.get('layout', 'single')
# 創建基礎 element
element = {
'element_id': idx,
'page': current_page,
'bbox': bbox, # ✅ 完整座標
'layout_type': layout_type,
'reading_order': idx,
'confidence': item.get('score', 0.0),
}
# 根據內容類型填充 type 和 content
# 順序很重要!優先級: table > formula > image > title > text
if 'table' in item and item['table']:
element['type'] = 'table'
element['content'] = item['table']
# 提取表格純文字(用於翻譯)
element['extracted_text'] = self._extract_table_text(item['table'])
elif 'formula' in item and item['formula']:
element['type'] = 'formula'
element['content'] = item['formula'] # LaTeX
elif 'figure' in item or 'image' in item:
element['type'] = 'image'
element['content'] = item.get('figure') or item.get('image')
elif 'title' in item and item['title']:
element['type'] = 'title'
element['content'] = item['title']
elif 'text' in item and item['text']:
element['type'] = 'text'
element['content'] = item['text']
else:
# 未知類型,嘗試提取任何非系統欄位
for key, value in item.items():
if key not in ['layout_bbox', 'layout', 'score'] and value:
element['type'] = key
element['content'] = value
break
else:
return None # 沒有內容,跳過
return element
def _convert_bbox_to_4point(self, layout_bbox) -> List:
"""轉換 layout_bbox 為 4-point 格式"""
if layout_bbox is None:
return []
# 處理 numpy array
if hasattr(layout_bbox, 'tolist'):
bbox = layout_bbox.tolist()
else:
bbox = list(layout_bbox)
if len(bbox) == 4: # [x1, y1, x2, y2]
x1, y1, x2, y2 = bbox
return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
return []
def _extract_table_text(self, html_content: str) -> str:
"""從 HTML 表格提取純文字(用於翻譯)"""
try:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
# 提取所有 cell 的文字
cells = []
for cell in soup.find_all(['td', 'th']):
text = cell.get_text(strip=True)
if text:
cells.append(text)
return ' | '.join(cells)
except Exception as e:
logger.warning(f"Failed to extract table text: {e}")
# Fallback: 簡單去除 HTML 標籤
import re
text = re.sub(r'<[^>]+>', ' ', html_content)
text = re.sub(r'\s+', ' ', text)
return text.strip()
```
#### 3. PDFGeneratorService (重構)
**目的**: 支援雙模式 PDF 生成
```python
class PDFGeneratorService:
"""
PDF 生成服務 (重構版)
支援兩種模式:
- coordinate: 座標定位模式 (精確還原版面)
- flow: 流式排版模式 (零資訊損失, 支援翻譯)
"""
def generate_pdf(
self,
json_path: Path,
output_path: Path,
mode: str = 'coordinate', # 'coordinate' 或 'flow'
source_file_path: Optional[Path] = None
) -> bool:
"""
生成 PDF
Args:
json_path: OCR JSON 檔案路徑
output_path: 輸出 PDF 路徑
mode: 生成模式 ('coordinate' 或 'flow')
source_file_path: 原始檔案路徑(用於獲取尺寸)
Returns:
成功返回 True
"""
try:
# 載入 OCR 數據
ocr_data = self.load_ocr_json(json_path)
if not ocr_data:
return False
# 根據模式選擇生成策略
if mode == 'flow':
return self._generate_flow_pdf(ocr_data, output_path)
else:
return self._generate_coordinate_pdf(ocr_data, output_path, source_file_path)
except Exception as e:
logger.error(f"PDF generation failed: {e}")
import traceback
traceback.print_exc()
return False
def _generate_coordinate_pdf(
self,
ocr_data: Dict,
output_path: Path,
source_file_path: Optional[Path]
) -> bool:
"""
模式 A: 座標定位模式
- 使用 layout_bbox 精確定位每個元素
- 保留原始文件的視覺外觀
- 適用於需要精確還原版面的場景
"""
logger.info("Generating PDF in COORDINATE mode (layout-preserving)")
# 提取數據
layout_data = ocr_data.get('layout_data', {})
elements = layout_data.get('elements', [])
if not elements:
logger.warning("No layout elements found")
return False
# 按 reading_order 和 page 排序
sorted_elements = sorted(elements, key=lambda x: (
x.get('page', 0),
x.get('reading_order', 0)
))
# 計算頁面尺寸
ocr_width, ocr_height = self.calculate_page_dimensions(ocr_data, source_file_path)
target_width, target_height = self._get_target_dimensions(source_file_path, ocr_width, ocr_height)
scale_w = target_width / ocr_width
scale_h = target_height / ocr_height
# 創建 PDF canvas
pdf_canvas = canvas.Canvas(str(output_path), pagesize=(target_width, target_height))
# 按頁碼分組元素
pages = {}
for elem in sorted_elements:
page = elem.get('page', 0)
if page not in pages:
pages[page] = []
pages[page].append(elem)
# 渲染每一頁
for page_num, page_elements in sorted(pages.items()):
if page_num > 0:
pdf_canvas.showPage()
logger.info(f"Rendering page {page_num + 1} with {len(page_elements)} elements")
# 按 reading_order 渲染每個元素
for elem in page_elements:
bbox = elem.get('bbox', [])
elem_type = elem.get('type')
content = elem.get('content', '')
if not bbox:
logger.warning(f"Element {elem['element_id']} has no bbox, skipping")
continue
# 根據類型渲染
try:
if elem_type == 'table':
self._draw_table_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h)
elif elem_type == 'text':
self._draw_text_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h)
elif elem_type == 'title':
self._draw_title_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h)
elif elem_type == 'image':
img_path = json_path.parent / content
if img_path.exists():
self._draw_image_at_bbox(pdf_canvas, str(img_path), bbox, target_height, scale_w, scale_h)
elif elem_type == 'formula':
self._draw_formula_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h)
# ... 其他類型
except Exception as e:
logger.warning(f"Failed to draw {elem_type} element: {e}")
pdf_canvas.save()
logger.info(f"✅ Coordinate PDF generated: {output_path}")
return True
def _generate_flow_pdf(
self,
ocr_data: Dict,
output_path: Path
) -> bool:
"""
模式 B: 流式排版模式
- 按 reading_order 流式排版
- 零資訊損失(不過濾任何內容)
- 使用 ReportLab Platypus 高階 API
- 適用於需要翻譯或內容處理的場景
"""
from reportlab.platypus import (
SimpleDocTemplate, Paragraph, Spacer,
Table, TableStyle, Image as RLImage, PageBreak
)
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib import colors
from reportlab.lib.enums import TA_LEFT, TA_CENTER
logger.info("Generating PDF in FLOW mode (content-preserving)")
# 提取數據
layout_data = ocr_data.get('layout_data', {})
elements = layout_data.get('elements', [])
if not elements:
logger.warning("No layout elements found")
return False
# 按 reading_order 排序
sorted_elements = sorted(elements, key=lambda x: (
x.get('page', 0),
x.get('reading_order', 0)
))
# 創建文檔
doc = SimpleDocTemplate(str(output_path))
story = []
styles = getSampleStyleSheet()
# 自定義樣式
styles.add(ParagraphStyle(
name='CustomTitle',
parent=styles['Heading1'],
fontSize=18,
alignment=TA_CENTER,
spaceAfter=12
))
current_page = -1
# 按順序添加元素
for elem in sorted_elements:
elem_type = elem.get('type')
content = elem.get('content', '')
page = elem.get('page', 0)
# 分頁
if page != current_page and current_page != -1:
story.append(PageBreak())
current_page = page
try:
if elem_type == 'title':
story.append(Paragraph(content, styles['CustomTitle']))
story.append(Spacer(1, 12))
elif elem_type == 'text':
story.append(Paragraph(content, styles['Normal']))
story.append(Spacer(1, 8))
elif elem_type == 'table':
# 解析 HTML 表格為 ReportLab Table
table_obj = self._html_to_reportlab_table(content)
if table_obj:
story.append(table_obj)
story.append(Spacer(1, 12))
elif elem_type == 'image':
# 嵌入圖片
img_path = output_path.parent.parent / content
if img_path.exists():
img = RLImage(str(img_path), width=400, height=300, kind='proportional')
story.append(img)
story.append(Spacer(1, 12))
elif elem_type == 'formula':
# 公式顯示為等寬字體
story.append(Paragraph(f"<font name='Courier'>{content}</font>", styles['Code']))
story.append(Spacer(1, 8))
except Exception as e:
logger.warning(f"Failed to add {elem_type} element to flow: {e}")
# 生成 PDF
doc.build(story)
logger.info(f"✅ Flow PDF generated: {output_path}")
return True
```
---
## 🔧 實作步驟
### 階段 1: 引擎層重構 (2-3 小時)
1. **創建 PPStructureV3Engine 單例類**
- 檔案: `backend/app/engines/ppstructure_engine.py` (新增)
- 統一管理 PP-StructureV3 引擎
- RTX 4060 8GB 最佳化配置
2. **創建 AdvancedLayoutExtractor 類**
- 檔案: `backend/app/services/advanced_layout_extractor.py` (新增)
- 實作 `extract_complete_layout()`
- 完整提取 parsing_res_list, layout_bbox, layout_det_res
3. **更新 OCRService**
- 修改 `analyze_layout()` 使用 `AdvancedLayoutExtractor`
- 保持向後相容(回退到舊邏輯)
### 階段 2: PDF 生成器重構 (3-4 小時)
1. **重構 PDFGeneratorService**
- 添加 `mode` 參數
- 實作 `_generate_coordinate_pdf()`
- 實作 `_generate_flow_pdf()`
2. **添加輔助方法**
- `_draw_table_at_bbox()`: 在指定座標繪製表格
- `_draw_text_at_bbox()`: 在指定座標繪製文字
- `_draw_title_at_bbox()`: 在指定座標繪製標題
- `_draw_formula_at_bbox()`: 在指定座標繪製公式
- `_html_to_reportlab_table()`: HTML 轉 ReportLab Table
3. **更新 API 端點**
- `/tasks/{id}/download/pdf?mode=coordinate` (預設)
- `/tasks/{id}/download/pdf?mode=flow`
### 階段 3: 測試與優化 (2-3 小時)
1. **單元測試**
- 測試 AdvancedLayoutExtractor
- 測試兩種 PDF 模式
- 測試向後相容性
2. **效能測試**
- GPU 記憶體使用監控
- 處理速度測試
- 並發請求測試
3. **品質驗證**
- 座標準確度
- 閱讀順序正確性
- 表格識別準確度
---
## 📈 預期效果
### 功能改善
| 指標 | 目前 | 重構後 | 提升 |
|------|-----|--------|------|
| bbox 可用性 | 0% (全空) | 100% | ✅ ∞ |
| 版面元素分類 | 2 種 | 23 種 | ✅ 11.5x |
| 閱讀順序 | 無 | 完整保留 | ✅ 100% |
| 資訊損失 | 21.6% | 0% (流式模式) | ✅ 100% |
| PDF 模式 | 1 種 | 2 種 | ✅ 2x |
| 翻譯支援 | 困難 | 完美 | ✅ 100% |
### GPU 使用優化
```python
# RTX 4060 8GB 配置效果
配置項目 | 目前 | 重構後
----------------|--------|--------
GPU 利用率 | ~30% | ~70%
處理速度 | 0.5/ | 1.2/
前處理功能 | 關閉 | 全開
識別準確度 | ~85% | ~95%
```
---
## 🎯 遷移策略
### 向後相容性保證
1. **API 層面**
- 保留現有所有 API 端點
- 添加可選的 `mode` 參數
- 預設行為不變
2. **數據層面**
- 舊 JSON 檔案仍可使用
- 新增欄位不影響舊邏輯
- 漸進式更新
3. **部署策略**
- 先部署新引擎和服務
- 逐步啟用新功能
- 監控效能和錯誤率
---
## 📝 配置檔案
### requirements.txt 更新
```txt
# 現有依賴
paddlepaddle-gpu>=3.0.0
paddleocr>=3.0.0
# 新增依賴
python-docx>=0.8.11 # Word 文檔生成 (可選)
PyMuPDF>=1.23.0 # PDF 處理增強
beautifulsoup4>=4.12.0 # HTML 解析
lxml>=4.9.0 # XML/HTML 解析加速
```
### 環境變數配置
```bash
# .env.local 新增
PADDLE_GPU_MEMORY=6144 # RTX 4060 8GB 保留 2GB 給系統
PADDLE_USE_SERVER_MODEL=true
PADDLE_ENABLE_ALL_FEATURES=true
# PDF 生成預設模式
PDF_DEFAULT_MODE=coordinate # 或 flow
```
---
## 🚀 實作優先級
### P0 (立即實作)
1. ✅ PPStructureV3Engine 統一引擎
2. ✅ AdvancedLayoutExtractor 完整提取
3. ✅ 座標定位模式 PDF
### P1 (第二階段)
4. ⭐ 流式排版模式 PDF
5. ⭐ API 端點更新 (mode 參數)
### P2 (優化階段)
6. 效能監控和優化
7. 批次處理支援
8. 品質檢查工具
---
## ⚠️ 風險與緩解
### 風險 1: GPU 記憶體不足
**緩解**:
- 合理設定 `gpu_mem=6144` (保留 2GB)
- 添加記憶體監控
- 大文檔分批處理
### 風險 2: 處理速度下降
**緩解**:
- Server 模型在 GPU 上比 Mobile 更快
- 並行處理多頁
- 結果快取
### 風險 3: 向後相容問題
**緩解**:
- 保留舊邏輯作為回退
- 逐步遷移
- 完整測試覆蓋
---
**預計總開發時間**: 7-10 小時
**預計效果**: 100% 利用 PP-StructureV3 能力 + 零資訊損失 + 完美翻譯支援
您希望我開始實作哪個階段?

View File

@@ -0,0 +1,691 @@
# PP-StructureV3 完整版面資訊利用計劃
## 📋 執行摘要
### 問題診斷
目前實作**嚴重低估了 PP-StructureV3 的能力**,只使用了 `page_result.markdown` 屬性,完全忽略了核心的版面資訊 `page_result.json`
### 核心發現
1. **PP-StructureV3 提供完整的版面解析資訊**,包括:
- `parsing_res_list`: 按閱讀順序排列的版面元素列表
- `layout_bbox`: 每個元素的精確座標
- `layout_det_res`: 版面檢測結果(區域類型、置信度)
- `overall_ocr_res`: 完整的 OCR 結果(包含所有文字的 bbox
- `layout`: 版面類型(單欄/雙欄/多欄)
2. **目前實作的缺陷**
```python
# ❌ 目前做法 (ocr_service.py:615-646)
markdown_dict = page_result.markdown # 只獲取 markdown 和圖片
markdown_texts = markdown_dict.get('markdown_texts', '')
# bbox 被設為空列表
'bbox': [], # PP-StructureV3 doesn't provide individual bbox in this format
```
3. **應該這樣做**
```python
# ✅ 正確做法
json_data = page_result.json # 獲取完整的結構化資訊
parsing_list = json_data.get('parsing_res_list', []) # 閱讀順序 + bbox
layout_det = json_data.get('layout_det_res', {}) # 版面檢測
overall_ocr = json_data.get('overall_ocr_res', {}) # 所有文字的座標
```
---
## 🎯 規劃目標
### 階段 1: 提取完整版面資訊(高優先級)
**目標**: 修改 `analyze_layout()` 以使用 PP-StructureV3 的完整能力
**預期效果**:
- ✅ 每個版面元素都有精確的 `layout_bbox`
- ✅ 保留原始閱讀順序(`parsing_res_list` 的順序)
- ✅ 獲取版面類型資訊(單欄/雙欄)
- ✅ 提取區域分類text/table/figure/title/formula
- ✅ 零資訊損失(不需要過濾重疊文字)
### 階段 2: 實作雙模式 PDF 生成(中優先級)
**目標**: 提供兩種 PDF 生成模式
**模式 A: 精確座標定位模式**
- 使用 `layout_bbox` 精確定位每個元素
- 保留原始文件的視覺外觀
- 適用於需要精確還原版面的場景
**模式 B: 流式排版模式**
- 按 `parsing_res_list` 順序流式排版
- 使用 ReportLab Platypus 高階 API
- 零資訊損失,所有內容都可搜尋
- 適用於需要翻譯或內容處理的場景
### 階段 3: 多欄版面處理(低優先級)
**目標**: 利用 PP-StructureV3 的多欄識別能力
---
## 📊 PP-StructureV3 完整資料結構
### 1. `page_result.json` 完整結構
```python
{
# 基本資訊
"input_path": str, # 源文件路徑
"page_index": int, # 頁碼PDF 專用)
# 版面檢測結果
"layout_det_res": {
"boxes": [
{
"cls_id": int, # 類別 ID
"label": str, # 區域類型: text/table/figure/title/formula/seal
"score": float, # 置信度 0-1
"coordinate": [x1, y1, x2, y2] # 矩形座標
},
...
]
},
# 完整 OCR 結果
"overall_ocr_res": {
"dt_polys": np.ndarray, # 文字檢測多邊形
"rec_polys": np.ndarray, # 文字識別多邊形
"rec_boxes": np.ndarray, # 文字識別矩形框 (n, 4, 2) int16
"rec_texts": List[str], # 識別的文字
"rec_scores": np.ndarray # 識別置信度
},
# **核心版面解析結果(按閱讀順序)**
"parsing_res_list": [
{
"layout_bbox": np.ndarray, # 區域邊界框 [x1, y1, x2, y2]
"layout": str, # 版面類型: single/double/multi-column
"text": str, # 文字內容(如果是文字區域)
"table": str, # 表格 HTML如果是表格區域
"image": str, # 圖片路徑(如果是圖片區域)
"formula": str, # 公式 LaTeX如果是公式區域
# ... 其他區域類型
},
... # 順序 = 閱讀順序
],
# 文字段落 OCR按閱讀順序
"text_paragraphs_ocr_res": {
"rec_polys": np.ndarray,
"rec_texts": List[str],
"rec_scores": np.ndarray
},
# 可選模組結果
"formula_res_region1": {...}, # 公式識別結果
"table_cell_img": {...}, # 表格儲存格圖片
"seal_res_region1": {...} # 印章識別結果
}
```
### 2. 關鍵欄位說明
| 欄位 | 用途 | 資料格式 | 重要性 |
|------|------|---------|--------|
| `parsing_res_list` | **核心資料**,包含按閱讀順序排列的所有版面元素 | List[Dict] | ⭐⭐⭐⭐⭐ |
| `layout_bbox` | 每個元素的精確座標 | np.ndarray [x1,y1,x2,y2] | ⭐⭐⭐⭐⭐ |
| `layout` | 版面類型(單欄/雙欄/多欄) | str: single/double/multi | ⭐⭐⭐⭐ |
| `layout_det_res` | 版面檢測詳細結果(包含區域分類) | Dict with boxes list | ⭐⭐⭐⭐ |
| `overall_ocr_res` | 所有文字的 OCR 結果和座標 | Dict with np.ndarray | ⭐⭐⭐⭐ |
| `markdown` | 簡化的 Markdown 輸出 | Dict with texts/images | ⭐⭐ |
---
## 🔧 實作計劃
### 任務 1: 重構 `analyze_layout()` 函數
**檔案**: `/backend/app/services/ocr_service.py`
**修改範圍**: Lines 590-710
**核心改動**:
```python
def analyze_layout(self, image_path: Path, output_dir: Optional[Path] = None, current_page: int = 0) -> Tuple[Optional[Dict], List[Dict]]:
"""
Analyze document layout using PP-StructureV3 (使用完整的 JSON 資訊)
"""
try:
structure_engine = self.get_structure_engine()
results = structure_engine.predict(str(image_path))
layout_elements = []
images_metadata = []
for page_idx, page_result in enumerate(results):
# ✅ 修改 1: 使用完整的 JSON 資料而非只用 markdown
json_data = page_result.json
# ✅ 修改 2: 提取版面檢測結果
layout_det_res = json_data.get('layout_det_res', {})
layout_boxes = layout_det_res.get('boxes', [])
# ✅ 修改 3: 提取核心的 parsing_res_list包含閱讀順序 + bbox
parsing_res_list = json_data.get('parsing_res_list', [])
if parsing_res_list:
# *** 核心邏輯:使用 parsing_res_list ***
for idx, item in enumerate(parsing_res_list):
# 提取 bbox不再是空列表
layout_bbox = item.get('layout_bbox')
if layout_bbox is not None:
# 轉換 numpy array 為標準格式
if hasattr(layout_bbox, 'tolist'):
bbox = layout_bbox.tolist()
else:
bbox = list(layout_bbox)
# 轉換為 4-point 格式: [[x1,y1], [x2,y1], [x2,y2], [x1,y2]]
if len(bbox) == 4: # [x1, y1, x2, y2]
x1, y1, x2, y2 = bbox
bbox = [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
else:
bbox = []
# 提取版面類型
layout_type = item.get('layout', 'single')
# 創建元素(包含所有資訊)
element = {
'element_id': idx,
'page': current_page,
'bbox': bbox, # ✅ 不再是空列表!
'layout_type': layout_type, # ✅ 新增版面類型
'reading_order': idx, # ✅ 新增閱讀順序
}
# 根據內容類型提取資料
if 'table' in item:
element['type'] = 'table'
element['content'] = item['table']
# 提取表格純文字(用於翻譯)
element['extracted_text'] = self._extract_table_text(item['table'])
elif 'text' in item:
element['type'] = 'text'
element['content'] = item['text']
elif 'figure' in item or 'image' in item:
element['type'] = 'image'
element['content'] = item.get('figure') or item.get('image')
elif 'formula' in item:
element['type'] = 'formula'
element['content'] = item['formula']
elif 'title' in item:
element['type'] = 'title'
element['content'] = item['title']
else:
# 未知類型,記錄所有非系統欄位
for key, value in item.items():
if key not in ['layout_bbox', 'layout']:
element['type'] = key
element['content'] = value
break
layout_elements.append(element)
else:
# 回退到 markdown 方式(向後相容)
logger.warning("No parsing_res_list found, falling back to markdown parsing")
markdown_dict = page_result.markdown
# ... 原有的 markdown 解析邏輯 ...
# ✅ 修改 4: 同時處理提取的圖片(仍需保存到磁碟)
markdown_dict = page_result.markdown
markdown_images = markdown_dict.get('markdown_images', {})
for img_idx, (img_path, img_obj) in enumerate(markdown_images.items()):
# 保存圖片到磁碟
try:
base_dir = output_dir if output_dir else image_path.parent
full_img_path = base_dir / img_path
full_img_path.parent.mkdir(parents=True, exist_ok=True)
if hasattr(img_obj, 'save'):
img_obj.save(str(full_img_path))
logger.info(f"Saved extracted image to {full_img_path}")
except Exception as e:
logger.warning(f"Failed to save image {img_path}: {e}")
# 提取 bbox從檔名或從 parsing_res_list 匹配)
bbox = self._find_image_bbox(img_path, parsing_res_list, layout_boxes)
images_metadata.append({
'element_id': len(layout_elements) + img_idx,
'image_path': img_path,
'type': 'image',
'page': current_page,
'bbox': bbox,
})
if layout_elements:
layout_data = {
'elements': layout_elements,
'total_elements': len(layout_elements),
'reading_order': [e['reading_order'] for e in layout_elements], # ✅ 保留閱讀順序
'layout_types': list(set(e.get('layout_type') for e in layout_elements)), # ✅ 版面類型統計
}
logger.info(f"Detected {len(layout_elements)} layout elements (with bbox and reading order)")
return layout_data, images_metadata
else:
logger.warning("No layout elements detected")
return None, []
except Exception as e:
import traceback
logger.error(f"Layout analysis error: {str(e)}\n{traceback.format_exc()}")
return None, []
def _find_image_bbox(self, img_path: str, parsing_res_list: List[Dict], layout_boxes: List[Dict]) -> List:
"""
從 parsing_res_list 或 layout_det_res 中查找圖片的 bbox
"""
# 方法 1: 從檔名提取(現有方法)
import re
match = re.search(r'box_(\d+)_(\d+)_(\d+)_(\d+)', img_path)
if match:
x1, y1, x2, y2 = map(int, match.groups())
return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
# 方法 2: 從 parsing_res_list 匹配(如果包含圖片路徑資訊)
for item in parsing_res_list:
if 'image' in item or 'figure' in item:
content = item.get('image') or item.get('figure')
if img_path in str(content):
bbox = item.get('layout_bbox')
if bbox is not None:
if hasattr(bbox, 'tolist'):
bbox_list = bbox.tolist()
else:
bbox_list = list(bbox)
if len(bbox_list) == 4:
x1, y1, x2, y2 = bbox_list
return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
# 方法 3: 從 layout_det_res 匹配(根據類型)
for box in layout_boxes:
if box.get('label') in ['figure', 'image']:
coord = box.get('coordinate', [])
if len(coord) == 4:
x1, y1, x2, y2 = coord
return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
logger.warning(f"Could not find bbox for image {img_path}")
return []
```
---
### 任務 2: 更新 PDF 生成器使用新資訊
**檔案**: `/backend/app/services/pdf_generator_service.py`
**核心改動**:
1. **移除文字過濾邏輯**(不再需要!)
- 因為 `parsing_res_list` 已經按閱讀順序排列
- 表格/圖片有自己的區域,文字有自己的區域
- 不會有重疊問題
2. **按 `reading_order` 渲染元素**
```python
def generate_layout_pdf(self, json_path: Path, output_path: Path, mode: str = 'coordinate') -> bool:
"""
mode: 'coordinate' 或 'flow'
"""
# 載入資料
ocr_data = self.load_ocr_json(json_path)
layout_data = ocr_data.get('layout_data', {})
elements = layout_data.get('elements', [])
if mode == 'coordinate':
# 模式 A: 座標定位模式
return self._generate_coordinate_pdf(elements, output_path, ocr_data)
else:
# 模式 B: 流式排版模式
return self._generate_flow_pdf(elements, output_path, ocr_data)
def _generate_coordinate_pdf(self, elements: List[Dict], output_path: Path, ocr_data: Dict) -> bool:
"""座標定位模式 - 精確還原版面"""
# 按 reading_order 排序元素
sorted_elements = sorted(elements, key=lambda x: x.get('reading_order', 0))
# 按頁碼分組
pages = {}
for elem in sorted_elements:
page = elem.get('page', 0)
if page not in pages:
pages[page] = []
pages[page].append(elem)
# 渲染每頁
for page_num, page_elements in sorted(pages.items()):
for elem in page_elements:
bbox = elem.get('bbox', [])
elem_type = elem.get('type')
content = elem.get('content', '')
if not bbox:
logger.warning(f"Element {elem['element_id']} has no bbox, skipping")
continue
# 使用精確座標渲染
if elem_type == 'table':
self.draw_table_at_bbox(pdf_canvas, content, bbox, page_height, scale_w, scale_h)
elif elem_type == 'text':
self.draw_text_at_bbox(pdf_canvas, content, bbox, page_height, scale_w, scale_h)
elif elem_type == 'image':
self.draw_image_at_bbox(pdf_canvas, content, bbox, page_height, scale_w, scale_h)
# ... 其他類型
def _generate_flow_pdf(self, elements: List[Dict], output_path: Path, ocr_data: Dict) -> bool:
"""流式排版模式 - 零資訊損失"""
from reportlab.platypus import SimpleDocTemplate, Paragraph, Table, Image, Spacer
from reportlab.lib.styles import getSampleStyleSheet
# 按 reading_order 排序元素
sorted_elements = sorted(elements, key=lambda x: x.get('reading_order', 0))
# 創建 Story流式內容
story = []
styles = getSampleStyleSheet()
for elem in sorted_elements:
elem_type = elem.get('type')
content = elem.get('content', '')
if elem_type == 'title':
story.append(Paragraph(content, styles['Title']))
elif elem_type == 'text':
story.append(Paragraph(content, styles['Normal']))
elif elem_type == 'table':
# 解析 HTML 表格為 ReportLab Table
table_obj = self._html_to_reportlab_table(content)
story.append(table_obj)
elif elem_type == 'image':
# 嵌入圖片
img_path = json_path.parent / content
if img_path.exists():
story.append(Image(str(img_path), width=400, height=300))
story.append(Spacer(1, 12)) # 間距
# 生成 PDF
doc = SimpleDocTemplate(str(output_path))
doc.build(story)
return True
```
---
## 📈 預期效果對比
### 目前實作 vs 新實作
| 指標 | 目前實作 ❌ | 新實作 ✅ | 改善 |
|------|-----------|----------|------|
| **bbox 資訊** | 空列表 `[]` | 精確座標 `[x1,y1,x2,y2]` | ✅ 100% |
| **閱讀順序** | 無(混合 HTML | `reading_order` 欄位 | ✅ 100% |
| **版面類型** | 無 | `layout_type`(單欄/雙欄) | ✅ 100% |
| **元素分類** | 簡單判斷 `<table` | 精確分類9+ 類型) | ✅ 100% |
| **資訊損失** | 21.6% 文字被過濾 | 0% 損失(流式模式) | ✅ 100% |
| **座標精度** | 只有部分圖片 bbox | 所有元素都有 bbox | ✅ 100% |
| **PDF 模式** | 只有座標定位 | 雙模式(座標+流式) | ✅ 新功能 |
| **翻譯支援** | 困難(資訊損失) | 完美(零損失) | ✅ 100% |
### 具體改善
#### 1. 零資訊損失
```python
# ❌ 目前: 342 個文字區域 → 過濾後 268 個 = 損失 74 個 (21.6%)
filtered_text_regions = self._filter_text_in_regions(text_regions, regions_to_avoid)
# ✅ 新實作: 不需要過濾,直接使用 parsing_res_list
# 所有元素(文字、表格、圖片)都在各自的區域中,不重疊
for elem in sorted(elements, key=lambda x: x['reading_order']):
render_element(elem) # 渲染所有元素,零損失
```
#### 2. 精確 bbox
```python
# ❌ 目前: bbox 是空列表
{
'element_id': 0,
'type': 'table',
'bbox': [], # ← 無法定位!
}
# ✅ 新實作: 從 layout_bbox 獲取精確座標
{
'element_id': 0,
'type': 'table',
'bbox': [[770, 776], [1122, 776], [1122, 1058], [770, 1058]], # ← 精確定位!
'reading_order': 3,
'layout_type': 'single'
}
```
#### 3. 閱讀順序
```python
# ❌ 目前: 無法保證正確的閱讀順序
# 表格、圖片、文字混在一起,順序混亂
# ✅ 新實作: parsing_res_list 的順序 = 閱讀順序
elements = sorted(elements, key=lambda x: x['reading_order'])
# 元素按 reading_order: 0, 1, 2, 3, ... 渲染
# 完美保留文件的邏輯順序
```
---
## 🚀 實作步驟
### 第一階段核心重構2-3 小時)
1. **修改 `analyze_layout()` 函數**
- 從 `page_result.json` 提取 `parsing_res_list`
- 提取 `layout_bbox` 為每個元素的 bbox
- 保留 `reading_order`
- 提取 `layout_type`
- 測試輸出 JSON 結構
2. **添加輔助函數**
- `_find_image_bbox()`: 從多個來源查找圖片 bbox
- `_convert_bbox_format()`: 統一 bbox 格式
- `_extract_element_content()`: 根據類型提取內容
3. **測試驗證**
- 使用現有測試文件重新執行 OCR
- 檢查生成的 JSON 是否包含 bbox
- 驗證 reading_order 是否正確
### 第二階段PDF 生成優化2-3 小時)
1. **實作座標定位模式**
- 移除文字過濾邏輯
- 按 bbox 精確渲染每個元素
- 按 reading_order 確定渲染順序(同頁元素)
2. **實作流式排版模式**
- 使用 ReportLab Platypus
- 按 reading_order 構建 Story
- 實作各類型元素的流式渲染
3. **添加 API 參數**
- `/tasks/{id}/download/pdf?mode=coordinate` (預設)
- `/tasks/{id}/download/pdf?mode=flow`
### 第三階段測試與優化1-2 小時)
1. **完整測試**
- 單頁文件測試
- 多頁 PDF 測試
- 多欄版面測試
- 複雜表格測試
2. **效能優化**
- 減少重複計算
- 優化 bbox 轉換
- 快取處理
3. **文檔更新**
- 更新 API 文檔
- 添加使用範例
- 更新架構圖
---
## 💡 關鍵技術細節
### 1. Numpy Array 處理
```python
# layout_bbox 是 numpy.ndarray需要轉換為標準格式
layout_bbox = item.get('layout_bbox')
if hasattr(layout_bbox, 'tolist'):
bbox = layout_bbox.tolist() # [x1, y1, x2, y2]
else:
bbox = list(layout_bbox)
# 轉換為 4-point 格式
x1, y1, x2, y2 = bbox
bbox_4point = [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
```
### 2. 版面類型處理
```python
# 根據 layout_type 調整渲染策略
layout_type = elem.get('layout_type', 'single')
if layout_type == 'double':
# 雙欄版面:可能需要特殊處理
pass
elif layout_type == 'multi':
# 多欄版面:更複雜的處理
pass
```
### 3. 閱讀順序保證
```python
# 確保按正確順序渲染
elements = layout_data.get('elements', [])
sorted_elements = sorted(elements, key=lambda x: (
x.get('page', 0), # 先按頁碼
x.get('reading_order', 0) # 再按閱讀順序
))
```
---
## ⚠️ 風險與緩解措施
### 風險 1: 向後相容性
**問題**: 舊的 JSON 檔案沒有新欄位
**緩解措施**:
```python
# 在 analyze_layout() 中添加回退邏輯
parsing_res_list = json_data.get('parsing_res_list', [])
if not parsing_res_list:
logger.warning("No parsing_res_list, using markdown fallback")
# 使用舊的 markdown 解析邏輯
```
### 風險 2: PaddleOCR 版本差異
**問題**: 不同版本的 PaddleOCR 可能輸出格式不同
**緩解措施**:
- 記錄 PaddleOCR 版本到 JSON
- 添加版本檢測邏輯
- 提供多版本支援
### 風險 3: 效能影響
**問題**: 提取更多資訊可能增加處理時間
**緩解措施**:
- 只在需要時提取詳細資訊
- 使用快取
- 並行處理多頁
---
## 📝 TODO Checklist
### 階段 1: 核心重構
- [ ] 修改 `analyze_layout()` 使用 `page_result.json`
- [ ] 提取 `parsing_res_list`
- [ ] 提取 `layout_bbox` 並轉換格式
- [ ] 保留 `reading_order`
- [ ] 提取 `layout_type`
- [ ] 實作 `_find_image_bbox()`
- [ ] 添加回退邏輯(向後相容)
- [ ] 測試新 JSON 輸出結構
### 階段 2: PDF 生成優化
- [ ] 實作 `_generate_coordinate_pdf()`
- [ ] 實作 `_generate_flow_pdf()`
- [ ] 移除舊的文字過濾邏輯
- [ ] 添加 mode 參數到 API
- [ ] 實作 HTML 表格解析器(用於流式模式)
- [ ] 測試兩種模式的 PDF 輸出
### 階段 3: 測試與文檔
- [ ] 單頁文件測試
- [ ] 多頁 PDF 測試
- [ ] 複雜版面測試(多欄、表格密集)
- [ ] 效能測試
- [ ] 更新 API 文檔
- [ ] 更新使用說明
- [ ] 創建遷移指南
---
## 🎓 學習資源
1. **PaddleOCR 官方文檔**
- [PP-StructureV3 Usage Tutorial](http://www.paddleocr.ai/main/en/version3.x/pipeline_usage/PP-StructureV3.html)
- [PaddleX PP-StructureV3](https://paddlepaddle.github.io/PaddleX/3.0/en/pipeline_usage/tutorials/ocr_pipelines/PP-StructureV3.html)
2. **ReportLab 文檔**
- [Platypus User Guide](https://www.reportlab.com/docs/reportlab-userguide.pdf)
- [Table Styling](https://www.reportlab.com/docs/reportlab-userguide.pdf#page=80)
3. **參考實作**
- PaddleOCR GitHub: `/paddlex/inference/pipelines/layout_parsing/pipeline_v2.py`
---
## 🏁 成功標準
### 必須達成
所有版面元素都有精確的 bbox
閱讀順序正確保留
零資訊損失流式模式
向後相容 JSON 仍可用
### 期望達成
雙模式 PDF 生成座標 + 流式
多欄版面正確處理
翻譯功能支援表格文字可提取
效能無明顯下降
### 附加目標
支援更多元素類型公式印章
版面類型統計和分析
視覺化版面結構
---
**規劃完成時間**: 2025-01-18
**預計開發時間**: 5-8 小時
**優先級**: P0 (最高優先級)

View File

@@ -0,0 +1,276 @@
# Technical Design: Dual-track Document Processing
## Context
### Background
The current OCR tool processes all documents through PaddleOCR, even when dealing with editable PDFs that contain extractable text. This causes:
- Unnecessary processing overhead
- Potential quality degradation from re-OCRing already digital text
- Loss of precise formatting information
- Inefficient GPU usage on documents that don't need OCR
### Constraints
- RTX 4060 8GB GPU memory limitation
- Need to maintain backward compatibility with existing API
- Must support future translation features
- Should handle mixed documents (partially scanned, partially digital)
### Stakeholders
- API consumers expecting consistent JSON/PDF output
- Translation system requiring structure preservation
- Performance-sensitive deployments
## Goals / Non-Goals
### Goals
- Intelligently route documents to appropriate processing track
- Preserve document structure for translation
- Optimize GPU usage by avoiding unnecessary OCR
- Maintain unified output format across tracks
- Reduce processing time for editable PDFs by 70%+
### Non-Goals
- Implementing the actual translation engine (future phase)
- Supporting video or audio transcription
- Real-time collaborative editing
- OCR model training or fine-tuning
## Decisions
### Decision 1: Dual-track Architecture
**What**: Implement two separate processing pipelines - OCR track and Direct extraction track
**Why**:
- Editable PDFs don't need OCR, can be processed 10-100x faster
- Direct extraction preserves exact formatting and fonts
- OCR track remains optimal for scanned documents
**Alternatives considered**:
1. **Single enhanced OCR pipeline**: Would still waste resources on editable PDFs
2. **Hybrid approach per page**: Too complex, most documents are uniformly editable or scanned
3. **Multiple specialized pipelines**: Over-engineering for current requirements
### Decision 2: UnifiedDocument Model
**What**: Create a standardized intermediate representation for both tracks
**Why**:
- Provides consistent API interface regardless of processing track
- Simplifies downstream processing (PDF generation, translation)
- Enables track switching without breaking changes
**Structure**:
```python
@dataclass
class UnifiedDocument:
document_id: str
metadata: DocumentMetadata
pages: List[Page]
processing_track: Literal["ocr", "direct"]
@dataclass
class Page:
page_number: int
elements: List[DocumentElement]
dimensions: Dimensions
@dataclass
class DocumentElement:
element_id: str
type: ElementType # text, table, image, header, etc.
content: Union[str, Dict, bytes]
bbox: BoundingBox
style: Optional[StyleInfo]
confidence: Optional[float] # Only for OCR track
```
### Decision 3: PyMuPDF for Direct Extraction
**What**: Use PyMuPDF (fitz) library for editable PDF processing
**Why**:
- Mature, well-maintained library
- Excellent coordinate preservation
- Fast C++ backend
- Supports text, tables, and image extraction with positions
**Alternatives considered**:
1. **pdfplumber**: Good but slower, less precise coordinates
2. **PyPDF2**: Limited layout information
3. **PDFMiner**: Complex API, slower performance
### Decision 4: Processing Track Auto-detection
**What**: Automatically determine optimal track based on document analysis
**Detection logic**:
```python
def detect_track(file_path: Path) -> str:
file_type = magic.from_file(file_path, mime=True)
if file_type.startswith('image/'):
return "ocr"
if file_type == 'application/pdf':
# Check if PDF has extractable text
doc = fitz.open(file_path)
for page in doc[:3]: # Sample first 3 pages
text = page.get_text()
if len(text.strip()) < 100: # Minimal text
return "ocr"
return "direct"
if file_type in OFFICE_MIMES:
return "ocr" # For now, may add direct Office support later
return "ocr" # Default fallback
```
### Decision 5: GPU Memory Management
**What**: Implement dynamic batch sizing and model caching for RTX 4060 8GB
**Why**:
- Prevents OOM errors
- Maximizes throughput
- Enables concurrent request handling
**Strategy**:
```python
# Adaptive batch sizing based on available memory
batch_size = calculate_batch_size(
available_memory=get_gpu_memory(),
image_size=image.shape,
model_size=MODEL_MEMORY_REQUIREMENTS
)
# Model caching to avoid reload overhead
@lru_cache(maxsize=2)
def get_model(model_type: str):
return load_model(model_type)
```
### Decision 6: Backward Compatibility
**What**: Maintain existing API while adding new capabilities
**How**:
- Existing endpoints continue working unchanged
- New `processing_track` parameter is optional
- Output format compatible with current consumers
- Gradual migration path for clients
## Risks / Trade-offs
### Risk 1: Mixed Content Documents
**Risk**: Documents with both scanned and digital pages
**Mitigation**:
- Page-level track detection as fallback
- Confidence scoring to identify uncertain pages
- Manual override option via API
### Risk 2: Direct Extraction Quality
**Risk**: Some PDFs have poor internal structure
**Mitigation**:
- Fallback to OCR track if extraction quality is low
- Quality metrics: text density, structure coherence
- User-reportable quality issues
### Risk 3: Memory Pressure
**Risk**: RTX 4060 8GB limitation with concurrent requests
**Mitigation**:
- Request queuing system
- Dynamic batch adjustment
- CPU fallback for overflow
### Trade-off 1: Processing Time vs Accuracy
- Direct extraction: Fast but depends on PDF quality
- OCR: Slower but consistent quality
- **Decision**: Prioritize speed for editable PDFs, accuracy for scanned
### Trade-off 2: Complexity vs Flexibility
- Two tracks increase system complexity
- But enable optimal processing per document type
- **Decision**: Accept complexity for 10x+ performance gains
## Migration Plan
### Phase 1: Infrastructure (Week 1-2)
1. Deploy UnifiedDocument model
2. Implement DocumentTypeDetector
3. Add DirectExtractionEngine
4. Update logging and monitoring
### Phase 2: Integration (Week 3)
1. Update OCR service with routing logic
2. Modify PDF generator for unified model
3. Add new API endpoints
4. Deploy to staging
### Phase 3: Validation (Week 4)
1. A/B testing with subset of traffic
2. Performance benchmarking
3. Quality validation
4. Client integration testing
### Rollback Plan
1. Feature flag to disable dual-track
2. Fallback all requests to OCR track
3. Maintain old code paths during transition
4. Database migration reversible
## Open Questions
### Resolved
- Q: Should we support page-level track mixing?
- A: No, adds complexity with minimal benefit. Document-level is sufficient.
- Q: How to handle Office documents?
- A: OCR track initially, consider python-docx/openpyxl later if needed.
### Pending
- Q: What translation services to integrate with?
- Needs stakeholder input on cost/quality trade-offs
- Q: Should we cache extracted text for repeated processing?
- Depends on storage costs vs reprocessing frequency
- Q: How to handle password-protected PDFs?
- May need API parameter for passwords
## Performance Targets
### Direct Extraction Track
- Latency: <500ms per page
- Throughput: 100+ pages/minute
- Memory: <500MB per document
### OCR Track (Optimized)
- Latency: 2-5s per page (GPU)
- Throughput: 20-30 pages/minute
- Memory: <2GB per batch
### API Response Times
- Document type detection: <100ms
- Processing initiation: <200ms
- Result retrieval: <100ms
## Technical Dependencies
### Python Packages
```python
# Direct extraction
PyMuPDF==1.23.x
pdfplumber==0.10.x # Fallback/validation
python-magic-bin==0.4.x
# OCR enhancement
paddlepaddle-gpu==2.5.2
paddleocr==2.7.3
# Infrastructure
pydantic==2.x
fastapi==0.100+
redis==5.x # For caching
```
### System Requirements
- CUDA 11.8+ for PaddlePaddle
- libmagic for file detection
- 16GB RAM minimum
- 50GB disk for models and cache

View File

@@ -0,0 +1,35 @@
# Change: Dual-track Document Processing with Structure-Preserving Translation
## Why
The current system processes all documents through PaddleOCR, causing unnecessary overhead for editable PDFs that already contain extractable text. Additionally, we're only using ~20% of PP-StructureV3's capabilities, missing out on comprehensive document structure extraction. The system needs to support structure-preserving document translation as a future goal.
## What Changes
- **ADDED** Dual-track processing architecture with intelligent routing
- OCR track for scanned documents, images, and Office files using PaddleOCR
- Direct extraction track for editable PDFs using PyMuPDF
- **ADDED** UnifiedDocument model as common output format for both tracks
- **ADDED** DocumentTypeDetector service for automatic track selection
- **MODIFIED** OCR service to use PP-StructureV3's parsing_res_list instead of markdown
- Now extracts all 23 element types with bbox coordinates
- Preserves reading order and hierarchical structure
- **MODIFIED** PDF generator to handle UnifiedDocument format
- Enhanced overlap detection to prevent text/image/table collisions
- Improved coordinate transformation for accurate layout
- **ADDED** Foundation for structure-preserving translation system
- **BREAKING** JSON output structure will include new fields (backward compatible with defaults)
## Impact
- **Affected specs**:
- `document-processing` (new capability)
- `result-export` (enhanced with track metadata and structure data)
- `task-management` (tracks processing route and history)
- **Affected code**:
- `backend/app/services/ocr_service.py` - Major refactoring for dual-track
- `backend/app/services/pdf_generator_service.py` - UnifiedDocument support
- `backend/app/api/v2/tasks.py` - New endpoints for track detection
- `frontend/src/pages/TaskDetailPage.tsx` - Display processing track info
- **Performance**: 5-10x faster for editable PDFs, same speed for scanned documents
- **Dependencies**: Adds PyMuPDF, pdfplumber, python-magic-bin

View File

@@ -0,0 +1,108 @@
# Document Processing Spec Delta
## ADDED Requirements
### Requirement: Dual-track Processing
The system SHALL support two distinct processing tracks for documents: OCR track for scanned/image documents and Direct extraction track for editable PDFs.
#### Scenario: Process scanned PDF through OCR track
- **WHEN** a scanned PDF is uploaded
- **THEN** the system SHALL detect it requires OCR
- **AND** route it through PaddleOCR PP-StructureV3 pipeline
- **AND** return results in UnifiedDocument format
#### Scenario: Process editable PDF through direct extraction
- **WHEN** an editable PDF with extractable text is uploaded
- **THEN** the system SHALL detect it can be directly extracted
- **AND** route it through PyMuPDF extraction pipeline
- **AND** return results in UnifiedDocument format without OCR
#### Scenario: Auto-detect processing track
- **WHEN** a document is uploaded without explicit track specification
- **THEN** the system SHALL analyze the document type and content
- **AND** automatically select the optimal processing track
- **AND** include the selected track in processing metadata
### Requirement: Document Type Detection
The system SHALL provide intelligent document type detection to determine the optimal processing track.
#### Scenario: Detect editable PDF
- **WHEN** analyzing a PDF document
- **THEN** the system SHALL check for extractable text content
- **AND** return confidence score for editability
- **AND** recommend "direct" track if text coverage > 90%
#### Scenario: Detect scanned document
- **WHEN** analyzing an image or scanned PDF
- **THEN** the system SHALL identify lack of extractable text
- **AND** recommend "ocr" track for processing
- **AND** configure appropriate OCR models
#### Scenario: Detect Office documents
- **WHEN** analyzing .docx, .xlsx, .pptx files
- **THEN** the system SHALL identify Office format
- **AND** route to OCR track for initial implementation
- **AND** preserve option for future direct Office extraction
### Requirement: Unified Document Model
The system SHALL use a standardized UnifiedDocument model as the common output format for both processing tracks.
#### Scenario: Generate UnifiedDocument from OCR
- **WHEN** OCR processing completes
- **THEN** the system SHALL convert PP-StructureV3 results to UnifiedDocument
- **AND** preserve all element types, coordinates, and confidence scores
- **AND** maintain reading order and hierarchical structure
#### Scenario: Generate UnifiedDocument from direct extraction
- **WHEN** direct extraction completes
- **THEN** the system SHALL convert PyMuPDF results to UnifiedDocument
- **AND** preserve text styling, fonts, and exact positioning
- **AND** extract tables with cell boundaries and content
#### Scenario: Consistent output regardless of track
- **WHEN** processing completes through either track
- **THEN** the output SHALL conform to UnifiedDocument schema
- **AND** include processing_track metadata field
- **AND** support identical downstream operations (PDF generation, translation)
### Requirement: Enhanced OCR with Full PP-StructureV3
The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list.
#### Scenario: Extract comprehensive document structure
- **WHEN** processing through OCR track
- **THEN** the system SHALL use page_result.json['parsing_res_list']
- **AND** extract all element types including headers, lists, tables, figures
- **AND** preserve layout_bbox coordinates for each element
#### Scenario: Maintain reading order
- **WHEN** extracting elements from PP-StructureV3
- **THEN** the system SHALL preserve the reading order from parsing_res_list
- **AND** assign sequential indices to elements
- **AND** support reordering for complex layouts
#### Scenario: Extract table structure
- **WHEN** PP-StructureV3 identifies a table
- **THEN** the system SHALL extract cell content and boundaries
- **AND** preserve table HTML for structure
- **AND** extract plain text for translation
### Requirement: Structure-Preserving Translation Foundation
The system SHALL maintain document structure and layout information to support future translation features.
#### Scenario: Preserve coordinates for translation
- **WHEN** processing any document
- **THEN** the system SHALL retain bbox coordinates for all text elements
- **AND** calculate space requirements for text expansion/contraction
- **AND** maintain element relationships and groupings
#### Scenario: Extract translatable content
- **WHEN** processing tables and lists
- **THEN** the system SHALL extract plain text content
- **AND** maintain mapping to original structure
- **AND** preserve formatting markers for reconstruction
#### Scenario: Support layout adjustment
- **WHEN** preparing for translation
- **THEN** the system SHALL identify flexible vs fixed layout regions
- **AND** calculate maximum text expansion ratios
- **AND** preserve non-translatable elements (logos, signatures)

View File

@@ -0,0 +1,74 @@
# Result Export Spec Delta
## MODIFIED Requirements
### Requirement: Export Interface
The Export page SHALL support downloading OCR results in multiple formats using V2 task APIs, with processing track information and enhanced structure data.
#### Scenario: Export page uses V2 download endpoints
- **WHEN** user selects a format and clicks export button
- **THEN** frontend SHALL call V2 endpoint `/api/v2/tasks/{task_id}/download/{format}`
- **AND** frontend SHALL NOT call V1 `/api/v2/export` endpoint (which returns 404)
- **AND** file SHALL download successfully
#### Scenario: Export supports multiple formats
- **WHEN** user exports a completed task
- **THEN** system SHALL support downloading as TXT, JSON, Excel, Markdown, and PDF
- **AND** each format SHALL use correct V2 download endpoint
- **AND** downloaded files SHALL contain task OCR results
#### Scenario: Export includes processing track metadata
- **WHEN** user exports a task processed through dual-track system
- **THEN** exported JSON SHALL include "processing_track" field indicating "ocr" or "direct"
- **AND** SHALL include "processing_metadata" with track-specific information
- **AND** SHALL maintain backward compatibility for clients not expecting these fields
#### Scenario: Export UnifiedDocument format
- **WHEN** user requests JSON export with unified=true parameter
- **THEN** system SHALL return UnifiedDocument structure
- **AND** include complete element hierarchy with coordinates
- **AND** preserve all PP-StructureV3 element types for OCR track
## ADDED Requirements
### Requirement: Enhanced PDF Export with Layout Preservation
The PDF export SHALL accurately preserve document layout from both OCR and direct extraction tracks.
#### Scenario: Export PDF from direct extraction track
- **WHEN** exporting PDF from a direct-extraction processed document
- **THEN** the PDF SHALL maintain exact text positioning from source
- **AND** preserve original fonts and styles where possible
- **AND** include extracted images at correct positions
#### Scenario: Export PDF from OCR track with full structure
- **WHEN** exporting PDF from OCR-processed document
- **THEN** the PDF SHALL use all 23 PP-StructureV3 element types
- **AND** render tables with proper cell boundaries
- **AND** maintain reading order from parsing_res_list
#### Scenario: Handle coordinate transformations
- **WHEN** generating PDF from UnifiedDocument
- **THEN** system SHALL correctly transform bbox coordinates to PDF space
- **AND** handle page size variations
- **AND** prevent text overlap using enhanced overlap detection
### Requirement: Structure Data Export
The system SHALL provide export formats that preserve document structure for downstream processing.
#### Scenario: Export structured JSON with hierarchy
- **WHEN** user selects structured JSON format
- **THEN** export SHALL include element hierarchy and relationships
- **AND** preserve parent-child relationships (sections, lists)
- **AND** include style and formatting information
#### Scenario: Export for translation preparation
- **WHEN** user exports with translation_ready=true parameter
- **THEN** export SHALL include translatable text segments
- **AND** maintain coordinate mappings for each segment
- **AND** mark non-translatable regions
#### Scenario: Export with layout analysis
- **WHEN** user requests layout analysis export
- **THEN** system SHALL include reading order indices
- **AND** identify layout regions (header, body, footer, sidebar)
- **AND** provide confidence scores for layout detection

View File

@@ -0,0 +1,105 @@
# Task Management Spec Delta
## MODIFIED Requirements
### Requirement: Task Result Generation
The OCR service SHALL generate both JSON and Markdown result files for completed tasks with actual content, including processing track information and enhanced structure data.
#### Scenario: Markdown file contains OCR results
- **WHEN** a task completes OCR processing successfully
- **THEN** the generated `.md` file SHALL contain the extracted text in markdown format
- **AND** the file size SHALL be greater than 0 bytes
- **AND** the markdown SHALL include headings, paragraphs, and formatting based on OCR layout detection
#### Scenario: Result files stored in task directory
- **WHEN** OCR processing completes for task ID `88c6c2d2-37e1-48fd-a50f-406142987bdf`
- **THEN** result files SHALL be stored in `storage/results/88c6c2d2-37e1-48fd-a50f-406142987bdf/`
- **AND** both `<filename>_result.json` and `<filename>_result.md` SHALL exist
- **AND** both files SHALL contain valid OCR output data
#### Scenario: Include processing track in results
- **WHEN** a task completes through dual-track processing
- **THEN** the JSON result SHALL include "processing_track" field
- **AND** SHALL indicate whether "ocr" or "direct" track was used
- **AND** SHALL include track-specific metadata (confidence for OCR, extraction quality for direct)
#### Scenario: Store UnifiedDocument format
- **WHEN** processing completes through either track
- **THEN** system SHALL save results in UnifiedDocument format
- **AND** maintain backward-compatible JSON structure
- **AND** include enhanced structure from PP-StructureV3 or PyMuPDF
### Requirement: Task Detail View
The frontend SHALL provide a dedicated page for viewing individual task details with processing track information and enhanced preview capabilities.
#### Scenario: Navigate to task detail page
- **WHEN** user clicks "View Details" button on task in Task History page
- **THEN** browser SHALL navigate to `/tasks/{task_id}`
- **AND** TaskDetailPage component SHALL render
#### Scenario: Display task information
- **WHEN** TaskDetailPage loads for a valid task ID
- **THEN** page SHALL display task metadata (filename, status, processing time, confidence)
- **AND** page SHALL show markdown preview of OCR results
- **AND** page SHALL provide download buttons for JSON, Markdown, and PDF formats
#### Scenario: Download from task detail page
- **WHEN** user clicks download button for a specific format
- **THEN** browser SHALL download the file using `/api/v2/tasks/{task_id}/download/{format}` endpoint
- **AND** downloaded file SHALL contain the task's OCR results in requested format
#### Scenario: Display processing track information
- **WHEN** viewing task processed through dual-track system
- **THEN** page SHALL display processing track used (OCR or Direct)
- **AND** show track-specific metrics (OCR confidence or extraction quality)
- **AND** provide option to reprocess with alternate track if applicable
#### Scenario: Preview document structure
- **WHEN** user enables structure view
- **THEN** page SHALL display document element hierarchy
- **AND** show bounding boxes overlay on preview
- **AND** highlight different element types (headers, tables, lists) with distinct colors
## ADDED Requirements
### Requirement: Processing Track Management
The task management system SHALL track and display processing track information for all tasks.
#### Scenario: Track processing route selection
- **WHEN** a task begins processing
- **THEN** system SHALL record the selected processing track
- **AND** log the reason for track selection
- **AND** store auto-detection confidence score
#### Scenario: Allow track override
- **WHEN** user views a completed task
- **THEN** system SHALL offer option to reprocess with different track
- **AND** maintain both results for comparison
- **AND** track which result user prefers
#### Scenario: Display processing metrics
- **WHEN** task completes processing
- **THEN** system SHALL record track-specific metrics
- **AND** OCR track SHALL show confidence scores and character count
- **AND** Direct track SHALL show extraction coverage and structure quality
### Requirement: Task Processing History
The system SHALL maintain detailed processing history for tasks including track changes and reprocessing.
#### Scenario: Record reprocessing attempts
- **WHEN** a task is reprocessed with different track
- **THEN** system SHALL maintain processing history
- **AND** store results from each attempt
- **AND** allow comparison between different processing attempts
#### Scenario: Track quality improvements
- **WHEN** viewing task history
- **THEN** system SHALL show quality metrics over time
- **AND** indicate if reprocessing improved results
- **AND** suggest optimal track based on document characteristics
#### Scenario: Export processing analytics
- **WHEN** exporting task data
- **THEN** system SHALL include processing history
- **AND** provide track selection statistics
- **AND** include performance metrics for each processing attempt

View File

@@ -0,0 +1,170 @@
# Implementation Tasks: Dual-track Document Processing
## 1. Core Infrastructure
- [ ] 1.1 Add PyMuPDF and other dependencies to requirements.txt
- [ ] 1.1.1 Add PyMuPDF==1.23.x
- [ ] 1.1.2 Add pdfplumber==0.10.x
- [ ] 1.1.3 Add python-magic-bin==0.4.x
- [ ] 1.1.4 Test dependency installation
- [ ] 1.2 Create UnifiedDocument model in backend/app/models/
- [ ] 1.2.1 Define UnifiedDocument dataclass
- [ ] 1.2.2 Add DocumentElement model
- [ ] 1.2.3 Add DocumentMetadata model
- [ ] 1.2.4 Create converters for both OCR and direct extraction outputs
- [ ] 1.3 Create DocumentTypeDetector service
- [ ] 1.3.1 Implement file type detection using python-magic
- [ ] 1.3.2 Add PDF editability checking logic
- [ ] 1.3.3 Add Office document detection
- [ ] 1.3.4 Create routing logic to determine processing track
- [ ] 1.3.5 Add unit tests for detector
## 2. Direct Extraction Track
- [ ] 2.1 Create DirectExtractionEngine service
- [ ] 2.1.1 Implement PyMuPDF-based text extraction
- [ ] 2.1.2 Add structure preservation logic
- [ ] 2.1.3 Extract tables with coordinates
- [ ] 2.1.4 Extract images and their positions
- [ ] 2.1.5 Maintain reading order
- [ ] 2.1.6 Handle multi-column layouts
- [ ] 2.2 Implement layout analysis for editable PDFs
- [ ] 2.2.1 Detect headers and footers
- [ ] 2.2.2 Identify sections and subsections
- [ ] 2.2.3 Parse lists and nested structures
- [ ] 2.2.4 Extract font and style information
- [ ] 2.3 Create direct extraction to UnifiedDocument converter
- [ ] 2.3.1 Map PyMuPDF structures to UnifiedDocument
- [ ] 2.3.2 Preserve coordinate information
- [ ] 2.3.3 Maintain element relationships
## 3. OCR Track Enhancement
- [ ] 3.1 Upgrade PP-StructureV3 configuration
- [ ] 3.1.1 Update config for RTX 4060 8GB optimization
- [ ] 3.1.2 Enable batch processing for GPU efficiency
- [ ] 3.1.3 Configure memory management settings
- [ ] 3.1.4 Set up model caching
- [ ] 3.2 Enhance OCR service to use parsing_res_list
- [ ] 3.2.1 Replace markdown extraction with parsing_res_list
- [ ] 3.2.2 Extract all 23 element types
- [ ] 3.2.3 Preserve bbox coordinates from PP-StructureV3
- [ ] 3.2.4 Maintain reading order information
- [ ] 3.3 Create OCR to UnifiedDocument converter
- [ ] 3.3.1 Map PP-StructureV3 elements to UnifiedDocument
- [ ] 3.3.2 Handle complex nested structures
- [ ] 3.3.3 Preserve all metadata
## 4. Unified Processing Pipeline
- [ ] 4.1 Update main OCR service for dual-track processing
- [ ] 4.1.1 Integrate DocumentTypeDetector
- [ ] 4.1.2 Route to appropriate processing engine
- [ ] 4.1.3 Return UnifiedDocument from both tracks
- [ ] 4.1.4 Maintain backward compatibility
- [ ] 4.2 Create unified JSON export
- [ ] 4.2.1 Define standardized JSON schema
- [ ] 4.2.2 Include processing metadata
- [ ] 4.2.3 Support both track outputs
- [ ] 4.3 Update PDF generator for UnifiedDocument
- [ ] 4.3.1 Adapt PDF generation to use UnifiedDocument
- [ ] 4.3.2 Preserve layout from both tracks
- [ ] 4.3.3 Handle coordinate transformations
## 5. Translation System Foundation
- [ ] 5.1 Create TranslationEngine interface
- [ ] 5.1.1 Define translation API contract
- [ ] 5.1.2 Support element-level translation
- [ ] 5.1.3 Preserve formatting markers
- [ ] 5.2 Implement structure-preserving translation
- [ ] 5.2.1 Translate text while maintaining coordinates
- [ ] 5.2.2 Handle table cell translations
- [ ] 5.2.3 Preserve list structures
- [ ] 5.2.4 Maintain header hierarchies
- [ ] 5.3 Create translated document renderer
- [ ] 5.3.1 Generate PDF with translated text
- [ ] 5.3.2 Adjust layouts for text expansion/contraction
- [ ] 5.3.3 Handle font substitution for target languages
## 6. API Updates
- [ ] 6.1 Update OCR endpoints
- [ ] 6.1.1 Add processing_track parameter
- [ ] 6.1.2 Support track auto-detection
- [ ] 6.1.3 Return processing metadata
- [ ] 6.2 Add document type detection endpoint
- [ ] 6.2.1 Create /analyze endpoint
- [ ] 6.2.2 Return recommended processing track
- [ ] 6.2.3 Provide confidence scores
- [ ] 6.3 Update result export endpoints
- [ ] 6.3.1 Support UnifiedDocument format
- [ ] 6.3.2 Add format conversion options
- [ ] 6.3.3 Include processing track information
## 7. Frontend Updates
- [ ] 7.1 Update task detail view
- [ ] 7.1.1 Display processing track information
- [ ] 7.1.2 Show track-specific metadata
- [ ] 7.1.3 Add track selection UI (if manual override needed)
- [ ] 7.2 Update results preview
- [ ] 7.2.1 Handle UnifiedDocument format
- [ ] 7.2.2 Display enhanced structure information
- [ ] 7.2.3 Show coordinate overlays (debug mode)
- [ ] 7.3 Add translation UI preparation
- [ ] 7.3.1 Add translation toggle/button
- [ ] 7.3.2 Language selection dropdown
- [ ] 7.3.3 Translation progress indicator
## 8. Testing
- [ ] 8.1 Unit tests for DocumentTypeDetector
- [ ] 8.1.1 Test various file types
- [ ] 8.1.2 Test editability detection
- [ ] 8.1.3 Test edge cases
- [ ] 8.2 Unit tests for DirectExtractionEngine
- [ ] 8.2.1 Test text extraction accuracy
- [ ] 8.2.2 Test structure preservation
- [ ] 8.2.3 Test coordinate extraction
- [ ] 8.3 Integration tests for dual-track processing
- [ ] 8.3.1 Test routing logic
- [ ] 8.3.2 Test UnifiedDocument generation
- [ ] 8.3.3 Test backward compatibility
- [ ] 8.4 End-to-end tests
- [ ] 8.4.1 Test scanned PDF processing (OCR track)
- [ ] 8.4.2 Test editable PDF processing (direct track)
- [ ] 8.4.3 Test Office document processing
- [ ] 8.4.4 Test image file processing
- [ ] 8.5 Performance testing
- [ ] 8.5.1 Benchmark both processing tracks
- [ ] 8.5.2 Test GPU memory usage
- [ ] 8.5.3 Compare processing times
## 9. Documentation
- [ ] 9.1 Update API documentation
- [ ] 9.1.1 Document new endpoints
- [ ] 9.1.2 Update existing endpoint docs
- [ ] 9.1.3 Add processing track information
- [ ] 9.2 Create architecture documentation
- [ ] 9.2.1 Document dual-track flow
- [ ] 9.2.2 Explain UnifiedDocument structure
- [ ] 9.2.3 Add decision trees for track selection
- [ ] 9.3 Add deployment guide
- [ ] 9.3.1 Document GPU requirements
- [ ] 9.3.2 Add environment configuration
- [ ] 9.3.3 Include troubleshooting guide
## 10. Deployment Preparation
- [ ] 10.1 Update Docker configuration
- [ ] 10.1.1 Add new dependencies to Dockerfile
- [ ] 10.1.2 Configure GPU support
- [ ] 10.1.3 Update volume mappings
- [ ] 10.2 Update environment variables
- [ ] 10.2.1 Add processing track settings
- [ ] 10.2.2 Configure GPU memory limits
- [ ] 10.2.3 Add feature flags
- [ ] 10.3 Create migration plan
- [ ] 10.3.1 Plan for existing data migration
- [ ] 10.3.2 Create rollback procedures
- [ ] 10.3.3 Document breaking changes
## Completion Checklist
- [ ] All unit tests passing
- [ ] Integration tests passing
- [ ] Performance benchmarks acceptable
- [ ] Documentation complete
- [ ] Code reviewed
- [ ] Deployment tested in staging

View File

@@ -1,226 +0,0 @@
#!/usr/bin/env python3
"""
Proof of Concept: External API Authentication Test
Tests the external authentication API at https://pj-auth-api.vercel.app
"""
import asyncio
import json
from datetime import datetime
from typing import Dict, Any, Optional
import httpx
from pydantic import BaseModel, Field
class UserInfo(BaseModel):
"""User information from external API"""
id: str
name: str
email: str
job_title: Optional[str] = Field(None, alias="jobTitle")
office_location: Optional[str] = Field(None, alias="officeLocation")
business_phones: list[str] = Field(default_factory=list, alias="businessPhones")
class AuthSuccessData(BaseModel):
"""Successful authentication response data"""
access_token: str
id_token: str
expires_in: int
token_type: str
user_info: UserInfo = Field(alias="userInfo")
issued_at: str = Field(alias="issuedAt")
expires_at: str = Field(alias="expiresAt")
class AuthSuccessResponse(BaseModel):
"""Successful authentication response"""
success: bool
message: str
data: AuthSuccessData
timestamp: str
class AuthErrorResponse(BaseModel):
"""Failed authentication response"""
success: bool
error: str
code: str
timestamp: str
class ExternalAuthClient:
"""Client for external authentication API"""
def __init__(self, base_url: str = "https://pj-auth-api.vercel.app", timeout: int = 30):
self.base_url = base_url
self.timeout = timeout
self.endpoint = "/api/auth/login"
async def authenticate(self, username: str, password: str) -> Dict[str, Any]:
"""
Authenticate user with external API
Args:
username: User email/username
password: User password
Returns:
Authentication result dictionary
"""
url = f"{self.base_url}{self.endpoint}"
print(f" Endpoint: POST {url}")
print(f" Username: {username}")
print(f" Timestamp: {datetime.now().isoformat()}")
print()
async with httpx.AsyncClient() as client:
try:
# Make authentication request
start_time = datetime.now()
response = await client.post(
url,
json={"username": username, "password": password},
timeout=self.timeout
)
elapsed = (datetime.now() - start_time).total_seconds()
# Print response details
print("Response Details:")
print(f" Status Code: {response.status_code}")
print(f" Response Time: {elapsed:.3f}s")
print(f" Content-Type: {response.headers.get('content-type', 'N/A')}")
print()
# Parse response
response_data = response.json()
print("Response Body:")
print(json.dumps(response_data, indent=2, ensure_ascii=False))
print()
# Handle success/failure
if response.status_code == 200:
auth_response = AuthSuccessResponse(**response_data)
return {
"success": True,
"status_code": response.status_code,
"data": auth_response.dict(),
"user_display_name": auth_response.data.user_info.name,
"user_email": auth_response.data.user_info.email,
"token": auth_response.data.access_token,
"expires_in": auth_response.data.expires_in,
"expires_at": auth_response.data.expires_at
}
elif response.status_code == 401:
error_response = AuthErrorResponse(**response_data)
return {
"success": False,
"status_code": response.status_code,
"error": error_response.error,
"code": error_response.code
}
else:
return {
"success": False,
"status_code": response.status_code,
"error": f"Unexpected status code: {response.status_code}",
"response": response_data
}
except httpx.TimeoutException:
print(f"❌ Request timeout after {self.timeout} seconds")
return {
"success": False,
"error": "Request timeout",
"code": "TIMEOUT"
}
except httpx.RequestError as e:
print(f"❌ Request error: {e}")
return {
"success": False,
"error": str(e),
"code": "REQUEST_ERROR"
}
except Exception as e:
print(f"❌ Unexpected error: {e}")
return {
"success": False,
"error": str(e),
"code": "UNKNOWN_ERROR"
}
async def test_authentication():
"""Test authentication with different scenarios"""
client = ExternalAuthClient()
# Test scenarios
test_cases = [
{
"name": "Valid Credentials (Example)",
"username": "ymirliu@panjit.com.tw",
"password": "correct_password", # Replace with actual password for testing
"expected": "success"
},
{
"name": "Invalid Credentials",
"username": "test@example.com",
"password": "wrong_password",
"expected": "failure"
}
]
for i, test_case in enumerate(test_cases, 1):
print(f"{'='*60}")
print(f"Test Case {i}: {test_case['name']}")
print(f"{'='*60}")
result = await client.authenticate(
username=test_case["username"],
password=test_case["password"]
)
# Analyze result
print("\nAnalysis:")
if result["success"]:
print("✅ Authentication successful")
print(f" User: {result.get('user_display_name', 'N/A')}")
print(f" Email: {result.get('user_email', 'N/A')}")
print(f" Token expires in: {result.get('expires_in', 0)} seconds")
print(f" Expires at: {result.get('expires_at', 'N/A')}")
else:
print("❌ Authentication failed")
print(f" Error: {result.get('error', 'Unknown error')}")
print(f" Code: {result.get('code', 'N/A')}")
print("\n")
async def test_token_validation():
"""Test token validation and refresh logic"""
# This would be implemented when we have a valid token
print("Token validation test - To be implemented with actual tokens")
pass
def main():
"""Main entry point"""
print("External Authentication API Test")
print("================================\n")
# Run tests
asyncio.run(test_authentication())
print("\nTest completed!")
print("\nNotes for implementation:")
print("1. Use httpx for async HTTP requests (already in requirements)")
print("2. Store tokens securely (consider encryption)")
print("3. Implement automatic token refresh before expiration")
print("4. Handle network failures with retry logic")
print("5. Map external user ID to local user records")
print("6. Display user 'name' field in UI instead of username")
if __name__ == "__main__":
main()