feat: add translated PDF format selection (layout/reflow)
- Add generate_translated_layout_pdf() method for layout-preserving translated PDFs - Add generate_translated_pdf() method for reflow translated PDFs - Update translate router to accept format parameter (layout/reflow) - Update frontend with dropdown to select translated PDF format - Fix reflow PDF table cell extraction from content dict - Add embedded images handling in reflow PDF tables - Archive improve-translated-text-fitting openspec proposal 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
919
docs/API.md
919
docs/API.md
@@ -1,842 +1,97 @@
|
||||
# Tool_OCR V2 API Documentation
|
||||
|
||||
## Overview
|
||||
|
||||
Tool_OCR V2 provides a comprehensive OCR service with dual-track document processing. The API supports intelligent routing between OCR track (for scanned documents) and Direct Extraction track (for editable PDFs and Office documents).
|
||||
|
||||
**Base URL**: `http://localhost:8000/api/v2`
|
||||
|
||||
**Authentication**: Bearer token (JWT)
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Authentication](#authentication)
|
||||
2. [Task Management](#task-management)
|
||||
3. [Document Processing](#document-processing)
|
||||
4. [Document Analysis](#document-analysis)
|
||||
5. [File Downloads](#file-downloads)
|
||||
6. [Processing Tracks](#processing-tracks)
|
||||
7. [Response Models](#response-models)
|
||||
8. [Error Handling](#error-handling)
|
||||
|
||||
---
|
||||
|
||||
## Authentication
|
||||
|
||||
All endpoints require authentication via Bearer token.
|
||||
|
||||
### Headers
|
||||
```http
|
||||
Authorization: Bearer <access_token>
|
||||
```
|
||||
|
||||
### Login
|
||||
```http
|
||||
POST /api/auth/login
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"email": "user@example.com",
|
||||
"password": "password123"
|
||||
}
|
||||
```
|
||||
|
||||
**Response**:
|
||||
```json
|
||||
{
|
||||
"access_token": "eyJhbGc...",
|
||||
"token_type": "bearer",
|
||||
"user": {
|
||||
"id": 1,
|
||||
"email": "user@example.com",
|
||||
"username": "user"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task Management
|
||||
|
||||
### Create Task
|
||||
|
||||
Create a new OCR processing task by uploading a document.
|
||||
|
||||
```http
|
||||
POST /tasks/
|
||||
Content-Type: multipart/form-data
|
||||
```
|
||||
|
||||
**Request Body**:
|
||||
- `file` (required): Document file to process
|
||||
- Supported formats: PDF, PNG, JPG, JPEG, GIF, BMP, TIFF, DOCX, PPTX, XLSX
|
||||
- `language` (optional): OCR language code (default: 'ch')
|
||||
- Options: 'ch', 'en', 'japan', 'korean', etc.
|
||||
- `detect_layout` (optional): Enable layout detection (default: true)
|
||||
- `force_track` (optional): Force specific processing track
|
||||
- Options: 'ocr', 'direct', 'auto' (default: 'auto')
|
||||
|
||||
**Response** `201 Created`:
|
||||
```json
|
||||
{
|
||||
"task_id": "550e8400-e29b-41d4-a716-446655440000",
|
||||
"filename": "document.pdf",
|
||||
"status": "pending",
|
||||
"language": "ch",
|
||||
"created_at": "2025-11-20T10:00:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
**Processing Track Selection**:
|
||||
- `auto` (default): Automatically select optimal track based on document analysis
|
||||
- Editable PDFs → Direct track (faster, ~1-2s/page)
|
||||
- Scanned documents/images → OCR track (slower, ~2-5s/page)
|
||||
- Office documents → Convert to PDF, then route based on content
|
||||
- `ocr`: Force OCR processing (PaddleOCR PP-StructureV3)
|
||||
- `direct`: Force direct extraction (PyMuPDF) - only for editable PDFs
|
||||
|
||||
---
|
||||
|
||||
### List Tasks
|
||||
|
||||
Get a paginated list of user's tasks with filtering.
|
||||
|
||||
```http
|
||||
GET /tasks/?status={status}&filename={search}&skip={skip}&limit={limit}
|
||||
```
|
||||
|
||||
**Query Parameters**:
|
||||
- `status` (optional): Filter by task status
|
||||
- Options: `pending`, `processing`, `completed`, `failed`
|
||||
- `filename` (optional): Search by filename (partial match)
|
||||
- `skip` (optional): Pagination offset (default: 0)
|
||||
- `limit` (optional): Page size (default: 10, max: 100)
|
||||
|
||||
**Response** `200 OK`:
|
||||
```json
|
||||
{
|
||||
"tasks": [
|
||||
{
|
||||
"task_id": "550e8400-e29b-41d4-a716-446655440000",
|
||||
"filename": "document.pdf",
|
||||
"status": "completed",
|
||||
"language": "ch",
|
||||
"processing_track": "direct",
|
||||
"processing_time": 1.14,
|
||||
"created_at": "2025-11-20T10:00:00Z",
|
||||
"completed_at": "2025-11-20T10:00:02Z"
|
||||
}
|
||||
],
|
||||
"total": 42,
|
||||
"skip": 0,
|
||||
"limit": 10
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Get Task Details
|
||||
|
||||
Retrieve detailed information about a specific task.
|
||||
|
||||
```http
|
||||
GET /tasks/{task_id}
|
||||
```
|
||||
|
||||
**Response** `200 OK`:
|
||||
```json
|
||||
{
|
||||
"task_id": "550e8400-e29b-41d4-a716-446655440000",
|
||||
"filename": "document.pdf",
|
||||
"status": "completed",
|
||||
"language": "ch",
|
||||
"processing_track": "direct",
|
||||
"document_type": "pdf_editable",
|
||||
"processing_time": 1.14,
|
||||
"page_count": 3,
|
||||
"element_count": 51,
|
||||
"character_count": 10592,
|
||||
"confidence": 0.95,
|
||||
"created_at": "2025-11-20T10:00:00Z",
|
||||
"completed_at": "2025-11-20T10:00:02Z",
|
||||
"result_files": {
|
||||
"json": "/tasks/550e8400-e29b-41d4-a716-446655440000/download/json",
|
||||
"markdown": "/tasks/550e8400-e29b-41d4-a716-446655440000/download/markdown",
|
||||
"pdf": "/tasks/550e8400-e29b-41d4-a716-446655440000/download/pdf"
|
||||
},
|
||||
"metadata": {
|
||||
"file_size": 524288,
|
||||
"mime_type": "application/pdf",
|
||||
"text_coverage": 0.95,
|
||||
"processing_track_reason": "PDF has extractable text on 100% of sampled pages"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**New Fields** (Dual-Track):
|
||||
- `processing_track`: Track used for processing (`ocr`, `direct`, or `null`)
|
||||
- `document_type`: Detected document type
|
||||
- `pdf_editable`: Editable PDF with text
|
||||
- `pdf_scanned`: Scanned/image-based PDF
|
||||
- `pdf_mixed`: Mixed content PDF
|
||||
- `image`: Image file
|
||||
- `office_word`, `office_excel`, `office_ppt`: Office documents
|
||||
- `page_count`: Number of pages extracted
|
||||
- `element_count`: Total elements (text, tables, images) extracted
|
||||
- `character_count`: Total characters extracted
|
||||
- `metadata.text_coverage`: Percentage of pages with extractable text (0.0-1.0)
|
||||
- `metadata.processing_track_reason`: Explanation of track selection
|
||||
|
||||
---
|
||||
|
||||
### Get Task Statistics
|
||||
|
||||
Get aggregated statistics for user's tasks.
|
||||
|
||||
```http
|
||||
GET /tasks/stats
|
||||
```
|
||||
|
||||
**Response** `200 OK`:
|
||||
```json
|
||||
{
|
||||
"total_tasks": 150,
|
||||
"by_status": {
|
||||
"pending": 5,
|
||||
"processing": 3,
|
||||
"completed": 140,
|
||||
"failed": 2
|
||||
},
|
||||
"by_processing_track": {
|
||||
"ocr": 80,
|
||||
"direct": 60,
|
||||
"unknown": 10
|
||||
},
|
||||
"total_pages_processed": 4250,
|
||||
"average_processing_time": 3.5,
|
||||
"success_rate": 0.987
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Delete Task
|
||||
|
||||
Delete a task and all associated files.
|
||||
|
||||
```http
|
||||
DELETE /tasks/{task_id}
|
||||
```
|
||||
|
||||
**Response** `204 No Content`
|
||||
|
||||
---
|
||||
|
||||
## Document Processing
|
||||
|
||||
### Processing Workflow
|
||||
|
||||
1. **Upload Document** → `POST /tasks/` → Returns `task_id`
|
||||
2. **Background Processing** → Task status changes to `processing`
|
||||
3. **Complete** → Task status changes to `completed` or `failed`
|
||||
4. **Download Results** → Use download endpoints
|
||||
|
||||
### Track Selection Flow
|
||||
|
||||
```
|
||||
Document Upload
|
||||
↓
|
||||
Document Type Detection
|
||||
↓
|
||||
┌──────────────┐
|
||||
│ Auto Routing │
|
||||
└──────┬───────┘
|
||||
↓
|
||||
┌────┴─────┐
|
||||
↓ ↓
|
||||
[Direct] [OCR]
|
||||
↓ ↓
|
||||
PyMuPDF PaddleOCR
|
||||
↓ ↓
|
||||
UnifiedDocument
|
||||
↓
|
||||
Export (JSON/MD/PDF)
|
||||
```
|
||||
|
||||
**Direct Track** (Fast - 1-2s/page):
|
||||
- Editable PDFs with extractable text
|
||||
- Office documents (converted to text-based PDF)
|
||||
- Uses PyMuPDF for direct text extraction
|
||||
- Preserves exact layout and fonts
|
||||
|
||||
**OCR Track** (Slower - 2-5s/page):
|
||||
- Scanned PDFs and images
|
||||
- Documents without extractable text
|
||||
- Uses PaddleOCR PP-StructureV3
|
||||
- Handles complex layouts with 23 element types
|
||||
|
||||
---
|
||||
|
||||
## Document Analysis
|
||||
|
||||
### Analyze Document Type
|
||||
|
||||
Analyze a document to determine optimal processing track **before** processing.
|
||||
|
||||
**NEW ENDPOINT**
|
||||
|
||||
```http
|
||||
POST /tasks/{task_id}/analyze
|
||||
```
|
||||
|
||||
**Response** `200 OK`:
|
||||
```json
|
||||
{
|
||||
"task_id": "550e8400-e29b-41d4-a716-446655440000",
|
||||
"filename": "document.pdf",
|
||||
"analysis": {
|
||||
"recommended_track": "direct",
|
||||
"confidence": 0.95,
|
||||
"reason": "PDF has extractable text on 100% of sampled pages",
|
||||
"document_type": "pdf_editable",
|
||||
"metadata": {
|
||||
"total_pages": 3,
|
||||
"sampled_pages": 3,
|
||||
"text_coverage": 1.0,
|
||||
"mime_type": "application/pdf",
|
||||
"file_size": 524288,
|
||||
"page_details": [
|
||||
{
|
||||
"page": 1,
|
||||
"text_length": 3520,
|
||||
"has_text": true,
|
||||
"image_count": 2,
|
||||
"image_coverage": 0.15
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Use Case**:
|
||||
- Preview processing track before starting
|
||||
- Validate document type for batch processing
|
||||
- Provide user feedback on processing method
|
||||
|
||||
---
|
||||
|
||||
### Get Processing Metadata
|
||||
|
||||
Get detailed metadata about how a document was processed.
|
||||
|
||||
**NEW ENDPOINT**
|
||||
|
||||
```http
|
||||
GET /tasks/{task_id}/metadata
|
||||
```
|
||||
|
||||
**Response** `200 OK`:
|
||||
```json
|
||||
{
|
||||
"task_id": "550e8400-e29b-41d4-a716-446655440000",
|
||||
"processing_track": "direct",
|
||||
"document_type": "pdf_editable",
|
||||
"confidence": 0.95,
|
||||
"reason": "PDF has extractable text on 100% of sampled pages",
|
||||
"statistics": {
|
||||
"page_count": 3,
|
||||
"element_count": 51,
|
||||
"total_tables": 2,
|
||||
"total_images": 3,
|
||||
"element_type_counts": {
|
||||
"text": 45,
|
||||
"table": 2,
|
||||
"image": 3,
|
||||
"header": 1
|
||||
},
|
||||
"text_stats": {
|
||||
"total_characters": 10592,
|
||||
"total_words": 1842,
|
||||
"average_confidence": 1.0
|
||||
}
|
||||
},
|
||||
"processing_info": {
|
||||
"processing_time": 1.14,
|
||||
"track_description": "PyMuPDF Direct Extraction - Used for editable PDFs",
|
||||
"schema_version": "1.0.0"
|
||||
},
|
||||
"file_metadata": {
|
||||
"filename": "document.pdf",
|
||||
"file_size": 524288,
|
||||
"mime_type": "application/pdf",
|
||||
"created_at": "2025-11-20T10:00:00Z"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## File Downloads
|
||||
|
||||
### Download JSON Result
|
||||
|
||||
Download structured JSON output with full document structure.
|
||||
|
||||
```http
|
||||
GET /tasks/{task_id}/download/json
|
||||
```
|
||||
|
||||
**Response** `200 OK`:
|
||||
- Content-Type: `application/json`
|
||||
- Content-Disposition: `attachment; filename="{filename}_result.json"`
|
||||
|
||||
**JSON Structure**:
|
||||
```json
|
||||
{
|
||||
"schema_version": "1.0.0",
|
||||
"document_id": "d8bea84d-a4ea-4455-b219-243624b5518e",
|
||||
"export_timestamp": "2025-11-20T10:00:02Z",
|
||||
"metadata": {
|
||||
"filename": "document.pdf",
|
||||
"file_type": ".pdf",
|
||||
"file_size": 524288,
|
||||
"created_at": "2025-11-20T10:00:00Z",
|
||||
"processing_track": "direct",
|
||||
"processing_time": 1.14,
|
||||
"language": "ch",
|
||||
"processing_info": {
|
||||
"track_description": "PyMuPDF Direct Extraction",
|
||||
"schema_version": "1.0.0",
|
||||
"export_format": "unified_document_v1"
|
||||
}
|
||||
},
|
||||
"pages": [
|
||||
{
|
||||
"page_number": 1,
|
||||
"dimensions": {
|
||||
"width": 595.32,
|
||||
"height": 841.92
|
||||
},
|
||||
"elements": [
|
||||
{
|
||||
"element_id": "text_1_0",
|
||||
"type": "text",
|
||||
"bbox": {
|
||||
"x0": 72.0,
|
||||
"y0": 72.0,
|
||||
"x1": 200.0,
|
||||
"y1": 90.0
|
||||
},
|
||||
"content": "Document Title",
|
||||
"confidence": 1.0,
|
||||
"style": {
|
||||
"font": "Helvetica-Bold",
|
||||
"size": 18.0
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"statistics": {
|
||||
"page_count": 3,
|
||||
"total_elements": 51,
|
||||
"total_tables": 2,
|
||||
"total_images": 3,
|
||||
"element_type_counts": {
|
||||
"text": 45,
|
||||
"table": 2,
|
||||
"image": 3,
|
||||
"header": 1
|
||||
},
|
||||
"text_stats": {
|
||||
"total_characters": 10592,
|
||||
"total_words": 1842,
|
||||
"average_confidence": 1.0
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Element Types**:
|
||||
- `text`: Text blocks
|
||||
- `header`: Headers (H1-H6)
|
||||
- `paragraph`: Paragraphs
|
||||
- `list`: Lists
|
||||
- `table`: Tables with cell structure
|
||||
- `image`: Images with position
|
||||
- `figure`: Figures with captions
|
||||
- `footer`: Page footers
|
||||
|
||||
---
|
||||
|
||||
### Download Markdown Result
|
||||
|
||||
Download Markdown formatted output.
|
||||
|
||||
```http
|
||||
GET /tasks/{task_id}/download/markdown
|
||||
```
|
||||
|
||||
**Response** `200 OK`:
|
||||
- Content-Type: `text/markdown`
|
||||
- Content-Disposition: `attachment; filename="{filename}_output.md"`
|
||||
|
||||
**Example Output**:
|
||||
```markdown
|
||||
# Document Title
|
||||
|
||||
This is the extracted content from the document.
|
||||
|
||||
## Section 1
|
||||
|
||||
Content of section 1...
|
||||
|
||||
| Column 1 | Column 2 |
|
||||
|----------|----------|
|
||||
| Data 1 | Data 2 |
|
||||
|
||||

|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Download Layout-Preserving PDF
|
||||
|
||||
Download reconstructed PDF with layout preservation.
|
||||
|
||||
```http
|
||||
GET /tasks/{task_id}/download/pdf
|
||||
```
|
||||
|
||||
**Response** `200 OK`:
|
||||
- Content-Type: `application/pdf`
|
||||
- Content-Disposition: `attachment; filename="{filename}_layout.pdf"`
|
||||
|
||||
**Features**:
|
||||
- Preserves original layout and coordinates
|
||||
- Maintains text positioning
|
||||
- Includes extracted images
|
||||
- Renders tables with proper structure
|
||||
|
||||
---
|
||||
|
||||
## Processing Tracks
|
||||
|
||||
### Track Comparison
|
||||
|
||||
| Feature | OCR Track | Direct Track |
|
||||
|---------|-----------|--------------|
|
||||
| **Speed** | 2-5 seconds/page | 0.5-1 second/page |
|
||||
| **Best For** | Scanned documents, images | Editable PDFs, Office docs |
|
||||
| **Technology** | PaddleOCR PP-StructureV3 | PyMuPDF |
|
||||
| **Accuracy** | 92-98% (content-dependent) | 100% (text is extracted, not recognized) |
|
||||
| **Layout Preservation** | Good (23 element types) | Excellent (exact coordinates) |
|
||||
| **GPU Required** | Yes (8GB recommended) | No |
|
||||
| **Supported Formats** | PDF, PNG, JPG, TIFF, etc. | PDF (with text), converted Office docs |
|
||||
|
||||
### Processing Track Enum
|
||||
|
||||
```python
|
||||
class ProcessingTrackEnum(str, Enum):
|
||||
AUTO = "auto" # Automatic selection (default)
|
||||
OCR = "ocr" # Force OCR processing
|
||||
DIRECT = "direct" # Force direct extraction
|
||||
```
|
||||
|
||||
### Document Type Enum
|
||||
|
||||
```python
|
||||
class DocumentType(str, Enum):
|
||||
PDF_EDITABLE = "pdf_editable" # PDF with extractable text
|
||||
PDF_SCANNED = "pdf_scanned" # Scanned/image-based PDF
|
||||
PDF_MIXED = "pdf_mixed" # Mixed content PDF
|
||||
IMAGE = "image" # Image files
|
||||
OFFICE_WORD = "office_word" # Word documents
|
||||
OFFICE_EXCEL = "office_excel" # Excel spreadsheets
|
||||
OFFICE_POWERPOINT = "office_ppt" # PowerPoint presentations
|
||||
TEXT = "text" # Plain text files
|
||||
UNKNOWN = "unknown" # Unknown format
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Response Models
|
||||
|
||||
### TaskResponse
|
||||
|
||||
```typescript
|
||||
interface TaskResponse {
|
||||
task_id: string;
|
||||
filename: string;
|
||||
status: "pending" | "processing" | "completed" | "failed";
|
||||
language: string;
|
||||
processing_track?: "ocr" | "direct" | null;
|
||||
created_at: string; // ISO 8601
|
||||
completed_at?: string | null;
|
||||
}
|
||||
```
|
||||
|
||||
### TaskDetailResponse
|
||||
|
||||
Extends `TaskResponse` with:
|
||||
```typescript
|
||||
interface TaskDetailResponse extends TaskResponse {
|
||||
document_type?: string;
|
||||
processing_time?: number; // seconds
|
||||
page_count?: number;
|
||||
element_count?: number;
|
||||
character_count?: number;
|
||||
confidence?: number; // 0.0-1.0
|
||||
result_files?: {
|
||||
json?: string;
|
||||
markdown?: string;
|
||||
pdf?: string;
|
||||
};
|
||||
metadata?: {
|
||||
file_size?: number;
|
||||
mime_type?: string;
|
||||
text_coverage?: number; // 0.0-1.0
|
||||
processing_track_reason?: string;
|
||||
[key: string]: any;
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
### DocumentAnalysisResponse
|
||||
|
||||
```typescript
|
||||
interface DocumentAnalysisResponse {
|
||||
task_id: string;
|
||||
filename: string;
|
||||
analysis: {
|
||||
recommended_track: "ocr" | "direct";
|
||||
confidence: number; // 0.0-1.0
|
||||
reason: string;
|
||||
document_type: string;
|
||||
metadata: {
|
||||
total_pages?: number;
|
||||
sampled_pages?: number;
|
||||
text_coverage?: number;
|
||||
mime_type?: string;
|
||||
file_size?: number;
|
||||
page_details?: Array<{
|
||||
page: number;
|
||||
text_length: number;
|
||||
has_text: boolean;
|
||||
image_count: number;
|
||||
image_coverage: number;
|
||||
}>;
|
||||
};
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
### ProcessingMetadata
|
||||
|
||||
```typescript
|
||||
interface ProcessingMetadata {
|
||||
task_id: string;
|
||||
processing_track: "ocr" | "direct";
|
||||
document_type: string;
|
||||
confidence: number;
|
||||
reason: string;
|
||||
statistics: {
|
||||
page_count: number;
|
||||
element_count: number;
|
||||
total_tables: number;
|
||||
total_images: number;
|
||||
element_type_counts: {
|
||||
[type: string]: number;
|
||||
};
|
||||
text_stats: {
|
||||
total_characters: number;
|
||||
total_words: number;
|
||||
average_confidence: number | null;
|
||||
};
|
||||
};
|
||||
processing_info: {
|
||||
processing_time: number;
|
||||
track_description: string;
|
||||
schema_version: string;
|
||||
};
|
||||
file_metadata: {
|
||||
filename: string;
|
||||
file_size: number;
|
||||
mime_type: string;
|
||||
created_at: string;
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Error Handling
|
||||
|
||||
### HTTP Status Codes
|
||||
|
||||
- `200 OK`: Successful request
|
||||
- `201 Created`: Resource created successfully
|
||||
- `204 No Content`: Successful deletion
|
||||
- `400 Bad Request`: Invalid request parameters
|
||||
- `401 Unauthorized`: Missing or invalid authentication
|
||||
- `403 Forbidden`: Insufficient permissions
|
||||
- `404 Not Found`: Resource not found
|
||||
- `422 Unprocessable Entity`: Validation error
|
||||
- `500 Internal Server Error`: Server error
|
||||
|
||||
### Error Response Format
|
||||
|
||||
```json
|
||||
{
|
||||
"detail": "Error message describing the issue",
|
||||
"error_code": "ERROR_CODE",
|
||||
"timestamp": "2025-11-20T10:00:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
### Common Errors
|
||||
|
||||
**Invalid File Format**:
|
||||
```json
|
||||
{
|
||||
"detail": "Unsupported file format. Supported: PDF, PNG, JPG, DOCX, PPTX, XLSX",
|
||||
"error_code": "INVALID_FILE_FORMAT"
|
||||
}
|
||||
```
|
||||
|
||||
**Task Not Found**:
|
||||
```json
|
||||
{
|
||||
"detail": "Task not found or access denied",
|
||||
"error_code": "TASK_NOT_FOUND"
|
||||
}
|
||||
```
|
||||
|
||||
**Processing Failed**:
|
||||
```json
|
||||
{
|
||||
"detail": "OCR processing failed: GPU memory insufficient",
|
||||
"error_code": "PROCESSING_FAILED"
|
||||
}
|
||||
```
|
||||
|
||||
**File Too Large**:
|
||||
```json
|
||||
{
|
||||
"detail": "File size exceeds maximum limit of 50MB",
|
||||
"error_code": "FILE_TOO_LARGE"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Example 1: Auto-Route Processing
|
||||
|
||||
Upload a document and let the system choose the optimal track:
|
||||
|
||||
# Tool_OCR V2 API (現況)
|
||||
|
||||
Base URL:`http://localhost:${BACKEND_PORT:-8000}/api/v2`
|
||||
認證:所有業務端點需 Bearer Token(JWT)。
|
||||
|
||||
## 認證
|
||||
- `POST /auth/login`:{ username, password } → `access_token`, `expires_in`, `user`.
|
||||
- `POST /auth/logout`:可傳 `session_id`,未傳則登出全部。
|
||||
- `GET /auth/me`:目前使用者資訊。
|
||||
- `GET /auth/sessions`:列出登入 Session。
|
||||
- `POST /auth/refresh`:刷新 access token。
|
||||
|
||||
## 任務流程摘要
|
||||
1) 上傳檔案 → `POST /upload` (multipart file) 取得 `task_id`。
|
||||
2) 啟動處理 → `POST /tasks/{task_id}/start`(ProcessingOptions 可控制 dual track、force_track、layout/預處理/table 偵測)。
|
||||
3) 查詢狀態與 metadata → `GET /tasks/{task_id}`、`/metadata`。
|
||||
4) 下載結果 → `/download/json | /markdown | /pdf | /unified`。
|
||||
5) 進階:`/analyze` 先看推薦軌道;`/preview/preprocessing` 取得預處理前後預覽。
|
||||
|
||||
## 核心端點
|
||||
- `POST /upload`
|
||||
- 表單欄位:`file` (必填);驗證副檔名於允許清單。
|
||||
- 回傳:`task_id`, `filename`, `file_size`, `file_type`, `status` (pending)。
|
||||
- `POST /tasks/`
|
||||
- 僅建立 Task meta(不含檔案),通常不需使用。
|
||||
- `POST /tasks/{task_id}/start`
|
||||
- Body `ProcessingOptions`:`use_dual_track`(default true), `force_track`(ocr|direct), `language`(default ch), `layout_model`(chinese|default|cdla), `preprocessing_mode`(auto|manual|disabled) + `preprocessing_config`, `table_detection`.
|
||||
- `POST /tasks/{task_id}/cancel`、`POST /tasks/{task_id}/retry`。
|
||||
- `GET /tasks`
|
||||
- 查詢參數:`status`(pending|processing|completed|failed)、`filename`、`date_from`/`date_to`、`page`、`page_size`、`order_by`、`order_desc`。
|
||||
- `GET /tasks/{task_id}`:詳細資料與路徑、處理軌道、統計。
|
||||
- `GET /tasks/stats`:當前使用者任務統計。
|
||||
- `POST /tasks/{task_id}/analyze`:預先分析文件並給出推薦軌道/信心/文件類型/抽樣統計。
|
||||
- `GET /tasks/{task_id}/metadata`:處理結果的統計與說明。
|
||||
- 下載:
|
||||
- `GET /tasks/{task_id}/download/json`
|
||||
- `GET /tasks/{task_id}/download/markdown`
|
||||
- `GET /tasks/{task_id}/download/pdf`(若無 PDF 則即時生成)
|
||||
- `GET /tasks/{task_id}/download/unified`(UnifiedDocument JSON)
|
||||
- 預處理預覽:
|
||||
- `POST /tasks/{task_id}/preview/preprocessing`(body:page/mode/config)
|
||||
- `GET /tasks/{task_id}/preview/image?type=original|preprocessed&page=1`
|
||||
|
||||
## 翻譯(需已完成 OCR)
|
||||
Prefix:`/translate`
|
||||
- `POST /{task_id}`:開始翻譯,body `{ target_lang, source_lang }`,回傳 202。若已存在會直接回 Completed。
|
||||
- `GET /{task_id}/status`:翻譯進度。
|
||||
- `GET /{task_id}/result?lang=xx`:翻譯 JSON。
|
||||
- `GET /{task_id}/translations`:列出已產生的翻譯。
|
||||
- `DELETE /{task_id}/translations/{lang}`:刪除翻譯。
|
||||
- `POST /{task_id}/pdf?lang=xx`:下載翻譯後版面保持 PDF。
|
||||
|
||||
## 管理端(需要管理員)
|
||||
Prefix:`/admin`
|
||||
- `GET /stats`:系統層統計。
|
||||
- `GET /users`、`GET /users/top`。
|
||||
- `GET /audit-logs`、`GET /audit-logs/user/{user_id}/summary`。
|
||||
|
||||
## 健康檢查
|
||||
- `/health`:服務狀態、GPU/Memory 管理資訊。
|
||||
- `/`:簡易 API 入口說明。
|
||||
|
||||
## 回應結構摘要
|
||||
- Task 回應常見欄位:`task_id`, `status`, `processing_track`, `document_type`, `processing_time_ms`, `page_count`, `element_count`, `file_size`, `mime_type`, `result_json_path` 等。
|
||||
- 下載端點皆以檔案回應(Content-Disposition 附檔名)。
|
||||
- 錯誤格式:`{ "detail": "...", "error_code": "...", "timestamp": "..." }`(部分錯誤僅有 `detail`)。
|
||||
|
||||
## 使用範例
|
||||
上傳並啟動:
|
||||
```bash
|
||||
# 1. Upload document
|
||||
curl -X POST "http://localhost:8000/api/v2/tasks/" \
|
||||
# 上傳
|
||||
curl -X POST "http://localhost:8000/api/v2/upload" \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-F "file=@document.pdf" \
|
||||
-F "language=ch"
|
||||
-F "file=@demo_docs/edit.pdf"
|
||||
|
||||
# Response: {"task_id": "550e8400..."}
|
||||
|
||||
# 2. Check status
|
||||
curl -X GET "http://localhost:8000/api/v2/tasks/550e8400..." \
|
||||
-H "Authorization: Bearer $TOKEN"
|
||||
|
||||
# 3. Download results (when completed)
|
||||
curl -X GET "http://localhost:8000/api/v2/tasks/550e8400.../download/json" \
|
||||
# 啟動處理(force_track=ocr 舉例)
|
||||
curl -X POST "http://localhost:8000/api/v2/tasks/$TASK_ID/start" \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-o result.json
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"force_track":"ocr","language":"ch"}'
|
||||
|
||||
# 查詢與下載
|
||||
curl -X GET "http://localhost:8000/api/v2/tasks/$TASK_ID/metadata" -H "Authorization: Bearer $TOKEN"
|
||||
curl -L "http://localhost:8000/api/v2/tasks/$TASK_ID/download/json" -H "Authorization: Bearer $TOKEN" -o result.json
|
||||
```
|
||||
|
||||
### Example 2: Analyze Before Processing
|
||||
|
||||
Analyze document type before processing:
|
||||
|
||||
翻譯並下載翻譯 PDF:
|
||||
```bash
|
||||
# 1. Upload document
|
||||
curl -X POST "http://localhost:8000/api/v2/tasks/" \
|
||||
curl -X POST "http://localhost:8000/api/v2/translate/$TASK_ID" \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-F "file=@document.pdf"
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"target_lang":"en","source_lang":"auto"}'
|
||||
|
||||
# Response: {"task_id": "550e8400..."}
|
||||
|
||||
# 2. Analyze document (NEW)
|
||||
curl -X POST "http://localhost:8000/api/v2/tasks/550e8400.../analyze" \
|
||||
-H "Authorization: Bearer $TOKEN"
|
||||
|
||||
# Response shows recommended track and confidence
|
||||
|
||||
# 3. Start processing (automatic based on analysis)
|
||||
# Processing happens in background after upload
|
||||
curl -X GET "http://localhost:8000/api/v2/translate/$TASK_ID/status" -H "Authorization: Bearer $TOKEN"
|
||||
curl -L "http://localhost:8000/api/v2/translate/$TASK_ID/pdf?lang=en" \
|
||||
-H "Authorization: Bearer $TOKEN" -o translated.pdf
|
||||
```
|
||||
|
||||
### Example 3: Force Specific Track
|
||||
|
||||
Force OCR processing for an editable PDF:
|
||||
|
||||
```bash
|
||||
curl -X POST "http://localhost:8000/api/v2/tasks/" \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-F "file=@document.pdf" \
|
||||
-F "force_track=ocr"
|
||||
```
|
||||
|
||||
### Example 4: Get Processing Metadata
|
||||
|
||||
Get detailed processing information:
|
||||
|
||||
```bash
|
||||
curl -X GET "http://localhost:8000/api/v2/tasks/550e8400.../metadata" \
|
||||
-H "Authorization: Bearer $TOKEN"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Version History
|
||||
|
||||
### V2.0.0 (2025-11-20) - Dual-Track Processing
|
||||
|
||||
**New Features**:
|
||||
- ✨ Dual-track processing (OCR + Direct Extraction)
|
||||
- ✨ Automatic document type detection
|
||||
- ✨ Office document support (Word, PowerPoint, Excel)
|
||||
- ✨ Processing track metadata
|
||||
- ✨ Enhanced layout analysis (23 element types)
|
||||
- ✨ GPU memory management
|
||||
|
||||
**New Endpoints**:
|
||||
- `POST /tasks/{task_id}/analyze` - Analyze document type
|
||||
- `GET /tasks/{task_id}/metadata` - Get processing metadata
|
||||
|
||||
**Enhanced Endpoints**:
|
||||
- `POST /tasks/` - Added `force_track` parameter
|
||||
- `GET /tasks/{task_id}` - Added `processing_track`, `document_type`, element counts
|
||||
- All download endpoints now include processing track information
|
||||
|
||||
**Performance Improvements**:
|
||||
- 10x faster processing for editable PDFs (1-2s vs 10-20s per page)
|
||||
- Optimized GPU memory usage for RTX 4060 8GB
|
||||
- Office documents: 2-5s vs >300s (60x improvement)
|
||||
|
||||
---
|
||||
|
||||
## Support
|
||||
|
||||
For issues, questions, or feature requests:
|
||||
- GitHub Issues: https://github.com/your-repo/Tool_OCR/issues
|
||||
- Documentation: https://your-docs-site.com
|
||||
- API Status: http://localhost:8000/health
|
||||
|
||||
---
|
||||
|
||||
*Generated by Tool_OCR V2.0.0 - Dual-Track Document Processing*
|
||||
|
||||
@@ -10,6 +10,7 @@
|
||||
- **OCR 解析**:PaddleOCR + `PPStructureEnhanced` 抽取 23 類元素;`OCRToUnifiedConverter` 轉成 `UnifiedDocument` 統一格式。
|
||||
- **匯出/呈現**:`UnifiedDocumentExporter` 產出 JSON/Markdown;`pdf_generator_service.py` 產生版面保持 PDF;前端透過 `/api/v2/tasks/{id}/download/*` 取得。
|
||||
- **資源控管**:`memory_manager.py`(MemoryGuard、prediction semaphore、模型生命週期),`service_pool.py`(`OCRService` 池)避免多重載模與 GPU 爆滿。
|
||||
- **翻譯與預覽**:`translation_service` 針對已完成任務提供異步翻譯(`/api/v2/translate/*`),`layout_preprocessing_service` 提供預處理預覽與品質指標(`/preview/preprocessing` → `/preview/image`)。
|
||||
|
||||
## 處理流程(任務層級)
|
||||
1. **上傳**:`POST /api/v2/upload` 建立 Task 並寫檔到 `uploads/`(含 SHA256、檔案資訊)。
|
||||
|
||||
@@ -1,31 +0,0 @@
|
||||
# Tool_OCR Commit History Review (2025-11-12 ~ 2025-11-26)
|
||||
|
||||
本報告依 `git log` 全量 97 筆提交整理,涵蓋開發脈絡、里程碑、測試/品質信號與後續風險。提交類型統計:35 `feat` / 37 `fix` / 9 `chore` / 5 `test` / 4 `docs` / 2 `refactor`,主要集中於 2025-11-18、11-19、11-20 與 11-24 的密集開發。
|
||||
|
||||
## 時間軸與里程碑
|
||||
- **前期基礎與前端現代化 (11-12~11-13)**:`21bc2f9`, `57cf912` 將前端改為 Tailwind v4 +專業 UI,`0f81d5e` 單容器 Docker 化、`d7e6473` WSL Ubuntu 開發環境。
|
||||
- **GPU 加速與相容性 (11-14)**:`6452797` 提案 + `7536f43` 實作 GPU OCR,`d80d60f`/`3694411`/`80c091b` 修正 Paddle 3.x API 與安裝來源,`b048f2d` 暫停圖表識別以避免 API 缺口。
|
||||
- **外部 Auth V2 與管理後台 (11-14~11-16)**:`28e419f`~`fd98018` 完成外部認證 V2、資料表前綴與架構移除 V1;`8f94191` 新增後台/稽核/Token 檢查;`90fca50`/`6bb5b76` 讓 18/18 測試全過。
|
||||
- **V2 UI 串接與初版版面保持 PDF (11-16~11-18)**:前端/後端全面切換 V2 API (`ad5c8be` 之後),`fa1abcd` 版面保持 PDF + 多次座標/重疊修正 (`d33f605`~`0edc56b`),強化 logging (`d99d37d`)。
|
||||
- **雙軌處理架構 (11-18~11-20)**:`2d50c12` + `82139c8` 導入 OCR/Direct 雙軌與 UnifiedDocument;`a3a6fbe`/`ab89a40`/`ecdce96` 完成轉換、JSON 匯出與 PDF 支援;`1d0b638` 後端 API,`c2288ba` 前端支援,`c50a5e9` 單元/整合測試;`0974fc3` E2E 修復,`ef335cf` Office 直抽,`b997f93`/`9f449e8` GPU 記憶體管理與文件化,`2ecd022` E2E 測試完成。
|
||||
- **PDF 版面復原計畫 (11-20 提案,11-24 實作高峰)**:`cf894b0` 提案後,`0aff468` Phase1 圖片/表格修復,`3fc32bc` Phase2 風格保存,`77fe4cc`/`ad879d4`/`75c194f` 等完成 Alignment、List、Span 級渲染與多欄位;一系列 `93bd9f5`~`3358d97` 針對位置/重疊/缺圖修正,`4325d02` 專案清理並封存提案。
|
||||
- **PP-Structure V3 調校 (11-25)**:`a659e7a` 改善複雜圖示結構保留,`2312b4c` 前端可調 `pp_structure` 參數 + 測試,`0999898` 多頁 PDF 座標校正。
|
||||
- **記憶體管理與混合抽圖 (11-25~11-26)**:`ba8ddf2` 提案,`1afdb82` 混合圖片抽取+記憶體管理落地,`b997f93` 系列 GPU 釋放/可選 torch,引入 ModelManager、ServicePool、MemoryGuard(詳見 `openspec/changes/archive/2025-11-26-enhance-memory-management`);`a227311` 封存提案但僅完成 75/80 任務(剩餘文件化);隨後多筆修復(`79cffe6`~`fa9b542`)處理 PDF 回歸與文字渲染,`6e050eb` 為最新 OCR 軌表格格式/裁剪修正。
|
||||
|
||||
## 品質與測試信號
|
||||
- 11-16 完成 V2 API 測試 18/18 (`6bb5b76`),建立初步信心。
|
||||
- 雙軌導入時新增單元/整合/E2E 測試 (`0fcb249`, `c50a5e9`, `2ecd022`),但後續 PDF 版面復原大量依賴人工驗證,Phase 4 測試仍未完成(見下)。
|
||||
- 記憶體管理變更伴隨 57+18+10 測試檔(任務 8.1 完成),但文件化缺失可能影響交接與調參。
|
||||
- 11-24 大量 PDF 修復連續提交顯示迭代式修 bug,建議增加回歸測試覆蓋(特別是表格/多欄/列表與跨軌道 PDF)。
|
||||
|
||||
## 未盡事項與風險
|
||||
- **記憶體管理文件化缺口**:`openspec/changes/archive/2025-11-26-enhance-memory-management/tasks.md` 未完成 Section 8.2(架構說明、調校指南、疑難排解、監控、遷移指南),可能影響部署可操作性。
|
||||
- **PDF 版面復原驗證不足**:同一變更的 Phase 4 測試/效能/文件與多類文件驗證均未勾選,現階段品質依賴手測。
|
||||
- **近期修正集中於 PDF 與表格**(`79cffe6`, `5c561f4`, `19bd5fd`, `fa9b542`, `6e050eb`),顯示 Direct/OCR 軌 PDF 路徑仍脆弱;缺乏自動化回歸易再度回歸。
|
||||
- **主分支狀態**:`main` 比 `origin/main` 超前 1 提交(`6e050eb`),請推送前確認 CI/測試。
|
||||
|
||||
## 建議後續行動
|
||||
1) 完成記憶體管理文件(架構、調參、故障排除、Prometheus 監控指南)並加入 sanity check。
|
||||
2) 為 PDF 版面復原建立最小回歸集:多欄文檔、含圖表/表格的 Direct/OCR 軌、列表與 span 混排。
|
||||
3) 圍繞 `processing_track` 分流與 UnifiedDocument/PDF 生成的邊界條件增加測試(LOGO/未知元素、跨頁表格、OCR/Direct 混合圖片)。
|
||||
4) 推送前跑現有單元/整合/E2E 測試,補上近兩週新增場景的腳本以降低回歸風險。
|
||||
@@ -1,24 +0,0 @@
|
||||
# Project Risk & Issue Outlook
|
||||
|
||||
本文件整理當前專案的可預見問題、潛在問題與建議修復方向(依風險與可行性排序)。依據來源:`git log`(97 commits, 2025-11-12~11-26)、`docs/architecture-overview.md`、`openspec/changes/archive/2025-11-26-enhance-memory-management/tasks.md` 等。
|
||||
|
||||
## 可預見的問題項目
|
||||
- **記憶體管理文件缺口**:`openspec/changes/archive/2025-11-26-enhance-memory-management/tasks.md` 的 8.2 文檔未完成,ModelManager/ServicePool/MemoryGuard 的調參與故障處置缺乏 runbook,部署或擴容時易踩坑。方向:補完架構說明、調參指南、故障排解與監控落地範例(Prometheus 指標與警戒值)。
|
||||
- **PDF 生成回歸風險高**:版面保持與表格/圖片渲染在 `fa1abcd` 之後多次修正(例如 `d33f605`→`92e326b`、`108784a`→`3358d97`、`6e050eb`),顯示缺少自動回歸。方向:建立最小回歸集(多欄文本、含圖表/表格、列表/Span 混排)與 golden PDF/JSON 比對,覆蓋 Direct/OCR 雙軌。
|
||||
- **最新 OCR 表格格式修復未經回歸**:`6e050eb` 修正 OCR 軌表格資料格式與裁剪,無對應測試。方向:為 OCR 軌加表格解析/PDF 出圖的整合測試,確保與前端下載/展示一致。
|
||||
- **PP-Structure 參數調校可能影響資源**:`frontend` 支援前端可調 `pp_structure_params`(`2312b4c`),若缺乏 guard,可能放大 GPU/記憶體壓力。方向:在後端對超參做白名單與上限檢查,並納入 MemoryGuard 預估。
|
||||
- **Chart 能力啟停策略缺少驗證**:`b048f2d` 禁用 → `7e12f16` 重新啟用;缺少覆蓋率與性能數據。方向:為 chart 模型啟用/關閉建立健康檢查與 A/B 測試數據收集。
|
||||
|
||||
## 潛在的問題項目
|
||||
- **UnifiedDocument 結構漂移風險**:雙軌共用輸出,近期多次調整(列表、Span、多欄、LOGO 元素),缺少結構驗證或 schema 鎖定。可能導致前端/匯出器/PDF 生成不一致。方向:定義 JSON Schema 或 pydantic 驗證,建立 contract 測試。
|
||||
- **服務池與記憶體守護的長時間行為未驗證**:雖有單元/整合測試,缺乏長時間 soak/stress(GPU 記憶碎片、模型 unload/reload、信號處理)。方向:加入 24h soak 測試與記憶體走勢告警,驗證 SIGTERM/SIGINT 清理。
|
||||
- **LibreOffice 轉檔鏈低觀測性**:Office 直抽與轉 PDF (`ef335cf`) 依賴系統 LibreOffice,缺少失敗監控與重試策略。方向:為轉檔階段增加 metrics/告警,並提供 fallback/重試。
|
||||
- **前端/後端 API 契約缺少檢查**:多次 V1→V2 遷移與新增參數(`pp_structure_params` 等),目前僅靠 E2E,缺少型別/契約檢查。方向:加入 OpenAPI 契約測試或生成型別校驗(ts-sdk 對齊 FastAPI schema)。
|
||||
- **混合抽圖/圖片保存路徑邊界**:Direct/OCR 混合抽圖與 `_save_image` 實作曾多次修復,仍缺少對 None/缺檔路徑的防禦。方向:為缺檔/無圖的 PDF 生成加強斷言與 fallback。
|
||||
|
||||
## 建議修復與方向
|
||||
1) **完成記憶體管理文檔與樣板設定**:在 `docs/` 新增 MemoryGuard/ServicePool 調參與故障排除指南,附 `.env` 範例與 Prometheus 規則,對應 tasks 8.2 清單。
|
||||
2) **建立 PDF/UnifiedDocument 回歸套件**:收集代表性樣本(多欄、表格、列表、含圖/LOGO、OCR/Direct 雙軌),產生 golden JSON/PDF,加入 CI 比對,並為 `6e050eb` 相關表格路徑新增測試。
|
||||
3) **加入 UnifiedDocument Schema 驗證**:定義 schema(pydantic/JSON Schema),在匯出/PDF 生成前驗證;同時讓前端型別由 OpenAPI 生成以防 drift。
|
||||
4) **PP-Structure 參數防護與資源估算**:後端實作白名單/上限與 MemoryGuard 預估,避免前端自由調參造成 GPU OOM;增加拒絕/降級回饋。
|
||||
5) **長時間穩定性與轉檔可觀測性**:增加 soak/stress pipeline,追蹤 GPU/CPU/記憶碎片;為 LibreOffice/轉檔階段加 metrics、重試與錯誤分類告警。
|
||||
Reference in New Issue
Block a user