OCR/docs/API.md

# Tool_OCR V2 API Documentation

## Overview

Tool_OCR V2 provides a comprehensive OCR service with dual-track document processing. The API supports intelligent routing between OCR track (for scanned documents) and Direct Extraction track (for editable PDFs and Office documents).

**Base URL**: `http://localhost:8000/api/v2`

**Authentication**: Bearer token (JWT)

---

## Table of Contents

1. [Authentication](#authentication)
2. [Task Management](#task-management)
3. [Document Processing](#document-processing)
4. [Document Analysis](#document-analysis)
5. [File Downloads](#file-downloads)
6. [Processing Tracks](#processing-tracks)
7. [Response Models](#response-models)
8. [Error Handling](#error-handling)

---

## Authentication

All endpoints require authentication via Bearer token.

### Headers
```http
Authorization: Bearer <access_token>
```

### Login
```http
POST /api/auth/login
Content-Type: application/json

{
  "email": "user@example.com",
  "password": "password123"
}
```

**Response**:
```json
{
  "access_token": "eyJhbGc...",
  "token_type": "bearer",
  "user": {
    "id": 1,
    "email": "user@example.com",
    "username": "user"
  }
}
```

---

## Task Management

### Create Task

Create a new OCR processing task by uploading a document.

```http
POST /tasks/
Content-Type: multipart/form-data
```

**Request Body**:
- `file` (required): Document file to process
  - Supported formats: PDF, PNG, JPG, JPEG, GIF, BMP, TIFF, DOCX, PPTX, XLSX
- `language` (optional): OCR language code (default: 'ch')
  - Options: 'ch', 'en', 'japan', 'korean', etc.
- `detect_layout` (optional): Enable layout detection (default: true)
- `force_track` (optional): Force specific processing track
  - Options: 'ocr', 'direct', 'auto' (default: 'auto')

**Response** `201 Created`:
```json
{
  "task_id": "550e8400-e29b-41d4-a716-446655440000",
  "filename": "document.pdf",
  "status": "pending",
  "language": "ch",
  "created_at": "2025-11-20T10:00:00Z"
}
```

**Processing Track Selection**:
- `auto` (default): Automatically select optimal track based on document analysis
  - Editable PDFs → Direct track (faster, ~1-2s/page)
  - Scanned documents/images → OCR track (slower, ~2-5s/page)
  - Office documents → Convert to PDF, then route based on content
- `ocr`: Force OCR processing (PaddleOCR PP-StructureV3)
- `direct`: Force direct extraction (PyMuPDF) - only for editable PDFs

---

### List Tasks

Get a paginated list of user's tasks with filtering.

```http
GET /tasks/?status={status}&filename={search}&skip={skip}&limit={limit}
```

**Query Parameters**:
- `status` (optional): Filter by task status
  - Options: `pending`, `processing`, `completed`, `failed`
- `filename` (optional): Search by filename (partial match)
- `skip` (optional): Pagination offset (default: 0)
- `limit` (optional): Page size (default: 10, max: 100)

**Response** `200 OK`:
```json
{
  "tasks": [
    {
      "task_id": "550e8400-e29b-41d4-a716-446655440000",
      "filename": "document.pdf",
      "status": "completed",
      "language": "ch",
      "processing_track": "direct",
      "processing_time": 1.14,
      "created_at": "2025-11-20T10:00:00Z",
      "completed_at": "2025-11-20T10:00:02Z"
    }
  ],
  "total": 42,
  "skip": 0,
  "limit": 10
}
```

---

### Get Task Details

Retrieve detailed information about a specific task.

```http
GET /tasks/{task_id}
```

**Response** `200 OK`:
```json
{
  "task_id": "550e8400-e29b-41d4-a716-446655440000",
  "filename": "document.pdf",
  "status": "completed",
  "language": "ch",
  "processing_track": "direct",
  "document_type": "pdf_editable",
  "processing_time": 1.14,
  "page_count": 3,
  "element_count": 51,
  "character_count": 10592,
  "confidence": 0.95,
  "created_at": "2025-11-20T10:00:00Z",
  "completed_at": "2025-11-20T10:00:02Z",
  "result_files": {
    "json": "/tasks/550e8400-e29b-41d4-a716-446655440000/download/json",
    "markdown": "/tasks/550e8400-e29b-41d4-a716-446655440000/download/markdown",
    "pdf": "/tasks/550e8400-e29b-41d4-a716-446655440000/download/pdf"
  },
  "metadata": {
    "file_size": 524288,
    "mime_type": "application/pdf",
    "text_coverage": 0.95,
    "processing_track_reason": "PDF has extractable text on 100% of sampled pages"
  }
}
```

**New Fields** (Dual-Track):
- `processing_track`: Track used for processing (`ocr`, `direct`, or `null`)
- `document_type`: Detected document type
  - `pdf_editable`: Editable PDF with text
  - `pdf_scanned`: Scanned/image-based PDF
  - `pdf_mixed`: Mixed content PDF
  - `image`: Image file
  - `office_word`, `office_excel`, `office_ppt`: Office documents
- `page_count`: Number of pages extracted
- `element_count`: Total elements (text, tables, images) extracted
- `character_count`: Total characters extracted
- `metadata.text_coverage`: Percentage of pages with extractable text (0.0-1.0)
- `metadata.processing_track_reason`: Explanation of track selection

---

### Get Task Statistics

Get aggregated statistics for user's tasks.

```http
GET /tasks/stats
```

**Response** `200 OK`:
```json
{
  "total_tasks": 150,
  "by_status": {
    "pending": 5,
    "processing": 3,
    "completed": 140,
    "failed": 2
  },
  "by_processing_track": {
    "ocr": 80,
    "direct": 60,
    "unknown": 10
  },
  "total_pages_processed": 4250,
  "average_processing_time": 3.5,
  "success_rate": 0.987
}
```

---

### Delete Task

Delete a task and all associated files.

```http
DELETE /tasks/{task_id}
```

**Response** `204 No Content`

---

## Document Processing

### Processing Workflow

1. **Upload Document** → `POST /tasks/` → Returns `task_id`
2. **Background Processing** → Task status changes to `processing`
3. **Complete** → Task status changes to `completed` or `failed`
4. **Download Results** → Use download endpoints

### Track Selection Flow

```
Document Upload
     ↓
Document Type Detection
     ↓
  ┌──────────────┐
  │ Auto Routing │
  └──────┬───────┘
         ↓
    ┌────┴─────┐
    ↓          ↓
 [Direct]   [OCR]
    ↓          ↓
  PyMuPDF   PaddleOCR
    ↓          ↓
  UnifiedDocument
    ↓
 Export (JSON/MD/PDF)
```

**Direct Track** (Fast - 1-2s/page):
- Editable PDFs with extractable text
- Office documents (converted to text-based PDF)
- Uses PyMuPDF for direct text extraction
- Preserves exact layout and fonts

**OCR Track** (Slower - 2-5s/page):
- Scanned PDFs and images
- Documents without extractable text
- Uses PaddleOCR PP-StructureV3
- Handles complex layouts with 23 element types

---

## Document Analysis

### Analyze Document Type

Analyze a document to determine optimal processing track **before** processing.

**NEW ENDPOINT**

```http
POST /tasks/{task_id}/analyze
```

**Response** `200 OK`:
```json
{
  "task_id": "550e8400-e29b-41d4-a716-446655440000",
  "filename": "document.pdf",
  "analysis": {
    "recommended_track": "direct",
    "confidence": 0.95,
    "reason": "PDF has extractable text on 100% of sampled pages",
    "document_type": "pdf_editable",
    "metadata": {
      "total_pages": 3,
      "sampled_pages": 3,
      "text_coverage": 1.0,
      "mime_type": "application/pdf",
      "file_size": 524288,
      "page_details": [
        {
          "page": 1,
          "text_length": 3520,
          "has_text": true,
          "image_count": 2,
          "image_coverage": 0.15
        }
      ]
    }
  }
}
```

**Use Case**:
- Preview processing track before starting
- Validate document type for batch processing
- Provide user feedback on processing method

---

### Get Processing Metadata

Get detailed metadata about how a document was processed.

**NEW ENDPOINT**

```http
GET /tasks/{task_id}/metadata
```

**Response** `200 OK`:
```json
{
  "task_id": "550e8400-e29b-41d4-a716-446655440000",
  "processing_track": "direct",
  "document_type": "pdf_editable",
  "confidence": 0.95,
  "reason": "PDF has extractable text on 100% of sampled pages",
  "statistics": {
    "page_count": 3,
    "element_count": 51,
    "total_tables": 2,
    "total_images": 3,
    "element_type_counts": {
      "text": 45,
      "table": 2,
      "image": 3,
      "header": 1
    },
    "text_stats": {
      "total_characters": 10592,
      "total_words": 1842,
      "average_confidence": 1.0
    }
  },
  "processing_info": {
    "processing_time": 1.14,
    "track_description": "PyMuPDF Direct Extraction - Used for editable PDFs",
    "schema_version": "1.0.0"
  },
  "file_metadata": {
    "filename": "document.pdf",
    "file_size": 524288,
    "mime_type": "application/pdf",
    "created_at": "2025-11-20T10:00:00Z"
  }
}
```

---

## File Downloads

### Download JSON Result

Download structured JSON output with full document structure.

```http
GET /tasks/{task_id}/download/json
```

**Response** `200 OK`:
- Content-Type: `application/json`
- Content-Disposition: `attachment; filename="{filename}_result.json"`

**JSON Structure**:
```json
{
  "schema_version": "1.0.0",
  "document_id": "d8bea84d-a4ea-4455-b219-243624b5518e",
  "export_timestamp": "2025-11-20T10:00:02Z",
  "metadata": {
    "filename": "document.pdf",
    "file_type": ".pdf",
    "file_size": 524288,
    "created_at": "2025-11-20T10:00:00Z",
    "processing_track": "direct",
    "processing_time": 1.14,
    "language": "ch",
    "processing_info": {
      "track_description": "PyMuPDF Direct Extraction",
      "schema_version": "1.0.0",
      "export_format": "unified_document_v1"
    }
  },
  "pages": [
    {
      "page_number": 1,
      "dimensions": {
        "width": 595.32,
        "height": 841.92
      },
      "elements": [
        {
          "element_id": "text_1_0",
          "type": "text",
          "bbox": {
            "x0": 72.0,
            "y0": 72.0,
            "x1": 200.0,
            "y1": 90.0
          },
          "content": "Document Title",
          "confidence": 1.0,
          "style": {
            "font": "Helvetica-Bold",
            "size": 18.0
          }
        }
      ]
    }
  ],
  "statistics": {
    "page_count": 3,
    "total_elements": 51,
    "total_tables": 2,
    "total_images": 3,
    "element_type_counts": {
      "text": 45,
      "table": 2,
      "image": 3,
      "header": 1
    },
    "text_stats": {
      "total_characters": 10592,
      "total_words": 1842,
      "average_confidence": 1.0
    }
  }
}
```

**Element Types**:
- `text`: Text blocks
- `header`: Headers (H1-H6)
- `paragraph`: Paragraphs
- `list`: Lists
- `table`: Tables with cell structure
- `image`: Images with position
- `figure`: Figures with captions
- `footer`: Page footers

---

### Download Markdown Result

Download Markdown formatted output.

```http
GET /tasks/{task_id}/download/markdown
```

**Response** `200 OK`:
- Content-Type: `text/markdown`
- Content-Disposition: `attachment; filename="{filename}_output.md"`

**Example Output**:
```markdown
# Document Title

This is the extracted content from the document.

## Section 1

Content of section 1...

| Column 1 | Column 2 |
|----------|----------|
| Data 1   | Data 2   |

![Image](imgs/img_in_image_box_100_200_500_600.jpg)
```

---

### Download Layout-Preserving PDF

Download reconstructed PDF with layout preservation.

```http
GET /tasks/{task_id}/download/pdf
```

**Response** `200 OK`:
- Content-Type: `application/pdf`
- Content-Disposition: `attachment; filename="{filename}_layout.pdf"`

**Features**:
- Preserves original layout and coordinates
- Maintains text positioning
- Includes extracted images
- Renders tables with proper structure

---

## Processing Tracks

### Track Comparison

| Feature | OCR Track | Direct Track |
|---------|-----------|--------------|
| **Speed** | 2-5 seconds/page | 0.5-1 second/page |
| **Best For** | Scanned documents, images | Editable PDFs, Office docs |
| **Technology** | PaddleOCR PP-StructureV3 | PyMuPDF |
| **Accuracy** | 92-98% (content-dependent) | 100% (text is extracted, not recognized) |
| **Layout Preservation** | Good (23 element types) | Excellent (exact coordinates) |
| **GPU Required** | Yes (8GB recommended) | No |
| **Supported Formats** | PDF, PNG, JPG, TIFF, etc. | PDF (with text), converted Office docs |

### Processing Track Enum

```python
class ProcessingTrackEnum(str, Enum):
    AUTO = "auto"      # Automatic selection (default)
    OCR = "ocr"        # Force OCR processing
    DIRECT = "direct"  # Force direct extraction
```

### Document Type Enum

```python
class DocumentType(str, Enum):
    PDF_EDITABLE = "pdf_editable"      # PDF with extractable text
    PDF_SCANNED = "pdf_scanned"        # Scanned/image-based PDF
    PDF_MIXED = "pdf_mixed"            # Mixed content PDF
    IMAGE = "image"                     # Image files
    OFFICE_WORD = "office_word"        # Word documents
    OFFICE_EXCEL = "office_excel"      # Excel spreadsheets
    OFFICE_POWERPOINT = "office_ppt"   # PowerPoint presentations
    TEXT = "text"                       # Plain text files
    UNKNOWN = "unknown"                 # Unknown format
```

---

## Response Models

### TaskResponse

```typescript
interface TaskResponse {
  task_id: string;
  filename: string;
  status: "pending" | "processing" | "completed" | "failed";
  language: string;
  processing_track?: "ocr" | "direct" | null;
  created_at: string;  // ISO 8601
  completed_at?: string | null;
}
```

### TaskDetailResponse

Extends `TaskResponse` with:
```typescript
interface TaskDetailResponse extends TaskResponse {
  document_type?: string;
  processing_time?: number;  // seconds
  page_count?: number;
  element_count?: number;
  character_count?: number;
  confidence?: number;  // 0.0-1.0
  result_files?: {
    json?: string;
    markdown?: string;
    pdf?: string;
  };
  metadata?: {
    file_size?: number;
    mime_type?: string;
    text_coverage?: number;  // 0.0-1.0
    processing_track_reason?: string;
    [key: string]: any;
  };
}
```

### DocumentAnalysisResponse

```typescript
interface DocumentAnalysisResponse {
  task_id: string;
  filename: string;
  analysis: {
    recommended_track: "ocr" | "direct";
    confidence: number;  // 0.0-1.0
    reason: string;
    document_type: string;
    metadata: {
      total_pages?: number;
      sampled_pages?: number;
      text_coverage?: number;
      mime_type?: string;
      file_size?: number;
      page_details?: Array<{
        page: number;
        text_length: number;
        has_text: boolean;
        image_count: number;
        image_coverage: number;
      }>;
    };
  };
}
```

### ProcessingMetadata

```typescript
interface ProcessingMetadata {
  task_id: string;
  processing_track: "ocr" | "direct";
  document_type: string;
  confidence: number;
  reason: string;
  statistics: {
    page_count: number;
    element_count: number;
    total_tables: number;
    total_images: number;
    element_type_counts: {
      [type: string]: number;
    };
    text_stats: {
      total_characters: number;
      total_words: number;
      average_confidence: number | null;
    };
  };
  processing_info: {
    processing_time: number;
    track_description: string;
    schema_version: string;
  };
  file_metadata: {
    filename: string;
    file_size: number;
    mime_type: string;
    created_at: string;
  };
}
```

---

## Error Handling

### HTTP Status Codes

- `200 OK`: Successful request
- `201 Created`: Resource created successfully
- `204 No Content`: Successful deletion
- `400 Bad Request`: Invalid request parameters
- `401 Unauthorized`: Missing or invalid authentication
- `403 Forbidden`: Insufficient permissions
- `404 Not Found`: Resource not found
- `422 Unprocessable Entity`: Validation error
- `500 Internal Server Error`: Server error

### Error Response Format

```json
{
  "detail": "Error message describing the issue",
  "error_code": "ERROR_CODE",
  "timestamp": "2025-11-20T10:00:00Z"
}
```

### Common Errors

**Invalid File Format**:
```json
{
  "detail": "Unsupported file format. Supported: PDF, PNG, JPG, DOCX, PPTX, XLSX",
  "error_code": "INVALID_FILE_FORMAT"
}
```

**Task Not Found**:
```json
{
  "detail": "Task not found or access denied",
  "error_code": "TASK_NOT_FOUND"
}
```

**Processing Failed**:
```json
{
  "detail": "OCR processing failed: GPU memory insufficient",
  "error_code": "PROCESSING_FAILED"
}
```

**File Too Large**:
```json
{
  "detail": "File size exceeds maximum limit of 50MB",
  "error_code": "FILE_TOO_LARGE"
}
```

---

## Usage Examples

### Example 1: Auto-Route Processing

Upload a document and let the system choose the optimal track:

```bash
# 1. Upload document
curl -X POST "http://localhost:8000/api/v2/tasks/" \
  -H "Authorization: Bearer $TOKEN" \
  -F "file=@document.pdf" \
  -F "language=ch"

# Response: {"task_id": "550e8400..."}

# 2. Check status
curl -X GET "http://localhost:8000/api/v2/tasks/550e8400..." \
  -H "Authorization: Bearer $TOKEN"

# 3. Download results (when completed)
curl -X GET "http://localhost:8000/api/v2/tasks/550e8400.../download/json" \
  -H "Authorization: Bearer $TOKEN" \
  -o result.json
```

### Example 2: Analyze Before Processing

Analyze document type before processing:

```bash
# 1. Upload document
curl -X POST "http://localhost:8000/api/v2/tasks/" \
  -H "Authorization: Bearer $TOKEN" \
  -F "file=@document.pdf"

# Response: {"task_id": "550e8400..."}

# 2. Analyze document (NEW)
curl -X POST "http://localhost:8000/api/v2/tasks/550e8400.../analyze" \
  -H "Authorization: Bearer $TOKEN"

# Response shows recommended track and confidence

# 3. Start processing (automatic based on analysis)
# Processing happens in background after upload
```

### Example 3: Force Specific Track

Force OCR processing for an editable PDF:

```bash
curl -X POST "http://localhost:8000/api/v2/tasks/" \
  -H "Authorization: Bearer $TOKEN" \
  -F "file=@document.pdf" \
  -F "force_track=ocr"
```

### Example 4: Get Processing Metadata

Get detailed processing information:

```bash
curl -X GET "http://localhost:8000/api/v2/tasks/550e8400.../metadata" \
  -H "Authorization: Bearer $TOKEN"
```

---

## Version History

### V2.0.0 (2025-11-20) - Dual-Track Processing

**New Features**:
- ✨ Dual-track processing (OCR + Direct Extraction)
- ✨ Automatic document type detection
- ✨ Office document support (Word, PowerPoint, Excel)
- ✨ Processing track metadata
- ✨ Enhanced layout analysis (23 element types)
- ✨ GPU memory management

**New Endpoints**:
- `POST /tasks/{task_id}/analyze` - Analyze document type
- `GET /tasks/{task_id}/metadata` - Get processing metadata

**Enhanced Endpoints**:
- `POST /tasks/` - Added `force_track` parameter
- `GET /tasks/{task_id}` - Added `processing_track`, `document_type`, element counts
- All download endpoints now include processing track information

**Performance Improvements**:
- 10x faster processing for editable PDFs (1-2s vs 10-20s per page)
- Optimized GPU memory usage for RTX 4060 8GB
- Office documents: 2-5s vs >300s (60x improvement)

---

## Support

For issues, questions, or feature requests:
- GitHub Issues: https://github.com/your-repo/Tool_OCR/issues
- Documentation: https://your-docs-site.com
- API Status: http://localhost:8000/health

---

*Generated by Tool_OCR V2.0.0 - Dual-Track Document Processing*