# Tool_OCR V2 API Documentation ## Overview Tool_OCR V2 provides a comprehensive OCR service with dual-track document processing. The API supports intelligent routing between OCR track (for scanned documents) and Direct Extraction track (for editable PDFs and Office documents). **Base URL**: `http://localhost:8000/api/v2` **Authentication**: Bearer token (JWT) --- ## Table of Contents 1. [Authentication](#authentication) 2. [Task Management](#task-management) 3. [Document Processing](#document-processing) 4. [Document Analysis](#document-analysis) 5. [File Downloads](#file-downloads) 6. [Processing Tracks](#processing-tracks) 7. [Response Models](#response-models) 8. [Error Handling](#error-handling) --- ## Authentication All endpoints require authentication via Bearer token. ### Headers ```http Authorization: Bearer ``` ### Login ```http POST /api/auth/login Content-Type: application/json { "email": "user@example.com", "password": "password123" } ``` **Response**: ```json { "access_token": "eyJhbGc...", "token_type": "bearer", "user": { "id": 1, "email": "user@example.com", "username": "user" } } ``` --- ## Task Management ### Create Task Create a new OCR processing task by uploading a document. ```http POST /tasks/ Content-Type: multipart/form-data ``` **Request Body**: - `file` (required): Document file to process - Supported formats: PDF, PNG, JPG, JPEG, GIF, BMP, TIFF, DOCX, PPTX, XLSX - `language` (optional): OCR language code (default: 'ch') - Options: 'ch', 'en', 'japan', 'korean', etc. - `detect_layout` (optional): Enable layout detection (default: true) - `force_track` (optional): Force specific processing track - Options: 'ocr', 'direct', 'auto' (default: 'auto') **Response** `201 Created`: ```json { "task_id": "550e8400-e29b-41d4-a716-446655440000", "filename": "document.pdf", "status": "pending", "language": "ch", "created_at": "2025-11-20T10:00:00Z" } ``` **Processing Track Selection**: - `auto` (default): Automatically select optimal track based on document analysis - Editable PDFs → Direct track (faster, ~1-2s/page) - Scanned documents/images → OCR track (slower, ~2-5s/page) - Office documents → Convert to PDF, then route based on content - `ocr`: Force OCR processing (PaddleOCR PP-StructureV3) - `direct`: Force direct extraction (PyMuPDF) - only for editable PDFs --- ### List Tasks Get a paginated list of user's tasks with filtering. ```http GET /tasks/?status={status}&filename={search}&skip={skip}&limit={limit} ``` **Query Parameters**: - `status` (optional): Filter by task status - Options: `pending`, `processing`, `completed`, `failed` - `filename` (optional): Search by filename (partial match) - `skip` (optional): Pagination offset (default: 0) - `limit` (optional): Page size (default: 10, max: 100) **Response** `200 OK`: ```json { "tasks": [ { "task_id": "550e8400-e29b-41d4-a716-446655440000", "filename": "document.pdf", "status": "completed", "language": "ch", "processing_track": "direct", "processing_time": 1.14, "created_at": "2025-11-20T10:00:00Z", "completed_at": "2025-11-20T10:00:02Z" } ], "total": 42, "skip": 0, "limit": 10 } ``` --- ### Get Task Details Retrieve detailed information about a specific task. ```http GET /tasks/{task_id} ``` **Response** `200 OK`: ```json { "task_id": "550e8400-e29b-41d4-a716-446655440000", "filename": "document.pdf", "status": "completed", "language": "ch", "processing_track": "direct", "document_type": "pdf_editable", "processing_time": 1.14, "page_count": 3, "element_count": 51, "character_count": 10592, "confidence": 0.95, "created_at": "2025-11-20T10:00:00Z", "completed_at": "2025-11-20T10:00:02Z", "result_files": { "json": "/tasks/550e8400-e29b-41d4-a716-446655440000/download/json", "markdown": "/tasks/550e8400-e29b-41d4-a716-446655440000/download/markdown", "pdf": "/tasks/550e8400-e29b-41d4-a716-446655440000/download/pdf" }, "metadata": { "file_size": 524288, "mime_type": "application/pdf", "text_coverage": 0.95, "processing_track_reason": "PDF has extractable text on 100% of sampled pages" } } ``` **New Fields** (Dual-Track): - `processing_track`: Track used for processing (`ocr`, `direct`, or `null`) - `document_type`: Detected document type - `pdf_editable`: Editable PDF with text - `pdf_scanned`: Scanned/image-based PDF - `pdf_mixed`: Mixed content PDF - `image`: Image file - `office_word`, `office_excel`, `office_ppt`: Office documents - `page_count`: Number of pages extracted - `element_count`: Total elements (text, tables, images) extracted - `character_count`: Total characters extracted - `metadata.text_coverage`: Percentage of pages with extractable text (0.0-1.0) - `metadata.processing_track_reason`: Explanation of track selection --- ### Get Task Statistics Get aggregated statistics for user's tasks. ```http GET /tasks/stats ``` **Response** `200 OK`: ```json { "total_tasks": 150, "by_status": { "pending": 5, "processing": 3, "completed": 140, "failed": 2 }, "by_processing_track": { "ocr": 80, "direct": 60, "unknown": 10 }, "total_pages_processed": 4250, "average_processing_time": 3.5, "success_rate": 0.987 } ``` --- ### Delete Task Delete a task and all associated files. ```http DELETE /tasks/{task_id} ``` **Response** `204 No Content` --- ## Document Processing ### Processing Workflow 1. **Upload Document** → `POST /tasks/` → Returns `task_id` 2. **Background Processing** → Task status changes to `processing` 3. **Complete** → Task status changes to `completed` or `failed` 4. **Download Results** → Use download endpoints ### Track Selection Flow ``` Document Upload ↓ Document Type Detection ↓ ┌──────────────┐ │ Auto Routing │ └──────┬───────┘ ↓ ┌────┴─────┐ ↓ ↓ [Direct] [OCR] ↓ ↓ PyMuPDF PaddleOCR ↓ ↓ UnifiedDocument ↓ Export (JSON/MD/PDF) ``` **Direct Track** (Fast - 1-2s/page): - Editable PDFs with extractable text - Office documents (converted to text-based PDF) - Uses PyMuPDF for direct text extraction - Preserves exact layout and fonts **OCR Track** (Slower - 2-5s/page): - Scanned PDFs and images - Documents without extractable text - Uses PaddleOCR PP-StructureV3 - Handles complex layouts with 23 element types --- ## Document Analysis ### Analyze Document Type Analyze a document to determine optimal processing track **before** processing. **NEW ENDPOINT** ```http POST /tasks/{task_id}/analyze ``` **Response** `200 OK`: ```json { "task_id": "550e8400-e29b-41d4-a716-446655440000", "filename": "document.pdf", "analysis": { "recommended_track": "direct", "confidence": 0.95, "reason": "PDF has extractable text on 100% of sampled pages", "document_type": "pdf_editable", "metadata": { "total_pages": 3, "sampled_pages": 3, "text_coverage": 1.0, "mime_type": "application/pdf", "file_size": 524288, "page_details": [ { "page": 1, "text_length": 3520, "has_text": true, "image_count": 2, "image_coverage": 0.15 } ] } } } ``` **Use Case**: - Preview processing track before starting - Validate document type for batch processing - Provide user feedback on processing method --- ### Get Processing Metadata Get detailed metadata about how a document was processed. **NEW ENDPOINT** ```http GET /tasks/{task_id}/metadata ``` **Response** `200 OK`: ```json { "task_id": "550e8400-e29b-41d4-a716-446655440000", "processing_track": "direct", "document_type": "pdf_editable", "confidence": 0.95, "reason": "PDF has extractable text on 100% of sampled pages", "statistics": { "page_count": 3, "element_count": 51, "total_tables": 2, "total_images": 3, "element_type_counts": { "text": 45, "table": 2, "image": 3, "header": 1 }, "text_stats": { "total_characters": 10592, "total_words": 1842, "average_confidence": 1.0 } }, "processing_info": { "processing_time": 1.14, "track_description": "PyMuPDF Direct Extraction - Used for editable PDFs", "schema_version": "1.0.0" }, "file_metadata": { "filename": "document.pdf", "file_size": 524288, "mime_type": "application/pdf", "created_at": "2025-11-20T10:00:00Z" } } ``` --- ## File Downloads ### Download JSON Result Download structured JSON output with full document structure. ```http GET /tasks/{task_id}/download/json ``` **Response** `200 OK`: - Content-Type: `application/json` - Content-Disposition: `attachment; filename="{filename}_result.json"` **JSON Structure**: ```json { "schema_version": "1.0.0", "document_id": "d8bea84d-a4ea-4455-b219-243624b5518e", "export_timestamp": "2025-11-20T10:00:02Z", "metadata": { "filename": "document.pdf", "file_type": ".pdf", "file_size": 524288, "created_at": "2025-11-20T10:00:00Z", "processing_track": "direct", "processing_time": 1.14, "language": "ch", "processing_info": { "track_description": "PyMuPDF Direct Extraction", "schema_version": "1.0.0", "export_format": "unified_document_v1" } }, "pages": [ { "page_number": 1, "dimensions": { "width": 595.32, "height": 841.92 }, "elements": [ { "element_id": "text_1_0", "type": "text", "bbox": { "x0": 72.0, "y0": 72.0, "x1": 200.0, "y1": 90.0 }, "content": "Document Title", "confidence": 1.0, "style": { "font": "Helvetica-Bold", "size": 18.0 } } ] } ], "statistics": { "page_count": 3, "total_elements": 51, "total_tables": 2, "total_images": 3, "element_type_counts": { "text": 45, "table": 2, "image": 3, "header": 1 }, "text_stats": { "total_characters": 10592, "total_words": 1842, "average_confidence": 1.0 } } } ``` **Element Types**: - `text`: Text blocks - `header`: Headers (H1-H6) - `paragraph`: Paragraphs - `list`: Lists - `table`: Tables with cell structure - `image`: Images with position - `figure`: Figures with captions - `footer`: Page footers --- ### Download Markdown Result Download Markdown formatted output. ```http GET /tasks/{task_id}/download/markdown ``` **Response** `200 OK`: - Content-Type: `text/markdown` - Content-Disposition: `attachment; filename="{filename}_output.md"` **Example Output**: ```markdown # Document Title This is the extracted content from the document. ## Section 1 Content of section 1... | Column 1 | Column 2 | |----------|----------| | Data 1 | Data 2 | ![Image](imgs/img_in_image_box_100_200_500_600.jpg) ``` --- ### Download Layout-Preserving PDF Download reconstructed PDF with layout preservation. ```http GET /tasks/{task_id}/download/pdf ``` **Response** `200 OK`: - Content-Type: `application/pdf` - Content-Disposition: `attachment; filename="{filename}_layout.pdf"` **Features**: - Preserves original layout and coordinates - Maintains text positioning - Includes extracted images - Renders tables with proper structure --- ## Processing Tracks ### Track Comparison | Feature | OCR Track | Direct Track | |---------|-----------|--------------| | **Speed** | 2-5 seconds/page | 0.5-1 second/page | | **Best For** | Scanned documents, images | Editable PDFs, Office docs | | **Technology** | PaddleOCR PP-StructureV3 | PyMuPDF | | **Accuracy** | 92-98% (content-dependent) | 100% (text is extracted, not recognized) | | **Layout Preservation** | Good (23 element types) | Excellent (exact coordinates) | | **GPU Required** | Yes (8GB recommended) | No | | **Supported Formats** | PDF, PNG, JPG, TIFF, etc. | PDF (with text), converted Office docs | ### Processing Track Enum ```python class ProcessingTrackEnum(str, Enum): AUTO = "auto" # Automatic selection (default) OCR = "ocr" # Force OCR processing DIRECT = "direct" # Force direct extraction ``` ### Document Type Enum ```python class DocumentType(str, Enum): PDF_EDITABLE = "pdf_editable" # PDF with extractable text PDF_SCANNED = "pdf_scanned" # Scanned/image-based PDF PDF_MIXED = "pdf_mixed" # Mixed content PDF IMAGE = "image" # Image files OFFICE_WORD = "office_word" # Word documents OFFICE_EXCEL = "office_excel" # Excel spreadsheets OFFICE_POWERPOINT = "office_ppt" # PowerPoint presentations TEXT = "text" # Plain text files UNKNOWN = "unknown" # Unknown format ``` --- ## Response Models ### TaskResponse ```typescript interface TaskResponse { task_id: string; filename: string; status: "pending" | "processing" | "completed" | "failed"; language: string; processing_track?: "ocr" | "direct" | null; created_at: string; // ISO 8601 completed_at?: string | null; } ``` ### TaskDetailResponse Extends `TaskResponse` with: ```typescript interface TaskDetailResponse extends TaskResponse { document_type?: string; processing_time?: number; // seconds page_count?: number; element_count?: number; character_count?: number; confidence?: number; // 0.0-1.0 result_files?: { json?: string; markdown?: string; pdf?: string; }; metadata?: { file_size?: number; mime_type?: string; text_coverage?: number; // 0.0-1.0 processing_track_reason?: string; [key: string]: any; }; } ``` ### DocumentAnalysisResponse ```typescript interface DocumentAnalysisResponse { task_id: string; filename: string; analysis: { recommended_track: "ocr" | "direct"; confidence: number; // 0.0-1.0 reason: string; document_type: string; metadata: { total_pages?: number; sampled_pages?: number; text_coverage?: number; mime_type?: string; file_size?: number; page_details?: Array<{ page: number; text_length: number; has_text: boolean; image_count: number; image_coverage: number; }>; }; }; } ``` ### ProcessingMetadata ```typescript interface ProcessingMetadata { task_id: string; processing_track: "ocr" | "direct"; document_type: string; confidence: number; reason: string; statistics: { page_count: number; element_count: number; total_tables: number; total_images: number; element_type_counts: { [type: string]: number; }; text_stats: { total_characters: number; total_words: number; average_confidence: number | null; }; }; processing_info: { processing_time: number; track_description: string; schema_version: string; }; file_metadata: { filename: string; file_size: number; mime_type: string; created_at: string; }; } ``` --- ## Error Handling ### HTTP Status Codes - `200 OK`: Successful request - `201 Created`: Resource created successfully - `204 No Content`: Successful deletion - `400 Bad Request`: Invalid request parameters - `401 Unauthorized`: Missing or invalid authentication - `403 Forbidden`: Insufficient permissions - `404 Not Found`: Resource not found - `422 Unprocessable Entity`: Validation error - `500 Internal Server Error`: Server error ### Error Response Format ```json { "detail": "Error message describing the issue", "error_code": "ERROR_CODE", "timestamp": "2025-11-20T10:00:00Z" } ``` ### Common Errors **Invalid File Format**: ```json { "detail": "Unsupported file format. Supported: PDF, PNG, JPG, DOCX, PPTX, XLSX", "error_code": "INVALID_FILE_FORMAT" } ``` **Task Not Found**: ```json { "detail": "Task not found or access denied", "error_code": "TASK_NOT_FOUND" } ``` **Processing Failed**: ```json { "detail": "OCR processing failed: GPU memory insufficient", "error_code": "PROCESSING_FAILED" } ``` **File Too Large**: ```json { "detail": "File size exceeds maximum limit of 50MB", "error_code": "FILE_TOO_LARGE" } ``` --- ## Usage Examples ### Example 1: Auto-Route Processing Upload a document and let the system choose the optimal track: ```bash # 1. Upload document curl -X POST "http://localhost:8000/api/v2/tasks/" \ -H "Authorization: Bearer $TOKEN" \ -F "file=@document.pdf" \ -F "language=ch" # Response: {"task_id": "550e8400..."} # 2. Check status curl -X GET "http://localhost:8000/api/v2/tasks/550e8400..." \ -H "Authorization: Bearer $TOKEN" # 3. Download results (when completed) curl -X GET "http://localhost:8000/api/v2/tasks/550e8400.../download/json" \ -H "Authorization: Bearer $TOKEN" \ -o result.json ``` ### Example 2: Analyze Before Processing Analyze document type before processing: ```bash # 1. Upload document curl -X POST "http://localhost:8000/api/v2/tasks/" \ -H "Authorization: Bearer $TOKEN" \ -F "file=@document.pdf" # Response: {"task_id": "550e8400..."} # 2. Analyze document (NEW) curl -X POST "http://localhost:8000/api/v2/tasks/550e8400.../analyze" \ -H "Authorization: Bearer $TOKEN" # Response shows recommended track and confidence # 3. Start processing (automatic based on analysis) # Processing happens in background after upload ``` ### Example 3: Force Specific Track Force OCR processing for an editable PDF: ```bash curl -X POST "http://localhost:8000/api/v2/tasks/" \ -H "Authorization: Bearer $TOKEN" \ -F "file=@document.pdf" \ -F "force_track=ocr" ``` ### Example 4: Get Processing Metadata Get detailed processing information: ```bash curl -X GET "http://localhost:8000/api/v2/tasks/550e8400.../metadata" \ -H "Authorization: Bearer $TOKEN" ``` --- ## Version History ### V2.0.0 (2025-11-20) - Dual-Track Processing **New Features**: - ✨ Dual-track processing (OCR + Direct Extraction) - ✨ Automatic document type detection - ✨ Office document support (Word, PowerPoint, Excel) - ✨ Processing track metadata - ✨ Enhanced layout analysis (23 element types) - ✨ GPU memory management **New Endpoints**: - `POST /tasks/{task_id}/analyze` - Analyze document type - `GET /tasks/{task_id}/metadata` - Get processing metadata **Enhanced Endpoints**: - `POST /tasks/` - Added `force_track` parameter - `GET /tasks/{task_id}` - Added `processing_track`, `document_type`, element counts - All download endpoints now include processing track information **Performance Improvements**: - 10x faster processing for editable PDFs (1-2s vs 10-20s per page) - Optimized GPU memory usage for RTX 4060 8GB - Office documents: 2-5s vs >300s (60x improvement) --- ## Support For issues, questions, or feature requests: - GitHub Issues: https://github.com/your-repo/Tool_OCR/issues - Documentation: https://your-docs-site.com - API Status: http://localhost:8000/health --- *Generated by Tool_OCR V2.0.0 - Dual-Track Document Processing*