feat: complete external auth V2 migration with advanced features

This commit implements comprehensive external Azure AD authentication
with complete task management, file download, and admin monitoring systems.

## Core Features Implemented (80% Complete)

### 1. Token Auto-Refresh Mechanism 
- Backend: POST /api/v2/auth/refresh endpoint
- Frontend: Auto-refresh 5 minutes before expiration
- Auto-retry on 401 errors with seamless token refresh

### 2. File Download System 
- Three format support: JSON / Markdown / PDF
- Endpoints: GET /api/v2/tasks/{id}/download/{format}
- File access control with ownership validation
- Frontend download buttons in TaskHistoryPage

### 3. Complete Task Management 
Backend Endpoints:
- POST /api/v2/tasks/{id}/start - Start task
- POST /api/v2/tasks/{id}/cancel - Cancel task
- POST /api/v2/tasks/{id}/retry - Retry failed task
- GET /api/v2/tasks - List with filters (status, filename, date range)
- GET /api/v2/tasks/stats - User statistics

Frontend Features:
- Status-based action buttons (Start/Cancel/Retry)
- Advanced search and filtering (status, filename, date range)
- Pagination and sorting
- Task statistics dashboard (5 stat cards)

### 4. Admin Monitoring System  (Backend)
Admin APIs:
- GET /api/v2/admin/stats - System statistics
- GET /api/v2/admin/users - User list with stats
- GET /api/v2/admin/users/top - User leaderboard
- GET /api/v2/admin/audit-logs - Audit log query system
- GET /api/v2/admin/audit-logs/user/{id}/summary

Admin Features:
- Email-based admin check (ymirliu@panjit.com.tw)
- Comprehensive system metrics (users, tasks, sessions, activity)
- Audit logging service for security tracking

### 5. User Isolation & Security 
- Row-level security on all task queries
- File access control with ownership validation
- Strict user_id filtering on all operations
- Session validation and expiry checking
- Admin privilege verification

## New Files Created

Backend:
- backend/app/models/user_v2.py - User model for external auth
- backend/app/models/task.py - Task model with user isolation
- backend/app/models/session.py - Session management
- backend/app/models/audit_log.py - Audit log model
- backend/app/services/external_auth_service.py - External API client
- backend/app/services/task_service.py - Task CRUD with isolation
- backend/app/services/file_access_service.py - File access control
- backend/app/services/admin_service.py - Admin operations
- backend/app/services/audit_service.py - Audit logging
- backend/app/routers/auth_v2.py - V2 auth endpoints
- backend/app/routers/tasks.py - Task management endpoints
- backend/app/routers/admin.py - Admin endpoints
- backend/alembic/versions/5e75a59fb763_*.py - DB migration

Frontend:
- frontend/src/services/apiV2.ts - Complete V2 API client
- frontend/src/types/apiV2.ts - V2 type definitions
- frontend/src/pages/TaskHistoryPage.tsx - Task history UI

Modified Files:
- backend/app/core/deps.py - Added get_current_admin_user_v2
- backend/app/main.py - Registered admin router
- frontend/src/pages/LoginPage.tsx - V2 login integration
- frontend/src/components/Layout.tsx - User display and logout
- frontend/src/App.tsx - Added /tasks route

## Documentation
- openspec/changes/.../PROGRESS_UPDATE.md - Detailed progress report

## Pending Items (20%)
1. Database migration execution for audit_logs table
2. Frontend admin dashboard page
3. Frontend audit log viewer

## Testing Status
- Manual testing:  Authentication flow verified
- Unit tests:  Pending
- Integration tests:  Pending

## Security Enhancements
-  User isolation (row-level security)
-  File access control
-  Token expiry validation
-  Admin privilege verification
-  Audit logging infrastructure
-  Token encryption (noted, low priority)
-  Rate limiting (noted, low priority)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-14 17:19:43 +08:00
parent 470fa96428
commit ad2b832fb6
32 changed files with 6450 additions and 26 deletions

View File

@@ -1,14 +1,28 @@
"""
Tool_OCR - Database Models
New schema with external API authentication and user task isolation.
All tables use 'tool_ocr_' prefix for namespace separation.
"""
from app.models.user import User
# New models for external authentication system
from app.models.user_v2 import User
from app.models.task import Task, TaskFile, TaskStatus
from app.models.session import Session
# Legacy models (will be deprecated after migration)
from app.models.ocr import OCRBatch, OCRFile, OCRResult
from app.models.export import ExportRule
from app.models.translation import TranslationConfig
__all__ = [
# New authentication and task models
"User",
"Task",
"TaskFile",
"TaskStatus",
"Session",
# Legacy models (deprecated)
"OCRBatch",
"OCRFile",
"OCRResult",

View File

@@ -0,0 +1,95 @@
"""
Tool_OCR - Audit Log Model
Security audit logging for authentication and task operations
"""
from sqlalchemy import Column, Integer, String, DateTime, Text, ForeignKey
from sqlalchemy.orm import relationship
from datetime import datetime
from app.core.database import Base
class AuditLog(Base):
"""
Audit log model for security tracking
Records all important events including:
- Authentication events (login, logout, failures)
- Task operations (create, update, delete)
- Admin operations
"""
__tablename__ = "tool_ocr_audit_logs"
id = Column(Integer, primary_key=True, index=True, autoincrement=True)
user_id = Column(
Integer,
ForeignKey("tool_ocr_users.id", ondelete="SET NULL"),
nullable=True,
index=True,
comment="User who performed the action (NULL for system events)"
)
event_type = Column(
String(50),
nullable=False,
index=True,
comment="Event type: auth_login, auth_logout, auth_failed, task_create, etc."
)
event_category = Column(
String(20),
nullable=False,
index=True,
comment="Category: authentication, task, admin, system"
)
description = Column(
Text,
nullable=False,
comment="Human-readable event description"
)
ip_address = Column(String(45), nullable=True, comment="Client IP address (IPv4/IPv6)")
user_agent = Column(String(500), nullable=True, comment="Client user agent")
resource_type = Column(
String(50),
nullable=True,
comment="Type of resource affected (task, user, session)"
)
resource_id = Column(
String(255),
nullable=True,
index=True,
comment="ID of affected resource"
)
success = Column(
Integer,
default=1,
nullable=False,
comment="1 for success, 0 for failure"
)
error_message = Column(Text, nullable=True, comment="Error details if failed")
metadata = Column(Text, nullable=True, comment="Additional JSON metadata")
created_at = Column(DateTime, default=datetime.utcnow, nullable=False, index=True)
# Relationships
user = relationship("User", back_populates="audit_logs")
def __repr__(self):
return f"<AuditLog(id={self.id}, type='{self.event_type}', user_id={self.user_id})>"
def to_dict(self):
"""Convert audit log to dictionary"""
return {
"id": self.id,
"user_id": self.user_id,
"event_type": self.event_type,
"event_category": self.event_category,
"description": self.description,
"ip_address": self.ip_address,
"user_agent": self.user_agent,
"resource_type": self.resource_type,
"resource_id": self.resource_id,
"success": bool(self.success),
"error_message": self.error_message,
"metadata": self.metadata,
"created_at": self.created_at.isoformat() if self.created_at else None
}

View File

@@ -0,0 +1,82 @@
"""
Tool_OCR - Session Model
Secure token storage and session management for external authentication
"""
from sqlalchemy import Column, Integer, String, DateTime, Text, ForeignKey
from sqlalchemy.orm import relationship
from datetime import datetime
from app.core.database import Base
class Session(Base):
"""
User session model for external API token management
Stores encrypted tokens from external authentication API
and tracks session metadata for security auditing.
"""
__tablename__ = "tool_ocr_sessions"
id = Column(Integer, primary_key=True, index=True, autoincrement=True)
user_id = Column(Integer, ForeignKey("tool_ocr_users.id", ondelete="CASCADE"),
nullable=False, index=True,
comment="Foreign key to users table")
access_token = Column(Text, nullable=True,
comment="Encrypted JWT access token from external API")
id_token = Column(Text, nullable=True,
comment="Encrypted JWT ID token from external API")
refresh_token = Column(Text, nullable=True,
comment="Encrypted refresh token (if provided by API)")
token_type = Column(String(50), default="Bearer", nullable=False,
comment="Token type (typically 'Bearer')")
expires_at = Column(DateTime, nullable=False, index=True,
comment="Token expiration timestamp from API")
issued_at = Column(DateTime, nullable=False,
comment="Token issue timestamp from API")
# Session metadata for security
ip_address = Column(String(45), nullable=True,
comment="Client IP address (IPv4/IPv6)")
user_agent = Column(String(500), nullable=True,
comment="Client user agent string")
# Timestamps
created_at = Column(DateTime, default=datetime.utcnow, nullable=False, index=True)
last_accessed_at = Column(DateTime, default=datetime.utcnow,
onupdate=datetime.utcnow, nullable=False,
comment="Last time this session was used")
# Relationships
user = relationship("User", back_populates="sessions")
def __repr__(self):
return f"<Session(id={self.id}, user_id={self.user_id}, expires_at='{self.expires_at}')>"
def to_dict(self):
"""Convert session to dictionary (excluding sensitive tokens)"""
return {
"id": self.id,
"user_id": self.user_id,
"token_type": self.token_type,
"expires_at": self.expires_at.isoformat() if self.expires_at else None,
"issued_at": self.issued_at.isoformat() if self.issued_at else None,
"ip_address": self.ip_address,
"created_at": self.created_at.isoformat() if self.created_at else None,
"last_accessed_at": self.last_accessed_at.isoformat() if self.last_accessed_at else None
}
@property
def is_expired(self) -> bool:
"""Check if session token is expired"""
return datetime.utcnow() >= self.expires_at if self.expires_at else True
@property
def time_until_expiry(self) -> int:
"""Get seconds until token expiration"""
if not self.expires_at:
return 0
delta = self.expires_at - datetime.utcnow()
return max(0, int(delta.total_seconds()))

126
backend/app/models/task.py Normal file
View File

@@ -0,0 +1,126 @@
"""
Tool_OCR - Task Model
OCR task management with user isolation
"""
from sqlalchemy import Column, Integer, String, DateTime, Boolean, Text, ForeignKey, Enum as SQLEnum
from sqlalchemy.orm import relationship
from datetime import datetime
import enum
from app.core.database import Base
class TaskStatus(str, enum.Enum):
"""Task status enumeration"""
PENDING = "pending"
PROCESSING = "processing"
COMPLETED = "completed"
FAILED = "failed"
class Task(Base):
"""
OCR Task model with user association
Each task belongs to a specific user and stores
processing status and result file paths.
"""
__tablename__ = "tool_ocr_tasks"
id = Column(Integer, primary_key=True, index=True, autoincrement=True)
user_id = Column(Integer, ForeignKey("tool_ocr_users.id", ondelete="CASCADE"),
nullable=False, index=True,
comment="Foreign key to users table")
task_id = Column(String(255), unique=True, nullable=False, index=True,
comment="Unique task identifier (UUID)")
filename = Column(String(255), nullable=True, index=True)
file_type = Column(String(50), nullable=True)
status = Column(SQLEnum(TaskStatus), default=TaskStatus.PENDING, nullable=False,
index=True)
result_json_path = Column(String(500), nullable=True,
comment="Path to JSON result file")
result_markdown_path = Column(String(500), nullable=True,
comment="Path to Markdown result file")
result_pdf_path = Column(String(500), nullable=True,
comment="Path to searchable PDF file")
error_message = Column(Text, nullable=True,
comment="Error details if task failed")
processing_time_ms = Column(Integer, nullable=True,
comment="Processing time in milliseconds")
created_at = Column(DateTime, default=datetime.utcnow, nullable=False, index=True)
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow,
nullable=False)
completed_at = Column(DateTime, nullable=True)
file_deleted = Column(Boolean, default=False, nullable=False,
comment="Track if files were auto-deleted")
# Relationships
user = relationship("User", back_populates="tasks")
files = relationship("TaskFile", back_populates="task", cascade="all, delete-orphan")
def __repr__(self):
return f"<Task(id={self.id}, task_id='{self.task_id}', status='{self.status.value}')>"
def to_dict(self):
"""Convert task to dictionary"""
return {
"id": self.id,
"task_id": self.task_id,
"filename": self.filename,
"file_type": self.file_type,
"status": self.status.value if self.status else None,
"result_json_path": self.result_json_path,
"result_markdown_path": self.result_markdown_path,
"result_pdf_path": self.result_pdf_path,
"error_message": self.error_message,
"processing_time_ms": self.processing_time_ms,
"created_at": self.created_at.isoformat() if self.created_at else None,
"updated_at": self.updated_at.isoformat() if self.updated_at else None,
"completed_at": self.completed_at.isoformat() if self.completed_at else None,
"file_deleted": self.file_deleted
}
class TaskFile(Base):
"""
Task file model
Stores information about files associated with a task.
"""
__tablename__ = "tool_ocr_task_files"
id = Column(Integer, primary_key=True, index=True, autoincrement=True)
task_id = Column(Integer, ForeignKey("tool_ocr_tasks.id", ondelete="CASCADE"),
nullable=False, index=True,
comment="Foreign key to tasks table")
original_name = Column(String(255), nullable=True)
stored_path = Column(String(500), nullable=True,
comment="Actual file path on server")
file_size = Column(Integer, nullable=True,
comment="File size in bytes")
mime_type = Column(String(100), nullable=True)
file_hash = Column(String(64), nullable=True, index=True,
comment="SHA256 hash for deduplication")
created_at = Column(DateTime, default=datetime.utcnow, nullable=False)
# Relationships
task = relationship("Task", back_populates="files")
def __repr__(self):
return f"<TaskFile(id={self.id}, task_id={self.task_id}, original_name='{self.original_name}')>"
def to_dict(self):
"""Convert task file to dictionary"""
return {
"id": self.id,
"task_id": self.task_id,
"original_name": self.original_name,
"stored_path": self.stored_path,
"file_size": self.file_size,
"mime_type": self.mime_type,
"file_hash": self.file_hash,
"created_at": self.created_at.isoformat() if self.created_at else None
}

View File

@@ -0,0 +1,49 @@
"""
Tool_OCR - User Model v2.0
External API authentication with simplified schema
"""
from sqlalchemy import Column, Integer, String, DateTime, Boolean
from sqlalchemy.orm import relationship
from datetime import datetime
from app.core.database import Base
class User(Base):
"""
User model for external API authentication
Uses email as primary identifier from Azure AD.
No password storage - authentication via external API only.
"""
__tablename__ = "tool_ocr_users"
id = Column(Integer, primary_key=True, index=True, autoincrement=True)
email = Column(String(255), unique=True, nullable=False, index=True,
comment="Primary identifier from Azure AD")
display_name = Column(String(255), nullable=True,
comment="Display name from API response")
created_at = Column(DateTime, default=datetime.utcnow, nullable=False)
last_login = Column(DateTime, nullable=True)
is_active = Column(Boolean, default=True, nullable=False, index=True)
# Relationships
tasks = relationship("Task", back_populates="user", cascade="all, delete-orphan")
sessions = relationship("Session", back_populates="user", cascade="all, delete-orphan")
audit_logs = relationship("AuditLog", back_populates="user")
def __repr__(self):
return f"<User(id={self.id}, email='{self.email}', display_name='{self.display_name}')>"
def to_dict(self):
"""Convert user to dictionary"""
return {
"id": self.id,
"email": self.email,
"display_name": self.display_name,
"created_at": self.created_at.isoformat() if self.created_at else None,
"last_login": self.last_login.isoformat() if self.last_login else None,
"is_active": self.is_active
}