- Removed all test files and directories - Deleted outdated documentation (will be rewritten) - Cleaned up temporary files, logs, and uploads - Archived 5 completed OpenSpec proposals - Created new dual-track-document-processing proposal with complete OpenSpec structure - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF) - UnifiedDocument model for consistent output - Support for structure-preserving translation - Updated .gitignore to prevent future test/temp files This is a major cleanup preparing for the complete refactoring of the document processing pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
294 lines
11 KiB
Markdown
294 lines
11 KiB
Markdown
# Change: Migrate to External API Authentication
|
|
|
|
## Why
|
|
|
|
The current local database authentication system has several limitations:
|
|
- User credentials are managed locally, requiring manual user creation and password management
|
|
- No centralized authentication with enterprise identity systems
|
|
- Cannot leverage existing enterprise authentication infrastructure (e.g., Microsoft Azure AD)
|
|
- No single sign-on (SSO) capability
|
|
- Increased maintenance overhead for user management
|
|
|
|
By migrating to the external API authentication service at https://pj-auth-api.vercel.app, the system will:
|
|
- Integrate with enterprise Microsoft Azure AD authentication
|
|
- Enable single sign-on (SSO) for users
|
|
- Eliminate local password management
|
|
- Leverage existing enterprise user management and security policies
|
|
- Reduce maintenance overhead
|
|
- Provide consistent authentication across multiple applications
|
|
|
|
## What Changes
|
|
|
|
### Authentication Flow
|
|
- **Current**: Local database authentication using username/password stored in MySQL
|
|
- **New**: External API authentication via POST to `https://pj-auth-api.vercel.app/api/auth/login`
|
|
- **Token Management**: Use JWT tokens from external API instead of locally generated tokens
|
|
- **User Display**: Use `name` field from API response for user display instead of local username
|
|
|
|
### API Integration
|
|
**Endpoint**: `POST https://pj-auth-api.vercel.app/api/auth/login`
|
|
|
|
**Request Format**:
|
|
```json
|
|
{
|
|
"username": "user@domain.com",
|
|
"password": "user_password"
|
|
}
|
|
```
|
|
|
|
**Success Response (200)**:
|
|
```json
|
|
{
|
|
"success": true,
|
|
"message": "認證成功",
|
|
"data": {
|
|
"access_token": "eyJ0eXAiOiJKV1Q...",
|
|
"id_token": "eyJ0eXAiOiJKV1Q...",
|
|
"expires_in": 4999,
|
|
"token_type": "Bearer",
|
|
"userInfo": {
|
|
"id": "42cf0b98-f598-47dd-ae2a-f33803f87d41",
|
|
"name": "ymirliu 劉念萱",
|
|
"email": "ymirliu@panjit.com.tw",
|
|
"jobTitle": null,
|
|
"officeLocation": "高雄",
|
|
"businessPhones": ["1580"]
|
|
},
|
|
"issuedAt": "2025-11-14T07:09:15.203Z",
|
|
"expiresAt": "2025-11-14T08:32:34.203Z"
|
|
},
|
|
"timestamp": "2025-11-14T07:09:15.203Z"
|
|
}
|
|
```
|
|
|
|
**Failure Response (401)**:
|
|
```json
|
|
{
|
|
"success": false,
|
|
"error": "用戶名或密碼錯誤",
|
|
"code": "INVALID_CREDENTIALS",
|
|
"timestamp": "2025-11-14T07:10:02.585Z"
|
|
}
|
|
```
|
|
|
|
### Database Schema Changes
|
|
|
|
**Complete Redesign (No backward compatibility needed)**:
|
|
|
|
**Table Prefix**: `tool_ocr_` (for clear separation from other systems in the same database)
|
|
|
|
1. **tool_ocr_users table (redesigned)**:
|
|
```sql
|
|
CREATE TABLE tool_ocr_users (
|
|
id INT PRIMARY KEY AUTO_INCREMENT,
|
|
email VARCHAR(255) UNIQUE NOT NULL, -- Primary identifier from Azure AD
|
|
display_name VARCHAR(255), -- Display name from API response
|
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
last_login TIMESTAMP,
|
|
is_active BOOLEAN DEFAULT TRUE
|
|
);
|
|
```
|
|
Note: No Azure AD ID storage needed - email is sufficient as unique identifier
|
|
|
|
2. **tool_ocr_tasks table (new - for task history)**:
|
|
```sql
|
|
CREATE TABLE tool_ocr_tasks (
|
|
id INT PRIMARY KEY AUTO_INCREMENT,
|
|
user_id INT NOT NULL, -- Foreign key to users table
|
|
task_id VARCHAR(255) UNIQUE, -- Unique task identifier
|
|
filename VARCHAR(255),
|
|
file_type VARCHAR(50),
|
|
status ENUM('pending', 'processing', 'completed', 'failed'),
|
|
result_json_path VARCHAR(500),
|
|
result_markdown_path VARCHAR(500),
|
|
error_message TEXT,
|
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
|
|
completed_at TIMESTAMP NULL,
|
|
file_deleted BOOLEAN DEFAULT FALSE, -- Track if files were auto-deleted
|
|
FOREIGN KEY (user_id) REFERENCES tool_ocr_users(id),
|
|
INDEX idx_user_status (user_id, status),
|
|
INDEX idx_created (created_at)
|
|
);
|
|
```
|
|
|
|
3. **tool_ocr_task_files table (for multiple files per task)**:
|
|
```sql
|
|
CREATE TABLE tool_ocr_task_files (
|
|
id INT PRIMARY KEY AUTO_INCREMENT,
|
|
task_id INT NOT NULL,
|
|
original_name VARCHAR(255),
|
|
stored_path VARCHAR(500),
|
|
file_size BIGINT,
|
|
mime_type VARCHAR(100),
|
|
FOREIGN KEY (task_id) REFERENCES tool_ocr_tasks(id) ON DELETE CASCADE
|
|
);
|
|
```
|
|
|
|
4. **tool_ocr_sessions table (for token management)**:
|
|
```sql
|
|
CREATE TABLE tool_ocr_sessions (
|
|
id INT PRIMARY KEY AUTO_INCREMENT,
|
|
user_id INT NOT NULL,
|
|
access_token TEXT,
|
|
id_token TEXT,
|
|
expires_at TIMESTAMP,
|
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
FOREIGN KEY (user_id) REFERENCES tool_ocr_users(id) ON DELETE CASCADE,
|
|
INDEX idx_user (user_id),
|
|
INDEX idx_expires (expires_at)
|
|
);
|
|
```
|
|
|
|
### Session Management
|
|
- Store external API tokens in session/cache instead of local JWT
|
|
- Implement token refresh mechanism based on `expires_in` field
|
|
- Use `expiresAt` timestamp for token expiration validation
|
|
|
|
## New Features: User Task Isolation and History
|
|
|
|
### Task Isolation
|
|
- **Principle**: Each user can only see and access their own tasks
|
|
- **Implementation**: All task queries filtered by `user_id` at API level
|
|
- **Security**: Enforce user context validation in all task-related endpoints
|
|
|
|
### Task History Features
|
|
1. **Task Status Tracking**:
|
|
- View pending tasks (waiting to process)
|
|
- View processing tasks (currently running)
|
|
- View completed tasks (with results available)
|
|
- View failed tasks (with error messages)
|
|
|
|
2. **Historical Query Capabilities**:
|
|
- Search tasks by filename
|
|
- Filter by date range
|
|
- Filter by status
|
|
- Sort by creation/completion time
|
|
- Pagination for large result sets
|
|
|
|
3. **Task Management**:
|
|
- Download original files (if not auto-deleted)
|
|
- Download results (JSON, Markdown, PDF exports)
|
|
- Re-process failed tasks
|
|
- Delete old tasks manually
|
|
|
|
### Frontend UI Changes
|
|
1. **New Components**:
|
|
- Task History page/tab
|
|
- Task filters and search bar
|
|
- Task status badges
|
|
- Batch action controls
|
|
|
|
2. **Task List View**:
|
|
```
|
|
| Filename | Status | Created | Completed | Actions |
|
|
|----------|--------|---------|-----------|---------|
|
|
| doc1.pdf | ✅ Completed | 2025-11-14 10:00 | 2025-11-14 10:05 | [Download] [View] |
|
|
| doc2.pdf | 🔄 Processing | 2025-11-14 10:10 | - | [Cancel] |
|
|
| doc3.pdf | ❌ Failed | 2025-11-14 09:00 | - | [Retry] [View Error] |
|
|
```
|
|
|
|
3. **User Information Display**:
|
|
- Show user display name in header
|
|
- Show last login time
|
|
- Show task statistics (total, completed, failed)
|
|
|
|
## Impact
|
|
|
|
### Affected Capabilities
|
|
- `authentication`: Complete replacement of authentication mechanism
|
|
- `user-management`: Simplified to read-only user information from external API
|
|
- `session-management`: Modified to handle external tokens
|
|
- `task-management`: NEW - User-specific task isolation and history
|
|
- `file-access-control`: NEW - User-based file access restrictions
|
|
|
|
### Affected Code
|
|
- **Backend Authentication**:
|
|
- `backend/app/api/v1/endpoints/auth.py`: Replace login logic with external API call
|
|
- `backend/app/core/security.py`: Modify token validation to use external tokens
|
|
- `backend/app/core/auth.py`: Update authentication dependencies
|
|
- `backend/app/services/auth_service.py`: New service for external API integration
|
|
|
|
- **Database Models**:
|
|
- `backend/app/models/user.py`: Complete redesign with new schema
|
|
- `backend/app/models/task.py`: NEW - Task model with user association
|
|
- `backend/app/models/task_file.py`: NEW - Task file model
|
|
- `backend/alembic/versions/`: Complete database recreation
|
|
|
|
- **Task Management APIs** (NEW):
|
|
- `backend/app/api/v1/endpoints/tasks.py`: Task CRUD operations with user isolation
|
|
- `backend/app/api/v1/endpoints/task_history.py`: Historical query endpoints
|
|
- `backend/app/services/task_service.py`: Task business logic
|
|
- `backend/app/services/file_access_service.py`: User-based file access control
|
|
|
|
- **Frontend**:
|
|
- `frontend/src/services/authService.ts`: Update to handle new token format
|
|
- `frontend/src/stores/authStore.ts`: Modify to store/display user info from API
|
|
- `frontend/src/components/Header.tsx`: Display `name` field and user menu
|
|
- `frontend/src/pages/TaskHistory.tsx`: NEW - Task history page
|
|
- `frontend/src/components/TaskList.tsx`: NEW - Task list component with filters
|
|
- `frontend/src/components/TaskFilters.tsx`: NEW - Search and filter UI
|
|
- `frontend/src/stores/taskStore.ts`: NEW - Task state management
|
|
- `frontend/src/services/taskService.ts`: NEW - Task API client
|
|
|
|
### Dependencies
|
|
- Add `httpx` or `aiohttp` for async HTTP requests to external API (already present)
|
|
- No new package dependencies required
|
|
|
|
### Configuration
|
|
- New environment variables:
|
|
- `EXTERNAL_AUTH_API_URL` = "https://pj-auth-api.vercel.app"
|
|
- `EXTERNAL_AUTH_ENDPOINT` = "/api/auth/login"
|
|
- `EXTERNAL_AUTH_TIMEOUT` = 30 (seconds)
|
|
- `TOKEN_REFRESH_BUFFER` = 300 (refresh tokens 5 minutes before expiry)
|
|
- `TASK_RETENTION_DAYS` = 30 (auto-delete old tasks)
|
|
- `MAX_TASKS_PER_USER` = 1000 (limit per user)
|
|
- `ENABLE_TASK_HISTORY` = true (enable history feature)
|
|
- `DATABASE_TABLE_PREFIX` = "tool_ocr_" (table naming prefix)
|
|
|
|
### Security Considerations
|
|
- HTTPS required for all authentication requests
|
|
- Token storage must be secure (HTTPOnly cookies or secure session storage)
|
|
- Implement rate limiting for authentication attempts
|
|
- Log all authentication events for audit trail
|
|
- Validate SSL certificates for external API calls
|
|
- Handle network failures gracefully with appropriate error messages
|
|
- **User Isolation**: Enforce user context in all database queries
|
|
- **File Access Control**: Validate user ownership before file access
|
|
- **API Security**: Add user_id validation in all task-related endpoints
|
|
|
|
### Migration Plan (Simplified - No Rollback Needed)
|
|
1. **Phase 1**: Backup existing database (for reference only)
|
|
2. **Phase 2**: Drop old tables and create new schema
|
|
3. **Phase 3**: Deploy new authentication and task management system
|
|
4. **Phase 4**: Test with initial users
|
|
5. **Phase 5**: Full deployment
|
|
|
|
Note: Since this is a test system with no production data to preserve, we can perform a clean migration without rollback concerns.
|
|
|
|
## Risks and Mitigations
|
|
|
|
### Risks
|
|
1. **External API Unavailability**: Authentication service downtime blocks all logins
|
|
- *Mitigation*: Implement fallback to local auth, cache tokens, implement retry logic
|
|
|
|
2. **Token Expiration Handling**: Users may be logged out unexpectedly
|
|
- *Mitigation*: Implement automatic token refresh before expiration
|
|
|
|
3. **Network Latency**: Slower authentication due to external API calls
|
|
- *Mitigation*: Implement proper timeout handling, async requests, response caching
|
|
|
|
4. **Data Consistency**: User information mismatch between local DB and external system
|
|
- *Mitigation*: Regular sync jobs, use external system as single source of truth
|
|
|
|
5. **Breaking Change**: Existing sessions will be invalidated
|
|
- *Mitigation*: Provide migration window, clear communication to users
|
|
|
|
## Success Criteria
|
|
- All users can authenticate via external API
|
|
- Authentication response time < 2 seconds (95th percentile)
|
|
- Zero data loss during migration
|
|
- Automatic token refresh works without user intervention
|
|
- Proper error messages for all failure scenarios
|
|
- Audit logs capture all authentication events
|
|
- Rollback procedure tested and documented |