refactor: enhance auth migration proposal with user task isolation

Major updates based on feedback:
1. Remove Azure AD ID storage - use email as primary identifier
2. Complete database redesign - no backward compatibility needed
3. Add comprehensive user task isolation and history features

Database changes:
- Simplified users table (email-based)
- New ocr_tasks table with user association
- New task_files table for file tracking
- Proper indexes for performance

New features:
- User task isolation (A cannot see B's tasks)
- Task history with status tracking (pending/processing/completed/failed)
- Historical query capabilities with filters
- Download support for completed tasks
- Task management UI with search and filters

Security enhancements:
- User context validation in all endpoints
- File access control based on ownership
- Row-level security in database queries
- API-level authorization checks

Implementation approach:
- Clean migration without rollback concerns
- Drop old tables and start fresh
- Simplified deployment process
- Comprehensive task management system

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-14 15:33:18 +08:00
parent 28e419f5fa
commit 88f9fef2d4
2 changed files with 301 additions and 104 deletions

View File

@@ -72,25 +72,118 @@ By migrating to the external API authentication service at https://pj-auth-api.v
```
### Database Schema Changes
- **users table modifications**:
- Remove/deprecate `hashed_password` column (keep for rollback)
- Add `external_user_id` (VARCHAR 255) - Store Azure AD user ID
- Add `display_name` (VARCHAR 255) - Store user display name from API
- Add `azure_email` (VARCHAR 255) - Store Azure AD email
- Add `last_token_refresh` (DATETIME) - Track token refresh timing
- Keep `username` for backward compatibility (can be email)
**Complete Redesign (No backward compatibility needed)**:
1. **users table (redesigned)**:
```sql
CREATE TABLE users (
id INT PRIMARY KEY AUTO_INCREMENT,
email VARCHAR(255) UNIQUE NOT NULL, -- Primary identifier from Azure AD
display_name VARCHAR(255), -- Display name from API response
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
last_login TIMESTAMP,
is_active BOOLEAN DEFAULT TRUE
);
```
Note: No Azure AD ID storage needed - email is sufficient as unique identifier
2. **ocr_tasks table (new - for task history)**:
```sql
CREATE TABLE ocr_tasks (
id INT PRIMARY KEY AUTO_INCREMENT,
user_id INT NOT NULL, -- Foreign key to users table
task_id VARCHAR(255) UNIQUE, -- Unique task identifier
filename VARCHAR(255),
file_type VARCHAR(50),
status ENUM('pending', 'processing', 'completed', 'failed'),
result_json_path VARCHAR(500),
result_markdown_path VARCHAR(500),
error_message TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
completed_at TIMESTAMP NULL,
file_deleted BOOLEAN DEFAULT FALSE, -- Track if files were auto-deleted
FOREIGN KEY (user_id) REFERENCES users(id),
INDEX idx_user_status (user_id, status),
INDEX idx_created (created_at)
);
```
3. **task_files table (for multiple files per task)**:
```sql
CREATE TABLE task_files (
id INT PRIMARY KEY AUTO_INCREMENT,
task_id INT NOT NULL,
original_name VARCHAR(255),
stored_path VARCHAR(500),
file_size BIGINT,
mime_type VARCHAR(100),
FOREIGN KEY (task_id) REFERENCES ocr_tasks(id) ON DELETE CASCADE
);
```
### Session Management
- Store external API tokens in session/cache instead of local JWT
- Implement token refresh mechanism based on `expires_in` field
- Use `expiresAt` timestamp for token expiration validation
## New Features: User Task Isolation and History
### Task Isolation
- **Principle**: Each user can only see and access their own tasks
- **Implementation**: All task queries filtered by `user_id` at API level
- **Security**: Enforce user context validation in all task-related endpoints
### Task History Features
1. **Task Status Tracking**:
- View pending tasks (waiting to process)
- View processing tasks (currently running)
- View completed tasks (with results available)
- View failed tasks (with error messages)
2. **Historical Query Capabilities**:
- Search tasks by filename
- Filter by date range
- Filter by status
- Sort by creation/completion time
- Pagination for large result sets
3. **Task Management**:
- Download original files (if not auto-deleted)
- Download results (JSON, Markdown, PDF exports)
- Re-process failed tasks
- Delete old tasks manually
### Frontend UI Changes
1. **New Components**:
- Task History page/tab
- Task filters and search bar
- Task status badges
- Batch action controls
2. **Task List View**:
```
| Filename | Status | Created | Completed | Actions |
|----------|--------|---------|-----------|---------|
| doc1.pdf | ✅ Completed | 2025-11-14 10:00 | 2025-11-14 10:05 | [Download] [View] |
| doc2.pdf | 🔄 Processing | 2025-11-14 10:10 | - | [Cancel] |
| doc3.pdf | ❌ Failed | 2025-11-14 09:00 | - | [Retry] [View Error] |
```
3. **User Information Display**:
- Show user display name in header
- Show last login time
- Show task statistics (total, completed, failed)
## Impact
### Affected Capabilities
- `authentication`: Complete replacement of authentication mechanism
- `user-management`: Simplified to read-only user information from external API
- `session-management`: Modified to handle external tokens
- `task-management`: NEW - User-specific task isolation and history
- `file-access-control`: NEW - User-based file access restrictions
### Affected Code
- **Backend Authentication**:
@@ -100,13 +193,26 @@ By migrating to the external API authentication service at https://pj-auth-api.v
- `backend/app/services/auth_service.py`: New service for external API integration
- **Database Models**:
- `backend/app/models/user.py`: Update User model with new fields
- `backend/alembic/versions/`: New migration for schema changes
- `backend/app/models/user.py`: Complete redesign with new schema
- `backend/app/models/task.py`: NEW - Task model with user association
- `backend/app/models/task_file.py`: NEW - Task file model
- `backend/alembic/versions/`: Complete database recreation
- **Task Management APIs** (NEW):
- `backend/app/api/v1/endpoints/tasks.py`: Task CRUD operations with user isolation
- `backend/app/api/v1/endpoints/task_history.py`: Historical query endpoints
- `backend/app/services/task_service.py`: Task business logic
- `backend/app/services/file_access_service.py`: User-based file access control
- **Frontend**:
- `frontend/src/services/authService.ts`: Update to handle new token format
- `frontend/src/stores/authStore.ts`: Modify to store/display user info from API
- `frontend/src/components/Header.tsx`: Display `name` field instead of username
- `frontend/src/components/Header.tsx`: Display `name` field and user menu
- `frontend/src/pages/TaskHistory.tsx`: NEW - Task history page
- `frontend/src/components/TaskList.tsx`: NEW - Task list component with filters
- `frontend/src/components/TaskFilters.tsx`: NEW - Search and filter UI
- `frontend/src/stores/taskStore.ts`: NEW - Task state management
- `frontend/src/services/taskService.ts`: NEW - Task API client
### Dependencies
- Add `httpx` or `aiohttp` for async HTTP requests to external API (already present)
@@ -117,8 +223,10 @@ By migrating to the external API authentication service at https://pj-auth-api.v
- `EXTERNAL_AUTH_API_URL` = "https://pj-auth-api.vercel.app"
- `EXTERNAL_AUTH_ENDPOINT` = "/api/auth/login"
- `EXTERNAL_AUTH_TIMEOUT` = 30 (seconds)
- `USE_EXTERNAL_AUTH` = true (feature flag for gradual rollout)
- `TOKEN_REFRESH_BUFFER` = 300 (refresh tokens 5 minutes before expiry)
- `TASK_RETENTION_DAYS` = 30 (auto-delete old tasks)
- `MAX_TASKS_PER_USER` = 1000 (limit per user)
- `ENABLE_TASK_HISTORY` = true (enable history feature)
### Security Considerations
- HTTPS required for all authentication requests
@@ -127,21 +235,18 @@ By migrating to the external API authentication service at https://pj-auth-api.v
- Log all authentication events for audit trail
- Validate SSL certificates for external API calls
- Handle network failures gracefully with appropriate error messages
- **User Isolation**: Enforce user context in all database queries
- **File Access Control**: Validate user ownership before file access
- **API Security**: Add user_id validation in all task-related endpoints
### Rollback Strategy
- Keep existing authentication code with feature flag
- Maintain password column in database (don't drop immediately)
- Implement dual authentication mode during transition:
- If `USE_EXTERNAL_AUTH=true`: Use external API
- If `USE_EXTERNAL_AUTH=false`: Use local authentication
- Provide migration script to sync existing users with external system
### Migration Plan (Simplified - No Rollback Needed)
1. **Phase 1**: Backup existing database (for reference only)
2. **Phase 2**: Drop old tables and create new schema
3. **Phase 3**: Deploy new authentication and task management system
4. **Phase 4**: Test with initial users
5. **Phase 5**: Full deployment
### Migration Plan
1. **Phase 1**: Implement external API authentication alongside existing system
2. **Phase 2**: Test with subset of users (based on domain or user flag)
3. **Phase 3**: Gradual rollout to all users
4. **Phase 4**: Deprecate local authentication (keep code for emergency)
5. **Phase 5**: Remove local authentication code (after stable period)
Note: Since this is a test system with no production data to preserve, we can perform a clean migration without rollback concerns.
## Risks and Mitigations