egg/OCR - OCR

egg/OCR

Author	SHA1	Message	Date
egg	c65df754cf	wip: add TableData.from_dict() for OCR track table parsing (incomplete) Add TableData.from_dict() and TableCell.from_dict() methods to convert JSON table dicts to proper TableData objects during UnifiedDocument parsing. Modified _json_to_document_element() to detect TABLE elements with dict content containing 'cells' key and convert to TableData. Note: This fix ensures table elements have proper to_html() method available but the rendered output still needs investigation - tables may still render incorrectly in OCR track PDFs. Files changed: - unified_document.py: Add from_dict() class methods - pdf_generator_service.py: Convert table dicts during JSON parsing - Add fix-ocr-track-table-rendering proposal 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-26 19:16:51 +08:00
egg	1afdb822c3	feat: implement hybrid image extraction and memory management Backend: - Add hybrid image extraction for Direct track (inline image blocks) - Add render_inline_image_regions() fallback when OCR doesn't find images - Add check_document_for_missing_images() for detecting missing images - Add memory management system (MemoryGuard, ModelManager, ServicePool) - Update pdf_generator_service to handle HYBRID processing track - Add ElementType.LOGO for logo extraction Frontend: - Fix PDF viewer re-rendering issues with memoization - Add TaskNotFound component and useTaskValidation hook - Disable StrictMode due to react-pdf incompatibility - Fix task detail and results page loading states 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-26 10:56:22 +08:00
egg	7064ea30d5	fix: add original_filename field to DocumentMetadata Add optional original_filename field to DocumentMetadata dataclass to properly store the original filename when files are converted (e.g., Office → PDF). This ensures the field is included in to_dict() output for JSON serialization. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-20 12:26:41 +08:00
egg	2d50c128f7	feat: implement core dual-track processing infrastructure Added foundation for dual-track document processing: 1. UnifiedDocument Model (backend/app/models/unified_document.py) - Common output format for both OCR and direct extraction - Comprehensive element types (23+ types from PP-StructureV3) - BoundingBox, StyleInfo, TableData structures - Backward compatibility with legacy format 2. DocumentTypeDetector Service (backend/app/services/document_type_detector.py) - Intelligent document type detection using python-magic - PDF editability analysis using PyMuPDF - Processing track recommendation with confidence scores - Support for PDF, images, Office docs, and text files 3. DirectExtractionEngine Service (backend/app/services/direct_extraction_engine.py) - Fast extraction from editable PDFs using PyMuPDF - Preserves fonts, colors, and exact positioning - Native and positional table detection - Image extraction with coordinates - Hyperlink and metadata extraction 4. Dependencies - Added PyMuPDF>=1.23.0 for PDF extraction - Added pdfplumber>=0.10.0 as fallback - Added python-magic-bin>=0.4.14 for file detection Next: Integrate with OCR service for complete dual-track processing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-18 20:17:50 +08:00
egg	fd98018ddd	refactor: complete V1 to V2 migration and remove legacy architecture Remove all V1 architecture components and promote V2 to primary: - Delete all paddle_ocr_* table models (export, ocr, translation, user) - Delete legacy routers (auth, export, ocr, translation) - Delete legacy schemas and services - Promote user_v2.py to user.py as primary user model - Update all imports and dependencies to use V2 models only - Update main.py version to 2.0.0 Database changes: - Fix SQLAlchemy reserved word: rename audit_log.metadata to extra_data - Add migration to drop all paddle_ocr_* tables - Update alembic env to only import V2 models Frontend fixes: - Fix Select component exports in TaskHistoryPage.tsx - Update to use simplified Select API with options prop - Fix AxiosInstance TypeScript import syntax 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-14 21:27:39 +08:00
egg	ad2b832fb6	feat: complete external auth V2 migration with advanced features This commit implements comprehensive external Azure AD authentication with complete task management, file download, and admin monitoring systems. ## Core Features Implemented (80% Complete) ### 1. Token Auto-Refresh Mechanism ✅ - Backend: POST /api/v2/auth/refresh endpoint - Frontend: Auto-refresh 5 minutes before expiration - Auto-retry on 401 errors with seamless token refresh ### 2. File Download System ✅ - Three format support: JSON / Markdown / PDF - Endpoints: GET /api/v2/tasks/{id}/download/{format} - File access control with ownership validation - Frontend download buttons in TaskHistoryPage ### 3. Complete Task Management ✅ Backend Endpoints: - POST /api/v2/tasks/{id}/start - Start task - POST /api/v2/tasks/{id}/cancel - Cancel task - POST /api/v2/tasks/{id}/retry - Retry failed task - GET /api/v2/tasks - List with filters (status, filename, date range) - GET /api/v2/tasks/stats - User statistics Frontend Features: - Status-based action buttons (Start/Cancel/Retry) - Advanced search and filtering (status, filename, date range) - Pagination and sorting - Task statistics dashboard (5 stat cards) ### 4. Admin Monitoring System ✅ (Backend) Admin APIs: - GET /api/v2/admin/stats - System statistics - GET /api/v2/admin/users - User list with stats - GET /api/v2/admin/users/top - User leaderboard - GET /api/v2/admin/audit-logs - Audit log query system - GET /api/v2/admin/audit-logs/user/{id}/summary Admin Features: - Email-based admin check (ymirliu@panjit.com.tw) - Comprehensive system metrics (users, tasks, sessions, activity) - Audit logging service for security tracking ### 5. User Isolation & Security ✅ - Row-level security on all task queries - File access control with ownership validation - Strict user_id filtering on all operations - Session validation and expiry checking - Admin privilege verification ## New Files Created Backend: - backend/app/models/user_v2.py - User model for external auth - backend/app/models/task.py - Task model with user isolation - backend/app/models/session.py - Session management - backend/app/models/audit_log.py - Audit log model - backend/app/services/external_auth_service.py - External API client - backend/app/services/task_service.py - Task CRUD with isolation - backend/app/services/file_access_service.py - File access control - backend/app/services/admin_service.py - Admin operations - backend/app/services/audit_service.py - Audit logging - backend/app/routers/auth_v2.py - V2 auth endpoints - backend/app/routers/tasks.py - Task management endpoints - backend/app/routers/admin.py - Admin endpoints - backend/alembic/versions/5e75a59fb763_*.py - DB migration Frontend: - frontend/src/services/apiV2.ts - Complete V2 API client - frontend/src/types/apiV2.ts - V2 type definitions - frontend/src/pages/TaskHistoryPage.tsx - Task history UI Modified Files: - backend/app/core/deps.py - Added get_current_admin_user_v2 - backend/app/main.py - Registered admin router - frontend/src/pages/LoginPage.tsx - V2 login integration - frontend/src/components/Layout.tsx - User display and logout - frontend/src/App.tsx - Added /tasks route ## Documentation - openspec/changes/.../PROGRESS_UPDATE.md - Detailed progress report ## Pending Items (20%) 1. Database migration execution for audit_logs table 2. Frontend admin dashboard page 3. Frontend audit log viewer ## Testing Status - Manual testing: ✅ Authentication flow verified - Unit tests: ⏳ Pending - Integration tests: ⏳ Pending ## Security Enhancements - ✅ User isolation (row-level security) - ✅ File access control - ✅ Token expiry validation - ✅ Admin privilege verification - ✅ Audit logging infrastructure - ⏳ Token encryption (noted, low priority) - ⏳ Rate limiting (noted, low priority) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-14 17:19:43 +08:00
beabigegg	da700721fa	first	2025-11-12 22:53:17 +08:00

7 Commits