egg/OCR - OCR

egg/OCR

Author	SHA1	Message	Date
egg	ee49751c38	fix: add UTC timezone indicator to all datetime serialization Database stores times in UTC but serialized without timezone info, causing frontend to misinterpret as local time. Now all datetime fields include 'Z' suffix to indicate UTC, enabling proper timezone conversion in the browser. - Add UTCDatetimeBaseModel base class for Pydantic schemas - Update model to_dict() methods to append 'Z' suffix - Affects: tasks, users, sessions, audit logs, translations 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-14 15:48:17 +08:00
egg	73112db055	feat: add storage cleanup mechanism with soft delete and auto scheduler - Add soft delete (deleted_at column) to preserve task records for statistics - Implement cleanup service to delete old files while keeping DB records - Add automatic cleanup scheduler (configurable interval, default 24h) - Add admin endpoints: storage stats, cleanup trigger, scheduler status - Update task service with admin views (include deleted/files_deleted) - Add frontend storage management UI in admin dashboard - Add i18n translations for storage management 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-14 12:41:01 +08:00
egg	65abd51d60	feat: add translation billing stats and remove Export/Settings pages - Add TranslationLog model to track translation API usage per task - Integrate Dify API actual price (total_price) into translation stats - Display translation statistics in admin dashboard with per-task costs - Remove unused Export and Settings pages to simplify frontend - Add GET /api/v2/admin/translation-stats endpoint 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-12 17:38:12 +08:00
egg	87dc97d951	fix: improve Office document processing with Direct track - Force Office documents (PPTX, DOCX, XLSX) to use Direct track after LibreOffice conversion, since converted PDFs always have extractable text - Fix PDF generator to not exclude text in image regions for Direct track, allowing text to render on top of background images (critical for PPT) - Increase file_type column from VARCHAR(50) to VARCHAR(100) to support long MIME types like PPTX - Remove reference to non-existent total_images metadata attribute This significantly improves processing time for Office documents (from ~170s OCR to ~10s Direct) while preserving text quality. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-30 16:22:04 +08:00
egg	c65df754cf	wip: add TableData.from_dict() for OCR track table parsing (incomplete) Add TableData.from_dict() and TableCell.from_dict() methods to convert JSON table dicts to proper TableData objects during UnifiedDocument parsing. Modified _json_to_document_element() to detect TABLE elements with dict content containing 'cells' key and convert to TableData. Note: This fix ensures table elements have proper to_html() method available but the rendered output still needs investigation - tables may still render incorrectly in OCR track PDFs. Files changed: - unified_document.py: Add from_dict() class methods - pdf_generator_service.py: Convert table dicts during JSON parsing - Add fix-ocr-track-table-rendering proposal 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-26 19:16:51 +08:00
egg	1afdb822c3	feat: implement hybrid image extraction and memory management Backend: - Add hybrid image extraction for Direct track (inline image blocks) - Add render_inline_image_regions() fallback when OCR doesn't find images - Add check_document_for_missing_images() for detecting missing images - Add memory management system (MemoryGuard, ModelManager, ServicePool) - Update pdf_generator_service to handle HYBRID processing track - Add ElementType.LOGO for logo extraction Frontend: - Fix PDF viewer re-rendering issues with memoization - Add TaskNotFound component and useTaskValidation hook - Disable StrictMode due to react-pdf incompatibility - Fix task detail and results page loading states 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-26 10:56:22 +08:00
egg	7064ea30d5	fix: add original_filename field to DocumentMetadata Add optional original_filename field to DocumentMetadata dataclass to properly store the original filename when files are converted (e.g., Office → PDF). This ensures the field is included in to_dict() output for JSON serialization. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-20 12:26:41 +08:00
egg	2d50c128f7	feat: implement core dual-track processing infrastructure Added foundation for dual-track document processing: 1. UnifiedDocument Model (backend/app/models/unified_document.py) - Common output format for both OCR and direct extraction - Comprehensive element types (23+ types from PP-StructureV3) - BoundingBox, StyleInfo, TableData structures - Backward compatibility with legacy format 2. DocumentTypeDetector Service (backend/app/services/document_type_detector.py) - Intelligent document type detection using python-magic - PDF editability analysis using PyMuPDF - Processing track recommendation with confidence scores - Support for PDF, images, Office docs, and text files 3. DirectExtractionEngine Service (backend/app/services/direct_extraction_engine.py) - Fast extraction from editable PDFs using PyMuPDF - Preserves fonts, colors, and exact positioning - Native and positional table detection - Image extraction with coordinates - Hyperlink and metadata extraction 4. Dependencies - Added PyMuPDF>=1.23.0 for PDF extraction - Added pdfplumber>=0.10.0 as fallback - Added python-magic-bin>=0.4.14 for file detection Next: Integrate with OCR service for complete dual-track processing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-18 20:17:50 +08:00
egg	fd98018ddd	refactor: complete V1 to V2 migration and remove legacy architecture Remove all V1 architecture components and promote V2 to primary: - Delete all paddle_ocr_* table models (export, ocr, translation, user) - Delete legacy routers (auth, export, ocr, translation) - Delete legacy schemas and services - Promote user_v2.py to user.py as primary user model - Update all imports and dependencies to use V2 models only - Update main.py version to 2.0.0 Database changes: - Fix SQLAlchemy reserved word: rename audit_log.metadata to extra_data - Add migration to drop all paddle_ocr_* tables - Update alembic env to only import V2 models Frontend fixes: - Fix Select component exports in TaskHistoryPage.tsx - Update to use simplified Select API with options prop - Fix AxiosInstance TypeScript import syntax 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-14 21:27:39 +08:00
egg	ad2b832fb6	feat: complete external auth V2 migration with advanced features This commit implements comprehensive external Azure AD authentication with complete task management, file download, and admin monitoring systems. ## Core Features Implemented (80% Complete) ### 1. Token Auto-Refresh Mechanism ✅ - Backend: POST /api/v2/auth/refresh endpoint - Frontend: Auto-refresh 5 minutes before expiration - Auto-retry on 401 errors with seamless token refresh ### 2. File Download System ✅ - Three format support: JSON / Markdown / PDF - Endpoints: GET /api/v2/tasks/{id}/download/{format} - File access control with ownership validation - Frontend download buttons in TaskHistoryPage ### 3. Complete Task Management ✅ Backend Endpoints: - POST /api/v2/tasks/{id}/start - Start task - POST /api/v2/tasks/{id}/cancel - Cancel task - POST /api/v2/tasks/{id}/retry - Retry failed task - GET /api/v2/tasks - List with filters (status, filename, date range) - GET /api/v2/tasks/stats - User statistics Frontend Features: - Status-based action buttons (Start/Cancel/Retry) - Advanced search and filtering (status, filename, date range) - Pagination and sorting - Task statistics dashboard (5 stat cards) ### 4. Admin Monitoring System ✅ (Backend) Admin APIs: - GET /api/v2/admin/stats - System statistics - GET /api/v2/admin/users - User list with stats - GET /api/v2/admin/users/top - User leaderboard - GET /api/v2/admin/audit-logs - Audit log query system - GET /api/v2/admin/audit-logs/user/{id}/summary Admin Features: - Email-based admin check (ymirliu@panjit.com.tw) - Comprehensive system metrics (users, tasks, sessions, activity) - Audit logging service for security tracking ### 5. User Isolation & Security ✅ - Row-level security on all task queries - File access control with ownership validation - Strict user_id filtering on all operations - Session validation and expiry checking - Admin privilege verification ## New Files Created Backend: - backend/app/models/user_v2.py - User model for external auth - backend/app/models/task.py - Task model with user isolation - backend/app/models/session.py - Session management - backend/app/models/audit_log.py - Audit log model - backend/app/services/external_auth_service.py - External API client - backend/app/services/task_service.py - Task CRUD with isolation - backend/app/services/file_access_service.py - File access control - backend/app/services/admin_service.py - Admin operations - backend/app/services/audit_service.py - Audit logging - backend/app/routers/auth_v2.py - V2 auth endpoints - backend/app/routers/tasks.py - Task management endpoints - backend/app/routers/admin.py - Admin endpoints - backend/alembic/versions/5e75a59fb763_*.py - DB migration Frontend: - frontend/src/services/apiV2.ts - Complete V2 API client - frontend/src/types/apiV2.ts - V2 type definitions - frontend/src/pages/TaskHistoryPage.tsx - Task history UI Modified Files: - backend/app/core/deps.py - Added get_current_admin_user_v2 - backend/app/main.py - Registered admin router - frontend/src/pages/LoginPage.tsx - V2 login integration - frontend/src/components/Layout.tsx - User display and logout - frontend/src/App.tsx - Added /tasks route ## Documentation - openspec/changes/.../PROGRESS_UPDATE.md - Detailed progress report ## Pending Items (20%) 1. Database migration execution for audit_logs table 2. Frontend admin dashboard page 3. Frontend audit log viewer ## Testing Status - Manual testing: ✅ Authentication flow verified - Unit tests: ⏳ Pending - Integration tests: ⏳ Pending ## Security Enhancements - ✅ User isolation (row-level security) - ✅ File access control - ✅ Token expiry validation - ✅ Admin privilege verification - ✅ Audit logging infrastructure - ⏳ Token encryption (noted, low priority) - ⏳ Rate limiting (noted, low priority) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-14 17:19:43 +08:00
beabigegg	da700721fa	first	2025-11-12 22:53:17 +08:00

11 Commits