Database stores times in UTC but serialized without timezone info,
causing frontend to misinterpret as local time. Now all datetime
fields include 'Z' suffix to indicate UTC, enabling proper timezone
conversion in the browser.
- Add UTCDatetimeBaseModel base class for Pydantic schemas
- Update model to_dict() methods to append 'Z' suffix
- Affects: tasks, users, sessions, audit logs, translations
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add soft delete (deleted_at column) to preserve task records for statistics
- Implement cleanup service to delete old files while keeping DB records
- Add automatic cleanup scheduler (configurable interval, default 24h)
- Add admin endpoints: storage stats, cleanup trigger, scheduler status
- Update task service with admin views (include deleted/files_deleted)
- Add frontend storage management UI in admin dashboard
- Add i18n translations for storage management
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add TranslationLog model to track translation API usage per task
- Integrate Dify API actual price (total_price) into translation stats
- Display translation statistics in admin dashboard with per-task costs
- Remove unused Export and Settings pages to simplify frontend
- Add GET /api/v2/admin/translation-stats endpoint
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Force Office documents (PPTX, DOCX, XLSX) to use Direct track after
LibreOffice conversion, since converted PDFs always have extractable text
- Fix PDF generator to not exclude text in image regions for Direct track,
allowing text to render on top of background images (critical for PPT)
- Increase file_type column from VARCHAR(50) to VARCHAR(100) to support
long MIME types like PPTX
- Remove reference to non-existent total_images metadata attribute
This significantly improves processing time for Office documents
(from ~170s OCR to ~10s Direct) while preserving text quality.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add TableData.from_dict() and TableCell.from_dict() methods to convert
JSON table dicts to proper TableData objects during UnifiedDocument parsing.
Modified _json_to_document_element() to detect TABLE elements with dict
content containing 'cells' key and convert to TableData.
Note: This fix ensures table elements have proper to_html() method available
but the rendered output still needs investigation - tables may still render
incorrectly in OCR track PDFs.
Files changed:
- unified_document.py: Add from_dict() class methods
- pdf_generator_service.py: Convert table dicts during JSON parsing
- Add fix-ocr-track-table-rendering proposal
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add optional original_filename field to DocumentMetadata dataclass
to properly store the original filename when files are converted
(e.g., Office → PDF). This ensures the field is included in to_dict()
output for JSON serialization.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Added foundation for dual-track document processing:
1. UnifiedDocument Model (backend/app/models/unified_document.py)
- Common output format for both OCR and direct extraction
- Comprehensive element types (23+ types from PP-StructureV3)
- BoundingBox, StyleInfo, TableData structures
- Backward compatibility with legacy format
2. DocumentTypeDetector Service (backend/app/services/document_type_detector.py)
- Intelligent document type detection using python-magic
- PDF editability analysis using PyMuPDF
- Processing track recommendation with confidence scores
- Support for PDF, images, Office docs, and text files
3. DirectExtractionEngine Service (backend/app/services/direct_extraction_engine.py)
- Fast extraction from editable PDFs using PyMuPDF
- Preserves fonts, colors, and exact positioning
- Native and positional table detection
- Image extraction with coordinates
- Hyperlink and metadata extraction
4. Dependencies
- Added PyMuPDF>=1.23.0 for PDF extraction
- Added pdfplumber>=0.10.0 as fallback
- Added python-magic-bin>=0.4.14 for file detection
Next: Integrate with OCR service for complete dual-track processing
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Remove all V1 architecture components and promote V2 to primary:
- Delete all paddle_ocr_* table models (export, ocr, translation, user)
- Delete legacy routers (auth, export, ocr, translation)
- Delete legacy schemas and services
- Promote user_v2.py to user.py as primary user model
- Update all imports and dependencies to use V2 models only
- Update main.py version to 2.0.0
Database changes:
- Fix SQLAlchemy reserved word: rename audit_log.metadata to extra_data
- Add migration to drop all paddle_ocr_* tables
- Update alembic env to only import V2 models
Frontend fixes:
- Fix Select component exports in TaskHistoryPage.tsx
- Update to use simplified Select API with options prop
- Fix AxiosInstance TypeScript import syntax
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>