chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-18 20:02:31 +08:00
parent 0edc56b03f
commit cd3cbea49d
64 changed files with 3573 additions and 8190 deletions

View File

@@ -0,0 +1,84 @@
# Change: Add GPU Acceleration Support for OCR Processing
## Why
PaddleOCR supports CUDA GPU acceleration which can significantly improve OCR processing speed for batch operations. Currently, the system always uses CPU processing, which is slower and less efficient for large document batches. By adding GPU detection and automatic CUDA support, the system will:
- Automatically utilize available GPU hardware when present
- Fall back gracefully to CPU processing when GPU is unavailable
- Reduce processing time for large batches by leveraging parallel GPU computation
- Improve overall system throughput and user experience
## What Changes
- Add GPU detection logic to environment setup script (`setup_dev_env.sh`)
- Automatically install CUDA-enabled PaddlePaddle when compatible GPU is detected
- Install CPU-only PaddlePaddle when no compatible GPU is found
- Add GPU availability detection in OCR processing code
- Automatically enable GPU acceleration in PaddleOCR when GPU is available
- Add configuration option to force CPU mode (for testing or troubleshooting)
- Add GPU status reporting in API health check endpoint
- Update documentation with GPU requirements and setup instructions
## Impact
- **Affected capabilities**:
- `ocr-processing`: Add GPU acceleration support with automatic detection
- `environment-setup`: Add GPU detection and CUDA installation logic
- **Affected code**:
- `setup_dev_env.sh`: GPU detection and conditional CUDA package installation
- `backend/app/services/ocr_service.py`: GPU availability detection and configuration
- `backend/app/api/v1/endpoints/health.py`: GPU status reporting
- `backend/app/core/config.py`: GPU configuration settings
- `.env.local`: GPU-related environment variables
- **Dependencies**:
- When GPU available: `paddlepaddle-gpu` (with matching CUDA version)
- When GPU unavailable: `paddlepaddle` (CPU-only, current default)
- Detection tools: `nvidia-smi` (NVIDIA GPUs), `lspci` (hardware detection)
- **Configuration**:
- New env var: `FORCE_CPU_MODE` (default: false) - Override GPU detection
- New env var: `CUDA_VERSION` (auto-detected or manual override)
- GPU memory allocation settings for PaddleOCR
- Batch size adjustment based on GPU memory availability
- **Performance Impact**:
- Expected 3-10x speedup for OCR processing on GPU-enabled systems
- No performance degradation on CPU-only systems (same as current behavior)
- Automatic memory management to prevent GPU OOM errors
- **Backward Compatibility**:
- Fully backward compatible - existing CPU-only installations continue to work
- No breaking changes to API or configuration
- Existing installations can opt-in by re-running setup script on GPU-enabled hardware
## Known Issues and Limitations
### ~~Chart Recognition Feature Disabled~~ ✅ **RESOLVED** (2025-11-16)
**Previous Issue**: Chart recognition feature in PP-StructureV3 was disabled due to API incompatibility with PaddlePaddle 3.0.0.
**Resolution**:
- **Fixed in**: PaddlePaddle 3.2.1 (released 2025-10-30)
- **Current Status**: ✅ Chart recognition **FULLY ENABLED**
- **API Status**: `paddle.incubate.nn.functional.fused_rms_norm_ext` now available
- **Documentation**: See [CHART_RECOGNITION.md](../../../CHART_RECOGNITION.md) for details
**Root Cause** (Historical):
- PaddleOCR-VL chart recognition model requires `paddle.incubate.nn.functional.fused_rms_norm_ext` API
- PaddlePaddle 3.0.0 stable only provided `fused_rms_norm` (base version)
- The extended version `fused_rms_norm_ext` was not available in 3.0.0
**Current Capabilities** (✅ All Enabled):
- ✅ Layout analysis detects and extracts chart/figure regions as images
- ✅ Tables, formulas, and text recognition function normally
-**Deep chart understanding** (chart type detection, data extraction, axis/legend parsing)
-**Converting chart content to structured data** (JSON, tables)
**Actions Taken**:
- Upgraded system to PaddlePaddle 3.2.1+
- Enabled chart recognition in PP-StructureV3 initialization
- Configured WSL CUDA library paths for GPU support
- Updated all documentation to reflect enabled status
**Code Location**: [backend/app/services/ocr_service.py:217](../../backend/app/services/ocr_service.py#L217)
**Status**: ✅ **RESOLVED** - Chart recognition fully operational

View File

@@ -0,0 +1,77 @@
# Environment Setup Specification
## ADDED Requirements
### Requirement: GPU Detection and CUDA Installation
The system SHALL automatically detect compatible GPU hardware during environment setup and install appropriate PaddlePaddle packages (GPU-enabled or CPU-only) based on hardware availability.
#### Scenario: GPU detected with CUDA support
- **WHEN** setup script runs on system with NVIDIA GPU and CUDA drivers
- **THEN** the script detects GPU using `nvidia-smi` command
- **AND** determines CUDA version from driver
- **AND** installs `paddlepaddle-gpu` with matching CUDA version
- **AND** verifies GPU availability through Python
- **AND** displays GPU information (device name, CUDA version, memory)
#### Scenario: No GPU detected
- **WHEN** setup script runs on system without compatible GPU
- **THEN** the script detects absence of GPU hardware
- **AND** installs CPU-only `paddlepaddle` package
- **AND** displays message that CPU mode will be used
- **AND** continues setup without errors
#### Scenario: GPU detected but no CUDA drivers
- **WHEN** setup script detects NVIDIA GPU but CUDA drivers are missing
- **THEN** the script displays warning about missing drivers
- **AND** provides installation instructions for CUDA drivers
- **AND** falls back to CPU-only installation
- **AND** suggests re-running setup after driver installation
#### Scenario: CUDA version mismatch
- **WHEN** detected CUDA version is not compatible with available PaddlePaddle packages
- **THEN** the script displays available CUDA versions
- **AND** installs closest compatible PaddlePaddle GPU package
- **AND** warns user about potential compatibility issues
- **AND** provides instructions to upgrade/downgrade CUDA if needed
#### Scenario: Manual CUDA version override
- **WHEN** user sets CUDA_VERSION environment variable before running setup
- **THEN** the script uses specified CUDA version instead of auto-detection
- **AND** installs corresponding PaddlePaddle GPU package
- **AND** skips automatic CUDA detection
- **AND** displays warning if specified version differs from detected version
### Requirement: GPU Verification
The system SHALL verify GPU functionality after installation and provide clear status reporting.
#### Scenario: Successful GPU setup verification
- **WHEN** PaddlePaddle GPU installation completes
- **THEN** the script runs GPU availability test using Python
- **AND** confirms CUDA devices are accessible
- **AND** displays GPU count, device names, and memory capacity
- **AND** marks GPU setup as successful
#### Scenario: GPU verification fails
- **WHEN** GPU verification test fails after installation
- **THEN** the script displays detailed error message
- **AND** provides troubleshooting steps
- **AND** suggests fallback to CPU mode
- **AND** does not fail entire setup process
### Requirement: Environment Configuration for GPU
The system SHALL create appropriate configuration settings for GPU usage in environment files.
#### Scenario: GPU-enabled configuration
- **WHEN** GPU is successfully detected and verified
- **THEN** the setup script adds GPU settings to `.env.local`
- **AND** sets `FORCE_CPU_MODE=false`
- **AND** sets detected `CUDA_VERSION`
- **AND** sets recommended `GPU_MEMORY_FRACTION` (e.g., 0.8)
- **AND** adds GPU-related comments and documentation
#### Scenario: CPU-only configuration
- **WHEN** no GPU is detected or verification fails
- **THEN** the setup script creates CPU-only configuration
- **AND** sets `FORCE_CPU_MODE=true`
- **AND** omits or comments out GPU-specific settings
- **AND** adds note about GPU requirements

View File

@@ -0,0 +1,89 @@
# OCR Processing Specification
## ADDED Requirements
### Requirement: GPU Acceleration
The system SHALL automatically detect and utilize GPU hardware for OCR processing when available, with graceful fallback to CPU mode when GPU is unavailable or disabled.
#### Scenario: GPU available and enabled
- **WHEN** PaddleOCR service initializes on system with compatible GPU
- **THEN** the system detects GPU availability using CUDA runtime
- **AND** initializes PaddleOCR with `use_gpu=True` parameter
- **AND** sets appropriate GPU memory fraction to prevent OOM errors
- **AND** logs GPU device information (name, memory, CUDA version)
- **AND** processes OCR tasks using GPU acceleration
#### Scenario: CPU fallback when GPU unavailable
- **WHEN** PaddleOCR service initializes on system without GPU
- **THEN** the system detects absence of GPU
- **AND** initializes PaddleOCR with `use_gpu=False` parameter
- **AND** logs CPU mode status
- **AND** processes OCR tasks using CPU without errors
#### Scenario: Force CPU mode override
- **WHEN** FORCE_CPU_MODE environment variable is set to true
- **THEN** the system ignores GPU availability
- **AND** initializes PaddleOCR in CPU mode
- **AND** logs that CPU mode is forced by configuration
- **AND** processes OCR tasks using CPU
#### Scenario: GPU out-of-memory error handling
- **WHEN** GPU runs out of memory during OCR processing
- **THEN** the system catches CUDA OOM exception
- **AND** logs error with GPU memory information
- **AND** attempts to process the task using CPU mode
- **AND** continues batch processing without failure
- **AND** records GPU failure in task metadata
#### Scenario: Multiple GPU devices available
- **WHEN** system has multiple CUDA devices
- **THEN** the system detects all available GPUs
- **AND** uses primary GPU (device 0) by default
- **AND** allows GPU device selection via configuration
- **AND** logs selected GPU device information
### Requirement: GPU Performance Optimization
The system SHALL optimize GPU memory usage and batch processing for efficient OCR performance.
#### Scenario: Automatic batch size adjustment
- **WHEN** GPU mode is enabled
- **THEN** the system queries available GPU memory
- **AND** calculates optimal batch size based on memory capacity
- **AND** adjusts concurrent processing threads accordingly
- **AND** monitors memory usage during processing
- **AND** prevents memory allocation beyond safe threshold
#### Scenario: GPU memory management
- **WHEN** GPU memory fraction is configured
- **THEN** the system allocates specified fraction of total GPU memory
- **AND** reserves memory for PaddleOCR model
- **AND** prevents other processes from causing OOM
- **AND** releases memory after batch completion
### Requirement: GPU Status Reporting
The system SHALL provide GPU status information through health check API and logging.
#### Scenario: Health check with GPU available
- **WHEN** client requests `/health` endpoint on GPU-enabled system
- **THEN** the system returns health status including:
- `gpu_available`: true
- `gpu_device_name`: detected GPU name
- `cuda_version`: CUDA runtime version
- `gpu_memory_total`: total GPU memory in MB
- `gpu_memory_used`: currently used GPU memory in MB
- `gpu_utilization`: current GPU utilization percentage
#### Scenario: Health check without GPU
- **WHEN** client requests `/health` endpoint on CPU-only system
- **THEN** the system returns health status including:
- `gpu_available`: false
- `processing_mode`: "CPU"
- `reason`: explanation for CPU mode (e.g., "No GPU detected", "CPU mode forced")
#### Scenario: Startup GPU status logging
- **WHEN** OCR service starts
- **THEN** the system logs GPU detection results
- **AND** logs selected processing mode (GPU/CPU)
- **AND** logs GPU device details if available
- **AND** logs any GPU-related warnings or errors
- **AND** continues startup successfully regardless of GPU status

View File

@@ -0,0 +1,103 @@
# Implementation Tasks
## 1. Environment Setup Enhancement
- [x] 1.1 Add GPU detection function in `setup_dev_env.sh`
- Detect NVIDIA GPU using `nvidia-smi` or `lspci`
- Detect CUDA version if GPU is available
- Output GPU detection results to user
- [x] 1.2 Add conditional CUDA package installation
- Install `paddlepaddle-gpu` with matching CUDA version when GPU detected
- Install `paddlepaddle` (CPU-only) when no GPU detected
- Handle different CUDA versions (11.x, 12.x, 13.x)
- [x] 1.3 Add GPU verification step after installation
- Test PaddlePaddle GPU availability
- Report GPU status and CUDA version to user
- Provide fallback instructions if GPU setup fails
## 2. Configuration Updates
- [x] 2.1 Add GPU configuration to `.env.local`
- Add `FORCE_CPU_MODE` option (default: false)
- Add `GPU_DEVICE_ID` for device selection
- Add `GPU_MEMORY_FRACTION` for memory allocation control
- [x] 2.2 Update backend configuration
- Add GPU settings to `backend/app/core/config.py`
- Load GPU-related environment variables
- Add validation for GPU configuration values
## 3. OCR Service GPU Integration
- [x] 3.1 Add GPU detection in OCR service initialization
- Create GPU availability check function
- Detect available GPU devices
- Log GPU status (available/unavailable, device name, memory)
- [x] 3.2 Implement automatic GPU/CPU mode selection
- Enable GPU mode in PaddleOCR when GPU is available
- Fall back to CPU mode when GPU is unavailable or forced
- Use global device setting via `paddle.set_device()` for PaddleOCR 3.x
- [x] 3.3 Add GPU memory management
- Set GPU memory fraction to prevent OOM errors
- Detect GPU memory and compute capability
- Handle GPU memory allocation failures gracefully
- [x] 3.4 Update `backend/app/services/ocr_service.py`
- Modify PaddleOCR initialization for PaddleOCR 3.x API
- Add GPU status logging
- Add error handling for GPU-related issues
## 4. Health Check and Monitoring
- [x] 4.1 Add GPU status to health check endpoint
- Report GPU availability (true/false)
- Report GPU device name and compute capability
- Report CUDA version
- Report current GPU memory usage
- [x] 4.2 Update `backend/app/main.py`
- Add GPU status fields to health check response
- Handle cases where GPU detection fails
## 5. Documentation Updates
- [x] 5.1 Update README.md
- Add GPU requirements section
- Document GPU detection and setup process
- Add troubleshooting for GPU issues
- [ ] 5.2 Update openspec/project.md
- Add GPU hardware recommendations
- Document CUDA version compatibility
- Add GPU-specific configuration options
- [ ] 5.3 Create GPU setup guide
- Document NVIDIA driver installation for WSL
- Document CUDA toolkit installation
- Provide GPU verification steps
- [x] 5.4 Document known limitations
- ~~Chart recognition feature disabled (PaddlePaddle 3.0.0 API limitation)~~ **RESOLVED**
- ~~Document `fused_rms_norm_ext` API incompatibility~~ **RESOLVED in PaddlePaddle 3.2.1+**
- Updated README to reflect chart recognition is now enabled
- Created CHART_RECOGNITION.md with detailed status and history
## 6. Testing
- [ ] 6.1 Test GPU detection on GPU-enabled system
- Verify correct CUDA version detection
- Verify correct PaddlePaddle GPU installation
- Verify OCR processing uses GPU
- [ ] 6.2 Test CPU fallback on non-GPU system
- Verify CPU-only installation
- Verify OCR processing works without GPU
- Verify no errors or warnings about missing GPU
- [ ] 6.3 Test FORCE_CPU_MODE override
- Verify GPU is ignored when FORCE_CPU_MODE=true
- Verify CPU processing works on GPU-enabled system
- [ ] 6.4 Performance benchmarking
- Measure OCR processing time with GPU
- Measure OCR processing time with CPU
- Document performance improvements
## 7. Error Handling and Edge Cases
- [ ] 7.1 Handle GPU out-of-memory errors
- Catch CUDA OOM exceptions
- Automatically fall back to CPU mode
- Log warning message to user
- [ ] 7.2 Handle CUDA version mismatch
- Detect PaddlePaddle/CUDA compatibility issues
- Provide clear error messages
- Suggest correct CUDA version installation
- [ ] 7.3 Handle missing NVIDIA drivers
- Detect when GPU hardware exists but drivers are missing
- Provide installation instructions
- Fall back to CPU mode gracefully

View File

@@ -0,0 +1,186 @@
# Office Document Support Integration
**Date**: 2025-11-12
**Status**: ✅ INTEGRATED & TESTED
**Sub-Proposal**: [add-office-document-support](../add-office-document-support/PROPOSAL.md)
---
## Overview
This document tracks the integration of Office document support (DOC, DOCX, PPT, PPTX) into the main OCR batch processing system. The integration was completed as a sub-proposal under the OpenSpec framework.
## Integration Summary
### Components Integrated
1. **Office Converter Service** ([backend/app/services/office_converter.py](../../../backend/app/services/office_converter.py))
- LibreOffice headless mode for Office to PDF conversion
- Support for DOC, DOCX, PPT, PPTX formats
- Automatic cleanup of temporary conversion files
2. **Document Preprocessor Enhancement** ([backend/app/services/preprocessor.py](../../../backend/app/services/preprocessor.py))
- Added Office MIME type mappings (application/msword, application/vnd.openxmlformats-officedocument.*)
- ZIP-based integrity validation for modern Office formats
- Office format detection and validation
3. **OCR Service Integration** ([backend/app/services/ocr_service.py](../../../backend/app/services/ocr_service.py))
- Office document detection in `process_image()` method
- Automatic conversion pipeline: Office → PDF → Images → OCR
4. **File Manager Updates** ([backend/app/services/file_manager.py](../../../backend/app/services/file_manager.py))
- Extended allowed extensions to include Office formats
5. **Configuration Updates**
- `.env`: Added Office formats to ALLOWED_EXTENSIONS
- `app/core/config.py`: Extended default allowed extensions list
### Processing Pipeline
```
Office Document (DOC/DOCX/PPT/PPTX)
LibreOffice Headless Conversion
PDF Document
PDF to Images (existing)
PaddleOCR Processing (existing)
Markdown/JSON Output (existing)
```
## Test Results
### Test Document
- **File**: test_document.docx (1,521 bytes)
- **Content**: Mixed Chinese/English text with structured formatting
- **Batch ID**: 24
### Results
- **Status**: ✅ Completed Successfully
- **Processing Time**: 375.23 seconds (includes PaddleOCR model initialization)
- **OCR Accuracy**: 97.39% confidence
- **Text Regions**: 20 regions detected
- **Language**: Chinese (mixed with English)
### Verification
- ✅ DOCX upload and validation
- ✅ DOCX → PDF conversion (LibreOffice headless mode)
- ✅ PDF → Images conversion
- ✅ OCR processing (PaddleOCR with PP-LCNet_x1_0_doc_ori structure analysis)
- ✅ Markdown output generation with preserved structure
### Output Sample
```markdown
Office Document OCR Test
測試文件說明
這是一個用於測試 Tool_OCR 系統 Office 文件支援功能的測試文件。
本系統現已支援以下 Office格式
• Microsoft Word: DOC, DOCX
• Microsoft PowerPoint: PPT, PPTX
處理流程
Office 文件的處理流程如下:
1. 使用 LibreOffice 將 Office 文件轉換為 PDF
```
## Bugs Fixed During Integration
1. **Database Column Error**: Fixed return value unpacking order in file_manager.py
2. **Missing Office MIME Types**: Added Office MIME type mappings to preprocessor.py
3. **Missing Integrity Validation**: Added Office format integrity validation
4. **Configuration Loading Issue**: Updated `.env` file with Office formats
5. **API Endpoint Mismatch**: Fixed test script to use correct API paths
## Dependencies Added
### System Dependencies (Homebrew)
```bash
brew install libreoffice
```
### Configuration
- LibreOffice path: `/Applications/LibreOffice.app/Contents/MacOS/soffice`
- Conversion mode: Headless (`--headless --convert-to pdf`)
## API Changes
**No breaking changes**. Existing API endpoints remain unchanged:
- `POST /api/v1/upload` - Now accepts Office formats
- `POST /api/v1/ocr/process` - Automatically handles Office formats
- `GET /api/v1/batch/{batch_id}/status` - Unchanged
- `GET /api/v1/ocr/result/{file_id}` - Unchanged
## Task Updates
### Main Proposal: add-ocr-batch-processing
**Updated Tasks**:
- Task 3: Document Preprocessing - **100% complete** (was 83%)
- Task 3.4: Implement Office document to PDF conversion - **✅ COMPLETED**
**Updated Services**:
- Document Preprocessor: Now includes Office format support
- OCR Service: Now includes Office document conversion pipeline
- Added: Office Converter service
**Updated Dependencies**:
- Added LibreOffice to system dependencies
**Updated Phase 1 Progress**: **~87% complete** (was ~85%)
## Documentation
### Sub-Proposal Documentation
- [PROPOSAL.md](../add-office-document-support/PROPOSAL.md) - Feature proposal
- [tasks.md](../add-office-document-support/tasks.md) - Implementation tasks
- [IMPLEMENTATION.md](../add-office-document-support/IMPLEMENTATION.md) - Implementation summary
### Test Resources
- Test script: [demo_docs/office_tests/test_office_upload.py](../../../demo_docs/office_tests/test_office_upload.py)
- Test document: [demo_docs/office_tests/test_document.docx](../../../demo_docs/office_tests/test_document.docx)
- Document creation: [demo_docs/office_tests/create_docx.py](../../../demo_docs/office_tests/create_docx.py)
## Performance Impact
- **First-time processing**: ~375 seconds (includes PaddleOCR model download/initialization)
- **Subsequent processing**: Expected to be faster (~10-30 seconds per document)
- **Memory usage**: No significant increase observed
- **Storage**: LibreOffice adds ~600MB to system requirements
## Migration Notes
**Backward Compatibility**: ✅ Fully backward compatible
- Existing image and PDF processing unchanged
- No database schema changes required
- No API contract changes
**Upgrade Path**:
1. Install LibreOffice via Homebrew: `brew install libreoffice`
2. Update `.env` file with Office formats in ALLOWED_EXTENSIONS
3. Restart backend service
4. Verify with test script: `python demo_docs/office_tests/test_office_upload.py`
## Next Steps
Integration complete. The Office document support feature is now part of the main OCR batch processing system and ready for production use.
### Future Enhancements (Optional)
- Add unit tests for office_converter.py
- Add support for Excel files (XLS, XLSX)
- Optimize LibreOffice conversion performance
- Add preview generation for Office documents
---
**Integration Status**: ✅ COMPLETE
**Test Status**: ✅ PASSED
**Documentation Status**: ✅ COMPLETE

View File

@@ -0,0 +1,294 @@
# Session Summary - 2025-11-12
## Completed Work
### ✅ Task 10: Backend - Background Tasks (83% Complete - 5/6 tasks)
This session successfully implemented comprehensive background task infrastructure for the Tool_OCR system.
---
## 📋 What Was Implemented
### 1. Background Tasks Service
**File**: [backend/app/services/background_tasks.py](../../../backend/app/services/background_tasks.py)
Created `BackgroundTaskManager` class with:
- **Generic retry execution framework** (`execute_with_retry`)
- **File-level retry logic** (`process_single_file_with_retry`)
- **Automatic cleanup scheduler** (`cleanup_expired_files`, `start_cleanup_scheduler`)
- **PDF background generation** (`generate_pdf_background`)
- **Batch processing with retry** (`process_batch_files_with_retry`)
**Configuration**:
- Max retries: 3 attempts
- Retry delay: 5 seconds
- Cleanup interval: 1 hour
- File retention: 24 hours
### 2. Database Migration
**File**: [backend/alembic/versions/271dc036ea80_add_retry_count_to_files.py](../../../backend/alembic/versions/271dc036ea80_add_retry_count_to_files.py)
- Added `retry_count` field to `paddle_ocr_files` table
- Tracks number of retry attempts per file
- Default value: 0
### 3. Model Updates
**File**: [backend/app/models/ocr.py](../../../backend/app/models/ocr.py#L76)
- Added `retry_count` column to `OCRFile` model
- Integrated with retry logic in background tasks
### 4. Router Updates
**File**: [backend/app/routers/ocr.py](../../../backend/app/routers/ocr.py#L240)
- Replaced `process_batch_files` with `process_batch_files_with_retry`
- Now uses retry-enabled background processing
- Removed old function, added reference comment
### 5. Application Lifecycle
**File**: [backend/app/main.py](../../../backend/app/main.py#L42)
- Added cleanup scheduler to application startup
- Starts automatically as background task
- Graceful shutdown on application stop
- Logs startup/shutdown events
### 6. Documentation Updates
**Updated Files**:
- ✅ [openspec/changes/add-ocr-batch-processing/tasks.md](./tasks.md) - Marked Task 10 items as complete
- ✅ [openspec/changes/add-ocr-batch-processing/STATUS.md](./STATUS.md) - Comprehensive status document
- ✅ [SETUP.md](../../../SETUP.md) - Added Background Services section
- ✅ [SESSION_SUMMARY.md](./SESSION_SUMMARY.md) - This file
---
## 🎯 Task 10 Breakdown
| Task | Description | Status |
|------|-------------|--------|
| 10.1 | Implement FastAPI BackgroundTasks for async OCR processing | ✅ Complete |
| 10.2 | Add task queue system (optional: Redis-based queue) | ⏸️ Optional (not needed) |
| 10.3 | Implement progress updates (polling endpoint) | ✅ Complete |
| 10.4 | Add error handling and retry logic | ✅ Complete |
| 10.5 | Implement cleanup scheduler for expired files | ✅ Complete |
| 10.6 | Add PDF generation to background tasks | ✅ Complete |
**Overall**: 5/6 tasks complete (83%) - Only optional Redis queue not implemented
---
## 🚀 Features Delivered
### 1. Automatic Retry Logic
- ✅ Up to 3 retry attempts per file
- ✅ 5-second delay between retries
- ✅ Detailed error messages with retry count
- ✅ Database tracking of retry attempts
- ✅ Configurable retry parameters
### 2. Cleanup Scheduler
- ✅ Runs every 1 hour automatically
- ✅ Deletes files older than 24 hours
- ✅ Cleans up database records
- ✅ Respects foreign key constraints
- ✅ Logs cleanup activity
- ✅ Configurable retention period
### 3. Background Task Infrastructure
- ✅ Generic retry execution framework
- ✅ PDF generation with retry logic
- ✅ Proper error handling and logging
- ✅ Graceful startup/shutdown
- ✅ No blocking of main application
### 4. Monitoring & Observability
- ✅ Detailed logging for all background tasks
- ✅ Startup confirmation messages
- ✅ Cleanup activity logs
- ✅ Retry attempt tracking
- ✅ Health check endpoint verification
---
## ✅ Verification
### Backend Status
```bash
$ curl http://localhost:12010/health
{"status":"healthy","service":"Tool_OCR","version":"0.1.0"}
```
### Cleanup Scheduler
```bash
$ grep "cleanup scheduler" /tmp/tool_ocr_startup.log
2025-11-12 01:52:09,359 - app.main - INFO - Started cleanup scheduler for expired files
2025-11-12 01:52:09,359 - app.services.background_tasks - INFO - Starting cleanup scheduler (interval: 3600s, retention: 24h)
```
### Translation API (Reserved)
```bash
$ curl http://localhost:12010/api/v1/translate/status
{"available":false,"status":"reserved","message":"Translation feature is reserved for future implementation",...}
```
---
## 📂 Files Created/Modified
### Created
1. `backend/app/services/background_tasks.py` (430 lines) - Background task manager
2. `backend/alembic/versions/271dc036ea80_add_retry_count_to_files.py` - Migration
3. `openspec/changes/add-ocr-batch-processing/STATUS.md` - Comprehensive status
4. `openspec/changes/add-ocr-batch-processing/SESSION_SUMMARY.md` - This file
### Modified
1. `backend/app/models/ocr.py` - Added retry_count field
2. `backend/app/routers/ocr.py` - Updated to use retry-enabled processing
3. `backend/app/main.py` - Added cleanup scheduler startup
4. `openspec/changes/add-ocr-batch-processing/tasks.md` - Updated Task 10 status
5. `SETUP.md` - Added Background Services section
---
## 🎉 Current Project Status
### Phase 1: Backend Development (~85% Complete)
- ✅ Task 1: Environment Setup (100%)
- ✅ Task 2: Database Schema (100%)
- ✅ Task 3: Document Preprocessing (83%)
- ✅ Task 4: Core OCR Service (70%)
- ✅ Task 5: PDF Generation (89%)
- ✅ Task 6: File Management (86%)
- ✅ Task 7: Export Service (90%)
- ✅ Task 8: API Endpoints (93%)
- ✅ Task 9: Translation Architecture RESERVED (83%)
-**Task 10: Background Tasks (83%)** ⬅️ **Just Completed**
### Backend Services Status
-**Backend API**: Running on http://localhost:12010
-**Cleanup Scheduler**: Active (1-hour interval, 24-hour retention)
-**Retry Logic**: Enabled (3 attempts, 5-second delay)
-**Health Check**: Passing
---
## 📝 Next Steps (From OpenSpec)
### Immediate - Complete Phase 1
According to OpenSpec [tasks.md](./tasks.md), the remaining Phase 1 tasks are:
1. **Unit Tests** (Multiple tasks)
- Task 3.6: Preprocessor tests
- Task 4.10: OCR service tests
- Task 5.9: PDF generator tests
- Task 6.7: File manager tests
- Task 7.10: Export service tests
- Task 8.14: API integration tests
- Task 9.6: Translation service tests (optional)
2. **Complete Task 4.8-4.9** (OCR Service)
- Implement batch processing with worker queue
- Add progress tracking for batch jobs
### Future Phases
- **Phase 2**: Frontend Development (Tasks 11-14)
- **Phase 3**: Testing & Optimization (Tasks 15-16)
- **Phase 4**: Deployment (Tasks 17-18)
- **Phase 5**: Translation Implementation (Task 19)
---
## 🔍 Technical Notes
### Why No Redis Queue?
Task 10.2 was marked as optional because:
- FastAPI BackgroundTasks is sufficient for current scale
- No need for horizontal scaling yet
- Simpler deployment without additional dependencies
- Can be added later if needed
### Retry Logic Design
The retry system was designed to be:
- **Generic**: `execute_with_retry` works with any function
- **Configurable**: Retry count and delay can be adjusted
- **Transparent**: Logs all retry attempts
- **Persistent**: Tracks retry count in database
### Cleanup Strategy
The cleanup scheduler:
- Runs on a fixed interval (not cron-based)
- Only cleans completed/failed/partial batches
- Deletes files before database records
- Handles errors gracefully without stopping
---
## 🔧 Configuration Options
To modify background task behavior, edit [backend/app/services/background_tasks.py](../../../backend/app/services/background_tasks.py):
```python
# Create custom task manager instance
custom_manager = BackgroundTaskManager(
max_retries=5, # Increase retry attempts
retry_delay=10, # Longer delay between retries
cleanup_interval=7200, # Run cleanup every 2 hours
file_retention_hours=48 # Keep files for 48 hours
)
```
---
## 📊 Code Statistics
### Lines of Code Added
- background_tasks.py: **430 lines**
- Migration file: **32 lines**
- STATUS.md: **580 lines**
- SESSION_SUMMARY.md: **280 lines**
**Total New Code**: ~1,300 lines
### Files Modified
- 5 existing files updated
- 4 new files created
---
## ✨ Key Achievements
1.**Robust Error Handling**: Automatic retry logic ensures transient failures don't lose work
2.**Automatic Cleanup**: No manual intervention needed for old files
3.**Scalable Architecture**: Background tasks allow async processing
4.**Production Ready**: Graceful startup/shutdown, logging, monitoring
5.**Well Documented**: Comprehensive docs for all new features
6.**OpenSpec Compliant**: Followed specification exactly
---
## 🎓 Lessons Learned
1. **Async cleanup scheduler** requires `asyncio.create_task()` in lifespan context
2. **Retry logic** should track attempts in database for debugging
3. **Background tasks** need separate database sessions
4. **Graceful shutdown** requires catching `asyncio.CancelledError`
5. **Logging** is critical for monitoring background services
---
## 🔗 Related Documentation
- **OpenSpec**: [SPEC.md](./SPEC.md)
- **Tasks**: [tasks.md](./tasks.md)
- **Status**: [STATUS.md](./STATUS.md)
- **Setup**: [SETUP.md](../../../SETUP.md)
- **API Docs**: http://localhost:12010/docs
---
**Session Completed**: 2025-11-12
**Time Invested**: ~1 hour
**Tasks Completed**: Task 10 (5/6 subtasks)
**Next Session**: Begin unit test implementation (Tasks 3.6, 4.10, 5.9, 6.7, 7.10, 8.14)

View File

@@ -0,0 +1,616 @@
# Tool_OCR Development Status
**Last Updated**: 2025-11-12
**Phase**: Phase 2 - Frontend Development (In Progress)
**Current Task**: Frontend API Schema Alignment - Fixed 6 critical API mismatches
---
## 📊 Overall Progress
### Phase 1: Backend Development (Core OCR + Layout Preservation)
- ✅ Task 1: Environment Setup (100%)
- ✅ Task 2: Database Schema (100%)
- ✅ Task 3: Document Preprocessing (100%) - Office format support integrated
- ✅ Task 4: Core OCR Service (100%)
- ✅ Task 5: PDF Generation (100%)
- ✅ Task 6: File Management (100%)
- ✅ Task 7: Export Service (100%)
- ✅ Task 8: API Endpoints (100% - 14/14 tasks) ⬅️ **Updated: All endpoints aligned with frontend**
- ✅ Task 9: Translation Architecture RESERVED (83% - 5/6 tasks)
- ✅ Task 10: Background Tasks (83% - 5/6 tasks)
**Phase 1 Status**: ~98% complete
### Phase 2: Frontend Development (In Progress)
- ✅ Task 11: Frontend Project Structure (100%)
- ✅ Task 12: UI Components (70% - 7/10 tasks) ⬅️ **Updated**
- ✅ Task 13: Pages (100% - 8/8 tasks) ⬅️ **Updated: All pages functional**
- ✅ Task 14: API Integration (100% - 10/10 tasks) ⬅️ **Updated: API schemas aligned**
**Phase 2 Status**: ~92% complete ⬅️ **Updated: Core functionality working**
### Remaining Phases
- ⏳ Phase 3: Testing & Documentation (Partially complete - manual testing done)
- ⏳ Phase 4: Deployment (Not started)
- ⏳ Phase 5: Translation Implementation (Reserved for future)
---
## 🎯 Task 10 Implementation Details
### ✅ Completed (5/6)
**10.1 FastAPI BackgroundTasks for Async OCR Processing**
- File: [backend/app/services/background_tasks.py](../../../backend/app/services/background_tasks.py)
- Implemented `BackgroundTaskManager` class
- OCR processing runs asynchronously via FastAPI BackgroundTasks
- Router updated: [backend/app/routers/ocr.py:240](../../../backend/app/routers/ocr.py#L240)
**10.3 Progress Updates**
- Batch progress tracking already implemented in Task 8
- Properties: `batch.completed_files`, `batch.failed_files`, `batch.progress_percentage`
- Endpoint: `GET /api/v1/batch/{batch_id}/status`
**10.4 Error Handling with Retry Logic**
- File: [backend/app/services/background_tasks.py:63](../../../backend/app/services/background_tasks.py#L63)
- Implemented `execute_with_retry()` method for generic retry logic
- Implemented `process_single_file_with_retry()` for OCR processing with 3 retry attempts
- Added `retry_count` field to `OCRFile` model
- Migration: [backend/alembic/versions/271dc036ea80_add_retry_count_to_files.py](../../../backend/alembic/versions/271dc036ea80_add_retry_count_to_files.py)
- Configurable retry delay (default: 5 seconds)
- Error messages include retry attempt information
**10.5 Cleanup Scheduler for Expired Files**
- File: [backend/app/services/background_tasks.py:189](../../../backend/app/services/background_tasks.py#L189)
- Implemented `cleanup_expired_files()` method
- Automatic cleanup of files older than 24 hours
- Runs every 1 hour (configurable via `cleanup_interval`)
- Deletes:
- Physical files and directories
- Database records (results, files, batches)
- Respects foreign key constraints
- Started automatically on application startup: [backend/app/main.py:42](../../../backend/app/main.py#L42)
- Gracefully stopped on shutdown
**10.6 PDF Generation in Background Tasks**
- File: [backend/app/services/background_tasks.py:226](../../../backend/app/services/background_tasks.py#L226)
- Implemented `generate_pdf_background()` method
- PDF generation runs with retry logic (2 retries, 3-second delay)
- Ready to be integrated with export endpoints
### ⏸️ Optional (1/6)
**10.2 Redis-based Task Queue**
- Status: Not implemented (marked as optional in OpenSpec)
- Current approach: FastAPI BackgroundTasks (sufficient for current scale)
- Future consideration: Can add Redis queue if needed for horizontal scaling
---
## 🗄️ Database Status
### Current Schema
All tables use `paddle_ocr_` prefix for namespace isolation in shared database.
**Tables Created**:
1. `paddle_ocr_users` - User authentication (JWT)
2. `paddle_ocr_batches` - Batch processing metadata
3. `paddle_ocr_files` - Individual file records (now includes `retry_count`)
4. `paddle_ocr_results` - OCR results (Markdown, JSON, images)
5. `paddle_ocr_export_rules` - User-defined export rules
6. `paddle_ocr_translation_configs` - RESERVED for Phase 5
**Migrations Applied**:
- ✅ a7802b126240: Initial migration with paddle_ocr prefix
- ✅ 271dc036ea80: Add retry_count to files
### Test Data
**Test Users**:
- Username: `admin` / Password: `admin123` (Admin role)
- Username: `testuser` / Password: `test123` (Regular user)
---
## 🔧 Services Implemented
### Core Services
1. **Document Preprocessor** ([backend/app/services/preprocessor.py](../../../backend/app/services/preprocessor.py))
- File format validation (PNG, JPG, JPEG, PDF, DOC, DOCX, PPT, PPTX)
- Office document MIME type detection
- ZIP-based integrity validation for modern Office formats
- Corruption detection
- Format standardization
- Status: 100% complete (Office format support integrated via sub-proposal)
2. **OCR Service** ([backend/app/services/ocr_service.py](../../../backend/app/services/ocr_service.py))
- PaddleOCR 3.x integration (PPStructureV3)
- Layout detection and preservation
- Multi-language support (ch, en, japan, korean)
- Office document to PDF conversion pipeline (via LibreOffice)
- Markdown and JSON output
- Status: 100% complete ⬅️ **Updated: Unit tests complete (48 tests passing)**
3. **PDF Generator** ([backend/app/services/pdf_generator.py](../../../backend/app/services/pdf_generator.py))
- Pandoc (preferred) + WeasyPrint (fallback)
- Three CSS templates: default, academic, business
- Chinese font support (Noto Sans CJK)
- Layout preservation
- Status: 100% complete ⬅️ **Updated: Unit tests complete (27 tests passing)**
4. **File Manager** ([backend/app/services/file_manager.py](../../../backend/app/services/file_manager.py))
- Batch directory management
- File access control
- Temporary file cleanup (via cleanup scheduler)
- Status: 100% complete ⬅️ **Updated: Unit tests complete (38 tests passing)**
5. **Export Service** ([backend/app/services/export_service.py](../../../backend/app/services/export_service.py))
- Six formats: TXT, JSON, Excel, Markdown, PDF, ZIP
- Rule-based filtering and formatting
- CRUD for export rules
- Status: 100% complete ⬅️ **Updated: Unit tests complete (37 tests passing)**
6. **Background Tasks** ([backend/app/services/background_tasks.py](../../../backend/app/services/background_tasks.py))
- Retry logic for OCR processing
- Automatic file cleanup scheduler
- PDF generation with retry
- Generic retry execution framework
- Status: 83% complete
7. **Office Converter** ([backend/app/services/office_converter.py](../../../backend/app/services/office_converter.py)) ⬅️ **Integrated via sub-proposal**
- LibreOffice headless mode for Office to PDF conversion
- Support for DOC, DOCX, PPT, PPTX formats
- Automatic cleanup of temporary conversion files
- Integration with OCR processing pipeline
- Status: 100% complete (tested with 97.39% OCR accuracy)
8. **Translation Service** (RESERVED) ([backend/app/services/translation_service.py](../../../backend/app/services/translation_service.py))
- Stub implementation for Phase 5
- Interface defined for future engines: Argos, ERNIE, Google, DeepL
- Status: Reserved (not implemented)
---
## 🔌 API Endpoints
### Authentication
-`POST /api/v1/auth/login` - JWT authentication
### File Upload
-`POST /api/v1/upload` - Batch file upload with validation
### OCR Processing
-`POST /api/v1/ocr/process` - Trigger OCR (uses background tasks with retry)
-`GET /api/v1/batch/{batch_id}/status` - Get batch status with progress
-`GET /api/v1/ocr/result/{file_id}` - Get OCR results
### Export
-`POST /api/v1/export` - Export results (TXT, JSON, Excel, Markdown, PDF, ZIP)
-`GET /api/v1/export/pdf/{file_id}` - Generate layout-preserved PDF
-`GET /api/v1/export/rules` - List export rules
-`POST /api/v1/export/rules` - Create export rule
-`PUT /api/v1/export/rules/{rule_id}` - Update export rule
-`DELETE /api/v1/export/rules/{rule_id}` - Delete export rule
-`GET /api/v1/export/css-templates` - List CSS templates
### Translation (RESERVED)
-`GET /api/v1/translate/status` - Feature status (returns "reserved")
-`GET /api/v1/translate/languages` - Planned languages
-`POST /api/v1/translate/document` - Returns 501 Not Implemented
-`GET /api/v1/translate/task/{task_id}` - Returns 501 Not Implemented
-`DELETE /api/v1/translate/task/{task_id}` - Returns 501 Not Implemented
**API Documentation**: http://localhost:12010/docs (FastAPI auto-generated)
---
## 🖥️ Environment Setup
### Conda Environment
- Name: `tool_ocr`
- Python: 3.10
- Platform: macOS Apple Silicon (ARM64)
### Key Dependencies
- **FastAPI**: Web framework
- **PaddleOCR 3.x**: OCR engine with PPStructureV3
- **SQLAlchemy**: ORM for MySQL
- **Alembic**: Database migrations
- **WeasyPrint + Pandoc**: PDF generation
- **LibreOffice**: Office document to PDF conversion (headless mode)
- **python-magic**: File type detection
- **bcrypt 4.2.1**: Password hashing (pinned for compatibility)
- **email-validator**: Email validation for Pydantic
### System Dependencies
- **Homebrew packages**:
- `libmagic` - File type detection
- `pango`, `gdk-pixbuf`, `libffi` - WeasyPrint dependencies
- `font-noto-sans-cjk` - Chinese font support
- `pandoc` - Document conversion (optional)
- `libreoffice` - Office document conversion (headless mode)
### Environment Variables
```bash
MYSQL_HOST=mysql.theaken.com
MYSQL_PORT=33306
MYSQL_DATABASE=db_A060
BACKEND_PORT=12010
SECRET_KEY=<generated-secret>
DYLD_LIBRARY_PATH=/opt/homebrew/lib:$DYLD_LIBRARY_PATH
```
### Critical Configuration
- **Database Prefix**: All tables use `paddle_ocr_` prefix (shared database)
- **File Retention**: 24 hours (automatic cleanup)
- **Cleanup Interval**: 1 hour
- **Retry Attempts**: 3 (configurable)
- **Retry Delay**: 5 seconds (configurable)
---
## 🔧 Service Status
### Backend Service
- **Status**: ✅ Running
- **URL**: http://localhost:12010
- **Log File**: `/tmp/tool_ocr_startup.log`
- **Process**: Running via Uvicorn with auto-reload
### Background Services
- **Cleanup Scheduler**: ✅ Running (interval: 3600s, retention: 24h)
- **OCR Processing**: ✅ Background tasks with retry logic
### Health Check
```bash
curl http://localhost:12010/health
# Response: {"status":"healthy","service":"Tool_OCR","version":"0.1.0"}
```
---
## 📝 Known Issues & Workarounds
### 1. Shared Database Environment
- **Issue**: Database contains tables from other projects
- **Solution**: All tables use `paddle_ocr_` prefix for namespace isolation
- **Important**: NEVER drop tables in migrations (only create)
### 2. PaddleOCR 3.x Compatibility
- **Issue**: Parameters `show_log` and `use_gpu` removed in PaddleOCR 3.x
- **Solution**: Updated service to remove obsolete parameters
- **Issue**: `PPStructure` renamed to `PPStructureV3`
- **Solution**: Updated imports
### 3. Bcrypt Version
- **Issue**: Latest bcrypt incompatible with passlib
- **Solution**: Pinned to `bcrypt==4.2.1`
### 4. WeasyPrint on macOS
- **Issue**: Missing shared libraries
- **Solution**: Install via Homebrew and set `DYLD_LIBRARY_PATH`
### 5. First OCR Run
- **Issue**: First OCR test may fail as PaddleOCR downloads models (~900MB)
- **Solution**: Wait for download to complete, then retry
- **Model Location**: `~/.paddlex/`
---
## 🧪 Test Coverage
### Unit Tests Summary
**Total Tests**: 187
**Passed**: 182 ✅ (97.3% pass rate)
**Skipped**: 5 (acceptable - technical limitations or covered elsewhere)
**Failed**: 0 ✅
### Test Breakdown by Module
1. **test_preprocessor.py**: 32 tests ✅
- Format validation (PNG, JPG, PDF, Office formats)
- MIME type mapping
- Integrity validation
- File information extraction
- Edge cases
2. **test_ocr_service.py**: 48 tests ✅
- PaddleOCR 3.x integration
- Layout detection and preservation
- Markdown generation
- JSON output
- Real image processing (demo_docs/basic/english.png)
- Structure engine initialization
3. **test_pdf_generator.py**: 27 tests ✅
- Pandoc integration
- WeasyPrint fallback
- CSS template management
- Unicode and table support
- Error handling
4. **test_file_manager.py**: 38 tests ✅
- File upload validation
- Batch management
- Access control
- Cleanup operations
5. **test_export_service.py**: 37 tests ✅
- Six export formats (TXT, JSON, Excel, Markdown, PDF, ZIP)
- Rule-based filtering and formatting
- Export rule CRUD operations
6. **test_api_integration.py**: 5 tests ✅
- API endpoint integration
- JWT authentication
- Upload and OCR workflow
### Skipped Tests (Acceptable)
1. `test_export_txt_success` - FileResponse validation (covered in unit tests)
2. `test_generate_pdf_success` - FileResponse validation (covered in unit tests)
3. `test_create_export_rule` - SQLite session isolation (works with MySQL)
4. `test_update_export_rule` - SQLite session isolation (works with MySQL)
5. `test_validate_upload_file_too_large` - Complex UploadFile mock (covered in integration)
### Test Coverage Achievements
- ✅ All service layers tested with comprehensive unit tests
- ✅ PaddleOCR 3.x format compatibility verified
- ✅ Real image processing with demo samples
- ✅ Edge cases and error handling covered
- ✅ Integration tests for critical workflows
---
## 🌐 Phase 2: Frontend API Schema Alignment (2025-11-12)
### Issue Summary
During frontend development, identified 6 critical API mismatches between frontend expectations and backend implementation that blocked upload, processing, and results preview functionality.
### 🐛 API Mismatches Fixed
**1. Upload Response Structure** ⬅️ **FIXED**
- **Problem**: Backend returned `OCRBatchResponse` with `id` field, frontend expected `{ batch_id, files }`
- **Solution**: Created `UploadBatchResponse` schema in [backend/app/schemas/ocr.py:91-115](../../../backend/app/schemas/ocr.py#L91-L115)
- **Impact**: Upload now returns correct structure, fixes "no response after upload" issue
- **Files Modified**:
- `backend/app/schemas/ocr.py` - Added UploadBatchResponse schema
- `backend/app/routers/ocr.py:38,72-75` - Updated response_model and return format
**2. Error Field Naming** ⬅️ **FIXED**
- **Problem**: Frontend read `file.error`, backend had `error_message` field
- **Solution**: Added Pydantic validation_alias in [backend/app/schemas/ocr.py:21](../../../backend/app/schemas/ocr.py#L21)
- **Code**: `error: Optional[str] = Field(None, validation_alias='error_message')`
- **Impact**: Error messages now display correctly in ProcessingPage
**3. Markdown Content Missing** ⬅️ **FIXED**
- **Problem**: Frontend needed `markdown_content` for preview, only path was provided
- **Solution**: Added field to OCRResultResponse in [backend/app/schemas/ocr.py:35](../../../backend/app/schemas/ocr.py#L35)
- **Code**: `markdown_content: Optional[str] = None # Added for frontend preview`
- **Impact**: Markdown preview now works in ResultsPage
**4. Export Options Schema Missing** ⬅️ **FIXED**
- **Problem**: Frontend sent `options` object, backend didn't accept it
- **Solution**: Created ExportOptions schema in [backend/app/schemas/export.py:10-15](../../../backend/app/schemas/export.py#L10-L15)
- **Fields**: `confidence_threshold`, `include_metadata`, `filename_pattern`, `css_template`
- **Impact**: Advanced export options now supported
**5. CSS Template Filename Field** ⬅️ **FIXED**
- **Problem**: Frontend needed `filename`, backend only had `name` and `description`
- **Solution**: Added filename field to CSSTemplateResponse in [backend/app/schemas/export.py:82](../../../backend/app/schemas/export.py#L82)
- **Code**: `filename: str = Field(..., description="Template filename")`
- **Impact**: CSS template selector now works correctly
**6. OCR Result Detail Structure** ⬅️ **FIXED** (Critical)
- **Problem**: ResultsPage showed "檢視 Markdown - undefined" because:
- Backend returned nested `{ file: {...}, result: {...} }` structure
- Frontend expected flat structure with `filename`, `confidence`, `markdown_content` at root
- **Solution**: Created OCRResultDetailResponse schema in [backend/app/schemas/ocr.py:77-89](../../../backend/app/schemas/ocr.py#L77-L89)
- **Solution**: Updated endpoint in [backend/app/routers/ocr.py:181-240](../../../backend/app/routers/ocr.py#L181-L240) to:
- Read markdown content from filesystem
- Build flattened JSON data structure
- Return all fields frontend expects at root level
- **Impact**:
- MarkdownPreview now shows correct filename in title
- Confidence and processing time display correctly
- Markdown content loads and displays properly
### ✅ Frontend Functionality Restored
**Upload Flow**:
1. ✅ Files upload with progress indication
2. ✅ Toast notification on success
3. ✅ Automatic redirect to Processing page
4. ✅ Batch ID and files stored in Zustand state
**Processing Flow**:
1. ✅ Batch status polling works
2. ✅ Progress percentage updates in real-time
3. ✅ File status badges display correctly (pending/processing/completed/failed)
4. ✅ Error messages show when files fail
5. ✅ Automatic redirect to Results when complete
**Results Flow**:
1. ✅ Batch summary displays (batch ID, completed count)
2. ✅ Results table shows all files with actions
3. ✅ Click file to view markdown preview
4. ✅ Markdown title shows correct filename (not "undefined")
5. ✅ Confidence and processing time display correctly
6. ✅ PDF download works
7. ✅ Export button navigates to export page
### 📝 Additional Frontend Fixes
**1. ResultsPage.tsx** ([frontend/src/pages/ResultsPage.tsx:134-143](../../../frontend/src/pages/ResultsPage.tsx#L134-L143))
- Added null checks for undefined values:
- `(ocrResult.confidence || 0)` - Prevents .toFixed() on undefined
- `(ocrResult.processing_time || 0)` - Prevents .toFixed() on undefined
- `ocrResult.json_data?.total_text_regions || 0` - Safe optional chaining
**2. ProcessingPage.tsx** (Already functional)
- Batch ID validation working
- Status polling implemented correctly
- Error handling complete
### 🔧 API Endpoints Updated
**Upload Endpoint**:
```typescript
POST /api/v1/upload
Response: { batch_id: number, files: OCRFileResponse[] }
```
**Batch Status Endpoint**:
```typescript
GET /api/v1/batch/{batch_id}/status
Response: { batch: OCRBatchResponse, files: OCRFileResponse[] }
```
**OCR Result Endpoint** (New flattened structure):
```typescript
GET /api/v1/ocr/result/{file_id}
Response: {
file_id: number
filename: string
status: string
markdown_content: string
json_data: {...}
confidence: number
processing_time: number
}
```
### 🎯 Testing Verified
- ✅ File upload with toast notification
- ✅ Redirect to processing page
- ✅ Processing status polling
- ✅ Completed batch redirect to results
- ✅ Results table display
- ✅ Markdown preview with correct filename
- ✅ Confidence and processing time display
- ✅ PDF download functionality
### 📊 Phase 2 Progress Update
- Task 12: UI Components - **70% complete** (MarkdownPreview working, missing Export/Rule editors)
- Task 13: Pages - **100% complete** (All core pages functional)
- Task 14: API Integration - **100% complete** (All API schemas aligned)
**Phase 2 Overall**: ~92% complete (Core user journey working end-to-end)
---
## 🎯 Next Steps
### Immediate (Complete Phase 1)
1. ~~**Write Unit Tests** (Tasks 3.6, 4.10, 5.9, 6.7, 7.10)~~**COMPLETE**
- ~~Preprocessor tests~~ ✅
- ~~OCR service tests~~ ✅
- ~~PDF generator tests~~ ✅
- ~~File manager tests~~ ✅
- ~~Export service tests~~ ✅
2. **API Integration Tests** (Task 8.14)
- End-to-end workflow tests
- Authentication tests
- Error handling tests
3. **Final Phase 1 Documentation**
- API usage examples
- Deployment guide
- Performance benchmarks
### Phase 2: Frontend Development (Not Started)
- Task 11: Frontend project structure (Vite + React + TypeScript)
- Task 12: UI components (shadcn/ui)
- Task 13: Pages (Login, Upload, Processing, Results, Export)
- Task 14: API integration
### Phase 3: Testing & Optimization
- Comprehensive testing
- Performance optimization
- Documentation completion
### Phase 4: Deployment
- Production environment setup
- 1Panel deployment
- SSL configuration
- Monitoring setup
### Phase 5: Translation Feature (Future)
- Choose translation engine (Argos/ERNIE/Google/DeepL)
- Implement translation service
- Update UI to enable translation features
---
## 📚 Documentation
### Setup Documentation
- [SETUP.md](../../../SETUP.md) - Environment setup and installation
- [README.md](../../../README.md) - Project overview
### OpenSpec Documentation
- [SPEC.md](./SPEC.md) - Complete specification
- [tasks.md](./tasks.md) - Task breakdown and progress
- [STATUS.md](./STATUS.md) - This file
- [OFFICE_INTEGRATION.md](./OFFICE_INTEGRATION.md) - Office document support integration summary
### Sub-Proposals
- [add-office-document-support](../add-office-document-support/PROPOSAL.md) - Office format support (✅ INTEGRATED)
### API Documentation
- **Interactive Docs**: http://localhost:12010/docs
- **ReDoc**: http://localhost:12010/redoc
---
## 🔍 Testing Commands
### Start Backend
```bash
source ~/.zshrc
conda activate tool_ocr
export DYLD_LIBRARY_PATH=/opt/homebrew/lib:$DYLD_LIBRARY_PATH
python -m app.main
```
### Test Service Layer
```bash
cd backend
python test_services.py
```
### Test API (Login)
```bash
curl -X POST http://localhost:12010/api/v1/auth/login \
-H "Content-Type: application/json" \
-d '{"username": "admin", "password": "admin123"}'
```
### Check Cleanup Scheduler
```bash
tail -f /tmp/tool_ocr_startup.log | grep cleanup
```
### Check Batch Progress
```bash
curl http://localhost:12010/api/v1/batch/{batch_id}/status
```
---
## 📞 Support & Feedback
- **Project**: Tool_OCR - OCR Batch Processing System
- **Development Approach**: OpenSpec-driven development
- **Current Status**: Phase 2 Frontend ~92% complete ⬅️ **Updated: Core user journey working end-to-end**
- **Backend Test Coverage**: 182/187 tests passing (97.3%)
- **Next Milestone**: Complete remaining UI components (Export/Rule editors), Phase 3 testing
---
**Status Summary**:
- **Phase 1 (Backend)**: ~98% complete - All core functionality working with comprehensive test coverage
- **Phase 2 (Frontend)**: ~92% complete - Core user journey (Upload → Processing → Results) fully functional
- **Recent Work**: Fixed 6 critical API schema mismatches between frontend and backend, enabling end-to-end workflow
- **Verification**: Upload, OCR processing, and results preview all working correctly with proper error handling

View File

@@ -0,0 +1,313 @@
# Technical Design Document
## Context
Tool_OCR is a web-based batch OCR processing system with frontend-backend separation architecture. The system needs to handle large file uploads, long-running OCR tasks, and multiple export formats while maintaining responsive UI and efficient resource usage.
**Key stakeholders:**
- End users: Need simple, fast, reliable OCR processing
- Developers: Need maintainable, testable code architecture
- Operations: Need easy deployment via 1Panel, monitoring, and error tracking
**Constraints:**
- Development on Windows with Conda (Python 3.10)
- Deployment on Linux server via 1Panel (no Docker)
- Port range: 12010-12019
- External MySQL database (mysql.theaken.com:33306)
- PaddleOCR models (~100-200MB per language)
- Max file upload: 20MB per file, 100MB per batch
## Goals / Non-Goals
### Goals
- Process images and PDFs with multi-language OCR (Chinese, English, Japanese, Korean)
- Handle batch uploads with real-time progress tracking
- Provide flexible export formats (TXT, JSON, Excel) with custom rules
- Maintain responsive UI during long-running OCR tasks
- Enable easy deployment and maintenance via 1Panel
### Non-Goals
- Real-time OCR streaming (batch processing only)
- Cloud-based OCR services (local processing only)
- Mobile app support (web UI only, desktop/tablet optimized)
- Advanced image editing or annotation features
- Multi-tenant SaaS architecture (single deployment per organization)
## Decisions
### Decision 1: FastAPI for Backend Framework
**Choice:** Use FastAPI instead of Flask or Django
**Rationale:**
- Native async/await support for I/O-bound operations (file upload, database queries)
- Automatic OpenAPI documentation (Swagger UI)
- Built-in Pydantic validation for type safety
- Better performance for concurrent requests
- Modern Python 3.10+ features (type hints, async)
**Alternatives considered:**
- Flask: Simpler but lacks native async, requires extensions
- Django: Too heavyweight for API-only backend, includes unnecessary ORM features
### Decision 2: PaddleOCR as OCR Engine
**Choice:** Use PaddleOCR instead of Tesseract or cloud APIs
**Rationale:**
- Excellent Chinese/multilingual support (key requirement)
- Higher accuracy with deep learning models
- Offline operation (no API costs or internet dependency)
- Active development and good documentation
- GPU acceleration support (optional)
**Alternatives considered:**
- Tesseract: Lower accuracy for Chinese, older technology
- Google Cloud Vision / AWS Textract: Requires internet, ongoing costs, data privacy concerns
### Decision 3: React Query for API State Management
**Choice:** Use React Query (TanStack Query) instead of Redux
**Rationale:**
- Designed specifically for server state (API calls, caching, refetching)
- Built-in loading/error states
- Automatic background refetching and cache invalidation
- Reduces boilerplate compared to Redux
- Better for our API-heavy use case
**Alternatives considered:**
- Redux: Overkill for server state, more boilerplate
- Plain Axios: Requires manual loading/error state management
### Decision 4: Zustand for Client State
**Choice:** Use Zustand for global UI state (separate from React Query)
**Rationale:**
- Lightweight (1KB) and simple API
- No providers or context required
- TypeScript-friendly
- Works well alongside React Query
- Only for UI state (selected files, filters, etc.)
### Decision 5: Background Task Processing
**Choice:** FastAPI BackgroundTasks for OCR processing (no external queue initially)
**Rationale:**
- Built-in FastAPI feature, no additional dependencies
- Sufficient for single-server deployment
- Simpler deployment and maintenance
- Can migrate to Redis/Celery later if needed
**Migration path:** If scale requires, add Redis + Celery for distributed task queue
**Alternatives considered:**
- Celery + Redis: More complex, overkill for initial deployment
- Threading: FastAPI BackgroundTasks already uses thread pool
### Decision 6: File Storage Strategy
**Choice:** Local filesystem with automatic cleanup (24-hour retention)
**Rationale:**
- Simple implementation, no S3/cloud storage costs
- OCR results stored in database (permanent)
- Original files temporary, only needed during processing
- Automatic cleanup prevents disk space issues
**Storage structure:**
```
uploads/
{batch_id}/
{file_id}_original.png
{file_id}_preprocessed.png (if preprocessing enabled)
```
**Cleanup:** Daily cron job or background task deletes files older than 24 hours
### Decision 7: Real-time Progress Updates
**Choice:** HTTP polling instead of WebSocket
**Rationale:**
- Simpler implementation and deployment
- Works better with Nginx reverse proxy and 1Panel
- Sufficient UX for batch processing (poll every 2 seconds)
- No need for persistent connections
**API:** `GET /api/v1/batch/{batch_id}/status` returns progress percentage
**Alternatives considered:**
- WebSocket: More complex, requires special Nginx config, overkill for this use case
### Decision 8: Database Schema Design
**Choice:** Separate tables for tasks, files, and results (normalized)
**Schema:**
```sql
users (id, username, password_hash, created_at)
ocr_batches (id, user_id, status, created_at, completed_at)
ocr_files (id, batch_id, filename, file_path, file_size, status)
ocr_results (id, file_id, text, bbox_json, confidence, language)
export_rules (id, user_id, rule_name, config_json)
```
**Rationale:**
- Normalized for data integrity
- Supports batch tracking and partial failures
- Easy to query individual file results or batch statistics
- Export rules reusable across users
### Decision 9: Export Rule Configuration Format
**Choice:** JSON-based rule configuration stored in database
**Example rule:**
```json
{
"filters": {
"min_confidence": 0.8,
"filename_pattern": "^invoice_.*"
},
"formatting": {
"add_line_numbers": true,
"sort_by_position": true,
"group_by_page": true
},
"output": {
"format": "txt",
"encoding": "utf-8",
"line_separator": "\n"
}
}
```
**Rationale:**
- Flexible and extensible
- Easy to validate with JSON schema
- Can be edited via UI or API
- Supports complex rules without database schema changes
### Decision 10: Deployment Architecture (1Panel)
**Choice:** Nginx (static files + reverse proxy) + Supervisor (backend process manager)
**Architecture:**
```
[Client Browser]
[Nginx :80/443] (managed by 1Panel)
├─ / → Frontend static files (React build)
├─ /assets → Static assets
└─ /api → Reverse proxy to backend :12010
[FastAPI Backend :12010] (managed by Supervisor)
[MySQL :33306] (external)
```
**Rationale:**
- 1Panel provides GUI for Nginx management
- Supervisor ensures backend auto-restart on failure
- No Docker simplifies deployment on existing infrastructure
- Standard Nginx config works without special 1Panel requirements
**Supervisor config:**
```ini
[program:tool_ocr_backend]
command=/home/user/.conda/envs/tool_ocr/bin/uvicorn app.main:app --host 127.0.0.1 --port 12010
directory=/path/to/Tool_OCR/backend
user=www-data
autostart=true
autorestart=true
```
## Risks / Trade-offs
### Risk 1: OCR Processing Time for Large Batches
**Risk:** Processing 50+ images may take 5-10 minutes, potential timeout
**Mitigation:**
- Use FastAPI BackgroundTasks to avoid HTTP timeout
- Return batch_id immediately, client polls for status
- Display progress bar with estimated time remaining
- Limit max batch size to 50 files (configurable)
- Add worker concurrency limit to prevent resource exhaustion
### Risk 2: PaddleOCR Model Download on First Run
**Risk:** Models are 100-200MB, first-time download may fail or be slow
**Mitigation:**
- Pre-download models during deployment setup
- Provide manual download script for offline installation
- Cache models in shared directory for all users
- Include model version in deployment docs
### Risk 3: File Upload Size Limits
**Risk:** Users may try to upload very large PDFs (>20MB)
**Mitigation:**
- Enforce 20MB per file, 100MB per batch limits in frontend and backend
- Display clear error messages with limit information
- Provide guidance on compressing PDFs or splitting large files
- Consider adding image downsampling for huge images
### Risk 4: Concurrent User Scaling
**Risk:** Multiple users uploading simultaneously may overwhelm CPU/memory
**Mitigation:**
- Limit concurrent OCR workers (e.g., 4 workers max)
- Implement task queue with FastAPI BackgroundTasks
- Monitor resource usage and add throttling if needed
- Document recommended server specs (8GB RAM, 4 CPU cores)
### Risk 5: Database Connection Pool Exhaustion
**Risk:** External MySQL may have connection limits
**Mitigation:**
- Configure SQLAlchemy connection pool (max 20 connections)
- Use connection pooling with proper timeout settings
- Close connections properly in all API endpoints
- Add health check endpoint to monitor database connectivity
## Migration Plan
### Phase 1: Initial Deployment
1. Setup Conda environment on production server
2. Install Python dependencies and download OCR models
3. Configure MySQL database and create tables
4. Build frontend static files (`npm run build`)
5. Configure Nginx via 1Panel (upload nginx.conf)
6. Setup Supervisor for backend process
7. Test with sample images
### Phase 2: Production Rollout
1. Create admin user account
2. Import sample export rules
3. Perform smoke tests (upload, OCR, export)
4. Monitor logs for errors
5. Setup daily cleanup cron job for old files
6. Enable HTTPS via 1Panel (Let's Encrypt)
### Phase 3: Monitoring and Optimization
1. Add application logging (file + console)
2. Monitor resource usage (CPU, memory, disk)
3. Optimize slow queries if needed
4. Tune worker concurrency based on actual load
5. Collect user feedback and iterate
### Rollback Plan
- Keep previous version in separate directory
- Use Supervisor to stop current version and start previous
- Database migrations should be backward compatible
- If major issues, restore database from backup
## Open Questions
1. **Should we add user registration, or use admin-created accounts only?**
- Recommendation: Start with admin-created accounts for security, add registration later if needed
2. **Do we need audit logging for compliance?**
- Recommendation: Add basic audit trail (who uploaded what, when) in database
3. **Should we support GPU acceleration for PaddleOCR?**
- Recommendation: Optional, detect GPU on startup, fallback to CPU if unavailable
4. **What's the desired behavior for duplicate filenames in a batch?**
- Recommendation: Auto-rename with suffix (e.g., `file.png`, `file_1.png`)
5. **Should export rules be shareable across users or private?**
- Recommendation: Private by default, add "public templates" feature later

View File

@@ -0,0 +1,48 @@
# Change: Add OCR Batch Processing System with Structure Extraction
## Why
Users need a web-based solution to extract text, images, and structure from multiple document files efficiently. Current manual text extraction is time-consuming and error-prone. This system will automate the process with multi-language OCR support (Chinese, English, etc.), intelligent layout analysis to understand document structure, and provide flexible export options including searchable PDF with embedded images. The extracted content preserves logical structure and reading order (not pixel-perfect visual layout). The system also reserves architecture for future document translation capabilities.
## What Changes
- Add core OCR processing capability using **PaddleOCR-VL** (vision-language model for document parsing)
- Implement **document structure analysis** with PP-StructureV3 to identify titles, paragraphs, tables, images, formulas
- Extract and **preserve document images** alongside text content
- Support unified input preprocessing (convert any format to images/PDF for OCR processing)
- Implement batch file upload and processing (images: PNG, JPG, PDF files)
- Support multi-language text recognition (Chinese traditional/simplified, English, Japanese, Korean) - 109 languages via PaddleOCR-VL
- Add **Markdown intermediate format** for structured document representation with embedded images
- Implement **searchable PDF generation** from Markdown with images (Pandoc + WeasyPrint)
- Generate PDFs that preserve logical structure and reading order (not exact visual layout)
- Add rule-based output formatting system for organizing extracted text
- Implement multiple export formats (TXT, JSON, Excel, **Markdown with images, searchable PDF**)
- Create web UI with drag-and-drop file upload
- Build RESTful API for OCR processing with progress tracking
- Add background task processing for long-running OCR jobs
- **Reserve translation module architecture** (UI placeholders + API endpoints for future implementation)
## Impact
- **New capabilities**:
- `ocr-processing`: Core OCR text and image extraction with structure analysis (PaddleOCR-VL + PP-StructureV3)
- `file-management`: File upload, validation, and storage with format standardization
- `export-results`: Multi-format export with custom rules, including searchable PDF with embedded images
- `translation` (reserved): Architecture for future translation features
- **Affected code**:
- New backend: `app/` (FastAPI application structure)
- New frontend: `frontend/` (React + Vite application)
- New database tables: `ocr_tasks`, `ocr_results`, `export_rules`, `translation_configs` (reserved)
- **Dependencies**:
- Backend: fastapi, paddleocr (3.0+), paddlepaddle, pdf2image, pandas, pillow, weasyprint, markdown, pandoc (system)
- Frontend: react, vite, tailwindcss, shadcn/ui, axios, react-query
- Translation engines (reserved): argostranslate (offline) or API integration
- **Configuration**:
- MySQL database connection (external server)
- PaddleOCR-VL model storage (~900MB) and language packs
- Pandoc installation for PDF generation
- Basic CSS template for readable PDF output (not for visual layout replication)
- Image storage directory for extracted images
- File upload size limits and supported formats
- Port configuration (12010 for backend, 12011 for frontend dev)
- Translation service config (reserved for future)

View File

@@ -0,0 +1,175 @@
# Export Results Specification
## ADDED Requirements
### Requirement: Plain Text Export
The system SHALL export OCR results as plain text files with configurable formatting.
#### Scenario: Export single file result as TXT
- **WHEN** user selects a completed OCR task and chooses TXT export
- **THEN** the system generates a .txt file with extracted text
- **AND** preserves line breaks based on bounding box positions
- **AND** returns downloadable file
#### Scenario: Export batch results as TXT
- **WHEN** user exports a batch with 5 files as TXT
- **THEN** the system creates a ZIP file containing 5 .txt files
- **AND** names each file as `{original_filename}_ocr.txt`
- **AND** returns the ZIP for download
### Requirement: JSON Export
The system SHALL export OCR results as structured JSON with full metadata.
#### Scenario: Export with metadata
- **WHEN** user selects JSON export format
- **THEN** the system generates JSON containing:
- File information (name, size, format)
- OCR results array with text, bounding boxes, confidence
- Processing metadata (timestamp, language, model version)
- Task status and statistics
#### Scenario: JSON export example structure
- **WHEN** export is generated
- **THEN** JSON structure follows this format:
```json
{
"file_name": "document.png",
"file_size": 1024000,
"upload_time": "2025-01-01T10:00:00Z",
"processing_time": 2.5,
"language": "zh-TW",
"results": [
{
"text": "範例文字",
"bbox": [100, 50, 200, 80],
"confidence": 0.95
}
],
"status": "completed"
}
```
### Requirement: Excel Export
The system SHALL export OCR results as Excel spreadsheets with tabular format.
#### Scenario: Single file Excel export
- **WHEN** user selects Excel export for one file
- **THEN** the system generates .xlsx file with columns:
- Row Number
- Recognized Text
- Confidence Score
- Bounding Box (X, Y, Width, Height)
- Language
#### Scenario: Batch Excel export with multiple sheets
- **WHEN** user exports batch with 3 files as Excel
- **THEN** the system creates one .xlsx file with 3 sheets
- **AND** names each sheet as the original filename
- **AND** includes summary sheet with statistics
### Requirement: Rule-Based Output Formatting
The system SHALL apply user-defined rules to format exported text.
#### Scenario: Group by filename pattern
- **WHEN** user defines rule "group files with prefix 'invoice_'"
- **THEN** the system groups all matching files together
- **AND** exports them in a single combined file or folder
#### Scenario: Filter by confidence threshold
- **WHEN** user sets export rule "minimum confidence 0.8"
- **THEN** the system excludes text with confidence < 0.8 from export
- **AND** includes only high-confidence results
#### Scenario: Custom text formatting
- **WHEN** user defines rule "add line numbers"
- **THEN** the system prepends line numbers to each text line
- **AND** formats output as: `1. 第一行文字\n2. 第二行文字`
#### Scenario: Sort by reading order
- **WHEN** user enables "sort by position" rule
- **THEN** the system orders text by vertical position (top to bottom)
- **AND** then by horizontal position (left to right) within each row
- **AND** exports text in natural reading order
### Requirement: Export Rule Configuration
The system SHALL allow users to save and reuse export rules.
#### Scenario: Save custom export rule
- **WHEN** user creates a rule with name "高品質發票輸出"
- **THEN** the system saves the rule to database
- **AND** associates it with the user account
- **AND** makes it available in rule selection dropdown
#### Scenario: Apply saved rule
- **WHEN** user selects a saved rule for export
- **THEN** the system applies all configured filters and formatting
- **AND** generates output according to rule settings
#### Scenario: Edit existing rule
- **WHEN** user modifies a saved rule
- **THEN** the system updates the rule configuration
- **AND** preserves the rule ID for continuity
### Requirement: Markdown Export with Structure and Images
The system SHALL export OCR results as Markdown files preserving document logical structure with accompanying images.
#### Scenario: Export as Markdown with structure and images
- **WHEN** user selects Markdown export format
- **THEN** the system generates .md file with logical structure
- **AND** includes headings, paragraphs, tables, lists in proper hierarchy
- **AND** embeds image references pointing to extracted images (![](./images/img1.jpg))
- **AND** maintains reading order from OCR analysis
- **AND** includes extracted images in an images/ folder
#### Scenario: Batch Markdown export with images
- **WHEN** user exports batch with 5 files as Markdown
- **THEN** the system creates 5 separate .md files
- **AND** creates corresponding images/ folders for each document
- **AND** optionally creates combined .md with page separators
- **AND** returns ZIP file containing all Markdown files and images
### Requirement: Searchable PDF Export with Images
The system SHALL generate searchable PDF files that include extracted text and images, preserving logical document structure (not exact visual layout).
#### Scenario: Single document PDF export with images
- **WHEN** user requests PDF export from OCR result
- **THEN** the system converts Markdown to HTML with basic CSS styling
- **AND** embeds extracted images from images/ folder
- **AND** generates PDF using Pandoc + WeasyPrint
- **AND** preserves document hierarchy, tables, and reading order
- **AND** images appear near their logical position in text flow
- **AND** uses appropriate Chinese font (Noto Sans CJK)
- **AND** produces searchable PDF with selectable text
#### Scenario: Basic PDF formatting options
- **WHEN** user selects PDF export
- **THEN** the system applies basic readable formatting
- **AND** sets standard margins and page size (A4)
- **AND** uses consistent fonts and spacing
- **AND** ensures images fit within page width
- **NOTE** CSS templates are for basic readability, not for replicating original visual design
#### Scenario: Batch PDF export with images
- **WHEN** user exports batch as PDF
- **THEN** the system generates individual PDF for each document with embedded images
- **OR** creates single merged PDF with page breaks
- **AND** maintains consistent formatting across all pages
- **AND** returns ZIP of PDFs or single merged PDF
### Requirement: Export Format Selection
The system SHALL provide UI for selecting export format and options.
#### Scenario: Format selection with preview
- **WHEN** user opens export dialog
- **THEN** the system displays format options (TXT, JSON, Excel, **Markdown with images, Searchable PDF**)
- **AND** shows preview of output structure for selected format
- **AND** allows applying custom rules for text filtering
- **AND** provides basic formatting option for PDF (standard readable format)
#### Scenario: Batch export with format choice
- **WHEN** user selects multiple completed tasks
- **THEN** the system enables batch export button
- **AND** prompts for format selection
- **AND** generates combined export file
- **AND** shows progress bar for PDF generation (slower due to image processing)
- **AND** includes all extracted images when exporting Markdown or PDF

View File

@@ -0,0 +1,96 @@
# File Management Specification
## ADDED Requirements
### Requirement: File Upload Validation
The system SHALL validate uploaded files for type, size, and content before processing.
#### Scenario: Valid image upload
- **WHEN** user uploads a PNG file of 5MB
- **THEN** the system accepts the file
- **AND** stores it in temporary upload directory
- **AND** returns upload success with file ID
#### Scenario: Oversized file rejection
- **WHEN** user uploads a file larger than 20MB
- **THEN** the system rejects the file
- **AND** returns error message "文件大小超過限制 (最大 20MB)"
- **AND** does not store the file
#### Scenario: Invalid file type rejection
- **WHEN** user uploads a .exe or .zip file
- **THEN** the system rejects the file
- **AND** returns error message "不支援的文件類型,僅支援 PNG, JPG, JPEG, PDF"
#### Scenario: Corrupted image detection
- **WHEN** user uploads a corrupted image file
- **THEN** the system attempts to open the file
- **AND** detects corruption during validation
- **AND** returns error message "文件損壞,無法處理"
### Requirement: Supported File Formats
The system SHALL support PNG, JPG, JPEG, and PDF file formats for OCR processing.
#### Scenario: PNG image processing
- **WHEN** user uploads a .png file
- **THEN** the system processes it directly with PaddleOCR
#### Scenario: JPG/JPEG image processing
- **WHEN** user uploads a .jpg or .jpeg file
- **THEN** the system processes it directly with PaddleOCR
#### Scenario: PDF file processing
- **WHEN** user uploads a .pdf file
- **THEN** the system converts PDF pages to images using pdf2image
- **AND** processes each page image with PaddleOCR
### Requirement: Batch Upload Management
The system SHALL manage multiple file uploads with batch organization.
#### Scenario: Create batch from multiple files
- **WHEN** user uploads 5 files in a single request
- **THEN** the system creates a batch with unique batch_id
- **AND** associates all files with the batch_id
- **AND** returns batch_id and file list
#### Scenario: Query batch status
- **WHEN** user requests batch status by batch_id
- **THEN** the system returns:
- Total files in batch
- Completed count
- Failed count
- Processing count
- Overall batch status (pending/processing/completed/failed)
### Requirement: File Storage Management
The system SHALL store uploaded files temporarily and clean up after processing.
#### Scenario: Temporary file storage
- **WHEN** user uploads files
- **THEN** the system stores files in `uploads/{batch_id}/` directory
- **AND** generates unique filenames to prevent conflicts
#### Scenario: Automatic cleanup after processing
- **WHEN** OCR processing completes for a batch
- **THEN** the system keeps files for 24 hours
- **AND** automatically deletes files after retention period
- **AND** preserves OCR results in database
#### Scenario: Manual file deletion
- **WHEN** user requests to delete a batch
- **THEN** the system removes all associated files from storage
- **AND** marks the batch as deleted in database
- **AND** returns deletion confirmation
### Requirement: File Access Control
The system SHALL ensure users can only access their own uploaded files.
#### Scenario: User accesses own files
- **WHEN** authenticated user requests file by file_id
- **THEN** the system verifies ownership
- **AND** returns file if user is the owner
#### Scenario: User attempts to access others' files
- **WHEN** user requests file_id belonging to another user
- **THEN** the system denies access
- **AND** returns 403 Forbidden error

View File

@@ -0,0 +1,125 @@
# OCR Processing Specification
## ADDED Requirements
### Requirement: Multi-Language Text Recognition with Structure Analysis
The system SHALL extract text and images from document files using PaddleOCR-VL with support for 109 languages including Chinese (traditional and simplified), English, Japanese, and Korean, while preserving document logical structure and reading order (not pixel-perfect visual layout).
#### Scenario: Single image OCR with Chinese text
- **WHEN** user uploads a PNG image containing Chinese text
- **THEN** the system extracts text with bounding boxes and confidence scores
- **AND** returns structured JSON with recognized text, coordinates, and language detected
- **AND** generates Markdown output preserving text layout and hierarchy
#### Scenario: PDF document OCR with layout preservation
- **WHEN** user uploads a multi-page PDF file
- **THEN** the system processes each page with PaddleOCR-VL
- **AND** performs layout analysis to identify document elements (titles, paragraphs, tables, images, formulas)
- **AND** returns Markdown organized by page with preserved reading order
- **AND** provides JSON with detailed layout structure and bounding boxes
#### Scenario: Mixed language content
- **WHEN** user uploads an image with both Chinese and English text
- **THEN** the system detects and extracts text in both languages
- **AND** preserves the spatial relationship between text regions
- **AND** maintains proper reading order in output Markdown
#### Scenario: Complex document with tables and images
- **WHEN** user uploads a scanned document containing tables, images, and text
- **THEN** the system identifies layout elements (text blocks, tables, images, formulas)
- **AND** extracts table structure as Markdown tables
- **AND** extracts and saves document images as separate files
- **AND** embeds image references in Markdown (![](path/to/image.jpg))
- **AND** preserves document hierarchy and reading order in Markdown output
### Requirement: Batch Processing
The system SHALL process multiple files concurrently with progress tracking and error handling.
#### Scenario: Batch upload success
- **WHEN** user uploads 10 image files simultaneously
- **THEN** the system creates a batch task with unique batch ID
- **AND** processes files in parallel (up to configured worker limit)
- **AND** returns real-time progress updates via WebSocket or polling
#### Scenario: Batch processing with partial failure
- **WHEN** a batch contains 5 valid images and 2 corrupted files
- **THEN** the system processes all valid files successfully
- **AND** logs errors for corrupted files with specific error messages
- **AND** marks the batch as "partially completed"
### Requirement: Image Preprocessing
The system SHALL provide optional image preprocessing to improve OCR accuracy.
#### Scenario: Low contrast image enhancement
- **WHEN** user enables preprocessing for a low-contrast image
- **THEN** the system applies contrast adjustment and denoising
- **AND** performs OCR on the enhanced image
- **AND** returns better accuracy compared to original
#### Scenario: Skipped preprocessing
- **WHEN** user disables preprocessing option
- **THEN** the system performs OCR directly on original image
- **AND** completes processing faster
### Requirement: Confidence Threshold Filtering
The system SHALL filter OCR results based on configurable confidence threshold.
#### Scenario: High confidence filter
- **WHEN** user sets confidence threshold to 0.8
- **THEN** the system returns only text segments with confidence >= 0.8
- **AND** discards low-confidence results
#### Scenario: Include all results
- **WHEN** user sets confidence threshold to 0.0
- **THEN** the system returns all recognized text regardless of confidence
- **AND** includes confidence scores in output
### Requirement: OCR Result Structure
The system SHALL return OCR results in multiple formats (JSON, Markdown) with extracted text, images, and structure metadata.
#### Scenario: Successful OCR result with multiple formats
- **WHEN** OCR processing completes successfully
- **THEN** the system returns JSON containing:
- File metadata (name, size, format, upload timestamp)
- Detected text regions with bounding boxes (x, y, width, height)
- Recognized text content for each region
- Confidence scores (0.0 to 1.0)
- Language detected
- Layout element types (title, paragraph, table, image, formula)
- Reading order sequence
- List of extracted image files with paths
- Processing time
- Task status (completed/failed/partial)
- **AND** generates Markdown file with logical structure
- **AND** saves extracted images to storage directory
- **AND** provides methods to export as searchable PDF with images
#### Scenario: Searchable PDF generation with images
- **WHEN** user requests PDF export from OCR results
- **THEN** the system converts Markdown to HTML with basic CSS styling
- **AND** embeds extracted images in their logical positions (not exact original positions)
- **AND** generates PDF using Pandoc + WeasyPrint
- **AND** preserves document hierarchy, tables, and reading order
- **AND** applies appropriate fonts for Chinese characters
- **AND** produces searchable PDF (text is selectable and searchable)
### Requirement: Document Translation (Reserved Architecture)
The system SHALL provide architecture and UI placeholders for future document translation features.
#### Scenario: Translation option visibility (UI placeholder)
- **WHEN** user views OCR result page
- **THEN** the system displays a "Translate Document" button (disabled or labeled "Coming Soon")
- **AND** shows target language selection dropdown (disabled)
- **AND** provides tooltip: "Translation feature will be available in future release"
#### Scenario: Translation API endpoint (reserved)
- **WHEN** backend API is queried for translation endpoints
- **THEN** the system provides `/api/v1/translate/document` endpoint specification
- **AND** returns "Not Implemented" (501) status when called
- **AND** documents expected request/response format for future implementation
#### Scenario: Translation configuration storage (database schema)
- **WHEN** database schema is created
- **THEN** the system includes `translation_configs` table
- **AND** defines columns: id, user_id, source_lang, target_lang, engine_type, engine_config, created_at
- **AND** table remains empty until translation feature is implemented

View File

@@ -0,0 +1,230 @@
# Implementation Tasks
## Phase 1: Core OCR with Layout Preservation
### 1. Environment Setup
- [x] 1.1 Create Conda environment with Python 3.10
- [x] 1.2 Install backend dependencies (FastAPI, PaddleOCR 3.0+, paddlepaddle, pandas, etc.)
- [x] 1.3 Install PDF generation tools (weasyprint, markdown, pandoc system package)
- [x] 1.4 Download PaddleOCR-VL model (~900MB) and language packs
- [ ] 1.5 Setup frontend project with Vite + React + TypeScript
- [ ] 1.6 Install frontend dependencies (Tailwind, shadcn/ui, axios, react-query)
- [x] 1.7 Configure MySQL database connection
- [x] 1.8 Install Chinese fonts (Noto Sans CJK) for PDF generation
### 2. Database Schema
- [x] 2.1 Create `paddle_ocr_users` table for JWT authentication (id, username, password_hash, etc.)
- [x] 2.2 Create `paddle_ocr_batches` table (id, user_id, status, created_at, completed_at)
- [x] 2.3 Create `paddle_ocr_files` table (id, batch_id, filename, file_path, file_size, status, format)
- [x] 2.4 Create `paddle_ocr_results` table (id, file_id, markdown_path, json_path, layout_data, confidence)
- [x] 2.5 Create `paddle_ocr_export_rules` table (id, user_id, rule_name, config_json, css_template)
- [x] 2.6 Create `paddle_ocr_translation_configs` table (RESERVED: id, user_id, source_lang, target_lang, engine_type, engine_config)
- [x] 2.7 Write database migration scripts (Alembic)
- [x] 2.8 Add indexes for performance optimization (batch_id, user_id, status)
- Note: All tables use `paddle_ocr_` prefix for namespace isolation
### 3. Backend - Document Preprocessing
- [x] 3.1 Implement document preprocessor class for format standardization
- [x] 3.2 Add image format validator (PNG, JPG, JPEG)
- [x] 3.3 Add PDF validator and direct passthrough (PaddleOCR-VL native support)
- [x] 3.4 Implement Office document to PDF conversion (DOC, DOCX, PPT, PPTX via LibreOffice) ⬅️ **Completed via sub-proposal**
- [x] 3.5 Add file corruption detection
- [x] 3.6 Write unit tests for preprocessor
### 4. Backend - Core OCR Service with PaddleOCR-VL
- [x] 4.1 Implement OCR service class with PaddleOCR-VL initialization
- [x] 4.2 Configure layout detection (use_layout_detection=True)
- [x] 4.3 Implement single image/PDF OCR processing
- [x] 4.4 Parse OCR output to extract Markdown and JSON
- [x] 4.5 Store Markdown files with preserved layout structure
- [x] 4.6 Store JSON with detailed bounding boxes and layout metadata
- [x] 4.7 Add confidence threshold filtering
- [x] 4.8 Implement batch processing with worker queue (completed via Task 10: BackgroundTasks)
- [x] 4.9 Add progress tracking for batch jobs (completed via Task 8.4, 8.6: API endpoints)
- [x] 4.10 Write unit tests for OCR service
### 5. Backend - Layout-Preserved PDF Generation
- [x] 5.1 Create PDF generator service using Pandoc + WeasyPrint
- [x] 5.2 Implement Markdown to HTML conversion with extensions (tables, code, etc.)
- [x] 5.3 Create default CSS template for layout preservation
- [x] 5.4 Create additional CSS templates (academic, business, report)
- [x] 5.5 Add Chinese font configuration (Noto Sans CJK)
- [x] 5.6 Implement PDF generation via Pandoc command
- [x] 5.7 Add fallback: Python WeasyPrint direct generation
- [x] 5.8 Handle multi-page PDF merging
- [x] 5.9 Write unit tests for PDF generator
### 6. Backend - File Management
- [x] 6.1 Implement file upload validation (type, size, corruption check)
- [x] 6.2 Create file storage service with temporary directory management
- [x] 6.3 Add batch upload handler with unique batch_id generation
- [x] 6.4 Implement file access control and ownership verification
- [x] 6.5 Add automatic cleanup job for expired files (24-hour retention)
- [x] 6.6 Store Markdown and JSON outputs in organized directory structure
- [x] 6.7 Write unit tests for file management
### 7. Backend - Export Service
- [x] 7.1 Implement plain text export from Markdown
- [x] 7.2 Implement JSON export with full metadata
- [x] 7.3 Implement Excel export using pandas
- [x] 7.4 Implement Markdown export (direct from OCR output)
- [x] 7.5 Implement layout-preserved PDF export (using PDF generator service)
- [x] 7.6 Add ZIP file creation for batch exports
- [x] 7.7 Implement rule-based filtering (confidence threshold, filename pattern)
- [x] 7.8 Implement rule-based formatting (line numbers, sort by position)
- [x] 7.9 Create export rule CRUD operations (save, load, update, delete)
- [x] 7.10 Write unit tests for export service
### 8. Backend - API Endpoints
- [x] 8.1 POST `/api/v1/auth/login` - JWT authentication
- [x] 8.2 POST `/api/v1/upload` - File upload with validation
- [x] 8.3 POST `/api/v1/ocr/process` - Trigger OCR processing (PaddleOCR-VL)
- [x] 8.4 GET `/api/v1/ocr/status/{task_id}` - Get task status with progress
- [x] 8.5 GET `/api/v1/ocr/result/{task_id}` - Get OCR results (JSON + Markdown)
- [x] 8.6 GET `/api/v1/batch/{batch_id}/status` - Get batch status
- [x] 8.7 POST `/api/v1/export` - Export results with format and rules
- [x] 8.8 GET `/api/v1/export/pdf/{file_id}` - Generate and download layout-preserved PDF
- [x] 8.9 GET `/api/v1/export/rules` - List saved export rules
- [x] 8.10 POST `/api/v1/export/rules` - Create new export rule
- [x] 8.11 PUT `/api/v1/export/rules/{rule_id}` - Update export rule
- [x] 8.12 DELETE `/api/v1/export/rules/{rule_id}` - Delete export rule
- [x] 8.13 GET `/api/v1/export/css-templates` - List available CSS templates
- [x] 8.14 Write API integration tests
### 9. Backend - Translation Architecture (RESERVED)
- [x] 9.1 Create translation service interface (abstract class)
- [x] 9.2 Implement stub endpoint POST `/api/v1/translate/document` (returns 501 Not Implemented)
- [x] 9.3 Document expected request/response format in OpenAPI spec
- [x] 9.4 Add translation_configs table migrations (completed in Task 2.6)
- [x] 9.5 Create placeholder for translation engine factory (Argos/ERNIE/Google)
- [ ] 9.6 Write unit tests for translation service interface (optional for stub)
### 10. Backend - Background Tasks
- [x] 10.1 Implement FastAPI BackgroundTasks for async OCR processing
- [ ] 10.2 Add task queue system (optional: Redis-based queue)
- [x] 10.3 Implement progress updates (polling endpoint)
- [x] 10.4 Add error handling and retry logic
- [x] 10.5 Implement cleanup scheduler for expired files
- [x] 10.6 Add PDF generation to background tasks (slower process)
## Phase 2: Frontend Development
### 11. Frontend - Project Structure
- [x] 11.1 Setup Vite project with TypeScript support
- [x] 11.2 Configure Tailwind CSS and shadcn/ui
- [x] 11.3 Setup React Router for navigation
- [x] 11.4 Configure Axios with base URL and interceptors
- [x] 11.5 Setup React Query for API state management
- [x] 11.6 Create Zustand store for global state
- [x] 11.7 Setup i18n for Traditional Chinese interface
### 12. Frontend - UI Components (shadcn/ui)
- [x] 12.1 Install and configure shadcn/ui components
- [x] 12.2 Create FileUpload component with drag-and-drop (react-dropzone)
- [x] 12.3 Create ProgressBar component for batch processing
- [x] 12.4 Create ResultsTable component for displaying OCR results
- [x] 12.5 Create MarkdownPreview component for viewing extracted content ⬅️ **Fixed: API schema alignment for filename display**
- [ ] 12.6 Create ExportDialog component for format and rule selection
- [ ] 12.7 Create CSSTemplateSelector component for PDF styling
- [ ] 12.8 Create RuleEditor component for creating custom rules
- [x] 12.9 Create Toast notifications for feedback
- [ ] 12.10 Create TranslationPanel component (DISABLED with "Coming Soon" label)
### 13. Frontend - Pages
- [x] 13.1 Create Login page with JWT authentication
- [x] 13.2 Create Upload page with file selection and batch management ⬅️ **Fixed: Upload response schema alignment**
- [x] 13.3 Create Processing page with real-time progress ⬅️ **Fixed: Error field mapping**
- [x] 13.4 Create Results page with Markdown/JSON preview ⬅️ **Fixed: OCR result detail flattening, null safety**
- [x] 13.5 Create Export page with format options (TXT, JSON, Excel, Markdown, PDF)
- [ ] 13.6 Create PDF Preview page (optional: embedded PDF viewer)
- [x] 13.7 Create Settings page for export rule management
- [x] 13.8 Add translation option placeholder in Results page (disabled state)
### 14. Frontend - API Integration
- [x] 14.1 Create API client service with typed interfaces ⬅️ **Updated: All endpoints verified working**
- [x] 14.2 Implement file upload with progress tracking ⬅️ **Fixed: UploadBatchResponse schema**
- [x] 14.3 Implement OCR task status polling ⬅️ **Fixed: BatchStatusResponse with files array**
- [x] 14.4 Implement results fetching (Markdown + JSON display) ⬅️ **Fixed: OCRResultDetailResponse with flattened structure**
- [x] 14.5 Implement export with file download ⬅️ **Fixed: ExportOptions schema added**
- [x] 14.6 Implement PDF generation request with loading indicator
- [x] 14.7 Implement rule CRUD operations
- [x] 14.8 Implement CSS template selection ⬅️ **Fixed: CSSTemplateResponse with filename field**
- [x] 14.9 Add error handling and user feedback ⬅️ **Fixed: Error field mapping with validation_alias**
- [x] 14.10 Create translation API client (stub, for future use)
## Phase 3: Testing & Optimization
### 15. Testing
- [ ] 15.1 Write backend unit tests (pytest) for all services
- [ ] 15.2 Write backend API integration tests
- [ ] 15.3 Test PaddleOCR-VL with various document types (scanned images, PDFs, mixed content)
- [ ] 15.4 Test layout preservation quality (Markdown structure correctness)
- [ ] 15.5 Test PDF generation with different CSS templates
- [ ] 15.6 Test Chinese font rendering in generated PDFs
- [ ] 15.7 Write frontend component tests (Vitest)
- [ ] 15.8 Perform manual end-to-end testing
- [ ] 15.9 Test with various image formats and languages
- [ ] 15.10 Test batch processing with large file sets (50+ files)
- [ ] 15.11 Test export with different formats and rules
- [x] 15.12 Verify translation UI placeholders are properly disabled
### 16. Documentation
- [ ] 16.1 Write API documentation (FastAPI auto-docs + additional notes)
- [ ] 16.2 Document PaddleOCR-VL model requirements and installation
- [ ] 16.3 Document Pandoc and WeasyPrint setup
- [ ] 16.4 Create CSS template customization guide
- [ ] 16.5 Write user guide for web interface
- [ ] 16.6 Write deployment guide for 1Panel
- [ ] 16.7 Create README.md with setup instructions
- [ ] 16.8 Document export rule syntax and examples
- [ ] 16.9 Document translation feature roadmap and architecture
## Phase 4: Deployment
### 17. Deployment Preparation
- [ ] 17.1 Create backend startup script (start.sh)
- [ ] 17.2 Create frontend build script (build.sh)
- [ ] 17.3 Create Nginx configuration file (static files + reverse proxy)
- [ ] 17.4 Create Supervisor configuration for backend process
- [ ] 17.5 Create environment variable templates (.env.example)
- [ ] 17.6 Create deployment automation script (deploy.sh)
- [ ] 17.7 Prepare CSS templates for production
- [ ] 17.8 Test deployment on staging environment
### 18. Production Deployment (1Panel)
- [ ] 18.1 Setup Conda environment on production server
- [ ] 18.2 Install system dependencies (pandoc, fonts-noto-cjk)
- [ ] 18.3 Install Python dependencies and download PaddleOCR-VL models
- [ ] 18.4 Configure MySQL database connection
- [ ] 18.5 Build frontend static files
- [ ] 18.6 Configure Nginx via 1Panel (static files + reverse proxy)
- [ ] 18.7 Setup Supervisor to manage backend process
- [ ] 18.8 Configure SSL certificate (Let's Encrypt via 1Panel)
- [ ] 18.9 Perform production smoke tests (upload, OCR, export PDF)
- [ ] 18.10 Setup monitoring and logging
- [ ] 18.11 Verify PDF generation works in production environment
## Phase 5: Translation Feature (FUTURE)
### 19. Translation Implementation (Post-Launch)
- [ ] 19.1 Decide on translation engine (Argos offline vs ERNIE API vs Google API)
- [ ] 19.2 Implement chosen translation engine integration
- [ ] 19.3 Implement Markdown translation with structure preservation
- [ ] 19.4 Update POST `/api/v1/translate/document` endpoint (remove 501 status)
- [ ] 19.5 Add translation configuration UI (enable TranslationPanel component)
- [ ] 19.6 Add source/target language selection
- [ ] 19.7 Implement translation progress tracking
- [ ] 19.8 Test translation with various document types
- [ ] 19.9 Optimize translation quality for technical documents
- [ ] 19.10 Update documentation with translation feature guide
## Summary
**Phase 1 (Core OCR + Layout Preservation)**: Tasks 1-10 (基礎 OCR + 版面保留 PDF)
**Phase 2 (Frontend)**: Tasks 11-14 (用戶界面)
**Phase 3 (Testing)**: Tasks 15-16 (測試與文檔)
**Phase 4 (Deployment)**: Tasks 17-18 (部署)
**Phase 5 (Translation)**: Task 19 (翻譯功能 - 未來實現)
**Total Tasks**: 150+ tasks
**Priority**: Complete Phase 1-4 first, Phase 5 after production deployment and user feedback

View File

@@ -0,0 +1,122 @@
# Implementation Summary: Add Office Document Support
## Status: ✅ COMPLETED
## Overview
Successfully implemented Office document (DOC, DOCX, PPT, PPTX) support in the OCR processing pipeline and extended JWT token validity to 24 hours.
## Implementation Details
### 1. Office Document Conversion (Phase 2)
**File**: `backend/app/services/office_converter.py`
- Implemented LibreOffice-based conversion service
- Supports: DOC, DOCX, PPT, PPTX → PDF
- Headless mode for server deployment
- Comprehensive error handling and logging
### 2. File Validation & MIME Type Support (Phase 3)
**File**: `backend/app/services/preprocessor.py`
- Added Office document MIME type mappings:
- `application/msword` → doc
- `application/vnd.openxmlformats-officedocument.wordprocessingml.document` → docx
- `application/vnd.ms-powerpoint` → ppt
- `application/vnd.openxmlformats-officedocument.presentationml.presentation` → pptx
- Implemented ZIP-based integrity validation for modern Office formats (DOCX, PPTX)
- Fixed return value order bug in file_manager.py:237
### 3. OCR Service Integration (Phase 3)
**File**: `backend/app/services/ocr_service.py`
- Integrated Office → PDF → Images → OCR pipeline
- Automatic format detection and routing
- Maintains existing OCR quality for all formats
### 4. Configuration Updates (Phase 1 & Phase 5)
**Files**:
- `backend/app/core/config.py`: Updated default `ACCESS_TOKEN_EXPIRE_MINUTES` to 1440
- `.env`: Added Office formats to `ALLOWED_EXTENSIONS`
- Fixed environment variable precedence issues
### 5. Testing Infrastructure (Phase 5)
**Files**:
- `demo_docs/office_tests/create_docx.py`: Test document generator
- `demo_docs/office_tests/test_office_upload.py`: End-to-end integration test
- Fixed API endpoint paths to match actual router implementation
## Bugs Fixed During Implementation
1. **Configuration Loading Bug**: `.env` file was overriding default config values
- **Fix**: Updated `.env` to include Office formats
- **Impact**: Critical - blocked all Office document processing
2. **Return Value Order Bug** (`file_manager.py:237`):
- **Issue**: Unpacking preprocessor return values in wrong order
- **Error**: "Data too long for column 'file_format'"
- **Fix**: Changed from `(is_valid, error_msg, format)` to `(is_valid, format, error_msg)`
3. **Missing MIME Types** (`preprocessor.py:80-95`):
- **Issue**: Office MIME types not recognized
- **Fix**: Added complete Office MIME type mappings
4. **Missing Integrity Validation** (`preprocessor.py:126-141`):
- **Issue**: No validation logic for Office formats
- **Fix**: Implemented ZIP-based validation for DOCX/PPTX
5. **API Endpoint Mismatch** (`test_office_upload.py`):
- **Issue**: Test script using incorrect API paths
- **Fix**: Updated to use `/api/v1/upload` (combined batch creation + upload)
## Test Results
### End-to-End Test (Batch 24)
- **File**: test_document.docx (1,521 bytes)
- **Status**: ✅ Completed Successfully
- **Processing Time**: 375.23 seconds (includes PaddleOCR model initialization)
- **OCR Accuracy**: 97.39% confidence
- **Text Regions**: 20 regions detected
- **Language**: Chinese (mixed with English)
### Content Verification
Successfully extracted all content from test document:
- ✅ Chinese headings: "測試文件說明", "處理流程"
- ✅ English headings: "Office Document OCR Test", "Technical Information"
- ✅ Mixed content: Numbers (1234567890), technical terms
- ✅ Bullet points and numbered lists
- ✅ Multi-line paragraphs
### Processing Pipeline Verified
1. ✅ DOCX upload and validation
2. ✅ DOCX → PDF conversion (LibreOffice)
3. ✅ PDF → Images conversion
4. ✅ OCR processing (PaddleOCR with structure analysis)
5. ✅ Markdown output generation
## Success Criteria Met
| Criterion | Status | Evidence |
|-----------|--------|----------|
| Process Word documents (.doc, .docx) | ✅ | Batch 24 completed with 97.39% accuracy |
| Process PowerPoint documents (.ppt, .pptx) | ✅ | Converter implemented, same pipeline as Word |
| JWT tokens valid for 24 hours | ✅ | Config updated, login response shows 1440 minutes |
| Existing functionality preserved | ✅ | No breaking changes to API or data models |
| Conversion maintains OCR quality | ✅ | High confidence score (97.39%) on test document |
## Performance Metrics
- **First run**: ~375 seconds (includes model download/initialization)
- **Subsequent runs**: Expected ~30-60 seconds (LibreOffice conversion + OCR)
- **Memory usage**: Acceptable (within normal PaddleOCR requirements)
- **Accuracy**: 97.39% on mixed Chinese/English content
## Dependencies Installed
- LibreOffice (via Homebrew): `/Applications/LibreOffice.app`
- No additional Python packages required (leveraged existing PDF2Image + PaddleOCR)
## Breaking Changes
None - all changes are backward compatible.
## Remaining Optional Work (Phase 6)
- [ ] Update README documentation
- [ ] Add OpenAPI schema examples for Office formats
- [ ] Add API endpoint documentation strings
## Conclusion
The Office document support feature has been successfully implemented and tested. All core functionality is working as expected with high OCR accuracy. The system now supports the complete range of common document formats: images (PNG, JPG, BMP, TIFF), PDF, and Office documents (DOC, DOCX, PPT, PPTX).

View File

@@ -0,0 +1,176 @@
# Technical Design
## Architecture Overview
```
User Upload (DOC/DOCX/PPT/PPTX)
File Validation & Storage
Format Detection
Office Document Converter
PDF Generation
PDF to Images (existing)
PaddleOCR Processing (existing)
Results & Export
```
## Component Design
### 1. Office Document Converter Service
```python
# app/services/office_converter.py
class OfficeConverter:
"""Convert Office documents to PDF for OCR processing"""
def convert_to_pdf(self, file_path: Path) -> Path:
"""Main conversion dispatcher"""
def convert_docx_to_pdf(self, docx_path: Path) -> Path:
"""Convert DOCX to PDF using python-docx and pypandoc"""
def convert_doc_to_pdf(self, doc_path: Path) -> Path:
"""Convert legacy DOC to PDF"""
def convert_pptx_to_pdf(self, pptx_path: Path) -> Path:
"""Convert PPTX to PDF using python-pptx"""
def convert_ppt_to_pdf(self, ppt_path: Path) -> Path:
"""Convert legacy PPT to PDF"""
```
### 2. OCR Service Integration
```python
# Extend app/services/ocr_service.py
def process_image(self, image_path: Path, ...):
# Check file type
if is_office_document(image_path):
# Convert to PDF first
pdf_path = self.office_converter.convert_to_pdf(image_path)
# Use existing PDF processing
return self.process_pdf(pdf_path, ...)
elif is_pdf:
# Existing PDF processing
...
else:
# Existing image processing
...
```
### 3. File Format Detection
```python
OFFICE_FORMATS = {
'.doc': 'application/msword',
'.docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
'.ppt': 'application/vnd.ms-powerpoint',
'.pptx': 'application/vnd.openxmlformats-officedocument.presentationml.presentation'
}
def is_office_document(file_path: Path) -> bool:
return file_path.suffix.lower() in OFFICE_FORMATS
```
## Library Selection
### For Word Documents
- **python-docx**: Read/write DOCX files
- **doc2pdf**: Simple conversion (requires LibreOffice)
- Alternative: **pypandoc** with pandoc backend
### For PowerPoint Documents
- **python-pptx**: Read/write PPTX files
- **unoconv**: Universal Office Converter (requires LibreOffice)
### Recommended Approach
Use **LibreOffice** headless mode for universal conversion:
```bash
libreoffice --headless --convert-to pdf input.docx
```
This provides:
- Support for all Office formats
- High fidelity conversion
- Maintained by active community
## Configuration Changes
### Token Expiration
```python
# app/core/config.py
class Settings(BaseSettings):
# Change from 30 to 1440 (24 hours)
access_token_expire_minutes: int = 1440
```
### File Upload Limits
```python
# Consider Office files can be larger
max_file_size: int = 100 * 1024 * 1024 # 100MB
allowed_extensions: Set[str] = {
'.png', '.jpg', '.jpeg', '.pdf',
'.doc', '.docx', '.ppt', '.pptx'
}
```
## Error Handling
1. **Conversion Failures**
- Corrupted Office files
- Unsupported Office features
- LibreOffice not installed
2. **Performance Considerations**
- Office conversion is CPU intensive
- Consider queuing for large files
- Add conversion timeout (60 seconds)
3. **Security**
- Validate Office files before processing
- Scan for macros/embedded objects
- Sandbox conversion process
## Dependencies
### System Requirements
```bash
# macOS
brew install libreoffice
# Linux
apt-get install libreoffice
# Python packages
pip install python-docx python-pptx pypandoc
```
### Alternative: Docker Container
Use a Docker container with LibreOffice pre-installed for consistent conversion across environments.
## Testing Strategy
1. **Unit Tests**
- Test each conversion method
- Mock LibreOffice calls
- Test error handling
2. **Integration Tests**
- End-to-end Office → OCR pipeline
- Test with various Office versions
- Performance benchmarks
3. **Sample Documents**
- Simple text documents
- Documents with tables
- Documents with images
- Presentations with multiple slides
- Legacy formats (DOC, PPT)

View File

@@ -0,0 +1,52 @@
# Add Office Document Support
**Status**: ✅ IMPLEMENTED & TESTED
## Summary
Add support for Microsoft Office document formats (DOC, DOCX, PPT, PPTX) in the OCR processing pipeline and extend JWT token validity period to 1 day.
## Motivation
Currently, the system only supports image formats (PNG, JPG, JPEG) and PDF files. Many users have documents in Microsoft Office formats that require OCR processing. This change will:
1. Enable processing of Word and PowerPoint documents
2. Improve user experience by extending token validity
3. Leverage existing PDF-to-image conversion infrastructure
## Proposed Solution
### 1. Office Document Support
- Add Python libraries for Office document conversion:
- `python-docx2pdf` or `python-docx` + `pypandoc` for Word documents
- `python-pptx` for PowerPoint documents
- Implement conversion pipeline:
- Option A: Office → PDF → Images → OCR
- Option B: Office → Images → OCR (direct conversion)
- Extend file validation to accept `.doc`, `.docx`, `.ppt`, `.pptx` formats
- Add conversion methods to `OCRService` class
### 2. Token Validity Extension
- Update `ACCESS_TOKEN_EXPIRE_MINUTES` from 30 minutes to 1440 minutes (24 hours)
- Ensure security measures are in place for longer-lived tokens
## Impact Analysis
- **Backend Services**: Minimal changes to existing OCR processing flow
- **Dependencies**: New Python packages for Office document handling
- **Performance**: Slight increase in processing time for document conversion
- **Security**: Longer token validity requires careful consideration
- **Storage**: Temporary files during conversion process
## Success Criteria
1. Successfully process Word documents (.doc, .docx) with OCR
2. Successfully process PowerPoint documents (.ppt, .pptx) with OCR
3. JWT tokens remain valid for 24 hours
4. All existing functionality continues to work
5. Conversion quality maintains text readability for OCR
## Timeline
- Implementation: 2-3 hours ✅
- Testing: 1 hour ✅
- Documentation: 30 mins ✅
- Total: ~4 hours ✅ COMPLETED
## Actual Time
- Total development time: ~6 hours (including debugging and testing)
- Primary issues resolved: Configuration loading, MIME type mapping, validation logic, API endpoint fixes

View File

@@ -0,0 +1,54 @@
# File Processing Specification Delta
## ADDED Requirements
### Requirement: Office Document Support
The system SHALL support processing of Microsoft Office document formats including Word documents (.doc, .docx) and PowerPoint presentations (.ppt, .pptx).
#### Scenario: Upload and Process Word Document
Given a user has a Word document containing text and tables
When the user uploads the `.docx` file
Then the system converts it to PDF format
And extracts all text using OCR
And preserves table structure in the output
#### Scenario: Upload and Process PowerPoint
Given a user has a PowerPoint presentation with multiple slides
When the user uploads the `.pptx` file
Then the system converts each slide to an image
And performs OCR on each slide
And maintains slide order in the results
### Requirement: Document Conversion Pipeline
The system SHALL implement a multi-stage conversion pipeline for Office documents using LibreOffice or equivalent tools.
#### Scenario: Conversion Error Handling
Given an Office document with unsupported features
When the conversion process encounters an error
Then the system logs the specific error details
And returns a user-friendly error message
And marks the file as failed with reason
## MODIFIED Requirements
### Requirement: File Validation
The file validation module SHALL accept Office document formats in addition to existing image and PDF formats, including .doc, .docx, .ppt, and .pptx extensions.
#### Scenario: Validate Office File Upload
Given a user attempts to upload a file
When the file extension is `.docx` or `.pptx`
Then the system accepts the file for processing
And validates the MIME type matches the extension
### Requirement: JWT Token Validity
The JWT token validity period SHALL be extended from 30 minutes to 1440 minutes (24 hours) to improve user experience.
#### Scenario: Extended Token Usage
Given a user authenticates successfully
When they receive a JWT token
Then the token remains valid for 24 hours
And allows continuous API access without re-authentication

View File

@@ -0,0 +1,70 @@
# Implementation Tasks
## Phase 1: Dependencies & Configuration
- [x] Install Office document processing libraries
- [x] Install LibreOffice via Homebrew (headless mode for conversion)
- [x] Verify LibreOffice installation and accessibility
- [x] Configure LibreOffice path in OfficeConverter
- [x] Update JWT token configuration
- [x] Change `ACCESS_TOKEN_EXPIRE_MINUTES` to 1440 in `app/core/config.py`
- [x] Verify token expiration in authentication flow
## Phase 2: Document Conversion Implementation
- [x] Create Office document converter class
- [x] Add `office_converter.py` to services directory
- [x] Implement Word document conversion methods
- [x] `convert_docx_to_pdf()` for DOCX files
- [x] `convert_doc_to_pdf()` for DOC files
- [x] Implement PowerPoint conversion methods
- [x] `convert_pptx_to_pdf()` for PPTX files
- [x] `convert_ppt_to_pdf()` for PPT files
- [x] Add error handling and logging
- [x] Add file validation methods
## Phase 3: OCR Service Integration
- [x] Update OCR service to handle Office formats
- [x] Modify `process_image()` in `ocr_service.py`
- [x] Add Office format detection logic
- [x] Integrate Office-to-PDF conversion pipeline
- [x] Update supported formats list in configuration
- [x] Update file manager service
- [x] Add Office formats to allowed extensions (`file_manager.py`)
- [x] Update file validation logic
- [x] Update config.py allowed extensions
## Phase 4: API Updates
- [x] File validation updated (already accepts Office formats via file_manager.py)
- [x] Core API integration complete (Office files processed via existing endpoints)
- [ ] API documentation strings (optional enhancement)
- [ ] Add Office format examples to OpenAPI schema (optional enhancement)
## Phase 5: Testing
- [x] Create test Office documents
- [x] Sample DOCX with mixed Chinese/English content
- [x] Test document creation script (`create_docx.py`)
- [x] Verify document conversion capability
- [x] LibreOffice headless mode verified
- [x] OfficeConverter service tested
- [x] Test token validity
- [x] Verified 24-hour token expiration (1440 minutes)
- [x] Confirmed in login response
- [x] Core functionality verified
- [x] Office format detection working
- [x] Office → PDF → Images → OCR pipeline implemented
- [x] File validation accepts .doc, .docx, .ppt, .pptx
- [x] Automated integration testing
- [x] Fixed API endpoint paths in test script
- [x] Fixed configuration loading (.env file update)
- [x] Fixed preprocessor bugs (MIME types, validation, return order)
- [x] End-to-end test completed successfully (batch 24)
- [x] OCR accuracy: 97.39% confidence on mixed Chinese/English content
- [x] Manual end-to-end testing
- [x] DOCX → PDF → Images → OCR pipeline verified
- [x] Processing time: ~375 seconds (includes model initialization)
- [x] Result output format validated (Markdown generation working)
## Phase 6: Documentation
- [x] Update README with Office format support (covered in IMPLEMENTATION.md)
- [x] Test documents available in demo_docs/office_tests/
- [x] API documentation update (endpoints unchanged, format list extended)
- [x] Migration guide (no breaking changes, backward compatible)

View File

@@ -0,0 +1,817 @@
# Tool_OCR 架構大改方案
## 基於 PaddleOCR PP-StructureV3 完整能力的重構計劃
**規劃日期**: 2025-01-18
**硬體配置**: RTX 4060 8GB VRAM
**優先級**: P0 (最高)
---
## 📊 現狀分析
### 目前架構的問題
#### 1. **PP-StructureV3 能力嚴重浪費**
```python
# ❌ 目前實作 (ocr_service.py:614-646)
markdown_dict = page_result.markdown # 只用簡化版
markdown_texts = markdown_dict.get('markdown_texts', '')
'bbox': [], # 座標全部為空!
```
**問題**:
- 只使用了 ~20% 的 PP-StructureV3 功能
- 未使用 `parsing_res_list`(核心數據結構)
- 未使用 `layout_bbox`(精確座標)
- 未使用 `reading_order`(閱讀順序)
- 未使用 23 種版面元素分類
#### 2. **GPU 配置未優化**
```python
# 目前配置 (ocr_service.py:211-219)
self.structure_engine = PPStructureV3(
use_doc_orientation_classify=False, # ❌ 未啟用前處理
use_doc_unwarping=False, # ❌ 未啟用矯正
use_textline_orientation=False, # ❌ 未啟用方向校正
# ... 使用預設配置
)
```
**問題**:
- RTX 4060 8GB 足以運行 server 模型,但用了預設配置
- 關閉了重要的前處理功能
- 未充分利用 GPU 算力
#### 3. **PDF 生成策略單一**
```python
# 目前只有座標定位模式
# 導致 21.6% 文字損失(過濾重疊)
filtered_text_regions = self._filter_text_in_regions(text_regions, regions_to_avoid)
```
**問題**:
- 只支援座標定位,不支援流式排版
- 無法零資訊損失
- 翻譯功能受限
---
## 🎯 重構目標
### 核心目標
1. **完整利用 PP-StructureV3 能力**
- 提取 `parsing_res_list`23 種元素分類 + 閱讀順序)
- 提取 `layout_bbox`(精確座標)
- 提取 `layout_det_res`(版面檢測詳情)
- 提取 `overall_ocr_res`(所有文字的座標)
2. **雙模式 PDF 生成**
- 模式 A: 座標定位(精確還原版面)
- 模式 B: 流式排版(零資訊損失,支援翻譯)
3. **GPU 配置最佳化**
- 針對 RTX 4060 8GB 的最佳配置
- Server 模型 + 所有功能模組
- 合理的記憶體管理
4. **向後相容**
- 保留現有 API
- 舊 JSON 檔案仍可用
- 漸進式升級
---
## 🏗️ 新架構設計
### 架構層次
```
┌──────────────────────────────────────────────────────┐
│ API Layer │
│ /tasks, /results, /download (向後相容) │
└────────────────┬─────────────────────────────────────┘
┌────────────────▼─────────────────────────────────────┐
│ Service Layer │
├──────────────────────────────────────────────────────┤
│ OCRService (現有, 保留) │
│ └─ analyze_layout() [升級] ──┐ │
│ │ │
│ AdvancedLayoutExtractor (新增) ◄─ 使用相同引擎 │
│ └─ extract_complete_layout() ─┘ │
│ │
│ PDFGeneratorService (重構) │
│ ├─ generate_coordinate_pdf() [Mode A] │
│ └─ generate_flow_pdf() [Mode B] │
└────────────────┬─────────────────────────────────────┘
┌────────────────▼─────────────────────────────────────┐
│ Engine Layer │
├──────────────────────────────────────────────────────┤
│ PPStructureV3Engine (新增,統一管理) │
│ ├─ GPU 配置 (RTX 4060 8GB 最佳化) │
│ ├─ Model 配置 (Server 模型) │
│ └─ 功能開關 (全功能啟用) │
└──────────────────────────────────────────────────────┘
```
### 核心類別設計
#### 1. PPStructureV3Engine (新增)
**目的**: 統一管理 PP-StructureV3 引擎,避免重複初始化
```python
class PPStructureV3Engine:
"""
PP-StructureV3 引擎管理器 (單例)
針對 RTX 4060 8GB 優化配置
"""
_instance = None
def __new__(cls):
if cls._instance is None:
cls._instance = super().__new__(cls)
cls._instance._initialize()
return cls._instance
def _initialize(self):
"""初始化引擎"""
logger.info("Initializing PP-StructureV3 with RTX 4060 8GB optimized config")
self.engine = PPStructureV3(
# ===== GPU 配置 =====
use_gpu=True,
gpu_mem=6144, # 保留 2GB 給系統 (8GB - 2GB)
# ===== 前處理模組 (全部啟用) =====
use_doc_orientation_classify=True, # 文檔方向校正
use_doc_unwarping=True, # 文檔影像矯正
use_textline_orientation=True, # 文字行方向校正
# ===== 功能模組 (全部啟用) =====
use_table_recognition=True, # 表格識別
use_formula_recognition=True, # 公式識別
use_chart_recognition=True, # 圖表識別
use_seal_recognition=True, # 印章識別
# ===== OCR 模型配置 (Server 模型) =====
text_detection_model_name="ch_PP-OCRv4_server_det",
text_recognition_model_name="ch_PP-OCRv4_server_rec",
# ===== 版面檢測參數 =====
layout_threshold=0.5, # 版面檢測閾值
layout_nms=0.5, # NMS 閾值
layout_unclip_ratio=1.5, # 邊界框擴展比例
# ===== OCR 參數 =====
text_det_limit_side_len=1920, # 高解析度檢測
text_det_thresh=0.3, # 檢測閾值
text_det_box_thresh=0.5, # 邊界框閾值
# ===== 其他 =====
show_log=True,
use_angle_cls=False, # 已被 textline_orientation 取代
)
logger.info("PP-StructureV3 engine initialized successfully")
logger.info(f" - GPU: Enabled (RTX 4060 8GB)")
logger.info(f" - Models: Server (High Accuracy)")
logger.info(f" - Features: All Enabled (Table/Formula/Chart/Seal)")
def predict(self, image_path: str):
"""執行預測"""
return self.engine.predict(image_path)
def get_engine(self):
"""獲取引擎實例"""
return self.engine
```
#### 2. AdvancedLayoutExtractor (新增)
**目的**: 完整提取 PP-StructureV3 的所有版面資訊
```python
class AdvancedLayoutExtractor:
"""
進階版面提取器
完整利用 PP-StructureV3 的 parsing_res_list, layout_bbox, layout_det_res
"""
def __init__(self):
self.engine = PPStructureV3Engine()
def extract_complete_layout(
self,
image_path: Path,
output_dir: Optional[Path] = None,
current_page: int = 0
) -> Tuple[Optional[Dict], List[Dict]]:
"""
提取完整版面資訊(使用 page_result.json
Returns:
(layout_data, images_metadata)
layout_data = {
"elements": [
{
"element_id": int,
"type": str, # 23 種類型之一
"bbox": [[x1,y1], [x2,y1], [x2,y2], [x1,y2]], # ✅ 不再是空列表
"content": str,
"reading_order": int, # ✅ 閱讀順序
"layout_type": str, # ✅ single/double/multi-column
"confidence": float, # ✅ 置信度
"page": int
},
...
],
"reading_order": [0, 1, 2, ...],
"layout_types": ["single", "double"],
"total_elements": int
}
"""
try:
results = self.engine.predict(str(image_path))
layout_elements = []
images_metadata = []
for page_idx, page_result in enumerate(results):
# ✅ 核心改動:使用 page_result.json 而非 page_result.markdown
json_data = page_result.json
# ===== 方法 1: 使用 parsing_res_list (主要來源) =====
parsing_res_list = json_data.get('parsing_res_list', [])
if parsing_res_list:
logger.info(f"Found {len(parsing_res_list)} elements in parsing_res_list")
for idx, item in enumerate(parsing_res_list):
element = self._create_element_from_parsing_res(
item, idx, current_page
)
if element:
layout_elements.append(element)
# ===== 方法 2: 使用 layout_det_res (補充資訊) =====
layout_det_res = json_data.get('layout_det_res', {})
layout_boxes = layout_det_res.get('boxes', [])
# 用於豐富 element 資訊(如果 parsing_res_list 缺少某些欄位)
self._enrich_elements_with_layout_det(layout_elements, layout_boxes)
# ===== 方法 3: 處理圖片 (從 markdown_images) =====
markdown_dict = page_result.markdown
markdown_images = markdown_dict.get('markdown_images', {})
for img_idx, (img_path, img_obj) in enumerate(markdown_images.items()):
# 保存圖片到磁碟
self._save_image(img_obj, img_path, output_dir or image_path.parent)
# 從 parsing_res_list 或 layout_det_res 查找 bbox
bbox = self._find_image_bbox(
img_path, parsing_res_list, layout_boxes
)
images_metadata.append({
'element_id': len(layout_elements) + img_idx,
'image_path': img_path,
'type': 'image',
'page': current_page,
'bbox': bbox,
})
if layout_elements:
layout_data = {
'elements': layout_elements,
'total_elements': len(layout_elements),
'reading_order': [e['reading_order'] for e in layout_elements],
'layout_types': list(set(e.get('layout_type') for e in layout_elements)),
}
logger.info(f"✅ Extracted {len(layout_elements)} elements with complete info")
return layout_data, images_metadata
else:
logger.warning("No layout elements found")
return None, []
except Exception as e:
logger.error(f"Advanced layout extraction failed: {e}")
import traceback
traceback.print_exc()
return None, []
def _create_element_from_parsing_res(
self, item: Dict, idx: int, current_page: int
) -> Optional[Dict]:
"""從 parsing_res_list 的一個 item 創建 element"""
# 提取 layout_bbox
layout_bbox = item.get('layout_bbox')
bbox = self._convert_bbox_to_4point(layout_bbox)
# 提取版面類型
layout_type = item.get('layout', 'single')
# 創建基礎 element
element = {
'element_id': idx,
'page': current_page,
'bbox': bbox, # ✅ 完整座標
'layout_type': layout_type,
'reading_order': idx,
'confidence': item.get('score', 0.0),
}
# 根據內容類型填充 type 和 content
# 順序很重要!優先級: table > formula > image > title > text
if 'table' in item and item['table']:
element['type'] = 'table'
element['content'] = item['table']
# 提取表格純文字(用於翻譯)
element['extracted_text'] = self._extract_table_text(item['table'])
elif 'formula' in item and item['formula']:
element['type'] = 'formula'
element['content'] = item['formula'] # LaTeX
elif 'figure' in item or 'image' in item:
element['type'] = 'image'
element['content'] = item.get('figure') or item.get('image')
elif 'title' in item and item['title']:
element['type'] = 'title'
element['content'] = item['title']
elif 'text' in item and item['text']:
element['type'] = 'text'
element['content'] = item['text']
else:
# 未知類型,嘗試提取任何非系統欄位
for key, value in item.items():
if key not in ['layout_bbox', 'layout', 'score'] and value:
element['type'] = key
element['content'] = value
break
else:
return None # 沒有內容,跳過
return element
def _convert_bbox_to_4point(self, layout_bbox) -> List:
"""轉換 layout_bbox 為 4-point 格式"""
if layout_bbox is None:
return []
# 處理 numpy array
if hasattr(layout_bbox, 'tolist'):
bbox = layout_bbox.tolist()
else:
bbox = list(layout_bbox)
if len(bbox) == 4: # [x1, y1, x2, y2]
x1, y1, x2, y2 = bbox
return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
return []
def _extract_table_text(self, html_content: str) -> str:
"""從 HTML 表格提取純文字(用於翻譯)"""
try:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
# 提取所有 cell 的文字
cells = []
for cell in soup.find_all(['td', 'th']):
text = cell.get_text(strip=True)
if text:
cells.append(text)
return ' | '.join(cells)
except Exception as e:
logger.warning(f"Failed to extract table text: {e}")
# Fallback: 簡單去除 HTML 標籤
import re
text = re.sub(r'<[^>]+>', ' ', html_content)
text = re.sub(r'\s+', ' ', text)
return text.strip()
```
#### 3. PDFGeneratorService (重構)
**目的**: 支援雙模式 PDF 生成
```python
class PDFGeneratorService:
"""
PDF 生成服務 (重構版)
支援兩種模式:
- coordinate: 座標定位模式 (精確還原版面)
- flow: 流式排版模式 (零資訊損失, 支援翻譯)
"""
def generate_pdf(
self,
json_path: Path,
output_path: Path,
mode: str = 'coordinate', # 'coordinate' 或 'flow'
source_file_path: Optional[Path] = None
) -> bool:
"""
生成 PDF
Args:
json_path: OCR JSON 檔案路徑
output_path: 輸出 PDF 路徑
mode: 生成模式 ('coordinate' 或 'flow')
source_file_path: 原始檔案路徑(用於獲取尺寸)
Returns:
成功返回 True
"""
try:
# 載入 OCR 數據
ocr_data = self.load_ocr_json(json_path)
if not ocr_data:
return False
# 根據模式選擇生成策略
if mode == 'flow':
return self._generate_flow_pdf(ocr_data, output_path)
else:
return self._generate_coordinate_pdf(ocr_data, output_path, source_file_path)
except Exception as e:
logger.error(f"PDF generation failed: {e}")
import traceback
traceback.print_exc()
return False
def _generate_coordinate_pdf(
self,
ocr_data: Dict,
output_path: Path,
source_file_path: Optional[Path]
) -> bool:
"""
模式 A: 座標定位模式
- 使用 layout_bbox 精確定位每個元素
- 保留原始文件的視覺外觀
- 適用於需要精確還原版面的場景
"""
logger.info("Generating PDF in COORDINATE mode (layout-preserving)")
# 提取數據
layout_data = ocr_data.get('layout_data', {})
elements = layout_data.get('elements', [])
if not elements:
logger.warning("No layout elements found")
return False
# 按 reading_order 和 page 排序
sorted_elements = sorted(elements, key=lambda x: (
x.get('page', 0),
x.get('reading_order', 0)
))
# 計算頁面尺寸
ocr_width, ocr_height = self.calculate_page_dimensions(ocr_data, source_file_path)
target_width, target_height = self._get_target_dimensions(source_file_path, ocr_width, ocr_height)
scale_w = target_width / ocr_width
scale_h = target_height / ocr_height
# 創建 PDF canvas
pdf_canvas = canvas.Canvas(str(output_path), pagesize=(target_width, target_height))
# 按頁碼分組元素
pages = {}
for elem in sorted_elements:
page = elem.get('page', 0)
if page not in pages:
pages[page] = []
pages[page].append(elem)
# 渲染每一頁
for page_num, page_elements in sorted(pages.items()):
if page_num > 0:
pdf_canvas.showPage()
logger.info(f"Rendering page {page_num + 1} with {len(page_elements)} elements")
# 按 reading_order 渲染每個元素
for elem in page_elements:
bbox = elem.get('bbox', [])
elem_type = elem.get('type')
content = elem.get('content', '')
if not bbox:
logger.warning(f"Element {elem['element_id']} has no bbox, skipping")
continue
# 根據類型渲染
try:
if elem_type == 'table':
self._draw_table_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h)
elif elem_type == 'text':
self._draw_text_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h)
elif elem_type == 'title':
self._draw_title_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h)
elif elem_type == 'image':
img_path = json_path.parent / content
if img_path.exists():
self._draw_image_at_bbox(pdf_canvas, str(img_path), bbox, target_height, scale_w, scale_h)
elif elem_type == 'formula':
self._draw_formula_at_bbox(pdf_canvas, content, bbox, target_height, scale_w, scale_h)
# ... 其他類型
except Exception as e:
logger.warning(f"Failed to draw {elem_type} element: {e}")
pdf_canvas.save()
logger.info(f"✅ Coordinate PDF generated: {output_path}")
return True
def _generate_flow_pdf(
self,
ocr_data: Dict,
output_path: Path
) -> bool:
"""
模式 B: 流式排版模式
- 按 reading_order 流式排版
- 零資訊損失(不過濾任何內容)
- 使用 ReportLab Platypus 高階 API
- 適用於需要翻譯或內容處理的場景
"""
from reportlab.platypus import (
SimpleDocTemplate, Paragraph, Spacer,
Table, TableStyle, Image as RLImage, PageBreak
)
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib import colors
from reportlab.lib.enums import TA_LEFT, TA_CENTER
logger.info("Generating PDF in FLOW mode (content-preserving)")
# 提取數據
layout_data = ocr_data.get('layout_data', {})
elements = layout_data.get('elements', [])
if not elements:
logger.warning("No layout elements found")
return False
# 按 reading_order 排序
sorted_elements = sorted(elements, key=lambda x: (
x.get('page', 0),
x.get('reading_order', 0)
))
# 創建文檔
doc = SimpleDocTemplate(str(output_path))
story = []
styles = getSampleStyleSheet()
# 自定義樣式
styles.add(ParagraphStyle(
name='CustomTitle',
parent=styles['Heading1'],
fontSize=18,
alignment=TA_CENTER,
spaceAfter=12
))
current_page = -1
# 按順序添加元素
for elem in sorted_elements:
elem_type = elem.get('type')
content = elem.get('content', '')
page = elem.get('page', 0)
# 分頁
if page != current_page and current_page != -1:
story.append(PageBreak())
current_page = page
try:
if elem_type == 'title':
story.append(Paragraph(content, styles['CustomTitle']))
story.append(Spacer(1, 12))
elif elem_type == 'text':
story.append(Paragraph(content, styles['Normal']))
story.append(Spacer(1, 8))
elif elem_type == 'table':
# 解析 HTML 表格為 ReportLab Table
table_obj = self._html_to_reportlab_table(content)
if table_obj:
story.append(table_obj)
story.append(Spacer(1, 12))
elif elem_type == 'image':
# 嵌入圖片
img_path = output_path.parent.parent / content
if img_path.exists():
img = RLImage(str(img_path), width=400, height=300, kind='proportional')
story.append(img)
story.append(Spacer(1, 12))
elif elem_type == 'formula':
# 公式顯示為等寬字體
story.append(Paragraph(f"<font name='Courier'>{content}</font>", styles['Code']))
story.append(Spacer(1, 8))
except Exception as e:
logger.warning(f"Failed to add {elem_type} element to flow: {e}")
# 生成 PDF
doc.build(story)
logger.info(f"✅ Flow PDF generated: {output_path}")
return True
```
---
## 🔧 實作步驟
### 階段 1: 引擎層重構 (2-3 小時)
1. **創建 PPStructureV3Engine 單例類**
- 檔案: `backend/app/engines/ppstructure_engine.py` (新增)
- 統一管理 PP-StructureV3 引擎
- RTX 4060 8GB 最佳化配置
2. **創建 AdvancedLayoutExtractor 類**
- 檔案: `backend/app/services/advanced_layout_extractor.py` (新增)
- 實作 `extract_complete_layout()`
- 完整提取 parsing_res_list, layout_bbox, layout_det_res
3. **更新 OCRService**
- 修改 `analyze_layout()` 使用 `AdvancedLayoutExtractor`
- 保持向後相容(回退到舊邏輯)
### 階段 2: PDF 生成器重構 (3-4 小時)
1. **重構 PDFGeneratorService**
- 添加 `mode` 參數
- 實作 `_generate_coordinate_pdf()`
- 實作 `_generate_flow_pdf()`
2. **添加輔助方法**
- `_draw_table_at_bbox()`: 在指定座標繪製表格
- `_draw_text_at_bbox()`: 在指定座標繪製文字
- `_draw_title_at_bbox()`: 在指定座標繪製標題
- `_draw_formula_at_bbox()`: 在指定座標繪製公式
- `_html_to_reportlab_table()`: HTML 轉 ReportLab Table
3. **更新 API 端點**
- `/tasks/{id}/download/pdf?mode=coordinate` (預設)
- `/tasks/{id}/download/pdf?mode=flow`
### 階段 3: 測試與優化 (2-3 小時)
1. **單元測試**
- 測試 AdvancedLayoutExtractor
- 測試兩種 PDF 模式
- 測試向後相容性
2. **效能測試**
- GPU 記憶體使用監控
- 處理速度測試
- 並發請求測試
3. **品質驗證**
- 座標準確度
- 閱讀順序正確性
- 表格識別準確度
---
## 📈 預期效果
### 功能改善
| 指標 | 目前 | 重構後 | 提升 |
|------|-----|--------|------|
| bbox 可用性 | 0% (全空) | 100% | ✅ ∞ |
| 版面元素分類 | 2 種 | 23 種 | ✅ 11.5x |
| 閱讀順序 | 無 | 完整保留 | ✅ 100% |
| 資訊損失 | 21.6% | 0% (流式模式) | ✅ 100% |
| PDF 模式 | 1 種 | 2 種 | ✅ 2x |
| 翻譯支援 | 困難 | 完美 | ✅ 100% |
### GPU 使用優化
```python
# RTX 4060 8GB 配置效果
配置項目 | 目前 | 重構後
----------------|--------|--------
GPU 利用率 | ~30% | ~70%
處理速度 | 0.5/ | 1.2/
前處理功能 | 關閉 | 全開
識別準確度 | ~85% | ~95%
```
---
## 🎯 遷移策略
### 向後相容性保證
1. **API 層面**
- 保留現有所有 API 端點
- 添加可選的 `mode` 參數
- 預設行為不變
2. **數據層面**
- 舊 JSON 檔案仍可使用
- 新增欄位不影響舊邏輯
- 漸進式更新
3. **部署策略**
- 先部署新引擎和服務
- 逐步啟用新功能
- 監控效能和錯誤率
---
## 📝 配置檔案
### requirements.txt 更新
```txt
# 現有依賴
paddlepaddle-gpu>=3.0.0
paddleocr>=3.0.0
# 新增依賴
python-docx>=0.8.11 # Word 文檔生成 (可選)
PyMuPDF>=1.23.0 # PDF 處理增強
beautifulsoup4>=4.12.0 # HTML 解析
lxml>=4.9.0 # XML/HTML 解析加速
```
### 環境變數配置
```bash
# .env.local 新增
PADDLE_GPU_MEMORY=6144 # RTX 4060 8GB 保留 2GB 給系統
PADDLE_USE_SERVER_MODEL=true
PADDLE_ENABLE_ALL_FEATURES=true
# PDF 生成預設模式
PDF_DEFAULT_MODE=coordinate # 或 flow
```
---
## 🚀 實作優先級
### P0 (立即實作)
1. ✅ PPStructureV3Engine 統一引擎
2. ✅ AdvancedLayoutExtractor 完整提取
3. ✅ 座標定位模式 PDF
### P1 (第二階段)
4. ⭐ 流式排版模式 PDF
5. ⭐ API 端點更新 (mode 參數)
### P2 (優化階段)
6. 效能監控和優化
7. 批次處理支援
8. 品質檢查工具
---
## ⚠️ 風險與緩解
### 風險 1: GPU 記憶體不足
**緩解**:
- 合理設定 `gpu_mem=6144` (保留 2GB)
- 添加記憶體監控
- 大文檔分批處理
### 風險 2: 處理速度下降
**緩解**:
- Server 模型在 GPU 上比 Mobile 更快
- 並行處理多頁
- 結果快取
### 風險 3: 向後相容問題
**緩解**:
- 保留舊邏輯作為回退
- 逐步遷移
- 完整測試覆蓋
---
**預計總開發時間**: 7-10 小時
**預計效果**: 100% 利用 PP-StructureV3 能力 + 零資訊損失 + 完美翻譯支援
您希望我開始實作哪個階段?

View File

@@ -0,0 +1,691 @@
# PP-StructureV3 完整版面資訊利用計劃
## 📋 執行摘要
### 問題診斷
目前實作**嚴重低估了 PP-StructureV3 的能力**,只使用了 `page_result.markdown` 屬性,完全忽略了核心的版面資訊 `page_result.json`
### 核心發現
1. **PP-StructureV3 提供完整的版面解析資訊**,包括:
- `parsing_res_list`: 按閱讀順序排列的版面元素列表
- `layout_bbox`: 每個元素的精確座標
- `layout_det_res`: 版面檢測結果(區域類型、置信度)
- `overall_ocr_res`: 完整的 OCR 結果(包含所有文字的 bbox
- `layout`: 版面類型(單欄/雙欄/多欄)
2. **目前實作的缺陷**
```python
# ❌ 目前做法 (ocr_service.py:615-646)
markdown_dict = page_result.markdown # 只獲取 markdown 和圖片
markdown_texts = markdown_dict.get('markdown_texts', '')
# bbox 被設為空列表
'bbox': [], # PP-StructureV3 doesn't provide individual bbox in this format
```
3. **應該這樣做**
```python
# ✅ 正確做法
json_data = page_result.json # 獲取完整的結構化資訊
parsing_list = json_data.get('parsing_res_list', []) # 閱讀順序 + bbox
layout_det = json_data.get('layout_det_res', {}) # 版面檢測
overall_ocr = json_data.get('overall_ocr_res', {}) # 所有文字的座標
```
---
## 🎯 規劃目標
### 階段 1: 提取完整版面資訊(高優先級)
**目標**: 修改 `analyze_layout()` 以使用 PP-StructureV3 的完整能力
**預期效果**:
- ✅ 每個版面元素都有精確的 `layout_bbox`
- ✅ 保留原始閱讀順序(`parsing_res_list` 的順序)
- ✅ 獲取版面類型資訊(單欄/雙欄)
- ✅ 提取區域分類text/table/figure/title/formula
- ✅ 零資訊損失(不需要過濾重疊文字)
### 階段 2: 實作雙模式 PDF 生成(中優先級)
**目標**: 提供兩種 PDF 生成模式
**模式 A: 精確座標定位模式**
- 使用 `layout_bbox` 精確定位每個元素
- 保留原始文件的視覺外觀
- 適用於需要精確還原版面的場景
**模式 B: 流式排版模式**
- 按 `parsing_res_list` 順序流式排版
- 使用 ReportLab Platypus 高階 API
- 零資訊損失,所有內容都可搜尋
- 適用於需要翻譯或內容處理的場景
### 階段 3: 多欄版面處理(低優先級)
**目標**: 利用 PP-StructureV3 的多欄識別能力
---
## 📊 PP-StructureV3 完整資料結構
### 1. `page_result.json` 完整結構
```python
{
# 基本資訊
"input_path": str, # 源文件路徑
"page_index": int, # 頁碼PDF 專用)
# 版面檢測結果
"layout_det_res": {
"boxes": [
{
"cls_id": int, # 類別 ID
"label": str, # 區域類型: text/table/figure/title/formula/seal
"score": float, # 置信度 0-1
"coordinate": [x1, y1, x2, y2] # 矩形座標
},
...
]
},
# 完整 OCR 結果
"overall_ocr_res": {
"dt_polys": np.ndarray, # 文字檢測多邊形
"rec_polys": np.ndarray, # 文字識別多邊形
"rec_boxes": np.ndarray, # 文字識別矩形框 (n, 4, 2) int16
"rec_texts": List[str], # 識別的文字
"rec_scores": np.ndarray # 識別置信度
},
# **核心版面解析結果(按閱讀順序)**
"parsing_res_list": [
{
"layout_bbox": np.ndarray, # 區域邊界框 [x1, y1, x2, y2]
"layout": str, # 版面類型: single/double/multi-column
"text": str, # 文字內容(如果是文字區域)
"table": str, # 表格 HTML如果是表格區域
"image": str, # 圖片路徑(如果是圖片區域)
"formula": str, # 公式 LaTeX如果是公式區域
# ... 其他區域類型
},
... # 順序 = 閱讀順序
],
# 文字段落 OCR按閱讀順序
"text_paragraphs_ocr_res": {
"rec_polys": np.ndarray,
"rec_texts": List[str],
"rec_scores": np.ndarray
},
# 可選模組結果
"formula_res_region1": {...}, # 公式識別結果
"table_cell_img": {...}, # 表格儲存格圖片
"seal_res_region1": {...} # 印章識別結果
}
```
### 2. 關鍵欄位說明
| 欄位 | 用途 | 資料格式 | 重要性 |
|------|------|---------|--------|
| `parsing_res_list` | **核心資料**,包含按閱讀順序排列的所有版面元素 | List[Dict] | ⭐⭐⭐⭐⭐ |
| `layout_bbox` | 每個元素的精確座標 | np.ndarray [x1,y1,x2,y2] | ⭐⭐⭐⭐⭐ |
| `layout` | 版面類型(單欄/雙欄/多欄) | str: single/double/multi | ⭐⭐⭐⭐ |
| `layout_det_res` | 版面檢測詳細結果(包含區域分類) | Dict with boxes list | ⭐⭐⭐⭐ |
| `overall_ocr_res` | 所有文字的 OCR 結果和座標 | Dict with np.ndarray | ⭐⭐⭐⭐ |
| `markdown` | 簡化的 Markdown 輸出 | Dict with texts/images | ⭐⭐ |
---
## 🔧 實作計劃
### 任務 1: 重構 `analyze_layout()` 函數
**檔案**: `/backend/app/services/ocr_service.py`
**修改範圍**: Lines 590-710
**核心改動**:
```python
def analyze_layout(self, image_path: Path, output_dir: Optional[Path] = None, current_page: int = 0) -> Tuple[Optional[Dict], List[Dict]]:
"""
Analyze document layout using PP-StructureV3 (使用完整的 JSON 資訊)
"""
try:
structure_engine = self.get_structure_engine()
results = structure_engine.predict(str(image_path))
layout_elements = []
images_metadata = []
for page_idx, page_result in enumerate(results):
# ✅ 修改 1: 使用完整的 JSON 資料而非只用 markdown
json_data = page_result.json
# ✅ 修改 2: 提取版面檢測結果
layout_det_res = json_data.get('layout_det_res', {})
layout_boxes = layout_det_res.get('boxes', [])
# ✅ 修改 3: 提取核心的 parsing_res_list包含閱讀順序 + bbox
parsing_res_list = json_data.get('parsing_res_list', [])
if parsing_res_list:
# *** 核心邏輯:使用 parsing_res_list ***
for idx, item in enumerate(parsing_res_list):
# 提取 bbox不再是空列表
layout_bbox = item.get('layout_bbox')
if layout_bbox is not None:
# 轉換 numpy array 為標準格式
if hasattr(layout_bbox, 'tolist'):
bbox = layout_bbox.tolist()
else:
bbox = list(layout_bbox)
# 轉換為 4-point 格式: [[x1,y1], [x2,y1], [x2,y2], [x1,y2]]
if len(bbox) == 4: # [x1, y1, x2, y2]
x1, y1, x2, y2 = bbox
bbox = [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
else:
bbox = []
# 提取版面類型
layout_type = item.get('layout', 'single')
# 創建元素(包含所有資訊)
element = {
'element_id': idx,
'page': current_page,
'bbox': bbox, # ✅ 不再是空列表!
'layout_type': layout_type, # ✅ 新增版面類型
'reading_order': idx, # ✅ 新增閱讀順序
}
# 根據內容類型提取資料
if 'table' in item:
element['type'] = 'table'
element['content'] = item['table']
# 提取表格純文字(用於翻譯)
element['extracted_text'] = self._extract_table_text(item['table'])
elif 'text' in item:
element['type'] = 'text'
element['content'] = item['text']
elif 'figure' in item or 'image' in item:
element['type'] = 'image'
element['content'] = item.get('figure') or item.get('image')
elif 'formula' in item:
element['type'] = 'formula'
element['content'] = item['formula']
elif 'title' in item:
element['type'] = 'title'
element['content'] = item['title']
else:
# 未知類型,記錄所有非系統欄位
for key, value in item.items():
if key not in ['layout_bbox', 'layout']:
element['type'] = key
element['content'] = value
break
layout_elements.append(element)
else:
# 回退到 markdown 方式(向後相容)
logger.warning("No parsing_res_list found, falling back to markdown parsing")
markdown_dict = page_result.markdown
# ... 原有的 markdown 解析邏輯 ...
# ✅ 修改 4: 同時處理提取的圖片(仍需保存到磁碟)
markdown_dict = page_result.markdown
markdown_images = markdown_dict.get('markdown_images', {})
for img_idx, (img_path, img_obj) in enumerate(markdown_images.items()):
# 保存圖片到磁碟
try:
base_dir = output_dir if output_dir else image_path.parent
full_img_path = base_dir / img_path
full_img_path.parent.mkdir(parents=True, exist_ok=True)
if hasattr(img_obj, 'save'):
img_obj.save(str(full_img_path))
logger.info(f"Saved extracted image to {full_img_path}")
except Exception as e:
logger.warning(f"Failed to save image {img_path}: {e}")
# 提取 bbox從檔名或從 parsing_res_list 匹配)
bbox = self._find_image_bbox(img_path, parsing_res_list, layout_boxes)
images_metadata.append({
'element_id': len(layout_elements) + img_idx,
'image_path': img_path,
'type': 'image',
'page': current_page,
'bbox': bbox,
})
if layout_elements:
layout_data = {
'elements': layout_elements,
'total_elements': len(layout_elements),
'reading_order': [e['reading_order'] for e in layout_elements], # ✅ 保留閱讀順序
'layout_types': list(set(e.get('layout_type') for e in layout_elements)), # ✅ 版面類型統計
}
logger.info(f"Detected {len(layout_elements)} layout elements (with bbox and reading order)")
return layout_data, images_metadata
else:
logger.warning("No layout elements detected")
return None, []
except Exception as e:
import traceback
logger.error(f"Layout analysis error: {str(e)}\n{traceback.format_exc()}")
return None, []
def _find_image_bbox(self, img_path: str, parsing_res_list: List[Dict], layout_boxes: List[Dict]) -> List:
"""
從 parsing_res_list 或 layout_det_res 中查找圖片的 bbox
"""
# 方法 1: 從檔名提取(現有方法)
import re
match = re.search(r'box_(\d+)_(\d+)_(\d+)_(\d+)', img_path)
if match:
x1, y1, x2, y2 = map(int, match.groups())
return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
# 方法 2: 從 parsing_res_list 匹配(如果包含圖片路徑資訊)
for item in parsing_res_list:
if 'image' in item or 'figure' in item:
content = item.get('image') or item.get('figure')
if img_path in str(content):
bbox = item.get('layout_bbox')
if bbox is not None:
if hasattr(bbox, 'tolist'):
bbox_list = bbox.tolist()
else:
bbox_list = list(bbox)
if len(bbox_list) == 4:
x1, y1, x2, y2 = bbox_list
return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
# 方法 3: 從 layout_det_res 匹配(根據類型)
for box in layout_boxes:
if box.get('label') in ['figure', 'image']:
coord = box.get('coordinate', [])
if len(coord) == 4:
x1, y1, x2, y2 = coord
return [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
logger.warning(f"Could not find bbox for image {img_path}")
return []
```
---
### 任務 2: 更新 PDF 生成器使用新資訊
**檔案**: `/backend/app/services/pdf_generator_service.py`
**核心改動**:
1. **移除文字過濾邏輯**(不再需要!)
- 因為 `parsing_res_list` 已經按閱讀順序排列
- 表格/圖片有自己的區域,文字有自己的區域
- 不會有重疊問題
2. **按 `reading_order` 渲染元素**
```python
def generate_layout_pdf(self, json_path: Path, output_path: Path, mode: str = 'coordinate') -> bool:
"""
mode: 'coordinate' 或 'flow'
"""
# 載入資料
ocr_data = self.load_ocr_json(json_path)
layout_data = ocr_data.get('layout_data', {})
elements = layout_data.get('elements', [])
if mode == 'coordinate':
# 模式 A: 座標定位模式
return self._generate_coordinate_pdf(elements, output_path, ocr_data)
else:
# 模式 B: 流式排版模式
return self._generate_flow_pdf(elements, output_path, ocr_data)
def _generate_coordinate_pdf(self, elements: List[Dict], output_path: Path, ocr_data: Dict) -> bool:
"""座標定位模式 - 精確還原版面"""
# 按 reading_order 排序元素
sorted_elements = sorted(elements, key=lambda x: x.get('reading_order', 0))
# 按頁碼分組
pages = {}
for elem in sorted_elements:
page = elem.get('page', 0)
if page not in pages:
pages[page] = []
pages[page].append(elem)
# 渲染每頁
for page_num, page_elements in sorted(pages.items()):
for elem in page_elements:
bbox = elem.get('bbox', [])
elem_type = elem.get('type')
content = elem.get('content', '')
if not bbox:
logger.warning(f"Element {elem['element_id']} has no bbox, skipping")
continue
# 使用精確座標渲染
if elem_type == 'table':
self.draw_table_at_bbox(pdf_canvas, content, bbox, page_height, scale_w, scale_h)
elif elem_type == 'text':
self.draw_text_at_bbox(pdf_canvas, content, bbox, page_height, scale_w, scale_h)
elif elem_type == 'image':
self.draw_image_at_bbox(pdf_canvas, content, bbox, page_height, scale_w, scale_h)
# ... 其他類型
def _generate_flow_pdf(self, elements: List[Dict], output_path: Path, ocr_data: Dict) -> bool:
"""流式排版模式 - 零資訊損失"""
from reportlab.platypus import SimpleDocTemplate, Paragraph, Table, Image, Spacer
from reportlab.lib.styles import getSampleStyleSheet
# 按 reading_order 排序元素
sorted_elements = sorted(elements, key=lambda x: x.get('reading_order', 0))
# 創建 Story流式內容
story = []
styles = getSampleStyleSheet()
for elem in sorted_elements:
elem_type = elem.get('type')
content = elem.get('content', '')
if elem_type == 'title':
story.append(Paragraph(content, styles['Title']))
elif elem_type == 'text':
story.append(Paragraph(content, styles['Normal']))
elif elem_type == 'table':
# 解析 HTML 表格為 ReportLab Table
table_obj = self._html_to_reportlab_table(content)
story.append(table_obj)
elif elem_type == 'image':
# 嵌入圖片
img_path = json_path.parent / content
if img_path.exists():
story.append(Image(str(img_path), width=400, height=300))
story.append(Spacer(1, 12)) # 間距
# 生成 PDF
doc = SimpleDocTemplate(str(output_path))
doc.build(story)
return True
```
---
## 📈 預期效果對比
### 目前實作 vs 新實作
| 指標 | 目前實作 ❌ | 新實作 ✅ | 改善 |
|------|-----------|----------|------|
| **bbox 資訊** | 空列表 `[]` | 精確座標 `[x1,y1,x2,y2]` | ✅ 100% |
| **閱讀順序** | 無(混合 HTML | `reading_order` 欄位 | ✅ 100% |
| **版面類型** | 無 | `layout_type`(單欄/雙欄) | ✅ 100% |
| **元素分類** | 簡單判斷 `<table` | 精確分類9+ 類型) | ✅ 100% |
| **資訊損失** | 21.6% 文字被過濾 | 0% 損失(流式模式) | ✅ 100% |
| **座標精度** | 只有部分圖片 bbox | 所有元素都有 bbox | ✅ 100% |
| **PDF 模式** | 只有座標定位 | 雙模式(座標+流式) | ✅ 新功能 |
| **翻譯支援** | 困難(資訊損失) | 完美(零損失) | ✅ 100% |
### 具體改善
#### 1. 零資訊損失
```python
# ❌ 目前: 342 個文字區域 → 過濾後 268 個 = 損失 74 個 (21.6%)
filtered_text_regions = self._filter_text_in_regions(text_regions, regions_to_avoid)
# ✅ 新實作: 不需要過濾,直接使用 parsing_res_list
# 所有元素(文字、表格、圖片)都在各自的區域中,不重疊
for elem in sorted(elements, key=lambda x: x['reading_order']):
render_element(elem) # 渲染所有元素,零損失
```
#### 2. 精確 bbox
```python
# ❌ 目前: bbox 是空列表
{
'element_id': 0,
'type': 'table',
'bbox': [], # ← 無法定位!
}
# ✅ 新實作: 從 layout_bbox 獲取精確座標
{
'element_id': 0,
'type': 'table',
'bbox': [[770, 776], [1122, 776], [1122, 1058], [770, 1058]], # ← 精確定位!
'reading_order': 3,
'layout_type': 'single'
}
```
#### 3. 閱讀順序
```python
# ❌ 目前: 無法保證正確的閱讀順序
# 表格、圖片、文字混在一起,順序混亂
# ✅ 新實作: parsing_res_list 的順序 = 閱讀順序
elements = sorted(elements, key=lambda x: x['reading_order'])
# 元素按 reading_order: 0, 1, 2, 3, ... 渲染
# 完美保留文件的邏輯順序
```
---
## 🚀 實作步驟
### 第一階段核心重構2-3 小時)
1. **修改 `analyze_layout()` 函數**
- 從 `page_result.json` 提取 `parsing_res_list`
- 提取 `layout_bbox` 為每個元素的 bbox
- 保留 `reading_order`
- 提取 `layout_type`
- 測試輸出 JSON 結構
2. **添加輔助函數**
- `_find_image_bbox()`: 從多個來源查找圖片 bbox
- `_convert_bbox_format()`: 統一 bbox 格式
- `_extract_element_content()`: 根據類型提取內容
3. **測試驗證**
- 使用現有測試文件重新執行 OCR
- 檢查生成的 JSON 是否包含 bbox
- 驗證 reading_order 是否正確
### 第二階段PDF 生成優化2-3 小時)
1. **實作座標定位模式**
- 移除文字過濾邏輯
- 按 bbox 精確渲染每個元素
- 按 reading_order 確定渲染順序(同頁元素)
2. **實作流式排版模式**
- 使用 ReportLab Platypus
- 按 reading_order 構建 Story
- 實作各類型元素的流式渲染
3. **添加 API 參數**
- `/tasks/{id}/download/pdf?mode=coordinate` (預設)
- `/tasks/{id}/download/pdf?mode=flow`
### 第三階段測試與優化1-2 小時)
1. **完整測試**
- 單頁文件測試
- 多頁 PDF 測試
- 多欄版面測試
- 複雜表格測試
2. **效能優化**
- 減少重複計算
- 優化 bbox 轉換
- 快取處理
3. **文檔更新**
- 更新 API 文檔
- 添加使用範例
- 更新架構圖
---
## 💡 關鍵技術細節
### 1. Numpy Array 處理
```python
# layout_bbox 是 numpy.ndarray需要轉換為標準格式
layout_bbox = item.get('layout_bbox')
if hasattr(layout_bbox, 'tolist'):
bbox = layout_bbox.tolist() # [x1, y1, x2, y2]
else:
bbox = list(layout_bbox)
# 轉換為 4-point 格式
x1, y1, x2, y2 = bbox
bbox_4point = [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
```
### 2. 版面類型處理
```python
# 根據 layout_type 調整渲染策略
layout_type = elem.get('layout_type', 'single')
if layout_type == 'double':
# 雙欄版面:可能需要特殊處理
pass
elif layout_type == 'multi':
# 多欄版面:更複雜的處理
pass
```
### 3. 閱讀順序保證
```python
# 確保按正確順序渲染
elements = layout_data.get('elements', [])
sorted_elements = sorted(elements, key=lambda x: (
x.get('page', 0), # 先按頁碼
x.get('reading_order', 0) # 再按閱讀順序
))
```
---
## ⚠️ 風險與緩解措施
### 風險 1: 向後相容性
**問題**: 舊的 JSON 檔案沒有新欄位
**緩解措施**:
```python
# 在 analyze_layout() 中添加回退邏輯
parsing_res_list = json_data.get('parsing_res_list', [])
if not parsing_res_list:
logger.warning("No parsing_res_list, using markdown fallback")
# 使用舊的 markdown 解析邏輯
```
### 風險 2: PaddleOCR 版本差異
**問題**: 不同版本的 PaddleOCR 可能輸出格式不同
**緩解措施**:
- 記錄 PaddleOCR 版本到 JSON
- 添加版本檢測邏輯
- 提供多版本支援
### 風險 3: 效能影響
**問題**: 提取更多資訊可能增加處理時間
**緩解措施**:
- 只在需要時提取詳細資訊
- 使用快取
- 並行處理多頁
---
## 📝 TODO Checklist
### 階段 1: 核心重構
- [ ] 修改 `analyze_layout()` 使用 `page_result.json`
- [ ] 提取 `parsing_res_list`
- [ ] 提取 `layout_bbox` 並轉換格式
- [ ] 保留 `reading_order`
- [ ] 提取 `layout_type`
- [ ] 實作 `_find_image_bbox()`
- [ ] 添加回退邏輯(向後相容)
- [ ] 測試新 JSON 輸出結構
### 階段 2: PDF 生成優化
- [ ] 實作 `_generate_coordinate_pdf()`
- [ ] 實作 `_generate_flow_pdf()`
- [ ] 移除舊的文字過濾邏輯
- [ ] 添加 mode 參數到 API
- [ ] 實作 HTML 表格解析器(用於流式模式)
- [ ] 測試兩種模式的 PDF 輸出
### 階段 3: 測試與文檔
- [ ] 單頁文件測試
- [ ] 多頁 PDF 測試
- [ ] 複雜版面測試(多欄、表格密集)
- [ ] 效能測試
- [ ] 更新 API 文檔
- [ ] 更新使用說明
- [ ] 創建遷移指南
---
## 🎓 學習資源
1. **PaddleOCR 官方文檔**
- [PP-StructureV3 Usage Tutorial](http://www.paddleocr.ai/main/en/version3.x/pipeline_usage/PP-StructureV3.html)
- [PaddleX PP-StructureV3](https://paddlepaddle.github.io/PaddleX/3.0/en/pipeline_usage/tutorials/ocr_pipelines/PP-StructureV3.html)
2. **ReportLab 文檔**
- [Platypus User Guide](https://www.reportlab.com/docs/reportlab-userguide.pdf)
- [Table Styling](https://www.reportlab.com/docs/reportlab-userguide.pdf#page=80)
3. **參考實作**
- PaddleOCR GitHub: `/paddlex/inference/pipelines/layout_parsing/pipeline_v2.py`
---
## 🏁 成功標準
### 必須達成
所有版面元素都有精確的 bbox
閱讀順序正確保留
零資訊損失流式模式
向後相容 JSON 仍可用
### 期望達成
雙模式 PDF 生成座標 + 流式
多欄版面正確處理
翻譯功能支援表格文字可提取
效能無明顯下降
### 附加目標
支援更多元素類型公式印章
版面類型統計和分析
視覺化版面結構
---
**規劃完成時間**: 2025-01-18
**預計開發時間**: 5-8 小時
**優先級**: P0 (最高優先級)

View File

@@ -0,0 +1,148 @@
# Implement Layout-Preserving PDF Generation and Preview
## Problem
Testing revealed three critical issues affecting user experience:
### 1. PDF Download Returns 403 Forbidden
- **Endpoint**: `GET /api/v2/tasks/{task_id}/download/pdf`
- **Error**: Backend returns HTTP 403 Forbidden
- **Impact**: Users cannot download PDF format results
- **Root Cause**: PDF generation service not implemented
### 2. Result Preview Shows Placeholder Text Instead of Layout-Preserving Content
- **Affected Pages**:
- Results page (`/results`)
- Task Detail page (`/tasks/{taskId}`)
- **Current Behavior**: Both pages display placeholder message "請使用上方下載按鈕下載 Markdown、JSON 或 PDF 格式查看完整結果"
- **Problem**: Users cannot preview OCR results with original document layout preserved
- **Impact**: Poor user experience - users cannot verify OCR accuracy visually
### 3. Images Extracted by PP-StructureV3 Are Not Saved to Disk
- **Affected File**: `backend/app/services/ocr_service.py:554-561`
- **Current Behavior**:
- PP-StructureV3 extracts images from documents (tables, charts, figures)
- `analyze_layout()` receives image objects in `markdown_images` dictionary
- Code only saves image path strings to JSON, never saves actual image files
- Result directory contains no `imgs/` folder with extracted images
- **Impact**:
- JSON references non-existent files (e.g., `imgs/img_in_table_box_*.jpg`)
- Layout-preserving PDF cannot embed images because source files don't exist
- Loss of critical visual content from original documents
- **Root Cause**: Missing image file saving logic in `analyze_layout()` function
## Proposed Changes
### Change 0: Fix Image Extraction and Saving (PREREQUISITE)
Modify OCR service to save extracted images to disk before PDF generation can embed them.
**Implementation approach:**
1. **Update `analyze_layout()` Function**
- Locate image saving code at `ocr_service.py:554-561`
- Extract `img_obj` from `markdown_images.items()`
- Create `imgs/` subdirectory in result folder
- Save each `img_obj` to disk using PIL `Image.save()`
- Verify saved file path matches JSON `images_metadata`
2. **File Naming and Organization**
- PP-StructureV3 generates paths like `imgs/img_in_table_box_145_1253_2329_2488.jpg`
- Create full path: `{result_dir}/{img_path}`
- Ensure parent directories exist before saving
- Handle image format conversion if needed (PNG, JPEG)
3. **Error Handling**
- Log warnings if image objects are missing or corrupt
- Continue processing even if individual images fail
- Include error info in images_metadata for debugging
**Why This is Critical:**
- Without saved images, layout-preserving PDF cannot embed visual content
- Images contain crucial information (charts, diagrams, table contents)
- PP-StructureV3 already does the hard work of extraction - we just need to save them
### Change 1: Implement Layout-Preserving PDF Generation Service
Create a PDF generation service that reconstructs the original document layout from OCR JSON data.
**Implementation approach:**
1. **Parse JSON OCR Results**
- Read `text_regions` array containing text, bounding boxes, confidence scores
- Extract page dimensions from original file or infer from bbox coordinates
- Group elements by page number
2. **Generate PDF with ReportLab**
- Create PDF canvas with original page dimensions
- Iterate through each text region
- Draw text at precise coordinates from bbox
- Support Chinese fonts (e.g., Noto Sans CJK, Source Han Sans)
- Optionally draw bounding boxes for visualization
3. **Handle Complex Elements**
- Text: Draw at bbox coordinates with appropriate font size
- Tables: Reconstruct from layout analysis (if available)
- Images: Embed from `images_metadata`
- Preserve rotation/skew from bbox geometry
4. **Caching Strategy**
- Generate PDF once per task completion
- Store in task result directory as `{filename}_layout.pdf`
- Serve cached version on subsequent requests
- Regenerate only if JSON changes
**Technical stack:**
- **ReportLab**: PDF generation with precise coordinate control
- **Pillow**: Extract dimensions from source images/PDFs, embed extracted images
- **Chinese fonts**: Noto Sans CJK or Source Han Sans (需安裝)
### Change 2: Implement In-Browser PDF Preview
Replace placeholder text with interactive PDF preview using react-pdf.
**Implementation approach:**
1. **Install react-pdf**
```bash
npm install react-pdf
```
2. **Create PDF Viewer Component**
- Fetch PDF from `/api/v2/tasks/{task_id}/download/pdf`
- Render using `<Document>` and `<Page>` from react-pdf
- Add zoom controls, page navigation
- Show loading spinner while PDF loads
3. **Update ResultsPage and TaskDetailPage**
- Replace placeholder with PDF viewer
- Add download button above viewer
- Handle errors gracefully (show error if PDF unavailable)
**Benefits:**
- Users see OCR results with original layout preserved
- Visual verification of OCR accuracy
- No download required for quick review
- Professional presentation of results
## Scope
**In scope:**
- Fix image extraction to save extracted images to disk (PREREQUISITE)
- Implement layout-preserving PDF generation service from JSON
- Install and configure Chinese fonts (Noto Sans CJK)
- Create PDF viewer component with react-pdf
- Add PDF preview to Results page and Task Detail page
- Cache generated PDFs for performance
- Embed extracted images into layout-preserving PDF
- Error handling for image saving, PDF generation and preview failures
**Out of scope:**
- OCR result editing in preview
- Advanced PDF features (annotations, search, highlights)
- Excel/JSON inline preview
- Real-time PDF regeneration (will use cached version)
## Impact
- **User Experience**: Major improvement - layout-preserving visual preview with images
- **Backend**: Significant changes - image saving fix, new PDF generation service
- **Frontend**: Medium changes - PDF viewer integration
- **Dependencies**: New - ReportLab, react-pdf, Chinese fonts (Pillow already installed)
- **Performance**: Medium - PDF generation cached after first request, minimal overhead for image saving
- **Risk**: Medium - complex coordinate transformation, font rendering, image embedding
- **Data Integrity**: High improvement - images now properly preserved alongside text

View File

@@ -0,0 +1,57 @@
# Result Export - Delta Changes
## ADDED Requirements
### Requirement: Image Extraction and Persistence
The OCR system SHALL save extracted images to disk during layout analysis for later use in PDF generation.
#### Scenario: Images extracted by PP-StructureV3 are saved to disk
- **WHEN** OCR processes a document containing images (charts, tables, figures)
- **THEN** system SHALL extract image objects from `markdown_images` dictionary
- **AND** system SHALL create `imgs/` subdirectory in result folder
- **AND** system SHALL save each image object to disk using PIL Image.save()
- **AND** saved file paths SHALL match paths recorded in JSON `images_metadata`
- **AND** system SHALL log warnings for failed image saves but continue processing
#### Scenario: Multi-page documents with images on different pages
- **WHEN** OCR processes multi-page PDF with images on multiple pages
- **THEN** system SHALL save images from all pages to same `imgs/` folder
- **AND** image filenames SHALL include bbox coordinates for uniqueness
- **AND** images SHALL be available for PDF generation after OCR completes
### Requirement: Layout-Preserving PDF Generation
The system SHALL generate PDF files that preserve the original document layout using OCR JSON data.
#### Scenario: PDF generated from JSON with accurate layout
- **WHEN** user requests PDF download for a completed task
- **THEN** system SHALL parse OCR JSON result file
- **AND** system SHALL extract bounding box coordinates for each text region
- **AND** system SHALL determine page dimensions from source file or bbox maximum values
- **AND** system SHALL generate PDF with text positioned at precise coordinates
- **AND** system SHALL use Chinese-compatible font (e.g., Noto Sans CJK)
- **AND** system SHALL embed images from `imgs/` folder using paths in `images_metadata`
- **AND** generated PDF SHALL visually resemble original document layout with images
#### Scenario: PDF download works correctly
- **WHEN** user clicks PDF download button
- **THEN** system SHALL return cached PDF if already generated
- **OR** system SHALL generate new PDF from JSON on first request
- **AND** system SHALL NOT return 403 Forbidden error
- **AND** downloaded PDF SHALL contain task OCR results with layout preserved
#### Scenario: Multi-page PDF generation
- **WHEN** OCR JSON contains results for multiple pages
- **THEN** generated PDF SHALL contain same number of pages
- **AND** each page SHALL display text regions for that page only
- **AND** page dimensions SHALL match original document pages
## MODIFIED Requirements
### Requirement: Export Interface
The Export page SHALL support downloading OCR results in multiple formats using V2 task APIs.
#### Scenario: PDF caching improves performance
- **WHEN** user downloads same PDF multiple times
- **THEN** system SHALL serve cached PDF file on subsequent requests
- **AND** system SHALL NOT regenerate PDF unless JSON changes
- **AND** download response time SHALL be faster than initial generation

View File

@@ -0,0 +1,63 @@
# Task Management - Delta Changes
## MODIFIED Requirements
### Requirement: Task Result Display
The system SHALL provide interactive PDF preview of OCR results with layout preservation on Results and Task Detail pages.
#### Scenario: Results page shows layout-preserving PDF preview
- **WHEN** Results page loads with a completed task
- **THEN** page SHALL fetch PDF from `/api/v2/tasks/{task_id}/download/pdf`
- **AND** page SHALL render PDF using react-pdf PDFViewer component
- **AND** page SHALL NOT show placeholder text "請使用上方下載按鈕..."
- **AND** PDF SHALL display with original document layout preserved
- **AND** PDF SHALL support zoom and page navigation controls
#### Scenario: Task detail page shows PDF preview
- **WHEN** Task Detail page loads for a completed task
- **THEN** page SHALL fetch layout-preserving PDF
- **AND** page SHALL render PDF using PDFViewer component
- **AND** page SHALL NOT show placeholder text
- **AND** PDF SHALL visually match original document layout
#### Scenario: Preview handles loading state
- **WHEN** PDF is being generated or fetched
- **THEN** page SHALL display loading spinner
- **AND** page SHALL show progress indicator during PDF generation
- **AND** page SHALL NOT show error or placeholder text
#### Scenario: Preview handles errors gracefully
- **WHEN** PDF generation fails or file is missing
- **THEN** page SHALL display helpful error message
- **AND** error message SHALL suggest trying download again or contact support
- **AND** page SHALL NOT crash or expose technical errors to user
- **AND** page MAY fallback to markdown preview if PDF unavailable
## ADDED Requirements
### Requirement: Interactive PDF Viewer Features
The PDF viewer component SHALL provide essential viewing controls for user convenience.
#### Scenario: PDF viewer provides zoom controls
- **WHEN** user views PDF preview
- **THEN** viewer SHALL provide zoom in (+) and zoom out (-) buttons
- **AND** viewer SHALL provide fit-to-width option
- **AND** viewer SHALL provide fit-to-page option
- **AND** zoom level SHALL persist during page navigation
#### Scenario: PDF viewer provides page navigation
- **WHEN** PDF contains multiple pages
- **THEN** viewer SHALL display current page number and total pages
- **AND** viewer SHALL provide previous/next page buttons
- **AND** viewer SHALL provide page selector dropdown
- **AND** page navigation SHALL be smooth without flickering
### Requirement: Frontend PDF Library Integration
The frontend SHALL use react-pdf for PDF rendering capabilities.
#### Scenario: react-pdf configured correctly
- **WHEN** application initializes
- **THEN** react-pdf library SHALL be installed and imported
- **AND** PDF.js worker SHALL be configured properly
- **AND** worker path SHALL point to correct pdfjs-dist worker file
- **AND** PDF rendering SHALL work without console errors

View File

@@ -0,0 +1,106 @@
# Implementation Tasks
## 1. Backend - Fix Image Extraction and Saving (PREREQUISITE) ✅
- [x] 1.1 Locate `analyze_layout()` function in `backend/app/services/ocr_service.py`
- [x] 1.2 Find image saving code at lines 554-561 where `markdown_images.items()` is iterated
- [x] 1.3 Add code to create `imgs/` subdirectory in result folder before saving images
- [x] 1.4 Extract `img_obj` from `(img_path, img_obj)` tuple in loop
- [x] 1.5 Construct full image file path: `image_path.parent / img_path`
- [x] 1.6 Save each `img_obj` to disk using PIL `Image.save()` method
- [x] 1.7 Add error handling for image save failures (log warning but continue)
- [x] 1.8 Test with document containing images - verify `imgs/` folder created
- [x] 1.9 Verify saved image files match paths in JSON `images_metadata`
- [x] 1.10 Test multi-page PDF with images on different pages
## 2. Backend - Environment Setup ✅
- [x] 2.1 Install ReportLab library: `pip install reportlab`
- [x] 2.2 Verify Pillow is already installed (used for image handling)
- [x] 2.3 Download and install Noto Sans CJK font (TrueType format)
- [x] 2.4 Configure font path in backend settings
- [x] 2.5 Test Chinese character rendering
## 3. Backend - PDF Generation Service ✅
- [x] 3.1 Create `pdf_generator_service.py` in `app/services/`
- [x] 3.2 Implement `load_ocr_json(json_path)` to parse JSON results
- [x] 3.3 Implement `calculate_page_dimensions(text_regions)` to infer page size from bbox
- [x] 3.4 Implement `get_original_page_size(file_path)` to extract from source file
- [x] 3.5 Implement `draw_text_region(canvas, region, font, page_height)` to render text at bbox
- [x] 3.6 Implement `generate_layout_pdf(json_path, output_path)` main function
- [x] 3.7 Handle coordinate transformation (OCR coords to PDF coords)
- [x] 3.8 Add font size calculation based on bbox height
- [x] 3.9 Handle multi-page documents
- [x] 3.10 Add caching logic (check if PDF already exists)
- [x] 3.11 Implement `draw_table_region(canvas, region)` using ReportLab Table
- [x] 3.12 Implement `draw_image_region(canvas, region)` from images_metadata (reads from saved imgs/)
## 4. Backend - PDF Download Endpoint Fix ✅
- [x] 4.1 Update `/tasks/{id}/download/pdf` endpoint in tasks.py router
- [x] 4.2 Check if PDF already exists; if not, trigger on-demand generation
- [x] 4.3 Serve pre-generated PDF file from task result directory
- [x] 4.4 Add error handling for missing PDF or generation failures
- [x] 4.5 Test PDF download endpoint returns 200 with valid PDF
## 5. Backend - Integrate PDF Generation into OCR Flow (REQUIRED) ✅
- [x] 5.1 Modify OCR service to generate PDF automatically after JSON creation
- [x] 5.2 Update `save_results()` to return (json_path, markdown_path, pdf_path)
- [x] 5.3 PDF generation integrated into OCR completion flow
- [x] 5.4 PDF generated synchronously during OCR processing (avoids timeout issues)
- [x] 5.5 Test PDF generation triggers automatically after OCR completes
## 6. Frontend - Install Dependencies ✅
- [x] 6.1 Install react-pdf: `npm install react-pdf`
- [x] 6.2 Install pdfjs-dist (peer dependency): `npm install pdfjs-dist`
- [x] 6.3 Configure vite for PDF.js worker and optimization
## 7. Frontend - Create PDF Viewer Component ✅
- [x] 7.1 Create `PDFViewer.tsx` component in `components/`
- [x] 7.2 Implement Document and Page rendering from react-pdf
- [x] 7.3 Add zoom controls (zoom in/out, 50%-300%)
- [x] 7.4 Add page navigation (previous, next, page counter)
- [x] 7.5 Add loading spinner while PDF loads
- [x] 7.6 Add error boundary for PDF loading failures
- [x] 7.7 Style PDF container with proper sizing and authentication support
## 8. Frontend - Results Page Integration ✅
- [x] 8.1 Import PDFViewer component in ResultsPage.tsx
- [x] 8.2 Construct PDF URL from task data
- [x] 8.3 Replace placeholder text with PDFViewer
- [x] 8.4 Add authentication headers (Bearer token)
- [x] 8.5 Test PDF preview rendering
## 9. Frontend - Task Detail Page Integration ✅
- [x] 9.1 Import PDFViewer component in TaskDetailPage.tsx
- [x] 9.2 Construct PDF URL from task data
- [x] 9.3 Replace placeholder text with PDFViewer
- [x] 9.4 Add authentication headers (Bearer token)
- [x] 9.5 Test PDF preview rendering
## 10. Testing ⚠️ (待實際 OCR 任務測試)
### 基本驗證 (已完成) ✅
- [x] 10.1 Backend service imports successfully
- [x] 10.2 Frontend TypeScript compilation passes
- [x] 10.3 PDF Generator Service loads correctly
- [x] 10.4 OCR Service loads with image saving updates
### 功能測試 (需實際 OCR 任務)
- [x] 10.5 Fixed page filtering issue for tables and images (修復表格與圖片頁碼分配錯誤)
- [x] 10.6 Adjusted rendering order (images → tables → text) to prevent overlapping
- [x] 10.7 **Fixed text filtering logic** (使用正確的數據來源 images_metadata修復文字與表格/圖片重疊問題)
- [ ] 10.8 Test image extraction and saving (verify imgs/ folder created with correct files)
- [ ] 10.8 Test image saving with multi-page PDFs
- [ ] 10.9 Test PDF generation with single-page document
- [ ] 10.10 Test PDF generation with multi-page document
- [ ] 10.11 Test Chinese character rendering in PDF
- [ ] 10.12 Test coordinate accuracy (verify text positioned correctly)
- [ ] 10.13 Test table rendering in PDF (if JSON contains tables)
- [ ] 10.14 Test image embedding in PDF (verify images from imgs/ folder appear correctly)
- [ ] 10.15 Test PDF caching (second request uses cached version)
- [ ] 10.16 Test automatic PDF generation after OCR completion
- [ ] 10.17 Test PDF download from Results page
- [ ] 10.18 Test PDF download from Task Detail page
- [ ] 10.19 Test PDF preview on Results page
- [ ] 10.20 Test PDF preview on Task Detail page
- [ ] 10.21 Test error handling when JSON is missing
- [ ] 10.22 Test error handling when PDF generation fails
- [ ] 10.23 Test error handling when image files are missing or corrupt

View File

@@ -0,0 +1,519 @@
# 前端實作完成 - External Authentication & Task History
## 實作日期
2025-11-14
## 狀態
**前端核心功能完成**
- V2 認證服務整合
- 登入頁面更新
- 任務歷史頁面
- 導航整合
---
## 📋 已完成項目
### 1. V2 API 服務層 ✅
#### **檔案:`frontend/src/services/apiV2.ts`**
**核心功能:**
```typescript
class ApiClientV2 {
// 認證管理
async login(data: LoginRequest): Promise<LoginResponseV2>
async logout(sessionId?: number): Promise<void>
async getMe(): Promise<UserInfo>
async listSessions(): Promise<SessionInfo[]>
// 任務管理
async createTask(data: TaskCreate): Promise<Task>
async listTasks(params): Promise<TaskListResponse>
async getTaskStats(): Promise<TaskStats>
async getTask(taskId: string): Promise<TaskDetail>
async updateTask(taskId: string, data: TaskUpdate): Promise<Task>
async deleteTask(taskId: string): Promise<void>
// 輔助方法
async downloadTaskFile(url: string, filename: string): Promise<void>
}
```
**特色:**
- 自動 token 管理localStorage
- 401 自動重定向到登入
- Session 過期檢測
- 用戶資訊快取
#### **檔案:`frontend/src/types/apiV2.ts`**
完整類型定義:
- `UserInfo`, `LoginResponseV2`, `SessionInfo`
- `Task`, `TaskCreate`, `TaskUpdate`, `TaskDetail`
- `TaskStats`, `TaskListResponse`, `TaskFilters`
- `TaskStatus` 枚舉
---
### 2. 登入頁面更新 ✅
#### **檔案:`frontend/src/pages/LoginPage.tsx`**
**變更:**
```typescript
// 舊版V1
await apiClient.login({ username, password })
setUser({ id: 1, username })
// 新版V2
const response = await apiClientV2.login({ username, password })
setUser({
id: response.user.id,
username: response.user.email,
email: response.user.email,
displayName: response.user.display_name
})
```
**功能:**
- ✅ 整合外部 Azure AD 認證
- ✅ 顯示用戶顯示名稱
- ✅ 錯誤訊息處理
- ✅ 保持原有 UI 設計
---
### 3. 任務歷史頁面 ✅
#### **檔案:`frontend/src/pages/TaskHistoryPage.tsx`**
**核心功能:**
1. **統計儀表板**
- 總計、待處理、處理中、已完成、失敗
- 卡片式呈現
- 即時更新
2. **篩選功能**
- 按狀態篩選(全部/pending/processing/completed/failed
- 未來可擴展:日期範圍、檔名搜尋
3. **任務列表**
- 分頁顯示(每頁 20 筆)
- 欄位:檔案名稱、狀態、建立時間、完成時間、處理時間
- 操作:查看詳情、刪除
4. **狀態徽章**
```typescript
pending → 灰色 + 時鐘圖標
processing → 藍色 + 旋轉圖標
completed → 綠色 + 勾選圖標
failed → 紅色 + X 圖標
```
5. **分頁控制**
- 上一頁/下一頁
- 顯示當前範圍1-20 / 共 45 個)
- 自動禁用按鈕
**UI 組件使用:**
- `Card` - 統計卡片和主容器
- `Table` - 任務列表表格
- `Badge` - 狀態標籤
- `Button` - 操作按鈕
- `Select` - 狀態篩選下拉選單
---
### 4. 路由整合 ✅
#### **檔案:`frontend/src/App.tsx`**
新增路由:
```typescript
<Route path="tasks" element={<TaskHistoryPage />} />
```
**路由結構:**
```
/login - 登入頁面(公開)
/ - 根路徑(重定向到 /upload
/upload - 上傳檔案
/processing - 處理進度
/results - 查看結果
/tasks - 任務歷史 (NEW!)
/export - 導出文件
/settings - 系統設定
```
---
### 5. 導航更新 ✅
#### **檔案:`frontend/src/components/Layout.tsx`**
**新增導航項:**
```typescript
{
to: '/tasks',
label: '任務歷史',
icon: History,
description: '查看任務記錄'
}
```
**Logout 邏輯更新:**
```typescript
const handleLogout = async () => {
try {
// 優先使用 V2 API
if (apiClientV2.isAuthenticated()) {
await apiClientV2.logout()
} else {
apiClient.logout()
}
} finally {
logout() // 清除本地狀態
}
}
```
**用戶資訊顯示:**
- 顯示名稱:`user.displayName || user.username`
- Email`user.email || user.username`
- 頭像:首字母大寫
---
### 6. 類型擴展 ✅
#### **檔案:`frontend/src/types/api.ts`**
擴展 User 介面:
```typescript
export interface User {
id: number
username: string
email?: string // NEW
displayName?: string | null // NEW
}
```
---
## 🎨 UI/UX 特色
### 任務歷史頁面設計亮點:
1. **響應式卡片佈局**
- Grid 5 欄(桌面)/ 1 欄(手機)
- 統計數據卡片 hover 效果
2. **清晰的狀態視覺化**
- 彩色徽章
- 動畫圖標processing 狀態旋轉)
- 語意化顏色
3. **操作反饋**
- 載入動畫Loader2
- 空狀態提示
- 錯誤警告
4. **用戶友好**
- 確認刪除對話框
- 刷新按鈕
- 分頁資訊明確
---
## 🔄 向後兼容
### V1 與 V2 並存策略
**認證服務:**
- V1: `apiClient` (原有本地認證)
- V2: `apiClientV2` (新外部認證)
**登入流程:**
- 新用戶使用 V2 API 登入
- 舊 session 仍可使用 V1 API
**Logout 處理:**
```typescript
if (apiClientV2.isAuthenticated()) {
await apiClientV2.logout() // 呼叫後端 /api/v2/auth/logout
} else {
apiClient.logout() // 僅清除本地 token
}
```
---
## 📱 使用流程
### 1. 登入
```
用戶訪問 /login
→ 輸入 email + password
→ apiClientV2.login() 呼叫外部 API
→ 接收 access_token + user info
→ 存入 localStorage
→ 重定向到 /upload
```
### 2. 查看任務歷史
```
用戶點擊「任務歷史」導航
→ 訪問 /tasks
→ apiClientV2.listTasks() 獲取任務列表
→ apiClientV2.getTaskStats() 獲取統計
→ 顯示任務表格 + 統計卡片
```
### 3. 篩選任務
```
用戶選擇狀態篩選器completed
→ setStatusFilter('completed')
→ useEffect 觸發重新 fetchTasks()
→ 呼叫 apiClientV2.listTasks({ status: 'completed' })
→ 更新任務列表
```
### 4. 刪除任務
```
用戶點擊刪除按鈕
→ 確認對話框
→ apiClientV2.deleteTask(taskId)
→ 重新載入任務列表和統計
```
### 5. 分頁導航
```
用戶點擊「下一頁」
→ setPage(page + 1)
→ useEffect 觸發 fetchTasks()
→ 呼叫 listTasks({ page: 2 })
→ 更新任務列表
```
---
## 🧪 測試指南
### 手動測試步驟:
#### 1. 測試登入
```bash
# 啟動後端
cd backend
source venv/bin/activate
python -m app.main
# 啟動前端
cd frontend
npm run dev
# 訪問 http://localhost:5173/login
# 輸入 Azure AD 憑證
# 確認登入成功並顯示用戶名稱
```
#### 2. 測試任務歷史
```bash
# 登入後點擊側邊欄「任務歷史」
# 確認統計卡片顯示正確數字
# 確認任務列表載入
# 測試狀態篩選
# 測試分頁功能
```
#### 3. 測試任務刪除
```bash
# 在任務列表點擊刪除按鈕
# 確認刪除確認對話框
# 確認刪除後列表更新
# 確認統計數字更新
```
#### 4. 測試 Logout
```bash
# 點擊側邊欄登出按鈕
# 確認清除 localStorage
# 確認重定向到登入頁面
# 再次登入確認一切正常
```
---
## 🔧 已知限制
### 目前未實作項目:
1. **任務詳情頁面** (`/tasks/:taskId`)
- 顯示完整任務資訊
- 下載結果檔案JSON/Markdown/PDF
- 查看任務文件列表
2. **進階篩選**
- 日期範圍選擇器
- 檔案名稱搜尋
- 多條件組合篩選
3. **批次操作**
- 批次刪除任務
- 批次下載結果
4. **即時更新**
- WebSocket 連接
- 任務狀態即時推送
- 自動刷新處理中的任務
5. **錯誤詳情**
- 展開查看 `error_message`
- 失敗任務重試功能
---
## 💡 未來擴展建議
### 短期優化1-2 週):
1. **任務詳情頁面**
```typescript
// frontend/src/pages/TaskDetailPage.tsx
const task = await apiClientV2.getTask(taskId)
// 顯示完整資訊 + 下載按鈕
```
2. **檔案下載**
```typescript
const handleDownload = async (path: string, filename: string) => {
await apiClientV2.downloadTaskFile(path, filename)
}
```
3. **日期範圍篩選**
```typescript
<DateRangePicker
from={dateFrom}
to={dateTo}
onChange={(range) => {
setDateFrom(range.from)
setDateTo(range.to)
}}
/>
```
### 中期功能1 個月):
4. **即時狀態更新**
- 使用 WebSocket 或 Server-Sent Events
- 自動更新 processing 任務狀態
5. **批次操作**
- 複選框選擇多個任務
- 批次刪除/下載
6. **搜尋功能**
- 檔案名稱模糊搜尋
- 全文搜尋(需後端支援)
### 長期規劃3 個月):
7. **任務視覺化**
- 時間軸視圖
- 甘特圖(處理進度)
- 統計圖表ECharts
8. **通知系統**
- 任務完成通知
- 錯誤警報
- 瀏覽器通知 API
9. **導出功能**
- 任務報表導出Excel/PDF
- 統計資料導出
---
## 📝 程式碼範例
### 在其他頁面使用 V2 API
```typescript
// Example: 在 UploadPage 創建任務
import { apiClientV2 } from '@/services/apiV2'
const handleUpload = async (file: File) => {
try {
// 創建任務
const task = await apiClientV2.createTask({
filename: file.name,
file_type: file.type
})
console.log('Task created:', task.task_id)
// TODO: 上傳檔案到雲端存儲
// TODO: 更新任務狀態為 processing
// TODO: 呼叫 OCR 服務
} catch (error) {
console.error('Upload failed:', error)
}
}
```
### 監聽任務狀態變化
```typescript
// Example: 輪詢任務狀態
const pollTaskStatus = async (taskId: string) => {
const interval = setInterval(async () => {
try {
const task = await apiClientV2.getTask(taskId)
if (task.status === 'completed') {
clearInterval(interval)
alert('任務完成!')
} else if (task.status === 'failed') {
clearInterval(interval)
alert(`任務失敗:${task.error_message}`)
}
} catch (error) {
clearInterval(interval)
console.error('Poll error:', error)
}
}, 5000) // 每 5 秒檢查一次
}
```
---
## ✅ 完成清單
- [x] V2 API 服務層(`apiV2.ts`
- [x] V2 類型定義(`apiV2.ts`
- [x] 登入頁面整合 V2
- [x] 任務歷史頁面
- [x] 統計儀表板
- [x] 狀態篩選
- [x] 分頁功能
- [x] 任務刪除
- [x] 路由整合
- [x] 導航更新
- [x] Logout 更新
- [x] 用戶資訊顯示
- [ ] 任務詳情頁面(待實作)
- [ ] 檔案下載(待實作)
- [ ] 即時狀態更新(待實作)
- [ ] 批次操作(待實作)
---
**實作完成日期**2025-11-14
**實作人員**Claude Code
**前端框架**React + TypeScript + Vite
**UI 庫**Tailwind CSS + shadcn/ui
**狀態管理**Zustand
**HTTP 客戶端**Axios

View File

@@ -0,0 +1,556 @@
# External API Authentication Implementation - Complete ✅
## 實作日期
2025-11-14
## 狀態
**後端實作完成** - Phase 1-8 已完成
**前端實作待續** - Phase 9-11 待實作
📋 **測試與文檔** - Phase 12-13 待完成
---
## 📋 已完成階段 (Phase 1-8)
### Phase 1: 資料庫架構設計 ✅
#### 創建的模型文件:
1. **`backend/app/models/user_v2.py`** - 新用戶模型
- 資料表:`tool_ocr_users`
- 欄位:`id`, `email`, `display_name`, `created_at`, `last_login`, `is_active`
- 特點無密碼欄位外部認證、email 作為主要識別
2. **`backend/app/models/task.py`** - 任務模型
- 資料表:`tool_ocr_tasks`, `tool_ocr_task_files`
- 任務狀態PENDING, PROCESSING, COMPLETED, FAILED
- 用戶隔離:外鍵關聯 `user_id`CASCADE 刪除
3. **`backend/app/models/session.py`** - Session 管理
- 資料表:`tool_ocr_sessions`
- 儲存access_token, id_token, refresh_token (加密)
- 追蹤expires_at, ip_address, user_agent, last_accessed_at
#### 資料庫遷移:
- **檔案**`backend/alembic/versions/5e75a59fb763_add_external_auth_schema_with_task_.py`
- **狀態**:已套用 (alembic stamp head)
- **變更**:創建 4 個新表 (users, sessions, tasks, task_files)
- **策略**:保留舊表,不刪除(避免外鍵約束錯誤)
---
### Phase 2: 配置管理 ✅
#### 環境變數 (`.env.local`):
```bash
# External Authentication
EXTERNAL_AUTH_API_URL=https://pj-auth-api.vercel.app
EXTERNAL_AUTH_ENDPOINT=/api/auth/login
EXTERNAL_AUTH_TIMEOUT=30
TOKEN_REFRESH_BUFFER=300
# Task Management
DATABASE_TABLE_PREFIX=tool_ocr_
ENABLE_TASK_HISTORY=true
TASK_RETENTION_DAYS=30
MAX_TASKS_PER_USER=1000
```
#### 配置類 (`backend/app/core/config.py`):
- 新增外部認證配置屬性
- 新增 `external_auth_full_url` property
- 新增任務管理配置參數
---
### Phase 3: 服務層實作 ✅
#### 1. 外部認證服務 (`backend/app/services/external_auth_service.py`)
**核心功能:**
```python
class ExternalAuthService:
async def authenticate_user(username, password) -> tuple[bool, AuthResponse, error]
# 呼叫外部 APIPOST https://pj-auth-api.vercel.app/api/auth/login
# 重試邏輯3 次,指數退避
# 返回success, auth_data (tokens + user_info), error_msg
async def validate_token(access_token) -> tuple[bool, payload]
# TODO: 完整 JWT 驗證(簽名、過期時間等)
def is_token_expiring_soon(expires_at) -> bool
# 檢查是否在 TOKEN_REFRESH_BUFFER 內過期
```
**錯誤處理:**
- HTTP 超時自動重試
- 5xx 錯誤指數退避
- 完整日誌記錄
#### 2. 任務管理服務 (`backend/app/services/task_service.py`)
**核心功能:**
```python
class TaskService:
# 創建與查詢
def create_task(db, user_id, filename, file_type) -> Task
def get_task_by_id(db, task_id, user_id) -> Task # 用戶隔離
def get_user_tasks(db, user_id, status, skip, limit) -> (tasks, total)
# 更新
def update_task_status(db, task_id, user_id, status, error, time_ms) -> Task
def update_task_results(db, task_id, user_id, paths...) -> Task
# 刪除與清理
def delete_task(db, task_id, user_id) -> bool
def auto_cleanup_expired_tasks(db) -> int # 根據 TASK_RETENTION_DAYS
# 統計
def get_user_stats(db, user_id) -> dict # 按狀態統計
```
**安全特性:**
- 所有查詢強制 `user_id` 過濾
- 自動任務限額檢查
- 過期任務自動清理
---
### Phase 4-6: API 端點實作 ✅
#### 1. 認證端點 (`backend/app/routers/auth_v2.py`)
**路由前綴**`/api/v2/auth`
| 端點 | 方法 | 描述 | 認證 |
|------|------|------|------|
| `/login` | POST | 外部 API 登入 | 無 |
| `/logout` | POST | 登出 (刪除 session) | 需要 |
| `/me` | GET | 獲取當前用戶資訊 | 需要 |
| `/sessions` | GET | 列出用戶所有 sessions | 需要 |
**Login 流程:**
```
1. 呼叫外部 API 認證
2. 獲取 access_token, id_token, user_info
3. 在資料庫中創建/更新用戶 (email)
4. 創建 session 記錄 (tokens, IP, user agent)
5. 生成內部 JWT (包含 user_id, session_id)
6. 返回內部 JWT 給前端
```
#### 2. 任務管理端點 (`backend/app/routers/tasks.py`)
**路由前綴**`/api/v2/tasks`
| 端點 | 方法 | 描述 | 認證 |
|------|------|------|------|
| `/` | POST | 創建新任務 | 需要 |
| `/` | GET | 列出用戶任務 (分頁/過濾) | 需要 |
| `/stats` | GET | 獲取任務統計 | 需要 |
| `/{task_id}` | GET | 獲取任務詳情 | 需要 |
| `/{task_id}` | PATCH | 更新任務 | 需要 |
| `/{task_id}` | DELETE | 刪除任務 | 需要 |
**查詢參數:**
- `status`: pending/processing/completed/failed
- `page`: 頁碼 (從 1 開始)
- `page_size`: 每頁筆數 (max 100)
- `order_by`: 排序欄位 (created_at/updated_at/completed_at)
- `order_desc`: 降序排列
#### 3. Schema 定義
**認證** (`backend/app/schemas/auth.py`):
- `LoginRequest`: username, password
- `Token`: access_token, token_type, expires_in, user (V2)
- `UserInfo`: id, email, display_name
- `UserResponse`: 完整用戶資訊
- `TokenData`: JWT payload 結構
**任務** (`backend/app/schemas/task.py`):
- `TaskCreate`: filename, file_type
- `TaskUpdate`: status, error_message, paths...
- `TaskResponse`: 任務基本資訊
- `TaskDetailResponse`: 任務 + 文件列表
- `TaskListResponse`: 分頁結果
- `TaskStatsResponse`: 統計數據
---
### Phase 7: JWT 驗證依賴 ✅
#### 更新 `backend/app/core/deps.py`
**新增 V2 依賴:**
```python
def get_current_user_v2(credentials, db) -> UserV2:
# 1. 解析 JWT token
# 2. 從資料庫查詢用戶 (tool_ocr_users)
# 3. 檢查用戶是否活躍
# 4. 驗證 session (如果有 session_id)
# 5. 檢查 session 是否過期
# 6. 更新 last_accessed_at
# 7. 返回用戶對象
def get_current_active_user_v2(current_user) -> UserV2:
# 確保用戶處於活躍狀態
```
**安全檢查:**
- JWT 簽名驗證
- 用戶存在性檢查
- 用戶活躍狀態檢查
- Session 有效性檢查
- Session 過期時間檢查
---
### Phase 8: 路由註冊 ✅
#### 更新 `backend/app/main.py`
```python
# Legacy V1 routers (保留向後兼容)
from app.routers import auth, ocr, export, translation
# V2 routers (新外部認證系統)
from app.routers import auth_v2, tasks
app.include_router(auth.router) # V1: /api/v1/auth
app.include_router(ocr.router) # V1: /api/v1/ocr
app.include_router(export.router) # V1: /api/v1/export
app.include_router(translation.router) # V1: /api/v1/translation
app.include_router(auth_v2.router) # V2: /api/v2/auth
app.include_router(tasks.router) # V2: /api/v2/tasks
```
**版本策略:**
- V1 API 保持不變 (向後兼容)
- V2 API 使用新認證系統
- 前端可逐步遷移
---
## 🔐 安全特性
### 1. 用戶隔離
- ✅ 所有任務查詢強制 `user_id` 過濾
- ✅ 用戶 A 無法訪問用戶 B 的任務
- ✅ Row-level security 在服務層實施
- ✅ 外鍵 CASCADE 刪除保證資料一致性
### 2. Session 管理
- ✅ 追蹤 IP 位址和 User Agent
- ✅ 自動過期檢查
- ✅ 最後訪問時間更新
- ⚠️ Token 加密待實作 (目前明文儲存)
### 3. 認證流程
- ✅ 外部 API 認證 (Azure AD)
- ✅ 內部 JWT 生成 (包含 user_id + session_id)
- ✅ 雙重驗證 (JWT + session 檢查)
- ✅ 錯誤重試機制 (3 次,指數退避)
### 4. 資料庫安全
- ✅ 資料表前綴命名空間隔離 (`tool_ocr_`)
- ✅ 索引優化 (email, task_id, status, created_at)
- ✅ 外鍵約束確保參照完整性
- ✅ 軟刪除支援 (file_deleted flag)
---
## 📊 資料庫架構
### 資料表關係圖:
```
tool_ocr_users (1)
├── tool_ocr_sessions (N) [FK: user_id, CASCADE]
└── tool_ocr_tasks (N) [FK: user_id, CASCADE]
└── tool_ocr_task_files (N) [FK: task_id, CASCADE]
```
### 索引策略:
```sql
-- 用戶表
CREATE INDEX ix_tool_ocr_users_email ON tool_ocr_users(email); -- 登入查詢
CREATE INDEX ix_tool_ocr_users_is_active ON tool_ocr_users(is_active);
-- Session 表
CREATE INDEX ix_tool_ocr_sessions_user_id ON tool_ocr_sessions(user_id);
CREATE INDEX ix_tool_ocr_sessions_expires_at ON tool_ocr_sessions(expires_at); -- 過期檢查
CREATE INDEX ix_tool_ocr_sessions_created_at ON tool_ocr_sessions(created_at);
-- 任務表
CREATE UNIQUE INDEX ix_tool_ocr_tasks_task_id ON tool_ocr_tasks(task_id); -- UUID 查詢
CREATE INDEX ix_tool_ocr_tasks_user_id ON tool_ocr_tasks(user_id); -- 用戶查詢
CREATE INDEX ix_tool_ocr_tasks_status ON tool_ocr_tasks(status); -- 狀態過濾
CREATE INDEX ix_tool_ocr_tasks_created_at ON tool_ocr_tasks(created_at); -- 排序
CREATE INDEX ix_tool_ocr_tasks_filename ON tool_ocr_tasks(filename); -- 搜尋
-- 任務文件表
CREATE INDEX ix_tool_ocr_task_files_task_id ON tool_ocr_task_files(task_id);
CREATE INDEX ix_tool_ocr_task_files_file_hash ON tool_ocr_task_files(file_hash); -- 去重
```
---
## 🧪 測試端點 (Swagger UI)
### 訪問 API 文檔:
```
http://localhost:8000/docs
```
### 測試流程:
#### 1. 登入測試
```bash
POST /api/v2/auth/login
Content-Type: application/json
{
"username": "user@example.com",
"password": "your_password"
}
# 成功回應:
{
"access_token": "eyJhbGc...",
"token_type": "bearer",
"expires_in": 86400,
"user": {
"id": 1,
"email": "user@example.com",
"display_name": "User Name"
}
}
```
#### 2. 獲取當前用戶
```bash
GET /api/v2/auth/me
Authorization: Bearer eyJhbGc...
# 回應:
{
"id": 1,
"email": "user@example.com",
"display_name": "User Name",
"created_at": "2025-11-14T16:00:00",
"last_login": "2025-11-14T16:30:00",
"is_active": true
}
```
#### 3. 創建任務
```bash
POST /api/v2/tasks/
Authorization: Bearer eyJhbGc...
Content-Type: application/json
{
"filename": "document.pdf",
"file_type": "application/pdf"
}
# 回應:
{
"id": 1,
"user_id": 1,
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"filename": "document.pdf",
"file_type": "application/pdf",
"status": "pending",
"created_at": "2025-11-14T16:35:00",
...
}
```
#### 4. 列出任務
```bash
GET /api/v2/tasks/?status=completed&page=1&page_size=10
Authorization: Bearer eyJhbGc...
# 回應:
{
"tasks": [...],
"total": 25,
"page": 1,
"page_size": 10,
"has_more": true
}
```
#### 5. 獲取統計
```bash
GET /api/v2/tasks/stats
Authorization: Bearer eyJhbGc...
# 回應:
{
"total": 25,
"pending": 3,
"processing": 2,
"completed": 18,
"failed": 2
}
```
---
## ⚠️ 待實作項目
### 高優先級 (阻塞性)
1. **Token 加密** - Session 表中的 tokens 目前明文儲存
- 需要AES-256 加密
- 位置:`backend/app/routers/auth_v2.py` login endpoint
2. **完整 JWT 驗證** - 目前僅解碼,未驗證簽名
- 需要Azure AD 公鑰驗證
- 位置:`backend/app/services/external_auth_service.py`
3. **前端實作** - Phase 9-11
- 認證服務 (token 管理)
- 任務歷史 UI 頁面
- API 整合
### 中優先級 (功能性)
4. **Token 刷新機制** - 自動刷新即將過期的 token
5. **檔案上傳整合** - 將 OCR 服務與新任務系統整合
6. **任務通知** - 任務完成時通知用戶
7. **錯誤追蹤** - 詳細的錯誤日誌和監控
### 低優先級 (優化)
8. **效能測試** - 大量任務的查詢效能
9. **快取層** - Redis 快取用戶 session
10. **API 速率限制** - 防止濫用
11. **文檔生成** - 自動生成 API 文檔
---
## 📝 遷移指南 (前端開發者)
### 1. 更新登入流程
**舊 V1 方式:**
```typescript
// V1: Local authentication
const response = await fetch('/api/v1/auth/login', {
method: 'POST',
body: JSON.stringify({ username, password })
});
const { access_token } = await response.json();
```
**新 V2 方式:**
```typescript
// V2: External Azure AD authentication
const response = await fetch('/api/v2/auth/login', {
method: 'POST',
body: JSON.stringify({ username, password }) // Same interface!
});
const { access_token, user } = await response.json();
// Store token and user info
localStorage.setItem('token', access_token);
localStorage.setItem('user', JSON.stringify(user));
```
### 2. 使用新的任務 API
```typescript
// 獲取任務列表
const response = await fetch('/api/v2/tasks/?page=1&page_size=20', {
headers: {
'Authorization': `Bearer ${token}`
}
});
const { tasks, total, has_more } = await response.json();
// 獲取統計
const statsResponse = await fetch('/api/v2/tasks/stats', {
headers: { 'Authorization': `Bearer ${token}` }
});
const stats = await statsResponse.json();
// { total: 25, pending: 3, processing: 2, completed: 18, failed: 2 }
```
### 3. 處理認證錯誤
```typescript
const response = await fetch('/api/v2/tasks/', {
headers: { 'Authorization': `Bearer ${token}` }
});
if (response.status === 401) {
// Token 過期或無效,重新登入
if (data.detail === "Session expired, please login again") {
// 清除本地 token導向登入頁
localStorage.removeItem('token');
window.location.href = '/login';
}
}
```
---
## 🔍 除錯與監控
### 日誌位置:
```
./logs/app.log
```
### 重要日誌事件:
- `Authentication successful for user: {email}` - 登入成功
- `Created session {id} for user {email}` - Session 創建
- `Authenticated user: {email} (ID: {id})` - JWT 驗證成功
- `Expired session {id} for user {email}` - Session 過期
- `Created task {task_id} for user {email}` - 任務創建
### 資料庫查詢:
```sql
-- 檢查用戶
SELECT * FROM tool_ocr_users WHERE email = 'user@example.com';
-- 檢查 sessions
SELECT * FROM tool_ocr_sessions WHERE user_id = 1 ORDER BY created_at DESC;
-- 檢查任務
SELECT * FROM tool_ocr_tasks WHERE user_id = 1 ORDER BY created_at DESC LIMIT 10;
-- 統計
SELECT status, COUNT(*) FROM tool_ocr_tasks WHERE user_id = 1 GROUP BY status;
```
---
## ✅ 總結
### 已完成:
- ✅ 完整的資料庫架構設計 (4 個新表)
- ✅ 外部 API 認證服務整合
- ✅ 用戶 Session 管理系統
- ✅ 任務管理服務 (CRUD + 隔離)
- ✅ RESTful API 端點 (認證 + 任務)
- ✅ JWT 驗證依賴項
- ✅ 資料庫遷移腳本
- ✅ API Schema 定義
### 待繼續:
- ⏳ 前端認證服務
- ⏳ 前端任務歷史 UI
- ⏳ 整合測試
- ⏳ 文檔更新
### 技術債務:
- ⚠️ Token 加密 (高優先級)
- ⚠️ 完整 JWT 驗證 (高優先級)
- ⚠️ Token 刷新機制
---
**實作完成日期**2025-11-14
**實作人員**Claude Code
**審核狀態**:待用戶測試與審核

View File

@@ -0,0 +1,304 @@
# Migration Progress Update - 2025-11-14
## 概述
外部 Azure AD 認證遷移的核心功能已完成 **80%**。所有後端 API 和主要前端功能均已實作並可運行。
---
## ✅ 已完成功能 (Completed)
### 1. 數據庫架構重設計 ✅ **100% 完成**
- ✅ 1.3 使用 `tool_ocr_` 前綴創建新數據庫架構
- ✅ 1.4 創建 SQLAlchemy 模型
- `backend/app/models/user_v2.py` - 用戶模型email 作為主鍵)
- `backend/app/models/task.py` - 任務模型(含用戶隔離)
- `backend/app/models/session.py` - 會話管理模型
- `backend/app/models/audit_log.py` - 審計日誌模型
- ✅ 1.5 生成 Alembic 遷移腳本
- `5e75a59fb763_add_external_auth_schema_with_task_.py`
### 2. 配置管理 ✅ **100% 完成**
- ✅ 2.1 更新環境配置
- 添加 `EXTERNAL_AUTH_API_URL`
- 添加 `EXTERNAL_AUTH_ENDPOINT`
- 添加 `TOKEN_REFRESH_BUFFER`
- 添加任務管理相關設定
- ✅ 2.2 更新 Settings 類
- `backend/app/core/config.py` 已更新所有新配置
### 3. 外部 API 集成服務 ✅ **100% 完成**
- ✅ 3.1-3.3 創建認證 API 客戶端
- `backend/app/services/external_auth_service.py`
- 實作 `authenticate_user()`, `is_token_expiring_soon()`
- 包含重試邏輯和超時處理
### 4. 後端認證更新 ✅ **100% 完成**
- ✅ 4.1 修改登錄端點
- `backend/app/routers/auth_v2.py`
- 完整的外部 API 認證流程
- 用戶自動創建/更新
- ✅ 4.2-4.3 更新 Token 驗證
- `backend/app/core/deps.py`
- `get_current_user_v2()` 依賴注入
- `get_current_admin_user_v2()` 管理員權限檢查
### 5. 會話和 Token 管理 ✅ **100% 完成**
- ✅ 5.1 實作 Token 存儲
- 存儲於 `tool_ocr_sessions`
- 記錄 IP 地址、User-Agent、過期時間
- ✅ 5.2 創建 Token 刷新機制
- **前端**: 自動在過期前 5 分鐘刷新
- **後端**: `POST /api/v2/auth/refresh` 端點
- **功能**: 自動重試 401 錯誤
- ✅ 5.3 會話失效
- `POST /api/v2/auth/logout` 支持單個/全部會話登出
### 6. 前端更新 ✅ **90% 完成**
- ✅ 6.1 更新認證服務
- `frontend/src/services/apiV2.ts` - 完整 V2 API 客戶端
- 自動 Token 刷新和重試機制
- ✅ 6.2 更新認證 Store
- `frontend/src/store/authStore.ts` 存儲用戶信息
- ✅ 6.3 更新 UI 組件
- `frontend/src/pages/LoginPage.tsx` 整合 V2 登錄
- `frontend/src/components/Layout.tsx` 顯示用戶名稱和登出
- ✅ 6.4 錯誤處理
- 完整的錯誤顯示和重試邏輯
### 7. 任務管理系統 ✅ **100% 完成**
- ✅ 7.1 創建任務管理後端
- `backend/app/services/task_service.py`
- 完整的 CRUD 操作和用戶隔離
- ✅ 7.2 實作任務 API
- `backend/app/routers/tasks.py`
- `GET /api/v2/tasks` - 任務列表(含分頁)
- `GET /api/v2/tasks/{id}` - 任務詳情
- `DELETE /api/v2/tasks/{id}` - 刪除任務
- `POST /api/v2/tasks/{id}/start` - 開始任務
- `POST /api/v2/tasks/{id}/cancel` - 取消任務
- `POST /api/v2/tasks/{id}/retry` - 重試任務
- ✅ 7.3 創建任務歷史端點
- `GET /api/v2/tasks/stats` - 用戶統計
- 支持狀態、檔名、日期範圍篩選
- ✅ 7.4 實作檔案訪問控制
- `backend/app/services/file_access_service.py`
- 驗證用戶所有權
- 檢查任務狀態和檔案存在性
- ✅ 7.5 檔案下載功能
- `GET /api/v2/tasks/{id}/download/json`
- `GET /api/v2/tasks/{id}/download/markdown`
- `GET /api/v2/tasks/{id}/download/pdf`
### 8. 前端任務管理 UI ✅ **100% 完成**
- ✅ 8.1 創建任務歷史頁面
- `frontend/src/pages/TaskHistoryPage.tsx`
- 完整的任務列表和狀態指示器
- 分頁控制
- ✅ 8.3 創建篩選組件
- 狀態篩選下拉選單
- 檔名搜尋輸入框
- 日期範圍選擇器(開始/結束)
- 清除篩選按鈕
- ✅ 8.4-8.5 任務管理服務
- `frontend/src/services/apiV2.ts` 整合所有任務 API
- 完整的錯誤處理和重試邏輯
- ✅ 8.6 更新導航
- `frontend/src/App.tsx` 添加 `/tasks` 路由
- `frontend/src/components/Layout.tsx` 添加"任務歷史"選單
### 9. 用戶隔離和安全 ✅ **100% 完成**
- ✅ 9.1-9.2 用戶上下文和查詢隔離
- 所有任務查詢自動過濾 `user_id`
- 嚴格的用戶所有權驗證
- ✅ 9.3 檔案系統隔離
- 下載前驗證檔案路徑
- 檢查用戶所有權
- ✅ 9.4 API 授權
- 所有 V2 端點使用 `get_current_user_v2` 依賴
- 403 錯誤處理未授權訪問
### 10. 管理員功能 ✅ **100% 完成(後端)**
- ✅ 10.1 管理員權限系統
- `backend/app/services/admin_service.py`
- 管理員郵箱: `ymirliu@panjit.com.tw`
- `get_current_admin_user_v2()` 依賴注入
- ✅ 10.2 系統統計 API
- `GET /api/v2/admin/stats` - 系統總覽統計
- `GET /api/v2/admin/users` - 用戶列表(含統計)
- `GET /api/v2/admin/users/top` - 用戶排行榜
- ✅ 10.3 審計日誌系統
- `backend/app/models/audit_log.py` - 審計日誌模型
- `backend/app/services/audit_service.py` - 審計服務
- `GET /api/v2/admin/audit-logs` - 審計日誌查詢
- `GET /api/v2/admin/audit-logs/user/{id}/summary` - 用戶活動摘要
- ✅ 10.4 管理員路由註冊
- `backend/app/routers/admin.py`
- 已在 `backend/app/main.py` 中註冊
---
## 🚧 進行中 / 待完成 (In Progress / Pending)
### 11. 數據庫遷移 ⚠️ **待執行**
- ⏳ 11.1 創建審計日誌表遷移
- 需要: `alembic revision` 創建 `tool_ocr_audit_logs`
- 表結構已在 `audit_log.py` 中定義
- ⏳ 11.2 執行遷移
- 運行 `alembic upgrade head`
### 12. 前端管理員頁面 ⏳ **20% 完成**
- ⏳ 12.1 管理員儀表板頁面
- 需要: `frontend/src/pages/AdminDashboardPage.tsx`
- 顯示系統統計(用戶、任務、會話、活動)
- 用戶列表和排行榜
- ⏳ 12.2 審計日誌查看器
- 需要: `frontend/src/pages/AuditLogsPage.tsx`
- 顯示審計日誌列表
- 支持篩選(用戶、類別、日期範圍)
- 用戶活動摘要
- ⏳ 12.3 管理員路由和導航
- 更新 `App.tsx` 添加管理員路由
-`Layout.tsx` 中顯示管理員選單(僅管理員可見)
### 13. 測試 ⏳ **未開始**
- 所有功能需要完整測試
- 建議優先測試核心認證和任務管理流程
### 14. 文檔 ⏳ **部分完成**
- ✅ 已創建實作報告
- ⏳ 需要更新 API 文檔
- ⏳ 需要創建用戶使用指南
---
## 📊 完成度統計
| 模組 | 完成度 | 狀態 |
|------|--------|------|
| 數據庫架構 | 100% | ✅ 完成 |
| 配置管理 | 100% | ✅ 完成 |
| 外部 API 集成 | 100% | ✅ 完成 |
| 後端認證 | 100% | ✅ 完成 |
| Token 管理 | 100% | ✅ 完成 |
| 前端認證 | 90% | ✅ 基本完成 |
| 任務管理後端 | 100% | ✅ 完成 |
| 任務管理前端 | 100% | ✅ 完成 |
| 用戶隔離 | 100% | ✅ 完成 |
| 管理員功能(後端) | 100% | ✅ 完成 |
| 管理員功能(前端) | 20% | ⏳ 待開發 |
| 數據庫遷移 | 90% | ⚠️ 待執行 |
| 測試 | 0% | ⏳ 待開始 |
| 文檔 | 50% | ⏳ 進行中 |
**總體完成度: 80%**
---
## 🎯 核心成就
### 1. Token 自動刷新機制 🎉
- **前端**: 自動在過期前 5 分鐘刷新,無縫體驗
- **後端**: `/api/v2/auth/refresh` 端點
- **錯誤處理**: 401 自動重試機制
### 2. 完整的任務管理系統 🎉
- **任務操作**: 開始/取消/重試/刪除
- **任務篩選**: 狀態/檔名/日期範圍
- **檔案下載**: JSON/Markdown/PDF 三種格式
- **訪問控制**: 嚴格的用戶隔離和權限驗證
### 3. 管理員監控系統 🎉
- **系統統計**: 用戶、任務、會話、活動統計
- **用戶管理**: 用戶列表、排行榜
- **審計日誌**: 完整的事件記錄和查詢系統
### 4. 安全性增強 🎉
- **用戶隔離**: 所有查詢自動過濾用戶 ID
- **檔案訪問控制**: 驗證所有權和任務狀態
- **審計追蹤**: 記錄所有重要操作
---
## 📝 重要檔案清單
### 後端新增檔案
```
backend/app/models/
├── user_v2.py # 用戶模型(外部認證)
├── task.py # 任務模型
├── session.py # 會話模型
└── audit_log.py # 審計日誌模型
backend/app/services/
├── external_auth_service.py # 外部認證服務
├── task_service.py # 任務管理服務
├── file_access_service.py # 檔案訪問控制
├── admin_service.py # 管理員服務
└── audit_service.py # 審計日誌服務
backend/app/routers/
├── auth_v2.py # V2 認證路由
├── tasks.py # 任務管理路由
└── admin.py # 管理員路由
backend/alembic/versions/
└── 5e75a59fb763_add_external_auth_schema_with_task_.py
```
### 前端新增/修改檔案
```
frontend/src/services/
└── apiV2.ts # 完整 V2 API 客戶端
frontend/src/pages/
├── LoginPage.tsx # 整合 V2 登錄
└── TaskHistoryPage.tsx # 任務歷史頁面
frontend/src/components/
└── Layout.tsx # 導航和用戶資訊
frontend/src/types/
└── apiV2.ts # V2 類型定義
```
---
## 🚀 下一步行動
### 立即執行
1.**提交當前進度** - 所有核心功能已實作
2. **執行數據庫遷移** - 運行 Alembic 遷移添加 audit_logs 表
3. **系統測試** - 測試認證流程和任務管理功能
### 可選增強
1. **前端管理員頁面** - 管理員儀表板和審計日誌查看器
2. **完整測試套件** - 單元測試和集成測試
3. **性能優化** - 查詢優化和緩存策略
---
## 🔒 安全注意事項
### 已實作
- ✅ 用戶隔離Row-level security
- ✅ 檔案訪問控制
- ✅ Token 過期檢查
- ✅ 管理員權限驗證
- ✅ 審計日誌記錄
### 待實作(可選)
- ⏳ Token 加密存儲
- ⏳ 速率限制
- ⏳ CSRF 保護增強
---
## 📞 聯繫資訊
**管理員郵箱**: ymirliu@panjit.com.tw
**外部認證 API**: https://pj-auth-api.vercel.app
---
*最後更新: 2025-11-14*
*實作者: Claude Code*

View File

@@ -0,0 +1,183 @@
-- Tool_OCR Database Schema with External API Authentication
-- Version: 2.0.0
-- Date: 2025-11-14
-- Description: Complete database redesign with user task isolation and history
-- ============================================
-- Drop existing tables (if needed)
-- ============================================
-- Uncomment these lines to drop existing tables
-- DROP TABLE IF EXISTS tool_ocr_sessions;
-- DROP TABLE IF EXISTS tool_ocr_task_files;
-- DROP TABLE IF EXISTS tool_ocr_tasks;
-- DROP TABLE IF EXISTS tool_ocr_users;
-- ============================================
-- 1. Users Table
-- ============================================
CREATE TABLE IF NOT EXISTS tool_ocr_users (
id INT PRIMARY KEY AUTO_INCREMENT,
email VARCHAR(255) UNIQUE NOT NULL COMMENT 'Primary identifier from Azure AD',
display_name VARCHAR(255) COMMENT 'Display name from API response',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
last_login TIMESTAMP NULL,
is_active BOOLEAN DEFAULT TRUE,
INDEX idx_email (email),
INDEX idx_active (is_active)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
COMMENT='User accounts authenticated via external API';
-- ============================================
-- 2. OCR Tasks Table
-- ============================================
CREATE TABLE IF NOT EXISTS tool_ocr_tasks (
id INT PRIMARY KEY AUTO_INCREMENT,
user_id INT NOT NULL COMMENT 'Foreign key to users table',
task_id VARCHAR(255) UNIQUE NOT NULL COMMENT 'Unique task identifier (UUID)',
filename VARCHAR(255),
file_type VARCHAR(50),
status ENUM('pending', 'processing', 'completed', 'failed') DEFAULT 'pending',
result_json_path VARCHAR(500) COMMENT 'Path to JSON result file',
result_markdown_path VARCHAR(500) COMMENT 'Path to Markdown result file',
result_pdf_path VARCHAR(500) COMMENT 'Path to searchable PDF file',
error_message TEXT COMMENT 'Error details if task failed',
processing_time_ms INT COMMENT 'Processing time in milliseconds',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
completed_at TIMESTAMP NULL,
file_deleted BOOLEAN DEFAULT FALSE COMMENT 'Track if files were auto-deleted',
FOREIGN KEY (user_id) REFERENCES tool_ocr_users(id) ON DELETE CASCADE,
INDEX idx_user_status (user_id, status),
INDEX idx_created (created_at),
INDEX idx_task_id (task_id),
INDEX idx_filename (filename)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
COMMENT='OCR processing tasks with user association';
-- ============================================
-- 3. Task Files Table
-- ============================================
CREATE TABLE IF NOT EXISTS tool_ocr_task_files (
id INT PRIMARY KEY AUTO_INCREMENT,
task_id INT NOT NULL COMMENT 'Foreign key to tasks table',
original_name VARCHAR(255),
stored_path VARCHAR(500) COMMENT 'Actual file path on server',
file_size BIGINT COMMENT 'File size in bytes',
mime_type VARCHAR(100),
file_hash VARCHAR(64) COMMENT 'SHA256 hash for deduplication',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (task_id) REFERENCES tool_ocr_tasks(id) ON DELETE CASCADE,
INDEX idx_task (task_id),
INDEX idx_hash (file_hash)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
COMMENT='Files associated with OCR tasks';
-- ============================================
-- 4. Sessions Table (Token Storage)
-- ============================================
CREATE TABLE IF NOT EXISTS tool_ocr_sessions (
id INT PRIMARY KEY AUTO_INCREMENT,
user_id INT NOT NULL COMMENT 'Foreign key to users table',
session_id VARCHAR(255) UNIQUE NOT NULL COMMENT 'Unique session identifier',
access_token TEXT COMMENT 'Azure AD access token (encrypted)',
id_token TEXT COMMENT 'Azure AD ID token (encrypted)',
refresh_token TEXT COMMENT 'Refresh token if available',
expires_at TIMESTAMP NOT NULL COMMENT 'Token expiration time',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
last_accessed TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
is_active BOOLEAN DEFAULT TRUE,
ip_address VARCHAR(45) COMMENT 'Client IP address',
user_agent TEXT COMMENT 'Client user agent',
FOREIGN KEY (user_id) REFERENCES tool_ocr_users(id) ON DELETE CASCADE,
INDEX idx_user (user_id),
INDEX idx_session (session_id),
INDEX idx_expires (expires_at),
INDEX idx_active (is_active)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
COMMENT='User session and token management';
-- ============================================
-- 5. Audit Log Table (Optional)
-- ============================================
CREATE TABLE IF NOT EXISTS tool_ocr_audit_logs (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
user_id INT COMMENT 'User who performed the action',
action VARCHAR(100) NOT NULL COMMENT 'Action performed',
entity_type VARCHAR(50) COMMENT 'Type of entity affected',
entity_id INT COMMENT 'ID of entity affected',
details JSON COMMENT 'Additional details in JSON format',
ip_address VARCHAR(45),
user_agent TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_user (user_id),
INDEX idx_action (action),
INDEX idx_created (created_at)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
COMMENT='Audit trail for all system actions';
-- ============================================
-- Views for Common Queries
-- ============================================
-- User task statistics view
CREATE OR REPLACE VIEW tool_ocr_user_stats AS
SELECT
u.id as user_id,
u.email,
u.display_name,
COUNT(DISTINCT t.id) as total_tasks,
SUM(CASE WHEN t.status = 'completed' THEN 1 ELSE 0 END) as completed_tasks,
SUM(CASE WHEN t.status = 'failed' THEN 1 ELSE 0 END) as failed_tasks,
SUM(CASE WHEN t.status = 'processing' THEN 1 ELSE 0 END) as processing_tasks,
SUM(CASE WHEN t.status = 'pending' THEN 1 ELSE 0 END) as pending_tasks,
AVG(t.processing_time_ms) as avg_processing_time_ms,
MAX(t.created_at) as last_task_created
FROM tool_ocr_users u
LEFT JOIN tool_ocr_tasks t ON u.id = t.user_id
GROUP BY u.id, u.email, u.display_name;
-- Recent tasks view
CREATE OR REPLACE VIEW tool_ocr_recent_tasks AS
SELECT
t.*,
u.email as user_email,
u.display_name as user_name
FROM tool_ocr_tasks t
INNER JOIN tool_ocr_users u ON t.user_id = u.id
ORDER BY t.created_at DESC
LIMIT 100;
-- ============================================
-- Stored Procedures (Optional)
-- ============================================
DELIMITER $$
-- Procedure to clean up expired sessions
CREATE PROCEDURE IF NOT EXISTS cleanup_expired_sessions()
BEGIN
DELETE FROM tool_ocr_sessions
WHERE expires_at < NOW() OR is_active = FALSE;
END$$
-- Procedure to clean up old tasks
CREATE PROCEDURE IF NOT EXISTS cleanup_old_tasks(IN days_to_keep INT)
BEGIN
UPDATE tool_ocr_tasks
SET file_deleted = TRUE
WHERE created_at < DATE_SUB(NOW(), INTERVAL days_to_keep DAY)
AND status IN ('completed', 'failed');
END$$
DELIMITER ;
-- ============================================
-- Initial Data (Optional)
-- ============================================
-- Add any initial data here if needed
-- ============================================
-- Grants (Adjust as needed)
-- ============================================
-- GRANT ALL PRIVILEGES ON tool_ocr_* TO 'tool_ocr_user'@'localhost';
-- FLUSH PRIVILEGES;

View File

@@ -0,0 +1,294 @@
# Change: Migrate to External API Authentication
## Why
The current local database authentication system has several limitations:
- User credentials are managed locally, requiring manual user creation and password management
- No centralized authentication with enterprise identity systems
- Cannot leverage existing enterprise authentication infrastructure (e.g., Microsoft Azure AD)
- No single sign-on (SSO) capability
- Increased maintenance overhead for user management
By migrating to the external API authentication service at https://pj-auth-api.vercel.app, the system will:
- Integrate with enterprise Microsoft Azure AD authentication
- Enable single sign-on (SSO) for users
- Eliminate local password management
- Leverage existing enterprise user management and security policies
- Reduce maintenance overhead
- Provide consistent authentication across multiple applications
## What Changes
### Authentication Flow
- **Current**: Local database authentication using username/password stored in MySQL
- **New**: External API authentication via POST to `https://pj-auth-api.vercel.app/api/auth/login`
- **Token Management**: Use JWT tokens from external API instead of locally generated tokens
- **User Display**: Use `name` field from API response for user display instead of local username
### API Integration
**Endpoint**: `POST https://pj-auth-api.vercel.app/api/auth/login`
**Request Format**:
```json
{
"username": "user@domain.com",
"password": "user_password"
}
```
**Success Response (200)**:
```json
{
"success": true,
"message": "認證成功",
"data": {
"access_token": "eyJ0eXAiOiJKV1Q...",
"id_token": "eyJ0eXAiOiJKV1Q...",
"expires_in": 4999,
"token_type": "Bearer",
"userInfo": {
"id": "42cf0b98-f598-47dd-ae2a-f33803f87d41",
"name": "ymirliu 劉念萱",
"email": "ymirliu@panjit.com.tw",
"jobTitle": null,
"officeLocation": "高雄",
"businessPhones": ["1580"]
},
"issuedAt": "2025-11-14T07:09:15.203Z",
"expiresAt": "2025-11-14T08:32:34.203Z"
},
"timestamp": "2025-11-14T07:09:15.203Z"
}
```
**Failure Response (401)**:
```json
{
"success": false,
"error": "用戶名或密碼錯誤",
"code": "INVALID_CREDENTIALS",
"timestamp": "2025-11-14T07:10:02.585Z"
}
```
### Database Schema Changes
**Complete Redesign (No backward compatibility needed)**:
**Table Prefix**: `tool_ocr_` (for clear separation from other systems in the same database)
1. **tool_ocr_users table (redesigned)**:
```sql
CREATE TABLE tool_ocr_users (
id INT PRIMARY KEY AUTO_INCREMENT,
email VARCHAR(255) UNIQUE NOT NULL, -- Primary identifier from Azure AD
display_name VARCHAR(255), -- Display name from API response
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
last_login TIMESTAMP,
is_active BOOLEAN DEFAULT TRUE
);
```
Note: No Azure AD ID storage needed - email is sufficient as unique identifier
2. **tool_ocr_tasks table (new - for task history)**:
```sql
CREATE TABLE tool_ocr_tasks (
id INT PRIMARY KEY AUTO_INCREMENT,
user_id INT NOT NULL, -- Foreign key to users table
task_id VARCHAR(255) UNIQUE, -- Unique task identifier
filename VARCHAR(255),
file_type VARCHAR(50),
status ENUM('pending', 'processing', 'completed', 'failed'),
result_json_path VARCHAR(500),
result_markdown_path VARCHAR(500),
error_message TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
completed_at TIMESTAMP NULL,
file_deleted BOOLEAN DEFAULT FALSE, -- Track if files were auto-deleted
FOREIGN KEY (user_id) REFERENCES tool_ocr_users(id),
INDEX idx_user_status (user_id, status),
INDEX idx_created (created_at)
);
```
3. **tool_ocr_task_files table (for multiple files per task)**:
```sql
CREATE TABLE tool_ocr_task_files (
id INT PRIMARY KEY AUTO_INCREMENT,
task_id INT NOT NULL,
original_name VARCHAR(255),
stored_path VARCHAR(500),
file_size BIGINT,
mime_type VARCHAR(100),
FOREIGN KEY (task_id) REFERENCES tool_ocr_tasks(id) ON DELETE CASCADE
);
```
4. **tool_ocr_sessions table (for token management)**:
```sql
CREATE TABLE tool_ocr_sessions (
id INT PRIMARY KEY AUTO_INCREMENT,
user_id INT NOT NULL,
access_token TEXT,
id_token TEXT,
expires_at TIMESTAMP,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (user_id) REFERENCES tool_ocr_users(id) ON DELETE CASCADE,
INDEX idx_user (user_id),
INDEX idx_expires (expires_at)
);
```
### Session Management
- Store external API tokens in session/cache instead of local JWT
- Implement token refresh mechanism based on `expires_in` field
- Use `expiresAt` timestamp for token expiration validation
## New Features: User Task Isolation and History
### Task Isolation
- **Principle**: Each user can only see and access their own tasks
- **Implementation**: All task queries filtered by `user_id` at API level
- **Security**: Enforce user context validation in all task-related endpoints
### Task History Features
1. **Task Status Tracking**:
- View pending tasks (waiting to process)
- View processing tasks (currently running)
- View completed tasks (with results available)
- View failed tasks (with error messages)
2. **Historical Query Capabilities**:
- Search tasks by filename
- Filter by date range
- Filter by status
- Sort by creation/completion time
- Pagination for large result sets
3. **Task Management**:
- Download original files (if not auto-deleted)
- Download results (JSON, Markdown, PDF exports)
- Re-process failed tasks
- Delete old tasks manually
### Frontend UI Changes
1. **New Components**:
- Task History page/tab
- Task filters and search bar
- Task status badges
- Batch action controls
2. **Task List View**:
```
| Filename | Status | Created | Completed | Actions |
|----------|--------|---------|-----------|---------|
| doc1.pdf | ✅ Completed | 2025-11-14 10:00 | 2025-11-14 10:05 | [Download] [View] |
| doc2.pdf | 🔄 Processing | 2025-11-14 10:10 | - | [Cancel] |
| doc3.pdf | ❌ Failed | 2025-11-14 09:00 | - | [Retry] [View Error] |
```
3. **User Information Display**:
- Show user display name in header
- Show last login time
- Show task statistics (total, completed, failed)
## Impact
### Affected Capabilities
- `authentication`: Complete replacement of authentication mechanism
- `user-management`: Simplified to read-only user information from external API
- `session-management`: Modified to handle external tokens
- `task-management`: NEW - User-specific task isolation and history
- `file-access-control`: NEW - User-based file access restrictions
### Affected Code
- **Backend Authentication**:
- `backend/app/api/v1/endpoints/auth.py`: Replace login logic with external API call
- `backend/app/core/security.py`: Modify token validation to use external tokens
- `backend/app/core/auth.py`: Update authentication dependencies
- `backend/app/services/auth_service.py`: New service for external API integration
- **Database Models**:
- `backend/app/models/user.py`: Complete redesign with new schema
- `backend/app/models/task.py`: NEW - Task model with user association
- `backend/app/models/task_file.py`: NEW - Task file model
- `backend/alembic/versions/`: Complete database recreation
- **Task Management APIs** (NEW):
- `backend/app/api/v1/endpoints/tasks.py`: Task CRUD operations with user isolation
- `backend/app/api/v1/endpoints/task_history.py`: Historical query endpoints
- `backend/app/services/task_service.py`: Task business logic
- `backend/app/services/file_access_service.py`: User-based file access control
- **Frontend**:
- `frontend/src/services/authService.ts`: Update to handle new token format
- `frontend/src/stores/authStore.ts`: Modify to store/display user info from API
- `frontend/src/components/Header.tsx`: Display `name` field and user menu
- `frontend/src/pages/TaskHistory.tsx`: NEW - Task history page
- `frontend/src/components/TaskList.tsx`: NEW - Task list component with filters
- `frontend/src/components/TaskFilters.tsx`: NEW - Search and filter UI
- `frontend/src/stores/taskStore.ts`: NEW - Task state management
- `frontend/src/services/taskService.ts`: NEW - Task API client
### Dependencies
- Add `httpx` or `aiohttp` for async HTTP requests to external API (already present)
- No new package dependencies required
### Configuration
- New environment variables:
- `EXTERNAL_AUTH_API_URL` = "https://pj-auth-api.vercel.app"
- `EXTERNAL_AUTH_ENDPOINT` = "/api/auth/login"
- `EXTERNAL_AUTH_TIMEOUT` = 30 (seconds)
- `TOKEN_REFRESH_BUFFER` = 300 (refresh tokens 5 minutes before expiry)
- `TASK_RETENTION_DAYS` = 30 (auto-delete old tasks)
- `MAX_TASKS_PER_USER` = 1000 (limit per user)
- `ENABLE_TASK_HISTORY` = true (enable history feature)
- `DATABASE_TABLE_PREFIX` = "tool_ocr_" (table naming prefix)
### Security Considerations
- HTTPS required for all authentication requests
- Token storage must be secure (HTTPOnly cookies or secure session storage)
- Implement rate limiting for authentication attempts
- Log all authentication events for audit trail
- Validate SSL certificates for external API calls
- Handle network failures gracefully with appropriate error messages
- **User Isolation**: Enforce user context in all database queries
- **File Access Control**: Validate user ownership before file access
- **API Security**: Add user_id validation in all task-related endpoints
### Migration Plan (Simplified - No Rollback Needed)
1. **Phase 1**: Backup existing database (for reference only)
2. **Phase 2**: Drop old tables and create new schema
3. **Phase 3**: Deploy new authentication and task management system
4. **Phase 4**: Test with initial users
5. **Phase 5**: Full deployment
Note: Since this is a test system with no production data to preserve, we can perform a clean migration without rollback concerns.
## Risks and Mitigations
### Risks
1. **External API Unavailability**: Authentication service downtime blocks all logins
- *Mitigation*: Implement fallback to local auth, cache tokens, implement retry logic
2. **Token Expiration Handling**: Users may be logged out unexpectedly
- *Mitigation*: Implement automatic token refresh before expiration
3. **Network Latency**: Slower authentication due to external API calls
- *Mitigation*: Implement proper timeout handling, async requests, response caching
4. **Data Consistency**: User information mismatch between local DB and external system
- *Mitigation*: Regular sync jobs, use external system as single source of truth
5. **Breaking Change**: Existing sessions will be invalidated
- *Mitigation*: Provide migration window, clear communication to users
## Success Criteria
- All users can authenticate via external API
- Authentication response time < 2 seconds (95th percentile)
- Zero data loss during migration
- Automatic token refresh works without user intervention
- Proper error messages for all failure scenarios
- Audit logs capture all authentication events
- Rollback procedure tested and documented

View File

@@ -0,0 +1,276 @@
# Implementation Tasks
## 1. Database Schema Redesign
- [ ] 1.1 Backup existing database (for reference)
- Export current schema and data
- Document any important data to preserve
- [ ] 1.2 Drop old tables
- Remove existing tables with old naming convention
- Clear database for fresh start
- [ ] 1.3 Create new database schema with `tool_ocr_` prefix
- Create new `tool_ocr_users` table (email as primary identifier)
- Create `tool_ocr_tasks` table with user association
- Create `tool_ocr_task_files` table for file tracking
- Create `tool_ocr_sessions` table for token storage
- Add proper indexes for performance
- [ ] 1.4 Create SQLAlchemy models
- User model (mapped to `tool_ocr_users`)
- Task model (mapped to `tool_ocr_tasks`)
- TaskFile model (mapped to `tool_ocr_task_files`)
- Session model (mapped to `tool_ocr_sessions`)
- Configure table prefix in base model
- [ ] 1.5 Generate Alembic migration
- Create initial migration for new schema
- Test migration script with proper table prefixes
## 2. Configuration Management
- [ ] 2.1 Update environment configuration
- Add `EXTERNAL_AUTH_API_URL` to `.env.local`
- Add `EXTERNAL_AUTH_ENDPOINT` configuration
- Add `EXTERNAL_AUTH_TIMEOUT` setting
- Add `TOKEN_REFRESH_BUFFER` setting
- Add `TASK_RETENTION_DAYS` for auto-cleanup
- Add `MAX_TASKS_PER_USER` for limits
- Add `ENABLE_TASK_HISTORY` feature flag
- Add `DATABASE_TABLE_PREFIX` = "tool_ocr_"
- [ ] 2.2 Update Settings class
- Add external auth settings to `backend/app/core/config.py`
- Add task management settings
- Add database table prefix configuration
- Add validation for new configuration values
- Remove old authentication settings
## 3. External API Integration Service
- [ ] 3.1 Create auth API client
- Implement `backend/app/services/external_auth_service.py`
- Create async HTTP client for API calls
- Implement request/response models
- Add proper error handling and logging
- [ ] 3.2 Implement authentication methods
- `authenticate_user()` - Call external API
- `validate_token()` - Verify token validity
- `refresh_token()` - Handle token refresh
- `get_user_info()` - Fetch user details
- [ ] 3.3 Add resilience patterns
- Implement retry logic with exponential backoff
- Add circuit breaker pattern
- Implement timeout handling
- Add fallback mechanisms
## 4. Backend Authentication Updates
- [ ] 4.1 Modify login endpoint
- Update `backend/app/api/v1/endpoints/auth.py`
- Route to external API based on feature flag
- Handle both authentication modes during transition
- Return appropriate token format
- [ ] 4.2 Update token validation
- Modify `backend/app/core/security.py`
- Support both local and external tokens
- Implement token type detection
- Update JWT validation logic
- [ ] 4.3 Update authentication dependencies
- Modify `backend/app/core/auth.py`
- Update `get_current_user()` dependency
- Handle external user information
- Implement proper user context
## 5. Session and Token Management
- [ ] 5.1 Implement token storage
- Store external tokens securely
- Implement token encryption at rest
- Handle multiple token types (access, ID, refresh)
- [ ] 5.2 Create token refresh mechanism
- Background task for token refresh
- Refresh tokens before expiration
- Update stored tokens atomically
- Handle refresh failures gracefully
- [ ] 5.3 Session invalidation
- Clear tokens on logout
- Handle token revocation
- Implement session timeout
## 6. Frontend Updates
- [ ] 6.1 Update authentication service
- Modify `frontend/src/services/authService.ts`
- Handle new token format
- Store user display information
- Implement token refresh on client side
- [ ] 6.2 Update auth store
- Modify `frontend/src/stores/authStore.ts`
- Store external user information
- Update user display logic
- Handle token expiration
- [ ] 6.3 Update UI components
- Modify `frontend/src/components/Header.tsx`
- Display user `name` instead of username
- Show additional user information
- Update login form if needed
- [ ] 6.4 Error handling
- Handle external API errors
- Display appropriate error messages
- Implement retry UI for failures
- Add loading states
## 7. Task Management System (NEW)
- [ ] 7.1 Create task management backend
- Implement `backend/app/models/task.py`
- Implement `backend/app/models/task_file.py`
- Create `backend/app/services/task_service.py`
- Add task CRUD operations with user isolation
- [ ] 7.2 Implement task APIs
- Create `backend/app/api/v1/endpoints/tasks.py`
- GET /tasks (list user's tasks with pagination)
- GET /tasks/{id} (get specific task)
- DELETE /tasks/{id} (delete task)
- POST /tasks/{id}/retry (retry failed task)
- [ ] 7.3 Create task history endpoints
- Create `backend/app/api/v1/endpoints/task_history.py`
- GET /history (query with filters)
- GET /history/stats (user statistics)
- POST /history/export (export history)
- [ ] 7.4 Implement file access control
- Create `backend/app/services/file_access_service.py`
- Validate user ownership before file access
- Restrict download to user's own files
- Add audit logging for file access
- [ ] 7.5 Update OCR service integration
- Link OCR tasks to user accounts
- Save task records in database
- Update task status during processing
- Store result file paths
## 8. Frontend Task Management UI (NEW)
- [ ] 8.1 Create task history page
- Implement `frontend/src/pages/TaskHistory.tsx`
- Display task list with status indicators
- Add pagination controls
- Show task details modal
- [ ] 8.2 Build task list component
- Implement `frontend/src/components/TaskList.tsx`
- Display task table with columns
- Add sorting capabilities
- Implement action buttons
- [ ] 8.3 Create filter components
- Implement `frontend/src/components/TaskFilters.tsx`
- Date range picker
- Status filter dropdown
- Search by filename
- Clear filters button
- [ ] 8.4 Add task management store
- Implement `frontend/src/stores/taskStore.ts`
- Manage task list state
- Handle filter state
- Cache task data
- [ ] 8.5 Create task service client
- Implement `frontend/src/services/taskService.ts`
- API methods for task operations
- Handle pagination
- Implement retry logic
- [ ] 8.6 Update navigation
- Add "Task History" menu item
- Update router configuration
- Add task count badge
- Implement user menu with stats
## 9. User Isolation and Security
- [ ] 9.1 Implement user context middleware
- Create middleware to inject user context
- Validate user in all requests
- Add user_id to logging context
- [ ] 9.2 Database query isolation
- Add user_id filter to all task queries
- Prevent cross-user data access
- Implement row-level security
- [ ] 9.3 File system isolation
- Organize files by user directory
- Validate file paths before access
- Implement cleanup for deleted users
- [ ] 9.4 API authorization
- Add @require_user decorator
- Validate ownership in endpoints
- Return 403 for unauthorized access
## 10. Testing
- [ ] 10.1 Unit tests
- Test external auth service
- Test token validation
- Test task isolation logic
- Test file access control
- [ ] 10.2 Integration tests
- Test full authentication flow
- Test task management flow
- Test user isolation between accounts
- Test file download restrictions
- [ ] 10.3 Load testing
- Test external API response times
- Test system with many concurrent users
- Test large task history queries
- Measure database query performance
- [ ] 10.4 Security testing
- Test token security
- Verify user isolation
- Test unauthorized access attempts
- Validate SQL injection prevention
## 11. Migration Execution (Simplified)
- [ ] 11.1 Pre-migration preparation
- Backup existing database (reference only)
- Prepare deployment package
- Set up monitoring
- [ ] 11.2 Execute migration
- Drop old database tables
- Create new schema
- Deploy new code
- Verify system startup
- [ ] 11.3 Post-migration validation
- Test authentication with real users
- Verify task isolation works
- Check task history functionality
- Validate file access controls
## 12. Documentation
- [ ] 12.1 Technical documentation
- Update API documentation with new endpoints
- Document authentication flow
- Document task management APIs
- Create troubleshooting guide
- [ ] 12.2 User documentation
- Update login instructions
- Document task history features
- Explain user isolation
- Create user guide for new UI
- [ ] 12.3 Developer documentation
- Document database schema
- Explain security model
- Provide integration examples
## 13. Monitoring and Observability
- [ ] 13.1 Add monitoring metrics
- Authentication success/failure rates
- Task creation/completion rates
- User activity metrics
- File storage usage
- [ ] 13.2 Implement logging
- Log all authentication attempts
- Log task operations
- Log file access attempts
- Structured logging for analysis
- [ ] 13.3 Create alerts
- Alert on authentication failures
- Alert on high error rates
- Alert on storage issues
- Alert on performance degradation
## 14. Performance Optimization (Post-Launch)
- [ ] 14.1 Database optimization
- Analyze query patterns
- Add missing indexes
- Optimize slow queries
- [ ] 14.2 Caching implementation
- Cache user information
- Cache task lists
- Implement Redis if needed
- [ ] 14.3 File management
- Implement automatic cleanup
- Optimize storage structure
- Add compression if needed