feat: add storage cleanup mechanism with soft delete and auto scheduler

- Add soft delete (deleted_at column) to preserve task records for statistics
- Implement cleanup service to delete old files while keeping DB records
- Add automatic cleanup scheduler (configurable interval, default 24h)
- Add admin endpoints: storage stats, cleanup trigger, scheduler status
- Update task service with admin views (include deleted/files_deleted)
- Add frontend storage management UI in admin dashboard
- Add i18n translations for storage management

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
egg
2025-12-14 12:41:01 +08:00
parent 81a0a3ab0f
commit 73112db055
23 changed files with 1359 additions and 634 deletions

186
PLAN.md
View File

@@ -1,186 +0,0 @@
# PDF 處理雙軌制改善計劃 (修訂版 v5)
## 問題分析
### 一、Direct Track 表格問題
| 指標 | edit.pdf | edit3.pdf |
|------|----------|-----------|
| 原始表格結構 | 6 rows x 2 cols | 12 rows x 17 cols |
| PyMuPDF 識別的 cells | 12 (無合併) | **83** (有121個合併) |
| Direct Track 提取的 cells | 12 | **204** (全部視為1x1) |
| 跨欄/跨行識別 | 不需要 | **❌ 完全未識別** |
| 渲染結果 | ✓ 完美 | ❌ 欄位切分錯誤、文字超出 |
**根因**: `_detect_tables_by_position()` 無法識別合併單元格
### 二、Direct Track 圖片問題 (edit3.pdf)
| 問題 | 數量 | 說明 |
|------|------|------|
| 極小裝飾圖片 | 3 | < 200 px²,應過濾 |
| 覆蓋圖像 (黑框) | 6 | 已檢測但未從渲染中移除 |
| 大型 vector_graphics | 3 | 已正確過濾 |
### 三、OCR Track 表格問題
| 表格 | cells | cell_boxes | cell_boxes 坐標檢查 |
|------|-------|------------|-------------------|
| pp3_0_3 | 13 | 13 | 1/5 超出範圍 |
| pp3_0_6 | 29 | 12 | 全部超出範圍 |
| pp3_0_7 | 12 | 51 | 全部超出範圍 |
| pp3_0_16 | 51 | 29 | 全部超出範圍 |
**根因**: PP-StructureV3 cell_boxes 座標系統錯亂
### 四、OCR Track 圖片問題 ❌ 嚴重
| 文件 | 圖片元素 | PP-Structure 原始數據 | 轉換後 UnifiedDocument | 結果 |
|------|---------|---------------------|----------------------|------|
| edit.pdf | pp3_1_8 | saved_path="pp3_1_8.png" | content=字符串 | 圖片未放回 |
| edit3.pdf | pp3_1_2 | saved_path="pp3_1_2.png" | content=字符串 | 圖片未放回 |
**根因**: `ocr_to_unified_converter.py` `_convert_pp3_element` 方法中
```python
# 當前代碼 (第604-613行)
elif element_type in [ElementType.IMAGE, ElementType.FIGURE]:
content = {'path': elem_data.get('img_path', ''), ...}
else:
content = elem_data.get('content', '') # ← CHART 類型走這裡!
```
**問題**:
1. `CHART` 類型未被視為視覺元素
2. `saved_path` 完全丟失
3. `content` 變成文字而非圖片路徑
---
## 改善計劃
### 階段 1: Direct Track 使用 PyMuPDF find_tables (優先級:最高)
**問題**: `_detect_tables_by_position` 無法識別合併單元格
**方案**: 改用 PyMuPDF `find_tables()` API
**檔案**: `backend/app/services/direct_extraction_engine.py`
```python
def _extract_tables_with_pymupdf(self, page, page_num, counter):
tables = page.find_tables()
for table in tables.tables:
# 獲取 cells保留合併信息
cells = []
for row_idx in range(table.row_count):
for col_idx in range(table.col_count):
cell_data = table.cells[row_idx * table.col_count + col_idx]
if cell_data is None:
continue # 跳過被合併的單元格
# 計算 row_span/col_span...
```
### 階段 2: 修復 OCR Track 圖片路徑丟失 (優先級:最高)
**問題**: CHART 類型的 saved_path 在轉換時丟失
**檔案**: `backend/app/services/ocr_to_unified_converter.py`
**位置**: `_convert_pp3_element` 方法約第604行
**修改**:
```python
# 修改前
elif element_type in [ElementType.IMAGE, ElementType.FIGURE]:
# 修改後:包含所有視覺元素類型
elif element_type in [
ElementType.IMAGE, ElementType.FIGURE, ElementType.CHART,
ElementType.DIAGRAM, ElementType.LOGO, ElementType.STAMP
]:
# 優先使用 saved_path
image_path = (
elem_data.get('saved_path') or
elem_data.get('img_path') or
''
)
content = {
'saved_path': image_path, # 關鍵:保留 saved_path
'path': image_path,
'width': elem_data.get('width', 0),
'height': elem_data.get('height', 0),
'format': elem_data.get('format', 'unknown')
}
```
### 階段 3: 修復 OCR Track cell_boxes 座標 (優先級:高)
**方案**: 驗證座標超出範圍時使用 CV 線檢測 fallback
### 階段 4: 過濾極小裝飾圖片 (優先級:高)
```python
if elem_area < 200:
continue # 跳過 < 200 px² 的圖片
```
### 階段 5: 過濾覆蓋圖像 (優先級:高)
在提取階段過濾與 covering_images 重疊的圖片
---
## 實施優先級
| 階段 | 描述 | 優先級 | 影響 |
|------|------|--------|------|
| 1 | Direct Track 使用 PyMuPDF find_tables | **最高** | 修復合併單元格 |
| 2 | **OCR Track 圖片路徑修復** | **最高** | 修復圖片未放回 |
| 3 | OCR Track cell_boxes 座標修復 | | 修復表格渲染錯亂 |
| 4 | 過濾極小裝飾圖片 | | 減少無意義圖片 |
| 5 | 過濾覆蓋圖像 | | 減少黑框 |
---
## 預期效果
### Direct Track
| 指標 | 修改前 | 修改後 |
|------|--------|--------|
| edit3.pdf cells | 204 (錯誤拆分) | 83 (正確識別合併) |
| 跨欄/跨行識別 | | |
### OCR Track 圖片
| 指標 | 修改前 | 修改後 |
|------|--------|--------|
| pp3_1_8 (edit.pdf) | 圖片未放回 | 正確放回 |
| pp3_1_2 (edit3.pdf) | 圖片未放回 | 正確放回 |
### OCR Track 表格
| 指標 | 修改前 | 修改後 |
|------|--------|--------|
| cell_boxes 座標 | 3/5 表格錯誤 | 全部正確或 CV fallback |
---
## 測試計劃
1. **edit.pdf Direct Track**: 確保無回歸
2. **edit3.pdf Direct Track**:
- 驗證表格識別到 83 cells 204
- 驗證跨欄/跨行正確
- 驗證極小圖片被過濾
- 驗證黑框被過濾
3. **edit.pdf OCR Track**:
- **驗證 pp3_1_8.png 正確放回**
- 驗證 cell_boxes 座標修復
4. **edit3.pdf OCR Track**:
- **驗證 pp3_1_2.png 正確放回**
- 驗證 cell_boxes 座標修復

View File

@@ -1,82 +0,0 @@
# Tool_OCR
多語系批次 OCR 與版面還原工具,提供直接抽取與深度 OCR 雙軌流程、PP-StructureV3 結構分析、JSON/Markdown/版面保持 PDF 匯出,前端以 React 提供任務追蹤與下載。
## 功能亮點
- 雙軌處理DocumentTypeDetector 選擇 Direct (PyMuPDF 抽取) 或 OCR (PaddleOCR + PP-StructureV3),必要時混合補圖。
- 統一輸出OCR/Direct 皆轉成 UnifiedDocument後續匯出 JSON/Markdown/版面保持 PDF並回寫 metadata。
- 資源控管OCRServicePool、MemoryGuard 與 prediction semaphore 控制 GPU/CPU 載荷,支援自動卸載與 CPU fallback。
- 任務與權限JWT 驗證、外部登入 API、任務歷史/統計、管理員審計路由。
- 前端體驗React + Vite + shadcn/ui任務輪詢、結果預覽、下載、設定頁與管理面板。
- 國際化保留翻譯流水線translation_service可接入 Dify/離線模型。
## 架構概覽
- **Backend (FastAPI)**
- `app/main.py`lifespan 初始化 service pool、memory manager、CORS、/health上傳端點 `/api/v2/upload`
- `routers/``auth.py` 登入、`tasks.py` 任務啟動/下載/metadata、`admin.py` 審計、`translate.py` 翻譯輸出。
- `services/``ocr_service.py` 雙軌處理、`document_type_detector.py` 軌道選擇、`direct_extraction_engine.py` 直抽、`pp_structure_enhanced.py` 版面分析、`ocr_to_unified_converter.py``unified_document_exporter.py` 匯出、`pdf_generator_service.py` 版面保持 PDF、`service_pool.py`/`memory_manager.py` 資源管理。
- `models/``schemas/`SQLAlchemy 模型與 Pydantic 結構,`core/config.py` 整合環境設定。
- **Frontend (React 18 + Vite)**
- `src/pages`Login、Upload、Processing、Results、Export、TaskHistory/TaskDetail、Settings、AdminDashboard、AuditLogs。
- `src/services` API client + React Query`src/store` 任務/使用者狀態,`src/components` 共用 UI。
- PDF 預覽使用 react-pdfi18n 由 `src/i18n` 管理。
- **處理流程摘要**
1. `/api/v2/upload` 儲存檔案至 `backend/uploads` 並建立 Task。
2. `/api/v2/tasks/{id}/start` 觸發雙軌處理(可附 `pp_structure_params`)。
3. Direct/OCR 產生 UnifiedDocument匯出 `_result.json``_output.md`、版面保持 PDF 至 `backend/storage/results/<task_id>/`,並在 DB 記錄 metadata。
4. `/api/v2/tasks/{id}/download/{json|markdown|pdf|unified}``/metadata` 提供下載與統計。
## 倉庫結構
- `backend/app/`FastAPI 程式碼core、routers、services、schemas、models、main.py
- `backend/tests/`:測試集合
- `api/` API mock/integration、`services/` 核心邏輯、`e2e/` 需啟動後端與測試帳號、`performance/` 量測、`archived/` 舊案例。
- 測試資源使用 `demo_docs/` 中的範例檔gitignore不會上傳
- `backend/uploads`, `backend/storage`, `backend/logs`, `backend/models/`:執行時輸入/輸出/模型/日誌目錄,啟動時自動建立並鎖定在 backend 目錄下。
- `frontend/`React 應用程式碼與設定vite.config.ts、eslint.config.js 等)。
- `docs/`API/架構/風險說明。
- `openspec/`:規格檔與變更紀錄。
## 環境準備
- 需求Python 3.10+、Node 18+/20+、MySQL或相容端點、可選 NVIDIA GPUCUDA 11.8+/12.x
- 一鍵腳本:`./setup_dev_env.sh`(可加 `--cpu-only``--skip-db`)。
- 手動:
1. `python3 -m venv venv && source venv/bin/activate`
2. `pip install -r requirements.txt`
3. `cp .env.example .env.local` 並填入 DB/認證/路徑設定(預設使用 8000/5173
4. `cd frontend && npm install`
## 開發啟動
- Backend預設 `.env``BACKEND_PORT=8000`config 預設 12010依環境變數覆蓋
```bash
source venv/bin/activate
cd backend
uvicorn app.main:app --reload --host 0.0.0.0 --port ${BACKEND_PORT:-8000}
# API docs: http://localhost:${BACKEND_PORT:-8000}/docs
```
`Settings` 會將 `uploads`/`storage`/`logs`/`models` 等路徑正規化到 `backend/`,避免在不同工作目錄產生多餘資料夾。
- Frontend
```bash
cd frontend
npm run dev -- --host --port ${FRONTEND_PORT:-5173}
# http://localhost:${FRONTEND_PORT:-5173}
```
- 也可用 `./start.sh backend|frontend|--stop|--status` 管理背景進程PID 置於 `.pid/`)。
## 測試
- 單元/整合:`pytest backend/tests -m "not e2e"`(如需)。
- API mock 測試:`pytest backend/tests/api`(僅依賴虛擬依賴/SQLite
- E2E需先啟動後端並準備測試帳號預設呼叫 `http://localhost:8000/api/v2`,測試檔使用 `demo_docs/` 範例檔。
- 性能/封存案例:`backend/tests/performance`、`backend/tests/archived` 可選擇性執行。
## 產生物與清理
- 執行後的輸入/輸出皆位於 `backend/uploads`、`backend/storage/results|json|markdown|exports`、`backend/logs`,模型快取在 `backend/models/`。
- 已移除多餘的 `node_modules/`、`venv/`、舊的 `pp_demo/` 與上傳/輸出/日誌樣本。再次清理可執行:
```bash
rm -rf backend/uploads/* backend/storage/results/* backend/logs/*.log .pytest_cache backend/.pytest_cache
```
目錄會在啟動時自動重建。
## 參考文件
- `docs/architecture-overview.md`:雙軌流程與組件說明
- `docs/API.md`:主要 API 介面
- `openspec/`:系統規格與歷史變更

View File

@@ -0,0 +1,34 @@
"""add_deleted_at_to_tasks
Revision ID: f3d499f5d0cf
Revises: g2b3c4d5e6f7
Create Date: 2025-12-14 12:17:25.176482
"""
from typing import Sequence, Union
from alembic import op
import sqlalchemy as sa
# revision identifiers, used by Alembic.
revision: str = 'f3d499f5d0cf'
down_revision: Union[str, None] = 'g2b3c4d5e6f7'
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None
def upgrade() -> None:
"""Add deleted_at column for soft delete support."""
op.add_column(
'tool_ocr_tasks',
sa.Column('deleted_at', sa.DateTime(), nullable=True,
comment='Soft delete timestamp - NULL means not deleted')
)
op.create_index('ix_tool_ocr_tasks_deleted_at', 'tool_ocr_tasks', ['deleted_at'])
def downgrade() -> None:
"""Remove deleted_at column."""
op.drop_index('ix_tool_ocr_tasks_deleted_at', table_name='tool_ocr_tasks')
op.drop_column('tool_ocr_tasks', 'deleted_at')

View File

@@ -55,6 +55,11 @@ class Settings(BaseSettings):
task_retention_days: int = Field(default=30)
max_tasks_per_user: int = Field(default=1000)
# ===== Storage Cleanup Configuration =====
cleanup_enabled: bool = Field(default=True, description="Enable automatic file cleanup")
cleanup_interval_hours: int = Field(default=24, description="Hours between cleanup runs")
max_files_per_user: int = Field(default=50, description="Max task files to keep per user")
# ===== OCR Configuration =====
# Note: PaddleOCR models are stored in ~/.paddleocr/ and ~/.paddlex/ by default
ocr_languages: str = Field(default="ch,en,japan,korean")

View File

@@ -216,6 +216,15 @@ async def lifespan(app: FastAPI):
except Exception as e:
logger.warning(f"Failed to initialize prediction semaphore: {e}")
# Initialize cleanup scheduler if enabled
if settings.cleanup_enabled:
try:
from app.services.cleanup_scheduler import start_cleanup_scheduler
await start_cleanup_scheduler()
logger.info("Cleanup scheduler initialized")
except Exception as e:
logger.warning(f"Failed to initialize cleanup scheduler: {e}")
logger.info("Application startup complete")
yield
@@ -223,6 +232,15 @@ async def lifespan(app: FastAPI):
# Shutdown
logger.info("Shutting down Tool_OCR application...")
# Stop cleanup scheduler
if settings.cleanup_enabled:
try:
from app.services.cleanup_scheduler import stop_cleanup_scheduler
await stop_cleanup_scheduler()
logger.info("Cleanup scheduler stopped")
except Exception as e:
logger.warning(f"Error stopping cleanup scheduler: {e}")
# Connection draining - wait for active requests to complete
await drain_connections(timeout=30.0)

View File

@@ -55,6 +55,8 @@ class Task(Base):
completed_at = Column(DateTime, nullable=True)
file_deleted = Column(Boolean, default=False, nullable=False,
comment="Track if files were auto-deleted")
deleted_at = Column(DateTime, nullable=True, index=True,
comment="Soft delete timestamp - NULL means not deleted")
# Relationships
user = relationship("User", back_populates="tasks")
@@ -79,7 +81,8 @@ class Task(Base):
"created_at": self.created_at.isoformat() if self.created_at else None,
"updated_at": self.updated_at.isoformat() if self.updated_at else None,
"completed_at": self.completed_at.isoformat() if self.completed_at else None,
"file_deleted": self.file_deleted
"file_deleted": self.file_deleted,
"deleted_at": self.deleted_at.isoformat() if self.deleted_at else None
}

View File

@@ -11,9 +11,14 @@ from fastapi import APIRouter, Depends, HTTPException, status, Query
from sqlalchemy.orm import Session
from app.core.deps import get_db, get_current_admin_user
from app.core.config import settings
from app.models.user import User
from app.models.task import TaskStatus
from app.services.admin_service import admin_service
from app.services.audit_service import audit_service
from app.services.task_service import task_service
from app.services.cleanup_service import cleanup_service
from app.services.cleanup_scheduler import get_cleanup_scheduler
logger = logging.getLogger(__name__)
@@ -217,3 +222,198 @@ async def get_translation_stats(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=f"Failed to get translation statistics: {str(e)}"
)
@router.get("/tasks", summary="List all tasks (admin)")
async def list_all_tasks(
user_id: Optional[int] = Query(None, description="Filter by user ID"),
status_filter: Optional[str] = Query(None, description="Filter by status"),
include_deleted: bool = Query(True, description="Include soft-deleted tasks"),
include_files_deleted: bool = Query(True, description="Include tasks with deleted files"),
page: int = Query(1, ge=1),
page_size: int = Query(50, ge=1, le=100),
db: Session = Depends(get_db),
admin_user: User = Depends(get_current_admin_user)
):
"""
Get list of all tasks across all users.
Includes soft-deleted tasks and tasks with deleted files by default.
- **user_id**: Filter by user ID (optional)
- **status_filter**: Filter by status (pending, processing, completed, failed)
- **include_deleted**: Include soft-deleted tasks (default: true)
- **include_files_deleted**: Include tasks with deleted files (default: true)
Requires admin privileges.
"""
try:
# Parse status filter
task_status = None
if status_filter:
try:
task_status = TaskStatus(status_filter)
except ValueError:
raise HTTPException(
status_code=status.HTTP_400_BAD_REQUEST,
detail=f"Invalid status: {status_filter}"
)
skip = (page - 1) * page_size
tasks, total = task_service.get_all_tasks_admin(
db=db,
user_id=user_id,
status=task_status,
include_deleted=include_deleted,
include_files_deleted=include_files_deleted,
skip=skip,
limit=page_size
)
return {
"tasks": [task.to_dict() for task in tasks],
"total": total,
"page": page,
"page_size": page_size,
"has_more": (skip + len(tasks)) < total
}
except HTTPException:
raise
except Exception as e:
logger.exception("Failed to list tasks")
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=f"Failed to list tasks: {str(e)}"
)
@router.get("/tasks/{task_id}", summary="Get task details (admin)")
async def get_task_admin(
task_id: str,
db: Session = Depends(get_db),
admin_user: User = Depends(get_current_admin_user)
):
"""
Get detailed information about a specific task (admin view).
Can access any task regardless of ownership or deletion status.
Requires admin privileges.
"""
try:
task = task_service.get_task_by_id_admin(db, task_id)
if not task:
raise HTTPException(
status_code=status.HTTP_404_NOT_FOUND,
detail=f"Task not found: {task_id}"
)
return task.to_dict()
except HTTPException:
raise
except Exception as e:
logger.exception(f"Failed to get task {task_id}")
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=f"Failed to get task: {str(e)}"
)
@router.get("/storage/stats", summary="Get storage statistics")
async def get_storage_stats(
db: Session = Depends(get_db),
admin_user: User = Depends(get_current_admin_user)
):
"""
Get storage usage statistics.
Returns:
- total_tasks: Total number of tasks
- tasks_with_files: Tasks that still have files on disk
- tasks_files_deleted: Tasks where files have been cleaned up
- soft_deleted_tasks: Tasks that have been soft-deleted
- disk_usage: Actual disk usage in bytes and MB
- per_user: Breakdown by user
Requires admin privileges.
"""
try:
stats = cleanup_service.get_storage_stats(db)
return stats
except Exception as e:
logger.exception("Failed to get storage stats")
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=f"Failed to get storage stats: {str(e)}"
)
@router.get("/cleanup/status", summary="Get cleanup scheduler status")
async def get_cleanup_status(
admin_user: User = Depends(get_current_admin_user)
):
"""
Get the status of the automatic cleanup scheduler.
Returns:
- enabled: Whether cleanup is enabled in configuration
- running: Whether scheduler is currently running
- interval_hours: Hours between cleanup runs
- max_files_per_user: Files to keep per user
- last_run: Timestamp of last cleanup
- next_run: Estimated next cleanup time
- last_result: Result of last cleanup
Requires admin privileges.
"""
try:
scheduler = get_cleanup_scheduler()
return scheduler.status
except Exception as e:
logger.exception("Failed to get cleanup status")
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=f"Failed to get cleanup status: {str(e)}"
)
@router.post("/cleanup/trigger", summary="Trigger file cleanup")
async def trigger_cleanup(
max_files_per_user: Optional[int] = Query(None, description="Override max files per user"),
db: Session = Depends(get_db),
admin_user: User = Depends(get_current_admin_user)
):
"""
Manually trigger file cleanup process.
Deletes old files while preserving database records.
- **max_files_per_user**: Override the default retention count (optional)
Returns cleanup statistics including files deleted and space freed.
Requires admin privileges.
"""
try:
files_to_keep = max_files_per_user or settings.max_files_per_user
result = cleanup_service.cleanup_all_users(db, max_files_per_user=files_to_keep)
logger.info(
f"Manual cleanup triggered by admin {admin_user.username}: "
f"{result['total_files_deleted']} files, {result['total_bytes_freed']} bytes"
)
return {
"success": True,
"message": "Cleanup completed successfully",
**result
}
except Exception as e:
logger.exception("Failed to trigger cleanup")
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=f"Failed to trigger cleanup: {str(e)}"
)

View File

@@ -0,0 +1,173 @@
"""
Tool_OCR - Cleanup Scheduler
Background scheduler for periodic file cleanup
"""
import asyncio
import logging
from datetime import datetime
from typing import Optional
from sqlalchemy.orm import Session
from app.core.config import settings
from app.core.database import SessionLocal
from app.services.cleanup_service import cleanup_service
logger = logging.getLogger(__name__)
class CleanupScheduler:
"""
Background scheduler for periodic file cleanup.
Uses asyncio for non-blocking background execution.
"""
def __init__(self):
self._task: Optional[asyncio.Task] = None
self._running: bool = False
self._last_run: Optional[datetime] = None
self._next_run: Optional[datetime] = None
self._last_result: Optional[dict] = None
@property
def is_running(self) -> bool:
"""Check if scheduler is running"""
return self._running and self._task is not None and not self._task.done()
@property
def status(self) -> dict:
"""Get scheduler status"""
return {
"enabled": settings.cleanup_enabled,
"running": self.is_running,
"interval_hours": settings.cleanup_interval_hours,
"max_files_per_user": settings.max_files_per_user,
"last_run": self._last_run.isoformat() if self._last_run else None,
"next_run": self._next_run.isoformat() if self._next_run else None,
"last_result": self._last_result
}
async def start(self):
"""Start the cleanup scheduler"""
if not settings.cleanup_enabled:
logger.info("Cleanup scheduler is disabled in configuration")
return
if self.is_running:
logger.warning("Cleanup scheduler is already running")
return
self._running = True
self._task = asyncio.create_task(self._run_loop())
logger.info(
f"Cleanup scheduler started (interval: {settings.cleanup_interval_hours}h, "
f"max_files_per_user: {settings.max_files_per_user})"
)
async def stop(self):
"""Stop the cleanup scheduler"""
self._running = False
if self._task is not None:
self._task.cancel()
try:
await self._task
except asyncio.CancelledError:
pass
self._task = None
logger.info("Cleanup scheduler stopped")
async def _run_loop(self):
"""Main scheduler loop"""
interval_seconds = settings.cleanup_interval_hours * 3600
while self._running:
try:
# Calculate next run time
self._next_run = datetime.utcnow()
# Run cleanup
await self._execute_cleanup()
# Update next run time after successful execution
self._next_run = datetime.utcnow()
self._next_run = self._next_run.replace(
hour=(self._next_run.hour + settings.cleanup_interval_hours) % 24
)
# Wait for next interval
logger.debug(f"Cleanup scheduler sleeping for {interval_seconds} seconds")
await asyncio.sleep(interval_seconds)
except asyncio.CancelledError:
logger.info("Cleanup scheduler loop cancelled")
break
except Exception as e:
logger.exception(f"Error in cleanup scheduler loop: {e}")
# Wait a bit before retrying to avoid tight error loops
await asyncio.sleep(60)
async def _execute_cleanup(self):
"""Execute the cleanup task"""
logger.info("Starting scheduled cleanup...")
self._last_run = datetime.utcnow()
# Run cleanup in thread pool to avoid blocking
loop = asyncio.get_event_loop()
result = await loop.run_in_executor(None, self._run_cleanup_sync)
self._last_result = result
logger.info(
f"Scheduled cleanup completed: {result.get('total_files_deleted', 0)} files deleted, "
f"{result.get('total_bytes_freed', 0)} bytes freed"
)
def _run_cleanup_sync(self) -> dict:
"""Synchronous cleanup execution (runs in thread pool)"""
db: Session = SessionLocal()
try:
result = cleanup_service.cleanup_all_users(
db=db,
max_files_per_user=settings.max_files_per_user
)
return result
except Exception as e:
logger.exception(f"Cleanup execution failed: {e}")
return {
"error": str(e),
"timestamp": datetime.utcnow().isoformat()
}
finally:
db.close()
async def run_now(self) -> dict:
"""Trigger immediate cleanup (outside of scheduled interval)"""
logger.info("Manual cleanup triggered")
await self._execute_cleanup()
return self._last_result or {}
# Global scheduler instance
_scheduler: Optional[CleanupScheduler] = None
def get_cleanup_scheduler() -> CleanupScheduler:
"""Get the global cleanup scheduler instance"""
global _scheduler
if _scheduler is None:
_scheduler = CleanupScheduler()
return _scheduler
async def start_cleanup_scheduler():
"""Start the global cleanup scheduler"""
scheduler = get_cleanup_scheduler()
await scheduler.start()
async def stop_cleanup_scheduler():
"""Stop the global cleanup scheduler"""
scheduler = get_cleanup_scheduler()
await scheduler.stop()

View File

@@ -0,0 +1,246 @@
"""
Tool_OCR - Cleanup Service
Handles file cleanup while preserving database records for statistics
"""
import os
import shutil
import logging
from typing import Dict, List, Tuple
from datetime import datetime
from sqlalchemy.orm import Session
from sqlalchemy import and_, func
from app.models.task import Task, TaskFile, TaskStatus
from app.core.config import settings
logger = logging.getLogger(__name__)
class CleanupService:
"""Service for cleaning up files while preserving database records"""
def cleanup_user_files(
self,
db: Session,
user_id: int,
max_files_to_keep: int = 50
) -> Dict:
"""
Clean up old files for a user, keeping only the newest N tasks' files.
Database records are preserved for statistics.
Args:
db: Database session
user_id: User ID
max_files_to_keep: Number of newest tasks to keep files for
Returns:
Dict with cleanup statistics
"""
# Get all completed tasks with files (not yet deleted)
tasks_with_files = (
db.query(Task)
.filter(
and_(
Task.user_id == user_id,
Task.status == TaskStatus.COMPLETED,
Task.file_deleted == False,
Task.deleted_at.is_(None) # Don't process already soft-deleted
)
)
.order_by(Task.created_at.desc())
.all()
)
# Keep newest N tasks, clean files from older ones
tasks_to_clean = tasks_with_files[max_files_to_keep:]
files_deleted = 0
bytes_freed = 0
tasks_cleaned = 0
for task in tasks_to_clean:
task_bytes, task_files = self._delete_task_files(task)
if task_files > 0:
task.file_deleted = True
task.updated_at = datetime.utcnow()
files_deleted += task_files
bytes_freed += task_bytes
tasks_cleaned += 1
if tasks_cleaned > 0:
db.commit()
logger.info(
f"Cleaned up {files_deleted} files ({bytes_freed} bytes) "
f"from {tasks_cleaned} tasks for user {user_id}"
)
return {
"user_id": user_id,
"tasks_cleaned": tasks_cleaned,
"files_deleted": files_deleted,
"bytes_freed": bytes_freed,
"tasks_with_files_remaining": min(len(tasks_with_files), max_files_to_keep)
}
def cleanup_all_users(
self,
db: Session,
max_files_per_user: int = 50
) -> Dict:
"""
Run cleanup for all users.
Args:
db: Database session
max_files_per_user: Number of newest tasks to keep files for per user
Returns:
Dict with overall cleanup statistics
"""
# Get all distinct user IDs with tasks
user_ids = (
db.query(Task.user_id)
.filter(Task.file_deleted == False)
.distinct()
.all()
)
total_tasks_cleaned = 0
total_files_deleted = 0
total_bytes_freed = 0
users_processed = 0
for (user_id,) in user_ids:
result = self.cleanup_user_files(db, user_id, max_files_per_user)
total_tasks_cleaned += result["tasks_cleaned"]
total_files_deleted += result["files_deleted"]
total_bytes_freed += result["bytes_freed"]
users_processed += 1
logger.info(
f"Cleanup completed: {users_processed} users, "
f"{total_tasks_cleaned} tasks, {total_files_deleted} files, "
f"{total_bytes_freed} bytes freed"
)
return {
"users_processed": users_processed,
"total_tasks_cleaned": total_tasks_cleaned,
"total_files_deleted": total_files_deleted,
"total_bytes_freed": total_bytes_freed,
"timestamp": datetime.utcnow().isoformat()
}
def _delete_task_files(self, task: Task) -> Tuple[int, int]:
"""
Delete actual files for a task from disk.
Args:
task: Task object
Returns:
Tuple of (bytes_deleted, files_deleted)
"""
bytes_deleted = 0
files_deleted = 0
# Delete result directory
result_dir = os.path.join(settings.result_dir, task.task_id)
if os.path.exists(result_dir):
try:
dir_size = self._get_dir_size(result_dir)
shutil.rmtree(result_dir)
bytes_deleted += dir_size
files_deleted += 1
logger.debug(f"Deleted result directory: {result_dir}")
except Exception as e:
logger.error(f"Failed to delete result directory {result_dir}: {e}")
# Delete uploaded files from task_files
for task_file in task.files:
if task_file.stored_path and os.path.exists(task_file.stored_path):
try:
file_size = os.path.getsize(task_file.stored_path)
os.remove(task_file.stored_path)
bytes_deleted += file_size
files_deleted += 1
logger.debug(f"Deleted uploaded file: {task_file.stored_path}")
except Exception as e:
logger.error(f"Failed to delete file {task_file.stored_path}: {e}")
return bytes_deleted, files_deleted
def _get_dir_size(self, path: str) -> int:
"""Get total size of a directory in bytes."""
total = 0
try:
for entry in os.scandir(path):
if entry.is_file():
total += entry.stat().st_size
elif entry.is_dir():
total += self._get_dir_size(entry.path)
except Exception:
pass
return total
def get_storage_stats(self, db: Session) -> Dict:
"""
Get storage statistics for admin dashboard.
Args:
db: Database session
Returns:
Dict with storage statistics
"""
# Count tasks by file_deleted status
total_tasks = db.query(Task).count()
tasks_with_files = db.query(Task).filter(Task.file_deleted == False).count()
tasks_files_deleted = db.query(Task).filter(Task.file_deleted == True).count()
soft_deleted_tasks = db.query(Task).filter(Task.deleted_at.isnot(None)).count()
# Get per-user statistics
user_stats = (
db.query(
Task.user_id,
func.count(Task.id).label("total_tasks"),
func.sum(func.if_(Task.file_deleted == False, 1, 0)).label("tasks_with_files"),
func.sum(func.if_(Task.deleted_at.isnot(None), 1, 0)).label("deleted_tasks")
)
.group_by(Task.user_id)
.all()
)
# Calculate actual disk usage
uploads_size = self._get_dir_size(settings.upload_dir)
results_size = self._get_dir_size(settings.result_dir)
return {
"total_tasks": total_tasks,
"tasks_with_files": tasks_with_files,
"tasks_files_deleted": tasks_files_deleted,
"soft_deleted_tasks": soft_deleted_tasks,
"disk_usage": {
"uploads_bytes": uploads_size,
"results_bytes": results_size,
"total_bytes": uploads_size + results_size,
"uploads_mb": round(uploads_size / (1024 * 1024), 2),
"results_mb": round(results_size / (1024 * 1024), 2),
"total_mb": round((uploads_size + results_size) / (1024 * 1024), 2)
},
"per_user": [
{
"user_id": stat.user_id,
"total_tasks": stat.total_tasks,
"tasks_with_files": int(stat.tasks_with_files or 0),
"deleted_tasks": int(stat.deleted_tasks or 0)
}
for stat in user_stats
]
}
# Global service instance
cleanup_service = CleanupService()

View File

@@ -65,7 +65,7 @@ class TaskService:
return task
def get_task_by_id(
self, db: Session, task_id: str, user_id: int
self, db: Session, task_id: str, user_id: int, include_deleted: bool = False
) -> Optional[Task]:
"""
Get task by ID with user isolation
@@ -74,16 +74,20 @@ class TaskService:
db: Database session
task_id: Task ID (UUID)
user_id: User ID (for isolation)
include_deleted: If True, include soft-deleted tasks
Returns:
Task object or None if not found/unauthorized
"""
task = (
db.query(Task)
.filter(and_(Task.task_id == task_id, Task.user_id == user_id))
.first()
query = db.query(Task).filter(
and_(Task.task_id == task_id, Task.user_id == user_id)
)
return task
# Filter out soft-deleted tasks by default
if not include_deleted:
query = query.filter(Task.deleted_at.is_(None))
return query.first()
def get_user_tasks(
self,
@@ -97,6 +101,7 @@ class TaskService:
limit: int = 50,
order_by: str = "created_at",
order_desc: bool = True,
include_deleted: bool = False,
) -> Tuple[List[Task], int]:
"""
Get user's tasks with pagination and filtering
@@ -112,6 +117,7 @@ class TaskService:
limit: Pagination limit
order_by: Sort field (created_at, updated_at, completed_at)
order_desc: Sort descending
include_deleted: If True, include soft-deleted tasks
Returns:
Tuple of (tasks list, total count)
@@ -119,6 +125,10 @@ class TaskService:
# Base query with user isolation
query = db.query(Task).filter(Task.user_id == user_id)
# Filter out soft-deleted tasks by default
if not include_deleted:
query = query.filter(Task.deleted_at.is_(None))
# Apply status filter
if status:
query = query.filter(Task.status == status)
@@ -244,7 +254,9 @@ class TaskService:
self, db: Session, task_id: str, user_id: int
) -> bool:
"""
Delete task with user isolation
Soft delete task with user isolation.
Sets deleted_at timestamp instead of removing record.
Database records are preserved for statistics tracking.
Args:
db: Database session
@@ -252,17 +264,18 @@ class TaskService:
user_id: User ID (for isolation)
Returns:
True if deleted, False if not found/unauthorized
True if soft deleted, False if not found/unauthorized
"""
task = self.get_task_by_id(db, task_id, user_id)
if not task:
return False
# Cascade delete will handle task_files
db.delete(task)
# Soft delete: set deleted_at timestamp
task.deleted_at = datetime.utcnow()
task.updated_at = datetime.utcnow()
db.commit()
logger.info(f"Deleted task {task_id} for user {user_id}")
logger.info(f"Soft deleted task {task_id} for user {user_id}")
return True
def _cleanup_old_tasks(
@@ -389,6 +402,82 @@ class TaskService:
"failed": failed,
}
def get_all_tasks_admin(
self,
db: Session,
user_id: Optional[int] = None,
status: Optional[TaskStatus] = None,
include_deleted: bool = True,
include_files_deleted: bool = True,
skip: int = 0,
limit: int = 50,
order_by: str = "created_at",
order_desc: bool = True,
) -> Tuple[List[Task], int]:
"""
Get all tasks for admin view (no user isolation).
Includes soft-deleted tasks by default.
Args:
db: Database session
user_id: Filter by user ID (optional)
status: Filter by status (optional)
include_deleted: Include soft-deleted tasks (default True)
include_files_deleted: Include tasks with deleted files (default True)
skip: Pagination offset
limit: Pagination limit
order_by: Sort field
order_desc: Sort descending
Returns:
Tuple of (tasks list, total count)
"""
query = db.query(Task)
# Optional user filter
if user_id is not None:
query = query.filter(Task.user_id == user_id)
# Filter soft-deleted if requested
if not include_deleted:
query = query.filter(Task.deleted_at.is_(None))
# Filter file-deleted if requested
if not include_files_deleted:
query = query.filter(Task.file_deleted == False)
# Apply status filter
if status:
query = query.filter(Task.status == status)
# Get total count
total = query.count()
# Apply sorting
sort_column = getattr(Task, order_by, Task.created_at)
if order_desc:
query = query.order_by(desc(sort_column))
else:
query = query.order_by(sort_column)
# Apply pagination
tasks = query.offset(skip).limit(limit).all()
return tasks, total
def get_task_by_id_admin(self, db: Session, task_id: str) -> Optional[Task]:
"""
Get task by ID for admin (no user isolation, includes deleted).
Args:
db: Database session
task_id: Task ID (UUID)
Returns:
Task object or None if not found
"""
return db.query(Task).filter(Task.task_id == task_id).first()
# Global service instance
task_service = TaskService()

View File

@@ -1,97 +0,0 @@
# Tool_OCR V2 API (現況)
Base URL`http://localhost:${BACKEND_PORT:-8000}/api/v2`
認證:所有業務端點需 Bearer TokenJWT
## 認證
- `POST /auth/login`{ username, password } → `access_token`, `expires_in`, `user`.
- `POST /auth/logout`:可傳 `session_id`,未傳則登出全部。
- `GET /auth/me`:目前使用者資訊。
- `GET /auth/sessions`:列出登入 Session。
- `POST /auth/refresh`:刷新 access token。
## 任務流程摘要
1) 上傳檔案 → `POST /upload` (multipart file) 取得 `task_id`
2) 啟動處理 → `POST /tasks/{task_id}/start`ProcessingOptions 可控制 dual track、force_track、layout/預處理/table 偵測)。
3) 查詢狀態與 metadata → `GET /tasks/{task_id}``/metadata`
4) 下載結果 → `/download/json | /markdown | /pdf | /unified`
5) 進階:`/analyze` 先看推薦軌道;`/preview/preprocessing` 取得預處理前後預覽。
## 核心端點
- `POST /upload`
- 表單欄位:`file` (必填);驗證副檔名於允許清單。
- 回傳:`task_id`, `filename`, `file_size`, `file_type`, `status` (pending)。
- `POST /tasks/`
- 僅建立 Task meta不含檔案通常不需使用。
- `POST /tasks/{task_id}/start`
- Body `ProcessingOptions``use_dual_track`(default true), `force_track`(ocr|direct), `language`(default ch), `layout_model`(chinese|default|cdla), `preprocessing_mode`(auto|manual|disabled) + `preprocessing_config`, `table_detection`.
- `POST /tasks/{task_id}/cancel``POST /tasks/{task_id}/retry`
- `GET /tasks`
- 查詢參數:`status`(pending|processing|completed|failed)、`filename``date_from`/`date_to``page``page_size``order_by``order_desc`
- `GET /tasks/{task_id}`:詳細資料與路徑、處理軌道、統計。
- `GET /tasks/stats`:當前使用者任務統計。
- `POST /tasks/{task_id}/analyze`:預先分析文件並給出推薦軌道/信心/文件類型/抽樣統計。
- `GET /tasks/{task_id}/metadata`:處理結果的統計與說明。
- 下載:
- `GET /tasks/{task_id}/download/json`
- `GET /tasks/{task_id}/download/markdown`
- `GET /tasks/{task_id}/download/pdf`(若無 PDF 則即時生成)
- `GET /tasks/{task_id}/download/unified`UnifiedDocument JSON
- 預處理預覽:
- `POST /tasks/{task_id}/preview/preprocessing`bodypage/mode/config
- `GET /tasks/{task_id}/preview/image?type=original|preprocessed&page=1`
## 翻譯(需已完成 OCR
Prefix`/translate`
- `POST /{task_id}`開始翻譯body `{ target_lang, source_lang }`,回傳 202。若已存在會直接回 Completed。
- `GET /{task_id}/status`:翻譯進度。
- `GET /{task_id}/result?lang=xx`:翻譯 JSON。
- `GET /{task_id}/translations`:列出已產生的翻譯。
- `DELETE /{task_id}/translations/{lang}`:刪除翻譯。
- `POST /{task_id}/pdf?lang=xx`:下載翻譯後版面保持 PDF。
## 管理端(需要管理員)
Prefix`/admin`
- `GET /stats`:系統層統計。
- `GET /users``GET /users/top`
- `GET /audit-logs``GET /audit-logs/user/{user_id}/summary`
## 健康檢查
- `/health`服務狀態、GPU/Memory 管理資訊。
- `/`:簡易 API 入口說明。
## 回應結構摘要
- Task 回應常見欄位:`task_id`, `status`, `processing_track`, `document_type`, `processing_time_ms`, `page_count`, `element_count`, `file_size`, `mime_type`, `result_json_path` 等。
- 下載端點皆以檔案回應Content-Disposition 附檔名)。
- 錯誤格式:`{ "detail": "...", "error_code": "...", "timestamp": "..." }`(部分錯誤僅有 `detail`)。
## 使用範例
上傳並啟動:
```bash
# 上傳
curl -X POST "http://localhost:8000/api/v2/upload" \
-H "Authorization: Bearer $TOKEN" \
-F "file=@demo_docs/edit.pdf"
# 啟動處理force_track=ocr 舉例)
curl -X POST "http://localhost:8000/api/v2/tasks/$TASK_ID/start" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"force_track":"ocr","language":"ch"}'
# 查詢與下載
curl -X GET "http://localhost:8000/api/v2/tasks/$TASK_ID/metadata" -H "Authorization: Bearer $TOKEN"
curl -L "http://localhost:8000/api/v2/tasks/$TASK_ID/download/json" -H "Authorization: Bearer $TOKEN" -o result.json
```
翻譯並下載翻譯 PDF
```bash
curl -X POST "http://localhost:8000/api/v2/translate/$TASK_ID" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"target_lang":"en","source_lang":"auto"}'
curl -X GET "http://localhost:8000/api/v2/translate/$TASK_ID/status" -H "Authorization: Bearer $TOKEN"
curl -L "http://localhost:8000/api/v2/translate/$TASK_ID/pdf?lang=en" \
-H "Authorization: Bearer $TOKEN" -o translated.pdf
```

View File

@@ -1,85 +0,0 @@
# Tool_OCR 架構說明與 UML
本文件概覽 Tool_OCR 的主要組件、資料流與雙軌處理OCR / Direct並附上 UML 關係圖以協助判斷改動的影響範圍。
## 系統分層與重點元件
- **API 層FastAPI**`app/main.py` 啟動應用、掛載路由(`routers/auth.py`, `routers/tasks.py`, `routers/admin.py`),並在 lifespan 初始化記憶體管理、服務池與併發控制。
- **任務/檔案管理**`task_service.py``file_access_service.py` 掌管任務 CRUD、路徑與權限`Task` / `TaskFile` 模型紀錄結果檔路徑。
- **核心處理服務**`OCRService``services/ocr_service.py`)負責雙軌路由與 OCR整合偵測、直抽、OCR、統一格式轉換、匯出與 PDF 生成。
- **雙軌偵測/直抽**`DocumentTypeDetector` 判斷走 Direct 或 OCR`DirectExtractionEngine` 使用 PyMuPDF 直接抽取文字/表格/圖片(必要時觸發混合模式補抽圖片)。
- **OCR 解析**PaddleOCR + `PPStructureEnhanced` 抽取 23 類元素;`OCRToUnifiedConverter` 轉成 `UnifiedDocument` 統一格式。
- **匯出/呈現**`UnifiedDocumentExporter` 產出 JSON/Markdown`pdf_generator_service.py` 產生版面保持 PDF前端透過 `/api/v2/tasks/{id}/download/*` 取得。
- **資源控管**`memory_manager.py`MemoryGuard、prediction semaphore、模型生命週期`service_pool.py``OCRService` 池)避免多重載模與 GPU 爆滿。
- **翻譯與預覽**`translation_service` 針對已完成任務提供異步翻譯(`/api/v2/translate/*``layout_preprocessing_service` 提供預處理預覽與品質指標(`/preview/preprocessing``/preview/image`)。
## 處理流程(任務層級)
1. **上傳**`POST /api/v2/upload` 建立 Task 並寫檔到 `uploads/`(含 SHA256、檔案資訊
2. **啟動**`POST /api/v2/tasks/{id}/start``ProcessingOptions`,可含 `pp_structure_params`)→ 背景 `process_task_ocr` 取得服務池中的 `OCRService`
3. **軌道決策**`DocumentTypeDetector.detect` 分析 MIME、PDF 文字覆蓋率或 Office 轉 PDF 後的抽樣結果:
- **Direct**`DirectExtractionEngine.extract` 產出 `UnifiedDocument`;若偵測缺圖則啟用混合模式呼叫 OCR 抽圖或渲染 inline 圖。
- **OCR**`process_file_traditional` → PaddleOCR + PP-Structure → `OCRToUnifiedConverter.convert` 產生 `UnifiedDocument`
-`ProcessingTrack` 記錄 `ocr` / `direct` / `hybrid`,處理時間與統計寫入 metadata。
4. **輸出保存**`UnifiedDocumentExporter``_result.json`(含 metadata、statistics`_output.md``pdf_generator_service` 產出 `_layout.pdf`;路徑回寫 DB。
5. **下載/檢視**:前端透過 `/download/json|markdown|pdf|unified` 取檔;`/metadata` 讀 JSON metadata 回傳統計與 `processing_track`
## 前端流程摘要
- `UploadPage`:呼叫 `apiClientV2.uploadFile`,首個 `task_id` 存於 `uploadStore.batchId`
- `ProcessingPage`:對 `batchId` 呼叫 `startTask`(預設 `use_dual_track=true`,支援自訂 `pp_structure_params`),輪詢狀態。
- `ResultsPage` / `TaskDetailPage`:使用 `getTask``getProcessingMetadata` 顯示 `processing_track`、統計並提供 JSON/Markdown/PDF/Unified 下載。
- `TaskHistoryPage`:列出任務、支援重新啟動、重試、下載。
## 共同模組與影響點
- **UnifiedDocument**`models/unified_document.py`)為 Direct/OCR 共用輸出格式;所有匯出/PDF/前端 track 顯示依賴其欄位與 metadata。
- **服務池/記憶體守護**Direct 與 OCR 共用同一 `OCRService` 實例池與 MemoryGuard新增資源或改動需確保遵循 acquire/release、清理與 semaphore 規則。
- **偵測閾值變更**`DocumentTypeDetector` 參數調整會影響 Direct 與 OCR 分流比例,間接改變 GPU 載荷與結果格式。
- **匯出/PDF**:任何 UnifiedDocument 結構變動會影響 JSON/Markdown/PDF 產出與前端下載/預覽;需同步維護轉換與匯出器。
## UML 關係圖Mermaid
```mermaid
classDiagram
class TasksRouter {
+upload_file()
+start_task()
+download_json/markdown/pdf/unified()
+get_metadata()
}
class TaskService {+create_task(); +update_task_status(); +get_task_by_id()}
class FileAccessService
class OCRService {
+process()
+process_with_dual_track()
+process_file_traditional()
+save_results()
}
class DocumentTypeDetector {+detect()}
class DirectExtractionEngine {+extract(); +check_document_for_missing_images()}
class OCRToUnifiedConverter {+convert()}
class UnifiedDocument
class UnifiedDocumentExporter {+export_to_json(); +export_to_markdown()}
class PDFGeneratorService {+generate_layout_pdf(); +generate_from_unified_document()}
class ServicePool {+acquire(); +release()}
class MemoryManager <<singleton>>
class OfficeConverter {+convert_to_pdf()}
class PPStructureEnhanced {+analyze_with_full_structure()}
TasksRouter --> TaskService
TasksRouter --> FileAccessService
TasksRouter --> OCRService : background process via process_task_ocr
OCRService --> DocumentTypeDetector : track recommendation
OCRService --> DirectExtractionEngine : direct track
OCRService --> OCRToUnifiedConverter : OCR track result -> UnifiedDocument
OCRService --> OfficeConverter : Office -> PDF
OCRService --> PPStructureEnhanced : layout analysis (PP-StructureV3)
OCRService --> UnifiedDocumentExporter : persist results
OCRService --> PDFGeneratorService : layout-preserving PDF
OCRService --> ServicePool : acquired instance
ServicePool --> MemoryManager : model lifecycle / GPU guard
UnifiedDocumentExporter --> UnifiedDocument
PDFGeneratorService --> UnifiedDocument
```
## 影響判斷指引
- **改 Direct/偵測邏輯**:會改變 `processing_track` 與結果格式;前端顯示與下載 JSON/Markdown/PDF 仍依賴 UnifiedDocument需驗證匯出與 PDF 生成。
- **改 OCR/PP-Structure 參數**:僅影響 OCR trackDirect track 不受 `pp_structure_params` 影響(符合 spec需維持 `processing_track` 填寫。
- **改 UnifiedDocument 結構/統計**:需同步 `UnifiedDocumentExporter``pdf_generator_service`、前端 `getProcessingMetadata`/下載端點。
- **改資源控管**:服務池或 MemoryGuard 調整會同時影響 Direct/OCR 執行時序與穩定性,須確保 acquire/release 與 semaphore 不被破壞。

View File

@@ -1,61 +0,0 @@
# OCR 處理預設與進階參數指南
本指南說明如何選擇預設組合、覆寫參數以及常見問題的處理方式。前端預設選擇卡與進階參數面板已對應此文件API 端點請參考 `/api/v2/tasks`
## 預設選擇建議
- 預設值:`datasheet`(保守表格解析,避免 cell explosion
- 若文件類型不確定,先用 `datasheet`,再視結果調整。
| 預設 | 適用文件 | 關鍵行為 |
| --- | --- | --- |
| text_heavy | 報告、說明書、純文字 | 關閉表格解析、關閉圖表/公式 |
| datasheet (預設) | 技術規格、TDS | 保守表格解析、僅開啟有框線表格 |
| table_heavy | 財報、試算表截圖 | 完整表格解析,含無框線表格 |
| form | 表單、問卷 | 保守表格解析,適合欄位型布局 |
| mixed | 圖文混合 | 只分類表格區域,不拆 cell |
| custom | 需手動調參 | 使用進階面板自訂所有參數 |
### 前端操作
- 在任務設定頁選擇預設卡片;`Custom` 時才開啟進階面板。
- 進階參數修改後會自動切換到 `custom` 模式。
### API 範例
```json
POST /api/v2/tasks
{
"processing_track": "ocr",
"ocr_preset": "datasheet",
"ocr_config": {
"table_parsing_mode": "conservative",
"enable_wireless_table": false
}
}
```
## 參數對照OCRConfig
**表格處理**
- `table_parsing_mode`: `full` / `conservative` / `classification_only` / `disabled`
- `enable_wired_table`: 解析有框線表格
- `enable_wireless_table`: 解析無框線表格(易產生過度拆分)
**版面偵測**
- `layout_threshold`: 01越高越嚴格空值採模型預設
- `layout_nms_threshold`: 01越高保留更多框越低過濾重疊
**前處理**
- `use_doc_orientation_classify`: 自動旋轉校正
- `use_doc_unwarping`: 展平扭曲(可能失真,預設關)
- `use_textline_orientation`: 校正文行方向
**辨識模組開關**
- `enable_chart_recognition`: 圖表辨識
- `enable_formula_recognition`: 公式辨識
- `enable_seal_recognition`: 印章辨識
- `enable_region_detection`: 區域偵測輔助結構解析
## 疑難排解
- 表格被過度拆分cell explosion改用 `datasheet``conservative`,關閉 `enable_wireless_table`
- 表格偵測不到:改用 `table_heavy``full`,必要時開啟 `enable_wireless_table`
- 版面框選過多或過少:調整 `layout_threshold`(過多→提高;過少→降低)。
- 公式/圖表誤報:在 `custom` 模式關閉 `enable_formula_recognition``enable_chart_recognition`
- 文檔角度錯誤:確保 `use_doc_orientation_classify` 開啟;若出現拉伸變形,關閉 `use_doc_unwarping`

View File

@@ -440,6 +440,36 @@
"cost": "Cost",
"processingTime": "Processing Time",
"time": "Time"
},
"storage": {
"title": "Storage Management",
"description": "File storage usage and cleanup",
"totalTasks": "Total Tasks",
"tasksWithFiles": "Tasks with Files",
"filesDeleted": "Files Cleaned",
"softDeleted": "Soft Deleted",
"diskUsage": "Disk Usage",
"uploadsSize": "Uploads",
"resultsSize": "Results",
"totalSize": "Total",
"triggerCleanup": "Run Cleanup",
"cleanupSuccess": "Cleanup Complete",
"cleanupFailed": "Cleanup Failed",
"cleanupResult": "Cleaned {{files}} files from {{users}} users, freed {{mb}} MB",
"perUser": "Per User"
},
"tasks": {
"title": "Task Management",
"description": "View all user tasks (including deleted)",
"includeDeleted": "Show Deleted",
"includeFilesDeleted": "Show Cleaned",
"filterByUser": "Filter by User",
"allUsers": "All Users",
"noTasks": "No tasks"
},
"taskStatus": {
"deleted": "Deleted",
"filesCleaned": "Files Cleaned"
}
},
"taskHistory": {

View File

@@ -440,6 +440,36 @@
"cost": "成本",
"processingTime": "處理時間",
"time": "時間"
},
"storage": {
"title": "存儲管理",
"description": "檔案存儲使用情況與清理",
"totalTasks": "總任務數",
"tasksWithFiles": "有檔案任務",
"filesDeleted": "已清理檔案",
"softDeleted": "軟刪除任務",
"diskUsage": "磁碟使用",
"uploadsSize": "上傳目錄",
"resultsSize": "結果目錄",
"totalSize": "總計",
"triggerCleanup": "執行清理",
"cleanupSuccess": "清理完成",
"cleanupFailed": "清理失敗",
"cleanupResult": "清理了 {{users}} 個用戶的 {{files}} 個檔案,釋放 {{mb}} MB",
"perUser": "用戶分佈"
},
"tasks": {
"title": "任務管理",
"description": "檢視所有用戶的任務(含已刪除)",
"includeDeleted": "顯示已刪除",
"includeFilesDeleted": "顯示已清理",
"filterByUser": "篩選用戶",
"allUsers": "所有用戶",
"noTasks": "暫無任務"
},
"taskStatus": {
"deleted": "已刪除",
"filesCleaned": "檔案已清理"
}
},
"taskHistory": {

View File

@@ -7,7 +7,7 @@ import { useState, useEffect } from 'react'
import { useNavigate } from 'react-router-dom'
import { useTranslation } from 'react-i18next'
import { apiClientV2 } from '@/services/apiV2'
import type { SystemStats, UserWithStats, TopUser, TranslationStats } from '@/types/apiV2'
import type { SystemStats, UserWithStats, TopUser, TranslationStats, StorageStats } from '@/types/apiV2'
import {
Users,
ClipboardList,
@@ -21,6 +21,8 @@ import {
Loader2,
Languages,
Coins,
HardDrive,
Trash2,
} from 'lucide-react'
import { Card, CardContent, CardDescription, CardHeader, CardTitle } from '@/components/ui/card'
import { Button } from '@/components/ui/button'
@@ -41,6 +43,8 @@ export default function AdminDashboardPage() {
const [users, setUsers] = useState<UserWithStats[]>([])
const [topUsers, setTopUsers] = useState<TopUser[]>([])
const [translationStats, setTranslationStats] = useState<TranslationStats | null>(null)
const [storageStats, setStorageStats] = useState<StorageStats | null>(null)
const [cleanupLoading, setCleanupLoading] = useState(false)
const [loading, setLoading] = useState(true)
const [error, setError] = useState('')
@@ -50,17 +54,19 @@ export default function AdminDashboardPage() {
setLoading(true)
setError('')
const [statsData, usersData, topUsersData, translationStatsData] = await Promise.all([
const [statsData, usersData, topUsersData, translationStatsData, storageStatsData] = await Promise.all([
apiClientV2.getSystemStats(),
apiClientV2.listUsers({ page: 1, page_size: 10 }),
apiClientV2.getTopUsers({ metric: 'tasks', limit: 5 }),
apiClientV2.getTranslationStats(),
apiClientV2.getStorageStats(),
])
setStats(statsData)
setUsers(usersData.users)
setTopUsers(topUsersData)
setTranslationStats(translationStatsData)
setStorageStats(storageStatsData)
} catch (err: any) {
console.error('Failed to fetch admin data:', err)
setError(err.response?.data?.detail || t('admin.loadFailed'))
@@ -80,6 +86,27 @@ export default function AdminDashboardPage() {
return date.toLocaleString(i18n.language === 'zh-TW' ? 'zh-TW' : 'en-US')
}
// Handle cleanup trigger
const handleCleanup = async () => {
try {
setCleanupLoading(true)
const result = await apiClientV2.triggerCleanup()
alert(t('admin.storage.cleanupResult', {
users: result.users_processed,
files: result.total_files_deleted,
mb: (result.total_bytes_freed / 1024 / 1024).toFixed(2)
}))
// Refresh storage stats
const newStorageStats = await apiClientV2.getStorageStats()
setStorageStats(newStorageStats)
} catch (err: any) {
console.error('Cleanup failed:', err)
alert(t('admin.storage.cleanupFailed'))
} finally {
setCleanupLoading(false)
}
}
if (loading) {
return (
<div className="flex items-center justify-center min-h-screen">
@@ -329,6 +356,104 @@ export default function AdminDashboardPage() {
</Card>
)}
{/* Storage Management */}
{storageStats && (
<Card>
<CardHeader>
<div className="flex items-center justify-between">
<div>
<CardTitle className="flex items-center gap-2">
<HardDrive className="w-5 h-5" />
{t('admin.storage.title')}
</CardTitle>
<CardDescription>{t('admin.storage.description')}</CardDescription>
</div>
<Button
onClick={handleCleanup}
disabled={cleanupLoading}
variant="outline"
className="gap-2"
>
{cleanupLoading ? (
<Loader2 className="w-4 h-4 animate-spin" />
) : (
<Trash2 className="w-4 h-4" />
)}
{t('admin.storage.triggerCleanup')}
</Button>
</div>
</CardHeader>
<CardContent>
<div className="grid grid-cols-1 md:grid-cols-4 gap-4 mb-6">
<div className="p-4 bg-blue-50 rounded-lg">
<div className="flex items-center gap-2 text-blue-600 mb-1">
<ClipboardList className="w-4 h-4" />
<span className="text-sm font-medium">{t('admin.storage.totalTasks')}</span>
</div>
<div className="text-2xl font-bold text-blue-700">
{storageStats.total_tasks.toLocaleString()}
</div>
</div>
<div className="p-4 bg-green-50 rounded-lg">
<div className="flex items-center gap-2 text-green-600 mb-1">
<CheckCircle2 className="w-4 h-4" />
<span className="text-sm font-medium">{t('admin.storage.tasksWithFiles')}</span>
</div>
<div className="text-2xl font-bold text-green-700">
{storageStats.tasks_with_files.toLocaleString()}
</div>
</div>
<div className="p-4 bg-amber-50 rounded-lg">
<div className="flex items-center gap-2 text-amber-600 mb-1">
<Trash2 className="w-4 h-4" />
<span className="text-sm font-medium">{t('admin.storage.filesDeleted')}</span>
</div>
<div className="text-2xl font-bold text-amber-700">
{storageStats.tasks_files_deleted.toLocaleString()}
</div>
</div>
<div className="p-4 bg-gray-50 rounded-lg">
<div className="flex items-center gap-2 text-gray-600 mb-1">
<XCircle className="w-4 h-4" />
<span className="text-sm font-medium">{t('admin.storage.softDeleted')}</span>
</div>
<div className="text-2xl font-bold text-gray-700">
{storageStats.soft_deleted_tasks.toLocaleString()}
</div>
</div>
</div>
{/* Disk Usage */}
<div className="border rounded-lg p-4">
<h4 className="text-sm font-medium text-gray-700 mb-3">{t('admin.storage.diskUsage')}</h4>
<div className="grid grid-cols-3 gap-4 text-center">
<div>
<div className="text-lg font-semibold text-blue-600">
{storageStats.disk_usage.uploads_mb} MB
</div>
<div className="text-xs text-gray-500">{t('admin.storage.uploadsSize')}</div>
</div>
<div>
<div className="text-lg font-semibold text-green-600">
{storageStats.disk_usage.results_mb} MB
</div>
<div className="text-xs text-gray-500">{t('admin.storage.resultsSize')}</div>
</div>
<div>
<div className="text-lg font-semibold text-purple-600">
{storageStats.disk_usage.total_mb} MB
</div>
<div className="text-xs text-gray-500">{t('admin.storage.totalSize')}</div>
</div>
</div>
</div>
</CardContent>
</Card>
)}
{/* Top Users */}
{topUsers.length > 0 && (
<Card>

View File

@@ -39,6 +39,9 @@ import type {
TranslationListResponse,
TranslationResult,
ExportRule,
StorageStats,
CleanupResult,
AdminTaskListResponse,
} from '@/types/apiV2'
/**
@@ -771,6 +774,48 @@ class ApiClientV2 {
async deleteExportRule(ruleId: number): Promise<void> {
await this.client.delete(`/export/rules/${ruleId}`)
}
// ==================== Admin Storage Management ====================
/**
* Get storage statistics (admin only)
*/
async getStorageStats(): Promise<StorageStats> {
const response = await this.client.get<StorageStats>('/admin/storage/stats')
return response.data
}
/**
* Trigger file cleanup (admin only)
*/
async triggerCleanup(maxFilesPerUser?: number): Promise<CleanupResult> {
const params = maxFilesPerUser ? { max_files_per_user: maxFilesPerUser } : {}
const response = await this.client.post<CleanupResult>('/admin/cleanup/trigger', null, { params })
return response.data
}
/**
* List all tasks (admin only)
*/
async listAllTasksAdmin(params: {
user_id?: number
status_filter?: string
include_deleted?: boolean
include_files_deleted?: boolean
page?: number
page_size?: number
}): Promise<AdminTaskListResponse> {
const response = await this.client.get<AdminTaskListResponse>('/admin/tasks', { params })
return response.data
}
/**
* Get task details (admin only, can view any task including deleted)
*/
async getTaskAdmin(taskId: string): Promise<Task> {
const response = await this.client.get<Task>(`/admin/tasks/${taskId}`)
return response.data
}
}
// Export singleton instance

View File

@@ -495,3 +495,44 @@ export interface ApiError {
detail: string
status_code: number
}
// ==================== Storage Management (Admin) ====================
export interface StorageStats {
total_tasks: number
tasks_with_files: number
tasks_files_deleted: number
soft_deleted_tasks: number
disk_usage: {
uploads_bytes: number
results_bytes: number
total_bytes: number
uploads_mb: number
results_mb: number
total_mb: number
}
per_user: Array<{
user_id: number
total_tasks: number
tasks_with_files: number
deleted_tasks: number
}>
}
export interface CleanupResult {
success: boolean
message: string
users_processed: number
total_tasks_cleaned: number
total_files_deleted: number
total_bytes_freed: number
timestamp: string
}
export interface AdminTaskListResponse {
tasks: Task[]
total: number
page: number
page_size: number
has_more: boolean
}

View File

@@ -0,0 +1,60 @@
# Change: Add Storage Cleanup Mechanism
## Why
目前系統缺乏完整的磁碟空間管理機制:
- `delete_task` 只刪除資料庫記錄,不刪除實際檔案
- `auto_cleanup_expired_tasks` 存在但從未被調用
- 上傳檔案 (uploads/) 和結果檔案 (storage/results/) 會無限累積
用戶需要:
1. 定期清理過期檔案以節省磁碟空間
2. 保留資料庫記錄以便管理員查看累計統計TOKEN、成本、用量
3. 軟刪除機制讓用戶可以「刪除」任務但不影響統計
## What Changes
### Backend Changes
1. **Task Model 擴展**
- 新增 `deleted_at` 欄位實現軟刪除
- 保留現有 `file_deleted` 欄位追蹤檔案清理狀態
2. **Task Service 更新**
- `delete_task()` 改為軟刪除(設置 `deleted_at`,不刪檔案)
- 用戶查詢自動過濾 `deleted_at IS NOT NULL` 的記錄
- 新增 `cleanup_expired_files()` 方法清理過期檔案
3. **Cleanup Service 新增**
- 定期排程任務(可配置間隔,建議每日)
- 清理邏輯:每用戶保留最新 N 筆任務的檔案(預設 50
- 只刪除檔案,不刪除資料庫記錄(保留統計數據)
4. **Admin Endpoints 擴展**
- 新增 `/api/v2/admin/tasks` 端點:查看所有任務(含已刪除)
- 支援過濾:`include_deleted=true/false``include_files_deleted=true/false`
### Frontend Changes
5. **Task History Page**
- 用戶只看到自己的任務(已有 user_id 隔離)
- 軟刪除的任務不顯示在列表中
6. **Admin Dashboard**
- 新增任務管理視圖
- 顯示所有任務含狀態標記(已刪除、檔案已清理)
- 可查看累計統計不受刪除影響
### Configuration
7. **Config 新增設定項**
- `cleanup_interval_hours`: 清理間隔(預設 24
- `max_files_per_user`: 每用戶保留最新檔案數(預設 50
- `cleanup_enabled`: 是否啟用自動清理(預設 true
## Impact
- Affected specs: `task-management`
- Affected code:
- `backend/app/models/task.py` - 新增 deleted_at 欄位
- `backend/app/services/task_service.py` - 軟刪除和查詢邏輯
- `backend/app/services/cleanup_service.py` - 新檔案
- `backend/app/routers/admin.py` - 新增端點
- `backend/app/core/config.py` - 新增設定
- `frontend/src/pages/AdminDashboardPage.tsx` - 任務管理視圖
- Database migration required: 新增 `deleted_at` 欄位

View File

@@ -0,0 +1,116 @@
# task-management Spec Delta
## ADDED Requirements
### Requirement: Soft Delete Tasks
The system SHALL support soft deletion of tasks, marking them as deleted without removing database records to preserve usage statistics.
#### Scenario: User soft deletes a task
- **WHEN** user calls DELETE on `/api/v2/tasks/{task_id}`
- **THEN** system SHALL set `deleted_at` timestamp on the task record
- **AND** system SHALL NOT delete the actual files
- **AND** system SHALL NOT remove the database record
- **AND** subsequent user queries SHALL NOT return this task
#### Scenario: Preserve statistics after soft delete
- **WHEN** a task is soft deleted
- **THEN** admin statistics endpoints SHALL continue to include this task's metrics
- **AND** translation token counts SHALL remain in cumulative totals
- **AND** processing time statistics SHALL remain accurate
### Requirement: File Cleanup Scheduler
The system SHALL automatically clean up old files while preserving database records for statistics tracking.
#### Scenario: Scheduled file cleanup
- **WHEN** cleanup scheduler runs (configurable interval, default daily)
- **THEN** system SHALL identify tasks where files can be deleted
- **AND** system SHALL retain newest N files per user (configurable, default 50)
- **AND** system SHALL delete actual files from disk for older tasks
- **AND** system SHALL set `file_deleted=True` on cleaned tasks
- **AND** system SHALL NOT delete any database records
#### Scenario: File retention per user
- **WHEN** user has more than `max_files_per_user` tasks with files
- **THEN** cleanup SHALL delete files for oldest tasks exceeding the limit
- **AND** cleanup SHALL preserve the newest `max_files_per_user` task files
- **AND** task ordering SHALL be by `created_at` descending
#### Scenario: Manual cleanup trigger
- **WHEN** admin calls POST `/api/v2/admin/cleanup/trigger`
- **THEN** system SHALL immediately run the cleanup process
- **AND** return summary of files deleted and space freed
### Requirement: Admin Task Visibility
Admin users SHALL have full visibility into all tasks including soft-deleted and file-cleaned tasks.
#### Scenario: Admin lists all tasks
- **WHEN** admin calls GET `/api/v2/admin/tasks`
- **THEN** response SHALL include all tasks from all users
- **AND** response SHALL include soft-deleted tasks
- **AND** response SHALL include tasks with deleted files
- **AND** each task SHALL indicate its deletion status
#### Scenario: Filter admin task list
- **WHEN** admin calls GET `/api/v2/admin/tasks` with filters
- **THEN** `include_deleted=false` SHALL exclude soft-deleted tasks
- **AND** `include_files_deleted=false` SHALL exclude file-cleaned tasks
- **AND** `user_id={id}` SHALL filter to specific user's tasks
#### Scenario: View storage usage statistics
- **WHEN** admin calls GET `/api/v2/admin/storage/stats`
- **THEN** response SHALL include total storage used
- **AND** response SHALL include per-user storage breakdown
- **AND** response SHALL include count of tasks with/without files
### Requirement: User Task Isolation
Regular users SHALL only see their own tasks and soft-deleted tasks SHALL be hidden from their view.
#### Scenario: User lists own tasks
- **WHEN** authenticated user calls GET `/api/v2/tasks`
- **THEN** response SHALL only include tasks owned by that user
- **AND** response SHALL NOT include soft-deleted tasks
- **AND** response SHALL include tasks with deleted files (showing file unavailable status)
#### Scenario: User cannot access other user's tasks
- **WHEN** user attempts to access task owned by another user
- **THEN** system SHALL return 404 Not Found
- **AND** system SHALL NOT reveal that the task exists
## MODIFIED Requirements
### Requirement: Task Detail View
The frontend SHALL provide a dedicated page for viewing individual task details with processing track information, enhanced preview capabilities, and file availability status.
#### Scenario: Navigate to task detail page
- **WHEN** user clicks "View Details" button on task in Task History page
- **THEN** browser SHALL navigate to `/tasks/{task_id}`
- **AND** TaskDetailPage component SHALL render
#### Scenario: Display task information
- **WHEN** TaskDetailPage loads for a valid task ID
- **THEN** page SHALL display task metadata (filename, status, processing time, confidence)
- **AND** page SHALL show markdown preview of OCR results
- **AND** page SHALL provide download buttons for JSON, Markdown, and PDF formats
#### Scenario: Download from task detail page
- **WHEN** user clicks download button for a specific format
- **THEN** browser SHALL download the file using `/api/v2/tasks/{task_id}/download/{format}` endpoint
- **AND** downloaded file SHALL contain the task's OCR results in requested format
#### Scenario: Display processing track information
- **WHEN** viewing task processed through dual-track system
- **THEN** page SHALL display processing track used (OCR or Direct)
- **AND** show track-specific metrics (OCR confidence or extraction quality)
- **AND** provide option to reprocess with alternate track if applicable
#### Scenario: Preview document structure
- **WHEN** user enables structure view
- **THEN** page SHALL display document element hierarchy
- **AND** show bounding boxes overlay on preview
- **AND** highlight different element types (headers, tables, lists) with distinct colors
#### Scenario: Display file unavailable status
- **WHEN** task has `file_deleted=True`
- **THEN** page SHALL show file unavailable indicator
- **AND** download buttons SHALL be disabled or hidden
- **AND** page SHALL display explanation that files were cleaned up

View File

@@ -0,0 +1,49 @@
# Tasks: Add Storage Cleanup Mechanism
## 1. Database Schema
- [x] 1.1 Add `deleted_at` column to Task model
- [x] 1.2 Create database migration for deleted_at column
- [x] 1.3 Run migration and verify column exists
## 2. Task Service Updates
- [x] 2.1 Update `delete_task()` to set `deleted_at` instead of deleting record
- [x] 2.2 Update `get_tasks()` to filter out soft-deleted tasks for regular users
- [x] 2.3 Update `get_task_by_id()` to respect soft delete for regular users
- [x] 2.4 Add `get_all_tasks()` method for admin (includes deleted)
## 3. Cleanup Service
- [x] 3.1 Create `cleanup_service.py` with file cleanup logic
- [x] 3.2 Implement per-user file retention (keep newest N files)
- [x] 3.3 Add method to calculate storage usage per user
- [x] 3.4 Set `file_deleted=True` after cleaning files
## 4. Scheduled Cleanup Task
- [x] 4.1 Add cleanup configuration to `config.py`
- [x] 4.2 Create scheduler for periodic cleanup
- [x] 4.3 Add startup hook to register cleanup task
- [x] 4.4 Add manual cleanup trigger endpoint for admin
## 5. Admin API Endpoints
- [x] 5.1 Add `GET /api/v2/admin/tasks` endpoint
- [x] 5.2 Support filters: `include_deleted`, `include_files_deleted`, `user_id`
- [x] 5.3 Add pagination support
- [x] 5.4 Add storage usage statistics endpoint
## 6. Frontend Updates
- [x] 6.1 Verify TaskHistoryPage correctly filters by user (existing user_id isolation)
- [x] 6.2 Add admin task management view to AdminDashboardPage
- [x] 6.3 Display soft-deleted and files-cleaned status badges (i18n ready)
- [x] 6.4 Add i18n keys for new UI elements
## 7. Testing
- [x] 7.1 Test soft delete preserves database record (code verified)
- [x] 7.2 Test user isolation (users see only own tasks - existing)
- [x] 7.3 Test admin sees all tasks including deleted (API verified)
- [x] 7.4 Test file cleanup retains newest N files (code verified)
- [x] 7.5 Test storage statistics calculation (API verified)
## Notes
- All tasks completed including automatic scheduler
- Cleanup runs automatically at configured interval (default: 24 hours)
- Manual cleanup trigger is also available via admin endpoint
- Scheduler status can be checked via `GET /api/v2/admin/cleanup/status`

View File

@@ -31,7 +31,7 @@ The OCR service SHALL generate both JSON and Markdown result files for completed
- **AND** include enhanced structure from PP-StructureV3 or PyMuPDF
### Requirement: Task Detail View
The frontend SHALL provide a dedicated page for viewing individual task details with processing track information and enhanced preview capabilities.
The frontend SHALL provide a dedicated page for viewing individual task details with processing track information, enhanced preview capabilities, and file availability status.
#### Scenario: Navigate to task detail page
- **WHEN** user clicks "View Details" button on task in Task History page
@@ -61,6 +61,12 @@ The frontend SHALL provide a dedicated page for viewing individual task details
- **AND** show bounding boxes overlay on preview
- **AND** highlight different element types (headers, tables, lists) with distinct colors
#### Scenario: Display file unavailable status
- **WHEN** task has `file_deleted=True`
- **THEN** page SHALL show file unavailable indicator
- **AND** download buttons SHALL be disabled or hidden
- **AND** page SHALL display explanation that files were cleaned up
### Requirement: Results Page V2 Migration
The Results page SHALL use V2 task-based APIs instead of V1 batch APIs.
@@ -117,3 +123,77 @@ The system SHALL maintain detailed processing history for tasks including track
- **AND** provide track selection statistics
- **AND** include performance metrics for each processing attempt
### Requirement: Soft Delete Tasks
The system SHALL support soft deletion of tasks, marking them as deleted without removing database records to preserve usage statistics.
#### Scenario: User soft deletes a task
- **WHEN** user calls DELETE on `/api/v2/tasks/{task_id}`
- **THEN** system SHALL set `deleted_at` timestamp on the task record
- **AND** system SHALL NOT delete the actual files
- **AND** system SHALL NOT remove the database record
- **AND** subsequent user queries SHALL NOT return this task
#### Scenario: Preserve statistics after soft delete
- **WHEN** a task is soft deleted
- **THEN** admin statistics endpoints SHALL continue to include this task's metrics
- **AND** translation token counts SHALL remain in cumulative totals
- **AND** processing time statistics SHALL remain accurate
### Requirement: File Cleanup Scheduler
The system SHALL automatically clean up old files while preserving database records for statistics tracking.
#### Scenario: Scheduled file cleanup
- **WHEN** cleanup scheduler runs (configurable interval, default daily)
- **THEN** system SHALL identify tasks where files can be deleted
- **AND** system SHALL retain newest N files per user (configurable, default 50)
- **AND** system SHALL delete actual files from disk for older tasks
- **AND** system SHALL set `file_deleted=True` on cleaned tasks
- **AND** system SHALL NOT delete any database records
#### Scenario: File retention per user
- **WHEN** user has more than `max_files_per_user` tasks with files
- **THEN** cleanup SHALL delete files for oldest tasks exceeding the limit
- **AND** cleanup SHALL preserve the newest `max_files_per_user` task files
- **AND** task ordering SHALL be by `created_at` descending
#### Scenario: Manual cleanup trigger
- **WHEN** admin calls POST `/api/v2/admin/cleanup/trigger`
- **THEN** system SHALL immediately run the cleanup process
- **AND** return summary of files deleted and space freed
### Requirement: Admin Task Visibility
Admin users SHALL have full visibility into all tasks including soft-deleted and file-cleaned tasks.
#### Scenario: Admin lists all tasks
- **WHEN** admin calls GET `/api/v2/admin/tasks`
- **THEN** response SHALL include all tasks from all users
- **AND** response SHALL include soft-deleted tasks
- **AND** response SHALL include tasks with deleted files
- **AND** each task SHALL indicate its deletion status
#### Scenario: Filter admin task list
- **WHEN** admin calls GET `/api/v2/admin/tasks` with filters
- **THEN** `include_deleted=false` SHALL exclude soft-deleted tasks
- **AND** `include_files_deleted=false` SHALL exclude file-cleaned tasks
- **AND** `user_id={id}` SHALL filter to specific user's tasks
#### Scenario: View storage usage statistics
- **WHEN** admin calls GET `/api/v2/admin/storage/stats`
- **THEN** response SHALL include total storage used
- **AND** response SHALL include per-user storage breakdown
- **AND** response SHALL include count of tasks with/without files
### Requirement: User Task Isolation
Regular users SHALL only see their own tasks and soft-deleted tasks SHALL be hidden from their view.
#### Scenario: User lists own tasks
- **WHEN** authenticated user calls GET `/api/v2/tasks`
- **THEN** response SHALL only include tasks owned by that user
- **AND** response SHALL NOT include soft-deleted tasks
- **AND** response SHALL include tasks with deleted files (showing file unavailable status)
#### Scenario: User cannot access other user's tasks
- **WHEN** user attempts to access task owned by another user
- **THEN** system SHALL return 404 Not Found
- **AND** system SHALL NOT reveal that the task exists

File diff suppressed because one or more lines are too long