Files
OCR/openspec/specs/task-management/spec.md
egg 73112db055 feat: add storage cleanup mechanism with soft delete and auto scheduler
- Add soft delete (deleted_at column) to preserve task records for statistics
- Implement cleanup service to delete old files while keeping DB records
- Add automatic cleanup scheduler (configurable interval, default 24h)
- Add admin endpoints: storage stats, cleanup trigger, scheduler status
- Update task service with admin views (include deleted/files_deleted)
- Add frontend storage management UI in admin dashboard
- Add i18n translations for storage management

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-14 12:41:01 +08:00

9.5 KiB

task-management Specification

Purpose

TBD - created by archiving change fix-v2-api-ui-issues. Update Purpose after archive.

Requirements

Requirement: Task Result Generation

The OCR service SHALL generate both JSON and Markdown result files for completed tasks with actual content, including processing track information and enhanced structure data.

Scenario: Markdown file contains OCR results

  • WHEN a task completes OCR processing successfully
  • THEN the generated .md file SHALL contain the extracted text in markdown format
  • AND the file size SHALL be greater than 0 bytes
  • AND the markdown SHALL include headings, paragraphs, and formatting based on OCR layout detection

Scenario: Result files stored in task directory

  • WHEN OCR processing completes for task ID 88c6c2d2-37e1-48fd-a50f-406142987bdf
  • THEN result files SHALL be stored in storage/results/88c6c2d2-37e1-48fd-a50f-406142987bdf/
  • AND both <filename>_result.json and <filename>_result.md SHALL exist
  • AND both files SHALL contain valid OCR output data

Scenario: Include processing track in results

  • WHEN a task completes through dual-track processing
  • THEN the JSON result SHALL include "processing_track" field
  • AND SHALL indicate whether "ocr" or "direct" track was used
  • AND SHALL include track-specific metadata (confidence for OCR, extraction quality for direct)

Scenario: Store UnifiedDocument format

  • WHEN processing completes through either track
  • THEN system SHALL save results in UnifiedDocument format
  • AND maintain backward-compatible JSON structure
  • AND include enhanced structure from PP-StructureV3 or PyMuPDF

Requirement: Task Detail View

The frontend SHALL provide a dedicated page for viewing individual task details with processing track information, enhanced preview capabilities, and file availability status.

Scenario: Navigate to task detail page

  • WHEN user clicks "View Details" button on task in Task History page
  • THEN browser SHALL navigate to /tasks/{task_id}
  • AND TaskDetailPage component SHALL render

Scenario: Display task information

  • WHEN TaskDetailPage loads for a valid task ID
  • THEN page SHALL display task metadata (filename, status, processing time, confidence)
  • AND page SHALL show markdown preview of OCR results
  • AND page SHALL provide download buttons for JSON, Markdown, and PDF formats

Scenario: Download from task detail page

  • WHEN user clicks download button for a specific format
  • THEN browser SHALL download the file using /api/v2/tasks/{task_id}/download/{format} endpoint
  • AND downloaded file SHALL contain the task's OCR results in requested format

Scenario: Display processing track information

  • WHEN viewing task processed through dual-track system
  • THEN page SHALL display processing track used (OCR or Direct)
  • AND show track-specific metrics (OCR confidence or extraction quality)
  • AND provide option to reprocess with alternate track if applicable

Scenario: Preview document structure

  • WHEN user enables structure view
  • THEN page SHALL display document element hierarchy
  • AND show bounding boxes overlay on preview
  • AND highlight different element types (headers, tables, lists) with distinct colors

Scenario: Display file unavailable status

  • WHEN task has file_deleted=True
  • THEN page SHALL show file unavailable indicator
  • AND download buttons SHALL be disabled or hidden
  • AND page SHALL display explanation that files were cleaned up

Requirement: Results Page V2 Migration

The Results page SHALL use V2 task-based APIs instead of V1 batch APIs.

Scenario: Load task results instead of batch

  • WHEN Results page loads with a task ID in upload store
  • THEN page SHALL call apiClientV2.getTask(taskId) to fetch task details
  • AND page SHALL NOT call any V1 batch status endpoints
  • AND task information SHALL display correctly

Scenario: Handle missing task gracefully

  • WHEN Results page loads without a task ID
  • THEN page SHALL display helpful message directing user to upload page
  • AND page SHALL provide button to navigate to /upload

Requirement: Processing Track Management

The task management system SHALL track and display processing track information for all tasks.

Scenario: Track processing route selection

  • WHEN a task begins processing
  • THEN system SHALL record the selected processing track
  • AND log the reason for track selection
  • AND store auto-detection confidence score

Scenario: Allow track override

  • WHEN user views a completed task
  • THEN system SHALL offer option to reprocess with different track
  • AND maintain both results for comparison
  • AND track which result user prefers

Scenario: Display processing metrics

  • WHEN task completes processing
  • THEN system SHALL record track-specific metrics
  • AND OCR track SHALL show confidence scores and character count
  • AND Direct track SHALL show extraction coverage and structure quality

Requirement: Task Processing History

The system SHALL maintain detailed processing history for tasks including track changes and reprocessing.

Scenario: Record reprocessing attempts

  • WHEN a task is reprocessed with different track
  • THEN system SHALL maintain processing history
  • AND store results from each attempt
  • AND allow comparison between different processing attempts

Scenario: Track quality improvements

  • WHEN viewing task history
  • THEN system SHALL show quality metrics over time
  • AND indicate if reprocessing improved results
  • AND suggest optimal track based on document characteristics

Scenario: Export processing analytics

  • WHEN exporting task data
  • THEN system SHALL include processing history
  • AND provide track selection statistics
  • AND include performance metrics for each processing attempt

Requirement: Soft Delete Tasks

The system SHALL support soft deletion of tasks, marking them as deleted without removing database records to preserve usage statistics.

Scenario: User soft deletes a task

  • WHEN user calls DELETE on /api/v2/tasks/{task_id}
  • THEN system SHALL set deleted_at timestamp on the task record
  • AND system SHALL NOT delete the actual files
  • AND system SHALL NOT remove the database record
  • AND subsequent user queries SHALL NOT return this task

Scenario: Preserve statistics after soft delete

  • WHEN a task is soft deleted
  • THEN admin statistics endpoints SHALL continue to include this task's metrics
  • AND translation token counts SHALL remain in cumulative totals
  • AND processing time statistics SHALL remain accurate

Requirement: File Cleanup Scheduler

The system SHALL automatically clean up old files while preserving database records for statistics tracking.

Scenario: Scheduled file cleanup

  • WHEN cleanup scheduler runs (configurable interval, default daily)
  • THEN system SHALL identify tasks where files can be deleted
  • AND system SHALL retain newest N files per user (configurable, default 50)
  • AND system SHALL delete actual files from disk for older tasks
  • AND system SHALL set file_deleted=True on cleaned tasks
  • AND system SHALL NOT delete any database records

Scenario: File retention per user

  • WHEN user has more than max_files_per_user tasks with files
  • THEN cleanup SHALL delete files for oldest tasks exceeding the limit
  • AND cleanup SHALL preserve the newest max_files_per_user task files
  • AND task ordering SHALL be by created_at descending

Scenario: Manual cleanup trigger

  • WHEN admin calls POST /api/v2/admin/cleanup/trigger
  • THEN system SHALL immediately run the cleanup process
  • AND return summary of files deleted and space freed

Requirement: Admin Task Visibility

Admin users SHALL have full visibility into all tasks including soft-deleted and file-cleaned tasks.

Scenario: Admin lists all tasks

  • WHEN admin calls GET /api/v2/admin/tasks
  • THEN response SHALL include all tasks from all users
  • AND response SHALL include soft-deleted tasks
  • AND response SHALL include tasks with deleted files
  • AND each task SHALL indicate its deletion status

Scenario: Filter admin task list

  • WHEN admin calls GET /api/v2/admin/tasks with filters
  • THEN include_deleted=false SHALL exclude soft-deleted tasks
  • AND include_files_deleted=false SHALL exclude file-cleaned tasks
  • AND user_id={id} SHALL filter to specific user's tasks

Scenario: View storage usage statistics

  • WHEN admin calls GET /api/v2/admin/storage/stats
  • THEN response SHALL include total storage used
  • AND response SHALL include per-user storage breakdown
  • AND response SHALL include count of tasks with/without files

Requirement: User Task Isolation

Regular users SHALL only see their own tasks and soft-deleted tasks SHALL be hidden from their view.

Scenario: User lists own tasks

  • WHEN authenticated user calls GET /api/v2/tasks
  • THEN response SHALL only include tasks owned by that user
  • AND response SHALL NOT include soft-deleted tasks
  • AND response SHALL include tasks with deleted files (showing file unavailable status)

Scenario: User cannot access other user's tasks

  • WHEN user attempts to access task owned by another user
  • THEN system SHALL return 404 Not Found
  • AND system SHALL NOT reveal that the task exists