Files
OCR/openspec/changes/archive/2025-12-02-add-document-translation/design.md
egg 8d9b69ba93 feat: add document translation via DIFY AI API
Implement document translation feature using DIFY AI API with batch processing:

Backend:
- Add DIFY client with batch translation support (5000 chars, 20 items per batch)
- Add translation service with element extraction and result building
- Add translation router with start/status/result/list/delete endpoints
- Add translation schemas (TranslationRequest, TranslationStatus, etc.)

Frontend:
- Enable translation UI in TaskDetailPage
- Add translation API methods to apiV2.ts
- Add translation types

Features:
- Batch translation with numbered markers [1], [2], [3]...
- Support for text, title, header, footer, paragraph, footnote, table cells
- Translation result JSON with statistics (tokens, latency, batch_count)
- Background task processing with progress tracking

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-02 11:57:02 +08:00

9.3 KiB

Design: Document Translation Feature

Context

Tool_OCR processes documents through three tracks (Direct/OCR/Hybrid) and outputs UnifiedDocument JSON. Users need translation capability to convert extracted text into different languages while preserving document structure.

Constraints

  • Must use DIFY AI service for translation
  • API-based solution (no local model management)
  • Translation quality depends on DIFY's underlying model

Stakeholders

  • End users: Need translated documents
  • System: Simple HTTP-based integration

Goals / Non-Goals

Goals

  • Translate documents using DIFY AI API
  • Preserve document structure (element positions, formatting)
  • Support all three processing tracks with unified logic
  • Real-time progress feedback to users
  • Simple, maintainable API integration

Non-Goals

  • Local model inference (replaced by DIFY API)
  • GPU memory management (not needed)
  • Translation memory or glossary support
  • Concurrent translation processing

Decisions

Decision 1: Translation Provider

Choice: DIFY AI Service (theaken.com)

Configuration:

  • Base URL: https://dify.theaken.com/v1
  • Endpoint: POST /chat-messages
  • API Key: app-YOPrF2ro5fshzMkCZviIuUJd
  • Mode: Chat (Blocking response)

Rationale:

  • High-quality cloud AI translation
  • No local model management required
  • No GPU memory concerns
  • Easy to maintain and update

Decision 2: Response Mode

Choice: Blocking Mode

API Request Format:

{
  "inputs": {},
  "query": "Translate the following text to Chinese:\n\nHello world",
  "response_mode": "blocking",
  "conversation_id": "",
  "user": "tool-ocr-{task_id}"
}

API Response Format:

{
  "event": "message",
  "answer": "你好世界",
  "conversation_id": "xxx",
  "metadata": {
    "usage": {
      "total_tokens": 54,
      "latency": 1.26
    }
  }
}

Rationale:

  • Simpler implementation than streaming
  • Adequate for batch text translation
  • Complete response in single call

Decision 3: Translation Batch Format

Choice: Single text per request with translation prompt

Request Format:

Translate the following text to {target_language}.
Return ONLY the translated text, no explanations.

{text_content}

Rationale:

  • Clear instruction for AI
  • Predictable response format
  • Easy to parse result

Decision 4: Translation Result Storage

Choice: Independent JSON file per language (unchanged from previous design)

backend/storage/results/{task_id}/
├── xxx_result.json              # Original
├── xxx_translated_en.json       # English translation
├── xxx_translated_ja.json       # Japanese translation
└── ...

Rationale:

  • Non-destructive (original preserved)
  • Multiple languages supported
  • Easy to manage and delete
  • Clear file naming convention

Decision 5: Element Type Handling

Translatable types (content is string):

  • text, title, header, footer, paragraph, footnote

Special handling (content is dict):

  • table -> Translate cells[].content

Skip (non-text content):

  • page_number, image, chart, logo, reference

Architecture

Component Diagram

┌─────────────────────────────────────────────────────────────┐
│                        Frontend                              │
│  ┌─────────────┐  ┌──────────────┐  ┌─────────────────────┐ │
│  │ TaskDetail  │  │ TranslateBtn │  │ ProgressDisplay     │ │
│  └─────────────┘  └──────────────┘  └─────────────────────┘ │
└────────────────────────────┬────────────────────────────────┘
                             │ HTTP
┌────────────────────────────▼────────────────────────────────┐
│                     Backend API                              │
│  ┌─────────────────────────────────────────────────────────┐│
│  │ TranslateRouter                                          ││
│  │ POST /api/v2/translate/{task_id}                         ││
│  │ GET  /api/v2/translate/{task_id}/status                  ││
│  │ GET  /api/v2/translate/{task_id}/result                  ││
│  └─────────────────────────────────────────────────────────┘│
└────────────────────────────┬────────────────────────────────┘
                             │
┌────────────────────────────▼────────────────────────────────┐
│                  TranslationService                          │
│  ┌───────────────┐  ┌───────────────┐  ┌─────────────────┐  │
│  │ DifyClient    │  │ BatchBuilder  │  │ ResultParser    │  │
│  │ - translate() │  │ - extract()   │  │ - parse()       │  │
│  │ - chat()      │  │ - format()    │  │ - map_ids()     │  │
│  └───────────────┘  └───────────────┘  └─────────────────┘  │
└────────────────────────────┬────────────────────────────────┘
                             │ HTTPS
┌────────────────────────────▼────────────────────────────────┐
│                    DIFY AI Service                           │
│                 https://dify.theaken.com/v1                  │
│                    (Chat - Blocking)                         │
└─────────────────────────────────────────────────────────────┘

Translation JSON Schema

{
  "schema_version": "1.0.0",
  "source_document": "xxx_result.json",
  "source_lang": "auto",
  "target_lang": "en",
  "provider": "dify",
  "translated_at": "2025-12-02T12:00:00Z",
  "statistics": {
    "total_elements": 50,
    "translated_elements": 45,
    "skipped_elements": 5,
    "total_characters": 5000,
    "processing_time_seconds": 30.5,
    "total_tokens": 2500
  },
  "translations": {
    "pp3_0_0": "Company Profile",
    "pp3_0_1": "Founded in 2020...",
    "table_1_0": {
      "cells": [
        {"row": 0, "col": 0, "content": "Technology"},
        {"row": 0, "col": 1, "content": "Epoxy"}
      ]
    }
  }
}

Language Code Mapping

LANGUAGE_NAMES = {
    "en": "English",
    "zh-TW": "Traditional Chinese",
    "zh-CN": "Simplified Chinese",
    "ja": "Japanese",
    "ko": "Korean",
    "de": "German",
    "fr": "French",
    "es": "Spanish",
    "pt": "Portuguese",
    "it": "Italian",
    "ru": "Russian",
    "vi": "Vietnamese",
    "th": "Thai",
    # Additional languages as needed
}

Risks / Trade-offs

Risk 1: API Availability

  • Risk: DIFY service downtime affects translation
  • Mitigation: Add timeout handling, retry logic, graceful error messages

Risk 2: API Cost

  • Risk: High volume translation increases cost
  • Mitigation: Monitor usage via metadata, consider rate limiting

Risk 3: Network Latency

  • Risk: Each translation request adds network latency
  • Mitigation: Batch text when possible, show progress to user

Risk 4: Translation Quality Variance

  • Risk: AI translation quality varies by language pair
  • Mitigation: Document known limitations, allow user feedback

Migration Plan

Phase 1: Core Translation (This Proposal)

  1. DIFY client implementation
  2. Backend translation service (rewrite)
  3. API endpoints (modify)
  4. Frontend activation

Phase 2: Enhanced Features (Future)

  1. Translated PDF generation
  2. Translation caching
  3. Custom terminology support

Rollback

  • Translation is additive feature
  • No schema changes to existing data
  • Can disable by removing router registration

Open Questions

  1. Rate Limiting: Should we limit requests per minute to DIFY API?

    • Tentative: 10 requests per minute per user
  2. Retry Logic: How to handle API failures?

    • Tentative: Retry up to 3 times with exponential backoff
  3. Batch Size: How many elements per API call?

    • Tentative: 1 element per call for simplicity, optimize later if needed