Files
Meeting_Assistant/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/design.md
egg 2e78e3760a chore: Archive add-dify-audio-transcription proposal
Move completed Dify audio transcription proposal to archive and update
transcription spec with new capabilities.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 21:05:01 +08:00

8.1 KiB

Design: Dify Audio Transcription

Context

The Meeting Assistant currently supports real-time transcription via a local Python sidecar using faster-whisper. Users have requested the ability to upload pre-recorded audio files for transcription. To avoid overloading local resources and leverage cloud capabilities, uploaded files will be processed by Dify's speech-to-text service.

Goals / Non-Goals

Goals:

  • Allow users to upload audio files for transcription
  • Use Dify STT service for file-based transcription
  • Handle large files via VAD-based automatic segmentation
  • Maintain seamless UX with transcription appearing in the same location
  • Support common audio formats (MP3, WAV, M4A, WebM, OGG)

Non-Goals:

  • Replace real-time local transcription (sidecar remains for live recording)
  • Support video files (audio extraction)

Decisions

Decision 1: Two-Path Transcription Architecture

  • What: Real-time recording uses local sidecar; file uploads use Dify cloud
  • Why: Local processing provides low latency for real-time needs; cloud processing handles larger files without impacting local resources
  • Alternatives considered:
    • All transcription via Dify: Rejected due to latency and network dependency for real-time use
    • All transcription local: Rejected due to resource constraints for large file processing

Decision 2: VAD-Based Audio Chunking

  • What: Use sidecar's Silero VAD to segment large audio files before sending to Dify
  • Why:
    • Dify STT API has file size limits (typically ~25MB)
    • VAD removes silence, reducing total upload size
    • Natural speech boundaries improve transcription quality
  • Implementation:
    • Sidecar segments audio into chunks (~2-5 minutes each based on speech boundaries)
    • Each chunk sent to Dify sequentially
    • Results concatenated with proper spacing
  • Alternatives considered:
    • Fixed-time splitting (e.g., 5min chunks): Rejected - may cut mid-sentence
    • Client-side splitting: Rejected - requires shipping VAD to client

Decision 3: Separate API Key for STT Service

  • What: Use dedicated Dify API key app-xQeSipaQecs0cuKeLvYDaRsu for STT
  • Why: Allows independent rate limiting and monitoring from summarization service
  • Configuration: DIFY_STT_API_KEY environment variable

Decision 4: Backend-Mediated Upload with Sidecar Processing

  • What: Client → Backend → Sidecar (VAD) → Backend → Dify → Client
  • Why:
    • Keeps Dify API keys secure on server
    • Reuses existing sidecar VAD capability
    • Enables progress tracking for multi-chunk processing
  • Alternatives considered:
    • Direct client → Dify: Rejected due to API key exposure and file size limits

Decision 5: Append vs Replace Transcript

  • What: Uploaded file transcription replaces current transcript content
  • Why: Users typically upload complete meeting recordings; appending would create confusion
  • UI: Show confirmation dialog before replacing existing content

API Design

Backend Endpoint

POST /api/ai/transcribe-audio
Content-Type: multipart/form-data

Request:
- file: Audio file (max 500MB, will be chunked)

Response (streaming for progress):
{
  "transcript": "完整的會議逐字稿內容...",
  "chunks_processed": 5,
  "total_duration_seconds": 3600,
  "language": "zh"
}

Sidecar VAD Segmentation Command

// Request
{
  "action": "segment_audio",
  "file_path": "/tmp/uploaded_audio.mp3",
  "max_chunk_seconds": 300,
  "min_silence_ms": 500
}

// Response
{
  "status": "success",
  "segments": [
    {"index": 0, "path": "/tmp/chunk_0.wav", "start": 0, "end": 180.5},
    {"index": 1, "path": "/tmp/chunk_1.wav", "start": 180.5, "end": 362.0},
    ...
  ],
  "total_segments": 5
}

Dify STT API Integration

POST https://dify.theaken.com/v1/audio-to-text
Authorization: Bearer {DIFY_STT_API_KEY}
Content-Type: multipart/form-data

Request:
- file: Audio chunk (<25MB)
- user: User identifier

Response:
{
  "text": "transcribed content for this chunk..."
}

Data Flow

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  Electron   │     │  FastAPI    │     │  Sidecar    │     │  Dify STT   │
│  Client     │     │  Backend    │     │  (VAD)      │     │  Service    │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                   │                   │                   │
       │ Upload audio      │                   │                   │
       │──────────────────>│                   │                   │
       │                   │                   │                   │
       │                   │ segment_audio     │                   │
       │                   │──────────────────>│                   │
       │                   │                   │                   │
       │                   │ segments[]        │                   │
       │                   │<──────────────────│                   │
       │                   │                   │                   │
       │                   │ For each chunk:   │                   │
       │   Progress: 1/5   │──────────────────────────────────────>│
       │<──────────────────│                   │                   │
       │                   │                   │    transcription  │
       │                   │<──────────────────────────────────────│
       │                   │                   │                   │
       │   Progress: 2/5   │──────────────────────────────────────>│
       │<──────────────────│                   │                   │
       │                   │<──────────────────────────────────────│
       │                   │                   │                   │
       │                   │  ... repeat ...   │                   │
       │                   │                   │                   │
       │  Final transcript │                   │                   │
       │<──────────────────│                   │                   │
       │  (concatenated)   │                   │                   │

Chunking Algorithm

def segment_audio_with_vad(audio_path, max_chunk_seconds=300, min_silence_ms=500):
    """
    Segment audio file using VAD for natural speech boundaries.

    1. Load audio file
    2. Run VAD to detect speech/silence regions
    3. Find silence gaps >= min_silence_ms
    4. Split at silence gaps, keeping chunks <= max_chunk_seconds
    5. If no silence found within max_chunk_seconds, force split at max
    6. Export each chunk as WAV file
    7. Return list of chunk file paths with timestamps
    """

Risks / Trade-offs

Risk Mitigation
Large file causes memory issues Stream audio processing; limit to 500MB
Dify rate limiting Add retry with exponential backoff
Chunk boundary affects context Overlap chunks by 1-2 seconds
Long processing time Show progress indicator with chunk count
Sidecar not available Return error suggesting real-time recording

Migration Plan

No migration needed - this is additive functionality.

Open Questions

  • Maximum file size limit? Resolved: 500MB with VAD chunking
  • Chunk overlap for context continuity?
    • Proposal: 1 second overlap, deduplicate in concatenation