# Design: Dify Audio Transcription ## Context The Meeting Assistant currently supports real-time transcription via a local Python sidecar using faster-whisper. Users have requested the ability to upload pre-recorded audio files for transcription. To avoid overloading local resources and leverage cloud capabilities, uploaded files will be processed by Dify's speech-to-text service. ## Goals / Non-Goals **Goals:** - Allow users to upload audio files for transcription - Use Dify STT service for file-based transcription - Handle large files via VAD-based automatic segmentation - Maintain seamless UX with transcription appearing in the same location - Support common audio formats (MP3, WAV, M4A, WebM, OGG) **Non-Goals:** - Replace real-time local transcription (sidecar remains for live recording) - Support video files (audio extraction) ## Decisions ### Decision 1: Two-Path Transcription Architecture - **What**: Real-time recording uses local sidecar; file uploads use Dify cloud - **Why**: Local processing provides low latency for real-time needs; cloud processing handles larger files without impacting local resources - **Alternatives considered**: - All transcription via Dify: Rejected due to latency and network dependency for real-time use - All transcription local: Rejected due to resource constraints for large file processing ### Decision 2: VAD-Based Audio Chunking - **What**: Use sidecar's Silero VAD to segment large audio files before sending to Dify - **Why**: - Dify STT API has file size limits (typically ~25MB) - VAD removes silence, reducing total upload size - Natural speech boundaries improve transcription quality - **Implementation**: - Sidecar segments audio into chunks (~2-5 minutes each based on speech boundaries) - Each chunk sent to Dify sequentially - Results concatenated with proper spacing - **Alternatives considered**: - Fixed-time splitting (e.g., 5min chunks): Rejected - may cut mid-sentence - Client-side splitting: Rejected - requires shipping VAD to client ### Decision 3: Separate API Key for STT Service - **What**: Use dedicated Dify API key `app-xQeSipaQecs0cuKeLvYDaRsu` for STT - **Why**: Allows independent rate limiting and monitoring from summarization service - **Configuration**: `DIFY_STT_API_KEY` environment variable ### Decision 4: Backend-Mediated Upload with Sidecar Processing - **What**: Client → Backend → Sidecar (VAD) → Backend → Dify → Client - **Why**: - Keeps Dify API keys secure on server - Reuses existing sidecar VAD capability - Enables progress tracking for multi-chunk processing - **Alternatives considered**: - Direct client → Dify: Rejected due to API key exposure and file size limits ### Decision 5: Append vs Replace Transcript - **What**: Uploaded file transcription replaces current transcript content - **Why**: Users typically upload complete meeting recordings; appending would create confusion - **UI**: Show confirmation dialog before replacing existing content ## API Design ### Backend Endpoint ``` POST /api/ai/transcribe-audio Content-Type: multipart/form-data Request: - file: Audio file (max 500MB, will be chunked) Response (streaming for progress): { "transcript": "完整的會議逐字稿內容...", "chunks_processed": 5, "total_duration_seconds": 3600, "language": "zh" } ``` ### Sidecar VAD Segmentation Command ```json // Request { "action": "segment_audio", "file_path": "/tmp/uploaded_audio.mp3", "max_chunk_seconds": 300, "min_silence_ms": 500 } // Response { "status": "success", "segments": [ {"index": 0, "path": "/tmp/chunk_0.wav", "start": 0, "end": 180.5}, {"index": 1, "path": "/tmp/chunk_1.wav", "start": 180.5, "end": 362.0}, ... ], "total_segments": 5 } ``` ### Dify STT API Integration ``` POST https://dify.theaken.com/v1/audio-to-text Authorization: Bearer {DIFY_STT_API_KEY} Content-Type: multipart/form-data Request: - file: Audio chunk (<25MB) - user: User identifier Response: { "text": "transcribed content for this chunk..." } ``` ## Data Flow ``` ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Electron │ │ FastAPI │ │ Sidecar │ │ Dify STT │ │ Client │ │ Backend │ │ (VAD) │ │ Service │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ Upload audio │ │ │ │──────────────────>│ │ │ │ │ │ │ │ │ segment_audio │ │ │ │──────────────────>│ │ │ │ │ │ │ │ segments[] │ │ │ │<──────────────────│ │ │ │ │ │ │ │ For each chunk: │ │ │ Progress: 1/5 │──────────────────────────────────────>│ │<──────────────────│ │ │ │ │ │ transcription │ │ │<──────────────────────────────────────│ │ │ │ │ │ Progress: 2/5 │──────────────────────────────────────>│ │<──────────────────│ │ │ │ │<──────────────────────────────────────│ │ │ │ │ │ │ ... repeat ... │ │ │ │ │ │ │ Final transcript │ │ │ │<──────────────────│ │ │ │ (concatenated) │ │ │ ``` ## Chunking Algorithm ```python def segment_audio_with_vad(audio_path, max_chunk_seconds=300, min_silence_ms=500): """ Segment audio file using VAD for natural speech boundaries. 1. Load audio file 2. Run VAD to detect speech/silence regions 3. Find silence gaps >= min_silence_ms 4. Split at silence gaps, keeping chunks <= max_chunk_seconds 5. If no silence found within max_chunk_seconds, force split at max 6. Export each chunk as WAV file 7. Return list of chunk file paths with timestamps """ ``` ## Risks / Trade-offs | Risk | Mitigation | |------|------------| | Large file causes memory issues | Stream audio processing; limit to 500MB | | Dify rate limiting | Add retry with exponential backoff | | Chunk boundary affects context | Overlap chunks by 1-2 seconds | | Long processing time | Show progress indicator with chunk count | | Sidecar not available | Return error suggesting real-time recording | ## Migration Plan No migration needed - this is additive functionality. ## Open Questions - ~~Maximum file size limit?~~ **Resolved**: 500MB with VAD chunking - Chunk overlap for context continuity? - Proposal: 1 second overlap, deduplicate in concatenation