Move completed Dify audio transcription proposal to archive and update transcription spec with new capabilities. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
186 lines
8.1 KiB
Markdown
186 lines
8.1 KiB
Markdown
# Design: Dify Audio Transcription
|
|
|
|
## Context
|
|
The Meeting Assistant currently supports real-time transcription via a local Python sidecar using faster-whisper. Users have requested the ability to upload pre-recorded audio files for transcription. To avoid overloading local resources and leverage cloud capabilities, uploaded files will be processed by Dify's speech-to-text service.
|
|
|
|
## Goals / Non-Goals
|
|
|
|
**Goals:**
|
|
- Allow users to upload audio files for transcription
|
|
- Use Dify STT service for file-based transcription
|
|
- Handle large files via VAD-based automatic segmentation
|
|
- Maintain seamless UX with transcription appearing in the same location
|
|
- Support common audio formats (MP3, WAV, M4A, WebM, OGG)
|
|
|
|
**Non-Goals:**
|
|
- Replace real-time local transcription (sidecar remains for live recording)
|
|
- Support video files (audio extraction)
|
|
|
|
## Decisions
|
|
|
|
### Decision 1: Two-Path Transcription Architecture
|
|
- **What**: Real-time recording uses local sidecar; file uploads use Dify cloud
|
|
- **Why**: Local processing provides low latency for real-time needs; cloud processing handles larger files without impacting local resources
|
|
- **Alternatives considered**:
|
|
- All transcription via Dify: Rejected due to latency and network dependency for real-time use
|
|
- All transcription local: Rejected due to resource constraints for large file processing
|
|
|
|
### Decision 2: VAD-Based Audio Chunking
|
|
- **What**: Use sidecar's Silero VAD to segment large audio files before sending to Dify
|
|
- **Why**:
|
|
- Dify STT API has file size limits (typically ~25MB)
|
|
- VAD removes silence, reducing total upload size
|
|
- Natural speech boundaries improve transcription quality
|
|
- **Implementation**:
|
|
- Sidecar segments audio into chunks (~2-5 minutes each based on speech boundaries)
|
|
- Each chunk sent to Dify sequentially
|
|
- Results concatenated with proper spacing
|
|
- **Alternatives considered**:
|
|
- Fixed-time splitting (e.g., 5min chunks): Rejected - may cut mid-sentence
|
|
- Client-side splitting: Rejected - requires shipping VAD to client
|
|
|
|
### Decision 3: Separate API Key for STT Service
|
|
- **What**: Use dedicated Dify API key `app-xQeSipaQecs0cuKeLvYDaRsu` for STT
|
|
- **Why**: Allows independent rate limiting and monitoring from summarization service
|
|
- **Configuration**: `DIFY_STT_API_KEY` environment variable
|
|
|
|
### Decision 4: Backend-Mediated Upload with Sidecar Processing
|
|
- **What**: Client → Backend → Sidecar (VAD) → Backend → Dify → Client
|
|
- **Why**:
|
|
- Keeps Dify API keys secure on server
|
|
- Reuses existing sidecar VAD capability
|
|
- Enables progress tracking for multi-chunk processing
|
|
- **Alternatives considered**:
|
|
- Direct client → Dify: Rejected due to API key exposure and file size limits
|
|
|
|
### Decision 5: Append vs Replace Transcript
|
|
- **What**: Uploaded file transcription replaces current transcript content
|
|
- **Why**: Users typically upload complete meeting recordings; appending would create confusion
|
|
- **UI**: Show confirmation dialog before replacing existing content
|
|
|
|
## API Design
|
|
|
|
### Backend Endpoint
|
|
```
|
|
POST /api/ai/transcribe-audio
|
|
Content-Type: multipart/form-data
|
|
|
|
Request:
|
|
- file: Audio file (max 500MB, will be chunked)
|
|
|
|
Response (streaming for progress):
|
|
{
|
|
"transcript": "完整的會議逐字稿內容...",
|
|
"chunks_processed": 5,
|
|
"total_duration_seconds": 3600,
|
|
"language": "zh"
|
|
}
|
|
```
|
|
|
|
### Sidecar VAD Segmentation Command
|
|
```json
|
|
// Request
|
|
{
|
|
"action": "segment_audio",
|
|
"file_path": "/tmp/uploaded_audio.mp3",
|
|
"max_chunk_seconds": 300,
|
|
"min_silence_ms": 500
|
|
}
|
|
|
|
// Response
|
|
{
|
|
"status": "success",
|
|
"segments": [
|
|
{"index": 0, "path": "/tmp/chunk_0.wav", "start": 0, "end": 180.5},
|
|
{"index": 1, "path": "/tmp/chunk_1.wav", "start": 180.5, "end": 362.0},
|
|
...
|
|
],
|
|
"total_segments": 5
|
|
}
|
|
```
|
|
|
|
### Dify STT API Integration
|
|
```
|
|
POST https://dify.theaken.com/v1/audio-to-text
|
|
Authorization: Bearer {DIFY_STT_API_KEY}
|
|
Content-Type: multipart/form-data
|
|
|
|
Request:
|
|
- file: Audio chunk (<25MB)
|
|
- user: User identifier
|
|
|
|
Response:
|
|
{
|
|
"text": "transcribed content for this chunk..."
|
|
}
|
|
```
|
|
|
|
## Data Flow
|
|
|
|
```
|
|
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
|
│ Electron │ │ FastAPI │ │ Sidecar │ │ Dify STT │
|
|
│ Client │ │ Backend │ │ (VAD) │ │ Service │
|
|
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
|
|
│ │ │ │
|
|
│ Upload audio │ │ │
|
|
│──────────────────>│ │ │
|
|
│ │ │ │
|
|
│ │ segment_audio │ │
|
|
│ │──────────────────>│ │
|
|
│ │ │ │
|
|
│ │ segments[] │ │
|
|
│ │<──────────────────│ │
|
|
│ │ │ │
|
|
│ │ For each chunk: │ │
|
|
│ Progress: 1/5 │──────────────────────────────────────>│
|
|
│<──────────────────│ │ │
|
|
│ │ │ transcription │
|
|
│ │<──────────────────────────────────────│
|
|
│ │ │ │
|
|
│ Progress: 2/5 │──────────────────────────────────────>│
|
|
│<──────────────────│ │ │
|
|
│ │<──────────────────────────────────────│
|
|
│ │ │ │
|
|
│ │ ... repeat ... │ │
|
|
│ │ │ │
|
|
│ Final transcript │ │ │
|
|
│<──────────────────│ │ │
|
|
│ (concatenated) │ │ │
|
|
```
|
|
|
|
## Chunking Algorithm
|
|
|
|
```python
|
|
def segment_audio_with_vad(audio_path, max_chunk_seconds=300, min_silence_ms=500):
|
|
"""
|
|
Segment audio file using VAD for natural speech boundaries.
|
|
|
|
1. Load audio file
|
|
2. Run VAD to detect speech/silence regions
|
|
3. Find silence gaps >= min_silence_ms
|
|
4. Split at silence gaps, keeping chunks <= max_chunk_seconds
|
|
5. If no silence found within max_chunk_seconds, force split at max
|
|
6. Export each chunk as WAV file
|
|
7. Return list of chunk file paths with timestamps
|
|
"""
|
|
```
|
|
|
|
## Risks / Trade-offs
|
|
|
|
| Risk | Mitigation |
|
|
|------|------------|
|
|
| Large file causes memory issues | Stream audio processing; limit to 500MB |
|
|
| Dify rate limiting | Add retry with exponential backoff |
|
|
| Chunk boundary affects context | Overlap chunks by 1-2 seconds |
|
|
| Long processing time | Show progress indicator with chunk count |
|
|
| Sidecar not available | Return error suggesting real-time recording |
|
|
|
|
## Migration Plan
|
|
No migration needed - this is additive functionality.
|
|
|
|
## Open Questions
|
|
- ~~Maximum file size limit?~~ **Resolved**: 500MB with VAD chunking
|
|
- Chunk overlap for context continuity?
|
|
- Proposal: 1 second overlap, deduplicate in concatenation
|