chore: Archive add-dify-audio-transcription proposal

Move completed Dify audio transcription proposal to archive and update transcription spec with new capabilities. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 21:05:01 +08:00
parent 263eb1c394
commit 2e78e3760a
5 changed files with 435 additions and 0 deletions
--- a/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/design.md
+++ b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/design.md
@@ -0,0 +1,185 @@
+# Design: Dify Audio Transcription
+
+## Context
+The Meeting Assistant currently supports real-time transcription via a local Python sidecar using faster-whisper. Users have requested the ability to upload pre-recorded audio files for transcription. To avoid overloading local resources and leverage cloud capabilities, uploaded files will be processed by Dify's speech-to-text service.
+
+## Goals / Non-Goals
+
+**Goals:**
+- Allow users to upload audio files for transcription
+- Use Dify STT service for file-based transcription
+- Handle large files via VAD-based automatic segmentation
+- Maintain seamless UX with transcription appearing in the same location
+- Support common audio formats (MP3, WAV, M4A, WebM, OGG)
+
+**Non-Goals:**
+- Replace real-time local transcription (sidecar remains for live recording)
+- Support video files (audio extraction)
+
+## Decisions
+
+### Decision 1: Two-Path Transcription Architecture
+- **What**: Real-time recording uses local sidecar; file uploads use Dify cloud
+- **Why**: Local processing provides low latency for real-time needs; cloud processing handles larger files without impacting local resources
+- **Alternatives considered**:
+  - All transcription via Dify: Rejected due to latency and network dependency for real-time use
+  - All transcription local: Rejected due to resource constraints for large file processing
+
+### Decision 2: VAD-Based Audio Chunking
+- **What**: Use sidecar's Silero VAD to segment large audio files before sending to Dify
+- **Why**:
+  - Dify STT API has file size limits (typically ~25MB)
+  - VAD removes silence, reducing total upload size
+  - Natural speech boundaries improve transcription quality
+- **Implementation**:
+  - Sidecar segments audio into chunks (~2-5 minutes each based on speech boundaries)
+  - Each chunk sent to Dify sequentially
+  - Results concatenated with proper spacing
+- **Alternatives considered**:
+  - Fixed-time splitting (e.g., 5min chunks): Rejected - may cut mid-sentence
+  - Client-side splitting: Rejected - requires shipping VAD to client
+
+### Decision 3: Separate API Key for STT Service
+- **What**: Use dedicated Dify API key `app-xQeSipaQecs0cuKeLvYDaRsu` for STT
+- **Why**: Allows independent rate limiting and monitoring from summarization service
+- **Configuration**: `DIFY_STT_API_KEY` environment variable
+
+### Decision 4: Backend-Mediated Upload with Sidecar Processing
+- **What**: Client → Backend → Sidecar (VAD) → Backend → Dify → Client
+- **Why**:
+  - Keeps Dify API keys secure on server
+  - Reuses existing sidecar VAD capability
+  - Enables progress tracking for multi-chunk processing
+- **Alternatives considered**:
+  - Direct client → Dify: Rejected due to API key exposure and file size limits
+
+### Decision 5: Append vs Replace Transcript
+- **What**: Uploaded file transcription replaces current transcript content
+- **Why**: Users typically upload complete meeting recordings; appending would create confusion
+- **UI**: Show confirmation dialog before replacing existing content
+
+## API Design
+
+### Backend Endpoint
+```
+POST /api/ai/transcribe-audio
+Content-Type: multipart/form-data
+
+Request:
+- file: Audio file (max 500MB, will be chunked)
+
+Response (streaming for progress):
+{
+  "transcript": "完整的會議逐字稿內容...",
+  "chunks_processed": 5,
+  "total_duration_seconds": 3600,
+  "language": "zh"
+}
+```
+
+### Sidecar VAD Segmentation Command
+```json
+// Request
+{
+  "action": "segment_audio",
+  "file_path": "/tmp/uploaded_audio.mp3",
+  "max_chunk_seconds": 300,
+  "min_silence_ms": 500
+}
+
+// Response
+{
+  "status": "success",
+  "segments": [
+    {"index": 0, "path": "/tmp/chunk_0.wav", "start": 0, "end": 180.5},
+    {"index": 1, "path": "/tmp/chunk_1.wav", "start": 180.5, "end": 362.0},
+    ...
+  ],
+  "total_segments": 5
+}
+```
+
+### Dify STT API Integration
+```
+POST https://dify.theaken.com/v1/audio-to-text
+Authorization: Bearer {DIFY_STT_API_KEY}
+Content-Type: multipart/form-data
+
+Request:
+- file: Audio chunk (<25MB)
+- user: User identifier
+
+Response:
+{
+  "text": "transcribed content for this chunk..."
+}
+```
+
+## Data Flow
+
+```
+┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
+│  Electron   │     │  FastAPI    │     │  Sidecar    │     │  Dify STT   │
+│  Client     │     │  Backend    │     │  (VAD)      │     │  Service    │
+└──────┬──────┘     └──────┬──────┘     └──────┬──────┘     └──────┬──────┘
+       │                   │                   │                   │
+       │ Upload audio      │                   │                   │
+       │──────────────────>│                   │                   │
+       │                   │                   │                   │
+       │                   │ segment_audio     │                   │
+       │                   │──────────────────>│                   │
+       │                   │                   │                   │
+       │                   │ segments[]        │                   │
+       │                   │<──────────────────│                   │
+       │                   │                   │                   │
+       │                   │ For each chunk:   │                   │
+       │   Progress: 1/5   │──────────────────────────────────────>│
+       │<──────────────────│                   │                   │
+       │                   │                   │    transcription  │
+       │                   │<──────────────────────────────────────│
+       │                   │                   │                   │
+       │   Progress: 2/5   │──────────────────────────────────────>│
+       │<──────────────────│                   │                   │
+       │                   │<──────────────────────────────────────│
+       │                   │                   │                   │
+       │                   │  ... repeat ...   │                   │
+       │                   │                   │                   │
+       │  Final transcript │                   │                   │
+       │<──────────────────│                   │                   │
+       │  (concatenated)   │                   │                   │
+```
+
+## Chunking Algorithm
+
+```python
+def segment_audio_with_vad(audio_path, max_chunk_seconds=300, min_silence_ms=500):
+    """
+    Segment audio file using VAD for natural speech boundaries.
+
+    1. Load audio file
+    2. Run VAD to detect speech/silence regions
+    3. Find silence gaps >= min_silence_ms
+    4. Split at silence gaps, keeping chunks <= max_chunk_seconds
+    5. If no silence found within max_chunk_seconds, force split at max
+    6. Export each chunk as WAV file
+    7. Return list of chunk file paths with timestamps
+    """
+```
+
+## Risks / Trade-offs
+
+| Risk | Mitigation |
+|------|------------|
+| Large file causes memory issues | Stream audio processing; limit to 500MB |
+| Dify rate limiting | Add retry with exponential backoff |
+| Chunk boundary affects context | Overlap chunks by 1-2 seconds |
+| Long processing time | Show progress indicator with chunk count |
+| Sidecar not available | Return error suggesting real-time recording |
+
+## Migration Plan
+No migration needed - this is additive functionality.
+
+## Open Questions
+- ~~Maximum file size limit?~~ **Resolved**: 500MB with VAD chunking
+- Chunk overlap for context continuity?
+  - Proposal: 1 second overlap, deduplicate in concatenation