Files
egg 2e78e3760a chore: Archive add-dify-audio-transcription proposal
Move completed Dify audio transcription proposal to archive and update
transcription spec with new capabilities.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 21:05:01 +08:00

2.2 KiB

Implementation Tasks

1. Backend Configuration

  • 1.1 Add DIFY_STT_API_KEY to backend/app/config.py
  • 1.2 Add DIFY_STT_API_KEY to backend/.env.example

2. Sidecar VAD Segmentation

  • 2.1 Add segment_audio action handler in sidecar/transcriber.py
  • 2.2 Implement VAD-based audio segmentation using Silero VAD
  • 2.3 Support max chunk duration (default 5 minutes)
  • 2.4 Support minimum silence threshold (default 500ms)
  • 2.5 Export chunks as WAV files to temp directory
  • 2.6 Return segment metadata (paths, timestamps)

3. Backend API Endpoint

  • 3.1 Create POST /api/ai/transcribe-audio endpoint in backend/app/routers/ai.py
  • 3.2 Implement multipart file upload handling (max 500MB)
  • 3.3 Validate audio file format (MP3, WAV, M4A, WebM, OGG)
  • 3.4 Save uploaded file to temp directory
  • 3.5 Call sidecar segment_audio for VAD chunking
  • 3.6 For each chunk: call Dify STT API (/v1/audio-to-text)
  • 3.7 Implement retry with exponential backoff for Dify errors
  • 3.8 Concatenate chunk transcriptions
  • 3.9 Clean up temp files after processing
  • 3.10 Return final transcript with metadata

4. Frontend UI

  • 4.1 Add "Upload Audio" button in meeting-detail.html (next to recording controls)
  • 4.2 Implement file input with accepted audio formats
  • 4.3 Add upload progress indicator (upload phase)
  • 4.4 Add transcription progress indicator (chunk X of Y)
  • 4.5 Show confirmation dialog if transcript already has content
  • 4.6 Display transcription result in transcript area
  • 4.7 Handle error states (file too large, unsupported format, API error)

5. API Service

  • 5.1 Add transcribeAudio() function to client/src/services/api.js
  • 5.2 Implement FormData upload with progress tracking
  • 5.3 Handle streaming response for chunk progress

6. Testing

  • 6.1 Test sidecar VAD segmentation with various audio lengths
  • 6.2 Test with various audio formats (MP3, WAV, M4A, WebM, OGG)
  • 6.3 Test with large file (>100MB) to verify chunking
  • 6.4 Test error handling (invalid format, Dify timeout, API error)
  • 6.5 Verify transcript displays correctly after upload
  • 6.6 Test chunk concatenation quality (no missing content at boundaries)