chore: Archive add-dify-audio-transcription proposal
Move completed Dify audio transcription proposal to archive and update transcription spec with new capabilities. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,185 @@
|
||||
# Design: Dify Audio Transcription
|
||||
|
||||
## Context
|
||||
The Meeting Assistant currently supports real-time transcription via a local Python sidecar using faster-whisper. Users have requested the ability to upload pre-recorded audio files for transcription. To avoid overloading local resources and leverage cloud capabilities, uploaded files will be processed by Dify's speech-to-text service.
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- Allow users to upload audio files for transcription
|
||||
- Use Dify STT service for file-based transcription
|
||||
- Handle large files via VAD-based automatic segmentation
|
||||
- Maintain seamless UX with transcription appearing in the same location
|
||||
- Support common audio formats (MP3, WAV, M4A, WebM, OGG)
|
||||
|
||||
**Non-Goals:**
|
||||
- Replace real-time local transcription (sidecar remains for live recording)
|
||||
- Support video files (audio extraction)
|
||||
|
||||
## Decisions
|
||||
|
||||
### Decision 1: Two-Path Transcription Architecture
|
||||
- **What**: Real-time recording uses local sidecar; file uploads use Dify cloud
|
||||
- **Why**: Local processing provides low latency for real-time needs; cloud processing handles larger files without impacting local resources
|
||||
- **Alternatives considered**:
|
||||
- All transcription via Dify: Rejected due to latency and network dependency for real-time use
|
||||
- All transcription local: Rejected due to resource constraints for large file processing
|
||||
|
||||
### Decision 2: VAD-Based Audio Chunking
|
||||
- **What**: Use sidecar's Silero VAD to segment large audio files before sending to Dify
|
||||
- **Why**:
|
||||
- Dify STT API has file size limits (typically ~25MB)
|
||||
- VAD removes silence, reducing total upload size
|
||||
- Natural speech boundaries improve transcription quality
|
||||
- **Implementation**:
|
||||
- Sidecar segments audio into chunks (~2-5 minutes each based on speech boundaries)
|
||||
- Each chunk sent to Dify sequentially
|
||||
- Results concatenated with proper spacing
|
||||
- **Alternatives considered**:
|
||||
- Fixed-time splitting (e.g., 5min chunks): Rejected - may cut mid-sentence
|
||||
- Client-side splitting: Rejected - requires shipping VAD to client
|
||||
|
||||
### Decision 3: Separate API Key for STT Service
|
||||
- **What**: Use dedicated Dify API key `app-xQeSipaQecs0cuKeLvYDaRsu` for STT
|
||||
- **Why**: Allows independent rate limiting and monitoring from summarization service
|
||||
- **Configuration**: `DIFY_STT_API_KEY` environment variable
|
||||
|
||||
### Decision 4: Backend-Mediated Upload with Sidecar Processing
|
||||
- **What**: Client → Backend → Sidecar (VAD) → Backend → Dify → Client
|
||||
- **Why**:
|
||||
- Keeps Dify API keys secure on server
|
||||
- Reuses existing sidecar VAD capability
|
||||
- Enables progress tracking for multi-chunk processing
|
||||
- **Alternatives considered**:
|
||||
- Direct client → Dify: Rejected due to API key exposure and file size limits
|
||||
|
||||
### Decision 5: Append vs Replace Transcript
|
||||
- **What**: Uploaded file transcription replaces current transcript content
|
||||
- **Why**: Users typically upload complete meeting recordings; appending would create confusion
|
||||
- **UI**: Show confirmation dialog before replacing existing content
|
||||
|
||||
## API Design
|
||||
|
||||
### Backend Endpoint
|
||||
```
|
||||
POST /api/ai/transcribe-audio
|
||||
Content-Type: multipart/form-data
|
||||
|
||||
Request:
|
||||
- file: Audio file (max 500MB, will be chunked)
|
||||
|
||||
Response (streaming for progress):
|
||||
{
|
||||
"transcript": "完整的會議逐字稿內容...",
|
||||
"chunks_processed": 5,
|
||||
"total_duration_seconds": 3600,
|
||||
"language": "zh"
|
||||
}
|
||||
```
|
||||
|
||||
### Sidecar VAD Segmentation Command
|
||||
```json
|
||||
// Request
|
||||
{
|
||||
"action": "segment_audio",
|
||||
"file_path": "/tmp/uploaded_audio.mp3",
|
||||
"max_chunk_seconds": 300,
|
||||
"min_silence_ms": 500
|
||||
}
|
||||
|
||||
// Response
|
||||
{
|
||||
"status": "success",
|
||||
"segments": [
|
||||
{"index": 0, "path": "/tmp/chunk_0.wav", "start": 0, "end": 180.5},
|
||||
{"index": 1, "path": "/tmp/chunk_1.wav", "start": 180.5, "end": 362.0},
|
||||
...
|
||||
],
|
||||
"total_segments": 5
|
||||
}
|
||||
```
|
||||
|
||||
### Dify STT API Integration
|
||||
```
|
||||
POST https://dify.theaken.com/v1/audio-to-text
|
||||
Authorization: Bearer {DIFY_STT_API_KEY}
|
||||
Content-Type: multipart/form-data
|
||||
|
||||
Request:
|
||||
- file: Audio chunk (<25MB)
|
||||
- user: User identifier
|
||||
|
||||
Response:
|
||||
{
|
||||
"text": "transcribed content for this chunk..."
|
||||
}
|
||||
```
|
||||
|
||||
## Data Flow
|
||||
|
||||
```
|
||||
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
||||
│ Electron │ │ FastAPI │ │ Sidecar │ │ Dify STT │
|
||||
│ Client │ │ Backend │ │ (VAD) │ │ Service │
|
||||
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
|
||||
│ │ │ │
|
||||
│ Upload audio │ │ │
|
||||
│──────────────────>│ │ │
|
||||
│ │ │ │
|
||||
│ │ segment_audio │ │
|
||||
│ │──────────────────>│ │
|
||||
│ │ │ │
|
||||
│ │ segments[] │ │
|
||||
│ │<──────────────────│ │
|
||||
│ │ │ │
|
||||
│ │ For each chunk: │ │
|
||||
│ Progress: 1/5 │──────────────────────────────────────>│
|
||||
│<──────────────────│ │ │
|
||||
│ │ │ transcription │
|
||||
│ │<──────────────────────────────────────│
|
||||
│ │ │ │
|
||||
│ Progress: 2/5 │──────────────────────────────────────>│
|
||||
│<──────────────────│ │ │
|
||||
│ │<──────────────────────────────────────│
|
||||
│ │ │ │
|
||||
│ │ ... repeat ... │ │
|
||||
│ │ │ │
|
||||
│ Final transcript │ │ │
|
||||
│<──────────────────│ │ │
|
||||
│ (concatenated) │ │ │
|
||||
```
|
||||
|
||||
## Chunking Algorithm
|
||||
|
||||
```python
|
||||
def segment_audio_with_vad(audio_path, max_chunk_seconds=300, min_silence_ms=500):
|
||||
"""
|
||||
Segment audio file using VAD for natural speech boundaries.
|
||||
|
||||
1. Load audio file
|
||||
2. Run VAD to detect speech/silence regions
|
||||
3. Find silence gaps >= min_silence_ms
|
||||
4. Split at silence gaps, keeping chunks <= max_chunk_seconds
|
||||
5. If no silence found within max_chunk_seconds, force split at max
|
||||
6. Export each chunk as WAV file
|
||||
7. Return list of chunk file paths with timestamps
|
||||
"""
|
||||
```
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
| Risk | Mitigation |
|
||||
|------|------------|
|
||||
| Large file causes memory issues | Stream audio processing; limit to 500MB |
|
||||
| Dify rate limiting | Add retry with exponential backoff |
|
||||
| Chunk boundary affects context | Overlap chunks by 1-2 seconds |
|
||||
| Long processing time | Show progress indicator with chunk count |
|
||||
| Sidecar not available | Return error suggesting real-time recording |
|
||||
|
||||
## Migration Plan
|
||||
No migration needed - this is additive functionality.
|
||||
|
||||
## Open Questions
|
||||
- ~~Maximum file size limit?~~ **Resolved**: 500MB with VAD chunking
|
||||
- Chunk overlap for context continuity?
|
||||
- Proposal: 1 second overlap, deduplicate in concatenation
|
||||
@@ -0,0 +1,28 @@
|
||||
# Change: Add Dify Audio Transcription for Uploaded Files
|
||||
|
||||
## Why
|
||||
Users need to transcribe pre-recorded audio files (e.g., meeting recordings from external sources). Currently, transcription only works with real-time recording via the local sidecar. Adding Dify-based transcription for uploaded files provides flexibility while keeping real-time transcription local for low latency.
|
||||
|
||||
## What Changes
|
||||
- Add audio file upload UI in Electron client (meeting detail page)
|
||||
- Add `segment_audio` command to sidecar for VAD-based audio chunking
|
||||
- Add backend API endpoint to receive audio files, chunk via sidecar, and forward to Dify STT service
|
||||
- Each chunk (~5 minutes max) sent to Dify separately, results concatenated
|
||||
- Transcription result replaces the transcript field (same as real-time transcription)
|
||||
- Support common audio formats: MP3, WAV, M4A, WebM, OGG
|
||||
|
||||
## Impact
|
||||
- Affected specs: `transcription`
|
||||
- Affected code:
|
||||
- `sidecar/transcriber.py` - Add `segment_audio` action for VAD chunking
|
||||
- `client/src/pages/meeting-detail.html` - Add upload button and progress UI
|
||||
- `backend/app/routers/ai.py` - Add `/api/ai/transcribe-audio` endpoint
|
||||
- `backend/app/config.py` - Add Dify STT API key configuration
|
||||
|
||||
## Technical Notes
|
||||
- Dify STT API Key: `app-xQeSipaQecs0cuKeLvYDaRsu`
|
||||
- Real-time transcription continues to use local sidecar (no change)
|
||||
- File upload transcription uses Dify cloud service with VAD chunking
|
||||
- VAD chunking ensures each chunk < 25MB (Dify API limit)
|
||||
- Max file size: 500MB (chunked processing handles large files)
|
||||
- Both methods output to the same transcript_blob field
|
||||
@@ -0,0 +1,88 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Audio File Upload
|
||||
The Electron client SHALL allow users to upload pre-recorded audio files for transcription.
|
||||
|
||||
#### Scenario: Upload audio file
|
||||
- **WHEN** user clicks "Upload Audio" button in meeting detail page
|
||||
- **THEN** file picker SHALL open with filter for supported audio formats (MP3, WAV, M4A, WebM, OGG)
|
||||
|
||||
#### Scenario: Show upload progress
|
||||
- **WHEN** audio file is being uploaded
|
||||
- **THEN** progress indicator SHALL be displayed showing upload percentage
|
||||
|
||||
#### Scenario: Show transcription progress
|
||||
- **WHEN** audio file is being transcribed in chunks
|
||||
- **THEN** progress indicator SHALL display "Processing chunk X of Y"
|
||||
|
||||
#### Scenario: Replace existing transcript
|
||||
- **WHEN** user uploads audio file and transcript already has content
|
||||
- **THEN** confirmation dialog SHALL appear before replacing existing transcript
|
||||
|
||||
#### Scenario: File size limit
|
||||
- **WHEN** user selects audio file larger than 500MB
|
||||
- **THEN** error message SHALL be displayed indicating file size limit
|
||||
|
||||
### Requirement: VAD-Based Audio Segmentation
|
||||
The sidecar SHALL segment large audio files using Voice Activity Detection before cloud transcription.
|
||||
|
||||
#### Scenario: Segment audio command
|
||||
- **WHEN** sidecar receives `{"action": "segment_audio", "file_path": "...", "max_chunk_seconds": 300}`
|
||||
- **THEN** it SHALL load audio file and run VAD to detect speech boundaries
|
||||
|
||||
#### Scenario: Split at silence boundaries
|
||||
- **WHEN** VAD detects silence gap >= 500ms within max chunk duration
|
||||
- **THEN** audio SHALL be split at the silence boundary
|
||||
- **AND** each chunk exported as WAV file to temp directory
|
||||
|
||||
#### Scenario: Force split for continuous speech
|
||||
- **WHEN** speech continues beyond max_chunk_seconds without silence gap
|
||||
- **THEN** audio SHALL be force-split at max_chunk_seconds boundary
|
||||
|
||||
#### Scenario: Return segment metadata
|
||||
- **WHEN** segmentation completes
|
||||
- **THEN** sidecar SHALL return list of segments with file paths and timestamps
|
||||
|
||||
### Requirement: Dify Speech-to-Text Integration
|
||||
The backend SHALL integrate with Dify STT service for audio file transcription.
|
||||
|
||||
#### Scenario: Transcribe uploaded audio with chunking
|
||||
- **WHEN** backend receives POST /api/ai/transcribe-audio with audio file
|
||||
- **THEN** backend SHALL call sidecar for VAD segmentation
|
||||
- **AND** send each chunk to Dify STT API sequentially
|
||||
- **AND** concatenate results into final transcript
|
||||
|
||||
#### Scenario: Supported audio formats
|
||||
- **WHEN** audio file is in MP3, WAV, M4A, WebM, or OGG format
|
||||
- **THEN** system SHALL accept and process the file
|
||||
|
||||
#### Scenario: Unsupported format handling
|
||||
- **WHEN** audio file format is not supported
|
||||
- **THEN** backend SHALL return HTTP 400 with error message listing supported formats
|
||||
|
||||
#### Scenario: Dify chunk transcription
|
||||
- **WHEN** backend sends audio chunk to Dify STT API
|
||||
- **THEN** chunk size SHALL be under 25MB to comply with API limits
|
||||
|
||||
#### Scenario: Transcription timeout per chunk
|
||||
- **WHEN** Dify STT does not respond for a chunk within 2 minutes
|
||||
- **THEN** backend SHALL retry up to 3 times with exponential backoff
|
||||
|
||||
#### Scenario: Dify STT error handling
|
||||
- **WHEN** Dify STT API returns error after retries
|
||||
- **THEN** backend SHALL return HTTP 502 with error details
|
||||
|
||||
### Requirement: Dual Transcription Mode
|
||||
The system SHALL support both real-time local transcription and file-based cloud transcription.
|
||||
|
||||
#### Scenario: Real-time transcription unchanged
|
||||
- **WHEN** user records audio in real-time
|
||||
- **THEN** local sidecar SHALL process audio using faster-whisper (existing behavior)
|
||||
|
||||
#### Scenario: File upload uses cloud transcription
|
||||
- **WHEN** user uploads audio file
|
||||
- **THEN** Dify cloud service SHALL process audio via chunked upload
|
||||
|
||||
#### Scenario: Unified transcript output
|
||||
- **WHEN** transcription completes from either source
|
||||
- **THEN** result SHALL be displayed in the same transcript area in meeting detail page
|
||||
@@ -0,0 +1,47 @@
|
||||
# Implementation Tasks
|
||||
|
||||
## 1. Backend Configuration
|
||||
- [x] 1.1 Add `DIFY_STT_API_KEY` to `backend/app/config.py`
|
||||
- [x] 1.2 Add `DIFY_STT_API_KEY` to `backend/.env.example`
|
||||
|
||||
## 2. Sidecar VAD Segmentation
|
||||
- [x] 2.1 Add `segment_audio` action handler in `sidecar/transcriber.py`
|
||||
- [x] 2.2 Implement VAD-based audio segmentation using Silero VAD
|
||||
- [x] 2.3 Support max chunk duration (default 5 minutes)
|
||||
- [x] 2.4 Support minimum silence threshold (default 500ms)
|
||||
- [x] 2.5 Export chunks as WAV files to temp directory
|
||||
- [x] 2.6 Return segment metadata (paths, timestamps)
|
||||
|
||||
## 3. Backend API Endpoint
|
||||
- [x] 3.1 Create `POST /api/ai/transcribe-audio` endpoint in `backend/app/routers/ai.py`
|
||||
- [x] 3.2 Implement multipart file upload handling (max 500MB)
|
||||
- [x] 3.3 Validate audio file format (MP3, WAV, M4A, WebM, OGG)
|
||||
- [x] 3.4 Save uploaded file to temp directory
|
||||
- [x] 3.5 Call sidecar `segment_audio` for VAD chunking
|
||||
- [x] 3.6 For each chunk: call Dify STT API (`/v1/audio-to-text`)
|
||||
- [x] 3.7 Implement retry with exponential backoff for Dify errors
|
||||
- [x] 3.8 Concatenate chunk transcriptions
|
||||
- [x] 3.9 Clean up temp files after processing
|
||||
- [x] 3.10 Return final transcript with metadata
|
||||
|
||||
## 4. Frontend UI
|
||||
- [x] 4.1 Add "Upload Audio" button in meeting-detail.html (next to recording controls)
|
||||
- [x] 4.2 Implement file input with accepted audio formats
|
||||
- [x] 4.3 Add upload progress indicator (upload phase)
|
||||
- [x] 4.4 Add transcription progress indicator (chunk X of Y)
|
||||
- [x] 4.5 Show confirmation dialog if transcript already has content
|
||||
- [x] 4.6 Display transcription result in transcript area
|
||||
- [x] 4.7 Handle error states (file too large, unsupported format, API error)
|
||||
|
||||
## 5. API Service
|
||||
- [x] 5.1 Add `transcribeAudio()` function to `client/src/services/api.js`
|
||||
- [x] 5.2 Implement FormData upload with progress tracking
|
||||
- [x] 5.3 Handle streaming response for chunk progress
|
||||
|
||||
## 6. Testing
|
||||
- [ ] 6.1 Test sidecar VAD segmentation with various audio lengths
|
||||
- [ ] 6.2 Test with various audio formats (MP3, WAV, M4A, WebM, OGG)
|
||||
- [ ] 6.3 Test with large file (>100MB) to verify chunking
|
||||
- [ ] 6.4 Test error handling (invalid format, Dify timeout, API error)
|
||||
- [ ] 6.5 Verify transcript displays correctly after upload
|
||||
- [ ] 6.6 Test chunk concatenation quality (no missing content at boundaries)
|
||||
@@ -88,3 +88,90 @@ The sidecar SHALL output transcribed text with appropriate Chinese punctuation m
|
||||
- **WHEN** transcribed text ends with question particles (嗎、呢、什麼、怎麼、為什麼)
|
||||
- **THEN** the punctuation processor SHALL append question mark (?)
|
||||
|
||||
### Requirement: Audio File Upload
|
||||
The Electron client SHALL allow users to upload pre-recorded audio files for transcription.
|
||||
|
||||
#### Scenario: Upload audio file
|
||||
- **WHEN** user clicks "Upload Audio" button in meeting detail page
|
||||
- **THEN** file picker SHALL open with filter for supported audio formats (MP3, WAV, M4A, WebM, OGG)
|
||||
|
||||
#### Scenario: Show upload progress
|
||||
- **WHEN** audio file is being uploaded
|
||||
- **THEN** progress indicator SHALL be displayed showing upload percentage
|
||||
|
||||
#### Scenario: Show transcription progress
|
||||
- **WHEN** audio file is being transcribed in chunks
|
||||
- **THEN** progress indicator SHALL display "Processing chunk X of Y"
|
||||
|
||||
#### Scenario: Replace existing transcript
|
||||
- **WHEN** user uploads audio file and transcript already has content
|
||||
- **THEN** confirmation dialog SHALL appear before replacing existing transcript
|
||||
|
||||
#### Scenario: File size limit
|
||||
- **WHEN** user selects audio file larger than 500MB
|
||||
- **THEN** error message SHALL be displayed indicating file size limit
|
||||
|
||||
### Requirement: VAD-Based Audio Segmentation
|
||||
The sidecar SHALL segment large audio files using Voice Activity Detection before cloud transcription.
|
||||
|
||||
#### Scenario: Segment audio command
|
||||
- **WHEN** sidecar receives `{"action": "segment_audio", "file_path": "...", "max_chunk_seconds": 300}`
|
||||
- **THEN** it SHALL load audio file and run VAD to detect speech boundaries
|
||||
|
||||
#### Scenario: Split at silence boundaries
|
||||
- **WHEN** VAD detects silence gap >= 500ms within max chunk duration
|
||||
- **THEN** audio SHALL be split at the silence boundary
|
||||
- **AND** each chunk exported as WAV file to temp directory
|
||||
|
||||
#### Scenario: Force split for continuous speech
|
||||
- **WHEN** speech continues beyond max_chunk_seconds without silence gap
|
||||
- **THEN** audio SHALL be force-split at max_chunk_seconds boundary
|
||||
|
||||
#### Scenario: Return segment metadata
|
||||
- **WHEN** segmentation completes
|
||||
- **THEN** sidecar SHALL return list of segments with file paths and timestamps
|
||||
|
||||
### Requirement: Dify Speech-to-Text Integration
|
||||
The backend SHALL integrate with Dify STT service for audio file transcription.
|
||||
|
||||
#### Scenario: Transcribe uploaded audio with chunking
|
||||
- **WHEN** backend receives POST /api/ai/transcribe-audio with audio file
|
||||
- **THEN** backend SHALL call sidecar for VAD segmentation
|
||||
- **AND** send each chunk to Dify STT API sequentially
|
||||
- **AND** concatenate results into final transcript
|
||||
|
||||
#### Scenario: Supported audio formats
|
||||
- **WHEN** audio file is in MP3, WAV, M4A, WebM, or OGG format
|
||||
- **THEN** system SHALL accept and process the file
|
||||
|
||||
#### Scenario: Unsupported format handling
|
||||
- **WHEN** audio file format is not supported
|
||||
- **THEN** backend SHALL return HTTP 400 with error message listing supported formats
|
||||
|
||||
#### Scenario: Dify chunk transcription
|
||||
- **WHEN** backend sends audio chunk to Dify STT API
|
||||
- **THEN** chunk size SHALL be under 25MB to comply with API limits
|
||||
|
||||
#### Scenario: Transcription timeout per chunk
|
||||
- **WHEN** Dify STT does not respond for a chunk within 2 minutes
|
||||
- **THEN** backend SHALL retry up to 3 times with exponential backoff
|
||||
|
||||
#### Scenario: Dify STT error handling
|
||||
- **WHEN** Dify STT API returns error after retries
|
||||
- **THEN** backend SHALL return HTTP 502 with error details
|
||||
|
||||
### Requirement: Dual Transcription Mode
|
||||
The system SHALL support both real-time local transcription and file-based cloud transcription.
|
||||
|
||||
#### Scenario: Real-time transcription unchanged
|
||||
- **WHEN** user records audio in real-time
|
||||
- **THEN** local sidecar SHALL process audio using faster-whisper (existing behavior)
|
||||
|
||||
#### Scenario: File upload uses cloud transcription
|
||||
- **WHEN** user uploads audio file
|
||||
- **THEN** Dify cloud service SHALL process audio via chunked upload
|
||||
|
||||
#### Scenario: Unified transcript output
|
||||
- **WHEN** transcription completes from either source
|
||||
- **THEN** result SHALL be displayed in the same transcript area in meeting detail page
|
||||
|
||||
|
||||
Reference in New Issue
Block a user