diff --git a/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/design.md b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/design.md new file mode 100644 index 0000000..e9a49eb --- /dev/null +++ b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/design.md @@ -0,0 +1,185 @@ +# Design: Dify Audio Transcription + +## Context +The Meeting Assistant currently supports real-time transcription via a local Python sidecar using faster-whisper. Users have requested the ability to upload pre-recorded audio files for transcription. To avoid overloading local resources and leverage cloud capabilities, uploaded files will be processed by Dify's speech-to-text service. + +## Goals / Non-Goals + +**Goals:** +- Allow users to upload audio files for transcription +- Use Dify STT service for file-based transcription +- Handle large files via VAD-based automatic segmentation +- Maintain seamless UX with transcription appearing in the same location +- Support common audio formats (MP3, WAV, M4A, WebM, OGG) + +**Non-Goals:** +- Replace real-time local transcription (sidecar remains for live recording) +- Support video files (audio extraction) + +## Decisions + +### Decision 1: Two-Path Transcription Architecture +- **What**: Real-time recording uses local sidecar; file uploads use Dify cloud +- **Why**: Local processing provides low latency for real-time needs; cloud processing handles larger files without impacting local resources +- **Alternatives considered**: + - All transcription via Dify: Rejected due to latency and network dependency for real-time use + - All transcription local: Rejected due to resource constraints for large file processing + +### Decision 2: VAD-Based Audio Chunking +- **What**: Use sidecar's Silero VAD to segment large audio files before sending to Dify +- **Why**: + - Dify STT API has file size limits (typically ~25MB) + - VAD removes silence, reducing total upload size + - Natural speech boundaries improve transcription quality +- **Implementation**: + - Sidecar segments audio into chunks (~2-5 minutes each based on speech boundaries) + - Each chunk sent to Dify sequentially + - Results concatenated with proper spacing +- **Alternatives considered**: + - Fixed-time splitting (e.g., 5min chunks): Rejected - may cut mid-sentence + - Client-side splitting: Rejected - requires shipping VAD to client + +### Decision 3: Separate API Key for STT Service +- **What**: Use dedicated Dify API key `app-xQeSipaQecs0cuKeLvYDaRsu` for STT +- **Why**: Allows independent rate limiting and monitoring from summarization service +- **Configuration**: `DIFY_STT_API_KEY` environment variable + +### Decision 4: Backend-Mediated Upload with Sidecar Processing +- **What**: Client → Backend → Sidecar (VAD) → Backend → Dify → Client +- **Why**: + - Keeps Dify API keys secure on server + - Reuses existing sidecar VAD capability + - Enables progress tracking for multi-chunk processing +- **Alternatives considered**: + - Direct client → Dify: Rejected due to API key exposure and file size limits + +### Decision 5: Append vs Replace Transcript +- **What**: Uploaded file transcription replaces current transcript content +- **Why**: Users typically upload complete meeting recordings; appending would create confusion +- **UI**: Show confirmation dialog before replacing existing content + +## API Design + +### Backend Endpoint +``` +POST /api/ai/transcribe-audio +Content-Type: multipart/form-data + +Request: +- file: Audio file (max 500MB, will be chunked) + +Response (streaming for progress): +{ + "transcript": "完整的會議逐字稿內容...", + "chunks_processed": 5, + "total_duration_seconds": 3600, + "language": "zh" +} +``` + +### Sidecar VAD Segmentation Command +```json +// Request +{ + "action": "segment_audio", + "file_path": "/tmp/uploaded_audio.mp3", + "max_chunk_seconds": 300, + "min_silence_ms": 500 +} + +// Response +{ + "status": "success", + "segments": [ + {"index": 0, "path": "/tmp/chunk_0.wav", "start": 0, "end": 180.5}, + {"index": 1, "path": "/tmp/chunk_1.wav", "start": 180.5, "end": 362.0}, + ... + ], + "total_segments": 5 +} +``` + +### Dify STT API Integration +``` +POST https://dify.theaken.com/v1/audio-to-text +Authorization: Bearer {DIFY_STT_API_KEY} +Content-Type: multipart/form-data + +Request: +- file: Audio chunk (<25MB) +- user: User identifier + +Response: +{ + "text": "transcribed content for this chunk..." +} +``` + +## Data Flow + +``` +┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ +│ Electron │ │ FastAPI │ │ Sidecar │ │ Dify STT │ +│ Client │ │ Backend │ │ (VAD) │ │ Service │ +└──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ + │ │ │ │ + │ Upload audio │ │ │ + │──────────────────>│ │ │ + │ │ │ │ + │ │ segment_audio │ │ + │ │──────────────────>│ │ + │ │ │ │ + │ │ segments[] │ │ + │ │<──────────────────│ │ + │ │ │ │ + │ │ For each chunk: │ │ + │ Progress: 1/5 │──────────────────────────────────────>│ + │<──────────────────│ │ │ + │ │ │ transcription │ + │ │<──────────────────────────────────────│ + │ │ │ │ + │ Progress: 2/5 │──────────────────────────────────────>│ + │<──────────────────│ │ │ + │ │<──────────────────────────────────────│ + │ │ │ │ + │ │ ... repeat ... │ │ + │ │ │ │ + │ Final transcript │ │ │ + │<──────────────────│ │ │ + │ (concatenated) │ │ │ +``` + +## Chunking Algorithm + +```python +def segment_audio_with_vad(audio_path, max_chunk_seconds=300, min_silence_ms=500): + """ + Segment audio file using VAD for natural speech boundaries. + + 1. Load audio file + 2. Run VAD to detect speech/silence regions + 3. Find silence gaps >= min_silence_ms + 4. Split at silence gaps, keeping chunks <= max_chunk_seconds + 5. If no silence found within max_chunk_seconds, force split at max + 6. Export each chunk as WAV file + 7. Return list of chunk file paths with timestamps + """ +``` + +## Risks / Trade-offs + +| Risk | Mitigation | +|------|------------| +| Large file causes memory issues | Stream audio processing; limit to 500MB | +| Dify rate limiting | Add retry with exponential backoff | +| Chunk boundary affects context | Overlap chunks by 1-2 seconds | +| Long processing time | Show progress indicator with chunk count | +| Sidecar not available | Return error suggesting real-time recording | + +## Migration Plan +No migration needed - this is additive functionality. + +## Open Questions +- ~~Maximum file size limit?~~ **Resolved**: 500MB with VAD chunking +- Chunk overlap for context continuity? + - Proposal: 1 second overlap, deduplicate in concatenation diff --git a/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/proposal.md b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/proposal.md new file mode 100644 index 0000000..f4e8a29 --- /dev/null +++ b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/proposal.md @@ -0,0 +1,28 @@ +# Change: Add Dify Audio Transcription for Uploaded Files + +## Why +Users need to transcribe pre-recorded audio files (e.g., meeting recordings from external sources). Currently, transcription only works with real-time recording via the local sidecar. Adding Dify-based transcription for uploaded files provides flexibility while keeping real-time transcription local for low latency. + +## What Changes +- Add audio file upload UI in Electron client (meeting detail page) +- Add `segment_audio` command to sidecar for VAD-based audio chunking +- Add backend API endpoint to receive audio files, chunk via sidecar, and forward to Dify STT service +- Each chunk (~5 minutes max) sent to Dify separately, results concatenated +- Transcription result replaces the transcript field (same as real-time transcription) +- Support common audio formats: MP3, WAV, M4A, WebM, OGG + +## Impact +- Affected specs: `transcription` +- Affected code: + - `sidecar/transcriber.py` - Add `segment_audio` action for VAD chunking + - `client/src/pages/meeting-detail.html` - Add upload button and progress UI + - `backend/app/routers/ai.py` - Add `/api/ai/transcribe-audio` endpoint + - `backend/app/config.py` - Add Dify STT API key configuration + +## Technical Notes +- Dify STT API Key: `app-xQeSipaQecs0cuKeLvYDaRsu` +- Real-time transcription continues to use local sidecar (no change) +- File upload transcription uses Dify cloud service with VAD chunking +- VAD chunking ensures each chunk < 25MB (Dify API limit) +- Max file size: 500MB (chunked processing handles large files) +- Both methods output to the same transcript_blob field diff --git a/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/specs/transcription/spec.md b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/specs/transcription/spec.md new file mode 100644 index 0000000..cc0acb0 --- /dev/null +++ b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/specs/transcription/spec.md @@ -0,0 +1,88 @@ +## ADDED Requirements + +### Requirement: Audio File Upload +The Electron client SHALL allow users to upload pre-recorded audio files for transcription. + +#### Scenario: Upload audio file +- **WHEN** user clicks "Upload Audio" button in meeting detail page +- **THEN** file picker SHALL open with filter for supported audio formats (MP3, WAV, M4A, WebM, OGG) + +#### Scenario: Show upload progress +- **WHEN** audio file is being uploaded +- **THEN** progress indicator SHALL be displayed showing upload percentage + +#### Scenario: Show transcription progress +- **WHEN** audio file is being transcribed in chunks +- **THEN** progress indicator SHALL display "Processing chunk X of Y" + +#### Scenario: Replace existing transcript +- **WHEN** user uploads audio file and transcript already has content +- **THEN** confirmation dialog SHALL appear before replacing existing transcript + +#### Scenario: File size limit +- **WHEN** user selects audio file larger than 500MB +- **THEN** error message SHALL be displayed indicating file size limit + +### Requirement: VAD-Based Audio Segmentation +The sidecar SHALL segment large audio files using Voice Activity Detection before cloud transcription. + +#### Scenario: Segment audio command +- **WHEN** sidecar receives `{"action": "segment_audio", "file_path": "...", "max_chunk_seconds": 300}` +- **THEN** it SHALL load audio file and run VAD to detect speech boundaries + +#### Scenario: Split at silence boundaries +- **WHEN** VAD detects silence gap >= 500ms within max chunk duration +- **THEN** audio SHALL be split at the silence boundary +- **AND** each chunk exported as WAV file to temp directory + +#### Scenario: Force split for continuous speech +- **WHEN** speech continues beyond max_chunk_seconds without silence gap +- **THEN** audio SHALL be force-split at max_chunk_seconds boundary + +#### Scenario: Return segment metadata +- **WHEN** segmentation completes +- **THEN** sidecar SHALL return list of segments with file paths and timestamps + +### Requirement: Dify Speech-to-Text Integration +The backend SHALL integrate with Dify STT service for audio file transcription. + +#### Scenario: Transcribe uploaded audio with chunking +- **WHEN** backend receives POST /api/ai/transcribe-audio with audio file +- **THEN** backend SHALL call sidecar for VAD segmentation +- **AND** send each chunk to Dify STT API sequentially +- **AND** concatenate results into final transcript + +#### Scenario: Supported audio formats +- **WHEN** audio file is in MP3, WAV, M4A, WebM, or OGG format +- **THEN** system SHALL accept and process the file + +#### Scenario: Unsupported format handling +- **WHEN** audio file format is not supported +- **THEN** backend SHALL return HTTP 400 with error message listing supported formats + +#### Scenario: Dify chunk transcription +- **WHEN** backend sends audio chunk to Dify STT API +- **THEN** chunk size SHALL be under 25MB to comply with API limits + +#### Scenario: Transcription timeout per chunk +- **WHEN** Dify STT does not respond for a chunk within 2 minutes +- **THEN** backend SHALL retry up to 3 times with exponential backoff + +#### Scenario: Dify STT error handling +- **WHEN** Dify STT API returns error after retries +- **THEN** backend SHALL return HTTP 502 with error details + +### Requirement: Dual Transcription Mode +The system SHALL support both real-time local transcription and file-based cloud transcription. + +#### Scenario: Real-time transcription unchanged +- **WHEN** user records audio in real-time +- **THEN** local sidecar SHALL process audio using faster-whisper (existing behavior) + +#### Scenario: File upload uses cloud transcription +- **WHEN** user uploads audio file +- **THEN** Dify cloud service SHALL process audio via chunked upload + +#### Scenario: Unified transcript output +- **WHEN** transcription completes from either source +- **THEN** result SHALL be displayed in the same transcript area in meeting detail page diff --git a/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/tasks.md b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/tasks.md new file mode 100644 index 0000000..0cad49b --- /dev/null +++ b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/tasks.md @@ -0,0 +1,47 @@ +# Implementation Tasks + +## 1. Backend Configuration +- [x] 1.1 Add `DIFY_STT_API_KEY` to `backend/app/config.py` +- [x] 1.2 Add `DIFY_STT_API_KEY` to `backend/.env.example` + +## 2. Sidecar VAD Segmentation +- [x] 2.1 Add `segment_audio` action handler in `sidecar/transcriber.py` +- [x] 2.2 Implement VAD-based audio segmentation using Silero VAD +- [x] 2.3 Support max chunk duration (default 5 minutes) +- [x] 2.4 Support minimum silence threshold (default 500ms) +- [x] 2.5 Export chunks as WAV files to temp directory +- [x] 2.6 Return segment metadata (paths, timestamps) + +## 3. Backend API Endpoint +- [x] 3.1 Create `POST /api/ai/transcribe-audio` endpoint in `backend/app/routers/ai.py` +- [x] 3.2 Implement multipart file upload handling (max 500MB) +- [x] 3.3 Validate audio file format (MP3, WAV, M4A, WebM, OGG) +- [x] 3.4 Save uploaded file to temp directory +- [x] 3.5 Call sidecar `segment_audio` for VAD chunking +- [x] 3.6 For each chunk: call Dify STT API (`/v1/audio-to-text`) +- [x] 3.7 Implement retry with exponential backoff for Dify errors +- [x] 3.8 Concatenate chunk transcriptions +- [x] 3.9 Clean up temp files after processing +- [x] 3.10 Return final transcript with metadata + +## 4. Frontend UI +- [x] 4.1 Add "Upload Audio" button in meeting-detail.html (next to recording controls) +- [x] 4.2 Implement file input with accepted audio formats +- [x] 4.3 Add upload progress indicator (upload phase) +- [x] 4.4 Add transcription progress indicator (chunk X of Y) +- [x] 4.5 Show confirmation dialog if transcript already has content +- [x] 4.6 Display transcription result in transcript area +- [x] 4.7 Handle error states (file too large, unsupported format, API error) + +## 5. API Service +- [x] 5.1 Add `transcribeAudio()` function to `client/src/services/api.js` +- [x] 5.2 Implement FormData upload with progress tracking +- [x] 5.3 Handle streaming response for chunk progress + +## 6. Testing +- [ ] 6.1 Test sidecar VAD segmentation with various audio lengths +- [ ] 6.2 Test with various audio formats (MP3, WAV, M4A, WebM, OGG) +- [ ] 6.3 Test with large file (>100MB) to verify chunking +- [ ] 6.4 Test error handling (invalid format, Dify timeout, API error) +- [ ] 6.5 Verify transcript displays correctly after upload +- [ ] 6.6 Test chunk concatenation quality (no missing content at boundaries) diff --git a/openspec/specs/transcription/spec.md b/openspec/specs/transcription/spec.md index bce3a7c..58ed996 100644 --- a/openspec/specs/transcription/spec.md +++ b/openspec/specs/transcription/spec.md @@ -88,3 +88,90 @@ The sidecar SHALL output transcribed text with appropriate Chinese punctuation m - **WHEN** transcribed text ends with question particles (嗎、呢、什麼、怎麼、為什麼) - **THEN** the punctuation processor SHALL append question mark (?) +### Requirement: Audio File Upload +The Electron client SHALL allow users to upload pre-recorded audio files for transcription. + +#### Scenario: Upload audio file +- **WHEN** user clicks "Upload Audio" button in meeting detail page +- **THEN** file picker SHALL open with filter for supported audio formats (MP3, WAV, M4A, WebM, OGG) + +#### Scenario: Show upload progress +- **WHEN** audio file is being uploaded +- **THEN** progress indicator SHALL be displayed showing upload percentage + +#### Scenario: Show transcription progress +- **WHEN** audio file is being transcribed in chunks +- **THEN** progress indicator SHALL display "Processing chunk X of Y" + +#### Scenario: Replace existing transcript +- **WHEN** user uploads audio file and transcript already has content +- **THEN** confirmation dialog SHALL appear before replacing existing transcript + +#### Scenario: File size limit +- **WHEN** user selects audio file larger than 500MB +- **THEN** error message SHALL be displayed indicating file size limit + +### Requirement: VAD-Based Audio Segmentation +The sidecar SHALL segment large audio files using Voice Activity Detection before cloud transcription. + +#### Scenario: Segment audio command +- **WHEN** sidecar receives `{"action": "segment_audio", "file_path": "...", "max_chunk_seconds": 300}` +- **THEN** it SHALL load audio file and run VAD to detect speech boundaries + +#### Scenario: Split at silence boundaries +- **WHEN** VAD detects silence gap >= 500ms within max chunk duration +- **THEN** audio SHALL be split at the silence boundary +- **AND** each chunk exported as WAV file to temp directory + +#### Scenario: Force split for continuous speech +- **WHEN** speech continues beyond max_chunk_seconds without silence gap +- **THEN** audio SHALL be force-split at max_chunk_seconds boundary + +#### Scenario: Return segment metadata +- **WHEN** segmentation completes +- **THEN** sidecar SHALL return list of segments with file paths and timestamps + +### Requirement: Dify Speech-to-Text Integration +The backend SHALL integrate with Dify STT service for audio file transcription. + +#### Scenario: Transcribe uploaded audio with chunking +- **WHEN** backend receives POST /api/ai/transcribe-audio with audio file +- **THEN** backend SHALL call sidecar for VAD segmentation +- **AND** send each chunk to Dify STT API sequentially +- **AND** concatenate results into final transcript + +#### Scenario: Supported audio formats +- **WHEN** audio file is in MP3, WAV, M4A, WebM, or OGG format +- **THEN** system SHALL accept and process the file + +#### Scenario: Unsupported format handling +- **WHEN** audio file format is not supported +- **THEN** backend SHALL return HTTP 400 with error message listing supported formats + +#### Scenario: Dify chunk transcription +- **WHEN** backend sends audio chunk to Dify STT API +- **THEN** chunk size SHALL be under 25MB to comply with API limits + +#### Scenario: Transcription timeout per chunk +- **WHEN** Dify STT does not respond for a chunk within 2 minutes +- **THEN** backend SHALL retry up to 3 times with exponential backoff + +#### Scenario: Dify STT error handling +- **WHEN** Dify STT API returns error after retries +- **THEN** backend SHALL return HTTP 502 with error details + +### Requirement: Dual Transcription Mode +The system SHALL support both real-time local transcription and file-based cloud transcription. + +#### Scenario: Real-time transcription unchanged +- **WHEN** user records audio in real-time +- **THEN** local sidecar SHALL process audio using faster-whisper (existing behavior) + +#### Scenario: File upload uses cloud transcription +- **WHEN** user uploads audio file +- **THEN** Dify cloud service SHALL process audio via chunked upload + +#### Scenario: Unified transcript output +- **WHEN** transcription completes from either source +- **THEN** result SHALL be displayed in the same transcript area in meeting detail page +