From 2e78e3760a38955412e8b2e60801476b6e5c3acd Mon Sep 17 00:00:00 2001 From: egg Date: Thu, 11 Dec 2025 21:05:01 +0800 Subject: [PATCH] chore: Archive add-dify-audio-transcription proposal MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Move completed Dify audio transcription proposal to archive and update transcription spec with new capabilities. ๐Ÿค– Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 --- .../design.md | 185 ++++++++++++++++++ .../proposal.md | 28 +++ .../specs/transcription/spec.md | 88 +++++++++ .../tasks.md | 47 +++++ openspec/specs/transcription/spec.md | 87 ++++++++ 5 files changed, 435 insertions(+) create mode 100644 openspec/changes/archive/2025-12-11-add-dify-audio-transcription/design.md create mode 100644 openspec/changes/archive/2025-12-11-add-dify-audio-transcription/proposal.md create mode 100644 openspec/changes/archive/2025-12-11-add-dify-audio-transcription/specs/transcription/spec.md create mode 100644 openspec/changes/archive/2025-12-11-add-dify-audio-transcription/tasks.md diff --git a/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/design.md b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/design.md new file mode 100644 index 0000000..e9a49eb --- /dev/null +++ b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/design.md @@ -0,0 +1,185 @@ +# Design: Dify Audio Transcription + +## Context +The Meeting Assistant currently supports real-time transcription via a local Python sidecar using faster-whisper. Users have requested the ability to upload pre-recorded audio files for transcription. To avoid overloading local resources and leverage cloud capabilities, uploaded files will be processed by Dify's speech-to-text service. + +## Goals / Non-Goals + +**Goals:** +- Allow users to upload audio files for transcription +- Use Dify STT service for file-based transcription +- Handle large files via VAD-based automatic segmentation +- Maintain seamless UX with transcription appearing in the same location +- Support common audio formats (MP3, WAV, M4A, WebM, OGG) + +**Non-Goals:** +- Replace real-time local transcription (sidecar remains for live recording) +- Support video files (audio extraction) + +## Decisions + +### Decision 1: Two-Path Transcription Architecture +- **What**: Real-time recording uses local sidecar; file uploads use Dify cloud +- **Why**: Local processing provides low latency for real-time needs; cloud processing handles larger files without impacting local resources +- **Alternatives considered**: + - All transcription via Dify: Rejected due to latency and network dependency for real-time use + - All transcription local: Rejected due to resource constraints for large file processing + +### Decision 2: VAD-Based Audio Chunking +- **What**: Use sidecar's Silero VAD to segment large audio files before sending to Dify +- **Why**: + - Dify STT API has file size limits (typically ~25MB) + - VAD removes silence, reducing total upload size + - Natural speech boundaries improve transcription quality +- **Implementation**: + - Sidecar segments audio into chunks (~2-5 minutes each based on speech boundaries) + - Each chunk sent to Dify sequentially + - Results concatenated with proper spacing +- **Alternatives considered**: + - Fixed-time splitting (e.g., 5min chunks): Rejected - may cut mid-sentence + - Client-side splitting: Rejected - requires shipping VAD to client + +### Decision 3: Separate API Key for STT Service +- **What**: Use dedicated Dify API key `app-xQeSipaQecs0cuKeLvYDaRsu` for STT +- **Why**: Allows independent rate limiting and monitoring from summarization service +- **Configuration**: `DIFY_STT_API_KEY` environment variable + +### Decision 4: Backend-Mediated Upload with Sidecar Processing +- **What**: Client โ†’ Backend โ†’ Sidecar (VAD) โ†’ Backend โ†’ Dify โ†’ Client +- **Why**: + - Keeps Dify API keys secure on server + - Reuses existing sidecar VAD capability + - Enables progress tracking for multi-chunk processing +- **Alternatives considered**: + - Direct client โ†’ Dify: Rejected due to API key exposure and file size limits + +### Decision 5: Append vs Replace Transcript +- **What**: Uploaded file transcription replaces current transcript content +- **Why**: Users typically upload complete meeting recordings; appending would create confusion +- **UI**: Show confirmation dialog before replacing existing content + +## API Design + +### Backend Endpoint +``` +POST /api/ai/transcribe-audio +Content-Type: multipart/form-data + +Request: +- file: Audio file (max 500MB, will be chunked) + +Response (streaming for progress): +{ + "transcript": "ๅฎŒๆ•ด็š„ๆœƒ่ญฐ้€ๅญ—็จฟๅ…งๅฎน...", + "chunks_processed": 5, + "total_duration_seconds": 3600, + "language": "zh" +} +``` + +### Sidecar VAD Segmentation Command +```json +// Request +{ + "action": "segment_audio", + "file_path": "/tmp/uploaded_audio.mp3", + "max_chunk_seconds": 300, + "min_silence_ms": 500 +} + +// Response +{ + "status": "success", + "segments": [ + {"index": 0, "path": "/tmp/chunk_0.wav", "start": 0, "end": 180.5}, + {"index": 1, "path": "/tmp/chunk_1.wav", "start": 180.5, "end": 362.0}, + ... + ], + "total_segments": 5 +} +``` + +### Dify STT API Integration +``` +POST https://dify.theaken.com/v1/audio-to-text +Authorization: Bearer {DIFY_STT_API_KEY} +Content-Type: multipart/form-data + +Request: +- file: Audio chunk (<25MB) +- user: User identifier + +Response: +{ + "text": "transcribed content for this chunk..." +} +``` + +## Data Flow + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Electron โ”‚ โ”‚ FastAPI โ”‚ โ”‚ Sidecar โ”‚ โ”‚ Dify STT โ”‚ +โ”‚ Client โ”‚ โ”‚ Backend โ”‚ โ”‚ (VAD) โ”‚ โ”‚ Service โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ โ”‚ โ”‚ โ”‚ + โ”‚ Upload audio โ”‚ โ”‚ โ”‚ + โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€>โ”‚ โ”‚ โ”‚ + โ”‚ โ”‚ โ”‚ โ”‚ + โ”‚ โ”‚ segment_audio โ”‚ โ”‚ + โ”‚ โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€>โ”‚ โ”‚ + โ”‚ โ”‚ โ”‚ โ”‚ + โ”‚ โ”‚ segments[] โ”‚ โ”‚ + โ”‚ โ”‚<โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ โ”‚ + โ”‚ โ”‚ โ”‚ โ”‚ + โ”‚ โ”‚ For each chunk: โ”‚ โ”‚ + โ”‚ Progress: 1/5 โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€>โ”‚ + โ”‚<โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ โ”‚ โ”‚ + โ”‚ โ”‚ โ”‚ transcription โ”‚ + โ”‚ โ”‚<โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ + โ”‚ โ”‚ โ”‚ โ”‚ + โ”‚ Progress: 2/5 โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€>โ”‚ + โ”‚<โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ โ”‚ โ”‚ + โ”‚ โ”‚<โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ + โ”‚ โ”‚ โ”‚ โ”‚ + โ”‚ โ”‚ ... repeat ... โ”‚ โ”‚ + โ”‚ โ”‚ โ”‚ โ”‚ + โ”‚ Final transcript โ”‚ โ”‚ โ”‚ + โ”‚<โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ โ”‚ โ”‚ + โ”‚ (concatenated) โ”‚ โ”‚ โ”‚ +``` + +## Chunking Algorithm + +```python +def segment_audio_with_vad(audio_path, max_chunk_seconds=300, min_silence_ms=500): + """ + Segment audio file using VAD for natural speech boundaries. + + 1. Load audio file + 2. Run VAD to detect speech/silence regions + 3. Find silence gaps >= min_silence_ms + 4. Split at silence gaps, keeping chunks <= max_chunk_seconds + 5. If no silence found within max_chunk_seconds, force split at max + 6. Export each chunk as WAV file + 7. Return list of chunk file paths with timestamps + """ +``` + +## Risks / Trade-offs + +| Risk | Mitigation | +|------|------------| +| Large file causes memory issues | Stream audio processing; limit to 500MB | +| Dify rate limiting | Add retry with exponential backoff | +| Chunk boundary affects context | Overlap chunks by 1-2 seconds | +| Long processing time | Show progress indicator with chunk count | +| Sidecar not available | Return error suggesting real-time recording | + +## Migration Plan +No migration needed - this is additive functionality. + +## Open Questions +- ~~Maximum file size limit?~~ **Resolved**: 500MB with VAD chunking +- Chunk overlap for context continuity? + - Proposal: 1 second overlap, deduplicate in concatenation diff --git a/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/proposal.md b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/proposal.md new file mode 100644 index 0000000..f4e8a29 --- /dev/null +++ b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/proposal.md @@ -0,0 +1,28 @@ +# Change: Add Dify Audio Transcription for Uploaded Files + +## Why +Users need to transcribe pre-recorded audio files (e.g., meeting recordings from external sources). Currently, transcription only works with real-time recording via the local sidecar. Adding Dify-based transcription for uploaded files provides flexibility while keeping real-time transcription local for low latency. + +## What Changes +- Add audio file upload UI in Electron client (meeting detail page) +- Add `segment_audio` command to sidecar for VAD-based audio chunking +- Add backend API endpoint to receive audio files, chunk via sidecar, and forward to Dify STT service +- Each chunk (~5 minutes max) sent to Dify separately, results concatenated +- Transcription result replaces the transcript field (same as real-time transcription) +- Support common audio formats: MP3, WAV, M4A, WebM, OGG + +## Impact +- Affected specs: `transcription` +- Affected code: + - `sidecar/transcriber.py` - Add `segment_audio` action for VAD chunking + - `client/src/pages/meeting-detail.html` - Add upload button and progress UI + - `backend/app/routers/ai.py` - Add `/api/ai/transcribe-audio` endpoint + - `backend/app/config.py` - Add Dify STT API key configuration + +## Technical Notes +- Dify STT API Key: `app-xQeSipaQecs0cuKeLvYDaRsu` +- Real-time transcription continues to use local sidecar (no change) +- File upload transcription uses Dify cloud service with VAD chunking +- VAD chunking ensures each chunk < 25MB (Dify API limit) +- Max file size: 500MB (chunked processing handles large files) +- Both methods output to the same transcript_blob field diff --git a/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/specs/transcription/spec.md b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/specs/transcription/spec.md new file mode 100644 index 0000000..cc0acb0 --- /dev/null +++ b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/specs/transcription/spec.md @@ -0,0 +1,88 @@ +## ADDED Requirements + +### Requirement: Audio File Upload +The Electron client SHALL allow users to upload pre-recorded audio files for transcription. + +#### Scenario: Upload audio file +- **WHEN** user clicks "Upload Audio" button in meeting detail page +- **THEN** file picker SHALL open with filter for supported audio formats (MP3, WAV, M4A, WebM, OGG) + +#### Scenario: Show upload progress +- **WHEN** audio file is being uploaded +- **THEN** progress indicator SHALL be displayed showing upload percentage + +#### Scenario: Show transcription progress +- **WHEN** audio file is being transcribed in chunks +- **THEN** progress indicator SHALL display "Processing chunk X of Y" + +#### Scenario: Replace existing transcript +- **WHEN** user uploads audio file and transcript already has content +- **THEN** confirmation dialog SHALL appear before replacing existing transcript + +#### Scenario: File size limit +- **WHEN** user selects audio file larger than 500MB +- **THEN** error message SHALL be displayed indicating file size limit + +### Requirement: VAD-Based Audio Segmentation +The sidecar SHALL segment large audio files using Voice Activity Detection before cloud transcription. + +#### Scenario: Segment audio command +- **WHEN** sidecar receives `{"action": "segment_audio", "file_path": "...", "max_chunk_seconds": 300}` +- **THEN** it SHALL load audio file and run VAD to detect speech boundaries + +#### Scenario: Split at silence boundaries +- **WHEN** VAD detects silence gap >= 500ms within max chunk duration +- **THEN** audio SHALL be split at the silence boundary +- **AND** each chunk exported as WAV file to temp directory + +#### Scenario: Force split for continuous speech +- **WHEN** speech continues beyond max_chunk_seconds without silence gap +- **THEN** audio SHALL be force-split at max_chunk_seconds boundary + +#### Scenario: Return segment metadata +- **WHEN** segmentation completes +- **THEN** sidecar SHALL return list of segments with file paths and timestamps + +### Requirement: Dify Speech-to-Text Integration +The backend SHALL integrate with Dify STT service for audio file transcription. + +#### Scenario: Transcribe uploaded audio with chunking +- **WHEN** backend receives POST /api/ai/transcribe-audio with audio file +- **THEN** backend SHALL call sidecar for VAD segmentation +- **AND** send each chunk to Dify STT API sequentially +- **AND** concatenate results into final transcript + +#### Scenario: Supported audio formats +- **WHEN** audio file is in MP3, WAV, M4A, WebM, or OGG format +- **THEN** system SHALL accept and process the file + +#### Scenario: Unsupported format handling +- **WHEN** audio file format is not supported +- **THEN** backend SHALL return HTTP 400 with error message listing supported formats + +#### Scenario: Dify chunk transcription +- **WHEN** backend sends audio chunk to Dify STT API +- **THEN** chunk size SHALL be under 25MB to comply with API limits + +#### Scenario: Transcription timeout per chunk +- **WHEN** Dify STT does not respond for a chunk within 2 minutes +- **THEN** backend SHALL retry up to 3 times with exponential backoff + +#### Scenario: Dify STT error handling +- **WHEN** Dify STT API returns error after retries +- **THEN** backend SHALL return HTTP 502 with error details + +### Requirement: Dual Transcription Mode +The system SHALL support both real-time local transcription and file-based cloud transcription. + +#### Scenario: Real-time transcription unchanged +- **WHEN** user records audio in real-time +- **THEN** local sidecar SHALL process audio using faster-whisper (existing behavior) + +#### Scenario: File upload uses cloud transcription +- **WHEN** user uploads audio file +- **THEN** Dify cloud service SHALL process audio via chunked upload + +#### Scenario: Unified transcript output +- **WHEN** transcription completes from either source +- **THEN** result SHALL be displayed in the same transcript area in meeting detail page diff --git a/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/tasks.md b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/tasks.md new file mode 100644 index 0000000..0cad49b --- /dev/null +++ b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/tasks.md @@ -0,0 +1,47 @@ +# Implementation Tasks + +## 1. Backend Configuration +- [x] 1.1 Add `DIFY_STT_API_KEY` to `backend/app/config.py` +- [x] 1.2 Add `DIFY_STT_API_KEY` to `backend/.env.example` + +## 2. Sidecar VAD Segmentation +- [x] 2.1 Add `segment_audio` action handler in `sidecar/transcriber.py` +- [x] 2.2 Implement VAD-based audio segmentation using Silero VAD +- [x] 2.3 Support max chunk duration (default 5 minutes) +- [x] 2.4 Support minimum silence threshold (default 500ms) +- [x] 2.5 Export chunks as WAV files to temp directory +- [x] 2.6 Return segment metadata (paths, timestamps) + +## 3. Backend API Endpoint +- [x] 3.1 Create `POST /api/ai/transcribe-audio` endpoint in `backend/app/routers/ai.py` +- [x] 3.2 Implement multipart file upload handling (max 500MB) +- [x] 3.3 Validate audio file format (MP3, WAV, M4A, WebM, OGG) +- [x] 3.4 Save uploaded file to temp directory +- [x] 3.5 Call sidecar `segment_audio` for VAD chunking +- [x] 3.6 For each chunk: call Dify STT API (`/v1/audio-to-text`) +- [x] 3.7 Implement retry with exponential backoff for Dify errors +- [x] 3.8 Concatenate chunk transcriptions +- [x] 3.9 Clean up temp files after processing +- [x] 3.10 Return final transcript with metadata + +## 4. Frontend UI +- [x] 4.1 Add "Upload Audio" button in meeting-detail.html (next to recording controls) +- [x] 4.2 Implement file input with accepted audio formats +- [x] 4.3 Add upload progress indicator (upload phase) +- [x] 4.4 Add transcription progress indicator (chunk X of Y) +- [x] 4.5 Show confirmation dialog if transcript already has content +- [x] 4.6 Display transcription result in transcript area +- [x] 4.7 Handle error states (file too large, unsupported format, API error) + +## 5. API Service +- [x] 5.1 Add `transcribeAudio()` function to `client/src/services/api.js` +- [x] 5.2 Implement FormData upload with progress tracking +- [x] 5.3 Handle streaming response for chunk progress + +## 6. Testing +- [ ] 6.1 Test sidecar VAD segmentation with various audio lengths +- [ ] 6.2 Test with various audio formats (MP3, WAV, M4A, WebM, OGG) +- [ ] 6.3 Test with large file (>100MB) to verify chunking +- [ ] 6.4 Test error handling (invalid format, Dify timeout, API error) +- [ ] 6.5 Verify transcript displays correctly after upload +- [ ] 6.6 Test chunk concatenation quality (no missing content at boundaries) diff --git a/openspec/specs/transcription/spec.md b/openspec/specs/transcription/spec.md index bce3a7c..58ed996 100644 --- a/openspec/specs/transcription/spec.md +++ b/openspec/specs/transcription/spec.md @@ -88,3 +88,90 @@ The sidecar SHALL output transcribed text with appropriate Chinese punctuation m - **WHEN** transcribed text ends with question particles (ๅ—Žใ€ๅ‘ขใ€ไป€้บผใ€ๆ€Ž้บผใ€็‚บไป€้บผ) - **THEN** the punctuation processor SHALL append question mark (๏ผŸ) +### Requirement: Audio File Upload +The Electron client SHALL allow users to upload pre-recorded audio files for transcription. + +#### Scenario: Upload audio file +- **WHEN** user clicks "Upload Audio" button in meeting detail page +- **THEN** file picker SHALL open with filter for supported audio formats (MP3, WAV, M4A, WebM, OGG) + +#### Scenario: Show upload progress +- **WHEN** audio file is being uploaded +- **THEN** progress indicator SHALL be displayed showing upload percentage + +#### Scenario: Show transcription progress +- **WHEN** audio file is being transcribed in chunks +- **THEN** progress indicator SHALL display "Processing chunk X of Y" + +#### Scenario: Replace existing transcript +- **WHEN** user uploads audio file and transcript already has content +- **THEN** confirmation dialog SHALL appear before replacing existing transcript + +#### Scenario: File size limit +- **WHEN** user selects audio file larger than 500MB +- **THEN** error message SHALL be displayed indicating file size limit + +### Requirement: VAD-Based Audio Segmentation +The sidecar SHALL segment large audio files using Voice Activity Detection before cloud transcription. + +#### Scenario: Segment audio command +- **WHEN** sidecar receives `{"action": "segment_audio", "file_path": "...", "max_chunk_seconds": 300}` +- **THEN** it SHALL load audio file and run VAD to detect speech boundaries + +#### Scenario: Split at silence boundaries +- **WHEN** VAD detects silence gap >= 500ms within max chunk duration +- **THEN** audio SHALL be split at the silence boundary +- **AND** each chunk exported as WAV file to temp directory + +#### Scenario: Force split for continuous speech +- **WHEN** speech continues beyond max_chunk_seconds without silence gap +- **THEN** audio SHALL be force-split at max_chunk_seconds boundary + +#### Scenario: Return segment metadata +- **WHEN** segmentation completes +- **THEN** sidecar SHALL return list of segments with file paths and timestamps + +### Requirement: Dify Speech-to-Text Integration +The backend SHALL integrate with Dify STT service for audio file transcription. + +#### Scenario: Transcribe uploaded audio with chunking +- **WHEN** backend receives POST /api/ai/transcribe-audio with audio file +- **THEN** backend SHALL call sidecar for VAD segmentation +- **AND** send each chunk to Dify STT API sequentially +- **AND** concatenate results into final transcript + +#### Scenario: Supported audio formats +- **WHEN** audio file is in MP3, WAV, M4A, WebM, or OGG format +- **THEN** system SHALL accept and process the file + +#### Scenario: Unsupported format handling +- **WHEN** audio file format is not supported +- **THEN** backend SHALL return HTTP 400 with error message listing supported formats + +#### Scenario: Dify chunk transcription +- **WHEN** backend sends audio chunk to Dify STT API +- **THEN** chunk size SHALL be under 25MB to comply with API limits + +#### Scenario: Transcription timeout per chunk +- **WHEN** Dify STT does not respond for a chunk within 2 minutes +- **THEN** backend SHALL retry up to 3 times with exponential backoff + +#### Scenario: Dify STT error handling +- **WHEN** Dify STT API returns error after retries +- **THEN** backend SHALL return HTTP 502 with error details + +### Requirement: Dual Transcription Mode +The system SHALL support both real-time local transcription and file-based cloud transcription. + +#### Scenario: Real-time transcription unchanged +- **WHEN** user records audio in real-time +- **THEN** local sidecar SHALL process audio using faster-whisper (existing behavior) + +#### Scenario: File upload uses cloud transcription +- **WHEN** user uploads audio file +- **THEN** Dify cloud service SHALL process audio via chunked upload + +#### Scenario: Unified transcript output +- **WHEN** transcription completes from either source +- **THEN** result SHALL be displayed in the same transcript area in meeting detail page +