From 2e78e3760a38955412e8b2e60801476b6e5c3acd Mon Sep 17 00:00:00 2001
From: egg <lin4637lin4637@gmail.com>
Date: Thu, 11 Dec 2025 21:05:01 +0800
Subject: [PATCH] chore: Archive add-dify-audio-transcription proposal
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Move completed Dify audio transcription proposal to archive and update
transcription spec with new capabilities.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---
 .../design.md                                 | 185 ++++++++++++++++++
 .../proposal.md                               |  28 +++
 .../specs/transcription/spec.md               |  88 +++++++++
 .../tasks.md                                  |  47 +++++
 openspec/specs/transcription/spec.md          |  87 ++++++++
 5 files changed, 435 insertions(+)
 create mode 100644 openspec/changes/archive/2025-12-11-add-dify-audio-transcription/design.md
 create mode 100644 openspec/changes/archive/2025-12-11-add-dify-audio-transcription/proposal.md
 create mode 100644 openspec/changes/archive/2025-12-11-add-dify-audio-transcription/specs/transcription/spec.md
 create mode 100644 openspec/changes/archive/2025-12-11-add-dify-audio-transcription/tasks.md

diff --git a/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/design.md b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/design.md
new file mode 100644
index 0000000..e9a49eb
--- /dev/null
+++ b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/design.md
@@ -0,0 +1,185 @@
+# Design: Dify Audio Transcription
+
+## Context
+The Meeting Assistant currently supports real-time transcription via a local Python sidecar using faster-whisper. Users have requested the ability to upload pre-recorded audio files for transcription. To avoid overloading local resources and leverage cloud capabilities, uploaded files will be processed by Dify's speech-to-text service.
+
+## Goals / Non-Goals
+
+**Goals:**
+- Allow users to upload audio files for transcription
+- Use Dify STT service for file-based transcription
+- Handle large files via VAD-based automatic segmentation
+- Maintain seamless UX with transcription appearing in the same location
+- Support common audio formats (MP3, WAV, M4A, WebM, OGG)
+
+**Non-Goals:**
+- Replace real-time local transcription (sidecar remains for live recording)
+- Support video files (audio extraction)
+
+## Decisions
+
+### Decision 1: Two-Path Transcription Architecture
+- **What**: Real-time recording uses local sidecar; file uploads use Dify cloud
+- **Why**: Local processing provides low latency for real-time needs; cloud processing handles larger files without impacting local resources
+- **Alternatives considered**:
+  - All transcription via Dify: Rejected due to latency and network dependency for real-time use
+  - All transcription local: Rejected due to resource constraints for large file processing
+
+### Decision 2: VAD-Based Audio Chunking
+- **What**: Use sidecar's Silero VAD to segment large audio files before sending to Dify
+- **Why**:
+  - Dify STT API has file size limits (typically ~25MB)
+  - VAD removes silence, reducing total upload size
+  - Natural speech boundaries improve transcription quality
+- **Implementation**:
+  - Sidecar segments audio into chunks (~2-5 minutes each based on speech boundaries)
+  - Each chunk sent to Dify sequentially
+  - Results concatenated with proper spacing
+- **Alternatives considered**:
+  - Fixed-time splitting (e.g., 5min chunks): Rejected - may cut mid-sentence
+  - Client-side splitting: Rejected - requires shipping VAD to client
+
+### Decision 3: Separate API Key for STT Service
+- **What**: Use dedicated Dify API key `app-xQeSipaQecs0cuKeLvYDaRsu` for STT
+- **Why**: Allows independent rate limiting and monitoring from summarization service
+- **Configuration**: `DIFY_STT_API_KEY` environment variable
+
+### Decision 4: Backend-Mediated Upload with Sidecar Processing
+- **What**: Client → Backend → Sidecar (VAD) → Backend → Dify → Client
+- **Why**:
+  - Keeps Dify API keys secure on server
+  - Reuses existing sidecar VAD capability
+  - Enables progress tracking for multi-chunk processing
+- **Alternatives considered**:
+  - Direct client → Dify: Rejected due to API key exposure and file size limits
+
+### Decision 5: Append vs Replace Transcript
+- **What**: Uploaded file transcription replaces current transcript content
+- **Why**: Users typically upload complete meeting recordings; appending would create confusion
+- **UI**: Show confirmation dialog before replacing existing content
+
+## API Design
+
+### Backend Endpoint
+```
+POST /api/ai/transcribe-audio
+Content-Type: multipart/form-data
+
+Request:
+- file: Audio file (max 500MB, will be chunked)
+
+Response (streaming for progress):
+{
+  "transcript": "完整的會議逐字稿內容...",
+  "chunks_processed": 5,
+  "total_duration_seconds": 3600,
+  "language": "zh"
+}
+```
+
+### Sidecar VAD Segmentation Command
+```json
+// Request
+{
+  "action": "segment_audio",
+  "file_path": "/tmp/uploaded_audio.mp3",
+  "max_chunk_seconds": 300,
+  "min_silence_ms": 500
+}
+
+// Response
+{
+  "status": "success",
+  "segments": [
+    {"index": 0, "path": "/tmp/chunk_0.wav", "start": 0, "end": 180.5},
+    {"index": 1, "path": "/tmp/chunk_1.wav", "start": 180.5, "end": 362.0},
+    ...
+  ],
+  "total_segments": 5
+}
+```
+
+### Dify STT API Integration
+```
+POST https://dify.theaken.com/v1/audio-to-text
+Authorization: Bearer {DIFY_STT_API_KEY}
+Content-Type: multipart/form-data
+
+Request:
+- file: Audio chunk (<25MB)
+- user: User identifier
+
+Response:
+{
+  "text": "transcribed content for this chunk..."
+}
+```
+
+## Data Flow
+
+```
+┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
+│  Electron   │     │  FastAPI    │     │  Sidecar    │     │  Dify STT   │
+│  Client     │     │  Backend    │     │  (VAD)      │     │  Service    │
+└──────┬──────┘     └──────┬──────┘     └──────┬──────┘     └──────┬──────┘
+       │                   │                   │                   │
+       │ Upload audio      │                   │                   │
+       │──────────────────>│                   │                   │
+       │                   │                   │                   │
+       │                   │ segment_audio     │                   │
+       │                   │──────────────────>│                   │
+       │                   │                   │                   │
+       │                   │ segments[]        │                   │
+       │                   │<──────────────────│                   │
+       │                   │                   │                   │
+       │                   │ For each chunk:   │                   │
+       │   Progress: 1/5   │──────────────────────────────────────>│
+       │<──────────────────│                   │                   │
+       │                   │                   │    transcription  │
+       │                   │<──────────────────────────────────────│
+       │                   │                   │                   │
+       │   Progress: 2/5   │──────────────────────────────────────>│
+       │<──────────────────│                   │                   │
+       │                   │<──────────────────────────────────────│
+       │                   │                   │                   │
+       │                   │  ... repeat ...   │                   │
+       │                   │                   │                   │
+       │  Final transcript │                   │                   │
+       │<──────────────────│                   │                   │
+       │  (concatenated)   │                   │                   │
+```
+
+## Chunking Algorithm
+
+```python
+def segment_audio_with_vad(audio_path, max_chunk_seconds=300, min_silence_ms=500):
+    """
+    Segment audio file using VAD for natural speech boundaries.
+
+    1. Load audio file
+    2. Run VAD to detect speech/silence regions
+    3. Find silence gaps >= min_silence_ms
+    4. Split at silence gaps, keeping chunks <= max_chunk_seconds
+    5. If no silence found within max_chunk_seconds, force split at max
+    6. Export each chunk as WAV file
+    7. Return list of chunk file paths with timestamps
+    """
+```
+
+## Risks / Trade-offs
+
+| Risk | Mitigation |
+|------|------------|
+| Large file causes memory issues | Stream audio processing; limit to 500MB |
+| Dify rate limiting | Add retry with exponential backoff |
+| Chunk boundary affects context | Overlap chunks by 1-2 seconds |
+| Long processing time | Show progress indicator with chunk count |
+| Sidecar not available | Return error suggesting real-time recording |
+
+## Migration Plan
+No migration needed - this is additive functionality.
+
+## Open Questions
+- ~~Maximum file size limit?~~ **Resolved**: 500MB with VAD chunking
+- Chunk overlap for context continuity?
+  - Proposal: 1 second overlap, deduplicate in concatenation
diff --git a/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/proposal.md b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/proposal.md
new file mode 100644
index 0000000..f4e8a29
--- /dev/null
+++ b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/proposal.md
@@ -0,0 +1,28 @@
+# Change: Add Dify Audio Transcription for Uploaded Files
+
+## Why
+Users need to transcribe pre-recorded audio files (e.g., meeting recordings from external sources). Currently, transcription only works with real-time recording via the local sidecar. Adding Dify-based transcription for uploaded files provides flexibility while keeping real-time transcription local for low latency.
+
+## What Changes
+- Add audio file upload UI in Electron client (meeting detail page)
+- Add `segment_audio` command to sidecar for VAD-based audio chunking
+- Add backend API endpoint to receive audio files, chunk via sidecar, and forward to Dify STT service
+- Each chunk (~5 minutes max) sent to Dify separately, results concatenated
+- Transcription result replaces the transcript field (same as real-time transcription)
+- Support common audio formats: MP3, WAV, M4A, WebM, OGG
+
+## Impact
+- Affected specs: `transcription`
+- Affected code:
+  - `sidecar/transcriber.py` - Add `segment_audio` action for VAD chunking
+  - `client/src/pages/meeting-detail.html` - Add upload button and progress UI
+  - `backend/app/routers/ai.py` - Add `/api/ai/transcribe-audio` endpoint
+  - `backend/app/config.py` - Add Dify STT API key configuration
+
+## Technical Notes
+- Dify STT API Key: `app-xQeSipaQecs0cuKeLvYDaRsu`
+- Real-time transcription continues to use local sidecar (no change)
+- File upload transcription uses Dify cloud service with VAD chunking
+- VAD chunking ensures each chunk < 25MB (Dify API limit)
+- Max file size: 500MB (chunked processing handles large files)
+- Both methods output to the same transcript_blob field
diff --git a/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/specs/transcription/spec.md b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/specs/transcription/spec.md
new file mode 100644
index 0000000..cc0acb0
--- /dev/null
+++ b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/specs/transcription/spec.md
@@ -0,0 +1,88 @@
+## ADDED Requirements
+
+### Requirement: Audio File Upload
+The Electron client SHALL allow users to upload pre-recorded audio files for transcription.
+
+#### Scenario: Upload audio file
+- **WHEN** user clicks "Upload Audio" button in meeting detail page
+- **THEN** file picker SHALL open with filter for supported audio formats (MP3, WAV, M4A, WebM, OGG)
+
+#### Scenario: Show upload progress
+- **WHEN** audio file is being uploaded
+- **THEN** progress indicator SHALL be displayed showing upload percentage
+
+#### Scenario: Show transcription progress
+- **WHEN** audio file is being transcribed in chunks
+- **THEN** progress indicator SHALL display "Processing chunk X of Y"
+
+#### Scenario: Replace existing transcript
+- **WHEN** user uploads audio file and transcript already has content
+- **THEN** confirmation dialog SHALL appear before replacing existing transcript
+
+#### Scenario: File size limit
+- **WHEN** user selects audio file larger than 500MB
+- **THEN** error message SHALL be displayed indicating file size limit
+
+### Requirement: VAD-Based Audio Segmentation
+The sidecar SHALL segment large audio files using Voice Activity Detection before cloud transcription.
+
+#### Scenario: Segment audio command
+- **WHEN** sidecar receives `{"action": "segment_audio", "file_path": "...", "max_chunk_seconds": 300}`
+- **THEN** it SHALL load audio file and run VAD to detect speech boundaries
+
+#### Scenario: Split at silence boundaries
+- **WHEN** VAD detects silence gap >= 500ms within max chunk duration
+- **THEN** audio SHALL be split at the silence boundary
+- **AND** each chunk exported as WAV file to temp directory
+
+#### Scenario: Force split for continuous speech
+- **WHEN** speech continues beyond max_chunk_seconds without silence gap
+- **THEN** audio SHALL be force-split at max_chunk_seconds boundary
+
+#### Scenario: Return segment metadata
+- **WHEN** segmentation completes
+- **THEN** sidecar SHALL return list of segments with file paths and timestamps
+
+### Requirement: Dify Speech-to-Text Integration
+The backend SHALL integrate with Dify STT service for audio file transcription.
+
+#### Scenario: Transcribe uploaded audio with chunking
+- **WHEN** backend receives POST /api/ai/transcribe-audio with audio file
+- **THEN** backend SHALL call sidecar for VAD segmentation
+- **AND** send each chunk to Dify STT API sequentially
+- **AND** concatenate results into final transcript
+
+#### Scenario: Supported audio formats
+- **WHEN** audio file is in MP3, WAV, M4A, WebM, or OGG format
+- **THEN** system SHALL accept and process the file
+
+#### Scenario: Unsupported format handling
+- **WHEN** audio file format is not supported
+- **THEN** backend SHALL return HTTP 400 with error message listing supported formats
+
+#### Scenario: Dify chunk transcription
+- **WHEN** backend sends audio chunk to Dify STT API
+- **THEN** chunk size SHALL be under 25MB to comply with API limits
+
+#### Scenario: Transcription timeout per chunk
+- **WHEN** Dify STT does not respond for a chunk within 2 minutes
+- **THEN** backend SHALL retry up to 3 times with exponential backoff
+
+#### Scenario: Dify STT error handling
+- **WHEN** Dify STT API returns error after retries
+- **THEN** backend SHALL return HTTP 502 with error details
+
+### Requirement: Dual Transcription Mode
+The system SHALL support both real-time local transcription and file-based cloud transcription.
+
+#### Scenario: Real-time transcription unchanged
+- **WHEN** user records audio in real-time
+- **THEN** local sidecar SHALL process audio using faster-whisper (existing behavior)
+
+#### Scenario: File upload uses cloud transcription
+- **WHEN** user uploads audio file
+- **THEN** Dify cloud service SHALL process audio via chunked upload
+
+#### Scenario: Unified transcript output
+- **WHEN** transcription completes from either source
+- **THEN** result SHALL be displayed in the same transcript area in meeting detail page
diff --git a/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/tasks.md b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/tasks.md
new file mode 100644
index 0000000..0cad49b
--- /dev/null
+++ b/openspec/changes/archive/2025-12-11-add-dify-audio-transcription/tasks.md
@@ -0,0 +1,47 @@
+# Implementation Tasks
+
+## 1. Backend Configuration
+- [x] 1.1 Add `DIFY_STT_API_KEY` to `backend/app/config.py`
+- [x] 1.2 Add `DIFY_STT_API_KEY` to `backend/.env.example`
+
+## 2. Sidecar VAD Segmentation
+- [x] 2.1 Add `segment_audio` action handler in `sidecar/transcriber.py`
+- [x] 2.2 Implement VAD-based audio segmentation using Silero VAD
+- [x] 2.3 Support max chunk duration (default 5 minutes)
+- [x] 2.4 Support minimum silence threshold (default 500ms)
+- [x] 2.5 Export chunks as WAV files to temp directory
+- [x] 2.6 Return segment metadata (paths, timestamps)
+
+## 3. Backend API Endpoint
+- [x] 3.1 Create `POST /api/ai/transcribe-audio` endpoint in `backend/app/routers/ai.py`
+- [x] 3.2 Implement multipart file upload handling (max 500MB)
+- [x] 3.3 Validate audio file format (MP3, WAV, M4A, WebM, OGG)
+- [x] 3.4 Save uploaded file to temp directory
+- [x] 3.5 Call sidecar `segment_audio` for VAD chunking
+- [x] 3.6 For each chunk: call Dify STT API (`/v1/audio-to-text`)
+- [x] 3.7 Implement retry with exponential backoff for Dify errors
+- [x] 3.8 Concatenate chunk transcriptions
+- [x] 3.9 Clean up temp files after processing
+- [x] 3.10 Return final transcript with metadata
+
+## 4. Frontend UI
+- [x] 4.1 Add "Upload Audio" button in meeting-detail.html (next to recording controls)
+- [x] 4.2 Implement file input with accepted audio formats
+- [x] 4.3 Add upload progress indicator (upload phase)
+- [x] 4.4 Add transcription progress indicator (chunk X of Y)
+- [x] 4.5 Show confirmation dialog if transcript already has content
+- [x] 4.6 Display transcription result in transcript area
+- [x] 4.7 Handle error states (file too large, unsupported format, API error)
+
+## 5. API Service
+- [x] 5.1 Add `transcribeAudio()` function to `client/src/services/api.js`
+- [x] 5.2 Implement FormData upload with progress tracking
+- [x] 5.3 Handle streaming response for chunk progress
+
+## 6. Testing
+- [ ] 6.1 Test sidecar VAD segmentation with various audio lengths
+- [ ] 6.2 Test with various audio formats (MP3, WAV, M4A, WebM, OGG)
+- [ ] 6.3 Test with large file (>100MB) to verify chunking
+- [ ] 6.4 Test error handling (invalid format, Dify timeout, API error)
+- [ ] 6.5 Verify transcript displays correctly after upload
+- [ ] 6.6 Test chunk concatenation quality (no missing content at boundaries)
diff --git a/openspec/specs/transcription/spec.md b/openspec/specs/transcription/spec.md
index bce3a7c..58ed996 100644
--- a/openspec/specs/transcription/spec.md
+++ b/openspec/specs/transcription/spec.md
@@ -88,3 +88,90 @@ The sidecar SHALL output transcribed text with appropriate Chinese punctuation m
 - **WHEN** transcribed text ends with question particles (嗎、呢、什麼、怎麼、為什麼)
 - **THEN** the punctuation processor SHALL append question mark (？)
 
+### Requirement: Audio File Upload
+The Electron client SHALL allow users to upload pre-recorded audio files for transcription.
+
+#### Scenario: Upload audio file
+- **WHEN** user clicks "Upload Audio" button in meeting detail page
+- **THEN** file picker SHALL open with filter for supported audio formats (MP3, WAV, M4A, WebM, OGG)
+
+#### Scenario: Show upload progress
+- **WHEN** audio file is being uploaded
+- **THEN** progress indicator SHALL be displayed showing upload percentage
+
+#### Scenario: Show transcription progress
+- **WHEN** audio file is being transcribed in chunks
+- **THEN** progress indicator SHALL display "Processing chunk X of Y"
+
+#### Scenario: Replace existing transcript
+- **WHEN** user uploads audio file and transcript already has content
+- **THEN** confirmation dialog SHALL appear before replacing existing transcript
+
+#### Scenario: File size limit
+- **WHEN** user selects audio file larger than 500MB
+- **THEN** error message SHALL be displayed indicating file size limit
+
+### Requirement: VAD-Based Audio Segmentation
+The sidecar SHALL segment large audio files using Voice Activity Detection before cloud transcription.
+
+#### Scenario: Segment audio command
+- **WHEN** sidecar receives `{"action": "segment_audio", "file_path": "...", "max_chunk_seconds": 300}`
+- **THEN** it SHALL load audio file and run VAD to detect speech boundaries
+
+#### Scenario: Split at silence boundaries
+- **WHEN** VAD detects silence gap >= 500ms within max chunk duration
+- **THEN** audio SHALL be split at the silence boundary
+- **AND** each chunk exported as WAV file to temp directory
+
+#### Scenario: Force split for continuous speech
+- **WHEN** speech continues beyond max_chunk_seconds without silence gap
+- **THEN** audio SHALL be force-split at max_chunk_seconds boundary
+
+#### Scenario: Return segment metadata
+- **WHEN** segmentation completes
+- **THEN** sidecar SHALL return list of segments with file paths and timestamps
+
+### Requirement: Dify Speech-to-Text Integration
+The backend SHALL integrate with Dify STT service for audio file transcription.
+
+#### Scenario: Transcribe uploaded audio with chunking
+- **WHEN** backend receives POST /api/ai/transcribe-audio with audio file
+- **THEN** backend SHALL call sidecar for VAD segmentation
+- **AND** send each chunk to Dify STT API sequentially
+- **AND** concatenate results into final transcript
+
+#### Scenario: Supported audio formats
+- **WHEN** audio file is in MP3, WAV, M4A, WebM, or OGG format
+- **THEN** system SHALL accept and process the file
+
+#### Scenario: Unsupported format handling
+- **WHEN** audio file format is not supported
+- **THEN** backend SHALL return HTTP 400 with error message listing supported formats
+
+#### Scenario: Dify chunk transcription
+- **WHEN** backend sends audio chunk to Dify STT API
+- **THEN** chunk size SHALL be under 25MB to comply with API limits
+
+#### Scenario: Transcription timeout per chunk
+- **WHEN** Dify STT does not respond for a chunk within 2 minutes
+- **THEN** backend SHALL retry up to 3 times with exponential backoff
+
+#### Scenario: Dify STT error handling
+- **WHEN** Dify STT API returns error after retries
+- **THEN** backend SHALL return HTTP 502 with error details
+
+### Requirement: Dual Transcription Mode
+The system SHALL support both real-time local transcription and file-based cloud transcription.
+
+#### Scenario: Real-time transcription unchanged
+- **WHEN** user records audio in real-time
+- **THEN** local sidecar SHALL process audio using faster-whisper (existing behavior)
+
+#### Scenario: File upload uses cloud transcription
+- **WHEN** user uploads audio file
+- **THEN** Dify cloud service SHALL process audio via chunked upload
+
+#### Scenario: Unified transcript output
+- **WHEN** transcription completes from either source
+- **THEN** result SHALL be displayed in the same transcript area in meeting detail page
+