Move completed Dify audio transcription proposal to archive and update transcription spec with new capabilities. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
3.8 KiB
3.8 KiB
ADDED Requirements
Requirement: Audio File Upload
The Electron client SHALL allow users to upload pre-recorded audio files for transcription.
Scenario: Upload audio file
- WHEN user clicks "Upload Audio" button in meeting detail page
- THEN file picker SHALL open with filter for supported audio formats (MP3, WAV, M4A, WebM, OGG)
Scenario: Show upload progress
- WHEN audio file is being uploaded
- THEN progress indicator SHALL be displayed showing upload percentage
Scenario: Show transcription progress
- WHEN audio file is being transcribed in chunks
- THEN progress indicator SHALL display "Processing chunk X of Y"
Scenario: Replace existing transcript
- WHEN user uploads audio file and transcript already has content
- THEN confirmation dialog SHALL appear before replacing existing transcript
Scenario: File size limit
- WHEN user selects audio file larger than 500MB
- THEN error message SHALL be displayed indicating file size limit
Requirement: VAD-Based Audio Segmentation
The sidecar SHALL segment large audio files using Voice Activity Detection before cloud transcription.
Scenario: Segment audio command
- WHEN sidecar receives
{"action": "segment_audio", "file_path": "...", "max_chunk_seconds": 300} - THEN it SHALL load audio file and run VAD to detect speech boundaries
Scenario: Split at silence boundaries
- WHEN VAD detects silence gap >= 500ms within max chunk duration
- THEN audio SHALL be split at the silence boundary
- AND each chunk exported as WAV file to temp directory
Scenario: Force split for continuous speech
- WHEN speech continues beyond max_chunk_seconds without silence gap
- THEN audio SHALL be force-split at max_chunk_seconds boundary
Scenario: Return segment metadata
- WHEN segmentation completes
- THEN sidecar SHALL return list of segments with file paths and timestamps
Requirement: Dify Speech-to-Text Integration
The backend SHALL integrate with Dify STT service for audio file transcription.
Scenario: Transcribe uploaded audio with chunking
- WHEN backend receives POST /api/ai/transcribe-audio with audio file
- THEN backend SHALL call sidecar for VAD segmentation
- AND send each chunk to Dify STT API sequentially
- AND concatenate results into final transcript
Scenario: Supported audio formats
- WHEN audio file is in MP3, WAV, M4A, WebM, or OGG format
- THEN system SHALL accept and process the file
Scenario: Unsupported format handling
- WHEN audio file format is not supported
- THEN backend SHALL return HTTP 400 with error message listing supported formats
Scenario: Dify chunk transcription
- WHEN backend sends audio chunk to Dify STT API
- THEN chunk size SHALL be under 25MB to comply with API limits
Scenario: Transcription timeout per chunk
- WHEN Dify STT does not respond for a chunk within 2 minutes
- THEN backend SHALL retry up to 3 times with exponential backoff
Scenario: Dify STT error handling
- WHEN Dify STT API returns error after retries
- THEN backend SHALL return HTTP 502 with error details
Requirement: Dual Transcription Mode
The system SHALL support both real-time local transcription and file-based cloud transcription.
Scenario: Real-time transcription unchanged
- WHEN user records audio in real-time
- THEN local sidecar SHALL process audio using faster-whisper (existing behavior)
Scenario: File upload uses cloud transcription
- WHEN user uploads audio file
- THEN Dify cloud service SHALL process audio via chunked upload
Scenario: Unified transcript output
- WHEN transcription completes from either source
- THEN result SHALL be displayed in the same transcript area in meeting detail page