Force archive the following proposals: - add-audio-device-selector (complete) - add-embedded-backend-packaging (19/26 tasks) - add-flexible-deployment-options (20/21 tasks) New specs created: - audio-device-management (7 requirements) - embedded-backend (8 requirements) Updated specs: - transcription (+2 requirements for model download progress) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
9.4 KiB
transcription Specification
Purpose
TBD - created by archiving change add-meeting-assistant-mvp. Update Purpose after archive.
Requirements
Requirement: Edge Speech-to-Text
The Electron client SHALL perform speech-to-text conversion locally using faster-whisper int8 model.
Scenario: Successful transcription
- WHEN user records audio during a meeting
- THEN the audio SHALL be transcribed locally without network dependency
Scenario: Transcription on target hardware
- WHEN running on i5 processor with 8GB RAM
- THEN transcription SHALL complete within acceptable latency for real-time display
Requirement: Traditional Chinese Output
The transcription engine SHALL output Traditional Chinese (繁體中文) text.
Scenario: Simplified to Traditional conversion
- WHEN whisper outputs Simplified Chinese characters
- THEN OpenCC SHALL convert output to Traditional Chinese
Scenario: Native Traditional Chinese
- WHEN whisper outputs Traditional Chinese directly
- THEN the text SHALL pass through unchanged
Requirement: Real-time Display
The Electron client SHALL display transcription results in real-time.
Scenario: Streaming transcription
- WHEN user is recording
- THEN transcribed text SHALL appear in the left panel within seconds of speech
Requirement: Python Sidecar
The transcription engine SHALL be packaged as a Python sidecar using PyInstaller.
Scenario: Sidecar startup
- WHEN Electron app launches
- THEN the Python sidecar containing faster-whisper and OpenCC SHALL be available
Scenario: Sidecar communication
- WHEN Electron sends audio data to sidecar
- THEN transcribed text SHALL be returned via IPC
Requirement: Streaming Transcription Mode
The sidecar SHALL support a streaming mode where audio chunks are continuously received and transcribed in real-time with VAD-triggered segmentation.
Scenario: Start streaming session
- WHEN sidecar receives
{"action": "start_stream"}command - THEN it SHALL initialize audio buffer and VAD processor
- AND respond with
{"status": "streaming", "session_id": "<uuid>"}
Scenario: Process audio chunk
- WHEN sidecar receives
{"action": "audio_chunk", "data": "<base64_pcm>"}during active stream - THEN it SHALL append audio to buffer and run VAD detection
- AND if speech boundary detected, transcribe accumulated audio
- AND emit
{"segment_id": <int>, "text": "<transcription>", "is_final": true}
Scenario: Stop streaming session
- WHEN sidecar receives
{"action": "stop_stream"}command - THEN it SHALL transcribe any remaining buffered audio
- AND respond with
{"status": "stream_stopped", "total_segments": <int>}
Requirement: VAD-based Speech Segmentation
The sidecar SHALL use Voice Activity Detection to identify natural speech boundaries for segmentation.
Scenario: Detect speech end
- WHEN VAD detects silence exceeding 500ms after speech
- THEN the accumulated speech audio SHALL be sent for transcription
- AND a new segment SHALL begin for subsequent speech
Scenario: Handle continuous speech
- WHEN speech continues for more than 15 seconds without pause
- THEN the sidecar SHALL force a segment boundary
- AND transcribe the 15-second chunk to prevent excessive latency
Requirement: Punctuation in Transcription Output
The sidecar SHALL output transcribed text with appropriate Chinese punctuation marks.
Scenario: Add sentence-ending punctuation
- WHEN transcription completes for a segment
- THEN the output SHALL include period (。) at natural sentence boundaries
- AND question marks (?) for interrogative sentences
- AND commas (,) for clause breaks within sentences
Scenario: Detect question patterns
- WHEN transcribed text ends with question particles (嗎、呢、什麼、怎麼、為什麼)
- THEN the punctuation processor SHALL append question mark (?)
Requirement: Audio File Upload
The Electron client SHALL allow users to upload pre-recorded audio files for transcription.
Scenario: Upload audio file
- WHEN user clicks "Upload Audio" button in meeting detail page
- THEN file picker SHALL open with filter for supported audio formats (MP3, WAV, M4A, WebM, OGG)
Scenario: Show upload progress
- WHEN audio file is being uploaded
- THEN progress indicator SHALL be displayed showing upload percentage
Scenario: Show transcription progress
- WHEN audio file is being transcribed in chunks
- THEN progress indicator SHALL display "Processing chunk X of Y"
Scenario: Replace existing transcript
- WHEN user uploads audio file and transcript already has content
- THEN confirmation dialog SHALL appear before replacing existing transcript
Scenario: File size limit
- WHEN user selects audio file larger than 500MB
- THEN error message SHALL be displayed indicating file size limit
Requirement: VAD-Based Audio Segmentation
The sidecar SHALL segment large audio files using Voice Activity Detection before cloud transcription.
Scenario: Segment audio command
- WHEN sidecar receives
{"action": "segment_audio", "file_path": "...", "max_chunk_seconds": 300} - THEN it SHALL load audio file and run VAD to detect speech boundaries
Scenario: Split at silence boundaries
- WHEN VAD detects silence gap >= 500ms within max chunk duration
- THEN audio SHALL be split at the silence boundary
- AND each chunk exported as WAV file to temp directory
Scenario: Force split for continuous speech
- WHEN speech continues beyond max_chunk_seconds without silence gap
- THEN audio SHALL be force-split at max_chunk_seconds boundary
Scenario: Return segment metadata
- WHEN segmentation completes
- THEN sidecar SHALL return list of segments with file paths and timestamps
Requirement: Dify Speech-to-Text Integration
The backend SHALL integrate with Dify STT service for audio file transcription.
Scenario: Transcribe uploaded audio with chunking
- WHEN backend receives POST /api/ai/transcribe-audio with audio file
- THEN backend SHALL call sidecar for VAD segmentation
- AND send each chunk to Dify STT API sequentially
- AND concatenate results into final transcript
Scenario: Supported audio formats
- WHEN audio file is in MP3, WAV, M4A, WebM, or OGG format
- THEN system SHALL accept and process the file
Scenario: Unsupported format handling
- WHEN audio file format is not supported
- THEN backend SHALL return HTTP 400 with error message listing supported formats
Scenario: Dify chunk transcription
- WHEN backend sends audio chunk to Dify STT API
- THEN chunk size SHALL be under 25MB to comply with API limits
Scenario: Transcription timeout per chunk
- WHEN Dify STT does not respond for a chunk within 2 minutes
- THEN backend SHALL retry up to 3 times with exponential backoff
Scenario: Dify STT error handling
- WHEN Dify STT API returns error after retries
- THEN backend SHALL return HTTP 502 with error details
Requirement: Dual Transcription Mode
The system SHALL support both real-time local transcription and file-based cloud transcription.
Scenario: Real-time transcription unchanged
- WHEN user records audio in real-time
- THEN local sidecar SHALL process audio using faster-whisper (existing behavior)
Scenario: File upload uses cloud transcription
- WHEN user uploads audio file
- THEN Dify cloud service SHALL process audio via chunked upload
Scenario: Unified transcript output
- WHEN transcription completes from either source
- THEN result SHALL be displayed in the same transcript area in meeting detail page
Requirement: Model Download Progress Display
The sidecar SHALL report Whisper model download progress to enable UI feedback.
Scenario: Emit download start
- WHEN Whisper model download begins
- THEN sidecar SHALL emit JSON to stdout:
{"status": "downloading_model", "model": "<size>", "progress": 0, "total_mb": <size>}
Scenario: Emit download progress
- WHEN download progress updates
- THEN sidecar SHALL emit JSON:
{"status": "downloading_model", "progress": <percent>, "downloaded_mb": <current>, "total_mb": <total>} - AND progress updates SHALL occur at least every 5% or every 5 seconds
Scenario: Emit download complete
- WHEN model download completes
- THEN sidecar SHALL emit JSON:
{"status": "model_downloaded", "model": "<size>"} - AND proceed to model loading
Scenario: Skip download for cached model
- WHEN model already exists in huggingface cache
- THEN sidecar SHALL NOT emit download progress messages
- AND proceed directly to loading
Requirement: Frontend Model Download Progress Display
The Electron frontend SHALL display model download progress to users.
Scenario: Show download progress in transcript panel
- WHEN sidecar emits download progress
- THEN whisper status element SHALL display download percentage and size
- AND format: "Downloading: XX% (YYY MB / ZZZ MB)"
Scenario: Show download complete
- WHEN sidecar emits model_downloaded status
- THEN whisper status element SHALL briefly show "Model downloaded"
- AND transition to loading state
Scenario: Forward progress events via IPC
- WHEN main process receives download progress from sidecar
- THEN it SHALL forward to renderer via
model-download-progressIPC channel