Force archive the following proposals: - add-audio-device-selector (complete) - add-embedded-backend-packaging (19/26 tasks) - add-flexible-deployment-options (20/21 tasks) New specs created: - audio-device-management (7 requirements) - embedded-backend (8 requirements) Updated specs: - transcription (+2 requirements for model download progress) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
217 lines
9.4 KiB
Markdown
217 lines
9.4 KiB
Markdown
# transcription Specification
|
||
|
||
## Purpose
|
||
TBD - created by archiving change add-meeting-assistant-mvp. Update Purpose after archive.
|
||
## Requirements
|
||
### Requirement: Edge Speech-to-Text
|
||
The Electron client SHALL perform speech-to-text conversion locally using faster-whisper int8 model.
|
||
|
||
#### Scenario: Successful transcription
|
||
- **WHEN** user records audio during a meeting
|
||
- **THEN** the audio SHALL be transcribed locally without network dependency
|
||
|
||
#### Scenario: Transcription on target hardware
|
||
- **WHEN** running on i5 processor with 8GB RAM
|
||
- **THEN** transcription SHALL complete within acceptable latency for real-time display
|
||
|
||
### Requirement: Traditional Chinese Output
|
||
The transcription engine SHALL output Traditional Chinese (繁體中文) text.
|
||
|
||
#### Scenario: Simplified to Traditional conversion
|
||
- **WHEN** whisper outputs Simplified Chinese characters
|
||
- **THEN** OpenCC SHALL convert output to Traditional Chinese
|
||
|
||
#### Scenario: Native Traditional Chinese
|
||
- **WHEN** whisper outputs Traditional Chinese directly
|
||
- **THEN** the text SHALL pass through unchanged
|
||
|
||
### Requirement: Real-time Display
|
||
The Electron client SHALL display transcription results in real-time.
|
||
|
||
#### Scenario: Streaming transcription
|
||
- **WHEN** user is recording
|
||
- **THEN** transcribed text SHALL appear in the left panel within seconds of speech
|
||
|
||
### Requirement: Python Sidecar
|
||
The transcription engine SHALL be packaged as a Python sidecar using PyInstaller.
|
||
|
||
#### Scenario: Sidecar startup
|
||
- **WHEN** Electron app launches
|
||
- **THEN** the Python sidecar containing faster-whisper and OpenCC SHALL be available
|
||
|
||
#### Scenario: Sidecar communication
|
||
- **WHEN** Electron sends audio data to sidecar
|
||
- **THEN** transcribed text SHALL be returned via IPC
|
||
|
||
### Requirement: Streaming Transcription Mode
|
||
The sidecar SHALL support a streaming mode where audio chunks are continuously received and transcribed in real-time with VAD-triggered segmentation.
|
||
|
||
#### Scenario: Start streaming session
|
||
- **WHEN** sidecar receives `{"action": "start_stream"}` command
|
||
- **THEN** it SHALL initialize audio buffer and VAD processor
|
||
- **AND** respond with `{"status": "streaming", "session_id": "<uuid>"}`
|
||
|
||
#### Scenario: Process audio chunk
|
||
- **WHEN** sidecar receives `{"action": "audio_chunk", "data": "<base64_pcm>"}` during active stream
|
||
- **THEN** it SHALL append audio to buffer and run VAD detection
|
||
- **AND** if speech boundary detected, transcribe accumulated audio
|
||
- **AND** emit `{"segment_id": <int>, "text": "<transcription>", "is_final": true}`
|
||
|
||
#### Scenario: Stop streaming session
|
||
- **WHEN** sidecar receives `{"action": "stop_stream"}` command
|
||
- **THEN** it SHALL transcribe any remaining buffered audio
|
||
- **AND** respond with `{"status": "stream_stopped", "total_segments": <int>}`
|
||
|
||
### Requirement: VAD-based Speech Segmentation
|
||
The sidecar SHALL use Voice Activity Detection to identify natural speech boundaries for segmentation.
|
||
|
||
#### Scenario: Detect speech end
|
||
- **WHEN** VAD detects silence exceeding 500ms after speech
|
||
- **THEN** the accumulated speech audio SHALL be sent for transcription
|
||
- **AND** a new segment SHALL begin for subsequent speech
|
||
|
||
#### Scenario: Handle continuous speech
|
||
- **WHEN** speech continues for more than 15 seconds without pause
|
||
- **THEN** the sidecar SHALL force a segment boundary
|
||
- **AND** transcribe the 15-second chunk to prevent excessive latency
|
||
|
||
### Requirement: Punctuation in Transcription Output
|
||
The sidecar SHALL output transcribed text with appropriate Chinese punctuation marks.
|
||
|
||
#### Scenario: Add sentence-ending punctuation
|
||
- **WHEN** transcription completes for a segment
|
||
- **THEN** the output SHALL include period (。) at natural sentence boundaries
|
||
- **AND** question marks (?) for interrogative sentences
|
||
- **AND** commas (,) for clause breaks within sentences
|
||
|
||
#### Scenario: Detect question patterns
|
||
- **WHEN** transcribed text ends with question particles (嗎、呢、什麼、怎麼、為什麼)
|
||
- **THEN** the punctuation processor SHALL append question mark (?)
|
||
|
||
### Requirement: Audio File Upload
|
||
The Electron client SHALL allow users to upload pre-recorded audio files for transcription.
|
||
|
||
#### Scenario: Upload audio file
|
||
- **WHEN** user clicks "Upload Audio" button in meeting detail page
|
||
- **THEN** file picker SHALL open with filter for supported audio formats (MP3, WAV, M4A, WebM, OGG)
|
||
|
||
#### Scenario: Show upload progress
|
||
- **WHEN** audio file is being uploaded
|
||
- **THEN** progress indicator SHALL be displayed showing upload percentage
|
||
|
||
#### Scenario: Show transcription progress
|
||
- **WHEN** audio file is being transcribed in chunks
|
||
- **THEN** progress indicator SHALL display "Processing chunk X of Y"
|
||
|
||
#### Scenario: Replace existing transcript
|
||
- **WHEN** user uploads audio file and transcript already has content
|
||
- **THEN** confirmation dialog SHALL appear before replacing existing transcript
|
||
|
||
#### Scenario: File size limit
|
||
- **WHEN** user selects audio file larger than 500MB
|
||
- **THEN** error message SHALL be displayed indicating file size limit
|
||
|
||
### Requirement: VAD-Based Audio Segmentation
|
||
The sidecar SHALL segment large audio files using Voice Activity Detection before cloud transcription.
|
||
|
||
#### Scenario: Segment audio command
|
||
- **WHEN** sidecar receives `{"action": "segment_audio", "file_path": "...", "max_chunk_seconds": 300}`
|
||
- **THEN** it SHALL load audio file and run VAD to detect speech boundaries
|
||
|
||
#### Scenario: Split at silence boundaries
|
||
- **WHEN** VAD detects silence gap >= 500ms within max chunk duration
|
||
- **THEN** audio SHALL be split at the silence boundary
|
||
- **AND** each chunk exported as WAV file to temp directory
|
||
|
||
#### Scenario: Force split for continuous speech
|
||
- **WHEN** speech continues beyond max_chunk_seconds without silence gap
|
||
- **THEN** audio SHALL be force-split at max_chunk_seconds boundary
|
||
|
||
#### Scenario: Return segment metadata
|
||
- **WHEN** segmentation completes
|
||
- **THEN** sidecar SHALL return list of segments with file paths and timestamps
|
||
|
||
### Requirement: Dify Speech-to-Text Integration
|
||
The backend SHALL integrate with Dify STT service for audio file transcription.
|
||
|
||
#### Scenario: Transcribe uploaded audio with chunking
|
||
- **WHEN** backend receives POST /api/ai/transcribe-audio with audio file
|
||
- **THEN** backend SHALL call sidecar for VAD segmentation
|
||
- **AND** send each chunk to Dify STT API sequentially
|
||
- **AND** concatenate results into final transcript
|
||
|
||
#### Scenario: Supported audio formats
|
||
- **WHEN** audio file is in MP3, WAV, M4A, WebM, or OGG format
|
||
- **THEN** system SHALL accept and process the file
|
||
|
||
#### Scenario: Unsupported format handling
|
||
- **WHEN** audio file format is not supported
|
||
- **THEN** backend SHALL return HTTP 400 with error message listing supported formats
|
||
|
||
#### Scenario: Dify chunk transcription
|
||
- **WHEN** backend sends audio chunk to Dify STT API
|
||
- **THEN** chunk size SHALL be under 25MB to comply with API limits
|
||
|
||
#### Scenario: Transcription timeout per chunk
|
||
- **WHEN** Dify STT does not respond for a chunk within 2 minutes
|
||
- **THEN** backend SHALL retry up to 3 times with exponential backoff
|
||
|
||
#### Scenario: Dify STT error handling
|
||
- **WHEN** Dify STT API returns error after retries
|
||
- **THEN** backend SHALL return HTTP 502 with error details
|
||
|
||
### Requirement: Dual Transcription Mode
|
||
The system SHALL support both real-time local transcription and file-based cloud transcription.
|
||
|
||
#### Scenario: Real-time transcription unchanged
|
||
- **WHEN** user records audio in real-time
|
||
- **THEN** local sidecar SHALL process audio using faster-whisper (existing behavior)
|
||
|
||
#### Scenario: File upload uses cloud transcription
|
||
- **WHEN** user uploads audio file
|
||
- **THEN** Dify cloud service SHALL process audio via chunked upload
|
||
|
||
#### Scenario: Unified transcript output
|
||
- **WHEN** transcription completes from either source
|
||
- **THEN** result SHALL be displayed in the same transcript area in meeting detail page
|
||
|
||
### Requirement: Model Download Progress Display
|
||
The sidecar SHALL report Whisper model download progress to enable UI feedback.
|
||
|
||
#### Scenario: Emit download start
|
||
- **WHEN** Whisper model download begins
|
||
- **THEN** sidecar SHALL emit JSON to stdout: `{"status": "downloading_model", "model": "<size>", "progress": 0, "total_mb": <size>}`
|
||
|
||
#### Scenario: Emit download progress
|
||
- **WHEN** download progress updates
|
||
- **THEN** sidecar SHALL emit JSON: `{"status": "downloading_model", "progress": <percent>, "downloaded_mb": <current>, "total_mb": <total>}`
|
||
- **AND** progress updates SHALL occur at least every 5% or every 5 seconds
|
||
|
||
#### Scenario: Emit download complete
|
||
- **WHEN** model download completes
|
||
- **THEN** sidecar SHALL emit JSON: `{"status": "model_downloaded", "model": "<size>"}`
|
||
- **AND** proceed to model loading
|
||
|
||
#### Scenario: Skip download for cached model
|
||
- **WHEN** model already exists in huggingface cache
|
||
- **THEN** sidecar SHALL NOT emit download progress messages
|
||
- **AND** proceed directly to loading
|
||
|
||
### Requirement: Frontend Model Download Progress Display
|
||
The Electron frontend SHALL display model download progress to users.
|
||
|
||
#### Scenario: Show download progress in transcript panel
|
||
- **WHEN** sidecar emits download progress
|
||
- **THEN** whisper status element SHALL display download percentage and size
|
||
- **AND** format: "Downloading: XX% (YYY MB / ZZZ MB)"
|
||
|
||
#### Scenario: Show download complete
|
||
- **WHEN** sidecar emits model_downloaded status
|
||
- **THEN** whisper status element SHALL briefly show "Model downloaded"
|
||
- **AND** transition to loading state
|
||
|
||
#### Scenario: Forward progress events via IPC
|
||
- **WHEN** main process receives download progress from sidecar
|
||
- **THEN** it SHALL forward to renderer via `model-download-progress` IPC channel
|
||
|