Files
Meeting_Assistant/openspec/specs/transcription/spec.md
egg e7a06e2b8f chore: Archive all pending OpenSpec proposals
Force archive the following proposals:
- add-audio-device-selector (complete)
- add-embedded-backend-packaging (19/26 tasks)
- add-flexible-deployment-options (20/21 tasks)

New specs created:
- audio-device-management (7 requirements)
- embedded-backend (8 requirements)

Updated specs:
- transcription (+2 requirements for model download progress)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-22 08:44:04 +08:00

217 lines
9.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# transcription Specification
## Purpose
TBD - created by archiving change add-meeting-assistant-mvp. Update Purpose after archive.
## Requirements
### Requirement: Edge Speech-to-Text
The Electron client SHALL perform speech-to-text conversion locally using faster-whisper int8 model.
#### Scenario: Successful transcription
- **WHEN** user records audio during a meeting
- **THEN** the audio SHALL be transcribed locally without network dependency
#### Scenario: Transcription on target hardware
- **WHEN** running on i5 processor with 8GB RAM
- **THEN** transcription SHALL complete within acceptable latency for real-time display
### Requirement: Traditional Chinese Output
The transcription engine SHALL output Traditional Chinese (繁體中文) text.
#### Scenario: Simplified to Traditional conversion
- **WHEN** whisper outputs Simplified Chinese characters
- **THEN** OpenCC SHALL convert output to Traditional Chinese
#### Scenario: Native Traditional Chinese
- **WHEN** whisper outputs Traditional Chinese directly
- **THEN** the text SHALL pass through unchanged
### Requirement: Real-time Display
The Electron client SHALL display transcription results in real-time.
#### Scenario: Streaming transcription
- **WHEN** user is recording
- **THEN** transcribed text SHALL appear in the left panel within seconds of speech
### Requirement: Python Sidecar
The transcription engine SHALL be packaged as a Python sidecar using PyInstaller.
#### Scenario: Sidecar startup
- **WHEN** Electron app launches
- **THEN** the Python sidecar containing faster-whisper and OpenCC SHALL be available
#### Scenario: Sidecar communication
- **WHEN** Electron sends audio data to sidecar
- **THEN** transcribed text SHALL be returned via IPC
### Requirement: Streaming Transcription Mode
The sidecar SHALL support a streaming mode where audio chunks are continuously received and transcribed in real-time with VAD-triggered segmentation.
#### Scenario: Start streaming session
- **WHEN** sidecar receives `{"action": "start_stream"}` command
- **THEN** it SHALL initialize audio buffer and VAD processor
- **AND** respond with `{"status": "streaming", "session_id": "<uuid>"}`
#### Scenario: Process audio chunk
- **WHEN** sidecar receives `{"action": "audio_chunk", "data": "<base64_pcm>"}` during active stream
- **THEN** it SHALL append audio to buffer and run VAD detection
- **AND** if speech boundary detected, transcribe accumulated audio
- **AND** emit `{"segment_id": <int>, "text": "<transcription>", "is_final": true}`
#### Scenario: Stop streaming session
- **WHEN** sidecar receives `{"action": "stop_stream"}` command
- **THEN** it SHALL transcribe any remaining buffered audio
- **AND** respond with `{"status": "stream_stopped", "total_segments": <int>}`
### Requirement: VAD-based Speech Segmentation
The sidecar SHALL use Voice Activity Detection to identify natural speech boundaries for segmentation.
#### Scenario: Detect speech end
- **WHEN** VAD detects silence exceeding 500ms after speech
- **THEN** the accumulated speech audio SHALL be sent for transcription
- **AND** a new segment SHALL begin for subsequent speech
#### Scenario: Handle continuous speech
- **WHEN** speech continues for more than 15 seconds without pause
- **THEN** the sidecar SHALL force a segment boundary
- **AND** transcribe the 15-second chunk to prevent excessive latency
### Requirement: Punctuation in Transcription Output
The sidecar SHALL output transcribed text with appropriate Chinese punctuation marks.
#### Scenario: Add sentence-ending punctuation
- **WHEN** transcription completes for a segment
- **THEN** the output SHALL include period (。) at natural sentence boundaries
- **AND** question marks () for interrogative sentences
- **AND** commas () for clause breaks within sentences
#### Scenario: Detect question patterns
- **WHEN** transcribed text ends with question particles (嗎、呢、什麼、怎麼、為什麼)
- **THEN** the punctuation processor SHALL append question mark ()
### Requirement: Audio File Upload
The Electron client SHALL allow users to upload pre-recorded audio files for transcription.
#### Scenario: Upload audio file
- **WHEN** user clicks "Upload Audio" button in meeting detail page
- **THEN** file picker SHALL open with filter for supported audio formats (MP3, WAV, M4A, WebM, OGG)
#### Scenario: Show upload progress
- **WHEN** audio file is being uploaded
- **THEN** progress indicator SHALL be displayed showing upload percentage
#### Scenario: Show transcription progress
- **WHEN** audio file is being transcribed in chunks
- **THEN** progress indicator SHALL display "Processing chunk X of Y"
#### Scenario: Replace existing transcript
- **WHEN** user uploads audio file and transcript already has content
- **THEN** confirmation dialog SHALL appear before replacing existing transcript
#### Scenario: File size limit
- **WHEN** user selects audio file larger than 500MB
- **THEN** error message SHALL be displayed indicating file size limit
### Requirement: VAD-Based Audio Segmentation
The sidecar SHALL segment large audio files using Voice Activity Detection before cloud transcription.
#### Scenario: Segment audio command
- **WHEN** sidecar receives `{"action": "segment_audio", "file_path": "...", "max_chunk_seconds": 300}`
- **THEN** it SHALL load audio file and run VAD to detect speech boundaries
#### Scenario: Split at silence boundaries
- **WHEN** VAD detects silence gap >= 500ms within max chunk duration
- **THEN** audio SHALL be split at the silence boundary
- **AND** each chunk exported as WAV file to temp directory
#### Scenario: Force split for continuous speech
- **WHEN** speech continues beyond max_chunk_seconds without silence gap
- **THEN** audio SHALL be force-split at max_chunk_seconds boundary
#### Scenario: Return segment metadata
- **WHEN** segmentation completes
- **THEN** sidecar SHALL return list of segments with file paths and timestamps
### Requirement: Dify Speech-to-Text Integration
The backend SHALL integrate with Dify STT service for audio file transcription.
#### Scenario: Transcribe uploaded audio with chunking
- **WHEN** backend receives POST /api/ai/transcribe-audio with audio file
- **THEN** backend SHALL call sidecar for VAD segmentation
- **AND** send each chunk to Dify STT API sequentially
- **AND** concatenate results into final transcript
#### Scenario: Supported audio formats
- **WHEN** audio file is in MP3, WAV, M4A, WebM, or OGG format
- **THEN** system SHALL accept and process the file
#### Scenario: Unsupported format handling
- **WHEN** audio file format is not supported
- **THEN** backend SHALL return HTTP 400 with error message listing supported formats
#### Scenario: Dify chunk transcription
- **WHEN** backend sends audio chunk to Dify STT API
- **THEN** chunk size SHALL be under 25MB to comply with API limits
#### Scenario: Transcription timeout per chunk
- **WHEN** Dify STT does not respond for a chunk within 2 minutes
- **THEN** backend SHALL retry up to 3 times with exponential backoff
#### Scenario: Dify STT error handling
- **WHEN** Dify STT API returns error after retries
- **THEN** backend SHALL return HTTP 502 with error details
### Requirement: Dual Transcription Mode
The system SHALL support both real-time local transcription and file-based cloud transcription.
#### Scenario: Real-time transcription unchanged
- **WHEN** user records audio in real-time
- **THEN** local sidecar SHALL process audio using faster-whisper (existing behavior)
#### Scenario: File upload uses cloud transcription
- **WHEN** user uploads audio file
- **THEN** Dify cloud service SHALL process audio via chunked upload
#### Scenario: Unified transcript output
- **WHEN** transcription completes from either source
- **THEN** result SHALL be displayed in the same transcript area in meeting detail page
### Requirement: Model Download Progress Display
The sidecar SHALL report Whisper model download progress to enable UI feedback.
#### Scenario: Emit download start
- **WHEN** Whisper model download begins
- **THEN** sidecar SHALL emit JSON to stdout: `{"status": "downloading_model", "model": "<size>", "progress": 0, "total_mb": <size>}`
#### Scenario: Emit download progress
- **WHEN** download progress updates
- **THEN** sidecar SHALL emit JSON: `{"status": "downloading_model", "progress": <percent>, "downloaded_mb": <current>, "total_mb": <total>}`
- **AND** progress updates SHALL occur at least every 5% or every 5 seconds
#### Scenario: Emit download complete
- **WHEN** model download completes
- **THEN** sidecar SHALL emit JSON: `{"status": "model_downloaded", "model": "<size>"}`
- **AND** proceed to model loading
#### Scenario: Skip download for cached model
- **WHEN** model already exists in huggingface cache
- **THEN** sidecar SHALL NOT emit download progress messages
- **AND** proceed directly to loading
### Requirement: Frontend Model Download Progress Display
The Electron frontend SHALL display model download progress to users.
#### Scenario: Show download progress in transcript panel
- **WHEN** sidecar emits download progress
- **THEN** whisper status element SHALL display download percentage and size
- **AND** format: "Downloading: XX% (YYY MB / ZZZ MB)"
#### Scenario: Show download complete
- **WHEN** sidecar emits model_downloaded status
- **THEN** whisper status element SHALL briefly show "Model downloaded"
- **AND** transition to loading state
#### Scenario: Forward progress events via IPC
- **WHEN** main process receives download progress from sidecar
- **THEN** it SHALL forward to renderer via `model-download-progress` IPC channel