Files

egg e7a06e2b8f chore: Archive all pending OpenSpec proposals

Force archive the following proposals:
- add-audio-device-selector (complete)
- add-embedded-backend-packaging (19/26 tasks)
- add-flexible-deployment-options (20/21 tasks)

New specs created:
- audio-device-management (7 requirements)
- embedded-backend (8 requirements)

Updated specs:
- transcription (+2 requirements for model download progress)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-22 08:44:04 +08:00

9.4 KiB

Raw Blame History

transcription Specification

Purpose

TBD - created by archiving change add-meeting-assistant-mvp. Update Purpose after archive.

Requirements

Requirement: Edge Speech-to-Text

The Electron client SHALL perform speech-to-text conversion locally using faster-whisper int8 model.

Scenario: Successful transcription

WHEN user records audio during a meeting
THEN the audio SHALL be transcribed locally without network dependency

Scenario: Transcription on target hardware

WHEN running on i5 processor with 8GB RAM
THEN transcription SHALL complete within acceptable latency for real-time display

Requirement: Traditional Chinese Output

The transcription engine SHALL output Traditional Chinese (繁體中文) text.

Scenario: Simplified to Traditional conversion

WHEN whisper outputs Simplified Chinese characters
THEN OpenCC SHALL convert output to Traditional Chinese

Scenario: Native Traditional Chinese

WHEN whisper outputs Traditional Chinese directly
THEN the text SHALL pass through unchanged

Requirement: Real-time Display

The Electron client SHALL display transcription results in real-time.

Scenario: Streaming transcription

WHEN user is recording
THEN transcribed text SHALL appear in the left panel within seconds of speech

Requirement: Python Sidecar

The transcription engine SHALL be packaged as a Python sidecar using PyInstaller.

Scenario: Sidecar startup

WHEN Electron app launches
THEN the Python sidecar containing faster-whisper and OpenCC SHALL be available

Scenario: Sidecar communication

WHEN Electron sends audio data to sidecar
THEN transcribed text SHALL be returned via IPC

Requirement: Streaming Transcription Mode

The sidecar SHALL support a streaming mode where audio chunks are continuously received and transcribed in real-time with VAD-triggered segmentation.

Scenario: Start streaming session

WHEN sidecar receives {"action": "start_stream"} command
THEN it SHALL initialize audio buffer and VAD processor
AND respond with {"status": "streaming", "session_id": "<uuid>"}

Scenario: Process audio chunk

WHEN sidecar receives {"action": "audio_chunk", "data": "<base64_pcm>"} during active stream
THEN it SHALL append audio to buffer and run VAD detection
AND if speech boundary detected, transcribe accumulated audio
AND emit {"segment_id": <int>, "text": "<transcription>", "is_final": true}

Scenario: Stop streaming session

WHEN sidecar receives {"action": "stop_stream"} command
THEN it SHALL transcribe any remaining buffered audio
AND respond with {"status": "stream_stopped", "total_segments": <int>}

Requirement: VAD-based Speech Segmentation

The sidecar SHALL use Voice Activity Detection to identify natural speech boundaries for segmentation.

Scenario: Detect speech end

WHEN VAD detects silence exceeding 500ms after speech
THEN the accumulated speech audio SHALL be sent for transcription
AND a new segment SHALL begin for subsequent speech

Scenario: Handle continuous speech

WHEN speech continues for more than 15 seconds without pause
THEN the sidecar SHALL force a segment boundary
AND transcribe the 15-second chunk to prevent excessive latency

Requirement: Punctuation in Transcription Output

The sidecar SHALL output transcribed text with appropriate Chinese punctuation marks.

Scenario: Add sentence-ending punctuation

WHEN transcription completes for a segment
THEN the output SHALL include period (。) at natural sentence boundaries
AND question marks (？) for interrogative sentences
AND commas (，) for clause breaks within sentences

Scenario: Detect question patterns

WHEN transcribed text ends with question particles (嗎、呢、什麼、怎麼、為什麼)
THEN the punctuation processor SHALL append question mark (？)

Requirement: Audio File Upload

The Electron client SHALL allow users to upload pre-recorded audio files for transcription.

Scenario: Upload audio file

WHEN user clicks "Upload Audio" button in meeting detail page
THEN file picker SHALL open with filter for supported audio formats (MP3, WAV, M4A, WebM, OGG)

Scenario: Show upload progress

WHEN audio file is being uploaded
THEN progress indicator SHALL be displayed showing upload percentage

Scenario: Show transcription progress

WHEN audio file is being transcribed in chunks
THEN progress indicator SHALL display "Processing chunk X of Y"

Scenario: Replace existing transcript

WHEN user uploads audio file and transcript already has content
THEN confirmation dialog SHALL appear before replacing existing transcript

Scenario: File size limit

WHEN user selects audio file larger than 500MB
THEN error message SHALL be displayed indicating file size limit

Requirement: VAD-Based Audio Segmentation

The sidecar SHALL segment large audio files using Voice Activity Detection before cloud transcription.

Scenario: Segment audio command

WHEN sidecar receives {"action": "segment_audio", "file_path": "...", "max_chunk_seconds": 300}
THEN it SHALL load audio file and run VAD to detect speech boundaries

Scenario: Split at silence boundaries

WHEN VAD detects silence gap >= 500ms within max chunk duration
THEN audio SHALL be split at the silence boundary
AND each chunk exported as WAV file to temp directory

Scenario: Force split for continuous speech

WHEN speech continues beyond max_chunk_seconds without silence gap
THEN audio SHALL be force-split at max_chunk_seconds boundary

Scenario: Return segment metadata

WHEN segmentation completes
THEN sidecar SHALL return list of segments with file paths and timestamps

Requirement: Dify Speech-to-Text Integration

The backend SHALL integrate with Dify STT service for audio file transcription.

Scenario: Transcribe uploaded audio with chunking

WHEN backend receives POST /api/ai/transcribe-audio with audio file
THEN backend SHALL call sidecar for VAD segmentation
AND send each chunk to Dify STT API sequentially
AND concatenate results into final transcript

Scenario: Supported audio formats

WHEN audio file is in MP3, WAV, M4A, WebM, or OGG format
THEN system SHALL accept and process the file

Scenario: Unsupported format handling

WHEN audio file format is not supported
THEN backend SHALL return HTTP 400 with error message listing supported formats

Scenario: Dify chunk transcription

WHEN backend sends audio chunk to Dify STT API
THEN chunk size SHALL be under 25MB to comply with API limits

Scenario: Transcription timeout per chunk

WHEN Dify STT does not respond for a chunk within 2 minutes
THEN backend SHALL retry up to 3 times with exponential backoff

Scenario: Dify STT error handling

WHEN Dify STT API returns error after retries
THEN backend SHALL return HTTP 502 with error details

Requirement: Dual Transcription Mode

The system SHALL support both real-time local transcription and file-based cloud transcription.

Scenario: Real-time transcription unchanged

WHEN user records audio in real-time
THEN local sidecar SHALL process audio using faster-whisper (existing behavior)

Scenario: File upload uses cloud transcription

WHEN user uploads audio file
THEN Dify cloud service SHALL process audio via chunked upload

Scenario: Unified transcript output

WHEN transcription completes from either source
THEN result SHALL be displayed in the same transcript area in meeting detail page

Requirement: Model Download Progress Display

The sidecar SHALL report Whisper model download progress to enable UI feedback.

Scenario: Emit download start

WHEN Whisper model download begins
THEN sidecar SHALL emit JSON to stdout: {"status": "downloading_model", "model": "<size>", "progress": 0, "total_mb": <size>}

Scenario: Emit download progress

WHEN download progress updates
THEN sidecar SHALL emit JSON: {"status": "downloading_model", "progress": <percent>, "downloaded_mb": <current>, "total_mb": <total>}
AND progress updates SHALL occur at least every 5% or every 5 seconds

Scenario: Emit download complete

WHEN model download completes
THEN sidecar SHALL emit JSON: {"status": "model_downloaded", "model": "<size>"}
AND proceed to model loading

9.4 KiB Raw Blame History Unescape Escape