Files
Meeting_Assistant/openspec/specs/transcription/spec.md
egg e7a06e2b8f chore: Archive all pending OpenSpec proposals
Force archive the following proposals:
- add-audio-device-selector (complete)
- add-embedded-backend-packaging (19/26 tasks)
- add-flexible-deployment-options (20/21 tasks)

New specs created:
- audio-device-management (7 requirements)
- embedded-backend (8 requirements)

Updated specs:
- transcription (+2 requirements for model download progress)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-22 08:44:04 +08:00

9.4 KiB
Raw Blame History

transcription Specification

Purpose

TBD - created by archiving change add-meeting-assistant-mvp. Update Purpose after archive.

Requirements

Requirement: Edge Speech-to-Text

The Electron client SHALL perform speech-to-text conversion locally using faster-whisper int8 model.

Scenario: Successful transcription

  • WHEN user records audio during a meeting
  • THEN the audio SHALL be transcribed locally without network dependency

Scenario: Transcription on target hardware

  • WHEN running on i5 processor with 8GB RAM
  • THEN transcription SHALL complete within acceptable latency for real-time display

Requirement: Traditional Chinese Output

The transcription engine SHALL output Traditional Chinese (繁體中文) text.

Scenario: Simplified to Traditional conversion

  • WHEN whisper outputs Simplified Chinese characters
  • THEN OpenCC SHALL convert output to Traditional Chinese

Scenario: Native Traditional Chinese

  • WHEN whisper outputs Traditional Chinese directly
  • THEN the text SHALL pass through unchanged

Requirement: Real-time Display

The Electron client SHALL display transcription results in real-time.

Scenario: Streaming transcription

  • WHEN user is recording
  • THEN transcribed text SHALL appear in the left panel within seconds of speech

Requirement: Python Sidecar

The transcription engine SHALL be packaged as a Python sidecar using PyInstaller.

Scenario: Sidecar startup

  • WHEN Electron app launches
  • THEN the Python sidecar containing faster-whisper and OpenCC SHALL be available

Scenario: Sidecar communication

  • WHEN Electron sends audio data to sidecar
  • THEN transcribed text SHALL be returned via IPC

Requirement: Streaming Transcription Mode

The sidecar SHALL support a streaming mode where audio chunks are continuously received and transcribed in real-time with VAD-triggered segmentation.

Scenario: Start streaming session

  • WHEN sidecar receives {"action": "start_stream"} command
  • THEN it SHALL initialize audio buffer and VAD processor
  • AND respond with {"status": "streaming", "session_id": "<uuid>"}

Scenario: Process audio chunk

  • WHEN sidecar receives {"action": "audio_chunk", "data": "<base64_pcm>"} during active stream
  • THEN it SHALL append audio to buffer and run VAD detection
  • AND if speech boundary detected, transcribe accumulated audio
  • AND emit {"segment_id": <int>, "text": "<transcription>", "is_final": true}

Scenario: Stop streaming session

  • WHEN sidecar receives {"action": "stop_stream"} command
  • THEN it SHALL transcribe any remaining buffered audio
  • AND respond with {"status": "stream_stopped", "total_segments": <int>}

Requirement: VAD-based Speech Segmentation

The sidecar SHALL use Voice Activity Detection to identify natural speech boundaries for segmentation.

Scenario: Detect speech end

  • WHEN VAD detects silence exceeding 500ms after speech
  • THEN the accumulated speech audio SHALL be sent for transcription
  • AND a new segment SHALL begin for subsequent speech

Scenario: Handle continuous speech

  • WHEN speech continues for more than 15 seconds without pause
  • THEN the sidecar SHALL force a segment boundary
  • AND transcribe the 15-second chunk to prevent excessive latency

Requirement: Punctuation in Transcription Output

The sidecar SHALL output transcribed text with appropriate Chinese punctuation marks.

Scenario: Add sentence-ending punctuation

  • WHEN transcription completes for a segment
  • THEN the output SHALL include period (。) at natural sentence boundaries
  • AND question marks () for interrogative sentences
  • AND commas () for clause breaks within sentences

Scenario: Detect question patterns

  • WHEN transcribed text ends with question particles (嗎、呢、什麼、怎麼、為什麼)
  • THEN the punctuation processor SHALL append question mark ()

Requirement: Audio File Upload

The Electron client SHALL allow users to upload pre-recorded audio files for transcription.

Scenario: Upload audio file

  • WHEN user clicks "Upload Audio" button in meeting detail page
  • THEN file picker SHALL open with filter for supported audio formats (MP3, WAV, M4A, WebM, OGG)

Scenario: Show upload progress

  • WHEN audio file is being uploaded
  • THEN progress indicator SHALL be displayed showing upload percentage

Scenario: Show transcription progress

  • WHEN audio file is being transcribed in chunks
  • THEN progress indicator SHALL display "Processing chunk X of Y"

Scenario: Replace existing transcript

  • WHEN user uploads audio file and transcript already has content
  • THEN confirmation dialog SHALL appear before replacing existing transcript

Scenario: File size limit

  • WHEN user selects audio file larger than 500MB
  • THEN error message SHALL be displayed indicating file size limit

Requirement: VAD-Based Audio Segmentation

The sidecar SHALL segment large audio files using Voice Activity Detection before cloud transcription.

Scenario: Segment audio command

  • WHEN sidecar receives {"action": "segment_audio", "file_path": "...", "max_chunk_seconds": 300}
  • THEN it SHALL load audio file and run VAD to detect speech boundaries

Scenario: Split at silence boundaries

  • WHEN VAD detects silence gap >= 500ms within max chunk duration
  • THEN audio SHALL be split at the silence boundary
  • AND each chunk exported as WAV file to temp directory

Scenario: Force split for continuous speech

  • WHEN speech continues beyond max_chunk_seconds without silence gap
  • THEN audio SHALL be force-split at max_chunk_seconds boundary

Scenario: Return segment metadata

  • WHEN segmentation completes
  • THEN sidecar SHALL return list of segments with file paths and timestamps

Requirement: Dify Speech-to-Text Integration

The backend SHALL integrate with Dify STT service for audio file transcription.

Scenario: Transcribe uploaded audio with chunking

  • WHEN backend receives POST /api/ai/transcribe-audio with audio file
  • THEN backend SHALL call sidecar for VAD segmentation
  • AND send each chunk to Dify STT API sequentially
  • AND concatenate results into final transcript

Scenario: Supported audio formats

  • WHEN audio file is in MP3, WAV, M4A, WebM, or OGG format
  • THEN system SHALL accept and process the file

Scenario: Unsupported format handling

  • WHEN audio file format is not supported
  • THEN backend SHALL return HTTP 400 with error message listing supported formats

Scenario: Dify chunk transcription

  • WHEN backend sends audio chunk to Dify STT API
  • THEN chunk size SHALL be under 25MB to comply with API limits

Scenario: Transcription timeout per chunk

  • WHEN Dify STT does not respond for a chunk within 2 minutes
  • THEN backend SHALL retry up to 3 times with exponential backoff

Scenario: Dify STT error handling

  • WHEN Dify STT API returns error after retries
  • THEN backend SHALL return HTTP 502 with error details

Requirement: Dual Transcription Mode

The system SHALL support both real-time local transcription and file-based cloud transcription.

Scenario: Real-time transcription unchanged

  • WHEN user records audio in real-time
  • THEN local sidecar SHALL process audio using faster-whisper (existing behavior)

Scenario: File upload uses cloud transcription

  • WHEN user uploads audio file
  • THEN Dify cloud service SHALL process audio via chunked upload

Scenario: Unified transcript output

  • WHEN transcription completes from either source
  • THEN result SHALL be displayed in the same transcript area in meeting detail page

Requirement: Model Download Progress Display

The sidecar SHALL report Whisper model download progress to enable UI feedback.

Scenario: Emit download start

  • WHEN Whisper model download begins
  • THEN sidecar SHALL emit JSON to stdout: {"status": "downloading_model", "model": "<size>", "progress": 0, "total_mb": <size>}

Scenario: Emit download progress

  • WHEN download progress updates
  • THEN sidecar SHALL emit JSON: {"status": "downloading_model", "progress": <percent>, "downloaded_mb": <current>, "total_mb": <total>}
  • AND progress updates SHALL occur at least every 5% or every 5 seconds

Scenario: Emit download complete

  • WHEN model download completes
  • THEN sidecar SHALL emit JSON: {"status": "model_downloaded", "model": "<size>"}
  • AND proceed to model loading

Scenario: Skip download for cached model

  • WHEN model already exists in huggingface cache
  • THEN sidecar SHALL NOT emit download progress messages
  • AND proceed directly to loading

Requirement: Frontend Model Download Progress Display

The Electron frontend SHALL display model download progress to users.

Scenario: Show download progress in transcript panel

  • WHEN sidecar emits download progress
  • THEN whisper status element SHALL display download percentage and size
  • AND format: "Downloading: XX% (YYY MB / ZZZ MB)"

Scenario: Show download complete

  • WHEN sidecar emits model_downloaded status
  • THEN whisper status element SHALL briefly show "Model downloaded"
  • AND transition to loading state

Scenario: Forward progress events via IPC

  • WHEN main process receives download progress from sidecar
  • THEN it SHALL forward to renderer via model-download-progress IPC channel