# transcription Specification ## Purpose TBD - created by archiving change add-meeting-assistant-mvp. Update Purpose after archive. ## Requirements ### Requirement: Edge Speech-to-Text The Electron client SHALL perform speech-to-text conversion locally using faster-whisper int8 model. #### Scenario: Successful transcription - **WHEN** user records audio during a meeting - **THEN** the audio SHALL be transcribed locally without network dependency #### Scenario: Transcription on target hardware - **WHEN** running on i5 processor with 8GB RAM - **THEN** transcription SHALL complete within acceptable latency for real-time display ### Requirement: Traditional Chinese Output The transcription engine SHALL output Traditional Chinese (繁體中文) text. #### Scenario: Simplified to Traditional conversion - **WHEN** whisper outputs Simplified Chinese characters - **THEN** OpenCC SHALL convert output to Traditional Chinese #### Scenario: Native Traditional Chinese - **WHEN** whisper outputs Traditional Chinese directly - **THEN** the text SHALL pass through unchanged ### Requirement: Real-time Display The Electron client SHALL display transcription results in real-time. #### Scenario: Streaming transcription - **WHEN** user is recording - **THEN** transcribed text SHALL appear in the left panel within seconds of speech ### Requirement: Python Sidecar The transcription engine SHALL be packaged as a Python sidecar using PyInstaller. #### Scenario: Sidecar startup - **WHEN** Electron app launches - **THEN** the Python sidecar containing faster-whisper and OpenCC SHALL be available #### Scenario: Sidecar communication - **WHEN** Electron sends audio data to sidecar - **THEN** transcribed text SHALL be returned via IPC ### Requirement: Streaming Transcription Mode The sidecar SHALL support a streaming mode where audio chunks are continuously received and transcribed in real-time with VAD-triggered segmentation. #### Scenario: Start streaming session - **WHEN** sidecar receives `{"action": "start_stream"}` command - **THEN** it SHALL initialize audio buffer and VAD processor - **AND** respond with `{"status": "streaming", "session_id": ""}` #### Scenario: Process audio chunk - **WHEN** sidecar receives `{"action": "audio_chunk", "data": ""}` during active stream - **THEN** it SHALL append audio to buffer and run VAD detection - **AND** if speech boundary detected, transcribe accumulated audio - **AND** emit `{"segment_id": , "text": "", "is_final": true}` #### Scenario: Stop streaming session - **WHEN** sidecar receives `{"action": "stop_stream"}` command - **THEN** it SHALL transcribe any remaining buffered audio - **AND** respond with `{"status": "stream_stopped", "total_segments": }` ### Requirement: VAD-based Speech Segmentation The sidecar SHALL use Voice Activity Detection to identify natural speech boundaries for segmentation. #### Scenario: Detect speech end - **WHEN** VAD detects silence exceeding 500ms after speech - **THEN** the accumulated speech audio SHALL be sent for transcription - **AND** a new segment SHALL begin for subsequent speech #### Scenario: Handle continuous speech - **WHEN** speech continues for more than 15 seconds without pause - **THEN** the sidecar SHALL force a segment boundary - **AND** transcribe the 15-second chunk to prevent excessive latency ### Requirement: Punctuation in Transcription Output The sidecar SHALL output transcribed text with appropriate Chinese punctuation marks. #### Scenario: Add sentence-ending punctuation - **WHEN** transcription completes for a segment - **THEN** the output SHALL include period (。) at natural sentence boundaries - **AND** question marks (?) for interrogative sentences - **AND** commas (,) for clause breaks within sentences #### Scenario: Detect question patterns - **WHEN** transcribed text ends with question particles (嗎、呢、什麼、怎麼、為什麼) - **THEN** the punctuation processor SHALL append question mark (?) ### Requirement: Audio File Upload The Electron client SHALL allow users to upload pre-recorded audio files for transcription. #### Scenario: Upload audio file - **WHEN** user clicks "Upload Audio" button in meeting detail page - **THEN** file picker SHALL open with filter for supported audio formats (MP3, WAV, M4A, WebM, OGG) #### Scenario: Show upload progress - **WHEN** audio file is being uploaded - **THEN** progress indicator SHALL be displayed showing upload percentage #### Scenario: Show transcription progress - **WHEN** audio file is being transcribed in chunks - **THEN** progress indicator SHALL display "Processing chunk X of Y" #### Scenario: Replace existing transcript - **WHEN** user uploads audio file and transcript already has content - **THEN** confirmation dialog SHALL appear before replacing existing transcript #### Scenario: File size limit - **WHEN** user selects audio file larger than 500MB - **THEN** error message SHALL be displayed indicating file size limit ### Requirement: VAD-Based Audio Segmentation The sidecar SHALL segment large audio files using Voice Activity Detection before cloud transcription. #### Scenario: Segment audio command - **WHEN** sidecar receives `{"action": "segment_audio", "file_path": "...", "max_chunk_seconds": 300}` - **THEN** it SHALL load audio file and run VAD to detect speech boundaries #### Scenario: Split at silence boundaries - **WHEN** VAD detects silence gap >= 500ms within max chunk duration - **THEN** audio SHALL be split at the silence boundary - **AND** each chunk exported as WAV file to temp directory #### Scenario: Force split for continuous speech - **WHEN** speech continues beyond max_chunk_seconds without silence gap - **THEN** audio SHALL be force-split at max_chunk_seconds boundary #### Scenario: Return segment metadata - **WHEN** segmentation completes - **THEN** sidecar SHALL return list of segments with file paths and timestamps ### Requirement: Dify Speech-to-Text Integration The backend SHALL integrate with Dify STT service for audio file transcription. #### Scenario: Transcribe uploaded audio with chunking - **WHEN** backend receives POST /api/ai/transcribe-audio with audio file - **THEN** backend SHALL call sidecar for VAD segmentation - **AND** send each chunk to Dify STT API sequentially - **AND** concatenate results into final transcript #### Scenario: Supported audio formats - **WHEN** audio file is in MP3, WAV, M4A, WebM, or OGG format - **THEN** system SHALL accept and process the file #### Scenario: Unsupported format handling - **WHEN** audio file format is not supported - **THEN** backend SHALL return HTTP 400 with error message listing supported formats #### Scenario: Dify chunk transcription - **WHEN** backend sends audio chunk to Dify STT API - **THEN** chunk size SHALL be under 25MB to comply with API limits #### Scenario: Transcription timeout per chunk - **WHEN** Dify STT does not respond for a chunk within 2 minutes - **THEN** backend SHALL retry up to 3 times with exponential backoff #### Scenario: Dify STT error handling - **WHEN** Dify STT API returns error after retries - **THEN** backend SHALL return HTTP 502 with error details ### Requirement: Dual Transcription Mode The system SHALL support both real-time local transcription and file-based cloud transcription. #### Scenario: Real-time transcription unchanged - **WHEN** user records audio in real-time - **THEN** local sidecar SHALL process audio using faster-whisper (existing behavior) #### Scenario: File upload uses cloud transcription - **WHEN** user uploads audio file - **THEN** Dify cloud service SHALL process audio via chunked upload #### Scenario: Unified transcript output - **WHEN** transcription completes from either source - **THEN** result SHALL be displayed in the same transcript area in meeting detail page ### Requirement: Model Download Progress Display The sidecar SHALL report Whisper model download progress to enable UI feedback. #### Scenario: Emit download start - **WHEN** Whisper model download begins - **THEN** sidecar SHALL emit JSON to stdout: `{"status": "downloading_model", "model": "", "progress": 0, "total_mb": }` #### Scenario: Emit download progress - **WHEN** download progress updates - **THEN** sidecar SHALL emit JSON: `{"status": "downloading_model", "progress": , "downloaded_mb": , "total_mb": }` - **AND** progress updates SHALL occur at least every 5% or every 5 seconds #### Scenario: Emit download complete - **WHEN** model download completes - **THEN** sidecar SHALL emit JSON: `{"status": "model_downloaded", "model": ""}` - **AND** proceed to model loading #### Scenario: Skip download for cached model - **WHEN** model already exists in huggingface cache - **THEN** sidecar SHALL NOT emit download progress messages - **AND** proceed directly to loading ### Requirement: Frontend Model Download Progress Display The Electron frontend SHALL display model download progress to users. #### Scenario: Show download progress in transcript panel - **WHEN** sidecar emits download progress - **THEN** whisper status element SHALL display download percentage and size - **AND** format: "Downloading: XX% (YYY MB / ZZZ MB)" #### Scenario: Show download complete - **WHEN** sidecar emits model_downloaded status - **THEN** whisper status element SHALL briefly show "Model downloaded" - **AND** transition to loading state #### Scenario: Forward progress events via IPC - **WHEN** main process receives download progress from sidecar - **THEN** it SHALL forward to renderer via `model-download-progress` IPC channel