Meeting_Assistant/spec.md at 2e78e3760a38955412e8b2e60801476b6e5c3acd

Files

egg 2e78e3760a chore: Archive add-dify-audio-transcription proposal

Move completed Dify audio transcription proposal to archive and update
transcription spec with new capabilities.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-11 21:05:01 +08:00

3.8 KiB

Raw Blame History

ADDED Requirements

Requirement: Audio File Upload

The Electron client SHALL allow users to upload pre-recorded audio files for transcription.

Scenario: Upload audio file

WHEN user clicks "Upload Audio" button in meeting detail page
THEN file picker SHALL open with filter for supported audio formats (MP3, WAV, M4A, WebM, OGG)

Scenario: Show upload progress

WHEN audio file is being uploaded
THEN progress indicator SHALL be displayed showing upload percentage

Scenario: Show transcription progress

WHEN audio file is being transcribed in chunks
THEN progress indicator SHALL display "Processing chunk X of Y"

Scenario: Replace existing transcript

WHEN user uploads audio file and transcript already has content
THEN confirmation dialog SHALL appear before replacing existing transcript

Scenario: File size limit

WHEN user selects audio file larger than 500MB
THEN error message SHALL be displayed indicating file size limit

Requirement: VAD-Based Audio Segmentation

The sidecar SHALL segment large audio files using Voice Activity Detection before cloud transcription.

Scenario: Segment audio command

WHEN sidecar receives {"action": "segment_audio", "file_path": "...", "max_chunk_seconds": 300}
THEN it SHALL load audio file and run VAD to detect speech boundaries

Scenario: Split at silence boundaries

WHEN VAD detects silence gap >= 500ms within max chunk duration
THEN audio SHALL be split at the silence boundary
AND each chunk exported as WAV file to temp directory

Scenario: Force split for continuous speech

WHEN speech continues beyond max_chunk_seconds without silence gap
THEN audio SHALL be force-split at max_chunk_seconds boundary

Scenario: Return segment metadata

WHEN segmentation completes
THEN sidecar SHALL return list of segments with file paths and timestamps

Requirement: Dify Speech-to-Text Integration

The backend SHALL integrate with Dify STT service for audio file transcription.

Scenario: Transcribe uploaded audio with chunking

WHEN backend receives POST /api/ai/transcribe-audio with audio file
THEN backend SHALL call sidecar for VAD segmentation
AND send each chunk to Dify STT API sequentially
AND concatenate results into final transcript

Scenario: Supported audio formats

WHEN audio file is in MP3, WAV, M4A, WebM, or OGG format
THEN system SHALL accept and process the file

Scenario: Unsupported format handling

WHEN audio file format is not supported
THEN backend SHALL return HTTP 400 with error message listing supported formats

Scenario: Dify chunk transcription

WHEN backend sends audio chunk to Dify STT API
THEN chunk size SHALL be under 25MB to comply with API limits

Scenario: Transcription timeout per chunk

WHEN Dify STT does not respond for a chunk within 2 minutes
THEN backend SHALL retry up to 3 times with exponential backoff

Scenario: Dify STT error handling

WHEN Dify STT API returns error after retries
THEN backend SHALL return HTTP 502 with error details

Requirement: Dual Transcription Mode

The system SHALL support both real-time local transcription and file-based cloud transcription.

Scenario: Real-time transcription unchanged

WHEN user records audio in real-time
THEN local sidecar SHALL process audio using faster-whisper (existing behavior)

Scenario: File upload uses cloud transcription

WHEN user uploads audio file
THEN Dify cloud service SHALL process audio via chunked upload

Scenario: Unified transcript output

WHEN transcription completes from either source
THEN result SHALL be displayed in the same transcript area in meeting detail page

3.8 KiB Raw Blame History