Files
egg 2e78e3760a chore: Archive add-dify-audio-transcription proposal
Move completed Dify audio transcription proposal to archive and update
transcription spec with new capabilities.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 21:05:01 +08:00

3.8 KiB

ADDED Requirements

Requirement: Audio File Upload

The Electron client SHALL allow users to upload pre-recorded audio files for transcription.

Scenario: Upload audio file

  • WHEN user clicks "Upload Audio" button in meeting detail page
  • THEN file picker SHALL open with filter for supported audio formats (MP3, WAV, M4A, WebM, OGG)

Scenario: Show upload progress

  • WHEN audio file is being uploaded
  • THEN progress indicator SHALL be displayed showing upload percentage

Scenario: Show transcription progress

  • WHEN audio file is being transcribed in chunks
  • THEN progress indicator SHALL display "Processing chunk X of Y"

Scenario: Replace existing transcript

  • WHEN user uploads audio file and transcript already has content
  • THEN confirmation dialog SHALL appear before replacing existing transcript

Scenario: File size limit

  • WHEN user selects audio file larger than 500MB
  • THEN error message SHALL be displayed indicating file size limit

Requirement: VAD-Based Audio Segmentation

The sidecar SHALL segment large audio files using Voice Activity Detection before cloud transcription.

Scenario: Segment audio command

  • WHEN sidecar receives {"action": "segment_audio", "file_path": "...", "max_chunk_seconds": 300}
  • THEN it SHALL load audio file and run VAD to detect speech boundaries

Scenario: Split at silence boundaries

  • WHEN VAD detects silence gap >= 500ms within max chunk duration
  • THEN audio SHALL be split at the silence boundary
  • AND each chunk exported as WAV file to temp directory

Scenario: Force split for continuous speech

  • WHEN speech continues beyond max_chunk_seconds without silence gap
  • THEN audio SHALL be force-split at max_chunk_seconds boundary

Scenario: Return segment metadata

  • WHEN segmentation completes
  • THEN sidecar SHALL return list of segments with file paths and timestamps

Requirement: Dify Speech-to-Text Integration

The backend SHALL integrate with Dify STT service for audio file transcription.

Scenario: Transcribe uploaded audio with chunking

  • WHEN backend receives POST /api/ai/transcribe-audio with audio file
  • THEN backend SHALL call sidecar for VAD segmentation
  • AND send each chunk to Dify STT API sequentially
  • AND concatenate results into final transcript

Scenario: Supported audio formats

  • WHEN audio file is in MP3, WAV, M4A, WebM, or OGG format
  • THEN system SHALL accept and process the file

Scenario: Unsupported format handling

  • WHEN audio file format is not supported
  • THEN backend SHALL return HTTP 400 with error message listing supported formats

Scenario: Dify chunk transcription

  • WHEN backend sends audio chunk to Dify STT API
  • THEN chunk size SHALL be under 25MB to comply with API limits

Scenario: Transcription timeout per chunk

  • WHEN Dify STT does not respond for a chunk within 2 minutes
  • THEN backend SHALL retry up to 3 times with exponential backoff

Scenario: Dify STT error handling

  • WHEN Dify STT API returns error after retries
  • THEN backend SHALL return HTTP 502 with error details

Requirement: Dual Transcription Mode

The system SHALL support both real-time local transcription and file-based cloud transcription.

Scenario: Real-time transcription unchanged

  • WHEN user records audio in real-time
  • THEN local sidecar SHALL process audio using faster-whisper (existing behavior)

Scenario: File upload uses cloud transcription

  • WHEN user uploads audio file
  • THEN Dify cloud service SHALL process audio via chunked upload

Scenario: Unified transcript output

  • WHEN transcription completes from either source
  • THEN result SHALL be displayed in the same transcript area in meeting detail page