chore: Archive add-dify-audio-transcription proposal

Move completed Dify audio transcription proposal to archive and update transcription spec with new capabilities. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 21:05:01 +08:00
parent 263eb1c394
commit 2e78e3760a
5 changed files with 435 additions and 0 deletions
--- a/openspec/specs/transcription/spec.md
+++ b/openspec/specs/transcription/spec.md
@@ -88,3 +88,90 @@ The sidecar SHALL output transcribed text with appropriate Chinese punctuation m
 - **WHEN** transcribed text ends with question particles (嗎、呢、什麼、怎麼、為什麼)
 - **THEN** the punctuation processor SHALL append question mark (？)

+### Requirement: Audio File Upload
+The Electron client SHALL allow users to upload pre-recorded audio files for transcription.
+
+#### Scenario: Upload audio file
+- **WHEN** user clicks "Upload Audio" button in meeting detail page
+- **THEN** file picker SHALL open with filter for supported audio formats (MP3, WAV, M4A, WebM, OGG)
+
+#### Scenario: Show upload progress
+- **WHEN** audio file is being uploaded
+- **THEN** progress indicator SHALL be displayed showing upload percentage
+
+#### Scenario: Show transcription progress
+- **WHEN** audio file is being transcribed in chunks
+- **THEN** progress indicator SHALL display "Processing chunk X of Y"
+
+#### Scenario: Replace existing transcript
+- **WHEN** user uploads audio file and transcript already has content
+- **THEN** confirmation dialog SHALL appear before replacing existing transcript
+
+#### Scenario: File size limit
+- **WHEN** user selects audio file larger than 500MB
+- **THEN** error message SHALL be displayed indicating file size limit
+
+### Requirement: VAD-Based Audio Segmentation
+The sidecar SHALL segment large audio files using Voice Activity Detection before cloud transcription.
+
+#### Scenario: Segment audio command
+- **WHEN** sidecar receives `{"action": "segment_audio", "file_path": "...", "max_chunk_seconds": 300}`
+- **THEN** it SHALL load audio file and run VAD to detect speech boundaries
+
+#### Scenario: Split at silence boundaries
+- **WHEN** VAD detects silence gap >= 500ms within max chunk duration
+- **THEN** audio SHALL be split at the silence boundary
+- **AND** each chunk exported as WAV file to temp directory
+
+#### Scenario: Force split for continuous speech
+- **WHEN** speech continues beyond max_chunk_seconds without silence gap
+- **THEN** audio SHALL be force-split at max_chunk_seconds boundary
+
+#### Scenario: Return segment metadata
+- **WHEN** segmentation completes
+- **THEN** sidecar SHALL return list of segments with file paths and timestamps
+
+### Requirement: Dify Speech-to-Text Integration
+The backend SHALL integrate with Dify STT service for audio file transcription.
+
+#### Scenario: Transcribe uploaded audio with chunking
+- **WHEN** backend receives POST /api/ai/transcribe-audio with audio file
+- **THEN** backend SHALL call sidecar for VAD segmentation
+- **AND** send each chunk to Dify STT API sequentially
+- **AND** concatenate results into final transcript
+
+#### Scenario: Supported audio formats
+- **WHEN** audio file is in MP3, WAV, M4A, WebM, or OGG format
+- **THEN** system SHALL accept and process the file
+
+#### Scenario: Unsupported format handling
+- **WHEN** audio file format is not supported
+- **THEN** backend SHALL return HTTP 400 with error message listing supported formats
+
+#### Scenario: Dify chunk transcription
+- **WHEN** backend sends audio chunk to Dify STT API
+- **THEN** chunk size SHALL be under 25MB to comply with API limits
+
+#### Scenario: Transcription timeout per chunk
+- **WHEN** Dify STT does not respond for a chunk within 2 minutes
+- **THEN** backend SHALL retry up to 3 times with exponential backoff
+
+#### Scenario: Dify STT error handling
+- **WHEN** Dify STT API returns error after retries
+- **THEN** backend SHALL return HTTP 502 with error details
+
+### Requirement: Dual Transcription Mode
+The system SHALL support both real-time local transcription and file-based cloud transcription.
+
+#### Scenario: Real-time transcription unchanged
+- **WHEN** user records audio in real-time
+- **THEN** local sidecar SHALL process audio using faster-whisper (existing behavior)
+
+#### Scenario: File upload uses cloud transcription
+- **WHEN** user uploads audio file
+- **THEN** Dify cloud service SHALL process audio via chunked upload
+
+#### Scenario: Unified transcript output
+- **WHEN** transcription completes from either source
+- **THEN** result SHALL be displayed in the same transcript area in meeting detail page
+