Files
Meeting_Assistant/openspec/specs/transcription/spec.md
egg 8b6184ecc5 feat: Meeting Assistant MVP - Complete implementation
Enterprise Meeting Knowledge Management System with:

Backend (FastAPI):
- Authentication proxy with JWT (pj-auth-api integration)
- MySQL database with 4 tables (users, meetings, conclusions, actions)
- Meeting CRUD with system code generation (C-YYYYMMDD-XX, A-YYYYMMDD-XX)
- Dify LLM integration for AI summarization
- Excel export with openpyxl
- 20 unit tests (all passing)

Client (Electron):
- Login page with company auth
- Meeting list with create/delete
- Meeting detail with real-time transcription
- Editable transcript textarea (single block, easy editing)
- AI summarization with conclusions/action items
- 5-second segment recording (efficient for long meetings)

Sidecar (Python):
- faster-whisper medium model with int8 quantization
- ONNX Runtime VAD (lightweight, ~20MB vs PyTorch ~2GB)
- Chinese punctuation processing
- OpenCC for Traditional Chinese conversion
- Anti-hallucination parameters
- Auto-cleanup of temp audio files

OpenSpec:
- add-meeting-assistant-mvp (47 tasks, archived)
- add-realtime-transcription (29 tasks, archived)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 20:17:44 +08:00

3.9 KiB
Raw Blame History

transcription Specification

Purpose

TBD - created by archiving change add-meeting-assistant-mvp. Update Purpose after archive.

Requirements

Requirement: Edge Speech-to-Text

The Electron client SHALL perform speech-to-text conversion locally using faster-whisper int8 model.

Scenario: Successful transcription

  • WHEN user records audio during a meeting
  • THEN the audio SHALL be transcribed locally without network dependency

Scenario: Transcription on target hardware

  • WHEN running on i5 processor with 8GB RAM
  • THEN transcription SHALL complete within acceptable latency for real-time display

Requirement: Traditional Chinese Output

The transcription engine SHALL output Traditional Chinese (繁體中文) text.

Scenario: Simplified to Traditional conversion

  • WHEN whisper outputs Simplified Chinese characters
  • THEN OpenCC SHALL convert output to Traditional Chinese

Scenario: Native Traditional Chinese

  • WHEN whisper outputs Traditional Chinese directly
  • THEN the text SHALL pass through unchanged

Requirement: Real-time Display

The Electron client SHALL display transcription results in real-time.

Scenario: Streaming transcription

  • WHEN user is recording
  • THEN transcribed text SHALL appear in the left panel within seconds of speech

Requirement: Python Sidecar

The transcription engine SHALL be packaged as a Python sidecar using PyInstaller.

Scenario: Sidecar startup

  • WHEN Electron app launches
  • THEN the Python sidecar containing faster-whisper and OpenCC SHALL be available

Scenario: Sidecar communication

  • WHEN Electron sends audio data to sidecar
  • THEN transcribed text SHALL be returned via IPC

Requirement: Streaming Transcription Mode

The sidecar SHALL support a streaming mode where audio chunks are continuously received and transcribed in real-time with VAD-triggered segmentation.

Scenario: Start streaming session

  • WHEN sidecar receives {"action": "start_stream"} command
  • THEN it SHALL initialize audio buffer and VAD processor
  • AND respond with {"status": "streaming", "session_id": "<uuid>"}

Scenario: Process audio chunk

  • WHEN sidecar receives {"action": "audio_chunk", "data": "<base64_pcm>"} during active stream
  • THEN it SHALL append audio to buffer and run VAD detection
  • AND if speech boundary detected, transcribe accumulated audio
  • AND emit {"segment_id": <int>, "text": "<transcription>", "is_final": true}

Scenario: Stop streaming session

  • WHEN sidecar receives {"action": "stop_stream"} command
  • THEN it SHALL transcribe any remaining buffered audio
  • AND respond with {"status": "stream_stopped", "total_segments": <int>}

Requirement: VAD-based Speech Segmentation

The sidecar SHALL use Voice Activity Detection to identify natural speech boundaries for segmentation.

Scenario: Detect speech end

  • WHEN VAD detects silence exceeding 500ms after speech
  • THEN the accumulated speech audio SHALL be sent for transcription
  • AND a new segment SHALL begin for subsequent speech

Scenario: Handle continuous speech

  • WHEN speech continues for more than 15 seconds without pause
  • THEN the sidecar SHALL force a segment boundary
  • AND transcribe the 15-second chunk to prevent excessive latency

Requirement: Punctuation in Transcription Output

The sidecar SHALL output transcribed text with appropriate Chinese punctuation marks.

Scenario: Add sentence-ending punctuation

  • WHEN transcription completes for a segment
  • THEN the output SHALL include period (。) at natural sentence boundaries
  • AND question marks () for interrogative sentences
  • AND commas () for clause breaks within sentences

Scenario: Detect question patterns

  • WHEN transcribed text ends with question particles (嗎、呢、什麼、怎麼、為什麼)
  • THEN the punctuation processor SHALL append question mark ()