Enterprise Meeting Knowledge Management System with: Backend (FastAPI): - Authentication proxy with JWT (pj-auth-api integration) - MySQL database with 4 tables (users, meetings, conclusions, actions) - Meeting CRUD with system code generation (C-YYYYMMDD-XX, A-YYYYMMDD-XX) - Dify LLM integration for AI summarization - Excel export with openpyxl - 20 unit tests (all passing) Client (Electron): - Login page with company auth - Meeting list with create/delete - Meeting detail with real-time transcription - Editable transcript textarea (single block, easy editing) - AI summarization with conclusions/action items - 5-second segment recording (efficient for long meetings) Sidecar (Python): - faster-whisper medium model with int8 quantization - ONNX Runtime VAD (lightweight, ~20MB vs PyTorch ~2GB) - Chinese punctuation processing - OpenCC for Traditional Chinese conversion - Anti-hallucination parameters - Auto-cleanup of temp audio files OpenSpec: - add-meeting-assistant-mvp (47 tasks, archived) - add-realtime-transcription (29 tasks, archived) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
3.9 KiB
3.9 KiB
transcription Specification
Purpose
TBD - created by archiving change add-meeting-assistant-mvp. Update Purpose after archive.
Requirements
Requirement: Edge Speech-to-Text
The Electron client SHALL perform speech-to-text conversion locally using faster-whisper int8 model.
Scenario: Successful transcription
- WHEN user records audio during a meeting
- THEN the audio SHALL be transcribed locally without network dependency
Scenario: Transcription on target hardware
- WHEN running on i5 processor with 8GB RAM
- THEN transcription SHALL complete within acceptable latency for real-time display
Requirement: Traditional Chinese Output
The transcription engine SHALL output Traditional Chinese (繁體中文) text.
Scenario: Simplified to Traditional conversion
- WHEN whisper outputs Simplified Chinese characters
- THEN OpenCC SHALL convert output to Traditional Chinese
Scenario: Native Traditional Chinese
- WHEN whisper outputs Traditional Chinese directly
- THEN the text SHALL pass through unchanged
Requirement: Real-time Display
The Electron client SHALL display transcription results in real-time.
Scenario: Streaming transcription
- WHEN user is recording
- THEN transcribed text SHALL appear in the left panel within seconds of speech
Requirement: Python Sidecar
The transcription engine SHALL be packaged as a Python sidecar using PyInstaller.
Scenario: Sidecar startup
- WHEN Electron app launches
- THEN the Python sidecar containing faster-whisper and OpenCC SHALL be available
Scenario: Sidecar communication
- WHEN Electron sends audio data to sidecar
- THEN transcribed text SHALL be returned via IPC
Requirement: Streaming Transcription Mode
The sidecar SHALL support a streaming mode where audio chunks are continuously received and transcribed in real-time with VAD-triggered segmentation.
Scenario: Start streaming session
- WHEN sidecar receives
{"action": "start_stream"}command - THEN it SHALL initialize audio buffer and VAD processor
- AND respond with
{"status": "streaming", "session_id": "<uuid>"}
Scenario: Process audio chunk
- WHEN sidecar receives
{"action": "audio_chunk", "data": "<base64_pcm>"}during active stream - THEN it SHALL append audio to buffer and run VAD detection
- AND if speech boundary detected, transcribe accumulated audio
- AND emit
{"segment_id": <int>, "text": "<transcription>", "is_final": true}
Scenario: Stop streaming session
- WHEN sidecar receives
{"action": "stop_stream"}command - THEN it SHALL transcribe any remaining buffered audio
- AND respond with
{"status": "stream_stopped", "total_segments": <int>}
Requirement: VAD-based Speech Segmentation
The sidecar SHALL use Voice Activity Detection to identify natural speech boundaries for segmentation.
Scenario: Detect speech end
- WHEN VAD detects silence exceeding 500ms after speech
- THEN the accumulated speech audio SHALL be sent for transcription
- AND a new segment SHALL begin for subsequent speech
Scenario: Handle continuous speech
- WHEN speech continues for more than 15 seconds without pause
- THEN the sidecar SHALL force a segment boundary
- AND transcribe the 15-second chunk to prevent excessive latency
Requirement: Punctuation in Transcription Output
The sidecar SHALL output transcribed text with appropriate Chinese punctuation marks.
Scenario: Add sentence-ending punctuation
- WHEN transcription completes for a segment
- THEN the output SHALL include period (。) at natural sentence boundaries
- AND question marks (?) for interrogative sentences
- AND commas (,) for clause breaks within sentences
Scenario: Detect question patterns
- WHEN transcribed text ends with question particles (嗎、呢、什麼、怎麼、為什麼)
- THEN the punctuation processor SHALL append question mark (?)