Files
Meeting_Assistant/openspec/specs/transcription/spec.md
egg 8b6184ecc5 feat: Meeting Assistant MVP - Complete implementation
Enterprise Meeting Knowledge Management System with:

Backend (FastAPI):
- Authentication proxy with JWT (pj-auth-api integration)
- MySQL database with 4 tables (users, meetings, conclusions, actions)
- Meeting CRUD with system code generation (C-YYYYMMDD-XX, A-YYYYMMDD-XX)
- Dify LLM integration for AI summarization
- Excel export with openpyxl
- 20 unit tests (all passing)

Client (Electron):
- Login page with company auth
- Meeting list with create/delete
- Meeting detail with real-time transcription
- Editable transcript textarea (single block, easy editing)
- AI summarization with conclusions/action items
- 5-second segment recording (efficient for long meetings)

Sidecar (Python):
- faster-whisper medium model with int8 quantization
- ONNX Runtime VAD (lightweight, ~20MB vs PyTorch ~2GB)
- Chinese punctuation processing
- OpenCC for Traditional Chinese conversion
- Anti-hallucination parameters
- Auto-cleanup of temp audio files

OpenSpec:
- add-meeting-assistant-mvp (47 tasks, archived)
- add-realtime-transcription (29 tasks, archived)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 20:17:44 +08:00

91 lines
3.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# transcription Specification
## Purpose
TBD - created by archiving change add-meeting-assistant-mvp. Update Purpose after archive.
## Requirements
### Requirement: Edge Speech-to-Text
The Electron client SHALL perform speech-to-text conversion locally using faster-whisper int8 model.
#### Scenario: Successful transcription
- **WHEN** user records audio during a meeting
- **THEN** the audio SHALL be transcribed locally without network dependency
#### Scenario: Transcription on target hardware
- **WHEN** running on i5 processor with 8GB RAM
- **THEN** transcription SHALL complete within acceptable latency for real-time display
### Requirement: Traditional Chinese Output
The transcription engine SHALL output Traditional Chinese (繁體中文) text.
#### Scenario: Simplified to Traditional conversion
- **WHEN** whisper outputs Simplified Chinese characters
- **THEN** OpenCC SHALL convert output to Traditional Chinese
#### Scenario: Native Traditional Chinese
- **WHEN** whisper outputs Traditional Chinese directly
- **THEN** the text SHALL pass through unchanged
### Requirement: Real-time Display
The Electron client SHALL display transcription results in real-time.
#### Scenario: Streaming transcription
- **WHEN** user is recording
- **THEN** transcribed text SHALL appear in the left panel within seconds of speech
### Requirement: Python Sidecar
The transcription engine SHALL be packaged as a Python sidecar using PyInstaller.
#### Scenario: Sidecar startup
- **WHEN** Electron app launches
- **THEN** the Python sidecar containing faster-whisper and OpenCC SHALL be available
#### Scenario: Sidecar communication
- **WHEN** Electron sends audio data to sidecar
- **THEN** transcribed text SHALL be returned via IPC
### Requirement: Streaming Transcription Mode
The sidecar SHALL support a streaming mode where audio chunks are continuously received and transcribed in real-time with VAD-triggered segmentation.
#### Scenario: Start streaming session
- **WHEN** sidecar receives `{"action": "start_stream"}` command
- **THEN** it SHALL initialize audio buffer and VAD processor
- **AND** respond with `{"status": "streaming", "session_id": "<uuid>"}`
#### Scenario: Process audio chunk
- **WHEN** sidecar receives `{"action": "audio_chunk", "data": "<base64_pcm>"}` during active stream
- **THEN** it SHALL append audio to buffer and run VAD detection
- **AND** if speech boundary detected, transcribe accumulated audio
- **AND** emit `{"segment_id": <int>, "text": "<transcription>", "is_final": true}`
#### Scenario: Stop streaming session
- **WHEN** sidecar receives `{"action": "stop_stream"}` command
- **THEN** it SHALL transcribe any remaining buffered audio
- **AND** respond with `{"status": "stream_stopped", "total_segments": <int>}`
### Requirement: VAD-based Speech Segmentation
The sidecar SHALL use Voice Activity Detection to identify natural speech boundaries for segmentation.
#### Scenario: Detect speech end
- **WHEN** VAD detects silence exceeding 500ms after speech
- **THEN** the accumulated speech audio SHALL be sent for transcription
- **AND** a new segment SHALL begin for subsequent speech
#### Scenario: Handle continuous speech
- **WHEN** speech continues for more than 15 seconds without pause
- **THEN** the sidecar SHALL force a segment boundary
- **AND** transcribe the 15-second chunk to prevent excessive latency
### Requirement: Punctuation in Transcription Output
The sidecar SHALL output transcribed text with appropriate Chinese punctuation marks.
#### Scenario: Add sentence-ending punctuation
- **WHEN** transcription completes for a segment
- **THEN** the output SHALL include period (。) at natural sentence boundaries
- **AND** question marks () for interrogative sentences
- **AND** commas () for clause breaks within sentences
#### Scenario: Detect question patterns
- **WHEN** transcribed text ends with question particles (嗎、呢、什麼、怎麼、為什麼)
- **THEN** the punctuation processor SHALL append question mark ()