## Context The Meeting Assistant currently uses batch transcription: audio is recorded, saved to file, then sent to Whisper for processing. This creates a poor UX where users must wait until recording stops to see any text. Users also cannot correct transcription errors. **Stakeholders**: End users recording meetings, admin reviewing transcripts **Constraints**: i5/8GB hardware target, offline capability required ## Goals / Non-Goals ### Goals - Real-time text display during recording (< 3 second latency) - Segment-based editing without disrupting ongoing transcription - Punctuation in output (Chinese: 。,?!;:) - Maintain offline capability (all processing local) ### Non-Goals - Speaker diarization (who said what) - future enhancement - Multi-language mixing - Chinese only for MVP - Cloud-based transcription fallback ## Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ Renderer Process (meeting-detail.html) │ │ ┌──────────────┐ ┌─────────────────────────────────┐ │ │ │ MediaRecorder│───▶│ Editable Transcript Component │ │ │ │ (audio chunks) │ [Segment 1] [Segment 2] [...] │ │ │ └──────┬───────┘ └─────────────────────────────────┘ │ │ │ IPC: stream-audio-chunk │ └─────────┼──────────────────────────────────────────────────┘ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Main Process (main.js) │ │ ┌──────────────────┐ ┌─────────────────────────────┐ │ │ │ Audio Buffer │────▶│ Sidecar (stdin pipe) │ │ │ │ (accumulate PCM) │ │ │ │ │ └──────────────────┘ └──────────┬──────────────────┘ │ │ │ IPC: transcription-segment │ ▼ │ │ Forward to renderer │ └─────────────────────────────────────────────────────────────┘ │ ▼ stdin (WAV chunks) ┌─────────────────────────────────────────────────────────────┐ │ Sidecar Process (transcriber.py) │ │ ┌──────────────┐ ┌──────────────┐ ┌────────────────┐ │ │ │ VAD Buffer │──▶│ Whisper │──▶│ Punctuator │ │ │ │ (silero-vad) │ │ (transcribe) │ │ (rule-based) │ │ │ └──────────────┘ └──────────────┘ └────────────────┘ │ │ │ │ │ │ │ Detect speech end │ │ │ ▼ ▼ │ │ stdout: {"segment_id": 1, "text": "今天開會討論。", ...} │ └─────────────────────────────────────────────────────────────┘ ``` ## Decisions ### Decision 1: VAD-triggered Segmentation **What**: Use Silero VAD to detect speech boundaries, transcribe complete utterances **Why**: - More accurate than fixed-interval chunking - Natural sentence boundaries - Reduces partial/incomplete transcriptions **Alternatives**: - Fixed 5-second chunks (simpler but cuts mid-sentence) - Word-level streaming (too fragmented, higher latency) ### Decision 2: Segment-based Editing **What**: Each VAD segment becomes an editable text block with unique ID **Why**: - Users can edit specific segments without affecting others - New segments append without disrupting editing - Simple merge on save (concatenate all segments) **Alternatives**: - Single textarea (editing conflicts with appending text) - Contenteditable div (complex cursor management) ### Decision 3: Audio Format Pipeline **What**: WebM (MediaRecorder) → WAV conversion in main.js → raw PCM to sidecar **Why**: - MediaRecorder only outputs WebM/Opus in browsers - Whisper works best with WAV/PCM - Conversion in main.js keeps sidecar simple **Alternatives**: - ffmpeg in sidecar (adds large dependency) - Raw PCM from AudioWorklet (complex, browser compatibility issues) ### Decision 4: Punctuation via Whisper + Rules **What**: Enable Whisper word_timestamps, apply rule-based punctuation after **Why**: - Whisper alone outputs minimal punctuation for Chinese - Rule-based post-processing adds 。,? based on pauses and patterns - No additional model needed **Alternatives**: - Separate punctuation model (adds latency and complexity) - No punctuation (user requirement) ## Risks / Trade-offs | Risk | Mitigation | |------|------------| | Latency > 3s on slow hardware | Use "tiny" model option, skip VAD if needed | | WebM→WAV conversion quality loss | Use lossless conversion, test on various inputs | | Memory usage with long meetings | Limit audio buffer to 30s, process and discard | | Segment boundary splits words | Use VAD with 500ms silence threshold | ## Implementation Phases 1. **Phase 1**: Sidecar streaming mode with VAD 2. **Phase 2**: IPC audio streaming pipeline 3. **Phase 3**: Frontend editable segment component 4. **Phase 4**: Punctuation post-processing ## Open Questions - Should segments be auto-merged after N seconds of no editing? - Maximum segment count before auto-archiving old segments?