Files
Meeting_Assistant/openspec/changes/archive/2025-12-10-add-realtime-transcription/design.md
egg 8b6184ecc5 feat: Meeting Assistant MVP - Complete implementation
Enterprise Meeting Knowledge Management System with:

Backend (FastAPI):
- Authentication proxy with JWT (pj-auth-api integration)
- MySQL database with 4 tables (users, meetings, conclusions, actions)
- Meeting CRUD with system code generation (C-YYYYMMDD-XX, A-YYYYMMDD-XX)
- Dify LLM integration for AI summarization
- Excel export with openpyxl
- 20 unit tests (all passing)

Client (Electron):
- Login page with company auth
- Meeting list with create/delete
- Meeting detail with real-time transcription
- Editable transcript textarea (single block, easy editing)
- AI summarization with conclusions/action items
- 5-second segment recording (efficient for long meetings)

Sidecar (Python):
- faster-whisper medium model with int8 quantization
- ONNX Runtime VAD (lightweight, ~20MB vs PyTorch ~2GB)
- Chinese punctuation processing
- OpenCC for Traditional Chinese conversion
- Anti-hallucination parameters
- Auto-cleanup of temp audio files

OpenSpec:
- add-meeting-assistant-mvp (47 tasks, archived)
- add-realtime-transcription (29 tasks, archived)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 20:17:44 +08:00

6.5 KiB

Context

The Meeting Assistant currently uses batch transcription: audio is recorded, saved to file, then sent to Whisper for processing. This creates a poor UX where users must wait until recording stops to see any text. Users also cannot correct transcription errors.

Stakeholders: End users recording meetings, admin reviewing transcripts Constraints: i5/8GB hardware target, offline capability required

Goals / Non-Goals

Goals

  • Real-time text display during recording (< 3 second latency)
  • Segment-based editing without disrupting ongoing transcription
  • Punctuation in output (Chinese: 。,?!;:)
  • Maintain offline capability (all processing local)

Non-Goals

  • Speaker diarization (who said what) - future enhancement
  • Multi-language mixing - Chinese only for MVP
  • Cloud-based transcription fallback

Architecture

┌─────────────────────────────────────────────────────────────┐
│ Renderer Process (meeting-detail.html)                      │
│  ┌──────────────┐    ┌─────────────────────────────────┐   │
│  │ MediaRecorder│───▶│ Editable Transcript Component   │   │
│  │ (audio chunks)    │  [Segment 1] [Segment 2] [...]  │   │
│  └──────┬───────┘    └─────────────────────────────────┘   │
│         │ IPC: stream-audio-chunk                          │
└─────────┼──────────────────────────────────────────────────┘
          ▼
┌─────────────────────────────────────────────────────────────┐
│ Main Process (main.js)                                      │
│  ┌──────────────────┐     ┌─────────────────────────────┐  │
│  │ Audio Buffer     │────▶│ Sidecar (stdin pipe)        │  │
│  │ (accumulate PCM) │     │                             │  │
│  └──────────────────┘     └──────────┬──────────────────┘  │
│                                      │ IPC: transcription-segment
│                                      ▼                      │
│                           Forward to renderer               │
└─────────────────────────────────────────────────────────────┘
          │
          ▼ stdin (WAV chunks)
┌─────────────────────────────────────────────────────────────┐
│ Sidecar Process (transcriber.py)                            │
│  ┌──────────────┐   ┌──────────────┐   ┌────────────────┐  │
│  │ VAD Buffer   │──▶│ Whisper      │──▶│ Punctuator     │  │
│  │ (silero-vad) │   │ (transcribe) │   │ (rule-based)   │  │
│  └──────────────┘   └──────────────┘   └────────────────┘  │
│         │                                      │            │
│         │ Detect speech end                    │            │
│         ▼                                      ▼            │
│  stdout: {"segment_id": 1, "text": "今天開會討論。", ...}  │
└─────────────────────────────────────────────────────────────┘

Decisions

Decision 1: VAD-triggered Segmentation

What: Use Silero VAD to detect speech boundaries, transcribe complete utterances Why:

  • More accurate than fixed-interval chunking
  • Natural sentence boundaries
  • Reduces partial/incomplete transcriptions Alternatives:
  • Fixed 5-second chunks (simpler but cuts mid-sentence)
  • Word-level streaming (too fragmented, higher latency)

Decision 2: Segment-based Editing

What: Each VAD segment becomes an editable text block with unique ID Why:

  • Users can edit specific segments without affecting others
  • New segments append without disrupting editing
  • Simple merge on save (concatenate all segments) Alternatives:
  • Single textarea (editing conflicts with appending text)
  • Contenteditable div (complex cursor management)

Decision 3: Audio Format Pipeline

What: WebM (MediaRecorder) → WAV conversion in main.js → raw PCM to sidecar Why:

  • MediaRecorder only outputs WebM/Opus in browsers
  • Whisper works best with WAV/PCM
  • Conversion in main.js keeps sidecar simple Alternatives:
  • ffmpeg in sidecar (adds large dependency)
  • Raw PCM from AudioWorklet (complex, browser compatibility issues)

Decision 4: Punctuation via Whisper + Rules

What: Enable Whisper word_timestamps, apply rule-based punctuation after Why:

  • Whisper alone outputs minimal punctuation for Chinese
  • Rule-based post-processing adds 。,? based on pauses and patterns
  • No additional model needed Alternatives:
  • Separate punctuation model (adds latency and complexity)
  • No punctuation (user requirement)

Risks / Trade-offs

Risk Mitigation
Latency > 3s on slow hardware Use "tiny" model option, skip VAD if needed
WebM→WAV conversion quality loss Use lossless conversion, test on various inputs
Memory usage with long meetings Limit audio buffer to 30s, process and discard
Segment boundary splits words Use VAD with 500ms silence threshold

Implementation Phases

  1. Phase 1: Sidecar streaming mode with VAD
  2. Phase 2: IPC audio streaming pipeline
  3. Phase 3: Frontend editable segment component
  4. Phase 4: Punctuation post-processing

Open Questions

  • Should segments be auto-merged after N seconds of no editing?
  • Maximum segment count before auto-archiving old segments?