Enterprise Meeting Knowledge Management System with: Backend (FastAPI): - Authentication proxy with JWT (pj-auth-api integration) - MySQL database with 4 tables (users, meetings, conclusions, actions) - Meeting CRUD with system code generation (C-YYYYMMDD-XX, A-YYYYMMDD-XX) - Dify LLM integration for AI summarization - Excel export with openpyxl - 20 unit tests (all passing) Client (Electron): - Login page with company auth - Meeting list with create/delete - Meeting detail with real-time transcription - Editable transcript textarea (single block, easy editing) - AI summarization with conclusions/action items - 5-second segment recording (efficient for long meetings) Sidecar (Python): - faster-whisper medium model with int8 quantization - ONNX Runtime VAD (lightweight, ~20MB vs PyTorch ~2GB) - Chinese punctuation processing - OpenCC for Traditional Chinese conversion - Anti-hallucination parameters - Auto-cleanup of temp audio files OpenSpec: - add-meeting-assistant-mvp (47 tasks, archived) - add-realtime-transcription (29 tasks, archived) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
118 lines
6.5 KiB
Markdown
118 lines
6.5 KiB
Markdown
## Context
|
|
The Meeting Assistant currently uses batch transcription: audio is recorded, saved to file, then sent to Whisper for processing. This creates a poor UX where users must wait until recording stops to see any text. Users also cannot correct transcription errors.
|
|
|
|
**Stakeholders**: End users recording meetings, admin reviewing transcripts
|
|
**Constraints**: i5/8GB hardware target, offline capability required
|
|
|
|
## Goals / Non-Goals
|
|
|
|
### Goals
|
|
- Real-time text display during recording (< 3 second latency)
|
|
- Segment-based editing without disrupting ongoing transcription
|
|
- Punctuation in output (Chinese: 。,?!;:)
|
|
- Maintain offline capability (all processing local)
|
|
|
|
### Non-Goals
|
|
- Speaker diarization (who said what) - future enhancement
|
|
- Multi-language mixing - Chinese only for MVP
|
|
- Cloud-based transcription fallback
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Renderer Process (meeting-detail.html) │
|
|
│ ┌──────────────┐ ┌─────────────────────────────────┐ │
|
|
│ │ MediaRecorder│───▶│ Editable Transcript Component │ │
|
|
│ │ (audio chunks) │ [Segment 1] [Segment 2] [...] │ │
|
|
│ └──────┬───────┘ └─────────────────────────────────┘ │
|
|
│ │ IPC: stream-audio-chunk │
|
|
└─────────┼──────────────────────────────────────────────────┘
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Main Process (main.js) │
|
|
│ ┌──────────────────┐ ┌─────────────────────────────┐ │
|
|
│ │ Audio Buffer │────▶│ Sidecar (stdin pipe) │ │
|
|
│ │ (accumulate PCM) │ │ │ │
|
|
│ └──────────────────┘ └──────────┬──────────────────┘ │
|
|
│ │ IPC: transcription-segment
|
|
│ ▼ │
|
|
│ Forward to renderer │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼ stdin (WAV chunks)
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Sidecar Process (transcriber.py) │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────┐ │
|
|
│ │ VAD Buffer │──▶│ Whisper │──▶│ Punctuator │ │
|
|
│ │ (silero-vad) │ │ (transcribe) │ │ (rule-based) │ │
|
|
│ └──────────────┘ └──────────────┘ └────────────────┘ │
|
|
│ │ │ │
|
|
│ │ Detect speech end │ │
|
|
│ ▼ ▼ │
|
|
│ stdout: {"segment_id": 1, "text": "今天開會討論。", ...} │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Decisions
|
|
|
|
### Decision 1: VAD-triggered Segmentation
|
|
**What**: Use Silero VAD to detect speech boundaries, transcribe complete utterances
|
|
**Why**:
|
|
- More accurate than fixed-interval chunking
|
|
- Natural sentence boundaries
|
|
- Reduces partial/incomplete transcriptions
|
|
**Alternatives**:
|
|
- Fixed 5-second chunks (simpler but cuts mid-sentence)
|
|
- Word-level streaming (too fragmented, higher latency)
|
|
|
|
### Decision 2: Segment-based Editing
|
|
**What**: Each VAD segment becomes an editable text block with unique ID
|
|
**Why**:
|
|
- Users can edit specific segments without affecting others
|
|
- New segments append without disrupting editing
|
|
- Simple merge on save (concatenate all segments)
|
|
**Alternatives**:
|
|
- Single textarea (editing conflicts with appending text)
|
|
- Contenteditable div (complex cursor management)
|
|
|
|
### Decision 3: Audio Format Pipeline
|
|
**What**: WebM (MediaRecorder) → WAV conversion in main.js → raw PCM to sidecar
|
|
**Why**:
|
|
- MediaRecorder only outputs WebM/Opus in browsers
|
|
- Whisper works best with WAV/PCM
|
|
- Conversion in main.js keeps sidecar simple
|
|
**Alternatives**:
|
|
- ffmpeg in sidecar (adds large dependency)
|
|
- Raw PCM from AudioWorklet (complex, browser compatibility issues)
|
|
|
|
### Decision 4: Punctuation via Whisper + Rules
|
|
**What**: Enable Whisper word_timestamps, apply rule-based punctuation after
|
|
**Why**:
|
|
- Whisper alone outputs minimal punctuation for Chinese
|
|
- Rule-based post-processing adds 。,? based on pauses and patterns
|
|
- No additional model needed
|
|
**Alternatives**:
|
|
- Separate punctuation model (adds latency and complexity)
|
|
- No punctuation (user requirement)
|
|
|
|
## Risks / Trade-offs
|
|
|
|
| Risk | Mitigation |
|
|
|------|------------|
|
|
| Latency > 3s on slow hardware | Use "tiny" model option, skip VAD if needed |
|
|
| WebM→WAV conversion quality loss | Use lossless conversion, test on various inputs |
|
|
| Memory usage with long meetings | Limit audio buffer to 30s, process and discard |
|
|
| Segment boundary splits words | Use VAD with 500ms silence threshold |
|
|
|
|
## Implementation Phases
|
|
|
|
1. **Phase 1**: Sidecar streaming mode with VAD
|
|
2. **Phase 2**: IPC audio streaming pipeline
|
|
3. **Phase 3**: Frontend editable segment component
|
|
4. **Phase 4**: Punctuation post-processing
|
|
|
|
## Open Questions
|
|
- Should segments be auto-merged after N seconds of no editing?
|
|
- Maximum segment count before auto-archiving old segments?
|