Meeting_Assistant/design.md at d3e32056929b88cb3e573c0b74ff6a39ddcb4dae

Files

egg 8b6184ecc5 feat: Meeting Assistant MVP - Complete implementation

Enterprise Meeting Knowledge Management System with:

Backend (FastAPI):
- Authentication proxy with JWT (pj-auth-api integration)
- MySQL database with 4 tables (users, meetings, conclusions, actions)
- Meeting CRUD with system code generation (C-YYYYMMDD-XX, A-YYYYMMDD-XX)
- Dify LLM integration for AI summarization
- Excel export with openpyxl
- 20 unit tests (all passing)

Client (Electron):
- Login page with company auth
- Meeting list with create/delete
- Meeting detail with real-time transcription
- Editable transcript textarea (single block, easy editing)
- AI summarization with conclusions/action items
- 5-second segment recording (efficient for long meetings)

Sidecar (Python):
- faster-whisper medium model with int8 quantization
- ONNX Runtime VAD (lightweight, ~20MB vs PyTorch ~2GB)
- Chinese punctuation processing
- OpenCC for Traditional Chinese conversion
- Anti-hallucination parameters
- Auto-cleanup of temp audio files

OpenSpec:
- add-meeting-assistant-mvp (47 tasks, archived)
- add-realtime-transcription (29 tasks, archived)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-10 20:17:44 +08:00

6.5 KiB

Raw Blame History

Context

The Meeting Assistant currently uses batch transcription: audio is recorded, saved to file, then sent to Whisper for processing. This creates a poor UX where users must wait until recording stops to see any text. Users also cannot correct transcription errors.

Stakeholders: End users recording meetings, admin reviewing transcripts Constraints: i5/8GB hardware target, offline capability required

Goals / Non-Goals

Goals

Real-time text display during recording (< 3 second latency)
Segment-based editing without disrupting ongoing transcription
Punctuation in output (Chinese: 。，？！；：)
Maintain offline capability (all processing local)

Non-Goals

Speaker diarization (who said what) - future enhancement
Multi-language mixing - Chinese only for MVP
Cloud-based transcription fallback

Architecture

┌─────────────────────────────────────────────────────────────┐
│ Renderer Process (meeting-detail.html)                      │
│  ┌──────────────┐    ┌─────────────────────────────────┐   │
│  │ MediaRecorder│───▶│ Editable Transcript Component   │   │
│  │ (audio chunks)    │  [Segment 1] [Segment 2] [...]  │   │
│  └──────┬───────┘    └─────────────────────────────────┘   │
│         │ IPC: stream-audio-chunk                          │
└─────────┼──────────────────────────────────────────────────┘
          ▼
┌─────────────────────────────────────────────────────────────┐
│ Main Process (main.js)                                      │
│  ┌──────────────────┐     ┌─────────────────────────────┐  │
│  │ Audio Buffer     │────▶│ Sidecar (stdin pipe)        │  │
│  │ (accumulate PCM) │     │                             │  │
│  └──────────────────┘     └──────────┬──────────────────┘  │
│                                      │ IPC: transcription-segment
│                                      ▼                      │
│                           Forward to renderer               │
└─────────────────────────────────────────────────────────────┘
          │
          ▼ stdin (WAV chunks)
┌─────────────────────────────────────────────────────────────┐
│ Sidecar Process (transcriber.py)                            │
│  ┌──────────────┐   ┌──────────────┐   ┌────────────────┐  │
│  │ VAD Buffer   │──▶│ Whisper      │──▶│ Punctuator     │  │
│  │ (silero-vad) │   │ (transcribe) │   │ (rule-based)   │  │
│  └──────────────┘   └──────────────┘   └────────────────┘  │
│         │                                      │            │
│         │ Detect speech end                    │            │
│         ▼                                      ▼            │
│  stdout: {"segment_id": 1, "text": "今天開會討論。", ...}  │
└─────────────────────────────────────────────────────────────┘

Decisions

Decision 1: VAD-triggered Segmentation

What: Use Silero VAD to detect speech boundaries, transcribe complete utterances Why:

More accurate than fixed-interval chunking
Natural sentence boundaries
Reduces partial/incomplete transcriptions Alternatives:
Fixed 5-second chunks (simpler but cuts mid-sentence)
Word-level streaming (too fragmented, higher latency)

Decision 2: Segment-based Editing

What: Each VAD segment becomes an editable text block with unique ID Why:

Users can edit specific segments without affecting others
New segments append without disrupting editing
Simple merge on save (concatenate all segments) Alternatives:
Single textarea (editing conflicts with appending text)
Contenteditable div (complex cursor management)

Decision 3: Audio Format Pipeline

What: WebM (MediaRecorder) → WAV conversion in main.js → raw PCM to sidecar Why:

MediaRecorder only outputs WebM/Opus in browsers
Whisper works best with WAV/PCM
Conversion in main.js keeps sidecar simple Alternatives:
ffmpeg in sidecar (adds large dependency)
Raw PCM from AudioWorklet (complex, browser compatibility issues)

Decision 4: Punctuation via Whisper + Rules

What: Enable Whisper word_timestamps, apply rule-based punctuation after Why:

Whisper alone outputs minimal punctuation for Chinese
Rule-based post-processing adds 。，？ based on pauses and patterns
No additional model needed Alternatives:
Separate punctuation model (adds latency and complexity)
No punctuation (user requirement)

Risks / Trade-offs

Risk	Mitigation
Latency > 3s on slow hardware	Use "tiny" model option, skip VAD if needed
WebM→WAV conversion quality loss	Use lossless conversion, test on various inputs
Memory usage with long meetings	Limit audio buffer to 30s, process and discard
Segment boundary splits words	Use VAD with 500ms silence threshold

Implementation Phases

Phase 1: Sidecar streaming mode with VAD
Phase 2: IPC audio streaming pipeline
Phase 3: Frontend editable segment component
Phase 4: Punctuation post-processing

Open Questions

Should segments be auto-merged after N seconds of no editing?
Maximum segment count before auto-archiving old segments?

6.5 KiB Raw Blame History