feat: Meeting Assistant MVP - Complete implementation

Enterprise Meeting Knowledge Management System with: Backend (FastAPI): - Authentication proxy with JWT (pj-auth-api integration) - MySQL database with 4 tables (users, meetings, conclusions, actions) - Meeting CRUD with system code generation (C-YYYYMMDD-XX, A-YYYYMMDD-XX) - Dify LLM integration for AI summarization - Excel export with openpyxl - 20 unit tests (all passing) Client (Electron): - Login page with company auth - Meeting list with create/delete - Meeting detail with real-time transcription - Editable transcript textarea (single block, easy editing) - AI summarization with conclusions/action items - 5-second segment recording (efficient for long meetings) Sidecar (Python): - faster-whisper medium model with int8 quantization - ONNX Runtime VAD (lightweight, ~20MB vs PyTorch ~2GB) - Chinese punctuation processing - OpenCC for Traditional Chinese conversion - Anti-hallucination parameters - Auto-cleanup of temp audio files OpenSpec: - add-meeting-assistant-mvp (47 tasks, archived) - add-realtime-transcription (29 tasks, archived) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 20:17:44 +08:00
commit 8b6184ecc5
65 changed files with 10510 additions and 0 deletions
--- a/openspec/changes/archive/2025-12-10-add-realtime-transcription/design.md
+++ b/openspec/changes/archive/2025-12-10-add-realtime-transcription/design.md
@@ -0,0 +1,117 @@
+## Context
+The Meeting Assistant currently uses batch transcription: audio is recorded, saved to file, then sent to Whisper for processing. This creates a poor UX where users must wait until recording stops to see any text. Users also cannot correct transcription errors.
+
+**Stakeholders**: End users recording meetings, admin reviewing transcripts
+**Constraints**: i5/8GB hardware target, offline capability required
+
+## Goals / Non-Goals
+
+### Goals
+- Real-time text display during recording (< 3 second latency)
+- Segment-based editing without disrupting ongoing transcription
+- Punctuation in output (Chinese: 。，？！；：)
+- Maintain offline capability (all processing local)
+
+### Non-Goals
+- Speaker diarization (who said what) - future enhancement
+- Multi-language mixing - Chinese only for MVP
+- Cloud-based transcription fallback
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│ Renderer Process (meeting-detail.html)                      │
+│  ┌──────────────┐    ┌─────────────────────────────────┐   │
+│  │ MediaRecorder│───▶│ Editable Transcript Component   │   │
+│  │ (audio chunks)    │  [Segment 1] [Segment 2] [...]  │   │
+│  └──────┬───────┘    └─────────────────────────────────┘   │
+│         │ IPC: stream-audio-chunk                          │
+└─────────┼──────────────────────────────────────────────────┘
+          ▼
+┌─────────────────────────────────────────────────────────────┐
+│ Main Process (main.js)                                      │
+│  ┌──────────────────┐     ┌─────────────────────────────┐  │
+│  │ Audio Buffer     │────▶│ Sidecar (stdin pipe)        │  │
+│  │ (accumulate PCM) │     │                             │  │
+│  └──────────────────┘     └──────────┬──────────────────┘  │
+│                                      │ IPC: transcription-segment
+│                                      ▼                      │
+│                           Forward to renderer               │
+└─────────────────────────────────────────────────────────────┘
+          │
+          ▼ stdin (WAV chunks)
+┌─────────────────────────────────────────────────────────────┐
+│ Sidecar Process (transcriber.py)                            │
+│  ┌──────────────┐   ┌──────────────┐   ┌────────────────┐  │
+│  │ VAD Buffer   │──▶│ Whisper      │──▶│ Punctuator     │  │
+│  │ (silero-vad) │   │ (transcribe) │   │ (rule-based)   │  │
+│  └──────────────┘   └──────────────┘   └────────────────┘  │
+│         │                                      │            │
+│         │ Detect speech end                    │            │
+│         ▼                                      ▼            │
+│  stdout: {"segment_id": 1, "text": "今天開會討論。", ...}  │
+└─────────────────────────────────────────────────────────────┘
+```
+
+## Decisions
+
+### Decision 1: VAD-triggered Segmentation
+**What**: Use Silero VAD to detect speech boundaries, transcribe complete utterances
+**Why**:
+- More accurate than fixed-interval chunking
+- Natural sentence boundaries
+- Reduces partial/incomplete transcriptions
+**Alternatives**:
+- Fixed 5-second chunks (simpler but cuts mid-sentence)
+- Word-level streaming (too fragmented, higher latency)
+
+### Decision 2: Segment-based Editing
+**What**: Each VAD segment becomes an editable text block with unique ID
+**Why**:
+- Users can edit specific segments without affecting others
+- New segments append without disrupting editing
+- Simple merge on save (concatenate all segments)
+**Alternatives**:
+- Single textarea (editing conflicts with appending text)
+- Contenteditable div (complex cursor management)
+
+### Decision 3: Audio Format Pipeline
+**What**: WebM (MediaRecorder) → WAV conversion in main.js → raw PCM to sidecar
+**Why**:
+- MediaRecorder only outputs WebM/Opus in browsers
+- Whisper works best with WAV/PCM
+- Conversion in main.js keeps sidecar simple
+**Alternatives**:
+- ffmpeg in sidecar (adds large dependency)
+- Raw PCM from AudioWorklet (complex, browser compatibility issues)
+
+### Decision 4: Punctuation via Whisper + Rules
+**What**: Enable Whisper word_timestamps, apply rule-based punctuation after
+**Why**:
+- Whisper alone outputs minimal punctuation for Chinese
+- Rule-based post-processing adds 。，？ based on pauses and patterns
+- No additional model needed
+**Alternatives**:
+- Separate punctuation model (adds latency and complexity)
+- No punctuation (user requirement)
+
+## Risks / Trade-offs
+
+| Risk | Mitigation |
+|------|------------|
+| Latency > 3s on slow hardware | Use "tiny" model option, skip VAD if needed |
+| WebM→WAV conversion quality loss | Use lossless conversion, test on various inputs |
+| Memory usage with long meetings | Limit audio buffer to 30s, process and discard |
+| Segment boundary splits words | Use VAD with 500ms silence threshold |
+
+## Implementation Phases
+
+1. **Phase 1**: Sidecar streaming mode with VAD
+2. **Phase 2**: IPC audio streaming pipeline
+3. **Phase 3**: Frontend editable segment component
+4. **Phase 4**: Punctuation post-processing
+
+## Open Questions
+- Should segments be auto-merged after N seconds of no editing?
+- Maximum segment count before auto-archiving old segments?