Meeting_Assistant/openspec/changes/archive/2025-12-10-add-realtime-transcription/design.md

## Context
The Meeting Assistant currently uses batch transcription: audio is recorded, saved to file, then sent to Whisper for processing. This creates a poor UX where users must wait until recording stops to see any text. Users also cannot correct transcription errors.

**Stakeholders**: End users recording meetings, admin reviewing transcripts
**Constraints**: i5/8GB hardware target, offline capability required

## Goals / Non-Goals

### Goals
- Real-time text display during recording (< 3 second latency)
- Segment-based editing without disrupting ongoing transcription
- Punctuation in output (Chinese: 。，？！；：)
- Maintain offline capability (all processing local)

### Non-Goals
- Speaker diarization (who said what) - future enhancement
- Multi-language mixing - Chinese only for MVP
- Cloud-based transcription fallback

## Architecture

```
┌─────────────────────────────────────────────────────────────┐
│ Renderer Process (meeting-detail.html)                      │
│  ┌──────────────┐    ┌─────────────────────────────────┐   │
│  │ MediaRecorder│───▶│ Editable Transcript Component   │   │
│  │ (audio chunks)    │  [Segment 1] [Segment 2] [...]  │   │
│  └──────┬───────┘    └─────────────────────────────────┘   │
│         │ IPC: stream-audio-chunk                          │
└─────────┼──────────────────────────────────────────────────┘
          ▼
┌─────────────────────────────────────────────────────────────┐
│ Main Process (main.js)                                      │
│  ┌──────────────────┐     ┌─────────────────────────────┐  │
│  │ Audio Buffer     │────▶│ Sidecar (stdin pipe)        │  │
│  │ (accumulate PCM) │     │                             │  │
│  └──────────────────┘     └──────────┬──────────────────┘  │
│                                      │ IPC: transcription-segment
│                                      ▼                      │
│                           Forward to renderer               │
└─────────────────────────────────────────────────────────────┘
          │
          ▼ stdin (WAV chunks)
┌─────────────────────────────────────────────────────────────┐
│ Sidecar Process (transcriber.py)                            │
│  ┌──────────────┐   ┌──────────────┐   ┌────────────────┐  │
│  │ VAD Buffer   │──▶│ Whisper      │──▶│ Punctuator     │  │
│  │ (silero-vad) │   │ (transcribe) │   │ (rule-based)   │  │
│  └──────────────┘   └──────────────┘   └────────────────┘  │
│         │                                      │            │
│         │ Detect speech end                    │            │
│         ▼                                      ▼            │
│  stdout: {"segment_id": 1, "text": "今天開會討論。", ...}  │
└─────────────────────────────────────────────────────────────┘
```

## Decisions

### Decision 1: VAD-triggered Segmentation
**What**: Use Silero VAD to detect speech boundaries, transcribe complete utterances
**Why**:
- More accurate than fixed-interval chunking
- Natural sentence boundaries
- Reduces partial/incomplete transcriptions
**Alternatives**:
- Fixed 5-second chunks (simpler but cuts mid-sentence)
- Word-level streaming (too fragmented, higher latency)

### Decision 2: Segment-based Editing
**What**: Each VAD segment becomes an editable text block with unique ID
**Why**:
- Users can edit specific segments without affecting others
- New segments append without disrupting editing
- Simple merge on save (concatenate all segments)
**Alternatives**:
- Single textarea (editing conflicts with appending text)
- Contenteditable div (complex cursor management)

### Decision 3: Audio Format Pipeline
**What**: WebM (MediaRecorder) → WAV conversion in main.js → raw PCM to sidecar
**Why**:
- MediaRecorder only outputs WebM/Opus in browsers
- Whisper works best with WAV/PCM
- Conversion in main.js keeps sidecar simple
**Alternatives**:
- ffmpeg in sidecar (adds large dependency)
- Raw PCM from AudioWorklet (complex, browser compatibility issues)

### Decision 4: Punctuation via Whisper + Rules
**What**: Enable Whisper word_timestamps, apply rule-based punctuation after
**Why**:
- Whisper alone outputs minimal punctuation for Chinese
- Rule-based post-processing adds 。，？ based on pauses and patterns
- No additional model needed
**Alternatives**:
- Separate punctuation model (adds latency and complexity)
- No punctuation (user requirement)

## Risks / Trade-offs

| Risk | Mitigation |
|------|------------|
| Latency > 3s on slow hardware | Use "tiny" model option, skip VAD if needed |
| WebM→WAV conversion quality loss | Use lossless conversion, test on various inputs |
| Memory usage with long meetings | Limit audio buffer to 30s, process and discard |
| Segment boundary splits words | Use VAD with 500ms silence threshold |

## Implementation Phases

1. **Phase 1**: Sidecar streaming mode with VAD
2. **Phase 2**: IPC audio streaming pipeline
3. **Phase 3**: Frontend editable segment component
4. **Phase 4**: Punctuation post-processing

## Open Questions
- Should segments be auto-merged after N seconds of no editing?
- Maximum segment count before auto-archiving old segments?