pyannote/speaker-diarization-3.1 to segment audio by speaker, then merges the speaker labels with the transcription.
How it works
- Transcribe: the audio is transcribed with faster-whisper (uses cache if available)
- Diarize: pyannote analyzes the audio waveform to detect speaker turns
- Merge: each transcription segment is assigned to the speaker with the maximum temporal overlap
No API keys required
The pyannote models (~29MB) are pre-downloaded by the installer as a tarball from GitHub Releases into~/.cache/huggingface/hub/. No Hugging Face token or API key is needed.
Speaker count
By default, pyannote auto-detects how many speakers are in the audio. If you know the speaker count, passnum_speakers to force it — this can improve accuracy.
Auto-detection works well for most audio. Use explicit count when speakers have very similar voices or when there are brief cameos you want captured.
Caching
Diarization results are cached in SQLite separately from transcriptions, keyed byfile_hash:num_speakers. This means:
- Diarizing with a different speaker count doesn’t re-transcribe
- Re-diarizing with the same speaker count is instant
- Transcribing with a different model size doesn’t invalidate diarization
When to use
- Interviews and podcasts with multiple hosts/guests
- Meeting recordings where attribution matters
- Any audio where “who said this?” is the question

