Skip to main content
Speaker diarization answers the question: who said what, and when? Augent uses pyannote/speaker-diarization-3.1 to segment audio by speaker, then merges the speaker labels with the transcription.

How it works

  1. Transcribe: the audio is transcribed with faster-whisper (uses cache if available)
  2. Diarize: pyannote analyzes the audio waveform to detect speaker turns
  3. Merge: each transcription segment is assigned to the speaker with the maximum temporal overlap
The merge works by calculating the overlap between every transcription segment and every speaker turn, then assigning the speaker with the largest overlap. Segments with no speaker overlap are labeled “Unknown.”

No API keys required

The pyannote models (~29MB) are pre-downloaded by the installer as a tarball from GitHub Releases into ~/.cache/huggingface/hub/. No Hugging Face token or API key is needed.

Speaker count

By default, pyannote auto-detects how many speakers are in the audio. If you know the speaker count, pass num_speakers to force it — this can improve accuracy. Auto-detection works well for most audio. Use explicit count when speakers have very similar voices or when there are brief cameos you want captured.

Caching

Diarization results are cached in SQLite separately from transcriptions, keyed by file_hash:num_speakers. This means:
  • Diarizing with a different speaker count doesn’t re-transcribe
  • Re-diarizing with the same speaker count is instant
  • Transcribing with a different model size doesn’t invalidate diarization

When to use

  • Interviews and podcasts with multiple hosts/guests
  • Meeting recordings where attribution matters
  • Any audio where “who said this?” is the question