Speaker Diarization

Speaker diarization answers the question: who said what, and when? Augent uses pyannote/speaker-diarization-3.1 to segment audio by speaker, then merges the speaker labels with the transcription.

How it works

Transcribe: the audio is transcribed with faster-whisper (uses cache if available)
Diarize: pyannote analyzes the audio waveform to detect speaker turns
Merge: each transcription segment is assigned to the speaker with the maximum temporal overlap

The merge works by calculating the overlap between every transcription segment and every speaker turn, then assigning the speaker with the largest overlap. Segments with no speaker overlap are labeled “Unknown.”

No API keys required

The pyannote models (~29MB) are pre-downloaded by the installer as a tarball from GitHub Releases into ~/.cache/huggingface/hub/. No Hugging Face token or API key is needed.

Speaker count

By default, pyannote auto-detects how many speakers are in the audio. If you know the speaker count, pass num_speakers to force it — this can improve accuracy. Auto-detection works well for most audio. Use explicit count when speakers have very similar voices or when there are brief cameos you want captured.

Caching

Diarization results are cached in SQLite separately from transcriptions, keyed by file_hash:num_speakers. This means:

Diarizing with a different speaker count doesn’t re-transcribe
Re-diarizing with the same speaker count is instant
Transcribing with a different model size doesn’t invalidate diarization

When to use

Interviews and podcasts with multiple hosts/guests
Meeting recordings where attribution matters
Any audio where “who said this?” is the question

​How it works

​No API keys required

​Speaker count

​Caching

​When to use

How it works

No API keys required

Speaker count

Caching

When to use