Powered by pyannote-audio, the most widely used speaker diarization toolkit in production. Pre-trained models are bundled with Augent and downloaded automatically during installation. No API keys, no tokens, no accounts required.
Automatically detects the number of speakers. Handles overlapping speech.
Models used:
| Model | Role |
|---|
| speaker-diarization-3.1 | Main pipeline: detects speakers and assigns segments |
| segmentation-3.0 | Underlying segmentation model used by the pipeline |
Example
Request:
{
"audio_path": "/Users/you/Downloads/interview.webm",
"num_speakers": 2
}
Response:
{
"speakers": ["SPEAKER_0", "SPEAKER_1"],
"segments": [
{
"speaker": "SPEAKER_0",
"start": 0.0,
"end": 4.8,
"text": "Welcome to the show. Today we're talking about AI.",
"timestamp": "0:00"
},
{
"speaker": "SPEAKER_1",
"start": 5.1,
"end": 12.3,
"text": "Thanks for having me. I've been working on language models for about five years now.",
"timestamp": "0:05"
}
],
"segment_count": 84,
"duration": 1823.4,
"duration_formatted": "30:23",
"language": "en",
"cached": false,
"model_used": "tiny"
}
Parameters
| Parameter | Required | Default | Description |
|---|
audio_path | Yes | — | Path to the audio/video file |
model_size | No | tiny | Whisper model size for transcription |
num_speakers | No | auto-detect | Number of speakers (omit to auto-detect) |
How it works
- Transcribe the audio with faster-whisper (from memory if already transcribed)
- Diarize with pyannote to detect speaker boundaries and count
- Merge transcription segments with speaker turns by timestamp overlap
- Cache the result. Same file, same speaker count returns instantly on next call.
Use the diarized output to drive deeper analysis:
search_audio or deep_search to find what a specific speaker said about a topic
separate_audio before diarization for cleaner results on noisy recordings
chapters to see which speakers dominate which sections
batch_search to find a speaker’s remarks across multiple recordings
Notes
Speaker labels are generic (SPEAKER_0, SPEAKER_1, etc.). The tool identifies who speaks when, not who they are.
Omit num_speakers to let the model auto-detect. If you know the exact count, providing it improves accuracy.
Models are stored at ~/.cache/huggingface/hub/ (~30MB total). Downloaded once during install, used offline from that point forward.