Skip to main content
Powered by pyannote-audio, the most widely used speaker diarization toolkit in production. Pre-trained models are bundled with Augent and downloaded automatically during installation. No API keys, no tokens, no accounts required. Automatically detects the number of speakers. Handles overlapping speech. Models used:
ModelRole
speaker-diarization-3.1Main pipeline: detects speakers and assigns segments
segmentation-3.0Underlying segmentation model used by the pipeline

Example

Request:
{
  "audio_path": "/Users/you/Downloads/interview.webm",
  "num_speakers": 2
}
Response:
{
  "speakers": ["SPEAKER_0", "SPEAKER_1"],
  "segments": [
    {
      "speaker": "SPEAKER_0",
      "start": 0.0,
      "end": 4.8,
      "text": "Welcome to the show. Today we're talking about AI.",
      "timestamp": "0:00"
    },
    {
      "speaker": "SPEAKER_1",
      "start": 5.1,
      "end": 12.3,
      "text": "Thanks for having me. I've been working on language models for about five years now.",
      "timestamp": "0:05"
    }
  ],
  "segment_count": 84,
  "duration": 1823.4,
  "duration_formatted": "30:23",
  "language": "en",
  "cached": false,
  "model_used": "tiny"
}

Parameters

ParameterRequiredDefaultDescription
audio_pathYesPath to the audio/video file
model_sizeNotinyWhisper model size for transcription
num_speakersNoauto-detectNumber of speakers (omit to auto-detect)

How it works

  1. Transcribe the audio with faster-whisper (from memory if already transcribed)
  2. Diarize with pyannote to detect speaker boundaries and count
  3. Merge transcription segments with speaker turns by timestamp overlap
  4. Cache the result. Same file, same speaker count returns instantly on next call.

Combine with other tools

Use the diarized output to drive deeper analysis:
  • search_audio or deep_search to find what a specific speaker said about a topic
  • separate_audio before diarization for cleaner results on noisy recordings
  • chapters to see which speakers dominate which sections
  • batch_search to find a speaker’s remarks across multiple recordings

Notes

Speaker labels are generic (SPEAKER_0, SPEAKER_1, etc.). The tool identifies who speaks when, not who they are.
Omit num_speakers to let the model auto-detect. If you know the exact count, providing it improves accuracy.
Models are stored at ~/.cache/huggingface/hub/ (~30MB total). Downloaded once during install, used offline from that point forward.