identify_speakers

Powered by pyannote-audio, the most widely used speaker diarization toolkit in production. Pre-trained models are bundled with Augent and downloaded automatically during installation. No API keys, no tokens, no accounts required. Automatically detects the number of speakers. Handles overlapping speech. Models used:

Model	Role
speaker-diarization-3.1	Main pipeline: detects speakers and assigns segments
segmentation-3.0	Underlying segmentation model used by the pipeline

Example

Request:

{
  "audio_path": "/Users/you/Downloads/interview.webm",
  "num_speakers": 2
}

Response:

{
  "speakers": ["SPEAKER_0", "SPEAKER_1"],
  "segments": [
    {
      "speaker": "SPEAKER_0",
      "start": 0.0,
      "end": 4.8,
      "text": "Welcome to the show. Today we're talking about AI.",
      "timestamp": "0:00"
    },
    {
      "speaker": "SPEAKER_1",
      "start": 5.1,
      "end": 12.3,
      "text": "Thanks for having me. I've been working on language models for about five years now.",
      "timestamp": "0:05"
    }
  ],
  "segment_count": 84,
  "duration": 1823.4,
  "duration_formatted": "30:23",
  "language": "en",
  "cached": false,
  "model_used": "tiny"
}

Parameters

Parameter	Required	Default	Description
`audio_path`	Yes	—	Path to the audio/video file
`model_size`	No	`tiny`	Whisper model size for transcription
`num_speakers`	No	auto-detect	Number of speakers (omit to auto-detect)

How it works

Transcribe the audio with faster-whisper (from memory if already transcribed)
Diarize with pyannote to detect speaker boundaries and count
Merge transcription segments with speaker turns by timestamp overlap
Cache the result. Same file, same speaker count returns instantly on next call.

Combine with other tools

Use the diarized output to drive deeper analysis:

search_audio or deep_search to find what a specific speaker said about a topic
separate_audio before diarization for cleaner results on noisy recordings
chapters to see which speakers dominate which sections
batch_search to find a speaker’s remarks across multiple recordings

Notes

Speaker labels are generic (SPEAKER_0, SPEAKER_1, etc.). The tool identifies who speaks when, not who they are.

Omit num_speakers to let the model auto-detect. If you know the exact count, providing it improves accuracy.

Models are stored at ~/.cache/huggingface/hub/ (~30MB total). Downloaded once during install, used offline from that point forward.

​Example

​Parameters

​How it works

​Combine with other tools

​Notes

Example

Parameters

How it works

Combine with other tools

Notes