> ## Documentation Index
> Fetch the complete documentation index at: https://docs.augent.app/llms.txt
> Use this file to discover all available pages before exploring further.

# identify_speakers

> Identify who speaks when in audio. Uses pyannote-audio for state-of-the-art speaker diarization.

Powered by [pyannote-audio](https://github.com/pyannote/pyannote-audio), the most widely used speaker diarization toolkit in production. Pre-trained models are bundled with Augent and downloaded automatically during installation. No API keys, no tokens, no accounts required.

Automatically detects the number of speakers. Handles overlapping speech.

**Models used:**

| Model                                                                              | Role                                                 |
| :--------------------------------------------------------------------------------- | :--------------------------------------------------- |
| [speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1) | Main pipeline: detects speakers and assigns segments |
| [segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0)               | Underlying segmentation model used by the pipeline   |

***

## Example

**Request:**

```json theme={null}
{
  "audio_path": "/Users/you/Downloads/interview.webm",
  "num_speakers": 2
}
```

**Response:**

```json theme={null}
{
  "speakers": ["SPEAKER_0", "SPEAKER_1"],
  "segments": [
    {
      "speaker": "SPEAKER_0",
      "start": 0.0,
      "end": 4.8,
      "text": "Welcome to the show. Today we're talking about AI.",
      "timestamp": "0:00"
    },
    {
      "speaker": "SPEAKER_1",
      "start": 5.1,
      "end": 12.3,
      "text": "Thanks for having me. I've been working on language models for about five years now.",
      "timestamp": "0:05"
    }
  ],
  "segment_count": 84,
  "duration": 1823.4,
  "duration_formatted": "30:23",
  "language": "en",
  "cached": false,
  "model_used": "tiny"
}
```

***

## Parameters

| Parameter      | Required | Default     | Description                              |
| -------------- | -------- | ----------- | ---------------------------------------- |
| `audio_path`   | Yes      | —           | Path to the audio/video file             |
| `model_size`   | No       | `tiny`      | Whisper model size for transcription     |
| `num_speakers` | No       | auto-detect | Number of speakers (omit to auto-detect) |

***

## How it works

1. **Transcribe** the audio with faster-whisper (from memory if already transcribed)
2. **Diarize** with pyannote to detect speaker boundaries and count
3. **Merge** transcription segments with speaker turns by timestamp overlap
4. **Cache** the result. Same file, same speaker count returns instantly on next call.

***

## Combine with other tools

Use the diarized output to drive deeper analysis:

* `search_audio` or `deep_search` to find what a specific speaker said about a topic
* `separate_audio` before diarization for cleaner results on noisy recordings
* `chapters` to see which speakers dominate which sections
* `batch_search` to find a speaker's remarks across multiple recordings

***

## Notes

<Tip>Speaker labels are generic (`SPEAKER_0`, `SPEAKER_1`, etc.). The tool identifies *who* speaks *when*, not *who they are*.</Tip>

<Tip>Omit `num_speakers` to let the model auto-detect. If you know the exact count, providing it improves accuracy.</Tip>

<Tip>Models are stored at `~/.cache/huggingface/hub/` (\~30MB total). Downloaded once during install, used offline from that point forward.</Tip>