> ## Documentation Index
> Fetch the complete documentation index at: https://docs.augent.app/llms.txt
> Use this file to discover all available pages before exploring further.

# Audio Separation

> How Augent isolates vocals from music and background noise using Meta's Demucs v4.

Audio separation splits a recording into its component parts — vocals, drums, bass, and other instruments — using Meta's Demucs v4 neural network.

The primary use case: **isolate vocals before transcription** to dramatically improve accuracy on audio with music, intros, or background noise.

***

## How it works

Demucs is a hybrid transformer model trained on thousands of songs. It takes a mixed audio signal and outputs separate stems:

* **Vocals**: speech, singing, human voice
* **Drums**: percussion
* **Bass**: low-frequency instruments
* **Other**: everything else (guitars, keyboards, effects)

Augent runs Demucs as a subprocess for maximum reliability across versions.

***

## Two modes

| Mode                          | What it produces    | Speed  |
| ----------------------------- | ------------------- | ------ |
| `vocals_only: true` (default) | vocals + no\_vocals | Faster |
| `vocals_only: false`          | all 4 stems         | Slower |

For transcription cleanup, `vocals_only: true` is all you need. Use full 4-stem when you need the individual instrument tracks.

***

## Models

| Model                | Quality | Speed  |
| -------------------- | ------- | ------ |
| `htdemucs` (default) | Great   | Fast   |
| `htdemucs_ft`        | Best    | Slower |

The fine-tuned model produces cleaner separation but takes longer. For pre-transcription cleanup, the default model is sufficient.

***

## Caching

Separation results are cached on the filesystem at `~/.augent/separated/`, keyed by `MD5(file):model:stem_mode`. If you separate the same file again with the same settings, it returns the cached stems instantly.

***

## Workflow

```
separate_audio → vocals_path → transcribe_audio
```

1. Call `separate_audio` on the audio file
2. Use the `vocals_path` from the response as the `audio_path` for transcription
3. The clean vocal track produces significantly better transcription

This is especially effective for:

* Podcasts with music intros/outros
* Videos with background music
* Lectures in noisy environments
* Any audio where speech competes with other sounds