Audio Separation

Audio separation splits a recording into its component parts — vocals, drums, bass, and other instruments — using Meta’s Demucs v4 neural network. The primary use case: isolate vocals before transcription to dramatically improve accuracy on audio with music, intros, or background noise.

How it works

Demucs is a hybrid transformer model trained on thousands of songs. It takes a mixed audio signal and outputs separate stems:

Vocals: speech, singing, human voice
Drums: percussion
Bass: low-frequency instruments
Other: everything else (guitars, keyboards, effects)

Augent runs Demucs as a subprocess for maximum reliability across versions.

Two modes

Mode	What it produces	Speed
`vocals_only: true` (default)	vocals + no_vocals	Faster
`vocals_only: false`	all 4 stems	Slower

For transcription cleanup, vocals_only: true is all you need. Use full 4-stem when you need the individual instrument tracks.

Models

Model	Quality	Speed
`htdemucs` (default)	Great	Fast
`htdemucs_ft`	Best	Slower

The fine-tuned model produces cleaner separation but takes longer. For pre-transcription cleanup, the default model is sufficient.

Caching

Separation results are cached on the filesystem at ~/.augent/separated/, keyed by MD5(file):model:stem_mode. If you separate the same file again with the same settings, it returns the cached stems instantly.

Workflow

separate_audio → vocals_path → transcribe_audio

Call separate_audio on the audio file
Use the vocals_path from the response as the audio_path for transcription
The clean vocal track produces significantly better transcription

This is especially effective for:

Podcasts with music intros/outros
Videos with background music
Lectures in noisy environments
Any audio where speech competes with other sounds

​How it works

​Two modes

​Models

​Caching

​Workflow

How it works

Two modes

Models

Caching

Workflow