Skip to main content
Audio separation splits a recording into its component parts — vocals, drums, bass, and other instruments — using Meta’s Demucs v4 neural network. The primary use case: isolate vocals before transcription to dramatically improve accuracy on audio with music, intros, or background noise.

How it works

Demucs is a hybrid transformer model trained on thousands of songs. It takes a mixed audio signal and outputs separate stems:
  • Vocals: speech, singing, human voice
  • Drums: percussion
  • Bass: low-frequency instruments
  • Other: everything else (guitars, keyboards, effects)
Augent runs Demucs as a subprocess for maximum reliability across versions.

Two modes

ModeWhat it producesSpeed
vocals_only: true (default)vocals + no_vocalsFaster
vocals_only: falseall 4 stemsSlower
For transcription cleanup, vocals_only: true is all you need. Use full 4-stem when you need the individual instrument tracks.

Models

ModelQualitySpeed
htdemucs (default)GreatFast
htdemucs_ftBestSlower
The fine-tuned model produces cleaner separation but takes longer. For pre-transcription cleanup, the default model is sufficient.

Caching

Separation results are cached on the filesystem at ~/.augent/separated/, keyed by MD5(file):model:stem_mode. If you separate the same file again with the same settings, it returns the cached stems instantly.

Workflow

separate_audio → vocals_path → transcribe_audio
  1. Call separate_audio on the audio file
  2. Use the vocals_path from the response as the audio_path for transcription
  3. The clean vocal track produces significantly better transcription
This is especially effective for:
  • Podcasts with music intros/outros
  • Videos with background music
  • Lectures in noisy environments
  • Any audio where speech competes with other sounds