How it works
Demucs is a hybrid transformer model trained on thousands of songs. It takes a mixed audio signal and outputs separate stems:- Vocals: speech, singing, human voice
- Drums: percussion
- Bass: low-frequency instruments
- Other: everything else (guitars, keyboards, effects)
Two modes
| Mode | What it produces | Speed |
|---|---|---|
vocals_only: true (default) | vocals + no_vocals | Faster |
vocals_only: false | all 4 stems | Slower |
vocals_only: true is all you need. Use full 4-stem when you need the individual instrument tracks.
Models
| Model | Quality | Speed |
|---|---|---|
htdemucs (default) | Great | Fast |
htdemucs_ft | Best | Slower |
Caching
Separation results are cached on the filesystem at~/.augent/separated/, keyed by MD5(file):model:stem_mode. If you separate the same file again with the same settings, it returns the cached stems instantly.
Workflow
- Call
separate_audioon the audio file - Use the
vocals_pathfrom the response as theaudio_pathfor transcription - The clean vocal track produces significantly better transcription
- Podcasts with music intros/outros
- Videos with background music
- Lectures in noisy environments
- Any audio where speech competes with other sounds

