separate_audio fixes this. It runs Meta’s Demucs v4 on the recording and isolates the vocal track from everything else. Music, drums, bass, ambient noise, all stripped out. What remains is clean speech that Whisper transcribes accurately.
When to use it
- Podcast episodes with intro/outro music. The first 30 seconds of most podcasts are music. Whisper tries to transcribe the lyrics or hallucinates words. Separation removes the music entirely.
- Twitter/X Spaces with background noise. Spaces are recorded from phones in noisy environments. Separation isolates the speakers.
- Conference talks and seminars. Venue acoustics, audience noise, and background music between segments all degrade transcription quality.
- Interviews recorded in public. Coffee shops, street noise, other conversations bleeding in.
- Any recording where someone is talking over music. Demucs was built for exactly this. It separates the voice from the music even when they overlap completely.
How it works
One tool call before transcription: Step 1: Separatesearch_audio, deep_search, chapters, identify_speakers, batch_search, take_notes, and search_proximity.
Caching
Separation results are cached by file hash at~/.augent/separated/. The first run processes the audio through Demucs. Every run after returns the cached stems instantly.
Same caching behavior as transcriptions. Process once, use forever.
Models
| Model | Speed | Quality | Best for |
|---|---|---|---|
htdemucs | Fast | Great | Default. Handles most recordings well. |
htdemucs_ft | Slower | Best | Difficult audio with heavy overlap between voice and music. |
htdemucs unless the default output still has audible music bleed in the vocal stem.
Vocals-only vs full separation
By default,separate_audio runs in vocals-only mode: it produces two stems (vocals and no_vocals). This is faster than full separation and produces the same vocal quality.
Set vocals_only: false to get all four stems: vocals, drums, bass, other. This is useful if you need the individual instrument tracks for other purposes, but for transcription, vocals-only is all you need.

