Skip to main content
Augent’s transcription is powered by Whisper, which is built for clean speech. When the audio has music, background noise, podcast intros, or overlapping sounds, Whisper does its best but the results suffer. Words get missed. Sentences get mangled. Timestamps drift. separate_audio fixes this. It runs Meta’s Demucs v4 on the recording and isolates the vocal track from everything else. Music, drums, bass, ambient noise, all stripped out. What remains is clean speech that Whisper transcribes accurately.

When to use it

  • Podcast episodes with intro/outro music. The first 30 seconds of most podcasts are music. Whisper tries to transcribe the lyrics or hallucinates words. Separation removes the music entirely.
  • Twitter/X Spaces with background noise. Spaces are recorded from phones in noisy environments. Separation isolates the speakers.
  • Conference talks and seminars. Venue acoustics, audience noise, and background music between segments all degrade transcription quality.
  • Interviews recorded in public. Coffee shops, street noise, other conversations bleeding in.
  • Any recording where someone is talking over music. Demucs was built for exactly this. It separates the voice from the music even when they overlap completely.

How it works

One tool call before transcription: Step 1: Separate
separate_audio
  audio_path: "/path/to/noisy-podcast.mp3"
Returns the path to the clean vocal stem. Step 2: Transcribe the vocal stem
transcribe_audio
  audio_path: "/path/to/.augent/separated/.../vocals.wav"
Clean transcription. No background noise. Accurate timestamps. The vocal stem works with every tool in Augent: search_audio, deep_search, chapters, identify_speakers, batch_search, take_notes, and search_proximity.

Caching

Separation results are cached by file hash at ~/.augent/separated/. The first run processes the audio through Demucs. Every run after returns the cached stems instantly. Same caching behavior as transcriptions. Process once, use forever.

Models

ModelSpeedQualityBest for
htdemucsFastGreatDefault. Handles most recordings well.
htdemucs_ftSlowerBestDifficult audio with heavy overlap between voice and music.
Stick with htdemucs unless the default output still has audible music bleed in the vocal stem.

Vocals-only vs full separation

By default, separate_audio runs in vocals-only mode: it produces two stems (vocals and no_vocals). This is faster than full separation and produces the same vocal quality. Set vocals_only: false to get all four stems: vocals, drums, bass, other. This is useful if you need the individual instrument tracks for other purposes, but for transcription, vocals-only is all you need.

Installation

Source separation is included in the standard Augent install:
curl -fsSL https://augent.app/install.sh | bash
If you installed Augent before this feature was added, install the separator package:
pip install augent[separator]
Or install demucs directly:
pip install demucs