Skip to main content
Augent is a pipeline. Audio goes in as URLs or files and comes out as searchable, indexed text stored permanently on your machine.

Stages

1. Download

When you give Augent a URL, it downloads the audio track only — never video. Uses yt-dlp with aria2c for speed (16 parallel connections, concurrent fragments). Supports YouTube, Twitter/X, TikTok, Instagram, SoundCloud, and 1000+ sites. The source URL is stored permanently by file hash. Any future operation on that file — even weeks later, from a different path — automatically links back to the original source.

2. Separate (optional)

If the audio has music, intros, or background noise, Augent can isolate the vocals using Meta’s Demucs v4 before transcription. This dramatically improves transcription accuracy on noisy audio. See Audio Separation.

3. Transcribe

The file is transcribed locally using faster-whisper (a CTranslate2-optimized build of OpenAI’s Whisper). Produces word-level timestamps, automatic language detection, and VAD filtering to skip silence. The result is stored in memory immediately — keyed by file hash + model size. See Memory & Caching.

4. Search & Analyze

Once a file is in memory, every tool works on the stored transcript instantly — no re-transcription:
  • Keyword search: literal string matching with timestamps and context
  • Semantic search: find content by meaning using sentence-transformer embeddings. See Semantic Search
  • Chapters: auto-detect topic boundaries using embedding similarity
  • Speaker ID: identify who said what using pyannote diarization. See Speaker Diarization
  • Highlights: find the best moments automatically or by topic
  • Notes: formatted notes in multiple styles
  • Batch search: search dozens of files in parallel

5. Export

Results can be exported as CSV, XLSX, SRT, VTT, or JSON. Video clips can be extracted around matches via clip_export. Notes are saved as .txt files formatted for Obsidian.

First run vs. subsequent runs

The first time you process a file, Augent downloads it, transcribes it, and stores the result. This takes a few minutes depending on file length and model size. Every operation after that — search, chapters, notes, highlights — queries the stored transcript instantly. There is no re-transcription unless you change the model size.

What stays on your machine

Everything. Audio files, transcriptions, embeddings, diarization results, and exported files never leave your device. The only network activity is downloading audio from URLs you provide.