> ## Documentation Index
> Fetch the complete documentation index at: https://docs.augent.app/llms.txt
> Use this file to discover all available pages before exploring further.

# Architecture

> How Augent processes audio, from URL to insight.

Augent is a pipeline. Each stage does one thing, stores the result in memory, and passes it forward.

```mermaid theme={null}
graph TB
    A["URL / File"] --> B["Download"]
    B --> S["Separate Vocals"]
    S --> C["Transcribe"]
    C --> D["Memory + Tag"]

    D --> Search
    D --> Analyze
    D --> Export

    subgraph Search
        direction LR
        E1["Keyword"]
        E2["Semantic"]
        E3["Proximity"]
        E4["Batch"]
        E5["Cross-Memory"]
    end

    subgraph Analyze
        direction LR
        F1["Chapters"]
        F2["Speaker ID"]
        F3["Notes"]
        F4["Highlights"]
    end

    subgraph Export
        direction LR
        G1["Clip Export"]
        G2["TTS"]
    end

    style A fill:#0d2618,stroke:#00f060,color:#00f060,stroke-width:2px
    style B fill:#0d2618,stroke:#00f060,color:#00f060,stroke-width:2px
    style S fill:#0d2618,stroke:#00f060,color:#00f060,stroke-width:2px
    style C fill:#0d2618,stroke:#00f060,color:#00f060,stroke-width:2px
    style D fill:#0d2618,stroke:#00f060,color:#00f060,stroke-width:2px

    style E1 fill:#0a0a0a,stroke:#00f060,color:#00f060
    style E2 fill:#0a0a0a,stroke:#00f060,color:#00f060
    style E3 fill:#0a0a0a,stroke:#00f060,color:#00f060
    style E4 fill:#0a0a0a,stroke:#00f060,color:#00f060
    style E5 fill:#0a0a0a,stroke:#00f060,color:#00f060
    style F1 fill:#0a0a0a,stroke:#00f060,color:#00f060
    style F2 fill:#0a0a0a,stroke:#00f060,color:#00f060
    style F3 fill:#0a0a0a,stroke:#00f060,color:#00f060
    style F4 fill:#0a0a0a,stroke:#00f060,color:#00f060
    style G1 fill:#0a0a0a,stroke:#00f060,color:#00f060
    style G2 fill:#0a0a0a,stroke:#00f060,color:#00f060

    style Search fill:#0a0a0a,stroke:#00f060,color:#00f060,stroke-width:2px
    style Analyze fill:#0a0a0a,stroke:#00f060,color:#00f060,stroke-width:2px
    style Export fill:#0a0a0a,stroke:#00f060,color:#00f060,stroke-width:2px

    linkStyle default stroke:#00f060,stroke-width:1.5px
```

***

## Pipeline

<Steps>
  <Step title="Download">
    `download_audio` grabs audio from any URL using yt-dlp + aria2c (16 parallel connections, 4 concurrent fragments). Only audio is downloaded. No video, no conversion. The raw file lands in `~/Downloads` by default.
  </Step>

  <Step title="Separate (optional)">
    `separate_audio` runs Meta's Demucs v4 to isolate vocals from music, background noise, and other sounds. The clean vocal stem feeds into transcription for accurate results on noisy recordings. Cached by file hash at `~/.augent/separated/`.
  </Step>

  <Step title="Transcribe">
    `transcribe_audio` runs the file through faster-whisper locally. The full transcript with word-level timestamps is stored in a SQLite memory at `~/.augent/memory/`. A human-readable `.md` copy is saved to `~/.augent/memory/transcriptions/`. Nothing leaves your machine.
  </Step>

  <Step title="Memory">
    Every transcription is keyed by file hash. If you search, analyze, or re-transcribe the same file, the stored result is returned instantly. Embeddings computed by `deep_search` and `chapters` are also stored and shared between tools.
  </Step>

  <Step title="Search & Analyze">
    Multiple tools operate on the stored transcript:

    * **search\_audio:** exact keyword matching with timestamps and context
    * **deep\_search:** semantic (meaning-based) search using embeddings
    * **search\_memory:** search across all stored transcriptions by keyword or meaning
    * **search\_proximity:** find where two keywords appear near each other
    * **batch\_search:** run keyword search across many files in parallel
    * **chapters:** auto-detect topic changes using embedding similarity
    * **identify\_speakers:** speaker diarization (who spoke when)
    * **highlights:** export MP4 clips of the best moments (auto or focused by topic)
    * **take\_notes:** generate formatted notes with AI
    * **tag:** organize transcriptions with broad topic categories
    * **clip\_export:** export video clips for specific time ranges
    * **text\_to\_speech:** convert text back to spoken audio
  </Step>
</Steps>

***

## Memory Layer

All stored data lives under `~/.augent/memory/`:

| Data                | Storage                            | Shared between                                    |
| ------------------- | ---------------------------------- | ------------------------------------------------- |
| Transcriptions      | SQLite DB + `.md` files            | All tools                                         |
| Embeddings          | SQLite DB                          | `deep_search`, `chapters`, `search_memory`, `tag` |
| Tags                | SQLite DB                          | `tag`, Web UI Memory Explorer                     |
| Speaker diarization | SQLite DB                          | `identify_speakers`                               |
| Source URLs         | SQLite DB (by file hash)           | All search tools, Web UI                          |
| Separated stems     | WAV files (`~/.augent/separated/`) | `separate_audio`                                  |

Source URLs from any platform (YouTube, Twitter/X, TikTok, Instagram, SoundCloud, and 1000+ sites) are stored permanently by audio file hash when downloaded via `download_audio`, the CLI, or the Web UI. Any future operation on that file automatically inherits the source URL for linking back to the original content.

Use `memory_stats` to see how much is stored, `list_memories` to browse entries, `clear_memory` to wipe everything, or the Web UI Memory Explorer to browse and delete individual entries.

***

## Key Design Decisions

* **Local-only:** no API calls, no cloud. Whisper runs on your CPU/GPU.
* **Audio-only downloads:** skipping video makes downloads up to 200x faster.
* **No format conversion:** files stay in their native format (`.webm`, `.m4a`, etc.) to avoid slow transcoding.
* **Cache everything:** first run is slow (transcription), every subsequent operation on that file is instant.
* **Composable tools:** each tool does one thing. Claude chains them together based on your prompt.
