Skip to main content

Model Sizes

ModelSpeedAccuracy
tinyFastestExcellent (default)
baseFastExcellent
smallMediumSuperior
mediumSlowOutstanding
largeSlowestMaximum
Use tiny for nearly everything. Only upgrade for heavy accents, poor audio quality, or lyrics.

Example

Request:
{
  "audio_path": "/Users/you/Downloads/podcast.webm",
  "model_size": "tiny"
}
Response:
{
  "text": "Full transcription text...",
  "language": "en",
  "duration": 1076.12,
  "duration_formatted": "17:56",
  "segments": [
    {
      "start": 0.0,
      "end": 4.8,
      "timestamp": "0:00",
      "text": "Welcome back to the show. Today we're diving into..."
    },
    {
      "start": 4.8,
      "end": 9.2,
      "timestamp": "0:04",
      "text": "something I've been thinking about for a long time."
    }
  ],
  "segment_count": 430,
  "cached": false,
  "model_used": "tiny"
}

Example: Transcribe a specific section

Use start and duration to transcribe only a portion of the file — no manual ffmpeg trimming needed.
{
  "audio_path": "/Users/you/Downloads/podcast.webm",
  "start": 600,
  "duration": 300
}
This transcribes 5 minutes starting at the 10-minute mark. Timestamps in the response are offset back to the original file position.

Example: Export to file

{
  "audio_path": "/Users/you/Downloads/podcast.webm",
  "output": "~/Desktop/transcription.xlsx"
}
When output is provided, the transcription is written to disk and output_path is added to the response. Use .xlsx for styled spreadsheets with bold headers, or .csv for plain data.

Parameters

ParameterRequiredDefaultDescription
audio_pathYesPath to the audio file
model_sizeNotinyWhisper model size
startNo0Start transcription at this many seconds into the audio
durationNofull fileOnly transcribe this many seconds of audio
outputNoFile path to save transcription (.csv or .xlsx)
translated_textNoEnglish translation to store alongside the original. Used after translating a non-English transcription.

Multilingual

Augent transcribes audio in its original language — Chinese, French, Spanish, Japanese, etc. Translation to English is handled by Claude, which produces far better results than any local translation model. When the transcription language is not English, the response includes:
{
  "language": "zh",
  "translation_available": true,
  "translation_hint": "This audio is in Chinese. To store an English translation..."
}
Translation workflow:
  1. transcribe_audio returns the original-language transcription with translation_available: true
  2. Claude translates the text
  3. Claude calls transcribe_audio again with the same audio_path and translated_text containing the English translation
  4. A sibling (eng) markdown file is created in memory alongside the original
Both versions appear in the Web UI Memory Explorer and are searchable via search_memory.

Memory

  • Transcriptions are stored by file content hash + model size
  • Same file, same model = instant memory hit
  • Same file, different model = new transcription
  • Modified file = new transcription (hash changes)
  • A markdown file is also saved to ~/.augent/memory/transcriptions/
  • Translated transcriptions get a sibling (eng) file (e.g., My Video.md + My Video (eng).md)