transcribe_audio

Model Sizes

Model	Speed	Accuracy
tiny	Fastest	Excellent (default)
base	Fast	Excellent
small	Medium	Superior
medium	Slow	Outstanding
large	Slowest	Maximum

Use tiny for nearly everything. Only upgrade for heavy accents, poor audio quality, or lyrics.

Example

Request:

{
  "audio_path": "/Users/you/Downloads/podcast.webm",
  "model_size": "tiny"
}

Response:

{
  "text": "Full transcription text...",
  "language": "en",
  "duration": 1076.12,
  "duration_formatted": "17:56",
  "segments": [
    {
      "start": 0.0,
      "end": 4.8,
      "timestamp": "0:00",
      "text": "Welcome back to the show. Today we're diving into..."
    },
    {
      "start": 4.8,
      "end": 9.2,
      "timestamp": "0:04",
      "text": "something I've been thinking about for a long time."
    }
  ],
  "segment_count": 430,
  "cached": false,
  "model_used": "tiny"
}

Example: Transcribe a specific section

Use start and duration to transcribe only a portion of the file — no manual ffmpeg trimming needed.

{
  "audio_path": "/Users/you/Downloads/podcast.webm",
  "start": 600,
  "duration": 300
}

This transcribes 5 minutes starting at the 10-minute mark. Timestamps in the response are offset back to the original file position.

Example: Export to file

{
  "audio_path": "/Users/you/Downloads/podcast.webm",
  "output": "~/Desktop/transcription.xlsx"
}

When output is provided, the transcription is written to disk and output_path is added to the response. Use .xlsx for styled spreadsheets with bold headers, or .csv for plain data.

Parameters

Parameter	Required	Default	Description
`audio_path`	Yes	—	Path to the audio file
`model_size`	No	`tiny`	Whisper model size
`start`	No	`0`	Start transcription at this many seconds into the audio
`duration`	No	full file	Only transcribe this many seconds of audio
`output`	No	—	File path to save transcription (`.csv` or `.xlsx`)
`translated_text`	No	—	English translation to store alongside the original. Used after translating a non-English transcription.

Multilingual

Augent transcribes audio in its original language — Chinese, French, Spanish, Japanese, etc. Translation to English is handled by Claude, which produces far better results than any local translation model. When the transcription language is not English, the response includes:

{
  "language": "zh",
  "translation_available": true,
  "translation_hint": "This audio is in Chinese. To store an English translation..."
}

Translation workflow:

transcribe_audio returns the original-language transcription with translation_available: true
Claude translates the text
Claude calls transcribe_audio again with the same audio_path and translated_text containing the English translation
A sibling (eng) markdown file is created in memory alongside the original

Both versions appear in the Web UI Memory Explorer and are searchable via search_memory.

Memory

Transcriptions are stored by file content hash + model size
Same file, same model = instant memory hit
Same file, different model = new transcription
Modified file = new transcription (hash changes)
A markdown file is also saved to ~/.augent/memory/transcriptions/
Translated transcriptions get a sibling (eng) file (e.g., My Video.md + My Video (eng).md)

Core

Search

Analysis

Processing

Memory

transcribe_audio

Model Sizes

Example

Example: Transcribe a specific section

Example: Export to file

Parameters

Multilingual

Memory

Core

Search

Analysis

Processing

Memory

Documentation Index

​Model Sizes

​Example

​Example: Transcribe a specific section

​Example: Export to file

​Parameters

​Multilingual

​Memory

Model Sizes

Example

Example: Transcribe a specific section

Example: Export to file

Parameters

Multilingual

Memory