Model Sizes
| Model | Speed | Accuracy |
|---|
| tiny | Fastest | Excellent (default) |
| base | Fast | Excellent |
| small | Medium | Superior |
| medium | Slow | Outstanding |
| large | Slowest | Maximum |
Use tiny for nearly everything. Only upgrade for heavy accents, poor audio quality, or lyrics.
Example
Request:
{
"audio_path": "/Users/you/Downloads/podcast.webm",
"model_size": "tiny"
}
Response:
{
"text": "Full transcription text...",
"language": "en",
"duration": 1076.12,
"duration_formatted": "17:56",
"segments": [
{
"start": 0.0,
"end": 4.8,
"timestamp": "0:00",
"text": "Welcome back to the show. Today we're diving into..."
},
{
"start": 4.8,
"end": 9.2,
"timestamp": "0:04",
"text": "something I've been thinking about for a long time."
}
],
"segment_count": 430,
"cached": false,
"model_used": "tiny"
}
Example: Transcribe a specific section
Use start and duration to transcribe only a portion of the file — no manual ffmpeg trimming needed.
{
"audio_path": "/Users/you/Downloads/podcast.webm",
"start": 600,
"duration": 300
}
This transcribes 5 minutes starting at the 10-minute mark. Timestamps in the response are offset back to the original file position.
Example: Export to file
{
"audio_path": "/Users/you/Downloads/podcast.webm",
"output": "~/Desktop/transcription.xlsx"
}
When output is provided, the transcription is written to disk and output_path is added to the response. Use .xlsx for styled spreadsheets with bold headers, or .csv for plain data.
Parameters
| Parameter | Required | Default | Description |
|---|
audio_path | Yes | — | Path to the audio file |
model_size | No | tiny | Whisper model size |
start | No | 0 | Start transcription at this many seconds into the audio |
duration | No | full file | Only transcribe this many seconds of audio |
output | No | — | File path to save transcription (.csv or .xlsx) |
translated_text | No | — | English translation to store alongside the original. Used after translating a non-English transcription. |
Multilingual
Augent transcribes audio in its original language — Chinese, French, Spanish, Japanese, etc. Translation to English is handled by Claude, which produces far better results than any local translation model.
When the transcription language is not English, the response includes:
{
"language": "zh",
"translation_available": true,
"translation_hint": "This audio is in Chinese. To store an English translation..."
}
Translation workflow:
transcribe_audio returns the original-language transcription with translation_available: true
- Claude translates the text
- Claude calls
transcribe_audio again with the same audio_path and translated_text containing the English translation
- A sibling
(eng) markdown file is created in memory alongside the original
Both versions appear in the Web UI Memory Explorer and are searchable via search_memory.
Memory
- Transcriptions are stored by file content hash + model size
- Same file, same model = instant memory hit
- Same file, different model = new transcription
- Modified file = new transcription (hash changes)
- A markdown file is also saved to
~/.augent/memory/transcriptions/
- Translated transcriptions get a sibling
(eng) file (e.g., My Video.md + My Video (eng).md)