Skip to main content
Someone explains their entire workflow in a 40-minute video. How they generate leads, how they close deals, how they automate fulfillment. Every step, every tool, every sequence. It’s all there in the audio. Augent transcribes it. Builds the workflow files. Maps the sequencing, the decision points, the tool stack, the timing. Structures every piece of the puzzle into something your MCP client can act on. But some steps are inherently visual. “Click here.” “You’ll see this dashboard.” “Drag this into that column.” The audio describes it, but the context is incomplete without seeing it. Augent gives agents ears and eyes.

The workflow

A single video explanation becomes a fully structured, replicable system:
  1. Transcribe the audio, extracting every word with timestamps
  2. Build workflow files from the explanation, the sequencing, the steps, the tools mentioned, the order of operations
  3. Detect visual gaps where the speaker describes something that needs to be seen (“this screen”, “click here”, “the layout looks like”)
  4. Extract screenshots at those moments, picking the sharpest frame with the most visual information
  5. Assemble the package with transcription + structured workflow + visual context, embedded inline in the notes with Obsidian wikilinks
The result: a spoken explanation becomes a complete blueprint. Not notes, not a summary, but a buildable system with every step documented and every visual reference captured.

How it works

The visual tool analyzes the transcript to identify moments where visual context is needed, then extracts frames only at those timestamps. Four modes: Query mode (primary): describe what you need visual context for. The tool searches the transcript semantically and extracts frames at matching moments.
visual(url="https://youtube.com/...", query="connecting Gmail to the agent")
One-shot with notes: pass visual directly to take_notes and get notes plus screenshots in a single call.
take_notes(url="https://youtube.com/...", visual="the workflow setup steps")
Auto mode: the tool autonomously detects visual moments using pattern matching and semantic scoring. Identifies UI actions, spatial references, demonstration language, and screen recordings. Suppresses B-roll, talking heads, and intros.
visual(url="https://youtube.com/...", auto=true)
Assist mode: when the video is a talking head or podcast where the speaker describes a workflow but the video doesn’t show it, assist mode flags the visual gaps and tells you exactly where to provide your own screenshots.
visual(url="https://youtube.com/...", assist=true)
Returns time ranges like “8:12-9:15: UI interaction or navigation being described” with the transcript excerpt, so you know exactly what screenshot to take and where it belongs. Frames are saved directly into your Obsidian vault. The notes file includes ![[frame.png]] embeds at the relevant sections. Open the file in Obsidian and every screenshot renders inline, right where it belongs.

The pipeline

StepToolWhat it does
Downloaddownload_audioPulls audio from any URL at maximum speed
Transcribetranscribe_audioFull transcription with per-segment timestamps
Structuretake_notesBuilds formatted workflow notes with sections, steps, and sequencing
Visual contextvisualExtracts screenshots at moments where audio alone isn’t enough
Find momentshighlightsIdentifies the most important moments by content density
Find topicsdeep_searchSearches by meaning to find specific workflow steps
Find contextsearch_proximityFinds where two concepts appear near each other
Extract clipsclip_exportExports video segments around specific timestamps
Detect chapterschaptersAuto-detects topic boundaries and transitions
Identify speakersidentify_speakersLabels who is explaining which part of the workflow

What makes the visual tool accurate

The tool doesn’t extract frames blindly. It analyzes the transcript first. Pattern matching catches explicit visual language: “click on the settings icon in the top right”, “as you can see on screen”, “I’m dragging this block onto the canvas.” These trigger high-confidence frame extraction. Semantic scoring catches subtler cues. Each transcript segment is compared against visual and non-visual anchor concepts using sentence-transformer embeddings. Segments that describe spatial actions, UI interactions, or demonstrations score high. Abstract discussion, opinions, and stories score low. Smart frame selection extracts three candidate frames per timestamp and picks the one with the most visual information, measured by edge density. A frame showing a UI with text and buttons beats a frame caught mid-transition. Duplicate detection removes near-identical frames using perceptual hashing. If two screenshots show the same screen with a minor cursor shift, only the higher-scored one survives. Intro suppression recognizes that the first 30 seconds of most videos are B-roll with voiceover. Scores are dampened unless the confidence is very high.

Why this changes the scale

Without visual context, Augent builds workflows from audio alone. That covers 80-90% of most explanations. People naturally describe what they’re doing as they do it. With visual context, it covers 100%. Your MCP client doesn’t have to guess what “this screen” looks like. It doesn’t have to infer which button the speaker clicked. It has the screenshots right there, timestamped and tied to the exact moments in the transcription. A business owner explains how they run their operation. Augent builds the entire system. Every step, every screen, every sequence. Ready for your MCP client to execute.