visual - Augent

Extract visual context from video at the moments that matter. The tool analyzes the transcript to identify when visual context is needed, then either extracts frames at those timestamps or flags visual gaps for you to fill manually. Skips talking heads, B-roll, and audio-sufficient content. Four modes:

Query mode (default): describe what you need visual context for. The tool searches the transcript semantically and extracts frames at matching moments.
Auto mode: autonomously detects visual moments from the transcript using pattern matching and semantic scoring.
Manual mode: extract frames at specific timestamps you provide.
Assist mode: analyzes the transcript for visual gaps and returns time ranges where you should provide your own screenshots. Ideal for talking-head videos or podcasts where the speaker describes a workflow but the video doesn’t show it.

For query, auto, and manual modes, frames are saved directly into your Obsidian vault with ![[filename.png]] wikilink embeds. Assist mode returns structured gap data only, no frames extracted.

Example: Query mode

Get screenshots of the Gmail configuration steps from a tutorial video. Request:

{
  "url": "https://youtube.com/watch?v=example",
  "query": "connecting Gmail to the agent and configuring email settings"
}

Response:

{
  "mode": "query",
  "query": "connecting Gmail to the agent and configuring email settings",
  "frame_count": 6,
  "analyzed_segments": 236,
  "video_duration": 941.0,
  "video_duration_formatted": "15:41",
  "frames": [
    {
      "filename": "chatgpt_agent_builder_07m19s.png",
      "timestamp": 439.7,
      "timestamp_formatted": "7:19",
      "score": 0.54,
      "reason": "query match",
      "transcript": "I go to the left sidebar, tool section, then MCP, I'm selecting Gmail..."
    }
  ]
}

Example: One-shot with take_notes

Get notes and visual context in a single call.

{
  "url": "https://youtube.com/watch?v=example",
  "style": "highlight",
  "visual": "the workflow setup steps"
}

The notes file includes ![[frame.png]] embeds inline at the relevant sections. Open in Obsidian and every screenshot renders where it belongs.

Example: Auto mode

Let the tool decide which moments need visual context.

{
  "url": "https://youtube.com/watch?v=example",
  "auto": true,
  "max_frames": 15
}

Auto mode scores each transcript segment for visual necessity using pattern matching (UI keywords, spatial references, demonstration language) and semantic similarity (embedding comparison against visual and non-visual anchor concepts). Intro B-roll is suppressed automatically.

Example: Manual mode

Extract frames at specific timestamps you already know.

{
  "video_path": "/Users/you/Downloads/tutorial.mp4",
  "timestamps": [120, 185, 240, 360]
}

Example: Assist mode

The speaker describes their entire automation workflow on a podcast, but the video is just their face. Assist mode flags the moments where screenshots would complete the picture.

{
  "url": "https://youtube.com/watch?v=example",
  "assist": true
}

Response:

{
  "mode": "assist",
  "gap_count": 4,
  "analyzed_segments": 312,
  "video_duration": 2460.0,
  "video_duration_formatted": "41:00",
  "gaps": [
    {
      "gap_number": 1,
      "start": 492.3,
      "end": 555.1,
      "start_formatted": "8:12",
      "end_formatted": "9:15",
      "duration_seconds": 62.8,
      "peak_score": 0.87,
      "screenshot_type": "UI interaction or navigation being described",
      "reasons": ["UI action", "navigation action"],
      "transcript": "so I go into Gmail settings, click on the forwarding tab, and then you add a forwarding address..."
    },
    {
      "gap_number": 2,
      "start": 873.0,
      "end": 898.4,
      "start_formatted": "14:33",
      "end_formatted": "14:58",
      "duration_seconds": 25.4,
      "peak_score": 0.85,
      "screenshot_type": "specific UI element or area being pointed to",
      "reasons": ["deictic reference", "UI action"],
      "transcript": "click the three dots in the top right corner, then go to advanced settings..."
    }
  ],
  "hint": "These are moments where the speaker describes something visual but the video may not show it. Provide your own screenshots for these time ranges."
}

No frames are extracted. Instead, the tool returns structured gaps with time windows, transcript context, and what kind of screenshot would help. Take the screenshots yourself, then use manual mode with timestamps to place them.

Example: Clear and redo

Remove all previous frames for a video and extract fresh ones with a new query.

{
  "video_path": "/Users/you/Downloads/tutorial.mp4",
  "clear": true,
  "query": "the database configuration"
}

Parameters

Parameter	Required	Default	Description
`video_path`	No	—	Path to a local video file. Provide this or `url`.
`url`	No	—	Video URL. Downloads the video automatically. Provide this or `video_path`.
`query`	No	—	What you need visual context for. Searches the transcript semantically.
`auto`	No	`false`	Autonomously detect visual moments from the transcript.
`assist`	No	`false`	Analyze the transcript for visual gaps and return time ranges for manual screenshots. No frames extracted.
`timestamps`	No	—	List of timestamps in seconds to extract frames at.
`clear`	No	`false`	Remove all previous frames for this video. Can combine with query or auto.
`model_size`	No	`tiny`	Whisper model size for transcription.
`max_frames`	No	`30`	Maximum number of frames to extract.
`top_k`	No	`10`	Number of transcript matches in query mode.
`context_words`	No	`40`	Words of context around each match in query mode.

How frame selection works

The tool doesn’t blindly grab frames at fixed intervals.

Transcript analysis: each segment is scored for visual necessity using pattern matching and semantic embeddings.
Smart timing: three candidate frames are extracted per timestamp (t, t+1s, t+2s). The one with the highest edge density wins, capturing the sharpest UI or screen content.
Deduplication: near-identical frames are detected using perceptual hashing and dropped automatically.
Intro suppression: the first 30 seconds of a video require much higher confidence to qualify, filtering out B-roll and title cards.

Notes

Query mode is the recommended default for screen recordings. For talking-head videos or podcasts, use assist mode to find where manual screenshots are needed.

Frames are saved into your Obsidian vault at External Files/visual/. The vault is auto-detected from Obsidian’s configuration. No manual setup needed.

Frame filenames include the video title and timestamp (e.g. chatgpt_agent_builder_07m19s.png), so they never collide across videos in Obsidian’s search.

Use clear: true to remove old frames before re-extracting with a different query. Combine both in one call.

​Example: Query mode

​Example: One-shot with take_notes

​Example: Auto mode

​Example: Manual mode

​Example: Assist mode

​Example: Clear and redo

​Parameters

​How frame selection works

​Notes

Example: Query mode

Example: One-shot with take_notes

Example: Auto mode

Example: Manual mode

Example: Assist mode

Example: Clear and redo

Parameters

How frame selection works

Notes