Skip to main content
Extract visual context from video at the moments that matter. The tool analyzes the transcript to identify when visual context is needed, then either extracts frames at those timestamps or flags visual gaps for you to fill manually. Skips talking heads, B-roll, and audio-sufficient content. Four modes:
  • Query mode (default): describe what you need visual context for. The tool searches the transcript semantically and extracts frames at matching moments.
  • Auto mode: autonomously detects visual moments from the transcript using pattern matching and semantic scoring.
  • Manual mode: extract frames at specific timestamps you provide.
  • Assist mode: analyzes the transcript for visual gaps and returns time ranges where you should provide your own screenshots. Ideal for talking-head videos or podcasts where the speaker describes a workflow but the video doesn’t show it.
For query, auto, and manual modes, frames are saved directly into your Obsidian vault with ![[filename.png]] wikilink embeds. Assist mode returns structured gap data only, no frames extracted.

Example: Query mode

Get screenshots of the Gmail configuration steps from a tutorial video. Request:
{
  "url": "https://youtube.com/watch?v=example",
  "query": "connecting Gmail to the agent and configuring email settings"
}
Response:
{
  "mode": "query",
  "query": "connecting Gmail to the agent and configuring email settings",
  "frame_count": 6,
  "analyzed_segments": 236,
  "video_duration": 941.0,
  "video_duration_formatted": "15:41",
  "frames": [
    {
      "filename": "chatgpt_agent_builder_07m19s.png",
      "timestamp": 439.7,
      "timestamp_formatted": "7:19",
      "score": 0.54,
      "reason": "query match",
      "transcript": "I go to the left sidebar, tool section, then MCP, I'm selecting Gmail..."
    }
  ]
}

Example: One-shot with take_notes

Get notes and visual context in a single call.
{
  "url": "https://youtube.com/watch?v=example",
  "style": "highlight",
  "visual": "the workflow setup steps"
}
The notes file includes ![[frame.png]] embeds inline at the relevant sections. Open in Obsidian and every screenshot renders where it belongs.

Example: Auto mode

Let the tool decide which moments need visual context.
{
  "url": "https://youtube.com/watch?v=example",
  "auto": true,
  "max_frames": 15
}
Auto mode scores each transcript segment for visual necessity using pattern matching (UI keywords, spatial references, demonstration language) and semantic similarity (embedding comparison against visual and non-visual anchor concepts). Intro B-roll is suppressed automatically.

Example: Manual mode

Extract frames at specific timestamps you already know.
{
  "video_path": "/Users/you/Downloads/tutorial.mp4",
  "timestamps": [120, 185, 240, 360]
}

Example: Assist mode

The speaker describes their entire automation workflow on a podcast, but the video is just their face. Assist mode flags the moments where screenshots would complete the picture.
{
  "url": "https://youtube.com/watch?v=example",
  "assist": true
}
Response:
{
  "mode": "assist",
  "gap_count": 4,
  "analyzed_segments": 312,
  "video_duration": 2460.0,
  "video_duration_formatted": "41:00",
  "gaps": [
    {
      "gap_number": 1,
      "start": 492.3,
      "end": 555.1,
      "start_formatted": "8:12",
      "end_formatted": "9:15",
      "duration_seconds": 62.8,
      "peak_score": 0.87,
      "screenshot_type": "UI interaction or navigation being described",
      "reasons": ["UI action", "navigation action"],
      "transcript": "so I go into Gmail settings, click on the forwarding tab, and then you add a forwarding address..."
    },
    {
      "gap_number": 2,
      "start": 873.0,
      "end": 898.4,
      "start_formatted": "14:33",
      "end_formatted": "14:58",
      "duration_seconds": 25.4,
      "peak_score": 0.85,
      "screenshot_type": "specific UI element or area being pointed to",
      "reasons": ["deictic reference", "UI action"],
      "transcript": "click the three dots in the top right corner, then go to advanced settings..."
    }
  ],
  "hint": "These are moments where the speaker describes something visual but the video may not show it. Provide your own screenshots for these time ranges."
}
No frames are extracted. Instead, the tool returns structured gaps with time windows, transcript context, and what kind of screenshot would help. Take the screenshots yourself, then use manual mode with timestamps to place them.

Example: Clear and redo

Remove all previous frames for a video and extract fresh ones with a new query.
{
  "video_path": "/Users/you/Downloads/tutorial.mp4",
  "clear": true,
  "query": "the database configuration"
}

Parameters

ParameterRequiredDefaultDescription
video_pathNoPath to a local video file. Provide this or url.
urlNoVideo URL. Downloads the video automatically. Provide this or video_path.
queryNoWhat you need visual context for. Searches the transcript semantically.
autoNofalseAutonomously detect visual moments from the transcript.
assistNofalseAnalyze the transcript for visual gaps and return time ranges for manual screenshots. No frames extracted.
timestampsNoList of timestamps in seconds to extract frames at.
clearNofalseRemove all previous frames for this video. Can combine with query or auto.
model_sizeNotinyWhisper model size for transcription.
max_framesNo30Maximum number of frames to extract.
top_kNo10Number of transcript matches in query mode.
context_wordsNo40Words of context around each match in query mode.

How frame selection works

The tool doesn’t blindly grab frames at fixed intervals.
  1. Transcript analysis: each segment is scored for visual necessity using pattern matching and semantic embeddings.
  2. Smart timing: three candidate frames are extracted per timestamp (t, t+1s, t+2s). The one with the highest edge density wins, capturing the sharpest UI or screen content.
  3. Deduplication: near-identical frames are detected using perceptual hashing and dropped automatically.
  4. Intro suppression: the first 30 seconds of a video require much higher confidence to qualify, filtering out B-roll and title cards.

Notes

Query mode is the recommended default for screen recordings. For talking-head videos or podcasts, use assist mode to find where manual screenshots are needed.
Frames are saved into your Obsidian vault at External Files/visual/. The vault is auto-detected from Obsidian’s configuration. No manual setup needed.
Frame filenames include the video title and timestamp (e.g. chatgpt_agent_builder_07m19s.png), so they never collide across videos in Obsidian’s search.
Use clear: true to remove old frames before re-extracting with a different query. Combine both in one call.