Two pipelines, one command. YouTube URLs get gem extraction (insights, opinions, advice). Screen recordings get QA processing (bugs, issues, findings).

How It Works

Loading diagram...

Gems Workflow (YouTube)

The gems pipeline extracts reusable knowledge from YouTube videos — the kind of insights you'd bookmark, screenshot, or reference later.

Pipeline

  1. Downloadyt-dlp extracts audio + metadata from the YouTube URL
  2. Transcribewhisper-cli produces timestamped SRT + plain text
  3. Detect hotspots — LLM reads the transcript and identifies gem moments
  4. Extract framesffmpeg captures video frames at each gem timestamp
  5. Vision analysis — Claude reads frames for slides, code, diagrams
  6. Storebrain_digest + brain_store persist everything to BrainLayer
  7. Report — Structured gems with categories, quotes, and relevance

Gem Categories

CategoryWhat It Captures
💡 InsightNon-obvious technical or conceptual shift
🔥 OpinionStrong take, contrarian view
✅ AdviceConcrete, actionable recommendation
🔧 ToolSpecific technology mention with evaluation
🏗️ ArchitectureSystem design pattern or structural decision
⚔️ War StoryReal production experience or post-mortem
💬 QuotableMemorable phrasing, tweetable insight

Example Output

## 🎬 T3 Code + Claude Subscriptions — Theo
Duration: 10:12 | Gems: 7 | Stored: ✅

1. 🔥 Opinion @ 3:24 — "Anthropic is massively subsidizing
   the amount of inference... $5,000 in compute for $200"
   → Reveals competitive moat strategy

2. 🏗️ Architecture @ 6:15 — "We're using the CLIs...
   we had to unbake a cake to reassemble it"
   → Event system abstraction for multi-harness support

3. ✅ Advice @ 8:42 — "5.4 is the best model for coding...
   switch to Claude for UI pass, quick tidy ups"
   → Multi-model workflow strategy

QA Workflow (Screen Recordings)

Delegates to the /qa-video stalker pipeline for screen recordings with narration:

  1. Audio extractionffmpeg extracts audio from the recording
  2. Transcriptionwhisper-cli converts narration to timestamped text
  3. Hotspot detection — LLM identifies bug reports, UX issues, feature requests
  4. Frame extraction — Frames at hotspot timestamps + 30-second intervals
  5. Vision correlation — Matches what was said with what's on screen
  6. Findings document — Structured by severity (Critical → Minor → Enhancement)
  7. Agent handoff — Sends findings to implementing agent via cmux

Routing Logic

InputRoutes ToOverride
YouTube URLGemsSay "QA" to override
Local .mov/.mp4QASay "gems" to override
"extract gems from..."GemsRegardless of source
"process QA recording"QARegardless of source

Prerequisites

ToolPurposeInstall
yt-dlpYouTube downloadpip3 install yt-dlp
ffmpegAudio/frame extractionbrew install ffmpeg
whisper-cliSpeech-to-textbrew install whisper-cpp
whisper modelTranscription modelwhisper-cli --download-model small

BrainLayer Integration

Both workflows store results in BrainLayer. Gems get brain_digest (full content indexing) + brain_store (curated gems with tags). QA findings get brain_store with severity counts and project tags.

If BrainLayer is unavailable, the skill loudly flags the failure and saves a local fallback file — it never silently skips storage.

Eval Coverage

7 eval cases covering:

  • URL routing (YouTube → gems, local → QA)
  • Gem quality from real transcript (Theo fixture, 8 assertions)
  • BrainLayer unavailability flagging
  • Gem-mode override for local files
  • With-skill vs without-skill comparison (10 assertions)
  • Ambiguous input clarification

Source

skills/golem-powers/video-extract/