YouTube Transcript
Overview
Extract YouTube video transcripts, metadata, and chapters using yt-dlp. Output formatted as Markdown with YAML frontmatter, saved to ~/Brains/brain/ (Obsidian vault).
Quick Start
To extract a transcript from a YouTube video:
python scripts/extract_transcript.py <youtube_url>
Optional: Specify custom output filename:
python scripts/extract_transcript.py <youtube_url> custom_filename.md
Output Format
YAML Frontmatter
The generated Markdown includes comprehensive metadata:
title- Video titlechannel- Channel nameurl- YouTube URLupload_date- Upload date (YYYY-MM-DD)duration- Video duration (HH:MM:SS)description- Video description (truncated to 500 chars)tags- Array of video tagsview_count- View countlike_count- Like count
Body Structure
Transcript organized by video chapters (if available):
## Chapter Title
**00:05:23** Transcript text for this segment.
**00:05:45** Next segment text.
If no chapters exist, all content appears under "## Transcript" heading.
Timestamps formatted as HH:MM:SS for consistency.
Workflow
- Extract metadata and subtitles using yt-dlp
- Parse VTT subtitle format to extract timestamps and text
- Group transcript segments by video chapters (if present)
- Format as Markdown with YAML frontmatter
- Save to ~/Brains/brain/ with sanitized filename based on video title
- Clean up temporary subtitle files
Deduplication
To remove duplicates from existing transcript files:
python scripts/deduplicate_transcript.py <markdown_file>
This removes transcript entries that are prefixes of subsequent entries (common in VTT files where subtitles accumulate).
Requirements
Ensure yt-dlp is installed:
pip install yt-dlp
Limitations
- Extracts subtitles in English first, falls back to Russian if English unavailable
- Requires video to have subtitles (auto-generated or manual)
- Does not download video or audio files
- Description truncated to 500 characters in frontmatter