Convert YouTube transcripts to structured XML format with automatic chapter detection.
Problem: Raw YouTube transcripts are unstructured text that LLMs struggle to parse, degrading AI chat responses about video content.
Solution: Converts transcripts to XML with chapter elements for improved AI comprehension.
(1) First, install UV Python Package and Project Manager from here.
(2) Then, install youtube-to-xml accessible from anywhere in your terminal:
uv tool install git+https://github.com/michellepace/youtube-to-xml.gitThe youtube-to-xml command intelligently auto-detects whether you're providing a YouTube URL or a transcript file.
Convert directly from YouTube URL:
youtube-to-xml https://youtu.be/Q4gsvJvRjCU
🎬 Processing: https://www.youtube.com/watch?v=Q4gsvJvRjCU
✅ Created: how-claude-code-hooks-save-me-hours-daily.xmlOutput XML (condensed - 4 chapters, 88 lines total):
<?xml version='1.0' encoding='utf-8'?>
<transcript video_title="How Claude Code Hooks Save Me HOURS Daily"
video_published="2025-07-12"
video_duration="2m 43s"
video_url="https://www.youtube.com/watch?v=Q4gsvJvRjCU">
<chapters>
<chapter title="Intro" start_time="0:00">
0:00 Hooks are hands down one of the best
0:02 features in Claude Code and for some
<!-- ... more transcript content ... -->
</chapter>
<chapter title="Hooks" start_time="0:19">
0:20 To create your first hook, use the hooks
<!-- ... more transcript content ... -->
</chapter>
<!-- ... 2 more chapters ... -->
</chapters>
</transcript>Manually copy YouTube transcript into a text file, then:
youtube-to-xml my_transcript.txt
# ✅ Created: my_transcript.xmlCopy-Paste Exact YT Format for my_transcript.txt:
Introduction to Cows
0:02
Welcome to this talk about erm.. er
2:30
Let's start with the fundamentals
Washing the cow
15:45
First, we'll start with the patches
Output XML:
<?xml version='1.0' encoding='utf-8'?>
<transcript video_title="" video_published="" video_duration="" video_url="">
<chapters>
<chapter title="Introduction to Cows" start_time="0:02">
0:02 Welcome to this talk about erm.. er
2:30 Let's start with the fundamentals
</chapter>
<chapter title="Washing the cow" start_time="15:45">
15:45 First, we'll start with the patches
</chapter>
</chapters>
</transcript>See demo-analysing-transcripts-with-claude-code.md for a real conversation where Claude Code analyses a 2-hour video transcript.
Interesting findings:
- 63,231 tokens — too large for Claude Code to read at once, but it adapted by using grep and reading specific line ranges
- XML chapters — made it trivial to target specific sections (e.g., "analyse chapter 10 and 11")
- Follow-up questions — improved answer completeness. In an App this could be handled by prompt engineering.
Claude Code can only read 25,000 tokens at a time. But the Anthropic API has a 200,000 token window. So, we still don't have to use RAG (later).
- Architecture: Pure functions with clear module separation
- Key Modules: See CLAUDE.md
- Dependencies: Python 3.14+,
yt-dlpfor YouTube downloads, see pyproject.toml - Python Package Management: UV
- Test-Driven Development: 125 tests (19 slow, 106 unit)
- Terminology: Uses TRANSCRIPT terminology throughout codebase, see docs/terminology.md
🤖 Repo 100% generated by Claude Code — every single line.
Setup:
git clone https://github.com/michellepace/youtube-to-xml.git
cd youtube-to-xml
uv sync
uv run pre-commit install
uv run pre-commit install --hook-type pre-pushCode Quality:
uv run ruff check --fix # Lint and auto-fix (see pyproject.toml)
uv run ruff format # Format code (see pyproject.toml)Testing:
uv run pytest # All tests
uv run pytest -m "slow" # Only slow tests (internet required)
uv run pytest -m "not slow" # All tests except slow tests
uv run pre-commit run --all-files # (see .pre-commit-config.yaml)
Counted by my plot-py-repo tool
youtube-to-xml CLI
│
┌──────┴───────┐
│ cli.py │
│ (auto-detect)│
└──────┬───────┘
│
┌────────────┴────────────┐
│ │
[URL Input] [File Input]
│ │
┌───────▼────────┐ ┌────────▼────────┐
│ url_parser.py │ │ file_parser.py │
│ │ │ │
│ • yt-dlp API │ │ • Pattern match │
│ • JSON3 parse │ │ • Chapter rules │
│ • Metadata │ │ • Empty metadata│
└───────┬────────┘ └────────┬────────┘
│ │
└────────────┬────────────┘
│
┌──────▼──────┐
│ models.py │
│ │
│TranscriptDoc│
│ Chapters │
│ Metadata │
└──────┬──────┘
│
┌──────▼──────────┐
│ xml_builder.py │
│ │
│ • Format times │
│ • Build XML tree│
└──────┬──────────┘
│
┌────▼────┐
│ XML File│
└─────────┘
Each transcript line now places the timestamp and text on the same line, rather than on separate lines:
Before (separate lines)
0:02
Welcome to this talk about cows
2:30
Let's start with the fundamentals
15:45
First, we'll start with the patches
After (same line)
0:02 Welcome to this talk about cows
2:30 Let's start with the fundamentals
15:45 First, we'll start with the patches
Why? The primary consumer of these transcripts is an LLM agent (e.g. Claude Code) that navigates large files by searching for keywords and reading line ranges. With inline timestamps, every search hit is a self-contained record — the agent immediately knows what was said and when, in a single operation. No follow-up read to find the timestamp on the line above.
Searching a transcript with thousands of lines: separate lines require a second lookup for the timestamp, same-line format returns a complete record
Evals To Do (transcript.txt vs transcript.xml):
- Build Shiny for Python app to use Hamel's simple error analysis approach
- But I don't like Hamel's binary approach, what about a Six Sigma ordinal data approach like in docs/idea-evals.md?
- Automate the evals with pytest as far as possible, LLM as a Judge for others
- If XML is the winner, try tweak the XML structure to improve, for example this. Like whitespace, more tags, or maybe JSON?
- But now I've got a problem with cost because xml is in the context window. So can RAG perform equally as well and fast?
- Can use a cheaper model that performs equally as well, like Haiku over Sonnet (for some things)?
- At some point I'm going to have to head over to BrainTrust.dev - use an agnostic SDK?
Learnings To Carry Over:
- Use CodeRabbit for PR review to improve code
- Use Claude Code Docs so Claude Code knows what it can do
- Use Claude Code Project Index so Claude Code sees entire project easily
- Manage MCPs nicely constrain what you use, put API keys in one place
- Git branch workflow try put everything on a purposeful branch
- Always use strict linting and typing and enforce in pre-commit hook
- Always do test-driven development and manual LLM testing is useful too
- Manage LLM Context: set terminology, use clear naming, keep docstrings/comments accurate, at 60% context window
/clearClaude Code
Open Questions:
