🎥 YouTube-to-XML

Convert YouTube transcripts to structured XML format with automatic chapter detection.

Problem: Raw YouTube transcripts are unstructured text that LLMs struggle to parse, degrading AI chat responses about video content.

Solution: Converts transcripts to XML with chapter elements for improved AI comprehension.

📦 Install

(1) First, install UV Python Package and Project Manager from here.

(2) Then, install youtube-to-xml accessible from anywhere in your terminal:

uv tool install git+https://github.com/michellepace/youtube-to-xml.git

🚀 Usage

The youtube-to-xml command intelligently auto-detects whether you're providing a YouTube URL or a transcript file.

Option 1: URL Method (Easiest)

Convert directly from YouTube URL:

youtube-to-xml https://youtu.be/Q4gsvJvRjCU

🎬 Processing: https://www.youtube.com/watch?v=Q4gsvJvRjCU
✅ Created: how-claude-code-hooks-save-me-hours-daily.xml

Output XML (condensed - 4 chapters, 88 lines total):

<?xml version='1.0' encoding='utf-8'?>
<transcript video_title="How Claude Code Hooks Save Me HOURS Daily"
            video_published="2025-07-12"
            video_duration="2m 43s"
            video_url="https://www.youtube.com/watch?v=Q4gsvJvRjCU">
  <chapters>
    <chapter title="Intro" start_time="0:00">
      0:00 Hooks are hands down one of the best
      0:02 features in Claude Code and for some
      <!-- ... more transcript content ... -->
    </chapter>
    <chapter title="Hooks" start_time="0:19">
      0:20 To create your first hook, use the hooks
      <!-- ... more transcript content ... -->
    </chapter>
    <!-- ... 2 more chapters ... -->
  </chapters>
</transcript>

📁 View Output XML →

Option 2: File Method

Manually copy YouTube transcript into a text file, then:

youtube-to-xml my_transcript.txt
# ✅ Created: my_transcript.xml

Copy-Paste Exact YT Format for my_transcript.txt:

Introduction to Cows
0:02
Welcome to this talk about erm.. er
2:30
Let's start with the fundamentals
Washing the cow
15:45
First, we'll start with the patches

Output XML:

<?xml version='1.0' encoding='utf-8'?>
<transcript video_title="" video_published="" video_duration="" video_url="">
  <chapters>
    <chapter title="Introduction to Cows" start_time="0:02">
      0:02 Welcome to this talk about erm.. er
      2:30 Let's start with the fundamentals
    </chapter>
    <chapter title="Washing the cow" start_time="15:45">
      15:45 First, we'll start with the patches
    </chapter>
  </chapters>
</transcript>

🤖 Demo: Claude Code Analysis

See demo-analysing-transcripts-with-claude-code.md for a real conversation where Claude Code analyses a 2-hour video transcript.

Interesting findings:

63,231 tokens — too large for Claude Code to read at once, but it adapted by using grep and reading specific line ranges
XML chapters — made it trivial to target specific sections (e.g., "analyse chapter 10 and 11")
Follow-up questions — improved answer completeness. In an App this could be handled by prompt engineering.

Claude Code can only read 25,000 tokens at a time. But the Anthropic API has a 200,000 token window. So, we still don't have to use RAG (later).

📊 Technical Details

Architecture: Pure functions with clear module separation
Key Modules: See CLAUDE.md
Dependencies: Python 3.14+, yt-dlp for YouTube downloads, see pyproject.toml
Python Package Management: UV
Test-Driven Development: 125 tests (19 slow, 106 unit)
Terminology: Uses TRANSCRIPT terminology throughout codebase, see docs/terminology.md

YouTube video interface showing the Transcript panel with timestamp and text displayed on single lines (e.g., '0:02 features in Claude Code and for some'). Orange annotations highlight chapter titles and transcript lines structure.

YouTube transcript terminology throughout codebase: (click to read)

🛠️ Development

🤖 Repo 100% generated by Claude Code — every single line.

Setup:

git clone https://github.com/michellepace/youtube-to-xml.git
cd youtube-to-xml
uv sync
uv run pre-commit install
uv run pre-commit install --hook-type pre-push

Code Quality:

uv run ruff check --fix           # Lint and auto-fix (see pyproject.toml)
uv run ruff format                # Format code (see pyproject.toml)

Testing:

uv run pytest                     # All tests
uv run pytest -m "slow"           # Only slow tests (internet required)
uv run pytest -m "not slow"       # All tests except slow tests
uv run pre-commit run --all-files # (see .pre-commit-config.yaml)

Stacked area chart showing repository growth from 500 to 3700 lines across 250 commits, with test code (blue) comprising 60% of codebase, source code (red) 30%, and comments (green) 10%

Counted by my plot-py-repo tool

🏗️ Architecture

                    youtube-to-xml CLI
                           │
                    ┌──────┴───────┐
                    │  cli.py      │
                    │ (auto-detect)│
                    └──────┬───────┘
                           │
              ┌────────────┴────────────┐
              │                         │
        [URL Input]              [File Input]
              │                         │
      ┌───────▼────────┐       ┌────────▼────────┐
      │ url_parser.py  │       │ file_parser.py  │
      │                │       │                 │
      │ • yt-dlp API   │       │ • Pattern match │
      │ • JSON3 parse  │       │ • Chapter rules │
      │ • Metadata     │       │ • Empty metadata│
      └───────┬────────┘       └────────┬────────┘
              │                         │
              └────────────┬────────────┘
                           │
                    ┌──────▼──────┐
                    │  models.py  │
                    │             │
                    │TranscriptDoc│
                    │  Chapters   │
                    │  Metadata   │
                    └──────┬──────┘
                           │
                    ┌──────▼──────────┐
                    │ xml_builder.py  │
                    │                 │
                    │ • Format times  │
                    │ • Build XML tree│
                    └──────┬──────────┘
                           │
                      ┌────▼────┐
                      │ XML File│
                      └─────────┘

Appendix 1: Decision - Inline Transcript Timestamps

Each transcript line now places the timestamp and text on the same line, rather than on separate lines:

Before (separate lines)

0:02
Welcome to this talk about cows
2:30
Let's start with the fundamentals
15:45
First, we'll start with the patches

After (same line)

0:02 Welcome to this talk about cows
2:30 Let's start with the fundamentals
15:45 First, we'll start with the patches

Why? The primary consumer of these transcripts is an LLM agent (e.g. Claude Code) that navigates large files by searching for keywords and reading line ranges. With inline timestamps, every search hit is a self-contained record — the agent immediately knows what was said and when, in a single operation. No follow-up read to find the timestamp on the line above.

Searching a transcript with thousands of lines: separate lines require a second lookup for the timestamp, same-line format returns a complete record

📕 Appendix 2: Personal Notes

Evals To Do (transcript.txt vs transcript.xml):

Build Shiny for Python app to use Hamel's simple error analysis approach
But I don't like Hamel's binary approach, what about a Six Sigma ordinal data approach like in docs/idea-evals.md?
Automate the evals with pytest as far as possible, LLM as a Judge for others
If XML is the winner, try tweak the XML structure to improve, for example this. Like whitespace, more tags, or maybe JSON?
But now I've got a problem with cost because xml is in the context window. So can RAG perform equally as well and fast?
Can use a cheaper model that performs equally as well, like Haiku over Sonnet (for some things)?
At some point I'm going to have to head over to BrainTrust.dev - use an agnostic SDK?

Learnings To Carry Over:

Use CodeRabbit for PR review to improve code
Use Claude Code Docs so Claude Code knows what it can do
Use Claude Code Project Index so Claude Code sees entire project easily
Manage MCPs nicely constrain what you use, put API keys in one place
Git branch workflow try put everything on a purposeful branch
Always use strict linting and typing and enforce in pre-commit hook
Always do test-driven development and manual LLM testing is useful too
Manage LLM Context: set terminology, use clear naming, keep docstrings/comments accurate, at 60% context window /clear Claude Code

Open Questions:

Q1. Is there something I could have done better with UV?
Q2. Is the system architecture well-designed and elegant?
Q3. Is the exception design suitable for a future API service?
Q4. Are tests/ clear and sane, or over-engineered?
Q5. Was it safe to exclude "XML security" Ruff S314?

Name		Name	Last commit message	Last commit date
Latest commit History 278 Commits
.claude		.claude
.vscode		.vscode
docs		docs
example_transcripts		example_transcripts
src/youtube_to_xml		src/youtube_to_xml
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.markdownlint.yaml		.markdownlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎥 YouTube-to-XML

📦 Install

🚀 Usage

Option 1: URL Method (Easiest)

Option 2: File Method

🤖 Demo: Claude Code Analysis

📊 Technical Details

🛠️ Development

🏗️ Architecture

Appendix 1: Decision - Inline Transcript Timestamps

📕 Appendix 2: Personal Notes

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎥 YouTube-to-XML

📦 Install

🚀 Usage

Option 1: URL Method (Easiest)

Option 2: File Method

🤖 Demo: Claude Code Analysis

📊 Technical Details

🛠️ Development

🏗️ Architecture

Appendix 1: Decision - Inline Transcript Timestamps

📕 Appendix 2: Personal Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages