Security-hardened, browser-free crawler that turns static documentation sites into clean, AI-ready Markdown — fast.
docpull uses async HTTP (not Playwright) to fetch server-rendered pages, extracts main content, and writes clean Markdown with source-URL frontmatter — in seconds, with a small install footprint. It won't render JavaScript, but for the large class of docs that don't need it (API references, Python/Go stdlib, most dev-tool docs, OpenAPI specs, Next.js and Docusaurus builds), it is a fast, auditable, sandbox-friendly way to pipe documentation into an LLM context, a RAG index, or an offline archive. SSRF, XXE, DNS-rebinding, and CRLF-injection protections are on by default — a necessity when an AI agent is choosing the URLs.
pip install docpull
# Optional extras
pip install 'docpull[llm]' # tiktoken for token-accurate chunking
pip install 'docpull[trafilatura]' # alternative extractor for noisy pages
pip install 'docpull[mcp]' # run as an MCP server for AI agents
pip install 'docpull[all]' # everything above# Crawl and save Markdown
docpull https://docs.example.com
# One page, no crawl — the fast path for agents
docpull https://docs.example.com/guide --single
# LLM-ready NDJSON with 4k-token chunks streamed to stdout
docpull https://docs.example.com --profile llm --stream | jq .
# Mirror a site for offline use
docpull https://docs.example.com --profile mirror --cachedocpull inspects each page before running the generic extractor and can pull content directly from framework data feeds:
| Framework | Strategy |
|---|---|
| Next.js | Parses __NEXT_DATA__ JSON |
| Mintlify | __NEXT_DATA__ with Mintlify tagging |
| OpenAPI | Renders openapi.json / swagger.json into Markdown |
| Docusaurus | Detected and tagged; generic extractor produces Markdown |
| Sphinx | Detected and tagged; generic extractor produces Markdown |
JS-only SPAs with no server-rendered content are detected and skipped with a
clear reason (or, with --strict-js-required, reported as an error so agents
can route elsewhere).
--single— fetch a single URL without discovery. Designed for tool loops.--stream— NDJSON one-record-per-line, flushed on every page, pipeable.--max-tokens-per-file N— split each page into token-bounded chunks on heading boundaries (exact counts with tiktoken, estimate without).--emit-chunks— write one file or record per chunk instead of per page.--strict-js-required— hard-fail on JS-only pages instead of silently skipping.--extractor trafilatura— swap in trafilatura for sites where the default heuristics struggle.
from docpull import fetch_one
ctx = fetch_one("https://docs.python.org/3/library/asyncio.html")
print(ctx.title, ctx.source_type)
print(ctx.markdown[:500])Async streaming:
import asyncio
from docpull import Fetcher, DocpullConfig, ProfileName, EventType
async def main():
cfg = DocpullConfig(
url="https://docs.example.com",
profile=ProfileName.LLM, # chunked NDJSON output
)
async with Fetcher(cfg) as fetcher:
async for event in fetcher.run():
if event.type == EventType.FETCH_PROGRESS:
print(f"{event.current}/{event.total}: {event.url}")
print(f"Done: {fetcher.stats.pages_fetched} pages")
asyncio.run(main())Single-page from an agent tool:
from docpull import Fetcher, DocpullConfig
async def tool_call(url: str) -> str:
async with Fetcher(DocpullConfig(url=url)) as f:
ctx = await f.fetch_one(url, save=False)
return ctx.markdown or ctx.error or ""docpull https://site.com --profile rag # Default. Dedup, rich metadata.
docpull https://site.com --profile llm # NDJSON + chunks + metadata.
docpull https://site.com --profile mirror # Full archive, polite, cached.
docpull https://site.com --profile quick # Sampling: 50 pages, depth 2.docpull ships an MCP (Model Context Protocol) server so AI agents can call it directly over stdio:
pip install 'docpull[mcp]'
docpull mcp # starts the stdio serverAdd to Claude Desktop or Claude Code:
{
"mcpServers": {
"docpull": {
"command": "docpull",
"args": ["mcp"]
}
}
}Tools exposed:
fetch_url(url, max_tokens?)— one-shot fetch, no crawlensure_docs(source, force?)— fetch a named library (cached 7 days)list_sources(category?)— show available aliases (react, nextjs, fastapi, …)list_indexed()— what has been fetched locallygrep_docs(pattern, library?)— regex search across fetched Markdown
User-defined sources live in ~/.config/docpull-mcp/sources.yaml:
sources:
mydocs:
url: https://docs.example.com
description: My internal docs
category: internal
maxPages: 200Markdown files with YAML frontmatter:
---
title: "Getting Started"
source: https://docs.example.com/guide
source_type: "nextjs"
---
# Getting Started
…NDJSON (one record per page or chunk):
{"url": "...", "title": "...", "content": "...", "hash": "...", "token_count": 842, "chunk_index": 0}- HTTPS-only, mandatory robots.txt compliance
- SSRF protection: blocks private/internal network IPs, DNS rebinding
- XXE protection via
defusedxmlon sitemaps - Path traversal and CRLF header injection guards
- Auth headers stripped on cross-origin redirects
Run docpull --help for the full list. Highlights:
Core:
--profile {rag,mirror,quick,llm,custom}
--single Fetch one URL (no crawl)
--format {markdown,json,ndjson,sqlite}
--stream Stream NDJSON to stdout
LLM / chunking:
--max-tokens-per-file N
--tokenizer NAME tiktoken encoding (default cl100k_base)
--emit-chunks One file/record per chunk
Content extraction:
--extractor {default,trafilatura}
--no-special-cases Disable framework extractors
--strict-js-required Error on JS-only pages
Cache:
--cache Enable incremental updates
--cache-dir DIR
--cache-ttl DAYS
docpull --doctor # Check installation
docpull URL --verbose # Verbose output
docpull URL --dry-run # Test without downloading
docpull URL --preview-urls # List URLs without fetchingMIT