Plasmate Python Toolkit

Simple, powerful web scraping with Plasmate's Semantic Object Model.

This toolkit provides a clean Python interface to the Plasmate CLI for developers who want structured web data without the overhead of a full framework like Scrapy.

The Problem

Traditional scraping with libraries like requests and BeautifulSoup requires you to write complex, site-specific parsers that break whenever a CSS class or HTML tag changes.

# The old way: brittle, complex, high-maintenance
import requests
from bs4 import BeautifulSoup

url = 'https://news.ycombinator.com/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

stories = []
for row in soup.select('tr.athing'):
    title_el = row.select_one('td.title > span.titleline > a')
    stories.append({
        'title': title_el.text,
        'url': title_el.get('href'),
    })

Result: You get fragile code and raw HTML. Processing this with an LLM is expensive: a typical news homepage can be 10,000-20,000 tokens.

The Solution

Plasmate handles fetching and parsing, giving you a clean Semantic Object Model (SOM). This library makes it trivial to use Plasmate in any Python script.

# The new way: simple, robust, low-maintenance
from plasmate_scraper import fetch, extract_links

som = fetch("https://news.ycombinator.com/")
links = extract_links(som)

for link in links:
    print(f"{link['text']} -> {link['url']}")

Result: You get clean data and a massive reduction in token usage for LLM pipelines. The same homepage as a Plasmate SOM is often 1,000-2,000 tokens — an order of magnitude smaller.

Installation

pip install plasmate-scraper

You also need the Plasmate CLI:

# macOS
brew install nicholasgasior/plasmate/plasmate

# From source
cargo install plasmate

# Or download from https://github.com/nicholasgasior/plasmate/releases

Quick Start

Fetch a single page

from plasmate_scraper import fetch, extract_text

# Fetch the page and get the SOM
som = fetch("https://example.com")

# The SOM is a dictionary
print(som['title'])
# > Example Domain

# Use helpers to extract common data
text = extract_text(som)
print(text)
# > Example Domain
# > This domain is for use in illustrative examples.
# > More information...

Fetch multiple pages concurrently

from plasmate_scraper import batch_fetch

urls = [
    "https://news.ycombinator.com",
    "https://github.com/explore",
    "https://dev.to",
]

# Fetches all pages in parallel
results = batch_fetch(urls, max_concurrent=10)

for som in results:
    if 'error' in som:
        print(f"Failed to fetch {som['url']}: {som['error']}")
    else:
        print(f"Fetched: {som.get('title', 'Untitled')}")

API Reference

`fetch()`

fetch(
    url: str,
    *,
    timeout: int = 30,
    javascript: bool = True,
    format: str = "json",
    binary: str = "plasmate",
    extra_args: list[str] | None = None,
) -> dict

Fetches a URL and returns the parsed SOM. Raises PlasmateError on failure.

`batch_fetch()`

batch_fetch(
    urls: list[str],
    *,
    max_concurrent: int = 5,
    raise_on_error: bool = False,
    # ... accepts same args as fetch()
) -> list[dict]

Fetches multiple URLs in parallel. If raise_on_error is False (default), failures are returned as dicts like {'url': '...', 'error': '...'}.

Utility Functions

All utilities take a SOM dictionary as input.

from plasmate_scraper import (
    extract_text,      # All text content as a string
    extract_links,     # [{'url': '...', 'text': '...'}]
    extract_headings,  # [{'level': 1, 'text': '...'}]
    extract_tables,    # Table regions/elements from the SOM
    extract_images,    # [{'src': '...', 'alt': '...'}]
    extract_by_role,   # Filter elements by SOM role
)

Comparison to Alternatives

Feature	`requests` + `bs4`	`playwright`	`plasmate-scraper`
Parsing	Manual (CSS/XPath)	Manual (CSS/XPath)	Automatic (SOM)
Resilience	Low (breaks easily)	Low (breaks easily)	High (semantic)
JS Support	No	Yes	Yes (default)
Concurrency	Manual (e.g., `ThreadPool`)	Manual	Built-in (`batch_fetch`)
LLM-ready	No (too verbose)	No (too verbose)	Yes (token-efficient)

License

Apache 2.0 — see LICENSE.

Part of the Plasmate Ecosystem


Engine	plasmate - The browser engine for agents
MCP	plasmate-mcp - Claude Code, Cursor, Windsurf
Extension	plasmate-extension - Chrome cookie export
SDKs	Python / Node.js / Go / Rust
Frameworks	LangChain / CrewAI / AutoGen / Smolagents
Tools	Scrapy / Audit / A11y / GitHub Action
Resources	Awesome Plasmate / Notebooks / Benchmarks
Docs	docs.plasmate.app
W3C	Web Content Browser for AI Agents

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
plasmate_scraper		plasmate_scraper
tests		tests
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Plasmate Python Toolkit

The Problem

The Solution

Installation

Quick Start

Fetch a single page

Fetch multiple pages concurrently

API Reference

`fetch()`

`batch_fetch()`

Utility Functions

Comparison to Alternatives

License

Part of the Plasmate Ecosystem

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Plasmate Python Toolkit

The Problem

The Solution

Installation

Quick Start

Fetch a single page

Fetch multiple pages concurrently

API Reference

fetch()

batch_fetch()

Utility Functions

Comparison to Alternatives

License

Part of the Plasmate Ecosystem

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`fetch()`

`batch_fetch()`

Packages