Simple, powerful web scraping with Plasmate's Semantic Object Model.
This toolkit provides a clean Python interface to the Plasmate CLI for developers who want structured web data without the overhead of a full framework like Scrapy.
Traditional scraping with libraries like requests and BeautifulSoup requires you to write complex, site-specific parsers that break whenever a CSS class or HTML tag changes.
# The old way: brittle, complex, high-maintenance
import requests
from bs4 import BeautifulSoup
url = 'https://news.ycombinator.com/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
stories = []
for row in soup.select('tr.athing'):
title_el = row.select_one('td.title > span.titleline > a')
stories.append({
'title': title_el.text,
'url': title_el.get('href'),
})Result: You get fragile code and raw HTML. Processing this with an LLM is expensive: a typical news homepage can be 10,000-20,000 tokens.
Plasmate handles fetching and parsing, giving you a clean Semantic Object Model (SOM). This library makes it trivial to use Plasmate in any Python script.
# The new way: simple, robust, low-maintenance
from plasmate_scraper import fetch, extract_links
som = fetch("https://news.ycombinator.com/")
links = extract_links(som)
for link in links:
print(f"{link['text']} -> {link['url']}")Result: You get clean data and a massive reduction in token usage for LLM pipelines. The same homepage as a Plasmate SOM is often 1,000-2,000 tokens — an order of magnitude smaller.
pip install plasmate-scraperYou also need the Plasmate CLI:
# macOS
brew install nicholasgasior/plasmate/plasmate
# From source
cargo install plasmate
# Or download from https://github.com/nicholasgasior/plasmate/releasesfrom plasmate_scraper import fetch, extract_text
# Fetch the page and get the SOM
som = fetch("https://example.com")
# The SOM is a dictionary
print(som['title'])
# > Example Domain
# Use helpers to extract common data
text = extract_text(som)
print(text)
# > Example Domain
# > This domain is for use in illustrative examples.
# > More information...from plasmate_scraper import batch_fetch
urls = [
"https://news.ycombinator.com",
"https://github.com/explore",
"https://dev.to",
]
# Fetches all pages in parallel
results = batch_fetch(urls, max_concurrent=10)
for som in results:
if 'error' in som:
print(f"Failed to fetch {som['url']}: {som['error']}")
else:
print(f"Fetched: {som.get('title', 'Untitled')}")fetch(
url: str,
*,
timeout: int = 30,
javascript: bool = True,
format: str = "json",
binary: str = "plasmate",
extra_args: list[str] | None = None,
) -> dictFetches a URL and returns the parsed SOM. Raises PlasmateError on failure.
batch_fetch(
urls: list[str],
*,
max_concurrent: int = 5,
raise_on_error: bool = False,
# ... accepts same args as fetch()
) -> list[dict]Fetches multiple URLs in parallel. If raise_on_error is False (default), failures are returned as dicts like {'url': '...', 'error': '...'}.
All utilities take a SOM dictionary as input.
from plasmate_scraper import (
extract_text, # All text content as a string
extract_links, # [{'url': '...', 'text': '...'}]
extract_headings, # [{'level': 1, 'text': '...'}]
extract_tables, # Table regions/elements from the SOM
extract_images, # [{'src': '...', 'alt': '...'}]
extract_by_role, # Filter elements by SOM role
)| Feature | requests + bs4 |
playwright |
plasmate-scraper |
|---|---|---|---|
| Parsing | Manual (CSS/XPath) | Manual (CSS/XPath) | Automatic (SOM) |
| Resilience | Low (breaks easily) | Low (breaks easily) | High (semantic) |
| JS Support | No | Yes | Yes (default) |
| Concurrency | Manual (e.g., ThreadPool) |
Manual | Built-in (batch_fetch) |
| LLM-ready | No (too verbose) | No (too verbose) | Yes (token-efficient) |
Apache 2.0 — see LICENSE.
| Engine | plasmate - The browser engine for agents |
| MCP | plasmate-mcp - Claude Code, Cursor, Windsurf |
| Extension | plasmate-extension - Chrome cookie export |
| SDKs | Python / Node.js / Go / Rust |
| Frameworks | LangChain / CrewAI / AutoGen / Smolagents |
| Tools | Scrapy / Audit / A11y / GitHub Action |
| Resources | Awesome Plasmate / Notebooks / Benchmarks |
| Docs | docs.plasmate.app |
| W3C | Web Content Browser for AI Agents |