Skip to content

plasmate-labs/plasmate-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Plasmate Python Toolkit

Simple, powerful web scraping with Plasmate's Semantic Object Model.

This toolkit provides a clean Python interface to the Plasmate CLI for developers who want structured web data without the overhead of a full framework like Scrapy.

The Problem

Traditional scraping with libraries like requests and BeautifulSoup requires you to write complex, site-specific parsers that break whenever a CSS class or HTML tag changes.

# The old way: brittle, complex, high-maintenance
import requests
from bs4 import BeautifulSoup

url = 'https://news.ycombinator.com/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

stories = []
for row in soup.select('tr.athing'):
    title_el = row.select_one('td.title > span.titleline > a')
    stories.append({
        'title': title_el.text,
        'url': title_el.get('href'),
    })

Result: You get fragile code and raw HTML. Processing this with an LLM is expensive: a typical news homepage can be 10,000-20,000 tokens.

The Solution

Plasmate handles fetching and parsing, giving you a clean Semantic Object Model (SOM). This library makes it trivial to use Plasmate in any Python script.

# The new way: simple, robust, low-maintenance
from plasmate_scraper import fetch, extract_links

som = fetch("https://news.ycombinator.com/")
links = extract_links(som)

for link in links:
    print(f"{link['text']} -> {link['url']}")

Result: You get clean data and a massive reduction in token usage for LLM pipelines. The same homepage as a Plasmate SOM is often 1,000-2,000 tokens — an order of magnitude smaller.

Installation

pip install plasmate-scraper

You also need the Plasmate CLI:

# macOS
brew install nicholasgasior/plasmate/plasmate

# From source
cargo install plasmate

# Or download from https://github.com/nicholasgasior/plasmate/releases

Quick Start

Fetch a single page

from plasmate_scraper import fetch, extract_text

# Fetch the page and get the SOM
som = fetch("https://example.com")

# The SOM is a dictionary
print(som['title'])
# > Example Domain

# Use helpers to extract common data
text = extract_text(som)
print(text)
# > Example Domain
# > This domain is for use in illustrative examples.
# > More information...

Fetch multiple pages concurrently

from plasmate_scraper import batch_fetch

urls = [
    "https://news.ycombinator.com",
    "https://github.com/explore",
    "https://dev.to",
]

# Fetches all pages in parallel
results = batch_fetch(urls, max_concurrent=10)

for som in results:
    if 'error' in som:
        print(f"Failed to fetch {som['url']}: {som['error']}")
    else:
        print(f"Fetched: {som.get('title', 'Untitled')}")

API Reference

fetch()

fetch(
    url: str,
    *,
    timeout: int = 30,
    javascript: bool = True,
    format: str = "json",
    binary: str = "plasmate",
    extra_args: list[str] | None = None,
) -> dict

Fetches a URL and returns the parsed SOM. Raises PlasmateError on failure.

batch_fetch()

batch_fetch(
    urls: list[str],
    *,
    max_concurrent: int = 5,
    raise_on_error: bool = False,
    # ... accepts same args as fetch()
) -> list[dict]

Fetches multiple URLs in parallel. If raise_on_error is False (default), failures are returned as dicts like {'url': '...', 'error': '...'}.

Utility Functions

All utilities take a SOM dictionary as input.

from plasmate_scraper import (
    extract_text,      # All text content as a string
    extract_links,     # [{'url': '...', 'text': '...'}]
    extract_headings,  # [{'level': 1, 'text': '...'}]
    extract_tables,    # Table regions/elements from the SOM
    extract_images,    # [{'src': '...', 'alt': '...'}]
    extract_by_role,   # Filter elements by SOM role
)

Comparison to Alternatives

Feature requests + bs4 playwright plasmate-scraper
Parsing Manual (CSS/XPath) Manual (CSS/XPath) Automatic (SOM)
Resilience Low (breaks easily) Low (breaks easily) High (semantic)
JS Support No Yes Yes (default)
Concurrency Manual (e.g., ThreadPool) Manual Built-in (batch_fetch)
LLM-ready No (too verbose) No (too verbose) Yes (token-efficient)

License

Apache 2.0 — see LICENSE.


Part of the Plasmate Ecosystem

Engine plasmate - The browser engine for agents
MCP plasmate-mcp - Claude Code, Cursor, Windsurf
Extension plasmate-extension - Chrome cookie export
SDKs Python / Node.js / Go / Rust
Frameworks LangChain / CrewAI / AutoGen / Smolagents
Tools Scrapy / Audit / A11y / GitHub Action
Resources Awesome Plasmate / Notebooks / Benchmarks
Docs docs.plasmate.app
W3C Web Content Browser for AI Agents

About

Python SDK for Plasmate - fetch web pages as structured SOM JSON. pip install plasmate.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages