Skip to content

brandonkramer/pi-scraper

Repository files navigation

πŸ•ΈοΈ pi-scraper

A scraper-first, Pi-native, and local-first extension for the Pi ecosystem.


NPM Version License Pi Compatibility

pi-scraper reads known URLs and sites. Use it to scrape, summarize one page, crawl, map URLs, diff snapshots, retrieve stored results, or download/extract deterministic/structured data β€” including CloakBrowser-backed browser mode with C++ fingerprint patches and persistent sessions.


Quick Start

Install the extension via the Pi CLI:

pi install npm:pi-scraper

Try these prompts:

Ask naturally; Pi can choose the right web tool automatically:

Tip


⚑ Scrape Modes

pi-scraper intelligently escalates its scraping strategy to balance speed and capability.

Mode JS Support Speed Best Use Case
fast ❌ πŸš€ Static HTML, documentation, and quick text extraction.
fingerprint ❌ 🏎️ Sites that block simple bots (uses TLS fingerprinting).
readable ❌ ⏱️ Articles and blogs where noise reduction is critical.
browser βœ… 🐒 Heavily JS-rendered sites (uses CloakBrowser by default).
auto πŸ€– πŸ”„ Default. Automatically selects the best path based on signals.

πŸ› οΈ Public Tools

Tool Capability Best For... Contract β‰ˆ
web_scrape 🏠 Local Reading a single URL as Markdown, Text, or HTML. 308 tokens
web_crawl πŸ•·οΈ Resumable BFS crawling to build local datasets or context packages. 158 tokens
web_map πŸ—ΊοΈ Discovery Inventorying URLs via robots.txt, sitemaps, and llms.txt. 58 tokens
web_batch πŸ“¦ Bulk Scaping multiple independent URLs concurrently. 195 tokens
web_extract πŸ” Structured Deterministic, selector-based, or LLM-backed extraction. 337 tokens
web_get_result πŸ“‚ Retrieval Accessing stored results, job manifests, or snapshots. 56 tokens

Note

Contract is the total tokens for the tool declaration.


πŸ“– Parameter Reference

Area Parameters Description
Shared sessionId, saveSession, clearSession, stealth, autoWait, browserBackend, proxy, headers, provider Sessions, browser controls, and LLM provider selection.
Scrape url, urls, content, task, mode, format, refresh, respectRobots, timeoutSeconds Targets, tasks (read/summarize), and fetch behavior.
Limits maxBytes, maxChars, onlyMainContent Size limits and content cleaning.
Filtering include, exclude, linesMatching, contextLines, caseSensitive Glob patterns and line-based content filtering.
Redirection followAlternates, followMetaRefresh Controls for non-standard redirects.
Snapshots snapshotName, snapshotTag, diff, compareTag, maxSnapshotAgeSeconds Versioning and diffing baselines.
Crawl action, maxPages, maxDepth, sameOrigin, concurrency, resume, crawlId, compile, seed, seedSitemap, status, limit, extract BFS discovery, limits, and state management.
Extract action, extractor, prompt, schema, selector, selectorType, attribute, adaptive, bullets, sentences, identifier, autoSave, threshold, extractSchema Vertical, ad-hoc, and selector extraction.
Patterns markers, contains, excerpts, regexes, sections, jsonPaths, sourceFormat, length Deterministic inspection: strings, regex, and ranges.
Map url, maxSitemaps Site-wide discovery of robots.txt and sitemaps.
Storage saveToFile true or {dir, filename, maxBytes} for disk storage.
Retrieval responseId, jobId, snapshotUrl, snapshotName, snapshotTag Retrieve stored payloads and job manifests.

πŸ”‘ Sessions & Persistence

pi-scraper is stateless by default. Use sessionId when you need to maintain state (cookies, login, cart) across multiple calls.

  • sessionId: A unique key for the session.
  • saveSession: Persist cookies to disk (useful across Pi reloads).
  • clearSession: Wipe the session state.
  • fingerprint: Use mode: "fingerprint" to bypass basic bot blocks using browser-grade TLS fingerprints without the overhead of a full browser.
// Example: Log in and then scrape a protected page
web_scrape({ url: "https://example.com/login", sessionId: "user-1", saveSession: true })
web_scrape({ url: "https://example.com/dashboard", sessionId: "user-1" })

🎯 Selector Extraction

Extract structured data using CSS selectors, XPath, or plain text search.

Parameter Description
selector The CSS/XPath/Text to find.
attribute Extract a specific attribute (e.g., href) instead of text.
adaptive Enable relocation if the page layout changes.
limit Maximum elements to return.

Example:

{
  "url": "https://example.com/products",
  "selector": ".product-card",
  "identifier": "products-v1",
  "autoSave": true,
  "limit": 5
}

🌐 Browser Mode Support

mode: "browser" uses CloakBrowser by default β€” a patched Chromium binary with 48 C++-level fingerprint patches.

βš™οΈ Backend options

Backend Default Browser Stealth level Requirement
"cloak" βœ… CloakBrowser Chromium 145 C++ source-level (48 patches) Bundled
"playwright" ❌ Stock Playwright Chromium JS page.evaluate() via stealth=true npm install playwright

πŸ›‘οΈ Fingerprint evasion

CloakBrowser does not need stealth=true β€” all anti-detection patches (navigator.webdriver, canvas, WebGL, audio, fonts, GPU, screen, WebRTC, network timing) are applied at the C++ binary level, undetectable by any JS-level bot detection.

Test results from CloakBrowser:

  • reCAPTCHA v3 score: 0.9 (human)
  • Cloudflare Turnstile: PASS
  • FingerprintJS: PASS
  • BrowserScan: NORMAL (4/4)
  • 30+ detection sites: passed

πŸ’Ύ Persistent sessions (CloakBrowser only)

When using CloakBrowser with sessionId + saveSession=true:

web_scrape url="https://example.com" mode=browser sessionId="my-session" saveSession=true

CloakBrowser uses launchPersistentContext() which writes cookies, localStorage, and session state to a disk profile at ~/.pi/browser-sessions/<sessionId>/. This:

  • Avoids incognito/private-mode detection (BrowserScan penalizes incognito by ~10%)
  • Survives Pi restarts and process reloads
  • Keeps login state across multiple scrape calls

To persist an authenticated login flow:

  1. Log in and Save the Session Open the login page in browser mode. Specifying saveSession=true writes the cookies and session state to your local profile.

    web_scrape url="https://example.com/login" mode=browser sessionId="site-session" saveSession=true
  2. Scrape Authenticated Content Subsequent calls using the same sessionId automatically inherit the authenticated state (cookies, local storage, etc.).

    web_scrape url="https://example.com/dashboard" mode=browser sessionId="site-session"
  3. Clear the Session when Done (Optional) Wipe the saved session and context from your local disk.

    web_scrape url="https://example.com" mode=browser sessionId="site-session" clearSession=true

πŸ”§ CloakBrowser-specific options

Option Type Description
timezone string IANA timezone (e.g. "America/New_York"). Set via binary flag β€” undetectable.
locale string BCP 47 locale (e.g. "en-US"). Set via --lang binary flag.
proxy string HTTP or SOCKS5 proxy URL.

These are safe to set even with the Playwright backend (ignored or applied via JS patches).


πŸ—οΈ Vertical Extraction

For well-known sites, pi-scraper uses optimized "vertical" extractors that hit APIs directly, bypassing slow HTML scraping.

Vertical Platforms / Sites Extracted Data / Possibilities
GitHub Repo GitHub Metadata, README, File Tree, Languages, Topics.
GitHub Issue GitHub Issue body, comments, participants, labels, status.
GitHub PR GitHub Pull request body, diff stats, reviews, comments.
GitHub Release GitHub Release notes, tag info, assets, author metadata.
npm Package npmjs.com Manifest JSON, versions, dependencies, README.
PyPI Package pypi.org Package metadata, versions, author, description.
crates.io crates.io Rust crate metadata, versions, dependencies.
Docker Hub hub.docker.com Image metadata, tags, architectures, layers.
HF Model huggingface.co Model cards, metadata, files, community stats.
HF Dataset huggingface.co Dataset cards, configuration, metadata, previews.
Hacker News ycombinator.com Story/Comment trees via Firebase API.
arXiv arxiv.org Academic paper metadata and Atom feeds.
DeepWiki deepwiki.io Structured wiki content and metadata.
Docs Site Docusaurus, RTD Sections, sidebar navigation, and page metadata.
docstrings TS/JS/Py/Rs Exported symbols, types, and function signatures.
Youtube Metadata youtube.com Video title, views, channel name, duration, and description.
Youtube Transcriptions youtube.com Full transcripts in plain-text and timed segments.
Youtube Comments youtube.com Preview of top video comments and engagement stats.
Reddit Post reddit.com Post content, scoring, flairs, and author metadata.
Reddit Thread reddit.com Full nested comment trees (retains original thread depth).
Reddit List reddit.com Subreddit listings (hot/new/top) and search results.
OSS Analytics ossinsight.io Real-time repository metrics, stars, and contribution trends.
OSS Trending ossinsight.io Daily/weekly trending repositories and collections.
OSS Rankings ossinsight.io Collection-based rankings and ecosystem comparison data.
// Get structured data for an npm package
web_extract({ action: "vertical", url: "https://www.npmjs.com/package/undici" })

// Get YouTube video metadata, transcript, and comment preview
web_extract({ action: "vertical", extractor: "youtube", url: "https://www.youtube.com/watch?v=arj7oStGLkU" })

πŸ’Ύ Download, Storage & History

Large results are stored automatically. You can retrieve them later using web_get_result.

πŸ“‚ Persistent Paths

Data Path
SQLite Index ~/.pi/scraper/index.db
Payload Blobs ~/.pi/scraper/blobs/
Downloads ~/.pi/scraper/downloads/

πŸ“„ Binary Downloads

Add saveToFile: true to persist PDFs, images, or archives to disk.

{ "url": "https://arxiv.org/pdf/1706.03762", "saveToFile": true }

βš–οΈ Max Bytes

Control the fetch limit per request (default: 30 MB).

{ "url": "https://example.com/large.zip", "maxBytes": 104857600 }

πŸ—ΊοΈ Site Mapping (web_map)

Use web_map for fast discovery of a domain's structure without downloading full page bodies. It is an "inventory-only" tool.

What it discovers:

  • robots.txt: Respects crawl delays and discovers sitemap links.
  • Sitemaps: Automatically parses sitemap.xml and gzipped sitemaps.
  • llms.txt: Finds specialized manifests designed for AI consumption.
// Inventory all known URLs for a domain
{ "url": "https://example.com", "action": "inventory" }

πŸ”’ Safety & Resilience

  • SSRF Protection: Built-in validation at the connect and redirect layers.
  • Robots.txt: Full respect for site crawling rules (configurable).
  • Memory Efficient: Large responses are streamed and stored locally.
  • Incremental Enforcement: maxBytes limits are enforced during the stream.

βš™οΈ Configuration

Use the /scrape-config slash command to manage your settings interactively or via the CLI:

/scrape-config status                     # View current settings
/scrape-config scrape-mode browser        # Set default mode to browser
/scrape-config robots off                 # Disable robots.txt respect
/scrape-config cache clear                # Wipe the local response cache

πŸ“¦ Developer Info

If you are contributing to or building on top of pi-scraper:

Requirements

  • Node.js: >=22.19.0
  • Pi: >=0.74.0

Build & Test

npm install        # Install dependencies
npm run typecheck  # Verify types
npm test           # Run unit tests
npm run test:tools # Run tool smoke tests

πŸ”„ Playwright backend (opt-out)

To use stock Playwright Chromium instead of CloakBrowser:

npm install playwright
npx playwright install chromium
web_scrape url="https://example.com" mode=browser browserBackend=playwright stealth=true

πŸ“œ License

This project is licensed under the MIT License. See LICENSE for details.

About

Pi extension for fast page scraping, recursive crawling, URL/site mapping, brand extraction, content diffing, PDF text extraction, and deterministic vertical extraction.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors