🕸️ pi-scraper

A scraper-first, Pi-native, and local-first extension for the Pi ecosystem.

pi-scraper reads known URLs and sites. Use it to scrape, summarize one page, crawl, map URLs, diff snapshots, retrieve stored results, or download/extract deterministic/structured data — including CloakBrowser-backed browser mode with C++ fingerprint patches and persistent sessions.

Quick Start

Install the extension via the Pi CLI:

pi install npm:pi-scraper

Try these prompts:

Ask naturally; Pi can choose the right web tool automatically:

Tip

"Read https://example.com as markdown."
"List all URLs available from https://example.com."
"Crawl https://example.com, up to 25 pages."
"Compare https://example.com against my homepage snapshot."
"Open https://example.com/login in browser mode, save the session, then scrape /dashboard."

⚡ Scrape Modes

pi-scraper intelligently escalates its scraping strategy to balance speed and capability.

Mode	JS Support	Speed	Best Use Case
`fast`	❌	🚀	Static HTML, documentation, and quick text extraction.
`fingerprint`	❌	🏎️	Sites that block simple bots (uses TLS fingerprinting).
`readable`	❌	⏱️	Articles and blogs where noise reduction is critical.
`browser`	✅	🐢	Heavily JS-rendered sites (uses CloakBrowser by default).
`auto`	🤖	🔄	Default. Automatically selects the best path based on signals.

🛠️ Public Tools

Tool	Capability	Best For...	Contract ≈
`web_scrape`	🏠 Local	Reading a single URL as Markdown, Text, or HTML.	308 tokens
`web_crawl`	🕷️ Resumable	BFS crawling to build local datasets or context packages.	158 tokens
`web_map`	🗺️ Discovery	Inventorying URLs via robots.txt, sitemaps, and llms.txt.	58 tokens
`web_batch`	📦 Bulk	Scaping multiple independent URLs concurrently.	195 tokens
`web_extract`	🔍 Structured	Deterministic, selector-based, or LLM-backed extraction.	337 tokens
`web_get_result`	📂 Retrieval	Accessing stored results, job manifests, or snapshots.	56 tokens

Note

Contract is the total tokens for the tool declaration.

📖 Parameter Reference

Area	Parameters	Description
Shared	`sessionId`, `saveSession`, `clearSession`, `stealth`, `autoWait`, `browserBackend`, `proxy`, `headers`, `provider`	Sessions, browser controls, and LLM provider selection.
Scrape	`url`, `urls`, `content`, `task`, `mode`, `format`, `refresh`, `respectRobots`, `timeoutSeconds`	Targets, tasks (`read`/`summarize`), and fetch behavior.
Limits	`maxBytes`, `maxChars`, `onlyMainContent`	Size limits and content cleaning.
Filtering	`include`, `exclude`, `linesMatching`, `contextLines`, `caseSensitive`	Glob patterns and line-based content filtering.
Redirection	`followAlternates`, `followMetaRefresh`	Controls for non-standard redirects.
Snapshots	`snapshotName`, `snapshotTag`, `diff`, `compareTag`, `maxSnapshotAgeSeconds`	Versioning and diffing baselines.
Crawl	`action`, `maxPages`, `maxDepth`, `sameOrigin`, `concurrency`, `resume`, `crawlId`, `compile`, `seed`, `seedSitemap`, `status`, `limit`, `extract`	BFS discovery, limits, and state management.
Extract	`action`, `extractor`, `prompt`, `schema`, `selector`, `selectorType`, `attribute`, `adaptive`, `bullets`, `sentences`, `identifier`, `autoSave`, `threshold`, `extractSchema`	Vertical, ad-hoc, and selector extraction.
Patterns	`markers`, `contains`, `excerpts`, `regexes`, `sections`, `jsonPaths`, `sourceFormat`, `length`	Deterministic inspection: strings, regex, and ranges.
Map	`url`, `maxSitemaps`	Site-wide discovery of robots.txt and sitemaps.
Storage	`saveToFile`	`true` or `{dir, filename, maxBytes}` for disk storage.
Retrieval	`responseId`, `jobId`, `snapshotUrl`, `snapshotName`, `snapshotTag`	Retrieve stored payloads and job manifests.

🔑 Sessions & Persistence

pi-scraper is stateless by default. Use sessionId when you need to maintain state (cookies, login, cart) across multiple calls.

sessionId: A unique key for the session.
saveSession: Persist cookies to disk (useful across Pi reloads).
clearSession: Wipe the session state.
fingerprint: Use mode: "fingerprint" to bypass basic bot blocks using browser-grade TLS fingerprints without the overhead of a full browser.

// Example: Log in and then scrape a protected page
web_scrape({ url: "https://example.com/login", sessionId: "user-1", saveSession: true })
web_scrape({ url: "https://example.com/dashboard", sessionId: "user-1" })

🎯 Selector Extraction

Extract structured data using CSS selectors, XPath, or plain text search.

Parameter	Description
`selector`	The CSS/XPath/Text to find.
`attribute`	Extract a specific attribute (e.g., `href`) instead of text.
`adaptive`	Enable relocation if the page layout changes.
`limit`	Maximum elements to return.

Example:

{
  "url": "https://example.com/products",
  "selector": ".product-card",
  "identifier": "products-v1",
  "autoSave": true,
  "limit": 5
}

🌐 Browser Mode Support

mode: "browser" uses CloakBrowser by default — a patched Chromium binary with 48 C++-level fingerprint patches.

⚙️ Backend options

Backend	Default	Browser	Stealth level	Requirement
`"cloak"`	✅	CloakBrowser Chromium 145	C++ source-level (48 patches)	Bundled
`"playwright"`	❌	Stock Playwright Chromium	JS `page.evaluate()` via `stealth=true`	`npm install playwright`

🛡️ Fingerprint evasion

CloakBrowser does not need stealth=true — all anti-detection patches (navigator.webdriver, canvas, WebGL, audio, fonts, GPU, screen, WebRTC, network timing) are applied at the C++ binary level, undetectable by any JS-level bot detection.

Test results from CloakBrowser:

reCAPTCHA v3 score: 0.9 (human)
Cloudflare Turnstile: PASS
FingerprintJS: PASS
BrowserScan: NORMAL (4/4)
30+ detection sites: passed

💾 Persistent sessions (CloakBrowser only)

When using CloakBrowser with sessionId + saveSession=true:

web_scrape url="https://example.com" mode=browser sessionId="my-session" saveSession=true

CloakBrowser uses launchPersistentContext() which writes cookies, localStorage, and session state to a disk profile at ~/.pi/browser-sessions/<sessionId>/. This:

Avoids incognito/private-mode detection (BrowserScan penalizes incognito by ~10%)
Survives Pi restarts and process reloads
Keeps login state across multiple scrape calls

To persist an authenticated login flow:

Log in and Save the Session Open the login page in browser mode. Specifying saveSession=true writes the cookies and session state to your local profile.
```
web_scrape url="https://example.com/login" mode=browser sessionId="site-session" saveSession=true
```
Scrape Authenticated Content Subsequent calls using the same sessionId automatically inherit the authenticated state (cookies, local storage, etc.).
```
web_scrape url="https://example.com/dashboard" mode=browser sessionId="site-session"
```

Clear the Session when Done (Optional) Wipe the saved session and context from your local disk.

web_scrape url="https://example.com" mode=browser sessionId="site-session" clearSession=true

🔧 CloakBrowser-specific options

Option	Type	Description
`timezone`	string	IANA timezone (e.g. `"America/New_York"`). Set via binary flag — undetectable.
`locale`	string	BCP 47 locale (e.g. `"en-US"`). Set via `--lang` binary flag.
`proxy`	string	HTTP or SOCKS5 proxy URL.

These are safe to set even with the Playwright backend (ignored or applied via JS patches).

🏗️ Vertical Extraction

For well-known sites, pi-scraper uses optimized "vertical" extractors that hit APIs directly, bypassing slow HTML scraping.

Vertical	Platforms / Sites	Extracted Data / Possibilities
GitHub Repo	GitHub	Metadata, README, File Tree, Languages, Topics.
GitHub Issue	GitHub	Issue body, comments, participants, labels, status.
GitHub PR	GitHub	Pull request body, diff stats, reviews, comments.
GitHub Release	GitHub	Release notes, tag info, assets, author metadata.
npm Package	npmjs.com	Manifest JSON, versions, dependencies, README.
PyPI Package	pypi.org	Package metadata, versions, author, description.
crates.io	crates.io	Rust crate metadata, versions, dependencies.
Docker Hub	hub.docker.com	Image metadata, tags, architectures, layers.
HF Model	huggingface.co	Model cards, metadata, files, community stats.
HF Dataset	huggingface.co	Dataset cards, configuration, metadata, previews.
Hacker News	ycombinator.com	Story/Comment trees via Firebase API.
arXiv	arxiv.org	Academic paper metadata and Atom feeds.
DeepWiki	deepwiki.io	Structured wiki content and metadata.
Docs Site	Docusaurus, RTD	Sections, sidebar navigation, and page metadata.
docstrings	TS/JS/Py/Rs	Exported symbols, types, and function signatures.
Youtube Metadata	youtube.com	Video title, views, channel name, duration, and description.
Youtube Transcriptions	youtube.com	Full transcripts in plain-text and timed segments.
Youtube Comments	youtube.com	Preview of top video comments and engagement stats.
Reddit Post	reddit.com	Post content, scoring, flairs, and author metadata.
Reddit Thread	reddit.com	Full nested comment trees (retains original thread depth).
Reddit List	reddit.com	Subreddit listings (hot/new/top) and search results.
OSS Analytics	ossinsight.io	Real-time repository metrics, stars, and contribution trends.
OSS Trending	ossinsight.io	Daily/weekly trending repositories and collections.
OSS Rankings	ossinsight.io	Collection-based rankings and ecosystem comparison data.

// Get structured data for an npm package
web_extract({ action: "vertical", url: "https://www.npmjs.com/package/undici" })

// Get YouTube video metadata, transcript, and comment preview
web_extract({ action: "vertical", extractor: "youtube", url: "https://www.youtube.com/watch?v=arj7oStGLkU" })

💾 Download, Storage & History

Large results are stored automatically. You can retrieve them later using web_get_result.

📂 Persistent Paths

Data	Path
SQLite Index	`~/.pi/scraper/index.db`
Payload Blobs	`~/.pi/scraper/blobs/`
Downloads	`~/.pi/scraper/downloads/`

📄 Binary Downloads

Add saveToFile: true to persist PDFs, images, or archives to disk.

{ "url": "https://arxiv.org/pdf/1706.03762", "saveToFile": true }

⚖️ Max Bytes

Control the fetch limit per request (default: 30 MB).

{ "url": "https://example.com/large.zip", "maxBytes": 104857600 }

🗺️ Site Mapping (`web_map`)

Use web_map for fast discovery of a domain's structure without downloading full page bodies. It is an "inventory-only" tool.

What it discovers:

robots.txt: Respects crawl delays and discovers sitemap links.
Sitemaps: Automatically parses sitemap.xml and gzipped sitemaps.
llms.txt: Finds specialized manifests designed for AI consumption.

// Inventory all known URLs for a domain
{ "url": "https://example.com", "action": "inventory" }

🔒 Safety & Resilience

SSRF Protection: Built-in validation at the connect and redirect layers.
Robots.txt: Full respect for site crawling rules (configurable).
Memory Efficient: Large responses are streamed and stored locally.
Incremental Enforcement: maxBytes limits are enforced during the stream.

⚙️ Configuration

Use the /scrape-config slash command to manage your settings interactively or via the CLI:

/scrape-config status                     # View current settings
/scrape-config scrape-mode browser        # Set default mode to browser
/scrape-config robots off                 # Disable robots.txt respect
/scrape-config cache clear                # Wipe the local response cache

📦 Developer Info

If you are contributing to or building on top of pi-scraper:

Requirements

Node.js: >=22.19.0
Pi: >=0.74.0

Build & Test

npm install        # Install dependencies
npm run typecheck  # Verify types
npm test           # Run unit tests
npm run test:tools # Run tool smoke tests

🔄 Playwright backend (opt-out)

To use stock Playwright Chromium instead of CloakBrowser:

npm install playwright
npx playwright install chromium

web_scrape url="https://example.com" mode=browser browserBackend=playwright stealth=true

📜 License

This project is licensed under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 409 Commits
.ast-grep/rules		.ast-grep/rules
.github/workflows		.github/workflows
bench		bench
eval		eval
scripts		scripts
skills/web-scraping		skills/web-scraping
src		src
.gitignore		.gitignore
.oxfmtrc.json		.oxfmtrc.json
.oxlintrc.json		.oxlintrc.json
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
lefthook.yml		lefthook.yml
package-lock.json		package-lock.json
package.json		package.json
sgconfig.yml		sgconfig.yml
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🕸️ pi-scraper

Quick Start

Try these prompts:

⚡ Scrape Modes

🛠️ Public Tools

📖 Parameter Reference

🔑 Sessions & Persistence

🎯 Selector Extraction

Example:

🌐 Browser Mode Support

⚙️ Backend options

🛡️ Fingerprint evasion

💾 Persistent sessions (CloakBrowser only)

🔧 CloakBrowser-specific options

🏗️ Vertical Extraction

💾 Download, Storage & History

📂 Persistent Paths

📄 Binary Downloads

⚖️ Max Bytes

🗺️ Site Mapping (`web_map`)

🔒 Safety & Resilience

⚙️ Configuration

📦 Developer Info

Requirements

Build & Test

🔄 Playwright backend (opt-out)

📜 License

About

Uh oh!

Releases 16

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🕸️ pi-scraper

Quick Start

Try these prompts:

⚡ Scrape Modes

🛠️ Public Tools

📖 Parameter Reference

🔑 Sessions & Persistence

🎯 Selector Extraction

Example:

🌐 Browser Mode Support

⚙️ Backend options

🛡️ Fingerprint evasion

💾 Persistent sessions (CloakBrowser only)

🔧 CloakBrowser-specific options

🏗️ Vertical Extraction

💾 Download, Storage & History

📂 Persistent Paths

📄 Binary Downloads

⚖️ Max Bytes

🗺️ Site Mapping (web_map)

🔒 Safety & Resilience

⚙️ Configuration

📦 Developer Info

Requirements

Build & Test

🔄 Playwright backend (opt-out)

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 16

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

🗺️ Site Mapping (`web_map`)

Packages