RUSTScrapling

A high-performance Rust port of Scrapling -- the modern web scraping framework built by web scrapers, for web scrapers.

Parse HTML with CSS selectors, fetch pages with stealth headers, and crawl entire sites with async concurrency -- all from a single Rust crate.

Why RUSTScrapling?

The original Scrapling (Python) combines three powerful ideas in one framework:

Adaptive Parsing -- CSS/XPath selectors that can relocate elements when page structure changes
Multi-Strategy Fetching -- simple HTTP, stealth-mode, and browser automation in one API
Spider-Based Crawling -- Scrapy-inspired async crawlers with rate limiting, deduplication, and checkpointing

RUSTScrapling brings this to Rust with native performance, memory safety, and zero-cost abstractions. It's structured as four independent layers that compose together:

Layer	Purpose	Key Types
Core	Rich string types, attribute maps, persistent storage	`TextHandler`, `AttributesHandler`, `SqliteStorage`
Parser	HTML parsing with CSS selectors, DOM traversal, regex	`Selector`, `Selectors`
Fetchers	Async HTTP with retries, stealth headers, proxy rotation	`Fetcher`, `FetcherConfig`, `Response`
Spiders	Concurrent crawl orchestration	`Spider` trait, `CrawlerEngine`, `SpiderRequest`

Installation

Add to your Cargo.toml:

[dependencies]
rust_scrapling = { git = "https://github.com/Liohtml/RUSTScrapling" }

Or clone and build locally:

git clone https://github.com/Liohtml/RUSTScrapling.git
cd RUSTScrapling
cargo build --release

Requirements: Rust 1.75+ (edition 2021)

Quick Start

use rust_scrapling::{Selector, Fetcher, FetcherConfig};

// -- Parse static HTML --
let html = r#"<html><body>
  <h1 class="title">Hello World</h1>
  <a href="/about">About</a>
</body></html>"#;

let page = Selector::from_html(html);
let title = page.css("h1.title");
println!("{}", title[0].text());  // "Hello World"

// -- Fetch a live page --
#[tokio::main]
async fn main() {
    let fetcher = Fetcher::new(FetcherConfig::default());
    let response = fetcher.get("https://example.com").await.unwrap();
    let page = response.selector();

    for link in page.css("a") {
        let href = link.attrib().get("href");
        println!("{}: {}", link.text(), href.map(|h| h.as_str()).unwrap_or(""));
    }
}

Usage Guide

Parsing HTML

Create a Selector from any HTML string. It wraps the parsed DOM tree and provides the full query API.

use rust_scrapling::Selector;

let page = Selector::from_html("<html><body><p>Hello</p></body></html>");

// With a base URL (enables urljoin for relative links)
let page = Selector::from_html_with_url(html, "https://example.com/page/1");
let absolute = page.urljoin("/about");  // "https://example.com/about"

CSS Selectors

Full CSS3 selector support powered by the scraper crate:

let page = Selector::from_html(html);

// By class
let items = page.css("div.product");

// By ID
let main = page.css("#main-content");

// By attribute
let priced = page.css("[data-price]");

// Compound selectors
let links = page.css("nav > ul > li > a.active");

// Descendant selectors
let deep = page.css("div.container span.highlight");

The result is a Selectors collection with batch operations:

let items = page.css("li.item");

// Access by index
let first = &items[0];
let last = items.last().unwrap();

// Iterate
for item in &items {
    println!("{}: {}", item.tag(), item.text());
}

// Filter
let special = items.filter(|item| item.has_class("special"));

// Search (find first match)
let target = items.search(|item| item.text().as_str() == "Target");

// Chain CSS queries
let names = page.css("div.product").css("h2.name");

// Batch text extraction
let all_text: Vec<_> = items.getall();  // Vec<TextHandler>

Text Extraction

TextHandler wraps every text value with regex, JSON, and cleaning methods:

// Direct text (immediate children only)
let text = element.text();  // TextHandler

// All text recursively, ignoring <script> and <style>
let all_text = element.get_all_text("\n", true, &["script", "style"], None);

// Chaining
let cleaned = element.text().strip().to_lowercase().replace_str("old", "new");

// JSON parsing
let data = element.text().json().unwrap();  // serde_json::Value

// Inner/outer HTML
let inner = element.html_content();
let outer = element.outer_html();

DOM Navigation

let item = page.css("li.product").first().unwrap();

// Parent
let list = item.parent().unwrap();
assert_eq!(list.tag(), "ul");

// Children (element nodes only)
let children = list.children();

// Siblings
let siblings = item.siblings();
let next = item.next();
let prev = item.previous();

// Attributes
let attrs = item.attrib();
let id = attrs.get("data-id").unwrap();
let has_class = item.has_class("featured");

// Search by text
let heading = page.find_by_text("Hello World", true, false, false);
let partial = page.find_by_text("Hello", true, true, false);  // partial match

// Search by regex
let match_ = page.find_by_regex(r"Item \d+", true, false);

Regex Extraction

Extract data from text using regex, with capture group support:

let price_el = page.css("span.price").first().unwrap();

// All matches (returns capture group 1 if present, else group 0)
let prices = price_el.re(r"\$(\d+\.\d+)", true, false, true);
// prices[0].as_str() == "19.99"

// First match only
let first = price_el.re_first(r"\$(\d+\.\d+)", true, false, true);

// Batch regex across multiple elements
let all_prices = page.css("span.price").re(r"\$(\d+\.\d+)", true, false, true);

Fetching Pages

The Fetcher is an async HTTP client with retries, stealth headers, and proxy support:

use rust_scrapling::{Fetcher, FetcherConfig};

// Default config: 30s timeout, 3 retries, stealth headers on
let fetcher = Fetcher::new(FetcherConfig::default());

// Custom config via builder
let fetcher = Fetcher::new(
    FetcherConfig::builder()
        .timeout(60)
        .retries(5)
        .proxy("http://proxy:8080")
        .user_agent("MyBot/1.0")
        .stealth(true)
        .verify_ssl(false)
        .header("Authorization", "Bearer token123")
        .build()
);

// HTTP methods
let resp = fetcher.get("https://example.com").await?;
let resp = fetcher.post("https://api.example.com/data", Some(body), None).await?;
let resp = fetcher.put("https://api.example.com/data/1", None, Some(&json_val)).await?;
let resp = fetcher.delete("https://api.example.com/data/1").await?;

// Response -> Selector (auto-parses HTML)
let page = resp.selector();
let titles = page.css("h1");

// Response metadata
println!("Status: {}", resp.status());
println!("URL: {}", resp.url());
println!("Blocked: {}", resp.is_blocked());
let json_data = resp.json()?;  // Parse as JSON

Building a Spider

Define a spider by implementing the Spider trait:

use rust_scrapling::{Spider, SpiderRequest, CrawlerEngine, FetcherConfig};
use rust_scrapling::spiders::response::SpiderResponse;
use rust_scrapling::spiders::session::SessionManager;
use async_trait::async_trait;
use std::sync::Arc;

struct ProductSpider;

#[async_trait]
impl Spider for ProductSpider {
    fn name(&self) -> &str { "products" }

    fn start_urls(&self) -> Vec<String> {
        vec!["https://shop.example.com/products".into()]
    }

    fn concurrent_requests(&self) -> u32 { 8 }
    fn download_delay(&self) -> f64 { 0.5 }
    fn robots_txt_obey(&self) -> bool { true }

    fn allowed_domains(&self) -> std::collections::HashSet<String> {
        ["shop.example.com".into()].into()
    }

    async fn parse(
        &self,
        response: SpiderResponse,
    ) -> (Vec<serde_json::Value>, Vec<SpiderRequest>) {
        let page = response.selector();
        let mut items = Vec::new();
        let mut requests = Vec::new();

        // Extract product data
        for product in page.css("div.product") {
            let name = product.css("h2.name");
            let price = product.css("span.price");

            items.push(serde_json::json!({
                "name": name.first().map(|n| n.text().as_str().to_string()),
                "price": price.first().map(|p| p.text().as_str().to_string()),
                "url": response.url(),
            }));
        }

        // Follow pagination
        for link in page.css("a.next-page") {
            if let Some(href) = link.attrib().get("href") {
                let next_url = page.urljoin(href.as_str());
                requests.push(SpiderRequest::new(&next_url));
            }
        }

        (items, requests)
    }

    async fn on_scraped_item(&self, item: serde_json::Value) -> Option<serde_json::Value> {
        // Filter out items without a price
        if item.get("price").is_some() { Some(item) } else { None }
    }
}

#[tokio::main]
async fn main() {
    let spider = Arc::new(ProductSpider);
    let mut session_manager = SessionManager::new(FetcherConfig::default());
    session_manager.ensure_default();

    let engine = CrawlerEngine::new(spider, session_manager, None);
    let result = engine.crawl().await;

    println!("Scraped {} items in {:.1}s",
        result.items.len(),
        result.stats.elapsed_seconds());
    println!("Requests: {}, Failed: {}",
        result.stats.requests_count,
        result.stats.failed_requests_count);

    // Export results
    result.items.to_json("products.json", true).unwrap();
    result.items.to_jsonl("products.jsonl").unwrap();
}

Spider Configuration Options

Option	Default	Description
`concurrent_requests()`	`4`	Global concurrency limit
`concurrent_requests_per_domain()`	`0`	Per-domain limit (0 = disabled)
`download_delay()`	`0.0`	Seconds between requests
`robots_txt_obey()`	`false`	Respect robots.txt
`max_blocked_retries()`	`3`	Retry limit for blocked responses
`allowed_domains()`	`{}`	Domain whitelist (empty = allow all)
`development_mode()`	`false`	Cache responses to disk for dev iteration

Spider Lifecycle Hooks

Hook	When
`on_start(resuming)`	Before crawl begins
`on_close()`	After crawl ends
`on_error(request, error)`	When a request fails
`on_scraped_item(item)`	Item pipeline -- return `None` to drop
`is_blocked(response)`	Custom block detection

CLI

RUSTScrapling includes a command-line tool for quick scraping:

# Fetch a page and extract text
rust-scrapling fetch https://example.com

# Extract specific elements with a CSS selector
rust-scrapling fetch https://example.com --selector "h1"

# Output as HTML
rust-scrapling fetch https://example.com --selector "div.content" --format html

# Output as JSON (tag, text, html per element)
rust-scrapling fetch https://example.com --selector "a" --format json

# Disable stealth headers
rust-scrapling fetch https://example.com --no-stealth

# Extract text content (shorthand)
rust-scrapling extract https://example.com --selector "p"

Architecture

rust_scrapling/
|
|-- core/                          # Foundation types
|   |-- text_handler.rs            # TextHandler: String + regex/json/clean
|   |-- text_handlers.rs           # TextHandlers: Vec<TextHandler> batch ops
|   |-- attributes_handler.rs      # AttributesHandler: read-only attr map
|   +-- storage.rs                 # SqliteStorage: adaptive element persistence
|
|-- parser/                        # HTML parsing engine
|   |-- selector.rs                # Selector: element wrapper (CSS, text, nav)
|   |-- selectors.rs               # Selectors: batch operations
|   |-- selector_generation.rs     # Auto-generate CSS/XPath from DOM position
|   +-- translator.rs              # ::text and ::attr() pseudo-elements
|
|-- fetchers/                      # HTTP layer
|   |-- client.rs                  # Fetcher: async HTTP with retries
|   |-- config.rs                  # FetcherConfig: builder pattern
|   |-- response.rs                # Response: auto-parses to Selector
|   |-- proxy.rs                   # ProxyRotator: round-robin proxy cycling
|   +-- constants.rs               # User agents, status codes, headers
|
+-- spiders/                       # Crawl framework
    |-- spider.rs                  # Spider trait (user-facing API)
    |-- engine.rs                  # CrawlerEngine: async orchestrator
    |-- request.rs                 # SpiderRequest: fingerprinting + priority
    |-- response.rs                # SpiderResponse: parser integration
    |-- result.rs                  # CrawlResult, CrawlStats, ItemList
    |-- scheduler.rs               # Priority queue with deduplication
    |-- session.rs                 # SessionManager: named HTTP sessions
    |-- robots.rs                  # robots.txt compliance
    |-- cache.rs                   # Dev-mode response caching
    +-- checkpoint.rs              # Pause/resume persistence

Design Principles

Each layer is independent. Use just the parser without fetchers. Use fetchers without spiders. Compose as needed.
Zero hidden allocations. Selector uses Rc<Html> to share the parsed tree. Child selectors point into the same tree.
Async-first. The fetcher and spider layers are built on tokio for high-concurrency crawling.
Scrapy-compatible API names. css(), text(), re(), re_first(), get(), getall() mirror Scrapy/Parsel conventions.

API Reference

Core Types

Type	Description
`TextHandler`	String wrapper with `.re()`, `.json()`, `.clean()`, `.strip()`, `.replace_str()`
`TextHandlers`	`Vec<TextHandler>` with batch `.re()`, `.re_first()`
`AttributesHandler`	Read-only attribute map with `.get()`, `.search_values()`, `.json_string()`
`SqliteStorage`	SQLite-backed element storage for adaptive mode

Parser Types

Type	Description
`Selector`	HTML element wrapper -- `.css()`, `.text()`, `.attrib()`, `.children()`, `.parent()`, `.find_by_text()`
`Selectors`	Element collection -- `.css()`, `.filter()`, `.search()`, `.getall()`, `.re()`

Fetcher Types

Type	Description
`Fetcher`	Async HTTP client -- `.get()`, `.post()`, `.put()`, `.delete()`
`FetcherConfig`	Config builder -- `.timeout()`, `.retries()`, `.proxy()`, `.stealth()`
`Response`	HTTP response -- `.selector()`, `.json()`, `.status()`, `.is_blocked()`
`ProxyRotator`	Round-robin proxy rotation

Spider Types

Type	Description
`Spider` (trait)	User implements `.parse()`, configures `start_urls`, concurrency, etc.
`CrawlerEngine<S>`	Async orchestrator -- `.crawl()` returns `CrawlResult`
`SpiderRequest`	Request with fingerprinting, priority, metadata
`CrawlResult`	Final result -- `.items`, `.stats`, `.completed()`
`CrawlStats`	Metrics -- requests, bytes, items, timing, status codes
`ItemList`	Scraped items -- `.to_json()`, `.to_jsonl()`

Testing

# Run all tests (175 pass, 3 network tests ignored)
cargo test

# Run with network tests
cargo test -- --ignored

# Run a specific test module
cargo test parser_selector
cargo test core_text_handler
cargo test integration_test

# Run with logging
RUST_LOG=debug cargo test

# Check code quality
cargo clippy -- -W clippy::all

# Build release binary
cargo build --release

Test Coverage

Module	Tests	Coverage
Core (TextHandler, AttributesHandler, Storage)	64	All public methods
Parser (Selector, Selectors, Generation)	38	CSS, text, nav, regex, DOM
Fetchers (Config, Client, Response)	21	Config, headers, Response struct
Spiders (Request, Scheduler, Result)	27	Fingerprinting, dedup, priority, export
Integration	28	End-to-end scraping workflows
Total	178

Contributing

Contributions are welcome! Here's how to get started:

Fork the repository
Create a branch for your feature (git checkout -b feature/amazing-feature)
Write tests for your changes
Run the test suite (cargo test && cargo clippy)
Commit with a descriptive message
Push and open a Pull Request

Development Setup

git clone https://github.com/Liohtml/RUSTScrapling.git
cd RUSTScrapling
cargo build
cargo test

Areas for Contribution

Browser automation -- Headless Chrome/Playwright integration (like Python Scrapling's StealthyFetcher/DynamicFetcher)
Adaptive mode -- Element relocation using similarity scoring (storage layer is ready)
Interactive shell -- REPL for exploring pages
Performance -- Benchmarks, SIMD text processing, zero-copy parsing
Documentation -- More examples, tutorials, API docs

License

Licensed under either of:

MIT License (LICENSE-MIT or http://opensource.org/licenses/MIT)
Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)

at your option.

Credits

Scrapling by Karim Shoair -- the original Python framework that inspired this project
scraper -- HTML parsing and CSS selection in Rust
reqwest -- HTTP client
tokio -- Async runtime

Built with Rust. Inspired by Scrapling. Made for scraping.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
docs/superpowers/plans		docs/superpowers/plans
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

RUSTScrapling

Why RUSTScrapling?

Table of Contents

Installation

Quick Start

Usage Guide

Parsing HTML

CSS Selectors

Text Extraction

DOM Navigation

Regex Extraction

Fetching Pages

Building a Spider

Spider Configuration Options

Spider Lifecycle Hooks

CLI

Architecture

Design Principles

API Reference

Core Types

Parser Types

Fetcher Types

Spider Types

Testing

Test Coverage

Contributing

Development Setup

Areas for Contribution

License

Credits

About

Topics

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages