A high-performance Rust port of Scrapling -- the modern web scraping framework built by web scrapers, for web scrapers.
Parse HTML with CSS selectors, fetch pages with stealth headers, and crawl entire sites with async concurrency -- all from a single Rust crate.
The original Scrapling (Python) combines three powerful ideas in one framework:
- Adaptive Parsing -- CSS/XPath selectors that can relocate elements when page structure changes
- Multi-Strategy Fetching -- simple HTTP, stealth-mode, and browser automation in one API
- Spider-Based Crawling -- Scrapy-inspired async crawlers with rate limiting, deduplication, and checkpointing
RUSTScrapling brings this to Rust with native performance, memory safety, and zero-cost abstractions. It's structured as four independent layers that compose together:
| Layer | Purpose | Key Types |
|---|---|---|
| Core | Rich string types, attribute maps, persistent storage | TextHandler, AttributesHandler, SqliteStorage |
| Parser | HTML parsing with CSS selectors, DOM traversal, regex | Selector, Selectors |
| Fetchers | Async HTTP with retries, stealth headers, proxy rotation | Fetcher, FetcherConfig, Response |
| Spiders | Concurrent crawl orchestration | Spider trait, CrawlerEngine, SpiderRequest |
- Installation
- Quick Start
- Usage Guide
- CLI
- Architecture
- API Reference
- Testing
- Contributing
- License
- Credits
Add to your Cargo.toml:
[dependencies]
rust_scrapling = { git = "https://github.com/Liohtml/RUSTScrapling" }Or clone and build locally:
git clone https://github.com/Liohtml/RUSTScrapling.git
cd RUSTScrapling
cargo build --releaseRequirements: Rust 1.75+ (edition 2021)
use rust_scrapling::{Selector, Fetcher, FetcherConfig};
// -- Parse static HTML --
let html = r#"<html><body>
<h1 class="title">Hello World</h1>
<a href="/about">About</a>
</body></html>"#;
let page = Selector::from_html(html);
let title = page.css("h1.title");
println!("{}", title[0].text()); // "Hello World"
// -- Fetch a live page --
#[tokio::main]
async fn main() {
let fetcher = Fetcher::new(FetcherConfig::default());
let response = fetcher.get("https://example.com").await.unwrap();
let page = response.selector();
for link in page.css("a") {
let href = link.attrib().get("href");
println!("{}: {}", link.text(), href.map(|h| h.as_str()).unwrap_or(""));
}
}Create a Selector from any HTML string. It wraps the parsed DOM tree and provides the full query API.
use rust_scrapling::Selector;
let page = Selector::from_html("<html><body><p>Hello</p></body></html>");
// With a base URL (enables urljoin for relative links)
let page = Selector::from_html_with_url(html, "https://example.com/page/1");
let absolute = page.urljoin("/about"); // "https://example.com/about"Full CSS3 selector support powered by the scraper crate:
let page = Selector::from_html(html);
// By class
let items = page.css("div.product");
// By ID
let main = page.css("#main-content");
// By attribute
let priced = page.css("[data-price]");
// Compound selectors
let links = page.css("nav > ul > li > a.active");
// Descendant selectors
let deep = page.css("div.container span.highlight");The result is a Selectors collection with batch operations:
let items = page.css("li.item");
// Access by index
let first = &items[0];
let last = items.last().unwrap();
// Iterate
for item in &items {
println!("{}: {}", item.tag(), item.text());
}
// Filter
let special = items.filter(|item| item.has_class("special"));
// Search (find first match)
let target = items.search(|item| item.text().as_str() == "Target");
// Chain CSS queries
let names = page.css("div.product").css("h2.name");
// Batch text extraction
let all_text: Vec<_> = items.getall(); // Vec<TextHandler>TextHandler wraps every text value with regex, JSON, and cleaning methods:
// Direct text (immediate children only)
let text = element.text(); // TextHandler
// All text recursively, ignoring <script> and <style>
let all_text = element.get_all_text("\n", true, &["script", "style"], None);
// Chaining
let cleaned = element.text().strip().to_lowercase().replace_str("old", "new");
// JSON parsing
let data = element.text().json().unwrap(); // serde_json::Value
// Inner/outer HTML
let inner = element.html_content();
let outer = element.outer_html();let item = page.css("li.product").first().unwrap();
// Parent
let list = item.parent().unwrap();
assert_eq!(list.tag(), "ul");
// Children (element nodes only)
let children = list.children();
// Siblings
let siblings = item.siblings();
let next = item.next();
let prev = item.previous();
// Attributes
let attrs = item.attrib();
let id = attrs.get("data-id").unwrap();
let has_class = item.has_class("featured");
// Search by text
let heading = page.find_by_text("Hello World", true, false, false);
let partial = page.find_by_text("Hello", true, true, false); // partial match
// Search by regex
let match_ = page.find_by_regex(r"Item \d+", true, false);Extract data from text using regex, with capture group support:
let price_el = page.css("span.price").first().unwrap();
// All matches (returns capture group 1 if present, else group 0)
let prices = price_el.re(r"\$(\d+\.\d+)", true, false, true);
// prices[0].as_str() == "19.99"
// First match only
let first = price_el.re_first(r"\$(\d+\.\d+)", true, false, true);
// Batch regex across multiple elements
let all_prices = page.css("span.price").re(r"\$(\d+\.\d+)", true, false, true);The Fetcher is an async HTTP client with retries, stealth headers, and proxy support:
use rust_scrapling::{Fetcher, FetcherConfig};
// Default config: 30s timeout, 3 retries, stealth headers on
let fetcher = Fetcher::new(FetcherConfig::default());
// Custom config via builder
let fetcher = Fetcher::new(
FetcherConfig::builder()
.timeout(60)
.retries(5)
.proxy("http://proxy:8080")
.user_agent("MyBot/1.0")
.stealth(true)
.verify_ssl(false)
.header("Authorization", "Bearer token123")
.build()
);
// HTTP methods
let resp = fetcher.get("https://example.com").await?;
let resp = fetcher.post("https://api.example.com/data", Some(body), None).await?;
let resp = fetcher.put("https://api.example.com/data/1", None, Some(&json_val)).await?;
let resp = fetcher.delete("https://api.example.com/data/1").await?;
// Response -> Selector (auto-parses HTML)
let page = resp.selector();
let titles = page.css("h1");
// Response metadata
println!("Status: {}", resp.status());
println!("URL: {}", resp.url());
println!("Blocked: {}", resp.is_blocked());
let json_data = resp.json()?; // Parse as JSONDefine a spider by implementing the Spider trait:
use rust_scrapling::{Spider, SpiderRequest, CrawlerEngine, FetcherConfig};
use rust_scrapling::spiders::response::SpiderResponse;
use rust_scrapling::spiders::session::SessionManager;
use async_trait::async_trait;
use std::sync::Arc;
struct ProductSpider;
#[async_trait]
impl Spider for ProductSpider {
fn name(&self) -> &str { "products" }
fn start_urls(&self) -> Vec<String> {
vec!["https://shop.example.com/products".into()]
}
fn concurrent_requests(&self) -> u32 { 8 }
fn download_delay(&self) -> f64 { 0.5 }
fn robots_txt_obey(&self) -> bool { true }
fn allowed_domains(&self) -> std::collections::HashSet<String> {
["shop.example.com".into()].into()
}
async fn parse(
&self,
response: SpiderResponse,
) -> (Vec<serde_json::Value>, Vec<SpiderRequest>) {
let page = response.selector();
let mut items = Vec::new();
let mut requests = Vec::new();
// Extract product data
for product in page.css("div.product") {
let name = product.css("h2.name");
let price = product.css("span.price");
items.push(serde_json::json!({
"name": name.first().map(|n| n.text().as_str().to_string()),
"price": price.first().map(|p| p.text().as_str().to_string()),
"url": response.url(),
}));
}
// Follow pagination
for link in page.css("a.next-page") {
if let Some(href) = link.attrib().get("href") {
let next_url = page.urljoin(href.as_str());
requests.push(SpiderRequest::new(&next_url));
}
}
(items, requests)
}
async fn on_scraped_item(&self, item: serde_json::Value) -> Option<serde_json::Value> {
// Filter out items without a price
if item.get("price").is_some() { Some(item) } else { None }
}
}
#[tokio::main]
async fn main() {
let spider = Arc::new(ProductSpider);
let mut session_manager = SessionManager::new(FetcherConfig::default());
session_manager.ensure_default();
let engine = CrawlerEngine::new(spider, session_manager, None);
let result = engine.crawl().await;
println!("Scraped {} items in {:.1}s",
result.items.len(),
result.stats.elapsed_seconds());
println!("Requests: {}, Failed: {}",
result.stats.requests_count,
result.stats.failed_requests_count);
// Export results
result.items.to_json("products.json", true).unwrap();
result.items.to_jsonl("products.jsonl").unwrap();
}| Option | Default | Description |
|---|---|---|
concurrent_requests() |
4 |
Global concurrency limit |
concurrent_requests_per_domain() |
0 |
Per-domain limit (0 = disabled) |
download_delay() |
0.0 |
Seconds between requests |
robots_txt_obey() |
false |
Respect robots.txt |
max_blocked_retries() |
3 |
Retry limit for blocked responses |
allowed_domains() |
{} |
Domain whitelist (empty = allow all) |
development_mode() |
false |
Cache responses to disk for dev iteration |
| Hook | When |
|---|---|
on_start(resuming) |
Before crawl begins |
on_close() |
After crawl ends |
on_error(request, error) |
When a request fails |
on_scraped_item(item) |
Item pipeline -- return None to drop |
is_blocked(response) |
Custom block detection |
RUSTScrapling includes a command-line tool for quick scraping:
# Fetch a page and extract text
rust-scrapling fetch https://example.com
# Extract specific elements with a CSS selector
rust-scrapling fetch https://example.com --selector "h1"
# Output as HTML
rust-scrapling fetch https://example.com --selector "div.content" --format html
# Output as JSON (tag, text, html per element)
rust-scrapling fetch https://example.com --selector "a" --format json
# Disable stealth headers
rust-scrapling fetch https://example.com --no-stealth
# Extract text content (shorthand)
rust-scrapling extract https://example.com --selector "p"rust_scrapling/
|
|-- core/ # Foundation types
| |-- text_handler.rs # TextHandler: String + regex/json/clean
| |-- text_handlers.rs # TextHandlers: Vec<TextHandler> batch ops
| |-- attributes_handler.rs # AttributesHandler: read-only attr map
| +-- storage.rs # SqliteStorage: adaptive element persistence
|
|-- parser/ # HTML parsing engine
| |-- selector.rs # Selector: element wrapper (CSS, text, nav)
| |-- selectors.rs # Selectors: batch operations
| |-- selector_generation.rs # Auto-generate CSS/XPath from DOM position
| +-- translator.rs # ::text and ::attr() pseudo-elements
|
|-- fetchers/ # HTTP layer
| |-- client.rs # Fetcher: async HTTP with retries
| |-- config.rs # FetcherConfig: builder pattern
| |-- response.rs # Response: auto-parses to Selector
| |-- proxy.rs # ProxyRotator: round-robin proxy cycling
| +-- constants.rs # User agents, status codes, headers
|
+-- spiders/ # Crawl framework
|-- spider.rs # Spider trait (user-facing API)
|-- engine.rs # CrawlerEngine: async orchestrator
|-- request.rs # SpiderRequest: fingerprinting + priority
|-- response.rs # SpiderResponse: parser integration
|-- result.rs # CrawlResult, CrawlStats, ItemList
|-- scheduler.rs # Priority queue with deduplication
|-- session.rs # SessionManager: named HTTP sessions
|-- robots.rs # robots.txt compliance
|-- cache.rs # Dev-mode response caching
+-- checkpoint.rs # Pause/resume persistence
- Each layer is independent. Use just the parser without fetchers. Use fetchers without spiders. Compose as needed.
- Zero hidden allocations.
SelectorusesRc<Html>to share the parsed tree. Child selectors point into the same tree. - Async-first. The fetcher and spider layers are built on
tokiofor high-concurrency crawling. - Scrapy-compatible API names.
css(),text(),re(),re_first(),get(),getall()mirror Scrapy/Parsel conventions.
| Type | Description |
|---|---|
TextHandler |
String wrapper with .re(), .json(), .clean(), .strip(), .replace_str() |
TextHandlers |
Vec<TextHandler> with batch .re(), .re_first() |
AttributesHandler |
Read-only attribute map with .get(), .search_values(), .json_string() |
SqliteStorage |
SQLite-backed element storage for adaptive mode |
| Type | Description |
|---|---|
Selector |
HTML element wrapper -- .css(), .text(), .attrib(), .children(), .parent(), .find_by_text() |
Selectors |
Element collection -- .css(), .filter(), .search(), .getall(), .re() |
| Type | Description |
|---|---|
Fetcher |
Async HTTP client -- .get(), .post(), .put(), .delete() |
FetcherConfig |
Config builder -- .timeout(), .retries(), .proxy(), .stealth() |
Response |
HTTP response -- .selector(), .json(), .status(), .is_blocked() |
ProxyRotator |
Round-robin proxy rotation |
| Type | Description |
|---|---|
Spider (trait) |
User implements .parse(), configures start_urls, concurrency, etc. |
CrawlerEngine<S> |
Async orchestrator -- .crawl() returns CrawlResult |
SpiderRequest |
Request with fingerprinting, priority, metadata |
CrawlResult |
Final result -- .items, .stats, .completed() |
CrawlStats |
Metrics -- requests, bytes, items, timing, status codes |
ItemList |
Scraped items -- .to_json(), .to_jsonl() |
# Run all tests (175 pass, 3 network tests ignored)
cargo test
# Run with network tests
cargo test -- --ignored
# Run a specific test module
cargo test parser_selector
cargo test core_text_handler
cargo test integration_test
# Run with logging
RUST_LOG=debug cargo test
# Check code quality
cargo clippy -- -W clippy::all
# Build release binary
cargo build --release| Module | Tests | Coverage |
|---|---|---|
| Core (TextHandler, AttributesHandler, Storage) | 64 | All public methods |
| Parser (Selector, Selectors, Generation) | 38 | CSS, text, nav, regex, DOM |
| Fetchers (Config, Client, Response) | 21 | Config, headers, Response struct |
| Spiders (Request, Scheduler, Result) | 27 | Fingerprinting, dedup, priority, export |
| Integration | 28 | End-to-end scraping workflows |
| Total | 178 |
Contributions are welcome! Here's how to get started:
- Fork the repository
- Create a branch for your feature (
git checkout -b feature/amazing-feature) - Write tests for your changes
- Run the test suite (
cargo test && cargo clippy) - Commit with a descriptive message
- Push and open a Pull Request
git clone https://github.com/Liohtml/RUSTScrapling.git
cd RUSTScrapling
cargo build
cargo test- Browser automation -- Headless Chrome/Playwright integration (like Python Scrapling's
StealthyFetcher/DynamicFetcher) - Adaptive mode -- Element relocation using similarity scoring (storage layer is ready)
- Interactive shell -- REPL for exploring pages
- Performance -- Benchmarks, SIMD text processing, zero-copy parsing
- Documentation -- More examples, tutorials, API docs
Licensed under either of:
- MIT License (LICENSE-MIT or http://opensource.org/licenses/MIT)
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
at your option.
- Scrapling by Karim Shoair -- the original Python framework that inspired this project
- scraper -- HTML parsing and CSS selection in Rust
- reqwest -- HTTP client
- tokio -- Async runtime
Built with Rust. Inspired by Scrapling. Made for scraping.