Summary
The Spider trait exposes a concurrent_requests_per_domain() method that users can override to cap simultaneous requests to any single domain. The value is stored in CrawlStats but the CrawlerEngine never actually enforces it — all requests contend on a single global Semaphore regardless of their target domain. Spiders that override this method believe they are rate-limiting per-domain when in reality they are not.
Location
- File:
src/spiders/engine.rs
- Line(s): 95 (stored in stats), entire
process_request / crawl loop (no per-domain semaphore)
- File:
src/spiders/spider.rs — Line 17 (trait method declaration)
- File:
src/spiders/result.rs — Line 68 (stored in CrawlStats but unused for control flow)
Severity
Medium
Details
CrawlerEngine creates one global_limiter: Arc<Semaphore> with capacity spider.concurrent_requests().max(1). The per-domain value from spider.concurrent_requests_per_domain() is copied into CrawlStats for reporting purposes only — no HashMap<String, Arc<Semaphore>> keyed on domain is ever created or consulted.
Consequence: a spider that sets concurrent_requests_per_domain to 1 (one-at-a-time per host) can still hammer the same origin with as many parallel requests as concurrent_requests allows. This may trigger bans, violate politeness policies, or cause unintended load on the target.
// engine.rs line 95 — value recorded but never used to throttle
stats.concurrent_requests_per_domain = self.spider.concurrent_requests_per_domain();
No code path checks concurrent_requests_per_domain before acquiring a semaphore permit.
Suggested Fix
Introduce a HashMap<String, Arc<Semaphore>> keyed on the request's domain (lazily created on first encounter). Before dispatching each request, acquire a permit from both the global semaphore and the per-domain semaphore when concurrent_requests_per_domain > 0. Example sketch:
let domain_limiters: Arc<Mutex<HashMap<String, Arc<Semaphore>>>> = ...;
// in process_request:
if per_domain > 0 {
let domain = request.domain().unwrap_or_default();
let sem = domain_limiters.lock().await
.entry(domain)
.or_insert_with(|| Arc::new(Semaphore::new(per_domain as usize)))
.clone();
let _permit = sem.acquire_owned().await?;
// proceed with fetch
}
Automated finding by repo-monitor
Summary
The
Spidertrait exposes aconcurrent_requests_per_domain()method that users can override to cap simultaneous requests to any single domain. The value is stored inCrawlStatsbut theCrawlerEnginenever actually enforces it — all requests contend on a single globalSemaphoreregardless of their target domain. Spiders that override this method believe they are rate-limiting per-domain when in reality they are not.Location
src/spiders/engine.rsprocess_request/crawlloop (no per-domain semaphore)src/spiders/spider.rs— Line 17 (trait method declaration)src/spiders/result.rs— Line 68 (stored inCrawlStatsbut unused for control flow)Severity
Medium
Details
CrawlerEnginecreates oneglobal_limiter: Arc<Semaphore>with capacityspider.concurrent_requests().max(1). The per-domain value fromspider.concurrent_requests_per_domain()is copied intoCrawlStatsfor reporting purposes only — noHashMap<String, Arc<Semaphore>>keyed on domain is ever created or consulted.Consequence: a spider that sets
concurrent_requests_per_domainto1(one-at-a-time per host) can still hammer the same origin with as many parallel requests asconcurrent_requestsallows. This may trigger bans, violate politeness policies, or cause unintended load on the target.No code path checks
concurrent_requests_per_domainbefore acquiring a semaphore permit.Suggested Fix
Introduce a
HashMap<String, Arc<Semaphore>>keyed on the request's domain (lazily created on first encounter). Before dispatching each request, acquire a permit from both the global semaphore and the per-domain semaphore whenconcurrent_requests_per_domain > 0. Example sketch:Automated finding by repo-monitor