Summary
The robots.txt parser cannot handle grouped user-agent entries (multiple User-agent: lines before Disallow:), causing the scraper to silently ignore restrictions that apply to it.
Location
- File:
src/spiders/robots.rs
- Line(s): 84–128
Severity
Medium
Details
For a robots.txt grouping multiple agents:
User-agent: MyBot
User-agent: Googlebot
Disallow: /private
The parser sets in_matching_section = true for MyBot, then encounters Googlebot and sets it to false, never processing Disallow: /private. This is a common robots.txt pattern.
Suggested Fix
Implement a two-pass parser: first collect all applicable agent group boundaries, then extract disallow rules. Or collect consecutive User-agent: lines before processing directives:
let mut current_agents: Vec<String> = Vec::new();
for line in content.lines() {
if line.starts_with("User-agent:") {
current_agents.push(/* extract agent */);
} else if line.starts_with("Disallow:") && current_agents.contains(&target_agent) {
// apply rule
} else {
current_agents.clear(); // new group starts
}
}
Automated finding by repo-monitor
Summary
The
robots.txtparser cannot handle grouped user-agent entries (multipleUser-agent:lines beforeDisallow:), causing the scraper to silently ignore restrictions that apply to it.Location
src/spiders/robots.rsSeverity
Medium
Details
For a
robots.txtgrouping multiple agents:The parser sets
in_matching_section = trueforMyBot, then encountersGooglebotand sets it tofalse, never processingDisallow: /private. This is a commonrobots.txtpattern.Suggested Fix
Implement a two-pass parser: first collect all applicable agent group boundaries, then extract disallow rules. Or collect consecutive
User-agent:lines before processing directives:Automated finding by repo-monitor