Skip to content

[repo-monitor] Medium: robots.txt multi-agent group rules silently ignored #6

@Liohtml

Description

@Liohtml

Summary

The robots.txt parser cannot handle grouped user-agent entries (multiple User-agent: lines before Disallow:), causing the scraper to silently ignore restrictions that apply to it.

Location

  • File: src/spiders/robots.rs
  • Line(s): 84–128

Severity

Medium

Details

For a robots.txt grouping multiple agents:

User-agent: MyBot
User-agent: Googlebot
Disallow: /private

The parser sets in_matching_section = true for MyBot, then encounters Googlebot and sets it to false, never processing Disallow: /private. This is a common robots.txt pattern.

Suggested Fix

Implement a two-pass parser: first collect all applicable agent group boundaries, then extract disallow rules. Or collect consecutive User-agent: lines before processing directives:

let mut current_agents: Vec<String> = Vec::new();
for line in content.lines() {
    if line.starts_with("User-agent:") {
        current_agents.push(/* extract agent */);
    } else if line.starts_with("Disallow:") && current_agents.contains(&target_agent) {
        // apply rule
    } else {
        current_agents.clear(); // new group starts
    }
}

Automated finding by repo-monitor

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions