Skip to content

Add brief semgrep command to generate rules from sink data #39

@andrew

Description

@andrew

Generate a Semgrep/Opengrep ruleset tailored to the detected project stack. brief doesn't become a scanner, it outputs .semgrep.yml that you run with an existing engine. The value is automated curation: a Django+SQLAlchemy project gets Django and SQLAlchemy rules, not Rails or GORM.

brief semgrep [flags] [path]
  --output FILE           Write to file (default stdout)
  --min-severity LEVEL    Filter rules (low/medium/high)

Each sink in the KB becomes a Semgrep rule with a pattern, language, severity, CWE metadata, and a message pulled from the sink note.

Pattern generation from a prototype against all 771 sinks:

  • 675 structural patterns (87%) — auto-generated from symbol names. html_safe becomes $X.html_safe, requests.get becomes requests.get(...), eval becomes eval(...). These work correctly as-is.
  • 55 multi-patterns (7%) — Ruby methods that can be called with or without parens, generating both $X.method and $X.method(...).
  • 41 regex fallbacks (5%) — template syntax ({{{, <%-, v-html, |safe, dangerouslySetInnerHTML) that isn't code. Uses Semgrep's pattern-regex with languages: [generic].

The prototype found three areas where the auto-generation needs refinement:

  • :: means module separator in Ruby, scope in C++, namespace in PHP, path in Rust. Digest::MD5 should be Digest::MD5.new(...) in Ruby but new Digest::MD5(...) doesn't make sense. Needs per-language handling of :: symbols.
  • Capitalized single-word symbols like ProcessStartInfo, Random, SqlCommand are constructors in C#/Java but the heuristic treats them as method calls. Should generate new X(...) for those languages.
  • Duplicate rule IDs from symbols like backtick and !{ that produce empty slugs.

An optional patterns field on the Sink struct would let individual sinks override auto-generation for the ~15% of cases where heuristics aren't enough:

[[security.sinks]]
symbol = "where"
threat = "sql_injection"
cwe = "CWE-89"
note = "With string interpolation; safe with hash"
patterns = ['$MODEL.where("..." + $X)', '$MODEL.where("...#{...}")']

When present, use those patterns directly. When absent, fall back to auto-generation.

Ecosystem to Semgrep language mapping: ruby→ruby, python→python, node→javascript+typescript, go→go, java→java, php→php, csharp→csharp, rust→rust, kotlin→kotlin, scala→scala, swift→swift, c→c, cpp→cpp. Elixir, Dart, Perl, Lua fall back to generic with regex.

Severity mapping from threat IDs: sql_injection/command_injection/code_injection/deserialization → ERROR. xss/ssrf/path_traversal/ssti → WARNING. weak_crypto/open_redirect/dos → INFO. Could also add a severity field to the threat registry in _threats.toml so it's data not code.

Implementation:

  • Add optional patterns []string field to Sink in kb/kb.go
  • Add severity field to ThreatDef in kb/kb.go and seed in _threats.toml
  • detect/semgrep.go — pattern generation heuristics, language mapping, rule assembly
  • report/semgrep.go — YAML output (not JSON, Semgrep expects YAML)
  • cmd/brief/semgrep.go — command wiring via runDetection
  • Tests: pattern generation unit tests for each symbol class, golden-file test for YAML output shape
  • Validate generated rules parse: semgrep --validate --config .semgrep.yml

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions