Skip to content

Semantic prompt injection detection via LLM classifier #59

@debu-sinha

Description

@debu-sinha

Problem or use case

Current prompt injection detection uses regex pattern matching against known phrases ("ignore previous instructions", "you are now", etc.). Real-world attacks use paraphrasing, obfuscation, homoglyphs, and multi-step decomposition that regex cannot catch. SkillScan-Security ships a fine-tuned DeBERTa classifier for this. Trail of Bits demonstrated that moderate evasion effort bypasses all pattern-based detection.

Proposed solution

Add optional --deep-scan mode using an offline ML classifier:

agentsec scan --deep-scan     # Enable semantic detection

Use a fine-tuned DeBERTa or DistilBERT adapter that classifies text segments as benign/injection. Runs fully offline (no API calls). Optional dependency (agentsec-ai[ml]).

Falls back to regex-only when ML dependency not installed.

Area

Skill scanner / MCP scanner

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestsecuritySecurity hardening

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions