Skip to content

ItsNishi/AI-Agent-Security

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

31 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ›‘οΈ AI Agent Security Research

Documenting how AI coding agents get exploited -- so we can build better defenses.

Educational AI Agents No Live Exploits


πŸ” What This Is

A growing collection of research, annotated attack examples, and defense strategies targeting the security of AI coding agents -- Claude Code, Cursor, Copilot, Windsurf, and the broader ecosystem of LLM-powered development tools.

Every attack has a defense. Every payload is annotated, defanged, and educational.

Note: This project is actively maintained and frequently updated as new findings emerge, attack surfaces evolve, and AI-assisted research uncovers new patterns. Expect content to change regularly.


πŸ“ Research Notes

Topic What You'll Learn
πŸ—‚οΈ Tools & Frameworks Index Quick reference for every tool, framework, benchmark, and standard mentioned across all notes
πŸ’‰ Prompt Injection & Skill Injection Foundational injection concepts, agent attack surface, trojanized skill teardown, supply chain comparison
🧱 Defense Patterns Sanitization, sandboxing, and mitigation strategies with working code
βš™οΈ Claude Code Skill Architecture How Claude Code's extensibility (skills, hooks, MCP) creates attack surface
πŸ‘» LLM Hallucination Prevention Why models invent things, how to detect it, and how to stop it
🌐 AI Coding Language Performance Multilingual benchmarks, token efficiency, and language-steering attacks
πŸ”“ LLM Jailbreaking Deep Dive Full taxonomy: DAN to GCG to Crescendo, defenses, benchmarks, agent implications
πŸ” Skill Scanning & Detection Landscape Cisco Skill Scanner, VirusTotal, ToxicSkills audit, gap analysis, what to build next
πŸ“‹ AI GRC & Policy Landscape NIST AI RMF, EU AI Act, ISO 42001, state laws, agentic governance, OWASP Agentic Top 10
🧠 AI Memory & Corruption Memory architectures, RAG poisoning, MINJA, persistence risks, real-world case studies, defenses
πŸ“„ Agent Configuration Files Cross-tool instruction file attack surface: CLAUDE.md, AGENTS.md, Copilot, Cursor, Unicode obfuscation, hardening recommendations
🧠 Chatbot & AI Psychosis AI-induced psychosis, sycophancy mechanisms, documented deaths, folie a deux, weaponization, RAND national security analysis
🦞 OpenClaw & ClawHub Security OpenClaw architecture, ClawHub supply chain, CVE-2026-25253, ClawHavoc campaign, AMOS stealer, memory poisoning, 42K exposed instances
πŸͺ AI Application Ecosystem Security GPT Store, MCP tool poisoning, LangChain, HuggingFace, AutoGPT, CrewAI, Devin, IDEsaster, GlassWorm, OWASP Agentic Top 10, MITRE ATLAS
βš”οΈ AI Hacking Frameworks XBOW, Shannon, Strix, PentAGI, CAI, Reaper, Nebula, CHECKMATE, Garak, Promptfoo, PyRIT, benchmarks, architecture patterns
πŸ’© Bullshit Benchmark & LLM Honesty BullshitBench, TruthfulQA, SimpleQA, sycophancy benchmarks, Bullshit Index, abstention, slopsquatting, RLHF-security tension
πŸ›‘οΈ AI Blue Teaming & Defensive AI AI SOC agents, CrowdStrike Charlotte, Microsoft Security Copilot, malware RE, DARPA AIxCC, NIST AI 100-2, defender's advantage analysis
πŸ”€ Unicode Variation Selector Attacks Invisible jailbreaking, guardrail evasion, Sneaky Bits encoding, GlassWorm malware, token expansion DoS, defense: explicit stripping required
πŸͺ™ Token Optimization & LLM Efficiency Context engineering, prompt structure, model routing, caching, batching, agent loop optimization, Claude Code cost management
πŸ’Έ Token-Based Attacks & Resource Exploitation Denial of wallet, sponge examples, reasoning exhaustion (ThinkTrap/ReasoningBomb), context window poisoning, token smuggling, tokenizer security, model extraction, LLMjacking
πŸ“Š LLM Landscape & Token Economics Security-framed model reference: tokenization attack surface, cost economics for threat modeling, model selection for security work, open model supply chain risks, Chinese censorship implications
🀝 Multi-Agent Security Agent-to-agent attacks, A2A protocol spoofing, 82.4% peer compliance bypass, delegation chain injection, memory contagion, cross-agent config attacks, offensive swarms, defense patterns

πŸ§ͺ Attack / Defense Examples

Hands-on annotated scenarios -- each one shows the attack and the fix.

Technique TL;DR
πŸ•΅οΈ Hidden Comment Injection HTML comments are invisible in markdown previews but the LLM reads every word
🌊 Indirect Prompt Injection Poison the web page, API response, or file the agent fetches -- it obeys
πŸ“€ Data Exfiltration Via Agent The agent becomes an unwitting mule for your secrets, keys, and credentials
πŸ“¦ Hallucinated Package Injection LLM invents a package name, attacker registers it -- instant supply chain attack
πŸ”§ MCP Tool Poisoning Malicious instructions hidden in tool descriptions hijack agent behavior silently
πŸ’Έ Resource Exhaustion & Denial of Wallet Reasoning bombs, context saturation, and cost math -- drain API budgets without crashing the service
πŸ”€ Unicode Invisible Injection Variation selectors and Tags block encode hidden instructions that survive diff review and Unicode normalization

πŸ—‚οΈ Attack Taxonomy

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                              AI Agent Attacks                                       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 🎯 Injection β”‚ πŸ”— Supply Chain  β”‚ πŸ“€ Exfiltration   β”‚ 🧠 Memory & Persistence      β”‚
β”‚              β”‚                  β”‚                   β”‚                               β”‚
β”‚ Direct       β”‚ Trojan skills    β”‚ Secrets & keys    β”‚ RAG poisoning                β”‚
β”‚ Indirect     β”‚ Hallucinated     β”‚ Source code       β”‚ Memory injection (MINJA)     β”‚
β”‚ Hidden       β”‚  packages        β”‚ Environment       β”‚ Context window manipulation  β”‚
β”‚  comments    β”‚ Poisoned docs    β”‚  variables        β”‚ Persistent backdoors         β”‚
β”‚ MCP tool     β”‚ Rules file       β”‚ Credentials       β”‚ Config file persistence      β”‚
β”‚  poisoning   β”‚  backdoor        β”‚ Agent tokens      β”‚ Instruction drift            β”‚
β”‚ Language-    β”‚ Namespace        β”‚ Chat history      β”‚ SOUL.md/MEMORY.md poisoning  β”‚
β”‚  steering    β”‚  squatting       β”‚ IDE telemetry     β”‚                               β”‚
β”‚ Sampling     β”‚ GlassWorm        β”‚                   β”‚                               β”‚
β”‚  injection   β”‚  extension worm  β”‚                   β”‚                               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ πŸ’Έ Resource  β”‚ πŸ—οΈ Framework & Platform                                            β”‚
β”‚   Attacks    β”‚                                                                     β”‚
β”‚              β”‚ MCP server compromise (CVE-2025-6514)                              β”‚
β”‚ Denial of    β”‚ OpenClaw gateway exposure (42K+ instances)                         β”‚
β”‚  wallet      β”‚ GPT Store plugin OAuth flaws                                       β”‚
β”‚ Sponge       β”‚ HuggingFace pickle deserialization                                 β”‚
β”‚  examples    β”‚ IDE Chromium CVEs (94+ in Cursor/Windsurf)                         β”‚
β”‚ Reasoning    β”‚ ClawHub malicious skills (1184+)                                   β”‚
β”‚  exhaustion  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Model routingβ”‚ πŸ›‘οΈ Bypass & Escalation                                             β”‚
β”‚  manipulationβ”‚                                                                     β”‚
β”‚ Token        β”‚ Sandbox escape (numpy allowlist)                                   β”‚
β”‚  smuggling   β”‚ Cross-agent privilege escalation                                   β”‚
β”‚ LLMjacking   β”‚ Tool confusion / confused deputy                                   β”‚
β”‚              β”‚ Rug pull / bait-and-switch                                          β”‚
β”‚              β”‚ IDEsaster (30+ CVEs across AI IDEs)                                β”‚
β”‚              β”‚ Agent-to-agent prompt injection                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”§ Security Skill Suite

Working defensive tooling built on Claude Code's skill + hook architecture. These turn the research above into practical detection.

Install from ClawHub

The fastest way to install -- each link goes to the ClawHub listing:

Skill ClawHub What It Does
vet-repo clawhub.ai/ItsNishi/vet-repo Scans .claude/, .mcp.json, CLAUDE.md, VS Code/Cursor configs for hook abuse, injection, MCP poisoning
scan-skill clawhub.ai/ItsNishi/scan-skill Deep analysis of a single skill before installation -- frontmatter, HTML comments, persistence triggers, supporting scripts
audit-code clawhub.ai/ItsNishi/audit-code Code security review -- hardcoded secrets, dangerous calls, SQL injection, .env files, file permissions

Install from Source

If you prefer to install manually from this repo:

# Clone the repo
git clone git@github.com:ItsNishi/AI-Agent-Security.git

# Copy the skills you want into your project or personal skills directory
# Project-level (scoped to one repo):
cp -r AI-Agent-Security/.claude/skills/vet-repo /path/to/your/project/.claude/skills/
cp -r AI-Agent-Security/.claude/skills/scan-skill /path/to/your/project/.claude/skills/
cp -r AI-Agent-Security/.claude/skills/audit-code /path/to/your/project/.claude/skills/

# Personal-level (available in all projects):
cp -r AI-Agent-Security/.claude/skills/vet-repo ~/.claude/skills/
cp -r AI-Agent-Security/.claude/skills/scan-skill ~/.claude/skills/
cp -r AI-Agent-Security/.claude/skills/audit-code ~/.claude/skills/

Usage

Once installed, invoke in any Claude Code session:

/vet-repo              # Scan current repo's agent configs
/scan-skill <dir>      # Analyze a skill before installing it
/audit-code [path]     # Security review of project code (defaults to project root)

Prerequisites

  • Python 3.10+ -- scanner scripts use stdlib only, no third-party packages
  • Claude Code -- skills are invoked via /skill-name in a Claude Code session

Hooks

Advisory PreToolUse guards in .claude/settings.json that warn (not block) on:

  • Bash: pipe-to-shell, rm -rf /, chmod 777, eval with variables, base64-to-execution
  • Write: writes to ~/.ssh/, ~/.aws/, .claude/settings.json, shell profiles

To install the hooks, copy .claude/settings.json into your project's .claude/ directory.

Shared Pattern Database

151 detection patterns across 15 categories. Each skill bundles its own copy of patterns.py so it works standalone:

skill_injection | hook_abuse | mcp_config | secrets | dangerous_calls
exfiltration | encoding_obfuscation | instruction_override | supply_chain | file_permissions
code_before_review | config_backdoor | memory_corruption | confused_delegation | persistence

All patterns derived from the research notes and examples in this repo.


πŸ“ Project Structure

AI-Agent-Security/
β”œβ”€β”€ πŸ“„ README.md
β”œβ”€β”€ πŸ“ notes/                           # Research writeups and analysis
β”œβ”€β”€ πŸ§ͺ examples/                        # Annotated attack/defense pairs
└── πŸ”§ .claude/
    β”œβ”€β”€ settings.json                    # Hook configurations
    └── skills/
        β”œβ”€β”€ vet-repo/                    # Repository agent config scanner
        β”‚   β”œβ”€β”€ SKILL.md
        β”‚   └── scripts/
        β”‚       β”œβ”€β”€ patterns.py          # Pattern database
        β”‚       └── vet_repo.py
        β”œβ”€β”€ scan-skill/                  # Individual skill analyzer
        β”‚   β”œβ”€β”€ SKILL.md
        β”‚   └── scripts/
        β”‚       β”œβ”€β”€ patterns.py          # Pattern database
        β”‚       └── scan_skill.py
        └── audit-code/                  # Code security auditor
            β”œβ”€β”€ SKILL.md
            └── scripts/
                β”œβ”€β”€ patterns.py          # Pattern database
                └── audit_code.py

⚠️ Disclaimer

This research is for educational and defensive purposes only. All examples use defanged URLs (hxxps://, [.]), annotated payloads marked [MALICIOUS], and non-executable demonstrations. Every attack technique includes corresponding defenses.


πŸ“œ License

MIT

About

Educational security research on AI coding agent attack surfaces -- prompt injection, skill poisoning, defense patterns

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages