🛡️ AI Agent Security Research

Documenting how AI coding agents get exploited -- so we can build better defenses.

🔍 What This Is

A growing collection of research, annotated attack examples, and defense strategies targeting the security of AI coding agents -- Claude Code, Cursor, Copilot, Windsurf, and the broader ecosystem of LLM-powered development tools.

Every attack has a defense. Every payload is annotated, defanged, and educational.

Note: This project is actively maintained and frequently updated as new findings emerge, attack surfaces evolve, and AI-assisted research uncovers new patterns. Expect content to change regularly.

📝 Research Notes

	Topic	What You'll Learn
🗂️	Tools & Frameworks Index	Quick reference for every tool, framework, benchmark, and standard mentioned across all notes
💉	Prompt Injection & Skill Injection	Foundational injection concepts, agent attack surface, trojanized skill teardown, supply chain comparison
🧱	Defense Patterns	Sanitization, sandboxing, and mitigation strategies with working code
⚙️	Claude Code Skill Architecture	How Claude Code's extensibility (skills, hooks, MCP) creates attack surface
👻	LLM Hallucination Prevention	Why models invent things, how to detect it, and how to stop it
🌐	AI Coding Language Performance	Multilingual benchmarks, token efficiency, and language-steering attacks
🔓	LLM Jailbreaking Deep Dive	Full taxonomy: DAN to GCG to Crescendo, defenses, benchmarks, agent implications
🔍	Skill Scanning & Detection Landscape	Cisco Skill Scanner, VirusTotal, ToxicSkills audit, gap analysis, what to build next
📋	AI GRC & Policy Landscape	NIST AI RMF, EU AI Act, ISO 42001, state laws, agentic governance, OWASP Agentic Top 10
🧠	AI Memory & Corruption	Memory architectures, RAG poisoning, MINJA, persistence risks, real-world case studies, defenses
📄	Agent Configuration Files	Cross-tool instruction file attack surface: CLAUDE.md, AGENTS.md, Copilot, Cursor, Unicode obfuscation, hardening recommendations
🧠	Chatbot & AI Psychosis	AI-induced psychosis, sycophancy mechanisms, documented deaths, folie a deux, weaponization, RAND national security analysis
🦞	OpenClaw & ClawHub Security	OpenClaw architecture, ClawHub supply chain, CVE-2026-25253, ClawHavoc campaign, AMOS stealer, memory poisoning, 42K exposed instances
🏪	AI Application Ecosystem Security	GPT Store, MCP tool poisoning, LangChain, HuggingFace, AutoGPT, CrewAI, Devin, IDEsaster, GlassWorm, OWASP Agentic Top 10, MITRE ATLAS
⚔️	AI Hacking Frameworks	XBOW, Shannon, Strix, PentAGI, CAI, Reaper, Nebula, CHECKMATE, Garak, Promptfoo, PyRIT, benchmarks, architecture patterns
💩	Bullshit Benchmark & LLM Honesty	BullshitBench, TruthfulQA, SimpleQA, sycophancy benchmarks, Bullshit Index, abstention, slopsquatting, RLHF-security tension
🛡️	AI Blue Teaming & Defensive AI	AI SOC agents, CrowdStrike Charlotte, Microsoft Security Copilot, malware RE, DARPA AIxCC, NIST AI 100-2, defender's advantage analysis
🔤	Unicode Variation Selector Attacks	Invisible jailbreaking, guardrail evasion, Sneaky Bits encoding, GlassWorm malware, token expansion DoS, defense: explicit stripping required
🪙	Token Optimization & LLM Efficiency	Context engineering, prompt structure, model routing, caching, batching, agent loop optimization, Claude Code cost management
💸	Token-Based Attacks & Resource Exploitation	Denial of wallet, sponge examples, reasoning exhaustion (ThinkTrap/ReasoningBomb), context window poisoning, token smuggling, tokenizer security, model extraction, LLMjacking
📊	LLM Landscape & Token Economics	Security-framed model reference: tokenization attack surface, cost economics for threat modeling, model selection for security work, open model supply chain risks, Chinese censorship implications
🤝	Multi-Agent Security	Agent-to-agent attacks, A2A protocol spoofing, 82.4% peer compliance bypass, delegation chain injection, memory contagion, cross-agent config attacks, offensive swarms, defense patterns

🧪 Attack / Defense Examples

Hands-on annotated scenarios -- each one shows the attack and the fix.

	Technique	TL;DR
🕵️	Hidden Comment Injection	HTML comments are invisible in markdown previews but the LLM reads every word
🌊	Indirect Prompt Injection	Poison the web page, API response, or file the agent fetches -- it obeys
📤	Data Exfiltration Via Agent	The agent becomes an unwitting mule for your secrets, keys, and credentials
📦	Hallucinated Package Injection	LLM invents a package name, attacker registers it -- instant supply chain attack
🔧	MCP Tool Poisoning	Malicious instructions hidden in tool descriptions hijack agent behavior silently
💸	Resource Exhaustion & Denial of Wallet	Reasoning bombs, context saturation, and cost math -- drain API budgets without crashing the service
🔤	Unicode Invisible Injection	Variation selectors and Tags block encode hidden instructions that survive diff review and Unicode normalization

🗂️ Attack Taxonomy

┌─────────────────────────────────────────────────────────────────────────────────────┐
│                              AI Agent Attacks                                       │
├──────────────┬──────────────────┬───────────────────┬───────────────────────────────┤
│ 🎯 Injection │ 🔗 Supply Chain  │ 📤 Exfiltration   │ 🧠 Memory & Persistence      │
│              │                  │                   │                               │
│ Direct       │ Trojan skills    │ Secrets & keys    │ RAG poisoning                │
│ Indirect     │ Hallucinated     │ Source code       │ Memory injection (MINJA)     │
│ Hidden       │  packages        │ Environment       │ Context window manipulation  │
│  comments    │ Poisoned docs    │  variables        │ Persistent backdoors         │
│ MCP tool     │ Rules file       │ Credentials       │ Config file persistence      │
│  poisoning   │  backdoor        │ Agent tokens      │ Instruction drift            │
│ Language-    │ Namespace        │ Chat history      │ SOUL.md/MEMORY.md poisoning  │
│  steering    │  squatting       │ IDE telemetry     │                               │
│ Sampling     │ GlassWorm        │                   │                               │
│  injection   │  extension worm  │                   │                               │
├──────────────┬──────────────────┴───────────────────┴───────────────────────────────┤
│ 💸 Resource  │ 🏗️ Framework & Platform                                            │
│   Attacks    │                                                                     │
│              │ MCP server compromise (CVE-2025-6514)                              │
│ Denial of    │ OpenClaw gateway exposure (42K+ instances)                         │
│  wallet      │ GPT Store plugin OAuth flaws                                       │
│ Sponge       │ HuggingFace pickle deserialization                                 │
│  examples    │ IDE Chromium CVEs (94+ in Cursor/Windsurf)                         │
│ Reasoning    │ ClawHub malicious skills (1184+)                                   │
│  exhaustion  ├─────────────────────────────────────────────────────────────────────┤
│ Model routing│ 🛡️ Bypass & Escalation                                             │
│  manipulation│                                                                     │
│ Token        │ Sandbox escape (numpy allowlist)                                   │
│  smuggling   │ Cross-agent privilege escalation                                   │
│ LLMjacking   │ Tool confusion / confused deputy                                   │
│              │ Rug pull / bait-and-switch                                          │
│              │ IDEsaster (30+ CVEs across AI IDEs)                                │
│              │ Agent-to-agent prompt injection                                     │
└──────────────┴─────────────────────────────────────────────────────────────────────┘

🔧 Security Skill Suite

Working defensive tooling built on Claude Code's skill + hook architecture. These turn the research above into practical detection.

Install from ClawHub

The fastest way to install -- each link goes to the ClawHub listing:

Skill	ClawHub	What It Does
vet-repo	clawhub.ai/ItsNishi/vet-repo	Scans `.claude/`, `.mcp.json`, `CLAUDE.md`, VS Code/Cursor configs for hook abuse, injection, MCP poisoning
scan-skill	clawhub.ai/ItsNishi/scan-skill	Deep analysis of a single skill before installation -- frontmatter, HTML comments, persistence triggers, supporting scripts
audit-code	clawhub.ai/ItsNishi/audit-code	Code security review -- hardcoded secrets, dangerous calls, SQL injection, `.env` files, file permissions

Install from Source

If you prefer to install manually from this repo:

# Clone the repo
git clone git@github.com:ItsNishi/AI-Agent-Security.git

# Copy the skills you want into your project or personal skills directory
# Project-level (scoped to one repo):
cp -r AI-Agent-Security/.claude/skills/vet-repo /path/to/your/project/.claude/skills/
cp -r AI-Agent-Security/.claude/skills/scan-skill /path/to/your/project/.claude/skills/
cp -r AI-Agent-Security/.claude/skills/audit-code /path/to/your/project/.claude/skills/

# Personal-level (available in all projects):
cp -r AI-Agent-Security/.claude/skills/vet-repo ~/.claude/skills/
cp -r AI-Agent-Security/.claude/skills/scan-skill ~/.claude/skills/
cp -r AI-Agent-Security/.claude/skills/audit-code ~/.claude/skills/

Usage

Once installed, invoke in any Claude Code session:

/vet-repo              # Scan current repo's agent configs
/scan-skill <dir>      # Analyze a skill before installing it
/audit-code [path]     # Security review of project code (defaults to project root)

Prerequisites

Python 3.10+ -- scanner scripts use stdlib only, no third-party packages
Claude Code -- skills are invoked via /skill-name in a Claude Code session

Hooks

Advisory PreToolUse guards in .claude/settings.json that warn (not block) on:

Bash: pipe-to-shell, rm -rf /, chmod 777, eval with variables, base64-to-execution
Write: writes to ~/.ssh/, ~/.aws/, .claude/settings.json, shell profiles

To install the hooks, copy .claude/settings.json into your project's .claude/ directory.

Shared Pattern Database

151 detection patterns across 15 categories. Each skill bundles its own copy of patterns.py so it works standalone:

skill_injection | hook_abuse | mcp_config | secrets | dangerous_calls
exfiltration | encoding_obfuscation | instruction_override | supply_chain | file_permissions
code_before_review | config_backdoor | memory_corruption | confused_delegation | persistence

All patterns derived from the research notes and examples in this repo.

📁 Project Structure

AI-Agent-Security/
├── 📄 README.md
├── 📝 notes/                           # Research writeups and analysis
├── 🧪 examples/                        # Annotated attack/defense pairs
└── 🔧 .claude/
    ├── settings.json                    # Hook configurations
    └── skills/
        ├── vet-repo/                    # Repository agent config scanner
        │   ├── SKILL.md
        │   └── scripts/
        │       ├── patterns.py          # Pattern database
        │       └── vet_repo.py
        ├── scan-skill/                  # Individual skill analyzer
        │   ├── SKILL.md
        │   └── scripts/
        │       ├── patterns.py          # Pattern database
        │       └── scan_skill.py
        └── audit-code/                  # Code security auditor
            ├── SKILL.md
            └── scripts/
                ├── patterns.py          # Pattern database
                └── audit_code.py

⚠️ Disclaimer

This research is for educational and defensive purposes only. All examples use defanged URLs (hxxps://, [.]), annotated payloads marked [MALICIOUS], and non-executable demonstrations. Every attack technique includes corresponding defenses.

📜 License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.claude		.claude
examples		examples
notes		notes
tests		tests
.gitignore		.gitignore
README.md		README.md
SEVERITY_CALIBRATION.md		SEVERITY_CALIBRATION.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛡️ AI Agent Security Research

🔍 What This Is

📝 Research Notes

🧪 Attack / Defense Examples

🗂️ Attack Taxonomy

🔧 Security Skill Suite

Install from ClawHub

Install from Source

Usage

Prerequisites

Hooks

Shared Pattern Database

📁 Project Structure

⚠️ Disclaimer

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🛡️ AI Agent Security Research

🔍 What This Is

📝 Research Notes

🧪 Attack / Defense Examples

🗂️ Attack Taxonomy

🔧 Security Skill Suite

Install from ClawHub

Install from Source

Usage

Prerequisites

Hooks

Shared Pattern Database

📁 Project Structure

⚠️ Disclaimer

📜 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages