Documenting how AI coding agents get exploited -- so we can build better defenses.
A growing collection of research, annotated attack examples, and defense strategies targeting the security of AI coding agents -- Claude Code, Cursor, Copilot, Windsurf, and the broader ecosystem of LLM-powered development tools.
Every attack has a defense. Every payload is annotated, defanged, and educational.
Note: This project is actively maintained and frequently updated as new findings emerge, attack surfaces evolve, and AI-assisted research uncovers new patterns. Expect content to change regularly.
| Topic | What You'll Learn | |
|---|---|---|
| ποΈ | Tools & Frameworks Index | Quick reference for every tool, framework, benchmark, and standard mentioned across all notes |
| π | Prompt Injection & Skill Injection | Foundational injection concepts, agent attack surface, trojanized skill teardown, supply chain comparison |
| π§± | Defense Patterns | Sanitization, sandboxing, and mitigation strategies with working code |
| βοΈ | Claude Code Skill Architecture | How Claude Code's extensibility (skills, hooks, MCP) creates attack surface |
| π» | LLM Hallucination Prevention | Why models invent things, how to detect it, and how to stop it |
| π | AI Coding Language Performance | Multilingual benchmarks, token efficiency, and language-steering attacks |
| π | LLM Jailbreaking Deep Dive | Full taxonomy: DAN to GCG to Crescendo, defenses, benchmarks, agent implications |
| π | Skill Scanning & Detection Landscape | Cisco Skill Scanner, VirusTotal, ToxicSkills audit, gap analysis, what to build next |
| π | AI GRC & Policy Landscape | NIST AI RMF, EU AI Act, ISO 42001, state laws, agentic governance, OWASP Agentic Top 10 |
| π§ | AI Memory & Corruption | Memory architectures, RAG poisoning, MINJA, persistence risks, real-world case studies, defenses |
| π | Agent Configuration Files | Cross-tool instruction file attack surface: CLAUDE.md, AGENTS.md, Copilot, Cursor, Unicode obfuscation, hardening recommendations |
| π§ | Chatbot & AI Psychosis | AI-induced psychosis, sycophancy mechanisms, documented deaths, folie a deux, weaponization, RAND national security analysis |
| π¦ | OpenClaw & ClawHub Security | OpenClaw architecture, ClawHub supply chain, CVE-2026-25253, ClawHavoc campaign, AMOS stealer, memory poisoning, 42K exposed instances |
| πͺ | AI Application Ecosystem Security | GPT Store, MCP tool poisoning, LangChain, HuggingFace, AutoGPT, CrewAI, Devin, IDEsaster, GlassWorm, OWASP Agentic Top 10, MITRE ATLAS |
| βοΈ | AI Hacking Frameworks | XBOW, Shannon, Strix, PentAGI, CAI, Reaper, Nebula, CHECKMATE, Garak, Promptfoo, PyRIT, benchmarks, architecture patterns |
| π© | Bullshit Benchmark & LLM Honesty | BullshitBench, TruthfulQA, SimpleQA, sycophancy benchmarks, Bullshit Index, abstention, slopsquatting, RLHF-security tension |
| π‘οΈ | AI Blue Teaming & Defensive AI | AI SOC agents, CrowdStrike Charlotte, Microsoft Security Copilot, malware RE, DARPA AIxCC, NIST AI 100-2, defender's advantage analysis |
| π€ | Unicode Variation Selector Attacks | Invisible jailbreaking, guardrail evasion, Sneaky Bits encoding, GlassWorm malware, token expansion DoS, defense: explicit stripping required |
| πͺ | Token Optimization & LLM Efficiency | Context engineering, prompt structure, model routing, caching, batching, agent loop optimization, Claude Code cost management |
| πΈ | Token-Based Attacks & Resource Exploitation | Denial of wallet, sponge examples, reasoning exhaustion (ThinkTrap/ReasoningBomb), context window poisoning, token smuggling, tokenizer security, model extraction, LLMjacking |
| π | LLM Landscape & Token Economics | Security-framed model reference: tokenization attack surface, cost economics for threat modeling, model selection for security work, open model supply chain risks, Chinese censorship implications |
| π€ | Multi-Agent Security | Agent-to-agent attacks, A2A protocol spoofing, 82.4% peer compliance bypass, delegation chain injection, memory contagion, cross-agent config attacks, offensive swarms, defense patterns |
Hands-on annotated scenarios -- each one shows the attack and the fix.
| Technique | TL;DR | |
|---|---|---|
| π΅οΈ | Hidden Comment Injection | HTML comments are invisible in markdown previews but the LLM reads every word |
| π | Indirect Prompt Injection | Poison the web page, API response, or file the agent fetches -- it obeys |
| π€ | Data Exfiltration Via Agent | The agent becomes an unwitting mule for your secrets, keys, and credentials |
| π¦ | Hallucinated Package Injection | LLM invents a package name, attacker registers it -- instant supply chain attack |
| π§ | MCP Tool Poisoning | Malicious instructions hidden in tool descriptions hijack agent behavior silently |
| πΈ | Resource Exhaustion & Denial of Wallet | Reasoning bombs, context saturation, and cost math -- drain API budgets without crashing the service |
| π€ | Unicode Invisible Injection | Variation selectors and Tags block encode hidden instructions that survive diff review and Unicode normalization |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AI Agent Attacks β
ββββββββββββββββ¬βββββββββββββββββββ¬ββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ€
β π― Injection β π Supply Chain β π€ Exfiltration β π§ Memory & Persistence β
β β β β β
β Direct β Trojan skills β Secrets & keys β RAG poisoning β
β Indirect β Hallucinated β Source code β Memory injection (MINJA) β
β Hidden β packages β Environment β Context window manipulation β
β comments β Poisoned docs β variables β Persistent backdoors β
β MCP tool β Rules file β Credentials β Config file persistence β
β poisoning β backdoor β Agent tokens β Instruction drift β
β Language- β Namespace β Chat history β SOUL.md/MEMORY.md poisoning β
β steering β squatting β IDE telemetry β β
β Sampling β GlassWorm β β β
β injection β extension worm β β β
ββββββββββββββββ¬βββββββββββββββββββ΄ββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββ€
β πΈ Resource β ποΈ Framework & Platform β
β Attacks β β
β β MCP server compromise (CVE-2025-6514) β
β Denial of β OpenClaw gateway exposure (42K+ instances) β
β wallet β GPT Store plugin OAuth flaws β
β Sponge β HuggingFace pickle deserialization β
β examples β IDE Chromium CVEs (94+ in Cursor/Windsurf) β
β Reasoning β ClawHub malicious skills (1184+) β
β exhaustion βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Model routingβ π‘οΈ Bypass & Escalation β
β manipulationβ β
β Token β Sandbox escape (numpy allowlist) β
β smuggling β Cross-agent privilege escalation β
β LLMjacking β Tool confusion / confused deputy β
β β Rug pull / bait-and-switch β
β β IDEsaster (30+ CVEs across AI IDEs) β
β β Agent-to-agent prompt injection β
ββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Working defensive tooling built on Claude Code's skill + hook architecture. These turn the research above into practical detection.
The fastest way to install -- each link goes to the ClawHub listing:
| Skill | ClawHub | What It Does |
|---|---|---|
| vet-repo | clawhub.ai/ItsNishi/vet-repo | Scans .claude/, .mcp.json, CLAUDE.md, VS Code/Cursor configs for hook abuse, injection, MCP poisoning |
| scan-skill | clawhub.ai/ItsNishi/scan-skill | Deep analysis of a single skill before installation -- frontmatter, HTML comments, persistence triggers, supporting scripts |
| audit-code | clawhub.ai/ItsNishi/audit-code | Code security review -- hardcoded secrets, dangerous calls, SQL injection, .env files, file permissions |
If you prefer to install manually from this repo:
# Clone the repo
git clone git@github.com:ItsNishi/AI-Agent-Security.git
# Copy the skills you want into your project or personal skills directory
# Project-level (scoped to one repo):
cp -r AI-Agent-Security/.claude/skills/vet-repo /path/to/your/project/.claude/skills/
cp -r AI-Agent-Security/.claude/skills/scan-skill /path/to/your/project/.claude/skills/
cp -r AI-Agent-Security/.claude/skills/audit-code /path/to/your/project/.claude/skills/
# Personal-level (available in all projects):
cp -r AI-Agent-Security/.claude/skills/vet-repo ~/.claude/skills/
cp -r AI-Agent-Security/.claude/skills/scan-skill ~/.claude/skills/
cp -r AI-Agent-Security/.claude/skills/audit-code ~/.claude/skills/Once installed, invoke in any Claude Code session:
/vet-repo # Scan current repo's agent configs
/scan-skill <dir> # Analyze a skill before installing it
/audit-code [path] # Security review of project code (defaults to project root)
- Python 3.10+ -- scanner scripts use stdlib only, no third-party packages
- Claude Code -- skills are invoked via
/skill-namein a Claude Code session
Advisory PreToolUse guards in .claude/settings.json that warn (not block) on:
- Bash: pipe-to-shell,
rm -rf /,chmod 777, eval with variables, base64-to-execution - Write: writes to
~/.ssh/,~/.aws/,.claude/settings.json, shell profiles
To install the hooks, copy .claude/settings.json into your project's .claude/ directory.
151 detection patterns across 15 categories. Each skill bundles its own copy of patterns.py so it works standalone:
skill_injection | hook_abuse | mcp_config | secrets | dangerous_calls
exfiltration | encoding_obfuscation | instruction_override | supply_chain | file_permissions
code_before_review | config_backdoor | memory_corruption | confused_delegation | persistence
All patterns derived from the research notes and examples in this repo.
AI-Agent-Security/
βββ π README.md
βββ π notes/ # Research writeups and analysis
βββ π§ͺ examples/ # Annotated attack/defense pairs
βββ π§ .claude/
βββ settings.json # Hook configurations
βββ skills/
βββ vet-repo/ # Repository agent config scanner
β βββ SKILL.md
β βββ scripts/
β βββ patterns.py # Pattern database
β βββ vet_repo.py
βββ scan-skill/ # Individual skill analyzer
β βββ SKILL.md
β βββ scripts/
β βββ patterns.py # Pattern database
β βββ scan_skill.py
βββ audit-code/ # Code security auditor
βββ SKILL.md
βββ scripts/
βββ patterns.py # Pattern database
βββ audit_code.py
This research is for educational and defensive purposes only. All examples use defanged URLs (hxxps://, [.]), annotated payloads marked [MALICIOUS], and non-executable demonstrations. Every attack technique includes corresponding defenses.
MIT