A Claude Code plugin that makes 12 NVIDIA NIM-hosted LLMs (Llama 3.x, Kimi K2, DeepSeek V3/R1, Nemotron, Mixtral, Mistral Large, Qwen 2.5, Qwen Coder) available to Claude — and nudges Claude to use them proactively when the user wants a second opinion, an ensemble comparison, or a model better suited to a specific task.
| Component | What it does |
|---|---|
MCP server (server/server.py) |
Wraps NVIDIA NIM's OpenAI-compatible endpoint as three MCP tools: list_models, nvidia_chat, nvidia_compare. |
Skill (skills/nvidia-delegate/) |
Description-driven trigger that makes Claude consider routing work to NVIDIA models when delegation would help — without the user having to ask. |
Slash commands (commands/) |
Explicit one-key access: /nvidia-compare, /nvidia-list, /nvidia-chat. |
The MCP server is the same hardened code shipped at nvidia-models-mcp v0.1.1 — bounded inputs, sanitized errors, 60 s timeout, no live network in tests.
- Claude Code (the CLI / desktop client that supports plugins)
- uv on
PATH - Python 3.10+
- An NVIDIA API key from build.nvidia.com (starts with
nvapi-)
Windows (PowerShell, persistent):
setx NVIDIA_API_KEY "nvapi-your-key-here"macOS / Linux (shell profile):
export NVIDIA_API_KEY="nvapi-your-key-here"Restart Claude Code so it picks up the env var.
Claude Code installs plugins from marketplaces in two steps. Run these inside Claude Code:
/plugin marketplace add NAJEMWEHBE/nvidia-models-plugin
/plugin install nvidia-models@nvidia-models-plugin
The first line registers this repo as a single-plugin marketplace (cloned to your local cache). The second installs the nvidia-models plugin from it.
After install: MCP server is wired automatically, the skill loads at session start, and the three slash commands (/nvidia-compare, /nvidia-list, /nvidia-chat) become available. No restart needed — Claude Code hot-reloads plugins.
Claude reads every skill's description field at session start. The nvidia-delegate skill description lists trigger scenarios ("second opinion", "ensemble", "what would model X say", explicit NVIDIA model names, etc.). When the user's request matches, Claude opens the skill, reads the routing table inside, and calls the appropriate MCP tool. You never have to say "use the MCP" — just describe what you want.
Examples that should auto-trigger:
"Get a second opinion from another LLM on this refactor." "Ask three models to draft a one-paragraph summary and compare them." "Use qwen-coder to write the test, then have deepseek-r1 critique it." "What would Kimi K2 say about this?"
If you want explicit control, use the slash commands instead.
| Command | Purpose |
|---|---|
/nvidia-compare <prompt> |
Fan the prompt to 3 default models; synthesize agreements + divergences. |
/nvidia-list |
Show the 12 short keys → full NIM paths. |
/nvidia-chat <short-key> <prompt> |
Single-model call. |
Catalog verified live on NIM as of 2026-05-12.
| Short key | NIM path | Notes |
|---|---|---|
llama-4-maverick |
meta/llama-4-maverick-17b-128e-instruct |
Meta flagship; 128-expert MoE |
llama-3.3-70b |
meta/llama-3.3-70b-instruct |
Generalist |
llama-3.1-70b |
meta/llama-3.1-70b-instruct |
Cheap generalist |
deepseek-v4-pro |
deepseek-ai/deepseek-v4-pro |
Premium reasoning MoE |
deepseek-v4-flash |
deepseek-ai/deepseek-v4-flash |
Fast reasoning MoE |
kimi-k2.6 |
moonshotai/kimi-k2.6 |
Moonshot Kimi |
qwen3-next-80b |
qwen/qwen3-next-80b-a3b-instruct |
Qwen MoE |
nemotron-super-49b |
nvidia/llama-3.3-nemotron-super-49b-v1 |
NVIDIA Nemotron flagship |
nemotron-mini-4b |
nvidia/nemotron-mini-4b-instruct |
Tiny NVIDIA |
mixtral-8x22b |
mistralai/mixtral-8x22b-instruct-v0.1 |
Large Mistral MoE |
mixtral-8x7b |
mistralai/mixtral-8x7b-instruct-v0.1 |
Small Mistral MoE |
NVIDIA_API_KEYis read from the environment at MCP server startup and never persisted or logged.- All tool arguments are bounded (prompt ≤ 32 KB, system ≤ 8 KB,
max_tokens≤ 8192,temperature∈ [0, 2]). - Upstream error messages are sanitized —
nvapi-…,sk-…, andAuthorization/x-api-keysubstrings are replaced with[REDACTED]before reaching Claude. - The MCP
commandisuv(PATH lookup). Verify youruvbinary (uv --version) is the official astral build — a PATH-shadoweduvcould exfiltrate the API key. - Full threat model in the MCP repo's SECURITY.md.
Never paste your API key into a chat session. If you do, rotate it at build.nvidia.com.
NVIDIA NIM has a free monthly credit. After that, per-token pricing varies by model — check build.nvidia.com for current quotas.
- nvidia-models-mcp — the bare MCP server. Use directly if you're on Claude Desktop (drag-drop DXT), Codex CLI, or any non-Claude-Code MCP host.
- nvidia-models-plugin (this repo) — the Claude Code experience: bundles the same server plus skill + slash commands. The skill is what makes Claude proactively reach for these models.
Bug reports, model additions, and contributions welcome on either repo.