Route every AI call to the cheapest model that can do the job well. 48 tools · 20+ providers · personal routing memory · budget caps, dashboards, traces.
Average savings: 60–80% vs running everything on Claude Opus.
pipx install claude-code-llm-router && llm-router install| Host | Command |
|---|---|
| Claude Code | llm-router install |
| VS Code | llm-router install --host vscode |
| Cursor | llm-router install --host cursor |
| Codex CLI | llm-router install --host codex |
| Gemini CLI | llm-router install --host gemini-cli |
llm-router works as an MCP server inside any tool that supports MCP, providing unified routing across your entire development environment.
| Tool | Status | What You Get |
|---|---|---|
| Claude Code | ✅ Full | Auto-routing hooks + session tracking + quota display |
| Gemini CLI | ✅ Full | Auto-routing hooks + session tracking + quota display |
| Codex CLI | ✅ Full | Auto-routing hooks + savings tracking |
| VS Code + Copilot | ✅ MCP | llm-router tools available (routing is model-voluntary) |
| Cursor | ✅ MCP | llm-router tools available (routing is model-voluntary) |
| OpenCode | ✅ MCP | llm-router tools available (routing is model-voluntary) |
| Windsurf | ✅ MCP | llm-router tools available (routing is model-voluntary) |
| Any MCP-compatible tool | ⚡ Manual | Add llm-router to your tool's MCP config |
Full support = auto-routing hooks fire before the model answers, enforcing your routing policy. MCP support = tools are available, but the model chooses whether to use them.
pipx install claude-code-llm-router
llm-router installThen in Claude Code, llm_route and friends appear as built-in tools. Your settings control the profile (budget/balanced/premium).
pipx install claude-code-llm-router
llm-router install --host gemini-cliGemini CLI users get full routing experience: auto-routing suggestions, quota display, and free-first chaining (Ollama → Codex → Gemini CLI → paid).
pipx install claude-code-llm-router
llm-router install --host codexCodex integrates deep into the routing chain as a free fallback when your OpenAI subscription is available.
pipx install claude-code-llm-router
llm-router install --host vscode # or --host cursorThe MCP server loads automatically. Tools appear in your IDE's model UI.
Intercepts prompts and routes them to the cheapest model that can handle the task. Most AI sessions are full of low-value work: file lookups, small edits, quick questions. Those burn through expensive models unnecessarily.
llm-router keeps cheap work on cheap/free models, escalates to premium models only when needed. No micromanagement required.
- Works in: Claude Code, Cursor, VS Code, Codex, Windsurf, Zed, claw-code, Agno
- Free-first: Ollama (local) → Codex → Gemini Flash → OpenAI → Claude (subscription)
Think of llm-router as a smart task dispatcher. When you ask a question:
- Analyze — What kind of task is this? (simple lookup vs. complex reasoning)
- Choose — Which model can handle this best and cheapest?
- Check Constraints — Are we over budget? Is this model degraded?
- Execute — Send to that model
The dispatcher learns over time: if a model starts performing poorly (judge scores drop), it gets demoted in future decisions. If you're running low on quota (budget pressure), it automatically uses cheaper models. You don't manage any of this—it just happens behind the scenes.
Example: "Explain this error message" → Simple task → Route to Haiku (fast, cheap) → Done. vs. "Refactor this complex architecture" → Complex task → Route to Opus (expensive but thorough) → Done.
The savings come from not using Opus for every question.
Major release with optimized routing chains and automatic Ollama management.
-
Ollama Auto-Startup — Session-start hook automatically launches Ollama and loads budget models (gemma4, qwen3.5) if not running
- Eliminates manual setup — local free inference available immediately
- Graceful fallback if Ollama unavailable
- 10-second readiness timeout with model auto-pull
-
Free-First MCP Chain for All Complexity Levels
- Simple tasks → Ollama → Codex → Gemini Flash → Groq
- Moderate tasks → Ollama → Codex → Gemini Pro (improved quality-to-cost) → GPT-4o → Claude Sonnet
- Complex tasks → Ollama → Codex → o3 → Gemini Pro → Claude Opus
- Codex injected before all paid externals as free fallback when subscription available
-
BALANCED Tier Chain Reordering — Gemini Pro prioritized after Codex injection
- Previously defaulted to expensive DeepSeek for moderate tasks
- Now balances cost + quality: Codex → Gemini Pro (better ROI) → paid fallbacks
- Reduces BALANCED tier spend ~40% while maintaining output quality
-
Routing Decision Logging & Analytics
- Track which model selected for each task, cost impact, complexity distribution
- Session-end hook shows routing summary with savings vs. full-Opus baseline
- Identify anomalies (e.g., high-cost tasks that should route cheaper)
See CHANGELOG.md for full version history and v6.x features.
Smart content generation detection with automatic routing suggestions.
-
Automatic Content Generation Detection — Hook detects "write", "draft", "add card", "create spec" patterns
- Prevents routing misses where content-generation tasks skip
llm_generaterouting - Suggests decomposition: route generation first, integrate locally second
- Example: "add carousel card about X to file.md" → auto-routes via
llm_generate
- Prevents routing misses where content-generation tasks skip
-
Decomposition Patterns — Multi-step content+file tasks now route intelligently
- "Generate narrative" →
llm_generate→ Done - "Add card to blueprint" →
llm_generatecontent →Editfile integration - Cost impact: ~90% savings on writing tasks (route cheap model vs. expensive local generation)
- "Generate narrative" →
-
Soft Nudges via Hook Suggestion (not blocking)
- Detects multi-step content generation patterns
- Suggests: "Consider routing via
llm_generatefirst, then integrate locally" - Enforces routing discipline without forcing user behavior
-
Fast-Path for Content Tasks — Content generation routed instantly without waiting for classifier
- Patterns: simple generation, decomposition, refinement, documentation
- Same speed as code detection fast-path
- Seamless fallback if pattern doesn't match
See CLAUDE.md § Content Generation Routing for detailed decision tree.
User Prompt
↓
[Complexity Classifier] — Haiku/Sonnet/Opus?
↓
[Free-First Router] — Ollama → Codex → Gemini Flash → OpenAI → Claude
↓
[Budget Pressure Check] — Downshift if over 85% budget
↓
[Quality Guard] — Demote if judge score < 0.6
↓
Selected Model → Execute
Zero-config by default if you use Claude Code Pro/Max (subscription mode).
Optional env vars:
OPENAI_API_KEY=sk-... # GPT-4o, o3
GEMINI_API_KEY=AIza... # Gemini Flash (free tier)
OLLAMA_BASE_URL=http://localhost:11434 # Local Ollama (free)
LLM_ROUTER_PROFILE=balanced # budget|balanced|premium
LLM_ROUTER_COMPRESS_RESPONSE=true # Enable response compressionFor full setup guide, see docs/SETUP.md.
Routing:
llm_route— Route task to optimal modelllm_classify— Classify task complexityllm_quality_guard— Monitor model health
Text:
llm_query,llm_research,llm_generate,llm_analyze,llm_code
Media:
llm_image,llm_video,llm_audio
Admin:
llm_usage,llm_savings,llm_budget,llm_health,llm_providers
Advanced:
llm_orchestrate— Multi-step pipelinesllm_setup— Configure provider keysllm_policy— Routing policy management
Full tool reference — Complete documentation for all 48 tools
See CLAUDE.md for:
- Design decisions
- Module organization
- Development workflow
- Release process
See docs/ARCHITECTURE.md for:
- Three-layer compression pipeline
- Judge scoring system
- Quality trend tracking
- Budget pressure algorithm
uv run pytest tests/ -q # Run tests
uv run ruff check src/ tests/ # Lint
uv run llm-router --version # Check versionMIT — See LICENSE
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Releases: PyPI
