| Date | Update | |
|---|---|---|
| π | 4 May 2026 | v0.2.3 Released: 5 new cloud backends (Groq, Together AI, Fireworks, Replicate, HuggingFace Inference) β 9 providers total. Unified ProviderRegistry, effgen doctor auth check, backend parity matrix. See changelog |
| π | 28 Apr 2026 | v0.2.2 Released: Gemini 3.x/2.5/2.0 registry, thinking_budget, Google Search grounding, Files API, Gemini native tools (GoogleSearch, UrlContext, CodeExecution). Anthropic Claude 4.7 registry, extended thinking, prompt caching (cache_control), streaming polish, experimental native tools. See changelog |
| π | 25 Apr 2026 | v0.2.1 Released: Cerebras backend (4 free-tier models, streaming, native tool-calling, rate-limit coordinator, cost tracking) + OpenAI gpt-5/gpt-5.4-nano/o-series with reasoning_effort, prompt caching, structured outputs v2, and OpenAI native tools (web_search, code_interpreter, file_search). See changelog |
| π | 9 Apr 2026 | v0.2.0 Released: Major release β native tool calling, guardrails, multi-agent orchestration, RAG pipeline, 31 tools, eval framework, production API server, MLX Apple Silicon support, Python & TypeScript SDKs. See changelog |
| π | 8 Apr 2026 | MLX & Apple Silicon support merged (PR #4): Native Metal GPU acceleration via MLX & MLX-VLM backends, hardware detection, 5 Gradio GUI examples. pip install effgen[mlx] |
| π§ | 25 Mar 2026 | v0.1.3 Released: Verification hardening β smarter loop detection, "skip the tool" prompting, model-aware token counting, sub-agent depth limits, circuit breaker persistence. See changelog |
| π§ | 12 Mar 2026 | v0.1.2 Released: Test-driven hardening β 10 example agents, 19 bug fixes, cross-model compatibility matrix (11 models, 73% pass rate). See changelog |
| π | 6 Mar 2026 | v0.1.1 Released: Stabilization β fixed license/metadata consistency, improved error handling, added 6 examples, expanded test suite. See changelog |
| π | 1 Mar 2026 | v0.1.0 Released: Major feature release β 14 built-in tools, agent presets, plugin system, real streaming, memory integration, ACP/MCP protocols, CI/CD, and comprehensive test suite. See changelog |
| π§ | 3 Feb 2026 | v0.0.2 Released: vLLM backend fixes with automatic chat template support, GPU memory control, improved OOM error handling, and multi-model family compatibility |
| π | 2 Feb 2026 | Preprint available: EffGen: Enabling Small Language Models as Capable Autonomous Agents |
| π | 31 Jan 2026 | Initial release of effGen framework (v0.0.1) |
effGen transforms Small Language Models into powerful AI agents. While most frameworks require massive LLMs, effGen is optimized from the ground up for efficient, smaller models β delivering fast, capable agents without the compute overhead.
from effgen import Agent, load_model
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator, PythonREPL
# Load a small but mighty model
model = load_model("Qwen/Qwen2.5-1.5B-Instruct", quantization="4bit")
# Create agent with tools
config = AgentConfig(
name="math_agent",
model=model,
tools=[Calculator(), PythonREPL()]
)
agent = Agent(config=config)
# Run computation
result = agent.run("What is 24344 * 334?")
print(f"Answer: {result.output}")Requires Python 3.10 or newer. Tested on Python 3.10, 3.11, 3.12, 3.13, 3.14.
pip install effgenpip install effgen[mlx] # Text models on Apple Silicon
pip install effgen[mlx-vlm] # Vision-Language models on Apple Siliconpip install effgen[vllm]pip install effgen[all] # installs vLLM + RAG + vector-DB + search + cloud-secrets + monitoring + β¦
flash-attnis not in[all]on purpose: its ownsetup.pyimportstorchbefore pip's isolated build environment has torch installed (a well-known upstream bug), so bundling it would breakpip install effgen[all]for everyone. Install it in two steps instead:
pip install effgen[all] # step 1: gets torch + the rest
pip install flash-attn --no-build-isolation # step 2: reuses the torch from step 1See docs/installation.md for the full guide.
git clone https://github.com/ctrl-gaurav/effGen.git
cd effGen
# Quick install
./install.sh
# Full install (includes vLLM + dev tools)
./install.sh --full
# Manual install
pip install -e .# Run a task
effgen run "What is the capital of France?"
# Interactive chat
effgen chat
# Start API server
effgen serve --port 8000
# List available presets
effgen presets
# Check infrastructure health
effgen health
# Interactive wizard
effgenfrom effgen import Agent, load_model
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator
# Load model
model = load_model("Qwen/Qwen2.5-1.5B-Instruct", quantization="4bit")
# Configure agent
config = AgentConfig(
name="calculator_agent",
model=model,
tools=[Calculator()],
system_prompt="You are a helpful math assistant."
)
# Create and run
agent = Agent(config=config)
result = agent.run("Calculate 15% tip on $85.50")
print(result.output)from effgen import Agent, load_model
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator
# Load MLX model β native Metal GPU, unified memory, no CPU-GPU transfer
model = load_model("LiquidAI/LFM2.5-1.2B-Instruct-MLX-8bit", engine="mlx")
config = AgentConfig(
name="mlx_agent",
model=model,
tools=[Calculator()],
)
agent = Agent(config=config)
result = agent.run("What is sqrt(144) + 2^10?")
print(result.output)|
π§ |
π |
π‘οΈ |
π |
π₯ |
π§ |
π |
Top 5 features in v0.2.3
-
5 new cloud backends β
GroqAdapter,TogetherAdapter,FireworksAdapter,ReplicateAdapter,HFInferenceAdapterβ each with streaming, native tools, rate-limit coordination, and cost tracking. 9 providers total.model = load_model("llama-3.1-8b-instant", provider="groq") model = load_model("Qwen/Qwen2.5-72B-Instruct", provider="hf")
-
Unified ProviderRegistry β
list_providers(),list_models(provider),lookup(model_id)consolidated across all 9 adapters.AmbiguousModelErroron bare IDs shared across providers. -
effgen doctorβ new CLI command showing which providers have API keys configured. -
Backend parity matrix β canonical agentic task ("(17 Γ 23) + sqrt(144) = 403") runs identically across all providers; streaming and error surfaces verified uniform. See
docs/providers/parity.md. -
HuggingFace Router support β
HFInferenceAdapterwith 124-model dynamic catalog,refresh_models()+check_drift(),ModelUnavailableErrorwithsuggest_alternatives(), and custom Inference Endpoint URL.
Top 5 features from v0.2.2
-
Gemini 3.x/2.5/2.0 + Gemma families β full model registry with correct context windows, output limits, and feature flags; SDK migrated to
google-genai>=1.0.0. -
Gemini
thinking_budgetβ activate Gemini's internal reasoning withGenerationConfig(thinking_budget=8192, include_thoughts=True); thinking trace surfaces inModelResponse.metadata["thinking"]. -
Gemini grounding + Files API β
GenerationConfig(grounding=True)injects Google Search;upload_file(path)passes PDFs/images to the model with a 2 GiB guard. -
Gemini native tools β
GoogleSearchTool,GeminiUrlContextTool,GeminiCodeExecutionToolactivate server-side Gemini capabilities in any Agent. Parallel function calls handled automatically. -
Anthropic Claude 4.7, extended thinking, prompt caching β full Claude 4.x registry;
GenerationConfig.thinkingfor extended reasoning;mark_cached()+AgentConfig.cache_system_prompt/cache_toolsforcache_control; cache tokens surfaced in usage.
Top 5 features from v0.2.1
-
Cerebras backend β 4 free-tier models (
llama3.1-8b,qwen-3-235b-a22b-instruct-2507,gpt-oss-120b,zai-glm-4.7) with streaming, native function-calling, automatic RPM/TPM/RPD/TPD rate-limit coordination, and per-call cost tracking.pip install effgen[cerebras]and setCEREBRAS_API_KEY.from effgen import load_model model = load_model("llama3.1-8b", provider="cerebras")
-
OpenAI gpt-5 / gpt-5.4-nano / o-series reasoning models β full registry coverage with
reasoning_effort(minimal/low/medium/high) andmax_reasoning_tokensonGenerationConfig. Reasoning payloads are routed only to reasoning-capable models. -
OpenAI prompt caching surfacing β
cached_input_tokensexposed onModelResponse.usage;AgentConfig.stable_system_prompt=Truekeeps the system prompt anchored at position 0 to maximize OpenAI's automatic β₯1024-token prefix cache hit rate. -
Structured outputs v2 β
OpenAIAdapter.generate_structured()with strict JSON Schema;to_openai_schema(pydantic_model)inlines$refs and forcesadditionalProperties: false; refusals raiseModelRefusalError. -
OpenAI native tools β
OpenAIWebSearchTool,OpenAICodeInterpreterTool,OpenAIFileSearchToolroute through OpenAI's Responses API and compose with effGen's local tools in the same agent.ToolIncompatibleErrorfires at Agent init when paired with a non-OpenAI model.
Top 5 features from v0.2.0
-
Native Tool Calling β Qwen, Llama, Mistral models use built-in function calling instead of text parsing. Set
tool_calling_mode="native"or"hybrid". Structured JSON/Pydantic output validation included. -
Guardrails & Safety β PII detection, prompt injection blocking, toxicity filtering, tool permissions. One-liner:
get_guardrail_preset("strict"). -
Production RAG Pipeline β Ingest PDF/DOCX/HTML/Markdown, semantic+BM25 hybrid search, reranking, inline citations.
create_agent("rag", model, knowledge_base="./docs/"). -
Production API Server β OpenAI-compatible
/v1/chat/completions, request queuing, agent pooling, multi-tenancy, API keys. Drop-in OpenAI replacement with local SLMs. -
Apple Silicon Native β MLX & MLX-VLM backends for M1/M2/M3/M4. Metal GPU acceleration, unified memory.
pip install effgen[mlx].
Get started instantly with ready-to-use agent configurations:
from effgen import load_model
from effgen.presets import create_agent
model = load_model("Qwen/Qwen2.5-3B-Instruct", quantization="4bit")
# One-line agent creation
math_agent = create_agent("math", model) # Calculator + PythonREPL
research_agent = create_agent("research", model) # WebSearch + URLFetch + Wikipedia
coding_agent = create_agent("coding", model) # CodeExecutor + PythonREPL + FileOps + Bash
general_agent = create_agent("general", model) # All tools
rag_agent = create_agent("rag", model, knowledge_base="./docs/") # RAG pipeline
minimal_agent = create_agent("minimal", model) # Direct inference, no tools# CLI preset support
effgen run --preset math "What is sqrt(144)?"
effgen run --preset research "Tell me about quantum computing"|
π’ |
π |
π» |
π |
π |
π |
π― |
|
π₯οΈ |
π€οΈ |
π |
π |
π |
π |
π |
# Visual agent & tool development
python examples/basic/chat_gui_mlx.py # MLX Chat β streaming chat with Apple Silicon models (port 7860)
python examples/basic/agent_viz_mlx.py # Agent Visualizer β step-by-step reasoning + code editor (port 7860)
python examples/basic/tool_builder_gui.py # Tool Builder β visually create custom tools (port 7863)
python examples/basic/tool_tester_gui.py # Tool Tester β browse, test, inspect all 31 tools (port 7864)python examples/basic/basic_agent_mlx.py # Basic MLX agent with calculator
python examples/basic/chat_gui_mlx.py --autoload # Chat GUI with auto model loading
python examples/basic/agent_viz_mlx.py --autoload # Agent visualizer with auto model loadingpython examples/basic/qa_agent.py # Q&A agent (no tools)
python examples/basic/calculator_agent.py # Math with Calculator + PythonREPL
python examples/tools/advanced_multi_tool_agent.py # 5 tools + fallback chains
python examples/tools/file_operations_agent.py # File read/write/search
python examples/tools/coding_agent.py # Code execution + iteration
python examples/advanced/conversational_agent.py # Multi-turn memory
python examples/advanced/advanced_streaming_agent.py # Token streaming with callbacks
python examples/advanced/data_processing_agent.py # JSON & data pipelines
python examples/advanced/multi_agent_pipeline.py # Multi-agent orchestration
python examples/advanced/error_recovery_agent.py # Error handling patternspython examples/basic/basic_agent.py # Basic agent (Transformers)
python examples/basic/basic_agent_vllm.py # Basic agent (vLLM - 5-10x faster)
python examples/plugins_presets/preset_agents.py # Ready-to-use agent presets
python examples/web_retrieval/streaming_agent.py # Simple streaming
python examples/web_retrieval/memory_agent.py # Simple multi-turn memory
python examples/tools/multi_tool_agent.py # Simple multi-tool
python examples/web_retrieval/weather_agent.py # Weather via Open-Meteo (free)
python examples/plugins_presets/plugin_example.py # Custom tool plugins
python examples/web_retrieval/web_agent.py # Web search agent
python examples/web_retrieval/retrieval_agent.py # RAG-based retrievalπ See examples/compatibility_matrix.md for model compatibility across all agents.
π More Examples
from effgen import Agent, load_model
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator, WebSearch, PythonREPL
model = load_model("Qwen/Qwen2.5-3B-Instruct")
config = AgentConfig(
name="research_agent",
model=model,
tools=[Calculator(), WebSearch(), PythonREPL()],
system_prompt="You are a research assistant."
)
agent = Agent(config=config)
result = agent.run("Search for the population of Tokyo and calculate what percentage it is of Japan's total population")from effgen import Agent, load_model
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator
model = load_model("Qwen/Qwen2.5-3B-Instruct", quantization="4bit")
agent = Agent(config=AgentConfig(
name="stream_demo", model=model,
tools=[Calculator()], enable_streaming=True
))
for token in agent.stream("What is 2 + 2?"):
print(token, end="", flush=True)agent = Agent(config=AgentConfig(
name="memory_demo", model=model,
tools=[], enable_memory=True
))
agent.run("My name is Alice and I'm working on quantum computing.")
result = agent.run("What's my name and what am I working on?")
# β "Your name is Alice and you're working on quantum computing."from effgen.tools.builtin import Retrieval
retrieval_tool = Retrieval(knowledge_base_path="./docs")
config = AgentConfig(name="qa_agent", model=model, tools=[retrieval_tool])
agent = Agent(config=config)
result = agent.run("What does the documentation say about configuration?")effGen supports 9 cloud inference providers + 4 local backends, tested across 11+ model families:
| Backend | Platform | Install | Best For |
|---|---|---|---|
| MLX | Apple Silicon (M1/M2/M3/M4) | effgen[mlx] |
Native Metal GPU, unified memory, 4/8-bit quantization |
| MLX-VLM | Apple Silicon | effgen[mlx-vlm] |
Vision-Language models (Qwen2-VL, LLaVA, Phi-3 Vision, 30+ architectures) |
| vLLM | NVIDIA GPU | effgen[vllm] |
High-throughput batch inference |
| Transformers | Any (CPU/GPU) | (bundled) | Universal compatibility, local models |
| OpenAI | Cloud API | (bundled) | gpt-5/gpt-5.4/o-series, reasoning_effort, structured outputs, native tools |
| Anthropic | Cloud API | (bundled) | Claude 4.7/4.x, extended thinking, prompt caching, native tools |
| Google Gemini | Cloud API | (bundled) | Gemini 3.x/2.5/2.0, thinking_budget, grounding, Files API, native tools |
| Cerebras | Cloud API | effgen[cerebras] |
4 free-tier models (llama3.1-8b, qwen-3-235b), ultra-low latency |
| Groq | Cloud API | effgen[groq] |
16 models (llama-3.3-70b, mixtral, qwen3-32b), ultra-fast free-tier inference |
| Together AI | Cloud API | effgen[together] |
163-model catalog (llama, deepseek, qwen, mistral), per-model pricing |
| Fireworks | Cloud API | effgen[fireworks] |
80 chat models (54 tool-capable), serverless + dedicated |
| Replicate | Cloud API | effgen[replicate] |
38 models, async run-poll, SSE streaming, compute-second billing |
| HuggingFace | Cloud API | effgen[hf] |
124-model HF Router catalog, custom Inference Endpoints, free serverless tier |
# See which API keys are configured
effgen doctorfrom effgen import load_model, Agent
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator
# Any of the 9 cloud providers
model = load_model("llama-3.1-8b-instant", provider="groq") # Groq
# model = load_model("meta-llama/Llama-3.3-70B-Instruct-Turbo", provider="together")
# model = load_model("Qwen/Qwen2.5-72B-Instruct", provider="hf")
agent = Agent(config=AgentConfig(name="agent", model=model, tools=[Calculator()]))
result = agent.run("What is (17 * 23) + sqrt(144)?")
print(result.output) # β 403| Model | Size | Compatibility |
|---|---|---|
| LFM2.5-1.2B-Instruct-MLX-8bit | 1.2B | Apple Silicon optimized, fast agentic |
| Qwen2.5-1.5B-Instruct | 1.5B | 10/10 agents pass |
| Qwen2.5-3B-Instruct | 3B | 10/10 agents pass (recommended default) |
| Phi-4-mini-instruct | 3.8B | 10/10 agents pass |
| Qwen3-1.7B | 1.7B | 9.5/10 |
| Qwen2.5-7B-Instruct | 7B | 9/10 |
| Llama-3.2-3B-Instruct | 3B | 8.5/10 |
Full matrix with 11 models x 10 agents: compatibility_matrix.md
|
π³ |
π‘οΈ |
β‘ |
π For security policies and vulnerability reporting, see SECURITY.md
If you use effGen in your research, please cite our paper:
@software{srivastava2026effgen,
title={effGen: Enabling Small Language Models as Capable Autonomous Agents},
author={Gaurav Srivastava and Aafiya Hussain and Chi Wang and Yingyan Celine Lin and Xuan Wang},
year={2026},
eprint={2602.00887},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.00887},
}Apache License 2.0 β see LICENSE for details.