Skip to content

ctrl-gaurav/effGen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

94 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
effGen

CI arXiv PyPI Python License

Total Downloads Monthly Downloads Stars Forks

Paper Website Docs PyPI

Typing SVG

πŸ“° News & Updates

Date Update
πŸš€ 4 May 2026 v0.2.3 Released: 5 new cloud backends (Groq, Together AI, Fireworks, Replicate, HuggingFace Inference) β€” 9 providers total. Unified ProviderRegistry, effgen doctor auth check, backend parity matrix. See changelog
πŸš€ 28 Apr 2026 v0.2.2 Released: Gemini 3.x/2.5/2.0 registry, thinking_budget, Google Search grounding, Files API, Gemini native tools (GoogleSearch, UrlContext, CodeExecution). Anthropic Claude 4.7 registry, extended thinking, prompt caching (cache_control), streaming polish, experimental native tools. See changelog
πŸš€ 25 Apr 2026 v0.2.1 Released: Cerebras backend (4 free-tier models, streaming, native tool-calling, rate-limit coordinator, cost tracking) + OpenAI gpt-5/gpt-5.4-nano/o-series with reasoning_effort, prompt caching, structured outputs v2, and OpenAI native tools (web_search, code_interpreter, file_search). See changelog
πŸš€ 9 Apr 2026 v0.2.0 Released: Major release β€” native tool calling, guardrails, multi-agent orchestration, RAG pipeline, 31 tools, eval framework, production API server, MLX Apple Silicon support, Python & TypeScript SDKs. See changelog
🍎 8 Apr 2026 MLX & Apple Silicon support merged (PR #4): Native Metal GPU acceleration via MLX & MLX-VLM backends, hardware detection, 5 Gradio GUI examples. pip install effgen[mlx]
πŸ”§ 25 Mar 2026 v0.1.3 Released: Verification hardening β€” smarter loop detection, "skip the tool" prompting, model-aware token counting, sub-agent depth limits, circuit breaker persistence. See changelog
πŸ”§ 12 Mar 2026 v0.1.2 Released: Test-driven hardening β€” 10 example agents, 19 bug fixes, cross-model compatibility matrix (11 models, 73% pass rate). See changelog
πŸ”’ 6 Mar 2026 v0.1.1 Released: Stabilization β€” fixed license/metadata consistency, improved error handling, added 6 examples, expanded test suite. See changelog
πŸŽ‰ 1 Mar 2026 v0.1.0 Released: Major feature release β€” 14 built-in tools, agent presets, plugin system, real streaming, memory integration, ACP/MCP protocols, CI/CD, and comprehensive test suite. See changelog
πŸ”§ 3 Feb 2026 v0.0.2 Released: vLLM backend fixes with automatic chat template support, GPU memory control, improved OOM error handling, and multi-model family compatibility
πŸ“„ 2 Feb 2026 Preprint available: EffGen: Enabling Small Language Models as Capable Autonomous Agents
πŸš€ 31 Jan 2026 Initial release of effGen framework (v0.0.1)

πŸ€” What is effGen?

effGen transforms Small Language Models into powerful AI agents. While most frameworks require massive LLMs, effGen is optimized from the ground up for efficient, smaller models β€” delivering fast, capable agents without the compute overhead.

from effgen import Agent, load_model
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator, PythonREPL

# Load a small but mighty model
model = load_model("Qwen/Qwen2.5-1.5B-Instruct", quantization="4bit")

# Create agent with tools
config = AgentConfig(
    name="math_agent",
    model=model,
    tools=[Calculator(), PythonREPL()]
)
agent = Agent(config=config)

# Run computation
result = agent.run("What is 24344 * 334?")
print(f"Answer: {result.output}")

⚑ Installation

Requires Python 3.10 or newer. Tested on Python 3.10, 3.11, 3.12, 3.13, 3.14.

πŸ“¦ From PyPI (Recommended)

pip install effgen

🍎 Apple Silicon (MLX β€” Recommended for Mac)

pip install effgen[mlx]          # Text models on Apple Silicon
pip install effgen[mlx-vlm]      # Vision-Language models on Apple Silicon

πŸš€ With vLLM for Faster Inference

pip install effgen[vllm]

🎁 Everything in one shot

pip install effgen[all]    # installs vLLM + RAG + vector-DB + search + cloud-secrets + monitoring + …

⚑ Optional: flash-attn (NVIDIA GPUs only β€” 2 steps)

flash-attn is not in [all] on purpose: its own setup.py imports torch before pip's isolated build environment has torch installed (a well-known upstream bug), so bundling it would break pip install effgen[all] for everyone. Install it in two steps instead:

pip install effgen[all]                       # step 1: gets torch + the rest
pip install flash-attn --no-build-isolation   # step 2: reuses the torch from step 1

See docs/installation.md for the full guide.

πŸ”§ From Source

git clone https://github.com/ctrl-gaurav/effGen.git
cd effGen

# Quick install
./install.sh

# Full install (includes vLLM + dev tools)
./install.sh --full

# Manual install
pip install -e .

πŸš€ Quick Start

πŸ’» CLI Usage

# Run a task
effgen run "What is the capital of France?"

# Interactive chat
effgen chat

# Start API server
effgen serve --port 8000

# List available presets
effgen presets

# Check infrastructure health
effgen health

# Interactive wizard
effgen

🐍 Python API

from effgen import Agent, load_model
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator

# Load model
model = load_model("Qwen/Qwen2.5-1.5B-Instruct", quantization="4bit")

# Configure agent
config = AgentConfig(
    name="calculator_agent",
    model=model,
    tools=[Calculator()],
    system_prompt="You are a helpful math assistant."
)

# Create and run
agent = Agent(config=config)
result = agent.run("Calculate 15% tip on $85.50")
print(result.output)

🍎 Apple Silicon (MLX)

from effgen import Agent, load_model
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator

# Load MLX model β€” native Metal GPU, unified memory, no CPU-GPU transfer
model = load_model("LiquidAI/LFM2.5-1.2B-Instruct-MLX-8bit", engine="mlx")

config = AgentConfig(
    name="mlx_agent",
    model=model,
    tools=[Calculator()],
)
agent = Agent(config=config)
result = agent.run("What is sqrt(144) + 2^10?")
print(result.output)

✨ Features

🧠
SLM Optimized
Small models

🍎
Apple Silicon
MLX + Metal GPU

πŸ›‘οΈ
Guardrails
PII, injection, safety

πŸ“š
RAG Pipeline
Ingest, search, cite

πŸ‘₯
Multi-Agent
DAG workflows

πŸ”§
31 Tools
+ MCP/A2A/ACP

🏭
Production API
OpenAI-compat


πŸ†• What's New in v0.2.3

Top 5 features in v0.2.3
  1. 5 new cloud backends β€” GroqAdapter, TogetherAdapter, FireworksAdapter, ReplicateAdapter, HFInferenceAdapter β€” each with streaming, native tools, rate-limit coordination, and cost tracking. 9 providers total.

    model = load_model("llama-3.1-8b-instant", provider="groq")
    model = load_model("Qwen/Qwen2.5-72B-Instruct", provider="hf")
  2. Unified ProviderRegistry β€” list_providers(), list_models(provider), lookup(model_id) consolidated across all 9 adapters. AmbiguousModelError on bare IDs shared across providers.

  3. effgen doctor β€” new CLI command showing which providers have API keys configured.

  4. Backend parity matrix β€” canonical agentic task ("(17 Γ— 23) + sqrt(144) = 403") runs identically across all providers; streaming and error surfaces verified uniform. See docs/providers/parity.md.

  5. HuggingFace Router support β€” HFInferenceAdapter with 124-model dynamic catalog, refresh_models() + check_drift(), ModelUnavailableError with suggest_alternatives(), and custom Inference Endpoint URL.

Top 5 features from v0.2.2
  1. Gemini 3.x/2.5/2.0 + Gemma families β€” full model registry with correct context windows, output limits, and feature flags; SDK migrated to google-genai>=1.0.0.

  2. Gemini thinking_budget β€” activate Gemini's internal reasoning with GenerationConfig(thinking_budget=8192, include_thoughts=True); thinking trace surfaces in ModelResponse.metadata["thinking"].

  3. Gemini grounding + Files API β€” GenerationConfig(grounding=True) injects Google Search; upload_file(path) passes PDFs/images to the model with a 2 GiB guard.

  4. Gemini native tools β€” GoogleSearchTool, GeminiUrlContextTool, GeminiCodeExecutionTool activate server-side Gemini capabilities in any Agent. Parallel function calls handled automatically.

  5. Anthropic Claude 4.7, extended thinking, prompt caching β€” full Claude 4.x registry; GenerationConfig.thinking for extended reasoning; mark_cached() + AgentConfig.cache_system_prompt/cache_tools for cache_control; cache tokens surfaced in usage.

Top 5 features from v0.2.1
  1. Cerebras backend β€” 4 free-tier models (llama3.1-8b, qwen-3-235b-a22b-instruct-2507, gpt-oss-120b, zai-glm-4.7) with streaming, native function-calling, automatic RPM/TPM/RPD/TPD rate-limit coordination, and per-call cost tracking. pip install effgen[cerebras] and set CEREBRAS_API_KEY.

    from effgen import load_model
    model = load_model("llama3.1-8b", provider="cerebras")
  2. OpenAI gpt-5 / gpt-5.4-nano / o-series reasoning models β€” full registry coverage with reasoning_effort (minimal/low/medium/high) and max_reasoning_tokens on GenerationConfig. Reasoning payloads are routed only to reasoning-capable models.

  3. OpenAI prompt caching surfacing β€” cached_input_tokens exposed on ModelResponse.usage; AgentConfig.stable_system_prompt=True keeps the system prompt anchored at position 0 to maximize OpenAI's automatic β‰₯1024-token prefix cache hit rate.

  4. Structured outputs v2 β€” OpenAIAdapter.generate_structured() with strict JSON Schema; to_openai_schema(pydantic_model) inlines $refs and forces additionalProperties: false; refusals raise ModelRefusalError.

  5. OpenAI native tools β€” OpenAIWebSearchTool, OpenAICodeInterpreterTool, OpenAIFileSearchTool route through OpenAI's Responses API and compose with effGen's local tools in the same agent. ToolIncompatibleError fires at Agent init when paired with a non-OpenAI model.

Top 5 features from v0.2.0
  1. Native Tool Calling β€” Qwen, Llama, Mistral models use built-in function calling instead of text parsing. Set tool_calling_mode="native" or "hybrid". Structured JSON/Pydantic output validation included.

  2. Guardrails & Safety β€” PII detection, prompt injection blocking, toxicity filtering, tool permissions. One-liner: get_guardrail_preset("strict").

  3. Production RAG Pipeline β€” Ingest PDF/DOCX/HTML/Markdown, semantic+BM25 hybrid search, reranking, inline citations. create_agent("rag", model, knowledge_base="./docs/").

  4. Production API Server β€” OpenAI-compatible /v1/chat/completions, request queuing, agent pooling, multi-tenancy, API keys. Drop-in OpenAI replacement with local SLMs.

  5. Apple Silicon Native β€” MLX & MLX-VLM backends for M1/M2/M3/M4. Metal GPU acceleration, unified memory. pip install effgen[mlx].


🎯 Agent Presets

Get started instantly with ready-to-use agent configurations:

from effgen import load_model
from effgen.presets import create_agent

model = load_model("Qwen/Qwen2.5-3B-Instruct", quantization="4bit")

# One-line agent creation
math_agent = create_agent("math", model)       # Calculator + PythonREPL
research_agent = create_agent("research", model) # WebSearch + URLFetch + Wikipedia
coding_agent = create_agent("coding", model)     # CodeExecutor + PythonREPL + FileOps + Bash
general_agent = create_agent("general", model)   # All tools
rag_agent = create_agent("rag", model, knowledge_base="./docs/")  # RAG pipeline
minimal_agent = create_agent("minimal", model)   # Direct inference, no tools
# CLI preset support
effgen run --preset math "What is sqrt(144)?"
effgen run --preset research "Tell me about quantum computing"

πŸ› οΈ Built-in Tools (31)

πŸ”’
Calculator
Math & Units

🌐
WebSearch
DuckDuckGo

πŸ’»
CodeExecutor
Sandboxed

🐍
PythonREPL
Interactive

πŸ“
FileOps
Read/Write

πŸ”
Retrieval
RAG + BM25

🎯
AgenticSearch
ripgrep

πŸ–₯️
BashTool
Shell Cmds

🌀️
WeatherTool
Open-Meteo

πŸ“‹
JSONTool
Query/Validate

πŸ•
DateTimeTool
Timezones

πŸ“
TextProcessing
Regex/Count

πŸ”—
URLFetch
Web Scrape

πŸ“–
Wikipedia
Free API


πŸ“š Examples

πŸ–₯️ GUI Applications (Gradio)

# Visual agent & tool development
python examples/basic/chat_gui_mlx.py              # MLX Chat β€” streaming chat with Apple Silicon models (port 7860)
python examples/basic/agent_viz_mlx.py             # Agent Visualizer β€” step-by-step reasoning + code editor (port 7860)
python examples/basic/tool_builder_gui.py          # Tool Builder β€” visually create custom tools (port 7863)
python examples/basic/tool_tester_gui.py           # Tool Tester β€” browse, test, inspect all 31 tools (port 7864)

🍎 Apple Silicon (MLX)

python examples/basic/basic_agent_mlx.py           # Basic MLX agent with calculator
python examples/basic/chat_gui_mlx.py --autoload   # Chat GUI with auto model loading
python examples/basic/agent_viz_mlx.py --autoload   # Agent visualizer with auto model loading

πŸ€– Core Agent Examples

python examples/basic/qa_agent.py                  # Q&A agent (no tools)
python examples/basic/calculator_agent.py          # Math with Calculator + PythonREPL
python examples/tools/advanced_multi_tool_agent.py # 5 tools + fallback chains
python examples/tools/file_operations_agent.py     # File read/write/search
python examples/tools/coding_agent.py              # Code execution + iteration
python examples/advanced/conversational_agent.py   # Multi-turn memory
python examples/advanced/advanced_streaming_agent.py # Token streaming with callbacks
python examples/advanced/data_processing_agent.py  # JSON & data pipelines
python examples/advanced/multi_agent_pipeline.py   # Multi-agent orchestration
python examples/advanced/error_recovery_agent.py   # Error handling patterns

⚑ Quick-Start Examples

python examples/basic/basic_agent.py               # Basic agent (Transformers)
python examples/basic/basic_agent_vllm.py          # Basic agent (vLLM - 5-10x faster)
python examples/plugins_presets/preset_agents.py   # Ready-to-use agent presets
python examples/web_retrieval/streaming_agent.py   # Simple streaming
python examples/web_retrieval/memory_agent.py      # Simple multi-turn memory
python examples/tools/multi_tool_agent.py          # Simple multi-tool
python examples/web_retrieval/weather_agent.py     # Weather via Open-Meteo (free)
python examples/plugins_presets/plugin_example.py  # Custom tool plugins
python examples/web_retrieval/web_agent.py         # Web search agent
python examples/web_retrieval/retrieval_agent.py   # RAG-based retrieval

πŸ“Š See examples/compatibility_matrix.md for model compatibility across all agents.

πŸ“– More Examples

Multi-Tool Agent

from effgen import Agent, load_model
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator, WebSearch, PythonREPL

model = load_model("Qwen/Qwen2.5-3B-Instruct")

config = AgentConfig(
    name="research_agent",
    model=model,
    tools=[Calculator(), WebSearch(), PythonREPL()],
    system_prompt="You are a research assistant."
)

agent = Agent(config=config)
result = agent.run("Search for the population of Tokyo and calculate what percentage it is of Japan's total population")

Streaming

from effgen import Agent, load_model
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator

model = load_model("Qwen/Qwen2.5-3B-Instruct", quantization="4bit")
agent = Agent(config=AgentConfig(
    name="stream_demo", model=model,
    tools=[Calculator()], enable_streaming=True
))

for token in agent.stream("What is 2 + 2?"):
    print(token, end="", flush=True)

Memory (Multi-Turn)

agent = Agent(config=AgentConfig(
    name="memory_demo", model=model,
    tools=[], enable_memory=True
))

agent.run("My name is Alice and I'm working on quantum computing.")
result = agent.run("What's my name and what am I working on?")
# β†’ "Your name is Alice and you're working on quantum computing."

Retrieval Agent (RAG)

from effgen.tools.builtin import Retrieval

retrieval_tool = Retrieval(knowledge_base_path="./docs")
config = AgentConfig(name="qa_agent", model=model, tools=[retrieval_tool])
agent = Agent(config=config)
result = agent.run("What does the documentation say about configuration?")

πŸ€– Multi-Model Support

effGen supports 9 cloud inference providers + 4 local backends, tested across 11+ model families:

Backend Platform Install Best For
MLX Apple Silicon (M1/M2/M3/M4) effgen[mlx] Native Metal GPU, unified memory, 4/8-bit quantization
MLX-VLM Apple Silicon effgen[mlx-vlm] Vision-Language models (Qwen2-VL, LLaVA, Phi-3 Vision, 30+ architectures)
vLLM NVIDIA GPU effgen[vllm] High-throughput batch inference
Transformers Any (CPU/GPU) (bundled) Universal compatibility, local models
OpenAI Cloud API (bundled) gpt-5/gpt-5.4/o-series, reasoning_effort, structured outputs, native tools
Anthropic Cloud API (bundled) Claude 4.7/4.x, extended thinking, prompt caching, native tools
Google Gemini Cloud API (bundled) Gemini 3.x/2.5/2.0, thinking_budget, grounding, Files API, native tools
Cerebras Cloud API effgen[cerebras] 4 free-tier models (llama3.1-8b, qwen-3-235b), ultra-low latency
Groq Cloud API effgen[groq] 16 models (llama-3.3-70b, mixtral, qwen3-32b), ultra-fast free-tier inference
Together AI Cloud API effgen[together] 163-model catalog (llama, deepseek, qwen, mistral), per-model pricing
Fireworks Cloud API effgen[fireworks] 80 chat models (54 tool-capable), serverless + dedicated
Replicate Cloud API effgen[replicate] 38 models, async run-poll, SSE streaming, compute-second billing
HuggingFace Cloud API effgen[hf] 124-model HF Router catalog, custom Inference Endpoints, free serverless tier

Provider Auth Check

# See which API keys are configured
effgen doctor

Quick Cloud Start

from effgen import load_model, Agent
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator

# Any of the 9 cloud providers
model = load_model("llama-3.1-8b-instant", provider="groq")          # Groq
# model = load_model("meta-llama/Llama-3.3-70B-Instruct-Turbo", provider="together")
# model = load_model("Qwen/Qwen2.5-72B-Instruct", provider="hf")

agent = Agent(config=AgentConfig(name="agent", model=model, tools=[Calculator()]))
result = agent.run("What is (17 * 23) + sqrt(144)?")
print(result.output)  # β†’ 403

Top Recommended Models

Model Size Compatibility
LFM2.5-1.2B-Instruct-MLX-8bit 1.2B Apple Silicon optimized, fast agentic
Qwen2.5-1.5B-Instruct 1.5B 10/10 agents pass
Qwen2.5-3B-Instruct 3B 10/10 agents pass (recommended default)
Phi-4-mini-instruct 3.8B 10/10 agents pass
Qwen3-1.7B 1.7B 9.5/10
Qwen2.5-7B-Instruct 7B 9/10
Llama-3.2-3B-Instruct 3B 8.5/10

Full matrix with 11 models x 10 agents: compatibility_matrix.md


πŸ”’ Security

🐳
Docker Sandbox
Isolated execution

πŸ›‘οΈ
Input Validation
Auto sanitization

⚑
Rate Limiting
Configurable limits

πŸ“‹ For security policies and vulnerability reporting, see SECURITY.md


πŸ“– Citation

If you use effGen in your research, please cite our paper:

@software{srivastava2026effgen,
      title={effGen: Enabling Small Language Models as Capable Autonomous Agents},
      author={Gaurav Srivastava and Aafiya Hussain and Chi Wang and Yingyan Celine Lin and Xuan Wang},
      year={2026},
      eprint={2602.00887},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.00887},
}

πŸ”— Links

Paper Website Docs PyPI Issues


πŸ“„ License

Apache License 2.0 β€” see LICENSE for details.


Get Started Examples Paper GitHub

Made with ❀️ for the AI community

effGen footer

About

[ICML 2026] effGen: Enabling Small Language Models as Capable Autonomous Agents

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages