A Python package containing easy to use tools for working with various language models and AI services. AIMU is designed for running models locally via Ollama, Hugging Face Transformers, or any OpenAI-compatible local serving framework, and for cloud models via native provider SDKs (OpenAI, Anthropic, Google Gemini).
-
Model Clients: Support for multiple AI model providers including:
- Ollama (local models, native API)
- Hugging Face Transformers (local models)
- llama-cpp-python (local GGUF models, in-process, no external service required)
- Anthropic Claude models via native
anthropicSDK (AnthropicClient) - native thinking support - Cloud and local servers via the
openaiSDK (aimu[openai_compat]):- OpenAI (
OpenAIClient) - GPT-4o, GPT-4.1, o3, o4-mini, and more - Google Gemini (
GeminiClient) - Gemini 2.0/2.5 via Google's OpenAI-compatible endpoint - LM Studio (
LMStudioOpenAIClient) - Ollama OpenAI-compat endpoint (
OllamaOpenAIClient) - HuggingFace Transformers Serve (
HFOpenAIClient) - vLLM (
VLLMOpenAIClient) - llama.cpp llama-server (
LlamaServerOpenAIClient) - SGLang (
SGLangOpenAIClient) - Any OpenAI-compatible server (
OpenAICompatClient)
- OpenAI (
-
Thinking Models: First-class support for extended reasoning models (e.g. DeepSeek-R1, Qwen3, GPT-OSS). Thinking is enabled automatically for supported models, with access to the reasoning traces.
-
Agentic Workflows: Per Anthropic's taxonomy, AIMU separates agents (autonomous, tool-driven) from workflows (code-controlled).
SimpleAgentandSkillAgentimplement the agent side;Chain,Router,Parallel, andEvaluatorOptimizerimplement Anthropic's four workflow patterns. All share aRunnerbase class withrun()/run_streamed().AgenticModelClientwraps aSimpleAgentbehind the standardModelClientinterface, making agentic and single-turn clients interchangeable. -
MCP Tools: Model Context Protocol (MCP) client for enhancing AI capabilities. Provides a simple(r) interface for FastMCP 2.0.
-
Chat Conversation Storage/Management: Chat conversation history management using TinyDB.
-
Memory Storage: Two complementary persistent memory stores:
- Semantic Memory (
SemanticMemoryStore): Fact storage using ChromaDB vector embeddings. Store natural-language subject-predicate-object strings (e.g."Paul works at Google") and retrieve by semantic topic (e.g."employment","family life"). - Document Memory (
DocumentStore): Path-based document store mirroring Anthropic's Managed Agents Memory API. Supportswrite,read,edit,delete, and full-textsearchon named paths (e.g./preferences.md).
- Semantic Memory (
-
Agent Skills: Filesystem-discovered skill definitions that inject instructions and tools into agents automatically. Skills are YAML-fronted Markdown files discovered from project and user directories.
-
Prompt Storage/Management: Versioned prompt catalog backed by SQLite (SQLAlchemy), plus a hill-climbing
PromptTunerfor automatic prompt optimization. Four concrete tuners are included:ClassificationPromptTuner(binary YES/NO),MultiClassPromptTuner(N-way),ExtractionPromptTuner(JSON field extraction), andJudgedPromptTuner(open-ended generation rated by a second LLM). SubclassPromptTunerto implement custom task types.
In addition to the AIMU package in the 'aimu' directory, the AIMU code repository includes:
-
Jupyter notebooks demonstrating key AIMU features.
-
Example chat clients in the
web/directory, built with Streamlit and Gradio, using AIMU Model Client, MCP tools support, and chat conversation management. -
A full suite of Pytest tests.
The following Jupyter notebooks demonstrate key AIMU features:
| Notebook | Description |
|---|---|
| 01 - Model Client | Text generation, chat, streaming, and thinking models |
| 02 - MCP Tools | MCP tool integration with model clients |
| 03 - Prompts | Versioned prompt storage and hill-climbing tuning |
| 04 - Conversations | Persistent chat conversation management |
| 05 - Memory | Semantic fact storage and retrieval |
| 06 - Agents | SimpleAgent and AgenticModelClient |
| 07 - Agent Skills | Filesystem-discovered skill injection with SkillAgent |
| 08 - Agent Workflows | Chain, Router, Parallel, and EvaluatorOptimizer patterns |
For all features, run:
pip install aimu[all]Or install only what you need:
pip install aimu[ollama] # Ollama (local models, native API)
pip install aimu[hf] # Hugging Face Transformers (local models)
pip install aimu[anthropic] # Anthropic Claude models
pip install aimu[openai_compat] # OpenAI, Google Gemini, and OpenAI-compatible local servers
pip install aimu[llamacpp] # Local GGUF models via llama-cpp-python (no external service)For gated Hugging Face models, you'll need a Hugging Face Hub access token:
hf auth loginOnce you've cloned the repository, run the following command to install all model dependencies:
pip install -e '.[all]'Additionally, run the following command to install development (testing, linting) and notebook dependencies:
pip install -e '.[dev,notebooks]'Alternatively, if you have uv installed, you can get all model and development dependencies with:
uv sync --all-extrasUsing Pytest, tests can be run for a specific model client and/or model, using optional arguments:
pytest tests\test_models.py --client=ollama --model=GPT_OSS_20Bfrom aimu.models import OllamaClient as ModelClient ## or HuggingFaceClient, or OpenAiCompatClient
model_client = ModelClient(ModelClient.MODELS.QWEN_3_5_9B)
response = model_client.generate("What is the capital of France?", {"temperature": 0.7})from aimu.models import OllamaClient as ModelClient
model_client = ModelClient(ModelClient.MODELS.QWEN_3_5_9B)
response = model_client.chat("What is the capital of France?")
print(model_client.messages)Models with extended reasoning capabilities (e.g. DeepSeek-R1, Qwen3, GPT-OSS) are identified by the THINKING_MODELS list on each client. Thinking is enabled automatically when one of these models is selected.
After generation, the model's reasoning trace is available in last_thinking:
from aimu.models import OllamaClient as ModelClient
model_client = ModelClient(ModelClient.MODELS.DEEPSEEK_R1_8B)
response = model_client.generate("What is the capital of France?")
print(model_client.last_thinking) # reasoning trace
print(response) # final answerDuring streamed generation via generate_streamed(), thinking tokens are yielded first followed by the response tokens as a single flat stream. For phase-separated streaming (thinking, tool calls, response), use chat_streamed() instead.
chat_streamed() yields StreamChunk objects. Each chunk carries its own type:
chunk.phase |
chunk.content type |
Description |
|---|---|---|
StreamPhase.THINKING |
str |
Reasoning token (thinking models only) |
StreamPhase.TOOL_CALLING |
dict {"name": str, "response": str} |
Tool call and its result |
StreamPhase.GENERATING |
str |
Final response token |
from aimu.models import OllamaClient as ModelClient, StreamPhase
model_client = ModelClient(ModelClient.MODELS.QWEN_3_5_9B)
last_phase = None
for chunk in model_client.chat_streamed("What is the capital of France?"):
if last_phase != chunk.phase:
print(f"--- {chunk.phase} ---")
last_phase = chunk.phase
print(chunk.content, end="", flush=True)OpenAIClient, AnthropicClient, and GeminiClient connect to cloud APIs using each provider's native SDK or OpenAI-compatible endpoint. API keys are read from environment variables (or a .env file).
from aimu.models import OpenAIClient, OpenAIModel
client = OpenAIClient(OpenAIModel.GPT_4O_MINI)
response = client.chat("What is the capital of France?")from aimu.models import AnthropicClient, AnthropicModel
client = AnthropicClient(AnthropicModel.CLAUDE_SONNET_4_6)
response = client.chat("What is the capital of France?")from aimu.models import GeminiClient, GeminiModel
client = GeminiClient(GeminiModel.GEMINI_2_0_FLASH)
response = client.chat("What is the capital of France?")All three support the full ModelClient API including streaming, tool calling, and thinking models (AnthropicModel.CLAUDE_SONNET_4_6, GeminiModel.GEMINI_2_5_PRO).
Required environment variables:
- OpenAI:
OPENAI_API_KEY - Anthropic:
ANTHROPIC_API_KEY - Google Gemini:
GOOGLE_API_KEY
Use any of the service-specific clients to connect to a local server that speaks the OpenAI REST API. Each client uses service-appropriate default URLs and model IDs:
from aimu.models import LMStudioOpenAIClient, LMStudioOpenAIModel
# Connects to http://localhost:1234/v1 by default
client = LMStudioOpenAIClient(LMStudioOpenAIModel.QWEN_3_8B)
response = client.chat("What is the capital of France?")from aimu.models import OllamaOpenAIClient, OllamaOpenAIModel
# Connects to Ollama's OpenAI-compat endpoint at http://localhost:11434/v1
client = OllamaOpenAIClient(OllamaOpenAIModel.QWEN_3_8B)
response = client.chat("What is the capital of France?")from aimu.models import LlamaServerOpenAIClient, LlamaServerOpenAIModel
# Connects to llama-server at http://localhost:8080/v1 by default
# Start with: llama-server -m /path/to/model.gguf --port 8080
client = LlamaServerOpenAIClient(LlamaServerOpenAIModel.QWEN_3_8B)
response = client.chat("What is the capital of France?")from aimu.models import SGLangOpenAIClient, SGLangOpenAIModel
# Connects to SGLang at http://localhost:30000/v1 by default
# Start with: python -m sglang.launch_server --model-path Qwen/Qwen3-8B --port 30000
client = SGLangOpenAIClient(SGLangOpenAIModel.QWEN_3_8B)
response = client.chat("What is the capital of France?")For a custom server or model not in the enum, use OpenAICompatClient directly:
from aimu.models import OpenAICompatClient
from aimu.models.openai_compat import OllamaOpenAIModel
client = OpenAICompatClient(OllamaOpenAIModel.QWEN_3_8B, base_url="http://myserver:8080/v1")All OpenAI-compatible clients support the full ModelClient API. Streaming, tool calling, thinking models, and MCP tools work identically to the other clients.
LlamaCppClient runs GGUF models directly in-process. Ollama, LM Studio, or another service are not required. Pass the path to any GGUF file and a LlamaCppModel enum value that describes the model's capabilities:
from aimu.models.llamacpp import LlamaCppClient, LlamaCppModel
client = LlamaCppClient(LlamaCppModel.QWEN_3_4B, model_path="/path/to/qwen3-4b.Q4_K_M.gguf")
response = client.chat("What is the capital of France?")GPU offloading is enabled by default (n_gpu_layers=-1). To run on CPU only, pass n_gpu_layers=0. The context window defaults to 4096 tokens; increase with n_ctx:
client = LlamaCppClient(
LlamaCppModel.QWEN_3_4B,
model_path="/path/to/model.gguf",
n_ctx=8192,
n_gpu_layers=-1, # offload all layers to GPU
)All standard ModelClient features work: Streaming, tool calling, thinking models, and MCP tools.
A full-featured chat UI with model/client selection, streaming, thinking model support, MCP tool calls, and conversation persistence.
streamlit run web/streamlit_chatbot.pyA full-featured chat UI equivalent to the Streamlit example above.
python web/gradio_chatbot.pyAIMU follows Anthropic's agent/workflow taxonomy. All runnable units share a Runner base class with run() and run_streamed(). Agents (SimpleAgent, SkillAgent) autonomously direct tool use; workflows (Chain, Router, Parallel, EvaluatorOptimizer) have code-controlled flow.
SimpleAgent wraps a ModelClient and runs a tool-calling loop until the model stops invoking tools.
from aimu.models.ollama import OllamaClient, OllamaModel
from aimu.tools import MCPClient
from aimu.agents import SimpleAgent
client = OllamaClient(OllamaModel.QWEN_3_8B)
client.mcp_client = MCPClient({"mcpServers": {"mytools": {"command": "python", "args": ["tools.py"]}}})
agent = SimpleAgent(client, name="assistant", max_iterations=10)
result = agent.run("Find all log files modified today and summarise the errors.")Agents are configurable from a plain dict:
agent = SimpleAgent.from_config(
{"name": "researcher", "system_message": "Use tools to answer.", "max_iterations": 8},
client,
)run_streamed() yields AgentChunk objects tagged with agent_name, iteration, and StreamPhase.
Chain sequences agents so the output of each step becomes the input to the next. Pass a single ModelClient — it is shared across steps, with messages cleared and system_message applied from each step's config before it runs:
from aimu.agents import Chain
chain = Chain.from_config(
[
{"name": "planner", "system_message": "Break the task into steps.", "max_iterations": 3},
{"name": "executor", "system_message": "Execute each step using tools.", "max_iterations": 10},
{"name": "formatter", "system_message": "Format the results clearly.", "max_iterations": 1},
],
client,
)
result = chain.run("Research the top Python web frameworks.")run_streamed() yields ChainChunk objects that extend AgentChunk with a step index.
Router classifies input with a routing agent and dispatches to the matching handler:
from aimu.agents import Router, SimpleAgent
routing_agent = SimpleAgent.from_config({"system_message": "Reply with only: weather, math, or general"}, client)
router = Router(
routing_agent=routing_agent,
handlers={"weather": weather_agent, "math": math_agent},
fallback=general_agent,
)
result = router.run("What is the weather in Tokyo?")Every Runner exposes a messages property that returns a dict[str, list[dict]] mapping each agent name to its message history snapshot. For workflows the dict is merged recursively across all sub-agents, so you can inspect every exchange after a run:
router.run("What is the weather in Tokyo?")
router.messages
# {
# "routing-agent": [{"role": "user", ...}, {"role": "assistant", "content": "weather"}],
# "weather-specialist": [{"role": "user", ...}, {"role": "assistant", "content": "..."}],
# "math-specialist": [], # not dispatched — empty snapshot
# }Snapshots are taken at the end of each run() / run_streamed() call, so they survive subsequent runs even when agents share a single ModelClient.
Parallel runs workers concurrently via ThreadPoolExecutor and optionally aggregates their outputs:
from aimu.agents import Parallel
parallel = Parallel(
workers=[perspective_a_agent, perspective_b_agent, perspective_c_agent],
aggregator=summarizer_agent,
)
result = parallel.run("Analyse the impact of remote work on productivity.")EvaluatorOptimizer runs a generate → evaluate → revise loop until the evaluator emits a pass signal or max_rounds is reached:
from aimu.agents import EvaluatorOptimizer
eo = EvaluatorOptimizer(
generator=writer_agent,
evaluator=critic_agent,
max_rounds=4,
pass_keyword="PASS",
)
result = eo.run("Write a concise explanation of transformer attention.")
### AgenticModelClient
`AgenticModelClient` wraps a `SimpleAgent` behind the standard `ModelClient` interface. Use it anywhere a `ModelClient` is accepted — web UIs, conversation managers, etc. — to get the full agentic loop transparently:
``` python
from aimu.models.ollama import OllamaClient, OllamaModel
from aimu.tools import MCPClient
from aimu.agents import SimpleAgent, AgenticModelClient
inner = OllamaClient(OllamaModel.QWEN_3_8B)
inner.mcp_client = MCPClient({"mcpServers": {"mytools": {"command": "python", "args": ["tools.py"]}}})
# Single-turn client
client = inner
# Agentic client — same interface, loops until tools stop being called
client = AgenticModelClient(SimpleAgent(inner, max_iterations=10))
# Both work identically here:
response = client.chat("Find all log files modified today and summarise the errors.")For workflow patterns (Chain, Router, Parallel, EvaluatorOptimizer), call run() / run_streamed() directly instead.
from aimu.tools import MCPClient
mcp_client = MCPClient({
"mcpServers": {
"mytools": {"command": "python", "args": ["tools.py"]},
}
})
mcp_client.call_tool("mytool", {"input": "hello world!"})from aimu.models import OllamaClient as ModelClient
from aimu.tools import MCPClient
mcp_client = MCPClient({
"mcpServers": {
"mytools": {"command": "python", "args": ["tools.py"]},
}
})
model_client = ModelClient(ModelClient.MODELS.QWEN_3_5_9B)
model_client.mcp_client = mcp_client
model_client.chat("use my tool please")from aimu.models import OllamaClient as ModelClient
from aimu.history import ConversationManager
chat_manager = ConversationManager("conversations.json", use_last_conversation=True) # loads the last saved conversation
model_client = ModelClient(ModelClient.MODELS.QWEN_3_5_9B)
model_client.messages = chat_manager.messages
model_client.chat("What is the capital of France?")
chat_manager.update_conversation(model_client.messages) # store the updated conversationfrom aimu.memory import SemanticMemoryStore
store = SemanticMemoryStore(persist_path="./memory_store")
store.store("Paul works at Google")
store.store("Paul is married to Sarah")
store.store("Sarah is the sister of Emma")
store.search("work and employment") # ["Paul works at Google", ...]
store.search("family relationships") # ["Paul is married to Sarah", ...]
store.search("work and employment", max_distance=0.4) # only close matchesfrom aimu.memory import DocumentStore
store = DocumentStore(persist_path="./doc_store")
store.write("/preferences.md", "Always use concise responses.")
store.write("/notes/meeting.md", "Discussed Q3 roadmap with team.")
store.read("/preferences.md") # "Always use concise responses."
store.edit("/preferences.md", "concise", "detailed") # in-place edit
store.list_paths() # ["/notes/meeting.md", "/preferences.md"]
store.search_full_text("roadmap") # [{"path": ..., "content": ...}]
store.delete("/notes/meeting.md")Skills are SKILL.md files discovered from .agents/skills/ or .claude/skills/ directories (project-level overrides user-level). Use SkillAgent instead of SimpleAgent to enable skill support:
from aimu.agents import SkillAgent
from aimu.skills import SkillManager
# Auto-discover skills from default paths
agent = SkillAgent(client, name="assistant")
# Or point at specific directories
agent = SkillAgent(client, skill_manager=SkillManager(skill_dirs=["./skills"]))
# Or use from_config (skill_dirs optional — omit to auto-discover)
agent = SkillAgent.from_config({"name": "assistant", "skill_dirs": ["./skills"]}, client)
result = agent.run("Use the pdf-processing skill to extract pages from report.pdf")Each skill directory contains a SKILL.md with YAML frontmatter (name, description) and optional scripts/*.py files that are registered as callable tools.
from aimu.prompts import PromptCatalog, Prompt
with PromptCatalog("prompts.db") as catalog:
prompt = Prompt(name="summarizer", prompt="Summarize the following: {content}", model_id="llama3.1:8b")
catalog.store_prompt(prompt) # version and created_at assigned automatically
latest = catalog.retrieve_last("summarizer", "llama3.1:8b")
print(f"v{latest.version}: {latest.prompt}")Four concrete tuners ship out of the box, all sharing the same hill-climbing loop inherited from PromptTuner.
import pandas as pd
from aimu.models import OllamaClient as ModelClient
from aimu.prompts import ClassificationPromptTuner
model_client = ModelClient(ModelClient.MODELS.QWEN_3_5_9B)
tuner = ClassificationPromptTuner(model_client=model_client)
df = pd.DataFrame({
"content": ["LLMs are transforming AI.", "The recipe calls for flour.", ...],
"actual_class": [True, False, ...],
})
best_prompt = tuner.tune(
df,
initial_prompt="Is this about AI? Reply [YES] or [NO]. Content: {content}",
max_iterations=10,
)from aimu.prompts import MultiClassPromptTuner
tuner = MultiClassPromptTuner(model_client, classes=["positive", "negative", "neutral"])
df = pd.DataFrame({
"content": ["Loved it!", "Terrible experience.", "It was okay."],
"actual_class": ["positive", "negative", "neutral"],
})
best_prompt = tuner.tune(
df,
initial_prompt="Classify the sentiment as [positive], [negative], or [neutral]. Text: {content}",
)Metrics include per-class precision, recall, F1, and macro F1.
from aimu.prompts import ExtractionPromptTuner
tuner = ExtractionPromptTuner(model_client, fields=["name", "company"])
df = pd.DataFrame({
"content": ["Alice Smith works at Acme Corp.", "Bob Jones is at Initech."],
"expected": [{"name": "Alice Smith", "company": "Acme Corp."}, {"name": "Bob Jones", "company": "Initech"}],
})
best_prompt = tuner.tune(
df,
initial_prompt='Extract "name" and "company" as JSON. Text: {content}',
)Metrics include row-level accuracy and per-field match rates. The model may return raw JSON, a fenced block, or any JSON object substring — all are handled automatically.
JudgedPromptTuner uses a second model to score each output on a 1–10 scale. The primary optimisation target is mean judge score rather than binary accuracy.
from aimu.prompts import JudgedPromptTuner
tuner = JudgedPromptTuner(
model_client=writer_client,
judge_client=judge_client,
criteria="Summaries should be under three sentences and mention the key conclusion.",
pass_threshold=0.7,
)
df = pd.DataFrame({"content": [long_article_1, long_article_2, ...]})
best_prompt = tuner.tune(
df,
initial_prompt="Summarise this article: {content}",
max_iterations=10,
)Subclass PromptTuner and implement three methods:
from aimu.prompts import PromptTuner
class MyTuner(PromptTuner):
def apply_prompt(self, prompt, data):
# Run prompt on data; set data["_correct"] boolean column
...
def evaluate(self, data) -> dict:
# Return metrics dict; must include "accuracy" key (or override score())
...
def mutation_prompt(self, current_prompt, items) -> str:
# Return a prompt asking the LLM to improve current_prompt given failing items
# LLM response should contain the improved prompt in <prompt>...</prompt> tags
...
# Optional: override to optimise a different metric
def score(self, metrics) -> float:
return metrics["my_metric"]
# Optional: override to parse mutations in a different format
def extract_mutated_prompt(self, result) -> str:
return result.strip()The tuner calls apply_prompt → evaluate → mutation_prompt in a loop, reverting on regression and saving each improvement to an optional PromptCatalog.
This project is licensed under the Apache 2.0 license.