AIMU - AI Model Utilities

A Python package containing easy to use tools for working with various language models and AI services. AIMU is designed for running models locally via Ollama, Hugging Face Transformers, or any OpenAI-compatible local serving framework, and for cloud models via native provider SDKs (OpenAI, Anthropic, Google Gemini).

Features

Model Clients: Support for multiple AI model providers including:
- Ollama (local models, native API)
- Hugging Face Transformers (local models)
- llama-cpp-python (local GGUF models, in-process, no external service required)
- Anthropic Claude models via native anthropic SDK (AnthropicClient) - native thinking support
- Cloud and local servers via the openai SDK (aimu[openai_compat]):
  - OpenAI (OpenAIClient) - GPT-4o, GPT-4.1, o3, o4-mini, and more
  - Google Gemini (GeminiClient) - Gemini 2.0/2.5 via Google's OpenAI-compatible endpoint
  - LM Studio (LMStudioOpenAIClient)
  - Ollama OpenAI-compat endpoint (OllamaOpenAIClient)
  - HuggingFace Transformers Serve (HFOpenAIClient)
  - vLLM (VLLMOpenAIClient)
  - llama.cpp llama-server (LlamaServerOpenAIClient)
  - SGLang (SGLangOpenAIClient)
  - Any OpenAI-compatible server (OpenAICompatClient)
Thinking Models: First-class support for extended reasoning models (e.g. DeepSeek-R1, Qwen3, GPT-OSS). Thinking is enabled automatically for supported models, with access to the reasoning traces.
Agentic Workflows: Per Anthropic's taxonomy, AIMU separates agents (autonomous, tool-driven) from workflows (code-controlled). SimpleAgent and SkillAgent implement the agent side; Chain, Router, Parallel, and EvaluatorOptimizer implement Anthropic's four workflow patterns. All share a Runner base class with run() / run_streamed(). AgenticModelClient wraps a SimpleAgent behind the standard ModelClient interface, making agentic and single-turn clients interchangeable.
MCP Tools: Model Context Protocol (MCP) client for enhancing AI capabilities. Provides a simple(r) interface for FastMCP 2.0.
Chat Conversation Storage/Management: Chat conversation history management using TinyDB.
Memory Storage: Two complementary persistent memory stores:
- Semantic Memory (SemanticMemoryStore): Fact storage using ChromaDB vector embeddings. Store natural-language subject-predicate-object strings (e.g. "Paul works at Google") and retrieve by semantic topic (e.g. "employment", "family life").
- Document Memory (DocumentStore): Path-based document store mirroring Anthropic's Managed Agents Memory API. Supports write, read, edit, delete, and full-text search on named paths (e.g. /preferences.md).
Agent Skills: Filesystem-discovered skill definitions that inject instructions and tools into agents automatically. Skills are YAML-fronted Markdown files discovered from project and user directories.
Prompt Storage/Management: Versioned prompt catalog backed by SQLite (SQLAlchemy), plus a hill-climbing PromptTuner for automatic prompt optimization. Four concrete tuners are included: ClassificationPromptTuner (binary YES/NO), MultiClassPromptTuner (N-way), ExtractionPromptTuner (JSON field extraction), and JudgedPromptTuner (open-ended generation rated by a second LLM). Subclass PromptTuner to implement custom task types.

Components

In addition to the AIMU package in the 'aimu' directory, the AIMU code repository includes:

Jupyter notebooks demonstrating key AIMU features.
Example chat clients in the web/ directory, built with Streamlit and Gradio, using AIMU Model Client, MCP tools support, and chat conversation management.
A full suite of Pytest tests.

Examples

The following Jupyter notebooks demonstrate key AIMU features:

Notebook	Description
01 - Model Client	Text generation, chat, streaming, and thinking models
02 - MCP Tools	MCP tool integration with model clients
03 - Prompts	Versioned prompt storage and hill-climbing tuning
04 - Conversations	Persistent chat conversation management
05 - Memory	Semantic fact storage and retrieval
06 - Agents	SimpleAgent and AgenticModelClient
07 - Agent Skills	Filesystem-discovered skill injection with SkillAgent
08 - Agent Workflows	Chain, Router, Parallel, and EvaluatorOptimizer patterns

Installation

For all features, run:

pip install aimu[all]

Or install only what you need:

pip install aimu[ollama]        # Ollama (local models, native API)
pip install aimu[hf]            # Hugging Face Transformers (local models)
pip install aimu[anthropic]     # Anthropic Claude models
pip install aimu[openai_compat] # OpenAI, Google Gemini, and OpenAI-compatible local servers
pip install aimu[llamacpp]      # Local GGUF models via llama-cpp-python (no external service)

For gated Hugging Face models, you'll need a Hugging Face Hub access token:

hf auth login

Development

Once you've cloned the repository, run the following command to install all model dependencies:

pip install -e '.[all]'

Additionally, run the following command to install development (testing, linting) and notebook dependencies:

pip install -e '.[dev,notebooks]'

Alternatively, if you have uv installed, you can get all model and development dependencies with:

uv sync --all-extras

Using Pytest, tests can be run for a specific model client and/or model, using optional arguments:

pytest tests\test_models.py --client=ollama --model=GPT_OSS_20B

Usage

Text Generation

from aimu.models import OllamaClient as ModelClient ## or HuggingFaceClient, or OpenAiCompatClient

model_client = ModelClient(ModelClient.MODELS.QWEN_3_5_9B)
response = model_client.generate("What is the capital of France?", {"temperature": 0.7})

Chat

from aimu.models import OllamaClient as ModelClient

model_client = ModelClient(ModelClient.MODELS.QWEN_3_5_9B)
response = model_client.chat("What is the capital of France?")

print(model_client.messages)

Thinking Models

Models with extended reasoning capabilities (e.g. DeepSeek-R1, Qwen3, GPT-OSS) are identified by the THINKING_MODELS list on each client. Thinking is enabled automatically when one of these models is selected.

After generation, the model's reasoning trace is available in last_thinking:

from aimu.models import OllamaClient as ModelClient

model_client = ModelClient(ModelClient.MODELS.DEEPSEEK_R1_8B)
response = model_client.generate("What is the capital of France?")

print(model_client.last_thinking)  # reasoning trace
print(response)                    # final answer

During streamed generation via generate_streamed(), thinking tokens are yielded first followed by the response tokens as a single flat stream. For phase-separated streaming (thinking, tool calls, response), use chat_streamed() instead.

Streamed Chat

chat_streamed() yields StreamChunk objects. Each chunk carries its own type:

`chunk.phase`	`chunk.content` type	Description
`StreamPhase.THINKING`	`str`	Reasoning token (thinking models only)
`StreamPhase.TOOL_CALLING`	`dict` `{"name": str, "response": str}`	Tool call and its result
`StreamPhase.GENERATING`	`str`	Final response token

from aimu.models import OllamaClient as ModelClient, StreamPhase

model_client = ModelClient(ModelClient.MODELS.QWEN_3_5_9B)
last_phase = None

for chunk in model_client.chat_streamed("What is the capital of France?"):
    if last_phase != chunk.phase:
        print(f"--- {chunk.phase} ---")
        last_phase = chunk.phase

    print(chunk.content, end="", flush=True)

Cloud Model Providers

OpenAIClient, AnthropicClient, and GeminiClient connect to cloud APIs using each provider's native SDK or OpenAI-compatible endpoint. API keys are read from environment variables (or a .env file).

from aimu.models import OpenAIClient, OpenAIModel

client = OpenAIClient(OpenAIModel.GPT_4O_MINI)
response = client.chat("What is the capital of France?")

from aimu.models import AnthropicClient, AnthropicModel

client = AnthropicClient(AnthropicModel.CLAUDE_SONNET_4_6)
response = client.chat("What is the capital of France?")

from aimu.models import GeminiClient, GeminiModel

client = GeminiClient(GeminiModel.GEMINI_2_0_FLASH)
response = client.chat("What is the capital of France?")

All three support the full ModelClient API including streaming, tool calling, and thinking models (AnthropicModel.CLAUDE_SONNET_4_6, GeminiModel.GEMINI_2_5_PRO).

Required environment variables:

OpenAI: OPENAI_API_KEY
Anthropic: ANTHROPIC_API_KEY
Google Gemini: GOOGLE_API_KEY

OpenAI-Compatible Local Servers

Use any of the service-specific clients to connect to a local server that speaks the OpenAI REST API. Each client uses service-appropriate default URLs and model IDs:

from aimu.models import LMStudioOpenAIClient, LMStudioOpenAIModel

# Connects to http://localhost:1234/v1 by default
client = LMStudioOpenAIClient(LMStudioOpenAIModel.QWEN_3_8B)
response = client.chat("What is the capital of France?")

from aimu.models import OllamaOpenAIClient, OllamaOpenAIModel

# Connects to Ollama's OpenAI-compat endpoint at http://localhost:11434/v1
client = OllamaOpenAIClient(OllamaOpenAIModel.QWEN_3_8B)
response = client.chat("What is the capital of France?")

from aimu.models import LlamaServerOpenAIClient, LlamaServerOpenAIModel

# Connects to llama-server at http://localhost:8080/v1 by default
# Start with: llama-server -m /path/to/model.gguf --port 8080
client = LlamaServerOpenAIClient(LlamaServerOpenAIModel.QWEN_3_8B)
response = client.chat("What is the capital of France?")

from aimu.models import SGLangOpenAIClient, SGLangOpenAIModel

# Connects to SGLang at http://localhost:30000/v1 by default
# Start with: python -m sglang.launch_server --model-path Qwen/Qwen3-8B --port 30000
client = SGLangOpenAIClient(SGLangOpenAIModel.QWEN_3_8B)
response = client.chat("What is the capital of France?")

For a custom server or model not in the enum, use OpenAICompatClient directly:

from aimu.models import OpenAICompatClient
from aimu.models.openai_compat import OllamaOpenAIModel

client = OpenAICompatClient(OllamaOpenAIModel.QWEN_3_8B, base_url="http://myserver:8080/v1")

All OpenAI-compatible clients support the full ModelClient API. Streaming, tool calling, thinking models, and MCP tools work identically to the other clients.

Local GGUF Models (llama-cpp-python)

LlamaCppClient runs GGUF models directly in-process. Ollama, LM Studio, or another service are not required. Pass the path to any GGUF file and a LlamaCppModel enum value that describes the model's capabilities:

from aimu.models.llamacpp import LlamaCppClient, LlamaCppModel

client = LlamaCppClient(LlamaCppModel.QWEN_3_4B, model_path="/path/to/qwen3-4b.Q4_K_M.gguf")
response = client.chat("What is the capital of France?")

GPU offloading is enabled by default (n_gpu_layers=-1). To run on CPU only, pass n_gpu_layers=0. The context window defaults to 4096 tokens; increase with n_ctx:

client = LlamaCppClient(
    LlamaCppModel.QWEN_3_4B,
    model_path="/path/to/model.gguf",
    n_ctx=8192,
    n_gpu_layers=-1,  # offload all layers to GPU
)

All standard ModelClient features work: Streaming, tool calling, thinking models, and MCP tools.

Chat UI (Streamlit)

A full-featured chat UI with model/client selection, streaming, thinking model support, MCP tool calls, and conversation persistence.

streamlit run web/streamlit_chatbot.py

Chat UI (Gradio)

A full-featured chat UI equivalent to the Streamlit example above.

python web/gradio_chatbot.py

Agentic Workflows

AIMU follows Anthropic's agent/workflow taxonomy. All runnable units share a Runner base class with run() and run_streamed(). Agents (SimpleAgent, SkillAgent) autonomously direct tool use; workflows (Chain, Router, Parallel, EvaluatorOptimizer) have code-controlled flow.

SimpleAgent

SimpleAgent wraps a ModelClient and runs a tool-calling loop until the model stops invoking tools.

from aimu.models.ollama import OllamaClient, OllamaModel
from aimu.tools import MCPClient
from aimu.agents import SimpleAgent

client = OllamaClient(OllamaModel.QWEN_3_8B)
client.mcp_client = MCPClient({"mcpServers": {"mytools": {"command": "python", "args": ["tools.py"]}}})

agent = SimpleAgent(client, name="assistant", max_iterations=10)
result = agent.run("Find all log files modified today and summarise the errors.")

Agents are configurable from a plain dict:

agent = SimpleAgent.from_config(
    {"name": "researcher", "system_message": "Use tools to answer.", "max_iterations": 8},
    client,
)

run_streamed() yields AgentChunk objects tagged with agent_name, iteration, and StreamPhase.

Chain (Prompt Chaining)

Chain sequences agents so the output of each step becomes the input to the next. Pass a single ModelClient — it is shared across steps, with messages cleared and system_message applied from each step's config before it runs:

from aimu.agents import Chain

chain = Chain.from_config(
    [
        {"name": "planner",   "system_message": "Break the task into steps.", "max_iterations": 3},
        {"name": "executor",  "system_message": "Execute each step using tools.", "max_iterations": 10},
        {"name": "formatter", "system_message": "Format the results clearly.", "max_iterations": 1},
    ],
    client,
)
result = chain.run("Research the top Python web frameworks.")

run_streamed() yields ChainChunk objects that extend AgentChunk with a step index.

Router (Routing)

Router classifies input with a routing agent and dispatches to the matching handler:

from aimu.agents import Router, SimpleAgent

routing_agent = SimpleAgent.from_config({"system_message": "Reply with only: weather, math, or general"}, client)
router = Router(
    routing_agent=routing_agent,
    handlers={"weather": weather_agent, "math": math_agent},
    fallback=general_agent,
)
result = router.run("What is the weather in Tokyo?")

Inspecting message histories

Every Runner exposes a messages property that returns a dict[str, list[dict]] mapping each agent name to its message history snapshot. For workflows the dict is merged recursively across all sub-agents, so you can inspect every exchange after a run:

router.run("What is the weather in Tokyo?")

router.messages
# {
#   "routing-agent": [{"role": "user", ...}, {"role": "assistant", "content": "weather"}],
#   "weather-specialist": [{"role": "user", ...}, {"role": "assistant", "content": "..."}],
#   "math-specialist": [],     # not dispatched — empty snapshot
# }

Snapshots are taken at the end of each run() / run_streamed() call, so they survive subsequent runs even when agents share a single ModelClient.

Parallel (Parallelization)

Parallel runs workers concurrently via ThreadPoolExecutor and optionally aggregates their outputs:

from aimu.agents import Parallel

parallel = Parallel(
    workers=[perspective_a_agent, perspective_b_agent, perspective_c_agent],
    aggregator=summarizer_agent,
)
result = parallel.run("Analyse the impact of remote work on productivity.")

EvaluatorOptimizer

EvaluatorOptimizer runs a generate → evaluate → revise loop until the evaluator emits a pass signal or max_rounds is reached:

from aimu.agents import EvaluatorOptimizer

eo = EvaluatorOptimizer(
    generator=writer_agent,
    evaluator=critic_agent,
    max_rounds=4,
    pass_keyword="PASS",
)
result = eo.run("Write a concise explanation of transformer attention.")

### AgenticModelClient

`AgenticModelClient` wraps a `SimpleAgent` behind the standard `ModelClient` interface. Use it anywhere a `ModelClient` is accepted — web UIs, conversation managers, etc. — to get the full agentic loop transparently:

``` python
from aimu.models.ollama import OllamaClient, OllamaModel
from aimu.tools import MCPClient
from aimu.agents import SimpleAgent, AgenticModelClient

inner = OllamaClient(OllamaModel.QWEN_3_8B)
inner.mcp_client = MCPClient({"mcpServers": {"mytools": {"command": "python", "args": ["tools.py"]}}})

# Single-turn client
client = inner

# Agentic client — same interface, loops until tools stop being called
client = AgenticModelClient(SimpleAgent(inner, max_iterations=10))

# Both work identically here:
response = client.chat("Find all log files modified today and summarise the errors.")

For workflow patterns (Chain, Router, Parallel, EvaluatorOptimizer), call run() / run_streamed() directly instead.

MCP Tool Usage

from aimu.tools import MCPClient

mcp_client = MCPClient({
    "mcpServers": {
        "mytools": {"command": "python", "args": ["tools.py"]},
    }
})

mcp_client.call_tool("mytool", {"input": "hello world!"})

MCP Tool Usage with ModelClient

from aimu.models import OllamaClient as ModelClient
from aimu.tools import MCPClient

mcp_client = MCPClient({
    "mcpServers": {
        "mytools": {"command": "python", "args": ["tools.py"]},
    }
})

model_client = ModelClient(ModelClient.MODELS.QWEN_3_5_9B)
model_client.mcp_client = mcp_client

model_client.chat("use my tool please")

Chat Conversation Storage/Management

from aimu.models import OllamaClient as ModelClient
from aimu.history import ConversationManager

chat_manager = ConversationManager("conversations.json", use_last_conversation=True) # loads the last saved conversation

model_client = ModelClient(ModelClient.MODELS.QWEN_3_5_9B)
model_client.messages = chat_manager.messages

model_client.chat("What is the capital of France?")

chat_manager.update_conversation(model_client.messages) # store the updated conversation

Semantic Memory Storage

from aimu.memory import SemanticMemoryStore

store = SemanticMemoryStore(persist_path="./memory_store")

store.store("Paul works at Google")
store.store("Paul is married to Sarah")
store.store("Sarah is the sister of Emma")

store.search("work and employment")                    # ["Paul works at Google", ...]
store.search("family relationships")                   # ["Paul is married to Sarah", ...]
store.search("work and employment", max_distance=0.4)  # only close matches

Document Memory Storage

from aimu.memory import DocumentStore

store = DocumentStore(persist_path="./doc_store")

store.write("/preferences.md", "Always use concise responses.")
store.write("/notes/meeting.md", "Discussed Q3 roadmap with team.")

store.read("/preferences.md")                        # "Always use concise responses."
store.edit("/preferences.md", "concise", "detailed") # in-place edit
store.list_paths()                                   # ["/notes/meeting.md", "/preferences.md"]
store.search_full_text("roadmap")                    # [{"path": ..., "content": ...}]
store.delete("/notes/meeting.md")

Agent Skills

Skills are SKILL.md files discovered from .agents/skills/ or .claude/skills/ directories (project-level overrides user-level). Use SkillAgent instead of SimpleAgent to enable skill support:

from aimu.agents import SkillAgent
from aimu.skills import SkillManager

# Auto-discover skills from default paths
agent = SkillAgent(client, name="assistant")

# Or point at specific directories
agent = SkillAgent(client, skill_manager=SkillManager(skill_dirs=["./skills"]))

# Or use from_config (skill_dirs optional — omit to auto-discover)
agent = SkillAgent.from_config({"name": "assistant", "skill_dirs": ["./skills"]}, client)
result = agent.run("Use the pdf-processing skill to extract pages from report.pdf")

Each skill directory contains a SKILL.md with YAML frontmatter (name, description) and optional scripts/*.py files that are registered as callable tools.

Prompt Storage/Management

from aimu.prompts import PromptCatalog, Prompt

with PromptCatalog("prompts.db") as catalog:
    prompt = Prompt(name="summarizer", prompt="Summarize the following: {content}", model_id="llama3.1:8b")
    catalog.store_prompt(prompt)  # version and created_at assigned automatically

    latest = catalog.retrieve_last("summarizer", "llama3.1:8b")
    print(f"v{latest.version}: {latest.prompt}")

Prompt Tuning

Four concrete tuners ship out of the box, all sharing the same hill-climbing loop inherited from PromptTuner.

Binary classification

import pandas as pd
from aimu.models import OllamaClient as ModelClient
from aimu.prompts import ClassificationPromptTuner

model_client = ModelClient(ModelClient.MODELS.QWEN_3_5_9B)
tuner = ClassificationPromptTuner(model_client=model_client)

df = pd.DataFrame({
    "content": ["LLMs are transforming AI.", "The recipe calls for flour.", ...],
    "actual_class": [True, False, ...],
})

best_prompt = tuner.tune(
    df,
    initial_prompt="Is this about AI? Reply [YES] or [NO]. Content: {content}",
    max_iterations=10,
)

Multi-class classification

from aimu.prompts import MultiClassPromptTuner

tuner = MultiClassPromptTuner(model_client, classes=["positive", "negative", "neutral"])

df = pd.DataFrame({
    "content": ["Loved it!", "Terrible experience.", "It was okay."],
    "actual_class": ["positive", "negative", "neutral"],
})

best_prompt = tuner.tune(
    df,
    initial_prompt="Classify the sentiment as [positive], [negative], or [neutral]. Text: {content}",
)

Metrics include per-class precision, recall, F1, and macro F1.

Structured field extraction

from aimu.prompts import ExtractionPromptTuner

tuner = ExtractionPromptTuner(model_client, fields=["name", "company"])

df = pd.DataFrame({
    "content": ["Alice Smith works at Acme Corp.", "Bob Jones is at Initech."],
    "expected": [{"name": "Alice Smith", "company": "Acme Corp."}, {"name": "Bob Jones", "company": "Initech"}],
})

best_prompt = tuner.tune(
    df,
    initial_prompt='Extract "name" and "company" as JSON. Text: {content}',
)

Metrics include row-level accuracy and per-field match rates. The model may return raw JSON, a fenced block, or any JSON object substring — all are handled automatically.

Open-ended generation (LLM-as-judge)

JudgedPromptTuner uses a second model to score each output on a 1–10 scale. The primary optimisation target is mean judge score rather than binary accuracy.

from aimu.prompts import JudgedPromptTuner

tuner = JudgedPromptTuner(
    model_client=writer_client,
    judge_client=judge_client,
    criteria="Summaries should be under three sentences and mention the key conclusion.",
    pass_threshold=0.7,
)

df = pd.DataFrame({"content": [long_article_1, long_article_2, ...]})

best_prompt = tuner.tune(
    df,
    initial_prompt="Summarise this article: {content}",
    max_iterations=10,
)

Custom tuners

Subclass PromptTuner and implement three methods:

from aimu.prompts import PromptTuner

class MyTuner(PromptTuner):
    def apply_prompt(self, prompt, data):
        # Run prompt on data; set data["_correct"] boolean column
        ...

    def evaluate(self, data) -> dict:
        # Return metrics dict; must include "accuracy" key (or override score())
        ...

    def mutation_prompt(self, current_prompt, items) -> str:
        # Return a prompt asking the LLM to improve current_prompt given failing items
        # LLM response should contain the improved prompt in <prompt>...</prompt> tags
        ...

    # Optional: override to optimise a different metric
    def score(self, metrics) -> float:
        return metrics["my_metric"]

    # Optional: override to parse mutations in a different format
    def extract_mutated_prompt(self, result) -> str:
        return result.strip()

The tuner calls apply_prompt → evaluate → mutation_prompt in a loop, reverting on regression and saving each improvement to an optional PromptCatalog.

License

This project is licensed under the Apache 2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 424 Commits
aimu		aimu
data/skills		data/skills
notebooks		notebooks
tests		tests
web		web
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
conftest.py		conftest.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

AIMU - AI Model Utilities

Features

Components

Examples

Installation

Development

Usage

Text Generation

Chat

Thinking Models

Streamed Chat

Cloud Model Providers

OpenAI-Compatible Local Servers

Local GGUF Models (llama-cpp-python)

Chat UI (Streamlit)

Chat UI (Gradio)

Agentic Workflows

SimpleAgent

Chain (Prompt Chaining)

Router (Routing)

Inspecting message histories

Parallel (Parallelization)

EvaluatorOptimizer

MCP Tool Usage

MCP Tool Usage with ModelClient

Chat Conversation Storage/Management

Semantic Memory Storage

Document Memory Storage

Agent Skills

Prompt Storage/Management

Prompt Tuning

Binary classification

Multi-class classification

Structured field extraction

Open-ended generation (LLM-as-judge)

Custom tuners

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages