A reliability layer for self-hosted LLM tool-calling. Forge takes an 8B model from ~38% to ~99% on multi-step agentic workflows through guardrails (rescue parsing, retry nudges, step enforcement) and context management (VRAM-aware budgets, tiered compaction).
Three ways to use it:
-
WorkflowRunner — Define tools, pick a backend, run structured agent loops. Forge manages the full lifecycle: system prompts, tool execution, context compaction, and guardrails. SlotWorker adds priority-queued access to a shared inference slot with auto-preemption — for multi-agent architectures where specialist workflows share a GPU slot. Best when you're building on forge directly.
-
Guardrails middleware — Use forge's reliability stack (composable middleware) inside your own orchestration loop. You control the loop; forge validates responses, rescues malformed tool calls, and enforces required steps.
-
Proxy server — Drop-in OpenAI-compatible proxy (
python -m forge.proxy) that sits between any client (opencode, Continue, aider, etc.) and a local model server. Applies guardrails transparently — the client thinks it's talking to a smarter model.
Supports Ollama, llama-server (llama.cpp), Llamafile, and Anthropic as backends.
- Python 3.12+
- A running LLM backend (see below)
pip install forge-guardrails # core only
pip install "forge-guardrails[anthropic]" # + Anthropic clientFor development:
git clone https://github.com/antoinezambelli/forge.git
cd forge
pip install -e ".[dev]"Ollama (easiest):
# Install from https://ollama.com/download
ollama pull ministral-3:8b-instruct-2512-q4_K_Mllama-server (best performance):
# Install from https://github.com/ggml-org/llama.cpp/releases
llama-server -m path/to/Ministral-3-8B-Reasoning-2512-Q4_K_M.gguf --jinja -ngl 999 --port 8080Anthropic (API, no local GPU needed):
pip install -e ".[anthropic]"
export ANTHROPIC_API_KEY=sk-...See Backend Setup for full instructions and Model Guide for which model fits your hardware.
import asyncio
from pydantic import BaseModel, Field
from forge import (
Workflow, ToolDef, ToolSpec,
WorkflowRunner, OllamaClient,
ContextManager, TieredCompact,
)
def get_weather(city: str) -> str:
return f"72°F and sunny in {city}"
class GetWeatherParams(BaseModel):
city: str = Field(description="City name")
workflow = Workflow(
name="weather",
description="Look up weather for a city.",
tools={
"get_weather": ToolDef(
spec=ToolSpec(
name="get_weather",
description="Get current weather",
parameters=GetWeatherParams,
),
callable=get_weather,
),
},
required_steps=[],
terminal_tool="get_weather",
system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)
async def main():
client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M")
ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)
runner = WorkflowRunner(client=client, context_manager=ctx)
await runner.run(workflow, "What's the weather in Paris?")
asyncio.run(main())For multi-step workflows, multi-turn conversations, and backend auto-management, see the User Guide. If you're building a long-running session (CLI, chat server, voice assistant), see the long-running session advisory for important guidance on filtering transient messages.
Drop-in replacement for a local model server. Point any OpenAI-compatible client at the proxy and get forge's guardrails for free.
# External mode — you manage llama-server, forge proxies it
python -m forge.proxy --backend-url http://localhost:8080 --port 8081
# Managed mode — forge starts llama-server and the proxy together
python -m forge.proxy --backend llamaserver --gguf path/to/model.gguf --port 8081Then configure your client to use http://localhost:8081/v1 as the API base URL.
Note: The proxy automatically injects a synthetic respond tool when tools are present in the request. The model calls respond(message="...") instead of producing bare text, keeping it in tool-calling mode where forge's full guardrail stack applies. The respond call is stripped from the outbound response — the client sees a normal text response (finish_reason: "stop") and never knows the tool exists. This is essential for small local models (~8B), which cannot be trusted to choose correctly between text and tool calls — guiding them to a tool is a must. See ADR-013 for the full analysis.
| Backend | Best for | Native FC? |
|---|---|---|
| Ollama | Easiest setup, model management built-in | Yes |
| llama-server | Best performance, full control | Yes (with --jinja) |
| Llamafile | Single binary, zero dependencies | No (prompt-injected) |
| Anthropic | Frontier baseline, hybrid workflows | Yes |
See Backend Setup for installation and Model Guide for which model to pick.
python -m pytest tests/ -v --tb=shortpython -m pytest tests/ --cov=forge --cov-report=term-missing22 scenarios measuring how reliably a model + backend combo navigates multi-step tool-calling workflows. See Eval Guide for full CLI reference.
# Ollama
python -m tests.eval.eval_runner --backend ollama --model "ministral-3:8b-instruct-2512-q4_K_M" --runs 10 --stream --verbose
# Batch eval (JSONL output, automatic resume)
python -m tests.eval.batch_eval --config all --runs 50
# Reports (ASCII table, HTML dashboard, markdown views)
python -m tests.eval.report eval_results.jsonlsrc/forge/
__init__.py # Public API exports
errors.py # ForgeError hierarchy
server.py # setup_backend(), ServerManager, BudgetMode
core/
messages.py # Message, MessageRole, MessageType, MessageMeta
workflow.py # ToolSpec, ToolDef, ToolCall, TextResponse, Workflow
inference.py # run_inference() — shared front half (compact, fold, validate, retry)
runner.py # WorkflowRunner — the agentic loop
slot_worker.py # SlotWorker — priority-queued slot access
steps.py # StepTracker
guardrails/
nudge.py # Nudge dataclass
response_validator.py # ResponseValidator, ValidationResult
step_enforcer.py # StepEnforcer, StepCheck
error_tracker.py # ErrorTracker
clients/
base.py # ChunkType, StreamChunk, LLMClient protocol
ollama.py # OllamaClient (native FC)
llamafile.py # LlamafileClient (native FC or prompt-injected)
anthropic.py # AnthropicClient (frontier baseline)
context/
manager.py # ContextManager, CompactEvent
strategies.py # CompactStrategy, NoCompact, TieredCompact, SlidingWindowCompact
hardware.py # HardwareProfile, detect_hardware()
prompts/
templates.py # Tool prompt builders (prompt-injected path)
nudges.py # Retry and step-enforcement nudge templates
tools/
respond.py # Synthetic respond tool (respond_tool(), respond_spec())
proxy/
proxy.py # ProxyServer — programmatic start/stop API
server.py # Raw asyncio HTTP server, SSE streaming
handler.py # Request handler — bridge between HTTP and run_inference
convert.py # OpenAI messages ↔ forge Messages conversion
tests/
unit/ # 638 deterministic tests — no LLM backend required
eval/ # Eval harness — model qualification against real backends
- User Guide — Usage patterns, multi-turn, context management, guardrails, slot worker, long-running session advisory
- Model Guide — Which model and backend for your hardware
- Backend Setup — Backend installation and server setup
- Eval Guide — Eval harness CLI reference, batch eval
- Architecture — Full design document
- Workflow Internals — Workflow design and runner internals
- Contributing — How to set up, test, and add new backends or scenarios
MIT — Copyright (c) 2025-2026 Antoine Zambelli