llama_cpp_sdk is the first concrete backend package for the self-hosted
inference stack:
external_runtime_transport
-> self_hosted_inference_core
-> llama_cpp_sdk
-> req_llm through published EndpointDescriptor values
It owns the llama-server specifics that do not belong in the shared kernel:
- boot-spec normalization
llama-serverflag rendering- readiness and health probes
- stop semantics for a spawned service
- backend manifest publication
- OpenAI-compatible endpoint descriptor production
It does not parse OpenAI payloads, token streams, or inference responses.
Those stay northbound in req_llm and the calling control plane.
The phase-1 proof fixture also serves /v1/chat/completions with both standard
JSON and SSE streaming responses so the published endpoint contract can be
exercised honestly by northbound clients.
The first backend release is intentionally narrow and truthful:
- supported startup kind:
:spawned - supported execution surface:
:local_subprocess - non-local execution surfaces: rejected during boot-spec normalization
- published protocol:
:openai_chat_completions - northbound integration:
self_hosted_inference_core :ssh_execstory: documented as a future additive path once remote model-path semantics, readiness reachability, and shutdown guarantees are verified
Add the package to your dependency list:
def deps do
[
{:llama_cpp_sdk, "~> 0.1.0"}
]
endllama_cpp_sdk depends on self_hosted_inference_core, which in turn depends
on external_runtime_transport.
Resolve a spawned endpoint through the shared kernel:
alias LlamaCppSdk
alias SelfHostedInferenceCore.ConsumerManifest
consumer =
ConsumerManifest.new!(
consumer: :jido_integration_req_llm,
accepted_runtime_kinds: [:service],
accepted_management_modes: [:jido_managed],
accepted_protocols: [:openai_chat_completions],
required_capabilities: %{streaming?: true},
optional_capabilities: %{tool_calling?: :unknown},
constraints: %{startup_kind: :spawned},
metadata: %{}
)
{:ok, resolution} =
LlamaCppSdk.resolve_endpoint(
%{
model: "/models/qwen3-14b-instruct.gguf",
alias: "qwen3-14b-instruct",
host: "127.0.0.1",
port: 8080,
ctx_size: 8_192,
gpu_layers: :all,
threads: 8,
parallel: 2,
flash_attn: :auto
},
consumer,
owner_ref: "run-123",
ttl_ms: 30_000
)
resolution.endpoint.base_url
resolution.lease.lease_refThe backend normalizes the boot spec, registers itself with
self_hosted_inference_core, and publishes an endpoint descriptor once the
service is actually ready.
That published descriptor is the northbound contract used by
jido_integration. The caller should execute requests against:
endpoint.base_url <> "/chat/completions"for chat completionsendpoint.headersfor bearer auth or other published headers
The first release supports normalized fields for the installed
llama-server CLI surface:
binary_pathlauncher_argsmodelaliashostportctx_sizegpu_layersthreadsthreads_batchparallelflash_attnembeddingsapi_keyapi_key_fileapi_prefixtimeout_secondsthreads_httppoolingenvironmentextra_args
See guides/boot_spec.md for the full contract.
When api_key_file is provided, llama_cpp_sdk reads it to derive the
published authorization header for northbound clients.
Readiness is owned here, above the transport seam:
- launch the spawned process via
external_runtime_transport - probe TCP reachability on the requested host and port
- probe HTTP availability on
/healthor/v1/models - publish the endpoint only after readiness succeeds
Health continues to poll after publication so the shared kernel can expose
healthy, degraded, or unavailable runtime truth.
guides/architecture.mdguides/readiness_and_health.mdguides/integration_with_self_hosted_inference_core.mdexamples/README.md
Run the normal quality checks from the repo root when your environment allows Mix to create its local coordination socket:
mix format --check-formatted
mix compile --warnings-as-errors
mix test
MIX_ENV=test mix credo --strict
MIX_ENV=dev mix dialyzer
mix docsThis repository is released under the MIT License. See LICENSE for the
canonical license text and CHANGELOG.md for release history.