WIP — work in progress, API is unstable
Rust SDK for NVIDIA Triton Inference Server.
Provides two things:
- A safe Rust API for writing custom Triton backends (compiled as
.soand loaded by Triton) - A high-level async gRPC client for sending inference requests to a running Triton server
| Crate | Description |
|---|---|
triton-ng-sys |
Raw FFI bindings generated by bindgen from tritonbackend.h |
triton-ng |
Safe Rust wrapper over triton-ng-sys |
triton-ng-macros |
Proc-macros for triton-ng |
triton-ng-client |
High-level async gRPC client |
example/custom-backend |
Example custom backend (MNIST, proxies to ONNX model) |
example/app |
Example client application |
Implement the Backend trait and register it with declare_backend!:
use triton_ng::backend::Backend;
use triton_ng::{BackendHandle, DataType, Error, InferenceRequest, Response};
struct MyBackend;
impl Backend for MyBackend {
fn initialize(backend: &BackendHandle) -> Result<(), Error> {
Ok(())
}
fn model_instance_execute(
model: triton_ng::Model,
requests: &[triton_ng::Request],
) -> Result<(), Error> {
for request in requests {
let input = request.get_input("INPUT")?;
let data = input.as_fp32_vec()?;
// ... run inference ...
let mut response = Response::new(request)?;
response
.create_output("OUTPUT", DataType::Fp32, &[1, 10])?
.write_fp32_vec(&result)?;
response.send()?;
}
Ok(())
}
}
triton_ng::declare_backend!(MyBackend);Build as a cdylib:
# Cargo.toml
[lib]
crate-type = ["cdylib"]use triton_ng_client::{InferInput, InferOptions, TritonClient, TritonClientConfig};
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let client = TritonClient::new(TritonClientConfig::new("http://localhost:8001")).await?;
let meta = client.model_metadata("my_model", None, None).await?;
let n: usize = meta.inputs[0].shape.iter().map(|&d| d as usize).product();
let response = client
.infer(
"my_model",
None,
[InferInput::fp32("INPUT", meta.inputs[0].shape.clone(), vec![0.0f32; n])],
["OUTPUT"],
InferOptions::default(),
)
.await?;
println!("{:?}", response.outputs[0].data);
Ok(())
}TLS:
use triton_ng_client::{ClientTlsConfig, TritonClientConfig};
let config = TritonClientConfig::new("https://triton.example.com:8001")
.with_tls(ClientTlsConfig::new()); // uses system roots- Rust stable
- NVIDIA driver 570+ (580+ for Blackwell / RTX 50xx)
- NVIDIA Container Toolkit
- Docker
git submodule update --init --recursive
make build # compile custom backend → target/release/libtriton_custom_backend.so
make download-model # download mnist_onnx + create model version dirs
make docker-env-up # start Triton (mounts .so and models/)cargo run --manifest-path=example/app/Cargo.toml --releaseTriton must be running with both models in READY state.
make tests # cargo nextest run --workspaceTests require a running Triton instance (make docker-env-up).
make build
make docker-env-down && make docker-env-up| Feature | Description |
|---|---|
cuda |
Enable GPU and pinned memory allocation in ResponseAllocator |
triton-ng = { version = "0.1", features = ["cuda"] }MIT