Skip to content

rdcm/triton-ng

Repository files navigation

WIP — work in progress, API is unstable

triton-ng

Rust SDK for NVIDIA Triton Inference Server.

Provides two things:

  • A safe Rust API for writing custom Triton backends (compiled as .so and loaded by Triton)
  • A high-level async gRPC client for sending inference requests to a running Triton server

Crates

Crate Description
triton-ng-sys Raw FFI bindings generated by bindgen from tritonbackend.h
triton-ng Safe Rust wrapper over triton-ng-sys
triton-ng-macros Proc-macros for triton-ng
triton-ng-client High-level async gRPC client
example/custom-backend Example custom backend (MNIST, proxies to ONNX model)
example/app Example client application

Writing a custom backend

Implement the Backend trait and register it with declare_backend!:

use triton_ng::backend::Backend;
use triton_ng::{BackendHandle, DataType, Error, InferenceRequest, Response};

struct MyBackend;

impl Backend for MyBackend {
    fn initialize(backend: &BackendHandle) -> Result<(), Error> {
        Ok(())
    }

    fn model_instance_execute(
        model: triton_ng::Model,
        requests: &[triton_ng::Request],
    ) -> Result<(), Error> {
        for request in requests {
            let input = request.get_input("INPUT")?;
            let data = input.as_fp32_vec()?;

            // ... run inference ...

            let mut response = Response::new(request)?;
            response
                .create_output("OUTPUT", DataType::Fp32, &[1, 10])?
                .write_fp32_vec(&result)?;
            response.send()?;
        }
        Ok(())
    }
}

triton_ng::declare_backend!(MyBackend);

Build as a cdylib:

# Cargo.toml
[lib]
crate-type = ["cdylib"]

Using the gRPC client

use triton_ng_client::{InferInput, InferOptions, TritonClient, TritonClientConfig};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let client = TritonClient::new(TritonClientConfig::new("http://localhost:8001")).await?;

    let meta = client.model_metadata("my_model", None, None).await?;
    let n: usize = meta.inputs[0].shape.iter().map(|&d| d as usize).product();

    let response = client
        .infer(
            "my_model",
            None,
            [InferInput::fp32("INPUT", meta.inputs[0].shape.clone(), vec![0.0f32; n])],
            ["OUTPUT"],
            InferOptions::default(),
        )
        .await?;

    println!("{:?}", response.outputs[0].data);
    Ok(())
}

TLS:

use triton_ng_client::{ClientTlsConfig, TritonClientConfig};

let config = TritonClientConfig::new("https://triton.example.com:8001")
    .with_tls(ClientTlsConfig::new()); // uses system roots

Getting started

Prerequisites

First run

git submodule update --init --recursive
make build           # compile custom backend → target/release/libtriton_custom_backend.so
make download-model  # download mnist_onnx + create model version dirs
make docker-env-up   # start Triton (mounts .so and models/)

Run the example app

cargo run --manifest-path=example/app/Cargo.toml --release

Triton must be running with both models in READY state.

Run integration tests

make tests           # cargo nextest run --workspace

Tests require a running Triton instance (make docker-env-up).

Rebuild after backend changes

make build
make docker-env-down && make docker-env-up

Features

Feature Description
cuda Enable GPU and pinned memory allocation in ResponseAllocator
triton-ng = { version = "0.1", features = ["cuda"] }

License

MIT

About

Rust SDK for writing custom backends for NVIDIA Triton Inference Server

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors