I have an asus laptop with an nvidia dgpu and an amd igpu, so I have built onnxruntime with cuda, migraphx, tensorrt and openvino (This can run on amd cpus, but refuses to make any opencl code without detecting an intel gpu).
Using the large 80m model
openvino and tensorrt can't handle it.
cuda helpfully points out that a bunch of stuff will remain on the cpu
2026-03-20 16:55:12.724633751 [W:onnxruntime:, transformer_memcpy.cc:111 ApplyImpl] 501 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message. 2026-03-20 16:55:12.724957123 [W:onnxruntime:, transformer_memcpy.cc:111 ApplyImpl] 4 Memcpy nodes are added to the graph sub_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message. 2026-03-20 16:55:12.733764468 [W:onnxruntime:, session_state.cc:1359 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2026-03-20 16:55:12.733774166 [W:onnxruntime:, session_state.cc:1361 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. 2026-03-20 16:55:12.849917281 [W:onnxruntime:Default, scatter_nd.h:51 ScatterNDWithAtomicReduction] ScatterND with reduction=='none' only guarantees to be correct if indices are not duplicated. 2026-03-20 16:55:12.851029854 [W:onnxruntime:Default, scatter_nd.h:51 ScatterNDWithAtomicReduction] ScatterND with reduction=='none' only guarantees to be correct if indices are not duplicated. 2026-03-20 16:55:12.851043209 [W:onnxruntime:Default, scatter_nd.h:51 ScatterNDWithAtomicReduction] ScatterND with reduction=='none' only guarantees to be correct if indices are not duplicated.
I am lazy and just timed per chunk, 10 times each (3 chunks total, 30 total times, 0 and 3 same sentence, 1 and 4, 2 and 5, etc)
Here are the cpu times. laptop ai 370 cpu
[1.3289828340057284, 1.1938998060068116, 1.9802772200200707, 1.251285506063141, 1.1841227940749377, 1.6927821079734713, 1.2578138179378584, 1.0629193530185148, 1.650097442092374, 1.264636099920608, 1.0339840319938958, 1.6628400130430236, 1.3496463500196114, 1.0624994849786162, 1.732478471007198, 1.3355245280545205, 1.0741714680334553, 1.7283370570512488, 1.3293178300373256, 1.04622592497617, 1.7413946259766817, 1.4043363209348172, 1.0919002940645441, 1.8983300430700183, 1.254338926053606, 1.1229005639906973, 1.710633401060477, 1.2498087181011215, 1.092614170978777, 1.7522031349362805]
Here are the cuda times. laptop 5070ti
[1.4710771288955584, 0.9855507109314203, 1.4329129040706903, 1.1214255160884932, 0.9520677300170064, 1.3678763950010762, 1.0963406789815053, 1.022940052091144, 1.4194873250089586, 1.0670830560848117, 0.9428946709958836, 1.4637802629731596, 1.228167102090083, 0.963685033028014, 1.4485970759997144, 1.1118927469942719, 0.9723141189897433, 1.4803383040707558, 1.1505753980018198, 0.9792011349927634, 1.4607792579336092, 1.1768588400445879, 0.982806543004699, 1.4588850239524618, 1.1415482349693775, 0.9726746029919013, 1.4941535888938233, 1.1467962849419564, 0.963144963956438, 1.4559551189886406]
Here are the migraphx times without caching (caching doesn't work in this case). laptop 890m
[2.8711149959126487, 2.2525830400409177, 3.338948813965544, 2.4883090369403362, 2.0587474269559607, 3.218932680087164, 2.4930616130586714, 2.0167018399806693, 3.278708435012959, 2.4618481990182772, 1.9911229209974408, 3.2683962740702555, 2.62071086501237, 2.009289737092331, 3.3030359429540113, 2.531531401909888, 2.0827651449944824, 3.416693067061715, 2.520578727009706, 2.0820888379821554, 3.3825352049898356, 2.590034666005522, 1.9985353520605713, 3.2973383460193872, 2.6240767521085218, 2.1150400809710845, 3.3563423779560253, 2.642783566028811, 2.0570333279902115, 3.340745543013327]
My attempt at compacting your stuff
from misaki import en, espeak
import numpy as np
import phonemizer
import soundfile as sf
import onnxruntime as ort
import time
from preprocess import TextPreprocessor
import re
import os
# os.environ["ORT_MIGRAPHX_MODEL_CACHE_PATH"] = os.path.abspath("./cache")
clean_text = True
pad = "$"
punctuation = ';:,.!?¡¿—…"«»"" '
letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
letters_ipa = "ɑɐɒæɓʙβɔɕçɗɖðʤəɘɚɛɜɝɞɟʄɡɠɢʛɦɧħɥʜɨɪʝɭɬɫɮʟɱɯɰŋɳɲɴøɵɸθœɶʘɹɺɾɻʀʁɽʂʃʈʧʉʊʋⱱʌɣɤʍχʎʏʑʐʒʔʡʕʢǀǁǂǃˈˌːˑʼʴʰʱʲʷˠˤ˞↓↑→↗↘'̩'ᵻ"
symbols = [pad] + list(punctuation) + list(letters) + list(letters_ipa)
dicts = {}
for i in range(len(symbols)):
dicts[symbols[i]] = i
phonemizer = phonemizer.backend.EspeakBackend(
language="en-us", preserve_punctuation=True, with_stress=True
)
preprocessor = TextPreprocessor(remove_punctuation=False)
voices = np.load("kittens/voices.npz")
# available_voices : ['Bella', 'Jasper', 'Luna', 'Bruno', 'Rosie', 'Hugo', 'Kiki', 'Leo']
voice = "Bruno"
voice_aliases = {
"Bella": "expr-voice-2-f",
"Jasper": "expr-voice-2-m",
"Luna": "expr-voice-3-f",
"Bruno": "expr-voice-3-m",
"Rosie": "expr-voice-4-f",
"Hugo": "expr-voice-4-m",
"Kiki": "expr-voice-5-f",
"Leo": "expr-voice-5-m",
}
# ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'MIGraphXExecutionProvider', 'OpenVINOExecutionProvider', 'CPUExecutionProvider']
kitten_session = ort.InferenceSession(
"kittens/kitten_tts_mini_v0_8.onnx", providers=["MIGraphXExecutionProvider"]
)
# Make sure it doesn't end with .!? or make sure sentence isn't ''
text = """One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. I saw 12 turtles and called 212-555-5432,"""
if clean_text:
text = preprocessor(text)
chunks = []
sentences = re.split(r"[.!?]+", text)
for sentence in sentences:
sentence = sentence.strip()
chunks.append(sentence)
times = []
for _ in range(10):
out_chunks = []
for text_chunk in chunks:
phonemes_list = phonemizer.phonemize([text_chunk])
phonemes = re.findall(r"\w+|[^\w\s]", phonemes_list[0])
phonemes = " ".join(phonemes)
tokens = []
for char in phonemes:
try:
tokens.append(dicts[char])
except KeyError:
pass
tokens.insert(0, 0)
tokens.append(10)
tokens.append(0)
input_ids = np.array([tokens], dtype=np.int64)
ref_id = min(len(text_chunk), voices[voice_aliases[voice]].shape[0] - 1)
ref_s = voices[voice_aliases[voice]][ref_id : ref_id + 1]
onnx_inputs = {
"input_ids": input_ids,
"style": ref_s,
"speed": np.array([1], dtype=np.float32),
}
start = time.perf_counter()
outputs = kitten_session.run(None, onnx_inputs)
end = time.perf_counter()
times.append(end - start)
audio = outputs[0][..., :-5000]
out_chunks.append(audio)
print(times)
final_audio = np.concatenate(out_chunks, axis=-1)
sf.write("kitten_test.wav", final_audio, 24000)
I have an asus laptop with an nvidia dgpu and an amd igpu, so I have built onnxruntime with cuda, migraphx, tensorrt and openvino (This can run on amd cpus, but refuses to make any opencl code without detecting an intel gpu).
Using the large 80m model
openvino and tensorrt can't handle it.
cuda helpfully points out that a bunch of stuff will remain on the cpu
2026-03-20 16:55:12.724633751 [W:onnxruntime:, transformer_memcpy.cc:111 ApplyImpl] 501 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message. 2026-03-20 16:55:12.724957123 [W:onnxruntime:, transformer_memcpy.cc:111 ApplyImpl] 4 Memcpy nodes are added to the graph sub_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message. 2026-03-20 16:55:12.733764468 [W:onnxruntime:, session_state.cc:1359 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2026-03-20 16:55:12.733774166 [W:onnxruntime:, session_state.cc:1361 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. 2026-03-20 16:55:12.849917281 [W:onnxruntime:Default, scatter_nd.h:51 ScatterNDWithAtomicReduction] ScatterND with reduction=='none' only guarantees to be correct if indices are not duplicated. 2026-03-20 16:55:12.851029854 [W:onnxruntime:Default, scatter_nd.h:51 ScatterNDWithAtomicReduction] ScatterND with reduction=='none' only guarantees to be correct if indices are not duplicated. 2026-03-20 16:55:12.851043209 [W:onnxruntime:Default, scatter_nd.h:51 ScatterNDWithAtomicReduction] ScatterND with reduction=='none' only guarantees to be correct if indices are not duplicated.
I am lazy and just timed per chunk, 10 times each (3 chunks total, 30 total times, 0 and 3 same sentence, 1 and 4, 2 and 5, etc)
Here are the cpu times. laptop ai 370 cpu
[1.3289828340057284, 1.1938998060068116, 1.9802772200200707, 1.251285506063141, 1.1841227940749377, 1.6927821079734713, 1.2578138179378584, 1.0629193530185148, 1.650097442092374, 1.264636099920608, 1.0339840319938958, 1.6628400130430236, 1.3496463500196114, 1.0624994849786162, 1.732478471007198, 1.3355245280545205, 1.0741714680334553, 1.7283370570512488, 1.3293178300373256, 1.04622592497617, 1.7413946259766817, 1.4043363209348172, 1.0919002940645441, 1.8983300430700183, 1.254338926053606, 1.1229005639906973, 1.710633401060477, 1.2498087181011215, 1.092614170978777, 1.7522031349362805]
Here are the cuda times. laptop 5070ti
[1.4710771288955584, 0.9855507109314203, 1.4329129040706903, 1.1214255160884932, 0.9520677300170064, 1.3678763950010762, 1.0963406789815053, 1.022940052091144, 1.4194873250089586, 1.0670830560848117, 0.9428946709958836, 1.4637802629731596, 1.228167102090083, 0.963685033028014, 1.4485970759997144, 1.1118927469942719, 0.9723141189897433, 1.4803383040707558, 1.1505753980018198, 0.9792011349927634, 1.4607792579336092, 1.1768588400445879, 0.982806543004699, 1.4588850239524618, 1.1415482349693775, 0.9726746029919013, 1.4941535888938233, 1.1467962849419564, 0.963144963956438, 1.4559551189886406]
Here are the migraphx times without caching (caching doesn't work in this case). laptop 890m
[2.8711149959126487, 2.2525830400409177, 3.338948813965544, 2.4883090369403362, 2.0587474269559607, 3.218932680087164, 2.4930616130586714, 2.0167018399806693, 3.278708435012959, 2.4618481990182772, 1.9911229209974408, 3.2683962740702555, 2.62071086501237, 2.009289737092331, 3.3030359429540113, 2.531531401909888, 2.0827651449944824, 3.416693067061715, 2.520578727009706, 2.0820888379821554, 3.3825352049898356, 2.590034666005522, 1.9985353520605713, 3.2973383460193872, 2.6240767521085218, 2.1150400809710845, 3.3563423779560253, 2.642783566028811, 2.0570333279902115, 3.340745543013327]
My attempt at compacting your stuff