Skip to content

Not much benefit from GPU #114

@bitters1453

Description

@bitters1453

I have an asus laptop with an nvidia dgpu and an amd igpu, so I have built onnxruntime with cuda, migraphx, tensorrt and openvino (This can run on amd cpus, but refuses to make any opencl code without detecting an intel gpu).

Using the large 80m model

openvino and tensorrt can't handle it.

cuda helpfully points out that a bunch of stuff will remain on the cpu
2026-03-20 16:55:12.724633751 [W:onnxruntime:, transformer_memcpy.cc:111 ApplyImpl] 501 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message. 2026-03-20 16:55:12.724957123 [W:onnxruntime:, transformer_memcpy.cc:111 ApplyImpl] 4 Memcpy nodes are added to the graph sub_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message. 2026-03-20 16:55:12.733764468 [W:onnxruntime:, session_state.cc:1359 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2026-03-20 16:55:12.733774166 [W:onnxruntime:, session_state.cc:1361 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. 2026-03-20 16:55:12.849917281 [W:onnxruntime:Default, scatter_nd.h:51 ScatterNDWithAtomicReduction] ScatterND with reduction=='none' only guarantees to be correct if indices are not duplicated. 2026-03-20 16:55:12.851029854 [W:onnxruntime:Default, scatter_nd.h:51 ScatterNDWithAtomicReduction] ScatterND with reduction=='none' only guarantees to be correct if indices are not duplicated. 2026-03-20 16:55:12.851043209 [W:onnxruntime:Default, scatter_nd.h:51 ScatterNDWithAtomicReduction] ScatterND with reduction=='none' only guarantees to be correct if indices are not duplicated.

I am lazy and just timed per chunk, 10 times each (3 chunks total, 30 total times, 0 and 3 same sentence, 1 and 4, 2 and 5, etc)
Here are the cpu times. laptop ai 370 cpu
[1.3289828340057284, 1.1938998060068116, 1.9802772200200707, 1.251285506063141, 1.1841227940749377, 1.6927821079734713, 1.2578138179378584, 1.0629193530185148, 1.650097442092374, 1.264636099920608, 1.0339840319938958, 1.6628400130430236, 1.3496463500196114, 1.0624994849786162, 1.732478471007198, 1.3355245280545205, 1.0741714680334553, 1.7283370570512488, 1.3293178300373256, 1.04622592497617, 1.7413946259766817, 1.4043363209348172, 1.0919002940645441, 1.8983300430700183, 1.254338926053606, 1.1229005639906973, 1.710633401060477, 1.2498087181011215, 1.092614170978777, 1.7522031349362805]

Here are the cuda times. laptop 5070ti
[1.4710771288955584, 0.9855507109314203, 1.4329129040706903, 1.1214255160884932, 0.9520677300170064, 1.3678763950010762, 1.0963406789815053, 1.022940052091144, 1.4194873250089586, 1.0670830560848117, 0.9428946709958836, 1.4637802629731596, 1.228167102090083, 0.963685033028014, 1.4485970759997144, 1.1118927469942719, 0.9723141189897433, 1.4803383040707558, 1.1505753980018198, 0.9792011349927634, 1.4607792579336092, 1.1768588400445879, 0.982806543004699, 1.4588850239524618, 1.1415482349693775, 0.9726746029919013, 1.4941535888938233, 1.1467962849419564, 0.963144963956438, 1.4559551189886406]

Here are the migraphx times without caching (caching doesn't work in this case). laptop 890m
[2.8711149959126487, 2.2525830400409177, 3.338948813965544, 2.4883090369403362, 2.0587474269559607, 3.218932680087164, 2.4930616130586714, 2.0167018399806693, 3.278708435012959, 2.4618481990182772, 1.9911229209974408, 3.2683962740702555, 2.62071086501237, 2.009289737092331, 3.3030359429540113, 2.531531401909888, 2.0827651449944824, 3.416693067061715, 2.520578727009706, 2.0820888379821554, 3.3825352049898356, 2.590034666005522, 1.9985353520605713, 3.2973383460193872, 2.6240767521085218, 2.1150400809710845, 3.3563423779560253, 2.642783566028811, 2.0570333279902115, 3.340745543013327]

My attempt at compacting your stuff

from misaki import en, espeak
import numpy as np
import phonemizer
import soundfile as sf
import onnxruntime as ort
import time
from preprocess import TextPreprocessor
import re
import os

# os.environ["ORT_MIGRAPHX_MODEL_CACHE_PATH"] = os.path.abspath("./cache")

clean_text = True

pad = "$"
punctuation = ';:,.!?¡¿—…"«»"" '
letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
letters_ipa = "ɑɐɒæɓʙβɔɕçɗɖðʤəɘɚɛɜɝɞɟʄɡɠɢʛɦɧħɥʜɨɪʝɭɬɫɮʟɱɯɰŋɳɲɴøɵɸθœɶʘɹɺɾɻʀʁɽʂʃʈʧʉʊʋⱱʌɣɤʍχʎʏʑʐʒʔʡʕʢǀǁǂǃˈˌːˑʼʴʰʱʲʷˠˤ˞↓↑→↗↘'̩'ᵻ"
symbols = [pad] + list(punctuation) + list(letters) + list(letters_ipa)
dicts = {}
for i in range(len(symbols)):
    dicts[symbols[i]] = i

phonemizer = phonemizer.backend.EspeakBackend(
    language="en-us", preserve_punctuation=True, with_stress=True
)

preprocessor = TextPreprocessor(remove_punctuation=False)

voices = np.load("kittens/voices.npz")
# available_voices : ['Bella', 'Jasper', 'Luna', 'Bruno', 'Rosie', 'Hugo', 'Kiki', 'Leo']
voice = "Bruno"

voice_aliases = {
    "Bella": "expr-voice-2-f",
    "Jasper": "expr-voice-2-m",
    "Luna": "expr-voice-3-f",
    "Bruno": "expr-voice-3-m",
    "Rosie": "expr-voice-4-f",
    "Hugo": "expr-voice-4-m",
    "Kiki": "expr-voice-5-f",
    "Leo": "expr-voice-5-m",
}

# ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'MIGraphXExecutionProvider', 'OpenVINOExecutionProvider', 'CPUExecutionProvider']
kitten_session = ort.InferenceSession(
    "kittens/kitten_tts_mini_v0_8.onnx", providers=["MIGraphXExecutionProvider"]
)
# Make sure it doesn't end with .!? or make sure sentence isn't ''
text = """One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. I saw 12 turtles and called 212-555-5432,"""

if clean_text:
    text = preprocessor(text)

chunks = []
sentences = re.split(r"[.!?]+", text)
for sentence in sentences:
    sentence = sentence.strip()
    chunks.append(sentence)

times = []
for _ in range(10):
    out_chunks = []
    for text_chunk in chunks:
        phonemes_list = phonemizer.phonemize([text_chunk])
        phonemes = re.findall(r"\w+|[^\w\s]", phonemes_list[0])
        phonemes = " ".join(phonemes)
        tokens = []
        for char in phonemes:
            try:
                tokens.append(dicts[char])
            except KeyError:
                pass
        tokens.insert(0, 0)
        tokens.append(10)
        tokens.append(0)
        input_ids = np.array([tokens], dtype=np.int64)
        ref_id = min(len(text_chunk), voices[voice_aliases[voice]].shape[0] - 1)
        ref_s = voices[voice_aliases[voice]][ref_id : ref_id + 1]
        onnx_inputs = {
            "input_ids": input_ids,
            "style": ref_s,
            "speed": np.array([1], dtype=np.float32),
        }
        start = time.perf_counter()
        outputs = kitten_session.run(None, onnx_inputs)
        end = time.perf_counter()
        times.append(end - start)
        audio = outputs[0][..., :-5000]
        out_chunks.append(audio)
print(times)
final_audio = np.concatenate(out_chunks, axis=-1)
sf.write("kitten_test.wav", final_audio, 24000)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions