Change to only add in marker events when the other queue is not empty and remove event generation for kernel launches by colleeneb · Pull Request #1162 · CHIP-SPV/chipStar

colleeneb · 2026-03-02T21:35:17Z

This PR is a consequence of noticing a large slowdown in hipKernelLaunch times for the in-order queue code vs the out-of-order queue code for a specific application that has 21 streams and does a lot of kernel launching.

Profiling output from out-of-order run (older chipStar):

> iprof -b hip ./a.out 
                      Name |     Time | Time(%) | Calls |  Average |      Min |      Max |
  __hipUnregisterFatBinary | 100.67ms |  42.06% |     1 | 100.67ms | 100.67ms | 100.67ms |
           hipLaunchKernel |  52.57ms |  21.96% | 10000 |   5.26us |   3.21us |   3.27ms |
[...]

Output from in-order-queue chipStar:

> iprof -b hip ./a.out 

                      Name |     Time | Time(%) | Calls |  Average |      Min |      Max |
           hipLaunchKernel |    1.43s |  85.15% | 10000 | 143.09us |  34.16us |   3.15ms |
  __hipUnregisterFatBinary | 100.77ms |   6.00% |     1 | 100.77ms | 100.77ms | 100.77ms |
[...]

So the hipLaunchKernel time went up by 10x.

The slowdown was tracked down to the fact that we add marker events to all the other queues (like here:

chipStar/src/CHIPBackend.hh

Line 2165 in cdd85e1

for (auto &q : ChipDevice_->getQueuesNoLock()) {

). This overhead is not very noticeable if there aren't many streams or if launches don't involve the NULL stream. However, in this application's case, there are many streams and a lot of kernel launches in the NULL stream, so a lot of implicit syncing. If we know a queue has nothing in it or it has already sync'd, we don't need to add a marker event since there's nothing to wait on. Thus this PR adds in a check && !q->isEmptyQueue() to only add in the mark event if the queue is empty. The check itself calls zeCommandListHostSynchronize(ZeCmdListImm_, 0) to check if the command list is finished or not, and returns true if it is and false if it isn't. We can change this LZ call to just checking the bool IsEmptyQueue_ which we track in the code (added via IsEmptyQueue_.store(...); calls). However, there is a bug with LZ at the moment (argonne-lcf/AuroraBugTracking#124) so we need to wait for that to be resolved before switching to that from the API call.

Additionally, the launchKernel routine is generating and tracking an event for the kernel launch (

chipStar/src/backend/Level0/CHIPBackendLevel0.cc

Line 1296 in cdd85e1

std::shared_ptr<chipstar::Event> LaunchEvent =

) this should no longer be necessary for in-order queues, as we are not tracking last events anymore. One part of removing this event involved removing

chipStar/src/backend/Level0/CHIPBackendLevel0.cc

Line 1353 in cdd85e1

if (std::shared_ptr<chipstar::ArgSpillBuffer> SpillBuf =

. If I read the git blame right, this was originally from @linehill . Is this still needed? If so, what case was it? All the tests are passing without it and I'd love to remove the event generated unless we need it.

With these changes it is much better:

iprof -b hip ./a.out

                      Name |     Time | Time(%) | Calls |  Average |      Min |      Max |
           hipLaunchKernel | 476.67ms |  71.09% | 10000 |  47.67us |  12.67us |   2.62ms |
  __hipUnregisterFatBinary | 100.71ms |  15.02% |     1 | 100.71ms | 100.71ms | 100.71ms |
[...]

It is still slower than before, but once we switch to a bool from the LZ API call it will help even more.

For reference, here's a code that will show the slowdown:

#include <hip/hip_runtime.h>
#include <iostream>
#include <vector>

#define HIP_CHECK(expression) \
{ \
    hipError_t status = expression; \
    if (status != hipSuccess) { \
        std::cerr << "HIP error " << hipGetErrorString(status) << " at " << __FILE__ << ":" << __LINE__ << std::endl; \
        exit(1); \
    } \
}

// Kernel to fill array                                                                                                                                                                                                                          
__global__ void fillKernel(float* d_arr, int N) {
  for (int i = 0; i < N; i++ )
    d_arr[i] = i;
}

int main(int argc, char* argv[]) {


  int N = 100;
  int num_streams = 20;
  int num_arrays = 100;
  size_t size = N * sizeof(float);
  std::vector<float> h_arr(N);
  float **d_arr = (float **) malloc( num_arrays * sizeof(float *));

  printf( "running with N=%d and num_arrays=%d", N, num_arrays);

  HIP_CHECK(hipSetDevice(0));
  HIP_CHECK(hipDeviceSynchronize());

  for(int i=0; i< num_arrays;i++) {
    HIP_CHECK(hipMalloc(&d_arr[i], size));
    }

  hipStream_t *stream_array = (hipStream_t *) malloc( num_streams*sizeof(stream_array));
  for( int i=0; i< num_streams; i++) {
    HIP_CHECK(hipStreamCreate(&stream_array[i]));
  }

  HIP_CHECK(hipMemcpy( h_arr.data(), d_arr[0], size, hipMemcpyHostToDevice));
  HIP_CHECK(hipMemcpy( h_arr.data(), d_arr[1], size, hipMemcpyHostToDevice));

  for( int i=0;i < 100; i++ ) {
    for(int i=0; i< num_arrays;i++) {
        hipLaunchKernelGGL(fillKernel, 1,1, 0, 0, d_arr[i], N);
      }

    HIP_CHECK(hipMemcpy( d_arr[0], h_arr.data(), size, hipMemcpyDeviceToHost));
    HIP_CHECK(hipDeviceSynchronize());
  }

  for( int i=0; i< num_streams; i++) {
    HIP_CHECK(hipStreamDestroy(stream_array[i]));
  }
  for(int i=0; i< num_arrays;i++) {
    HIP_CHECK(hipFree(d_arr[i]));
}
  return 0;
}

linehill · 2026-03-03T13:39:32Z

If I read the git blame right, this was originally from @linehill . Is this still needed? If so, what case was it?

The SpillBuf is part of kernel argument "spilling" - the detail can be found here.

All the tests are passing without it [SpillBuf?] ...

Hmm. There is a test for the argument spilling. I'd imagine the kernels in the tests would crash or read garbage if the SpillBuf is removed.

colleeneb · 2026-03-03T15:02:42Z

Thanks @linehill !

I'll take a look at the test you mentioned. When I said "all tests are passing" I meant the tests that are running with check.py:

> CHIP_BE=level0 make build_tests -j
> ../scripts/check.py ./ dgpu level0 --timeout 60 --num-tries=1 --num-threads=1 
[...]

Probably the test you pointed out isn't part of that set so I didn't see a fail, I will check it. The test runs as above run pass with 100% tests passed, 0 tests failed out of 1147.

colleeneb · 2026-03-03T17:33:59Z

@linehill I confirm by hand-testing that TestLargeKernelArgLists runs and passes.

> hipcc spill.cpp 
> ./a.out 
PASSED

I also tried increasing the size of LargeStruct from 4 * 1024 - 16 to 127 * 1024 - 16, and this also passed. 128 * 1024 - 16 did fail with a segfault on kernel launch, but I see this same behavior with an older chipStar without this PR.

linehill · 2026-03-04T13:09:56Z

I don’t think doing demonstrations like this is a good metric for justifying the removal of SpillBuf. This is indicated by the lucky shot with the test run where the size of LargeStruct increased to 127 * 1024 - 16. If you take a look at here you’ll find out that the size value* of the object of that large doesn’t fit in 16-bits meaning it gets probably clipped and ends up reporting a wrong argument size to the chipStar runtime. This should be catched by assertions in here - maybe they are not triggered because the chipStar and/or the LLVM is not built in debug mode.

but I see this same behavior with an older chipStar without this PR.

Is the behavior same as in #1133?

*: 17-bits are needed to represent 127 * 1024 - 16 value.

colleeneb · 2026-03-04T14:20:21Z

I haven't seen any behavior like in #1133, but I'll build chipStar and LLVM with debug and see. (Indeed I was using the release build.)

The goal is not to remove SpillBuf but to remove the launch kernel event, as with in-order LZ command lists, we should not need to store the event of a kernel launch (and with it, we see an overhead in the application that led to this PR). As part of making that change, I removed the adding an action to the launch event (the part of the kernel launch function that involved SpillBuf since it was using the launch event) and all the tests passed without it, including TestLargeKernelArgLists (with no changes to sizes, so just the original LargeStruct size). I will check the tests today with LLVM and chipStar built with debug to see what happens.

What else would be good to check? I have no problems to leave the launch event if needed, just would like to avoid it if possible.

linehill · 2026-03-05T13:21:31Z

The goal is not to remove SpillBuf but to remove the launch kernel event, as with in-order LZ command lists, we should not need to store the event of a kernel launch (and with it, we see an overhead in the application that led to this PR). As part of making that change, I removed the adding an action to the launch event (the part of the kernel launch function that involved SpillBuf since it was using the launch event)

Got it - I thought/misread the SpillBuf allocation was going to be removed.

Ok, I see it’s fine to remove the launch kernel event as long as there is a replacement for the event action that ensures the buffer is not released too early (before the kernel execution starts or at mid-execution). Removing the event action means the buffer is held alive until the ExecItem gets destroyed which happens shortly after launchImpl() call - you might necessarily see issues if the launched kernel completes before it.

... and all the tests passed without it, including TestLargeKernelArgLists (with no changes to sizes, so just the original LargeStruct size). I will check the tests today with LLVM and chipStar built with debug to see what happens.

Probably right now none of the kernels in the tests need spill buffers (= ExecItem->getArgSpillBuffer() returns a null pointer) except the TestLargeKernelArgLists case but the kernel in that one probably completes before the SpillBuf gets released (by luck).

colleeneb · 2026-03-08T22:44:58Z

Thanks for the explanation!

Since there seems to be no way in Level Zero to put a zeMemFree into an in-order command list (otherwise we could just put it in the in-order list as the next task!), I think the options are:

Use the kernel launch event, but only when there’s a spill buffer. This would be the simplest change.
Instead of zeMemFree-ing the spill buffer when the exec item goes out of scope, keep track of the spill buffers and free them when there’s a sync/wait on the command list in which the kernel launch was submitted to. This would not free the memory right away, which could be an issue if they’re large, but it would avoid the kernel launch event.

Item 2) is what I’d prefer (or at least what I’d try to implement if possible) due to wanting to avoid events unless needed, but for the sake of getting this in PR in quickly due to the regression in kernel launch performance compared to chipStar last year, I made the change as in 1) so that we still have an event when there’s a spill buffer. We can consider 2) later if other people are ok with it.

(As a side point, after learning about counter-based events, I think counter-based events are best way to handle many of the events since we now use in-order lists. Counter-based events can immediately be reused without resetting them once they are set as dependencies, so no need to wait for all the tasks that depend on a specific event before resetting it. I will do a PR after this one which changes the some of the marker events to counter based events instead for review later this week.)

linehill · 2026-03-10T08:20:12Z

Option (1) seems to me the best option to starts with. My gut feeling is that the cases where the spill-buffers are needed are rare. The cases that use spill-buffers are kernels with 1024 byte parameter list or more - how often you see such kernels?

colleeneb · 2026-03-10T13:33:51Z

how often you see such kernels?
True, it is not common. Option (1) is what we'll go with.

Since the PR now has (1) implemented, could we merge this in? Or is something else needed? The only CI failing are for macOS but I think that's due to something other than what's in the code.

linehill · 2026-03-11T10:21:05Z

how often you see such kernels?
True, it is not common. Option (1) is what we'll go with.

Since the PR now has (1) implemented, could we merge this in? Or is something else needed? The only CI failing are for macOS but I think that's due to something other than what's in the code.

The plan now sounds good to me. What comes to reviewing and merging is left to you and @pvelesko.

…lLaunch.md baseline Add logTrace timing around addDependenciesQueueSync in launchImpl to surface O(N) cross-queue dependency overhead. Document baseline numbers on Intel Arc A770 showing ~4x slowdown at 64 idle blocking streams.

pvelesko · 2026-03-12T09:05:07Z

  ┌──────────────┬───────────────┬────────────────┐
  │ Idle streams │ Baseline (ns) │ After fix (ns) │
  ├──────────────┼───────────────┼────────────────┤
  │ 0            │ ~328,000      │ ~46,000        │
  ├──────────────┼───────────────┼────────────────┤
  │ 1            │ ~367,000      │ ~54,000        │
  ├──────────────┼───────────────┼────────────────┤
  │ 8            │ ~482,000      │ ~46,000        │
  ├──────────────┼───────────────┼────────────────┤
  │ 32           │ ~805,000      │ ~46,000        │
  ├──────────────┼───────────────┼────────────────┤
  │ 64           │ ~1,320,000    │ ~46,000        │
  └──────────────┴───────────────┴────────────────┘

colleeneb · 2026-03-12T12:48:53Z

Thanks for the changes Paulius! Removing the barriers in finish() looks great indeed. I'll pull and test on Aurora soon.

One thing though is that at some point we may want to have a separate copy list, it was only set equal to the command list due to a bug I thought. So it would be good to leave a comment that if we use a different copy list in the future, remember to sync it here, or add a guard around the copy list sync like if( copylist != commandlist ) sync(copylist) so it doesn't get forgotten in the future if the separate copylist is used again. The other thing is that the test case added is good but not really appropriate for the issue of kernel launch time itself (which is what this was PR was trying to address) since it includes the synchronize time:

    auto t0 = std::chrono::steady_clock::now();
    nopKernel<<<1, 1, 0, nullptr>>>();
    CHECK(hipStreamSynchronize(nullptr));
    auto t1 = std::chrono::steady_clock::now();

So I think maybe it's worth it to add one just for kernel launch time too. I think we could just add a flag to the current one with slight adjustments to also get kernel launch only time too -- I think I can do it but feel free yourself if you prefer, let me know or feel free to push.

pvelesko · 2026-03-12T12:52:51Z

Please fix as you see fit and push, it's LGTM for me once you fix the benchmark and add the conditional sync/ comment Sent from Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Colleen Bertoni ***@***.***> Sent: Thursday, March 12, 2026 2:49:14 PM To: CHIP-SPV/chipStar ***@***.***> Cc: Paulius Velesko ***@***.***>; Mention ***@***.***> Subject: Re: [CHIP-SPV/chipStar] Change to only add in marker events when the other queue is not empty and remove event generation for kernel launches (PR #1162) [https://avatars.githubusercontent.com/u/18434185?s=20&v=4]colleeneb left a comment (CHIP-SPV/chipStar#1162)<#1162 (comment)> Thanks for the changes Paulius! Removing the barriers in finish() looks great indeed. I'll pull and test on Aurora soon. One thing though is that at some point we may want to have a separate copy list, it was only set equal to the command list due to a bug I thought. So it would be good to leave a comment that if we use a different copy list in the future, remember to sync it here, or add a guard around the copy list sync like if( copylist != commandlist ) sync(copylist) so it doesn't get forgotten in the future if the separate copylist is used again. The other thing is that the test case added is good but not really appropriate for the issue of kernel launch time itself (which is what this was PR was trying to address) since it includes the synchronize time: auto t0 = std::chrono::steady_clock::now(); nopKernel<<<1, 1, 0, nullptr>>>(); CHECK(hipStreamSynchronize(nullptr)); auto t1 = std::chrono::steady_clock::now(); So I think maybe it's worth it to add one just for kernel launch time too. I think we could just add a flag to the current one with slight adjustments to also get kernel launch only time too -- I think I can do it but feel free yourself if you prefer, let me know or feel free to push. — Reply to this email directly, view it on GitHub<#1162 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACCJBQP5M7ICVOKISDE22UL4QKW4VAVCNFSM6AAAAACWEWR7COVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DANBWGUYDOMZTG4>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Kerilk · 2026-03-12T15:26:23Z

@pvelesko Please, when reworking contributions, be careful to cleanly rebase commits so that individual contributions are still captured, or at least use the tag Co-Authored-By: Name <mail> to capture other contributors. This is important for contribution attributions, thanks.

colleeneb · 2026-03-12T19:45:22Z

I pushed the additional test and guard of the copy list.

However, a few of the tests are now failing on Aurora, when they weren’t before. One thing is that the rebase commits removed the zeCommandListHostSynchronize(ZeCmdListImm_, 0) in isEmptyQueue. Like I mentioned in the PR description and had in the original commits, on Aurora we need to use the zeCommandListHostSynchronize call due to a bug on the runtime (argonne-lcf/AuroraBugTracking#124). If all the tests are all passing ok on other systems without it (and it looked like it based on the CI), then I’d vote to add in an AURORA_SYSTEM_WORKAROUND macro (or some other name) that can guard what is used, and when I compile on Aurora I will set it. If this is ok, I’ll push it in too. Let me know.

I haven’t had a chance to dig into the other fails but maybe they’re similarly Aurora-specific. I’ll update when I’ve dug into them more.

Track queue emptiness with an atomic IsEmptyQueue_ bool: set false on every submit operation, reset true in finish(). Guard CreateMarkerInQueue calls in addDependenciesQueueSyncImpl with !q->isEmptyQueue() so marker events (and their zeCommandListAppendBarrier calls) are skipped for queues that have never received work. Also restructure launchImpl to skip creating a LaunchEvent when no spill buffer is used. The event was only needed to anchor the spill buffer lifetime; with no spill buffer there is nothing to track, and passing nullptr as the signal event to zeCommandListAppendLaunchKernel is valid and avoids event-pool pressure. PendingCrossQueueDeps_ holds marker event shared_ptrs on the submitting queue to prevent checkEvents() from recycling their ze_events before the GPU wait operations that reference them have completed. Co-Authored-By: Colleen Bertoni <bertoni@anl.gov>

…mandListHostSynchronize ZeCmdListImmCopy_ == ZeCmdListImm_ (issue #1136 was never implemented), so the existing two-barrier + event-sync sequence was a self-referential no-op: a barrier on the copy CL signaling an event, then a barrier on the compute CL waiting on that same event, then zeEventHostSynchronize, then zeCommandListHostSynchronize on both CLs. Replacing the whole sequence with a single zeCommandListHostSynchronize on ZeCmdListImm_ is semantically correct and eliminates the zeCommandListAppendBarrier call with a signal event. On an in-order CL the driver had to track that signal event across all in-order CLs on the same physical engine, making finish() cost O(N queues). With this patch finish() is O(1) regardless of how many streams exist. Also reset IsEmptyQueue_ to true and drain PendingCrossQueueDeps_ in finish() now that it is the canonical "all GPU work done" point.

…f needed

colleeneb · 2026-03-15T17:36:50Z

After looking more, the other issues on Aurora are resolved if we use explicit scaling or remove the kernel stamp flag. Odd that it was working previously fine but maybe the issue is coming from specific patterns, and that changed with the newer commits. I opened a bug with Intel with a LZ reproducer.

Based on that, this PR is ready to go.

colleeneb force-pushed the hiplaunchslowdown branch from 86435c6 to 9d5099a Compare March 3, 2026 20:21

colleeneb changed the title ~~Change to only add in marker envets when the other queue is not empty and remove event generation for kernel launches~~ Change to only add in marker events when the other queue is not empty and remove event generation for kernel launches Mar 4, 2026

colleeneb force-pushed the hiplaunchslowdown branch from 1f8c21e to 3f3b9e9 Compare March 8, 2026 22:47

pvelesko added 2 commits March 12, 2026 09:20

benchmarks: add manySmallKernels launch-overhead benchmark

e5cc4cc

pvelesko force-pushed the hiplaunchslowdown branch 3 times, most recently from 43e2ad7 to 01395e6 Compare March 12, 2026 07:44

pvelesko and others added 4 commits March 13, 2026 08:14

revert instrumentation: remove addDependenciesQueueSync timing and docs

a9c724f

additional test and guard of the copy list

95a02d3

colleeneb force-pushed the hiplaunchslowdown branch from 4fb11e3 to 95a02d3 Compare March 13, 2026 13:16

colleeneb added 2 commits March 13, 2026 12:48

adding in macro to switch to the zeCommandListHostSynchronize check i…

81e1205

…f needed

typo fix

1f7d2a7

pvelesko merged commit 2a4572f into main Mar 17, 2026
17 of 18 checks passed

pvelesko deleted the hiplaunchslowdown branch March 17, 2026 17:06

Conversation

colleeneb commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linehill commented Mar 3, 2026

Uh oh!

colleeneb commented Mar 3, 2026

Uh oh!

colleeneb commented Mar 3, 2026

Uh oh!

linehill commented Mar 4, 2026

Uh oh!

colleeneb commented Mar 4, 2026

Uh oh!

linehill commented Mar 5, 2026

Uh oh!

colleeneb commented Mar 8, 2026

Uh oh!

linehill commented Mar 10, 2026

Uh oh!

colleeneb commented Mar 10, 2026

Uh oh!

linehill commented Mar 11, 2026

Uh oh!

pvelesko commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

colleeneb commented Mar 12, 2026

Uh oh!

pvelesko commented Mar 12, 2026 via email

Uh oh!

Kerilk commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

colleeneb commented Mar 12, 2026

Uh oh!

colleeneb commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

colleeneb commented Mar 2, 2026 •

edited

Loading

pvelesko commented Mar 12, 2026 •

edited

Loading

Kerilk commented Mar 12, 2026 •

edited

Loading

colleeneb commented Mar 15, 2026 •

edited

Loading