Skip to content

Change to only add in marker events when the other queue is not empty and remove event generation for kernel launches#1162

Merged
pvelesko merged 8 commits intomainfrom
hiplaunchslowdown
Mar 17, 2026
Merged

Change to only add in marker events when the other queue is not empty and remove event generation for kernel launches#1162
pvelesko merged 8 commits intomainfrom
hiplaunchslowdown

Conversation

@colleeneb
Copy link
Copy Markdown
Contributor

@colleeneb colleeneb commented Mar 2, 2026

This PR is a consequence of noticing a large slowdown in hipKernelLaunch times for the in-order queue code vs the out-of-order queue code for a specific application that has 21 streams and does a lot of kernel launching.

Profiling output from out-of-order run (older chipStar):

> iprof -b hip ./a.out 
                      Name |     Time | Time(%) | Calls |  Average |      Min |      Max |
  __hipUnregisterFatBinary | 100.67ms |  42.06% |     1 | 100.67ms | 100.67ms | 100.67ms |
           hipLaunchKernel |  52.57ms |  21.96% | 10000 |   5.26us |   3.21us |   3.27ms |
[...]

Output from in-order-queue chipStar:

> iprof -b hip ./a.out 

                      Name |     Time | Time(%) | Calls |  Average |      Min |      Max |
           hipLaunchKernel |    1.43s |  85.15% | 10000 | 143.09us |  34.16us |   3.15ms |
  __hipUnregisterFatBinary | 100.77ms |   6.00% |     1 | 100.77ms | 100.77ms | 100.77ms |
[...]

So the hipLaunchKernel time went up by 10x.

The slowdown was tracked down to the fact that we add marker events to all the other queues (like here:

for (auto &q : ChipDevice_->getQueuesNoLock()) {
). This overhead is not very noticeable if there aren't many streams or if launches don't involve the NULL stream. However, in this application's case, there are many streams and a lot of kernel launches in the NULL stream, so a lot of implicit syncing. If we know a queue has nothing in it or it has already sync'd, we don't need to add a marker event since there's nothing to wait on. Thus this PR adds in a check && !q->isEmptyQueue() to only add in the mark event if the queue is empty. The check itself calls zeCommandListHostSynchronize(ZeCmdListImm_, 0) to check if the command list is finished or not, and returns true if it is and false if it isn't. We can change this LZ call to just checking the bool IsEmptyQueue_ which we track in the code (added via IsEmptyQueue_.store(...); calls). However, there is a bug with LZ at the moment (argonne-lcf/AuroraBugTracking#124) so we need to wait for that to be resolved before switching to that from the API call.

Additionally, the launchKernel routine is generating and tracking an event for the kernel launch (

std::shared_ptr<chipstar::Event> LaunchEvent =
) this should no longer be necessary for in-order queues, as we are not tracking last events anymore. One part of removing this event involved removing
if (std::shared_ptr<chipstar::ArgSpillBuffer> SpillBuf =
. If I read the git blame right, this was originally from @linehill . Is this still needed? If so, what case was it? All the tests are passing without it and I'd love to remove the event generated unless we need it.

With these changes it is much better:

iprof -b hip ./a.out

                      Name |     Time | Time(%) | Calls |  Average |      Min |      Max |
           hipLaunchKernel | 476.67ms |  71.09% | 10000 |  47.67us |  12.67us |   2.62ms |
  __hipUnregisterFatBinary | 100.71ms |  15.02% |     1 | 100.71ms | 100.71ms | 100.71ms |
[...]

It is still slower than before, but once we switch to a bool from the LZ API call it will help even more.

For reference, here's a code that will show the slowdown:

#include <hip/hip_runtime.h>
#include <iostream>
#include <vector>

#define HIP_CHECK(expression) \
{ \
    hipError_t status = expression; \
    if (status != hipSuccess) { \
        std::cerr << "HIP error " << hipGetErrorString(status) << " at " << __FILE__ << ":" << __LINE__ << std::endl; \
        exit(1); \
    } \
}

// Kernel to fill array                                                                                                                                                                                                                          
__global__ void fillKernel(float* d_arr, int N) {
  for (int i = 0; i < N; i++ )
    d_arr[i] = i;
}

int main(int argc, char* argv[]) {


  int N = 100;
  int num_streams = 20;
  int num_arrays = 100;
  size_t size = N * sizeof(float);
  std::vector<float> h_arr(N);
  float **d_arr = (float **) malloc( num_arrays * sizeof(float *));

  printf( "running with N=%d and num_arrays=%d", N, num_arrays);

  HIP_CHECK(hipSetDevice(0));
  HIP_CHECK(hipDeviceSynchronize());

  for(int i=0; i< num_arrays;i++) {
    HIP_CHECK(hipMalloc(&d_arr[i], size));
    }

  hipStream_t *stream_array = (hipStream_t *) malloc( num_streams*sizeof(stream_array));
  for( int i=0; i< num_streams; i++) {
    HIP_CHECK(hipStreamCreate(&stream_array[i]));
  }

  HIP_CHECK(hipMemcpy( h_arr.data(), d_arr[0], size, hipMemcpyHostToDevice));
  HIP_CHECK(hipMemcpy( h_arr.data(), d_arr[1], size, hipMemcpyHostToDevice));

  for( int i=0;i < 100; i++ ) {
    for(int i=0; i< num_arrays;i++) {
        hipLaunchKernelGGL(fillKernel, 1,1, 0, 0, d_arr[i], N);
      }

    HIP_CHECK(hipMemcpy( d_arr[0], h_arr.data(), size, hipMemcpyDeviceToHost));
    HIP_CHECK(hipDeviceSynchronize());
  }

  for( int i=0; i< num_streams; i++) {
    HIP_CHECK(hipStreamDestroy(stream_array[i]));
  }
  for(int i=0; i< num_arrays;i++) {
    HIP_CHECK(hipFree(d_arr[i]));
}
  return 0;
}

@linehill
Copy link
Copy Markdown
Collaborator

linehill commented Mar 3, 2026

If I read the git blame right, this was originally from @linehill . Is this still needed? If so, what case was it?

The SpillBuf is part of kernel argument "spilling" - the detail can be found here.

All the tests are passing without it [SpillBuf?] ...

Hmm. There is a test for the argument spilling. I'd imagine the kernels in the tests would crash or read garbage if the SpillBuf is removed.

@colleeneb
Copy link
Copy Markdown
Contributor Author

Thanks @linehill !

I'll take a look at the test you mentioned. When I said "all tests are passing" I meant the tests that are running with check.py:

> CHIP_BE=level0 make build_tests -j
> ../scripts/check.py ./ dgpu level0 --timeout 60 --num-tries=1 --num-threads=1 
[...]

Probably the test you pointed out isn't part of that set so I didn't see a fail, I will check it. The test runs as above run pass with 100% tests passed, 0 tests failed out of 1147.

@colleeneb
Copy link
Copy Markdown
Contributor Author

@linehill I confirm by hand-testing that TestLargeKernelArgLists runs and passes.

> hipcc spill.cpp 
> ./a.out 
PASSED

I also tried increasing the size of LargeStruct from 4 * 1024 - 16 to 127 * 1024 - 16, and this also passed. 128 * 1024 - 16 did fail with a segfault on kernel launch, but I see this same behavior with an older chipStar without this PR.

@colleeneb colleeneb force-pushed the hiplaunchslowdown branch from 86435c6 to 9d5099a Compare March 3, 2026 20:21
@linehill
Copy link
Copy Markdown
Collaborator

linehill commented Mar 4, 2026

I don’t think doing demonstrations like this is a good metric for justifying the removal of SpillBuf. This is indicated by the lucky shot with the test run where the size of LargeStruct increased to 127 * 1024 - 16. If you take a look at here you’ll find out that the size value* of the object of that large doesn’t fit in 16-bits meaning it gets probably clipped and ends up reporting a wrong argument size to the chipStar runtime. This should be catched by assertions in here - maybe they are not triggered because the chipStar and/or the LLVM is not built in debug mode.

but I see this same behavior with an older chipStar without this PR.

Is the behavior same as in #1133?

*: 17-bits are needed to represent 127 * 1024 - 16 value.

@colleeneb
Copy link
Copy Markdown
Contributor Author

I haven't seen any behavior like in #1133, but I'll build chipStar and LLVM with debug and see. (Indeed I was using the release build.)

The goal is not to remove SpillBuf but to remove the launch kernel event, as with in-order LZ command lists, we should not need to store the event of a kernel launch (and with it, we see an overhead in the application that led to this PR). As part of making that change, I removed the adding an action to the launch event (the part of the kernel launch function that involved SpillBuf since it was using the launch event) and all the tests passed without it, including TestLargeKernelArgLists (with no changes to sizes, so just the original LargeStruct size). I will check the tests today with LLVM and chipStar built with debug to see what happens.

What else would be good to check? I have no problems to leave the launch event if needed, just would like to avoid it if possible.

@colleeneb colleeneb changed the title Change to only add in marker envets when the other queue is not empty and remove event generation for kernel launches Change to only add in marker events when the other queue is not empty and remove event generation for kernel launches Mar 4, 2026
@linehill
Copy link
Copy Markdown
Collaborator

linehill commented Mar 5, 2026

The goal is not to remove SpillBuf but to remove the launch kernel event, as with in-order LZ command lists, we should not need to store the event of a kernel launch (and with it, we see an overhead in the application that led to this PR). As part of making that change, I removed the adding an action to the launch event (the part of the kernel launch function that involved SpillBuf since it was using the launch event)

Got it - I thought/misread the SpillBuf allocation was going to be removed.

Ok, I see it’s fine to remove the launch kernel event as long as there is a replacement for the event action that ensures the buffer is not released too early (before the kernel execution starts or at mid-execution). Removing the event action means the buffer is held alive until the ExecItem gets destroyed which happens shortly after launchImpl() call - you might necessarily see issues if the launched kernel completes before it.

... and all the tests passed without it, including TestLargeKernelArgLists (with no changes to sizes, so just the original LargeStruct size). I will check the tests today with LLVM and chipStar built with debug to see what happens.

Probably right now none of the kernels in the tests need spill buffers (= ExecItem->getArgSpillBuffer() returns a null pointer) except the TestLargeKernelArgLists case but the kernel in that one probably completes before the SpillBuf gets released (by luck).

@colleeneb
Copy link
Copy Markdown
Contributor Author

Thanks for the explanation!

Since there seems to be no way in Level Zero to put a zeMemFree into an in-order command list (otherwise we could just put it in the in-order list as the next task!), I think the options are:

  1. Use the kernel launch event, but only when there’s a spill buffer. This would be the simplest change.

  2. Instead of zeMemFree-ing the spill buffer when the exec item goes out of scope, keep track of the spill buffers and free them when there’s a sync/wait on the command list in which the kernel launch was submitted to. This would not free the memory right away, which could be an issue if they’re large, but it would avoid the kernel launch event.

Item 2) is what I’d prefer (or at least what I’d try to implement if possible) due to wanting to avoid events unless needed, but for the sake of getting this in PR in quickly due to the regression in kernel launch performance compared to chipStar last year, I made the change as in 1) so that we still have an event when there’s a spill buffer. We can consider 2) later if other people are ok with it.

(As a side point, after learning about counter-based events, I think counter-based events are best way to handle many of the events since we now use in-order lists. Counter-based events can immediately be reused without resetting them once they are set as dependencies, so no need to wait for all the tasks that depend on a specific event before resetting it. I will do a PR after this one which changes the some of the marker events to counter based events instead for review later this week.)

@colleeneb colleeneb force-pushed the hiplaunchslowdown branch from 1f8c21e to 3f3b9e9 Compare March 8, 2026 22:47
@linehill
Copy link
Copy Markdown
Collaborator

Option (1) seems to me the best option to starts with. My gut feeling is that the cases where the spill-buffers are needed are rare. The cases that use spill-buffers are kernels with 1024 byte parameter list or more - how often you see such kernels?

@colleeneb
Copy link
Copy Markdown
Contributor Author

how often you see such kernels?
True, it is not common. Option (1) is what we'll go with.

Since the PR now has (1) implemented, could we merge this in? Or is something else needed? The only CI failing are for macOS but I think that's due to something other than what's in the code.

@linehill
Copy link
Copy Markdown
Collaborator

how often you see such kernels?
True, it is not common. Option (1) is what we'll go with.

Since the PR now has (1) implemented, could we merge this in? Or is something else needed? The only CI failing are for macOS but I think that's due to something other than what's in the code.

The plan now sounds good to me. What comes to reviewing and merging is left to you and @pvelesko.

…lLaunch.md baseline

Add logTrace timing around addDependenciesQueueSync in launchImpl to
surface O(N) cross-queue dependency overhead.  Document baseline numbers
on Intel Arc A770 showing ~4x slowdown at 64 idle blocking streams.
@pvelesko pvelesko force-pushed the hiplaunchslowdown branch 3 times, most recently from 43e2ad7 to 01395e6 Compare March 12, 2026 07:44
@pvelesko
Copy link
Copy Markdown
Collaborator

pvelesko commented Mar 12, 2026

  ┌──────────────┬───────────────┬────────────────┐
  │ Idle streams │ Baseline (ns) │ After fix (ns) │
  ├──────────────┼───────────────┼────────────────┤
  │ 0            │ ~328,000      │ ~46,000        │
  ├──────────────┼───────────────┼────────────────┤
  │ 1            │ ~367,000      │ ~54,000        │
  ├──────────────┼───────────────┼────────────────┤
  │ 8            │ ~482,000      │ ~46,000        │
  ├──────────────┼───────────────┼────────────────┤
  │ 32           │ ~805,000      │ ~46,000        │
  ├──────────────┼───────────────┼────────────────┤
  │ 64           │ ~1,320,000    │ ~46,000        │
  └──────────────┴───────────────┴────────────────┘

@colleeneb
Copy link
Copy Markdown
Contributor Author

Thanks for the changes Paulius! Removing the barriers in finish() looks great indeed. I'll pull and test on Aurora soon.

One thing though is that at some point we may want to have a separate copy list, it was only set equal to the command list due to a bug I thought. So it would be good to leave a comment that if we use a different copy list in the future, remember to sync it here, or add a guard around the copy list sync like if( copylist != commandlist ) sync(copylist) so it doesn't get forgotten in the future if the separate copylist is used again. The other thing is that the test case added is good but not really appropriate for the issue of kernel launch time itself (which is what this was PR was trying to address) since it includes the synchronize time:

    auto t0 = std::chrono::steady_clock::now();
    nopKernel<<<1, 1, 0, nullptr>>>();
    CHECK(hipStreamSynchronize(nullptr));
    auto t1 = std::chrono::steady_clock::now();

So I think maybe it's worth it to add one just for kernel launch time too. I think we could just add a flag to the current one with slight adjustments to also get kernel launch only time too -- I think I can do it but feel free yourself if you prefer, let me know or feel free to push.

@pvelesko
Copy link
Copy Markdown
Collaborator

pvelesko commented Mar 12, 2026 via email

@Kerilk
Copy link
Copy Markdown
Contributor

Kerilk commented Mar 12, 2026

@pvelesko Please, when reworking contributions, be careful to cleanly rebase commits so that individual contributions are still captured, or at least use the tag Co-Authored-By: Name <mail> to capture other contributors. This is important for contribution attributions, thanks.

@colleeneb
Copy link
Copy Markdown
Contributor Author

I pushed the additional test and guard of the copy list.

However, a few of the tests are now failing on Aurora, when they weren’t before. One thing is that the rebase commits removed the zeCommandListHostSynchronize(ZeCmdListImm_, 0) in isEmptyQueue. Like I mentioned in the PR description and had in the original commits, on Aurora we need to use the zeCommandListHostSynchronize call due to a bug on the runtime (argonne-lcf/AuroraBugTracking#124). If all the tests are all passing ok on other systems without it (and it looked like it based on the CI), then I’d vote to add in an AURORA_SYSTEM_WORKAROUND macro (or some other name) that can guard what is used, and when I compile on Aurora I will set it. If this is ok, I’ll push it in too. Let me know.

I haven’t had a chance to dig into the other fails but maybe they’re similarly Aurora-specific. I’ll update when I’ve dug into them more.

pvelesko and others added 4 commits March 13, 2026 08:14
Track queue emptiness with an atomic IsEmptyQueue_ bool: set false on
every submit operation, reset true in finish().  Guard CreateMarkerInQueue
calls in addDependenciesQueueSyncImpl with !q->isEmptyQueue() so marker
events (and their zeCommandListAppendBarrier calls) are skipped for queues
that have never received work.

Also restructure launchImpl to skip creating a LaunchEvent when no spill
buffer is used.  The event was only needed to anchor the spill buffer
lifetime; with no spill buffer there is nothing to track, and passing
nullptr as the signal event to zeCommandListAppendLaunchKernel is valid
and avoids event-pool pressure.

PendingCrossQueueDeps_ holds marker event shared_ptrs on the submitting
queue to prevent checkEvents() from recycling their ze_events before the
GPU wait operations that reference them have completed.

Co-Authored-By: Colleen Bertoni <bertoni@anl.gov>
…mandListHostSynchronize

ZeCmdListImmCopy_ == ZeCmdListImm_ (issue #1136 was never implemented),
so the existing two-barrier + event-sync sequence was a self-referential
no-op: a barrier on the copy CL signaling an event, then a barrier on
the compute CL waiting on that same event, then zeEventHostSynchronize,
then zeCommandListHostSynchronize on both CLs.

Replacing the whole sequence with a single zeCommandListHostSynchronize
on ZeCmdListImm_ is semantically correct and eliminates the
zeCommandListAppendBarrier call with a signal event.  On an in-order CL
the driver had to track that signal event across all in-order CLs on the
same physical engine, making finish() cost O(N queues).  With this patch
finish() is O(1) regardless of how many streams exist.

Also reset IsEmptyQueue_ to true and drain PendingCrossQueueDeps_ in
finish() now that it is the canonical "all GPU work done" point.
@colleeneb
Copy link
Copy Markdown
Contributor Author

colleeneb commented Mar 15, 2026

​After looking more, the other issues on Aurora are ​resolved if we use explicit scaling or ​remove the kernel stamp flag. Odd that it was working previously fine but maybe the issue is coming from specific patterns, and that changed with the newer commits. I opened a bug with Intel with a LZ reproducer.

Based on that, this PR is ready to go.

@pvelesko pvelesko merged commit 2a4572f into main Mar 17, 2026
17 of 18 checks passed
@pvelesko pvelesko deleted the hiplaunchslowdown branch March 17, 2026 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants