Conversation
The
Hmm. There is a test for the argument spilling. I'd imagine the kernels in the tests would crash or read garbage if the |
|
Thanks @linehill ! I'll take a look at the test you mentioned. When I said "all tests are passing" I meant the tests that are running with Probably the test you pointed out isn't part of that set so I didn't see a fail, I will check it. The test runs as above run pass with |
|
@linehill I confirm by hand-testing that TestLargeKernelArgLists runs and passes. I also tried increasing the size of LargeStruct from |
86435c6 to
9d5099a
Compare
|
I don’t think doing demonstrations like this is a good metric for justifying the removal of
Is the behavior same as in #1133? *: 17-bits are needed to represent |
|
I haven't seen any behavior like in #1133, but I'll build chipStar and LLVM with debug and see. (Indeed I was using the release build.) The goal is not to remove SpillBuf but to remove the launch kernel event, as with in-order LZ command lists, we should not need to store the event of a kernel launch (and with it, we see an overhead in the application that led to this PR). As part of making that change, I removed the adding an action to the launch event (the part of the kernel launch function that involved SpillBuf since it was using the launch event) and all the tests passed without it, including TestLargeKernelArgLists (with no changes to sizes, so just the original LargeStruct size). I will check the tests today with LLVM and chipStar built with debug to see what happens. What else would be good to check? I have no problems to leave the launch event if needed, just would like to avoid it if possible. |
Got it - I thought/misread the SpillBuf allocation was going to be removed. Ok, I see it’s fine to remove the launch kernel event as long as there is a replacement for the event action that ensures the buffer is not released too early (before the kernel execution starts or at mid-execution). Removing the event action means the buffer is held alive until the ExecItem gets destroyed which happens shortly after
Probably right now none of the kernels in the tests need spill buffers (= ExecItem->getArgSpillBuffer() returns a null pointer) except the |
|
Thanks for the explanation! Since there seems to be no way in Level Zero to put a zeMemFree into an in-order command list (otherwise we could just put it in the in-order list as the next task!), I think the options are:
Item 2) is what I’d prefer (or at least what I’d try to implement if possible) due to wanting to avoid events unless needed, but for the sake of getting this in PR in quickly due to the regression in kernel launch performance compared to chipStar last year, I made the change as in 1) so that we still have an event when there’s a spill buffer. We can consider 2) later if other people are ok with it. (As a side point, after learning about counter-based events, I think counter-based events are best way to handle many of the events since we now use in-order lists. Counter-based events can immediately be reused without resetting them once they are set as dependencies, so no need to wait for all the tasks that depend on a specific event before resetting it. I will do a PR after this one which changes the some of the marker events to counter based events instead for review later this week.) |
1f8c21e to
3f3b9e9
Compare
|
Option (1) seems to me the best option to starts with. My gut feeling is that the cases where the spill-buffers are needed are rare. The cases that use spill-buffers are kernels with 1024 byte parameter list or more - how often you see such kernels? |
Since the PR now has (1) implemented, could we merge this in? Or is something else needed? The only CI failing are for macOS but I think that's due to something other than what's in the code. |
The plan now sounds good to me. What comes to reviewing and merging is left to you and @pvelesko. |
…lLaunch.md baseline Add logTrace timing around addDependenciesQueueSync in launchImpl to surface O(N) cross-queue dependency overhead. Document baseline numbers on Intel Arc A770 showing ~4x slowdown at 64 idle blocking streams.
43e2ad7 to
01395e6
Compare
|
|
Thanks for the changes Paulius! Removing the barriers in finish() looks great indeed. I'll pull and test on Aurora soon. One thing though is that at some point we may want to have a separate copy list, it was only set equal to the command list due to a bug I thought. So it would be good to leave a comment that if we use a different copy list in the future, remember to sync it here, or add a guard around the copy list sync like So I think maybe it's worth it to add one just for kernel launch time too. I think we could just add a flag to the current one with slight adjustments to also get kernel launch only time too -- I think I can do it but feel free yourself if you prefer, let me know or feel free to push. |
|
Please fix as you see fit and push, it's LGTM for me once you fix the benchmark and add the conditional sync/ comment
Sent from Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Colleen Bertoni ***@***.***>
Sent: Thursday, March 12, 2026 2:49:14 PM
To: CHIP-SPV/chipStar ***@***.***>
Cc: Paulius Velesko ***@***.***>; Mention ***@***.***>
Subject: Re: [CHIP-SPV/chipStar] Change to only add in marker events when the other queue is not empty and remove event generation for kernel launches (PR #1162)
[https://avatars.githubusercontent.com/u/18434185?s=20&v=4]colleeneb left a comment (CHIP-SPV/chipStar#1162)<#1162 (comment)>
Thanks for the changes Paulius! Removing the barriers in finish() looks great indeed. I'll pull and test on Aurora soon.
One thing though is that at some point we may want to have a separate copy list, it was only set equal to the command list due to a bug I thought. So it would be good to leave a comment that if we use a different copy list in the future, remember to sync it here, or add a guard around the copy list sync like if( copylist != commandlist ) sync(copylist) so it doesn't get forgotten in the future if the separate copylist is used again. The other thing is that the test case added is good but not really appropriate for the issue of kernel launch time itself (which is what this was PR was trying to address) since it includes the synchronize time:
auto t0 = std::chrono::steady_clock::now();
nopKernel<<<1, 1, 0, nullptr>>>();
CHECK(hipStreamSynchronize(nullptr));
auto t1 = std::chrono::steady_clock::now();
So I think maybe it's worth it to add one just for kernel launch time too. I think we could just add a flag to the current one with slight adjustments to also get kernel launch only time too -- I think I can do it but feel free yourself if you prefer, let me know or feel free to push.
—
Reply to this email directly, view it on GitHub<#1162 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACCJBQP5M7ICVOKISDE22UL4QKW4VAVCNFSM6AAAAACWEWR7COVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DANBWGUYDOMZTG4>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
|
@pvelesko Please, when reworking contributions, be careful to cleanly rebase commits so that individual contributions are still captured, or at least use the tag |
|
I pushed the additional test and guard of the copy list. However, a few of the tests are now failing on Aurora, when they weren’t before. One thing is that the rebase commits removed the I haven’t had a chance to dig into the other fails but maybe they’re similarly Aurora-specific. I’ll update when I’ve dug into them more. |
Track queue emptiness with an atomic IsEmptyQueue_ bool: set false on every submit operation, reset true in finish(). Guard CreateMarkerInQueue calls in addDependenciesQueueSyncImpl with !q->isEmptyQueue() so marker events (and their zeCommandListAppendBarrier calls) are skipped for queues that have never received work. Also restructure launchImpl to skip creating a LaunchEvent when no spill buffer is used. The event was only needed to anchor the spill buffer lifetime; with no spill buffer there is nothing to track, and passing nullptr as the signal event to zeCommandListAppendLaunchKernel is valid and avoids event-pool pressure. PendingCrossQueueDeps_ holds marker event shared_ptrs on the submitting queue to prevent checkEvents() from recycling their ze_events before the GPU wait operations that reference them have completed. Co-Authored-By: Colleen Bertoni <bertoni@anl.gov>
…mandListHostSynchronize ZeCmdListImmCopy_ == ZeCmdListImm_ (issue #1136 was never implemented), so the existing two-barrier + event-sync sequence was a self-referential no-op: a barrier on the copy CL signaling an event, then a barrier on the compute CL waiting on that same event, then zeEventHostSynchronize, then zeCommandListHostSynchronize on both CLs. Replacing the whole sequence with a single zeCommandListHostSynchronize on ZeCmdListImm_ is semantically correct and eliminates the zeCommandListAppendBarrier call with a signal event. On an in-order CL the driver had to track that signal event across all in-order CLs on the same physical engine, making finish() cost O(N queues). With this patch finish() is O(1) regardless of how many streams exist. Also reset IsEmptyQueue_ to true and drain PendingCrossQueueDeps_ in finish() now that it is the canonical "all GPU work done" point.
4fb11e3 to
95a02d3
Compare
|
After looking more, the other issues on Aurora are resolved if we use explicit scaling or remove the kernel stamp flag. Odd that it was working previously fine but maybe the issue is coming from specific patterns, and that changed with the newer commits. I opened a bug with Intel with a LZ reproducer. Based on that, this PR is ready to go. |
This PR is a consequence of noticing a large slowdown in hipKernelLaunch times for the in-order queue code vs the out-of-order queue code for a specific application that has 21 streams and does a lot of kernel launching.
Profiling output from out-of-order run (older chipStar):
Output from in-order-queue chipStar:
So the hipLaunchKernel time went up by 10x.
The slowdown was tracked down to the fact that we add marker events to all the other queues (like here:
chipStar/src/CHIPBackend.hh
Line 2165 in cdd85e1
&& !q->isEmptyQueue()to only add in the mark event if the queue is empty. The check itself callszeCommandListHostSynchronize(ZeCmdListImm_, 0)to check if the command list is finished or not, and returns true if it is and false if it isn't. We can change this LZ call to just checking the boolIsEmptyQueue_which we track in the code (added viaIsEmptyQueue_.store(...);calls). However, there is a bug with LZ at the moment (argonne-lcf/AuroraBugTracking#124) so we need to wait for that to be resolved before switching to that from the API call.Additionally, the launchKernel routine is generating and tracking an event for the kernel launch (
chipStar/src/backend/Level0/CHIPBackendLevel0.cc
Line 1296 in cdd85e1
chipStar/src/backend/Level0/CHIPBackendLevel0.cc
Line 1353 in cdd85e1
git blameright, this was originally from @linehill . Is this still needed? If so, what case was it? All the tests are passing without it and I'd love to remove the event generated unless we need it.With these changes it is much better:
It is still slower than before, but once we switch to a bool from the LZ API call it will help even more.
For reference, here's a code that will show the slowdown: