Added hipblaslt bias fused kernels into autotune backend for addmm#3012
Open
guangzlu wants to merge 1 commit intorelease/2.9from
Open
Added hipblaslt bias fused kernels into autotune backend for addmm#3012guangzlu wants to merge 1 commit intorelease/2.9from
guangzlu wants to merge 1 commit intorelease/2.9from
Conversation
|
Jenkins build for a844cd4069ac6f83256820788b5dd7ae68a62633 commit finished as FAILURE |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
We found that in some GEMM with bias cases, we got poorer performance with torch compile and max autotune than without it. And the root cause is that when we turn on max autotune, inductor cannot call hipblaslt bias fused kernels. And if it doesn't call fused kernel, it will call a triton fused kerenl or a separate Aten solution (a single GEMM kernel + an elementwise kernel). And the separate Aten solution has a worse perf than hipblaslt fused kernels.
Technical Details
In the current code, inductor will use inp_expanded as the bias argument for addmm lowering kernel inputs. Inp_expanded is the bias argument expanded from 1D to 2D after arg processing. Hipblaslt bias fused kernel cannot support 2D bias input, so inductor cannot call hipblaslt bias fused kernels now.
This PR use original 1D bias argument as the kernel input for aten addmm to enable hipblaslt bias fused kernel in inductor.
Test Plan
Here is a simple unittest to test perf of Linear with bias, which will lower into addmm in inductor.
run-test.sh
unittest_linear_mi308x_fp16.py
Test Result
In this case, mnk is [4352 , 1024 , 1024]
Without the PR, inductor will run two kernels, and the perf is GEMM 81.659us, elementwise 13.619us, the total execution time is 85us.
With the PR, inductor will choose a hipblaslt bias fused kernel, the execution time is 63.504us.
Submission Checklist