Skip to content

[CUDA] [PERFORMANCE] Increase speed of bf16bf16bf16_grouped_wgrad via indicating that ElementC is void / nullptr#329

Open
benediktjohannes wants to merge 1 commit into
meta-pytorch:mainfrom
benediktjohannes:patch-2
Open

[CUDA] [PERFORMANCE] Increase speed of bf16bf16bf16_grouped_wgrad via indicating that ElementC is void / nullptr#329
benediktjohannes wants to merge 1 commit into
meta-pytorch:mainfrom
benediktjohannes:patch-2

Conversation

@benediktjohannes
Copy link
Copy Markdown

@benediktjohannes benediktjohannes commented Apr 19, 2026

I'm not quite sure whether it's true that we really don't have beta scaling here / ElementC needed here and I've not tested the changes because I don't have a cuda device, but I'm pretty confident that this is correct, but I'm not 100% sure whether this works correctly and I'm also not 100% sure whether it increases performance and doesn't decrease performance or let it be the same (I'm not 100% sure in general (because it's not tested) and also I'm not 100% sure because we use another check ("OUTPUT_ACCUM ? ... : ...,") two times (but I think it should automatically be constexpr) and also I'm not quite sure because of other things (and I'm not quite sure in general, in terms of performance and in terms of correctness (and in all terms I'm not 100% sure (also I'm not quite sure whether it's true that we really don't have beta scaling here / ElementC needed here)) (as already said), so please correct me if I'm mistaken))

Contributed by Benedikt Johannes

… indicating that ElementC is void / nullptr

Contributed by Benedikt Johannes
@meta-cla meta-cla Bot added the cla signed label Apr 19, 2026
@cthi
Copy link
Copy Markdown
Contributor

cthi commented Apr 23, 2026

Hi Benedikt, thanks for the contribution. Unfortunately it's a bit hard for us to spend time to review it unless you have access to the appropriate hardware to iterate on it/check performance. Additionally it seems the code is not actually compiling based on the build checks.

@benediktjohannes
Copy link
Copy Markdown
Author

Hi Benedikt, thanks for the contribution. Unfortunately it's a bit hard for us to spend time to review it unless you have access to the appropriate hardware to iterate on it/check performance. Additionally it seems the code is not actually compiling based on the build checks.

Hi! Thanks for the answer! I‘ll try my Best to have a Look on the compillng Problems (and hopefully or maybe we got the possibilty to run some Kind of Checks with Performance Tests Like on PyTorch or something Like This, that would be great!).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants