[CUDA] [PERFORMANCE] Increase speed of bf16bf16bf16_grouped_wgrad via indicating that ElementC is void / nullptr#329
Conversation
… indicating that ElementC is void / nullptr Contributed by Benedikt Johannes
|
Hi Benedikt, thanks for the contribution. Unfortunately it's a bit hard for us to spend time to review it unless you have access to the appropriate hardware to iterate on it/check performance. Additionally it seems the code is not actually compiling based on the build checks. |
Hi! Thanks for the answer! I‘ll try my Best to have a Look on the compillng Problems (and hopefully or maybe we got the possibilty to run some Kind of Checks with Performance Tests Like on PyTorch or something Like This, that would be great!). |
I'm not quite sure whether it's true that we really don't have beta scaling here / ElementC needed here and I've not tested the changes because I don't have a cuda device, but I'm pretty confident that this is correct, but I'm not 100% sure whether this works correctly and I'm also not 100% sure whether it increases performance and doesn't decrease performance or let it be the same (I'm not 100% sure in general (because it's not tested) and also I'm not 100% sure because we use another check ("OUTPUT_ACCUM ? ... : ...,") two times (but I think it should automatically be constexpr) and also I'm not quite sure because of other things (and I'm not quite sure in general, in terms of performance and in terms of correctness (and in all terms I'm not 100% sure (also I'm not quite sure whether it's true that we really don't have beta scaling here / ElementC needed here)) (as already said), so please correct me if I'm mistaken))
Contributed by Benedikt Johannes