Required prerequisites
Motivation
I have been writing an backward kernel which performs atomic reduction across hc_mult(4) dim. If I use atomic.global.add, which means I need to read the whole tensor into L2 which is unacceptable.
It is better to perform reduction in share memory level. I have searched the docs and atomic.h and haven't found it implemented yet. Could u please add this feature in early future?
Thanks!
Solution
No response
Alternatives
No response
Additional context
No response
Required prerequisites
Motivation
I have been writing an backward kernel which performs atomic reduction across hc_mult(4) dim. If I use atomic.global.add, which means I need to read the whole tensor into L2 which is unacceptable.
It is better to perform reduction in share memory level. I have searched the docs and atomic.h and haven't found it implemented yet. Could u please add this feature in early future?
Thanks!
Solution
No response
Alternatives
No response
Additional context
No response