[Feature Request] Atomic Ops on share memory like red.shared.add

### Required prerequisites

- [x] I have searched the [Issue Tracker](https://github.com/tile-ai/tilelang/issues) that this hasn't already been reported. (comment there if it has.)

### Motivation

I have been writing an backward kernel which performs atomic reduction across hc_mult(4) dim. If I use atomic.global.add, which means I need to read the whole tensor into L2 which is unacceptable. 

It is better to perform reduction in share memory level. I have searched the docs and atomic.h and haven't found it implemented yet. Could u please add this feature in early future?
Thanks!

### Solution

_No response_

### Alternatives

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Atomic Ops on share memory like red.shared.add #1998

Required prerequisites

Motivation

Solution

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Atomic Ops on share memory like red.shared.add #1998

Description

Required prerequisites

Motivation

Solution

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions