Add validity checks for MoE FlatMM scatter and enable bf16 hardware atomic-add#3236
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR fixes crashes in MoE FlatMM operations when using bf16 data type on CDNA3 (gfx942) GPUs by adding validity checks for scatter operations and enabling bf16 hardware atomic-add operations.
Key changes:
- Added validity flag tracking to prevent out-of-bounds atomic writes for invalid/padding tokens
- Enabled hardware atomic builtin for bf16x2_t when available instead of always using software CAS fallback
- Fixed typo in tensor size calculation for non-InputGemm case
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| include/ck_tile/ops/flatmm/kernel/moe_flatmm_kernel.hpp | Added validity flag tracking for scatter operations and fixed tensor size typo |
| include/ck_tile/core/arch/generic_memory_space_atomic.hpp | Enabled bf16 hardware atomic builtin when available |
| example/ck_tile/18_flatmm/run_moe_flatmm_example.inc | Marked unused variable with [[maybe_unused]] |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
aosewski
reviewed
Nov 20, 2025
aosewski
approved these changes
Nov 27, 2025
AviralGoelAMD
pushed a commit
that referenced
this pull request
Nov 28, 2025
…tomic-add (#3236) * Add validity checks for MoE FlatMM scatter and enable bf16 hardware atomic * correct clang-format * removed unused rtol_atol variable from example code * clang format correction * remove unused varable max_accumulated_value from example
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposed changes
Summary
This PR fixes crashes in MoE FlatMM operations when using bf16 data type on CDNA3 (gfx942) GPUs and enables bf16 hardware atomic operations. The crashes occurred with specific NumTokens values (powers of 2 ≥ 512) due to out-of-bounds atomic write operations.
Problem
Invalid/padding tokens in MoE operations are marked with
scatter_token_id = NumTokens, which generates scatter offsets pointing exactly at the buffer end (out of bounds). When atomic write operations attempt to access these addresses:Solution
1. Validity Flag Tracking
Added validity flag tracking to scatter operations using the designed
tile_scatter_gatherAPI:scatter_token_id < NumTokensmake_tile_scatter_gather()2. Enable bf16 Hardware Atomic Builtin
Previously,
atomic_add<bf16x2_t>was missing the#if HAS_GLOBAL_ATOMIC_PK_ADD_BUILTINcheck and always used software CAS fallback. Now it properly uses the hardware builtin when available:Checklist
Please put an
xinto the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.clang-formaton all changed filesDiscussion
If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered