Skip to content

[Common] Remove volatile keyword in fused router kernel utils#2683

Merged
yaox12 merged 2 commits intoNVIDIA:mainfrom
denera:common/fused-router-no-volatile
Feb 26, 2026
Merged

[Common] Remove volatile keyword in fused router kernel utils#2683
yaox12 merged 2 commits intoNVIDIA:mainfrom
denera:common/fused-router-no-volatile

Conversation

@denera
Copy link
Collaborator

@denera denera commented Feb 13, 2026

Description

Per recommendation from the compiler team, this helps avoid local memory loads/stores on SM100 and significantly improves performance of fused router kernels.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@denera denera requested a review from ptrendx February 13, 2026 18:40
@denera denera self-assigned this Feb 13, 2026
@denera denera marked this pull request as ready for review February 25, 2026 16:50
denera and others added 2 commits February 25, 2026 10:50
@denera denera force-pushed the common/fused-router-no-volatile branch from d386e21 to 17c055a Compare February 25, 2026 16:50
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 25, 2026

Greptile Summary

Removed volatile keyword from 10 local variable declarations in CUDA device functions (warp_reduce_on_shmem, masked_warp_reduce_on_shmem, and naive_topk_and_mask) to optimize SM100 performance by avoiding unnecessary local memory spills.

Changes

  • Removed volatile from double val variables used in warp reduction operations
  • Removed volatile from int index and double cur_val variables used in topk selection
  • Removed volatile from auto shuffled_val and auto shuffled_index variables used in warp shuffle operations

Analysis

The volatile keyword was originally used in these CUDA kernels, likely as a historical practice from older CUDA programming patterns. However, modern CUDA compilers and the CUDA memory model provide proper guarantees for warp-level operations through warp-synchronous programming primitives like __shfl_xor_sync() and __syncwarp(). The code correctly uses these synchronization primitives to ensure proper memory ordering within the warp.

The removal is safe because:

  • All warp-level data sharing uses explicit shuffle intrinsics (__shfl_xor_sync) which provide the necessary synchronization
  • Explicit warp synchronization barriers (__syncwarp()) are present where needed
  • The variables are local to each thread and don't require volatile semantics for correctness
  • Modern compiler optimizations benefit from removing unnecessary volatile qualifiers

Confidence Score: 5/5

  • This PR is safe to merge with no risk - it's a targeted performance optimization based on compiler team recommendation
  • The change removes outdated volatile keywords that were causing performance degradation on SM100 architecture. The code uses proper warp synchronization primitives (__shfl_xor_sync, __syncwarp) that provide correct memory ordering guarantees without needing volatile. The change is mechanical, well-scoped to a single header file, and based on official compiler team guidance.
  • No files require special attention

Important Files Changed

Filename Overview
transformer_engine/common/fused_router/utils.h Removed volatile keyword from local variables in warp reduction and topk functions to prevent local memory spills on SM100 architecture

Last reviewed commit: 17c055a

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@ptrendx
Copy link
Member

ptrendx commented Feb 25, 2026

/te-ci pytorch

@yaox12 yaox12 merged commit 842b770 into NVIDIA:main Feb 26, 2026
23 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants