[Common] Remove volatile keyword in fused router kernel utils by denera · Pull Request #2683 · NVIDIA/TransformerEngine

denera · 2026-02-13T18:40:07Z

Description

Per recommendation from the compiler team, this helps avoid local memory loads/stores on SM100 and significantly improves performance of fused router kernels.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…em spill on SM100 Signed-off-by: Alp Dener <adener@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-02-25T16:52:33Z

Greptile Summary

Removed volatile keyword from 10 local variable declarations in CUDA device functions (warp_reduce_on_shmem, masked_warp_reduce_on_shmem, and naive_topk_and_mask) to optimize SM100 performance by avoiding unnecessary local memory spills.

Changes

Removed volatile from double val variables used in warp reduction operations
Removed volatile from int index and double cur_val variables used in topk selection
Removed volatile from auto shuffled_val and auto shuffled_index variables used in warp shuffle operations

Analysis

The volatile keyword was originally used in these CUDA kernels, likely as a historical practice from older CUDA programming patterns. However, modern CUDA compilers and the CUDA memory model provide proper guarantees for warp-level operations through warp-synchronous programming primitives like __shfl_xor_sync() and __syncwarp(). The code correctly uses these synchronization primitives to ensure proper memory ordering within the warp.

The removal is safe because:

All warp-level data sharing uses explicit shuffle intrinsics (__shfl_xor_sync) which provide the necessary synchronization
Explicit warp synchronization barriers (__syncwarp()) are present where needed
The variables are local to each thread and don't require volatile semantics for correctness
Modern compiler optimizations benefit from removing unnecessary volatile qualifiers

Confidence Score: 5/5

This PR is safe to merge with no risk - it's a targeted performance optimization based on compiler team recommendation
The change removes outdated volatile keywords that were causing performance degradation on SM100 architecture. The code uses proper warp synchronization primitives (__shfl_xor_sync, __syncwarp) that provide correct memory ordering guarantees without needing volatile. The change is mechanical, well-scoped to a single header file, and based on official compiler team guidance.
No files require special attention

Important Files Changed

Filename	Overview
transformer_engine/common/fused_router/utils.h	Removed `volatile` keyword from local variables in warp reduction and topk functions to prevent local memory spills on SM100 architecture

_{Last reviewed commit: 17c055a}

greptile-apps

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

ptrendx · 2026-02-25T19:57:26Z

/te-ci pytorch

denera requested a review from ptrendx February 13, 2026 18:40

denera self-assigned this Feb 13, 2026

nvMelissa mentioned this pull request Feb 23, 2026

Fused router optimization for GroupedTensor #2457

Open

denera marked this pull request as ready for review February 25, 2026 16:50

denera and others added 2 commits February 25, 2026 10:50

remove volatile keyword in fused router kernel utils to avoid local m…

84708b4

…em spill on SM100 Signed-off-by: Alp Dener <adener@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

17c055a

for more information, see https://pre-commit.ci

denera force-pushed the common/fused-router-no-volatile branch from d386e21 to 17c055a Compare February 25, 2026 16:50

greptile-apps bot reviewed Feb 25, 2026

View reviewed changes

ptrendx approved these changes Feb 25, 2026

View reviewed changes

yaox12 merged commit 842b770 into NVIDIA:main Feb 26, 2026
23 of 28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Common] Remove volatile keyword in fused router kernel utils#2683

[Common] Remove volatile keyword in fused router kernel utils#2683
yaox12 merged 2 commits intoNVIDIA:mainfrom
denera:common/fused-router-no-volatile

denera commented Feb 13, 2026

Uh oh!

greptile-apps bot commented Feb 25, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

ptrendx commented Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

denera commented Feb 13, 2026

Description

Type of change

Checklist:

Uh oh!

greptile-apps bot commented Feb 25, 2026

Greptile Summary

Changes

Analysis

Confidence Score: 5/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

ptrendx commented Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants