[ROCm] Sdpa dropout fix by alugorey · Pull Request #174708 · pytorch/pytorch

alugorey · 2026-02-10T20:05:00Z

Fixes a device side memory access fault when dropout is used with CK sdpa backend. There was an bug where the philox seed/offset tensors were not being placed on the GPU correctly thus resulting in a memory access fault.

This PR is an attempted reland of #154864 which was reverted a couple times due to internal tests breaking. This was because the output tensors for mha_fwd were mistakenly changed causing dynamo to complain. In the newest commit, we changed the output tensors back to what they were originally while preserving the bug fix for the device side memory assertion. This allowed the dynamo checks to succeed and avoided the original error.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang

pytorch-bot · 2026-02-10T20:05:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174708

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit bdddea3 with merge base e45dfba ():

NEW FAILURE - The following job has failed:

pull / linux-jammy-py3.14t-clang15 / test (default, 1, 5, lf.linux.4xlarge) (gh)
test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_multiple_preexisting_work_in_main_thread_True

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2026-02-10T20:05:08Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

meta-codesync · 2026-02-10T20:09:46Z

@drisspg has imported this pull request. If you are a Meta employee, you can view this in D92865015.

alugorey · 2026-02-23T15:53:09Z

@eqy @malfet

Can we get eyes on this? unrelated failures. This was the one that failed some Meta internal tests. @drisspg gave us a reproducer last week and this snapshot of the commit fixed it.

malfet · 2026-02-23T16:20:47Z

@alugorey can you provide a bit more description on the PR what exactly it fixes and what changes were made after initial attempt got reverted

alugorey · 2026-02-24T19:12:06Z

hi @malfet I updated the PR description. Please take a look

alugorey · 2026-02-26T16:14:17Z

@malfet one more look? :)

drisspg · 2026-02-27T01:00:30Z

@claude explain to me what this PR is doing and what was broken before, does the regression test adequately test this

claude · 2026-02-27T01:00:56Z

Claude finished @drisspg's task in 1m 17s —— View job

Analysis Complete

Understand the bug that was fixed
Review the code changes
Analyze the regression test
Provide summary and assessment

What This PR Does

This PR fixes a device-side memory access fault when using dropout with the CK (Composable Kernel) SDPA backend on ROCm/AMD GPUs.

The Bug (Before)

In mha_fwd_ck.hip, the philox seed/offset tensors used for dropout RNG were being created incorrectly:

// OLD (buggy) code - mha_fwd_ck.hip:287-298
auto rng_state = at::empty({2}, opts.dtype(at::kLong));  // Created on CPU!
auto rng_state_ptr = reinterpret_cast<uint64_t*>(rng_state.data_ptr());

// Later passed to GPU kernel - MEMORY FAULT!
auto drop_seed_offset = std::make_pair(rng_state_ptr, rng_state_ptr + 1);

The issue: opts comes from q.options() without explicitly specifying device, and the at::empty call wasn't forcing CUDA placement. When rng_state_ptr was dereferenced on the GPU, it was actually pointing to CPU memory, causing the memory access fault.

The Fix

The fix explicitly creates rng_state on the GPU:

// NEW (fixed) code - mha_fwd_ck.hip:291-293
auto rng_state_options = at::TensorOptions().dtype(at::kUInt64).device(at::kCUDA);
auto rng_state = at::zeros({2}, rng_state_options.dtype(at::kUInt64));

Additionally:

The backward pass (mha_bwd_ck.hip) now directly reads philox seed/offset from GPU tensors rather than copying to local variables
The return tensors seed_t and offset_t are now explicitly created on CUDA with the correct dtype (kUInt64)

Why Previous Reverts

The PR description mentions this is a reland of #154864 which was reverted due to internal test failures. The original fix accidentally changed the output tensor structure of mha_fwd. This version preserves the original return signature while fixing the memory bug.

Regression Test Assessment

The test test_flash_attention_vs_math_ref_grads was modified to parametrize over sdpa_backend (["aotriton", "ck"]):

Strengths:

✅ Tests both forward and backward passes with dropout (dropout_p parameter includes non-zero values)
✅ Tests the CK backend explicitly via torch.backends.cuda.preferred_rocm_fa_library(sdpa_backend)
✅ Properly handles CK's different dropout mask format (lines 3866-3870)
✅ Tests gradient correctness against a math reference

Concerns:

⚠️ The test only runs if PLATFORM_SUPPORTS_CK_SDPA is true, which requires the CK backend to be available. This means CI without ROCm won't exercise this path.
⚠️ The test doesn't explicitly verify that the dropout values are on the GPU or that no memory fault occurs - it relies on the test passing without crashing.
⚠️ The dropout_p parametrization includes [0.0, 0.5] but only 0.5 exercises the fixed code path.

Is it adequate?

For a functional regression test, yes - if the bug were present, the test would crash with a memory access fault when running with dropout_p > 0 on the CK backend. However, a more targeted unit test that specifically verifies tensor device placement would be more robust for catching this class of bug in the future.

Summary

Aspect	Assessment
Bug identified	✅ Clear - philox RNG tensors not on GPU
Fix correctness	✅ Correctly forces CUDA device placement
Test coverage	⚠️ Adequate for regression, but only exercises on ROCm with CK available
Risk of re-regression	Low - the fix is straightforward and explicit

drisspg · 2026-02-27T01:25:46Z

I dont really trust claude here since opts should already have the correct device can you explain why your code changes fix the issue?

jeffdaily · 2026-03-04T17:00:53Z

I dont really trust claude here since opts should already have the correct device can you explain why your code changes fix the issue?

@alugorey IIRC it was a Tensor lifetime issue. Previously the seed/offset were referencing ptrs to Tensors that went out of scope, but this PR guarantees proper lifetime management of seed/offset now.

drisspg

Okay lets try again

alugorey · 2026-03-06T17:06:44Z

Hi @drisspg, Jeff's assessment was right. it was a tensor lifetime issue. This is ready to land 👍

jeffdaily · 2026-03-06T17:38:47Z

@pytorchbot merge -f "unrelated lint infra, other lint is passing; others are known flaky"

pytorchmergebot · 2026-03-06T17:42:07Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot · 2026-03-06T17:43:45Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

jeffdaily · 2026-03-10T18:03:17Z

@pytorchbot merge -f "resolved the doc errors that caused the prior revert, all other CI is passing"

pytorchmergebot · 2026-03-10T18:06:07Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

liangel-02 · 2026-03-10T20:22:44Z

@pytorchbot revert -m "sorry, need to revert this due to #172246 being reverted" -c ghfirst

pytorchmergebot · 2026-03-10T20:26:32Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2026-03-10T20:26:41Z

@alugorey your PR has been successfully reverted.

This reverts commit 3f60bc4. Reverted #174708 on behalf of https://github.com/liangel-02 due to sorry, need to revert this due to #172246 being reverted ([comment](#174708 (comment)))

This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.

Summary: This reverts D95628165, which was an auto-revert of the ROCm SDPA dropout fix (PR pytorch#174708). Re-applying the original dropout fix restores: - Proper dropout seed/offset handling in CK forward path using ParsePhiloxCudaState kernel launch instead of host-side unpack - CK backward dropout using pointer-based seed/offset pairs - AOTriton forward return with inline comments - CK SDPA parametrized testing with proper backend selection - CK-specific dropout mask logic for test comparisons Test Plan: buck build --flagfile fbcode//mode/dev-nosan-amd-gpu fbcode//caffe2:ATen-hip Differential Revision: D98007271 Pulled By: haoyuz

Fixes a device side memory access fault when dropout is used with CK sdpa backend. There was an bug where the philox seed/offset tensors were not being placed on the GPU correctly thus resulting in a memory access fault. This PR is an attempted reland of pytorch#154864 which was reverted a couple times due to internal tests breaking. This was because the output tensors for mha_fwd were mistakenly changed causing dynamo to complain. In the newest commit, we changed the output tensors back to what they were originally while preserving the bug fix for the device side memory assertion. This allowed the dynamo checks to succeed and avoided the original error. Pull Request resolved: pytorch#174708 Approved by: https://github.com/drisspg

This reverts commit c21aa36. Reverted pytorch#174708 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#174708 (comment)))

Fixes a device side memory access fault when dropout is used with CK sdpa backend. There was an bug where the philox seed/offset tensors were not being placed on the GPU correctly thus resulting in a memory access fault. This PR is an attempted reland of pytorch#154864 which was reverted a couple times due to internal tests breaking. This was because the output tensors for mha_fwd were mistakenly changed causing dynamo to complain. In the newest commit, we changed the output tensors back to what they were originally while preserving the bug fix for the device side memory assertion. This allowed the dynamo checks to succeed and avoided the original error. Pull Request resolved: pytorch#174708 Approved by: https://github.com/jeffdaily

This reverts commit 3f60bc4. Reverted pytorch#174708 on behalf of https://github.com/liangel-02 due to sorry, need to revert this due to pytorch#172246 being reverted ([comment](pytorch#174708 (comment)))

Summary: Pull Request resolved: pytorch#178713 This reverts D95628165, which was an auto-revert of the ROCm SDPA dropout fix (PR pytorch#174708). Re-applying the original dropout fix restores: - Proper dropout seed/offset handling in CK forward path using ParsePhiloxCudaState kernel launch instead of host-side unpack - CK backward dropout using pointer-based seed/offset pairs - AOTriton forward return with inline comments - CK SDPA parametrized testing with proper backend selection - CK-specific dropout mask logic for test comparisons Test Plan: buck build --flagfile fbcode//mode/dev-nosan-amd-gpu fbcode//caffe2:ATen-hip Reviewed By: leitian Differential Revision: D98007271 Pulled By: haoyuz

Summary: Pull Request resolved: #178713 This reverts D95628165, which was an auto-revert of the ROCm SDPA dropout fix (PR #174708). Re-applying the original dropout fix restores: - Proper dropout seed/offset handling in CK forward path using ParsePhiloxCudaState kernel launch instead of host-side unpack - CK backward dropout using pointer-based seed/offset pairs - AOTriton forward return with inline comments - CK SDPA parametrized testing with proper backend selection - CK-specific dropout mask logic for test comparisons Test Plan: buck build --flagfile fbcode//mode/dev-nosan-amd-gpu fbcode//caffe2:ATen-hip Reviewed By: leitian Differential Revision: D98007271 Pulled By: haoyuz

Summary: Pull Request resolved: pytorch#178713 This reverts D95628165, which was an auto-revert of the ROCm SDPA dropout fix (PR pytorch#174708). Re-applying the original dropout fix restores: - Proper dropout seed/offset handling in CK forward path using ParsePhiloxCudaState kernel launch instead of host-side unpack - CK backward dropout using pointer-based seed/offset pairs - AOTriton forward return with inline comments - CK SDPA parametrized testing with proper backend selection - CK-specific dropout mask logic for test comparisons Test Plan: buck build --flagfile fbcode//mode/dev-nosan-amd-gpu fbcode//caffe2:ATen-hip Reviewed By: leitian Differential Revision: D98007271 Pulled By: haoyuz

Summary: This reverts D95628165, which was an auto-revert of the ROCm SDPA dropout fix (PR #174708). Re-applying the original dropout fix restores: - Proper dropout seed/offset handling in CK forward path using ParsePhiloxCudaState kernel launch instead of host-side unpack - CK backward dropout using pointer-based seed/offset pairs - AOTriton forward return with inline comments - CK SDPA parametrized testing with proper backend selection - CK-specific dropout mask logic for test comparisons Test Plan: buck build --flagfile fbcode//mode/dev-nosan-amd-gpu fbcode//caffe2:ATen-hip Reviewed By: leitian Differential Revision: D98007271 Pulled By: haoyuz Pull Request resolved: #178713 Approved by: https://github.com/echen4096 Co-authored-by: Andy.LugoReyes@amd.com <Andy.LugoReyes@amd.com>

alugorey · 2026-04-03T21:08:48Z

This functionality landed here: 980f368

closing

Summary: This reverts D95628165, which was an auto-revert of the ROCm SDPA dropout fix (PR #174708). Re-applying the original dropout fix restores: - Proper dropout seed/offset handling in CK forward path using ParsePhiloxCudaState kernel launch instead of host-side unpack - CK backward dropout using pointer-based seed/offset pairs - AOTriton forward return with inline comments - CK SDPA parametrized testing with proper backend selection - CK-specific dropout mask logic for test comparisons Test Plan: buck build --flagfile fbcode//mode/dev-nosan-amd-gpu fbcode//caffe2:ATen-hip Reviewed By: leitian Differential Revision: D98007271 Pulled By: haoyuz Pull Request resolved: #178713 Approved by: https://github.com/echen4096 Co-authored-by: Andy.LugoReyes@amd.com <Andy.LugoReyes@amd.com>

…torch#178713) Summary: This reverts D95628165, which was an auto-revert of the ROCm SDPA dropout fix (PR pytorch#174708). Re-applying the original dropout fix restores: - Proper dropout seed/offset handling in CK forward path using ParsePhiloxCudaState kernel launch instead of host-side unpack - CK backward dropout using pointer-based seed/offset pairs - AOTriton forward return with inline comments - CK SDPA parametrized testing with proper backend selection - CK-specific dropout mask logic for test comparisons Test Plan: buck build --flagfile fbcode//mode/dev-nosan-amd-gpu fbcode//caffe2:ATen-hip Reviewed By: leitian Differential Revision: D98007271 Pulled By: haoyuz Pull Request resolved: pytorch#178713 Approved by: https://github.com/echen4096 Co-authored-by: Andy.LugoReyes@amd.com <Andy.LugoReyes@amd.com>

Summary: This reverts D95628165, which was an auto-revert of the ROCm SDPA dropout fix (PR #174708). Re-applying the original dropout fix restores: - Proper dropout seed/offset handling in CK forward path using ParsePhiloxCudaState kernel launch instead of host-side unpack - CK backward dropout using pointer-based seed/offset pairs - AOTriton forward return with inline comments - CK SDPA parametrized testing with proper backend selection - CK-specific dropout mask logic for test comparisons Test Plan: buck build --flagfile fbcode//mode/dev-nosan-amd-gpu fbcode//caffe2:ATen-hip Reviewed By: leitian Differential Revision: D98007271 Pulled By: haoyuz Pull Request resolved: #178713 Approved by: https://github.com/echen4096 Co-authored-by: Andy.LugoReyes@amd.com <Andy.LugoReyes@amd.com>

alugorey requested review from Aidyn-A, eqy and syed-ahmed as code owners February 10, 2026 20:05

pytorch-bot Bot added ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 module: rocm AMD GPU support for Pytorch labels Feb 10, 2026

pytorchbot added the open source label Feb 10, 2026

bdhirsh requested a review from jeffdaily February 10, 2026 23:20

bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 10, 2026

alugorey mentioned this pull request Feb 24, 2026

[ROCm] SDPA fix mem fault when dropout is enabled #154864

Closed

drisspg previously approved these changes Mar 5, 2026

View reviewed changes

pytorchmergebot added the merging label Mar 6, 2026

pytorchmergebot added the Merged label Mar 6, 2026

pytorchmergebot closed this in c21aa36 Mar 6, 2026

pytorchmergebot removed the merging label Mar 6, 2026

jithunnair-amd added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 9, 2026

jeffdaily previously approved these changes Mar 10, 2026

View reviewed changes

pytorchmergebot added the merging label Mar 10, 2026

pytorchmergebot closed this in 3f60bc4 Mar 10, 2026

pytorchmergebot removed the merging label Mar 10, 2026

pytorchmergebot reopened this Mar 10, 2026

haoyuz mentioned this pull request Mar 29, 2026

[ROCm] Reland SDPA dropout fix (#174708)" (#178713) #178713

Closed

alugorey closed this Apr 3, 2026

Conversation

alugorey commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174708

❌ 1 New Failure

Uh oh!

pytorch-bot Bot commented Feb 10, 2026

This PR needs a release notes: label

Uh oh!

meta-codesync Bot commented Feb 10, 2026

Uh oh!

alugorey commented Feb 23, 2026

Uh oh!

malfet commented Feb 23, 2026

Uh oh!

alugorey commented Feb 24, 2026

Uh oh!

alugorey commented Feb 26, 2026

Uh oh!

drisspg commented Feb 27, 2026

Uh oh!

claude Bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Analysis Complete

What This PR Does

The Bug (Before)

The Fix

Why Previous Reverts

Regression Test Assessment

Summary

Uh oh!

drisspg commented Feb 27, 2026

Uh oh!

jeffdaily commented Mar 4, 2026

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

alugorey commented Mar 6, 2026

Uh oh!

jeffdaily commented Mar 6, 2026

Uh oh!

pytorchmergebot commented Mar 6, 2026

Merge started

Uh oh!

pytorch-bot Bot commented Mar 6, 2026

This PR needs a release notes: label

Uh oh!

jeffdaily commented Mar 10, 2026

Uh oh!

pytorchmergebot commented Mar 10, 2026

Merge started

Uh oh!

liangel-02 commented Mar 10, 2026

Uh oh!

pytorchmergebot commented Mar 10, 2026

Uh oh!

pytorchmergebot commented Mar 10, 2026

Uh oh!

alugorey commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

alugorey commented Feb 10, 2026 •

edited

Loading

pytorch-bot Bot commented Feb 10, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Feb 27, 2026 •

edited

Loading

This PR needs a `release notes:` label