[CUDA] Implement MaskedScatter by Lyxot · Pull Request #3151 · ml-explore/mlx

Lyxot · 2026-02-20T19:41:44Z

Proposed changes

This PR adds CUDA support for MaskedScatter.

Changed files

mlx/backend/cuda/indexing.cpp: implemented CUDA MaskedScatter::eval_gpu using the CUDA JIT module path.
mlx/backend/cuda/device/scatter.cuh: added the JIT device kernel masked_scatter_assign<...> used by CUDA masked scatter.
mlx/backend/cuda/scan.cu: refactored scan execution into reusable scan_gpu_inplace(...) and updated Scan::eval_gpu to delegate to it.
mlx/backend/cuda/scan.h: added declaration for scan_gpu_inplace(...).
python/tests/cuda_skip.py, tests/ops_tests.cpp, tests/autograd_tests.cpp: removed CUDA skip entries for masked-scatter-related tests.

Validation

Python:
- python -m pytest python/tests/test_ops.py -k masked_scatter -q passed.
- python -m pytest python/tests/test_vmap.py -k vmap_masked_scatter -q passed.
- python -m pytest python/tests/test_array.py -k setitem_with_boolean_mask -q passed.
C++:
- build/tests/tests -tc="test masked_scatter,test masked_scatter autograd" passed.

Checklist

Put an x in the boxes that apply.

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

Copilot

Pull request overview

This PR implements CUDA support for the MaskedScatter operation, which scatters values from a source array into a destination array at positions specified by a boolean mask. The implementation follows the existing Metal backend pattern and properly integrates with the CUDA backend infrastructure.

Changes:

Converted indexing.cpp to indexing.cu and added full CUDA MaskedScatter::eval_gpu implementation with masked_assign kernel
Refactored scan launch logic into reusable scan_gpu_inplace function with new header file
Removed CUDA skip entries for masked-scatter-related tests

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
mlx/backend/cuda/indexing.cu	Implemented `masked_assign` CUDA kernel and `MaskedScatter::eval_gpu` method; converted from .cpp to .cu
mlx/backend/cuda/scan.cu	Refactored scan logic into `scan_gpu_inplace` function for reuse in MaskedScatter
mlx/backend/cuda/scan.h	Added header declaring `scan_gpu_inplace` function
mlx/backend/cuda/primitives.cpp	Removed `NO_GPU(MaskedScatter)` macro to enable CUDA support
mlx/backend/cuda/CMakeLists.txt	Updated build to compile indexing.cu instead of indexing.cpp
tests/ops_tests.cpp	Removed CUDA skip guard from masked_scatter tests
tests/autograd_tests.cpp	Removed CUDA skip guard from masked_scatter autograd tests
python/tests/cuda_skip.py	Removed three masked-scatter-related test entries from skip list

Comments suppressed due to low confidence (1)

mlx/backend/cuda/indexing.cu:80

The masked_assign kernel uses a signed 32-bit IdxT together with stride = static_cast<IdxT>(blockDim.x) * gridDim.x * gridDim.y * gridDim.z, which can overflow when mask_flat.size() approaches INT32_MAX, causing stride to wrap negative while total remains positive. In that case, the loop for (IdxT idx = thread_id; idx < total; idx += stride) can revisit with negative idx values and read/write mask[idx], scatter_offsets[idx], and out[idx] out of bounds, leading to GPU memory corruption and potential data exposure or code execution in contexts that rely on untrusted shapes. To address this, ensure the index type used in this kernel cannot overflow for the chosen grid/block configuration (e.g., use an unsigned or 64-bit index consistently for IdxT when computing block_id, stride, and indexing, or otherwise constrain gridDim/blockDim so their product fits safely in the index type).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

zcbenz

Looks good to me, would like another review before merging.

nastya236

Looks good to me as well!

Re-review

nastya236

As I said looks great, thanks for your contribution.
Could you please provide bandwidth numbers for masked scatter kernel for a range of shapes?

Lyxot · 2026-02-28T08:06:08Z

@nastya236 bench result from benchmarks/python/masked_scatter.py

Lyxot · 2026-03-05T16:50:55Z

@nastya236
I followed up on the large-shape regression and implemented an optimization pass for masked_scatter.

What changed:

Added a contiguous fast path with a fused masked-scatter kernel.
Increased per-thread work and reduced memory traffic in the hot path.
Reworked prefix handling for that path (tile count + prefix offsets) to avoid the previous per-element offset overhead.

On my setup, MLX is now faster than Torch in all benchmark cases, and the prior large-shape degradation trend is resolved.

NVIDIA_GeForce_RTX_4070_SUPER_masked_scatter_float32

Could you please take another look?

nastya236 · 2026-03-05T21:58:33Z

Thanks for the update! I will look as soon as possible.

zcbenz · 2026-03-13T04:36:53Z

I think the improvement on performance is impressive but in the meanwhile the new kernel code is complicated and really hard to review, and we also want to avoid using the CUB device APIs as the overhead of graph capture easily eliminates the performance gain.

I would say the initial PR was already good enough, the code was clean and achieved reasonable performance. For further optimizations we should judge on real world needs otherwise we would have a large piece of code that are hard to maintain and never used.

@nastya236 What do you think if we just merge the initial version?

nastya236 · 2026-03-13T17:48:07Z

I agree you @zcbenz, I think initial masked scatter is slow for larger shapes because of the scan. Just out of curiosity I checked pytorch masked scatter and it is identical to what was proposed by @Lyxot initially. I think we can merge masked scatter -- first version, and if needed scan improvement should be a separate PR.

@Lyxot thanks for exploring faster approach.

Lyxot · 2026-03-14T07:02:22Z

@nastya236 @zcbenz Should we preserve e70c1fc and f1c2a0b? These 2 small commits provide a ~1.1x-1.4x performance improvement over the initial version, while keeping the implementation relatively simple

zcbenz

The new changes look good to me, thanks for updating the PR!

nastya236

Looks great thank you! Lets merge when the tests are done.

Copilot AI review requested due to automatic review settings February 20, 2026 19:41

Copilot started reviewing on behalf of Lyxot February 20, 2026 19:42 View session

Copilot AI reviewed Feb 20, 2026

View reviewed changes

zcbenz approved these changes Feb 27, 2026

View reviewed changes

Comment thread mlx/backend/cuda/indexing.cpp Outdated

Lyxot added 4 commits February 27, 2026 18:20

feat(cuda): implement masked scatter kernel

aecf627

test(cuda): enable masked scatter test coverage

71cb5af

refactor(cuda): align masked scatter jit with scatter kernels

50c0da5

refactor(cuda): use add_kernel_node_raw for masked scatter launch

f5693f7

Lyxot force-pushed the cuda/masked_scatter branch from 39962fe to f5693f7 Compare February 27, 2026 10:50

Lyxot requested a review from zcbenz February 27, 2026 10:50

nastya236 self-requested a review February 27, 2026 16:05

nastya236 previously approved these changes Feb 27, 2026

View reviewed changes

nastya236 self-requested a review February 27, 2026 20:19

nastya236 requested changes Feb 28, 2026

View reviewed changes

Lyxot requested a review from nastya236 March 2, 2026 10:55

test: update bench script

4c318ef

Lyxot force-pushed the cuda/masked_scatter branch from 19571bc to e70c1fc Compare March 4, 2026 13:21

Lyxot added 2 commits March 14, 2026 14:29

perf: fuse masked scatter copy and assign

97dc3c5

perf: add contiguous vectorized masked scatter kernel

7370a22

Lyxot force-pushed the cuda/masked_scatter branch from fead885 to 7370a22 Compare March 14, 2026 07:03

zcbenz approved these changes Mar 14, 2026

View reviewed changes

Comment thread mlx/backend/cuda/device/scatter.cuh Outdated

Comment thread mlx/backend/cuda/indexing.cpp Outdated

Comment thread mlx/backend/cuda/device/scatter.cuh Outdated

Comment thread mlx/backend/cuda/scan.h Outdated

apply pr comments

2d38f44

nastya236 approved these changes Mar 14, 2026

View reviewed changes

zcbenz merged commit 0bdbfdb into ml-explore:main Mar 15, 2026
16 checks passed

BrewTestBot mentioned this pull request Apr 22, 2026

mlx 0.31.2 Homebrew/homebrew-core#278764

Closed

1 task

Conversation

Lyxot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Changed files

Validation

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

zcbenz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nastya236 left a comment

Choose a reason for hiding this comment

Uh oh!

nastya236 left a comment

Choose a reason for hiding this comment

Uh oh!

Lyxot commented Feb 28, 2026

Uh oh!

Lyxot commented Mar 5, 2026

Uh oh!

nastya236 commented Mar 5, 2026

Uh oh!

zcbenz commented Mar 13, 2026

Uh oh!

nastya236 commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lyxot commented Mar 14, 2026

Uh oh!

zcbenz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nastya236 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Lyxot commented Feb 20, 2026 •

edited

Loading

nastya236 commented Mar 13, 2026 •

edited

Loading

nastya236 left a comment •

edited

Loading