cuda.bindings latency benchmarks by danielfrg · Pull Request #1736 · NVIDIA/cuda-python

danielfrg · 2026-03-06T23:21:18Z

Description

@leofang @mdboom I migrated one benchmark from the pytest suite to use pyperf and added a C++ equivalent.

Added a small benchmark discovery to find bench_*.py files with bench_*() functions
Uses bench_time_func
C++ benchmarks output pyperf-compatible JSON so both sides can be analyzed with the same pyperf stats / pyperf hist commands.
The readme explain how to run it on the different envs using pixi

The benchmark is cuPointerGetAttribute, both Python and C++ call the same driver API with error checking.

These are one set of results for Python and C++ in my system, so we are ok under the <1us. They dont run the same warmup and runs for each, i still need to finish that but just to give you an idea.

# Python (pyperf bench_time_func)
bindings.pointer_attributes.pointer_get_attribute: Mean +- std dev: 603 ns +- 25 ns

# C++ (driver API baseline)
cpp.pointer_attributes.pointer_get_attribute: Mean +- std dev: 29 ns +- 1 ns

I still need to work on matching params for all the benchmarks and so on and so on but wanted to get feedback first if this looks fine to keep going.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2026-03-06T23:21:22Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

mdboom

I'm marking this as "approve" even though I have some questions inline and since I think it's totally fine to merge this and iterate if that's the easiest way forward.

(I am not a regular pixi user...) I tried to follow the instructions but I get:

 pixi run -e source bench
Error:   × failed to solve requirements of environment 'source' for platform 'linux-64'
  ├─▶   × failed to solve the environment
  │
  ╰─▶ Cannot solve the request because of: cuda-bindings * cannot be installed because there are no viable options:
      └─ cuda-bindings 13.1.0 would require
         └─ cuda-nvrtc >=13.2.51,<14.0a0, which cannot be installed because there are no viable options:
            └─ cuda-nvrtc 13.2.51 would require
               └─ cuda-version >=13.2,<13.3.0a0, for which no candidates were found.

 pixi run -e wheel bench
Error:   × failed to solve requirements of environment 'source' for platform 'linux-64'
  ├─▶   × failed to solve the environment
  │
  ╰─▶ Cannot solve the request because of: cuda-bindings * cannot be installed because there are no viable options:
      └─ cuda-bindings 13.1.0 would require
         └─ cuda-nvrtc >=13.2.51,<14.0a0, which cannot be installed because there are no viable options:
            └─ cuda-nvrtc 13.2.51 would require
               └─ cuda-version >=13.2,<13.3.0a0, for which no candidates were found.

mdboom · 2026-03-12T16:11:11Z

cuda_bindings/benchmarks/bench_pointer_attributes.py

@@ -0,0 +1,17 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


Maybe move this file into a benchmarks subdirectory so as not to clutter the top-level of the cuda_bindings/benchmarks directory.

mdboom · 2026-03-12T16:11:43Z

cuda_bindings/benchmarks/cpp/CMakeLists.txt

+    message(FATAL_ERROR "Could not find libcuda. Ensure the NVIDIA driver is installed.")
+endif()
+
+add_executable(bench_pointer_attributes_cpp bench_pointer_attributes.cpp)


Did you forget to commit bench_pointer_attributes.cpp?

mdboom · 2026-03-12T16:20:28Z

cuda_bindings/benchmarks/runner/main.py

+    def time_func(loops: int) -> float:
+        t0 = time.perf_counter()
+        for _ in range(loops):
+            fn()


I appreciate the decorator approach here, but this means we will be measuring the overhead of this Python function call, in addition to the actual cuda_bindings function call we are measuring.

Even though it's less convenient, I think we need to manually inline this timing benchmark into the function itself and not use this wrapper in order to get accurate timings.

mdboom · 2026-03-12T16:22:11Z

cuda_bindings/benchmarks/README.md

+
+- `bench`: Runs the Python benchmarks
+- `bench-cpp`: Runs the C++ benchmarks
+


Maybe mention pyperf system tune here?

mdboom · 2026-03-12T16:25:28Z

cuda_bindings/benchmarks/bench_pointer_attributes.py

+
+
+def bench_pointer_get_attribute() -> None:
+    err, _ = cuda.cuPointerGetAttribute(ATTRIBUTE, PTR)


When this is refactored to do its own timing measurement, the PTR and ATTRIBUTE vars should also be moved here (but outside of the loop) so the Python compiler will use fast local variable lookups rather than global lookups.

danielfrg · 2026-03-12T16:52:09Z

Thanks for the comments! I dont think we need to merge now. I'll address the comments and once we are happy with a template we have here we can commit and then in another PR i can just add more benchmarks.

danielfrg added 3 commits March 6, 2026 12:06

Pixi file for wheels and source

5b5d911

pyperf runner with one pointer benchmark

9850c41

Add pointer benchmark in Cpp too

f35d320

leofang assigned danielfrg Mar 7, 2026

leofang requested review from leofang and mdboom March 7, 2026 02:40

leofang added this to the cuda.bindings next milestone Mar 7, 2026

danielfrg mentioned this pull request Mar 9, 2026

Python latency testing & benchmarking #1580

Open

mdboom approved these changes Mar 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda.bindings latency benchmarks#1736

cuda.bindings latency benchmarks#1736
danielfrg wants to merge 3 commits intomainfrom
cuda-bindings-bench

danielfrg commented Mar 6, 2026

Uh oh!

copy-pr-bot bot commented Mar 6, 2026

Uh oh!

mdboom left a comment

Uh oh!

mdboom Mar 12, 2026

Uh oh!

mdboom Mar 12, 2026

Uh oh!

mdboom Mar 12, 2026

Uh oh!

mdboom Mar 12, 2026

Uh oh!

mdboom Mar 12, 2026

Uh oh!

danielfrg commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,17 @@
		# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


		- `bench`: Runs the Python benchmarks
		- `bench-cpp`: Runs the C++ benchmarks



		def bench_pointer_get_attribute() -> None:
		err, _ = cuda.cuPointerGetAttribute(ATTRIBUTE, PTR)

Conversation

danielfrg commented Mar 6, 2026

Description

Description

Checklist

Uh oh!

copy-pr-bot bot commented Mar 6, 2026

Uh oh!

mdboom left a comment

Choose a reason for hiding this comment

Uh oh!

mdboom Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

mdboom Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

mdboom Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

mdboom Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

mdboom Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

danielfrg commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants