tests: skip local NVML runtime mismatches while preserving CI failures by cpcloud · Pull Request #1739 · NVIDIA/cuda-python

cpcloud · 2026-03-07T16:36:34Z

Summary

Centralize NVML runtime gating through require_nvml_runtime_or_skip_local fixtures in cuda_bindings and cuda_core tests.
Treat NVML init/load failures (including driver/library mismatch and missing NVML shared library) as local skips, but continue to fail in CI by re-raising.
Cover the behavior with regression tests and apply fixture-based gating across NVML-dependent test modules/fixtures.
Rationale: after a driver upgrade without a reboot (a common local developer state), NVML can report a temporary driver/library mismatch; local runs should skip NVML-dependent tests instead of failing collection, while CI should still fail fast for real infra regressions.

Test plan

pixi run --manifest-path cuda_bindings pytest cuda_bindings/tests --override-ini norecursedirs=examples -k "not test_cufile"
CI=1 pixi run --manifest-path cuda_bindings pytest cuda_bindings/tests/nvml/test_init.py::test_init_ref_count (expected error on NVML mismatch in CI mode)
pixi run --manifest-path cuda_core test (currently blocked in this workspace by unrelated import mismatch: cuda.core._resource_handles does not export expected C function create_culink_handle)
CI=1 pixi run --manifest-path cuda_core pytest cuda_core/tests/system/test_system_system.py::test_num_devices (same unrelated import mismatch blocker)

Made with Cursor

copy-pr-bot · 2026-03-07T16:36:38Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Driver upgrades without a reboot can temporarily leave NVML in a driver/library mismatch state, which is a common local developer scenario. Route NVML-dependent checks through shared fixtures/helpers so local runs skip cleanly while CI still fails fast on real NVML init/load regressions. Made-with: Cursor

Run repository hooks and keep the NVML fixture changes compliant by applying ruff import ordering and formatting adjustments. Made-with: Cursor

Apply hook-driven import ordering/spacing updates introduced by rebasing onto upstream/main so pre-commit passes cleanly. Made-with: Cursor

cpcloud · 2026-03-07T16:52:44Z

/ok to test

github-actions · 2026-03-07T17:03:18Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1739/
https://nvidia.github.io/cuda-python/pr-preview/pr-1739/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1739/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1739/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

mdboom

This looks great, in terms of centralizing all of this logic.

However, I'm not sure why some tests unrelated to NVML now have this tagging.

And within system/test_system_system.py, we need to keep most of those tests still running even when NVML is totally missing.

mdboom · 2026-03-10T18:14:24Z

cuda_core/tests/test_memory.py



 @pytest.mark.parametrize("change_device", [True, False])
+@pytest.mark.usefixtures("require_nvml_runtime_or_skip_local")


Why do the tests here now require a working NVML? These tests predate NVML in cuda_bindings... What's the root cause?

I'll look into why this ends up being required.

get_num_devices will use NVML if it's available. So, yeah, they're pre-existing tests, but they're hitting other APIs now.

Now that these route through NVML when it's available, they're a place where we need to skip if nvml is available but fails in an expected way.

What's the root cause?

The root cause is that I upgraded the driver without rebooting. Since NVML is a driver library, I can no longer use it without a reboot.

I don't want to reboot to keep working in the repo, especially if I'm working on something unrelated to any of this code.

Thanks for the clarification. That makes sense.

However, I'm not sure we want to pave over broken installations like this (though I feel your pain about wanting to continue working with a partially-broken install). The test suite can't know whether things are broken because of local installation issues or changes to the code that have broken things, and this would pave over the latter. I know this would eventually get caught in CI, but I'm not a fan of having local and remote testing behave differently.

@cpcloud there is a way to update driver without rebooting. Just need to unload and load the kernel modules with rmmod and/or modprobe, see, e.g.
https://forums.developer.nvidia.com/t/reset-driver-without-rebooting-on-linux/40625/2
Some of our internal infra use this, for example, to achieve <1 min driver refreshing at node allocation time.

mdboom · 2026-03-10T18:15:34Z

cuda_core/tests/system/test_system_system.py


 from .conftest import skip_if_nvml_unsupported

+pytestmark = skip_if_nvml_unsupported


Most of the tests in this file are expected to run, other than test_gpu_driver_version, even without an NVML available.

I'll look into why this ends up being required.

Not required

Remove the module-level pytestmark from test_system_system.py and the per-test require_nvml_runtime_or_skip_local markers from test_memory.py. These tests don't inherently need NVML; the NVML-specific tests already have individual @skip_if_nvml_unsupported decorators. Made-with: Cursor

cpcloud · 2026-03-10T21:14:16Z

/ok to test

mdboom

I'm kind of wary of this change papering over failures on broken installations. Maybe let's get a second opinion?

cpcloud · 2026-03-11T17:41:30Z

Not worth the review effort/delay. By the time anyone else chimes in, everyone will have forgotten why this was done and it'll remain in limbo.

Removed preview folders for the following PRs: - PR #1739

leofang · 2026-03-12T14:27:50Z

Agreed with Mike. The system needs to be in a sane state. Running nvidia-smi prior to any CUDA program, as in our CI, would cache such issues.

cpcloud added 3 commits March 7, 2026 11:51

style: apply pre-commit formatting updates

f325c11

Run repository hooks and keep the NVML fixture changes compliant by applying ruff import ordering and formatting adjustments. Made-with: Cursor

style: align imports with pre-commit after upstream rebase

4272269

Apply hook-driven import ordering/spacing updates introduced by rebasing onto upstream/main so pre-commit passes cleanly. Made-with: Cursor

cpcloud force-pushed the fix/nvml-local-skip-ci-fail branch from 3ae1ece to 4272269 Compare March 7, 2026 16:52

mdboom requested changes Mar 10, 2026

View reviewed changes

cpcloud requested a review from mdboom March 10, 2026 21:14

mdboom reviewed Mar 11, 2026

View reviewed changes

cpcloud closed this Mar 11, 2026

cpcloud deleted the fix/nvml-local-skip-ci-fail branch March 11, 2026 17:41

github-actions bot pushed a commit that referenced this pull request Mar 12, 2026

Clean up PR preview folders for 1 closed/merged PRs

e3491a9

Removed preview folders for the following PRs: - PR #1739



		@pytest.mark.parametrize("change_device", [True, False])
		@pytest.mark.usefixtures("require_nvml_runtime_or_skip_local")


		from .conftest import skip_if_nvml_unsupported

		pytestmark = skip_if_nvml_unsupported

Conversation

cpcloud commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

copy-pr-bot bot commented Mar 7, 2026

Uh oh!

cpcloud commented Mar 7, 2026

Uh oh!

github-actions bot commented Mar 7, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

mdboom left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpcloud commented Mar 10, 2026

Uh oh!

mdboom left a comment

Choose a reason for hiding this comment

Uh oh!

cpcloud commented Mar 11, 2026

Uh oh!

leofang commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cpcloud commented Mar 7, 2026 •

edited

Loading