Skip to content

NVML v12 vs v13 ABI incompatibility, incorrect NVML_FI_PWR_SMOOTHING_* constants in current nvml-wrapper-sys release #134

@blthayer

Description

@blthayer

TL;DR - it appears nvml-wrapper-sys has incorrect NVML_FI field numbering for recently added fields. NVML v13 renumbered fields, so users of nvml-wrapper and nvml-wrapper-sys will get silently incorrect data on machines with CUDA 13. Perhaps a runtime version check that crashes out for v13 is in order? It does appear that NVIDIA at least somewhat documented this: https://docs.nvidia.com/deploy/nvml-api/known-issues.html#known-issues

Caveat: I did leverage AI to generate this as it's a hell of a lot faster (and more patient) than me when comparing and describing differences between two massive header files.

AI generated report below:

NVML Field ID ABI Break Between CUDA 12.9 and CUDA 13.2

Summary

The NVML field IDs used with nvmlDeviceGetFieldValues (the NVML_FI_* constants) were
renumbered between the CUDA 12.9 and CUDA 13.2 releases. Fields from ID 251 onward
are assigned different meanings depending on which NVML runtime version is present on the
host. Code compiled against one version's header and run against the other version's
runtime will silently query the wrong fields and receive incorrect data — no error is
returned.

This also means the NVML_FI_PWR_SMOOTHING_* constants currently shipped in
nvml-wrapper-sys are incorrect for any CUDA 12 deployment: they carry v13-style IDs
but the crate declares NVML API version 12.


Affected Versions

Artifact Version
CUDA 12 package cuda-nvml-dev-12-9_12.9.79-1_amd64
CUDA 13 package cuda-nvml-dev-13-2_13.2.51-1_amd64
NVML API version (v12 header) 12 (#define NVML_API_VERSION 12)
NVML API version (v13 header) 13 (#define NVML_API_VERSION 13)

How to Obtain the Headers Without Installing CUDA

The nvml.h header is distributed inside the cuda-nvml-dev-* Debian packages available
from NVIDIA's public repository. No CUDA installation or root access is required — you can
download and extract the header directly.

CUDA 12.9 (latest v12)

curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-nvml-dev-12-9_12.9.79-1_amd64.deb
mkdir nvml-v12 && cd nvml-v12
ar x ../cuda-nvml-dev-12-9_12.9.79-1_amd64.deb
tar -xf data.tar.*
# Header is at:
# ./usr/local/cuda-12.9/targets/x86_64-linux/include/nvml.h

CUDA 13.2 (latest v13 at time of writing)

curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-nvml-dev-13-2_13.2.51-1_amd64.deb
mkdir nvml-v13 && cd nvml-v13
ar x ../cuda-nvml-dev-13-2_13.2.51-1_amd64.deb
tar -xf data.tar.*
# Header is at:
# ./usr/local/cuda-13.2/targets/x86_64-linux/include/nvml.h

Other distro package URLs can be found by browsing
https://developer.download.nvidia.com/compute/cuda/repos/ and substituting the
appropriate platform directory (e.g. ubuntu2004/x86_64, rhel9/x86_64, etc.).


The Renumbering

IDs 1–230 are identical in both versions. The divergence begins at ID 231, where CUDA 12.9
introduced four new C2C link fields. IDs 235–250 (NVLink FEC history) are also identical.
From ID 251 onward the two versions diverge completely.

IDs 231–250: Identical in both versions

ID Name
231 NVML_FI_DEV_C2C_LINK_ERROR_INTR
232 NVML_FI_DEV_C2C_LINK_ERROR_REPLAY
233 NVML_FI_DEV_C2C_LINK_ERROR_REPLAY_B2B
234 NVML_FI_DEV_C2C_LINK_POWER_STATE
235–250 NVML_FI_DEV_NVLINK_COUNT_FEC_HISTORY_0 through _15

IDs 251+: Diverge between versions

ID CUDA 12.9 (nvml-dev-12-9_12.9.79-1) CUDA 13.2 (nvml-dev-13-2_13.2.51-1)
251 NVML_FI_DEV_CLOCKS_EVENT_REASON_SW_THERM_SLOWDOWN NVML_FI_PWR_SMOOTHING_ENABLED
252 NVML_FI_DEV_CLOCKS_EVENT_REASON_HW_THERM_SLOWDOWN NVML_FI_PWR_SMOOTHING_PRIV_LVL
253 NVML_FI_DEV_CLOCKS_EVENT_REASON_HW_POWER_BRAKE_SLOWDOWN NVML_FI_PWR_SMOOTHING_IMM_RAMP_DOWN_ENABLED
254 NVML_FI_DEV_POWER_SYNC_BALANCING_FREQ NVML_FI_PWR_SMOOTHING_APPLIED_TMP_CEIL
255 NVML_FI_DEV_POWER_SYNC_BALANCING_AF NVML_FI_PWR_SMOOTHING_APPLIED_TMP_FLOOR
256 NVML_FI_PWR_SMOOTHING_ENABLED NVML_FI_PWR_SMOOTHING_MAX_PERCENT_TMP_FLOOR_SETTING
257 NVML_FI_PWR_SMOOTHING_PRIV_LVL NVML_FI_PWR_SMOOTHING_MIN_PERCENT_TMP_FLOOR_SETTING
258 NVML_FI_PWR_SMOOTHING_IMM_RAMP_DOWN_ENABLED NVML_FI_PWR_SMOOTHING_HW_CIRCUITRY_PERCENT_LIFETIME_REMAINING
259 NVML_FI_PWR_SMOOTHING_APPLIED_TMP_CEIL NVML_FI_PWR_SMOOTHING_MAX_NUM_PRESET_PROFILES
260 NVML_FI_PWR_SMOOTHING_APPLIED_TMP_FLOOR NVML_FI_PWR_SMOOTHING_PROFILE_PERCENT_TMP_FLOOR
261 NVML_FI_PWR_SMOOTHING_MAX_PERCENT_TMP_FLOOR_SETTING NVML_FI_PWR_SMOOTHING_PROFILE_RAMP_UP_RATE
262 NVML_FI_PWR_SMOOTHING_MIN_PERCENT_TMP_FLOOR_SETTING NVML_FI_PWR_SMOOTHING_PROFILE_RAMP_DOWN_RATE
263 NVML_FI_PWR_SMOOTHING_HW_CIRCUITRY_PERCENT_LIFETIME_REMAINING NVML_FI_PWR_SMOOTHING_PROFILE_RAMP_DOWN_HYST_VAL
264 NVML_FI_PWR_SMOOTHING_MAX_NUM_PRESET_PROFILES NVML_FI_PWR_SMOOTHING_ACTIVE_PRESET_PROFILE
265 NVML_FI_PWR_SMOOTHING_PROFILE_PERCENT_TMP_FLOOR NVML_FI_PWR_SMOOTHING_ADMIN_OVERRIDE_PERCENT_TMP_FLOOR
266 NVML_FI_PWR_SMOOTHING_PROFILE_RAMP_UP_RATE NVML_FI_PWR_SMOOTHING_ADMIN_OVERRIDE_RAMP_UP_RATE
267 NVML_FI_PWR_SMOOTHING_PROFILE_RAMP_DOWN_RATE NVML_FI_PWR_SMOOTHING_ADMIN_OVERRIDE_RAMP_DOWN_RATE
268 NVML_FI_PWR_SMOOTHING_PROFILE_RAMP_DOWN_HYST_VAL NVML_FI_PWR_SMOOTHING_ADMIN_OVERRIDE_RAMP_DOWN_HYST_VAL
269 NVML_FI_PWR_SMOOTHING_ACTIVE_PRESET_PROFILE NVML_FI_DEV_CLOCKS_EVENT_REASON_SW_THERM_SLOWDOWN
270 NVML_FI_PWR_SMOOTHING_ADMIN_OVERRIDE_PERCENT_TMP_FLOOR NVML_FI_DEV_CLOCKS_EVENT_REASON_HW_THERM_SLOWDOWN
271 NVML_FI_PWR_SMOOTHING_ADMIN_OVERRIDE_RAMP_UP_RATE NVML_FI_DEV_CLOCKS_EVENT_REASON_HW_POWER_BRAKE_SLOWDOWN
272 NVML_FI_PWR_SMOOTHING_ADMIN_OVERRIDE_RAMP_DOWN_RATE NVML_FI_DEV_POWER_SYNC_BALANCING_FREQ
273 NVML_FI_PWR_SMOOTHING_ADMIN_OVERRIDE_RAMP_DOWN_HYST_VAL NVML_FI_DEV_POWER_SYNC_BALANCING_AF
274 NVML_FI_MAX (274) NVML_FI_DEV_EDPP_MULTIPLIER
275–288 (not present) NVML_FI_PWR_SMOOTHING_PRIMARY_POWER_FLOOR through NVML_FI_PWR_SMOOTHING_ADMIN_OVERRIDE_PRIMARY_FLOOR_ACT_OFFSET
289–295 (not present) NVML_FI_DEV_NVLINK_COUNT_RAW_ERRORS_LANE0/1, NVML_FI_DEV_NVLINK_COUNT_RAW_BER_LANE0/1_V2, NVML_FI_DEV_NVLINK_COUNT_RAW_BER_V2, NVML_FI_DEV_NVLINK_PLR_XMIT_BLOCKS, NVML_FI_DEV_NVLINK_PLR_XMIT_RETRY_BLOCKS
296 (not present) NVML_FI_MAX (296)

Impact on nvml-wrapper-sys

The NVML_FI_PWR_SMOOTHING_* constants were added to nvml-wrapper-sys in commit
29f91b8 (v0.9.0, released 2025-03-28). The values shipped were:

NVML_FI_PWR_SMOOTHING_ENABLED = 251
...
NVML_FI_PWR_SMOOTHING_ADMIN_OVERRIDE_RAMP_DOWN_HYST_VAL = 268
NVML_FI_MAX = 269

These values correspond to the CUDA 13 numbering, not CUDA 12. The crate simultaneously
declares NVML_API_VERSION = 12. As a result:

  • On a CUDA 12 host: passing NVML_FI_PWR_SMOOTHING_ENABLED (integer 251) to
    nvmlDeviceGetFieldValues asks the driver for
    NVML_FI_DEV_CLOCKS_EVENT_REASON_SW_THERM_SLOWDOWN. The call succeeds and returns a
    value, but it is throttle-reason nanosecond data, not a power smoothing enablement flag.
    No error is surfaced.

  • On a CUDA 13 host: the same integer 251 correctly addresses
    NVML_FI_PWR_SMOOTHING_ENABLED and returns the expected data.


Recommended Fix

The bindings should be regenerated from the official CUDA 12.9 header. The correct v12
values for the affected constants are:

Constant Correct v12 ID Shipped (incorrect) ID
NVML_FI_DEV_CLOCKS_EVENT_REASON_SW_THERM_SLOWDOWN 251 (not present)
NVML_FI_DEV_CLOCKS_EVENT_REASON_HW_THERM_SLOWDOWN 252 (not present)
NVML_FI_DEV_CLOCKS_EVENT_REASON_HW_POWER_BRAKE_SLOWDOWN 253 (not present)
NVML_FI_DEV_POWER_SYNC_BALANCING_FREQ 254 (not present)
NVML_FI_DEV_POWER_SYNC_BALANCING_AF 255 (not present)
NVML_FI_PWR_SMOOTHING_ENABLED 256 251
NVML_FI_PWR_SMOOTHING_PRIV_LVL 257 252
NVML_FI_PWR_SMOOTHING_IMM_RAMP_DOWN_ENABLED 258 253
NVML_FI_PWR_SMOOTHING_APPLIED_TMP_CEIL 259 254
NVML_FI_PWR_SMOOTHING_APPLIED_TMP_FLOOR 260 255
NVML_FI_PWR_SMOOTHING_MAX_PERCENT_TMP_FLOOR_SETTING 261 256
NVML_FI_PWR_SMOOTHING_MIN_PERCENT_TMP_FLOOR_SETTING 262 257
NVML_FI_PWR_SMOOTHING_HW_CIRCUITRY_PERCENT_LIFETIME_REMAINING 263 258
NVML_FI_PWR_SMOOTHING_MAX_NUM_PRESET_PROFILES 264 259
NVML_FI_PWR_SMOOTHING_PROFILE_PERCENT_TMP_FLOOR 265 260
NVML_FI_PWR_SMOOTHING_PROFILE_RAMP_UP_RATE 266 261
NVML_FI_PWR_SMOOTHING_PROFILE_RAMP_DOWN_RATE 267 262
NVML_FI_PWR_SMOOTHING_PROFILE_RAMP_DOWN_HYST_VAL 268 263
NVML_FI_PWR_SMOOTHING_ACTIVE_PRESET_PROFILE 269 264
NVML_FI_PWR_SMOOTHING_ADMIN_OVERRIDE_PERCENT_TMP_FLOOR 270 265
NVML_FI_PWR_SMOOTHING_ADMIN_OVERRIDE_RAMP_UP_RATE 271 266
NVML_FI_PWR_SMOOTHING_ADMIN_OVERRIDE_RAMP_DOWN_RATE 272 267
NVML_FI_PWR_SMOOTHING_ADMIN_OVERRIDE_RAMP_DOWN_HYST_VAL 273 268
NVML_FI_MAX 274 269

This is a silent data corruption bug for any code using the NVML_FI_PWR_SMOOTHING_*
constants on a CUDA 12 deployment. A semver-major or at minimum a clearly documented
breaking change is warranted when publishing a corrected version.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions