Skip to content

drop support for compute capability <= 7.0 for newer cuDNN versions#170

Open
bedroge wants to merge 1 commit intoEESSI:mainfrom
bedroge:cudnn915_cc70
Open

drop support for compute capability <= 7.0 for newer cuDNN versions#170
bedroge wants to merge 1 commit intoEESSI:mainfrom
bedroge:cudnn915_cc70

Conversation

@bedroge
Copy link
Contributor

@bedroge bedroge commented Feb 27, 2026

This one is a little bit more tricky as CUDA itself, as the list of supported compute capabilities in the docs (https://docs.nvidia.com/deeplearning/cudnn/backend/v9.19.0/reference/support-matrix.html) don't really match what running cuobjdump on the binaries shows. Also, there seem to be some gaps in the matrix, and I wonder if that's really correct.

So for now I've chosen an easier approach by just checking if we're building with a newer cuDNN and compute capability <= 7.0, and in that case I do the same thing as what @casparvl implemented for CUDA. In order to check if cuDNN is used as dependency, I've generalized Caspar's get_cuda_version into a get_dependency_software_version function.

Tested this locally with EESSI-extend and the cuDNN from EESSI/software-layer#1410 on a V100 (CC 7.0) and RTX PRO 6000 (CC 12.0f), and got the expected result: on the RTX PRO 6000 I get a full cuDNN installation, while for the V100 I get the following output during the build:

WARNING: Requested a CUDA Compute Capability (['7.0']) that is not supported by the cuDNN version (9.15.0.57) used by this software. Switching to 
'--module-only --force' and injectiong an LmodError into the modulefile. You can override this behaviour by setting the 
EESSI_OVERRIDE_CUDA_CC_CUDNN_CHECK environment variable.

and a module file that has:

if (not os.getenv("EESSI_IGNORE_CUDNN_9_15_0_57_CC_7_0")) then LmodError("EasyConfigs using cuDNN 9.15.0.57 or older are not supported for (all) requested Compute Capabilities: ['7.0'].\n") end

@bedroge
Copy link
Contributor Author

bedroge commented Feb 27, 2026

Ultimately we could make the same kind of lookup table as for CUDA. Initially I started working on it:

# The documentation at e.g. https://docs.nvidia.com/deeplearning/cudnn/backend/v9.19.0/reference/support-matrix.html and
# what cuobjdump showns on cuDNN libraries does not fully match. The support matrix below may be too inclusive,
# so if you find that a specific combination is not supported in practice, please remove it from the matrix.
CUDNN_SUPPORTED_CCS = {
    '8.8.0': [],
    '9.15.0': ['75', '80', '86', '89', '90', '100', '103', '120', '121'],
    '9.15.1': ['75', '80', '86', '89', '90', '100', '103', '120', '121'],
    '9.16.0': ['75', '80', '86', '89', '90', '100', '103', '120', '121'],
    '9.17.0': ['75', '80', '86', '89', '90', '100', '103', '120', '121'],
    '9.17.1': ['75', '80', '86', '89', '90', '100', '103', '120', '121'],
    '9.18.0': ['75', '80', '86', '89', '90', '100', '103', '120', '121'],
    '9.18.1': ['75', '80', '86', '89', '90', '100', '103', '120', '121'],
    '9.19.0': ['75', '80', '86', '89', '90', '100', '103', '120', '121'],
}

but it's a lot of work, and as mentioned, it's not really clear what is supported and what is not. We could also consider an more simple lookup table with just the min+max supported CCs per X.YZ version? But then again, https://docs.nvidia.com/deeplearning/cudnn/backend/v9.19.0/reference/support-matrix.html says that 12.1 is not supported, the binaries do seem to indicate that it's supported, so it's very confusing and unclear...

cuda_ccs_string = re.sub(r'[a-zA-Z]', '', cuda_ccs_string).replace(',', '_')
# Also replace periods, those are not officially supported in environment variable names
var=f"EESSI_IGNORE_CUDNN_{cudnn_ver}_CC_{cuda_ccs_string}".replace('.', '_')
errmsg = f"EasyConfigs using cuDNN {cudnn_ver} or older are not supported for (all) requested Compute "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong: in your case the cuDNN is too new, not too old, right?

@casparvl
Copy link
Contributor

My 2 cents:

  1. Go for a lookup table. If you only specify a min and max version, the implicit assumption is that all intermediate versions are supported - which does not seem to be the case (i.e. 11.X almost certainly isn't, since that's not supported in CUDA 12 - see the CUDA lookup table)
  2. If you create a lookup table, and if the docs contradict what the binaries show, assume the binaries to be correct. If the binaries say there is no X.Y support, there is no X.Y code in the binary - so there can't be support. If the binary says there is X.Y code in the binary, that might not be a hard guarantee that the full cuDNN API is supported for that architecture - but the only way to find out is to assume the support is there, install it, and see how this works in practice. If we skip installations for targets that do turn out to be supported, we'd never find out otherwise.

@bedroge
Copy link
Contributor Author

bedroge commented Feb 27, 2026

I just feel like a lookup table is a lot of work to set up and to maintain, while (according to the docs) the supported CCs don't change that often. Also, wouldn't the sanity check still catch unsupported CCs, as it did for CC 7.0 in EESSI/software-layer#1410? So whenever we run into this, we can mark those as unsupported in the hooks (and if necessary, change the if statement to something else if there are going to be too many combinations)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants