Skip to content

{2025.06}[system] CUDA 12.9.1 and cuDNN 9.15.0.57#1410

Open
bedroge wants to merge 3 commits intoEESSI:mainfrom
bedroge:main
Open

{2025.06}[system] CUDA 12.9.1 and cuDNN 9.15.0.57#1410
bedroge wants to merge 3 commits intoEESSI:mainfrom
bedroge:main

Conversation

@bedroge
Copy link
Collaborator

@bedroge bedroge commented Feb 21, 2026

We probably need cuda-sanity-check-accept-missing-ptx: True here for cuDNN (see https://github.com/EESSI/software-layer/blob/main/easystacks/software.eessi.io/2025.06/accel/nvidia/eessi-2025.06-eb-5.2.0-001-system.yml), but let's try without it first.

@bedroge bedroge added accel:nvidia 2025.06-software.eessi.io 2025.06 version of software.eessi.io labels Feb 21, 2026
@bedroge
Copy link
Collaborator Author

bedroge commented Feb 21, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-rug for:arch=x86_64/intel/skylake_avx512,accel=nvidia/cc70

@eessi-bot-rug
Copy link

eessi-bot-rug bot commented Feb 21, 2026

New job on instance eessi-bot-rug for repository eessi.io-2025.06-software
Building on: intel-skylake_avx512 and accelerator nvidia/cc70
Building for: x86_64/intel/skylake_avx512 and accelerator nvidia/cc70
Job dir: /scratch/hb-eessibot/SHARED/jobs/2026.02/pr_1410/27398927

date job status comment
Feb 21 08:43:27 UTC 2026 submitted job id 27398927 awaits release by job manager
Feb 21 08:43:34 UTC 2026 released job awaits launch by Slurm scheduler
Feb 21 08:45:37 UTC 2026 running job 27398927 is running
Feb 21 08:57:47 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-27398927.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-x86_64-intel-skylake_avx512-accel-nvidia-cc70-17716641240.tar.zstsize: 0 MiB (22 bytes)
entries: 0
modules under 2025.06/software/linux/x86_64/intel/skylake_avx512/accel/nvidia/cc70/modules/all
no module files in tarball
software under 2025.06/software/linux/x86_64/intel/skylake_avx512/accel/nvidia/cc70/software
no software packages in tarball
reprod directories under 2025.06/software/linux/x86_64/intel/skylake_avx512/accel/nvidia/cc70/reprod
no reprod directories in tarball
other under 2025.06/software/linux/x86_64/intel/skylake_avx512/accel/nvidia/cc70
no other files in tarball
Feb 21 08:57:47 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 0/0 test case(s) from 0 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-27398927.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Collaborator Author

bedroge commented Feb 21, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-rug for:arch=x86_64/intel/skylake_avx512,accel=nvidia/cc70

@eessi-bot-rug
Copy link

eessi-bot-rug bot commented Feb 21, 2026

New job on instance eessi-bot-rug for repository eessi.io-2025.06-software
Building on: intel-skylake_avx512 and accelerator nvidia/cc70
Building for: x86_64/intel/skylake_avx512 and accelerator nvidia/cc70
Job dir: /scratch/hb-eessibot/SHARED/jobs/2026.02/pr_1410/27401161

date job status comment
Feb 21 16:34:53 UTC 2026 submitted job id 27401161 awaits release by job manager
Feb 21 16:36:08 UTC 2026 released job awaits launch by Slurm scheduler
Feb 21 17:18:13 UTC 2026 running job 27401161 is running
Feb 21 17:32:26 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-27401161.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-x86_64-intel-skylake_avx512-accel-nvidia-cc70-17716948950.tar.zstsize: 3068 MiB (3217285682 bytes)
entries: 6538
modules under 2025.06/software/linux/x86_64/intel/skylake_avx512/accel/nvidia/cc70/modules/all
CUDA/12.9.1.lua
software under 2025.06/software/linux/x86_64/intel/skylake_avx512/accel/nvidia/cc70/software
CUDA/12.9.1
reprod directories under 2025.06/software/linux/x86_64/intel/skylake_avx512/accel/nvidia/cc70/reprod
CUDA/12.9.1/20260221_172548UTC
other under 2025.06/software/linux/x86_64/intel/skylake_avx512/accel/nvidia/cc70
no other files in tarball
Feb 21 17:32:26 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 0/0 test case(s) from 0 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-27401161.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Collaborator Author

bedroge commented Feb 21, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws on:arch=zen4 for:arch=x86_64/amd/zen4,accel=nvidia/cc100

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented Feb 21, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: amd-zen4
Building for: x86_64/amd/zen4 and accelerator nvidia/cc100
Job dir: /project/def-users/SHARED/jobs/2026.02/pr_1410/133161

date job status comment
Feb 21 21:59:45 UTC 2026 submitted job id 133161 awaits release by job manager
Feb 21 22:00:19 UTC 2026 released job awaits launch by Slurm scheduler
Feb 21 22:05:21 UTC 2026 running job 133161 is running
Feb 21 22:13:31 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-133161.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-x86_64-amd-zen4-accel-nvidia-cc100-17717118380.tar.zstsize: 3603 MiB (3778037591 bytes)
entries: 6623
modules under 2025.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc100/modules/all
CUDA/12.9.1.lua
cuDNN/9.15.0.57-CUDA-12.9.1.lua
software under 2025.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc100/software
CUDA/12.9.1
cuDNN/9.15.0.57-CUDA-12.9.1
reprod directories under 2025.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc100/reprod
CUDA/12.9.1/20260221_220859UTC
cuDNN/9.15.0.57-CUDA-12.9.1/20260221_221013UTC
other under 2025.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc100
no other files in tarball
Feb 21 22:13:31 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:x86-64-zen4+default
P: latency: 1.43 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:x86-64-zen4+default
P: latency: 3.28 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:x86-64-zen4+default
P: latency: 0.16 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:x86-64-zen4+default
P: bandwidth: 14251.52 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-133161.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Collaborator Author

bedroge commented Feb 21, 2026

According to https://docs.nvidia.com/deeplearning/cudnn/backend/v9.15.0/reference/support-matrix.html, this version of cuDNN doesn't support CC 7.0 anymore...

@ocaisa
Copy link
Member

ocaisa commented Feb 22, 2026

It does support cc7.5

@ocaisa
Copy link
Member

ocaisa commented Feb 22, 2026

One option may be to downgrade the cuDNN version for the cc...bit painful, and might cause problems in real life

@bedroge
Copy link
Collaborator Author

bedroge commented Feb 23, 2026

One option may be to downgrade the cuDNN version for the cc...bit painful, and might cause problems in real life

Though we could indeed try that (9.11.1 seems to be the latest one that still supports 7.0: https://docs.nvidia.com/deeplearning/cudnn/backend/v9.11.1/reference/support-matrix.html), I don't want to go down that rabbit hole. So for now, I'll just try to reuse what @casparvl has done for CUDA, and create dummy modules for these unsupported combinations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2025.06-software.eessi.io 2025.06 version of software.eessi.io accel:nvidia

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants