Container image recipes and Perlmutter run scripts for Slingshot/CXI communication libraries.
The recipes are for NERSC Perlmutter CPU and GPU nodes. GPU images default to CUDA 13.2.0. Perlmutter currently has NVIDIA driver 580.105.08; NVIDIA documents R580 as the CUDA 13.x driver family, so CUDA 13.2 is expected to work through CUDA minor-version compatibility. The CUDA version is still a build argument so it can be pinned to 13.0 if site validation requires it.
Public-source targets in container/Containerfile:
| Image tag | Nodes | Contents |
|---|---|---|
libfabric-cpu |
CPU | XPMEM userspace, Cassini/CXI headers, libcxi, libfabric with CXI/LNX/EFA |
libfabric-gpu |
GPU | libfabric-cpu plus CUDA and GDRCopy support |
mpich-cpu |
CPU | libfabric-cpu plus MPICH CH4/OFI linked to PMIx |
mpich-gpu |
GPU | mpich-cpu equivalent with CUDA-aware MPICH build |
openmpi-cpu |
CPU | libfabric-cpu plus Open MPI 5 with OFI and external PMIx |
openmpi-gpu |
GPU | openmpi-cpu equivalent with CUDA-aware Open MPI build |
openmpi-ofi-ucx-cpu |
CPU | openmpi-cpu plus UCX and OpenSHMEM enabled |
openmpi-ofi-ucx-gpu |
GPU | CUDA-aware Open MPI with OFI, UCX, OpenSHMEM, and external PMIx |
nccl-gpu |
GPU | libfabric-gpu, NCCL, AWS OFI NCCL plugin, and single-process NCCL tests built without MPI |
nvshmem-gpu |
GPU | nccl-gpu plus CUDA 13 NVSHMEM packages, PMIx/libfabric runtime settings, and an NVSHMEM hello test |
The source of truth is the multi-stage container/Containerfile. For easier reading and CI builds, the repository also carries one buildable Containerfile per named target under container/targets/.
The generator is scripts/generate-target-containerfiles.py. It parses named FROM ... AS ... stages from container/Containerfile, follows the parent-stage dependency closure for each target, and writes a single-target file containing only the top-level build arguments and the ancestor stages needed for that target. For example, container/targets/mpich-gpu.Containerfile contains gpu-base, libfabric-gpu, and mpich-gpu, but omits unrelated OpenMPI, NCCL, and NVSHMEM stages.
Useful commands:
# List all named stages.
scripts/generate-target-containerfiles.py --list
# Regenerate every per-target file under container/targets/.
scripts/generate-target-containerfiles.py
# Regenerate one target.
scripts/generate-target-containerfiles.py --target mpich-gpu
# Check that generated files are up to date. GitHub Actions runs this.
scripts/generate-target-containerfiles.py --checkscripts/build.sh and the GitHub Actions matrix build from container/targets/<target>.Containerfile directly. Edit container/Containerfile, regenerate, and commit both the source and generated files.
These images keep the communication stack inside the container, while the host provides the kernel drivers, Slurm launch, and device files.
The main layers are:
| Layer | What it provides | Main techniques used |
|---|---|---|
| CXI | Userspace access to the HPE Slingshot Cassini NIC. | Host kernel device files such as /dev/cxi*, libcxi ioctls, NIC command queues, completion queues, memory registration, and hardware offload. |
| OFI | A provider-neutral network API used by MPI, NCCL plugins, and PGAS libraries. | The libfabric API exposes endpoints, address vectors, completion queues, scalable endpoints, tagged messaging, RMA, atomics, and provider selection through FI_PROVIDER=cxi. |
| libfabric | The implementation of OFI and the container entry point to the CXI provider. | Provider plugins, memory registration cache, XPMEM-assisted intra-node paths, and optional CUDA/GDRCopy memory support in GPU images. |
| XPMEM | Intra-node cross-process memory mapping used by shared-memory and libfabric paths. | Host /dev/xpmem, make/get/attach/detach handles, process-to-process mappings, and low-copy host-memory transfers within a node. |
| UCX | A communication framework used by OpenMPI components and OpenSHMEM. | Transport selection, active messages, RMA, tagged operations, shared memory transports, CUDA memory hooks, and optional GPUDirect-style paths. |
| PMIx | Process-management wire-up between Slurm and ranks inside the container. | Slurm hosts the PMIx server, the image carries the OpenPMIx client, and ranks exchange job metadata, endpoints, namespaces, and local topology. |
| MPI | The application-facing distributed-memory programming model. | Point-to-point messages, collectives, communicators, derived datatypes, one-sided RMA, and launcher integration through PMIx. |
| MPICH | MPI implementation used for CH4/OFI examples. | The CH4 device uses the OFI netmod, which maps MPI operations onto libfabric endpoints and the cxi provider. |
| OpenMPI | MPI and OpenSHMEM implementation used for the combined examples. | Modular components select the runtime path: pml/cm plus mtl/ofi for MPI over OFI/CXI, and UCX components for optional MPI paths and OSHMEM. |
| NCCL | GPU collective communication library. | CUDA kernels, topology-aware rings/trees, GPU buffers, and the AWS OFI NCCL plugin for Slingshot through libfabric/CXI. |
| aws-ofi-nccl | NCCL network plugin that connects NCCL collectives to OFI/libfabric. | NCCL net-plugin callbacks, libfabric endpoints and completion queues, CXI provider selection, GPU-memory registration, and multi-NIC rank-to-CXI mapping. |
| OpenSHMEM | Partitioned global address space model for symmetric-memory communication. | Processing elements, symmetric heaps, puts/gets, atomics, barriers, and OpenMPI OSHMEM SPML components, with UCX in the combined images. |
| NVSHMEM | GPU-resident SHMEM programming model. | Symmetric GPU memory, device-side puts/gets/atomics, CUDA streams, PMIx bootstrap, libfabric/CXI transport, and NCCL-assisted collectives where applicable. |
MPI over Slingshot uses OFI/libfabric and the CXI provider:
MPI application
|
+-- MPICH CH4/OFI
| |
| +-- OFI API
| |
| +-- libfabric cxi provider
| |
| +-- libcxi + host /dev/cxi*
| |
| +-- Slingshot/Cassini NIC
|
+-- OpenMPI default on Perlmutter
|
+-- PML cm -> MTL ofi
|
+-- OFI API -> libfabric cxi -> libcxi -> /dev/cxi*
PMIx is the process wire-up path from Slurm into the container:
srun --mpi=pmix
|
+-- Slurm PMIx server on host
|
+-- OpenPMIx client in image
|
+-- MPI ranks or SHMEM PEs
The combined OpenMPI targets include both OFI and UCX:
openmpi-ofi-ucx-*
|
+-- MPI default path:
| PML cm -> MTL ofi -> libfabric -> cxi
|
+-- Optional MPI/one-sided path:
| PML ucx / OSC ucx -> UCX transports
|
+-- OpenSHMEM path:
OSHMEM -> SPML ucx -> UCX transports
GPU collectives and GPU SHMEM layer on the same network substrate:
NCCL
|
+-- AWS OFI NCCL plugin -> libfabric/OFI -> cxi
NVSHMEM
|
+-- PMIx for wire-up
+-- libfabric/OFI/cxi for remote transport
+-- NCCL for supported collectives
+-- CUDA/GDRCopy for GPU memory paths
The NCCL image is intentionally not an MPI image. It builds nccl-tests with MPI=0, which makes the bundled tests useful for single-process GPU smoke tests. Multi-rank NCCL validation should be driven by application launchers or Slurm wrapper scripts that explicitly choose the rank layout, GPU binding, and cross-node topology.
The NVSHMEM image also avoids an OpenMPI base. It uses NVIDIA's CUDA 13 NVSHMEM packages, sets PMIx as the Slurm bootstrap path, keeps libfabric/CXI as the remote transport, and removes packaged MPI/OpenSHMEM/UCX/IB plugins that are not part of the Perlmutter Slingshot path.
UCX is present in the combined OpenMPI images for OpenSHMEM and portability testing. The default Perlmutter MPI path still selects OFI/CXI with:
OMPI_MCA_pml=cm
OMPI_MCA_mtl=ofi
FI_PROVIDER=cxi
PMIX_MCA_psec=nativeUser-facing Perlmutter container instructions are in user-docs/. These pages focus on choosing and running one published image with podman-hpc, and include per-library benchmark summaries.
Implementation notes are in docs/:
| Topic | Page |
|---|---|
| CXI | docs/cxi.md |
| libfabric and OFI | docs/libfabric-ofi.md |
| UCX | docs/ucx.md |
| PMIx | docs/pmix.md |
| MPI | docs/mpi.md |
| MPICH | docs/mpich.md |
| OpenMPI | docs/openmpi.md |
| NCCL | docs/nccl.md |
| OpenSHMEM | docs/openshmem.md |
| NVSHMEM | docs/nvshmem.md |
| Perlmutter runtime | docs/runtime.md |
Build one target:
scripts/build.sh mpich-cpu
scripts/build.sh openmpi-gpu
scripts/build.sh openmpi-ofi-ucx-gpuBuild all public-source targets:
scripts/build.sh allOverride CUDA:
scripts/build.sh --build-arg CUDA_VERSION=13.0.0 mpich-gpuThe NCCL image defaults to NCCL_PACKAGE_VERSION=2.29.7-1+cuda13.2, which is one of the NCCL versions tested with aws-ofi-nccl 1.19.0. Override it only when validating a newer NCCL package:
scripts/build.sh --build-arg NCCL_PACKAGE_VERSION=2.30.4-1+cuda13.2 nccl-gpuThese examples do not use podman-hpc --mpi or --cuda-mpi. MPI, libfabric, CXI, and GPU communication libraries are in the image. Slurm provides launch and PMIx wire-up, and podman-hpc shared-run starts one container per node.
The directory scripts/perlmutter-images/ contains standalone sbatch scripts, one per published image. Each file contains the flattened srun ... podman-hpc shared-run command and can be copied into an application repository. Use scripts/run-perlmutter.sh for day-to-day repo testing; use the standalone scripts when documenting or adapting a single image for an end-user workflow.
For MPICH and OpenMPI images, the default smoke test is:
python3 /workspace/tests/test_mpi4py.pyEach script has default #SBATCH settings, a default image tag, and a simple smoke-test command. Copy the script for the image you use, edit the #SBATCH lines and APP_COMMAND, then submit it with sbatch.
| Image tag | Script | Submit example |
|---|---|---|
libfabric-cpu |
scripts/perlmutter-images/run-libfabric-cpu.sbatch |
sbatch scripts/perlmutter-images/run-libfabric-cpu.sbatch |
libfabric-gpu |
scripts/perlmutter-images/run-libfabric-gpu.sbatch |
sbatch scripts/perlmutter-images/run-libfabric-gpu.sbatch |
mpich-cpu |
scripts/perlmutter-images/run-mpich-cpu.sbatch |
sbatch scripts/perlmutter-images/run-mpich-cpu.sbatch |
mpich-gpu |
scripts/perlmutter-images/run-mpich-gpu.sbatch |
sbatch scripts/perlmutter-images/run-mpich-gpu.sbatch |
openmpi-cpu |
scripts/perlmutter-images/run-openmpi-cpu.sbatch |
sbatch scripts/perlmutter-images/run-openmpi-cpu.sbatch |
openmpi-gpu |
scripts/perlmutter-images/run-openmpi-gpu.sbatch |
sbatch scripts/perlmutter-images/run-openmpi-gpu.sbatch |
openmpi-ofi-ucx-cpu |
scripts/perlmutter-images/run-openmpi-ofi-ucx-cpu.sbatch |
sbatch scripts/perlmutter-images/run-openmpi-ofi-ucx-cpu.sbatch |
openmpi-ofi-ucx-gpu |
scripts/perlmutter-images/run-openmpi-ofi-ucx-gpu.sbatch |
sbatch scripts/perlmutter-images/run-openmpi-ofi-ucx-gpu.sbatch |
nccl-gpu |
scripts/perlmutter-images/run-nccl-gpu.sbatch |
sbatch scripts/perlmutter-images/run-nccl-gpu.sbatch |
nvshmem-gpu |
scripts/perlmutter-images/run-nvshmem-gpu.sbatch |
sbatch scripts/perlmutter-images/run-nvshmem-gpu.sbatch |
Override the application command without editing the script:
APP_COMMAND='./my_app --input input.toml' sbatch --export=ALL scripts/perlmutter-images/run-openmpi-ofi-ucx-gpu.sbatchThe repository wrapper commands are shorter and use the same runtime shape:
scripts/run-perlmutter.sh cpu mpich
scripts/run-perlmutter.sh gpu openmpi
scripts/run-perlmutter.sh gpu openmpi-ofi-ucx
scripts/run-perlmutter.sh gpu nccl
scripts/run-perlmutter.sh gpu nvshmemThe default gpu nccl run uses one node and one task because the bundled nccl-tests binary is built without MPI. Set NODES, TASKS_PER_NODE, and pass a custom command when using an application-level or Slurm-managed distributed NCCL test.
The run script currently passes the PMIx and CXI runtime environment explicitly. Once NERSC/podman-hpc#152 is deployed, set PODMANHPC_PMIX_HELPER=module to use the podman-hpc --pmix helper instead of the manual PMIx plumbing:
PODMANHPC_PMIX_HELPER=module scripts/run-perlmutter.sh gpu openmpiFor Podman backend debugging, build a local Podman 5.8.2 under $SCRATCH and point podman-hpc at it:
scripts/perlmutter-tools/build-podman-5.8.2.sh
export PODMANHPC_PODMAN_BIN=$SCRATCH/communication-libraries-image/podman-alt/podman-5.8.2/bin/podmanTo test a newer passt/pasta helper with either the site Podman or the local Podman build:
scripts/perlmutter-tools/build-passt.sh
export CONTAINERS_HELPER_BINARY_DIR=$SCRATCH/communication-libraries-image/podman-alt/passt-2026_01_20.386b5f5/install/binPodman searches CONTAINERS_HELPER_BINARY_DIR before the system helper directories, so this changes which pasta binary is used without replacing /usr/bin/pasta.
For overlay shifting diagnostics with non-contiguous rootless UID/GID maps, the same helper can build a patched comparison binary:
APPLY_FORCE_SHIFTING_PATCH=1 \
INSTALL_ROOT=$SCRATCH/communication-libraries-image/podman-alt/podman-5.8.2-force-shifting \
scripts/perlmutter-tools/build-podman-5.8.2.sh
export PODMANHPC_PODMAN_BIN=$SCRATCH/communication-libraries-image/podman-alt/podman-5.8.2-force-shifting/bin/podman
export _CONTAINERS_FORCE_SHIFTING=1For direct podman-hpc run --userns=keep-id path-access debugging, the helper can also build a Podman 5.8.2 binary with the old rootless runtime makeAccessible() path restored:
APPLY_MAKE_ACCESSIBLE_PATCH=1 \
INSTALL_ROOT=$SCRATCH/communication-libraries-image/podman-alt/podman-5.8.2-make-accessible \
scripts/perlmutter-tools/build-podman-5.8.2.shFor group-protected host bind mounts under --userns=keep-id, Podman/crun can preserve the submitting user's supplemental host groups with:
podman-hpc run --rm --privileged --userns=keep-id --group-add keep-groups ...The diagnostic wrapper can test both normal writable bind mounts and a directory writable only through a supplemental group:
sbatch --export=ALL,PODMAN_BIN=$SCRATCH/communication-libraries-image/podman-alt/podman-5.8.2-make-accessible/bin/podman,GROUP_WRITE_TEST_DIR=<group-writable-host-dir> \
scripts/perlmutter-tools/test-podman-keepid-bindmount.sbatchSee docs/podman-backend.md for the PMIx --userns=keep-id comparison, direct compute-node reproducer, overlay-shifting comparison, bind-mount ownership test, and supplemental-group findings.
The direct run and bind-mount wrappers are:
sbatch scripts/perlmutter-tools/test-podman-keepid-run.sbatch
sbatch scripts/perlmutter-tools/test-podman-keepid-bindmount.sbatchFor login-node Podman network throughput checks, run the reproducible curl benchmark:
scripts/perlmutter-tools/test-podman-network-curl.shIt compares single and parallel curl downloads for default networking, explicit pasta, slirp4netns, --network=host, and a direct host curl baseline, then writes CSV files and report.md under $SCRATCH/communication-libraries-image/podman-network-curl/.
For GPU runs, bind the complete host NVIDIA driver library set, not only libcuda. The script binds /usr/lib64/libcuda*, /usr/lib64/libnvidia*, and /usr/bin/nvidia-smi; this keeps the driver JIT and NVML libraries matched to the Perlmutter 580.105.08 host driver while the image carries the CUDA 13.2 user-space toolkit.
Benchmark support is under benchmarks/. It includes benchmark image stages based on this repo's images, Perlmutter sbatch scripts, and a parser that converts OSU, NCCL, and NVSHMEM logs into a Markdown report.
benchmarks/scripts/build.sh bench-openmpi-ofi-ucx-gpu
MPI_IMPL=openmpi-ofi-ucx sbatch --export=ALL benchmarks/scripts/perlmutter/run-mpi-osu-gpu.sbatch
benchmarks/scripts/process-results.py "$SCRATCH/communication-libraries-image/benchmarks/results/<jobid>" \
-o "$SCRATCH/communication-libraries-image/benchmarks/results/<jobid>/report.md"See benchmarks/README.md for the benchmark matrix.
GitHub Actions builds and pushes public-source targets to GHCR:
ghcr.io/dingp/communication-libraries-image:libfabric-cpu
ghcr.io/dingp/communication-libraries-image:libfabric-gpu
ghcr.io/dingp/communication-libraries-image:mpich-cpu
ghcr.io/dingp/communication-libraries-image:mpich-gpu
ghcr.io/dingp/communication-libraries-image:openmpi-cpu
ghcr.io/dingp/communication-libraries-image:openmpi-gpu
ghcr.io/dingp/communication-libraries-image:openmpi-ofi-ucx-cpu
ghcr.io/dingp/communication-libraries-image:openmpi-ofi-ucx-gpu
ghcr.io/dingp/communication-libraries-image:nccl-gpu
ghcr.io/dingp/communication-libraries-image:nvshmem-gpu
ghcr.io/dingp/communication-libraries-image:bench-mpich-cpu
ghcr.io/dingp/communication-libraries-image:bench-mpich-gpu
ghcr.io/dingp/communication-libraries-image:bench-openmpi-cpu
ghcr.io/dingp/communication-libraries-image:bench-openmpi-gpu
ghcr.io/dingp/communication-libraries-image:bench-openmpi-ofi-ucx-cpu
ghcr.io/dingp/communication-libraries-image:bench-openmpi-ofi-ucx-gpu
ghcr.io/dingp/communication-libraries-image:bench-nccl-gpu
ghcr.io/dingp/communication-libraries-image:bench-nccl-mpich-gpu
ghcr.io/dingp/communication-libraries-image:bench-nvshmem-gpu