-
Notifications
You must be signed in to change notification settings - Fork 41
Pyxis Enroot Error Reference
pyxis: [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code <N>
This umbrella line fires whenever the NVIDIA hook exits non-zero, for any reason. It is always accompanied by a more specific inner error (device, driver, capability, requirement, ldcache, etc.) — look at the lines above this one to find the real cause. Most production cases are toolkit or driver breakage on the node.
Owner: depends on the inner error. See the sub-cases below.
$ sudo rm /dev/nvidia-modeset
$ NVIDIA_VISIBLE_DEVICES=0 NVIDIA_DRIVER_CAPABILITIES=all \
srun --container-image=nvidia/cuda:13.0.1-base-ubuntu24.04 nvidia-smi
error: pyxis: nvidia-container-cli: mount error: stat failed: /dev/nvidia-modeset: no such file or directory
error: pyxis: [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
NVIDIA control devices (nvidia-modeset, nvidiactl, nvidia-uvm,
nvidia-uvm-tools) are created at driver load. If any is missing,
libnvidia-container's host-device stat() fails. Numbered GPU devices
(/dev/nvidia0, etc.) fail through a different path (device error, below).
Owner: cluster operator. The driver state on the node is broken — typically needs a driver reinstall or reload.
error: pyxis: [WARN] Kernel module nvidia_uvm is not loaded. Make sure the NVIDIA device driver is installed and loaded.
error: pyxis: nvidia-container-cli: detection error: nvml error: not supported
error: pyxis: [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
NVML cannot talk to the kernel module — missing module load, partial driver upgrade, or ABI mismatch between user-space libraries and kernel module.
Owner: cluster operator.
$ NVIDIA_VISIBLE_DEVICES=42 srun --container-image=nvcr.io/nvidia/pytorch:26.02-py3 --pty bash
error: pyxis: nvidia-container-cli: device error: 42: unknown device
error: pyxis: [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
NVIDIA_VISIBLE_DEVICES points at a device index that does not exist on the
node — typically a user typo, or oversubscription versus the SLURM --gres
allocation.
Owner: user. Fix the env var or the --gres request. Edge case: on a
GPU-less node where device 0 is missing, the fix is still user-side
(don't request GPUs on a CPU node).
$ NVIDIA_DRIVER_CAPABILITIES=comput srun --container-image=nvcr.io/nvidia/pytorch:26.02-py3 --pty bash
error: pyxis: [ERROR] Unknown NVIDIA driver capability: comput
error: pyxis: [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
Typo in NVIDIA_DRIVER_CAPABILITIES. Valid values include compute,
utility, graphics, display, video, compat32, all.
Owner: user.
$ NVIDIA_REQUIRE_CUDA="cuda>=99.0" srun --container-image=nvcr.io/nvidia/pytorch:26.02-py3 --pty bash
error: pyxis: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=99.0, please update your driver to a newer version, or use an earlier cuda container
error: pyxis: [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
NVIDIA_REQUIRE_* (set by the user at submit time, or baked into the image)
isn't satisfied by the host driver.
Owner: user. Even when the requirement comes from the image, the submitter's fix is to use an older image or request a node with a newer driver.
error: pyxis: [ERROR] Command not found: nvidia-container-cli, see https://github.com/NVIDIA/libnvidia-container
error: pyxis: [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
Typically a CPU-only node where the admin has not set
NVIDIA_VISIBLE_DEVICES=void in the enroot config, which would make the
hook a no-op. Without that, the hook tries to exec nvidia-container-cli
on a node where it is correctly not installed.
Owner: cluster operator. Fix is admin-side — either install the toolkit
or add NVIDIA_VISIBLE_DEVICES=void to the enroot config on CPU-only nodes.
error: pyxis: nvidia-container-cli: ldcache error: process /usr/sbin/ldconfig.real failed with error code: 1
error: pyxis: [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
ldconfig fails during the post-mount library cache rebuild — corrupted
ldconfig binary inside the container, ABI skew with host glibc, or LSM
confinement (AppArmor/SELinux).
Owner: cluster operator (LSM / glibc skew) or image author (corrupted ldconfig). Triage by reproducing inside the container.
slurmstepd: error: pyxis: nvidia-container-cli: detection error: path error: ///var/run/nvidia-fabricmanager/socket: permission denied
slurmstepd: error: pyxis: [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
libnvidia-container can't stat() or open an admin-managed system path
(fabric-manager socket, /proc/driver/nvidia/*, etc.). Typically wrong
permissions on the socket file as set by the system service.
Owner: cluster operator.
error: pyxis: enroot-mount: failed to mount: tmpfs at /raid/enroot/data/user-147141/pyxis_49_49.1/sys/class/infiniband_mad: No such file or directory
error: pyxis: [ERROR] /etc/enroot/hooks.d/99-mellanox.sh exited with return code 1
The mellanox hook bind-mounts tmpfs at /sys/class/infiniband* paths. When
an IB kernel module (e.g. ib_umad) is not loaded on the host, the source
sysfs directory does not exist and the hook fails.
Owner: cluster operator. Load the missing IB module(s) or remove the mellanox hook on nodes where InfiniBand is not provisioned.
$ srun --container-image=docker://registry-1.docker.io/library/private-image:latest true
error: pyxis: [ERROR] URL https://registry-1.docker.io/v2/library/private-image/manifests/latest returned error code: 401 Unauthorized
Missing or insufficient credentials. On Docker Hub, the /token endpoint
typically succeeds anonymously but /manifests/ returns 401 for private
repos.
Owner: user. Configure credentials in ~/.config/enroot/.credentials
for the target hostname.
$ srun --container-image=docker://ghcr.io/nonexistent-user/no-such-image:latest true
error: pyxis: [ERROR] URL https://ghcr.io/token returned error code: 403 Forbidden
Quirk of ghcr.io: it returns 403 at the /token endpoint for nonexistent
repos rather than 404 at the manifest endpoint. The user-facing fix is the
same as for 401 — check credentials and image name.
Owner: user.
$ srun --container-image=docker://ubuntu:36.04 true # nonexistent tag
error: pyxis: [ERROR] URL https://registry-1.docker.io/v2/library/ubuntu/manifests/36.04 returned error code: 404 Not Found
Misspelled image name, deleted tag, or wrong registry hostname.
Owner: user.
$ srun --container-image='docker:///foo' true
error: pyxis: [ERROR] Invalid image reference: docker:///foo
The URI fails enroot's parser — common patterns: triple-slash, trailing
space, embedded whitespace. The single-slash form docker:/foo actually
parses (as a hostname) and surfaces as a different error (curl tunnel /
DNS).
Owner: user.
$ srun --container-image=nvcr.io/nvidia/eval-factory/simple-evals:26.03 ...
error: pyxis: [ERROR] URL https://registry-1.docker.io/v2/nvcr.io/nvidia/eval-factory/simple-evals/manifests/26.03 returned error code: 401 Unauthorized
Older enroot quirk: when no credentials are configured for a <hostname>,
enroot silently treats the whole <hostname>/<repo> as a Docker Hub repo
path. The dotted hostname appearing in the /v2/ path segment is the tell
(v2/nvcr.io/... instead of v2/...).
Owner: user. Add credentials for the target hostname in
~/.config/enroot/.credentials.
slurmstepd: error: pyxis: [ERROR] Could not parse digest information from https://<your-registry>/v2/<repo>/manifests/<tag>
The registry returned a well-formed response that is not a valid manifest,
or the manifest is missing the docker-content-digest header. Typically a
registry-side or proxy-side bug.
Owner: cluster / registry operator.
$ srun --container-image=docker:/alpine true # single slash parses as hostname "alpine"
error: pyxis: curl: (56) CONNECT tunnel failed, response 502
$ srun --container-image=docker://nonexistent.example.com/foo:bar true
error: pyxis: curl: (56) CONNECT tunnel failed, response 502
On clusters with a CONNECT proxy, both user typos pointing at nonexistent
hostnames and genuine registry outages surface as the same curl: (56)
error. The two cases are indistinguishable from the log alone.
Owner: either user (typo) or cluster / registry operator (real outage). Verify the hostname and image first; if it looks correct, escalate.
Two zstd sub-errors for different corruption modes, same remedy:
# truncated layer
slurmstepd: error: pyxis: [INFO] Extracting image layers...
slurmstepd: error: pyxis: zstd: /*stdin*\: unexpected end of file
slurmstepd: error: pyxis: tar: Child returned status 1
# byte-corrupted layer (checksum mismatch)
error: pyxis: /*stdin*\ : Decoding error (36) : Restored data doesn't match checksum
error: pyxis: tar: Unexpected EOF in archive
error: pyxis: tar: Error is not recoverable: exiting now
The enroot cache contains a bad layer blob — usually from a previously
interrupted import. Remedy: clear $ENROOT_CACHE_PATH (or the per-user
cache subdirectory) and retry.
Owner: typically cluster-side cache hygiene; occasionally user-side when the user controls the cache directory.
error: pyxis: curl: (28) Operation timed out after 1200000 milliseconds with 1510260015 out of 10371127195 bytes received
error: pyxis: pigz: skipping: <stdin>: corrupted -- invalid deflate data (invalid distance code)
The pull exceeded enroot's configured transfer timeout. The log includes both bytes received and the timeout window — divide them to estimate effective throughput. Slow effective rate suggests an infrastructure problem; a normal rate on a very large image suggests the user should increase the timeout or use a smaller image.
Owner: either. Use the throughput estimate to decide.
slurmstepd: error: pyxis: curl: (52) Empty reply from server
The connection succeeded and the request was sent, but the server closed without returning anything — registry crash, load-balancer dropped the connection mid-request, etc. User action cannot produce this.
Owner: cluster / registry operator.
slurmstepd: error: pyxis: curl: (6) Could not resolve host: registry-1.docker.io
Node DNS misconfiguration, resolver unreachable, or genuinely unreachable
hostname. For known-good registries (e.g. registry-1.docker.io) this is
unambiguously infrastructure; for unknown hostnames it could be a user
typo on clusters with direct egress.
Owner: likely cluster operator; check the hostname for typos first.
error: pyxis: [INFO] Authentication succeeded
error: pyxis: [ERROR] Could not process JSON input
The HTTP request succeeded (auth even passed) but the response body is not JSON — almost always a caching proxy returning an HTML error page in place of the registry's response.
Owner: cluster / registry operator.
error: pyxis: [ERROR] Could not acquire rootfs lock
Pyxis creates one container per node per step, so a user cannot legitimately produce lock contention. A hit means another enroot process is stuck holding the lock — stale state on the node.
Owner: cluster operator. Investigate stuck enroot processes on the node.
$ echo 'not a real sqsh' > /tmp/bad.sqsh && srun --container-image=/tmp/bad.sqsh true
error: pyxis: failed to create container filesystem
error: pyxis: [ERROR] Invalid image format: /tmp/bad.sqsh
unsquashfs -s fails on the image — truncated or corrupted sqsh, or a
non-sqsh file passed as the image.
Owner: either. User-corrupted file (e.g. interrupted enroot import)
versus a cache-side bad pull. Check the file with unsquashfs -s outside
the job to triage.
error: pyxis: [ERROR] File already exists: /raid/containers/data/user-39367/pyxis_1620149_1620149.0
The last path component pyxis_<jobid>_<stepid>.<taskid> is auto-generated
by pyxis from SLURM IDs — a user cannot cause a collision on that pattern.
A hit means stale filesystem state (a previous job's container rootfs was
not cleaned up).
Owner: cluster operator.
$ srun --container-image=docker://alpine --container-mounts=/tmp/doesnotexist:/mnt/x true
error: pyxis: enroot-mount: failed to mount: /tmp/doesnotexist at /tmp/enroot-data/.../mnt/x: No such file or directory
Pyxis's own pre-check (--container-mounts: source path does not exist)
catches some of these earlier; the runtime enroot-mount failure catches
the rest.
Owner: user.
$ srun --container-image=/tmp/no-such-file.sqsh true
error: pyxis: failed to create container filesystem
error: pyxis: [ERROR] No such file or directory: /tmp/no-such-file.sqsh
User passed --container-image=<local sqsh> pointing at a file that does
not exist on the node.
Owner: user.
error: pyxis: mkdir: cannot create directory '/raid/enroot/cache/group-21323': Permission denied
error: pyxis: mkdir: cannot create directory '/var/lib/enroot/cache/job-1664320/.tokens.28686': Permission denied
Enroot tries to create a cache, data, or runtime subdirectory and mkdir
returns EACCES. Path shape varies per cluster
(/raid/enroot/*, /var/lib/enroot/*, etc.).
Owner: depends on the path. If the parent is under /home/* or
/lustre/* the user owns it (filesystem permissions are theirs to fix);
otherwise it's an admin-side misconfiguration of the parent directory.
error: pyxis: [INFO] Extracting squashfs filesystem...
error: pyxis: Read on filesystem failed because EOF
error: pyxis: read_id_table: failed to read id index table
error: pyxis: FATAL ERROR: File system corruption detected
Most common cause: two jobs concurrently writing to the same sqsh path
without synchronization (user-side race). Also possible: a stale partial
sqsh from a killed enroot import, or genuine disk corruption. The log
does not carry the path so the cause cannot be inferred from this line
alone.
Owner: typically user (concurrent writes); occasionally cluster operator (storage corruption).
error: pyxis: failed to create container filesystem
error: pyxis: [INFO] Extracting squashfs filesystem...
error: pyxis: FATAL ERROR: dir_scan: failed to make directory /persist-var/a4u8g-0002/lib/enroot/data/job-1673945/pyxis_<...>, because Permission denied
During extraction, unsquashfs's dir_scan cannot create a directory in the
destination tree. Distinct from the pre-extraction mkdir permission error
(different phase) and from the disk-full write error (different errno).
Owner: depends on the captured path (user-managed might be /home/*, /lustre/*
versus admin-managed system path like /raid/*).
error: pyxis: child 2529097 failed with error code: 137
error: pyxis: [INFO] Extracting squashfs filesystem...
error: pyxis: /usr/lib/enroot/runtime.sh: line 384: 2529132 Killed unsquashfs ...
error: Detected 1 oom_kill event in StepId=26343980.0.
srun: error: cpu-00030: task 0: Out Of Memory.
The kernel OOM-killed unsquashfs due to a cgroup memory limit. On CPU
nodes (shared, tight per-job memory budgets) this is typically a user
--mem issue; on GPU nodes (dedicated, plenty of RAM) it usually points
to an unusually large image or a leak.
Owner: typically user on CPU nodes; cluster operator on GPU nodes.
$ srun --container-image=nvcr.io/nvidia/pytorch:26.02-py3 true
error: pyxis: failed to import docker image: nvcr.io/nvidia/pytorch:26.02-py3
error: pyxis: [INFO] Creating squashfs filesystem...
error: pyxis: Write failed because No space left on device
error: pyxis: FATAL ERROR: Probably out of space on output filesystem
error: pyxis: Creating 4.0 filesystem on /run/pyxis/1000/16.0.squashfs, block size 131072.
The output partition ran out of space mid-write during the mksquashfs
import stage. Distinct from the unsquashfs extraction write failure
(different pipeline stage). The output partition for mksquashfs is
typically tmpfs (/run), small and system-managed.
Owner: cluster operator (the system path filling up).
error: pyxis: failed to create container filesystem
error: pyxis: Write on output file failed because No space left on device
error: pyxis: FATAL ERROR:writer: failed to write file /raid/containers/data/user-40595/pyxis_<...>/usr/local/lib/.../foo.pyc
unsquashfs runs out of space mid-extraction. Enroot's pre-check does not
always catch this (different threshold, or the partition fills between the
check and the extract). The path is captured in the log so the owner can
be inferred from it.
Owner: depends on the captured path.
Two production-observed sub-causes, same owner:
# EACCES — AppArmor profile missing (Ubuntu 24.04+)
$ sudo aa-disable /etc/apparmor.d/enroot-nsenter
$ srun --container-image=ubuntu:26.04 true
error: pyxis: enroot-nsenter: failed to create user namespace: Permission denied
# ENOSPC — /proc/sys/user/max_user_namespaces exhausted
error: pyxis: enroot-nsenter: failed to create user namespace: No space left on device
The kernel rejected unshare(CLONE_NEWUSER). The errno suffix identifies
the specific sub-cause (AppArmor profile missing, sysctl limit hit, kernel
config). Trap: the No space left on device errno intuitively suggests
disk-full but in this context means the namespace count cap — anchor on
the enroot-nsenter: failed to create <ns> namespace prefix when matching.
Owner: cluster operator.
slurmstepd: error: pyxis: enroot-aufs2ovlfs: failed to set capabilities: Operation not permitted
An enroot C binary (aufs2ovlfs, mount, nsenter, switchroot) cannot
acquire its required file capabilities via capset(). Typically the
enroot+caps companion package is not installed (that package runs
setcap during install), or the binaries were copied without xattr
preservation.
Owner: cluster operator.
slurmstepd: error: pyxis: enroot-switchroot: failed to execute: /bin/bash: Exec format error
execve() of the container's entrypoint fails. Exec format error
(ENOEXEC) means wrong architecture (e.g. amd64 image on an arm64 node)
or a corrupted ELF binary. Related variants — No such file or directory
(missing entrypoint) and Permission denied (non-executable file) — show
the same diagnostic shape.
Owner: user (image / arch mismatch is selectable at submit time).
error: pyxis: container start failed with error code: <N>
Appears in nearly every pyxis-level failure. By itself it is just a wrapper; the real root cause is in the preceding enroot log lines.
Owner: depends on the underlying error in the log.
$ srun --container-image=docker://alpine --container-workdir=/does/not/exist pwd
error: pyxis: couldn't chdir to /does/not/exist: No such file or directory
--container-workdir points at a path that does not exist inside the
image. Container start itself succeeded — only the chdir() failed.
Owner: user.
$ srun --container-image=A --container-image=B true
srun: error: pyxis: --container-image specified multiple times
srun: error: Invalid --container-image argument: B.
Owner: user.
$ srun --container-mounts='missingcolon' --container-image=ubuntu:24.04 true
srun: error: pyxis: mount source must be a relative path, an absolute path, "tmpfs" or "umount"
srun: error: pyxis: could not add mount entry: missingcolon:missingcolon
srun: error: Invalid --container-mounts argument: missingcolon.
--container-mounts spec has no colon; pyxis tries to parse it as
<src>:<dst> with both fields identical and fails validation.
Owner: user.
$ srun --container-mount-home --no-container-mount-home --container-image=ubuntu:26.04 true
srun: error: pyxis: both --container-mount-home and --no-container-mount-home were specified
srun: error: Invalid --no-container-mount-home argument: (null).
Any pair of mutually exclusive options — covers --container-mount-home,
--container-remap-root, --container-entrypoint, and the
--container-writable / --container-readonly pair.
Owner: user.
$ srun --container-name=collision:create --container-image=docker://alpine sleep 60 &
$ sleep 5
$ srun --container-name=collision:create --container-image=docker://alpine true
pyxis: imported docker image: docker://alpine
error: pyxis: error: "create" flag was passed to --container-name but the container already exists
The second job tries to create a container with a name already in use by the first on the same node. The image import completes before the collision check fires, which is a wasted operation but harmless.
Owner: user.
srun: error: Invalid --container-image argument: B.
srun: error: Invalid --container-mounts argument: missingcolon.
srun: error: Invalid --no-container-mount-home argument: (null).
SLURM's own error message that fires after any pyxis SPANK option fails validation. Useful as a fallback when more specific pyxis validation errors haven't been captured.
Owner: user.
error: pyxis: failed to create container filesystem
Enroot-create-stage wrapper (distinct from the enroot-import-stage wrapper
failed to import docker image). Useful for identifying which pipeline
stage failed; the real cause is in preceding lines.
Owner: depends on the underlying error.
# Newer pyxis (dedicated SIGTERM handler)
error: pyxis: received SIGTERM, forwarding to child 807892
error: pyxis: child 807892 terminated with signal 15
error: pyxis: failed to import docker image: nvcr.io/nvidia/pytorch:26.02-py3
# Older pyxis (waitpid returns EINTR)
error: pyxis: could not wait for child 1661846: Interrupted system call
error: pyxis: failed to import docker image
The job was killed mid-pyxis-operation by an external SIGTERM (user
scancel, SLURM time limit, watchdog, etc.) — not a pyxis failure.
Two signatures depending on pyxis version. Pyxis does not use
self-interrupting patterns, so EINTR on waitpid() in pyxis's code
path is specifically a cancellation signal arriving from outside.
Owner: not a failure. The cancellation source (user, time limit, watchdog) is what should be investigated, not pyxis.
error: pyxis: container was not started successfully by another task
Recent pyxis versions only — older pyxis does not emit this line.
A peer task on the same node failed to start the container, and this task
is reporting that cascade. By itself this line is not the root cause;
look for the primary task's failure earlier in the log. If it is the only
pyxis signal on the job, the primary task probably received SIGKILL
and produced no log of its own.
Owner: depends on the primary task's failure (look at the log).