Skip to content

Pyxis Enroot Error Reference

Felix Abecassis edited this page Apr 29, 2026 · 8 revisions

Enroot hook failures

98-nvidia.sh hook failed (generic umbrella)

pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code <N>

This umbrella line fires whenever the NVIDIA hook exits non-zero, for any reason. It is always accompanied by a more specific inner error (device, driver, capability, requirement, ldcache, etc.) — look at the lines above this one to find the real cause. Most production cases are toolkit or driver breakage on the node.

Owner: depends on the inner error. See the sub-cases below.

Missing NVIDIA control device nodes

$ sudo rm /dev/nvidia-modeset
$ NVIDIA_VISIBLE_DEVICES=0 NVIDIA_DRIVER_CAPABILITIES=all \
    srun --container-image=nvidia/cuda:13.0.1-base-ubuntu24.04 nvidia-smi

error: pyxis:     nvidia-container-cli: mount error: stat failed: /dev/nvidia-modeset: no such file or directory
error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1

NVIDIA control devices (nvidia-modeset, nvidiactl, nvidia-uvm, nvidia-uvm-tools) are created at driver load. If any is missing, libnvidia-container's host-device stat() fails. Numbered GPU devices (/dev/nvidia0, etc.) fail through a different path (device error, below).

Owner: cluster operator. The driver state on the node is broken — typically needs a driver reinstall or reload.

Driver / module mismatch (NVML detection)

error: pyxis:     [WARN] Kernel module nvidia_uvm is not loaded. Make sure the NVIDIA device driver is installed and loaded.
error: pyxis:     nvidia-container-cli: detection error: nvml error: not supported
error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1

NVML cannot talk to the kernel module — missing module load, partial driver upgrade, or ABI mismatch between user-space libraries and kernel module.

Owner: cluster operator.

nvidia-container-cli: device error (bad device spec)

$ NVIDIA_VISIBLE_DEVICES=42 srun --container-image=nvcr.io/nvidia/pytorch:26.02-py3 --pty bash

error: pyxis:     nvidia-container-cli: device error: 42: unknown device
error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1

NVIDIA_VISIBLE_DEVICES points at a device index that does not exist on the node — typically a user typo, or oversubscription versus the SLURM --gres allocation.

Owner: user. Fix the env var or the --gres request. Edge case: on a GPU-less node where device 0 is missing, the fix is still user-side (don't request GPUs on a CPU node).

Unknown NVIDIA driver capability

$ NVIDIA_DRIVER_CAPABILITIES=comput srun --container-image=nvcr.io/nvidia/pytorch:26.02-py3 --pty bash

error: pyxis:     [ERROR] Unknown NVIDIA driver capability: comput
error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1

Typo in NVIDIA_DRIVER_CAPABILITIES. Valid values include compute, utility, graphics, display, video, compat32, all.

Owner: user.

Unsatisfied NVIDIA driver / cuda requirement

$ NVIDIA_REQUIRE_CUDA="cuda>=99.0" srun --container-image=nvcr.io/nvidia/pytorch:26.02-py3 --pty bash

error: pyxis:     nvidia-container-cli: requirement error: unsatisfied condition: cuda>=99.0, please update your driver to a newer version, or use an earlier cuda container
error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1

NVIDIA_REQUIRE_* (set by the user at submit time, or baked into the image) isn't satisfied by the host driver.

Owner: user. Even when the requirement comes from the image, the submitter's fix is to use an older image or request a node with a newer driver.

nvidia-container-cli not installed on the node

error: pyxis:     [ERROR] Command not found: nvidia-container-cli, see https://github.com/NVIDIA/libnvidia-container
error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1

Typically a CPU-only node where the admin has not set NVIDIA_VISIBLE_DEVICES=void in the enroot config, which would make the hook a no-op. Without that, the hook tries to exec nvidia-container-cli on a node where it is correctly not installed.

Owner: cluster operator. Fix is admin-side — either install the toolkit or add NVIDIA_VISIBLE_DEVICES=void to the enroot config on CPU-only nodes.

nvidia-container-cli: ldcache error

error: pyxis: nvidia-container-cli: ldcache error: process /usr/sbin/ldconfig.real failed with error code: 1
error: pyxis: [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1

ldconfig fails during the post-mount library cache rebuild — corrupted ldconfig binary inside the container, ABI skew with host glibc, or LSM confinement (AppArmor/SELinux).

Owner: cluster operator (LSM / glibc skew) or image author (corrupted ldconfig). Triage by reproducing inside the container.

nvidia-container-cli: detection error (admin-managed path)

slurmstepd: error: pyxis:     nvidia-container-cli: detection error: path error: ///var/run/nvidia-fabricmanager/socket: permission denied
slurmstepd: error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1

libnvidia-container can't stat() or open an admin-managed system path (fabric-manager socket, /proc/driver/nvidia/*, etc.). Typically wrong permissions on the socket file as set by the system service.

Owner: cluster operator.

99-mellanox.sh hook failed

error: pyxis:     enroot-mount: failed to mount: tmpfs at /raid/enroot/data/user-147141/pyxis_49_49.1/sys/class/infiniband_mad: No such file or directory
error: pyxis:     [ERROR] /etc/enroot/hooks.d/99-mellanox.sh exited with return code 1

The mellanox hook bind-mounts tmpfs at /sys/class/infiniband* paths. When an IB kernel module (e.g. ib_umad) is not loaded on the host, the source sysfs directory does not exist and the hook fails.

Owner: cluster operator. Load the missing IB module(s) or remove the mellanox hook on nodes where InfiniBand is not provisioned.


Registry / image pull errors

HTTP 401 Unauthorized

$ srun --container-image=docker://registry-1.docker.io/library/private-image:latest true

error: pyxis:     [ERROR] URL https://registry-1.docker.io/v2/library/private-image/manifests/latest returned error code: 401 Unauthorized

Missing or insufficient credentials. On Docker Hub, the /token endpoint typically succeeds anonymously but /manifests/ returns 401 for private repos.

Owner: user. Configure credentials in ~/.config/enroot/.credentials for the target hostname.

HTTP 403 Forbidden

$ srun --container-image=docker://ghcr.io/nonexistent-user/no-such-image:latest true

error: pyxis:     [ERROR] URL https://ghcr.io/token returned error code: 403 Forbidden

Quirk of ghcr.io: it returns 403 at the /token endpoint for nonexistent repos rather than 404 at the manifest endpoint. The user-facing fix is the same as for 401 — check credentials and image name.

Owner: user.

HTTP 404 Not Found

$ srun --container-image=docker://ubuntu:36.04 true    # nonexistent tag

error: pyxis:     [ERROR] URL https://registry-1.docker.io/v2/library/ubuntu/manifests/36.04 returned error code: 404 Not Found

Misspelled image name, deleted tag, or wrong registry hostname.

Owner: user.

Invalid image reference (URI syntax)

$ srun --container-image='docker:///foo' true

error: pyxis:     [ERROR] Invalid image reference: docker:///foo

The URI fails enroot's parser — common patterns: triple-slash, trailing space, embedded whitespace. The single-slash form docker:/foo actually parses (as a hostname) and surfaces as a different error (curl tunnel / DNS).

Owner: user.

Missing enroot credentials, hostname parsed as Docker Hub repo path

$ srun --container-image=nvcr.io/nvidia/eval-factory/simple-evals:26.03 ...

error: pyxis:     [ERROR] URL https://registry-1.docker.io/v2/nvcr.io/nvidia/eval-factory/simple-evals/manifests/26.03 returned error code: 401 Unauthorized

Older enroot quirk: when no credentials are configured for a <hostname>, enroot silently treats the whole <hostname>/<repo> as a Docker Hub repo path. The dotted hostname appearing in the /v2/ path segment is the tell (v2/nvcr.io/... instead of v2/...).

Owner: user. Add credentials for the target hostname in ~/.config/enroot/.credentials.

Could not retrieve / parse digest

slurmstepd: error: pyxis:     [ERROR] Could not parse digest information from https://<your-registry>/v2/<repo>/manifests/<tag>

The registry returned a well-formed response that is not a valid manifest, or the manifest is missing the docker-content-digest header. Typically a registry-side or proxy-side bug.

Owner: cluster / registry operator.

curl tunnel / network failure (error 56)

$ srun --container-image=docker:/alpine true    # single slash parses as hostname "alpine"

error: pyxis:     curl: (56) CONNECT tunnel failed, response 502
$ srun --container-image=docker://nonexistent.example.com/foo:bar true

error: pyxis:     curl: (56) CONNECT tunnel failed, response 502

On clusters with a CONNECT proxy, both user typos pointing at nonexistent hostnames and genuine registry outages surface as the same curl: (56) error. The two cases are indistinguishable from the log alone.

Owner: either user (typo) or cluster / registry operator (real outage). Verify the hostname and image first; if it looks correct, escalate.

Corrupted cached layer (zstd / tar extraction)

Two zstd sub-errors for different corruption modes, same remedy:

# truncated layer
slurmstepd: error: pyxis:     [INFO] Extracting image layers...
slurmstepd: error: pyxis:     zstd: /*stdin*\: unexpected end of file
slurmstepd: error: pyxis:     tar: Child returned status 1
# byte-corrupted layer (checksum mismatch)
error: pyxis:     /*stdin*\ : Decoding error (36) : Restored data doesn't match checksum
error: pyxis:     tar: Unexpected EOF in archive
error: pyxis:     tar: Error is not recoverable: exiting now

The enroot cache contains a bad layer blob — usually from a previously interrupted import. Remedy: clear $ENROOT_CACHE_PATH (or the per-user cache subdirectory) and retry.

Owner: typically cluster-side cache hygiene; occasionally user-side when the user controls the cache directory.

curl transfer timeout (error 28)

error: pyxis:    curl: (28) Operation timed out after 1200000 milliseconds with 1510260015 out of 10371127195 bytes received
error: pyxis:    pigz: skipping: <stdin>: corrupted -- invalid deflate data (invalid distance code)

The pull exceeded enroot's configured transfer timeout. The log includes both bytes received and the timeout window — divide them to estimate effective throughput. Slow effective rate suggests an infrastructure problem; a normal rate on a very large image suggests the user should increase the timeout or use a smaller image.

Owner: either. Use the throughput estimate to decide.

curl empty reply from server (error 52)

slurmstepd: error: pyxis:     curl: (52) Empty reply from server

The connection succeeded and the request was sent, but the server closed without returning anything — registry crash, load-balancer dropped the connection mid-request, etc. User action cannot produce this.

Owner: cluster / registry operator.

DNS resolution failure (error 6)

slurmstepd: error: pyxis:     curl: (6) Could not resolve host: registry-1.docker.io

Node DNS misconfiguration, resolver unreachable, or genuinely unreachable hostname. For known-good registries (e.g. registry-1.docker.io) this is unambiguously infrastructure; for unknown hostnames it could be a user typo on clusters with direct egress.

Owner: likely cluster operator; check the hostname for typos first.

Could not process JSON input (proxy returning non-JSON)

error: pyxis:     [INFO] Authentication succeeded
error: pyxis:     [ERROR] Could not process JSON input

The HTTP request succeeded (auth even passed) but the response body is not JSON — almost always a caching proxy returning an HTML error page in place of the registry's response.

Owner: cluster / registry operator.


Filesystem / storage

Rootfs lock contention

error: pyxis:     [ERROR] Could not acquire rootfs lock

Pyxis creates one container per node per step, so a user cannot legitimately produce lock contention. A hit means another enroot process is stuck holding the lock — stale state on the node.

Owner: cluster operator. Investigate stuck enroot processes on the node.

Invalid image format (corrupt local sqsh)

$ echo 'not a real sqsh' > /tmp/bad.sqsh && srun --container-image=/tmp/bad.sqsh true

error: pyxis: failed to create container filesystem
error: pyxis:     [ERROR] Invalid image format: /tmp/bad.sqsh

unsquashfs -s fails on the image — truncated or corrupted sqsh, or a non-sqsh file passed as the image.

Owner: either. User-corrupted file (e.g. interrupted enroot import) versus a cache-side bad pull. Check the file with unsquashfs -s outside the job to triage.

File already exists (ephemeral container name collision)

error: pyxis:     [ERROR] File already exists: /raid/containers/data/user-39367/pyxis_1620149_1620149.0

The last path component pyxis_<jobid>_<stepid>.<taskid> is auto-generated by pyxis from SLURM IDs — a user cannot cause a collision on that pattern. A hit means stale filesystem state (a previous job's container rootfs was not cleaned up).

Owner: cluster operator.

Enroot mount path not found

$ srun --container-image=docker://alpine --container-mounts=/tmp/doesnotexist:/mnt/x true

error: pyxis:     enroot-mount: failed to mount: /tmp/doesnotexist at /tmp/enroot-data/.../mnt/x: No such file or directory

Pyxis's own pre-check (--container-mounts: source path does not exist) catches some of these earlier; the runtime enroot-mount failure catches the rest.

Owner: user.

Cannot find container file (missing local sqsh)

$ srun --container-image=/tmp/no-such-file.sqsh true

error: pyxis: failed to create container filesystem
error: pyxis:     [ERROR] No such file or directory: /tmp/no-such-file.sqsh

User passed --container-image=<local sqsh> pointing at a file that does not exist on the node.

Owner: user.

Permission denied creating enroot cache / data / runtime directory

error: pyxis:     mkdir: cannot create directory '/raid/enroot/cache/group-21323': Permission denied
error: pyxis:     mkdir: cannot create directory '/var/lib/enroot/cache/job-1664320/.tokens.28686': Permission denied

Enroot tries to create a cache, data, or runtime subdirectory and mkdir returns EACCES. Path shape varies per cluster (/raid/enroot/*, /var/lib/enroot/*, etc.).

Owner: depends on the path. If the parent is under /home/* or /lustre/* the user owns it (filesystem permissions are theirs to fix); otherwise it's an admin-side misconfiguration of the parent directory.

unsquashfs extraction read failure (squashfs corruption)

error: pyxis:     [INFO] Extracting squashfs filesystem...
error: pyxis:     Read on filesystem failed because EOF
error: pyxis:     read_id_table: failed to read id index table
error: pyxis:     FATAL ERROR: File system corruption detected

Most common cause: two jobs concurrently writing to the same sqsh path without synchronization (user-side race). Also possible: a stale partial sqsh from a killed enroot import, or genuine disk corruption. The log does not carry the path so the cause cannot be inferred from this line alone.

Owner: typically user (concurrent writes); occasionally cluster operator (storage corruption).

unsquashfs extraction directory-creation failure (permission denied)

error: pyxis: failed to create container filesystem
error: pyxis:     [INFO] Extracting squashfs filesystem...
error: pyxis:     FATAL ERROR: dir_scan: failed to make directory /persist-var/a4u8g-0002/lib/enroot/data/job-1673945/pyxis_<...>, because Permission denied

During extraction, unsquashfs's dir_scan cannot create a directory in the destination tree. Distinct from the pre-extraction mkdir permission error (different phase) and from the disk-full write error (different errno).

Owner: depends on the captured path (user-managed might be /home/*, /lustre/* versus admin-managed system path like /raid/*).

unsquashfs killed by OOM during extraction

error: pyxis: child 2529097 failed with error code: 137
error: pyxis:     [INFO] Extracting squashfs filesystem...
error: pyxis:     /usr/lib/enroot/runtime.sh: line 384: 2529132 Killed    unsquashfs ...
error: Detected 1 oom_kill event in StepId=26343980.0.
srun: error: cpu-00030: task 0: Out Of Memory.

The kernel OOM-killed unsquashfs due to a cgroup memory limit. On CPU nodes (shared, tight per-job memory budgets) this is typically a user --mem issue; on GPU nodes (dedicated, plenty of RAM) it usually points to an unusually large image or a leak.

Owner: typically user on CPU nodes; cluster operator on GPU nodes.

mksquashfs write failure during import (disk full)

$ srun --container-image=nvcr.io/nvidia/pytorch:26.02-py3 true

error: pyxis: failed to import docker image: nvcr.io/nvidia/pytorch:26.02-py3
error: pyxis:     [INFO] Creating squashfs filesystem...
error: pyxis:     Write failed because No space left on device
error: pyxis:     FATAL ERROR: Probably out of space on output filesystem
error: pyxis:     Creating 4.0 filesystem on /run/pyxis/1000/16.0.squashfs, block size 131072.

The output partition ran out of space mid-write during the mksquashfs import stage. Distinct from the unsquashfs extraction write failure (different pipeline stage). The output partition for mksquashfs is typically tmpfs (/run), small and system-managed.

Owner: cluster operator (the system path filling up).

unsquashfs extraction write failure (disk full mid-extraction)

error: pyxis: failed to create container filesystem
error: pyxis:     Write on output file failed because No space left on device
error: pyxis:     FATAL ERROR:writer: failed to write file /raid/containers/data/user-40595/pyxis_<...>/usr/local/lib/.../foo.pyc

unsquashfs runs out of space mid-extraction. Enroot's pre-check does not always catch this (different threshold, or the partition fills between the check and the extract). The path is captured in the log so the owner can be inferred from it.

Owner: depends on the captured path.


Namespace / kernel / exec

Failed to create user namespace

Two production-observed sub-causes, same owner:

# EACCES — AppArmor profile missing (Ubuntu 24.04+)
$ sudo aa-disable /etc/apparmor.d/enroot-nsenter
$ srun --container-image=ubuntu:26.04 true
error: pyxis:     enroot-nsenter: failed to create user namespace: Permission denied
# ENOSPC — /proc/sys/user/max_user_namespaces exhausted
error: pyxis:     enroot-nsenter: failed to create user namespace: No space left on device

The kernel rejected unshare(CLONE_NEWUSER). The errno suffix identifies the specific sub-cause (AppArmor profile missing, sysctl limit hit, kernel config). Trap: the No space left on device errno intuitively suggests disk-full but in this context means the namespace count cap — anchor on the enroot-nsenter: failed to create <ns> namespace prefix when matching.

Owner: cluster operator.

enroot capability set failure

slurmstepd: error: pyxis:     enroot-aufs2ovlfs: failed to set capabilities: Operation not permitted

An enroot C binary (aufs2ovlfs, mount, nsenter, switchroot) cannot acquire its required file capabilities via capset(). Typically the enroot+caps companion package is not installed (that package runs setcap during install), or the binaries were copied without xattr preservation.

Owner: cluster operator.

Container entrypoint exec failed (wrong architecture)

slurmstepd: error: pyxis:     enroot-switchroot: failed to execute: /bin/bash: Exec format error

execve() of the container's entrypoint fails. Exec format error (ENOEXEC) means wrong architecture (e.g. amd64 image on an arm64 node) or a corrupted ELF binary. Related variants — No such file or directory (missing entrypoint) and Permission denied (non-executable file) — show the same diagnostic shape.

Owner: user (image / arch mismatch is selectable at submit time).


Pyxis errors

pyxis container start nonzero exit (wrapper)

error: pyxis: container start failed with error code: <N>

Appears in nearly every pyxis-level failure. By itself it is just a wrapper; the real root cause is in the preceding enroot log lines.

Owner: depends on the underlying error in the log.

pyxis couldn't chdir to container workdir

$ srun --container-image=docker://alpine --container-workdir=/does/not/exist pwd

error: pyxis: couldn't chdir to /does/not/exist: No such file or directory

--container-workdir points at a path that does not exist inside the image. Container start itself succeeded — only the chdir() failed.

Owner: user.


Pyxis argument validation

--container-image specified multiple times

$ srun --container-image=A --container-image=B true

srun: error: pyxis: --container-image specified multiple times
srun: error: Invalid --container-image argument: B.

Owner: user.

Mount source must be relative / absolute / tmpfs / umount

$ srun --container-mounts='missingcolon' --container-image=ubuntu:24.04 true

srun: error: pyxis: mount source must be a relative path, an absolute path, "tmpfs" or "umount"
srun: error: pyxis: could not add mount entry: missingcolon:missingcolon
srun: error: Invalid --container-mounts argument: missingcolon.

--container-mounts spec has no colon; pyxis tries to parse it as <src>:<dst> with both fields identical and fails validation.

Owner: user.

Conflicting pyxis options

$ srun --container-mount-home --no-container-mount-home --container-image=ubuntu:26.04 true

srun: error: pyxis: both --container-mount-home and --no-container-mount-home were specified
srun: error: Invalid --no-container-mount-home argument: (null).

Any pair of mutually exclusive options — covers --container-mount-home, --container-remap-root, --container-entrypoint, and the --container-writable / --container-readonly pair.

Owner: user.

Container name collision (create flag)

$ srun --container-name=collision:create --container-image=docker://alpine sleep 60 &
$ sleep 5
$ srun --container-name=collision:create --container-image=docker://alpine true

pyxis: imported docker image: docker://alpine
error: pyxis: error: "create" flag was passed to --container-name but the container already exists

The second job tries to create a container with a name already in use by the first on the same node. The image import completes before the collision check fires, which is a wasted operation but harmless.

Owner: user.

srun-level Invalid --container-<flag> argument (umbrella)

srun: error: Invalid --container-image argument: B.
srun: error: Invalid --container-mounts argument: missingcolon.
srun: error: Invalid --no-container-mount-home argument: (null).

SLURM's own error message that fires after any pyxis SPANK option fails validation. Useful as a fallback when more specific pyxis validation errors haven't been captured.

Owner: user.


Lifecycle wrappers & cancellation

pyxis failed to create container filesystem (wrapper)

error: pyxis: failed to create container filesystem

Enroot-create-stage wrapper (distinct from the enroot-import-stage wrapper failed to import docker image). Useful for identifying which pipeline stage failed; the real cause is in preceding lines.

Owner: depends on the underlying error.

Pyxis cancellation markers (external SIGTERM)

# Newer pyxis (dedicated SIGTERM handler)
error: pyxis: received SIGTERM, forwarding to child 807892
error: pyxis: child 807892 terminated with signal 15
error: pyxis: failed to import docker image: nvcr.io/nvidia/pytorch:26.02-py3
# Older pyxis (waitpid returns EINTR)
error: pyxis: could not wait for child 1661846: Interrupted system call
error: pyxis: failed to import docker image

The job was killed mid-pyxis-operation by an external SIGTERM (user scancel, SLURM time limit, watchdog, etc.) — not a pyxis failure. Two signatures depending on pyxis version. Pyxis does not use self-interrupting patterns, so EINTR on waitpid() in pyxis's code path is specifically a cancellation signal arriving from outside.

Owner: not a failure. The cancellation source (user, time limit, watchdog) is what should be investigated, not pyxis.


Peer-task cascades

Container was not started successfully by another task

error: pyxis: container was not started successfully by another task

Recent pyxis versions only — older pyxis does not emit this line.

A peer task on the same node failed to start the container, and this task is reporting that cascade. By itself this line is not the root cause; look for the primary task's failure earlier in the log. If it is the only pyxis signal on the job, the primary task probably received SIGKILL and produced no log of its own.

Owner: depends on the primary task's failure (look at the log).

Clone this wiki locally