Spec reference

This page describes the Compose subset that hpc-compose accepts today. Unknown or unsupported fields are rejected unless this page explicitly says otherwise.

Top-level shape

name: demo
version: "3.9"

runtime:
  backend: pyxis

x-slurm:
  time: "00:30:00"
  cache_dir: /cluster/shared/hpc-compose-cache

services:
  app:
    image: python:3.11-slim
    command: python -m main

Top-level fields

Field	Shape	Default	Notes
`name`	string	omitted	Used as the Slurm job name when `x-slurm.job_name` is not set.
`version`	string	omitted	Accepted for Compose compatibility. Ignored by the planner.
`runtime`	mapping	`backend: pyxis`	Selects the service runtime backend and GPU passthrough policy.
`services`	mapping	required	Must contain at least one service.
`x-slurm`	mapping	omitted	Top-level Slurm settings and shared runtime defaults.

Settings-aware command table

Use these commands and global flags when you want the project-local settings file (.hpc-compose/settings.toml) to remember compose path, env files, env vars, and binary overrides.

Command or flag	Purpose	Notes
`--profile <NAME>`	Select the profile from settings	Global flag; applies to every subcommand.
`--settings-file <PATH>`	Use an explicit settings file	Global flag; bypasses upward auto-discovery of `.hpc-compose/settings.toml`.
`hpc-compose setup`	Create or update the project-local settings file	Interactive by default; supports `--non-interactive` with `--profile-name`, `--compose-file`, `--env-file`, `--env`, `--binary`, and `--default-profile`.
`hpc-compose context`	Print fully resolved execution context	Shows selected settings/profile, compose path, binaries, interpolation vars, runtime paths, and value sources; supports `--format json`.
`hpc-compose validate --strict-env`	Fail when interpolation fell back to defaults	Detects when `${VAR:-...}` or `${VAR-...}` consumed fallback values because `VAR` was missing.
`hpc-compose schema`	Print the checked-in JSON Schema	Useful for editor integration and authoring tools. Rust validation remains the semantic source of truth.

`x-slurm`

These fields live under the top-level x-slurm block.

Field	Shape	Default	Notes
`job_name`	string	`name` when present	Rendered as `#SBATCH --job-name`.
`partition`	string	omitted	Passed through to `#SBATCH --partition`.
`account`	string	omitted	Passed through to `#SBATCH --account`.
`qos`	string	omitted	Passed through to `#SBATCH --qos`.
`time`	string	omitted	Passed through to `#SBATCH --time`.
`nodes`	integer	omitted	Slurm allocation node count. Defaults to `1` when omitted.
`ntasks`	integer	omitted	Passed through to `#SBATCH --ntasks`.
`ntasks_per_node`	integer	omitted	Passed through to `#SBATCH --ntasks-per-node`.
`cpus_per_task`	integer	omitted	Top-level Slurm CPU request.
`mem`	string	omitted	Passed through to `#SBATCH --mem`.
`gres`	string	omitted	Passed through to `#SBATCH --gres`.
`gpus`	integer	omitted	Used only when `gres` is not set.
`gpus_per_node`	integer	omitted	Passed through to `#SBATCH --gpus-per-node`.
`gpus_per_task`	integer	omitted	Passed through to `#SBATCH --gpus-per-task`.
`cpus_per_gpu`	integer	omitted	Passed through to `#SBATCH --cpus-per-gpu`.
`mem_per_gpu`	string	omitted	Passed through to `#SBATCH --mem-per-gpu`.
`gpu_bind`	string	omitted	Passed through to `#SBATCH --gpu-bind`.
`cpu_bind`	string	omitted	Passed through to `#SBATCH --cpu-bind`.
`mem_bind`	string	omitted	Passed through to `#SBATCH --mem-bind`.
`distribution`	string	omitted	Passed through to `#SBATCH --distribution`.
`hint`	string	omitted	Passed through to `#SBATCH --hint`.
`constraint`	string	omitted	Passed through to `#SBATCH --constraint`.
`output`	string	omitted	Passed through to `#SBATCH --output`.
`error`	string	omitted	Passed through to `#SBATCH --error`.
`chdir`	string	omitted	Passed through to `#SBATCH --chdir`.
`cache_dir`	string	`$HOME/.cache/hpc-compose`	Must resolve to shared storage visible from the login node and the compute nodes.
`scratch`	mapping	omitted	Optional scratch path mounted into services and exposed as `HPC_COMPOSE_SCRATCH_DIR`.
`stage_in`	list of mappings	omitted	Copy or rsync host paths before services launch.
`stage_out`	list of mappings	omitted	Copy or rsync paths during teardown, optionally by outcome.
`burst_buffer`	mapping	omitted	Raw `#BB` / `#DW` directives for site-specific burst-buffer systems.
`metrics`	mapping	omitted	Enables runtime metrics sampling.
`artifacts`	mapping	omitted	Enables tracked artifact collection and export metadata.
`resume`	mapping	omitted	Enables checkpoint-aware resume semantics with a shared host path mounted into every service.
`notify`	mapping	omitted	First-class Slurm email notification settings.
`setup`	list of strings	omitted	Raw shell lines inserted into the generated batch script before service launches.
`submit_args`	list of strings	omitted	Extra raw Slurm arguments appended as `#SBATCH ...` lines.

`x-slurm.setup`

x-slurm:
  setup:
    - module load enroot
    - source /shared/env.sh

Shape: list of strings
Default: omitted
Notes:
- Each line is emitted verbatim into the generated bash script.
- The script runs under set -euo pipefail.
- Shell quoting and escaping are the user's responsibility.

`x-slurm.submit_args`

x-slurm:
  submit_args:
    - "--mail-type=END"
    - "--mail-user=user@example.com"
    - "--reservation=gpu-reservation"

Shape: list of strings
Default: omitted
Notes:
- Each entry is emitted as #SBATCH {arg}.
- Entries are rejected if they contain line breaks or null bytes.
- Entries are not validated against Slurm option syntax.

`x-slurm.notify`

x-slurm:
  notify:
    email:
      to: user@example.com
      on: [end, fail]

Field	Shape	Default	Notes
`notify.email`	mapping	omitted	Required when `notify` is present.
`notify.email.to`	string	required	Rendered as `#SBATCH --mail-user`.
`notify.email.on`	list of events	`[end, fail]`	Rendered as `#SBATCH --mail-type`.

Supported events:

Event	Slurm mail type
`start`	`BEGIN`
`end`	`END`
`fail`	`FAIL`
`all`	`ALL`

Rules:

When on is omitted or empty, defaults to [end, fail].
If all is present, it replaces all other events.
Cannot be combined with raw --mail-type or --mail-user in x-slurm.submit_args.

`x-slurm.cache_dir`

Shape: string
Default: $HOME/.cache/hpc-compose
Notes:
- Relative paths and environment variables are resolved against the compose file directory.
- Paths under /tmp, /var/tmp, /private/tmp, and /dev/shm are rejected.
- The path must be visible from both the login node and the compute nodes.

`runtime`

runtime:
  backend: apptainer
  gpu: auto

Field	Shape	Default	Notes
`backend`	`pyxis`, `apptainer`, `singularity`, or `host`	`pyxis`	Selects the runtime used inside Slurm steps.
`gpu`	`auto`, `none`, or `nvidia`	`auto`	For Apptainer/Singularity, controls `--nv`; `auto` enables it when Slurm GPU resources are requested.

Backend notes:

pyxis uses srun --container-* flags and Enroot .sqsh artifacts.
apptainer and singularity build or reuse .sif artifacts and launch them through apptainer exec/run or singularity exec/run inside srun.
host runs commands directly under srun; services must set command or entrypoint, and image prepare blocks, service volumes, and x-slurm.mpi.host_mpi.bind_paths are not allowed because no container bind mount is applied.
x-enroot.prepare is a Pyxis/Enroot compatibility spelling. Prefer x-runtime.prepare for new specs, especially with Apptainer/Singularity.

`x-slurm.scratch`, `stage_in`, `stage_out`, and `burst_buffer`

x-slurm:
  scratch:
    scope: shared
    base: /scratch/$USER/jobs
    mount: /scratch
    cleanup: on_success
  stage_in:
    - from: /shared/input
      to: /scratch/input
      mode: rsync
  stage_out:
    - from: /scratch/output
      to: /shared/results/${SLURM_JOB_ID}
      when: always
      mode: copy
  burst_buffer:
    directives:
      - "#BB create_persistent name=data capacity=100G"

scratch.base is a host path. scratch.mount is the container-visible mount point.
scratch.scope is node_local or shared; cluster profiles can warn when a shared scratch path does not look shared.
scratch.cleanup is always, on_success, or never.
stage_in runs before services launch; stage_out runs during teardown.
mode is rsync or copy; rsync falls back to cp -R when rsync is unavailable.
stage_out.when is always, on_success, or on_failure.
${SLURM_JOB_ID} is preserved in scratch and staging paths for runtime expansion.
burst_buffer.directives entries are emitted as raw batch-script directives and must start with #BB or #DW .

Multi-node placement rules

x-slurm.nodes > 1 reserves a multi-node allocation.
Helper services remain single-node steps and are pinned to the allocation's primary node.
When a multi-node job has exactly one service, that service defaults to the distributed full-allocation step.
Services may use services.<name>.x-slurm.placement to select explicit allocation node indices.
Overlapping explicit placements are rejected unless one side sets allow_overlap: true or uses share_with.
Any service spanning more than one node may use readiness.type: sleep or readiness.type: log, or TCP/HTTP readiness only with an explicit non-local host or URL.

`x-slurm.metrics`

x-slurm:
  metrics:
    interval_seconds: 5
    collectors: [gpu, slurm]

Shape: mapping
Default: omitted
Notes:
- Omitting the block disables runtime metrics sampling.
- If the block is present and enabled is omitted, metrics sampling is enabled.
- interval_seconds defaults to 5 and must be at least 1.
- collectors defaults to [gpu, slurm].
- Supported collectors:
  - gpu samples device and process telemetry through nvidia-smi
  - slurm samples job-step CPU and memory data through sstat
- In multi-node v1, gpu sampling remains primary-node-only; slurm sampling still observes the full distributed step through sstat.
- Sampler files are written under ${SLURM_SUBMIT_DIR:-$PWD}/.hpc-compose/${SLURM_JOB_ID}/metrics on the host and are also visible inside containers at /hpc-compose/job/metrics.
- Collector failures are best-effort and do not fail the batch job.

`x-slurm.artifacts`

x-slurm:
  artifacts:
    collect: always
    export_dir: ./results/${SLURM_JOB_ID}
    paths:
      - /hpc-compose/job/metrics/**
    bundles:
      checkpoints:
        paths:
          - /hpc-compose/job/checkpoints/*.pt

Shape: mapping
Default: omitted
Notes:
- Omitting the block disables tracked artifact collection.
- collect defaults to always. Supported values are always, on_success, and on_failure.
- export_dir is required and is resolved relative to the compose file directory when hpc-compose artifacts runs.
- ${SLURM_JOB_ID} is preserved in export_dir until hpc-compose artifacts expands it from tracked metadata.
- paths remains supported as the implicit default bundle.
- bundles is optional. Bundle names must match [A-Za-z0-9_-]+, and default is reserved for top-level paths.
- At least one source path must be present in paths or bundles.
- Every source path must be an absolute container-visible path rooted at /hpc-compose/job.
- Paths under /hpc-compose/job/artifacts are rejected.
- Collection happens during batch teardown and is best-effort.
- Collected payloads and manifest.json are written under ${SLURM_SUBMIT_DIR:-$PWD}/.hpc-compose/${SLURM_JOB_ID}/artifacts/.
- hpc-compose artifacts --bundle <name> exports only the selected bundle or bundles.
- hpc-compose artifacts --tarball also writes one <bundle>.tar.gz archive per exported bundle.
- Export writes per-bundle provenance metadata under <export_dir>/_hpc-compose/bundles/<bundle>.json.

`x-slurm.resume`

x-slurm:
  resume:
    path: /shared/$USER/runs/my-run

Shape: mapping
Default: omitted
Notes:
- Omitting the block disables resume semantics.
- path is required and must be an absolute host path.
- /hpc-compose/... paths are rejected because path must point at shared host storage, not a container-visible path.
- /tmp and /var/tmp technically validate, but preflight warns because those paths are not reliable resume storage.
- When enabled, hpc-compose mounts path into every service at /hpc-compose/resume.
- Services also receive HPC_COMPOSE_RESUME_DIR, HPC_COMPOSE_ATTEMPT, and HPC_COMPOSE_IS_RESUME.
- The canonical resume source is the shared path, not exported artifact bundles.
- Attempt-specific runtime state moves under ${SLURM_SUBMIT_DIR:-$PWD}/.hpc-compose/${SLURM_JOB_ID}/attempts/<attempt>/, and the top-level logs, metrics, artifacts, and state.json paths continue to point at the latest attempt for compatibility.

Allocation metadata inside services

Every service receives:

HPC_COMPOSE_PRIMARY_NODE
HPC_COMPOSE_NODE_COUNT
HPC_COMPOSE_NODELIST
HPC_COMPOSE_NODELIST_FILE
HPC_COMPOSE_SERVICE_PRIMARY_NODE
HPC_COMPOSE_SERVICE_NODE_COUNT
HPC_COMPOSE_SERVICE_NODELIST
HPC_COMPOSE_SERVICE_NODELIST_FILE

The allocation-wide data is also written under /hpc-compose/job/allocation/primary_node and /hpc-compose/job/allocation/nodes.txt. Service-scoped node lists are written under /hpc-compose/job/allocation/service-nodelists/.

Services that configure services.<name>.x-slurm.mpi also receive:

HPC_COMPOSE_MPI_TYPE
HPC_COMPOSE_MPI_HOSTFILE

The MPI hostfile is written under /hpc-compose/job/allocation/mpi-hostfiles/ and contains the service's effective node list. When ntasks_per_node is known, each host line includes slots=<ntasks_per_node>. For a single-node service with ntasks but no ntasks_per_node, the hostfile uses slots=<ntasks>. Otherwise it emits one node per line without slots.

MPI services also forward common PMI, PMIx, and Slurm rank variables into the container through Pyxis --container-env, including PMI_RANK, PMI_SIZE, PMIX_RANK, PMIX_NAMESPACE, SLURM_PROCID, SLURM_LOCALID, SLURM_NODEID, SLURM_NTASKS, and SLURM_TASKS_PER_NODE.

`gres` and `gpus`

When both gres and gpus are set at the same level, gres takes priority and gpus is ignored.

Service fields

Field	Shape	Default	Notes
`image`	string	required unless `runtime.backend: host`	Can be a remote image reference, a local `.sqsh` / `.squashfs` path for Pyxis, or a local `.sif` path for Apptainer/Singularity.
`command`	string or list of strings	omitted	Shell form or exec form.
`entrypoint`	string or list of strings	omitted	Must use the same form as `command` when both are present.
`environment`	mapping or list of `KEY=VALUE` strings	omitted	Both forms normalize to key/value pairs.
`volumes`	list of `host_path:container_path` strings	omitted	Runtime bind mounts. Host paths resolve against the compose file directory.
`working_dir`	string	omitted	Valid only when the service also has an explicit `command` or `entrypoint`.
`depends_on`	list or mapping	omitted	Dependency list with `service_started` or `service_healthy` conditions.
`readiness`	mapping	omitted	Post-launch readiness gate.
`healthcheck`	mapping	omitted	Compose-compatible sugar for a subset of `readiness`. Mutually exclusive with `readiness`.
`x-slurm`	mapping	omitted	Per-service Slurm overrides.
`x-runtime`	mapping	omitted	Backend-neutral image preparation rules.
`x-enroot`	mapping	omitted	Pyxis/Enroot preparation compatibility alias.

Image rules

Remote images

Any image reference without an explicit :// scheme is prefixed with docker://.
Explicit schemes are allowed only for docker://, dockerd://, and podman://.
Other schemes are rejected.
Shell variables in the image string are expanded at plan time.
Unset variables expand to empty strings.

Local images

Pyxis local image paths must point to .sqsh or .squashfs files.
Apptainer/Singularity local image paths must point to .sif files.
Relative paths are resolved against the compose file directory.
Paths that look like build contexts are rejected.

`command` and `entrypoint`

Both fields accept either:

a string, interpreted as shell form
a list of strings, interpreted as exec form

Rules:

If both fields are present, they must use the same form.
Mixed string/array combinations are rejected.
If neither field is present, the image default entrypoint and command are used.
If working_dir is set, at least one of command or entrypoint must also be set.

`environment`

Accepted forms:

environment:
  APP_ENV: prod
  LOG_LEVEL: info

environment:
  - APP_ENV=prod
  - LOG_LEVEL=info

Rules:

List items must use KEY=VALUE syntax.
.env from the compose file directory is loaded automatically when present.
Shell environment variables override .env; .env fills only missing variables.
environment, x-runtime.prepare.env, and compatibility x-enroot.prepare.env values support $VAR, ${VAR}, ${VAR:-default}, and ${VAR-default} interpolation.
Missing variables without defaults are errors.
Use $$ for a literal dollar sign in interpolated fields.
String-form shell snippets are still literal. For example, $PATH inside a string-form command is not expanded at plan time.

`volumes`

Accepted form:

volumes:
  - ./app:/workspace
  - /shared/data:/data
  - /shared/reference:/reference:ro

Rules:

Host paths are resolved against the compose file directory.
Runtime mounts accept host_path:container_path and host_path:container_path:ro|rw.
Pyxis mounts are passed through srun --container-mounts=...; Apptainer/Singularity mounts are passed as --bind.
Every service also gets an automatic shared mount at /hpc-compose/job, backed by ${SLURM_SUBMIT_DIR:-$PWD}/.hpc-compose/${SLURM_JOB_ID} on the host.
/hpc-compose/job is reserved and cannot be used as an explicit volume destination.

Warning

If a mounted file is a symlink, the symlink target must also be visible from inside the mounted directory. Otherwise the path can exist on the host but fail inside the container.

`depends_on`

Accepted forms:

depends_on:
  - redis

depends_on:
  redis:
    condition: service_started

depends_on:
  redis:
    condition: service_healthy

Rules:

List form means condition: service_started.
Map form accepts condition: service_started, condition: service_healthy, and condition: service_completed_successfully.
service_healthy requires the dependency service to define readiness.
service_started waits only for the dependency process to be launched and still alive.
service_healthy waits for the dependency readiness check to succeed.
service_completed_successfully waits for the dependency to exit with status 0 before launching the dependent service, which is useful for one-shot DAG stages such as preprocess -> train -> postprocess.

`readiness`

Supported types:

Sleep

readiness:
  type: sleep
  seconds: 5

seconds is required.

TCP

readiness:
  type: tcp
  host: 127.0.0.1
  port: 6379
  timeout_seconds: 30

host defaults to 127.0.0.1.
timeout_seconds defaults to 60.

Log

readiness:
  type: log
  pattern: "Server started"
  timeout_seconds: 60

timeout_seconds defaults to 60.

HTTP

readiness:
  type: http
  url: http://127.0.0.1:8080/health
  status_code: 200
  timeout_seconds: 30

status_code defaults to 200.
timeout_seconds defaults to 60.
The readiness check polls the URL through curl.

`healthcheck`

healthcheck is accepted as migration sugar and is normalized into the readiness model.

services:
  redis:
    image: redis:7
    healthcheck:
      test: ["CMD", "nc", "-z", "127.0.0.1", "6379"]
      timeout: 30s

Rules:

healthcheck and readiness are mutually exclusive.
Supported probe forms are a constrained subset:
- ["CMD", "nc", "-z", HOST, PORT]
- ["CMD-SHELL", "nc -z HOST PORT"]
- recognized curl probes against http:// or https:// URLs
- recognized wget --spider probes against http:// or https:// URLs
timeout maps to timeout_seconds.
disable: true disables readiness for that service.
interval, retries, and start_period are parsed but rejected in v1.
HTTP-style healthchecks normalize to readiness.type: http with status_code: 200.

Service-level `x-slurm`

These fields live under services.<name>.x-slurm.

Field	Shape	Default	Notes
`nodes`	integer	omitted	Legacy shorthand: `1` for a helper step, or the full top-level allocation node count for a full-allocation distributed service. Partial multi-node counts require `placement.node_count`.
`placement`	mapping	omitted	Explicit node-index placement inside the allocation.
`ntasks`	integer	omitted	Adds `--ntasks` to that service's `srun`.
`ntasks_per_node`	integer	omitted	Adds `--ntasks-per-node` to that service's `srun`.
`cpus_per_task`	integer	omitted	Adds `--cpus-per-task` to that service's `srun`.
`gpus`	integer	omitted	Adds `--gpus` when `gres` is not set.
`gres`	string	omitted	Adds `--gres` to that service's `srun`. Takes priority over `gpus`.
`gpus_per_node`	integer	omitted	Adds `--gpus-per-node` to that service's `srun`.
`gpus_per_task`	integer	omitted	Adds `--gpus-per-task` to that service's `srun`.
`cpus_per_gpu`	integer	omitted	Adds `--cpus-per-gpu` to that service's `srun`.
`mem_per_gpu`	string	omitted	Adds `--mem-per-gpu` to that service's `srun`.
`gpu_bind`	string	omitted	Adds `--gpu-bind` to that service's `srun`.
`cpu_bind`	string	omitted	Adds `--cpu-bind` to that service's `srun`.
`mem_bind`	string	omitted	Adds `--mem-bind` to that service's `srun`.
`distribution`	string	omitted	Adds `--distribution` to that service's `srun`.
`hint`	string	omitted	Adds `--hint` to that service's `srun`.
`time_limit`	string	omitted	Advisory per-service time limit. Validated against Slurm time formats but not passed to `srun`. `inspect` surfaces warnings when the limit exceeds allocation time or conflicts with dependencies. Accepted formats: `MM`, `MM:SS`, `HH:MM:SS`, `D-HH`, `D-HH:MM`, `D-HH:MM:SS`.
`extra_srun_args`	list of strings	omitted	Appended directly to the service's `srun` command.
`mpi`	mapping	omitted	Adds first-class MPI launch metadata and `srun --mpi=<type>`.
`failure_policy`	mapping	omitted	Per-service failure handling (`fail_job`, `ignore`, `restart_on_failure`).
`prologue`	string or mapping	omitted	Per-service shell hook run before each launch attempt. String shorthand runs on the host.
`epilogue`	string or mapping	omitted	Per-service shell hook run after each service exit attempt. String shorthand runs on the host.

`services.<name>.x-slurm.prologue` / `epilogue`

services:
  trainer:
    image: trainer:latest
    command: python train.py
    x-slurm:
      prologue: |
        module load cuda/12.1
        nvidia-smi
      epilogue:
        context: container
        script: |
          tar czf /shared/logs-${SLURM_JOB_ID}.tar.gz /hpc-compose/job/logs

Shape: either a block string, or a mapping with script and optional context.
context: host (default) or container.
Hook scripts are emitted as trusted shell and are not Compose-interpolated, so runtime variables such as ${SLURM_JOB_ID} are preserved.
Hooks run once per service launch attempt, including restart_on_failure retries.
Host hooks run in the generated batch supervisor on the allocation's primary execution context. Container hooks wrap the service command inside the container and can use /hpc-compose/job.
Hook stdout/stderr is written to the service log.
Container hooks require an explicit command or entrypoint; image-default services cannot be wrapped.

`services.<name>.x-slurm.placement`

services:
  a:
    image: app:a
    x-slurm:
      placement: { node_range: "0-3" }
  b:
    image: app:b
    x-slurm:
      placement: { node_range: "4-7" }
  ps:
    image: app:b
    x-slurm:
      placement: { share_with: b }

Exactly one selector is required:

Field	Shape	Notes
`node_range`	string	Zero-based inclusive allocation indices, for example `"0-3"` or `"0-3,6"`.
`node_count`	integer	Selects this many eligible nodes starting at `start_index`, default `0`.
`node_percent`	integer `1..100`	Selects `ceil(percent * eligible_nodes / 100)`, minimum one node.
`share_with`	string	Reuses another service's resolved node set for explicit co-location.

Optional fields:

start_index: applies to node_count and node_percent.
exclude: zero-based allocation indices removed from the eligible set and passed to srun --exclude.
allow_overlap: permits intentional overlap with another explicit placement.

Node indices are resolved against the Slurm allocation order from scontrol show hostnames "$SLURM_JOB_NODELIST". At runtime, containers receive both allocation-wide metadata (HPC_COMPOSE_NODELIST) and service-scoped metadata (HPC_COMPOSE_SERVICE_NODELIST, HPC_COMPOSE_SERVICE_NODELIST_FILE, HPC_COMPOSE_SERVICE_PRIMARY_NODE, HPC_COMPOSE_SERVICE_NODE_COUNT).

`services.<name>.x-slurm.mpi`

services:
  trainer:
    image: mpi-image:latest
    command: /usr/local/bin/train
    x-slurm:
      nodes: 2
      ntasks_per_node: 4
      mpi:
        type: pmix_v4
        implementation: openmpi
        launcher: srun
        expected_ranks: 8
        host_mpi:
          bind_paths:
            - /opt/site/openmpi:/opt/site/openmpi:ro
          env:
            MPI_DIR: /opt/site/openmpi

Shape: mapping
Default: omitted
type is an exact srun --mpi=<type> plugin token. Common values include pmix, pmix_v4, pmi2, pmi1, and openmpi; use srun --mpi=list or hpc-compose doctor --cluster-report on the target cluster to discover site-specific values.
Notes:
- Rendered as --mpi=<type> on the service's srun command.
- launcher defaults to srun; v1 rejects other launchers.
- implementation is optional metadata for diagnostics. Supported values are openmpi, mpich, intel_mpi, mvapich2, cray_mpi, hpe_mpi, and unknown.
- expected_ranks, when set, must match the resolved Slurm task geometry.
- host_mpi.bind_paths uses host_path:container_path[:ro|rw] syntax, is validated like service volumes, and is automatically mounted into the service.
- host_mpi.env is injected into the service environment after normal service environment entries.
- Cannot be combined with raw --mpi... entries in extra_srun_args.
- MPI services receive HPC_COMPOSE_MPI_TYPE and HPC_COMPOSE_MPI_HOSTFILE.
- hpc-compose doctor --mpi-smoke -f compose.yaml --service trainer renders a smoke probe for the service; add --submit to run it through Slurm. The smoke plan keeps allocation and MPI launch settings, but strips application workflow blocks such as setup, scratch staging, resume metadata, artifacts, and burst-buffer directives.

`services.<name>.x-slurm.failure_policy`

services:
  worker:
    image: python:3.11-slim
    x-slurm:
      failure_policy:
        mode: restart_on_failure
        max_restarts: 3
        backoff_seconds: 5
        window_seconds: 60
        max_restarts_in_window: 3

Field	Shape	Default	Notes
`mode`	`fail_job` \| `ignore` \| `restart_on_failure`	`fail_job`	`fail_job` keeps fail-fast behavior. `ignore` keeps the job running after non-zero exits. `restart_on_failure` restarts on non-zero exits only.
`max_restarts`	integer	`3` when `mode=restart_on_failure`	Required to be at least `1` after defaults are applied. Valid only for `restart_on_failure`.
`backoff_seconds`	integer	`5` when `mode=restart_on_failure`	Fixed delay between restart attempts. Required to be at least `1` after defaults are applied. Valid only for `restart_on_failure`.
`window_seconds`	integer	`60` when `mode=restart_on_failure`	Rolling window for counting restart-triggering exits. Required to be at least `1` after defaults are applied. Valid only for `restart_on_failure`.
`max_restarts_in_window`	integer	resolved `max_restarts` when `mode=restart_on_failure`	Maximum restart-triggering exits allowed within `window_seconds`. Required to be at least `1` after defaults are applied. Valid only for `restart_on_failure`.

Rules:

In a multi-node allocation, implicit helper services are pinned to HPC_COMPOSE_PRIMARY_NODE.
Explicit service placements may not overlap unless one side sets placement.allow_overlap: true or uses placement.share_with.
max_restarts, backoff_seconds, window_seconds, and max_restarts_in_window are rejected unless mode: restart_on_failure.
Restart attempts count relaunches after the initial launch.
Restarts trigger only for non-zero exits.
restart_on_failure enforces both a lifetime cap (max_restarts) and a rolling-window cap (max_restarts_in_window within window_seconds) during one live batch-script execution.
If you omit the rolling-window fields, restart_on_failure still enables default crash-loop protection with window_seconds: 60 and max_restarts_in_window: <resolved max_restarts>.
Services configured with mode: ignore cannot be used as dependencies in depends_on.

Examples:

Use the defaults when you only need bounded retries:

services:
  worker:
    image: python:3.11-slim
    x-slurm:
      failure_policy:
        mode: restart_on_failure

That resolves to:

max_restarts: 3
backoff_seconds: 5
window_seconds: 60
max_restarts_in_window: 3

Use explicit fields when you need a larger lifetime budget but still want a tighter crash-loop guard:

services:
  worker:
    image: python:3.11-slim
    x-slurm:
      failure_policy:
        mode: restart_on_failure
        max_restarts: 8
        backoff_seconds: 10
        window_seconds: 60
        max_restarts_in_window: 3

Semantics:

The initial launch does not count as a restart.
restart_count counts granted relaunches after the initial launch.
max_restarts_in_window counts restart-triggering non-zero exits whose timestamps still satisfy now - event < window_seconds.
If a non-zero exit would exceed the rolling-window cap, the job fails immediately and that blocked exit is not recorded as a consumed restart.
Successful exits do not trigger restarts and do not add entries to the rolling window.
The rolling window is attempt-local to one live batch-script execution. It is not hydrated from prior state.json, resume metadata, or Slurm requeue history.

Tracked state:

status --format json includes failure_policy_mode, restart_count, max_restarts, window_seconds, max_restarts_in_window, restart_failures_in_window, and last_exit_code for each tracked service.
Text status renders the live rolling-window budget as window=<current>/<max>@<seconds>s.

Unknown keys under top-level x-slurm or per-service x-slurm cause hard errors.

`x-runtime.prepare` and `x-enroot.prepare`

x-runtime.prepare lets a service build a prepared runtime image from its base image before submission. x-enroot.prepare remains accepted as a Pyxis-only compatibility spelling.

services:
  app:
    image: python:3.11-slim
    x-runtime:
      prepare:
        commands:
          - pip install --no-cache-dir numpy pandas
        mounts:
          - ./requirements.txt:/tmp/requirements.txt
        env:
          PIP_CACHE_DIR: /tmp/pip-cache
        root: true

Field	Shape	Default	Notes
`commands`	list of strings	required when `prepare` is present	Each command runs through the selected backend's writable prepare flow.
`mounts`	list of `host_path:container_path` strings	omitted	Visible only during prepare. Relative host paths resolve against the compose file directory.
`env`	mapping or list of `KEY=VALUE` strings	omitted	Passed only during prepare. Values support the same interpolation rules as `environment`.
`root`	boolean	`true`	Controls whether prepare commands request root/fakeroot behavior where the backend supports it.

Rules:

If x-runtime.prepare or x-enroot.prepare is present, commands cannot be empty.
A service may not set both spellings.
x-enroot.prepare is rejected when runtime.backend is not pyxis.
If prepare.mounts is non-empty, the service rebuilds on every prepare or submit.
Remote base images are imported under cache_dir/base.
Prepared images are exported under cache_dir/prepared.
Unknown keys under x-runtime, x-enroot, or prepare cause hard errors.

Unsupported Compose keys

These keys are rejected with explicit messages:

build
ports
networks
network_mode
Compose restart (use services.<name>.x-slurm.failure_policy)
deploy

Any other unknown key at the service level is also rejected.

FilesExpand file tree

spec-reference.md

Latest commit

History

spec-reference.md

File metadata and controls

Spec reference

Top-level shape

Top-level fields

Settings-aware command table

x-slurm

x-slurm.setup

x-slurm.submit_args

x-slurm.notify

x-slurm.cache_dir

runtime

x-slurm.scratch, stage_in, stage_out, and burst_buffer

Multi-node placement rules

x-slurm.metrics

x-slurm.artifacts

x-slurm.resume

Allocation metadata inside services

gres and gpus

Service fields

Image rules

Remote images

Local images

command and entrypoint

environment

volumes

depends_on

readiness

Sleep

TCP

Log

HTTP

healthcheck

Service-level x-slurm

services.<name>.x-slurm.prologue / epilogue

services.<name>.x-slurm.placement

services.<name>.x-slurm.mpi

services.<name>.x-slurm.failure_policy

x-runtime.prepare and x-enroot.prepare

Unsupported Compose keys

`x-slurm`

`x-slurm.setup`

`x-slurm.submit_args`

`x-slurm.notify`

`x-slurm.cache_dir`

`runtime`

`x-slurm.scratch`, `stage_in`, `stage_out`, and `burst_buffer`

`x-slurm.metrics`

`x-slurm.artifacts`

`x-slurm.resume`

`gres` and `gpus`

`command` and `entrypoint`

`environment`

`volumes`

`depends_on`

`readiness`

`healthcheck`

Service-level `x-slurm`

`services.<name>.x-slurm.prologue` / `epilogue`

`services.<name>.x-slurm.placement`

`services.<name>.x-slurm.mpi`

`services.<name>.x-slurm.failure_policy`

`x-runtime.prepare` and `x-enroot.prepare`