This runbook is the normal real-cluster flow for adapting a hpc-compose spec on a supported Linux Slurm submission host.
Commands below assume hpc-compose is on your PATH. If you are running from a local checkout, replace hpc-compose with target/release/hpc-compose.
Compose-aware commands accept -f / --file. When omitted, hpc-compose uses the active context compose file from .hpc-compose/settings.toml, then falls back to compose.yaml in the current directory. Global context flags are available everywhere:
--profile <NAME>selects a profile from.hpc-compose/settings.toml.--settings-file <PATH>uses an explicit settings file instead of upward auto-discovery.
Read Execution Model, Runtime Backends, and Support Matrix before adapting a workflow to a new cluster.
Make sure you have:
- a Linux submission host with
srunandsbatch, - the runtime backend selected by
runtime.backend, scontrolwhenx-slurm.nodes > 1,- Pyxis support in
srunwhenruntime.backend: pyxis(srun --helpshould mention--container-image), - shared storage for
x-slurm.cache_dir, - local source trees or local
.sqsh/.sifimages in place, - registry credentials when your cluster or registry requires them.
Backend-specific requirements are listed in Runtime Backends. Cluster profile generation and MPI smoke probes are covered in Cluster Profiles.
For a new spec on a real cluster:
- Choose a starter from Examples, or run
hpc-compose new --template <name> --name my-app --cache-dir '<shared-cache-dir>' --output compose.yaml. - Run
hpc-compose setuponce if you want compose path, env files, env vars, and binary overrides stored in a project-local settings file. - Run
hpc-compose context --format jsonto verify resolved values and sources. - Set or confirm
x-slurm.cache_dir, then adjust cluster-specific resource settings. - Run
hpc-compose plan -f compose.yamlandhpc-compose plan --verbose -f compose.yamlwhile adapting the file. - Run
hpc-compose up -f compose.yamlfor the normal cluster run. - If it fails, start with
hpc-compose debug -f compose.yaml --preflight, then use Troubleshooting and break outpreflight,prepare,render,status,ps,watch,stats, orlogsseparately.
For a minimal cluster smoke test from a checkout, set CACHE_DIR to shared storage and run scripts/cluster_smoke.sh. It validates, preflights, and renders by default; set HPC_COMPOSE_SMOKE_SUBMIT=1 only when you intentionally want it to launch the smoke job.
hpc-compose can discover .hpc-compose/settings.toml by walking upward from the current directory. You can also pin a file with --settings-file.
Typical setup flow:
hpc-compose setup
hpc-compose context
hpc-compose --profile dev context --format jsonNon-interactive setup is available for scripting:
hpc-compose setup --profile-name dev --compose-file compose.yaml --env-file .env --env-file .env.dev --env 'CACHE_DIR=<shared-cache-dir>' --default-profile dev --non-interactiveSettings file shape:
version = 1
default_profile = "dev"
[defaults]
compose_file = "compose.yaml"
env_files = [".env"]
[defaults.env]
CACHE_DIR = "/cluster/shared/hpc-compose-cache"
[profiles.dev]
compose_file = "compose.yaml"
env_files = [".env", ".env.dev"]
[profiles.dev.env]
RESUME_DIR = "/shared/$USER/runs/my-run"
MODEL_DIR = "$HOME/models"Resolution precedence is fixed:
- CLI flags
- selected profile values
- shared settings defaults
- built-in CLI defaults
Use context whenever you want to inspect effective compose path, binaries, interpolation variables, runtime paths, and per-field sources.
The maintained selection guide is Examples. It includes:
- four promoted beginner paths,
- a novice ladder from authoring to distributed workloads,
- the full repository example matrix,
- companion notes for LLM worker examples,
- an adaptation checklist.
Keep docs/src/examples.md as the single source of example selection truth. The embedded YAML source appendix is Example Source.
Set x-slurm.cache_dir to a path visible from both the login node and compute nodes:
x-slurm:
cache_dir: /cluster/shared/hpc-compose-cacheQuick recipe:
export CACHE_DIR=/cluster/shared/hpc-compose-cache
mkdir -p "$CACHE_DIR"
test -w "$CACHE_DIR"Rules:
- Do not use
/tmp,/var/tmp,/private/tmp, or/dev/shm. - If
cache_diris unset, the default is$HOME/.cache/hpc-compose. - The default may work on some clusters, but a shared project/work/scratch path is safer.
- Validation can accept unsafe local paths;
preflightreports them as policy errors.
More cache details are in Cache Management.
Start with the nearest example and then change:
imagecommand/entrypointvolumesenvironmentx-slurmresource settingsx-runtime.preparecommands for dependencies or tooling
Recommended pattern:
- Put fast-changing application code in
volumes. - Put slower-changing dependency installation in
x-runtime.prepare.commands. - Add
readinessonly to services that other services truly depend on.
hpc-compose validate -f compose.yaml
hpc-compose validate -f compose.yaml --strict-envUse validate first when changing field names, dependency shape, command/entrypoint form, paths, x-slurm, x-runtime, or compatibility x-enroot blocks.
If validate fails, fix that before doing anything more expensive. Use --strict-env when missing interpolation variables should fail instead of consuming ${VAR:-default} or ${VAR-default} fallbacks.
hpc-compose plan -f compose.yaml
hpc-compose plan --verbose -f compose.yaml
hpc-compose plan --show-script -f compose.yamlCheck:
- service order,
- allocation geometry and service step geometry,
- normalized image references,
- host-to-container mount mappings,
- resolved environment values,
- runtime artifact paths,
- cache hit/miss expectations.
plan is purely static: it parses, validates, builds the normalized runtime plan, and can print the generated script to stdout, but it does not run preflight, prepare images, call sbatch, or write hpc-compose.sbatch. plan --verbose can print secrets from resolved environment values.
hpc-compose up -f compose.yamlup is the preferred end-to-end cluster flow. It runs preflight unless disabled, prepares images unless skipped, renders the script, calls sbatch, records tracked job metadata, polls scheduler state, and streams logs.
Useful options:
--script-out path/to/job.sbatchkeeps a copy of the rendered script.--force-rebuildrefreshes imported and prepared artifacts.--skip-preparereuses existing prepared artifacts.--no-preflightskips the preflight phase.--detachsubmits or launches, records tracking metadata, and returns without watching.--format text|jsonis accepted with--detachor--dry-run.--watch-mode auto|tui|lineselects the live output mode;--no-tuiis a line-mode alias.--resume-diff-onlyprints resume-sensitive config diffs without launching.--allow-resume-changesconfirms intentional resume-coupled config drift.
up --local is Linux + Pyxis-only and single-host. See Runtime Backends.
hpc-compose preflight -f compose.yaml
hpc-compose preflight --verbose -f compose.yaml
hpc-compose preflight -f compose.yaml --strictpreflight checks selected-backend tools, Slurm tools, cache path policy, local mounts/images, registry credentials, cluster profile compatibility, distributed-readiness hazards, metrics collector tools, and resume path safety.
Generate a cluster capability profile on the target login node when you want validation and preflight to catch partition/backend/QOS/GPU/MPI mismatches earlier:
hpc-compose doctor cluster-reportSee Cluster Profiles for generated profile details, site policy packs, and MPI smoke probes.
hpc-compose prepare -f compose.yaml
hpc-compose prepare -f compose.yaml --forceUse this when you want to build or refresh prepared images before submission, confirm cache reuse behavior, or debug preparation separately from job submission.
prepare needs the selected runtime backend tools, but it does not call sbatch.
hpc-compose render -f compose.yaml --output /tmp/job.sbatchThis is useful when debugging generated srun arguments, mounts, environment passing, launch order, and readiness waits.
hpc-compose jobs list
hpc-compose status -f compose.yaml
hpc-compose ps -f compose.yaml
hpc-compose watch -f compose.yaml
hpc-compose logs -f compose.yaml --service app --follow
hpc-compose stats -f compose.yaml --format jsonlUse Runtime Observability for tracked state, logs, metrics, and machine-readable output. Use Artifacts and Resume for artifact bundles and resume-aware attempts.
hpc-compose cache list
hpc-compose cache inspect -f compose.yaml
hpc-compose cache prune --all-unused -f compose.yaml
hpc-compose cache prune --age 7 --cache-dir '<shared-cache-dir>'
hpc-compose clean -f compose.yaml --age 7 --dry-runUse Cache Management for cache reuse and pruning. Use Troubleshooting before deleting tracked job directories.
| If you changed... | Typical next step |
|---|---|
| YAML planning/runtime settings only | plan --verbose, then up |
Base image, x-runtime.prepare.commands, or prepare env |
up --force-rebuild, or prepare --force when debugging separately |
Mounted runtime source under volumes |
Usually just up |
| Cache entries this plan no longer references | cache prune --all-unused -f compose.yaml |
hpc-compose itself |
Expect cache misses on the next prepare or up, then optionally prune old entries |