Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions docs/cluster-setup-guide/deploy-cluster/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,8 @@ environments:
+------------------------------------------+-----------------------------------------------------+
| :doc:`sysadmin-deploy-on-k8s/overview` | How to run Determined on Kubernetes. |
+------------------------------------------+-----------------------------------------------------+
| :doc:`sysadmin-deploy-on-slurm/overview` | How to run Determined on Slurm. |
| :doc:`sysadmin-deploy-on-slurm/overview` | How to run Determined on an HPC cluster |
| | (Slurm/PBS). |
+------------------------------------------+-----------------------------------------------------+

.. toctree::
Expand All @@ -26,4 +27,4 @@ environments:
Deploy on AWS <sysadmin-deploy-on-aws/overview>
Deploy on GCP <sysadmin-deploy-on-gcp/overview>
Deploy on Kubernetes <sysadmin-deploy-on-k8s/overview>
Deploy on Slurm <sysadmin-deploy-on-slurm/overview>
Deploy on Slurm/PBS <sysadmin-deploy-on-slurm/overview>
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
.. _install-on-slurm:

#############################
Install Determined on Slurm
#############################
#################################
Install Determined on Slurm/PBS
#################################

This document describes how to deploy Determined on a Slurm cluster.
This document describes how to deploy Determined on an HPC cluster managed by the Slurm or PBS
workload managers.

The Determined master and launcher installation packages are configured for installation on a single
login or administrator Slurm cluster node.
login or administrator Slurm/PBS cluster node.

***************************
Install Determined Master
Expand Down Expand Up @@ -37,22 +38,23 @@ fulfilled and configured, install and configure the Determined master:

sudo apt install ./hpe-hpc-launcher-<version>.deb

The installation configures and enables the ``systemd`` ``launcher`` service, which provides
Slurm management capabilities.
The installation configures and enables the ``systemd`` ``launcher`` service, which provides HPC
management capabilities.

If launcher dependencies are not satisfied, warning messages are displayed. Install or update
missing dependencies or adjust the ``path`` and ``ld_libary_path`` in the next step to locate the
dependencies.

.. _using_slurm:

*************************************************
Configure and Verify Determined Master on Slurm
*************************************************
*******************************************************
Configure and Verify Determined Master on HPC Cluster
*******************************************************

#. The launcher automatically adds a prototype ``resource_manager`` section for Slurm. Edit the
provided ``resource_manager`` configuration section for your particular deployment. For RPM-based
installations, the configuration file is typically the ``/etc/determined/master.yaml`` file.
#. The launcher automatically adds a prototype ``resource_manager`` section for Slurm/PBS if not
already present upon startup of the launcher service. Edit the provided ``resource_manager``
configuration section for your particular deployment. For RPM-based installations, the
configuration file is typically the ``/etc/determined/master.yaml`` file.

In this example, with Determined and the launcher colocated on a node named ``login``, the
section might resemble:
Expand All @@ -72,6 +74,7 @@ fulfilled and configured, install and configure the Determined master:
auth_file: /root/.launcher.token
job_storage_root:
path:
ld_library_path:
tres_supported: true
slot_type: cuda

Expand All @@ -81,6 +84,12 @@ fulfilled and configured, install and configure the Determined master:
+----------------------------+----------------------------------------------------------------+
| Option | Experiment Type |
+============================+================================================================+
| ``type`` | The cluster workload manager (``slurm`` or ``pbs``). |
+----------------------------+----------------------------------------------------------------+
| ``master_host`` | The host name of the Determined master. This is the name the |
| | compute nodes will utilize to communicate with the the |
| | Determined master. |
+----------------------------+----------------------------------------------------------------+
| ``port`` | Communication port used by the launcher. Update this value if |
| | there are conflicts with other services on your cluster. |
+----------------------------+----------------------------------------------------------------+
Expand All @@ -103,9 +112,13 @@ fulfilled and configured, install and configure the Determined master:
| ``path`` | If any of the launcher dependencies are not on the default |
| | path, you can override the default by updating this value. |
+----------------------------+----------------------------------------------------------------+
| ``gres_supported`` | Indicates that Slurm/PBS is able to identify GPUs. The default |
| | is ``true``. See :ref:`slurm-config-requirements` or |
| | :ref:`pbs-config-requirements` for details. |
+----------------------------+----------------------------------------------------------------+

See the :ref:`slurm section <cluster-configuration-slurm>` of the cluster configuration reference
for the full list of configuration options.
See the :ref:`slurm/pbs section <cluster-configuration-slurm>` of the cluster configuration
reference for the full list of configuration options.

After changing values in the ``resource_manager`` section of the ``/etc/determined/master.yaml``
file, restart the launcher service:
Expand All @@ -120,7 +133,7 @@ fulfilled and configured, install and configure the Determined master:
``/etc/determined/master.yaml`` file, and restart the launcher.

If the installer reported incorrect dependencies, verify that they have been resolved by changes
to the ``path`` in the previous step:
to the ``path`` and ``ld_library_path`` in the previous step:

.. code:: bash

Expand All @@ -137,8 +150,8 @@ fulfilled and configured, install and configure the Determined master:
``/var/log/messages`` or ``journalctl --since=10m -u determined-master``, make the needed changes
to the ``/etc/determined/master.yaml`` file, and restart the determined-master.

#. If the compute nodes of your cluster do not have internet connectivity to download Docker images,
see :ref:`slurm-image-config`.
#. If using Singularity and the compute nodes of your cluster do not have internet connectivity to
download Docker images, see :ref:`slurm-image-config`.

#. Verify the configuration by sanity-checking your Determined Slurm configuration:

Expand All @@ -154,57 +167,3 @@ fulfilled and configured, install and configure the Determined master:
communication, access to the shared filesystem, GPU scheduling, and highspeed interconnect
configuration. For more complete validation, ensure that the ``slots_per_trial`` is at least
twice the number of GPUs available on a single node.

*****************
Configure Slurm
*****************

Determined should function with your existing Slurm configuration. The following steps are
recommended to optimize how Determined interacts with Slurm:

- Enable Slurm for GPU Scheduling.

Configure Slurm with `SelectType=select/cons_tres <https://slurm.schedmd.com/cons_res.html>`__.
This enables Slurm to track GPU allocation instead of tracking only CPUs. If this is not
available, you must change the :ref:`slurm section <cluster-configuration-slurm>`
``tres_supported`` option to ``false``.

- Configure GPU Generic Resources (GRES).

Determined works best when allocating GPUs. Information about what GPUs are available is
available using GRES. You can use the `AutoDetect
<https://slurm.schedmd.com/gres.html#AutoDetect>`__ feature to configure GPU GRES automatically.
Otherwise, you should manually configure `GRES GPUs
<https://slurm.schedmd.com/gres.html#GPU_Management>`__ such that Slurm can schedule nodes with
the GPUs you want.

For the automatic selection of nodes with GPUs, Slurm must be configured for ``GresTypes=gpu``
and nodes with GPUs must have properly configured GRES indicating the presence of any GPUs. If
Slurm GRES cannot be properly configured, specify the :ref:`slurm section
<cluster-configuration-slurm>` ``gres_supported`` option to ``false``, and it is the user's
responsibility to ensure that GPUs will be available on nodes selected for the job using other
configurations such as targeting a specific resource pool with only GPU nodes, or specifying a
Slurm constraint in the experiment configuration.

- Ensure homogeneous Slurm partitions.

Determined maps Slurm partitions to Determined resource pools. It is recommended that the nodes
within a partition are homogeneous for Determined to effectively schedule GPU jobs.

- A Slurm partition with GPUs is identified as a CUDA/ROCM resource pool. The type is inherited
from the ``resource_manager.slot_type`` configuration. It can be also be specified-per
partition using ``resource_manager.partition_overrides``

- A Slurm partition with no GPUs is identified as an AUX resource pool.

- The Determined default resource pool is set to the Slurm default partition.

- Tune the Slurm configuration for Determined job preemption.

Slurm preempts jobs using signals. When a Determined job receives SIGTERM, it begins a checkpoint
and graceful shutdown. To prevent unnecessary loss of work, it is recommended to set ``GraceTime
(secs)`` high enough to permit the job to complete an entire Determined ``scheduling_unit``.

To enable GPU job preemption, use ``PreemptMode=REQUEUE`` or ``PreemptMode=REQUEUE``, because
``PreemptMode=SUSPEND`` does not release GPUs so does not allow a higher-priority job to access
the allocated GPU resources.
Original file line number Diff line number Diff line change
@@ -1,31 +1,34 @@
#################
Deploy on Slurm
#################
#####################
Deploy on Slurm/PBS
#####################

+----------------------+
| Supported Versions |
+======================+
| Slurm >= 19.05 |
| Slurm >= 19.05 or |
| PBS >= 2021.1.2 |
+----------------------+
| Singularity >= 3.7 |
| or PodMan >= 3.3.1 |
+----------------------+
| Launcher |
| (`hpe-hpc-launcher`) |
| >= 3.1.0 |
| >= 3.1.2 |
+----------------------+
| Java >= 1.8 |
+----------------------+

.. note::

Slurm deployment applies to the Enterprise Edition.
Slurm/PBS deployment applies to the Enterprise Edition.

Determined Slurm integration delegates all job scheduling and prioritization to the Slurm workload
manager. This integration enables existing Slurm workloads and Determined workloads to coexist and
Determined workloads to access all of the advanced capabilities of the Slurm workload manager.
This document describes how Determined can be configured to utilize HPC cluster scheduling systems
via the Determined HPC launcher. In this type of configuration, Determined delegates all job
scheduling and prioritization to the HPC workload manager (either Slurm or PBS). This integration
enables existing HPC workloads and Determined workloads to coexist and Determined workloads to
access all of the advanced capabilities of the HPC workload manager.

To install Determined on a Slurm cluster, ensure that the
To install Determined on the HPC cluster, ensure that the
:doc:`/cluster-setup-guide/deploy-cluster/sysadmin-deploy-on-slurm/slurm-requirements` are met, then
follow the steps in the
:doc:`/cluster-setup-guide/deploy-cluster/sysadmin-deploy-on-slurm/install-on-slurm` document.
Expand All @@ -36,7 +39,10 @@ follow the steps in the

- :ref:`Determined Installation Requirements <system-requirements>`
- `Slurm <https://slurm.schedmd.com/documentation.html>`__
- `OpenPBS® <https://www.openpbs.org/>`__
- `PBS Professional® <https://www.altair.com/pbs-professional/>`__
- `Singularity <https://docs.sylabs.io/guides/3.7/user-guide/introduction.html>`__
- `Apptainer <https://apptainer.org/>`__
- `PodMan <https://docs.podman.io>`__

.. toctree::
Expand Down
Loading