2_working_on_the_hpc

Working on the HPC

Introduction

As before, this is a short guide designed to get you acquainted with working on Saga, part of the Sigma2 high performance computer cluster infrastructure for Norway. It is worth noting that Sigma2 has a lot of high level, clear documentation. This tutorial is designed as a rough guide but for a deeper dive, you should refer to the Sigma2 documentation, particularly this section on running jobs.

Although you can use Saga interactively, it’s true power comes from submitting jobs to a queue system. This allows you to send off large analyses to a computing node so that job is carried out on your behalf and then when it is finished, you are able to get the results. Think of it as a way to distribute your analysis so that instead of operating on a single computer, you can harness the power of many all at once.

There are several different ways to interact with job schedulers. The system on Saga is built on slurm. This takes a bit of getting used to, but is generally quite easy to use. We will walk through a submitting a basic job to get an idea.

Once you have got the hang of the skills taught here, it is essential that you read the guide on how to choose memory, the number of CPUs and the run time of a job properly. You can find that here

Submitting a simple job

When you are logged in to Saga, you are interacting with the login node. Here you can use the unix commandline and even do some preliminary, small scale analyses. However, when you submit a job, you submit it to a compute node.

Below is an example of a very basic job script. You can copy and paste it into nano on Saga and then save it as 1_simple_script.slurm. Alternatively you can download the script from github here nb this is not yet working

#!/bin/bash

# Job name:
#SBATCH --job-name=simple1

# partition/queue job being run on
#SBATCH --account=nn10082k

# number of nodes
#SBATCH --nodes=1

# tasks per node
#SBATCH --ntasks-per-node=1

# Processor and memory usage:
#SBATCH --mem-per-cpu=1G
#SBATCH --cpus-per-task=1

# total time - i.e. wallclock limit - how long job has to run
#SBATCH --time=01:00:00

# notify job failure
#SBATCH --mail-user=username@email.com
#SBATCH --mail-type=FAIL

# make sure the script starts where you want it to
cd $HOME

# a simple command
echo "Hello world!"

# after echoing, wait for 60 seconds
sleep 60

There is a lot going on here, so let’s break it down. There are two parts to this script, the header and the body.

The header

The header of the script is denoted by all the lines with #SBATCH - these are what sets up the script and tells slurm how assign the right resources to it. We will explain each line in turn:

#!/bin/bash - this is not actually part of the sbatch specification but instead just denotes that this is bash script and will be interpreted as such when it is run.
#SBATCH --job-name=template - as you might expect, this command defines the name of the job. This sets how it will be displayed in the queue. Here it will show the name template
#SBATCH --account=nn10082k - here we specify the account that the job is run on. This is necessary to account for how much memory or resources are used by each group. nn10082k is our group account.
#SBATCH --nodes=1 - here we specify how many nodes are necessary for this task. This job is being run on a single node.
#SBATCH --ntasks-per-node=1 - and here we specify how many tasks are being performed per node. With most simple jobs, this will be 1 node and 1 task - i.e. we have not parallelised our analysis. For most things you run, you will not need to edit this.
#SBATCH --mem-per-cpu=1G - here we set the memory necessary for the job. In this case it is 1 Gb. This is very low for this job (for reasons I will explain further below) but the job is very unlikely to use more than a few mb. You can change this if you are doing something more memory intensive.
#SBATCH --cpus-per-task=1 - this is the number of cpus the job requires. In many cases, this can be set to 1 but if for example you need to use multiple threads (i.e. parallelise within a program) you might need to set this to a higher number. You can always ask advice on this and the other parameters if necessary.
#SBATCH --time=01:00:00 - This is the total wallclock time for the job - i.e. how long it will run for in real terms. Here it is set to 1 hour. If it takes longer than this, the scheduler will kill it. The maximum you can set is 1 week - 168 hours.
#SBATCH --mail-user=username@email.com - you can also set the script to email you when it is started, when it is finished or if it fails. Here you enter your email address for it to do that.
#SBATCH --mail-type=FAIL - and here you can tell it when to email you. In this case, it will email if the job fails.

There are plenty of other options you can set and you can see all of these using sbatch --help. Some of these options might be a little complicated to understand at first, but it makes more sense once you get the hang of how the slurm script works.

The body

Once slurm has interpreted the header, it will interpret the body of the script. This is where you tell it to actually do things. A reminder of what the script body is in our simple example:

# make sure the script starts where you want it to
cd $HOME

# a simple command
echo "Hello world!"

# after echoing, wait for 60 seconds
sleep 60

The first thing this script does is move into the home directory. It then echoes the words “Hello world!”. Finally it waits 60 seconds and quits.

This is a very simple example, but we can use it to demonstrate how the scheduler works. Once you have the script in your home directory, simply run it like so:

sbatch 1_simple_script.slurm

You will see a response that tells you the job id number. You can then look at the queue to see if it is being run or not:

squeue -u username

This will show you all the jobs you have submitted. If you see the job name and ID with R next to it, this means it is running. PD means pending - i.e. waiting to run. There is more information on accessing and viewing the queue here.

Checking the job has run

Once your job is run, you can check the usage statistics using the following command:

sacct -j JobID

The job will also produce an output file slurm-JobID.out in the same directory you submitted it from. This file has information from the output of the job and also the usage statistics that you can see with sacct.

If you look at the job output for our simple example, you should see that it wrote out the words “Hello world!”. Because this is all our job did, it didn’t actually create any additional files. However we will see an example of that later.

Setting up a script with output

For our next example, we will create a script which writes the work it does to another separate output file. As before, the script is below and you can use nano to make a version on Saga. It should be called 2_simple_script_with_output.slurm. You can also download it here.

The full script is below:

#!/bin/bash

# Job name:
#SBATCH --job-name=simple2

# partition/queue job being run on
#SBATCH --account=nn10082k

# number of nodes
#SBATCH --nodes=1

# tasks per node
#SBATCH --ntasks-per-node=1

# Processor and memory usage:
#SBATCH --mem-per-cpu=1G
#SBATCH --cpus-per-task=1

# total time - i.e. wallclock limit - how long job has to run
#SBATCH --time=01:00:00

# notify job failure
#SBATCH --mail-user=username@email.com
#SBATCH --mail-type=FAIL

# make sure the script starts where you want it to
cd $HOME

# a simple command - this time taking a value from the command line
echo "Hello world!"

# a simple command, also redirected to an output
echo "Hello world!" > my_output.txt

# after echoing, wait for 60 seconds
sleep 60

We won’t run through the header this time as it is exactly the same as before with the exception of a change in its name. However this time there is an extra line in the body:

# a simple command, also redirected to an output
echo "Hello world!" > my_output.txt

All this does is also echo the output to an additional file - my_output.txt.

Run the script by submitting it as before. Like below:

sbatch 2_simple_script_with_output.slurm

You can check its status while running as before if you wish (easy in this case when the job is quick). When it is done, you will have a slurm output file and also the output file my_output.txt. If you look inside this, you should see the words "Hello world!".

Of course this is a simple example, however you can quickly see that this is a way to set up a job to do some work and output it for a file for you to access once it is complete. In the next example, we will see how to interact with a job script to give it input from the command line.

Providing input to slurm scripts

As a general strategy, it is a good idea to write scripts that can be used for more than one purpose, without having to edit them every single time. One way to achieve this is to have a script which can take output from the command line. This is generally easy to with bash scripts - see here for a tutorial on how to do it. So what about with slurm?

Below is a script that once again you can copy and paste into nano or download here from github. If you copy it, make sure you name it 3_simple_script_take_input.slurm.

#!/bin/bash

# Job name:
#SBATCH --job-name=simple3

# partition/queue job being run on
#SBATCH --account=nn10082k

# number of nodes
#SBATCH --nodes=1

# tasks per node
#SBATCH --ntasks-per-node=1

# Processor and memory usage:
#SBATCH --mem-per-cpu=1G
#SBATCH --cpus-per-task=1

# total time - i.e. wallclock limit - how long job has to run
#SBATCH --time=01:00:00

# notify job failure
#SBATCH --mail-user=username@email.com
#SBATCH --mail-type=FAIL

# make sure the script starts where you want it to
cd $HOME

# a simple command - this time taking a value from the command line
echo "Hello world! My name is $NAME"

# a simple command, also redirected to an output
echo "Hello world! My name is $NAME" > my_output2.txt

# after echoing, wait for 60 seconds
sleep 60

This script is very similar to our previous one but it has a bash environmental variable $NAME that is not definied within the script. That means if we run it without defining that variable in some way, we will just get a blank space in the sentence. Luckily there is a way to define such variables from the command line when we submit the job. For example:

sbatch --export=NAME="mark" 3_simple_script_take_input.slurm

Here we use the --export option to define the variable, in this case NAME and we assign it a value. Feel free to try this with your own name too.

Once the job is finished, check the slurm output and the my_output2.txt file. You should see that the job took your input and echoed it to the standard out (i.e. the slurm output) and the file the job produced.

An example of doing some work

So far we have only seen some toy examples of how to use a slurm script. Let’s try an example that is a little bit more realistic. We are going to count the number of lines in a file.

Firstly, we need to create the file we need. Once logged into the cluster, run the following for loop.

for i in {1..1000}; do
  echo "This is line $i" >> myfile.txt
done

This will create a file with 1000 lines. Obviously in this case we know the answer already but in reality we might often submit jobs that do things like count lines to get an idea of how many reads or variants are in a file - and then we don’t know the answer!

We can use the script below to then do this via a job submission.

#!/bin/bash

# Job name:
#SBATCH --job-name=count

# partition/queue job being run on
#SBATCH --account=nn10082k

# number of nodes
#SBATCH --nodes=1

# tasks per node
#SBATCH --ntasks-per-node=1

# Processor and memory usage:
#SBATCH --mem-per-cpu=1G
#SBATCH --cpus-per-task=1

# total time - i.e. wallclock limit - how long job has to run
#SBATCH --time=01:00:00

# notify job failure
#SBATCH --mail-user=username@email.com
#SBATCH --mail-type=FAIL

# make sure the script starts where you want it to
cd $HOME

# good practice to specify files as variables
FILE=myfile.txt

# count the lines of this file
wc -l ${FILE}

# after echoing, wait for 60 seconds
sleep 60}

Once again, you can download this script here or copy and paste it into a nano window on the cluster and call it 4_count_lines.slurm. Then just submit it like so:

sbatch 4_count_lines.slurm

Once it has run, check the output (i.e. the slurm outfile) and you should see the correct number of lines has been written. If you want, you can edit the for loop above to try different values to verify to yourself that this works.

Last of all, a tip - if you wanted to supply a specific file name from the sbatch command, here it would be overwritten because ${FILE} is declared in the script. If you comment that line out of the slurm script, you can submit the job by specifying the name like so:

sbatch --export=FILE="myfile.txt" 4_count_lines.slurm

Running jobs in arrays

By now, you should have a decent understanding of how to submit jobs via slurm. In the last section, you learned how to write a script to take a different input each time. But what if you want to run a script that does an individual job on hundreds of files? The easiest way to do this is as an array job.

An array job essentially allows you to submit a script that will run as individual sub-jobs, all under a single id. So for example, you could submit an array job for an array of 0-9 (note that in slurm, we start from zero) and it will run 10 jobs all with the following ID:

Assuming the overall job id is 123 in this example. A key point with an array job is that it when you submit an array, it takes the value of its ID as a variable. Within an array script, this variable is called $SLURM_ARRAY_TASK_ID. So for job 123_1, $SLURM_ARRAY_TASK_ID = 1.

So far this probably seems a bit a confusing and you might also be wondering the point. Well let’s use a little example to show you how it can work. Let’s imagine we have a list of files. We can create this like so:

echo -e "file1\nfile2\nfile3" > test_files

This creates a simple file that is a list of individual file names, one on each line.

We can then use a bash array to create essentially an indexed list (i.e. similar to a vector in R). We do this like so:

readarray INPUT_FILE < test_files

This then creates a bash array as a variable called INPUT_FILE. You can investigate it like so:

echo $INPUT_FILE # shows only the first value
echo ${INPUT_FILE[@]} # shows all values
echo ${INPUT_FILE[0]} # shows value at position 0
echo ${INPUT_FILE[1]} # shows value at position 1
echo ${INPUT_FILE[2]} # shows value at position 2

Remember that these bash arrays are also zero-indexed (i.e. we start from zero). Arrays generally are useful to know about because they are protected - i.e. you haved to explicitly add to them after you have created one. This makes them good for input for for loops for example as you cannot accidentally run it on the output - i.e. creating recurring loops that never end!

Anyway, back to slurm. Let’s create a simple slurm script like we had previously:

#!/bin/bash

# Job name:
#SBATCH --job-name=count_array

# partition/queue job being run on
#SBATCH --account=nn10082k

# number of nodes
#SBATCH --nodes=1

# tasks per node
#SBATCH --ntasks-per-node=1

# Processor and memory usage:
#SBATCH --mem-per-cpu=1G
#SBATCH --cpus-per-task=1

# total time - i.e. wallclock limit - how long job has to run
#SBATCH --time=01:00:00

# notify job failure
#SBATCH --mail-user=username@email.com
#SBATCH --mail-type=FAIL

# make sure the script starts where you want it to
cd $HOME

# read a list of files as an array
readarray INPUT_FILES < test_files
# declare input from array using array id
INPUT=${INPUT_FILES[${SLURM_ARRAY_TASK_ID}]}

# count the lines of this file
wc -l ${INPUT}

# after echoing, wait for 60 seconds
sleep 60}

You can see from this script that first it establishes a bash array using the test_files list we created above. Then it sets the value of input as the $SLURM_ARRAY_TASK_ID. We can save this script as 5_count_lines_array.slurm. Now you can easily run this script on the three files in our test_files list. In fact, test_files could have 1000 files listed in it and you could submit a single job to work on them all at once.

So how do you actually submit an array job? It is very straightforward - like so:

sbatch -a 0-2 5_count_lines_array.slurm

Here all you need to do is add the -a option to submit the array. Here we set it to run from 0-2 - i.e. lines 1-3 in our file. You can submit hundreds of jobs with a single line this way and the queue will manage their submission for you. However you can also specify the way they work with modulos. For example:

sbatch -a 0-99%10 5_count_lines_array.slurm

Would run 100 jobs in total but only ever submit 10 at a time. It is well worth learning more about array jobs as they can be a very powerful addition to your work on the HPC!

Conclusion

You should now have a basic idea of how to run jobs under the slurm schedule manager. Although this might seem a little abstract at this point, it does get a lot easier to handle once you are familiar wth it. In the next tutorial, we learn about loading modules to get access to software and also how to run an array job on the scheduler.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2_working_on_the_hpc

Working on the HPC

Introduction

Submitting a simple job

The header

The body

Checking the job has run

Setting up a script with output

Providing input to slurm scripts

An example of doing some work

Running jobs in arrays

Conclusion

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally