Skip to content

5_using_conda_Saga

Mark Ravinet edited this page Aug 23, 2024 · 1 revision

Using conda on the cluster

Introduction

Conda is a package manager for python and other tools that will be important for your work on the cluster. However, it can be quite cumbersome to run and is prone to errors. Operating a conda installation on Saga is not straightforward as it installs many files and can quickly take up a lot of room - i.e. filling your $HOME directory.

For most uses, such as the genotyping and phasing pipelines, you should use the preinstalled conda and mamba (the faster version of conda) modules on Saga. This guide will explain how to do that and also how to create your own environments that you can repeatedly load from the /cluster/projects/nn10082k/ directory.

The rationale here is to follow the Sigma2 guidelines on this and it saves us from eaching having a separate conda installation that takes up a lot of space on our shared project directory. It is also designed to make it easier to share scripts that are reproducible among members of the group.

Loading the conda and mamba modules

Loading the previously installed modules on Saga is very straightforward with the module command. For example:

module load Miniconda3/23.10.0-1

This will load both miniconda. Whereas the following will load mamba.

module load Mamba/23.11.0-0

You can use module avail miniconda to search for all the available packages, this is worth doing as there might be multiple versions of miniconda or the version installed on Saga may be updated.

Once you have loaded one of these, you will need to activate them so that conda actually runs. For example the following activation line will load miniconda:

source ${EBROOTMINICONDA3}/bin/activate

And this will activate mamba:

source ${EBROOTMAMBA}/bin/activate

You will know these have worked because your command line prompt will now say (base) next to it. You will always need to run this line when starting a new terminal, logging in to Saga or at the start of scripts that make use of conda packages or environments.

Running previously installed environments

I have installed several environments for the phasing and genotyping pipelines in the project directory. These are maintained at /cluster/projects/nn10082k/conda_group. There is a README in this directory explaining what each of them is.

It is very simple to load these and run them so that you can make use of the pipelines. All you need to do is point conda or mamba to them to do so. Here I will show you how to load the genotyping and phasing environments using mamba.

To load the genotyping pipeline, use the following command (once conda is loaded):

conda activate /cluster/projects/nn10082k/conda_group/nextflow 

Your prompt will change and you can now use all the programs that are contained in this environment. To do the same for the phasing environment it is simply:

conda activate /cluster/projects/nn10082k/conda_group/phase 

Installing programs with conda and creating your own environments

It is quite easy to install packages with conda. However as a rule, you should not install anything in the base environment. Instead you should create your own environments and install packages into a specific location on the cluster.

I have created the following location to ensure that everyone has a space for installation. You should only install things here: /cluster/projects/nn10082k/conda_users. Create a directory with your username and we will then point conda only to this location. So for example, my folder is:

/cluster/projects/nn10082k/conda_users/msravine

Next, we need to set an environmental variable to ensure conda uses this directory as a cache - i.e. a place to store everything it downloads and that we can easily maintain.

Do so like this:

export CONDA_PKGS_DIRS=/cluster/projects/nn10082k/conda_users/username/package-cache

Be sure to replace the username part of this path with your own. It is very important to maintain the cache here so that it doesn't fill your home directory. However it can quickly fill with a lot of downloads so you should regularly clean it using the following command:

conda clean -a

Next, we need to create a new personal environment to install things into - remember, we are not using base. Here I will create an enviornment called cpg in my conda directory:

conda create -y --prefix /cluster/projects/nn10082k/conda_users/username/cpg

Note that here the -y flag will prevent conda asking for permission - it will just go ahead and set it up for you. Once this is done, you can activate the environment like so:

conda activate /cluster/projects/nn10082k/conda_users/username/cpg

With this loaded and installed, you can now install programs to that environment like so:

conda install bcftools

Using conda in a slurm script

If you are using one of the lab pipelines that requires a conda installation - e.g. the genotyping pipeline - you should add the following lines at the start of your script.

module load Miniconda3/23.10.0-1
export CONDA_PKGS_DIRS=/cluster/projects/nn10082k/conda_users/username/package-cache
conda activate /cluster/projects/nn10082k/conda_group/nextflow 

This will load miniconda and activate the group environment nextflow. With this, you can essentially ignore the requirements for installing conda listed on the pipeline page here. You do not need your own version of conda and this will make it much easier to run the pipeline without issue.

Sharing environments with others

If you create an environment for running a script that you want to share with others, you can easily do so by giving them the link to your environment - i.e.:

conda activate /cluster/projects/nn10082k/conda_users/username/my_environment

You might need to ensure other group members have permission to access this. You can do so with chmod -R 775 /cluster/projects/nn10082k/conda_users/username/my_environment. However once that is done, it is very easy for others to use your environment!

Clone this wiki locally