The atavide pipeline consists of multiple steps, that we run sequentially.
We start with a DEFINITIONS.sh file that defines some variables that the scripts use. We source this
in the scripts and then use it to control the file names.
Note that none of the defintions should have spaces in them. If you get an unbound variable error when
running atavide_lite, make sure the defintions are correct!
The individual definitions are:
This should be any non-whitespace text. Please don't put spaces in the sample name. We use it to create
the output files, so that if you run atavide_lite on multiple different samples you can figure out
which sample is which!
This can be any useful mnemonic for your samples.
Examples:
export SAMPLENAME=AngelSharks
export SAMPLENAME=Sputum
export SAMPLENAME=PRJEB20836The $FILEEND variable is used to remove the .fastq.gz and other parts of the filename so that you end up with
meaningful names. We only need to include the R1 filename, as we will automatically do the same thing for R2
files.
The first step will check and throw an error if you have forgotten to change this.
Examples:
export FILEEND=_S34_R1.fastq.gz
export FILEEND=_S34_L001_R1.fastq.gz
export FILEEND=_R1.fastq.gzIn each of these cases, we trim off all the $FILEEND text before making directories, output files, and so on.
One of the necessary steps is to convert fastq to fasta so we can run mmseqs2 easy-taxonomy. We need
the same FAFILEEND so we know what to trim here. This will probably be the same as the $FILEEND variable
but with fastq substituted for fasta.
Examples:
export FAFILEEND=_S34_R1.fasta.gz
export FAFILEEND=_S34_L001_R1.fasta.gz
export FAFILEEND=_R1.fasta.gzThis is the location of the fastq files to begin with. I almost always just use fastq since that makes sense!
Examples:
export SOURCE=fastqIf you want to remove host DNA then specify the location of the fasta file that has the host genome. You can ommit this if you don't want to remove host DNA.
Examples:
export HOSTFILE=/scratch/$PAWSEY_PROJECT/$USER/Databases/human/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gzThese two variables are the directories where you would like to save the host/no host data. It is just the names of the directories, so no spaces please, and it can be simple like the examples.
Examples:
export HOST=human
export HOSTREMOVED=no_human
export HOST=shark
export HOSTREMOVED=no_shark- Set the source of the scripts
Clone the atavide lite repository to your home directory, and make the compiled code
git clone https://github.com/linsalrob/atavide_lite.git
cd atavide_lite/bin
make
SRC=~/atavide_lite/pawsey_shortread # change this to the most suitable source for you!- Set up the conda environment
Note, if you are working on Pawsey you should check out how to set up a temporary conda installation.
mamba env create -f ~/atavide_lite/atavide_lite.yaml
mamba env create -f ~/atavide_lite/atavide_lite_vamb.yaml- Create some directories for the slurm output
mkdir -p slurm_output/host_slurm slurm_output/megahit_slurm slurm_output/mmseqs_slurm slurm_output/vamb_slurm- Create a list of R1 files.
Several of the steps will use the file R1_reads.txt to know which reads we have. This is one of the main control files
as in our scripts we process through this file and compute the R1 and R2 data.
Note: for the single-read processing (e.g. MinION data) we call this file reads.txt!
find fastq -name \*_R1\* -printf "%f\n" > R1_reads.txt
NUM_R1_READS=$(wc -l R1_reads.txt | cut -f 1 -d ' ')
echo There are $NUM_R1_READS R1 readsA
if [[ $(find fastq -name \*_R2\* | awk 's++{} END {print s}') != $NUM_R1_READS ]]; then echo "There are a different number of R1 and R2 reads"; fi- Quality control of the sequences
We use fastp to QC/QA the sequences and to trim any adapters and barcodes. You might want to look at the fastp.slurm file and check
the default parameters. Currently, we use -n 1 which only allows one N in the sequence, and -l 100 which sets the minimum length
of sequences to 100 bp. Note: we have occassionally run into problems with data from the SRA which has reads of less than 100 bp
and so nothing passes QC/QA. Currently, we ignore that really old data, because we have not checked the subsequent steps to see if
the results are actually meaningful.
JOB=$(sbatch --parsable --array=1-$NUM_R1_READS:1 $SRC/fastp.slurm)- HOST REMOVAL (optional)
Note: If you don't want to do host removal just set HOSTREMOVED=fastq_fastp in DEFINITIONS.sh
This optional step maps the reads to the host genome (e.g. human, shark, coral as specified by the $HOSTFILE varibale)
and removes those reads using samtools. We create new R1 and R2 files in the $HOST and $HOSTREMOVED directories
which contain the clean data.
Current processing does not do anything with the host reads, but you could use them for other purposes.
HOSTJOB=$(sbatch --parsable --array=1-$NUM_R1_READS:1 --dependency=afterok:$JOB $SRC/host_removal.slurm)- Convert to fasta for mmseqs
There is a common gotcha for mmseqs2, which is that the easy-taxonomy step requires the input to be in fasta format.
We have included a standalone C file that converts fastq to fasta, and we run that here. It processes
a whole directory at once.
FAJOB=$(sbatch --parsable --dependency=afterok:$HOSTJOB $SRC/fastq2fasta.slurm)- Run mmseqs taxonomy
This is the most computationally demanding step of the pipeline. On some clusters you might want to run this on the
long queue or whatever the equivalent is.
We use the standard mmseqs2 easy-taxonomy pipeline. By default we use UniRef50 as the database, but you can change
that if you wish. (Note: we're working on a comparison of what is missed between UniRef50 and UniRef100)
MMSEQSJOB=$(sbatch --parsable --dependency=afterok:$FAJOB $SRC/mmseqs_easy_taxonomy_submit.slurm)- Create a taxonomy table
Here we use snakemake to run the summarise_taxonomy pipeline to summarise the
taxonomy outputs created by mmseqs2 easy-taxonomy. This first adds the taxonomy to the .lca files, and then
summarises those into a single .tsv file for each taxonomy level.
There was a design decision here to use the pytaxonkit wrapper to
taxonkit to add the taxonomy to the .lca files rather than using the
mmseqs2 taxonomy, because taxonkit provides a more complete taxonomy with fewer gaps. The cost is an
additional compute step, but it doesn't take a massive amount of time because its just parsing text files.
sbatch --dependency=afterok:$MMSEQSJOB $SRC/mmseqs_summarise_taxonomy.slurm- Add the subsystems to the taxonomy
We love the BV-BRC subsystems (disclaimer: we wrote many of them!), and it is, to our
knowledge, still the best functional hierarchy for metagenomics.
We use an sqlite database to store the subsystems, and then we add those to the taxonomy. This also
provides us a per-taxonomy functional summary, which is useful for downstream analysis.
We then also count the subsystems and summarise them as raw or normalised counts.
SSJOB=$(sbatch --parsable --dependency=afterok:$MMSEQSJOB --array=1-$NUM_R1_READS:1 $SRC/mmseqs_add_subsystems_taxonomy_fast.slurm)
COUNTSSJOB=$(sbatch --parsable --dependency=afterok:$SSJOB $PAWSEY_SRC/count_subsystems.slurm)- Run megahit
We use megahit for the assembly, and we assemble each of the runs independently. This is because we have found that trying to do cross-assemblies quickly runs out of memory!
MEGAHITJOB=$(sbatch --parsable --dependency=afterok:$HOSTJOB --array=1-$NUM_R1_READS:1 $SRC/megahit.slurmWe prefer VAMB for our binning, especially with the new version that allows you to split contigs.
- Combine assemblies for VAMB
Note: this takes a few (<10) minutes, and so I run it on the short queue
VCJOB=$(sbatch --parsable --dependency=afterok:$MEGAHITJOB $SRC/vamb_concat.slurm)- Map vamb reads.
VMJOB=$(sbatch --parsable --dependency=afterok:$VCJOB --array=1-$NUM_R1_READS:1 $SRC/vamb_minimap.slurm)- Run VAMB
sbatch --dependency=afterok:$VMJOB $SRC/vamb.slurm- Run checkm
# CHECKMJOB=$(sbatch --parsable --dependency=afterany:$VAMBJOB $SRC/checkm.slurm vamb/bins/ vamb/checkm)
CHECKMUNSPLITJOB=$(sbatch --parsable --dependency=afterany:$VAMBJOB $SRC/checkm.slurm vamb/clusters_unsplit vamb/checkm_unsplit)
CHECKMSPLITJOB=$(sbatch --parsable --dependency=afterany:$VAMBJOB $SRC/checkm.slurm vamb/clusters_split vamb/checkm_split)There comes a time when we want to run vamb on a subset of reads, and so we have a grouped version of vamb.
We use a simple .tsv file to designate the groups, and then we run each of the components of vamb on that group.
For example, if we have a .tsv file with this data:
--- | --- Host | Sample Angelshark | GSV317_ Angelshark | GSV326_ Angelshark | GSV329_ Angelshark | GSV342_ Eagle_Ray | ER128RecapASuck_ Eagle_Ray | ER151Suck_ Eagle_Ray | ER247Suck_
We will separate GSV317, GSV326, GSV329, GSV342 into AngelShark and ER128RecapASuck, ER151Suck, ER247Suck into Eagle_Ray directories, and then vamb them separately.
The main trick here is to make it so that the sample names only give 0 or 1 results when grepping through R1_reads.txt which is why I append the _ to these names.
sbatch $SRC/vamb_concat_groups.slurm groups_sample.tsvsbatch --array=1-$NUM_R1_READS:1 $SRC/vamb_minimap_group.slurmsbatch $SRC/vamb_group.slurmAll three lines as one go:
VCONCAT=$(sbatch --parsable $SRC/vamb_concat_groups.slurm groups_sample.tsv)
VMAP=$(sbatch --parsable --dependency=afterok:$VCONCAT --array=1-$NUM_R1_READS:1 $SRC/vamb_minimap_group.slurm)
sbatch --parsable --dependency=afterok:$VMAP $SRC/vamb_group.slurmWe don't really recommend copying and pasting all the steps at once, although we usually do this anyway because we are lazy. Depending on the nuances of your system you will either end up with a bunch of jobs that can't proceed because of unmet dependencies, or the jobs will be deleted from the queue if a prior dependency fails.
One of the big advantages of atavide_lite is that you can process a single file and replace that output and then restart at any step without worrying about timestamps (like snakemake does!)
mkdir -p slurm_output/host_slurm slurm_output/megahit_slurm slurm_output/mmseqs_slurm slurm_output/vamb_slurm slurm_output/fastp_slurm
find fastq -name \*_R1\* -printf "%f\n" > R1_reads.txt
export NUM_R1_READS=$(wc -l R1_reads.txt | cut -f 1 -d ' '); echo "There are $NUM_R1_READS reads"
SRC=~/atavide_lite/deepthought_shortread
cp $SRC/DEFINITIONS.sh .
# edit the DEFINITIONS file to change the sample name
JOB=$(sbatch --parsable --array=1-$NUM_R1_READS:1 $SRC/fastp.slurm)
HOSTJOB=$(sbatch --parsable --array=1-$NUM_R1_READS:1 --dependency=afterok:$JOB $SRC/host_removal.slurm)
FAJOB=$(sbatch --parsable --dependency=afterok:$HOSTJOB $SRC/fastq2fasta.slurm)
MMSEQSJOB=$(sbatch --parsable --array=1-$NUM_R1_READS:1 --dependency=afterok:$FAJOB $SRC/mmseqs_easy_taxonomy.slurm)
sbatch --dependency=afterok:$MMSEQSJOB $SRC/mmseqs_summarise_taxonomy.slurm
SSJOB=$(sbatch --parsable --dependency=afterok:$MMSEQSJOB --array=1-$NUM_R1_READS:1 $SRC/mmseqs_add_subsystems_taxonomy_fast.slurm)
sbatch --dependency=afterok:$SSJOB $SRC/count_subsystems.slurm
MEGAHITJOB=$(sbatch --parsable --dependency=afterok:$HOSTJOB --array=1-$NUM_R1_READS:1 $SRC/megahit.slurm)
VCJOB=$(sbatch --parsable --dependency=afterok:$MEGAHITJOB $SRC/vamb_concat.slurm)
VMJOB=$(sbatch --parsable --dependency=afterok:$VCJOB --array=1-$NUM_R1_READS:1 $SRC/vamb_minimap.slurm)
VAMBJOB=$(sbatch --parsable --dependency=afterany:$VMJOB $SRC/vamb.slurm)
VAMBUNSPLIT=$(sbatch --parsable --dependency=afterany:$VAMBJOB $SRC/vamb_split.slurm vamb/contigs.fna.gz vamb/vae_clusters_unsplit.tsv vamb/clusters_unsplit)
VAMBSPLIT=$(sbatch --parsable --dependency=afterany:$VAMBJOB $SRC/vamb_split.slurm vamb/contigs.fna.gz vamb/vae_clusters_split.tsv vamb/clusters_split)
# CHECKMJOB=$(sbatch --parsable --dependency=afterany:$VAMBJOB $SRC/checkm.slurm vamb/bins/ vamb/checkm)
CHECKMUNSPLITJOB=$(sbatch --parsable --dependency=afterany:$VAMBJOB $SRC/checkm.slurm vamb/clusters_unsplit vamb/checkm_unsplit)
CHECKMSPLITJOB=$(sbatch --parsable --dependency=afterany:$VAMBJOB $SRC/checkm.slurm vamb/clusters_split vamb/checkm_split)
We have some optional steps that we run on some data, mainly when we run on data we download from the SRA.
We have a 16S detection pipeline that uses the 16SMicrobial.fasta.gz from partie to determine whether a run from the SRA is a 16S sample or a metagenome.
sbatch --parsable --export=ATAVIDE_CONDA=$ATAVIDE_CONDA $PAWSEY_SRC/16S_detection_single.slurmWe love how Sankey plots summarise the data. Here are several metagenomes we've processed:
We can make the data using our sankey code once we have generated all the data.
SANKEYJOB=$(sbatch --parsable --dependency=afterok:$COUNTSSJOB $SRC/sankey_plot.slurm)