A Snakemake workflow for quality control, taxonomic profiling, and functional annotation of biological sequence data (RE-RRSeq and WGS).
TaFeE provides modular Snakemake workflows for processing microbiome sequencing data from raw reads through to taxonomic count matrices and functional profiles. Three general-purpose workflow variants are included:
| Workflow | Snakefile | Input | Description |
|---|---|---|---|
| RE-RRSeq-TaFeE | workflow/RE-RRSeq-TaFeE.smk |
Demultiplexed RE-RRSeq single-end reads | QC, read masking, host/rRNA decontamination, Kraken 2 taxonomy, and taxpasta count matrices for reduced-representation sequencing libraries |
| TaFeE-prepare | workflow/TaFeE-prepare.smk |
Paired-end WGS FASTQ | QC and host/rRNA decontamination of paired-end shotgun reads |
| TaFeE-profile | workflow/TaFeE-profile.smk |
KneadData-cleaned paired-end reads (output of TaFeE-prepare) | Kraken 2 taxonomic profiling, taxpasta count matrices, host filtering, and HUMAnN 3 functional profiling |
An additional helper workflow handles RE-RRSeq demultiplexing:
| Workflow | Snakefile | Description |
|---|---|---|
| RE-RRSeq-TaFeE-demux | workflow/RE-RRSeq-TaFeE-demux.smk |
Demultiplex RE-RRSeq libraries with cutadapt using barcodes |
All workflows require Snakemake ≥ 8 with the cluster-generic executor plugin. A ready-made Conda environment specification is provided:
# workflow/env/conda.yaml
dependencies:
- conda
- snakedeploy
- snakefmt
- snakemake>=8
- snakemake-executor-plugin-cluster-genericCreate and activate the environment:
conda env create -f workflow/env/conda.yaml -n smk
conda activate smkEach processing step uses its own Conda environment that Snakemake creates automatically via --use-conda. The environment YAML files are located in workflow/env/:
| Environment | File | Key tools |
|---|---|---|
| BBMap 39.01 | env/bbmap-39.01.yaml |
bbduk.sh — adapter trimming, quality filtering, entropy masking |
| PRINSEQ++ 1.2.4 | env/prinseq-plus-plus-1.2.4.yaml |
prinseq++ — low-complexity filtering |
| KneadData 0.12 | env/kneaddata-0.12.yaml |
kneaddata — Trimmomatic trimming, TRF repeat removal, Bowtie 2 rRNA decontamination |
| SeqKit 2.4 | env/seqkit-2.4.yaml |
seqkit — FASTQ statistics and read-pair repair |
| pigz 2.6 | env/pigz-2.6.yaml |
pigz — parallel gzip compression |
| Kraken 2 2.1.3 | env/kraken2-2.1.3.yaml |
kraken2 — taxonomic classification |
| taxpasta 0.7.0 | env/taxpasta-0.7.0.yaml |
taxpasta — standardised taxonomic count matrix generation |
| HUMAnN 3.8 | env/human-3.8.yaml |
humann3 — functional profiling (UniRef50 + MetaCyc) |
| cutadapt 4.4 | env/cutadapt-4.4.yaml |
cutadapt — RE-RRSeq demultiplexing |
The following databases must be downloaded or built and their paths set in config/config.yaml:
| Config key | Description |
|---|---|
SILVA |
SILVA 138.2 rRNA Bowtie 2 index directory for KneadData rRNA decontamination |
K2INDEX |
Kraken 2 index directory (must contain hash.k2d, opts.k2d, taxo.k2d) |
TAXONOMY |
Directory containing names.dmp and nodes.dmp for the Kraken 2 index (used by taxpasta) |
HOSTS |
Kraken 2 index of host genomes for host-read filtering (TaFeE-profile only) |
UNIPROTDB |
HUMAnN 3 UniRef50/EC-filtered protein database directory (TaFeE-profile only) |
TRIMMOMATIC |
Path to a Trimmomatic installation (e.g. ~/.conda/envs/kneaddata/share/trimmomatic-0.39-2) |
LINKFARM |
Directory containing raw FASTQ symlinks for RE-RRSeq demultiplexing |
All pipeline parameters are set in config/config.yaml. Key parameters:
# Sequencing library identifier (RE-RRSeq workflows)
LIBRARY: SQ1040
# Run identifier (WGS workflows)
RUN: wgs
# Symlink farm for raw RE-RRSeq FASTQ files
LINKFARM: /path/to/fastq-link-farm
# Reference database paths (see above)
TRIMMOMATIC: "~/.conda/envs/kneaddata/share/trimmomatic-0.39-2"
SILVA: /path/to/SILVA138.2
K2INDEX: /path/to/kraken2_index
TAXONOMY: /path/to/kraken2_index/taxonomy
HOSTS: /path/to/kraken2-hosts
UNIPROTDB: /path/to/humann3/uniref50ECFiltA SLURM cluster profile is provided at config/slurm_profiles/eRI/config.yaml for HPC submission.
All commands are run from the repository root directory.
Demultiplex a RE-RRSeq library into per-sample FASTQ files using cutadapt:
snakemake --profile config/slurm_profiles/eRI \
--config LIBRARY=SQ1040 \
--snakefile workflow/RE-RRSeq-TaFeE-demux.smk \End-to-end processing of demultiplexed RE-RRSeq single-end reads — from read masking through Kraken 2 classification and taxpasta count matrices. Samples with fewer than 25,000 raw reads are automatically excluded.
snakemake --profile config/slurm_profiles/eRI \
--config LIBRARY=SQ1040 \
--snakefile workflow/RE-RRSeq-TaFeE.smk \QC and decontamination of paired-end WGS reads:
snakemake --profile config/slurm_profiles/eRI \
--config RUN=wgs \
--snakefile workflow/TaFeE-prepare.smk \Kraken 2 classification, taxpasta matrices, host filtering, and HUMAnN 3 functional profiling of prepared paired-end reads:
snakemake --profile config/slurm_profiles/eRI \
--config RUN=wgs \
--snakefile workflow/TaFeE-profile.smk \Append -n to any command to perform a dry run:
snakemake --profile config/slurm_profiles/eRI \
--config LIBRARY=SQ1040 \
--snakefile workflow/RE-RRSeq-TaFeE.smk \
-nResults are written under results/{LIBRARY}/ (RE-RRSeq) or results/{RUN}/ (WGS).
SeqKit statistics reports are generated at each processing stage to track read counts and quality:
| File | Description |
|---|---|
seqkit.report.raw.txt |
Raw input read statistics |
seqkit.report.bbduk.txt |
Post-BBDuk quality/entropy filtering |
seqkit.report.prinseq.txt |
Post-PRINSEQ++ low-complexity filtering |
seqkit.report.KDTrim.txt |
Post-KneadData Trimmomatic trimming |
seqkit.report.KDTRF.txt |
Post-KneadData TRF repeat removal |
seqkit.report.KDSILVA138.txt |
Reads removed by SILVA 138.2 rRNA decontamination |
seqkit.report.KDR.txt |
Final decontaminated reads |
RE-RRSeq workflows only. Per-sample demultiplexed FASTQ files: {sample}.fastq.gz.
Intermediate entropy-filtered (BBDuk) and low-complexity-filtered (PRINSEQ++) reads.
KneadData output reads after adapter trimming, TRF repeat removal, and SILVA rRNA decontamination:
- RE-RRSeq:
{sample}.fastq.gz(single-end) - WGS:
{sample}.KDR.R1.fastq.gz,{sample}.KDR.R2.fastq.gz(paired-end)
| File | Description |
|---|---|
{sample}.kraken2 |
Kraken 2 report (used by taxpasta) |
{sample}.k2.gz |
Compressed per-read classification output |
{sample}.kraken2.classified*.fastq.gz |
Classified reads (used for downstream functional profiling in WGS) |
Taxpasta merges per-sample Kraken 2 reports into standardised count matrices at multiple taxonomic ranks:
RE-RRSeq (RE-RRSeq-TaFeE):
| File | Format |
|---|---|
{LIBRARY}.kraken2.domain.counts.tsv |
TSV |
{LIBRARY}.kraken2.phylum.counts.tsv |
TSV |
{LIBRARY}.kraken2.class.counts.tsv |
TSV |
{LIBRARY}.kraken2.order.counts.tsv |
TSV |
{LIBRARY}.kraken2.family.counts.tsv |
TSV |
{LIBRARY}.kraken2.genus.counts.tsv |
TSV |
{LIBRARY}.kraken2.species.counts.tsv |
TSV |
{LIBRARY}.kraken2.genus.counts.biom |
BIOM |
WGS (TaFeE-profile):
| File | Format |
|---|---|
kraken2.domain.tsv |
TSV |
kraken2.phylum.tsv |
TSV |
kraken2.class.tsv |
TSV |
kraken2.order.tsv |
TSV |
kraken2.family.tsv |
TSV |
kraken2.genus.tsv |
TSV |
kraken2.species.tsv |
TSV |
kraken2.genus.biom |
BIOM |
All TSV matrices include taxonomy ID, name, rank, and full lineage columns.
WGS only. Per-sample summary of read classification:
{sample}.classification.proportions.txt— counts of microbe, host, and unclassified reads
WGS only. HUMAnN 3 functional profiling outputs (merged across all samples):
| File | Description |
|---|---|
humann3_uniref50EC_microbial_pathabundance.rpk.tsv |
MetaCyc pathway abundance (RPK) |
humann3_uniref50EC_microbial_pathcoverage.rpk.tsv |
MetaCyc pathway coverage |
humann3_uniref50EC_microbial_genefamilies.rpk.tsv |
UniRef50 gene family abundance (RPK) |
humann3_uniref50EC_microbial_genefamilies.rpk.KO.tsv |
Gene families regrouped to KEGG Orthologs |
humann3_uniref50EC_microbial_genefamilies.rpk.EC.tsv |
Gene families regrouped to Enzyme Commission |
humann3_uniref50EC_microbial_genefamilies.rpk.pfam.tsv |
Gene families regrouped to Pfam |
humann3_uniref50EC_microbial_genefamilies.rpk.EggNOG.tsv |
Gene families regrouped to EggNOG |
humann3_uniref50EC_microbial_genefamilies.rpk.cpm.QC.tsv |
Gene families normalised to CPM |
humann3_uniref50EC_microbial_pathabundance.rpk.cpm.QC.tsv |
Pathway abundance normalised to CPM |
Runtime and resource usage for every rule are recorded under results/{LIBRARY or RUN}/benchmarks/.
Raw RE-RRSeq FASTQ
→ cutadapt (demultiplex)
→ BBDuk (entropy/quality filter)
→ PRINSEQ++ (low-complexity filter)
→ KneadData (Trimmomatic + TRF + SILVA rRNA removal)
→ Kraken 2 (taxonomic classification)
→ taxpasta (count matrices at domain → species)
Raw paired-end FASTQ
→ BBDuk (entropy/quality filter)
→ PRINSEQ++ (low-complexity filter)
→ KneadData R1/R2 (Trimmomatic + TRF + SILVA rRNA removal)
→ SeqKit pair (read-pair repair)
→ Kraken 2 (taxonomic classification)
→ taxpasta (count matrices at domain → species)
→ Kraken 2 host filter
→ HUMAnN 3 (functional profiling)
→ Regroup (KO, EC, Pfam, EggNOG)
→ Normalise (CPM)
MIT License. Copyright (c) 2023-2024 Benjamin J Perry.
