Skip to content

AgResearch/TaFeE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

209 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TaFeE Banner

TaFeE — Taxonomic Feature Extraction from Microbiome Sequencing Data

A Snakemake workflow for quality control, taxonomic profiling, and functional annotation of biological sequence data (RE-RRSeq and WGS).


Overview

TaFeE provides modular Snakemake workflows for processing microbiome sequencing data from raw reads through to taxonomic count matrices and functional profiles. Three general-purpose workflow variants are included:

Workflow Snakefile Input Description
RE-RRSeq-TaFeE workflow/RE-RRSeq-TaFeE.smk Demultiplexed RE-RRSeq single-end reads QC, read masking, host/rRNA decontamination, Kraken 2 taxonomy, and taxpasta count matrices for reduced-representation sequencing libraries
TaFeE-prepare workflow/TaFeE-prepare.smk Paired-end WGS FASTQ QC and host/rRNA decontamination of paired-end shotgun reads
TaFeE-profile workflow/TaFeE-profile.smk KneadData-cleaned paired-end reads (output of TaFeE-prepare) Kraken 2 taxonomic profiling, taxpasta count matrices, host filtering, and HUMAnN 3 functional profiling

An additional helper workflow handles RE-RRSeq demultiplexing:

Workflow Snakefile Description
RE-RRSeq-TaFeE-demux workflow/RE-RRSeq-TaFeE-demux.smk Demultiplex RE-RRSeq libraries with cutadapt using barcodes

Dependencies

Snakemake & Conda

All workflows require Snakemake ≥ 8 with the cluster-generic executor plugin. A ready-made Conda environment specification is provided:

# workflow/env/conda.yaml
dependencies:
  - conda
  - snakedeploy
  - snakefmt
  - snakemake>=8
  - snakemake-executor-plugin-cluster-generic

Create and activate the environment:

conda env create -f workflow/env/conda.yaml -n smk
conda activate smk

Bioinformatics Tool Environments

Each processing step uses its own Conda environment that Snakemake creates automatically via --use-conda. The environment YAML files are located in workflow/env/:

Environment File Key tools
BBMap 39.01 env/bbmap-39.01.yaml bbduk.sh — adapter trimming, quality filtering, entropy masking
PRINSEQ++ 1.2.4 env/prinseq-plus-plus-1.2.4.yaml prinseq++ — low-complexity filtering
KneadData 0.12 env/kneaddata-0.12.yaml kneaddata — Trimmomatic trimming, TRF repeat removal, Bowtie 2 rRNA decontamination
SeqKit 2.4 env/seqkit-2.4.yaml seqkit — FASTQ statistics and read-pair repair
pigz 2.6 env/pigz-2.6.yaml pigz — parallel gzip compression
Kraken 2 2.1.3 env/kraken2-2.1.3.yaml kraken2 — taxonomic classification
taxpasta 0.7.0 env/taxpasta-0.7.0.yaml taxpasta — standardised taxonomic count matrix generation
HUMAnN 3.8 env/human-3.8.yaml humann3 — functional profiling (UniRef50 + MetaCyc)
cutadapt 4.4 env/cutadapt-4.4.yaml cutadapt — RE-RRSeq demultiplexing

Reference Databases

The following databases must be downloaded or built and their paths set in config/config.yaml:

Config key Description
SILVA SILVA 138.2 rRNA Bowtie 2 index directory for KneadData rRNA decontamination
K2INDEX Kraken 2 index directory (must contain hash.k2d, opts.k2d, taxo.k2d)
TAXONOMY Directory containing names.dmp and nodes.dmp for the Kraken 2 index (used by taxpasta)
HOSTS Kraken 2 index of host genomes for host-read filtering (TaFeE-profile only)
UNIPROTDB HUMAnN 3 UniRef50/EC-filtered protein database directory (TaFeE-profile only)
TRIMMOMATIC Path to a Trimmomatic installation (e.g. ~/.conda/envs/kneaddata/share/trimmomatic-0.39-2)
LINKFARM Directory containing raw FASTQ symlinks for RE-RRSeq demultiplexing

Configuration

All pipeline parameters are set in config/config.yaml. Key parameters:

# Sequencing library identifier (RE-RRSeq workflows)
LIBRARY: SQ1040

# Run identifier (WGS workflows)
RUN: wgs

# Symlink farm for raw RE-RRSeq FASTQ files
LINKFARM: /path/to/fastq-link-farm

# Reference database paths (see above)
TRIMMOMATIC: "~/.conda/envs/kneaddata/share/trimmomatic-0.39-2"
SILVA: /path/to/SILVA138.2
K2INDEX: /path/to/kraken2_index
TAXONOMY: /path/to/kraken2_index/taxonomy
HOSTS: /path/to/kraken2-hosts
UNIPROTDB: /path/to/humann3/uniref50ECFilt

A SLURM cluster profile is provided at config/slurm_profiles/eRI/config.yaml for HPC submission.


Usage

All commands are run from the repository root directory.

1. RE-RRSeq Demultiplexing (RE-RRSeq-TaFeE-demux)

Demultiplex a RE-RRSeq library into per-sample FASTQ files using cutadapt:

snakemake --profile config/slurm_profiles/eRI \
  --config LIBRARY=SQ1040 \
  --snakefile workflow/RE-RRSeq-TaFeE-demux.smk \

2. RE-RRSeq QC, Decontamination & Taxonomic Profiling (RE-RRSeq-TaFeE)

End-to-end processing of demultiplexed RE-RRSeq single-end reads — from read masking through Kraken 2 classification and taxpasta count matrices. Samples with fewer than 25,000 raw reads are automatically excluded.

snakemake --profile config/slurm_profiles/eRI \
  --config LIBRARY=SQ1040 \
  --snakefile workflow/RE-RRSeq-TaFeE.smk \

3. WGS Read Preparation (TaFeE-prepare)

QC and decontamination of paired-end WGS reads:

snakemake --profile config/slurm_profiles/eRI \
  --config RUN=wgs \
  --snakefile workflow/TaFeE-prepare.smk \

4. WGS Taxonomic & Functional Profiling (TaFeE-profile)

Kraken 2 classification, taxpasta matrices, host filtering, and HUMAnN 3 functional profiling of prepared paired-end reads:

snakemake --profile config/slurm_profiles/eRI \
  --config RUN=wgs \
  --snakefile workflow/TaFeE-profile.smk \

Dry Run

Append -n to any command to perform a dry run:

snakemake --profile config/slurm_profiles/eRI \
  --config LIBRARY=SQ1040 \
  --snakefile workflow/RE-RRSeq-TaFeE.smk \
  -n

Output Results

Directory Structure

Results are written under results/{LIBRARY}/ (RE-RRSeq) or results/{RUN}/ (WGS).

QC Reports (00_QC/)

SeqKit statistics reports are generated at each processing stage to track read counts and quality:

File Description
seqkit.report.raw.txt Raw input read statistics
seqkit.report.bbduk.txt Post-BBDuk quality/entropy filtering
seqkit.report.prinseq.txt Post-PRINSEQ++ low-complexity filtering
seqkit.report.KDTrim.txt Post-KneadData Trimmomatic trimming
seqkit.report.KDTRF.txt Post-KneadData TRF repeat removal
seqkit.report.KDSILVA138.txt Reads removed by SILVA 138.2 rRNA decontamination
seqkit.report.KDR.txt Final decontaminated reads

Demultiplexed Reads (01_cutadapt/)

RE-RRSeq workflows only. Per-sample demultiplexed FASTQ files: {sample}.fastq.gz.

Read Masking (01_readMasking/)

Intermediate entropy-filtered (BBDuk) and low-complexity-filtered (PRINSEQ++) reads.

Decontaminated Reads (02_kneaddata/)

KneadData output reads after adapter trimming, TRF repeat removal, and SILVA rRNA decontamination:

  • RE-RRSeq: {sample}.fastq.gz (single-end)
  • WGS: {sample}.KDR.R1.fastq.gz, {sample}.KDR.R2.fastq.gz (paired-end)

Kraken 2 Classification (03_kraken2/)

File Description
{sample}.kraken2 Kraken 2 report (used by taxpasta)
{sample}.k2.gz Compressed per-read classification output
{sample}.kraken2.classified*.fastq.gz Classified reads (used for downstream functional profiling in WGS)

Taxonomic Count Matrices

Taxpasta merges per-sample Kraken 2 reports into standardised count matrices at multiple taxonomic ranks:

RE-RRSeq (RE-RRSeq-TaFeE):

File Format
{LIBRARY}.kraken2.domain.counts.tsv TSV
{LIBRARY}.kraken2.phylum.counts.tsv TSV
{LIBRARY}.kraken2.class.counts.tsv TSV
{LIBRARY}.kraken2.order.counts.tsv TSV
{LIBRARY}.kraken2.family.counts.tsv TSV
{LIBRARY}.kraken2.genus.counts.tsv TSV
{LIBRARY}.kraken2.species.counts.tsv TSV
{LIBRARY}.kraken2.genus.counts.biom BIOM

WGS (TaFeE-profile):

File Format
kraken2.domain.tsv TSV
kraken2.phylum.tsv TSV
kraken2.class.tsv TSV
kraken2.order.tsv TSV
kraken2.family.tsv TSV
kraken2.genus.tsv TSV
kraken2.species.tsv TSV
kraken2.genus.biom BIOM

All TSV matrices include taxonomy ID, name, rank, and full lineage columns.

Host/Microbe Classification Proportions (04_k2_filtering/)

WGS only. Per-sample summary of read classification:

  • {sample}.classification.proportions.txt — counts of microbe, host, and unclassified reads

Functional Profiles (05_functional/)

WGS only. HUMAnN 3 functional profiling outputs (merged across all samples):

File Description
humann3_uniref50EC_microbial_pathabundance.rpk.tsv MetaCyc pathway abundance (RPK)
humann3_uniref50EC_microbial_pathcoverage.rpk.tsv MetaCyc pathway coverage
humann3_uniref50EC_microbial_genefamilies.rpk.tsv UniRef50 gene family abundance (RPK)
humann3_uniref50EC_microbial_genefamilies.rpk.KO.tsv Gene families regrouped to KEGG Orthologs
humann3_uniref50EC_microbial_genefamilies.rpk.EC.tsv Gene families regrouped to Enzyme Commission
humann3_uniref50EC_microbial_genefamilies.rpk.pfam.tsv Gene families regrouped to Pfam
humann3_uniref50EC_microbial_genefamilies.rpk.EggNOG.tsv Gene families regrouped to EggNOG
humann3_uniref50EC_microbial_genefamilies.rpk.cpm.QC.tsv Gene families normalised to CPM
humann3_uniref50EC_microbial_pathabundance.rpk.cpm.QC.tsv Pathway abundance normalised to CPM

Benchmarks

Runtime and resource usage for every rule are recorded under results/{LIBRARY or RUN}/benchmarks/.


Workflow DAG

RE-RRSeq (RE-RRSeq-TaFeE)

Raw RE-RRSeq FASTQ
  → cutadapt (demultiplex)
    → BBDuk (entropy/quality filter)
      → PRINSEQ++ (low-complexity filter)
        → KneadData (Trimmomatic + TRF + SILVA rRNA removal)
          → Kraken 2 (taxonomic classification)
            → taxpasta (count matrices at domain → species)

WGS (TaFeE-prepare → TaFeE-profile)

Raw paired-end FASTQ
  → BBDuk (entropy/quality filter)
    → PRINSEQ++ (low-complexity filter)
      → KneadData R1/R2 (Trimmomatic + TRF + SILVA rRNA removal)
        → SeqKit pair (read-pair repair)
          → Kraken 2 (taxonomic classification)
            → taxpasta (count matrices at domain → species)
            → Kraken 2 host filter
              → HUMAnN 3 (functional profiling)
                → Regroup (KO, EC, Pfam, EggNOG)
                → Normalise (CPM)

License

MIT License. Copyright (c) 2023-2024 Benjamin J Perry.

About

snakemake workflow for taxonomic profiling of biological sequence data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors