TaFeE — Taxonomic Feature Extraction from Microbiome Sequencing Data

A Snakemake workflow for quality control, taxonomic profiling, and functional annotation of biological sequence data (RE-RRSeq and WGS).

Overview

TaFeE provides modular Snakemake workflows for processing microbiome sequencing data from raw reads through to taxonomic count matrices and functional profiles. Three general-purpose workflow variants are included:

Workflow	Snakefile	Input	Description
RE-RRSeq-TaFeE	`workflow/RE-RRSeq-TaFeE.smk`	Demultiplexed RE-RRSeq single-end reads	QC, read masking, host/rRNA decontamination, Kraken 2 taxonomy, and taxpasta count matrices for reduced-representation sequencing libraries
TaFeE-prepare	`workflow/TaFeE-prepare.smk`	Paired-end WGS FASTQ	QC and host/rRNA decontamination of paired-end shotgun reads
TaFeE-profile	`workflow/TaFeE-profile.smk`	KneadData-cleaned paired-end reads (output of TaFeE-prepare)	Kraken 2 taxonomic profiling, taxpasta count matrices, host filtering, and HUMAnN 3 functional profiling

An additional helper workflow handles RE-RRSeq demultiplexing:

Workflow	Snakefile	Description
RE-RRSeq-TaFeE-demux	`workflow/RE-RRSeq-TaFeE-demux.smk`	Demultiplex RE-RRSeq libraries with cutadapt using barcodes

Dependencies

Snakemake & Conda

All workflows require Snakemake ≥ 8 with the cluster-generic executor plugin. A ready-made Conda environment specification is provided:

# workflow/env/conda.yaml
dependencies:
  - conda
  - snakedeploy
  - snakefmt
  - snakemake>=8
  - snakemake-executor-plugin-cluster-generic

Create and activate the environment:

conda env create -f workflow/env/conda.yaml -n smk
conda activate smk

Bioinformatics Tool Environments

Each processing step uses its own Conda environment that Snakemake creates automatically via --use-conda. The environment YAML files are located in workflow/env/:

Environment	File	Key tools
BBMap 39.01	`env/bbmap-39.01.yaml`	`bbduk.sh` — adapter trimming, quality filtering, entropy masking
PRINSEQ++ 1.2.4	`env/prinseq-plus-plus-1.2.4.yaml`	`prinseq++` — low-complexity filtering
KneadData 0.12	`env/kneaddata-0.12.yaml`	`kneaddata` — Trimmomatic trimming, TRF repeat removal, Bowtie 2 rRNA decontamination
SeqKit 2.4	`env/seqkit-2.4.yaml`	`seqkit` — FASTQ statistics and read-pair repair
pigz 2.6	`env/pigz-2.6.yaml`	`pigz` — parallel gzip compression
Kraken 2 2.1.3	`env/kraken2-2.1.3.yaml`	`kraken2` — taxonomic classification
taxpasta 0.7.0	`env/taxpasta-0.7.0.yaml`	`taxpasta` — standardised taxonomic count matrix generation
HUMAnN 3.8	`env/human-3.8.yaml`	`humann3` — functional profiling (UniRef50 + MetaCyc)
cutadapt 4.4	`env/cutadapt-4.4.yaml`	`cutadapt` — RE-RRSeq demultiplexing

Reference Databases

The following databases must be downloaded or built and their paths set in config/config.yaml:

Config key	Description
`SILVA`	SILVA 138.2 rRNA Bowtie 2 index directory for KneadData rRNA decontamination
`K2INDEX`	Kraken 2 index directory (must contain `hash.k2d`, `opts.k2d`, `taxo.k2d`)
`TAXONOMY`	Directory containing `names.dmp` and `nodes.dmp` for the Kraken 2 index (used by taxpasta)
`HOSTS`	Kraken 2 index of host genomes for host-read filtering (TaFeE-profile only)
`UNIPROTDB`	HUMAnN 3 UniRef50/EC-filtered protein database directory (TaFeE-profile only)
`TRIMMOMATIC`	Path to a Trimmomatic installation (e.g. `~/.conda/envs/kneaddata/share/trimmomatic-0.39-2`)
`LINKFARM`	Directory containing raw FASTQ symlinks for RE-RRSeq demultiplexing

Configuration

All pipeline parameters are set in config/config.yaml. Key parameters:

# Sequencing library identifier (RE-RRSeq workflows)
LIBRARY: SQ1040

# Run identifier (WGS workflows)
RUN: wgs

# Symlink farm for raw RE-RRSeq FASTQ files
LINKFARM: /path/to/fastq-link-farm

# Reference database paths (see above)
TRIMMOMATIC: "~/.conda/envs/kneaddata/share/trimmomatic-0.39-2"
SILVA: /path/to/SILVA138.2
K2INDEX: /path/to/kraken2_index
TAXONOMY: /path/to/kraken2_index/taxonomy
HOSTS: /path/to/kraken2-hosts
UNIPROTDB: /path/to/humann3/uniref50ECFilt

A SLURM cluster profile is provided at config/slurm_profiles/eRI/config.yaml for HPC submission.

Usage

All commands are run from the repository root directory.

1. RE-RRSeq Demultiplexing (RE-RRSeq-TaFeE-demux)

Demultiplex a RE-RRSeq library into per-sample FASTQ files using cutadapt:

snakemake --profile config/slurm_profiles/eRI \
  --config LIBRARY=SQ1040 \
  --snakefile workflow/RE-RRSeq-TaFeE-demux.smk \

2. RE-RRSeq QC, Decontamination & Taxonomic Profiling (RE-RRSeq-TaFeE)

End-to-end processing of demultiplexed RE-RRSeq single-end reads — from read masking through Kraken 2 classification and taxpasta count matrices. Samples with fewer than 25,000 raw reads are automatically excluded.

snakemake --profile config/slurm_profiles/eRI \
  --config LIBRARY=SQ1040 \
  --snakefile workflow/RE-RRSeq-TaFeE.smk \

3. WGS Read Preparation (TaFeE-prepare)

QC and decontamination of paired-end WGS reads:

snakemake --profile config/slurm_profiles/eRI \
  --config RUN=wgs \
  --snakefile workflow/TaFeE-prepare.smk \

4. WGS Taxonomic & Functional Profiling (TaFeE-profile)

Kraken 2 classification, taxpasta matrices, host filtering, and HUMAnN 3 functional profiling of prepared paired-end reads:

snakemake --profile config/slurm_profiles/eRI \
  --config RUN=wgs \
  --snakefile workflow/TaFeE-profile.smk \

Dry Run

Append -n to any command to perform a dry run:

snakemake --profile config/slurm_profiles/eRI \
  --config LIBRARY=SQ1040 \
  --snakefile workflow/RE-RRSeq-TaFeE.smk \
  -n

Output Results

Directory Structure

Results are written under results/{LIBRARY}/ (RE-RRSeq) or results/{RUN}/ (WGS).

QC Reports (`00_QC/`)

SeqKit statistics reports are generated at each processing stage to track read counts and quality:

File	Description
`seqkit.report.raw.txt`	Raw input read statistics
`seqkit.report.bbduk.txt`	Post-BBDuk quality/entropy filtering
`seqkit.report.prinseq.txt`	Post-PRINSEQ++ low-complexity filtering
`seqkit.report.KDTrim.txt`	Post-KneadData Trimmomatic trimming
`seqkit.report.KDTRF.txt`	Post-KneadData TRF repeat removal
`seqkit.report.KDSILVA138.txt`	Reads removed by SILVA 138.2 rRNA decontamination
`seqkit.report.KDR.txt`	Final decontaminated reads

Demultiplexed Reads (`01_cutadapt/`)

RE-RRSeq workflows only. Per-sample demultiplexed FASTQ files: {sample}.fastq.gz.

Read Masking (`01_readMasking/`)

Intermediate entropy-filtered (BBDuk) and low-complexity-filtered (PRINSEQ++) reads.

Decontaminated Reads (`02_kneaddata/`)

KneadData output reads after adapter trimming, TRF repeat removal, and SILVA rRNA decontamination:

RE-RRSeq: {sample}.fastq.gz (single-end)
WGS: {sample}.KDR.R1.fastq.gz, {sample}.KDR.R2.fastq.gz (paired-end)

Kraken 2 Classification (`03_kraken2/`)

File	Description
`{sample}.kraken2`	Kraken 2 report (used by taxpasta)
`{sample}.k2.gz`	Compressed per-read classification output
`{sample}.kraken2.classified*.fastq.gz`	Classified reads (used for downstream functional profiling in WGS)

Taxonomic Count Matrices

Taxpasta merges per-sample Kraken 2 reports into standardised count matrices at multiple taxonomic ranks:

RE-RRSeq (RE-RRSeq-TaFeE):

File	Format
`{LIBRARY}.kraken2.domain.counts.tsv`	TSV
`{LIBRARY}.kraken2.phylum.counts.tsv`	TSV
`{LIBRARY}.kraken2.class.counts.tsv`	TSV
`{LIBRARY}.kraken2.order.counts.tsv`	TSV
`{LIBRARY}.kraken2.family.counts.tsv`	TSV
`{LIBRARY}.kraken2.genus.counts.tsv`	TSV
`{LIBRARY}.kraken2.species.counts.tsv`	TSV
`{LIBRARY}.kraken2.genus.counts.biom`	BIOM

WGS (TaFeE-profile):

File	Format
`kraken2.domain.tsv`	TSV
`kraken2.phylum.tsv`	TSV
`kraken2.class.tsv`	TSV
`kraken2.order.tsv`	TSV
`kraken2.family.tsv`	TSV
`kraken2.genus.tsv`	TSV
`kraken2.species.tsv`	TSV
`kraken2.genus.biom`	BIOM

All TSV matrices include taxonomy ID, name, rank, and full lineage columns.

Host/Microbe Classification Proportions (`04_k2_filtering/`)

WGS only. Per-sample summary of read classification:

{sample}.classification.proportions.txt — counts of microbe, host, and unclassified reads

Functional Profiles (`05_functional/`)

WGS only. HUMAnN 3 functional profiling outputs (merged across all samples):

File	Description
`humann3_uniref50EC_microbial_pathabundance.rpk.tsv`	MetaCyc pathway abundance (RPK)
`humann3_uniref50EC_microbial_pathcoverage.rpk.tsv`	MetaCyc pathway coverage
`humann3_uniref50EC_microbial_genefamilies.rpk.tsv`	UniRef50 gene family abundance (RPK)
`humann3_uniref50EC_microbial_genefamilies.rpk.KO.tsv`	Gene families regrouped to KEGG Orthologs
`humann3_uniref50EC_microbial_genefamilies.rpk.EC.tsv`	Gene families regrouped to Enzyme Commission
`humann3_uniref50EC_microbial_genefamilies.rpk.pfam.tsv`	Gene families regrouped to Pfam
`humann3_uniref50EC_microbial_genefamilies.rpk.EggNOG.tsv`	Gene families regrouped to EggNOG
`humann3_uniref50EC_microbial_genefamilies.rpk.cpm.QC.tsv`	Gene families normalised to CPM
`humann3_uniref50EC_microbial_pathabundance.rpk.cpm.QC.tsv`	Pathway abundance normalised to CPM

Benchmarks

Runtime and resource usage for every rule are recorded under results/{LIBRARY or RUN}/benchmarks/.

Workflow DAG

RE-RRSeq (RE-RRSeq-TaFeE)

Raw RE-RRSeq FASTQ
  → cutadapt (demultiplex)
    → BBDuk (entropy/quality filter)
      → PRINSEQ++ (low-complexity filter)
        → KneadData (Trimmomatic + TRF + SILVA rRNA removal)
          → Kraken 2 (taxonomic classification)
            → taxpasta (count matrices at domain → species)

WGS (TaFeE-prepare → TaFeE-profile)

Raw paired-end FASTQ
  → BBDuk (entropy/quality filter)
    → PRINSEQ++ (low-complexity filter)
      → KneadData R1/R2 (Trimmomatic + TRF + SILVA rRNA removal)
        → SeqKit pair (read-pair repair)
          → Kraken 2 (taxonomic classification)
            → taxpasta (count matrices at domain → species)
            → Kraken 2 host filter
              → HUMAnN 3 (functional profiling)
                → Regroup (KO, EC, Pfam, EggNOG)
                → Normalise (CPM)

Name		Name	Last commit message	Last commit date
Latest commit History 209 Commits
config		config
fastq		fastq
resources		resources
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

TaFeE — Taxonomic Feature Extraction from Microbiome Sequencing Data

Overview

Dependencies

Snakemake & Conda

Bioinformatics Tool Environments

Reference Databases

Configuration

Usage

1. RE-RRSeq Demultiplexing (RE-RRSeq-TaFeE-demux)

2. RE-RRSeq QC, Decontamination & Taxonomic Profiling (RE-RRSeq-TaFeE)

3. WGS Read Preparation (TaFeE-prepare)

4. WGS Taxonomic & Functional Profiling (TaFeE-profile)

Dry Run

Output Results

Directory Structure

QC Reports (00_QC/)

Demultiplexed Reads (01_cutadapt/)

Read Masking (01_readMasking/)

Decontaminated Reads (02_kneaddata/)

Kraken 2 Classification (03_kraken2/)

Taxonomic Count Matrices

Host/Microbe Classification Proportions (04_k2_filtering/)

Functional Profiles (05_functional/)

Benchmarks

Workflow DAG

RE-RRSeq (RE-RRSeq-TaFeE)

WGS (TaFeE-prepare → TaFeE-profile)

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

QC Reports (`00_QC/`)

Demultiplexed Reads (`01_cutadapt/`)

Read Masking (`01_readMasking/`)

Decontaminated Reads (`02_kneaddata/`)

Kraken 2 Classification (`03_kraken2/`)

Host/Microbe Classification Proportions (`04_k2_filtering/`)

Functional Profiles (`05_functional/`)

Packages