Skip to content

elbecerrasoto/pandoomain

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

479 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Semantic Versioning snakemake-green MIT License


pandoomain: the pipe


v0.0.2

I recommend using and sending patches to the upstream version that is at: https://github.com/elbecerrasoto/pandoomain

Contents

Description

pandoomain is a Snakemake pipeline designed for:

  • Downloading genomes.
  • Searching proteins using Hidden Markov Models (HMMs).
  • Domain annotation via interproscan.sh.
  • Extracting protein domain architectures.
  • Extracting gene neighborhoods.
  • Adding taxonomic information.

This pipeline helps identify functional and evolutionary patterns by analyzing Protein Domain Architecture and Gene Neighborhood data.

Some biological questions are better approached at the domain level rather than at the raw sequence level. This pipeline extends that idea to entire Gene Neighborhoods.

Domain Representation

pandoomain encodes a domain architecture as a string, offering several advantages:

  • Existing libraries for string distance can be directly applied.
  • Easier human inspection of raw tables.
  • Enables domain alignments.

The encoding method consists of adding +33 to each PFAM ID and treating the result as a Unicode code point.

The reasons for this are:

  • Adding +33 avoids mapping to control and whitespace charaters, using the same idea behind the Phred33 score.
  • Unicode can comfortably accommodate all defined PFAMs (~16,000), as it provides 155,063 characters. User-defined HMMs could be assigned a code point bigger than 18,000 to comfortably dodge any PFAM ID.

Pipeline Workflow

The pipeline takes two inputs:

  1. A text file with assembly accessions.
  2. A directory of HMMs.

Then it retrieves genomes (in .gff and .faa formats), extracts proteins that match any given HMM, annotates them with interproscan.sh, and derives Domain Architectures at both protein and neighborhood levels.

The final results include taxonomic data for further analysis.

pandoomain is used at the DeMoraes Lab to search for novel bacterial toxins.


Quick Usage

Option 1: Using config/config.yaml

Edit config/config.yaml and then run:

snakemake --cores all

Option 2: Using Command-Line Arguments

Run the pipeline with configuration directly on the command line:

snakemake --cores all \
          --config \
            genomes=genomes.txt \
            queries=queries

Option 1 is recommended since an edited configuration file acts as a log of the experiment, improving reproducibility. Option 2 is useful for quick test runs.

Before running anything perform a test run adding the following options -np --printshellcmds to the snakemake command.


Inputs

  1. Genome List: A text file with no headers, containing one genome assembly accession per line. Example: tests/genomes.txt.

    • Use # for comments.
  2. HMM Directory: A directory containing .hmm files, which can be obtained from the InterPro database or manually generated from alignments.


Outputs

The pipeline generates TSV tables summarizing:

  • HMM hits.
  • Genome data.
  • Taxonomic information.
  • Protein domain architectures.

Documentation

For further details check the documentation at docs/README.md.


Installation

Dependencies

The most complex dependency is interproscan.sh, so a helper script is included: utils/install_iscan.py.

The pipeline runs through the Snakemake framework.

Cloud Installation

For a guide on cloud deployment, see: deploy-pandoomain.

Local Installation

1. Clone the repository

git clone 'https://github.com/elbecerrasoto/pandoomain'
cd pandoomain

2. Install an Anaconda Distribution

I recommend Miniforge. A Makefile rule can simplify this step:

make install-mamba

3. Install the Conda Environment

~/miniforge3/bin/conda init
source ~/.bashrc
mamba shell init --shell bash --root-prefix=~/miniforge3
source ~/.bashrc
mamba env create --file environment.yml
mamba activate pandoomain

4. Install InterProScan

make install-iscan

5. Install R Libraries

make install-Rlibs

6. Test the Installation

make test

Everything should now be set up and ready to run. 🚀

Version Changes

0.0.2

  • Removal of hmmer_input rule.
    • It's simpler to just use the genomes.tsv as input for the hmmer rule.
  • Removal of preprocessing rule for genomes.txt.
    • Now the dependant rules can parse genomes.txt directly.
  • Fixed taxallnomy bug, caused by an updated DB.
    • taxallnomy now has 43 cols instead of 42.
  • Removal of utils.py

About

A snakemake pipeline for analyzing proteins, its domains, and taxonomy.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors