pandoomain: the pipe

v0.0.2

I recommend using and sending patches to the upstream version that is at: https://github.com/elbecerrasoto/pandoomain

Description

pandoomain is a Snakemake pipeline designed for:

Downloading genomes.
Searching proteins using Hidden Markov Models (HMMs).
Domain annotation via interproscan.sh.
Extracting protein domain architectures.
Extracting gene neighborhoods.
Adding taxonomic information.

This pipeline helps identify functional and evolutionary patterns by analyzing Protein Domain Architecture and Gene Neighborhood data.

Some biological questions are better approached at the domain level rather than at the raw sequence level. This pipeline extends that idea to entire Gene Neighborhoods.

Domain Representation

pandoomain encodes a domain architecture as a string, offering several advantages:

Existing libraries for string distance can be directly applied.
Easier human inspection of raw tables.
Enables domain alignments.

The encoding method consists of adding +33 to each PFAM ID and treating the result as a Unicode code point.

The reasons for this are:

Adding +33 avoids mapping to control and whitespace charaters, using the same idea behind the Phred33 score.
Unicode can comfortably accommodate all defined PFAMs (~16,000), as it provides 155,063 characters. User-defined HMMs could be assigned a code point bigger than 18,000 to comfortably dodge any PFAM ID.

Pipeline Workflow

The pipeline takes two inputs:

A text file with assembly accessions.
A directory of HMMs.

Then it retrieves genomes (in .gff and .faa formats), extracts proteins that match any given HMM, annotates them with interproscan.sh, and derives Domain Architectures at both protein and neighborhood levels.

The final results include taxonomic data for further analysis.

pandoomain is used at the DeMoraes Lab to search for novel bacterial toxins.

Quick Usage

Option 1: Using `config/config.yaml`

Edit config/config.yaml and then run:

snakemake --cores all

Option 2: Using Command-Line Arguments

Run the pipeline with configuration directly on the command line:

snakemake --cores all \
          --config \
            genomes=genomes.txt \
            queries=queries

Option 1 is recommended since an edited configuration file acts as a log of the experiment, improving reproducibility. Option 2 is useful for quick test runs.

Before running anything perform a test run adding the following options -np --printshellcmds to the snakemake command.

Inputs

Genome List: A text file with no headers, containing one genome assembly accession per line. Example: tests/genomes.txt.
- Use # for comments.
HMM Directory: A directory containing .hmm files, which can be obtained from the InterPro database or manually generated from alignments.

Outputs

The pipeline generates TSV tables summarizing:

HMM hits.
Genome data.
Taxonomic information.
Protein domain architectures.

Documentation

For further details check the documentation at docs/README.md.

Installation

Dependencies

The most complex dependency is interproscan.sh, so a helper script is included: utils/install_iscan.py.

The pipeline runs through the Snakemake framework.

Cloud Installation

For a guide on cloud deployment, see: deploy-pandoomain.

Local Installation

1. Clone the repository

git clone 'https://github.com/elbecerrasoto/pandoomain'
cd pandoomain

2. Install an Anaconda Distribution

I recommend Miniforge. A Makefile rule can simplify this step:

make install-mamba

3. Install the Conda Environment

~/miniforge3/bin/conda init
source ~/.bashrc
mamba shell init --shell bash --root-prefix=~/miniforge3
source ~/.bashrc
mamba env create --file environment.yml
mamba activate pandoomain

4. Install InterProScan

make install-iscan

5. Install R Libraries

make install-Rlibs

6. Test the Installation

make test

Everything should now be set up and ready to run. 🚀

Version Changes

0.0.2

Removal of hmmer_input rule.
- It's simpler to just use the genomes.tsv as input for the hmmer rule.
Removal of preprocessing rule for genomes.txt.
- Now the dependant rules can parse genomes.txt directly.
Fixed taxallnomy bug, caused by an updated DB.
- taxallnomy now has 43 cols instead of 42.
Removal of utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pandoomain: the pipe

v0.0.2

Contents

Description

Domain Representation

Pipeline Workflow

Quick Usage

Option 1: Using `config/config.yaml`

Option 2: Using Command-Line Arguments

Inputs

Outputs

Documentation

Installation

Dependencies

Cloud Installation

Local Installation

1. Clone the repository

2. Install an Anaconda Distribution

3. Install the Conda Environment

4. Install InterProScan

5. Install R Libraries

6. Test the Installation

Version Changes

0.0.2

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 479 Commits
config		config
docs		docs
pics		pics
tests		tests
utils		utils
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
environment.yml		environment.yml

Folders and files

Latest commit

History

Repository files navigation

pandoomain: the pipe

v0.0.2

Contents

Description

Domain Representation

Pipeline Workflow

Quick Usage

Option 1: Using config/config.yaml

Option 2: Using Command-Line Arguments

Inputs

Outputs

Documentation

Installation

Dependencies

Cloud Installation

Local Installation

1. Clone the repository

2. Install an Anaconda Distribution

3. Install the Conda Environment

4. Install InterProScan

5. Install R Libraries

6. Test the Installation

Version Changes

0.0.2

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Option 1: Using `config/config.yaml`

Packages