I recommend using and sending patches to the upstream version that is at: https://github.com/elbecerrasoto/pandoomain
pandoomain is a Snakemake pipeline designed for:
- Downloading genomes.
- Searching proteins using Hidden Markov Models (HMMs).
- Domain annotation via
interproscan.sh. - Extracting protein domain architectures.
- Extracting gene neighborhoods.
- Adding taxonomic information.
This pipeline helps identify functional and evolutionary patterns by analyzing Protein Domain Architecture and Gene Neighborhood data.
Some biological questions are better approached at the domain level rather than at the raw sequence level. This pipeline extends that idea to entire Gene Neighborhoods.
pandoomain encodes a domain architecture as a string, offering several advantages:
- Existing libraries for string distance can be directly applied.
- Easier human inspection of raw tables.
- Enables domain alignments.
The encoding method consists of adding +33 to each PFAM ID and treating the result as a Unicode code point.
The reasons for this are:
- Adding +33 avoids mapping to control and whitespace charaters, using the same idea behind the Phred33 score.
- Unicode can comfortably accommodate all defined PFAMs (~16,000), as it provides 155,063 characters. User-defined HMMs could be assigned a code point bigger than 18,000 to comfortably dodge any PFAM ID.
The pipeline takes two inputs:
- A text file with assembly accessions.
- A directory of HMMs.
Then it retrieves genomes (in .gff and .faa formats), extracts proteins that match any given HMM,
annotates them with interproscan.sh, and derives Domain Architectures at both protein and neighborhood levels.
The final results include taxonomic data for further analysis.
pandoomain is used at the DeMoraes Lab to search for novel bacterial toxins.
Option 1: Using config/config.yaml
Edit config/config.yaml and then run:
snakemake --cores allRun the pipeline with configuration directly on the command line:
snakemake --cores all \
--config \
genomes=genomes.txt \
queries=queriesOption 1 is recommended since an edited configuration file acts as a log of the experiment, improving reproducibility. Option 2 is useful for quick test runs.
Before running anything perform a test run
adding the following options -np --printshellcmds
to the snakemake command.
-
Genome List: A text file with no headers, containing one genome assembly accession per line. Example:
tests/genomes.txt.- Use
#for comments.
- Use
-
HMM Directory: A directory containing
.hmmfiles, which can be obtained from the InterPro database or manually generated from alignments.
The pipeline generates TSV tables summarizing:
- HMM hits.
- Genome data.
- Taxonomic information.
- Protein domain architectures.
For further details check the documentation at docs/README.md.
The most complex dependency is interproscan.sh, so a helper script is included: utils/install_iscan.py.
The pipeline runs through the Snakemake framework.
For a guide on cloud deployment, see: deploy-pandoomain.
git clone 'https://github.com/elbecerrasoto/pandoomain'
cd pandoomainI recommend Miniforge. A Makefile rule can simplify this step:
make install-mamba~/miniforge3/bin/conda init
source ~/.bashrc
mamba shell init --shell bash --root-prefix=~/miniforge3
source ~/.bashrc
mamba env create --file environment.yml
mamba activate pandoomainmake install-iscanmake install-Rlibsmake testEverything should now be set up and ready to run. 🚀
- Removal of hmmer_input rule.
- It's simpler to just use the
genomes.tsvas input for the hmmer rule.
- It's simpler to just use the
- Removal of preprocessing rule for
genomes.txt.- Now the dependant rules can parse
genomes.txtdirectly.
- Now the dependant rules can parse
- Fixed taxallnomy bug, caused by an updated DB.
- taxallnomy now has 43 cols instead of 42.
- Removal of utils.py