Skip to content

frznprograms/phang

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

phang

Version Python CUDA Apple MPS Conda

Background

Phang is an end-to-end bioinformatics pipeline designed to give researchers and clinicians a complete genomic picture of novel phages. Starting from an input fasta sequence, the pipeline automates three major deliverables:

  1. Auto-formatted NCBI BankIt submission package: a single multi-fasta and five feature tables
  2. Clinical Report Card: a structured preliminary therapeutic suitability assessment for phage therapy contexts
  3. Detailed Research Report: a comprehensive genomic characterisation document for novel phages

In this repo, we offer an easy-to-use API in Python to support users who prefer to interact with the sequences on a more granular level, and to have better control over which exact components of the pipeline to run, if you do not wish to run the whole thing. Please see the section on running the pipeline.

Note

If you have any feedback on how we can improve this pipeline, or wish to suggest new features/tools to add, please let us know, and we'll see what we can do!

Why Phang?

Phage genomics involves a crowded ecosystem of specialised tools, each with its own Python version requirements, dependency chains, and installation quirks. Phang exists to handle all of that for you.

  • Dependency isolation: Each tool runs in its own dedicated conda environment, so there are no version conflicts to manage. Adding or updating a tool does not risk breaking anything else in the pipeline.
  • Simple to configure: All settings live in human-readable YAML files managed with Hydra. You can override a single parameter, swap an entire config file, or restrict the run to a subset of tools entirely from Python or the command line, with no code changes required. If you do not wish to do any configuring, no problem! Sensible defaults are provided.
  • Automatic dependency resolution: Declare which components you want and Phang works out the rest. If you run phold, it will automatically include pharokka first since phold depends on its output. You never need to think about run order.
  • Guards against redundant work: Setup scripts only execute when a required conda environment or database is actually missing, ensuring repeated runs do not ever re-perform intensive set up actions.
  • Parallelised where possible: Tools that operate on multiple input files distribute work across CPU cores using Python's ProcessPoolExecutor, keeping runtimes shorter on larger datasets. You can even consider this use more of your hardware if you have it available.
  • Extensible by design: The registry-based, object-oriented architecture makes it straightforward to add new tools. Register a component, add a config schema, write a Runnable subclass, and the pipeline picks it up automatically.
  • Transparent logging: Every step emits structured logs via loguru, covering start, progress, warnings, and completion. When something goes wrong, the logs tell you exactly where and why. If you don't want to see verbose logs, just run the command without using the --no-capture-output flag.

Authors

Ideated and Supervised by:

  1. Sean Pang, NTU Interdisciplinary graduate program, Institute for Digital Molecular Analytics and Science
  2. Assoc Prof. Eric Yap Peng Huat, NTU Lee Kong Chian School of Medicine, Institute for Digital Molecular Analytics and Science

Developed by:

  1. Shane Bharathan (frznprograms)

Contacting us

For queries about the code, email shanevbh@gmail.com. For queries related to the conceptual ideas behind the tools implemented in this pipeline, please contact Sean (SE0001LE@e.ntu.edu.sg). Please also note, that we will not reply to queries from bots or anonymous senders.

If you experience any issues related to using our code, please open an issue on GitHub and state:

  1. The nature of the issue you are experiencing
  2. The exact code you used when you encountered the issue
  3. What you ran beforehand, if any (e.g. if you ran any of the bash scripts beforehand)

Set Up

Please install Python before using this repo. We also require an installation of conda. We are aware that faster package managers like uv exist, but many established libraries for bioinformatics research are conda-dependent. Furthermore, a lot of the tools utilised by Phang we created at different times and with conflicting dependencies. Much effort was taken to account for these, and conda was the easiest way to address the issue.

For our project, we recommend using miniconda to retain only the needed packages and remove bloat. Please refer to this website for more information.

If you wish to use the entire pipeline, run scripts/init.sh in your shell. For some users (e.g. MacOS), you may need to change the permissions on this file first to make it executable:

git clone https://github.com/frznprograms/phang.git
cd phang/
chmod +x scripts/init.sh
./scripts/init.sh

Running the above script will run all the utilities you need to run the pipeline, including environments, dependencies, and databases. Some of the databases are large, so please be patient for the downloads to complete.

Since the above steps are for a quick start, you don't need to think about setting up the environments. If you do not wish to run the whole pipeline and also do not wish to have all the various databases involved (i.e. you only want the speicifc databases and envs for the subset of tools you'd like to use), do this instead:

git clone https://github.com/frznprograms/phang.git
cd phang/
chmod +x scripts/setup_phang.sh
./scripts/setup_phang.sh

Warning

The setup steps involve heavy downloads of around 25GB, and output files can be large as well. This means you can expect the set up scripts and the pipeline to take some time to finish execution. Depending on the number of components you use in the pipeline and the number of input files you have, the pipeline may take a long time to run. I suggest clicking run and taking a coffee break or a longer lunch. This is, unfortunately, necessary given the heavy computations involved.

The above step is necessary as phang_env is the conda env used to dispatch all other tools and is a minimum requirement for the pipeline to run. I recommend you just use the first option to set everything up as you may need other tools beyond just the one you plan on using anyway.

If you want to run only some components of the pipeline, we offer you the option to run the specific scripts required for those components. Feel free to peruse the scripts/ directory for more. Each component's script is prefixed with the word "setup", e.g. "setup_cherry.sh". However, we recommend you just let the pipeline take care of this for you.

Quick Start

First, ensure all your data (.fasta files) are placed in the data/ folder. One should come with the repo when you clone it.

If you want to run all the tools and are happy with the default settings, you can do so in the Jupyter notebook provided at quickstart.ipynb. The core usage is just three lines:

from src.pipeline import Pipeline

p = Pipeline.build(config_dir="src/conf")
p.run()

Alternatively, if you prefer the shell, run:

conda run -n phang_env python3 -m main
# if you don't want to see verbose output, run:
# conda run --no-capture-output -n phang_env python3 -m main

and that's it! You should see the output folder populated with the outputs from each step in the pipeline. The output folder should look something like this:

output/
    <tool_name>/ # whichever tool you ran e.g. pharokka, phold etc.
        <phage_1_name> # as determined by you via the input .fasta file name
        <phage_2_name>
    <tool_name>
        ...

You may see a folder called prokbert_inference_output created. This is a natural output of running the Cherry (from PhaBox2) component.

Important

Do not modify or delete the YAML files in configuration that end with "*_default.yaml". These are necessary for the Hydra configurations to run properly.

The run method in Pipeline only takes one argument called skip_setup, which when set to True will skip the setup steps that verify the necessary conda environments and databases have been installed before running the component. The setup steps are purely defensive programming, so if you are confident all the needed environments and databases have been installed, you can set this parameter to True and call the function:

from src.pipeline import Pipeline

p = Pipeline.build(config_dir="src/conf")
p.run(skip_setup=True)

This parameter is set to False by default to prevent users from accidentally skipping setup steps the first time the pipeline is run.

Running and Configuring the Pipeline

The Pipeline object is fully configurable via YAML files. All settings live in src/conf/config.yaml, with one section per tool. Configuration is handled with OmegaConf, which makes it straightforward to swap in custom settings or run only specific tools via overrides and components.

For a detailed guide with examples, see docs/configurations.md.

Citations

The following tools are used or modified in our pipeline. We would like to thank the researchers and teams that prepared them:

  1. PhaBox2: https://github.com/KennthShang/PhaBOX
  2. DefenseFinder: https://github.com/mdmparis/defense-finder
  3. Deposcope: https://github.com/dimiboeckaerts/DepoScope
  4. Pharokka: https://github.com/gbouras13/pharokka
  5. PhaStyle: https://github.com/nbrg-ppcu/PhaStyle
  6. Phold: https://github.com/gbouras13/phold
  7. Phynteny: https://github.com/susiegriggo/Phynteny
  8. RBPDetect: https://github.com/dimiboeckaerts/PhageRBPdetection
  9. VContact: https://vcontact3.readthedocs.io/en/latest/
@Article{ProkBERT2024,
  author  = {Ligeti, Balázs and Szepesi-Nagy, István and Bodnár, Babett and Ligeti-Nagy, Noémi and Juhász, János},
  journal = {Frontiers in Microbiology},
  title   = {{ProkBERT} family: genomic language models for microbiome applications},
  year    = {2024},
  volume  = {14},
  URL={https://www.frontiersin.org/articles/10.3389/fmicb.2023.1331233},       
	DOI={10.3389/fmicb.2023.1331233},      
	ISSN={1664-302X}
}
@article{10.1093/nar/gkaf1448,
    author = {Bouras, George and Grigson, Susanna R and Mirdita, Milot and Heinzinger, Michael and Papudeshi, Bhavya and Mallawaarachchi, Vijini and Green, Renee and Kim, Rachel Seongeun and Mihalia, Victor and Psaltis, Alkis James and Wormald, Peter-John and Vreugde, Sarah and Steinegger, Martin and Edwards, Robert A},
    title = {Protein structure-informed bacteriophage genome annotation with Phold},
    journal = {Nucleic Acids Research},
    volume = {54},
    number = {1},
    pages = {gkaf1448},
    year = {2026},
    month = {01},
    abstract = {Bacteriophage (phage) genome annotation is essential for understanding their functional potential and suitability for use as therapeutic agents. Here, we introduce Phold, an annotation framework utilizing protein structural information that combines the ProstT5 protein language model and structural alignment tool Foldseek. Phold assigns annotations using a database of over 1.36 million predicted phage protein structures with high-quality functional labels. Benchmarking reveals that Phold outperforms existing sequence-based homology approaches in functional annotation sensitivity whilst maintaining speed, consistency, and scalability. Applying Phold to diverse cultured and metagenomic phage genomes shows it consistently annotates over 50\% of genes on an average phage and 40\% on an average archaeal virus. Comparisons of phage protein structures to other protein structures across the tree of life reveal that phage proteins commonly have structural homology to proteins shared across the tree of life, particularly those that have nucleic acid metabolism and enzymatic functions. Phold is available as free and open-source software at https://github.com/gbouras13/phold.},
    issn = {1362-4962},
    doi = {10.1093/nar/gkaf1448},
    url = {https://doi.org/10.1093/nar/gkaf1448},
    eprint = {https://academic.oup.com/nar/article-pdf/54/1/gkaf1448/66285038/gkaf1448.pdf},
}
@article{bolduc2025vcontact3,
author  = {Bolduc, Benjamin and others},
title   = {Machine learning enables scalable and systematic hierarchical virus taxonomy},
journal = {Nature Biotechnology},
year    = {2025},
doi     = {10.1038/s41587-025-02946-9},
url     = {https://doi.org/10.1038/s41587-025-02946-9}
}
Proudly created with Neovim on MacOS, with the assistance of Claude and Claude Code. Human-first, AI-assisted.

About

Generate report cards for Phage genomes

Topics

Resources

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors