This repository provides a four-step pipeline for data acquisition and initial filtering of protein sequence sets intended for phylogenetic inference benchmarking.
- For a detailed explanation of workflow and design choices, see
documentation.md. - Software requirements are listed in
requirements.txt.
The pipeline assumes the following initial directory layout:
base_dir/
├─ species_set.txt
├─ scripts/
│ ├─ oma_groups.py
│ ├─ blast.py
│ ├─ hits_counter.py
│ ├─ blast_filtering.py
├─ data/
All scripts are designed to be run consecutively in the order below. If you deviate from the pipeline, ensure you provide the correct corresponding inputs expected by each step.
-
Install the Python packages listed in
requirements.txt. -
blast.pyandblast_filtering.pyrequire NCBI BLAST+ (v2.17.0 or newer). Download it from: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ and ensure the BLAST+ binaries are available on yourPATH. -
oma_groups.pyandblast_filtering.pyrequire a local NCBI Taxonomy database.- The latest version is downloaded automatically in
oma_groups.pyunless this behavior is disabled via the script parameters. - Note: if you rerun the script without disabling the download, the local taxonomy database version may change.
- The latest version is downloaded automatically in
-
blast.pyandblast_filtering.pyrequire a local BLAST Swiss-Prot database.- The latest version is downloaded automatically in
blast.pyunless this behavior is disabled via the script parameters. - Note: if you rerun the script without disabling the download, the local Swiss-Prot database version may change.
- The latest version is downloaded automatically in
Requires internet connection.
Input: species_set.txt
Plain text file containing five-letter OMA database species codes formatted as a list of lists. At least one species from each sublist will be represented in the resulting dataset. For exact formatting instructions, see species_set.txt.
Output: data/oma_groups/groupname.fasta
Directory containing FASTA files with full protein sequences from the input species. Each file represents one OMA orthologous group, and filenames correspond to the OMA group fingerprints.
Parameters: None
If you follow the expected directory structure and provide species_set.txt, no parameters are required.
Description:
Downloads an approximately 100 MB .txt.gz file from OMA DB, identifies groups containing the input species, queries OMA DB for full protein sequences, saves sequences as FASTA files (one per group), and prints the total number of orthologous groups.
Internet connection required only for database download/update.
Input: data/oma_groups/groupname.fasta
Directory of FASTA files containing full protein sequences to be used as BLAST queries. Each file (with multiple sequences) is used as a single query.
Output: data/blast_hits/groupname.xml
Directory containing BLAST results in XML format (one file per group/query).
Parameters: E-VALUE
BLAST E-value cutoff. Default is 1e-10.
Description: Downloads/updates the Swiss-Prot BLAST database, runs local BLAST using the input sequences as queries (with the specified E-value), and stores the resulting hits as BLAST XML.
No internet connection required.
Input: data/blast_hits/groupname.xml
Directory of BLAST hit XML files.
Output: data/groups_sorted.txt
Plain text file containing a dictionary of group fingerprints sorted into categories based on the number of unique BLAST hits (homologous sequences) per group. This file is used as a helper input for the filtering step.
Parameters: None
Description:
Parses BLAST results, counts the number of unique hits per query/group, assigns each group to sequence-count categories, and prints how many groups fall into each category. Categories correspond to powers of two: a set with N unique hits belongs to all categories 2^k such that N ≥ 2^k.
Example: if a set contains 423 unique hits, it belongs to categories:
0, 8, 16, 32, 64, 128, 256
but not 512 or 1024.
This summary helps the user decide how many sequences to retain per set for downstream analysis, which is required for filtering in the next step.
No internet connection required (assuming the Swiss-Prot database is already present from running blast.py).
Inputs:
data/blast_hits/groupname.xmldata/groups_sorted.txt
Outputs:
data/results/groupname.fastadata/results_filtered/groupname/groupname_filtered.fastadata/results_filtered/groupname/accesion_taxid_dict.txtdata/results_filtered/groupname/groupname.newick
data/results/ contains all sets with at least the user-defined number of sequences.
data/results_filtered/ contains one subdirectory per filtered set; each subdirectory contains:
groupname_filtered.fasta: sequences filtered by length (median ± tolerance) and reduced to an exact user-defined number of sequences (randomly sampled if more remain).accesion_taxid_dict.txt: mapping of UniProt accession → NCBI taxID.groupname.newick: NCBI taxonomy tree of all taxIDs in the set, in Newick format.
Parameters: SEQ_IN_SET, TOLERANCE
-
SEQ_IN_SETin{0, 1024}: exact number of sequences to keep per set.- If
0: no filtering by sequence count.
- If
-
TOLERANCEin[0, 1]: tolerance for length filtering around the median.- If
0: only sequences with exactly the median length are retained. - If
1: no length filtering.
- If
Description:
Retrieves full protein sequences for all hits from the local Swiss-Prot database for sets meeting the minimum size requirement, and saves them in FASTA format with UniProt accession numbers as headers. It then filters sequences by length (median ± tolerance). If more than SEQ_IN_SET sequences remain, it removes the original species' sequences, randomly samples down to the exact target size (minus original sequences) and adds the original sequences back. Finally, it builds an NCBI taxonomy tree (Newick) and writes an accession → taxID mapping for the retained sequences.
- Script ordering matters:
oma_groups.py→blast.py→hits_counter.py→blast_filtering.py. - Scripts
blast.py,hits_counter.py, andblast_filtering.pycould run offline if the database is provided. Please checkdocumentation.mdfor description. - Make sure that all necessary packages are installed (see
requirements.txt). - Make sure
BLAST+==2.17.0+is installed and added to PATH.