Skip to content

acg-team/data_acquisition_pipeline

Repository files navigation

Data acquisition pipeline for phylogenetic inference benchmarking

This repository provides a four-step pipeline for data acquisition and initial filtering of protein sequence sets intended for phylogenetic inference benchmarking.


Expected directory structure

The pipeline assumes the following initial directory layout:

base_dir/
    ├─ species_set.txt
    ├─ scripts/
    │   ├─ oma_groups.py
    │   ├─ blast.py
    │   ├─ hits_counter.py
    │   ├─ blast_filtering.py
    ├─ data/

All scripts are designed to be run consecutively in the order below. If you deviate from the pipeline, ensure you provide the correct corresponding inputs expected by each step.


Installation and setup

  • Install the Python packages listed in requirements.txt.

  • blast.py and blast_filtering.py require NCBI BLAST+ (v2.17.0 or newer). Download it from: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ and ensure the BLAST+ binaries are available on your PATH.

  • oma_groups.py and blast_filtering.py require a local NCBI Taxonomy database.

    • The latest version is downloaded automatically in oma_groups.py unless this behavior is disabled via the script parameters.
    • Note: if you rerun the script without disabling the download, the local taxonomy database version may change.
  • blast.py and blast_filtering.py require a local BLAST Swiss-Prot database.

    • The latest version is downloaded automatically in blast.py unless this behavior is disabled via the script parameters.
    • Note: if you rerun the script without disabling the download, the local Swiss-Prot database version may change.

Pipeline overview

1) oma_groups.py

Requires internet connection.

Input: species_set.txt Plain text file containing five-letter OMA database species codes formatted as a list of lists. At least one species from each sublist will be represented in the resulting dataset. For exact formatting instructions, see species_set.txt.

Output: data/oma_groups/groupname.fasta Directory containing FASTA files with full protein sequences from the input species. Each file represents one OMA orthologous group, and filenames correspond to the OMA group fingerprints.

Parameters: None If you follow the expected directory structure and provide species_set.txt, no parameters are required.

Description: Downloads an approximately 100 MB .txt.gz file from OMA DB, identifies groups containing the input species, queries OMA DB for full protein sequences, saves sequences as FASTA files (one per group), and prints the total number of orthologous groups.


2) blast.py

Internet connection required only for database download/update.

Input: data/oma_groups/groupname.fasta Directory of FASTA files containing full protein sequences to be used as BLAST queries. Each file (with multiple sequences) is used as a single query.

Output: data/blast_hits/groupname.xml Directory containing BLAST results in XML format (one file per group/query).

Parameters: E-VALUE BLAST E-value cutoff. Default is 1e-10.

Description: Downloads/updates the Swiss-Prot BLAST database, runs local BLAST using the input sequences as queries (with the specified E-value), and stores the resulting hits as BLAST XML.


3) hits_counter.py

No internet connection required.

Input: data/blast_hits/groupname.xml Directory of BLAST hit XML files.

Output: data/groups_sorted.txt Plain text file containing a dictionary of group fingerprints sorted into categories based on the number of unique BLAST hits (homologous sequences) per group. This file is used as a helper input for the filtering step.

Parameters: None

Description: Parses BLAST results, counts the number of unique hits per query/group, assigns each group to sequence-count categories, and prints how many groups fall into each category. Categories correspond to powers of two: a set with N unique hits belongs to all categories 2^k such that N ≥ 2^k.

Example: if a set contains 423 unique hits, it belongs to categories: 0, 8, 16, 32, 64, 128, 256 but not 512 or 1024.

This summary helps the user decide how many sequences to retain per set for downstream analysis, which is required for filtering in the next step.


4) blast_filtering.py

No internet connection required (assuming the Swiss-Prot database is already present from running blast.py).

Inputs:

  • data/blast_hits/groupname.xml
  • data/groups_sorted.txt

Outputs:

  • data/results/groupname.fasta
  • data/results_filtered/groupname/groupname_filtered.fasta
  • data/results_filtered/groupname/accesion_taxid_dict.txt
  • data/results_filtered/groupname/groupname.newick

data/results/ contains all sets with at least the user-defined number of sequences.

data/results_filtered/ contains one subdirectory per filtered set; each subdirectory contains:

  • groupname_filtered.fasta: sequences filtered by length (median ± tolerance) and reduced to an exact user-defined number of sequences (randomly sampled if more remain).
  • accesion_taxid_dict.txt: mapping of UniProt accession → NCBI taxID.
  • groupname.newick: NCBI taxonomy tree of all taxIDs in the set, in Newick format.

Parameters: SEQ_IN_SET, TOLERANCE

  • SEQ_IN_SET in {0, 1024}: exact number of sequences to keep per set.

    • If 0: no filtering by sequence count.
  • TOLERANCE in [0, 1]: tolerance for length filtering around the median.

    • If 0: only sequences with exactly the median length are retained.
    • If 1: no length filtering.

Description: Retrieves full protein sequences for all hits from the local Swiss-Prot database for sets meeting the minimum size requirement, and saves them in FASTA format with UniProt accession numbers as headers. It then filters sequences by length (median ± tolerance). If more than SEQ_IN_SET sequences remain, it removes the original species' sequences, randomly samples down to the exact target size (minus original sequences) and adds the original sequences back. Finally, it builds an NCBI taxonomy tree (Newick) and writes an accession → taxID mapping for the retained sequences.


Notes

  • Script ordering matters: oma_groups.pyblast.pyhits_counter.pyblast_filtering.py.
  • Scripts blast.py, hits_counter.py, and blast_filtering.py could run offline if the database is provided. Please check documentation.md for description.
  • Make sure that all necessary packages are installed (see requirements.txt).
  • Make sure BLAST+==2.17.0+ is installed and added to PATH.

About

Data acquisition pipeline for phylogenetic inference benchmarking.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages