Staphylococcus aureus is the most common cause of human bacterial infections, including the majority of hospital acquired infections. S. aureus has a highly variable genome, with differences between isolates that include substantial insertion/deletions of mobile elements. Disease surveillance research has led to the genome sequencing of many thousands of isolates. However, the annotation of these genome sequences does not provide researchers with a complete set of orthologs with informative names. Here, we present a computational pipeline to compare de novo sequence contigs to the set of complete RefSeq genomes for i) determining appropriate reference genome for whole-genome alignment, ii) annotation, ortholog prediction, and comparative genomics, and iii) front-end visualization of genome annotation using a versatile, user-friendly web-based genome browser. We demonstrate our pipeline using data from S. aureus as a paradigm, owing to its high sequence variability, and therefore less well-curated genomic sequences in public databases.
pyfasta link
BLAST+ link
MAUVE link
GLIMMER link
JBrowse link
Staphyococcus aureus genome sequence data was obtained from NCBI genomes portal, using the search term “staphylococcus aureus[orgn] “. NBCI lists 7968 sequences associated with S. aureus, but only 66 sequences are complete whole-genome sequences. We downloaded 66 complete whole-genome sequences of S. aureus from NCBI.
Obtain near-complete whole-genome sequences from NCBI
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/717/725/
As a proof-of-principle, we sampled 10 random genomes from the whole-genome set and subdivided the genomes into 1kb chunks using pyfasta tool. 1kb sequences were then compared with BLAST against a database of the remaining genomes using a custom shell script. The most frequent best-hit to a Reference was identified for each query genome.
A pairwise alignment between the query genome and the closest Reference genome is then constructed with MAUVE Contig Mover and exported as a JPEG image. A table of gaps in the alignment is also provided as a CSV file to aid in the identification of strain-specific insertions of mobile elements, which often carry drug resistance and virulence genes.
staph_pipeline.sh
Output for best alignment can be seen from command-line, but this is implemented as part of our user-friendly browser so unless users are interested in developing, this information is not shown in StaphBrowse web browser.
For 50 S. aureus genomes we extracted the protein-coding sequences (CDS) from the gene feature format (gff) file. The CDS fasta files were then converted into individual BLAST databases for performing reciprocal best BLAST search. We required a minimum alignment length of 50 nucleotides with a e-value < 0.001 for a best reciprocal BLAST hit using the ortholog_identification.R script.
Our welcome page of the brower has options to load the de novo genome of your interest to find the best reference genome for alignment, visualize gene annotations using JBrowse and view orthlog predictions in a table format embedded within the StaphBrowse genome browser.
- User page for loading the genome
- Based on the best alignment for a reference, users can choose a genome for visualization of gene models
- Results from ortholog predictions can be loaded as a table and browsed through StaphBrowse




