Here are scripts to perform analysis of systematic viral epitope scanning (VirScan). The principles behind the codes were described in [Xu et al. (2015)] (http://doi.org/10.1126/science.aaa0698)
The scripts have been used to obtain the results published in “The repertoire of maternal anti-viral antibodies in human newborns” Christian Pou, Dieudonné Nkulikiyimfura, Ewa Henckel, Axel Olin, Tadepally Lakshmikanth, Jaromir Mikes, Jun Wang, Yang Chen, Anna-Karin Bernhardsson, Anna Gustafsson, Kajsa Bohlin and Petter Brodin. (http://dx.doi.org/10.1038/s41591-019-0392-8)
- Bowtie/1.2.0
- Samtools/1.1
- Python/3.5.4 (numpy/1.11.3, pandas/0.17.1, matplotlib/2.1.2)
db <-- db directory with the input virus library
src/ <-- src directory contains python scripts
index/ <-- index directory with an index for the reference sequences
The scripts available in this repo are all located in src/. The usage descriptions are provided in this readme
Use Bowtie to align the reads to a reference sequence.
Before you can align, you need to build an index for the reference sequences.
bowtie-build REFERENCE.fasta REFERENCE_OUTPUT_NAME
This has been done for you and an index for the reference sequences is found in the index directory.
bzip2 -dc path/to/FASTQ_SEQUENCES.fastq.bz2 | bowtie -n 3 -l 30 -e 1000 --tryhard --nomaqround --norc --best --sam --quiet path/to/REFERENCE_OUTPUT_NAME - | samtools-1.1/bin/samtools view -u - | samtools-1.1/bin/samtools sort -T BAM_OUTPUT_NAME.bam - -o $@
where
path/to/FASTQ_SEQUENCES.fastq.bz2is the path to the FASTQ file containing the sequencing readspath/to/REFERENCE_OUTPUT_NAMEis the path to bowtie index (without the.1.ebwtsuffix)BAM_OUTPUT_NAME.bamis the name of the bam file to which you will save the alignment results
Use samtools idxstats command to count the number of reads for each sequence.
Prior to that, it is necessary to index the bam file:
samtools index BAM_OUTPUT_NAME.bam
Then, count the number of reads for each reference sequence and converts it into a csv file.
samtool idxstats BAM_OUTPUT_NAME.bam | cut -f 1,3 | sed -e '/^\*\t/d' -e '1 i id\tSAMPLE_ID' | tr "\\t" "," >COUNT_FILE.csv
Where SAMPLE_ID is the id of the sample and COUNT_FILE.csv is the name of the count file.
To save space, compress the csv files with gzip:
gzip COUNT_FILE.csv
The python script calc_zipval.py will calculate zero-inflated p-values for a set of output counts based on a set of input counts.
python calc_zipval.py OUTPUT.count.csv.gz INPUT.count.csv.gz log_directory >OUTPUT.zipval.csv
OUTPUT.count.csvis the output read countINTPUT.count.csvis the input read countlog_directoryis a directory where the script will save several plots showing model fitsOUTPUT.zipval.csvis the resulting zero-inflated p-values.
The python script call_hits.py will call hits based on replicate zero-inflated p-values.
python call_hits.py REPLICATE1.zipval.csv.gz REPLICATE2.zipval.csv.gz THRESHOLD log_directory >OUTPUT.zihit.csv
REPLICATE1.zipval.csv.gzandREPLICATE2.zipval.csv.gzare the two replicate zero-inflated pvalue filesTHRESHOLDis the threshold zero-inflated p-value for calling hits (2.3 for VirScan)log_directoryis a directory where the script will save plots showing correlation between the replicatesOUTPUT.zihit.csvis the resulting hits. First column is oligo id, second is True/False
The python script calc_scores.py calculates virus scores using the maximum parsimony approach.
python calc_scores.py ZIHITS.zihit.csv.gz METADATA.csv.gz NHITS.BEADS.csv.gz NHITS.SAMPS.csv.gz GROUPING_LEVEL EPITOPE_LEN >OUTPUT.ziscore.spp.csv
ZIHITS.zihit.csv.gzis the gzipped results of call_hits.pyMETADATA.csv.gzis the file containing the metadata for the virus libraryNHITS.BEADS.csv.gzis a two column gzipped csv file, column 1 is oligo id, column 2 is the number of beads samples in which that oligo was a hitNHITS.SAMPS.csv.gzis a two column gzipped csv file, column 1 is oligo id, column 2 is the number of non-beads samples in which that oligo was a hitGROUPING_LEVELcan be Species or Organism, depending onEPITOPE_LENshould be 7, the length of a linear epitopeOUTPUT.ziscore.spp.csvis a two column csv file.
The concat_tables.py script combines multiple csv files into one.