Skip to content

Indel Detection using Ad Hoc Methods

sflink edited this page May 2, 2013 · 6 revisions

A really useful tool is the pysam module for python, a wrapper for the Samtools API. You gain the advantage of storing aligned reads in .bam format, while being able to extract information smoothly. You might want to use only the PE reads that mapped in a proper pair, like this:

#! /usr/bin/python

# Extract records for properly aligned reads from bowtie2-generated paired-end read file
# where proper alignment means the pairs aligned to opposite strands, in the 
# proper orientation, and within the range specified when calling the aligner. 
# That value was 2000 nt for our DNA-seq alignments, even though the fragments were targeted
# to be 300 nt in length. 

import sys, re
from pysam import *

if len(sys.argv) < 2:
  print
  print("USAGE: bestFromBam.py in_file.bam out_file.bam")
  sys.exit(1)

fin = Samfile(sys.argv[1], 'rb')
fout = Samfile(sys.argv[2], 'w', template=fin)
#fout = open(sys.argv[2], 'w')
for read in fin:
  if read.is_proper_pair:
    fout.write(read)
fin.close()
fout.close()

Apparently the pysam package does not work for Python 3. Here we use the script in a pipe to extract the lengths of the inserts.

python samwriter.py BNLX.bowtie2.sorted.bam.bam - | awk '{if ($9<0) print (-$9)-80; else print $9-80}' > BNLX_insert_sizes.txt

Why are we interested in the inserts? A genomic region spanned by inserts but not actually covered by reads itself is evidence of an insertion or deletion. To bring this idea to fruition, we will:

  1. filter out those regions which are within 1000 bp (say) of a gap in the assembly.
  2. choose some kind of threshold for insert depth. I think the SHR genome paper used as few as 2 or 3.
  3. filter the regions by mappability. It may be that no reads aligned because the region is identical, or nearly so, to many other regions.
  4. use a distribution of insert sizes to determine whether the lengths of the spanning inserts differ significantly from the lengths of inserts in general.

Oh, here's a question: Suppose the depth of coverage drops off suddenly at/near a region of depth 0. Does this imply an indel more strongly than if depth drops off gradually?

Some R code to examine the distribution of insert lengths:

> I <- scan("BNLX_insert_sizes.txt")
Read 393901394 items
> summary(I)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    -12     179     228     255     292    1920
> K<-log(I)
Warning message:
In log(I) : NaNs produced
### oops!
> K<-log(I[which(I>0)])
> summary(K)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  0.000   5.187   5.429   5.427   5.677   7.560
> pdf("BNLX_insert_size_hist.pdf")
> hist(K, freq=FALSE, breaks=100, main=mainTitle, sub=subTitle, xlab="log length",ylim=c(0,summary(IDensity$y)[6]*1.05)
+ )
> dev.off()
> sd(K)
[1] 0.4736811
> sdk=sd(K)
> summaryK<-summary(K)
> summaryK[4]
 Mean
5.427
> summaryK[4]+2*sdk
    Mean
6.374362
> summaryK[4]-2*sdk
    Mean
4.479638
> exp(summaryK[4]-2*sdk)
    Mean
88.20272
> exp(summaryK[4]+2*sdk)
    Mean
586.6112
> sum(I<=88)
[1] 8812724
> sum(I>=587)
[1] 7989358
> png("BNLX_insert_size_hist.png")
> hist(K, freq=FALSE, breaks=100, main=mainTitle, sub=subTitle, xlab="log length")
> dev.off()
null device
          1

Clone this wiki locally