Skip to content

Latest commit

 

History

History
156 lines (117 loc) · 6.96 KB

File metadata and controls

156 lines (117 loc) · 6.96 KB

Dropouts

Current hypothesis:

  • small amount of starting material and low capture efficiency → only a small fraction of the mRNA molecules in the cell is captured and amplified
  • large number of zero counts (but apparently not zero-inflated as argued i.a. by Yanai and Svensson, i.e. "observed zeros are consistent with count statistics, and droplet scRNA-seq protocols are not producing higher numbers of dropouts than expected")
  • bimodal distributions

(Ye 2017):

  • SCDE (Kharchenko et al, 2015) assumes all zeroes are technical zeroes
  • MAST (Finak et al., 2015) categorizes all zero counts as 'unexpressed'

The scone package contains lists of genes that are believed to be ubiquitously and even uniformly expressed across human tissues. If we assume these genes are truly expressed in all cells, we can label all zero abundance observations as drop-out events. (scone vignette)

data(housekeeping, package = "scone")

scone's approach

scone's approach to identifying transcripts that are worth keeping:

# Initial Gene Filtering: 
# Select "common" transcripts based on proportional criteria.
num_reads = quantile(assay(fluidigm)[assay(fluidigm) > 0])[4]
num_cells = 0.25*ncol(fluidigm)
is_common = rowSums(assay(fluidigm) >= num_reads ) >= num_cells

# Final Gene Filtering: Highly expressed in at least 5 cells
num_reads = quantile(assay(fluidigm)[assay(fluidigm) > 0])[4]
num_cells = 5
is_quality = rowSums(assay(fluidigm) >= num_reads ) >= num_cells

My own approach using dropout rates

## calculate drop out rates
gns_dropouts <- calc_dropouts_per_cellGroup(sce, genes = rownames(sce), split_by = "condition")

## define HK genes for display
hk_genes <- unique(c(grep("^mt-", rownames(sce), value=TRUE, ignore.case=TRUE), # mitochondrial genes
            grep("^Rp[sl]", rownames(sce), value=TRUE, ignore.case=TRUE))) # ribosomal genes

## plot
ggplot(data = gns_dropouts,
        aes(x = log10(mean.pct.of.counts),
            y = log10(pct.zeroCov_cells + .1),
        text = paste(gene, condition, sep = "_"))) + 
  geom_point(aes(color = condition), shape = 1, size = .5, alpha = .5) +
  ## add HK gene data points
  geom_point(data = gns_dropouts[gene %in% hk_genes],
             aes(fill = condition), shape = 22, size = 4, alpha = .8) +
   facet_grid(~condition) + 
   ggtitle("With housekeeping genes")

Normalization

global scaling methods will fail if there's a large number of DE genes → per-clustering using rank-based methods followed by normalization within each group is preferable for those cases (see scran implementation)

## add logcounts
library(scran)
library(scater)
set.seed(100)
clusts <- quickCluster(scegaba) 
scegaba <- computeSumFactors(scegaba, cluster=clusts, min.mean=0.1)
scegaba <- logNormCounts(scegaba)

log-offset

MNN-based expression correction

  • will yield log-expression values, not guaranteed that they're interpretable as log-counts! → useful for computing reduced dimensions and for visualizations, but probably not appropriate for DE analysis Aaron @ github

Smoothening

Publication Method Package
Wagner, Yatai, 2018 knn-smoothing github.com/yanailab/knn-smoothing
Dijk et al., 2017 manifold learning using diffusion maps github
  • there is no guarantee that a smoothened expression profile accurately reflects an existing cell population
  • might be a good idea to use scater's approach of first clustering and then smoothening within every cluster of similar cells (MAGIC tries that inherently)
  • after smoothening, values of different genes might no longer independent, which violates basic assumption of most DE tests (Wagner's method generates a dependency of the cells, rather than genes)

Dimensionality reduction and clustering

marker genes expressed >= 4x than the rest of the genes, either Seurat or Simlr algorithm will work (Abrams 2018)

run_clustering <- function(sce.object, dimred_name, neighbors, exprs_values){
  
  print("Building SNNGraph")
  snn.gr <- scran::buildSNNGraph(sce.object,
                                 use.dimred = dimred_name,
                                 assay.type = exprs_values,
                                 k = neighbors)
  print("Clustering")
  clusters <- igraph::cluster_walktrap(snn.gr)
  return(factor(clusters$membership))
}

sce.filt$Cluster.condRegress_k100 <- run_clustering(sce.filt,
                                                   dimred_name = "PCA_cond_regressed_Poisson",
                                                   neighbors = 100,
                                                   exprs_values = "cond_regressed")

t-SNE

great write up: "t-sne explained in plain javascript"

PCA

Avril Coghlan's write-up of PCA

DE

Soneson & Robinson:

  • Pre-filtering of lowly expressed genes can have important effects on the results, particularly for some of the methods originally developed for analysis of bulk RNA-seq data
  • Generally, methods developed for bulk RNA-seq analysis do not perform notably worse than those developed specifically for scRNA-seq.
  • make sure to use count-based approaches, possibly by adding a covariate for the sample batch effect