Notes/scRNA-seq.md at master · friedue/Notes

Dropouts
Normalization
- Log-offset
- MNN correction
Smoothening & Imputation
Dimensionality reduction & clustering
DE

Dropouts

Current hypothesis:

small amount of starting material and low capture efficiency → only a small fraction of the mRNA molecules in the cell is captured and amplified
large number of zero counts (but apparently not zero-inflated as argued i.a. by Yanai and Svensson, i.e. "observed zeros are consistent with count statistics, and droplet scRNA-seq protocols are not producing higher numbers of dropouts than expected")
bimodal distributions

(Ye 2017):

SCDE (Kharchenko et al, 2015) assumes all zeroes are technical zeroes
MAST (Finak et al., 2015) categorizes all zero counts as 'unexpressed'

The scone package contains lists of genes that are believed to be ubiquitously and even uniformly expressed across human tissues. If we assume these genes are truly expressed in all cells, we can label all zero abundance observations as drop-out events. (scone vignette)

data(housekeeping, package = "scone")

scone's approach

scone's approach to identifying transcripts that are worth keeping:

# Initial Gene Filtering: 
# Select "common" transcripts based on proportional criteria.
num_reads = quantile(assay(fluidigm)[assay(fluidigm) > 0])[4]
num_cells = 0.25*ncol(fluidigm)
is_common = rowSums(assay(fluidigm) >= num_reads ) >= num_cells

# Final Gene Filtering: Highly expressed in at least 5 cells
num_reads = quantile(assay(fluidigm)[assay(fluidigm) > 0])[4]
num_cells = 5
is_quality = rowSums(assay(fluidigm) >= num_reads ) >= num_cells

My own approach using dropout rates

## calculate drop out rates
gns_dropouts <- calc_dropouts_per_cellGroup(sce, genes = rownames(sce), split_by = "condition")

## define HK genes for display
hk_genes <- unique(c(grep("^mt-", rownames(sce), value=TRUE, ignore.case=TRUE), # mitochondrial genes
            grep("^Rp[sl]", rownames(sce), value=TRUE, ignore.case=TRUE))) # ribosomal genes

## plot
ggplot(data = gns_dropouts,
        aes(x = log10(mean.pct.of.counts),
            y = log10(pct.zeroCov_cells + .1),
        text = paste(gene, condition, sep = "_"))) + 
  geom_point(aes(color = condition), shape = 1, size = .5, alpha = .5) +
  ## add HK gene data points
  geom_point(data = gns_dropouts[gene %in% hk_genes],
             aes(fill = condition), shape = 22, size = 4, alpha = .8) +
   facet_grid(~condition) + 
   ggtitle("With housekeeping genes")

Normalization

global scaling methods will fail if there's a large number of DE genes → per-clustering using rank-based methods followed by normalization within each group is preferable for those cases (see scran implementation)

## add logcounts
library(scran)
library(scater)
set.seed(100)
clusts <- quickCluster(scegaba) 
scegaba <- computeSumFactors(scegaba, cluster=clusts, min.mean=0.1)
scegaba <- logNormCounts(scegaba)

log-offset

MNN-based expression correction

will yield log-expression values, not guaranteed that they're interpretable as log-counts! → useful for computing reduced dimensions and for visualizations, but probably not appropriate for DE analysis Aaron @ github

Smoothening

Publication	Method	Package
Wagner, Yatai, 2018	knn-smoothing	github.com/yanailab/knn-smoothing
Dijk et al., 2017	manifold learning using diffusion maps	github

there is no guarantee that a smoothened expression profile accurately reflects an existing cell population
might be a good idea to use scater's approach of first clustering and then smoothening within every cluster of similar cells (MAGIC tries that inherently)
after smoothening, values of different genes might no longer independent, which violates basic assumption of most DE tests (Wagner's method generates a dependency of the cells, rather than genes)

Dimensionality reduction and clustering

marker genes expressed >= 4x than the rest of the genes, either Seurat or Simlr algorithm will work (Abrams 2018)

run_clustering <- function(sce.object, dimred_name, neighbors, exprs_values){
  
  print("Building SNNGraph")
  snn.gr <- scran::buildSNNGraph(sce.object,
                                 use.dimred = dimred_name,
                                 assay.type = exprs_values,
                                 k = neighbors)
  print("Clustering")
  clusters <- igraph::cluster_walktrap(snn.gr)
  return(factor(clusters$membership))
}

sce.filt$Cluster.condRegress_k100 <- run_clustering(sce.filt,
                                                   dimred_name = "PCA_cond_regressed_Poisson",
                                                   neighbors = 100,
                                                   exprs_values = "cond_regressed")

t-SNE

great write up: "t-sne explained in plain javascript"

PCA

Avril Coghlan's write-up of PCA

DE

Soneson & Robinson:

Pre-filtering of lowly expressed genes can have important effects on the results, particularly for some of the methods originally developed for analysis of bulk RNA-seq data
Generally, methods developed for bulk RNA-seq analysis do not perform notably worse than those developed specifically for scRNA-seq.
make sure to use count-based approaches, possibly by adding a covariate for the sample batch effect

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dropouts

scone's approach

My own approach using dropout rates

Normalization

log-offset

MNN-based expression correction

Smoothening

Dimensionality reduction and clustering

t-SNE

PCA

DE

FilesExpand file tree

scRNA-seq.md

Latest commit

History

scRNA-seq.md

File metadata and controls

Dropouts

scone's approach

My own approach using dropout rates

Normalization

log-offset

MNN-based expression correction

Smoothening

Dimensionality reduction and clustering

t-SNE

PCA

DE