Current hypothesis:
- small amount of starting material and low capture efficiency → only a small fraction of the mRNA molecules in the cell is captured and amplified
- large number of zero counts (but apparently not zero-inflated as argued i.a. by Yanai and Svensson, i.e. "observed zeros are consistent with count statistics, and droplet scRNA-seq protocols are not producing higher numbers of dropouts than expected")
- bimodal distributions
- SCDE (Kharchenko et al, 2015) assumes all zeroes are technical zeroes
- MAST (Finak et al., 2015) categorizes all zero counts as 'unexpressed'
The scone package contains lists of genes that are believed to be ubiquitously and even uniformly expressed across human tissues. If we assume these genes are truly expressed in all cells, we can label all zero abundance observations as drop-out events. (scone vignette)
data(housekeeping, package = "scone")
scone's approach to identifying transcripts that are worth keeping:
# Initial Gene Filtering:
# Select "common" transcripts based on proportional criteria.
num_reads = quantile(assay(fluidigm)[assay(fluidigm) > 0])[4]
num_cells = 0.25*ncol(fluidigm)
is_common = rowSums(assay(fluidigm) >= num_reads ) >= num_cells
# Final Gene Filtering: Highly expressed in at least 5 cells
num_reads = quantile(assay(fluidigm)[assay(fluidigm) > 0])[4]
num_cells = 5
is_quality = rowSums(assay(fluidigm) >= num_reads ) >= num_cells
## calculate drop out rates
gns_dropouts <- calc_dropouts_per_cellGroup(sce, genes = rownames(sce), split_by = "condition")
## define HK genes for display
hk_genes <- unique(c(grep("^mt-", rownames(sce), value=TRUE, ignore.case=TRUE), # mitochondrial genes
grep("^Rp[sl]", rownames(sce), value=TRUE, ignore.case=TRUE))) # ribosomal genes
## plot
ggplot(data = gns_dropouts,
aes(x = log10(mean.pct.of.counts),
y = log10(pct.zeroCov_cells + .1),
text = paste(gene, condition, sep = "_"))) +
geom_point(aes(color = condition), shape = 1, size = .5, alpha = .5) +
## add HK gene data points
geom_point(data = gns_dropouts[gene %in% hk_genes],
aes(fill = condition), shape = 22, size = 4, alpha = .8) +
facet_grid(~condition) +
ggtitle("With housekeeping genes")
global scaling methods will fail if there's a large number of DE genes → per-clustering using rank-based methods followed by normalization within each group is preferable for those cases (see scran implementation)
## add logcounts
library(scran)
library(scater)
set.seed(100)
clusts <- quickCluster(scegaba)
scegaba <- computeSumFactors(scegaba, cluster=clusts, min.mean=0.1)
scegaba <- logNormCounts(scegaba)
- will yield log-expression values, not guaranteed that they're interpretable as log-counts! → useful for computing reduced dimensions and for visualizations, but probably not appropriate for DE analysis Aaron @ github
| Publication | Method | Package |
|---|---|---|
| Wagner, Yatai, 2018 | knn-smoothing | github.com/yanailab/knn-smoothing |
| Dijk et al., 2017 | manifold learning using diffusion maps | github |
- there is no guarantee that a smoothened expression profile accurately reflects an existing cell population
- might be a good idea to use scater's approach of first clustering and then smoothening within every cluster of similar cells (MAGIC tries that inherently)
- after smoothening, values of different genes might no longer independent, which violates basic assumption of most DE tests (Wagner's method generates a dependency of the cells, rather than genes)
marker genes expressed >= 4x than the rest of the genes, either Seurat or Simlr algorithm will work (Abrams 2018)
run_clustering <- function(sce.object, dimred_name, neighbors, exprs_values){
print("Building SNNGraph")
snn.gr <- scran::buildSNNGraph(sce.object,
use.dimred = dimred_name,
assay.type = exprs_values,
k = neighbors)
print("Clustering")
clusters <- igraph::cluster_walktrap(snn.gr)
return(factor(clusters$membership))
}
sce.filt$Cluster.condRegress_k100 <- run_clustering(sce.filt,
dimred_name = "PCA_cond_regressed_Poisson",
neighbors = 100,
exprs_values = "cond_regressed")
great write up: "t-sne explained in plain javascript"
Avril Coghlan's write-up of PCA
- Pre-filtering of lowly expressed genes can have important effects on the results, particularly for some of the methods originally developed for analysis of bulk RNA-seq data
- Generally, methods developed for bulk RNA-seq analysis do not perform notably worse than those developed specifically for scRNA-seq.
- make sure to use count-based approaches, possibly by adding a covariate for the sample batch effect