This is the Seurat workflow applied to single-cell RNA sequencing data from multiple samples using the Seurat package in R. I have added some details and steps in a way that will help to start from scratch and codes needed especially for a beginner. The goal is to read raw data, create Seurat object, perform quality control to filter out poor quality cells, merge and integrate multiple samples to correct for batch effects and explore a cell population for e.g M2 Tumor-Associated Macrophages (TAMs) as an example.
Note: This tutorial has used 4 samples (BC1–BC4) as an example. You can extend this workflow if you have more samples (e.g., BC5–BC18) as per your data availability.
Make sure you have the following R packages installed:
install.packages("tidyverse")
install.packages("Seurat")Then load the libraries:
library("Seurat")
library("Matrix")
library("tidyverse")In scRNA typically you will get three files for scRNA data for each sample generated by the Cell Ranger pipeline from 10X Genomics. Each sample will include:
matrix.mtx: the raw count matrixfeatures.tsv: gene namesbarcodes.tsv: cell barcodesReadMtxfunction will read these three files for each sample and convert into a single sparse matrix. Consider this count matrix of feature, barcode, matrix as a count matrix in bulk RNA-seq data. The rows are genes and columns are cell barcodes or cells. We will then convert this sparse count matrix into a seurat object.
# Replace these paths with your own where your files are placed in your local computer
BC1 <- ReadMtx(mtx = "path/to/BC1/matrix.mtx", features = "path/to/BC1/features.tsv",
cells = "path/to/BC1/barcodes.tsv")
BC2 <- ReadMtx(mtx = "path/to/BC2/matrix.mtx",features = "path/to/BC2/features.tsv",
cells = "path/to/BC2/barcodes.tsv")
BC3 <- ReadMtx(mtx = "path/to/BC3/matrix.mtx", features = "path/to/BC3/features.tsv",
cells = "path/to/BC3/barcodes.tsv")
BC4 <- ReadMtx(mtx = "path/to/BC4/matrix.mtx", features = "path/to/BC4/features.tsv",
cells = "path/to/BC4/barcodes.tsv")We now create individual Seurat objects for each sample. Each object represent a single sample’s scRNA data. We apply basic quality control filter:
min.cells = 3: Only keep genes expressed in at least 3 cells.min.features = 200: Only keep cells that have at least 200 detected genes (features). In addition, we also calculate the percentage of mitrochondrial gene expression extracted by "^MT-" pattern from the data. This will create a new columnpercent.mtin the metadata that will demonstrate the mitochondrial percentage.
# Sample BC1
BC1 <- CreateSeuratObject(counts = BC1, project = "BC1", min.cells = 3, min.features = 200)
BC1 <- PercentageFeatureSet(BC1, pattern = "^MT-", col.name = "percent.mt")
# Sample BC2
BC2 <- CreateSeuratObject(counts = BC2, project = "BC2", min.cells = 3, min.features = 200)
BC2 <- PercentageFeatureSet(BC2, pattern = "^MT-", col.name = "percent.mt")
# Sample BC3
BC3 <- CreateSeuratObject(counts = BC3, project = "BC3", min.cells = 3, min.features = 200)
BC3 <- PercentageFeatureSet(BC3, pattern = "^MT-", col.name = "percent.mt")
# Sample BC4
BC4 <- CreateSeuratObject(counts = BC4, project = "BC4", min.cells = 3, min.features = 200)
BC4 <- PercentageFeatureSet(BC4, pattern = "^MT-", col.name = "percent.mt")
# View each of these Seurat objects metadata separately
View(BC1@metadata)You will see three important columns in the metadata, which will be used for filtering in next step:
nCount_RNA: Total number of unique molecules detected in a cellnfeature_RNA: Total number of unique genes in a cellpercent.mt: mitochondrial percentage in each cell
Next we will filter out poor-quality cells with more stringent quality control filter.
- Cells with too few features (genes) by
nFeature_RNA> 200 - Cells with high mitochondrial gene content:
percent.mt> 20 - High complexity ratio : log10(
nFeature_RNA) / log10(nCount_RNA)) These filters will remove dead or damaged cells (high mitochondrial content) and technical artifacts like low-quality droplets or doublets which have highnFeature_RNAandnCount_RNA.
BC1$complexity_ratio <- log10(BC1$nFeature_RNA) / log10(BC1$nCount_RNA)
BC1 <- subset(BC1, subset = nFeature_RNA > 200 & complexity_ratio > 0.8 & percent.mt < 20)
BC2$complexity_ratio <- log10(BC2$nFeature_RNA) / log10(BC2$nCount_RNA)
BC2 <- subset(BC2, subset = nFeature_RNA > 200 & complexity_ratio > 0.8 & percent.mt < 20)
BC3$complexity_ratio <- log10(BC3$nFeature_RNA) / log10(BC3$nCount_RNA)
BC3 <- subset(BC3, subset = nFeature_RNA > 200 & complexity_ratio > 0.8 & percent.mt < 20)
BC4$complexity_ratio <- log10(BC4$nFeature_RNA) / log10(BC4$nCount_RNA)
BC4 <- subset(BC4, subset = nFeature_RNA > 200 & complexity_ratio > 0.8 & percent.mt < 20)We now merge all the four filtered samples (individual Seurat objects) into a single Seurat object. This is needed to perform joint analysis (normalization, clustering, and downstream analysis).
merged_seurat <- merge(BC1, y = c(BC2, BC3, BC4),
add.cell.ids = c("BC1", "BC2", "BC3", "BC4"),
project = "MergedProject")
# Add two new columns “patient” and "barcode" by separating the sample column(which contain cell barcode with its respective sample name) in the metadata for sample tracking
merged_seurat$sample <- rownames(merged_seurat@meta.data)
merged_seurat@meta.data <- separate(merged_seurat@meta.data, col = "sample", into = c("Patient", "Barcode"), sep = "_")Before we integrate datasets, we must normalize the raw counts, identify variable genes and scale the data to make it comparable across cells and samples.
# Normalize gene expression counts using log-normalization
merged_seurat <- NormalizeData(merged_seurat)
# Identify 2000 highly variable genes (used for downstream steps like PCA) in each cell
merged_seurat <- FindVariableFeatures(merged_seurat)
# Scale and centering the data (zero-mean, unit variance)
merged_seurat <- ScaleData(merged_seurat)
# Run Principal Component Analysis (PCA) to reduce dimensionality
merged_seurat <- RunPCA(merged_seurat)Choosing the right number of Principal Components (PCs) is crucial for clustering and integration.Seurat suggest Elbow Plot for this purpose. In addition to elbow plot, I use a heuristic method to select PCs. Too few PCs can miss meaningful clusters, too many include noise. We select PCs capturing most variance. You can just use the elbow plot but I use this heuristic method to get exact optimal PCs to proceed (but this is optional step).
# Visual inspection
ElbowPlot(merged_seurat, ndims = 50)
# Heuristic approach to select PCs
pct <- merged_seurat[["pca"]]@stdev / sum(merged_seurat[["pca"]]@stdev) * 100
cumu <- cumsum(pct)
co1 <- which(cumu > 90 & pct < 5)[1]
co2 <- sort(which((pct[1:length(pct) - 1] - pct[2:length(pct)]) > 0.1), decreasing = TRUE)[1] + 1
pcs <- min(co1, co2)
pcs #return the optimal PCs We cluster cells based on selected PCs identified in the previous step and use UMAP to visualize batch effects before the integration step. This helps us to see the clustering pattern and how much batch effect might exist. Lets assume we get 15 PCs in the last step, we will use 15 PCs here.
merged_seurat <- FindNeighbors(merged_seurat, dims = 1:15, reduction = "pca")
merged_seurat <- FindClusters(merged_seurat, resolution = 1, cluster.name = "unintegrated_clusters")
merged_seurat <- RunUMAP(merged_seurat, dims = 1:15, reduction = "pca", reduction.name = "umap.unintegrated")
#Plot to visualize cells based on clusters and patient samples
DimPlot(merged_seurat, reduction = "umap.unintegrated", group.by = c("Patient", "seurat_clusters"))Different samples might have batch effects due to technical variation (e.g processed on different days, sequencing machines, different labs). Seurat use Canonical Correlation Analysis (CCA) to integrate different samples and reduce technical noise. Seurat also provide other integration methods such as harmony, RPCA etc. You can explore various methods by adding the specific method in method below, I am showing CCA integration method here.
merged_seurat <- IntegrateLayers(object = merged_seurat, method = CCAIntegration,
orig.reduction = "pca", new.reduction = "integrated.cca", verbose = TRUE)
# Join RNA layers for downstream analysis
merged_seurat[["RNA"]] <- JoinLayers(merged_seurat[["RNA"]])We will repeat PCA, clustering and UMAP-this time on the batch corrected integrated data after CCA integration method.
#Explore through ElbowPlot to select optimum PCs. Repeat the Heuristic approach to select PCs described previously.
ElbowPlot(merged, ndims = 50)
# Re-compute dimensionality reduction and clustering
merged_seurat <- FindNeighbors(merged_seurat, reduction = "integrated.cca", dims = 1:12)
merged_seurat <- FindClusters(merged_seurat, resolution = 1)
merged_seurat <- RunUMAP(merged_seurat, dims = 1:12, reduction = "integrated.cca")
# UMAP of integrated clusters
DimPlot(merged_seurat, reduction = "umap", group.by = "seurat_clusters", label = TRUE)After the above step the data is integrated. We can now use cannonical markers to identify various cell types. For instance epithelial cells with 'EPCAM' gene, COL1A1 for fibroblast, LYZ for immune cell etc. Here We demonstrate expression of M2-like macrophage cell using markers like CD163, MRC1, CD14 in the integrated dataset's clusters through a violin plot and feature plot. These genes are commonly associated with M2-like macrophages, a subtype of TAMs involved in immune suppression and tumor progression. This will help us to identify which cluster is associated with M2-TAM macrophage.
VlnPlot(merged_seurat, features = c("CD163", "MRC1", "CD14"))
FeaturePlot(merged_seurat, features = c("CD163", "MRC1", "CD14"), reduction = "umap")To find differentially expressed genes (DEGs) across all clusters and marker for a specific cluster we use following codes. Lets say cluster 5 demonstrate high expression of M2 macrophage genes in the violin plot and feature plot. We can extract which genes are highly expressed in this cluster.
cluster5_markers <- FindMarkers(merged_seurat,
ident.1 = 5,
only.pos = TRUE,
logfc.threshold = 0.5,
min.pct = 0.25)
### For all clusters:
all_markers <- FindAllMarkers(merged_seurat,
only.pos = TRUE,
logfc.threshold = 0.5,
min.pct = 0.25)
head(all_markers)We can also download and export in a CSV file, all the genes expressed in each cluster after the filtering criteria. write.csv(all_markers, "all_cluster_markers.csv")
You can save this merged_seurat object in your computer and later reload this object instead of repeating this entire pipeline again.
# save Seurat object
saveRDS(merged_seurat, file = "merged_seurat.rds")
# Load Seurat Object Later
merged_seurat <- readRDS("merged_seurat.rds")- Annotate clusters manually by SingleR
- Subset macrophage cluster and reanalyze for key populations
- Run enrichment analysis (GO, KEGG, GSEA) on the marker genes