scRNA analysis pipeline: Multi-Sample scRNA-seq Integration & Analysis

This is the Seurat workflow applied to single-cell RNA sequencing data from multiple samples using the Seurat package in R. I have added some details and steps in a way that will help to start from scratch and codes needed especially for a beginner. The goal is to read raw data, create Seurat object, perform quality control to filter out poor quality cells, merge and integrate multiple samples to correct for batch effects and explore a cell population for e.g M2 Tumor-Associated Macrophages (TAMs) as an example.

Note: This tutorial has used 4 samples (BC1–BC4) as an example. You can extend this workflow if you have more samples (e.g., BC5–BC18) as per your data availability.

Prerequisites

Make sure you have the following R packages installed:

install.packages("tidyverse")
install.packages("Seurat")

Then load the libraries:

library("Seurat")
library("Matrix")
library("tidyverse")

1. Load Raw Expression Matrices

In scRNA typically you will get three files for scRNA data for each sample generated by the Cell Ranger pipeline from 10X Genomics. Each sample will include:

matrix.mtx: the raw count matrix
features.tsv: gene names
barcodes.tsv: cell barcodes ReadMtx function will read these three files for each sample and convert into a single sparse matrix. Consider this count matrix of feature, barcode, matrix as a count matrix in bulk RNA-seq data. The rows are genes and columns are cell barcodes or cells. We will then convert this sparse count matrix into a seurat object.

# Replace these paths with your own where your files are placed in your local computer
BC1 <- ReadMtx(mtx = "path/to/BC1/matrix.mtx", features = "path/to/BC1/features.tsv",
 cells = "path/to/BC1/barcodes.tsv")

BC2 <- ReadMtx(mtx = "path/to/BC2/matrix.mtx",features = "path/to/BC2/features.tsv",
  cells = "path/to/BC2/barcodes.tsv")

BC3 <- ReadMtx(mtx = "path/to/BC3/matrix.mtx", features = "path/to/BC3/features.tsv",
  cells = "path/to/BC3/barcodes.tsv")

BC4 <- ReadMtx(mtx = "path/to/BC4/matrix.mtx", features = "path/to/BC4/features.tsv",
  cells = "path/to/BC4/barcodes.tsv")

2. Create Seurat Objects

We now create individual Seurat objects for each sample. Each object represent a single sample’s scRNA data. We apply basic quality control filter:

min.cells = 3: Only keep genes expressed in at least 3 cells.
min.features = 200: Only keep cells that have at least 200 detected genes (features). In addition, we also calculate the percentage of mitrochondrial gene expression extracted by "^MT-" pattern from the data. This will create a new column percent.mt in the metadata that will demonstrate the mitochondrial percentage.

# Sample BC1
BC1 <- CreateSeuratObject(counts = BC1, project = "BC1", min.cells = 3, min.features = 200)
BC1 <- PercentageFeatureSet(BC1, pattern = "^MT-", col.name = "percent.mt")

# Sample BC2
BC2 <- CreateSeuratObject(counts = BC2, project = "BC2", min.cells = 3, min.features = 200)
BC2 <- PercentageFeatureSet(BC2, pattern = "^MT-", col.name = "percent.mt")

# Sample BC3
BC3 <- CreateSeuratObject(counts = BC3, project = "BC3", min.cells = 3, min.features = 200)
BC3 <- PercentageFeatureSet(BC3, pattern = "^MT-", col.name = "percent.mt")

# Sample BC4
BC4 <- CreateSeuratObject(counts = BC4, project = "BC4", min.cells = 3, min.features = 200)
BC4 <- PercentageFeatureSet(BC4, pattern = "^MT-", col.name = "percent.mt")
# View each of these Seurat objects metadata separately
View(BC1@metadata)

You will see three important columns in the metadata, which will be used for filtering in next step:

nCount_RNA: Total number of unique molecules detected in a cell
nfeature_RNA: Total number of unique genes in a cell
percent.mt: mitochondrial percentage in each cell

3. Quality Control and Filtering (Complex Filtering)

Next we will filter out poor-quality cells with more stringent quality control filter.

Cells with too few features (genes) by nFeature_RNA > 200
Cells with high mitochondrial gene content: percent.mt > 20
High complexity ratio : log10(nFeature_RNA) / log10(nCount_RNA)) These filters will remove dead or damaged cells (high mitochondrial content) and technical artifacts like low-quality droplets or doublets which have high nFeature_RNA and nCount_RNA.

BC1$complexity_ratio <- log10(BC1$nFeature_RNA) / log10(BC1$nCount_RNA)
BC1 <- subset(BC1, subset = nFeature_RNA > 200 & complexity_ratio > 0.8 & percent.mt < 20)

BC2$complexity_ratio <- log10(BC2$nFeature_RNA) / log10(BC2$nCount_RNA)
BC2 <- subset(BC2, subset = nFeature_RNA > 200 & complexity_ratio > 0.8 & percent.mt < 20)

BC3$complexity_ratio <- log10(BC3$nFeature_RNA) / log10(BC3$nCount_RNA)
BC3 <- subset(BC3, subset = nFeature_RNA > 200 & complexity_ratio > 0.8 & percent.mt < 20)

BC4$complexity_ratio <- log10(BC4$nFeature_RNA) / log10(BC4$nCount_RNA)
BC4 <- subset(BC4, subset = nFeature_RNA > 200 & complexity_ratio > 0.8 & percent.mt < 20)

4. Merge Seurat Objects

We now merge all the four filtered samples (individual Seurat objects) into a single Seurat object. This is needed to perform joint analysis (normalization, clustering, and downstream analysis).

merged_seurat <- merge(BC1, y = c(BC2, BC3, BC4),
                       add.cell.ids = c("BC1", "BC2", "BC3", "BC4"),
                       project = "MergedProject")

# Add two new columns “patient” and "barcode" by separating the sample column(which contain cell barcode with its respective sample name) in the metadata for sample tracking
merged_seurat$sample <- rownames(merged_seurat@meta.data)
merged_seurat@meta.data <- separate(merged_seurat@meta.data, col = "sample", into = c("Patient", "Barcode"), sep = "_")

5. Preprocessing for PCA and clustering (Before Integration)

Before we integrate datasets, we must normalize the raw counts, identify variable genes and scale the data to make it comparable across cells and samples.

# Normalize gene expression counts using log-normalization
merged_seurat <- NormalizeData(merged_seurat)

# Identify 2000 highly variable genes (used for downstream steps like PCA) in each cell
merged_seurat <- FindVariableFeatures(merged_seurat)

# Scale and centering the data (zero-mean, unit variance)
merged_seurat <- ScaleData(merged_seurat)

# Run Principal Component Analysis (PCA) to reduce dimensionality
merged_seurat <- RunPCA(merged_seurat)

6. Determine Optimal Number of Principal Components

Choosing the right number of Principal Components (PCs) is crucial for clustering and integration.Seurat suggest Elbow Plot for this purpose. In addition to elbow plot, I use a heuristic method to select PCs. Too few PCs can miss meaningful clusters, too many include noise. We select PCs capturing most variance. You can just use the elbow plot but I use this heuristic method to get exact optimal PCs to proceed (but this is optional step).

# Visual inspection
ElbowPlot(merged_seurat, ndims = 50)

# Heuristic approach to select PCs
pct <- merged_seurat[["pca"]]@stdev / sum(merged_seurat[["pca"]]@stdev) * 100
cumu <- cumsum(pct)
co1 <- which(cumu > 90 & pct < 5)[1]
co2 <- sort(which((pct[1:length(pct) - 1] - pct[2:length(pct)]) > 0.1), decreasing = TRUE)[1] + 1
pcs <- min(co1, co2)
pcs #return the optimal PCs

7. Clustering Without Integration

We cluster cells based on selected PCs identified in the previous step and use UMAP to visualize batch effects before the integration step. This helps us to see the clustering pattern and how much batch effect might exist. Lets assume we get 15 PCs in the last step, we will use 15 PCs here.

merged_seurat <- FindNeighbors(merged_seurat, dims = 1:15, reduction = "pca")
merged_seurat <- FindClusters(merged_seurat, resolution = 1, cluster.name = "unintegrated_clusters")
merged_seurat <- RunUMAP(merged_seurat, dims = 1:15, reduction = "pca", reduction.name = "umap.unintegrated")

#Plot to visualize cells based on clusters and patient samples 
DimPlot(merged_seurat, reduction = "umap.unintegrated", group.by = c("Patient", "seurat_clusters"))

8. Integrate samples using CCA

Different samples might have batch effects due to technical variation (e.g processed on different days, sequencing machines, different labs). Seurat use Canonical Correlation Analysis (CCA) to integrate different samples and reduce technical noise. Seurat also provide other integration methods such as harmony, RPCA etc. You can explore various methods by adding the specific method in method below, I am showing CCA integration method here.

merged_seurat <- IntegrateLayers(object = merged_seurat, method = CCAIntegration,
  orig.reduction = "pca", new.reduction = "integrated.cca", verbose = TRUE)

# Join RNA layers for downstream analysis
merged_seurat[["RNA"]] <- JoinLayers(merged_seurat[["RNA"]])

9. Re-Clustering on Integrated Data

We will repeat PCA, clustering and UMAP-this time on the batch corrected integrated data after CCA integration method.

#Explore through ElbowPlot to select optimum PCs. Repeat the Heuristic approach to select PCs described previously.
ElbowPlot(merged, ndims = 50)

# Re-compute dimensionality reduction and clustering
merged_seurat <- FindNeighbors(merged_seurat, reduction = "integrated.cca", dims = 1:12)
merged_seurat <- FindClusters(merged_seurat, resolution = 1)
merged_seurat <- RunUMAP(merged_seurat, dims = 1:12, reduction = "integrated.cca")

# UMAP of integrated clusters
DimPlot(merged_seurat, reduction = "umap", group.by = "seurat_clusters", label = TRUE)

10. Explore Tumor-Associated Macrophage (M2 TAM) Markers

After the above step the data is integrated. We can now use cannonical markers to identify various cell types. For instance epithelial cells with 'EPCAM' gene, COL1A1 for fibroblast, LYZ for immune cell etc. Here We demonstrate expression of M2-like macrophage cell using markers like CD163, MRC1, CD14 in the integrated dataset's clusters through a violin plot and feature plot. These genes are commonly associated with M2-like macrophages, a subtype of TAMs involved in immune suppression and tumor progression. This will help us to identify which cluster is associated with M2-TAM macrophage.

VlnPlot(merged_seurat, features = c("CD163", "MRC1", "CD14"))
FeaturePlot(merged_seurat, features = c("CD163", "MRC1", "CD14"), reduction = "umap")

11. Identify Marker Genes

To find differentially expressed genes (DEGs) across all clusters and marker for a specific cluster we use following codes. Lets say cluster 5 demonstrate high expression of M2 macrophage genes in the violin plot and feature plot. We can extract which genes are highly expressed in this cluster.

For a specific cluster (e.g., cluster 5):

cluster5_markers <- FindMarkers(merged_seurat,
                                ident.1 = 5,
                                only.pos = TRUE,
                                logfc.threshold = 0.5,
                                min.pct = 0.25)
### For all clusters:
all_markers <- FindAllMarkers(merged_seurat,
                              only.pos = TRUE,
                              logfc.threshold = 0.5,
                              min.pct = 0.25)
head(all_markers)

Export Marker Genes to CSV:

We can also download and export in a CSV file, all the genes expressed in each cluster after the filtering criteria. write.csv(all_markers, "all_cluster_markers.csv")

Save Seurat Object:

You can save this merged_seurat object in your computer and later reload this object instead of repeating this entire pipeline again.

# save Seurat object
saveRDS(merged_seurat, file = "merged_seurat.rds")
# Load Seurat Object Later
merged_seurat <- readRDS("merged_seurat.rds")

Next steps

Annotate clusters manually by SingleR
Subset macrophage cluster and reanalyze for key populations
Run enrichment analysis (GO, KEGG, GSEA) on the marker genes

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scRNA analysis pipeline: Multi-Sample scRNA-seq Integration & Analysis

Prerequisites

1. Load Raw Expression Matrices

2. Create Seurat Objects

3. Quality Control and Filtering (Complex Filtering)

4. Merge Seurat Objects

5. Preprocessing for PCA and clustering (Before Integration)

6. Determine Optimal Number of Principal Components

7. Clustering Without Integration

8. Integrate samples using CCA

9. Re-Clustering on Integrated Data

10. Explore Tumor-Associated Macrophage (M2 TAM) Markers

11. Identify Marker Genes

For a specific cluster (e.g., cluster 5):

Export Marker Genes to CSV:

Save Seurat Object:

Next steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

scRNA analysis pipeline: Multi-Sample scRNA-seq Integration & Analysis

Prerequisites

1. Load Raw Expression Matrices

2. Create Seurat Objects

3. Quality Control and Filtering (Complex Filtering)

4. Merge Seurat Objects

5. Preprocessing for PCA and clustering (Before Integration)

6. Determine Optimal Number of Principal Components

7. Clustering Without Integration

8. Integrate samples using CCA

9. Re-Clustering on Integrated Data

10. Explore Tumor-Associated Macrophage (M2 TAM) Markers

11. Identify Marker Genes

For a specific cluster (e.g., cluster 5):

Export Marker Genes to CSV:

Save Seurat Object:

Next steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages