Skip to content

Latest commit

 

History

History
82 lines (52 loc) · 5.15 KB

File metadata and controls

82 lines (52 loc) · 5.15 KB

RNA velocity

Basics

  • Goal: predict the future expression state of a cell
  • input: two gene-by-cell matrices
    • one with spliced (S)
    • one with unspliced counts (U)

For each gene, a phase-plot is constructed. That phase-plot is used to estimate the gene- and cell-specific velocity. The phase-plot is a scatterplot of the relevant row (gene) from the U and S matrices (well, a moment estimator of these quantities, but that is a minor point). In other words, the gene-specific velocities are determined by the relationship of S and U. (HansenLabBlog)

In the words of the Theis Lab:

For each gene, a steady-state-ratio of pre-mature (unspliced) and mature (spliced) mRNA counts is fitted, which constitutes a constant transcriptional state. Velocities are then obtained as residuals from this ratio. Positive velocity indicates that a gene is up-regulated, which occurs for cells that show higher abundance of unspliced mRNA for that gene than expected in steady state. Conversely, negative velocity indicates that a gene is down-regulated.

This gif illustrates the principles.

In brief, velocity is estimated for each gene in each cell and then projected into lower dimensional space to reveal the direction of cell fate transitions. The extraplolation of traditional RNA velocity measurements is valid for approx. a couple of hours (based on Qiu et al., 2022.

Batch effect

Batch effect seems to be present according to the Hansen Lab. (HansenLabBlog)

Challenge for RNA velocity: we need to batch correct not 1 but 2 matrices simultaneously

scVelo currently does not pay attention to this, as they state "any additional preprocessing step only affects X and is not applied to spliced/unspliced counts." (Ref)

HansenLab suggests:

we correct S and U for library size, and form M=U+S. Then we log-transform M as log(M+1), use ComBat and invert the log transformation. This is what we feed to scVelo and friends. Details

Processing details

The starting point for any type of velocity analysis: 2 count matrices of pre-mature (unspliced) and mature (spliced) abundances.

These can be obtained from standard sequencing protocols, using the velocyto.py or loompy/kallisto counting pipeline.

Velocyto offers multiple wrappers around 10X Genomics (CellRanger) data, Smart-seq2 data etc. It essentially looks at every single mapped read and determines whether it represents a spliced, unspliced or ambiguous molecule.

The BAM file will have to:

  • Be sorted by mapping position.
  • Represents either a single sample (multiple cells prepared using a certain barcode set in a single experiment) or single cell.
  • Contain an error corrected cell barcodes as a TAG named CB or XC.
  • Contain an error corrected molecular barcodes as a TAG named UB or XM.
# wrapper for 10X
velocyto run10x -m repeat_msk.gtf mypath/sample01 somepath/refdata-cellranger-mm10-1.2.0/genes/genes.gtf

# same thing with the simple velocyto run command
velocyto run -b filtered_barcodes.tsv -o output_path -m repeat_msk_srt.gtf bam_file.bam annotation.gtf

The output is a 4-layered loom file, i.e. an HDF5 file that contains specific groups representing the main matrix as well as row and column attribute. (loom details here).

For a more detailed run-down of how to move from R-processed data over to velocity, see Sam's description or Basil's.

scanpy, scVelo and AnnData

scVelo is based on adata

  • stores a data matrix adata.X,
  • annotation of observations adata.obs
  • variables adata.var, and
  • unstructured annotations adata.uns
  • computed velocities are stored in adata.layers just like the count matrices.

Names of observations and variables can be accessed via adata.obs_names and adata.var_names, respectively.

AnnData objects can be sliced like dataframes: adata_subset = adata[:, list_of_gene_names]. For more details, see the anndata docs.

Additional resources: