Spatially Variable Genes

Spatial transcriptomics
SVGs
Moran's I

Identify spatially variable genes whose expression changes across tissue locations. This lesson shows how to detect, visualize, and prioritize SVGs as markers of spatial domains and gradients.

Author

Noor Sohail

Published

March 20, 2026

Keywords

Variable genes, Spatial patterns, Spatial autocorrelation, Spatial gradients

Approximate time: XX minutes

Learning objectives

In this lesson, we will:

  • Define the Moran’s I calculation
  • Run Moran’s I to identify spatially variable genes
  • Visualize spatialy variable genes

Overview of lesson

At this step, instead of comparing defined cell types, you might ask which genes vary as a function of spatial location. In this lesson you apply methods for detecting spatially variable genes (SVGs), then map and visualize their expression patterns across the tissue. These genes often serve as interpretable anchors for defining domains, building signatures, and linking expression to morphology.

Spatially variable genes (SVGs)

Spatially variable genes (SVGs) are similar in concept to highly variable genes (HVGs). These were identifies by evaluating the variability of gene expression across the entire dataset. However, now we have the additional layer of the spatial location of each bin. Therefore we are able to identify genes that have different levels of expression depending on which area of the tissue they are found on.

Figure 1: Schematic of how spatially variable genes present themselves in spatial data.
Image source: Yan et al. (2025)

These methods can be used to identify genes that have strong strong spatial associations for downstream analyses - similar to how we used HVGs to calculate PCA. These SVGs can also be used to evaluate spatial niches within cell-types to assist in find subtypes with the tissue.

Ultimately, these genes provide another peek in the spatial context of the tissue.

Moran’s I

Many of the SVG algorithms are built upon the concept of Moran’s I which is defined as “a measure of spatial autocorrelation”. The math behind the method is defined like so:

\[ I = \frac{N}{S_0} \frac{ \sum_{i=1}^N \sum_{j=1}^N w_{ij}\,(x_i - \bar{x})(x_j - \bar{x}) }{ \sum_{i=1}^N (x_i - \bar{x})^2 } \]

where:

  • \(N\) is the number of spatial units, indexed by \(i\) and \(j\);
  • \(x\) is the variable of interest;
  • \(\bar{x}\) is the mean of \(x\);
  • \(w_{ij}\) are the elements of the spatial-weights matrix, with \(w_{ii} = 0\);
  • \(S_0\) is the sum of all spatial weights, \(S_0 = \sum_{i=1}^N \sum_{j=1}^N w_{ij}\).

For a random spatial distribution, Moran’s I will be near zero. A large positive value indicates the data is clustered. A large negative value indicates that the data is dispersed, meaning units that are close spatially have dissimilar variable values.

The range of values for Moran’s I E(I) in spatial transcriptomics is usually [-1, 1].

Figure 2: Range of Moran’s I values, showing the range of scores that correspond to different spatial structures (clustered, random, dispersed).
Image source: Tsui et al. (2022)

To see why this is the case, consider the expression \((x_i - \bar{x}) (x_j - \bar{x})\) in the above formula. If \(x_i\) and \(x_j\) are similar and far from the mean, then this product will be large and positive. If they are dissimilar (in particular, if one is well above the mean and one is far below) then this product will be large and negative. Note that Moran’s I depends strongly on the calculation of \(w_{ij}\), which represents the spatial relations of the bins. A large value \(w_{ij}\) indicates that bins \(x_i\) and \(x_j\) are close spatially.

Running Moran’s I

Moran’s I will work best if we focus on one sample at a time, as the spatial coordinates will be consistent within a single slide. Furthermore, because this step can take a long time to run, we are going to further subset the data to a singular celltype. In this way, we are attempting to evaluate genes that are specific to spatially defined subtype of a celltype.

# Moran's I
# Visium HD spatial transcriptomics workshop
# Author: Harvard Chan Bioinformatics Core
# Created: December 2025

# Load libararies
library(tidyverse)
library(Seurat)
set.seed(12345)

# Increases the size of the default vector (8GB)
options(future.globals.maxSize = 8000 * 1024^2)
# Set parallelization (multithreading) to speed up calculations
library(future)
plan(multisession, workers = parallel::detectCores() - 1)

# Load dataset
seurat_rctd <- readRDS("data/intermediate_seurat/seurat_RCTD.rds")

In this case, let us zoom in on the tumor cells in the CRC dataset:

# Subset to tumor P5CRC bins
crc_sub <- subset(seurat_rctd,
              subset = (orig.ident == "P5CRC") &
                       (first_type == "Tumor"))
crc_sub
An object of class Seurat 
36170 features across 12855 samples within 2 assays 
Active assay: Spatial.008um (18085 features, 2000 variable features)
 2 layers present: counts, data
 1 other assay present: sketch
 6 dimensional reductions calculated: pca.sketch, umap.sketch, full.pca.sketch, full.umap.sketch, pca.banksy, harmony.banksy
 1 spatial field of view present: P5CRC.008um

For the best results we are going to make use of the scale.data assay (this is typically calculated right before the PCA step). We only ever calculated this on the sketch assay, so here we will run the ScaleData() function on our entire subsetted dataset.

# Scale log-normalized counts
crc_sub <- ScaleData(crc_sub)
crc_sub
An object of class Seurat 
36170 features across 12855 samples within 2 assays 
Active assay: Spatial.008um (18085 features, 2000 variable features)
 3 layers present: counts, data, scale.data
 1 other assay present: sketch
 6 dimensional reductions calculated: pca.sketch, umap.sketch, full.pca.sketch, full.umap.sketch, pca.banksy, harmony.banksy
 1 spatial field of view present: P5CRC.008um

Now we can actually calculate Moran’s I with FindSpatiallyVariableFeatures() and set the selection.method = "moransi". We are additionally going to specify that we want to evaluate Moran’s I on the top 50 highly variable genes. A gene that is spatially variable is likely to also be highly variable because ultimately we are assessing changes in gene expression across different populations. We take the top 50 to reduce the number of calculations that take place (so we can reasonably run this on a laptop).

This step may take a few minutes to run.

# Run Moran's I on the top 50 highly variable genes
crc_sub <- FindSpatiallyVariableFeatures(crc_sub,
                                        selection.method = "moransi",
                                        verbose = TRUE,
                                        features = VariableFeatures(crc_sub)[1:50],
                                        image = "P5CRC.008um")

Once the calculation completes, we can evaluate the Moran’s I scores and associated p-values for each of the genes. We will have to do a bit of subsetting as we only calculated these values for 50 genes.

# Moran's I scores for each gene
SVFInfo(crc_sub, method = "moransi") %>%
  subset(!is.na(MoransI_observed)) %>%
  arrange(desc(MoransI_observed)) %>%
  head()
Table 1: Results of FindSpatiallyVariableFeatures() with associated Moran’s I scores and p-values.
MoransI_observed MoransI_p.value
REG3A 0.4772619 0.0009756
OLFM4 0.4213175 0.0009756
PI3 0.3579835 0.0009756
LCN2 0.2793870 0.0009756
MUC17 0.2544224 0.0009756
CHGB 0.2052781 0.0009756

Then, we can use the SpatiallyVariableFeatures() to get the list of SVGs that were just calculated. For visualization purposes, we are going to grab the top 10 genes.

# Get top 10 spatially variable features
top10_svg <- SpatiallyVariableFeatures(crc_sub, 
                                       method = "moransi")
top10_svg <- top10_svg[1:10]
top10_svg
 [1] "REG3A" "OLFM4" "PI3"   "LCN2"  "MUC17" "CHGB"  "IGHG1" "CHGA"  "IGKC" 
[10] "MUC6" 

And visualize the expression of these genes on the spatial slide:

# Spatial plot of top SVGs
SpatialFeaturePlot(crc_sub,
                   features = top10_svg,
                   pt.size.factor = 15,
                   image.alpha = 0.1, 
                   ncol = 5)
Figure 3: Top spatially variable gene expression for tumor cells, plotted on top of the tissue.

Recall in the BANKSY lesson we noted that the BANKSY algorithm identified different, spatially constrained subtypes in this cancerous region. Now we can clearly see which of those genes were defining the phenotype of these tumor subtypes.

# Dotplot of tumor clusters for top SVGs
DotPlot(crc_sub,
        top10_svg,
        group.by = "banksy_cluster",
        idents = c(6, 13, 14, 15, 19))
Figure 4: DotPlot of BANKSY cancerous clusters, showing top tumor spatially variable genes.

Now we can see that there is a phenotype that is distinct between each of these groups of cancerous regions. For example, cluster 15 and 19 appear to be have different expression patterns compared to the other groups of tumor bins.

Using these spatially variable genes, we can identify subtypes of broader cell types.

Other SVG methods

As the field of spatial transcriptomics is ever evolving, there are other algorithms that use other priors and calcluations to identify spatially variable genes. Some methods include, but are not limited to:

Note that several of these are written in Python.


Next Lesson >>

Back to Schedule

Reuse

CC-BY-4.0