Identifying Unwanted Variation

Author

Will Gammerdinger, Noor Sohail, Zhu Zhuo, James Billingsley, Shannan Ho Sui

Published

July 22, 2025

Approximate time: XY minutes

Learning Objectives

  • LO 1
  • Perform clustering of cells based on significant principal components
  • LO3

Single-Cell RNA-seq (scRNA-seq) Workflow

We will be performing a standard scRNA-seq workflow on our sketch assay. For more explicit details on each one of these steps, we have materials on how to conduct a scRNA-seq analysis from start to end.

Here we show the overarching steps that are takin in a classical single-cell analysis.

The broader principles of this workflow remain the same when working with spatial transcriptomics data. This lesson will cover the following steps:

  1. FindVariableFeatures(): As before, this generates a list of highly variable genes, which may be slightly different for the sketched dataset than for the full dataset
  2. ScaleData(): Highly variable genes will be confounded with the most highly expressed genes, so we need to adjust for this
  3. RunPCA(): Perform a principal component analysis using our scaled data and variable genes. This will emphasize variation in gene expression as well as similarity across bins
  4. FindNeighbors(): Determine the Euclidean distance between bins in PCA space
  5. FindClusters(): Iteratively group bins together based on neighborhood distances. Higher resolution will yield more groups.
  6. RunUMAP(): TODO

With that being said, there are certain elements that are unique to a spatial dataset that we will highlight as we go through all of these steps.

Variation in the Dataset

library(Seurat)
library(tidyverse)

seurat_sketch <- readRDS("data/seurat_sketch.RDS")

Identify Highly Variable Features

Recall that when we were working with the entire dataset, we found the following genes to be the most variable:

# Identify the 15 most highly variable genes
ranked_variable_genes <- VariableFeatures(seurat_sketch, assay="Spatial.016um")
top_genes <- ranked_variable_genes[1:15]
top_genes
 [1] "SPP1"  "IGHG1" "IGHM"  "IGHG3" "CXCL8" "INSL5" "IGLC1" "MMP12" "CHGA" 
[10] "VIP"   "SST"   "GCG"   "REG3A" "HBA2"  "PYY"  

We are going to run FindVariableFeatures() once more, but this time on our smaller dataset of 10,000 bins.

# Identify the most variable genes
seurat_processed <- FindVariableFeatures(seurat_sketch, 
                                         selection.method = "vst",
                                         nfeatures = 2000,
                                         assay="sketch")
seurat_processed
An object of class Seurat 
36170 features across 67660 samples within 2 assays 
Active assay: sketch (18085 features, 2000 variable features)
 2 layers present: counts, data
 1 other assay present: Spatial.016um
 2 spatial fields of view present: P5CRC.016um P5NAT.016um

Do you seen any differences when we calculated HVGs for the entire dataset versus the sketch assay?

# Identify the 15 most highly variable genes
ranked_variable_genes <- VariableFeatures(seurat_processed, assay="sketch")
ranked_variable_genes[1:15]
 [1] "IGHG1" "IGHM"  "CXCL8" "SPP1"  "IGLC1" "CHGA"  "VIP"   "HBA2"  "REG3A"
[10] "MMP12" "INSL5" "GCG"   "CALB2" "PYY"   "MMP9" 

Visualize Highly Variable Genes

We can visualize the dispersion of all genes using Seurat’s VariableFeaturePlot(), which shows a gene’s average expression across all cells on the x-axis and variance on the y-axis. Ideally we would see genes lay across the entire x-axis, as we want to ensure that we are using genes that are both lowly and highly expressed.

For the visualization, we are adding labels using LabelPoints() to help us better understand which genes are driving the shape of our data.

# Identify the 15 most highly variable genes
ranked_variable_genes <- VariableFeatures(seurat_processed,
                                          assay = "sketch")
top_genes <- ranked_variable_genes[1:15]

# Plot the average expression and variance of these genes
# With labels to indicate which genes are in the top 15
p <- VariableFeaturePlot(seurat_processed)
LabelPoints(plot = p, points = top_genes, repel = TRUE)

Now we are primed to run the next step of the workflow, which is Principal Component Analysis.

PCA

Principal Component Analysis (PCA) is a technique used to emphasize variation as well as similarity, and to bring out strong patterns in a dataset; it is one of the methods used for “dimensionality reduction”.

For a more detailed explanation on PCA, please look over this lesson (adapted from StatQuests/Josh Starmer’s YouTube video). We also strongly encourage you to explore the video StatQuest’s video for a more thorough understanding.

Let’s say you are working with a single-cell RNA-seq dataset with 12,000 cells and you have quantified the expression of 20,000 genes. The schematic below demonstrates how you would go from a cell x gene matrix to principal component (PC) scores for each individual cell.

To perform PCA, we need to first choose the most variable features, then scale the data. Since highly expressed genes exhibit the highest amount of variation and we don’t want our ‘highly variable genes’ only to reflect high expression, we need to scale the data to scale variation with expression level. The Seurat ScaleData() function will scale the data by:

  • Adjusting the expression of each gene to give a mean expression across cells to be 0
  • Scaling expression of each gene to give a variance across cells to be 1
# Scale the data counts
seurat_processed <- ScaleData(seurat_processed)
seurat_processed
An object of class Seurat 
36170 features across 67660 samples within 2 assays 
Active assay: sketch (18085 features, 2000 variable features)
 3 layers present: counts, data, scale.data
 1 other assay present: Spatial.016um
 2 spatial fields of view present: P5CRC.016um P5NAT.016um

We have calculated a new layer titled sale.data, where our scaled log-normalized counts now exist. So now we can proceed in calculating our Principal Components with RunPCA().

# Run PCA
seurat_processed <- RunPCA(seurat_processed,
                           assay = "sketch",
                           reduction.name = "pca.sketch")

The information printed out show the top genes that influence a cell’s score for each component. This includes genes that have both a positive and negative weight.

If we look at our updated seurat_processed object:

seurat_processed
An object of class Seurat 
36170 features across 67660 samples within 2 assays 
Active assay: sketch (18085 features, 2000 variable features)
 3 layers present: counts, data, scale.data
 1 other assay present: Spatial.016um
 1 dimensional reduction calculated: pca.sketch
 2 spatial fields of view present: P5CRC.016um P5NAT.016um

We see that we have stored our PCA results: dimensional reduction calculated: pca.sketch

Visualize PCA

After the PC scores have been calculated, you are looking at a matrix of 12,000 x 12,000 that represents the information about relative gene expression in all the cells. You can select the PC1 and PC2 columns and plot that in a 2D way.

DimPlot(seurat_processed,
        group.by = "orig.ident",
        reduction = "pca.sketch")

Ultimately what we have calculated with PCA are values that can represent how similar cells are to one another. We would expect that cells with similar gene expression will have similar PC values. For example, we would anticipate that two Fibroblast cells would have comparable gene expression - which would also results in similar scores in principal components. This is why many of the downstream steps in the remainder of this lesson use PCA as the input.

Also assess the differences in coordinates based on batch variables.

DimPlot(seurat_processed,
        reduction = "pca.sketch",
        group.by = "orig.ident",
        split.by = "orig.ident")

Selecting PC Dimensions

To overcome the extensive technical noise in the expression of any single gene for scRNA-seq data, Seurat assigns cells to clusters based on their PCA scores derived from the expression of the integrated most variable genes. Each PC essentially representing a “metagene” that combines information across a correlated gene set. Determining how many PCs to include in the clustering step is therefore important to ensure that we are capturing the majority of the variation, or cell types, present in our dataset.

It is useful to explore the PCs prior to deciding which PCs to include for the downstream clustering analysis.

The elbow plot is a helpful way to determine how many PCs to use for clustering so that we are capturing majority of the variation in the data. The elbow plot visualizes the standard deviation of each PC, and we are looking for where the standard deviations begins to plateau. Essentially, where the elbow appears is usually the threshold for identifying the majority of the variation. However, this method can be quite subjective.

Let’s draw an elbow plot using the top 50 PCs:

# Plot the elbow plot
ElbowPlot(object = seurat_processed,
          reduction = "pca.sketch",
          ndims = 50)

Based on this plot, we could roughly determine the majority of the variation by where the elbow occurs around PC8 - PC10, or one could argue that it should be when the data points start to get close to the X-axis, PC30 or so. This gives us a very rough idea of the number of PCs needed to be included.

We will be using 50 PCs for our downstream steps.

Grouping Similar Cells Together

Seurat uses a graph-based clustering approach using a K-nearest neighbor approach, and then attempts to partition this graph into highly interconnected ‘quasi-cliques’ or ‘communities’ [Seurat - Guided Clustering Tutorial]. A nice in-depth description of clustering methods is provided in the SVI Bioinformatics and Cellular Genomics Lab course.

k-Nearest Neighbors

The first step is to construct a K-nearest neighbor (KNN) graph based on the euclidean distance in PCA space.

Image source: Analysis of Single cell RNA-seq data

  • Edges are drawn between cells with similar features expression patterns.
  • Edge weights are refined between any two cells based on shared overlap in their local neighborhoods.

This is done in Seurat by using the FindNeighbors() function. We also specify the reduction we want to use as well as which components (dims) will be used for the calculation. In this case, we will be using the first 50.

# Determine the K-nearest neighbor graph
seurat_processed <- FindNeighbors(seurat_processed, 
                                  assay = "sketch", 
                                  reduction = "pca.sketch",
                                  dims = 1:50)

Clustering

Next, Seurat will iteratively group cells together with the goal of optimizing the standard modularity function.

We will use the FindClusters() function to perform the graph-based clustering. The resolution is an important argument that sets the “granularity” of the downstream clustering and will need to be optimized for every individual experiment. For datasets of 3,000 - 5,000 cells, the resolution set between 0.4-1.4 generally yields good clustering. Increased resolution values lead to a greater number of clusters, which is often required for larger datasets.

The FindClusters() function allows us to enter a series of resolutions and will calculate the “granularity” of the clustering. This is very helpful for testing which resolution works for moving forward without having to run the function for each resolution.

Testing different resolution values

Typically we would recommend using a variety of different resolutions and select a resolution that best suits your dataset. However, for the sake of time, we will proceed with a resolution = 0.65. We have more detailed explanations of how to test a variety of resolutions at one time.

seurat_processed <- FindClusters(seurat_processed, 
                                 cluster.name = "seurat_cluster.sketched", 
                                 resolution = 0.65)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 10000
Number of edges: 415959

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8844
Number of communities: 18
Elapsed time: 1 seconds

If we take a look at the columns in our @meta.data, we see that the column seurat_cluster.sketched is a new column that was added.

seurat_processed@meta.data %>% head() %>% relocate("seurat_cluster.sketched")
                            seurat_cluster.sketched orig.ident
P5CRC_s_016um_00050_00315-1                    <NA>      P5CRC
P5CRC_s_016um_00204_00145-1                    <NA>      P5CRC
P5CRC_s_016um_00237_00273-1                    <NA>      P5CRC
P5CRC_s_016um_00224_00126-1                    <NA>      P5CRC
P5CRC_s_016um_00191_00159-1                       8      P5CRC
P5CRC_s_016um_00064_00214-1                    <NA>      P5CRC
                            nCount_Spatial.016um nFeature_Spatial.016um
P5CRC_s_016um_00050_00315-1                   80                     75
P5CRC_s_016um_00204_00145-1                   81                     73
P5CRC_s_016um_00237_00273-1                  278                    225
P5CRC_s_016um_00224_00126-1                  692                    511
P5CRC_s_016um_00191_00159-1                   35                     34
P5CRC_s_016um_00064_00214-1                 1250                    724
                            mitoRatio log10GenesPerUMI leverage.score
P5CRC_s_016um_00050_00315-1 0.1000000        0.9852720      0.6907727
P5CRC_s_016um_00204_00145-1 0.1234568        0.9763361      0.4799947
P5CRC_s_016um_00237_00273-1 0.0323741        0.9624138      0.3291851
P5CRC_s_016um_00224_00126-1 0.2095376        0.9536337      1.4497517
P5CRC_s_016um_00191_00159-1 0.1142857        0.9918468      6.8221965
P5CRC_s_016um_00064_00214-1 0.2176000        0.9234167      0.6191210
                            seurat_clusters
P5CRC_s_016um_00050_00315-1            <NA>
P5CRC_s_016um_00204_00145-1            <NA>
P5CRC_s_016um_00237_00273-1            <NA>
P5CRC_s_016um_00224_00126-1            <NA>
P5CRC_s_016um_00191_00159-1               8
P5CRC_s_016um_00064_00214-1            <NA>

When we looked at the first few rows of our metadata, it appeared that there were many cells that do not have a cluster value.

Count how many NA’s are found for our cluster with table() function and use the argument (useNA = "ifany"):

table(seurat_processed$seurat_cluster.sketched, useNA = "ifany")

    0     1     2     3     4     5     6     7     8     9    10    11    12 
 1357  1274  1079   898   828   624   579   545   497   495   463   402   300 
   13    14    15    16    17  <NA> 
  272   127   106    80    74 57660 

Why do you think there are so many NA values?

FindClusters() tends to automatically change the Idents() to the cluster values that were just calculated. If we take a look at our Idents we can see if that it has changed from orig.ident to seurat_cluster.sketched

Idents(seurat_processed) %>% head()
P5CRC_s_016um_00050_00315-1 P5CRC_s_016um_00204_00145-1 
                       <NA>                        <NA> 
P5CRC_s_016um_00237_00273-1 P5CRC_s_016um_00224_00126-1 
                       <NA>                        <NA> 
P5CRC_s_016um_00191_00159-1 P5CRC_s_016um_00064_00214-1 
                          8                        <NA> 
Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

UMAP

To visualize the cell clusters, there are a few different dimensionality reduction techniques that can be helpful. The most popular methods include t-distributed stochastic neighbor embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) techniques.

Both methods aim to place cells with similar local neighborhoods in high-dimensional space together in low-dimensional space. These methods will require you to input number of PCA dimentions to use for the visualization, we suggest using the same number of PCs as input to the clustering analysis. Here, we will proceed with the UMAP method for visualizing the clusters.

seurat_processed <- RunUMAP(seurat_processed, 
                            reduction = "pca.sketch", 
                            reduction.name = "umap.sketch", 
                            return.model = T, dims = 1:50)
seurat_processed
An object of class Seurat 
36170 features across 67660 samples within 2 assays 
Active assay: sketch (18085 features, 2000 variable features)
 3 layers present: counts, data, scale.data
 1 other assay present: Spatial.016um
 2 dimensional reductions calculated: pca.sketch, umap.sketch
 2 spatial fields of view present: P5CRC.016um P5NAT.016um
DimPlot(seurat_processed, group.by="orig.ident", reduction="umap.sketch")

These UMAP coordinates are now stored in the dimensional reductions calculated: pca.sketch, umap.sketch. Now we can visualize with the clusters that we calculated earlier using DimPlot() and adding clearer labels withLabelClusters()`.

# Plot UMAP
p <- DimPlot(seurat_processed, 
        reduction = "umap.sketch", label = T) + 
  ggtitle("Sketched clustering")
LabelClusters(p, id = "ident",  fontface = "bold", size = 5, 
              bg.colour = "white", bg.r = .2, force = 0)

Project Back to Entire Dataset

Now that we have our clusters and dimensional reductions from our sketched dataset, we need to extend these to the full dataset.

Run ProjectData()

The ProjectData function projects all the bins in the dataset (the Spatial.016um assay) onto the sketch assay.

# Increases the size of the default vector
options(future.globals.maxSize= 2000000000)

seurat_processed <- ProjectData(
  object = seurat_processed,
  assay = "Spatial.016um",
  full.reduction = "full.pca.sketch",
  sketched.assay = "sketch",
  sketched.reduction = "pca.sketch",
  umap.model = "umap.sketch",
  dims = 1:50,
  refdata = list(seurat_cluster.projected = "seurat_cluster.sketched"))

Using the sketch PCA and UMAP, the ProjectData function returns a Seurat object that includes:

Category Description
Dimensional reduction (PCA) The full.pca.sketch reduction extends the PCA from sketched cells to all bins in the dataset
Dimensional reduction (UMAP) The full.umap.sketch reduction extends the UMAP from sketched cells to all bins in the dataset
Cluster labels The seurat_cluster.projected metadata column labels all cells with cluster labels from sketched cells

We can now see the additional full-dataset reductions in the object.

seurat_processed
An object of class Seurat 
36170 features across 67660 samples within 2 assays 
Active assay: Spatial.016um (18085 features, 2000 variable features)
 2 layers present: counts, data
 1 other assay present: sketch
 4 dimensional reductions calculated: pca.sketch, umap.sketch, full.pca.sketch, full.umap.sketch
 2 spatial fields of view present: P5CRC.016um P5NAT.016um

Note that a score for the projection of each bin will be saved as a column in the metadata. Now the seurat_cluster.projected column shows values for every bin.

head(seurat_processed@meta.data)
                            orig.ident nCount_Spatial.016um
P5CRC_s_016um_00050_00315-1      P5CRC                   80
P5CRC_s_016um_00204_00145-1      P5CRC                   81
P5CRC_s_016um_00237_00273-1      P5CRC                  278
P5CRC_s_016um_00224_00126-1      P5CRC                  692
P5CRC_s_016um_00191_00159-1      P5CRC                   35
P5CRC_s_016um_00064_00214-1      P5CRC                 1250
                            nFeature_Spatial.016um mitoRatio log10GenesPerUMI
P5CRC_s_016um_00050_00315-1                     75 0.1000000        0.9852720
P5CRC_s_016um_00204_00145-1                     73 0.1234568        0.9763361
P5CRC_s_016um_00237_00273-1                    225 0.0323741        0.9624138
P5CRC_s_016um_00224_00126-1                    511 0.2095376        0.9536337
P5CRC_s_016um_00191_00159-1                     34 0.1142857        0.9918468
P5CRC_s_016um_00064_00214-1                    724 0.2176000        0.9234167
                            leverage.score seurat_cluster.sketched
P5CRC_s_016um_00050_00315-1      0.6907727                    <NA>
P5CRC_s_016um_00204_00145-1      0.4799947                    <NA>
P5CRC_s_016um_00237_00273-1      0.3291851                    <NA>
P5CRC_s_016um_00224_00126-1      1.4497517                    <NA>
P5CRC_s_016um_00191_00159-1      6.8221965                       8
P5CRC_s_016um_00064_00214-1      0.6191210                    <NA>
                            seurat_clusters seurat_cluster.projected
P5CRC_s_016um_00050_00315-1            <NA>                        8
P5CRC_s_016um_00204_00145-1            <NA>                        8
P5CRC_s_016um_00237_00273-1            <NA>                       11
P5CRC_s_016um_00224_00126-1            <NA>                       12
P5CRC_s_016um_00191_00159-1               8                        8
P5CRC_s_016um_00064_00214-1            <NA>                        2
                            seurat_cluster.projected.score
P5CRC_s_016um_00050_00315-1                              1
P5CRC_s_016um_00204_00145-1                              1
P5CRC_s_016um_00237_00273-1                              1
P5CRC_s_016um_00224_00126-1                              1
P5CRC_s_016um_00191_00159-1                              1
P5CRC_s_016um_00064_00214-1                              1

Visualize Projected Clusters

We can now visualize our clusters from the projected assignments. The UMAP plot now contains more points, which is expected because we are now visualizing the full dataset rather than our 10,000 bin sketch. Nonetheless, we can see that the full dataset is still well-represented by the projected dimensional reduction and clustering.

# switch to full dataset assay
DefaultAssay(seurat_processed) <- "Spatial.016um"

# Change the idents to the projected cluster assignments
Idents(seurat_processed) <- "seurat_cluster.projected"

# Plot UMAP
p <- DimPlot(seurat_processed, 
        reduction = "full.umap.sketch", label = T) + 
  ggtitle("Projected clustering")
LabelClusters(p, id = "ident",  fontface = "bold", size = 5, 
              bg.colour = "white", bg.r = .2, force = 0)

My cluster labels are all out of order!

To ensure that our cluster labels are listed in proper numeric order, we are going to make our seurat_cluster.projected a factor and set the levels such that the order is correct.

# Arrange so clusters get listed in numerical order
seurat_processed$seurat_cluster.projected <- seurat_processed$seurat_cluster.projected %>% 
  as.numeric() %>%
  as.factor()

seurat_processed$seurat_cluster.projected %>% head()
P5CRC_s_016um_00050_00315-1 P5CRC_s_016um_00204_00145-1 
                         12                          12 
P5CRC_s_016um_00237_00273-1 P5CRC_s_016um_00224_00126-1 
                          5                          14 
P5CRC_s_016um_00191_00159-1 P5CRC_s_016um_00064_00214-1 
                         12                           2 
Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Now we should see the levels in correct numeric order.

# Change the idents to the projected cluster assignments
Idents(seurat_processed) <- "seurat_cluster.projected"

# Plot UMAP
p <- DimPlot(seurat_processed, 
        reduction = "full.umap.sketch", label = T) + 
  ggtitle("Projected clustering")
LabelClusters(p, id = "ident",  fontface = "bold", size = 5, 
              bg.colour = "white", bg.r = .2, force = 0)

In order to see the clusters superimposed on our image we can use the SpatialDimPlot() function. We will also set the color palette and convert the cluster assignments to a factor so they are ordered numerically rather than lexicographically in the figure.

SpatialDimPlot(seurat_processed,
               pt.size.factor = 12,
               image.alpha = 0)

Save!

Now is a great spot to save our seurat_processed object as we have added a lot new information into our Seurat object, such as:

  • Sketch HVGs
  • Scaled log-normalized counts
  • PCA
  • k-Nearest neighbors
  • Clusters
  • UMAP
  • Projected clusters
# Save Seurat object
saveRDS(seurat_processed, "data/seurat_processed.RDS")