Clustering

category_1
category_2
category_3
category_4

Write a description of the lesson here.

Author

Noor Sohail

Published

July 22, 2025

Keywords

keyword_1, keyword_2, keyword_3, keyword_4, keyword_5, keyword_6

Approximate time: XX minutes

Learning objectives

In this lesson, we will:

  • Learning Objective 1
  • Learning Objective 2
  • Learning Objective 3

Overview of lesson

When doing XYZ…

Grouping similar bins together

Now, we want to group similar bins (cell states) together based upon their gene expression profiles with clustering. This is a two step process that involves:

  1. Constructing a K-nearest neighbor (KNN) graph based on the PCA space.
  2. Group bins together based upon the KNN graph to assign cluster labels to each bin.

This graph-based approach allows us to partition the graph of bins into highly interconnected ‘quasi-cliques’ or ‘communities’ that represent clusters of similar bins. A nice in-depth description of clustering methods is provided in the SVI Bioinformatics and Cellular Genomics Lab course.

k-nearest neighbors (kNN)

The first step is to construct a K-nearest neighbor (KNN) graph. As previous mentioned, this is calculated on the PCA space. Recall that we have multiple principle components (PCs) with scores for each one of our bins. We can think of the steps taken like so:

  1. Each bin is a point in PCA (n-dimensional) space.
  2. We calculate the distance (euclidean) between each bin and all other bins in this PCA space.
  3. We then connect bins together (edges) that are close to each other in this PCA space. Bins that are close together in PCA space will have similar gene expression profiles, and thus are likely to be similar.
  4. Then edge weights are calculate for each connection based upon the shared overlap in their local neighborhoods to create a shared nearest neighbor (SNN) graph.

This means that if two bins are close together in PCA space and have similar neighbors, they will have a stronger connection (higher edge weight) than two bins that are close together but do not share similar neighbors.

Figure 1: Schematic showing how bins are connected in PCA space, where edges connected between close bins are then used to calculate edge weights based upon the shared overlap in their local neighborhoods.
Image source: Analysis of Single cell RNA-seq data.

All of this is done in one step with the Seurat function FindNeighbors(). The function takes in the PCA reduction and calculates the KNN graph. We specify the reduction (sketch) as well as which components (dims) will be used for the calculation. In this case, we will be using the first 50 PCs.

# Determine the K-nearest neighbor graph
seurat_processed <- FindNeighbors(seurat_processed, 
                                  assay = "sketch", 
                                  reduction = "pca.sketch",
                                  dims = 1:50)

We should now see two graphs in the @graphs slot of seurat_processed. The first in the list is the KNN graph (sketch_nn) and the second is the shared nearest neighbor (SNN) graph (sketch_snn).

seurat_processed@graphs
$sketch_nn
A Graph object containing 10000 cells
$sketch_snn
A Graph object containing 10000 cells

We use the SNN graph because it considers both the distance beteween two bins and the similarity of their local neighborhoods. This leads to more robust and biologically meaningful connections between bins. In contrast, the KNN graph only considers the distance between two bins, which may not capture the full complexity of the global relationships between many bins at once.

Now that we have this shared nearest neighbor graph, we can being the clustering step.

Clustering

The next step in the workflow is to group bins together based upon the graph to assign cluster labels to each bin.

The FindClusters() function accomplishes this by iteratively grouping bins together with the goal of optimizing the standard modularity function. This function is a measure of the strength of division of a network into clusters, with higher values indicating a stronger clustering structure. The FindClusters() function uses the Louvain algorithm to optimize this modularity function and assign cluster labels to each bin by default. We can also use the Leiden algorithm by setting the algorithm = 4. The Leiden algorithm is an improvement on the Louvain algorithm that is faster and more accurate in finding clusters in large datasets.

An important parameter is resolution which sets the “granularity” of the downstream clustering and will need to be optimized for every individual experiment. Larger resolution values lead to a greater number of clusters, which is often required for larger datasets. For datasets of 3,000 - 5,000 cells, the resolution set between 0.4-1.4 generally yields good clustering.

Testing different resolution values

Typically we would recommend using a variety of different resolutions and select a resolution that best suits your dataset. However, for the sake of time, we will proceed with a resolution = 0.65. We have more detailed explanations of how to test a variety of resolutions at one time.

seurat_processed <- FindClusters(seurat_processed, 
                                 cluster.name = "seurat_cluster.sketched", 
                                 algorithm = 4,
                                 resolution = 0.65,
                                 random.seed = 12345)

If we take a look at the columns in our @meta.data, we see that the column seurat_cluster.sketched is a new column that was added.

seurat_processed@meta.data %>% 
  head() %>% 
  relocate("seurat_cluster.sketched")
Table 1: Updated @meta.data with sketch Leiden clusters.
seurat_cluster.sketched orig.ident nCount_Spatial.008um nFeature_Spatial.008um log10GenesPerUMI mitoRatio leverage.score
P5CRC_s_008um_00078_00444-1 NA P5CRC 65 57 0.9685377 0.2307692 0.1182505
P5CRC_s_008um_00128_00278-1 NA P5CRC 1300 906 0.9496410 0.1730769 0.8126225
P5CRC_s_008um_00052_00559-1 NA P5CRC 128 121 0.9884090 0.0234375 1.3967893
P5CRC_s_008um_00121_00413-1 NA P5CRC 538 326 0.9203288 0.0092937 0.6153956
P5CRC_s_008um_00202_00633-1 NA P5CRC 365 232 0.9231919 0.0657534 0.4457335
P5CRC_s_008um_00035_00460-1 NA P5CRC 122 100 0.9586074 0.1639344 0.0945122
  1. When we looked at the first few rows of our metadata, it appeared that there were many cells that do not have a cluster value. Count how many NA’s are found for our cluster with table() function and use the argument (useNA = "ifany"). Why do you think there are so many NA values?

The Seurat workflow tends to automatically change the Idents() of the Seurat object to the last cluster values that were calculated during FindClusters(). If we take a look at our Idents we can see if that it has changed from orig.ident to seurat_cluster.sketched.

Idents(seurat_processed) %>% head()
P5CRC_s_008um_00078_00444-1 P5CRC_s_008um_00128_00278-1 
                       <NA>                        <NA> 
P5CRC_s_008um_00052_00559-1 P5CRC_s_008um_00121_00413-1 
                       <NA>                        <NA> 
P5CRC_s_008um_00202_00633-1 P5CRC_s_008um_00035_00460-1 
                       <NA>                        <NA> 
Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Visualize clusters on UMAP

We previously calculated UMAP coordinates for our sketch assay and colored the bins by which sample they belonged to. Now, we can visualize the clusters that we just calculated on this UMAP to see how well the clusters are separating.

# Plot UMAP
p <- DimPlot(seurat_processed, 
        reduction = "umap.sketch", label = T) + 
  ggtitle("Sketched clustering")
LabelClusters(p, id = "ident",  fontface = "bold", size = 5, 
              bg.colour = "white", bg.r = .2, force = 0)
Figure 2: UMAP scatterplot with bins colored and labeled by sketch Leiden cluster annotation.

Ideally we would see distinct clusters of bins that are well separated from each other. If we see a lot of overlap between clusters, this may indicate that the clustering is not optimal and we may need to adjust our resolution parameter or re-consider our previous steps (QC, filtration, highly variable gene selection, etc.) to improve the clustering.

Project back to entire dataset

Recall that we have been calculating all of these steps on our sketched dataset of 10,000 bins. Now, we want to know what the cluster labels are for all of the bins in our full dataset. We can do this with the ProjectData() function.

Warning

Struggling to find method used to project sketch to full data

Run ProjectData()

During this projection step, we provide the sketched information we have just calculated:

  • sketched.assay = sketch assay name (sketch)
  • sketched.reduction = reduction (PCA) calculated on the sketch assay (pca.sketch)
  • umap.model = UMAP reduction calculated on the sketch assay (umap.sketch)

The we specify the assay with the full dataset that we want to project onto (in this case, the Spatial.008um assay). We also specify the full.reduction to name the PCA that will have scores for each bin (full.pca.sketch) as well as the number of principal components to use (dims).

# Increases the size of the default vector
# 8GB
options(future.globals.maxSize = 8000 * 1024^2)

# Project from sketch assay onto entire dataset
seurat_processed <- ProjectData(
  object = seurat_processed,
  sketched.assay = "sketch",
  sketched.reduction = "pca.sketch",
  umap.model = "umap.sketch",
  assay = "Spatial.008um",
  full.reduction = "full.pca.sketch",
  dims = 1:50,
  refdata = list(seurat_cluster.projected = "seurat_cluster.sketched"))

Using the sketch PCA and UMAP, the ProjectData function returns a Seurat object that includes:

Category Description
Dimensional reduction (PCA) The full.pca.sketch reduction extends the PCA from sketched cells to all bins in the dataset
Dimensional reduction (UMAP) The full.umap.sketch reduction extends the UMAP from sketched cells to all bins in the dataset
Cluster labels The seurat_cluster.projected metadata column labels all cells with cluster labels from sketched cells

So now when we investigate the @reductions slot, we see that we have two new reductions: full.pca.sketch and full.umap.sketch.

seurat_processed@reductions
$pca.sketch
A dimensional reduction object with key PC_ 
 Number of dimensions: 50 
 Number of cells: 10000 
 Projected dimensional reduction calculated:  FALSE 
 Jackstraw run: FALSE 
 Computed using assay: sketch 

$umap.sketch
A dimensional reduction object with key umapsketch_ 
 Number of dimensions: 2 
 Number of cells: 10000 
 Projected dimensional reduction calculated:  FALSE 
 Jackstraw run: FALSE 
 Computed using assay: sketch 

$full.pca.sketch
A dimensional reduction object with key fullpcasketch_ 
 Number of dimensions: 50 
 Number of cells: 135798 
 Projected dimensional reduction calculated:  FALSE 
 Jackstraw run: FALSE 
 Computed using assay: Spatial.008um 

$full.umap.sketch
A dimensional reduction object with key fullumapsketch_ 
 Number of dimensions: 2 
 Number of cells: 135798 
 Projected dimensional reduction calculated:  FALSE 
 Jackstraw run: FALSE 
 Computed using assay: Spatial.008um 

We also now have a seurat_cluster.projected column with Leiden cluster labels for every bin.

seurat_processed@meta.data %>% head()
Table 2: Updated @meta.data with projected Leiden clusters.
orig.ident nCount_Spatial.008um nFeature_Spatial.008um log10GenesPerUMI mitoRatio leverage.score seurat_cluster.sketched seurat_cluster.projected seurat_cluster.projected.score
P5CRC_s_008um_00078_00444-1 P5CRC 65 57 0.9685377 0.2307692 0.1182505 NA 3 0.9783292
P5CRC_s_008um_00128_00278-1 P5CRC 1300 906 0.9496410 0.1730769 0.8126225 NA 1 1.0000000
P5CRC_s_008um_00052_00559-1 P5CRC 128 121 0.9884090 0.0234375 1.3967893 NA 11 0.6450775
P5CRC_s_008um_00121_00413-1 P5CRC 538 326 0.9203288 0.0092937 0.6153956 NA 9 0.6799116
P5CRC_s_008um_00202_00633-1 P5CRC 365 232 0.9231919 0.0657534 0.4457335 NA 2 1.0000000
P5CRC_s_008um_00035_00460-1 P5CRC 122 100 0.9586074 0.1639344 0.0945122 NA 3 1.0000000
  1. How many bins are in each cluster? Use the table() function to count the number of bins in each cluster. Do you notice any patterns in the number of bins per cluster?

Visualize projected clusters

This is a good point to pause and inspect our datset!

We now have UMAP and cluster labels for every bin in our dataset. Earlier, we visualized the sketch UMAP with the cluster labels. and saw separation in groups. Now, we can visualize the full UMAP with the projected cluster labels to ensure that the projection step worked and that the clusters are still well separated in the full dataset.

# Switch to full dataset assay
DefaultAssay(seurat_processed) <- "Spatial.008um"

# Change the idents to the projected cluster assignments
Idents(seurat_processed) <- "seurat_cluster.projected"

# Plot UMAP
p <- DimPlot(seurat_processed, 
        reduction = "full.umap.sketch", label = T) + 
  ggtitle("Projected clustering")
LabelClusters(p, id = "ident",  fontface = "bold", size = 5, 
              bg.colour = "white", bg.r = .2, force = 0)
Figure 3: UMAP scatterplot with bins colored and labeled by projected Leiden cluster annotation.
My cluster labels are all out of order!

To ensure that our cluster labels are listed in proper numeric order, we are going to make our seurat_cluster.projected a factor and set the levels such that the order is correct.

# Arrange seurat_cluster.projected so that values are in numeric order
order <- seurat_processed$seurat_cluster.projected %>%
  unique() %>% as.numeric() %>%
  sort() %>% as.character()
order
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
# Factorize seurat_cluster.projected with levels in correct numeric order
seurat_processed$seurat_cluster.projected <- seurat_processed$seurat_cluster.projected %>% 
  factor(levels = order)

# Print head of seurat_cluster.projected to confirm levels are in correct order
seurat_processed$seurat_cluster.projected %>% head()
P5CRC_s_008um_00078_00444-1 P5CRC_s_008um_00128_00278-1 
                          3                           1 
P5CRC_s_008um_00052_00559-1 P5CRC_s_008um_00121_00413-1 
                         11                           9 
P5CRC_s_008um_00202_00633-1 P5CRC_s_008um_00035_00460-1 
                          2                           3 
Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Now we should see the levels in numeric order on our UMAP legend.

# Change the idents to the projected cluster assignments
Idents(seurat_processed) <- "seurat_cluster.projected"

# Plot UMAP
p <- DimPlot(seurat_processed, 
        reduction = "full.umap.sketch", label = T) + 
  ggtitle("Projected clustering")
LabelClusters(p, id = "ident",  fontface = "bold", size = 5, 
              bg.colour = "white", bg.r = .2, force = 0)

In a typical single-cell analysis, this would a pause point. However, this is a spatial transcriptomics analysis with coordinates associated with each bin. So now we can visualize the clusters superimposed on the tissue image to see how the groups are spatially distributed across the tissue.

This is another litmus test to see if we have generated clusters that are biologically meaningful. For example, if a pathologist were to look at the tissue image, they may be able to identify unique, distinct regions of the tissue. If our clusters are biologically meaningful, we would expect to see that the clusters are spatially separated and correspond to these regions of the tissue (depending on the tissue type and the biological question at hand).

In order to plot the clusters superimposed on our image we use the SpatialDimPlot() function.

SpatialDimPlot(seurat_processed,
               pt.size.factor = 16,
               image.alpha = 0)
Figure 4: Bins on the tissue slide, colored by projected Leiden cluster annotation.

Save!

Now is a great spot to save our seurat_processed object as we have added a lot new information into our Seurat object, including:

  • Sketch HVGs
  • Scaled log-normalized counts
  • PCA
  • k-Nearest neighbors
  • Clusters
  • UMAP
  • Projected clusters
# Save Seurat object
saveRDS(seurat_processed, "data/seurat_processed.RDS")

Next Lesson >>

Back to Schedule

Reuse

CC-BY-4.0