# Determine the K-nearest neighbor graph
seurat_processed <- FindNeighbors(seurat_processed,
assay = "sketch",
reduction = "pca.sketch",
dims = 1:50)Clustering
Write a description of the lesson here.
keyword_1, keyword_2, keyword_3, keyword_4, keyword_5, keyword_6
Approximate time: XX minutes
Learning objectives
In this lesson, we will:
- Learning Objective 1
- Learning Objective 2
- Learning Objective 3
Overview of lesson
When doing XYZ…
Grouping similar bins together
Now, we want to group similar bins (cell states) together based upon their gene expression profiles with clustering. This is a two step process that involves:
- Constructing a K-nearest neighbor (KNN) graph based on the PCA space.
- Group bins together based upon the KNN graph to assign cluster labels to each bin.
This graph-based approach allows us to partition the graph of bins into highly interconnected ‘quasi-cliques’ or ‘communities’ that represent clusters of similar bins. A nice in-depth description of clustering methods is provided in the SVI Bioinformatics and Cellular Genomics Lab course.
k-nearest neighbors (kNN)
The first step is to construct a K-nearest neighbor (KNN) graph. As previous mentioned, this is calculated on the PCA space. Recall that we have multiple principle components (PCs) with scores for each one of our bins. We can think of the steps taken like so:
- Each bin is a point in PCA (n-dimensional) space.
- We calculate the distance (euclidean) between each bin and all other bins in this PCA space.
- We then connect bins together (edges) that are close to each other in this PCA space. Bins that are close together in PCA space will have similar gene expression profiles, and thus are likely to be similar.
- Then edge weights are calculate for each connection based upon the shared overlap in their local neighborhoods to create a shared nearest neighbor (SNN) graph.
This means that if two bins are close together in PCA space and have similar neighbors, they will have a stronger connection (higher edge weight) than two bins that are close together but do not share similar neighbors.
Image source: Analysis of Single cell RNA-seq data.
All of this is done in one step with the Seurat function FindNeighbors(). The function takes in the PCA reduction and calculates the KNN graph. We specify the reduction (sketch) as well as which components (dims) will be used for the calculation. In this case, we will be using the first 50 PCs.
We should now see two graphs in the @graphs slot of seurat_processed. The first in the list is the KNN graph (sketch_nn) and the second is the shared nearest neighbor (SNN) graph (sketch_snn).
seurat_processed@graphs$sketch_nn
A Graph object containing 10000 cells
$sketch_snn
A Graph object containing 10000 cells
We use the SNN graph because it considers both the distance beteween two bins and the similarity of their local neighborhoods. This leads to more robust and biologically meaningful connections between bins. In contrast, the KNN graph only considers the distance between two bins, which may not capture the full complexity of the global relationships between many bins at once.
Now that we have this shared nearest neighbor graph, we can being the clustering step.
Clustering
The next step in the workflow is to group bins together based upon the graph to assign cluster labels to each bin.
The FindClusters() function accomplishes this by iteratively grouping bins together with the goal of optimizing the standard modularity function. This function is a measure of the strength of division of a network into clusters, with higher values indicating a stronger clustering structure. The FindClusters() function uses the Louvain algorithm to optimize this modularity function and assign cluster labels to each bin by default. We can also use the Leiden algorithm by setting the algorithm = 4. The Leiden algorithm is an improvement on the Louvain algorithm that is faster and more accurate in finding clusters in large datasets.
An important parameter is resolution which sets the “granularity” of the downstream clustering and will need to be optimized for every individual experiment. Larger resolution values lead to a greater number of clusters, which is often required for larger datasets. For datasets of 3,000 - 5,000 cells, the resolution set between 0.4-1.4 generally yields good clustering.
Typically we would recommend using a variety of different resolutions and select a resolution that best suits your dataset. However, for the sake of time, we will proceed with a resolution = 0.65. We have more detailed explanations of how to test a variety of resolutions at one time.
seurat_processed <- FindClusters(seurat_processed,
cluster.name = "seurat_cluster.sketched",
algorithm = 4,
resolution = 0.65,
random.seed = 12345)If we take a look at the columns in our @meta.data, we see that the column seurat_cluster.sketched is a new column that was added.
seurat_processed@meta.data %>%
head() %>%
relocate("seurat_cluster.sketched")@meta.data with sketch Leiden clusters.
| seurat_cluster.sketched | orig.ident | nCount_Spatial.008um | nFeature_Spatial.008um | log10GenesPerUMI | mitoRatio | leverage.score | |
|---|---|---|---|---|---|---|---|
| P5CRC_s_008um_00078_00444-1 | NA | P5CRC | 65 | 57 | 0.9685377 | 0.2307692 | 0.1182505 |
| P5CRC_s_008um_00128_00278-1 | NA | P5CRC | 1300 | 906 | 0.9496410 | 0.1730769 | 0.8126225 |
| P5CRC_s_008um_00052_00559-1 | NA | P5CRC | 128 | 121 | 0.9884090 | 0.0234375 | 1.3967893 |
| P5CRC_s_008um_00121_00413-1 | NA | P5CRC | 538 | 326 | 0.9203288 | 0.0092937 | 0.6153956 |
| P5CRC_s_008um_00202_00633-1 | NA | P5CRC | 365 | 232 | 0.9231919 | 0.0657534 | 0.4457335 |
| P5CRC_s_008um_00035_00460-1 | NA | P5CRC | 122 | 100 | 0.9586074 | 0.1639344 | 0.0945122 |
- When we looked at the first few rows of our metadata, it appeared that there were many cells that do not have a cluster value. Count how many
NA’s are found for our cluster withtable()function and use the argument (useNA = "ifany"). Why do you think there are so manyNAvalues?
The Seurat workflow tends to automatically change the Idents() of the Seurat object to the last cluster values that were calculated during FindClusters(). If we take a look at our Idents we can see if that it has changed from orig.ident to seurat_cluster.sketched.
Idents(seurat_processed) %>% head()P5CRC_s_008um_00078_00444-1 P5CRC_s_008um_00128_00278-1
<NA> <NA>
P5CRC_s_008um_00052_00559-1 P5CRC_s_008um_00121_00413-1
<NA> <NA>
P5CRC_s_008um_00202_00633-1 P5CRC_s_008um_00035_00460-1
<NA> <NA>
Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Visualize clusters on UMAP
We previously calculated UMAP coordinates for our sketch assay and colored the bins by which sample they belonged to. Now, we can visualize the clusters that we just calculated on this UMAP to see how well the clusters are separating.
# Plot UMAP
p <- DimPlot(seurat_processed,
reduction = "umap.sketch", label = T) +
ggtitle("Sketched clustering")
LabelClusters(p, id = "ident", fontface = "bold", size = 5,
bg.colour = "white", bg.r = .2, force = 0)Ideally we would see distinct clusters of bins that are well separated from each other. If we see a lot of overlap between clusters, this may indicate that the clustering is not optimal and we may need to adjust our resolution parameter or re-consider our previous steps (QC, filtration, highly variable gene selection, etc.) to improve the clustering.
Project back to entire dataset
Recall that we have been calculating all of these steps on our sketched dataset of 10,000 bins. Now, we want to know what the cluster labels are for all of the bins in our full dataset. We can do this with the ProjectData() function.
Struggling to find method used to project sketch to full data
Run ProjectData()
During this projection step, we provide the sketched information we have just calculated:
sketched.assay= sketch assay name (sketch)sketched.reduction= reduction (PCA) calculated on the sketch assay (pca.sketch)umap.model= UMAP reduction calculated on the sketch assay (umap.sketch)
The we specify the assay with the full dataset that we want to project onto (in this case, the Spatial.008um assay). We also specify the full.reduction to name the PCA that will have scores for each bin (full.pca.sketch) as well as the number of principal components to use (dims).
# Increases the size of the default vector
# 8GB
options(future.globals.maxSize = 8000 * 1024^2)
# Project from sketch assay onto entire dataset
seurat_processed <- ProjectData(
object = seurat_processed,
sketched.assay = "sketch",
sketched.reduction = "pca.sketch",
umap.model = "umap.sketch",
assay = "Spatial.008um",
full.reduction = "full.pca.sketch",
dims = 1:50,
refdata = list(seurat_cluster.projected = "seurat_cluster.sketched"))Using the sketch PCA and UMAP, the ProjectData function returns a Seurat object that includes:
| Category | Description |
|---|---|
| Dimensional reduction (PCA) | The full.pca.sketch reduction extends the PCA from sketched cells to all bins in the dataset |
| Dimensional reduction (UMAP) | The full.umap.sketch reduction extends the UMAP from sketched cells to all bins in the dataset |
| Cluster labels | The seurat_cluster.projected metadata column labels all cells with cluster labels from sketched cells |
So now when we investigate the @reductions slot, we see that we have two new reductions: full.pca.sketch and full.umap.sketch.
seurat_processed@reductions$pca.sketch
A dimensional reduction object with key PC_
Number of dimensions: 50
Number of cells: 10000
Projected dimensional reduction calculated: FALSE
Jackstraw run: FALSE
Computed using assay: sketch
$umap.sketch
A dimensional reduction object with key umapsketch_
Number of dimensions: 2
Number of cells: 10000
Projected dimensional reduction calculated: FALSE
Jackstraw run: FALSE
Computed using assay: sketch
$full.pca.sketch
A dimensional reduction object with key fullpcasketch_
Number of dimensions: 50
Number of cells: 135798
Projected dimensional reduction calculated: FALSE
Jackstraw run: FALSE
Computed using assay: Spatial.008um
$full.umap.sketch
A dimensional reduction object with key fullumapsketch_
Number of dimensions: 2
Number of cells: 135798
Projected dimensional reduction calculated: FALSE
Jackstraw run: FALSE
Computed using assay: Spatial.008um
We also now have a seurat_cluster.projected column with Leiden cluster labels for every bin.
seurat_processed@meta.data %>% head()@meta.data with projected Leiden clusters.
| orig.ident | nCount_Spatial.008um | nFeature_Spatial.008um | log10GenesPerUMI | mitoRatio | leverage.score | seurat_cluster.sketched | seurat_cluster.projected | seurat_cluster.projected.score | |
|---|---|---|---|---|---|---|---|---|---|
| P5CRC_s_008um_00078_00444-1 | P5CRC | 65 | 57 | 0.9685377 | 0.2307692 | 0.1182505 | NA | 3 | 0.9783292 |
| P5CRC_s_008um_00128_00278-1 | P5CRC | 1300 | 906 | 0.9496410 | 0.1730769 | 0.8126225 | NA | 1 | 1.0000000 |
| P5CRC_s_008um_00052_00559-1 | P5CRC | 128 | 121 | 0.9884090 | 0.0234375 | 1.3967893 | NA | 11 | 0.6450775 |
| P5CRC_s_008um_00121_00413-1 | P5CRC | 538 | 326 | 0.9203288 | 0.0092937 | 0.6153956 | NA | 9 | 0.6799116 |
| P5CRC_s_008um_00202_00633-1 | P5CRC | 365 | 232 | 0.9231919 | 0.0657534 | 0.4457335 | NA | 2 | 1.0000000 |
| P5CRC_s_008um_00035_00460-1 | P5CRC | 122 | 100 | 0.9586074 | 0.1639344 | 0.0945122 | NA | 3 | 1.0000000 |
- How many bins are in each cluster? Use the
table()function to count the number of bins in each cluster. Do you notice any patterns in the number of bins per cluster?
Visualize projected clusters
This is a good point to pause and inspect our datset!
We now have UMAP and cluster labels for every bin in our dataset. Earlier, we visualized the sketch UMAP with the cluster labels. and saw separation in groups. Now, we can visualize the full UMAP with the projected cluster labels to ensure that the projection step worked and that the clusters are still well separated in the full dataset.
# Switch to full dataset assay
DefaultAssay(seurat_processed) <- "Spatial.008um"
# Change the idents to the projected cluster assignments
Idents(seurat_processed) <- "seurat_cluster.projected"
# Plot UMAP
p <- DimPlot(seurat_processed,
reduction = "full.umap.sketch", label = T) +
ggtitle("Projected clustering")
LabelClusters(p, id = "ident", fontface = "bold", size = 5,
bg.colour = "white", bg.r = .2, force = 0)To ensure that our cluster labels are listed in proper numeric order, we are going to make our seurat_cluster.projected a factor and set the levels such that the order is correct.
# Arrange seurat_cluster.projected so that values are in numeric order
order <- seurat_processed$seurat_cluster.projected %>%
unique() %>% as.numeric() %>%
sort() %>% as.character()
order [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
# Factorize seurat_cluster.projected with levels in correct numeric order
seurat_processed$seurat_cluster.projected <- seurat_processed$seurat_cluster.projected %>%
factor(levels = order)
# Print head of seurat_cluster.projected to confirm levels are in correct order
seurat_processed$seurat_cluster.projected %>% head()P5CRC_s_008um_00078_00444-1 P5CRC_s_008um_00128_00278-1
3 1
P5CRC_s_008um_00052_00559-1 P5CRC_s_008um_00121_00413-1
11 9
P5CRC_s_008um_00202_00633-1 P5CRC_s_008um_00035_00460-1
2 3
Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Now we should see the levels in numeric order on our UMAP legend.
# Change the idents to the projected cluster assignments
Idents(seurat_processed) <- "seurat_cluster.projected"
# Plot UMAP
p <- DimPlot(seurat_processed,
reduction = "full.umap.sketch", label = T) +
ggtitle("Projected clustering")
LabelClusters(p, id = "ident", fontface = "bold", size = 5,
bg.colour = "white", bg.r = .2, force = 0)In a typical single-cell analysis, this would a pause point. However, this is a spatial transcriptomics analysis with coordinates associated with each bin. So now we can visualize the clusters superimposed on the tissue image to see how the groups are spatially distributed across the tissue.
This is another litmus test to see if we have generated clusters that are biologically meaningful. For example, if a pathologist were to look at the tissue image, they may be able to identify unique, distinct regions of the tissue. If our clusters are biologically meaningful, we would expect to see that the clusters are spatially separated and correspond to these regions of the tissue (depending on the tissue type and the biological question at hand).
In order to plot the clusters superimposed on our image we use the SpatialDimPlot() function.
Save!
Now is a great spot to save our seurat_processed object as we have added a lot new information into our Seurat object, including:
- Sketch HVGs
- Scaled log-normalized counts
- PCA
- k-Nearest neighbors
- Clusters
- UMAP
- Projected clusters
# Save Seurat object
saveRDS(seurat_processed, "data/seurat_processed.RDS")