library(Seurat)
library(tidyverse)
seurat_sketch <- readRDS("data/seurat_sketch.RDS")Identifying Unwanted Variation
Approximate time: XY minutes
Learning Objectives
- LO 1
- Perform clustering of cells based on significant principal components
- LO3
Single-Cell RNA-seq (scRNA-seq) Workflow
We will be performing a standard scRNA-seq workflow on our sketch assay. For more explicit details on each one of these steps, we have materials on how to conduct a scRNA-seq analysis from start to end.
Here we show the overarching steps that are takin in a classical single-cell analysis.
The broader principles of this workflow remain the same when working with spatial transcriptomics data. This lesson will cover the following steps:
FindVariableFeatures(): As before, this generates a list of highly variable genes, which may be slightly different for the sketched dataset than for the full datasetScaleData(): Highly variable genes will be confounded with the most highly expressed genes, so we need to adjust for thisRunPCA(): Perform a principal component analysis using our scaled data and variable genes. This will emphasize variation in gene expression as well as similarity across binsFindNeighbors(): Determine the Euclidean distance between bins in PCA spaceFindClusters(): Iteratively group bins together based on neighborhood distances. Higher resolution will yield more groups.RunUMAP(): TODO
With that being said, there are certain elements that are unique to a spatial dataset that we will highlight as we go through all of these steps.
Variation in the Dataset
Identify Highly Variable Features
Recall that when we were working with the entire dataset, we found the following genes to be the most variable:
# Identify the 15 most highly variable genes
ranked_variable_genes <- VariableFeatures(seurat_sketch, assay="Spatial.016um")
top_genes <- ranked_variable_genes[1:15]
top_genes [1] "SPP1" "IGHG1" "IGHM" "IGHG3" "CXCL8" "INSL5" "IGLC1" "MMP12" "CHGA"
[10] "VIP" "SST" "GCG" "REG3A" "HBA2" "PYY"
We are going to run FindVariableFeatures() once more, but this time on our smaller dataset of 10,000 bins.
# Identify the most variable genes
seurat_processed <- FindVariableFeatures(seurat_sketch,
selection.method = "vst",
nfeatures = 2000,
assay="sketch")
seurat_processedAn object of class Seurat
36170 features across 67660 samples within 2 assays
Active assay: sketch (18085 features, 2000 variable features)
2 layers present: counts, data
1 other assay present: Spatial.016um
2 spatial fields of view present: P5CRC.016um P5NAT.016um
Do you seen any differences when we calculated HVGs for the entire dataset versus the sketch assay?
# Identify the 15 most highly variable genes
ranked_variable_genes <- VariableFeatures(seurat_processed, assay="sketch")
ranked_variable_genes[1:15] [1] "IGHG1" "IGHM" "CXCL8" "SPP1" "IGLC1" "CHGA" "VIP" "HBA2" "REG3A"
[10] "MMP12" "INSL5" "GCG" "CALB2" "PYY" "MMP9"
Visualize Highly Variable Genes
We can visualize the dispersion of all genes using Seurat’s VariableFeaturePlot(), which shows a gene’s average expression across all cells on the x-axis and variance on the y-axis. Ideally we would see genes lay across the entire x-axis, as we want to ensure that we are using genes that are both lowly and highly expressed.
For the visualization, we are adding labels using LabelPoints() to help us better understand which genes are driving the shape of our data.
# Identify the 15 most highly variable genes
ranked_variable_genes <- VariableFeatures(seurat_processed,
assay = "sketch")
top_genes <- ranked_variable_genes[1:15]
# Plot the average expression and variance of these genes
# With labels to indicate which genes are in the top 15
p <- VariableFeaturePlot(seurat_processed)
LabelPoints(plot = p, points = top_genes, repel = TRUE)Now we are primed to run the next step of the workflow, which is Principal Component Analysis.
PCA
Principal Component Analysis (PCA) is a technique used to emphasize variation as well as similarity, and to bring out strong patterns in a dataset; it is one of the methods used for “dimensionality reduction”.
For a more detailed explanation on PCA, please look over this lesson (adapted from StatQuests/Josh Starmer’s YouTube video). We also strongly encourage you to explore the video StatQuest’s video for a more thorough understanding.
Let’s say you are working with a single-cell RNA-seq dataset with 12,000 cells and you have quantified the expression of 20,000 genes. The schematic below demonstrates how you would go from a cell x gene matrix to principal component (PC) scores for each individual cell.
To perform PCA, we need to first choose the most variable features, then scale the data. Since highly expressed genes exhibit the highest amount of variation and we don’t want our ‘highly variable genes’ only to reflect high expression, we need to scale the data to scale variation with expression level. The Seurat ScaleData() function will scale the data by:
- Adjusting the expression of each gene to give a mean expression across cells to be 0
- Scaling expression of each gene to give a variance across cells to be 1
# Scale the data counts
seurat_processed <- ScaleData(seurat_processed)
seurat_processedAn object of class Seurat
36170 features across 67660 samples within 2 assays
Active assay: sketch (18085 features, 2000 variable features)
3 layers present: counts, data, scale.data
1 other assay present: Spatial.016um
2 spatial fields of view present: P5CRC.016um P5NAT.016um
We have calculated a new layer titled sale.data, where our scaled log-normalized counts now exist. So now we can proceed in calculating our Principal Components with RunPCA().
# Run PCA
seurat_processed <- RunPCA(seurat_processed,
assay = "sketch",
reduction.name = "pca.sketch")The information printed out show the top genes that influence a cell’s score for each component. This includes genes that have both a positive and negative weight.
If we look at our updated seurat_processed object:
seurat_processedAn object of class Seurat
36170 features across 67660 samples within 2 assays
Active assay: sketch (18085 features, 2000 variable features)
3 layers present: counts, data, scale.data
1 other assay present: Spatial.016um
1 dimensional reduction calculated: pca.sketch
2 spatial fields of view present: P5CRC.016um P5NAT.016um
We see that we have stored our PCA results: dimensional reduction calculated: pca.sketch
Visualize PCA
After the PC scores have been calculated, you are looking at a matrix of 12,000 x 12,000 that represents the information about relative gene expression in all the cells. You can select the PC1 and PC2 columns and plot that in a 2D way.
DimPlot(seurat_processed,
group.by = "orig.ident",
reduction = "pca.sketch")Ultimately what we have calculated with PCA are values that can represent how similar cells are to one another. We would expect that cells with similar gene expression will have similar PC values. For example, we would anticipate that two Fibroblast cells would have comparable gene expression - which would also results in similar scores in principal components. This is why many of the downstream steps in the remainder of this lesson use PCA as the input.
Also assess the differences in coordinates based on batch variables.
DimPlot(seurat_processed,
reduction = "pca.sketch",
group.by = "orig.ident",
split.by = "orig.ident")Selecting PC Dimensions
To overcome the extensive technical noise in the expression of any single gene for scRNA-seq data, Seurat assigns cells to clusters based on their PCA scores derived from the expression of the integrated most variable genes. Each PC essentially representing a “metagene” that combines information across a correlated gene set. Determining how many PCs to include in the clustering step is therefore important to ensure that we are capturing the majority of the variation, or cell types, present in our dataset.
It is useful to explore the PCs prior to deciding which PCs to include for the downstream clustering analysis.
The elbow plot is a helpful way to determine how many PCs to use for clustering so that we are capturing majority of the variation in the data. The elbow plot visualizes the standard deviation of each PC, and we are looking for where the standard deviations begins to plateau. Essentially, where the elbow appears is usually the threshold for identifying the majority of the variation. However, this method can be quite subjective.
Let’s draw an elbow plot using the top 50 PCs:
# Plot the elbow plot
ElbowPlot(object = seurat_processed,
reduction = "pca.sketch",
ndims = 50)Based on this plot, we could roughly determine the majority of the variation by where the elbow occurs around PC8 - PC10, or one could argue that it should be when the data points start to get close to the X-axis, PC30 or so. This gives us a very rough idea of the number of PCs needed to be included.
We will be using 50 PCs for our downstream steps.
Grouping Similar Cells Together
Seurat uses a graph-based clustering approach using a K-nearest neighbor approach, and then attempts to partition this graph into highly interconnected ‘quasi-cliques’ or ‘communities’ [Seurat - Guided Clustering Tutorial]. A nice in-depth description of clustering methods is provided in the SVI Bioinformatics and Cellular Genomics Lab course.
k-Nearest Neighbors
The first step is to construct a K-nearest neighbor (KNN) graph based on the euclidean distance in PCA space.
Image source: Analysis of Single cell RNA-seq data
- Edges are drawn between cells with similar features expression patterns.
- Edge weights are refined between any two cells based on shared overlap in their local neighborhoods.
This is done in Seurat by using the FindNeighbors() function. We also specify the reduction we want to use as well as which components (dims) will be used for the calculation. In this case, we will be using the first 50.
# Determine the K-nearest neighbor graph
seurat_processed <- FindNeighbors(seurat_processed,
assay = "sketch",
reduction = "pca.sketch",
dims = 1:50)Clustering
Next, Seurat will iteratively group cells together with the goal of optimizing the standard modularity function.
We will use the FindClusters() function to perform the graph-based clustering. The resolution is an important argument that sets the “granularity” of the downstream clustering and will need to be optimized for every individual experiment. For datasets of 3,000 - 5,000 cells, the resolution set between 0.4-1.4 generally yields good clustering. Increased resolution values lead to a greater number of clusters, which is often required for larger datasets.
The FindClusters() function allows us to enter a series of resolutions and will calculate the “granularity” of the clustering. This is very helpful for testing which resolution works for moving forward without having to run the function for each resolution.
Typically we would recommend using a variety of different resolutions and select a resolution that best suits your dataset. However, for the sake of time, we will proceed with a resolution = 0.65. We have more detailed explanations of how to test a variety of resolutions at one time.
seurat_processed <- FindClusters(seurat_processed,
cluster.name = "seurat_cluster.sketched",
resolution = 0.65)Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 10000
Number of edges: 415959
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8844
Number of communities: 18
Elapsed time: 1 seconds
If we take a look at the columns in our @meta.data, we see that the column seurat_cluster.sketched is a new column that was added.
seurat_processed@meta.data %>% head() %>% relocate("seurat_cluster.sketched") seurat_cluster.sketched orig.ident
P5CRC_s_016um_00050_00315-1 <NA> P5CRC
P5CRC_s_016um_00204_00145-1 <NA> P5CRC
P5CRC_s_016um_00237_00273-1 <NA> P5CRC
P5CRC_s_016um_00224_00126-1 <NA> P5CRC
P5CRC_s_016um_00191_00159-1 8 P5CRC
P5CRC_s_016um_00064_00214-1 <NA> P5CRC
nCount_Spatial.016um nFeature_Spatial.016um
P5CRC_s_016um_00050_00315-1 80 75
P5CRC_s_016um_00204_00145-1 81 73
P5CRC_s_016um_00237_00273-1 278 225
P5CRC_s_016um_00224_00126-1 692 511
P5CRC_s_016um_00191_00159-1 35 34
P5CRC_s_016um_00064_00214-1 1250 724
mitoRatio log10GenesPerUMI leverage.score
P5CRC_s_016um_00050_00315-1 0.1000000 0.9852720 0.6907727
P5CRC_s_016um_00204_00145-1 0.1234568 0.9763361 0.4799947
P5CRC_s_016um_00237_00273-1 0.0323741 0.9624138 0.3291851
P5CRC_s_016um_00224_00126-1 0.2095376 0.9536337 1.4497517
P5CRC_s_016um_00191_00159-1 0.1142857 0.9918468 6.8221965
P5CRC_s_016um_00064_00214-1 0.2176000 0.9234167 0.6191210
seurat_clusters
P5CRC_s_016um_00050_00315-1 <NA>
P5CRC_s_016um_00204_00145-1 <NA>
P5CRC_s_016um_00237_00273-1 <NA>
P5CRC_s_016um_00224_00126-1 <NA>
P5CRC_s_016um_00191_00159-1 8
P5CRC_s_016um_00064_00214-1 <NA>
When we looked at the first few rows of our metadata, it appeared that there were many cells that do not have a cluster value.
Count how many NA’s are found for our cluster with table() function and use the argument (useNA = "ifany"):
table(seurat_processed$seurat_cluster.sketched, useNA = "ifany")
0 1 2 3 4 5 6 7 8 9 10 11 12
1357 1274 1079 898 828 624 579 545 497 495 463 402 300
13 14 15 16 17 <NA>
272 127 106 80 74 57660
Why do you think there are so many NA values?
FindClusters() tends to automatically change the Idents() to the cluster values that were just calculated. If we take a look at our Idents we can see if that it has changed from orig.ident to seurat_cluster.sketched
Idents(seurat_processed) %>% head()P5CRC_s_016um_00050_00315-1 P5CRC_s_016um_00204_00145-1
<NA> <NA>
P5CRC_s_016um_00237_00273-1 P5CRC_s_016um_00224_00126-1
<NA> <NA>
P5CRC_s_016um_00191_00159-1 P5CRC_s_016um_00064_00214-1
8 <NA>
Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
UMAP
To visualize the cell clusters, there are a few different dimensionality reduction techniques that can be helpful. The most popular methods include t-distributed stochastic neighbor embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) techniques.
Both methods aim to place cells with similar local neighborhoods in high-dimensional space together in low-dimensional space. These methods will require you to input number of PCA dimentions to use for the visualization, we suggest using the same number of PCs as input to the clustering analysis. Here, we will proceed with the UMAP method for visualizing the clusters.
seurat_processed <- RunUMAP(seurat_processed,
reduction = "pca.sketch",
reduction.name = "umap.sketch",
return.model = T, dims = 1:50)
seurat_processedAn object of class Seurat
36170 features across 67660 samples within 2 assays
Active assay: sketch (18085 features, 2000 variable features)
3 layers present: counts, data, scale.data
1 other assay present: Spatial.016um
2 dimensional reductions calculated: pca.sketch, umap.sketch
2 spatial fields of view present: P5CRC.016um P5NAT.016um
DimPlot(seurat_processed, group.by="orig.ident", reduction="umap.sketch")These UMAP coordinates are now stored in the dimensional reductions calculated: pca.sketch, umap.sketch. Now we can visualize with the clusters that we calculated earlier using DimPlot() and adding clearer labels withLabelClusters()`.
# Plot UMAP
p <- DimPlot(seurat_processed,
reduction = "umap.sketch", label = T) +
ggtitle("Sketched clustering")
LabelClusters(p, id = "ident", fontface = "bold", size = 5,
bg.colour = "white", bg.r = .2, force = 0)Project Back to Entire Dataset
Now that we have our clusters and dimensional reductions from our sketched dataset, we need to extend these to the full dataset.
Run ProjectData()
The ProjectData function projects all the bins in the dataset (the Spatial.016um assay) onto the sketch assay.
# Increases the size of the default vector
options(future.globals.maxSize= 2000000000)
seurat_processed <- ProjectData(
object = seurat_processed,
assay = "Spatial.016um",
full.reduction = "full.pca.sketch",
sketched.assay = "sketch",
sketched.reduction = "pca.sketch",
umap.model = "umap.sketch",
dims = 1:50,
refdata = list(seurat_cluster.projected = "seurat_cluster.sketched"))Using the sketch PCA and UMAP, the ProjectData function returns a Seurat object that includes:
| Category | Description |
|---|---|
| Dimensional reduction (PCA) | The full.pca.sketch reduction extends the PCA from sketched cells to all bins in the dataset |
| Dimensional reduction (UMAP) | The full.umap.sketch reduction extends the UMAP from sketched cells to all bins in the dataset |
| Cluster labels | The seurat_cluster.projected metadata column labels all cells with cluster labels from sketched cells |
We can now see the additional full-dataset reductions in the object.
seurat_processedAn object of class Seurat
36170 features across 67660 samples within 2 assays
Active assay: Spatial.016um (18085 features, 2000 variable features)
2 layers present: counts, data
1 other assay present: sketch
4 dimensional reductions calculated: pca.sketch, umap.sketch, full.pca.sketch, full.umap.sketch
2 spatial fields of view present: P5CRC.016um P5NAT.016um
Note that a score for the projection of each bin will be saved as a column in the metadata. Now the seurat_cluster.projected column shows values for every bin.
head(seurat_processed@meta.data) orig.ident nCount_Spatial.016um
P5CRC_s_016um_00050_00315-1 P5CRC 80
P5CRC_s_016um_00204_00145-1 P5CRC 81
P5CRC_s_016um_00237_00273-1 P5CRC 278
P5CRC_s_016um_00224_00126-1 P5CRC 692
P5CRC_s_016um_00191_00159-1 P5CRC 35
P5CRC_s_016um_00064_00214-1 P5CRC 1250
nFeature_Spatial.016um mitoRatio log10GenesPerUMI
P5CRC_s_016um_00050_00315-1 75 0.1000000 0.9852720
P5CRC_s_016um_00204_00145-1 73 0.1234568 0.9763361
P5CRC_s_016um_00237_00273-1 225 0.0323741 0.9624138
P5CRC_s_016um_00224_00126-1 511 0.2095376 0.9536337
P5CRC_s_016um_00191_00159-1 34 0.1142857 0.9918468
P5CRC_s_016um_00064_00214-1 724 0.2176000 0.9234167
leverage.score seurat_cluster.sketched
P5CRC_s_016um_00050_00315-1 0.6907727 <NA>
P5CRC_s_016um_00204_00145-1 0.4799947 <NA>
P5CRC_s_016um_00237_00273-1 0.3291851 <NA>
P5CRC_s_016um_00224_00126-1 1.4497517 <NA>
P5CRC_s_016um_00191_00159-1 6.8221965 8
P5CRC_s_016um_00064_00214-1 0.6191210 <NA>
seurat_clusters seurat_cluster.projected
P5CRC_s_016um_00050_00315-1 <NA> 8
P5CRC_s_016um_00204_00145-1 <NA> 8
P5CRC_s_016um_00237_00273-1 <NA> 11
P5CRC_s_016um_00224_00126-1 <NA> 12
P5CRC_s_016um_00191_00159-1 8 8
P5CRC_s_016um_00064_00214-1 <NA> 2
seurat_cluster.projected.score
P5CRC_s_016um_00050_00315-1 1
P5CRC_s_016um_00204_00145-1 1
P5CRC_s_016um_00237_00273-1 1
P5CRC_s_016um_00224_00126-1 1
P5CRC_s_016um_00191_00159-1 1
P5CRC_s_016um_00064_00214-1 1
Visualize Projected Clusters
We can now visualize our clusters from the projected assignments. The UMAP plot now contains more points, which is expected because we are now visualizing the full dataset rather than our 10,000 bin sketch. Nonetheless, we can see that the full dataset is still well-represented by the projected dimensional reduction and clustering.
# switch to full dataset assay
DefaultAssay(seurat_processed) <- "Spatial.016um"
# Change the idents to the projected cluster assignments
Idents(seurat_processed) <- "seurat_cluster.projected"
# Plot UMAP
p <- DimPlot(seurat_processed,
reduction = "full.umap.sketch", label = T) +
ggtitle("Projected clustering")
LabelClusters(p, id = "ident", fontface = "bold", size = 5,
bg.colour = "white", bg.r = .2, force = 0)To ensure that our cluster labels are listed in proper numeric order, we are going to make our seurat_cluster.projected a factor and set the levels such that the order is correct.
# Arrange so clusters get listed in numerical order
seurat_processed$seurat_cluster.projected <- seurat_processed$seurat_cluster.projected %>%
as.numeric() %>%
as.factor()
seurat_processed$seurat_cluster.projected %>% head()P5CRC_s_016um_00050_00315-1 P5CRC_s_016um_00204_00145-1
12 12
P5CRC_s_016um_00237_00273-1 P5CRC_s_016um_00224_00126-1
5 14
P5CRC_s_016um_00191_00159-1 P5CRC_s_016um_00064_00214-1
12 2
Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Now we should see the levels in correct numeric order.
# Change the idents to the projected cluster assignments
Idents(seurat_processed) <- "seurat_cluster.projected"
# Plot UMAP
p <- DimPlot(seurat_processed,
reduction = "full.umap.sketch", label = T) +
ggtitle("Projected clustering")
LabelClusters(p, id = "ident", fontface = "bold", size = 5,
bg.colour = "white", bg.r = .2, force = 0)In order to see the clusters superimposed on our image we can use the SpatialDimPlot() function. We will also set the color palette and convert the cluster assignments to a factor so they are ordered numerically rather than lexicographically in the figure.
SpatialDimPlot(seurat_processed,
pt.size.factor = 12,
image.alpha = 0)Save!
Now is a great spot to save our seurat_processed object as we have added a lot new information into our Seurat object, such as:
- Sketch HVGs
- Scaled log-normalized counts
- PCA
- k-Nearest neighbors
- Clusters
- UMAP
- Projected clusters
# Save Seurat object
saveRDS(seurat_processed, "data/seurat_processed.RDS")