# Load libariries
library(Seurat)
library(Banksy)
library(SeuratWrappers)
library(tidyverse)
library(future)
library(harmony)
# Set seed for reproducibility
set.seed(12345)
# Set parallelization (multithreading) to speed up calculations
plan("multicore", workers = parallel::detectCores() - 1)
# Increases the size of the default vector (8GB)
options(future.globals.maxSize = 8000 * 1024^2)
# Load dataset
seurat_processed <- readRDS("data/seurat_processed.RDS")Spatially Derived Clusters
Discover how BANKSY leverages neighborhood information to cluster spatial transcriptomics data using both gene expression and spatial context. You will compare BANKSY clusters with expression-only methods and evaluate differences in tissue domain detection.
BANKSY, Neighborhoods, Niches, Azimuthal gabor filter
Approximate time: XX minutes
Learning objectives
In this lesson, we will:
- Calculate BANKSY counts matrix which incorporates neighborhood structure
- Run harmony integration
- Evaluate spatially constrained clusters
Overview of lesson
At this point in the analysis, we have clusters and UMAP coordinates calculated based solely upon the gene expression. As this is a spatial dataset, we also have the ability to incorporate the location of bins, or the neighborhood structure, in the clustering process. Here, we will make use of the BANKY method that adds the local structure as information to ultimately calculate new clusters that are spatially constrained.
BANKSY
In the standard single-cell workflow, our clusters are identified solely based upon the similarity of gene expression between bins. However, this approach does not take into account the spatial relationship and how close they are in proximity to one another. The idea being that we would expect bins of similar expression and celltype to be closeby.
BANKSY is an alogorithm that allows you calculate scores that consider both gene expression and spatial proximity of bins to one another. How much consideration that can be given to the spatial locations can be modulated with a lambda value. This lamdba value controls the balance between gene expression and location.
Therefore, BANKSY is a powerful way to include the full extent of data in our dataset to group bins together.
Image source: Singhar et al. (2024)
One of the examples used in the original manupscript makes use of the layered structure of the brain’s cortical structure. Here, we can see the differences from the annotated dataset compared to two different methods of clustering. The middle represents clustering based upon solely gene expression, similar to the Leiden clusters we have already calculated. To contrast, the right figure shows clusters generated from BANKSY. We can see that the bins are more cleanly grouped in spatial location.
Image source: Singhar et al. (2024)
Set-up for BANKSY
To start, we are going to load all the libraries necessary for this analysis as usual.
Before we jump in and use the RunBanksy() function, we first need to understand the inputs required and prepare our Seurat object accordingly.
| Parameter | Description |
|---|---|
group |
Allows us to run BANKSY on multiple samples, enabling the function to understand that overlapping spatial locations can come from different slides. |
dimx, dimy |
The x and y coordinates of a spot on the tissue slide. |
split.scale |
(True or False) Whether to use within-group scaling parameter, accounting for minor differences between datasets. |
k_geom |
Number of bins to consider for the local neighborhood. Larger values will yield larger domains. |
lambda |
Influence of the neighborhood. Larger values yield more spatially coherent domains. The authors recommend using 0.8 to identify broader spatial domains. |
We already have the group parameter with orig.ident (representing our two samples) and so we will set split.scale to True. Therefore the next parameter we needs to supply in the @meta.data are the dimx and dimy spatial coordinates for every bin in the dataset.
The easiest way to grab the coordinates is the GetTissueCoordinates() function:
# Store the cell barcode as a column
# Sometimes when you merge dataframes, you lose rownames
seurat_processed$cell <- rownames(seurat_processed@meta.data)
# CRC and NAT sample coordinates
coords_crc <- GetTissueCoordinates(seurat_processed,
image = "P5CRC.008um")
coords_nat <- GetTissueCoordinates(seurat_processed,
image = "P5NAT.008um")
# Merge P5NAT and P5CRC coordinates into one dataset
coords <- rbind(coords_crc, coords_nat)Why did we have to run GetTissueCoordinates() twice? This is because this function will only return the coordinates for one image at a time. So we need to specifiy the image for each sample and then merge together to create a merged dataframe with x and y coordinates for each sample
# View first few rows of coordinates (x, y)
coords %>% head()x and y coordinates from the GetTissueCoordinates() function.
| x | y | cell | |
|---|---|---|---|
| P5CRC_s_008um_00078_00444-1 | 57263.31 | 61738.62 | P5CRC_s_008um_00078_00444-1 |
| P5CRC_s_008um_00128_00278-1 | 52418.69 | 60257.61 | P5CRC_s_008um_00128_00278-1 |
| P5CRC_s_008um_00052_00559-1 | 60620.61 | 62512.22 | P5CRC_s_008um_00052_00559-1 |
| P5CRC_s_008um_00121_00413-1 | 56362.64 | 60478.41 | P5CRC_s_008um_00121_00413-1 |
| P5CRC_s_008um_00202_00633-1 | 62800.98 | 58138.03 | P5CRC_s_008um_00202_00633-1 |
| P5CRC_s_008um_00035_00460-1 | 57725.67 | 62997.04 | P5CRC_s_008um_00035_00460-1 |
Next we are going to add these coordinates into our metadata, this is so that we can directly call this column from the RunBanksy() function.
# Join together metadata with coordinates
seurat_processed@meta.data <- left_join(x = seurat_processed@meta.data,
y = coords,
by = "cell")
seurat_processed@meta.data %>% head()@meta.data and coords dataframes.
| orig.ident | nCount_Spatial.008um | nFeature_Spatial.008um | log10GenesPerUMI | mitoRatio | leverage.score | seurat_cluster.sketched | seurat_cluster.projected | seurat_cluster.projected.score | cell | x | y |
|---|---|---|---|---|---|---|---|---|---|---|---|
| P5CRC | 65 | 57 | 0.9685377 | 0.2307692 | 0.1182505 | NA | 3 | 0.9783292 | P5CRC_s_008um_00078_00444-1 | 57263.31 | 61738.62 |
| P5CRC | 1300 | 906 | 0.9496410 | 0.1730769 | 0.8126225 | NA | 1 | 1.0000000 | P5CRC_s_008um_00128_00278-1 | 52418.69 | 60257.61 |
| P5CRC | 128 | 121 | 0.9884090 | 0.0234375 | 1.3967893 | NA | 11 | 0.6450775 | P5CRC_s_008um_00052_00559-1 | 60620.61 | 62512.22 |
| P5CRC | 538 | 326 | 0.9203288 | 0.0092937 | 0.6153956 | NA | 9 | 0.6799116 | P5CRC_s_008um_00121_00413-1 | 56362.64 | 60478.41 |
| P5CRC | 365 | 232 | 0.9231919 | 0.0657534 | 0.4457335 | NA | 2 | 1.0000000 | P5CRC_s_008um_00202_00633-1 | 62800.98 | 58138.03 |
| P5CRC | 122 | 100 | 0.9586074 | 0.1639344 | 0.0945122 | NA | 3 | 1.0000000 | P5CRC_s_008um_00035_00460-1 | 57725.67 | 62997.04 |
But wait, the rownames with the bin barcodes are missing! It is vital that the rownames of our Seurat object align with what is stored in Cell(). So we are going to add the rownames back into our metadata, which is why we stored them as a column in the metadata earlier.
# Add cell barcodes back to rownames
rownames(seurat_processed@meta.data) <- seurat_processed$cellRunning BANKSY
The last thing we need to do before running BANKSY is to set values for lambda, use_agf, and k_geom, where:
k_geomdefines how many k-nearest neighbors to use when looking at neighboring bins.use_agftoggles whether or not we want to use the “Azimithal gabor filter” which are values that constrain bin niches.lambdadetermines how much weight to assign to the spatial locations. The higher the value, the more weight is given to the spatial structure of the tissue.
Here we can see an example of what happens when different lambda values are set before clustering for a brain sample:
lambda to different scores on a brain dataset to show how a larger lambda results in more weight to the spatial structure of the tissue.Image source: Spatial Nanocourse
The authors of BANKSY recommend setting lambda = 0.8 which is what we will also use.
Want to link out materials on working on HPC in this callout box
We have been running these analyses on laptops/computers with local resources throughout this workshop. Realistically for a spatial dataset, you would be using either HPC or a cloud-based platform (Google or AWS) which provides access to much more RAM and memory for such large datasets.
This BANKSY step is a great example of the limitations of running these analyses locally. Not everyone will be able to run this step to completion due to a lack of resources - which is okay because at the end of the lesson, we will provide the seurat object generated during this step.
And this is on a cropped slide, not the full dataset!
# Run Banksy from SeuratWrappers
seurat_banksy <- RunBanksy(seurat_processed,
lambda = 0.8,
k_geom = 15,
use_agf = TRUE,
dimx = 'x', dimy = 'y',
group = "orig.ident",
split.scale = TRUE,
assay = 'Spatial.008um',
slot = 'data',
verbose = TRUE)Now we can investigate what has changed in our seurat_banksy object. Most notably, we have a new Active assay: BANKSY.
# Print seurat object after running BANKSY
seurat_banksyAn object of class Seurat
42170 features across 135798 samples within 3 assays
Active assay: BANKSY (6000 features, 0 variable features)
2 layers present: data, scale.data
2 other assays present: Spatial.008um, sketch
4 dimensional reductions calculated: pca.sketch, umap.sketch, full.pca.sketch, full.umap.sketch
2 spatial fields of view present: P5CRC.008um P5NAT.008um
BANKSY matrix
The RunBanksy() step generated a new counts matrix from the original counts matrix. In actually what happens it that BANKSY will concatenate 3 different matrices together, all of which are computed from the top highly variable features.
If we take a look at the beginning, middle, and end of the Features() in the BANKSY matrix we can see each component of the 3 matrices:
# First features in BANKSY matrix
Features(seurat_banksy) %>% head()[1] "IGHG1" "IGHM" "SST" "IGLC1" "INSL5" "CHGA"
# Middle features in BANKSY matrix
Features(seurat_banksy)[2000:2005][1] "TLR8" "IGHG1.m0" "IGHM.m0" "SST.m0" "IGLC1.m0" "INSL5.m0"
# Last features in BANKSY matrix
Features(seurat_banksy) %>% tail()[1] "ADAM23.m1" "PNMT.m1" "RSPO2.m1" "HTRA3.m1" "SLC3A1.m1" "TLR8.m1"
As mentioned before, this is actually three different matrices stacked on top of one another:
Original counts: These are the original, normalized counts for the top highly variable features.
Weighted mean expression (m0): The average gene expression of neighboring cells, where closer neighbors are weighted more heavily than distant ones. This gives each cell a representation of its local micro-environment.
Azimuthal gabor filter (m1): Evaluates the symmetry of gene expression around the bin. This provides a value for whether or not a cell is surrounded by similar cells or at the boundary of a different spatial niche. A higher value indicates that the cell is likely at the boundary between different neighborhoods.
Calculations on BANKSY matrix
At this point, we still do not have our spatially constrained clusters. But now that we have our BANKSY matrix, we can run a simplified clustering workflow, similar how we calculated clusters in the previous lesson.
- What steps/functions did we use to calculate clusters previously?
Dimensionality reduction
With this new matrix, we are once again going to calculate PCA, this time storing the results as pca.banksy. We want to run this on all the features in our BANKSY matrix to include the spatially weighted scores in the calculation.
# PCA
seurat_banksy <- RunPCA(seurat_banksy,
assay = "BANKSY",
reduction.name = "pca.banksy",
features = Features(seurat_banksy),
npcs = 30)With this new PCA, we can visualize the distribution of bins to evaluate whether or not integration is necessary. The batch variable we may need to account for is orig.ident or samples. Additionally, we can pick a marker gene to determine whether there is a split (batch effect) in the data. For this example, we will be using the gene IGKC, a known B cell marker gene.
# BANKSY PCA colored by sample
p_orig <- DimPlot(seurat_banksy,
group.by = "orig.ident",
reduction = "pca.banksy")
# BANKSY PCA colored by IGKC expression
p_igkc <- FeaturePlot(seurat_banksy,
feature = "IGKC",
reduction = "pca.banksy")
# Plot PCA space side-by-side
p_orig + p_igkcThere appears to be 2 unique populations of B cells in our PCA space, where the orig.ident is driving the differences in those bins. This is a clear indication that integration will be necessary to resolve the batch effect in the data.
Integration
We are going to use the Harmony algorithm to batch correct our dataset. Starting from the pca.banksy values, the RunHarmony() function will iteratively adjust the PCA space such that we no longer see a divide demarkated by the batch variable specified in the dataset. We are going to store the resultant latent space as harmony.banksy within the Seurat object.
This step may take several iterations to run, but should converge within a few minutes.
# Harmony batch correction
seurat_banksy <- RunHarmony(seurat_banksy,
group.by.vars = "orig.ident",
reduction = "pca.banksy",
reduction.save = "harmony.banksy",
assay.use = "BANKSY")Once the RunHarmony() function has completed successfully, we have a new latent space called harmony.banksy. The simplest way to view these values is to consider it a “corrected PCA” and treat is the same way you would use a PCA space. Therefore, we can visualize the harmony.banksy space and see if there is a strong effect with orig.ident or B cell marker IGKC.
# Harmony BANKSY PCA colored by sample
p_orig <- DimPlot(seurat_banksy,
group.by = "orig.ident",
reduction = "harmony.banksy")
# Harmony BANKSY PCA colored by IGKC expression
p_igkc <- FeaturePlot(seurat_banksy,
feature = "IGKC",
reduction = "harmony.banksy")
# Plot latent space side-by-side
p_orig + p_igkcThere is no longer a split within the B cell populations, which indicates that we are likely good to proceed further in the workflow.
Clustering
With this harmony integrated space, we can use the FindNeighbors() and FindClusters() functions to perform the clustering. We will be using the same resolution = 0.65 from the previous step to make equivalent comparisons between the spatially-weighted clusters against the transcript-based clusters later on.
# Find k-Nearest Neighbors
seurat_banksy <- FindNeighbors(seurat_banksy,
reduction = "harmony.banksy",
dims = 1:30)
# Leiden clustering
seurat_banksy <- FindClusters(seurat_banksy,
algorithm = 4,
cluster.name = "banksy_cluster",
resolution = 0.65,
random.seed = 12345)At this point, we no longer need access to the BANKSY assay as all the relevant information has now been stored as:
banksy_clustersin@meta.datapca.banksyin@reductionsharmony.banksyin@reductions
So to conserve memory on our laptops (the Seurat object is ~14 GB!!), we are going to delete the BANKSY assay from our Seurat object.
# Change default asasy from BANKSY
DefaultAssay(seurat_banksy) <- "Spatial.008um"
# Delete BANKSY assay by setting it to NULL
seurat_banksy[["BANKSY"]] <- NULLIf you were unable to run the BANKSY step from earlier, we have provided the RDS file that includes the results for every step run until this point:
# Load pre-generated seurat object with relevant BANKSY results
seurat_banksy <- readRDS("data/intermediate_seurat/seurat_banksy.rds")Now is also a great point to assess how well the integration step worked when looking at each of our clusters. As we would hope, most of the clusters are comprised of both samples:
Comparing clustering results
We have two different clustering results, one was calculated solely on the gene expression (seurat_clusters.projected) and the other with spatially-concious BANKSY scores (banksy_cluster). Neither one of these results is “more correct” than the other, rather they provide different pieces of information that we can use to gain insights into our dataset.
Spatial overlay
First, let us see how each of the clusters appear on the spatial slide with SpatialDimPlot()
SpatialDimPlot(seurat_banksy,
group.by = c("banksy_cluster",
"seurat_cluster.projected"),
pt.size.factor = 15,
image.alpha = 0)What differences do you notice between the two clusters?
Sometimes when you have many clusters, it can be challenging to visualize each cluster at once. So here, we provide the code to use facet_wrap() to create separate figures for each cluster to more easily view each cluster:
# Grab x and y coordinates for plotting
# Also get the sample ID and the banksy cluster score for each bin
df <- FetchData(seurat_banksy,
vars = c("x", "y",
"orig.ident",
"banksy_cluster"))
# Subset to one sample (P5CRC)
ggplot(df %>% subset(orig.ident == "P5CRC"),
# Plot spatial coordinates and color by cluster
aes(x = x, y = y,
color = banksy_cluster)) +
geom_point(size = 0.05) +
theme_bw() +
# Create separate plots for each cluster
facet_wrap(~banksy_cluster) +
NoLegend()For one, we can see that there is less “noise” in the seurat_banksy clusters where bins of a cluster are not as scattered throughout the slide - which is what would be expected from the BANKSY algorithm.
UMAP
DimPlot(seurat_banksy,
group.by = c("banksy_cluster",
"seurat_cluster.projected"),
reduction = "full.umap.sketch",
label = TRUE)On the UMAP, we can see that cluster 1 in seurat_cluster.projected appears to be split in half. This tells us that based upon gene expression along (seurat_clusters.projected) these are the same celltype, but that there is a subtle, spatial difference between bins in those two regions. If we go back to the spatial overlay, we see that cluster 1 belongs primarily to tissue on the left. We see that there is a clear spatial divide when we look at the banksy_clusters, split in half from top to bottom.
Not happy with this wording
An important note about this UMAP is that we are still using the UMAP_1 and UMAP_2 that were derived from the original dimensionality workflow. We are not using the UMAP coordinates derived from BANKSY.
- What does the UMAP look like when use the BANKSY derived harmony latent space to calculate UMAP coordinates?
Assess clustering
Just like we saw in the clustering lesson, we can once again answer the question of how well these BANKSY cluster align with the expected celltypes.
| Cell type | Marker genes |
|---|---|
| B cells | IGKC, IGHM, CD79A, MS4A1, MZB1 |
| Endothelial cells | PECAM1, VWF, PLVAP, ENG, KLF2 |
| Fibroblasts | COL1A1, COL3A1, DCN, LUM, COL6A2 |
| Intestinal epithelial cells | CLCA1, FCGBP, MUC2, PIGR, ZG16 |
| Myeloid cells | C1QC, SELENOP, SPP1, LYZ, CD68 |
| Neural cells | NRXN1, L1CAM, NCAM1, VIP, CALB2 |
| Smooth muscle cells | TAGLN, ACTA2, MYH11, MYL9, CNN1 |
| T cells | TRAC, CD3E, TRBC2, IL7R, CD52 |
| Tumor cells | CEACAM6, CEACAM5, EPCAM, KRT8, LCN2 |
So now, for example, we can see how the marker genes for Smooth Muscle cells appear in our seurat_banksy clusters:
VlnPlot(seurat_banksy,
features = c("ACTA2", "MYH11"),
pt.size = 0,
ncol = 1) +
NoLegend()We can clearly see that there is an uptick of expression of both genes in cluster 9 - which is an indication that cluster 9 is likely comprised of Smooth Muscle cells.
- Use the
DotPlot()function in conjunction withmarker_listto see if clusters correspond well with celltypes.
marker_list <- list(
"B cells" = c("IGKC", "IGHM", "CD79A", "MS4A1", "MZB1"),
"Endothelial cells" = c("PECAM1", "VWF", "PLVAP", "ENG", "KLF2"),
"Fibroblasts" = c("COL1A1", "COL3A1", "DCN", "LUM", "COL6A2"),
"Intestinal epithelial cells" = c("CLCA1", "FCGBP", "MUC2", "PIGR", "ZG16"),
"Myeloid cells" = c("C1QC", "SELENOP", "SPP1", "LYZ", "CD68"),
"Neural cells" = c("NRXN1", "L1CAM", "NCAM1", "VIP", "CALB2"),
"Smooth muscle cells" = c("TAGLN", "ACTA2", "MYH11", "MYL9", "CNN1"),
"T cells" = c("TRAC", "CD3E", "TRBC2", "IL7R", "CD52"),
"Tumor cells" = c("CEACAM6", "CEACAM5", "EPCAM", "KRT8", "LCN2")
)- Roughly identify which clusters correspond to which celltypes to provide better context for future analyses.
Save!
saveRDS(seurat_banksy, "data/seurat_banksy.RDS")