Spatially Derived Clusters

Spatial transcriptomics
Clustering
Spatial domains

Discover how BANKSY leverages neighborhood information to cluster spatial transcriptomics data using both gene expression and spatial context. You will compare BANKSY clusters with expression-only methods and evaluate differences in tissue domain detection.

Author

Noor Sohail

Published

July 22, 2025

Keywords

BANKSY, Neighborhoods, Niches, Azimuthal gabor filter

Approximate time: XX minutes

Learning objectives

In this lesson, we will:

  • Calculate BANKSY counts matrix which incorporates neighborhood structure
  • Run harmony integration
  • Evaluate spatially constrained clusters

Overview of lesson

At this point in the analysis, we have clusters and UMAP coordinates calculated based solely upon the gene expression. As this is a spatial dataset, we also have the ability to incorporate the location of bins, or the neighborhood structure, in the clustering process. Here, we will make use of the BANKY method that adds the local structure as information to ultimately calculate new clusters that are spatially constrained.

BANKSY

In the standard single-cell workflow, our clusters are identified solely based upon the similarity of gene expression between bins. However, this approach does not take into account the spatial relationship and how close they are in proximity to one another. The idea being that we would expect bins of similar expression and celltype to be closeby.

BANKSY is an alogorithm that allows you calculate scores that consider both gene expression and spatial proximity of bins to one another. How much consideration that can be given to the spatial locations can be modulated with a lambda value. This lamdba value controls the balance between gene expression and location.

Therefore, BANKSY is a powerful way to include the full extent of data in our dataset to group bins together.

Figure 1: Schematic of how BANKSY identifes celltype clusters based upon both gene expression and spatial proximity.
Image source: Singhar et al. (2024)

One of the examples used in the original manupscript makes use of the layered structure of the brain’s cortical structure. Here, we can see the differences from the annotated dataset compared to two different methods of clustering. The middle represents clustering based upon solely gene expression, similar to the Leiden clusters we have already calculated. To contrast, the right figure shows clusters generated from BANKSY. We can see that the bins are more cleanly grouped in spatial location.

Figure 2: Three different visuals of the brain cortical layers, the left shows the reference annotation, middle is clustering based solely on gene expression, and right shows clusters assigned with BANKSY.
Image source: Singhar et al. (2024)

Set-up for BANKSY

To start, we are going to load all the libraries necessary for this analysis as usual.

# Load libariries
library(Seurat)
library(Banksy)
library(SeuratWrappers)
library(tidyverse)
library(future)
library(harmony)

# Set seed for reproducibility
set.seed(12345)

# Set parallelization (multithreading) to speed up calculations
plan("multicore", workers = parallel::detectCores() - 1)

# Increases the size of the default vector (8GB)
options(future.globals.maxSize = 8000 * 1024^2)

# Load dataset
seurat_processed <- readRDS("data/seurat_processed.RDS")

Before we jump in and use the RunBanksy() function, we first need to understand the inputs required and prepare our Seurat object accordingly.

Parameter Description
group Allows us to run BANKSY on multiple samples, enabling the function to understand that overlapping spatial locations can come from different slides.
dimx, dimy The x and y coordinates of a spot on the tissue slide.
split.scale (True or False) Whether to use within-group scaling parameter, accounting for minor differences between datasets.
k_geom Number of bins to consider for the local neighborhood. Larger values will yield larger domains.
lambda Influence of the neighborhood. Larger values yield more spatially coherent domains. The authors recommend using 0.8 to identify broader spatial domains.

We already have the group parameter with orig.ident (representing our two samples) and so we will set split.scale to True. Therefore the next parameter we needs to supply in the @meta.data are the dimx and dimy spatial coordinates for every bin in the dataset.

The easiest way to grab the coordinates is the GetTissueCoordinates() function:

# Store the cell barcode as a column
# Sometimes when you merge dataframes, you lose rownames
seurat_processed$cell <- rownames(seurat_processed@meta.data)

# CRC and NAT sample coordinates
coords_crc <- GetTissueCoordinates(seurat_processed, 
                                   image = "P5CRC.008um")
coords_nat <- GetTissueCoordinates(seurat_processed, 
                                   image = "P5NAT.008um")

# Merge P5NAT and P5CRC coordinates into one dataset
coords <- rbind(coords_crc, coords_nat)

Why did we have to run GetTissueCoordinates() twice? This is because this function will only return the coordinates for one image at a time. So we need to specifiy the image for each sample and then merge together to create a merged dataframe with x and y coordinates for each sample

# View first few rows of coordinates (x, y)
coords %>% head()
Table 1: Resultant dataframe of x and y coordinates from the GetTissueCoordinates() function.
x y cell
P5CRC_s_008um_00078_00444-1 57263.31 61738.62 P5CRC_s_008um_00078_00444-1
P5CRC_s_008um_00128_00278-1 52418.69 60257.61 P5CRC_s_008um_00128_00278-1
P5CRC_s_008um_00052_00559-1 60620.61 62512.22 P5CRC_s_008um_00052_00559-1
P5CRC_s_008um_00121_00413-1 56362.64 60478.41 P5CRC_s_008um_00121_00413-1
P5CRC_s_008um_00202_00633-1 62800.98 58138.03 P5CRC_s_008um_00202_00633-1
P5CRC_s_008um_00035_00460-1 57725.67 62997.04 P5CRC_s_008um_00035_00460-1

Next we are going to add these coordinates into our metadata, this is so that we can directly call this column from the RunBanksy() function.

# Join together metadata with coordinates
seurat_processed@meta.data <- left_join(x = seurat_processed@meta.data, 
                                        y = coords, 
                                        by = "cell")
seurat_processed@meta.data %>% head()
Table 2: Joined @meta.data and coords dataframes.
orig.ident nCount_Spatial.008um nFeature_Spatial.008um log10GenesPerUMI mitoRatio leverage.score seurat_cluster.sketched seurat_cluster.projected seurat_cluster.projected.score cell x y
P5CRC 65 57 0.9685377 0.2307692 0.1182505 NA 3 0.9783292 P5CRC_s_008um_00078_00444-1 57263.31 61738.62
P5CRC 1300 906 0.9496410 0.1730769 0.8126225 NA 1 1.0000000 P5CRC_s_008um_00128_00278-1 52418.69 60257.61
P5CRC 128 121 0.9884090 0.0234375 1.3967893 NA 11 0.6450775 P5CRC_s_008um_00052_00559-1 60620.61 62512.22
P5CRC 538 326 0.9203288 0.0092937 0.6153956 NA 9 0.6799116 P5CRC_s_008um_00121_00413-1 56362.64 60478.41
P5CRC 365 232 0.9231919 0.0657534 0.4457335 NA 2 1.0000000 P5CRC_s_008um_00202_00633-1 62800.98 58138.03
P5CRC 122 100 0.9586074 0.1639344 0.0945122 NA 3 1.0000000 P5CRC_s_008um_00035_00460-1 57725.67 62997.04

But wait, the rownames with the bin barcodes are missing! It is vital that the rownames of our Seurat object align with what is stored in Cell(). So we are going to add the rownames back into our metadata, which is why we stored them as a column in the metadata earlier.

# Add cell barcodes back to rownames
rownames(seurat_processed@meta.data) <- seurat_processed$cell

Running BANKSY

The last thing we need to do before running BANKSY is to set values for lambda, use_agf, and k_geom, where:

  • k_geom defines how many k-nearest neighbors to use when looking at neighboring bins.
  • use_agf toggles whether or not we want to use the “Azimithal gabor filter” which are values that constrain bin niches.
  • lambda determines how much weight to assign to the spatial locations. The higher the value, the more weight is given to the spatial structure of the tissue.

Here we can see an example of what happens when different lambda values are set before clustering for a brain sample:

Figure 3: Results of setting lambda to different scores on a brain dataset to show how a larger lambda results in more weight to the spatial structure of the tissue.
Image source: Spatial Nanocourse

The authors of BANKSY recommend setting lambda = 0.8 which is what we will also use.

Warning

Want to link out materials on working on HPC in this callout box

RAM and memory

We have been running these analyses on laptops/computers with local resources throughout this workshop. Realistically for a spatial dataset, you would be using either HPC or a cloud-based platform (Google or AWS) which provides access to much more RAM and memory for such large datasets.

This BANKSY step is a great example of the limitations of running these analyses locally. Not everyone will be able to run this step to completion due to a lack of resources - which is okay because at the end of the lesson, we will provide the seurat object generated during this step.

And this is on a cropped slide, not the full dataset!

# Run Banksy from SeuratWrappers
seurat_banksy <- RunBanksy(seurat_processed, 
                           lambda = 0.8, 
                           k_geom = 15, 
                           use_agf = TRUE,
                           dimx = 'x', dimy = 'y', 
                           group = "orig.ident", 
                           split.scale = TRUE,
                           assay = 'Spatial.008um', 
                           slot = 'data',
                           verbose = TRUE)

Now we can investigate what has changed in our seurat_banksy object. Most notably, we have a new Active assay: BANKSY.

# Print seurat object after running BANKSY
seurat_banksy
An object of class Seurat 
42170 features across 135798 samples within 3 assays 
Active assay: BANKSY (6000 features, 0 variable features)
 2 layers present: data, scale.data
 2 other assays present: Spatial.008um, sketch
 4 dimensional reductions calculated: pca.sketch, umap.sketch, full.pca.sketch, full.umap.sketch
 2 spatial fields of view present: P5CRC.008um P5NAT.008um

BANKSY matrix

The RunBanksy() step generated a new counts matrix from the original counts matrix. In actually what happens it that BANKSY will concatenate 3 different matrices together, all of which are computed from the top highly variable features.

If we take a look at the beginning, middle, and end of the Features() in the BANKSY matrix we can see each component of the 3 matrices:

# First features in BANKSY matrix
Features(seurat_banksy) %>% head()
[1] "IGHG1" "IGHM"  "SST"   "IGLC1" "INSL5" "CHGA" 
# Middle features in BANKSY matrix
Features(seurat_banksy)[2000:2005]
[1] "TLR8"     "IGHG1.m0" "IGHM.m0"  "SST.m0"   "IGLC1.m0" "INSL5.m0"
# Last features in BANKSY matrix
Features(seurat_banksy) %>% tail()
[1] "ADAM23.m1" "PNMT.m1"   "RSPO2.m1"  "HTRA3.m1"  "SLC3A1.m1" "TLR8.m1"  

As mentioned before, this is actually three different matrices stacked on top of one another:

  1. Original counts: These are the original, normalized counts for the top highly variable features.

  2. Weighted mean expression (m0): The average gene expression of neighboring cells, where closer neighbors are weighted more heavily than distant ones. This gives each cell a representation of its local micro-environment.

  3. Azimuthal gabor filter (m1): Evaluates the symmetry of gene expression around the bin. This provides a value for whether or not a cell is surrounded by similar cells or at the boundary of a different spatial niche. A higher value indicates that the cell is likely at the boundary between different neighborhoods.

Calculations on BANKSY matrix

At this point, we still do not have our spatially constrained clusters. But now that we have our BANKSY matrix, we can run a simplified clustering workflow, similar how we calculated clusters in the previous lesson.

  1. What steps/functions did we use to calculate clusters previously?

Dimensionality reduction

With this new matrix, we are once again going to calculate PCA, this time storing the results as pca.banksy. We want to run this on all the features in our BANKSY matrix to include the spatially weighted scores in the calculation.

# PCA
seurat_banksy <- RunPCA(seurat_banksy, 
                        assay = "BANKSY", 
                        reduction.name = "pca.banksy", 
                        features = Features(seurat_banksy), 
                        npcs = 30)

With this new PCA, we can visualize the distribution of bins to evaluate whether or not integration is necessary. The batch variable we may need to account for is orig.ident or samples. Additionally, we can pick a marker gene to determine whether there is a split (batch effect) in the data. For this example, we will be using the gene IGKC, a known B cell marker gene.

# BANKSY PCA colored by sample
p_orig <- DimPlot(seurat_banksy,
                  group.by = "orig.ident",
                  reduction = "pca.banksy") 

# BANKSY PCA colored by IGKC expression
p_igkc <- FeaturePlot(seurat_banksy,
                        feature = "IGKC",
                        reduction = "pca.banksy")

# Plot PCA space side-by-side
p_orig + p_igkc
Figure 4: PCA plot of PC1 vs PC2 (calculated on BANKSY) with bins colored by sample and B cell marker gene IGKC.

There appears to be 2 unique populations of B cells in our PCA space, where the orig.ident is driving the differences in those bins. This is a clear indication that integration will be necessary to resolve the batch effect in the data.

Integration

We are going to use the Harmony algorithm to batch correct our dataset. Starting from the pca.banksy values, the RunHarmony() function will iteratively adjust the PCA space such that we no longer see a divide demarkated by the batch variable specified in the dataset. We are going to store the resultant latent space as harmony.banksy within the Seurat object.

This step may take several iterations to run, but should converge within a few minutes.

# Harmony batch correction
seurat_banksy <- RunHarmony(seurat_banksy,
                            group.by.vars = "orig.ident",
                            reduction = "pca.banksy",
                            reduction.save = "harmony.banksy",
                            assay.use = "BANKSY")

Once the RunHarmony() function has completed successfully, we have a new latent space called harmony.banksy. The simplest way to view these values is to consider it a “corrected PCA” and treat is the same way you would use a PCA space. Therefore, we can visualize the harmony.banksy space and see if there is a strong effect with orig.ident or B cell marker IGKC.

# Harmony BANKSY PCA colored by sample
p_orig <- DimPlot(seurat_banksy,
                  group.by = "orig.ident",
                  reduction = "harmony.banksy") 

# Harmony BANKSY PCA colored by IGKC expression
p_igkc <- FeaturePlot(seurat_banksy,
                        feature = "IGKC",
                        reduction = "harmony.banksy")

# Plot latent space side-by-side
p_orig + p_igkc
Figure 5: Latent space plot after integration with Harmony, with bins colored by sample and B cell marker gene IGKC.

There is no longer a split within the B cell populations, which indicates that we are likely good to proceed further in the workflow.

Clustering

With this harmony integrated space, we can use the FindNeighbors() and FindClusters() functions to perform the clustering. We will be using the same resolution = 0.65 from the previous step to make equivalent comparisons between the spatially-weighted clusters against the transcript-based clusters later on.

# Find k-Nearest Neighbors
seurat_banksy <- FindNeighbors(seurat_banksy, 
                               reduction = "harmony.banksy", 
                               dims = 1:30)

# Leiden clustering
seurat_banksy <- FindClusters(seurat_banksy, 
                              algorithm = 4,
                              cluster.name = "banksy_cluster",
                              resolution = 0.65,
                              random.seed = 12345)

At this point, we no longer need access to the BANKSY assay as all the relevant information has now been stored as:

  • banksy_clusters in @meta.data
  • pca.banksy in @reductions
  • harmony.banksy in @reductions

So to conserve memory on our laptops (the Seurat object is ~14 GB!!), we are going to delete the BANKSY assay from our Seurat object.

# Change default asasy from BANKSY
DefaultAssay(seurat_banksy) <- "Spatial.008um"

# Delete BANKSY assay by setting it to NULL
seurat_banksy[["BANKSY"]] <- NULL
Loading pre-generated Seurat object

If you were unable to run the BANKSY step from earlier, we have provided the RDS file that includes the results for every step run until this point:

# Load pre-generated seurat object with relevant BANKSY results
seurat_banksy <- readRDS("data/intermediate_seurat/seurat_banksy.rds")

Now is also a great point to assess how well the integration step worked when looking at each of our clusters. As we would hope, most of the clusters are comprised of both samples:

# Barplot of proportion of cells in each cluster by sample
ggplot(seurat_banksy@meta.data) +
    geom_bar(aes(x = banksy_cluster, 
                 fill = orig.ident), 
             position = position_fill())  +
    theme_classic()
Figure 6: Proportion of bins in each cluster by sample for BANKSY clusters.

Comparing clustering results

We have two different clustering results, one was calculated solely on the gene expression (seurat_clusters.projected) and the other with spatially-concious BANKSY scores (banksy_cluster). Neither one of these results is “more correct” than the other, rather they provide different pieces of information that we can use to gain insights into our dataset.

Spatial overlay

First, let us see how each of the clusters appear on the spatial slide with SpatialDimPlot()

SpatialDimPlot(seurat_banksy, 
               group.by = c("banksy_cluster",
                            "seurat_cluster.projected"), 
               pt.size.factor = 15, 
               image.alpha = 0)
Figure 7: Spatial overlay of both BANKSY derived clusters and gene expression derived clusters.

What differences do you notice between the two clusters?

Sometimes when you have many clusters, it can be challenging to visualize each cluster at once. So here, we provide the code to use facet_wrap() to create separate figures for each cluster to more easily view each cluster:

# Grab x and y coordinates for plotting
# Also get the sample ID and the banksy cluster score for each bin
df <- FetchData(seurat_banksy,
                vars = c("x", "y", 
                         "orig.ident",
                         "banksy_cluster"))

# Subset to one sample (P5CRC)
ggplot(df %>% subset(orig.ident == "P5CRC"),
       # Plot spatial coordinates and color by cluster
       aes(x = x, y = y,
           color = banksy_cluster)) +
  geom_point(size = 0.05) +
  theme_bw() +
  # Create separate plots for each cluster
  facet_wrap(~banksy_cluster) +
  NoLegend()
Figure 8: Spatial overlay of BANKSY derived cluster, highlighting each cluster in a separate plot.

For one, we can see that there is less “noise” in the seurat_banksy clusters where bins of a cluster are not as scattered throughout the slide - which is what would be expected from the BANKSY algorithm.

UMAP

DimPlot(seurat_banksy, 
        group.by = c("banksy_cluster",
                     "seurat_cluster.projected"),
        reduction = "full.umap.sketch",
        label = TRUE)
Figure 9: UMAP of both BANKSY derived clusters and gene expression derived clusters.

On the UMAP, we can see that cluster 1 in seurat_cluster.projected appears to be split in half. This tells us that based upon gene expression along (seurat_clusters.projected) these are the same celltype, but that there is a subtle, spatial difference between bins in those two regions. If we go back to the spatial overlay, we see that cluster 1 belongs primarily to tissue on the left. We see that there is a clear spatial divide when we look at the banksy_clusters, split in half from top to bottom.

Warning

Not happy with this wording

An important note about this UMAP is that we are still using the UMAP_1 and UMAP_2 that were derived from the original dimensionality workflow. We are not using the UMAP coordinates derived from BANKSY.

  1. What does the UMAP look like when use the BANKSY derived harmony latent space to calculate UMAP coordinates?

Assess clustering

Just like we saw in the clustering lesson, we can once again answer the question of how well these BANKSY cluster align with the expected celltypes.

Table 3: Marker genes for broad cell types in our dataset.
Cell type Marker genes
B cells IGKC, IGHM, CD79A, MS4A1, MZB1
Endothelial cells PECAM1, VWF, PLVAP, ENG, KLF2
Fibroblasts COL1A1, COL3A1, DCN, LUM, COL6A2
Intestinal epithelial cells CLCA1, FCGBP, MUC2, PIGR, ZG16
Myeloid cells C1QC, SELENOP, SPP1, LYZ, CD68
Neural cells NRXN1, L1CAM, NCAM1, VIP, CALB2
Smooth muscle cells TAGLN, ACTA2, MYH11, MYL9, CNN1
T cells TRAC, CD3E, TRBC2, IL7R, CD52
Tumor cells CEACAM6, CEACAM5, EPCAM, KRT8, LCN2

So now, for example, we can see how the marker genes for Smooth Muscle cells appear in our seurat_banksy clusters:

VlnPlot(seurat_banksy,
        features = c("ACTA2", "MYH11"),
        pt.size = 0,
        ncol = 1) +
  NoLegend()
Figure 10: Violin plot of ACTA2 and MYH11 for each BANKSY cluster to potentially identify populations of Smooth Muscle cells.

We can clearly see that there is an uptick of expression of both genes in cluster 9 - which is an indication that cluster 9 is likely comprised of Smooth Muscle cells.

  1. Use the DotPlot() function in conjunction with marker_list to see if clusters correspond well with celltypes.
marker_list <- list(
  "B cells" = c("IGKC", "IGHM", "CD79A", "MS4A1", "MZB1"),
  "Endothelial cells" = c("PECAM1", "VWF", "PLVAP", "ENG", "KLF2"),
  "Fibroblasts" = c("COL1A1", "COL3A1", "DCN", "LUM", "COL6A2"),
  "Intestinal epithelial cells" = c("CLCA1", "FCGBP", "MUC2", "PIGR", "ZG16"),
  "Myeloid cells" = c("C1QC", "SELENOP", "SPP1", "LYZ", "CD68"),
  "Neural cells" = c("NRXN1", "L1CAM", "NCAM1", "VIP", "CALB2"),
  "Smooth muscle cells" = c("TAGLN", "ACTA2", "MYH11", "MYL9", "CNN1"),
  "T cells" = c("TRAC", "CD3E", "TRBC2", "IL7R", "CD52"),
  "Tumor cells" = c("CEACAM6", "CEACAM5", "EPCAM", "KRT8", "LCN2")
)
  1. Roughly identify which clusters correspond to which celltypes to provide better context for future analyses.

Save!

saveRDS(seurat_banksy, "data/seurat_banksy.RDS")

Next Lesson >>

Back to Schedule

Reuse

CC-BY-4.0