Spatially Derived Clusters

Spatial transcriptomics
Clustering
Spatial domains

In this lesson, we will discover how BANKSY leverages neighborhood information to cluster spatial transcriptomics data using both gene expression and spatial context. You will compare BANKSY clusters with expression-only methods and evaluate differences in tissue domain detection.

Author

Noor Sohail

Published

July 22, 2025

Keywords

BANKSY, Neighborhoods, Niches, Azimuthal Gabor filter

Approximate time: 40 minutes

Learning objectives

In this lesson, we will:

  • Calculate a BANKSY counts matrix which incorporates neighborhood structure
  • Run Harmony integration
  • Evaluate spatially-constrained clusters

Overview of lesson

At this point in the analysis, we have clusters and UMAP coordinates calculated based solely upon the gene expression. As this is a spatial dataset, we also have the ability to incorporate the location of bins, or the neighborhood structure, in the clustering process. Here, we will make use of the BANKSY method that adds the local structure as information to ultimately calculate new clusters that are spatially constrained.

BANKSY

In the standard single-cell workflow, our clusters are identified solely based upon the similarity of gene expression between bins. However, this approach does not take into account the spatial relationship and how close bins are in proximity to one another. The idea being that we would expect bins of similar expression and cell type to be nearby.

BANKSY is an algorithm that allows you to calculate scores that consider both the gene expression and spatial proximity of bins to one another. How much consideration can be given to the spatial locations is modulated with a lambda value. This lambda value controls the balance between gene expression and location.

Therefore, BANKSY is a powerful way to include the full extent of data in our dataset to group bins together.

Figure 1: Schematic of how BANKSY identifies cell type clusters based upon both gene expression and spatial proximity.
Image source: Singhar et al. (2024)

One of the examples used in the original manuscript makes use of the layered structure of the brain’s cortical structure. Here, we can see the differences from the annotated dataset compared to two different methods of clustering. The middle represents clustering based upon solely gene expression, similar to the Leiden clusters we have already calculated. To contrast, the right figure shows clusters generated from BANKSY. We can see that the bins are more cleanly grouped in spatial location.

Figure 2: Three different visuals of the brain cortical layers, the left shows the reference annotation, middle is clustering based solely on gene expression, and right shows clusters assigned with BANKSY.
Image source: Singhar et al. (2024)

Set-up for BANKSY

To start, we are going to open a new script and save it as 05_BANKSY.R to our scripts directory. Then, we will load all the libraries necessary for this analysis as usual.

# BANKSY Spatially Derived Clustering
# Visium HD spatial transcriptomics workshop
# Author: Harvard Chan Bioinformatics Core
# Created: May 2026

# Load libraries
library(Seurat)
library(Banksy)
library(SeuratWrappers)
library(tidyverse)
library(future)
library(harmony)
library(patchwork)
# Set seed for reproducibility
set.seed(12345)

# Increases the size of the default vector (8GB)
options(future.globals.maxSize = 8000 * 1024^2)
# Set parallelization (multithreading) to speed up calculations
library(future)
plan(multisession, workers = parallel::detectCores() - 1)

# Load dataset
seurat_processed <- readRDS("data/seurat_processed.RDS")

Before we jump in and use the RunBanksy() function, we first need to understand the inputs required. In doing so, we will update our Seurat object to include information required by the algorithm.

Parameter Description
group Allows us to run BANKSY on multiple samples, enabling the function to understand that overlapping spatial locations can come from different slides.
split.scale (True or False) Whether to use within-group scaling parameter, accounting for minor differences between datasets.
dimx, dimy The x and y coordinates of the bins on the tissue slide.
k_geom Number of bins to consider for the local neighborhood. Larger values will yield larger domains.
lambda Influence of the neighborhood. Larger values yield more spatially coherent domains. The authors recommend using 0.8 to identify broader spatial domains.

Since we have more than one slide, we will make use of both the group and split.scale arguments. Setting split.scale = TRUE will standardize the calculations such that if batch correction is necessary, the results will be improved. This will ideally result in clusters that are well represented across multiple slides.

The next parameter we need to account for are dimx and dimy. The RunBanksy() function expects that the coordinates for each bin are included in the @meta.data slot of the provided Seurat object. Therefore, we need to grab the coordinates for every bin in the dataset and add it as a column.

The easiest way to grab the coordinates is the GetTissueCoordinates() function:

# Store the cell barcode as a column
# Sometimes when you wrangle dataframes, you can lose the rownames
seurat_processed$cell <- rownames(seurat_processed@meta.data)

# CRC and NAT sample coordinates
coords_crc <- GetTissueCoordinates(seurat_processed, 
                                   image = "P5CRC.008um")
coords_nat <- GetTissueCoordinates(seurat_processed, 
                                   image = "P5NAT.008um")

# Merge P5NAT and P5CRC coordinates into one dataframe
coords <- rbind(coords_crc, coords_nat)

Why did we have to run GetTissueCoordinates() twice? This is because this function will only return the coordinates for one image at a time. So we need to specify the image for each sample and then merge together to create a merged dataframe with x and y coordinates for each sample.

# View first few rows of coordinates (x, y)
coords %>% head()
Table 1: Resulting dataframe of x and y coordinates from the GetTissueCoordinates() function.
x y cell
P5CRC_s_008um_00078_00444-1 57263.31 61738.62 P5CRC_s_008um_00078_00444-1
P5CRC_s_008um_00128_00278-1 52418.69 60257.61 P5CRC_s_008um_00128_00278-1
P5CRC_s_008um_00052_00559-1 60620.61 62512.22 P5CRC_s_008um_00052_00559-1
P5CRC_s_008um_00121_00413-1 56362.64 60478.41 P5CRC_s_008um_00121_00413-1
P5CRC_s_008um_00202_00633-1 62800.98 58138.03 P5CRC_s_008um_00202_00633-1
P5CRC_s_008um_00035_00460-1 57725.67 62997.04 P5CRC_s_008um_00035_00460-1

Next we are going to add these coordinates into our metadata, this is so that we can directly call this column from the RunBanksy() function.

# Join together metadata with coordinates
seurat_processed@meta.data <- left_join(x = seurat_processed@meta.data, 
                                        y = coords, 
                                        by = "cell")
seurat_processed@meta.data %>% head()
Table 2: Joined @meta.data and coords dataframes.
orig.ident nCount_Spatial.008um nFeature_Spatial.008um log10GenesPerUMI mitoRatio leverage.score seurat_cluster.sketched seurat_cluster.projected seurat_cluster.projected.score cell x y
P5CRC 65 57 0.9685377 0.2307692 0.1182505 NA 3 0.9783292 P5CRC_s_008um_00078_00444-1 57263.31 61738.62
P5CRC 1300 906 0.9496410 0.1730769 0.8126225 NA 1 1.0000000 P5CRC_s_008um_00128_00278-1 52418.69 60257.61
P5CRC 128 121 0.9884090 0.0234375 1.3967893 NA 11 0.6450775 P5CRC_s_008um_00052_00559-1 60620.61 62512.22
P5CRC 538 326 0.9203288 0.0092937 0.6153956 NA 9 0.6799116 P5CRC_s_008um_00121_00413-1 56362.64 60478.41
P5CRC 365 232 0.9231919 0.0657534 0.4457335 NA 2 1.0000000 P5CRC_s_008um_00202_00633-1 62800.98 58138.03
P5CRC 122 100 0.9586074 0.1639344 0.0945122 NA 3 1.0000000 P5CRC_s_008um_00035_00460-1 57725.67 62997.04

But wait, the rownames with the bin barcodes are missing! It is vital that the rownames of our Seurat object align with what is stored in Cells(). So we are going to add the rownames back into our metadata, which is why we stored them as a column in the metadata earlier.

# Add cell barcodes back to rownames
rownames(seurat_processed@meta.data) <- seurat_processed$cell

If the rownames inside your Seurat object’s @meta.data are missing, you will get errors that are not very informative. This is due to the fact that Seurat will check if your rownames are the same as what is stored in Cells():

# Check that seurat metadata rownames are the same as Cells()
all(rownames(seurat_processed@meta.data) == Cells(seurat_processed))
[1] TRUE

Therefore, it is a good idea to store the barcode/cell IDs as a column in your metadata so that you can always set the rownames if they get lost during a data manipulation step.

Running BANKSY

The last thing we need to do before running BANKSY is to set values for lambda, use_agf, and k_geom, where:

  • k_geom - Defines how many nearest neighbors to use when looking at neighboring bins.
  • use_agf - Toggles whether or not we want to use the “Azimithal Gabor filter” which are values that constrain bin niches.
  • lambda - Determines how much weight to assign to the spatial locations. The higher the value, the more weight is given to the spatial structure of the tissue.

Here we can see an example of what happens when different lambda values are set before clustering for a brain sample:

Figure 3: Results of setting lambda to different scores on a brain dataset to show how a larger lambda results in more weight to the spatial structure of the tissue.
Image source: Spatial Nanocourse
RAM and memory

We have been running these analyses on laptops/computers with local resources throughout this workshop. Realistically, for a spatial dataset, you would be using either a high-performance computing cluster or a cloud-based platform (Google or AWS) which provides access to much more RAM and memory for such large datasets.

This BANKSY step is a great example of the limitations of running these analyses locally. Not everyone will be able to run this step to completion due to a lack of resources.

If you are unable to run this BANKSY step, follow along with the explanation without running the code. We have provided the BANKSY processed object for you to load in at the appropriate step.

And this is on a cropped slide, not the full dataset!

Because of these memory limitations, you maybe be interested in running R on the O2 cluster through the either the command-line or, more likely, through the O2 Portal. The O2 Portal provides an R Studio environment for you to work within, but it allows you to leverage the the memory available on the high-performance computing cluster. We have materials for running R on O2 and also through the O2 Portal.

The authors of BANKSY recommend setting lambda = 0.8 which is what we will also use.

# Run Banksy from SeuratWrappers
seurat_banksy <- RunBanksy(seurat_processed, 
                           lambda = 0.8, 
                           k_geom = 15, 
                           use_agf = TRUE,
                           dimx = 'x',
                           dimy = 'y', 
                           group = "orig.ident", 
                           split.scale = TRUE,
                           assay = 'Spatial.008um', 
                           slot = 'data',
                           verbose = TRUE)

Now we can investigate what has changed in our seurat_banksy object. Most notably, we have a new Active assay: BANKSY.

# Print seurat object after running BANKSY
seurat_banksy
An object of class Seurat 
42170 features across 135798 samples within 3 assays 
Active assay: BANKSY (6000 features, 0 variable features)
 2 layers present: data, scale.data
 2 other assays present: Spatial.008um, sketch
 4 dimensional reductions calculated: pca.sketch, umap.sketch, full.pca.sketch, full.umap.sketch
 2 spatial fields of view present: P5CRC.008um P5NAT.008um

BANKSY matrix

The RunBanksy() step generated a new counts matrix from the original counts matrix. In the previous step, BANKSY calculated two different scores (matrices) for each highly variable gene. Ultimately, these results are all concatenated together to create a BANKSY count matrix.

If we take a look at the beginning, middle and end of the Features() in the BANKSY matrix we can see each component of the 3 matrices:

# First features in BANKSY matrix
Features(seurat_banksy)[1:6]
[1] "IGHG1" "IGHM"  "SST"   "IGLC1" "INSL5" "CHGA" 
# Middle features in BANKSY matrix
Features(seurat_banksy)[2000:2006]
[1] "TLR8"     "IGHG1.m0" "IGHM.m0"  "SST.m0"   "IGLC1.m0" "INSL5.m0" "CHGA.m0" 
# Last features in BANKSY matrix
Features(seurat_banksy)[4000:4006]
[1] "TLR8.m0"  "IGHG1.m1" "IGHM.m1"  "SST.m1"   "IGLC1.m1" "INSL5.m1" "CHGA.m1" 

As mentioned before, this is actually three different matrices stacked on top of one another:

  1. Original counts: These are the original, normalized counts for the top highly variable genes

  2. Weighted mean expression (m0): The average gene expression of neighboring cells, where closer neighbors are weighted more heavily than distant ones. This gives each cell a representation of its local micro-environment.

  3. Azimuthal Gabor filter (m1): Evaluates the symmetry of gene expression around the bin. This provides a value for whether or not a cell is surrounded by similar cells or at the boundary of a different spatial niche. A higher value indicates that the cell is likely at the boundary between different neighborhoods.

Calculations on BANKSY matrix

At this point, we still do not have our spatially constrained clusters. But now that we have our BANKSY matrix, we can run a simplified clustering workflow, similar how we calculated clusters in the previous lesson.

  1. What steps/functions did we use to calculate clusters previously?

Dimensionality reduction

With this new matrix, we are once again going to calculate PCA, this time storing the results as pca.banksy. We want to run this on all the features in our BANKSY matrix to include the spatially weighted scores in the calculation.

# Run PCA
seurat_banksy <- RunPCA(seurat_banksy, 
                        assay = "BANKSY", 
                        reduction.name = "pca.banksy", 
                        features = Features(seurat_banksy), 
                        npcs = 30)

With this new PCA, we can visualize the distribution of bins to evaluate whether or not integration is necessary. The batch variable we may need to account for is orig.ident, or samples. Additionally, we can pick a marker gene to determine whether there is a split (batch effect) in the data. For this example, we will be using the gene IGKC, a known B cell marker gene.

# BANKSY PCA colored by sample
p_orig <- DimPlot(seurat_banksy,
                  group.by = "orig.ident",
                  reduction = "pca.banksy") 

# BANKSY PCA colored by IGKC expression
p_igkc <- FeaturePlot(seurat_banksy,
                        feature = "IGKC",
                        reduction = "pca.banksy")

# Plot PCA space side-by-side
p_orig + p_igkc
Figure 4: PCA plot of PC1 vs PC2 (calculated on BANKSY) with bins colored by sample and B cell marker gene IGKC.

There appears to be 2 distinct populations of B cells in our PCA space, where the orig.ident is driving the differences in those bins. This is a clear indication that integration will be necessary to resolve the batch effect in the data.

Integration

We are going to use the Harmony algorithm to correct our dataset for batch. Starting from the pca.banksy values, the RunHarmony() function will iteratively adjust the PCA space such that we no longer see a divide demarkated by the batch variable specified in the dataset. We are going to store the resultant latent space, or dimensionality reduction, as harmony.banksy within the Seurat object.

This step may take several iterations to run, but it should converge within a few minutes.

# Harmony batch correction
seurat_banksy <- RunHarmony(seurat_banksy,
                            group.by.vars = "orig.ident",
                            reduction = "pca.banksy",
                            reduction.save = "harmony.banksy",
                            assay.use = "BANKSY")

Once the RunHarmony() function has completed successfully, we have a new latent space called harmony.banksy. The simplest way to view these values is to consider it a “corrected PCA” and treat is the same way you would use a PCA space. Therefore, we can visualize the harmony.banksy space and see if there is a strong effect with orig.ident or the B cell marker IGKC.

# Harmony BANKSY PCA colored by sample
p_orig <- DimPlot(seurat_banksy,
                  group.by = "orig.ident",
                  reduction = "harmony.banksy") 

# Harmony BANKSY PCA colored by IGKC expression
p_igkc <- FeaturePlot(seurat_banksy,
                        feature = "IGKC",
                        reduction = "harmony.banksy")

# Plot latent space side-by-side
p_orig + p_igkc
Figure 5: Latent space plot after integration with Harmony, with bins colored by sample and B cell marker gene IGKC.

There is no longer a split within the B cell populations, which indicates that we are likely good to proceed further in the workflow.

Clustering

With this Harmony integrated space, we can use the FindNeighbors() and FindClusters() functions to perform clustering. We will be using the same resolution = 0.65 from the previous step to make equivalent comparisons between the spatially-weighted clusters against the transcript-based clusters later on.

# Find k-Nearest Neighbors
seurat_banksy <- FindNeighbors(seurat_banksy, 
                               reduction = "harmony.banksy", 
                               dims = 1:30)

# Leiden clustering
seurat_banksy <- FindClusters(seurat_banksy, 
                              algorithm = 4,
                              cluster.name = "banksy_cluster",
                              resolution = 0.65,
                              random.seed = 12345)

At this point, we no longer need access to the BANKSY assay as all the relevant information has now been stored as:

  • banksy_clusters in @meta.data
  • pca.banksy in @reductions
  • harmony.banksy in @reductions

So to conserve memory on our laptops (the Seurat object is ~14 GB!!), we are going to delete the BANKSY assay from our Seurat object.

# Change default asasy from BANKSY
DefaultAssay(seurat_banksy) <- "Spatial.008um"

# Delete BANKSY assay by setting it to NULL
seurat_banksy[["BANKSY"]] <- NULL
Loading pre-generated Seurat object

If you were unable to run the BANKSY step from earlier, we have provided the RDS file that includes the results for every step run until this point:

# Load pre-generated seurat object with relevant BANKSY results
seurat_banksy <- readRDS("data/intermediate_seurat/seurat_banksy.rds")

Now is also a great time to assess how well the integration step worked when looking at each of our clusters. As we would hope, most of the clusters are comprised of both samples:

# Barplot of proportion of cells in each cluster by sample
ggplot(seurat_banksy@meta.data) +
    geom_bar(aes(x = banksy_cluster, 
                 fill = orig.ident), 
             position = position_fill())  +
    theme_classic()
Figure 6: Proportion of bins in each cluster by sample for BANKSY clusters.

Comparing clustering results

We have two different clustering results, one was calculated solely on the gene expression (seurat_clusters.projected) and the other with spatially-conscious BANKSY scores (banksy_cluster). Neither one of these results is “more correct” than the other, rather they provide different pieces of information that we can use to gain insights into our dataset.

Spatial overlay

First, let us see how each of the clusters appear on the spatial slide with SpatialDimPlot()

# Visualize the spatial overlay of both BANKSY derived clusters and
# gene expression derived clusters for both samples
SpatialDimPlot(seurat_banksy, 
               group.by = c("banksy_cluster",
                            "seurat_cluster.projected"), 
               pt.size.factor = 15, 
               image.alpha = 0)
Figure 7: Spatial overlay of both BANKSY derived clusters and gene expression derived clusters for both samples.

What differences do you notice between the two clustering results?

Sometimes when you have many clusters, it can be challenging to visualize each cluster at once. So here, we provide the code to use facet_wrap() to create separate figures for each cluster to more easily view each cluster:

# Grab x and y coordinates for plotting
# Also get the sample ID and the banksy cluster score for each bin
df <- FetchData(seurat_banksy,
                vars = c("x", "y", 
                         "orig.ident",
                         "banksy_cluster"))

# Subset to one sample (P5CRC)
ggplot(df %>% subset(orig.ident == "P5CRC"),
       # Plot spatial coordinates and color by cluster
       aes(x = x, y = y,
           color = banksy_cluster)) +
  geom_point(size = 0.05) +
  theme_bw() +
  # Create separate plots for each cluster
  facet_wrap(~banksy_cluster) +
  NoLegend()
Figure 8: Spatial overlay of BANKSY derived cluster, highlighting each cluster in a separate plot.

For one, we can see that there is less “noise” in the seurat_banksy clusters, where bins of a cluster are not as scattered throughout the slide. This is what would be expected from the BANKSY algorithm. For example, if we compare the outer rim of the tissue, the bins are more connected in the BANKSY representation compared to the gene-expression clustering.

First we highlight the bins from cluster 9 from the gene-expression based clustering, seurat_cluster.projected:

# Cluster 9 from seurat_cluster.projected 
# Highlighted on spatial slide
Idents(seurat_banksy) <- "seurat_cluster.projected"
SpatialDimPlot(seurat_banksy,
               pt.size.factor = 15,
               cells.highlight = WhichCells(seurat_banksy,
                                            idents = 9)) +
  plot_annotation(title = "Seurat cluster 9 (projected)")  
Figure 9: Spatial overlay of gene expression derived clusters, highlighting cluster 9.

And contrast it against the analagous BANKSY cluster 12:

# Cluster 12 from banksy_cluster
# Highlighted on spatial slide
Idents(seurat_banksy) <- "banksy_cluster"
SpatialDimPlot(seurat_banksy,
               pt.size.factor = 15,
               cells.highlight = WhichCells(seurat_banksy,
                                            idents = 12)) +
  plot_annotation(title = "BANKSY cluster 12")
Figure 10: Spatial overlay of BANKSY derived clusters, highlighting cluster 12.

The bins are less scattered throughout the slide with the BANKSY cluster. Additionally, bins in close proximity to one another are more likely to be part of the same cluster.

UMAP

We can also take this opportunity to visualize the BANKSY clusters on the UMAP coordinates we previously computed.

# Compare the UMAPs of the BANKSY derived clusters and gene expression derived clusters
DimPlot(seurat_banksy, 
        group.by = c("banksy_cluster",
                     "seurat_cluster.projected"),
        reduction = "full.umap.sketch",
        label = TRUE)
Figure 11: UMAP of both BANKSY derived clusters and gene expression derived clusters.

On the UMAP, cluster 1 in seurat_cluster.projected appears to be split into two groups in banksy_cluster. This suggests that, based on gene expression alone, these bins represent the same cell type, but BANKSY identifies a spatial distinction between them. If we return to the spatial overlay, we see that cluster 1 is primarily located in the left-hand region of P5CRC. In contrast, the corresponding banksy_clusters show a clear spatial divide, with this region split into three subclusters along a top–bottom axis.

Together, this would indicate that cluster 1 broadly corresponds to tumor bins (as we identified in the integration lesson). The spatial context further divides these tumor cells into distinct subtypes.

Keep in mind that that this UMAP still uses the original UMAP_1 and UMAP_2 coordinates from the initial dimensionality reduction workflow; we are not using UMAP coordinates derived from the BANKSY results.

  1. What does the UMAP look like when use the BANKSY derived Harmony latent space to calculate UMAP coordinates?

Assess clustering

Just like we saw in the clustering lesson, we can once again answer the question of how well these BANKSY cluster align with the expected cell types.

Table 3: Marker genes for broad cell types in our dataset.
Cell type Marker genes
B cells IGKC, IGHM, CD79A, MS4A1, MZB1
Endothelial cells PECAM1, VWF, PLVAP, ENG, KLF2
Fibroblasts COL1A1, COL3A1, DCN, LUM, COL6A2
Intestinal epithelial cells CLCA1, FCGBP, MUC2, PIGR, ZG16
Myeloid cells C1QC, SELENOP, SPP1, LYZ, CD68
Neural cells NRXN1, L1CAM, NCAM1, VIP, CALB2
Smooth Muscle cells TAGLN, ACTA2, MYH11, MYL9, CNN1
T cells TRAC, CD3E, TRBC2, IL7R, CD52
Tumor cells CEACAM6, CEACAM5, EPCAM, KRT8, LCN2

So now, for example, we can see how the marker genes for Smooth Muscle cells appear in our seurat_banksy clusters:

# Create violin plots for two Smooth Muscle marker genes
VlnPlot(seurat_banksy,
        features = c("ACTA2", "MYH11"),
        pt.size = 0,
        ncol = 1) +
  NoLegend()
Figure 12: Violin plot of ACTA2 and MYH11 for each BANKSY cluster to potentially identify populations of Smooth Muscle cells.

We can clearly see that there is an uptick of expression of both genes in cluster 9 - which is an indication that cluster 9 is likely comprised of Smooth Muscle cells.

  1. Use the DotPlot() function in conjunction with marker_list to see if clusters correspond well with cell types.
# Create marker list
marker_list <- list(
  "B cells" = c("IGKC", "IGHM", "CD79A", "MS4A1", "MZB1"),
  "Endothelial cells" = c("PECAM1", "VWF", "PLVAP", "ENG", "KLF2"),
  "Fibroblasts" = c("COL1A1", "COL3A1", "DCN", "LUM", "COL6A2"),
  "Intestinal epithelial cells" = c("CLCA1", "FCGBP", "MUC2", "PIGR", "ZG16"),
  "Myeloid cells" = c("C1QC", "SELENOP", "SPP1", "LYZ", "CD68"),
  "Neural cells" = c("NRXN1", "L1CAM", "NCAM1", "VIP", "CALB2"),
  "Smooth muscle cells" = c("TAGLN", "ACTA2", "MYH11", "MYL9", "CNN1"),
  "T cells" = c("TRAC", "CD3E", "TRBC2", "IL7R", "CD52"),
  "Tumor cells" = c("CEACAM6", "CEACAM5", "EPCAM", "KRT8", "LCN2")
)
  1. Roughly identify which clusters correspond to which cell types to provide better context for future analyses.

Save!

Lastly, let’s go ahead and save our Seurat object before we go any further:

# Save seurat_banksy
saveRDS(seurat_banksy, "data/seurat_banksy.RDS")

Next Lesson >>

Back to Schedule

Reuse

CC-BY-4.0