Normalization and Sketch Downsampling

category_1
category_2
category_3
category_4

Write a description of the lesson here.

Author

Noor Sohail

Published

July 22, 2025

Keywords

keyword_1, keyword_2, keyword_3, keyword_4, keyword_5, keyword_6

Approximate time: XX minutes

Learning objectives

In this lesson, we will:

  • Discuss why normalizing counts is necessary for accurate comparison between cells
  • Identify interesting genes with highly variable gene selection
  • Downsample our dataset while retaining the biological complexity

Overview of lesson

When doing XYZ…

Normalization

Normalization is the process by which we adjust raw counts to account for technical variation. Normalization allows us to make fair comparisons of gene expression across different samples, cells, and genes. The counts are affected by many factors, some of which are “interesting” (e.g. the expression of RNA) and some of which are “uninteresting” (e.g. sequencing depth, technical artifacts). Normalization is the process of adjusting raw count values to account for the “uninteresting” factors.

The most common example of an “uninteresting” factor is sequencing depth. This is the total number of reads (UMIs) found in a cell. As seen below, sequencing depth can make it appear as though one cell has higher gene expression than another - simply because it was sequenced more deeply. In this example, each gene appears to have doubled in expression in cell 2, however this is a consequence of cell 2 having twice the sequencing depth. Therefore, we need to normalize the counts to account for differences in sequencing depth across cells.

Figure 1: Effect of sequencing depth on gene expression.
Image source: M. Love: RNA-seq data analysis

Normalization in spatial transcriptomics

Based on this information, we may conclude that normalization is always a necessary step. However, Bhuva et al. (2024) evaluated several normalization strategies and found that normalizing by the number of transcripts detected per bin can negatively affect spatial domain identification. This is due to the fact that total detection per bin can represent real biology. Thus, normalizing by the number of transcripts detected per bin can mask true biological differences between bins.

We can also see that the depth of transcription also varies across different spatial technologies, and even across different tissues.

Figure 2: Library size across different spatial technologies for various tissues.
Image source: Bhuva et al. (2024)

We are cognizant of this subtlety, but as discussed earlier, it can be challenging to tease apart technical artifact from a biological signal. Without normalization, this lack of signal will strongly affect clustering regardless of the reason. For this reason, we apply a standard log-transformed library size normalization to our data.

Log-normalization steps

Log-normalization is one of the simplest normalization methods for scRNA-seq data. It is a two-step process that involves scaling and transformation.

1. Scaling

First, we divide each count by the total counts (nCount_Spatial.008um) for that cell. Then, we multiply a scaling factor (default is 10,000). Why would we want to do this? Different cells have different amounts of mRNA; this could be due to differences between cell types or variation within the same cell type depending on how well the kit worked in one bin versus another.

Regardless, we are not interested in comparing these absolute counts between cells. Instead we are interested in comparing concentrations, and scaling helps achieve this.

2. Transformation

The most common transformation is a log transformation, which is applied in the original Seurat workflow. Other transformations, such as a square root transform, are less commonly used.

Various methods have been developed specifically for scRNA-seq normalization. Some simpler methods resemble what we have seen with bulk RNA-seq; the application of global scale factors adjusting for a count-depth relationship that is assumed common across all genes. However, if those assumptions are not true then this basic normalization can lead to over-correction for lowly and moderately expressed genes and, in some cases, under-normalization of highly expressed genes (Bacher et al. (2017)).

One such example is the SCTransform method, which models gene expression using a regularized negative binomial regression. This method accounts for technical variation. For more information on this methods, the Introduction to scRNA-seq workshop has a lesson on how apply SCTransform and what is does in detail.

Perform log-normalization

Now that we are on a new step of our analysis, let us create another script titled normalization_and_sketch.R that will contain our normalization and sketching code. This will help us keep our workflow organized and modular.

After loading the necessary libraries, we will load in the seurat_filtered object we created in the previous lesson.

# Normalization and sketch downsampling
# Visium HD spatial transcriptomics workshop
# Author: Harvard Chan Bioinformatics Core
# Created: December 2025

# Load libraries
library(Seurat)
library(tidyverse)
library(scales)

seurat_filtered <- readRDS("data/seurat_filtered.RDS")

To avoid making biased comparisons across bins, we will apply a simple normalization with the NormalizeData() function. This function performs log-normalization by default, but it also has other options for normalization methods.

# Normalize the counts
seurat_sketch <- NormalizeData(seurat_filtered)
seurat_sketch
An object of class Seurat 
18085 features across 135798 samples within 1 assay 
Active assay: Spatial.008um (18085 features, 0 variable features)
 2 layers present: counts, data
 2 spatial fields of view present: P5CRC.008um P5NAT.008um

When we examine our Seurat object, seurat_sketch, we see that there is a new addition in the layer line that says “data”. This is where our log-normalized counts are stored. The original counts are still stored in the counts layer, and the data layer contains the normalized counts.

Highly variable genes (HVGs)

Highly variable gene (HVG) selection is a critical step in the analysis of the workflow. We know that not all genes are equally informative. We want to emphasize the effect of genes that are changing across our dataset - rather than giving more weight to genes that remain consistent (less interesting). These genes are oftentimes indicators of biological processes or cell types. Ultimately, we get a list of genes ranked by how much they change across populations.

Most downstream steps are computed only on these highly variable genes. Therefore, understanding which genes are highly variable is important for understanding how the results of downstream analyses are computed.

Calculating HVGs

HVGs are calculated using the FindVariableFeatures() function. Some key arguments supplied to this function are:

Table 1: Arguments for FindVariableFeatures() to calculate highly variable genes.
Argument Description
selection.method How to choose top variable features. vst: Fits a line to log(variance) vs. log(mean) using loess, standardizes values, and computes variance after clipping (see clip.max).
nfeatures Number of features to select as top variable features; used when selection.method is dispersion or vst
# Identify the most variable genes
seurat_sketch <- FindVariableFeatures(seurat_sketch, 
                                      selection.method = "vst",
                                      nfeatures = 2000)
seurat_sketch
An object of class Seurat 
18085 features across 135798 samples within 1 assay 
Active assay: Spatial.008um (18085 features, 2000 variable features)
 2 layers present: counts, data
 2 spatial fields of view present: P5CRC.008um P5NAT.008um

When we examine our Seurat object, we see that FindVariableFeatures() has added 2,000 variable features - just as we specified with nfeatures. Seurat allows us to access the ranked highly variable genes with the VariableFeatures() function. Here we print the top 15:

# Identify the 15 most highly variable genes
ranked_variable_genes <- VariableFeatures(seurat_sketch)
top_genes <- ranked_variable_genes[1:15]
top_genes
 [1] "IGHG1" "IGHM"  "SST"   "IGLC1" "INSL5" "CHGA"  "CXCL8" "GCG"   "HBA2" 
[10] "IGLC7" "VIP"   "IGHG3" "MMP12" "PYY"   "IGHA1"
  1. Using GeneCards, look up one of the top genes and read more about its role and how it could possible relate to CRC.
  2. Visualize one of the top variable genes on the spatial slide with SpatialFeaturePlot() to see if expression varies across the image.

Sketch downsampling

As high-throughput sequencing has evolved, the number cells/bins in a dataset has increased dramatically. This can make it challenging to run calculations on the full dataset, especially for computationally intensive algorithms such as PCA and clustering. To offset these problems, “sketching” methods have been developed to downsample the dataset while retaining the biological complexity of the data. This allows us to retain the accuracy of our analyses while reducing the time and resources needed to run calculations.

But how do we know which cells to keep when downsampling? We want to make sure that we are not just keeping the most abundant cell types, but also retaining rare populations that may be biologically important.

Warning

TODO: explain how leverage score is calculated

Figure 3: Contrasting the number of cells per cell type against leverage scores, highlighting how rare populations have higher leverage scores.
Image source: Hao et al. (2024)

A leverage score is calculated for each bin. This value reflects the magnitude of a bin’s contribution to the gene-covariance matrix, and its importance to the overall dataset. From these scores, bins are selected and added to the sketch assay. This sketch-based analyses aims to “subsample” large datasets in a way that preserves rare populations. The authors of Seurat found that rare celltype populations (with fewer cells) have higher leverage scores.

This means that our downsampled cells in the sketch will oversample rare populations, retaining the biological complexity of the dataset while drastically compressing the number of cells.

Downsampling

For our dataset, we are going to downsample our dataset to 10,000 bins using the SketchData() function. The function takes a normalized single-cell dataset containing a set of variable features. Both of which we have calculated in the previous steps. The function then calculates leverage scores for each bin and selects ncells based on these scores to create a new assay called “sketch”.

This step may take a few minutes to run.

# Select 10,000 cells based on leverage scores
# Create a new 'sketch' assay
seurat_sketch <- SketchData(object = seurat_sketch,
                            assay = 'Spatial.008um',
                            ncells = 10000,
                            method = "LeverageScore",
                            sketched.assay = "sketch")

Sketch assay

We can see that there are four major changes that have taken place:

  • The number of features in the second line has double, because we have added a new assay
  • Accordingly, the number of assays has increased from one to two
  • The active assay has change from Spatial.008um to sketch
  • There is a new line listing additional assays that exist in the Seurat object
seurat_sketch
An object of class Seurat 
36170 features across 135798 samples within 2 assays 
Active assay: sketch (18085 features, 2000 variable features)
 2 layers present: counts, data
 1 other assay present: Spatial.008um
 2 spatial fields of view present: P5CRC.008um P5NAT.008um

We can also see that the leverage.score has been added as a column to the metadata of our object.

seurat_sketch@meta.data %>% View()
Table 2: Updated @meta.data with leverage.score calculated for bin.
orig.ident nCount_Spatial.008um nFeature_Spatial.008um log10GenesPerUMI mitoRatio leverage.score
P5CRC_s_008um_00078_00444-1 P5CRC 65 57 0.9685377 0.2307692 0.1182505
P5CRC_s_008um_00128_00278-1 P5CRC 1300 906 0.9496410 0.1730769 0.8126225
P5CRC_s_008um_00052_00559-1 P5CRC 128 121 0.9884090 0.0234375 1.3967893
P5CRC_s_008um_00121_00413-1 P5CRC 538 326 0.9203288 0.0092937 0.6153956
P5CRC_s_008um_00202_00633-1 P5CRC 365 232 0.9231919 0.0657534 0.4457335

Visualizing leverage scores

We would not like to see a strong bias in leverage scores across samples, as this would indicate that we are preferentially keeping bins from one sample over another.

ggplot(seurat_sketch@meta.data) +
  geom_histogram(aes(x=leverage.score, 
                     fill=orig.ident), 
                 alpha=0.5, bins=100) +
  theme_classic() +
  scale_x_log10(labels = scales::label_number())
Figure 4: Distribution of leverage scores across samples.

We can also visualize the spatial distribution of leverage scores across the slide to see if we can begin identifying spatial patterns in the data with SpatialFeaturePlot().

SpatialFeaturePlot(seurat_sketch,
                   "leverage.score",
                   pt.size.factor = 15,
                   image.alpha = 0,
                   max.cutoff = 2)
Figure 5: Spatial distribution of leverage scores across the slide.

Save!

Now is a great spot to save our seurat_sketch object as we have added a new assay to our Seurat object.

# Save Seurat object
saveRDS(seurat_sketch, "data/seurat_sketch.RDS")

Next Lesson >>

Back to Schedule

Reuse

CC-BY-4.0