Will Gammerdinger, Noor Sohail, Zhu Zhuo, James Billingsley, Shannan Ho Sui
Published
July 22, 2025
Approximate time: XY minutes
Learning Objectives
Discuss why normalizing counts is necessary for accurate comparison between cells
LO 2
LO 3
Normalization
Normalization is important in order to make expression counts comparable across genes and/or samples. We note that the best normalization methods for spatial data are still being developed and evaluated. In particular, Bhuva et. al tested a variety of normalization techniques and found that normalizing by the number of transcripts detected per bin negatively affects spatial domain identification because total detections per bin can represent real biology.
We are cognizant of this, but as discussed earlier, it can be challenging to determine whether a bin has few detections because of a technical artifact or biological signal. In the absence of normalization, this lack of signal will strongly affect clustering regardless of whether it is technical or biological. For this reason, we apply a standard log-transformed library size normalization to our data.
Why Should We Normalize?
An essential first step in the majority of mRNA expression analyses is normalization, whereby systematic variations are adjusted for to make expression counts comparable across genes and/or samples. The counts of mapped reads for each gene is proportional to the expression of RNA (“interesting”) in addition to many other factors (“uninteresting”). Normalization is the process of adjusting raw count values to account for the “uninteresting” factors.
Sequencing depth: Accounting for sequencing depth is necessary for comparison of gene expression between cells. In the example below, each gene appears to have doubled in expression in cell 2, however this is a consequence of cell 2 having twice the sequencing depth.
Log-normalization Steps
Various methods have been developed specifically for scRNA-seq normalization. Some simpler methods resemble what we have seen with bulk RNA-seq; the application of global scale factors adjusting for a count-depth relationship that is assumed common across all genes. However, if those assumptions are not true then this basic normalization can lead to over-correction for lowly and moderately expressed genes and, in some cases, under-normalization of highly expressed genes (Bacher R et al, 2017).
Regardless of which method is used for normalization, it can be helpful to think of it as a two-step process (even though it is often described as a single step in most papers). The first is a scaling step and the second is a transformation.
1. Scaling
The first step in normalization is to multiply each UMI count by a cell specific factor to get all cells to have the same UMI counts. Why would we want to do this? Different cells have different amounts of mRNA; this could be due to differences between cell types or variation within the same cell type depending on how well the chemistry worked in one drop versus another. In either case, we are not interested in comparing these absolute counts between cells. Instead we are interested in comparing concentrations, and scaling helps achieve this.
2. Transformation
The next step is a transformation, and it is at this step where we can distinguish the simpler versus complex methods as mentioned above.
Simple transformations are those which apply the same function to each individual measurement. Common examples include a log transform (which is applied in the original Seurat workflow), or a square root transform (less commonly used).
Perform Log-normalization
Let’s start by creating a new script for the normalization and integration steps. Create a new script (File -> New File -> R script), and save it as normalization_and_sketch.R.
For the remainder of the workflow we will be mainly using functions available in the Seurat package. Additionally, we will load in the seurat_filtered object we created in the previous lesson.
# Normalization and sketch downsampling# Visium HD spatial transcriptomics workshop# Author: Harvard Chan Bioinformatics Core# Created: December 2025# Load librarieslibrary(Seurat)library(tidyverse)library(scales)seurat_filtered <-readRDS("data/seurat_filtered.RDS")seurat_filtered
An object of class Seurat
18085 features across 67660 samples within 1 assay
Active assay: Spatial.016um (18085 features, 0 variable features)
1 layer present: counts
2 spatial fields of view present: P5CRC.016um P5NAT.016um
Before we make any comparisons across cells, we will apply a simple normalization with the NormalizeData() function.
# Normalize the countsseurat_sketch <-NormalizeData(seurat_filtered)seurat_sketch
An object of class Seurat
18085 features across 67660 samples within 1 assay
Active assay: Spatial.016um (18085 features, 0 variable features)
2 layers present: counts, data
2 spatial fields of view present: P5CRC.016um P5NAT.016um
And we can see that there is now a new data layer in the Seurat object.
Highly variable gene (HVG) selection is extremely important since many downstream steps are computed only on these genes.
Essentially, we are looking at genes with high levels of variance while also accounting for the average expression. In doing so, we get a list of genes ranked by how much they change across different cell populations.
TODO We calculate these scores to minimize the number of genes that are used for calculations (downstream) and to emphasize the effect of genes that are changing across our dataset - rather than giving more weight to genes that remain consistent (less interesting).
Calculating HVGs
HVGs are calculated using the FindVariableFeatures() function. Some key arguments supplied to this function are:
Argument
Description
selection.method
How to choose top variable features. vst: Fits a line to log(variance) vs. log(mean) using loess, standardizes values, and computes variance after clipping (see clip.max).
nfeatures
Number of features to select as top variable features; used when selection.method is dispersion or vst
# Identify the most variable genesseurat_sketch <-FindVariableFeatures(seurat_sketch, selection.method ="vst",nfeatures =2000)seurat_sketch
An object of class Seurat
18085 features across 67660 samples within 1 assay
Active assay: Spatial.016um (18085 features, 2000 variable features)
2 layers present: counts, data
2 spatial fields of view present: P5CRC.016um P5NAT.016um
When we examine our Seurat object, we see that FindVariableFeatures() has added 2,000 variable features - just as we specified with nfeatures.
Seurat allows us to access the ranked highly variable genes with the VariableFeatures() function. Here we print the top 15:
# Identify the 15 most highly variable genesranked_variable_genes <-VariableFeatures(seurat_sketch)top_genes <- ranked_variable_genes[1:15]top_genes
Using GeneCards, look up one of the top genes and read more about its role and how it could possible relate to CRC.
Sketch Downsampling
As single-cell technologies have evolved, the datasets have grown ever larger. The concept of a “sketch” has become popularized to offset the difficulties of analysis on these large populations. The aim is to generate a subset of cells in the dataset that is representative of each cell states in the samples.
What are the advantages of using a “sketch” method
One of the advantages of such a method is that we can run our typical scRNA-seq workflow on the downsampled data and project it back onto the larger dataset. Therefore we can run more computationally intensive algorithms (PCA, clustering, integration) on a smaller number of cells and reduce both the time and resources to make calculations.
In order to determine which cells to keep, a leverage score is calculated. The leverage score reflects the magnitude of its contribution to the gene-covariance matrix, and its importance to the overall dataset. Where TODO write about what this actually does.
This sketch-based analyses aims to “subsample” large datasets in a way that preserves rare populations. The authors of Seurat found that rare celltype populations, that have fewer cells, have higher leverage scores.
This means that our downsampled cells in the sketch will oversample rare populations, retaining the biological complexity of the dataset while drastically compressing the number of cells.
Downsampling
For our dataset, we are going to downsample our dataset to 10,000 bins using the SketchData() function. The function takes a normalized single-cell dataset containing a set of variable features.
It returns a Seurat object with a new assay (sketch), consisting of ncells bins selected based off a “leverage score” for each bin. This step may take a few minutes to run.
# we select 10,000 cells and create a new 'sketch' assayseurat_sketch <-SketchData(object = seurat_sketch,assay ='Spatial.016um',ncells =10000,method ="LeverageScore",sketched.assay ="sketch")
Sketch Assay
We can see that there are four major changes that have taken place:
The number of features in the second line has double, because we have added a new assay
Accordingly, the number of assays has increased from one to two
The active assay has change from Spatial.016um to sketch
There is a new line listing additional assays that exist in the Seurat object
seurat_sketch
An object of class Seurat
36170 features across 67660 samples within 2 assays
Active assay: sketch (18085 features, 2000 variable features)
2 layers present: counts, data
1 other assay present: Spatial.016um
2 spatial fields of view present: P5CRC.016um P5NAT.016um
We can also see that the leverage.score has been added as a column to the metadata of our object.
Now is a great spot to save our seurat_sketch object as we have added a new assay to our Seurat object.
# Save Seurat objectsaveRDS(seurat_sketch, "data/seurat_sketch.RDS")
Source Code
---title: "Normalization and Sketch Downsampling"authors: "Will Gammerdinger, Noor Sohail, Zhu Zhuo, James Billingsley, Shannan Ho Sui"date: "Tuesday, July 22, 2025"editor_options: markdown: wrap: 72---Approximate time: XY minutes# Learning Objectives- Discuss why normalizing counts is necessary for accurate comparison between cells- LO 2- LO 3# NormalizationNormalization is important in order to make expression counts comparable across genes and/or samples. We note that the best normalization methods for spatial data are still being developed and evaluated. In particular, [Bhuva et. al](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03241-7) tested a variety of normalization techniques and found that normalizing by the number of transcripts detected per bin negatively affects spatial domain identification because total detections per bin can represent real biology.We are cognizant of this, but as discussed earlier, it can be challenging to determine whether a bin has few detections because of a technical artifact or biological signal. In the absence of normalization, this lack of signal will strongly affect clustering regardless of whether it is technical or biological. For this reason, we apply a **standard log-transformed library size normalization** to our data.## Why Should We Normalize?An essential first step in the majority of mRNA expression analyses is normalization, whereby systematic variations are adjusted for to **make expression counts comparable across genes and/or samples**. The counts of mapped reads for each gene is proportional to the expression of RNA ("interesting") in addition to many other factors ("uninteresting"). Normalization is the process of adjusting raw count values to account for the "uninteresting" factors.**Sequencing depth:** Accounting for sequencing depth is necessary for comparison of gene expression between cells. In the example below, each gene appears to have doubled in expression in cell 2, however this is a consequence of cell 2 having twice the sequencing depth.<img src="../img/sequencing_depth.png" width="400"/>## Log-normalization StepsVarious methods have been developed specifically for scRNA-seq normalization. Some simpler methods resemble what we have seen with bulk RNA-seq; the application of global scale factors adjusting for a count-depth relationship that is assumed common across all genes. However, if those assumptions are not true then this basic normalization can lead to over-correction for lowly and moderately expressed genes and, in some cases, under-normalization of highly expressed genes ([Bacher R et al, 2017](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5473255/)).Regardless of which method is used for normalization, it can be helpful to **think of it as a two-step process** (even though it is often described as a single step in most papers). The first is a scaling step and the second is a transformation.**1. Scaling**The first step in normalization is to **multiply each UMI count by a cell specific factor to get all cells to have the same UMI counts**. Why would we want to do this? Different cells have different amounts of mRNA; this could be due to differences between cell types or variation within the same cell type depending on how well the chemistry worked in one drop versus another. In either case, we are not interested in comparing these absolute counts between cells. Instead we are interested in comparing concentrations, and scaling helps achieve this.**2. Transformation**The next step is a transformation, and it is at this step where we can distinguish the simpler versus complex methods as mentioned above.Simple transformations are those which apply the same function to each individual measurement. Common examples include a **log transform** (which is applied in the original Seurat workflow), or a square root transform (less commonly used).## Perform Log-normalizationLet's start by creating a new script for the normalization and integration steps. Create a new script (File -> New File -> R script), and save it as `normalization_and_sketch.R`.For the remainder of the workflow we will be mainly using functions available in the Seurat package. Additionally, we will load in the `seurat_filtered` object we created in the previous lesson.```{r}#| label: set-up_script# Normalization and sketch downsampling# Visium HD spatial transcriptomics workshop# Author: Harvard Chan Bioinformatics Core# Created: December 2025# Load librarieslibrary(Seurat)library(tidyverse)library(scales)seurat_filtered <-readRDS("data/seurat_filtered.RDS")seurat_filtered```Before we make any comparisons across cells, we will **apply a simple normalization** with the `NormalizeData()` function.```{r}#| label: normalize_data# Normalize the countsseurat_sketch <-NormalizeData(seurat_filtered)seurat_sketch```And we can see that there is now a new `data` layer in the Seurat object.::: callout-tip# [**Exercise 1**](05_normalization_Answer-key.qmd#exercise-1)Exercise 1:::# Highly Variable GenesHighly variable gene (HVG) selection is extremely important since many **downstream steps are computed only on these genes**.Essentially, we are looking at genes with high levels of variance while also accounting for the average expression. In doing so, we get a list of genes ranked by how much they change across different cell populations.TODO We calculate these scores to minimize the number of genes that are used for calculations (downstream) and to emphasize the effect of genes that are changing across our dataset - rather than giving more weight to genes that remain consistent (less interesting).## Calculating HVGsHVGs are calculated using the `FindVariableFeatures()` function. Some key arguments supplied to this function are:| Argument | Description ||--------------------|----------------------------------------------------|| selection.method | How to choose top variable features. `vst`: Fits a line to log(variance) vs. log(mean) using loess, standardizes values, and computes variance after clipping (see clip.max). || nfeatures | Number of features to select as top variable features; used when selection.method is `dispersion` or `vst` |```{r}#| label: find_variable_genes# Identify the most variable genesseurat_sketch <-FindVariableFeatures(seurat_sketch, selection.method ="vst",nfeatures =2000)seurat_sketch```When we examine our Seurat object, we see that `FindVariableFeatures()` has added 2,000 variable features - just as we specified with `nfeatures`.Seurat allows us to access the ranked highly variable genes with the `VariableFeatures()` function. Here we print the top 15:```{r}# Identify the 15 most highly variable genesranked_variable_genes <-VariableFeatures(seurat_sketch)top_genes <- ranked_variable_genes[1:15]top_genes```::: callout-tip# [**Exercise 2**](05_normalization_Answer-key.qmd#exercise-2)Using [GeneCards](https://www.genecards.org/), look up one of the top genes and read more about its role and how it could possible relate to CRC.:::# Sketch DownsamplingAs single-cell technologies have evolved, the datasets have grown ever larger. The concept of a "sketch" has become popularized to offset the difficulties of analysis on these large populations. The aim is to generate a subset of cells in the dataset that is representative of each cell states in the samples.::: {.callout-note collapse=true}# What are the advantages of using a "sketch" methodOne of the advantages of such a method is that we can run our typical scRNA-seq workflow on the downsampled data and project it back onto the larger dataset. Therefore we can run more computationally intensive algorithms (PCA, clustering, integration) on a smaller number of cells and reduce both the time and resources to make calculations. :::In order to determine which cells to keep, a [leverage score](https://dl.acm.org/doi/pdf/10.1145/3019134) is calculated. The leverage score reflects the magnitude of its contribution to the gene-covariance matrix, and its importance to the overall dataset. Where **TODO write about what this actually does.**This sketch-based analyses aims to "subsample" large datasets in a way that preserves rare populations. The authors of Seurat found that rare celltype populations, that have fewer cells, have higher leverage scores., _Supplementary Figure 5H_](../img/ncells_vs_leverage_score.png){width="50%"}This means that our downsampled cells in the sketch will oversample rare populations, retaining the biological complexity of the dataset while drastically compressing the number of cells. ## DownsamplingFor our dataset, we are going to downsample our dataset to 10,000 bins using the `SketchData()` function. The function takes a **normalized** single-cell dataset containing a set of **variable features**. It returns a Seurat object with a new assay (sketch), consisting of `ncells` bins selected based off a **"leverage score"** for each bin. This step may take a few minutes to run.```{r}# we select 10,000 cells and create a new 'sketch' assayseurat_sketch <-SketchData(object = seurat_sketch,assay ='Spatial.016um',ncells =10000,method ="LeverageScore",sketched.assay ="sketch")```## Sketch AssayWe can see that there are four major changes that have taken place:- The number of features in the second line has double, because we have added a new assay- Accordingly, the number of assays has increased from one to two- The active assay has change from `Spatial.016um` to `sketch`- There is a new line listing additional assays that exist in the Seurat object```{r}seurat_sketch```We can also see that the `leverage.score` has been added as a column to the metadata of our object.```{r}#| eval: falseseurat_sketch@meta.data %>%View()``````{r}#| echo: falseseurat_sketch@meta.data %>%head(15) %>% DT::datatable()```## Visualizing Leverage Scores```{r}ggplot(seurat_sketch@meta.data) +geom_histogram(aes(x=leverage.score, fill=orig.ident), alpha=0.5, bins=100) +theme_classic() +scale_x_log10(labels = scales::label_number())``````{r}SpatialFeaturePlot(seurat_sketch,"leverage.score",pt.size.factor =11,image.alpha =0,max.cutoff =2)```::: callout-tip# [**Exercise 3**](05_normalization_Answer-key.qmd#exercise-3)Exercise 3:::# Save!Now is a great spot to save our `seurat_sketch` object as we have added a new assay to our Seurat object.```{r}#| eval: false# Save Seurat objectsaveRDS(seurat_sketch, "data/seurat_sketch.RDS")```