Quality Control

spatial transcriptomics

Filtration

Focus on quality control strategies for Visium HD, using thresholds on genes, UMIs and mitochondrial content to filter low-quality spots. This lesson helps you detect technical artifacts and sample issues before downstream analyses.

Author

Noor Sohail

Published

July 22, 2025

Keywords

Quality control, Mitochondrial ratio, UMI counts

Approximate time: 45 minutes

Learning objectives

In this lesson, we will:

Construct quality control metrics and visually evaluate the quality of the data
Apply appropriate filters to remove low-quality bins
Create a filtered Seurat object

Overview of lesson

In Visium HD data, the main challenge is in distinguishing bins that are poor quality from bins containing reads from less complex cells. If you expect a particular cell type in your dataset to be less transcriptionally active as compared to other cell types in your dataset, the bins underneath this cell type will naturally have fewer detected genes and transcripts. However, having fewer detected genes and transcripts can also be a technical artifact and not a result of biological signal.

Here we will learn how to identify and set thresholds for filtration so that the dataset contains high-quality bins.

Number of bins before filtration

Before doing any filtration, we can see how many bins we have per sample. This is a number we should keep in the back of our mind throughout the analysis because it will help us understand the distribution of our spots. For spatial datasets, the number of bins is going to correspond directly to the bin size selected.

# Barplot number of bins per sample
ggplot(seurat_merged@meta.data) +
  geom_bar(aes(x = orig.ident, fill = orig.ident),
           color = "black") +
  geom_text(aes(x = orig.ident, label=after_stat(count)), 
            stat='count', vjust=-1) +
  theme_classic()

Figure 1: Number of bins in the dataset, split by sample.

Quality metrics

We will assess a variety of metrics to evaluate which bins are considered low/high quality. We will apply very permissive filtering here as it has been shown that low expression can be biologically meaningful for spatial context, so we won’t be as stringent as we normally are with scRNA-seq.

Sample-specific thresholds

We are frequently asked if you you have to use the same threshold values across all your samples. We recommend that you follow what the data is telling you and apply values on a per-sample basis. Ultimately the end goal is to retain high-quality bins, even if that means using different values for each sample.

The metrics we will be using to filter low-quality bins from high-quality ones include:

UMI counts per cell
Genes detected per cell
Complexity (novelty score)
Mitochondrial counts ratio

We will calculate some of these values throughout this lesson. Others (UMIs and genes per cell) already exist in our @meta.data:

# Store metadata as variable meta
meta <- seurat_merged@meta.data
View(meta)

Table 1: View @meta.data dataframe

	orig.ident	nCount_Spatial.008um	nFeature_Spatial.008um
P5CRC_s_008um_00078_00444-1	P5CRC	65	57
P5CRC_s_008um_00128_00278-1	P5CRC	1300	906
P5CRC_s_008um_00052_00559-1	P5CRC	128	121
P5CRC_s_008um_00121_00413-1	P5CRC	538	326
P5CRC_s_008um_00167_00326-1	P5CRC	44	39

We will be using a variety of visualization methods:

Looking at the values of each bin on the spatial slide
Distribution of values before and after filtration as a density plot

UMI counts (transcripts) per bin

The nCount is the number of unique transcripts (UMIs) detected per bin. Oftentimes, these values will correspond with how transcriptionally active a cell may be, which is typically defined by their cell type. For example, tumor cells will have very high UMI counts.

In the density plots, we expect to see a bimodal distribution. One peak should represent bins containing lower-quality bins with fewer UMIs and a second peak should represent healthy bins with more UMIs. Ideally, the peak representing lower-quality and dying cells is small and the peak representing healthy cells is large.

These numbers may be lower than what we would expect for scRNA-seq datasets due to the small size of the bins.

“Nicer” spatial visualizations

For “nicer” (subjective) plotting with SpatialFeaturePlot() and SpatialDimPlot(), we will add some extra parameters to gain a clearer image beyond the default plot (shown here):

# Visualize the spatial distribution of total UMIs 
# Default parameters
SpatialFeaturePlot(seurat_merged, 
                   "nCount_Spatial.008um",
                   pt.size.factor = 15)

Figure 2: Default `SpatialFeaturePlot()` visualization parameters.

Therefore for the rest of this lesson, we will be using the following arguments:

pt.size.factor = 15: to clearly see each bin on the slide
image.alpha = 0: to remove the H&E stained image in the background of the image
max.cutoff and min.cutoff: to not allow the color scale to be driven by smaller populations of cells with high/low values

# Visualize the spatial distribution of total UMIs
SpatialFeaturePlot(seurat_merged, 
                   "nCount_Spatial.008um",
                   pt.size.factor = 15,
                   image.alpha = 0,
                   max.cutoff = "q90")

Figure 3: Number of UMIs overlaid over spatial slide.

# log10-transformed density of UMIs for each sample
# Vertical lines are sample-specific filtering thresholds
ggplot(meta) +
  geom_density(aes(x = nCount_Spatial.008um, fill = orig.ident),
               alpha = 0.4,
               color = "black") +
  geom_vline(xintercept = 30, color = "pink") +
  geom_vline(xintercept = 10, color = "lightblue") +
  scale_x_log10() +
  theme_classic()

# Apply filtration thresholds
meta_filt <- subset(meta,
  ((orig.ident == "P5CRC") & (nCount_Spatial.008um > 30)) |
  ((orig.ident == "P5NAT") & (nCount_Spatial.008um > 10)))

# log10-transformed density of UMIs for each sample after filtration
ggplot(meta_filt) +
  geom_density(aes(x = nCount_Spatial.008um,
                     fill = orig.ident),
                 alpha = 0.4,
                 color = "black") +
  geom_vline(xintercept = 30, color = "pink") +
  geom_vline(xintercept = 10, color = "lightblue") +
  scale_x_log10() +
  theme_classic()

Figure 5: Number of UMIs density after filtration.

Genes detected per bin

The nFeature is the number of genes detected per bin, or the number of genes that have a non-zero value in a bin. We have similar expectations for gene detection as we did for number of UMIs in terms of the distribution of values.

When we look at the spatial slide, we can already begin to see patterns of expression (both with genes and UMIs) that will be helpful in understanding our dataset better.

# Visualize the spatial distribution of number of genes
SpatialFeaturePlot(seurat_merged, 
                   "nFeature_Spatial.008um",
                   pt.size.factor = 15,
                   image.alpha = 0,
                   max.cutoff = "q90")

Figure 6: Number of features overlaid over spatial slide.

# log10-transformed density of number of genes for each sample
# Vertical lines are sample-specific filtering thresholds
ggplot(meta) +
  geom_density(aes(x = nFeature_Spatial.008um,
                     fill = orig.ident),
                 alpha = 0.4,
                 color = "black") +
  geom_vline(xintercept = 30, color = "pink") +
  geom_vline(xintercept = 10, color = "lightblue") +
  scale_x_log10() +
  theme_classic()

# Apply filtration thresholds
meta_filt <- subset(meta,
  ((orig.ident == "P5CRC") & (nFeature_Spatial.008um > 30)) |
  ((orig.ident == "P5NAT") & (nFeature_Spatial.008um > 10)))

# log10-transformed density of number of genes for each sample after filtration
ggplot(meta_filt) +
  geom_density(aes(x = nFeature_Spatial.008um,
                     fill = orig.ident),
                 alpha = 0.4,
                 color = "black") +
  geom_vline(xintercept = 30, color = "pink") +
  geom_vline(xintercept = 10, color = "lightblue") +
  scale_x_log10() +
  theme_classic()

Figure 8: Number of features density after filtration.

Complexity (novelty) score

Sometimes there may be bins with high nCount (UMIs) and low nFeature (genes). This finding would indicate that a few genes were sequenced many times over. We consider these instances to be cases of “low complexity”, where we are getting high expression from only a small number of genes. Some cell types, such as red blood cells, are known for such behavior. However, we should be cautious when interpreting these results, as contamination or technical artifacts could also contribute to this finding. Generally, we expect the complexity score to be above 0.80 for good-quality bins.

The novelty score is computed as a ratio of genes to UMIs, as shown below:

\[ \text{Complexity Score} = \frac{\log_{10}(\text{Number of Genes})}{\log_{10}(\text{Number of UMIs})} \]

Which we can now calculate using R and store in our @meta.data:

# Add number of genes per UMI for each cell to metadata
seurat_merged$log10GenesPerUMI <- log10(seurat_merged$nFeature_Spatial.008um) / 
                                  log10(seurat_merged$nCount_Spatial.008um)

There are several NA values in this newly generated column. This result is because there are some bins with nCount_Spatial.008um = 0. These bins will naturally be filtered out once we complete our filtration, so strictly for the purposes of visualization, we are going to set these NA values to 0:

# Turn NA values into 0 for now
seurat_merged$log10GenesPerUMI[is.na(seurat_merged$log10GenesPerUMI)] <- 0

# Visualize the spatial distribution of calculated complexity score
SpatialFeaturePlot(seurat_merged, 
                   "log10GenesPerUMI",
                   pt.size.factor = 17,
                   image.alpha = 0,
                   min.cutoff = "q10")

Figure 9: Complexity score overlaid over spatial slide.

# log10-transformed density of complexity score for each sample
# Vertical line is filtering threshold
meta <- seurat_merged@meta.data
ggplot(meta) +
  geom_density(aes(x = log10GenesPerUMI,
                   fill = orig.ident),
                 alpha = 0.4,
                 color = "black") +
  geom_vline(xintercept = 0.80) +
  theme_classic()

# Apply filtration thresholds
meta_filt <- subset(meta, log10GenesPerUMI > 0.80)

# log10-transformed density of complexity score for each sample after filtration
ggplot(meta_filt) +
  geom_density(aes(x = log10GenesPerUMI,
                   fill = orig.ident),
                 alpha = 0.4,
                 color = "black") +
  geom_vline(xintercept = 0.80) +
  theme_classic()

Figure 11: Complexity score density after filtration.

Mitochondrial counts ratio

During sequencing, we do not only measure the levels of expression from the nuclear genome - we also capture the mitochondrial genome! High levels of mitochondrial expression can be a sign of dead or dying cells. Therefore, we can calculate the proportion of reads (UMIs) that come from the mitochondria out of all the transcripts in a cell as another metric.

Bins with greater than 0.25 (25%) mitochondrial reads are typically defined as poor-quality. However, if you have reason to believe that the mitochondrial content is meant to be on the higher end, you can adjust this threshold.

Think about your biological question

While using a baseline score of 0.25 is an acceptable threshold for removing high mitochondrial content cells, it is important to always go back to your original biological question. What samples are you working with? Do you expect there to be high values of mitochondrial expression due to your experimental condition?

For example, if you were studying renal oncocytomas, would you make this same choice? This disease is characterized as having aberrantly high mitochondrial expression, so would it make sense to remove cells with high mitochondrial ratio?

This ratio is computed as:

\[ \text{Mitochondrial Ratio} = \frac{\text{Number of reads aligning to mitochondrial genes}} {\text{Total reads}} \]

# Compute percent mito ratio by finding genes that start with "MT-"
seurat_merged$mitoRatio <- PercentageFeatureSet(object = seurat_merged, 
                                                pattern = "^MT-")
seurat_merged$mitoRatio <- seurat_merged@meta.data$mitoRatio / 100

The same issue that caused the NA values in the complexity score will appear in the calculated mitoRatio. So here we will set these values to be 1.00:

# Turn NA values into 1.00 for now
seurat_merged$mitoRatio[is.na(seurat_merged$mitoRatio)] <- 1.00

# Visualize the spatial distribution of mitochondrial ratio
SpatialFeaturePlot(seurat_merged, 
                   "mitoRatio",
                   pt.size.factor = 15,
                   image.alpha = 0,
                   max.cutoff = "q90")

Figure 12: Mitochondrial ratio overlaid over spatial slide.

# Update meta to grab mitoRatio column
meta <- seurat_merged@meta.data

# log10-transformed density of mitochondrial ratio for each sample
# Vertical line is filtering threshold
ggplot(meta) +
  geom_density(aes(x = mitoRatio,
                   fill = orig.ident),
                 alpha = 0.4,
                 color = "black") +
  geom_vline(xintercept = 0.25) +
  theme_classic()

# Apply filtration thresholds
meta_filt <- subset(meta, mitoRatio < 0.25)

# log10-transformed density of mitochondrial ratio for each sample after filtration
ggplot(meta_filt) +
  geom_density(aes(x = mitoRatio,
                   fill = orig.ident),
                 alpha = 0.4,
                 color = "black") +
  geom_vline(xintercept = 0.25) +
  theme_classic()

Figure 14: Mitochondrial ratio density after filtration.

Exercise 1

Do you notice a pattern in cells in regards to the number of UMIs and features? Make a geom_point plot to compare these values on a per-cell basis and color each point by the mitochondrial ratio following the structure provided here:

# Structure for making geom_point plot
# Fill in values to answer the question
seurat_merged@meta.data %>%
  # Sorting by mitoRatio to make high scores appear on top of the plot
  arrange(mitoRatio) %>%
  ggplot() +
  geom_point(aes(x = ???, 
                 y = ???,
                 color = ???),
             size = 0.5) +
  # Setting limits so that outliers don't determine scale of the plot
  ylim(0, 3500) + xlim(0, 3500) +
  theme_bw()

Filtration

We will apply very minimal filtering here. It has been shown that low expression can be biologically meaningful for spatial context, so we won’t be as stringent as we normally are with scRNA-seq.

# Per-sample nCount thresholds
seurat_filtered <- subset(seurat_merged,
  ((orig.ident == "P5CRC") & (nCount_Spatial.008um > 30)) |
  ((orig.ident == "P5NAT") & (nCount_Spatial.008um > 10)))

# Per-sample nFeature thresholds
seurat_filtered <- subset(seurat_filtered,
  ((orig.ident == "P5CRC") & (nFeature_Spatial.008um > 30)) |
  ((orig.ident == "P5NAT") & (nFeature_Spatial.008um > 10)))
  
# Global thresholds for mitochondrial ratio and complexity
seurat_filtered <- subset(seurat_filtered, mitoRatio < 0.25)
seurat_filtered <- subset(seurat_filtered, log10GenesPerUMI > 0.80)

# Print seurat object after filtration
seurat_filtered

An object of class Seurat 
18085 features across 135798 samples within 1 assay 
Active assay: Spatial.008um (18085 features, 0 variable features)
 1 layer present: counts
 2 spatial fields of view present: P5CRC.008um P5NAT.008um

Warning: Not validating

After subsetting, you may get the following warning message:

Warning: Not validating Centroids objects 
Warning: Not validating FOV objects
Warning: Not validating FOV objects
Warning: Not validating FOV objects
Warning: Not validating Seurat objects

This warning message can be ignored as it is Seurat internally checking bin barcodes against the image. In future lessons, after every subset step, this message may appear again, but can be disregarded as the subsetting is ultimately accomplished.

Exercise 2

How many bins did we remove in this filtration process? Hint: We can use the ncol() function to count the number of bins in a Seurat object.

Visualizing counts data

We can visualize the number of UMIs and gene counts per bin, both as a distribution and layered on top of the tissue image.

Violin plots

Let’s start with a violin plot to look at the distribution of UMI counts and gene counts. The input is our post-filtered dataset.

# Violin plot of UMIs
p_ncount <- VlnPlot(seurat_filtered, 
                    features = "nCount_Spatial.008um", 
                    pt.size = 0, group.by = 'orig.ident') +
  NoLegend()

# Violin plot of number of genes
p_nfeats <- VlnPlot(seurat_filtered, 
                    features = "nFeature_Spatial.008um", 
                    pt.size = 0, group.by = 'orig.ident') + 
  NoLegend()

# Plot UMIs and gene count violin plots side-by-side
p_ncount | p_nfeats

Figure 15: Violin plot of nCount and nFeature after filtration.

We see that both violin plots have a similar peak. However, the UMI (nCount) distribution has a much longer tail than the number of genes distribution (nFeature). This is expected, because while the small physical size of the bins means that most genes will be detected only once or twice, a minority of bins under very transcriptionally active cells may exhibit multiple transcripts of the same gene.

Spatial overlay

Next, we can look at the same metrics and the distribution on the actual image itself after filtration. Note that some spots will have lower counts compared to others, in part due to low cellular density or cell types with low complexity in certain tissue regions.

# Visualize the spatial distribution of total UMIs and number of genes after filtration
SpatialFeaturePlot(seurat_filtered, 
                   c("nFeature_Spatial.008um", 
                     "nCount_Spatial.008um"),
                   pt.size.factor = 16,
                   image.alpha = 0)

Figure 16: Number of features and counts overlaid over spatial slide.

Save!

Now is a great spot to save our seurat_filtered object as we have finished filtering.

# Save Seurat object
saveRDS(seurat_filtered, "data/seurat_filtered.RDS")

Next Lesson >>

Back to Schedule

Reuse

CC-BY-4.0

--- title: "Quality Control" description: | Focus on quality control strategies for Visium HD, using thresholds on genes, UMIs and mitochondrial content to filter low-quality spots. This lesson helps you detect technical artifacts and sample issues before downstream analyses. author: - "Noor Sohail" date: "2025-07-22" categories: - spatial transcriptomics - QC - Filtration keywords: - Quality control - Mitochondrial ratio - UMI counts license: "CC-BY-4.0" editor_options: markdown: wrap: 72 --- ```{r} #| label: load_libraries_data #| echo: false # Load libraries and data library(Seurat) library(tidyverse) seurat_merged <- qs2::qs_read("intermediate/03_seurat_merged.qs") ``` Approximate time: 45 minutes ## Learning objectives In this lesson, we will: - Construct quality control metrics and visually evaluate the quality of the data - Apply appropriate filters to remove low-quality bins - Create a filtered Seurat object ## Overview of lesson In Visium HD data, the main challenge is in **distinguishing bins that are poor quality from bins containing reads from less complex cells.** If you expect a particular cell type in your dataset to be less transcriptionally active as compared to other cell types in your dataset, the bins underneath this cell type will naturally have fewer detected genes and transcripts. However, having fewer detected genes and transcripts can also be a technical artifact and not a result of biological signal. Here we will learn how to identify and set thresholds for filtration so that the dataset contains high-quality bins. ## Number of bins before filtration Before doing any filtration, we can see how many bins we have per sample. This is a number we should keep in the back of our mind throughout the analysis because it will help us understand the distribution of our spots. For spatial datasets, the number of bins is going to correspond directly to the bin size selected. ```{r} #| label: fig-n_cells #| fig-cap: Number of bins in the dataset, split by sample. # Barplot number of bins per sample ggplot(seurat_merged@meta.data) + geom_bar(aes(x = orig.ident, fill = orig.ident), color = "black") + geom_text(aes(x = orig.ident, label=after_stat(count)), stat='count', vjust=-1) + theme_classic() ``` ## Quality metrics We will assess a variety of metrics to evaluate which bins are considered low/high quality. **We will apply very permissive filtering here** as it has been shown that low expression can be biologically meaningful for spatial context, so we won’t be as stringent as we normally are with scRNA-seq. ::: callout-note # Sample-specific thresholds We are frequently asked if you you have to use the same threshold values across all your samples. We recommend that you follow what the data is telling you and apply values on a per-sample basis. Ultimately the end goal is to retain **high-quality bins**, even if that means using different values for each sample. ::: The metrics we will be using to filter low-quality bins from high-quality ones include: - UMI counts per cell - Genes detected per cell - Complexity (novelty score) - Mitochondrial counts ratio We will calculate some of these values throughout this lesson. Others (UMIs and genes per cell) already exist in our `@meta.data`: ```{r} #| label: meta_view #| eval: false # Store metadata as variable meta meta <- seurat_merged@meta.data View(meta) ``` ```{r} #| label: tbl-meta_dt #| tbl-cap: View `@meta.data` dataframe #| echo: false meta <- seurat_merged@meta.data meta %>% head(5) %>% knitr::kable() ``` **We will be using a variety of visualization methods:** - Looking at the values of each bin on the spatial slide - Distribution of values **before and after** filtration as a density plot ### UMI counts (transcripts) per bin The `nCount` is the number of unique transcripts (UMIs) detected per bin. Oftentimes, these values will correspond with how transcriptionally active a cell may be, which is typically defined by their cell type. For example, tumor cells will have very high UMI counts. In the density plots, we expect to see a bimodal distribution. One peak should represent bins containing lower-quality bins with fewer UMIs and a second peak should represent healthy bins with more UMIs. Ideally, the peak representing lower-quality and dying cells is small and the peak representing healthy cells is large. These numbers may be lower than what we would expect for scRNA-seq datasets due to the small size of the bins. ::: {.panel-tabset} #### Spatial overlay ::: {.callout-note collapse=true} # "Nicer" spatial visualizations For "nicer" (subjective) plotting with `SpatialFeaturePlot()` and `SpatialDimPlot()`, we will add some extra parameters to gain a clearer image beyond the default plot (shown here): ```{r} #| label: fig-nCount_spatial_default #| fig-cap: Default `SpatialFeaturePlot()` visualization parameters. # Visualize the spatial distribution of total UMIs # Default parameters SpatialFeaturePlot(seurat_merged, "nCount_Spatial.008um", pt.size.factor = 15) ``` Therefore for the rest of this lesson, we will be using the following arguments: - `pt.size.factor = 15`: to clearly see each bin on the slide - `image.alpha = 0`: to remove the H&E stained image in the background of the image - `max.cutoff` and `min.cutoff`: to not allow the color scale to be driven by smaller populations of cells with high/low values ::: ```{r} #| label: fig-nCount_spatial #| fig-cap: Number of UMIs overlaid over spatial slide. # Visualize the spatial distribution of total UMIs SpatialFeaturePlot(seurat_merged, "nCount_Spatial.008um", pt.size.factor = 15, image.alpha = 0, max.cutoff = "q90") ``` #### Before filtration density ```{r} #| label: fig-nCount_density #| fig-cap: Number of UMIs density. # log10-transformed density of UMIs for each sample # Vertical lines are sample-specific filtering thresholds ggplot(meta) + geom_density(aes(x = nCount_Spatial.008um, fill = orig.ident), alpha = 0.4, color = "black") + geom_vline(xintercept = 30, color = "pink") + geom_vline(xintercept = 10, color = "lightblue") + scale_x_log10() + theme_classic() ``` #### After filtration density ```{r} #| label: fig-nCount_density_filt #| fig-cap: Number of UMIs density after filtration. # Apply filtration thresholds meta_filt <- subset(meta, ((orig.ident == "P5CRC") & (nCount_Spatial.008um > 30)) | ((orig.ident == "P5NAT") & (nCount_Spatial.008um > 10))) # log10-transformed density of UMIs for each sample after filtration ggplot(meta_filt) + geom_density(aes(x = nCount_Spatial.008um, fill = orig.ident), alpha = 0.4, color = "black") + geom_vline(xintercept = 30, color = "pink") + geom_vline(xintercept = 10, color = "lightblue") + scale_x_log10() + theme_classic() ``` ::: ### Genes detected per bin The `nFeature` is the number of genes detected per bin, or the number of genes that have a non-zero value in a bin. We have similar expectations for gene detection as we did for number of UMIs in terms of the distribution of values. When we look at the spatial slide, we can already begin to see patterns of expression (both with genes and UMIs) that will be helpful in understanding our dataset better. ::: {.panel-tabset} #### Spatial overlay ```{r} #| label: fig-nFeature_spatial #| fig-cap: Number of features overlaid over spatial slide. # Visualize the spatial distribution of number of genes SpatialFeaturePlot(seurat_merged, "nFeature_Spatial.008um", pt.size.factor = 15, image.alpha = 0, max.cutoff = "q90") ``` #### Before filtration density ```{r} #| label: fig-nFeature_density #| fig-cap: Number of features density. # log10-transformed density of number of genes for each sample # Vertical lines are sample-specific filtering thresholds ggplot(meta) + geom_density(aes(x = nFeature_Spatial.008um, fill = orig.ident), alpha = 0.4, color = "black") + geom_vline(xintercept = 30, color = "pink") + geom_vline(xintercept = 10, color = "lightblue") + scale_x_log10() + theme_classic() ``` #### After filtration density ```{r} #| label: fig-nFeature_density_filt #| fig-cap: Number of features density after filtration. # Apply filtration thresholds meta_filt <- subset(meta, ((orig.ident == "P5CRC") & (nFeature_Spatial.008um > 30)) | ((orig.ident == "P5NAT") & (nFeature_Spatial.008um > 10))) # log10-transformed density of number of genes for each sample after filtration ggplot(meta_filt) + geom_density(aes(x = nFeature_Spatial.008um, fill = orig.ident), alpha = 0.4, color = "black") + geom_vline(xintercept = 30, color = "pink") + geom_vline(xintercept = 10, color = "lightblue") + scale_x_log10() + theme_classic() ``` ::: ### Complexity (novelty) score Sometimes there may be bins with high `nCount` (UMIs) and low `nFeature` (genes). This finding would indicate that a few genes were sequenced many times over. We consider these instances to be cases of "low complexity", where we are getting high expression from only a small number of genes. Some cell types, such as red blood cells, are known for such behavior. However, we should be cautious when interpreting these results, as contamination or technical artifacts could also contribute to this finding. Generally, we expect the complexity score to be above 0.80 for good-quality bins. The novelty score is computed as a ratio of genes to UMIs, as shown below: $$ \text{Complexity Score} = \frac{\log_{10}(\text{Number of Genes})}{\log_{10}(\text{Number of UMIs})} $$ Which we can now calculate using R and store in our `@meta.data`: ```{r} #| label: calc_log10GenesPerUMI # Add number of genes per UMI for each cell to metadata seurat_merged$log10GenesPerUMI <- log10(seurat_merged$nFeature_Spatial.008um) / log10(seurat_merged$nCount_Spatial.008um) ``` There are several `NA` values in this newly generated column. This result is because there are some bins with `nCount_Spatial.008um = 0`. These bins will naturally be filtered out once we complete our filtration, so strictly for the purposes of visualization, we are going to set these `NA` values to 0: ```{r} #| label: zero_log10GenesPerUMI # Turn NA values into 0 for now seurat_merged$log10GenesPerUMI[is.na(seurat_merged$log10GenesPerUMI)] <- 0 ``` ::: {.panel-tabset} #### Spatial overlay ```{r} #| label: fig-log10GenesPerUMI_spatial #| fig-cap: Complexity score overlaid over spatial slide. # Visualize the spatial distribution of calculated complexity score SpatialFeaturePlot(seurat_merged, "log10GenesPerUMI", pt.size.factor = 17, image.alpha = 0, min.cutoff = "q10") ``` #### Before filtration density ```{r} #| label: fig-log10GenesPerUMI_density #| fig-cap: Complexity score density. # log10-transformed density of complexity score for each sample # Vertical line is filtering threshold meta <- seurat_merged@meta.data ggplot(meta) + geom_density(aes(x = log10GenesPerUMI, fill = orig.ident), alpha = 0.4, color = "black") + geom_vline(xintercept = 0.80) + theme_classic() ``` #### After filtration density ```{r} #| label: fig-log10GenesPerUMI_density_filt #| fig-cap: Complexity score density after filtration. # Apply filtration thresholds meta_filt <- subset(meta, log10GenesPerUMI > 0.80) # log10-transformed density of complexity score for each sample after filtration ggplot(meta_filt) + geom_density(aes(x = log10GenesPerUMI, fill = orig.ident), alpha = 0.4, color = "black") + geom_vline(xintercept = 0.80) + theme_classic() ``` ::: ### Mitochondrial counts ratio During sequencing, we do not only measure the levels of expression from the nuclear genome - we also capture the mitochondrial genome! High levels of mitochondrial expression can be a sign of dead or dying cells. Therefore, we can calculate the proportion of reads (UMIs) that come from the mitochondria out of all the transcripts in a cell as another metric. Bins with greater than `0.25` (25%) mitochondrial reads are typically defined as poor-quality. However, if you have reason to believe that the mitochondrial content is meant to be on the higher end, you can adjust this threshold. ::: {.callout-note collapse=true} # Think about your biological question While using a baseline score of 0.25 is an acceptable threshold for removing high mitochondrial content cells, it is important to always go back to your original biological question. What samples are you working with? Do you expect there to be high values of mitochondrial expression due to your experimental condition? For example, if you were studying renal oncocytomas, would you make this same choice? This disease is characterized as having [aberrantly high mitochondrial expression](https://www.nature.com/articles/modpathol2015101), so would it make sense to remove cells with high mitochondrial ratio? ::: This ratio is computed as: $$ \text{Mitochondrial Ratio} = \frac{\text{Number of reads aligning to mitochondrial genes}} {\text{Total reads}} $$ ```{r} #| label: calc_mitoRatio # Compute percent mito ratio by finding genes that start with "MT-" seurat_merged$mitoRatio <- PercentageFeatureSet(object = seurat_merged, pattern = "^MT-") seurat_merged$mitoRatio <- seurat_merged@meta.data$mitoRatio / 100 ``` The same issue that caused the `NA` values in the complexity score will appear in the calculated `mitoRatio`. So here we will set these values to be `1.00`: ```{r} #| label: hundred_mitoRatio # Turn NA values into 1.00 for now seurat_merged$mitoRatio[is.na(seurat_merged$mitoRatio)] <- 1.00 ``` ::: {.panel-tabset} #### Spatial overlay ```{r} #| label: fig-mitoRatio_spatial #| fig-cap: Mitochondrial ratio overlaid over spatial slide. # Visualize the spatial distribution of mitochondrial ratio SpatialFeaturePlot(seurat_merged, "mitoRatio", pt.size.factor = 15, image.alpha = 0, max.cutoff = "q90") ``` #### Before filtration density ```{r} #| label: fig-mitoRatio_density #| fig-cap: Mitochondrial ratio density. # Update meta to grab mitoRatio column meta <- seurat_merged@meta.data # log10-transformed density of mitochondrial ratio for each sample # Vertical line is filtering threshold ggplot(meta) + geom_density(aes(x = mitoRatio, fill = orig.ident), alpha = 0.4, color = "black") + geom_vline(xintercept = 0.25) + theme_classic() ``` #### After filtration density ```{r} #| label: fig-mitoRatio_density_filt #| fig-cap: Mitochondrial ratio density after filtration. # Apply filtration thresholds meta_filt <- subset(meta, mitoRatio < 0.25) # log10-transformed density of mitochondrial ratio for each sample after filtration ggplot(meta_filt) + geom_density(aes(x = mitoRatio, fill = orig.ident), alpha = 0.4, color = "black") + geom_vline(xintercept = 0.25) + theme_classic() ``` ::: :::{.callout-tip} # [**Exercise 1**](05_quality_control-Answer_key.qmd#exercise-1) 1. Do you notice a pattern in cells in regards to the number of UMIs and features? Make a `geom_point` plot to compare these values on a per-cell basis and color each point by the mitochondrial ratio following the structure provided here: ```{r} #| label: exercise_geom_point #| eval: false # Structure for making geom_point plot # Fill in values to answer the question seurat_merged@meta.data %>% # Sorting by mitoRatio to make high scores appear on top of the plot arrange(mitoRatio) %>% ggplot() + geom_point(aes(x = ???, y = ???, color = ???), size = 0.5) + # Setting limits so that outliers don't determine scale of the plot ylim(0, 3500) + xlim(0, 3500) + theme_bw() ``` ::: ## Filtration We will apply very minimal filtering here. It has been shown that low expression can be biologically meaningful for spatial context, so we won’t be as stringent as we normally are with scRNA-seq. ```{r} #| label: filter_cells # Per-sample nCount thresholds seurat_filtered <- subset(seurat_merged, ((orig.ident == "P5CRC") & (nCount_Spatial.008um > 30)) | ((orig.ident == "P5NAT") & (nCount_Spatial.008um > 10))) # Per-sample nFeature thresholds seurat_filtered <- subset(seurat_filtered, ((orig.ident == "P5CRC") & (nFeature_Spatial.008um > 30)) | ((orig.ident == "P5NAT") & (nFeature_Spatial.008um > 10))) # Global thresholds for mitochondrial ratio and complexity seurat_filtered <- subset(seurat_filtered, mitoRatio < 0.25) seurat_filtered <- subset(seurat_filtered, log10GenesPerUMI > 0.80) # Print seurat object after filtration seurat_filtered ``` ::: callout-warning # Warning: Not validating After subsetting, you may get the following warning message: ```{r} #| label: warning_fov #| eval: false Warning: Not validating Centroids objects Warning: Not validating FOV objects Warning: Not validating FOV objects Warning: Not validating FOV objects Warning: Not validating Seurat objects ``` This warning message can be ignored as it is Seurat internally checking bin barcodes against the image. In future lessons, after every `subset` step, this message may appear again, but can be disregarded as the subsetting is ultimately accomplished. ::: :::{.callout-tip} # [**Exercise 2**](05_quality_control-Answer_key.qmd#exercise-2) 2. How many bins did we remove in this filtration process? _Hint: We can use the `ncol()` function to count the number of bins in a Seurat object._ ::: ## Visualizing counts data We can visualize the number of UMIs and gene counts per bin, both as a distribution and layered on top of the tissue image. ### Violin plots Let’s start with a violin plot to look at the distribution of UMI counts and gene counts. The input is our post-filtered dataset. ```{r} #| label: fig-vln_ncount_nfeature #| fig-cap: Violin plot of nCount and nFeature after filtration. # Violin plot of UMIs p_ncount <- VlnPlot(seurat_filtered, features = "nCount_Spatial.008um", pt.size = 0, group.by = 'orig.ident') + NoLegend() # Violin plot of number of genes p_nfeats <- VlnPlot(seurat_filtered, features = "nFeature_Spatial.008um", pt.size = 0, group.by = 'orig.ident') + NoLegend() # Plot UMIs and gene count violin plots side-by-side p_ncount | p_nfeats ``` We see that both violin plots have a similar peak. However, the UMI (`nCount`) distribution has a much longer tail than the number of genes distribution (`nFeature`). This is expected, because while the small physical size of the bins means that most genes will be detected only once or twice, a minority of bins under very transcriptionally active cells may exhibit multiple transcripts of the same gene. ### Spatial overlay Next, we can look at the same metrics and the distribution on the actual image itself after filtration. Note that some spots will have lower counts compared to others, in part due to low cellular density or cell types with low complexity in certain tissue regions. ```{r} #| label: fig-nFeature_nCount_spatial #| fig-cap: Number of features and counts overlaid over spatial slide. #| fig-width: 10 #| fig-height: 8 # Visualize the spatial distribution of total UMIs and number of genes after filtration SpatialFeaturePlot(seurat_filtered, c("nFeature_Spatial.008um", "nCount_Spatial.008um"), pt.size.factor = 16, image.alpha = 0) ``` ## Save! Now is a great spot to save our `seurat_filtered` object as we have finished filtering. ```{r} #| label: save_seurat_filtered #| eval: false # Save Seurat object saveRDS(seurat_filtered, "data/seurat_filtered.RDS") ``` ```{r} #| label: save_seurat_filtered_qs #| eval: false #| echo: false # Save Seurat object qs2::qs_save(seurat_filtered, "intermediate/05_seurat_filtered.qs") ``` *** [Next Lesson >>](06_normalization_and_sketch.qmd) [Back to Schedule](../schedule/schedule.qmd)