# Barplot number of bins per sample
ggplot(seurat_merged@meta.data) +
geom_bar(aes(x = orig.ident, fill = orig.ident),
color = "black") +
geom_text(aes(x = orig.ident, label=after_stat(count)),
stat='count', vjust=-1) +
theme_classic()Quality Control
Focus on quality control strategies for Visium HD, using thresholds on genes, UMIs and mitochondrial content to filter low-quality spots. This lesson helps you detect technical artifacts and sample issues before downstream analyses.
Quality control, Mitochondrial ratio, UMI counts
Approximate time: 45 minutes
Learning objectives
In this lesson, we will:
- Construct quality control metrics and visually evaluate the quality of the data
- Apply appropriate filters to remove low-quality bins
- Create a filtered Seurat object
Overview of lesson
In Visium HD data, the main challenge is in distinguishing bins that are poor quality from bins containing reads from less complex cells. If you expect a particular cell type in your dataset to be less transcriptionally active as compared to other cell types in your dataset, the bins underneath this cell type will naturally have fewer detected genes and transcripts. However, having fewer detected genes and transcripts can also be a technical artifact and not a result of biological signal.
Here we will learn how to identify and set thresholds for filtration so that the dataset contains high-quality bins.
Number of bins before filtration
Before doing any filtration, we can see how many bins we have per sample. This is a number we should keep in the back of our mind throughout the analysis because it will help us understand the distribution of our spots. For spatial datasets, the number of bins is going to correspond directly to the bin size selected.
Quality metrics
We will assess a variety of metrics to evaluate which bins are considered low/high quality. We will apply very permissive filtering here as it has been shown that low expression can be biologically meaningful for spatial context, so we won’t be as stringent as we normally are with scRNA-seq.
We are frequently asked if you you have to use the same threshold values across all your samples. We recommend that you follow what the data is telling you and apply values on a per-sample basis. Ultimately the end goal is to retain high-quality bins, even if that means using different values for each sample.
The metrics we will be using to filter low-quality bins from high-quality ones include:
- UMI counts per cell
- Genes detected per cell
- Complexity (novelty score)
- Mitochondrial counts ratio
We will calculate some of these values throughout this lesson. Others (UMIs and genes per cell) already exist in our @meta.data:
# Store metadata as variable meta
meta <- seurat_merged@meta.data
View(meta)@meta.data dataframe
| orig.ident | nCount_Spatial.008um | nFeature_Spatial.008um | |
|---|---|---|---|
| P5CRC_s_008um_00078_00444-1 | P5CRC | 65 | 57 |
| P5CRC_s_008um_00128_00278-1 | P5CRC | 1300 | 906 |
| P5CRC_s_008um_00052_00559-1 | P5CRC | 128 | 121 |
| P5CRC_s_008um_00121_00413-1 | P5CRC | 538 | 326 |
| P5CRC_s_008um_00167_00326-1 | P5CRC | 44 | 39 |
We will be using a variety of visualization methods:
- Looking at the values of each bin on the spatial slide
- Distribution of values before and after filtration as a density plot
UMI counts (transcripts) per bin
The nCount is the number of unique transcripts (UMIs) detected per bin. Oftentimes, these values will correspond with how transcriptionally active a cell may be, which is typically defined by their cell type. For example, tumor cells will have very high UMI counts.
In the density plots, we expect to see a bimodal distribution. One peak should represent bins containing lower-quality bins with fewer UMIs and a second peak should represent healthy bins with more UMIs. Ideally, the peak representing lower-quality and dying cells is small and the peak representing healthy cells is large.
These numbers may be lower than what we would expect for scRNA-seq datasets due to the small size of the bins.
For “nicer” (subjective) plotting with SpatialFeaturePlot() and SpatialDimPlot(), we will add some extra parameters to gain a clearer image beyond the default plot (shown here):
# Visualize the spatial distribution of total UMIs
# Default parameters
SpatialFeaturePlot(seurat_merged,
"nCount_Spatial.008um",
pt.size.factor = 15)Therefore for the rest of this lesson, we will be using the following arguments:
pt.size.factor = 15: to clearly see each bin on the slideimage.alpha = 0: to remove the H&E stained image in the background of the imagemax.cutoffandmin.cutoff: to not allow the color scale to be driven by smaller populations of cells with high/low values
# log10-transformed density of UMIs for each sample
# Vertical lines are sample-specific filtering thresholds
ggplot(meta) +
geom_density(aes(x = nCount_Spatial.008um, fill = orig.ident),
alpha = 0.4,
color = "black") +
geom_vline(xintercept = 30, color = "pink") +
geom_vline(xintercept = 10, color = "lightblue") +
scale_x_log10() +
theme_classic()# Apply filtration thresholds
meta_filt <- subset(meta,
((orig.ident == "P5CRC") & (nCount_Spatial.008um > 30)) |
((orig.ident == "P5NAT") & (nCount_Spatial.008um > 10)))
# log10-transformed density of UMIs for each sample after filtration
ggplot(meta_filt) +
geom_density(aes(x = nCount_Spatial.008um,
fill = orig.ident),
alpha = 0.4,
color = "black") +
geom_vline(xintercept = 30, color = "pink") +
geom_vline(xintercept = 10, color = "lightblue") +
scale_x_log10() +
theme_classic()Genes detected per bin
The nFeature is the number of genes detected per bin, or the number of genes that have a non-zero value in a bin. We have similar expectations for gene detection as we did for number of UMIs in terms of the distribution of values.
When we look at the spatial slide, we can already begin to see patterns of expression (both with genes and UMIs) that will be helpful in understanding our dataset better.
# log10-transformed density of number of genes for each sample
# Vertical lines are sample-specific filtering thresholds
ggplot(meta) +
geom_density(aes(x = nFeature_Spatial.008um,
fill = orig.ident),
alpha = 0.4,
color = "black") +
geom_vline(xintercept = 30, color = "pink") +
geom_vline(xintercept = 10, color = "lightblue") +
scale_x_log10() +
theme_classic()# Apply filtration thresholds
meta_filt <- subset(meta,
((orig.ident == "P5CRC") & (nFeature_Spatial.008um > 30)) |
((orig.ident == "P5NAT") & (nFeature_Spatial.008um > 10)))
# log10-transformed density of number of genes for each sample after filtration
ggplot(meta_filt) +
geom_density(aes(x = nFeature_Spatial.008um,
fill = orig.ident),
alpha = 0.4,
color = "black") +
geom_vline(xintercept = 30, color = "pink") +
geom_vline(xintercept = 10, color = "lightblue") +
scale_x_log10() +
theme_classic()Complexity (novelty) score
Sometimes there may be bins with high nCount (UMIs) and low nFeature (genes). This finding would indicate that a few genes were sequenced many times over. We consider these instances to be cases of “low complexity”, where we are getting high expression from only a small number of genes. Some cell types, such as red blood cells, are known for such behavior. However, we should be cautious when interpreting these results, as contamination or technical artifacts could also contribute to this finding. Generally, we expect the complexity score to be above 0.80 for good-quality bins.
The novelty score is computed as a ratio of genes to UMIs, as shown below:
\[ \text{Complexity Score} = \frac{\log_{10}(\text{Number of Genes})}{\log_{10}(\text{Number of UMIs})} \]
Which we can now calculate using R and store in our @meta.data:
# Add number of genes per UMI for each cell to metadata
seurat_merged$log10GenesPerUMI <- log10(seurat_merged$nFeature_Spatial.008um) /
log10(seurat_merged$nCount_Spatial.008um)There are several NA values in this newly generated column. This result is because there are some bins with nCount_Spatial.008um = 0. These bins will naturally be filtered out once we complete our filtration, so strictly for the purposes of visualization, we are going to set these NA values to 0:
# Turn NA values into 0 for now
seurat_merged$log10GenesPerUMI[is.na(seurat_merged$log10GenesPerUMI)] <- 0# log10-transformed density of complexity score for each sample
# Vertical line is filtering threshold
meta <- seurat_merged@meta.data
ggplot(meta) +
geom_density(aes(x = log10GenesPerUMI,
fill = orig.ident),
alpha = 0.4,
color = "black") +
geom_vline(xintercept = 0.80) +
theme_classic()# Apply filtration thresholds
meta_filt <- subset(meta, log10GenesPerUMI > 0.80)
# log10-transformed density of complexity score for each sample after filtration
ggplot(meta_filt) +
geom_density(aes(x = log10GenesPerUMI,
fill = orig.ident),
alpha = 0.4,
color = "black") +
geom_vline(xintercept = 0.80) +
theme_classic()Mitochondrial counts ratio
During sequencing, we do not only measure the levels of expression from the nuclear genome - we also capture the mitochondrial genome! High levels of mitochondrial expression can be a sign of dead or dying cells. Therefore, we can calculate the proportion of reads (UMIs) that come from the mitochondria out of all the transcripts in a cell as another metric.
Bins with greater than 0.25 (25%) mitochondrial reads are typically defined as poor-quality. However, if you have reason to believe that the mitochondrial content is meant to be on the higher end, you can adjust this threshold.
While using a baseline score of 0.25 is an acceptable threshold for removing high mitochondrial content cells, it is important to always go back to your original biological question. What samples are you working with? Do you expect there to be high values of mitochondrial expression due to your experimental condition?
For example, if you were studying renal oncocytomas, would you make this same choice? This disease is characterized as having aberrantly high mitochondrial expression, so would it make sense to remove cells with high mitochondrial ratio?
This ratio is computed as:
\[ \text{Mitochondrial Ratio} = \frac{\text{Number of reads aligning to mitochondrial genes}} {\text{Total reads}} \]
# Compute percent mito ratio by finding genes that start with "MT-"
seurat_merged$mitoRatio <- PercentageFeatureSet(object = seurat_merged,
pattern = "^MT-")
seurat_merged$mitoRatio <- seurat_merged@meta.data$mitoRatio / 100The same issue that caused the NA values in the complexity score will appear in the calculated mitoRatio. So here we will set these values to be 1.00:
# Turn NA values into 1.00 for now
seurat_merged$mitoRatio[is.na(seurat_merged$mitoRatio)] <- 1.00# Update meta to grab mitoRatio column
meta <- seurat_merged@meta.data
# log10-transformed density of mitochondrial ratio for each sample
# Vertical line is filtering threshold
ggplot(meta) +
geom_density(aes(x = mitoRatio,
fill = orig.ident),
alpha = 0.4,
color = "black") +
geom_vline(xintercept = 0.25) +
theme_classic()# Apply filtration thresholds
meta_filt <- subset(meta, mitoRatio < 0.25)
# log10-transformed density of mitochondrial ratio for each sample after filtration
ggplot(meta_filt) +
geom_density(aes(x = mitoRatio,
fill = orig.ident),
alpha = 0.4,
color = "black") +
geom_vline(xintercept = 0.25) +
theme_classic()- Do you notice a pattern in cells in regards to the number of UMIs and features? Make a
geom_pointplot to compare these values on a per-cell basis and color each point by the mitochondrial ratio following the structure provided here:
# Structure for making geom_point plot
# Fill in values to answer the question
seurat_merged@meta.data %>%
# Sorting by mitoRatio to make high scores appear on top of the plot
arrange(mitoRatio) %>%
ggplot() +
geom_point(aes(x = ???,
y = ???,
color = ???),
size = 0.5) +
# Setting limits so that outliers don't determine scale of the plot
ylim(0, 3500) + xlim(0, 3500) +
theme_bw()Filtration
We will apply very minimal filtering here. It has been shown that low expression can be biologically meaningful for spatial context, so we won’t be as stringent as we normally are with scRNA-seq.
# Per-sample nCount thresholds
seurat_filtered <- subset(seurat_merged,
((orig.ident == "P5CRC") & (nCount_Spatial.008um > 30)) |
((orig.ident == "P5NAT") & (nCount_Spatial.008um > 10)))
# Per-sample nFeature thresholds
seurat_filtered <- subset(seurat_filtered,
((orig.ident == "P5CRC") & (nFeature_Spatial.008um > 30)) |
((orig.ident == "P5NAT") & (nFeature_Spatial.008um > 10)))
# Global thresholds for mitochondrial ratio and complexity
seurat_filtered <- subset(seurat_filtered, mitoRatio < 0.25)
seurat_filtered <- subset(seurat_filtered, log10GenesPerUMI > 0.80)
# Print seurat object after filtration
seurat_filteredAn object of class Seurat
18085 features across 135798 samples within 1 assay
Active assay: Spatial.008um (18085 features, 0 variable features)
1 layer present: counts
2 spatial fields of view present: P5CRC.008um P5NAT.008um
After subsetting, you may get the following warning message:
Warning: Not validating Centroids objects
Warning: Not validating FOV objects
Warning: Not validating FOV objects
Warning: Not validating FOV objects
Warning: Not validating Seurat objectsThis warning message can be ignored as it is Seurat internally checking bin barcodes against the image. In future lessons, after every subset step, this message may appear again, but can be disregarded as the subsetting is ultimately accomplished.
- How many bins did we remove in this filtration process? Hint: We can use the
ncol()function to count the number of bins in a Seurat object.
Visualizing counts data
We can visualize the number of UMIs and gene counts per bin, both as a distribution and layered on top of the tissue image.
Violin plots
Let’s start with a violin plot to look at the distribution of UMI counts and gene counts. The input is our post-filtered dataset.
# Violin plot of UMIs
p_ncount <- VlnPlot(seurat_filtered,
features = "nCount_Spatial.008um",
pt.size = 0, group.by = 'orig.ident') +
NoLegend()
# Violin plot of number of genes
p_nfeats <- VlnPlot(seurat_filtered,
features = "nFeature_Spatial.008um",
pt.size = 0, group.by = 'orig.ident') +
NoLegend()
# Plot UMIs and gene count violin plots side-by-side
p_ncount | p_nfeatsWe see that both violin plots have a similar peak. However, the UMI (nCount) distribution has a much longer tail than the number of genes distribution (nFeature). This is expected, because while the small physical size of the bins means that most genes will be detected only once or twice, a minority of bins under very transcriptionally active cells may exhibit multiple transcripts of the same gene.
Spatial overlay
Next, we can look at the same metrics and the distribution on the actual image itself after filtration. Note that some spots will have lower counts compared to others, in part due to low cellular density or cell types with low complexity in certain tissue regions.
Save!
Now is a great spot to save our seurat_filtered object as we have finished filtering.
# Save Seurat object
saveRDS(seurat_filtered, "data/seurat_filtered.RDS")