Exploration of quality control metrics

To determine whether our clusters might be due to artifacts such as cell cycle phase or mitochondrial expression, it can be useful to explore these metrics visually to see if any clusters exhibit enrichment or are different from the other clusters. However, if enrichment or differences are observed for particular clusters it may not be worrisome if it can be explained by the cell type.

First we will check the number of cells per cluster in each sample.

# Extract identity and sample information from seurat object to determine the number of cells per cluster per sample
n_cells <- FetchData(combined, 
                     vars = c("ident", "sample")) %>%
        group_by(sample) %>%
        dplyr::count(ident) %>% 
        spread(ident, n) 

# View table
View(n_cells)

We can additionally visualize whether we have any sample-specific clusters by using the split.by argument:

# Plot UMAP split by sample
DimPlot(combined,
        reduction = "umap",
        split.by = "sample",
        label = TRUE,
        label.size = 6,
        plot.title = "UMAP")

There doesn’t appear to be any sample-specific clusters present.

We can also perform the standard QC that we had gone through previously with the control sample.

We can explore cell cycle by cluster by splitting by Phase.

DimPlot(combined,
        reduction = "umap",
        split.by = "Phase",
        label = TRUE,
        label.size = 6,
        plot.title = "UMAP")

We can see that cluster 20 is primarily in S-phase and clusters 16 and 17 are in G2M phase. If we hadn’t regressed out cell cycle previously, we would do so after seeing these clusters segregrate. Since we did regress out cell cycle, it is quite likely that there is a biological reason for the segregation.

Next, we will explore additional metrics, such as the number of UMIs and genes per cell, S-phase and G2M-phase markers, and mitochondrial gene expression:

# Determine metrics to plot present in seurat_control@meta.data
metrics <-  c("nUMI", "nGene", "S.Score", "G2M.Score", "mitoRatio")

# Extract the UMAP coordinates for each cell and include information about the metrics to plot
qc_data <- FetchData(combined, 
                     vars = c(metrics, "ident", "UMAP_1", "UMAP_2"))

# Adding cluster label to center of cluster on UMAP
umap_label <- FetchData(combined, 
                        vars = c("ident", "UMAP_1", "UMAP_2"))  %>%
        as.data.frame() %>% 
        group_by(ident) %>%
        summarise(x=mean(UMAP_1), y=mean(UMAP_2))

# Plot a UMAP plot for each metric
map(metrics, function(qc){
        ggplot(qc_data,
               aes(UMAP_1, UMAP_2)) +
                geom_point(aes_string(color=qc), 
                           alpha = 0.7) +
                scale_color_gradient(guide = FALSE, 
                                     low = "grey90", 
                                     high = "blue")  +
                geom_text(data=umap_label, 
                          aes(label=ident, x, y)) +
                ggtitle(qc)
}) %>%
        plot_grid(plotlist = .)

We can also explore how well our clusters separate by the different PCs; we hope that the defined PCs separate the cell types well.

# Defining the information in the seurat object of interest
columns <- c(paste0("PC_", 1:30),
             "ident",
             "UMAP_1", "UMAP_2")

# Extracting this data from the seurat object
pc_data <- FetchData(combined, 
                     vars = columns)

# Plotting a UMAP plot for each of the PCs
map(paste0("PC_", 1:30), function(pc){
        ggplot(pc_data, 
               aes(UMAP_1, UMAP_2)) +
                geom_point(aes_string(color=pc), 
                           alpha = 0.7) +
                scale_color_gradient(guide = FALSE, 
                                     low = "grey90", 
                                     high = "blue")  +
                geom_text(data=umap_label, 
                          aes(label=ident, x, y)) +
                ggtitle(pc)
}) %>% 
        plot_grid(plotlist = .)

Evaluating clustering

To determine whether our clustering and resolution are appropriate for our experiment, it is helpful to explore a handful of markers for each of the major cell types that we expect to be in our data and see how they segregate.

We will explore known immune cell markers for expected clusters:

Cell Type Marker
CD14+ monocytes CD14, LYZ
FCGR3A+ monocytes FCGR3A, MS4A7
Conventional dendritic cells FCER1A, CST3
Plasmacytoid dendritic cells IL3RA, GZMB, SERPINF1, ITM2C
B cells CD79A, MS4A1
T cells CD3D
CD4+ T cells CD3D, IL7R, CCR7
CD8+ T cells CD3D, CD8A
NK cells GNLY, NKG7
Megakaryocytes PPBP
Erythrocytes HBB, HBA2

Let’s remind ourselves of our clusters:

# Plot UMAP
DimPlot(combined,
        reduction = "umap",
        label = TRUE,
        label.size = 6,
        plot.title = "UMAP")

CD14+ monocyte markers

FeaturePlot(combined, 
            reduction = "umap", 
            features = c("CD14", "LYZ"))

Cluster 0 appears to have expression of both CD14 and LYZ. Cluster 7 also appears to express both markers, but at a lower level.

FCGR3A+ monocyte markers

FeaturePlot(combined, 
            reduction = "umap", 
            features = c("FCGR3A", "MS4A7"))

Expression is strong for only cluster 7. Therefore, we will surmise that cluster 0 is CD14+ monocytes, while cluster 7 represents FCGR3A+ (CD16+) monocytes.

Dendritic cell (DC) markers

FeaturePlot(combined, 
            reduction = "umap", 
            features = c("FCER1A", "CST3"))

Although the expression is not that impressive, the only cluster to exhibit expression of both markers is cluster 13. Since these markers are weak, we would likely want to explore the marker identification markers in more detail for this cell type.

Plasmacytoid DCs

FeaturePlot(combined, 
            reduction = "umap", 
            features = c("IL3RA", "GZMB", "SERPINF1", "ITM2C"))

The expression of several markers suggest cluster 18 corresponds to plasmacytoid DCs.

B cell markers

FeaturePlot(combined, 
            reduction = "umap", 
            features = c("CD79A", "MS4A1"))

These B cell markers have a good amount of expression for both markers for clusters 4, 14, and 15.

T cell markers

FeaturePlot(combined, 
            reduction = "umap", 
            features = c("CD3D"))

The T cell marker maps to many clusters, including 1, 2, 3, 6, 9, 10, 12, 16, 17, and 19.

CD4+ T cell markers

FeaturePlot(combined, 
            reduction = "umap", 
            features = c("CD3D", "IL7R"))

These markers correspond to clusters 1, 2, 3, 9 and 16. IL7R appears to map to cluster 10.

CD8+ T cell markers

FeaturePlot(combined, 
            reduction = "umap", 
            features = c("CD3D", "CD8A"))

Markers for CD8+ T cells map to clusters 6, 10 and 11.

NK cell markers

FeaturePlot(combined, 
            reduction = "umap", 
            features = c("GNLY", "NKG7"))

The NK cell markers correspond to clusters 5 and 6. However, we know that cluster 6 has T cell markers, therefore, this cluster is not representing NK cells. Likely cluster 6 represents activated CD8+ T cells.

Megakaryocyte markers

FeaturePlot(combined, 
            reduction = "umap", 
            features = c("PPBP"))

This marker distinguishes cluster 12 as megakaryocytes.

Erythrocyte markers

FeaturePlot(combined, 
            reduction = "umap", 
            features = c("HBB", "HBA2"))

Erythocytes correspond to cluster 19.

We identified in the Control analysis that we had markers differentiating the naive from the activated T cells. We can look whether specific clusters correspond to these subsets.

Cell state exploration

Based on the analysis with the Control sample, we have some additional cell markers to determine immune cell activation status and stressed/dying cells:

Cell State Marker
Naive T cells CCR7, SELL
Activated B and T cells CREM, CD69
Stressed/dying cells HSPB1, DNAJB6, HSPH1, GADD45B

Naive or memory T cells

FeaturePlot(combined, 
            reduction = "umap", 
            features = c( "CCR7", "SELL"))

The naive CD4+ T cells correspond to clusters 1, 2, 9, and 11.

Activated T and B cells

FeaturePlot(combined, 
            reduction = "umap", 
            features = c("CREM", "CD69"))

Activated CD4+ T cells correspond to cluster 3, while activated B cells correspond to cluster 14.

Stressed or dying cells

Based on our previous analysis with the control sample, we know there should be a cluster containing the stressed or dying cells with heat shock genes, cell cycle arrest genes, and DNA damage genes overexpressed. We can determine which cluster corresponds to these cells using the identified markers from the previous analysis.

FeaturePlot(combined, 
            reduction = "umap", 
            features = c("HSPB1", "DNAJB6", "HSPH1", "GADD45B"))

This cluster of stressed cells corresponds to cluster 8.

Based on these results, we can associate the majority of the clusters with cell types. However, we would like to perform a deeper analysis using marker identification before performing a final assignment of the clusters.

Cell Type Clusters
CD14+ Monocytes 0
FCGR3A+ Monocytes 7
Conventional dendritic cells 13
Plasmacytoid dendritic cells 18
Naive B cells 4, 15
Activated B cells 14
Naive CD4+ T cells 1, 2, 9, 16
Activated CD4+ T cells 3
Naive CD8+ T cells 11
Activated (cytotoxic) CD8+ T cells 6, 10
NK cells 5
Megakaryocytes 12
Erythrocytes 19
Stressed / dying cells 8
Unknown 17, 20

While we have a good idea of the cell types for the different clusters, it’s always a good idea to perform marker identification to ensure the hypothesized cell identities make sense with the enriched genes. Also, we can explore identities for the unknown clusters 17 and 20.


This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.