Clustering - Answer Key

Author

Noor Sohail

Published

July 22, 2025

Exercise 1

When we looked at the first few rows of our metadata, it appeared that there were many cells that do not have a cluster value. Count how many NA’s are found for our cluster with table() function and use the argument (useNA = "ifany"). Why do you think there are so many NA values?

table(seurat_processed$seurat_cluster.sketched,
      useNA = "ifany")


     1      2      3      4      5      6      7      8      9     10     11 
  1332   1316   1280   1061    956    878    876    745    676    364    147 
    12     13     14   <NA> 
   145    114    110 125798

Exercise 2

How many bins are in each cluster? Use the table() function to count the number of bins in each cluster.

table(seurat_processed$seurat_cluster.projected,
      seurat_processed$orig.ident)

    
     P5CRC P5NAT
  1  12057    43
  2   6981  8822
  3  20668 11572
  4   4236 26666
  5   1167  7628
  6   2148  3659
  7   3753  8529
  8   1772  3778
  9   2211  1961
  10  1415  4237
  11   648    15
  12   145   703
  13    45   163
  14   217   559

Now, we know that the bins found in cluster 1 belong primarily to the sample P5CRC

# Visualize the number of cell counts per sample
seurat_processed@meta.data %>% 
    ggplot(aes(x=seurat_cluster.projected, 
               fill=orig.ident)) + 
    geom_bar(position=position_dodge()) +
    theme_classic() +
    ggtitle("Bins per cluster (resolution 0.65)") +
    NoLegend()

Exercise 3

Use the DotPlot() function in conjunction with marker_list to see if clusters correspond well with celltypes.

marker_list <- list(
  "B cells" = c("IGKC", "IGHM", "CD79A", "MS4A1", "MZB1"),
  "Endothelial cells" = c("PECAM1", "VWF", "PLVAP", "ENG", "KLF2"),
  "Fibroblasts" = c("COL1A1", "COL3A1", "DCN", "LUM", "COL6A2"),
  "Intestinal epithelial cells" = c("CLCA1", "FCGBP", "MUC2", "PIGR", "ZG16"),
  "Myeloid cells" = c("C1QC", "SELENOP", "SPP1", "LYZ", "CD68"),
  "Neural cells" = c("NRXN1", "L1CAM", "NCAM1", "VIP", "CALB2"),
  "Smooth muscle cells" = c("TAGLN", "ACTA2", "MYH11", "MYL9", "CNN1"),
  "T cells" = c("TRAC", "CD3E", "TRBC2", "IL7R", "CD52"),
  "Tumor cells" = c("CEACAM6", "CEACAM5", "EPCAM", "KRT8", "LCN2")
)

DotPlot(seurat_processed,
        marker_list,
        group.by = "seurat_cluster.projected",
        cluster.idents = TRUE) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Roughly identify which clusters correspond to which celltypes to provide better context for future analyses.

Based upon the dotplot, the good thing is that while it may not be the clearest signal, we are able to identify major populations of cells in the clusters. At this point, you can see there is some uncertainty in this very rough assignment and that is okay!

Cluster	Cell type
1	Tumor
2	B cells
3	Intestinal epithelial cells
4	?
5	Tumor cells /Intestinal epithelial cells
6	B cells / T cells
7	Tumor cells / Intestinal epithelial cells
8	Endothelial cells
9	Myeloid cells / Fibroblasts
10	Smooth muscle cells
11	Tumor
12	Neural cells
13	Neural cells
14	?

In a standard analysis, we could test out different resolution scores to better tease apart clusters of different celltypes from one another. However, in future lessons, we are going to (1) run an alternative, spatially-constrained clustering method and (2) automatically annotate our dataset.

This is a good exercise to run to ensure that we are able to identify key celltypes in our dataset.

Back to Lesson >>

Back to Schedule

Reuse

CC-BY-4.0

--- title: "Clustering - Answer Key" author: - Noor Sohail date: "2025-07-22" license: "CC-BY-4.0" editor_options: markdown: wrap: 72 --- ```{r} #| label: load_libraries_data #| echo: false # Load libraries and data library(Seurat) library(tidyverse) seurat_processed <- qs2::qs_read("intermediate/08_seurat_processed.qs") ``` # Exercise 1 1. When we looked at the first few rows of our metadata, it appeared that there were many cells that do not have a cluster value. Count how many `NA`'s are found for our cluster with `table()` function and use the argument (`useNA = "ifany"`). Why do you think there are so many `NA` values? ```{r} #| label: table_clusters table(seurat_processed$seurat_cluster.sketched, useNA = "ifany") ``` # Exercise 2 2. How many bins are in each cluster? Use the `table()` function to count the number of bins in each cluster. ```{r} #| label: table_clusters_2 table(seurat_processed$seurat_cluster.projected, seurat_processed$orig.ident) ``` Now, we know that the bins found in cluster 1 belong primarily to the sample `P5CRC` ```{r} #| label: barplot_nbins_cluster # Visualize the number of cell counts per sample seurat_processed@meta.data %>% ggplot(aes(x=seurat_cluster.projected, fill=orig.ident)) + geom_bar(position=position_dodge()) + theme_classic() + ggtitle("Bins per cluster (resolution 0.65)") + NoLegend() ``` # Exercise 3 3. Use the `DotPlot()` function in conjunction with `marker_list` to see if clusters correspond well with celltypes. ```{r} #| label: marker_list marker_list <- list( "B cells" = c("IGKC", "IGHM", "CD79A", "MS4A1", "MZB1"), "Endothelial cells" = c("PECAM1", "VWF", "PLVAP", "ENG", "KLF2"), "Fibroblasts" = c("COL1A1", "COL3A1", "DCN", "LUM", "COL6A2"), "Intestinal epithelial cells" = c("CLCA1", "FCGBP", "MUC2", "PIGR", "ZG16"), "Myeloid cells" = c("C1QC", "SELENOP", "SPP1", "LYZ", "CD68"), "Neural cells" = c("NRXN1", "L1CAM", "NCAM1", "VIP", "CALB2"), "Smooth muscle cells" = c("TAGLN", "ACTA2", "MYH11", "MYL9", "CNN1"), "T cells" = c("TRAC", "CD3E", "TRBC2", "IL7R", "CD52"), "Tumor cells" = c("CEACAM6", "CEACAM5", "EPCAM", "KRT8", "LCN2") ) ``` ```{r} #| label: dotplot #| fig-width: 15 DotPlot(seurat_processed, marker_list, group.by = "seurat_cluster.projected", cluster.idents = TRUE) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) ``` 4. Roughly identify which clusters correspond to which celltypes to provide better context for future analyses. Based upon the dotplot, the good thing is that while it may not be the clearest signal, we are able to identify major populations of cells in the clusters. At this point, you can see there is some uncertainty in this **very rough assignment** and that is okay! | Cluster | Cell type | |---------|-----------------------------------| | 1 | Tumor | | 2 | B cells | | 3 | Intestinal epithelial cells | | 4 | ? | | 5 | Tumor cells /Intestinal epithelial cells | | 6 | B cells / T cells | | 7 | Tumor cells / Intestinal epithelial cells | | 8 | Endothelial cells | | 9 | Myeloid cells / Fibroblasts | | 10 | Smooth muscle cells | | 11 | Tumor | | 12 | Neural cells | | 13 | Neural cells | | 14 | ? | In a standard analysis, we could test out different resolution scores to better tease apart clusters of different celltypes from one another. However, in future lessons, we are going to (1) run an alternative, spatially-constrained clustering method and (2) automatically annotate our dataset. This is a good exercise to run to ensure that we are able to identify key celltypes in our dataset. *** [Back to Lesson >>](08_clustering.qmd) [Back to Schedule](../schedule/schedule.qmd)