Cell cycle scoring
Cell cycle variation is a common source of uninteresting variation in single-cell RNA-seq data. To examine cell cycle variation in our data, we assign each cell a score, based on its expression of G2/M and S phase markers.
An overview of the cell cycle phases is given in the image below:
Adapted from Wikipedia (Image License is CC BY-SA 3.0)
- G0: Quiescence or resting phase. The cell is not actively dividing, which is common for cells that are fully differentiated. Some types of cells enter G0 for long periods of time (many neuronal cells), while other cell types never enter G0 by continuously dividing (epithelial cells).
- G1: Gap 1 phase represents the beginning of interphase. During G1 there is growth of the non-chromosomal components of the cells. From this phase, the cell may enter G0 or S phase.
- S: Synthesis phase for the replication of the chromosomes (also part of interphase).
- G2: Gap 2 phase represents the end of interphase, prior to entering the mitotic phase. During this phase th cell grows in preparation for mitosis and the spindle forms.
- M: M phase is the nuclear division of the cell (consisting of prophase, metaphase, anaphase and telophase).
The Cell-Cycle Scoring and Regression tutorial from Seurat makes available a list of cell cycle phase marker genes for humans and performs phase scoring based on the paper from Tirosh, I. et al.. We have used this list to perform orthology searches to create compiled cell cycle gene lists for other organisms, as well.
After scoring each gene for cell cycle phase, we can perform PCA using the expression of cell cycle genes. If the cells group by cell cycle in the PCA, then we would want to regress out cell cycle variation, unless cells are differentiating.
NOTE: If cells are known to be differentiating and there is clear clustering differences between G2M and S phases, then you may want to regress out by the difference between the G2M and S phase scores as described in the Seurat tutorial, thereby still differentiating the cycling from the non-cycling cells.
The code in this lesson relies on these libraries:
library(rCurl)
library(AnnotationHub)
library(ensembldb)
# Download cell cycle genes for organism at https://github.com/hbc/tinyatlas/tree/master/cell_cycle. Read it in with:
cc_file <- getURL("https://raw.githubusercontent.com/hbc/tinyatlas/master/cell_cycle/Homo_sapiens.csv")
cell_cycle_genes <- read.csv(text = cc_file)
All of the cell cycle genes are Ensembl IDs, but our gene IDs are the gene names. To score the genes in our count matrix for cell cycle, we need to obtain the gene names for the cell cycle genes.
We can use annotation databases to acquire these IDs. While there are many different options, including BioMart, AnnotationDBI, and AnnotationHub. We will use the AnnotationHub
R package to query Ensembl using the ensembldb
R package.
# Connect to AnnotationHub
ah <- AnnotationHub()
# Access the Ensembl database for organism
ahDb <- query(ah,
pattern = c("Homo sapiens", "EnsDb"),
ignore.case = TRUE)
# Acquire the latest annotation files
id <- ahDb %>%
mcols() %>%
rownames() %>%
tail(n = 1)
# Download the appropriate Ensembldb database
edb <- ah[[id]]
# Extract gene-level information from database
annotations <- genes(edb,
return.type = "data.frame")
# Select annotations of interest
annotations <- annotations %>%
dplyr::select(gene_id, gene_name, seq_name, gene_biotype, description)
Now we can use these annotations to get the corresponding gene names for the Ensembl IDs of the cell cycle genes.
# Get gene names for Ensembl IDs for each gene
cell_cycle_markers <- dplyr::left_join(cell_cycle_genes, annotations, by = c("geneID" = "gene_id"))
# Acquire the S phase genes
s_genes <- cell_cycle_markers %>%
dplyr::filter(phase == "S") %>%
pull("gene_name")
# Acquire the G2M phase genes
g2m_genes <- cell_cycle_markers %>%
dplyr::filter(phase == "G2/M") %>%
pull("gene_name")
Taking the gene names for the cell cycle genes we can score each cell based which stage of the cell cycle it is most likely to be in.
# Perform cell cycle scoring
seurat_phase <- CellCycleScoring(seurat_phase,
g2m.features = g2m_genes,
s.features = s_genes)
By default, the PCA is run only using the most variable features. If identified previously, there is no need to run FindVariableFeatures()
again. The output of the PCA returns the correlated gene sets associated with the different principal components (PCs).
# Identify the most variable genes if it hasn't been run
seurat_phase <- FindVariableFeatures(seurat_phase,
selection.method = "vst",
nfeatures = 2000,
verbose = FALSE)
# Scale the counts
seurat_phase <- ScaleData(seurat_phase)
# Perform PCA and color by cell cycle phase
seurat_phase <- RunPCA(seurat_phase)
# Visualize the PCA, grouping by cell cycle phase
DimPlot(seurat_phase,
reduction = "pca",
group.by= "Phase")
We do see differences on PC1, with the G1 cells to the left of the other cells on PC1. Based on this plot, we would regress out the variation due to cell cycle.
NOTE: Alternatively, we could wait and perform the clustering without regression and see if we have clusters separated by cell cycle phase. If we do, then we could come back and perform the regression.