DE analysis Visualization from FindMarkers

Author

Noor Sohail, Mary Piper, Lorena Pantano, Meeta Mistry, Radhika Khetani, Jihe Liu, Will Gammerdinger

Published

Invalid Date

Approximate time: 75 minutes

Learning Objectives:

Create visualizations for differentially expressed genes
Discuss other statistical tests for differential expression analysis

Significant genes

We want to subset our results to show just our significant genes so we can begin visualizing and analyzing the results. To do this, we filter out rows based upon the p_val_adj column and subsetting only genes that meet our multiple testing-corrected significance threshold of 0.05.

# Subset significant genes
dge_vsm_sig <- dge_vsm %>% subset(p_val_adj < 0.05)
dge_vsm_sig %>% head()

                p_val avg_log2FC pct.1 pct.2     p_val_adj
Gm42418 4.492261e-223  1.8013603 0.999 0.986 8.881650e-219
Ubb     1.801612e-207 -0.8274973 0.982 0.999 3.561968e-203
H3f3b   6.044423e-198 -0.9395566 0.975 0.998 1.195043e-193
Nr4a2   9.607598e-193 -2.1841673 0.503 0.855 1.899518e-188
Rpl21   2.319987e-190 -0.6377144 0.967 0.998 4.586847e-186
Rpl9    1.249444e-187 -0.6824496 0.954 0.999 2.470276e-183

Volcano plot

To get a first look at the genes that are retained, we can generated a volcano plot using the EnhancedVolcano() function. This is a visualization that allows us to quickly see trends in the significant genes. The x-axis here represents the average log2-fold change value, showing the degree of difference between the two conditions. On the y-axis, we see our p_val_adj column represented after a negative log10 transformation is applied to better see the spread of the adjusted p-values.

Volcano plots provide a great overview of which genes are up-regulated (positive on the x-axis) or down-regulated (negative on the x-axis).

# Volcano plot
p_fm <- EnhancedVolcano(dge_vsm_sig,
        row.names(dge_vsm_sig),
        x="avg_log2FC",
        y="p_val_adj",
       title="FindMarkers VSM cells",
       subtitle="TN vs cold7")

p_fm

Violin plots

While looking at the overall trends in the data is a great starting point, we can also start looking at genes that have large differences between TN and cold7. To do this, we can take a look at the top 6 genes with the smallest p-values. We additionally disregard the ribsomal genes in this visualization step.

# Get the gene names and get the first 6 values
# Ignore ribosomal genes
genes <- dge_vsm_sig %>%
  rownames_to_column(var="gene") %>%
  filter(!str_detect(gene, "Rpl|Rps")) %>% 
  head(6)
genes <- genes$gene
genes

[1] "Gm42418" "Ubb"     "H3f3b"   "Nr4a2"   "Cebpb"   "Fau"

With these genes selected, we can now being to visualize the distribution of expression across our two conditions using the VlnPlot() function.

# Set Idents and draw Violin plots for top 6 genes
Idents(seurat_vsm) <- "condition"
VlnPlot(seurat_vsm, genes, ncol=3, idents=c("TN", "cold7"))

UMAP plots

When comparing two different conditions, we recommend creating a UMAP that clearly shows where the cells exist for each condition. To do so, we first need to get the UMAP coordinates for every cell of interest. When creating the scatterplot, the first thing we do is put a layer of light gray points that show the entire dataset to understand where all the cells fall. Then, we take the UMAP coordinates of the condition (TN or cold7 in our example) and plot those on top with a color to clearly indicate where those cells are located.

Note

This sometimes works better on the non-integrated data, so you observe a true separation of cells by condition.

# Grab the umap coordinates and condition information for each cell
df <- FetchData(seurat_vsm, c("umap_1", "umap_2", "condition"))
df_tn <- df %>% subset(condition == "TN")
df_cold7 <- df %>% subset(condition == "cold7")

# Scatterplot of TN cells
p_tn <- ggplot() +
  geom_point(data=df, aes(x=umap_1, y=umap_2), color="lightgray", alpha=0.5, size = 0.1) +
  geom_point(data=df_tn, aes(x=umap_1, y=umap_2), color="#F8766D", size = 0.1) +
  theme_classic() +
  ggtitle("VSM: TN cells")

# Scatterplot of cold7 cells
p_cold7 <- ggplot() +
  geom_point(data=df, aes(x=umap_1, y=umap_2), color="lightgray", alpha=0.5, size = 0.1) +
  geom_point(data=df_cold7, aes(x=umap_1, y=umap_2), color="#00B8E7", size = 0.1) +
  theme_classic() +
  ggtitle("VSM: cold7 cells")

# TN and cold7 UMAPs side by side
p_tn + p_cold7

This allows us to better understand our results when we look at any follow-up information on our UMAP. For example, we can begin to look at the distribution of gene expression for each of the top 6 genes with a better understanding of where the cells for each condition lie:

# Create Feature Plot
FeaturePlot(seurat_vsm, genes, ncol=3)

Other statistical tests

When we looked at the extra explanations for the FindMarkers() function, there was a parameter called test.use. By default, the method for calculating differentially expressed genes will be a Wilcoxon Rank Sum Test. This is a fairly simple statistical approach, and there a multitude of different algorithms that can be specified. These other options are documented on the FindMarkers() documentation page. For this workshop we want to highlight a few of these methods:

Wilcoxon Rank Sum Test

Often described as the non-parametric version of the two-sample t-test.
Beneficial because it can reduce the impact of outliers, which can skew the results of parametric testing.
It ranks the data and compares the sum of ranks within each group, to identify significant differences.

DESeq2

Identifies differentially expressed genes between two groups of cells based on a model using DESeq2 which uses a negative binomial distribution (Love et al., 2014). More information on DESeq2 will be provided in an upcoming lesson in this workshop.
This test option does not support pre-filtering of genes based on average difference (or percent detection rate) between cell groups. However, genes may be pre-filtered based on their minimum detection rate (min.pct) across both cell groups.

Note

The creators of the Seurat package no longer recommend using the FindMarkers() implementation of DESeq2.

MAST

Implements an approach that accounts for the stochastic dropout and characteristic bimodal expression distributions in which expression is either strongly non-zero or non-detectable.
- A two-part, generalized linear model for such bimodal data that parameterizes both of these features
Also allows for estimation and control of the “cellular detection rate” (CDR) while simultaneously estimating treatment effects. This addresses the fact that cells scale transcript copy number with cell volume.
Permits the analysis of complex experiments, such as repeated single-cell measurements under various treatments or longitudinal sampling of single cells from multiple subjects with a variety of background characteristics (e.g., sex, age), because it can easily be extended to accommodate random effects.

Note

Instead of using the FindMarkers() implementation, we recommend directly using the MAST algorithm from the package itself for the best results.

If you are interested in exploring code to run MAST on this dataset directly using the package, please see the script at the link below. We recommend including the sample in the model to improve results by taking into account biological variability. Please note that this is a computationally intensive calculation and may take a long time to run.

Click here for code to run MAST on this dataset side-by-side

Note that this R code below uses the MAST library. In order to run this you will need to first install the required packages and then:.

library(Seurat)
library(dplyr)
library(SingleCellExperiment)
library(MAST)

# Seurat to SingleCellExperiment
DefaultAssay(seurat_vsm) <- "RNA"
sce <- as.SingleCellExperiment(seurat_vsm)
# Apply log transformation
assay(sce, "logcounts") <- log2(counts(sce) + 1)

# Create new sce object (with only 'logcounts' matrix)
sce_1 <- SingleCellExperiment(assays = list(logcounts = assay(sce, "logcounts")))
colData(sce_1) <- colData(sce)
 
# Change to SingleCellAssay
sca <- SceToSingleCellAssay(sce_1)

# Calculate number of genes expressed per cell and scale the value
cdr2 <- colSums(SummarizedExperiment::assay(sca) > 0)
colData(sca)$cngeneson <- scale(cdr2)

# Takes a long time to calculate!
# Here our model includes:
## The number of genes expressed (ngeneson)
## Experimental condition (condition)
## Sample as a random variable ((1 | sample))

zlmCond <- zlm(~condition + cngeneson + (1 | sample), 
               sca, method="glmer", ebayes=FALSE)

# Only test the condition coefficient.
summaryCond <- summary(zlmCond, doLRT='conditionTN') 

# Some data wranging of the results
summaryDt <- summaryCond$datatable
fcHurdle <- merge(summaryDt[contrast=='conditionTN' & component=='H',.(primerid, `Pr(>Chisq)`)], # Hurdle P values
                  summaryDt[contrast=='conditionTN' & component=='logFC', 
                            .(primerid, coef, ci.hi, ci.lo)], by='primerid') #logFC coefficients

fcHurdle[,fdr:=p.adjust(`Pr(>Chisq)`, 'fdr')]

--- title: "DE analysis Visualization from FindMarkers" author: "Noor Sohail, Mary Piper, Lorena Pantano, Meeta Mistry, Radhika Khetani, Jihe Liu, Will Gammerdinger" date: Wednesday, February 5th, 2020 --- Approximate time: 75 minutes ```{r} #| label: load_libraries_data #| echo: false library(Seurat) library(tidyverse) library(EnhancedVolcano) seurat <- readRDS("data/BAT_GSE160585_final.rds") # Subset the object seurat_vsm <- subset(seurat, subset = (celltype == "VSM")) # Set the idents Idents(seurat_vsm) <- "condition" # Set default assay DefaultAssay(seurat_vsm) <- "RNA" # Determine differentiating markers for TN and cold7 dge_vsm <- FindMarkers(seurat_vsm, ident.1="cold7", ident.2="TN" ) ``` # Learning Objectives: * Create visualizations for differentially expressed genes * Discuss other statistical tests for differential expression analysis # Significant genes We want to subset our results to show just our significant genes so we can begin visualizing and analyzing the results. To do this, we filter out rows based upon the `p_val_adj` column and subsetting only genes that meet our multiple testing-corrected significance threshold of 0.05. ```{r} #| label: subset_sig_genes # Subset significant genes dge_vsm_sig <- dge_vsm %>% subset(p_val_adj < 0.05) dge_vsm_sig %>% head() ``` # Volcano plot To get a first look at the genes that are retained, we can generated a volcano plot using the `EnhancedVolcano()` function. This is a visualization that allows us to quickly see trends in the significant genes. The x-axis here represents the average log2-fold change value, showing the degree of difference between the two conditions. On the y-axis, we see our `p_val_adj` column represented after a negative log10 transformation is applied to better see the spread of the adjusted p-values. Volcano plots provide a great **overview of which genes are up-regulated (positive on the x-axis) or down-regulated (negative on the x-axis)**. ```{r fig.width=10, fig.height=6} #| label: volcano_plot # Volcano plot p_fm <- EnhancedVolcano(dge_vsm_sig, row.names(dge_vsm_sig), x="avg_log2FC", y="p_val_adj", title="FindMarkers VSM cells", subtitle="TN vs cold7") p_fm ``` # Violin plots While looking at the overall trends in the data is a great starting point, we can also start looking at genes that have large differences between `TN` and `cold7`. To do this, we can take a look at the top 6 genes with the smallest p-values. We additionally disregard the ribsomal genes in this visualization step. ```{r} #| label: top_6_sig_genes # Get the gene names and get the first 6 values # Ignore ribosomal genes genes <- dge_vsm_sig %>% rownames_to_column(var="gene") %>% filter(!str_detect(gene, "Rpl|Rps")) %>% head(6) genes <- genes$gene genes ``` With these genes selected, we can now being to visualize the distribution of expression across our two conditions using the `VlnPlot()` function. ```{r} #| label: create_violin_plot # Set Idents and draw Violin plots for top 6 genes Idents(seurat_vsm) <- "condition" VlnPlot(seurat_vsm, genes, ncol=3, idents=c("TN", "cold7")) ``` # UMAP plots When comparing two different conditions, we recommend creating a UMAP that clearly shows where the cells exist for each condition. To do so, we first need to get the UMAP coordinates for every cell of interest. When creating the scatterplot, the first thing we do is put a layer of light gray points that show the entire dataset to understand where all the cells fall. Then, we take the UMAP coordinates of the condition (`TN` or `cold7` in our example) and plot those on top with a color to clearly indicate where those cells are located. ::: callout-note This sometimes works better on the non-integrated data, so you observe a true separation of cells by condition. ::: ```{r} #| label: create_UMAP # Grab the umap coordinates and condition information for each cell df <- FetchData(seurat_vsm, c("umap_1", "umap_2", "condition")) df_tn <- df %>% subset(condition == "TN") df_cold7 <- df %>% subset(condition == "cold7") # Scatterplot of TN cells p_tn <- ggplot() + geom_point(data=df, aes(x=umap_1, y=umap_2), color="lightgray", alpha=0.5, size = 0.1) + geom_point(data=df_tn, aes(x=umap_1, y=umap_2), color="#F8766D", size = 0.1) + theme_classic() + ggtitle("VSM: TN cells") # Scatterplot of cold7 cells p_cold7 <- ggplot() + geom_point(data=df, aes(x=umap_1, y=umap_2), color="lightgray", alpha=0.5, size = 0.1) + geom_point(data=df_cold7, aes(x=umap_1, y=umap_2), color="#00B8E7", size = 0.1) + theme_classic() + ggtitle("VSM: cold7 cells") # TN and cold7 UMAPs side by side p_tn + p_cold7 ``` This allows us to better understand our results when we look at any follow-up information on our UMAP. For example, we can begin to look at the distribution of gene expression for each of the top 6 genes with a better understanding of where the cells for each condition lie: ```{r} #| label: create_feature_plot # Create Feature Plot FeaturePlot(seurat_vsm, genes, ncol=3) ``` # Other statistical tests When we looked at the extra explanations for the `FindMarkers()` function, there was a parameter called `test.use`. By default, the method for calculating differentially expressed genes will be a Wilcoxon Rank Sum Test. This is a fairly simple statistical approach, and there a multitude of different algorithms that can be specified. These other options are documented on the `FindMarkers()` [documentation page](https://www.rdocumentation.org/packages/Seurat/versions/5.0.3/topics/FindMarkers). For this workshop we want to highlight a few of these methods: # Wilcoxon Rank Sum Test * Often described as the non-parametric version of the two-sample t-test. * Beneficial because it can reduce the impact of outliers, which can skew the results of parametric testing. * It ranks the data and compares the sum of ranks within each group, to identify significant differences. # DESeq2 * Identifies differentially expressed genes between two groups of cells based on a model using DESeq2 which uses a negative binomial distribution ([Love et al., 2014](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8)). More information on DESeq2 will be provided in an upcoming lesson in this workshop. * This test option does not support pre-filtering of genes based on average difference (or percent detection rate) between cell groups. However, genes may be pre-filtered based on their minimum detection rate (min.pct) across both cell groups. ::: callout-note The creators of the Seurat package [no longer recommend](https://github.com/satijalab/seurat/issues/2938) using the `FindMarkers()` implementation of DESeq2. ::: # MAST * Implements an approach that accounts for the stochastic dropout and characteristic bimodal expression distributions in which expression is either strongly non-zero or non-detectable. * A two-part, generalized linear model for such bimodal data that parameterizes both of these features * Also allows for estimation and control of the “cellular detection rate” (CDR) while simultaneously estimating treatment effects. This addresses the fact that cells scale transcript copy number with cell volume. * Permits the analysis of complex experiments, such as repeated single-cell measurements under various treatments or longitudinal sampling of single cells from multiple subjects with a variety of background characteristics (e.g., sex, age), because it can easily be extended to accommodate random effects. ::: callout-note Instead of using the `FindMarkers()` implementation, we recommend directly using the [MAST](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0844-5) algorithm from the package itself for the best results. If you are interested in exploring code to run MAST on this dataset directly using the package, please see the script at the link below. We recommend including the sample in the model to improve results by taking into account biological variability. Please note that this is a **computationally intensive** calculation and may take **a long time to run**. <details> <summary>Click here for code to run MAST on this dataset side-by-side</summary> Note that this R code below uses the MAST library. In order to run this you will need to first install the required packages and then:. ```{r} #| label: MAST_example #| eval: false library(Seurat) library(dplyr) library(SingleCellExperiment) library(MAST) # Seurat to SingleCellExperiment DefaultAssay(seurat_vsm) <- "RNA" sce <- as.SingleCellExperiment(seurat_vsm) # Apply log transformation assay(sce, "logcounts") <- log2(counts(sce) + 1) # Create new sce object (with only 'logcounts' matrix) sce_1 <- SingleCellExperiment(assays = list(logcounts = assay(sce, "logcounts"))) colData(sce_1) <- colData(sce) # Change to SingleCellAssay sca <- SceToSingleCellAssay(sce_1) # Calculate number of genes expressed per cell and scale the value cdr2 <- colSums(SummarizedExperiment::assay(sca) > 0) colData(sca)$cngeneson <- scale(cdr2) # Takes a long time to calculate! # Here our model includes: ## The number of genes expressed (ngeneson) ## Experimental condition (condition) ## Sample as a random variable ((1 | sample)) zlmCond <- zlm(~condition + cngeneson + (1 | sample), sca, method="glmer", ebayes=FALSE) # Only test the condition coefficient. summaryCond <- summary(zlmCond, doLRT='conditionTN') # Some data wranging of the results summaryDt <- summaryCond$datatable fcHurdle <- merge(summaryDt[contrast=='conditionTN' & component=='H',.(primerid, `Pr(>Chisq)`)], # Hurdle P values summaryDt[contrast=='conditionTN' & component=='logFC', .(primerid, coef, ci.hi, ci.lo)], by='primerid') #logFC coefficients fcHurdle[,fdr:=p.adjust(`Pr(>Chisq)`, 'fdr')] ``` </details> :::