Noor Sohail, Mary Piper, Lorena Pantano, Amélie Julé, Meeta Mistry, Radhika Khetani
Published
September 12, 2024
Approximate time: 40 minutes
Learning Objectives:
Explore visualization methods for interrogating significant genes from DESeq2
Evaluate top genes to better understand the changes between conditions
Visualization of differentially expressed genes
Visualization is a key step for understanding the results of our analysis. In particular, we want to focus on the significant genes that are differentially expressed between the two conditions of interest. In doing so, we can better understand patterns in gene expression and identify any outlier genes that may bring up further questions in a downstream analysis. Additionally, these visualizations are a great way to globally evaluate the changes brought about due to an experimental condition.
Identify significant genes
Now that we have a table of genes with their associated adjusted p-values and log-fold change scores, we need to filter the results. We are only interested in significantly differentially expressed genes that pass an adjusted p-value threshold of 0.05.
# Set thresholdspadj.cutoff <-0.05# Turn the results object into a tibble for use with tidyverse functionsdge_deseq2 <- res %>%data.frame() %>%as_tibble()# Subset the significant resultssig_res <- dplyr::filter(dge_deseq2, padj < padj.cutoff)# Look at top sig genessig_res %>%head()
We will take these results and use them as input to a few different visualization techniques to explore the changes in gene expression.
Volcano plot
To get a first look at the significant genes compared to all genes tested, we can generated a volcano plot using the EnhancedVolcano() function. This is a visualization that allows us to quickly see trends in the significant genes. The x-axis here represents the average log2-fold change value, showing the degree of difference between the two conditions. On the y-axis, we see our adjusted p-value to which a negative log10 transformation is applied to better see the spread of our p-values.
Volcano plots show us a great overview of which genes are up-regulated (positive on the x-axis) or down-regulated (negative on the x-axis).
p_deseq2 <-EnhancedVolcano(sig_res, sig_res$gene,x="log2FoldChange",y="padj",title="DESeq2 VSM cells",subtitle="TN vs cold7")print(p_deseq2)
Heatmap of differentially expressed genes
Another way to look at global patterns of gene expression is to take our normalized expression matrix for our significant genes and generate a heatmap. The rows correspond to significant genes, columns are samples, and each value is the normalized expression from the pseudobulk aggregation.
Using the pheatmap() function, we can also cluster samples and genes together based upon their similarity. From the heatmap below we can clearly see that samples are clustering together based upon which experimental condition they belong to (TN and cold7). Similarly, the genes are being grouped together based upon their expression values, where we can see which genes are up- and down-regulated in each condition.
# Extract normalized expression for significant genes from the samplesnormalized_counts <-counts(dds, normalized=TRUE) %>%as.data.frame()norm_sig <- normalized_counts %>% dplyr::filter(row.names(normalized_counts) %in% sig_res$gene)# Set a color paletteheat_colors <-brewer.pal(6, "YlOrRd")# Run pheatmap using the metadata data frame for the annotationanno <-colData(dds) %>%as.data.frame() %>%select(condition, celltype)pheatmap(norm_sig,color = heat_colors,cluster_rows =TRUE,show_rownames =FALSE,annotation = anno,border_color =NA,fontsize =10,scale ="row", fontsize_row =10, height =20)
Top 6 genes
It is important to take a look at some of the top genes that show up. Alternatively, you could choose to use this plot to explore some of your candidate genes where you anticipated changes in gene expression between groups. This is a great way to evaluate why these genes showed up in the results.
Rather than just taking the top genes based on adjusted p-value, we can apply a fold change threshold. In our example below, we are using abs(log2FoldChange) > 0.6 which translates to a ~50% increase or decrease in gene expression as additional criteria for subsetting.
Now that we have identified the significant genes that we will look at, we can use a scatterplot to look at the expression values for each sample in both groups. This plot is also a good sanity check to make sure that we are interpreting our fold change values correctly, as well.
Each point represents a sample with the y-axis representing the normalized expression. Ideally we should see a clear shift in expression between our two conditions. As this is a helpful metric for assessing the pseudobulk results, we will create a function to make repeated use of this type of visualization.
# pseudobulk scatterpot of normalized expressionplot_pb_count <-function(dds, gene) {# returnData to get normalized counts for each sample for a gene d <-plotCounts(dds, gene=gene, intgroup="condition", returnData=TRUE)# Keep the order TN then cold7 d$condition <-factor(d$condition, levels =c("TN", "cold7"))# Plot the normalized counts for each sample p <-ggplot(d, aes(x = condition, y = count, color = condition)) +geom_point(position=position_jitter(w =0.1, h =0)) +theme_bw() +NoLegend() +ggtitle(gene)return(p)}
plot_list <-list()for (gene in genes) { plot_list[[gene]] <-plot_pb_count(dds, gene)}plot_grid(plotlist=plot_list)
Finally, it is also a good exercise to evaluate the gene expression at a single-cell level. The results we have been visualizing so far represent the sample averages of gene expression across all VSM cells in each sample. Therefore it is prudent to go back to the cellular level to see what these same results look like for each individual cell. To do so, we will make use of our original seurat object seurat_vsm.
We see that generally the trends are consistent with pseudobulk expression scatterplots, but in some cases the change is not as proncounced.
In the next lesson, we will continue with additional visualizations for the DESeq2 results while comparing and contrasting with results from the FindMarkers() analysis.
Source Code
---title: "Pseudobulk visualization"author: "Noor Sohail, Mary Piper, Lorena Pantano, Amélie Julé, Meeta Mistry, Radhika Khetani"date: Monday, September 12 2024---Approximate time: 40 minutes```{r}#| echo: falselibrary(tidyverse)library(pheatmap)library(EnhancedVolcano)library(RColorBrewer)library(cowplot)library(Seurat)library(DESeq2)dge_deseq2 <-read.csv("../results/deseq2_VSM_cold7_vs_TN.csv")seurat <-readRDS("../data/BAT_GSE160585_final.rds")seurat_vsm <-subset(seurat, subset = (celltype =="VSM"))bulk <-AggregateExpression( seurat,return.seurat =TRUE,assays ="RNA",group.by =c("celltype", "sample", "condition"))bulk$condition <-factor(bulk$condition, levels=c("TN", "RT", "cold2", "cold7"))# Number of cells by sample and celltypen_cells <- seurat@meta.data %>% dplyr::count(sample, celltype) %>% dplyr::rename("n_cells"="n")n_cells$sample <-str_replace(n_cells$sample, "_", "-")bulk@meta.data <-left_join(bulk@meta.data, n_cells)rownames(bulk@meta.data) <-Cells(bulk)# VSM cellsbulk_vsm <-subset(bulk, subset= (celltype =="VSM") & (condition %in%c("TN", "cold7")))# Get count matrixcluster_counts <-FetchData(bulk_vsm, layer="counts", vars=rownames(bulk_vsm))# Create DESeq2 object# transpose it to get genes as rowsdds <-DESeqDataSetFromMatrix(t(cluster_counts),colData = bulk_vsm@meta.data,design =~ condition)dds <-DESeq(dds)res <-read.csv("../results/deseq2_VSM_cold7_vs_TN.csv")```## Learning Objectives:* Explore visualization methods for interrogating significant genes from DESeq2* Evaluate top genes to better understand the changes between conditions## Visualization of differentially expressed genesVisualization is a key step for understanding the results of our analysis. In particular, we want to focus on the significant genes that are differentially expressed between the two conditions of interest. In doing so, we can better understand patterns in gene expression and identify any outlier genes that may bring up further questions in a downstream analysis. Additionally, these visualizations are a great way to globally evaluate the changes brought about due to an experimental condition.### Identify significant genesNow that we have a table of genes with their associated adjusted p-values and log-fold change scores, we need to **filter the results**. We are only interested in significantly differentially expressed genes that pass an **adjusted p-value threshold of 0.05**.```{r}# Set thresholdspadj.cutoff <-0.05# Turn the results object into a tibble for use with tidyverse functionsdge_deseq2 <- res %>%data.frame() %>%as_tibble()# Subset the significant resultssig_res <- dplyr::filter(dge_deseq2, padj < padj.cutoff)# Look at top sig genessig_res %>%head()```We will take these results and use them as input to a few different visualization techniques to explore the changes in gene expression. ### Volcano plot To get a first look at the significant genes compared to all genes tested, we can generated a volcano plot using the `EnhancedVolcano()` function. This is a visualization that allows us to quickly see trends in the significant genes. The x-axis here represents the average log2-fold change value, showing the degree of difference between the two conditions. On the y-axis, we see our adjusted p-value to which a negative log10 transformation is applied to better see the spread of our p-values.Volcano plots show us a great overview of which genes are up-regulated (positive on the x-axis) or down-regulated (negative on the x-axis).```{r fig.width=10, fig.height=8}p_deseq2 <-EnhancedVolcano(sig_res, sig_res$gene,x="log2FoldChange",y="padj",title="DESeq2 VSM cells",subtitle="TN vs cold7")print(p_deseq2)```### Heatmap of differentially expressed genesAnother way to look at global patterns of gene expression is to take our normalized expression matrix for our significant genes and generate a heatmap. The rows correspond to significant genes, columns are samples, and each value is the normalized expression from the pseudobulk aggregation. Using the `pheatmap()` function, we can also cluster samples and genes together based upon their similarity. From the heatmap below we can clearly see that samples are clustering together based upon which experimental condition they belong to (`TN` and `cold7`). Similarly, the genes are being grouped together based upon their expression values, where we can see which genes are up- and down-regulated in each condition.```{r fig.height=8}# Extract normalized expression for significant genes from the samplesnormalized_counts <-counts(dds, normalized=TRUE) %>%as.data.frame()norm_sig <- normalized_counts %>% dplyr::filter(row.names(normalized_counts) %in% sig_res$gene)# Set a color paletteheat_colors <-brewer.pal(6, "YlOrRd")# Run pheatmap using the metadata data frame for the annotationanno <-colData(dds) %>%as.data.frame() %>%select(condition, celltype)pheatmap(norm_sig,color = heat_colors,cluster_rows =TRUE,show_rownames =FALSE,annotation = anno,border_color =NA,fontsize =10,scale ="row", fontsize_row =10, height =20)```### Top 6 genesIt is important to take a look at some of the top genes that show up. Alternatively, you could choose to use this plot to explore some of your candidate genes where you anticipated changes in gene expression between groups. This is a great way to evaluate why these genes showed up in the results.Rather than just taking the top genes based on adjusted p-value, we can apply a fold change threshold. In our example below, we are using `abs(log2FoldChange) > 0.6` which translates to a ~50% increase or decrease in gene expression as additional criteria for subsetting.```{r}genes <- sig_res %>%arrange(padj) %>%subset(abs(log2FoldChange) >0.6) %>%head(6)genes <- genes$genegenes```Now that we have identified the significant genes that we will look at, we can use a scatterplot to look at the expression values for each sample in both groups. This plot is also a good sanity check to make sure that we are interpreting our fold change values correctly, as well.Each point represents a sample with the y-axis representing the normalized expression. Ideally we should see a clear shift in expression between our two conditions. As this is a helpful metric for assessing the pseudobulk results, we will create a function to make repeated use of this type of visualization.```{r}# pseudobulk scatterpot of normalized expressionplot_pb_count <-function(dds, gene) {# returnData to get normalized counts for each sample for a gene d <-plotCounts(dds, gene=gene, intgroup="condition", returnData=TRUE)# Keep the order TN then cold7 d$condition <-factor(d$condition, levels =c("TN", "cold7"))# Plot the normalized counts for each sample p <-ggplot(d, aes(x = condition, y = count, color = condition)) +geom_point(position=position_jitter(w =0.1, h =0)) +theme_bw() +NoLegend() +ggtitle(gene)return(p)}``````{r}plot_list <-list()for (gene in genes) { plot_list[[gene]] <-plot_pb_count(dds, gene)}plot_grid(plotlist=plot_list)```Finally, it is also a good exercise to evaluate the gene expression at a single-cell level. The results we have been visualizing so far represent the sample averages of gene expression across all VSM cells in each sample. Therefore it is prudent to go back to the cellular level to see what these same results look like for each individual cell. To do so, we will make use of our original seurat object `seurat_vsm`.**We see that generally the trends are consistent with pseudobulk expression scatterplots, but in some cases the change is not as proncounced.**```{r}DefaultAssay(seurat_vsm) <-"RNA"Idents(seurat_vsm) <-"condition"VlnPlot(seurat_vsm, genes, idents=c("cold7", "TN"))```In the next lesson, we will continue with additional visualizations for the DESeq2 results while comparing and contrasting with results from the `FindMarkers()` analysis.