Differential Gene Expression

Spatial transcriptomics

Differential gene expression

Pathway analysis

In this lesson, we will perform differential expression analysis between cell types of the matched cancer and normal samples. Afterwards, we will process the significant genes into pathway and gene set enrichment analyses to observe trends.

Authors

Noor Sohail

Meeta Mistry

Published

March 25, 2026

Keywords

Wilcoxon test, FindMarkers, Functional analysis, Overrepresentation analysis, GSEA

Approximate time: 90 minutes

Learning objectives

In this lesson, we will:

Run differential gene expression analysis using a Wilcoxon test
Calculate over-represented pathways
Identify gene set enrichment scores for pathways

Overview of lesson

By now, we have meaningful annotations for most of our bins for both samples. So now we can start to ask the question of how gene expression changes between sample conditions (NAT and CRC). Running a differential gene expression analysis is a great way to evaluate the significance of change between populations. However, looking at a table of genes can be overwhelming! So we can take the output from the DGE analysis to identify pathways that are enriched in one condition compared to another. This provides a mechanistic insight into the biological processes that differ between samples.

Differential gene expression (DGE)

What if we were interested in a particular cell type and understanding how gene expression changes across the different conditions? To answer this, we will perform a differential gene expression analysis. Let’s start by creating a new R script called 07_DGE_and_functional_analysis.R and place it within our scripts directory.

# DGE and pathway analysis
# Visium HD spatial transcriptomics workshop
# Author: Harvard Chan Bioinformatics Core
# Created: December 2025

# Load libraries
library(Seurat)
library(tidyverse)
library(EnhancedVolcano)

# ORA libraries
library(tidyverse)
library(clusterProfiler)
library(org.Hs.eg.db)
library(msigdbr)

set.seed(12345)
# Load dataset
seurat_rctd <- readRDS("data/intermediate_seurat/seurat_RCTD.rds")

For instance, maybe we are interested in seeing how the Myeloid population changes in gene expression between the CRC and NAT samples. To start identifying these changes, we are going to first subset the dataset to Myeloid bins based upon the annotation generated by RCTD.

# Subset to Myeloid cells
seurat_myeloid <- subset(seurat_rctd,
                       first_type == "Myeloid")
Idents(seurat_myeloid) <- "orig.ident"

`FindMarkers()`

The FindMarkers() function in the Seurat package is used to perform differential expression analysis between groups of cells. We will provide arguments to specify ident.1 and ident.2, the two groups of cells we are interested in comparing. This function is commonly used at the cell type annotation step, but can also be used to calculate differentially expressed genes between experimental conditions.

We can use this same function to compare two groups of cells which represent different conditions by modifying the ident values provided. The FindMarkers() function has several important arguments we can provide when running it. To view what these parameters are, we can access the help page on this function:

?FindMarkers

Below we have described in more detail what each of these arguments do:

Parameter	Description	Default
`logfc.threshold`	The minimum log2 fold change for average expression of a gene in `ident.1` vs `ident.2`.	0.25
`min.diff.pct`	The minimum percent difference between the percent of cells expressing the gene in the group and the percent of cells expressing the gene in all other groups combined.	—
`min.pct`	Only test genes that are detected in a minimum fraction of cells in either of the two populations; speeds up the function by skipping very infrequently expressed genes.	0.1
`ident.1`	The group of interest; this function evaluates one group at a time, and `ident.1` must correspond to a value set in `Idents()`.	—
`ident.2`	The comparison group; specifies the group against which `ident.1` is compared.	—

# Calculate DGEs with wilcox test
dge <- FindMarkers(seurat_myeloid,
                   ident.1 = "P5CRC",
                   ident.2 = "P5NAT")
dge$gene <- rownames(dge)
dge %>% head()

Table 1: Wilcoxon test results from comparing P5CRC and P5NAT Myeloid bins.

	avg_log2FC	pct.1	pct.2	gene
SRGN	1.6450858	0.664	0.398	SRGN
SOD2	3.3359809	0.266	0.064	SOD2
CXCL8	8.6759345	0.167	0.007	CXCL8
S100A9	3.3131400	0.209	0.028	S100A9
B2M	0.7734271	0.711	0.494	B2M
S100A8	4.4094812	0.152	0.009	S100A8

The output from the FindMarkers() function is a dataframe containing a ranked list of differentially expressed genes listed by gene ID and associated statistics. We describe some of these columns below:

Column	Description
`gene`	The gene symbol.
`p_val`	The p-value not adjusted for multiple test corrections. This should not be used to identify significant (see `p_val_adj`)
`avg_logFC`	The average log-fold change between sample groups, calculated as: \[\frac{gene\_avg_{ident.1}}{gene\_avg_{ident.2}}\] Positive values indicate that the gene is more highly expressed in `ident.1`.
`pct.1`	The percentage of cells where the gene is detected in `ident.1` (P5CRC).
`pct.2`	The percentage of cells where the gene is detected on average in `ident.2` (P5NAT).
`p_val_adj`	The adjusted p-value, based on Bonferroni correction using all genes in the dataset, used to determine significance.

Inflated p-values

These tests treat each cell as an independent replicate and ignore inherent correlations between cells originating from the same sample. This results in highly inflated p-values for each gene. Studies have shown a large number of false positive associations with these results.

Visualizing results

Now that we have our DGE results, we can visualize the genes in a variety of different ways to better understand what is happening with our data.

Volcano plot

To get a first look at the differences in each condition, we can generate a volcano plot using the EnhancedVolcano() function. This is a visualization that allows us to quickly see trends in the significant genes. The x-axis represents the average log2-fold change value, showing the degree of difference between the two conditions. On the y-axis, we see our p_val_adj column represented after a negative log10 transformation is applied to better see the spread of the adjusted p-values.

Volcano plots provide a great overview of which genes are up-regulated (positive on the x-axis) or down-regulated (negative on the x-axis). These results are all relative to ident.1 (P5CRC).

# Volcano plot
EnhancedVolcano(dge,
                lab = dge$gene,
                x = "avg_log2FC",
                y = "p_val_adj",
                title = "FindMarkers Myeloid Results",
                subtitle = "CRC vs NAT")

Figure 2: Volcano plot of `FindMarkers()` differentially expressed genes for Myeloid bins.

Violin plots

While looking at the overall trends in the data is a great starting point, we can also look at genes that have large differences between CRC and NAT. To do this, we can pull the top 6 genes:

# Filter significant genes
dge_sig <- dge %>% subset(p_val_adj < 0.05)

# Get the gene names and get the first 6 values
genes <- dge_sig %>%
  pull(gene) %>%
  head(6)
genes

[1] "SRGN"   "SOD2"   "CXCL8"  "S100A9" "B2M"    "S100A8"

And then plot them using the VlnPlot() function:

# Violin plot expression of genes per condition
VlnPlot(seurat_myeloid, 
        genes,
        group.by = "orig.ident")

Figure 3: Violin plot of top 6 genes in the Myeloid population.

This can also be a helpful way in identifying the directionality of the log-fold change! SRGN had a positive avg_log2FC score and we can see there is more expression of this gene in condition P5CRC.

UMAP plot

We can look at the distribution of gene expression for each of the top 6 genes with a better understanding of where the cells for each condition exist in UMAP space.

# UMAP of gene expression
FeaturePlot(seurat_myeloid, 
            genes, 
            reduction = "full.umap.sketch",
            ncol = 3)

Figure 4: Expression of top DE genes on the UMAP coordinates.

These visuals would be most helpful if we had re-run PCA, UMAP, and clustering on this myeloid subpopulation. In doing so, we would be able to associate clusters to different subtypes of myeloid cells, which would inform the biological context of change in these genes.

Spatial plot

We also have the spatial coordinates for these bins, so we can plot the expression of these genes on our slides:

# Spatial plot of gene expression
SpatialFeaturePlot(seurat_myeloid, 
                   genes[1], 
                   image.alpha = 0.1, 
                   pt.size.factor = 20)

Figure 5: Expression of top DE gene on the tissue.

Exercise 1

Pick a different cell type and run FindMarkers(). Name the population you chose and print the top 6 genes in the results.

Functional Analysis

When it comes to functional analysis there are various analyses that can be done:

Determine whether there is enrichment of known biological functions, interactions or pathways
Identify genes’ involvement in novel pathways or networks by grouping genes together based on similar trends
Use global changes in gene expression by visualizing all genes being significantly up- or down-regulated in the context of external interaction data

Generally, for any differential expression analysis, it is useful to interpret the resulting gene lists using freely available web- and R-based tools. While tools for functional analysis span a wide variety of techniques, they can loosely be categorized into three main types: over-representation analysis, functional class scoring and pathway topology.

Figure 6: Different methods of assessing pathways.
*Image source: Khatri et al. (2012)*

In this lesson, we will walk you through both an over-representation analysis and gene set enrichment analysis (GSEA) using an R Bioconductor package called clusterProfiler.

Over-representation analysis

Over-representation analysis (ORA) is used to determine which a priori defined gene sets are more present (over-represented) in a subset of “interesting” genes than what would be expected by chance (Huang et al. 2009). Most genes in the genome have some pre-existing annotation associated with it, which has been compiled through a combination of manual curation and computational algorithms. There are a number of existing databases which define genes using a controlled vocabulary and then categorize genes into groups (gene sets) based on shared function, involvement in a pathway or presence in a specific cellular location, etc. A very commonly used gene annotation resource is the Gene Ontology (GO) database, and is what we will use in our workflow.

We then use those categorizations to assess whether or not the enrichment observed amongst our DE gene results when compared to the larger universe of genes is significant.

Figure 7: Showcasing the universe of all genes compared to the significant differentially expressed genes used as input for over-representation analysis.

Hypergeometric test

The statistical test that will determine whether something is actually over-represented is the Hypergeometric test.

Hypergeometric distribution is a probability distribution that describes the probability of some number of genes (k) being associated with “Functional category 1”, for all genes in our gene list (n), compared to the number of genes (K) associated with “Functional category 1” from a population of all of the genes in entire genome (N).

The calculation of probability of k successes follows this formula:

\[ p = 1 - \sum_{i=0}^{k-1} \frac{\binom{N}{k} \binom{N-K}{n-k}}{\binom{N}{n}} \]

This test will result in a p-value that will be adjusted to consider multiple testing correction for each category tested.

Running ORA with clusterProfiler

Now that we know more about what ORA is doing, let’s take our significant genes and see if there are any GO terms over-represented that align with what we expect to be happening in Myeloid cells in response to cancer.

We will take the results from the FindMarkers DE and run different functional analysis methods to obtain some biological insight. We will create vectors containing our gene symbols for the background and query set of genes. We will query the up- and down-regulated gene sets separately, but note that you can also use the entire significant list as a query input.

# Create background dataset for hypergeometric testing using all tested genes for significance in the results
all_genes <- as.character(dge$gene)

# Extract significant results for up-regulated genes
sigUp <- dplyr::filter(dge_sig, 
                       p_val_adj < 0.05, 
                       avg_log2FC > 0)

# Convert the genes to a character vector
sigUp_genes <- as.character(sigUp$gene)

# Inspect the significantly up-regulated genes
sigUp_genes %>% head()

[1] "SRGN"   "SOD2"   "CXCL8"  "S100A9" "B2M"    "S100A8"

Finally, we can perform the GO enrichment analysis with enrichGO(), focusing on the BP biological processes pathways:

# Run GO enrichment analysis 
egoUp <- enrichGO(gene = sigUp_genes, 
                  universe = all_genes,
                  keyType = "SYMBOL",
                  OrgDb = org.Hs.eg.db, 
                  ont = "BP", 
                  pAdjustMethod = "BH", 
                  qvalueCutoff = 0.05, 
                  readable = TRUE)

enrichGO() parameters

Note 1: The different organisms with annotation databases available to use with the OrgDb argument can be found here.

Note 2: The keyType argument may be coded as keytype in different versions of clusterProfiler.

Note 3: The ont argument can accept either "BP" (Biological Process), "MF" (Molecular Function), and "CC" (Cellular Component) subontologies, or "ALL" for all three.

Exploring results from an over-representation analysis

Let’s take a look at what terms are identified as over-represented in the up-regulated genes.

# Output results from GO analysis to a table
cluster_summaryUp <- data.frame(egoUp)

# View GO upregulated output
View(cluster_summaryUp)

Table 2: Upregulated GO BP enriched pathways in P5CRC.

	ID	Description	GeneRatio	BgRatio	RichFactor	FoldEnrichment	zScore	geneID	Count
GO:0006954	GO:0006954	inflammatory response	57/162	461/6577	0.1236443	5.019804	14.22203	CXCL8/S100A9/S100A8/IL1RN/NAMPT/IL1B/IL1R2/LCN2/PTGS2/SERPINA1/OSM/FPR1/SOCS3/TNFAIP3/PROK2/FPR2/LYZ/CXCL1/NFKBIA/FCER1G/ALOX5AP/C5AR1/SLC11A1/TFRC/MMP9/CCL4/IER3/NKG7/IGHG1/CCL24/TNFAIP6/TYROBP/TNFRSF1B/HSPA8/CCL5/NLRP3/ADM/CCL20/CXCL13/TIMP1/CCL3/ADAM8/GZMB/PLA2G2A/PLA2G7/CST7/IL1A/REG3A/OLR1/FCGR3A/CXCR2/THBS1/CYBA/CD44/AIF1/FCGR2A/CXCL2	57
GO:0006935	GO:0006935	chemotaxis	35/162	230/6577	0.1521739	6.178073	12.70246	CXCL8/S100A9/S100A8/IL1B/PLAUR/CD74/FPR1/TREM1/CSF3R/PROK2/FPR2/CXCL1/FCER1G/C5AR1/TYMP/PDE4B/CCL4/SPI1/CXCL16/CCL24/TNFAIP6/CCL5/CCL20/CXCL13/PLAU/CCL3/CALR/ADAM8/PLA2G7/EGR3/CXCR2/THBS1/CXCR1/AIF1/CXCL2	35
GO:0042330	GO:0042330	taxis	35/162	230/6577	0.1521739	6.178073	12.70246	CXCL8/S100A9/S100A8/IL1B/PLAUR/CD74/FPR1/TREM1/CSF3R/PROK2/FPR2/CXCL1/FCER1G/C5AR1/TYMP/PDE4B/CCL4/SPI1/CXCL16/CCL24/TNFAIP6/CCL5/CCL20/CXCL13/PLAU/CCL3/CALR/ADAM8/PLA2G7/EGR3/CXCR2/THBS1/CXCR1/AIF1/CXCL2	35
GO:0019730	GO:0019730	antimicrobial humoral response	18/162	44/6577	0.4090909	16.608586	16.50721	CXCL8/S100A9/B2M/IGHA1/LYZ/CXCL1/SLC11A1/CCL4/IGHG1/CCL24/PI3/CCL5/ADM/CCL20/CXCL13/CCL3/REG3A/CXCL2	18
GO:0060326	GO:0060326	cell chemotaxis	30/162	171/6577	0.1754386	7.122590	12.89080	CXCL8/S100A9/S100A8/IL1B/CD74/TREM1/CSF3R/FPR2/CXCL1/FCER1G/C5AR1/PDE4B/CCL4/SPI1/CXCL16/CCL24/TNFAIP6/CCL5/CCL20/CXCL13/CCL3/CALR/ADAM8/PLA2G7/EGR3/CXCR2/THBS1/CXCR1/AIF1/CXCL2	30
GO:0030595	GO:0030595	leukocyte chemotaxis	26/162	126/6577	0.2063492	8.377523	13.28686	CXCL8/S100A9/S100A8/IL1B/CD74/TREM1/CSF3R/FPR2/FCER1G/C5AR1/PDE4B/CCL4/SPI1/CXCL16/CCL24/TNFAIP6/CCL5/CXCL13/CCL3/CALR/ADAM8/PLA2G7/CXCR2/THBS1/CXCR1/AIF1	26

In the first few columns, we see the GO identifier and the descriptive term name. In the next two columns that follow, we observe GeneRatio and BgRatio. These values allows us to compare the overlaps to the background. Here we describe what each of the columns represent in the results:

Term	Math Expression	Description
BgRatio	\(\dfrac{K}{N}\)	The total number of genes in the GO term gene set (K), divided by the total number of genes in the universe (N).
GeneRatio	\(\dfrac{k}{n}\)	The total number of genes in our significantly up‑regulated DE gene set that are part of a given GO term (k), divided by the total number of genes in our significantly up‑regulated DE gene set (n).
Fold Enrichment	\(\dfrac{k/n}{K/N}\)	How many-fold enriched the GeneRatio is compared to the BgRatio.
p.adjust	–	Multiple-testing adjusted p-value (results are typically ordered by this column).
geneID	–	Gene symbols of the genes overlapping between the DE gene set and the GO term, separated by a “/”.

In the P5CRC sample, we can see that the top differential pathway is “inflammatory response” among the Myeloid bins. This makes sense as the Myeloid cells (Macrophages, Dendritic cells, etc) modulate the immune response in a system. In the case of cancer, it makes sense that there is an increased inflammatory signal in this population compared to the healthy, adjacent tissue.

Exercise 2

Run an ORA analysis on the cell type that you calculated DE results for in the previous exercise. What is the top pathway that appears?

Visualizing over-representation analysis results

clusterProfiler has a variety of options for viewing the over-represented GO terms. We will explore the dotplot in this lesson.

The dotplot shows statistics associated with a user-selected top number of significant terms. The color of the dots represents the adjusted p-values for these terms and size of the dots corresponds to the total count of significantly DE genes annotated with the GO term (Count). This plot displays the top 20 GO terms ordered by gene ratio, not adjusted p-value.

# Create a dotplot of GO results 
dotplot(egoUp,
        showCategory = 20)

Figure 8: Dotplot of the top 20 significant, upregulated biological processes pathways.

Gene Set Enrichment Analysis (GSEA)

While over-representation analysis is helpful and commonly used, it does require you to subset your gene list using an arbitrary threshold. There could very well be many genes that very narrowly miss this threshold and are therefore not considered in the functional analysis. To get around this, there are functional class scoring (FCS) methods that can be helpful. For these methods the hypothesis is that although large changes in individual genes can have significant effects on pathways (and will be detected via ORA methods), weaker but coordinated changes in sets of functionally related genes (i.e., pathways) can also have significant effects. Thus, rather than setting a threshold to identify ‘significant genes’, all genes are considered in the analysis. The gene-level statistics from the dataset are aggregated to generate a single pathway-level statistic and statistical significance of each pathway is reported. This type of analysis can be particularly helpful if the differential expression analysis only outputs a small list of significant DE genes.

A commonly used example of an FCS method is GSEA (Subramanian et al. 2005). Gene set enrichment analysis utilizes gene-level statistics or log2-fold changes for all genes to look to see whether gene sets for particular biological pathways (e.g., derived from KEGG pathways, Gene Ontology terms, MSigDB, etc.) are enriched among the large positive- or negative-fold changes.

Figure 9: Schematic of the GSEA method.
*Image source: Subramanium et al. (2005)*

This image describes the theory of GSEA, with the ‘gene set S’ showing the metric used (in our case, ranked log2-fold change) to determine enrichment of genes in the gene set. There are four main steps that are being performed:

Rank genes:
- Genes in a dataset are ranked based on the given statistic, which in our case is the log2-fold changes.
Calculate enrichment scores for each gene set
- This score reflects how often genes in the set appear at the top or bottom of the ranked list.
- The score is calculated by walking down the list of log2-fold changes and increasing the running-sum statistic every time a gene in the gene set is encountered and decreasing it when genes are not part of the gene set.
- Increase/decrease is determined by magnitude of fold change.
Estimate statistical significance
- A permutation test is used to calculate a null distribution for the enrichment score. This produces a p-value that represents the probability of observing a given enrichment score.
Normalize enrichment scores to account for gene set size
- The raw enrichment score (ES) depends on properties like the size of the gene set. To make scores comparable across gene sets, GSEA computes a normalized enrichment score (NES) by adjusting the ES relative to scores obtained from permutations for that specific gene set.
Adjust for multiple hypothesis testing
- Because many gene sets (pathways) are tested simultaneously, the resulting p‑values are adjusted to control for multiple testing by estimating a false discovery rate (FDR).

Running GSEA with MSigDB gene sets

The clusterProfiler package offers several functions to perform GSEA using different genes sets, including but not limited to GO, KEGG and MSigDb. We will use the MSigDb gene sets in our example below. The Molecular Signatures Database (also known as MSigDB) is a collection of annotated gene sets. It contains 8 major collections.

To run GSEA with the MSigDb gene sets, we will use the msigdbr R package which provides the MSigDB gene sets in tidy data format that can be used directly with clusterProfiler.

Selecting species and pathways for msigdbr

The msigdbr package supports several species:

# Show MSigDB species
msigdbr_species()

# A tibble: 20 × 2
   species_name                    species_common_name                          
   <chr>                           <chr>                                        
 1 Anolis carolinensis             Carolina anole, green anole                  
 2 Bos taurus                      bovine, cattle, cow, dairy cow, domestic cat…
 3 Caenorhabditis elegans          <NA>                                         
 4 Canis lupus familiaris          dog, dogs                                    
 5 Danio rerio                     leopard danio, zebra danio, zebra fish, zebr…
 6 Drosophila melanogaster         fruit fly                                    
 7 Equus caballus                  domestic horse, equine, horse                
 8 Felis catus                     cat, cats, domestic cat                      
 9 Gallus gallus                   bantam, chicken, chickens, Gallus domesticus 
10 Homo sapiens                    human                                        
11 Macaca mulatta                  rhesus macaque, rhesus macaques, Rhesus monk…
12 Monodelphis domestica           gray short-tailed opossum                    
13 Mus musculus                    house mouse, mouse                           
14 Ornithorhynchus anatinus        duck-billed platypus, duckbill platypus, pla…
15 Pan troglodytes                 chimpanzee                                   
16 Rattus norvegicus               brown rat, Norway rat, rat, rats             
17 Saccharomyces cerevisiae        baker's yeast, brewer's yeast, S. cerevisiae 
18 Schizosaccharomyces pombe 972h- <NA>                                         
19 Sus scrofa                      pig, pigs, swine, wild boar                  
20 Xenopus tropicalis              tropical clawed frog, western clawed frog

And you can see what gene sets are available:

# Show MSigDB collections
msigdbr_collections()

   db_version gs_collection gs_subcollection
1   2026.1.Hs            C1                 
2   2026.1.Hs            C2              CGP
3   2026.1.Hs            C2               CP
4   2026.1.Hs            C2      CP:BIOCARTA
5   2026.1.Hs            C2   CP:KEGG_LEGACY
6   2026.1.Hs            C2  CP:KEGG_MEDICUS
7   2026.1.Hs            C2           CP:PID
8   2026.1.Hs            C2      CP:REACTOME
9   2026.1.Hs            C2  CP:WIKIPATHWAYS
10  2026.1.Hs            C3        MIR:MIRDB
11  2026.1.Hs            C3   MIR:MIR_LEGACY
12  2026.1.Hs            C3         TFT:GTRD
13  2026.1.Hs            C3   TFT:TFT_LEGACY
14  2026.1.Hs            C4              3CA
15  2026.1.Hs            C4              CGN
16  2026.1.Hs            C4               CM
17  2026.1.Hs            C5            GO:BP
18  2026.1.Hs            C5            GO:CC
19  2026.1.Hs            C5            GO:MF
20  2026.1.Hs            C5              HPO
21  2026.1.Hs            C6                 
22  2026.1.Hs            C7      IMMUNESIGDB
23  2026.1.Hs            C7              VAX
24  2026.1.Hs            C8                 
25  2026.1.Hs            C9                 
26  2026.1.Hs             H                 
                     gs_collection_name num_genesets
1                            Positional          302
2    Chemical and Genetic Perturbations         3555
3                    Canonical Pathways           19
4                     BioCarta Pathways          292
5                  KEGG Legacy Pathways          186
6                 KEGG Medicus Pathways          658
7                          PID Pathways          196
8                     Reactome Pathways         1839
9                          WikiPathways          925
10                                miRDB         2377
11                           MIR_Legacy          221
12                                 GTRD          506
13                           TFT_Legacy          610
14 Curated Cancer Cell Atlas gene sets           148
15            Cancer Gene Neighborhoods          427
16                       Cancer Modules          431
17                GO Biological Process         7538
18                GO Cellular Component         1080
19                GO Molecular Function         1872
20             Human Phenotype Ontology         5793
21                  Oncogenic Signature          189
22                          ImmuneSigDB         4872
23                HIPC Vaccine Response          347
24                  Cell Type Signature          866
25 Computational Perturbation Signature           62
26                             Hallmark           50

For our analysis, we will use C5, which contains the Gene Ontology gene sets. This will allow us to see how the results align with output from ORA. From the retrieved table, we only need two columns: gene set name and gene symbol:

# Use a specific collection; C5 GO signatures
m_t2g <- msigdbr(species = "Homo sapiens",
                 collection = "C5") %>%
  dplyr::select(gs_name, gene_symbol)

Now that we have our gene sets, we need to prepare the fold changes. GSEA will use the log2-fold changes obtained from the differential expression analysis for every gene to perform the analysis. We need to create an ordered and named vector for input to clusterProfiler:

# Extract the fold changes
foldchanges <- dge_sig$avg_log2FC

# Name each fold change with the corresponding gene symbol
names(foldchanges) <- dge_sig$gene

# Sort fold changes in decreasing order
foldchanges <- sort(foldchanges, decreasing = TRUE)

# Inspect the fold changes ordered and named vector
foldchanges %>% head()

   ADGRG3     CXCR1     REG3A     CXCL8     IL1RN       OSM 
11.250232 10.291084  9.827910  8.675935  7.863775  7.448534

Now we are ready to run GSEA!

# Run GSEA
msig_GSEA <- GSEA(foldchanges,
                  TERM2GENE = m_t2g,
                  verbose = FALSE)

# Extract the GSEA results
msigGSEA_results <- msig_GSEA@result

Setting seed

The permutations are performed using random reordering, so every time we run the function we will get slightly different results. If we would like to use the same permutations every time we run a function, then we use the set.seed() function prior to running.

This is a good spot to save the results so that we can revisit them at any point later on:

# Write results to file
write.csv(msigGSEA_results,
          file = "results/gsea_msigdb_GO_genesets.csv",
          quote = FALSE)

Take a look at the results table and reorder by NES (normalized enrichment score). What terms do you see positively enriched? Does this overlap with what we observed from ORA analysis?

# Look at results ordered by NES
msigGSEA_results %>%
  arrange(desc(abs(NES))) %>%
  View()

Table 3: GSEA pathway analysis results, ordered by normalized enrichment score (NES).

	ID	Description	setSize	enrichmentScore	NES	pvalue	p.adjust	qvalue	rank	leading_edge	core_enrichment
HP_RESPIRATORY_INSUFFICIENCY	HP_RESPIRATORY_INSUFFICIENCY	HP_RESPIRATORY_INSUFFICIENCY	10	-0.7500000	-2.453227	0.0002521	0.0122580	0.0110249	64	tags=100%, list=29%, signal=75%	NPC2/MT-ND3/LTBP4/FHL1/SLC26A2/CRYAB/GATA2/LRP4/KIT
GOMF_CYTOKINE_ACTIVITY	GOMF_CYTOKINE_ACTIVITY	GOMF_CYTOKINE_ACTIVITY	19	0.8009752	2.031152	0.0000006	0.0001814	0.0001632	38	tags=68%, list=17%, signal=62%	CXCL8/IL1RN/OSM/IL1B/IL1A/CCL20/CCL4L2/CCL4/CXCL1/CXCL2/NAMPT/CXCL13/CCL3
GOMF_CYTOKINE_RECEPTOR_BINDING	GOMF_CYTOKINE_RECEPTOR_BINDING	GOMF_CYTOKINE_RECEPTOR_BINDING	21	0.7401314	1.909748	0.0000376	0.0043321	0.0038963	38	tags=62%, list=17%, signal=57%	CXCL8/IL1RN/OSM/IL1B/IL1A/CCL20/CCL4L2/CCL4/CXCL1/CXCL2/SOCS3/CXCL13/CCL3
GOBP_ANTIMICROBIAL_HUMORAL_IMMUNE_RESPONSE_MEDIATED_BY_ANTIMICROBIAL_PEPTIDE	GOBP_ANTIMICROBIAL_HUMORAL_IMMUNE_RESPONSE_MEDIATED_BY_ANTIMICROBIAL_PEPTIDE	GOBP_ANTIMICROBIAL_HUMORAL_IMMUNE_RESPONSE_MEDIATED_BY_ANTIMICROBIAL_PEPTIDE	15	0.7742921	1.885327	0.0000445	0.0043321	0.0038963	44	tags=73%, list=20%, signal=63%	REG3A/CXCL8/CCL20/CCL4L2/CCL4/CXCL1/CXCL2/CXCL13/CCL3/S100A9/ADM
GOMF_SIGNALING_RECEPTOR_REGULATOR_ACTIVITY	GOMF_SIGNALING_RECEPTOR_REGULATOR_ACTIVITY	GOMF_SIGNALING_RECEPTOR_REGULATOR_ACTIVITY	27	0.6991968	1.868216	0.0000323	0.0043321	0.0038963	44	tags=56%, list=20%, signal=51%	REG3A/CXCL8/IL1RN/OSM/IL1B/IL1A/CCL20/CCL4L2/CCL4/CXCL1/CXCL2/NAMPT/CXCL13/CCL3/ADM
GOBP_INFLAMMATORY_RESPONSE	GOBP_INFLAMMATORY_RESPONSE	GOBP_INFLAMMATORY_RESPONSE	69	0.6375055	1.863721	0.0000006	0.0001814	0.0001632	48	tags=43%, list=22%, signal=49%	REG3A/CXCL8/IL1RN/OSM/TNFAIP6/IL1B/FPR2/PTGS2/IL1A/PROK2/CCL20/CCL4L2/CCL4/CXCL1/S100A8/CXCR2/CXCL2/NAMPT/GZMB/SOCS3/OLR1/CXCL13/LCN2/CCL3/S100A9/NKG7/ADM/IL1R2/FPR1/IER3

The first few columns of the results table identify the gene set information.
The next columns include the associated statistics.
The last column will report which genes are part of the ‘core enrichment’. These are the genes associated with the pathway which contributed to the observed enrichment score (i.e., in the extremes of the ranking).

Exercise 3

Run GSEA on the cell type that you calculated DE results for in the first exercise. What is the top pathway that appears?

GSEA visualization

Let’s explore the GSEA plot of enrichment of one of the pathways in the ranked list using a built-in function from clusterProfiler. We can pick one of the top terms GOBP_INFLAMMATORY_RESPONSE:

# Plot the GSEA plot for a single enriched GO term
gseaplot(msig_GSEA,
         geneSetID = 'GOBP_INFLAMMATORY_RESPONSE')

In this plot, the lines in upper panel (or on the x-axis in the lower panel) represent the genes in the gene set GOBP_INFLAMMATORY_RESPONSE and where they occur among the log2-fold changes. The largest positive log2-fold changes are on the left-hand side of the plot, while the largest negative log2-fold changes are on the right. The top plot shows the magnitude of the log2-fold changes for each gene, while the bottom plot shows the running sum, with the enrichment score peaking at the red dotted line (which is among the positive log2-fold changes). This suggests the up-regulation of this function.

Next Lesson >>

Back to Schedule

Reuse

CC-BY-4.0