Differential Gene Expression

Spatial transcriptomics
Differential gene expression
Pathway analysis

Perform differential expression analysis between celltypes of matched cancer and normal. Feed the significant genes into pathway and gene set enrichment analyses to observe trends.

Authors

Noor Sohail

Meeta Mistry

Published

March 25, 2026

Keywords

Wilcoxon test, FindMarkers, Functional analysis, Overrepresentation analysis, GSEA

Approximate time: XX minutes

Learning objectives

In this lesson, we will:

  • Run differential gene expression analysis using a Wilcox test
  • Calculate over-representated pathways
  • Identify gene set enrichment scores for pathways

Overview of lesson

By now, we have meaningful annotations for msot of our bins for both samples. So now we can start to ask the question of how these gene expression changes between sample conditions (NAT and CRC). Running a differential gene expression analysis is a great way to evaluate the significance of change between populations. However, looking at a table of genes can be overwhelming! So we can take the output from the DGE analysis to identify pathways that are enriched in one condition compared to another. This provides a mechanistic insight in the biological processes that differ between samples.

Differential gene expression (DGE)

What if we were interested in a particular celltype and understanding how gene expression changes across the different conditions?

# DGE and pathway analysis
# Visium HD spatial transcriptomics workshop
# Author: Harvard Chan Bioinformatics Core
# Created: December 2025

# Load libraries
library(Seurat)
library(tidyverse)
library(EnhancedVolcano)

# ORA libraries
library(tidyverse)
library(clusterProfiler)
library(org.Hs.eg.db)
library(msigdbr)

set.seed(12345)
# Load dataset
seurat_rctd <- readRDS("data/intermediate_seurat/seurat_RCTD.rds")

For instance, maybe we are interested in seeing how the Myeloid population changes in gene expression between the CRC and NAT samples. To start identifying these changes, we are going to first subset the dataset to Myeloid bins based upon the annotation generated by RCTD.

# Subset to Myeloid cells
seurat_macro <- subset(seurat_rctd,
                       first_type == "Myeloid")
Idents(seurat_macro) <- "orig.ident"

FindMarkers()

The FindMarkers() function in the Seurat package is used to perform differential expression analysis between groups of cells. We provide arguments to specify ident.1 and ident.2 the two groups of cells we are interested in comparing. This function is commonly used at the celltype annotation step, but can also be used to calculate differentially expressed genes between experimental conditions.

Figure 1: Schematic showing how the FindMarkers() function calculates differential genes.
Image source: Introduction to scRNA-seq

We can use this same function to compare two groups of cells which represent different conditions by modifying the ident values provided. The FindMarkers() function has several important arguments we can modify when running it. To view what these parameters are, we can access the help page on this function:

?FindMarkers

Below we have described in more detail what each of these arguments mean :

Parameter Description Default
logfc.threshold The minimum log2 fold change for average expression of a gene in the group relative to the average expression in all other groups combined. 0.25
min.diff.pct The minimum percent difference between the percent of cells expressing the gene in the group and the percent of cells expressing the gene in all other groups combined.
min.pct Only test genes that are detected in a minimum fraction of cells in either of the two populations; speeds up the function by skipping very infrequently expressed genes. 0.1
ident.1 The group of interest; this function evaluates one group at a time, and ident.1 must correspond to a value set in Idents().
ident.2 The comparison group; specifies the group against which ident.1 is compared.
# Calculate DGEs
dge <- FindMarkers(seurat_macro,
                   ident.1 = "P5CRC",
                   ident.2 = "P5NAT")
dge$gene <- rownames(dge)
dge %>% head()
Table 1: Wilcox test results from comparing P5CRC and P5NAT Myeloid bins.
p_val avg_log2FC pct.1 pct.2 p_val_adj gene
SRGN 0 1.6450858 0.664 0.398 0 SRGN
SOD2 0 3.3359809 0.266 0.064 0 SOD2
CXCL8 0 8.6759345 0.167 0.007 0 CXCL8
S100A9 0 3.3131400 0.209 0.028 0 S100A9
B2M 0 0.7734271 0.711 0.494 0 B2M
S100A8 0 4.4094812 0.152 0.009 0 S100A8

The output from the FindMarkers() function is a dataframe containing a ranked list of differentially expressed genes listed by gene ID and associated statistics. We describe some of these columns below:

Column Description
gene The gene symbol.
p_val The p-value not adjusted for multiple test corrections.
avg_logFC The average log-fold change between sample groups. Positive values indicate that the gene is more highly expressed in ident.1.
pct.1 The percentage of cells where the gene is detected in ident.1 (P5CRC).
pct.2 The percentage of cells where the gene is detected on average in ident.2 (P5NAT).
p_val_adj The adjusted p-value, based on Bonferroni correction using all genes in the dataset, used to determine significance.
Inflated p-values

These tests treat each cell as an independent replicate and ignore inherent correlations between cells originating from the same sample. This results in highly inflated p-values for each gene. Studies have been shown to find a large number of false positive associations with these results.

Visualizing results

Now that we have our DGE results, we can visualize the genes in a variety of different ways to better understand what is happening with our data.

Volcano plot

To get a first look at the genes that are retained, we can generated a volcano plot using the EnhancedVolcano() function. This is a visualization that allows us to quickly see trends in the significant genes. The x-axis here represents the average log2-fold change value, showing the degree of difference between the two conditions. On the y-axis, we see our p_val_adj column represented after a negative log10 transformation is applied to better see the spread of the adjusted p-values.

Volcano plots provide a great overview of which genes are up-regulated (positive on the x-axis) or down-regulated (negative on the x-axis). These results are all relative to ident.1 (P5CRC).

# Volcano plot
EnhancedVolcano(dge,
                lab = dge$gene,
                x = "avg_log2FC",
                y = "p_val_adj",
                title = "FindMarkers Myeloid Results",
                subtitle = "CRC vs NAT")
Figure 2: Volcano plot of FindMarkers() differentially expressed genes for Myeloid bins.

Violin plots

While looking at the overall trends in the data is a great starting point, we can also start looking at genes that have large differences between CRC and NAT. To do this, we can take a look at the top 6 genes:

# Filter significant genes
dge_sig <- dge %>% subset(p_val_adj < 0.05)

# Get the gene names and get the first 6 values
genes <- dge_sig %>%
  pull(gene) %>%
  head(6)
genes
[1] "SRGN"   "SOD2"   "CXCL8"  "S100A9" "B2M"    "S100A8"

And then plot them using VlnPlot() function:

# Violin plot expression of genes per condition
VlnPlot(seurat_macro, 
        genes,
        group.by = "orig.ident")
Figure 3: Violin plot of top 6 genes in the Myeloid population.

UMAP plot

we can begin to look at the distribution of gene expression for each of the top 6 genes with a better understanding of where the cells for each condition exists.

# UMAP of gene expression
FeaturePlot(seurat_macro, 
            genes, 
            reduction = "full.umap.sketch",
            ncol = 3)
Figure 4: Expression of top DE genes on the UMAP coordinates.

Spatial plot

As we also have the spatial conditions for these bins, we can plot the expression o f

# Spatial plot of gene expression
SpatialFeaturePlot(seurat_macro, 
                   genes[1], 
                   image.alpha = 0.1, 
                   pt.size.factor = 18)
Figure 5: Expression of top DE gene on the tissue.
  1. Pick a different cell type and run FindMarkers(). Name the population you chose and print the top 6 genes in the results.

Functional Analysis

When it comes to functional analysis there are various analyses that can be done:

  • Determine whether there is enrichment of known biological functions, interactions, or pathways
  • Identify genes’ involvement in novel pathways or networks by grouping genes together based on similar trends
  • Use global changes in gene expression by visualizing all genes being significantly up- or down-regulated in the context of external interaction data

Generally, for any differential expression analysis, it is useful to interpret the resulting gene lists using freely available web- and R-based tools. While tools for functional analysis span a wide variety of techniques, they can loosely be categorized into three main types: over-representation analysis, functional class scoring, and pathway topology [1].

Figure 6: Different methods of assessing pathways.
Image source: Khatri et al. (2012)

In this lesson, we will walk you through both an over-representation analysis and gene set enrichment analysis (GSEA) using an R Bioconductor package called clusterProfiler.

Over-representation analysis

Over-representation analysis (ORA) is used to determine which a priori defined gene sets are more present (over-represented) in a subset of “interesting” genes than what would be expected by chance (Huang et al, 2009). Most genes in the genome have some pre-existing annotation associated with it, which has been compiled through a combination of manual curation and computational algorithms. There are a number of existing databases which define genes using a controlled vocabulary and then categorize genes into groups (gene sets) based on shared function, involvement in a pathway or presence in a specific cellular location, etc. A very commonly used gene annotataion resource is the Gene Ontology (GO) database, and is what we will use in our workflow.

We then use those categorizations to assess whether or not the enrichment observed amongst our DE gene results when compared to the larger universe of genes is significant.

Figure 7: Showcasing the universe of genes compared to the significant differential genes, used as input for over-representation analysis.

Hypergeometric test

The statistical test that will determine whether something is actually over-represented is the Hypergeometric test.

Hypergeometric distribution is a probability distribution that describes the probability of some number of genes (k) being associated with “Functional category 1”, for all genes in our gene list (n=1000), compared to the number of genes (K) associated with “Functional category 1” from a population of all of the genes in entire genome (N=13,000) [2].

The calculation of probability of k successes follows the formula:

\[ p = 1 - \sum_{i=0}^{k-1} \frac{\binom{N}{k} \binom{N-K}{n-k}}{\binom{N}{n}} \]

This test will result in a p-value that will be adjusted to consider multiple testing correction for each category tested.

Running ORA with clusterProfiler

Now that we know more about what ORA is doing, let’s take our significant genes and see if there are any GO terms over-represented that align with what we expect to be happening in macrophage cells with a change of temperature.

We will take the results from the FindMarkers DE and run different functional analysis methods to obtain some biological insight. We create vectors containing our gene symbols for the background and query set of genes. We will query the up- and down-regulated gene sets separately, but note that you can also use the entire significant list as a query input.

# Create background dataset for hypergeometric testing using all tested 
# genes for significance in the results
all_genes <- as.character(dge$gene)

# Extract significant results for up-regulated
sigUp <- dplyr::filter(dge_sig, 
                       p_val_adj < 0.05, 
                       avg_log2FC > 0)
sigUp_genes <- as.character(sigUp$gene)
sigUp_genes %>% head()
[1] "SRGN"   "SOD2"   "CXCL8"  "S100A9" "B2M"    "S100A8"

Finally, we can perform the GO enrichment analysis with enrichGO() and save the results:

Notes about enrichGO() parameters

Note 1: The different organisms with annotation databases available to use with for the OrgDb argument can be found here.

Note 2: The keyType argument may be coded as keytype in different versions of clusterProfiler.

Note 3: The ont argument can accept either “BP” (Biological Process), “MF” (Molecular Function), and “CC” (Cellular Component) subontologies, or “ALL” for all three.

# Run GO enrichment analysis 
egoUp <- enrichGO(gene = sigUp_genes, 
                  universe = all_genes,
                  keyType = "SYMBOL",
                  OrgDb = org.Hs.eg.db, 
                  ont = "BP", 
                  pAdjustMethod = "BH", 
                  qvalueCutoff = 0.05, 
                  readable = TRUE)

Exploring results from over-representation analysis

Let’s take a look at what terms are identified as over-represented in the genes up-regulated.

# Output results from GO analysis to a table
cluster_summaryUp <- data.frame(egoUp)

# View GO upregulated output
View(cluster_summaryUp)
Table 2: Upregulated GO BP enriched pathways in P5CRC.
ID Description GeneRatio BgRatio RichFactor FoldEnrichment zScore pvalue p.adjust qvalue geneID Count
GO:0006954 GO:0006954 inflammatory response 57/162 461/6577 0.1236443 5.019804 14.22203 0 0 0 CXCL8/S100A9/S100A8/IL1RN/NAMPT/IL1B/IL1R2/LCN2/PTGS2/SERPINA1/OSM/FPR1/SOCS3/TNFAIP3/PROK2/FPR2/LYZ/CXCL1/NFKBIA/FCER1G/ALOX5AP/C5AR1/SLC11A1/TFRC/MMP9/CCL4/IER3/NKG7/IGHG1/CCL24/TNFAIP6/TYROBP/TNFRSF1B/HSPA8/CCL5/NLRP3/ADM/CCL20/CXCL13/TIMP1/CCL3/ADAM8/GZMB/PLA2G2A/PLA2G7/CST7/IL1A/REG3A/OLR1/FCGR3A/CXCR2/THBS1/CYBA/CD44/AIF1/FCGR2A/CXCL2 57
GO:0006935 GO:0006935 chemotaxis 35/162 230/6577 0.1521739 6.178073 12.70246 0 0 0 CXCL8/S100A9/S100A8/IL1B/PLAUR/CD74/FPR1/TREM1/CSF3R/PROK2/FPR2/CXCL1/FCER1G/C5AR1/TYMP/PDE4B/CCL4/SPI1/CXCL16/CCL24/TNFAIP6/CCL5/CCL20/CXCL13/PLAU/CCL3/CALR/ADAM8/PLA2G7/EGR3/CXCR2/THBS1/CXCR1/AIF1/CXCL2 35
GO:0042330 GO:0042330 taxis 35/162 230/6577 0.1521739 6.178073 12.70246 0 0 0 CXCL8/S100A9/S100A8/IL1B/PLAUR/CD74/FPR1/TREM1/CSF3R/PROK2/FPR2/CXCL1/FCER1G/C5AR1/TYMP/PDE4B/CCL4/SPI1/CXCL16/CCL24/TNFAIP6/CCL5/CCL20/CXCL13/PLAU/CCL3/CALR/ADAM8/PLA2G7/EGR3/CXCR2/THBS1/CXCR1/AIF1/CXCL2 35
GO:0019730 GO:0019730 antimicrobial humoral response 18/162 44/6577 0.4090909 16.608586 16.50721 0 0 0 CXCL8/S100A9/B2M/IGHA1/LYZ/CXCL1/SLC11A1/CCL4/IGHG1/CCL24/PI3/CCL5/ADM/CCL20/CXCL13/CCL3/REG3A/CXCL2 18
GO:0060326 GO:0060326 cell chemotaxis 30/162 171/6577 0.1754386 7.122590 12.89080 0 0 0 CXCL8/S100A9/S100A8/IL1B/CD74/TREM1/CSF3R/FPR2/CXCL1/FCER1G/C5AR1/PDE4B/CCL4/SPI1/CXCL16/CCL24/TNFAIP6/CCL5/CCL20/CXCL13/CCL3/CALR/ADAM8/PLA2G7/EGR3/CXCR2/THBS1/CXCR1/AIF1/CXCL2 30
GO:0030595 GO:0030595 leukocyte chemotaxis 26/162 126/6577 0.2063492 8.377523 13.28686 0 0 0 CXCL8/S100A9/S100A8/IL1B/CD74/TREM1/CSF3R/FPR2/FCER1G/C5AR1/PDE4B/CCL4/SPI1/CXCL16/CCL24/TNFAIP6/CCL5/CXCL13/CCL3/CALR/ADAM8/PLA2G7/CXCR2/THBS1/CXCR1/AIF1 26

In the first few columns we see the GO identifier and the descriptive term name. In the next two columns that follow, we observe GeneRatio and BgRatio. These values allows us to compare the overlaps to the background. Here we describe what each of the columns represent in the results:

Term Math Expression Description
BgRatio \(\dfrac{K}{N}\) The total number of genes in the GO term gene set (K), divided by the total number of genes in the universe (N).
GeneRatio \(\dfrac{k}{n}\) The total number of genes in our significant DE gene set which overlap with the GO term gene set (k), divided by the total number of genes in our significant DE gene set that overlap with the universe gene set (n).
Fold Enrichment \(\dfrac{k/n}{K/N}\) How many-fold enriched the GeneRatio is compared to the BgRatio.
p.adjust Multiple-testing adjusted p-value (results are typically ordered by this column).
geneID Gene symbols of the overlapping genes, seperated by “/”.

In the P5CRC sample, we can see that the top differential pathway is “inflammatory response” among the Myeloid bins. This makes sense as the Myeloid cells (Macrophages, Dendritic cells, etc) modulate the immune response in a system. In the case of cancer, it makes sense that there is an increased inflammatory signal in this population compared to the healthy, adjacent tissue.

  1. Run ORA analysis on the cell type that you calculated DE results for in the previous exercise. What is the top pathway that appears?

Visualizing over-representation analysis results

clusterProfiler has a variety of options for viewing the over-represented GO terms. We will explore the dotplot and the enrichment plot in this lesson.

The dotplot shows statistics associated with a user-selected top number of significant terms. The color of the dots represents the adjusted p-values for these terms and size of the dots corresponds to the total count of sig DE genes annotated with the GO term (count). This plot displays the top 20 GO terms ordered by gene ratio, not adjusted p-value.

# Dotplot 
dotplot(egoUp, showCategory=20)
Figure 8: Dotplot of top 20 significant, upregulated biological processes pathways.

Gene Set Enrichment Analysis (GSEA)

While over-representation analysis is helpful and commonly used, it does require you to subset your gene list using an arbitrary threshold. There could very well be many genes that very narrowly miss this threshold and are therefore not considered in the functional analysis. To get around this there are there are functional class scoring (FCS) methods that can be helpful. For these methods the hypothesis is that although large changes in individual genes can have significant effects on pathways (and will be detected via ORA methods), weaker but coordinated changes in sets of functionally related genes (i.e., pathways) can also have significant effects. Thus, rather than setting a threshold to identify ‘significant genes’, all genes are considered in the analysis. The gene-level statistics from the dataset are aggregated to generate a single pathway-level statistic and statistical significance of each pathway is reported. This type of analysis can be particularly helpful if the differential expression analysis only outputs a small list of significant DE genes.

A commonly used example of an FCS method is GSEA (Subramanium A. et al, 2005). Gene set enrichment analysis utilizes the gene-level statistics or log2-fold changes for all genes to look to see whether gene sets for particular biological pathways (e.g., derived from KEGG pathways, Gene Ontology terms, MSigDB, etc) are enriched among the large positive or negative fold changes.

Figure 9: Schematic of the GSEA method.
Image source: Subramanium et al. (2005)

This image describes the theory of GSEA, with the ‘gene set S’ showing the metric used (in our case, ranked log2-fold changes) to determine enrichment of genes in the gene set. There are four main steps that are being performed:

  1. Rank genes:
    • Genes in a data set are ranked based on the given statistic, which in our case is the shrunken log2-fold changes.
  2. Calculate enrichment scores for each gene set
    • This score reflects how often genes in the set appear at the top or bottom of the ranked list.
    • The score is calculated by walking down the list of log2-fold changes and increasing the running-sum statistic every time a gene in the gene set is encountered and decreasing it when genes are not part of the gene set.
    • Increase/decrease is determined by magnitude of fold change.
  3. Estimate statistical significance
    • A permutation test is used to calculate a null distribution for the enrichment score. This produces a p-value that represents the probability of observing a given enrichment score.
  4. Adjust for multiple hypothesis testing
    • Enrichment scores are normalized for the size of each gene set and a false discovery rate is calculated to prevent false positives.

Running GSEA with MSigDB gene sets

The clusterProfiler package offers several functions to perform GSEA using different genes sets, including but not limited to GO, KEGG, and MSigDb. We will use the MSigDb gene sets in our example below. The Molecular Signatures Database (also known as MSigDB) is a collection of annotated gene sets. It contains 8 major collections.

To run GSEA with the MSigDb gene sets, we will use the msigdbr R package which provides the MSigDB gene sets in tidy data format that can be used directly with clusterProfiler.

The msigdbr package supports several species:

# Show MSigDB species
msigdbr_species()
# A tibble: 20 × 2
   species_name                    species_common_name                          
   <chr>                           <chr>                                        
 1 Anolis carolinensis             Carolina anole, green anole                  
 2 Bos taurus                      bovine, cattle, cow, dairy cow, domestic cat…
 3 Caenorhabditis elegans          <NA>                                         
 4 Canis lupus familiaris          dog, dogs                                    
 5 Danio rerio                     leopard danio, zebra danio, zebra fish, zebr…
 6 Drosophila melanogaster         fruit fly                                    
 7 Equus caballus                  domestic horse, equine, horse                
 8 Felis catus                     cat, cats, domestic cat                      
 9 Gallus gallus                   bantam, chicken, chickens, Gallus domesticus 
10 Homo sapiens                    human                                        
11 Macaca mulatta                  rhesus macaque, rhesus macaques, Rhesus monk…
12 Monodelphis domestica           gray short-tailed opossum                    
13 Mus musculus                    house mouse, mouse                           
14 Ornithorhynchus anatinus        duck-billed platypus, duckbill platypus, pla…
15 Pan troglodytes                 chimpanzee                                   
16 Rattus norvegicus               brown rat, Norway rat, rat, rats             
17 Saccharomyces cerevisiae        baker's yeast, brewer's yeast, S. cerevisiae 
18 Schizosaccharomyces pombe 972h- <NA>                                         
19 Sus scrofa                      pig, pigs, swine, wild boar                  
20 Xenopus tropicalis              tropical clawed frog, western clawed frog    

And you can see what gene sets are available:

# Show MSigDB collections
msigdbr_collections()
   db_version gs_collection gs_subcollection
1   2026.1.Hs            C1                 
2   2026.1.Hs            C2              CGP
3   2026.1.Hs            C2               CP
4   2026.1.Hs            C2      CP:BIOCARTA
5   2026.1.Hs            C2   CP:KEGG_LEGACY
6   2026.1.Hs            C2  CP:KEGG_MEDICUS
7   2026.1.Hs            C2           CP:PID
8   2026.1.Hs            C2      CP:REACTOME
9   2026.1.Hs            C2  CP:WIKIPATHWAYS
10  2026.1.Hs            C3        MIR:MIRDB
11  2026.1.Hs            C3   MIR:MIR_LEGACY
12  2026.1.Hs            C3         TFT:GTRD
13  2026.1.Hs            C3   TFT:TFT_LEGACY
14  2026.1.Hs            C4              3CA
15  2026.1.Hs            C4              CGN
16  2026.1.Hs            C4               CM
17  2026.1.Hs            C5            GO:BP
18  2026.1.Hs            C5            GO:CC
19  2026.1.Hs            C5            GO:MF
20  2026.1.Hs            C5              HPO
21  2026.1.Hs            C6                 
22  2026.1.Hs            C7      IMMUNESIGDB
23  2026.1.Hs            C7              VAX
24  2026.1.Hs            C8                 
25  2026.1.Hs            C9                 
26  2026.1.Hs             H                 
                     gs_collection_name num_genesets
1                            Positional          302
2    Chemical and Genetic Perturbations         3555
3                    Canonical Pathways           19
4                     BioCarta Pathways          292
5                  KEGG Legacy Pathways          186
6                 KEGG Medicus Pathways          658
7                          PID Pathways          196
8                     Reactome Pathways         1839
9                          WikiPathways          925
10                                miRDB         2377
11                           MIR_Legacy          221
12                                 GTRD          506
13                           TFT_Legacy          610
14 Curated Cancer Cell Atlas gene sets           148
15            Cancer Gene Neighborhoods          427
16                       Cancer Modules          431
17                GO Biological Process         7538
18                GO Cellular Component         1080
19                GO Molecular Function         1872
20             Human Phenotype Ontology         5793
21                  Oncogenic Signature          189
22                          ImmuneSigDB         4872
23                HIPC Vaccine Response          347
24                  Cell Type Signature          866
25 Computational Perturbation Signature           62
26                             Hallmark           50

For our analysis, we will use C5, which contains the Gene Ontology gene sets. This will allow us to see how the results align with output from ORA. From the table, we only need two columns: gene set name and gene symbol:

# Use a specific collection; C5 GO signatures
m_t2g <- msigdbr(species = "Homo sapiens", collection = "C5") %>%
  dplyr::select(gs_name, gene_symbol)

Now that we have our gene sets, we need to prepare the fold changes. GSEA will use the log2-fold changes obtained from the differential expression analysis for every gene to perform the analysis. We need to create an ordered and named vector for input to clusterProfiler:

# Extract the foldchanges
foldchanges <- dge_sig$avg_log2FC

# Name each fold change with the corresponding gene symbol
names(foldchanges) <- dge_sig$gene
# Sort fold changes in decreasing order
foldchanges <- sort(foldchanges, decreasing = TRUE)

foldchanges %>% head()
   ADGRG3     CXCR1     REG3A     CXCL8     IL1RN       OSM 
11.250232 10.291084  9.827910  8.675935  7.863775  7.448534 

Now we are ready to run GSEA!

# Run GSEA
msig_GSEA <- GSEA(foldchanges, TERM2GENE = m_t2g, verbose = FALSE)

# Extract the GSEA results
msigGSEA_results <- msig_GSEA@result
Setting seed

The permutations are performed using random reordering, so every time we run the function we will get slightly different results. If we would like to use the same permutations every time we run a function, then we use the set.seed() function prior to running. The input to set.seed() can be any number.

This is a good spot to save the results so that we can revisit them at any point later on:

# Write results to file
write.csv(msigGSEA_results, "results/gsea_msigdb_GO_genesets.csv", quote=F)

Take a look at the results table and reorder by NES (normalized enrichment score). What terms do you see positively enriched? Does this overlap with what we observed from ORA analysis?

# Look at results ordered by NES
msigGSEA_results %>% arrange(-NES) %>% View()
Table 3: GSEA pathway analysis results, ordered by normalized enrichment score (NES).
ID Description setSize enrichmentScore NES pvalue p.adjust qvalue rank leading_edge core_enrichment
GOMF_CYTOKINE_ACTIVITY GOMF_CYTOKINE_ACTIVITY GOMF_CYTOKINE_ACTIVITY 19 0.8009752 2.031152 6.00e-07 0.0001814 0.0001632 38 tags=68%, list=17%, signal=62% CXCL8/IL1RN/OSM/IL1B/IL1A/CCL20/CCL4L2/CCL4/CXCL1/CXCL2/NAMPT/CXCL13/CCL3
GOMF_CYTOKINE_RECEPTOR_BINDING GOMF_CYTOKINE_RECEPTOR_BINDING GOMF_CYTOKINE_RECEPTOR_BINDING 21 0.7401314 1.909748 3.76e-05 0.0043321 0.0038963 38 tags=62%, list=17%, signal=57% CXCL8/IL1RN/OSM/IL1B/IL1A/CCL20/CCL4L2/CCL4/CXCL1/CXCL2/SOCS3/CXCL13/CCL3
GOBP_ANTIMICROBIAL_HUMORAL_IMMUNE_RESPONSE_MEDIATED_BY_ANTIMICROBIAL_PEPTIDE GOBP_ANTIMICROBIAL_HUMORAL_IMMUNE_RESPONSE_MEDIATED_BY_ANTIMICROBIAL_PEPTIDE GOBP_ANTIMICROBIAL_HUMORAL_IMMUNE_RESPONSE_MEDIATED_BY_ANTIMICROBIAL_PEPTIDE 15 0.7742921 1.885327 4.45e-05 0.0043321 0.0038963 44 tags=73%, list=20%, signal=63% REG3A/CXCL8/CCL20/CCL4L2/CCL4/CXCL1/CXCL2/CXCL13/CCL3/S100A9/ADM
GOMF_SIGNALING_RECEPTOR_REGULATOR_ACTIVITY GOMF_SIGNALING_RECEPTOR_REGULATOR_ACTIVITY GOMF_SIGNALING_RECEPTOR_REGULATOR_ACTIVITY 27 0.6991968 1.868216 3.23e-05 0.0043321 0.0038963 44 tags=56%, list=20%, signal=51% REG3A/CXCL8/IL1RN/OSM/IL1B/IL1A/CCL20/CCL4L2/CCL4/CXCL1/CXCL2/NAMPT/CXCL13/CCL3/ADM
GOBP_INFLAMMATORY_RESPONSE GOBP_INFLAMMATORY_RESPONSE GOBP_INFLAMMATORY_RESPONSE 69 0.6375055 1.863721 6.00e-07 0.0001814 0.0001632 48 tags=43%, list=22%, signal=49% REG3A/CXCL8/IL1RN/OSM/TNFAIP6/IL1B/FPR2/PTGS2/IL1A/PROK2/CCL20/CCL4L2/CCL4/CXCL1/S100A8/CXCR2/CXCL2/NAMPT/GZMB/SOCS3/OLR1/CXCL13/LCN2/CCL3/S100A9/NKG7/ADM/IL1R2/FPR1/IER3
GOBP_CELL_CHEMOTAXIS GOBP_CELL_CHEMOTAXIS GOBP_CELL_CHEMOTAXIS 35 0.6675561 1.856027 4.80e-05 0.0043321 0.0038963 41 tags=49%, list=18%, signal=47% CXCR1/CXCL8/TNFAIP6/IL1B/FPR2/EGR3/CCL20/TREM1/CCL4L2/CCL4/CXCL1/S100A8/CXCR2/CXCL2/CXCL13/CCL3/S100A9
  • The first few columns of the results table identify the gene set information.
  • The following columns include the associated statistics.
  • The last column will report which genes are part of the ‘core enrichment’. These are the genes associated with the pathway which contributed to the observed enrichment score (i.e., in the extremes of the ranking).
  1. Run GSEA on the cell type that you calculated DE results for in the first exercise. What is the top pathway that appears?

GSEA visualization

Let’s explore the GSEA plot of enrichment of one of the pathways in the ranked list using a built-in function from clusterProfiler. We can pick the top term GOMF_CYTOKINE_ACTIVITY:

# Plot the GSEA plot for a single enriched GO term
gseaplot(msig_GSEA, geneSetID = 'GOMF_CYTOKINE_ACTIVITY')

In this plot, the lines in plot represent the genes in the gene set GOMF_CYTOKINE_ACTIVITY and where they occur among the log2-fold changes. The largest positive log2-fold changes are on the left-hand side of the plot, while the largest negative log2-fold changes are on the right. The top plot shows the magnitude of the log2-fold changes for each gene, while the bottom plot shows the running sum, with the enrichment score peaking at the red dotted line (which is among the positive log2-fold changes). This suggests the up-regulation of this function.


Next Lesson >>

Back to Schedule

Reuse

CC-BY-4.0