# DGE and pathway analysis
# Visium HD spatial transcriptomics workshop
# Author: Harvard Chan Bioinformatics Core
# Created: December 2025
# Load libraries
library(Seurat)
library(tidyverse)
library(EnhancedVolcano)
# ORA libraries
library(tidyverse)
library(clusterProfiler)
library(org.Hs.eg.db)
library(msigdbr)
set.seed(12345)
# Load dataset
seurat_rctd <- readRDS("data/intermediate_seurat/seurat_RCTD.rds")Differential Gene Expression
Perform differential expression analysis between celltypes of matched cancer and normal. Feed the significant genes into pathway and gene set enrichment analyses to observe trends.
Wilcoxon test, FindMarkers, Functional analysis, Overrepresentation analysis, GSEA
Approximate time: XX minutes
Learning objectives
In this lesson, we will:
- Run differential gene expression analysis using a Wilcox test
- Calculate over-representated pathways
- Identify gene set enrichment scores for pathways
Overview of lesson
By now, we have meaningful annotations for msot of our bins for both samples. So now we can start to ask the question of how these gene expression changes between sample conditions (NAT and CRC). Running a differential gene expression analysis is a great way to evaluate the significance of change between populations. However, looking at a table of genes can be overwhelming! So we can take the output from the DGE analysis to identify pathways that are enriched in one condition compared to another. This provides a mechanistic insight in the biological processes that differ between samples.
Differential gene expression (DGE)
What if we were interested in a particular celltype and understanding how gene expression changes across the different conditions?
For instance, maybe we are interested in seeing how the Myeloid population changes in gene expression between the CRC and NAT samples. To start identifying these changes, we are going to first subset the dataset to Myeloid bins based upon the annotation generated by RCTD.
# Subset to Myeloid cells
seurat_macro <- subset(seurat_rctd,
first_type == "Myeloid")
Idents(seurat_macro) <- "orig.ident"FindMarkers()
The FindMarkers() function in the Seurat package is used to perform differential expression analysis between groups of cells. We provide arguments to specify ident.1 and ident.2 the two groups of cells we are interested in comparing. This function is commonly used at the celltype annotation step, but can also be used to calculate differentially expressed genes between experimental conditions.
FindMarkers() function calculates differential genes.Image source: Introduction to scRNA-seq
We can use this same function to compare two groups of cells which represent different conditions by modifying the ident values provided. The FindMarkers() function has several important arguments we can modify when running it. To view what these parameters are, we can access the help page on this function:
?FindMarkersBelow we have described in more detail what each of these arguments mean :
| Parameter | Description | Default |
|---|---|---|
logfc.threshold |
The minimum log2 fold change for average expression of a gene in the group relative to the average expression in all other groups combined. | 0.25 |
min.diff.pct |
The minimum percent difference between the percent of cells expressing the gene in the group and the percent of cells expressing the gene in all other groups combined. | — |
min.pct |
Only test genes that are detected in a minimum fraction of cells in either of the two populations; speeds up the function by skipping very infrequently expressed genes. | 0.1 |
ident.1 |
The group of interest; this function evaluates one group at a time, and ident.1 must correspond to a value set in Idents(). |
— |
ident.2 |
The comparison group; specifies the group against which ident.1 is compared. |
— |
# Calculate DGEs
dge <- FindMarkers(seurat_macro,
ident.1 = "P5CRC",
ident.2 = "P5NAT")
dge$gene <- rownames(dge)
dge %>% head()| p_val | avg_log2FC | pct.1 | pct.2 | p_val_adj | gene | |
|---|---|---|---|---|---|---|
| SRGN | 0 | 1.6450858 | 0.664 | 0.398 | 0 | SRGN |
| SOD2 | 0 | 3.3359809 | 0.266 | 0.064 | 0 | SOD2 |
| CXCL8 | 0 | 8.6759345 | 0.167 | 0.007 | 0 | CXCL8 |
| S100A9 | 0 | 3.3131400 | 0.209 | 0.028 | 0 | S100A9 |
| B2M | 0 | 0.7734271 | 0.711 | 0.494 | 0 | B2M |
| S100A8 | 0 | 4.4094812 | 0.152 | 0.009 | 0 | S100A8 |
The output from the FindMarkers() function is a dataframe containing a ranked list of differentially expressed genes listed by gene ID and associated statistics. We describe some of these columns below:
| Column | Description |
|---|---|
gene |
The gene symbol. |
p_val |
The p-value not adjusted for multiple test corrections. |
avg_logFC |
The average log-fold change between sample groups. Positive values indicate that the gene is more highly expressed in ident.1. |
pct.1 |
The percentage of cells where the gene is detected in ident.1 (P5CRC). |
pct.2 |
The percentage of cells where the gene is detected on average in ident.2 (P5NAT). |
p_val_adj |
The adjusted p-value, based on Bonferroni correction using all genes in the dataset, used to determine significance. |
These tests treat each cell as an independent replicate and ignore inherent correlations between cells originating from the same sample. This results in highly inflated p-values for each gene. Studies have been shown to find a large number of false positive associations with these results.
Visualizing results
Now that we have our DGE results, we can visualize the genes in a variety of different ways to better understand what is happening with our data.
Volcano plot
To get a first look at the genes that are retained, we can generated a volcano plot using the EnhancedVolcano() function. This is a visualization that allows us to quickly see trends in the significant genes. The x-axis here represents the average log2-fold change value, showing the degree of difference between the two conditions. On the y-axis, we see our p_val_adj column represented after a negative log10 transformation is applied to better see the spread of the adjusted p-values.
Volcano plots provide a great overview of which genes are up-regulated (positive on the x-axis) or down-regulated (negative on the x-axis). These results are all relative to ident.1 (P5CRC).
# Volcano plot
EnhancedVolcano(dge,
lab = dge$gene,
x = "avg_log2FC",
y = "p_val_adj",
title = "FindMarkers Myeloid Results",
subtitle = "CRC vs NAT")Violin plots
While looking at the overall trends in the data is a great starting point, we can also start looking at genes that have large differences between CRC and NAT. To do this, we can take a look at the top 6 genes:
# Filter significant genes
dge_sig <- dge %>% subset(p_val_adj < 0.05)
# Get the gene names and get the first 6 values
genes <- dge_sig %>%
pull(gene) %>%
head(6)
genes[1] "SRGN" "SOD2" "CXCL8" "S100A9" "B2M" "S100A8"
And then plot them using VlnPlot() function:
# Violin plot expression of genes per condition
VlnPlot(seurat_macro,
genes,
group.by = "orig.ident")UMAP plot
we can begin to look at the distribution of gene expression for each of the top 6 genes with a better understanding of where the cells for each condition exists.
# UMAP of gene expression
FeaturePlot(seurat_macro,
genes,
reduction = "full.umap.sketch",
ncol = 3)Spatial plot
As we also have the spatial conditions for these bins, we can plot the expression o f
# Spatial plot of gene expression
SpatialFeaturePlot(seurat_macro,
genes[1],
image.alpha = 0.1,
pt.size.factor = 18)- Pick a different cell type and run
FindMarkers(). Name the population you chose and print the top 6 genes in the results.
Functional Analysis
When it comes to functional analysis there are various analyses that can be done:
- Determine whether there is enrichment of known biological functions, interactions, or pathways
- Identify genes’ involvement in novel pathways or networks by grouping genes together based on similar trends
- Use global changes in gene expression by visualizing all genes being significantly up- or down-regulated in the context of external interaction data
Generally, for any differential expression analysis, it is useful to interpret the resulting gene lists using freely available web- and R-based tools. While tools for functional analysis span a wide variety of techniques, they can loosely be categorized into three main types: over-representation analysis, functional class scoring, and pathway topology [1].
Image source: Khatri et al. (2012)
In this lesson, we will walk you through both an over-representation analysis and gene set enrichment analysis (GSEA) using an R Bioconductor package called clusterProfiler.
Over-representation analysis
Over-representation analysis (ORA) is used to determine which a priori defined gene sets are more present (over-represented) in a subset of “interesting” genes than what would be expected by chance (Huang et al, 2009). Most genes in the genome have some pre-existing annotation associated with it, which has been compiled through a combination of manual curation and computational algorithms. There are a number of existing databases which define genes using a controlled vocabulary and then categorize genes into groups (gene sets) based on shared function, involvement in a pathway or presence in a specific cellular location, etc. A very commonly used gene annotataion resource is the Gene Ontology (GO) database, and is what we will use in our workflow.
We then use those categorizations to assess whether or not the enrichment observed amongst our DE gene results when compared to the larger universe of genes is significant.
Hypergeometric test
The statistical test that will determine whether something is actually over-represented is the Hypergeometric test.
Hypergeometric distribution is a probability distribution that describes the probability of some number of genes (k) being associated with “Functional category 1”, for all genes in our gene list (n=1000), compared to the number of genes (K) associated with “Functional category 1” from a population of all of the genes in entire genome (N=13,000) [2].
The calculation of probability of k successes follows the formula:
\[ p = 1 - \sum_{i=0}^{k-1} \frac{\binom{N}{k} \binom{N-K}{n-k}}{\binom{N}{n}} \]
This test will result in a p-value that will be adjusted to consider multiple testing correction for each category tested.
Running ORA with clusterProfiler
Now that we know more about what ORA is doing, let’s take our significant genes and see if there are any GO terms over-represented that align with what we expect to be happening in macrophage cells with a change of temperature.
We will take the results from the FindMarkers DE and run different functional analysis methods to obtain some biological insight. We create vectors containing our gene symbols for the background and query set of genes. We will query the up- and down-regulated gene sets separately, but note that you can also use the entire significant list as a query input.
# Create background dataset for hypergeometric testing using all tested
# genes for significance in the results
all_genes <- as.character(dge$gene)
# Extract significant results for up-regulated
sigUp <- dplyr::filter(dge_sig,
p_val_adj < 0.05,
avg_log2FC > 0)
sigUp_genes <- as.character(sigUp$gene)
sigUp_genes %>% head()[1] "SRGN" "SOD2" "CXCL8" "S100A9" "B2M" "S100A8"
Finally, we can perform the GO enrichment analysis with enrichGO() and save the results:
enrichGO() parameters
Note 1: The different organisms with annotation databases available to use with for the OrgDb argument can be found here.
Note 2: The keyType argument may be coded as keytype in different versions of clusterProfiler.
Note 3: The ont argument can accept either “BP” (Biological Process), “MF” (Molecular Function), and “CC” (Cellular Component) subontologies, or “ALL” for all three.
# Run GO enrichment analysis
egoUp <- enrichGO(gene = sigUp_genes,
universe = all_genes,
keyType = "SYMBOL",
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
qvalueCutoff = 0.05,
readable = TRUE)Exploring results from over-representation analysis
Let’s take a look at what terms are identified as over-represented in the genes up-regulated.
# Output results from GO analysis to a table
cluster_summaryUp <- data.frame(egoUp)
# View GO upregulated output
View(cluster_summaryUp)| ID | Description | GeneRatio | BgRatio | RichFactor | FoldEnrichment | zScore | pvalue | p.adjust | qvalue | geneID | Count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GO:0006954 | GO:0006954 | inflammatory response | 57/162 | 461/6577 | 0.1236443 | 5.019804 | 14.22203 | 0 | 0 | 0 | CXCL8/S100A9/S100A8/IL1RN/NAMPT/IL1B/IL1R2/LCN2/PTGS2/SERPINA1/OSM/FPR1/SOCS3/TNFAIP3/PROK2/FPR2/LYZ/CXCL1/NFKBIA/FCER1G/ALOX5AP/C5AR1/SLC11A1/TFRC/MMP9/CCL4/IER3/NKG7/IGHG1/CCL24/TNFAIP6/TYROBP/TNFRSF1B/HSPA8/CCL5/NLRP3/ADM/CCL20/CXCL13/TIMP1/CCL3/ADAM8/GZMB/PLA2G2A/PLA2G7/CST7/IL1A/REG3A/OLR1/FCGR3A/CXCR2/THBS1/CYBA/CD44/AIF1/FCGR2A/CXCL2 | 57 |
| GO:0006935 | GO:0006935 | chemotaxis | 35/162 | 230/6577 | 0.1521739 | 6.178073 | 12.70246 | 0 | 0 | 0 | CXCL8/S100A9/S100A8/IL1B/PLAUR/CD74/FPR1/TREM1/CSF3R/PROK2/FPR2/CXCL1/FCER1G/C5AR1/TYMP/PDE4B/CCL4/SPI1/CXCL16/CCL24/TNFAIP6/CCL5/CCL20/CXCL13/PLAU/CCL3/CALR/ADAM8/PLA2G7/EGR3/CXCR2/THBS1/CXCR1/AIF1/CXCL2 | 35 |
| GO:0042330 | GO:0042330 | taxis | 35/162 | 230/6577 | 0.1521739 | 6.178073 | 12.70246 | 0 | 0 | 0 | CXCL8/S100A9/S100A8/IL1B/PLAUR/CD74/FPR1/TREM1/CSF3R/PROK2/FPR2/CXCL1/FCER1G/C5AR1/TYMP/PDE4B/CCL4/SPI1/CXCL16/CCL24/TNFAIP6/CCL5/CCL20/CXCL13/PLAU/CCL3/CALR/ADAM8/PLA2G7/EGR3/CXCR2/THBS1/CXCR1/AIF1/CXCL2 | 35 |
| GO:0019730 | GO:0019730 | antimicrobial humoral response | 18/162 | 44/6577 | 0.4090909 | 16.608586 | 16.50721 | 0 | 0 | 0 | CXCL8/S100A9/B2M/IGHA1/LYZ/CXCL1/SLC11A1/CCL4/IGHG1/CCL24/PI3/CCL5/ADM/CCL20/CXCL13/CCL3/REG3A/CXCL2 | 18 |
| GO:0060326 | GO:0060326 | cell chemotaxis | 30/162 | 171/6577 | 0.1754386 | 7.122590 | 12.89080 | 0 | 0 | 0 | CXCL8/S100A9/S100A8/IL1B/CD74/TREM1/CSF3R/FPR2/CXCL1/FCER1G/C5AR1/PDE4B/CCL4/SPI1/CXCL16/CCL24/TNFAIP6/CCL5/CCL20/CXCL13/CCL3/CALR/ADAM8/PLA2G7/EGR3/CXCR2/THBS1/CXCR1/AIF1/CXCL2 | 30 |
| GO:0030595 | GO:0030595 | leukocyte chemotaxis | 26/162 | 126/6577 | 0.2063492 | 8.377523 | 13.28686 | 0 | 0 | 0 | CXCL8/S100A9/S100A8/IL1B/CD74/TREM1/CSF3R/FPR2/FCER1G/C5AR1/PDE4B/CCL4/SPI1/CXCL16/CCL24/TNFAIP6/CCL5/CXCL13/CCL3/CALR/ADAM8/PLA2G7/CXCR2/THBS1/CXCR1/AIF1 | 26 |
In the first few columns we see the GO identifier and the descriptive term name. In the next two columns that follow, we observe GeneRatio and BgRatio. These values allows us to compare the overlaps to the background. Here we describe what each of the columns represent in the results:
| Term | Math Expression | Description |
|---|---|---|
| BgRatio | \(\dfrac{K}{N}\) | The total number of genes in the GO term gene set (K), divided by the total number of genes in the universe (N). |
| GeneRatio | \(\dfrac{k}{n}\) | The total number of genes in our significant DE gene set which overlap with the GO term gene set (k), divided by the total number of genes in our significant DE gene set that overlap with the universe gene set (n). |
| Fold Enrichment | \(\dfrac{k/n}{K/N}\) | How many-fold enriched the GeneRatio is compared to the BgRatio. |
| p.adjust | – | Multiple-testing adjusted p-value (results are typically ordered by this column). |
| geneID | – | Gene symbols of the overlapping genes, seperated by “/”. |
In the P5CRC sample, we can see that the top differential pathway is “inflammatory response” among the Myeloid bins. This makes sense as the Myeloid cells (Macrophages, Dendritic cells, etc) modulate the immune response in a system. In the case of cancer, it makes sense that there is an increased inflammatory signal in this population compared to the healthy, adjacent tissue.
- Run ORA analysis on the cell type that you calculated DE results for in the previous exercise. What is the top pathway that appears?
Visualizing over-representation analysis results
clusterProfiler has a variety of options for viewing the over-represented GO terms. We will explore the dotplot and the enrichment plot in this lesson.
The dotplot shows statistics associated with a user-selected top number of significant terms. The color of the dots represents the adjusted p-values for these terms and size of the dots corresponds to the total count of sig DE genes annotated with the GO term (count). This plot displays the top 20 GO terms ordered by gene ratio, not adjusted p-value.
Gene Set Enrichment Analysis (GSEA)
While over-representation analysis is helpful and commonly used, it does require you to subset your gene list using an arbitrary threshold. There could very well be many genes that very narrowly miss this threshold and are therefore not considered in the functional analysis. To get around this there are there are functional class scoring (FCS) methods that can be helpful. For these methods the hypothesis is that although large changes in individual genes can have significant effects on pathways (and will be detected via ORA methods), weaker but coordinated changes in sets of functionally related genes (i.e., pathways) can also have significant effects. Thus, rather than setting a threshold to identify ‘significant genes’, all genes are considered in the analysis. The gene-level statistics from the dataset are aggregated to generate a single pathway-level statistic and statistical significance of each pathway is reported. This type of analysis can be particularly helpful if the differential expression analysis only outputs a small list of significant DE genes.
A commonly used example of an FCS method is GSEA (Subramanium A. et al, 2005). Gene set enrichment analysis utilizes the gene-level statistics or log2-fold changes for all genes to look to see whether gene sets for particular biological pathways (e.g., derived from KEGG pathways, Gene Ontology terms, MSigDB, etc) are enriched among the large positive or negative fold changes.
Image source: Subramanium et al. (2005)
This image describes the theory of GSEA, with the ‘gene set S’ showing the metric used (in our case, ranked log2-fold changes) to determine enrichment of genes in the gene set. There are four main steps that are being performed:
- Rank genes:
- Genes in a data set are ranked based on the given statistic, which in our case is the shrunken log2-fold changes.
- Calculate enrichment scores for each gene set
- This score reflects how often genes in the set appear at the top or bottom of the ranked list.
- The score is calculated by walking down the list of log2-fold changes and increasing the running-sum statistic every time a gene in the gene set is encountered and decreasing it when genes are not part of the gene set.
- Increase/decrease is determined by magnitude of fold change.
- Estimate statistical significance
- A permutation test is used to calculate a null distribution for the enrichment score. This produces a p-value that represents the probability of observing a given enrichment score.
- Adjust for multiple hypothesis testing
- Enrichment scores are normalized for the size of each gene set and a false discovery rate is calculated to prevent false positives.
Running GSEA with MSigDB gene sets
The clusterProfiler package offers several functions to perform GSEA using different genes sets, including but not limited to GO, KEGG, and MSigDb. We will use the MSigDb gene sets in our example below. The Molecular Signatures Database (also known as MSigDB) is a collection of annotated gene sets. It contains 8 major collections.
To run GSEA with the MSigDb gene sets, we will use the msigdbr R package which provides the MSigDB gene sets in tidy data format that can be used directly with clusterProfiler.
msigdbr
The msigdbr package supports several species:
# Show MSigDB species
msigdbr_species()# A tibble: 20 × 2
species_name species_common_name
<chr> <chr>
1 Anolis carolinensis Carolina anole, green anole
2 Bos taurus bovine, cattle, cow, dairy cow, domestic cat…
3 Caenorhabditis elegans <NA>
4 Canis lupus familiaris dog, dogs
5 Danio rerio leopard danio, zebra danio, zebra fish, zebr…
6 Drosophila melanogaster fruit fly
7 Equus caballus domestic horse, equine, horse
8 Felis catus cat, cats, domestic cat
9 Gallus gallus bantam, chicken, chickens, Gallus domesticus
10 Homo sapiens human
11 Macaca mulatta rhesus macaque, rhesus macaques, Rhesus monk…
12 Monodelphis domestica gray short-tailed opossum
13 Mus musculus house mouse, mouse
14 Ornithorhynchus anatinus duck-billed platypus, duckbill platypus, pla…
15 Pan troglodytes chimpanzee
16 Rattus norvegicus brown rat, Norway rat, rat, rats
17 Saccharomyces cerevisiae baker's yeast, brewer's yeast, S. cerevisiae
18 Schizosaccharomyces pombe 972h- <NA>
19 Sus scrofa pig, pigs, swine, wild boar
20 Xenopus tropicalis tropical clawed frog, western clawed frog
And you can see what gene sets are available:
# Show MSigDB collections
msigdbr_collections() db_version gs_collection gs_subcollection
1 2026.1.Hs C1
2 2026.1.Hs C2 CGP
3 2026.1.Hs C2 CP
4 2026.1.Hs C2 CP:BIOCARTA
5 2026.1.Hs C2 CP:KEGG_LEGACY
6 2026.1.Hs C2 CP:KEGG_MEDICUS
7 2026.1.Hs C2 CP:PID
8 2026.1.Hs C2 CP:REACTOME
9 2026.1.Hs C2 CP:WIKIPATHWAYS
10 2026.1.Hs C3 MIR:MIRDB
11 2026.1.Hs C3 MIR:MIR_LEGACY
12 2026.1.Hs C3 TFT:GTRD
13 2026.1.Hs C3 TFT:TFT_LEGACY
14 2026.1.Hs C4 3CA
15 2026.1.Hs C4 CGN
16 2026.1.Hs C4 CM
17 2026.1.Hs C5 GO:BP
18 2026.1.Hs C5 GO:CC
19 2026.1.Hs C5 GO:MF
20 2026.1.Hs C5 HPO
21 2026.1.Hs C6
22 2026.1.Hs C7 IMMUNESIGDB
23 2026.1.Hs C7 VAX
24 2026.1.Hs C8
25 2026.1.Hs C9
26 2026.1.Hs H
gs_collection_name num_genesets
1 Positional 302
2 Chemical and Genetic Perturbations 3555
3 Canonical Pathways 19
4 BioCarta Pathways 292
5 KEGG Legacy Pathways 186
6 KEGG Medicus Pathways 658
7 PID Pathways 196
8 Reactome Pathways 1839
9 WikiPathways 925
10 miRDB 2377
11 MIR_Legacy 221
12 GTRD 506
13 TFT_Legacy 610
14 Curated Cancer Cell Atlas gene sets 148
15 Cancer Gene Neighborhoods 427
16 Cancer Modules 431
17 GO Biological Process 7538
18 GO Cellular Component 1080
19 GO Molecular Function 1872
20 Human Phenotype Ontology 5793
21 Oncogenic Signature 189
22 ImmuneSigDB 4872
23 HIPC Vaccine Response 347
24 Cell Type Signature 866
25 Computational Perturbation Signature 62
26 Hallmark 50
For our analysis, we will use C5, which contains the Gene Ontology gene sets. This will allow us to see how the results align with output from ORA. From the table, we only need two columns: gene set name and gene symbol:
# Use a specific collection; C5 GO signatures
m_t2g <- msigdbr(species = "Homo sapiens", collection = "C5") %>%
dplyr::select(gs_name, gene_symbol)Now that we have our gene sets, we need to prepare the fold changes. GSEA will use the log2-fold changes obtained from the differential expression analysis for every gene to perform the analysis. We need to create an ordered and named vector for input to clusterProfiler:
# Extract the foldchanges
foldchanges <- dge_sig$avg_log2FC
# Name each fold change with the corresponding gene symbol
names(foldchanges) <- dge_sig$gene
# Sort fold changes in decreasing order
foldchanges <- sort(foldchanges, decreasing = TRUE)
foldchanges %>% head() ADGRG3 CXCR1 REG3A CXCL8 IL1RN OSM
11.250232 10.291084 9.827910 8.675935 7.863775 7.448534
Now we are ready to run GSEA!
# Run GSEA
msig_GSEA <- GSEA(foldchanges, TERM2GENE = m_t2g, verbose = FALSE)
# Extract the GSEA results
msigGSEA_results <- msig_GSEA@resultThe permutations are performed using random reordering, so every time we run the function we will get slightly different results. If we would like to use the same permutations every time we run a function, then we use the set.seed() function prior to running. The input to set.seed() can be any number.
This is a good spot to save the results so that we can revisit them at any point later on:
# Write results to file
write.csv(msigGSEA_results, "results/gsea_msigdb_GO_genesets.csv", quote=F)Take a look at the results table and reorder by NES (normalized enrichment score). What terms do you see positively enriched? Does this overlap with what we observed from ORA analysis?
# Look at results ordered by NES
msigGSEA_results %>% arrange(-NES) %>% View()| ID | Description | setSize | enrichmentScore | NES | pvalue | p.adjust | qvalue | rank | leading_edge | core_enrichment | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GOMF_CYTOKINE_ACTIVITY | GOMF_CYTOKINE_ACTIVITY | GOMF_CYTOKINE_ACTIVITY | 19 | 0.8009752 | 2.031152 | 6.00e-07 | 0.0001814 | 0.0001632 | 38 | tags=68%, list=17%, signal=62% | CXCL8/IL1RN/OSM/IL1B/IL1A/CCL20/CCL4L2/CCL4/CXCL1/CXCL2/NAMPT/CXCL13/CCL3 |
| GOMF_CYTOKINE_RECEPTOR_BINDING | GOMF_CYTOKINE_RECEPTOR_BINDING | GOMF_CYTOKINE_RECEPTOR_BINDING | 21 | 0.7401314 | 1.909748 | 3.76e-05 | 0.0043321 | 0.0038963 | 38 | tags=62%, list=17%, signal=57% | CXCL8/IL1RN/OSM/IL1B/IL1A/CCL20/CCL4L2/CCL4/CXCL1/CXCL2/SOCS3/CXCL13/CCL3 |
| GOBP_ANTIMICROBIAL_HUMORAL_IMMUNE_RESPONSE_MEDIATED_BY_ANTIMICROBIAL_PEPTIDE | GOBP_ANTIMICROBIAL_HUMORAL_IMMUNE_RESPONSE_MEDIATED_BY_ANTIMICROBIAL_PEPTIDE | GOBP_ANTIMICROBIAL_HUMORAL_IMMUNE_RESPONSE_MEDIATED_BY_ANTIMICROBIAL_PEPTIDE | 15 | 0.7742921 | 1.885327 | 4.45e-05 | 0.0043321 | 0.0038963 | 44 | tags=73%, list=20%, signal=63% | REG3A/CXCL8/CCL20/CCL4L2/CCL4/CXCL1/CXCL2/CXCL13/CCL3/S100A9/ADM |
| GOMF_SIGNALING_RECEPTOR_REGULATOR_ACTIVITY | GOMF_SIGNALING_RECEPTOR_REGULATOR_ACTIVITY | GOMF_SIGNALING_RECEPTOR_REGULATOR_ACTIVITY | 27 | 0.6991968 | 1.868216 | 3.23e-05 | 0.0043321 | 0.0038963 | 44 | tags=56%, list=20%, signal=51% | REG3A/CXCL8/IL1RN/OSM/IL1B/IL1A/CCL20/CCL4L2/CCL4/CXCL1/CXCL2/NAMPT/CXCL13/CCL3/ADM |
| GOBP_INFLAMMATORY_RESPONSE | GOBP_INFLAMMATORY_RESPONSE | GOBP_INFLAMMATORY_RESPONSE | 69 | 0.6375055 | 1.863721 | 6.00e-07 | 0.0001814 | 0.0001632 | 48 | tags=43%, list=22%, signal=49% | REG3A/CXCL8/IL1RN/OSM/TNFAIP6/IL1B/FPR2/PTGS2/IL1A/PROK2/CCL20/CCL4L2/CCL4/CXCL1/S100A8/CXCR2/CXCL2/NAMPT/GZMB/SOCS3/OLR1/CXCL13/LCN2/CCL3/S100A9/NKG7/ADM/IL1R2/FPR1/IER3 |
| GOBP_CELL_CHEMOTAXIS | GOBP_CELL_CHEMOTAXIS | GOBP_CELL_CHEMOTAXIS | 35 | 0.6675561 | 1.856027 | 4.80e-05 | 0.0043321 | 0.0038963 | 41 | tags=49%, list=18%, signal=47% | CXCR1/CXCL8/TNFAIP6/IL1B/FPR2/EGR3/CCL20/TREM1/CCL4L2/CCL4/CXCL1/S100A8/CXCR2/CXCL2/CXCL13/CCL3/S100A9 |
- The first few columns of the results table identify the gene set information.
- The following columns include the associated statistics.
- The last column will report which genes are part of the ‘core enrichment’. These are the genes associated with the pathway which contributed to the observed enrichment score (i.e., in the extremes of the ranking).
- Run GSEA on the cell type that you calculated DE results for in the first exercise. What is the top pathway that appears?
GSEA visualization
Let’s explore the GSEA plot of enrichment of one of the pathways in the ranked list using a built-in function from clusterProfiler. We can pick the top term GOMF_CYTOKINE_ACTIVITY:
# Plot the GSEA plot for a single enriched GO term
gseaplot(msig_GSEA, geneSetID = 'GOMF_CYTOKINE_ACTIVITY')In this plot, the lines in plot represent the genes in the gene set GOMF_CYTOKINE_ACTIVITY and where they occur among the log2-fold changes. The largest positive log2-fold changes are on the left-hand side of the plot, while the largest negative log2-fold changes are on the right. The top plot shows the magnitude of the log2-fold changes for each gene, while the bottom plot shows the running sum, with the enrichment score peaking at the red dotted line (which is among the positive log2-fold changes). This suggests the up-regulation of this function.