Differential Gene Expression

category_1
category_2
category_3
category_4

Write a description of the lesson here.

Authors

author_1

author_2

Published

March 25, 2026

Keywords

keyword_1, keyword_2, keyword_3, keyword_4, keyword_5, keyword_6

Approximate time: XX minutes

Learning objectives

In this lesson, we will:

  • Learning Objective 1
  • Learning Objective 2
  • Learning Objective 3

Overview of lesson

When doing XYZ…

Differential gene expression (DGE)

In our current UMAP, we have merged samples across the different conditions and used integration to align cells of the same celltype across samples. Now, what if we were interested in a particular celltype and understanding how gene expression changes across the different conditions?

From the publication, we know that the macrophage population is of interest.

# DGE and pathway analysis
# Visium HD spatial transcriptomics workshop
# Author: Harvard Chan Bioinformatics Core
# Created: December 2025
library(Seurat)
library(tidyverse)
library(qs)

# seurat_rctd <- qs::qread("intermediate/09_seurat_banksy_2.qs")
# seurat_rctd$cell <- rownames(seurat_rctd@meta.data)
# 
# full_meta <- qs::qread("../scripts/intermediates/07_seurat_rctd_unassigned.qs")
# full_meta <- full_meta@meta.data
# full_meta <- full_meta[c("cell",
#                          "spot_class",
#                          "first_type",
#                          "second_type")]
# 
# meta <- left_join(seurat_rctd@meta.data,
#           full_meta, by = "cell")
# rownames(meta) <- meta$cell
# seurat_rctd@meta.data <- meta
# 
# qs::qsave(seurat_rctd, "intermediate/13_rctd_tmp.qs")
seurat_rctd <- qs::qread("intermediate/13_rctd_tmp.qs")

# Subset to macrophages
seurat_macro <- subset(seurat_rctd,
                       first_type == "Macrophage")
Idents(seurat_macro) <- "orig.ident"

FindMarkers()

The FindMarkers() function in the Seurat package is used to perform differential expression analysis between groups of cells. We provide arguments to specify ident.1 and ident.2 the two groups of cells we are interested in comparing. This function is commonly used at the celltype annotation step, but can also be used to calculate differentially expressed genes between experimental conditions.

We can use this same function to compare two groups of cells which represent different conditions by modifying the ident values provided. The FindMarkers() function has several important arguments we can modify when running it. To view what these parameters are, we can access the help page on this function:

?FindMarkers

Below we have described in more detail what each of these arguments mean :

Parameter Description Default
logfc.threshold The minimum log2 fold change for average expression of a gene in the group relative to the average expression in all other groups combined. 0.25
min.diff.pct The minimum percent difference between the percent of cells expressing the gene in the group and the percent of cells expressing the gene in all other groups combined.
min.pct Only test genes that are detected in a minimum fraction of cells in either of the two populations; speeds up the function by skipping very infrequently expressed genes. 0.1
ident.1 The group of interest; this function evaluates one group at a time, and ident.1 must correspond to a value set in Idents().
ident.2 The comparison group; specifies the group against which ident.1 is compared.
# Calculate DGEs
dge <- FindMarkers(seurat_macro,
                   ident.1 = "P5CRC",
                   ident.2 = "P5NAT")
dge$gene <- rownames(dge)
dge %>% head()

The output from the FindMarkers() function is a matrix containing a ranked list of differentially expressed genes listed by gene ID and associated statistics. We describe some of these columns below:

  • gene: The gene symbol
  • p_val: The p-value not adjusted for multiple test corrections
  • avg_logFC: The average log-fold change between sample groups. Positive values indicate that the gene is more highly expressed ident.1.
  • pct.1: The percentage of cells where the gene is detected in ident.1 (cold7)
  • pct.2: The percentage of cells where the gene is detected on average in ident.2 (TN)
  • p_val_adj: The adjusted p-value, based on Bonferroni correction using all genes in the dataset, used to determine significance
Note

These tests treat each cell as an independent replicate and ignore inherent correlations between cells originating from the same sample. This results in highly inflated p-values for each gene. Studies have been shown to find a large number of false positive associations with these results.

Volcano plot

library(EnhancedVolcano)
# Volcano plot
EnhancedVolcano(dge,
                        lab=dge$gene,
                        x="avg_log2FC",
                        y="p_val_adj",
                        title="FindMarkers Macrophage bins",
                        subtitle="CRC vs NAT")

Top genes

# Filter significant genes
dge_sig <- dge %>% subset(p_val_adj < 0.05)

# Get the gene names and get the first 6 values
# Ignore ribosomal genes
genes <- dge_sig %>%
  filter(!str_detect(gene, "Rpl|Rps")) %>% 
  head(6)
genes <- genes$gene
genes
VlnPlot(seurat_macro, genes,
        group.by = "orig.ident")
FeaturePlot(seurat_macro, genes, reduction = "full.umap.sketch")
SpatialFeaturePlot(seurat_macro, genes[1], 
                   image.alpha = 0.1, 
                   pt.size.factor = 15)
  1. A question to evaluate Learning Objective 1
  2. A followup question to question #1

Functional Analysis

When it comes to functional analysis there are various analyses that can be done:

  • Determine whether there is enrichment of known biological functions, interactions, or pathways
  • Identify genes’ involvement in novel pathways or networks by grouping genes together based on similar trends
  • Use global changes in gene expression by visualizing all genes being significantly up- or down-regulated in the context of external interaction data

Generally, for any differential expression analysis, it is useful to interpret the resulting gene lists using freely available web- and R-based tools. While tools for functional analysis span a wide variety of techniques, they can loosely be categorized into three main types: over-representation analysis, functional class scoring, and pathway topology [1].

Image credit: Khatri et al, PloS Computational Biology

In this lesson, we will walk you through both an over-representation analysis and gene set enrichment analysis (GSEA) using an R Bioconductor package called clusterProfiler.

Over-representation analysis

Over-representation analysis (ORA) is used to determine which a priori defined gene sets are more present (over-represented) in a subset of “interesting” genes than what would be expected by chance (Huang et al, 2009). Most genes in the genome have some pre-existing annotation associated with it, which has been compiled through a combination of manual curation and computational algorithms. There are a number of existing databases which define genes using a controlled vocabulary and then categorize genes into groups (gene sets) based on shared function, involvement in a pathway or presence in a specific cellular location, etc. A very commonly used gene annotataion resource is the Gene Ontology (GO) database, and is what we will use in our workflow.

We then use those categorizations to assess whether or not the enrichment observed amongst our DE gene results when compared to the larger universe of genes is significant.

Hypergeometric test

The statistical test that will determine whether something is actually over-represented is the Hypergeometric test.

Hypergeometric distribution is a probability distribution that describes the probability of some number of genes (k) being associated with “Functional category 1”, for all genes in our gene list (n=1000), compared to the number of genes (K) associated with “Functional category 1” from a population of all of the genes in entire genome (N=13,000) [2].

The calculation of probability of k successes follows the formula:

This test will result in a p-value that will be adjusted to consider multiple testing correction for each category tested.

Running ORA with clusterProfiler

Now that we know more about what ORA is doing, let’s take our significant genes and see if there are any GO terms over-represented that align with what we expect to be happening in macrophage cells with a change of temperature.

We will take the results from the FindMarkers DE and run different functional analysis methods to obtain some biological insight.

The first thing we’ll do is load the required libraries:

# Load libraries
library(tidyverse)
library(clusterProfiler)
library(org.Hs.eg.db)
library(msigdbr)

Next, we will filter genes to remove any genes that have NA values in the padj column. These are genes that were not tested and so we do not want to consider them in our background set of genes. Once filtered, we create vectors containing our gene symbols for the background and query set of genes. We will query the up- and down-regulated gene sets separately, but note that you can also use the entire significant list as a query input.

# Create background dataset for hypergeometric testing using all tested 
# genes for significance in the results
all_genes <- as.character(dge$gene)

# Extract significant results for up-regulated
sigUp <- dplyr::filter(dge_sig, 
                       p_val_adj < 0.05, 
                       avg_log2FC > 0)
sigUp_genes <- as.character(sigUp$gene)
sigUp_genes %>% head()

Finally, we can perform the GO enrichment analysis and save the results:

# Run GO enrichment analysis 
egoUp <- enrichGO(gene = sigUp_genes, 
                  universe = all_genes,
                  keyType = "SYMBOL",
                  OrgDb = org.Hs.eg.db, 
                  ont = "BP", 
                  pAdjustMethod = "BH", 
                  qvalueCutoff = 0.05, 
                  readable = TRUE)
Note

Note 1: The different organisms with annotation databases available to use with for the OrgDb argument can be found here.

Note 2: The keyType argument may be coded as keytype in different versions of clusterProfiler.

Note 3: The ont argument can accept either “BP” (Biological Process), “MF” (Molecular Function), and “CC” (Cellular Component) subontologies, or “ALL” for all three.

Note

Instead of saving just the results summary from the egoUp object, it might also be beneficial to save the object itself. The save() function enables you to save it as a .rda file, e.g. save(egoUp, file="results/egoUp.rda"). The statistics stored in the object can be used for downstream visualization.

Exploring results from over-representation analysis

Let’s take a look at what terms are identified as over-represented in the genes up-regulated in cold conditions.

# Output results from GO analysis to a table
cluster_summaryUp <- data.frame(egoUp)

# View GO upregulated output
View(cluster_summaryUp)

In the first few columns we see the GO identifier and the descriptive term name. In the next two columns that follow, we observe GeneRatio and BgRatio. These values allows us to compare the overlaps to the background.

  • BgRatio: K/N
    • The total number of genes in the GO term gene set (K), divided by the total number of genes in universe (N)
  • GeneRatio: k/n
    • The total number of genes in our sig DE gene set which overlap with the GO term gene set (k), divided by the total number of genes in our sig DE gene set that overlap with the universe gene set (n)
  • Fold Enrichment: (k/n)/(K/N)
    • This represents how many fold enriched is the GeneRatio compared to the BgRatio

Other columns of interest are the p.adjust column (by which results are ordered by default), and the geneID column, which lists the gene symbols of the overlapping genes.

When cold induces a response in vascular smooth muscle cells (VSMCs), the primary transcriptional change observed is an up-regulation of genes related to vasoconstriction. Vasoconstriction is when the muscles around your blood vessels tighten to make the space inside smaller. In our results table we see significant terms such as extracellular matrix organization and cell proliferation, which makes sense because the cold temperatures will lead to a shift towards a more contractile phenotype. We also observe up-regulation of genes involved in cell adhesion and tight junction formation, which are processes related to maintaining vascular integrity.

Using the code above as a template, run the over-representation analysis on the significantly down-regulated genes from the pseudobulk analysis.

  • How many significant terms do you find?
  • What are some of the prominent biological processes that are observed?

Visualizing over-representation analysis results

clusterProfiler has a variety of options for viewing the over-represented GO terms. We will explore the dotplot and the enrichment plot in this lesson.

The dotplot shows statistics associated with a user-selected top number of significant terms. The color of the dots represents the adjusted p-values for these terms and size of the dots corresponds to the total count of sig DE genes annotated with the GO term (count). This plot displays the top 20 GO terms ordered by gene ratio, not adjusted p-value.

# Dotplot 
dotplot(egoUp, showCategory=20)

To save the figure, click on the Export button in the RStudio Plots tab and Save as PDF.... In the pop-up window, change: - Orientation: To Portrait - PDF size to 11 x 8 to give a figure of appropriate size for the text labels

The next plot is the enrichment GO plot, which shows the relationship between the top 50 most significantly enriched GO terms (padj) by grouping similar terms together. Before creating the plot, we will need to obtain the similarity between terms using the pairwise_termsim() function (instructions for emapplot). In the enrichment plot, the color represents the p-values relative to the other displayed terms (brighter red is more significant), and the size of the terms represents the number of genes that are significant from our list.

# Add similarity matrix to the termsim slot of enrichment result
egoUp <- enrichplot::pairwise_termsim(egoUp)

# Enrichmap clusters the 50 most significant (by padj) GO terms to visualize relationships between terms
emapplot(egoUp, showCategory = 50)

To save the figure, click on the Export button in the RStudio Plots tab and Save as PDF.... In the pop-up window, ensure that the Orientation is Portrait and change the PDF size to 11 x 8 to give the figure the appropriate size for the text labels.

Gene Set Enrichment Analysis (GSEA)

While over-representation analysis is helpful and commonly used, it does require you to subset your gene list using an arbitrary threshold. There could very well be many genes that very narrowly miss this threshold and are therefore not considered in the functional analysis. To get around this there are there are functional class scoring (FCS) methods that can be helpful. For these methods the hypothesis is that although large changes in individual genes can have significant effects on pathways (and will be detected via ORA methods), weaker but coordinated changes in sets of functionally related genes (i.e., pathways) can also have significant effects. Thus, rather than setting a threshold to identify ‘significant genes’, all genes are considered in the analysis. The gene-level statistics from the dataset are aggregated to generate a single pathway-level statistic and statistical significance of each pathway is reported. This type of analysis can be particularly helpful if the differential expression analysis only outputs a small list of significant DE genes.

A commonly used example of an FCS method is GSEA (Subramanium A. et al, 2005). Gene set enrichment analysis utilizes the gene-level statistics or log2-fold changes for all genes to look to see whether gene sets for particular biological pathways (e.g., derived from KEGG pathways, Gene Ontology terms, MSigDB, etc) are enriched among the large positive or negative fold changes.

Image source: (Subramanium A. et al, 2005)

This image describes the theory of GSEA, with the ‘gene set S’ showing the metric used (in our case, ranked log2-fold changes) to determine enrichment of genes in the gene set. There are four main steps that are being performed:

  1. Rank genes:
    • Genes in a data set are ranked based on the given statistic, which in our case is the shrunken log2-fold changes.
  2. Calculate enrichment scores for each gene set
    • This score reflects how often genes in the set appear at the top or bottom of the ranked list.
    • The score is calculated by walking down the list of log2-fold changes and increasing the running-sum statistic every time a gene in the gene set is encountered and decreasing it when genes are not part of the gene set.
    • Increase/decrease is determined by magnitude of fold change.
  3. Estimate statistical significance
    • A permutation test is used to calculate a null distribution for the enrichment score. This produces a p-value that represents the probability of observing a given enrichment score.
  4. Adjust for multiple hypothesis testing
    • Enrichment scores are normalized for the size of each gene set and a false discovery rate is calculated to prevent false positives.

Running GSEA with MSigDB gene sets

The clusterProfiler package offers several functions to perform GSEA using different genes sets, including but not limited to GO, KEGG, and MSigDb. We will use the MSigDb gene sets in our example below. The Molecular Signatures Database (also known as MSigDB) is a collection of annotated gene sets. It contains 8 major collections for mouse, and for our analysis we will use C5, which contains the Gene Ontology gene sets. We can see how this aligns with our ORA result.

To run GSEA with the MSigDb gene sets, we will use the msigdbr R package which provides the MSigDB gene sets in tidy data format that can be used directly with clusterProfiler. The msigdbr package supports several species:

# Show MSigDB species
msigdbr_species()

And you can see what gene sets are available:

# Show MSigDB collections
msigdbr_collections()

For our analysis, we will select the mouse C5 collection, which corresponds to GO gene sets. From the table, we only need two columns, Gene set name and the Gene symbol:

# Use a specific collection; C5 GO signatures
m_t2g <- msigdbr(species = "Homo sapiens", category = "C5") %>%
  dplyr::select(gs_name, gene_symbol)

Now that we have our gene sets, we need to prepare the fold changes. GSEA will use the log2-fold changes obtained from the differential expression analysis for every gene to perform the analysis. We need to create an ordered and named vector for input to clusterProfiler:

# Extract the foldchanges
foldchanges <- dge_sig$avg_log2FC

# Name each fold change with the corresponding gene symbol
names(foldchanges) <- dge_sig$gene
# Sort fold changes in decreasing order
foldchanges <- sort(foldchanges, decreasing = TRUE)

foldchanges %>% head()

Now we are ready to run GSEA!

# Run GSEA
msig_GSEA <- GSEA(foldchanges, TERM2GENE = m_t2g, verbose = FALSE)

# Extract the GSEA results
msigGSEA_results <- msig_GSEA@result

# Write results to file
# write.csv(msigGSEA_results, "results/gsea_msigdb_GO_genesets.csv", quote=F)
Note

The permutations are performed using random reordering, so every time we run the function we will get slightly different results. If we would like to use the same permutations every time we run a function, then we use the set.seed() function prior to running. The input to set.seed() can be any number.

Take a look at the results table and reorder by NES (normalized enrichment score). What terms do you see positively enriched? Does this overlap with what we observed from ORA analysis?

# Look at results ordered by NES
msigGSEA_results %>% arrange(-NES) %>% View()
  • The first few columns of the results table identify the gene set information.
  • The following columns include the associated statistics.
  • The last column will report which genes are part of the ‘core enrichment’. These are the genes associated with the pathway which contributed to the observed enrichment score (i.e., in the extremes of the ranking).

GSEA visualization

Let’s explore the GSEA plot of enrichment of one of the pathways in the ranked list using a built-in function from clusterProfiler. We can pick the top term GOMF_EXTRACELLULAR_MATRIX_STRUCTURAL_CONSTITUENT:

# Plot the GSEA plot for a single enriched GO term
gseaplot(msig_GSEA, geneSetID = 'GOBP_RESPONSE_TO_BACTERIUM')

In this plot, the lines in plot represent the genes in the gene set GOBP_RESPONSE_TO_BACTERIUM and where they occur among the log2-fold changes. The largest positive log2-fold changes are on the left-hand side of the plot, while the largest negative log2-fold changes are on the right. The top plot shows the magnitude of the log2-fold changes for each gene, while the bottom plot shows the running sum, with the enrichment score peaking at the red dotted line (which is among the positive log2-fold changes). This suggests the up-regulation of this function.


Next Lesson >>

Back to Schedule

Reuse

CC-BY-4.0