Summary of DGE workflow

Author

Mary Piper

Published

June 8, 2017

Approximate time: 15 minutes

Learning Objectives

  • Identify the R commands needed to run a complete differential expression analysis using DESeq2

Summary of differential expression analysis workflow

We have detailed the various steps in a differential expression analysis workflow, providing theory with example code. To provide a more succinct reference for the code needed to run a DGE analysis, we have summarized the steps in an analysis below:

  1. Obtaining gene-level counts from Salmon using tximport
# Run tximport
txi <- tximport(files, 
                type="salmon", 
                tx2gene=t2g, 
                countsFromAbundance = "lengthScaledTPM")

# "files" is a vector wherein each element is the path to the salmon quant.sf file, and each element is named with the name of the sample.
# "t2g" is a 2 column data frame which contains transcript IDs mapped to geneIDs (in that order)
  1. Creating the dds object:
# Check that the row names of the metadata equal the column names of the **raw counts** data
all(colnames(txi$counts) == rownames(metadata))

# Create DESeq2Dataset object
dds <- DESeqDataSetFromTximport(txi, 
                                colData = metadata, 
                                design = ~ condition)
  1. Exploratory data analysis (PCA & hierarchical clustering) - identifying outliers and sources of variation in the data:
# Transform counts for data visualization
rld <- rlog(dds, 
            blind=TRUE)

# Plot PCA 
plotPCA(rld, 
        intgroup="condition")

# Extract the rlog matrix from the object and compute pairwise correlation values
rld_mat <- assay(rld)
rld_cor <- cor(rld_mat)

# Plot heatmap
pheatmap(rld_cor, 
         annotation = metadata)
  1. Run DESeq2:
# **Optional step** - Re-create DESeq2 dataset if the design formula has changed after QC analysis in include other sources of variation
# dds <- DESeqDataSetFromTximport(txi, colData = metadata, design = ~ covariate + condition)

# Run DESeq2 differential expression analysis
dds <- DESeq(dds)

# **Optional step** - Output normalized counts to save as a file to access outside RStudio
# normalized_counts <- counts(dds, normalized=TRUE)
  1. Check the fit of the dispersion estimates:
# Plot dispersion estimates
plotDispEsts(dds)
  1. Create contrasts to perform Wald testing on the shrunken log2 foldchanges between specific conditions:
# Specify contrast for comparison of interest
contrast <- c("condition", "level_to_compare", "base_level")

# Output results of Wald test for contrast
res <- results(dds, 
               contrast = contrast, 
               alpha = 0.05)

# Shrink the log2 fold changes to be more accurate
res <- lfcShrink(dds, 
                 coef = "sampletype_group1_vs_group2", 
                 type = "apeglm")    
# The coef will be dependent on what your contrast was. and should be identical to what is stored in resultsNames()
  1. Output significant results:
# Set thresholds
padj.cutoff < - 0.05

# Turn the results object into a tibble for use with tidyverse functions
res_tbl <- res %>%
  data.frame() %>%
  rownames_to_column(var="gene") %>% 
  as_tibble()

# Subset the significant results
sig_res <- dplyr::filter(res_tbl, 
                         padj < padj.cutoff)
  1. Visualize results: volcano plots, heatmaps, normalized counts plots of top genes, etc.

  2. Perform analysis to extract functional significance of results: GO or KEGG enrichment, GSEA, etc.

  3. Make sure to output the versions of all tools used in the DE analysis:

sessionInfo()

For better reproducibility, it can help to create RMarkdown reports, which save all code, results, and visualizations as nicely formatted html reports. We have a very basic example of a report linked here. To create these reports we have additional materials available.