Summary of DGE workflow

Approximate time: 15 minutes

Learning Objectives

Identify the R commands needed to run a complete differential expression analysis using DESeq2

Summary of differential expression analysis workflow

We have detailed the various steps in a differential expression analysis workflow, providing theory with example code. To provide a more succinct reference for the code needed to run a DGE analysis, we have summarized the steps in an analysis below:

Obtaining gene-level counts from Salmon using tximport

 # Run tximport
 txi <- tximport(files, 
         type="salmon", 
         tx2gene=t2g, 
         countsFromAbundance = "lengthScaledTPM")
	
 # "files" is a vector wherein each element is the path to the salmon quant.sf file, and each element is named with the name of the sample.
 # "t2g" is a 2 column data frame which contains transcript IDs mapped to geneIDs (in that order)

Creating the dds object:

 # Check that the row names of the metadata equal the column names of the **raw counts** data
 all(colnames(txi$counts) == rownames(metadata))
	
 # Create DESeq2Dataset object
 dds <- DESeqDataSetFromTximport(txi, 
                 colData = metadata, 
                 design = ~ condition)

Exploratory data analysis (PCA & hierarchical clustering) - identifying outliers and sources of variation in the data:

 # Transform counts for data visualization
 rld <- rlog(dds, 
         blind=TRUE)
	
 # Plot PCA 
 plotPCA(rld, 
     intgroup="condition")
	
 # Extract the rlog matrix from the object and compute pairwise correlation values
 rld_mat <- assay(rld)
 rld_cor <- cor(rld_mat)
	
 # Plot heatmap
 pheatmap(rld_cor, 
      annotation = metadata)

Run DESeq2:

     # **Optional step** - Re-create DESeq2 dataset if the design formula has changed after QC analysis in include other sources of variation using "dds <- DESeqDataSetFromTximport(txi, colData = metadata, design = ~ covariate + condition)"

 # Run DESeq2 differential expression analysis
 dds <- DESeq(dds)

     # **Optional step** - Output normalized counts to save as a file to access outside RStudio using "normalized_counts <- counts(dds, normalized=TRUE)"

Check the fit of the dispersion estimates:

 # Plot dispersion estimates
 plotDispEsts(dds)

Create contrasts to perform Wald testing on the shrunken log2 foldchanges between specific conditions:

 # Specify contrast for comparison of interest
 contrast <- c("condition", "level_to_compare", "base_level")
	
 # Output results of Wald test for contrast
 res <- results(dds, 
            contrast = contrast, 
            alpha = 0.05)
	
 # Shrink the log2 fold changes to be more accurate
 res <- lfcShrink(dds, 
          coef = "sampletype_group1_vs_group2", 
          type = "apeglm")	 
      # The coef will be dependent on what your contrast was. and should be identical to what is stored in resultsNames()

Output significant results:

 # Set thresholds
 padj.cutoff < - 0.05
	
 # Turn the results object into a tibble for use with tidyverse functions
 res_tbl <- res %>%
               data.frame() %>%
               rownames_to_column(var="gene") %>% 
               as_tibble()
	
 # Subset the significant results
 sig_res <- dplyr::filter(res_tbl, 
           padj < padj.cutoff)

Visualize results: volcano plots, heatmaps, normalized counts plots of top genes, etc.
Perform analysis to extract functional significance of results: GO or KEGG enrichment, GSEA, etc.
Make sure to output the versions of all tools used in the DE analysis:
```
sessionInfo()
```
For better reproducibility, it can help to create RMarkdown reports, which save all code, results, and visualizations as nicely formatted html reports. We have a very basic example of a report linked here. To create these reports we have additional materials available.