Summary of DGE workflow

RNA-seq

Differential Expression

Analysis Workflow

This lesson provides a concise summary of the complete differential gene expression (DGE) analysis workflow using DESeq2. Participants will review the essential R code steps, from importing quantification data and creating the DESeq2 dataset object, through exploratory data analysis and quality control, to running differential expression testing, visualizing results and extracting significant genes for downstream functional analysis. The lesson acts as a reference and checklist for performing robust and reproducible RNA-seq DGE analysis.

Authors

Meeta Mistry

Radhika Khetani

Mary Piper

Jihe Liu

Will Gammerdinger

Published

June 8, 2017

Keywords

DESeq2, Gene expression analysis, Functional enrichment, Reproducibility

Approximate time: 15 minutes

Learning Objectives

In this lesson, you will:

Identify the R commands needed to run a complete differential expression analysis using DESeq2

Summary of differential expression analysis workflow

We have detailed the various steps in a differential expression analysis workflow, providing theory with example code. To provide a more succinct reference for the code needed to run a DGE analysis, we have summarized the steps in an analysis below:

Obtaining gene-level counts from Salmon using tximport

# Run tximport
txi <- tximport(files, 
                type = "salmon", 
                tx2gene = t2g, 
                countsFromAbundance = "lengthScaledTPM")

# "files" is a vector wherein each element is the path to the salmon quant.sf file, and each element is named with the name of the sample.
# "t2g" is a 2 column data frame which contains transcript IDs mapped to geneIDs (in that order)

Creating the dds object:

# Check that the row names of the metadata equal the column names of the **raw counts** data
all(colnames(txi$counts) == rownames(metadata))

# Create DESeq2Dataset object
dds <- DESeqDataSetFromTximport(txi, 
                                colData = metadata, 
                                design = ~ condition)

Exploratory data analysis (PCA & hierarchical clustering) - identifying outliers and sources of variation in the data:

# Transform counts for data visualization
rld <- rlog(dds, 
            blind = TRUE)

# Plot PCA 
plotPCA(rld, 
        intgroup = "condition")

# Extract the rlog matrix from the object and compute pairwise correlation values
rld_mat <- assay(rld)
rld_cor <- cor(rld_mat)

# Plot heatmap
pheatmap(rld_cor, 
         annotation = metadata)

Run DESeq2:

# **Optional step** - Re-create DESeq2 dataset if the design formula has changed after QC analysis in include other sources of variation
# dds <- DESeqDataSetFromTximport(txi, colData = metadata, design = ~ covariate + condition)

# Run DESeq2 differential expression analysis
dds <- DESeq(dds)

# **Optional step** - Output normalized counts to save as a file to access outside RStudio
# normalized_counts <- counts(dds, normalized = TRUE)

Check the fit of the dispersion estimates:

# Plot dispersion estimates
plotDispEsts(dds)

Create contrasts to perform Wald testing on the shrunken log2 foldchanges between specific conditions:

# Specify contrast for comparison of interest
contrast <- c("condition", "level_to_compare", "base_level")

# Output results of Wald test for contrast
res <- results(dds, 
               contrast = contrast, 
               alpha = 0.05)

# Shrink the log2 fold changes to be more accurate
res <- lfcShrink(dds, 
                 coef = "sampletype_group1_vs_group2", 
                 type = "apeglm")    
# The coef will be dependent on what your contrast was. and should be identical to what is stored in resultsNames()

Output significant results:

# Set thresholds
padj.cutoff < - 0.05

# Turn the results object into a tibble for use with tidyverse functions
res_tbl <- res %>%
  data.frame() %>%
  rownames_to_column(var="gene") %>% 
  as_tibble()

# Subset the significant results
sig_res <- dplyr::filter(res_tbl, 
                         padj < padj.cutoff)

Visualize results: volcano plots, heatmaps, normalized counts plots of top genes, etc.
Perform analysis to extract functional significance of results: GO or KEGG enrichment, GSEA, etc.
Make sure to output the versions of all tools used in the DE analysis:

# Print sessionInfo()
sessionInfo()

For better reproducibility, it can help to create RMarkdown reports, which save all code, results, and visualizations as nicely formatted html reports. To create these reports we have additional materials available.

Reuse

CC-BY-4.0

--- title: "Summary of DGE workflow" description: | This lesson provides a concise summary of the complete differential gene expression (DGE) analysis workflow using DESeq2. Participants will review the essential R code steps, from importing quantification data and creating the DESeq2 dataset object, through exploratory data analysis and quality control, to running differential expression testing, visualizing results and extracting significant genes for downstream functional analysis. The lesson acts as a reference and checklist for performing robust and reproducible RNA-seq DGE analysis. author: - Meeta Mistry - Radhika Khetani - Mary Piper - Jihe Liu - Will Gammerdinger date: "2017-06-08" categories: - RNA-seq - Differential Expression - Analysis Workflow keywords: - DESeq2 - Gene expression analysis - Functional enrichment - Reproducibility license: "CC-BY-4.0" editor_options: markdown: wrap: 72 --- Approximate time: 15 minutes ## Learning Objectives In this lesson, you will: - Identify the R commands needed to run a complete differential expression analysis using DESeq2 ## Summary of differential expression analysis workflow We have detailed the various steps in a differential expression analysis workflow, providing theory with example code. To provide a more succinct reference for the code needed to run a DGE analysis, we have summarized the steps in an analysis below: 1. Obtaining gene-level counts from Salmon using tximport ```{r} #| label: run_tximport #| eval: false # Run tximport txi <- tximport(files, type = "salmon", tx2gene = t2g, countsFromAbundance = "lengthScaledTPM") # "files" is a vector wherein each element is the path to the salmon quant.sf file, and each element is named with the name of the sample. # "t2g" is a 2 column data frame which contains transcript IDs mapped to geneIDs (in that order) ``` 2. Creating the dds object: ```{r} #| label: create_dds #| eval: false # Check that the row names of the metadata equal the column names of the **raw counts** data all(colnames(txi$counts) == rownames(metadata)) # Create DESeq2Dataset object dds <- DESeqDataSetFromTximport(txi, colData = metadata, design = ~ condition) ``` 3. Exploratory data analysis (PCA & hierarchical clustering) - identifying outliers and sources of variation in the data: ```{r} #| label: exploration #| eval: false # Transform counts for data visualization rld <- rlog(dds, blind = TRUE) # Plot PCA plotPCA(rld, intgroup = "condition") # Extract the rlog matrix from the object and compute pairwise correlation values rld_mat <- assay(rld) rld_cor <- cor(rld_mat) # Plot heatmap pheatmap(rld_cor, annotation = metadata) ``` 4. Run DESeq2: ```{r} #| label: deseq2 #| eval: false # **Optional step** - Re-create DESeq2 dataset if the design formula has changed after QC analysis in include other sources of variation # dds <- DESeqDataSetFromTximport(txi, colData = metadata, design = ~ covariate + condition) # Run DESeq2 differential expression analysis dds <- DESeq(dds) # **Optional step** - Output normalized counts to save as a file to access outside RStudio # normalized_counts <- counts(dds, normalized = TRUE) ``` 5. Check the fit of the dispersion estimates: ```{r} #| label: dispersion #| eval: false # Plot dispersion estimates plotDispEsts(dds) ``` 6. Create contrasts to perform Wald testing on the shrunken log2 foldchanges between specific conditions: ```{r} #| label: contrasts #| eval: false # Specify contrast for comparison of interest contrast <- c("condition", "level_to_compare", "base_level") # Output results of Wald test for contrast res <- results(dds, contrast = contrast, alpha = 0.05) # Shrink the log2 fold changes to be more accurate res <- lfcShrink(dds, coef = "sampletype_group1_vs_group2", type = "apeglm") # The coef will be dependent on what your contrast was. and should be identical to what is stored in resultsNames() ``` 7. Output significant results: ```{r} #| label: results #| eval: false # Set thresholds padj.cutoff < - 0.05 # Turn the results object into a tibble for use with tidyverse functions res_tbl <- res %>% data.frame() %>% rownames_to_column(var="gene") %>% as_tibble() # Subset the significant results sig_res <- dplyr::filter(res_tbl, padj < padj.cutoff) ``` 8. Visualize results: volcano plots, heatmaps, normalized counts plots of top genes, etc. 9. Perform analysis to extract functional significance of results: GO or KEGG enrichment, GSEA, etc. 10. Make sure to output the versions of all tools used in the DE analysis: ```{r} #| label: sessionInfo #| eval: false # Print sessionInfo() sessionInfo() ``` For better reproducibility, it can help to create **RMarkdown reports**, which save all code, results, and visualizations as nicely formatted html reports. To create these reports we have [additional materials](https://hbctraining.github.io/Training-modules/Rmarkdown/) available.