Using DESeq2 for gene-level differential expression analysis
-
The metadata below describes an RNA-seq analysis experiment, in which the metadata table below and associated count matrix have been loaded into R as
metaandcounts, respectively. Additionally, all of the appropriate libraries have been loaded for you. Use the information in the table to answer the following questions.meta
Genotype Celltype Batch sample1 Wt typeA second sample2 Wt typeA second sample3 Wt typeA first sample4 KO typeA first sample5 KO typeA first sample6 KO typeA second sample7 Wt typeB second sample8 Wt typeB first sample9 Wt typeB second sample10 KO typeB first sample11 KO typeB first sample12 KO typeB second NOTE: This is an exercise in thinking about running DESeq2. You do not need to run any code in R/RStudio. Refer to the materials/lessons from class to answer the following questions.
a. Reorder the columns of the
countsdataset such thatrownames(meta) == colnames(counts).b. Provide the line of code used to create a DESeqDataSet object called
ddsin whichGenotypeis the factor of interest andCelltypeandBatchare other contributing sources of variation in your data.c. Provide the line of code required to run DESeq2 on
dds.d. Provide the line of code to create a dispersion plot.
e. Provide the line of code to return the results of a Wald test comparison for
CelltypecategoriestypeAversustypeB(i.e the fold changes reported should reflect gene expression changes relative totypeB).f. Provide the line of code to return the results with log2 fold change shrinkage performed.
g. Provide the line of code to write the results of the Wald test with shrunken log2 fold changes to file.
h. Provide the line of code to subset the results to return those genes with adjusted p-value < 0.05 and logFold2Change > 1.
Working with the DESeq2 results table
-
Using the de_script.R that we created in class for the differential expression analysis, change the thresholds for adjusted p-value and log2 fold change to the following values:
padj.cutoff <- 0.01 lfc.cutoff <- 1.5Using these new cutoffs, perform the following steps:
a. Subset
res_tableOEto only return those rows that meet the criteria we specified above (adjusted p-values < 0.01 and log fold changes >1.5). Save the subsetted table to a data frame calledsig_table_hw_oe. Write the code below:b. There is a a DESeq2 function that summarizes how many genes are up- and down-regulated using our criteria for
alpha=0.01. Use this on thesig_table_hw_oe. Write the code you would use, and also, list how many genes are up- and down- regulated.c. Get the gene names from
sig_table_hw_oeand save them to a vector calledsigOE_hw. Write the code below:d. Write the
sigOE_hwvector of gene names to a file calledsigOE_hw.txtusing thewrite()function. Ensure the genes are listed in a single column. Write the code below.
Visualizing Results
-
For the genes that are differentially expressed in the knockdown versus control comparison (
res_tableKD), plot an expression heatmap using normalized counts andpheatmap()following the instructions below. Write the code you would use to create the heatmap.a. The heatmap should only include control and knockdown samples.
b. Set up a heat.colors vector using a palette of your choice from brewer.pal (make sure it is different from the one used in class).
c. Plot the heatmap without clustering the columns.
d. Scale expression values by row.
Use significant gene lists to find overlaps between the two comparisons
-
Using the original cutoff values, perform the following steps:
padj.cutoff < 0.05 lfc.cutoff > 0.58a. Create separate vectors with gene names for up-regulated genes and down-regulated genes from
res_tableOEand save asup_OEanddown_OE, respectively. Write the code below:b. Create separate vectors with gene names for up-regulated genes and down-regulated genes from
res_tableKDand save asup_KDanddown_KD, respectively. Write the code below:c. Test for overlaps between the lists:
-
How many, and which genes in
up_OEare also indown_KD? -
How many, and which genes in
up_KDare also indown_OE?
-
Using Salmon abundance estimates with DESeq2
-
We have materials for using counts from quasialignment tools, such as Salmon, to perform the DE analysis. The ‘tximport’ package in R is needed to summarize the transcript-level estimates generated from Salmon into pseudocounts.
Open up RStudio and create a project for using Salmon estimates with DESeq2 (i.e. ~/Desktop/salmon should be your working directory).
a. Follow the markdown for the Salmon section to create the DESeqDataSet object.
b. Run DESeq2 on the DESeqDataSet object (
dds) you created.c. Use the
results()function to extract a results table for the OE vs Ctl comparison. Be sure to provide the appropriate contrasts and to perform shrinkage.d. Use the
results()function to extract a results table for the KD vs Ctl comparison. Be sure to provide the appropriate contrasts and to perform shrinkage.e. Use the
summary()function usingalpha=0.05on the OE results table.-
Report the number of genes that are up and down-regulated:
-
Report how many genes are removed due to independent filtering (HINT: these are genes that are filtered due to low mean count):
-
Report how many genes were removed due to an extreme outlier count (HINT: these ‘outliers’ are identified based on Cook’s distance):
f. Use the
summary()function usingalpha=0.05on the KD results table.-
Report the number of genes that are up and down-regulated:
-
Report how many genes are removed due to independent filtering (HINT: these are genes that are filtered due to low mean count)
-
Report how many genes were removed due to an extreme outlier count (HINT: these ‘outliers’ are identified based on Cook’s distance)
g. Create a vector called
criteria_oe, which contains the indexes for which rows in the OE results table havepadj < 0.05ANDabs(logFC) > 0.58h. Subset the OE results into a new table using the
criteria_oevector:i. Create a vector called
criteria_kd, which contains the indexes for which rows in the KD results table havepadj < 0.05ANDabs(logFC) > 0.58j. Subset the KD results into a new table using the
criteria_kd vector:k. Use the following linked datasets as the significant genes using a different workflow (STAR alignment + featureCounts + DESeq2), but same samples. Download them via these links:
Load these gene lists in as
star_KDandstar_OE, using thescan()function in R (write your code below):l. Find overlapping OE genes between those that you have identified in your subsetted results table (using the Salmon abundance estimates) to the
star_OEgene list.- How many genes overlap?
m. Find overlapping KD genes between those that you have identified in your subsetted results table (using the Salmon abundance estimates) to the
star_KDgene list.- How many genes overlap?
-