DGE analysis using LRT in DESeq2

Author

Meeta Mistry and Mary Piper

Published

June 14, 2017

Approximate time: 60 minutes

Learning Objectives

Apply the Likelihood Ratio Test (LRT) for hypothesis testing
Compare results generated from the LRT to results obtained using the Wald test
Identify shared expression profiles from the LRT significant gene list

Exploring results from the Likelihood ratio test (LRT)

DESeq2 also offers the Likelihood Ratio Test as an alternative when evaluating expression change across more than two levels. Genes that are identified as significant are those that are changing in expression in any direction across the different factor levels.

Generally, this test will result in a larger number of genes than the individual pairwise comparisons. While the LRT is a test of significance for differences of any level(s) of the factor, one should not expect it to be exactly equal to the union of sets of genes using Wald tests (although we do expect a high degree of overlap).

The `results()` table

To extract the results from our dds_lrt object we can use the same results() function we had used with the Wald test. There is no need for contrasts since we are not making a pairwise comparison.

Note

In an earlier lesson on hypothesis testing, we had you create the object dds_lrt. If you are having trouble finding the object, please run the code:

dds_lrt <- DESeq(dds, test="LRT", reduced = ~ 1)

# Extract results for LRT
res_LRT <- results(dds_lrt)

Let’s take a look at the results table:

# View results for LRT
res_LRT

log2 fold change (MLE): sampletype MOV10 overexpression vs control 
LRT p-value: '~ sampletype' vs '~ 1' 
DataFrame with 57761 rows and 6 columns
                  baseMean log2FoldChange     lfcSE      stat      pvalue
                 <numeric>      <numeric> <numeric> <numeric>   <numeric>
ENSG00000000003  3525.8835      -0.438245 0.0774607  40.46117 1.63670e-09
ENSG00000000005    26.2489       0.029208 0.4411295   1.61898 4.45084e-01
ENSG00000000419  1478.2512       0.383635 0.1137609  11.34102 3.44611e-03
ENSG00000000457   518.4220       0.228971 0.1023313  14.63134 6.65035e-04
ENSG00000000460  1159.7761      -0.269138 0.0814993  25.03939 3.65398e-06
...                    ...            ...       ...       ...         ...
ENSG00000285889    1.82171       -4.68144 3.9266061   2.35649 0.307818323
ENSG00000285950    7.58089       -1.01978 1.0715583   1.21446 0.544857226
ENSG00000285976 4676.24904        0.19364 0.0656673  14.87805 0.000587859
ENSG00000285978    2.25697        4.13612 2.0706212   4.68720 0.095981569
ENSG00000285980    0.00000             NA        NA        NA          NA
                       padj
                  <numeric>
ENSG00000000003 3.14071e-08
ENSG00000000005 5.88670e-01
ENSG00000000419 1.22924e-02
ENSG00000000457 3.04551e-03
ENSG00000000460 3.23425e-05
...                     ...
ENSG00000285889          NA
ENSG00000285950          NA
ENSG00000285976  0.00273904
ENSG00000285978          NA
ENSG00000285980          NA

The results table output looks similar to the Wald test results, with identical columns to what we observed previously.

Why are fold changes reported for an LRT test?

For analyses using the likelihood ratio test, the p-values are determined solely by the difference in deviance between the full and reduced model formula. A single log2 fold change is printed in the results table for consistency with other results table outputs, but is not associated with the actual test.

Columns relevant to the LRT test:

baseMean: mean of normalized counts for all samples
stat: the difference in deviance between the reduced model and the full model
pvalue: the stat value is compared to a chi-squared distribution to generate a pvalue
padj: BH adjusted p-values

Additional columns:

log2FoldChange: log2 fold change
lfcSE: standard error

Note

Printed at the top of the the results table are the two sample groups used to generate the log2 fold change values that we observe in the results table. This can be controlled using the name argument; the value provided to name must be an element of resultsNames(dds).

Identifying significant genes

When filtering significant genes from the LRT we threshold only the padj column. How many genes are significant at padj < 0.05?

# Create a tibble for LRT results
res_LRT_tb <- res_LRT %>%
  data.frame() %>%
  rownames_to_column(var="gene") %>% 
  as_tibble()

# Subset to return genes with padj < 0.05
sigLRT_genes <- res_LRT_tb %>% 
  dplyr::filter(padj < padj.cutoff)

# Get number of significant genes
nrow(sigLRT_genes)

[1] 7315

# Compare to numbers we had from Wald test
nrow(sigOE) # overexpression vs control

[1] 4774

nrow(sigKD) # knockdown vs control

[1] 2827

The number of significant genes observed from the LRT is quite high. This list includes genes that can be changing in any direction across the three factor levels (control, KO, overexpression). To reduce the number of significant genes, we can increase the stringency of our FDR threshold (padj.cutoff).

Exercise

Compare the resulting gene list from the LRT test to the gene lists from the Wald test comparisons.
1. How many of the sigLRT_genes overlap with the significant genes in sigOE?
2. How many of the sigLRT_genes overlap with the significant genes in sigKD?

Identifying clusters of genes with shared expression profiles

We now have this list of ~7K significant genes that we know are changing in some way across the three different sample groups. What do we do next?

A good next step is to identify groups of genes that share a pattern of expression change across the sample groups (levels). To do this we will be using a clustering tool called degPatterns from the ‘DEGreport’ package. The degPatterns tool uses a hierarchical clustering approach based on pair-wise correlations between genes, then cuts the hierarchical tree to generate groups of genes with similar expression profiles. The tool cuts the tree in a way to optimize the diversity of the clusters, such that the variability inter-cluster > the variability intra-cluster.

Before we begin clustering, we will first subset our rlog transformed normalized counts to retain only the differentially expressed genes (padj < 0.05). In our case, it may take some time to run the clustering on 7K genes, and so for class demonstration purposes we will subset to keep only the top 1000 genes sorted by p-adjusted value.

Where do I get rlog transformed counts?

This rlog transformation was applied in an earlier lesson when we performed QC analysis. If you do not see this in your environment, run the following code:

# Transform counts for data visualization
rld <- rlog(dds, blind=TRUE)
rld_mat <- assay(rld)

# Subset results for faster cluster finding (for classroom demo purposes)
clustering_sig_genes <- sigLRT_genes %>%
  arrange(padj) %>%
  head(n=1000)

# Obtain rlog values for those significant genes
cluster_rlog <- rld_mat[clustering_sig_genes$gene, ]

The rlog transformed counts for the significant genes are input to degPatterns along with a few additional arguments:

metadata: the metadata dataframe that corresponds to samples
time: character column name in metadata that will be used as variable that changes
col: character column name in metadata to separate samples

Once the clustering is finished running, you will get your command prompt back in the console and you should see a figure appear in your plot window. The genes have been clustered into four different groups. For each group of genes, we have a boxplot illustrating expression change across the different sample groups. A line graph is overlayed to illustrate the trend in expression change.

# Use the `degPatterns` function from the 'DEGreport' package to show gene clusters across sample groups
clusters <- degPatterns(cluster_rlog, metadata = meta, time = "sampletype", col = NULL)

Suppose we are interested in the genes which show a decreased expression in the knockdown samples and increase in the overexpression. According to the plot there are 275 genes that share this expression profile. To find out what these genes are, let’s explore the output. What type of data structure is the clusters output?

# What type of data structure is the `clusters` output?
class(clusters)

[1] "list"

We can see what objects are stored in the list by using names(clusters). There is a dataframe stored inside. This is the main result so let’s take a look at it. The first column contains the genes, and the second column contains the cluster number to which they belong.

# Let's see what is stored in the `df` component
head(clusters$df)

                          genes cluster
ENSG00000155363 ENSG00000155363       1
ENSG00000173110 ENSG00000173110       1
ENSG00000189060 ENSG00000189060       1
ENSG00000187621 ENSG00000187621       2
ENSG00000265972 ENSG00000265972       1
ENSG00000270882 ENSG00000270882       3

Since we are interested in Group 1, we can filter the dataframe to keep only those genes:

# Extract the Group 1 genes
group1 <- clusters$df %>%
  dplyr::filter(cluster == 1)

After extracting a group of genes, we can use annotation packages to obtain additional information. We can also use these lists of genes as input to downstream functional analysis tools to obtain more biological insight and see whether the groups of genes share a specific function.

Materials and hands-on activities were adapted from RNA-seq workflow on the Bioconductor website

--- title: "DGE analysis using LRT in DESeq2" author: "Meeta Mistry and Mary Piper" date: "June 14, 2017" --- Approximate time: 60 minutes ```{r data_setup} #| echo: false # load libraries needed to render this lesson library(tidyverse) library(DESeq2) library(DEGreport) # load objects needed to render this lesson dds_lrt <- readRDS("../data/intermediate_dds_lrt.RDS") sigOE <- readRDS("../data/intermediate_res_sigOE.RDS") sigKD <- readRDS("../data/intermediate_res_sigKD.RDS") dds <- readRDS("../data/intermediate_dds.RDS") meta <- readRDS("../data/metadata.RDS") padj.cutoff <- 0.05 ``` ## Learning Objectives * Apply the Likelihood Ratio Test (LRT) for hypothesis testing * Compare results generated from the LRT to results obtained using the Wald test * Identify shared expression profiles from the LRT significant gene list ## Exploring results from the Likelihood ratio test (LRT) DESeq2 also offers the Likelihood Ratio Test as an alternative **when evaluating expression change across more than two levels**. Genes that are identified as significant are those that are changing in expression in any direction across the different factor levels. Generally, this test will result in a larger number of genes than the individual pairwise comparisons. While the LRT is a test of significance for differences of any level(s) of the factor, one should not expect it to be exactly equal to the union of sets of genes using Wald tests (although we do expect a high degree of overlap). ## The `results()` table To extract the results from our `dds_lrt` object we can use the same `results()` function we had used with the Wald test. _There is no need for contrasts since we are not making a pairwise comparison._ ::: callout-note In an [earlier lesson on hypothesis testing](05a_hypothesis_testing.html#likelihood-ratio-test-lrt), we had you create the object `dds_lrt`. If you are **having trouble finding the object**, please run the code: ```{r} #| eval: false dds_lrt <- DESeq(dds, test="LRT", reduced = ~ 1) ``` ::: ```{r results_lrt} # Extract results for LRT res_LRT <- results(dds_lrt) ``` Let's take a look at the results table: ```{r results_view} # View results for LRT res_LRT ``` The results table output looks similar to the Wald test results, with identical columns to what we observed previously. ### Why are fold changes reported for an LRT test? For analyses using the likelihood ratio test, the p-values are determined solely by the difference in deviance between the full and reduced model formula. **A single log2 fold change is printed in the results table for consistency with other results table outputs, but is not associated with the actual test.** **Columns relevant to the LRT test:** * `baseMean`: mean of normalized counts for all samples * `stat`: the difference in deviance between the reduced model and the full model * `pvalue`: the stat value is compared to a chi-squared distribution to generate a pvalue * `padj`: BH adjusted p-values **Additional columns:** * `log2FoldChange`: log2 fold change * `lfcSE`: standard error ::: callout-note Printed at the top of the the results table are the two sample groups used to generate the log2 fold change values that we observe in the results table. This can be controlled using the `name` argument; the value provided to `name` must be an element of `resultsNames(dds)`. ::: ## Identifying significant genes When filtering significant genes from the LRT we threshold only the `padj` column. _How many genes are significant at `padj < 0.05`?_ ```{r results_sig} # Create a tibble for LRT results res_LRT_tb <- res_LRT %>% data.frame() %>% rownames_to_column(var="gene") %>% as_tibble() # Subset to return genes with padj < 0.05 sigLRT_genes <- res_LRT_tb %>% dplyr::filter(padj < padj.cutoff) # Get number of significant genes nrow(sigLRT_genes) # Compare to numbers we had from Wald test nrow(sigOE) # overexpression vs control nrow(sigKD) # knockdown vs control ``` The number of significant genes observed from the LRT is quite high. This list includes genes that can be changing in any direction across the three factor levels (control, KO, overexpression). To reduce the number of significant genes, we can increase the stringency of our FDR threshold (`padj.cutoff`). ::: callout-tip # Exercise 1. Compare the resulting gene list from the LRT test to the gene lists from the Wald test comparisons. 1. How many of the `sigLRT_genes` overlap with the significant genes in `sigOE`? 1. How many of the `sigLRT_genes` overlap with the significant genes in `sigKD`? ::: ## Identifying clusters of genes with shared expression profiles We now have this list of ~7K significant genes that we know are changing in some way across the three different sample groups. What do we do next? A good next step is to identify groups of genes that share a pattern of expression change across the sample groups (levels). To do this we will be using a clustering tool called `degPatterns` from the 'DEGreport' package. The `degPatterns` tool uses a **hierarchical clustering approach based on pair-wise correlations** between genes, then cuts the hierarchical tree to generate groups of genes with similar expression profiles. The tool cuts the tree in a way to optimize the diversity of the clusters, such that the variability inter-cluster > the variability intra-cluster. Before we begin clustering, we will **first subset our rlog transformed normalized counts** to retain only the differentially expressed genes (padj < 0.05). In our case, it may take some time to run the clustering on 7K genes, and so for class demonstration purposes we will subset to keep only the top 1000 genes sorted by p-adjusted value. ::: callout-note # Where do I get rlog transformed counts? This rlog transformation was applied in an [earlier lesson](03_DGE_QC_analysis.html#transform-normalized-counts-for-the-mov10-dataset) when we performed QC analysis. If you **do not see this in your environment**, run the following code: ```{r normalize} # Transform counts for data visualization rld <- rlog(dds, blind=TRUE) rld_mat <- assay(rld) ``` ::: ```{r subset} # Subset results for faster cluster finding (for classroom demo purposes) clustering_sig_genes <- sigLRT_genes %>% arrange(padj) %>% head(n=1000) # Obtain rlog values for those significant genes cluster_rlog <- rld_mat[clustering_sig_genes$gene, ] ``` The rlog transformed counts for the significant genes are input to `degPatterns` along with a few additional arguments: * `metadata`: the metadata dataframe that corresponds to samples * `time`: character column name in metadata that will be used as variable that changes * `col`: character column name in metadata to separate samples Once the clustering is finished running, you will get your command prompt back in the console and you should see a figure appear in your plot window. The genes have been clustered into four different groups. For each group of genes, we have a boxplot illustrating expression change across the different sample groups. A line graph is overlayed to illustrate the trend in expression change. ```{r cluster} # Use the `degPatterns` function from the 'DEGreport' package to show gene clusters across sample groups clusters <- degPatterns(cluster_rlog, metadata = meta, time = "sampletype", col = NULL) ``` Suppose we are interested in the genes which show a decreased expression in the knockdown samples and increase in the overexpression. According to the plot there are 275 genes that share this expression profile. To find out what these genes are, let's explore the output. What type of data structure is the `clusters` output? ```{r cluster_structure} # What type of data structure is the `clusters` output? class(clusters) ``` We can see what objects are stored in the list by using `names(clusters)`. There is a dataframe stored inside. This is the main result so let's take a look at it. The first column contains the genes, and the second column contains the cluster number to which they belong. ```{r cluster_df} # Let's see what is stored in the `df` component head(clusters$df) ``` Since we are interested in Group 1, we can filter the dataframe to keep only those genes: ```{r cluster_1} # Extract the Group 1 genes group1 <- clusters$df %>% dplyr::filter(cluster == 1) ``` After extracting a group of genes, we can use annotation packages to obtain additional information. We can also use these lists of genes as input to downstream functional analysis tools to obtain more biological insight and see whether the groups of genes share a specific function. --- *Materials and hands-on activities were adapted from [RNA-seq workflow](http://www.bioconductor.org/help/workflows/rnaseqGene/#de) on the Bioconductor website*

Learning Objectives

Exploring results from the Likelihood ratio test (LRT)

The results() table

Why are fold changes reported for an LRT test?

Identifying significant genes

Identifying clusters of genes with shared expression profiles

The `results()` table