Approximate time: 75 minutes
Data Wrangling with Tidyverse
The Tidyverse suite of integrated packages are designed to work together to make common data science operations more user friendly. The packages have functions for data wrangling, tidying, reading/writing, parsing, and visualizing, among others. There is a freely available book, R for Data Science, with detailed descriptions and practical examples of the tools available and how they work together. We will explore the basic syntax for working with these packages, as well as, specific functions for data wrangling with the ‘dplyr’ package and data visualization with the ‘ggplot2’ package.
The Tidyverse suite of packages introduces users to a set of data structures, functions and operators to make working with data more intuitive, but is slightly different from the way we do things in base R. Two important new concepts we will focus on are pipes and tibbles.
Before we get started with pipes or tibbles, let’s load the library:
Stringing together commands in R can be quite daunting. Also, trying to understand code that has many nested functions can be confusing.
To make R code more human readable, the Tidyverse tools use the pipe,
%>%, which was acquired from the
magrittr package and is now part of the
dplyr package that is installed automatically with Tidyverse. The pipe allows the output of a previous command to be used as input to another command instead of using nested functions.
NOTE: Shortcut to write the pipe is shift + command + M
An example of using the pipe to run multiple commands:
## A single command sqrt(83) ## Base R method of running more than one command round(sqrt(83), digits = 2) ## Running more than one command with piping sqrt(83) %>% round(digits = 2)
The pipe represents a much easier way of writing and deciphering R code, and so we will be taking advantage of it, when possible, as we work through the remaining lesson.
Create a vector of random numbers using the code below:
random_numbers <- c(81, 90, 65, 43, 71, 29)
Use the pipe (
%>%) to perform two steps in a single line:
- Take the mean of
- Round the output to three digits using the
- Take the mean of
A core component of the tidyverse is the tibble. Tibbles are a modern rework of the standard
data.frame, with some internal improvements to make code more reliable. They are data frames, but do not follow all of the same rules. For example, tibbles can have numbers/symbols for column names, which is not normally allowed in base R.
Important: tidyverse is very opininated about row names. These packages insist that all column data (e.g.
data.frame) be treated equally, and that special designation of a column as
rownames should be deprecated. Tibble provides simple utility functions to handle rownames:
Tibbles can be created directly using the
tibble() function or data frames can be converted into tibbles using
NOTE: The function
as_tibble()will ignore row names, so if a column representing the row names is needed, then the function
rownames_to_column(name_of_df)should be run prior to turning the data.frame into a tibble. Also,
as_tibble()will not coerce character vectors to factors by default.
We’re going to explore the Tidyverse suite of tools to wrangle our data to prepare it for visualization. You should have downloaded the file called
gprofiler_results_Mov10oe.tsv into your R project’s
data folder earlier.
If you do not have the
gprofiler_results_Mov10oe.tsvfile in your
datafolder, you can right click and download it into the
datafolder using this link.
- Represents the functional analysis results, including the biological processes, functions, pathways, or conditions that are over-represented in a given list of genes.
- Our gene list was generated by differential gene expression analysis and the genes represent differences between control mice and mice over-expressing a gene involved in RNA splicing.
The functional analysis that we will focus on involves gene ontology (GO) terms, which:
- describe the roles of genes and gene products
- organized into three controlled vocabularies/ontologies (domains):
- biological processes (BP)
- cellular components (CC)
- molecular functions (MF)
Analysis goal and workflow
Goal: Visually compare the most significant biological processes (BP) based on the number of associated differentially expressed genes (gene ratios) and significance values by creating the following plot:
To wrangle our data in preparation for the plotting, we are going to use the Tidyverse suite of tools to wrangle and visualize our data through several steps:
- Read in the functional analysis results
- Extract only the GO biological processes (BP) of interest
- Select only the columns needed for visualization
- Order by significance (p-adjusted values)
- Rename columns to be more intuitive
- Create additional metrics for plotting (e.g. gene ratios)
- Plot results
While all of the tools in the Tidyverse suite are deserving of being explored in more depth, we are going to investigate more deeply the reading (
readr), wrangling (
dplyr), and plotting (
1. Read in the functional analysis results
While the base R packages have perfectly fine methods for reading in data, the
readxl Tidyverse packages offer additional methods for reading in data. Let’s read in our tab-delimited functional analysis results using
# Read in the functional analysis results functional_GO_results <- read_delim(file = "data/gprofiler_results_Mov10oe.tsv", delim = "\t" ) # Take a look at the results functional_GO_results
Notice that the results were automatically read in as a tibble and the output gives the number of rows, columns and the data type for each of the columns.
NOTE: A large number of tidyverse functions will work with both tibbles and dataframes, and the data structure of the output will be identical to the input. However, there are some functions that will return a tibble (without row names), whether or not a tibble or dataframe is provided.
2. Extract only the GO biological processes (BP) of interest
Now that we have our data, we will need to wrangle it into a format ready for plotting. For all of our data wrangling steps we will be using tools from the dplyr package, which is a swiss-army knife for data wrangling of data frames.
To extract the biological processes of interest, we only want those rows where the
domain is equal to
BP, which we can do using the
To filter rows of a data frame/tibble based on values in different columns, we give a logical expression as input to the
filter() function to return those rows for which the expression is TRUE.
Now let’s return only those rows that have a
# Return only GO biological processes bp_oe <- functional_GO_results %>% filter(domain == "BP") View(bp_oe)
Now we have returned only those rows with a
BP. How have the dimensions of our results changed?
We would like to perform an additional round of filtering to only keep the most specific GO terms.
bp_oe, use the
filter()function to only keep those rows where the
relative.depthis greater than 4.
- Save output to overwrite our
3. Select only the columns needed for visualization
For visualization purposes, we are only interested in the columns related to the GO terms, the significance of the terms, and information about the number of genes associated with the terms.
To extract columns from a data frame/tibble we can use the
select() function. In contrast to base R, we do not need to put the column names in quotes for selection.
# Selecting columns to keep bp_oe <- bp_oe %>% select(term.id, term.name, p.value, query.size, term.size, overlap.size, intersection)
select() function also allows for negative selection. So we could have alternately removed columns with negative selection. Note that we need to put the column names inside of the combine (
c()) function with a
- preceding it for this functionality.
# DO NOT RUN # Selecting columns to remove bp_oe <- bp_oe %>% select(-c(query.number, significant, recall, precision, subgraph.number, relative.depth, domain))
4. Order GO processes by significance (adjusted p-values)
Now that we have only the rows and columns of interest, let’s arrange these by significance, which is denoted by the adjusted p-value.
Let’s sort the rows by adjusted p-value with the
# Order by adjusted p-value ascending bp_oe <- bp_oe %>% arrange(p.value)
NOTE: If you wanted to arrange in descending order, then you could have run the following instead:
# DO NOT RUN # Order by adjusted p-value descending bp_oe <- bp_oe %>% arrange(desc(p.value))
5. Rename columns to be more intuitive
While not necessary for our visualization, renaming columns more intuitively can help with our understanding of the data using the
rename() function. The syntax is
Let’s rename the
# Provide better names for columns bp_oe <- bp_oe %>% dplyr::rename(GO_id = term.id, GO_term = term.name)
NOTE: In the case of two packages with identical function names, you can use
::with the package name before and the function name after (e.g
stats::filter()) to ensure that the correct function is implemented. The
::can also be used to bring in a function from a library without loading it first.
In the example above, we wanted to use the
rename()function specifically from the
dplyrpackage, and not any of the other packages (or base R) which may have the
intersection column to
genes to reflect the fact that these are the DE genes associated with the GO process.
6. Create additional metrics for plotting (e.g. gene ratios)
Finally, before we plot our data, we need to create a couple of additional metrics. The
mutate() function enables you to create a new column from an existing column.
Let’s generate gene ratios to reflect the number of DE genes associated with each GO process relative to the total number of DE genes.
# Create gene ratio column based on other columns in dataset bp_oe <- bp_oe %>% mutate(gene_ratio = overlap.size / query.size)
Create a column in
term_percent to determine the percent of DE genes associated with the GO term relative to the total number of genes associated with the GO term (
Our final data for plotting should look like the table below:
Now that we have our results ready for plotting, we can use the ggplot2 package to plot our results. If you are interested, you can follow this lesson and dive into how to use
ggplot2 to create the plots with this dataset.
This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.