Tidyverse data wrangling

R Programming

Data wrangling

This lesson introduces participants to data wrangling using the tidyverse. Participants learn to work with pipes, tibbles and key functions from dplyr to filter, select, arrange, rename and transform data. The lesson guides participants through preparing functional analysis results for visualization, building practical skills for organizing and analyzing biological data.

Authors

Mary Piper

Meeta Mistry

Radhika Khetani

Jihe Liu

Will Gammerdinger

Noor Sohail

Published

May 30, 2025

Keywords

R, tidyverse, pipes, select, filter, mutate, arrange

Approximate time: 55 minutes

Learning Objectives

Perform basic data wrangling with functions in the Tidyverse package.

Data Wrangling with Tidyverse

The Tidyverse suite of integrated packages are designed to work together to make common data science operations more user friendly. The packages have functions for data wrangling, tidying, reading/writing, parsing, and visualizing, among others. There is a freely available book, R for Data Science, with detailed descriptions and practical examples of the tools available and how they work together. We will explore the basic syntax for working with these packages, as well as, specific functions for data wrangling with the ‘dplyr’ package and data visualization with the ‘ggplot2’ package.

Tidyverse basics

The Tidyverse suite of packages introduces users to a set of data structures, functions and operators to make working with data more intuitive, but is slightly different from the way we do things in base R. Two important new concepts we will focus on are pipes and tibbles.

Before we get started with pipes or tibbles, let’s load the library:

# Load the tidyverse package
library(tidyverse)

Pipes

Stringing together commands in R can be quite daunting. Also, trying to understand code that has many nested functions can be confusing.

To make R code more human readable, the Tidyverse tools use the pipe, %>%, which was acquired from the magrittr package and is now part of the dplyr package that is installed automatically with Tidyverse. The pipe allows the output of a previous command to be used as input to another command instead of using nested functions.

Note

Shortcut to write the pipe is shift + command + M

An example of using the pipe to run multiple commands:

# A single command
sqrt(83)

[1] 9.110434

# Base R method of running more than one command
round(sqrt(83), digits = 2)

[1] 9.11

# Running more than one command with piping
sqrt(83) %>% round(digits = 2)

[1] 9.11

The pipe represents a much easier way of writing and deciphering R code, and so we will be taking advantage of it, when possible, as we work through the remaining lesson.

Exercise 1

Create a vector of random numbers using the code below:

# Create a vector of random numbers
random_numbers <- c(81, 90, 65, 43, 71, 29)

Use the pipe (%>%) to perform two steps:

Take the mean of random_numbers using the mean() function.
Round the output to three digits using the round() function.

Tibbles

A core component of the tidyverse is the tibble. Tibbles are a modern rework of the standard data.frame, with some internal improvements to make code more reliable. They are data frames, but do not follow all of the same rules. For example, tibbles can have numbers/symbols for column names, which is not normally allowed in base R.

Important: tidyverse is very opinionated about row names. These packages insist that all column data (e.g. data.frame) be treated equally, and that special designation of a column as rownames should be deprecated. Tibble provides simple utility functions to handle rownames: rownames_to_column() and column_to_rownames().

Tibbles can be created directly using the tibble() function or data frames can be converted into tibbles using as_tibble(name_of_df).

Note

The function as_tibble() will ignore row names, so if a column representing the row names is needed, then the function rownames_to_column(name_of_df) should be run prior to turning the data.frame into a tibble. Also, as_tibble() will not coerce character vectors to factors by default.

Experimental data

We’re going to explore the Tidyverse suite of tools to wrangle our data to prepare it for visualization. You should have downloaded the file called gprofiler_results_Mov10oe.tsv into your R project’s data folder earlier.

Note

If you do not have the gprofiler_results_Mov10oe.tsv file in your data folder, you can right click and download it into the data folder using this link.

The dataset:

Represents the functional analysis results, including the biological processes, functions, pathways, or conditions that are over-represented in a given list of genes.
Our gene list was generated by differential gene expression analysis and the genes represent differences between control mice and mice over-expressing a gene involved in RNA splicing.

The functional analysis that we will focus on involves gene ontology (GO) terms, which:

describe the roles of genes and gene products
organized into three controlled vocabularies/ontologies (domains):
- biological processes (BP)
- cellular components (CC)
- molecular functions (MF)

Analysis goal and workflow

Goal: Visually compare the most significant biological processes (BP) based on the number of associated differentially expressed genes (gene ratios) and significance values by creating the following plot:

To wrangle our data in preparation for the plotting, we are going to use the Tidyverse suite of tools to wrangle and visualize our data through several steps:

Read in the functional analysis results
Extract only the GO biological processes (BP) of interest
Select only the columns needed for visualization
Order by significance (p-adjusted values)
Rename columns to be more intuitive
Create additional metrics for plotting (e.g. gene ratios)
Plot results

Tidyverse tools

While all of the tools in the Tidyverse suite are deserving of being explored in more depth, we are going to investigate more deeply the reading (readr), wrangling (dplyr), and plotting (ggplot2) tools.

1. Read in the functional analysis results

While the base R packages have perfectly fine methods for reading in data, the readr and readxl Tidyverse packages offer additional methods for reading in data. Let’s read in our tab-delimited functional analysis results using read_delim():

# Read in the functional analysis results
functional_GO_results <- read_delim(file = "data/gprofiler_results_Mov10oe.txt", delim = "\t" )

# Take a look at the results
functional_GO_results

# A tibble: 3,644 × 14
   query.number significant p.value term.size query.size overlap.size recall
          <dbl> <lgl>         <dbl>     <dbl>      <dbl>        <dbl>  <dbl>
 1            1 TRUE        0.00434       111       5850           52  0.009
 2            1 TRUE        0.0033        110       5850           52  0.009
 3            1 TRUE        0.0297         39       5850           21  0.004
 4            1 TRUE        0.0193         70       5850           34  0.006
 5            1 TRUE        0.0148         26       5850           16  0.003
 6            1 TRUE        0.0187         22       5850           14  0.002
 7            1 TRUE        0.0226         48       5850           25  0.004
 8            1 TRUE        0.0491         17       5850           11  0.002
 9            1 TRUE        0.00798       164       5850           71  0.012
10            1 TRUE        0.0439         19       5850           12  0.002
# ℹ 3,634 more rows
# ℹ 7 more variables: precision <dbl>, term.id <chr>, domain <chr>,
#   subgraph.number <dbl>, term.name <chr>, relative.depth <dbl>,
#   intersection <chr>

Click here to see how to do this in base R

# Read in the functional analysis results
functional_GO_results <- read.delim(file = "data/gprofiler_results_Mov10oe.txt", sep = "\t" )

# Take a look at the results
functional_GO_results

Notice that the results were automatically read in as a tibble and the output gives the number of rows, columns and the data type for each of the columns.

Note

A large number of tidyverse functions will work with both tibbles and dataframes, and the data structure of the output will be identical to the input. However, there are some functions that will return a tibble (without row names), whether or not a tibble or dataframe is provided.

2. Extract only the GO biological processes (BP) of interest

Now that we have our data, we will need to wrangle it into a format ready for plotting. For all of our data wrangling steps we will be using tools from the dplyr package, which is a swiss-army knife for data wrangling of data frames.

To extract the biological processes of interest, we only want those rows where the domain is equal to BP, which we can do using the filter() function.

To filter rows of a data frame/tibble based on values in different columns, we give a logical expression as input to the filter() function to return those rows for which the expression is TRUE.

Now let’s return only those rows that have a domain of BP:

# Filter for only GO biological processes
bp_oe <- functional_GO_results %>%
  filter(domain == "BP")

# Print object after filtering
bp_oe

# A tibble: 1,102 × 14
   query.number significant p.value term.size query.size overlap.size recall
          <dbl> <lgl>         <dbl>     <dbl>      <dbl>        <dbl>  <dbl>
 1            1 TRUE        0.00434       111       5850           52  0.009
 2            1 TRUE        0.0033        110       5850           52  0.009
 3            1 TRUE        0.0297         39       5850           21  0.004
 4            1 TRUE        0.0193         70       5850           34  0.006
 5            1 TRUE        0.0148         26       5850           16  0.003
 6            1 TRUE        0.0187         22       5850           14  0.002
 7            1 TRUE        0.0226         48       5850           25  0.004
 8            1 TRUE        0.0491         17       5850           11  0.002
 9            1 TRUE        0.00798       164       5850           71  0.012
10            1 TRUE        0.0439         19       5850           12  0.002
# ℹ 1,092 more rows
# ℹ 7 more variables: precision <dbl>, term.id <chr>, domain <chr>,
#   subgraph.number <dbl>, term.name <chr>, relative.depth <dbl>,
#   intersection <chr>

Click here to see how to do this in base R

# Create an index for rows where the domain is biological processes 
idx <- functional_GO_results$domain == "BP"
# Subset object by the index
bp_oe <- functional_GO_results[idx,]

# Print object after subsetting
bp_oe

Now we have returned only those rows with a domain of BP. How have the dimensions of our results changed?

Exercise 2

We would like to perform an additional round of filtering to only keep the most specific GO terms.

For bp_oe, use the filter() function to only keep those rows where the relative.depth is greater than 4.
Save output to overwrite our bp_oe object

3. Select only the columns needed for visualization

For visualization purposes, we are only interested in the columns related to the GO terms, the significance of the terms, and information about the number of genes associated with the terms.

To extract columns from a data frame/tibble we can use the select() function. In contrast to base R, we do not need to put the column names in quotes for selection.

# Selecting columns to keep
bp_oe <- bp_oe %>%
  select(term.id, term.name, p.value, query.size, term.size, overlap.size, intersection)

# Print object after selecting the columns to keep
bp_oe

# A tibble: 668 × 7
   term.id    term.name   p.value query.size term.size overlap.size intersection
   <chr>      <chr>         <dbl>      <dbl>     <dbl>        <dbl> <chr>       
 1 GO:0090672 telomeras… 2.41e- 2       5850        16           11 dkc1,cct7,s…
 2 GO:0090670 RNA local… 2.41e- 2       5850        16           11 dkc1,cct7,s…
 3 GO:0090671 telomeras… 2.41e- 2       5850        16           11 dkc1,cct7,s…
 4 GO:0071702 organic s… 7.90e-11       5850      2629          973 bad,snx11,s…
 5 GO:0015931 nucleobas… 4.43e- 5       5850       200           93 upf1,mvp,zc…
 6 GO:0050657 nucleic a… 3.67e- 6       5850       166           83 upf1,mvp,zc…
 7 GO:0050658 RNA trans… 3.67e- 6       5850       166           83 upf1,mvp,zc…
 8 GO:0051031 tRNA tran… 4.88e- 2       5850        33           18 nup133,seh1…
 9 GO:0051028 mRNA tran… 2.48e- 5       5850       137           69 upf1,mvp,zc…
10 GO:0016192 vesicle-m… 1.39e- 4       5850      1492          540 snx11,arf5,…
# ℹ 658 more rows

Click here to see how to do this in base R

# Selecting columns to keep
bp_oe <- bp_oe[, c("term.id", "term.name", "p.value", "query.size", "term.size", "overlap.size", "intersection")]

# Print object after selecting the columns to keep
bp_oe

The select() function also allows for negative selection. So we could have alternately removed columns with negative selection. Note that we need to put the column names inside of the combine (c()) function with a - preceding it for this functionality.

# DO NOT RUN
# Selecting columns to remove
bp_oe <- bp_oe %>%
    select(-c(query.number, significant, recall, precision, 
              subgraph.number, relative.depth, domain))

Click here to see how to do this in base R

# DO NOT RUN
# Create an index of columns to keep by giving R a vector of columns to remove
idx <- !(colnames(bp_oe) %in% c("query.number", "significant", "recall", "precision", "subgraph.number", "relative.depth", "domain"))

# Subset object by the index
bp_oe <- bp_oe[, idx]

4. Order GO processes by significance (adjusted p-values)

Now that we have only the rows and columns of interest, let’s arrange these by significance, which is denoted by the adjusted p-value.

Let’s sort the rows by adjusted p-value with the arrange() function.

# Order by adjusted p-value ascending
bp_oe <- bp_oe %>%
  arrange(p.value)

Click here to see how to do this in base R

# Create an index ordering the rows by ascending adjusted p-value 
idx <- order(bp_oe$p.value)

# Subset object by the index
bp_oe <- bp_oe[idx,]

Note

If you wanted to arrange in descending order, then you could have run the following instead:

# DO NOT RUN
# Order by adjusted p-value descending
bp_oe <- bp_oe %>%
  arrange(desc(p.value))

Click here to see how to do this in base R

# DO NOT RUN
# Create an index ordering the rows by descending adjusted p-value
idx <- order(bp_oe$p.value, decreasing = TRUE)

# Subset object by the index
bp_oe <- bp_oe[idx,]

Note

Ordering variables in ggplot2 is a bit different. This post introduces a few ways of ordering variables in a plot.

5. Rename columns to be more intuitive

While not necessary for our visualization, renaming columns more intuitively can help with our understanding of the data using the rename() function. The syntax is new_name = old_name.

Let’s rename the term.id and term.name columns.

# Provide better names for columns
bp_oe <- bp_oe %>% 
  dplyr::rename(GO_id = term.id, 
                GO_term = term.name)

# Print object after renaming the columns
bp_oe

# A tibble: 668 × 7
   GO_id      GO_term     p.value query.size term.size overlap.size intersection
   <chr>      <chr>         <dbl>      <dbl>     <dbl>        <dbl> <chr>       
 1 GO:0010467 gene expr… 6.71e-66       5850      5257         2142 gclc,nfya,b…
 2 GO:0090304 nucleic a… 1.18e-61       5850      5103         2073 gclc,nfya,d…
 3 GO:0006139 nucleobas… 2.49e-58       5850      5731         2271 dpm1,gclc,n…
 4 GO:0016070 RNA metab… 7.28e-57       5850      4597         1881 gclc,nfya,d…
 5 GO:0009059 macromole… 3.12e-54       5850      5066         2030 dpm1,gclc,n…
 6 GO:0034645 cellular … 5.6 e-54       5850      4907         1975 dpm1,gclc,n…
 7 GO:0044271 cellular … 2.10e-47       5850      4882         1938 gclc,nfya,s…
 8 GO:0010468 regulatio… 4.25e-46       5850      4297         1733 gclc,nfya,d…
 9 GO:2000112 regulatio… 1.22e-40       5850      3960         1593 gclc,nfya,d…
10 GO:0010556 regulatio… 2.22e-39       5850      4073         1626 gclc,nfya,d…
# ℹ 658 more rows

Click here to see how to do this in base R

# Provide better names for term.id columns
colnames(bp_oe)[colnames(bp_oe) == "term.id"] <- "GO_id"

# Provide better names for term.name columns
colnames(bp_oe)[colnames(bp_oe) == "term.name"] <- "GO_term"

Note

In the case of two packages with identical function names, you can use :: with the package name before and the function name after (e.g stats::filter()) to ensure that the correct function is implemented. The :: can also be used to bring in a function from a library without loading it first.

In the example above, we wanted to use the rename() function specifically from the dplyr package, and not any of the other packages (or base R) which may have the rename() function.

Exercise 3

Rename the intersection column to genes to reflect the fact that these are the DE genes associated with the GO process.

6. Create additional metrics for plotting (e.g. gene ratios)

Finally, before we plot our data, we need to create a couple of additional metrics. The mutate() function enables you to create a new column from an existing column.

Let’s generate gene ratios to reflect the number of DE genes associated with each GO process relative to the total number of DE genes.

# Create gene_ratio column based on other columns in dataset
bp_oe <- bp_oe %>%
  mutate(gene_ratio = overlap.size / query.size)

# Print object after creating the new column
bp_oe

# A tibble: 668 × 8
   GO_id     GO_term  p.value query.size term.size overlap.size genes gene_ratio
   <chr>     <chr>      <dbl>      <dbl>     <dbl>        <dbl> <chr>      <dbl>
 1 GO:00104… gene e… 6.71e-66       5850      5257         2142 gclc…      0.366
 2 GO:00903… nuclei… 1.18e-61       5850      5103         2073 gclc…      0.354
 3 GO:00061… nucleo… 2.49e-58       5850      5731         2271 dpm1…      0.388
 4 GO:00160… RNA me… 7.28e-57       5850      4597         1881 gclc…      0.322
 5 GO:00090… macrom… 3.12e-54       5850      5066         2030 dpm1…      0.347
 6 GO:00346… cellul… 5.6 e-54       5850      4907         1975 dpm1…      0.338
 7 GO:00442… cellul… 2.10e-47       5850      4882         1938 gclc…      0.331
 8 GO:00104… regula… 4.25e-46       5850      4297         1733 gclc…      0.296
 9 GO:20001… regula… 1.22e-40       5850      3960         1593 gclc…      0.272
10 GO:00105… regula… 2.22e-39       5850      4073         1626 gclc…      0.278
# ℹ 658 more rows

Click here to see how to do this in base R

# Create gene ratio column based on other columns in dataset
bp_oe <- cbind(bp_oe, gene_ratio = bp_oe$overlap.size / bp_oe$query.size)

Exercise 4

Create a column in bp_oe called term_percent to determine the percent of DE genes associated with the GO term relative to the total number of genes associated with the GO term (overlap.size / term.size)

Our final data for plotting should look like the table below:

View(bp_oe)

Next steps

Now that we have our results ready for plotting, we can use the ggplot2 package to plot our results. If you are interested, you can follow this lesson and dive into how to use ggplot2 to create the plots with this dataset.

Additional resources

Reuse

CC-BY-4.0

--- title: "Tidyverse data wrangling" description: | This lesson introduces participants to data wrangling using the `tidyverse`. Participants learn to work with pipes, tibbles and key functions from `dplyr` to filter, select, arrange, rename and transform data. The lesson guides participants through preparing functional analysis results for visualization, building practical skills for organizing and analyzing biological data. author: - Mary Piper - Meeta Mistry - Radhika Khetani - Jihe Liu - Will Gammerdinger - Noor Sohail date: "2025-05-30" categories: - R Programming - Data wrangling keywords: - R - tidyverse - pipes - select - filter - mutate - arrange license: "CC-BY-4.0" editor_options: markdown: wrap: 72 --- Approximate time: 55 minutes ```{r} #| label: load_DT #| echo: false # Load the DT package library(DT) ``` ## Learning Objectives - Perform basic data wrangling with functions in the Tidyverse package. # Data Wrangling with Tidyverse The [Tidyverse suite of integrated packages](https://www.tidyverse.org/packages/) are designed to work together to make common data science operations more user friendly. The packages have functions for data wrangling, tidying, reading/writing, parsing, and visualizing, among others. There is a freely available book, [R for Data Science](http://r4ds.had.co.nz/index.html), with detailed descriptions and practical examples of the tools available and how they work together. We will explore the basic syntax for working with these packages, as well as, specific functions for data wrangling with the 'dplyr' package and data visualization with the 'ggplot2' package. ![](../img/tidyverse_website.png) ## Tidyverse basics The Tidyverse suite of packages introduces users to a set of data structures, functions and operators to make working with data more intuitive, but is slightly different from the way we do things in base R. **Two important new concepts we will focus on are pipes and tibbles**. Before we get started with pipes or tibbles, let's load the library: ```{r} #| label: load_libraries # Load the tidyverse package library(tidyverse) ``` ### Pipes Stringing together commands in R can be quite daunting. Also, trying to understand code that has many nested functions can be confusing. To make R code more human readable, the Tidyverse tools use the pipe, `%>%`, which was acquired from the `magrittr` package and is now part of the `dplyr` package that is installed automatically with Tidyverse. **The pipe allows the output of a previous command to be used as input to another command instead of using nested functions.** ::: {.callout-note} Shortcut to write the pipe is <kbd>shift</kbd> + <kbd>command</kbd> + <kbd>M</kbd> ::: An example of using the pipe to run multiple commands: ```{r} #| label: pipe_example # A single command sqrt(83) # Base R method of running more than one command round(sqrt(83), digits = 2) # Running more than one command with piping sqrt(83) %>% round(digits = 2) ``` The pipe represents a much easier way of writing and deciphering R code, and so we will be taking advantage of it, when possible, as we work through the remaining lesson. ::: {.callout-tip} # [**Exercise 1**](18_tidyverse-Answer_key.qmd#exercise-1) Create a vector of random numbers using the code below: ```{r} #| label: create_random_numbers_vector # Create a vector of random numbers random_numbers <- c(81, 90, 65, 43, 71, 29) ``` Use the pipe (`%>%`) to perform two steps: 1. Take the mean of `random_numbers` using the `mean()` function. 2. Round the output to three digits using the `round()` function. ::: ### Tibbles A core component of the [tidyverse](http://tidyverse.org/) is the [tibble](http://tibble.tidyverse.org/). **Tibbles are a modern rework of the standard `data.frame`, with some internal improvements** to make code more reliable. They are data frames, but do not follow all of the same rules. For example, tibbles can have numbers/symbols for column names, which is not normally allowed in base R. **Important: [tidyverse](http://tidyverse.org/) is very opinionated about row names**. These packages insist that all column data (e.g. `data.frame`) be treated equally, and that special designation of a column as `rownames` should be deprecated. [Tibble](http://tibble.tidyverse.org/) provides simple utility functions to handle rownames: `rownames_to_column()` and `column_to_rownames()`. Tibbles can be created directly using the `tibble()` function or data frames can be converted into tibbles using `as_tibble(name_of_df)`. ::: {.callout-note} The function `as_tibble()` will ignore row names, so if a column representing the row names is needed, then the function `rownames_to_column(name_of_df)` should be run prior to turning the data.frame into a tibble. Also, `as_tibble()` will not coerce character vectors to factors by default. ::: ## Experimental data We're going to explore the Tidyverse suite of tools to wrangle our data to prepare it for visualization. You should have downloaded the file called `gprofiler_results_Mov10oe.tsv` into your R project's `data` folder earlier. ::: {.callout-note} If you do not have the `gprofiler_results_Mov10oe.tsv` file in your `data` folder, you can right click and download it into the `data` folder [using this link](https://github.com/hbctraining/Training-modules/blob/master/Tidyverse_ggplot2/data/gprofiler_results_Mov10oe.tsv?raw=true). ::: **The dataset:** - Represents the **functional analysis results**, including the biological processes, functions, pathways, or conditions that are over-represented in a given list of genes. - Our gene list was generated by **differential gene expression analysis** and the genes represent differences between **control mice** and **mice over-expressing a gene involved in RNA splicing**. The functional analysis that we will focus on involves **gene ontology (GO) terms**, which: - describe the roles of genes and gene products - organized into three controlled vocabularies/ontologies (domains): - biological processes (BP) - cellular components (CC) - molecular functions (MF) <img src="../img/mov10_FA.png" width="800"/> ## Analysis goal and workflow **Goal:** *Visually compare the most significant biological processes (BP) based on the number of associated differentially expressed genes (gene ratios) and significance values by creating the following plot:* ![](../img/dotplot6.png) To wrangle our data in preparation for the plotting, we are going to use the Tidyverse suite of tools to wrangle and visualize our data through several steps: 1. Read in the functional analysis results 2. Extract only the GO biological processes (BP) of interest 3. Select only the columns needed for visualization 4. Order by significance (p-adjusted values) 5. Rename columns to be more intuitive 6. Create additional metrics for plotting (e.g. gene ratios) 7. Plot results ## Tidyverse tools While all of the tools in the Tidyverse suite are deserving of being explored in more depth, we are going to investigate more deeply the reading (`readr`), wrangling (`dplyr`), and plotting (`ggplot2`) tools. ## 1. Read in the functional analysis results While the base R packages have perfectly fine methods for reading in data, the `readr` and `readxl` Tidyverse packages offer additional methods for reading in data. Let's read in our tab-delimited functional analysis results using `read_delim()`: ```{r} #| label: load_data # Read in the functional analysis results functional_GO_results <- read_delim(file = "data/gprofiler_results_Mov10oe.txt", delim = "\t" ) # Take a look at the results functional_GO_results ``` <details> <summary>Click here to see how to do this in base R</summary> ```{r} #| label: load_data_base_r #| eval: false # Read in the functional analysis results functional_GO_results <- read.delim(file = "data/gprofiler_results_Mov10oe.txt", sep = "\t" ) # Take a look at the results functional_GO_results ``` </details> Notice that the results were automatically read in as a tibble and the output gives the number of rows, columns and the data type for each of the columns. ::: {.callout-note} A large number of tidyverse functions will work with both tibbles and dataframes, and the data structure of the output will be identical to the input. However, there are some functions that will return a tibble (without row names), whether or not a tibble or dataframe is provided. ::: ## 2. Extract only the GO biological processes (BP) of interest Now that we have our data, we will need to wrangle it into a format ready for plotting. For all of our data wrangling steps we will be using tools from the [dplyr](http://dplyr.tidyverse.org/) package, which is a swiss-army knife for data wrangling of data frames. To extract the biological processes of interest, we only want those rows where the `domain` is equal to `BP`, which we can do using the `filter()` function. To filter rows of a data frame/tibble based on values in different columns, we give a logical expression as input to the `filter()` function to return those rows for which the expression is TRUE. Now let's return only those rows that have a `domain` of `BP`: ```{r} #| label: filter_example # Filter for only GO biological processes bp_oe <- functional_GO_results %>% filter(domain == "BP") # Print object after filtering bp_oe ``` <details> <summary>Click here to see how to do this in base R</summary> ```{r} #| label: filter_base_r #| eval: false # Create an index for rows where the domain is biological processes idx <- functional_GO_results$domain == "BP" # Subset object by the index bp_oe <- functional_GO_results[idx,] # Print object after subsetting bp_oe ``` </details> Now we have returned only those rows with a `domain` of `BP`. **How have the dimensions of our results changed?** ::: {.callout-tip} # [**Exercise 2**](18_tidyverse-Answer_key.qmd#exercise-2) We would like to perform an additional round of filtering to only keep the most specific GO terms. 1. For `bp_oe`, use the `filter()` function to only keep those rows where the `relative.depth` is greater than 4. 2. Save output to overwrite our `bp_oe` object ::: ```{r} #| label: filter_depth #| echo: false bp_oe <- bp_oe %>% filter(relative.depth > 4) ``` ## 3. Select only the columns needed for visualization For visualization purposes, we are only interested in the columns related to the GO terms, the significance of the terms, and information about the number of genes associated with the terms. To extract columns from a data frame/tibble we can use the `select()` function. In contrast to base R, we do not need to put the column names in quotes for selection. ```{r} #| label: select_example # Selecting columns to keep bp_oe <- bp_oe %>% select(term.id, term.name, p.value, query.size, term.size, overlap.size, intersection) # Print object after selecting the columns to keep bp_oe ``` <details> <summary>Click here to see how to do this in base R</summary> ```{r} #| label: select_base_r #| eval: false # Selecting columns to keep bp_oe <- bp_oe[, c("term.id", "term.name", "p.value", "query.size", "term.size", "overlap.size", "intersection")] # Print object after selecting the columns to keep bp_oe ``` </details> The `select()` function also allows for negative selection. So we could have alternately removed columns with negative selection. Note that we need to put the column names inside of the combine (`c()`) function with a `-` preceding it for this functionality. ```{r} #| label: negative_select_example #| eval: false # DO NOT RUN # Selecting columns to remove bp_oe <- bp_oe %>% select(-c(query.number, significant, recall, precision, subgraph.number, relative.depth, domain)) ``` <details> <summary>Click here to see how to do this in base R</summary> ```{r} #| label: negative_select_base_r #| eval: false # DO NOT RUN # Create an index of columns to keep by giving R a vector of columns to remove idx <- !(colnames(bp_oe) %in% c("query.number", "significant", "recall", "precision", "subgraph.number", "relative.depth", "domain")) # Subset object by the index bp_oe <- bp_oe[, idx] ``` </details> ## 4. Order GO processes by significance (adjusted p-values) Now that we have only the rows and columns of interest, let's arrange these by significance, which is denoted by the adjusted p-value. Let's sort the rows by adjusted p-value with the `arrange()` function. ```{r} #| label: arrange_ascending_example # Order by adjusted p-value ascending bp_oe <- bp_oe %>% arrange(p.value) ``` <details> <summary>Click here to see how to do this in base R</summary> ```{r} #| label: arrange_descending_base_r #| eval: false # Create an index ordering the rows by ascending adjusted p-value idx <- order(bp_oe$p.value) # Subset object by the index bp_oe <- bp_oe[idx,] ``` </details> ::: {.callout-note} If you wanted to arrange in descending order, then you could have run the following instead: ```{r} #| label: arrange_descending_example #| eval: false # DO NOT RUN # Order by adjusted p-value descending bp_oe <- bp_oe %>% arrange(desc(p.value)) ``` <details> <summary>Click here to see how to do this in base R</summary> ```{r} #| label: arrange_ascending_example_base_r #| eval: false # DO NOT RUN # Create an index ordering the rows by descending adjusted p-value idx <- order(bp_oe$p.value, decreasing = TRUE) # Subset object by the index bp_oe <- bp_oe[idx,] ``` </details> ::: ::: {.callout-note} Ordering variables in `ggplot2` is a bit different. [This post](https://www.r-graph-gallery.com/267-reorder-a-variable-in-ggplot2.html) introduces a few ways of ordering variables in a plot. ::: ## 5. Rename columns to be more intuitive While not necessary for our visualization, renaming columns more intuitively can help with our understanding of the data using the `rename()` function. The syntax is `new_name` = `old_name`. Let's rename the `term.id` and `term.name` columns. ```{r} #| label: rename_example # Provide better names for columns bp_oe <- bp_oe %>% dplyr::rename(GO_id = term.id, GO_term = term.name) # Print object after renaming the columns bp_oe ``` <details> <summary>Click here to see how to do this in base R</summary> ```{r} #| label: rename_example_base_r #| eval: false # Provide better names for term.id columns colnames(bp_oe)[colnames(bp_oe) == "term.id"] <- "GO_id" # Provide better names for term.name columns colnames(bp_oe)[colnames(bp_oe) == "term.name"] <- "GO_term" ``` </details> ::: {.callout-note} In the case of two packages with identical function names, you can use `::` with the package name before and the function name after (e.g `stats::filter()`) to ensure that the correct function is implemented. The `::` can also be used to bring in a function from a library without loading it first. In the example above, we wanted to use the `rename()` function specifically from the `dplyr` package, and not any of the other packages (or base R) which may have the `rename()` function. ::: ::: {.callout-tip} # [**Exercise 3**](18_tidyverse-Answer_key.qmd#exercise-3) Rename the `intersection` column to `genes` to reflect the fact that these are the DE genes associated with the GO process. ::: ```{r} #| label: rename_intersection_to_genes #| echo: false bp_oe <- bp_oe %>% dplyr::rename(genes = intersection) ``` ## 6. Create additional metrics for plotting (e.g. gene ratios) Finally, before we plot our data, we need to create a couple of additional metrics. The `mutate()` function enables you to create a new column from an existing column. Let's generate gene ratios to reflect the number of DE genes associated with each GO process relative to the total number of DE genes. ```{r} #| label: mutate_example # Create gene_ratio column based on other columns in dataset bp_oe <- bp_oe %>% mutate(gene_ratio = overlap.size / query.size) # Print object after creating the new column bp_oe ``` <details> <summary>Click here to see how to do this in base R</summary> ```{r} #| label: mutate_example_base_r #| eval: false # Create gene ratio column based on other columns in dataset bp_oe <- cbind(bp_oe, gene_ratio = bp_oe$overlap.size / bp_oe$query.size) ``` </details> ::: {.callout-tip} # [**Exercise 4**](18_tidyverse-Answer_key.qmd#exercise-4) Create a column in `bp_oe` called `term_percent` to determine the percent of DE genes associated with the GO term relative to the total number of genes associated with the GO term (`overlap.size` / `term.size`) ::: ```{r} #| label: add_column_term_percent #| echo: false bp_oe <- bp_oe %>% mutate(term_percent = overlap.size / term.size) ``` Our final data for plotting should look like the table below: ```{r} #| label: print_dataframe #| eval: false View(bp_oe) ``` ```{r} #| label: render_dataframe #| echo: false datatable(bp_oe) ``` ## Next steps Now that we have our results ready for plotting, we can use the [ggplot2](https://ggplot2.tidyverse.org) package to plot our results. If you are interested, you can follow [this lesson](Aside_ggplot2.qmd) and dive into how to use `ggplot2` to create the plots with this dataset. ### Additional resources - [R for Data Science](http://r4ds.had.co.nz) - [teach the tidyverse](http://varianceexplained.org/r/teach-tidyverse/) - [tidy style guide](http://style.tidyverse.org/)