Plotting and data visualization in R

Author

Mary Piper, Meeta Mistry, Radhika Khetani

Published

December 4, 2019

Learning Objectives

Describe the map() function for iterative tasks on data structures.

Setting up a data frame for visualization

In this lesson we want to make plots to evaluate the average expression in each sample and its relationship with the age of the mouse. So, to this end, we will be adding a couple of additional columns of information to the metadata data frame that we can utilize for plotting.

Calculating average expression

Let’s take a closer look at our counts data. Each column represents a sample in our experiment, and each sample has ~38K values corresponding to the expression of different transcripts. We want to compute the average value of expression for each sample eventually.

Taking this one step at a time, what would we do if we just wanted the average expression for Sample 1 (across all transcripts)? We can use the R base package provided function called mean():

mean(rpkm_ordered$sample1)

[1] 10.2661

That is great, if we only wanted the average from one of the samples (one column in a data frame), but we need to get this information from all 12 samples, so all 12 columns. It would be ideal to get a vector of 12 values that we can add to the metadata data frame. What is the best way to do this?

Programming languages typically have a way to allow the execution of a single line of code or several lines of code multiple times, or in a “loop”. While “for loops” are available in R, there are other easier-to-use functions that can achieve this - for example, the apply() family of functions and the map() family of functions.

The map() family is a bit more intuitive to use than apply() and we will be using it today. However, if you are interested in learning more about theapply() family of functions we have materials available here.

To obtain mean values for all samples we can use the map_dbl() function which generates a numeric vector.

library(purrr)  # Load the purrr

samplemeans <- map_dbl(rpkm_ordered, mean)

The map family of functions

The map() family of functions is available from the purrr package, which is part of the tidyverse suite of packages. More detailed information is available in the R for Data Science book. This family includes several functions, each taking a vector as input and outputting a vector of a specified type. For example, we can use these functions to execute some task/function on every element in a vector, or every column in a dataframe, or every component of a list, and so on.

map() creates a list.
map_lgl() creates a logical vector.
map_int() creates an integer vector.
map_dbl() creates a “double” or numeric vector.
map_chr() creates a character vector.

The syntax for the map() family of functions is:

## DO NOT RUN
map(object, function_to_apply)

If you would like to practice with the map() family of functions, we have additional materials available.

Creating a new metadata object with additional information

Because the input was 12 columns of information the output of map_dbl() is a named vector of length 12.

# Named vectors have a name assigned to each element 
# instead of just referring to them as indices ([1], [2] and so on)
samplemeans

  sample1   sample2   sample3   sample4   sample5   sample6   sample7   sample8 
10.266102 10.849759  9.452517 15.833872 15.590184 15.551529 15.522219 13.808281 
  sample9  sample10  sample11  sample12 
14.108399 10.743292 10.778318  9.754733

# Check length of the vector before adding it to the data frame
length(samplemeans)

[1] 12

Let’s also create a vector with the ages of each of the mice in our data set.

# Create a numeric vector with ages. Note that there are 12 elements here
age_in_days <- c(40, 32, 38, 35, 41, 32, 34, 26, 28, 28, 30, 32)

Now, we are ready to combine the metadata data frame with the two new vectors to create a new data frame with 5 columns

# Add the new vector as the last column to the new_metadata dataframe
new_metadata <- data.frame(metadata, samplemeans, age_in_days)

# Take a look at the new_metadata object
View(new_metadata)

Note

Note that we could have also combined columns using the cbind() function as shown in the code below:

## DO NOT RUN
new_metadata <- cbind(metadata, samplemeans, age_in_days)

The two functions work identically with the exception of assigning row names. For example, if we were combining columns and wanted to add in a vector of row names, we could easily do so in data.frame() with the use of the row.names argument. This argument is not available for the cbind() function.

We are now ready for plotting and data visualization!

--- title: "Plotting and data visualization in R" author: "Mary Piper, Meeta Mistry, Radhika Khetani" date: "Wednesday, December 4, 2019" --- ```{r} #| eval: true #| echo: false rpkm_data <- read.csv("data/counts.rpkm.csv") metadata <- read.csv(file="data/mouse_exp_design.csv") genomic_idx <- match(rownames(metadata), colnames(rpkm_data)) rpkm_ordered <- rpkm_data[ , genomic_idx] ``` ## Learning Objectives * Describe the `map()` function for iterative tasks on data structures. ## Setting up a data frame for visualization In this lesson we want to make plots to evaluate the average expression in each sample and its relationship with the age of the mouse. So, to this end, we will be **adding a couple of additional columns of information to the `metadata`** data frame that we can utilize for plotting. <img src="../img/new_metadata.png" width="600"> ### Calculating average expression Let's take a closer look at our counts data. Each column represents a sample in our experiment, and each sample has ~38K values corresponding to the expression of different transcripts. We want to compute **the average value of expression** for each sample eventually. Taking this one step at a time, what would we do if we just wanted the average expression for Sample 1 (across all transcripts)? We can use the R base package provided function called `mean()`: ```{r} mean(rpkm_ordered$sample1) ``` That is great, if we only wanted the average from one of the samples (one column in a data frame), but we need to get this information from all 12 samples, so all 12 columns. It would be ideal to get a vector of 12 values that we can add to the metadata data frame. What is the best way to do this? Programming languages typically have a way to allow the execution of a single line of code or several lines of code multiple times, or in a "loop". While "for loops" are available in R, there are other easier-to-use functions that can achieve this - for example, the `apply()` family of functions and the `map()` family of functions. The `map()` family is a bit more intuitive to use than `apply()` and we will be using it today. However, if you are interested in learning more about the`apply()` family of functions we have [materials available here](https://hbctraining.github.io/Intro-to-R/lessons/apply_functions.html). To obtain **mean values for all samples** we can use the `map_dbl()` function which generates a numeric vector. ```{r} library(purrr) # Load the purrr samplemeans <- map_dbl(rpkm_ordered, mean) ``` ::: callout-note # The `map` family of functions The `map()` family of functions is available from the **`purrr`** package, which is part of the tidyverse suite of packages. More detailed information is available in the [R for Data Science](http://r4ds.had.co.nz/iteration.html#the-map-functions) book. This family includes several functions, each taking a vector as input and outputting a vector of a specified type. For example, we can use these functions to execute some task/function on every element in a vector, or every column in a dataframe, or every component of a list, and so on. - `map()` creates a list. - `map_lgl()` creates a logical vector. - `map_int()` creates an integer vector. - `map_dbl()` creates a "double" or numeric vector. - `map_chr()` creates a character vector. The syntax for the `map()` family of functions is: ```{r} #| eval: false ## DO NOT RUN map(object, function_to_apply) ``` If you would like to practice with the `map()` family of functions, we have [additional materials](https://hbctraining.github.io/Intro-to-R/lessons/map_purrr.html) available. ::: ### Creating a new metadata object with additional information Because the input was 12 columns of information the **output of `map_dbl()` is a *named* vector of length 12**. ```{r} # Named vectors have a name assigned to each element # instead of just referring to them as indices ([1], [2] and so on) samplemeans # Check length of the vector before adding it to the data frame length(samplemeans) ``` Let's also create a vector with the ages of each of the mice in our data set. ```{r} # Create a numeric vector with ages. Note that there are 12 elements here age_in_days <- c(40, 32, 38, 35, 41, 32, 34, 26, 28, 28, 30, 32) ``` Now, we are ready to **combine the `metadata` data frame with the two new vectors** to create a new data frame with 5 columns ```{r} # Add the new vector as the last column to the new_metadata dataframe new_metadata <- data.frame(metadata, samplemeans, age_in_days) ``` ```{r} #| eval: false # Take a look at the new_metadata object View(new_metadata) ``` ::: callout-note Note that we could have also combined columns using the `cbind()` function as shown in the code below: ```{r} #| eval: false ## DO NOT RUN new_metadata <- cbind(metadata, samplemeans, age_in_days) ``` The two functions work identically with the exception of assigning row names. For example, if we were combining columns and wanted to add in a vector of row names, we could easily do so in `data.frame()` with the use of the `row.names` argument. This argument is not available for the `cbind()` function. ::: We are now ready for plotting and data visualization!