Plotting and data visualization in R

Approximate time: 90 minutes

Learning Objectives

Explain the grammar of graphics syntax used by ggplot2
Determine how to plot different types of graphs with ggplot2 depending on the number and type of variables
Export plots for use outside of the R environment.

Data Visualization with `ggplot2`

When we are working with large sets of numbers it can be useful to display that information graphically to gain more insight. Visualization deserves an entire course of its own (there is that much to know!). In this lesson we will be plotting with the popular package ggplot2.

More recently, R users have moved away from base graphic options towards ggplot2 since it offers a lot more functionality as compared to the base R plotting functions. The ggplot2 syntax takes some getting used to, but once you get it, you will find it’s extremely powerful and flexible. We will start with drawing a simple X-Y scatterplot of gene_ratio versus GO_term from the bp_oe tibble. ggplot2 expects that the information being plotted is contained in a data frame or tibble (data frame-like).

We would typically start by loading the ggplot2 library, but it is a part of the tidyverse suite, so it was loaded in the last lesson.

Please click here to download the data output from the previous lesson if you do not have it.

ggplot2 syntax: To initialize the basic graph structure with this package we have to use the ggplot() function, then we add “layers” to it using the + operator. The idea is that you create a basic plot first, then additional functions are added on to build the final plot.

Let’s start by first subsetting the bp_oe dataframe to only contain the top 30 most significant GO terms:

## Visualizing data with ggplot2

# Subset data frame
bp_plot <- bp_oe[1:30, ]

ggplot(bp_plot) # what happens? 

You get a blank plot, because you need to specify layers using the + operator.

One type of layer is geometric objects. This is a mandatory layer and it specifies what type of plot you are interested in making. Examples include:

points (geom_point(), geom_jitter() for scatter plots, dot plots, etc)
lines (geom_line(), for time series, trend lines, etc)
boxplot (geom_boxplot(), for, well, boxplots!)

For a more exhaustive list on all possible geometric objects and when to use them check out Hadley Wickham’s RPubs or the RStudio cheatsheet.

A plot must have at least one geom, and there can be multiple complementary geoms; there is no upper limit.

Let’s add a geom to make a scatter plot, i.e. the geom_point() function

ggplot(bp_plot) +
  geom_point() 

You will find that even though we have added a layer by specifying geom_point, we get an error. This is because each type of geom usually has a required set of aesthetics. Aesthetic mappings are set with the aes() function which can be nested within the geom function and can be set inside geom_point() to be specifically applied to that layer. If we supplied aesthetics within ggplot(), they will be used as defaults for every layer. Below are some examples of what is categorized as aesthetics in the ggplot2 context:

position (i.e. columns to be used for the x and y axes)
color (“outside” color of the data point/bar being plotted)
fill (“inside” color of the data point/bar being plotted)
shape (of data points)
linetype
size (of data points)

To start, we will specify the columns for the x- and y-axis since geom_point() requires the most basic information about a scatterplot, i.e. what you want to plot on the x and y axes.

Typically, a scatterplot is used to illustrate the relationship between two numeric variables. The x-axis represents the independent variable and the y-axis represents the dependent variable. We can test this out using two numeric columns from our bp_plot tibble:

ggplot(bp_plot) +
  geom_point(aes(x = overlap.size, y = p.value))

However, instead of a scatterplot with numeric values on both axes, we would like to create a dotplot for visualizing the top 30 functional categories in our dataset, and how prevalent they are. Basically, we want a dotplot for visualizing functional analysis data, which plots the gene ratio values on the x-axis and the GO terms on the y-axis.

Let’s see what happens when we add a non-numeric value to the y-axis and change the x-axis to the “gene_ratio” column:

ggplot(bp_plot) +
  geom_point(aes(x = gene_ratio, y = GO_term))

Now that we have the required aesthetics, let’s add some extras like color to the plot. Let’s say we wanted to quickly visualize significance of the GO terms in the plot, we can color the points on the plot based on p-values, by specifying the column header.

ggplot(bp_plot) +
  geom_point(aes(x = gene_ratio, y = GO_term, color = p.value))

You will notice that there are a default set of colors that will be used so we do not have to specify which colors to use. Also, the legend has been conveniently plotted for us!

Alternatively, we could color number of DE genes associated with each term (overlap.size).

ggplot(bp_plot) +
  geom_point(aes(x = gene_ratio, y = GO_term, color = overlap.size))

Moving forward, we are going to stick with coloring the dots based on the p.value column. Let’s explore some of the other arguments that can be specified in the geom layer.

To modify the size of the data points we can use the size argument.

If we add size inside aes() we could assign a numeric column to it and the size of the data points would change according to that column.
However, if we add size inside the geom_point() but outside aes() we can’t assign a column to it, instead we have to give it a numeric value. This use of size will uniformly change the size of all the data points.

Note: This is true for several arguments, including color, shape etc. E.g. we can change all shapes to square by adding this argument to be outside the aes() function; if we put the argument inside the aes() function we could change the shape according to a (categorical) variable in our data frame or tibble.

We have decided that we want to change the size of all the data point to a uniform size instead of typing it to a numeric column in the input tibble. Add in the size argument by specifying a number for the size of the data point:

ggplot(bp_plot) +
  geom_point(aes(x = gene_ratio, y = GO_term, , color = p.value), 
             size = 2)

Note: The size of the points is personal preference, and you may need to play around with the parameter to decide which size is best. That seems a bit too small, so we can try out a slightly larger size.

As we do that, let’s see how we can change the shape of the data point. Different shapes are available, as detailed in the RStudio ggplot2 cheatsheet. Let’s explore this parameter by changing all of the points to squares:

ggplot(bp_plot) +
  geom_point(aes(x = gene_ratio, y = GO_term, , color = p.value), 
             size = 2, 
             shape = "square")

Now we can start updating the plot to suit our preferences for how we want the data displayed. The labels on the x- and y-axis are also quite small and not very descriptive. To change their size and labeling, we need to add additional theme layers. The ggplot2 theme() system handles modification of non-data plot elements such as:

Axis label aesthetics
Plot background
Facet label backround
Legend appearance

There are built-in themes that we can use (i.e. theme_bw()) that mostly change the background/foreground colours, by adding it as additional layer. Alternatively, we can adjust specific elements of the current default theme by adding a theme() layer and passing in arguments for the things we wish to change. Or we can use both, a built-in theme layer and a custom theme layer!

Let’s add a built-in theme layer theme_bw() first.

ggplot(bp_plot) +
  geom_point(aes(x = gene_ratio, y = GO_term, , color = p.value), 
             size = 2) +
  theme_bw()

Do the axis labels or the tick labels get any larger by changing themes?

Not in this case. But we can add arguments using theme() to change it ourselves. Since we are adding this layer on top (i.e later in sequence), any features we change will override what is set in the theme_bw(). Here we’ll increase the size of the axes labels to be 1.15 times the default size and the x-axis tick labels to be 1.15 times the default.

ggplot(bp_plot) +
  geom_point(aes(x = gene_ratio, y = GO_term, color = p.value), 
             size = 2) +
  theme_bw() +
  theme(axis.text.x = element_text(size=rel(1.15)),
        axis.title = element_text(size=rel(1.15)))

Note #1: When modifying the size of text we often use the rel() function to specify the size we want relative to the default. We can also provide a numeric value as we did with the data point size, but it can be cumbersome if you don’t know what the default font size is to begin with.

Note #2: You can use the example("geom_point") function here to explore a multitude of different aesthetics and layers that can be added to your plot. As you scroll through the different plots, take note of how the code is modified. You can use this with any of the different geom layers available in ggplot2 to learn how you can easily modify your plots!

Note #3: RStudio provides this very useful cheatsheet for plotting using ggplot2. Different example plots are provided and the associated code (i.e which geom or theme to use in the appropriate situation.)

Exercises

The current axis label text defaults to what we gave as input to geom_point (i.e the column headers). We can change this by adding additional layers called xlab() and ylab() for the x- and y-axis, respectively. Add these layers to the current plot such that:
- x-axis label: “Gene ratios”
- y-axis label: “Top 30 significant GO terms”
Add a ggtitle() layer to add a title to your plot.

NOTE: Useful code to center your title over your plot can be done using theme(plot.title=element_text(hjust=0.5, face = "bold")).

Customizing data point colors

The plot is looking better, but it is hard to distinguish differences in significance based on the colors used. There are cheatsheets available for specifying the base R colors by name or hexadecimal code. We could specify other colors available or use pre-created color palettes from an external R package.

To make additional color palettes available for plotting, we can load the RColorBrewer library, which contains color palettes designed specifically for the different types of data being compared.

# Install the RColorBrewer library (if you don't have it installed already)
install.packages("RColorBrewer")

# Load the RColorBrewer library
library(RColorBrewer)

# Check the available color palettes
display.brewer.all()

The output is separated into three sections based on the suggested palettes for sequential, qualitative, and diverging data.

Sequential palettes (top): For sequential data, with lighter colors for low values and darker colors for high values.
Qualitative palettes (middle): For categorical data, where the color does not denote differences in magnitude or value.
Diverging palettes (bottom): For data with emphasis on mid-range values and extremes.

Since our adjusted p-values are sequential, we will choose from these palettes. Let’s go with the “Yellow, orange, red” palette. We can choose how many colors from the palette to include, which may take some trial and error. We can test the colors included in a palette by using the display.brewer.pal() function, and changing if desired:

# Testing the palette with six colors
display.brewer.pal(6, "YlOrRd")

The yellow might be a bit too light, and we might not need so many different colors. Let’s test with three different colors:

# Testing the palette with three colors
display.brewer.pal(3, "YlOrRd")

# Define a palette
mypalette <- brewer.pal(3, "YlOrRd")

# how are the colors represented in the mypalette vector?
mypalette

Those colors look okay, so let’s test them in our plot. We can add a color scale layer, and most often one of the following two scales will work:

scale_color_manual(): for categorical data or quantiles
scale_color_gradient() family: for continuous data.

By default, scale_color_gradient() creates a two color gradient from low to high. Since we plan to use more colors, we will use the more flexible scale_color_gradientn() function. To make the legend a bit cleaner, we will also perform a -log10 transform on the p-values (higher values means more significant).

ggplot(bp_plot) +
  geom_point(aes(x = gene_ratio, y = GO_term, color = -log10(p.value)), 
             size = 2) +
  theme_bw() +
  theme(axis.text.x = element_text(size=rel(1.15)),
        axis.title = element_text(size=rel(1.15))) +
  xlab("Gene ratios") +
  ylab("Top 30 significant GO terms") +
  ggtitle("Dotplot of top 30 significant GO terms") +
  theme(plot.title = element_text(hjust=0.5, 
  	face = "bold")) +
  scale_color_gradientn(colors = mypalette)
			 

This looks good, but we want to add better name for the legend and we want to make sure the legend title is centered and bold. To do this, we can add a name argument to scale_color_gradientn() and a new theme layer for the legend title.

ggplot(bp_plot) +
  geom_point(aes(x = gene_ratio, y = GO_term, color = -log10(p.value)), 
             size = 2) +
  theme_bw() +
  theme(axis.text.x = element_text(size=rel(1.15)),
        axis.title = element_text(size=rel(1.15))) +
  xlab("Gene ratios") +
  ylab("Top 30 significant GO terms") +
  ggtitle("Dotplot of top 30 significant GO terms") +
  theme(plot.title = element_text(hjust=0.5, 
  	face = "bold")) +
  scale_color_gradientn(name = "Significance \n (-log10(padj))", colors = mypalette) +
  theme(legend.title = element_text(size=rel(1.15),
	hjust=0.5, 
	face="bold"))
			 

Exercises

Arrange bp_oe by term_percent in descending order.
Create a dotplot with the top 30 GO terms with highest term_percent, with term_percent as x-axis and GO_term as the y-axis.
[Optional] Color the plot using the palette of your choice.

So far we have explored many layers that can be added to any plot with the ggplot2 package. However, we haven’t explored the different geoms available. The type of data you are plotting will determine the type of geom needed, but a nice summary of the main geoms is available on the RStudio ggplot2 cheatsheet.

Let’s explore different geoms by creating a couple of different plots. We’ll start with a bar plot of the number of genes per category. We can start with the most basic plot by specifying the dataframe, geom, and aesthetics.

ggplot(bp_plot) +
  geom_col(aes(x = GO_term, y = overlap.size))

This is a good base to start from, now let’s start to customize. To add color to the bars, we can use the fill argument, and if we would like to add an outline color to the bars, we can use the color argument.

ggplot(bp_plot) +
  geom_col(aes(x = GO_term, y = overlap.size),
           fill = "royalblue",
           color = "black")

Then we can provide our theme preferences, give the plot a title, and label our axes:

ggplot(bp_plot) +
  geom_col(aes(x = GO_term, y = overlap.size),
           fill = "royalblue",
           color = "black") +
  theme(axis.text.x = element_text(size=rel(1.15)),
        axis.title = element_text(size=rel(1.15))) +
  theme(plot.title = element_text(hjust=0.5, 
                                  face = "bold")) +
  labs(title = "DE genes per GO process", x = NULL, y =  "# DE genes")

Note that instead of using the functions xlab(), ylab(), and ggtitle(), we can provide all as arguments to the labs() function.

Now we might be fairly happy with our plot, but the x-axis labelling needs some help. Within the theme() layer, we can change the orientiation of the x-axis labels with the angle argument and align the labels to the x-axis with the hjust argument.

ggplot(bp_plot) +
  geom_col(aes(x = GO_term, y = overlap.size),
           fill = "royalblue",
           color = "black") +
  theme(axis.text.x = element_text(size=rel(1.15)),
        axis.title = element_text(size=rel(1.15))) +
  theme(plot.title = element_text(hjust=0.5, 
                                  face = "bold")) +
  labs(title = "DE genes per GO process", x = NULL, y =  "# DE genes") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This is almost what we were looking for, but the labels are getting cut-off because the plotting area is too small. The plot.margin argument of the theme’s element_text() function can be used to alter the plotting dimensions to make room for our labels.

ggplot(bp_plot) +
  geom_col(aes(x = GO_term, y = overlap.size),
           fill = "royalblue",
           color = "black") +
  theme(axis.text.x = element_text(size=rel(1.15)),
        axis.title = element_text(size=rel(1.15))) +
  theme(plot.title = element_text(hjust=0.5, 
                                  face = "bold")) +
  labs(title = "DE genes per GO process", x = NULL, y =  "# DE genes") +
  theme(axis.text.x = element_text(angle = 45, 
                                   hjust = 1)) + 
  theme(plot.margin = unit(c(1,1,1,3), "cm"))

NOTE: If we wanted to remove the space between the x-axis and the labels, we could add an additional layer for scale_y_continuous(expand = c(0, 0)), which would not expand the y-axis past the plotting limits.

Exporting figures to file

There are two ways in which figures and plots can be output to a file (rather than simply displaying on screen). The first (and easiest) is to export directly from the RStudio ‘Plots’ panel, by clicking on Export when the image is plotted. This will give you the option of png or pdf and selecting the directory to which you wish to save it to. It will also give you options to dictate the size and resolution of the output image.

The second option is to use R functions and have the write to file hard-coded in to your script. This would allow you to run the script from start to finish and automate the process (not requiring human point-and-click actions to save).

Consistent formatting using custom functions

When publishing, it is helpful to ensure all plots have similar formatting. To do this we can create a custom function with our preferences for the theme.

## DO NOT RUN ##
name_of_function <- function(arguments) {
    statements or code that does something
}

Now, let’s suppose we always wanted our theme to include the following:

## DO NOT RUN ##
theme_bw() +
    theme(axis.text.x = element_text(size=rel(1.15)),
        axis.title = element_text(size=rel(1.15)),
        legend.title = element_text(size=10, 
                                    face="bold"),
          plot.title=element_text(hjust=0.5, 
                                face = "bold"))

If there is nothing that we want to change when we run this, then we do not need to specify any arguments. Creating the function is simple; we can just put the code inside the {}:

personal_theme <- function(){
  theme_bw() +
    theme(axis.text.x = element_text(size=rel(1.15)),
        axis.title = element_text(size=rel(1.15)),
        legend.title = element_text(size=10, 
                                    face="bold"),
          plot.title=element_text(hjust=0.5, 
                                face = "bold"))
}

Now to run our personal theme with any plot, we can use this function in place of all of the theme() layers:

ggplot(bp_plot) +
  geom_point(aes(x = gene_ratio, y = GO_term, color = p.value), 
             size = 2) +
  personal_theme() +
  xlab("Gene ratios") +
  ylab("Top 30 significant GO terms") +
  ggtitle("Dotplot of top 30 significant GO terms")

Exercises

Based on the number of genes associated with each GO term (“term.size” column) we can categorize them into “small”, “large” or “medium” categories. Once we have done that, we want to determine what the spread of p-values is for each category; we can do this by drawing a boxplot.

Use the following code to create a new column in bp_oe tibble for the new categories.

 x <- bp_oe$term.size
 sizes <- rep(NA, length(x) )
	
 sizes[which(x > 3000)] <- "large"
 sizes[which(x <= 3000 & x > 500 )] <- "medium"
 sizes[which(x <= 500)] <- "small"
 bp_oe$term_cat <- factor(sizes, levels = c("small","medium","large"))

Create a boxplot with the new column (term.cat) on the x-axis and the -log10 of the p.value on the y-axis.
Fill color into each boxplot based on that new column
Add appropriate labels and theme() layers to your liking.

Resources

Helpful packages to add additional functionality to ggplot2:

This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Learning Objectives

Data Visualization with ggplot2

Customizing data point colors

Exporting figures to file

Consistent formatting using custom functions

Resources

Data Visualization with `ggplot2`