Tidyverse data wrangling Answer Key

Author

Will Gammerdinger

Published

July 1, 2025

Exercise 1

Create a vector of random numbers using the code below:

# Create a vector of random numbers
random_numbers <- c(81, 90, 65, 43, 71, 29)

Use the pipe (%>%) to perform two steps:

Take the mean of random_numbers using the mean() function.

# Return the mean of the random_numbers vector
random_numbers %>% 
  mean()

[1] 63.16667

Round the output to three digits using the round() function.

# Return the mean of the random_numbers vector and round to three digits
random_numbers %>% 
  mean() %>% 
  round(digits = 3)

[1] 63.167

Exercise 2

We would like to perform an additional round of filtering to only keep the most specific GO terms.

For bp_oe, use the filter() function to only keep those rows where the relative.depth is greater than 4.

# Filter bp_oe to keep those rows where the relative.depth is greater than 4
bp_oe %>% 
  filter(relative.depth > 4)

# A tibble: 668 × 14
   query.number significant  p.value term.size query.size overlap.size recall
          <dbl> <lgl>          <dbl>     <dbl>      <dbl>        <dbl>  <dbl>
 1            1 TRUE        2.41e- 2        16       5850           11  0.002
 2            1 TRUE        2.41e- 2        16       5850           11  0.002
 3            1 TRUE        2.41e- 2        16       5850           11  0.002
 4            1 TRUE        7.90e-11      2629       5850          973  0.166
 5            1 TRUE        4.43e- 5       200       5850           93  0.016
 6            1 TRUE        3.67e- 6       166       5850           83  0.014
 7            1 TRUE        3.67e- 6       166       5850           83  0.014
 8            1 TRUE        4.88e- 2        33       5850           18  0.003
 9            1 TRUE        2.48e- 5       137       5850           69  0.012
10            1 TRUE        1.39e- 4      1492       5850          540  0.092
# ℹ 658 more rows
# ℹ 7 more variables: precision <dbl>, term.id <chr>, domain <chr>,
#   subgraph.number <dbl>, term.name <chr>, relative.depth <dbl>,
#   intersection <chr>

Save output to overwrite our bp_oe object

# Filter bp_oe to keep those rows where the relative.depth is greater than 4 and overwrite the bp_oe object
bp_oe <- bp_oe %>% 
  filter(relative.depth > 4)

# Print object after filtering on the relative.depth column 
bp_oe

# A tibble: 668 × 14
   query.number significant  p.value term.size query.size overlap.size recall
          <dbl> <lgl>          <dbl>     <dbl>      <dbl>        <dbl>  <dbl>
 1            1 TRUE        2.41e- 2        16       5850           11  0.002
 2            1 TRUE        2.41e- 2        16       5850           11  0.002
 3            1 TRUE        2.41e- 2        16       5850           11  0.002
 4            1 TRUE        7.90e-11      2629       5850          973  0.166
 5            1 TRUE        4.43e- 5       200       5850           93  0.016
 6            1 TRUE        3.67e- 6       166       5850           83  0.014
 7            1 TRUE        3.67e- 6       166       5850           83  0.014
 8            1 TRUE        4.88e- 2        33       5850           18  0.003
 9            1 TRUE        2.48e- 5       137       5850           69  0.012
10            1 TRUE        1.39e- 4      1492       5850          540  0.092
# ℹ 658 more rows
# ℹ 7 more variables: precision <dbl>, term.id <chr>, domain <chr>,
#   subgraph.number <dbl>, term.name <chr>, relative.depth <dbl>,
#   intersection <chr>

Exercise 3

Rename the intersection column to genes to reflect the fact that these are the DE genes associated with the GO process.

# Rename the interaction column of the bp_oe to be genes
bp_oe <- bp_oe %>% 
  dplyr::rename(genes = intersection)

# Print object after renaming the column
bp_oe

# A tibble: 668 × 7
   GO_id      GO_term            p.value query.size term.size overlap.size genes
   <chr>      <chr>                <dbl>      <dbl>     <dbl>        <dbl> <chr>
 1 GO:0010467 gene expression   6.71e-66       5850      5257         2142 gclc…
 2 GO:0090304 nucleic acid met… 1.18e-61       5850      5103         2073 gclc…
 3 GO:0006139 nucleobase-conta… 2.49e-58       5850      5731         2271 dpm1…
 4 GO:0016070 RNA metabolic pr… 7.28e-57       5850      4597         1881 gclc…
 5 GO:0009059 macromolecule bi… 3.12e-54       5850      5066         2030 dpm1…
 6 GO:0034645 cellular macromo… 5.6 e-54       5850      4907         1975 dpm1…
 7 GO:0044271 cellular nitroge… 2.10e-47       5850      4882         1938 gclc…
 8 GO:0010468 regulation of ge… 4.25e-46       5850      4297         1733 gclc…
 9 GO:2000112 regulation of ce… 1.22e-40       5850      3960         1593 gclc…
10 GO:0010556 regulation of ma… 2.22e-39       5850      4073         1626 gclc…
# ℹ 658 more rows

Exercise 4

Create a column in bp_oe called term_percent to determine the percent of DE genes associated with the GO term relative to the total number of genes associated with the GO term (overlap.size / term.size)

# Create term_percent column based on other columns in dataset
bp_oe <- bp_oe %>% 
  mutate(term_percent = overlap.size / term.size)

# Print object after creating the new column
bp_oe

# A tibble: 668 × 9
   GO_id     GO_term  p.value query.size term.size overlap.size genes gene_ratio
   <chr>     <chr>      <dbl>      <dbl>     <dbl>        <dbl> <chr>      <dbl>
 1 GO:00104… gene e… 6.71e-66       5850      5257         2142 gclc…      0.366
 2 GO:00903… nuclei… 1.18e-61       5850      5103         2073 gclc…      0.354
 3 GO:00061… nucleo… 2.49e-58       5850      5731         2271 dpm1…      0.388
 4 GO:00160… RNA me… 7.28e-57       5850      4597         1881 gclc…      0.322
 5 GO:00090… macrom… 3.12e-54       5850      5066         2030 dpm1…      0.347
 6 GO:00346… cellul… 5.6 e-54       5850      4907         1975 dpm1…      0.338
 7 GO:00442… cellul… 2.10e-47       5850      4882         1938 gclc…      0.331
 8 GO:00104… regula… 4.25e-46       5850      4297         1733 gclc…      0.296
 9 GO:20001… regula… 1.22e-40       5850      3960         1593 gclc…      0.272
10 GO:00105… regula… 2.22e-39       5850      4073         1626 gclc…      0.278
# ℹ 658 more rows
# ℹ 1 more variable: term_percent <dbl>