Day 3 Activities Answer Key

Author

Mary Piper, Meeta Mistry, Radhika Khetani

Published

December 4, 2019

Exercises

Reading in and inspecting data

  1. Download the animals.csv, by right-clicking on the link and “Save Link As…” to place the file into the data directory.
  2. Read the .csv file into your environment and assign it to a variable called animals. Be sure to check that your row names are the different animals.
animals <- read.csv("../../data/animals.csv")
animals
          speed color
Elephant   40.0  Gray
Cheetah   120.0   Tan
Tortoise    0.1 Green
Hare       48.0  Grey
Lion       80.0   Tan
PolarBear  30.0 White
  1. Check to make sure that animals is a dataframe.
class(animals)
[1] "data.frame"
  1. How many rows are in the animals dataframe? How many columns?
nrow(animals)
[1] 6
ncol(animals)
[1] 2

Data wrangling

  1. Extract the speed value of 40 km/h from the animals dataframe.
animals[1,1]
[1] 40
animals[which(animals$speed == 40), 1]
[1] 40
animals[which(animals$speed == 40), "speed"]
[1] 40
animals$speed[which(animals$speed == 40)]
[1] 40
  1. Return the rows with animals that are the color Tan.
animals[c(2,5),]
        speed color
Cheetah   120   Tan
Lion       80   Tan
animals[which(animals$color == "Tan"),]
        speed color
Cheetah   120   Tan
Lion       80   Tan
  1. Return the rows with animals that have speed greater than 50 km/h and output only the color column. Keep the output as a data frame.
animals[which(animals$speed > 50), "color", drop =F]
        color
Cheetah   Tan
Lion      Tan
  1. Change the color of “Grey” to “Gray”.
animals$color[which(animals$color == "Grey")] <- "Gray"
animals[which(animals$color == "Grey"), "color"] <- "Gray"
  1. Create a list called animals_list in which the first element contains the speed column of the animals dataframe and the second element contains the color column of the animals dataframe.
animals_list <- list(animals$speed, animals$color)
  1. Give each element of your list the appropriate name (i.e speed and color).
names(animals_list) <- colnames(animals)

The %in% operator, reordering and matching

  1. In your environment you should have a dataframe called proj_summary which contains quality metric information for an RNA-seq dataset. We have obtained batch information for the control samples in this dataset. Copy and paste the code below to create a dataframe of control samples with the associated batch information:
proj_summary <- read.table(file = "../../data/project-summary.txt", header = TRUE, row.names = 1)

ctrl_samples <- data.frame(row.names = c("sample3", "sample10", "sample8", "sample4", "sample15"), 
                            date = c("01/13/2018", "03/15/2018", "01/13/2018", "09/20/2018","03/15/2018"))
  1. How many of the ctrl_samples are also in the proj_summary dataframe? Use the %in% operator to compare sample names.
length(which(rownames(ctrl_samples) %in% rownames(proj_summary)))
[1] 3
  1. Keep only the rows in proj_summary which correspond to those in ctrl_samples. Do this with the %in% operator. Save it to a variable called proj_summary_ctrl.
proj_summary_ctrl <- proj_summary[which(rownames(proj_summary) %in% rownames(ctrl_samples)),]
proj_summary_ctrl
        percent_GC Exonic_Rate Intronic_Rate Intergenic_Rate Mapping_Rate
sample3         50      0.8834        0.0663          0.0503    0.9877286
sample4         50      0.9027        0.0649          0.0325    0.9870764
sample8         49      0.9022        0.0656          0.0322    0.9877458
        Quality_format   rRNA_rate treatment
sample3       standard 0.026944958   control
sample4       standard 0.005081974   control
sample8       standard 0.004549047   control
  1. We would like to add in the batch information for the samples in proj_summary_ctrl. Find the rows that match in ctrl_samples.
m <- match(rownames(proj_summary_ctrl), rownames(ctrl_samples))
m
[1] 1 4 3
  1. Use cbind() to add a column called batch to the proj_summary_ctrl dataframe. Assign this new dataframe back to proj_summary_ctrl.
proj_summary_ctrl <- cbind(proj_summary_ctrl, batch=ctrl_samples[m,])
proj_summary_ctrl
        percent_GC Exonic_Rate Intronic_Rate Intergenic_Rate Mapping_Rate
sample3         50      0.8834        0.0663          0.0503    0.9877286
sample4         50      0.9027        0.0649          0.0325    0.9870764
sample8         49      0.9022        0.0656          0.0322    0.9877458
        Quality_format   rRNA_rate treatment      batch
sample3       standard 0.026944958   control 01/13/2018
sample4       standard 0.005081974   control 09/20/2018
sample8       standard 0.004549047   control 01/13/2018

BONUS: Using map_lgl()

  1. Subset proj_summary to keep only the “high” and “low” samples based on the treament column. Save the new dataframe to a variable called proj_summary_noctl.
library(purrr)

proj_summary_noctl <- proj_summary[which(proj_summary$treatment != "control"),]
proj_summary_noctl
        percent_GC Exonic_Rate Intronic_Rate Intergenic_Rate Mapping_Rate
sample1         49      0.8913        0.0709          0.0378    0.9787998
sample2         49      0.9055        0.0625          0.0321    0.9825069
sample5         49      0.8923        0.0714          0.0362    0.9781835
sample6         49      0.8999        0.0667          0.0334    0.9772096
sample7         49      0.8983        0.0665          0.0352    0.9757997
sample9         49      0.9111        0.0566          0.0323    0.9814494
        Quality_format   rRNA_rate treatment
sample1       standard 0.007264734      high
sample2       standard 0.005518317       low
sample5       standard 0.005023175      high
sample6       standard 0.005345113       low
sample7       standard 0.005240401      high
sample9       standard 0.005817519       low
  1. Further, subset the dataframe to remove the non-numeric columns “Quality_format”, and “treatment”. Try to do this using the map_lgl() function in addition to is.numeric(). Save the new dataframe back to proj_summary_noctl.
keep <- map_lgl(proj_summary_noctl, is.numeric)
proj_summary_noctl <- proj_summary_noctl[,keep]
proj_summary_noctl
        percent_GC Exonic_Rate Intronic_Rate Intergenic_Rate Mapping_Rate
sample1         49      0.8913        0.0709          0.0378    0.9787998
sample2         49      0.9055        0.0625          0.0321    0.9825069
sample5         49      0.8923        0.0714          0.0362    0.9781835
sample6         49      0.8999        0.0667          0.0334    0.9772096
sample7         49      0.8983        0.0665          0.0352    0.9757997
sample9         49      0.9111        0.0566          0.0323    0.9814494
          rRNA_rate
sample1 0.007264734
sample2 0.005518317
sample5 0.005023175
sample6 0.005345113
sample7 0.005240401
sample9 0.005817519