# Read in proj_summary if needed
proj_summary <- read.table(file = "data/project-summary.txt", header = TRUE, row.names = 1)
# Create ctrl_samples dataframe
ctrl_samples <- data.frame(row.names = c("sample3", "sample10", "sample8", "sample4", "sample15"),
date = c("01/13/2018", "03/15/2018", "01/13/2018", "09/20/2018","03/15/2018"))Fun with Data Wrangling
A practical, hands‑on data‑wrangling exercise in which participants work with datasets to read, inspect, subset, reorder and match data in R. Participants practice skills such as importing CSV files, filtering data frames, modifying values, building lists, using the %in% operator and applying functional programming tools like map_lgl() to manage and refine data structures.
R, Subsetting, in operator
Exercises
Reading in and inspecting data
Using the
animals.csv, read the.csvfile into your environment and assign it to a variable calledanimals. Be sure to check that your row names are the different animals.Check to make sure that
animalsis a dataframe.How many rows are in the
animalsdataframe? How many columns?
Data wrangling
Extract the
speedvalue of 40 km/h from theanimalsdataframe.Return the rows with animals that are the
colorTan.Return the rows with animals that have
speedgreater than 50 km/h and output only thecolorcolumn. Keep the output as a data frame.Change the color of “Grey” to “Gray”.
Create a list called
animals_listin which the first element contains the speed column of theanimalsdataframe and the second element contains the color column of theanimalsdataframe.Give each element of your list the appropriate name (i.e speed and color).
The %in% operator, reordering and matching
- In your environment you should have a dataframe called
proj_summarywhich contains quality metric information for an RNA-seq dataset. We have obtained batch information for the control samples in this dataset. Copy and paste the code below to create a dataframe of control samples with the associated batch information:
How many of the
ctrl_samplesare also in theproj_summarydataframe? Use the %in% operator to compare sample names.Keep only the rows in
proj_summarywhich correspond to those inctrl_samples. Do this with the %in% operator. Save it to a variable calledproj_summary_ctrl.We would like to add in the batch information for the samples in
proj_summary_ctrl. Find the rows that match inctrl_samples.Use
cbind()to add a column calledbatchto theproj_summary_ctrldataframe. Assign this new dataframe back toproj_summary_ctrl.
BONUS: Using map_lgl()
Subset
proj_summaryto keep only the “high” and “low” samples based on the treatment column. Save the new dataframe to a variable calledproj_summary_noctl.Further, subset the dataframe to remove the non-numeric columns “Quality_format”, and “treatment”. Try to do this using the
map_lgl()function in addition tois.numeric(). Save the new dataframe back toproj_summary_noctl.