Matching

Exercise 1 Solution

  1. Using the A and B vectors created above, evaluate each element in B to see if there is a match in A
B %in% A
[1] FALSE FALSE FALSE FALSE  TRUE  TRUE
  1. Subset the B vector to only return those values that are also in A.
intersectionBA <- B %in% A
B[intersectionBA]
[1] 1 5
# or you could do the two above steps in a single line of code:
B[B %in% A]
[1] 1 5

Exercise 2 Solution

We have a list of 6 marker genes of that we are very interested in. Our goal is to extract count data for these genes, without having to scroll through the data frame of count data, using the %in% operator.

First, lets create a vector called important_genes with the Ensembl IDs of the 6 genes we are interested in:

important_genes <- c("ENSMUSG00000083700", "ENSMUSG00000080990", "ENSMUSG00000065619", "ENSMUSG00000047945", "ENSMUSG00000081010", "ENSMUSG00000030970")
  1. Use the %in% operator to determine if all of these genes are in the row names of the rpkm_data data frame.
important_genes %in% rownames(rpkm_data)
[1] TRUE TRUE TRUE TRUE TRUE TRUE
  1. Extract the rows from rpkm_data that correspond to these 6 genes using [] and the %in% operator, again. Double check the row names to ensure that you are extracting the correct rows.
intersection <- rownames(rpkm_data) %in% important_genes

rpkm_data[intersection, ]
                    sample2  sample5  sample7  sample8  sample9   sample4
ENSMUSG00000030970 2.221180 0.537852 2.243810 2.599400 3.593970 0.1753800
ENSMUSG00000047945 4.745070 0.323620 1.297810 3.896810 3.285470 0.2213430
ENSMUSG00000065619 0.000000 0.000000 0.000000 0.000000 0.000000 0.0000000
ENSMUSG00000080990 0.000000 0.000000 0.000000 0.000000 0.000000 0.0000000
ENSMUSG00000081010 0.222275 0.349415 0.190397 0.167166 0.221353 0.4196660
ENSMUSG00000083700 0.425214 0.337651 0.145973 0.142010 0.508757 0.0660419
                    sample6 sample12  sample3 sample11 sample10  sample1
ENSMUSG00000030970 0.435484 0.964169 2.151490 0.963523 1.014520 2.971420
ENSMUSG00000047945 0.478836 3.581640 4.501390 1.442970 0.982691 5.199470
ENSMUSG00000065619 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
ENSMUSG00000080990 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
ENSMUSG00000081010 0.248244 0.594672 0.214347 0.415823 0.452537 0.235848
ENSMUSG00000083700 0.308669 0.488064 0.136998 0.222865 0.205934 0.124225
  1. Bonus question: Extract the rows from rpkm_data that correspond to these 6 genes using [], but without using the %in% operator.
rpkm_data[important_genes, ]
                    sample2  sample5  sample7  sample8  sample9   sample4
ENSMUSG00000083700 0.425214 0.337651 0.145973 0.142010 0.508757 0.0660419
ENSMUSG00000080990 0.000000 0.000000 0.000000 0.000000 0.000000 0.0000000
ENSMUSG00000065619 0.000000 0.000000 0.000000 0.000000 0.000000 0.0000000
ENSMUSG00000047945 4.745070 0.323620 1.297810 3.896810 3.285470 0.2213430
ENSMUSG00000081010 0.222275 0.349415 0.190397 0.167166 0.221353 0.4196660
ENSMUSG00000030970 2.221180 0.537852 2.243810 2.599400 3.593970 0.1753800
                    sample6 sample12  sample3 sample11 sample10  sample1
ENSMUSG00000083700 0.308669 0.488064 0.136998 0.222865 0.205934 0.124225
ENSMUSG00000080990 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
ENSMUSG00000065619 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
ENSMUSG00000047945 0.478836 3.581640 4.501390 1.442970 0.982691 5.199470
ENSMUSG00000081010 0.248244 0.594672 0.214347 0.415823 0.452537 0.235848
ENSMUSG00000030970 0.435484 0.964169 2.151490 0.963523 1.014520 2.971420

Exercise 3 Solution

For a research project, we asked healthy volunteers and cancer patients questions about their diet and exercise. We also collected blood work for each individual, and each person was given a unique ID. Create the following dataframes, behavior and blood by copy/pasting the code below:

# Creating behavior dataframe
ID <- c(546, 983, 042, 952, 853, 061)
diet <- c("veg", "pes", "omni", "omni", "omni", "omni")
exercise <- c("high", "low", "low", "low", "med", "high")
behavior <- data.frame(ID, diet, exercise)

# Creating blood dataframe

ID <- c(983, 952, 704, 555, 853, 061, 042, 237, 145, 581, 249, 467, 841, 546)
blood_levels <- c(43543, 465, 4634, 94568, 134, 347, 2345, 5439, 850, 6840, 5483, 66452, 54371, 1347)
blood <- data.frame(ID, blood_levels)
  1. We would like to see if we have diet and exercise information for all of our blood samples. Using the ID information, determine whether all individuals with blood samples have associated behavioral information. Which individuals do not have behavioral information?
# One way to subset the FALSE values
blood$ID[!(blood$ID %in% behavior$ID)]
[1] 704 555 237 145 581 249 467 841
# Equally good way to subset FALSE values
blood$ID[which(blood$ID %in% behavior$ID == F)]
[1] 704 555 237 145 581 249 467 841
  1. The samples lacking behavioral information correspond to individuals who opted out of the study after having their blood drawn. Subset the blood data to keep only samples that have behavioral information and save the dataframe back to the blood variable.
blood <- blood[blood$ID %in% behavior$ID, ]
all(blood$ID %in% behavior$ID)
[1] TRUE
  1. We would like to combine the blood and behavior dataframes together, but first we need to make sure the data is in the same order.

a. Take a look at each of the dataframes and manually identify the correct order for the blood dataframe such that it matches the order of IDs in the behavior dataframe.

behavior
   ID diet exercise
1 546  veg     high
2 983  pes      low
3  42 omni      low
4 952 omni      low
5 853 omni      med
6  61 omni     high
blood
    ID blood_levels
1  983        43543
2  952          465
5  853          134
6   61          347
7   42         2345
14 546         1347
# order of `blood` IDs to match `behavior` IDs is: 6, 1, 5, 2, 3, 4

b. Reorder the blood data to match the order of the IDs in the behavior dataframe and save the reordered blood dataframe as blood_reordered. Hint: you will need to have a vector of index values from a. to reorder. Once you have created blood_reordered you can use the all() function as a sanity check to make sure it was done correctly.

blood_reordered <- blood[c(6, 1, 5, 2, 3, 4), ]
all(blood_reordered$ID == behavior$ID)
[1] TRUE

c. Combine the dataframes blood_reordered and behavior using the data.frame() function and save this to a new dataframe called blood_behavior. Note: you will find that there are now two “ID” columns, this will help verify that you have reordered correctly.

blood_behavior <- data.frame(blood_reordered, behavior)
blood_behavior
    ID blood_levels ID.1 diet exercise
14 546         1347  546  veg     high
1  983        43543  983  pes      low
7   42         2345   42 omni      low
2  952          465  952 omni      low
5  853          134  853 omni      med
6   61          347   61 omni     high

Exercise 4 Solution

Similar to the previous exercise, perform the reordering of the blood data to match the order of the IDs in the behavior dataframe, but this time use the match() function. Save the reordered blood dataframe as blood_reordered_match.

blood_reordered_idx <- match(behavior$ID, blood$ID)
blood_reordered_idx
[1] 6 1 5 2 3 4
blood_reordered_match <- blood[blood_reordered_idx, ]