The %in% operator

R Programming

data wrangling

This lesson introduces the %in% operator in R for identifying matching elements between vectors and demonstrates how to use any() and all() to efficiently evaluate logical conditions. Participants will apply these tools to check dataset consistency among sample identifiers.

Authors

Mary Piper

Meeta Mistry

Radhika Khetani

Jihe Liu

Will Gammerdinger

Noor Sohail

Published

May 30, 2025

Keywords

R, in operator, vector matching, any, all

Approximate time: 45 min

Learning Objectives

Describe the use of %in% operator.
Explain the user case for any and all functions.

Logical operators for identifying matching elements

Oftentimes, we encounter different analysis tools that require multiple input datasets. It is not uncommon for these inputs to need to have the same row names, column names, or unique identifiers in the same order to perform the analysis. Therefore, knowing how to reorder datasets and determine whether the data matches is an important skill.

In our use case, we will be working with genomic data. We have gene expression data generated by RNA-seq, which we had downloaded previously; in addition, we have a metadata file corresponding to the RNA-seq samples. The metadata contains information about the samples present in the gene expression file, such as which sample group each sample belongs to and any batch or experimental variables present in the data.

Let’s read in our gene expression data (RPKM matrix) that we downloaded previously:

# Read in the expression data
rpkm_data <- read.csv("data/counts.rpkm.csv")

Take a look at the first few lines of the data matrix to see what’s in there.

# View the first six lines of rpkm_data
head(rpkm_data)

                     sample2    sample5  sample7   sample8   sample9   sample4
ENSMUSG00000000001 19.265000 23.7222000 2.611610 5.8495400 6.5126300 24.076700
ENSMUSG00000000003  0.000000  0.0000000 0.000000 0.0000000 0.0000000  0.000000
ENSMUSG00000000028  1.032290  0.8269540 1.134410 0.6987540 0.9251170  0.827891
ENSMUSG00000000031  0.000000  0.0000000 0.000000 0.0298449 0.0597726  0.000000
ENSMUSG00000000037  0.056033  0.0473238 0.000000 0.0685938 0.0494147  0.180883
ENSMUSG00000000049  0.258134  1.0730200 0.252342 0.2970320 0.2082800  2.191720
                      sample6   sample12   sample3   sample11  sample10
ENSMUSG00000000001 20.8198000 26.9158000 20.889500 24.0465000 24.198100
ENSMUSG00000000003  0.0000000  0.0000000  0.000000  0.0000000  0.000000
ENSMUSG00000000028  1.1686300  0.6735630  0.892183  0.9753270  1.045920
ENSMUSG00000000031  0.0511932  0.0204382  0.000000  0.0000000  0.000000
ENSMUSG00000000037  0.1438840  0.0662324  0.146196  0.0206405  0.017004
ENSMUSG00000000049  1.6853800  0.1161970  0.421286  0.0634322  0.369550
                      sample1
ENSMUSG00000000001 19.7848000
ENSMUSG00000000003  0.0000000
ENSMUSG00000000028  0.9377920
ENSMUSG00000000031  0.0359631
ENSMUSG00000000037  0.1514170
ENSMUSG00000000049  0.2567330

It looks as if the sample names (header) in our data matrix are similar to the row names of our metadata file, but it’s hard to tell since they are not in the same order. We can do a quick check of the number of columns in the count data and the rows in the metadata and at least see if the numbers match up.

# Return the number of columns in rpkm_data 
ncol(rpkm_data)

[1] 12

# Return the number of rows in rpkm_data 
nrow(metadata)

[1] 12

What we want to know is, do we have data for every sample that we have metadata?

The `%in%` operator

Although lacking in documentation, this operator is well-used and convenient once you get the hang of it. The operator is used with the following syntax:

# DO NOT RUN
# Example syntax
vector1 %in% vector2

It will take each element from vector1 as input, one at a time, and evaluate if the element is present in vector2. The two vectors do not have to be the same size.

This operation will return a vector containing logical values to indicate whether or not there is a match. The new vector will be of the same length as vector1. Take a look at the example below:

# Vector with odd numbers
A <- c(1,3,5,7,9,11)
# Vector with even numbers
B <- c(2,4,6,8,10,12)

# Test to see if each of the elements of A is in B  
A %in% B

[1] FALSE FALSE FALSE FALSE FALSE FALSE

Since vector A contains only odd numbers and vector B contains only even numbers, the operation returns a logical vector containing six FALSE, suggesting that no element in vector A is present in vector B. Let’s change a couple of numbers inside vector B to match vector A:

# Vector with odd numbers
A <- c(1,3,5,7,9,11)
# Vector with odd and even numbers
B <- c(2,4,6,8,1,5)

# Test to see if each of the elements of A is in B
A %in% B

[1]  TRUE FALSE  TRUE FALSE FALSE FALSE

The returned logical vector denotes which elements in A are also in B - the first and third elements, which are 1 and 5.

We saw previously that we could use the output from a logical expression to subset data by returning only the values corresponding to TRUE. Therefore, we can use the output logical vector to subset our data, and return only those elements in A, which are also in B by returning only the TRUE values:

# Test to see if each of the elements of A is in B and assign the logical vector output to intersection
intersection <- A %in% B

# Show the contents of intersection
intersection

[1]  TRUE FALSE  TRUE FALSE FALSE FALSE

# Subset the A vector by the values returning TRUE for being in both A and B
A[intersection]

[1] 1 5

In these previous examples, the vectors were so small that it’s easy to check every logical value by eye; but this is not practical when we work with large datasets (e.g. a vector with 1000 logical values). Instead, we can use any function. Given a logical vector, this function will tell you whether at least one value is TRUE. It provides us a quick way to assess if any of the values contained in vector A are also in vector B:

# Test to see if any values of A are in B
any(A %in% B)

[1] TRUE

The all function is also useful. Given a logical vector, it will tell you whether all values are TRUE. If there is at least one FALSE value, the all function will return a FALSE. We can use this function to assess whether all elements from vector A are contained in vector B.

# Test to see if all values of A are in B
all(A %in% B)

[1] FALSE

Exercise 1

Using the A and B vectors created above, evaluate each element in B to see if there is a match in A
Subset the B vector to only return those values that are also in A.

Suppose we had two vectors containing same values. How can we check if those values are in the same order in each vector? In this case, we can use == operator to compare each element of the same position from two vectors.

The operator returns a logical vector indicating TRUE/FALSE at each position. Then we can use all() function to check if all values in the returned vector are TRUE. If all values are TRUE, we know that these two vectors are the same. Unlike %in% operator, == operator requires that two vectors are of equal length.

# Create a vector of numbers
A <- c(10,20,30,40,50)
# Create another vector of the same numbers but backwards
B <- c(50,40,30,20,10)

# Test to see if each element of A is in B
A %in% B

[1] TRUE TRUE TRUE TRUE TRUE

# Test to see if each element of A is in the same position in B
A == B

[1] FALSE FALSE  TRUE FALSE FALSE

# Test if the vectors A and B are a perfect match
all(A == B)

[1] FALSE

Let’s try this on our genomic data, and see whether we have metadata information for all samples in our expression data. We’ll start by creating two vectors: one is the rownames of the metadata, and one is the colnames of the RPKM data. These are base functions in R which allow you to extract the row and column names as a vector:

# Assign the rownames of the metadata data frame to x 
x <- rownames(metadata)
# Assign the column names of the rpkm_data data frame to y
y <- colnames(rpkm_data)

Now check to see that all of x are in y:

# Test if all of the rownames in the metadata data frame are also in the column names in the rpkm_data data frame
all(x %in% y)

[1] TRUE

Note

Note that we can use nested functions in place of x and y and still get the same result.

# Test if all of the rownames in the metadata data frame are also in the column names in the rpkm_data data frame
all(rownames(metadata) %in% colnames(rpkm_data))

We know that all samples are present, but are they in the same order?

# Test if all of the rownames in the metadata data frame are also in the column names in the rpkm_data data frame and in the same order
all(x == y)

[1] FALSE

No, it looks like they need to be reordered. This can be accomplished in a few ways. We could use the match() function or we could provide a vector for how we would like our data to be reordered. We will do the latter:

# Reorder rpkm_data
rpkm_ordered  <- rpkm_data[, rownames(metadata)]

We can check to make sure that our reordering worked:

# Check that the reorder worked
all(rownames(metadata) == colnames(rpkm_ordered))

[1] TRUE

Exercise 2

We have a list of 6 marker genes that we are very interested in. Our goal is to extract count data for these genes using the %in% operator from the rpkm_data data frame, instead of scrolling through rpkm_data and finding them manually.

First, let’s create a vector called important_genes with the Ensembl IDs of the 6 genes we are interested in:

# Create important genes vector
important_genes <- c("ENSMUSG00000083700", "ENSMUSG00000080990", "ENSMUSG00000065619", "ENSMUSG00000047945", "ENSMUSG00000081010", "ENSMUSG00000030970")

Use the %in% operator to determine if all of these genes are present in the row names of the rpkm_data data frame.
Extract the rows from rpkm_data that correspond to these 6 genes using [] and the %in% operator. Double check the row names to ensure that you are extracting the correct rows.
Bonus question: Extract the rows from rpkm_data that correspond to these 6 genes using [], but without using the %in% operator.

Reuse

CC-BY-4.0

--- title: "The %in% operator" description: | This lesson introduces the `%in%` operator in R for identifying matching elements between vectors and demonstrates how to use `any()` and `all()` to efficiently evaluate logical conditions. Participants will apply these tools to check dataset consistency among sample identifiers. author: - Mary Piper - Meeta Mistry - Radhika Khetani - Jihe Liu - Will Gammerdinger - Noor Sohail date: "2025-05-30" categories: - R Programming - data wrangling keywords: - R - in operator - vector matching - any - all license: "CC-BY-4.0" editor_options: markdown: wrap: 72 --- Approximate time: 45 min ```{r} #| label: load_data #| echo: false #| eval: true # Read in metadata metadata <- read.csv(file="data/mouse_exp_design.csv") ``` ## Learning Objectives - Describe the use of `%in%` operator. - Explain the user case for `any` and `all` functions. ## Logical operators for identifying matching elements Oftentimes, we encounter different analysis tools that require multiple input datasets. It is not uncommon for these inputs to need to have the same row names, column names, or unique identifiers in the same order to perform the analysis. Therefore, knowing how to reorder datasets and determine whether the data matches is an important skill. In our use case, we will be working with genomic data. We have gene expression data generated by RNA-seq, which we had downloaded previously; in addition, we have a metadata file corresponding to the RNA-seq samples. The metadata contains information about the samples present in the gene expression file, such as which sample group each sample belongs to and any batch or experimental variables present in the data. Let's read in our gene expression data (RPKM matrix) that we downloaded previously: ```{r} #| label: read_in_rpkm_data # Read in the expression data rpkm_data <- read.csv("data/counts.rpkm.csv") ``` Take a look at the first few lines of the data matrix to see what's in there. ```{r} #| label: head_rpkm_data # View the first six lines of rpkm_data head(rpkm_data) ``` It looks as if the sample names (header) in our data matrix are similar to the row names of our metadata file, but it's hard to tell since they are not in the same order. We can do a quick check of the number of columns in the count data and the rows in the metadata and at least see if the numbers match up. ```{r} #| label: sample_count # Return the number of columns in rpkm_data ncol(rpkm_data) # Return the number of rows in rpkm_data nrow(metadata) ``` What we want to know is, **do we have data for every sample that we have metadata?** ## The `%in%` operator Although lacking in [documentation](http://dr-k-lo.blogspot.com/2013/11/), this operator is well-used and convenient once you get the hang of it. The operator is used with the following syntax: ```{r} #| label: in_operator_example #| eval: false # DO NOT RUN # Example syntax vector1 %in% vector2 ``` It will take each element from vector1 as input, one at a time, and **evaluate if the element is present in vector2.** *The two vectors do not have to be the same size.* This operation will **return a vector containing logical values** to indicate whether or not there is a match. The new vector will be of the **same length as vector1**. Take a look at the example below: ```{r} #| label: in_operator_no_overlap # Vector with odd numbers A <- c(1,3,5,7,9,11) # Vector with even numbers B <- c(2,4,6,8,10,12) # Test to see if each of the elements of A is in B A %in% B ``` Since vector A contains only odd numbers and vector B contains only even numbers, the operation returns a logical vector containing six `FALSE`, suggesting that no element in vector A is present in vector B. Let's change a couple of numbers inside vector B to match vector A: ```{r} #| label: in_operator_some_overlap # Vector with odd numbers A <- c(1,3,5,7,9,11) # Vector with odd and even numbers B <- c(2,4,6,8,1,5) # Test to see if each of the elements of A is in B A %in% B ``` The returned logical vector denotes which elements in `A` are also in `B` - the first and third elements, which are 1 and 5. We saw previously that we could use the output from a logical expression to subset data by returning only the values corresponding to `TRUE`. Therefore, we can use the output logical vector to subset our data, and return only those elements in `A`, which are also in `B` by returning only the TRUE values: ![](../img/in-operator1.png) ```{r} #| label: assign_overlap_in_operator # Test to see if each of the elements of A is in B and assign the logical vector output to intersection intersection <- A %in% B # Show the contents of intersection intersection ``` ![](../img/in-operator2.png) ```{r} #| label: subset_overlap_in_operator # Subset the A vector by the values returning TRUE for being in both A and B A[intersection] ``` ![](../img/in-operator3.png) In these previous examples, the vectors were so small that it's easy to check every logical value by eye; but this is not practical when we work with large datasets (e.g. a vector with 1000 logical values). Instead, we can use `any` function. Given a logical vector, this function will tell you whether **at least one value** is `TRUE`. It provides us a quick way to assess if **any of the values contained in vector A are also in vector B**: ```{r} #| label: any_overlap_in_operator # Test to see if any values of A are in B any(A %in% B) ``` The `all` function is also useful. Given a logical vector, it will tell you whether **all values** are `TRUE`. If there is at least one `FALSE` value, the `all` function will return a `FALSE`. We can use this function to assess whether **all elements from vector A are contained in vector B**. ```{r} #| label: all_overlap_in_operator # Test to see if all values of A are in B all(A %in% B) ``` ::: {.callout-tip} # [**Exercise 1**](09_identifying-matching-elements-Answer_key.qmd#exercise-1) 1. Using the `A` and `B` vectors created above, evaluate each element in `B` to see if there is a match in `A` 2. Subset the `B` vector to only return those values that are also in `A`. ::: Suppose we had two vectors containing same values. How can we check **if those values are in the same order in each vector**? In this case, we can use `==` operator to compare each element of the same position from two vectors. The operator returns a logical vector indicating TRUE/FALSE at each position. Then we can use `all()` function to check if all values in the returned vector are TRUE. If all values are TRUE, we know that these two vectors are the same. Unlike `%in%` operator, **`==` operator requires that two vectors are of equal length**. ```{r} #| label: vector_order_logical_tests # Create a vector of numbers A <- c(10,20,30,40,50) # Create another vector of the same numbers but backwards B <- c(50,40,30,20,10) # Test to see if each element of A is in B A %in% B # Test to see if each element of A is in the same position in B A == B # Test if the vectors A and B are a perfect match all(A == B) ``` Let's try this on our genomic data, and see whether we have metadata information for all samples in our expression data. We'll start by creating two vectors: one is the `rownames` of the metadata, and one is the `colnames` of the RPKM data. These are base functions in R which allow you to extract the row and column names as a vector: ```{r} #| label: evaluate_samples # Assign the rownames of the metadata data frame to x x <- rownames(metadata) # Assign the column names of the rpkm_data data frame to y y <- colnames(rpkm_data) ``` Now check to see that all of `x` are in `y`: ```{r} #| label: all_samples_present_check # Test if all of the rownames in the metadata data frame are also in the column names in the rpkm_data data frame all(x %in% y) ``` ::: {.callout-note} Note that we can use nested functions in place of `x` and `y` and still get the same result. ```{r} #| label: nested_all_samples_present_check #| eval: false # Test if all of the rownames in the metadata data frame are also in the column names in the rpkm_data data frame all(rownames(metadata) %in% colnames(rpkm_data)) ``` ::: We know that all samples are present, but **are they in the same order**? ```{r} #| label: same_samples_order_check # Test if all of the rownames in the metadata data frame are also in the column names in the rpkm_data data frame and in the same order all(x == y) ``` No, it looks like **they need to be reordered**. This can be accomplished in a few ways. We could use the [`match()` function](Aside_reordering-to-match-datasets.qmd) or we could provide a vector for how we would like our data to be reordered. We will do the latter: ```{r} #| label: reorder_rpkm # Reorder rpkm_data rpkm_ordered <- rpkm_data[, rownames(metadata)] ``` We can check to make sure that our reordering worked: ```{r} #| label: confirm_reorder # Check that the reorder worked all(rownames(metadata) == colnames(rpkm_ordered)) ``` ::: {.callout-tip} # [**Exercise 2**](09_identifying-matching-elements-Answer_key.qmd#exercise-2) We have a list of 6 marker genes that we are very interested in. Our goal is to extract count data for these genes using the `%in%` operator from the `rpkm_data` data frame, instead of scrolling through `rpkm_data` and finding them manually. First, let's create a vector called `important_genes` with the Ensembl IDs of the 6 genes we are interested in: ```{r} #| label: create_important_genes # Create important genes vector important_genes <- c("ENSMUSG00000083700", "ENSMUSG00000080990", "ENSMUSG00000065619", "ENSMUSG00000047945", "ENSMUSG00000081010", "ENSMUSG00000030970") ``` 1. Use the `%in%` operator to determine if all of these genes are present in the row names of the `rpkm_data` data frame. 2. Extract the rows from `rpkm_data` that correspond to these 6 genes using `[]` and the `%in%` operator. Double check the row names to ensure that you are extracting the correct rows. 3. **Bonus question:** Extract the rows from `rpkm_data` that correspond to these 6 genes using `[]`, but without using the `%in%` operator. :::