Data subsetting with base R: vectors and factors
Approximate time: 60 min
Learning Objectives¶
- Demonstrate how to subset vectors and factors
- Explain the use of logical operators when subsetting vectors and factors
- Demonstrate how to relevel factors in a desired order
Selecting data using indices and sequences¶
When analyzing data, we often want to partition the data so that we are only working with selected columns or rows. A data frame or data matrix is simply a collection of vectors combined together. So let's begin with vectors and how to access different elements, and then extend those concepts to dataframes.
Vectors¶
Selecting using indices¶
If we want to extract one or several values from a vector, we must provide one or several indices using square brackets [ ]
syntax. The index represents the element number within a vector (or the compartment number, if you think of the bucket analogy). R indices start at 1. Programming languages like Fortran, MATLAB, and R start counting at 1, because that's what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that's simpler for computers to do.
Let's start by creating a vector called age:
Suppose we only wanted the fifth value of this vector, we would use the following syntax:
If we wanted all values except the fifth value of this vector, we would use the following:
If we wanted to select more than one element we would still use the square bracket syntax, but rather than using a single value we would pass in a vector of several index values:
age[c(3,5,6)] ## nested
# OR
## create a vector first then select
idx <- c(3,5,6) # create vector of the elements of interest
age[idx]
To select a sequence of continuous values from a vector, we would use :
which is a special function that creates numeric vectors of integer in increasing or decreasing order. Let's select the first four values from age:
Alternatively, if you wanted the reverse could try 4:1
for instance, and see what is returned.
Exercise
- Create a vector called alphabets with the following letters, C, D, X, L, F.
- Use the associated indices along with
[ ]
to do the following:- only display C, D and F
- display all except X
- display the letters in the opposite order (F, L, X, D, C)
Selecting using indices with logical operators¶
We can also use indices with logical operators. Logical operators include greater than (>), less than (<), and equal to (==). A full list of logical operators in R is displayed below:
Operator | Description |
---|---|
> |
greater than |
>= |
greater than or equal to |
< |
less than |
<= |
less than or equal to |
== |
equal to |
!= |
not equal to |
& |
and |
| |
or |
We can use logical expressions to determine whether a particular condition is true or false. For example, let's use our age vector:
If we wanted to know if each element in our age vector is greater than 50, we could write the following expression:
Returned is a vector of logical values the same length as age with TRUE and FALSE values indicating whether each element in the vector is greater than 50.
We can use these logical vectors to select only the elements in a vector with TRUE values at the same position or index as in the logical vector.
Select all values in the age
vector over 50 or age
less than 18:
age > 50 | age < 18
age
age[age > 50 | age < 18] ## nested
# OR
## create a vector first then select
idx <- age > 50 | age < 18
age[idx]
Indexing with logical operators using the which()
function¶
While logical expressions will return a vector of TRUE and FALSE values of the same length, we could use the which()
function to output the indices where the values are TRUE. Indexing with either method generates the same results, and personal preference determines which method you choose to use. For example:
which(age > 50 | age < 18)
age[which(age > 50 | age < 18)] ## nested
# OR
## create a vector first then select
idx_num <- which(age > 50 | age < 18)
age[idx_num]
Notice that we get the same results regardless of whether or not we use the which()
. Also note that while which()
works the same as the logical expressions for indexing, it can be used for multiple other operations, where it is not interchangeable with logical expressions.
Factors¶
Since factors are special vectors, the same rules for selecting values using indices apply. The elements of the expression factor created previously had the following categories or levels: low, medium, and high.
Let's extract the values of the factor with high expression, and let's using nesting here:
expression[expression == "high"] ## This will only return those elements in the factor equal to "high"
Nesting note
The piece of code above was more efficient with nesting; we used a single step instead of two steps as shown below:
Step1 (no nesting): idx <- expression == "high"
Step2 (no nesting): expression[idx]
Exercise
Extract only those elements in samplegroup
that are not KO (nesting the logical operation is optional).
Releveling factors¶
We have briefly talked about factors, but this data type only becomes more intuitive once you've had a chance to work with it. Let's take a slight detour and learn about how to relevel categories within a factor.
To view the integer assignments under the hood you can use str()
:
expression
factor were assigned integers alphabetically, with high=1
, low=2
, medium=3
. However, it makes more sense for us if low=1
, medium=2
and high=3
, i.e. it makes sense for us to relevel the categories in this factor.
To relevel the categories, you can add the levels
argument to the factor()
function, and give it a vector with the categories listed in the required order:
expression <- factor(expression, levels=c("low", "medium", "high")) # you can re-factor a factor
str(expression)
Factor w/ 3 levels "low","medium",..: 1 3 2 3 1 2 3
Now we have a releveled factor with low as the lowest or first category, medium as the second and high as the third. This is reflected in the way they are listed in the output of str()
, as well as in the numbering of which category is where in the factor.
Note
Releveling becomes necessary when you need a specific category in a factor to be the "base" category, i.e. category that is equal to 1. One example would be if you need the "control" to be the "base" in a given RNA-seq experiment.
Exercise
Use the samplegroup
factor we created in a previous lesson, and relevel it such that KO is the first level followed by CTL and OE.
Attribution notice
-
This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
-
The materials used in this lesson are adapted from work that is Copyright © Data Carpentry (http://datacarpentry.org/).
-
All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).