?read.csvReading in and inspecting data
Learning Objectives
- Demonstrate how to read existing data into R
- Utilize base R functions to inspect data structures
Reading data into R
The basics
Regardless of the specific analysis in R we are performing, we usually need to bring data in for any analysis being done in R, so learning how to read in data is a crucial component of learning to use R.
Many functions exist to read data in, and the function in R you use will depend on the file format being read in. Below we have a table with some examples of functions that can be used for importing some common text data types (plain text).
| Data type | Extension | Function | Package |
|---|---|---|---|
| Comma separated values | csv | read.csv() |
utils (default) |
read_csv() |
readr (tidyverse) | ||
| Tab separated values | tsv | read_tsv() |
readr |
| Other delimited formats | txt | read.table() |
utils |
read_table() |
readr | ||
read_delim() |
readr |
For example, if we have text file where the columns are separated by commas (comma-separated values or comma-delimited), you could use the function read.csv. However, if the data are separated by a different delimiter in a text file (e.g. “:”, “;”, ” “,”), you could use the generic read.table function and specify the delimiter (sep = " ") as an argument in the function.
The "\t" delimiter is shorthand for tab.
In the above table we refer to base R functions as being contained in the “utils” package. In addition to base R functions, we have also listed functions from some other packages that can be used to import data, specifically the “readr” package that installs when you install the “tidyverse” suite of packages.
In addition to plain text files, you can also import data from other statistical analysis packages and Excel using functions from different packages.
| Data type | Extension | Function | Package |
|---|---|---|---|
| Stata version 13-14 | dta | readdta() |
haven |
| Stata version 7-12 | dta | read.dta() |
foreign |
| SPSS | sav | read.spss() |
foreign |
| SAS | sas7bdat | read.sas7bdat() |
sas7bdat |
| Excel | xlsx, xls | read_excel() |
readxl (tidyverse) |
Note, that these lists are not comprehensive, and may other functions exist for importing data. Once you have been using R for a bit, maybe you will have a preference for which functions you prefer to use for which data type.
Metadata
When working with large datasets, you will very likely be working with “metadata” file which contains the information about each sample in your dataset.
The metadata is very important information and we encourage you to think about creating a document with as much metadata you can record before you bring the data into R. Here is some additional reading on metadata from the HMS Data Management Working Group.
The read.csv() function
Let’s bring in the metadata file we downloaded earlier (mouse_exp_design.csv or mouse_exp_design.txt) using the read.csv function.
First, check the arguments for the function using the ? to ensure that you are entering all the information appropriately:
The first thing you will notice is that you’ve pulled up the documentation for read.table(), this is because that is the parent function and all the other functions are in the same family.
The next item on the documentation page is the function Description, which specifies that the output of this set of functions is going to be a data frame - “Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.”
In usage, all of the arguments listed for read.table() are the default values for all of the family members unless otherwise specified for a given function. Let’s take a look at 2 examples:
- The separator -
- in the case of
read.table()it issep = ""(space or tab) - whereas for
read.csv()it issep = ","(a comma).
- in the case of
- The
header- This argument refers to the column headers that may (TRUE) or may not (FALSE) exist in the plain text file you are reading in.- in the case of
read.table()it isheader = FALSE(by default, it assumes you do not have column names) - whereas for
read.csv()it isheader = TRUE(by default, it assumes that all your columns have names listed).
- in the case of
The take-home from the “Usage” section for read.csv() is that it has one mandatory argument, the path to the file and filename in quotations; in our case that is data/mouse_exp_design.csv or data/mouse_exp_design.txt.
stringsAsFactors argument
Note that the read.table {utils} family of functions has an argument called stringsAsFactors, which by default is set to FALSE (you can double check this by searching the Help tab for read.table or running ?read.table in the console).
If stringsAsFactors = TRUE, any function in this family of functions will coerce character columns in the data you are reading in to factor columns (i.e., coerce from vector to factor) in the resulting data frame.
If you want to maintain the character vector data structure (e.g., for gene names), you will want to make sure that stringsAsFactors = FALSE.
Create a data frame by reading in the file
At this point, please check the extension for the mouse_exp_design file within your data folder. You will have to type it accordingly within the read.csv() function.
read.csv is not fussy about extensions for plain text files, so even though the file we are reading in is a comma-separated value file, it will be read in properly even with a .txt extension.
Let’s read in the mouse_exp_design file and create a new data frame called metadata.
metadata <- read.csv(file="data/mouse_exp_design.csv")
# OR
# metadata <- read.csv(file="data/mouse_exp_design.txt")RStudio supports the automatic completion of code using the Tab key. This is especially helpful for when reading in files to ensure the correct file path. The tab completion feature also provides a shortcut to listing objects, and inline help for functions. Tab completion is your friend! We encourage you to use it whenever possible.
Click here to see how to import data using the Import Dataset button
You can also use the Import Dataset button in your Environment pane to import data. This option is not as appealing because it can lack reproducibility if not documented properly, but it can be helpful when getting started. In order to use the Import Dataset:
- Left-click the Import Dataset button in the Environment pane
- Left-click From Text (base…)
- Navigate to the file you would like to import and select Open
-
Type the name you would like the imported object to be called in the
Nametextbox. -
Select the delimiter that your file is using in the
Separatordropdown menu - Left-click Import
These steps are summarized in the GIF below: