# Import pandas using the alias, pd
import pandas as pd
# Read in the gene expression data
rpkm_data = pd.read_csv("../data/counts.rpkm.csv")
# Inspect the gene expression data to make sure it was read in properly
rpkm_data.head()Data Wrangling
This lesson focuses on wrangling linked datasets with pandas, including filtering, reordering, saving cleaned tables and merging DataFrames to prepare real data for analysis.
Data cleaning, Merge DataFrames, Save CSV
Approximate time: XX minutes
Learning Objectives
In this lesson, we will:
- Reorder related datasets to ensure that they are in the same order
- Extract specific rows from a DataFrame using the
isinoperator - Save a DataFrame as a new CSV file
- Use the
merge()function to combine two DataFrames
Overview of lesson
Data wrangling is the step where messy, real‑world data is turned into something usable. In practice, this often means linking multiple tables (e.g., combining sample metadata with expression data), filtering out unwanted rows and then saving clean versions of your data for later use. pandas enables you to do all this and more. In this lesson, you will work with datasets and learn how to merge, filter and reorder tables so that you can easily complete downstream analyses and create visualizations.
Linked DataFrames
Oftentimes, we have multiple files related to the same dataset. In these cases, we want to make sure that the data in these files are linked together in a way that allows us to match identities correctly. Therefore, knowing how to reorder datasets and determine data matches is an important skill.
In our use case, we will be working with genomic data. We have gene expression data generated by RNA-seq and metadata that corresponds to each samples. In our data folder, we have these two related files:
mouse_exp_design.csvcounts.rpkm.csv
Read in Expression Data and Metadata
Let’s start by reading in our gene expression data (RPKM matrix) and then previewing the first few lines:
sample2 sample5 ... sample10 sample1
ENSMUSG00000000001 1.926500e+01 2.372220e+01 ... 24.198100 1.978480e+01
ENSMUSG00000000003 0.000000e+00 0.000000e+00 ... 0.000000 0.000000e+00
ENSMUSG00000000028 1.032290e+00 8.269540e-01 ... 1.045920 9.377920e-01
ENSMUSG00000000031 0.000000e+00 0.000000e+00 ... 0.000000 3.596310e-02
ENSMUSG00000000037 5.603300e-02 4.732380e-02 ... 0.017004 1.514170e-01
... ... ... ... ... ...
ENSMUSG00000099329 3.856260e-23 3.565390e-89 ... 0.285499 6.799790e-169
ENSMUSG00000099330 0.000000e+00 0.000000e+00 ... 0.000000 0.000000e+00
ENSMUSG00000099331 0.000000e+00 0.000000e+00 ... 0.000000 2.976280e+01
ENSMUSG00000099332 0.000000e+00 0.000000e+00 ... 0.000000 0.000000e+00
ENSMUSG00000099334 0.000000e+00 0.000000e+00 ... 0.000000 0.000000e+00
[38828 rows x 12 columns]
Let’s inspect the associated metadata file:
# Inspect the metadata to make sure it was read in properly
metadata.head() genotype celltype replicate
sample1 Wt typeA 1
sample2 Wt typeA 2
sample3 Wt typeA 3
sample4 KO typeA 1
sample5 KO typeA 2
It looks as if the sample names (header) in our data matrix are similar to the row names of our metadata file, but it’s hard to tell since they are not in the same order. We can do a quick check of the number of columns in the count data and the rows in the metadata and at least see if the numbers match up.
# Check the dimensions of the RPKM data
print("Dimensions of RPKM data:", rpkm_data.shape)Dimensions of RPKM data: (38828, 12)
# Check the dimensions of the metadata
print("Dimensions of metadata:", metadata.shape) Dimensions of metadata: (12, 3)
What we want to know is, are all the samples in our metadata also in our expression data?
The in Operator
Let’s try checking if all our metadata samples are also in our expression data. We will use the in operator to check for membership in a collection.
# Assign the rownames of the metadata DataFrame to x
x = metadata.index
# Assign the column names of the rpkm_data DataFrame to y
y = rpkm_data.columnsNow we can check to see if x are in y:
# Check if all elements of x are in y
x in yTypeError: unhashable type: 'Index'
We will get an error because we are not actually checking if a single element x can be found in y, but this code actually checks if the entire list of x is in y. Instead, the isin operator will actually check if each individual element of x is in y:
# Check if each element of x is in y using isin
print(x.isin(y))[ True True True True True True True True True True True True]
There are many other ways to check for membership. One way is to use a list comprehension to check if each element of x is in y by iterating through the elements of each list and comparing each individual x and y to find membership:
# Check if each element of x is in y
[sample in y for sample in x][True, True, True, True, True, True, True, True, True, True, True, True]
If we have a large number of samples and are unable to count that each element of x is in y, we can instead use the all() function to check if all elements are True:
# Check if all elements of x are in y
all(x.isin(y))True
So now we know that all of our samples in the metadata are also in our expression data, but we don’t know if they in the same order. We can check the order of our data by comparing the order of x and y:
# Check if the order of x and y is the same
x == yarray([False, False, False, False, False, False, False, False, False,
False, False, False])
We can see that they are not in the same order. We can use the reindex method to reorder our expression data to match the order of our metadata:
# Reorder the columns of the rpkm_data DataFrame to match the order of the metadata
rpkm_ordered = rpkm_data.reindex(columns=x)
# Ispect the re-ordered table to make sure it looks correct
rpkm_ordered.head() sample1 sample2 ... sample11 sample12
ENSMUSG00000000001 19.784800 19.265000 ... 24.046500 26.915800
ENSMUSG00000000003 0.000000 0.000000 ... 0.000000 0.000000
ENSMUSG00000000028 0.937792 1.032290 ... 0.975327 0.673563
ENSMUSG00000000031 0.035963 0.000000 ... 0.000000 0.020438
ENSMUSG00000000037 0.151417 0.056033 ... 0.020640 0.066232
[5 rows x 12 columns]
When we look at the columns of rpkm_ordered now, we see that they are now in the same order as the row names of our metadata (numerical order).
We have a list of 6 marker genes that we are interested in. Our goal is to extract count data for these genes using the isin operator from the rpkm_ordered DataFrame instead of finding them manually.
First, we will create a vector called important_genes with the Ensemble IDs of the 6 genes we are interested in:
# Create important genes vector
important_genes = ["ENSMUSG00000083700", "ENSMUSG00000080990",
"ENSMUSG00000065619", "ENSMUSG00000047945",
"ENSMUSG00000081010", "ENSMUSG00000030970"]Use the
isinoperator to determine if all of these genes are present in the row names of therpkm_orderedDataFrame.Extract the rows from
rpkm_orderedthat correspond to these 6 genes by using[]and theisinoperator. Double check the row names to ensure that you are extracting the correct rows.Bonus question: Extract the rows from
rpkm_orderedthat correspond to these 6 genes using[], but without using theisinoperator.
Saving CSV Files
Now that we have our properly ordered expression data, we will save it as a new CSV file by using the to_csv method:
# Save the reordered expression data to a new CSV file
rpkm_ordered.to_csv("../data/ordered_counts_rpkm.csv")So if we wanted to read in this new file, we can use the read_csv method:
# Read in the new CSV file
rpkm_ordered_new = pd.read_csv("../data/ordered_counts_rpkm.csv", index_col=0)
# Inspect the read-in CSV file
rpkm_ordered_new.head() sample1 sample2 ... sample11 sample12
ENSMUSG00000000001 19.784800 19.265000 ... 24.046500 26.915800
ENSMUSG00000000003 0.000000 0.000000 ... 0.000000 0.000000
ENSMUSG00000000028 0.937792 1.032290 ... 0.975327 0.673563
ENSMUSG00000000031 0.035963 0.000000 ... 0.000000 0.020438
ENSMUSG00000000037 0.151417 0.056033 ... 0.020640 0.066232
[5 rows x 12 columns]
If you look at the output, you’ll see an extra column called Unnamed: 0. This is because when we loaded in the CSV file, the index (rownames) were not set. Instead they were read in as a regular column without a row name. When we load in our new CSV file, we want to set the first column as the index to ensure that the genes are being stored as row names:
# Read in the new CSV file
rpkm_ordered_new = pd.read_csv("data/ordered_counts_rpkm.csv", index_col=0)
rpkm_ordered_new.head() sample1 sample2 ... sample11 sample12
ENSMUSG00000000001 19.784800 19.265000 ... 24.046500 26.915800
ENSMUSG00000000003 0.000000 0.000000 ... 0.000000 0.000000
ENSMUSG00000000028 0.937792 1.032290 ... 0.975327 0.673563
ENSMUSG00000000031 0.035963 0.000000 ... 0.000000 0.020438
ENSMUSG00000000037 0.151417 0.056033 ... 0.020640 0.066232
[5 rows x 12 columns]
Wrangling DataFrames
Now we want to update some information in our metadata object. We want to evaluate the average expression for each sample and its relationship with the age of the mouse. Our goal is to wrangle our data so we can create visualizations with these new columns in the next lesson.
Calculating Average Expression
Let’s take a closer look at our counts data. Each column represents a sample in our experiment and each sample has ~38K values corresponding to the expression of different transcripts. We want to compute the average value of expression for each sample.
Taking this one step at a time, what would we do if we just wanted the average expression for Sample 1 (across all transcripts)? We can use the base numpy function called mean():
#Import numpy using the alias, np
import numpy as np
# Take the mean of the expression values for sample1
print(np.mean(rpkm_ordered["sample1"]))10.266102101254251
This gives us the average from one of the samples (one column in a DataFrame), but we need to get this information from all 12 samples (therefore all columns). Ideally, we want a list of 12 values that we can easily add to the metadata DataFrame. What is the best way to do this?
Luckily, the pandas library has a built-in method called mean() that can be applied to a DataFrame. By default, it will calculate the mean of each column and return a new series with the mean values for each column:
# Calculate the mean expression for each sample (column) in the rpkm_ordered DataFrame
samplemeans = rpkm_ordered.mean()
# Inspect samplemeans
samplemeanssample1 10.266102
sample2 10.849759
sample3 9.452517
sample4 15.833872
sample5 15.590184
sample6 15.551529
sample7 15.522219
sample8 13.808281
sample9 14.108399
sample10 10.743292
sample11 10.778318
sample12 9.754733
dtype: float64
We have now the values that we want, except notice they don’t have a column name associated with these results. We can use the name attribute to assign a name to this series:
# Name the samplemeans series
samplemeans.name = "mean_expression"
# Inspect samplemeans
samplemeanssample1 10.266102
sample2 10.849759
sample3 9.452517
sample4 15.833872
sample5 15.590184
sample6 15.551529
sample7 15.522219
sample8 13.808281
sample9 14.108399
sample10 10.743292
sample11 10.778318
sample12 9.754733
Name: mean_expression, dtype: float64
Merging DataFrames
At this point, samplemeans is a series object with the sample names as the index and the mean expression values as the corresponding values. Next we will add this information to our metadata DataFrame. One of the most straightforward ways to do so is to use the merge() method from the pandas library, which allows us to merge two DataFrames by using a common column or index.
Since this is a new method, we will use the help() function to learn more about it:
# Get help on the merge function
help(pd.merge)When we look at the arguments for the merge() method, we can see that we can specify the left and right DataFrames to merge, as well as the columns or indices we want to merge on. We can specify which columns to merge on using the left_on and right_on arguments or we can specify that we want to merge on the index using the left_index and right_index arguments.
In this case, “left” DataFrame will be the metadata and the “right” DataFrame will be the samplemeans. Since both metadata and samplemeans have the sample names as indicies, we can merge them on the indices using the left_index and right_index arguments:
# Merge the samplemeans series with the metadata DataFrame on the index
new_metadata = metadata.merge(samplemeans,
left_index=True,
right_index=True)
# Inspect the merged metadata
new_metadata.head() genotype celltype replicate mean_expression
sample1 Wt typeA 1 10.266102
sample2 Wt typeA 2 10.849759
sample3 Wt typeA 3 9.452517
sample4 KO typeA 1 15.833872
sample5 KO typeA 2 15.590184
Now we have a left-joined new_metadata with all the information from the original metadata as well as a new column with the mean expression values for each sample.
This means merging can be done with any two DataFrames that have a common column or index and it is a very powerful tool for combining information from different sources.
Adding an age column
Let’s also create a list with the ages of each of the mice in our dataset. We also want to ensure that we have 12 values in this list to match the 12 samples in our data set. In this example we are supplying the ages manually, but in a real use case, we would likely have this information in a separate file that we would read in and merge with our metadata DataFrame using the merge() function as we just did with the samplemeans series.
# Add the age column to the metadata DataFrame
# Create a numeric vector with ages.
age_in_days = [40, 32, 38, 35, 41, 32,
34, 26, 28, 28, 30, 32]
# Double check that we have 12 values, for each sample
len(age_in_days)12
Now we will create a new column in our new_metadata DataFrame called age and assign the values from our age_in_days list to the new column:
# Assign the age_in_days list to a new column in new_metadata
new_metadata["age_in_days"] = age_in_days
# Inspect new_metadata to ensure the age column was appended correctly
new_metadata.head() genotype celltype replicate mean_expression age_in_days
sample1 Wt typeA 1 10.266102 40
sample2 Wt typeA 2 10.849759 32
sample3 Wt typeA 3 9.452517 38
sample4 KO typeA 1 15.833872 35
sample5 KO typeA 2 15.590184 41
Now that we have the data in the format that we want, we will save the new metadata DataFrame as a new CSV file using the to_csv() method:
# Save the new metadata DataFrame to a new CSV file
new_metadata.to_csv("../data/new_metadata.csv")We are now ready for plotting and data visualization!
Fun With Data Wrangling
As you work more and more with different datasets, you will likely find yourself spending a lot of time and energy wrangling data to put it in the right format for analysis. We have a few fun and useful exercises to practice some of the data wrangling skills we have learned in this lesson.
Reading in and inspecting data
Using the
animals.csv, read the.csvfile into your environment and assign it to a variable calledanimals. Be sure to check that your row names are the different animals.Check to make sure that
animalsis a DataFrame.How many rows are in the
animalsDataFrame? How many columns?
Data wrangling
Extract the
speedvalue of 40 km/h from theanimalsDataFrame.Return the rows with animals that are the
colorTan.Return the rows with animals that have
speedgreater than 50 km/h and output only thecolorcolumn. Keep the output as a DataFrame.Change the color of “Grey” to “Gray”.
Create a list called
animals_listin which the first element contains the speed column of theanimalsDataFrame and the second element contains the color column of theanimalsDataFrame.Create a dictionary with the appropriate keys (i.e speed and color).
The in operator, reordering and matching
- In your data folder you should have a DataFrame called
proj_summarywhich contains quality metric information for an RNA-seq dataset. We have obtained batch information for the control samples in this dataset. Copy and paste the code below to create a DataFrame of control samples with the associated batch information:
# Read in proj_summary if needed
proj_summary = pd.read_table("data/project-summary.txt",
header=0, index_col=0)
# Create ctrl_samples dataframe
ctrl_samples = pd.DataFrame(
data = {"date": ["01/13/2018", "03/15/2018", "01/13/2018",
"09/20/2018","03/15/2018"]},
index = ["sample3", "sample10", "sample8",
"sample4", "sample15"]
)proj_summary percent_GC Exonic_Rate ... rRNA_rate treatment
Name ...
sample1 49 0.8913 ... 0.007265 high
sample2 49 0.9055 ... 0.005518 low
sample3 50 0.8834 ... 0.026945 control
sample4 50 0.9027 ... 0.005082 control
sample5 49 0.8923 ... 0.005023 high
sample6 49 0.8999 ... 0.005345 low
sample7 49 0.8983 ... 0.005240 high
sample8 49 0.9022 ... 0.004549 control
sample9 49 0.9111 ... 0.005818 low
[9 rows x 8 columns]
ctrl_samples date
sample3 01/13/2018
sample10 03/15/2018
sample8 01/13/2018
sample4 09/20/2018
sample15 03/15/2018
How many of the
ctrl_samplesare also in theproj_summaryDataFrame? Use theisinoperator to compare sample names.Keep only the rows in
proj_summarywhich correspond to those inctrl_samples. Do this with theisinoperator. Save it to a variable calledproj_summary_ctrl.Use
merge()to add a column calledbatchto theproj_summary_ctrlDataFrame. Assign this new DataFrame back toproj_summary_ctrl.