This lesson focuses on wrangling linked datasets with pandas, including filtering, reordering, saving cleaned tables, and merging DataFrames to prepare real data for analysis.
Authors
Noor Sohail
Will Gammerdinger
Published
March 16, 2026
Keywords
Data cleaning, Merge DataFrames, Save CSV
Approximate time: XX minutes
Learning Objectives
In this lesson, we will:
Reorder related datasets to ensure that they are in the same order
Extract specific rows from a data frame using the isin operator
Save a data frame as a new CSV file
Use the merge() function to combine two DataFrames
Overview of lesson
Data wrangling is the step where messy, real‑world data is turned into something usable. In practice, this often means linking multiple tables (e.g., combining sample metadata with expression data), filtering out unwanted rows, and saving clean versions of your data for later use. pandas enables you to do all this and more when it comes to data manipulation. In this lesson, you will work with datasets and learn how to merge, filter, and reorder tables so that you can easily complete downstream analyses and create visualizations.
Linked DataFrames
Oftentimes, we may have multiple files that relate to the same dataset. In these cases, we want to make sure that the data in these files are linked together in such a way that we match the identities correctly. Therefore, knowing how to reorder datasets and determine whether the data matches is an important skill.
In our use case, we will be working with genomic data. We have gene expression data generated by RNA-seq and metadata corresponding to each samples. Inside our data folder, we have these two related files:
mouse_exp_design.csv
counts.rpkm.csv
Read in Expression Data and Metadata
Let’s start by reading in our gene expression data (RPKM matrix) and previewing the first few lines:
import pandas as pdrpkm_data = pd.read_csv("data/counts.rpkm.csv")rpkm_data.head()
Table 1: RPKM normalized expression data for 12 samples, where rows are genes and columns are samples.
sample2
sample5
sample7
sample8
sample9
sample4
sample6
sample12
sample3
sample11
sample10
sample1
ENSMUSG00000000001
19.265000
23.722200
2.61161
5.849540
6.512630
24.076700
20.819800
26.915800
20.889500
24.046500
24.198100
19.784800
ENSMUSG00000000003
0.000000
0.000000
0.00000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
ENSMUSG00000000028
1.032290
0.826954
1.13441
0.698754
0.925117
0.827891
1.168630
0.673563
0.892183
0.975327
1.045920
0.937792
ENSMUSG00000000031
0.000000
0.000000
0.00000
0.029845
0.059773
0.000000
0.051193
0.020438
0.000000
0.000000
0.000000
0.035963
ENSMUSG00000000037
0.056033
0.047324
0.00000
0.068594
0.049415
0.180883
0.143884
0.066232
0.146196
0.020640
0.017004
0.151417
Similarly, let us read in the associated metadata file:
Table 2: Metadata for 12 samples, where rows are samples and columns are different attributes of the samples.
genotype
celltype
replicate
sample1
Wt
typeA
1
sample2
Wt
typeA
2
sample3
Wt
typeA
3
sample4
KO
typeA
1
sample5
KO
typeA
2
It looks as if the sample names (header) in our data matrix are similar to the row names of our metadata file, but it’s hard to tell since they are not in the same order. We can do a quick check of the number of columns in the count data and the rows in the metadata and at least see if the numbers match up.
# Check the dimensions of the dataprint("Dimensions of RPKM data:", rpkm_data.shape)
Dimensions of RPKM data: (38828, 12)
print("Dimensions of metadata:", metadata.shape)
Dimensions of metadata: (12, 3)
What we want to know is, are all the samples in our metadata also in our expression data?
The in Operator
Let’s try checking that all our metadata samples are in our expression data. We can do this by using the in operator, which allows us to check for membership in a collection.
# Assign the rownames of the metadata data frame to x x = metadata.index# Assign the column names of the rpkm_data data frame to yy = rpkm_data.columns
Now we can check to see if x are in y:
# Check if all elements of x are in yx in y
TypeError: unhashable type: 'Index'
We get an error because we are not checking if a single element, x can be found in y, but rather we are checking if the entire list of x is in y. Instead we use the isin operator, which checks if each element of x is in y:
# Check if each element of x is in y using isinx.isin(y)
There are many ways to check for membership. For example, we can use a list comprehension to check if each element of x is in y by iterating through the elements of each list and checking for membership:
# Check if each element of x is in y[sample in y for sample in x]
But what if we had a ton of samples and we were not able to count that each element of x is in y? We can use the all() function to check if all elements are True:
# Check if all elements of x are in yall(x.isin(y))
True
So now we know that all of our samples in the metadata are also in our expression data, but are they in the same order? We can check this by comparing the order of x and y:
We have a list of 6 marker genes that we are very interested in. Our goal is to extract count data for these genes using the isin operator from the rpkm_ordered data frame, instead of scrolling through rpkm_ordered and finding them manually.
First, let’s create a vector called important_genes with the Ensembl IDs of the 6 genes we are interested in:
Use the isin operator to determine if all of these genes are present in the row names of the rpkm_ordered data frame.
Extract the rows from rpkm_ordered that correspond to these 6 genes using [] and the isin operator. Double check the row names to ensure that you are extracting the correct rows.
Bonus question: Extract the rows from rpkm_ordered that correspond to these 6 genes using [], but without using the isin operator.
Saving CSV Files
Now that we have our ordered expression data, we can save it as a new CSV file using the to_csv method:
# Save the reordered expression data to a new CSV filerpkm_ordered.to_csv("data/ordered_counts_rpkm.csv")
So now if we wanted to read in this new file, we can do so using the read_csv method:
# Read in the new CSV filerpkm_ordered_new = pd.read_csv("data/ordered_counts_rpkm.csv")rpkm_ordered_new.head()
Table 4: Read in the new CSV file that of the ordered RPKM data that was just saved.
Unnamed: 0
sample1
sample2
sample3
sample4
sample5
sample6
sample7
sample8
sample9
sample10
sample11
sample12
0
ENSMUSG00000000001
19.784800
19.265000
20.889500
24.076700
23.722200
20.819800
2.61161
5.849540
6.512630
24.198100
24.046500
26.915800
1
ENSMUSG00000000003
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.00000
0.000000
0.000000
0.000000
0.000000
0.000000
2
ENSMUSG00000000028
0.937792
1.032290
0.892183
0.827891
0.826954
1.168630
1.13441
0.698754
0.925117
1.045920
0.975327
0.673563
3
ENSMUSG00000000031
0.035963
0.000000
0.000000
0.000000
0.000000
0.051193
0.00000
0.029845
0.059773
0.000000
0.000000
0.020438
4
ENSMUSG00000000037
0.151417
0.056033
0.146196
0.180883
0.047324
0.143884
0.00000
0.068594
0.049415
0.017004
0.020640
0.066232
But wait, why do have an extra column called Unnamed: 0? This is because when we loaded in the CSV file, the index (rownames) were not set. Instead they were read in as a regular column without a rowname. When we load in our new CSV file, we want to set the first column as the index to ensure that the genes are being stored as row names:
# Read in the new CSV filerpkm_ordered_new = pd.read_csv("data/ordered_counts_rpkm.csv", index_col=0)rpkm_ordered_new.head()
Table 5: Read in the new CSV file that of the ordered RPKM data that was just saved with rownames set appropriately.
sample1
sample2
sample3
sample4
sample5
sample6
sample7
sample8
sample9
sample10
sample11
sample12
ENSMUSG00000000001
19.784800
19.265000
20.889500
24.076700
23.722200
20.819800
2.61161
5.849540
6.512630
24.198100
24.046500
26.915800
ENSMUSG00000000003
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.00000
0.000000
0.000000
0.000000
0.000000
0.000000
ENSMUSG00000000028
0.937792
1.032290
0.892183
0.827891
0.826954
1.168630
1.13441
0.698754
0.925117
1.045920
0.975327
0.673563
ENSMUSG00000000031
0.035963
0.000000
0.000000
0.000000
0.000000
0.051193
0.00000
0.029845
0.059773
0.000000
0.000000
0.020438
ENSMUSG00000000037
0.151417
0.056033
0.146196
0.180883
0.047324
0.143884
0.00000
0.068594
0.049415
0.017004
0.020640
0.066232
Wrangling DataFrames
Now, we want to update some extra information in our metadata object. For one, we wnat to evaluate the average expression in each sample and its relationship with the age of the mouse. Our goal is to wrangle our data in such a way that we can create visualizations with these new columns in the next lesson.
Calculating Average Expression
Let’s take a closer look at our counts data. Each column represents a sample in our experiment, and each sample has ~38K values corresponding to the expression of different transcripts. We want to compute the average value of expression for each sample eventually.
Taking this one step at a time, what would we do if we just wanted the average expression for Sample 1 (across all transcripts)? We can use the base numpy function called mean():
import numpy as np# Take the mean of the expression values for sample1np.mean(rpkm_ordered["sample1"])
np.float64(10.266102101254251)
That is great, if we only wanted the average from one of the samples (one column in a data frame), but we need to get this information from all 12 samples, so all 12 columns. It would be ideal to get a vector of 12 values that we can add to the metadata data frame. What is the best way to do this?
Luckily, the Pandas library has a built in method called mean() that can be applied to a data frame. By default, it will calculate the mean of each column and return a new series with the mean values for each column:
# Calculate the mean expression for each sample (column) in the rpkm_ordered data framesamplemeans = rpkm_ordered.mean()samplemeans
We have the values that we want, except notice that we do not have a column name associated with these results. To assign a name to this series, we can use the name attribute:
# Name the samplemeans seriessamplemeans.name ="mean_expression"samplemeans
At this point, samplemeans is a series object with the sample names as the index and the mean expression values as the values. We want to add this information to our metadata data frame. One of the easiest ways to do this is to use the merge() function from the Pandas library, which allows us to merge two data frames based on a common column or index.
Since this is a new function, let us use the help() function to learn more about how to use it:
# Get help on the merge functionhelp(pd.merge)
When we look at the arguments for the merge() function, we see that we can specify the left and right data frames to merge, as well as the columns or indices to merge on. We can specify which columns to merge on using the left_on and right_on arguments, or we can specify that we want to merge on the index using the left_index and right_index arguments.
In our case, let us set the “left” DataFrame will be the metadata and the “right” DataFrame will be the samplemeans. Since both metadata and samplemeans have the sample names as the index, we can merge them on indeces using the left_index and right_index arguments:
# Merge the samplemeans series with the metadata data frame on the indexnew_metadata = metadata.merge(samplemeans, left_index=True, right_index=True)new_metadata.head()
genotype
celltype
replicate
mean_expression
sample1
Wt
typeA
1
10.266102
sample2
Wt
typeA
2
10.849759
sample3
Wt
typeA
3
9.452517
sample4
KO
typeA
1
15.833872
sample5
KO
typeA
2
15.590184
Now we have a left-joined new_metadata with all the information from the original metadata as well as a new column with the mean expression values for each sample.
This concept of merging can be done with any two data frames that have a common column or index, and it is a very powerful tool for combining information from different sources.
Adding age column
Let’s also create a list with the ages of each of the mice in our data set. We also want to double check that we have 12 values in this list, since we have 12 samples in our data set. Here we are supplying the ages manually, but in a real use case, we would likely have this information in a separate file that we would read in and merge with our metadata data frame using the merge() function as we just did with the samplemeans series.
# Add the age column to the metadata data frame# Create a numeric vector with ages. age_in_days = [40, 32, 38, 35, 41, 32, 34, 26, 28, 28, 30, 32] # Double check that we have 12 values, for each samplelen(age_in_days)
12
Now we can create a new column in our new_metadata data frame called age and assign the values from our age_in_days list to this new column:
# Assign the age_in_days list to a new column in new_metadatanew_metadata["age_in_days"] = age_in_daysnew_metadata.head()
genotype
celltype
replicate
mean_expression
age_in_days
sample1
Wt
typeA
1
10.266102
40
sample2
Wt
typeA
2
10.849759
32
sample3
Wt
typeA
3
9.452517
38
sample4
KO
typeA
1
15.833872
35
sample5
KO
typeA
2
15.590184
41
Now that we have the data in the format that we want, we can save this new metadata data frame as a new CSV file using the to_csv() method:
# Save the new metadata data frame to a new CSV filenew_metadata.to_csv("data/new_metadata.csv")
We are now ready for plotting and data visualization!
Fun With Data Wrangling
As you work more and more with different dataset, you will likely find yourself spending quite a bit of time wrangling data to get it into the right format for analysis. So here we have a few fun exercises to practice some of the data wrangling skills we have learned in this lesson.
Using the animals.csv, read the .csv file into your environment and assign it to a variable called animals. Be sure to check that your row names are the different animals.
Check to make sure that animals is a dataframe.
How many rows are in the animals dataframe? How many columns?
Data wrangling
Extract the speed value of 40 km/h from the animals dataframe.
Return the rows with animals that are the color Tan.
Return the rows with animals that have speed greater than 50 km/h and output only the color column. Keep the output as a data frame.
Change the color of “Grey” to “Gray”.
Create a list called animals_list in which the first element contains the speed column of the animals dataframe and the second element contains the color column of the animals dataframe.
Create a dictionary with the appropriate keys (i.e speed and color).
The in operator, reordering and matching
In your data folder you should have a dataframe called proj_summary which contains quality metric information for an RNA-seq dataset. We have obtained batch information for the control samples in this dataset. Copy and paste the code below to create a dataframe of control samples with the associated batch information:
# Read in proj_summary if neededproj_summary = pd.read_table("data/project-summary.txt", header=0, index_col=0)# Create ctrl_samples dataframectrl_samples = pd.DataFrame( data = {"date": ["01/13/2018", "03/15/2018", "01/13/2018","09/20/2018","03/15/2018"]}, index = ["sample3", "sample10", "sample8", "sample4", "sample15"] )
proj_summary
Table 6: DataFrame of quality metric information for an RNA-seq dataset.
percent_GC
Exonic_Rate
Intronic_Rate
Intergenic_Rate
Mapping_Rate
Quality_format
rRNA_rate
treatment
Name
sample1
49
0.8913
0.0709
0.0378
0.978800
standard
0.007265
high
sample2
49
0.9055
0.0625
0.0321
0.982507
standard
0.005518
low
sample3
50
0.8834
0.0663
0.0503
0.987729
standard
0.026945
control
sample4
50
0.9027
0.0649
0.0325
0.987076
standard
0.005082
control
sample5
49
0.8923
0.0714
0.0362
0.978184
standard
0.005023
high
sample6
49
0.8999
0.0667
0.0334
0.977210
standard
0.005345
low
sample7
49
0.8983
0.0665
0.0352
0.975800
standard
0.005240
high
sample8
49
0.9022
0.0656
0.0322
0.987746
standard
0.004549
control
sample9
49
0.9111
0.0566
0.0323
0.981449
standard
0.005818
low
ctrl_samples
Table 7: DataFrame of control samples with associated batch information.
date
sample3
01/13/2018
sample10
03/15/2018
sample8
01/13/2018
sample4
09/20/2018
sample15
03/15/2018
How many of the ctrl_samples are also in the proj_summary dataframe? Use the isin operator to compare sample names.
Keep only the rows in proj_summary which correspond to those in ctrl_samples. Do this with the isin operator. Save it to a variable called proj_summary_ctrl.
Use merge() to add a column called batch to the proj_summary_ctrl dataframe. Assign this new dataframe back to proj_summary_ctrl.
---title: "Data Wrangling"description: | This lesson focuses on wrangling linked datasets with pandas, including filtering, reordering, saving cleaned tables, and merging DataFrames to prepare real data for analysis.author: - Noor Sohail - Will Gammerdingerdate: "2026-03-16"categories: - Python programming - Data wrangling - Pandaskeywords: - Data cleaning - Merge DataFrames - Save CSVlicense: "CC-BY-4.0"editor_options: markdown: wrap: 72---```{r}#| label: load_libraries_data#| echo: false# Load libraries and data# Interfacing with R quarto and python futzinglibrary(reticulate)use_condaenv("/opt/anaconda3/envs/intro_python", required =TRUE)```Approximate time: XX minutes## Learning Objectives In this lesson, we will:- Reorder related datasets to ensure that they are in the same order- Extract specific rows from a data frame using the `isin` operator- Save a data frame as a new CSV file- Use the `merge()` function to combine two DataFrames## Overview of lessonData wrangling is the step where messy, real‑world data is turned into something usable. In practice, this often means linking multiple tables (e.g., combining sample metadata with expression data), filtering out unwanted rows, and saving clean versions of your data for later use. `pandas` enables you to do all this and more when it comes to data manipulation. In this lesson, you will work with datasets and learn how to merge, filter, and reorder tables so that you can easily complete downstream analyses and create visualizations.## Linked DataFramesOftentimes, we may have multiple files that relate to the same dataset. In these cases, we want to make sure that the data in these files are linked together in such a way that we match the identities correctly. Therefore, knowing how to reorder datasets and determine whether the data matches is an important skill.In our use case, we will be working with genomic data. We have gene **expression data generated by RNA-seq** and metadata corresponding to each samples. Inside our data folder, we have these two related files: - `mouse_exp_design.csv`- `counts.rpkm.csv`### Read in Expression Data and MetadataLet’s start by reading in our gene expression data (RPKM matrix) and previewing the first few lines:```{python}#| label: tbl-read_in_expression_data#| tbl-cap: RPKM normalized expression data for 12 samples, where rows are genes and columns are samples.import pandas as pdrpkm_data = pd.read_csv("data/counts.rpkm.csv")rpkm_data.head()```Similarly, let us read in the associated metadata file:```{python}#| label: tbl-read_in_metadata#| tbl-cap: Metadata for 12 samples, where rows are samples and columns are different attributes of the samples.metadata = pd.read_csv("data/mouse_exp_design.csv")metadata.head()```It looks as if the sample names (header) in our data matrix are similar to the row names of our metadata file, but it’s hard to tell since they are not in the same order. We can do a quick check of the number of columns in the count data and the rows in the metadata and at least see if the numbers match up.```{python}#| label: check_dimensions# Check the dimensions of the dataprint("Dimensions of RPKM data:", rpkm_data.shape)print("Dimensions of metadata:", metadata.shape) ```What we want to know is, **are all the samples in our metadata also in our expression data?**### The `in` OperatorLet’s try checking that all our metadata samples are in our expression data. We can do this by using the `in` operator, which allows us to check for membership in a collection. ```{python}#| label: get_sample_names# Assign the rownames of the metadata data frame to x x = metadata.index# Assign the column names of the rpkm_data data frame to yy = rpkm_data.columns```Now we can check to see if `x` are in `y`:```{python}#| label: check_membership#| error: true# Check if all elements of x are in yx in y```We get an error because we are not checking if a single element, `x` can be found in `y`, but rather we are checking if the entire list of `x` is in `y`. Instead we use the `isin` operator, which checks if each element of `x` is in `y`:```{python}#| label: check_membership_isin# Check if each element of x is in y using isinx.isin(y)```::: {.callout-note collapse="true"}# List comprehension alternative to isin operatorThere are many ways to check for membership. For example, we can use a list comprehension to check if each element of `x` is in `y` by iterating through the elements of each list and checking for membership:```{python}#| label: check_membership_list_comprehension# Check if each element of x is in y[sample in y for sample in x]```:::But what if we had a ton of samples and we were not able to count that each element of `x` is in `y`? We can use the `all()` function to check if all elements are `True`:```{python}#| label: check_all_membership# Check if all elements of x are in yall(x.isin(y))```So now we know that all of our samples in the metadata are also in our expression data, but are they in the same order? We can check this by comparing the order of `x` and `y`:```{python}#| label: check_order# Check if the order of x and y is the samex == y```We see that they are not in the same order. We can use the `reindex` method to reorder our expression data to match the order of our metadata:```{python}#| label: tbl-reorder_expression_data#| tbl-cap: Reordered RPKM data columns to match the order of the metadata rownames.# Reorder the columns of the rpkm_data data frame to match the order of the metadatarpkm_ordered = rpkm_data.reindex(columns=x)rpkm_ordered.head()```When we look at the columns of `rpkm_ordered`, we see that they are now in the same order as the row names of our metadata (numerical order).:::{.callout-tip}# [**Exercise 1**](10_data_wrangling-Answer_key.qmd#exercise-1)We have a list of 6 marker genes that we are very interested in. Our goal is to extract count data for these genes using the `isin` operator from the `rpkm_ordered` data frame, instead of scrolling through `rpkm_ordered` and finding them manually.First, let's create a vector called `important_genes` with the Ensembl IDs of the 6 genes we are interested in:```{python}#| label: create_important_genes# Create important genes vectorimportant_genes = ["ENSMUSG00000083700", "ENSMUSG00000080990", "ENSMUSG00000065619", "ENSMUSG00000047945", "ENSMUSG00000081010", "ENSMUSG00000030970"]```1. Use the `isin` operator to determine if all of these genes are present in the row names of the `rpkm_ordered` data frame.2. Extract the rows from `rpkm_ordered` that correspond to these 6 genes using `[]` and the `isin` operator. Double check the row names to ensure that you are extracting the correct rows.3. **Bonus question:** Extract the rows from `rpkm_ordered` that correspond to these 6 genes using `[]`, but without using the `isin` operator.:::## Saving CSV FilesNow that we have our ordered expression data, we can save it as a new CSV file using the `to_csv` method:```{python}#| label: save_csv# Save the reordered expression data to a new CSV filerpkm_ordered.to_csv("data/ordered_counts_rpkm.csv")```So now if we wanted to read in this new file, we can do so using the `read_csv` method:```{python}#| label: tbl-read_new_csv#| tbl-cap: Read in the new CSV file that of the ordered RPKM data that was just saved.# Read in the new CSV filerpkm_ordered_new = pd.read_csv("data/ordered_counts_rpkm.csv")rpkm_ordered_new.head()```But wait, why do have an extra column called `Unnamed: 0`? This is because when we loaded in the CSV file, the index (rownames) were not set. Instead they were read in as a regular column without a rowname. When we load in our new CSV file, we want to set the first column as the index to ensure that the genes are being stored as row names:```{python}#| label: tbl-read_new_csv_rownames#| tbl-cap: Read in the new CSV file that of the ordered RPKM data that was just saved with rownames set appropriately.# Read in the new CSV filerpkm_ordered_new = pd.read_csv("data/ordered_counts_rpkm.csv", index_col=0)rpkm_ordered_new.head()```## Wrangling DataFramesNow, we want to update some extra information in our `metadata` object. For one, we wnat to evaluate the average expression in each sample and its relationship with the age of the mouse. Our goal is to wrangle our data in such a way that we can create visualizations with these new columns in the next lesson.### Calculating Average ExpressionLet’s take a closer look at our counts data. Each column represents a sample in our experiment, and each sample has ~38K values corresponding to the expression of different transcripts. We want to compute the **average value of expression** for each sample eventually.Taking this one step at a time, what would we do if we just wanted the average expression for Sample 1 (across all transcripts)? We can use the base `numpy` function called `mean()`:```{python}#| label: calculate_mean_sample_1import numpy as np# Take the mean of the expression values for sample1np.mean(rpkm_ordered["sample1"])```That is great, if we only wanted the average from one of the samples (one column in a data frame), but we need to get this information from all 12 samples, so all 12 columns. It would be ideal to get a vector of 12 values that we can add to the metadata data frame. What is the best way to do this?Luckily, the `Pandas` library has a built in method called `mean()` that can be applied to a data frame. By default, it will calculate the mean of each column and return a new series with the mean values for each column:```{python}#| label: calculate_mean_all_samples# Calculate the mean expression for each sample (column) in the rpkm_ordered data framesamplemeans = rpkm_ordered.mean()samplemeans```We have the values that we want, except notice that we do not have a column name associated with these results. To assign a name to this series, we can use the `name` attribute:```{python}#| label: name_samplemeans# Name the samplemeans seriessamplemeans.name ="mean_expression"samplemeans```### Merging DataFramesAt this point, `samplemeans` is a series object with the sample names as the index and the mean expression values as the values. We want to add this information to our `metadata` data frame. One of the easiest ways to do this is to use the `merge()` function from the `Pandas` library, which allows us to merge two data frames based on a common column or index. Since this is a new function, let us use the `help()` function to learn more about how to use it:```{python}#| label: help_merge#| eval: false# Get help on the merge functionhelp(pd.merge)```When we look at the arguments for the `merge()` function, we see that we can specify the left and right data frames to merge, as well as the columns or indices to merge on. We can specify which columns to merge on using the `left_on` and `right_on` arguments, or we can specify that we want to merge on the index using the `left_index` and `right_index` arguments.In our case, let us set the "left" DataFrame will be the `metadata` and the "right" DataFrame will be the `samplemeans`. Since both `metadata` and `samplemeans` have the sample names as the index, we can merge them on indeces using the `left_index` and `right_index` arguments:```{python}#| label: merge_dataframes#| error: true# Merge the samplemeans series with the metadata data frame on the indexnew_metadata = metadata.merge(samplemeans, left_index=True, right_index=True)new_metadata.head()```Now we have a left-joined `new_metadata` with all the information from the original metadata as well as a new column with the mean expression values for each sample.This concept of merging can be done with any two data frames that have a common column or index, and it is a very powerful tool for combining information from different sources.### Adding `age` column Let’s also create a list with the ages of each of the mice in our data set. We also want to double check that we have 12 values in this list, since we have 12 samples in our data set. Here we are supplying the ages manually, but in a real use case, we would likely have this information in a separate file that we would read in and merge with our metadata data frame using the `merge()` function as we just did with the `samplemeans` series.```{python}#| label: add_age_column# Add the age column to the metadata data frame# Create a numeric vector with ages. age_in_days = [40, 32, 38, 35, 41, 32, 34, 26, 28, 28, 30, 32] # Double check that we have 12 values, for each samplelen(age_in_days)```Now we can create a new column in our `new_metadata` data frame called `age` and assign the values from our `age_in_days` list to this new column:```{python}#| label: assign_age_column# Assign the age_in_days list to a new column in new_metadatanew_metadata["age_in_days"] = age_in_daysnew_metadata.head()```Now that we have the data in the format that we want, we can save this new metadata data frame as a new CSV file using the `to_csv()` method:```{python}#| label: save_new_metadata# Save the new metadata data frame to a new CSV filenew_metadata.to_csv("data/new_metadata.csv")```**We are now ready for plotting and data visualization!**## Fun With Data WranglingAs you work more and more with different dataset, you will likely find yourself spending quite a bit of time wrangling data to get it into the right format for analysis. So here we have a few fun exercises to practice some of the data wrangling skills we have learned in this lesson.::: {.callout-tip}# [**Exercise 2**](10_data_wrangling-Answer_key.qmd#exercise-2)### Reading in and inspecting data4. Using the `animals.csv`, read the `.csv` file into your environment and assign it to a variable called `animals`. **Be sure to check that your row names are the different animals.**5. Check to make sure that `animals` is a dataframe.6. How many rows are in the `animals` dataframe? How many columns?### Data wrangling7. Extract the `speed` value of 40 km/h from the `animals` dataframe.8. Return the rows with animals that are the `color` Tan.9. Return the rows with animals that have `speed` greater than 50 km/h and output only the `color` column. Keep the output as a data frame. 10. Change the color of "Grey" to "Gray". 11. Create a list called `animals_list` in which the first element contains the speed column of the `animals` dataframe and the second element contains the color column of the `animals` dataframe. 12. Create a dictionary with the appropriate keys (i.e speed and color).### The `in` operator, reordering and matching13. In your data folder you should have a dataframe called `proj_summary` which contains quality metric information for an RNA-seq dataset. We have obtained batch information for the control samples in this dataset. **Copy and paste the code below to create a dataframe of control samples with the associated batch information**:```{python}#| label: create_ctrl_samples_df# Read in proj_summary if neededproj_summary = pd.read_table("data/project-summary.txt", header=0, index_col=0)# Create ctrl_samples dataframectrl_samples = pd.DataFrame( data = {"date": ["01/13/2018", "03/15/2018", "01/13/2018","09/20/2018","03/15/2018"]}, index = ["sample3", "sample10", "sample8", "sample4", "sample15"] )``````{python}#| label: tbl-proj_summary#| tbl-cap: DataFrame of quality metric information for an RNA-seq dataset.proj_summary ``````{python}#| label: tbl-ctrl_samples#| tbl-cap: DataFrame of control samples with associated batch information.ctrl_samples```14. How many of the `ctrl_samples` are also in the `proj_summary` dataframe? Use the `isin` operator to compare sample names.15. Keep only the rows in `proj_summary` which correspond to those in `ctrl_samples`. Do this with the `isin` operator. Save it to a variable called `proj_summary_ctrl`.16. Use `merge()` to add a column called `batch` to the `proj_summary_ctrl` dataframe. Assign this new dataframe back to `proj_summary_ctrl`.:::***[Next Lesson >>](11_plotting_basics.qmd)[Back to Schedule](../schedule/schedule.qmd)