Data Wrangling - Answer Key

Authors

Noor Sohail

Will Gammerdinger

Published

March 15, 2026

Exercise 1

We have a list of 6 marker genes that we are very interested in. Our goal is to extract count data for these genes using the in operator from the rpkm_ordered data frame, instead of scrolling through rpkm_ordered and finding them manually.

First, let’s create a vector called important_genes with the Ensembl IDs of the 6 genes we are interested in:

# Create important genes vector
important_genes = ["ENSMUSG00000083700", "ENSMUSG00000080990", 
                   "ENSMUSG00000065619", "ENSMUSG00000047945", 
                   "ENSMUSG00000081010", "ENSMUSG00000030970"]
  1. Use the in operator to determine if all of these genes are present in the row names of the rpkm_ordered data frame.
# Check if important genes are in rpkm_ordered
rpkm_ordered.index.isin(important_genes)
array([False, False, False, ..., False, False, False], shape=(38828,))
  1. Extract the rows from rpkm_ordered that correspond to these 6 genes using [] and the isin operator. Double check the row names to ensure that you are extracting the correct rows.
# Extract rows for important genes
rpkm_ordered[rpkm_ordered.index.isin(important_genes)]
sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 sample11 sample12
ENSMUSG00000030970 2.971420 2.221180 2.151490 0.175380 0.537852 0.435484 2.243810 2.599400 3.593970 1.014520 0.963523 0.964169
ENSMUSG00000047945 5.199470 4.745070 4.501390 0.221343 0.323620 0.478836 1.297810 3.896810 3.285470 0.982691 1.442970 3.581640
ENSMUSG00000065619 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
ENSMUSG00000080990 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
ENSMUSG00000081010 0.235848 0.222275 0.214347 0.419666 0.349415 0.248244 0.190397 0.167166 0.221353 0.452537 0.415823 0.594672
ENSMUSG00000083700 0.124225 0.425214 0.136998 0.066042 0.337651 0.308669 0.145973 0.142010 0.508757 0.205934 0.222865 0.488064
  1. Bonus question: Extract the rows from rpkm_ordered that correspond to these 6 genes using [], but without using the isin operator.
# Extract rows for important genes without using isin operator
rpkm_ordered.loc[important_genes]
sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 sample11 sample12
ENSMUSG00000083700 0.124225 0.425214 0.136998 0.066042 0.337651 0.308669 0.145973 0.142010 0.508757 0.205934 0.222865 0.488064
ENSMUSG00000080990 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
ENSMUSG00000065619 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
ENSMUSG00000047945 5.199470 4.745070 4.501390 0.221343 0.323620 0.478836 1.297810 3.896810 3.285470 0.982691 1.442970 3.581640
ENSMUSG00000081010 0.235848 0.222275 0.214347 0.419666 0.349415 0.248244 0.190397 0.167166 0.221353 0.452537 0.415823 0.594672
ENSMUSG00000030970 2.971420 2.221180 2.151490 0.175380 0.537852 0.435484 2.243810 2.599400 3.593970 1.014520 0.963523 0.964169

Exercise 2

Reading in and inspecting data

  1. Using the animals.csv, read the .csv file into your environment and assign it to a variable called animals. Be sure to check that your row names are the different animals.
# Read in animals.csv
animals = pd.read_csv("data/animals.csv")

# Print animals
animals
speed color
Elephant 40.0 Gray
Cheetah 120.0 Tan
Tortoise 0.1 Green
Hare 48.0 Grey
Lion 80.0 Tan
PolarBear 30.0 White
  1. Check to make sure that animals is a dataframe.
# Check the class of animals
type(animals)
<class 'pandas.DataFrame'>
  1. How many rows are in the animals dataframe? How many columns?
# Number of rows in animals
len(animals.index)
6
# Number of columns in animals
len(animals.columns)
2
# Shape of animals
animals.shape
(6, 2)

Data wrangling

  1. Extract the speed value of 40 km/h from the animals dataframe.
# Various methods to extract the speed value of 40 km/h
animals.iloc[0, 0]
np.float64(40.0)
animals[animals["speed"] == 40]
speed color
Elephant 40.0 Gray
animals[animals["speed"] == 40]["speed"]
Elephant    40.0
Name: speed, dtype: float64
animals["speed"].loc[animals["speed"] == 40]
Elephant    40.0
Name: speed, dtype: float64
  1. Return the rows with animals that are the color Tan.
# Two way to return the rows with the animal color as Tan
animals[animals["color"] == "Tan"]
speed color
Cheetah 120.0 Tan
Lion 80.0 Tan
animals.loc[animals["color"] == "Tan"]
speed color
Cheetah 120.0 Tan
Lion 80.0 Tan
  1. Return the speed column for the rows with animals that are the color Tan.
# Return the speed column for animals with color Tan
animals[animals["color"] == "Tan"]["speed"]
Cheetah    120.0
Lion        80.0
Name: speed, dtype: float64
  1. Change the color of “Grey” to “Gray”.
# Two ways to change Grey to Gray
animals["color"][animals["color"] == "Grey"] = "Gray"
animals.loc[animals["color"] == "Grey", "color"] = "Gray"
animals
speed color
Elephant 40.0 Gray
Cheetah 120.0 Tan
Tortoise 0.1 Green
Hare 48.0 Gray
Lion 80.0 Tan
PolarBear 30.0 White
  1. Create a list called animals_list in which the first element contains the speed column of the animals dataframe and the second element contains the color column of the animals dataframe.
# Create a list with the speed and color vectors
animals_list = [animals["speed"], animals["color"]]
animals_list
[Elephant      40.0
Cheetah      120.0
Tortoise       0.1
Hare          48.0
Lion          80.0
PolarBear     30.0
Name: speed, dtype: float64, Elephant      Gray
Cheetah        Tan
Tortoise     Green
Hare          Gray
Lion           Tan
PolarBear    White
Name: color, dtype: str]
  1. Create a dictionary with the appropriate keys (i.e speed and color).
# Create a dictionary with the speed and color vectors
animals_dict = {"speed": animals["speed"], 
                "color": animals["color"]}
animals_dict
{'speed': Elephant      40.0
Cheetah      120.0
Tortoise       0.1
Hare          48.0
Lion          80.0
PolarBear     30.0
Name: speed, dtype: float64, 'color': Elephant      Gray
Cheetah        Tan
Tortoise     Green
Hare          Gray
Lion           Tan
PolarBear    White
Name: color, dtype: str}

The in operator, reordering and matching

  1. In your data folder you should have a dataframe called proj_summary which contains quality metric information for an RNA-seq dataset. We have obtained batch information for the control samples in this dataset. Copy and paste the code below to create a dataframe of control samples with the associated batch information:
# Read in proj_summary if needed
proj_summary = pd.read_table("data/project-summary.txt", 
                             header=0, index_col=0)

# Create ctrl_samples dataframe
ctrl_samples = pd.DataFrame(
  data = {"date": ["01/13/2018", "03/15/2018",  "01/13/2018",
                   "09/20/2018","03/15/2018"]}, 
  index = ["sample3", "sample10", "sample8", 
           "sample4", "sample15"]  
                           )
proj_summary 
Table 1: DataFrame of quality metric information for an RNA-seq dataset.
percent_GC Exonic_Rate Intronic_Rate Intergenic_Rate Mapping_Rate Quality_format rRNA_rate treatment
Name
sample1 49 0.8913 0.0709 0.0378 0.978800 standard 0.007265 high
sample2 49 0.9055 0.0625 0.0321 0.982507 standard 0.005518 low
sample3 50 0.8834 0.0663 0.0503 0.987729 standard 0.026945 control
sample4 50 0.9027 0.0649 0.0325 0.987076 standard 0.005082 control
sample5 49 0.8923 0.0714 0.0362 0.978184 standard 0.005023 high
sample6 49 0.8999 0.0667 0.0334 0.977210 standard 0.005345 low
sample7 49 0.8983 0.0665 0.0352 0.975800 standard 0.005240 high
sample8 49 0.9022 0.0656 0.0322 0.987746 standard 0.004549 control
sample9 49 0.9111 0.0566 0.0323 0.981449 standard 0.005818 low
ctrl_samples
Table 2: DataFrame of control samples with associated batch information.
date
sample3 01/13/2018
sample10 03/15/2018
sample8 01/13/2018
sample4 09/20/2018
sample15 03/15/2018
  1. How many of the ctrl_samples are also in the proj_summary dataframe? Use the isin operator to compare sample names.
# Number of shared samples between ctrl_samples and proj_summary
len(ctrl_samples.index[ctrl_samples.index.isin(proj_summary.index)])
3
  1. Keep only the rows in proj_summary which correspond to those in ctrl_samples. Do this with the isin operator. Save it to a variable called proj_summary_ctrl.
# Subset proj_summary by the samples within ctrl_samples
proj_summary_ctrl = proj_summary[proj_summary.index.isin(ctrl_samples.index)]

# Print out proj_summary_ctrl
proj_summary_ctrl
percent_GC Exonic_Rate Intronic_Rate Intergenic_Rate Mapping_Rate Quality_format rRNA_rate treatment
Name
sample3 50 0.8834 0.0663 0.0503 0.987729 standard 0.026945 control
sample4 50 0.9027 0.0649 0.0325 0.987076 standard 0.005082 control
sample8 49 0.9022 0.0656 0.0322 0.987746 standard 0.004549 control
  1. Use merge() to add a column called batch to the proj_summary_ctrl dataframe. Assign this new dataframe back to proj_summary_ctrl.
# Using the match, add the batch information
proj_summary_ctrl = pd.merge(proj_summary_ctrl, 
                             ctrl_samples, 
                             left_index=True, 
                             right_index=True)

# Print proj_summary_ctrl
proj_summary_ctrl
percent_GC Exonic_Rate Intronic_Rate Intergenic_Rate Mapping_Rate Quality_format rRNA_rate treatment date
Name
sample3 50 0.8834 0.0663 0.0503 0.987729 standard 0.026945 control 01/13/2018
sample4 50 0.9027 0.0649 0.0325 0.987076 standard 0.005082 control 09/20/2018
sample8 49 0.9022 0.0656 0.0322 0.987746 standard 0.004549 control 01/13/2018

Reuse

CC-BY-4.0