Data Wrangling - Answer Key

Authors

Noor Sohail

Will Gammerdinger

Published

March 15, 2026

Exercise 1

We have a list of 6 marker genes that we are very interested in. Our goal is to extract count data for these genes using the in operator from the rpkm_ordered data frame, instead of scrolling through rpkm_ordered and finding them manually.

First, let’s create a vector called important_genes with the Ensembl IDs of the 6 genes we are interested in:

# Create important genes vector
important_genes = ["ENSMUSG00000083700", "ENSMUSG00000080990", 
                   "ENSMUSG00000065619", "ENSMUSG00000047945", 
                   "ENSMUSG00000081010", "ENSMUSG00000030970"]

Use the isin operator to determine which row names of the rpkm_ordered DataFrame match our genes of interest.

# Check if important genes are in rpkm_ordered
rpkm_ordered.index.isin(important_genes)

array([False, False, False, ..., False, False, False], shape=(38828,))

Extract the rows from rpkm_ordered that correspond to these 6 genes using [] and the isin operator. Double check the row names to ensure that you are extracting the correct rows.

# Extract rows for important genes
rpkm_ordered[rpkm_ordered.index.isin(important_genes)]

                     sample1   sample2   sample3  ...  sample10  sample11  sample12
ENSMUSG00000030970  2.971420  2.221180  2.151490  ...  1.014520  0.963523  0.964169
ENSMUSG00000047945  5.199470  4.745070  4.501390  ...  0.982691  1.442970  3.581640
ENSMUSG00000065619  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000
ENSMUSG00000080990  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000
ENSMUSG00000081010  0.235848  0.222275  0.214347  ...  0.452537  0.415823  0.594672
ENSMUSG00000083700  0.124225  0.425214  0.136998  ...  0.205934  0.222865  0.488064

[6 rows x 12 columns]

Bonus question: Extract the rows from rpkm_ordered that correspond to these 6 genes using [], but without using the isin operator. Do you notice anything different about the output?

# Extract rows for important genes without using isin operator
rpkm_ordered.loc[important_genes]

                     sample1   sample2   sample3  ...  sample10  sample11  sample12
ENSMUSG00000083700  0.124225  0.425214  0.136998  ...  0.205934  0.222865  0.488064
ENSMUSG00000080990  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000
ENSMUSG00000065619  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000
ENSMUSG00000047945  5.199470  4.745070  4.501390  ...  0.982691  1.442970  3.581640
ENSMUSG00000081010  0.235848  0.222275  0.214347  ...  0.452537  0.415823  0.594672
ENSMUSG00000030970  2.971420  2.221180  2.151490  ...  1.014520  0.963523  0.964169

[6 rows x 12 columns]

The rows are now in the order that we specified the genes of interest. When we used isin above, we got the rows in the order that they are present in the expression data.

Exercise 2

Reading in and inspecting data

Using the animals.csv within your data directory, read the .csv file into your environment and assign it to a variable called animals. Be sure to check that your row names are the different animals.

# Read in animals.csv
animals = pd.read_csv("data/animals.csv")

# Print animals
animals

           speed  color
Elephant    40.0   Gray
Cheetah    120.0    Tan
Tortoise     0.1  Green
Hare        48.0   Grey
Lion        80.0    Tan
PolarBear   30.0  White

Check to make sure that animals is a dataframe.

# Check the class of animals
type(animals)

<class 'pandas.core.frame.DataFrame'>

How many rows are in the animals dataframe? How many columns?

# Number of rows in animals
len(animals.index)

# Number of columns in animals
len(animals.columns)

# Shape of animals
animals.shape

(6, 2)

Data wrangling

Extract the speed value of 40 km/h from the animals dataframe.

# Various methods to extract the speed value of 40 km/h
print(animals.iloc[0, 0])

40.0

animals[animals["speed"] == 40]["speed"]

Elephant    40.0
Name: speed, dtype: float64

animals["speed"].loc[animals["speed"] == 40]

Elephant    40.0
Name: speed, dtype: float64

Return the rows with animals that are the color Tan.

# Two way to return the rows with the animal color as Tan
animals[animals["color"] == "Tan"]

         speed color
Cheetah  120.0   Tan
Lion      80.0   Tan

animals.loc[animals["color"] == "Tan"]

         speed color
Cheetah  120.0   Tan
Lion      80.0   Tan

Return the rows with animals that have speed greater than 50 km/h and output only the color column. Keep the output as a DataFrame.

# Return the color column for animals with speed > 50 as a DataFrame
animals[animals["speed"] > 50]["color"].to_frame()

        color
Cheetah   Tan
Lion      Tan

Change the color of “Grey” to “Gray”.

# Two ways to change Grey to Gray
animals["color"][animals["color"] == "Grey"] = "Gray"
animals.loc[animals["color"] == "Grey", "color"] = "Gray"
animals

           speed  color
Elephant    40.0   Gray
Cheetah    120.0    Tan
Tortoise     0.1  Green
Hare        48.0   Gray
Lion        80.0    Tan
PolarBear   30.0  White

Create a list called animals_list in which the first element contains the speed column of the animals dataframe and the second element contains the color column of the animals dataframe.

# Create a list with the speed and color vectors
animals_list = [animals["speed"], animals["color"]]
animals_list

[Elephant      40.0
Cheetah      120.0
Tortoise       0.1
Hare          48.0
Lion          80.0
PolarBear     30.0
Name: speed, dtype: float64, Elephant      Gray
Cheetah        Tan
Tortoise     Green
Hare          Gray
Lion           Tan
PolarBear    White
Name: color, dtype: object]

Create a dictionary with the appropriate keys (i.e., speed and color).

# Create a dictionary with the speed and color vectors
animals_dict = {"speed": animals["speed"], 
                "color": animals["color"]}
animals_dict

{'speed': Elephant      40.0
Cheetah      120.0
Tortoise       0.1
Hare          48.0
Lion          80.0
PolarBear     30.0
Name: speed, dtype: float64, 'color': Elephant      Gray
Cheetah        Tan
Tortoise     Green
Hare          Gray
Lion           Tan
PolarBear    White
Name: color, dtype: object}

The `isin` operator, reordering and matching

In the data directory, you should have a table called project-summary.txt that contains quality metric information for an RNA-seq dataset. We have obtained batch information for the control samples in this dataset. Copy and paste the code below to create DataFrames for the project summary and for the control samples with the associated batch information:

# Read in proj_summary
proj_summary = pd.read_table("data/project-summary.txt", 
                             header=0, index_col=0)

# Create ctrl_samples dataframe
ctrl_samples = pd.DataFrame(
  data = {"date": ["01/13/2018", "03/15/2018",  "01/13/2018",
                   "09/20/2018","03/15/2018"]}, 
  index = ["sample3", "sample10", "sample8", 
           "sample4", "sample15"])

Table 1: DataFrame of quality metric information for an RNA-seq dataset.

# View RNA-seq QC information
proj_summary

         percent_GC  Exonic_Rate  ...  rRNA_rate  treatment
Name                              ...                      
sample1          49       0.8913  ...   0.007265       high
sample2          49       0.9055  ...   0.005518        low
sample3          50       0.8834  ...   0.026945    control
sample4          50       0.9027  ...   0.005082    control
sample5          49       0.8923  ...   0.005023       high
sample6          49       0.8999  ...   0.005345        low
sample7          49       0.8983  ...   0.005240       high
sample8          49       0.9022  ...   0.004549    control
sample9          49       0.9111  ...   0.005818        low

[9 rows x 8 columns]

Table 2: DataFrame of control samples with associated batch information.

# View control batch information
ctrl_samples

                date
sample3   01/13/2018
sample10  03/15/2018
sample8   01/13/2018
sample4   09/20/2018
sample15  03/15/2018

How many of the ctrl_samples are also in the proj_summary dataframe? Use the isin operator to compare sample names.

# Number of shared samples between ctrl_samples and proj_summary
print(len(proj_summary.index[proj_summary.index.isin(ctrl_samples.index)]))

print(len(ctrl_samples.index[ctrl_samples.index.isin(proj_summary.index)]))

Keep only the rows in proj_summary which correspond to those in ctrl_samples. Do this with the isin operator. Save it to a variable called proj_summary_ctrl.

# Subset proj_summary by the samples within ctrl_samples
proj_summary_ctrl = proj_summary[proj_summary.index.isin(ctrl_samples.index)]

# Print out proj_summary_ctrl
proj_summary_ctrl

         percent_GC  Exonic_Rate  ...  rRNA_rate  treatment
Name                              ...                      
sample3          50       0.8834  ...   0.026945    control
sample4          50       0.9027  ...   0.005082    control
sample8          49       0.9022  ...   0.004549    control

[3 rows x 8 columns]

Use merge() to add a column called batch to the proj_summary_ctrl dataframe. Assign this new dataframe back to proj_summary_ctrl.

# Using the match, add the batch information
proj_summary_ctrl = pd.merge(proj_summary_ctrl, 
                             ctrl_samples, 
                             left_index=True, 
                             right_index=True)

# Print proj_summary_ctrl
proj_summary_ctrl

         percent_GC  Exonic_Rate  ...  treatment        date
sample3          50       0.8834  ...    control  01/13/2018
sample4          50       0.9027  ...    control  09/20/2018
sample8          49       0.9022  ...    control  01/13/2018

[3 rows x 9 columns]

Reuse

CC-BY-4.0