# Create important genes vector
important_genes = ["ENSMUSG00000083700", "ENSMUSG00000080990",
"ENSMUSG00000065619", "ENSMUSG00000047945",
"ENSMUSG00000081010", "ENSMUSG00000030970"]Data Wrangling - Answer Key
Exercise 1
We have a list of 6 marker genes that we are very interested in. Our goal is to extract count data for these genes using the in operator from the rpkm_ordered data frame, instead of scrolling through rpkm_ordered and finding them manually.
First, let’s create a vector called important_genes with the Ensembl IDs of the 6 genes we are interested in:
- Use the
isinoperator to determine which row names of therpkm_orderedDataFrame match our genes of interest.
# Check if important genes are in rpkm_ordered
rpkm_ordered.index.isin(important_genes)array([False, False, False, ..., False, False, False], shape=(38828,))
- Extract the rows from
rpkm_orderedthat correspond to these 6 genes using[]and theisinoperator. Double check the row names to ensure that you are extracting the correct rows.
# Extract rows for important genes
rpkm_ordered[rpkm_ordered.index.isin(important_genes)] sample1 sample2 sample3 ... sample10 sample11 sample12
ENSMUSG00000030970 2.971420 2.221180 2.151490 ... 1.014520 0.963523 0.964169
ENSMUSG00000047945 5.199470 4.745070 4.501390 ... 0.982691 1.442970 3.581640
ENSMUSG00000065619 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000
ENSMUSG00000080990 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000
ENSMUSG00000081010 0.235848 0.222275 0.214347 ... 0.452537 0.415823 0.594672
ENSMUSG00000083700 0.124225 0.425214 0.136998 ... 0.205934 0.222865 0.488064
[6 rows x 12 columns]
- Bonus question: Extract the rows from
rpkm_orderedthat correspond to these 6 genes using[], but without using theisinoperator. Do you notice anything different about the output?
# Extract rows for important genes without using isin operator
rpkm_ordered.loc[important_genes] sample1 sample2 sample3 ... sample10 sample11 sample12
ENSMUSG00000083700 0.124225 0.425214 0.136998 ... 0.205934 0.222865 0.488064
ENSMUSG00000080990 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000
ENSMUSG00000065619 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000
ENSMUSG00000047945 5.199470 4.745070 4.501390 ... 0.982691 1.442970 3.581640
ENSMUSG00000081010 0.235848 0.222275 0.214347 ... 0.452537 0.415823 0.594672
ENSMUSG00000030970 2.971420 2.221180 2.151490 ... 1.014520 0.963523 0.964169
[6 rows x 12 columns]
The rows are now in the order that we specified the genes of interest. When we used isin above, we got the rows in the order that they are present in the expression data.
Exercise 2
Reading in and inspecting data
- Using the
animals.csvwithin yourdatadirectory, read the.csvfile into your environment and assign it to a variable calledanimals. Be sure to check that your row names are the different animals.
# Read in animals.csv
animals = pd.read_csv("data/animals.csv")
# Print animals
animals speed color
Elephant 40.0 Gray
Cheetah 120.0 Tan
Tortoise 0.1 Green
Hare 48.0 Grey
Lion 80.0 Tan
PolarBear 30.0 White
- Check to make sure that
animalsis a dataframe.
# Check the class of animals
type(animals)<class 'pandas.core.frame.DataFrame'>
- How many rows are in the
animalsdataframe? How many columns?
# Number of rows in animals
len(animals.index)6
# Number of columns in animals
len(animals.columns)2
# Shape of animals
animals.shape(6, 2)
Data wrangling
- Extract the
speedvalue of 40 km/h from theanimalsdataframe.
# Various methods to extract the speed value of 40 km/h
print(animals.iloc[0, 0])40.0
animals[animals["speed"] == 40]["speed"]Elephant 40.0
Name: speed, dtype: float64
animals["speed"].loc[animals["speed"] == 40]Elephant 40.0
Name: speed, dtype: float64
- Return the rows with animals that are the
colorTan.
# Two way to return the rows with the animal color as Tan
animals[animals["color"] == "Tan"] speed color
Cheetah 120.0 Tan
Lion 80.0 Tan
animals.loc[animals["color"] == "Tan"] speed color
Cheetah 120.0 Tan
Lion 80.0 Tan
- Return the rows with animals that have speed greater than 50 km/h and output only the color column. Keep the output as a DataFrame.
# Return the color column for animals with speed > 50 as a DataFrame
animals[animals["speed"] > 50]["color"].to_frame() color
Cheetah Tan
Lion Tan
- Change the color of “Grey” to “Gray”.
# Two ways to change Grey to Gray
animals["color"][animals["color"] == "Grey"] = "Gray"
animals.loc[animals["color"] == "Grey", "color"] = "Gray"
animals speed color
Elephant 40.0 Gray
Cheetah 120.0 Tan
Tortoise 0.1 Green
Hare 48.0 Gray
Lion 80.0 Tan
PolarBear 30.0 White
- Create a list called
animals_listin which the first element contains the speed column of theanimalsdataframe and the second element contains the color column of theanimalsdataframe.
# Create a list with the speed and color vectors
animals_list = [animals["speed"], animals["color"]]
animals_list[Elephant 40.0
Cheetah 120.0
Tortoise 0.1
Hare 48.0
Lion 80.0
PolarBear 30.0
Name: speed, dtype: float64, Elephant Gray
Cheetah Tan
Tortoise Green
Hare Gray
Lion Tan
PolarBear White
Name: color, dtype: object]
- Create a dictionary with the appropriate keys (i.e., speed and color).
# Create a dictionary with the speed and color vectors
animals_dict = {"speed": animals["speed"],
"color": animals["color"]}
animals_dict{'speed': Elephant 40.0
Cheetah 120.0
Tortoise 0.1
Hare 48.0
Lion 80.0
PolarBear 30.0
Name: speed, dtype: float64, 'color': Elephant Gray
Cheetah Tan
Tortoise Green
Hare Gray
Lion Tan
PolarBear White
Name: color, dtype: object}
The isin operator, reordering and matching
- In the
datadirectory, you should have a table calledproject-summary.txtthat contains quality metric information for an RNA-seq dataset. We have obtained batch information for the control samples in this dataset. Copy and paste the code below to create DataFrames for the project summary and for the control samples with the associated batch information:
# Read in proj_summary
proj_summary = pd.read_table("data/project-summary.txt",
header=0, index_col=0)
# Create ctrl_samples dataframe
ctrl_samples = pd.DataFrame(
data = {"date": ["01/13/2018", "03/15/2018", "01/13/2018",
"09/20/2018","03/15/2018"]},
index = ["sample3", "sample10", "sample8",
"sample4", "sample15"])# View RNA-seq QC information
proj_summary percent_GC Exonic_Rate ... rRNA_rate treatment
Name ...
sample1 49 0.8913 ... 0.007265 high
sample2 49 0.9055 ... 0.005518 low
sample3 50 0.8834 ... 0.026945 control
sample4 50 0.9027 ... 0.005082 control
sample5 49 0.8923 ... 0.005023 high
sample6 49 0.8999 ... 0.005345 low
sample7 49 0.8983 ... 0.005240 high
sample8 49 0.9022 ... 0.004549 control
sample9 49 0.9111 ... 0.005818 low
[9 rows x 8 columns]
# View control batch information
ctrl_samples date
sample3 01/13/2018
sample10 03/15/2018
sample8 01/13/2018
sample4 09/20/2018
sample15 03/15/2018
- How many of the
ctrl_samplesare also in theproj_summarydataframe? Use theisinoperator to compare sample names.
# Number of shared samples between ctrl_samples and proj_summary
print(len(proj_summary.index[proj_summary.index.isin(ctrl_samples.index)]))3
print(len(ctrl_samples.index[ctrl_samples.index.isin(proj_summary.index)]))3
- Keep only the rows in
proj_summarywhich correspond to those inctrl_samples. Do this with theisinoperator. Save it to a variable calledproj_summary_ctrl.
# Subset proj_summary by the samples within ctrl_samples
proj_summary_ctrl = proj_summary[proj_summary.index.isin(ctrl_samples.index)]
# Print out proj_summary_ctrl
proj_summary_ctrl percent_GC Exonic_Rate ... rRNA_rate treatment
Name ...
sample3 50 0.8834 ... 0.026945 control
sample4 50 0.9027 ... 0.005082 control
sample8 49 0.9022 ... 0.004549 control
[3 rows x 8 columns]
- Use
merge()to add a column calledbatchto theproj_summary_ctrldataframe. Assign this new dataframe back toproj_summary_ctrl.
# Using the match, add the batch information
proj_summary_ctrl = pd.merge(proj_summary_ctrl,
ctrl_samples,
left_index=True,
right_index=True)
# Print proj_summary_ctrl
proj_summary_ctrl percent_GC Exonic_Rate ... treatment date
sample3 50 0.8834 ... control 01/13/2018
sample4 50 0.9027 ... control 09/20/2018
sample8 49 0.9022 ... control 01/13/2018
[3 rows x 9 columns]
Reuse
CC-BY-4.0