We have a list of 6 marker genes that we are very interested in. Our goal is to extract count data for these genes using the in operator from the rpkm_ordered data frame, instead of scrolling through rpkm_ordered and finding them manually.
First, let’s create a vector called important_genes with the Ensembl IDs of the 6 genes we are interested in:
Extract the rows from rpkm_ordered that correspond to these 6 genes using [] and the isin operator. Double check the row names to ensure that you are extracting the correct rows.
# Extract rows for important genesrpkm_ordered[rpkm_ordered.index.isin(important_genes)]
sample1
sample2
sample3
sample4
sample5
sample6
sample7
sample8
sample9
sample10
sample11
sample12
ENSMUSG00000030970
2.971420
2.221180
2.151490
0.175380
0.537852
0.435484
2.243810
2.599400
3.593970
1.014520
0.963523
0.964169
ENSMUSG00000047945
5.199470
4.745070
4.501390
0.221343
0.323620
0.478836
1.297810
3.896810
3.285470
0.982691
1.442970
3.581640
ENSMUSG00000065619
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
ENSMUSG00000080990
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
ENSMUSG00000081010
0.235848
0.222275
0.214347
0.419666
0.349415
0.248244
0.190397
0.167166
0.221353
0.452537
0.415823
0.594672
ENSMUSG00000083700
0.124225
0.425214
0.136998
0.066042
0.337651
0.308669
0.145973
0.142010
0.508757
0.205934
0.222865
0.488064
Bonus question: Extract the rows from rpkm_ordered that correspond to these 6 genes using [], but without using the isin operator.
# Extract rows for important genes without using isin operatorrpkm_ordered.loc[important_genes]
sample1
sample2
sample3
sample4
sample5
sample6
sample7
sample8
sample9
sample10
sample11
sample12
ENSMUSG00000083700
0.124225
0.425214
0.136998
0.066042
0.337651
0.308669
0.145973
0.142010
0.508757
0.205934
0.222865
0.488064
ENSMUSG00000080990
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
ENSMUSG00000065619
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
ENSMUSG00000047945
5.199470
4.745070
4.501390
0.221343
0.323620
0.478836
1.297810
3.896810
3.285470
0.982691
1.442970
3.581640
ENSMUSG00000081010
0.235848
0.222275
0.214347
0.419666
0.349415
0.248244
0.190397
0.167166
0.221353
0.452537
0.415823
0.594672
ENSMUSG00000030970
2.971420
2.221180
2.151490
0.175380
0.537852
0.435484
2.243810
2.599400
3.593970
1.014520
0.963523
0.964169
Exercise 2
Reading in and inspecting data
Using the animals.csv, read the .csv file into your environment and assign it to a variable called animals. Be sure to check that your row names are the different animals.
# Read in animals.csvanimals = pd.read_csv("data/animals.csv")# Print animalsanimals
speed
color
Elephant
40.0
Gray
Cheetah
120.0
Tan
Tortoise
0.1
Green
Hare
48.0
Grey
Lion
80.0
Tan
PolarBear
30.0
White
Check to make sure that animals is a dataframe.
# Check the class of animalstype(animals)
<class 'pandas.DataFrame'>
How many rows are in the animals dataframe? How many columns?
# Number of rows in animalslen(animals.index)
6
# Number of columns in animalslen(animals.columns)
2
# Shape of animalsanimals.shape
(6, 2)
Data wrangling
Extract the speed value of 40 km/h from the animals dataframe.
# Various methods to extract the speed value of 40 km/hanimals.iloc[0, 0]
np.float64(40.0)
animals[animals["speed"] ==40]
speed
color
Elephant
40.0
Gray
animals[animals["speed"] ==40]["speed"]
Elephant 40.0
Name: speed, dtype: float64
animals["speed"].loc[animals["speed"] ==40]
Elephant 40.0
Name: speed, dtype: float64
Return the rows with animals that are the color Tan.
# Two way to return the rows with the animal color as Tananimals[animals["color"] =="Tan"]
speed
color
Cheetah
120.0
Tan
Lion
80.0
Tan
animals.loc[animals["color"] =="Tan"]
speed
color
Cheetah
120.0
Tan
Lion
80.0
Tan
Return the speed column for the rows with animals that are the color Tan.
# Return the speed column for animals with color Tananimals[animals["color"] =="Tan"]["speed"]
# Two ways to change Grey to Grayanimals["color"][animals["color"] =="Grey"] ="Gray"animals.loc[animals["color"] =="Grey", "color"] ="Gray"animals
speed
color
Elephant
40.0
Gray
Cheetah
120.0
Tan
Tortoise
0.1
Green
Hare
48.0
Gray
Lion
80.0
Tan
PolarBear
30.0
White
Create a list called animals_list in which the first element contains the speed column of the animals dataframe and the second element contains the color column of the animals dataframe.
# Create a list with the speed and color vectorsanimals_list = [animals["speed"], animals["color"]]animals_list
[Elephant 40.0
Cheetah 120.0
Tortoise 0.1
Hare 48.0
Lion 80.0
PolarBear 30.0
Name: speed, dtype: float64, Elephant Gray
Cheetah Tan
Tortoise Green
Hare Gray
Lion Tan
PolarBear White
Name: color, dtype: str]
Create a dictionary with the appropriate keys (i.e speed and color).
# Create a dictionary with the speed and color vectorsanimals_dict = {"speed": animals["speed"], "color": animals["color"]}animals_dict
{'speed': Elephant 40.0
Cheetah 120.0
Tortoise 0.1
Hare 48.0
Lion 80.0
PolarBear 30.0
Name: speed, dtype: float64, 'color': Elephant Gray
Cheetah Tan
Tortoise Green
Hare Gray
Lion Tan
PolarBear White
Name: color, dtype: str}
The in operator, reordering and matching
In your data folder you should have a dataframe called proj_summary which contains quality metric information for an RNA-seq dataset. We have obtained batch information for the control samples in this dataset. Copy and paste the code below to create a dataframe of control samples with the associated batch information:
# Read in proj_summary if neededproj_summary = pd.read_table("data/project-summary.txt", header=0, index_col=0)# Create ctrl_samples dataframectrl_samples = pd.DataFrame( data = {"date": ["01/13/2018", "03/15/2018", "01/13/2018","09/20/2018","03/15/2018"]}, index = ["sample3", "sample10", "sample8", "sample4", "sample15"] )
proj_summary
Table 1: DataFrame of quality metric information for an RNA-seq dataset.
percent_GC
Exonic_Rate
Intronic_Rate
Intergenic_Rate
Mapping_Rate
Quality_format
rRNA_rate
treatment
Name
sample1
49
0.8913
0.0709
0.0378
0.978800
standard
0.007265
high
sample2
49
0.9055
0.0625
0.0321
0.982507
standard
0.005518
low
sample3
50
0.8834
0.0663
0.0503
0.987729
standard
0.026945
control
sample4
50
0.9027
0.0649
0.0325
0.987076
standard
0.005082
control
sample5
49
0.8923
0.0714
0.0362
0.978184
standard
0.005023
high
sample6
49
0.8999
0.0667
0.0334
0.977210
standard
0.005345
low
sample7
49
0.8983
0.0665
0.0352
0.975800
standard
0.005240
high
sample8
49
0.9022
0.0656
0.0322
0.987746
standard
0.004549
control
sample9
49
0.9111
0.0566
0.0323
0.981449
standard
0.005818
low
ctrl_samples
Table 2: DataFrame of control samples with associated batch information.
date
sample3
01/13/2018
sample10
03/15/2018
sample8
01/13/2018
sample4
09/20/2018
sample15
03/15/2018
How many of the ctrl_samples are also in the proj_summary dataframe? Use the isin operator to compare sample names.
# Number of shared samples between ctrl_samples and proj_summarylen(ctrl_samples.index[ctrl_samples.index.isin(proj_summary.index)])
3
Keep only the rows in proj_summary which correspond to those in ctrl_samples. Do this with the isin operator. Save it to a variable called proj_summary_ctrl.
# Subset proj_summary by the samples within ctrl_samplesproj_summary_ctrl = proj_summary[proj_summary.index.isin(ctrl_samples.index)]# Print out proj_summary_ctrlproj_summary_ctrl
percent_GC
Exonic_Rate
Intronic_Rate
Intergenic_Rate
Mapping_Rate
Quality_format
rRNA_rate
treatment
Name
sample3
50
0.8834
0.0663
0.0503
0.987729
standard
0.026945
control
sample4
50
0.9027
0.0649
0.0325
0.987076
standard
0.005082
control
sample8
49
0.9022
0.0656
0.0322
0.987746
standard
0.004549
control
Use merge() to add a column called batch to the proj_summary_ctrl dataframe. Assign this new dataframe back to proj_summary_ctrl.
# Using the match, add the batch informationproj_summary_ctrl = pd.merge(proj_summary_ctrl, ctrl_samples, left_index=True, right_index=True)# Print proj_summary_ctrlproj_summary_ctrl
percent_GC
Exonic_Rate
Intronic_Rate
Intergenic_Rate
Mapping_Rate
Quality_format
rRNA_rate
treatment
date
Name
sample3
50
0.8834
0.0663
0.0503
0.987729
standard
0.026945
control
01/13/2018
sample4
50
0.9027
0.0649
0.0325
0.987076
standard
0.005082
control
09/20/2018
sample8
49
0.9022
0.0656
0.0322
0.987746
standard
0.004549
control
01/13/2018
Reuse
CC-BY-4.0
Source Code
---title: "Data Wrangling - Answer Key"author: - Noor Sohail - Will Gammerdingerdate: "2026-03-15"license: "CC-BY-4.0"editor_options: markdown: wrap: 72---```{r}#| label: load_libraries_data#| echo: false# Load libraries and data# Interfacing with R quarto and python futzinglibrary(reticulate)use_condaenv("/opt/anaconda3/envs/intro_python", required =TRUE)``````{python}#| label: load_libraries_data_python#| echo: false# Load libraries and dataimport pandas as pdrpkm_data = pd.read_csv("data/counts.rpkm.csv")metadata = pd.read_csv("data/mouse_exp_design.csv")rpkm_ordered = rpkm_data.reindex(columns=metadata.index)```# Exercise 1We have a list of 6 marker genes that we are very interested in. Our goal is to extract count data for these genes using the `in` operator from the `rpkm_ordered` data frame, instead of scrolling through `rpkm_ordered` and finding them manually.First, let's create a vector called `important_genes` with the Ensembl IDs of the 6 genes we are interested in:```{python}#| label: create_important_genes# Create important genes vectorimportant_genes = ["ENSMUSG00000083700", "ENSMUSG00000080990", "ENSMUSG00000065619", "ENSMUSG00000047945", "ENSMUSG00000081010", "ENSMUSG00000030970"]```1. Use the `in` operator to determine if all of these genes are present in the row names of the `rpkm_ordered` data frame.```{python}#| label: exercise_1# Check if important genes are in rpkm_orderedrpkm_ordered.index.isin(important_genes)```2. Extract the rows from `rpkm_ordered` that correspond to these 6 genes using `[]` and the `isin` operator. Double check the row names to ensure that you are extracting the correct rows.```{python}#| label: exercise_2# Extract rows for important genesrpkm_ordered[rpkm_ordered.index.isin(important_genes)]```3. **Bonus question:** Extract the rows from `rpkm_ordered` that correspond to these 6 genes using `[]`, but without using the `isin` operator.```{python}#| label: exercise_3# Extract rows for important genes without using isin operatorrpkm_ordered.loc[important_genes]```# Exercise 2### Reading in and inspecting data4. Using the `animals.csv`, read the `.csv` file into your environment and assign it to a variable called `animals`. **Be sure to check that your row names are the different animals.**```{python}#| label: read_in_animals_csv# Read in animals.csvanimals = pd.read_csv("data/animals.csv")# Print animalsanimals```5. Check to make sure that `animals` is a dataframe.```{python}#| label: class_animals# Check the class of animalstype(animals)```6. How many rows are in the `animals` dataframe? How many columns?```{python}#| label: dim_animals# Number of rows in animalslen(animals.index)# Number of columns in animalslen(animals.columns)# Shape of animalsanimals.shape```### Data wrangling7. Extract the `speed` value of 40 km/h from the `animals` dataframe.```{python}#| label: extract_first_row_first_column# Various methods to extract the speed value of 40 km/hanimals.iloc[0, 0]animals[animals["speed"] ==40]animals[animals["speed"] ==40]["speed"]animals["speed"].loc[animals["speed"] ==40]```8. Return the rows with animals that are the `color` Tan.```{python}#| label: rows_with_color_tan# Two way to return the rows with the animal color as Tananimals[animals["color"] =="Tan"]animals.loc[animals["color"] =="Tan"]```9. Return the `speed` column for the rows with animals that are the `color` Tan.```{python}#| label: speed_column_tan# Return the speed column for animals with color Tananimals[animals["color"] =="Tan"]["speed"]```10. Change the color of "Grey" to "Gray". ```{python}#| label: change_grey_to_gray# Two ways to change Grey to Grayanimals["color"][animals["color"] =="Grey"] ="Gray"animals.loc[animals["color"] =="Grey", "color"] ="Gray"animals```11. Create a list called `animals_list` in which the first element contains the speed column of the `animals` dataframe and the second element contains the color column of the `animals` dataframe. ```{python}#| label: create_list# Create a list with the speed and color vectorsanimals_list = [animals["speed"], animals["color"]]animals_list```12. Create a dictionary with the appropriate keys (i.e speed and color).```{python}#| label: create_dict# Create a dictionary with the speed and color vectorsanimals_dict = {"speed": animals["speed"], "color": animals["color"]}animals_dict```### The `in` operator, reordering and matching13. In your data folder you should have a dataframe called `proj_summary` which contains quality metric information for an RNA-seq dataset. We have obtained batch information for the control samples in this dataset. **Copy and paste the code below to create a dataframe of control samples with the associated batch information**:```{python}#| label: create_ctrl_samples_df# Read in proj_summary if neededproj_summary = pd.read_table("data/project-summary.txt", header=0, index_col=0)# Create ctrl_samples dataframectrl_samples = pd.DataFrame( data = {"date": ["01/13/2018", "03/15/2018", "01/13/2018","09/20/2018","03/15/2018"]}, index = ["sample3", "sample10", "sample8", "sample4", "sample15"] )``````{python}#| label: tbl-proj_summary#| tbl-cap: DataFrame of quality metric information for an RNA-seq dataset.proj_summary ``````{python}#| label: tbl-ctrl_samples#| tbl-cap: DataFrame of control samples with associated batch information.ctrl_samples```14. How many of the `ctrl_samples` are also in the `proj_summary` dataframe? Use the `isin` operator to compare sample names.```{python}#| label: shared_ctrl_samples_proj_summary# Number of shared samples between ctrl_samples and proj_summarylen(ctrl_samples.index[ctrl_samples.index.isin(proj_summary.index)])```15. Keep only the rows in `proj_summary` which correspond to those in `ctrl_samples`. Do this with the `isin` operator. Save it to a variable called `proj_summary_ctrl`.```{python}#| label: subset_proj_summary_by_ctrl_samples# Subset proj_summary by the samples within ctrl_samplesproj_summary_ctrl = proj_summary[proj_summary.index.isin(ctrl_samples.index)]# Print out proj_summary_ctrlproj_summary_ctrl```16. Use `merge()` to add a column called `batch` to the `proj_summary_ctrl` dataframe. Assign this new dataframe back to `proj_summary_ctrl`.```{python}#| label: add_batch_information# Using the match, add the batch informationproj_summary_ctrl = pd.merge(proj_summary_ctrl, ctrl_samples, left_index=True, right_index=True)# Print proj_summary_ctrlproj_summary_ctrl```