This lesson teaches how to use pandas DataFrames in Python to load tabular data, inspect and subset rows and columns, create new variables, and summarize data.
Authors
Noor Sohail
Will Gammerdinger
Published
March 16, 2026
Keywords
Pandas tutorial, CSV, Excel
Approximate time: XX minutes
Learning Objectives
In this lesson, we will:
Load and inspect data in a Pandas DataFrame.
Subset and index DataFrames in several different ways.
Add new columns to a DataFrame based on existing data.
Perform useful operations on DataFrames to summarize and manipulate data.
Overview of lesson
Real-life data oftentimes comes in the form of tables of data, matrices of gene expression, or excel sheets of patient data. pandas DataFrames are Python’s way of representing this tabular data so you can clean, filter, and summarize it efficiently. Being comfortable with DataFrames will let you do tasks like quickly finding all samples from a particular condition, computing summary statistics, or preparing a clean table for plotting. In this lesson, you will practice loading data into DataFrames and performing common operations that mirror how you would explore a dataset in a spreadsheet, but with far more control and reproducibility.
Downloading Data
First, we need to download the datasets we will be working with in this lesson, which is a metadata file containing information about a biological dataset.
Right-clicking the link and selecting “Save Link As…” to download the file to your computer.
After downloading the file, place it in the data folder within the project directory
Unzip the file to extract its contents.
Pandas Library
The Pandas library is a powerful tool for data manipulation and analysis in Python. It provides data structures and functions needed to manipulate tables of data, like excel sheets. The primary data structures in Pandas are Series and DataFrames.
Pandas and NumPy
Pandas is built on top of the NumPy library, which in practice means that most of the methods defined for NumPy Arrays apply to Pandas data structures.
This library is widely used in data science and machine learning for tasks such as data cleaning, transformation, and analysis. This is due to its comprehensive set of functions for handling and manipulating data, as well as its ability to handle large datasets efficiently.
Loading a Datasets
Within the Pandas library, there are built-in functions to load datasets from various file formats, such as csv (comma separated values) or excel files (.xlsx). The most commonly used function for loading data is pandas.read_csv(), which allows you to read a CSV file and create a DataFrame.
When working with large datasets, you will very likely be working with “metadata” file which contains the information about each sample in your dataset. The metadata is very important information and we encourage you to think about creating a document with as much metadata you can record before you bring the data into R. Here is some additional reading on metadata from the HMS Data Management Working Group.
We have a file in which we identify information about the data or metadata. Our metadata is also stored in a CSV file. In this file, each row corresponds to a sample and each column contains some information about each sample.
import pandas as pdmetadata = pd.read_csv("data/mouse_exp_design.csv")metadata
Table 1: DataFrame of our experimental design, including sample names, genotypes, cell types, and replicate numbers for each sample.
genotype
celltype
replicate
sample1
Wt
typeA
1
sample2
Wt
typeA
2
sample3
Wt
typeA
3
sample4
KO
typeA
1
sample5
KO
typeA
2
sample6
KO
typeA
3
sample7
Wt
typeB
1
sample8
Wt
typeB
2
sample9
Wt
typeB
3
sample10
KO
typeB
1
sample11
KO
typeB
2
sample12
KO
typeB
3
The first column contains the row names and the remaining columns contain information about our samples that allow us to categorize them. For example, the second column contains genotype information for each sample. Each sample is classified in one of two categories: Wt (wild type) or KO (knockout). What types of categories do you observe in the remaining columns?
This metadata appears to describe the samples in our study. Each row holds information for a single sample, and the columns contain categorical information about the sample genotype (WT or KO), celltype (typeA or typeB), and replicate number (1, 2, or 3).
Inspecting the DataFrame
There are a wide selection of base functions in R that are useful for inspecting your data and summarizing it. Let’s use the metadata file that we created to test out data inspection functions.
For example, we can use the shape method to check the dimensions of our DataFrame, which will tell us how many rows and columns it contains:
metadata.shape
(12, 3)
.shape
The .shape method returns the number of rows and columns in the DataFrame. The first element is the number of rows, while the second element is the number of columns.
We do not have closing parentheses after .shape because it is an attribute of the DataFrame, not a method/function. In contrast, methods require parentheses to be called, even if they do not take any arguments (e.g., metadata.head()).
Suppose we had a larger file, we might not want to display all the contents in the console. Instead we could check the top (the first 6 lines) of this data.frame using the function head():
metadata.head()
Table 2: First six rows of the metadata DataFrame using the head() function
Print the last 6 lines of the metadata DataFrame using tail() function.
A followup question to question #1
…
Indexing and Subsetting DataFrames
To access specific elements of a DataFrame, we can use indexing and subsetting techniques. DataFrames can be indexed using both numerical indices and labels (rownames and column names).
Both are useful for different purposes. Numerical indexing is often more concise and can be faster for certain operations, while label-based indexing can be more intuitive and easier to read, especially when working with large datasets with meaningful row and column names.
Subsetting DataFrames with Indices
To access specific elements of a DataFrame using numerical indexing, we can use the iloc method, which stands for “integer location”. This method allows us to access rows and columns by their indices.
Let’s say we wanted to extract the wild type (Wt) value that is present in the first row and the first column.
To extract it, just like with arrays, we give the name of the dataframe that we want to extract from, followed by the square brackets (metadata.iloc[ ]).
Now inside the square brackets we give the coordinates or indices for the rows in which the value(s) are present, followed by a comma, then the coordinates or indices for the columns in which the value(s) are present (metadata.iloc[rows, columns]).
We know the wild type value is in the first row if we count from the top, so we put a one, then a comma. The wild type value is also in the first column, counting from left to right, so we put a one in the columns space too.
# Extract the value in the first row and first columnmetadata.iloc[0, 0]
'Wt'
Now let’s extract the value 1 from the first row and third column.
# Extract the value in the first row and third columnmetadata.iloc[0, 2]
np.int64(1)
Now if you only wanted to select based on rows, you would provide the index for the rows and leave the columns index blank. The key here is to include the comma, to let Python know that you are accessing a 2-dimensional data structure:
What kind of data structure does the output appear to be? It looks slightly different from the original DataFrame, but it still has the column names from before. Let us use the type() function to check the data structure of this output:
type(metadata.iloc[0, :])
<class 'pandas.Series'>
This is a Series data structure, which is a one-dimensional array with rownames. It is similar to a vector in R. The reason we get a Series instead of a DataFrame is because we are selecting a single row from the DataFrame. Since a single column in a DataFrame is really just a vector, Python will output a list-like object as the simplest data structure.
If you were selecting specific columns from the DataFrame - the rows are left blank:
# Extract the first columnmetadata.iloc[:, 0]
sample1 Wt
sample2 Wt
sample3 Wt
sample4 KO
sample5 KO
sample6 KO
sample7 Wt
sample8 Wt
sample9 Wt
sample10 KO
sample11 KO
sample12 KO
Name: genotype, dtype: str
Same as before, we get a Series data structure because we are selecting a single column from the DataFrame.
Oftentimes we would like to keep our single column as a DataFrame. To do this, we use the to_frame() method, which converts a Series to a DataFrame:
# Extract the first column and convert it to a DataFramemetadata.iloc[:, 0].to_frame()
Table 3: Converting a Series to a DataFrame using the to_frame() method.
genotype
sample1
Wt
sample2
Wt
sample3
Wt
sample4
KO
sample5
KO
sample6
KO
sample7
Wt
sample8
Wt
sample9
Wt
sample10
KO
sample11
KO
sample12
KO
Slicing DataFrames
Just like with vectors, you can select multiple rows and columns at a time. Within the square brackets, you need to provide a vector of the desired values.
We can extract consecutive rows or columns using the colon (:) to create the vector of indices to extract.
# Extract the first three rowsmetadata.iloc[0:3, :]
Table 4: Extracting the first three rows from the metadata DataFrame using slices.
genotype
celltype
replicate
sample1
Wt
typeA
1
sample2
Wt
typeA
2
sample3
Wt
typeA
3
Alternatively, we can use the list of indices [] to extract any number of rows or columns. Let’s extract the first, third, and sixth rows.
# Extract the first, third, and sixth rowsmetadata.iloc[[0, 2, 5], :]
Table 5: Extracting non-consecutive rows from the metadata DataFrame with a list of indices.
genotype
celltype
replicate
sample1
Wt
typeA
1
sample3
Wt
typeA
3
sample6
KO
typeA
3
Subsetting DataFrames with Labels
For larger datasets, it can be tricky to remember the column number that corresponds to a particular variable. (Is celltype in column 1 or 2? oh, right… they are in column 1). In some cases, the column/row number for values can change if the script you are using adds or removes columns/rows. It’s, therefore, often better to use column/row names to refer to extract particular values, and it makes your code easier to read and your intentions clearer.
So first let us take a look at what the rownams (index) and column names (columns) are for our DataFrame:
Now that we know the row and column names, we can use them to subset our data. To access specific elements of a DataFrame using label-based indexing, we can use the loc method, which stands for “location”. This method allows us to access rows and columns by their labels. For example, to extract the first three samples, we can use the following code:
# Extract the first three samples for the celltype columnmetadata.loc[["sample1", "sample2", "sample3"], "celltype"]
We can mix and match label-based and numerical indexing. For example, we can use label-based indexing to select the column we want, then use numerical indexing to select the first three samples from that column:
# Extract the first three samples for the celltype column metadata.iloc[0:2]["celltype"]
It’s important to type the names of the columns/rows in the exact way that they are typed in the DataFrame; for instance if I had spelled celltype with a capital C, it would not have worked.
# Extract the first three samples for the Celltype column metadata.iloc[0:2]["Celltype"] # Celltype column incorrect
KeyError: 'Celltype'
We can also directly access a column without the loc method by using the column name as an attribute of the DataFrame. For example, to access the celltype column, we can use the following code:
# Access the celltype column directlymetadata["celltype"]
With DataFrames we can use logical expressions to extract the rows or columns in the DataFrame with specific values. First, we need to determine the indices in a rows or columns where a logical expression is True, then we can extract those rows or columns from the DataFrame.
For example, if we want to return only those rows of the DataFrame with the celltype column having a value of typeA, we would perform two steps:
Identify which rows in the celltype column have a value of typeA.
Use those True values to extract those rows from the DataFrame.
# Create a boolean mask for rows where celltype is "typeA"metadata["celltype"] =="typeA"
This will output True and False values for the values in the vector. The first six values are True and the last six are False. This means the first six rows of our metadata have a vale of typeA while the last six do not. We can save these values to a variable, which we can call whatever we would like; let’s call it logical_idx.
# Create a boolean mask for rows where celltype is "typeA"logical_idx = metadata["celltype"] =="typeA"# Subset the DataFrame to return only rows where celltype is "typeA"metadata[logical_idx]
Table 6: Subsetting metadata by applying a boolean mask for rows where celltype is “typeA”.
genotype
celltype
replicate
sample1
Wt
typeA
1
sample2
Wt
typeA
2
sample3
Wt
typeA
3
sample4
KO
typeA
1
sample5
KO
typeA
2
sample6
KO
typeA
3
We have used those True and False values to extract the rows that correspond to the True values from the metadata DataFrame. The result is a dataframe that only contains rows where the celltype is typeA.
Now that we know how to access specific values in a DataFrame, we can also add new columns to our data. Oftentimes you will need to create new variables based on the information in your DataFrame or add new information to your DataFrame.
Adding a New Column with the Same Value
Perhaps we wanted to add a new column to our metadata, where we specify that the species for each of our samples is mus musculus. When the value is the same across all the rows, we can simply create a new column and assign whichever value we want to that column:
# Add a new column for speciesmetadata["species"] ="mus musculus"metadata
Table 7: Adding a new column (species) to the metadata DataFrame with the same value for all rows.
genotype
celltype
replicate
species
sample1
Wt
typeA
1
mus musculus
sample2
Wt
typeA
2
mus musculus
sample3
Wt
typeA
3
mus musculus
sample4
KO
typeA
1
mus musculus
sample5
KO
typeA
2
mus musculus
sample6
KO
typeA
3
mus musculus
sample7
Wt
typeB
1
mus musculus
sample8
Wt
typeB
2
mus musculus
sample9
Wt
typeB
3
mus musculus
sample10
KO
typeB
1
mus musculus
sample11
KO
typeB
2
mus musculus
sample12
KO
typeB
3
mus musculus
Adding a New Column with Conditional Values
We can even add new columns that are conditional across information within the dataframe. For example, perhaps all the mice in replicate 1 were female while the rest were male. We can create a new column for sex and make those corresponding assignments using the loc method.
We are going to do this in three steps:
Create the column sex and initialize it with a default value of None.
None
None is a special value in Python that represents the absence of a value or a null value. It is often used to indicate that a variable has no value or that a function does not return anything. In this case, we are initializing the sex column with None to indicate that we have not yet assigned any values to it.
# Add a new column for sexmetadata["sex"] =Nonemetadata
Table 8: Initializing a new column (sex) in the metadata DataFrame with a default value of None.
genotype
celltype
replicate
species
sex
sample1
Wt
typeA
1
mus musculus
None
sample2
Wt
typeA
2
mus musculus
None
sample3
Wt
typeA
3
mus musculus
None
sample4
KO
typeA
1
mus musculus
None
sample5
KO
typeA
2
mus musculus
None
sample6
KO
typeA
3
mus musculus
None
sample7
Wt
typeB
1
mus musculus
None
sample8
Wt
typeB
2
mus musculus
None
sample9
Wt
typeB
3
mus musculus
None
sample10
KO
typeB
1
mus musculus
None
sample11
KO
typeB
2
mus musculus
None
sample12
KO
typeB
3
mus musculus
None
Asign the value female to rows where the replicate column has a value of 1. We do this with the loc method, which allows us to access rows that meet our condition, then we specify the column (sex) in which we want to assign the value female.
# Assign "female" to rows where replicate is 1metadata.loc[metadata["replicate"] ==1, "sex"] ="female"metadata
Table 9: Assigning the value “female” to rows where the replicate column has a value of 1.
genotype
celltype
replicate
species
sex
sample1
Wt
typeA
1
mus musculus
female
sample2
Wt
typeA
2
mus musculus
None
sample3
Wt
typeA
3
mus musculus
None
sample4
KO
typeA
1
mus musculus
female
sample5
KO
typeA
2
mus musculus
None
sample6
KO
typeA
3
mus musculus
None
sample7
Wt
typeB
1
mus musculus
female
sample8
Wt
typeB
2
mus musculus
None
sample9
Wt
typeB
3
mus musculus
None
sample10
KO
typeB
1
mus musculus
female
sample11
KO
typeB
2
mus musculus
None
sample12
KO
typeB
3
mus musculus
None
Assign the value male to rows where the replicate column has a value of 2 or 3.
# Assign "male" to rows where replicate is not 1metadata.loc[metadata["replicate"] !=1, "sex"] ="male"metadata
Table 10: Assigning the value “male” to rows where the replicate column has a value not equal to 1.
genotype
celltype
replicate
species
sex
sample1
Wt
typeA
1
mus musculus
female
sample2
Wt
typeA
2
mus musculus
male
sample3
Wt
typeA
3
mus musculus
male
sample4
KO
typeA
1
mus musculus
female
sample5
KO
typeA
2
mus musculus
male
sample6
KO
typeA
3
mus musculus
male
sample7
Wt
typeB
1
mus musculus
female
sample8
Wt
typeB
2
mus musculus
male
sample9
Wt
typeB
3
mus musculus
male
sample10
KO
typeB
1
mus musculus
female
sample11
KO
typeB
2
mus musculus
male
sample12
KO
typeB
3
mus musculus
male
Calculating New Columns
We can also create new columns by performing calculations on existing columns. For example, we can create a new column called replicate_squared that contains the square of the values in the replicate column.
# Create a new column for the square of the replicate numbermetadata["replicate_squared"] = metadata["replicate"] **2metadata
Table 11: Creating a new column (replicate_squared) that contains the square of the values in the replicate column.
genotype
celltype
replicate
species
sex
replicate_squared
sample1
Wt
typeA
1
mus musculus
female
1
sample2
Wt
typeA
2
mus musculus
male
4
sample3
Wt
typeA
3
mus musculus
male
9
sample4
KO
typeA
1
mus musculus
female
1
sample5
KO
typeA
2
mus musculus
male
4
sample6
KO
typeA
3
mus musculus
male
9
sample7
Wt
typeB
1
mus musculus
female
1
sample8
Wt
typeB
2
mus musculus
male
4
sample9
Wt
typeB
3
mus musculus
male
9
sample10
KO
typeB
1
mus musculus
female
1
sample11
KO
typeB
2
mus musculus
male
4
sample12
KO
typeB
3
mus musculus
male
9
Or even take the sum across multiple columns to create a new column. For example, we can create a new column called replicate_replicate_sum that contains the sum of the values in the replicate and replicate columns.
# Create a new column for the sum of replicate and replicatemetadata["replicate_replicate_sum"] = metadata["replicate"] + metadata["replicate"]metadata
Table 12: Creating a new column (replicate_replicate_sum) that contains the sum of the values in the replicate and replicate columns.
genotype
celltype
replicate
species
sex
replicate_squared
replicate_replicate_sum
sample1
Wt
typeA
1
mus musculus
female
1
2
sample2
Wt
typeA
2
mus musculus
male
4
4
sample3
Wt
typeA
3
mus musculus
male
9
6
sample4
KO
typeA
1
mus musculus
female
1
2
sample5
KO
typeA
2
mus musculus
male
4
4
sample6
KO
typeA
3
mus musculus
male
9
6
sample7
Wt
typeB
1
mus musculus
female
1
2
sample8
Wt
typeB
2
mus musculus
male
4
4
sample9
Wt
typeB
3
mus musculus
male
9
6
sample10
KO
typeB
1
mus musculus
female
1
2
sample11
KO
typeB
2
mus musculus
male
4
4
sample12
KO
typeB
3
mus musculus
male
9
6
We can apply any range of mathematical operations to create new columns based on data that already exists in the DataFrame.
Useful DataFrame Operations
The Pandas library is filled with useful function to manipulate and wrangle with dataframes. Here, we will continue to cover some useful functions that are commonly used, but a more comprehensive cheatsheet of Pandas functions can be found on the official website.
value_counts()
Other ways to quickly summarize the contents of a DataFrame include value_counts(), which counts the number of times a particular value appears in a column. For example, we can use this function to count the number of samples that belong to each genotype category:
metadata["genotype"].value_counts()
genotype
Wt 6
KO 6
Name: count, dtype: int64
Now we know how many samples are classified as WT and how many are classified as KO in our dataset.
---title: "`pandas` DataFrames"description: | This lesson teaches how to use pandas DataFrames in Python to load tabular data, inspect and subset rows and columns, create new variables, and summarize data.author: - Noor Sohail - Will Gammerdingerdate: "2026-03-16"categories: - Python programming - Pandas - DataFrames - Tabular datakeywords: - Pandas tutorial - CSV - Excellicense: "CC-BY-4.0"editor_options: markdown: wrap: 72---```{r}#| label: load_libraries_data#| echo: false# Load libraries and data# Interfacing with R quarto and python futzinglibrary(reticulate)use_condaenv("/opt/anaconda3/envs/intro_python", required =TRUE)```Approximate time: XX minutes## Learning Objectives In this lesson, we will:- Load and inspect data in a `Pandas` DataFrame.- Subset and index DataFrames in several different ways.- Add new columns to a DataFrame based on existing data.- Perform useful operations on DataFrames to summarize and manipulate data.## Overview of lessonReal-life data oftentimes comes in the form of tables of data, matrices of gene expression, or excel sheets of patient data. `pandas` DataFrames are Python’s way of representing this tabular data so you can clean, filter, and summarize it efficiently. Being comfortable with DataFrames will let you do tasks like quickly finding all samples from a particular condition, computing summary statistics, or preparing a clean table for plotting. In this lesson, you will practice loading data into DataFrames and performing common operations that mirror how you would explore a dataset in a spreadsheet, but with far more control and reproducibility.## Downloading DataFirst, we need to download the datasets we will be working with in this lesson, which is a metadata file containing information about a biological dataset. You can [download the data folder here](https://www.dropbox.com/scl/fi/yszqm3ri6p4uf5ppuflp1/data.zip?rlkey=c496frt1ioud8rrbo9ea1614p&st=fw7fhif8&dl=1) by:1. Right-clicking the link and selecting "Save Link As..." to download the file to your computer.2. After downloading the file, place it in the `data` folder within the project directory3. Unzip the file to extract its contents.## `Pandas` LibraryThe `Pandas` library is a powerful tool for data manipulation and analysis in Python. It provides data structures and functions needed to manipulate tables of data, like excel sheets. The primary data structures in `Pandas` are Series and DataFrames.::: callout-note# `Pandas` and `NumPy`Pandas is built on top of the `NumPy` library, which in practice means that most of the methods defined for `NumPy` Arrays apply to `Pandas` data structures.:::This library is widely used in data science and machine learning for tasks such as data cleaning, transformation, and analysis. This is due to its comprehensive set of functions for handling and manipulating data, as well as its ability to handle large datasets efficiently.### Loading a DatasetsWithin the `Pandas` library, there are built-in functions to load datasets from various file formats, such as csv (comma separated values) or excel files (.xlsx). The most commonly used function for loading data is `pandas.read_csv()`, which allows you to read a CSV file and create a DataFrame.When working with large datasets, you will very likely be working with "metadata" file which contains the information about each sample in your dataset. The metadata is very important information and we encourage you to think about creating a document with as much metadata you can record before you bring the data into R. [Here is some additional reading on metadata](https://datamanagement.hms.harvard.edu/metadata-overview) from the [HMS Data Management Working Group](https://datamanagement.hms.harvard.edu/).We have a file in which we identify information about the data or metadata. Our metadata is also stored in a CSV file. In this file, each row corresponds to a sample and each column contains some information about each sample.```{python}#| label: tbl-load_metadata#| tbl-cap: DataFrame of our experimental design, including sample names, genotypes, cell types, and replicate numbers for each sample.import pandas as pdmetadata = pd.read_csv("data/mouse_exp_design.csv")metadata```The first column contains the row names and the remaining columns contain information about our samples that allow us to categorize them. For example, the second column contains genotype information for each sample. Each sample is classified in one of two categories: Wt (wild type) or KO (knockout). _What types of categories do you observe in the remaining columns?_This metadata appears to describe the samples in our study. Each row holds information for a single sample, and the columns contain categorical information about the sample `genotype` (WT or KO), `celltype` (typeA or typeB), and `replicate` number (1, 2, or 3).### Inspecting the DataFrameThere are a wide selection of base functions in R that are useful for inspecting your data and summarizing it. Let’s use the metadata file that we created to test out data inspection functions.For example, we can use the `shape` method to check the dimensions of our DataFrame, which will tell us how many rows and columns it contains:```{python}#| label: dim_metadatametadata.shape```::: callout-note# `.shape`The `.shape` method returns the number of rows and columns in the DataFrame. The first element is the number of rows, while the second element is the number of columns.We do not have closing parentheses after `.shape` because it is an **attribute** of the DataFrame, not a method/function. In contrast, methods require parentheses to be called, even if they do not take any arguments (e.g., `metadata.head()`).:::Suppose we had a larger file, we might not want to display all the contents in the console. Instead we could check the top (the first 6 lines) of this data.frame using the function `head()`:```{python}#| label: tbl-head_metadata#| tbl-cap: First six rows of the metadata DataFrame using the `head()` functionmetadata.head()```::: callout-warningHave not made exercises for the entire lesson::::::{.callout-tip}# [**Exercise 1**](09_pandas_dataframes-Answer_key.qmd#exercise-1)1. Print the last 6 lines of the metadata DataFrame using `tail()` function.2. A followup question to question #13. ...:::## Indexing and Subsetting DataFramesTo access specific elements of a DataFrame, we can use indexing and subsetting techniques. DataFrames can be indexed using both numerical indices and labels (rownames and column names).Both are useful for different purposes. Numerical indexing is often more concise and can be faster for certain operations, while label-based indexing can be more intuitive and easier to read, especially when working with large datasets with meaningful row and column names.### Subsetting DataFrames with IndicesTo access specific elements of a DataFrame using numerical indexing, we can use the `iloc` method, which stands for "integer location". This method allows us to access rows and columns by their indices.Let’s say we wanted to extract the wild type (Wt) value that is present in the **first row and the first column**.1. To extract it, just like with arrays, we give the name of the dataframe that we want to extract from, followed by the square brackets (`metadata.iloc[ ]`).2. Now inside the square brackets we give the coordinates or indices for the rows in which the value(s) are present, followed by a comma, then the coordinates or indices for the columns in which the value(s) are present (`metadata.iloc[rows, columns]`).We know the wild type value is in the first row if we count from the top, so we put a one, then a comma. The wild type value is also in the first column, counting from left to right, so we put a one in the columns space too.```{python}#| label: numerical_indexing# Extract the value in the first row and first columnmetadata.iloc[0, 0]```Now let’s extract the value `1` from the first row and third column.```{python}#| label: numerical_indexing_2# Extract the value in the first row and third columnmetadata.iloc[0, 2]```Now if you only wanted to select based on rows, you would provide the index for the rows and leave the columns index blank. The key here is to include the comma, to let Python know that you are accessing a 2-dimensional data structure:```{python}#| label: numerical_indexing_rows# Extract the first rowmetadata.iloc[0, :]```_What kind of data structure does the output appear to be?_ It looks slightly different from the original DataFrame, but it still has the column names from before. Let us use the `type()` function to check the data structure of this output:```{python}#| label: type_numerical_indexing_rowstype(metadata.iloc[0, :])```This is a `Series` data structure, which is a one-dimensional array with rownames. It is similar to a vector in R. The reason we get a Series instead of a DataFrame is because we are **selecting a single row** from the DataFrame. Since a single column in a DataFrame is really just a vector, Python will output a list-like object as the simplest data structure.If you were selecting specific columns from the DataFrame - the rows are left blank:```{python}#| label: numerical_indexing_columns# Extract the first columnmetadata.iloc[:, 0]```Same as before, we get a `Series` data structure because we are selecting a single column from the DataFrame.Oftentimes we would like to keep our single column as a DataFrame. To do this, we use the `to_frame()` method, which converts a Series to a DataFrame:```{python}#| label: tbl-to_frame#| tbl-cap: Converting a Series to a DataFrame using the `to_frame()` method.# Extract the first column and convert it to a DataFramemetadata.iloc[:, 0].to_frame()```### Slicing DataFramesJust like with vectors, you can select multiple rows and columns at a time. Within the square brackets, you need to provide a vector of the desired values.We can extract consecutive rows or columns using the colon (`:`) to create the vector of indices to extract.```{python}#| label: tbl-slicing_consecutive_rows#| tbl-cap: Extracting the first three rows from the metadata DataFrame using slices.# Extract the first three rowsmetadata.iloc[0:3, :]```Alternatively, we can use the list of indices `[]` to extract any number of rows or columns. Let’s extract the first, third, and sixth rows.```{python}#| label: tbl-slicing_non_consecutive_rows#| tbl-cap: Extracting non-consecutive rows from the metadata DataFrame with a list of indices.# Extract the first, third, and sixth rowsmetadata.iloc[[0, 2, 5], :]```### Subsetting DataFrames with LabelsFor larger datasets, it can be tricky to remember the column number that corresponds to a particular variable. (Is `celltype` in column 1 or 2? oh, right… they are in column 1). In some cases, the column/row number for values can change if the script you are using adds or removes columns/rows. It’s, therefore, often better to use column/row names to refer to extract particular values, and it makes your code easier to read and your intentions clearer.So first let us take a look at what the rownams (`index`) and column names (`columns`) are for our DataFrame:```{python}#| label: row_col_names# Get the row namesmetadata.index# Get the column namesmetadata.columns```Now that we know the row and column names, we can use them to subset our data. To access specific elements of a DataFrame using label-based indexing, we can use the `loc` method, which stands for "location". This method allows us to access rows and columns by their labels. For example, to extract the first three samples, we can use the following code:```{python}#| label: label_based_indexing_rows#| error: true# Extract the first three samples for the celltype columnmetadata.loc[["sample1", "sample2", "sample3"], "celltype"]```We can mix and match label-based and numerical indexing. For example, we can use label-based indexing to select the column we want, then use numerical indexing to select the first three samples from that column:```{python}#| label: label_based_indexing# Extract the first three samples for the celltype column metadata.iloc[0:2]["celltype"]```It’s important to type the names of the columns/rows in the exact way that they are typed in the DataFrame; for instance if I had spelled `celltype` with a capital `C`, it would not have worked.```{python}#| label: incorrect_column_name#| error: true# Extract the first three samples for the Celltype column metadata.iloc[0:2]["Celltype"] # Celltype column incorrect```We can also directly access a column without the `loc` method by using the column name as an attribute of the DataFrame. For example, to access the `celltype` column, we can use the following code:```{python}#| label: direct_column_access# Access the celltype column directlymetadata["celltype"]```### Subsetting DataFrames with Logical ExpressionsWith DataFrames we can use logical expressions to extract the rows or columns in the DataFrame with specific values. First, we need to determine the indices in a rows or columns where a logical expression is `True`, then we can extract those rows or columns from the DataFrame.For example, if we want to return only those rows of the DataFrame with the celltype column having a value of `typeA`, we would perform two steps:1. Identify which rows in the celltype column have a value of `typeA`.2. Use those `True` values to extract those rows from the DataFrame.```{python}#| label: conditional_subsetting_1# Create a boolean mask for rows where celltype is "typeA"metadata["celltype"] =="typeA"```This will output `True` and `False` values for the values in the vector. The first six values are `True` and the last six are `False`. This means the first six rows of our metadata have a vale of `typeA` while the last six do not. We can save these values to a variable, which we can call whatever we would like; let’s call it `logical_idx`.```{python}#| label: tbl-conditional_subsetting_2#| tbl-cap: Subsetting metadata by applying a boolean mask for rows where celltype is "typeA".# Create a boolean mask for rows where celltype is "typeA"logical_idx = metadata["celltype"] =="typeA"# Subset the DataFrame to return only rows where celltype is "typeA"metadata[logical_idx]```We have used those `True` and `False` values to extract the rows that correspond to the `True` values from the metadata DataFrame. The result is a dataframe that only contains rows where the celltype is `typeA`.:::{.callout-tip}# [**Exercise 2**](09_pandas_dataframes-Answer_key.qmd#exercise-2)1. A question to evaluate Learning Objective 22. A followup question to question #13. ...:::## Adding New ColumnsNow that we know how to access specific values in a DataFrame, we can also add new columns to our data. Oftentimes you will need to create new variables based on the information in your DataFrame or add new information to your DataFrame.### Adding a New Column with the Same ValuePerhaps we wanted to add a new column to our metadata, where we specify that the species for each of our samples is mus musculus. When the value is the same across all the rows, we can simply create a new column and assign whichever value we want to that column:```{python}#| label: tbl-add_column#| tbl-cap: Adding a new column (`species`) to the metadata DataFrame with the same value for all rows.# Add a new column for speciesmetadata["species"] ="mus musculus"metadata```### Adding a New Column with Conditional ValuesWe can even add new columns that are conditional across information within the dataframe. For example, perhaps all the mice in `replicate 1` were `female` while the rest were `male`. We can create a new column for sex and make those corresponding assignments using the `loc` method.We are going to do this in three steps:1. Create the column `sex` and initialize it with a default value of `None`.::: callout-note# `None``None` is a special value in Python that represents the absence of a value or a null value. It is often used to indicate that a variable has no value or that a function does not return anything. In this case, we are initializing the `sex` column with `None` to indicate that we have not yet assigned any values to it.:::```{python}#| label: tbl-add_column_conditional_1#| tbl-cap: Initializing a new column (`sex`) in the metadata DataFrame with a default value of `None`.# Add a new column for sexmetadata["sex"] =Nonemetadata```2. Asign the value `female` to rows where the replicate column has a value of 1. We do this with the `loc` method, which allows us to access rows that meet our condition, then we specify the column (`sex`) in which we want to assign the value `female`.```{python}#| label: tbl-add_column_conditional_2#| tbl-cap: Assigning the value "female" to rows where the replicate column has a value of 1.# Assign "female" to rows where replicate is 1metadata.loc[metadata["replicate"] ==1, "sex"] ="female"metadata```3. Assign the value `male` to rows where the replicate column has a value of 2 or 3.```{python}#| label: tbl-add_column_conditional_3#| tbl-cap: Assigning the value "male" to rows where the replicate column has a value not equal to 1.# Assign "male" to rows where replicate is not 1metadata.loc[metadata["replicate"] !=1, "sex"] ="male"metadata```### Calculating New ColumnsWe can also create new columns by performing calculations on existing columns. For example, we can create a new column called `replicate_squared` that contains the square of the values in the `replicate` column.```{python}#| label: tbl-add_column_calculation#| tbl-cap: Creating a new column (`replicate_squared`) that contains the square of the values in the `replicate` column.# Create a new column for the square of the replicate numbermetadata["replicate_squared"] = metadata["replicate"] **2metadata```Or even take the sum across multiple columns to create a new column. For example, we can create a new column called `replicate_replicate_sum` that contains the sum of the values in the `replicate` and `replicate` columns.```{python}#| label: tbl-add_column_calculation_sum#| tbl-cap: Creating a new column (`replicate_replicate_sum`) that contains the sum of the values in the `replicate` and `replicate` columns.# Create a new column for the sum of replicate and replicatemetadata["replicate_replicate_sum"] = metadata["replicate"] + metadata["replicate"]metadata```We can apply any range of mathematical operations to create new columns based on data that already exists in the DataFrame.## Useful DataFrame OperationsThe `Pandas` library is filled with useful function to manipulate and wrangle with dataframes. Here, we will continue to cover some useful functions that are commonly used, but a more comprehensive cheatsheet of `Pandas` functions can be found [on the official website](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf).### `value_counts()`Other ways to quickly summarize the contents of a DataFrame include `value_counts()`, which counts the number of times a particular value appears in a column. For example, we can use this function to count the number of samples that belong to each genotype category:```{python}#| label: metata_value_countsmetadata["genotype"].value_counts()```Now we know how many samples are classified as WT and how many are classified as KO in our dataset.### Subsection 3B:::{.callout-tip}# [**Exercise 3**](09_pandas_dataframes-Answer_key.qmd#exercise-3)1. A question to evaluate Learning Objective 32. A followup question to question #13. ...:::***[Next Lesson >>](10_data_wrangling.qmd)[Back to Schedule](../schedule/schedule.qmd)