`pandas` DataFrames

Python programming

Pandas

DataFrames

Tabular data

This lesson teaches how to use pandas DataFrames in Python to load tabular data, inspect and subset rows and columns, create new variables and summarize data.

Authors

Noor Sohail

Will Gammerdinger

Published

March 16, 2026

Keywords

Pandas tutorial, CSV, Excel

Approximate time: 75 minutes

Learning objectives

In this lesson, we will:

Load and inspect data in a pandas DataFrame
Subset and index DataFrames in different ways
Add new columns to a DataFrame based on existing data
Perform useful operations on DataFrames to summarize data

Overview of lesson

Real-life data are oftentimes represented in the forms of tables of data, matrices of gene expression or Excel spreadsheets of patient data. pandas DataFrames are Python’s way of processing and representing this tabular data so you can clean, filter and summarize it efficiently. DataFrames enable you to do tasks, such as quickly finding all samples from a particular condition, computing summary statistics or preparing a clean table for plotting. In this lesson, you will practice loading data into DataFrames and performing common operations that mirror how you would explore a dataset in a spreadsheet, but with far more control and reproducibility.

`pandas` library

The pandas library is a powerful tool for data wrangling and analysis in Python. It provides useful data structures and functions needed to handle tables of data, like Excel or Google Sheets spreadsheets. The primary data structures of note in pandas are Series and DataFrames.

pandas and NumPy

Pandas is built on top of the NumPy library, which in practice means that most of the methods defined for NumPy arrays also apply to Pandas data structures.

This library is widely used in data science and machine learning fields for tasks such as data cleaning, transformation and analysis. It provides a comprehensive set of functions for handling and manipulating data and can handle large datasets efficiently.

Loading a dataset

Within the pandas library, there are built-in functions to load datasets from various file formats, such as CSV (comma separated values) or Excel files (.xlsx). The most commonly used function for loading data is pandas.read_csv(), which allows you to read a CSV file and create a DataFrame with the data.

When working with a large dataset, you will very likely work with a “metadata” file that contains the information about each sample in your dataset. This is important information and we encourage you to always generate a document with as much metadata as you can record before you bring the data into Python. Metadata can include information about the experimental design, sample characteristics, data collection methods and any other relevant information that can help you understand and analyze your data. Here is some additional reading on metadata from the HMS Data Management Working Group.

We have a file in which we identify information about the data, called metadata. Our metadata is also stored in a CSV file. In this file, each row corresponds to a sample and each column contains some information about each sample.

# Import pandas using the alias pd
import pandas as pd

# Read in the metadata
metadata = pd.read_csv("data/mouse_exp_design.csv")

# Print out the metadata
metadata

Table 1: DataFrame of our experimental design, including sample names, genotypes, cell types and replicate numbers for each sample.

	genotype	celltype	replicate
sample1	Wt	typeA	1
sample2	Wt	typeA	2
sample3	Wt	typeA	3
sample4	KO	typeA	1
sample5	KO	typeA	2
sample6	KO	typeA	3
sample7	Wt	typeB	1
sample8	Wt	typeB	2
sample9	Wt	typeB	3
sample10	KO	typeB	1
sample11	KO	typeB	2
sample12	KO	typeB	3

The first column contains the row names and the remaining columns contain information about our samples that we can use to categorize them. For example, the second column contains genotype information for each sample. Each sample is classified in one of two categories: Wt (wild type) or KO (knockout).

What types of categories do you observe in the remaining columns?

This metadata describes the samples in our study. Each row holds information for a single sample and the columns contain categorical information about the sample genotype (WT or KO), celltype (typeA or typeB) and replicate number (1, 2 or 3).

Inspecting the DataFrame

There are a wide selection of base tools in Python that are useful for inspecting your data and summarizing it. Let’s use the metadata file that we created to test out data inspection tools

For example, we can use the shape attribute to check the dimensions of our DataFrame, which will tell us how many rows and columns it contains:

# Retrieve the dimensions of metadata
metadata.shape

(12, 3)

.shape

The .shape attribute returns the number of rows and columns in the DataFrame. The first element is the number of rows, while the second element is the number of columns.

We do not use parentheses after .shape because it is an attribute of the DataFrame, not a method/function. In contrast, methods require parentheses to be called, even when they do not take any arguments (e.g., metadata.head()).

If we had a larger file, we may not want to display all the contents in the console. Instead we could check the top (by default, the first 5 lines) of this data.frame using the method head():

# Inspect the first 5 rows of metadata
metadata.head()

Table 2: First 5 rows of the metadata DataFrame using the head() function

	genotype	celltype	replicate
sample1	Wt	typeA	1
sample2	Wt	typeA	2
sample3	Wt	typeA	3
sample4	KO	typeA	1
sample5	KO	typeA	2

Getting help for functions from libraries

When learning about basic functions, we saw that we can use the help() function to find the documentation for another function, including its arguments and usage. The same help() function can be used to learn about functions from libraries, but we will need to provide information about the package and the data structure like so: help(pd.DataFrame.head).

Exercise 1

Use the tail() method to inspect our metadata DataFrame.

Indexing and subsetting DataFrames

When we need to access specific elements of a DataFrame, we commonly use indexing and subsetting techniques. DataFrames can be indexed using both numerical indices and labels (row names and column names).

Both are useful for different purposes. Numerical indexing is often more concise and can be faster for certain operations, while label-based indexing can be more intuitive and easier to read, especially when working with large datasets with meaningful row and column names.

Subsetting DataFrames with indices

We can use the iloc method (which stands for “integer location”) to access specific elements of a DataFrame by using numerical indexing. This method allows us to access rows and columns by their indices.

If we wanted to extract the wild type (Wt) value that is present in the first row and the first column:

To extract it we first use the name of the dataframe that we want to extract from, followed by the iloc method with square brackets (metadata.iloc[ ]).
Inside the square brackets we add the coordinates or indices for the rows in which the value(s) are present, followed by a comma, and then the coordinates or indices for the columns in which the value(s) are present (metadata.iloc[rows, columns]).

We know the wild type value is in the first row if we count from the top, so we put a zero followed by a comma. The wild type value is also in the first column (counting from left to right as usual), so we put a zero in the columns space too.

# Extract the value in the first row and first column
metadata.iloc[0, 0]

'Wt'

Now we will extract the value 1 from the first row and third column.

# Extract the value in the first row and third column
metadata.iloc[0, 2]

np.int64(1)

If you only wanted to select values based on rows, you would provide the index for the rows and just put a colon (:) for the columns. The : with no numbers indicates to Python that we should slice from the beginning until the end, or in other words all columns. We can also use a : that is blank on either side for rows as well as columns. The key here is to include the comma, to let Python know that you are still accessing a 2-dimensional data structure:

# Extract the first row
metadata.iloc[0, :]

genotype        Wt
celltype     typeA
replicate        1
Name: sample1, dtype: object

What kind of data structure does the output appear to be? It looks slightly different from the original DataFrame, but it still has the column names from before. Let us use the type() function to check the data structure of this output:

type(metadata.iloc[0, :])

pandas.core.series.Series

This is a Series data structure, which is a one-dimensional array with row names (in this case, the original column names from our DataFrame). The reason we get a Series instead of a DataFrame is because we are selecting a single row from the DataFrame. Python will output a list-like object as the simplest data structure.

If you were selecting specific columns from the DataFrame, the rows are left blank:

# Extract the first column
metadata.iloc[:, 0]

sample1     Wt
sample2     Wt
sample3     Wt
sample4     KO
sample5     KO
sample6     KO
sample7     Wt
sample8     Wt
sample9     Wt
sample10    KO
sample11    KO
sample12    KO
Name: genotype, dtype: object

Same as before, we get a Series data structure because we are selecting a single column from the DataFrame.

Oftentimes we would like to keep our single column as a DataFrame. We use the to_frame() method to convert a Series to a DataFrame:

# Extract the first column and convert it to a DataFrame
metadata.iloc[:, 0].to_frame()

Table 3: Converting a Series to a DataFrame using the to_frame() method.

	genotype
sample1	Wt
sample2	Wt
sample3	Wt
sample4	KO
sample5	KO
sample6	KO
sample7	Wt
sample8	Wt
sample9	Wt
sample10	KO
sample11	KO
sample12	KO

Slicing DataFrames

Like with vectors, you can select multiple rows and columns at a time. Within the square brackets, you need to provide a vector of the desired values.

We can extract consecutive rows or columns using the colon (:) to create the vector of indices to extract.

# Extract the first three rows and every column
metadata.iloc[0:3, :]

Table 4: Extracting the first three rows from the metadata DataFrame using slices.

	genotype	celltype	replicate
sample1	Wt	typeA	1
sample2	Wt	typeA	2
sample3	Wt	typeA	3

Alternatively, we could use the list of indices [] to extract any number of rows or columns. Let’s extract the first, third and sixth rows.

# Extract the first, third and sixth rows
metadata.iloc[[0, 2, 5], :]

Table 5: Extracting non-consecutive rows from the metadata DataFrame with a list of indices.

	genotype	celltype	replicate
sample1	Wt	typeA	1
sample3	Wt	typeA	3
sample6	KO	typeA	3

Subsetting DataFrames with labels

When we work with larger datasets, it can be tricky to remember the column number that corresponds to a particular variable. Is celltype in column 1 or 2?.

The column/row number for values can also change if you use a script that adds or removes columns/rows. Therefore, it’s often better to use column/row names to refer to extract particular values; this choice makes your code easier to read and your intentions clearer.

First, we will look at the attributes to retrieve our row names (index) and column names (columns) from our DataFrame:

# Get the row names
metadata.index

Index(['sample1', 'sample2', 'sample3', 'sample4', 'sample5', 'sample6',
       'sample7', 'sample8', 'sample9', 'sample10', 'sample11', 'sample12'],
      dtype='object')

# Get the column names
metadata.columns

Index(['genotype', 'celltype', 'replicate'], dtype='object')

Now that we know the row and column names, we can use them to subset our data. We can use the loc method (which stands for “location”) to access specific elements of a DataFrame using label-based indexing. This method allows us to access rows and columns by their labels. For example, we can extract the celltype for the first three samples by using the following code:

# Extract the first three samples for the celltype column
metadata.loc[["sample1", "sample2", "sample3"], "celltype"]

sample1    typeA
sample2    typeA
sample3    typeA
Name: celltype, dtype: object

It is important to type the names of the columns/rows in the exact way that they are typed in the DataFrame; for instance, if I had spelled celltype with a capital C, the line of code would not have worked.

# Extract the first three samples for the Celltype column 
metadata.loc["Celltype"] # Celltype column incorrect

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/anaconda3/lib/python3.13/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3811 try:
-> 3812     return self._engine.get_loc(casted_key)
   3813 except KeyError as err:

File pandas/_libs/index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7096, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Celltype'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[15], line 2
      1 # Extract the first three samples for the Celltype column 
----> 2 metadata.loc["Celltype"] # Celltype column incorrect

File /opt/anaconda3/lib/python3.13/site-packages/pandas/core/indexing.py:1192, in _LocationIndexer.__getitem__(self, key)
   1190 maybe_callable = com.apply_if_callable(key, self.obj)
   1191 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1192 return self._getitem_axis(maybe_callable, axis=axis)

File /opt/anaconda3/lib/python3.13/site-packages/pandas/core/indexing.py:1432, in _LocIndexer._getitem_axis(self, key, axis)
   1430 # fall thru to straight lookup
   1431 self._validate_key(key, axis)
-> 1432 return self._get_label(key, axis=axis)

File /opt/anaconda3/lib/python3.13/site-packages/pandas/core/indexing.py:1382, in _LocIndexer._get_label(self, label, axis)
   1380 def _get_label(self, label, axis: AxisInt):
   1381     # GH#5567 this will fail if the label is not present in the axis.
-> 1382     return self.obj.xs(label, axis=axis)

File /opt/anaconda3/lib/python3.13/site-packages/pandas/core/generic.py:4323, in NDFrame.xs(self, key, axis, level, drop_level)
   4321             new_index = index[loc]
   4322 else:
-> 4323     loc = index.get_loc(key)
   4325     if isinstance(loc, np.ndarray):
   4326         if loc.dtype == np.bool_:

File /opt/anaconda3/lib/python3.13/site-packages/pandas/core/indexes/base.py:3819, in Index.get_loc(self, key)
   3814     if isinstance(casted_key, slice) or (
   3815         isinstance(casted_key, abc.Iterable)
   3816         and any(isinstance(x, slice) for x in casted_key)
   3817     ):
   3818         raise InvalidIndexError(key)
-> 3819     raise KeyError(key) from err
   3820 except TypeError:
   3821     # If we have a listlike key, _check_indexing_error will raise
   3822     #  InvalidIndexError. Otherwise we fall through and re-raise
   3823     #  the TypeError.
   3824     self._check_indexing_error(key)

KeyError: 'Celltype'

If we wanted to access only a few values within the column, we can index the column with the column name and then use numerical indexing to select the desired values.

# Extract the first three samples for the celltype column
metadata.loc[:, "celltype"][0:3]

sample1    typeA
sample2    typeA
sample3    typeA
Name: celltype, dtype: object

We can also directly access a column without the loc method by using the column name as an attribute of the DataFrame. For example, to access the celltype column, we can use the following code:

# Access the celltype column directly
metadata["celltype"]

sample1     typeA
sample2     typeA
sample3     typeA
sample4     typeA
sample5     typeA
sample6     typeA
sample7     typeB
sample8     typeB
sample9     typeB
sample10    typeB
sample11    typeB
sample12    typeB
Name: celltype, dtype: object

However, this method only works for accessing columns - not rows. If we wanted to access a row using this method, we would get an error:

# Access the sample1 row directly
metadata["sample1"] # Error because column "sample1" does not exist

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/anaconda3/lib/python3.13/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3811 try:
-> 3812     return self._engine.get_loc(casted_key)
   3813 except KeyError as err:

File pandas/_libs/index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7096, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'sample1'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[18], line 2
      1 # Access the sample1 row directly
----> 2 metadata["sample1"] # Error because column "sample1" does not exist

File /opt/anaconda3/lib/python3.13/site-packages/pandas/core/frame.py:4113, in DataFrame.__getitem__(self, key)
   4111 if self.columns.nlevels > 1:
   4112     return self._getitem_multilevel(key)
-> 4113 indexer = self.columns.get_loc(key)
   4114 if is_integer(indexer):
   4115     indexer = [indexer]

File /opt/anaconda3/lib/python3.13/site-packages/pandas/core/indexes/base.py:3819, in Index.get_loc(self, key)
   3814     if isinstance(casted_key, slice) or (
   3815         isinstance(casted_key, abc.Iterable)
   3816         and any(isinstance(x, slice) for x in casted_key)
   3817     ):
   3818         raise InvalidIndexError(key)
-> 3819     raise KeyError(key) from err
   3820 except TypeError:
   3821     # If we have a listlike key, _check_indexing_error will raise
   3822     #  InvalidIndexError. Otherwise we fall through and re-raise
   3823     #  the TypeError.
   3824     self._check_indexing_error(key)

KeyError: 'sample1'

So to access all the information for the first sample, we would need to use the loc method with the row name.

# Access the first sample using the loc method
metadata.loc["sample1", :]

genotype        Wt
celltype     typeA
replicate        1
Name: sample1, dtype: object

We can mix and match label-based and numerical indexing. For example, we can start by using label-based indexing to select the column we want and then use numerical indexing to select the first three samples from that column:

# Extract the first three samples for the celltype column 
metadata.iloc[0:3]["celltype"]

sample1    typeA
sample2    typeA
sample3    typeA
Name: celltype, dtype: object

Subsetting DataFrames with logical expressions

We can use logical expressions with DataFrames to extract the rows or columns in the DataFrame by using specific values. First, we need to determine the indices in the rows or columns where a logical expression is True, then we can extract those rows or columns from the DataFrame.

For example, if we want to only return the rows of the DataFrame with the celltype column with a value of typeA, we would perform the following two steps:

Identify which rows in the celltype column have a value of typeA.
Use those True values to extract those rows from the DataFrame.

# Create a boolean mask for rows where celltype is "typeA"
metadata["celltype"] == "typeA"

sample1      True
sample2      True
sample3      True
sample4      True
sample5      True
sample6      True
sample7     False
sample8     False
sample9     False
sample10    False
sample11    False
sample12    False
Name: celltype, dtype: bool

This will output True and False values for the values in the vector. The first six values are True and the last six are False. This means the first six rows of our metadata have a value of typeA while the last six do not. We can save these values to a variable, which we can name whatever we would like; let’s call it logical_idx.

# Create a boolean mask for rows where celltype is "typeA"
logical_idx = metadata["celltype"] == "typeA"

# Subset the DataFrame to return only rows where celltype is "typeA"
metadata[logical_idx]

Table 6: Subsetting metadata by applying a boolean mask for rows where celltype is “typeA”.

	genotype	celltype	replicate
sample1	Wt	typeA	1
sample2	Wt	typeA	2
sample3	Wt	typeA	3
sample4	KO	typeA	1
sample5	KO	typeA	2
sample6	KO	typeA	3

We can use those True and False values to extract the rows that correspond to the True values from the metadata DataFrame. The result is a DataFrame that only contains rows where the celltype is typeA.

Alternatively, you can also use a nested approach and put the logical index within the subsetting:

# Subset the DataFrame to return only rows where celltype is "typeA"
metadata[metadata["celltype"] == "typeA"]

Table 7: Using a nested approach for subsetting metadata by applying a boolean mask for rows where celltype is “typeA”.

	genotype	celltype	replicate
sample1	Wt	typeA	1
sample2	Wt	typeA	2
sample3	Wt	typeA	3
sample4	KO	typeA	1
sample5	KO	typeA	2
sample6	KO	typeA	3

Exercise 2

Retrieve the values of metadata where the value in the replicate is column is two or greater.
Retrieve the values of metadata where the value in the genotype is not equal to Wt.

Adding new columns

Now that we know how to access specific values in a DataFrame, we can also add new columns to our data. You will often need to create new variables based on the information in your DataFrame or to add new information to your DataFrame.

Adding a new column with the same value

We could want to add a new column to our metadata that specifies that the species for each of our samples is Mus musculus. When the value is the same across all the rows, we can simply create a new column and assign whichever value we want to that column:

# Add a new column for species
metadata["species"] = "Mus musculus"
metadata

Table 8: Adding a new column (species) to the metadata DataFrame with the same value for all rows.

	genotype	celltype	replicate	species
sample1	Wt	typeA	1	Mus musculus
sample2	Wt	typeA	2	Mus musculus
sample3	Wt	typeA	3	Mus musculus
sample4	KO	typeA	1	Mus musculus
sample5	KO	typeA	2	Mus musculus
sample6	KO	typeA	3	Mus musculus
sample7	Wt	typeB	1	Mus musculus
sample8	Wt	typeB	2	Mus musculus
sample9	Wt	typeB	3	Mus musculus
sample10	KO	typeB	1	Mus musculus
sample11	KO	typeB	2	Mus musculus
sample12	KO	typeB	3	Mus musculus

Adding a new column with conditional values

We can also conditionally add data to new columns depending on other data within the dataframe. For example, if all the mice in replicate 1 were female and those that were not in replicate 1 were male, we can create a new column with sex and make corresponding assignments using the loc method.

We are going to do this in three steps:

Create the column sex and initialize it with a default value of None.

None

None is a special value in Python that represents the absence of a value or a null value. It is often used to indicate that a variable has no value or that a function does not return anything. In this case, we are initializing the sex column with None to indicate that there are no values in the column.

# Add a new column for sex
metadata["sex"] = None

# Print the metadata DataFrame
metadata

Table 9: Initializing a new column (sex) in the metadata DataFrame with a default value of None.

	genotype	celltype	replicate	species	sex
sample1	Wt	typeA	1	Mus musculus	None
sample2	Wt	typeA	2	Mus musculus	None
sample3	Wt	typeA	3	Mus musculus	None
sample4	KO	typeA	1	Mus musculus	None
sample5	KO	typeA	2	Mus musculus	None
sample6	KO	typeA	3	Mus musculus	None
sample7	Wt	typeB	1	Mus musculus	None
sample8	Wt	typeB	2	Mus musculus	None
sample9	Wt	typeB	3	Mus musculus	None
sample10	KO	typeB	1	Mus musculus	None
sample11	KO	typeB	2	Mus musculus	None
sample12	KO	typeB	3	Mus musculus	None

We assign the value female to rows where the replicate column has a value of 1. We do this with the loc method, which allows us to only access rows that meet our condition and then specify the column (sex) where we want to assign the value female.

# Assign "female" to rows where replicate is 1
metadata.loc[metadata["replicate"] == 1, "sex"] = "female"

# Print the metadata DataFrame
metadata

Table 10: Assigning the value “female” to rows where the replicate column has a value of 1.

	genotype	celltype	replicate	species	sex
sample1	Wt	typeA	1	Mus musculus	female
sample2	Wt	typeA	2	Mus musculus	None
sample3	Wt	typeA	3	Mus musculus	None
sample4	KO	typeA	1	Mus musculus	female
sample5	KO	typeA	2	Mus musculus	None
sample6	KO	typeA	3	Mus musculus	None
sample7	Wt	typeB	1	Mus musculus	female
sample8	Wt	typeB	2	Mus musculus	None
sample9	Wt	typeB	3	Mus musculus	None
sample10	KO	typeB	1	Mus musculus	female
sample11	KO	typeB	2	Mus musculus	None
sample12	KO	typeB	3	Mus musculus	None

Assign the value male to rows where the replicate column has a value other than 1.

# Assign "male" to rows where replicate is not 1
metadata.loc[metadata["replicate"] != 1, "sex"] = "male"

# Print the metadata DataFrame
metadata

Table 11: Assigning the value “male” to rows where the replicate column has a value not equal to 1.

	genotype	celltype	replicate	species	sex
sample1	Wt	typeA	1	Mus musculus	female
sample2	Wt	typeA	2	Mus musculus	male
sample3	Wt	typeA	3	Mus musculus	male
sample4	KO	typeA	1	Mus musculus	female
sample5	KO	typeA	2	Mus musculus	male
sample6	KO	typeA	3	Mus musculus	male
sample7	Wt	typeB	1	Mus musculus	female
sample8	Wt	typeB	2	Mus musculus	male
sample9	Wt	typeB	3	Mus musculus	male
sample10	KO	typeB	1	Mus musculus	female
sample11	KO	typeB	2	Mus musculus	male
sample12	KO	typeB	3	Mus musculus	male

Calculating new columns

We can also create new columns by performing calculations on existing columns. For example, we can create a new column called replicate_squared that contains the square of the values in the replicate column.

# Create a new column for the square of the replicate number
metadata["replicate_squared"] = metadata["replicate"] ** 2

# Print the metadata DataFrame
metadata

Table 12: Creating a new column (replicate_squared) that contains the square of the values in the replicate column.

	genotype	celltype	replicate	species	sex	replicate_squared
sample1	Wt	typeA	1	Mus musculus	female	1
sample2	Wt	typeA	2	Mus musculus	male	4
sample3	Wt	typeA	3	Mus musculus	male	9
sample4	KO	typeA	1	Mus musculus	female	1
sample5	KO	typeA	2	Mus musculus	male	4
sample6	KO	typeA	3	Mus musculus	male	9
sample7	Wt	typeB	1	Mus musculus	female	1
sample8	Wt	typeB	2	Mus musculus	male	4
sample9	Wt	typeB	3	Mus musculus	male	9
sample10	KO	typeB	1	Mus musculus	female	1
sample11	KO	typeB	2	Mus musculus	male	4
sample12	KO	typeB	3	Mus musculus	male	9

We can even take the sum across multiple columns to create a new column. For example, we can create a new column called replicate_sum that contains the sum of the values in the replicate and replicate_squared columns for each row.

# Create a new column for the sum of replicate and replicate_squared
metadata["replicate_sum"] = metadata["replicate"] + metadata["replicate_squared"]

# Print the metadata DataFrame
metadata

Table 13: Creating a new column (replicate_sum) that contains the sum of the values in the replicate and replicate_squared columns.

	genotype	celltype	replicate	species	sex	replicate_squared	replicate_sum
sample1	Wt	typeA	1	Mus musculus	female	1	2
sample2	Wt	typeA	2	Mus musculus	male	4	6
sample3	Wt	typeA	3	Mus musculus	male	9	12
sample4	KO	typeA	1	Mus musculus	female	1	2
sample5	KO	typeA	2	Mus musculus	male	4	6
sample6	KO	typeA	3	Mus musculus	male	9	12
sample7	Wt	typeB	1	Mus musculus	female	1	2
sample8	Wt	typeB	2	Mus musculus	male	4	6
sample9	Wt	typeB	3	Mus musculus	male	9	12
sample10	KO	typeB	1	Mus musculus	female	1	2
sample11	KO	typeB	2	Mus musculus	male	4	6
sample12	KO	typeB	3	Mus musculus	male	9	12

We can apply any range of mathematical operations to create new columns based on data that already exists in the DataFrame.

Useful DataFrame operations

The pandas library is filled with useful functions to wrangle dataframes. Here, we will continue to cover some useful functions that are commonly used, but a more comprehensive cheatsheet of pandas functions can be found on the official website.

The `value_counts()` function

One way to quickly summarize the contents of a DataFrame is by using value_counts(), which counts the number of times a particular value appears in a column. We can use this function to count the number of samples that belong to each genotype category:

# Retrieve the distribution of values in the genotype column
metadata["genotype"].value_counts()

genotype
Wt    6
KO    6
Name: count, dtype: int64

Now we know how many samples are classified as WT and how many are classified as KO in our dataset.

Exercise 3

Use the value_counts() method to summarize the number of times you observe each replicate number in the replicate column of metadata.

Next Lesson >>

Back to Schedule

Reuse

CC-BY-4.0

Learning objectives

Overview of lesson

pandas library

Loading a dataset

Inspecting the DataFrame

Indexing and subsetting DataFrames

Subsetting DataFrames with indices

Slicing DataFrames

Subsetting DataFrames with labels

Subsetting DataFrames with logical expressions

Adding new columns

Adding a new column with the same value

Adding a new column with conditional values

Calculating new columns

Useful DataFrame operations

The value_counts() function

Reuse

`pandas` library

The `value_counts()` function