Machine Learning - Random Forests

category_1
category_2
category_3
category_4

Write a description of the lesson here.

Authors

Noor Sohail

Will Gammerdinger

Published

March 14, 2026

Keywords

keyword_1, keyword_2, keyword_3, keyword_4, keyword_5, keyword_6

Approximate time: XX minutes

Learning Objectives

In this lesson, we will:

  • Learning Objective 1
  • Learning Objective 2
  • Learning Objective 3

Overview of lesson

When doing XYZ…

Cortical layer dataset

As an example of a real-world application of machine learning, we will be using a dataset that comes from spatial locations associated with different cortical layers in the human brain. These layers are broken into 6 cortical layers (L1, L2, L3, L4, L5, L6) and a white matter layer. Each of these layers has a unique spatial location.

Figure 1: Spatial locations of the cortical layers in the human brain.
Image source: Rai et al. (2026)

Based upon this dataset, we have generated a synthetic dataset that contains the x and y coordinates of cells in the cortex with cortical layer labels. Additionally, we have included the log-normalized expression values of known marker genes for each cortical layer.

Figure 2: Example of the spatial expression of known marker genes for each cortical layer.
Image source: Rai et al. (2026)

We will be using this synthetic dataset to train a random forest classifier to predict the cortical layer labels based on the spatial location and gene expression of each cell.

The dataset contains spatial coordinates of cells in the cortex, as well as the cortical layer that each cell belongs to.

# Load libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

df_cortical = pd.read_csv("data/synthetic_cortex_data.csv")
df_cortical.head()
Table 1: Cortical cells data, where the x and y coordinates represent the spatial locations of each cell as well as the cortical layer they belong to.
Unnamed: 0 cell_barcode x y depth cortical_layer AQP4 HPCAL1 FREM3 TRABD2A KRT17 MOBP
0 0 cell_000000 0.000000 0.0 0.000000 L1 3.574897 1.110201 0.241240 0.00000 0.014254 0.304772
1 1 cell_000001 0.006289 0.0 0.012579 L1 4.463502 2.276650 0.000000 0.00000 0.060432 0.000000
2 2 cell_000002 0.012579 0.0 0.025157 L2 6.194550 3.277692 0.000000 0.17412 0.046238 0.000000
3 3 cell_000003 0.018868 0.0 0.037736 L1 3.576662 4.434999 0.340184 0.00000 0.070848 0.000000
4 4 cell_000004 0.025157 0.0 0.050314 L1 2.029683 0.000000 0.127988 0.00000 0.000000 0.492927

We have the following columns in this dataset:

  • barcode: A unique identifier for each cell
  • x: The x coordinate of the cell’s spatial location
  • y: The y coordinate of the cell’s spatial location
  • layer: The cortical layer that the cell belongs to (L1, L2, L3, L4, L5, L6, WM1, WM2)

As this is a spatial dataset, we can visualize where on the tissue each cell is located by plotting the x and y coordinates of each cell and coloring the points by the cortical layer that they belong to:

# Plot the spatial locations of the cells colored by the cortical layer they belong to
sns.scatterplot(data=df_cortical, 
                x="x", y="y", 
                hue="cortical_layer", 
                edgecolor=None,
                palette="tab10")

# Add title and axis labels
plt.title("Brain Cells by Cortical Layer")
plt.xlabel("x coordinate")
plt.ylabel("y coordinate")
plt.legend(title="Cortical Layer")
plt.show()
Figure 3: Spatial plot of the cortical cells colored by the cortical layer they belong to.

In the dataframe, you have have noted that we also have columns: AQP4, HPCAL1, FREM3, TRABD2A, KRT17, and MOBP. These are the log-normalized expression values for those genes in each cell. These genes are known to be highly expressed in specific cortical layers, so they can be used as markers to identify which layer a cell belongs to based on its gene expression profile. Once again we can visualize the expression of these marker genes across the cortex to see how they are distributed across the different layers:

# List of marker genes to plot
genes = ["AQP4", "HPCAL1", "FREM3", 
         "TRABD2A", "KRT17", "MOBP"]

# Initialize a plot with rows and columns for each gene
fig, axes = plt.subplots(2, 3, figsize=(15, 8))

# Make axes a flat list so we can index easily
axes = axes.flatten()

for i, gene in enumerate(genes):
    ax = axes[i]
    sns.scatterplot(
        data=df_cortical,
        x="x", y="y",
        hue=gene,
        palette="viridis",
        edgecolor=None,
        ax=ax
    )
    ax.set_title(f"Expression of {gene} across the cortex")

plt.tight_layout()
plt.show()
Figure 4: Spatial plot of the gene expression of known marker genes for each cortical layer.

Random forest classifiers

Random forests allow you to predict a categorical variable (cortical layer) based on one or more predictor variables (x and y coordinates). To do so, the algorithm builds multiple decision trees, which are models that make predictions based on a series of binary decisions (True or False).

Figure 5: Example of a decision tree where the variables are age, weight, and smoker to predict risk level of a heart attack.
Image source: DataCamp

These decision trees comprise of decision nodes, which are the points where the data is split based on a predictor variable, and leaf nodes, which are the final predictions made by the tree.

Random forests build multiple decision trees and combine their predictions to improve accuracy and reduce overfitting. These trees are built on random subsets of the data. Then, a majority vote is taken across the final decision of all the trees to make the final prediction.

Figure 6: Example of a random forest with 3 decision trees to generate a prediction based upon majority voting.
Image source: GeeksforGeeks

Preparing training dataset

The learning of machine learning comes from the fact that these algorithms must first learn patterns. This is accomplished by taking a subset of labelled data, the training set, to train the model. From this, the random forest classifier would learn how to predict the cortical layer of a cell based on its x, y coordinates and gene expression.

First we are going to define what is the label we want to predict (cortical_layer), and what are the predictor variables we want to use to make that prediction, (x, y coordinates and gene expression). Oftentimes there will be referred to as the X and y respectively.

# Feature and target columns
feature_cols = ["x", "y", "AQP4", "HPCAL1", "FREM3",
                "TRABD2A", "KRT17", "MOBP"]
target_col = "cortical_layer"

# Set X and y for future use in training and prediction
X = df_cortical[feature_cols]
y = df_cortical[target_col]

With this information, we can now prepare our training and test dataset with train_test_split() from the sklearn.model_selection module. This function will split our dataset into a training set and a test set.

We will train the model on the training set and then evaluate its performance (accuracy) on the test set. We supply the following parameters into the function:

  • test_size: proportion of the dataset that we want to use as the test set (30% of the data for testing and 70% for training).
  • stratify: distribution of the target variable (cortical_layer) is the same in both the training and test sets. This is to reduce bias due to sampling and ensure that the model is trained on a representative sample of the data.
  • random_state: random seed for reproducibility.
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42,
    stratify=y
)
Warning

explain how we assign values to multiple variables at once.

Train random forest classifier

To create the model, first we are going to initialize an instance of the RandomForestClassifier class from the sklearn.ensemble module. We will specify the number of trees in the forest using the n_estimators parameter and set a random seed, random_state, for reproducibility.

# Initialize the random forest classifier
rf = RandomForestClassifier(n_estimators=100,
                            random_state=42,
                            class_weight="balanced")

Next, we will train the model using the fit() method, which takes in the predictor variables (x and y coordinates) and the target variable (cortical layer) from the training data.

# Train the random forest classifier model
rf.fit(X_train, y_train)
RandomForestClassifier(class_weight='balanced', random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Predict cortical layer labels

With this model, rf, we can now predict the cortical layer labels of the test dataset using the predict() method. Once again supplying the x and y coordinates, but this time for the prediction data instead of the training data. The model will use the patterns it learned from the training data to predict which cortical layer each unassigned cell belongs to based on its spatial location.

# Predict coritcal layers for test dataset
y_pred = rf.predict(X_test)

So now we have y_pred, but what is this output?

type(y_pred)
<class 'numpy.ndarray'>

It is a numpy array! So we can access the first few elements to see what the predicted labels look like:

# View the first few predicted labels
y_pred[0:5]
array(['WM', 'L6', 'L6', 'L3_4', 'L5'], dtype=object)

Assessing model performance

At this point, we have the predicted labels for the test dataset, but how do we know if these predictions are accurate? To evaluate the performance of our model, we can compare the predicted labels to the true labels of the test dataset.

Accuracy of model predictions

Now that we have the predicted and true labels of the test dataset, we can calculate the accuracy of our model’s predictions. Accuracy is calculated as the number of correct predictions divided by the total number of predictions.

# Calculate the accuracy of the model's predictions
acc = accuracy_score(y_test, y_pred)
accuracy_percentage = acc * 100
accuracy_percentage
85.54700568752091

Our accuracy is quite high! This tells us that our model is doing a good job at predicting the cortical layer labels based on the x, y coordinates and gene expression. However, accuracy alone does not always give us the full picture of how well our model is performing.

Confusion matrices are another way to evaluate the performance of classification. This table shows the number of labels that were correctly predicted (true positives and true negatives) and the number of labels that were incorrectly predicted (false positives and false negatives).

class_names = sorted(y.unique())
cm = confusion_matrix(y_test, y_pred, labels=class_names)

plt.figure(figsize=(8, 6))
sns.heatmap(
    cm,
    annot=True,
    fmt="g",
    cmap="Blues",
    xticklabels=class_names,
    yticklabels=class_names
)
plt.title("Random Forest – Confusion Matrix (Test Set)")
plt.xlabel("Predicted label")
plt.ylabel("True label")
plt.tight_layout()
plt.show()

This can help you understand which classes the model is doing well on and which classes it is struggling with. If your accuracy is low, you can look at the confusion matrix to see which classes are being misclassified and potentially adjust your model or data accordingly.

Next steps

With this model, you could try to predict the cortical layer labels of other datasets. You could also try to use different predictor variables (e.g. only gene expression or only spatial coordinates) to see how that affects the accuracy of the model’s predictions.


Back to Schedule

Reuse

CC-BY-4.0