As an example of a real-world application of machine learning, we will be using a dataset that comes from spatial locations associated with different cortical layers in the human brain. These layers are broken into 6 cortical layers (L1, L2, L3, L4, L5, L6) and a white matter layer. Each of these layers has a unique spatial location.
Figure 1: Spatial locations of the cortical layers in the human brain. Image source: Rai et al. (2026)
Based upon this dataset, we have generated a synthetic dataset that contains the x and y coordinates of cells in the cortex with cortical layer labels. Additionally, we have included the log-normalized expression values of known marker genes for each cortical layer.
Figure 2: Example of the spatial expression of known marker genes for each cortical layer. Image source: Rai et al. (2026)
We will be using this synthetic dataset to train a random forest classifier to predict the cortical layer labels based on the spatial location and gene expression of each cell.
The dataset contains spatial coordinates of cells in the cortex, as well as the cortical layer that each cell belongs to.
# Load librariesimport pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score, confusion_matrixdf_cortical = pd.read_csv("data/synthetic_cortex_data.csv")df_cortical.head()
Table 1: Cortical cells data, where the x and y coordinates represent the spatial locations of each cell as well as the cortical layer they belong to.
Unnamed: 0
cell_barcode
x
y
depth
cortical_layer
AQP4
HPCAL1
FREM3
TRABD2A
KRT17
MOBP
0
0
cell_000000
0.000000
0.0
0.000000
L1
3.574897
1.110201
0.241240
0.00000
0.014254
0.304772
1
1
cell_000001
0.006289
0.0
0.012579
L1
4.463502
2.276650
0.000000
0.00000
0.060432
0.000000
2
2
cell_000002
0.012579
0.0
0.025157
L2
6.194550
3.277692
0.000000
0.17412
0.046238
0.000000
3
3
cell_000003
0.018868
0.0
0.037736
L1
3.576662
4.434999
0.340184
0.00000
0.070848
0.000000
4
4
cell_000004
0.025157
0.0
0.050314
L1
2.029683
0.000000
0.127988
0.00000
0.000000
0.492927
We have the following columns in this dataset:
barcode: A unique identifier for each cell
x: The x coordinate of the cell’s spatial location
y: The y coordinate of the cell’s spatial location
layer: The cortical layer that the cell belongs to (L1, L2, L3, L4, L5, L6, WM1, WM2)
As this is a spatial dataset, we can visualize where on the tissue each cell is located by plotting the x and y coordinates of each cell and coloring the points by the cortical layer that they belong to:
# Plot the spatial locations of the cells colored by the cortical layer they belong tosns.scatterplot(data=df_cortical, x="x", y="y", hue="cortical_layer", edgecolor=None, palette="tab10")# Add title and axis labelsplt.title("Brain Cells by Cortical Layer")plt.xlabel("x coordinate")plt.ylabel("y coordinate")plt.legend(title="Cortical Layer")plt.show()
Figure 3: Spatial plot of the cortical cells colored by the cortical layer they belong to.
In the dataframe, you have have noted that we also have columns: AQP4, HPCAL1, FREM3, TRABD2A, KRT17, and MOBP. These are the log-normalized expression values for those genes in each cell. These genes are known to be highly expressed in specific cortical layers, so they can be used as markers to identify which layer a cell belongs to based on its gene expression profile. Once again we can visualize the expression of these marker genes across the cortex to see how they are distributed across the different layers:
# List of marker genes to plotgenes = ["AQP4", "HPCAL1", "FREM3", "TRABD2A", "KRT17", "MOBP"]# Initialize a plot with rows and columns for each genefig, axes = plt.subplots(2, 3, figsize=(15, 8))# Make axes a flat list so we can index easilyaxes = axes.flatten()for i, gene inenumerate(genes): ax = axes[i] sns.scatterplot( data=df_cortical, x="x", y="y", hue=gene, palette="viridis", edgecolor=None, ax=ax ) ax.set_title(f"Expression of {gene} across the cortex")plt.tight_layout()plt.show()
Figure 4: Spatial plot of the gene expression of known marker genes for each cortical layer.
Random forest classifiers
Random forests allow you to predict a categorical variable (cortical layer) based on one or more predictor variables (x and y coordinates). To do so, the algorithm builds multiple decision trees, which are models that make predictions based on a series of binary decisions (True or False).
Figure 5: Example of a decision tree where the variables are age, weight, and smoker to predict risk level of a heart attack. Image source: DataCamp
These decision trees comprise of decision nodes, which are the points where the data is split based on a predictor variable, and leaf nodes, which are the final predictions made by the tree.
Random forests build multiple decision trees and combine their predictions to improve accuracy and reduce overfitting. These trees are built on random subsets of the data. Then, a majority vote is taken across the final decision of all the trees to make the final prediction.
Figure 6: Example of a random forest with 3 decision trees to generate a prediction based upon majority voting. Image source: GeeksforGeeks
Preparing training dataset
The learning of machine learning comes from the fact that these algorithms must first learn patterns. This is accomplished by taking a subset of labelled data, the training set, to train the model. From this, the random forest classifier would learn how to predict the cortical layer of a cell based on its x, y coordinates and gene expression.
First we are going to define what is the label we want to predict (cortical_layer), and what are the predictor variables we want to use to make that prediction, (x, y coordinates and gene expression). Oftentimes there will be referred to as the X and y respectively.
# Feature and target columnsfeature_cols = ["x", "y", "AQP4", "HPCAL1", "FREM3","TRABD2A", "KRT17", "MOBP"]target_col ="cortical_layer"# Set X and y for future use in training and predictionX = df_cortical[feature_cols]y = df_cortical[target_col]
With this information, we can now prepare our training and test dataset with train_test_split() from the sklearn.model_selection module. This function will split our dataset into a training set and a test set.
We will train the model on the training set and then evaluate its performance (accuracy) on the test set. We supply the following parameters into the function:
test_size: proportion of the dataset that we want to use as the test set (30% of the data for testing and 70% for training).
stratify: distribution of the target variable (cortical_layer) is the same in both the training and test sets. This is to reduce bias due to sampling and ensure that the model is trained on a representative sample of the data.
explain how we assign values to multiple variables at once.
Train random forest classifier
To create the model, first we are going to initialize an instance of the RandomForestClassifier class from the sklearn.ensemble module. We will specify the number of trees in the forest using the n_estimators parameter and set a random seed, random_state, for reproducibility.
# Initialize the random forest classifierrf = RandomForestClassifier(n_estimators=100, random_state=42, class_weight="balanced")
Next, we will train the model using the fit() method, which takes in the predictor variables (x and y coordinates) and the target variable (cortical layer) from the training data.
# Train the random forest classifier modelrf.fit(X_train, y_train)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
With this model, rf, we can now predict the cortical layer labels of the test dataset using the predict() method. Once again supplying the x and y coordinates, but this time for the prediction data instead of the training data. The model will use the patterns it learned from the training data to predict which cortical layer each unassigned cell belongs to based on its spatial location.
# Predict coritcal layers for test datasety_pred = rf.predict(X_test)
So now we have y_pred, but what is this output?
type(y_pred)
<class 'numpy.ndarray'>
It is a numpy array! So we can access the first few elements to see what the predicted labels look like:
At this point, we have the predicted labels for the test dataset, but how do we know if these predictions are accurate? To evaluate the performance of our model, we can compare the predicted labels to the true labels of the test dataset.
Accuracy of model predictions
Now that we have the predicted and true labels of the test dataset, we can calculate the accuracy of our model’s predictions. Accuracy is calculated as the number of correct predictions divided by the total number of predictions.
# Calculate the accuracy of the model's predictionsacc = accuracy_score(y_test, y_pred)accuracy_percentage = acc *100accuracy_percentage
85.54700568752091
Our accuracy is quite high! This tells us that our model is doing a good job at predicting the cortical layer labels based on the x, y coordinates and gene expression. However, accuracy alone does not always give us the full picture of how well our model is performing.
Confusion matrices are another way to evaluate the performance of classification. This table shows the number of labels that were correctly predicted (true positives and true negatives) and the number of labels that were incorrectly predicted (false positives and false negatives).
This can help you understand which classes the model is doing well on and which classes it is struggling with. If your accuracy is low, you can look at the confusion matrix to see which classes are being misclassified and potentially adjust your model or data accordingly.
Next steps
With this model, you could try to predict the cortical layer labels of other datasets. You could also try to use different predictor variables (e.g. only gene expression or only spatial coordinates) to see how that affects the accuracy of the model’s predictions.
---title: "Machine Learning - Random Forests"description: | Write a description of the lesson here. author: - Noor Sohail - Will Gammerdingerdate: "2026-03-14"categories: - category_1 - category_2 - category_3 - category_4keywords: - keyword_1 - keyword_2 - keyword_3 - keyword_4 - keyword_5 - keyword_6license: "CC-BY-4.0"editor_options: markdown: wrap: 72---```{r}#| label: load_libraries_data#| echo: false# Load libraries and data# Interfacing with R quarto and python futzinglibrary(reticulate)use_condaenv("/opt/anaconda3/envs/intro_python", required =TRUE)```Approximate time: XX minutes## Learning Objectives In this lesson, we will:- Learning Objective 1- Learning Objective 2- Learning Objective 3## Overview of lessonWhen doing XYZ...## Cortical layer dataset::: columns::: {.column width="25%"}As an example of a real-world application of machine learning, we will be using a dataset that comes from spatial locations associated with different cortical layers in the human brain. These layers are broken into 6 cortical layers (L1, L2, L3, L4, L5, L6) and a white matter layer. Each of these layers has a unique spatial location.:::::: column::: {#fig-cortical_layers_paper .figure}{height=50}Spatial locations of the cortical layers in the human brain. <br>_Image source: [Rai et al. (2026)](https://www.biorxiv.org/content/10.64898/2026.01.12.698703v1.full)_:::::::::Based upon [this dataset](https://www.biorxiv.org/content/10.64898/2026.01.12.698703v1.full), we have generated a synthetic dataset that contains the x and y coordinates of cells in the cortex with cortical layer labels. Additionally, we have included the log-normalized expression values of known marker genes for each cortical layer. ::: {#fig-cortical_marker_genes .figure}{width=550}Example of the spatial expression of known marker genes for each cortical layer. <br>_Image source: [Rai et al. (2026)](https://www.biorxiv.org/content/10.64898/2026.01.12.698703v1.full)_:::**We will be using this synthetic dataset to train a random forest classifier to predict the cortical layer labels based on the spatial location and gene expression of each cell.**The dataset contains spatial coordinates of cells in the cortex, as well as the cortical layer that each cell belongs to.```{python}#| label: tbl-load_cortical_data#| tbl-cap: Cortical cells data, where the x and y coordinates represent the spatial locations of each cell as well as the cortical layer they belong to.# Load librariesimport pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score, confusion_matrixdf_cortical = pd.read_csv("data/synthetic_cortex_data.csv")df_cortical.head()```We have the following columns in this dataset:- `barcode`: A unique identifier for each cell- `x`: The x coordinate of the cell's spatial location- `y`: The y coordinate of the cell's spatial location- `layer`: The cortical layer that the cell belongs to (L1, L2, L3, L4, L5, L6, WM1, WM2)As this is a spatial dataset, we can visualize where on the tissue each cell is located by plotting the x and y coordinates of each cell and coloring the points by the cortical layer that they belong to:```{python}#| label: fig-cortical_layers#| fig-cap: "Spatial plot of the cortical cells colored by the cortical layer they belong to."# Plot the spatial locations of the cells colored by the cortical layer they belong tosns.scatterplot(data=df_cortical, x="x", y="y", hue="cortical_layer", edgecolor=None, palette="tab10")# Add title and axis labelsplt.title("Brain Cells by Cortical Layer")plt.xlabel("x coordinate")plt.ylabel("y coordinate")plt.legend(title="Cortical Layer")plt.show()```In the dataframe, you have have noted that we also have columns: `AQP4`, `HPCAL1`, `FREM3`, `TRABD2A`, `KRT17`, and `MOBP`. These are the log-normalized expression values for those genes in each cell. These genes are known to be highly expressed in specific cortical layers, so they can be used as markers to identify which layer a cell belongs to based on its gene expression profile. Once again we can visualize the expression of these marker genes across the cortex to see how they are distributed across the different layers:```{python}#| label: fig-cortical_marker_genes#| fig-cap: Spatial plot of the gene expression of known marker genes for each cortical layer.# List of marker genes to plotgenes = ["AQP4", "HPCAL1", "FREM3", "TRABD2A", "KRT17", "MOBP"]# Initialize a plot with rows and columns for each genefig, axes = plt.subplots(2, 3, figsize=(15, 8))# Make axes a flat list so we can index easilyaxes = axes.flatten()for i, gene inenumerate(genes): ax = axes[i] sns.scatterplot( data=df_cortical, x="x", y="y", hue=gene, palette="viridis", edgecolor=None, ax=ax ) ax.set_title(f"Expression of {gene} across the cortex")plt.tight_layout()plt.show()```## Random forest classifiersRandom forests allow you to predict a categorical variable (cortical layer) based on one or more predictor variables (x and y coordinates). To do so, the algorithm builds multiple decision trees, which are models that make predictions based on a series of binary decisions (`True` or `False`).::: {#fig-decision_tree_example .figure}{width="80%"}Example of a decision tree where the variables are age, weight, and smoker to predict risk level of a heart attack.<br>_Image source: [DataCamp](https://www.datacamp.com/tutorial/decision-tree-classification-python)_:::These decision trees comprise of decision nodes, which are the points where the data is split based on a predictor variable, and leaf nodes, which are the final predictions made by the tree.Random forests build multiple decision trees and combine their predictions to improve accuracy and reduce overfitting. These trees are built on random subsets of the data. Then, a majority vote is taken across the final decision of all the trees to make the final prediction.::: {#fig-random_forest_example .figure}{width="80%"}Example of a random forest with 3 decision trees to generate a prediction based upon majority voting.<br>_Image source: [GeeksforGeeks](https://www.geeksforgeeks.org/random-forest-classifier-using-scikit-learn/)_:::### Preparing training datasetThe _learning_ of machine learning comes from the fact that these algorithms must first learn patterns. This is accomplished by taking a subset of labelled data, the **training set**, to train the model. From this, the random forest classifier would learn how to predict the cortical layer of a cell based on its x, y coordinates and gene expression.First we are going to define what is the label we want to predict (`cortical_layer`), and what are the predictor variables we want to use to make that prediction, (`x`, `y` coordinates and gene expression). Oftentimes there will be referred to as the `X` and `y` respectively.```{python}#| label: define_feature_target_cols# Feature and target columnsfeature_cols = ["x", "y", "AQP4", "HPCAL1", "FREM3","TRABD2A", "KRT17", "MOBP"]target_col ="cortical_layer"# Set X and y for future use in training and predictionX = df_cortical[feature_cols]y = df_cortical[target_col]```With this information, we can now prepare our training and test dataset with `train_test_split()` from the `sklearn.model_selection` module. This function will split our dataset into a **training set and a test set**. We will train the model on the training set and then evaluate its performance (accuracy) on the test set. We supply the following parameters into the function:- `test_size`: proportion of the dataset that we want to use as the test set (30% of the data for testing and 70% for training).- `stratify`: distribution of the target variable (`cortical_layer`) is the same in both the training and test sets. This is to reduce bias due to sampling and ensure that the model is trained on a representative sample of the data.- `random_state`: random seed for reproducibility.```{python}#| label: prepare_training_dataX_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42, stratify=y)```::: callout-warningexplain how we assign values to multiple variables at once.:::### Train random forest classifierTo create the model, first we are going to initialize an instance of the `RandomForestClassifier` class from the `sklearn.ensemble` module. We will specify the number of trees in the forest using the `n_estimators` parameter and set a random seed, `random_state`, for reproducibility.```{python}#| label: initialize_random_forest# Initialize the random forest classifierrf = RandomForestClassifier(n_estimators=100, random_state=42, class_weight="balanced")```Next, we will train the model using the `fit()` method, which takes in the predictor variables (x and y coordinates) and the target variable (cortical layer) from the training data.```{python}#| label: train_model# Train the random forest classifier modelrf.fit(X_train, y_train)```## Predict cortical layer labelsWith this model, `rf`, we can now predict the cortical layer labels of the test dataset using the `predict()` method. Once again supplying the x and y coordinates, but this time for the prediction data instead of the training data. The model will use the patterns it learned from the training data to predict which cortical layer each unassigned cell belongs to based on its spatial location.```{python}#| label: predict_cortical_layer# Predict coritcal layers for test datasety_pred = rf.predict(X_test)```So now we have `y_pred`, but what is this output?```{python}type(y_pred)```It is a numpy array! So we can access the first few elements to see what the predicted labels look like:```{python}#| label: view_predicted_labels# View the first few predicted labelsy_pred[0:5]```## Assessing model performanceAt this point, we have the predicted labels for the test dataset, but how do we know if these predictions are accurate? To evaluate the performance of our model, we can compare the predicted labels to the true labels of the test dataset.### Accuracy of model predictionsNow that we have the predicted and true labels of the test dataset, we can calculate the accuracy of our model's predictions. Accuracy is calculated as the number of correct predictions divided by the total number of predictions.```{python}#| label: calculate_accuracy# Calculate the accuracy of the model's predictionsacc = accuracy_score(y_test, y_pred)accuracy_percentage = acc *100accuracy_percentage```Our accuracy is quite high! This tells us that our model is doing a good job at predicting the cortical layer labels based on the x, y coordinates and gene expression. However, accuracy alone does not always give us the full picture of how well our model is performing.Confusion matrices are another way to evaluate the performance of classification. This table shows the number of labels that were correctly predicted (true positives and true negatives) and the number of labels that were incorrectly predicted (false positives and false negatives).```{python}#| label: calculate_confusion_matrixclass_names =sorted(y.unique())cm = confusion_matrix(y_test, y_pred, labels=class_names)plt.figure(figsize=(8, 6))sns.heatmap( cm, annot=True, fmt="g", cmap="Blues", xticklabels=class_names, yticklabels=class_names)plt.title("Random Forest – Confusion Matrix (Test Set)")plt.xlabel("Predicted label")plt.ylabel("True label")plt.tight_layout()plt.show()```This can help you understand which classes the model is doing well on and which classes it is struggling with. If your accuracy is low, you can look at the confusion matrix to see which classes are being misclassified and potentially adjust your model or data accordingly.## Next stepsWith this model, you could try to predict the cortical layer labels of other datasets. You could also try to use different predictor variables (e.g. only gene expression or only spatial coordinates) to see how that affects the accuracy of the model's predictions.***[Back to Schedule](../schedule/schedule.qmd)