Integration

category_1
category_2
category_3
category_4

Write a description of the lesson here.

Author

Noor Sohail

Published

July 22, 2025

Keywords

keyword_1, keyword_2, keyword_3, keyword_4, keyword_5, keyword_6

Approximate time: XX minutes

Learning objectives

In this lesson, we will:

  • Learning Objective 1
  • Learning Objective 2
  • Learning Objective 3

Overview of lesson

When doing XYZ…

What is integration

Integration is the process by which uninteresting batch variables (date sequenced, kit used, platform) is accounted for to improve clustering of similar bins. The most common instance of this when multiple samples, sequenced separately, are used for a single analysis. The aim is to align the datasets such that similar bins (cell states) are clustered together.

Figure 1: Example of integrating multiple scRNA-seq datasets generated with different library preparation methods.

Why is it important that cells of the same cell type cluster together?

We want to identify cell types which are present in all samples/conditions/modalities within our dataset, and therefore would like to observe a representation of cells from both samples/conditions/modalities in every cluster. This will enable more interpretable results downstream (i.e. DE analysis, ligand-receptor analysis, differential abundance analysis, …).

To integrate or not to integrate

But how you know when you should integrate?

We recommend that you always first look at your clustering without integration before running any alignment methods. This will allow you to identify if there are any batch effects present in your data, and if so, what is driving the batch effect.

For example, if we have two samples that are sequenced separately, we may see a clear split in PCA and UMAP by sample. This is an example of a batch effect, where the technical batch variable (sample) is driving the clustering instead of the biological variable (cell type). Below we see such an example, where we can see that the monocytes (LYZ is a marker gene for monocytes) are clustering separately by sample instead of together as one cluster.

Figure 2: Example of how multiple samples can cause a batch effect, where the samples cluster separately instead of by cell type.
Image source: Intro to scRNA-seq

Do not just always perform integration because you think there might be differences - explore the data. Condition-specific clustering of the cells indicates that we need to integrate the cells across conditions to ensure that cells of the same cell type cluster together.

Exploring our data

At this point we have not run any integration on our dataset, so we can look at the PCA and UMAP to see if there are any batch effects present in our data.

# Integration
# Visium HD spatial transcriptomics workshop
# Author: Harvard Chan Bioinformatics Core
# Created: December 2025
library(Seurat)
library(tidyverse)

seurat_processed <- readRDS("data/seurat_processed.RDS")

In our case, we have only one batch condition, which is our samples P5CRC and P5NAT. So now we use the DimPlot() function to visualize the PCA and UMAP colored by sample.

p1 <- DimPlot(seurat_processed,
              group.by="orig.ident",
              reduction = "full.pca.sketch") +
  NoLegend() + ggtitle("PCA (Projected)")

p2 <- DimPlot(seurat_processed,
              group.by="orig.ident",
              reduction = "full.umap.sketch") +
  ggtitle("UMAP (Projected)")

p1 | p2
Figure 3: PCA and UMAP colored by sample.
  1. Do you see a clear split based on sample in the UMAP and PCA? What could this indicate about the presence of batch effects in our data?

Distribution of sample per cluster

When we look at the dimensionality reduction plots, we can see that there does appear to be a distinct population of bins that belong to P5CRC. Before making any assumptions about the presence of batch effects, we need to investigate this further.

Visualize with plots other than UMAP

When working with single-cell data, it is may be tempted to make justifications based upon the UMAP visualizations. However, we would encourage you to make use of other visualizations. Barplots, violin plots, boxplots, dotplots, and heatmaps are all useful visualizations to confirm your observations from a UMAP in a more quantitative manner.

So here we can make a barplot of the proportion of bins in each cluster by sample to see if there are any clusters that are dominated by one sample.

# Barplot of proportion of cells in each cluster by sample
ggplot(seurat_processed@meta.data) +
    geom_bar(aes(x = seurat_cluster.projected, 
                 fill = orig.ident), 
             position = position_fill())  +
    theme_classic()
Figure 4: Proportion of bins in each cluster by sample.

Notably, we see that that clusters 1 and 11 contain almost exclusively P5CRC bins. So let us zoom in on these bins to see if there is a biological explanation for this observation, or if it is a technical artifact that we need to correct for with integration.

First, we will identify which bins belong to clusters 1 and 11 with the WhichCells() function. We will supply that output to the cells.highlight argument in DimPlot() to highlight these bins on the UMAP.

Idents(seurat_processed) <- "seurat_cluster.projected"
p <- DimPlot(seurat_processed,
             cells.highlight = WhichCells(seurat_processed, 
                                          idents = c("1", "11"))) +
  ggtitle("Clusters 2 and 12")
LabelClusters(p, id = "ident",  fontface = "bold", size = 5, 
              bg.colour = "white", bg.r = .2, force = 0)
Figure 5: Highlight clusters 2 and 12 on the UMAP.

Now bins that have the label of cluster 1 or 11 are colored red while the rest are gray in the UMAP.

Possible explanations for batch

Recall that when we first loaded the dataset, we identified the broad cell types that we expected to see in our dataset. Here, we summarize this information once again to help us identify if there is a biological explanation for the presence of these clusters that are dominated by P5CRC bins.

Warning

Add marker genes for neural cells and smooth muscle cells

Table 1: Marker genes for broad cell types in our dataset.
Cell type Genes
B cells IGKC
Endothelial cells PECAM1
Fibroblasts COL1A1
Intestinal epithelial cells CLCA1, FCGBP, MUC2, PIGR
Myeloid cells C1QC, SPP1, SELENOP
Neural cells TODO
Smooth muscle cells TODO
T cells TRAC, CD3E
Tumor cells CEACAM6, CEACAM5
  1. Is there a biological reason we see clusters that are dominated by P5CRC bins? What cell type do you think these clusters may represent?

Assessing tumor marker genes

Our dataset contains two samples, P5CRC and P5NAT, which are colorectal cancer (CRC) and normal adjacent tissue (NAT) samples, respectively. Therefore, it makes sense that we may see a population of tumor cells in the P5CRC sample that is not present in the other sample. To confirm that the clusters that are dominated by P5CRC bins are indeed tumor cells, we can look at the expression of tumor marker genes in clusters 1 and 11.

We are going to use VlnPlot() and FeaturePlot() to visualize which clusters have higher expression of CEACAM5 and CEACAM6, which are known tumor marker genes in colorectal cancer.

FeaturePlot(seurat_processed, 
            features = "CEACAM5",
            label = TRUE)
Figure 6: CEACAM5 expression plotted on the UMAP.
VlnPlot(seurat_processed, 
        features = "CEACAM5", 
        pt.size = 0) +
  NoLegend()
Figure 7: CEACAM5 expression plotted as a violin plot for each cluster.
FeaturePlot(seurat_processed, 
            features = "CEACAM6",
            label = TRUE)
Figure 8: CEACAM6 expression plotted on the UMAP.
VlnPlot(seurat_processed, 
        features = "CEACAM6", 
        pt.size = 0) +
  NoLegend()
Figure 9: CEACAM6 expression plotted as a violin plot for each cluster.

We see that clusters 1 and 11 do indeed have higher expression of the tumor marker genes!

When annotating cell types, using multiple genes is recommended to give you more confidence in your annotations. We can also make use of the DotPlot() function to visualize the expression of both CEACAM5 and CEACAM6 at the same time across all clusters. Here, the size of a dot represents the percentage of cells in a cluster that express the gene, and the color of the dot represents the average expression of the gene in that cluster.

DotPlot(seurat_processed, 
        features = c("CEACAM5", "CEACAM6"))

From this, we have the additional knowledge that there are likely some tumor cells among cluster 7.

Final integration decision

Clusters 2 and 12 are dominated by P5CRC bins and have higher expression of tumor marker genes. Therefore, it is likely that these clusters represent tumor cells that are present in the P5CRC sample but not in the P5NAT sample.

So the lack of overlap in samples/batch has a biological answer! We would not want to integrate the dataset across samples because we would be trying to force the tumor cells to cluster together with other cell types. However, that is not the biological reality of our dataset, and therefore not in the best interest of our analysis.

Integration Algorithms

Anecdote

Integration typically is not the best way to go for spatial datasets

In this spatial transcriptomics workflow, integration would be performed on the sketch dataset, which we would then project back to the full dataset. While we may not be running integration on this dataset, here we will describe a few of the most commonly used algorithms for integration in single-cell data analysis.

Harmony was devleoped in 2019, and is an example of a tool that can work with complex integration tasks. It is available as an R package on GitHub, and it has functions for standalone and Seurat pipeline analyses. It has been shown to perform incredibly well from recent benchmarking studies [1].

In this section, we illustrate the use of Harmony as a possible alternative to the Seurat integration workflow. Compared to other algorithms, Harmony notably presents the following advantages (Korsunsky et al. 2019, Tran et al. 2020):

  1. Possibility to integrate data across several variables (for example, by experimental batch and by condition)
  2. Significant gain in speed and lower memory requirements for integration of large datasets
  3. Interoperability with the Seurat workflow

Instead of using CCA, Harmony applies a transformation to the principal component (PCs) values, using all available PCs, e.g. as pre-computed within the Seurat workflow. In this space of transformed PCs, Harmony uses k-means clustering to delineate clusters, seeking to define clusters with maximum “diversity”. The diversity of each cluster reflects whether it contains balanced amounts of cells from each of the batches (donor, condition, tissue, technolgy…) we seek to integrate on, as should be observed in a well-integrated dataset. After defining diverse clusters, Harmony determines how much a cell’s batch identity impacts on its PC coordinates, and applies a correction to “shift” the cell towards the centroid of the cluster it belongs to. Cells are projected again using these corrected PCs, and the process is repeated iteratively until convergence.

Figure 10: Overview of the Harmony algorithm for batch correction and integration of single-cell datasets.
Image Source: Korsunsky et al., 2019

For a more detailed breakdown of the Harmony algorithm, we recommend checking this advanced vignette from the package developers.

Integration is a powerful method that uses shared highly variable genes from each group to identify shared subpopulations across conditions or datasets [Stuart and Bulter et al. (2018)]. The goal of integration is to ensure that the cell types of one condition/dataset align with the same celltypes of the other conditions/datasets (e.g. control macrophages align with stimulated macrophages).

The integration method that is available in the Seurat package utilizes the canonical correlation analysis (CCA); a method that expects “correspondences” or shared biological states among at least a subset of single cells across the groups. The result of this integration approach is a corrected data matrix for all datasets, enabling them to be analyzed jointly in a single workflow. To transfer information from a reference to query dataset, Seurat does not modify the underlying expression data, but instead projects continuous data across experiments.

The steps in the Seurat integration workflow are outlined in the figure below:

Figure 11: Overview of the Seurat integration workflow using canonical correlation analysis (CCA) and anchors. Source: Stuart & Butler et al., 2018

1. Identify shared variable genes:

Integration aims to take the matrix for each dataset (Ctrl and Stim) and identify correlated structures across them and align them in a common space. The shared highly variable genes from each dataset are used to form the intersection set, because they are the most likely to represent those genes distinguishing the different cell types present.

Each dataset can have a different number of cells, but must have the same number of genes.

2. Perform canonical correlation analysis (CCA):

Next, Seurat will jointly reduce the dimensionality of both datasets using diagonalized canonical correlation analysis (CCA) which is a form of PCA. Similar to principal components in PCA, the CCA will result in canonical correlation vectors. An L2-normalization is applied to the canonical correlation vectors, to use as input for the next step (identifying MNNs).

3. Find mutual nearest neighbors (MNNs) or anchors:

In this new shared low-dimensional space, Seurat will identify anchors or mutual nearest neighbors (MNNs) across datasets. These MNNs are pairs of cells that can be thought of as ‘best buddies’.

For each cell in one condition:

  • The cell’s closest neighbor in the other condition is identified based on gene expression values - its ‘best buddy’.
  • The reciprocal analysis is performed, and if the two cells are ‘best buddies’ in both directions, then those cells will be marked as anchors to ‘anchor’ the two datasets together.

4. Filter anchors to remove incorrect anchors:

Assess the similarity between anchor pairs by the overlap in their local neighborhoods (incorrect anchors will have low scores) - do the adjacent cells have ‘best buddies’ that are adjacent to each other? If not, these are removed the anchor list.

5. Integrate the conditions/datasets:

Using the anchors and corresponding scores the cell expression values are transformed, allowing for the integration of the conditions/datasets (different samples, conditions, datasets, modalities). For each cell in the dataset we now have an integrated value, but only for the variable features used for this analysis.

If cell types are present in one dataset, but not the other, then the cells will still appear as a separate sample-specific cluster.

In the Introduction to scRNA-seq workshop, there are more detailed instructions on how to run a CCA analysis.


Next Lesson >>

Back to Schedule

Reuse

CC-BY-4.0