Integration

R Programming

Single-cell RNA-seq

This lesson introduces participants to the concepts involved in integrating single-cell RNA-seq datasets using canonical correlation analysis (CCA) within the Seurat framework. Participants will learn when integration is appropriate, how it aligns shared cell types across conditions and why evaluating data before integration is essential for accurate downstream analyses.

Authors

Mary Piper

Lorena Pantano

Meeta Mistry

Radhika Khetani

Jihe Liu

Amélie Julé

Will Gammerdinger

Noor Sohail

Published

May 30, 2025

Keywords

R, Integration, CCA, Seurat, Mutual nearest neighbors

Approximate time: 90 minutes

Learning Objectives

Describe the theory of integration with CCA

Single-cell RNA-seq clustering analysis: Integration theory

Figure 1: Overview of the single-cell RNA-seq workflow. *Source: HBC Training Intro-to-scRNAseq materials.*

To integrate or not to integrate?

Generally, we always look at our clustering without integration before deciding whether we need to perform any alignment. It can be helpful to first run through clustering with samples from different sample classes together to see whether there are condition-specific clusters for cell types present in both conditions. Oftentimes, when clustering cells from multiple conditions there are condition-specific clusters and integration can help ensure the same cell types cluster together.

Do not just always perform integration because you think there might be differences - explore the data. If we had performed the normalization on both conditions together in a Seurat object and visualized the similarity between cells, we would have seen in our dataset there is condition-specific clustering.

We will discuss the UMAP algorithm in more detail in a future section, but for now we can see that once we calculate our UMAP coordinates, there is a clear split based upon sample.

# Run UMAP
seurat_phase <- RunUMAP(seurat_phase,
                        dims = 1:40,
                        reduction = "pca")
# Plot UMAP
DimPlot(seurat_phase,
        group.by = "sample")

Figure 2: Sample-specific clustering in UMAP space, before integration.

Condition-specific clustering of the cells indicates that we need to integrate the cells across conditions to ensure that cells of the same cell type cluster together. If cells cluster by sample, condition, batch, dataset, modality, performing integration can help align cells across the groups to greatly improve the clustering and the downstream analyses.

Why is it important that cells of the same cell type cluster together?

We want to identify cell types which are present in all samples/conditions/modalities within our dataset, and therefore would like to observe a representation of cells from both samples/conditions/modalities in every cluster. This will enable more interpretable results downstream (i.e. DE analysis, ligand-receptor analysis, differential abundance analysis…).

So then how would you determine if integration is necessary? In this dataset, we know that the gene LYZ is a marker for monocytes. Even if the monocyte populations differ slightly between experimental conditions, we still expect monoctyes from all batches to be biologically similar. Because of this, these cells should be considered comparable and should occupy the same region in the UMAP embedding.

FeaturePlot(seurat_phase,
            features = "LYZ")

Figure 3: Expression of gene LYZ on the unintegrated UMAP to show when integration is necessary.

However we can clearly see that there is a split in the grouping of monocytes that is driven by batch, as seen in the previous UMAP plot.

In this lesson, we will cover the integration of our samples across conditions, which is adapted from the Seurat Guided Integration Tutorial.

Vignette without integration

Seurat has a vignette for how to run through the workflow from normalization to clustering without integration. Other steps in the workflow remain fairly similar, but the samples would not necessarily be split in the beginning and integration would not be performed.

Example scenarios for integration

Different conditions (e.g. control and stimulated)

Figure 4: Example of condition-specific clustering and post-integration alignment using Seurat (control vs stimulated).

Different datasets (e.g. scRNA-seq from datasets generated using different library preparation methods on the same samples)

Figure 5: Example of integrating multiple scRNA-seq datasets generated with different library preparation methods.

Different batches (e.g. when experimental conditions make batch processing of samples necessary)

Integration using CCA

Integration is a powerful method that uses shared highly variable genes from each group to identify shared subpopulations across conditions or datasets [Stuart and Bulter et al. (2018)]. The goal of integration is to ensure that the cell types of one condition/dataset align with the same celltypes of the other conditions/datasets (e.g. control macrophages align with stimulated macrophages).

The integration method that is available in the Seurat package utilizes the canonical correlation analysis (CCA); a method that expects “correspondences” or shared biological states among at least a subset of single cells across the groups. The result of this integration approach is a corrected data matrix for all datasets, enabling them to be analyzed jointly in a single workflow. To transfer information from a reference to query dataset, Seurat does not modify the underlying expression data, but instead projects continuous data across experiments.

The steps in the Seurat integration workflow are outlined in the figure below:

Figure 6: Overview of the Seurat integration workflow using canonical correlation analysis (CCA) and anchors. *Source: Stuart & Butler et al., 2018*

1. Identify shared variable genes:

Integration aims to take the matrix for each dataset (Ctrl and Stim) and identify correlated structures across them and align them in a common space. The shared highly variable genes from each dataset are used to form the intersection set, because they are the most likely to represent those genes distinguishing the different cell types present.

Each dataset can have a different number of cells, but must have the same number of genes.

2. Perform canonical correlation analysis (CCA):

Next, Seurat will jointly reduce the dimensionality of both datasets using diagonalized canonical correlation analysis (CCA) which is a form of PCA. Similar to principal components in PCA, the CCA will result in canonical correlation vectors. An L2-normalization is applied to the canonical correlation vectors, to use as input for the next step (identifying MNNs).

3. Find mutual nearest neighbors (MNNs) or anchors:

In this new shared low-dimensional space, Seurat will identify anchors or mutual nearest neighbors (MNNs) across datasets. These MNNs are pairs of cells that can be thought of as ‘best buddies’.

For each cell in one condition:

The cell’s closest neighbor in the other condition is identified based on gene expression values - its ‘best buddy’.
The reciprocal analysis is performed, and if the two cells are ‘best buddies’ in both directions, then those cells will be marked as anchors to ‘anchor’ the two datasets together.

4. Filter anchors to remove incorrect anchors:

Assess the similarity between anchor pairs by the overlap in their local neighborhoods (incorrect anchors will have low scores) - do the adjacent cells have ‘best buddies’ that are adjacent to each other? If not, these are removed the anchor list.

5. Integrate the conditions/datasets:

Using the anchors and corresponding scores the cell expression values are transformed, allowing for the integration of the conditions/datasets (different samples, conditions, datasets, modalities). For each cell in the dataset we now have an integrated value, but only for the variable features used for this analysis.

Neighborhoods and correction values

Transformation of each cell uses a weighted average of the two cells of each anchor across anchors of the datasets. Weights determined by cell similarity score (distance between cell and k nearest anchors) and anchor scores, so cells in the same neighborhood should have similar correction values.

If cell types are present in one dataset, but not the other, then the cells will still appear as a separate sample-specific cluster.

Reciprocal PCA

If there are a substantial number of cells that do not have a match between groups or there are a large number of cells to integrate, an alternative approach recommended by the Seurat vignette is reciprocal PCA (RPCA).

Reuse

CC-BY-4.0

--- title: "Integration" description: | This lesson introduces participants to the concepts involved in integrating single-cell RNA-seq datasets using canonical correlation analysis (CCA) within the Seurat framework. Participants will learn when integration is appropriate, how it aligns shared cell types across conditions and why evaluating data before integration is essential for accurate downstream analyses. author: - Mary Piper - Lorena Pantano - Meeta Mistry - Radhika Khetani - Jihe Liu - Amélie Julé - Will Gammerdinger - Noor Sohail date: "2025-05-30" categories: - R Programming - Single-cell RNA-seq keywords: - R - Integration - CCA - Seurat - Mutual nearest neighbors license: "CC-BY-4.0" editor_options: markdown: wrap: 72 --- Approximate time: 90 minutes ```{r} #| label: load_libraries #| echo: false # Load libraries library(Seurat) seurat_phase <- readRDS("data/seurat_phase.rds") ``` # Learning Objectives * Describe the theory of integration with CCA # Single-cell RNA-seq clustering analysis: Integration theory ::: {#fig-sc-workflow .figure} ![](../img/sc_workflow_2022.jpg){width=630} Overview of the single-cell RNA-seq workflow. _Source: [HBC Training Intro-to-scRNAseq](https://hbctraining.github.io/Intro-to-scRNAseq/) materials._ ::: ***        ## To integrate or not to integrate? Generally, we always look at our clustering **without integration** before deciding whether we need to perform any alignment. It can be helpful to first run through clustering with samples from different sample classes together to see whether there are condition-specific clusters for cell types present in both conditions. Oftentimes, when clustering cells from multiple conditions there are condition-specific clusters and integration can help ensure the same cell types cluster together. **Do not just always perform integration because you think there might be differences - explore the data.** If we had performed the normalization on both conditions together in a Seurat object and visualized the similarity between cells, we would have seen in our dataset there is condition-specific clustering. We will discuss the UMAP algorithm in more detail in a future section, but for now we can see that once we calculate our UMAP coordinates, there is a clear split based upon `sample`. ```{r} #| label: fig-UMAP_phases_plot #| fig-cap: Sample-specific clustering in UMAP space, before integration. # Run UMAP seurat_phase <- RunUMAP(seurat_phase, dims = 1:40, reduction = "pca") # Plot UMAP DimPlot(seurat_phase, group.by = "sample") ``` Condition-specific clustering of the cells indicates that we need to integrate the cells across conditions to ensure that cells of the same cell type cluster together. _**If cells cluster by sample, condition, batch, dataset, modality, performing integration can help align cells across the groups to greatly improve the clustering and the downstream analyses**._ **Why is it important that cells of the same cell type cluster together?** We want to identify _**cell types which are present in all samples/conditions/modalities**_ within our dataset, and therefore would like to observe a representation of cells from both samples/conditions/modalities in every cluster. This will enable more interpretable results downstream (i.e. DE analysis, ligand-receptor analysis, differential abundance analysis...). So then how would you determine if integration is necessary? In this dataset, we know that the gene `LYZ` is a marker for monocytes. Even if the monocyte populations differ slightly between experimental conditions, we still expect monoctyes from all batches to be biologically similar. Because of this, these cells should be considered comparable and should occupy the same region in the UMAP embedding. ```{r} #| label: fig-lyz_featureplot #| fig-cap: Expression of gene LYZ on the unintegrated UMAP to show when integration is necessary. FeaturePlot(seurat_phase, features = "LYZ") ``` However we can clearly see that there is a split in the grouping of monocytes that is driven by batch, as seen in the previous UMAP plot. In this lesson, we will cover the integration of our samples across conditions, which is adapted from the [Seurat Guided Integration Tutorial](https://satijalab.org/seurat/articles/integration_introduction.html). ::: callout-note # Vignette without integration Seurat has a [vignette](https://satijalab.org/seurat/articles/sctransform_vignette.html) for how to run through the workflow from normalization to clustering without integration. Other steps in the workflow remain fairly similar, but the samples would not necessarily be split in the beginning and integration would not be performed. ::: ## Example scenarios for integration - Different **conditions** (e.g. control and stimulated) ::: {#fig-integration-conditions .figure} ![](../img/seurat_condition_integ.png){width=60%} Example of condition-specific clustering and post-integration alignment using Seurat (control vs stimulated). ::: - Different **datasets** (e.g. scRNA-seq from datasets generated using different library preparation methods on the same samples) ::: {#fig-integration-datasets .figure} ![](../img/seurat_dataset_integ.png){width=60%} Example of integrating multiple scRNA-seq datasets generated with different library preparation methods. :::      - Different **batches** (e.g. when experimental conditions make batch processing of samples necessary) ## Integration using CCA Integration is a powerful method that **uses shared highly variable genes from each group to identify shared subpopulations across conditions or datasets** [[Stuart and Bulter et al. (2018)](https://www.biorxiv.org/content/early/2018/11/02/460147)]. The goal of integration is to ensure that the cell types of one condition/dataset align with the same celltypes of the other conditions/datasets (e.g. control macrophages align with stimulated macrophages). The integration method that is available in the Seurat package utilizes the **canonical correlation analysis** (CCA); a method that expects "correspondences" or shared biological states among at least a subset of single cells across the groups. The result of this integration approach is a corrected data matrix for all datasets, enabling them to be analyzed jointly in a single workflow. To transfer information from a reference to query dataset, Seurat **does not modify the underlying expression data, but instead projects continuous data across experiments**. The steps in the `Seurat` integration workflow are outlined in the figure below: ::: {#fig-integration-workflow .figure} ![](../img/integration.png) Overview of the Seurat integration workflow using canonical correlation analysis (CCA) and anchors. _Source: [Stuart & Butler et al., 2018](https://doi.org/10.1101/460147)_ ::: **1. Identify shared variable genes**: Integration aims to take the matrix for each dataset (Ctrl and Stim) and identify correlated structures across them and align them in a common space. The **shared highly variable genes from each dataset are used to form the intersection set**, because they are the most likely to represent those genes distinguishing the different cell types present. _Each dataset can have a different number of cells, but must have the same number of genes._ **2. Perform canonical correlation analysis (CCA):** Next, Seurat will jointly reduce the dimensionality of both datasets using diagonalized canonical correlation analysis (CCA) which is a form of PCA. Similar to principal components in PCA, the CCA will result in canonical correlation vectors. An L2-normalization is applied to the canonical correlation vectors, to use as input for the next step (identifying MNNs). **3. Find mutual nearest neighbors (MNNs) or anchors:** In this new shared low-dimensional space, Seurat will identify anchors or mutual nearest neighbors (MNNs) across datasets. These MNNs are pairs of cells that can be thought of as **'best buddies'**. For each cell in one condition: - The cell's closest neighbor in the other condition is identified based on gene expression values - its 'best buddy'. - The reciprocal analysis is performed, and if the two cells are 'best buddies' in both directions, then those cells will be marked as **anchors** to 'anchor' the two datasets together. **4. Filter anchors** to remove incorrect anchors: Assess the similarity between anchor pairs by the overlap in their local neighborhoods (incorrect anchors will have low scores) - do the adjacent cells have 'best buddies' that are adjacent to each other? If not, these are removed the anchor list. **5. Integrate the conditions/datasets**: Using the anchors and corresponding scores the cell expression values are transformed, allowing for the integration of the conditions/datasets (different samples, conditions, datasets, modalities). For each cell in the dataset we now have an integrated value, but only for the variable features used for this analysis. ::: callout-note # Neighborhoods and correction values Transformation of each cell uses a weighted average of the two cells of each anchor across anchors of the datasets. Weights determined by cell similarity score (distance between cell and k nearest anchors) and anchor scores, so cells in the same neighborhood should have similar correction values. ::: **If cell types are present in one dataset, but not the other, then the cells will still appear as a separate sample-specific cluster.** ::: callout-note # Reciprocal PCA If there are a substantial number of cells that do not have a match between groups or there are a large number of cells to integrate, an alternative approach recommended by the Seurat vignette is [reciprocal PCA (RPCA)](https://satijalab.org/seurat/articles/integration_rpca.html). :::