Generating the Reference Dataset for RCTD

Spatial transcriptomics

Deconvolution

FLEX

scRNA-seq

In this lesson, we will build a high-quality single-cell RNA-seq reference object for RCTD deconvolution.

Author

Noor Sohail

Published

April 19, 2026

Keywords

Reference, CRC, Public dataset, Seurat

Approximate time: 10 minutes

Overview of lesson

Deconvolution requires a trustworthy scRNA-seaq reference dataset to calculate average expressions across cell types in your query dataset. For the CRC dataset that we have been working with throughout this workshop, we will use the 10X paired FLEX dataset that was created. Since filtering, normalization, clustering and manual celltype annotation has already been done for the object and stored in the metadata, we will follow the author’s original workflow.

Investing effort into creating a clean reference dataset will improve the accuracy and interpretability of deconvolution results in the main deconvolution lesson.

Download the dataset

The reference dataset was generated by downloading two files with bash in terminal:

Count matrix (.h5 file)
Metadata csv

# Load libraries
library(curl)
library(R.utils)
library(Seurat)

# Download count matrix
curl_download(
  url = "https://cf.10xgenomics.com/samples/cell-exp/8.0.0/HumanColonCancer_Flex_Multiplex/HumanColonCancer_Flex_Multiplex_count_filtered_feature_bc_matrix.h5",
  destfile = "data/HumanColonCancer_Flex_Multiplex_count_filtered_feature_bc_matrix.h5"
)

# Download metadata
curl_download(
  url = "https://github.com/10XGenomics/HumanColonCancer_VisiumHD/raw/refs/heads/main/MetaData/SingleCell_MetaData.csv.gz",
  dest = "data/SingleCell_MetaData.csv.gz"
)

# Uncompress the metadata file
gunzip("data/SingleCell_MetaData.csv.gz",
  remove = FALSE)

Create Seurat object

Then in R, we load both the metadata and counts matrix to generate a Seurat object.

# Load counts matrix
counts <- Read10X_h5("data/HumanColonCancer_Flex_Multiplex_count_filtered_feature_bc_matrix.h5")

# Load metadata and set rownames
meta <- read.csv("data/SingleCell_MetaData.csv")
rownames(meta) <- meta$Barcode

# Create Seurat object
seurat <- CreateSeuratObject(counts = counts,
                             meta.data = meta,
                             project = "CRC FLEX")

The UMAP coordinates are also included within the metadata file. Here, we add the coordinates to the Seurat object within the dimensionality reduction slot of the Seurat object so that we can appropriatley called DimPlot() in future steps.

# Grab UMAP coordinates from metadata and put them in a reduction
umap_coords <- as.matrix(seurat@meta.data[, c("UMAP1", "UMAP2")])

# Create a DimReduc and store it in the object as "umap"
seurat[["umap"]] <- CreateDimReducObject(embeddings = umap_coords,
                                         key = "UMAP_",
                                         assay = DefaultAssay(seurat))

The downloaded dataset is the raw output, meaning that no filtration has been done until this point. We are going to use the same filtration that the original creators intended by using the QCFilter column in the metadata. In doing so, we will have cleaned up the dataset to include only high quality cells.

# Remove cells that did not pass QC
seurat <- subset(seurat, 
                 subset = (QCFilter == "Keep"))

Even after filtration, this is a very large dataset so the last step is going to be downsampling this reference. The metadata column Level1 contains celltype annotations for the dataset, so we will set those as the Idents() of the object. This is so that when we run subset() and specify downsample = 500, we will be getting 500 cells for each of the cell types in Level1.

# Set Idents to Level1 celltype annotation
Idents(seurat) <- "Level1"

# Randomly downsample dataset such that there are
# 500 cells per Level1 identity
seurat_down <- subset(seurat, 
                      downsample = 500)

# Save downsampled Seurat object
saveRDS(seurat_down, "crc_flex_ref_downsample.RDS")

This seurat_down is the same object that we will be using in the Deconvolution lesson as the reference dataset.

Back to Lesson >>

Back to Schedule

Reuse

CC-BY-4.0

--- title: "Generating the Reference Dataset for RCTD" description: | In this lesson, we will build a high-quality single-cell RNA-seq reference object for RCTD deconvolution. author: - Noor Sohail date: "2026-04-19" categories: - Spatial transcriptomics - Deconvolution - FLEX - scRNA-seq keywords: - Reference - CRC - Public dataset - Seurat license: "CC-BY-4.0" editor_options: markdown: wrap: 72 --- Approximate time: 10 minutes ## Overview of lesson Deconvolution requires a trustworthy scRNA-seaq reference dataset to calculate average expressions across cell types in your query dataset. For the CRC dataset that we have been working with throughout this workshop, we will use the 10X paired FLEX dataset that was created. Since filtering, normalization, clustering and manual celltype annotation has already been done for the object and stored in the metadata, we will follow the author's original workflow. Investing effort into creating a clean reference dataset will improve the accuracy and interpretability of deconvolution results in the main deconvolution lesson. ## Download the dataset The reference dataset was generated by downloading two files with `bash` in terminal: - Count matrix (`.h5` file) - Metadata csv ```{r} #| label: download_dataset #| eval: false # Load libraries library(curl) library(R.utils) library(Seurat) # Download count matrix curl_download( url = "https://cf.10xgenomics.com/samples/cell-exp/8.0.0/HumanColonCancer_Flex_Multiplex/HumanColonCancer_Flex_Multiplex_count_filtered_feature_bc_matrix.h5", destfile = "data/HumanColonCancer_Flex_Multiplex_count_filtered_feature_bc_matrix.h5" ) # Download metadata curl_download( url = "https://github.com/10XGenomics/HumanColonCancer_VisiumHD/raw/refs/heads/main/MetaData/SingleCell_MetaData.csv.gz", dest = "data/SingleCell_MetaData.csv.gz" ) # Uncompress the metadata file gunzip("data/SingleCell_MetaData.csv.gz", remove = FALSE) ``` ## Create Seurat object Then in R, we load both the metadata and counts matrix to generate a Seurat object. ```{r} #| label: create_seurat #| eval: false # Load counts matrix counts <- Read10X_h5("data/HumanColonCancer_Flex_Multiplex_count_filtered_feature_bc_matrix.h5") # Load metadata and set rownames meta <- read.csv("data/SingleCell_MetaData.csv") rownames(meta) <- meta$Barcode # Create Seurat object seurat <- CreateSeuratObject(counts = counts, meta.data = meta, project = "CRC FLEX") ``` The UMAP coordinates are also included within the metadata file. Here, we add the coordinates to the Seurat object within the dimensionality reduction slot of the Seurat object so that we can appropriatley called `DimPlot()` in future steps. ```{r} #| label: umap_coords #| eval: false # Grab UMAP coordinates from metadata and put them in a reduction umap_coords <- as.matrix(seurat@meta.data[, c("UMAP1", "UMAP2")]) # Create a DimReduc and store it in the object as "umap" seurat[["umap"]] <- CreateDimReducObject(embeddings = umap_coords, key = "UMAP_", assay = DefaultAssay(seurat)) ``` The downloaded dataset is the _raw_ output, meaning that no filtration has been done until this point. We are going to use the same filtration that the original creators intended by using the `QCFilter` column in the metadata. In doing so, we will have cleaned up the dataset to include only high quality cells. ```{r} #| label: filter #| eval: false # Remove cells that did not pass QC seurat <- subset(seurat, subset = (QCFilter == "Keep")) ``` Even after filtration, this is a very large dataset so the last step is going to be downsampling this reference. The metadata column `Level1` contains celltype annotations for the dataset, so we will set those as the `Idents()` of the object. This is so that when we run `subset()` and specify `downsample = 500`, we will be getting 500 cells for each of the cell types in `Level1`. ```{r} #| label: downsample #| eval: false # Set Idents to Level1 celltype annotation Idents(seurat) <- "Level1" # Randomly downsample dataset such that there are # 500 cells per Level1 identity seurat_down <- subset(seurat, downsample = 500) # Save downsampled Seurat object saveRDS(seurat_down, "crc_flex_ref_downsample.RDS") ``` This `seurat_down` is the same object that we will be using in the [Deconvolution](12_deconvolution.qmd) lesson as the reference dataset. *** [Back to Lesson >>](12_deconvolution.qmd) [Back to Schedule](../schedule/schedule.qmd)