Loading Spatial Data

category_1
category_2
category_3
category_4

Write a description of the lesson here.

Author

Noor Sohail

Published

July 25, 2025

Keywords

keyword_1, keyword_2, keyword_3, keyword_4, keyword_5, keyword_6

Approximate time: XX minutes

Learning objectives

In this lesson, we will:

  • Establish a hypothesis or biological question to assess using the provided dataset
  • Quantify how bin size affects the count matrix
  • Access important information from a Seurat object
  • Create a merged Seurat object from Space Ranger outputs

Overview of lesson

When doing XYZ…

Exploring the example Visium HD dataset

Throughout this workshop, we will be working with a Visium HD dataset that came from a larger study on human colorectal cancer. The focus of this study was to understand the tumor microenvironment in colorectal cancer (CRC) by utilizing multiple sequencing modalities, including Visium HD, Xenium, and FLEX single-cell sequencing.

Figure 1: Celltype annotation overlaid over three colorectral samples sequenced with Visium HD.
Image source: Oliveira et al. (2025)

In particular, we are going to be working with the P5CRC sample as there is a matched, normal adjacent tissue P5NAT sample that we can use to compare tumor versus normal tissue.

Dataset availability

This dataset was generated by 10X Genomics and is publicly available on their website here.

Metadata

While it may be tempting to get started right away with loading your data, it is crucial that you first take a moment to document metadata that is associated with your dataset. This will be important for you to interpret your results correctly as you move further along in your analysis.

For this particular dataset, we have the following metadata available to us about the patient:

  • Stage IV-A of CRC
  • Female patient
  • 58 years old

Additionally, we want to keep track of some basic information that we would expect to see in our dataset, more specifically the cell types we anticipate seeing:

  • B cells
  • Endothelial cells
  • Fibroblasts
  • Intenstinal epithelial cells
  • Myeloid cells
  • Neural cells
  • Smooth musle cells
  • T cells
  • Tumor cells

It is also good to keep track of the sample preparation and sequencing protocols that were used to generate the data. For this dataset, we know:

  • 5 µm sections were taken from the FFPE tissue blocks with a microtome
  • Sectioning followed the Visium CytAssist v2 WT Panel Gene Expression protocol
  • FFPE tissue sections were place on plain glass slides for deparaffinization, H&E staining and imaging following the Visium HD FFPE Tissue Preparation Handboo
  • Sequencing was performed on an Illumina NovaSeq 6000 with paired-end reads
  • Samples processed with Space Ranger v3.0
  1. Given the information that we know from the metadata, what might be some questions that we want to answer using our data?
  2. What are some of the limitations of this dataset that we should keep in mind as we analyze it?

With this information in mind, we can now move forward with loading our data and performing our analysis.

Set up

We have assembled an R Project for you to download that includes the data along with a basic file structure for good data management. Whenever you start a new project, is is a good habit to set up a similar directory structure to clearly organize your data, scripts, and results. This will make it easier to keep track of your files and to share your work with others.

Download the data

If you haven’t done this already, the project can be accessed using this link. You will have to left-click the link and select Save Link As.. or Download Linked File As.., then select a place on your computer where you would like to place this R Project.

Opening R Studio

We can open the R Project up and see that the provided file structure should look like:

data
├── P5CRC_cropped
│   └── binned_outputs
├── P5NAT_cropped
│   └── binned_outputs
└── crc_flex_ref_downsample.RDS
Cropped images

You may notice that we are working with cropped folders for this workshop. This is due to the limitations of how much data can be loaded on a laptop. So here, we have cropped the image so that we have a smaller cross-section of the tissue to work with, ultimately reducing the number of cells for this example dataset.

The code used to create all the files for this workshop can be found here.

Warning

TODO: link for aside lesson on cropping dataset

If your R Project looks like below, then you are ready to start!

Warning

TODO: Image of R project file structure

Project organization

When you have large amounts of data (like with spatial transcriptomics), it is easy to lose track of your files and become overwhelmed. We tend to prioritize the analysis and in the excitement of getting a first look at our data, we often forget to consider how we are going to manage our data and files. This is a common mistake, as data management is often an afterthought when it should be a key part of the workflow from the very beginning. The HMS Data Management Working Group, discusses in-depth some things to consider beyond the data creation and analysis.

One important aspect of data management is organization. For each experiment you work on and analyze data for, it is considered best practice to get organized by creating a planned storage space. We will do that for our spatial transcriptomics analysis.

Look inside your project space and you will find that a directory structure has been setup for you:

Warning

TODO: is this the right directory structure? Do we need to add anything else?

spatial_transcriptomics/
├── data
├── figures
├── results
├── scripts
└── spatial_transcriptomics.Rproj
Note for Windows OS users

When you open the project folder after unzipping, please check if you have a data folder with a sub folder also called data. If this is the case, please move all the files from the subfolder into the parent data folder.

New script

Next, open a new Rscript file, and start with some comments to indicate what this file is going to contain. Ideally we will have one script per major step in our analysis. For this first script, we will be loading our data and performing quality control - as indicated in the header.

Loading your libraries at the top of the script also allows you to easily keep track of which libraries you are using and to load them all at once in the beginning.

# Creating Seurat object from Space Ranger output
# Visium HD spatial transcriptomics workshop
# Author: Harvard Chan Bioinformatics Core
# Created: December 2025

library(tidyverse)
library(Seurat)

Save the Rscript as quality_control.R into the scripts folder. Your working directory should look something like this:

Warning

TODO: Image of R project file structure with new script

Loading Data into Seurat

There are several different tools that can used for loading and analyzing spatial transcriptomics data. While each has their own nuances, they all follow the same fundamental theory and processes:

For this workshop, we will be using the Seurat workflow.

Input files

To load in our data, we are going to use the function Load10X_Spatial(), which is a function that is designed to read in the output from Space Ranger.

Therefore, the input files that we will be using are the feature matrix and low-resolution tissue image generated by Space Ranger.

To load our data, we are going to make use of the function Load10X_Spatial(). The input needed for this function comes from the output generated by Space Ranger, including both the feature matrix and low-resolution tissue image. Once supplied, a Seurat object that contains our counts matrix and image is created.

If you are ever unsure what parameters you can supply to a function, you make use of the ? call in R. This will bring up the manual page for the function while will provide more details on what each variable is used for.

?Load10X_Spatial

There are 3 main arguments that we will be utilizing for this step:

Table 1: Seurat’s Load10X_Spatial() arguments.
Argument Description
data.dir Directory containing the H5 file specified by filename and the image data in a subdirectory called spatial
bin.size Specifies the bin sizes to read in (defaults to c(16, 8))
slice Name for the stored image of the tissue slice

Before loading in the data, let us take a look at the files we will supplying to the function. In particular, let us take a look at sample data/P5CRC_cropped which is the path would supply as data.dir in the Load10X_Spatial() function.

data/P5CRC_cropped/
└── binned_outputs
    ├── square_008um
    └── square_016um

In the Visium HD assay, Space Ranger bins the image into 2µm x 2µm, 8µm x 8µm, and 16µm x 16µm bins. The output from each of these resolution is found in the binned_outputs folder.

Choosing your bin size

This is not an easy question to answer!

The typical recommendation for a Visium HD analysis is to use the 8µm x 8µm resolution of the dataset. However, it is important to consider the dataset you are working with. For example, neuron cells are know to be a larger celltype in size. Therefore it would be good to consider a large bin size in order to fully represent each cell within a bin.

Creating the Seurat object

Now that we understand the input needed for the Load10X_Spatial() function, let’s use it to create a Seurat object called crc. We will specify that we want to load in the 8µm and 16µm bin sizes at the same time.

Typically you would only load in one bin size, but for the purposes of better understanding bins and our Seurat data structure, we will load both here.

crc <- Load10X_Spatial(data.dir = "data/P5CRC_cropped/",
                       bin.size = c(8, 16),
                       slice = "P5CRC")

We can print out some basic information of our Seurat object to examine its data structure. We will do this frequently throughout the workshop to understand how our the Seurat data structure changes with each step of the workflow.

crc
An object of class Seurat 
36170 features across 97570 samples within 2 assays 
Active assay: Spatial.008um (18085 features, 0 variable features)
 1 layer present: counts
 1 other assay present: Spatial.016um
 2 spatial fields of view present: P5CRC.008um P5CRC.016um
Anatomy of Seurat object

The Seurat object contains many different parts with specific functions to access each slot. To better understand how to access each component, we have created a lesson going through the anatomy of a Seurat object for the crc object we have just created.

Seurat automatically creates some metadata for each of the cells when the object is created. This information is stored in the @meta.data slot within the Seurat object. The rownames are automatically set to be the cell names.

crc@meta.data %>% View()
Table 2: Seurat default @meta.data
orig.ident nCount_Spatial.008um nFeature_Spatial.008um nCount_Spatial.016um nFeature_Spatial.016um
s_008um_00078_00444-1 s 65 57 NA NA
s_008um_00128_00278-1 s 1300 906 NA NA
s_008um_00052_00559-1 s 128 121 NA NA
s_008um_00121_00413-1 s 538 326 NA NA
s_008um_00167_00326-1 s 44 39 NA NA

What does each column represent?

Table 3: Columns automatically populated in @meta.data
Column Description
orig.ident Sample identity if known; defaults to “s”
nCount_RNA Number of UMIs per cell
nFeature_RNA Number of genes detected per cell

While it may seem intimidating at first, the important thing to remember is that this is a dataframe. Therefore can modify and work with this dataframe just like we would any other in R!

Loading multiple samples into Seurat

Before we begin, we are going to remove the crc object we have been playing with in order to clear up extra space on our computers. Spatial transcriptomics take up a lot of memory! So we are being careful to remove any variables that might overload the RAM.

rm(crc)

Now that we have a better understanding of how to create a Seurat object from SpaceRanger outpus, let’s go through how we can load multiple samples at once. In this case we want to represent both samples P5CRC and P5NAT in a singular Seurat object.

Using a for loop

In practice, you will likely have several samples that you will need to read in data for, and that can get tedious and error-prone if you do it one at a time. So, to make the data import into R more efficient we can use a for loop, which will iterate over a series of commands for each of the inputs given and create Seurat objects for each of our samples.

In R, the for loop has the following structure/syntax:

## DO NOT RUN

for (variable in input){
    command1
    command2
    command3
}

Today we will use it to iterate over the two sample folders and execute commands for each sample as we did above for a single sample:

  1. Create the Seurat objects from the SpaceRanger data (Load10X_Spatial()) and specify slice so that images are labeled with their sample name
  2. Set orig.ident to be our sample

Once those steps run, we will store the newly generated Seurat object to a list called list_seurat so that we can eventually merge both samples together.

We will be using a bin size of 16um for the remainder of this workshop!

samples <- c("P5CRC", "P5NAT")
list_seurat <- list()

for (sample in samples) {
  
  # Path to data directory
  data_dir <- paste0("data/", sample, "_cropped")
  # Create seurat object and set orig ident to be sample
  seurat <- Load10X_Spatial(data.dir = data_dir,
                            bin.size = 8,
                            slice = sample)
  seurat$orig.ident <- sample
  
  # Store seurat object in our list
  list_seurat[[sample]] <- seurat
}

To confirm that we succesfully loaded both samples in, we can take a look at the contents of list_seurat. We should see that there are two Seurat objects in our list that correspond to each sample.

list_seurat
$P5CRC
An object of class Seurat 
18085 features across 77896 samples within 1 assay 
Active assay: Spatial.008um (18085 features, 0 variable features)
 1 layer present: counts
 1 spatial field of view present: P5CRC.008um

$P5NAT
An object of class Seurat 
18085 features across 91248 samples within 1 assay 
Active assay: Spatial.008um (18085 features, 0 variable features)
 1 layer present: counts
 1 spatial field of view present: P5NAT.008um

This is exactly what we had hoped to see! So now we can move on to the next step which is merging the sample together into a singular Seurat object.

Merge datasets together

We merge samples together because it make it easier to run the QC steps for both sample groups together. It also enables us to easily compare the data quality for all cells at one time

To create this combined Seurat object, we use the merge() function. Because the same cell IDs can be used for different samples, we add a sample-specific prefix to each of our cell IDs using the add.cell.id argument to ensure the cell names are unique.

seurat_merged <- merge(x = list_seurat[["P5CRC"]],
                       y = list_seurat[["P5NAT"]],
                       add.cell.id = c("P5CRC", "P5NAT"))
seurat_merged
An object of class Seurat 
18085 features across 169144 samples within 1 assay 
Active assay: Spatial.008um (18085 features, 0 variable features)
 2 layers present: counts.1, counts.2
 2 spatial fields of view present: P5CRC.008um P5NAT.008um

From the callout we can see that we now have: 2 layers present: counts.1, counts.2

Seurat has functionality to merge many samples together. You can do this quite easily by adding all sample objects to the y argument in a vector format. An example is provided below, assuming you use the list structured we used previously:

## DO NOT RUN
merged_seurat <- merge(x = seurat_list[[1]], 
                       y = seurat_list[[2:length(seurat_list)]],
                       add.cell.id = names(seurat_list))

However, we do not want to have two distinct raw counts matrices (counts.1 and counts.2). We want the counts for each sample to be concatenated together. This would be a single, large count matrix that represent each cell in the dataset. Seurat’s function JoinLayers() allows us to do just this.

seurat_merged <- JoinLayers(seurat_merged)
  1. What function can we use to double check that we have a singular counts layer?

A common point of confusion is the distinction between integration and merging. In the field, integration is considered to be modifying either your counts or latent space in a way to correct for a batch variable. Whereas what we are doing now is merging or concatenating multiple samples together. This process of merging does not transform the values in the count matrices.

Evaluating merged_seurat

Let’s also double check that we have the correct number of cells. First, let us see what the number of cells was for each sample:

ncol(list_seurat[["P5CRC"]]) + ncol(list_seurat[["P5NAT"]])
[1] 169144

Which is the same as the number of samples in our Seurat callout!

seurat_merged
An object of class Seurat 
18085 features across 169144 samples within 1 assay 
Active assay: Spatial.008um (18085 features, 0 variable features)
 1 layer present: counts
 2 spatial fields of view present: P5CRC.008um P5NAT.008um

If we look at the metadata of the merged object we should be able to see the prefixes in the rownames (Cells) as well as the updated orig.ident we set in the for loop earlier.

# Check that the merged object has the appropriate sample-specific prefixes
seurat_merged@meta.data %>% head()
Table 4: First 5 rows of @meta.data
orig.ident nCount_Spatial.008um nFeature_Spatial.008um
P5CRC_s_008um_00078_00444-1 P5CRC 65 57
P5CRC_s_008um_00128_00278-1 P5CRC 1300 906
P5CRC_s_008um_00052_00559-1 P5CRC 128 121
P5CRC_s_008um_00121_00413-1 P5CRC 538 326
P5CRC_s_008um_00167_00326-1 P5CRC 44 39
seurat_merged@meta.data %>% tail()
Table 5: Last 5 rows of @meta.data
orig.ident nCount_Spatial.008um nFeature_Spatial.008um
P5NAT_s_008um_00365_00197-1 P5NAT 748 488
P5NAT_s_008um_00357_00367-1 P5NAT 814 454
P5NAT_s_008um_00415_00143-1 P5NAT 100 25
P5NAT_s_008um_00148_00248-1 P5NAT 252 222
P5NAT_s_008um_00373_00222-1 P5NAT 514 436

Lastly we want to make sure that the Idents of our cells is a useful piece of information, for example like orig.ident, which contains our sample IDs.

Idents(seurat_merged) <- "orig.ident"

Save!

Now is a great spot to save our seurat_merged object as the next step is going to be filtering.

# Save integrated Seurat object
saveRDS(seurat_merged, "data/seurat_merged.RDS")

A good rule of thumb is to save your intermediate objects before doing any filtration or after running a computational step that takes a long time to run.


Next Lesson >>

Back to Schedule

Reuse

CC-BY-4.0