Subsetting a Slide

Authors

Will Gammerdinger

Noor Sohail

Published

August 25, 2025

Approximate time: 45 minutes

Learning Objectives

In this lesson, we will:

  • Describe the coordinate systems for Visium HD data
  • Subset a spatial scRNA-seq slide

Background

When organizing this workshop, spatial single-cell data can be quite large and the result is that analyzing it can be quite memory intensive. In order to focus on the techniques, methods and interpretation of results, we needed to reduce the size of our dataset to work better on our computers. We have already done these steps and provided you with the downstream output from this subsetting for our analysis, but we have created these materials should you want to know how to do this yourself.

Set-up

Open a new R Script by going FileNew File R Script. Save the file as subsetting_slide.R and place it within our scripts directory. Provide the R Script with some information about the script:

# subsetting_slide.R is for subsetting the spatial single-cell data utilized in the Introduction to Spatial Single-Cell RNA-seq workshop provided by the Harvard Chan Bioinformatics Core
# Written by: <Your name>
# Written on: <Date>
# Usage example:
#  Rscript subsetting_slide.R

In order to subset the spatial single-cell data, we will need to load some packages:

# Load libraries
library(tidyverse)
library(Seurat)
library(DropletUtils)
library(cowplot)
library(png)
library(grid)
library(magick)
library(arrow)
library(curl)

Downloading the data

The data for this workshop is hosted on the 10X Genomics website. We will need to download the binned output data and uncompress the directory within our data directory. Note that the binned output data is 16GB and may take several minutes to download.

# Set URL for where to download the data from
URL <- "https://cf.10xgenomics.com/samples/spatial-exp/3.0.0/Visium_HD_Human_Colon_Cancer_P5/Visium_HD_Human_Colon_Cancer_P5_binned_outputs.tar.gz"

# Create directory to hold full dataset
dir.create("data/P5CRC_full/")

# Set destination folder
dest <- file.path("data/P5CRC_full", basename(URL))

# Download data
curl_download(
  url = URL,
  destfile = dest,
  mode = "wb")

# Uncompress tar.gz
untar(tarfile = dest,
      exdir = "data/P5CRC_full")

Loading the Seurat Object

We can start the subsetting process by loading in our entire dataset into a Seurat object called seurat_obj using the Load10X_Spatial() function:

# Load spatial data into Seurat object
seurat_obj <- Load10X_Spatial(
  data.dir = "data/P5CRC_full/",
  bin.size = 16,
  slice = "P5")

The parameters we are using are:

  • data.dir - The directory containing a subdirectory called binned_outputs that holds the data
  • bin.size - The size in µm that we want to use for the binning
  • slice - The name of the image

We can inspect this Seurat object with:

# Inspect the Seurat object
seurat_obj
An object of class Seurat 
18085 features across 136035 samples within 1 assay 
Active assay: Spatial.016um (18085 features, 0 variable features)
 1 layer present: counts
 1 spatial field of view present: P5.016um

We can see that this object has 18085 genes 136035 bins.

Pixel Coordinate Space

Within this Seurat object, we also have coordinate information stored that we will need to use to subset our slide. In order to extract this coordinate information from the Seurat object, we will use the GetTissueCoordinates() function:

# Obtain coordinates for our spatial data
coordinates <- GetTissueCoordinates(seurat_obj)

Let’s get an intuition for what this data looks like:

# Inspect the coordinates table
View(coordinates)

We have subsetted the first five lines in the table below:

Table 1: First five lines of the coordinates data frame
x y cell
s_016um_00052_00082-1 49099.34 60930.62 s_016um_00052_00082-1
s_016um_00396_00063-1 48072.16 40823.45 s_016um_00396_00063-1
s_016um_00297_00147-1 52957.09 46628.74 s_016um_00297_00147-1
s_016um_00347_00254-1 59222.16 43732.54 s_016um_00347_00254-1
s_016um_00299_00088-1 49509.67 46497.72 s_016um_00299_00088-1

The x and y values correspond to the bin in pixel space that Seurat will be using.

We can compare the original image, the boundaries of the slide area and the coordinates obtained from the data using the code below:

# Read in and orient the low resolution image
lowres_image <- image_read("data/P5CRC_full/binned_outputs/square_016um/spatial/tissue_lowres_image.png") %>% 
  image_flip %>% 
  as.raster(interpolate = TRUE) %>% 
  rasterGrob()

# Read in and orient the detected tissue image
detected_image <- image_read("data/P5CRC_full/binned_outputs/square_016um/spatial/detected_tissue_image.jpg") %>% 
  as.raster(interpolate = TRUE) %>% 
  rasterGrob()

# Create plot with coordinates
coordinate_plot <- ggplot(coordinates) +
  geom_point(aes(x = x,
                 y = y),
             size = 0.0001,
             color = "cornflowerblue")

# Plot three image images next to each other
plot_grid(lowres_image, detected_image, coordinate_plot,
          nrow = 1,
          rel_widths = c(0.3, 0.3, 0.6))
Figure 1: Comparison of the original image, the boundaries of the slide area and the coordinates obtained from the data

Array Coordinate Space

Let’s dig a bit deeper into how the spatial data is stored. The storage of the data comes from a parquet file that is provided with the data. We can read this parquet file using the read_parquet() function from the arrow package:

# Read parquet file
parquet_df <- read_parquet("data/P5CRC_full/binned_outputs/square_016um/spatial/tissue_positions.parquet")

We can inspect what this parquet data frame looks like with:

# Inspect parquet file
View(parquet_df)

The first five lines will look like:

Table 2: First five lines of the Parquet file
barcode in_tissue array_row array_col pxl_row_in_fullres pxl_col_in_fullres
s_016um_00000_00000-1 0 0 0 63949.78 44294.63
s_016um_00000_00001-1 0 0 1 63950.03 44353.07
s_016um_00000_00002-1 0 0 2 63950.27 44411.51
s_016um_00000_00003-1 0 0 3 63950.51 44469.95
s_016um_00000_00004-1 0 0 4 63950.75 44528.40

In order to get a sense for this table, let’s investigate each column:

  • barcode - This is the barcode for the bin on the slide. The format is:
    • [ident]_[bin_size]_[array_row]_[array_col]-[suffix] where:
      • ident - The ident
      • bin_size - The size of the bin
      • array_row - The row number in the array for a given bin size
      • array_col - The column number in the array for a given bin size
      • suffix - An optional suffix reserved for technical reasons (usually is -1)
  • in_tissue - A binary descriptor for whether Space Ranger believes this is in the tissue or not
    • 0 - Not in the tissue
    • 1 - In the tissue
  • array_row - The row number in the array for a given bin size
  • array_col - The column number in the array for a given bin size
  • pxl_row_in_fullres - Location of row in pixel space
  • pxl_col_in_fullres - Location of column in pixel space

Scope of data

The first thing to note is that the parquet file holds information for all bins, regardless of whether they are in the tissue or not. The coordinate information that we retrieved the the Seurat object using the GetTissueCoordinates() function only contains bins that were scored for having tissue, or bins with the value of 1 in the parquet file.

We can confirm this by comparing the number of rows in the coordinate_df to the number of bins in the parquet file where the in_tissue value is equal to 1:

# Compare this value to the number of bins in the coordinate_df
nrow(coordinates)
[1] 136035
# Obtain the number of bins that were scored for being in and out of the tissue in the parquet file
table(parquet_df$in_tissue)

     0      1 
 39526 136035 

From this we can see that Seurat has subsampled our dataset to only include the bins that Space Ranger detected as being within the tissue.

Drawing Connections Between the Two Coordinate Spaces

Within the parquet file the array_row and array_col are directly related to pxl_row_in_fullres and pxl_col_in_fullres, respectively. However the data has undergone an affine transformation within Space Ranger that considers a scale factor (how many pixels per bin), an offset and a rotation. In short, the array_row and array_col are associated with pxl_row_in_fullres and pxl_col_in_fullres and the parquet file is the key that ties them together.

Let’s look at the first bin in our coordinates object:

# Obtain first bin
first_bin <- coordinates$cell[1]

# Extract first bin from coordinates
coordinates[first_bin, ]
                             x        y                  cell
s_016um_00052_00082-1 49099.34 60930.62 s_016um_00052_00082-1

Let’s compare this to the same bin in the parquet file:

# Change the significant figures for rounding
options(pillar.sigfig = 7)

# Extract the bin from the parquest file that is the first bin from coordinates
parquet_df[parquet_df$barcode == first_bin,]
# A tibble: 1 × 6
  barcode    in_tissue array_row array_col pxl_row_in_fullres pxl_col_in_fullres
  <chr>          <int>     <int>     <int>              <dbl>              <dbl>
1 s_016um_0…         1        52        82           60930.62           49099.34

We can observe that pxl_row_in_fullres in the parquet file is the y column from Seurat’s GetTissueCoordinates() function and pxl_col_in_fullres in the parquet file is the x column from Seurat’s GetTissueCoordinates() function.

Subsetting the Seurat Object

Now that we have an understanding of the coordinates, let’s go ahead and subset our coordinates based upon the pixel region we will like to focus on in this workshop:

# Set x and y bounds for subsetting image
x_min <- 52000
x_max <- 64319
y_min <- 57500
y_max <- 64002

# Extract bin barcodes based on the defined x and y bounds
subsetted_bins <- coordinates %>% 
  subset(x > x_min) %>%
  subset(x < x_max) %>%
  subset(y > y_min) %>%
  subset(y < y_max) %>% 
  row.names()

The output of this subsetting should be bin names that fit this criteria for x and y bounds. We can inspect the object with:

# Inspect subsetted bin names
head(subsetted_bins)
[1] "s_016um_00050_00315-1" "s_016um_00064_00214-1" "s_016um_00101_00317-1"
[4] "s_016um_00049_00195-1" "s_016um_00032_00133-1" "s_016um_00061_00268-1"

We can confirm that this subsetting was done appropriately, by subsetting the coordinates object by our subsetted_bins names and checking the range for x and y.

# Subsetted x range
range(coordinates[subsetted_bins,]$x)
[1] 52008.83 64304.74
# Subsetted y range
range(coordinates[subsetted_bins,]$y)
[1] 57500.22 63987.15

Thus, we can confirm our pixels are all within the pixel range that we subsetted by, so we feel confident that our subsetting was appropriate.

Note

Sanity checks like the above check may feel unnecessary but they are really important for good data analysis practices. It oftentimes takes less than a few moments to inspect and can save you lots of time downstream when you are debugging code.

Let’s go ahead and start subsetting our Seurat object. This will happen in three parts:

  1. Creating a new column, filter, in meta.data of the Seurat object
# Label a column called filter for TRUE
seurat_obj$filter <- TRUE
  1. Assigning the values of TRUE or FALSE for bins to be removed or kept, respectively
# Re-assign the value of FALSE to the filter column if the bin is within subsetted_bins
seurat_obj@meta.data[subsetted_bins, "filter"] <- FALSE
  1. Subsetting the Seurat object based about the data in the filter column of meta.data
# Retain the bins which have FALSE for filter in the meta data
filtered_seurat_obj <- subset(seurat_obj, filter == FALSE)

Just like we did previously when subsetting coordinates, let’s confirm that we are subsetting the correct bins.

# Inspect the range of x and y coordinates to ensure they match our expectation
GetTissueCoordinates(filtered_seurat_obj) %>%  
  reframe(
    stat = c("min", "max"),
    x = range(x),
    y = range(y)
  ) %>% 
  as.data.frame() %>% 
  column_to_rownames(var = "stat")
           x        y
min 52008.83 57500.22
max 64304.74 63987.15

Last, we can use the SpatialFeaturePlot() function from Seurat to compare the before and after of subsetting.

# Visual the original image and the subsetted image next to each other
SpatialFeaturePlot(seurat_obj,
                     features = "nFeature_Spatial.016um", 
                     pt.size.factor = 16) +
SpatialFeaturePlot(filtered_seurat_obj,
                     features = "nFeature_Spatial.016um", 
                     pt.size.factor = 16)
Figure 2: Comparison of the data before and after subsetting

Writing Subsetted Data Directory

It looks like we have correctly subsetted our data and now we are ready to create a directory to hold our data. This will make it easier in the future to work with our data, because we can use the Load10X_Spatial() function from Seurat to read in our subsetted data rather than performing the subsetting again.

We are going to start by creating a directory structure to hold our subsetted data:

dir.create(path = "data/subsetted_data/binned_outputs/square_016um/spatial/", recursive = TRUE)

The directory structure should look like:

data/subsetted_data/
└── binned_outputs
    └── square_016um
        └── spatial

Now we will populate this directory with the files needed to be able to run the Load10X_Spatial() function in the future.

# Set path for JSON file
path_json <- paste0("data/P5CRC_full/binned_outputs/square_016um/spatial/scalefactors_json.json")
# Set path for Parquest file
path_parq <- paste0("data/P5CRC_full/binned_outputs/square_016um/spatial/tissue_positions.parquet")
# Set path for PNG file
path_png  <- paste0("data/P5CRC_full/binned_outputs/square_016um/spatial/tissue_lowres_image.png")

# Combine JSON, Parquet and PNG file paths into a vector
files_spatial <- c(path_json, path_parq, path_png)

# Copy JSON, Parquet and PNG files into our spatial folder
file.copy(from = files_spatial,
          to = "data/subsetted_data/binned_outputs/square_016um/spatial")

# Write counts to h5 file
write10xCounts("data/subsetted_data/binned_outputs/square_016um/filtered_feature_bc_matrix.h5",
  x=LayerData(filtered_seurat_obj),
  barcodes=colnames(filtered_seurat_obj),
  gene.id=rownames(filtered_seurat_obj),
  version="3",
  type="HDF5",
  overwrite = TRUE)

After writing and copying the files the file directory should look like this:

data/subsetted_data/
└── binned_outputs
    └── square_016um
        ├── filtered_feature_bc_matrix.h5
        └── spatial
            ├── scalefactors_json.json
            ├── tissue_lowres_image.png
            └── tissue_positions.parquet

Automate subsetting by looping through bins

Now we have processed a single bin (16μm). However, we would like to process our data for two bin sizes (8μm and 16μm). This process can be automated by using the same process as above and looping through both bin sizes. We will first process the slide with the colorectal cancer, then we will leave the non-cancerous slide as an exercise

Setting variables

We are going to set-up the bin sizes that we would like to loop through and also set our boundaries for where to subset our sample.

# Set sample name
sample_name <- "P5CRC"

# Set minimums and maximums for x and y
x_min <- 52000
x_max <- 64319
y_min <- 57500
y_max <- 64002

# Selecting the bin sizes to use
bins <- c("008", "016")

Next, we are going to create a space to hold our subsetted data:

# Create directory to hold subsetted data
dir.create(paste0("data/", sample_name, "_subsetted/binned_outputs/"), 
           recursive = TRUE)

The code below will allow us to follow the steps we’ve done previously and loop through the different bin sizes, adding them to the directory called P5CRC_subsetted. Note that we are calling this P5CRC_subsetted rather than P5CRC_cropped like you downloaded, so that you can differentiate between the files you are subsetting and the ones provided.

# For each bin size
for (bin in bins) {
  # Set path to write data to
  path_bin <- paste0("data/", sample_name, "_subsetted/binned_outputs/square_", bin, "um")
  dir.create(path = path_bin,
             recursive = TRUE)
  
  # Load object
  seurat_obj <- Load10X_Spatial(
    data.dir = "data/P5CRC_full/",
    bin.size = c(as.integer(bin)),
    slice = sample_name)
  
  # Obtain coordinates for our spatial data
  coordinates <- GetTissueCoordinates(seurat_obj)
  
  # Extract bin barcodes based on the defined x and y bounds
  cells <- coordinates %>% 
    subset(x > x_min) %>%
    subset(x < x_max) %>%
    subset(y > y_min) %>%
    subset(y < y_max) %>% 
    row.names()
  
  # Label a column called filter for TRUE
  seurat_obj$filter <- TRUE
  
  # Re-assign the value of FALSE to the filter column if the bin is within subsetted_bins
  seurat_obj@meta.data[cells, "filter"] <- FALSE
  
  # Retain the bins which have FALSE for filter in the meta data
  filtered_seurat_obj <- subset(seurat_obj, filter == FALSE)
  
  # Copy spatial images from original folder
  dir.create(path = paste0(path_bin, "/spatial"),
             recursive = TRUE)
  
  # Set path for JSON file
  path_json <- paste0("data/P5CRC_full/binned_outputs/square_", bin, "um/spatial/scalefactors_json.json")
  # Set path for Parquet file
  path_parq <- paste0("data/P5CRC_full/binned_outputs/square_", bin, "um/spatial/tissue_positions.parquet")
  # Set path for PNG file
  path_png  <- paste0("data/P5CRC_full/binned_outputs/square_", bin, "um/spatial/tissue_lowres_image.png")
  
  # Combine JSON, Parquet and PNG file paths into a vector
  files_spatial <- c(path_json, path_parq, path_png)

  # Copy JSON, Parquet and PNG files into our spatial folder
  file.copy(from = files_spatial,
            to = paste0(path_bin, "/spatial"))
  
  # Write counts to h5 file
  write10xCounts(path = paste0(path_bin, "/filtered_feature_bc_matrix.h5"),
                 x=LayerData(filtered_seurat_obj),
                 barcodes=colnames(filtered_seurat_obj),
                 gene.id=rownames(filtered_seurat_obj),
                 version="3",
                 type="HDF5",
                 overwrite = TRUE)
  
  # Remove seurat_obj
  rm(seurat_obj)
  # Remove filtered_seurat_obj
  rm(filtered_seurat_obj)
}

Inspecting the loop output

The directory structure after the loop has finished should look like:

data/P5CRC_subsetted/binned_outputs/
├── square_008um
│   ├── filtered_feature_bc_matrix.h5
│   └── spatial
│       ├── scalefactors_json.json
│       ├── tissue_lowres_image.png
│       └── tissue_positions.parquet
└── square_016um
    ├── filtered_feature_bc_matrix.h5
    └── spatial
        ├── scalefactors_json.json
        ├── tissue_lowres_image.png
        └── tissue_positions.parquet

Let’s read in the subsetted data that was created from our loop and the provided cropped data, then compare them to confirm that we have recreated the same data:

# Load subsetted data
subsetted_seurat_obj <- Load10X_Spatial(
  data.dir = "data/P5CRC_subsetted/",
  bin.size = 16,
  slice = "P5CRC")

# Load provided data
provided_seurat_object <- Load10X_Spatial(
  data.dir = "data/P5CRC_cropped/",
  bin.size = 16,
  slice = "P5CRC")

# Visual the subsetted data next to the provided data
SpatialFeaturePlot(subsetted_seurat_obj,
                   features = "nFeature_Spatial.016um", 
                     pt.size.factor = 16) +
SpatialFeaturePlot(provided_seurat_object,
                   features = "nFeature_Spatial.016um", 
                     pt.size.factor = 16)
Figure 3: Comparison of the subsetted data we created with the provided cropped data to ensure they look the same.

It looks like we have the same data, but let’s confirm by checking that the bins that are present in the data we just subset are also in the data that were provided:

# Check to ensure that both data sets have the same number of bins
all(colnames(subsetted_seurat_obj) == colnames(provided_seurat_object))
[1] TRUE

At this point we have successfully subsetted our data for two different bin sizes on the slide while performing appropriate checks along the way to ensure that our subsetting has been done correctly.

  1. The binned output not containing colorectal cancer bins comes from here. Using a loop and the parameters below, subset the non-cancerous slide for 8μm and 16μm bin sizes.
# Set sample name
sample_name <- "P5NAT"

# Set minimums and maximums for x and y
x_min <- 54000
x_max <- 61991
y_min <- 29000
y_max <- 39987