# subsetting_slide.R is for subsetting the spatial single-cell data utilized in the Introduction to Spatial Single-Cell RNA-seq workshop provided by the Harvard Chan Bioinformatics Core
# Written by: <Your name>
# Written on: <Date>
# Usage example:
# Rscript subsetting_slide.RSubsetting a Slide
Approximate time: 45 minutes
Learning Objectives
In this lesson, we will:
- Describe the coordinate systems for Visium HD data
- Subset a spatial scRNA-seq slide
Background
When organizing this workshop, spatial single-cell data can be quite large and the result is that analyzing it can be quite memory intensive. In order to focus on the techniques, methods and interpretation of results, we needed to reduce the size of our dataset to work better on our computers. We have already done these steps and provided you with the downstream output from this subsetting for our analysis, but we have created these materials should you want to know how to do this yourself.
Set-up
Open a new R Script by going File → New File → R Script. Save the file as subsetting_slide.R and place it within our scripts directory. Provide the R Script with some information about the script:
In order to subset the spatial single-cell data, we will need to load some packages:
# Load libraries
library(tidyverse)
library(Seurat)
library(DropletUtils)
library(cowplot)
library(png)
library(grid)
library(magick)
library(arrow)
library(curl)Downloading the data
The data for this workshop is hosted on the 10X Genomics website. We will need to download the binned output data and uncompress the directory within our data directory. Note that the binned output data is 16GB and may take several minutes to download.
# Set URL for where to download the data from
URL <- "https://cf.10xgenomics.com/samples/spatial-exp/3.0.0/Visium_HD_Human_Colon_Cancer_P5/Visium_HD_Human_Colon_Cancer_P5_binned_outputs.tar.gz"
# Create directory to hold full dataset
dir.create("data/P5CRC_full/")
# Set destination folder
dest <- file.path("data/P5CRC_full", basename(URL))
# Download data
curl_download(
url = URL,
destfile = dest,
mode = "wb")
# Uncompress tar.gz
untar(tarfile = dest,
exdir = "data/P5CRC_full")Loading the Seurat Object
We can start the subsetting process by loading in our entire dataset into a Seurat object called seurat_obj using the Load10X_Spatial() function:
# Load spatial data into Seurat object
seurat_obj <- Load10X_Spatial(
data.dir = "data/P5CRC_full/",
bin.size = 16,
slice = "P5")The parameters we are using are:
data.dir- The directory containing a subdirectory calledbinned_outputsthat holds the databin.size- The size in µm that we want to use for the binningslice- The name of the image
We can inspect this Seurat object with:
# Inspect the Seurat object
seurat_objAn object of class Seurat
18085 features across 136035 samples within 1 assay
Active assay: Spatial.016um (18085 features, 0 variable features)
1 layer present: counts
1 spatial field of view present: P5.016um
We can see that this object has 18085 genes 136035 bins.
Pixel Coordinate Space
Within this Seurat object, we also have coordinate information stored that we will need to use to subset our slide. In order to extract this coordinate information from the Seurat object, we will use the GetTissueCoordinates() function:
# Obtain coordinates for our spatial data
coordinates <- GetTissueCoordinates(seurat_obj)Let’s get an intuition for what this data looks like:
# Inspect the coordinates table
View(coordinates)We have subsetted the first five lines in the table below:
| x | y | cell | |
|---|---|---|---|
| s_016um_00052_00082-1 | 49099.34 | 60930.62 | s_016um_00052_00082-1 |
| s_016um_00396_00063-1 | 48072.16 | 40823.45 | s_016um_00396_00063-1 |
| s_016um_00297_00147-1 | 52957.09 | 46628.74 | s_016um_00297_00147-1 |
| s_016um_00347_00254-1 | 59222.16 | 43732.54 | s_016um_00347_00254-1 |
| s_016um_00299_00088-1 | 49509.67 | 46497.72 | s_016um_00299_00088-1 |
The x and y values correspond to the bin in pixel space that Seurat will be using.
We can compare the original image, the boundaries of the slide area and the coordinates obtained from the data using the code below:
# Read in and orient the low resolution image
lowres_image <- image_read("data/P5CRC_full/binned_outputs/square_016um/spatial/tissue_lowres_image.png") %>%
image_flip %>%
as.raster(interpolate = TRUE) %>%
rasterGrob()
# Read in and orient the detected tissue image
detected_image <- image_read("data/P5CRC_full/binned_outputs/square_016um/spatial/detected_tissue_image.jpg") %>%
as.raster(interpolate = TRUE) %>%
rasterGrob()
# Create plot with coordinates
coordinate_plot <- ggplot(coordinates) +
geom_point(aes(x = x,
y = y),
size = 0.0001,
color = "cornflowerblue")
# Plot three image images next to each other
plot_grid(lowres_image, detected_image, coordinate_plot,
nrow = 1,
rel_widths = c(0.3, 0.3, 0.6))Array Coordinate Space
Let’s dig a bit deeper into how the spatial data is stored. The storage of the data comes from a parquet file that is provided with the data. We can read this parquet file using the read_parquet() function from the arrow package:
# Read parquet file
parquet_df <- read_parquet("data/P5CRC_full/binned_outputs/square_016um/spatial/tissue_positions.parquet")We can inspect what this parquet data frame looks like with:
# Inspect parquet file
View(parquet_df)The first five lines will look like:
| barcode | in_tissue | array_row | array_col | pxl_row_in_fullres | pxl_col_in_fullres |
|---|---|---|---|---|---|
| s_016um_00000_00000-1 | 0 | 0 | 0 | 63949.78 | 44294.63 |
| s_016um_00000_00001-1 | 0 | 0 | 1 | 63950.03 | 44353.07 |
| s_016um_00000_00002-1 | 0 | 0 | 2 | 63950.27 | 44411.51 |
| s_016um_00000_00003-1 | 0 | 0 | 3 | 63950.51 | 44469.95 |
| s_016um_00000_00004-1 | 0 | 0 | 4 | 63950.75 | 44528.40 |
In order to get a sense for this table, let’s investigate each column:
barcode- This is the barcode for the bin on the slide. The format is:- [ident]_[bin_size]_[array_row]_[array_col]-[suffix] where:
ident- The identbin_size- The size of the binarray_row- The row number in the array for a given bin sizearray_col- The column number in the array for a given bin sizesuffix- An optional suffix reserved for technical reasons (usually is-1)
- [ident]_[bin_size]_[array_row]_[array_col]-[suffix] where:
in_tissue- A binary descriptor for whether Space Ranger believes this is in the tissue or not0- Not in the tissue1- In the tissue
array_row- The row number in the array for a given bin sizearray_col- The column number in the array for a given bin sizepxl_row_in_fullres- Location of row in pixel spacepxl_col_in_fullres- Location of column in pixel space
Scope of data
The first thing to note is that the parquet file holds information for all bins, regardless of whether they are in the tissue or not. The coordinate information that we retrieved the the Seurat object using the GetTissueCoordinates() function only contains bins that were scored for having tissue, or bins with the value of 1 in the parquet file.
We can confirm this by comparing the number of rows in the coordinate_df to the number of bins in the parquet file where the in_tissue value is equal to 1:
# Compare this value to the number of bins in the coordinate_df
nrow(coordinates)[1] 136035
# Obtain the number of bins that were scored for being in and out of the tissue in the parquet file
table(parquet_df$in_tissue)
0 1
39526 136035
From this we can see that Seurat has subsampled our dataset to only include the bins that Space Ranger detected as being within the tissue.
Drawing Connections Between the Two Coordinate Spaces
Within the parquet file the array_row and array_col are directly related to pxl_row_in_fullres and pxl_col_in_fullres, respectively. However the data has undergone an affine transformation within Space Ranger that considers a scale factor (how many pixels per bin), an offset and a rotation. In short, the array_row and array_col are associated with pxl_row_in_fullres and pxl_col_in_fullres and the parquet file is the key that ties them together.
Let’s look at the first bin in our coordinates object:
# Obtain first bin
first_bin <- coordinates$cell[1]
# Extract first bin from coordinates
coordinates[first_bin, ] x y cell
s_016um_00052_00082-1 49099.34 60930.62 s_016um_00052_00082-1
Let’s compare this to the same bin in the parquet file:
# Change the significant figures for rounding
options(pillar.sigfig = 7)
# Extract the bin from the parquest file that is the first bin from coordinates
parquet_df[parquet_df$barcode == first_bin,]# A tibble: 1 × 6
barcode in_tissue array_row array_col pxl_row_in_fullres pxl_col_in_fullres
<chr> <int> <int> <int> <dbl> <dbl>
1 s_016um_0… 1 52 82 60930.62 49099.34
We can observe that pxl_row_in_fullres in the parquet file is the y column from Seurat’s GetTissueCoordinates() function and pxl_col_in_fullres in the parquet file is the x column from Seurat’s GetTissueCoordinates() function.
Subsetting the Seurat Object
Now that we have an understanding of the coordinates, let’s go ahead and subset our coordinates based upon the pixel region we will like to focus on in this workshop:
# Set x and y bounds for subsetting image
x_min <- 52000
x_max <- 64319
y_min <- 57500
y_max <- 64002
# Extract bin barcodes based on the defined x and y bounds
subsetted_bins <- coordinates %>%
subset(x > x_min) %>%
subset(x < x_max) %>%
subset(y > y_min) %>%
subset(y < y_max) %>%
row.names()The output of this subsetting should be bin names that fit this criteria for x and y bounds. We can inspect the object with:
# Inspect subsetted bin names
head(subsetted_bins)[1] "s_016um_00050_00315-1" "s_016um_00064_00214-1" "s_016um_00101_00317-1"
[4] "s_016um_00049_00195-1" "s_016um_00032_00133-1" "s_016um_00061_00268-1"
We can confirm that this subsetting was done appropriately, by subsetting the coordinates object by our subsetted_bins names and checking the range for x and y.
# Subsetted x range
range(coordinates[subsetted_bins,]$x)[1] 52008.83 64304.74
# Subsetted y range
range(coordinates[subsetted_bins,]$y)[1] 57500.22 63987.15
Thus, we can confirm our pixels are all within the pixel range that we subsetted by, so we feel confident that our subsetting was appropriate.
Sanity checks like the above check may feel unnecessary but they are really important for good data analysis practices. It oftentimes takes less than a few moments to inspect and can save you lots of time downstream when you are debugging code.
Let’s go ahead and start subsetting our Seurat object. This will happen in three parts:
- Creating a new column,
filter, inmeta.dataof the Seurat object
# Label a column called filter for TRUE
seurat_obj$filter <- TRUE- Assigning the values of
TRUEorFALSEfor bins to be removed or kept, respectively
# Re-assign the value of FALSE to the filter column if the bin is within subsetted_bins
seurat_obj@meta.data[subsetted_bins, "filter"] <- FALSE- Subsetting the Seurat object based about the data in the
filtercolumn ofmeta.data
# Retain the bins which have FALSE for filter in the meta data
filtered_seurat_obj <- subset(seurat_obj, filter == FALSE)Just like we did previously when subsetting coordinates, let’s confirm that we are subsetting the correct bins.
# Inspect the range of x and y coordinates to ensure they match our expectation
GetTissueCoordinates(filtered_seurat_obj) %>%
reframe(
stat = c("min", "max"),
x = range(x),
y = range(y)
) %>%
as.data.frame() %>%
column_to_rownames(var = "stat") x y
min 52008.83 57500.22
max 64304.74 63987.15
Last, we can use the SpatialFeaturePlot() function from Seurat to compare the before and after of subsetting.
# Visual the original image and the subsetted image next to each other
SpatialFeaturePlot(seurat_obj,
features = "nFeature_Spatial.016um",
pt.size.factor = 16) +
SpatialFeaturePlot(filtered_seurat_obj,
features = "nFeature_Spatial.016um",
pt.size.factor = 16)Writing Subsetted Data Directory
It looks like we have correctly subsetted our data and now we are ready to create a directory to hold our data. This will make it easier in the future to work with our data, because we can use the Load10X_Spatial() function from Seurat to read in our subsetted data rather than performing the subsetting again.
We are going to start by creating a directory structure to hold our subsetted data:
dir.create(path = "data/subsetted_data/binned_outputs/square_016um/spatial/", recursive = TRUE)The directory structure should look like:
data/subsetted_data/
└── binned_outputs
└── square_016um
└── spatial
Now we will populate this directory with the files needed to be able to run the Load10X_Spatial() function in the future.
# Set path for JSON file
path_json <- paste0("data/P5CRC_full/binned_outputs/square_016um/spatial/scalefactors_json.json")
# Set path for Parquest file
path_parq <- paste0("data/P5CRC_full/binned_outputs/square_016um/spatial/tissue_positions.parquet")
# Set path for PNG file
path_png <- paste0("data/P5CRC_full/binned_outputs/square_016um/spatial/tissue_lowres_image.png")
# Combine JSON, Parquet and PNG file paths into a vector
files_spatial <- c(path_json, path_parq, path_png)
# Copy JSON, Parquet and PNG files into our spatial folder
file.copy(from = files_spatial,
to = "data/subsetted_data/binned_outputs/square_016um/spatial")
# Write counts to h5 file
write10xCounts("data/subsetted_data/binned_outputs/square_016um/filtered_feature_bc_matrix.h5",
x=LayerData(filtered_seurat_obj),
barcodes=colnames(filtered_seurat_obj),
gene.id=rownames(filtered_seurat_obj),
version="3",
type="HDF5",
overwrite = TRUE)After writing and copying the files the file directory should look like this:
data/subsetted_data/
└── binned_outputs
└── square_016um
├── filtered_feature_bc_matrix.h5
└── spatial
├── scalefactors_json.json
├── tissue_lowres_image.png
└── tissue_positions.parquet
Automate subsetting by looping through bins
Now we have processed a single bin (16μm). However, we would like to process our data for two bin sizes (8μm and 16μm). This process can be automated by using the same process as above and looping through both bin sizes. We will first process the slide with the colorectal cancer, then we will leave the non-cancerous slide as an exercise
Setting variables
We are going to set-up the bin sizes that we would like to loop through and also set our boundaries for where to subset our sample.
# Set sample name
sample_name <- "P5CRC"
# Set minimums and maximums for x and y
x_min <- 52000
x_max <- 64319
y_min <- 57500
y_max <- 64002
# Selecting the bin sizes to use
bins <- c("008", "016")Next, we are going to create a space to hold our subsetted data:
# Create directory to hold subsetted data
dir.create(paste0("data/", sample_name, "_subsetted/binned_outputs/"),
recursive = TRUE)The code below will allow us to follow the steps we’ve done previously and loop through the different bin sizes, adding them to the directory called P5CRC_subsetted. Note that we are calling this P5CRC_subsetted rather than P5CRC_cropped like you downloaded, so that you can differentiate between the files you are subsetting and the ones provided.
# For each bin size
for (bin in bins) {
# Set path to write data to
path_bin <- paste0("data/", sample_name, "_subsetted/binned_outputs/square_", bin, "um")
dir.create(path = path_bin,
recursive = TRUE)
# Load object
seurat_obj <- Load10X_Spatial(
data.dir = "data/P5CRC_full/",
bin.size = c(as.integer(bin)),
slice = sample_name)
# Obtain coordinates for our spatial data
coordinates <- GetTissueCoordinates(seurat_obj)
# Extract bin barcodes based on the defined x and y bounds
cells <- coordinates %>%
subset(x > x_min) %>%
subset(x < x_max) %>%
subset(y > y_min) %>%
subset(y < y_max) %>%
row.names()
# Label a column called filter for TRUE
seurat_obj$filter <- TRUE
# Re-assign the value of FALSE to the filter column if the bin is within subsetted_bins
seurat_obj@meta.data[cells, "filter"] <- FALSE
# Retain the bins which have FALSE for filter in the meta data
filtered_seurat_obj <- subset(seurat_obj, filter == FALSE)
# Copy spatial images from original folder
dir.create(path = paste0(path_bin, "/spatial"),
recursive = TRUE)
# Set path for JSON file
path_json <- paste0("data/P5CRC_full/binned_outputs/square_", bin, "um/spatial/scalefactors_json.json")
# Set path for Parquet file
path_parq <- paste0("data/P5CRC_full/binned_outputs/square_", bin, "um/spatial/tissue_positions.parquet")
# Set path for PNG file
path_png <- paste0("data/P5CRC_full/binned_outputs/square_", bin, "um/spatial/tissue_lowres_image.png")
# Combine JSON, Parquet and PNG file paths into a vector
files_spatial <- c(path_json, path_parq, path_png)
# Copy JSON, Parquet and PNG files into our spatial folder
file.copy(from = files_spatial,
to = paste0(path_bin, "/spatial"))
# Write counts to h5 file
write10xCounts(path = paste0(path_bin, "/filtered_feature_bc_matrix.h5"),
x=LayerData(filtered_seurat_obj),
barcodes=colnames(filtered_seurat_obj),
gene.id=rownames(filtered_seurat_obj),
version="3",
type="HDF5",
overwrite = TRUE)
# Remove seurat_obj
rm(seurat_obj)
# Remove filtered_seurat_obj
rm(filtered_seurat_obj)
}Inspecting the loop output
The directory structure after the loop has finished should look like:
data/P5CRC_subsetted/binned_outputs/
├── square_008um
│ ├── filtered_feature_bc_matrix.h5
│ └── spatial
│ ├── scalefactors_json.json
│ ├── tissue_lowres_image.png
│ └── tissue_positions.parquet
└── square_016um
├── filtered_feature_bc_matrix.h5
└── spatial
├── scalefactors_json.json
├── tissue_lowres_image.png
└── tissue_positions.parquet
Let’s read in the subsetted data that was created from our loop and the provided cropped data, then compare them to confirm that we have recreated the same data:
# Load subsetted data
subsetted_seurat_obj <- Load10X_Spatial(
data.dir = "data/P5CRC_subsetted/",
bin.size = 16,
slice = "P5CRC")
# Load provided data
provided_seurat_object <- Load10X_Spatial(
data.dir = "data/P5CRC_cropped/",
bin.size = 16,
slice = "P5CRC")
# Visual the subsetted data next to the provided data
SpatialFeaturePlot(subsetted_seurat_obj,
features = "nFeature_Spatial.016um",
pt.size.factor = 16) +
SpatialFeaturePlot(provided_seurat_object,
features = "nFeature_Spatial.016um",
pt.size.factor = 16)It looks like we have the same data, but let’s confirm by checking that the bins that are present in the data we just subset are also in the data that were provided:
# Check to ensure that both data sets have the same number of bins
all(colnames(subsetted_seurat_obj) == colnames(provided_seurat_object))[1] TRUE
At this point we have successfully subsetted our data for two different bin sizes on the slide while performing appropriate checks along the way to ensure that our subsetting has been done correctly.
- The binned output not containing colorectal cancer bins comes from here. Using a loop and the parameters below, subset the non-cancerous slide for 8μm and 16μm bin sizes.
# Set sample name
sample_name <- "P5NAT"
# Set minimums and maximums for x and y
x_min <- 54000
x_max <- 61991
y_min <- 29000
y_max <- 39987