Seurat Object from Counts File

R Programming

Single-cell RNA-seq

This lesson introduces participants to creating a Seurat object starting from raw single‑cell RNA‑seq count files. Participants learn how gene, barcode and matrix files are structured, how to load them into R and how to generate a sparse count matrix to improve efficiency when working with large datasets.

Authors

Mary Piper

Lorena Pantano

Meeta Mistry

Radhika Khetani

Jihe Liu

Amélie Julé

Will Gammerdinger

Noor Sohail

Published

May 30, 2025

Keywords

R, Seurat, sparse matrix, readMM

Creating count data object

Generally, all single-cell RNA-seq datasets, regardless of technology or pipeline, will contain three files:

a file with the gene IDs, representing all genes quantified
a file with the cell IDs, representing all cells quantified
a matrix of counts per gene for every cell

We can explore these files by clicking on the data/ctrl_raw_feature_bc_matrix folder:

barcodes.tsv: cellular barcodes present in dataset

features.tsv: IDs of quantified genes

matrix.mtx: a matrix of count values, where rows are associated with the gene IDs above and columns correspond to the cellular barcodes. Note that there are many zero values in this matrix.

We can create a count matrix using these files. However, instead of creating a standard count matrix, we will create a sparse matrix to improve the amount of space, memory and CPU required to work with our huge count matrix.

library(Matrix)
library(readr)

We will use readMM() function from the Matrix package to turn our standard matrix into a sparse matrix. The genes.tsv file should correspond to the genes or row names of the matrix, while barcodes.tsv corresponds to the cells or columns.

# Read in `matrix.mtx`
counts <- readMM("data/ctrl_raw_feature_bc_matrix/matrix.mtx.gz")

# Read in `genes.tsv`
genes <- read_tsv("data/ctrl_raw_feature_bc_matrix/features.tsv.gz",
                  col_names = FALSE)
gene_ids <- genes$X1

# Read in `barcodes.tsv`
cell_ids <- read_tsv("data/ctrl_raw_feature_bc_matrix/barcodes.tsv.gz",
                     col_names = FALSE)
cell_ids <- cell_ids$X1

Then we can add row names to the count matrix to be the gene IDs and the column names of the count matrix to be the cell IDs.

# Make the column names as the cell IDs and the row names as the gene IDs
rownames(counts) <- gene_ids
colnames(counts) <- cell_ids

We could use this data for downstream QC analysis. However, this would take a long time if we had multiple samples. A quicker way to load multiple samples is to use the Seurat R package, which has a specific function for reading in 10X data, called read10X().

Note

If using other droplet-based methods for library preparation, the above method would be needed to perform the QC. We have additional materials available based on creation of the count matrix in this way.

Reuse

CC-BY-4.0

--- title: "Seurat Object from Counts File" description: | This lesson introduces participants to creating a Seurat object starting from raw single‑cell RNA‑seq count files. Participants learn how gene, barcode and matrix files are structured, how to load them into R and how to generate a sparse count matrix to improve efficiency when working with large datasets. author: - Mary Piper - Lorena Pantano - Meeta Mistry - Radhika Khetani - Jihe Liu - Amélie Julé - Will Gammerdinger - Noor Sohail date: "2025-05-30" categories: - R Programming - Single-cell RNA-seq keywords: - R - Seurat - sparse matrix - readMM license: "CC-BY-4.0" editor_options: markdown: wrap: 72 --- ### Creating count data object Generally, all single-cell RNA-seq datasets, regardless of technology or pipeline, will contain **three files**: 1. a file with the **gene IDs**, representing all genes quantified 2. a file with the **cell IDs**, representing all cells quantified 3. a **matrix of counts** per gene for every cell We can explore these files by clicking on the `data/ctrl_raw_feature_bc_matrix` folder: - **`barcodes.tsv`:** cellular barcodes present in dataset <img src="../img/cell_ids_new.png" width="180"> - **`features.tsv`:** IDs of quantified genes <img src="../img/genes.png" width="300"> - **`matrix.mtx`:** a matrix of count values, where rows are associated with the gene IDs above and columns correspond to the cellular barcodes. Note that there are many zero values in this matrix. <img src="../img/sparse_matrix.png"> We can create a count matrix using these files. However, instead of creating a standard count matrix, we will create a **sparse matrix** to improve the amount of space, memory and CPU required to work with our huge count matrix. ```{r} library(Matrix) library(readr) ``` We will use `readMM()` function from the **Matrix** package to turn our standard matrix into a sparse matrix. The `genes.tsv` file should correspond to the genes or row names of the matrix, while `barcodes.tsv` corresponds to the cells or columns. ```{r} # Read in `matrix.mtx` counts <- readMM("data/ctrl_raw_feature_bc_matrix/matrix.mtx.gz") # Read in `genes.tsv` genes <- read_tsv("data/ctrl_raw_feature_bc_matrix/features.tsv.gz", col_names = FALSE) gene_ids <- genes$X1 # Read in `barcodes.tsv` cell_ids <- read_tsv("data/ctrl_raw_feature_bc_matrix/barcodes.tsv.gz", col_names = FALSE) cell_ids <- cell_ids$X1 ``` Then we can add row names to the count matrix to be the gene IDs and the column names of the count matrix to be the cell IDs. ```{r} # Make the column names as the cell IDs and the row names as the gene IDs rownames(counts) <- gene_ids colnames(counts) <- cell_ids ``` We could use this data for downstream QC analysis. However, this would take a long time if we had multiple samples. A quicker way to load multiple samples is to use the Seurat R package, which has a specific function for reading in 10X data, called `read10X()`. ::: callout-note If using other droplet-based methods for library preparation, the above method would be needed to perform the QC. We have [additional materials](https://hbctraining.github.io/In-depth-NGS-Data-Analysis-Course/sessionIV/lessons/SC_quality_control_analysis.html) available based on creation of the count matrix in this way. :::