Approximate time: 30 minutes

Learning Objectives:

Describe how cellranger is run and what the ouputs are
Review the cellranger generated QC report (web summary HTML)
Create plots with cellranger metrics

Single-cell RNA-seq: Quality control of Cellranger output

Cellranger

Cellranger is a tool created by the company 10x to process single-cell sequencing experiments that were processed with their kits.

The algorithm for the single-cell RNA-seq (scRNA) version of cellranger is described by 10x as follows:

Image credit: 10x

The main elements of this pipeline are as follows:

Align FASTQ reads against a reference genome
Filter low quailty reads and correct cell barcodes/UMIs
Collapse on PCR duplicates using UMIs
Generate raw counts matrix
Identify low quality cells to generate a filtered counts matrix

While the focus of this workshop is scRNA, we also want to point out that there are other cellranger softwares and modes for different types of single-cell experiments.

Experiment	Experiment description	10x tool
RNA	RNA	cellranger count
ATAC	ATAC	cellranger-atac count
Multiome	RNA + ATAC	cellranger-arc count
V(D)J	Clonotyping of T and B cells	cellranger vdj
Hashtagging	Antibody/oligo tags to differentiate cells after pooling	cellranger multi

Running Cellranger on O2

Running cellranger requires a lot of time and computational resources in order to process a single sample. Therefore, having access to a High Performance Computing (HPC) cluster is necessary to run it. Some sequencing cores will automatically process samples with cellranger and provide the outputs to you.

Note that prior to this step, you must have a cellranger compatible reference genome generated. If you are working with mouse or human, 10x has pre-generated the reference which can be downloaded from their website for use. If you are using another organism, cellranger has a mode called mkref which will generate everything needed for a reference from the files you supply (GTF and fasta).

Here we are showing an example of how to run cellranger count on Harvard’s O2 HPC using SLURM. To run this script, you will have add additional information, such as:

The name of the project (the results will be placed in a folder of the same name)
Path to the FASTQ files from your experiment
Path to the reference genome

In the following example script, you would just have to change the variable specified in the “Inputs for cellranger” section on eth 10x support site. We have already provided some optimal parameters in terms of runtime and memory for running cellranger count.

You do not need to run this script.

#!/bin/bash

#SBATCH --partition=short               # Partition name
#SBATCH --time=0-06:00                  # Runtime in D-HH:MM format
#SBATCH --nodes=1                       # Number of nodes (keep at 1)
#SBATCH --ntasks=1                      # Number of tasks per node (keep at 1)
#SBATCH --cpus-per-task=16              # CPU cores requested per task (change for threaded jobs)
#SBATCH --mem=64G                       # Memory needed per node (total)
#SBATCH --error=jobid_%j.err            # File to which STDERR will be written, including job ID
#SBATCH --output=jobid_%j.out           # File to which STDOUT will be written, including job ID
#SBATCH --mail-type=ALL                 # Type of email notification (BEGIN, END, FAIL, ALL)

module load gcc
module load cellranger/7.1.0

local_cores=16
local_mem=64

# Inputs for cellranger
project_name=""                         # Name of output
path_fastq="/path/to/fastq/"             # Path to folder with FASTQ files for one sample
path_ref="/path/to/reference/"           # Path to cellranger compatible reference


cellranger count \
    --id=${project_name} \
    --fastqs=${path_fastq} \
    --transcriptome=${path_ref} \
    --localcores=${local_cores} \
    --localmem=${local_mem}

Cellranger outs

Once cellranger has finished running, there will be a folder titled outs/ in a directory named after the project_name variable set above. Generation of all the following files is expected from a succesful completion of the cellranger counts pipeline:

├── cloupe.cloupe
├── filtered_feature_bc_matrix
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── filtered_feature_bc_matrix.h5
├── metrics_summary.csv
├── molecule_info.h5
├── possorted_genome_bam.bam
├── possorted_genome_bam.bam.bai
├── raw_feature_bc_matrix
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── raw_feature_bc_matrix.h5
└── web_summary.html

Web summary HTML Report

The Web Summary HTML file is a great resource for looking at the basic quality of your sample before starting on an analysis. 10x has a page describing each metric in depth. There are two pages/tabs included in a scRNA report titled “Summary” and “Gene Expression”.

We have included these Web Summary files for the control and stimulated samples as links below. You can download each, and move the HTML to your project data folder:

Control sample report
Stimulated sample report

NOTE: Some of the values in these reports will be slightly different from current standards, as these samples were generated using the version 1 chemistry kit and optimization have been made since then.

Summary

At the top of the “Summary” tab, under the “Alerts” header, will be a list of warnings and messages on the quality/important information about the sample. These messages are very informative on what may have gone wrong with the sample or other flags that can be set in the cellranger count run to gain better results.

Underneath the “Alerts” header, in green text, are the estimated number of high quality cells in the sample, average reads per cells, and median genes per cell. The number of cells will vary depending on how many were loaded in sample preparation, but some general recommendations are provided below:

500 cells is the lower limit for a good quality sample.
10x also recommends a minimum of 20,000 reads per cell on average.
The median genes per cell varies widely across samples as it depends on sequencing depth and cell type, making it difficult to establish a good minimal value.

The remaining 4 sections include various metrics that describe the overall quality of the sample. Note that clicking on the grey question mark will show more detailed explanations.

Sequencing

Includes information such as the total number of reads and how many of those reads did not meet the length requirements. Additionally, since all barcodes and UMIs are known values (from the kit used to prep scRNA experiments), we can evaluate what percentage of the barcodes and UMIs belong to that whitelist and are valid.

Ideally, you would like to see >75% for almost all of these values since lower values are indicative of a low quality sequencing run or bad sample quality.

Mapping

Percentage of reads that map to different regions of the reference genome as reported by STAR.

The percent of reads mapped to the genome should be on the higher end, around 85% or higher. Values that are very low could indicate that the reference genome supplied was incorrect or that the sample was problematic. Otherwise, the expectation for a scRNA runs is that the majority of reads will belong to exonic regions. If nuclei were used instead of whole cells, the percentage of reads mapping to intronic regions will be higher (~45%).

Cells

Here we can see what an ideal representation of the Barcode Rank Plot looks like. The cells are sorted by the number of UMIs found in the cell to differentiate empty droplets/low quality cells (background) from actual cells.

Image credit: 10x

The shape of these plots can indicate a few different things about the sample:

Typical: Clear cliff and knee with separation between cells and background.
Heterogeneous: Bimodal plot with 2 cliffs and knees, with a clear divide between cells and background.
Compromised: Round curve with a steep drop-off at the end whih indicated low quality due to many factors.
Compromised: Defined cliff and knee, but with few barcodes detected could be due to inaccurate cell count or clogging.

This section additionally describes averages and medians for number of genes and reads in the sample.

Sample

The sample section contains important metadata used by cellranger, such as what the Sample ID and the path used for the reference. The chemistry version (which 10x kit was used) and intron flags are also stored here. This information is useful for reproducibility reasons, as the version of cellranger used is also kept.

Gene Expression

The “Gene Expression” table contains information downstream of the basic QC, such as:

t-SNE Projection

Dotplot showing the t-SNE projection of filtered cells colored by UMI counts and clusters. The report allows you select various values of K for the K-means clustering, showing different groupings that can be generated from the data.

Later in the workshop we will spend more time on the intricacies of clustering. The requirements for this QC report would be to see clear separation of cells into groups with defined clusters - representing different cell types.

Top Features by Cluster

This table shows the log2 fold-change and p-value for each gene and cluster after a differential expression analysis is run.

These top genes per cluster can give a brief peek into the cell type distribution of the sample. If no expected cell type marker genes appear or mitochondrial/ribosomal genes show up frequently, this can be indicative of something wrong with the sample.

Sequencing Saturation and Median Genes per Cell

The sequencing saturation plot is a measure of library complexity. In scRNA, more genes can be detected with higher sequencing depth. At a point, you reach sequencing saturation where you do not gain any more meaningful insights which is what the dotted line represents here.

Similar to the sequencing saturation plot, looking at the median gene per cells against mean reads per cell will indicate if your have over or under-sequenced. The slope near the endpoint can be used to determine how much benefit would be gained from sequencing more deeply.

Metrics evaluation

Many of the core pieces of information from the web summary are stored in the metrics_summary.csv. As this is a csv file, we can read it into R and generate plots to include in reports on the general quality of the samples.

We have included these csv files for the control and stimulated samples as links below. You can right-click on the link and “Save as…” into your project data folder:

Control sample metrics.csv file
Stimulated sample metrics.csv file

First, to read the files in:

# Names of samples (same name as folders stored in data)
samples <- c("ctrl", "stim")

# Loop over each sample and read the metrics summary in
metrics <- list()
for (sample in samples) {
    path_csv <- paste0("data/", sample, "_metrics_summary.csv")
    df <- read.csv(path_csv)
    rownames(df) <- sample
    metrics[[sample]] <- df
}
# Concatenate each sample metrics together
metrics <- ldply(metrics, rbind)
# Remove periods and percentags to make the values numeric
metrics <- metrics %>%
    column_to_rownames(".id") %>%
    mutate_all(funs(parse_number(str_replace(., ",", "")))) %>%
    mutate_all(funs(parse_number(str_replace(., "%", ""))))
metrics$sample <- rownames(metrics)

The information available in this file include:

 [1] "Estimated.Number.of.Cells"                     
 [2] "Mean.Reads.per.Cell"                           
 [3] "Median.Genes.per.Cell"                         
 [4] "Number.of.Reads"                               
 [5] "Valid.Barcodes"                                
 [6] "Sequencing.Saturation"                         
 [7] "Q30.Bases.in.Barcode"                          
 [8] "Q30.Bases.in.RNA.Read"                         
 [9] "Q30.Bases.in.UMI"                              
[10] "Reads.Mapped.to.Genome"                        
[11] "Reads.Mapped.Confidently.to.Genome"            
[12] "Reads.Mapped.Confidently.to.Intergenic.Regions"
[13] "Reads.Mapped.Confidently.to.Intronic.Regions"  
[14] "Reads.Mapped.Confidently.to.Exonic.Regions"    
[15] "Reads.Mapped.Confidently.to.Transcriptome"     
[16] "Reads.Mapped.Antisense.to.Gene"                
[17] "Fraction.Reads.in.Cells"                       
[18] "Total.Genes.Detected"                          
[19] "Median.UMI.Counts.per.Cell"                    
[20] "sample"

With all of this information available as a dataframe, we can use ggplot to visualize these values. As an example of how this information can be used, we can display what percentage of reads map to the various parts of the genome (Intergentic, Intronic, and Exonic).

# Columns of interest
cols <- c("Reads.Mapped.Confidently.to.Intergenic.Regions",
          "Reads.Mapped.Confidently.to.Intronic.Regions",
          "Reads.Mapped.Confidently.to.Exonic.Regions",
          "sample")

# Data wrangling to sculpt dataframe in a ggplot friendly manner
df <- metrics %>%
    select(cols) %>%
    melt() %>%
    mutate(variable = str_replace_all(variable, "Reads.Mapped.Confidently.to.", "")) %>%
    mutate(variable = str_replace_all(variable, ".Regions", ""))

# ggplot code to make a barplot
df %>% ggplot() +
    geom_bar(
        aes(x = sample, y = value, fill = variable),
        position = "stack",
        stat = "identity") +
    coord_flip() +
    labs(
        x = "Sample",
        y = "Percentage of Reads",
        title = "Percent of Reads Mapped to Each Region",
        fill = "Region")

Matrix folders

The most important files that are generated during this cellranger run are the two matrix folders, which contain the count matrices from the experiment:

raw_feature_bc_matrix
filtered_feature_bc_matrix

In the previous lesson, we used raw_feature_bc_matrix to load the counts into Seurat. You can similarly do the same with filtered_feature_bc_matrix, the difference being that the filtered matrix has removed cells that cellranger determined as low quality using a variety of different tools. We chose to start with the raw counts matrix in this lesson so that you can better see what metrics are used to determine which cells are considered high quality.

This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.