Contributors: Mary Piper and Meeta Mistry

Approximate time: 1.25 hours

Learning Objectives

Lightweight alignment and quantification of gene expression

Now that we have explored the quality of our raw reads, we can move on to quantifying expression at the transcript level. The goal of this step is to identify from which transcript each of the reads originated from and the total number of reads associated with each transcript.

Tools that have been found to be most accurate for this step in the analysis are referred to as lightweight alignment tools, which include Kallisto, Sailfish and Salmon; each working slightly different from one another. We will focus on Salmon for this workshop, which is the successor of Sailfish. However, Kallisto is an equally good choice with similar performance metrics for speed and accuracy.

Common to all of these tools is that base-to-base alignment of the reads is avoided, which is the time-consuming step of older splice-aware alignment tools such as STAR and HISAT2. These lightweight alignment tools provide quantification estimates much faster than older tools (typically more than 20 times faster) with improvements in accuracy [1]. These transcript expression estimates, often referred to as ‘pseudocounts’ or ‘abundance estimates’, can be aggregated to the gene level for use with differential gene expression tools like DESeq2 or the estimates can be used directly for isoform-level differential expression using a tool like Sleuth.

Salmon

Salmon uses the reference transcriptome (in FASTA format) and raw sequencing reads (in FASTQ format) as input to perform both mapping and quantification of the reads.

The “quasi-mapping” approach utilized by Salmon requires a reference index to determine the position and orientation information for where the fragments best map prior to quantification [2]. The reference index essentially provides the transcriptome in a format that is easily and rapidly searchable. Therefore, it will allow us to quickly find the positions in the transcriptome where each of the reads originated.

NOTE: Since we are searching against the transcriptome, Salmon would not be the appropriate tool to use if trying to detect novel genes or isoforms, intron retention events, or other methods that require annotations not present in the transcriptome.

Image credit: RNA-Seq Blog

Creating the transcriptome index

This step involves creating an index to evaluate the sequences for all possible unique sequences of length k (k-mer) in the transcriptome, which includes all known transcripts/ splice isoforms for all genes.

The index helps creates a signature for each transcript in our reference transcriptome. The Salmon index has two components:

To create the transcriptome index with Salmon, let’s start an interactive session and create a new directory in our results folder for the Salmon output:

$ srun --pty -p interactive -t 0-12:00 --mem 8G --reservation=HBC /bin/bash

$ mkdir ~/rnaseq/results/salmon

$ cd ~/rnaseq/results/salmon

Salmon is not available as a module on O2, but it is installed as part of the bcbio pipeline. Therefore, we need to add the appropriate path (/n/app/bcbio/tools/bin/) in our $PATH variable so that we can use it by simply typing in salmon. This directory contains executables for many tools useful for NGS analysis. We can add this location by including an export command to do this at the end of the .bashrc file, this will make it so that when you start a new shell session the location will always be in your path.

Open the .bashrc file using vim and at the end of the file add the export command that adds a specific location to the list in $PATH.

$ vim ~/.bashrc

# at the end of the file type in the following:
export PATH=/n/app/bcbio/tools/bin:$PATH

# Don't forget the ":" between the full path and the "$PATH"!

Exit vim and re-read the .bashrc file using the source command:

$ source ~/.bashrc

$ echo $PATH

Now, we could create the index using the salmon index command as detailed below; however, we are not going to run this in class as it can take a few minutes to run. The parameters for the indexing step are as follows:

## DO NOT RUN THIS CODE
$ salmon index \
-t /n/groups/hbctraining/rna-seq_2019_02/reference_data/Homo_sapiens.GRCh38.cdna.all.fa \
-i salmon_index \
-k 31

NOTE: Default for salmon is -k 31, so we do not need to include these parameters in the index command. However, the kmer default of 31 is optimized for 75bp or longer reads, so if your reads are shorter, you may want a smaller kmer to use with shorter reads (kmer size needs to be an odd number).

Accessing transcriptome data: We generated the index from transcript sequences for human obtained from the Ensembl ftp site with the following commands:

# Download from the FTP server
$ wget ftp://ftp.ensembl.org/pub/release-95/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

# Decompress the FASTA file
$ gzip -d Homo_sapiens.GRCh38.cdna.all.fa.gz

Exercise:

In your RNA-seq experiment, you expressed a GFP transgene in your mice, and you would like to quantify the expression of GFP. We know that the sequence is not present in the human transcriptome. What would you do?


Quasi-mapping and quantification

The quasi-mapping approach estimates where the reads best map to on the transcriptome through identifying where informative sequences within the read map to instead of performing base-by-base alignment. The quasi-mapping approach is described below, with details provided by the Rapmap tool [3], which provides the underlying algorithm for the quasi-mapping.

Salmon output

You should see a new directory has been created that is named by the string value you provided in the -o command. Take a look at what is contained in this directory:

$ ls -l Mov10_oe_1.subset.salmon/

There is a logs directory, which contains all of the text that was printed to screen as Sailfish was running. Additionally, there is a file called quant.sf.

This is the quantification file in which each row corresponds to a transcript, listed by Ensembl ID, and the columns correspond to metrics for each transcript:

Name    Length  EffectiveLength TPM     NumReads
ENST00000632684.1       12      3.00168 0       0
ENST00000434970.2       9       2.81792 0       0
ENST00000448914.1       13      3.04008 0       0
ENST00000415118.1       8       2.72193 0       0
ENST00000631435.1       12      3.00168 0       0
ENST00000390567.1       20      3.18453 0       0
ENST00000439842.1       11      2.95387 0       0

....

Running Salmon on multiple samples

We just ran Salmon on a single sample, but we would like to run this on all samples. To do so, we will need to create a job submission script.

NOTE: We are iterating over FASTQ files in the raw_data directory.

Create a job submission script to run Salmon in serial

Since Salmon is only able to take a single file as input, one way in which we can do this is to use a ‘for loop’ to run Salmon on all samples in serial. What this means is that Salmon will process the dataset one sample at a time.

Let’s start by opening up a script in vim:

$ vim salmon_all_samples.sbatch

Let’s start our script with a shebang line followed by SBATCH directives which describe the resources we are requesting from O2. We will ask for 6 cores and take advantage of Salmon’s multi-threading capabilities.

Next we can create a for loop to iterate over all FASTQ samples. Inside the loop we will create a variable that stores the prefix we will use for naming output files, then we run Salmon.

The final script is shown below:

#!/bin/bash

#SBATCH -p short 
#SBATCH -c 6 
#SBATCH -t 0-12:00 
#SBATCH --mem 8G 
#SBATCH --job-name salmon_in_serial 
#SBATCH -o %j.out 
#SBATCH -e %j.err
#SBATCH --reservation=HBC

cd ~/rnaseq/results/salmon

for fq in ~/rnaseq/raw_data/*.fq

do

# create a prefix
base=`basename $fq .fq`

# run salmon
salmon quant -i /n/groups/hbctraining/rna-seq_2019_02/reference_data/salmon.ensembl38.idx.09-06-2019 \
 -l A \
 -r $fq \
 -p 6 \
 -o $base.salmon \
 --seqBias \
 --useVBOpt \
 --numBootstraps 30 \
 --validateMappings

done

Note, that we are adding a couple of new parameters:

NOTE: --numBootstraps is necessary if performing isoform-level differential expression analysis with Sleuth, but not for gene-level differential expression analysis. Due to the statistical procedure required to assign reads to gene isoforms, in addition to the random processes underlying RNA-Seq, there will be technical variability in the abundance estimates output from the pseudo-alignment tool [2, 3] for the isoform level abundance estimates (not necessary for gene-level estimates). Therefore, we would need technical replicates to distinguish technical variability from the biological variability for gene isoforms.

The bootstraps estimate technical variation per gene by calculating the abundance estimates for all genes using a different sub-sample of reads during each round of bootstrapping. The variation in the abundance estimates output from each round of bootstrapping is used for the estimation of the technical variance for each gene.

Save and close the script. This is now ready to run.

$ sbatch salmon_all_samples.sbatch

This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.