Skip to the content.

Approximate time: 50 minutes

Learning Objectives:

Quality Control

After running Salmon, we now have transcript-level abundance estimates for each of our samples, however we don’t have a good way to be able to determine the following:

In order for us to assess these metrics we need the genomic coordinates of where each read maps. This information can be generated by mapping (or aligning) reads to the whole genome using genome alignment tools, and can then be used as input to tools that can generate the desired metrics.

Qualimap

We will be using Qualimap, a Java application, which provides an overall view of the data quality in an HTML report. The report is comprehensive with mapping statistics summarized along with useful figures and tab-delimited files of metrics data. We will run Qualimap on a single sample, and then take a detailed look at the report and metrics in comparison to what we would expect for good data.

Before we can run Qualimap, we need to use a genome alignment tool to get the genomic coordinates of where the reads are aligning to the genome. Most aligners that work with NGS data generate a very specific file format wherein all the alignment results are organized systematically. This file format is called SAM/BAM (see below).

Alignment file format: SAM/BAM

The Sequence Alignment Map format (SAM) file is a tab-delimited text file that contains all information from the FASTQ file, with additional fields containing alignment information for each read. Specifically, we can obtain the genomic coordinates of where each read maps to in the genome and the quality of that mapping. A BAM file is the binary, compressed version of the SAM file. It is significantly smaller in size and is usually the file format requested for by downstream tools that require alignment data as input. The paper by Heng Li et al provides a lot more detail on the specification.

Creating a BAM file

To generate a BAM file we need to map our reads to the genome. The alignment process consists of choosing an appropriate reference genome to map our reads against and performing the read alignment using one of several splice-aware alignment tools such as STAR or HISAT2. The choice of aligner is often a personal preference and also dependent on the computational resources that are available to you.

STAR Aligner

To determine where on the human genome our reads originated from, we will align our reads to the reference genome using STAR (Spliced Transcripts Alignment to a Reference). STAR is an aligner designed to specifically address many of the challenges of RNA-seq data mapping using a strategy to account for spliced alignments. We will not go into detail about how STAR works, but if you are interested in undertanding the alignment strategy we have some materials linked here.

NOTE: Until recently, the standard approach for RNA-seq analysis has been to map our reads using a splice-aware aligner (i.e STAR) and then use the resulting BAM files as input to counting tools like featureCounts and htseq-count to obtain our final expression matrix. The field has now moved towards using lightweight alignment tools like Salmon as standard practice, and so we only use STAR for generating a BAM file. If you are interested in knowing more about the standard approach we have some materials linked here.

To get started with this lesson, start an interactive session with 6 cores and 8G of memory:

$ srun --pty -p interactive -t 0-12:00 -c 6 --mem 8G /bin/bash	

You should have a directory tree setup similar to that shown below. It is best practice to have all files you intend on using for your workflow present within the same directory.

rnaseq
	├── logs
	├── meta
	├── raw_data
	│   ├── Irrel_kd_1.subset.fq
	│   ├── Irrel_kd_2.subset.fq
	│   ├── Irrel_kd_3.subset.fq
	│   ├── Mov10_oe_1.subset.fq
	│   ├── Mov10_oe_2.subset.fq
	│   └── Mov10_oe_3.subset.fq
	├── results
	└── scripts

To use the STAR aligner, load the module:

$ module load gcc/6.2.0 star/2.5.4a

Similar to Salmon, aligning reads using STAR is a two step process:

  1. Create a genome index
  2. Map reads to the genome

A quick note on shared databases for human and other commonly used model organisms. The O2 cluster has a designated directory at /n/groups/shared_databases/ in which there are files that can be accessed by any user. These files contain, but are not limited to, genome indices for various tools, reference sequences, tool specific data, and data from public databases, such as NCBI and PDB. So when using a tool that requires a reference of sorts, it is worth taking a quick look here because chances are it’s already been taken care of for you.

$ ls -l /n/groups/shared_databases/igenome/

Creating a genome index

For this workshop we have generated the genome indices for you, so that we don’t get held up waiting on the generation of the indices (it takes a while and requires a lot of memory). The index can be found in the /n/groups/hbctraining/intro_rnaseq_hpc/reference_data_ensembl38/ensembl38_STAR_index/ directory.

The command to create an index can be found in the job submission script we have linked here.

NOTE: By default the latest human genome build, GRCh38, contains information about alternative alleles for various locations on the genome. If using this version of the GRCh38 genome then it is advisable to use the HISAT2 aligner as it is able to utilize this information during the alignment. There is a version of GRCh38 available that does not have these alleles represented, which is the appropriate version to use with STAR. This is because STAR does not have the functionality to appropriately deal with the presence of alternate alleles as yet.

Aligning reads

Since we already have the reference index ready, we can move on to aligning reads to the genome.

Create an output directory for our alignment files:

$ cd ~/rnaseq/raw_data

$ mkdir ../results/STAR

For now, we’re going to work on just one sample to set up our workflow. To start we will use the first replicate in the Mov10 over-expression group, Mov10_oe_1.subset.fq. Details on STAR and its functionality can be found in the user manual; we encourage you to peruse through to get familiar with all available options.

The basic options for aligning reads to the genome using STAR are:

Listed below are additional parameters that we will use in our command:

NOTE: Note that “STAR’s default parameters are optimized for mammalian genomes. Other species may require significant modifications of some alignment parameters; in particular, the maximum and minimum intron sizes have to be reduced for organisms with smaller introns” [1].

The full command is provided below for you to copy paste into your terminal. If you want to manually enter the command, it is advisable to first type out the full command in a text editor (i.e. Sublime Text or Notepad++) on your local machine and then copy paste into the terminal. This will make it easier to catch typos and make appropriate changes.

$ STAR --genomeDir /n/groups/hbctraining/intro_rnaseq_hpc/reference_data_ensembl38/ensembl38_STAR_index/ \
--runThreadN 6 \
--readFilesIn Mov10_oe_1.subset.fq \
--outFileNamePrefix ../results/STAR/Mov10_oe_1_ \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outSAMattributes Standard 

STAR output

After running our single FASTQ file through the STAR aligner, you should have a number of output files in the ~/rnaseq/results/STAR directory. Let’s take a quick look at some of the files that were generated and explore their content.

$ cd ~/rnaseq/results/STAR
	
$ ls -lh

You should have 5 output files plus a single .tmp directory for the Mov10_oe_1 sample. The contents of the 5 files are described below:

We are most interested in the BAM file, which will allow us to proceed with Qualimap.

Running Qualimap

Alignment data files frequently contain biases that are introduced by sequencing technologies and/or during sample preparation. Therefore, one of the fundamental requirement during analysis of these data is to perform quality control. In this way, we get an idea of how well our reads align to the reference and how well data fit with the expected outcome.

We will take the BAM file we generated in the previous step and use it as input to Qualimap which computes various quality metrics such as DNA or rRNA contamination, 5’-3’ biases, and coverage biases.

To run Qualimap, change directories to the rnaseq folder and make a qualimap folder inside the results directory:

$ cd ~/rnaseq

$ mkdir -p results/qualimap

By default, Qualimap will try to open a GUI to run Qualimap, so we need to run the unset DISPLAY command:

$ unset DISPLAY

We also need to load the qualimap module:

$ module load java/jdk-1.8u112 qualimap/2.2.1

Now we are ready to run Qualimap on our BAM file! There are different tools or modules available through Qualimap, and the documentation website details the tools and options available. We are interested in the rnaseq tool. To see the arguments available for this tool we can search the help:

$ qualimap rnaseq 

We will be running Qualimap with the following specifications:

$ qualimap rnaseq \
-outdir results/qualimap/Mov10_oe_1 \
-a proportional \
-bam results/STAR/Mov10_oe_1_Aligned.sortedByCoord.out.bam \
-p strand-specific-reverse \
-gtf /n/groups/hbctraining/intro_rnaseq_hpc/reference_data_ensembl38/Homo_sapiens.GRCh38.92.1.gtf \
--java-mem-size=8G

The Qualimap report

The Qualimap report in HTML folrmat should be present in the results/qualimap directory. To view this report you need a web browser, so you would need to transfer it over to your laptop. However, you don’t need to do that. We generated this report on a subset of data, to get a better idea of the metrics let’s take a look at the report of the full dataset for Mov_oe_1, available in this zipped folder. Please download and unzip the folder; find the HTML report within and open it in your browser.

Read alignment summary

The first few numbers listed in the report are the mapping statistics. Qualimap also computes counts by assigning reads to genes and reports associated statistics. For example it computes the following:

Reads genomic origin

This section reports how many alignments fall into exonic, intronic and intergenic regions along with a number of intronic/intergenic alignments overlapping exons. Exonic region includes 5’UTR,protein coding region and 3’UTR region. This information is summarized in table in addition to a pie chart as shown below.

Transcript coverage profile

The profile provides ratios between mean coverage at the 5’ region, the 3’ region and the whole transcript. Coverage plots are generated for all genes total, and also for the 500 highest-expressed and 500 lowest-expressed genes separately.

Junction Analysis

Qualimap also reports the total number of reads mapping to splice junctions and the 10 most frequent junction rates. The pie chart shows analysis of junction positions in spliced alignments.

Other tools like RNASeQC will plot figures that can help evaluate GC content bias. This is also an important aspect of QC, as low/high GC content regions will tend to have low coverage.


Summary

Taken together, these metrics give us some insight into the quality of our samples and help us in identifying any biases present in our data. The conclusions derived from these QC results may indicate that we need to correct for these biases and so you may want to go back and modify the parameters for Salmon (mapping) accordingly.


This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.