Alignment Quality Control

Approximate time: 30 minutes

Learning Objectives

Verify alignment rates using Picard
Merge Picard QC metrics with FastQC metrics using MultiQC

Collecting Alignment Statistics

The next step of QC is where we need to evaluate the quality of the alignments. We will start by running the code required to compute metrics, then we will discuss the metrics and evaluate our results against the general expecations.

We are going to use Picard once again in order to collect our alignment statistics. Picard has many packages for collecting different types of data, but the one we will be using is CollectAlignmentSummaryMetrics. This tool takes a SAM/BAM file input and produces metrics (in a tab delimited .txt file) detailing the quality of the read alignments. Note that these quality filters are specific to Illumina data.

Some examples of metrics reported include (but, are not limited to):

Total number of reads
Total number of PF (pass-filter) reads (reads that pass an internal qualty fitering on the Illumina sequencer)
Total aligned PF reads (any PF read that has a sequence and position)
High-quality aligned PF reads (high-quality means a mapping quality >= 20)
Reads aligned in pairs (vs. reads aligned with mate unaligned/not present)
Read length

NOTE: This lesson assumes you have successfuly completed the code for alignment processing. If for some reason you did not successfully run the Picard alignment processing steps scripts you will want to copy over the required files using the commands provided below:
 cp /n/groups/hbctraining/variant_calling/intermediate_files/alignments/syn3_*_GRCh38.p7.coordinate_sorted.ba? /n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/

Normal sample script

Let’s start creating an sbatch script for collecting metrics:

cd ~/variant_calling/scripts/
vim picard_metrics_normal.sbatch

First, we need to add our shebang line, description and sbatch directives to the script:

#!/bin/bash
# This sbatch script is for collecting alignment metrics using Picard 

# Assign sbatch directives
#SBATCH -p priority
#SBATCH -t 0-00:30:00
#SBATCH -c 1
#SBATCH --mem 16G
#SBATCH -o picard_metrics_normal_%j.out
#SBATCH -e picard_metrics_normal_%j.err

Next, we need to load Picard:

# Load picard
module load picard/2.27.5

Next, let’s assign our files to variables:

# Assign variables
INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.coordinate_sorted.bam
REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa
OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.p7.CollectAlignmentSummaryMetrics.txt

Lastly, we can add the Picard command to gather the alignment metrics.

# Run Picard CollectAlignmentSummaryMetrics
java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \
  --INPUT $INPUT_BAM \
  --REFERENCE_SEQUENCE $REFERENCE \
  --OUTPUT $OUTPUT_METRICS_FILE

We can breakdown this command into each of its components:

java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics Calls the CollectAlignmentSummaryMetrics package from within Picard
--INPUT $INPUT_BAM This is the output BAM file from our previous Picard alignment processing steps.
--REFERENCE_SEQUENCE $REFERENCE This isn’t a required parameter, but Picard can do a subset of mismatch-related metrics if this is provided.
--OUTPUT $OUTPUT_METRICS_FILE This is the file to write the output metrics to.

Now this script is all set to run! Go ahead and save and quit.

Click here to see what our final sbatchcode script for collecting the normal sample alignment metrics should look like

#!/bin/bash
# This sbatch script is for collecting alignment metrics using Picard

# Assign sbatch directives
#SBATCH -p priority
#SBATCH -t 0-00:30:00
#SBATCH -c 1
#SBATCH --mem 16G
#SBATCH -o picard_metrics_normal_%j.out
#SBATCH -e picard_metrics_normal_%j.err

# Load picard
module load picard/2.27.5

# Assign variables
INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.coordinate_sorted.bam
REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa
OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.p7.CollectAlignmentSummaryMetrics.txt

# Run Picard CollectAlignmentSummaryMetrics
java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \
  --INPUT $INPUT_BAM \
  --REFERENCE_SEQUENCE $REFERENCE \
  --OUTPUT $OUTPUT_METRICS_FILE

Tumor sample script

Now we will want to create the tumor version of this submission script using sed (as we have done previously):

sed 's/normal/tumor/g' picard_metrics_normal.sbatch > picard_metrics_tumor.sbatch

Click here to see what our final sbatchcode script for collecting the tumor sample alignment metrics should look like

#!/bin/bash
# This sbatch script is for collecting alignment metrics using Picard

# Assign sbatch directives
#SBATCH -p priority
#SBATCH -t 0-00:30:00
#SBATCH -c 1
#SBATCH --mem 16G
#SBATCH -o picard_metrics_tumor_%j.out
#SBATCH -e picard_metrics_tumor_%j.err

# Load picard
module load picard/2.27.5

# Assign variables
INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_tumor_GRCh38.p7.coordinate_sorted.bam
REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa
OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_tumor/syn3_tumor_GRCh38.p7.CollectAlignmentSummaryMetrics.txt

# Run Picard CollectAlignmentSummaryMetrics
java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \
  --INPUT $INPUT_BAM \
  --REFERENCE_SEQUENCE $REFERENCE \
  --OUTPUT $OUTPUT_METRICS_FILE

Submitting scripts for Picard alignment processing

Before we submit our jobs, let’s check the status of our previous Picard alignment processing steps:

squeue --me

If your Picard alignment processing steps are completed, and you have the required input files then you can submit these jobs to collect alignment metrics:

sbatch picard_metrics_normal.sbatch
sbatch picard_metrics_tumor.sbatch

NOTE: Each of these scripts should only take about 15 minutes to run.

Collecting Coverage Metrics

Coverage is the average level of alignment for any random locus in the genome. Picard also has a package called CollectWgsMetrics which is also very nice for collecting data about coverage for alignments. However, since our data set is whole exome sequencing rather than whole genome sequencing and thus only compromises about 1-2% of the human genome, average coverage across the whole genome is not a very useful metric. However, if one did have whole genome data, then running CollectWgsMetrics would be useful. As such, in the dropdown box below we provide the code that you could use to collect this information.

Image source: Coverage analysis from the command line

Click here to find out more on collecting coverage metrics for WGS datasets in Picard

The tool in Picard used for collecting coverage metrics for WGS datasets is called CollectWgsMetrics. The code used to run CollectWgsMetrics can be found below.

  # Assign paths to bash variables
  $COORDINATE_SORTED_BAM_FILE=/path/to/sample.coordinate_sorted.bam
  $OUTPUT=/home/$USER/variant_calling/reports/picard/sample.CollectWgsMetrics.txt
  $REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa

  # Run Picard CollectWgsMetrics \
  java -jar $PICARD/picard.jar CollectWgsMetrics \
    --INPUT $COORDINATE_SORTED_BAM_FILE \
    --OUTPUT $METRICS_OUTPUT_FILE \
    --REFERENCE_SEQUENCE $REFERENCE

java -jar $PICARD/picard.jar CollectWgsMetrics This calls the CollectWgsMetrics package within Picard
--INPUT $COORDINATE_SORTED_BAM_FILE This is the input coordinate-sorted BAM file
--OUTPUT $METRICS_OUTPUT_FILE This is the output report file
--REFERENCE_SEQUENCE $REFERENCE This is the path to the reference genome that was used for the alignment.

Factors Impacting Alignment

While we mentioned above the various metrics that are computed as part of the Picard command, one of the most important metrics for your alignment file is the alignment rate. When aligning high-quality reads to a high quality reference genome, one should expect to see alignment rates at 90% or better. If alignment rates dipped below 80-85%, then there could be reason for further inspection.

Alignment rates can vary based upon many factors, including:

Quality of reference assembly - A high-quality assembly like GRCh38.p7 will provide an excellent reference genome for alignment. However, if you were studying a organism with a poorly assembled genome, parts of the reference genome could be missing from the assembly. Therefore, high-quality reads might not align because they there is missing reference sequence to align to that corresponds to their sequence.
Quality of libraries - If the library generation was poor and there wasn’t enough input DNA, then your sequencing could be filled with low-quality reads
Quality of the reads - If the reads are poor quality, then it can make alignment more uncertain. If your FASTQC report shows any anomalous signs, contact your sequencing center for support.
Contamination - If your samples are contaminated, then it can also skew your alignment. For example, if your samples were heavily contaminated with some bacteria, then much of what you will sequence will be bacteria DNA and not your sample DNA. As a result, most of the sequence reads will not align to your target sequence. If you suspect contamination might be the source of a poor alignment, you could consider running Kraken to evaluate the levels of contamination in your samples.
Evolutionary distance between the sampled organism and reference genome If a reference genome doesn’t exist for your species of interest, you are able to align reads to a closely-related organism. However, it does come at the cost of lowering the alignment rate.
Aligner and alignment parameters Different aligners work differently and are specialized for different types of data. Additionally, many aligners have a variety of parameters that are able to be adjusted. As a result, different aligners or different parameters for the same aligner will give different alignment rates, but they usually should be within the same range of approximate alignment rates. Generally speaking, the default parameters for most alignment tools are usually fine and they shouldn’t need to be manually adjusted/optimized unless there is a specific reason to do so.

Next Lesson »

Back to Schedule

This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.