Skip to the content.

Approximate time: 30 minutes

Learning Objectives

Collecting Alignment Statistics

The next step of QC is where we need to evaluate the quality of the alignments. We will start by running the code required to compute metrics, then we will discuss the metrics and evaluate our results against the general expecations.

We are going to use Picard once again in order to collect our alignment statistics. Picard has many packages for collecting different types of data, but the one we will be using is CollectAlignmentSummaryMetrics. This tool takes a SAM/BAM file input and produces metrics (in a tab delimited .txt file) detailing the quality of the read alignments. Note that these quality filters are specific to Illumina data.

Some examples of metrics reported include (but, are not limited to):

NOTE: This lesson assumes you have successfuly completed the code for alignment processing. If for some reason you did not successfully run the Picard alignment processing steps scripts you will want to copy over the required files using the commands provided below:

 cp /n/groups/hbctraining/variant_calling/intermediate_files/alignments/syn3_*_GRCh38.p7.coordinate_sorted.ba? /n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/

Normal sample script

Let’s start creating an sbatch script for collecting metrics:

cd ~/variant_calling/scripts/
vim picard_metrics_normal.sbatch

First, we need to add our shebang line, description and sbatch directives to the script:

#!/bin/bash
# This sbatch script is for collecting alignment metrics using Picard 

# Assign sbatch directives
#SBATCH -p priority
#SBATCH -t 0-00:30:00
#SBATCH -c 1
#SBATCH --mem 16G
#SBATCH -o picard_metrics_normal_%j.out
#SBATCH -e picard_metrics_normal_%j.err

Next, we need to load Picard:

# Load picard
module load picard/2.27.5

Next, let’s assign our files to variables:

# Assign variables
INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.coordinate_sorted.bam
REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa
OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.p7.CollectAlignmentSummaryMetrics.txt

Lastly, we can add the Picard command to gather the alignment metrics.

# Run Picard CollectAlignmentSummaryMetrics
java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \
  --INPUT $INPUT_BAM \
  --REFERENCE_SEQUENCE $REFERENCE \
  --OUTPUT $OUTPUT_METRICS_FILE

We can breakdown this command into each of its components:

Now this script is all set to run! Go ahead and save and quit.

Click here to see what our final sbatchcode script for collecting the normal sample alignment metrics should look like
#!/bin/bash
# This sbatch script is for collecting alignment metrics using Picard
# Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-00:30:00 #SBATCH -c 1 #SBATCH --mem 16G #SBATCH -o picard_metrics_normal_%j.out #SBATCH -e picard_metrics_normal_%j.err
# Load picard module load picard/2.27.5
# Assign variables INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.coordinate_sorted.bam REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.p7.CollectAlignmentSummaryMetrics.txt
# Run Picard CollectAlignmentSummaryMetrics java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \ --INPUT $INPUT_BAM \ --REFERENCE_SEQUENCE $REFERENCE \ --OUTPUT $OUTPUT_METRICS_FILE

Tumor sample script

Now we will want to create the tumor version of this submission script using sed (as we have done previously):

sed 's/normal/tumor/g' picard_metrics_normal.sbatch > picard_metrics_tumor.sbatch
Click here to see what our final sbatchcode script for collecting the tumor sample alignment metrics should look like
#!/bin/bash
# This sbatch script is for collecting alignment metrics using Picard
# Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-00:30:00 #SBATCH -c 1 #SBATCH --mem 16G #SBATCH -o picard_metrics_tumor_%j.out #SBATCH -e picard_metrics_tumor_%j.err
# Load picard module load picard/2.27.5
# Assign variables INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_tumor_GRCh38.p7.coordinate_sorted.bam REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_tumor/syn3_tumor_GRCh38.p7.CollectAlignmentSummaryMetrics.txt
# Run Picard CollectAlignmentSummaryMetrics java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \ --INPUT $INPUT_BAM \ --REFERENCE_SEQUENCE $REFERENCE \ --OUTPUT $OUTPUT_METRICS_FILE

Submitting scripts for Picard alignment processing

Before we submit our jobs, let’s check the status of our previous Picard alignment processing steps:

squeue --me
sbatch picard_metrics_normal.sbatch
sbatch picard_metrics_tumor.sbatch

NOTE: Each of these scripts should only take about 15 minutes to run.

Collecting Coverage Metrics

Coverage is the average level of alignment for any random locus in the genome. Picard also has a package called CollectWgsMetrics which is also very nice for collecting data about coverage for alignments. However, since our data set is whole exome sequencing rather than whole genome sequencing and thus only compromises about 1-2% of the human genome, average coverage across the whole genome is not a very useful metric. However, if one did have whole genome data, then running CollectWgsMetrics would be useful. As such, in the dropdown box below we provide the code that you could use to collect this information.

Image source: Coverage analysis from the command line

Click here to find out more on collecting coverage metrics for WGS datasets in Picard
The tool in Picard used for collecting coverage metrics for WGS datasets is called CollectWgsMetrics. The code used to run CollectWgsMetrics can be found below.

  # Assign paths to bash variables
  $COORDINATE_SORTED_BAM_FILE=/path/to/sample.coordinate_sorted.bam
  $OUTPUT=/home/$USER/variant_calling/reports/picard/sample.CollectWgsMetrics.txt
  $REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa
# Run Picard CollectWgsMetrics \ java -jar $PICARD/picard.jar CollectWgsMetrics \ --INPUT $COORDINATE_SORTED_BAM_FILE \ --OUTPUT $METRICS_OUTPUT_FILE \ --REFERENCE_SEQUENCE $REFERENCE

Factors Impacting Alignment

While we mentioned above the various metrics that are computed as part of the Picard command, one of the most important metrics for your alignment file is the alignment rate. When aligning high-quality reads to a high quality reference genome, one should expect to see alignment rates at 90% or better. If alignment rates dipped below 80-85%, then there could be reason for further inspection.

Alignment rates can vary based upon many factors, including:

Next Lesson »

Back to Schedule


This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.