Skip to the content.

Approximate time: 45 minutes

Learning Objectives

The importance of alignment for variant calling

Read alignment is an essential first step in the characterization of DNA sequence variation. This step takes our reads and tries to find the place in the genome where they best fit.

Importantly, the accuracy of variant-calling results can be highly impacted by a low quality of read alignment. In theory, aligning reads sounds like a simple case of string matching to identify where in the genome each read originated from. However, this task is complicated by a multitude of factors including:

The largest and most difficult part of this task is creating an alignment algorithm that can account for these factors and still provide alignment for millions of reads within a reasonable time frame. There are a multitude of alignment tools availible today; each has strengths and weakness and some are more appropriate for particular types of analysis. In this course, we will be using BWA.

BWA

Many modern alignment tools rely on the Burrows-Wheeler Transform as part of their alignment algorithm; BWA is one of those tools. BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. In this workshop we will demonstrate BWA-MEM.

No matter which algorithm you choose, there are two steps to running bwa:

  1. Create an index of the reference sequence (done once)
    1. Create a Suffix Array Index
    2. Generate a Burrows-Wheeler Transform from the suffix array
  2. Query reads against the index to get alignments

The Burrows-Wheeler Transform and Suffix Arrays are outside of the scope of this course, but a great place to get information on these methods are on Dr. Ben Langmead’s YouTube Channel. Dr. Langmead developed the algorithm for a similar alignment tool, Bowtie, and he has made many useful videos explaining how the components of these alignment tools work.

We have already created an index for you. So we will jump right into running the alignments.

Click here for details on creating your own bwa index While we will be using an index that has already been made for us, if you need to create an index for a reference sequence using bwa, the steps for this are laid out below.
  1. Navigate to your reference sequence directory
    cd ~/path/to/reference/sequence/directory/
    
  2. Create a bwa index:
    bwa index reference_sequence.fasta
    

This process may take up to 30+ minutes to run depending on the reference sequence size. The output of this command will produce five files:

SAM (Sequence Alignment Map) file format

Alignment information is stored within SAM files. Here, we briefly describe the SAM file, but encourage you to explore this lesson for more in depth detail.

There are two main components to a SAM file are:

Image source: https://www.samformat.info/sam-format-flag

The bit-wise flags are worth noting here, as they are very helpful for giving the user a rough understanding of the read. Details such as whether the read is paired, has an alignment to the provided reference sequence or is a PCR duplicate can all be encoded into the FLAG. This tool on the Broad’s Website can be very helpful for decoding the SAM FLAGs that you can encounter.

Aligning reads with bwa-mem

BWA-MEM is the latest algorithm of the three bwa algorithms available, and is generally recommended for high-quality queries as it is faster and more accurate. Other features include:

Let’s begin by looking at the command used to run the alignment, and describing each of the parameters. Do not run the code below, as this just example code.

# DO NOT RUN
# Align reads with bwa-mem
bwa mem \
-M \
-t 8 \
-R "@RG\tID:C6C0TANXX_2\tSM:ZW177\tPL:ILLUMINA\tPU:ZW177" \
/path/to/reference.fa \
LEFT_read_1.fq \
RIGHT_read_2.fq \
-o /path/to/output_file.sam

Let’s breakdown this bwa command.

A note on paired-end reads: In this case you can see that the reads are paired-end reads because we have provided two FASTQ files as input and they have _1/_2 right before the .fq/.fastq extension.

This is a very common annotation for paired-end reads. Alternatively, sometimes they might also be annotated as _R1/_R2, but once again, they will almost always be placed right before the .fq/.fastq extension.

Read groups

The term Read Group refers to a set of reads that were generated from a single run of a sequencing instrument.

Read groups are identified in the SAM/BAM file by a number of tags that are defined in the official SAM specification. These tags, when assigned appropriately, allow us to differentiate not only samples, but also various technical features that are associated with artifacts. Having this information allows to take steps towards mitigating the effects of artifacts in downstream analyses [1].

Let’s use the read group from our example bwa command above to demonstrate the use of tags:

More information about read groups and some fields we didn’t discuss can be found here.


Exercises

1. The read group field LB (Library) is a required field to when adding read groups using Picard’s AddOrReplaceReadGroups, but we don’t currently have this field in our read group information. How would we alter our bwa command to include LB as well?

2. If we wanted to increase the number of threads used by bwa for processing our alignment to 12, how would we need to modify our bwa command to accommodate this?


Creating a script for alignment

To align reads to the reference sequence, we will need to give ourselves an appropriate amount of resources and so we will need to create an sbatch job submission script. Move into the scripts directory and open up vim.

$ cd ~/variant_calling/scripts/
$ vim bwa_alignment_normal.sbatch

Now, we can copy and paste the following shebang line and sbatch directives into our script file:

#!/bin/bash
# This script is for aligning sequencing reads against a reference genome using bwa

# Assign sbatch directives
#SBATCH -p priority
#SBATCH -t 0-04:00:00
#SBATCH -c 8
#SBATCH --mem 16G
#SBATCH -o bwa_alignment_normal_%j.out
#SBATCH -e bwa_alignment_normal_%j.err

Next, we need to add the modules that we will be using for alignment:

# Load modules
module load gcc/6.2.0
module load bwa/0.7.17

NOTE: On O2, many of the common tools were compiled using GCC version 6.2.0, so to be able to access them, we first need to load the GCC module.

Finally, we need the bwa command we are going to run. Displayed below is the code we would use specifically for the syn3_normal_1. DO NOT paste this code in your script!

## DO NOT PASTE THIS CODE: EXAMPLE ONLY
bwa mem \
    -M \
    -t 8 \
    -R "@RG\tID:syn3_normal\tPL:illumina\tPU:$SAMPLE\tSM:syn3_normal" \
    /n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa \
    ~/variant_calling/raw_data/syn3_normal_1.fq.gz \
    ~/variant_calling/raw_data/syn3_normal_2.fq.gz \
    -o /n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.sam

Instead we will introduce bash variables in our script which will allow us to quickly adapt this for use on all other samples we have in our dataset. The variables are described in more detail below:

Another advantage of using bash variables in this way is that it can reduce typos. It can also help make your code more reproducible since you will not be altering the underlying bwa command, which you could inadvertantly do while trying to swap out parameters within your code between scripts if you didn’t use bash variables. Lastly, it can also help make your code more readable to yourself and others.

# Assign files to bash variables
REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa
LEFT_READS=/home/$USER/variant_calling/raw_data/syn3_normal_1.fq.gz
RIGHT_READS=`echo ${LEFT_READS%1.fq.gz}2.fq.gz`
SAMPLE=`basename $LEFT_READS _1.fq.gz`
SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE}_GRCh38.p7.sam

NOTE: $RIGHT_READS uses some bash string manipulation in order to swap the last parts of their filename. We also use basename to parse out the path from a file and when coupled with an argument after the filename, it will trim the end of the filename as well as we can see with the $SAMPLE variable.

We can now add our command for running bwa and utilize the variables we created above:

# Align reads with bwa
bwa mem \
    -M \
    -t 8 \
    -R "@RG\tID:$SAMPLE\tPL:illumina\tPU:$SAMPLE\tSM:$SAMPLE" \
    $REFERENCE_SEQUENCE \
    $LEFT_READS \
    $RIGHT_READS \
    -o $SAM_FILE
Click here to see what our final sbatchcode script for the normal sample alignment should look like
#!/bin/bash
# This script is for aligning sequencing reads against a reference genome using bwa
# Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-04:00:00 #SBATCH -c 8 #SBATCH --mem 16G #SBATCH -o bwa_alignment_normal_%j.out #SBATCH -e bwa_alignment_normal_%j.err
# Load modules module load gcc/6.2.0 module load bwa/0.7.17
# Assign files to bash variables REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa LEFT_READS=/home/$USER/variant_calling/raw_data/syn3_normal_1.fq.gz RIGHT_READS=`echo ${LEFT_READS%1.fq.gz}2.fq.gz` SAMPLE=`basename $LEFT_READS _1.fq.gz` SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE}_GRCh38.p7.sam
# Align reads with bwa bwa mem \ -M \ -t 8 \ -R "@RG\tID:$SAMPLE\tPL:illumina\tPU:$SAMPLE\tSM:$SAMPLE" \ $REFERENCE_SEQUENCE \ $LEFT_READS \ $RIGHT_READS \ -o $SAM_FILE

Submitting sbatch bwa script

Now your sbatch script for bwa is ready to submit:

$ sbatch bwa_alignment_normal.sbatch

Creating Tumor sbatch script

Because our files will stay mostly the same and we are just swapping the word “normal” for “tumor”, this is an excellent place where we can implement the sed command. The sed command can be used to carry out a variety of tasks, but one of the more popular uses is to use it as a “search-and-replace” tool. A more thorough discussion of sed can be found here. We are going to replace all of the instances of “normal” with “tumor” using a sed command, and redirect the output to a file called bwa_alignment_tumor.sbatch:

$ sed 's/normal/tumor/g' bwa_alignment_normal.sbatch >  bwa_alignment_tumor.sbatch
Click here to see what our final sbatchcode script for the tumor sample alignment should look like
#!/bin/bash
# This script is for aligning sequencing reads against a reference genome using bwa
# Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-04:00:00 #SBATCH -c 8 #SBATCH --mem 16G #SBATCH -o bwa_alignment_tumor_%j.out #SBATCH -e bwa_alignment_tumor_%j.err
# Load modules module load gcc/6.2.0 module load bwa/0.7.17
# Assign files to bash variables REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa LEFT_READS=/home/$USER/variant_calling/raw_data/syn3_tumor_1.fq.gz RIGHT_READS=`echo ${LEFT_READS%1.fq.gz}2.fq.gz` SAMPLE=`basename $LEFT_READS _1.fq.gz` SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE}_GRCh38.p7.sam
# Align reads with bwa bwa mem \ -M \ -t 8 \ -R "@RG\tID:$SAMPLE\tPL:illumina\tPU:$SAMPLE\tSM:$SAMPLE" \ $REFERENCE_SEQUENCE \ $LEFT_READS \ $RIGHT_READS \ -o $SAM_FILE

Once we have created this script we can go ahead and submit it for processing:

sbatch bwa_alignment_tumor.sbatch 

NOTE: These scripts may take awhile to run. Each script can take anywhere from 30-45 min to run to completion.

Next Lesson »

Back to Schedule


This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.