Skip to the content.

Contributors: Mary Piper, Radhika Khetani, Meeta Mistry, Jihe Liu, Will Gammerdinger

Approximate time: 45 min

Learning Objectives

Alignment to Genome

Now that we have assessed the quality of our sequence data, we are ready to align the reads to the reference genome.

In theory, this sounds like a very simple case of string matching. We take the sequence read and figure out where it originated from in the reference genome. However, in practice, this is actually quite difficult! This is because:

There are many different tools that have been developed for alignment of next-generation sequencing data, and some that are more suitable to different technologies. A popular tool commonly used with ChIP-seq data, and the one that we will be using in this workshop is Bowtie2.

To trim or not to trim?

Trimming is the process of removing unwanted sequence prior to sequence alignment. In general, trimming is an optional step. The 3’ ends of the sequence may contain part of the Illumina sequencing adapter. This adapter contamination may prevent the reads from aligning to the reference genome correctly, thus adversely impacting the downstream analysis. You could evaluate potential adapter contamination either from the FastQC report (in “Overrepresented sequences” or “Adapter Contamination” sections), or from the size distribution of your sequencing libraries. If you suspect that your reads are contaminated with adapters, you should run an adapter removal tool. We list some additional considerations below whether trimming is needed:

Bowtie2

Bowtie2 is a fast and accurate alignment tool that supports gapped, local and paired-end alignment modes and works best for reads that are at least 50 bp (shorter read lengths should use Bowtie1).

By default, Bowtie2 will perform a global end-to-end read alignment, which aligns from the first to the last base of the read. This alignment is best for reads that have already been trimmed for quality and adapters (e.g. reads where nucleotide bases of poor quality or matching adapter sequences have been removed from the ends of the reads prior to alignment). However, Bowtie2 also has a local alignment mode, which, in contrast to end-to-end alignment, ignores portions at the ends of the reads that do match well to the reference genome. This is referred to as soft-clipping and allows for a more accurate alignment. The procedure can carry a small penalty for each soft-clipped base, but amounts to a significantly smaller penalty than mismatching bases. In contrast to trimming, which removes the unwanted sequence (hard-clipping), soft-clipping retains the soft-clipped base in the sequence and simply marks it. We will use this option since we did not trim our reads.

How do other aligners compare?

We use Bowtie2 to align our reads in this workshop, but there are a number of other options. For bwa, the mapping rates are higher, with an equally similar increase in the number of duplicate mappings. Consequently, there is a significantly higher number of mapped reads and a much larger number of peaks being called (30% increase compared to Bowtie2). When we compare the peak calls generated from different aligners, the bwa peak calls are a superset of those called from the Bowtie2 aligments. It is yet to be determined whether or not these additional peaks are true positives.

Bowtie2: Building an index

To perform the Bowtie2 alignment, a genome index is required. The index is analagous to the index in a book. By indexing the genome, we have organized it in a manner that now allows for efficient search and retrieval of matches of the query (sequence read) to the genome.

Bowtie2 indexes the genome with an FM Index based on the Burrows-Wheeler Transform method to keep memory requirements low for the alignment process. If interested, a thorough description of the Burrows-Wheeler Transform and the FM Index can be found in this resource paper.

To create an index for your analysis, you can use the bowtie2-build command. There are various arguments you can provide, but in its simplest form all you need as input is the path to the reference genome FASTA file, and a prefix to name your indices once its created.

#### DO NOT RUN THIS CODE ####

$ bowtie2-build <path_to_reference_genome.fa> <prefix_to_name_indexes>

We will not be creating a genome index. Genome indices tend to be very large files, which would take up a lot of storage in our O2 space. Instead of generating our own index, the O2 cluster has a designated directory (/n/groups/shared_databases/) with shared databases for human and other commonly used model organisms that all users can readily access. These files contain, but are not limited to, genome indices for various tools, reference sequences, tool-specific data, and data from public databases, such as NCBI and PDB. Therefore, when using a tool that requires a reference file, it is worth taking a quick look at this directory and checking whether your desired reference is already deposited here.

$ ls -l /n/groups/shared_databases/

NOTE: Disadvantages of using the shared databases include not knowing exactly what is the reference version (i.e. release) or the exact parameters used to generate it. When using shared databases, it is important that it is clear how the reference was generated and its version.

For our dataset, we will need the mm10 build of the reference genome. You can find those indices at /n/groups/shared_databases/bowtie2_indexes/mm10. Rather than copying the files over, we will just point to the directory when necessary using its full path.

Bowtie2: Alignment

Now we are ready to perform the read alignment. Let’s first create a bowtie2 directory for our output:

# Create bowtie2 directory
$ mkdir ~/chipseq_workshop/results/bowtie2

We then need to load the module. We could find out more about Bowtie2 on O2:

# Check modules for Bowtie2
$ module spider bowtie2

# Check for any dependencies
$ module spider bowtie2/2.2.9

Notice that before we load Bowtie2, we also need to load the gcc compiler (as is the case for many other NGS analysis tools on O2). As a tip, we recommend always run module spider first to check any dependent modules.

# Load the necessary compiler and Bowtie2
$ module load gcc/6.2.0 bowtie2/2.2.9

The command to run the alignment is simply bowtie2. Some additional arguments that we will need for aligning reads to the genome using Bowtie2 are described below:

For CUT&RUN and ATAC-seq, there are additional parameters that you want to explore, and we list them below:

How do the parameters change for CUT&RUN?
For CUT&RUN, there are additional parameters that can be used. Here, we list options that have been reported by other groups. It might not be neccessary to include any or all of these options. We encourage you to explore the literature that resemble your research and method, and decide what is best for your data.
How do the parameters change for ATAC-seq?
ATAC-seq usually uses the same parameters as ChIP-seq for alignment, but the following options can be added if needed:

Below is an example of the full command to run bowtie2 on a single FASTQ file wt_sample2_chip. Details on Bowtie2 and its functionality can be found in the user manual; we encourage you to peruse through to get familiar with all available options.

# DO NOT RUN
$ bowtie2 -p 2 -q --local \
-x /n/groups/shared_databases/bowtie2_indexes/mm10 \
-U ~/chipseq_workshop/raw_data/wt_sample2_chip.fastq.gz \
-S ~/chipseq_workshop/results/bowtie2/wt_sample2_chip.sam

Bowtie2 does not generate log summary files. Rather this information gets printed to screen. If we want to capture that and save it in a file we can access later we can use the 2> operator. To redirect the standard error from the bowtie2 command we could do the following:

# DO NOT RUN
bowtie2 -p 2 -q --local \
-x /n/groups/shared_databases/bowtie2_indexes/mm10 \
-U ~/chipseq_workshop/raw_data/wt_sample2_chip.fastq.gz \
-S ~/chipseq_workshop/results/bowtie2/wt_sample2_chip.sam 2> ~/chipseq_workshop/logs/wt_sample2_chip_bowtie2.log

NOTE: What is the 2> operator doing? Whenever we run a command, it generates two types of output: the output accomplished by the process, which is generally referred to as standard output, and the diagnostic output, which is generally referred to as standard error. By convention, both are output to the screen, but we can redirect either output to file using redirection operators. To redirect the standard output to file (if there is no option for specifying in the command), we can use the > operator and to redirect the standard error to file we can use the 2> operator. Tools generally follow the convention of which type of information is contained in each output, but some tools will send the non-diagnostic output to standard error, and vice versa. We often redirect the standard error from the screen to a log file so that we have this information for any downstream troubleshooting.

Alignment output: SAM/BAM file format

The output from the Bowtie2 aligner is an unsorted SAM file, also known as Sequence Alignment/Map format. The SAM file is a tab-delimited text file that contains information for each individual read and its alignment to the genome. While we will go into some features of the SAM format, the paper by Heng Li et al provides a lot more detail on the specification.

The file begins with a header, which is optional. The header is used to describe source of data, reference sequence, method of alignment, etc., this will change depending on the aligner being used. Each section begins with character ‘@’ followed by a two-letter record type code. These are followed by two-letter tags and values.

Following the header is the alignment section. Each line corresponds to the alignment information for a single read. Each alignment line has 11 mandatory fields for essential mapping information and a variable number of other fields for aligner-specific information.

SAM1

An example read mapping is displayed above. Note that the example above spans two lines, but in the actual file it is a single line. Let’s go through the fields one at a time.

Now to the remaining fields in our SAM file:

SAM1

The next three fields are more pertinent to paired-end data.

Finally, you have the raw sequence data from the original FASTQ file stored for each read:

Changing file format from SAM to BAM

While the SAM alignment file from Bowtie2 is human readable, we need a BAM alignment file for downstream analysis. A BAM file is a binary equivalent version of the SAM file, in other words, the same file in a compressed format. Therefore, BAM file is not human readable, and it is much smaller in size. BAM file is the typical format used in bioinformatics tools. We will use Samtools to convert the file format from SAM to BAM. Samtools is a program that consists of many utilities for working with the Sequence Alignment/Map (SAM) format. Here, we will use the samtools view command to convert our SAM file into its binary compressed version (BAM) and save it to file.

NOTE: Once we generate the BAM file, we don’t need to retain the SAM file anymore - we can delete it to save space.

Let’s start by loading the module samtools:

$ module load gcc/6.2.0 # you may not need to load this if you are working in the same session from Bowtie2
$ module load samtools/1.13

We outline below the parameters to use with the command samtools view, and what each does:

NOTE: You can find detailed instructions for different samtools functions and additional parameter options in this manual.

# DO NOT RUN
$ samtools view -h -S -b \
-o ~/chipseq_workshop/results/bowtie2/wt_sample2_chip.bam \
~/chipseq_workshop/results/bowtie2/wt_sample2_chip.sam

Running alignment on a single sample

Genome alignment can take a while to finish, so we won’t run it in an interactive session. Instead, we will create a SBATCH script, alignment.sbatch under the ~/chipseq_workshop/scripts directory, and submit this script as a job on the cluster.

# Create a SBATCH script
vim ~/chipseq_workshop/scripts/alignment.sbatch

NOTE: In the vim, press i to start the editing mode. Once done, press ESC to get back into command mode and type :wq to save and exit.

Let’s specify the job submission options as below (don’t forget the shebang line, #!/bin/bash at the begining):

#SBATCH -p short              # partition name
#SBATCH -c 2                  # number of cores
#SBATCH -t 0-2:00             # time limit
#SBATCH --mem 8G              # requested memory
#SBATCH --job-name alignment 	# job name
#SBATCH -o %j.out			          # file to which standard output will be written
#SBATCH -e %j.err 		          # file to which standard error will be written

In the body of the script, add the code required to:

Please refer to the corresponding code we discussed earlier in this lesson, to fill up the whole script. Once you are done, submit the script as a job, using sbatch ~/chipseq_workshop/scripts/alignment.sbatch command.

NOTE: for this script, do not use the 2> operator. Instead of creating a log file, we will examine what happens to the standard out/error during a job submission.

Click here for solution
#!/bin/bash
#SBATCH -p short # partition name #SBATCH -c 2 # number of cores #SBATCH -t 0-2:00 # time limit #SBATCH --mem 8G # requested memory #SBATCH --job-name alignment # job name #SBATCH -o %j.out # file to which standard output will be written #SBATCH -e %j.err # file to which standard error will be written
module load gcc/6.2.0 bowtie2/2.2.9 samtools/1.13
bowtie2 -p 2 -q --local \ -x /n/groups/shared_databases/bowtie2_indexes/mm10 \ -U ~/chipseq_workshop/raw_data/wt_sample2_chip.fastq.gz \ -S ~/chipseq_workshop/results/bowtie2/wt_sample2_chip.sam
samtools view -h -S -b \ -o ~/chipseq_workshop/results/bowtie2/wt_sample2_chip.bam \ ~/chipseq_workshop/results/bowtie2/wt_sample2_chip.sam
rm ~/chipseq_workshop/results/bowtie2/wt_sample2_chip.sam

NOTE:


Exercise:

  1. After your job has completed, check to see if you have a .out and .err file. If either of the files are present, what information do you obtain from it?
  2. Take a quick peek at a sample BAM file using samtools view. Does the information you see line up with the fields we described above?
  3. What is the alignment rate for the wt_sample2_chip? Do you think the alignment is good?

NOTE: After performing read alignment, it’s useful to evaluate the mapping rate for each sample by taking look at the log files. Additionally, it is common to aggregate QC metrics and visualize them with plots using tools such as MultiQC. This is important to do prior to moving on to the next steps of the analysis.


This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.