Alignment and filtering

Contributors: Mary Piper, Radhika Khetani, Meeta Mistry

Approximate time: 45 minutes

Learning Objectives

Perform alignment of reads to the genome using Bowtie2
Examining a SAM file and understanding the information stored in it
Filtering aligned reads to keep only uniquely mapped ones

Alignment to Genome

Now that we have assessed the quality of our sequence data, we are ready to align the reads to the reference genome. Bowtie2 is a fast and accurate alignment tool that indexes the genome with an FM Index based on the Burrows-Wheeler Transform method to keep memory requirements low for the alignment process. Bowtie2 supports gapped, local and paired-end alignment modes and works best for reads that are at least 50 bp (shorter read lengths should use Bowtie1). By default, Bowtie2 will perform a global end-to-end read alignment, which is best for quality-trimmed reads. However, it also has a local alignment mode, which will perform soft-clipping for the removal of poor quality bases or adapters from untrimmed reads. We will use this option since we did not trim our reads.

NOTE: Our reads are only 36 bp, so technically we should explore alignment with bwa or Bowtie1 to see if it is better. However, since it is rare that you will have sequencing reads with less than 50 bp, we will show you how to perform alignment using Bowtie2.

How do other aligners compare?

In this workshop we are using Bowtie2 to align our reads, but there are a number of other options. We have explored the use of bwa for ChIP-seq analysis and found some differences. For bwa, the mapping rates are higher (~ 2%), with an equally similar increase in the number of duplicate mappings identified. Post-filtering this translates to a significantly higher number of mapped reads and results in a much larger number of peaks being called (30% increase). When we compare the peak calls generated from the different aligners, the bwa peak calls are a superset of those called from the Bowtie2 aligments. Whether or not these additional peaks are true positives, is something that is yet to be determined.

Creating a Bowtie2 index

To perform the Bowtie2 alignment, a genome index is required. The index is analagous to the index in the back of a book. By indexing the genome, we have organized it in a manner that now allows for efficient search and retrieval of matches of the query (sequence read) to the genome. We previously generated the genome indices for you, and they exist in the reference_data directory.

However, if you needed to create a genome index yourself, you would use the following command:

# DO NOT RUN

bowtie2-build <path_to_reference_genome.fa> <prefix_to_name_indexes>

A quick note on shared databases for human and other commonly used model organisms. The O2 cluster has a designated directory at /n/groups/shared_databases/ in which there are files that can be accessed by any user. These files contain, but are not limited to, genome indices for various tools, reference sequences, tool specific data, and data from public databases, such as NCBI and PDB. So when using a tool that requires a reference of sorts, it is worth taking a quick look here because chances are it’s already been taken care of for you.

$ ls -l /n/groups/shared_databases/igenome/

Aligning reads to the genome with Bowtie2

Since we have our indices already created, we can get started with read alignment. Change directories to the bowtie2 folder:

$ cd ~/chipseq/results/bowtie2

Now let’s load the module. We can find out more on the module on O2:

$ module spider bowtie2

You will notice that before we load this module we also need to load the gcc compiler (as will be the case for many of the NGS analysis tools on O2. Always check module spider first.)

$ module load gcc/6.2.0 bowtie2/2.2.9

We will perform alignment on our single raw FASTQ file, H1hesc_Input_Rep1_chr12.fastq. Details on Bowtie2 and its functionality can be found in the user manual; we encourage you to peruse through to get familiar with all available options.

The basic options for aligning reads to the genome using Bowtie2 are:

-p: number of processors / cores
-q: reads are in FASTQ format
--local: local alignment feature to perform soft-clipping
-x: /path/to/genome_indices_directory
-U: /path/to/FASTQ_file
-S: /path/to/output/SAM_file

$ bowtie2 -p 2 -q --local \
-x ~/chipseq/reference_data/chr12 \
-U ~/chipseq/raw_data/H1hesc_Input_Rep1_chr12.fastq \
-S ~/chipseq/results/bowtie2/H1hesc_Input_Rep1_chr12_aln_unsorted.sam

Alignment file format: SAM/BAM

The output we requested from the Bowtie2 aligner is an unsorted SAM file, also known as Sequence Alignment Map format. The SAM file, is a tab-delimited text file that contains information for each individual read and its alignment to the genome. While we will go into some features of the SAM format, the paper by Heng Li et al provides a lot more detail on the specification.

The file begins with a header, which is optional. The header is used to describe source of data, reference sequence, method of alignment, etc., this will change depending on the aligner being used. Each section begins with character ‘@’ followed by a two-letter record type code. These are followed by two-letter tags and values. Example of some common sections are provided below:

@HD  The header line
VN: format version
SO: Sorting order of alignments

@SQ  Reference sequence dictionary
SN: reference sequence name
LN: reference sequence length
SP: species

@PG  Program
PN: program name
VN: program version

Following the header is the alignment section. Each line that follows corresponds to alignment information for a single read. Each alignment line has 11 mandatory fields for essential mapping information and a variable number of other fields for aligner specific information.

SAM1

An example read mapping is displayed above. Note that the example above spans two lines, but in the file it is a single line. Let’s go through the fields one at a time.

QNAME: Query name or read name - this is the same read name present in the header of the FASTQ file

FLAG: numerical value providing information about read mapping and whether the read is part of a pair.

NOTE: The information stored inside the FLAG is additive based on the following information being TRUE or FALSE:

Flag Description

1 read is mapped

2 read is mapped as part of a pair

4 read is unmapped

8 mate is unmapped

16 read reverse strand

32 mate reverse strand

64 first in pair

128 second in pair

256 not primary alignment

512 read fails platform/vendor quality checks

1024 read is PCR or optical duplicate

For a given alignment, each of these flags are either on or off indicating the condition is true or false.

The FLAG is a combination of all of the individual flags (from the table above) that are true for the alignment

The beauty of the flag values is that any combination of flags can only result in one sum.

There are tools that help you translate the bitwise flag, for example this one from Picard

RNAME: is the reference sequence name, giving the chromosome to which the read mapped. The example read is from chromosome 1 which explains why we see ‘chr1’.
POS: refers to the 1-based leftmost position of the alignment.
MAPQ: is giving us the alignment quality, the scale of which will depend on the aligner being used.
CIGAR: is a sequence of letters and numbers that represent the edits or operations required to match the read to the reference. The letters are operations that are used to indicate which bases align to the reference (i.e. match, mismatch, deletion, insertion), and the numbers indicate the associated base lengths for each ‘operation’.

Flag	Description
1	read is mapped
2	read is mapped as part of a pair
4	read is unmapped
8	mate is unmapped
16	read reverse strand
32	mate reverse strand
64	first in pair
128	second in pair
256	not primary alignment
512	read fails platform/vendor quality checks
1024	read is PCR or optical duplicate

Now to the remaning fields in our SAM file:

SAM1

The next three fields are more pertinent to paired-end data.

MRNM: is the mate reference name.
MPOS: is the mate position (1-based, leftmost).
ISIZE: is the inferred insert size.

Finally, you have the raw sequence data from the original FASTQ file stored for each read:

SEQ: is the raw sequence
QUAL: is the associated quality values for each position in the read.

Let’s take a quick peek at our SAM file that we just generated. Since it is just a text file, we can browse through it using less:

$ less H1hesc_Input_Rep1_chr12_aln_unsorted.sam

Does the information you see line up with the fields we described above?

Filtering reads

An important issue with ChIP-seq data concerns the inclusion of multiple mapped reads (reads mapped to multiple loci on the reference genome). Allowing for multiple mapped reads increases the number of usable reads and the sensitivity of peak detection; however, the number of false positives may also increase [1]. Therefore we need to filter our alignment files to contain only uniquely mapping reads in order to increase confidence in site discovery and improve reproducibility. Since there is no parameter in Bowtie2 to keep only uniquely mapping reads, we will need to perform the following steps to generate alignment files containing only the uniquely mapping reads:

Change alignment file format from SAM to BAM
Sort BAM file by read coordinate locations
Filter to keep only uniquely mapping reads (this will also remove any unmapped reads)

1. Changing file format from SAM to BAM

While the SAM alignment file output by Bowtie2 is human readable, we need a BAM alignment file for downstream tools. Therefore, we will use Samtools to convert the file formats.

To use samtools we will need to load the module:

$ module load gcc/6.2.0 # you may not need to load this if you are working in the same session from Bowtie2
$ module load samtools/1.9

The command we will use is samtools view with the following parameters:

-h: include header in output
-S: input is in SAM format
-b: output BAM format
-o: /path/to/output/file

$ samtools view -h -S -b \
-o H1hesc_Input_Rep1_chr12_aln_unsorted.bam \
H1hesc_Input_Rep1_chr12_aln_unsorted.sam

You can find additional parameters for the samtools functions in the manual.

2. Sorting BAM files by genomic coordinates

Before we can filter to keep the uniquely mapping reads, we need to sort our BAM alignment files by genomic coordinates (instead of by name). To perform this sort, we will use Sambamba, which is a tool that quickly processes BAM and SAM files.

The command we will use is sambamba sort with the following parameters:

-t: number of threads / cores
-o: /path/to/output/file

$ sambamba sort -t 2 \
-o H1hesc_Input_Rep1_chr12_aln_sorted.bam \
H1hesc_Input_Rep1_chr12_aln_unsorted.bam 

NOTE: This tool is not available as a module on O2. You will only be able to use this as part of the tools available in the bcbio pipeline. In a previous lesson, you had added this to your $PATH by modifying your .bashrc file. If the command above does not work for you, run this line below:

export PATH=/n/app/bcbio/tools/bin:$PATH

We could have also used samtools to perform the above sort, however using sambamba gives us dual functionality. List the contents of the directory – what do you see? The advantage to using sambamba is that along with the newly sorted file, an index file is generated. If we used samtools this would have been a two-step process.

3. Filtering uniquely mapping reads

Finally, we can filter the BAM to keep only uniquely mapping reads. We will use the sambamba view command with the following parameters:

-t: number of threads / cores
-h: print SAM header before reads
-f: format of output file (default is SAM)
-F: set custom filter - we will be using the filter to remove duplicates, multimappers and unmapped reads.

$ sambamba view -h -t 2 -f bam \
-F "[XS] == null and not unmapped  and not duplicate" \
H1hesc_Input_Rep1_chr12_aln_sorted.bam > H1hesc_Input_Rep1_chr12_aln.bam

We filtered out unmapped reads by specifying in the filter not unmapped, and duplicates with not duplicate. Also, among the reads that were aligned, we filtered out multimappers by specifying [XS] == null. ‘XS’ is a tag generated by Bowtie2 that gives an alignment score for the second-best alignment, and it is only present if the read is aligned and more than one alignment was found for the read.

Now that the alignment files contain only uniquely mapping reads, we are ready to perform peak calling.

NOTE: After performing read alignment, it’s useful to generate QC metrics for the alignment using tools such as MultiQC prior to moving on to the next steps of the analysis.

Filtering out Blacklisted Regions

Although we will not perform this step, it is common practice to apply an additional level of filtering to our BAM files. That is, we remove alignments that occur with defined Blacklisted Regions.

Blacklisted regions represent artifact regions that tend to show artificially high signal (excessive unstructured anomalous reads mapping). These regions are often found at specific types of repeats such as centromeres, telomeres and satellite repeats and typically appear uniquely mappable so simple mappability filters applied above do not remove them. The ENCODE and modENCODE consortia have compiled blacklists for various species and genome versions including human, mouse, worm and fly. These blacklisted regions (coordinate files) can be filtered out from our alignment files before proceeding to peak calling.

We will revisit this in more detail when we discuss QC metrics in a later lesson.

This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.