Contributors: Mary Piper, Radhika Khetani, Meeta Mistry
Approximate time: 45 minutes
- Perform alignment of reads to the genome using Bowtie2
- Examining a SAM file and understanding the information stored in it
- Filtering aligned reads to keep only uniquely mapped ones
Alignment to Genome
Now that we have assessed the quality of our sequence data, we are ready to align the reads to the reference genome. Bowtie2 is a fast and accurate alignment tool that indexes the genome with an FM Index based on the Burrows-Wheeler Transform method to keep memory requirements low for the alignment process. Bowtie2 supports gapped, local and paired-end alignment modes and works best for reads that are at least 50 bp (shorter read lengths should use Bowtie1). By default, Bowtie2 will perform a global end-to-end read alignment, which is best for quality-trimmed reads. However, it also has a local alignment mode, which will perform soft-clipping for the removal of poor quality bases or adapters from untrimmed reads. We will use this option since we did not trim our reads.
NOTE: Our reads are only 36 bp, so technically we should explore alignment with bwa or Bowtie1 to see if it is better. However, since it is rare that you will have sequencing reads with less than 50 bp, we will show you how to perform alignment using Bowtie2.
How do other aligners compare?
In this workshop we are using Bowtie2 to align our reads, but there are a number of other options. We have explored the use of bwa for ChIP-seq analysis and found some differences. For bwa, the mapping rates are higher (~ 2%), with an equally similar increase in the number of duplicate mappings identified. Post-filtering this translates to a significantly higher number of mapped reads and results in a much larger number of peaks being called (30% increase). When we compare the peak calls generated from the different aligners, the bwa peak calls are a superset of those called from the Bowtie2 aligments. Whether or not these additional peaks are true positives, is something that is yet to be determined.
Creating a Bowtie2 index
To perform the Bowtie2 alignment, a genome index is required. The index is analagous to the index in the back of a book. By indexing the genome, we have organized it in a manner that now allows for efficient search and retrieval of matches of the query (sequence read) to the genome. We previously generated the genome indices for you, and they exist in the
However, if you needed to create a genome index yourself, you would use the following command:
# DO NOT RUN bowtie2-build <path_to_reference_genome.fa> <prefix_to_name_indexes>
A quick note on shared databases for human and other commonly used model organisms. The O2 cluster has a designated directory at
/n/groups/shared_databases/in which there are files that can be accessed by any user. These files contain, but are not limited to, genome indices for various tools, reference sequences, tool specific data, and data from public databases, such as NCBI and PDB. So when using a tool that requires a reference of sorts, it is worth taking a quick look here because chances are it’s already been taken care of for you.
$ ls -l /n/groups/shared_databases/igenome/
Aligning reads to the genome with Bowtie2
Since we have our indices already created, we can get started with read alignment. Change directories to the
$ cd ~/chipseq/results/bowtie2
Now let’s load the module. We can find out more on the module on O2:
$ module spider bowtie2
You will notice that before we load this module we also need to load the gcc compiler (as will be the case for many of the NGS analysis tools on O2. Always check
module spider first.)
$ module load gcc/6.2.0 bowtie2/2.2.9
We will perform alignment on our single raw FASTQ file,
H1hesc_Input_Rep1_chr12.fastq. Details on Bowtie2 and its functionality can be found in the user manual; we encourage you to peruse through to get familiar with all available options.
The basic options for aligning reads to the genome using Bowtie2 are:
-p: number of processors / cores
-q: reads are in FASTQ format
--local: local alignment feature to perform soft-clipping
$ bowtie2 -p 2 -q --local \ -x ~/chipseq/reference_data/chr12 \ -U ~/chipseq/raw_data/H1hesc_Input_Rep1_chr12.fastq \ -S ~/chipseq/results/bowtie2/H1hesc_Input_Rep1_chr12_aln_unsorted.sam
Alignment file format: SAM/BAM
The output we requested from the Bowtie2 aligner is an unsorted SAM file, also known as Sequence Alignment Map format. The SAM file, is a tab-delimited text file that contains information for each individual read and its alignment to the genome. While we will go into some features of the SAM format, the paper by Heng Li et al provides a lot more detail on the specification.
The file begins with a header, which is optional. The header is used to describe source of data, reference sequence, method of alignment, etc., this will change depending on the aligner being used. Each section begins with character ‘@’ followed by a two-letter record type code. These are followed by two-letter tags and values. Example of some common sections are provided below:
@HD The header line VN: format version SO: Sorting order of alignments @SQ Reference sequence dictionary SN: reference sequence name LN: reference sequence length SP: species @PG Program PN: program name VN: program version
Following the header is the alignment section. Each line that follows corresponds to alignment information for a single read. Each alignment line has 11 mandatory fields for essential mapping information and a variable number of other fields for aligner specific information.
An example read mapping is displayed above. Note that the example above spans two lines, but in the file it is a single line. Let’s go through the fields one at a time.
QNAME: Query name or read name - this is the same read name present in the header of the FASTQ file
FLAG: numerical value providing information about read mapping and whether the read is part of a pair.
NOTE: The information stored inside the FLAG is additive based on the following information being TRUE or FALSE:
Flag Description 1 read is mapped 2 read is mapped as part of a pair 4 read is unmapped 8 mate is unmapped 16 read reverse strand 32 mate reverse strand 64 first in pair 128 second in pair 256 not primary alignment 512 read fails platform/vendor quality checks 1024 read is PCR or optical duplicate
- For a given alignment, each of these flags are either on or off indicating the condition is true or false.
FLAGis a combination of all of the individual flags (from the table above) that are true for the alignment
- The beauty of the flag values is that any combination of flags can only result in one sum.
There are tools that help you translate the bitwise flag, for example this one from Picard
RNAME: is the reference sequence name, giving the chromosome to which the read mapped. The example read is from chromosome 1 which explains why we see ‘chr1’.
POS: refers to the 1-based leftmost position of the alignment.
MAPQ: is giving us the alignment quality, the scale of which will depend on the aligner being used.
CIGAR: is a sequence of letters and numbers that represent the edits or operations required to match the read to the reference. The letters are operations that are used to indicate which bases align to the reference (i.e. match, mismatch, deletion, insertion), and the numbers indicate the associated base lengths for each ‘operation’.
Now to the remaning fields in our SAM file:
The next three fields are more pertinent to paired-end data.
MRNM: is the mate reference name.
MPOS: is the mate position (1-based, leftmost).
ISIZE: is the inferred insert size.
Finally, you have the raw sequence data from the original FASTQ file stored for each read:
SEQ: is the raw sequence
QUAL: is the associated quality values for each position in the read.
Let’s take a quick peek at our SAM file that we just generated. Since it is just a text file, we can browse through it using
$ less H1hesc_Input_Rep1_chr12_aln_unsorted.sam
Does the information you see line up with the fields we described above?
An important issue with ChIP-seq data concerns the inclusion of multiple mapped reads (reads mapped to multiple loci on the reference genome). Allowing for multiple mapped reads increases the number of usable reads and the sensitivity of peak detection; however, the number of false positives may also increase . Therefore we need to filter our alignment files to contain only uniquely mapping reads in order to increase confidence in site discovery and improve reproducibility. Since there is no parameter in Bowtie2 to keep only uniquely mapping reads, we will need to perform the following steps to generate alignment files containing only the uniquely mapping reads:
- Change alignment file format from SAM to BAM
- Sort BAM file by read coordinate locations
- Filter to keep only uniquely mapping reads (this will also remove any unmapped reads)
1. Changing file format from SAM to BAM
While the SAM alignment file output by Bowtie2 is human readable, we need a BAM alignment file for downstream tools. Therefore, we will use Samtools to convert the file formats.
samtools we will need to load the module:
$ module load gcc/6.2.0 # you may not need to load this if you are working in the same session from Bowtie2 $ module load samtools/1.9
The command we will use is
samtools view with the following parameters:
-h: include header in output
-S: input is in SAM format
-b: output BAM format
$ samtools view -h -S -b \ -o H1hesc_Input_Rep1_chr12_aln_unsorted.bam \ H1hesc_Input_Rep1_chr12_aln_unsorted.sam
You can find additional parameters for the samtools functions in the manual.
2. Sorting BAM files by genomic coordinates
Before we can filter to keep the uniquely mapping reads, we need to sort our BAM alignment files by genomic coordinates (instead of by name). To perform this sort, we will use Sambamba, which is a tool that quickly processes BAM and SAM files.
The command we will use is
sambamba sort with the following parameters:
-t: number of threads / cores
$ sambamba sort -t 2 \ -o H1hesc_Input_Rep1_chr12_aln_sorted.bam \ H1hesc_Input_Rep1_chr12_aln_unsorted.bam
NOTE: This tool is not available as a module on O2. You will only be able to use this as part of the tools available in the
bcbiopipeline. In a previous lesson, you had added this to your $PATH by modifying your
.bashrcfile. If the command above does not work for you, run this line below:
We could have also used
samtools to perform the above sort, however using
sambamba gives us dual functionality. List the contents of the directory – what do you see? The advantage to using
sambamba is that along with the newly sorted file, an index file is generated. If we used
samtools this would have been a two-step process.
3. Filtering uniquely mapping reads
Finally, we can filter the BAM to keep only uniquely mapping reads. We will use the
sambamba view command with the following parameters:
-t: number of threads / cores
-h: print SAM header before reads
-f: format of output file (default is SAM)
-F: set custom filter - we will be using the filter to remove duplicates, multimappers and unmapped reads.
$ sambamba view -h -t 2 -f bam \ -F "[XS] == null and not unmapped and not duplicate" \ H1hesc_Input_Rep1_chr12_aln_sorted.bam > H1hesc_Input_Rep1_chr12_aln.bam
We filtered out unmapped reads by specifying in the filter
not unmapped, and duplicates with
not duplicate. Also, among the reads that were aligned, we filtered out multimappers by specifying
[XS] == null. ‘XS’ is a tag generated by Bowtie2 that gives an alignment score for the second-best alignment, and it is only present if the read is aligned and more than one alignment was found for the read.
Now that the alignment files contain only uniquely mapping reads, we are ready to perform peak calling.
NOTE: After performing read alignment, it’s useful to generate QC metrics for the alignment using tools such as MultiQC prior to moving on to the next steps of the analysis.
Filtering out Blacklisted Regions
Although we will not perform this step, it is common practice to apply an additional level of filtering to our BAM files. That is, we remove alignments that occur with defined Blacklisted Regions.
Blacklisted regions represent artifact regions that tend to show artificially high signal (excessive unstructured anomalous reads mapping). These regions are often found at specific types of repeats such as centromeres, telomeres and satellite repeats and typically appear uniquely mappable so simple mappability filters applied above do not remove them. The ENCODE and modENCODE consortia have compiled blacklists for various species and genome versions including human, mouse, worm and fly. These blacklisted regions (coordinate files) can be filtered out from our alignment files before proceeding to peak calling.
We will revisit this in more detail when we discuss QC metrics in a later lesson.
This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.