Skip to the content.

Approximate time: 45 minutes

Learning Objectives

From Sequence reads to Peak calls

In this workshop we covered the steps in the first half of the ChIP-seq workflow where we go from raw sequence reads through to peak calls. We have discussed each of the steps in detail, outlining the tools involved and the file formats encountered. In this lesson, we revisit the quality checks associated with each step and summarize the main points to take away.

Quality control of sequence reads

The quality checks at this stage in the workflow include:

Examples of bad quality data

Good quality data masquerading as bad quality

Quality checks include looking for modules of the FastQC report which may report as bad quality (for any other NGS data), but indicate good quality ChIP-seq data:

Alignment quality

The quality checks at this stage in the workflow include:

If my mapping rate is low, do I discard my sample? Do not discard your sample, rather you will want to:

  1. Flag the sample as low quality. Keep an eye out for QC metrics later in the workflow for that same sample.
  2. Troubleshoot the sample. Take the unmapped reads and BLAST the sequences; if the reads are not mapping to the genome, where are they mapping? It’s possible you might identify a high level of contamination from another organism.

Image source: Land et, al, 2012

NOTE: For paired-end reads you will also want to checking percent that are properly paired. By default, Bowtie 2 searches for both concordant and discordant alignments, though searching for discordant alignments can be disabled with the --no-discordant option.

MultiQC: An aggregation QC metrics in report format

As you go through the ChIP-seq workflow (or most NGS workflows), it is important to track the metrics/results at every step. Careful evaluation of metrics described above enable you to identify any issues with the data and/or the parameters you are using, as well as alert you to the presence of contamination or systematic biases, etc. An important QC step is to make sure that these metrics are consistent across the samples for a given experiment, and any outliers should be investigated further.

Manually tracking these metrics is tedious and error-prone. MultiQC, is an open source tool used to aggregate bionformatics results. It can generate an HTML report from 96 different bioinformatics tools, and includes helpful visualizations to make comparisons across samples within a dataset. While we are not implementing it in this workshop, we encourage you to read through this lesson as an example of its use on RNA-seq data analysis.

Peak quality checks

The quality checks at this stage in the workflow include:

Total number of peaks

This number will vary depending on your protein of interest and the number of expected binding sites. It can range from thousands of regions to hundred thousands. If you are only finding a handful of regions identified as significantly enriched, there is a high likelihood that your experiment failed.

Image source: Hendrix, DA, “Applied Bioinformatics” - Online textbook from Oregon State Univeristy

Possible reasons you are not seeing many peaks:

Read enrichment within known artifact regions

The use of exclusive regions of “blacklists”, or regions where genome assembly results in erroneous signal are a critical part of the workflow as it helps to remove signal-artifact regions in ChIP-seq experiments.

Image source: The ENCODE Blacklist: Identification of Problematic Regions of the Genome

As described in this workshop, filtering can happen after alignment or after peak calling. If a high percentage of our peaks are filtered out due to overlap with blacklist regions - this tells us that most of the peaks we identified were in fact background noise. A high percentage of peaks overlapping with blacklist regions suggests that your experiment did not work. If the majority of peaks identified are attributed to backgorund noise, there is effectively little to no true signal in your data. To troubleshoot why it didn’t work, see some of the points listed above.

Replicate concordance

Unlike RNA-seq, increasing replicates in your ChIP-seq will not increase the number of binding sites identified. Rather, it gives you confidence that the sites you identified are true signal.

Representative browser snapshot of the four EGR1 ChIP-seq experiments, showing the much stronger peaks obtained with the second set of replicates

Image source: Land et, al, 2012

Qualitative assessment of enriched regions

At this point, if you have a reasonable number of peaks and you observe a good amount of concordance between replicates - the next step is evaluating the enriched regions. You can do this with a simple site-based inspection (i.e use a genome viewer to look for enrichment profiles fo specific target genes), or use profile plots for a genome-wide assessment.


This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.