Skip to the content.

Contributors: Meeta Mistry, Upendra Bhattarai, Will Gammerdinger

Approximate time: 20 minutes

Learning Objectives

From sequence reads to peak calls

In this lesson, we will highlight the important steps involved in a typical workflow for the analysis of ChIP-seq or related data. If you are looking for more in-depth information on the background and theory for each step, we suggest looking at the Understanding chromatin biology using high throughput sequencing workshop materials.

NOTE: When starting out with your experiment, there are numerous quality control considerations at the bench when preparing your samples. We have highlighted some of the important points within this lesson. Investing time early in the experiment to ensure good quality samples will pay off with meaningful and reproducible results down the line.

Sequence Data QC

The raw sequence (FASTQ) files you obtain from the sequencing facility will first need to be assessed for quality. Here, we use the tool FastQC to look at metrics like base call quality, sequence duplication levels and overrepresented sequences.

At this stage we are flagging samples with values deviating from expected ranges. If the data doesn’t look reasonably clean at this point, it can make downstream processes more difficult and thus it may be grounds for removal of a sample from downstream analysis. Be sure to consult with your sequencing facility if you suspect there has been an issue in the sequencing.

More specific details on what we assess from the FastQC report can be found here.

Alignment to genome

Next, we take our reads and map them to the genome. There are a variety of tools used to align reads to a reference genome. For our workflow we use Bowtie2 for this task. The output from the alignment step will be a SAM/BAM file. If you are aligning to a high-quality reference genome (human/mouse/Drosophila), you should expect to see an alignment rate above 90%. If your alignment dips too far below this threshold, it could be the result of contamination.

More details on this alignment procedure for ChIP-seq can be found here.

Filtering BAM files

The raw alignment output from Bowtie2 has a few issues that we will need to filter out for our analysis. These include:

More details on this alignnment filtering procedure for ChIP-seq can be found here.

Peak calling

Peak calling, the next step in our workflow, is a computational method used to identify areas in the genome that have been enriched with aligned reads as a consequence of performing a ChIP-seq experiment.

For ChIP-seq experiments, what we observe from the alignment files is a strand asymmetry with read densities on the +/- strand, centered around the binding site. The 5’ ends of the selected fragments will form groups on the positive- and negative-strand. The distributions of these groups are then assessed using statistical measures and compared against background (input or IgG samples) to determine if the site of enrichment is likely to be a real binding site.

Image source: Wilbanks and Faccioti, PLoS One 2010

Similar to alignment algorithms, there are several options for peak calling algorithms and each offers their own strengths and can be dependent on the protein of interest. A more in-depth comparisons between peak calling algorithms can be found here. We used the tool MACS2 in order to call peaks in our dataset.

More details on this peak calling procedure for ChIP-seq as well as a more in-depth look on how MACS2 operates can be found here.

Additional resources on troubleshooting QC issues associated with ChIP-seq can be found here.

Back to Schedule

Next Lesson »


This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.