Skip to the content.

Contributors: Mary Piper, Radhika Khetani, Meeta Mistry, Jihe Liu, Will Gammerdinger

Approximate time: 30 minutes

Learning Objectives

Introduction to the dataset

For this workshop we will be working with ChIP-seq data from a recent publication in Neuron by Baizabal et al. (2018) [1].

Please note that even though we are utilizing a ChIP-seq dataset for this workshop, we will be highlighting how the code/parameters will differ if you are analyzing either ATAC-seq or CUT&RUN data.

Baizabal et al. sought to understand how chromatin-modifying enzymes function in neural stem cells to establish the epigenetic landscape that determines cell type and stage-specific gene expression. Chromatin-modifying enzymes are transcriptional regulators that control gene expression through covalent modification of DNA or histones.

Image adapted from: American Society of Hematology

PRDM16

The transcriptional regulator PRDM16 is a chromatin-modifying enzyme that belongs to the larger PRDM (Positive Regulatory Domain) protein family, that is structurally defined by the presence of a conserved N-terminal histone methyltransferase PR domain (Hohenauer and Moore, 2012).

How PRDM16 functions to regulate transcriptional programs in the developing cerebral cortex remains largely unknown.

In this paper, the authors use various techniques to identify and validate the targets and activities of PRDM16, including ChIP-seq, bulk RNA-seq, FACS, in-situ hybridization and immunofluorescent microscopy on brain samples from embryonic mice and a generation of PRDM16 conditional knockout mice.

From the RNA-seq data, they found that the absence of PRDM16 in cortical neurons resulted in the misregulation of over a thousand genes during neurogenesis. To identify the subset of genes that are transcriptional targets of PRDM16 and to understand how these genes are directly regulated, they performed chromatin immunoprecipitation followed by sequencing (ChIP-seq).

Hypothesis: How does the histone methyltransferase PRDM16 work with other chromatin machinery to either silence or activate expression of sets of genes that impact the organization of the cerebral cortex?

Raw data

For this study, we use the ChIP-seq data that is publicly available in the Sequence Read Archive (SRA).

NOTE: If you are interested in how to obtain publicly available sequence data from the SRA, we have training materials on this topic.

Metadata

In addition to the raw sequence data, we also need to collect information about the data, also known as metadata. We sometimes rush to begin the analysis of the sequence data (FASTQ files), but how useful is it if we know nothing about the samples that this sequence data originated from?

Some relevant metadata for our dataset is provided below:

All of the above pertains to both WT and Prdm16 conditional knock-out mouse (Emx1Ires-Cre; Prdm16flox/flox). For the rest of the workshop we will be referring to the conditional knockout samples as KO.

Our dataset consists of two WT samples and two KO samples. For each of the IP samples, we have a corresponding input sample as illustrated in the schematic below.

Connect to O2

Let’s get started with the hands-on component by typing in the following command to log in to O2:

ssh username@o2.hms.harvard.edu

You will receive a prompt for your password, and you should type in your associated password; note that the cursor will not move as you type in your password.

A warning might pop up the first time you try to connect to a remote machine, type “Yes” or “Y”.

Once logged in, you should see the O2 icon, some news, and the command prompt, e.g. [rc_training10@login01 ~]$.

Note 1: ssh stands for secure shell. All of the information (like your password) going between your computer and the O2 login computer is encrypted when using ssh.

Next, you will need to start an interactive session. A login node’s only function is to enable users to log in to a cluster, it is not meant to be used for any actual work/computing. Since we will be doing some work, let’s get on to a compute node:

$ srun --pty -p interactive -t 0-3:00 --mem 1G  /bin/bash

Make sure that your command prompt is now preceded by a character string that contains the word “compute”.

Implementing data management best practices

In a previous lesson, we describe the data lifecycle and the different aspects to consider when working on your own projects. Here, we implement some of those strategies to get ourselves setup before we begin with any analysis.

Image acquired from the Harvard Biomedical Data Management Website

Planning and organization

For each experiment you work on and analyze data for, it is considered best practice to get organized by creating a planned storage space (directory structure). We will start by creating a directory that we can use for the rest of the workshop. First, make sure that you are in your home directory.

$ cd
$ pwd

This should return /home/rc_training. Create the directory chipseq_workshop and move into it.

$ mkdir chipseq_workshop
$ cd chipseq_workshop

Now that we have a project directory, we can set up the following structure within it to keep files organized.

chipseq_workshop/
├── logs/
├── meta/
├── raw_data/
├── reference_data/
├── results/
└── scripts/
$ mkdir raw_data reference_data scripts logs meta results

$ tree     # this will show you the directory structure you just created

This is a generic directory structure and can be tweaked based on personal preference and analysis workflow.

Now that we have the directory structure created, let’s copy over the data:

$ cp /n/groups/hbctraining/harwell-datasets/chipseq_workshop/data/*fastq.gz raw_data/

We’re all set up for our analysis!

File naming conventions

Another aspect of staying organized is making sure that all the filenames in an analysis are as consistent as possible, and are not things like alignment1.bam, but more like 20170823_kd_rep1_gmap-1.4.bam. This link and this slideshow have some good guidelines for file naming dos and don’ts.

Documentation

In your lab notebook, you likely keep track of the different reagents and kits used for a specific protocol. Similarly, recording information about the tools used in the workflow is important for documenting your computational experiments.

README files

After setting up the directory structure, it is useful to have a README file within your project directory. This is a plain text file containing a short summary about the project and a description of the files/directories found within it. An example README is shown below. It can also be helpful to include a README within each sub-directory with any information pertaining to the analysis.

## README ##
## This directory contains data generated during the Introduction to ChIP-seq workshop
## Date: 

There are six subdirectories in this directory:

raw_data : contains raw data
meta:  contains...
logs:
results:
scripts:
reference_data:

Exercise

  1. Take a moment to create a README for the chipseq_workshop/ folder (hint: use vim to create the file). Give a short description of the project and brief descriptions of the types of files you will be storing within each of the sub-directories.

This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.