Workshop Schedule

NOTE: The Basic Data Skills Introduction to the command-line interface workshop is a prerequisite.

Pre-reading:

Please study the contents within the following lessons:

Day 1

Time	Topic	Instructor
9:30 - 10:10	Workshop Introduction	Will
10:00 - 11:30	Introduction to Variant Calling	Elizabeth
11:30 - 11:50	Project Organization	Elizabeth
11:50 - 12:00	Overview of self-learning materials and homework submission	Will

Before the next class:

I. Please study the contents and work through all the code within the following lessons:

Evaluating Read Quality with FastQC
Click here for a preview of this lesson

The first step in many NGS studies is first to evaluate the read qualites that you received from the sequencing facility. A common tool used for handling this analysis is FastQC.

This lesson will:
- Implement FastQC to evaluate read qualities
- Evaluate FASTQC quality metrics
Sequence Read Alignment
Click here for a preview of this lesson

Once we have completed our QC on sequence reads we will be aligning the reads to a reference sequence. This alignment step places each read in genomic space and creates the bedrock for calling variants.

This lesson will:
- Enumerate difficulties with alignment
- Create an sbatch script to align reads
Alignment File Processing
Click here for a preview of this lesson

Before we can call variants from our alignment files, we need to do some processing to clean up the alignment files. The two major concerns here are organizing (sorting) our alignment files for our analyses and removing duplicates.

This lesson will:
- Differentiate between query-sorted and coordinate-sorted alignment files
- Describe and remove duplicate reads
- Process a raw SAM file for input into a BAM for GATK

NOTE: To run through the code above, you will need to be logged into O2 and working on a compute node (i.e. your command prompt should have the word compute in it).

Log in using ssh rc_trainingXX@o2.hms.harvard.edu and enter your password (replace the “XX” in the username with the number you were assigned in class).

Once you are on the login node, use srun --pty -p interactive -t 0-2:30 --mem 1G /bin/bash to get on a compute node.

Proceed once your command prompt has the word compute in it.

If you log out between lessons (using the exit command twice), please follow points 1. and 2. above to log back in and get on a compute node when you restart with the self learning.

II. Complete the exercises:

Each lesson above contains exercises; please go through each of them.
Copy over your solutions into the Google Form the day before the next class.

Questions?

If you get stuck due to an error while runnning code in the lesson, email us

Day 2

Time	Topic	Instructor
9:30 - 10:00	Self-learning lessons review	All
10:00 - 10:30	Alignment File Quality Control	Elizabeth
10:30 - 10:40	Break
10:40 - 11:15	Aggregating QC metrics using MultiQC	Elizabeth
11:15 - 12:00	Variant Calling	Will

Before the next class:

I. Please study the contents and work through all the code within the following lessons:

Variant Filtering
Click here for a preview of this lesson

Now that we have called our raw variants, we will need to filter our data for only high-quality variant calls. Low-quality variant calls can occur for a variety of reasons that we will explore and we will implement steps to exclude them.

This lesson will:
- Filter raw variant calls using FilterMutectCells to reduce errors
- Remove Low-Complexity Regions from the called variants using SnpSift to further reduce errors
Variant Annotation with SnpEff
Click here for a preview of this lesson

With our high-quality variant calls, we would like to know more information about these variants. For example, we might like to know which genes our they are in or how they alter the protein-coding sequence for the genes they are in. In order to do this, we will need to provide annotations for our genes.

This lesson will:
- Annotate a VCF file for functional impacts with `SnpEff`
- Differentiate between an unannotated and annotated VCF file

NOTE: To run through the code above, you will need to be logged into O2 and working on a compute node (i.e. your command prompt should have the word compute in it). For login instructions, please see above.

II. Complete the exercises:

Each lesson above contains exercises; please go through each of them.
Copy over your solutions into the Google Form the day before the next class.

Questions?

If you get stuck due to an error while runnning code in the lesson, email us

Day 3

Time	Topic	Instructor
9:30 - 10:00	Self-learning lessons review	Elizabeth
10:00 - 10:30	Variant Prioritization with SnpSift	Elizabeth
10:30 - 11:00	Exercise (Key)	Will
11:00 - 11:30	Visualization in IGV	Will
11:30 - 12:00	Q & A (review of Automation)	All

Questions?

If you get stuck due to an error while runnning code in the lesson, email us

Day 4

Time	Topic	Instructor
9:30 - 10:30	Introduction to cBioPortal	Dr. Tali Mazor
10:30 - 11:30	cBioPortal Practical	Dr. Tali Mazor
11:30 - 11:45	Oncoprint Integration	Will
11:45 - 12:00	Wrap up	Elizabeth

File Format Reference

File Formats

Automation Reference

Automation of Variant Calling Pipeline

Answer key

Exercises Answer Key

These materials have been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Some materials used in these lessons were derived from work that is Copyright © Data Carpentry (http://datacarpentry.org/). All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).