Workshop Schedule
NOTE: The Basic Data Skills Introduction to the command-line interface workshop is a prerequisite.
Pre-reading:
- Please study the contents within the following lessons:
Day 1
Time | Topic | Instructor |
---|---|---|
9:30 - 10:10 | Workshop Introduction | Will |
10:00 - 11:30 | Introduction to Variant Calling | Elizabeth |
11:30 - 11:50 | Project Organization | Elizabeth |
11:50 - 12:00 | Overview of self-learning materials and homework submission | Will |
Before the next class:
I. Please study the contents and work through all the code within the following lessons:
- Evaluating Read Quality with
FastQC
Click here for a preview of this lesson
The first step in many NGS studies is first to evaluate the read qualites that you received from the sequencing facility. A common tool used for handling this analysis isFastQC
.
This lesson will:
- Implement
FastQC
to evaluate read qualities - Evaluate FASTQC quality metrics
- Implement
- Sequence Read Alignment
Click here for a preview of this lesson
Once we have completed our QC on sequence reads we will be aligning the reads to a reference sequence. This alignment step places each read in genomic space and creates the bedrock for calling variants.
This lesson will:
- Enumerate difficulties with alignment
- Create an
sbatch
script to align reads
- Alignment File Processing
Click here for a preview of this lesson
Before we can call variants from our alignment files, we need to do some processing to clean up the alignment files. The two major concerns here are organizing (sorting) our alignment files for our analyses and removing duplicates.
This lesson will:
- Differentiate between query-sorted and coordinate-sorted alignment files
- Describe and remove duplicate reads
- Process a raw SAM file for input into a BAM for
GATK
NOTE: To run through the code above, you will need to be logged into O2 and working on a compute node (i.e. your command prompt should have the word
compute
in it).
- Log in using
ssh rc_trainingXX@o2.hms.harvard.edu
and enter your password (replace the “XX” in the username with the number you were assigned in class).- Once you are on the login node, use
srun --pty -p interactive -t 0-2:30 --mem 1G /bin/bash
to get on a compute node.- Proceed once your command prompt has the word
compute
in it.- If you log out between lessons (using the
exit
command twice), please follow points 1. and 2. above to log back in and get on a compute node when you restart with the self learning.
II. Complete the exercises:
- Each lesson above contains exercises; please go through each of them.
- Copy over your solutions into the Google Form the day before the next class.
Questions?
- If you get stuck due to an error while runnning code in the lesson, email us
Day 2
Time | Topic | Instructor |
---|---|---|
9:30 - 10:00 | Self-learning lessons review | All |
10:00 - 10:30 | Alignment File Quality Control | Elizabeth |
10:30 - 10:40 | Break | |
10:40 - 11:15 | Aggregating QC metrics using MultiQC | Elizabeth |
11:15 - 12:00 | Variant Calling | Will |
Before the next class:
I. Please study the contents and work through all the code within the following lessons:
-
Click here for a preview of this lesson
Now that we have called our raw variants, we will need to filter our data for only high-quality variant calls. Low-quality variant calls can occur for a variety of reasons that we will explore and we will implement steps to exclude them.
This lesson will:
- Filter raw variant calls using
FilterMutectCells
to reduce errors - Remove Low-Complexity Regions from the called variants using
SnpSift
to further reduce errors
- Filter raw variant calls using
-
Variant Annotation with SnpEff
Click here for a preview of this lesson
With our high-quality variant calls, we would like to know more information about these variants. For example, we might like to know which genes our they are in or how they alter the protein-coding sequence for the genes they are in. In order to do this, we will need to provide annotations for our genes.
This lesson will:
- Annotate a VCF file for functional impacts with `SnpEff`
- Differentiate between an unannotated and annotated VCF file
NOTE: To run through the code above, you will need to be logged into O2 and working on a compute node (i.e. your command prompt should have the word
compute
in it). For login instructions, please see above.
II. Complete the exercises:
- Each lesson above contains exercises; please go through each of them.
- Copy over your solutions into the Google Form the day before the next class.
Questions?
- If you get stuck due to an error while runnning code in the lesson, email us
Day 3
Time | Topic | Instructor |
---|---|---|
9:30 - 10:00 | Self-learning lessons review | Elizabeth |
10:00 - 10:30 | Variant Prioritization with SnpSift | Elizabeth |
10:30 - 11:00 | Exercise (Key) | Will |
11:00 - 11:30 | Visualization in IGV | Will |
11:30 - 12:00 | Q & A (review of Automation) | All |
Questions?
- If you get stuck due to an error while runnning code in the lesson, email us
Day 4
Time | Topic | Instructor |
---|---|---|
9:30 - 10:30 | Introduction to cBioPortal | Dr. Tali Mazor |
10:30 - 11:30 | cBioPortal Practical | Dr. Tali Mazor |
11:30 - 11:45 | Oncoprint Integration | Will |
11:45 - 12:00 | Wrap up | Elizabeth |
File Format Reference
Automation Reference
Answer key
These materials have been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- Some materials used in these lessons were derived from work that is Copyright © Data Carpentry (http://datacarpentry.org/). All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).