Skip to the content.

Approximate time: 45 minutes

Learning Objectives

Variant Calling

Best practices for variant calling

Now that we have inspected our reads and alignments for high-quality QC and also formatted our alignments, we are ready for variant calling. Dozens of variant calling tools for NGS data have been published, and choosing which one to use can be a daunting task. Accurate variant detection in the human genome is a key requirement for diagnostics in clinical sequencing, and so you want to make sure you are using the caller which is appropriate for your data and study design.


Image source: Zverinova S and Guryev V, 2021

Germline versus Somatic Variant Calling

Variant calling can be broadly broken up into two groups, germline and somatic. These two types of variant calling methods have different assumptions regarding the input data and thus are handled differently.

In the above image we can see an example of a germline variant on the left. Approximately half of the reads support each allele in both the tumor and normal sample reads. This is compared to the somatic variant on the right, where we observe no variants in the normal sample reads but there is a variant present in the tumor sample reads.

For this workshop, we have tumor/normal samples and so we will be calling somatic variants.

Panel of Normals

A panel of normals (PoN) is a resource derived from a set of healthy tissues with the goal of correcting for recurrent technical artifacts within a sequencing experiment. Ideally, a panel of normals should:

In the above image we can see that for the left “variant”, the panel of normals and the tumor all have a thymine “variant” in this position relative to the reference. This would likely be a technical artifact. However, for the right variant, we can see that while the tumor thymine allele is still present in all of the reads for this position like the previous case, the panel of normals shows variation and this signifies that this position is likely not a technical artifact.

An additional benefit of a good panel of normals is that it can act as an additional filter for germline variants. This is particularly true when you run MuTect2 in tumor-only mode (discussed briefly later). If you do not have access to a panel of normals, you can also use a panel of normals that the Broad Institute hosts which is derived from the 1000 Genomes Project. This data can be downloaded here. We will are currently testing the best way to implement this and will hopefully have a implementation next week. The Broad Institute also provides instructions on how to make your own panel of normals here. It is not required to use a panel of normals when running Mutect2, but it is recommended.

Note: When using a panel of normals VCF file in MuTect2, you will also need an index of the VCF file. We have already created the index for this workshop, but this can be accomplished in one of a few ways:

  1. If you are using the panel of normals provided by the Broad Institute, then they also provide a index that you can download from here.
  2. You can index your own VCF file with the IndexFeatureFile within GATK.
  3. You can also use other software packages like tabix or bcftools to create index files for VCF files

Germline Resource

A germline resource contains many variants within a population and can be used in conjunction with other tools to help variant calling algorithms determine if a variant is likely a germline or somatic mutation. The logic is: if a variant is a common variant in the population and it is present in the sample, then it is more likely to be a germline variant rather than a somatic variant. On the other hand, if a low-frequency variant in the population is present in an sample, then it is more likely to be somatic variant rather than a germline variant. This idea is illustrated below:

The Broad Institute provides a VCF file utilizing the variants from gnomAD, which can be used as a germline reference. Similarly to a panel of normals, this is not required for MuTect2 to call variants, however, it is also recommended. We will are currently testing the best way to implement this and will hopefully have a implementation next week.

Note: Similarly to the index file needed for the panel of normals VCF file, we will also need a index file for the germline resource VCF file. If you are using the Broad Institute’s gnomAD VCF file, then you can download the index from here or create it yourself using GATK, tabix or bcftools (discussed in the note above).

GATK Toolkit

GATK (Genome Analysis Toolkit) is a popular open-source software package developed by the Broad Institute for analysis of high throughput sequencing (HTS) data. It was initially developed around 2007 to help analyze sequencing data from the 1000 Genomes Project. It was created with a focus on best practices in variant discovery and genotyping from NGS data. New versions of GATK are regularly released by the Broad Institute as sequencing technologies and reference genomes are improved.

GATK has tools for preprocessing, variant discovery, genotyping, and filtering of HTS data. It takes aligned sequencing reads as input and outputs variant calls. There are several different GATK Best Practices workflows tailored to particular applications depending on the type of variation of interest and the technology employed.

In terms of variant discovery, GATK offers the following tools:

This course is going to focus on analyzing somatic SNPs, so we are going to use MuTect2.

MuTect2

Basic workflow

The algorithm used for somatic variant discovery with MuTect2 is broken up into three major components:

MuTect2 can be run in three different modes:

  1. Tumor-normal mode: where a tumor sample is matched with a normal samples. This is the mode we will be using.
  2. Tumor-only mode: where a single sample is used as input.
    • Note that the software’s ability to reliably call somatic variants is greatly diminished as it has difficulty distinguishing between high-frequency variants and germline variants. In order to help, a panel of normals and a germline resource are strongly recommended when running in this mode.
  3. Mitochondrial mode: where specific parameters are added for appropriately calling on mitochondria.

Running MuTect2

Let’s start our variant calling analysis by writing out a new sbatch submission script for MuTect2:

cd ~/variant_calling/scripts/
vim mutect2_normal_tumor.sbatch

Let’s add our shebang line, script description and sbatch directives:

#!/bin/bash
# This sbatch script is for variant calling with GATK's MuTect2

# Assign sbatch directives
#SBATCH -p priority
#SBATCH -t 1-00:00:00
#SBATCH -c 1
#SBATCH --mem 16G
#SBATCH -o mutect2_variant_calling_normal_tumor_%j.out
#SBATCH -e mutect2_variant_calling_normal_tumor_%j.err

Next to add the GATK4 module we are going to load:

# Load the GATK module
module load gatk/4.1.9.0

And now, we need to create our variables:

# Assign variables
REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa
REFERENCE_DICTIONARY=`echo ${REFERENCE_SEQUENCE%fa}dict`
NORMAL_SAMPLE_NAME=syn3_normal
NORMAL_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${NORMAL_SAMPLE_NAME}_GRCh38.p7.coordinate_sorted.bam
TUMOR_SAMPLE_NAME=syn3_tumor
TUMOR_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${TUMOR_SAMPLE_NAME}_GRCh38.p7.coordinate_sorted.bam
VCF_OUTPUT_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/vcf_files/mutect2_${NORMAL_SAMPLE_NAME}_${TUMOR_SAMPLE_NAME}_GRCh38.p7-raw.vcf
Click here if you used samtools instead of Picard to process the alignment files Very little needs to be edited to these variables to adapt them for the samtools output. However, the end of the file that was created in samtools was .removed_duplicates.bam rather than .coordinate_sorted.bam. As a result we need to edit the variables a bit. Change:
  NORMAL_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${NORMAL_SAMPLE_NAME}_GRCh38.p7.coordinate_sorted.bam
To:
  NORMAL_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${NORMAL_SAMPLE_NAME}_GRCh38.p7.removed_duplicates.bam
And also change:
  TUMOR_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${TUMOR_SAMPLE_NAME}_GRCh38.p7.coordinate_sorted.bam
To:
  TUMOR_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${TUMOR_SAMPLE_NAME}_GRCh38.p7.removed_duplicates.bam
After those changeed have been made, the rest of the script should be the same.

NOTE: Sometimes when there are many input variables that re-use many of the same textual elements (i.e. paths, sample names and reference genome names), like we have above, it is sometimes cleaner, less typo-prone and more reproducible to assign those repeated items to variables and then use text manipulation tools and variable subsitution in bash to create the rest of the variables. In the above example, some of the lines of variable assignment (REFERENCE_SEQUENCE, NORMAL_SAMPLE_NAME and TUMOR_SAMPLE_NAME) are likely lines you might edit from run-to-run, but the other lines (REFERENCE_DICTIONARY, NORMAL_BAM_FILE, TUMOR_BAM_FILE and VCF_OUTPUT_FILE) will likely stay the same or similiar. Standardizing your paths and nomenclature will help you keep track of your files much easier.

Lastly, we need to add the MuTect2 command:

# Run MuTect2
gatk Mutect2 \
  --sequence-dictionary $REFERENCE_DICTIONARY \
  --reference $REFERENCE_SEQUENCE \
  --input $NORMAL_BAM_FILE \
  --normal-sample $NORMAL_SAMPLE_NAME \
  --input $TUMOR_BAM_FILE \
  --tumor-sample $TUMOR_SAMPLE_NAME \
  --annotation ClippingRankSumTest --annotation DepthPerSampleHC --annotation MappingQualityRankSumTest --annotation MappingQualityZero --annotation QualByDepth --annotation ReadPosRankSumTest --annotation RMSMappingQuality --annotation FisherStrand --annotation MappingQuality --annotation DepthPerAlleleBySample --annotation Coverage \
  --output $VCF_OUTPUT_FILE

Let’s breakdown this command:

NOTE: It is VERY IMPORTANT that the sample names (--normal-sample $NORMAL_SAMPLE_NAME and --tumor-sample $TUMOR_SAMPLE_NAME) are provided in the same order as the --input input BAM files!

In order to run MuTect2 we also need to have a FASTA index file of our reference sequence in addition to our sequence dictionary. Similarly to the sequence dictionary, VCF indices and bwa indicies, we have already created this index for you. However, we have two dropdowns below to walk you through how to do it, should you ever need to do it on your own.

Click here for details for creating a FASTA index file in samtools
FASTA index files for reference sequences are fairly common requirements for a variety of NGS software packages. Picard currently does not feature an ability to create a FASTA index file. However, samtools is a very popular tool that is used for a variety of processes for processing BAM/SAM files and it also includes functionality for the creation of FASTA index files. First, we will need to load the gcc and samtools modules:
 module load gcc/6.2.0
 module load samtools/1.15.1
The command for indexing a FASTA file is straightforward and should run pretty quickly:
   # YOU DON'T NEED TO RUN THIS
   samtools faidx \
   reference_sequence.fa
We can breakdown this code:
  • samtools faidx This calls the faidx software from samtools
  • reference_sequence.fa This is the reference sequence FASTA file that you would like to index
Once the indexing is complete, then you should have a index file (reference_sequence.fa.fai) in same directory as your reference sequence reference_sequence.fa.
Click here for the commands to create a sequence directory We can create the required sequence dictionary in Picard. But first, let's double check we have the Picard module loaded:
 module load picard/2.27.5
The command to do create the sequence dictionary is:
 # YOU DON'T NEED TO RUN THIS
 java -jar $PICARD/picard.jar CreateSequenceDictionary \
 --REFERENCE /n/groups/hbctraining/variant_calling/reference/GRCh38.p7_genomic.fa
 --OUTPUT /n/groups/hbctraining/variant_calling/reference/GRCh38.p7_genomic.dict
The components of this command are:
  • java -jar $PICARD/picard.jar CreateSequenceDictionary This calls the CreateSequenceDictionary command within Picard
  • --REFERENCE /n/groups/hbctraining/variant_calling/reference/GRCh38.p7_genomic.fa This is the reference sequence to create the sequence dictionary from.
  • --OUTPUT /n/groups/hbctraining/variant_calling/reference/GRCh38.p7_genomic.dict This is the output sequence dictionary.
Like indexing, once you have created the sequence dictionary for a reference genome, you won't need to do it again.
Click here to see what our final sbatchcode script for calling variants with MuTect2 should look like
#!/bin/bash
# This sbatch script is for variant calling with GATK's MuTect2
# Assign sbatch directives #SBATCH -p priority #SBATCH -t 1-00:00:00 #SBATCH -c 1 #SBATCH --mem 16G #SBATCH -o mutect2_variant_calling_normal_tumor_%j.out #SBATCH -e mutect2_variant_calling_normal_tumor_%j.err
# Load the GATK module module load gatk/4.1.9.0
# Assign variables REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa REFERENCE_DICTIONARY=`echo ${REFERENCE_SEQUENCE%fa}dict` NORMAL_SAMPLE_NAME=syn3_normal NORMAL_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${NORMAL_SAMPLE_NAME}_GRCh38.p7.coordinate_sorted.bam TUMOR_SAMPLE_NAME=syn3_tumor TUMOR_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${TUMOR_SAMPLE_NAME}_GRCh38.p7.coordinate_sorted.bam VCF_OUTPUT_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/vcf_files/mutect2_${NORMAL_SAMPLE_NAME}_${TUMOR_SAMPLE_NAME}_GRCh38.p7-raw.vcf
# Run MuTect2 gatk Mutect2 \ --sequence-dictionary $REFERENCE_DICTIONARY \ --reference $REFERENCE_SEQUENCE \ --input $NORMAL_BAM_FILE \ --normal-sample $NORMAL_SAMPLE_NAME \ --input $TUMOR_BAM_FILE \ --tumor-sample $TUMOR_SAMPLE_NAME \ --annotation ClippingRankSumTest --annotation DepthPerSampleHC --annotation MappingQualityRankSumTest --annotation MappingQualityZero --annotation QualByDepth --annotation ReadPosRankSumTest --annotation RMSMappingQuality --annotation FisherStrand --annotation MappingQuality --annotation DepthPerAlleleBySample --annotation Coverage \ --output $VCF_OUTPUT_FILE

You can submit your variant calling script to the cluster:

sbatch mutect2_normal_tumor.sbatch

VCF (Variant Call Format)

The output from MuTect2 is a VCF file; the de facto file format for storing genetic variation. The Variant Call Format (VCF) is a standardized, text-file format for describing variants identifed from a sequencing experiment. This allows for downstream processes to be streamlined and also allows for researchers to easily collaborate and wrangle a shared set of variant calls.

Since it is a text file, we could easily take a quick peek at our VCF file using the less command. However, our script might still be runnning!

Instead, we will use the figure below taken from the TCGA VCF 1.1 Specification pages to summarize the components of a VCF.

A VCF file is composed of three main parts:

1. Header

This contains meta-information lines that provide supplemental information and they always start with ##

2a. Fixed Fields

A line starting with a single # that contains headers for all columns in the Body of the file. The following eight mandatory fields will be found in every VCF file:

Abbreviation Data Type
AF Allele Frequency for the ALT allele
DP Combined Depth across all samples
NS Number of samples with data

2b. Optional Fields

Abbreviation Data Type
GT Genotype
DP Depth of the sample
GQ Genotype Quality
HQ Commma-separated list of Haplotype Qualities

3. Body

These are the data lines where the variant calls will be found with each field corresponding to its column in the header line.

NOTE: For more detailed information on the VCF specification, please see our File Formats lesson where the VCF components are thoroughly described.


Exercise

First let’s move into our variant_calling directory and copy over a sample VCF file:

cd ~/variant_calling/
cp /n/groups/hbctraining/variant_calling/sample_data/sample.vcf .

1. Using grep, extract only the meta-information lines from the VCF file.

2. For the variant at position 806262 on chromosome 19, what is the reference allele?


Additional Resources


Next Lesson »

Back to Schedule


This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.