Skip to the content.

Learning Objectives

FilterMutectCalls

The output from MuTect2 is a raw variant calling output. Many potential variants that are called are actually errors and so filtering needs to be performed.

To perform the filtering, we will be using FilterMutectCalls, which utilizes both hard filtering thresholds and sophisticated machine learning models to comprehensively filter variants and reduce false positives.

NOTE: While we are not concerned with cross-sample contamination for this dataset, if you were concerned about cross-sample contamination, then you would need to run CalculateContamination program within GATK to obtain a contamination table. You can use this contamination table as input into FilterMutectCells with the --contamination-table option.

Once filtering is complete, FilterMutectCalls will annotate the FILTER field in the VCF file with whether the variant is passing with PASS or FAIL.

Running FilterMutectCalls

Let’s begin by navigating to our scripts directory and creating the sbatch submission script that we will be using for filtering our VCF file:

cd ~/variant_calling/scripts/
vim variant_filtering_normal_tumor.sbatch

The first step is to add our shebang line, description and sbatch directives:

#!/bin/bash
# This sbatch script is for variant filtering

# Assign sbatch directives
#SBATCH -p priority
#SBATCH -t 0-00:10:00
#SBATCH -c 1
#SBATCH --mem 8G
#SBATCH -o variant_filtering_normal_tumor_%j.out
#SBATCH -e variant_filtering_normal_tumor_%j.err

Next, we need to add the modules that we will be loading:

# Load modules
module load gatk/4.1.9.0
module load snpEff/4.3g

Next, we will add our variables:

# Assign variables
REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa
RAW_VCF_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/vcf_files/mutect2_syn3_normal_syn3_tumor_GRCh38.p7-raw.vcf
LCR_FILE=/n/groups/hbctraining/variant_calling/reference/LCR-hs38.bed
MUTECT_FILTERED_VCF=${RAW_VCF_FILE%raw.vcf}filt.vcf
PASSING_FILTER_VCF=${RAW_VCF_FILE%raw.vcf}pass-filt.vcf
LCR_FILTERED_VCF=${RAW_VCF_FILE%raw.vcf}pass-filt-LCR.vcf

Next, we can add the FilterMutectCells command:

# Filter Mutect Calls
gatk FilterMutectCalls \
  --reference $REFERENCE_SEQUENCE \
  --variant $RAW_VCF_FILE \
  --output $MUTECT_FILTERED_VCF

Let’s break down this command:

More information on FilterMutectCalls can be found here and a more technical guide to the filtering can be found here in Section II.

Filter using SnpSift

Now, we are going to filter for only variants that had a FILTER result of “PASS”. We are going to use SnpSift, which is part of the SnpEff and SnpSift suite of tools. We will later be using SnpEff to annotate our variants and SnpSift to prioritize our variants, but for now we are just going to use SnpSift to filter out our variants. If some of the syntax for this command is unclear, that is fine. We are going to spend time covering the syntax later during the Variant Prioritization section.

Let’s add our SnpSift command after the FilterMutectCalls command:

# Filter for only SNPs with PASS in the FILTER field
java -jar $SNPEFF/SnpSift.jar filter \
  -noLog \
  "( FILTER = 'PASS' )" \
  $MUTECT_FILTERED_VCF > $PASSING_FILTER_VCF

Low-Complexity Regions

Low-complexity regions of the genome represent regions that have simple sequence repeats and variant callers are prone to make errors within these regions (see Li, 2014). Some insertions and deletions (Indels) are erroreously called within these low-complexity regions by various variant callers. They found:

“low-complexity regions (LCRs), 2% of the human genome, harbor 80–90% of heterozygous INDEL calls and up to 60% of heterozygous SNPs” with false positive rates ranging from “10% to as high as 40%”.

As a result of the high error rates in these low-complexity regions, it is recommended to remove of these regions until better methods for variant calling in low-complexity regions can become established. The file that holds our LCR regions is in BED format. For more detailed information on the BED file format, please see our File Formats lesson

Using SnpSift to remove LCRs

The first step that one would need to do to remove LCRs would be to download them. We have already done this for you, but you can see the steps in the dropdown menu below:

Click here to see how to download and format the LCR BED file

Downloading and Unpacking the BED files with LCRs

The BED file (explained below) containing the LCRs for GRCh38 and can be directly downloaded. However, we have already done this and formatted the BED file appropriately. If you are interested in this process, the steps are outlined below:
# YOU DON'T NEED TO DO THIS
cd ~/variant_calling/
mkdir reference
cd reference
curl -o LCR-hs38_with_chr.bed.gz -L https://github.com/lh3/varcmp/blob/master/scripts/LCR-hs38.bed.gz?raw=true
curl is a tool in bash that can download linked files from the Internet via the command line. curl is very similiar to wget and the differences between them are relatively minor and beyond the scope of this course. If you have experience or prefer wget feel free to use it instead, but for this example we will be using curl. Let's breifly discuss the curl command. We can then unpack the gzipped files using the following command:
 # YOU DON'T NEED TO DO THIS
 gunzip -c LCR-hs38_with_chr.bed.gz > LCR-hs38_with_chr.bed
When we inspect our BED file we can see that it simply has the required 3 columns denoting the positioning of low-complexitiy regions in GRCh38.
# YOU DON'T NEED TO DO THIS
less LCR-hs38_with_chr.bed
Unfortunately, this file is formatted with the chromosome names starting with chr and we would like to strip that off so that it is consistent with our chromosome numbering scheme. That can be easily done in sed:
# YOU DON'T NEED TO DO THIS
sed 's/^chr//g' LCR-hs38_with_chr.bed > LCR-hs38.bed
We can break this command down:

In order to remove the LCRs from the VCF file, we will once again be using SnpSift. As we mentioned earlier, we will be discussing SnpSift at length in the Variant Prioritization lesson, but for now were are going to focus on using the intervals command built into SnpSift. Let’s go back to our scripts directory and edit our variant filtering script.

Add the filter command from SnpSift to our sbatch script in order remove all sites that overlap with the BED file:

# Filter LCR
java -jar $SNPEFF/SnpSift.jar intervals \
  -noLog \
  -x \
  -i $PASSING_FILTER_VCF \
  $LCR_FILE > $LCR_FILTERED_VCF

NOTE: bedtools is an useful suite of tools to use when handling BED files. It also has functionality for handling VCF files. A similar approach for filtering out low-complexity regions can also be done within the tool bedtools and is shown in a dropdown box below.

Click here to see how to use bedtools to exclude sites First, we will need to load the bedtools module:
module load gcc/9.2.0
module load bedtools/2.30.0
The command for running bedtools to filter out low-complexity regions is:
bedtools intersect \
  -header \
  -v \
  -a $PASSING_FILTER_VCF \
  -b $LCR_FILE > $LCR_FILTERED_VCF
We can break down this command: Both bedtools and SnpSift should result in the same variants passing filtering and the only difference is in the amount of metadata provided in the header lines where SnpSift provides the command that was used to produce that file, while bedtools does not. Because of this, one could argue that SnpSift is slightly better for this purpose than bedtools. However, much like the samtools versus Picard discussion we had in the BAM alignment procressing lesson, it is mostly up to personal preference.
Click here to see what our final sbatchcode script for filtering our called variants should look like
#!/bin/bash
# This sbatch script is for variant filtering
# Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-00:10:00 #SBATCH -c 1 #SBATCH --mem 8G #SBATCH -o variant_filtering_normal_tumor_%j.out #SBATCH -e variant_filtering_normal_tumor_%j.err
# Load modules module load gatk/4.1.9.0 module load snpEff/4.3g
# Assign variables REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa RAW_VCF_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/vcf_files/mutect2_syn3_normal_syn3_tumor_GRCh38.p7-raw.vcf LCR_FILE=/n/groups/hbctraining/variant_calling/reference/LCR-hs38.bed MUTECT_FILTERED_VCF=${RAW_VCF_FILE%raw.vcf}filt.vcf PASSING_FILTER_VCF=${RAW_VCF_FILE%raw.vcf}pass-filt.vcf LCR_FILTERED_VCF=${RAW_VCF_FILE%raw.vcf}pass-filt-LCR.vcf
# Filter Mutect Calls gatk FilterMutectCalls \ --reference $REFERENCE_SEQUENCE \ --variant $RAW_VCF_FILE \ --output $MUTECT_FILTERED_VCF
# Filter for only SNPs with PASS in the FILTER field java -jar $SNPEFF/SnpSift.jar filter \ -noLog \ "( FILTER = 'PASS' )" \ $MUTECT_FILTERED_VCF > $PASSING_FILTER_VCF
# Filter LCR java -jar $SNPEFF/SnpSift.jar intervals \ -noLog \ -x \ -i $PASSING_FILTER_VCF \ $LCR_FILE > $LCR_FILTERED_VCF

We can now submit our sbatch script:

sbatch variant_filtering_normal_tumor.sbatch

Inspecting Filtered VCF File

Once it has completed, which should be quick, we can look at the output VCF file and note a few items that have been added to the meta-information lines:

less /n/scratch/users/${USER:0:1}/${USER}/variant_calling/vcf_files/mutect2_syn3_normal_syn3_tumor_GRCh38.p7-pass-filt-LCR.vcf

Scroll down the VCF file past all of the contigs and you should see a line starting with:

##filtering_status=These calls have been filtered by FilterMutectCalls to label false positives with a list of failed filters and true positives with PASS.

Let’s inspect these lines a little:

##filtering_status=These calls have been filtered by FilterMutectCalls to label false positives with a list of failed filters and true positives with PASS.
##normal_sample=syn3-normal
##source=FilterMutectCalls
##source=Mutect2
##tumor_sample=syn3-tumor
##SnpSiftVersion="SnpSift 4.3g (build 2016-11-28 08:32), by Pablo Cingolani"
##SnpSiftCmd="SnpSift filter '( FILTER = 'PASS' )' /n/scratch3/users/${USER:0:1}/$USER/variant_calling/vcf_files/mutect2_syn3_normal_syn3_tumor_GRCh38.p7-filt.vcf"
##SnpSiftVersion="SnpSift 4.3g (build 2016-11-28 08:32), by Pablo Cingolani"
##SnpSiftCmd="SnpSift int -x -i /n/scratch/users/${USER:0:1}/$USER/variant_calling/vcf_files/mutect2_syn3_normal_syn3_tumor_GRCh38.p7-pass-filt.vcf /n/groups/hbctraining/variant_calling/reference/LCR-hs38.bed"

The first five lines have been added to our VCF file by GATK. They give information on the programs that have been run on the data, which is listed on the ##source= lines. These lines also define the column header in the VCF file that corresponds to the normal (##normal_sample=) and tumor sample (##tumor_sample=).

The next four lines tell us about our SnpSift commands:

We can also check how many variants are contained in our filtered file:

grep -v "^#" /n/scratch/users/${USER:0:1}/${USER}/variant_calling/vcf_files/mutect2_syn3_normal_syn3_tumor_GRCh38.p7-pass-filt-LCR.vcf | wc -l

What this command does is use grep -v to search for lines in our filtered VCF that do not start with a # (and are therefore not metadata, like we were looking at above, or the header row of the VCF), and then use wc -l to count the number of lines in that output.

We should see 1362 variants in our filtered VCF.


Now, we have successfuly filtered our raw VCF file to only include high-quality variant calls, we are ready to begin annotating our variants.

Next lesson »

Back to Schedule


This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.