Learning Objectives
- Filter raw variant calls using
FilterMutectCellsto reduce errors - Remove Low-Complexity Regions from the called variants using
SnpSiftto further reduce errors
FilterMutectCalls
The output from MuTect2 is a raw variant calling output. Many potential variants that are called are actually errors and so filtering needs to be performed.
To perform the filtering, we will be using FilterMutectCalls, which utilizes both hard filtering thresholds and sophisticated machine learning models to comprehensively filter variants and reduce false positives.
- Hard cut-offs: MuTect2 applies hard filtering thresholds and variants must pass these thresholds. Some examples include:
- Depth of coverage
- Base quality: Low median base quality scores
- Mapping quality: Low median mapping quality of alt reads
- Artifact loci: Known problematic loci are filtered based on high false positive rates
- Clustered substitutions: Substitutions clustered together are more likely artifacts and are filtered
- Machine learning: MuTect2 was trained on thousands of real and fake variants to learn patterns. It uses this knowledge to score each variant based on how likely it is to be real.
- Employs probabilistic error models to characterize the probability distributions of various sequencing and alignment artifacts. This includes things like base quality scores, mapping quality, strand bias, etc.
- For each raw variant call, MuTect2 calculates likelihoods of it being a real somatic mutation versus different error types based on the trained error models.
- The quality score (QUAL) reported for each variant is based on the log-likelihood ratio of being a true mutation versus being an error. Higher scores indicate a variant is more likely real.
NOTE: While we are not concerned with cross-sample contamination for this dataset, if you were concerned about cross-sample contamination, then you would need to run
CalculateContaminationprogram withinGATKto obtain a contamination table. You can use this contamination table as input intoFilterMutectCellswith the--contamination-tableoption.
Once filtering is complete, FilterMutectCalls will annotate the FILTER field in the VCF file with whether the variant is passing with PASS or FAIL.
- Variants with “PASS” in FILTER passed all filters and are the ones MuTect2 deems most likely to be real somatic variants.
- Variants with “FAIL” in the FILTER means that it did not pass MuTect2’s internal filtering steps and quality checks.
Running FilterMutectCalls
Let’s begin by navigating to our scripts directory and creating the sbatch submission script that we will be using for filtering our VCF file:
cd ~/variant_calling/scripts/
vim variant_filtering_normal_tumor.sbatch
The first step is to add our shebang line, description and sbatch directives:
#!/bin/bash
# This sbatch script is for variant filtering
# Assign sbatch directives
#SBATCH -p priority
#SBATCH -t 0-00:10:00
#SBATCH -c 1
#SBATCH --mem 8G
#SBATCH -o variant_filtering_normal_tumor_%j.out
#SBATCH -e variant_filtering_normal_tumor_%j.err
Next, we need to add the modules that we will be loading:
# Load modules
module load gatk/4.1.9.0
module load snpEff/4.3g
Next, we will add our variables:
# Assign variables
REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa
RAW_VCF_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/vcf_files/mutect2_syn3_normal_syn3_tumor_GRCh38.p7-raw.vcf
LCR_FILE=/n/groups/hbctraining/variant_calling/reference/LCR-hs38.bed
MUTECT_FILTERED_VCF=${RAW_VCF_FILE%raw.vcf}filt.vcf
PASSING_FILTER_VCF=${RAW_VCF_FILE%raw.vcf}pass-filt.vcf
LCR_FILTERED_VCF=${RAW_VCF_FILE%raw.vcf}pass-filt-LCR.vcf
Next, we can add the FilterMutectCells command:
# Filter Mutect Calls
gatk FilterMutectCalls \
--reference $REFERENCE_SEQUENCE \
--variant $RAW_VCF_FILE \
--output $MUTECT_FILTERED_VCF
Let’s break down this command:
gatk FilterMutectCallsCalls theFilterMutectCallspackage withinGATK--reference $REFERENCE_SEQUENCEThis is our reference genome--variant $RAW_VCF_FILEThis is our raw VCF file fromMuTect2--output $MUTECT_FILTERED_VCFThis is our filtered output file fromFilterMutectCalls
More information on FilterMutectCalls can be found here and a more technical guide to the filtering can be found here in Section II.
Filter using SnpSift
Now, we are going to filter for only variants that had a FILTER result of “PASS”. We are going to use SnpSift, which is part of the SnpEff and SnpSift suite of tools. We will later be using SnpEff to annotate our variants and SnpSift to prioritize our variants, but for now we are just going to use SnpSift to filter out our variants. If some of the syntax for this command is unclear, that is fine. We are going to spend time covering the syntax later during the Variant Prioritization section.
Let’s add our SnpSift command after the FilterMutectCalls command:
# Filter for only SNPs with PASS in the FILTER field
java -jar $SNPEFF/SnpSift.jar filter \
-noLog \
"( FILTER = 'PASS' )" \
$MUTECT_FILTERED_VCF > $PASSING_FILTER_VCF
-
java -jar $SNPEFF/SnpSift.jar filteris ajavapackaged program, so it needs to be called withjava -jarfollowed by the path where the JAR file is located on the cluster.$SNPEFFis just a bash variable that contains the path to JAR file. This calls thefilterfunction within theSnpSiftpackage -noLogDoes not report usage statistics toSnpEff’s servers. According to their documentation, it is so that they can monitor which features people are and aren’t using."( FILTER = 'PASS' )"This is the syntax thatSnpSiftuses to only retain variant calls withPASSin theFILTERfield of the VCF file$MUTECT_FILTERED_VCFThis is the input VCF file> $PASSING_FILTER_VCFThis is the output VCF file
Low-Complexity Regions
Low-complexity regions of the genome represent regions that have simple sequence repeats and variant callers are prone to make errors within these regions (see Li, 2014). Some insertions and deletions (Indels) are erroreously called within these low-complexity regions by various variant callers. They found:
“low-complexity regions (LCRs), 2% of the human genome, harbor 80–90% of heterozygous INDEL calls and up to 60% of heterozygous SNPs” with false positive rates ranging from “10% to as high as 40%”.
As a result of the high error rates in these low-complexity regions, it is recommended to remove of these regions until better methods for variant calling in low-complexity regions can become established. The file that holds our LCR regions is in BED format. For more detailed information on the BED file format, please see our File Formats lesson
Using SnpSift to remove LCRs
The first step that one would need to do to remove LCRs would be to download them. We have already done this for you, but you can see the steps in the dropdown menu below:
Click here to see how to download and format the LCR BED file
Downloading and Unpacking the BED files with LCRs
The BED file (explained below) containing the LCRs for GRCh38 and can be directly downloaded. However, we have already done this and formatted the BED file appropriately. If you are interested in this process, the steps are outlined below:# YOU DON'T NEED TO DO THIS cd ~/variant_calling/ mkdir reference cd reference curl -o LCR-hs38_with_chr.bed.gz -L https://github.com/lh3/varcmp/blob/master/scripts/LCR-hs38.bed.gz?raw=true
curl is a tool in bash that can download linked files from the Internet via the command line. curl is very similiar to wget and the differences between them are relatively minor and beyond the scope of this course. If you have experience or prefer wget feel free to use it instead, but for this example we will be using curl. Let's breifly discuss the curl command.
curlThis calls the curl command.-o LCR-hs38_with_chr.bed.gzBy using the-o(lowercase O), you are tellingcurlto download it and give it the following filename. You could also use a path here if you didn't want to download it to your current directly like-o /home/${USER}/variant_calling/reference/LCR-hs38_with_chr.bed.gz. Importantly, The-O(uppercase O) option without a string after it tellscurlto use the filename of the file as it is online, which happens to beLCR-hs38.bed.gz. However, this file has a query parameter?raw=truein the URL and if you were to download this with-Owithout?raw=trueon the end of the link you would get an HTML document. If you did include it, then your file would be namedLCR-hs38.bed.gz?raw=trueand you would need to rename it. Thus, for this case you probably should use-oover-Oto make the download be more straightforward.-L https://github.com/lh3/varcmp/blob/master/scripts/LCR-hs38.bed.gz?raw=trueThis is the link you would like to download from
# YOU DON'T NEED TO DO THIS gunzip -c LCR-hs38_with_chr.bed.gz > LCR-hs38_with_chr.bedWhen we inspect our BED file we can see that it simply has the required 3 columns denoting the positioning of low-complexitiy regions in GRCh38.
# YOU DON'T NEED TO DO THIS less LCR-hs38_with_chr.bedUnfortunately, this file is formatted with the chromosome names starting with
chr and we would like to strip that off so that it is consistent with our chromosome numbering scheme. That can be easily done in sed:
# YOU DON'T NEED TO DO THIS sed 's/^chr//g' LCR-hs38_with_chr.bed > LCR-hs38.bedWe can break this command down:
sedCalls thesedcommand's/^chr//g' LCR-hs38_with_chr.bedThis tellssedto replace all instances of a "chr" beginning a line (that's what the^tellssed) in the fileLCR-hs38_with_chr.bedwith nothing> LCR-hs38.bedWrite the output to this file
In order to remove the LCRs from the VCF file, we will once again be using SnpSift. As we mentioned earlier, we will be discussing SnpSift at length in the Variant Prioritization lesson, but for now were are going to focus on using the intervals command built into SnpSift. Let’s go back to our scripts directory and edit our variant filtering script.
Add the filter command from SnpSift to our sbatch script in order remove all sites that overlap with the BED file:
# Filter LCR
java -jar $SNPEFF/SnpSift.jar intervals \
-noLog \
-x \
-i $PASSING_FILTER_VCF \
$LCR_FILE > $LCR_FILTERED_VCF
java -jar $SNPEFF/SnpSift.jar intervalsThis calls theintervalspackage withinSnpSift-noLogThis does not report command usage toSnpEff’s server-xThis option tellsSnpSiftto exclude sites found in the BED file. The default behavior ofSnpSift filteris to only include sites found in the BED file.-i $PASSING_FILTER_VCFThis is the VCF file that we would like to be filtered. It can either be.gzcompressed or not.$LCR_FILEThis represents the BED file you want to use to filter your VCF file with. While in this case we only have one BED file, you can use multiple BED files if you have several filters that you wanted to apply.> $LCR_FILTERED_VCFLastly, this is just redirecting the output into a new, filtered VCF file.
NOTE:
bedtoolsis an useful suite of tools to use when handling BED files. It also has functionality for handling VCF files. A similar approach for filtering out low-complexity regions can also be done within the toolbedtoolsand is shown in a dropdown box below.
Click here to see how to use bedtools to exclude sites
First, we will need to load the bedtools module:
module load gcc/9.2.0 module load bedtools/2.30.0The command for running
bedtools to filter out low-complexity regions is:
bedtools intersect \ -header \ -v \ -a $PASSING_FILTER_VCF \ -b $LCR_FILE > $LCR_FILTERED_VCFWe can break down this command:
-headerThis will maintain the header information from the-afile. However, it will not add thebedtoolscommand to the header line likeSnpSiftdoes.-vTraditionally,bedtools intersectwill report the intersection of the file following-aand the file following-b. However, the-voption alters this behavior to find positions in the-afile not in-bfile.-a $PASSING_FILTER_VCFThis is the VCF file that we want filtered-b $LCR_FILEThis is the BED file containing genomic coordinates for sites in the VCF file to exclude> $LCR_FILTERED_VCFRedirecting the output of this filtering command to a new file.
bedtools and SnpSift should result in the same variants passing filtering and the only difference is in the amount of metadata provided in the header lines where SnpSift provides the command that was used to produce that file, while bedtools does not. Because of this, one could argue that SnpSift is slightly better for this purpose than bedtools. However, much like the samtools versus Picard discussion we had in the BAM alignment procressing lesson, it is mostly up to personal preference.
Click here to see what our final sbatchcode script for filtering our called variants should look like
#!/bin/bash # This sbatch script is for variant filtering
# Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-00:10:00 #SBATCH -c 1 #SBATCH --mem 8G #SBATCH -o variant_filtering_normal_tumor_%j.out #SBATCH -e variant_filtering_normal_tumor_%j.err
# Load modules module load gatk/4.1.9.0 module load snpEff/4.3g
# Assign variables REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa RAW_VCF_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/vcf_files/mutect2_syn3_normal_syn3_tumor_GRCh38.p7-raw.vcf LCR_FILE=/n/groups/hbctraining/variant_calling/reference/LCR-hs38.bed MUTECT_FILTERED_VCF=${RAW_VCF_FILE%raw.vcf}filt.vcf PASSING_FILTER_VCF=${RAW_VCF_FILE%raw.vcf}pass-filt.vcf LCR_FILTERED_VCF=${RAW_VCF_FILE%raw.vcf}pass-filt-LCR.vcf
# Filter Mutect Calls gatk FilterMutectCalls \ --reference $REFERENCE_SEQUENCE \ --variant $RAW_VCF_FILE \ --output $MUTECT_FILTERED_VCF
# Filter for only SNPs with PASS in the FILTER field java -jar $SNPEFF/SnpSift.jar filter \ -noLog \ "( FILTER = 'PASS' )" \ $MUTECT_FILTERED_VCF > $PASSING_FILTER_VCF
# Filter LCR java -jar $SNPEFF/SnpSift.jar intervals \ -noLog \ -x \ -i $PASSING_FILTER_VCF \ $LCR_FILE > $LCR_FILTERED_VCF
We can now submit our sbatch script:
sbatch variant_filtering_normal_tumor.sbatch
Inspecting Filtered VCF File
Once it has completed, which should be quick, we can look at the output VCF file and note a few items that have been added to the meta-information lines:
less /n/scratch/users/${USER:0:1}/${USER}/variant_calling/vcf_files/mutect2_syn3_normal_syn3_tumor_GRCh38.p7-pass-filt-LCR.vcf
Scroll down the VCF file past all of the contigs and you should see a line starting with:
##filtering_status=These calls have been filtered by FilterMutectCalls to label false positives with a list of failed filters and true positives with PASS.
Let’s inspect these lines a little:
##filtering_status=These calls have been filtered by FilterMutectCalls to label false positives with a list of failed filters and true positives with PASS.
##normal_sample=syn3-normal
##source=FilterMutectCalls
##source=Mutect2
##tumor_sample=syn3-tumor
##SnpSiftVersion="SnpSift 4.3g (build 2016-11-28 08:32), by Pablo Cingolani"
##SnpSiftCmd="SnpSift filter '( FILTER = 'PASS' )' /n/scratch3/users/${USER:0:1}/$USER/variant_calling/vcf_files/mutect2_syn3_normal_syn3_tumor_GRCh38.p7-filt.vcf"
##SnpSiftVersion="SnpSift 4.3g (build 2016-11-28 08:32), by Pablo Cingolani"
##SnpSiftCmd="SnpSift int -x -i /n/scratch/users/${USER:0:1}/$USER/variant_calling/vcf_files/mutect2_syn3_normal_syn3_tumor_GRCh38.p7-pass-filt.vcf /n/groups/hbctraining/variant_calling/reference/LCR-hs38.bed"
The first five lines have been added to our VCF file by GATK. They give information on the programs that have been run on the data, which is listed on the ##source= lines. These lines also define the column header in the VCF file that corresponds to the normal (##normal_sample=) and tumor sample (##tumor_sample=).
The next four lines tell us about our SnpSift commands:
##SnpSiftVersion=states the version of SnpSift that was used to produce this file.##SnpSiftCmd=provides thefilterandintervalscommands that were used bySnpSiftto carry out the filtering.
We can also check how many variants are contained in our filtered file:
grep -v "^#" /n/scratch/users/${USER:0:1}/${USER}/variant_calling/vcf_files/mutect2_syn3_normal_syn3_tumor_GRCh38.p7-pass-filt-LCR.vcf | wc -l
What this command does is use grep -v to search for lines in our filtered VCF that do not start with a # (and are therefore not metadata, like we were looking at above, or the header row of the VCF), and then use wc -l to count the number of lines in that output.
We should see 1362 variants in our filtered VCF.
Now, we have successfuly filtered our raw VCF file to only include high-quality variant calls, we are ready to begin annotating our variants.
This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.