Learning Objectives
- Filter records in a VCF file various effects and impacts
- Extract fields of interest from VCF file
Prioritizing Variants
Variant prioritization is the process of assessing the potential significance and pathogenicity of genetic variants identified through DNA sequencing. As we have seen from our example, thousands of genetic variants can be identified and the next task is to identify what subset of those we would like to focus on. Some factors to consider when ranking varaints include:
- Location in the genome and/or in specific genes of interest
- Predicted effect on the gene or protein (i.e missense mutation or silent mutations)
- Frequency in population databases
- Previous reports of the variant being associated with disease
Through systematically analyzing these various attributes, variant prioritization helps geneticists and clinicians focus on the subset of variants most likely to explain an individual’s condition for further review and validation.
Snpsift: filter
SnpSift is part of the SnpEff suite and it is built for variant prioritization. Let’s start by discussing some of the ways you can filter your data with SnpSift.
Before we do anything, let’s move to the directory with our VCF files and load the SnpEff module:
cd /n/scratch/users/${USER:0:1}/$USER/variant_calling/vcf_files/
module load snpEff/4.3g
SnpSift filter is one of the most useful SnpSift commands. Using SnpSift filter you can filter VCF files using arbitrary expressions. In the most simple case, you can filter your SnpEff annotated VCF file based upon any of the first seven fields of the VCF file:
- CHROM
- POS
- ID
- REF
- ALT
- QUAL
- FILTER
Let’s start with a command to filter and keep all of the variants on Chromosome 1. We will pipe the output into less so we can easily scroll through the result:
java -jar $SNPEFF/SnpSift.jar filter \
-noLog \
"( CHROM = '1' )" \
mutect2_syn3_normal_syn3_tumor_GRCh38.p7-pass-filt-LCR.pedigree_header.snpeff.dbSNP.vcf | less
Let’s break down the syntax a bit:
java -jar $SNPEFF/SnpSift.jar filterThis calls thefilterpackage withinSnpSift.-noLogThis does not report command usage toSnpEff’s server"( CHROM = '1' )"This is the syntax needed to extract variants on Chromosome1.- The left side of the equals sign corresponds to the VCF field you wish to filter by.
- The right side is the string you would like to match.
mutect2_syn3_normal_syn3_tumor_GRCh38.p7-pass-filt-LCR.pedigree_header.snpeff.dbSNP.vcfThis is the input VCF file. Importantly, this needs to go at the end.
Revisiting “( FILTER = ‘PASS’ )”
Recall that we had used
SnpSift, in the filtering command that we used earlier in the workshop. In that lesson we usedSnpSiftto look at theFILTERfield and retain varaints with the value ofPASS.# YOU DO NOT NEED TO RUN THIS # Filter for only SNPs with PASS in the FILTER field java -jar $SNPEFF/SnpSift.jar filter \ -noLog \ "( FILTER = 'PASS' )" \ $MUTECT_FILTERED_VCF > $PASSING_FILTER_VCF
Multiple Filters
It is likely that you could want to filter on multiple criteria. You can do that by separating the filter criteria with either and & (and) or | (or).
For example, let’s consider a case where you want to filter for any variants on Chromosome 1 OR Chromosome 2. That command would look like:
java -jar $SNPEFF/SnpSift.jar filter \
"( CHROM = '1' ) | ( CHROM = '2' )" \
mutect2_syn3_normal_syn3_tumor_GRCh38.p7-pass-filt-LCR.pedigree_header.snpeff.dbSNP.vcf | less
NOTE: The
"( CHROM = '1' ) | ( CHROM = '2' )"syntax allows us to filter for Chromosome1or Chromosome2by using the|to separate our criteria within the double quotes. While you might be most familiar with the|symbol as the pipe command inbash, it is not uncommon in other instances or languages likeRfor it to stand for “or”. In fact, in bash, “or” is||, so it is closely related. The important point here is that the|within the double quotes stands for “or” when usingSnpSiftand it is not a pipe.
Alternatively, we could be interested in variants on Chromosome 1 between positions 1000000 and 2000000. This command would look like:
java -jar $SNPEFF/SnpSift.jar filter \
-noLog \
"( CHROM = '1' ) & ( POS > 1000000 ) & ( POS < 2000000 )" \
mutect2_syn3_normal_syn3_tumor_GRCh38.p7-pass-filt-LCR.pedigree_header.snpeff.dbSNP.vcf | less
INFO Field: Gene
SnpSift also allows the user to filter based upon the INFO field. This is particularly helpful since SnpEff’s annotations are placed into the INFO field. There are many INFO field filters that one can apply but we will discuss some of the more popular ones. Filtering the INFO will importantly always begin with an ANN in the filtering criteria.
If you are interested in all of the variants corresponding to a single gene of interest, you can filter by the gene name in this case CPSF3L:
java -jar $SNPEFF/SnpSift.jar filter \
-noLog \
"( ANN[*].GENE = 'CPSF3L' )" mutect2_syn3_normal_syn3_tumor_GRCh38.p7-pass-filt-LCR.pedigree_header.snpeff.dbSNP.vcf | less
To filter by the gene name you will need "( ANN[*].GENE = 'INSERT_GENE_NAME' )".
NOTE: When handling multiple valued fields (i.e. fields with commas),
SnpSiftuses a 0-based index to describing those elements. In the example below, we can see that the first fourANNfields are for the geneCPTPbefore we get to the annotations forCPSF3L. So if you used"( ANN[0].GENE = 'CPSF3L' )"instead of"( ANN[*].GENE = 'CPSF3L' )", then it will only return the entries which have"CPSF3L"in the firstGENEannotation field and exclude the later entries. The*tellsSnpSiftto extract the record if “any” annotations corresponds toCPSF3L. For most cases, you will want to use the*, but you should understand why it is there.1 1324300 . G A . PASS AS_FilterStatus=SITE;AS_SB_TABLE=47,6|11,0;ClippingRankSum=0.39;DP=68;ECNT=1;FS=2.373;GERMQ=93;MBQ=27,27;MFRL=337,338;MMQ=60,60;MPOS=27;MQ=60;MQ0=0;MQRankSum=0;NALOD=1.54;NLOD=9.88;POPAF=6;ReadPosRankSum=-0.125;TLOD=24.65;ANN=A|upstream_gene_variant|MODIFIER|CPTP|CPTP|transcript|NM_001029885.1|protein_coding||c.-2611G>A|||||463|,A|upstream_gene_variant|MODIFIER|CPTP|CPTP|transcript|XM_005244802.1|protein_coding||c.-3008G>A|||||456|,A|upstream_gene_variant|MODIFIER|CPTP|CPTP|transcript|XM_005244801.3|protein_coding||c.-2611G>A|||||843|,A|upstream_gene_variant|MODIFIER|CPTP|CPTP|transcript|XM_011542200.2|protein_coding||c.-2611G>A|||||1315|,A|intron_variant|MODIFIER|CPSF3L|CPSF3L|transcript|XM_011541647.1|protein_coding|1/18|c.28+281C>T||||||,A|intron_variant|MODIFIER|CPSF3L|CPSF3L|transcript|NM_001256456.1|protein_coding|1/18|c.-428+281C>T||||||,A|intron_variant|MODIFIER|CPSF3L|CPSF3L|transcript|NM_001256460.1|protein_coding|1/17|c.-167+281C>T||||||,A|intron_variant|MODIFIER|CPSF3L|CPSF3L|transcript|NM_001256462.1|protein_coding|1/14|c.28+281C>T||||||,A|intron_variant|MODIFIER|CPSF3L|CPSF3L|transcript|NM_001256463.1|protein_coding|1/14|c.28+281C>T||||||,A|intron_variant|MODIFIER|CPSF3L|CPSF3L|transcript|NM_017871.5|protein_coding|1/16|c.28+281C>T||||||,A|intron_variant|MODIFIER|CPSF3L|CPSF3L|transcript|XM_017001558.1|protein_coding|1/18|c.-438+281C>T||||||,A|intron_variant|MODIFIER|CPSF3L|CPSF3L|transcript|XM_017001557.1|protein_coding|1/17|c.-361+281C>T||||||,A|intron_variant|MODIFIER|CPSF3L|CPSF3L|transcript|XM_011541648.1|protein_coding|1/18|c.-91+281C>T||||||,A|intron_variant|MODIFIER|CPSF3L|CPSF3L|transcript|XM_011541650.1|protein_coding|1/16|c.-254+281C>T|||||| GT:AD:AF:DP:F1R2:F2R1:SB 0/0:33,0:0.028:33:12,0:20,0:32,1,0,0 0/1:20,11:0.367:31:6,6:12,5:15,5,11,0
You can also filter on the transcript ID which in this case is the NCBI accession number.
java -jar $SNPEFF/SnpSift.jar filter \
-noLog \
"( ANN[*].TRID = 'XM_017001557.1' )" mutect2_syn3_normal_syn3_tumor_GRCh38.p7-pass-filt-LCR.pedigree_header.snpeff.dbSNP.vcf | less
INFO Field: Effects
If you want to filter your output by the effects the variants have on the annotated gene models, the syntax for this is quite similar to the example for genes:
java -jar $SNPEFF/SnpSift.jar filter \
-noLog \
"( ANN[*].EFFECT has 'missense_variant' )" \
mutect2_syn3_normal_syn3_tumor_GRCh38.p7-pass-filt-LCR.pedigree_header.snpeff.dbSNP.vcf | less
To filter by a variant effect, the filter syntax is "( ANN[*].EFFECT has 'VARIANT_EFFECT' )"
NOTE: Importantly, notice the use of
hasinstead of=here. Sometimes effects field will contain mutliple effects such asmissense_variant&splice_donor_variant. UsingANN[*].EFFECT = missense_varianthere WILL NOT return this line, because the line is not equal tomissense_variant, howeverANN[*].EFFECT has missense_variantWILL return this line. Oftentimes for effects, one would be interested in thehasquery as opposed to the=one.
There are many different variant effects and some of the more common ones are listed below:
| SnpEff Annotation | Type of variant |
|---|---|
missense_variant |
missense/non-synonymous variants |
frameshift_variant |
frameshift variants |
stop_gain |
nonsense variants |
stop_lost |
variants that lose a stop codon |
start_gain |
variants that gain a start codon |
start_lost |
variants that lose a start codon |
synonymous_variant |
synonymous/silent variants |
splice_donor_variant |
variants in the splice donor site |
splice_acceptor_variant |
variants in the splice acceptor site |
5_prime_UTR_variant |
variants in the 5’ untranslated region |
3_prime_UTR_variant |
variants in the 3’ untranslated region |
Many more effects can be found here.
INFO Field: Impacts
SnpEff also predicts the deleterious nature of a variant by binning it into one of several categories:
HIGHThese are variants that will almost certainly have a deleterious impact on the transcript. Examples of this would be the loss or gain of a stop codon or a frameshift mutation.MODERATEThese are variants where the impact may have a deleterious impact on the transcript. Examples of this would be missense/non-synonymous variants and in-frame deletions/insertions.LOWThese are variants that are unlikely to have a deleterious impact on the transcript. Examples of this would be silent/synonymous variants and alterations between different stop codons.MODIFERThese variants are typically in non-coding regions and their impacts are difficult to ascertain.
More information on these categories can be found here and a complete listing of the categories for each effect can be found here.
Let’s go ahead and select out all of our HIGH impact muations:
java -jar $SNPEFF/SnpSift.jar filter \
-noLog \
"( ANN[*].IMPACT has 'HIGH' )" \
mutect2_syn3_normal_syn3_tumor_GRCh38.p7-pass-filt-LCR.pedigree_header.snpeff.dbSNP.vcf | less
NOTE: Similarly to
EFFECT, oftentimes you will want to usehasrather than=.
Let’s go ahead and redirect the output of these “high-impact” mutations to a new VCF file:
java -jar $SNPEFF/SnpSift.jar filter \
-noLog \
"( ANN[*].IMPACT has 'HIGH' )" \
mutect2_syn3_normal_syn3_tumor_GRCh38.p7-pass-filt-LCR.pedigree_header.snpeff.dbSNP.vcf > mutect2_syn3_normal_syn3_tumor_GRCh38.p7-pass-filt-LCR.pedigree_header.snpeff.dbSNP.high_impact.vcf
INFO: Other ANN fields
In addition to GENE, EFFECT, IMPACT AND TRID, there are a whole host of other ANN fields. Some of the other ANN fields that we will come across later are:
HGVS_P- The alteration in protein notationHGVS_C- The alteration in DNA notation
A full list of ANN fields can be found here.
Snpsift: vcfEffOnePerLine
A useful tool within the SnpSift toolkit is the perl script named vcfEffOnePerLine.pl. This script allows the user to separate each effect onto its own line instead of having them lumped into a single line. In order to utilize this script we need to pipe the output of our filter command into $SNPEFF/scripts/vcfEffOnePerLine.pl. We can use it on our previous example to demonstrate:
java -jar $SNPEFF/SnpSift.jar filter \
-noLog \
"( ANN[*].IMPACT has 'HIGH' )" \
mutect2_syn3_normal_syn3_tumor_GRCh38.p7-pass-filt-LCR.pedigree_header.snpeff.dbSNP.vcf | \
$SNPEFF/scripts/vcfEffOnePerLine.pl | less
Now, we can see that each variant has a separate entry depending on its effect.
1 6471577 . A ACTCACGTGCAAGCATCACACCGGCACGC . PASS AS_FilterStatus=SITE;AS_SB_TABLE=49,52|6,4;ClippingRankSum=-1.498;DP=119;ECNT=1;FS=2.779;GERMQ=93;MBQ=32,32;MFRL=341,338;MMQ=60,60;MPOS=20;MQ=60;MQ0=0;MQRankSum=0;NALOD=1.82;NLOD=19.17;POPAF=6;ReadPosRankSum=-1.659;TLOD=33.37;LOF=(PLEKHG5|PLEKHG5|8|1.00);NMD=(PLEKHG5|PLEKHG5|8|1.00);ANN=ACTCACGTGCAAGCATCACACCGGCACGC|frameshift_variant&stop_gained|HIGH|PLEKHG5|PLEKHG5|transcript|NM_001265592.1|protein_coding|13/22|c.1428_1429insGCGTGCCGGTGTGATGCTTGCACGTGAG|p.Trp477fs|1493/4794|1428/3258|476/1085|| GT:AD:AF:DP:F1R2:F2R1:SB 0/0:63,0:0.015:63:37,0:25,0:33,30,0,0 0/1:38,10:0.216:48:19,5:17,4:16,22,6,4
1 6471577 . A ACTCACGTGCAAGCATCACACCGGCACGC . PASS AS_FilterStatus=SITE;AS_SB_TABLE=49,52|6,4;ClippingRankSum=-1.498;DP=119;ECNT=1;FS=2.779;GERMQ=93;MBQ=32,32;MFRL=341,338;MMQ=60,60;MPOS=20;MQ=60;MQ0=0;MQRankSum=0;NALOD=1.82;NLOD=19.17;POPAF=6;ReadPosRankSum=-1.659;TLOD=33.37;LOF=(PLEKHG5|PLEKHG5|8|1.00);NMD=(PLEKHG5|PLEKHG5|8|1.00);ANN=ACTCACGTGCAAGCATCACACCGGCACGC|frameshift_variant&stop_gained|HIGH|PLEKHG5|PLEKHG5|transcript|NM_001265593.1|protein_coding|12/21|c.1398_1399insGCGTGCCGGTGTGATGCTTGCACGTGAG|p.Trp467fs|1424/4725|1398/3228|466/1075|| GT:AD:AF:DP:F1R2:F2R1:SB 0/0:63,0:0.015:63:37,0:25,0:33,30,0,0 0/1:38,10:0.216:48:19,5:17,4:16,22,6,4
1 6471577 . A ACTCACGTGCAAGCATCACACCGGCACGC . PASS AS_FilterStatus=SITE;AS_SB_TABLE=49,52|6,4;ClippingRankSum=-1.498;DP=119;ECNT=1;FS=2.779;GERMQ=93;MBQ=32,32;MFRL=341,338;MMQ=60,60;MPOS=20;MQ=60;MQ0=0;MQRankSum=0;NALOD=1.82;NLOD=19.17;POPAF=6;ReadPosRankSum=-1.659;TLOD=33.37;LOF=(PLEKHG5|PLEKHG5|8|1.00);NMD=(PLEKHG5|PLEKHG5|8|1.00);ANN=ACTCACGTGCAAGCATCACACCGGCACGC|frameshift_variant&stop_gained|HIGH|PLEKHG5|PLEKHG5|transcript|NM_001042665.1|protein_coding|12/21|c.1191_1192insGCGTGCCGGTGTGATGCTTGCACGTGAG|p.Trp398fs|1397/4698|1191/3021|397/1006|| GT:AD:AF:DP:F1R2:F2R1:SB 0/0:63,0:0.015:63:37,0:25,0:33,30,0,0 0/1:38,10:0.216:48:19,5:17,4:16,22,6,4
1 6471577 . A ACTCACGTGCAAGCATCACACCGGCACGC . PASS AS_FilterStatus=SITE;AS_SB_TABLE=49,52|6,4;ClippingRankSum=-1.498;DP=119;ECNT=1;FS=2.779;GERMQ=93;MBQ=32,32;MFRL=341,338;MMQ=60,60;MPOS=20;MQ=60;MQ0=0;MQRankSum=0;NALOD=1.82;NLOD=19.17;POPAF=6;ReadPosRankSum=-1.659;TLOD=33.37;LOF=(PLEKHG5|PLEKHG5|8|1.00);NMD=(PLEKHG5|PLEKHG5|8|1.00);ANN=ACTCACGTGCAAGCATCACACCGGCACGC|frameshift_variant&stop_gained|HIGH|PLEKHG5|PLEKHG5|transcript|NM_001042664.1|protein_coding|12/21|c.1191_1192insGCGTGCCGGTGTGATGCTTGCACGTGAG|p.Trp398fs|1414/4715|1191/3021|397/1006|| GT:AD:AF:DP:F1R2:F2R1:SB 0/0:63,0:0.015:63:37,0:25,0:33,30,0,0 0/1:38,10:0.216:48:19,5:17,4:16,22,6,4
1 6471577 . A ACTCACGTGCAAGCATCACACCGGCACGC . PASS AS_FilterStatus=SITE;AS_SB_TABLE=49,52|6,4;ClippingRankSum=-1.498;DP=119;ECNT=1;FS=2.779;GERMQ=93;MBQ=32,32;MFRL=341,338;MMQ=60,60;MPOS=20;MQ=60;MQ0=0;MQRankSum=0;NALOD=1.82;NLOD=19.17;POPAF=6;ReadPosRankSum=-1.659;TLOD=33.37;LOF=(PLEKHG5|PLEKHG5|8|1.00);NMD=(PLEKHG5|PLEKHG5|8|1.00);ANN=ACTCACGTGCAAGCATCACACCGGCACGC|frameshift_variant&stop_gained|HIGH|PLEKHG5|PLEKHG5|transcript|NM_001265594.1|protein_coding|12/22|c.1191_1192insGCGTGCCGGTGTGATGCTTGCACGTGAG|p.Trp398fs|1428/4529|1191/2793|397/930|| GT:AD:AF:DP:F1R2:F2R1:SB 0/0:63,0:0.015:63:37,0:25,0:33,30,0,0 0/1:38,10:0.216:48:19,5:17,4:16,22,6,4
1 6471577 . A ACTCACGTGCAAGCATCACACCGGCACGC . PASS AS_FilterStatus=SITE;AS_SB_TABLE=49,52|6,4;ClippingRankSum=-1.498;DP=119;ECNT=1;FS=2.779;GERMQ=93;MBQ=32,32;MFRL=341,338;MMQ=60,60;MPOS=20;MQ=60;MQ0=0;MQRankSum=0;NALOD=1.82;NLOD=19.17;POPAF=6;ReadPosRankSum=-1.659;TLOD=33.37;LOF=(PLEKHG5|PLEKHG5|8|1.00);NMD=(PLEKHG5|PLEKHG5|8|1.00);ANN=ACTCACGTGCAAGCATCACACCGGCACGC|frameshift_variant&stop_gained|HIGH|PLEKHG5|PLEKHG5|transcript|NM_020631.4|protein_coding|12/21|c.1191_1192insGCGTGCCGGTGTGATGCTTGCACGTGAG|p.Trp398fs|1343/4644|1191/3021|397/1006|| GT:AD:AF:DP:F1R2:F2R1:SB 0/0:63,0:0.015:63:37,0:25,0:33,30,0,0 0/1:38,10:0.216:48:19,5:17,4:16,22,6,4
1 6471577 . A ACTCACGTGCAAGCATCACACCGGCACGC . PASS AS_FilterStatus=SITE;AS_SB_TABLE=49,52|6,4;ClippingRankSum=-1.498;DP=119;ECNT=1;FS=2.779;GERMQ=93;MBQ=32,32;MFRL=341,338;MMQ=60,60;MPOS=20;MQ=60;MQ0=0;MQRankSum=0;NALOD=1.82;NLOD=19.17;POPAF=6;ReadPosRankSum=-1.659;TLOD=33.37;LOF=(PLEKHG5|PLEKHG5|8|1.00);NMD=(PLEKHG5|PLEKHG5|8|1.00);ANN=ACTCACGTGCAAGCATCACACCGGCACGC|frameshift_variant&stop_gained|HIGH|PLEKHG5|PLEKHG5|transcript|NM_001042663.1|protein_coding|13/22|c.1359_1360insGCGTGCCGGTGTGATGCTTGCACGTGAG|p.Trp454fs|1460/4761|1359/3189|453/1062|| GT:AD:AF:DP:F1R2:F2R1:SB 0/0:63,0:0.015:63:37,0:25,0:33,30,0,0 0/1:38,10:0.216:48:19,5:17,4:16,22,6,4
1 6471577 . A ACTCACGTGCAAGCATCACACCGGCACGC . PASS AS_FilterStatus=SITE;AS_SB_TABLE=49,52|6,4;ClippingRankSum=-1.498;DP=119;ECNT=1;FS=2.779;GERMQ=93;MBQ=32,32;MFRL=341,338;MMQ=60,60;MPOS=20;MQ=60;MQ0=0;MQRankSum=0;NALOD=1.82;NLOD=19.17;POPAF=6;ReadPosRankSum=-1.659;TLOD=33.37;LOF=(PLEKHG5|PLEKHG5|8|1.00);NMD=(PLEKHG5|PLEKHG5|8|1.00);ANN=ACTCACGTGCAAGCATCACACCGGCACGC|frameshift_variant&stop_gained|HIGH|PLEKHG5|PLEKHG5|transcript|NM_198681.3|protein_coding|13/22|c.1422_1423insGCGTGCCGGTGTGATGCTTGCACGTGAG|p.Trp475fs|1972/5273|1422/3252|474/1083|| GT:AD:AF:DP:F1R2:F2R1:SB 0/0:63,0:0.015:63:37,0:25,0:33,30,0,0 0/1:38,10:0.216:48:19,5:17,4:16,22,6,4
Which was previously:
1 6471577 . A ACTCACGTGCAAGCATCACACCGGCACGC . PASS AS_FilterStatus=SITE;AS_SB_TABLE=49,52|6,4;ClippingRankSum=-1.498;DP=119;ECNT=1;FS=2.779;GERMQ=93;MBQ=32,32;MFRL=341,338;MMQ=60,60;MPOS=20;MQ=60;MQ0=0;MQRankSum=0;NALOD=1.82;NLOD=19.17;POPAF=6;ReadPosRankSum=-1.659;TLOD=33.37;ANN=ACTCACGTGCAAGCATCACACCGGCACGC|frameshift_variant&stop_gained|HIGH|PLEKHG5|PLEKHG5|transcript|NM_001265592.1|protein_coding|13/22|c.1428_1429insGCGTGCCGGTGTGATGCTTGCACGTGAG|p.Trp477fs|1493/4794|1428/3258|476/1085||,ACTCACGTGCAAGCATCACACCGGCACGC|frameshift_variant&stop_gained|HIGH|PLEKHG5|PLEKHG5|transcript|NM_001265593.1|protein_coding|12/21|c.1398_1399insGCGTGCCGGTGTGATGCTTGCACGTGAG|p.Trp467fs|1424/4725|1398/3228|466/1075||,ACTCACGTGCAAGCATCACACCGGCACGC|frameshift_variant&stop_gained|HIGH|PLEKHG5|PLEKHG5|transcript|NM_001042665.1|protein_coding|12/21|c.1191_1192insGCGTGCCGGTGTGATGCTTGCACGTGAG|p.Trp398fs|1397/4698|1191/3021|397/1006||,ACTCACGTGCAAGCATCACACCGGCACGC|frameshift_variant&stop_gained|HIGH|PLEKHG5|PLEKHG5|transcript|NM_001042664.1|protein_coding|12/21|c.1191_1192insGCGTGCCGGTGTGATGCTTGCACGTGAG|p.Trp398fs|1414/4715|1191/3021|397/1006||,ACTCACGTGCAAGCATCACACCGGCACGC|frameshift_variant&stop_gained|HIGH|PLEKHG5|PLEKHG5|transcript|NM_001265594.1|protein_coding|12/22|c.1191_1192insGCGTGCCGGTGTGATGCTTGCACGTGAG|p.Trp398fs|1428/4529|1191/2793|397/930||,ACTCACGTGCAAGCATCACACCGGCACGC|frameshift_variant&stop_gained|HIGH|PLEKHG5|PLEKHG5|transcript|NM_020631.4|protein_coding|12/21|c.1191_1192insGCGTGCCGGTGTGATGCTTGCACGTGAG|p.Trp398fs|1343/4644|1191/3021|397/1006||,ACTCACGTGCAAGCATCACACCGGCACGC|frameshift_variant&stop_gained|HIGH|PLEKHG5|PLEKHG5|transcript|NM_001042663.1|protein_coding|13/22|c.1359_1360insGCGTGCCGGTGTGATGCTTGCACGTGAG|p.Trp454fs|1460/4761|1359/3189|453/1062||,ACTCACGTGCAAGCATCACACCGGCACGC|frameshift_variant&stop_gained|HIGH|PLEKHG5|PLEKHG5|transcript|NM_198681.3|protein_coding|13/22|c.1422_1423insGCGTGCCGGTGTGATGCTTGCACGTGAG|p.Trp475fs|1972/5273|1422/3252|474/1083||;LOF=(PLEKHG5|PLEKHG5|8|1.00);NMD=(PLEKHG5|PLEKHG5|8|1.00) GT:AD:AF:DP:F1R2:F2R1:SB 0/0:63,0:0.015:63:37,0:25,0:33,30,0,0 0/1:38,10:0.216:48:19,5:17,4:16,22,6,4
This step is particularly helpful for cleaning up the files for use in the next step, extractFields.
Snpsift: extractFields
Lastly, we have another extremely useful feature of SnpSift and that is the extractFields command. This allows us to parse the VCF file and print only the fields we are interested in.
If we wanted to parse out the missense mutations, create a single line for each effect, then extract the fields for chromosome, position, gene ID as well as the alteration in terms of protein and DNA space and also the predicted effect, then we could create an output like that using:
java -jar $SNPEFF/SnpSift.jar filter \
-noLog \
"( ANN[*].EFFECT has 'missense_variant' )" \
mutect2_syn3_normal_syn3_tumor_GRCh38.p7-pass-filt-LCR.pedigree_header.snpeff.dbSNP.vcf | \
$SNPEFF/scripts/vcfEffOnePerLine.pl | \
java -jar $SNPEFF/SnpSift.jar extractFields \
- \
"CHROM" "POS" "ANN[*].GENE" "ANN[*].TRID" "EFF[*].HGVS_P" "ANN[*].HGVS_C" "ANN[*].EFFECT" | less
Let’s breakdown this command:
java -jar $SNPEFF/SnpSift.jar filterThis calls thefilterpackage withinSnpSift-noLogThis does not report command usage toSnpEff’s server"( ANN[*].EFFECT has 'missense_variant' )"Filter out lines wheremissense_variantis annotation in ANY annotation.mutect2_syn3_normal_syn3_tumor_GRCh38.p7-pass-filt-LCR.pedigree_header.snpeff.dbSNP.vcf |This is our input file and then pipe this output.$SNPEFF/scripts/vcfEffOnePerLine.pl |Place each effect on it’s own line and pipe this output.java -jar $SNPEFF/SnpSift.jar extractFieldsThis calls theextractFieldspackage withinSnpSift-The use of-is very commonly used to define the input as coming from standard input, or in other words, the input is being piped into the command."CHROM" "POS" "ANN[*].GENE" "ANN[*].TRID" "EFF[*].HGVS_P" "ANN[*].HGVS_C" "ANN[*].EFFECT"This is defining the fields that we would like to filter.
Notice, however, that some of the fields don’t correspond to a missense_variant. This is because when we initially extracted sites we filtered sites where at least one effect was a missense_variant, then we separated the variants into separate lines before extracting our fields of interest. At this point if we wanted to only keep the missense variant lines, so we can pipe the output into a simple grep command:
java -jar $SNPEFF/SnpSift.jar filter \
-noLog \
"( ANN[*].EFFECT has 'missense_variant' )" \
mutect2_syn3_normal_syn3_tumor_GRCh38.p7-pass-filt-LCR.pedigree_header.snpeff.dbSNP.vcf | \
$SNPEFF/scripts/vcfEffOnePerLine.pl | \
java -jar $SNPEFF/SnpSift.jar extractFields \
- \
"CHROM" "POS" "ANN[*].GENE" "ANN[*].TRID" "EFF[*].HGVS_P" "ANN[*].HGVS_C" "ANN[*].EFFECT" | \
grep 'missense_variant' | less
This provides us with an clean, organized, tab-delimited table of our output.
Exercises
1) Extract from mutect2_syn3_normal_syn3_tumor_GRCh38.p7-pass-filt-LCR.pedigree_header.snpeff.dbSNP.vcf all of the MODERATE-impact mutations on Chromosome 12.
2) Pipe the output from Exercise 1) into a command to only display one line for each effect.
3) Pipe the output from Exercise 2) into a command to extract the chromosome, position, gene and effect.
4) Redirect the output from Exercise 3) into a new file called mutect2_syn3_normal_syn3_tumor_GRCh38.p7-pass-filt-LCR.pedigree_header.snpeff.dbSNP.chr12_moderate-impact.txt
This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.