Approximate time: 45 minutes

Learning Objectives

What is genomic variation?

Genomic variation refers to differences in DNA sequences among individuals or species. Genomic variants hold significant information that can benefit many fields.

Types of variants

There are several different types variants that require their own consideration. These include:

The Importance of Coverage

One of the most important considerations of experimental design when carrying out a study to identify variants is to sequence your samples to an adequate level of coverage. High coverage is helpful for two reasons:

1) It helps distinguish sequencing errors and artifacts from true low frequency alleles in the tumor samples.

2) It helps distinugish germline variants from somatic variants.

Coverage simply means for a given position, what is the average number of sequencing reads that span (or “cover”) that position and it is abbreviated as the integer value followed by “X”. For example, if the average position in the genome was covered by 22 reads, this sample would be considered to have 22X. The coverage guidelines are based on which sequencing strategy you implement, as described below.

Sequencing strategies for variant calling

The choice of sequencing strategy for a clinical sample has important impact for variant calling, and your choice will tend to be guided based on your study design.

Whole Genome Sequencing (WGS)

Whole-genome sequencing offers the most comprehensive approach and typically yields ~ 30–60× average sequence depth across the entire genome. This is the most expensive approach.

Exome sequencing

Exome sequencing, which targets virtually all ~ 20,000 protein-coding genes, typically achieves > 100× average depth across the target regions.

Gene panels

Single- or multi-gene panels are increasingly cost-effective means of testing for subsets of genes associated with specific clinical phenotypes.

Germline versus Somatic Variant Calling

Variant calling can be broadly broken up into two groups, germline and somatic.

Germline variant calling refers to the process of calling variants that are ubiquitous across the organism (i.e. almost all cells carry these variants) and these are the types of variants that can be passed through the germline. Studies that evaluate population genetics are often concerned with germline variant calling.

Somatic variant calling refers to the process of calling variants that differ between cells within a single organism and these variants are not passed through the germline. Somatic variant calling is often used when studying the progression of various cancers.

Example of a germline variant on the left, and somatic variant on the right.

These two types of variant calling methods have different assumptions regarding in the input data and thus are handled differently. For example, germline variant calling for the most part expects at most two alleles in relatively equal frequencies, while a single tumor sample could have various cancer lineages with various allele frequencies. This makes somatic variant calling more difficult than germline variant calling because low frequency variants and sequencing artifacts are difficult to distinguish from sequencing errors. Additionally, oftentimes within somatic variant calling, you are also trying to avoid calling the germline variants.

Variant Calling Workflow

Depending on the sequencing strategy and whether we are looking at germline or somatic variants, the workflow will need to be modified. Typically, there will be some similarities in the processing of the raw FASTQ data through to aligned BAM files. The variant calling itself will vary based on study design, and we can address that later in this workshop.

Analyzing variants takes the form of three main steps:

  1. Data preparation
  2. Variant calling
  3. Variant evaluation

The detailed workflow for variant analysis that many researchers follow looks like:

Sequence Samples - This is the work of your sequencing facility and they should provide you with FASTQ files.

Read Quality Control - It is important to assess the quality of your reads before you further analyze them. In this step, we are looking to ensure that there aren’t major errors that occurred in the sequencing and that the quality of the reads is high.

Align Reads - We need to match our raw FASTQ reads to the place in the genome where they most likely originated from.

Processing Alignment Files - While our reads have now been assigned to a place in the genome, the files holding this information are not in the format that we need in order to call variants. We will be reformatting the alignment files in order to be useable for variant calling.

Alignment Quality Control - This is a second opportunity to evaluate the quality of our data. Here we are ensuring that the data aligns well with our genome.

Call Variants - Now we are able to call the raw variants from our data.

Filter Variants - Our raw variants may have some artifacts in them and need to be processed to remove them.

Annotate Variants - We would be interested to know where our variants fall within the context of current gene models. We will be using existing gene models to predict the impact of a variant. We will be asking questions such as, is this a variant that creates a premature stop codon or is it in a intergenic region?

Prioritize Variants - Assess the annotated variants based upon their predicted functional impacts and narrow the search for important genes that may be causing a disease or trait.


Use the figure above to try to make inferences answer the following questions:

  1. If we assume there are no sequencing errors, are you more inclined to speculate that Locus 1 is a germline or somatic variant? Why?
  2. Given the existence of sequencing errors, how confident are you that Locus 1 represents a heterozygous locus in the germline?
  3. Given the existence of sequencing errors, how confident are you that Locus 1 represents a polymorphic locus in a somatic tissue?
  4. How confident are you that Locus 2 is homozygous?
  5. What additional information might you want in order to better assess these loci?

This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.