Skip to the content.

Accessing genomic reference data

During an NGS experiment, the nucleotide sequences stored inside the raw FASTQ files, or “sequence reads”, need to be mapped or aligned to the reference genome to determine from where these sequences originated. Therefore, we need a reference genome (in FASTA format) in which to align our sequences.

In addition, many NGS methods require knowing where known genes or exons are located on the genome in order to quantify the number of reads aligning to different genome features, such as exons, introns, transcription start sites, etc. These analyses require reference data containing specific information about genomic coordinates of various genomic “features”, such as gene annotation files (in GTF, GFF, etc.).

To download reference data, there are a few different sources available:

*Note that these reference data sources are relevant to most types of genomic analyses, not just NGS analyses.

General biological databases

Biological databases for gene expression data store genome assemblies and provide annotations regarding where the genes, transcripts, and other genomic features are located on the genome.

Genome assemblies give us the nucleotide sequence of the reference genome. Although the Human Genome Project was “completed” in 2003, small gaps in the sequence remained (estimated 1% of gene-containing portions). As technology improves and more genomes are sequenced, these gaps are filled, mistakes are corrected and alternate alleles are provided. Therefore, every several years a new genome build is released that contains these improvements.

The current genome build is GRCh38/hg38 for the human, which was released in 2013 and is maintained by the Genome Reference Consortium (GRC). Usually the biological databases will include the updated versions as soon as they are stably released, in addition to access to archived versions.

Genome databases incorporate these genomes and generate the gene annotations with the following similarities/differences:

Ensembl

Ensembl provides a website that acts as a single point of access to annotated genomes for vertebrate species. For all other organisms there are additional Ensembl databases available through Ensembl Genomes; however, they do not include viruses (NCBI does).

Using the Ensembl genomic database and genome browser

Navigate to the Ensembl website to view the interface. The homepage for Ensembl has a lot to offer, with the a lot of information and access to a range of functionality and tools.

While we are not going to explore the Ensembl database in this workshop, we have materials available if you wish to explore on your own.

Ensembl identifiers

When using Ensembl, note that it uses the following format for biological identifiers:

For non-human species a suffix is added:

Finding and accessing reference data on Ensembl

The interface for downloading reference data from Ensembl is straight-forward. On the home page, you can click on Downloads.

Then click on the section to Download a sequence or region.

In the ‘Export Data’ window, click on the link for the FTP site.

Finally, right-click on the link to the reference genome (DNA FASTA), reference transcriptome (cDNA FASTA), gene annotation file (Gene sets, GTF or GFF), or other required reference data to download. Copy the link address.

Now, on an HPC environment (O2 or FAS-RC cluster) you could use the wget command to download the reference data:

## DO NOT RUN
$ wget ftp://ftp.ensembl.org/pub/release-92/fasta/homo_sapiens/dna/

This would take a really long time for large genomes, so instead, you would probably want to submit a batch job using a script similar to the one below:

#!/bin/bash

#SBATCH -p short 	# partition name (small partition on O2)
#SBATCH -t 0-6:00 	# hours:minutes runlimit after which job will be killed
#SBATCH -c 1 		# number of cores requested 
#SBATCH -o %J.out	# File to which standard out will be written
#SBATCH -e %J.err 	# File to which standard err will be written

wget ftp://ftp.ensembl.org/pub/release-92/fasta/homo_sapiens/dna/

NOTE: If you desired an archived version of the genome, on the Ensembl home page for the organism of interest you would click on the View in archive site link in the lower right-hand corner of the page. Then you would navigate as described above.

To run the script, you would use the following sbatch command:

## DO NOT RUN
$ sbatch name_of_script.slurm

Genomic reference data could be downloaded similarly by FTP from the NCBI FTP (or through Aspera) or the UCSC FTP.

Reference data collections: iGenomes

If working on a commonly analyzed organism, Illumina’s iGenomes has facilitated the process of downloading reference data. On the iGenomes website reference data from Ensembl, UCSC and NCBI for various genome builds are available for download. In addition, the download is a compressed file containing the matching reference genome (FASTA) and gene annotation (GTF/GFF) files.

To download from iGenomes, we can right-click and copy the link to the file and use our wget command.

#!/bin/bash

#SBATCH -p shared 	# partition name (small partition on O2)
#SBATCH -t 0-6:00 	# hours:minutes runlimit after which job will be killed
#SBATCH -n 1 		# number of cores requested 
#SBATCH -o %J.out	# File to which standard out will be written
#SBATCH -e %J.err 	# File to which standard err will be written

$ wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Homo_sapiens/NCBI/GRCh38/Homo_sapiens_NCBI_GRCh38.tar.gz

After running the script with the sbatch command, you could use the tar command to unpack it.

$ tar -xzf Homo_sapiens_NCBI_GRCh38.tar.gz

If desired you could just run the tar command after the wget command in the above script to automatically unpack the reference data.

Local access via FAS-RC cluster or O2

Downloading the reference data from biological database or iGenomes might not be necessary since the Harvard FAS-RC cluster and O2 clusters have shared reference data downloaded from iGenomes available to its users.

On O2 shared data is located at /n/groups/shared_databases/igenome/. Instead of using storage space inside your folder, give the path to the reference data in these shared databases instead.

Let’s explore what’s available within the igenome folder and how to find the reference sequence and gene annotation files.

$ cd /n/groups/shared_databases/igenome/

Organism-specific databases

Although the general genomic databases update the genome builds and annotations for all species, organism-specific databases often update the genome patches and gene annotations more frequently. In addition, these databases also offer genomes for other species that may not be present in the general databases. Additional tools and information regarding these organisms are also accessible.

Wormbase

As an example, we will explore WormBase ParaSite, which is devoted to the study of C. elegans and other nematodes, in addition to, helminths. This site is closely linked with WormBase, and incorporates the information for C. elegans and other nematodes from this repository.

On the homepage, there is direct access to WormBase, links to all species genomes available, tools, and news.

Finding and accessing reference data on Wormbase

Downloading reference data from WormBase ParaSite is intuitive and simple. All that is needed is to click on the Downloads tab.

This will take you to the FTP site, where you can right-click to copy the link address of the reference data of interest.

Then, similar to the other methods, the wget command can be used to download to the cluster.

#!/bin/bash

#SBATCH -p short 	# partition name (small partition on O2)
#SBATCH -t 0-6:00 	# hours:minutes runlimit after which job will be killed
#SBATCH -c 1 		# number of cores requested 
#SBATCH -o %J.out	# File to which standard out will be written
#SBATCH -e %J.err 	# File to which standard err will be written

wget ftp://ftp.ebi.ac.uk/pub/databases/wormbase/parasite/releases/WBPS10/species/acanthocheilonema_viteae/PRJEB4306/acanthocheilonema_viteae.PRJEB4306.WBPS10.genomic.fa.gz

In general, whether downloading data from a general database, an organism-specific database, or an iGenomes collection, the download should be straight-forward using the wget command and the link to the corresponding FTP site.

$ wget link_to_FTP

This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.