# Load libraries
library(AnnotationHub)
library(ensembldb)
# Connect to AnnotationHub
<- AnnotationHub() ah
Genomic annotations
Approximate time: 30 minutes
Learning Objectives:
- Discuss the available genomic annotation databases and the different types if information stored
- Compare and contrast the tools available for accessing genomic annotation databases
- Apply various R packages for retrieval of genomic annotations
Genomic annotations
The analysis of next-generation sequencing results requires associating genes, transcripts, proteins, etc. with functional or regulatory information. To perform functional analysis on gene lists, we often need to obtain gene identifiers that are compatible with the tools we wish to use and this is not always trivial. Here, we discuss ways in which you can obtain gene annotation information and some of the advantages and disadvantages of each method.
Databases
We retrieve information on the processes, pathways, etc. (for which a gene is involved in) from the necessary database where the information is stored. The database you choose will be dependent on what type of information you are trying to obtain. Examples of databases that are often queried include:
General databases
Offer comprehensive information on genome features, feature coordinates, homology, variant information, phenotypes, protein domain/family information, associated biological processes/pathways, associated microRNAs, etc.:
- Ensembl (use Ensembl gene IDs)
- NCBI (use Entrez gene IDs)
- UCSC
- EMBL-EBI
Annotation-specific databases
Provide annotations related to a specific topic:
- Gene Ontology (GO): database of gene ontology biological processes, cellular components and molecular functions - based on Ensembl or Entrez gene IDs or official gene symbols
- KEGG: database of biological pathways - based on Entrez gene IDs
- MSigDB: database of gene sets
- Reactome: database of biological pathways
- Human Phenotype Ontology: database of genes associated with human disease
- CORUM: database of protein complexes for human, mouse, rat
- …
This is by no means an exhaustive list, there are many other databases available that are not listed here.
Genome builds
Before you begin your search through any of these databases, you should know which build of the genome was used to generate your gene list and make sure you use the same build for the annotations during functional analysis. When a new genome build is acquired, the names and/or coordinate location of genomic features (gene, transcript, exon, etc.) may change. Therefore, the annotations regarding genome features (gene, transcript, exon, etc.) is genome build-specific and we need to make sure that our annotations are obtained from the appropriate resource.
For example, if we used the GRCh38 build of the human genome to quantify gene expression used for differential expression analysis, then we should use the same GRCh38 build of the genome to convert between gene IDs and to identify annotations for each of the genes.
Tools for accessing databases
Within R, there are many popular packages used for gene/transcript-level annotation. These packages provide tools that take the list of genes you provide and retrieve information for each gene using one or more of the databases listed above.
Annotation tools: for accessing/querying annotations from a specific databases
Tool | Description | Pros | Cons |
---|---|---|---|
org.Xx.eg.db | Query gene feature information for the organism of interest | gene ID conversion, biotype and coordinate information | only latest genome build available |
EnsDb.Xx.vxx | Transcript and gene-level information directly fetched from Ensembl API (similar to TxDb, but with filtering ability and versioned by Ensembl release) | easy functions to extract features, direct filtering | Not the most up-to-date annotations, more difficult to use than some packages |
TxDb.Xx.UCSC.hgxx.knownGene | UCSC database for transcript and gene-level information or can create own TxDb from an SQLite database file using the GenomicFeatures package | feature information, easy functions to extract features | only available current and recent genome builds - can create your own, less up-to-date with annotations than Ensembl |
annotables | Gene-level feature information immediately available for the human and model organisms | super quick and easy gene ID conversion, biotype and coordinate information | static resource, not updated regularly |
biomaRt | An R package version of the Ensembl BioMart online tool | all Ensembl database information available, all organisms on Ensembl, wealth of information |
Interface tools: for accessing/querying annotations from multiple different annotation sources
- AnnotationDbi: queries the OrgDb, TxDb, Go.db, EnsDb, and BioMart annotations.
- AnnotationHub: queries large collection of whole genome resources, including ENSEMBL, UCSC, ENCODE, Broad Institute, KEGG, NIH Pathway Interaction Database, etc.
These are both packages that can be used to create the tx2gene
files we had you download at the beginning of this workshop.
AnnotationDbi
AnnotationDbi is an R package that provides an interface for connecting and querying various annotation databases using SQLite data storage. The AnnotationDbi packages can query the OrgDb, TxDb, EnsDb, Go.db, and BioMart annotations. There is helpful documentation available to reference when extracting data from any of these databases.
While AnnotationDbi is a popular tool, we will not be walking through code to use this package. However, if you are interested in more detail, we have materials linked here with examples using our current dataset.
AnnotationHub
AnnotationHub is a wonderful resource for accessing genomic data or querying large collection of whole genome resources, including ENSEMBL, UCSC, ENCODE, Broad Institute, KEGG, NIH Pathway Interaction Database, etc. All of this information is stored and easily accessible by directly connecting to the database.
To get started with AnnotationHub, we first load the library and connect to the database:
A cache is used in R to store data or a copy of the data so that future requests can be served faster without having to re-run a lengthy computation.
The AnnotationHub()
command creates a client that manages a local cache of the database, helping with quick and reproducible access. When encountering question AnnotationHub does not exist, create directory?
, you can answer either yes
(create a permanent location to store cache) or no
(create a temporary location to store cache). hubCache(ah)
gets the file system location of the local AnnotationHub cache. hubUrl(ah)
gets the URL for the online hub.
To see the types of information stored inside our database, we can just type the name of the object. Using the output, you can get an idea of the information that you can query within the AnnotationHub object:
# Explore the AnnotationHub object
ah
AnnotationHub with 72100 records
# snapshotDate(): 2024-10-28
# $dataprovider: Ensembl, BroadInstitute, UCSC, ftp://ftp.ncbi.nlm.nih.gov/g...
# $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Rattus norv...
# $rdataclass: GRanges, TwoBitFile, BigWigFile, EnsDb, Rle, OrgDb, SQLiteFil...
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH5012"]]'
title
AH5012 | Chromosome Band
AH5013 | STS Markers
AH5014 | FISH Clones
AH5015 | Recomb Rate
AH5016 | ENCODE Pilot
... ...
AH119506 | Ensembl 113 EnsDb for Zonotrichia albicollis
AH119507 | Ensembl 113 EnsDb for Zalophus californianus
AH119508 | Ensembl 113 EnsDb for Zosterops lateralis melanops
AH119518 | MassBank CompDb for release 2024.06
AH119519 | MassBank CompDb for release 2024.11
Notice the note on retrieving records with object[[ID]]
- this will be how we can extract a single record from the AnnotationHub object.
If you would like to see more information about any of the classes of data, you can extract that information as well. For example, if you wanted to determine all species information available, you could explore that within the AnnotationHub object:
# Explore all species information available
unique(ah$species) %>% head()
[1] "Homo sapiens" "Vicugna pacos" "Dasypus novemcinctus"
[4] "Otolemur garnettii" "Papio hamadryas" "Papio anubis"
In addition to species information, there is also additional information about the type of Data Objects and the Data Providers:
# Explore the types of Data Objects available
unique(ah$rdataclass) %>% head()
[1] "GRanges" "data.frame" "Inparanoid8Db" "TwoBitFile"
[5] "ChainFile" "SQLiteConnection"
# Explore the Data Providers
unique(ah$dataprovider) %>% head()
[1] "UCSC" "Ensembl" "RefNet" "Inparanoid8" "NHLBI"
[6] "ChEA"
Now that we know the types of information available from AnnotationHub, we can query it for the information we want using the query()
function. Let’s say we would like to return the Ensembl EnsDb
information for Human. To return the records available, we need to use the terms as they are output from the ah
object to extract the desired data.
# Query AnnotationHub for Human references
<- query(ah, c("Homo sapiens", "EnsDb"))
human_ens
# See what annotations are available
human_ens
AnnotationHub with 28 records
# snapshotDate(): 2024-10-28
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH53211"]]'
title
AH53211 | Ensembl 87 EnsDb for Homo Sapiens
AH53715 | Ensembl 88 EnsDb for Homo Sapiens
AH56681 | Ensembl 89 EnsDb for Homo Sapiens
AH57757 | Ensembl 90 EnsDb for Homo Sapiens
AH60773 | Ensembl 91 EnsDb for Homo Sapiens
... ...
AH109606 | Ensembl 109 EnsDb for Homo sapiens
AH113665 | Ensembl 110 EnsDb for Homo sapiens
AH116291 | Ensembl 111 EnsDb for Homo sapiens
AH116860 | Ensembl 112 EnsDb for Homo sapiens
AH119325 | Ensembl 113 EnsDb for Homo sapiens
The query retrieves all hits for the EnsDb
objects, and you will see that they are listed by the release number. The most current release for GRCh38 is Ensembl 113 and AnnotationHub offers that as an option to use. However, if you look at options for older releases, for Homo sapiens it only go back as far as Ensembl 87. This is fine if you are using GRCh38; however, if you were using an older genome build like hg19/GRCh37, you would need to load the EnsDb
package if available for that release or you might need to build your own with ensembldb
.
In our case, we are looking for the latest Ensembl release so that the annotations are the most up-to-date. To extract this information from AnnotationHub, we can use the AnnotationHub ID to subset the object:
# Extract annotations of interest
<- human_ens[["AH119325"]] human_ens
Now we can use ensembldb
functions to extract the information at the gene, transcript, or exon levels. We are interested in the gene-level annotations, so we can extract that information as follows:
# Extract gene-level information
genes(human_ens, return.type = "data.frame") %>% head()
gene_id gene_name gene_biotype gene_seq_start
1 ENSG00000290825 DDX11L16 lncRNA 11121
3 ENSG00000223972 DDX11L1 transcribed_unprocessed_pseudogene 12010
4 ENSG00000310526 WASH7P lncRNA 14356
5 ENSG00000227232 WASH7P transcribed_unprocessed_pseudogene 14696
6 ENSG00000278267 MIR6859-1 miRNA 17369
7 ENSG00000243485 MIR1302-2HG lncRNA 28589
gene_seq_end seq_name seq_strand seq_coord_system
1 24894 1 1 chromosome
3 13670 1 1 chromosome
4 30744 1 -1 chromosome
5 24886 1 -1 chromosome
6 17436 1 -1 chromosome
7 31109 1 1 chromosome
description
1 DEAD/H-box helicase 11 like 16 (pseudogene) [Source:NCBI gene (formerly Entrezgene);Acc:727856]
3 DEAD/H-box helicase 11 like 1 (pseudogene) [Source:HGNC Symbol;Acc:HGNC:37102]
4 WASP family homolog 7, pseudogene [Source:HGNC Symbol;Acc:HGNC:38034]
5 WASP family homolog 7, pseudogene [Source:HGNC Symbol;Acc:HGNC:38034]
6 microRNA 6859-1 [Source:HGNC Symbol;Acc:HGNC:50039]
7 MIR1302-2 host gene [Source:HGNC Symbol;Acc:HGNC:52482]
gene_id_version canonical_transcript symbol entrezid
1 ENSG00000290825.2 ENST00000832823 DDX11L16 727856, ....
3 ENSG00000223972.6 ENST00000450305 DDX11L1 NA
4 ENSG00000310526.1 ENST00000831140 WASH7P 653635
5 ENSG00000227232.6 ENST00000488147 WASH7P NA
6 ENSG00000278267.1 ENST00000619216 MIR6859-1 102466751
7 ENSG00000243485.6 ENST00000834618 MIR1302-2HG NA
But note that it is just as easy to get the transcript- or exon-level information:
# Extract transcript-level information
transcripts(human_ens, return.type = "data.frame") %>% head()
tx_id tx_biotype tx_seq_start tx_seq_end tx_cds_seq_start
1 ENST00000620701 snRNA 1 107 NA
2 ENST00000634102 protein_coding 1 58864 5632
3 ENST00000630811 lncRNA 1 7125 NA
4 ENST00000622103 snRNA 12 118 NA
5 ENST00000612589 snRNA 12 118 NA
6 ENST00000610987 snRNA 12 118 NA
tx_cds_seq_end gene_id tx_support_level tx_id_version gc_content
1 NA ENSG00000275782 NA ENST00000620701.1 42.99065
2 57902 ENSG00000278550 1 ENST00000634102.1 59.15209
3 NA ENSG00000280592 2 ENST00000630811.1 55.94758
4 NA ENSG00000275021 NA ENST00000622103.1 42.99065
5 NA ENSG00000275168 NA ENST00000612589.1 42.99065
6 NA ENSG00000276552 NA ENST00000610987.1 42.99065
tx_external_name tx_is_canonical tx_name
1 U6.55-201 1 ENST00000620701
2 SLC43A2-224 0 ENST00000634102
3 LINC01624-216 1 ENST00000630811
4 U6.47-201 1 ENST00000622103
5 U6.49-201 1 ENST00000612589
6 U6.61-201 1 ENST00000610987
# Extract exon-level information
exons(human_ens, return.type = "data.frame") %>% head()
exon_id exon_seq_start exon_seq_end
1 ENSE00003754250 1 107
2 ENSE00003766135 1 2668
3 ENSE00003783572 1 5793
4 ENSE00003716605 12 118
5 ENSE00003723881 12 118
6 ENSE00003725167 12 118
To obtain an annotation data frame using AnnotationHub, we’ll use the genes()
function, but only keep selected columns and filter out rows to keep those corresponding to our gene identifiers in our results file:
# Create a gene-level dataframe
<- genes(human_ens, return.type = "data.frame") %>%
annotations_ahb # Choose which columns to keep
::select(gene_id, gene_name, entrezid, gene_biotype) %>%
dplyr# Choose which rows to keep (only genes that are in our overexpression results)
::filter(gene_id %in% res_tableOE_tb$gene) dplyr
This dataframe looks like it should be fine as it is, but we look a little closer we will notice that the column containing Entrez identifiers is a list, and in fact there are many Ensembl identifiers that map to more than one Entrez identifier!
# Wait a second, we don't have one-to-one mappings!
class(annotations_ahb$entrezid)
[1] "list"
which(map(annotations_ahb$entrezid, length) > 1) %>% head()
ENSG00000186092 ENSG00000239149 ENSG00000229571 ENSG00000206652 ENSG00000158747
9 407 432 529 604
ENSG00000235200
1835
So what do we do here? And why do we have this problem? An answer from the Ensembl Help Desk is that this occurs when we cannot choose a perfect match; i.e., when we have two good matches, but one does not appear to match with a better percentage than the other. In that case, we assign both matches. What we will do is choose to keep the first identifier for these multiple mapping cases.
# Only keep the first identifier in each list
$entrezid <- map(annotations_ahb$entrezid, 1) %>% unlist() annotations_ahb
Not all databases handle multiple mappings in the same way. For example, if we used the OrgDb instead of the EnsDb:
# Pull annotations from OrgDb
<- query(ah, c("org.Hs.eg.db.sqlite"))
human_orgdb <- human_orgdb[["AH116710"]]
human_orgdb <- AnnotationDbi::select(human_orgdb, res_tableOE_tb$gene, c("SYMBOL", "GENENAME", "ENTREZID"), "ENSEMBL") annotations_orgdb
We would find that multiple mapping entries would be automatically reduced to one-to-one. We would also find that more than half of the input genes do not return any annotations. This is because the OrgDb family of database are primarily based on mapping using Entrez Gene identifiers. Since our data is based on Ensembl mappings, using the OrgDb would result in a loss of information.
Let’s take a look and see how many of our Ensembl identifiers have an associated gene symbol, and how many of them are unique:
# Total Ensembl IDs with associated gene symbols
which(!is.na(annotations_ahb$gene_name)) %>% length()
[1] 56427
# Duplicated gene symbols
which(duplicated(annotations_ahb$gene_name)) %>% length()
[1] 15814
Let’s identify the non-duplicated genes and only keep the ones that are not duplicated:
# Determine the indices for the non-duplicated genes
<- which(duplicated(annotations_ahb$gene_name) == FALSE)
non_duplicates_idx
# How many rows does annotations_ahb have?
%>% nrow() annotations_ahb
[1] 56427
# Return only the non-duplicated genes using indices
<- annotations_ahb[non_duplicates_idx, ]
annotations_ahb
# How many rows are we left with after removing?
%>% nrow() annotations_ahb
[1] 40613
Finally, it would be good to know what proportion of the Ensembl identifiers map to an Entrez identifier:
# Determine how many of the Entrez column entries are NA
which(is.na(annotations_ahb$entrezid)) %>% length()
[1] 15969
That’s almost half of our genes! If we plan on using Entrez ID results for downstream analysis, we should definitely keep this in mind. If you look at some of the Ensembl IDs from our query that returned NA
, these map to pseudogenes (e.g., ENSG00000265439) or non-coding RNAs (e.g., ENSG00000265425). The discrepancy (which we can expect to observe) between databases is due to the fact that each implements its own different computational approaches for generating the gene builds.
Using AnnotationHub to create our tx2gene file
To create our tx2gene
file, we would need to use a combination of the methods above and merge two dataframes together. For example:
## DO NOT RUN THIS CODE
# Create a transcript dataframe
<- transcripts(human_ens, return.type = "data.frame") %>%
txdb ::select(tx_id, gene_id)
dplyr<- txdb[grep("ENST", txdb$tx_id),]
txdb
# Create a gene-level dataframe
<- genes(human_ens, return.type = "data.frame") %>%
genedb ::select(gene_id, gene_name)
dplyr
# Merge the two dataframes together
<- inner_join(txdb, genedb) annotations
In this lesson our focus has been using annotation packages to extract information mainly just for gene ID conversion for the different tools that we use downstream. Many of the annotation packages we have presented have much more information than what we need for functional analysis, and we have only just scratched the surface here. It’s good to know the capabilities of the tools we use, so we encourage you to spend some time exploring these packages to become more familiar with them.
The annotables package is a super easy annotation package to use. It is not updated frequently, so it’s not great for getting the most up-to-date information for the current builds and does not have information for other organisms than human and mouse, but is a quick way to get annotation information.
# Install package
::install("annotables")
BiocManager
# Load library
library(annotables)
# Access previous build of annotations
grch38