AnnotationDbi is an R package that provides an interface for connecting and querying various annotation databases using SQLite data storage. The AnnotationDbi packages can query the OrgDb, TxDb, EnsDb, Go.db, and BioMart annotations. There is helpful documentation available to reference when extracting data from any of these databases.
org.Hs.eg.db
There are a plethora of organism-specific orgDb packages, such as org.Hs.eg.db for human and org.Mm.eg.db for mouse, and a list of organism databases can be found here. These databases are best for converting gene IDs or obtaining GO information for current genome builds, but not for older genome builds. These packages provide the current builds corresponding to the release date of the package, and update every 6 months. If a package is not available for your organism of interest, you can create your own using AnnotationHub.
We can see the metadata for the database by just typing the name of the database, including the species, last updates for the different source information, and the source urls. Note the KEGG data from this database was last updated in 2011, so may not be the best site for KEGG pathway information.
We can easily extract information from this database using AnnotationDbi with the methods: columns, keys, keytypes, and select. For example, we will use our org.Hs.eg.db database to acquire information, but know that the same methods work for the TxDb, Go.db, EnsDb, and BioMart annotations.
# Return the Ensembl IDs for a set of genesannotations_orgDb <- AnnotationDbi::select( org.Hs.eg.db, # databasekeys = res_tableOE_tb$gene, # data to use for retrievalcolumns =c("SYMBOL", "ENTREZID","GENENAME"), # information to retrieve for given datakeytype ="ENSEMBL") # type of data given in 'keys' argument
We started from at about 57K genes in our results table, and the dimensions of our resulting annotation data frame also look quite similar. Let’s take a peek to see if we actually returned annotations for each individual Ensembl gene ID that went in to the query:
# How many Ensembl IDs have NOT been linked to gene names (symbols)?length(which(is.na(annotations_orgDb$SYMBOL)))
[1] 22125
Looks like about half of the input genes did not return any annotations. This is because the OrgDb family of database are primarily based on mapping using Entrez Gene identifiers. If you look at some of the Ensembl IDs from our query that returned NA, these map to pseudogenes (e.g., ENSG00000265439) or non-coding RNAs (e.g., ENSG00000265425). The difference is due to the fact that each database implements different computational approaches for generating the gene builds. Let’s get rid of those NA entries:
# Determine the indices for the non-NA genesnon_na_idx <-which(is.na(annotations_orgDb$SYMBOL) ==FALSE)# Return only the genes with annotations using indicesannotations_orgDb <- annotations_orgDb[non_na_idx, ]
You may have also noted the warning returned: ‘select()’ returned 1:many mapping between keys and columns. This is always going to happen with converting between different gene IDs (i.e., one geneID can map to more than one identifier in another databse) . Unless we would like to keep multiple mappings for a single gene, then we probably want to de-duplicate our data before using it.
# Determine the indices for the non-duplicated genesnon_duplicates_idx <-which(duplicated(annotations_orgDb$SYMBOL) ==FALSE)# Return only the non-duplicated genes using indicesannotations_orgDb <- annotations_orgDb[non_duplicates_idx, ]
EnsDb.Hsapiens.v86
To generate the Ensembl annotations, the EnsDb database can also be easily queried using AnnotationDbi. You will need to decide the release of Ensembl you would like to query. We know that our data is for GRCh38, and the most current EnsDb release for GRCh38 in Bioconductor is release 86, so we can install this database. All Ensembl releases are listed here. NOTE: this is not the most current release of GRCh38 in the Ensembl database, but it’s as current as we can obtain through AnnotationDbi.
Since we are using AnnotationDbi to query the database, we can use the same functions that we used previously:
# Install the library# BiocManager::install("EnsDb.Hsapiens.v86")# Load the librarylibrary(EnsDb.Hsapiens.v86)# Check object metadataEnsDb.Hsapiens.v86
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.0
|Creation time: Thu May 18 16:32:27 2017
|ensembl_version: 86
|ensembl_host: localhost
|Organism: homo_sapiens
|taxonomy_id: 9606
|genome_build: GRCh38
|DBSCHEMAVERSION: 2.0
| No. of genes: 63970.
| No. of transcripts: 216741.
|Protein data available.
# Explore the fields that can be used as keyskeytypes(EnsDb.Hsapiens.v86)
# Return the Ensembl IDs for a set of genesannotations_edb <- AnnotationDbi::select( EnsDb.Hsapiens.v86, # databasekeys = res_tableOE_tb$gene, # data to use for retrievalcolumns =c("SYMBOL", "ENTREZID","GENEBIOTYPE"), # information to retrieve for given datakeytype ="GENEID") # type of data given in 'keys' argument
We can check for NA entries, and find that there are none:
length(which(is.na(annotations_edb$SYMBOL)))
[1] 0
Then we can again deduplicate, to remove the gene symbols which appear more than once:
# Determine the indices for the non-duplicated genesnon_duplicates_idx <-which(duplicated(annotations_edb$SYMBOL) ==FALSE)# Return only the non-duplicated genes using indicesannotations_edb <- annotations_edb[non_duplicates_idx, ]
Note
In this case we used the same build but a slightly older release, and we found little discrepancy. If your analysis was conducted using an older genome build (e.g., hg19), but used a newer build for annotation, some genes may be found to be not annotated (NA). Some of the genes have changed names in between versions (due to updates and patches), so may not be present in the newer version of the database.
Source Code
## AnnotationDbi```{r data_setup}#| echo: false# load libraries needed to render this lessonlibrary(tidyverse)# load objects needed to render this lessonres_tableOE_tb <- readRDS("../data/intermediate_res_tableOE_tb.RDS")```AnnotationDbi is an R package that provides an interface for connecting and querying various annotation databases using SQLite data storage. The AnnotationDbi packages can query the *OrgDb*, *TxDb*, *EnsDb*, *Go.db*, and *BioMart* annotations. There is helpful [documentation](https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf) available to reference when extracting data from any of these databases.### org.Hs.eg.dbThere are a plethora of organism-specific *orgDb* packages, such as `org.Hs.eg.db` for human and `org.Mm.eg.db` for mouse, and a list of organism databases can be found [here](https://www.bioconductor.org/packages/release/BiocViews.html#___OrgDb). These databases are best for converting gene IDs or obtaining GO information for current genome builds, but not for older genome builds. These packages provide the current builds corresponding to the release date of the package, and update every 6 months. If a package is not available for your organism of interest, you can create your own using *AnnotationHub*.```{r orgHs_view}# Load librarieslibrary(org.Hs.eg.db)library(AnnotationDbi)# Check object metadataorg.Hs.eg.db```We can see the metadata for the database by just typing the name of the database, including the species, last updates for the different source information, and the source urls. Note the KEGG data from this database was last updated in 2011, so may not be the best site for KEGG pathway information.We can easily extract information from this database using *AnnotationDbi* with the methods: `columns`, `keys`, `keytypes`, and `select`. For example, we will use our `org.Hs.eg.db` database to acquire information, but know that the same methods work for the *TxDb*, *Go.db*, *EnsDb*, and *BioMart* annotations.```{r orgHs_query}# Return the Ensembl IDs for a set of genesannotations_orgDb <- AnnotationDbi::select( org.Hs.eg.db, # database keys = res_tableOE_tb$gene, # data to use for retrieval columns = c("SYMBOL", "ENTREZID","GENENAME"), # information to retrieve for given data keytype = "ENSEMBL") # type of data given in 'keys' argument```We started from at about 57K genes in our results table, and the dimensions of our resulting annotation data frame also look quite similar. Let's take a peek to see if we actually returned annotations for each individual Ensembl gene ID that went in to the query:```{r orgHs_symbols_count}# How many Ensembl IDs have NOT been linked to gene names (symbols)?length(which(is.na(annotations_orgDb$SYMBOL)))```Looks like about half of the input genes did not return any annotations. This is because the OrgDb family of database are primarily based on mapping using Entrez Gene identifiers. If you look at some of the Ensembl IDs from our query that returned NA, these map to pseudogenes (e.g., [ENSG00000265439](https://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000265439;r=6:44209766-44210063;t=ENST00000580735)) or non-coding RNAs (e.g., [ENSG00000265425](http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000265425;r=18:68427030-68436918;t=ENST00000577835)). The difference is due to the fact that each database implements different computational approaches for generating the gene builds. Let's get rid of those `NA` entries:```{r orgHs_symbols_remove_NA}# Determine the indices for the non-NA genesnon_na_idx <- which(is.na(annotations_orgDb$SYMBOL) == FALSE)# Return only the genes with annotations using indicesannotations_orgDb <- annotations_orgDb[non_na_idx, ]```You may have also noted the *warning* returned: *'select()' returned 1:many mapping between keys and columns*. This is always going to happen with converting between different gene IDs (i.e., one geneID can map to more than one identifier in another databse) . Unless we would like to keep multiple mappings for a single gene, then we probably want to de-duplicate our data before using it.```{r orgHs_symbols_remove_duplicates}# Determine the indices for the non-duplicated genesnon_duplicates_idx <- which(duplicated(annotations_orgDb$SYMBOL) == FALSE)# Return only the non-duplicated genes using indicesannotations_orgDb <- annotations_orgDb[non_duplicates_idx, ]```### EnsDb.Hsapiens.v86To generate the Ensembl annotations, the *EnsDb* database can also be easily queried using AnnotationDbi. You will need to decide the release of Ensembl you would like to query. We know that our data is for GRCh38, and the most current *EnsDb* release for GRCh38 in Bioconductor is release 86, so we can install this database. All Ensembl releases are listed [here](http://useast.ensembl.org/info/website/archives/index.html). **NOTE: this is not the most current release of GRCh38 in the Ensembl database, but it's as current as we can obtain through AnnotationDbi.**Since we are using *AnnotationDbi* to query the database, we can use the same functions that we used previously:```{r EnsDb_view}# Install the library# BiocManager::install("EnsDb.Hsapiens.v86")# Load the librarylibrary(EnsDb.Hsapiens.v86)# Check object metadataEnsDb.Hsapiens.v86# Explore the fields that can be used as keyskeytypes(EnsDb.Hsapiens.v86)```Now we can return all gene IDs for our gene list:```{r EnsDb_query}# Return the Ensembl IDs for a set of genesannotations_edb <- AnnotationDbi::select( EnsDb.Hsapiens.v86, # database keys = res_tableOE_tb$gene, # data to use for retrieval columns = c("SYMBOL", "ENTREZID","GENEBIOTYPE"), # information to retrieve for given data keytype = "GENEID") # type of data given in 'keys' argument```We can check for `NA` entries, and find that there are none:```{r EnsDb_symbols}length(which(is.na(annotations_edb$SYMBOL)))```Then we can again deduplicate, to remove the gene symbols which appear more than once:```{r EnsDb_symbols_remove_duplicates}# Determine the indices for the non-duplicated genesnon_duplicates_idx <- which(duplicated(annotations_edb$SYMBOL) == FALSE)# Return only the non-duplicated genes using indicesannotations_edb <- annotations_edb[non_duplicates_idx, ]```::: callout-noteIn this case we used the same build but a slightly older release, and we found little discrepancy. If your analysis was conducted using an older genome build (e.g., hg19), but used a newer build for annotation, some genes may be found to be not annotated (`NA`). Some of the genes have changed names in between versions (due to updates and patches), so may not be present in the newer version of the database.:::