Here we investigate the functional diversity of target genes of a nuclear hormone receptor transcription factor, the unc-55 in Caenorhabditis elegans (C. elegans) and in human. We perform Gene Ontology (GO) overrepresentation analyses of the target genes in the two species in order to identify shared functional roles that likely represent the ancestral function of unc-55. Furthermore, this comparison will yield insights into the potentially divergent roles unc-55 play in these two distant animal groups.
Let’s download and read in to R
the target gene interaction table of the unc-55 transcription factor in Caenorhabditis elegans and the target gene interaction table of the unc-55 ortholog NR2F1 transcription factor in human from the TFLink website using the read_tsv
command of the tidyverse
package.
library(tidyverse)
# C.elegans unc-55
<- read_tsv("https://tflink.net/protein/g5ecr9/")
Interaction_tab_Ce # human NR2F1
<- read_tsv("https://tflink.net/protein/p10589/") Interaction_tab_Hs
For the overrepresentation analyses we will need background genes sets. Therefore, we download all C. elegans and human target genes available at the Download part of the TFLink website, and read it in with the fread
command of the data.table
package that reads huge tables effectively.
library(data.table)
# Creating a list for the Uniprot IDs of the background genes
<- list()
Background_Uniprot
# Background genes for C. elegans
$Ce <- fread(
Background_Uniprot"TFLink_Caenorhabditis_elegans_interactions_All_simpleFormat_v1.0.tsv") %>%
select(UniprotID.Target) %>%
unique() %>%
pull()
# Background genes for human
$Hs <- fread(
Background_Uniprot"TFLink_Homo_sapiens_interactions_All_simpleFormat_v1.0.tsv") %>%
select(UniprotID.Target) %>%
unique() %>%
pull()
To have an idea about the type of Biological processes the unc-55 and the NR2F1 transcription factors regulate we perform a Gene ontology (GO) overrepresentation analysis of their target genes using the clusterProfiler
R
Bioconductor
package.
First, we need to translate the Uniprot IDs of the target genes to Entrez IDs using the bitr
function of the clusterProfiler
package. For this we need to install another two R
Bioconductor
packages containing various biological identifiers: the org.Ce.eg.db
and org.Hs.eg.db
.
library(clusterProfiler)
# Creating a list for the target genes
<- list()
Genes_list
# Translating the target genes of C. elegans unc-55 to Entrez IDs
$Ce_unc55 <- bitr(geneID = Interaction_tab_Ce$UniprotID.Target,
Genes_listfromType = "UNIPROT",
toType = c("ENTREZID"),
OrgDb = "org.Ce.eg.db") %>%
::select(ENTREZID) %>%
dplyrunique() %>%
pull()
# Translating the background genes of C. elegans to Entrez IDs
$Ce_BG <- bitr(geneID = Background_Uniprot$Ce,
Genes_listfromType = "UNIPROT",
toType = c("ENTREZID"),
OrgDb = "org.Ce.eg.db") %>%
::select(ENTREZID) %>%
dplyrunique() %>%
pull()
# Translating the target genes of human NR2F1 to Entrez IDs
$Hs_NR2F1 <- bitr(geneID = Interaction_tab_Hs$UniprotID.Target,
Genes_listfromType = "UNIPROT",
toType = c("ENTREZID"),
OrgDb = "org.Hs.eg.db") %>%
::select(ENTREZID) %>%
dplyrunique() %>%
pull()
# Translating the background genes of human to Entrez IDs
$Hs_BG <- bitr(geneID = Background_Uniprot$Hs,
Genes_listfromType = "UNIPROT",
toType = c("ENTREZID"),
OrgDb = "org.Hs.eg.db") %>%
::select(ENTREZID) %>%
dplyrunique() %>%
pull()
Let’s use the enrichGO
clusterProfiler
function to see in which Gene ontology terms the target genes are overrepresented. We are not interested overrepresentation in too general or too specific GO terms, therefore we discard terms annotating more than 500 or less than 10 genes.
Be avare that it may take a few seconds to calculate the overrepresentation analyses.
# Creating a list for GO overrepresentation results
<- list()
GO_results
# Overrepresentation analysis of the targets of C. elegans unc-55
$Ce_unc55 <- enrichGO(gene = Genes_list$Ce_unc55,
GO_resultsuniverse = Genes_list$Ce_BG,
OrgDb = "org.Ce.eg.db",
ont = "ALL",
minGSSize = 10,
maxGSSize = 500,
pAdjustMethod = "bonferroni",
pvalueCutoff = 0.01,
qvalueCutoff = 0.01,
readable = TRUE)
# Overrepresentation analysis of the targets of human NR2F1
$Hs_NR2F1 <- enrichGO(gene = Genes_list$Hs_NR2F1,
GO_resultsuniverse = Genes_list$Hs_BG,
OrgDb = "org.Hs.eg.db",
ont = "ALL",
minGSSize = 10,
maxGSSize = 500,
pAdjustMethod = "bonferroni",
pvalueCutoff = 0.01,
qvalueCutoff = 0.01,
readable = TRUE)
Let’s see the results of the GO overrepresentation analyses by checking the number of GO terms each set of target genes were enriched in.
# Nr. of GO terms for target genes of C. elegans unc-55
$Ce_unc55@result %>%
GO_resultsnrow()
## [1] 318
# Nr. of GO terms for target genes of human NR2F1
$Hs_NR2F1@result %>%
GO_resultsnrow()
## [1] 201
As we see, there are too many GO terms each set of target genes were enriched in. Therefore, we perform a following step and summarize the results by removing redundant GO terms. For this we use the rrvgo
R
Biocondustor
package. Here we will summarise the Biological process GO terms only.
Be avare that it may take a few seconds to calculate these summaries.
library(rrvgo)
# Creating a list for GO summarisation results
<- list()
GO_sum
# Summarising the GO results of the targets of the C. elegans unc-55
$Ce_simMatrix <- calculateSimMatrix(GO_results$Ce_unc55@result$ID,
GO_sumorgdb = "org.Ce.eg.db",
ont = "BP",
method = "Rel")
$Ce_scores <- -log10(GO_results$Ce_unc55@result$qvalue) %>%
GO_sumsetNames(., GO_results$Ce_unc55@result$ID)
$Ce_reducedTerms <- reduceSimMatrix(simMatrix = GO_sum$Ce_simMatrix,
GO_sumscores = GO_sum$Ce_scores,
threshold = 0.5,
orgdb="org.Ce.eg.db")
# Summarising the GO results of the targets of the human NR2F1
$Hs_simMatrix <- calculateSimMatrix(GO_results$Hs_NR2F1@result$ID,
GO_sumorgdb = "org.Hs.eg.db",
ont = "BP",
method = "Rel")
$Hs_scores <- -log10(GO_results$Hs_NR2F1@result$qvalue) %>%
GO_sumsetNames(., GO_results$Hs_NR2F1@result$ID)
$Hs_reducedTerms <- reduceSimMatrix(simMatrix = GO_sum$Hs_simMatrix,
GO_sumscores = GO_sum$Hs_scores,
threshold = 0.5,
orgdb="org.Hs.eg.db")
Let’s plot the main GO results and compare them.
# C. elegans unc-55
scatterPlot(GO_sum$Ce_simMatrix, GO_sum$Ce_reducedTerms)
# human NR2F1
scatterPlot(GO_sum$Hs_simMatrix, GO_sum$Hs_reducedTerms, labelSize = 4)
Let’s plot the main GO results that appeared to be common in the overrepresentation an summarisation analyses of the target genes of the C. elegans unc-55 and the human NR2F1 transcription factors.
# Filtering and plotting the common GO results
$Ce_reducedTerms %>%
GO_sumfilter(go %in% GO_sum$Hs_reducedTerms$go) %>%
scatterPlot(GO_sum$Ce_simMatrix, .)
As we can see there are several conserved functions, and even more that diverged during the evolution the this transcription factor.
In this use case the following software and package versions were applied:
R
version 4.1.3tidyverse
version 1.3.1data.table
version 1.14.2clusterProfiler
version 4.2.2org.Ce.eg.db
version 3.14.0org.Hs.eg.db
version 3.14.0rrvgo
version 1.6.0