1 The aim of this use case

Here we investigate the functional diversity of target genes of a nuclear hormone receptor transcription factor, the unc-55 in Caenorhabditis elegans (C. elegans) and in human. We perform Gene Ontology (GO) overrepresentation analyses of the target genes in the two species in order to identify shared functional roles that likely represent the ancestral function of unc-55. Furthermore, this comparison will yield insights into the potentially divergent roles unc-55 play in these two distant animal groups.

2 Input data

Let’s download and read in to R the target gene interaction table of the unc-55 transcription factor in Caenorhabditis elegans and the target gene interaction table of the unc-55 ortholog NR2F1 transcription factor in human from the TFLink website using the read_tsv command of the tidyverse package.

library(tidyverse)
# C.elegans unc-55
Interaction_tab_Ce <- read_tsv("https://tflink.net/protein/g5ecr9/")
# human NR2F1
Interaction_tab_Hs <- read_tsv("https://tflink.net/protein/p10589/")

For the overrepresentation analyses we will need background genes sets. Therefore, we download all C. elegans and human target genes available at the Download part of the TFLink website, and read it in with the fread command of the data.table package that reads huge tables effectively.

library(data.table)

# Creating a list for the Uniprot IDs of the background genes
Background_Uniprot <- list()

# Background genes for C. elegans
Background_Uniprot$Ce <- fread(
  "TFLink_Caenorhabditis_elegans_interactions_All_simpleFormat_v1.0.tsv") %>% 
  select(UniprotID.Target) %>% 
  unique() %>% 
  pull()

# Background genes for human
Background_Uniprot$Hs <- fread(
  "TFLink_Homo_sapiens_interactions_All_simpleFormat_v1.0.tsv") %>% 
  select(UniprotID.Target) %>% 
  unique() %>% 
  pull()

3 Gene ontology overrepresentation analysis of target genes

To have an idea about the type of Biological processes the unc-55 and the NR2F1 transcription factors regulate we perform a Gene ontology (GO) overrepresentation analysis of their target genes using the clusterProfiler R Bioconductor package.

3.1 Translating the target genes and the background genes to Entrez IDs

First, we need to translate the Uniprot IDs of the target genes to Entrez IDs using the bitr function of the clusterProfiler package. For this we need to install another two R Bioconductor packages containing various biological identifiers: the org.Ce.eg.db and org.Hs.eg.db.

library(clusterProfiler)

# Creating a list for the target genes
Genes_list <- list()

# Translating the target genes of C. elegans unc-55 to Entrez IDs
Genes_list$Ce_unc55 <- bitr(geneID = Interaction_tab_Ce$UniprotID.Target, 
                            fromType = "UNIPROT",
                            toType = c("ENTREZID"),
                            OrgDb = "org.Ce.eg.db") %>% 
  dplyr::select(ENTREZID) %>% 
  unique() %>% 
  pull()

# Translating the background genes of C. elegans to Entrez IDs
Genes_list$Ce_BG <- bitr(geneID = Background_Uniprot$Ce, 
                            fromType = "UNIPROT",
                            toType = c("ENTREZID"),
                            OrgDb = "org.Ce.eg.db") %>% 
  dplyr::select(ENTREZID) %>% 
  unique() %>% 
  pull()

# Translating the target genes of human NR2F1 to Entrez IDs
Genes_list$Hs_NR2F1 <- bitr(geneID = Interaction_tab_Hs$UniprotID.Target, 
                            fromType = "UNIPROT",
                            toType = c("ENTREZID"),
                            OrgDb = "org.Hs.eg.db") %>% 
  dplyr::select(ENTREZID) %>% 
  unique() %>% 
  pull()

# Translating the background genes of human to Entrez IDs
Genes_list$Hs_BG <- bitr(geneID = Background_Uniprot$Hs, 
                            fromType = "UNIPROT",
                            toType = c("ENTREZID"),
                            OrgDb = "org.Hs.eg.db") %>% 
  dplyr::select(ENTREZID) %>% 
  unique() %>% 
  pull()

3.2 The overrepresentation analyses

Let’s use the enrichGO clusterProfiler function to see in which Gene ontology terms the target genes are overrepresented. We are not interested overrepresentation in too general or too specific GO terms, therefore we discard terms annotating more than 500 or less than 10 genes.

Be avare that it may take a few seconds to calculate the overrepresentation analyses.

# Creating a list for GO overrepresentation results
GO_results <- list()

# Overrepresentation analysis of the targets of C. elegans unc-55
GO_results$Ce_unc55 <- enrichGO(gene = Genes_list$Ce_unc55,
                                universe = Genes_list$Ce_BG,
                                OrgDb = "org.Ce.eg.db",
                                ont = "ALL",
                                minGSSize = 10,
                                maxGSSize = 500,
                                pAdjustMethod = "bonferroni",
                                pvalueCutoff = 0.01,
                                qvalueCutoff = 0.01,
                                readable = TRUE)

# Overrepresentation analysis of the targets of human NR2F1
GO_results$Hs_NR2F1 <- enrichGO(gene = Genes_list$Hs_NR2F1,
                                universe = Genes_list$Hs_BG,
                                OrgDb = "org.Hs.eg.db",
                                ont = "ALL",
                                minGSSize = 10,
                                maxGSSize = 500,
                                pAdjustMethod = "bonferroni",
                                pvalueCutoff = 0.01,
                                qvalueCutoff = 0.01,
                                readable = TRUE)

Let’s see the results of the GO overrepresentation analyses by checking the number of GO terms each set of target genes were enriched in.

# Nr. of GO terms for target genes of C. elegans unc-55
GO_results$Ce_unc55@result %>% 
  nrow()
## [1] 318
# Nr. of GO terms for target genes of human NR2F1
GO_results$Hs_NR2F1@result %>% 
  nrow()
## [1] 201

3.3 Summarization of the Gene ontology overrepresentation results

As we see, there are too many GO terms each set of target genes were enriched in. Therefore, we perform a following step and summarize the results by removing redundant GO terms. For this we use the rrvgo R Biocondustor package. Here we will summarise the Biological process GO terms only.

Be avare that it may take a few seconds to calculate these summaries.

library(rrvgo)

# Creating a list for GO summarisation results
GO_sum <- list()

# Summarising the GO results of the targets of the C. elegans unc-55
GO_sum$Ce_simMatrix <- calculateSimMatrix(GO_results$Ce_unc55@result$ID,
                                          orgdb = "org.Ce.eg.db",
                                          ont = "BP",
                                          method = "Rel")
GO_sum$Ce_scores <- -log10(GO_results$Ce_unc55@result$qvalue) %>% 
  setNames(., GO_results$Ce_unc55@result$ID)
GO_sum$Ce_reducedTerms <- reduceSimMatrix(simMatrix = GO_sum$Ce_simMatrix,
                                          scores = GO_sum$Ce_scores,
                                          threshold = 0.5,
                                          orgdb="org.Ce.eg.db")

# Summarising the GO results of the targets of the human NR2F1
GO_sum$Hs_simMatrix <- calculateSimMatrix(GO_results$Hs_NR2F1@result$ID,
                                          orgdb = "org.Hs.eg.db",
                                          ont = "BP",
                                          method = "Rel")
GO_sum$Hs_scores <- -log10(GO_results$Hs_NR2F1@result$qvalue) %>% 
  setNames(., GO_results$Hs_NR2F1@result$ID)
GO_sum$Hs_reducedTerms <- reduceSimMatrix(simMatrix = GO_sum$Hs_simMatrix,
                                          scores = GO_sum$Hs_scores,
                                          threshold = 0.5,
                                          orgdb="org.Hs.eg.db")

Let’s plot the main GO results and compare them.

# C. elegans unc-55
scatterPlot(GO_sum$Ce_simMatrix, GO_sum$Ce_reducedTerms)

# human NR2F1
scatterPlot(GO_sum$Hs_simMatrix, GO_sum$Hs_reducedTerms, labelSize = 4)

Let’s plot the main GO results that appeared to be common in the overrepresentation an summarisation analyses of the target genes of the C. elegans unc-55 and the human NR2F1 transcription factors.

# Filtering and plotting the common GO results
GO_sum$Ce_reducedTerms %>% 
  filter(go %in% GO_sum$Hs_reducedTerms$go) %>% 
  scatterPlot(GO_sum$Ce_simMatrix, .)

As we can see there are several conserved functions, and even more that diverged during the evolution the this transcription factor.

4 Environment

In this use case the following software and package versions were applied:

  • R version 4.1.3
  • tidyverse version 1.3.1
  • data.table version 1.14.2
  • clusterProfiler version 4.2.2
  • org.Ce.eg.db version 3.14.0
  • org.Hs.eg.db version 3.14.0
  • rrvgo version 1.6.0