After selecting the organism, the user can browse and search within the dataset. The results can be filtered by gene name, UniProt ID, NCBI Gene ID, function (e.g. 'transcription factor', 'target gene', or 'transcription factor and target gene'), and according to evidence type (small- or large-scale experiments). TFLink differentiates between 'transcription factor' and 'transcription factor and target gene' functions based on whether the transcription factor protein regulating the gene for the particular transcription factor is known (present in the TFLink database) or not. We suggest filtering for both 'transcription factor' and 'transcription factor and target gene', and then focusing on the "Targets of …" table on the Entry site, when looking for transcription factors. Such as filtering 'target gene' and 'transcription factor and target gene', and then focusing on the "Transcription factors of …" table on the Entry site, when searching target genes. Information on the number of interactions a particular gene or protein is involved in is also provided. After selecting an entry (gene or protein) from the Browsing table, an 'entry page' is opened. (Links to example entry pages are provided in the Supplementary Notes and in the FAQ part of the TFLink gateway.)
Each entry page contains basic information about the transcription factor protein or target gene: gene name, UniProt ID (linked to the corresponding UniProt protein page), NCBI Gene ID (linked to the corresponding NCBI Gene site), organism (the scientific name of the species), its function (transcription factor, target gene or both), the number of its interactions, and its orthologs (species name and UniProt ID) – when there are any. In case the ortholog is also available in the TFLink database, a link is provided to the related entry page. Binding site nucleotide composition frequency matrices and sequence logos of the transcription factors are also available through the JASPAR website to facilitate the prediction of more binding sites.
Below the basic information section, the user may visualise three layers of information (if available) about the selected transcription factor: (1) target genes of the transcription factor and/or (2) transcription factors for the target gene and (3) binding sites of the transcription factor. In the target gene and transcription factor tables the user finds details on gene names (linked to corresponding TFLink entries), UniProt IDs (linked to corresponding UniProt protein pages), NCBI Gene IDs (linked to corresponding NCBI Gene sites), name of the source database(s), method(s) of detection, cross-links to the original publications at NCBI PubMed, and indications of the evidence type (small- or large-scale experiments). Along with these tables, interactive network visualisations are presented, demonstrating the interactions between the transcription factor(s) and target gene(s) (indicated by green and red colours, respectively) to facilitate the visual inspection of the interactions.
Besides the TFLink ID, the name of the source database(s), the method(s) of detection, the link to the original publications, and the indication to clarify whether the evidence is based on a small- or a large-scale experiment, the binding site table also presents information about the genomic location: genome assembly version, chromosome, the coordinates of the start and end points of transcription factor binding sites, and the number of overlapping binding sites for the particular transcription factor. To make the visual exploration of the genomic context easier, each binding site is linked to its particular genomic location at the UCSC genome browser website.
In case there are more than 100 interactions or binding sites available for a particular entry in the TFLink gateway, we only show the first 100 targets / transcription factors / binding sites in the tables on the website, and make the full information available in the form of downloadable table (and in case of binding sites: GFF3 annotation) files.
The sequences of binding sites based on small-scale evidence are shown below the tables in FASTA format. The header of the sequences contains the TFLink and the UniProt IDs, gene name, genome assembly version, chromosome name and the start and end point coordinates of the binding sites. Some data downloaded from the JASPAR database refer to binding sequences without exact localization, for example in cases when random sequences were investigated with SELEX. The binding sequences revealed by large-scale experiments are available from the entry pages as downloadable FASTA files.
Entry pages example links
Interaction table files are tab separated tables (TSV) of transcription factor - target gene interactions that contain either interactions validated by small-scale experiments or large-scale experiments or these two data altogether. Interaction tables contain the following data:
Interaction MITABs contain transcription factor - target gene interactions in HUPO-PSI MITAB 2.8 format. MITAB 2.8 (as defined by the Human Proteome Organization - Proteomics Standards Initiative, HUPO-PSI) is a standardised format – including standardised vocabulary also – used to describe molecular interactions. While other databases may refer to the detection methods by multiple different names, databases that use the MITAB format (e.g. TFLink, MINT or IntAct) use the same code for a given technique. For example, the electrophoretic mobility shift assay technique could be identified by both its full name, and the shortened name EMSA, but in databases utilising the MITAB format it's always referred to by the psi-mi:”MI:0413” code. This makes the identification of interaction properties more efficient and helps avoid potential misunderstandings. The MITAB files are tab-delimited tables containing 46 columns and no header.
A header for the MITAB tables is available here.
The interaction tables and MITAB files can be used as input data for the Cytoscape software to perform systems and network biology studies.
For more information on the HUPO-PSI's molecular interaction format see: link Sivade Dumousseau M et al. (2019) Encompassing new use cases - level 3.0 of the HUPO-PSI format for molecular interactions. BMC Bioinformatics 19(1):134. doi: 10.1186/s12859-018-2118-1.
Interaction GMT (Gene Matrix Transposed) is a tab delimited file format that describes gene sets – target genes of a transcription factor – in each row. The first and second column contains information about the transcription factors (various IDs and gene names). The first cell in each row is always unique. From the third to the last column the target genes of the transcription factor are listed. The number of target genes can vary from transcription factor to transcription factor, therefore the number of cells can be different in every row. The user can choose between GMT files with Uniprot IDs, NCBI Gene IDs, and gene names. The GMT files are useful for enrichment and gene overrepresentation analyses and can be an input file for the mulea R package and GSEA software.
Binding site table files are tab separated tables (TSV) of binding site annotations that contain:
Binding site annotation files contain:
Binding site sequence files are FASTA files containing the DNA sequences of the transcription factor binding sites. The header of each sequence contains
Nr. of data downloaded | |||||
Version | Downloading date | Type of data | Nr. of integrated data | Species | |
DoRothEA | 2 | 19/06/2020 | SS interactions | 3,453 | Hs |
GTRD | 20.06 | 02/07/2020 | LS interactions | 10,685,122 | Hs, Mm, Rn, Dr, Dm, Ce, Sc |
HTRIdb | 1 | 29/04/2017 | SS interactions | 2,020 | Hs |
LS interactions | 47,140 | ||||
JASPAR | 2020 | 22/07/2020 | SS binding sites | 3,048 | Hs, Mm, Rn, Dm, Ce |
LS binding sites | 8,567,469 | ||||
ORegAnno | 3.0 | 24/05/2017 | SS interactions | 1,979 | Hs, Mm, Rn, Dm, Ce, Sc |
LS interactions | 160,096 | ||||
SS binding sites | 47,304 | ||||
LS binding sites | 705,121 | ||||
REDfly | 6.0.2 | 16/06/2020 | SS interactions | 683 | Dm |
LS interactions | 90 | ||||
SS binding sites | 2,240 | ||||
LS binding sites | 27 | ||||
ReMap | 1.2 | 16/07/2018 | LS interactions | 2,933,177 | Hs |
TRED | - | 08/06/2018 | SS interactions | 8,693 | Hs, Mm |
TRRUST | 2 | 30/07/2018 | SS interactions | 16,570 | Hs, Mm |
Yeastract | 2020 | 20/07/2020 | SS interactions | 5,349 | Sc |
LS interactions | 188,072 |
TFLink is a useful resource for wet-lab researchers, since it provides easy access to high quality transcription factor - target gene interaction and transcription factor binding site data, with cross-links to several other databases. TFLink is also a long-awaited resource for bioinformaticians, as it contains large quantities of standardised, downloadable regulatory data in multiple formats. The provided interaction tables can be used as input data for the Cytoscape software or to the igraph package to perform systems and network biology studies. The GMT files are useful for gene set enrichment and overrepresentation analyses. Binding site tables allow the user to investigate the genomic location of binding sites. Users can apply the GFF3 binding site annotation files in various NGS analyses, for example when investigating the mapped RNA-seq reads with IGV genome viewer. The binding site sequences can be applied in binding site predictions, binding site matrix calculations, as well as for investigations of the rate of evolution of transcription factor binding sites. Therefore, TFLink will facilitate benchmarking experiments in several fields of gene regulation research.
To facilitate the application of TFLink, we provide some examples on how to use and process the data available at the gateway in form of descriptions, R scripts and unix shell commands:
Here we want to check and visualise the common target genes of two transcription factors. We describe how to find transcription factors which share common target genes. We cluster transcription factors based on their common target genes. We create a transcription factor - target gene interaction graph of the STAT5A and STAT5B transcription factors using the igraph R package. We also show how to create the same transcription factor - target gene interaction graph by using the Cytoscape software.
Here we investigate the functional diversity of target genes of a nuclear hormone receptor transcription factor, the unc-55 in human and a nematode species. We perform Gene Ontology overrepresentation analyses of the target genes in the two species in order to identify shared functional roles that likely represent the ancestral function of unc-55. Furthermore, this comparison will yield insights into the potentially divergent roles unc-55 play in these two distant animal groups.
Here we investigate the binding sites of the EGR1 transcription factor. After converting the TFLink binding site table to BED and BAM files, we calculate the "coverage" to reveal the strength of evidence (number of supporting experiments) for each binding site. Then we plot the binding sites on the human chromosomes, indicating the number of supporting evidences each binding site has. Finally, we investigate specific binding sites using the IGV genome viewer tool.
TFLink is freely available for non-commercial use.