FAQ

Who is working on the TFLink project?
How to browse the TFLink database?
What does an entry page contain?
What are the downloadable data formats?

What do interaction tables contain?
What do MITAB tables contain?
What do interaction GMT files contain?
What do binding site tables contain?
What do GFF3 binding site annotation files contain?
What do binding site sequence files contain?

What were the source databases that provided the data of TFLink?
What kind of biological questions can be answered by using data on TFLink?
Is TFLink freely available?

Who is working on the TFLink project?

Orsolya Liska
liska.orsolya@brc.hu
website
Affiliations:

Balázs Bohár
bbazsi41@gmail.com
Affiliations:

András Hidas
hidas.andras@ecolres.hu
website
Affiliations:

Institute of Aquatic Ecology, Centre for Ecological Research, Budapest, Hungary

Tamás Korcsmáros
Tamas.Korcsmaros@earlham.ac.uk
website
Affiliations:

Balázs Papp
papp.balazs@brc.hu
website
Affiliations:

Dávid Fazekas
fazekas@netbiol.elte.hu
website
Affiliations:

Eszter Ari
arieszter@gmail.com
website 1, website 2
Affiliations:

How to browse the TFLink database?

After selecting the organism, the user can browse and search within the dataset. The results can be filtered by gene name, UniProt ID, NCBI Gene ID, function (e.g. 'transcription factor', 'target gene', or 'transcription factor and target gene'), and according to evidence type (small- or large-scale experiments). TFLink differentiates between 'transcription factor' and 'transcription factor and target gene' functions based on whether the transcription factor protein regulating the gene for the particular transcription factor is known (present in the TFLink database) or not. We suggest filtering for both 'transcription factor' and 'transcription factor and target gene', and then focusing on the "Targets of …" table on the Entry site, when looking for transcription factors. Such as filtering 'target gene' and 'transcription factor and target gene', and then focusing on the "Transcription factors of …" table on the Entry site, when searching target genes. Information on the number of interactions a particular gene or protein is involved in is also provided. After selecting an entry (gene or protein) from the Browsing table, an 'entry page' is opened. (Links to example entry pages are provided in the Supplementary Notes and in the FAQ part of the TFLink gateway.)

What does an entry page contain?

Each entry page contains basic information about the transcription factor protein or target gene: gene name, UniProt ID (linked to the corresponding UniProt protein page), NCBI Gene ID (linked to the corresponding NCBI Gene site), organism (the scientific name of the species), its function (transcription factor, target gene or both), the number of its interactions, and its orthologs (species name and UniProt ID) – when there are any. In case the ortholog is also available in the TFLink database, a link is provided to the related entry page. Binding site nucleotide composition frequency matrices and sequence logos of the transcription factors are also available through the JASPAR website to facilitate the prediction of more binding sites.

Below the basic information section, the user may visualise three layers of information (if available) about the selected transcription factor: (1) target genes of the transcription factor and/or (2) transcription factors for the target gene and (3) binding sites of the transcription factor. In the target gene and transcription factor tables the user finds details on gene names (linked to corresponding TFLink entries), UniProt IDs (linked to corresponding UniProt protein pages), NCBI Gene IDs (linked to corresponding NCBI Gene sites), name of the source database(s), method(s) of detection, cross-links to the original publications at NCBI PubMed, and indications of the evidence type (small- or large-scale experiments). Along with these tables, interactive network visualisations are presented, demonstrating the interactions between the transcription factor(s) and target gene(s) (indicated by green and red colours, respectively) to facilitate the visual inspection of the interactions.

Besides the TFLink ID, the name of the source database(s), the method(s) of detection, the link to the original publications, and the indication to clarify whether the evidence is based on a small- or a large-scale experiment, the binding site table also presents information about the genomic location: genome assembly version, chromosome, the coordinates of the start and end points of transcription factor binding sites, and the number of overlapping binding sites for the particular transcription factor. To make the visual exploration of the genomic context easier, each binding site is linked to its particular genomic location at the UCSC genome browser website.

In case there are more than 100 interactions or binding sites available for a particular entry in the TFLink gateway, we only show the first 100 targets / transcription factors / binding sites in the tables on the website, and make the full information available in the form of downloadable table (and in case of binding sites: GFF3 annotation) files.

The sequences of binding sites based on small-scale evidence are shown below the tables in FASTA format. The header of the sequences contains the TFLink and the UniProt IDs, gene name, genome assembly version, chromosome name and the start and end point coordinates of the binding sites. Some data downloaded from the JASPAR database refer to binding sequences without exact localization, for example in cases when random sequences were investigated with SELEX. The binding sequences revealed by large-scale experiments are available from the entry pages as downloadable FASTA files.

Entry pages example links

Transcription factor entry page: https://tflink.net/protein/q9vhm6/
Target gene protein page: https://tflink.net/protein/p52429/
Transcription factor and target gene protein page: https://tflink.net/protein/p10242/

What are the downloadable data formats?

What do interaction tables contain?

Interaction table files are tab separated tables (TSV) of transcription factor - target gene interactions that contain either interactions validated by small-scale experiments or large-scale experiments or these two data altogether. Interaction tables contain the following data:

UniprotID.TF and/or* UniprotID.Target: Uniprot IDs of transcription factors and/or target genes
NCBI.GeneID.TF and/or NCBI.GeneID.Target: NCBI Gene IDs of transcription factors and/or target genes,
Name.TF and/or Name.Target: gene names of transcription factors and/or target genes,
Detection.method: names of the detection methods,
PubmedID: Pubmed IDs of the original publications (when available) and the publications of the databases,
Organism: scientific name of the organism,
Source.database: names of the original source databases, and
Small-scale.evidence: indication about if the data were confirmed by small-scale evidence (with "Yes" or "No").
TF.TFLink.ortho: UniProt IDs of ortholog transcription factors that are available at the TFLink gateway. Each entry consists of a shortened name of the organism (Hs: Homo sapiens, Mm: Mus musculus, Rn: Rattus norvegicus, Dr: Danio rerio, Dm: Drosophila melanogaster, Ce: Caenorhabditis elegans, Sc: Saccharomyces cerevisiae) and the UniProt ID separated by a colon (e.g. Mm:Q3UPW2).
TF.nonTFLink.ortho: UniProt IDs of ortholog transcription factors that are not available at the TFLink gateway.
Target.TFLink.ortho: UniProt IDs of ortholog target genes that are available at the TFLink gateway.
Target.nonTFLink.ortho: UniProt IDs of ortholog target genes that are not available at the TFLink gateway.

* If the interaction table was downloaded at the `Download` section the file contains both TF and Target IDs and names. If it was downloaded from an entry page, it contains names and IDs of the TF (when downloading "Transcription factors of …" or names and IDs of the targets (when downloading "Targets of …")

What do MITAB tables contain?

Interaction MITABs contain transcription factor - target gene interactions in HUPO-PSI MITAB 2.8 format. MITAB 2.8 (as defined by the Human Proteome Organization - Proteomics Standards Initiative, HUPO-PSI) is a standardised format – including standardised vocabulary also – used to describe molecular interactions. While other databases may refer to the detection methods by multiple different names, databases that use the MITAB format (e.g. TFLink, MINT or IntAct) use the same code for a given technique. For example, the electrophoretic mobility shift assay technique could be identified by both its full name, and the shortened name EMSA, but in databases utilising the MITAB format it's always referred to by the psi-mi:”MI:0413” code. This makes the identification of interaction properties more efficient and helps avoid potential misunderstandings. The MITAB files are tab-delimited tables containing 46 columns and no header.

A header for the MITAB tables is available here.

The interaction tables and MITAB files can be used as input data for the Cytoscape software to perform systems and network biology studies.

For more information on the HUPO-PSI's molecular interaction format see: link Sivade Dumousseau M et al. (2019) Encompassing new use cases - level 3.0 of the HUPO-PSI format for molecular interactions. BMC Bioinformatics 19(1):134. doi: 10.1186/s12859-018-2118-1.

What do interaction GMT files contain?

Interaction GMT (Gene Matrix Transposed) is a tab delimited file format that describes gene sets – target genes of a transcription factor – in each row. The first and second column contains information about the transcription factors (various IDs and gene names). The first cell in each row is always unique. From the third to the last column the target genes of the transcription factor are listed. The number of target genes can vary from transcription factor to transcription factor, therefore the number of cells can be different in every row. The user can choose between GMT files with Uniprot IDs, NCBI Gene IDs, and gene names. The GMT files are useful for enrichment and gene overrepresentation analyses and can be an input file for the mulea R package and GSEA software.

What do binding site tables contain?

Binding site table files are tab separated tables (TSV) of binding site annotations that contain:

TFLinkID: unique TFLink IDs of the binding sites,
UniprotID.TF: Uniprot IDs of the transcription factors,
Name.TF: gene names of the transcription factors,
Organism: scientific name of the organism,
Assembly: version of the genome assembly,
Chromosome: name of the chromosome,
Start: start coordinates of the binding sites,
End: end coordinates of the binding sites,
Strand: coding strand ("+" indicates the forward strand, and "-" the reverse strand),
Genome.browser: a hyperlink to the particular genomic location at the UCSC genome browser website,
Detection.method: names of the detection methods,
PubmedID: Pubmed IDs of the original publications (when available) and the publications of the databases,
Source.database: names of the original source databases,
Small-scale.evidence: indication about if the data were confirmed by small-scale evidence (with "Yes" or "No"),
Number.of.TFBS.overlaps: the number of overlapping binding sites of the same transcription factor, and
TFBS.overlaps: list of TFLink IDs of overlapping binding sites of the same transcription factor.

What do GFF3 binding site annotation files contain?

Binding site annotation files contain:

##sequence-region …: sequence regions with the name, start and end site of chromosomes,
seqid: names of the chromosomes,
source: starting with "TFLink_from_" and then the names of the source databases,
type: "TF_binding_site"
start: the start coordinates of the binding sites,
end: the end coordinates of the binding sites,
score: "."
strand: the coding strand ("+" indicates the forward strand, and "-" the reverse strand),
phase: "."
attributes:

ID: unique TFLink IDs of the binding sites,
Name: names of the transcription factors, and
Note: Uniprot IDs of the transcription factors

in GFF3 format. For detailed description of the format, please visit this site.

What do binding site sequence files contain?

Binding site sequence files are FASTA files containing the DNA sequences of the transcription factor binding sites. The header of each sequence contains

unique internal TFLink ID of the binding site,
Uniprot ID and gene name of the transcription factor,
version of the genome assembly,
name of the chromosome, and
start and end coordinates of the sites.

What were the source databases that provided the data of TFLink?

We use the following abbreviations in the table:
Data based on small-scale experiments: SS
Data based on large-scale experiments: LS
Homo sapiens: Hs
Mus musculus: Mm
Rattus norvegicus: Rn
Danio rerio: Dr
Drosophila melanogaster: Dm
Caenorhabditis elegans: Ce
Saccharomyces cerevisiae: Sc

			Nr. of data downloaded
	Version	Downloading date	Type of data	Nr. of integrated data	Species
DoRothEA	2	19/06/2020	SS interactions	3,453	Hs
GTRD	20.06	02/07/2020	LS interactions	10,685,122	Hs, Mm, Rn, Dr, Dm, Ce, Sc
HTRIdb	1	29/04/2017	SS interactions	2,020	Hs
HTRIdb	1	29/04/2017	LS interactions	47,140	Hs
JASPAR	2020	22/07/2020	SS binding sites	3,048	Hs, Mm, Rn, Dm, Ce
JASPAR	2020	22/07/2020	LS binding sites	8,567,469	Hs, Mm, Rn, Dm, Ce
ORegAnno	3.0	24/05/2017	SS interactions	1,979	Hs, Mm, Rn, Dm, Ce, Sc
			LS interactions	160,096
			SS binding sites	47,304
			LS binding sites	705,121
REDfly	6.0.2	16/06/2020	SS interactions	683	Dm
			LS interactions	90
			SS binding sites	2,240
			LS binding sites	27
ReMap	1.2	16/07/2018	LS interactions	2,933,177	Hs
TRED	-	08/06/2018	SS interactions	8,693	Hs, Mm
TRRUST	2	30/07/2018	SS interactions	16,570	Hs, Mm
Yeastract	2020	20/07/2020	SS interactions	5,349	Sc
Yeastract	2020	20/07/2020	LS interactions	188,072	Sc

What kind of biological questions can be answered by using data on TFLink?

General applications

TFLink is a useful resource for wet-lab researchers, since it provides easy access to high quality transcription factor - target gene interaction and transcription factor binding site data, with cross-links to several other databases. TFLink is also a long-awaited resource for bioinformaticians, as it contains large quantities of standardised, downloadable regulatory data in multiple formats. The provided interaction tables can be used as input data for the Cytoscape software or to the igraph package to perform systems and network biology studies. The GMT files are useful for gene set enrichment and overrepresentation analyses. Binding site tables allow the user to investigate the genomic location of binding sites. Users can apply the GFF3 binding site annotation files in various NGS analyses, for example when investigating the mapped RNA-seq reads with IGV genome viewer. The binding site sequences can be applied in binding site predictions, binding site matrix calculations, as well as for investigations of the rate of evolution of transcription factor binding sites. Therefore, TFLink will facilitate benchmarking experiments in several fields of gene regulation research.

Use cases

To facilitate the application of TFLink, we provide some examples on how to use and process the data available at the gateway in form of descriptions, R scripts and unix shell commands:

Use case 1

Here we want to check and visualise the common target genes of two transcription factors. We describe how to find transcription factors which share common target genes. We cluster transcription factors based on their common target genes. We create a transcription factor - target gene interaction graph of the STAT5A and STAT5B transcription factors using the igraph R package. We also show how to create the same transcription factor - target gene interaction graph by using the Cytoscape software.

Use case 2

Here we investigate the functional diversity of target genes of a nuclear hormone receptor transcription factor, the unc-55 in human and a nematode species. We perform Gene Ontology overrepresentation analyses of the target genes in the two species in order to identify shared functional roles that likely represent the ancestral function of unc-55. Furthermore, this comparison will yield insights into the potentially divergent roles unc-55 play in these two distant animal groups.

Use case 3

Here we investigate the binding sites of the EGR1 transcription factor. After converting the TFLink binding site table to BED and BAM files, we calculate the "coverage" to reveal the strength of evidence (number of supporting experiments) for each binding site. Then we plot the binding sites on the human chromosomes, indicating the number of supporting evidences each binding site has. Finally, we investigate specific binding sites using the IGV genome viewer tool.

Is TFLink freely available?

TFLink is freely available for non-commercial use.