1 The aim of this use case

Here we investigate the binding sites of the EGR1 transcription factor. After converting the TFLink binding site table to BED and BAM files, we calculate the “coverage” to reveal the strength of evidence (number of supporting experiments) for each binding site. Then we plot the binding sites on the human chromosomes, indicating the number of supporting evidences each binding site has. Finally, we investigate specific binding sites using the IGV genome viewer tool.

2 The input data

Let’s download the binding site table of human EGR1 transcription factor from the TFLink website, and read it in to R with the read_tsv command of the tidyverse package.

library(tidyverse)
EGR1_BS_tab <- read_tsv("TFLink_P18146_TFBS_annot_v1.0.tsv")

Let’s see the first few rows of the binding site table.

EGR1_BS_tab %>% 
  slice(1:5)
Head of the binding site table (continued below)
TFLinkID UniprotID.TF Name.TF Organism Assembly Chromosome Start End
TFLinkSS09532447 P18146 EGR1 Homo sapiens hg38 chr14 59361534 59361673
TFLinkSS09543853 P18146 EGR1 Homo sapiens hg38 chr14 37596110 37596200
TFLinkSS09543856 P18146 EGR1 Homo sapiens hg38 chr14 96264096 96264186
TFLinkSS09543857 P18146 EGR1 Homo sapiens hg38 chr14 37594940 37595000
TFLinkSS09543858 P18146 EGR1 Homo sapiens hg38 chr14 37595930 37595990
Table continues below
Strand Genome.browser Detection.method PubmedID Source.database
+ https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr14:59361534-59361673 inferred by curator 18971253 ORegAnno
+ https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr14:37596110-37596200 inferred by curator 18971253 ORegAnno
+ https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr14:96264096-96264186 inferred by curator 18971253 ORegAnno
+ https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr14:37594940-37595000 inferred by curator 18971253 ORegAnno
+ https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr14:37595930-37595990 inferred by curator 18971253 ORegAnno
Small-scale.evidence Number.of.TFBS.overlaps TFBS.overlaps
Yes 0 -
Yes 0 -
Yes 0 -
Yes 0 -
Yes 0 -

Let’s check the number of binding sites of the EGR1 transcription factor.

EGR1_BS_tab %>% 
  nrow()

[1] 156418

3 Converting the binding site table to BED and BAM files

Let’s check the names of the chromosomes used in the binding site table.

EGR1_BS_tab %>% 
  group_by(Chromosome) %>% 
  group_keys() %>% 
  pull()

chr1, chr10, chr11, chr12, chr13, chr14, chr14_GL000009v2_random, chr14_GL000225v1_random, chr15, chr16, chr17, chr18, chr19, chr2, chr2_KI270894v1_alt, chr20, chr21, chr22, chr22_KI270735v1_random, chr22_KI270879v1_alt, chr3, chr4, chr4_GL000008v2_random, chr5, chr6, chr7, chr7_KI270803v1_alt, chr8, chr8_KI270821v1_alt, chr9, chrM, chrUn_GL000219v1, chrUn_KI270442v1, chrUn_KI270742v1, chrX, chrY

3.1 Excluding non-canonical chromosomes and saving the BED files

If we want to check the “coverage” for the canonical human chromosomes only, we need to exclude all other contigs from the EGR1_BS_tab table.

# Keeping rows where the name of the chromosome doesn't include "_", and
# excluding the mitochondrial chromosome (chrM) as well.

# Creating a singe BED file containing the data about the FORWARD and REVERSE 
# strands.
EGR1_BS_tab %>% 
  filter(!grepl(x = Chromosome, pattern = "_|M")) %>%
  mutate(Name = "EGR1", Score = 1000) %>% 
  select(Chromosome, Start, End, Name, Score, Strand) %>% 
  arrange(Chromosome, Start, End, Strand) %>% 
  write_tsv("EGR1_BS.bed", col_names = FALSE)

# Creating a BED file containing the data about the FORWARD strand only.
EGR1_BS_tab %>% 
  filter(!grepl(x = Chromosome, pattern = "_|M") & Strand == "+") %>%
  mutate(Name = "EGR1", Score = 1000) %>% 
  select(Chromosome, Start, End, Name, Score, Strand) %>% 
  arrange(Chromosome, Start, End, Strand) %>% 
  write_tsv("EGR1_BS_forward.bed", col_names = FALSE)

# Creating a BED file containing the data about the REVERSE strand only.
EGR1_BS_tab %>% 
  filter(!grepl(x = Chromosome, pattern = "_|M") & Strand == "-") %>%
  mutate(Name = "EGR1", Score = 1000) %>% 
  select(Chromosome, Start, End, Name, Score, Strand) %>% 
  arrange(Chromosome, Start, End, Strand) %>% 
  write_tsv("EGR1_BS_reverse.bed", col_names = FALSE)

Let’s convert the BED file to BAM using the bedtobam command of the bedtools software package.

For this we need a text file, called: chrom.sizes containing the name and the size (in base pairs) of the chromosomes. We can download this for the human genome version hg38 from here.

Let’s read the chrom.sizes file into R, exclude the non-canonical chromosomes, and write out a new chrom.sizes file.

chrom_sizes_tab <- read_tsv(
  "https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes",
  col_names = c("Chromosome", "Size"))
chrom_sizes_tab %>% 
  write_tsv("chrom.sizes", col_names = FALSE)

Now we can run the bedToBam command in the terminal to create the BAM files.

# Forward strand BAM
$ bedToBam -i EGR1_BS_forward.bed -g chrom.sizes > EGR1_BS_forward.bam
# Reverse strand BAM
$ bedToBam -i EGR1_BS_reverse.bed -g chrom.sizes > EGR1_BS_reverse.bam

4 Calculating and investigating the “coverage”

Let’s calculate the number of evidences each binding site has using the samtools software.

# Forward strand BAM
$ samtools depth EGR1_BS_forward.bam > EGR1_BS_forward.coverage
# Reverse strand BAM
$ samtools depth EGR1_BS_reverse.bam > EGR1_BS_reverse.coverage

4.1 Reading in the “coverage” files to R.

Now we use the fread function of the data.table R package that reads huge tables effectively.

library(data.table)

# Reading the coverage file of the FORWARD strand
EGR1_BS_cover <- fread("EGR1_BS_forward.coverage")

# Reading the coverage file of the REVERSE strand
EGR1_BS_cover <- fread("EGR1_BS_reverse.coverage") %>% 
  # Indicating that it is the coverage of the reverse strand by multiplying 
  # the last column (containing information about the depth) with -1
  mutate(V3 = V3*-1) %>% 
  # Binding the rows of the reverse strand variable to the previous variable 
  # containing the coverage of the forward strand
  bind_rows(., EGR1_BS_cover) %>% 
  # Renaming the columns according to their content
  rename(Chromosome = V1, Locus = V2, Depth = V3) %>% 
  # Arranging the rows by chromosome and locus (position)
  arrange("Chromosome", "Locus")

Let’s check the “coverage” (the number of evidences each binding sites chromosomal position has).

EGR1_BS_cover %>% 
  group_by(Depth) %>% 
  summarise(n = n()) %>% 
  # indicating the strand
  mutate(Strand = case_when(Depth < 0 ~ "-",
                            Depth > 0 ~ "+"),
         Depth = abs(Depth)) %>% 
  arrange(desc(Depth))
Number of evidences
Depth n Strand
12 1 -
11 2 -
10 4 -
9 9 +
8 18 -
8 297 +
7 1,648 -
7 1,980 +
6 3,126 -
6 3,370 +
5 3,164 -
5 4,949 +
4 6,346 -
4 20,330 +
3 18,586 -
3 132,572 +
2 81,920 -
2 669,525 +
1 366,387 -
1 41,819,023 +

4.2 Plotting the number of evidences of binding sites on chromosome 1

On the figure, the binding sites on the forward and reverse strands appear as positive and negative values respectively.

Be avare that it may take a few seconds to render this plot.

EGR1_BS_cover %>% 
  filter(Chromosome == "chr1") %>% 
  ggplot(aes(x = Locus, y = Depth, group = Chromosome)) +
  geom_line(size = 0.3, color = "#488990") +
  xlab("Chromosome 1") +
  ylab("Nr. of evidences") +
  theme_minimal()

As you can see, there are no binding sites at the middle of the chromosome, where the centromere is located.

4.3 Plotting the number of evidences of binding sites on all chromosomes

Be avare that it may take a several minutes to render this plot and your computer even can run out of memory.

p <- EGR1_BS_cover %>% 
  ggplot(aes(x = Locus, y = Depth)) +
  geom_line(size = 0.3, color = "#488990") +
  xlab("Location") +
  ylab("Nr. of evidences")
p +
  facet_wrap(. ~ Chromosome, ncol = 4) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust=1))

5 How many EGR1 binding sites can be found on each chromosome?

Let’s use the downloaded binding site table again. We exclude non-canonical chromosomes.

# New variable with canonical chromosomes (as factor)
EGR1_BS_tab_canChro <- EGR1_BS_tab %>% 
  filter(!grepl(x = Chromosome, pattern = "_|M")) %>%
  mutate(Chromosome = factor(Chromosome, levels = paste0("chr", c(1:22, "X", "Y"))))

# Barplot
EGR1_BS_tab_canChro %>% 
  group_by(Chromosome) %>% 
  summarise(n = n()) %>% 
  ggplot(aes(x = Chromosome, y = n)) +
  geom_bar(stat="identity", fill = "#488990") +
  ylab("Nr. of binding sites") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Chromosome 1 has the most binding site but what happens if we normalize the counts by the length of the chromosome (using the chrom_sizes_tab variable).

# Normalization and barplotting
EGR1_BS_tab_canChro %>% 
  group_by(Chromosome) %>% 
  summarise(n = n()) %>% 
  left_join(., chrom_sizes_tab) %>% 
  mutate(Chromosome = factor(Chromosome, 
                             levels = paste0("chr", c(1:22, "X", "Y")))) %>% 
  mutate(`Nr. of binding sites\nper million bps` = n / (Size / 1000000)) %>% 
  ggplot(aes(x = Chromosome, y = `Nr. of binding sites\nper million bps`)) +
  geom_bar(stat="identity", fill = "#488990") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Now we see that actually chromosome 19 has the most binding sites of ERG1.

In the next chapter we check around which genes these binding sites are located.

6 Visually investigating the binding sites with the IGV genome viewer software

Now we have two choices to investigate the binding sites with the IGV genome viewer:

  1. Reading in the transcription factor binding site BAM file (created above) as a “read” track.
  2. Reading in the transcription factor binding site GFF3 file (downloaded from the corresponding TFLink entry page) as an “annotation” track.

Let’s do both!

First we need to index the BAM file using the samtools software.

$ samtools index EGR1_BS.bam

With the above command a new file, named EGR1_BS.bam.bai was created.

After opening the IGV genome viewer:

  1. let’s set the human genome to hg38 (open the drop down menu at the top left panel → More),
  2. read the BAM file as “read” track (FileLoad from file...),
  3. color the “reads” by read strand (right click on the track in the left panel, Color alignments byread strand),
  4. group the “reads” by read strand (right click on the track in the left panel, Group alignments byread strand),
  5. read the downloaded GFF3 file as annotation track (FileLoad from file...),
  6. expand the annotation track (right click on the track in the left panel, Expand),
  7. change the color of the annotation track to red (right click on the track in the left panel, Change track color),
  8. zoom in to the region 137,188,319-137,189,579 on chromosome 9 (inserting chr9:137,188,319-137,189,579 to the searching bar at top center, Go), and

we see the figure below.

What we see at the “read” track area?

  1. There is a longer (371bp) binding site on the forward strand (colored with light red) overlapping with the putative promoter region, the first exon and first intron, of gene SSNA1.
  2. Also there is a shorter (13bp) binding site on the forward strand in the first intron of gene SSNA1.
  3. There is a shorter (13bp) binding site on the reverse strand (colored with light blue) in the putative promoter region of gene ANAPC2.
  4. At the “annotation” track area we see the similar picture but the orientation is showed by white arrows on the binding sites.

Just to double check these findings by using an independent source of information: we can download the target gene interaction table of the EGR1 transcription factor from the TFLink website, and we can look up these two genes among the target genes. Then indeed, we will find both the SSNA1 and the ANAPC2 genes among the target genes. Both evidences were inferred by chromatin immunoprecipitation assays:

Quitting from lines 375-378 (TFLink_Use_Case_3_Binding_Sites_of_EGR1.Rmd) Error: ‘2_target_genes.tsv’ does not exist in current working directory (‘/home/barizona/Eszter/Kutatas/TF/use_cases_for_paper/UC3_binding_sites_EGR1’).

7 Environment

In this use case the following software and package versions were applied:

  • R version 4.1.3
  • tidyverse version 1.3.1
  • igraph version 1.2.11
  • Cytoscape version 3.8.2