Here we investigate the binding sites of the EGR1 transcription factor. After converting the TFLink binding site table to BED and BAM files, we calculate the “coverage” to reveal the strength of evidence (number of supporting experiments) for each binding site. Then we plot the binding sites on the human chromosomes, indicating the number of supporting evidences each binding site has. Finally, we investigate specific binding sites using the IGV genome viewer tool.
Let’s download the binding site table of human EGR1 transcription factor from the TFLink website, and read it in to R with the read_tsv command of the tidyverse package.
library(tidyverse)
EGR1_BS_tab <- read_tsv("TFLink_P18146_TFBS_annot_v1.0.tsv")Let’s see the first few rows of the binding site table.
EGR1_BS_tab %>%
slice(1:5)| TFLinkID | UniprotID.TF | Name.TF | Organism | Assembly | Chromosome | Start | End |
|---|---|---|---|---|---|---|---|
| TFLinkSS09532447 | P18146 | EGR1 | Homo sapiens | hg38 | chr14 | 59361534 | 59361673 |
| TFLinkSS09543853 | P18146 | EGR1 | Homo sapiens | hg38 | chr14 | 37596110 | 37596200 |
| TFLinkSS09543856 | P18146 | EGR1 | Homo sapiens | hg38 | chr14 | 96264096 | 96264186 |
| TFLinkSS09543857 | P18146 | EGR1 | Homo sapiens | hg38 | chr14 | 37594940 | 37595000 |
| TFLinkSS09543858 | P18146 | EGR1 | Homo sapiens | hg38 | chr14 | 37595930 | 37595990 |
| Strand | Genome.browser | Detection.method | PubmedID | Source.database |
|---|---|---|---|---|
| + | https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr14:59361534-59361673 | inferred by curator | 18971253 | ORegAnno |
| + | https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr14:37596110-37596200 | inferred by curator | 18971253 | ORegAnno |
| + | https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr14:96264096-96264186 | inferred by curator | 18971253 | ORegAnno |
| + | https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr14:37594940-37595000 | inferred by curator | 18971253 | ORegAnno |
| + | https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr14:37595930-37595990 | inferred by curator | 18971253 | ORegAnno |
| Small-scale.evidence | Number.of.TFBS.overlaps | TFBS.overlaps |
|---|---|---|
| Yes | 0 | - |
| Yes | 0 | - |
| Yes | 0 | - |
| Yes | 0 | - |
| Yes | 0 | - |
Let’s check the number of binding sites of the EGR1 transcription factor.
EGR1_BS_tab %>%
nrow()[1] 156418
Let’s check the names of the chromosomes used in the binding site table.
EGR1_BS_tab %>%
group_by(Chromosome) %>%
group_keys() %>%
pull()chr1, chr10, chr11, chr12, chr13, chr14, chr14_GL000009v2_random, chr14_GL000225v1_random, chr15, chr16, chr17, chr18, chr19, chr2, chr2_KI270894v1_alt, chr20, chr21, chr22, chr22_KI270735v1_random, chr22_KI270879v1_alt, chr3, chr4, chr4_GL000008v2_random, chr5, chr6, chr7, chr7_KI270803v1_alt, chr8, chr8_KI270821v1_alt, chr9, chrM, chrUn_GL000219v1, chrUn_KI270442v1, chrUn_KI270742v1, chrX, chrY
If we want to check the “coverage” for the canonical human chromosomes only, we need to exclude all other contigs from the EGR1_BS_tab table.
# Keeping rows where the name of the chromosome doesn't include "_", and
# excluding the mitochondrial chromosome (chrM) as well.
# Creating a singe BED file containing the data about the FORWARD and REVERSE
# strands.
EGR1_BS_tab %>%
filter(!grepl(x = Chromosome, pattern = "_|M")) %>%
mutate(Name = "EGR1", Score = 1000) %>%
select(Chromosome, Start, End, Name, Score, Strand) %>%
arrange(Chromosome, Start, End, Strand) %>%
write_tsv("EGR1_BS.bed", col_names = FALSE)
# Creating a BED file containing the data about the FORWARD strand only.
EGR1_BS_tab %>%
filter(!grepl(x = Chromosome, pattern = "_|M") & Strand == "+") %>%
mutate(Name = "EGR1", Score = 1000) %>%
select(Chromosome, Start, End, Name, Score, Strand) %>%
arrange(Chromosome, Start, End, Strand) %>%
write_tsv("EGR1_BS_forward.bed", col_names = FALSE)
# Creating a BED file containing the data about the REVERSE strand only.
EGR1_BS_tab %>%
filter(!grepl(x = Chromosome, pattern = "_|M") & Strand == "-") %>%
mutate(Name = "EGR1", Score = 1000) %>%
select(Chromosome, Start, End, Name, Score, Strand) %>%
arrange(Chromosome, Start, End, Strand) %>%
write_tsv("EGR1_BS_reverse.bed", col_names = FALSE)Let’s convert the BED file to BAM using the bedtobam command of the bedtools software package.
For this we need a text file, called: chrom.sizes containing the name and the size (in base pairs) of the chromosomes. We can download this for the human genome version hg38 from here.
Let’s read the chrom.sizes file into R, exclude the non-canonical chromosomes, and write out a new chrom.sizes file.
chrom_sizes_tab <- read_tsv(
"https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes",
col_names = c("Chromosome", "Size"))
chrom_sizes_tab %>%
write_tsv("chrom.sizes", col_names = FALSE)Now we can run the bedToBam command in the terminal to create the BAM files.
# Forward strand BAM
$ bedToBam -i EGR1_BS_forward.bed -g chrom.sizes > EGR1_BS_forward.bam
# Reverse strand BAM
$ bedToBam -i EGR1_BS_reverse.bed -g chrom.sizes > EGR1_BS_reverse.bamLet’s calculate the number of evidences each binding site has using the samtools software.
# Forward strand BAM
$ samtools depth EGR1_BS_forward.bam > EGR1_BS_forward.coverage
# Reverse strand BAM
$ samtools depth EGR1_BS_reverse.bam > EGR1_BS_reverse.coverageR.Now we use the fread function of the data.table R package that reads huge tables effectively.
library(data.table)
# Reading the coverage file of the FORWARD strand
EGR1_BS_cover <- fread("EGR1_BS_forward.coverage")
# Reading the coverage file of the REVERSE strand
EGR1_BS_cover <- fread("EGR1_BS_reverse.coverage") %>%
# Indicating that it is the coverage of the reverse strand by multiplying
# the last column (containing information about the depth) with -1
mutate(V3 = V3*-1) %>%
# Binding the rows of the reverse strand variable to the previous variable
# containing the coverage of the forward strand
bind_rows(., EGR1_BS_cover) %>%
# Renaming the columns according to their content
rename(Chromosome = V1, Locus = V2, Depth = V3) %>%
# Arranging the rows by chromosome and locus (position)
arrange("Chromosome", "Locus")Let’s check the “coverage” (the number of evidences each binding sites chromosomal position has).
EGR1_BS_cover %>%
group_by(Depth) %>%
summarise(n = n()) %>%
# indicating the strand
mutate(Strand = case_when(Depth < 0 ~ "-",
Depth > 0 ~ "+"),
Depth = abs(Depth)) %>%
arrange(desc(Depth))| Depth | n | Strand |
|---|---|---|
| 12 | 1 | - |
| 11 | 2 | - |
| 10 | 4 | - |
| 9 | 9 | + |
| 8 | 18 | - |
| 8 | 297 | + |
| 7 | 1,648 | - |
| 7 | 1,980 | + |
| 6 | 3,126 | - |
| 6 | 3,370 | + |
| 5 | 3,164 | - |
| 5 | 4,949 | + |
| 4 | 6,346 | - |
| 4 | 20,330 | + |
| 3 | 18,586 | - |
| 3 | 132,572 | + |
| 2 | 81,920 | - |
| 2 | 669,525 | + |
| 1 | 366,387 | - |
| 1 | 41,819,023 | + |
On the figure, the binding sites on the forward and reverse strands appear as positive and negative values respectively.
Be avare that it may take a few seconds to render this plot.
EGR1_BS_cover %>%
filter(Chromosome == "chr1") %>%
ggplot(aes(x = Locus, y = Depth, group = Chromosome)) +
geom_line(size = 0.3, color = "#488990") +
xlab("Chromosome 1") +
ylab("Nr. of evidences") +
theme_minimal()As you can see, there are no binding sites at the middle of the chromosome, where the centromere is located.
Be avare that it may take a several minutes to render this plot and your computer even can run out of memory.
p <- EGR1_BS_cover %>%
ggplot(aes(x = Locus, y = Depth)) +
geom_line(size = 0.3, color = "#488990") +
xlab("Location") +
ylab("Nr. of evidences")
p +
facet_wrap(. ~ Chromosome, ncol = 4) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust=1))Let’s use the downloaded binding site table again. We exclude non-canonical chromosomes.
# New variable with canonical chromosomes (as factor)
EGR1_BS_tab_canChro <- EGR1_BS_tab %>%
filter(!grepl(x = Chromosome, pattern = "_|M")) %>%
mutate(Chromosome = factor(Chromosome, levels = paste0("chr", c(1:22, "X", "Y"))))
# Barplot
EGR1_BS_tab_canChro %>%
group_by(Chromosome) %>%
summarise(n = n()) %>%
ggplot(aes(x = Chromosome, y = n)) +
geom_bar(stat="identity", fill = "#488990") +
ylab("Nr. of binding sites") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))Chromosome 1 has the most binding site but what happens if we normalize the counts by the length of the chromosome (using the chrom_sizes_tab variable).
# Normalization and barplotting
EGR1_BS_tab_canChro %>%
group_by(Chromosome) %>%
summarise(n = n()) %>%
left_join(., chrom_sizes_tab) %>%
mutate(Chromosome = factor(Chromosome,
levels = paste0("chr", c(1:22, "X", "Y")))) %>%
mutate(`Nr. of binding sites\nper million bps` = n / (Size / 1000000)) %>%
ggplot(aes(x = Chromosome, y = `Nr. of binding sites\nper million bps`)) +
geom_bar(stat="identity", fill = "#488990") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))Now we see that actually chromosome 19 has the most binding sites of ERG1.
In the next chapter we check around which genes these binding sites are located.
Now we have two choices to investigate the binding sites with the IGV genome viewer:
Let’s do both!
First we need to index the BAM file using the samtools software.
$ samtools index EGR1_BS.bamWith the above command a new file, named EGR1_BS.bam.bai was created.
After opening the IGV genome viewer:
More),File → Load from file...),Color alignments by → read strand),Group alignments by → read strand),File → Load from file...),Expand),Change track color),chr9:137,188,319-137,189,579 to the searching bar at top center, Go), andwe see the figure below.
What we see at the “read” track area?
Just to double check these findings by using an independent source of information: we can download the target gene interaction table of the EGR1 transcription factor from the TFLink website, and we can look up these two genes among the target genes. Then indeed, we will find both the SSNA1 and the ANAPC2 genes among the target genes. Both evidences were inferred by chromatin immunoprecipitation assays:
Quitting from lines 375-378 (TFLink_Use_Case_3_Binding_Sites_of_EGR1.Rmd) Error: ‘2_target_genes.tsv’ does not exist in current working directory (‘/home/barizona/Eszter/Kutatas/TF/use_cases_for_paper/UC3_binding_sites_EGR1’).
In this use case the following software and package versions were applied:
R version 4.1.3tidyverse version 1.3.1igraph version 1.2.11Cytoscape version 3.8.2