Here we investigate the binding sites of the EGR1 transcription factor. After converting the TFLink binding site table to BED and BAM files, we calculate the “coverage” to reveal the strength of evidence (number of supporting experiments) for each binding site. Then we plot the binding sites on the human chromosomes, indicating the number of supporting evidences each binding site has. Finally, we investigate specific binding sites using the IGV genome viewer tool.
Let’s download the binding site table of human EGR1 transcription factor from the TFLink website, and read it in to R
with the read_tsv
command of the tidyverse
package.
library(tidyverse)
<- read_tsv("TFLink_P18146_TFBS_annot_v1.0.tsv") EGR1_BS_tab
Let’s see the first few rows of the binding site table.
%>%
EGR1_BS_tab slice(1:5)
TFLinkID | UniprotID.TF | Name.TF | Organism | Assembly | Chromosome | Start | End |
---|---|---|---|---|---|---|---|
TFLinkSS09532447 | P18146 | EGR1 | Homo sapiens | hg38 | chr14 | 59361534 | 59361673 |
TFLinkSS09543853 | P18146 | EGR1 | Homo sapiens | hg38 | chr14 | 37596110 | 37596200 |
TFLinkSS09543856 | P18146 | EGR1 | Homo sapiens | hg38 | chr14 | 96264096 | 96264186 |
TFLinkSS09543857 | P18146 | EGR1 | Homo sapiens | hg38 | chr14 | 37594940 | 37595000 |
TFLinkSS09543858 | P18146 | EGR1 | Homo sapiens | hg38 | chr14 | 37595930 | 37595990 |
Strand | Genome.browser | Detection.method | PubmedID | Source.database |
---|---|---|---|---|
+ | https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr14:59361534-59361673 | inferred by curator | 18971253 | ORegAnno |
+ | https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr14:37596110-37596200 | inferred by curator | 18971253 | ORegAnno |
+ | https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr14:96264096-96264186 | inferred by curator | 18971253 | ORegAnno |
+ | https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr14:37594940-37595000 | inferred by curator | 18971253 | ORegAnno |
+ | https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr14:37595930-37595990 | inferred by curator | 18971253 | ORegAnno |
Small-scale.evidence | Number.of.TFBS.overlaps | TFBS.overlaps |
---|---|---|
Yes | 0 | - |
Yes | 0 | - |
Yes | 0 | - |
Yes | 0 | - |
Yes | 0 | - |
Let’s check the number of binding sites of the EGR1 transcription factor.
%>%
EGR1_BS_tab nrow()
[1] 156418
Let’s check the names of the chromosomes used in the binding site table.
%>%
EGR1_BS_tab group_by(Chromosome) %>%
group_keys() %>%
pull()
chr1, chr10, chr11, chr12, chr13, chr14, chr14_GL000009v2_random, chr14_GL000225v1_random, chr15, chr16, chr17, chr18, chr19, chr2, chr2_KI270894v1_alt, chr20, chr21, chr22, chr22_KI270735v1_random, chr22_KI270879v1_alt, chr3, chr4, chr4_GL000008v2_random, chr5, chr6, chr7, chr7_KI270803v1_alt, chr8, chr8_KI270821v1_alt, chr9, chrM, chrUn_GL000219v1, chrUn_KI270442v1, chrUn_KI270742v1, chrX, chrY
If we want to check the “coverage” for the canonical human chromosomes only, we need to exclude all other contigs from the EGR1_BS_tab table.
# Keeping rows where the name of the chromosome doesn't include "_", and
# excluding the mitochondrial chromosome (chrM) as well.
# Creating a singe BED file containing the data about the FORWARD and REVERSE
# strands.
%>%
EGR1_BS_tab filter(!grepl(x = Chromosome, pattern = "_|M")) %>%
mutate(Name = "EGR1", Score = 1000) %>%
select(Chromosome, Start, End, Name, Score, Strand) %>%
arrange(Chromosome, Start, End, Strand) %>%
write_tsv("EGR1_BS.bed", col_names = FALSE)
# Creating a BED file containing the data about the FORWARD strand only.
%>%
EGR1_BS_tab filter(!grepl(x = Chromosome, pattern = "_|M") & Strand == "+") %>%
mutate(Name = "EGR1", Score = 1000) %>%
select(Chromosome, Start, End, Name, Score, Strand) %>%
arrange(Chromosome, Start, End, Strand) %>%
write_tsv("EGR1_BS_forward.bed", col_names = FALSE)
# Creating a BED file containing the data about the REVERSE strand only.
%>%
EGR1_BS_tab filter(!grepl(x = Chromosome, pattern = "_|M") & Strand == "-") %>%
mutate(Name = "EGR1", Score = 1000) %>%
select(Chromosome, Start, End, Name, Score, Strand) %>%
arrange(Chromosome, Start, End, Strand) %>%
write_tsv("EGR1_BS_reverse.bed", col_names = FALSE)
Let’s convert the BED file to BAM using the bedtobam
command of the bedtools software package.
For this we need a text file, called: chrom.sizes containing the name and the size (in base pairs) of the chromosomes. We can download this for the human genome version hg38 from here.
Let’s read the chrom.sizes file into R
, exclude the non-canonical chromosomes, and write out a new chrom.sizes file.
<- read_tsv(
chrom_sizes_tab "https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes",
col_names = c("Chromosome", "Size"))
%>%
chrom_sizes_tab write_tsv("chrom.sizes", col_names = FALSE)
Now we can run the bedToBam
command in the terminal to create the BAM files.
# Forward strand BAM
$ bedToBam -i EGR1_BS_forward.bed -g chrom.sizes > EGR1_BS_forward.bam
# Reverse strand BAM
$ bedToBam -i EGR1_BS_reverse.bed -g chrom.sizes > EGR1_BS_reverse.bam
Let’s calculate the number of evidences each binding site has using the samtools software.
# Forward strand BAM
$ samtools depth EGR1_BS_forward.bam > EGR1_BS_forward.coverage
# Reverse strand BAM
$ samtools depth EGR1_BS_reverse.bam > EGR1_BS_reverse.coverage
R
.Now we use the fread
function of the data.table
R
package that reads huge tables effectively.
library(data.table)
# Reading the coverage file of the FORWARD strand
<- fread("EGR1_BS_forward.coverage")
EGR1_BS_cover
# Reading the coverage file of the REVERSE strand
<- fread("EGR1_BS_reverse.coverage") %>%
EGR1_BS_cover # Indicating that it is the coverage of the reverse strand by multiplying
# the last column (containing information about the depth) with -1
mutate(V3 = V3*-1) %>%
# Binding the rows of the reverse strand variable to the previous variable
# containing the coverage of the forward strand
bind_rows(., EGR1_BS_cover) %>%
# Renaming the columns according to their content
rename(Chromosome = V1, Locus = V2, Depth = V3) %>%
# Arranging the rows by chromosome and locus (position)
arrange("Chromosome", "Locus")
Let’s check the “coverage” (the number of evidences each binding sites chromosomal position has).
%>%
EGR1_BS_cover group_by(Depth) %>%
summarise(n = n()) %>%
# indicating the strand
mutate(Strand = case_when(Depth < 0 ~ "-",
> 0 ~ "+"),
Depth Depth = abs(Depth)) %>%
arrange(desc(Depth))
Depth | n | Strand |
---|---|---|
12 | 1 | - |
11 | 2 | - |
10 | 4 | - |
9 | 9 | + |
8 | 18 | - |
8 | 297 | + |
7 | 1,648 | - |
7 | 1,980 | + |
6 | 3,126 | - |
6 | 3,370 | + |
5 | 3,164 | - |
5 | 4,949 | + |
4 | 6,346 | - |
4 | 20,330 | + |
3 | 18,586 | - |
3 | 132,572 | + |
2 | 81,920 | - |
2 | 669,525 | + |
1 | 366,387 | - |
1 | 41,819,023 | + |
On the figure, the binding sites on the forward and reverse strands appear as positive and negative values respectively.
Be avare that it may take a few seconds to render this plot.
%>%
EGR1_BS_cover filter(Chromosome == "chr1") %>%
ggplot(aes(x = Locus, y = Depth, group = Chromosome)) +
geom_line(size = 0.3, color = "#488990") +
xlab("Chromosome 1") +
ylab("Nr. of evidences") +
theme_minimal()
As you can see, there are no binding sites at the middle of the chromosome, where the centromere is located.
Be avare that it may take a several minutes to render this plot and your computer even can run out of memory.
<- EGR1_BS_cover %>%
p ggplot(aes(x = Locus, y = Depth)) +
geom_line(size = 0.3, color = "#488990") +
xlab("Location") +
ylab("Nr. of evidences")
+
p facet_wrap(. ~ Chromosome, ncol = 4) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust=1))
Let’s use the downloaded binding site table again. We exclude non-canonical chromosomes.
# New variable with canonical chromosomes (as factor)
<- EGR1_BS_tab %>%
EGR1_BS_tab_canChro filter(!grepl(x = Chromosome, pattern = "_|M")) %>%
mutate(Chromosome = factor(Chromosome, levels = paste0("chr", c(1:22, "X", "Y"))))
# Barplot
%>%
EGR1_BS_tab_canChro group_by(Chromosome) %>%
summarise(n = n()) %>%
ggplot(aes(x = Chromosome, y = n)) +
geom_bar(stat="identity", fill = "#488990") +
ylab("Nr. of binding sites") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Chromosome 1 has the most binding site but what happens if we normalize the counts by the length of the chromosome (using the chrom_sizes_tab
variable).
# Normalization and barplotting
%>%
EGR1_BS_tab_canChro group_by(Chromosome) %>%
summarise(n = n()) %>%
left_join(., chrom_sizes_tab) %>%
mutate(Chromosome = factor(Chromosome,
levels = paste0("chr", c(1:22, "X", "Y")))) %>%
mutate(`Nr. of binding sites\nper million bps` = n / (Size / 1000000)) %>%
ggplot(aes(x = Chromosome, y = `Nr. of binding sites\nper million bps`)) +
geom_bar(stat="identity", fill = "#488990") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Now we see that actually chromosome 19 has the most binding sites of ERG1.
In the next chapter we check around which genes these binding sites are located.
Now we have two choices to investigate the binding sites with the IGV genome viewer:
Let’s do both!
First we need to index the BAM file using the samtools software.
$ samtools index EGR1_BS.bam
With the above command a new file, named EGR1_BS.bam.bai
was created.
After opening the IGV genome viewer:
More
),File
→ Load from file...
),Color alignments by
→ read strand
),Group alignments by
→ read strand
),File
→ Load from file...
),Expand
),Change track color
),chr9:137,188,319-137,189,579
to the searching bar at top center, Go
), andwe see the figure below.
What we see at the “read” track area?
Just to double check these findings by using an independent source of information: we can download the target gene interaction table of the EGR1 transcription factor from the TFLink website, and we can look up these two genes among the target genes. Then indeed, we will find both the SSNA1 and the ANAPC2 genes among the target genes. Both evidences were inferred by chromatin immunoprecipitation assays:
Quitting from lines 375-378 (TFLink_Use_Case_3_Binding_Sites_of_EGR1.Rmd) Error: ‘2_target_genes.tsv’ does not exist in current working directory (‘/home/barizona/Eszter/Kutatas/TF/use_cases_for_paper/UC3_binding_sites_EGR1’).
In this use case the following software and package versions were applied:
R
version 4.1.3tidyverse
version 1.3.1igraph
version 1.2.11Cytoscape
version 3.8.2