computational study of target gene interactions
TRANSCRIPT
University of Central Florida University of Central Florida
STARS STARS
Electronic Theses and Dissertations, 2020-
2021
Computational Study of Target Gene Interactions - Enhancers and Computational Study of Target Gene Interactions - Enhancers and
microRNAs microRNAs
Amlan Talukder University of Central Florida
Part of the Computer Sciences Commons, and the Genetics and Genomics Commons
Find similar works at: https://stars.library.ucf.edu/etd2020
University of Central Florida Libraries http://library.ucf.edu
This Doctoral Dissertation (Open Access) is brought to you for free and open access by STARS. It has been accepted
for inclusion in Electronic Theses and Dissertations, 2020- by an authorized administrator of STARS. For more
information, please contact [email protected].
STARS Citation STARS Citation Talukder, Amlan, "Computational Study of Target Gene Interactions - Enhancers and microRNAs" (2021). Electronic Theses and Dissertations, 2020-. 570. https://stars.library.ucf.edu/etd2020/570
COMPUTATIONAL STUDY ON TARGET GENE INTERACTIONS β
ENHANCERS AND MICRORNAS
by
AMLAN TALUKDER
B.Sc. Bangladesh University of Engineering and Technology, 2011
A dissertation submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy
in the Department of Computer Science
in the College of Engineering and Computer Science
at the University of Central Florida
Orlando, Florida
Spring Term
2021
Major Professor: Haiyan Hu
ii
Β© 2021 Amlan Talukder
iii
ABSTRACT
Gene expression is an essential mechanism for physical and mental development of human.
Aberrant regulation of gene expression creates abnormality in human body than can lead to
complicated diseases. Gene expression can be regulated at any stage from the chromatin unfolding
stage to post-translation stage of protein. In this study, we focused on two important factors of
gene expression regulation that participate in the gene expression process at the transcription and
the post-transcriptional stages; enhancer-promoter interactions and miRNA-mRNA interactions.
The enhancer-promoter interactions are difficult to detect due to the large distance between the
enhancer and promoter region and cell-specific activity of the interactions. The cell-specific
interactions have not been well studied due to inconsistent feature availability in different cells.
We designed a tool that considers a large variety of enhancer-promoter interaction features in
different cell lines, can deal with missing features, and can predict cell-specific interactions with
better accuracy than the available tools. By analyzing the cell-specific interactions from different
sources we also found that enhancers-promoter interactions are shared in groups.
MiRNA-mRNA interactions are more complicated in human than other organism because of the
imperfectness of the interactions and the smaller size and complex target choosing strategy of the
miRNA. Available miRNA target prediction tools, designed on canonical features, often suffer
from low accuracy with the new experimentally supported datasets. These tools do not consider
the position-wise binding preference and relationship between adjacent positions and regions of
the miRNA sequence. Here, we designed a Markov-model based feature to capture this position
wise information from experimental data sets, which can be incorporated with any prediction tool
to improve the performance of the tool.
iv
ACKNOWLEDGEMENTS
I would like to express my gratitude to my advisor Dr. Haiyan Hu for continuously appreciating
my effort and encouraging me in every step of my Ph.D. journey.
I would also like to thank my co-advisor Dr. Xiaoman Li for his earnest efforts and sincere
guidance throughout my Ph.D. research. Coming out of formal attitude, he literally taught me how
to do research and always pushed me to try my best.
Finally, I would like to thank my family for their continuous support. This dissertation is a
dedication to my grandfather who has been aspiring to see me achieve a Ph.D. degree even in his
dotage. It makes me so happy to finally be able to fulfill his long yearning dream. I can never thank
my parents enough for their tireless efforts to help me achieve my life goals and my brother who
is one of my inspirations to keep working in bioinformatics. Last, but not the least, I would like to
thank my loving wife who has been a constant support in every single day of my life since we met.
v
TABLE OF CONTENTS
LIST OF FIGURES ..................................................................................................................... VII
LIST OF TABLES ..................................................................................................................... VIII
CHAPTER 1 : INTRODUCTION .................................................................................................. 1
1.1 Study of Enhancer-Promoter interactions ............................................................................. 1
1.2 Study of miRNA-mRNA interactions ................................................................................... 2
CHAPTER 2 : STUDY OF ENHANCER-PROMOTER INTERACTIONS ................................. 3
2.1 EPIP: A Novel Approach for Cell-Specific EnhancerβPromoter Interaction Prediction ...... 3
2.1.1 Background..................................................................................................................... 3
2.1.2 Materials and Method ..................................................................................................... 5
2.1.3 Results .......................................................................................................................... 12
2.1.4 Discussion..................................................................................................................... 20
2.2 An Intriguing Characteristic of Enhancer-Promoter Interactions ....................................... 24
2.2.1 Background................................................................................................................... 24
2.2.2 Materials and Method ................................................................................................... 28
2.2.3 Results .......................................................................................................................... 36
2.2.4 Discussion..................................................................................................................... 51
CHAPTER 3 : STUDY OF MIRNA-MRNA INTERACTIONS ................................................. 55
3.1 MDPS: Position-Wise Binding Preference is Important for miRNA Target Site Prediction
................................................................................................................................................... 55
vi
3.1.1 Background................................................................................................................... 55
3.1.2 Materials and Methods ................................................................................................. 58
3.1.3 Results .......................................................................................................................... 64
3.1.4 Discussion..................................................................................................................... 72
CHAPTER 4 : CONCLUSION AND FUTURE WORK ............................................................. 75
4.1 Conclusion ........................................................................................................................... 75
4.2 Future Work ........................................................................................................................ 77
4.2.1 Enhancer-promoter interactions ................................................................................... 77
4.2.2 miRNA-mRNA interactions ......................................................................................... 78
LIST OF REFERENCES .............................................................................................................. 80
vii
LIST OF FIGURES
Figure 2-1: Training and Test data.................................................................................................. 7
Figure 2-2: EPIP model and partitions.......................................................................................... 10
Figure 2-3: The overall performance of EPIP on external datasets. ............................................. 15
Figure 2-4: Generation of IEPs and calculation of BCC .............................................................. 27
Figure 2-5: Clusters of enhancers with Hi-C reads ....................................................................... 47
Figure 2-6: The distance distribution between consecutive enhancers in the same cluster .......... 49
Figure 2-7: The overlap of the enhancer clusters with the super-enhancers................................. 50
Figure 3-1: Five states in an miRNA-target interaction................................................................ 60
Figure 3-2: Non-seed regions may be important for miRNA-target interactions ......................... 65
Figure 3-3: Correlated pairs of miRNA positions......................................................................... 67
Figure 3-4: Clusters of miRNAs with similar ``Match'' patterns in specific regions ................... 68
viii
LIST OF TABLES
Table 2-1: EPIP on balanced test data, unbalanced test data, and all pairs .................................. 13
Table 2-2: Performance of cell-specific EPIP model on predicting of condition-specific EPIs. . 17
Table 2-3: Comparison with TargetFinder and Ripple on TargetFinder and EPIP data. ............. 18
Table 2-4: The BCC of enhancers and that of promoters are likely to be 1 in a cell line. ............ 36
Table 2-5: BCC statistics for enhancers ........................................................................................ 38
Table 2-6: BCC statistics for promoters ....................................................................................... 44
Table 3-1: Training and test datasets. ........................................................................................... 58
Table 3-2: Performance comparison of the combined tools with the original tools. .................... 72
1
CHAPTER 1 : INTRODUCTION
Gene expression regulation is one of the major reasons of diseases and anomalies in human body.
The entire process of gene expression can be divided into multiple steps; chromatin uncoiling,
gene transcription to form RNA molecules, RNA-splicing, translation of protein and post-
translational stage. Numerous factors interplay and interact with a gene and its products in different
stages of the gene expression process. Any regulation or disruption of these factors or their
interactions in either of the stages can cause aberrance in the whole gene expression process.
Proper computational analyses are needed to find out the relationships among the diverse features
that help the factors take part in the interactions in different cell types and cell lines. Here, we
study two of the factors and their respective interactions with the proximal genetic regions and the
RNA transcripts of the genes in the transcriptional and post-transcriptional stages of gene
expression.
1.1 Study of Enhancer-Promoter interactions
Enhancer-promoter interaction (EPI) is a phenomenon that takes place in the transcription step of
gene expression process. Promoter covers the region of size 1-2 kilobases (kb) upstream of gene
transcription start site (TSS). Enhancer is a distal region of DNA that comes in contact with the
promoter region due to chromatin looping and initiates gene transcription process along with other
factors like RNA polymerase II and several transcription factor proteins. The size of the enhancers,
their distances from the promoters, which enhancers or promoters are active and lastly when an
EPI occurs, are all open problems. Also, EPIs are specific to different cell lines and cell types.
Only 40% of enhancers in a cell, take part in EPIs [1]. Understanding the functional behaviors of
enhancers and capturing the independent features of EPIs in different cell lines are important to
2
correctly predict the cell specific EPIs. Accurate prediction of EPIs can help pinpoint the ones that
play vital roles in critical development processes. The experimental data for the underlying
features of cell line specific EPIs are still not consistently available. The popular EPI prediction
tools suffer from low precision due to missing features in individual cell lines. Dealing with the
missing feature features and efficiently predict cell specific EPIs is still a major challenge in this
area.
1.2 Study of miRNA-mRNA interactions
The recent developments in the gene transcription studies have unraveled a vast and complicated
world of transcriptome in human body. Some of these transcripts are translated into protein while
others are not. The transcripts, that code proteins, are called protein coding RNAs or messenger
RNAs (mRNAs) and those, that do not, are called the non-coding RNAs (ncRNAs). Although,
there is still a lot to discover about the specific functionalities of these ncRNAs, their functions
often start with direct interaction with other transcripts. These interactions allow them to affect the
respective pathways of those transcripts. MicroRNA (miRNA) is one of the most prominent
ncRNAs found to date, which has a shorter sequence length (16 to 28 nucleotides) than other
known RNAs. The formation of an active mature miRNA molecule from an initial miRNA
transcript is a multistep process, part of which occurs in nucleus and the rest occurs in the
cytoplasm. In this process, the primary transcript of miRNA is cut twice by two enzymes to create
the mature miRNA molecule. After transcription, miRNAs often target other RNA transcripts,
interact with them by forming perfect or imperfect bonds and eventually either regulate their
pathways or degrade their structures. In this way, numerous miRNAs have been found to play vital
roles in a variety of complex disease pathways in human body.
3
CHAPTER 2 : STUDY OF ENHANCER-PROMOTER INTERACTIONS
2.1 EPIP: A Novel Approach for Cell-Specific EnhancerβPromoter Interaction Prediction
2.1.1 Background
Enhancers are distal regions of DNA that plays an important role in gene transcription. Enhancer
regions are typically located from 1 kilo bases (kb) to several mega bases (mb) from the genes in
interest. Despite located far from the gene promoters they come in direct contact with the promoter
regions because of chromatin looping and trigger gene expression with the help of other factors
[2-5]. To date, the majority of EPIs within a cell remain undiscovered [6]. Due to the long range
of possible distances between enhancers and their interacting gene promoters, it is also challenging
to predict EPIs [3].
Current experimental approaches to identify EPIs are mainly based on chromosome conformation
capture (3C) and its variants such as chromatin interaction analysis with paired-end tag sequencing
(ChIA-PET) and high throughput genome-wide 3C (Hi-C) [4, 7, 8]. These experimental techniques
determine the relative frequency of the direct physical contacts between genomic regions and have
successfully identified EPIs and other long-range interactions [9]. However, the ChIA-PET
method still has a low signal-to-noise ratio and the most available Hi-C data have a low resolution
[7, 8]. In addition, since certain EPIs are cell-specific, the experimental EPI data in one cell sample
cannot always be directly applied to infer EPIs in other samples.
Since, most of the experimental procedures are expensive in terms of time and cost, computational
methods have been indispensable alternatives to identify EPIs. These methods employ available
experimentally extracted genomic and/or epigenomic data to predict EPIs in an inexpensive way.
4
Early methods considered the closest promoter as the only target of an enhancer. However, a study
demonstrated that in almost 60% of the cases, the enhancers are located far from the interacting
gene promoters and one enhancer may interact with multiple gene promoters [1]. Later, several
computational approaches were developed to predict EPIs based on the correlation of epigenomic
signals in the enhancer and promoter regions [1, 6, 10, 11]. One challenge of using these methods
is they require proper correlation thresholds to reduce false EPI predictions [12, 13]. Recently,
various supervised machine learning based methods have been developed, such as IM-PET [9],
PETModule [14], Ripple [12] and TargetFinder [13]. These methods commonly use genomic and
epigenomic data such as those from DNase I hypersensitive sites sequencing (DNase-seq) and
histone modification based chromatin immunoprecipitation followed by massive parallel
sequencing (ChIP-seq) to extract features for EPI predictions. IM-PET, Ripple and PETModule
utilize random forests as their classifier, while TargetFinder is based on boosted tree classifiers.
These methods either do not consider or have low performance on the cell-specific EPI predictions
[12].
In this study, a computational method was proposed to predict condition-specific EPIs called
βEnhancerβPromoter Interaction Predictionβ (EPIP) [15]. EPIP is a supervised machine learning
based approach that utilizes functional genomic and epigenomic data to build a robust model to
predict shared and condition-specific EPIs. EPIP can work with missing data, different types of
datasets and even a dataset with a partial list of features. Tested on experimental data from more
than eight samples, EPIP reliably predicted the cell-specific EPIs and shared EPIs in different
samples with the average area under the receiver operating characteristic curve (AUROC) about
0.95, and the average area under the precisionβrecall curve (AUPR) about 0.73. When compared
5
with two state-of-the-art computational methods for predicting EPIs, EPIP outperformed both of
them.
2.1.2 Materials and Method
2.1.2.1 Enhancers and Promoters
We used 32,693 enhancers annotated by FANTOM [1]. In order to get a more reliable set of
enhancers, we overlapped this set of enhancers with the computationally predicted ChromHMM
enhancers [16] for cell lines that had the ChromHMM data available (GM12878, HeLa, HMEC,
HUVEC, IMR90 and NHEK). Since KBM7 does not have any annotated ChromHMM enhancer,
all FANTOM enhancers were used for KBM7. An enhancer was considered βactiveβ in a sample
if it overlapped with the H3K27ac ChIP-seq peaks in this sample. The H3K27ac peaks were
downloaded from ENCODE [17]. Since, KBM7 did not have any H3K27ac data available, all the
obtained enhancers were considered as βactiveβ for this cell line. In this way, we obtained 7,023β
32,693 enhancers and 4,888β32,693 active enhancers in a sample (Table 2-1).
To define promoters, we used all annotated transcription start sites (TSSs) from GENCODE V19
[18] and considered the regions between 1βkb upstream and 100 base pairs downstream of the TSSs
as βpromotersβ. This process resulted in 57,783 promoters. In order to determine if a promoter is
βactiveβ, we used RNA-Seq data available in six cell lines (GM12878, HeLa, HUVEC, IMR90,
K562 and NHEK) [17]. A promoter was labeled as βactiveβ if the corresponding gene had at least
6
0.30 reads per kb of transcript per million mapped reads with the irreproducible discovery rate of
0.1, similarly as previously [13]. For cell lines without available RNA-Seq data (HMEC and
KBM7), all promoters were considered as active promoters (Table 2-2).
2.1.2.2 Training Data
In order to define positive enhancer-promoter pair (EP-pair) or EPI and negative or non-interacting
EP-pairs we used normalized Hi-C data from Gene Expression Omnibus (GEO) with accession
number GSE63525 [8]. The data was available for eight cell lines; GM12878, HeLa, HMEC,
HUVEC, IMR90, K562, KBM7 and NHEK. The data set consisted of two sets of data in the above
cell lines; looplists and contact matrices. The looplists contain significant intra-chromosomal
chromatin interactions. The number of EP-pairs from the looplists defined at the highest resolution
(5kb) for these cell lines was too small to train the EPIP model well. Hence, we defined the positive
and negative EP-pairs from their normalized Hi-C contact matrices, following by previous studies
[14, 19].
An EP-pair was defined as βpositiveβ for a cell line, when (1) the corresponding enhancer and
promoter were active in the corresponding cell line, (2) they overlapped a pair of regions that were
supported by at least 30 normalized Hi-C reads and (3) the distance between the enhancer and
promoter regions is within 2.5kb to 2Mb. Similarly, an EP-pair was considered as negative if it did
not overlap with any pair of regions that were supported by 5 or more normalized Hi-C reads
(Figure. 2-1A). The cutoffs, 30 and 5, were chosen based on our test results with different cutoffs
(Table 2-3). Since, the HeLa cell line was ignored from training samples, due to unavailability of
contact matrix data in this cell line.
7
We trained EPIP with both balanced and unbalanced data. The balanced data consists of randomly
chosen 30% of positive EP-pairs and the same number of negative EP-pairs in each of the above
seven cell lines. The positives and negatives from different cell lines were then combined to train
a balanced prediction model. The unbalanced model was generated in the same way, except the
number of randomly chosen negative EP-pairs was 10 times of the positive EP-pairs in each
sample. Finally, we combined the two models into a combined EPIP model, which predicts an EP-
pair as a βnegativeβ pair only when both models predict this pair as a negative pair and predicts an
EP-pair as a positive pair otherwise. This strategy was based on the observation that the balanced
model had a high sensitivity and the unbalanced model had a high specificity when tested on the
training data by cross-validation.
Figure 2-1: Training and Test data. (A) The flowchart of training data creation. Here all the read
numbers are normalized. An EP-pair with the enhancer overlapping with one of the two interacting
regions and the promoter overlapping with the other of the two interacting regions will be
considered as an interacting EP-pair candidate. (B) The five test data sets on which we tested EPIP.
8
2.1.2.3 Test Data
We considered five different test data to evaluate EPIP (Figure 2-1B). (1) The remaining 70% of
positive EP-pairs, together with the same number of randomly selected negative pairs that were
not used for training (balanced test data). (2) The remaining 70% of positive EP-pairs together
with 10 times randomly selected negative pairs that were not used for training (unbalanced test
data). (3) All EP-pairs within 2βMb that were not used for training as well. (4) Positive EP-pairs
defined with normalized Hi-C contact matrices under the cutoffs 10, 20, 30, 50 and 100. (5) EP-
pairs collected in other studies [8, 20, 21], which were obtained from the strictly defined interacting
regions.
2.1.2.4 Features of EP-Pairs
EPIP considers total 31 features. Three of these features are common in every sample. First, the
distance between the enhancer and the promoter in an EP-pair. Second, the conserved synteny
score that measures the co-conservation of an EP-pair in five other vertebrate genomes (chicken
galGal3, chimpanzee panTro4, frog xenTro3, mouse mm10 and zebrafish zv9). Third, the
correlation of epigenomic signals in the enhancer region and that in the promoter region of an EP-
pair across ENCODE Tiers 1 and 2 samples [14]. For simplicityβs sake, as of now, these features
will be called βdistanceβ, βcssβ and βcorrβ, respectively.
Depending on the types of epigenomic data available in a sample, EPIP considers 28 additional
features for an EP-pair sample; 14 for the enhancer region and 14 for the promoter region. These
14 features include DNase-seq data, ChIP-seq data for nine types of histone modifications
9
(H3K4me1, H3K4me2, H3K4me3, H3K27ac, H3K27me3, H3K36me3, H3K79me2, H3K9ac and
H4K20me1) and four types of chromatin factors (CTCF, POL2, RAD21 and SMC3). All these
epigenomic data were found to act as significant biomarkers for EPIs [12]. The value of each of
these features were computed both in the enhancer and promoter region of an EP-pair sample. The
feature value corresponds to the βstrengthβ of the signal peak that overlapped with this region. In
case, a region overlaps with multiple peaks of a feature signal, the average of the peak strength
values was considered as the feature value for this region. The feature signal peaks, and their signal
strength values were downloaded from ENCODE [17]. The distance, css and corr features were
considered for all the seven cell lines. But, some of the 28 feature data were unavailable for
different cell lines. The total number of features for GM12878, HMEC, HUVEC, IMR90, K562,
KBM7 and NHEK were 31, 25, 27, 31, 31, 3 and 27 respectively.
The features were grouped into 11 overlapping feature subsets or βpartitionsβ. The first partition
consisted of only the three common features: distance, css and corr. The other 10 partitions
consisted of the following subsets of features, including the three common features: DNase-seq;
H3K4me1; H3K4me1 and H3K27ac; DNase-seq and H3K27ac; H3K4me1, H3K4me3 and
H3K27ac; DNase-seq, H3K4me3 and H3K27ac; DNase-seq, H3K4me1-3 and H3K27ac; DNase-
seq, H3K4me1-3, H3K27ac, H3K27me3, H3K36me3, H3K79me2, H4K20me1, CTCF and
DNaseI; DNase-seq, H3K4me1-3, H3K27ac, H3K27me3, H3K36me3, H3K79me2, H4K20me1,
CTCF, DNaseI, POL2; DNase-seq, H3K4me1-3, H3K27ac, H3K27me3, H3K36me3, H3K79me2,
H4K20me1, CTCF, DNaseI, POL2, RAD21 and SMC3. The benefits of the feature partitions are
threefold. First, they allow the model to function for a sample with missing features. Second,
different predictors can be trained with different feature partitions and thus can avoid overfitting.
10
Third, the prediction decision based on different feature partitions can have alternative
perspectives (Figure 2-2B).
2.1.2.5 Model
EPIP model uses a supervised sequential ensemble learning approach (Figure 2-2A). The model
consists of 11 incremental learners, each for one feature partition. Each incremental learner is an
AdaBoost classifier [22], that consists of 200 weak learners for each cell line. The weak learners
are decision tree classifiers [23] of depth 10. EPIP makes a prediction by βhardβ voting by all the
incremental learners.
Figure 2-2: EPIP model and partitions. (A) The training process of EPIP. There are three types of
partitions and in total 11 partitions used. Samples with the features required by a partition are used
to train the corresponding incremental learner (IL) for this partition. Each incremental learner
trains a maximum 200 weak learners (W) for a sample. The weak learners trained from all available
samples then vote to make the predictions for the corresponding incremental learner. The
prediction of all incremental learners determines the final prediction with another voting process.
(B) An example of the third type of partitions from three samples. The 25 features in HMEC are
included in the 27 features in HUVEC, which are included in the 31 features in GM12878.
11
EPIP uses the 11 feature partitions to choose the suitable cell lines for each incremental leaner.
When all the features in a partition are available for a cell line, EPIP trains 200 weak learners under
the corresponding incremental learner. Here, the number 200 was selected as it is smallest number
that gives EPIP the best AUROC and AUPR scores. Each weak learner is a decision tree classifier.
The depth of the decision trees was set to 10 after testing several options. The weak learners are
trained iteratively. The first weak learner is trained to classify all the training samples of the cell
line. The misclassified samples are weighted higher and correctly classified samples are weighted
lower. All the training samples with modified weights are then used to train the next weak learner.
This process is repeated until the 200th weak learner is trained. The overall prediction of an
incremental learner is the summation of the predictions from its weak learners. The overall
prediction of EPIP is made from the maximum voted decision of the 11 incremental learners.
2.1.2.6 Comparison with State-of-the-Art Tools
EPIP model performance was compared with two state-of-the art tools TargetFinder [13] and
Ripple [12]. TargetFinder predictions were published in six cell lines; GM12878, HeLa, HUVEC,
IMR90, K562 and NHEK. Also, the positive and negative EP-pairs used for TargetFinder
prediction were provided for four classifiers (https://github.com/shwhalen/targetfinder), among
which the gradient boosting classifier (GBM) showed the best precision and recall. Thus, EPIP
was compared with the TargetFinder by executing EPIP and TargetFinder GBM classifier on both
TargetFinder and EPIP data.
Ripple predicts EPIs using a combination of random forest classifiers and group LASSO in a multi-
task learning framework [12]. It used DNase-Seq, ChIP-Seq and RNA-Seq peaks in 5C
12
(GSE39510) and Hi-C (GSE63525) datasets to design the training and test data in GM12878, H1-
hESC, HeLa and K562 cell lines. We compared EPIP with Ripple by execute them on EPIP data
and the TargetFinder data in three shared cell lines; GM12878, HeLa and K562. We did not use
Ripple data directly for the comparison as (i) Ripple data was balanced which does not reflect the
reality well where we often tend to have many more negative EP-pairs than positives; (ii) the data
had a poor overlap with the FANTOM enhancers; (iii) the data labeled very closely located
enhancer and promoter as a positive EP-pair, where EPIP considers at least 2.5kb distance between
the enhancer and promoter of the positive EP-pair.
The comparison between EPIP and the two tools were done using 10-fold cross validation,
following the same strategy used by TargetFinder and Ripple. To generate TargetFinder features
for an EP-pair, we used the generate_training.py script provided in TargetFinder source code. We
followed the steps mentioned in the TargetFinder readme file to apply the 10-fold cross-validation
on the training data using TargetFinder GBM model. We used the genFeatures tool in Ripple to
generate Ripple features for an EP-pair. We executed the runAllfeatures_crosscellline.m
MATLAB file provided in Ripple source code to apply 10-fold cross-validation on the training
data using Ripple model.
2.1.3 Results
2.1.3.1 Reliable Prediction of EPIs
EPIP showed high performance in predicting EPIs in five types of test datasets; balanced test data,
unbalanced test data; all EP-pairs within 2.5kb to 2Mb, EP-pairs defined with varied Hi-C read
cutoffs and EP-pairs from other studies.
13
First, EPIP was tested on balanced and unbalanced test data and all EP-pairs within 2.5kb to 2Mb
(Chapter 2.1.2.3). We made sure none of the EP-pair in the test data were used in the training data.
On average, EPIP showed an AUROC of 0.96, 0.96 and 0.95; an AUPR of 0.96, 0.92 and 0.73 and
an F1 score of 0.99, 0.95 and 0.51 for the balanced, unbalanced and all EP-pairs within 2.5βkb to
2βMb test data, respectively (Table 2-1). The low F1 score of the third dataset was due to the lack
of balance between the positives and negatives in this data set (the number of negatives was around
13 times the number of positives). In this test dataset, the recall was higher than 0.92 in all the cell
lines including KBM7, even though it did not have any epigenomic features. EPIP showed a much
higher precision and F1 score in GM12878 than the other cell lines. The much higher sequencing
depth of GM12878 than the other cell lines, might be the reason behind this. EPIP was also tested
Table 2-1: EPIP on balanced test data, unbalanced test data, and all pairs within 2.5kb to 2Mb
test data.
Cell line AUROC AUPR F1 Precision Sensitivity/Recall
% of
predicted
condition-
specific EPIs
GM12878 0.7322 (0.7661,0.7657)
0.5761 (0.7669,0.5686)
0.8993 (0.9967,0.9827)
0.818 (0.9967,0.9691)
0.9985 (0.9967,0.9967)
0.9984 (0.9964,0.9954)
HMEC 0.9768
(0.9933,0.9908)
0.6714
(0.9931,0.967)
0.2914
(0.9837,0.9084)
0.1707
(0.977,0.84)
0.9924
(0.9904,0.989)
0.9351
(0.8397,0.8444)
HUVEC 0.9925 (0.9962,0.9965)
0.6575 (0.9957,0.9793)
0.4233 (0.99,0.9576)
0.2688 (0.9915,0.9242)
0.9958 (0.9886,0.9934)
0.9167 (0.6,0.5588)
IMR90 0.9875
(0.9977,0.9967)
0.9248
(0.9976,0.9854)
0.7416
(0.9971,0.9695)
0.6205
(0.9961,0.9442)
0.9216
(0.998,0.9961)
0.8937
(0.9953,0.988)
K562 0.9974 (0.9987,0.9987)
0.9664 (0.9987,0.9959)
0.6412 (0.9931,0.9581)
0.4746 (0.9931,0.9258)
0.9882 (0.9931,0.9927)
0.9736 (0.9739,0.982)
KBM7 0.9722
(0.9818,0.98)
0.6455
(0.9802,0.9344)
0.2155
(0.9795,0.8888)
0.1209
(0.9804,0.8162)
0.9905
(0.9787,0.9756)
0.9853
(0.9658,0.9592)
NHEK 0.9851 (0.9959,0.996)
0.6473 (0.9952,0.9791)
0.388 (0.9892,0.9314)
0.2408 (0.988,0.8772)
0.9974 (0.9904,0.9928)
0.9524 (0.7333,0.8095)
Overall 0.95 (0.96,0.96) 0.73 (0.96,0.92) 0.51 (0.99,0.95) 0.34 (0.99,0.90) 0.99 (0.99,0.99) 0.99 (0.99, 0.98)
In each entry with three numbers, the three numbers in order are for all pairs within 2.5 kb to 2 Mb test data, balanced
test data, and unbalanced test data, respectively. In the last row, the AUROC and AUPR were the averages of the
AUROC and AUPR for the seven samples, and the F1, Precision and Sensitivity/Recall were calculated using the total
number of true positives, true negatives, false positives and false negatives for the seven samples.
14
on cell-specific EP-pairs within 2.5βkb to 2βMb, where it predicted 12 455 (99.26%) of the 12 548
cell-specific EP-pairs in the seven samples.
The performance of EPIP stated above was on the test datasets, where the positive and negative
chromatin contacts were decided based on the cutoffs 30 and 5. Since these cutoffs were not
rigorously determined, we tested EPIP on the more strictly defined Hi-C looplists at 5βkb resolution
[8], the Hi-C data for IMR90 [20] and the ChIA-PET data for K562 and MCF7 [21]. To generate
the EP-pairs, the strictly defined interacting regions in these studies were overlapped with βactiveβ
enhancers and βactiveβ promoters. On the three datasets, EPIP showed average precision scores of
0.90, 0.89 and 0.93, respectively; average recall scores of 0.83, 0.81 and 0.89, respectively; and
average F1 scores of 0.86, 0.85 and 0.91, respectively (Figure 2-3). Interestingly, although EPIP
was not trained on the MCF7 cell line, it could still correctly predict 89.70% of EPIs in this cell
line.
EPIP was also tested on all the EP-pairs within 2.5βkb to 2βMb with positives defined by four more
normalized read cutoffs; 10, 20, 50 and 100. The same negative dataset (cut off 5) was used for
the four positive datasets. Overall, with the increase of the positive cut offs, the AUROC scores
15
showed an increasing trend while the AUPR and the F1 scores were in decreasing trend. This
might be due to the higher imbalance created by the decreasing number of positive but constant
number of negative EP-pairs with the larger cutoffs. The recall (sensitivity) was larger than 0.92,
for all cutoffs, and showed an increasing trend as the cutoff got higher. The results suggest that the
trained EPIP model was robust and reliable to predict true positive EP-pairs. The higher cutoff
data are more likely to contain the real enhancer-promoter interactions, the higher recall of EPIP
with the higher cutoff data verifies the efficiency of EPIP. Since the negative EP-pairs were the
same under different cutoffs, the specificity was constant (0.80). The average precision was
decreasing from 0.76 at the cutoff 10 to 0.09 at the cutoff 100. This dramatic decrease in the
precision scores from lower to higher cutoffs suggested that the larger cutoffs 50 and 100 might
be too stringent and lower cutoffs 10 and 20 might be too slack. In that case, the cutoff 30 might
Figure 2-3: The overall performance of EPIP on external datasets.
16
be the proper one to define positives, especially since EPIP had good precision and recall with this
cutoff on more strictly defined EP-pairs from the above three previous studies (Figure 2-3).
In summary, EPIP predicted EPIs with high precision, recall and F1 scores with varied datasets
including the published datasets from previous studies. To test EPIP on a more reliable unified
data set, the all EP-pairs dataset within 2.5 kb to 2 Mb defined by the cutoffs 30 and 5 was
overlapped with the published datasets and different cutoffs. On this dataset, EPIP showed an
AUROC of 0.95 and an AUPR of 0.73 on average.
2.1.3.2 Reliable Prediction of Cell-Specific EPIs
The performance of EPIP was studied on prediction of cell-specific EPIs in different cell lines. To
evaluate the performance of EPIP on each cell line, a fresh model was trained on the samples from
the other six cell lines and then tested on the samples from the remaining cell line. Separate EPIP
models were generated in this way. The positive and negative EP-pairs for the training data were
generated in the same way as before with the cutoffs 30 and 5 (Chapter 2.1.2.2, second paragraph),
respectively. Each EPIP model was evaluated by the combination of the balanced and unbalanced
models (Chapter 2.1.2.2, last paragraph), trained on the samples from six cell lines.
The separate EPIP models for the seven cell lines, showed an average AUROC of 0.96, an average
AUPR of 0.89, on the seventh sample, when tested on all EP-pairs within 2.5βkb to 2βMb based on
the cutoffs 30 and 5 (Table 2-2). When evaluated on cell-specific EP-pairs in the seven cell lines,
the EPIP models predicted 5498 (97.66%) of the total 5630 cell-specific EP-pairs in all the cell
17
lines except GM12878. EPIP predicted only 31.77% of cell-specific EP-pairs in GM12878
(Table 2-2).
One likely reason behind the poor performance of EPIP on cell-specific GM12878 could be the
much higher Hi-C sequencing depth in GM12878 than in the other cell lines. In other words, the
quality of the EP-pairs in other samples was different from that in GM12878. To test this
hypothesis, the same EPIP model trained on other six samples based on the cutoffs 30 and 5 was
evaluated to predict cell-specific EP-pairs defined with the cutoff 100 in GM12878. EPIP correctly
predicted 2396 (78.69%) of the 3045 cell-specific EP-pairs in GM12878 defined by the cutoff 100.
So, overall, EPIP reliably predicted the cell-specific EP-pairs in a new cell line, with a recall of
91.00% (7894 out of 8675 cell-specific EPIs) in all the seven cell lines.
Table 2-2: Performance of cell-specific EPIP model on predicting of condition-specific EPIs.
Test Cell line AUROC AUPR F1 Precision Sensitivity/Recall
#
condition
-specific
EPIs
% of
predicted
condition-
specific
EPIs
GM12878 0.7379 0.7015 0.5347 0.9785 0.3678 20004 0.3177
GM12878
(cutoff 100+5) 0.9816 0.9657 0.9002 0.9578 0.8491 3045 0.7869
HMEC 0.987 0.9119 0.4762 0.3129 0.9957 147 0.9592
HUVEC 0.9938 0.9174 0.5203 0.3529 0.9896 30 0.8333
IMR90 0.9966 0.988 0.8744 0.7806 0.9938 605 0.9868
K562 0.998 0.9934 0.8219 0.7029 0.9894 655 0.9802
KBM7 0.9711 0.6777 0.3951 0.2471 0.9849 4152 0.9769
NHEK 0.9974 0.9812 0.7043 0.5451 0.995 41 0.9024
Overall 0.96
(0.99)
0.89
(0.92)
0.55
(0.70)
0.49
(0.55) 0.62 (1.00)
28679
(8675) 0.50 (0.91)
Except in the last row, the numbers in a row are based on the EPIP model trained on the remaining six samples and
then tested on the sample specified in this row. In the last row, the first number shows the average statistics based on
the 30+5 cutoff EP-pairs in seven samples, while the number in the parenthesis shows the average statistics with
the100+5 cutoff EP-pairs in GM12878 together with the 30+5 cutoff EP-pairs in other six samples.
18
2.1.3.3 Better Performance in EPI Prediction than the State-of-the-Art Methods
The performance of EPIP was evaluated with two recently published methods, TargetFinder and
Ripple on the TargetFinder data and the EPIP all EP-pair test data within 2βkb to 2βMb. On both
data sets, EPIP showed a better performance than TargetFinder and Ripple (Table 2-3).
First, EPIP was compared with TargetFinder and Ripple on the dataset from used in the
TargetFinder study (Table 2-3). This dataset contained six cell lines; GM12878, HeLa, HUVEC,
IMR90, K562 and NHEK. On the six cell lines, EPIP showed an average AUROC, AUPR, F1,
precision, recall and specificity of 0.95, 0.84, 0.64, 0.98, 0.48 and 1.00, respectively, compared to
0.92, 0.59, 0.50, 0.72, 0.39 and 0.99, respectively by Targetfinder and 0.75, 0.19, 0.02, 0.75, 0.01
and 1.00, respectively by Ripple (Table 2-3). Rippleβs poor performance indicates the fact that
Ripple could not deal with unbalanced data well, which are closer representative of the real world
data.
Table 2-3: Comparison with TargetFinder and Ripple on TargetFinder and EPIP data.
Pos Neg AUROC AUPR F-score Precision
Sensitivity
/Recall
TargetFinder
data
EPIP vs
TargetFinder
TargetFinder 9899 197500 0.924 0.5864 0.5021 0.7225 0.3848
EPIP 9899 197500 0.95 0.8386 0.6422 0.9763 0.4784
EPIP vs Ripple Ripple 5830 116500 0.7478 0.1922 0.0146 0.7544 0.0074
EPIP 5830 116500 0.9519 0.8514 0.6759 0.9748 0.5173
EPIP data
EPIP vs
TargetFinder
TargetFinder 25865 73463 0.959 0.8695 0.8618 0.9436 0.7932
EPIP 26381 77179 1 0.982 0.9935 0.9879 0.9992
EPIP vs Ripple Ripple 23808 52313 0.6637 0.3924 0.3565 0.6066 0.2524
EPIP 23808 52313 1 0.995 0.9955 0.992 0.9992
The comparison between TargetFinder and EPIP on TargetFinder data was done for six common samples (GM12878,
HeLa, HUVEC, IMR90, K562 and NHEK). The comparison between Ripple and EPIP on TargetFinder data was done
for the three common samples (GM12878, HeLa and K562). When tested on EPIP data, the comparison between
TargetFinder and EPIP was done for the common five samples (except HeLa, as HeLa did not have EPIP data).
Similarly, the EPIP and Ripple comparison on the EPIP data was on two common samples (except HeLa).
19
Next, EPIP was evaluated with TargetFinder and Ripple on the all EP-pairs test data within 2.5βkb
to 2βMb (Table 2-3). Among the seven cell lines used for EPIP design, five (GM12878, HUVEC,
IMR90, K562, NHEK) were common with the TargetFinder study and two (GM12878 and K562)
only common with the Ripple study. In comparison with TargetFinder on five common cell lines,
EPIP showed an average AUROC, AUPR, F1, precision, recall and specificity of 1.00, 0.98, 0.99,
0.99, 1.00 and 1.00, respectively, while the best model of TargetFinder, GBM, showed 0.96, 0.87,
0.86, 0.94, 0.79 and 0.98, respectively. On the two common cell lines, Ripple showed an average
AUROC, AUPR, F1, precision, recall and specificity of 0.66, 0.39, 0.36, 0.61, 0.25 and 0.93,
respectively, while EPIP showed much better scores; 1.00, 1.00, 1.00, 0.99, 1.00 and 1.00,
respectively, on the same data set.
When compared on the cell-specific EPIs in TargetFinder data, EPIP predicted 51.36% of the 8471
cell-specific EP-pairs in the six samples, while TargetFinder predicted 38.85% of them. On the
three common cell lines (GM12878, HeLa and K562) between the TargetFinder and Ripple
studies, Ripple predicted only 0.53% of the 5787 cell-specific EP-pairs, while EPIP predicted
54.42% of them. The lower accuracy of EPIP on cell-specific EP-pairs of the TargetFinder data
compared to that of the EPIP test data, was may be the overall quality of TargetFinder data was
not good. For instance, the enhancers and promoters used by TargetFinder were from
computational predictions [16, 24], which were prone to errors. Moreover, as we investigated,
almost 50% of the enhancer and promoter regions overlapped with each other. Also, the negative
EP-pairs used in TargetFinder might be loosely defined. TargetFinder labeled an EP-pair
βnegativeβ, if it did not overlap the contacts of any resolution in the Rao et al. looplists. Note that,
the looplists defined in Rao et al. were finely selected Hi-C contacts with 0.1 false discovery rate.
20
Due to the stringency of the looplists, although they are likely to represent the positive EPIs, the
EPIs not identified by the looplists are not necessarily negative pairs.
In case of the cell-specific EPIs of the all pairs within 2.5 kb to 2 Mb, EPIP clearly outperformed
TargetFinder and Ripple. On the five common cell lines with TargetFinder study, EPIP predicted
99.99% of the cell-specific EP-pairs, while TargetFinder predicted only 83.91% of them. On the
two common cell lines with the Ripple study, EPIP predicted 99.99% of the cell-specific EP-pairs,
while Ripple could predict only 27.07% of them.
2.1.4 Discussion
EPIs are one of the major factors that initiate gene transcription. Proper identification of EPIs can
help to understand gene transcription regulation. The active EPIs can be different for different cell
types. At this moment, the performance of the available EPI prediction tools is not satisfactory,
especially in terms of cell-specific EPIs. Here a computational method, EPIP, was developed to
learn the patterns of EPIs and to predict cell-specific EPIs. On average, EPIP correctly predicts
99.26% of cell-specific EPIs in different cell lines. EPIP also performed better than two state-of-
the-art EPI prediction tools.
The design of EPIP incorporates a robust framework to integrate useful features for EPI
predictions. Using a feature partitioning strategy, EPIP can work as efficiently for the cell lines
with partially available features, as for those with abundant features. As a result, EPIP can be
trained on different types of samples, which makes the training model more accurate and broadly
21
representative. Not only, the learning approach of EPIP facilitates incremental training of the
model with the availability of new data.
While training EPIP with different cell lines, the order of the cell lines does not matter. This means
that data from different samples can be fed to the training model in any order. To investigate
whether the order of the cell lines in training has an impact on the performance of EPIP, we
considered HUVEC as the test cell line and trained the EPIP model on the remaining six samples
in all possible 720 orders. The standard deviation of the AUROC and the F1 score was 0.001 and
0.002, respectively, for all 720 different orders of training in these experiments. This shows that
the order of the cell lines used in training EPIP does not significantly impact the final performance.
EPIP was trained with a reliable set of available enhancers. So far, FANTOM enhancers arguably
represent the largest set of enhancers that are defined with a consistent criterion and supported by
experiments. But the number of FANTOM enhancers is small compared with the known and
predicted enhancers in various studies [16]. So, to generate abundant reliable enhancers, the
FANTOM enhancers was overlapped with computationally predicted ChromHMM enhancers and
H3K27ac ChIP-seq peaks. However, the efficiency of EPIP on the EP-pairs generated from a new
enhancer source remains to be evaluated, due to the lack of such data thus far. When there is a
larger and more reliable set of experimentally determined enhancers available in the future, it is
necessary to test EPIP on the EP-pairs based on the new set of enhancers to make sure that it
performs similarly.
The EPIP models trained on the EP-pairs using the looplists defined by Rao et al. generated
suboptimal results due to the smaller size of training data. Hence, the cutoffs 30 and 5 were used
22
to define positive and negative samples respectively. This combination of cutoffs were selected
based on the test results with different cutoff combinations and our previous studies [14, 19]. The
EP-pairs designed in this approach may not yet be perfect and may suffer from the following
drawbacks or dilemma. First, the available methods to analyze Hi-C contact matrices are still
suboptimal [25], which prevents from defining accurate interacting regions. Second, the cutoff
combinations chosen was a tradeoff between too strict (such as Rao et al. looplists) or too loose
(such as those from the cutoff 10) chromatin contacts, which might still affect the quality of the
obtained EP-pairs. Third, as mentioned above, the FANTOM enhancers only represent a portion
of existing enhancers while the ChromHMM enhancers are not so reliable. Although, these two
enhancer sets were used together with the H3K27ac peaks to define active enhancers, the data may
still miss some positive EP-pairs. Finally, a fixed cutoff of 30 does not consider the exponential
decay of the number of supporting Hi-C reads with the increasing distance between enhancers and
promoters, which may miss true positive EP-pairs as well.
Despite the limitations in the quality of enhancers and the criterion to extract EP-pairs, the good
performance of EPIP on the EP-pairs based on the interacting regions defined by other studies
make us believe that the majority of the positives and negatives in the training data represent the
true data. Moreover, EPIP showed a consistently high recall/sensitivity when different cutoffs were
used to define positive EP-pairs. EPIP also performed well when tested on the remaining 70% of
untrained EP-pairs. The performance of EPIP on different variety of test data set again allows us
to believe that EPIP did a good job in learning to classify the interacting EP-pairs from the non-
interacting ones.
23
Even though EPIP showed a better performance compared to the state-of-the-art methods, there is
still room for improvement. For instance, the training data used in this study is not perfect. With
the availability of more accurate and broadly representative training data in the future, the
performance of EPIP can be improved further. Here only Hi-C was used to extract training data.
It is worth studying how the performance of EPIP improves using EPIs from other sources of
chromatin interaction, such as Hi-C, ChIA-PET and 5C, together with Hi-C. Also, the Hi-C
chromatin contacts used here were preprocessed by Rao et al. considering various algorithmic
tradeoffs. Extraction of chromatin interactions from raw Hi-C data may help to improve the
performance of EPIP. Finally, as shown in a previous study [14], multiple EPIs can be
interconnected due to complicated chromatin structures. Like almost every other existing method,
EPIP considers each EP-pair independently to predict EPIs while considering multiple EP-pairs
may add a different perspective.
24
2.2 An Intriguing Characteristic of Enhancer-Promoter Interactions
2.2.1 Background
Enhancer-promoter interaction is one of the major factors of gene transcription. Enhancers are
short genomic regions that interact with gene promoters to initiate gene transcription. Despite
located far from their target genes, the enhancers come in direct contact with the gene promoters
via chromatin looping to control the temporal and spatial expression of the target genes [21, 26-
31]. The distance between enhancers and their targets validated by low-throughput experiments
can be about one mega bps (Mbps) [26, 27]. Recent high-throughput experiments showed that the
distance can be even larger than two Mbps in many cases [8, 32]. Because of such a long variable
distance, it is still challenging to identify interacting enhancer-promoter pairs (IEPs). In this study,
an IEP refers to an enhancer-promoter pair that physically interacts, although such an interaction
may or may not have any functional effect observed yet.
Identification of the active enhancers is a part of the problem of finding the IEPs. Early
experimental studies identify enhancers by βenhancer trapβ, which has established our rudimentary
understanding of enhancers in spite of its low-throughput and time-consuming nature [33, 34].
Early computational methods predict enhancers through comparative genomics, which are cost-
effective but may produce many false positives. With the availability of next-generation
sequencing (NGS) technologies, enhancers are now identified through a variety of experimental
methods such as chromatin immunoprecipitation followed by massive parallel sequencing (ChIP-
seq), DNase I hypersensitive sites sequencing (DNase-seq), global run-on sequencing (GRO-seq),
cap analysis gene expression (CAGE), etc. [1, 35-39]. In the ChIP-seq experiments, genomic
regions enriched with H3K4me1 and H3K27ac modifications are widely considered as active
enhancers, and those with H3K4me1 and H3K27me3 modifications are regarded as repressed
25
enhancers [36]. In the DNase-seq experiments, distal open chromatin regions are considered as
potential enhancers for gene regulation studies [5, 11, 40, 41]. In the GRO-seq and CAGE
experiments, bidirectional transcripts are employed to identify active enhancers [1, 42, 43].
Numerous computational methods were developed based on the NGS data to predict enhancers on
the genome-wide scale [16, 24, 36, 44]. These methods range from the early ones that are based
solely on H3K4me3 and H3K4me1 ChIP-seq experiments to the later ones that are based on
various types of epigenomic and genomic signals.
A large number of enhancers have been discovered so far by different experimental and
computation methods. The VISTA database includes about 2,900 enhancers from comparative
genomics were tested with mouse transgenic reporter assay [45]. The functional annotation of the
mouse/mammalian genome (FANTOM) project
(http://FANTOM.gsc.riken.jp/5/datafiles/latest/extra/Enhancers/) identified 32,693 enhancers
from balanced bidirectional capped transcripts [1]. This set of enhancers is arguably the largest set
of mammalian enhancers with supporting experimental evidence [46]. The computational methods
such as ChromHMM and Seqway also contributed to predicting thousands of human enhancers
[16, 24]. This set of enhancers is regarded as the most comprehensive set of computationally
predicted human enhancers available so far. In addition to the individual enhancers, a group of
enhancers in a genomic region called the βsuper-enhancersβ were identified that can collectively
control the expression of genes involved in cell-identities [47, 48].
Although the discovery of enhancers has been relatively straightforward, the identification of IEPs
is still nontrivial. Early experimental procedures to identify IEPs are expensive and time-
26
consuming [4, 49]. Recent Hi-C experiments hold a great promise to identify IEPs on the genome-
scale, while are still not cost effective in terms of capturing high-resolution Hi-C interactions [8,
20, 32]. To date, these experiments have only been carried out on a few cell lines or cell types.
Computational methods has also evolved a lot, from the early ones that regarded the closest genes
as target genes, to the later ones which considered the correlation of epigenomic signals in
enhancers and those in promoters, to the current ones that are based on more sophisticated
approaches [1, 6, 9-14, 50]. Although these methods have shown some success in predicting
enhancer target genes, they either do not consider or have a lowβperformance on cell-specific IEP
prediction [12]. From the results of these experimental and computational studies, self-interacting
genomic regions of several mega bases were discovered in mammalian genomes, called
topologically associated domains (TADs). IEPs usually fall within the TADs instead of crossing
different TADs [51].
All the existing computational methods almost always consider one enhancer-promoter pair at a
time to determine whether they interact. We hypothesized that when two enhancers interact with
a common target gene, these two enhancers may be spatially close to each other and may thus
interact with all target genes of both enhancers. In other words, if two enhancers share a target
gene, they may share all of their target genes as well. If this hypothesis is true, we should consider
the interactions of multiple enhancers and multiple target genes simultaneously to predict IEPs,
which may improve the accuracy of the computational prediction of the IEPs, especially that of
cell-specific IEPs.
27
To find out how different enhancers may share their target genes, the experimentally supported
IEPs from five previous studies [8, 20, 21, 32, 52] were collected and investigated in different cell
lines and cell types. The enhancers used in this study include both the experimentally annotated
enhancers from FANTOM and the computationally predicted enhancers by ChromHMM in
different samples [1, 16]. We observed that two enhancers are likely to either share almost all of
their target genes or interact with two completely disjoint sets of target genes, in a cell line or a
cell type. This observation implies an interesting characteristic of IEPs, which has not been
considered by the existing studies to predict IEPs. This study may also shed new light on the
underlying principles of chromatin interactions and facilitate the more accurate identification of
IEPs.
Figure 2-4: Generation of IEPs and calculation of BCC. (A) The process of generating IEPs using
the chromatin interaction data from five studies, enhancer regions from FANTOM and
ChromHMM, and promoters defined around the GENCODE annotated gene TSSs. (B) A toy
interaction network between three enhancers (ππ, ππ and ππ) and three promoters (ππ, ππ and ππ).
The average BCC of the enhancers in this example is (π
π+π
ππ+π
ππ)
π= π. π.
28
2.2.2 Materials and Method
2.2.2.1 Enhancers and Promoters
Two sets of enhancers were used in this study (Figure 2-4A). The first set contained the 32,693
enhancers annotated by FANTOM, which had been obtained from the balanced bidirectional
capped transcripts [1]. The FANTOM enhancers were downloaded from FANTOM5 Human
Enhancer Selector (http://slidebase.binf.ku.dk/human_enhancers/results). The second set
contained the computationally predicted enhancers by ChromHMM [16] in the following seven
cell lines: GM12878, HMEC, HUVEC, K562, NHEK, IMR90 and HeLa. ChromHMM is widely
used to partition genomes into different functional units including enhancers. The ChromHMM
enhancers for GM12878, HMEC, HUVEC, K562 and NHEK cell lines were downloaded from the
ENCODE composite track
(http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeBroadHmm) of UCSC
Genome Browser. The ChromHMM enhancers for HeLa and IMR90 cell lines were downloaded
respectively from the ENCODE Genome Segmentation track of UCSC Genome Browser
(http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeAwgSegmentation/)
and chromatin state model based on imputed data (25 state, 12 marks, 127 epigenomes)
(https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/impu
ted12marks/jointModel/final/E017_25_imputed12marks_mnemonics.bed.gz)
The FANTOM enhancers are not cell-specific, while the ChromHMM predicted enhancers are
specific for the seven different cell lines mentioned. Thus βactiveβ FANTOM enhancers were
defined by overlapping the enhancers with the H3K27ac ChIP-seq peaks in the corresponding cell
lines obtained from the Encyclopedia of DNA Elements (ENCODE) project [17]. For cell lines
29
without available H3K27ac ChIP-seq data such as KBM7, the enhancers that overlapped with the
chromatin interacting anchors in this cell line were considered as βactiveβ enhancers [8].
To define promoters, the gene transcriptional start sites annotations were downloaded from
GENCODE V19 [18]. The 1kbps upstream to the 100 bps downstream regions around each
transcriptional start site was considered as a promoter. In total, 57,820 promoters were obtained in
this way in the human genome. To define cell-specific active promoters, the available RNA-Seq
data in different cell lines (GM12878, HeLa, HUVEC, IMR90, K562 and NHEK) as used in a
previous study [13]. In a cell line, a promoter was considered βactive" if the corresponding gene
had at least 0.30 reads per kbps of transcript per million mapped reads with the irreproducible
discovery rate of 0.1, similarly as used in the previous studies [13, 15]. For cell lines without RNA-
Seq data (HMEC and KBM7), all promoters were considered as active promoters [15].
2.2.2.2 IEPs from Five Studies
The experimentally supported chromatin contact information from five previous studies were
collected to define IEPs [8, 20, 21, 32, 52] (Figure 2-4A). These data arguably represent the intra-
chromosomal chromatin interactions defined with the highest resolutions by the corresponding
techniques. The first set of the data was downloaded from the Hi-C dataset GSE63525 in the Gene
Expression Omnibus (GEO) database [8]. This data contains significant intra-chromosomal
chromatin interactions with 5 kbps resolution named βlooplistβ extracted for the following eight
cell lines: GM12878, HeLa, HMEC, HUVEC, IMR90, K562, KBM7 and NHEK [8]. The looplists
were defined with stringent criteria and were most likely to be true pairs of interacting genomic
regions, each of which was about 5 kbps long. In every cell line, each chromatin interactions in
30
the corresponding looplist was overlapped with the aforementioned two sets of active enhancers
and with the annotated active promoters to obtain IEPs. In other words, an obtained IEP consisted
of an enhancer and a promoter, where the enhancer overlapped with one of the interacting regions
of a chromatin interaction and the promoter overlapped with the other region. Since we had two
sets of enhancers, we obtained two sets of IEPs for each of the eight cell lines (Figure 2-4A). Note
that, since only the intrachromosomal interactions were used in this study, the enhancer and
promoter in an IEP are always from the same chromosome.
The number of IEPs obtained from the above looplists was small, especially when the FANTOM
enhancers were considered. The reason might be, the criteria Rao et al. used to define looplists was
quite stringent and many true interacting genomic regions might therefore be missed [15]. To
capture more IEPs in these cell lines, the cell line specific contact matrix datasets were used from
the same study [8]. The contact matrix for a cell line contains the 5 kbp resolution
intrachromosomal chromatin interactions supported by at least one Hi-C read. The number of reads
in the contact matrices were normalized using KR normalization vector. The alternative sets of
IEPs were generated from the contact matrices with three normalized read cutoffs: 30, 50, and 100.
Given a normalized read cutoff, say x, if an enhancer-promoter pair overlapped with a pair of
interacting genomic regions that were supported by at least x normalized Hi-C reads, the enhancer-
promoter pair was considered as an IEP. The cutoff 30 was used as this cutoff was likely to include
of almost all known IEPs in K562 and IMR90 from other studies [20, 21] without allowing too
many false positives [15]. The two other cutoffs (50 and 100) were used to see how the observed
enhancer characteristics may change with more stringent cutoffs. The IEPs from contact matrix
was not considered for HeLa because Rao et al. did not provide a Hi-C contact matrix in HeLa.
31
Since the sequencing depth was much higher in case of GM12878 than that in other seven cell
lines, the IEPs defined by the cutoff 400 were considered as highly reliable for GM12878 after
testing different cutoffs.
From another Hi-C study, 57,578 IEPs were downloaded for IMR90 cell line [20]. To our
knowledge, this was the only Hi-C dataset for human samples with a comparable sequencing depth
as that in Rao et al. In this study, Jin et al. defined active enhancers with H3K4me1 and H3K27ac
ChIP-seq peaks and active promoters with H3K4me3 ChIP-seq peaks together with the known
genes from the University of California, Santa Cruz genome browser. In addition to using the
original IEP dataset which was provided in the hg18 version [20], the IEPs were also converted
into the hg19 version and overlapped with the aforementioned enhancers and promoters used in
this study to define a new set of IEPs for IMR90 cell line.
The IEPs defined by the ChIA-PET experiments in K562 and MCF7 were used as well for this
study [21]. Using the interacting regions in these datasets total 2,923 and 2,190 IEPs were extracted
with the FANTOM enhancers for K562 and MCF7, respectively. For the ChromHMM enhancers,
the number of IEPs were 33,598 in K562. There were no ChromHMM enhancers available in
MCF7.
Additional IEPs were used in this study that are based on the active enhancer and promoter links
defined by Javierre et al. from promoter capture Hi-C experiments in nine cell types (Table 2-5 in
[32]). Javierre et al. did the experiments on seventeen primary cell types while the active enhancer
and promoter links were provided for nine cell types. Each link defined a pair of interacting
32
regions, with the average length of 5,709 and 8,599 bps, respectively. Since Javierre et al. did not
explicitly specify the enhancers and promoters, these links were overlapped with the two sets of
enhancers and the GENCODE promoters to define two sets of IEPs. In total, 20,764 and 607,274
IEPs were obtained with FANTOM and ChromHMM enhancers, respectively.
The final chromatin interaction dataset for this study were the interactions detected using a newly
developed method named βSPRITEβ by Guttman lab [52]. This dataset was downloaded from
GEO database of NCBI with the accession number GSE114242. Among the available SPRITE
datasets, the only human dataset was in GM12878 cell line with the lowest resolution of 25 kbps.
This dataset was filtered with three different read cutoffs; 30, 50 and 100 to obtain IEPs.
A distance filter was applied on all the IEP sets found above. For every IEP, if the distance between
the corresponding enhancer and promoter is less than 2.5 kbps, that IEP was filtered out from the
analysis.
2.2.2.3 Other Data Used
Rao et al. annotated chromatin contact domains in each of the eight cell lines [8]. These domains
were downloaded from GSE63525 and considered as the topologically associating domains
(TAD)s in this study. The annotated TADs in IMR90 by Dixon et al. were also used, which were
generated by the same lab that generated the Jin et al. data [51].
The super-enhancers in GM12878, HeLa, HMEC, HUVEC, K562 and NHEK were downloaded
from http://asntech.org/dbsuper/download.php. No known super-enhancers were available in
33
KBM7. The super-enhancers in a cell line were compared with the clusters of enhancers that
interact with the same set of target genes in the same cell line identified in this study.
2.2.2.4 BCC (Bipartite Clustering Coefficient)
The defined IEPs in a cell line can be represented as a bipartite graph, where the enhancers on one
side connect with the target genes on the other side. Bipartite clustering coefficient (BCC) is used
to measure the degree to which the nodes in a graph tend to cluster together [53]. Here BCC was
used to characterize how enhancers share their target genes and how genes share their enhancers
(Figure 2-4B).
For a pair of enhancers (or a pair of genes), say π’ and π£, their BCC is defined as π΅πΆπΆ(π’, π£) =
|π(π’)β©π(π£)|
|π(π’)βͺπ(π£)|, where π(π’) and π(π£) are the set of genes (or enhancers) interacting with π’ and π£,
respectively. Intuitively, if π’ and π£ are a pair of enhancers, π΅πΆπΆ(π’, π£) measures the percentage of
target genes both π’ and π£ interact with among all of their target genes. Similarly, if π’ and π£ are a
pair of genes, π΅πΆπΆ(π’, π£) measures the percentage of enhancers both π’ and π£ interact with among
all enhancers they interact with. Correspondingly, the BCC of an individual enhancer (or gene),
say π’, is defined as π΅πΆπΆ(π’) =β π΅πΆπΆ(π’,π£)π£βπ(π(π’)),π£β π’
|π(π(π’))|β1, where π(π(π’)) is the set of enhancers (or
genes) that share at least one target gene (or enhancer) with π’. Under a given condition, for all
enhancers (or target genes) sharing at least one target gene (or an enhancer) with other enhancers
(or target genes), we averaged their individual BCCs to obtain the BCC of enhancers (or target
genes) under this condition.
34
2.2.2.5 Generation of Enhancer Clusters
Using the enhancers with BCC > 0, an enhancer graph was created for each IEP dataset in each
cell line. In this graph, the nodes represent enhancers and edges represent pairs of enhancers
interacting with at least one common target gene. Then the famous Bron-Kerbosch algorithm was
applied to this graph to find all maximal cliques [54]. The enhancers in a clique represented a
cluster of enhancers that interact with the same set of genes. Here, different clusters may share the
same enhancers.
2.2.2.6 Statistical Tests
To assess the statistical significance of the observed BCC values in a given set of IEPs, a random
set of IEPs was generated using the same enhancers and promoters from the original set of IEPs.
The observed BCC values of the enhancers (promoters) in the original set of IEPs were then
compared with that in random IEPs. In every comparison, the BCC values of the enhancers
(promoters) that interacted with multiple promoters (enhancers) were pooled together from the
original IEPs and compared with those from the random IEPs. In the statistical significance
analysis, the probability of the enhancers (promoters) with BCC > 0.9 in the random IEPs was
calculated as the Binomial probability parameter (p). Now, if there are n enhancers in the original
IEPs and k of them have their BCC > 0.9, the p-value is calculated using the following formula.
π β π£πππ’π = 1 ββ(π
π) ππ(1 β π)πβπ
πβ1
π=0
35
2.2.2.7 Additional Analyses
To assess the sequence similarity between the enhancers within a cluster, the enhancer sequences
within a cluster were multiple aligned by ClustalW programs using MUSCLE version 3.8.31 [55].
The similarity score between a pair of enhancers was then defined as the percentage of identities
in the corresponding alignment [55]. Similarly, the similarity scores were measured between every
pair of enhancers from a randomly selected enhancer set in the same cell line. The two sets of
similarity scores were then compared by the Mann-Whitney U test [56].
To assess whether the enhancers in a cluster tend to be located close to each other in a cell line,
the relative distances between every pair of enhancers within clusters in a cell line were compared
with the same for the randomly chosen enhancer set in the same cell line using the Mann-Whitney
U test.
Finally, the functional similarity scores between the target genes of every pair of enhancers in a
cluster was measured for each a cell line by the GREAT tool [57]. The tool generated the
significant functional terms (p-value<1e-05) associated with the target genes of the enhancer
clusters.
36
2.2.3 Results
2.2.3.1 Two Enhancers are Likely to Interact with Either Exactly the Same Set or Two
Completely Different Sets of Genes
In order to study IEPs, the BCC values of the enhancers were calculated for the five sets of
experimentally supported IEPs with the two sets of enhancers in each cell line or cell type (Chapter
2.2.2.1 and 2.2.2.2, Figure 2-4A). BCC is commonly used to measure the degree to which, two
unconnected nodes in a bipartite graph share their connected or neighboring nodes. Note that every
set of IEPs can be represented as a bipartite graph, where the enhancer set and the gene promoter
set correspond to the two disjoint sets of nodes, and their interactions correspond to the edges
Table 2-4: The BCC of enhancers and that of promoters are likely to be 1 in a cell line.
Cell line IEPs
BCC of enhancers
% of total
enhancers with
multiple
promoter and BCC
> 0 (E1)
% of E1 with BCC>=0.9
BCC of promoters
% of total
promoters with
multiple
enhancers and BCC
> 0 (P1)
% of P1 with BCC>=0.9
all multiple all multiple
Rao
GM12878 294
(2384) 0.97
(0.99) 0.96
(0.96) 17.47
(18.44) 87.5 (88.95)
0.97 (0.95)
0.95 (0.93)
19.35 (27.15)
91.67 (87.98)
HELA 11 (37) 1 (1) 0 (1) 0 (12.5) 0 (100) 1 (1) 0 (1) 0 (11.76) 0 (100)
HMEC 260
(2558)
0.97
(0.99)
0.96
(0.98)
15.42
(26.9) 90.32 (95.91)
0.96
(0.97)
0.91
(0.96)
13.97
(37.41) 88 (93.82)
HUVEC 9 (95) 1 (1) 0 (1) 0 (10.47) 0 (100) 0 (1) 0 (1) 0 (19.35) 0 (100)
IMR90 144
(554) 1 (1) 1 (0.99) 4.8 (8.98) 100 (100) 1 (0.99) 1 (0.98) 6.25 (9.35) 100 (93.1)
K562 47 (638) 1 (1) 1 (1) 10.81
(17.35) 100 (100) 1 (1) 1 (1)
12.82 (27.92)
100 (100)
KBM7 8 NA
(NA)
NA
(NA) NA (NA) NA (NA) 1 NA (NA) NA (NA) NA (NA)
NHEK 0 (0) NA
(NA) NA
(NA) NA (NA) NA (NA)
NA (NA)
NA (NA) NA (NA) NA (NA)
Jin IMR90 1167
(5303)
0.9
(0.93)
0.84
(0.87)
34.86
(34.97) 62.16 (70.75)
0.77
(0.68)
0.73
(0.66)
37.66
(49.11)
52.98
(40.92)
Li K562
2916 (33449)
0.8 (0.86)
0.75 (0.78)
30.98 (41.9)
50.92 (53.62) 0.86
(0.67) 0.75
(0.65) 26.43
(57.26) 44.13
(38.73)
MCF7 2190 0.89 0.83 25.15 66.76 0.86 0.75 22.59 57.41
In the head row, βmultipleβ means the enhancers (or promoters) with multiple interacting promoters (enhancers). βAllβ
means all enhancers (or promoters). When two numbers are in an entry, the number in the parenthesis is from the
ChromHMM enhancers.
37
(Figure 2-4B). The neighboring promoter nodes of an enhancer are the target genes of this
enhancer. With the goal to investigate how different enhancers share their target genes, BCC is a
perfect measurement, as it can show the percentage of shared target genes of an enhancer in a
given set of IEPs (Figure 2-4B). The average BCC values of the enhancers were larger than 0.90
for all the data sets. The high BCC values indicate that enhancers are not likely to share partially.
When a pair of enhancers interact with a common target gene, both enhancers are likely to interact
with all target genes of these two enhancers.
First, the IEPs were studied based on the looplists from Rao et al. [8], with the annotated FANTOM
enhancers [1] and the GENCODE promoters [18] (Figure 2-4A). The BCC of enhancers was no
smaller than 0.97 in all cell lines with enough IEPs (Table 2-4). The average BCC was then
calculated for only the enhancers interacting with more than one gene. In this case too, the average
BCC was no smaller than 0.96 in all the cell lines. The high BCC values suggest that two enhancers
are likely to interact with either the same set or two disjoint sets of target genes. In other words,
the target genes of any pair of enhancers usually are either the same or completely different.
To assess the statistical significance of the above observation, the BCC values of the enhancers
were studied in randomly generated IEPs (Table 2-5). These random IEPs were constructed using
the same set of enhancers and promoters but randomized interactions. Given an enhancer and its
number of interacting promoters from the original IEP set of a cell line, the same number of
promoters were randomly chosen from the active promoters in the cell line, so that the number of
interactions of every enhancer remains the same in both the original and the random IEP sets. Five
different sets of random IEPs were generated in this way with five different random
38
Table 2-5: BCC statistics for enhancers. The BCC of the enhancers in the real IEPs are shown
for different samples. The BCC of the enhancers in random IEPs are also shown along with the
p-values of the nonparametric statistical test supporting the difference between the BCC values
in real and random IEPs. All the statistics are shown for both βallβ enhancers and the enhancers
interacting with βmultipleβ promoters.
Experiments Cell lines IEPs Enhancers BCC of enhancers
BCC of enhancers in random IEPs with p-
values in parenthesis All Multiple All Multiple
FANTOM Gencode Rao looplist
GM12878 294 229 0.97 0.96 0.51 (0) 0.34 (0)
HELA 11 10 1 0 0 (0) 0 (NA)
HMEC 260 201 0.97 0.96 0.37 (0) 0.17 (0)
HUVEC 9 9 1 0 0 (0) 0 (NA)
IMR90 144 125 1 1 0.33 (0) 0.2 (0)
K562 47 37 1 1 0 (0) 0 (0)
KBM7 8 5 0 0 0 (NA) 0 (NA)
NHEK 0 0 NA NA NA (NA) NA (NA)
FANTOM Gencode Rao cutoff 400 GM12878 902 783 0.97 0.85 0.78 (0) 0.38 (0)
FANTOM Gencode Rao cutoff 300 GM12878 1138 974 0.95 0.82 0.76 (0) 0.39 (0)
FANTOM Gencode Rao cutoff 200 GM12878 2695 2091 0.9 0.74 0.62 (0) 0.36 (0)
FANTOM Gencode Rao cutoff 150 GM12878 4184 3002 0.88 0.74 0.56 (0) 0.34 (0)
FANTOM Gencode Rao cutoff 100
GM12878 7527 4488 0.81 0.7 0.43 (0) 0.28 (0)
HMEC 313 277 0.93 0.67 0.53 (0) 0.07 (0)
HUVEC 43 41 0.92 0.5 0 (0) 0 (NA)
IMR90 525 468 0.96 0.72 0.83 (0) 0.42 (0)
K562 506 440 0.96 0.83 0.8 (0) 0.39 (0)
KBM7 1465 1308 0.94 0.7 0.84 (0) 0.43 (0)
NHEK 211 200 0.95 0.5 0.49 (0) 0.3 (NA)
FANTOM Gencode Rao cutoff 50
GM12878 19623 7599 0.73 0.66 0.25 (0) 0.19 (0)
HMEC 854 702 0.94 0.85 0.68 (0) 0.4 (0)
HUVEC 254 237 0.95 0.81 0.58 (0) 0.1 (0)
IMR90 1643 1319 0.91 0.75 0.66 (0) 0.39 (0)
K562 1734 1368 0.89 0.73 0.64 (0) 0.39 (0)
KBM7 4033 3274 0.9 0.74 0.7 (0) 0.37 (0)
NHEK 462 407 0.92 0.69 0.78 (0) 0.4 (0)
FANTOM Gencode Rao cutoff 30
GM12878 29348 8670 0.71 0.65 0.48 (0) 0.47 (0)
HMEC 1786 1451 0.92 0.78 0.83 (0) 0.53 (0)
HUVEC 582 518 0.95 0.81 0.9 (0) 0.52 (0)
IMR90 3077 2235 0.87 0.73 0.76 (0) 0.54 (0)
K562 2872 2021 0.85 0.71 0.74 (0) 0.52 (0)
KBM7 7047 5564 0.88 0.72 0.81 (0) 0.52 (0)
NHEK 1011 885 0.93 0.76 0.88 (0) 0.52 (0)
ChromHMM Gencode Rao looplist
GM12878 2384 1914 0.99 0.96 0.67 (0) 0.39 (0)
HELA 37 32 1 1 0.1 (0) 0.1 (0)
HMEC 2558 1907 0.99 0.98 0.59 (0) 0.36 (0)
HUVEC 95 86 1 1 0.22 (0) 0.17 (0)
IMR90 554 490 1 0.99 0.77 (0) 0.45 (0)
K562 638 536 1 1 0.74 (0) 0.44 (0)
NHEK 0 0 NA NA NA (NA) NA (NA)
ChromHMM Gencode Rao cutoff 400 GM12878 11097 9343 0.93 0.78 0.75 (6.72E-12) 0.42 (0)
ChromHMM Gencode Rao cutoff 300 GM12878 14846 12347 0.92 0.78 0.74 (1.52E-11) 0.42 (0)
ChromHMM Gencode Rao cutoff 200 GM12878 33072 24664 0.81 0.67 0.64 (1.17E-11) 0.37 (0)
ChromHMM Gencode Rao cutoff 150 GM12878 51174 34925 0.8 0.67 0.57 (0) 0.34 (0)
ChromHMM Gencode Rao cutoff 100
GM12878 89712 51676 0.74 0.64 0.46 (0) 0.29 (0)
HMEC 4081 3635 0.94 0.76 0.81 (0) 0.41 (0)
HUVEC 499 458 0.98 0.86 0.85 (0) 0.48 (0)
IMR90 2415 2118 0.97 0.88 0.78 (0) 0.41 (0)
K562 8062 6835 0.93 0.76 0.75 (0) 0.42 (0)
NHEK 3291 3028 0.96 0.75 0.86 (0) 0.44 (0)
ChromHMM Gencode Rao cutoff 50
GM12878 231522 88850 0.64 0.6 0.27 (0) 0.19 (0)
HMEC 11191 9131 0.92 0.78 0.69 (0) 0.39 (0)
HUVEC 3396 3073 0.96 0.8 0.83 (0) 0.44 (0)
IMR90 7270 5765 0.93 0.79 0.67 (1.73E-12) 0.39 (0)
K562 28590 21084 0.86 0.7 0.63 (0) 0.37 (0)
39
NHEK 7017 6103 0.94 0.77 0.78 (0) 0.43 (0)
Jin IMR90 50800 44239 0.94 0.79 0.81 (0) 0.44 (0)
FANTOM Gencode Jin IMR90 1167 743 0.9 0.84 0.51 (0) 0.33 (0)
ChromHMM Gencode Jin IMR90 5303 3383 0.93 0.87 0.53 (0) 0.32 (0)
FANTOM Gencode Chiapet K562 2916 1585 0.8 0.75 0.41 (0) 0.28 (0)
MCF7 2190 1471 0.89 0.83 0.55 (0) 0.35 (0)
ChromHMM Gencode Chiapet K562 33449 19550 0.86 0.78 0.46 (1.74E-11) 0.3 (0)
FANTOM Gencode Javierre
Ery 74 44 1 1 0.41 (0) 0.33 (0)
Mac0 88 59 0.98 0.94 0.51 (0) 0.37 (0)
Mac1 215 144 1 1 0.54 (0) 0.37 (0)
Mac2 112 75 0.99 0.96 0.53 (0) 0.34 (0)
MK 100 65 0.96 0.9 0.52 (0) 0.34 (0)
Mon 139 82 1 1 0.43 (0) 0.32 (0)
nCD4 86 58 1 1 0.52 (0) 0.35 (0)
nCD8 84 55 1 1 0.5 (0) 0.36 (0)
Neu 178 109 1 1 0.45 (0) 0.32 (0)
ChromHMM Gencode Javierre
Ery 4484 2471 0.98 0.98 0.42 (0) 0.3 (0)
Mac0 2003 1097 0.99 0.99 0.41 (0) 0.29 (0)
Mac1 4867 2996 0.97 0.96 0.49 (0) 0.33 (0)
Mac2 3733 2298 0.99 0.99 0.49 (0) 0.33 (0)
MK 2629 1744 0.99 0.98 0.55 (0) 0.35 (0)
Mon 2483 1547 0.96 0.94 0.49 (0) 0.34 (0)
nCD4 2975 1546 0.99 0.99 0.39 (0) 0.28 (0)
nCD8 2774 1623 0.98 0.97 0.46 (0) 0.31 (0)
Neu 4661 2739 0.99 0.98 0.46 (0) 0.32 (0)
FANTOM Gencode SPRITE cutoff 100 GM12878 38 28 1 1 0.2 (0) 0 (0)
FANTOM Gencode SPRITE cutoff 50 GM12878 497 317 0.92 0.8 0.46 (0) 0.35 (0)
FANTOM Gencode SPRITE cutoff 30 GM12878 3381 2151 0.92 0.84 0.45 (0) 0.3 (0)
ChromHMM Gencode SPRITE cutoff 100
GM12878 622 453 0.99 0.97 0.56 (0) 0.3 (0)
ChromHMM Gencode SPRITE cutoff 50 GM12878 4794 3213 0.95 0.89 0.5 (0) 0.32 (0)
ChromHMM Gencode SPRITE cutoff 30 GM12878 36027 21870 0.9 0.81 0.48 (1.22E-11) 0.3 (0)
seeds. These random IEPs barely had a handful of enhancers that shared promoters with the other
enhancers in any of the eight cell lines, suggesting that it is not by chance that multiple enhancers
interact with a common set of target genes in the Rao et al.βs looplists. The number of IEPs was
too small to calculate BCC for four of the eight cell lines. For all the other four cell lines, where
the BCC could be calculated, the BCC values of enhancers were 0.51, 0.37, 0.33 and 0,
respectively, which were much smaller than the BCC of enhancers in the above sets of real IEPs
(p-value=0, Table 2-5). When the BCC of enhancers interacting with multiple genes were
considered, the BCC values were no larger than 0.34 for random IEPs, while it was no smaller
than 0.96 for the real IEPs. The observations suggest that the BCC of enhancers being close to 1
was not by chance (Table 2-5).
40
Second, the IEPs defined by the contact matrices from Rao et al. were studied with different cutoffs
in the seven cell lines (Chapter 2.2.2.2). Compared with the IEPs from the looplists, these IEPs
were likely to include many more bona fide interactions and more false positives as well. Under
the cutoffs 30, 50 and 100, the BCC of the enhancers in all the seven cell lines except GM12878
was no smaller than 0.85, 0.89 and 0.92, respectively (Table 2-5). Since GM12878 had a much
higher sequencing depth than the other cell lines, it was understandable that a cutoff that is
stringent for other cell lines could still be loose for GM12878. Thus, the cutoffs 150, 200, 300, and
400 were also tried for GM12878. Among the three cutoffs, 400 was the most reasonable, since
the number of IEPs in GM12878 defined at this cutoff was similar to that in other cell lines defined
at the cutoff 100 (Table 2-5). So, the cutoff 400 was chosen for GM12878 and the cutoff 100 was
chosen for the other cell lines. With cutoff 400, the BCC of enhancers was 0.97 in GM12878. Note
that in HMEC, HUVEC, KBM7 and NHEK, the BCC of enhancers was no smaller than 0.92 even
under the cutoff 100. Moreover, the BCC of enhancers was increasing with more stringently
defined IEPs, suggesting that the BCC of enhancers is close to 1 if it is not 1 (Table 2-5).
In order to assess the statistical significance of the observed BCC of enhancers in IEPs from
different cutoffs, similarly, the above BCC values of enhancers were compared with that from
randomly generated IEPs (Table 2-5). Again, for every cutoff in every cell line, the BCC of
enhancers for random IEPs was much smaller than the BCC of enhancers for real IEPs (p-value=0).
For instance, under the cutoff 50, the BCC of enhancers was no larger than 0.78 for random IEPs,
while the corresponding number was no smaller than 0.89 for real IEPs. When only the enhancers
interacting with multiple target genes were considered, the BCC of the enhancers for random IEPs
41
was smaller than that for real IEPs by about a factor of two. For instance, under the cutoff 50, the
largest BCC value was 0.40 for random IEPs, while the smallest BCC value for real IEPs was 0.69.
Third, to see how this observation might change if the data from other labs or other experimental
protocols were used, the IEPs from four additional studies were analyzed (Chapter 2.2.2.2, Figure
2-4A) [20, 21, 32, 52]. When the BCC values of the enhancers were calculated using the IEPs
defined by Jin et al. [20], it was 0.94 on average. When considering the processed IEPs from Jin
et al. based on the FANTOM enhancers and the annotated promoters by GENCODE, it was 0.90.
In terms of the ChIA-PET datasets [21], it was 0.80 in K562 and 0.89 in MCF7 (Table 2-4). For
the nine cell types from Javierre et al. [32], it was no smaller than 0.96 in all cell types. For the
SPRITE data from Quinodoz et al. [52], it was 0.92, 0.92 and 1 for the cutoffs 30, 50 and 100,
respectively (Table 2-5). Although the IEPs were from different labs and from different
experimental procedures, in all cases, the BCC of enhancers was larger than 0.80 and the majority
of enhancers interacting with multiple promoters had their individual BCCs larger than 0.90,
suggesting that the BCC of enhancers is likely to be 1 in these samples. Again, for the
corresponding randomly generated IEPs for these datasets, on average, the BCC value was 0.48,
much smaller than the corresponding ones from original IEPs, which was 0.96 (p-value=0, Table
2-5).
Finally, the above analyses was repeated with the ChromHMM enhancers instead of the FANTOM
enhancers, because the number of the FANTOM enhancers was relatively smaller than the
ChromHMM enhancers [16]. The observations were similar in all cases, showing the BCC of
enhancers for the IEPs in a cell line was close to 1 (Table 2-4, Table 2-5). For instance, for IEPs
42
based on the looplists, it was almost a perfect 1 in all cell lines. For the Hi-C data from Rao et al.
under the cutoff 400 for GM12878 and 100 for the other cell lines, it was no smaller than 0.93. For
the Hi-C data from Jin et al. [20], it was 0.93. For the ChIA-PET data from Li et al. [21], it was
0.86. For the nine cell types from Javierre et al. [32], it was no smaller than 0.97. For the SPRITE
data on GM12878 cell line [52], the BCC values were 0.9, 0.95 and 0.99 for the cutoffs 30, 50 and
100, respectively. In almost all cases, the majority of enhancers with multiple promoters had their
individual BCCs larger than 0.90.
In summary, the BCC values of the enhancers were likely to be close to 1 for different sets of IEPs,
data from different labs, different experimental protocols, different cell lines and cell types, and
different enhancer sets. The analyses based on IEPs from different cutoffs suggest that the BCC of
enhancers is quite robust, although it is smaller when more loosely defined IEPs are used. It is
close to 1 or becomes 1 when the IEPs are defined with more stringent criteria (with fewer false
positive IEPs). These analyses suggest that the observation may be an intrinsic property of
enhancers. That is, if two enhancers interact with one common gene, they are likely to interact
with all of their target genes.
2.2.3.2 Two Target Genes Tend to Interact with Exactly the Same Set or Two Completely
Different Sets of Enhancers
The BCC of promoters in each set of the aforementioned IEPs were also studied to see if the similar
observation can be made for the promoters. The results of the studies showed that the BCC of
promoters was likely to be 1 as well, although this was not evident as strongly as the BCC of
enhancers in certain cases.
43
First, the BCC values of the promoters were studied with the IEPs based on the looplists [8]. The
BCC values were close to 1 on average, for both the FANTOM and ChromHMM enhancers (Table
2-4). The BCC values of the promoters were then studied in randomly simulated IEP datasets. The
random IEP set consisted of the same sets of enhancers and promoters, but the enhancers were
randomly selected to interact with the promoters so that every promoter had the same number of
interacting enhancers as it had in the original set of IEPs. The BCC of promoters was 0.52 at best
in any cell line in these random datasets, suggesting that it was not by chance that the BCC of
promoters was close to 1 in all cell lines (Table 2-6).
Second, the BCC values of the promoters were studied for the IEPs defined with different cutoffs
[8] (Table 2-6). When the FANTOM enhancers were used, the BCC of promoters was often close
to 1. For instance, with the cutoff 400 for GM12878 and the cutoff 100 for other cell lines, the
BCC of promoters was no smaller than 0.91 in all the cell lines. For different cutoffs, it was usually
no smaller than the BCC of enhancers, which was close to 1 in most cases. When the ChromHMM
enhancers were used, however, the values were not as high as those from the FANTOM enhancers.
For instance, with the cutoff 400 for GM12878 and the cutoff 100 for other cell lines, the BCC of
promoters varied from 0.64 to 0.91 in different cell lines. The BCC values got smaller with smaller
cutoffs, which might be due to the much lower quality of the enhancers predicted by ChromHMM
compared with the experimentally defined FANTOM ones.
Although the BCC of the promoters was not as large as the BCC of enhancers when the
ChromHMM enhancers were used, the actual BCC of promoters could also be close to 1. This was
44
Table 2-6: BCC statistics for promoters. The BCC of the promoters in real IEPs are shown for
different samples. The BCC of the promoters in random IEPs are also shown along with the p-
values of the nonparametric statistical test supporting the difference between the BCC values
in real and random IEPs. All the statistics are shown for both βallβ promoters and the
promoters interacting with βmultipleβ enhancers.
Experiments Cell lines IEPs Promoters BCC of Promoters
BCC of promoters in random IEPs
with p-values in parenthesis All Multiple All Multiple
FANTOM Gencode Rao looplist
GM12878 294 186 0.97 0.95 0.52 (0) 0.28 (0)
HELA 11 8 1 0 0 (0) 0 (NA)
HMEC 260 179 0.96 0.91 0.52 (0) 0.27 (0)
HUVEC 9 6 0 0 0 (NA) 0 (NA)
IMR90 144 112 1 1 0.48 (0) 0.08 (0)
K562 47 39 1 1 0.1 (0) 0.1 (0)
KBM7 8 8 1 0 0 (0) 0 (NA)
NHEK 0 0 NA NA NA (NA) NA (NA)
FANTOM Gencode Rao cutoff 400 GM12878 902 683 0.95 0.81 0.62 (0) 0.37 (0)
FANTOM Gencode Rao cutoff 300 GM12878 1138 848 0.92 0.78 0.57 (0) 0.33 (0)
FANTOM Gencode Rao cutoff 200 GM12878 2695 1663 0.83 0.7 0.43 (0) 0.29 (0)
FANTOM Gencode Rao cutoff 150 GM12878 4184 2292 0.81 0.7 0.38 (0) 0.25 (0)
FANTOM Gencode Rao cutoff 100
GM12878 7527 3475 0.76 0.66 0.32 (0) 0.21 (0)
HMEC 313 288 0.95 0.7 0.94 (0) 0.17 (0)
HUVEC 43 36 0.75 0.5 0 (0) 0 (NA)
IMR90 525 438 0.93 0.71 0.7 (0) 0.39 (0)
K562 506 404 0.92 0.81 0.68 (0) 0.39 (0)
KBM7 1465 1285 0.92 0.71 0.79 (0) 0.42 (0)
NHEK 211 190 0.91 0.5 0.72 (0) 0.2 (NA)
FANTOM Gencode Rao cutoff 50
GM12878 19623 6631 0.69 0.62 0.23 (0) 0.16 (0)
HMEC 854 719 0.95 0.84 0.75 (0) 0.38 (0)
HUVEC 254 211 0.84 0.63 0.53 (0) 0.21 (0)
IMR90 1643 1232 0.88 0.75 0.62 (0) 0.35 (0)
K562 1734 1218 0.85 0.72 0.54 (0) 0.32 (0)
KBM7 4033 3209 0.89 0.73 0.65 (0) 0.38 (0)
NHEK 462 386 0.89 0.67 0.73 (0) 0.41 (0)
FANTOM Gencode Rao cutoff 30
GM12878 29348 8320 0.66 0.61 0.48 (0) 0.45 (0)
HMEC 1786 1441 0.92 0.76 0.83 (0) 0.49 (0)
HUVEC 582 457 0.91 0.76 0.81 (0) 0.49 (0)
IMR90 3077 2050 0.85 0.72 0.73 (0) 0.52 (0)
K562 2872 1815 0.82 0.69 0.7 (0) 0.49 (0)
KBM7 7047 5304 0.86 0.69 0.79 (0) 0.49 (0)
NHEK 1011 802 0.88 0.75 0.81 (0) 0.51 (0)
ChromHMM Gencode Rao looplist
GM12878 2384 674 0.95 0.93 0.13 (0) 0.12 (0)
HELA 37 17 1 1 0 (0) 0 (0)
HMEC 2558 735 0.97 0.96 0.15 (0) 0.14 (0)
HUVEC 95 31 1 1 0 (0) 0 (0)
IMR90 554 310 0.99 0.98 0.51 (0) 0.24 (0)
K562 638 197 1 1 0.12 (0) 0.12 (0)
NHEK 0 0 NA NA NA (NA) 0 (NA)
ChromHMM Gencode Rao cutoff 400 GM12878 11097 3899 0.66 0.62 0.18 (0) 0.15 (0)
ChromHMM Gencode Rao cutoff 300 GM12878 14846 4777 0.65 0.61 0.16 (0) 0.14 (0)
ChromHMM Gencode Rao cutoff 200 GM12878 33072 7412 0.57 0.53 0.11 (0) 0.1 (0)
ChromHMM Gencode Rao cutoff 150 GM12878 51174 8688 0.56 0.53 0.09 (0) 0.08 (0)
ChromHMM Gencode Rao cutoff 100
GM12878 89712 10080 0.54 0.52 0.06 (0) 0.05 (0)
HMEC 4081 2410 0.74 0.66 0.4 (0) 0.28 (0)
HUVEC 499 283 0.84 0.79 0.26 (0) 0.18 (0)
IMR90 2415 1418 0.91 0.84 0.41 (0) 0.29 (0)
K562 8062 3005 0.64 0.59 0.19 (0) 0.16 (0)
NHEK 3291 1784 0.71 0.64 0.35 (0) 0.26 (0)
ChromHMM Gencode Rao cutoff 50
GM12878 231522 12998 0.49 0.48 0.03 (0) 0.03 (0)
HMEC 11191 5169 0.71 0.65 0.26 (0) 0.21 (0)
HUVEC 3396 1660 0.7 0.65 0.27 (0) 0.21 (0)
IMR90 7270 3540 0.81 0.73 0.29 (0) 0.22 (0)
K562 28590 6604 0.55 0.52 0.12 (0) 0.11 (0)
45
NHEK 7017 2851 0.69 0.64 0.22 (0) 0.18 (0)
Jin IMR90 50800 8117 0.11 0.11 0.09 (0) 0.08 (0)
FANTOM Gencode Jin IMR90 1167 401 0.77 0.73 0.23 (0) 0.17 (0)
ChromHMM Gencode Jin IMR90 5303 617 0.68 0.66 0.07 (0) 0.06 (0)
FANTOM Gencode Chiapet K562 2916 1869 0.86 0.75 0.52 (0) 0.31 (0)
MCF7 2190 1195 0.86 0.75 0.43 (0) 0.25 (0)
ChromHMM Gencode Chiapet K562 33449 6439 0.67 0.65 0.11 (0) 0.1 (0)
FANTOM Gencode Javierre
Ery 74 64 1 1 0.79 (0) 0.44 (0)
Mac0 88 64 0.98 0.95 0.59 (0) 0.36 (0)
Mac1 215 153 1 1 0.6 (0) 0.35 (0)
Mac2 112 85 0.98 0.96 0.64 (0) 0.38 (0)
MK 100 81 0.98 0.89 0.73 (0) 0.38 (0)
Mon 139 94 1 1 0.57 (0) 0.31 (0)
nCD4 86 64 1 1 0.63 (0) 0.39 (0)
nCD8 84 67 1 1 0.68 (0) 0.42 (0)
Neu 178 137 1 1 0.66 (0) 0.39 (0)
ChromHMM Gencode Javierre
Ery 4484 539 0.93 0.92 0.07 (0) 0.06 (0)
Mac0 2003 268 0.97 0.97 0.07 (0) 0.07 (0)
Mac1 4867 658 0.91 0.9 0.07 (0) 0.07 (0)
Mac2 3733 474 0.95 0.94 0.07 (0) 0.06 (0)
MK 2629 402 0.92 0.92 0.09 (0) 0.07 (0)
Mon 2483 330 0.91 0.9 0.08 (0) 0.07 (0)
nCD4 2975 359 0.97 0.97 0.07 (0) 0.06 (0)
nCD8 2774 339 0.93 0.93 0.07 (0) 0.06 (0)
Neu 4661 596 0.96 0.96 0.07 (0) 0.06 (0)
FANTOM Gencode SPRITE cutoff 100 GM12878 38 25 1 1 0 (0) 0 (0)
FANTOM Gencode SPRITE cutoff 50 GM12878 497 239 0.92 0.84 0.33 (0) 0.2 (0)
FANTOM Gencode SPRITE cutoff 30 GM12878 3381 1523 0.89 0.82 0.29 (0) 0.2 (0)
ChromHMM Gencode SPRITE cutoff 100 GM12878 622 94 0.96 0.95 0.02 (0) 0.02 (0)
ChromHMM Gencode SPRITE cutoff 50 GM12878 4794 663 0.85 0.84 0.06 (0) 0.06 (0)
ChromHMM Gencode SPRITE cutoff 30 GM12878 36027 4210 0.71 0.7 0.06 (0) 0.05 (0)
because the computationally predicted ChromHMM enhancers might result in predicting false
interactions and thus a low BCC of the promoters. Moreover, the BCC of the promoters was always
increasing with more and more stringently defined IEPs. For example, although the BCC of the
promoters was not close to 1 at the cutoff 100, it got closer to 1 when the looplists defined by Rao
et al. were considered. In addition, the BCC of promoters for random IEPs in every cell line and
under every cutoff was much smaller than that for the real IEPs, indicating that the observed much
larger BCC of promoters was not by chance (Table 2-6).
Third, the BCC values of the promoters were analyzed for lEPs from other studies (Figure 2-4A,
Table 2-4 and Table 2-6) [20, 21, 32, 52]. For the original IEPs from Jin et al., it was 0.11.
However, when the IEPs were defined from the overlap of these original IEPs with the GENCODE
promoters and the two types of enhancers, it was 0.77 and 0.68, respectively (Table 2-4). The low
46
BCC of the promoters for the original IEPs may be partially due to the promoters Jin et al. used,
which had 11,313 promoters inferred by Jin et al., compared to the 57,820 promoters annotated by
GENCODE. In terms of the ChIA-PET data, when the FANTOM enhancers were used, the BCC
of the promoters was 0.86 in K562 and 0.86 in MCF7; when the ChromHMM enhancers were
used, it was 0.67 in K562. ChromHMM did not have annotated enhancers in MCF7. For the nine
cell types from Javierre et al., it was no smaller than 0.98 and 0.91 when the FANTOM enhancers
and the ChromHMM enhancers were used, respectively. For the SPRITE data on the GM12878
cell line, the BCC values of the promoters were no smaller than 0.89 and 0.71 in the IEPs defined
with the FANTOM and ChromHMM enhancers, respectively. Overall, although it was not as large
as the BCC of the enhancers, because of the imperfectness of all these collected IEPs, and the fact
that the majority of the promoters interacting with multiple enhancers had their individual BCC
larger than 0.90, and they were much larger than the corresponding BCC of the promoters for
random IEPs (Table 2-6), the BCC of the promoters was likely to be close to 1 as well. In other
words, a gene usually interacts with all the enhancers of another gene or interacts with a completely
different set of enhancers from this second gene.
47
Figure 2-5: Clusters of enhancers with Hi-C reads. Here all ChromHMM active enhancer clusters
in GM12878 are shown within the region Chr1:161060000-161175000. Total five clusters belong
to this region. The bottom half of the figure shows the five enhancer clusters (grey, yellow, green,
purple and brown on the two sides) interacting with the common gene promoter regions (in the
middle), arranged from left to right according to their relative genomic locations. The top half of
the figure shows the same interactions of the five clusters (same color codes) with Hi-C reads. For
example, the yellow cluster of enhancers interact with NIT1 and PFDN2 gene promoters with 687
Hi-C reads. The unmarked enhancer (blue) and gene promoter (UFC1) did not belong to any
cluster. The location of the enhancers relative to each other and to the target genes are shown in
the middle.
48
2.2.3.3 Enhancers Form Clusters that Have Special Characteristics
Since the BCC of the enhancers is close to 1, the enhancers can be organized into clusters, where
every enhancer in the same cluster is likely to interact with the same set of target genes. Thus, in
each IEP set, an enhancer graph was built by connecting the enhancers that share at least one
common target (Chapter 2.2.2.5, Figure 2-5). Here, only the looplists and the IEPs obtained from
the most stringent cutoff (400 in GM12878 and 100 in other cell lines) were considered to obtain
enhancer clusters, as they were more reliable than other sets of IEPs.
Total 1 to 2,134 clusters were generated in different cell lines. The number of clusters in a cell line
and across different cell lines varied dramatically, depending on the IEPs and the enhancers used.
When the ChromHMM enhancers were used, there were many more clusters and 67% to 96% of
all enhancers in a cell line were included in the clusters. When the FANTOM enhancers were used,
fewer clusters were identified and about 16% to 67% of the total enhancers in a cell line were
found in the clusters. The average number of enhancers in a cluster varied from 2 to 5 in different
cell lines. Enhancers in the majority of clusters interacted with only one gene, while on average,
the enhancers in 18.36% clusters interacted with at least two different genes.
49
Next, the distance between the consecutive enhancers in a cluster, the distance between their
consecutive targets and the distance between enhancers and their target genes were studied (Figure
2-6). On average, about 84% of the enhancers in a cluster were within 10 kbps. However, there
was a small fraction of enhancers in a cluster that were more than 50 kbps away from each other.
For instance, when the looplists and the FANTOM enhancers were considered, there were more
than 8% enhancers in a cluster that were more than 50 kbps away from each other in GM12878,
HMEC and IMR90. Although the enhancers in a cluster were often close to each other, their
distances to each other were not significantly smaller than the distances of random enhancer pairs
(almost all p-values>0.2). In terms of the target genes, the majority of them were within 10 kbps,
Figure 2-6: The distance distribution between consecutive enhancers in the same cluster for each
cell line. The X-axis represents the distance and the Y-axis represents the average percentage of
consecutive enhancer pairs in an enhancer cluster.
50
with a small fraction far from each other. For instance, in GM12878, HMEC and IMR90, when
the looplists and the FANTOM enhancers were considered, 25.93%, 21.43% and 33.33% of the
target genes of an enhancer cluster that were more than 50 kbps away from each other, respectively.
It was also worth pointing out that the enhancers in a cluster were normally consecutive and active
enhancers while their target genes were normally not consecutive. In all cell lines, on average,
more than 90% of the enhancers in a cluster were consecutive active enhancers while fewer than
17% of the target genes of an enhancer cluster were consecutive.
Since the enhancers in a cluster were consecutive in the genome and the majority of enhancers in
a cluster were close to each other, they seemed like the super-enhancers. Hence, the enhancer
clusters were compared with known super-enhancers in terms of their locations. On average,
29.77% of enhancer clusters overlapped with the corresponding super-enhancers in a cell line
while the majority of enhancer clusters did not overlap with the known super-enhancers (Figure 2-
7A), which may represent new super-enhancers. On the other hand, a large proportion of known
Figure 2-7: The overlap of the enhancer clusters with the super-enhancers. (A) The percentage of
the enhancer clusters overlapping with the super-enhancers. (B) The percentage of the super-
enhancers overlapping with the enhancer clusters.
51
super-enhancers did not overlap with the enhancer clusters in the corresponding cell lines (Figure
2-7B). Interestingly, when a super-enhancer overlapped an enhancer cluster, more than 80% of the
genomic regions that contain all the enhancers in this enhancer cluster were within this super-
enhancer.
The locations of the enhancers in a cluster were also compared with TADs. The enhancers in a
cluster were usually within the same TAD, with no smaller than 98.08% of enhancers in a cluster
within a TAD in every cell line, independent of IEPs and enhancers used. In most of the cell lines,
for all clusters, all the enhancers in a cluster were within a TAD. The slight deviation from the
100% was mostly for the ChromHMM enhancers, which may be due to the imperfectness of either
the computationally predicted enhancers, IEPs, or TADs. The percentage was always 100% in
almost all the cell lines when the FANTOM enhancers were used.
The enhancer clusters were compared between different cell lines as well. On average, no more
than 12% enhancer clusters were identified in two cell lines. Moreover, the percentage was smaller
for IEPs using looplists than the IEPs using the contact matrices with different cutoffs, implying
that the looplists were too strict to include many bona fide IEPs. The small percentage of the shared
enhancer clusters suggested that most enhancer clusters were cell-specific, which is consistent with
the properties of super-enhancers [47, 48].
2.2.4 Discussion
We observed that two enhancers either do not share any target gene or share almost all of their
target genes. This observation was true when different sets of IEPs, two sets of enhancers, and a
52
variety of cell lines and cell types were considered. Moreover, the BCC of enhancers became closer
and closer to 1 when the criteria to define IEPs became more and more stringent. In addition, the
same observation did not hold to be true for randomly generated IEPs. These analyses suggested
that the BCC of enhancers in a cell line or a cell type was likely to be close to 1 if it is not 1.
Similarly, we observed that two promoters were likely to interact with either the same set of
enhancers or two disjoint sets of enhancers. This observation about promoters was not as evident
as that about enhancers. However, it was pervasive in all cases when the FANTOM enhancers
were used. It was also evident when the looplists and the IEPs defined by the most stringent cutoffs
were used. Although it seemed not compelling when the ChromHMM enhancers and the sets of
IEPs that were defined with loose criteria were used, this might be due to the imperfectness of
enhancers and IEPs we had. More importantly, the fact that the BCC of enhancers was close to 1
implied that the BCC of the promoters should be close to 1 as well based on the definition of the
BCC.
The BCC of enhancers being close to 1 suggested that enhancers form clusters to interact with the
target genes. As shown above, these clusters are different from the known enhancer clusters such
as super-enhancers, although they do overlap in certain regions. Enhancers in the clusters here
were likely to interact with the same set of genes, while enhancers in a super-enhancer do not
necessarily interact with multiple target genes. Moreover, the enhancers in a cluster here could be
far from each other while the enhancers in a super-enhancer are quite close to each other.
53
The BCC of enhancers was not 1 sometimes, which implied that when a group of enhancers
interacts with a set of target genes, the majority of target genes interact with each enhancer in this
group while the rest interact with only a subset of enhancers in this group. We called the former
the fully shared target genes and the latter the partially shared target genes. The percentage of the
partially shared target genes by a group of enhancers varied from 0% to 6.57%. We compared
these two types of target genes in terms of TAD, tissue specificity, and correlations with the
enhancers, with the IEPs from the looplists and the IEPs from the most stringent cutoff (400 in
GM12878 and 100 in other cell lines) (Methods). We did not observe any difference between the
two types of target genes.
In practice, several aspects may prevent the BCC of enhancers and the BCC of promoters from
being 1. First, the resolution of the interaction data prevents from obtaining accurate IEPs. The
two interacting regions in the interaction data are often long, which is around 5 kbps in most of
the cases we studied. We defined IEPs by overlapping enhancers and promoters with pairs of
interacting regions, which might be prone to errors, given the fact that many known enhancers
were much shorter [2, 58]. Second, the IEPs defined imperfectly might have produced βfalseβ
interactions and thus decreased the BCCs. Third, the enhancers were not perfectly defined either.
The FANTOM enhancers are still far from complete while the computationally predicted
ChromHMM enhancers may contain many βfalseβ enhancers.
We also studied the functional similarities between the targets of enhancers in the same clusters.
With the GREAT tool [57], we found the cluster targets associated with DNA packaging complex,
DNA binding, nucleosome, immune response etc. (p-value<1e-5). We measured the sequence
54
similarity of enhancers within clusters in a cell line as well (Methods). We found that the pairs of
enhancers in the same clusters did not share more sequence similarity compared with enhancer
pairs randomly chosen in the same cell lines (p-value>0.5).
There are other measurements to study bipartite graphs. We chose BCC because we intended to
investigate how enhancers (promoters) shared their target genes (enhancers). In this sense, the
BCC value perfectly reflected what we hoped to measure. In the future, we may explore other
measurements to study other characteristics of IEPs. Moreover, we focused on enhancers
interacting with multiple targets. There is no doubt that a proportion of enhancers only interacting
with individual target genes. These enhancers and their target genes were not considered here, as
they did not share target genes with each other. In the future, the characteristics of these enhancers
may be worth studying as well.
In a cell line or cell type, both active enhancers and active promoters form their own clusters.
When an enhancer interacts with a promoter, consistent with the transcriptional factories proposed
previously [59, 60], almost all enhancers in the same enhancer cluster interact with almost all
promoters in the corresponding promoter cluster. It is thus important to consider the relationship
among enhancers and among promoters when studying their interactions, which may help improve
our understanding of the distal gene regulation and the chromatin structures.
55
CHAPTER 3 : STUDY OF MIRNA-MRNA INTERACTIONS
3.1 MDPS: Position-Wise Binding Preference is Important for miRNA Target Site Prediction
3.1.1 Background
MicroRNAs (miRNAs) are small (16 to 28 nucleotides) non-coding RNAs that play an important
regulatory role in gene expression pathway. In human, miRNAs are found to get involved in
imperfect interactions with their target sequences from messenger RNAs (mRNAs) or other non-
coding RNAs, such as long non-coding RNAs, transfer RNAs, circular RNAs, etc. [61]. The
interactions with mRNA lead to regulation of the corresponding gene expression with reduced
protein translation or complete degradation of the mRNA structure [62, 63]. The regulatory
involvements of miRNAs in critical gene expression pathways associate with complex diseases
[64].
In human, miRNA-target interactions are mostly imperfect consisting of both complementary
matches and gaps [62]. Because of the much smaller length of the miRNA sequence than the
mRNA transcript sequence and the imperfect interactions with their targets sequence, multiple
potential miRNA target sites may exist with the mRNA transcript sequence. Many of these sites
have not been found as functional yet and thus are normally ignored as negative sites. Because of
the non-functional negative sites that co-exist with the positive sites in the same mRNA transcript,
the computation methods designed for miRNA-target prediction often suffers from a large number
of false positive predictions. To handle this issue, computation tools abide by certain canonical
rules of miRNA-target interactions. The canonical rules of miRNA-target interactions require that
a positive interaction will involve a special area (position 2 to 8) of the miRNA sequence called
the βseedβ region and a target sequence from the 3β untranslated region of the mRNA transcript
56
with extensive bonds. Later this canonical rule was given a bit of leeway, allowing the non-
canonical seeds (one mismatch or wobble in the seed region) and the binding in the miRNA 3β
regions centered on positions 13-16, along with other features such as target accessibility [65],
local AU content [66], folding energy [66, 67], conservation [68], etc. Dozens of target prediction
tools along with the most popular ones focus primarily on these features [67, 69-72].
The advancement of next-generation sequencing (NGS) based technologies have enabled the study
of miRNA targets with extensive experimental support. NGS techniques with the cross-linking
and immunoprecipitation (CLIP) allowed direct identification of miRNA targets [73, 74]. The
resolution of CLIP-seq method was increased by the use of photoactivatable-ribonucleoside-
enhanced cross-linking and immunoprecipitation (PAR-CLIP) method [75]. Later, crosslinking,
ligation, and sequencing of hybrids (CLASH) experiments was introduced to detect miRNA-target
pairs as chimeric reads in NGS data [68]. Moore et al. improved the CLASH experiments with the
covalent ligation of endogenous Argonaute-bound RNAs-CLIP (CLEAR-CLIP) experiments [76].
The CLASH and CLEAR-CLIP experiments ultimately presented a transcriptome-wide dataset
containing more than 18,000 and 30,000, respectively, high-confidence miRNA-target
interactions. Most of the interactions do not maintain the established canonical rules of miRNA-
target interactions, revealing prevalence of both seed and non-seed interactions and the diversity
of in vivo miRNA targets in mRNA 3β UTR, 5β UTR and coding DNA sequence (CDS) regions.
The interactions are of different stability and have different free folding energy (ranging from 1.5
kcal/mol to 32 kcal/mol). With the raw sequence reads from these studies, a number of new tools
have been developed for miRNA target prediction based on the aforementioned features together
with new features learned from NGS data [71, 77-79]. Despite the existence of numerous tools to
57
predict miRNA targets, due to the complex target choosing technique of the miRNA in different
cells, almost all the tools still suffer from low precision. Since, high-throughput experimental
approaches are still cost and time expensive and may not be carried out under certain conditions,
computational methods are still the only way to solve this problem. The low precision of available
computational methods may be partially due to our limited knowledge of the characteristics of
miRNA target sites. Several studies, thus, concentrated on the features of miRNA binding sites.
Among them, a Markov chain based method started to model the base pairings between the entire
mature miRNAs and their targets [80]. Although only two states, the existence and absence of a
matching base pair, were considered in this Markov model, this study demonstrated the value of
considering flexible matching patterns instead of the canonical seed matching when identifying
miRNA target sites.
In this study, Markov models was designed to represent the position-wise pairing information
(match, mismatch, bulge, and wobble) of a miRNA from the miRNA-target interactions. Using
the models, the importance of the pairing patterns of a miRNA beyond its seed region was
evaluated for target prediction. From the model learning, the position-wise pairing patterns of a
mature miRNA was identified as a valuable feature for miRNA target site prediction. Also, region-
specific correlations between miRNAs were detected in terms of target binding. Finally, a feature
named MDPS (Markov model-scored Dynamic Programming algorithm for miRNA target site
Selection) was designed that focuses on the miRNA position wise information from miRNA-target
58
interactions based on the experimental data. Combination of MDPS as an additional feature with
three existing tools, demonstrated the potential contribution of the position-wise pairing
information to improve the precise identification of miRNA-target sites.
3.1.2 Materials and Methods
3.1.2.1 Training and Test Data
The miRNAβmRNA interactions reported in the CLASH study were used to design MDPS, as
these interactions provide the miRNA-mRNA sequence pairs with the highest resolution [68].
Using the interactions from this study two datasets were generated. The first set contained the
interactions of 77 miRNAs, where each of this miRNA interacted with at least 50 targets in the
CLASH experiments. This set of interaction was named βtarget-enriched datasetβ. The other set
included the interactions of 122 miRNAs, where each miRNA interacted with at least 20 targets
with minimum folding energy β15βkcal/mol. This set was termed as the βenergy-filtered datasetβ.
For each of the two CLASH interaction sets, 80% of the interactions were randomly chosen as the
training data and the remaining 20% were kept for test purpose. The hyper-parameters for the
scoring model were chosen using 10-fold cross-validation on the training data. The best hyper-
Table 3-1: Training and test datasets.
Total Target-enriched dataset Energy-filtered dataset
miRNAs Targets miRNAs Targets miRNAs Targets
CLASH 399 18041 77 15390 122 16209
CLEAR-CLIP 451 20094 - - - -
We randomly selected 80% of the CLASH interactions to train a model using 10-fold cross-validation. We then tested
the model on the 20% of the remaining CLASH interactions. We also tested the model on the independent CLEAR-
CLIP interactions.
59
parameters were later used to make the prediction on the 20% test data of the corresponding
interaction dataset. The scoring model was also applied on an independent experimentally
validated miRNA target dataset generated by a CLEAR-CLIP study [76]. This dataset was chosen
because like the CLASH interaction data, this dataset also provides interacting miRNA-target
sequence pairs with the highest specificity. The interactions in CLASH and CLEAR-CLIP that did
not map to any mRNA transcript from ENSEMBL version 75 were filtered out. The reported
miRNA and target sequences were aligned using the RNAhybrid tool [81], as in the CLASH study
[68], to obtain the position-wise alignment information of each miRNA sequence. The number of
miRNAs and their corresponding targets for these datasets are documented in Table 3-1.
3.1.2.2 Position-Wise Information with Different States of miRNAβTarget Interactions
A Markov model was used to learn the position-wise binding patterns for a given miRNA and its
targets. Given a miRNA and one of its target sequences, a position of the miRNA sequence and a
position of the target sequence can form the following five possible states in the miRNA-target
alignment; match (π), mismatch (π), G-U wobble match (π), bulge in target (π΅π₯) and bulge in
miRNA (π΅π¦) (Figure 3-1).
For every miRNA, a weight matrix π€ and a transition matrix π‘ were designed with the five possible
states mentioned above. The weight matrix describes the probability of a state that a miRNA
position prefers. For a miRNA sequence of length π, its weight matrix π€ is a 4 Γ π matrix, where
the rows correspond to one of the following four states: π, π, π, π΅π¦, the columns corresponds to
different positions in the miRNA sequence, and each numbers in the matrix represents the
60
probability that the corresponding miRNA position prefers the corresponding state. The state π΅π₯
does not correspond to any miRNA position and thus was not considered in the weight matrix π€.
The transition matrix π‘ is a 5 Γ 5 matrix that represents the transition probabilities among the
five states in a miRNA-target sequence alignment. The miRNA specific transition and weight
matrices were calculated separately for the two training datasets. To create the weight matrix, the
number of the occurrences of the four states at each miRNA position were counted in all miRNA-
target interactions in a dataset. To create the transition matrix, the frequency of occurrence of each
transition in the interactions of miRNA was calculated. A small pseudo count of 0.0001 was added
to every entry of the two matrices. The matrices were normalized column-wise so that the
summation of the numbers in each column becomes 1. The start to end positions of a miRNA were
considered from 5β² to 3β² direction of the miRNA sequence.
Two types of scoring models were designed: miRNA-specific and miRNA-general. In the miRNA-
specific model, the weight and transition matrices were calculated for each miRNA and its targets.
In the miRNA-general model, only one weight matrix and one transition matrix were calculated
using the pairing information of all the miRNAs and their targets within a dataset. In the latter
Figure 3-1: Five states in an miRNA-target interaction
61
case, the transition and weight matrices were the unweighted average of the respective miRNA-
specific matrices within a dataset.
3.1.2.3 MDPS Scoring Strategy
Given a miRNA and a target sequence, MDPS uses a sequence-alignment strategy using dynamic
programming algorithm to score the alignments. The weight to score alignment of the two
sequences are taken from the weight and the transition matrices. The score of the alignment is used
to determine if the given miRNA and target sequences may interact with each other.
To understand the scoring strategy of MDPS, it is important to get familiar with the two following
notations, S[i, j, k] and state(i, j). S[i, j, k] is defined as the best score of the alignment between
πππ ππ΄(1β¦ π) and target π ππ΄(1β¦ π) sequences, with the last alignment position at the π-th
posture. Here πππ ππ΄(1β¦ π) represents the miRNA sequence from the position 1 to the position
i. Similarly, target π ππ΄(1β¦ π) represents the target sequence from the position 1 to the position
π. There are three different possibilities for the last alignment position. When π = 0, it means the
last alignment position is at the states π,π, ππ π, which we call posture 0. When π = 1, it means
the last alignment position is at the posture 1 and the state is By. When π = 2, it means the last
alignment position is at the posture 2 and the state is Bx. The π π‘ππ‘π(π, π) is defined as the state of
the pairing of the π-th miRNA position and the π-th mRNA position. Since two actual base pairs
are involved, π π‘ππ‘π(π, π) can only be one of the states: π,π, ππ π.
With the above definition of the two notations, since both miRNA and target sequence positions
start from 1, π[π, π, 0] = ββ, ππ π = 0 ππ π = 0. Also, for the first position of the miRNA, no
62
transition is considered. Therefore, π[1, π, 0] = πππ(π€(π π‘ππ‘π(1, π), 1)) for any π > 0, where
π€(π π‘ππ‘π(1, π), 1) means the (π π‘ππ‘π(1, π), 1)-th entry of the weight matrix of this miRNA. In
addition, when the first position of the target sequence is aligned with any position of the miRNA
after its first position, a transition from is By to the current state is considered. So, π[π, 1, 0] =
log(π€(π π‘ππ‘π(π, 1), π)) + π[π β 1,0,1] + log (π‘(π΅π¦, π π‘ππ‘π(π, 1))) for any π > 1. With these initial
cases, we have the following iteration formula to calculate π[π, π, 0] for any π > 1 πππ π > 1:
π[π, π, 0] = log(π€(π π‘ππ‘π(π, π), π)) + πππ₯
{
π[π β 1, π β 1,0] + log (π‘(π π‘ππ‘π(π β 1, π β 1), π π‘ππ‘π(π, π)))
π[π β 1, π β 1,1] + log (π‘ (π΅π¦ , π π‘ππ‘π(π, π)))
π[π β 1, π β 1,2] + log (π‘(π΅π₯ , π π‘ππ‘π(π, π)))
Similarly, π[π, π, 1] was calculated by the following formula with the initial cases π[0, π, 1] = ββ
and π[1, π, 1] = log (π€(π΅π¦, 1)) for π > 1 and any π:
π[π, π, 1] = log (π€(π΅π¦ , π)) + πππ₯ {π[π β 1, π, 0] + log (π‘(π π‘ππ‘π(π β 1, π), π΅π¦))
π[π β 1, π, 1] + log (π‘(π΅π¦ , π΅π¦))
Similarly, with the initial cases, π[π, 1, 2] = π[π, 0, 2] = ββ for any π, π[π, π, 2] was calculated for
any π and π > 1,
π[π, π, 2] = πππ₯ {π[π, π β 1,0] + log (π‘(π π‘ππ‘π(π, π β 1), π΅π₯))
π[π, π β 1,2] + log (π‘(π΅π₯ , π΅π₯))
With the above cases, the maximum of π[π, π, π] for any π and π, is considered as the final score
of MDPS, where n is the last position of the miRNA sequence. This score of the alignment score
of the miRNA and target RNA sequences under consideration, based on MDPS strategy. The
63
actual alignment resulted in this score can be shown by backtracking which represents the pairing
between the miRNA and target sequences.
The MDPS model hyperparameters consisted of the w matrices, the t matrices and the
corresponding score cutoffs that gave the best predictions on the CLASH training dataset for
different miRNAs. The hyperparameters were generated from the target-enriched dataset and the
energy-filtered dataset separately. For the miRNA-specific models, miRNA-specific
hyperparameters are generated, which contained separate π€, π‘ and score cutoffs for every miRNA
in the dataset. For the miRNA-general model, only one set of π€, π‘ and score cutoff was generated
for all the miRNAs in the dataset. The π€ and π‘ in this model were generated by taking average of
the miRNA-specific models from the target-enriched dataset and the energy-filtered dataset
separately. Since the column size of the w matrices was the length of the corresponding miRNAs
in miRNA-specific models and the lengths are different for different miRNAs, the column size of
the w matrix in the general models was decided as the length of the longest miRNA sequence in
the training datasets. The score cutoffs are required to filter out the false positive predictions. After
considering five different criteria, the score cutoff was chosen as the Average score + 2*Standard
Deviation, where the Average score and the Standard deviation are the mean and the standard
deviation of the alignment scores of the miRNA-target duplexes in the training datasets.
3.1.2.4 Combining MDPS Scores with Existing Tools
All the popular target prediction algorithms emphasize the miRNA-target pairing in the seed
regions [66, 82-84], and/or do not consider the dependence of the neighboring pairings [67]. The
miRNA-target alignment score measured by MDPS is representative of the position-wise
64
preference and dependencies between adjacent positions throughout the whole miRNA sequence.
By incorporating the MDPS scores with the predictions of the existing tools, the efficiency of the
overall miRNA-target prediction may be improved. To test this hypothesis, the MDPS scores were
combined with three popular methods, miRanda, RNA22, and TargetScan [67, 82-85]. First, the
predictions of the three tools on given miRNA and target sequences was generated, by running
miRanda 3.3a and TargetScanHuman 7.0 and using the existing predictions of RNA22
(ENSEMBL 65, miRbase 18). Then the MDPS scores were calculated and the score cutoffs were
applied on the predicted positive miRNA and target sequences to generate the combined
predictions. The original prediction of the three tools and the combined predictions were compared
on the two test datasets (Table 3-1).
3.1.3 Results
3.1.3.1 Importance of Non-Seed Regions in miRNAβTarget Interactions
Canonical rules of miRNA-target interactions emphasize extensive bonds in the seed region as one
of the primary criteria. But the CLASH study [68] reported numerous interactions with poor
interactions in the seed region. To evaluate the importance of the miRNA positions outside seed
region in target binding, the 18,041 CLASH interactions were analyzed. MiRNA positions 1-8 was
considered as the seed region in this section. The analysis shown more than 12% of miRNAs had
at least eight matches/wobbles after the eighth position in the interactions they were involved
(Figure 3-2A). Out of the 399 miRNAs listed in the CLASH study, 386 (97%) had interactions
with at least one match/wobble pairing outside the seed regions. Figure 3-2B shows the distribution
of the number of match/wobble pairing outside the seed regions among the 18, 041 CLASH
interactions. Only 14 interactions had no match/wobble pairing outside the seed region.
65
The miRNAβtarget interactions with extensive seed matching also showed a good number of
match/wobble pairs outside the seed regions. Similar to the CLASH study [68], the 6mer, 7mer,
8mer and 9mer interactions were considered as the interactions with seed matching, which had 6,
7, 8 and 9 continuous matches from the miRNA Position 1, respectively. Although the non-seed
interactions on average had more match/wobble after the seed regions, the seed interactions, more
than 50% of the seed interactions also tended to have extensive bonds from
position 10-20 (Figure 3-2C). From the analysis, it is thus evident that it may be valuable to
consider miRNA-target pairings after the seed region of the miRNA sequence.
Figure 3-2: Non-seed regions may be important for miRNA-target interactions. (A) Percentage of
miRNAs with the different lowest number of match/wobble pairings after the position 8 in the 18041
CLASH interactions. (B) Percentage of the 18041 CLASH interactions having different number of
match/wobble pairing after the position 8. (C) The frequency of match/wobble pairing at different
miRNA positions for different types of CLASH interactions.
66
The dependency between contiguous positions of a miRNA sequence in terms of target interactions
were also studied. The hypothesis was when two miRNA-target bonds (match/wobble) occur side-
by-side, the strength of one pairing might help to stabilize the pairing by its side. To study the
dependency between neighboring positions, each βMatchβ or βWobbleβ state was labeled with a β1β
and each βMismatchβ or βBulgeβ with a β0β, for each position of a miRNA. In this way, for each
miRNA position, a binary binding vector was generated which represented the binding states of
that miRNA position in the interactions. The size of this binding vector reflected the number of
miRNA interactions (Figure 3-3A). To find the correlations between two positions of the same
miRNA, Matthews correlation coefficient (MCC) formula was applied on the two binary vectors
for the two positions of the miRNA (Figure 3-3A). Only the neighboring positions tended to have
positive correlation (MCC β₯0.75). Also, the adjacent positions within regions 2β9, 11β14 and 16β
21 of a large number of miRNAs tended to show the higher correlation values (Figure 3-3B). This
suggested the potential dependency or cooperation between adjacent binding positions of a
miRNA in terms of target binding. All these analyses made it clear that all the miRNA positions
and their dependencies are worth considering for miRNAβtarget interactions. The MDPS scores
should be able to capture this information.
67
3.1.3.2 Clusters of miRNAs Share Correlated Target Binding Patterns
Since many miRNAβtarget interactions involve both seed and non-seed regions and the pairing at
different miRNA positions are dependent, we hypothesized that many miRNAs may have similar
or correlated target binding patterns. This hypothesis was tested with the obtained weight and
transition matrices and found that many miRNAs indeed share correlated binding patterns.
To investigate whether different miRNAs have similar or correlated binding patterns, a miRNA
sequence was divided into two equal size regions, positions 1β8 and 9β16. Here, the results are
shown for the energy-filtered dataset, although the conclusions were similar for the target-enriched
dataset. For each of the 122 miRNAs in the energy-filtered dataset, its position-wise βMatchβ and
βMismatchβ probabilities were obtained from the learned weight matrix. The Spearmanβs
correlation coefficient between each pair of miRNAs were calculated based on their position-wise
Figure 3-3: Correlated pairs of miRNA positions. (A) An illustration of how MCC is calculated for
miR-484. (B) The percentage of miRNAs having correlated position pairs (MCC β₯ 0.75). The heatmap
has miRNA positions in the axes and the percentage of correlated miRNAs are shown for every pair
of positions.
68
βMatchβ and βMismatchβ probabilities. This was done in both of the regions separately. The G-U
wobble state was considered as the βMatchβ state and bulge states were ignored in this analysis.
The miRNAs that belonged to the same family were ignored here, as these miRNAs had high
sequence similarities. A clique-finding-based clustering process was applied based on the
correlation (correlation cutoffβ=β0.75) and 17 distinct clusters of miRNAs were identified that were
correlated in terms of βMatchβ state probabilities at positions 1β8. The largest 8 clusters had
50.88% of the total 122 miRNAs (Figure 3-4A shows four different exclusive clusters). When
considering the positions 9β16 of a miRNA, 29 distinct miRNA clusters were identified where the
miRNAs in each cluster were correlated on βMatchβ state probabilities within that region (Figure
3-4B). The largest 10 clusters had only 29.82% of the total 122 miRNAs considering βMatchβ
probabilities. These statistics suggested that the seed regions (positions 1β8) of miRNAs were
Figure 3-4: Clusters of miRNAs with similar ``Match'' patterns in specific regions. The X-axis of
a cluster plot shows the positions of the miRNAs in that cluster and the Y-axis of the plot shows
the percentage of interactions having ``Match'' in corresponding miRNA positions (A) Clusters of
miRNAs correlated with the ``Match'' state probability from position 1 to 8. (B) Clusters of
miRNAs correlated with the ``Match'' state probability from position 9 to 16.
69
more correlated than the non-seed region (positions 9β16), which supported the current practice of
considering seed matching for miRNA targeting but at the same time established the fact that the
non-seed regions also contribute a lot to this process.
3.1.3.3 miRNA-general models showed better performance on target site prediction than miRNA-
specific models
Many miRNAs have similar or correlated target binding patterns, as demonstrated in Section 3.2.
This leads to the idea that the miRNA-general model learned from position-wise information of
all the miRNAs and their corresponding targets should work better than the miRNA-specific
models learned for individual miRNAs. In the miRNA-general model, a common weight matrix
and a common transition matrix were learned for all miRNAs together. In the miRNA-specific
model, a unique weight matrix and a unique transition matrix were learned for each individual
miRNA with a decent number of targets (β₯20). The models were learned with the 10-fold cross-
validation based on two training datasets.
From the comparative performance analysis of the miRNA-general and miRNA-specific models,
miRNA-general model was found to work better than the latter. In the target-enriched datasets, the
miRNA-general model identified 93.49% of the CLASH interactions correctly while the miRNA-
specific models identified 87.56% of the CLASH interactions. Similarly, in the energy-filtered
datasets, the miRNA-general model identified 91.59% of the CLASH interactions while the
miRNA-specific models identified only 85.91% of the interactions.
70
The following could be the reasons for the better performance of the miRNA-general model. First,
as demonstrated in the last section, miRNAs do share similar or correlated patterns in terms of
target binding, which enabled the miRNA-general model capture the βkeyβ or βconservedβ
characteristics of miRNAβtarget interactions; Second, there were much more training data to train
a miRNA-general model than that to train a miRNA-specific model. May be, the number of targets
of an individual miRNA in the training datasets was not large enough for the miRNA-specific
model to avoid βoverfittingβ. But the last reason can be ignored as the 10-fold cross-validation
accuracy of the miRNA-specific models on the 10 groups of untrained datasets was similar.
Therefore, it is highly likely that the only reason the general models worked better was the
similarity of the binding patterns of different miRNAs.
Despite of the overall better performance of the miRNA-general model, for certain miRNAs, their
miRNA-specific models did work better than the miRNA-general model. For instance, for miR-
10a, the miRNA-specific model predicted 100% of its target sites correctly, whereas the miRNA-
general model predicted 86% of its target sites correctly. This miRNA had 51 targets in the energy-
filtered training dataset. Also, the number of target sites in the training dataset was not a decisive
factor of the performance. For example, in case of miR-186, the miRNA-general model did not
perform better, even though it had 81 training target sites. On the other hand, the miRNA-specific
model performed better for miR-1301, although it only had 26 training target sites. So, it can be
said that the individual binding pattern was the reason that the miRNA-specific model worked
better in this case.
71
3.1.3.4 Combining the MDPS scores with existing tools improved their accuracy
Almost all the existing miRNA prediction tools suffer from a huge number of false positive
predictions. Since these tools do not consider the entire miRNA regions for miRNAβtarget
interaction prediction, and/or do not consider the dependency among different pairing positions in
miRNAβtarget interactions, we hypothesized that by combining the MDPS scores with the existing
tools, it might be possible to improve the precision of the existing tools. To prove this hypothesis,
MDPS scoring process was applied on the miRNA-target prediction results from the three tools.
To combine MDPS scores with miRanda, RNA22 and TargetScan, these tools was applied to
predict miRNAβtarget interactions first. Then the MDPS scores were calculated on the predicted
targets and predicted target sites were labeled true or false based on the MDPS score cutoff from
the trained general models. The two steps process was applied on the untrained 20% CLASH
dataset and the independent CLEAR-CLIP dataset. Also, the MDPS hyperparameters trained on
both the target-enriched and the energy-filtered dataset was applied separately to calculate the
combined predictions. After combining MDPS, the precision of the combined predictions was
significantly increased while the recall was slightly decreased, compared with the original
prediction of the tools (Table 3-2). Overall, the F1 score of the combined tool was improved. For
instance, the recall, precision and F1 score of RNA22 on the CLEAR-CLIP data were increased
by β9.35%, 22.71% and 22.46%, respectively, when combined with the MDPS model trained on
the energy-filtered dataset. This analysis demonstrated that the MDPS score as an additional
feature for miRNA target site prediction was able to decrease the false positive predictions by the
existing tools.
72
3.1.4 Discussion
Existing miRNA-target prediction tools are heavily dependent on the canonical rules of miRNA-
target interactions. Some of the canonical rules entail extensive binding in the seed region, target
site to come from the mRNA 3β UTR region and high stability between the interaction sites.
Although these tools can identify a good number of experimentally validated interactions, they all
suffer from a huge number of false positive predictions. Recent experimental data provide
numerous miRNAβtarget interactions that do not maintain any of these canonical rules. Studies on
these newly generated datasets have shown potential involvement of non-seed regions of miRNAs
in the binding activities. However, the importance of non-seed regions for miRNA target binding
has not been thoroughly studied; neither has the dependency among the consecutive positions and
regions in the miRNA. The MDPS algorithm was developed to learn miRNA-target pairing
patterns, both in the seed and non-seed regions of miRNA binding, by utilizing the genome-wide
CLASH datasets. MDPS takes into account the dependency of neighboring positions of the
miRNA sequence using a Markov model. Utilizing the weight and transition matrices of the trained
Markov model, MDPS is then able to score each potential miRNA binding site to pre-select/predict
Table 3-2: Performance comparison of the combined tools with the original tools.
miRanda RNA22 TargetScan
F1 Precision Recall F1 Precision Recall F1 Precision Recall
Target-enriched
model on CLASH 18.88 23.64 -5.76 25.24 26.62 -5.22 20.78 23.37 -4.19
Target-enriched
model on CLEAR-
CLIP
15.36 15.67 -7.12 22.46 22.71 -9.35 18.11 18.28 -7.16
Energy-filtered
model on CLASH 17.82 20.85 -7.62 24.52 25.66 -7.10 23.21 24.97 -4.81
Energy-filtered
model on CLEAR-
CLIP
15.52 15.89 -10.68 21.15 21.40 -10.57 15.81 15.96 -7.64
Each number is the increased percentage when comparing the performance of the combined tool with the performance
of the original tool.
73
putative candidate miRNAβtarget interactions. By combining the MDPS scores with the existing
tools, the precision scores of the combined tools were greatly improved.
The DP used in MDPS is different from the one used in miRanda [67], which uses a standard DP
algorithm to perform pair-wise alignment between a miRNA and a potential target. The alignment
score is then used as a criterion together with site conservation and binding energy scores to predict
miRNA target sites. There are at least two important differences between the miRanda DP
algorithm and the MDPS one. One is the scoring schema for miRNA-target alignments, for which
miRanda uses a fixed scoring schema, such as a score of +5 for G:C and A:T pairs, +2 for G:U
wobble pairs etc. [69], whereas MDPS uses a probabilistic scoring schema based on the CLASH
training data. The other is, MDPS considers neighboring pairing positions in the alignments,
whereas miRanda assumes the independence of neighboring pairing positions.
Through the investigation of the Markov models learned from both target-enriched datasets and
energy-filtered datasets, we were able to make interesting findings on position-wise binding
patterns of miRNAβtarget interactions. We found subsets of miRNAs had correlated binding
patterns in specific sub-regions. We also found both seed and non-seed regions contribute to the
specific miRNAsβ binding patterns. Besides seed region binding, the length of the continuous
pairings outside the seed region, the gap between two continuous pairings, the number and position
of G-C pairing in an interaction are also some of the important features that can play a part in
miRNA target prediction. The position-wise knowledge of a miRNA target binding, the continuous
paring patterns, the number and position of the G-C bonds along with the canonical seed preference
rule can help us to find a target prediction algorithm with less bias, better sensitivity and specificity.
74
Although the MDPS scores can help to improve the miRNA target site prediction, we are unsure
whether these selected target sites are functional. In other words, although the miRNAs may indeed
bind to the corresponding selected target sites, the miRNAs may not suppress the expression level
of the target RNAs. These selected sites can only be considered as potential target sites and their
functional effects need to be further investigated by experiments.
The current version of MDPS was not developed to be a standalone tool for miRNA target
prediction. Along with this score many other features such as sequence conservation, binding
energy, target site abundance etc. are essential to be considered to confidently predict miRNA
target sites. However, the focus of this study was to find out if the dependencies between the
neighboring positions of the miRNA sequence and global pairing information of miRNAβtarget
interactions are important for target site selection. The incorporation of MDPS either as a feature
or an additional step in the existing miRNA target prediction pipelines has the potential to enhance
the overall performance.
75
CHAPTER 4 : CONCLUSION AND FUTURE WORK
4.1 Conclusion
This dissertation focuses on two of the major factors of gene expression regulation that act in the
transcription and post-transcription stages of gene expression. EPIs work as transcriptional factors.
These interactions along with several transcription factors and RNA-polymerase II initiate the
transcription of a gene. Here, the properties of the interactions were discussed, analyzed and the
important features were collected to design a prediction tool for cell specific interactions. From
the analysis of the interaction patterns, an important characteristic of enhancers was identified
which provides us a new way of dealing with the interactions.
EPIP: We designed an EPI prediction tool named EPIP that can efficiently predict cell
specific EPIs by handling the missing features in cell lines. We used two sets of enhancers, a
properly curated set of promoters, experimentally validated Hi-C chromatin contacts and a
comprehensive set of cell specific features for eight different types of cell lines. The inactive
enhancers and promoters were filtered using active histone markers and RNA-seq gene
expression information in the respective cell lines. EPIP model was designed to handle the
missing features using 11 feature partitions and a set of robust ensemble classifiers. Each
partition represents a set of cell lines that have that partition among their available features.
The model decides the prediction output based on the voting of weak learners trained on the
respective feature partition. EPIP was compared with two popular EPI prediction tools on
both EPIP test dataset and the test data of the two tools. EPIP outperformed the two tools on
both sets of data, specially in terms of cell specific EPI prediction. EPIP was also tested on
76
five different test datasets including three sets of data from other labs. In all cases, EPIP
showed a high performance, particularly for the cell specific EPIs.
Analyzing chromatin interaction data sets from five different labs we found an interaction
pattern of enhancers with their target gene promoters. When interacting with a shared set of
promoters, multiple enhancers do not tend to share partially. So, enhancers either share all of
their interacting promoters or they share none. Based on this property, we extracted clusters
of enhancers in different cell lines that have a very little overlap with the known super-
enhancers. The enhancers in cluster are mostly consecutive in terms of their genomic positions
and belong to the same TAD. The clusters of enhancers are different in different cell lines.
The interaction between a miRNA and protein coding mRNA is regarded as a major gene
regulation factor. By interacting with mRNA, miRNA disrupts the pathway of the mRNA or
degrades the mRNA structure, which eventually blocks the translation of the specific proteins that
the mRNA was assigned to translate into. So, by interacting with a mRNA, miRNAs play a vital
part in modulating the regular gene expression pathway and create complicacies such as severe
diseases in human body. The miRNA-mRNA interactions are imperfect in human and contain
diverse patterns that are difficult to understand. Also, the size of the miRNA allows it to bind with
multiple strong interaction sites in a mRNA transcript. But not all the sites are active in a cell type.
Hence, to achieve high precision of the prediction, miRNA target prediction tools follow certain
canonical rules. The recent experimental data show a huge number of interactions that do not
follow these canonical rules, resulting in low sensitivity for the prediction tools. The non-canonical
77
interactions found in the data subvert the traditional features used for the prediction and insinuates
the possible contribution of position specific features of miRNA and target sequences.
MDPS: With the hypothesis that every position of miRNA may contain certain importance
factor to form a miRNA-target interaction, we designed a feature named MDPS. Given a
miRNA and target sequence pair, the sequences are aligned with the scores of the miRNA
position-wise frequencies of the binding states and the transition frequencies from one binding
state to another. Finally, the overall score of the alignment is considered as the MDPS feature
score for the sequence pair. The position-wise frequencies and transition weights of the states
were learned from the interactions extracted from the CLASH experimental data. Along with
this a score cutoff is set to remove the false positive interactions. MDPS was applied on the
predicted positives of the three popular miRNA target prediction tools and shown to increase
of their precisions. Based on the position wise binding frequencies of individual miRNAs we
also showed the significance of the non-seed regions and found clusters of miRNAs having
region-wise similar binding patterns.
4.2 Future Work
4.2.1 Enhancer-promoter interactions
This study focused on the properties of enhancers and the interactions between enhancers and
promoters. The EPIP tool that we designed was trained with the best available datasets to date. But
with the annotations of new enhancers, promoters and the availability of more accurate and broadly
representative training data in the future, the performance of EPIP can be improved further. We
78
used Hi-C chromatin interactions to extract training data. But it is worth studying how the
performance of EPIP improves using EPIs from other sources of chromatin interaction, such as
Hi-C, ChIA-PET and 5C, together with Hi-C. EPIP considers one EP-pair at a time to decide if it
is an active EPI like the other EPI prediction tools. But the EPIs may be interconnected due to the
complicated chromatin structure, as found in a recent study [14]. So, changing the design of EPIP
to consider multiple EPIs together as inputs, may improve its performance further.
A primary finding of the study on the enhancer-promoter interactions was that a group of enhancers
tend to interact with the common set of target genes. This property was tested on a variety of
chromatin interaction data sets with two sets of enhancers. Here it was made sure that both the
enhancer and promoter are active regions, but there was no way to make sure if the chromatin
interactions were functional in different cell lines. With the availability of more cell-specific
chromatin interaction data, the property should be rigorously verified.
4.2.2 miRNA-mRNA interactions
From the work done here on the analysis of miRNA-mRNA interactions, it is clear that every
position of miRNA has a contribution to form a successful interaction with its target. We
discovered clusters of miRNAs that show a similar binding pattern along a certain region of the
miRNA sequence. The clusters of miRNAs however should be further analyzed for pathway
similarity. Recently, numerous cell specific miRNA isoforms (isomiRs) were discovered by RNA-
seq and miRNA-seq experiments which are produced in a cell due to imprecise slicing of the
primary miRNA transcript or RNA editing mechanism applied on the initial miRNA transcripts
among many other reasons [86]. IsomiRs have small differences in sequence than the canonical
79
miRNAs. Based on the location of the differences, isomiRs can interact with the same or target
than the canonical miRNAs. The incorporation of isomiR in the miRNA target prediction problem
can help the target prediction model have the whole picture of the problem. With this hypothesis
and our understanding about the importance of non-canonical position or region specific
information, we are working on developing a miRNA-mRNA or isomiR-mRNA target prediction
tool that uses a deep learning model to learn the hidden features from just the sequence of the
corresponding miRNA or isomiR sequences and mRNA transcripts. Since, deep learning models
are well known to capture deep interconnected features, we are interested in the sequence feature
patterns this model can capture to explain a miRNA-mRNA interaction.
80
LIST OF REFERENCES
1. Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, Chen Y,
Zhao X, Schmidl C, Suzuki T et al: An atlas of active enhancers across human cell
types and tissues. Nature 2014, 507(7493):455-461.
2. Cai X, Hou L, Su N, Hu H, Deng M, Li X: Systematic identification of conserved
motif modules in the human genome. BMC Genomics 2010, 11(1):567.
3. De Laat W, Duboule D: Topology of mammalian developmental enhancers and their
regulatory landscapes. Nature 2013, 502(7472):499-506.
4. Dekker J, Rippe K, Dekker M, Kleckner N: Capturing chromosome conformation.
Science 2002, 295(5558):1306-1311.
5. Zheng Y, Li X, Hu H: Comprehensive discovery of DNA motifs in 349 human cells
and tissues reveals new features of motifs. Nucleic Acids Res 2015, 43(1):74-83.
6. Corradin O, Saiakhova A, Akhtar-Zaidi B, Myeroff L, Willis J, Cowper-Sal lari R,
Lupien M, Markowitz S, Scacheri PC: Combinatorial effects of multiple enhancer
variants in linkage disequilibrium dictate levels of gene expression to confer
susceptibility to common traits. Genome Res 2014, 24(1):1-13.
7. Fullwood MJ, Liu MH, Pan YF, Liu J, Xu H, Mohamed YB, Orlov YL, Velkov S, Ho A,
Mei PH: An oestrogen-receptor-Ξ±-bound human chromatin interactome. Nature
2009, 462(7269):58-64.
8. Rao SS, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, Sanborn
AL, Machol I, Omer AD, Lander ES et al: A 3D map of the human genome at kilobase
resolution reveals principles of chromatin looping. Cell 2014, 159(7):1665-1680.
9. He B, Chen C, Teng L, Tan K: Global view of enhancer-promoter interactome in
human cells. Proceedings of the National Academy of Sciences of the United States of
America 2014, 111(21):E2191-2199.
10. Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, Zhang X, Wang
L, Issner R, Coyne M et al: Mapping and analysis of chromatin state dynamics in
nine human cell types. Nature 2011, 473(7345):43-49.
11. Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, Sheffield NC,
Stergachis AB, Wang H, Vernot B et al: The accessible chromatin landscape of the
human genome. Nature 2012, 489(7414):75-82.
12. Roy S, Siahpirani AF, Chasman D, Knaack S, Ay F, Stewart R, Wilson M, Sridharan R:
A predictive modeling approach for cell line-specific long-range regulatory
interactions. Nucleic Acids Res 2015, 43(18):8694-8712.
81
13. Whalen S, Truty RM, Pollard KS: Enhancer-promoter interactions are encoded by
complex genomic signatures on looping chromatin. Nat Genet 2016, 48(5):488-496.
14. Zhao C, Li X, Hu H: PETModule: a motif module based approach for enhancer
target gene prediction. Scientific reports 2016, 6:30043.
15. Talukder A, Saadat S, Li X, Hu H: EPIP: A novel approach for condition-specific
enhancer-promoter interaction prediction. Bioinformatics 2019, 35(20):3877--3883.
16. Ernst J, Kellis M: ChromHMM: automating chromatin-state discovery and
characterization. Nature methods 2012, 9(3):215-216.
17. Dunham I, Consortium EP: An integrated encyclopedia of DNA elements in the
human genome. Nature 2012, 489(7414):57-74.
18. Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL,
Barrell D, Zadissa A, Searle S et al: GENCODE: the reference human genome
annotation for The ENCODE Project. Genome Res 2012, 22(9):1760-1774.
19. Li X, Zheng Y, Hu H, Li X: Integrative analyses shed new light on human ribosomal
protein gene regulation. Scientific reports 2016, 6:28619.
20. Jin F, Li Y, Dixon JR, Selvaraj S, Ye Z, Lee AY, Yen CA, Schmitt AD, Espinoza CA,
Ren B: A high-resolution map of the three-dimensional chromatin interactome in
human cells. Nature 2013, 503(7475):290-294.
21. Li G, Ruan X, Auerbach RK, Sandhu KS, Zheng M, Wang P, Poh HM, Goh Y, Lim J,
Zhang J et al: Extensive promoter-centered chromatin interactions provide a
topological basis for transcription regulation. Cell 2012, 148(1-2):84-98.
22. Freund Y, Schapire RE: A decision-theoretic generalization of on-line learning and an
application to boosting. Journal of computer system sciences 1997, 55(1):119-139.
23. Breiman L, Friedman J, Stone CJ, Olshen RA: Classification and Regression Trees:
Taylor & Francis; 1984.
24. Hoffman MM, Buske OJ, Wang J, Weng Z, Bilmes JA, Noble WS: Unsupervised
pattern discovery in human chromatin structure through genomic segmentation.
Nature methods 2012, 9(5):473-476.
25. Forcato M, Nicoletti C, Pal K, Livi CM, Ferrari F, Bicciato S: Comparison of
computational methods for Hi-C data analysis. Nature methods 2017, 14(7):679-685.
26. Furlong EEM, Levine M: Developmental enhancers and chromosome topology.
Science 2018, 361(6409):1341-1345.
27. Lettice LA, Horikoshi T, Heaney SJ, van Baren MJ, van der Linde HC, Breedveld GJ,
Joosse M, Akarsu N, Oostra BA, Endo N et al: Disruption of a long-range cis-acting
82
regulator for Shh causes preaxial polydactyly. Proceedings of the National Academy
of Sciences of the United States of America 2002, 99(11):7548-7553.
28. Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A,
Amit I, Lajoie BR, Sabo PJ, Dorschner MO et al: Comprehensive mapping of long-
range interactions reveals folding principles of the human genome. Science 2009,
326(5950):289-293.
29. Mossing MC, Record MT, Jr.: Upstream operators enhance repression of the lac
promoter. Science 1986, 233(4766):889-892.
30. Pennacchio LA, Bickmore W, Dean A, Nobrega MA, Bejerano G: Enhancers: five
essential questions. Nature reviews Genetics 2013, 14(4):288-295.
31. Wang S, Hu H, Li X: Shared distal regulatory regions may contribute to the
coordinated expression of human ribosomal protein genes. Genomics 2020,
112(4):2886-2893.
32. Javierre BM, Burren OS, Wilder SP, Kreuzhuber R, Hill SM, Sewitz S, Cairns J, Wingett
SW, Varnai C, Thiecke MJ et al: Lineage-Specific Genome Architecture Links
Enhancers and Non-coding Disease Variants to Target Gene Promoters. Cell 2016,
167(5):1369-1384 e1319.
33. Bellen HJ, O'Kane CJ, Wilson C, Grossniklaus U, Pearson RK, Gehring WJ: P-element-
mediated enhancer detection: a versatile method to study development in
Drosophila. Genes & development 1989, 3(9):1288-1300.
34. Weber F, de Villiers J, Schaffner W: An SV40 "enhancer trap" incorporates
exogenous enhancers or generates enhancers from its own sequences. Cell 1984,
36(4):983-992.
35. Crawford GE, Holt IE, Whittle J, Webb BD, Tai D, Davis S, Margulies EH, Chen Y,
Bernat JA, Ginsburg D et al: Genome-wide mapping of DNase hypersensitive sites
using massively parallel signature sequencing (MPSS). Genome Res 2006, 16(1):123-
131.
36. Heintzman ND, Stuart RK, Hon G, Fu Y, Ching CW, Hawkins RD, Barrera LO, Van
Calcar S, Qu C, Ching KA et al: Distinct and predictive chromatin signatures of
transcriptional promoters and enhancers in the human genome. Nat Genet 2007,
39(3):311-318.
37. Johnson DS, Mortazavi A, Myers RM, Wold B: Genome-wide mapping of in vivo
protein-DNA interactions. Science 2007, 316(5830):1497-1502.
38. Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G,
Bernier B, Varhol R, Delaney A et al: Genome-wide profiles of STAT1 DNA
association using chromatin immunoprecipitation and massively parallel
sequencing. Nature methods 2007, 4(8):651-657.
83
39. Wang Y, Li X, Hu H: H3K4me2 reliably defines transcription factor binding regions
in different cells. Genomics 2014, 103(2-3):222-228.
40. Malin J, Aniba MR, Hannenhalli S: Enhancer networks revealed by correlated DNAse
hypersensitivity states of enhancers. Nucleic Acids Res 2013, 41(14):6828-6838.
41. Zheng Y, Li X, Hu H: PreDREM: a database of predicted DNA regulatory motifs
from 349 human cell and tissue samples. Database : the journal of biological
databases and curation 2015, 2015.
42. Daniel B, Nagy G, Hah N, Horvath A, Czimmerer Z, Poliska S, Gyuris T, Keirsse J,
Gysemans C, Van Ginderachter JA et al: The active enhancer network operated by
liganded RXR supports angiogenic activity in macrophages. Genes & development
2014, 28(14):1562-1577.
43. Danko CG, Hyland SL, Core LJ, Martins AL, Waters CT, Lee HW, Cheung VG, Kraus
WL, Lis JT, Siepel A: Identification of active transcriptional regulatory elements
from GRO-seq data. Nature methods 2015, 12(5):433-438.
44. Won KJ, Ren B, Wang W: Genome-wide prediction of transcription factor binding
sites using an integrated model. Genome Biol 2010, 11(1):R7.
45. Visel A, Minovitsky S, Dubchak I, Pennacchio LA: VISTA Enhancer Browser--a
database of tissue-specific human enhancers. Nucleic Acids Res 2007, 35(Database
issue):D88-92.
46. Chen H, Li C, Peng X, Zhou Z, Weinstein JN, Cancer Genome Atlas Research N, Liang
H: A Pan-Cancer Analysis of Enhancer Expression in Nearly 9000 Patient Samples.
Cell 2018, 173(2):386-399 e312.
47. Pott S, Lieb JD: What are super-enhancers? 2015, 47(1):8-12.
48. Whyte WA, Orlando DA, Hnisz D, Abraham BJ, Lin CY, Kagey MH, Rahl PB, Lee TI,
Young RA: Master transcription factors and mediator establish super-enhancers at
key cell identity genes. Cell 2013, 153(2):307--319.
49. Dostie J, Richmond TA, Arnaout RA, Selzer RR, Lee WL, Honan TA, Rubio ED,
Krumm A, Lamb J, Nusbaum C et al: Chromosome Conformation Capture Carbon
Copy (5C): a massively parallel solution for mapping interactions between genomic
elements. Genome Res 2006, 16(10):1299-1309.
50. Rodelsperger C, Guo G, Kolanczyk M, Pletschacher A, Kohler S, Bauer S, Schulz MH,
Robinson PN: Integrative analysis of genomic, functional and protein interaction
data predicts long-range enhancer-target gene interactions. Nucleic Acids Res 2011,
39(7):2492-2502.
84
51. Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, Hu M, Liu JS, Ren B: Topological
domains in mammalian genomes identified by analysis of chromatin interactions.
Nature 2012, 485(7398):376-380.
52. Quinodoz SA, Ollikainen N, Tabak B, Palla A, al. e: Higher-Order Inter-chromosomal
Hubs Shape 3D Genome Organization in the Nucleus. Cell 2018, 174(3):744-757.
53. Latapy M, Magnien C, Del Vecchio N: Basic Notions for the Analysis of Large Two-
mode Networks. Social Networks 2008, 30:31-48.
54. Bron C, Kerbosch J: Finding All Cliques of an Undirected Graph. Commun Acm 1973,
16(9):575-577.
55. Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high
throughput. Nucleic Acids Res 2004, 32(5):1792β1797.
56. Mann HB, Whitney DR: On a Test of Whether one of Two Random Variables is
Stochastically Larger than the Other. Annals of Mathematical Statistics 1947,
18(1):50-60.
57. McLean YC, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, Wenger AM,
Bejerano G: GREAT improves functional interpretation of cis-regulatory regions.
Nature Biotechnology 2010, 28(5):495-501.
58. Blanchette M, Bataille AR, Chen X, Poitras C, Laganière J, Lefèbvre C, Deblois G,
Giguère V, Ferretti V, Bergeron D et al: Genome-wide computational prediction of
transcriptional regulatory modules reveals new insights into human gene
expression. Genome Res 2006, 16(5):656-668.
59. Edelman LB, Fraser P: Transcription factories: genetic programming in three
dimensions. Curr Opin Genet Dev 2012, 22(2):110-114.
60. Papantonis A, Cook PR: Transcription factories: genome organization and gene
regulation. Chemical Review 2013, 113(11):8683-8705.
61. Burroughs AM, Ando Y, de Hoon ML, Tomaru Y, Suzuki H, Hayashizaki Y, Daub CO:
Deep-sequencing of human Argonaute-associated small RNAs provides insight into
miRNA sorting and reveals Argonaute association with RNA fragments of diverse
origin. RNA Biology 2011, 8(1):158-177.
62. Bartel DP: MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 2004,
116(2):281-297.
63. Wang Y, Li X, Hu H: Transcriptional regulation of co-expressed microRNA target
genes. Genomics 2011, 98(6):445-452.
64. Li Y, Kowdley KV: MicroRNAs in common human diseases. Genomics Proteomics
Bioinformatics 2012, 10(5):246-253.
85
65. Kertesz M, Iovino N, Unnerstall U, Gaul U, Segal E: The role of site accessibility in
microRNA target recognition. Nature Genetics 2007, 39(10):1278-1284.
66. Grimson A, Farh KK-H, Johnston WK, Garrett-Engele P, Lim LP, Bartel DP:
MicroRNA targeting specificity in mammals: determinants beyond seed pairing.
Molecular Cell 2007, 27(1):91-105.
67. Enright A, John B, Gaul U, Tuschl T, Sander C, Marks D: MicroRNA targets in
Drosophila. genome biol. In.: BioMed Central; 2003.
68. Helwak A, Kudla G, Dudnakova T, Tollervey D: Mapping the human miRNA
interactome by CLASH reveals frequent noncanonical binding. Cell 2013, 153:654-
665.
69. Betel D, Koppal A, Agius P, Sander C, Leslie C: Comprehensive modeling of
microRNA targets predicts functional non-conserved and non-canonical sites.
Genome Biology 2010, 11(8):R90.
70. Ding J, Li X, Hu H: MicroRNA modules prefer to bind weak and unconventional
target sites. Bioinformatics 2015, 31(9):1366-1374.
71. Ding J, Li X, Hu H: TarPmiR: a new approach for microRNA target site prediction.
Bioinformatics 2016, 32(18):2768-2775.
72. Ding J, Li X, Hu H: CCmiR: a computational approach for competitive and
cooperative microRNA binding prediction. Bioinformatics 2018, 34(2):198-206.
73. Chi SW, Zang JB, Mele A, Darnell RB: Argonaute HITS-CLIP decodes microRNAβ
mRNA interaction maps. Nature 2009, 460(7254):479-486.
74. Chou C-H, Chang N-W, Shrestha S, Hsu S-D, Lin Y-L, Lee W-H, Yang C-D, Hong H-C,
Wei T-Y, Tu S-J: miRTarBase 2016: updates to the experimentally validated
miRNA-target interactions database. Nucleic Acids Res 2016, 44(D1):D239-D247.
75. Hafner M, Landthaler M, Burger L, Khorshid M, Hausser J, Berninger P, Rothballer A,
Ascano Jr M, Jungkamp A-C, Munschauer M: Transcriptome-wide identification of
RNA-binding protein and microRNA target sites by PAR-CLIP. Cell 2010,
141(1):129-141.
76. Moore MJ, Scheel TKH, Luna JM, Park CY, Fak JJ, Nishiuchi E, Rice CM, Darnell RB:
MiRNA-target chimeras reveal miRNA 3β²-end pairing as a major determinant of
Argonaute target specificity. Nature Communications 2015, 6:1-17.
77. Li X, Hu H: Improving miRNA target prediction using CLASH data. In: MicroRNA
Target Identification. Springer; 2019: 75-83.
78. Lu Y, Leslie CS: Learning to predict miRNA-mRNA interactions from AGO CLIP
sequencing and CLASH data. PLOS Computational Biology 2016, 12(7):e1005026.
86
79. Wang X: Improving microRNA target prediction by modeling with unambiguously
identified microRNA-target pairs from CLIP-ligation studies. Bioinformatics 2016,
32(9):1316-1322.
80. Fu H-Y, Xue D-Y, Zhang X, Yang P-Y: Assessing potential miRNA targets based on a
Markov model. Genetics Molecular Research 2009, 8(3):848-860.
81. KrΓΌger J, Rehmsmeier M: RNAhybrid: microRNA target prediction easy, fast and
flexible. Nucleic Acids Res 2006, 34(suppl_2):W451-W454.
82. Agarwal V, Bell G, Nam J, Bartel D: Predicting effective microRNA target sites in
mammalian mRNAs. eLife 2015, 4(e05005).
83. Friedman RC, Farh KK-H, Burge CB, Bartel DP: Most mammalian mRNAs are
conserved targets of microRNAs. Genome research 2009, 19(1):92-105.
84. Lewis BP, Shih I-h, Jones-Rhoades MW, Bartel DP, Burge CB: Prediction of
mammalian microRNA targets. Cell 2003, 115(7):787-798.
85. Miranda K, Huynh T, Tay Y, Ang Y, Tam W, Thomson A, Lim B, Rigoutsos I: A
pattern-based method for the identification of microRNA binding sites and their
corresponding heteroduplexes. Cell 2006, 126(6):1203-1217.
86. Neilsen CT, Goodall GJ, Bracken CP: IsomiRs β the overlooked repertoire in the
dynamic microRNAome. Trends in Genetics 2012, 28:544-549.