computational study of target gene interactions

95
University of Central Florida University of Central Florida STARS STARS Electronic Theses and Dissertations, 2020- 2021 Computational Study of Target Gene Interactions - Enhancers and Computational Study of Target Gene Interactions - Enhancers and microRNAs microRNAs Amlan Talukder University of Central Florida Part of the Computer Sciences Commons, and the Genetics and Genomics Commons Find similar works at: https://stars.library.ucf.edu/etd2020 University of Central Florida Libraries http://library.ucf.edu This Doctoral Dissertation (Open Access) is brought to you for free and open access by STARS. It has been accepted for inclusion in Electronic Theses and Dissertations, 2020- by an authorized administrator of STARS. For more information, please contact [email protected]. STARS Citation STARS Citation Talukder, Amlan, "Computational Study of Target Gene Interactions - Enhancers and microRNAs" (2021). Electronic Theses and Dissertations, 2020-. 570. https://stars.library.ucf.edu/etd2020/570

Upload: others

Post on 23-Dec-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational Study of Target Gene Interactions

University of Central Florida University of Central Florida

STARS STARS

Electronic Theses and Dissertations, 2020-

2021

Computational Study of Target Gene Interactions - Enhancers and Computational Study of Target Gene Interactions - Enhancers and

microRNAs microRNAs

Amlan Talukder University of Central Florida

Part of the Computer Sciences Commons, and the Genetics and Genomics Commons

Find similar works at: https://stars.library.ucf.edu/etd2020

University of Central Florida Libraries http://library.ucf.edu

This Doctoral Dissertation (Open Access) is brought to you for free and open access by STARS. It has been accepted

for inclusion in Electronic Theses and Dissertations, 2020- by an authorized administrator of STARS. For more

information, please contact [email protected].

STARS Citation STARS Citation Talukder, Amlan, "Computational Study of Target Gene Interactions - Enhancers and microRNAs" (2021). Electronic Theses and Dissertations, 2020-. 570. https://stars.library.ucf.edu/etd2020/570

Page 2: Computational Study of Target Gene Interactions

COMPUTATIONAL STUDY ON TARGET GENE INTERACTIONS –

ENHANCERS AND MICRORNAS

by

AMLAN TALUKDER

B.Sc. Bangladesh University of Engineering and Technology, 2011

A dissertation submitted in partial fulfillment of the requirements

for the degree of Doctor of Philosophy

in the Department of Computer Science

in the College of Engineering and Computer Science

at the University of Central Florida

Orlando, Florida

Spring Term

2021

Major Professor: Haiyan Hu

Page 3: Computational Study of Target Gene Interactions

ii

Β© 2021 Amlan Talukder

Page 4: Computational Study of Target Gene Interactions

iii

ABSTRACT

Gene expression is an essential mechanism for physical and mental development of human.

Aberrant regulation of gene expression creates abnormality in human body than can lead to

complicated diseases. Gene expression can be regulated at any stage from the chromatin unfolding

stage to post-translation stage of protein. In this study, we focused on two important factors of

gene expression regulation that participate in the gene expression process at the transcription and

the post-transcriptional stages; enhancer-promoter interactions and miRNA-mRNA interactions.

The enhancer-promoter interactions are difficult to detect due to the large distance between the

enhancer and promoter region and cell-specific activity of the interactions. The cell-specific

interactions have not been well studied due to inconsistent feature availability in different cells.

We designed a tool that considers a large variety of enhancer-promoter interaction features in

different cell lines, can deal with missing features, and can predict cell-specific interactions with

better accuracy than the available tools. By analyzing the cell-specific interactions from different

sources we also found that enhancers-promoter interactions are shared in groups.

MiRNA-mRNA interactions are more complicated in human than other organism because of the

imperfectness of the interactions and the smaller size and complex target choosing strategy of the

miRNA. Available miRNA target prediction tools, designed on canonical features, often suffer

from low accuracy with the new experimentally supported datasets. These tools do not consider

the position-wise binding preference and relationship between adjacent positions and regions of

the miRNA sequence. Here, we designed a Markov-model based feature to capture this position

wise information from experimental data sets, which can be incorporated with any prediction tool

to improve the performance of the tool.

Page 5: Computational Study of Target Gene Interactions

iv

ACKNOWLEDGEMENTS

I would like to express my gratitude to my advisor Dr. Haiyan Hu for continuously appreciating

my effort and encouraging me in every step of my Ph.D. journey.

I would also like to thank my co-advisor Dr. Xiaoman Li for his earnest efforts and sincere

guidance throughout my Ph.D. research. Coming out of formal attitude, he literally taught me how

to do research and always pushed me to try my best.

Finally, I would like to thank my family for their continuous support. This dissertation is a

dedication to my grandfather who has been aspiring to see me achieve a Ph.D. degree even in his

dotage. It makes me so happy to finally be able to fulfill his long yearning dream. I can never thank

my parents enough for their tireless efforts to help me achieve my life goals and my brother who

is one of my inspirations to keep working in bioinformatics. Last, but not the least, I would like to

thank my loving wife who has been a constant support in every single day of my life since we met.

Page 6: Computational Study of Target Gene Interactions

v

TABLE OF CONTENTS

LIST OF FIGURES ..................................................................................................................... VII

LIST OF TABLES ..................................................................................................................... VIII

CHAPTER 1 : INTRODUCTION .................................................................................................. 1

1.1 Study of Enhancer-Promoter interactions ............................................................................. 1

1.2 Study of miRNA-mRNA interactions ................................................................................... 2

CHAPTER 2 : STUDY OF ENHANCER-PROMOTER INTERACTIONS ................................. 3

2.1 EPIP: A Novel Approach for Cell-Specific Enhancer–Promoter Interaction Prediction ...... 3

2.1.1 Background..................................................................................................................... 3

2.1.2 Materials and Method ..................................................................................................... 5

2.1.3 Results .......................................................................................................................... 12

2.1.4 Discussion..................................................................................................................... 20

2.2 An Intriguing Characteristic of Enhancer-Promoter Interactions ....................................... 24

2.2.1 Background................................................................................................................... 24

2.2.2 Materials and Method ................................................................................................... 28

2.2.3 Results .......................................................................................................................... 36

2.2.4 Discussion..................................................................................................................... 51

CHAPTER 3 : STUDY OF MIRNA-MRNA INTERACTIONS ................................................. 55

3.1 MDPS: Position-Wise Binding Preference is Important for miRNA Target Site Prediction

................................................................................................................................................... 55

Page 7: Computational Study of Target Gene Interactions

vi

3.1.1 Background................................................................................................................... 55

3.1.2 Materials and Methods ................................................................................................. 58

3.1.3 Results .......................................................................................................................... 64

3.1.4 Discussion..................................................................................................................... 72

CHAPTER 4 : CONCLUSION AND FUTURE WORK ............................................................. 75

4.1 Conclusion ........................................................................................................................... 75

4.2 Future Work ........................................................................................................................ 77

4.2.1 Enhancer-promoter interactions ................................................................................... 77

4.2.2 miRNA-mRNA interactions ......................................................................................... 78

LIST OF REFERENCES .............................................................................................................. 80

Page 8: Computational Study of Target Gene Interactions

vii

LIST OF FIGURES

Figure 2-1: Training and Test data.................................................................................................. 7

Figure 2-2: EPIP model and partitions.......................................................................................... 10

Figure 2-3: The overall performance of EPIP on external datasets. ............................................. 15

Figure 2-4: Generation of IEPs and calculation of BCC .............................................................. 27

Figure 2-5: Clusters of enhancers with Hi-C reads ....................................................................... 47

Figure 2-6: The distance distribution between consecutive enhancers in the same cluster .......... 49

Figure 2-7: The overlap of the enhancer clusters with the super-enhancers................................. 50

Figure 3-1: Five states in an miRNA-target interaction................................................................ 60

Figure 3-2: Non-seed regions may be important for miRNA-target interactions ......................... 65

Figure 3-3: Correlated pairs of miRNA positions......................................................................... 67

Figure 3-4: Clusters of miRNAs with similar ``Match'' patterns in specific regions ................... 68

Page 9: Computational Study of Target Gene Interactions

viii

LIST OF TABLES

Table 2-1: EPIP on balanced test data, unbalanced test data, and all pairs .................................. 13

Table 2-2: Performance of cell-specific EPIP model on predicting of condition-specific EPIs. . 17

Table 2-3: Comparison with TargetFinder and Ripple on TargetFinder and EPIP data. ............. 18

Table 2-4: The BCC of enhancers and that of promoters are likely to be 1 in a cell line. ............ 36

Table 2-5: BCC statistics for enhancers ........................................................................................ 38

Table 2-6: BCC statistics for promoters ....................................................................................... 44

Table 3-1: Training and test datasets. ........................................................................................... 58

Table 3-2: Performance comparison of the combined tools with the original tools. .................... 72

Page 10: Computational Study of Target Gene Interactions

1

CHAPTER 1 : INTRODUCTION

Gene expression regulation is one of the major reasons of diseases and anomalies in human body.

The entire process of gene expression can be divided into multiple steps; chromatin uncoiling,

gene transcription to form RNA molecules, RNA-splicing, translation of protein and post-

translational stage. Numerous factors interplay and interact with a gene and its products in different

stages of the gene expression process. Any regulation or disruption of these factors or their

interactions in either of the stages can cause aberrance in the whole gene expression process.

Proper computational analyses are needed to find out the relationships among the diverse features

that help the factors take part in the interactions in different cell types and cell lines. Here, we

study two of the factors and their respective interactions with the proximal genetic regions and the

RNA transcripts of the genes in the transcriptional and post-transcriptional stages of gene

expression.

1.1 Study of Enhancer-Promoter interactions

Enhancer-promoter interaction (EPI) is a phenomenon that takes place in the transcription step of

gene expression process. Promoter covers the region of size 1-2 kilobases (kb) upstream of gene

transcription start site (TSS). Enhancer is a distal region of DNA that comes in contact with the

promoter region due to chromatin looping and initiates gene transcription process along with other

factors like RNA polymerase II and several transcription factor proteins. The size of the enhancers,

their distances from the promoters, which enhancers or promoters are active and lastly when an

EPI occurs, are all open problems. Also, EPIs are specific to different cell lines and cell types.

Only 40% of enhancers in a cell, take part in EPIs [1]. Understanding the functional behaviors of

enhancers and capturing the independent features of EPIs in different cell lines are important to

Page 11: Computational Study of Target Gene Interactions

2

correctly predict the cell specific EPIs. Accurate prediction of EPIs can help pinpoint the ones that

play vital roles in critical development processes. The experimental data for the underlying

features of cell line specific EPIs are still not consistently available. The popular EPI prediction

tools suffer from low precision due to missing features in individual cell lines. Dealing with the

missing feature features and efficiently predict cell specific EPIs is still a major challenge in this

area.

1.2 Study of miRNA-mRNA interactions

The recent developments in the gene transcription studies have unraveled a vast and complicated

world of transcriptome in human body. Some of these transcripts are translated into protein while

others are not. The transcripts, that code proteins, are called protein coding RNAs or messenger

RNAs (mRNAs) and those, that do not, are called the non-coding RNAs (ncRNAs). Although,

there is still a lot to discover about the specific functionalities of these ncRNAs, their functions

often start with direct interaction with other transcripts. These interactions allow them to affect the

respective pathways of those transcripts. MicroRNA (miRNA) is one of the most prominent

ncRNAs found to date, which has a shorter sequence length (16 to 28 nucleotides) than other

known RNAs. The formation of an active mature miRNA molecule from an initial miRNA

transcript is a multistep process, part of which occurs in nucleus and the rest occurs in the

cytoplasm. In this process, the primary transcript of miRNA is cut twice by two enzymes to create

the mature miRNA molecule. After transcription, miRNAs often target other RNA transcripts,

interact with them by forming perfect or imperfect bonds and eventually either regulate their

pathways or degrade their structures. In this way, numerous miRNAs have been found to play vital

roles in a variety of complex disease pathways in human body.

Page 12: Computational Study of Target Gene Interactions

3

CHAPTER 2 : STUDY OF ENHANCER-PROMOTER INTERACTIONS

2.1 EPIP: A Novel Approach for Cell-Specific Enhancer–Promoter Interaction Prediction

2.1.1 Background

Enhancers are distal regions of DNA that plays an important role in gene transcription. Enhancer

regions are typically located from 1 kilo bases (kb) to several mega bases (mb) from the genes in

interest. Despite located far from the gene promoters they come in direct contact with the promoter

regions because of chromatin looping and trigger gene expression with the help of other factors

[2-5]. To date, the majority of EPIs within a cell remain undiscovered [6]. Due to the long range

of possible distances between enhancers and their interacting gene promoters, it is also challenging

to predict EPIs [3].

Current experimental approaches to identify EPIs are mainly based on chromosome conformation

capture (3C) and its variants such as chromatin interaction analysis with paired-end tag sequencing

(ChIA-PET) and high throughput genome-wide 3C (Hi-C) [4, 7, 8]. These experimental techniques

determine the relative frequency of the direct physical contacts between genomic regions and have

successfully identified EPIs and other long-range interactions [9]. However, the ChIA-PET

method still has a low signal-to-noise ratio and the most available Hi-C data have a low resolution

[7, 8]. In addition, since certain EPIs are cell-specific, the experimental EPI data in one cell sample

cannot always be directly applied to infer EPIs in other samples.

Since, most of the experimental procedures are expensive in terms of time and cost, computational

methods have been indispensable alternatives to identify EPIs. These methods employ available

experimentally extracted genomic and/or epigenomic data to predict EPIs in an inexpensive way.

Page 13: Computational Study of Target Gene Interactions

4

Early methods considered the closest promoter as the only target of an enhancer. However, a study

demonstrated that in almost 60% of the cases, the enhancers are located far from the interacting

gene promoters and one enhancer may interact with multiple gene promoters [1]. Later, several

computational approaches were developed to predict EPIs based on the correlation of epigenomic

signals in the enhancer and promoter regions [1, 6, 10, 11]. One challenge of using these methods

is they require proper correlation thresholds to reduce false EPI predictions [12, 13]. Recently,

various supervised machine learning based methods have been developed, such as IM-PET [9],

PETModule [14], Ripple [12] and TargetFinder [13]. These methods commonly use genomic and

epigenomic data such as those from DNase I hypersensitive sites sequencing (DNase-seq) and

histone modification based chromatin immunoprecipitation followed by massive parallel

sequencing (ChIP-seq) to extract features for EPI predictions. IM-PET, Ripple and PETModule

utilize random forests as their classifier, while TargetFinder is based on boosted tree classifiers.

These methods either do not consider or have low performance on the cell-specific EPI predictions

[12].

In this study, a computational method was proposed to predict condition-specific EPIs called

β€˜Enhancer–Promoter Interaction Prediction’ (EPIP) [15]. EPIP is a supervised machine learning

based approach that utilizes functional genomic and epigenomic data to build a robust model to

predict shared and condition-specific EPIs. EPIP can work with missing data, different types of

datasets and even a dataset with a partial list of features. Tested on experimental data from more

than eight samples, EPIP reliably predicted the cell-specific EPIs and shared EPIs in different

samples with the average area under the receiver operating characteristic curve (AUROC) about

0.95, and the average area under the precision–recall curve (AUPR) about 0.73. When compared

Page 14: Computational Study of Target Gene Interactions

5

with two state-of-the-art computational methods for predicting EPIs, EPIP outperformed both of

them.

2.1.2 Materials and Method

2.1.2.1 Enhancers and Promoters

We used 32,693 enhancers annotated by FANTOM [1]. In order to get a more reliable set of

enhancers, we overlapped this set of enhancers with the computationally predicted ChromHMM

enhancers [16] for cell lines that had the ChromHMM data available (GM12878, HeLa, HMEC,

HUVEC, IMR90 and NHEK). Since KBM7 does not have any annotated ChromHMM enhancer,

all FANTOM enhancers were used for KBM7. An enhancer was considered β€˜active’ in a sample

if it overlapped with the H3K27ac ChIP-seq peaks in this sample. The H3K27ac peaks were

downloaded from ENCODE [17]. Since, KBM7 did not have any H3K27ac data available, all the

obtained enhancers were considered as β€˜active’ for this cell line. In this way, we obtained 7,023–

32,693 enhancers and 4,888–32,693 active enhancers in a sample (Table 2-1).

To define promoters, we used all annotated transcription start sites (TSSs) from GENCODE V19

[18] and considered the regions between 1 kb upstream and 100 base pairs downstream of the TSSs

as β€œpromoters”. This process resulted in 57,783 promoters. In order to determine if a promoter is

β€œactive”, we used RNA-Seq data available in six cell lines (GM12878, HeLa, HUVEC, IMR90,

K562 and NHEK) [17]. A promoter was labeled as β€œactive” if the corresponding gene had at least

Page 15: Computational Study of Target Gene Interactions

6

0.30 reads per kb of transcript per million mapped reads with the irreproducible discovery rate of

0.1, similarly as previously [13]. For cell lines without available RNA-Seq data (HMEC and

KBM7), all promoters were considered as active promoters (Table 2-2).

2.1.2.2 Training Data

In order to define positive enhancer-promoter pair (EP-pair) or EPI and negative or non-interacting

EP-pairs we used normalized Hi-C data from Gene Expression Omnibus (GEO) with accession

number GSE63525 [8]. The data was available for eight cell lines; GM12878, HeLa, HMEC,

HUVEC, IMR90, K562, KBM7 and NHEK. The data set consisted of two sets of data in the above

cell lines; looplists and contact matrices. The looplists contain significant intra-chromosomal

chromatin interactions. The number of EP-pairs from the looplists defined at the highest resolution

(5kb) for these cell lines was too small to train the EPIP model well. Hence, we defined the positive

and negative EP-pairs from their normalized Hi-C contact matrices, following by previous studies

[14, 19].

An EP-pair was defined as β€œpositive” for a cell line, when (1) the corresponding enhancer and

promoter were active in the corresponding cell line, (2) they overlapped a pair of regions that were

supported by at least 30 normalized Hi-C reads and (3) the distance between the enhancer and

promoter regions is within 2.5kb to 2Mb. Similarly, an EP-pair was considered as negative if it did

not overlap with any pair of regions that were supported by 5 or more normalized Hi-C reads

(Figure. 2-1A). The cutoffs, 30 and 5, were chosen based on our test results with different cutoffs

(Table 2-3). Since, the HeLa cell line was ignored from training samples, due to unavailability of

contact matrix data in this cell line.

Page 16: Computational Study of Target Gene Interactions

7

We trained EPIP with both balanced and unbalanced data. The balanced data consists of randomly

chosen 30% of positive EP-pairs and the same number of negative EP-pairs in each of the above

seven cell lines. The positives and negatives from different cell lines were then combined to train

a balanced prediction model. The unbalanced model was generated in the same way, except the

number of randomly chosen negative EP-pairs was 10 times of the positive EP-pairs in each

sample. Finally, we combined the two models into a combined EPIP model, which predicts an EP-

pair as a β€˜negative’ pair only when both models predict this pair as a negative pair and predicts an

EP-pair as a positive pair otherwise. This strategy was based on the observation that the balanced

model had a high sensitivity and the unbalanced model had a high specificity when tested on the

training data by cross-validation.

Figure 2-1: Training and Test data. (A) The flowchart of training data creation. Here all the read

numbers are normalized. An EP-pair with the enhancer overlapping with one of the two interacting

regions and the promoter overlapping with the other of the two interacting regions will be

considered as an interacting EP-pair candidate. (B) The five test data sets on which we tested EPIP.

Page 17: Computational Study of Target Gene Interactions

8

2.1.2.3 Test Data

We considered five different test data to evaluate EPIP (Figure 2-1B). (1) The remaining 70% of

positive EP-pairs, together with the same number of randomly selected negative pairs that were

not used for training (balanced test data). (2) The remaining 70% of positive EP-pairs together

with 10 times randomly selected negative pairs that were not used for training (unbalanced test

data). (3) All EP-pairs within 2 Mb that were not used for training as well. (4) Positive EP-pairs

defined with normalized Hi-C contact matrices under the cutoffs 10, 20, 30, 50 and 100. (5) EP-

pairs collected in other studies [8, 20, 21], which were obtained from the strictly defined interacting

regions.

2.1.2.4 Features of EP-Pairs

EPIP considers total 31 features. Three of these features are common in every sample. First, the

distance between the enhancer and the promoter in an EP-pair. Second, the conserved synteny

score that measures the co-conservation of an EP-pair in five other vertebrate genomes (chicken

galGal3, chimpanzee panTro4, frog xenTro3, mouse mm10 and zebrafish zv9). Third, the

correlation of epigenomic signals in the enhancer region and that in the promoter region of an EP-

pair across ENCODE Tiers 1 and 2 samples [14]. For simplicity’s sake, as of now, these features

will be called β€˜distance’, β€˜css’ and β€˜corr’, respectively.

Depending on the types of epigenomic data available in a sample, EPIP considers 28 additional

features for an EP-pair sample; 14 for the enhancer region and 14 for the promoter region. These

14 features include DNase-seq data, ChIP-seq data for nine types of histone modifications

Page 18: Computational Study of Target Gene Interactions

9

(H3K4me1, H3K4me2, H3K4me3, H3K27ac, H3K27me3, H3K36me3, H3K79me2, H3K9ac and

H4K20me1) and four types of chromatin factors (CTCF, POL2, RAD21 and SMC3). All these

epigenomic data were found to act as significant biomarkers for EPIs [12]. The value of each of

these features were computed both in the enhancer and promoter region of an EP-pair sample. The

feature value corresponds to the β€˜strength’ of the signal peak that overlapped with this region. In

case, a region overlaps with multiple peaks of a feature signal, the average of the peak strength

values was considered as the feature value for this region. The feature signal peaks, and their signal

strength values were downloaded from ENCODE [17]. The distance, css and corr features were

considered for all the seven cell lines. But, some of the 28 feature data were unavailable for

different cell lines. The total number of features for GM12878, HMEC, HUVEC, IMR90, K562,

KBM7 and NHEK were 31, 25, 27, 31, 31, 3 and 27 respectively.

The features were grouped into 11 overlapping feature subsets or β€˜partitions’. The first partition

consisted of only the three common features: distance, css and corr. The other 10 partitions

consisted of the following subsets of features, including the three common features: DNase-seq;

H3K4me1; H3K4me1 and H3K27ac; DNase-seq and H3K27ac; H3K4me1, H3K4me3 and

H3K27ac; DNase-seq, H3K4me3 and H3K27ac; DNase-seq, H3K4me1-3 and H3K27ac; DNase-

seq, H3K4me1-3, H3K27ac, H3K27me3, H3K36me3, H3K79me2, H4K20me1, CTCF and

DNaseI; DNase-seq, H3K4me1-3, H3K27ac, H3K27me3, H3K36me3, H3K79me2, H4K20me1,

CTCF, DNaseI, POL2; DNase-seq, H3K4me1-3, H3K27ac, H3K27me3, H3K36me3, H3K79me2,

H4K20me1, CTCF, DNaseI, POL2, RAD21 and SMC3. The benefits of the feature partitions are

threefold. First, they allow the model to function for a sample with missing features. Second,

different predictors can be trained with different feature partitions and thus can avoid overfitting.

Page 19: Computational Study of Target Gene Interactions

10

Third, the prediction decision based on different feature partitions can have alternative

perspectives (Figure 2-2B).

2.1.2.5 Model

EPIP model uses a supervised sequential ensemble learning approach (Figure 2-2A). The model

consists of 11 incremental learners, each for one feature partition. Each incremental learner is an

AdaBoost classifier [22], that consists of 200 weak learners for each cell line. The weak learners

are decision tree classifiers [23] of depth 10. EPIP makes a prediction by β€œhard” voting by all the

incremental learners.

Figure 2-2: EPIP model and partitions. (A) The training process of EPIP. There are three types of

partitions and in total 11 partitions used. Samples with the features required by a partition are used

to train the corresponding incremental learner (IL) for this partition. Each incremental learner

trains a maximum 200 weak learners (W) for a sample. The weak learners trained from all available

samples then vote to make the predictions for the corresponding incremental learner. The

prediction of all incremental learners determines the final prediction with another voting process.

(B) An example of the third type of partitions from three samples. The 25 features in HMEC are

included in the 27 features in HUVEC, which are included in the 31 features in GM12878.

Page 20: Computational Study of Target Gene Interactions

11

EPIP uses the 11 feature partitions to choose the suitable cell lines for each incremental leaner.

When all the features in a partition are available for a cell line, EPIP trains 200 weak learners under

the corresponding incremental learner. Here, the number 200 was selected as it is smallest number

that gives EPIP the best AUROC and AUPR scores. Each weak learner is a decision tree classifier.

The depth of the decision trees was set to 10 after testing several options. The weak learners are

trained iteratively. The first weak learner is trained to classify all the training samples of the cell

line. The misclassified samples are weighted higher and correctly classified samples are weighted

lower. All the training samples with modified weights are then used to train the next weak learner.

This process is repeated until the 200th weak learner is trained. The overall prediction of an

incremental learner is the summation of the predictions from its weak learners. The overall

prediction of EPIP is made from the maximum voted decision of the 11 incremental learners.

2.1.2.6 Comparison with State-of-the-Art Tools

EPIP model performance was compared with two state-of-the art tools TargetFinder [13] and

Ripple [12]. TargetFinder predictions were published in six cell lines; GM12878, HeLa, HUVEC,

IMR90, K562 and NHEK. Also, the positive and negative EP-pairs used for TargetFinder

prediction were provided for four classifiers (https://github.com/shwhalen/targetfinder), among

which the gradient boosting classifier (GBM) showed the best precision and recall. Thus, EPIP

was compared with the TargetFinder by executing EPIP and TargetFinder GBM classifier on both

TargetFinder and EPIP data.

Ripple predicts EPIs using a combination of random forest classifiers and group LASSO in a multi-

task learning framework [12]. It used DNase-Seq, ChIP-Seq and RNA-Seq peaks in 5C

Page 21: Computational Study of Target Gene Interactions

12

(GSE39510) and Hi-C (GSE63525) datasets to design the training and test data in GM12878, H1-

hESC, HeLa and K562 cell lines. We compared EPIP with Ripple by execute them on EPIP data

and the TargetFinder data in three shared cell lines; GM12878, HeLa and K562. We did not use

Ripple data directly for the comparison as (i) Ripple data was balanced which does not reflect the

reality well where we often tend to have many more negative EP-pairs than positives; (ii) the data

had a poor overlap with the FANTOM enhancers; (iii) the data labeled very closely located

enhancer and promoter as a positive EP-pair, where EPIP considers at least 2.5kb distance between

the enhancer and promoter of the positive EP-pair.

The comparison between EPIP and the two tools were done using 10-fold cross validation,

following the same strategy used by TargetFinder and Ripple. To generate TargetFinder features

for an EP-pair, we used the generate_training.py script provided in TargetFinder source code. We

followed the steps mentioned in the TargetFinder readme file to apply the 10-fold cross-validation

on the training data using TargetFinder GBM model. We used the genFeatures tool in Ripple to

generate Ripple features for an EP-pair. We executed the runAllfeatures_crosscellline.m

MATLAB file provided in Ripple source code to apply 10-fold cross-validation on the training

data using Ripple model.

2.1.3 Results

2.1.3.1 Reliable Prediction of EPIs

EPIP showed high performance in predicting EPIs in five types of test datasets; balanced test data,

unbalanced test data; all EP-pairs within 2.5kb to 2Mb, EP-pairs defined with varied Hi-C read

cutoffs and EP-pairs from other studies.

Page 22: Computational Study of Target Gene Interactions

13

First, EPIP was tested on balanced and unbalanced test data and all EP-pairs within 2.5kb to 2Mb

(Chapter 2.1.2.3). We made sure none of the EP-pair in the test data were used in the training data.

On average, EPIP showed an AUROC of 0.96, 0.96 and 0.95; an AUPR of 0.96, 0.92 and 0.73 and

an F1 score of 0.99, 0.95 and 0.51 for the balanced, unbalanced and all EP-pairs within 2.5 kb to

2 Mb test data, respectively (Table 2-1). The low F1 score of the third dataset was due to the lack

of balance between the positives and negatives in this data set (the number of negatives was around

13 times the number of positives). In this test dataset, the recall was higher than 0.92 in all the cell

lines including KBM7, even though it did not have any epigenomic features. EPIP showed a much

higher precision and F1 score in GM12878 than the other cell lines. The much higher sequencing

depth of GM12878 than the other cell lines, might be the reason behind this. EPIP was also tested

Table 2-1: EPIP on balanced test data, unbalanced test data, and all pairs within 2.5kb to 2Mb

test data.

Cell line AUROC AUPR F1 Precision Sensitivity/Recall

% of

predicted

condition-

specific EPIs

GM12878 0.7322 (0.7661,0.7657)

0.5761 (0.7669,0.5686)

0.8993 (0.9967,0.9827)

0.818 (0.9967,0.9691)

0.9985 (0.9967,0.9967)

0.9984 (0.9964,0.9954)

HMEC 0.9768

(0.9933,0.9908)

0.6714

(0.9931,0.967)

0.2914

(0.9837,0.9084)

0.1707

(0.977,0.84)

0.9924

(0.9904,0.989)

0.9351

(0.8397,0.8444)

HUVEC 0.9925 (0.9962,0.9965)

0.6575 (0.9957,0.9793)

0.4233 (0.99,0.9576)

0.2688 (0.9915,0.9242)

0.9958 (0.9886,0.9934)

0.9167 (0.6,0.5588)

IMR90 0.9875

(0.9977,0.9967)

0.9248

(0.9976,0.9854)

0.7416

(0.9971,0.9695)

0.6205

(0.9961,0.9442)

0.9216

(0.998,0.9961)

0.8937

(0.9953,0.988)

K562 0.9974 (0.9987,0.9987)

0.9664 (0.9987,0.9959)

0.6412 (0.9931,0.9581)

0.4746 (0.9931,0.9258)

0.9882 (0.9931,0.9927)

0.9736 (0.9739,0.982)

KBM7 0.9722

(0.9818,0.98)

0.6455

(0.9802,0.9344)

0.2155

(0.9795,0.8888)

0.1209

(0.9804,0.8162)

0.9905

(0.9787,0.9756)

0.9853

(0.9658,0.9592)

NHEK 0.9851 (0.9959,0.996)

0.6473 (0.9952,0.9791)

0.388 (0.9892,0.9314)

0.2408 (0.988,0.8772)

0.9974 (0.9904,0.9928)

0.9524 (0.7333,0.8095)

Overall 0.95 (0.96,0.96) 0.73 (0.96,0.92) 0.51 (0.99,0.95) 0.34 (0.99,0.90) 0.99 (0.99,0.99) 0.99 (0.99, 0.98)

In each entry with three numbers, the three numbers in order are for all pairs within 2.5 kb to 2 Mb test data, balanced

test data, and unbalanced test data, respectively. In the last row, the AUROC and AUPR were the averages of the

AUROC and AUPR for the seven samples, and the F1, Precision and Sensitivity/Recall were calculated using the total

number of true positives, true negatives, false positives and false negatives for the seven samples.

Page 23: Computational Study of Target Gene Interactions

14

on cell-specific EP-pairs within 2.5 kb to 2 Mb, where it predicted 12 455 (99.26%) of the 12 548

cell-specific EP-pairs in the seven samples.

The performance of EPIP stated above was on the test datasets, where the positive and negative

chromatin contacts were decided based on the cutoffs 30 and 5. Since these cutoffs were not

rigorously determined, we tested EPIP on the more strictly defined Hi-C looplists at 5 kb resolution

[8], the Hi-C data for IMR90 [20] and the ChIA-PET data for K562 and MCF7 [21]. To generate

the EP-pairs, the strictly defined interacting regions in these studies were overlapped with β€˜active’

enhancers and β€˜active’ promoters. On the three datasets, EPIP showed average precision scores of

0.90, 0.89 and 0.93, respectively; average recall scores of 0.83, 0.81 and 0.89, respectively; and

average F1 scores of 0.86, 0.85 and 0.91, respectively (Figure 2-3). Interestingly, although EPIP

was not trained on the MCF7 cell line, it could still correctly predict 89.70% of EPIs in this cell

line.

EPIP was also tested on all the EP-pairs within 2.5 kb to 2 Mb with positives defined by four more

normalized read cutoffs; 10, 20, 50 and 100. The same negative dataset (cut off 5) was used for

the four positive datasets. Overall, with the increase of the positive cut offs, the AUROC scores

Page 24: Computational Study of Target Gene Interactions

15

showed an increasing trend while the AUPR and the F1 scores were in decreasing trend. This

might be due to the higher imbalance created by the decreasing number of positive but constant

number of negative EP-pairs with the larger cutoffs. The recall (sensitivity) was larger than 0.92,

for all cutoffs, and showed an increasing trend as the cutoff got higher. The results suggest that the

trained EPIP model was robust and reliable to predict true positive EP-pairs. The higher cutoff

data are more likely to contain the real enhancer-promoter interactions, the higher recall of EPIP

with the higher cutoff data verifies the efficiency of EPIP. Since the negative EP-pairs were the

same under different cutoffs, the specificity was constant (0.80). The average precision was

decreasing from 0.76 at the cutoff 10 to 0.09 at the cutoff 100. This dramatic decrease in the

precision scores from lower to higher cutoffs suggested that the larger cutoffs 50 and 100 might

be too stringent and lower cutoffs 10 and 20 might be too slack. In that case, the cutoff 30 might

Figure 2-3: The overall performance of EPIP on external datasets.

Page 25: Computational Study of Target Gene Interactions

16

be the proper one to define positives, especially since EPIP had good precision and recall with this

cutoff on more strictly defined EP-pairs from the above three previous studies (Figure 2-3).

In summary, EPIP predicted EPIs with high precision, recall and F1 scores with varied datasets

including the published datasets from previous studies. To test EPIP on a more reliable unified

data set, the all EP-pairs dataset within 2.5 kb to 2 Mb defined by the cutoffs 30 and 5 was

overlapped with the published datasets and different cutoffs. On this dataset, EPIP showed an

AUROC of 0.95 and an AUPR of 0.73 on average.

2.1.3.2 Reliable Prediction of Cell-Specific EPIs

The performance of EPIP was studied on prediction of cell-specific EPIs in different cell lines. To

evaluate the performance of EPIP on each cell line, a fresh model was trained on the samples from

the other six cell lines and then tested on the samples from the remaining cell line. Separate EPIP

models were generated in this way. The positive and negative EP-pairs for the training data were

generated in the same way as before with the cutoffs 30 and 5 (Chapter 2.1.2.2, second paragraph),

respectively. Each EPIP model was evaluated by the combination of the balanced and unbalanced

models (Chapter 2.1.2.2, last paragraph), trained on the samples from six cell lines.

The separate EPIP models for the seven cell lines, showed an average AUROC of 0.96, an average

AUPR of 0.89, on the seventh sample, when tested on all EP-pairs within 2.5 kb to 2 Mb based on

the cutoffs 30 and 5 (Table 2-2). When evaluated on cell-specific EP-pairs in the seven cell lines,

the EPIP models predicted 5498 (97.66%) of the total 5630 cell-specific EP-pairs in all the cell

Page 26: Computational Study of Target Gene Interactions

17

lines except GM12878. EPIP predicted only 31.77% of cell-specific EP-pairs in GM12878

(Table 2-2).

One likely reason behind the poor performance of EPIP on cell-specific GM12878 could be the

much higher Hi-C sequencing depth in GM12878 than in the other cell lines. In other words, the

quality of the EP-pairs in other samples was different from that in GM12878. To test this

hypothesis, the same EPIP model trained on other six samples based on the cutoffs 30 and 5 was

evaluated to predict cell-specific EP-pairs defined with the cutoff 100 in GM12878. EPIP correctly

predicted 2396 (78.69%) of the 3045 cell-specific EP-pairs in GM12878 defined by the cutoff 100.

So, overall, EPIP reliably predicted the cell-specific EP-pairs in a new cell line, with a recall of

91.00% (7894 out of 8675 cell-specific EPIs) in all the seven cell lines.

Table 2-2: Performance of cell-specific EPIP model on predicting of condition-specific EPIs.

Test Cell line AUROC AUPR F1 Precision Sensitivity/Recall

#

condition

-specific

EPIs

% of

predicted

condition-

specific

EPIs

GM12878 0.7379 0.7015 0.5347 0.9785 0.3678 20004 0.3177

GM12878

(cutoff 100+5) 0.9816 0.9657 0.9002 0.9578 0.8491 3045 0.7869

HMEC 0.987 0.9119 0.4762 0.3129 0.9957 147 0.9592

HUVEC 0.9938 0.9174 0.5203 0.3529 0.9896 30 0.8333

IMR90 0.9966 0.988 0.8744 0.7806 0.9938 605 0.9868

K562 0.998 0.9934 0.8219 0.7029 0.9894 655 0.9802

KBM7 0.9711 0.6777 0.3951 0.2471 0.9849 4152 0.9769

NHEK 0.9974 0.9812 0.7043 0.5451 0.995 41 0.9024

Overall 0.96

(0.99)

0.89

(0.92)

0.55

(0.70)

0.49

(0.55) 0.62 (1.00)

28679

(8675) 0.50 (0.91)

Except in the last row, the numbers in a row are based on the EPIP model trained on the remaining six samples and

then tested on the sample specified in this row. In the last row, the first number shows the average statistics based on

the 30+5 cutoff EP-pairs in seven samples, while the number in the parenthesis shows the average statistics with

the100+5 cutoff EP-pairs in GM12878 together with the 30+5 cutoff EP-pairs in other six samples.

Page 27: Computational Study of Target Gene Interactions

18

2.1.3.3 Better Performance in EPI Prediction than the State-of-the-Art Methods

The performance of EPIP was evaluated with two recently published methods, TargetFinder and

Ripple on the TargetFinder data and the EPIP all EP-pair test data within 2 kb to 2 Mb. On both

data sets, EPIP showed a better performance than TargetFinder and Ripple (Table 2-3).

First, EPIP was compared with TargetFinder and Ripple on the dataset from used in the

TargetFinder study (Table 2-3). This dataset contained six cell lines; GM12878, HeLa, HUVEC,

IMR90, K562 and NHEK. On the six cell lines, EPIP showed an average AUROC, AUPR, F1,

precision, recall and specificity of 0.95, 0.84, 0.64, 0.98, 0.48 and 1.00, respectively, compared to

0.92, 0.59, 0.50, 0.72, 0.39 and 0.99, respectively by Targetfinder and 0.75, 0.19, 0.02, 0.75, 0.01

and 1.00, respectively by Ripple (Table 2-3). Ripple’s poor performance indicates the fact that

Ripple could not deal with unbalanced data well, which are closer representative of the real world

data.

Table 2-3: Comparison with TargetFinder and Ripple on TargetFinder and EPIP data.

Pos Neg AUROC AUPR F-score Precision

Sensitivity

/Recall

TargetFinder

data

EPIP vs

TargetFinder

TargetFinder 9899 197500 0.924 0.5864 0.5021 0.7225 0.3848

EPIP 9899 197500 0.95 0.8386 0.6422 0.9763 0.4784

EPIP vs Ripple Ripple 5830 116500 0.7478 0.1922 0.0146 0.7544 0.0074

EPIP 5830 116500 0.9519 0.8514 0.6759 0.9748 0.5173

EPIP data

EPIP vs

TargetFinder

TargetFinder 25865 73463 0.959 0.8695 0.8618 0.9436 0.7932

EPIP 26381 77179 1 0.982 0.9935 0.9879 0.9992

EPIP vs Ripple Ripple 23808 52313 0.6637 0.3924 0.3565 0.6066 0.2524

EPIP 23808 52313 1 0.995 0.9955 0.992 0.9992

The comparison between TargetFinder and EPIP on TargetFinder data was done for six common samples (GM12878,

HeLa, HUVEC, IMR90, K562 and NHEK). The comparison between Ripple and EPIP on TargetFinder data was done

for the three common samples (GM12878, HeLa and K562). When tested on EPIP data, the comparison between

TargetFinder and EPIP was done for the common five samples (except HeLa, as HeLa did not have EPIP data).

Similarly, the EPIP and Ripple comparison on the EPIP data was on two common samples (except HeLa).

Page 28: Computational Study of Target Gene Interactions

19

Next, EPIP was evaluated with TargetFinder and Ripple on the all EP-pairs test data within 2.5 kb

to 2 Mb (Table 2-3). Among the seven cell lines used for EPIP design, five (GM12878, HUVEC,

IMR90, K562, NHEK) were common with the TargetFinder study and two (GM12878 and K562)

only common with the Ripple study. In comparison with TargetFinder on five common cell lines,

EPIP showed an average AUROC, AUPR, F1, precision, recall and specificity of 1.00, 0.98, 0.99,

0.99, 1.00 and 1.00, respectively, while the best model of TargetFinder, GBM, showed 0.96, 0.87,

0.86, 0.94, 0.79 and 0.98, respectively. On the two common cell lines, Ripple showed an average

AUROC, AUPR, F1, precision, recall and specificity of 0.66, 0.39, 0.36, 0.61, 0.25 and 0.93,

respectively, while EPIP showed much better scores; 1.00, 1.00, 1.00, 0.99, 1.00 and 1.00,

respectively, on the same data set.

When compared on the cell-specific EPIs in TargetFinder data, EPIP predicted 51.36% of the 8471

cell-specific EP-pairs in the six samples, while TargetFinder predicted 38.85% of them. On the

three common cell lines (GM12878, HeLa and K562) between the TargetFinder and Ripple

studies, Ripple predicted only 0.53% of the 5787 cell-specific EP-pairs, while EPIP predicted

54.42% of them. The lower accuracy of EPIP on cell-specific EP-pairs of the TargetFinder data

compared to that of the EPIP test data, was may be the overall quality of TargetFinder data was

not good. For instance, the enhancers and promoters used by TargetFinder were from

computational predictions [16, 24], which were prone to errors. Moreover, as we investigated,

almost 50% of the enhancer and promoter regions overlapped with each other. Also, the negative

EP-pairs used in TargetFinder might be loosely defined. TargetFinder labeled an EP-pair

β€˜negative’, if it did not overlap the contacts of any resolution in the Rao et al. looplists. Note that,

the looplists defined in Rao et al. were finely selected Hi-C contacts with 0.1 false discovery rate.

Page 29: Computational Study of Target Gene Interactions

20

Due to the stringency of the looplists, although they are likely to represent the positive EPIs, the

EPIs not identified by the looplists are not necessarily negative pairs.

In case of the cell-specific EPIs of the all pairs within 2.5 kb to 2 Mb, EPIP clearly outperformed

TargetFinder and Ripple. On the five common cell lines with TargetFinder study, EPIP predicted

99.99% of the cell-specific EP-pairs, while TargetFinder predicted only 83.91% of them. On the

two common cell lines with the Ripple study, EPIP predicted 99.99% of the cell-specific EP-pairs,

while Ripple could predict only 27.07% of them.

2.1.4 Discussion

EPIs are one of the major factors that initiate gene transcription. Proper identification of EPIs can

help to understand gene transcription regulation. The active EPIs can be different for different cell

types. At this moment, the performance of the available EPI prediction tools is not satisfactory,

especially in terms of cell-specific EPIs. Here a computational method, EPIP, was developed to

learn the patterns of EPIs and to predict cell-specific EPIs. On average, EPIP correctly predicts

99.26% of cell-specific EPIs in different cell lines. EPIP also performed better than two state-of-

the-art EPI prediction tools.

The design of EPIP incorporates a robust framework to integrate useful features for EPI

predictions. Using a feature partitioning strategy, EPIP can work as efficiently for the cell lines

with partially available features, as for those with abundant features. As a result, EPIP can be

trained on different types of samples, which makes the training model more accurate and broadly

Page 30: Computational Study of Target Gene Interactions

21

representative. Not only, the learning approach of EPIP facilitates incremental training of the

model with the availability of new data.

While training EPIP with different cell lines, the order of the cell lines does not matter. This means

that data from different samples can be fed to the training model in any order. To investigate

whether the order of the cell lines in training has an impact on the performance of EPIP, we

considered HUVEC as the test cell line and trained the EPIP model on the remaining six samples

in all possible 720 orders. The standard deviation of the AUROC and the F1 score was 0.001 and

0.002, respectively, for all 720 different orders of training in these experiments. This shows that

the order of the cell lines used in training EPIP does not significantly impact the final performance.

EPIP was trained with a reliable set of available enhancers. So far, FANTOM enhancers arguably

represent the largest set of enhancers that are defined with a consistent criterion and supported by

experiments. But the number of FANTOM enhancers is small compared with the known and

predicted enhancers in various studies [16]. So, to generate abundant reliable enhancers, the

FANTOM enhancers was overlapped with computationally predicted ChromHMM enhancers and

H3K27ac ChIP-seq peaks. However, the efficiency of EPIP on the EP-pairs generated from a new

enhancer source remains to be evaluated, due to the lack of such data thus far. When there is a

larger and more reliable set of experimentally determined enhancers available in the future, it is

necessary to test EPIP on the EP-pairs based on the new set of enhancers to make sure that it

performs similarly.

The EPIP models trained on the EP-pairs using the looplists defined by Rao et al. generated

suboptimal results due to the smaller size of training data. Hence, the cutoffs 30 and 5 were used

Page 31: Computational Study of Target Gene Interactions

22

to define positive and negative samples respectively. This combination of cutoffs were selected

based on the test results with different cutoff combinations and our previous studies [14, 19]. The

EP-pairs designed in this approach may not yet be perfect and may suffer from the following

drawbacks or dilemma. First, the available methods to analyze Hi-C contact matrices are still

suboptimal [25], which prevents from defining accurate interacting regions. Second, the cutoff

combinations chosen was a tradeoff between too strict (such as Rao et al. looplists) or too loose

(such as those from the cutoff 10) chromatin contacts, which might still affect the quality of the

obtained EP-pairs. Third, as mentioned above, the FANTOM enhancers only represent a portion

of existing enhancers while the ChromHMM enhancers are not so reliable. Although, these two

enhancer sets were used together with the H3K27ac peaks to define active enhancers, the data may

still miss some positive EP-pairs. Finally, a fixed cutoff of 30 does not consider the exponential

decay of the number of supporting Hi-C reads with the increasing distance between enhancers and

promoters, which may miss true positive EP-pairs as well.

Despite the limitations in the quality of enhancers and the criterion to extract EP-pairs, the good

performance of EPIP on the EP-pairs based on the interacting regions defined by other studies

make us believe that the majority of the positives and negatives in the training data represent the

true data. Moreover, EPIP showed a consistently high recall/sensitivity when different cutoffs were

used to define positive EP-pairs. EPIP also performed well when tested on the remaining 70% of

untrained EP-pairs. The performance of EPIP on different variety of test data set again allows us

to believe that EPIP did a good job in learning to classify the interacting EP-pairs from the non-

interacting ones.

Page 32: Computational Study of Target Gene Interactions

23

Even though EPIP showed a better performance compared to the state-of-the-art methods, there is

still room for improvement. For instance, the training data used in this study is not perfect. With

the availability of more accurate and broadly representative training data in the future, the

performance of EPIP can be improved further. Here only Hi-C was used to extract training data.

It is worth studying how the performance of EPIP improves using EPIs from other sources of

chromatin interaction, such as Hi-C, ChIA-PET and 5C, together with Hi-C. Also, the Hi-C

chromatin contacts used here were preprocessed by Rao et al. considering various algorithmic

tradeoffs. Extraction of chromatin interactions from raw Hi-C data may help to improve the

performance of EPIP. Finally, as shown in a previous study [14], multiple EPIs can be

interconnected due to complicated chromatin structures. Like almost every other existing method,

EPIP considers each EP-pair independently to predict EPIs while considering multiple EP-pairs

may add a different perspective.

Page 33: Computational Study of Target Gene Interactions

24

2.2 An Intriguing Characteristic of Enhancer-Promoter Interactions

2.2.1 Background

Enhancer-promoter interaction is one of the major factors of gene transcription. Enhancers are

short genomic regions that interact with gene promoters to initiate gene transcription. Despite

located far from their target genes, the enhancers come in direct contact with the gene promoters

via chromatin looping to control the temporal and spatial expression of the target genes [21, 26-

31]. The distance between enhancers and their targets validated by low-throughput experiments

can be about one mega bps (Mbps) [26, 27]. Recent high-throughput experiments showed that the

distance can be even larger than two Mbps in many cases [8, 32]. Because of such a long variable

distance, it is still challenging to identify interacting enhancer-promoter pairs (IEPs). In this study,

an IEP refers to an enhancer-promoter pair that physically interacts, although such an interaction

may or may not have any functional effect observed yet.

Identification of the active enhancers is a part of the problem of finding the IEPs. Early

experimental studies identify enhancers by β€œenhancer trap”, which has established our rudimentary

understanding of enhancers in spite of its low-throughput and time-consuming nature [33, 34].

Early computational methods predict enhancers through comparative genomics, which are cost-

effective but may produce many false positives. With the availability of next-generation

sequencing (NGS) technologies, enhancers are now identified through a variety of experimental

methods such as chromatin immunoprecipitation followed by massive parallel sequencing (ChIP-

seq), DNase I hypersensitive sites sequencing (DNase-seq), global run-on sequencing (GRO-seq),

cap analysis gene expression (CAGE), etc. [1, 35-39]. In the ChIP-seq experiments, genomic

regions enriched with H3K4me1 and H3K27ac modifications are widely considered as active

enhancers, and those with H3K4me1 and H3K27me3 modifications are regarded as repressed

Page 34: Computational Study of Target Gene Interactions

25

enhancers [36]. In the DNase-seq experiments, distal open chromatin regions are considered as

potential enhancers for gene regulation studies [5, 11, 40, 41]. In the GRO-seq and CAGE

experiments, bidirectional transcripts are employed to identify active enhancers [1, 42, 43].

Numerous computational methods were developed based on the NGS data to predict enhancers on

the genome-wide scale [16, 24, 36, 44]. These methods range from the early ones that are based

solely on H3K4me3 and H3K4me1 ChIP-seq experiments to the later ones that are based on

various types of epigenomic and genomic signals.

A large number of enhancers have been discovered so far by different experimental and

computation methods. The VISTA database includes about 2,900 enhancers from comparative

genomics were tested with mouse transgenic reporter assay [45]. The functional annotation of the

mouse/mammalian genome (FANTOM) project

(http://FANTOM.gsc.riken.jp/5/datafiles/latest/extra/Enhancers/) identified 32,693 enhancers

from balanced bidirectional capped transcripts [1]. This set of enhancers is arguably the largest set

of mammalian enhancers with supporting experimental evidence [46]. The computational methods

such as ChromHMM and Seqway also contributed to predicting thousands of human enhancers

[16, 24]. This set of enhancers is regarded as the most comprehensive set of computationally

predicted human enhancers available so far. In addition to the individual enhancers, a group of

enhancers in a genomic region called the β€˜super-enhancers’ were identified that can collectively

control the expression of genes involved in cell-identities [47, 48].

Although the discovery of enhancers has been relatively straightforward, the identification of IEPs

is still nontrivial. Early experimental procedures to identify IEPs are expensive and time-

Page 35: Computational Study of Target Gene Interactions

26

consuming [4, 49]. Recent Hi-C experiments hold a great promise to identify IEPs on the genome-

scale, while are still not cost effective in terms of capturing high-resolution Hi-C interactions [8,

20, 32]. To date, these experiments have only been carried out on a few cell lines or cell types.

Computational methods has also evolved a lot, from the early ones that regarded the closest genes

as target genes, to the later ones which considered the correlation of epigenomic signals in

enhancers and those in promoters, to the current ones that are based on more sophisticated

approaches [1, 6, 9-14, 50]. Although these methods have shown some success in predicting

enhancer target genes, they either do not consider or have a low–performance on cell-specific IEP

prediction [12]. From the results of these experimental and computational studies, self-interacting

genomic regions of several mega bases were discovered in mammalian genomes, called

topologically associated domains (TADs). IEPs usually fall within the TADs instead of crossing

different TADs [51].

All the existing computational methods almost always consider one enhancer-promoter pair at a

time to determine whether they interact. We hypothesized that when two enhancers interact with

a common target gene, these two enhancers may be spatially close to each other and may thus

interact with all target genes of both enhancers. In other words, if two enhancers share a target

gene, they may share all of their target genes as well. If this hypothesis is true, we should consider

the interactions of multiple enhancers and multiple target genes simultaneously to predict IEPs,

which may improve the accuracy of the computational prediction of the IEPs, especially that of

cell-specific IEPs.

Page 36: Computational Study of Target Gene Interactions

27

To find out how different enhancers may share their target genes, the experimentally supported

IEPs from five previous studies [8, 20, 21, 32, 52] were collected and investigated in different cell

lines and cell types. The enhancers used in this study include both the experimentally annotated

enhancers from FANTOM and the computationally predicted enhancers by ChromHMM in

different samples [1, 16]. We observed that two enhancers are likely to either share almost all of

their target genes or interact with two completely disjoint sets of target genes, in a cell line or a

cell type. This observation implies an interesting characteristic of IEPs, which has not been

considered by the existing studies to predict IEPs. This study may also shed new light on the

underlying principles of chromatin interactions and facilitate the more accurate identification of

IEPs.

Figure 2-4: Generation of IEPs and calculation of BCC. (A) The process of generating IEPs using

the chromatin interaction data from five studies, enhancer regions from FANTOM and

ChromHMM, and promoters defined around the GENCODE annotated gene TSSs. (B) A toy

interaction network between three enhancers (π’†πŸ, π’†πŸ and π’†πŸ‘) and three promoters (π’‘πŸ, π’‘πŸ and π’‘πŸ‘).

The average BCC of the enhancers in this example is (𝟏

𝟐+πŸ•

𝟏𝟐+πŸ“

𝟏𝟐)

πŸ‘= 𝟎. πŸ“.

Page 37: Computational Study of Target Gene Interactions

28

2.2.2 Materials and Method

2.2.2.1 Enhancers and Promoters

Two sets of enhancers were used in this study (Figure 2-4A). The first set contained the 32,693

enhancers annotated by FANTOM, which had been obtained from the balanced bidirectional

capped transcripts [1]. The FANTOM enhancers were downloaded from FANTOM5 Human

Enhancer Selector (http://slidebase.binf.ku.dk/human_enhancers/results). The second set

contained the computationally predicted enhancers by ChromHMM [16] in the following seven

cell lines: GM12878, HMEC, HUVEC, K562, NHEK, IMR90 and HeLa. ChromHMM is widely

used to partition genomes into different functional units including enhancers. The ChromHMM

enhancers for GM12878, HMEC, HUVEC, K562 and NHEK cell lines were downloaded from the

ENCODE composite track

(http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeBroadHmm) of UCSC

Genome Browser. The ChromHMM enhancers for HeLa and IMR90 cell lines were downloaded

respectively from the ENCODE Genome Segmentation track of UCSC Genome Browser

(http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeAwgSegmentation/)

and chromatin state model based on imputed data (25 state, 12 marks, 127 epigenomes)

(https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/impu

ted12marks/jointModel/final/E017_25_imputed12marks_mnemonics.bed.gz)

The FANTOM enhancers are not cell-specific, while the ChromHMM predicted enhancers are

specific for the seven different cell lines mentioned. Thus β€œactive” FANTOM enhancers were

defined by overlapping the enhancers with the H3K27ac ChIP-seq peaks in the corresponding cell

lines obtained from the Encyclopedia of DNA Elements (ENCODE) project [17]. For cell lines

Page 38: Computational Study of Target Gene Interactions

29

without available H3K27ac ChIP-seq data such as KBM7, the enhancers that overlapped with the

chromatin interacting anchors in this cell line were considered as β€œactive” enhancers [8].

To define promoters, the gene transcriptional start sites annotations were downloaded from

GENCODE V19 [18]. The 1kbps upstream to the 100 bps downstream regions around each

transcriptional start site was considered as a promoter. In total, 57,820 promoters were obtained in

this way in the human genome. To define cell-specific active promoters, the available RNA-Seq

data in different cell lines (GM12878, HeLa, HUVEC, IMR90, K562 and NHEK) as used in a

previous study [13]. In a cell line, a promoter was considered β€œactive" if the corresponding gene

had at least 0.30 reads per kbps of transcript per million mapped reads with the irreproducible

discovery rate of 0.1, similarly as used in the previous studies [13, 15]. For cell lines without RNA-

Seq data (HMEC and KBM7), all promoters were considered as active promoters [15].

2.2.2.2 IEPs from Five Studies

The experimentally supported chromatin contact information from five previous studies were

collected to define IEPs [8, 20, 21, 32, 52] (Figure 2-4A). These data arguably represent the intra-

chromosomal chromatin interactions defined with the highest resolutions by the corresponding

techniques. The first set of the data was downloaded from the Hi-C dataset GSE63525 in the Gene

Expression Omnibus (GEO) database [8]. This data contains significant intra-chromosomal

chromatin interactions with 5 kbps resolution named β€œlooplist” extracted for the following eight

cell lines: GM12878, HeLa, HMEC, HUVEC, IMR90, K562, KBM7 and NHEK [8]. The looplists

were defined with stringent criteria and were most likely to be true pairs of interacting genomic

regions, each of which was about 5 kbps long. In every cell line, each chromatin interactions in

Page 39: Computational Study of Target Gene Interactions

30

the corresponding looplist was overlapped with the aforementioned two sets of active enhancers

and with the annotated active promoters to obtain IEPs. In other words, an obtained IEP consisted

of an enhancer and a promoter, where the enhancer overlapped with one of the interacting regions

of a chromatin interaction and the promoter overlapped with the other region. Since we had two

sets of enhancers, we obtained two sets of IEPs for each of the eight cell lines (Figure 2-4A). Note

that, since only the intrachromosomal interactions were used in this study, the enhancer and

promoter in an IEP are always from the same chromosome.

The number of IEPs obtained from the above looplists was small, especially when the FANTOM

enhancers were considered. The reason might be, the criteria Rao et al. used to define looplists was

quite stringent and many true interacting genomic regions might therefore be missed [15]. To

capture more IEPs in these cell lines, the cell line specific contact matrix datasets were used from

the same study [8]. The contact matrix for a cell line contains the 5 kbp resolution

intrachromosomal chromatin interactions supported by at least one Hi-C read. The number of reads

in the contact matrices were normalized using KR normalization vector. The alternative sets of

IEPs were generated from the contact matrices with three normalized read cutoffs: 30, 50, and 100.

Given a normalized read cutoff, say x, if an enhancer-promoter pair overlapped with a pair of

interacting genomic regions that were supported by at least x normalized Hi-C reads, the enhancer-

promoter pair was considered as an IEP. The cutoff 30 was used as this cutoff was likely to include

of almost all known IEPs in K562 and IMR90 from other studies [20, 21] without allowing too

many false positives [15]. The two other cutoffs (50 and 100) were used to see how the observed

enhancer characteristics may change with more stringent cutoffs. The IEPs from contact matrix

was not considered for HeLa because Rao et al. did not provide a Hi-C contact matrix in HeLa.

Page 40: Computational Study of Target Gene Interactions

31

Since the sequencing depth was much higher in case of GM12878 than that in other seven cell

lines, the IEPs defined by the cutoff 400 were considered as highly reliable for GM12878 after

testing different cutoffs.

From another Hi-C study, 57,578 IEPs were downloaded for IMR90 cell line [20]. To our

knowledge, this was the only Hi-C dataset for human samples with a comparable sequencing depth

as that in Rao et al. In this study, Jin et al. defined active enhancers with H3K4me1 and H3K27ac

ChIP-seq peaks and active promoters with H3K4me3 ChIP-seq peaks together with the known

genes from the University of California, Santa Cruz genome browser. In addition to using the

original IEP dataset which was provided in the hg18 version [20], the IEPs were also converted

into the hg19 version and overlapped with the aforementioned enhancers and promoters used in

this study to define a new set of IEPs for IMR90 cell line.

The IEPs defined by the ChIA-PET experiments in K562 and MCF7 were used as well for this

study [21]. Using the interacting regions in these datasets total 2,923 and 2,190 IEPs were extracted

with the FANTOM enhancers for K562 and MCF7, respectively. For the ChromHMM enhancers,

the number of IEPs were 33,598 in K562. There were no ChromHMM enhancers available in

MCF7.

Additional IEPs were used in this study that are based on the active enhancer and promoter links

defined by Javierre et al. from promoter capture Hi-C experiments in nine cell types (Table 2-5 in

[32]). Javierre et al. did the experiments on seventeen primary cell types while the active enhancer

and promoter links were provided for nine cell types. Each link defined a pair of interacting

Page 41: Computational Study of Target Gene Interactions

32

regions, with the average length of 5,709 and 8,599 bps, respectively. Since Javierre et al. did not

explicitly specify the enhancers and promoters, these links were overlapped with the two sets of

enhancers and the GENCODE promoters to define two sets of IEPs. In total, 20,764 and 607,274

IEPs were obtained with FANTOM and ChromHMM enhancers, respectively.

The final chromatin interaction dataset for this study were the interactions detected using a newly

developed method named β€œSPRITE” by Guttman lab [52]. This dataset was downloaded from

GEO database of NCBI with the accession number GSE114242. Among the available SPRITE

datasets, the only human dataset was in GM12878 cell line with the lowest resolution of 25 kbps.

This dataset was filtered with three different read cutoffs; 30, 50 and 100 to obtain IEPs.

A distance filter was applied on all the IEP sets found above. For every IEP, if the distance between

the corresponding enhancer and promoter is less than 2.5 kbps, that IEP was filtered out from the

analysis.

2.2.2.3 Other Data Used

Rao et al. annotated chromatin contact domains in each of the eight cell lines [8]. These domains

were downloaded from GSE63525 and considered as the topologically associating domains

(TAD)s in this study. The annotated TADs in IMR90 by Dixon et al. were also used, which were

generated by the same lab that generated the Jin et al. data [51].

The super-enhancers in GM12878, HeLa, HMEC, HUVEC, K562 and NHEK were downloaded

from http://asntech.org/dbsuper/download.php. No known super-enhancers were available in

Page 42: Computational Study of Target Gene Interactions

33

KBM7. The super-enhancers in a cell line were compared with the clusters of enhancers that

interact with the same set of target genes in the same cell line identified in this study.

2.2.2.4 BCC (Bipartite Clustering Coefficient)

The defined IEPs in a cell line can be represented as a bipartite graph, where the enhancers on one

side connect with the target genes on the other side. Bipartite clustering coefficient (BCC) is used

to measure the degree to which the nodes in a graph tend to cluster together [53]. Here BCC was

used to characterize how enhancers share their target genes and how genes share their enhancers

(Figure 2-4B).

For a pair of enhancers (or a pair of genes), say 𝑒 and 𝑣, their BCC is defined as 𝐡𝐢𝐢(𝑒, 𝑣) =

|𝑛(𝑒)βˆ©π‘›(𝑣)|

|𝑛(𝑒)βˆͺ𝑛(𝑣)|, where 𝑛(𝑒) and 𝑛(𝑣) are the set of genes (or enhancers) interacting with 𝑒 and 𝑣,

respectively. Intuitively, if 𝑒 and 𝑣 are a pair of enhancers, 𝐡𝐢𝐢(𝑒, 𝑣) measures the percentage of

target genes both 𝑒 and 𝑣 interact with among all of their target genes. Similarly, if 𝑒 and 𝑣 are a

pair of genes, 𝐡𝐢𝐢(𝑒, 𝑣) measures the percentage of enhancers both 𝑒 and 𝑣 interact with among

all enhancers they interact with. Correspondingly, the BCC of an individual enhancer (or gene),

say 𝑒, is defined as 𝐡𝐢𝐢(𝑒) =βˆ‘ 𝐡𝐢𝐢(𝑒,𝑣)π‘£βˆˆπ‘›(𝑛(𝑒)),𝑣≠𝑒

|𝑛(𝑛(𝑒))|βˆ’1, where 𝑛(𝑛(𝑒)) is the set of enhancers (or

genes) that share at least one target gene (or enhancer) with 𝑒. Under a given condition, for all

enhancers (or target genes) sharing at least one target gene (or an enhancer) with other enhancers

(or target genes), we averaged their individual BCCs to obtain the BCC of enhancers (or target

genes) under this condition.

Page 43: Computational Study of Target Gene Interactions

34

2.2.2.5 Generation of Enhancer Clusters

Using the enhancers with BCC > 0, an enhancer graph was created for each IEP dataset in each

cell line. In this graph, the nodes represent enhancers and edges represent pairs of enhancers

interacting with at least one common target gene. Then the famous Bron-Kerbosch algorithm was

applied to this graph to find all maximal cliques [54]. The enhancers in a clique represented a

cluster of enhancers that interact with the same set of genes. Here, different clusters may share the

same enhancers.

2.2.2.6 Statistical Tests

To assess the statistical significance of the observed BCC values in a given set of IEPs, a random

set of IEPs was generated using the same enhancers and promoters from the original set of IEPs.

The observed BCC values of the enhancers (promoters) in the original set of IEPs were then

compared with that in random IEPs. In every comparison, the BCC values of the enhancers

(promoters) that interacted with multiple promoters (enhancers) were pooled together from the

original IEPs and compared with those from the random IEPs. In the statistical significance

analysis, the probability of the enhancers (promoters) with BCC > 0.9 in the random IEPs was

calculated as the Binomial probability parameter (p). Now, if there are n enhancers in the original

IEPs and k of them have their BCC > 0.9, the p-value is calculated using the following formula.

𝑝 βˆ’ π‘£π‘Žπ‘™π‘’π‘’ = 1 βˆ’βˆ‘(𝑛

𝑖) 𝑝𝑖(1 βˆ’ 𝑝)π‘›βˆ’π‘–

π‘˜βˆ’1

𝑖=0

Page 44: Computational Study of Target Gene Interactions

35

2.2.2.7 Additional Analyses

To assess the sequence similarity between the enhancers within a cluster, the enhancer sequences

within a cluster were multiple aligned by ClustalW programs using MUSCLE version 3.8.31 [55].

The similarity score between a pair of enhancers was then defined as the percentage of identities

in the corresponding alignment [55]. Similarly, the similarity scores were measured between every

pair of enhancers from a randomly selected enhancer set in the same cell line. The two sets of

similarity scores were then compared by the Mann-Whitney U test [56].

To assess whether the enhancers in a cluster tend to be located close to each other in a cell line,

the relative distances between every pair of enhancers within clusters in a cell line were compared

with the same for the randomly chosen enhancer set in the same cell line using the Mann-Whitney

U test.

Finally, the functional similarity scores between the target genes of every pair of enhancers in a

cluster was measured for each a cell line by the GREAT tool [57]. The tool generated the

significant functional terms (p-value<1e-05) associated with the target genes of the enhancer

clusters.

Page 45: Computational Study of Target Gene Interactions

36

2.2.3 Results

2.2.3.1 Two Enhancers are Likely to Interact with Either Exactly the Same Set or Two

Completely Different Sets of Genes

In order to study IEPs, the BCC values of the enhancers were calculated for the five sets of

experimentally supported IEPs with the two sets of enhancers in each cell line or cell type (Chapter

2.2.2.1 and 2.2.2.2, Figure 2-4A). BCC is commonly used to measure the degree to which, two

unconnected nodes in a bipartite graph share their connected or neighboring nodes. Note that every

set of IEPs can be represented as a bipartite graph, where the enhancer set and the gene promoter

set correspond to the two disjoint sets of nodes, and their interactions correspond to the edges

Table 2-4: The BCC of enhancers and that of promoters are likely to be 1 in a cell line.

Cell line IEPs

BCC of enhancers

% of total

enhancers with

multiple

promoter and BCC

> 0 (E1)

% of E1 with BCC>=0.9

BCC of promoters

% of total

promoters with

multiple

enhancers and BCC

> 0 (P1)

% of P1 with BCC>=0.9

all multiple all multiple

Rao

GM12878 294

(2384) 0.97

(0.99) 0.96

(0.96) 17.47

(18.44) 87.5 (88.95)

0.97 (0.95)

0.95 (0.93)

19.35 (27.15)

91.67 (87.98)

HELA 11 (37) 1 (1) 0 (1) 0 (12.5) 0 (100) 1 (1) 0 (1) 0 (11.76) 0 (100)

HMEC 260

(2558)

0.97

(0.99)

0.96

(0.98)

15.42

(26.9) 90.32 (95.91)

0.96

(0.97)

0.91

(0.96)

13.97

(37.41) 88 (93.82)

HUVEC 9 (95) 1 (1) 0 (1) 0 (10.47) 0 (100) 0 (1) 0 (1) 0 (19.35) 0 (100)

IMR90 144

(554) 1 (1) 1 (0.99) 4.8 (8.98) 100 (100) 1 (0.99) 1 (0.98) 6.25 (9.35) 100 (93.1)

K562 47 (638) 1 (1) 1 (1) 10.81

(17.35) 100 (100) 1 (1) 1 (1)

12.82 (27.92)

100 (100)

KBM7 8 NA

(NA)

NA

(NA) NA (NA) NA (NA) 1 NA (NA) NA (NA) NA (NA)

NHEK 0 (0) NA

(NA) NA

(NA) NA (NA) NA (NA)

NA (NA)

NA (NA) NA (NA) NA (NA)

Jin IMR90 1167

(5303)

0.9

(0.93)

0.84

(0.87)

34.86

(34.97) 62.16 (70.75)

0.77

(0.68)

0.73

(0.66)

37.66

(49.11)

52.98

(40.92)

Li K562

2916 (33449)

0.8 (0.86)

0.75 (0.78)

30.98 (41.9)

50.92 (53.62) 0.86

(0.67) 0.75

(0.65) 26.43

(57.26) 44.13

(38.73)

MCF7 2190 0.89 0.83 25.15 66.76 0.86 0.75 22.59 57.41

In the head row, β€œmultiple” means the enhancers (or promoters) with multiple interacting promoters (enhancers). β€œAll”

means all enhancers (or promoters). When two numbers are in an entry, the number in the parenthesis is from the

ChromHMM enhancers.

Page 46: Computational Study of Target Gene Interactions

37

(Figure 2-4B). The neighboring promoter nodes of an enhancer are the target genes of this

enhancer. With the goal to investigate how different enhancers share their target genes, BCC is a

perfect measurement, as it can show the percentage of shared target genes of an enhancer in a

given set of IEPs (Figure 2-4B). The average BCC values of the enhancers were larger than 0.90

for all the data sets. The high BCC values indicate that enhancers are not likely to share partially.

When a pair of enhancers interact with a common target gene, both enhancers are likely to interact

with all target genes of these two enhancers.

First, the IEPs were studied based on the looplists from Rao et al. [8], with the annotated FANTOM

enhancers [1] and the GENCODE promoters [18] (Figure 2-4A). The BCC of enhancers was no

smaller than 0.97 in all cell lines with enough IEPs (Table 2-4). The average BCC was then

calculated for only the enhancers interacting with more than one gene. In this case too, the average

BCC was no smaller than 0.96 in all the cell lines. The high BCC values suggest that two enhancers

are likely to interact with either the same set or two disjoint sets of target genes. In other words,

the target genes of any pair of enhancers usually are either the same or completely different.

To assess the statistical significance of the above observation, the BCC values of the enhancers

were studied in randomly generated IEPs (Table 2-5). These random IEPs were constructed using

the same set of enhancers and promoters but randomized interactions. Given an enhancer and its

number of interacting promoters from the original IEP set of a cell line, the same number of

promoters were randomly chosen from the active promoters in the cell line, so that the number of

interactions of every enhancer remains the same in both the original and the random IEP sets. Five

different sets of random IEPs were generated in this way with five different random

Page 47: Computational Study of Target Gene Interactions

38

Table 2-5: BCC statistics for enhancers. The BCC of the enhancers in the real IEPs are shown

for different samples. The BCC of the enhancers in random IEPs are also shown along with the

p-values of the nonparametric statistical test supporting the difference between the BCC values

in real and random IEPs. All the statistics are shown for both β€œall” enhancers and the enhancers

interacting with β€œmultiple” promoters.

Experiments Cell lines IEPs Enhancers BCC of enhancers

BCC of enhancers in random IEPs with p-

values in parenthesis All Multiple All Multiple

FANTOM Gencode Rao looplist

GM12878 294 229 0.97 0.96 0.51 (0) 0.34 (0)

HELA 11 10 1 0 0 (0) 0 (NA)

HMEC 260 201 0.97 0.96 0.37 (0) 0.17 (0)

HUVEC 9 9 1 0 0 (0) 0 (NA)

IMR90 144 125 1 1 0.33 (0) 0.2 (0)

K562 47 37 1 1 0 (0) 0 (0)

KBM7 8 5 0 0 0 (NA) 0 (NA)

NHEK 0 0 NA NA NA (NA) NA (NA)

FANTOM Gencode Rao cutoff 400 GM12878 902 783 0.97 0.85 0.78 (0) 0.38 (0)

FANTOM Gencode Rao cutoff 300 GM12878 1138 974 0.95 0.82 0.76 (0) 0.39 (0)

FANTOM Gencode Rao cutoff 200 GM12878 2695 2091 0.9 0.74 0.62 (0) 0.36 (0)

FANTOM Gencode Rao cutoff 150 GM12878 4184 3002 0.88 0.74 0.56 (0) 0.34 (0)

FANTOM Gencode Rao cutoff 100

GM12878 7527 4488 0.81 0.7 0.43 (0) 0.28 (0)

HMEC 313 277 0.93 0.67 0.53 (0) 0.07 (0)

HUVEC 43 41 0.92 0.5 0 (0) 0 (NA)

IMR90 525 468 0.96 0.72 0.83 (0) 0.42 (0)

K562 506 440 0.96 0.83 0.8 (0) 0.39 (0)

KBM7 1465 1308 0.94 0.7 0.84 (0) 0.43 (0)

NHEK 211 200 0.95 0.5 0.49 (0) 0.3 (NA)

FANTOM Gencode Rao cutoff 50

GM12878 19623 7599 0.73 0.66 0.25 (0) 0.19 (0)

HMEC 854 702 0.94 0.85 0.68 (0) 0.4 (0)

HUVEC 254 237 0.95 0.81 0.58 (0) 0.1 (0)

IMR90 1643 1319 0.91 0.75 0.66 (0) 0.39 (0)

K562 1734 1368 0.89 0.73 0.64 (0) 0.39 (0)

KBM7 4033 3274 0.9 0.74 0.7 (0) 0.37 (0)

NHEK 462 407 0.92 0.69 0.78 (0) 0.4 (0)

FANTOM Gencode Rao cutoff 30

GM12878 29348 8670 0.71 0.65 0.48 (0) 0.47 (0)

HMEC 1786 1451 0.92 0.78 0.83 (0) 0.53 (0)

HUVEC 582 518 0.95 0.81 0.9 (0) 0.52 (0)

IMR90 3077 2235 0.87 0.73 0.76 (0) 0.54 (0)

K562 2872 2021 0.85 0.71 0.74 (0) 0.52 (0)

KBM7 7047 5564 0.88 0.72 0.81 (0) 0.52 (0)

NHEK 1011 885 0.93 0.76 0.88 (0) 0.52 (0)

ChromHMM Gencode Rao looplist

GM12878 2384 1914 0.99 0.96 0.67 (0) 0.39 (0)

HELA 37 32 1 1 0.1 (0) 0.1 (0)

HMEC 2558 1907 0.99 0.98 0.59 (0) 0.36 (0)

HUVEC 95 86 1 1 0.22 (0) 0.17 (0)

IMR90 554 490 1 0.99 0.77 (0) 0.45 (0)

K562 638 536 1 1 0.74 (0) 0.44 (0)

NHEK 0 0 NA NA NA (NA) NA (NA)

ChromHMM Gencode Rao cutoff 400 GM12878 11097 9343 0.93 0.78 0.75 (6.72E-12) 0.42 (0)

ChromHMM Gencode Rao cutoff 300 GM12878 14846 12347 0.92 0.78 0.74 (1.52E-11) 0.42 (0)

ChromHMM Gencode Rao cutoff 200 GM12878 33072 24664 0.81 0.67 0.64 (1.17E-11) 0.37 (0)

ChromHMM Gencode Rao cutoff 150 GM12878 51174 34925 0.8 0.67 0.57 (0) 0.34 (0)

ChromHMM Gencode Rao cutoff 100

GM12878 89712 51676 0.74 0.64 0.46 (0) 0.29 (0)

HMEC 4081 3635 0.94 0.76 0.81 (0) 0.41 (0)

HUVEC 499 458 0.98 0.86 0.85 (0) 0.48 (0)

IMR90 2415 2118 0.97 0.88 0.78 (0) 0.41 (0)

K562 8062 6835 0.93 0.76 0.75 (0) 0.42 (0)

NHEK 3291 3028 0.96 0.75 0.86 (0) 0.44 (0)

ChromHMM Gencode Rao cutoff 50

GM12878 231522 88850 0.64 0.6 0.27 (0) 0.19 (0)

HMEC 11191 9131 0.92 0.78 0.69 (0) 0.39 (0)

HUVEC 3396 3073 0.96 0.8 0.83 (0) 0.44 (0)

IMR90 7270 5765 0.93 0.79 0.67 (1.73E-12) 0.39 (0)

K562 28590 21084 0.86 0.7 0.63 (0) 0.37 (0)

Page 48: Computational Study of Target Gene Interactions

39

NHEK 7017 6103 0.94 0.77 0.78 (0) 0.43 (0)

Jin IMR90 50800 44239 0.94 0.79 0.81 (0) 0.44 (0)

FANTOM Gencode Jin IMR90 1167 743 0.9 0.84 0.51 (0) 0.33 (0)

ChromHMM Gencode Jin IMR90 5303 3383 0.93 0.87 0.53 (0) 0.32 (0)

FANTOM Gencode Chiapet K562 2916 1585 0.8 0.75 0.41 (0) 0.28 (0)

MCF7 2190 1471 0.89 0.83 0.55 (0) 0.35 (0)

ChromHMM Gencode Chiapet K562 33449 19550 0.86 0.78 0.46 (1.74E-11) 0.3 (0)

FANTOM Gencode Javierre

Ery 74 44 1 1 0.41 (0) 0.33 (0)

Mac0 88 59 0.98 0.94 0.51 (0) 0.37 (0)

Mac1 215 144 1 1 0.54 (0) 0.37 (0)

Mac2 112 75 0.99 0.96 0.53 (0) 0.34 (0)

MK 100 65 0.96 0.9 0.52 (0) 0.34 (0)

Mon 139 82 1 1 0.43 (0) 0.32 (0)

nCD4 86 58 1 1 0.52 (0) 0.35 (0)

nCD8 84 55 1 1 0.5 (0) 0.36 (0)

Neu 178 109 1 1 0.45 (0) 0.32 (0)

ChromHMM Gencode Javierre

Ery 4484 2471 0.98 0.98 0.42 (0) 0.3 (0)

Mac0 2003 1097 0.99 0.99 0.41 (0) 0.29 (0)

Mac1 4867 2996 0.97 0.96 0.49 (0) 0.33 (0)

Mac2 3733 2298 0.99 0.99 0.49 (0) 0.33 (0)

MK 2629 1744 0.99 0.98 0.55 (0) 0.35 (0)

Mon 2483 1547 0.96 0.94 0.49 (0) 0.34 (0)

nCD4 2975 1546 0.99 0.99 0.39 (0) 0.28 (0)

nCD8 2774 1623 0.98 0.97 0.46 (0) 0.31 (0)

Neu 4661 2739 0.99 0.98 0.46 (0) 0.32 (0)

FANTOM Gencode SPRITE cutoff 100 GM12878 38 28 1 1 0.2 (0) 0 (0)

FANTOM Gencode SPRITE cutoff 50 GM12878 497 317 0.92 0.8 0.46 (0) 0.35 (0)

FANTOM Gencode SPRITE cutoff 30 GM12878 3381 2151 0.92 0.84 0.45 (0) 0.3 (0)

ChromHMM Gencode SPRITE cutoff 100

GM12878 622 453 0.99 0.97 0.56 (0) 0.3 (0)

ChromHMM Gencode SPRITE cutoff 50 GM12878 4794 3213 0.95 0.89 0.5 (0) 0.32 (0)

ChromHMM Gencode SPRITE cutoff 30 GM12878 36027 21870 0.9 0.81 0.48 (1.22E-11) 0.3 (0)

seeds. These random IEPs barely had a handful of enhancers that shared promoters with the other

enhancers in any of the eight cell lines, suggesting that it is not by chance that multiple enhancers

interact with a common set of target genes in the Rao et al.’s looplists. The number of IEPs was

too small to calculate BCC for four of the eight cell lines. For all the other four cell lines, where

the BCC could be calculated, the BCC values of enhancers were 0.51, 0.37, 0.33 and 0,

respectively, which were much smaller than the BCC of enhancers in the above sets of real IEPs

(p-value=0, Table 2-5). When the BCC of enhancers interacting with multiple genes were

considered, the BCC values were no larger than 0.34 for random IEPs, while it was no smaller

than 0.96 for the real IEPs. The observations suggest that the BCC of enhancers being close to 1

was not by chance (Table 2-5).

Page 49: Computational Study of Target Gene Interactions

40

Second, the IEPs defined by the contact matrices from Rao et al. were studied with different cutoffs

in the seven cell lines (Chapter 2.2.2.2). Compared with the IEPs from the looplists, these IEPs

were likely to include many more bona fide interactions and more false positives as well. Under

the cutoffs 30, 50 and 100, the BCC of the enhancers in all the seven cell lines except GM12878

was no smaller than 0.85, 0.89 and 0.92, respectively (Table 2-5). Since GM12878 had a much

higher sequencing depth than the other cell lines, it was understandable that a cutoff that is

stringent for other cell lines could still be loose for GM12878. Thus, the cutoffs 150, 200, 300, and

400 were also tried for GM12878. Among the three cutoffs, 400 was the most reasonable, since

the number of IEPs in GM12878 defined at this cutoff was similar to that in other cell lines defined

at the cutoff 100 (Table 2-5). So, the cutoff 400 was chosen for GM12878 and the cutoff 100 was

chosen for the other cell lines. With cutoff 400, the BCC of enhancers was 0.97 in GM12878. Note

that in HMEC, HUVEC, KBM7 and NHEK, the BCC of enhancers was no smaller than 0.92 even

under the cutoff 100. Moreover, the BCC of enhancers was increasing with more stringently

defined IEPs, suggesting that the BCC of enhancers is close to 1 if it is not 1 (Table 2-5).

In order to assess the statistical significance of the observed BCC of enhancers in IEPs from

different cutoffs, similarly, the above BCC values of enhancers were compared with that from

randomly generated IEPs (Table 2-5). Again, for every cutoff in every cell line, the BCC of

enhancers for random IEPs was much smaller than the BCC of enhancers for real IEPs (p-value=0).

For instance, under the cutoff 50, the BCC of enhancers was no larger than 0.78 for random IEPs,

while the corresponding number was no smaller than 0.89 for real IEPs. When only the enhancers

interacting with multiple target genes were considered, the BCC of the enhancers for random IEPs

Page 50: Computational Study of Target Gene Interactions

41

was smaller than that for real IEPs by about a factor of two. For instance, under the cutoff 50, the

largest BCC value was 0.40 for random IEPs, while the smallest BCC value for real IEPs was 0.69.

Third, to see how this observation might change if the data from other labs or other experimental

protocols were used, the IEPs from four additional studies were analyzed (Chapter 2.2.2.2, Figure

2-4A) [20, 21, 32, 52]. When the BCC values of the enhancers were calculated using the IEPs

defined by Jin et al. [20], it was 0.94 on average. When considering the processed IEPs from Jin

et al. based on the FANTOM enhancers and the annotated promoters by GENCODE, it was 0.90.

In terms of the ChIA-PET datasets [21], it was 0.80 in K562 and 0.89 in MCF7 (Table 2-4). For

the nine cell types from Javierre et al. [32], it was no smaller than 0.96 in all cell types. For the

SPRITE data from Quinodoz et al. [52], it was 0.92, 0.92 and 1 for the cutoffs 30, 50 and 100,

respectively (Table 2-5). Although the IEPs were from different labs and from different

experimental procedures, in all cases, the BCC of enhancers was larger than 0.80 and the majority

of enhancers interacting with multiple promoters had their individual BCCs larger than 0.90,

suggesting that the BCC of enhancers is likely to be 1 in these samples. Again, for the

corresponding randomly generated IEPs for these datasets, on average, the BCC value was 0.48,

much smaller than the corresponding ones from original IEPs, which was 0.96 (p-value=0, Table

2-5).

Finally, the above analyses was repeated with the ChromHMM enhancers instead of the FANTOM

enhancers, because the number of the FANTOM enhancers was relatively smaller than the

ChromHMM enhancers [16]. The observations were similar in all cases, showing the BCC of

enhancers for the IEPs in a cell line was close to 1 (Table 2-4, Table 2-5). For instance, for IEPs

Page 51: Computational Study of Target Gene Interactions

42

based on the looplists, it was almost a perfect 1 in all cell lines. For the Hi-C data from Rao et al.

under the cutoff 400 for GM12878 and 100 for the other cell lines, it was no smaller than 0.93. For

the Hi-C data from Jin et al. [20], it was 0.93. For the ChIA-PET data from Li et al. [21], it was

0.86. For the nine cell types from Javierre et al. [32], it was no smaller than 0.97. For the SPRITE

data on GM12878 cell line [52], the BCC values were 0.9, 0.95 and 0.99 for the cutoffs 30, 50 and

100, respectively. In almost all cases, the majority of enhancers with multiple promoters had their

individual BCCs larger than 0.90.

In summary, the BCC values of the enhancers were likely to be close to 1 for different sets of IEPs,

data from different labs, different experimental protocols, different cell lines and cell types, and

different enhancer sets. The analyses based on IEPs from different cutoffs suggest that the BCC of

enhancers is quite robust, although it is smaller when more loosely defined IEPs are used. It is

close to 1 or becomes 1 when the IEPs are defined with more stringent criteria (with fewer false

positive IEPs). These analyses suggest that the observation may be an intrinsic property of

enhancers. That is, if two enhancers interact with one common gene, they are likely to interact

with all of their target genes.

2.2.3.2 Two Target Genes Tend to Interact with Exactly the Same Set or Two Completely

Different Sets of Enhancers

The BCC of promoters in each set of the aforementioned IEPs were also studied to see if the similar

observation can be made for the promoters. The results of the studies showed that the BCC of

promoters was likely to be 1 as well, although this was not evident as strongly as the BCC of

enhancers in certain cases.

Page 52: Computational Study of Target Gene Interactions

43

First, the BCC values of the promoters were studied with the IEPs based on the looplists [8]. The

BCC values were close to 1 on average, for both the FANTOM and ChromHMM enhancers (Table

2-4). The BCC values of the promoters were then studied in randomly simulated IEP datasets. The

random IEP set consisted of the same sets of enhancers and promoters, but the enhancers were

randomly selected to interact with the promoters so that every promoter had the same number of

interacting enhancers as it had in the original set of IEPs. The BCC of promoters was 0.52 at best

in any cell line in these random datasets, suggesting that it was not by chance that the BCC of

promoters was close to 1 in all cell lines (Table 2-6).

Second, the BCC values of the promoters were studied for the IEPs defined with different cutoffs

[8] (Table 2-6). When the FANTOM enhancers were used, the BCC of promoters was often close

to 1. For instance, with the cutoff 400 for GM12878 and the cutoff 100 for other cell lines, the

BCC of promoters was no smaller than 0.91 in all the cell lines. For different cutoffs, it was usually

no smaller than the BCC of enhancers, which was close to 1 in most cases. When the ChromHMM

enhancers were used, however, the values were not as high as those from the FANTOM enhancers.

For instance, with the cutoff 400 for GM12878 and the cutoff 100 for other cell lines, the BCC of

promoters varied from 0.64 to 0.91 in different cell lines. The BCC values got smaller with smaller

cutoffs, which might be due to the much lower quality of the enhancers predicted by ChromHMM

compared with the experimentally defined FANTOM ones.

Although the BCC of the promoters was not as large as the BCC of enhancers when the

ChromHMM enhancers were used, the actual BCC of promoters could also be close to 1. This was

Page 53: Computational Study of Target Gene Interactions

44

Table 2-6: BCC statistics for promoters. The BCC of the promoters in real IEPs are shown for

different samples. The BCC of the promoters in random IEPs are also shown along with the p-

values of the nonparametric statistical test supporting the difference between the BCC values

in real and random IEPs. All the statistics are shown for both β€œall” promoters and the

promoters interacting with β€œmultiple” enhancers.

Experiments Cell lines IEPs Promoters BCC of Promoters

BCC of promoters in random IEPs

with p-values in parenthesis All Multiple All Multiple

FANTOM Gencode Rao looplist

GM12878 294 186 0.97 0.95 0.52 (0) 0.28 (0)

HELA 11 8 1 0 0 (0) 0 (NA)

HMEC 260 179 0.96 0.91 0.52 (0) 0.27 (0)

HUVEC 9 6 0 0 0 (NA) 0 (NA)

IMR90 144 112 1 1 0.48 (0) 0.08 (0)

K562 47 39 1 1 0.1 (0) 0.1 (0)

KBM7 8 8 1 0 0 (0) 0 (NA)

NHEK 0 0 NA NA NA (NA) NA (NA)

FANTOM Gencode Rao cutoff 400 GM12878 902 683 0.95 0.81 0.62 (0) 0.37 (0)

FANTOM Gencode Rao cutoff 300 GM12878 1138 848 0.92 0.78 0.57 (0) 0.33 (0)

FANTOM Gencode Rao cutoff 200 GM12878 2695 1663 0.83 0.7 0.43 (0) 0.29 (0)

FANTOM Gencode Rao cutoff 150 GM12878 4184 2292 0.81 0.7 0.38 (0) 0.25 (0)

FANTOM Gencode Rao cutoff 100

GM12878 7527 3475 0.76 0.66 0.32 (0) 0.21 (0)

HMEC 313 288 0.95 0.7 0.94 (0) 0.17 (0)

HUVEC 43 36 0.75 0.5 0 (0) 0 (NA)

IMR90 525 438 0.93 0.71 0.7 (0) 0.39 (0)

K562 506 404 0.92 0.81 0.68 (0) 0.39 (0)

KBM7 1465 1285 0.92 0.71 0.79 (0) 0.42 (0)

NHEK 211 190 0.91 0.5 0.72 (0) 0.2 (NA)

FANTOM Gencode Rao cutoff 50

GM12878 19623 6631 0.69 0.62 0.23 (0) 0.16 (0)

HMEC 854 719 0.95 0.84 0.75 (0) 0.38 (0)

HUVEC 254 211 0.84 0.63 0.53 (0) 0.21 (0)

IMR90 1643 1232 0.88 0.75 0.62 (0) 0.35 (0)

K562 1734 1218 0.85 0.72 0.54 (0) 0.32 (0)

KBM7 4033 3209 0.89 0.73 0.65 (0) 0.38 (0)

NHEK 462 386 0.89 0.67 0.73 (0) 0.41 (0)

FANTOM Gencode Rao cutoff 30

GM12878 29348 8320 0.66 0.61 0.48 (0) 0.45 (0)

HMEC 1786 1441 0.92 0.76 0.83 (0) 0.49 (0)

HUVEC 582 457 0.91 0.76 0.81 (0) 0.49 (0)

IMR90 3077 2050 0.85 0.72 0.73 (0) 0.52 (0)

K562 2872 1815 0.82 0.69 0.7 (0) 0.49 (0)

KBM7 7047 5304 0.86 0.69 0.79 (0) 0.49 (0)

NHEK 1011 802 0.88 0.75 0.81 (0) 0.51 (0)

ChromHMM Gencode Rao looplist

GM12878 2384 674 0.95 0.93 0.13 (0) 0.12 (0)

HELA 37 17 1 1 0 (0) 0 (0)

HMEC 2558 735 0.97 0.96 0.15 (0) 0.14 (0)

HUVEC 95 31 1 1 0 (0) 0 (0)

IMR90 554 310 0.99 0.98 0.51 (0) 0.24 (0)

K562 638 197 1 1 0.12 (0) 0.12 (0)

NHEK 0 0 NA NA NA (NA) 0 (NA)

ChromHMM Gencode Rao cutoff 400 GM12878 11097 3899 0.66 0.62 0.18 (0) 0.15 (0)

ChromHMM Gencode Rao cutoff 300 GM12878 14846 4777 0.65 0.61 0.16 (0) 0.14 (0)

ChromHMM Gencode Rao cutoff 200 GM12878 33072 7412 0.57 0.53 0.11 (0) 0.1 (0)

ChromHMM Gencode Rao cutoff 150 GM12878 51174 8688 0.56 0.53 0.09 (0) 0.08 (0)

ChromHMM Gencode Rao cutoff 100

GM12878 89712 10080 0.54 0.52 0.06 (0) 0.05 (0)

HMEC 4081 2410 0.74 0.66 0.4 (0) 0.28 (0)

HUVEC 499 283 0.84 0.79 0.26 (0) 0.18 (0)

IMR90 2415 1418 0.91 0.84 0.41 (0) 0.29 (0)

K562 8062 3005 0.64 0.59 0.19 (0) 0.16 (0)

NHEK 3291 1784 0.71 0.64 0.35 (0) 0.26 (0)

ChromHMM Gencode Rao cutoff 50

GM12878 231522 12998 0.49 0.48 0.03 (0) 0.03 (0)

HMEC 11191 5169 0.71 0.65 0.26 (0) 0.21 (0)

HUVEC 3396 1660 0.7 0.65 0.27 (0) 0.21 (0)

IMR90 7270 3540 0.81 0.73 0.29 (0) 0.22 (0)

K562 28590 6604 0.55 0.52 0.12 (0) 0.11 (0)

Page 54: Computational Study of Target Gene Interactions

45

NHEK 7017 2851 0.69 0.64 0.22 (0) 0.18 (0)

Jin IMR90 50800 8117 0.11 0.11 0.09 (0) 0.08 (0)

FANTOM Gencode Jin IMR90 1167 401 0.77 0.73 0.23 (0) 0.17 (0)

ChromHMM Gencode Jin IMR90 5303 617 0.68 0.66 0.07 (0) 0.06 (0)

FANTOM Gencode Chiapet K562 2916 1869 0.86 0.75 0.52 (0) 0.31 (0)

MCF7 2190 1195 0.86 0.75 0.43 (0) 0.25 (0)

ChromHMM Gencode Chiapet K562 33449 6439 0.67 0.65 0.11 (0) 0.1 (0)

FANTOM Gencode Javierre

Ery 74 64 1 1 0.79 (0) 0.44 (0)

Mac0 88 64 0.98 0.95 0.59 (0) 0.36 (0)

Mac1 215 153 1 1 0.6 (0) 0.35 (0)

Mac2 112 85 0.98 0.96 0.64 (0) 0.38 (0)

MK 100 81 0.98 0.89 0.73 (0) 0.38 (0)

Mon 139 94 1 1 0.57 (0) 0.31 (0)

nCD4 86 64 1 1 0.63 (0) 0.39 (0)

nCD8 84 67 1 1 0.68 (0) 0.42 (0)

Neu 178 137 1 1 0.66 (0) 0.39 (0)

ChromHMM Gencode Javierre

Ery 4484 539 0.93 0.92 0.07 (0) 0.06 (0)

Mac0 2003 268 0.97 0.97 0.07 (0) 0.07 (0)

Mac1 4867 658 0.91 0.9 0.07 (0) 0.07 (0)

Mac2 3733 474 0.95 0.94 0.07 (0) 0.06 (0)

MK 2629 402 0.92 0.92 0.09 (0) 0.07 (0)

Mon 2483 330 0.91 0.9 0.08 (0) 0.07 (0)

nCD4 2975 359 0.97 0.97 0.07 (0) 0.06 (0)

nCD8 2774 339 0.93 0.93 0.07 (0) 0.06 (0)

Neu 4661 596 0.96 0.96 0.07 (0) 0.06 (0)

FANTOM Gencode SPRITE cutoff 100 GM12878 38 25 1 1 0 (0) 0 (0)

FANTOM Gencode SPRITE cutoff 50 GM12878 497 239 0.92 0.84 0.33 (0) 0.2 (0)

FANTOM Gencode SPRITE cutoff 30 GM12878 3381 1523 0.89 0.82 0.29 (0) 0.2 (0)

ChromHMM Gencode SPRITE cutoff 100 GM12878 622 94 0.96 0.95 0.02 (0) 0.02 (0)

ChromHMM Gencode SPRITE cutoff 50 GM12878 4794 663 0.85 0.84 0.06 (0) 0.06 (0)

ChromHMM Gencode SPRITE cutoff 30 GM12878 36027 4210 0.71 0.7 0.06 (0) 0.05 (0)

because the computationally predicted ChromHMM enhancers might result in predicting false

interactions and thus a low BCC of the promoters. Moreover, the BCC of the promoters was always

increasing with more and more stringently defined IEPs. For example, although the BCC of the

promoters was not close to 1 at the cutoff 100, it got closer to 1 when the looplists defined by Rao

et al. were considered. In addition, the BCC of promoters for random IEPs in every cell line and

under every cutoff was much smaller than that for the real IEPs, indicating that the observed much

larger BCC of promoters was not by chance (Table 2-6).

Third, the BCC values of the promoters were analyzed for lEPs from other studies (Figure 2-4A,

Table 2-4 and Table 2-6) [20, 21, 32, 52]. For the original IEPs from Jin et al., it was 0.11.

However, when the IEPs were defined from the overlap of these original IEPs with the GENCODE

promoters and the two types of enhancers, it was 0.77 and 0.68, respectively (Table 2-4). The low

Page 55: Computational Study of Target Gene Interactions

46

BCC of the promoters for the original IEPs may be partially due to the promoters Jin et al. used,

which had 11,313 promoters inferred by Jin et al., compared to the 57,820 promoters annotated by

GENCODE. In terms of the ChIA-PET data, when the FANTOM enhancers were used, the BCC

of the promoters was 0.86 in K562 and 0.86 in MCF7; when the ChromHMM enhancers were

used, it was 0.67 in K562. ChromHMM did not have annotated enhancers in MCF7. For the nine

cell types from Javierre et al., it was no smaller than 0.98 and 0.91 when the FANTOM enhancers

and the ChromHMM enhancers were used, respectively. For the SPRITE data on the GM12878

cell line, the BCC values of the promoters were no smaller than 0.89 and 0.71 in the IEPs defined

with the FANTOM and ChromHMM enhancers, respectively. Overall, although it was not as large

as the BCC of the enhancers, because of the imperfectness of all these collected IEPs, and the fact

that the majority of the promoters interacting with multiple enhancers had their individual BCC

larger than 0.90, and they were much larger than the corresponding BCC of the promoters for

random IEPs (Table 2-6), the BCC of the promoters was likely to be close to 1 as well. In other

words, a gene usually interacts with all the enhancers of another gene or interacts with a completely

different set of enhancers from this second gene.

Page 56: Computational Study of Target Gene Interactions

47

Figure 2-5: Clusters of enhancers with Hi-C reads. Here all ChromHMM active enhancer clusters

in GM12878 are shown within the region Chr1:161060000-161175000. Total five clusters belong

to this region. The bottom half of the figure shows the five enhancer clusters (grey, yellow, green,

purple and brown on the two sides) interacting with the common gene promoter regions (in the

middle), arranged from left to right according to their relative genomic locations. The top half of

the figure shows the same interactions of the five clusters (same color codes) with Hi-C reads. For

example, the yellow cluster of enhancers interact with NIT1 and PFDN2 gene promoters with 687

Hi-C reads. The unmarked enhancer (blue) and gene promoter (UFC1) did not belong to any

cluster. The location of the enhancers relative to each other and to the target genes are shown in

the middle.

Page 57: Computational Study of Target Gene Interactions

48

2.2.3.3 Enhancers Form Clusters that Have Special Characteristics

Since the BCC of the enhancers is close to 1, the enhancers can be organized into clusters, where

every enhancer in the same cluster is likely to interact with the same set of target genes. Thus, in

each IEP set, an enhancer graph was built by connecting the enhancers that share at least one

common target (Chapter 2.2.2.5, Figure 2-5). Here, only the looplists and the IEPs obtained from

the most stringent cutoff (400 in GM12878 and 100 in other cell lines) were considered to obtain

enhancer clusters, as they were more reliable than other sets of IEPs.

Total 1 to 2,134 clusters were generated in different cell lines. The number of clusters in a cell line

and across different cell lines varied dramatically, depending on the IEPs and the enhancers used.

When the ChromHMM enhancers were used, there were many more clusters and 67% to 96% of

all enhancers in a cell line were included in the clusters. When the FANTOM enhancers were used,

fewer clusters were identified and about 16% to 67% of the total enhancers in a cell line were

found in the clusters. The average number of enhancers in a cluster varied from 2 to 5 in different

cell lines. Enhancers in the majority of clusters interacted with only one gene, while on average,

the enhancers in 18.36% clusters interacted with at least two different genes.

Page 58: Computational Study of Target Gene Interactions

49

Next, the distance between the consecutive enhancers in a cluster, the distance between their

consecutive targets and the distance between enhancers and their target genes were studied (Figure

2-6). On average, about 84% of the enhancers in a cluster were within 10 kbps. However, there

was a small fraction of enhancers in a cluster that were more than 50 kbps away from each other.

For instance, when the looplists and the FANTOM enhancers were considered, there were more

than 8% enhancers in a cluster that were more than 50 kbps away from each other in GM12878,

HMEC and IMR90. Although the enhancers in a cluster were often close to each other, their

distances to each other were not significantly smaller than the distances of random enhancer pairs

(almost all p-values>0.2). In terms of the target genes, the majority of them were within 10 kbps,

Figure 2-6: The distance distribution between consecutive enhancers in the same cluster for each

cell line. The X-axis represents the distance and the Y-axis represents the average percentage of

consecutive enhancer pairs in an enhancer cluster.

Page 59: Computational Study of Target Gene Interactions

50

with a small fraction far from each other. For instance, in GM12878, HMEC and IMR90, when

the looplists and the FANTOM enhancers were considered, 25.93%, 21.43% and 33.33% of the

target genes of an enhancer cluster that were more than 50 kbps away from each other, respectively.

It was also worth pointing out that the enhancers in a cluster were normally consecutive and active

enhancers while their target genes were normally not consecutive. In all cell lines, on average,

more than 90% of the enhancers in a cluster were consecutive active enhancers while fewer than

17% of the target genes of an enhancer cluster were consecutive.

Since the enhancers in a cluster were consecutive in the genome and the majority of enhancers in

a cluster were close to each other, they seemed like the super-enhancers. Hence, the enhancer

clusters were compared with known super-enhancers in terms of their locations. On average,

29.77% of enhancer clusters overlapped with the corresponding super-enhancers in a cell line

while the majority of enhancer clusters did not overlap with the known super-enhancers (Figure 2-

7A), which may represent new super-enhancers. On the other hand, a large proportion of known

Figure 2-7: The overlap of the enhancer clusters with the super-enhancers. (A) The percentage of

the enhancer clusters overlapping with the super-enhancers. (B) The percentage of the super-

enhancers overlapping with the enhancer clusters.

Page 60: Computational Study of Target Gene Interactions

51

super-enhancers did not overlap with the enhancer clusters in the corresponding cell lines (Figure

2-7B). Interestingly, when a super-enhancer overlapped an enhancer cluster, more than 80% of the

genomic regions that contain all the enhancers in this enhancer cluster were within this super-

enhancer.

The locations of the enhancers in a cluster were also compared with TADs. The enhancers in a

cluster were usually within the same TAD, with no smaller than 98.08% of enhancers in a cluster

within a TAD in every cell line, independent of IEPs and enhancers used. In most of the cell lines,

for all clusters, all the enhancers in a cluster were within a TAD. The slight deviation from the

100% was mostly for the ChromHMM enhancers, which may be due to the imperfectness of either

the computationally predicted enhancers, IEPs, or TADs. The percentage was always 100% in

almost all the cell lines when the FANTOM enhancers were used.

The enhancer clusters were compared between different cell lines as well. On average, no more

than 12% enhancer clusters were identified in two cell lines. Moreover, the percentage was smaller

for IEPs using looplists than the IEPs using the contact matrices with different cutoffs, implying

that the looplists were too strict to include many bona fide IEPs. The small percentage of the shared

enhancer clusters suggested that most enhancer clusters were cell-specific, which is consistent with

the properties of super-enhancers [47, 48].

2.2.4 Discussion

We observed that two enhancers either do not share any target gene or share almost all of their

target genes. This observation was true when different sets of IEPs, two sets of enhancers, and a

Page 61: Computational Study of Target Gene Interactions

52

variety of cell lines and cell types were considered. Moreover, the BCC of enhancers became closer

and closer to 1 when the criteria to define IEPs became more and more stringent. In addition, the

same observation did not hold to be true for randomly generated IEPs. These analyses suggested

that the BCC of enhancers in a cell line or a cell type was likely to be close to 1 if it is not 1.

Similarly, we observed that two promoters were likely to interact with either the same set of

enhancers or two disjoint sets of enhancers. This observation about promoters was not as evident

as that about enhancers. However, it was pervasive in all cases when the FANTOM enhancers

were used. It was also evident when the looplists and the IEPs defined by the most stringent cutoffs

were used. Although it seemed not compelling when the ChromHMM enhancers and the sets of

IEPs that were defined with loose criteria were used, this might be due to the imperfectness of

enhancers and IEPs we had. More importantly, the fact that the BCC of enhancers was close to 1

implied that the BCC of the promoters should be close to 1 as well based on the definition of the

BCC.

The BCC of enhancers being close to 1 suggested that enhancers form clusters to interact with the

target genes. As shown above, these clusters are different from the known enhancer clusters such

as super-enhancers, although they do overlap in certain regions. Enhancers in the clusters here

were likely to interact with the same set of genes, while enhancers in a super-enhancer do not

necessarily interact with multiple target genes. Moreover, the enhancers in a cluster here could be

far from each other while the enhancers in a super-enhancer are quite close to each other.

Page 62: Computational Study of Target Gene Interactions

53

The BCC of enhancers was not 1 sometimes, which implied that when a group of enhancers

interacts with a set of target genes, the majority of target genes interact with each enhancer in this

group while the rest interact with only a subset of enhancers in this group. We called the former

the fully shared target genes and the latter the partially shared target genes. The percentage of the

partially shared target genes by a group of enhancers varied from 0% to 6.57%. We compared

these two types of target genes in terms of TAD, tissue specificity, and correlations with the

enhancers, with the IEPs from the looplists and the IEPs from the most stringent cutoff (400 in

GM12878 and 100 in other cell lines) (Methods). We did not observe any difference between the

two types of target genes.

In practice, several aspects may prevent the BCC of enhancers and the BCC of promoters from

being 1. First, the resolution of the interaction data prevents from obtaining accurate IEPs. The

two interacting regions in the interaction data are often long, which is around 5 kbps in most of

the cases we studied. We defined IEPs by overlapping enhancers and promoters with pairs of

interacting regions, which might be prone to errors, given the fact that many known enhancers

were much shorter [2, 58]. Second, the IEPs defined imperfectly might have produced β€œfalse”

interactions and thus decreased the BCCs. Third, the enhancers were not perfectly defined either.

The FANTOM enhancers are still far from complete while the computationally predicted

ChromHMM enhancers may contain many β€œfalse” enhancers.

We also studied the functional similarities between the targets of enhancers in the same clusters.

With the GREAT tool [57], we found the cluster targets associated with DNA packaging complex,

DNA binding, nucleosome, immune response etc. (p-value<1e-5). We measured the sequence

Page 63: Computational Study of Target Gene Interactions

54

similarity of enhancers within clusters in a cell line as well (Methods). We found that the pairs of

enhancers in the same clusters did not share more sequence similarity compared with enhancer

pairs randomly chosen in the same cell lines (p-value>0.5).

There are other measurements to study bipartite graphs. We chose BCC because we intended to

investigate how enhancers (promoters) shared their target genes (enhancers). In this sense, the

BCC value perfectly reflected what we hoped to measure. In the future, we may explore other

measurements to study other characteristics of IEPs. Moreover, we focused on enhancers

interacting with multiple targets. There is no doubt that a proportion of enhancers only interacting

with individual target genes. These enhancers and their target genes were not considered here, as

they did not share target genes with each other. In the future, the characteristics of these enhancers

may be worth studying as well.

In a cell line or cell type, both active enhancers and active promoters form their own clusters.

When an enhancer interacts with a promoter, consistent with the transcriptional factories proposed

previously [59, 60], almost all enhancers in the same enhancer cluster interact with almost all

promoters in the corresponding promoter cluster. It is thus important to consider the relationship

among enhancers and among promoters when studying their interactions, which may help improve

our understanding of the distal gene regulation and the chromatin structures.

Page 64: Computational Study of Target Gene Interactions

55

CHAPTER 3 : STUDY OF MIRNA-MRNA INTERACTIONS

3.1 MDPS: Position-Wise Binding Preference is Important for miRNA Target Site Prediction

3.1.1 Background

MicroRNAs (miRNAs) are small (16 to 28 nucleotides) non-coding RNAs that play an important

regulatory role in gene expression pathway. In human, miRNAs are found to get involved in

imperfect interactions with their target sequences from messenger RNAs (mRNAs) or other non-

coding RNAs, such as long non-coding RNAs, transfer RNAs, circular RNAs, etc. [61]. The

interactions with mRNA lead to regulation of the corresponding gene expression with reduced

protein translation or complete degradation of the mRNA structure [62, 63]. The regulatory

involvements of miRNAs in critical gene expression pathways associate with complex diseases

[64].

In human, miRNA-target interactions are mostly imperfect consisting of both complementary

matches and gaps [62]. Because of the much smaller length of the miRNA sequence than the

mRNA transcript sequence and the imperfect interactions with their targets sequence, multiple

potential miRNA target sites may exist with the mRNA transcript sequence. Many of these sites

have not been found as functional yet and thus are normally ignored as negative sites. Because of

the non-functional negative sites that co-exist with the positive sites in the same mRNA transcript,

the computation methods designed for miRNA-target prediction often suffers from a large number

of false positive predictions. To handle this issue, computation tools abide by certain canonical

rules of miRNA-target interactions. The canonical rules of miRNA-target interactions require that

a positive interaction will involve a special area (position 2 to 8) of the miRNA sequence called

the β€˜seed’ region and a target sequence from the 3’ untranslated region of the mRNA transcript

Page 65: Computational Study of Target Gene Interactions

56

with extensive bonds. Later this canonical rule was given a bit of leeway, allowing the non-

canonical seeds (one mismatch or wobble in the seed region) and the binding in the miRNA 3’

regions centered on positions 13-16, along with other features such as target accessibility [65],

local AU content [66], folding energy [66, 67], conservation [68], etc. Dozens of target prediction

tools along with the most popular ones focus primarily on these features [67, 69-72].

The advancement of next-generation sequencing (NGS) based technologies have enabled the study

of miRNA targets with extensive experimental support. NGS techniques with the cross-linking

and immunoprecipitation (CLIP) allowed direct identification of miRNA targets [73, 74]. The

resolution of CLIP-seq method was increased by the use of photoactivatable-ribonucleoside-

enhanced cross-linking and immunoprecipitation (PAR-CLIP) method [75]. Later, crosslinking,

ligation, and sequencing of hybrids (CLASH) experiments was introduced to detect miRNA-target

pairs as chimeric reads in NGS data [68]. Moore et al. improved the CLASH experiments with the

covalent ligation of endogenous Argonaute-bound RNAs-CLIP (CLEAR-CLIP) experiments [76].

The CLASH and CLEAR-CLIP experiments ultimately presented a transcriptome-wide dataset

containing more than 18,000 and 30,000, respectively, high-confidence miRNA-target

interactions. Most of the interactions do not maintain the established canonical rules of miRNA-

target interactions, revealing prevalence of both seed and non-seed interactions and the diversity

of in vivo miRNA targets in mRNA 3’ UTR, 5’ UTR and coding DNA sequence (CDS) regions.

The interactions are of different stability and have different free folding energy (ranging from 1.5

kcal/mol to 32 kcal/mol). With the raw sequence reads from these studies, a number of new tools

have been developed for miRNA target prediction based on the aforementioned features together

with new features learned from NGS data [71, 77-79]. Despite the existence of numerous tools to

Page 66: Computational Study of Target Gene Interactions

57

predict miRNA targets, due to the complex target choosing technique of the miRNA in different

cells, almost all the tools still suffer from low precision. Since, high-throughput experimental

approaches are still cost and time expensive and may not be carried out under certain conditions,

computational methods are still the only way to solve this problem. The low precision of available

computational methods may be partially due to our limited knowledge of the characteristics of

miRNA target sites. Several studies, thus, concentrated on the features of miRNA binding sites.

Among them, a Markov chain based method started to model the base pairings between the entire

mature miRNAs and their targets [80]. Although only two states, the existence and absence of a

matching base pair, were considered in this Markov model, this study demonstrated the value of

considering flexible matching patterns instead of the canonical seed matching when identifying

miRNA target sites.

In this study, Markov models was designed to represent the position-wise pairing information

(match, mismatch, bulge, and wobble) of a miRNA from the miRNA-target interactions. Using

the models, the importance of the pairing patterns of a miRNA beyond its seed region was

evaluated for target prediction. From the model learning, the position-wise pairing patterns of a

mature miRNA was identified as a valuable feature for miRNA target site prediction. Also, region-

specific correlations between miRNAs were detected in terms of target binding. Finally, a feature

named MDPS (Markov model-scored Dynamic Programming algorithm for miRNA target site

Selection) was designed that focuses on the miRNA position wise information from miRNA-target

Page 67: Computational Study of Target Gene Interactions

58

interactions based on the experimental data. Combination of MDPS as an additional feature with

three existing tools, demonstrated the potential contribution of the position-wise pairing

information to improve the precise identification of miRNA-target sites.

3.1.2 Materials and Methods

3.1.2.1 Training and Test Data

The miRNA–mRNA interactions reported in the CLASH study were used to design MDPS, as

these interactions provide the miRNA-mRNA sequence pairs with the highest resolution [68].

Using the interactions from this study two datasets were generated. The first set contained the

interactions of 77 miRNAs, where each of this miRNA interacted with at least 50 targets in the

CLASH experiments. This set of interaction was named β€˜target-enriched dataset’. The other set

included the interactions of 122 miRNAs, where each miRNA interacted with at least 20 targets

with minimum folding energy βˆ’15 kcal/mol. This set was termed as the β€˜energy-filtered dataset’.

For each of the two CLASH interaction sets, 80% of the interactions were randomly chosen as the

training data and the remaining 20% were kept for test purpose. The hyper-parameters for the

scoring model were chosen using 10-fold cross-validation on the training data. The best hyper-

Table 3-1: Training and test datasets.

Total Target-enriched dataset Energy-filtered dataset

miRNAs Targets miRNAs Targets miRNAs Targets

CLASH 399 18041 77 15390 122 16209

CLEAR-CLIP 451 20094 - - - -

We randomly selected 80% of the CLASH interactions to train a model using 10-fold cross-validation. We then tested

the model on the 20% of the remaining CLASH interactions. We also tested the model on the independent CLEAR-

CLIP interactions.

Page 68: Computational Study of Target Gene Interactions

59

parameters were later used to make the prediction on the 20% test data of the corresponding

interaction dataset. The scoring model was also applied on an independent experimentally

validated miRNA target dataset generated by a CLEAR-CLIP study [76]. This dataset was chosen

because like the CLASH interaction data, this dataset also provides interacting miRNA-target

sequence pairs with the highest specificity. The interactions in CLASH and CLEAR-CLIP that did

not map to any mRNA transcript from ENSEMBL version 75 were filtered out. The reported

miRNA and target sequences were aligned using the RNAhybrid tool [81], as in the CLASH study

[68], to obtain the position-wise alignment information of each miRNA sequence. The number of

miRNAs and their corresponding targets for these datasets are documented in Table 3-1.

3.1.2.2 Position-Wise Information with Different States of miRNA–Target Interactions

A Markov model was used to learn the position-wise binding patterns for a given miRNA and its

targets. Given a miRNA and one of its target sequences, a position of the miRNA sequence and a

position of the target sequence can form the following five possible states in the miRNA-target

alignment; match (𝑀), mismatch (𝑁), G-U wobble match (π‘Š), bulge in target (𝐡π‘₯) and bulge in

miRNA (𝐡𝑦) (Figure 3-1).

For every miRNA, a weight matrix 𝑀 and a transition matrix 𝑑 were designed with the five possible

states mentioned above. The weight matrix describes the probability of a state that a miRNA

position prefers. For a miRNA sequence of length 𝑛, its weight matrix 𝑀 is a 4 Γ— 𝑛 matrix, where

the rows correspond to one of the following four states: 𝑀, 𝑁, π‘Š, 𝐡𝑦, the columns corresponds to

different positions in the miRNA sequence, and each numbers in the matrix represents the

Page 69: Computational Study of Target Gene Interactions

60

probability that the corresponding miRNA position prefers the corresponding state. The state 𝐡π‘₯

does not correspond to any miRNA position and thus was not considered in the weight matrix 𝑀.

The transition matrix 𝑑 is a 5 Γ— 5 matrix that represents the transition probabilities among the

five states in a miRNA-target sequence alignment. The miRNA specific transition and weight

matrices were calculated separately for the two training datasets. To create the weight matrix, the

number of the occurrences of the four states at each miRNA position were counted in all miRNA-

target interactions in a dataset. To create the transition matrix, the frequency of occurrence of each

transition in the interactions of miRNA was calculated. A small pseudo count of 0.0001 was added

to every entry of the two matrices. The matrices were normalized column-wise so that the

summation of the numbers in each column becomes 1. The start to end positions of a miRNA were

considered from 5β€² to 3β€² direction of the miRNA sequence.

Two types of scoring models were designed: miRNA-specific and miRNA-general. In the miRNA-

specific model, the weight and transition matrices were calculated for each miRNA and its targets.

In the miRNA-general model, only one weight matrix and one transition matrix were calculated

using the pairing information of all the miRNAs and their targets within a dataset. In the latter

Figure 3-1: Five states in an miRNA-target interaction

Page 70: Computational Study of Target Gene Interactions

61

case, the transition and weight matrices were the unweighted average of the respective miRNA-

specific matrices within a dataset.

3.1.2.3 MDPS Scoring Strategy

Given a miRNA and a target sequence, MDPS uses a sequence-alignment strategy using dynamic

programming algorithm to score the alignments. The weight to score alignment of the two

sequences are taken from the weight and the transition matrices. The score of the alignment is used

to determine if the given miRNA and target sequences may interact with each other.

To understand the scoring strategy of MDPS, it is important to get familiar with the two following

notations, S[i, j, k] and state(i, j). S[i, j, k] is defined as the best score of the alignment between

π‘šπ‘–π‘…π‘π΄(1… 𝑖) and target 𝑅𝑁𝐴(1… 𝑗) sequences, with the last alignment position at the π‘˜-th

posture. Here π‘šπ‘–π‘…π‘π΄(1… 𝑖) represents the miRNA sequence from the position 1 to the position

i. Similarly, target 𝑅𝑁𝐴(1… 𝑗) represents the target sequence from the position 1 to the position

𝑗. There are three different possibilities for the last alignment position. When π‘˜ = 0, it means the

last alignment position is at the states 𝑀,𝑁, π‘œπ‘Ÿ π‘Š, which we call posture 0. When π‘˜ = 1, it means

the last alignment position is at the posture 1 and the state is By. When π‘˜ = 2, it means the last

alignment position is at the posture 2 and the state is Bx. The π‘ π‘‘π‘Žπ‘‘π‘’(𝑖, 𝑗) is defined as the state of

the pairing of the 𝑖-th miRNA position and the 𝑗-th mRNA position. Since two actual base pairs

are involved, π‘ π‘‘π‘Žπ‘‘π‘’(𝑖, 𝑗) can only be one of the states: 𝑀,𝑁, π‘œπ‘Ÿ π‘Š.

With the above definition of the two notations, since both miRNA and target sequence positions

start from 1, 𝑆[𝑖, 𝑗, 0] = βˆ’βˆž, 𝑖𝑓 𝑖 = 0 π‘œπ‘Ÿ 𝑗 = 0. Also, for the first position of the miRNA, no

Page 71: Computational Study of Target Gene Interactions

62

transition is considered. Therefore, 𝑆[1, 𝑗, 0] = π‘™π‘œπ‘”(𝑀(π‘ π‘‘π‘Žπ‘‘π‘’(1, 𝑗), 1)) for any 𝑗 > 0, where

𝑀(π‘ π‘‘π‘Žπ‘‘π‘’(1, 𝑗), 1) means the (π‘ π‘‘π‘Žπ‘‘π‘’(1, 𝑗), 1)-th entry of the weight matrix of this miRNA. In

addition, when the first position of the target sequence is aligned with any position of the miRNA

after its first position, a transition from is By to the current state is considered. So, 𝑆[𝑖, 1, 0] =

log(𝑀(π‘ π‘‘π‘Žπ‘‘π‘’(𝑖, 1), 𝑖)) + 𝑆[𝑖 βˆ’ 1,0,1] + log (𝑑(𝐡𝑦, π‘ π‘‘π‘Žπ‘‘π‘’(𝑖, 1))) for any 𝑖 > 1. With these initial

cases, we have the following iteration formula to calculate 𝑆[𝑖, 𝑗, 0] for any 𝑖 > 1 π‘Žπ‘›π‘‘ 𝑗 > 1:

𝑆[𝑖, 𝑗, 0] = log(𝑀(π‘ π‘‘π‘Žπ‘‘π‘’(𝑖, 𝑗), 𝑖)) + π‘šπ‘Žπ‘₯

{

𝑆[𝑖 βˆ’ 1, 𝑗 βˆ’ 1,0] + log (𝑑(π‘ π‘‘π‘Žπ‘‘π‘’(𝑖 βˆ’ 1, 𝑗 βˆ’ 1), π‘ π‘‘π‘Žπ‘‘π‘’(𝑖, 𝑗)))

𝑆[𝑖 βˆ’ 1, 𝑗 βˆ’ 1,1] + log (𝑑 (𝐡𝑦 , π‘ π‘‘π‘Žπ‘‘π‘’(𝑖, 𝑗)))

𝑆[𝑖 βˆ’ 1, 𝑗 βˆ’ 1,2] + log (𝑑(𝐡π‘₯ , π‘ π‘‘π‘Žπ‘‘π‘’(𝑖, 𝑗)))

Similarly, 𝑆[𝑖, 𝑗, 1] was calculated by the following formula with the initial cases 𝑆[0, 𝑗, 1] = βˆ’βˆž

and 𝑆[1, 𝑗, 1] = log (𝑀(𝐡𝑦, 1)) for 𝑖 > 1 and any 𝑗:

𝑆[𝑖, 𝑗, 1] = log (𝑀(𝐡𝑦 , 𝑖)) + π‘šπ‘Žπ‘₯ {𝑆[𝑖 βˆ’ 1, 𝑗, 0] + log (𝑑(π‘ π‘‘π‘Žπ‘‘π‘’(𝑖 βˆ’ 1, 𝑗), 𝐡𝑦))

𝑆[𝑖 βˆ’ 1, 𝑗, 1] + log (𝑑(𝐡𝑦 , 𝐡𝑦))

Similarly, with the initial cases, 𝑆[𝑖, 1, 2] = 𝑆[𝑖, 0, 2] = βˆ’βˆž for any 𝑖, 𝑆[𝑖, 𝑗, 2] was calculated for

any 𝑖 and 𝑗 > 1,

𝑆[𝑖, 𝑗, 2] = π‘šπ‘Žπ‘₯ {𝑆[𝑖, 𝑗 βˆ’ 1,0] + log (𝑑(π‘ π‘‘π‘Žπ‘‘π‘’(𝑖, 𝑗 βˆ’ 1), 𝐡π‘₯))

𝑆[𝑖, 𝑗 βˆ’ 1,2] + log (𝑑(𝐡π‘₯ , 𝐡π‘₯))

With the above cases, the maximum of 𝑆[𝑛, 𝑗, π‘˜] for any 𝑗 and π‘˜, is considered as the final score

of MDPS, where n is the last position of the miRNA sequence. This score of the alignment score

of the miRNA and target RNA sequences under consideration, based on MDPS strategy. The

Page 72: Computational Study of Target Gene Interactions

63

actual alignment resulted in this score can be shown by backtracking which represents the pairing

between the miRNA and target sequences.

The MDPS model hyperparameters consisted of the w matrices, the t matrices and the

corresponding score cutoffs that gave the best predictions on the CLASH training dataset for

different miRNAs. The hyperparameters were generated from the target-enriched dataset and the

energy-filtered dataset separately. For the miRNA-specific models, miRNA-specific

hyperparameters are generated, which contained separate 𝑀, 𝑑 and score cutoffs for every miRNA

in the dataset. For the miRNA-general model, only one set of 𝑀, 𝑑 and score cutoff was generated

for all the miRNAs in the dataset. The 𝑀 and 𝑑 in this model were generated by taking average of

the miRNA-specific models from the target-enriched dataset and the energy-filtered dataset

separately. Since the column size of the w matrices was the length of the corresponding miRNAs

in miRNA-specific models and the lengths are different for different miRNAs, the column size of

the w matrix in the general models was decided as the length of the longest miRNA sequence in

the training datasets. The score cutoffs are required to filter out the false positive predictions. After

considering five different criteria, the score cutoff was chosen as the Average score + 2*Standard

Deviation, where the Average score and the Standard deviation are the mean and the standard

deviation of the alignment scores of the miRNA-target duplexes in the training datasets.

3.1.2.4 Combining MDPS Scores with Existing Tools

All the popular target prediction algorithms emphasize the miRNA-target pairing in the seed

regions [66, 82-84], and/or do not consider the dependence of the neighboring pairings [67]. The

miRNA-target alignment score measured by MDPS is representative of the position-wise

Page 73: Computational Study of Target Gene Interactions

64

preference and dependencies between adjacent positions throughout the whole miRNA sequence.

By incorporating the MDPS scores with the predictions of the existing tools, the efficiency of the

overall miRNA-target prediction may be improved. To test this hypothesis, the MDPS scores were

combined with three popular methods, miRanda, RNA22, and TargetScan [67, 82-85]. First, the

predictions of the three tools on given miRNA and target sequences was generated, by running

miRanda 3.3a and TargetScanHuman 7.0 and using the existing predictions of RNA22

(ENSEMBL 65, miRbase 18). Then the MDPS scores were calculated and the score cutoffs were

applied on the predicted positive miRNA and target sequences to generate the combined

predictions. The original prediction of the three tools and the combined predictions were compared

on the two test datasets (Table 3-1).

3.1.3 Results

3.1.3.1 Importance of Non-Seed Regions in miRNA–Target Interactions

Canonical rules of miRNA-target interactions emphasize extensive bonds in the seed region as one

of the primary criteria. But the CLASH study [68] reported numerous interactions with poor

interactions in the seed region. To evaluate the importance of the miRNA positions outside seed

region in target binding, the 18,041 CLASH interactions were analyzed. MiRNA positions 1-8 was

considered as the seed region in this section. The analysis shown more than 12% of miRNAs had

at least eight matches/wobbles after the eighth position in the interactions they were involved

(Figure 3-2A). Out of the 399 miRNAs listed in the CLASH study, 386 (97%) had interactions

with at least one match/wobble pairing outside the seed regions. Figure 3-2B shows the distribution

of the number of match/wobble pairing outside the seed regions among the 18, 041 CLASH

interactions. Only 14 interactions had no match/wobble pairing outside the seed region.

Page 74: Computational Study of Target Gene Interactions

65

The miRNA–target interactions with extensive seed matching also showed a good number of

match/wobble pairs outside the seed regions. Similar to the CLASH study [68], the 6mer, 7mer,

8mer and 9mer interactions were considered as the interactions with seed matching, which had 6,

7, 8 and 9 continuous matches from the miRNA Position 1, respectively. Although the non-seed

interactions on average had more match/wobble after the seed regions, the seed interactions, more

than 50% of the seed interactions also tended to have extensive bonds from

position 10-20 (Figure 3-2C). From the analysis, it is thus evident that it may be valuable to

consider miRNA-target pairings after the seed region of the miRNA sequence.

Figure 3-2: Non-seed regions may be important for miRNA-target interactions. (A) Percentage of

miRNAs with the different lowest number of match/wobble pairings after the position 8 in the 18041

CLASH interactions. (B) Percentage of the 18041 CLASH interactions having different number of

match/wobble pairing after the position 8. (C) The frequency of match/wobble pairing at different

miRNA positions for different types of CLASH interactions.

Page 75: Computational Study of Target Gene Interactions

66

The dependency between contiguous positions of a miRNA sequence in terms of target interactions

were also studied. The hypothesis was when two miRNA-target bonds (match/wobble) occur side-

by-side, the strength of one pairing might help to stabilize the pairing by its side. To study the

dependency between neighboring positions, each β€˜Match’ or β€˜Wobble’ state was labeled with a β€˜1’

and each β€˜Mismatch’ or β€˜Bulge’ with a β€˜0’, for each position of a miRNA. In this way, for each

miRNA position, a binary binding vector was generated which represented the binding states of

that miRNA position in the interactions. The size of this binding vector reflected the number of

miRNA interactions (Figure 3-3A). To find the correlations between two positions of the same

miRNA, Matthews correlation coefficient (MCC) formula was applied on the two binary vectors

for the two positions of the miRNA (Figure 3-3A). Only the neighboring positions tended to have

positive correlation (MCC β‰₯0.75). Also, the adjacent positions within regions 2–9, 11–14 and 16–

21 of a large number of miRNAs tended to show the higher correlation values (Figure 3-3B). This

suggested the potential dependency or cooperation between adjacent binding positions of a

miRNA in terms of target binding. All these analyses made it clear that all the miRNA positions

and their dependencies are worth considering for miRNA–target interactions. The MDPS scores

should be able to capture this information.

Page 76: Computational Study of Target Gene Interactions

67

3.1.3.2 Clusters of miRNAs Share Correlated Target Binding Patterns

Since many miRNA–target interactions involve both seed and non-seed regions and the pairing at

different miRNA positions are dependent, we hypothesized that many miRNAs may have similar

or correlated target binding patterns. This hypothesis was tested with the obtained weight and

transition matrices and found that many miRNAs indeed share correlated binding patterns.

To investigate whether different miRNAs have similar or correlated binding patterns, a miRNA

sequence was divided into two equal size regions, positions 1–8 and 9–16. Here, the results are

shown for the energy-filtered dataset, although the conclusions were similar for the target-enriched

dataset. For each of the 122 miRNAs in the energy-filtered dataset, its position-wise β€˜Match’ and

β€˜Mismatch’ probabilities were obtained from the learned weight matrix. The Spearman’s

correlation coefficient between each pair of miRNAs were calculated based on their position-wise

Figure 3-3: Correlated pairs of miRNA positions. (A) An illustration of how MCC is calculated for

miR-484. (B) The percentage of miRNAs having correlated position pairs (MCC β‰₯ 0.75). The heatmap

has miRNA positions in the axes and the percentage of correlated miRNAs are shown for every pair

of positions.

Page 77: Computational Study of Target Gene Interactions

68

β€˜Match’ and β€˜Mismatch’ probabilities. This was done in both of the regions separately. The G-U

wobble state was considered as the β€˜Match’ state and bulge states were ignored in this analysis.

The miRNAs that belonged to the same family were ignored here, as these miRNAs had high

sequence similarities. A clique-finding-based clustering process was applied based on the

correlation (correlation cutoff = 0.75) and 17 distinct clusters of miRNAs were identified that were

correlated in terms of β€˜Match’ state probabilities at positions 1–8. The largest 8 clusters had

50.88% of the total 122 miRNAs (Figure 3-4A shows four different exclusive clusters). When

considering the positions 9–16 of a miRNA, 29 distinct miRNA clusters were identified where the

miRNAs in each cluster were correlated on β€˜Match’ state probabilities within that region (Figure

3-4B). The largest 10 clusters had only 29.82% of the total 122 miRNAs considering β€˜Match’

probabilities. These statistics suggested that the seed regions (positions 1–8) of miRNAs were

Figure 3-4: Clusters of miRNAs with similar ``Match'' patterns in specific regions. The X-axis of

a cluster plot shows the positions of the miRNAs in that cluster and the Y-axis of the plot shows

the percentage of interactions having ``Match'' in corresponding miRNA positions (A) Clusters of

miRNAs correlated with the ``Match'' state probability from position 1 to 8. (B) Clusters of

miRNAs correlated with the ``Match'' state probability from position 9 to 16.

Page 78: Computational Study of Target Gene Interactions

69

more correlated than the non-seed region (positions 9–16), which supported the current practice of

considering seed matching for miRNA targeting but at the same time established the fact that the

non-seed regions also contribute a lot to this process.

3.1.3.3 miRNA-general models showed better performance on target site prediction than miRNA-

specific models

Many miRNAs have similar or correlated target binding patterns, as demonstrated in Section 3.2.

This leads to the idea that the miRNA-general model learned from position-wise information of

all the miRNAs and their corresponding targets should work better than the miRNA-specific

models learned for individual miRNAs. In the miRNA-general model, a common weight matrix

and a common transition matrix were learned for all miRNAs together. In the miRNA-specific

model, a unique weight matrix and a unique transition matrix were learned for each individual

miRNA with a decent number of targets (β‰₯20). The models were learned with the 10-fold cross-

validation based on two training datasets.

From the comparative performance analysis of the miRNA-general and miRNA-specific models,

miRNA-general model was found to work better than the latter. In the target-enriched datasets, the

miRNA-general model identified 93.49% of the CLASH interactions correctly while the miRNA-

specific models identified 87.56% of the CLASH interactions. Similarly, in the energy-filtered

datasets, the miRNA-general model identified 91.59% of the CLASH interactions while the

miRNA-specific models identified only 85.91% of the interactions.

Page 79: Computational Study of Target Gene Interactions

70

The following could be the reasons for the better performance of the miRNA-general model. First,

as demonstrated in the last section, miRNAs do share similar or correlated patterns in terms of

target binding, which enabled the miRNA-general model capture the β€˜key’ or β€˜conserved’

characteristics of miRNA–target interactions; Second, there were much more training data to train

a miRNA-general model than that to train a miRNA-specific model. May be, the number of targets

of an individual miRNA in the training datasets was not large enough for the miRNA-specific

model to avoid β€˜overfitting’. But the last reason can be ignored as the 10-fold cross-validation

accuracy of the miRNA-specific models on the 10 groups of untrained datasets was similar.

Therefore, it is highly likely that the only reason the general models worked better was the

similarity of the binding patterns of different miRNAs.

Despite of the overall better performance of the miRNA-general model, for certain miRNAs, their

miRNA-specific models did work better than the miRNA-general model. For instance, for miR-

10a, the miRNA-specific model predicted 100% of its target sites correctly, whereas the miRNA-

general model predicted 86% of its target sites correctly. This miRNA had 51 targets in the energy-

filtered training dataset. Also, the number of target sites in the training dataset was not a decisive

factor of the performance. For example, in case of miR-186, the miRNA-general model did not

perform better, even though it had 81 training target sites. On the other hand, the miRNA-specific

model performed better for miR-1301, although it only had 26 training target sites. So, it can be

said that the individual binding pattern was the reason that the miRNA-specific model worked

better in this case.

Page 80: Computational Study of Target Gene Interactions

71

3.1.3.4 Combining the MDPS scores with existing tools improved their accuracy

Almost all the existing miRNA prediction tools suffer from a huge number of false positive

predictions. Since these tools do not consider the entire miRNA regions for miRNA–target

interaction prediction, and/or do not consider the dependency among different pairing positions in

miRNA–target interactions, we hypothesized that by combining the MDPS scores with the existing

tools, it might be possible to improve the precision of the existing tools. To prove this hypothesis,

MDPS scoring process was applied on the miRNA-target prediction results from the three tools.

To combine MDPS scores with miRanda, RNA22 and TargetScan, these tools was applied to

predict miRNA–target interactions first. Then the MDPS scores were calculated on the predicted

targets and predicted target sites were labeled true or false based on the MDPS score cutoff from

the trained general models. The two steps process was applied on the untrained 20% CLASH

dataset and the independent CLEAR-CLIP dataset. Also, the MDPS hyperparameters trained on

both the target-enriched and the energy-filtered dataset was applied separately to calculate the

combined predictions. After combining MDPS, the precision of the combined predictions was

significantly increased while the recall was slightly decreased, compared with the original

prediction of the tools (Table 3-2). Overall, the F1 score of the combined tool was improved. For

instance, the recall, precision and F1 score of RNA22 on the CLEAR-CLIP data were increased

by βˆ’9.35%, 22.71% and 22.46%, respectively, when combined with the MDPS model trained on

the energy-filtered dataset. This analysis demonstrated that the MDPS score as an additional

feature for miRNA target site prediction was able to decrease the false positive predictions by the

existing tools.

Page 81: Computational Study of Target Gene Interactions

72

3.1.4 Discussion

Existing miRNA-target prediction tools are heavily dependent on the canonical rules of miRNA-

target interactions. Some of the canonical rules entail extensive binding in the seed region, target

site to come from the mRNA 3’ UTR region and high stability between the interaction sites.

Although these tools can identify a good number of experimentally validated interactions, they all

suffer from a huge number of false positive predictions. Recent experimental data provide

numerous miRNA–target interactions that do not maintain any of these canonical rules. Studies on

these newly generated datasets have shown potential involvement of non-seed regions of miRNAs

in the binding activities. However, the importance of non-seed regions for miRNA target binding

has not been thoroughly studied; neither has the dependency among the consecutive positions and

regions in the miRNA. The MDPS algorithm was developed to learn miRNA-target pairing

patterns, both in the seed and non-seed regions of miRNA binding, by utilizing the genome-wide

CLASH datasets. MDPS takes into account the dependency of neighboring positions of the

miRNA sequence using a Markov model. Utilizing the weight and transition matrices of the trained

Markov model, MDPS is then able to score each potential miRNA binding site to pre-select/predict

Table 3-2: Performance comparison of the combined tools with the original tools.

miRanda RNA22 TargetScan

F1 Precision Recall F1 Precision Recall F1 Precision Recall

Target-enriched

model on CLASH 18.88 23.64 -5.76 25.24 26.62 -5.22 20.78 23.37 -4.19

Target-enriched

model on CLEAR-

CLIP

15.36 15.67 -7.12 22.46 22.71 -9.35 18.11 18.28 -7.16

Energy-filtered

model on CLASH 17.82 20.85 -7.62 24.52 25.66 -7.10 23.21 24.97 -4.81

Energy-filtered

model on CLEAR-

CLIP

15.52 15.89 -10.68 21.15 21.40 -10.57 15.81 15.96 -7.64

Each number is the increased percentage when comparing the performance of the combined tool with the performance

of the original tool.

Page 82: Computational Study of Target Gene Interactions

73

putative candidate miRNA–target interactions. By combining the MDPS scores with the existing

tools, the precision scores of the combined tools were greatly improved.

The DP used in MDPS is different from the one used in miRanda [67], which uses a standard DP

algorithm to perform pair-wise alignment between a miRNA and a potential target. The alignment

score is then used as a criterion together with site conservation and binding energy scores to predict

miRNA target sites. There are at least two important differences between the miRanda DP

algorithm and the MDPS one. One is the scoring schema for miRNA-target alignments, for which

miRanda uses a fixed scoring schema, such as a score of +5 for G:C and A:T pairs, +2 for G:U

wobble pairs etc. [69], whereas MDPS uses a probabilistic scoring schema based on the CLASH

training data. The other is, MDPS considers neighboring pairing positions in the alignments,

whereas miRanda assumes the independence of neighboring pairing positions.

Through the investigation of the Markov models learned from both target-enriched datasets and

energy-filtered datasets, we were able to make interesting findings on position-wise binding

patterns of miRNA–target interactions. We found subsets of miRNAs had correlated binding

patterns in specific sub-regions. We also found both seed and non-seed regions contribute to the

specific miRNAs’ binding patterns. Besides seed region binding, the length of the continuous

pairings outside the seed region, the gap between two continuous pairings, the number and position

of G-C pairing in an interaction are also some of the important features that can play a part in

miRNA target prediction. The position-wise knowledge of a miRNA target binding, the continuous

paring patterns, the number and position of the G-C bonds along with the canonical seed preference

rule can help us to find a target prediction algorithm with less bias, better sensitivity and specificity.

Page 83: Computational Study of Target Gene Interactions

74

Although the MDPS scores can help to improve the miRNA target site prediction, we are unsure

whether these selected target sites are functional. In other words, although the miRNAs may indeed

bind to the corresponding selected target sites, the miRNAs may not suppress the expression level

of the target RNAs. These selected sites can only be considered as potential target sites and their

functional effects need to be further investigated by experiments.

The current version of MDPS was not developed to be a standalone tool for miRNA target

prediction. Along with this score many other features such as sequence conservation, binding

energy, target site abundance etc. are essential to be considered to confidently predict miRNA

target sites. However, the focus of this study was to find out if the dependencies between the

neighboring positions of the miRNA sequence and global pairing information of miRNA–target

interactions are important for target site selection. The incorporation of MDPS either as a feature

or an additional step in the existing miRNA target prediction pipelines has the potential to enhance

the overall performance.

Page 84: Computational Study of Target Gene Interactions

75

CHAPTER 4 : CONCLUSION AND FUTURE WORK

4.1 Conclusion

This dissertation focuses on two of the major factors of gene expression regulation that act in the

transcription and post-transcription stages of gene expression. EPIs work as transcriptional factors.

These interactions along with several transcription factors and RNA-polymerase II initiate the

transcription of a gene. Here, the properties of the interactions were discussed, analyzed and the

important features were collected to design a prediction tool for cell specific interactions. From

the analysis of the interaction patterns, an important characteristic of enhancers was identified

which provides us a new way of dealing with the interactions.

EPIP: We designed an EPI prediction tool named EPIP that can efficiently predict cell

specific EPIs by handling the missing features in cell lines. We used two sets of enhancers, a

properly curated set of promoters, experimentally validated Hi-C chromatin contacts and a

comprehensive set of cell specific features for eight different types of cell lines. The inactive

enhancers and promoters were filtered using active histone markers and RNA-seq gene

expression information in the respective cell lines. EPIP model was designed to handle the

missing features using 11 feature partitions and a set of robust ensemble classifiers. Each

partition represents a set of cell lines that have that partition among their available features.

The model decides the prediction output based on the voting of weak learners trained on the

respective feature partition. EPIP was compared with two popular EPI prediction tools on

both EPIP test dataset and the test data of the two tools. EPIP outperformed the two tools on

both sets of data, specially in terms of cell specific EPI prediction. EPIP was also tested on

Page 85: Computational Study of Target Gene Interactions

76

five different test datasets including three sets of data from other labs. In all cases, EPIP

showed a high performance, particularly for the cell specific EPIs.

Analyzing chromatin interaction data sets from five different labs we found an interaction

pattern of enhancers with their target gene promoters. When interacting with a shared set of

promoters, multiple enhancers do not tend to share partially. So, enhancers either share all of

their interacting promoters or they share none. Based on this property, we extracted clusters

of enhancers in different cell lines that have a very little overlap with the known super-

enhancers. The enhancers in cluster are mostly consecutive in terms of their genomic positions

and belong to the same TAD. The clusters of enhancers are different in different cell lines.

The interaction between a miRNA and protein coding mRNA is regarded as a major gene

regulation factor. By interacting with mRNA, miRNA disrupts the pathway of the mRNA or

degrades the mRNA structure, which eventually blocks the translation of the specific proteins that

the mRNA was assigned to translate into. So, by interacting with a mRNA, miRNAs play a vital

part in modulating the regular gene expression pathway and create complicacies such as severe

diseases in human body. The miRNA-mRNA interactions are imperfect in human and contain

diverse patterns that are difficult to understand. Also, the size of the miRNA allows it to bind with

multiple strong interaction sites in a mRNA transcript. But not all the sites are active in a cell type.

Hence, to achieve high precision of the prediction, miRNA target prediction tools follow certain

canonical rules. The recent experimental data show a huge number of interactions that do not

follow these canonical rules, resulting in low sensitivity for the prediction tools. The non-canonical

Page 86: Computational Study of Target Gene Interactions

77

interactions found in the data subvert the traditional features used for the prediction and insinuates

the possible contribution of position specific features of miRNA and target sequences.

MDPS: With the hypothesis that every position of miRNA may contain certain importance

factor to form a miRNA-target interaction, we designed a feature named MDPS. Given a

miRNA and target sequence pair, the sequences are aligned with the scores of the miRNA

position-wise frequencies of the binding states and the transition frequencies from one binding

state to another. Finally, the overall score of the alignment is considered as the MDPS feature

score for the sequence pair. The position-wise frequencies and transition weights of the states

were learned from the interactions extracted from the CLASH experimental data. Along with

this a score cutoff is set to remove the false positive interactions. MDPS was applied on the

predicted positives of the three popular miRNA target prediction tools and shown to increase

of their precisions. Based on the position wise binding frequencies of individual miRNAs we

also showed the significance of the non-seed regions and found clusters of miRNAs having

region-wise similar binding patterns.

4.2 Future Work

4.2.1 Enhancer-promoter interactions

This study focused on the properties of enhancers and the interactions between enhancers and

promoters. The EPIP tool that we designed was trained with the best available datasets to date. But

with the annotations of new enhancers, promoters and the availability of more accurate and broadly

representative training data in the future, the performance of EPIP can be improved further. We

Page 87: Computational Study of Target Gene Interactions

78

used Hi-C chromatin interactions to extract training data. But it is worth studying how the

performance of EPIP improves using EPIs from other sources of chromatin interaction, such as

Hi-C, ChIA-PET and 5C, together with Hi-C. EPIP considers one EP-pair at a time to decide if it

is an active EPI like the other EPI prediction tools. But the EPIs may be interconnected due to the

complicated chromatin structure, as found in a recent study [14]. So, changing the design of EPIP

to consider multiple EPIs together as inputs, may improve its performance further.

A primary finding of the study on the enhancer-promoter interactions was that a group of enhancers

tend to interact with the common set of target genes. This property was tested on a variety of

chromatin interaction data sets with two sets of enhancers. Here it was made sure that both the

enhancer and promoter are active regions, but there was no way to make sure if the chromatin

interactions were functional in different cell lines. With the availability of more cell-specific

chromatin interaction data, the property should be rigorously verified.

4.2.2 miRNA-mRNA interactions

From the work done here on the analysis of miRNA-mRNA interactions, it is clear that every

position of miRNA has a contribution to form a successful interaction with its target. We

discovered clusters of miRNAs that show a similar binding pattern along a certain region of the

miRNA sequence. The clusters of miRNAs however should be further analyzed for pathway

similarity. Recently, numerous cell specific miRNA isoforms (isomiRs) were discovered by RNA-

seq and miRNA-seq experiments which are produced in a cell due to imprecise slicing of the

primary miRNA transcript or RNA editing mechanism applied on the initial miRNA transcripts

among many other reasons [86]. IsomiRs have small differences in sequence than the canonical

Page 88: Computational Study of Target Gene Interactions

79

miRNAs. Based on the location of the differences, isomiRs can interact with the same or target

than the canonical miRNAs. The incorporation of isomiR in the miRNA target prediction problem

can help the target prediction model have the whole picture of the problem. With this hypothesis

and our understanding about the importance of non-canonical position or region specific

information, we are working on developing a miRNA-mRNA or isomiR-mRNA target prediction

tool that uses a deep learning model to learn the hidden features from just the sequence of the

corresponding miRNA or isomiR sequences and mRNA transcripts. Since, deep learning models

are well known to capture deep interconnected features, we are interested in the sequence feature

patterns this model can capture to explain a miRNA-mRNA interaction.

Page 89: Computational Study of Target Gene Interactions

80

LIST OF REFERENCES

1. Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, Chen Y,

Zhao X, Schmidl C, Suzuki T et al: An atlas of active enhancers across human cell

types and tissues. Nature 2014, 507(7493):455-461.

2. Cai X, Hou L, Su N, Hu H, Deng M, Li X: Systematic identification of conserved

motif modules in the human genome. BMC Genomics 2010, 11(1):567.

3. De Laat W, Duboule D: Topology of mammalian developmental enhancers and their

regulatory landscapes. Nature 2013, 502(7472):499-506.

4. Dekker J, Rippe K, Dekker M, Kleckner N: Capturing chromosome conformation.

Science 2002, 295(5558):1306-1311.

5. Zheng Y, Li X, Hu H: Comprehensive discovery of DNA motifs in 349 human cells

and tissues reveals new features of motifs. Nucleic Acids Res 2015, 43(1):74-83.

6. Corradin O, Saiakhova A, Akhtar-Zaidi B, Myeroff L, Willis J, Cowper-Sal lari R,

Lupien M, Markowitz S, Scacheri PC: Combinatorial effects of multiple enhancer

variants in linkage disequilibrium dictate levels of gene expression to confer

susceptibility to common traits. Genome Res 2014, 24(1):1-13.

7. Fullwood MJ, Liu MH, Pan YF, Liu J, Xu H, Mohamed YB, Orlov YL, Velkov S, Ho A,

Mei PH: An oestrogen-receptor-Ξ±-bound human chromatin interactome. Nature

2009, 462(7269):58-64.

8. Rao SS, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, Sanborn

AL, Machol I, Omer AD, Lander ES et al: A 3D map of the human genome at kilobase

resolution reveals principles of chromatin looping. Cell 2014, 159(7):1665-1680.

9. He B, Chen C, Teng L, Tan K: Global view of enhancer-promoter interactome in

human cells. Proceedings of the National Academy of Sciences of the United States of

America 2014, 111(21):E2191-2199.

10. Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, Zhang X, Wang

L, Issner R, Coyne M et al: Mapping and analysis of chromatin state dynamics in

nine human cell types. Nature 2011, 473(7345):43-49.

11. Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, Sheffield NC,

Stergachis AB, Wang H, Vernot B et al: The accessible chromatin landscape of the

human genome. Nature 2012, 489(7414):75-82.

12. Roy S, Siahpirani AF, Chasman D, Knaack S, Ay F, Stewart R, Wilson M, Sridharan R:

A predictive modeling approach for cell line-specific long-range regulatory

interactions. Nucleic Acids Res 2015, 43(18):8694-8712.

Page 90: Computational Study of Target Gene Interactions

81

13. Whalen S, Truty RM, Pollard KS: Enhancer-promoter interactions are encoded by

complex genomic signatures on looping chromatin. Nat Genet 2016, 48(5):488-496.

14. Zhao C, Li X, Hu H: PETModule: a motif module based approach for enhancer

target gene prediction. Scientific reports 2016, 6:30043.

15. Talukder A, Saadat S, Li X, Hu H: EPIP: A novel approach for condition-specific

enhancer-promoter interaction prediction. Bioinformatics 2019, 35(20):3877--3883.

16. Ernst J, Kellis M: ChromHMM: automating chromatin-state discovery and

characterization. Nature methods 2012, 9(3):215-216.

17. Dunham I, Consortium EP: An integrated encyclopedia of DNA elements in the

human genome. Nature 2012, 489(7414):57-74.

18. Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL,

Barrell D, Zadissa A, Searle S et al: GENCODE: the reference human genome

annotation for The ENCODE Project. Genome Res 2012, 22(9):1760-1774.

19. Li X, Zheng Y, Hu H, Li X: Integrative analyses shed new light on human ribosomal

protein gene regulation. Scientific reports 2016, 6:28619.

20. Jin F, Li Y, Dixon JR, Selvaraj S, Ye Z, Lee AY, Yen CA, Schmitt AD, Espinoza CA,

Ren B: A high-resolution map of the three-dimensional chromatin interactome in

human cells. Nature 2013, 503(7475):290-294.

21. Li G, Ruan X, Auerbach RK, Sandhu KS, Zheng M, Wang P, Poh HM, Goh Y, Lim J,

Zhang J et al: Extensive promoter-centered chromatin interactions provide a

topological basis for transcription regulation. Cell 2012, 148(1-2):84-98.

22. Freund Y, Schapire RE: A decision-theoretic generalization of on-line learning and an

application to boosting. Journal of computer system sciences 1997, 55(1):119-139.

23. Breiman L, Friedman J, Stone CJ, Olshen RA: Classification and Regression Trees:

Taylor & Francis; 1984.

24. Hoffman MM, Buske OJ, Wang J, Weng Z, Bilmes JA, Noble WS: Unsupervised

pattern discovery in human chromatin structure through genomic segmentation.

Nature methods 2012, 9(5):473-476.

25. Forcato M, Nicoletti C, Pal K, Livi CM, Ferrari F, Bicciato S: Comparison of

computational methods for Hi-C data analysis. Nature methods 2017, 14(7):679-685.

26. Furlong EEM, Levine M: Developmental enhancers and chromosome topology.

Science 2018, 361(6409):1341-1345.

27. Lettice LA, Horikoshi T, Heaney SJ, van Baren MJ, van der Linde HC, Breedveld GJ,

Joosse M, Akarsu N, Oostra BA, Endo N et al: Disruption of a long-range cis-acting

Page 91: Computational Study of Target Gene Interactions

82

regulator for Shh causes preaxial polydactyly. Proceedings of the National Academy

of Sciences of the United States of America 2002, 99(11):7548-7553.

28. Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A,

Amit I, Lajoie BR, Sabo PJ, Dorschner MO et al: Comprehensive mapping of long-

range interactions reveals folding principles of the human genome. Science 2009,

326(5950):289-293.

29. Mossing MC, Record MT, Jr.: Upstream operators enhance repression of the lac

promoter. Science 1986, 233(4766):889-892.

30. Pennacchio LA, Bickmore W, Dean A, Nobrega MA, Bejerano G: Enhancers: five

essential questions. Nature reviews Genetics 2013, 14(4):288-295.

31. Wang S, Hu H, Li X: Shared distal regulatory regions may contribute to the

coordinated expression of human ribosomal protein genes. Genomics 2020,

112(4):2886-2893.

32. Javierre BM, Burren OS, Wilder SP, Kreuzhuber R, Hill SM, Sewitz S, Cairns J, Wingett

SW, Varnai C, Thiecke MJ et al: Lineage-Specific Genome Architecture Links

Enhancers and Non-coding Disease Variants to Target Gene Promoters. Cell 2016,

167(5):1369-1384 e1319.

33. Bellen HJ, O'Kane CJ, Wilson C, Grossniklaus U, Pearson RK, Gehring WJ: P-element-

mediated enhancer detection: a versatile method to study development in

Drosophila. Genes & development 1989, 3(9):1288-1300.

34. Weber F, de Villiers J, Schaffner W: An SV40 "enhancer trap" incorporates

exogenous enhancers or generates enhancers from its own sequences. Cell 1984,

36(4):983-992.

35. Crawford GE, Holt IE, Whittle J, Webb BD, Tai D, Davis S, Margulies EH, Chen Y,

Bernat JA, Ginsburg D et al: Genome-wide mapping of DNase hypersensitive sites

using massively parallel signature sequencing (MPSS). Genome Res 2006, 16(1):123-

131.

36. Heintzman ND, Stuart RK, Hon G, Fu Y, Ching CW, Hawkins RD, Barrera LO, Van

Calcar S, Qu C, Ching KA et al: Distinct and predictive chromatin signatures of

transcriptional promoters and enhancers in the human genome. Nat Genet 2007,

39(3):311-318.

37. Johnson DS, Mortazavi A, Myers RM, Wold B: Genome-wide mapping of in vivo

protein-DNA interactions. Science 2007, 316(5830):1497-1502.

38. Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G,

Bernier B, Varhol R, Delaney A et al: Genome-wide profiles of STAT1 DNA

association using chromatin immunoprecipitation and massively parallel

sequencing. Nature methods 2007, 4(8):651-657.

Page 92: Computational Study of Target Gene Interactions

83

39. Wang Y, Li X, Hu H: H3K4me2 reliably defines transcription factor binding regions

in different cells. Genomics 2014, 103(2-3):222-228.

40. Malin J, Aniba MR, Hannenhalli S: Enhancer networks revealed by correlated DNAse

hypersensitivity states of enhancers. Nucleic Acids Res 2013, 41(14):6828-6838.

41. Zheng Y, Li X, Hu H: PreDREM: a database of predicted DNA regulatory motifs

from 349 human cell and tissue samples. Database : the journal of biological

databases and curation 2015, 2015.

42. Daniel B, Nagy G, Hah N, Horvath A, Czimmerer Z, Poliska S, Gyuris T, Keirsse J,

Gysemans C, Van Ginderachter JA et al: The active enhancer network operated by

liganded RXR supports angiogenic activity in macrophages. Genes & development

2014, 28(14):1562-1577.

43. Danko CG, Hyland SL, Core LJ, Martins AL, Waters CT, Lee HW, Cheung VG, Kraus

WL, Lis JT, Siepel A: Identification of active transcriptional regulatory elements

from GRO-seq data. Nature methods 2015, 12(5):433-438.

44. Won KJ, Ren B, Wang W: Genome-wide prediction of transcription factor binding

sites using an integrated model. Genome Biol 2010, 11(1):R7.

45. Visel A, Minovitsky S, Dubchak I, Pennacchio LA: VISTA Enhancer Browser--a

database of tissue-specific human enhancers. Nucleic Acids Res 2007, 35(Database

issue):D88-92.

46. Chen H, Li C, Peng X, Zhou Z, Weinstein JN, Cancer Genome Atlas Research N, Liang

H: A Pan-Cancer Analysis of Enhancer Expression in Nearly 9000 Patient Samples.

Cell 2018, 173(2):386-399 e312.

47. Pott S, Lieb JD: What are super-enhancers? 2015, 47(1):8-12.

48. Whyte WA, Orlando DA, Hnisz D, Abraham BJ, Lin CY, Kagey MH, Rahl PB, Lee TI,

Young RA: Master transcription factors and mediator establish super-enhancers at

key cell identity genes. Cell 2013, 153(2):307--319.

49. Dostie J, Richmond TA, Arnaout RA, Selzer RR, Lee WL, Honan TA, Rubio ED,

Krumm A, Lamb J, Nusbaum C et al: Chromosome Conformation Capture Carbon

Copy (5C): a massively parallel solution for mapping interactions between genomic

elements. Genome Res 2006, 16(10):1299-1309.

50. Rodelsperger C, Guo G, Kolanczyk M, Pletschacher A, Kohler S, Bauer S, Schulz MH,

Robinson PN: Integrative analysis of genomic, functional and protein interaction

data predicts long-range enhancer-target gene interactions. Nucleic Acids Res 2011,

39(7):2492-2502.

Page 93: Computational Study of Target Gene Interactions

84

51. Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, Hu M, Liu JS, Ren B: Topological

domains in mammalian genomes identified by analysis of chromatin interactions.

Nature 2012, 485(7398):376-380.

52. Quinodoz SA, Ollikainen N, Tabak B, Palla A, al. e: Higher-Order Inter-chromosomal

Hubs Shape 3D Genome Organization in the Nucleus. Cell 2018, 174(3):744-757.

53. Latapy M, Magnien C, Del Vecchio N: Basic Notions for the Analysis of Large Two-

mode Networks. Social Networks 2008, 30:31-48.

54. Bron C, Kerbosch J: Finding All Cliques of an Undirected Graph. Commun Acm 1973,

16(9):575-577.

55. Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high

throughput. Nucleic Acids Res 2004, 32(5):1792–1797.

56. Mann HB, Whitney DR: On a Test of Whether one of Two Random Variables is

Stochastically Larger than the Other. Annals of Mathematical Statistics 1947,

18(1):50-60.

57. McLean YC, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, Wenger AM,

Bejerano G: GREAT improves functional interpretation of cis-regulatory regions.

Nature Biotechnology 2010, 28(5):495-501.

58. Blanchette M, Bataille AR, Chen X, Poitras C, Laganière J, Lefèbvre C, Deblois G,

Giguère V, Ferretti V, Bergeron D et al: Genome-wide computational prediction of

transcriptional regulatory modules reveals new insights into human gene

expression. Genome Res 2006, 16(5):656-668.

59. Edelman LB, Fraser P: Transcription factories: genetic programming in three

dimensions. Curr Opin Genet Dev 2012, 22(2):110-114.

60. Papantonis A, Cook PR: Transcription factories: genome organization and gene

regulation. Chemical Review 2013, 113(11):8683-8705.

61. Burroughs AM, Ando Y, de Hoon ML, Tomaru Y, Suzuki H, Hayashizaki Y, Daub CO:

Deep-sequencing of human Argonaute-associated small RNAs provides insight into

miRNA sorting and reveals Argonaute association with RNA fragments of diverse

origin. RNA Biology 2011, 8(1):158-177.

62. Bartel DP: MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 2004,

116(2):281-297.

63. Wang Y, Li X, Hu H: Transcriptional regulation of co-expressed microRNA target

genes. Genomics 2011, 98(6):445-452.

64. Li Y, Kowdley KV: MicroRNAs in common human diseases. Genomics Proteomics

Bioinformatics 2012, 10(5):246-253.

Page 94: Computational Study of Target Gene Interactions

85

65. Kertesz M, Iovino N, Unnerstall U, Gaul U, Segal E: The role of site accessibility in

microRNA target recognition. Nature Genetics 2007, 39(10):1278-1284.

66. Grimson A, Farh KK-H, Johnston WK, Garrett-Engele P, Lim LP, Bartel DP:

MicroRNA targeting specificity in mammals: determinants beyond seed pairing.

Molecular Cell 2007, 27(1):91-105.

67. Enright A, John B, Gaul U, Tuschl T, Sander C, Marks D: MicroRNA targets in

Drosophila. genome biol. In.: BioMed Central; 2003.

68. Helwak A, Kudla G, Dudnakova T, Tollervey D: Mapping the human miRNA

interactome by CLASH reveals frequent noncanonical binding. Cell 2013, 153:654-

665.

69. Betel D, Koppal A, Agius P, Sander C, Leslie C: Comprehensive modeling of

microRNA targets predicts functional non-conserved and non-canonical sites.

Genome Biology 2010, 11(8):R90.

70. Ding J, Li X, Hu H: MicroRNA modules prefer to bind weak and unconventional

target sites. Bioinformatics 2015, 31(9):1366-1374.

71. Ding J, Li X, Hu H: TarPmiR: a new approach for microRNA target site prediction.

Bioinformatics 2016, 32(18):2768-2775.

72. Ding J, Li X, Hu H: CCmiR: a computational approach for competitive and

cooperative microRNA binding prediction. Bioinformatics 2018, 34(2):198-206.

73. Chi SW, Zang JB, Mele A, Darnell RB: Argonaute HITS-CLIP decodes microRNA–

mRNA interaction maps. Nature 2009, 460(7254):479-486.

74. Chou C-H, Chang N-W, Shrestha S, Hsu S-D, Lin Y-L, Lee W-H, Yang C-D, Hong H-C,

Wei T-Y, Tu S-J: miRTarBase 2016: updates to the experimentally validated

miRNA-target interactions database. Nucleic Acids Res 2016, 44(D1):D239-D247.

75. Hafner M, Landthaler M, Burger L, Khorshid M, Hausser J, Berninger P, Rothballer A,

Ascano Jr M, Jungkamp A-C, Munschauer M: Transcriptome-wide identification of

RNA-binding protein and microRNA target sites by PAR-CLIP. Cell 2010,

141(1):129-141.

76. Moore MJ, Scheel TKH, Luna JM, Park CY, Fak JJ, Nishiuchi E, Rice CM, Darnell RB:

MiRNA-target chimeras reveal miRNA 3β€²-end pairing as a major determinant of

Argonaute target specificity. Nature Communications 2015, 6:1-17.

77. Li X, Hu H: Improving miRNA target prediction using CLASH data. In: MicroRNA

Target Identification. Springer; 2019: 75-83.

78. Lu Y, Leslie CS: Learning to predict miRNA-mRNA interactions from AGO CLIP

sequencing and CLASH data. PLOS Computational Biology 2016, 12(7):e1005026.

Page 95: Computational Study of Target Gene Interactions

86

79. Wang X: Improving microRNA target prediction by modeling with unambiguously

identified microRNA-target pairs from CLIP-ligation studies. Bioinformatics 2016,

32(9):1316-1322.

80. Fu H-Y, Xue D-Y, Zhang X, Yang P-Y: Assessing potential miRNA targets based on a

Markov model. Genetics Molecular Research 2009, 8(3):848-860.

81. KrΓΌger J, Rehmsmeier M: RNAhybrid: microRNA target prediction easy, fast and

flexible. Nucleic Acids Res 2006, 34(suppl_2):W451-W454.

82. Agarwal V, Bell G, Nam J, Bartel D: Predicting effective microRNA target sites in

mammalian mRNAs. eLife 2015, 4(e05005).

83. Friedman RC, Farh KK-H, Burge CB, Bartel DP: Most mammalian mRNAs are

conserved targets of microRNAs. Genome research 2009, 19(1):92-105.

84. Lewis BP, Shih I-h, Jones-Rhoades MW, Bartel DP, Burge CB: Prediction of

mammalian microRNA targets. Cell 2003, 115(7):787-798.

85. Miranda K, Huynh T, Tay Y, Ang Y, Tam W, Thomson A, Lim B, Rigoutsos I: A

pattern-based method for the identification of microRNA binding sites and their

corresponding heteroduplexes. Cell 2006, 126(6):1203-1217.

86. Neilsen CT, Goodall GJ, Bracken CP: IsomiRs – the overlooked repertoire in the

dynamic microRNAome. Trends in Genetics 2012, 28:544-549.