integration of multi-omics data for gene regulatory network inference and application...

10
Integration of Multi-Omics Data for Gene Regulatory Network Inference and Application to Breast Cancer Lin Yuan , Le-Hang Guo, Chang-An Yuan, Youhua Zhang, Kyungsook Han, Asoke K. Nandi, Barry Honig, and De-Shuang Huang Abstract—Underlying a cancer phenotype is a specific gene regulatory network that represents the complex regulatory relationships between genes. It remains, however, a challenge to find cancer-related gene regulatory network because of insufficient sample sizes and complex regulatory mechanisms in which gene is influenced by not only other genes but also other biological factors. With the development of high-throughput technologies and the unprecedented wealth of multi-omics data it gives us a new opportunity to design machine learning method to investigate underlying gene regulatory network. In this paper, we propose an approach, which use Biweight Midcorrelation to measure the correlation between factors and make use of Nonconvex Penalty based sparse regression for Gene Regulatory Network inference (BMNPGRN). BMNCGRN incorporates multi-omics data (including DNA methylation and copy number variation) and their interactions in gene regulatory network model. The experimental results on synthetic datasets show that BMNPGRN outperforms popular and state-of-the-art methods (including DCGRN, ARACNE, and CLR) under false positive control. Furthermore, we applied BMNPGRN on breast cancer (BRCA) data from The Cancer Genome Atlas database and provided gene regulatory network. Index Terms—Biweight midcorrelation, differential correlation, nonconvex penalty, gene regulatory network, stability selection Ç 1 INTRODUCTION G ENE regulatory network (GRN) is a biological process that represents the complex regulatory relationships between genes. Research on the structures and dynamics of GRNs provides an important insights into the mechanisms of complex diseases (e.g., breast cancer or brain tumors). It remains, however, a challenge to understand gene regula- tory network because of complex regulatory mechanisms in which gene is influenced by not only other genes but also other biological factors such as DNA methylation (DM) and copy number variation (CNV) of genes. It is essential to research GRN using multi-omics data. Meanwhile, the rapid development in high-throughput sequencing (HTS) has led to detailed clinical records and heterogeneous data for more than 1000 samples of breast cancer. DNA methylation, one of the best-known epigenetic marker, plays an important role in modifying gene expres- sion level, and it is one of heritable changes in gene expression, which are not due to any alteration in the DNA sequence. DNA methylation controls gene expression by means of changes of DNA stability, chromatin structure and DNA-protein interactions. In addition, DNA methylation generally inhibits gene expression [1], [2]. However, DNA methylation-gene regulatory mechanism is still an unsolved problem. CNV is a gene of more than 1 kilo-base in size. Spe- cifically, CNV is a kind of structural variation and a type of duplication or deletion event that affects a considerable number of base pairs. Many researches have shown that CNV is an important biological factor that affects gene expression. Recently, biological experiments have shown that GRN research that integrate DNA methylation data and CNV data have a better performance than methods which use gene expression data only [3], [4], [5], [6]. In the beginning, experiments show that the expression levels of function-related genes are correlated [7]. Gene cor- relation measure methods were applied to gene regulatory network analysis [8]. One of the well-known correlation L. Yuan and D. Huang are with the Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Caoan Road 4800, Shanghai 201804, China. E-mail: yuanlindc @126.com, dshuang @tongji.edu.cn. L. Guo is with the the Department of Medical Ultrasound, Shanghai Tenth People’s Hospital, Ultrasound Research, and Education Institute, Tongji University School of Medicine, Yanchang Middle Road 301, Shanghai 200072, China. E-mail: [email protected]. C. A. Yuan is with the Science Computing and Intelligent Information Processing of GuangXi Higher Education Key Laboratory, Guangxi Teachers Education University, Nanning, Guangxi 530001, China. E-mail: [email protected]. Y. Zhang is with the School of Information and Computer, Anhui Agricultural University, Changjiang West Road 130, Hefei, Anhui, China. E-mail: [email protected]. K. Han is with the School of Computer Science, Engineering Inha University, Incheon 22212, South Korea. E-mail: [email protected]. A. S. Nandi is with the Department of Electronic and Computer Engineer- ing, Brunel University London, Uxbridge, UB8 3PH, United Kingdom. E-mail: [email protected]. B. Honig is with the Center for Computational Biology and Bioinformatics, 1130 St. Nicholas Avenue, Room 815, New York, NY 10032. E-mail: [email protected]. Manuscript received 24 Apr. 2018; revised 2 Aug. 2018; accepted 16 Aug. 2018. Date of publication 23 Aug. 2018; date of current version 31 May 2019. (Corresponding author: Lin Yuan.) Recommended for acceptance by D.-S Huang. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TCBB.2018.2866836 782 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 16, NO. 3, MAY/JUNE 2019 This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/ _

Upload: others

Post on 27-Mar-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Integration of Multi-Omics Data for Gene Regulatory Network Inference and Application ...dept.inha.ac.kr/user/bclab/files/recent/Ingetration.2019.pdf · 2020. 2. 25. · Integration

Integration of Multi-Omics Data for GeneRegulatory Network Inference and Application

to Breast CancerLin Yuan , Le-Hang Guo, Chang-An Yuan, Youhua Zhang, Kyungsook Han,

Asoke K. Nandi, Barry Honig, and De-Shuang Huang

Abstract—Underlying a cancer phenotype is a specific gene regulatory network that represents the complex regulatory relationships

between genes. It remains, however, a challenge to find cancer-related gene regulatory network because of insufficient sample sizes

and complex regulatory mechanisms in which gene is influenced by not only other genes but also other biological factors. With the

development of high-throughput technologies and the unprecedentedwealth of multi-omics data it gives us a new opportunity to design

machine learningmethod to investigate underlying gene regulatory network. In this paper, we propose an approach, which use Biweight

Midcorrelation tomeasure the correlation between factors andmake use of Nonconvex Penalty based sparse regression for Gene

Regulatory Network inference (BMNPGRN). BMNCGRN incorporatesmulti-omics data (including DNAmethylation and copy number

variation) and their interactions in gene regulatory network model. The experimental results on synthetic datasets show that BMNPGRN

outperforms popular and state-of-the-art methods (including DCGRN, ARACNE, and CLR) under false positive control. Furthermore, we

applied BMNPGRN on breast cancer (BRCA) data from TheCancer Genome Atlas database and provided gene regulatory network.

Index Terms—Biweight midcorrelation, differential correlation, nonconvex penalty, gene regulatory network, stability selection

Ç

1 INTRODUCTION

GENE regulatory network (GRN) is a biological processthat represents the complex regulatory relationships

between genes. Research on the structures and dynamics ofGRNs provides an important insights into the mechanismsof complex diseases (e.g., breast cancer or brain tumors). It

remains, however, a challenge to understand gene regula-tory network because of complex regulatory mechanisms inwhich gene is influenced by not only other genes but alsoother biological factors such as DNA methylation (DM) andcopy number variation (CNV) of genes. It is essential toresearch GRN using multi-omics data. Meanwhile, therapid development in high-throughput sequencing (HTS)has led to detailed clinical records and heterogeneous datafor more than 1000 samples of breast cancer.

DNA methylation, one of the best-known epigeneticmarker, plays an important role in modifying gene expres-sion level, and it is one of heritable changes in geneexpression, which are not due to any alteration in the DNAsequence. DNA methylation controls gene expression bymeans of changes of DNA stability, chromatin structure andDNA-protein interactions. In addition, DNA methylationgenerally inhibits gene expression [1], [2]. However, DNAmethylation-gene regulatory mechanism is still an unsolvedproblem. CNV is a gene of more than 1 kilo-base in size. Spe-cifically, CNV is a kind of structural variation and a type ofduplication or deletion event that affects a considerablenumber of base pairs. Many researches have shown thatCNV is an important biological factor that affects geneexpression. Recently, biological experiments have shownthat GRN research that integrate DNAmethylation data andCNV data have a better performance than methods whichuse gene expression data only [3], [4], [5], [6].

In the beginning, experiments show that the expressionlevels of function-related genes are correlated [7]. Gene cor-relation measure methods were applied to gene regulatorynetwork analysis [8]. One of the well-known correlation

� L. Yuan and D. Huang are with the Institute of Machine Learning andSystems Biology, School of Electronics and Information Engineering,Tongji University, Caoan Road 4800, Shanghai 201804, China.E-mail: yuanlindc @126.com, dshuang @tongji.edu.cn.

� L. Guo is with the the Department of Medical Ultrasound, Shanghai TenthPeople’s Hospital, Ultrasound Research, and Education Institute, TongjiUniversity School of Medicine, Yanchang Middle Road 301, Shanghai200072, China. E-mail: [email protected].

� C. A. Yuan is with the Science Computing and Intelligent InformationProcessing of GuangXi Higher Education Key Laboratory, GuangxiTeachers Education University, Nanning, Guangxi 530001, China.E-mail: [email protected].

� Y. Zhang is with the School of Information and Computer, AnhuiAgricultural University, Changjiang West Road 130, Hefei, Anhui, China.E-mail: [email protected].

� K. Han is with the School of Computer Science, Engineering InhaUniversity, Incheon 22212, South Korea. E-mail: [email protected].

� A. S. Nandi is with the Department of Electronic and Computer Engineer-ing, Brunel University London, Uxbridge, UB8 3PH, United Kingdom.E-mail: [email protected].

� B. Honig is with the Center for Computational Biology and Bioinformatics,1130 St. Nicholas Avenue, Room 815, New York, NY 10032.E-mail: [email protected].

Manuscript received 24 Apr. 2018; revised 2 Aug. 2018; accepted 16 Aug.2018. Date of publication 23 Aug. 2018; date of current version 31 May 2019.(Corresponding author: Lin Yuan.)Recommended for acceptance by D.-S Huang.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TCBB.2018.2866836

782 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 16, NO. 3, MAY/JUNE 2019

This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/_

Page 2: Integration of Multi-Omics Data for Gene Regulatory Network Inference and Application ...dept.inha.ac.kr/user/bclab/files/recent/Ingetration.2019.pdf · 2020. 2. 25. · Integration

measures is the Pearson correlation measure [9], 10]. Mean-while, gene co-expression network analysis is one ofthe most important gene regulatory network analyses.Weighted gene co-expression network analysis (WGCNA)is a representative method of the correlation-based methods[11]. For different characteristics of data, methods whichbased on different correlation coefficients were proposed.For example, mutual information-based gene network anal-ysis, maximum information correlation-based gene networkanalysis, cosine similarity-based gene network analysis,spearman rank correlation-based gene network analysisand conditional mutual information-based gene networkanalysis. Compared with the gene co-expression networkanalysis, gene differential co-expression analysis based oncorrelation measure can discover tiny but important biomol-ecule changes through identifying subtle changes in geneexpression levels from case-control group [12]. For example,gene regulatory network analysis based on GO term andidentifying differentially co-expression genes and links fromgene expression microarray data (DCGL) outperforms thanWGCNA in terms of sensitivity and specificity and real datasets result. However, the relationship between genes is stillunclear due to gene interactions are mediated by other bio-logical factors. Meanwhile, correlation measurement-basedmethod use an undirected graph, which cannot explain theregulatory relationship between genes [7], [13], [14].

Bayesian-based methods make it possible to infer the reg-ulatory relationship of genes based on the directed acyclicgraph (DAG) [15]. Many studies reported directed gene reg-ulation graphs [16]. With the development of Bayesian net-work, dynamic Bayesian network was used to model timevarying gene regulatory network. Moreover, Bayesian-based methods can deal with data missing problem andnoise [17]. Bayesian-based methods can also efficientlyincorporate prior biological knowledge such as structuralinformation in the model [16]. However, those Bayesian-based methods are not suitable for large-scale biologicaldata sets because its runtime will increase exponentially asthe number of genes increases [18], [19], [20].

Regression-based methods decomposes the complex generegulatory network problem into P (number of genes) regres-sion problems. In practice, regression-based methods need tosolve small sample size and high-dimensional problems, thusregression-based gene regulatory network inference modelscontain regularization and are sparse models. Regularizationregression models have often adapted LASSO for selectingsignificant network-related biological factors (e.g., genes,DNA methylation sites, CNVs) [21]. Regularization regres-sion models can ensure efficiency even with large-scalebiological data sets. Meanwhile, Regularization regressionmodels are high-ranking methods in the HPN-DREAM net-work inference challenge. Recently, several regression-basedresearches focus on finding gene regulatory network by com-bining heterogeneous data [22]. Experimental results on syn-thetic data and real data show that methods can improve theaccuracy of gene regulatory network inference by combininga variety of biological factor information [23], [24].

DCGRN (DNA methylation and CNV Gene RegulatoryNetwork) is a representative method of the regularizationregression models [22]. DCGRN proposed an integrativegene regulatory network inference method by using gene

expression, DNA methylation and CNV [22]. Such methodassumes the number of gene-related DNAmethylation is one[25]. However, biological experiments have shown that multi-ple methylation sites affect one gene [26]. A function moduleformed by the interaction of multiple methylation sitesaffects gene expression. [22] have applied LASSO to selectbiological factors of gene regulatory network. LASSO maynot be applied directly to multi-omics data since LASSOignores prior information present in wealth of multi-omicsdata. In addition, although the LASSO achieved good results[21], [22], L1-regularization is a convex relaxation formulationof L0-regularization. L1-regularization often leads to subopti-mal solution because it is not a good approximation to L0-regularization. Nonconvex penalty functions such SCAD(Smoothly Clipped Absolute Deviation) [27] and MCP (Mini-max Concave Penalty) [28] were applied to sparse problemsand achieved better performance than traditionalmethods.

In this paper,we propose an approach,which use BiweightMidcorrelation to measure the correlation between factorsand make use of Nonconvex Penalty based sparse regres-sion for Gene Regulatory Network inference (BMNPGRN).BMNPGRN integrates heterogeneous multi-omics data toinfer gene regulatory network with gene expression data,DNA methylation data and copy number variation data. Inorder to infer gene regulatory network. Firstly, we com-bine biweight midcorrelation coefficient algorithm, which isan efficient algorithm for computing correlation coefficient,with ‘differential correlation strategy’ to learn associationsamong DNA methylation sites. Then, nonconvex penaltybased sparse regression is used to find gene-related biolo-gical factors, and the parameter of method is determinedby cross-validation. Meanwhile, nonconvex penalty basedsparse regression is used under stability selection which cancontrol false positives effectively [29]. Finally, BMNPGRNidentifies gene regulatory network based on the probabil-ities of biological factors (i.e., gene, DNA methylation sites,andCNV).

Our proposed approach BMNPGRN has advantagesover existing gene regulatory network inference methods.Firstly, BMNPGRN can find more DNA methylation siteswhich are associated with gene regulatory network. Suchmethod provide deeper insight into gene regulation mecha-nism. Secondly, BMNPGRN can effectively control falsepositives using stability selection strategy. Furthermore,BMNPGRN can more accurately find biological factorsusing nonconvex penalty based sparse regression. Finally,to the best of our knowledge, BMNPGRN is the first methodwhich is applied to breast cancer data obtained by high-throughput sequencing technology.

In our experiments, we first compared the receiver oper-ating characteristic (ROC) performance of BMNPGRNwith well-known gene regulatory network inference meth-ods (DCGRN, ARACNE and CLR) [22], [30], [31] in twokinds of synthetic data sets, experiment results show thatBMNPGRN can significantly improve the performance ofdetecting gene regulatory network under false positive con-trol. The mean and variance of AUC values obtained frommultiple experiment results show that the stability ofBMNPGRN is better than state-of-the-art methods of biolog-ical network inference. We then applied BMNPGRN onbreast cancer data from The Cancer Genome Atlas (TCGA)

YUAN ET AL.: INTEGRATION OF MULTI-OMICS DATA FOR GENE REGULATORY NETWORK INFERENCE AND APPLICATION TO BREAST... 783

Page 3: Integration of Multi-Omics Data for Gene Regulatory Network Inference and Application ...dept.inha.ac.kr/user/bclab/files/recent/Ingetration.2019.pdf · 2020. 2. 25. · Integration

database and identified several cancer-related gene regula-tory network.

2 METHODS

Before introducing our method, we summarize the nota-tions used in this article. Matrices are denoted by boldfaceuppercase, vectors are denoted by boldface lowercase, andscalars are denoted by lowercase letters. We denote thegene expression matrix by X 2 RN�P , N represents numberof samples and P represents number of genes, xj representsthe jth column of gene expression matrix, xi representsthe ith row of gene expression matrix, and xi

j represents the(i, j) entry of matrix. Meanwhile, DNA methylation matrixis denoted by D 2 RN�Q with N samples and Q DNA meth-ylation sites, and copy number variation matrix is denotedby C 2 RN�P with P CNVs.

We show how to discover gene regulatory network, next.In brief, we first present a new scheme to find gene-relatedDNA methylation sites and, then, select significant generegulation-related biological factors (including genes, DNAmethylation sites) using nonconvex penalty based sparseregression. In addition, we use nonconvex penalty basedsparse regression under stability selection which can controlfalse positives effectively.

2.1 A New Scheme to Find Gene-RelatedDna Methylation Sites

Given the datasets containing gene expression data, copynumber variation data, and DNA methylation data, we usethe heterogeneous multi-omics data to infer gene regulatorynetwork. In general, a dataset include both patient and nor-mal samples. The correlation coefficient of functionallyrelated DNA methylation sites varies greatly from normalsample to patient sample. Based on the characteristics ofbiological data, we propose a new scheme to find gene-related DNA methylation sites. First we need to calculatethe correlation coefficient between expression vectors ofDNA methylation sites. Researchers have proposed manymethods to measure the correlation coefficient between twovariables. Pearson correlation coefficient is a representativemethod of correlation coefficient approaches [9], [10]. How-ever, Pearson correlation coefficient is sensitive to outliers.Biweight midcorrelation is considered to be a good alterna-tive to Pearson correlation coefficient since it is more robustto outliers. To calculate correlation coefficient betweenDNA methylation sites, we use biweight midcorrelation,which shall be described in the next section. On the basis ofcorrelation coefficients from patient samples and normalsamples, we adopt a differential correlation strategy. Thisstrategy is similar to the case-control sample correlationcoefficient measurement step followed by threshold-basedvariables selection [32], [33], [34].

2.1.1 Biweight Midcorrelation Coefficient

In order to introduce the biweight midcorrelation coefficient(BIMC) of two numeric vectors x ¼ ðx1; . . . ; xnÞ and y ¼ðy1; . . . ; ynÞ, x and y can be two column vectors of DNAmethylation matrix (see Fig. 1), ui, vi are defined withi ¼ 1; . . . ; n as follows:

ui ¼ xi �med xð ÞT �mad xð Þ (1)

vi ¼ yi �med yð ÞT �mad yð Þ (2)

mad xð Þ ¼ med xi �med xð Þj jð Þ; (3)

where medðxÞ and medðyÞ are the median of vector x and yrespectively. madð�Þ represents the median absolute devia-tion of numeric vector. Based on ui and vi. The weights w

ðxÞi

for xi and wðyÞi for yi are defined as follows:

wxð Þi ¼ 1� u2

i

� �2I 1� uij jð Þ (4)

wyð Þi ¼ 1� v2i

� �2I 1� vij jð Þ; (5)

where I is an indicator equation, for equation (5), the indica-tor equation Ið1� jvijÞ is 1 if ð1� jvijÞ > 0 and otherwiseequals to 0. The same situation occurs for equation (4). Forequations (2) and (5), as the difference between yi andmedðyÞ gets smaller and smaller, w

ðyÞi gets closer to 1. If the

difference between yi and medðyÞ is larger than T �madðyÞ,w

ðyÞi equals to 0. The same situation occurs for equations (1)

and (4). T is a pre-defined parameter. Let us discuss pre-defined parameter T . In practice, T is chosen between 5 and9; the bigger T , the smaller the number of values to be filteredout. In this article, T is set to 9. For T , we chose the highestvalid value to include all potentially interesting values. Inaddition, users can determine T based on the data character-istics and possible proportion of outliers. The weight values

of all outliers are guaranteed to be 0. Based on wðxÞi and w

ðyÞi ,

we can define BIMC of vector x and y as follows:

preðx; yÞ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPni¼1 xi �medðxÞð Þ � wðxÞ

i

h i2r

�ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPn

i¼1 yi �med yð Þð Þ � wðyÞi

h i2r0BB@

1CCA

�1

(6)

BIMCðx; yÞ¼Xni¼1

xi �medðxÞð Þ � wðxÞi

� yi �medðyÞð Þ � wðyÞi � preðx; yÞ; (7)

where BIMCðx; yÞ represents the BIMC of x and . . . Itshould be noted that, the range of BIMC is from -1 to 1. Ifthere is a strong positive linear relationship between DNAmethylation vectors, the value of BIMC will be close to 1. Ifthere is a strong negative linear relationship between meth-ylation vectors, the value of BIMC will be close to -1. If thereis no linear relationship or only a weak linear relationshipbetween methylation vectors, the value of BIMC will be 0or close to 0.

Fig. 1. An example of DNA methylation matrix.

784 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 16, NO. 3, MAY/JUNE 2019

Page 4: Integration of Multi-Omics Data for Gene Regulatory Network Inference and Application ...dept.inha.ac.kr/user/bclab/files/recent/Ingetration.2019.pdf · 2020. 2. 25. · Integration

In order to explain how to use biweight midcorrelation tocalculate the correlation coefficient of DNA methylationsites. Let us take DNA methylation matrix as an example.Example of a DNAmethylation matrix is shown in Fig. 1.

For each sample Si, we measure level of expression ateach DNA methylation site, dij represents the expressionlevel of the j-th DNA methylation site for the ith samplewhere j ¼ 1; . . . ; q. The ith column vector of DNA methyla-tion matrix is expression level of ith DNA methylation site(DMsi). Then we can use the first and second column vec-tors of the matrix to calculate the correlation coefficientbetweenDMs1 andDMs2.

In this study, the absolute value of the BIMC of DNAmethylation sites in the same functionmodule approaches tobe 1. The stronger the association, the larger the BIMC valueof the DNAmethylation sites. In order to findDNAmethyla-tion sites that are strongly associated with a specific DNAmethylation site, we use BIMC with ‘differential correlationstrategy’, which shall be described in the next section.

2.1.2 Differential Correlation Strategy

Reasonable use of data characteristics can provide effectiveprior information. According to the existing relevant biologi-cal research results, the value of a methylation site in patientsample differs from that in the control sample; at the sametime, the association betweenmethylation sites in the patientsample differs from that in the control sample. In addition,DNAmethylation sites are often located in core sequences ofgene promoters and transcription start sites (TSS) [35], andthey often affect the nearest gene [36]. This prior informationhelps us find important DNA methylation sites for specificgenes. In this section, we introduce ‘differential correlationstrategy’. This strategy is used to find gene-related DNAmethylation sites using two different label samples (diseaselabel samples and normal label samples). This strategy issimilar to the gene differential co-expression analysis whichfind biological pathways or gene function module throughmeasuring gene correlation changes between patient sam-ples and control samples [32], [33], [34].

Firstly, for a specific gene G, we determine the corre-sponding DNA methylation site DMsG at the promoter ortranscription start sit of the gene G by searching relevantpublic database (e.g., MethDB: http://www.methdb.de/).Secondly, we calculate the BIMCs between DNA methyla-tion site DMsG and all other DNA methylation sites inthe patient samples and control samples, respectively.BIMCdmps ¼ ½a1; a2; . . . ; aq�1� represents BIMCs betweenDMsG and all other DNA methylation sites in the patientsamples, and vector BIMCdmcs ¼ ½b1; b2; . . . ; bq�1� representsBIMCs between DMsG and all other DNA methylation sitesin the control samples. For the two elements of the sameposition in two vectors BIMCdmps and BIMCdmcs (e.g., a1and b1), the two values are calculated from the same pair ofDNA methylation sites. Thirdly, in order to find differentialcorrelation methylation sites, we set two thresholds TS1 andTS2; vector BIMCdmpc ¼ ½e1; e2; � � � ; eq�1� contains only 0 and1 (0 for no correlation and 1 for correlation) indicateswhether methylation site is related to the DMsG. If absolutevalue of BIMCdmpsðiÞ is less than or equal to TS1 and abso-lute value of BIMCdmcsðiÞ is larger than or equal to TS2,BIMCdmpcðiÞ is set to 1; if BIMCdmpsðiÞ is larger than or equal

to TS2 and absolute value of BIMCdmcsðiÞ is less than orequal to TS1, BIMCdmpcðiÞ is set to 1; otherwise BIMCdmpcðiÞis set to 0. BIMCdmpcðiÞ ¼ 1 represents the ith DNAmethyla-tion site andDMsG have a strong correlation in patient sam-ples but have a weak correlation in control samples or havea weak correlation in patient samples but have a strong cor-relation in control samples, which means ith methylationsite and DMsG function together to affect expressionof gene G. TS1 is set to 0.3 and TS2 is set to 0.7. Meanwhile,for each methylation site that interacts with DMsG toaffect gene G, we calculate absolute value of differencebetween the absolute values of its two BIMC values(i.e., BIMCabs ¼ jjBIMCdmpsðiÞj � jBIMCdmcsðiÞjj). Finally,we select DNA methylation sites based on BIMCdmpc,then sort methylation sites based on BIMCabs, and selecthigh-ranking methylation sites.

2.2 BMNPGRN Model

Our proposed method BMNPGRN is used to discover adirected network that encodes the regulatory relationshipsover a set of genes and selectedmethylation sits, these selectedmethylation siteswere obtained by applying biweight midcor-relation coefficient and differential correlation strategy tothe raw data. And we focus on the impact of CNV on the genewhich in the region of CNV. Let X 2 RN�P denotes geneexpression matrix of N samples and P selected genes.D 2 RN�QS denotes DNA methylation matrix of N samplesandQS DNAmethylation sites,C 2 RN�P denotes copy num-ber variationmatrix ofN samples and P CNVs. The three datamatrices are defined as X ¼ ½x1; x2; . . . ; xp�, D ¼ ½d1;d2; . . . ;dqs �, andC ¼ ½c1; c2; . . . ; cp�where xi;di; ci are ith column vec-tor of X;D andC respectively. Gene expression vector affectedby biological factors is defined as follows:

xi ¼ Xfi þDgi þ Cwi þ ui þ "i; (8)

where fi and wi are ith column vector of adjacency matrixF 2 RP�P and W 2 RP�P respectively; gi is sub-vector of ithcolumn vector of adjacency matrix G 2 RQ�Q, the numberof elements in the gi is QS , the other elements of the ith col-umn vector in adjacency matrix G 2 RQ�Q are 0. ui is amodel bias and "i is a residual. fij represents the regulationmodes from ith gene to jth gene: positive for activation, neg-ative for repression, and 0 for no regulation. In this study,we focus on the regulation relationship among genes, thusthere is no self-regulation and every diagonal elementfii ¼ 0. Meanwhile, there is no two nodes cycle (i.e., both fijand fji are non-zero) in BMNPGRN. In addition, it isassumed that a gene can be directly affected by methylationsite belong to the gene and other DNA methylation sitesthat are highly correlated with the methylation site. And agene can be directly affected by CNV that belong to thegene but no other CNVs. gij and wii represent the regulationweight value of DNA methylation site and CNV of ith gene.For equation (8), we want to find F, G and W that can bestrepresent weights of genes, DNA methylation sites andCNVs. Our goal is to estimate fi; gi, andwi that minimize "i,after the bias is removed by mean centering, equation (8)can be restated as follows:

minfi;gi;wi

xi � Xfi �Dgi � Cwi

�� ��22; (9)

YUAN ET AL.: INTEGRATION OF MULTI-OMICS DATA FOR GENE REGULATORY NETWORK INFERENCE AND APPLICATION TO BREAST... 785

Page 5: Integration of Multi-Omics Data for Gene Regulatory Network Inference and Application ...dept.inha.ac.kr/user/bclab/files/recent/Ingetration.2019.pdf · 2020. 2. 25. · Integration

where k � k2 denotes L2-norm. Generally speaking, L1-normpenalty is often used to avoid overfitting and accuratelyfind genes, DNA methylation sites and CNVs that influenceith gene. L1-norm penalty is applied to all columns of F, Gand W, then equation (9) based on the L1-regularization canbe expressed as follows:

minfi;gi;wi

xi � Xfi �Dgi � Cwi

�� ��22þ �1 fik k1 þ �2 gi

�� ��1þ �3 wik k1;

(10)

where �1; �2 and �3 are hyper-parameters for sparsity regu-larization. Equation (10) can be rewritten as follows:

L bið Þ ¼ minbi

xi � Ybik k22 þ � bik k1; (11)

where

Y ¼ x1; x2; . . . ; xi�1; xiþ1; . . . ; xp;d1; . . . ;di; . . . ;dqs ; ci� �

(12)

bi ¼ fi1; fi2; . . . ; fi�1; fiþ1; . . . ; fip; g1; . . . ; gi; . . . ; gqs ;wii

� �(13)

where ½d1; . . . ;di; . . . ;dqs � represents expression vector ofQS

selected DNA methylation sites and ½g1; . . . ; gi; . . . ; gqs � rep-resent corresponding weights. Equation (11) is a LeastAbsolute Shrinkage and Selection Operator (LASSO) prob-lem [21]. Nonconvex penalties like smoothly clipped abso-lute deviation (SCAD) and minimax concave penalty (MCP)are considered as worthwhile alternatives to the LASSO.We show how to use nonconvex penalties and coordinatedescent algorithm to find gene related biological factors,next.

MCP penalty is defined on ½0;1Þ by:

k�;g uð Þ ¼ �u � u2

2g ; if u � g�;

12 g�

2; if u > g�;

(

k0�;g uð Þ ¼ �� ug; if u � g�;

0; if u > g�:

� (14)

for � �¼ 0 and g > 1. SCAD is defined on ½0;1Þ by:

k�;g uð Þ ¼�u; if u � �;

�u2þ2g�u��2

2 g�1ð Þ ; if � < u � g�;

�2 gþ1ð Þ2 ; if u > g�;

8>><>>:

k0�;g uð Þ ¼�; if u � �;g��ug�1 if � < u � g�

0; if u > g�:

8><>: ;

(15)

for � �¼ 0 and g > 2. The above LASSO problem (i.e.,equation (11)) can be restated as follows:

L bið Þ ¼ minbi

xi � Ybik k22 þXPþQS

j¼1

k�;g bij � �

; (16)

where P þQS represents the sum of genes, selected DNAmethylation sites and CNV.

Solving equation (16) depends on the choice of the tuningparameters � and g. This is usually accomplished withcross-validation strategy. However, if only cross-validationis used, we often obtain many false positives (i.e., true zero

weights are non-zero in estimated weight vector bi). Inorder to effectively control false positives and find moretruly related genes and DNA methylation sites. We aug-ment equation (16) with stability selection [29] to determineedges in gene regulatory network. Stability selection strat-egy is a bootstrapping-type algorithm which can effectivelycontrol false positives. Briefly, stability selection strategyworks as follows: Firstly, we randomly select half (i.e.,bN=2c) of the total samples for Tss times. Secondly, for eachselected subsamples, equation (16) is run on the samples.Finally, stability selection strategy select factors (i.e., genes,DNA methylation sites and copy number variation) whoseweight values are non-zero for Tss � zts times, where zts is auser-defined parameters.

We discuss how to determine user-defined parametersTss and zts, next. In practices, a large number of experimen-tal results show that Tss � 100 is sufficient to achieve falsepositive control [29]. Meanwhile, zts is often chosen between0.5 and 1; in theory, under certain conditions, [29] gives therelationship between the number of false positives and zts.When finding edges between genes, DNAmethylation sites,copy number variation and the ith genes,

E Við Þ � 12zts�1

l2� ;g

P þ QS

(17)

where EðViÞ is the expected number of falsely detected fac-tors (i.e., genes, DNA methylation sites and copy numbervariation) for the ith gene. l2�;g is the number of non-zeroweights found by BMNPGRN with � and g. Equation (17)shows that the upper bound on the number of false posi-tives is inversely proportional to zts.

For the ith gene, stability selection strategy assign eachfactor (i.e., genes, DNA methylation sites and copy numbervariation) a weight score, reflecting their degrees of signifi-cance. For the association between the ith gene and jth fac-tor, the association score Sj

i is defined by the proportion ofthe cases where the jth factor is selected to the total numberof randomly selected subsamples.

2.3 Evaluation Criteria

We compare our method BMNPGRN with three methodsDCGRN, ARACNE and CLR. In order to strictly evaluate amethod on its performance as balanced by true positiverates (TPR) and false positive rates (FPR), we use receiveroperating characteristic (ROC) curve, area under receiveroperating characteristic (AUROC) and area under precisionrecall curve (AUPRC), which are calculated based on confu-sion matrix (Fig. 2).

TPR ¼ TP

TPþ FN(18)

FPR ¼ FP

FPþ TN(19)

Precision ¼ TP

TPþ FP(20)

Recall ¼ TP

TPþ FN; (21)

where true positive (TP), false positive (FP), true negative(TN), and false negative (FN) are defined in Fig. 2. We

786 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 16, NO. 3, MAY/JUNE 2019

Page 6: Integration of Multi-Omics Data for Gene Regulatory Network Inference and Application ...dept.inha.ac.kr/user/bclab/files/recent/Ingetration.2019.pdf · 2020. 2. 25. · Integration

calculate TP, FP, TN, and FN to measure the accuracy crite-ria, TPR and FPR. The performance of BMNPGRN are com-pared to other methods with ROC curves and AUROCvalues and AUPRC values.

3 RESULTS

In this section, we first compared the performance ofBMNPGRN with three state-of-the-art methods (includingDCGRN, ARACNE and CLR) in two kinds of syntheticdatasets. DCGRN is a method using LASSO with DNAmethylation site and CNV [22], ARACNE is an algorithmfor reverse engineering of gene regulatory network [30].CLR is a context likelihood relatedness-based gene regula-tory network inference algorithm [31]. We then appliedBMNPGRN on breast cancer data from TCGA and identi-fied several significant gene regulatory networks.

3.1 Performance Comparison on Synthetic Datasets

In this section, the BMNPGRN algorithm was tested interms of different aspects, and the section is organizedas follows.

First, in order to demonstrate the superiority ofBMNPGRN over three state-of-the-art methods (includingDCGRN, ARACNE and CLR) in naive synthetic datasetused by three state-of-the-art methods, a naive experimentis designed to compare BMNPGRN with three methods.

Second, the performance of the BMNPGRN algorithm iscompared with three methods (i.e., DCGRN, ARACNE andCLR) in a more practical synthetic dataset which containsrelevant factors (i.e., associated genes and associated DNAmethylation sites).

3.1.1 Naive Synthetic Datasets

In this section, we compare the performance of theBMNPGRN with three methods (including DCGRN, ARA-CNE and CLR) in naive synthetic dataset used by threestate-of-the-art methods. In this kind of dataset, it is assu-med that the ith gene is only affected by DNA methylationsite which belong to the ith gene, and biological factors areindependent of each other. First we introduce this naivesynthetic random networks. We followed the syntheticdataset generation method of [22]. P represents the numberof genes and is set to 30, 40, and 50. P � P matrix F is initial-ized to zero matrix, then elements of F are randomlyselected avoiding any cycle. The parameter EG representsthe number of inbound edges per gene on average. Thelarger the parameter EG, the more complex the network.The weight value fij is uniformly distributed over ½0:5; 1� or½�0:5; 1�. Meanwhile, the adjacency matrix G and W are ini-tialized to zero matrix, then diagonal elements (gii and wii)are randomly selected. The parameter RDC represents the

percentage of nodes that are regulated by DNA methylationsites and copy number variation. For example, if P and RDC

are 30 and 0.2 respectively, then it means six (30� 0:2)DNA methylation sites and copy number variations regu-late corresponding genes (i.e., six diagonal elements of Gand W are non-zero). The selected weight value gii is uni-formly distributed over ½0:5; 1� or ½�0:5; 1�, and selected wii

is set to 1. dij from uniform distribution U½0; 1�. cij is ran-domly set as -2, -1, 0, 1, or 2 with the probabilities 0.01, 0.22,0.55, 0.2 and 0.02 respectively. The design gene expressionmatrix X can be generated by calculating X ¼ ðDG þCWþ EÞðF� IÞ�1, where each element of E is generatedfrom zero-mean Gaussian distribution.

3.1.2 Performance Evaluaition on Naive

Synthetic Datasets

We compare our method BMNPGRN with three methodsDCGRN, ARACNE and CLR. Given data X, D and C, F, Gand W are inferred, and then they are compared to the trueedges of F, G and W, and calculating TPR and FPR. For thenaive synthetic dataset, we generated two sets of datasetsbased on different parameter combinations. The param-eters of first set of naive synthetic datasets (FNSD) are:N 2 f100; 200; 300; 400g, EG ¼ 1, RDC ¼ 0:3 and P ¼ 30.The parameters of second set of naive synthetic datasets(SNSD) are: N 2 f100; 200; 300; 400g, EG ¼ 3,RDC ¼ 0:5 andP ¼ 50. For BMNPGRN, we used two different penalty func-tions MCP and SCAD, and set Tss ¼ 100, zts ¼ 0:6 orzts ¼ 0:8. The ROC comparison results for the two sets ofdatasets are shown in Figs. 3 and 4 respectively. The AUROC(Area Under Receiver Operating Characteristic) and AUPRC(Area Under Precision Recall Curve) results for the two setsof datasets are shown in Table S1 and Table S2, which can befound on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2866836respectively. Experimental results show that BMNPGRNoutperforms othermethods in naive synthetic dataset.

3.1.3 Complex Synthetic Datasets

In this section, we compare our method BMNPGRN withthree methods DCGRN, ARACNE and CLR in a complexsynthetic datasets.C,W, E, F and I are initialized in the sameway as in naive synthetic dataset. We describe how to

Fig. 2. Confusion matrix used to evaluate the GRN inference method.

Fig. 3. ROCs of BMNPGRN with DCGRN, ARACNE, and CLR: N ¼ 100,EG ¼ 1,RDC ¼ 0:3 (top left), N ¼ 200, EG ¼ 1, RDC ¼ 0:3 (top right), N ¼300,EG ¼ 1, RDC ¼ 0:3 (bottom left), and N ¼ 400, EG ¼ 1, RDC ¼ 0:3(bottom right). For BMNPGRN, we show the results with two settings forzts 0.6, 0.8 and two penalty functions.

YUAN ET AL.: INTEGRATION OF MULTI-OMICS DATA FOR GENE REGULATORY NETWORK INFERENCE AND APPLICATION TO BREAST... 787

Page 7: Integration of Multi-Omics Data for Gene Regulatory Network Inference and Application ...dept.inha.ac.kr/user/bclab/files/recent/Ingetration.2019.pdf · 2020. 2. 25. · Integration

generateD andG, next. First, we randomly selectedK DNAmethylation sites which located in core sequences of genepromoters and transcription start sites (TSS). Second, gener-ating K groups based on these K DNA methylation sites.The number of DNAmethylation sites in group is randomlyselected from 2, 3, and 4. Third, we used popular value-basedco-correlation network method to generate co-correlationexpression vectors ofD [12]. The correlation coefficient is cal-culated by the Pearson correlation formula. It should benoted that the sample contains both patient and normal sam-ples. Based on this information, co-correlation networkswere constructed. Finally, dij from uniform distributionU ½0; 1�. The diagonal elements ofG are 0. The selectedweightvalue gij is uniformly distributed over ½0:5; 1� or ½�0:5; 1�.

For the complex synthetic dataset, we generated two setsof datasets based on different parameter combinations. Theparameters of first set of complex synthetic datasets (FCSD)are: N 2 f100; 200; 300; 400g, EG ¼ 2, RDC ¼ 0:3 and P ¼ 30.The parameters of second set of complex synthetic datasets(SCSD) are: N 2 f100; 200; 300; 400g, EG ¼ 3, RDC ¼ 0:5 andP¼ 50. For BMNPGRN, we used two different penalty func-tions MCP and SCAD, and set Tss ¼ 100, zts ¼ 0:6 orzts ¼ 0:8. The ROC comparison results for the two sets ofdatasets are shown in Figs. 5 and 6 respectively. TheAUROC and AUPRC results for the two sets of datasets areshown in Tables S3 and S4, available online respectively.Experimental results show that BMNPGRN outperformsother methods in naive synthetic dataset.

3.2 Performance Evaluation of Bmnpgrn andBmnpgrn without Stability Selection

In order to verify the effect of the stability selection strategy[29], we generated one set of datasets based on differentparameter combinations. The parameters of third set ofcomplex synthetic datasets (TCSD) are: N 2 f100; 200; 300;400g, EG ¼ 3, RDC ¼ 0:4, and P ¼ 30. This dataset is used tocompare the performance of BMNPGRN and BMNPGRNwithout stability selection. The ROC comparison results ofthe datasets are shown in Fig. 7.

3.3 Application to Breast Cancer Data

Breast cancer is the most common malignancy in UnitedStates women, accounting for >40,000 deaths each year.Discovering cancer-related biological pathway informationis one of the key challenges of breast cancer research. Identi-fying comprehensive gene regulatory network can providean important resource for studying the underlying mecha-nisms of breast cancer.

We applied BMNPGRN to a dataset, containing mea-surement profiles of gene expression, DNA methylation,and copy number variation. Gene expression profiles wasmeasured experimentally using the Illumina HiSeq 2000RNA Sequencing platform, downloaded from TCGA. DNAmethylation profiles were obtained from the ratios of back-ground-corrected methylated and un-methylated probeintensities measured by Illumina Infinium HumanMethyla-tion450k BeadArrays, also downloaded from TCGA. The

Fig. 4. ROCs of BMNPGRN with DCGRN, ARACNE, and CLR: N ¼ 100,EG ¼ 3, RDC ¼ 0:5 (top left), N ¼ 200, EG ¼ 3, RDC ¼ 0:5 (top right), N¼ 300, EG ¼ 3, RDC ¼ 0:5 (bottom left), and N ¼ 400, EG ¼ 3,RDC ¼ 0:5 (bottom right). For BMNPGRN, we show the results with twosettings for zts 0.6, 0.8 and two penalty functions.

Fig. 5. ROCs of BMNPGRN with DCGRN, ARACNE, and CLR: N ¼ 100,EG ¼ 2, RDC ¼ 0:3 (top left), N ¼ 200, EG ¼ 2, RDC ¼ 0:3 (top right), N¼ 300, EG ¼ 2, RDC ¼ 0:3 (bottom left), and N ¼ 400, EG ¼ 2,RDC ¼ 0:3 (bottom right). For BMNPGRN, we show the results with twosettings for zts 0.6, 0.8 and two penalty functions.

Fig. 6. ROCs of BMNPGRN with DCGRN, ARACNE, and CLR: N ¼100,EG ¼ 2,RDC ¼ 0:5(top left), N ¼ 200,EG ¼ 2, RDC ¼ 0:5 (top right),N ¼ 300, EG ¼ 2, RDC ¼ 0:5 (bottom left), and N ¼ 400, EG ¼ 2,RDC ¼ 0:5 (bottom right). For BMNPGRN, we show the results with twosettings for zts 0.6, 0.8 and two penalty functions.

Fig. 7. ROCs of BMNPGRN with DCGRN, ARACNE, and CLR: N ¼ 100,EG ¼ 3,RDC ¼ 0:4(top left), N ¼ 200, EG ¼ 3, RDC ¼ 0:4 (top right), N ¼300, EG ¼ 3, RDC ¼ 0:4 (bottom left), and N ¼ 400, EG ¼ 3, RDC ¼ 0:4(bottom right). For BMNPGRN, we show the results with two settings forzts 0.6, 0.8 and two penalty functions.

788 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 16, NO. 3, MAY/JUNE 2019

Page 8: Integration of Multi-Omics Data for Gene Regulatory Network Inference and Application ...dept.inha.ac.kr/user/bclab/files/recent/Ingetration.2019.pdf · 2020. 2. 25. · Integration

gene-level CNVs estimated using the GISTIC2 method.There are 760 case and 80 control samples measured frombreast tissue. There are 24776 genes and correspondingCNVs in gene expression data and copy number variationdata. There are 485578 DNAmethylation sites in data.

Fig. 8 is the integrated network with 14 genes that havehigh absolute value of coefficient. There are two DNAmeth-ylation sites that are connected to gene KCNK12. Also, thereis a CNV that is connected to the gene SLC2A3. Several stud-ies have been reported KCNK12 is associatedwith breast can-cer but there is no report about associations with DNAmethylation sites of the gene KCNK12 [37], [38], [39]. Severalstudies have been reported IPO8 is associatedwith breast can-cer but there is no report about associations with DNAmeth-ylation sites of the gene IPO8 [40], [41]. It should be noted thatCNP11949 is first discovered related to breast cancer [42],[43]. Several studies have been reported BRCA1 and TP53 areassociated with breast cancer [44]. Experimental results pro-vide genes that need attention in future work. TNF andCTNNB1 play an important role in gene regulatory network.

4 CONCLUSIONS

In this paper, we propose an approach, which use biweightmidcorrelation to measure the correlation between factorsand make use of Nonconvex Penalty based sparse regres-sion for Gene Regulatory Network inference (BMNPGRN).BMNPGRN incorporates multi-omics data (including DNAmethylation and copy number variation) and their interac-tions in gene regulatory network model. The experimentalresults on synthetic datasets show that BMNPGRN outper-forms popular and state-of-the-art methods (includingDCGRN, ARACNE and CLR) under false positive control.Furthermore, we applied BMNPGRN on breast cancer(BRCA) data from The Cancer Genome Atlas database andprovided gene regulatory network.

ACKNOWLEDGMENTS

This work is partly supported by the National Natural Sci-ence Foundation of China (Grant nos. 61732012, 61520106006,31571364, 61532008, U1611265, 61672382, 61772370, 61702371,61772357, and 61672203) and the China Postdoctoral ScienceFoundation (Grant nos. 2017M611619, and 2016M601646),

and supported by “BAGUI Scholar” Program of GuangxiZhuangAutonomous Region of China.

REFERENCES

[1] P. A. Jones, J. P. Issa, and S. Baylin, “Targeting the cancer epige-nome for therapy,” Nature Rev. Genetics, vol. 17, no. 10, 2016,Art. no. 630.

[2] D. S. Huang and H. J. Yu, “Normalized feature vectors: A novelalignment-free sequence comparison method based on the num-bers of adjacent amino acids,” IEEE/ACM Trans.Comput. Biol. Bio-inf., vol. 10, no. 2, pp. 457–467, Mar./Apr. 2013.

[3] M. R. Aure, S. K. Leivonen, T. Fleischer, Q. Zhu, J. Overgaard,J. Alsner, T. Tramm, R. Louhimo, G. I. G. Alnæs, and M. Per€al€a,“Individual and combined effects of DNA methylation and copynumber alterations on miRNA expression in breast tumors,”Genome Biol., vol. 14, no. 11, 2013, Art. no. R126.

[4] D. S. Huang, “Systematic theory of neural networks for patternrecognition,” Publishing House of Electronic Industry of China,Beijing, vol. 201, 1996.

[5] D. S. Huang, “Radial basis probabilistic neural networks: Modeland application,” Int. J. Pattern Recognit. Artif. Intell., vol. 13,no. 07, pp. 1083–1101, 1999.

[6] D. S. Huang and J.-X. Du, “A constructive hybrid structureoptimization methodology for radial basis probabilistic neuralnetworks,” IEEE Trans. Neural Netw., vol. 19, no. 12, pp. 2099–2115, Dec. 2008.

[7] C. H. Zheng, L. Zhang, T. Y. Ng, K. S. Chi, and D. S. Huang,“Molecular pattern discovery based on penalized matrix decom-position,” IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 8, no. 6,pp. 1592–1603, Nov./Dec. 2011.

[8] D. S. Huang, L. Zhang, K. Han, S. Deng, K. Yang, and H. Zhang,“Prediction of protein-protein interactions based on protein-protein correlation using least squares regression,” Current ProteinPeptide Sci., vol. 15, no. 6, pp. 553–560, 2014.

[9] S. D. Bolboac and L. J€antschi, “Pearson versus Spearman,Kendall’s Tau correlation analysis on structure-activity relation-ships of biologic active compounds,” Leonardo J. Sci., vol. 5, no. 9,pp. 179–200, 2006.

[10] K. Pearson, “Note on regression and inheritance in the case of twoparents,” Proc. Royal Soc . London, vol. 58, pp. 240–242, 2006.

[11] P. Langfelder and S. Horvath, “WGCNA: An R package forweighted correlation network analysis,” BMC Bioinf., vol. 9, no. 1,2008, Art. no. 559.

[12] J. K. Choi, U. Yu, O. J. Yoo, and S. Kim, “Differential coexpressionanalysis using microarray data and its application to human can-cer,” Bioinf., vol. 21, no. 24, pp. 4348–4355, 2005.

[13] C. H. Zheng, D. S. Huang, L. Zhang, and X. Z. Kong, “Tumorclustering using nonnegative matrix factorization with geneselection,” IEEE Trans. Inform. Technol. Biomed., vol. 13, no. 4,pp. 599–607, 2009.

[14] S. P. Deng, L. Zhu, and D. S. Huang, “ Predicting hub genes asso-ciated with cervical cancer through gene co-expression networks,”IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 13, no. 6, pp. 27–35,Jan./Feb. 2016.

[15] D. S. Huang, X. M. Zhao, G. B. Huang, and Y. M. Cheung,“Classifying protein sequences using hydropathy blocks,” PatternRecognit., vol. 39, no. 12, pp. 2293–2300, 2006.

[16] A. V. Werhli and D. Husmeier, “Reconstructing gene regulatorynetworks with bayesian networks by combining expression datawith multiple sources of prior knowledge,” Statist. Appl. GeneticsMol. Biol. , vol. 6, no. 1, pp. 15–15, 2007.

[17] D. S. Huang and C. H. Zheng, “Independent component analysisbased penalized discriminate method for tumor classificationusing gene expression data,” Bioinf., vol. 22, no. 15, pp. 1855–1862,2006.

[18] S. P. Deng, L. Zhu, and D. S. Huang, “Mining the bladder cancer-associated genes by an integrated strategy for the constructionand analysis of differential co-expression networks,” BMC Geno-mics, vol. 16, no. S3 , 2015, Art. no. S4.

[19] L. Zhu, W. L. Guo, S. P. Deng, and D. S. Huang, “ChIP-PIT:Enhancing the analysis of ChIP-Seq data using convex-relaxedpair-wise interaction tensor decomposition,” IEEE/ACM Trans.Comput. Biol. Bioinf., vol. 13, no. 1, pp. 55–63, Jan./Feb. 2016.

[20] L. Zhu, S. P. Deng, and D. S. Huang, “A two-stage geometricmethod for pruning unreliable links in protein-protein networks,”IEEE Trans. Nanobiosci., vol. 14, no. 5, pp. 528–534, Jul. 2015.

Fig. 8. Gene regulatory network from breast cancer data. DNAmethylationsites (blue dot),gene (yellow dot), and copy number polymorphism (reddot).

YUAN ET AL.: INTEGRATION OF MULTI-OMICS DATA FOR GENE REGULATORY NETWORK INFERENCE AND APPLICATION TO BREAST... 789

Page 9: Integration of Multi-Omics Data for Gene Regulatory Network Inference and Application ...dept.inha.ac.kr/user/bclab/files/recent/Ingetration.2019.pdf · 2020. 2. 25. · Integration

[21] R. Tibshirani, “Regression shrinkage and selection via the lasso: Aretrospective,” J . Royal Statist. Soc., vol. 73, no. 3, pp. 273–282, 2011.

[22] D. C. Kim, M. Kang, B. Zhang, X. Wu, C. Liu, and J. Gao,“Integration of DNA methylation, copy number variation, andgene expression for gene regulatory network inference and appli-cation to psychiatric disorders,” IEEE Int. Conf . Bioinf. Bioeng.,pp. 669–674, 2015.

[23] L. Zhu, Z. H. You, D. S. Huang, and B. Wang, “t-LSE: A novelrobust geometric approach for modeling protein-protein interac-tion networks,” Plos One, vol. 8, no. 4 , 2013, Art. no. e58368.

[24] D. S. Huang and W. Jiang, “A general CPL-AdS methodology forfixing dynamic parameters in dual environments,” IEEE Trans.Syst. Man Cybern. Part B Cybern. , vol. 42, no. 5, pp. 1489–1500,Oct. 2012.

[25] S. P. Deng and D. S. Huang, “SFAPS: An R package for structure/function analysis of protein sequences based on informationalspectrummethod,”Methods, vol. 69, no. 3, pp. 207–212, 2014.

[26] X. Ma, Z. Liu, Z. Zhang, X. Huang, and W. Tang, “Multiple net-work algorithm for epigenetic modules via the integration ofgenome-wide DNA methylation and gene expression data,” BMCBioinf., vol. 18, no. 1 , 2017, Art. no. 72.

[27] J. Fan and R. Li, “Variable selection via nonconcave penalizedlikelihood and its oracle properties,” Amer. Statist. Assoc., vol. 96,no. 456, pp. 1348–1360, 2001.

[28] C. H. Zhang, “Nearly unbiased variable selection under minimaxconcave penalty,” Ann. Statist., vol. 38, no. 2, pp. 894–942, 2010.

[29] N. Meinshausen and P. B€uhlmann, “Stability selection,” J . RoyalStatist. Society: Series B (Statist. Methodology), vol. 72, no. 4,pp. 417–473, 2010.

[30] A. A.Margolin, I. Nemenman, K. Basso, C.Wiggins, G. Stolovitzky,F. R. Dalla, and A. Califano, “ARACNE: An algorithm for thereconstruction of gene regulatory networks in a mammalian cellu-lar context,” BMCBioinf., vol. 7, no. Suppl 1 , 2006, Art. no. S7.

[31] J. J. Faith, B. Hayete, J. T. Thaden, I. Mogno, J. Wierzbowski,G. Cottarel, S. Kasif, J. J. Collins, and T. S. Gardner, “Large-scalemapping and validation of Escherichia coli transcriptional regula-tion from a compendium of expression profiles,” Plos Biol., vol. 5,no. 1, 2007, Art. no. e8.

[32] O. K. Kathleen, O. Karim, L. Marie-H�el�ene, B. Sahir, P. N. Tonin,and C. M. T. Greenwood, “Gene coexpression analyses differenti-ate networks associated with diverse cancers harboring TP53missense or null mutations,” Frontiers Genetics, vol. 7, 2016,Art. no. 137.

[33] D. Amar, H. Safer, and R. Shamir, “Dissection of regulatory net-works that are altered in disease via differential co-expression,”PLoS Comput. Biol., vol. 9, no. 3 , 2013, Art. no. e1002955.

[34] L. Yuan, C. H. Zheng, J. F. Xia, and D. S. Huang, “Module baseddifferential coexpression analysis method for type 2 diabetes,”Biomed. Res. Int., vol. 2015 , 2015, Art. no. 836929.

[35] P. A. Jones, “Functions of DNA methylation: Islands, start sites,gene bodies and beyond,” Nature Rev. Genetics, vol. 13, no. 7,pp. 484–492, 2012.

[36] T. Dayeh, P. Volkov, S. Sal€o, E. Hall, E. Nilsson, A. H. Olsson,C. L. Kirkpatrick, C. B. Wollheim, L. Eliasson, and T. R€onn,“Genome-wide DNA methylation analysis of human pancreaticislets from Type 2 diabetic and non-diabetic donors identifiescandidate genes that influence insulin secretion,” PLoS Genetics,vol. 10, no. 3, 2014, Art. no. e1004160.

[37] K. C. Johnson, D. C. Koestler, C. Cheng, and B. C. Christensen,“Age-related DNA methylation in normal breast tissue and itsrelationship with invasive breast tumor methylation,” EpigeneticsOfficial J. Dna Methylation Soc., vol. 9, no. 2, pp. 268–275, 2014.

[38] M. O. Riener, E. Nikolopoulos, A. Herr, P. J. Wild, M. Hausmann,T.Wiech,M. Orlowskavolk, S. Lassmann, A.Walch, andM.Werner,“Microarray comparative genomic hybridization analysis of tubularbreast carcinoma shows recurrent loss of the CDH13 locus on 16q,”Hum. Pathology, vol. 39, no. 11 , 2008, Art. no. 1621.

[39] K. A. Dookeran, W. Zhang, L. Stayner, and M. Argos, “Associa-tions of two-pore domain potassium channels and triple negativebreast cancer subtype in the cancer genome atlas: Systematic eval-uation of gene expression and methylation,” BMC Res. Notes,vol. 10, no. 1, 2017, Art. no. 475.

[40] C.Michaloglou, C. Crafter, R. Siersbæk, O. Delpuech, J. O. Curwen,L. S. Carnevalli, A. D. Staniszewska, U. M. Polanska, A. Cheragh-chibashi, and M. Lawson, “Combined inhibition of mTOR andCDK4/6 is required for optimal blockade of E2F function and longtermgrowth inhibition in estrogen receptor positive breast cancer,”Mol. Cancer Ther., vol. 17, pp. 908–920, 2018.

[41] S. M. Guichard, Z. Howard, H. Dan, M. Roth, G. Hughes,J. Curwen, J. Yates, A. Logie, S. Holt, and C. M. Chresta, “Abstract917: AZD2014, a dual mTORC1 and mTORC2 inhibitor is diffe-rentiated from allosteric inhibitors of mTORC1 in ERþ breastcancer,” Cancer Res., vol. 72, no. 8 Supplement, pp. 917–917, 2012.

[42] X. P. Jiang, R. L. Elliott, and J. F. Head, “Exogenous normal mam-mary epithelial mitochondria suppress glycolytic metabolism andglucose uptake of human breast cancer cells,” Breast Cancer Res.Treatment, vol. 153, no. 3, 2015, Art. no. 519.

[43] J. Lester, K. Crosthwaite, R. Stout, R. N. Jones, C. Holloman,C. Shapiro, and B. L. Andersen, “Women with breast cancer: self-reported distress in early survivorship,” Oncology Nursing Forum,vol. 42, no. 1, pp. 17–23, 2015.

[44] M. Yamamoto, M. Hosoda, K. Nakano, S. Jia, K. C. Hatanaka,E. Takakuwa, Y. Hatanaka, Y. Matsuno, and H. Yamashita, “p53accumulation is a strong predictor of recurrence in estrogen recep-tor-positive breast cancer patients treated with aromatase inhib-itors,” Cancer Sci., vol. 105, no. 1, pp. 81–88, 2014.

Lin Yuan received the MA degree in communica-tion and information system from Qufu NormalUniversity, Rizhao, China, in 2014. He is currentlyworking toward the PhD degree in comp uterapplication from Tongji University, Shanghai,China. His current research interests includemachine learning and bioinformatics.

Le-Hang Guo received the MD degree fromTongji University, Shanghai, China. He is currentlyworking towad the PhD degree in medical imagingfrom Nanjing Medical University, Nanjing, China.His current research interests include machinelearning, artificial intelligence, and tumor imaging.

Chang-An Yuan received the PhD degree incomputer application technology from SichuanUniversity, China, in 2006. He is a professor atGuangxi Teachers Education University. Hisresearch interests include computational intelli-gence and data mining.

Youhua Zhang received the PhD degree fromthe University of Science and Technology ofChina. He is dean of the School of Informationand Computer of Anhui Agricultural University,executive council member of the Computer Appli-cations Branch of the China Agricultural Society,executive council member of the Anhui Associa-tion of Agricultural Information. He is a memberof the National Bee modern industrial technologyand product quality and safety standards system,a member of the Expert Advisory Committee of

Anhui Provincial Human Resources and Social Security in the Depart-ment of Information Technology, and a member of the Executive Com-mittee of the Hefei branch of the CCF. He is mainly engaged inteaching and research in computer applications and logistics engi-neering. He is in charge or participates in provincial science andtechnology research, National Science and Technology Support Pro-gram, 863 project, and National Natural Science Foundation. Hewon the First Prize of Scientific and Technological Progress in AnhuiProvince.

790 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 16, NO. 3, MAY/JUNE 2019

Page 10: Integration of Multi-Omics Data for Gene Regulatory Network Inference and Application ...dept.inha.ac.kr/user/bclab/files/recent/Ingetration.2019.pdf · 2020. 2. 25. · Integration

Kyungsook Han received the BS (cum laude)degree fromSeoul National University, in 1983, theMS (cum laude) degree in computer science fromKAIST, in 1985, and the another MS degree incomputer science from the University of Minne-sota, Minneapolis, USA, in 1989, and the PhDdegree in computer science from Rutgers Univer-sity, USA, in 1994. She is a professor at theDepart-ment of Computer Science and Engineering, InhaUniversity, Korea. Her research areas include bio-informatics, visualization, and datamining.

Asoke K. Nandi received the PhD degree inphysics from Trinity College, the University ofCambridge, Cambridge, United Kingdom. Heheld academic positions at several universities,including Oxford (UK), Imperial College London(UK), Strathclyde (UK), and Liverpool (UK) aswell as a Finland Distinguished Professorship inJyvaskyla (Finland). In 2013, he moved to BrunelUniversity (UK) to become the chair and head ofelectronic and computer engineering. He is a dis-tinguished visiting professor at Tongji University

(China) and an adjunct professor at the University of Calgary (Canada).His current research interests lie in the areas of signal processing andmachine learning, with applications to communications, gene expressiondata, functional magnetic resonance data, and biomedical data. He hasauthored more than 500 technical publications, including 200 journalpapers as well as four books, entitled Automatic Modulation Classifica-tion: Principles, Algorithms and Applications (Wiley, 2015), IntegrativeCluster Analysis in Bioinformatics (Wiley, 2015), Automatic ModulationRecognition of Communications Signals (Springer, 1996), and Blind Esti-mation Using Higher-Order Statistics (Springer, 1999),. Recently, he pub-lished in Blood, BMC Bioinformatics, IEEE TWC, NeuroImage, PLOSONE, Royal Society Interface, and Signal Processing. The h-index of hispublications is 63 (Google Scholar). He is a fellow of the Royal Academyof Engineering and also a fellow of seven other institutions including theIEEE and the IET. Among the many awards he received are the Instituteof Electrical and Electronics Engineers (USA) Heinrich Hertz Award in2012, the Glory of Bengal Award for his outstanding achievements in sci-entific research in 2010, the Water Arbitration Prize of the Institution ofMechanical Engineers (UK) in 1999, and the Mountbatten Premium, Divi-sion Award of the Electronics andCommunications Division, of the Institu-tion of Electrical Engineers (UK) in 1998.

Barry Honig is a tenured professor at ColumbiaUniversity, the US Academy of Sciences. He isdirector of the Center for Computational Biologyand Bioinformatics at Columbia University on theeditorial board of PNAS, the Journal of MolecularBiology, Structure, Biochemistry and CurrentOpinion in Structural Biology. His research inter-ests include computational biology and bioinfor-matics. As of 2013, he has published more than300 papers in high-level SCI journals. In 2004, hewas elected to the National Academy of Scien-

ces. In 2007, he received the US National Academy of Sciences Alexan-der Hollaender Award.

De-Shuang Huang received the BSc degreefrom the Institute of Electronic Engineering,Hefei, China, the MSc degree from the NationalDefense University of Science and Technology,Changsha, China, and the PhD degree fromXidian University, Xian, China, all in electronicengineering, in 1986, 1989 and 1993, respec-tively. During 1993-1997, he was a postdoctoralresearch fellow in the Beijing Institute of Technol-ogy and in National Key Laboratory of PatternRecognition, Chinese Academy of Sciences, Bei-

jing, China, respectively. In Sept. 2000, he joined the Institute of Intelli-gent Machines, Chinese Academy of Sciences, as the recipient of“Hundred Talents Program of CAS”. In Sept. 2011, he entered intoTongji University as chaired professor. From Sept. 2000 to Mar. 2001,he worked as a research associate with Hong Kong Polytechnic Univer-sity. From Aug. to Sept. 2003, he was a visiting professor at the GeorgeWashington University, Washington DC. From July to Dec. 2004, heworked as the University fellow with Hong Kong Baptist University. FromMar. 2005 to Mar. 2006, he worked as a research fellow with the Chi-nese University of Hong Kong. From Mar. to July, 2006, he worked as avisiting professor at the Queen’s University of Belfast, United Kingdom.From 2007-2009, he worked as a visiting professor with Inha University,Korea. At present, he is the director of the Institute of Machines Learningand Systems Biology, Tongji University. He is currently a fellow of theInternational Association of Pattern Recognition (IAPR fellow), and asenior member of the IEEE and International Neural Networks Society.He has published more than 180 journal papers. Also, in 1996, he pub-lished a book entitled Systematic Theory of Neural Networks for PatternRecognition (in Chinese), which won the Second-Class Prize of the 8thExcellent High Technology Books of China, and in 2001 & 2009 anothertwo books entitled Intelligent Signal Processing Technique for High Res-olution Radars (in Chinese) and The Study of Data Mining Methods forGene Expression Profiles (in Chinese), respectively. His currentresearch interests include bioinformatics, pattern recognition, andmachine learning.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

YUAN ET AL.: INTEGRATION OF MULTI-OMICS DATA FOR GENE REGULATORY NETWORK INFERENCE AND APPLICATION TO BREAST... 791