research article link prediction via sparse gaussian graphical … · 2019. 5. 14. · research...

12
Research Article Link Prediction via Sparse Gaussian Graphical Model Liangliang Zhang, Longqi Yang, Guyu Hu, Zhisong Pan, and Zhen Li College of Command Information System, PLA University of Science and Technology, Nanjing 210007, China Correspondence should be addressed to Zhisong Pan; [email protected] Received 10 November 2015; Revised 27 January 2016; Accepted 27 January 2016 Academic Editor: David Bigaud Copyright © 2016 Liangliang Zhang et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Link prediction is an important task in complex network analysis. Traditional link prediction methods are limited by network topology and lack of node property information, which makes predicting links challenging. In this study, we address link prediction using a sparse Gaussian graphical model and demonstrate its theoretical and practical effectiveness. In theory, link prediction is executed by estimating the inverse covariance matrix of samples to overcome information limits. e proposed method was evaluated with four small and four large real-world datasets. e experimental results show that the area under the curve (AUC) value obtained by the proposed method improved by an average of 3% and 12.5% compared to 13 mainstream similarity methods, respectively. is method outperforms the baseline method, and the prediction accuracy is superior to mainstream methods when using only 80% of the training set. e method also provides significantly higher AUC values when using only 60% in Dolphin and Taro datasets. Furthermore, the error rate of the proposed method demonstrates superior performance with all datasets compared to mainstream methods. 1. Introduction Since 2010, link prediction has become an increasingly distinctive and important part of complex network analysis. Link prediction refers to the prediction of a possible link between two nodes when links are unknown [1]. Such predic- tion involves the prediction of existing yet unknown links and future links. Link prediction is the basis of data mining prob- lems and lays the foundation for complex network research. Link prediction provides a mechanism for both structure and evolution of networks. Studying this problem is important from both theoretical and practical perspectives [2]. Existing community detection research is primarily based on an adja- cency matrix, and community detection typically depends on the adequacy and completeness of the adjacency matrix. Link prediction is instrumental for accurately analysing social-network structures, helping community detection, and improving the accuracy of community detection [2, 3]. Link prediction can be used to predict missing data and can help analyse network evolution [4]. For example, we can use the current network structure to predict users who have not been recognized as friends or can develop into friends. Link prediction methods have made remarkable achieve- ments in various fields, including biology, social science, security, and medicine. Ermis ¸ et al. [2] address the link prediction problem by data fusion formulated as simultane- ous factorization of several observation tensors where latent factors are shared among each observation; some studies turn to multirelational link prediction [3]; Yang et al. also studied evaluation of link prediction [5]. Gong et al. in 2014 extended the Social-Attribute Net- work framework with several supervised and unsupervised link-prediction algorithms and demonstrate their method performance improvement [6]. However, such methods have limitations when processing social datasets. First, social datasets are low-quality datasets that include faulty links and noise. Such datasets must be preprocessed before similarity measurement, set partitioning, and common neighbour (CN) count. Moreover, node properties are cumbersome to obtain, and most social network data can only be used to obtain a raw adjacency matrix that does not include specific attribute information because user information is private in most online systems. Consequently, many prediction methods cannot use the features of such properties, and we cannot Hindawi Publishing Corporation Mathematical Problems in Engineering Volume 2016, Article ID 7213432, 11 pages http://dx.doi.org/10.1155/2016/7213432

Upload: others

Post on 01-Apr-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article Link Prediction via Sparse Gaussian Graphical … · 2019. 5. 14. · Research Article Link Prediction via Sparse Gaussian Graphical Model LiangliangZhang,LongqiYang,GuyuHu,ZhisongPan,andZhenLi

Research ArticleLink Prediction via Sparse Gaussian Graphical Model

Liangliang Zhang Longqi Yang Guyu Hu Zhisong Pan and Zhen Li

College of Command Information System PLA University of Science and Technology Nanjing 210007 China

Correspondence should be addressed to Zhisong Pan hotpzshotmailcom

Received 10 November 2015 Revised 27 January 2016 Accepted 27 January 2016

Academic Editor David Bigaud

Copyright copy 2016 Liangliang Zhang et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

Link prediction is an important task in complex network analysis Traditional link prediction methods are limited by networktopology and lack of node property information whichmakes predicting links challenging In this study we address link predictionusing a sparse Gaussian graphical model and demonstrate its theoretical and practical effectiveness In theory link predictionis executed by estimating the inverse covariance matrix of samples to overcome information limits The proposed method wasevaluated with four small and four large real-world datasets The experimental results show that the area under the curve (AUC)value obtained by the proposed method improved by an average of 3 and 125 compared to 13 mainstream similarity methodsrespectively This method outperforms the baseline method and the prediction accuracy is superior to mainstreammethods whenusing only 80 of the training setThemethod also provides significantly higher AUC values when using only 60 in Dolphin andTaro datasets Furthermore the error rate of the proposed method demonstrates superior performance with all datasets comparedto mainstream methods

1 Introduction

Since 2010 link prediction has become an increasinglydistinctive and important part of complex network analysisLink prediction refers to the prediction of a possible linkbetween two nodes when links are unknown [1] Such predic-tion involves the prediction of existing yet unknown links andfuture links Link prediction is the basis of data mining prob-lems and lays the foundation for complex network researchLink prediction provides amechanism for both structure andevolution of networks Studying this problem is importantfrom both theoretical and practical perspectives [2] Existingcommunity detection research is primarily based on an adja-cency matrix and community detection typically dependson the adequacy and completeness of the adjacency matrixLink prediction is instrumental for accurately analysingsocial-network structures helping community detection andimproving the accuracy of community detection [2 3] Linkprediction can be used to predict missing data and can helpanalyse network evolution [4] For example we can use thecurrent network structure to predict users who have not beenrecognized as friends or can develop into friends

Link prediction methods have made remarkable achieve-ments in various fields including biology social sciencesecurity and medicine Ermis et al [2] address the linkprediction problem by data fusion formulated as simultane-ous factorization of several observation tensors where latentfactors are shared among each observation some studies turnto multirelational link prediction [3] Yang et al also studiedevaluation of link prediction [5]

Gong et al in 2014 extended the Social-Attribute Net-work framework with several supervised and unsupervisedlink-prediction algorithms and demonstrate their methodperformance improvement [6] However such methods havelimitations when processing social datasets First socialdatasets are low-quality datasets that include faulty links andnoise Such datasets must be preprocessed before similaritymeasurement set partitioning and commonneighbour (CN)count Moreover node properties are cumbersome to obtainand most social network data can only be used to obtain araw adjacency matrix that does not include specific attributeinformation because user information is private in mostonline systems Consequently many prediction methodscannot use the features of such properties and we cannot

Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2016 Article ID 7213432 11 pageshttpdxdoiorg10115520167213432

2 Mathematical Problems in Engineering

calculate feature propertyTherefore using only an adjacencymatrix can avoid interference of node properties which isconvenient and feasible

Many community detection methods are based on anadjacency matrix Thus the adjacency matrixes integritydirectly affects the results Through link prediction usingan adjacency matrix we can determine the relationshipsbetween unconnected nodes and the entire communitystructure can be obtained by analysing these relationshipsThus an effective link prediction method is required Thisissue presents a series of challenges such as the following(1) link prediction must function with an adjacency matrixthat does not contain properties (2) a graph structure modelfor estimating the network is required to determine the roleof different types of connections and (3) verification andevaluation of link prediction are required

Existing link prediction methods use similarities nodeproperties edge properties and so forth However propertiesrequire a large amount of test data and heavily rely onnetwork connectivity and structure thus link predictionwithout properties is less robustWhen the network structurechanges it is difficult to mine the relationships betweennodes Thus determining how to use limited test data topredict a network edge is the motivation of this study

To solve the above problems this paper presents a linkprediction method based on the application of a Gaussiangraphicalmodel (GGM) to an adjacencymatrixThis conceptreferences Friedman et alrsquos [7] sparse inverse covarianceestimation theory The study uses the original adjacencymatrix for sampling thereby obtaining a samplematrixThuswe use a sparse GGM (SGGM) inverse covariance matrix forlink prediction The main contributions of this study are asfollows

(1) Sampling of a network to build a feature matrixseeking maximum likelihood estimation using anSGGM and estimating an inverse covariance matrix(precision matrix) of the adjacency matrix

(2) Establishing conditional independence betweennodes to complete link prediction using the Markovrandom field independence principle

(3) Proving that the proposed method is more effectivethan previous methods by testing the methods usingfour real-world datasets

The remainder of this paper is organised as follows Weintroduce related work in Section 2 includingmany previouslink predictionmethods In Section 3 we present our SGGM-based link prediction method In Section 4 we introduceeight real-world datasets and test the methods using thesedatasets to prove that the proposed method is more effectivethan previous methods Finally we conclude the paper andpresent suggestions for future work in Section 5

2 Related Work

21 Problem Description Existing link prediction methodscan be divided into three categories

211 Similarity Link Prediction Employs Different MethodsOne is based on node properties such as sex age occupationpreferences and other properties to compute node similarityIt is more probable that edges will exist between high-similarity nodes Another method is based on the networkstructures similarity for example the use of CN nodesHowever this method is only applicable to a network witha high network clustering coefficient

212 Estimates Based on the Maximum Likelihood Estima-tion of a Link Can Be Divided into Two Categories Onemethod is based on network hierarchy but it has highcomplexity because it generates many network samples Theother method is based on stochastic block model predictionwherein nodes are divided into some sets and the probabilityof an edge depends on corresponding sets

213 A Link Prediction Model Based on Probability Builds aModel by Adjusting the Parameters This can fit the structureof the relationships in real networks A pair of nodes will gen-erate an edge determined by probability using the optimumparameter A probabilistic model considers the probability ofexisting edges as a property It transforms edge predictioninto property issues This method takes advantage of thenetwork structure and node properties with high precisionbut offers poor universality

Due to the poor universality of maximum likelihoodestimation and the probability model which depend highlyon node properties thesemethods cannot be applied tomanynetworks Herein we consider a link prediction methodwhich is only based on similarity and discuss experimentsperformed to compare the proposed and previous methods

22 Similarity-Based Link Prediction Here we compare theprediction accuracies of 13 similarity measures All of thesemeasures are based on the local structural informationcontained in a test set We first introduce each measurebriefly The formulas are shown in Table 1 Here 119866(119881 119864) isan undirected network 119881 is a set of nodes and 119864 is a set ofedges The total number of nodes for the network is 119873 andthe number of edges is 119872 For a node 119883 and its neighboursΛ(119909) the degree of 119883 is 119889(119909) = |Λ(119909)| The network has119873(119873minus1)2 node pairs that is a universal set119880 When givena link prediction method each pair of nodes (119909 119910) isin (119880 119864)

without an edge will have a score 119878119909119910 Then all unconnected

pairs of nodes are ordered by the score value in descendingorder and the probability of an edge appearing is the largeston the top

CN nodes are based on a local information similarityindex and this is one of the simplest methods [8] In otherwords it is more likely that a link will exist between twohigh-similarity nodes that have many neighbours For node119883 Λ(119909) represent its neighbours and if nodes 119883 and 119884

have many CNs they are more likely to have a link Visiblythe structural equivalence pays more attention to whethertwo nodes are in the same circumstance For example ina social network context if two people share many friendsthey are more likely to be friends themselves We consider

Mathematical Problems in Engineering 3

Table 1 Similarity indexes of local node information

Name Definition

CN 119878119909119910=1003816100381610038161003816Λ (119909) cap Λ (119910)

1003816100381610038161003816

Salton index 119878119909119910=

1003816100381610038161003816Λ (119909) cap Λ (119910)1003816100381610038161003816

radic119889 (119909) times 119889 (119910)

Jaccard index 119878119909119910=

1003816100381610038161003816Λ (119909) cap Λ (119910)1003816100381610038161003816

1003816100381610038161003816Λ (119909) cup Λ (119910)1003816100381610038161003816

Soslashrensen index 119878119909119910=21003816100381610038161003816Λ (119909) cap Λ (119910)

1003816100381610038161003816

119889 (119909) + 119889 (119910)

HPI 119878119909119910=

1003816100381610038161003816Λ (119909) cap Λ (119910)1003816100381610038161003816

min 119889 (119909) 119889 (119910)

HDI 119878119909119910=

1003816100381610038161003816Λ (119909) cap Λ (119910)1003816100381610038161003816

max 119889 (119909) 119889 (119910)

LHN-I 119878119909119910=

1003816100381610038161003816Λ (119909) cap Λ (119910)1003816100381610038161003816

119889 (119909) times 119889 (119910)

PA 119878119909119910= 119889 (119909) times 119889 (119910)

AA 119878119909119910= sum

119911isinΛ(119909)capΛ(119910)

1

lg 119889 (119911)

RA 119878119909119910= sum

119911isinΛ(119909)capΛ(119910)

1

119889 (119911)

the impact of degree of both nodes and generate six simi-larity indices namely Salton index (cosine-based similarity)[9] Jaccard index [10] Soslashrensen index [11] hub promotedindex (HPI) hub depressed index (HDI) and Leicht-Holme-Newman (LHN) index [12] These indices are based onCN similarity Another similarity-degree-based method ispreferential attachment (PA) [13] which can generate a scale-free network Note that the complexity of this algorithm islower than that of others because it requires less information

If we consider degree information of two nodes CNs wehave the Adamic-Adar (AA) index [14] which considers thatthe contribution of a small degree of CN nodes is greater thanthat of a larger one Liu et al [15] proposed the RA indexwhich forms a view of network resource allocation The RAandAA indices determine theweight of CNnodes in differentways RA decreases by 1119896 and AA is 1 lg 119896 The RA indexperforms better than the AA index in a weighted networkand community mining Traditional CN methods do notdistinguish the different roles of CNs Liu et al [15] proposeda local naive Bayes (LNB) model This model creates a roleparameter for marking the different roles of CNs However itis only applied to certain types of networks

3 Link Prediction Based on Sparse GaussianGraphical Model

Most similarity algorithms perform well with ideal datasetsand large training datasets Such algorithms do not con-sider missing incomplete and polluted data in an adja-cency matrix As previously mentioned the accuracy ofsimilarity-based methods depends on whether the method

can determine the characteristics of the network structureFor example CN-based algorithms utilise a nodes CNs anda pair of nodes with many CNs are more likely to connectSuch algorithms perform well sometimes even better thancomplex algorithms in a network with a high clusteringcoefficient However in networks with a low clusteringcoefficient such as router or power networks accuracy issignificantly worse

This study attempts to determine links in an adjacencymatrix to reveal actual links between nodes The proposedSGGM method is based on an undirected graphical modeland transforms the adjacency matrix without using nodeproperties The SGGMmethod estimates the precision of thematrix of a network to predict unknown edges To verify linkprediction accuracy area under the curve (AUC) and an errormetric are used to prove the proposed methodsrsquo effective-ness

31 Sparse Graphical Model Biologists interested in geneticconnections use the GGM to estimate genetic interactionEdge relationships in an undirected graph are representedby the joint distribution of random variables For examplegenome work is based on biological functions and somesupervising relationships exist between genes Correspond-ing to the graph genes represent nodes and edges repre-sent this supervising relationship The relationship betweengenes provides a method to model such relationships Weassume that variables have Gaussian distribution thereforethe GGM is most frequently used Therefore the problem isequivalent to estimating the inverse covariance matrix Σ

minus1

(precision matrix Θ = Σminus1) and the diagonal elements of

precision matrixΘ represent the edges in the graph [16] TheGGM denotes the statistical dependencies between variablesIf there is no link between two nodes such nodes haveconditional independence In the GGM a precision matrixcan parameterise each edge

A popular modern method for evaluating a GGM is thegraphical lasso and Friedman et al [7] added an ℓ

1norm to

punish each off-diagonal element of the inverse covariancematrix

The GGM encodes the conditional dependence relation-ships among a set of 119901 variables and 119899 observations that haveidentical and independent Gaussian distribution Motivatedby network terminology we can refer to the 119901 variables in agraphical model as nodes If a pair of variables (or features)is conditionally dependent then there is an edge betweenthe corresponding pair of nodes otherwise no edge ispresentThebasicmodel for continuous data assumes that theobservations have a multivariate Gaussian distribution withmean120583 and covariancematrixΣ and119883[1] 119883[2] 119883[119873] sim

119873(0 Σ) If the 119894119895th component of Σminus1 is zero then thevariables 119894 and 119895 are conditionally independent given theother variables Specifically given119883

119896and 119896 = 1 119901119894 119895

the nonzero elements inΣminus1 represent the graphs structure Inother words if (Σminus1)

119894119895= 0 nodes 119894 and 119895 have no connecting

edge in the graph The precision matrix is sparse due to

4 Mathematical Problems in Engineering

the conditional independence of the variables To estimatea sparse GGM many methods are based on maximumlikelihood estimation (MLE) and one such method is thegraphical lasso which is expressed as

maxΘ≻0

log detΘ minus tr (119878Θ) minus 120582 Θ1 (1)

119878 =1

119873

119873

sum

119894=1

(119909119894minus 119909) (119909

119894minus 119909)119879

(2)

Θ1= sum

119894 =119895

10038161003816100381610038161003816Θ119894119895

10038161003816100381610038161003816 (3)

Here 119878 is the experiential covariance matrix and 120582 ispositive tuning parameter If Θ gt 0 Θ is a 119901 times 119901 positivedefinite matrix Θ

1is the punishment element and Θ

becomes increasingly sparse with increasing 120582 An ℓ1-norm is

utilised to considerΘ a variable rather than a fixed parameter

32 Graphical Lasso Algorithm In 2008 Banerjee et al [17]proved that formula (1) is convex They estimated Σ via 119882only and subsequently solved the problem by optimisingover each row Moreover they optimised the correspondingcolumn of119882 in a block coordinate descent fashion

119882 = (

11988211

11990812

119908119879

1211990822

)

119878 = (

11987811

11990412

119904119879

1211990422

)

(4)

By partitioning119882 and 119878 the solution for 11990812satisfies

11990812= argmin

119910

119910119879

119882minus1

11119910

1003817100381710038171003817119910 minus 119904121003817100381710038171003817infin

le 119901 (5)

Banerjee et al [17] proved that solving (5) is equivalent tosolving its dual problem

min120573

= 1

2

1003817100381710038171003817100381711988212

11120573 minus 119887

10038171003817100381710038171003817

2

+ 120588100381710038171003817100381712057310038171003817100381710038171 (6)

Here assume that 120573 is the solution to (6) that is 119887 =

11988212

1111990412 thus119908

12= 11988211120573 is the solution to (5)Therefore (6)

is similar to lasso regression and is the basis for the graphicallasso algorithm First we must prove that (6) and (1) areequivalent Given119882Θ = 119868 that is

(

11988211

11990812

119908119879

1211990822

)(

Θ11

12057912

120579119879

1212057922

) = (

119868 0

0119879

1

) (7)

the MLE of (1) is rewritten as

119882minus 119878 minus 120588119867 = 0 (8)

Note that the derivative of log detΘ equals Θminus1 = 119882Here 119867

119894119895isin sign(Θ

119894119895) that is 119867

119894119895= sign(Θ

119894119895) if Θ

119894119895= 0

else119867119894119895isin [minus1 1] ifΘ

119894119895= 0 Thus the upper-right block of (8)

is expressed as

11990812minus 11990412minus 12058812057412= 0 (9)

The subgradient equation of (6) is expressed as follows

11988211120573 minus 11990412+ 120588V = 0 (10)

where V isin sign(120573) Here suppose that (119882119867) solves (8)consequently (119908

12 12057412) solves (9) Then 120573 = 119882

minus1

1111990812

andV = minus120574

12solve (10) For the sign terms119882

11Θ12+ 11990812Θ22= 0

is derived from (7) therefore we obtainΘ12= minusΘ22119882minus1

1111990812

Since Θ22gt 0 it follows that sign(Θ

12) = minus sign(119882minus1

1111990812) =

minus sign(120573) which proves the equivalenceThe solution120573 to thelasso problem (6) gives the relevant part of Θ

12= minusΘ22120573

Problem (6) appears similar to a lasso (ℓ1-regularised)

least-squares problem In fact if 11988211

= 11987811 the solutions

of 120573 are equal to the lasso estimates for the 119901th variable In2007 Friedman et al used fast coordinate descent algorithmsto solve the lasso problem Hsieh et al [18 19] proposed anovel block-coordinate descentmethod via clustering to solvethe optimization problem The method can reach quadraticconvergence rates which is designed for large network

33 Link Prediction Scheme Based on Sparse Inverse Covari-ance Estimation Here we consider an undirected simplenetwork119866(119881 119864) where 119864 is the set of links and119881 is the set ofnodes Moreover 119860 is an adjacent matrix Λ is the graphicallasso parameter 119901 denotes the number of nodes and theuniversal set 119880 consists of 119901(119901 minus 1)2 edges The set of linksare randomly divided into two parts training (119864

119879) and testing

(119864119875) sets Note that only the information in the training set

can be used for link prediction 120588 denotes the ratio of thetraining set which means the amount of edges being usedClearly 119864 = 119864

119879cup 119864119875 and 119864

119879cap 119864119875= 0 In the training set

sampling process we generated 119899 independent observations119883 = [119909

1 119909

119899] each from normal distribution 119873(0 Σ)

where Σ = (119864119879)minus1 Here 119878 denotes the sample covariance

matrices of 119883 = [1199091 119909

119899] We then apply SGGM for link

prediction The pseudocode of the link prediction scheme isgiven in Algorithm 1

4 Experiment and Analysis

41 Evaluation Metrics

411 AUC In link prediction we focus on accurately recov-ering missing links Here we have chosen the area undera relative operating characteristic (ROC) curve as a metricAn ROC curve represents a comparison of two operatingcharacteristics TPR and FPR as the criterion changes TheAUC ranges from 05 to 1 and a higher value means a bettermodel

412 Error Rate We also defined an error metric to evaluatethe difference between the original and estimated networksand are denoted by 120579 and 1205791015840 respectively The error metric isdefined as Error = 120579minus120579

1015840

119865120579119865 where sdot

119865is the Frobenius

norm

42 Real-World Datasets In this paper we consider fourrepresentative networks drawn from disparate fields (data

Mathematical Problems in Engineering 5

Require 119860 120588 gt 0 120582 gt 0 119899 gt 0

Ensure 119860119864119879 randomly select 120588|119864| edges from 119864

119864119875

lArr 119864 minus 119864119879

Σ lArr (119864119879

)minus1

119883 lArr [1199091 1199092 119909

119899] 119899 independent observations from

119873(0 Σ)

119878 lArr cov(119883)Θ lArr Sparse Inverse Covariance Estimation (119860 120582 119878)if 120579119894119895le 0 and 119894 = 119895 then

119860119894119895lArr 1

else119860119894119895lArr 0

end if

Algorithm 1 Link prediction based on sparse inverse covarianceestimation with graphical lasso

Table 2 Basic topological features of four real-world networks

Network 119873 119872 119864 119862 Avg 119863 119903

Taro 22 39 0488 0339 3546 minus0375Tailor Shop 39 158 0567 0458 8103 minus0183Dolphin 62 159 0379 0259 5129 minus0044Football 115 613 0450 0403 10661 0162

sources (1) httpwww-personalumichedusimmejnnetdata(Football andDolphin) and (2) httpvladofmfuni-ljsipubnetworksdataucinetucidatahtm (Tailor Shop and Taro))

We selected four real-world datasets to compare theproposed method with the 13 similarity measures Table 2shows the basic topological features of the four real-worldnetworks wherein 119873 and119872 are the total numbers of nodesand links respectively 119864 is the network efficiency [20] whichis defined as 119864 = (2119873(119873 minus 1))sum

119909119910isin119881119909 =119910119889minus1

119909119910 where 119889

119909119910

denotes the shortest distance between nodes 119909 and 119910 and119889119909119910

= +infin if 119909 and 119910 are in two different components119862 and Avg 119863 denote the clustering coefficient [21] andaverage degree respectively and 119903 is the network assortativecoefficient [22] The network is assortative if a node tends toconnect with the approximate node 119903 gt 0 means that theentire network has assortative structure and a nodewith largedegree tends to connect to other nodes with large degree 119903 lt0 means that the entire network has disassortative structureand 119903 = 0 implies that there is no correlation in the networkstructure Figure 1 shows the distribution of the four datasetsIt is clear that the degree distributions of the four data sets areconsiderably different

43 Experimental Settings To evaluate the performance ofthe algorithmswith different sized training sets we randomlyselected 60 70 80 and 90 of the links from theoriginal network as training sets Only the information inthe training set was used for estimation Each training setwas executed 10 times We choose 10-fold cross validationbecause it has been widely accepted in machine learning

Table 3 Football dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0702 0725 0757 0783SGGM 119899 = 119901 0742 0777 0818 0857SGGM 119899 = 15119901 0753 0795 0845 0889SGGM 119899 = 2119901 0756 0800 0852 0905CN 0774 0798 0824 0835Salton 0779 0804 0828 0838Jaccard 0502 0508 0514 0520Soslashrensen 0590 0597 0604 0606HPI 0778 0802 0827 0838HDI 0779 0803 0827 0838LHN 0777 0802 0827 0838AA 0774 0798 0825 0836RA 0774 0798 0824 0836PA 0515 0516 0520 0528LNBCN 0777 0800 0825 0836LNBAA 0777 0800 0825 0836LNBRA 0776 0799 0825 0837

and data mining research Thirteen similarity measures wereimplemented following Zhou et al [14] and the SGGM wasimplemented using the SLEP toolbox [23] We evaluated theperformance of the GGM algorithm with different samplescales Here 119901 denotes the number of nodes We set thesample scale to 05119901 119901 15119901 and 2119901 The parameter 120582 in theSGGM controlled the sparsity of the estimated model Larger120582 gives a sparser estimated network Here 120582 ranged from 001

to 02 with a step of 001

44 Results

441 Tuning Parameter for the SGGM Using the training setthat was scaled to 90 we tested the proposed SGGM withdifferent 120582 and sample scales As shown in Figure 2 the AUCof the SGGM increased with increasing sample scale for thefour datasets One advantage of the SGGM is that it is notsensitive to the parameter 120582 Then we set the sample scale to05119873 Figure 3 shows that the AUC increased with increase intraining set scale For the Tailor Shop dataset the proposedalgorithm performs optimally with the 70 scaled trainingset

442 Comparison on AUC Table 3 lists the results for 14methods with the Football dataset The SGGM performsoptimally when 80 or 90 of the training set is used and thenumber of samples is greater than119873The prediction accuracyof the SGGM increased by 5 compared to the other 13methodsHowever with lower proportions of the training setthe SGGMs AUC is very close to that of the other methodsNote that PA yielded the poorest result with this dataset

As shown in Table 4 the SGGM method performs opti-mally with the Dolphin dataset The prediction accuracy ofthe SGGMmethod increased by at least 3 compared to the

6 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10 11 120

002

004

006

008

01

012

014

016

Degree

Prob

abili

ty o

f the

nod

e deg

ree

(a) Dolphin

0 1 2 3 4 5 6 7 8 9 10 11 120

01

02

03

04

05

06

07

Degree

Prob

abili

ty o

f the

nod

e deg

ree

(b) Football

0 1 2 3 4 5 60

01

02

03

04

05

06

07

08

Degree

Prob

abili

ty o

f the

nod

e deg

ree

(c) Taro

minus5 0 5 10 15 20 250

002

004

006

008

01

012

014

Degree

Prob

abili

ty o

f the

nod

e deg

ree

(d) Tailor Shop

Figure 1 Degree distribution of four real-world datasets

other thirteen methods PA is optimal among the similaritymeasures and Jaccard yielded the poorest result

As shown in Table 5 the SGGMmethod outperforms theother 13 methods with the Taro dataset The AUC improvedby at least 10 For the Taro dataset the degree of 70 nodesis 3 which indicates a low degree of heterogeneity thus theperformance of the other thirteen methods is very close Onemight expect PA to show good performance on assortativenetworks and poor performance on disassortative networksHowever the experimental results show no obvious correla-tion between PA performance and the assortative coefficient

For the Tailor Shop dataset PA performs optimally whenthe 60 and 70 training sets were used while the SGGM

method performs optimally with the 80 and 90 trainingsets (Table 6) The SGGMs prediction accuracy graduallyincreased with increasing the number of samplesThus whensampling conditions permit multiple samples could improveprediction accuracy

443 Comparison on ROC The ratio of the training set wasset to 90The sample scale of the SGGM varied within 05119901119901 15119901 and 2119901120582 = 001 CNwas chosen as the representativeneighbourhood because six other measures (Salton JaccardSoslashrensen HPI HDI and LHN) are variants of CN

The ROC of the RA index is similar to that of the AAindex As observed in Figure 4 in most cases the SGGM

Mathematical Problems in Engineering 7

0005

010015

020

0506

0708

09

06507

07508

08509

Ratio of training

AUC

120582

(a) Dolphin

0005

010015

020

0506

0708

09

05506

06507

07508

Ratio of training

AUC

120582

(b) Football

0005

010015

020

06

07

08

064066068

07072074076

09

Ratio of training

AUC

120582

(c) Taro

0005

010015

020

06

07

08

09

06062064066068

07072

Ratio of training

AUC

120582

(d) Tailor Shop

Figure 2 AUC of the SGGM (four datasets training set ratio = 90)

05pp

152p

25p3p

0005

010015

020

07808

082084086088

09092

Samples

AUC

120582

(a) Dolphin

05pp

152p

25p3p

0005

010015

020

0506070809

1

Samples

AUC

120582

(b) Football

05pp

152p

25p3p

0005

010015

020

07075

08085

09095

Samples

AUC

120582

(c) Taro

05pp

152p

25p

3p

0005

010015

020

06507

07508

08509

095

Samples

AUC

120582

(d) Tailor Shop

Figure 3 AUC of the SGGM (four datasets sample scale = 05119873)

8 Mathematical Problems in Engineering

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

(a) Dolphin

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

(b) Football

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

(c) Taro

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

(d) Tailor Shop

Figure 4 Comparison of ROC metric of four methods (SGGM CN AA and PA) on four datasets

shows an improvement over existing similarity measureswhen sufficient samples were used Note that more sampleslead to higher accuracy

444 Comparison on Error Rate The error rate is a metricto evaluate the difference between two matrices A lowererror rate indicates that the estimated matrix approximates

the original matrix From Table 7 we observe that the SGGMhas the lowest error rate among all methods shown Toshow the effectiveness of the proposed SGGM both theoriginal adjacency matrix and estimated adjacency matrixare presented as a coloured graph in Figure 5 We usedthe Taro dataset with a 90 training set to recover theoriginal network Graphs with fewer red points indicate

Mathematical Problems in Engineering 9

Figure 5 Recovery performances of 14methods on the Taro datasetThe upper left corner is the original matrix blue points denote linksbetween two corresponding nodes The red points denote links inthe original matrix that were not accurately estimated The greenpoints denote links in the estimated matrix that do not exist in theoriginal matrix From left to right top to bottom the figures are therecovered matrix by SGGM (119899 = 05119901 119901 15119901 and 2119901) CN SaltonJaccard Soslashrensen HPI HDI LHN AA RA PA LNBCN LNBAAand LNBRA respectively

Table 4 Dolphin dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0699 0731 0761 0801SGGM 119899 = 119901 0742 0775 0823 0867SGGM 119899 = 15119901 0758 0803 0845 0895SGGM 119899 = 2119901 0765 0807 0857 0909CN 0631 0687 0728 0762Salton 0621 0673 0709 0740Jaccard 0485 0490 0495 0500Soslashrensen 0538 0558 0571 0583HPI 0620 0669 0702 0728HDI 0626 0680 0720 0753LHN 0619 0666 0698 0721AA 0633 0690 0732 0767RA 0632 0690 0731 0766PA 0712 0727 0737 0746LNBCN 0636 0696 0739 0775LNBAA 0635 0694 0738 0773LNBRA 0635 0693 0736 0771

good recovery of the original matrix As can be seen fromFigure 5 using the SGGM method can restore the originalimage whereas the other similarity measures return greatererror rates Note that the SGGM error rate decreased withincreasing sample scale

45 Large Datasets The main challenge of link prediction isdealing with large network We carefully choose four largedatasets from four individual domain For SGGM method120582 is set to 001 90 of data was used as training set

Table 5 Taro dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0672 0711 0736 0750SGGM 119899 = 119901 0713 0747 0779 0815SGGM 119899 = 15119901 0721 0763 0816 0857SGGM 119899 = 2119901 0723 0788 0825 0865CN 0529 0545 0573 0582Salton 0530 0542 0571 0580Jaccard 0468 0476 0476 0478Soslashrensen 0523 0533 0544 0552HPI 0533 0545 0574 0585HDI 0528 0541 0568 0576LHN 0532 0543 0570 0579AA 0533 0554 0589 0609RA 0533 0554 0589 0610PA 0559 0568 0582 0607LNBCN 0536 0562 0616 0638LNBAA 0537 0561 0614 0634LNBRA 0536 0560 0613 0634

Table 6 Tailor Shop dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0652 0679 0677 0706SGGM 119899 = 119901 0703 0723 0750 0769SGGM 119899 = 15119901 0715 0755 0786 0817SGGM 119899 = 2119901 0734 0770 0803 0844CN 0673 0699 0720 0746Salton 0616 0628 0646 0667Jaccard 0494 0498 0504 0514Soslashrensen 0583 0589 0592 0597HPI 0615 0618 0631 0642HDI 0628 0643 0660 0680LHN 0588 0574 0571 0566AA 0682 0709 0729 0753RA 0681 0707 0729 0752PA 0761 0775 0779 0786LNBCN 0692 0715 0746 0765LNBAA 0693 0716 0746 0764LNBRA 0693 0715 0745 0762

In order to prove the performance of SGGM we comparethe proposed method with the other thirteen methods onthese large datasets (for details see Table 8) in terms of AUCand error rate (data sources (1) httpkonectuni-koblenzdenetworks (Email) (2) httpwww3ndedusimnetworksresourceshtm (Protein) (3) httpwww-personalumichedusimmejnnetdatapowerzip (Grid) (4) httpkonectuni-koblenzdenetworks (Astro-ph))

To estimate the large network efficiently we used QUIC[19] to solve sparse inverse covariance estimation (formula(1)) instead of glasso [7] Overall the proposed method

10 Mathematical Problems in Engineering

Table 7 Error rate of 13 methods on 4 datasets

Method DatasetsFootball Dolphin Taro Tailor Shop

SGGM 119899 = 05119901 1467 1251 1098 1121SGGM 119899 = 119901 1364 1109 1062 1188SGGM 119899 = 15119901 1362 1198 1098 1025SGGM 119899 = 2119901 1200 1121 1013 0987CN 1641 1291 1320 1485Salton 1641 1291 1320 1485Jaccard 1641 1291 1320 1485Soslashrensen 1863 1672 1695 1826HPI 1641 1291 1320 1485HDI 1641 1291 1320 1485LHN 1641 1291 1320 1485AA 1641 1291 1320 1485RA 1641 1291 1320 1485PA 2512 1251 1340 1446LNBCN 1641 1291 1261 1382LNBAA 1641 1291 1261 1382LNBRA 1641 1291 1261 1382

Table 8 Basic topological features of four large networks

Network 119873 119872 119864 119862 Avg 119863 119903

Email 1133 5451 0299 0220 9622 0078Protein 1870 2240 0099 0099 2396 minus0156Grid 4941 6594 0063 0107 2669 0003Astro-ph 18771 198050 0091 0486 4140 0294

Table 9 Comparison of 14 methods on 4 large datasets on AUC

Method Email Protein Grid Astro-phSGGM 0916 0943 0944 0929CN 0817 0566 0570 0864Salton 0812 0566 0569 0863Jaccard 0475 0479 0487 0479Soslashrensen 0547 0493 0529 0584HPI 0809 0566 0569 0863HDI 0813 0566 0569 0863LHN-I 0805 0565 0570 0864AA 0819 0565 0570 0876RA 0818 0566 0568 0838PA 0824 0820 0713 0785

can efficiently recover the original network and predict themissing links accurately As observed from Table 9 AUC ofSGGM is highest on all 4 datasets For Email Protein Gridand Astro-ph datasets SGGM improves 92 123 231and 53 on AUC respectively For error rate metric asshown in Table 10 the SGGM method performs optimallywith all 4 large datasets It indicates that SGGMmethod couldaccurately recover the original network

Table 10 Comparison of 14 methods on 4 large datasets on errorrate

Method Email Protein Grid Astro-phSGGM 0467 0411 0367 0359CN 2348 2189 1670 1568Salton 2978 2189 1767 1549Jaccard 2982 2189 1450 1548Soslashrensen 3047 2555 2071 2771HPI 2578 2189 1670 1610HDI 2978 2189 1237 1854LHN-I 2428 2189 1692 1612AA 2922 2204 1872 1953RA 2348 2189 1670 1410PA 9219 8032 15081 12451

46 Analysis In the comparative analysis of the performanceof the 14 methods using the eight real-world datasets theSGGM method was outstanding in terms of AUC and errorrate This method can accurately recover the original adja-cency matrix and in most of cases requires fewer samplesthat is far fewer than the actual number of nodes TheSGGM methodrsquos prediction precision gradually increasedwith the increase in number of samples moreover therecovery error rate decreased with the increase in matrixsparsityTherefore the SGGMmethod can be applied to smallsamples however it performs better with more samplesThese results demonstrate that the SGGM method can beimplemented in a scalable fashion

5 Conclusions

Link prediction is a basic problem in complex networkanalysis In real-world networks obtaining node and edgeproperties is cumbersome Link prediction methods basedon network structure similarity do not depend on nodeproperties however these methods are limited by networkstructure and are difficult to adapt in different networkstructures

Our work is not dependent on node properties andexpands link prediction methods that cannot deal withdiverse network topologiesWe sampled the original networkadjacency matrix and used the SGGM method to depict thenetwork structure

Most nodes are independent in the actual network thuswe used the SGGM method to estimate a precision matrixof the adjacency matrix to predict links Our experimentalresults show that using the SGGMmethod to recover networkstructure is better than 13mainstream similaritymethodsWetested the proposed method with eight real-world datasetsand used the AUC and error rate as evaluation indexes Wefound that this method obtains higher prediction precisionWe also analysed the influence of different parameters onthe SGGM method and found no significant influenceThe SGGM method returns a high AUC value within acertain range Furthermore the proposed method retains

Mathematical Problems in Engineering 11

high prediction precision with fewer training samples it canbe applied in large network

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank Longqi Yang Zhisong Pan and GuyuHu for their contributions to work on link prediction andZhen Li and Junyang Qiu for their valuable discussionscomments and suggestions This work was supported by theNational Technology Research and Development Programof China (863 Program) (Grant no 2012AA01A510) theNational Natural Science Foundation of China (Grant no61473149) and the Natural Science Foundation of JiangsuProvince China (Grant no BK20140073)

References

[1] L L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquo Physica A Statistical Mechanics amp Its Applications vol390 no 6 pp 1150ndash1170 2011

[2] B Ermis E Acar and A T Cemgil ldquoLink prediction in het-erogeneous data via generalized coupled tensor factorizationrdquoDataMining amp Knowledge Discovery vol 29 no 1 pp 203ndash2362013

[3] Y Yang N Chawla Y Sun and J Hani ldquoPredicting links inmulti-relational and heterogeneous networksrdquo in Proceedings ofthe 12th IEEE International Conference on Data Mining (ICDMrsquo12) pp 755ndash764 Brussels Belgium December 2012

[4] D Liben-Nowell and J Kleinberg ldquoThe link-prediction prob-lem for social networksrdquo Journal of the American Society forInformation Science and Technology vol 58 no 7 pp 1019ndash10312007

[5] Y Yang R N Lichtenwalter and N V Chawla ldquoEvaluating linkprediction methodsrdquo Knowledge and Information Systems vol45 no 3 pp 751ndash782 2015

[6] N Z Gong A Talwalkar LMacKey et al ldquoJoint link predictionand attribute inference using a social-attribute networkrdquo ACMTransactions on Intelligent Systems and Technology vol 5 no 2article 27 2014

[7] J Friedman T Hastie and R Tibshirani ldquoSparse inversecovariance estimationwith the graphical lassordquoBiostatistics vol9 no 3 pp 432ndash441 2008

[8] M E Newman ldquoClustering and preferential attachment ingrowing networksrdquo Physical Review E vol 64 no 2 pp 25ndash1022001

[9] G Salton and M J McGill Introduction to Modern InformationRetrieval McGraw-Hill New York NY USA 1986

[10] P Jaccard ldquoEtude comparative de la distribution florale dansune portion des Alpes et du Jurardquo Bulletin de la Societe Vaudoisedes Sciences Naturelles vol 37 no 142 pp 547ndash579 1901

[11] T Soslashrensen ldquoA method of establishing groups of equal ampli-tude in plant sociology based on similarity of species and itsapplication to analyses of the vegetation on danish commonsrdquoBiologiske Skrifter vol 5 pp 1ndash34 1948

[12] E A Leicht P Holme and M E J Newman ldquoVertex similarityin networksrdquo Physical Review E vol 73 no 2 Article ID 0261202006

[13] Y-B Xie T Zhou and B-H Wang ldquoScale-free networkswithout growthrdquo Physica A Statistical Mechanics and Its Appli-cations vol 387 no 7 pp 1683ndash1688 2008

[14] T Zhou L Lu and Y-C Zhang ldquoPredicting missing links vialocal informationrdquoThe European Physical Journal B vol 71 no4 pp 623ndash630 2009

[15] Z Liu Q M Zhang L Lu and T Zhou ldquoLink prediction incomplex networks a local naıve Bayesmodelrdquo EPL (EurophysicsLetters) vol 96 no 4 Article ID 48007 2011

[16] J Dahl L Vandenberghe and V Roychowdhury ldquoCovarianceselection for nonchordal graphs via chordal embeddingrdquo Opti-mization Methods amp Software vol 23 no 4 pp 501ndash520 2008

[17] O Banerjee L E Ghaoui andADaspremont ldquoModel selectionthrough sparse maximum likelihood estimation for multivari-ate Gaussian or binary datardquoMachine Learning Research vol 9no 3 pp 485ndash516 2008

[18] C J Hsieh M A Sustik I S Dhillon P K Ravikumar and RPoldrack ldquoBigamp quic sparse inverse covariance estimation for amillion variablesrdquo in Advances in Neural Information ProcessingSystems pp 3165ndash3173 MIT Press 2013

[19] C-J Hsieh M A Sustik I S Dhillon and P RavikumarldquoQUIC quadratic approximation for sparse inverse covarianceestimationrdquo Journal of Machine Learning Research vol 15 pp2911ndash2947 2014

[20] V Latora and M Marchiori ldquoEfficient behavior of small-worldnetworksrdquo Physical Review Letters vol 87 Article ID 1987012001

[21] D J Watts and S H Strogatz ldquoCollective dynamics of lsquosmall-worldrsquo networksrdquoNature vol 393 no 6684 pp 440ndash442 1998

[22] M E J Newman ldquoAssortative mixing in networksrdquo PhysicalReview Letters vol 89 no 20 Article ID 208701 2002

[23] J Liu S Ji and J Ye SLEP Sparse Learning with EfficientProjections vol 6 Arizona State University Tempe Ariz USA2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 2: Research Article Link Prediction via Sparse Gaussian Graphical … · 2019. 5. 14. · Research Article Link Prediction via Sparse Gaussian Graphical Model LiangliangZhang,LongqiYang,GuyuHu,ZhisongPan,andZhenLi

2 Mathematical Problems in Engineering

calculate feature propertyTherefore using only an adjacencymatrix can avoid interference of node properties which isconvenient and feasible

Many community detection methods are based on anadjacency matrix Thus the adjacency matrixes integritydirectly affects the results Through link prediction usingan adjacency matrix we can determine the relationshipsbetween unconnected nodes and the entire communitystructure can be obtained by analysing these relationshipsThus an effective link prediction method is required Thisissue presents a series of challenges such as the following(1) link prediction must function with an adjacency matrixthat does not contain properties (2) a graph structure modelfor estimating the network is required to determine the roleof different types of connections and (3) verification andevaluation of link prediction are required

Existing link prediction methods use similarities nodeproperties edge properties and so forth However propertiesrequire a large amount of test data and heavily rely onnetwork connectivity and structure thus link predictionwithout properties is less robustWhen the network structurechanges it is difficult to mine the relationships betweennodes Thus determining how to use limited test data topredict a network edge is the motivation of this study

To solve the above problems this paper presents a linkprediction method based on the application of a Gaussiangraphicalmodel (GGM) to an adjacencymatrixThis conceptreferences Friedman et alrsquos [7] sparse inverse covarianceestimation theory The study uses the original adjacencymatrix for sampling thereby obtaining a samplematrixThuswe use a sparse GGM (SGGM) inverse covariance matrix forlink prediction The main contributions of this study are asfollows

(1) Sampling of a network to build a feature matrixseeking maximum likelihood estimation using anSGGM and estimating an inverse covariance matrix(precision matrix) of the adjacency matrix

(2) Establishing conditional independence betweennodes to complete link prediction using the Markovrandom field independence principle

(3) Proving that the proposed method is more effectivethan previous methods by testing the methods usingfour real-world datasets

The remainder of this paper is organised as follows Weintroduce related work in Section 2 includingmany previouslink predictionmethods In Section 3 we present our SGGM-based link prediction method In Section 4 we introduceeight real-world datasets and test the methods using thesedatasets to prove that the proposed method is more effectivethan previous methods Finally we conclude the paper andpresent suggestions for future work in Section 5

2 Related Work

21 Problem Description Existing link prediction methodscan be divided into three categories

211 Similarity Link Prediction Employs Different MethodsOne is based on node properties such as sex age occupationpreferences and other properties to compute node similarityIt is more probable that edges will exist between high-similarity nodes Another method is based on the networkstructures similarity for example the use of CN nodesHowever this method is only applicable to a network witha high network clustering coefficient

212 Estimates Based on the Maximum Likelihood Estima-tion of a Link Can Be Divided into Two Categories Onemethod is based on network hierarchy but it has highcomplexity because it generates many network samples Theother method is based on stochastic block model predictionwherein nodes are divided into some sets and the probabilityof an edge depends on corresponding sets

213 A Link Prediction Model Based on Probability Builds aModel by Adjusting the Parameters This can fit the structureof the relationships in real networks A pair of nodes will gen-erate an edge determined by probability using the optimumparameter A probabilistic model considers the probability ofexisting edges as a property It transforms edge predictioninto property issues This method takes advantage of thenetwork structure and node properties with high precisionbut offers poor universality

Due to the poor universality of maximum likelihoodestimation and the probability model which depend highlyon node properties thesemethods cannot be applied tomanynetworks Herein we consider a link prediction methodwhich is only based on similarity and discuss experimentsperformed to compare the proposed and previous methods

22 Similarity-Based Link Prediction Here we compare theprediction accuracies of 13 similarity measures All of thesemeasures are based on the local structural informationcontained in a test set We first introduce each measurebriefly The formulas are shown in Table 1 Here 119866(119881 119864) isan undirected network 119881 is a set of nodes and 119864 is a set ofedges The total number of nodes for the network is 119873 andthe number of edges is 119872 For a node 119883 and its neighboursΛ(119909) the degree of 119883 is 119889(119909) = |Λ(119909)| The network has119873(119873minus1)2 node pairs that is a universal set119880 When givena link prediction method each pair of nodes (119909 119910) isin (119880 119864)

without an edge will have a score 119878119909119910 Then all unconnected

pairs of nodes are ordered by the score value in descendingorder and the probability of an edge appearing is the largeston the top

CN nodes are based on a local information similarityindex and this is one of the simplest methods [8] In otherwords it is more likely that a link will exist between twohigh-similarity nodes that have many neighbours For node119883 Λ(119909) represent its neighbours and if nodes 119883 and 119884

have many CNs they are more likely to have a link Visiblythe structural equivalence pays more attention to whethertwo nodes are in the same circumstance For example ina social network context if two people share many friendsthey are more likely to be friends themselves We consider

Mathematical Problems in Engineering 3

Table 1 Similarity indexes of local node information

Name Definition

CN 119878119909119910=1003816100381610038161003816Λ (119909) cap Λ (119910)

1003816100381610038161003816

Salton index 119878119909119910=

1003816100381610038161003816Λ (119909) cap Λ (119910)1003816100381610038161003816

radic119889 (119909) times 119889 (119910)

Jaccard index 119878119909119910=

1003816100381610038161003816Λ (119909) cap Λ (119910)1003816100381610038161003816

1003816100381610038161003816Λ (119909) cup Λ (119910)1003816100381610038161003816

Soslashrensen index 119878119909119910=21003816100381610038161003816Λ (119909) cap Λ (119910)

1003816100381610038161003816

119889 (119909) + 119889 (119910)

HPI 119878119909119910=

1003816100381610038161003816Λ (119909) cap Λ (119910)1003816100381610038161003816

min 119889 (119909) 119889 (119910)

HDI 119878119909119910=

1003816100381610038161003816Λ (119909) cap Λ (119910)1003816100381610038161003816

max 119889 (119909) 119889 (119910)

LHN-I 119878119909119910=

1003816100381610038161003816Λ (119909) cap Λ (119910)1003816100381610038161003816

119889 (119909) times 119889 (119910)

PA 119878119909119910= 119889 (119909) times 119889 (119910)

AA 119878119909119910= sum

119911isinΛ(119909)capΛ(119910)

1

lg 119889 (119911)

RA 119878119909119910= sum

119911isinΛ(119909)capΛ(119910)

1

119889 (119911)

the impact of degree of both nodes and generate six simi-larity indices namely Salton index (cosine-based similarity)[9] Jaccard index [10] Soslashrensen index [11] hub promotedindex (HPI) hub depressed index (HDI) and Leicht-Holme-Newman (LHN) index [12] These indices are based onCN similarity Another similarity-degree-based method ispreferential attachment (PA) [13] which can generate a scale-free network Note that the complexity of this algorithm islower than that of others because it requires less information

If we consider degree information of two nodes CNs wehave the Adamic-Adar (AA) index [14] which considers thatthe contribution of a small degree of CN nodes is greater thanthat of a larger one Liu et al [15] proposed the RA indexwhich forms a view of network resource allocation The RAandAA indices determine theweight of CNnodes in differentways RA decreases by 1119896 and AA is 1 lg 119896 The RA indexperforms better than the AA index in a weighted networkand community mining Traditional CN methods do notdistinguish the different roles of CNs Liu et al [15] proposeda local naive Bayes (LNB) model This model creates a roleparameter for marking the different roles of CNs However itis only applied to certain types of networks

3 Link Prediction Based on Sparse GaussianGraphical Model

Most similarity algorithms perform well with ideal datasetsand large training datasets Such algorithms do not con-sider missing incomplete and polluted data in an adja-cency matrix As previously mentioned the accuracy ofsimilarity-based methods depends on whether the method

can determine the characteristics of the network structureFor example CN-based algorithms utilise a nodes CNs anda pair of nodes with many CNs are more likely to connectSuch algorithms perform well sometimes even better thancomplex algorithms in a network with a high clusteringcoefficient However in networks with a low clusteringcoefficient such as router or power networks accuracy issignificantly worse

This study attempts to determine links in an adjacencymatrix to reveal actual links between nodes The proposedSGGM method is based on an undirected graphical modeland transforms the adjacency matrix without using nodeproperties The SGGMmethod estimates the precision of thematrix of a network to predict unknown edges To verify linkprediction accuracy area under the curve (AUC) and an errormetric are used to prove the proposed methodsrsquo effective-ness

31 Sparse Graphical Model Biologists interested in geneticconnections use the GGM to estimate genetic interactionEdge relationships in an undirected graph are representedby the joint distribution of random variables For examplegenome work is based on biological functions and somesupervising relationships exist between genes Correspond-ing to the graph genes represent nodes and edges repre-sent this supervising relationship The relationship betweengenes provides a method to model such relationships Weassume that variables have Gaussian distribution thereforethe GGM is most frequently used Therefore the problem isequivalent to estimating the inverse covariance matrix Σ

minus1

(precision matrix Θ = Σminus1) and the diagonal elements of

precision matrixΘ represent the edges in the graph [16] TheGGM denotes the statistical dependencies between variablesIf there is no link between two nodes such nodes haveconditional independence In the GGM a precision matrixcan parameterise each edge

A popular modern method for evaluating a GGM is thegraphical lasso and Friedman et al [7] added an ℓ

1norm to

punish each off-diagonal element of the inverse covariancematrix

The GGM encodes the conditional dependence relation-ships among a set of 119901 variables and 119899 observations that haveidentical and independent Gaussian distribution Motivatedby network terminology we can refer to the 119901 variables in agraphical model as nodes If a pair of variables (or features)is conditionally dependent then there is an edge betweenthe corresponding pair of nodes otherwise no edge ispresentThebasicmodel for continuous data assumes that theobservations have a multivariate Gaussian distribution withmean120583 and covariancematrixΣ and119883[1] 119883[2] 119883[119873] sim

119873(0 Σ) If the 119894119895th component of Σminus1 is zero then thevariables 119894 and 119895 are conditionally independent given theother variables Specifically given119883

119896and 119896 = 1 119901119894 119895

the nonzero elements inΣminus1 represent the graphs structure Inother words if (Σminus1)

119894119895= 0 nodes 119894 and 119895 have no connecting

edge in the graph The precision matrix is sparse due to

4 Mathematical Problems in Engineering

the conditional independence of the variables To estimatea sparse GGM many methods are based on maximumlikelihood estimation (MLE) and one such method is thegraphical lasso which is expressed as

maxΘ≻0

log detΘ minus tr (119878Θ) minus 120582 Θ1 (1)

119878 =1

119873

119873

sum

119894=1

(119909119894minus 119909) (119909

119894minus 119909)119879

(2)

Θ1= sum

119894 =119895

10038161003816100381610038161003816Θ119894119895

10038161003816100381610038161003816 (3)

Here 119878 is the experiential covariance matrix and 120582 ispositive tuning parameter If Θ gt 0 Θ is a 119901 times 119901 positivedefinite matrix Θ

1is the punishment element and Θ

becomes increasingly sparse with increasing 120582 An ℓ1-norm is

utilised to considerΘ a variable rather than a fixed parameter

32 Graphical Lasso Algorithm In 2008 Banerjee et al [17]proved that formula (1) is convex They estimated Σ via 119882only and subsequently solved the problem by optimisingover each row Moreover they optimised the correspondingcolumn of119882 in a block coordinate descent fashion

119882 = (

11988211

11990812

119908119879

1211990822

)

119878 = (

11987811

11990412

119904119879

1211990422

)

(4)

By partitioning119882 and 119878 the solution for 11990812satisfies

11990812= argmin

119910

119910119879

119882minus1

11119910

1003817100381710038171003817119910 minus 119904121003817100381710038171003817infin

le 119901 (5)

Banerjee et al [17] proved that solving (5) is equivalent tosolving its dual problem

min120573

= 1

2

1003817100381710038171003817100381711988212

11120573 minus 119887

10038171003817100381710038171003817

2

+ 120588100381710038171003817100381712057310038171003817100381710038171 (6)

Here assume that 120573 is the solution to (6) that is 119887 =

11988212

1111990412 thus119908

12= 11988211120573 is the solution to (5)Therefore (6)

is similar to lasso regression and is the basis for the graphicallasso algorithm First we must prove that (6) and (1) areequivalent Given119882Θ = 119868 that is

(

11988211

11990812

119908119879

1211990822

)(

Θ11

12057912

120579119879

1212057922

) = (

119868 0

0119879

1

) (7)

the MLE of (1) is rewritten as

119882minus 119878 minus 120588119867 = 0 (8)

Note that the derivative of log detΘ equals Θminus1 = 119882Here 119867

119894119895isin sign(Θ

119894119895) that is 119867

119894119895= sign(Θ

119894119895) if Θ

119894119895= 0

else119867119894119895isin [minus1 1] ifΘ

119894119895= 0 Thus the upper-right block of (8)

is expressed as

11990812minus 11990412minus 12058812057412= 0 (9)

The subgradient equation of (6) is expressed as follows

11988211120573 minus 11990412+ 120588V = 0 (10)

where V isin sign(120573) Here suppose that (119882119867) solves (8)consequently (119908

12 12057412) solves (9) Then 120573 = 119882

minus1

1111990812

andV = minus120574

12solve (10) For the sign terms119882

11Θ12+ 11990812Θ22= 0

is derived from (7) therefore we obtainΘ12= minusΘ22119882minus1

1111990812

Since Θ22gt 0 it follows that sign(Θ

12) = minus sign(119882minus1

1111990812) =

minus sign(120573) which proves the equivalenceThe solution120573 to thelasso problem (6) gives the relevant part of Θ

12= minusΘ22120573

Problem (6) appears similar to a lasso (ℓ1-regularised)

least-squares problem In fact if 11988211

= 11987811 the solutions

of 120573 are equal to the lasso estimates for the 119901th variable In2007 Friedman et al used fast coordinate descent algorithmsto solve the lasso problem Hsieh et al [18 19] proposed anovel block-coordinate descentmethod via clustering to solvethe optimization problem The method can reach quadraticconvergence rates which is designed for large network

33 Link Prediction Scheme Based on Sparse Inverse Covari-ance Estimation Here we consider an undirected simplenetwork119866(119881 119864) where 119864 is the set of links and119881 is the set ofnodes Moreover 119860 is an adjacent matrix Λ is the graphicallasso parameter 119901 denotes the number of nodes and theuniversal set 119880 consists of 119901(119901 minus 1)2 edges The set of linksare randomly divided into two parts training (119864

119879) and testing

(119864119875) sets Note that only the information in the training set

can be used for link prediction 120588 denotes the ratio of thetraining set which means the amount of edges being usedClearly 119864 = 119864

119879cup 119864119875 and 119864

119879cap 119864119875= 0 In the training set

sampling process we generated 119899 independent observations119883 = [119909

1 119909

119899] each from normal distribution 119873(0 Σ)

where Σ = (119864119879)minus1 Here 119878 denotes the sample covariance

matrices of 119883 = [1199091 119909

119899] We then apply SGGM for link

prediction The pseudocode of the link prediction scheme isgiven in Algorithm 1

4 Experiment and Analysis

41 Evaluation Metrics

411 AUC In link prediction we focus on accurately recov-ering missing links Here we have chosen the area undera relative operating characteristic (ROC) curve as a metricAn ROC curve represents a comparison of two operatingcharacteristics TPR and FPR as the criterion changes TheAUC ranges from 05 to 1 and a higher value means a bettermodel

412 Error Rate We also defined an error metric to evaluatethe difference between the original and estimated networksand are denoted by 120579 and 1205791015840 respectively The error metric isdefined as Error = 120579minus120579

1015840

119865120579119865 where sdot

119865is the Frobenius

norm

42 Real-World Datasets In this paper we consider fourrepresentative networks drawn from disparate fields (data

Mathematical Problems in Engineering 5

Require 119860 120588 gt 0 120582 gt 0 119899 gt 0

Ensure 119860119864119879 randomly select 120588|119864| edges from 119864

119864119875

lArr 119864 minus 119864119879

Σ lArr (119864119879

)minus1

119883 lArr [1199091 1199092 119909

119899] 119899 independent observations from

119873(0 Σ)

119878 lArr cov(119883)Θ lArr Sparse Inverse Covariance Estimation (119860 120582 119878)if 120579119894119895le 0 and 119894 = 119895 then

119860119894119895lArr 1

else119860119894119895lArr 0

end if

Algorithm 1 Link prediction based on sparse inverse covarianceestimation with graphical lasso

Table 2 Basic topological features of four real-world networks

Network 119873 119872 119864 119862 Avg 119863 119903

Taro 22 39 0488 0339 3546 minus0375Tailor Shop 39 158 0567 0458 8103 minus0183Dolphin 62 159 0379 0259 5129 minus0044Football 115 613 0450 0403 10661 0162

sources (1) httpwww-personalumichedusimmejnnetdata(Football andDolphin) and (2) httpvladofmfuni-ljsipubnetworksdataucinetucidatahtm (Tailor Shop and Taro))

We selected four real-world datasets to compare theproposed method with the 13 similarity measures Table 2shows the basic topological features of the four real-worldnetworks wherein 119873 and119872 are the total numbers of nodesand links respectively 119864 is the network efficiency [20] whichis defined as 119864 = (2119873(119873 minus 1))sum

119909119910isin119881119909 =119910119889minus1

119909119910 where 119889

119909119910

denotes the shortest distance between nodes 119909 and 119910 and119889119909119910

= +infin if 119909 and 119910 are in two different components119862 and Avg 119863 denote the clustering coefficient [21] andaverage degree respectively and 119903 is the network assortativecoefficient [22] The network is assortative if a node tends toconnect with the approximate node 119903 gt 0 means that theentire network has assortative structure and a nodewith largedegree tends to connect to other nodes with large degree 119903 lt0 means that the entire network has disassortative structureand 119903 = 0 implies that there is no correlation in the networkstructure Figure 1 shows the distribution of the four datasetsIt is clear that the degree distributions of the four data sets areconsiderably different

43 Experimental Settings To evaluate the performance ofthe algorithmswith different sized training sets we randomlyselected 60 70 80 and 90 of the links from theoriginal network as training sets Only the information inthe training set was used for estimation Each training setwas executed 10 times We choose 10-fold cross validationbecause it has been widely accepted in machine learning

Table 3 Football dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0702 0725 0757 0783SGGM 119899 = 119901 0742 0777 0818 0857SGGM 119899 = 15119901 0753 0795 0845 0889SGGM 119899 = 2119901 0756 0800 0852 0905CN 0774 0798 0824 0835Salton 0779 0804 0828 0838Jaccard 0502 0508 0514 0520Soslashrensen 0590 0597 0604 0606HPI 0778 0802 0827 0838HDI 0779 0803 0827 0838LHN 0777 0802 0827 0838AA 0774 0798 0825 0836RA 0774 0798 0824 0836PA 0515 0516 0520 0528LNBCN 0777 0800 0825 0836LNBAA 0777 0800 0825 0836LNBRA 0776 0799 0825 0837

and data mining research Thirteen similarity measures wereimplemented following Zhou et al [14] and the SGGM wasimplemented using the SLEP toolbox [23] We evaluated theperformance of the GGM algorithm with different samplescales Here 119901 denotes the number of nodes We set thesample scale to 05119901 119901 15119901 and 2119901 The parameter 120582 in theSGGM controlled the sparsity of the estimated model Larger120582 gives a sparser estimated network Here 120582 ranged from 001

to 02 with a step of 001

44 Results

441 Tuning Parameter for the SGGM Using the training setthat was scaled to 90 we tested the proposed SGGM withdifferent 120582 and sample scales As shown in Figure 2 the AUCof the SGGM increased with increasing sample scale for thefour datasets One advantage of the SGGM is that it is notsensitive to the parameter 120582 Then we set the sample scale to05119873 Figure 3 shows that the AUC increased with increase intraining set scale For the Tailor Shop dataset the proposedalgorithm performs optimally with the 70 scaled trainingset

442 Comparison on AUC Table 3 lists the results for 14methods with the Football dataset The SGGM performsoptimally when 80 or 90 of the training set is used and thenumber of samples is greater than119873The prediction accuracyof the SGGM increased by 5 compared to the other 13methodsHowever with lower proportions of the training setthe SGGMs AUC is very close to that of the other methodsNote that PA yielded the poorest result with this dataset

As shown in Table 4 the SGGM method performs opti-mally with the Dolphin dataset The prediction accuracy ofthe SGGMmethod increased by at least 3 compared to the

6 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10 11 120

002

004

006

008

01

012

014

016

Degree

Prob

abili

ty o

f the

nod

e deg

ree

(a) Dolphin

0 1 2 3 4 5 6 7 8 9 10 11 120

01

02

03

04

05

06

07

Degree

Prob

abili

ty o

f the

nod

e deg

ree

(b) Football

0 1 2 3 4 5 60

01

02

03

04

05

06

07

08

Degree

Prob

abili

ty o

f the

nod

e deg

ree

(c) Taro

minus5 0 5 10 15 20 250

002

004

006

008

01

012

014

Degree

Prob

abili

ty o

f the

nod

e deg

ree

(d) Tailor Shop

Figure 1 Degree distribution of four real-world datasets

other thirteen methods PA is optimal among the similaritymeasures and Jaccard yielded the poorest result

As shown in Table 5 the SGGMmethod outperforms theother 13 methods with the Taro dataset The AUC improvedby at least 10 For the Taro dataset the degree of 70 nodesis 3 which indicates a low degree of heterogeneity thus theperformance of the other thirteen methods is very close Onemight expect PA to show good performance on assortativenetworks and poor performance on disassortative networksHowever the experimental results show no obvious correla-tion between PA performance and the assortative coefficient

For the Tailor Shop dataset PA performs optimally whenthe 60 and 70 training sets were used while the SGGM

method performs optimally with the 80 and 90 trainingsets (Table 6) The SGGMs prediction accuracy graduallyincreased with increasing the number of samplesThus whensampling conditions permit multiple samples could improveprediction accuracy

443 Comparison on ROC The ratio of the training set wasset to 90The sample scale of the SGGM varied within 05119901119901 15119901 and 2119901120582 = 001 CNwas chosen as the representativeneighbourhood because six other measures (Salton JaccardSoslashrensen HPI HDI and LHN) are variants of CN

The ROC of the RA index is similar to that of the AAindex As observed in Figure 4 in most cases the SGGM

Mathematical Problems in Engineering 7

0005

010015

020

0506

0708

09

06507

07508

08509

Ratio of training

AUC

120582

(a) Dolphin

0005

010015

020

0506

0708

09

05506

06507

07508

Ratio of training

AUC

120582

(b) Football

0005

010015

020

06

07

08

064066068

07072074076

09

Ratio of training

AUC

120582

(c) Taro

0005

010015

020

06

07

08

09

06062064066068

07072

Ratio of training

AUC

120582

(d) Tailor Shop

Figure 2 AUC of the SGGM (four datasets training set ratio = 90)

05pp

152p

25p3p

0005

010015

020

07808

082084086088

09092

Samples

AUC

120582

(a) Dolphin

05pp

152p

25p3p

0005

010015

020

0506070809

1

Samples

AUC

120582

(b) Football

05pp

152p

25p3p

0005

010015

020

07075

08085

09095

Samples

AUC

120582

(c) Taro

05pp

152p

25p

3p

0005

010015

020

06507

07508

08509

095

Samples

AUC

120582

(d) Tailor Shop

Figure 3 AUC of the SGGM (four datasets sample scale = 05119873)

8 Mathematical Problems in Engineering

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

(a) Dolphin

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

(b) Football

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

(c) Taro

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

(d) Tailor Shop

Figure 4 Comparison of ROC metric of four methods (SGGM CN AA and PA) on four datasets

shows an improvement over existing similarity measureswhen sufficient samples were used Note that more sampleslead to higher accuracy

444 Comparison on Error Rate The error rate is a metricto evaluate the difference between two matrices A lowererror rate indicates that the estimated matrix approximates

the original matrix From Table 7 we observe that the SGGMhas the lowest error rate among all methods shown Toshow the effectiveness of the proposed SGGM both theoriginal adjacency matrix and estimated adjacency matrixare presented as a coloured graph in Figure 5 We usedthe Taro dataset with a 90 training set to recover theoriginal network Graphs with fewer red points indicate

Mathematical Problems in Engineering 9

Figure 5 Recovery performances of 14methods on the Taro datasetThe upper left corner is the original matrix blue points denote linksbetween two corresponding nodes The red points denote links inthe original matrix that were not accurately estimated The greenpoints denote links in the estimated matrix that do not exist in theoriginal matrix From left to right top to bottom the figures are therecovered matrix by SGGM (119899 = 05119901 119901 15119901 and 2119901) CN SaltonJaccard Soslashrensen HPI HDI LHN AA RA PA LNBCN LNBAAand LNBRA respectively

Table 4 Dolphin dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0699 0731 0761 0801SGGM 119899 = 119901 0742 0775 0823 0867SGGM 119899 = 15119901 0758 0803 0845 0895SGGM 119899 = 2119901 0765 0807 0857 0909CN 0631 0687 0728 0762Salton 0621 0673 0709 0740Jaccard 0485 0490 0495 0500Soslashrensen 0538 0558 0571 0583HPI 0620 0669 0702 0728HDI 0626 0680 0720 0753LHN 0619 0666 0698 0721AA 0633 0690 0732 0767RA 0632 0690 0731 0766PA 0712 0727 0737 0746LNBCN 0636 0696 0739 0775LNBAA 0635 0694 0738 0773LNBRA 0635 0693 0736 0771

good recovery of the original matrix As can be seen fromFigure 5 using the SGGM method can restore the originalimage whereas the other similarity measures return greatererror rates Note that the SGGM error rate decreased withincreasing sample scale

45 Large Datasets The main challenge of link prediction isdealing with large network We carefully choose four largedatasets from four individual domain For SGGM method120582 is set to 001 90 of data was used as training set

Table 5 Taro dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0672 0711 0736 0750SGGM 119899 = 119901 0713 0747 0779 0815SGGM 119899 = 15119901 0721 0763 0816 0857SGGM 119899 = 2119901 0723 0788 0825 0865CN 0529 0545 0573 0582Salton 0530 0542 0571 0580Jaccard 0468 0476 0476 0478Soslashrensen 0523 0533 0544 0552HPI 0533 0545 0574 0585HDI 0528 0541 0568 0576LHN 0532 0543 0570 0579AA 0533 0554 0589 0609RA 0533 0554 0589 0610PA 0559 0568 0582 0607LNBCN 0536 0562 0616 0638LNBAA 0537 0561 0614 0634LNBRA 0536 0560 0613 0634

Table 6 Tailor Shop dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0652 0679 0677 0706SGGM 119899 = 119901 0703 0723 0750 0769SGGM 119899 = 15119901 0715 0755 0786 0817SGGM 119899 = 2119901 0734 0770 0803 0844CN 0673 0699 0720 0746Salton 0616 0628 0646 0667Jaccard 0494 0498 0504 0514Soslashrensen 0583 0589 0592 0597HPI 0615 0618 0631 0642HDI 0628 0643 0660 0680LHN 0588 0574 0571 0566AA 0682 0709 0729 0753RA 0681 0707 0729 0752PA 0761 0775 0779 0786LNBCN 0692 0715 0746 0765LNBAA 0693 0716 0746 0764LNBRA 0693 0715 0745 0762

In order to prove the performance of SGGM we comparethe proposed method with the other thirteen methods onthese large datasets (for details see Table 8) in terms of AUCand error rate (data sources (1) httpkonectuni-koblenzdenetworks (Email) (2) httpwww3ndedusimnetworksresourceshtm (Protein) (3) httpwww-personalumichedusimmejnnetdatapowerzip (Grid) (4) httpkonectuni-koblenzdenetworks (Astro-ph))

To estimate the large network efficiently we used QUIC[19] to solve sparse inverse covariance estimation (formula(1)) instead of glasso [7] Overall the proposed method

10 Mathematical Problems in Engineering

Table 7 Error rate of 13 methods on 4 datasets

Method DatasetsFootball Dolphin Taro Tailor Shop

SGGM 119899 = 05119901 1467 1251 1098 1121SGGM 119899 = 119901 1364 1109 1062 1188SGGM 119899 = 15119901 1362 1198 1098 1025SGGM 119899 = 2119901 1200 1121 1013 0987CN 1641 1291 1320 1485Salton 1641 1291 1320 1485Jaccard 1641 1291 1320 1485Soslashrensen 1863 1672 1695 1826HPI 1641 1291 1320 1485HDI 1641 1291 1320 1485LHN 1641 1291 1320 1485AA 1641 1291 1320 1485RA 1641 1291 1320 1485PA 2512 1251 1340 1446LNBCN 1641 1291 1261 1382LNBAA 1641 1291 1261 1382LNBRA 1641 1291 1261 1382

Table 8 Basic topological features of four large networks

Network 119873 119872 119864 119862 Avg 119863 119903

Email 1133 5451 0299 0220 9622 0078Protein 1870 2240 0099 0099 2396 minus0156Grid 4941 6594 0063 0107 2669 0003Astro-ph 18771 198050 0091 0486 4140 0294

Table 9 Comparison of 14 methods on 4 large datasets on AUC

Method Email Protein Grid Astro-phSGGM 0916 0943 0944 0929CN 0817 0566 0570 0864Salton 0812 0566 0569 0863Jaccard 0475 0479 0487 0479Soslashrensen 0547 0493 0529 0584HPI 0809 0566 0569 0863HDI 0813 0566 0569 0863LHN-I 0805 0565 0570 0864AA 0819 0565 0570 0876RA 0818 0566 0568 0838PA 0824 0820 0713 0785

can efficiently recover the original network and predict themissing links accurately As observed from Table 9 AUC ofSGGM is highest on all 4 datasets For Email Protein Gridand Astro-ph datasets SGGM improves 92 123 231and 53 on AUC respectively For error rate metric asshown in Table 10 the SGGM method performs optimallywith all 4 large datasets It indicates that SGGMmethod couldaccurately recover the original network

Table 10 Comparison of 14 methods on 4 large datasets on errorrate

Method Email Protein Grid Astro-phSGGM 0467 0411 0367 0359CN 2348 2189 1670 1568Salton 2978 2189 1767 1549Jaccard 2982 2189 1450 1548Soslashrensen 3047 2555 2071 2771HPI 2578 2189 1670 1610HDI 2978 2189 1237 1854LHN-I 2428 2189 1692 1612AA 2922 2204 1872 1953RA 2348 2189 1670 1410PA 9219 8032 15081 12451

46 Analysis In the comparative analysis of the performanceof the 14 methods using the eight real-world datasets theSGGM method was outstanding in terms of AUC and errorrate This method can accurately recover the original adja-cency matrix and in most of cases requires fewer samplesthat is far fewer than the actual number of nodes TheSGGM methodrsquos prediction precision gradually increasedwith the increase in number of samples moreover therecovery error rate decreased with the increase in matrixsparsityTherefore the SGGMmethod can be applied to smallsamples however it performs better with more samplesThese results demonstrate that the SGGM method can beimplemented in a scalable fashion

5 Conclusions

Link prediction is a basic problem in complex networkanalysis In real-world networks obtaining node and edgeproperties is cumbersome Link prediction methods basedon network structure similarity do not depend on nodeproperties however these methods are limited by networkstructure and are difficult to adapt in different networkstructures

Our work is not dependent on node properties andexpands link prediction methods that cannot deal withdiverse network topologiesWe sampled the original networkadjacency matrix and used the SGGM method to depict thenetwork structure

Most nodes are independent in the actual network thuswe used the SGGM method to estimate a precision matrixof the adjacency matrix to predict links Our experimentalresults show that using the SGGMmethod to recover networkstructure is better than 13mainstream similaritymethodsWetested the proposed method with eight real-world datasetsand used the AUC and error rate as evaluation indexes Wefound that this method obtains higher prediction precisionWe also analysed the influence of different parameters onthe SGGM method and found no significant influenceThe SGGM method returns a high AUC value within acertain range Furthermore the proposed method retains

Mathematical Problems in Engineering 11

high prediction precision with fewer training samples it canbe applied in large network

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank Longqi Yang Zhisong Pan and GuyuHu for their contributions to work on link prediction andZhen Li and Junyang Qiu for their valuable discussionscomments and suggestions This work was supported by theNational Technology Research and Development Programof China (863 Program) (Grant no 2012AA01A510) theNational Natural Science Foundation of China (Grant no61473149) and the Natural Science Foundation of JiangsuProvince China (Grant no BK20140073)

References

[1] L L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquo Physica A Statistical Mechanics amp Its Applications vol390 no 6 pp 1150ndash1170 2011

[2] B Ermis E Acar and A T Cemgil ldquoLink prediction in het-erogeneous data via generalized coupled tensor factorizationrdquoDataMining amp Knowledge Discovery vol 29 no 1 pp 203ndash2362013

[3] Y Yang N Chawla Y Sun and J Hani ldquoPredicting links inmulti-relational and heterogeneous networksrdquo in Proceedings ofthe 12th IEEE International Conference on Data Mining (ICDMrsquo12) pp 755ndash764 Brussels Belgium December 2012

[4] D Liben-Nowell and J Kleinberg ldquoThe link-prediction prob-lem for social networksrdquo Journal of the American Society forInformation Science and Technology vol 58 no 7 pp 1019ndash10312007

[5] Y Yang R N Lichtenwalter and N V Chawla ldquoEvaluating linkprediction methodsrdquo Knowledge and Information Systems vol45 no 3 pp 751ndash782 2015

[6] N Z Gong A Talwalkar LMacKey et al ldquoJoint link predictionand attribute inference using a social-attribute networkrdquo ACMTransactions on Intelligent Systems and Technology vol 5 no 2article 27 2014

[7] J Friedman T Hastie and R Tibshirani ldquoSparse inversecovariance estimationwith the graphical lassordquoBiostatistics vol9 no 3 pp 432ndash441 2008

[8] M E Newman ldquoClustering and preferential attachment ingrowing networksrdquo Physical Review E vol 64 no 2 pp 25ndash1022001

[9] G Salton and M J McGill Introduction to Modern InformationRetrieval McGraw-Hill New York NY USA 1986

[10] P Jaccard ldquoEtude comparative de la distribution florale dansune portion des Alpes et du Jurardquo Bulletin de la Societe Vaudoisedes Sciences Naturelles vol 37 no 142 pp 547ndash579 1901

[11] T Soslashrensen ldquoA method of establishing groups of equal ampli-tude in plant sociology based on similarity of species and itsapplication to analyses of the vegetation on danish commonsrdquoBiologiske Skrifter vol 5 pp 1ndash34 1948

[12] E A Leicht P Holme and M E J Newman ldquoVertex similarityin networksrdquo Physical Review E vol 73 no 2 Article ID 0261202006

[13] Y-B Xie T Zhou and B-H Wang ldquoScale-free networkswithout growthrdquo Physica A Statistical Mechanics and Its Appli-cations vol 387 no 7 pp 1683ndash1688 2008

[14] T Zhou L Lu and Y-C Zhang ldquoPredicting missing links vialocal informationrdquoThe European Physical Journal B vol 71 no4 pp 623ndash630 2009

[15] Z Liu Q M Zhang L Lu and T Zhou ldquoLink prediction incomplex networks a local naıve Bayesmodelrdquo EPL (EurophysicsLetters) vol 96 no 4 Article ID 48007 2011

[16] J Dahl L Vandenberghe and V Roychowdhury ldquoCovarianceselection for nonchordal graphs via chordal embeddingrdquo Opti-mization Methods amp Software vol 23 no 4 pp 501ndash520 2008

[17] O Banerjee L E Ghaoui andADaspremont ldquoModel selectionthrough sparse maximum likelihood estimation for multivari-ate Gaussian or binary datardquoMachine Learning Research vol 9no 3 pp 485ndash516 2008

[18] C J Hsieh M A Sustik I S Dhillon P K Ravikumar and RPoldrack ldquoBigamp quic sparse inverse covariance estimation for amillion variablesrdquo in Advances in Neural Information ProcessingSystems pp 3165ndash3173 MIT Press 2013

[19] C-J Hsieh M A Sustik I S Dhillon and P RavikumarldquoQUIC quadratic approximation for sparse inverse covarianceestimationrdquo Journal of Machine Learning Research vol 15 pp2911ndash2947 2014

[20] V Latora and M Marchiori ldquoEfficient behavior of small-worldnetworksrdquo Physical Review Letters vol 87 Article ID 1987012001

[21] D J Watts and S H Strogatz ldquoCollective dynamics of lsquosmall-worldrsquo networksrdquoNature vol 393 no 6684 pp 440ndash442 1998

[22] M E J Newman ldquoAssortative mixing in networksrdquo PhysicalReview Letters vol 89 no 20 Article ID 208701 2002

[23] J Liu S Ji and J Ye SLEP Sparse Learning with EfficientProjections vol 6 Arizona State University Tempe Ariz USA2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 3: Research Article Link Prediction via Sparse Gaussian Graphical … · 2019. 5. 14. · Research Article Link Prediction via Sparse Gaussian Graphical Model LiangliangZhang,LongqiYang,GuyuHu,ZhisongPan,andZhenLi

Mathematical Problems in Engineering 3

Table 1 Similarity indexes of local node information

Name Definition

CN 119878119909119910=1003816100381610038161003816Λ (119909) cap Λ (119910)

1003816100381610038161003816

Salton index 119878119909119910=

1003816100381610038161003816Λ (119909) cap Λ (119910)1003816100381610038161003816

radic119889 (119909) times 119889 (119910)

Jaccard index 119878119909119910=

1003816100381610038161003816Λ (119909) cap Λ (119910)1003816100381610038161003816

1003816100381610038161003816Λ (119909) cup Λ (119910)1003816100381610038161003816

Soslashrensen index 119878119909119910=21003816100381610038161003816Λ (119909) cap Λ (119910)

1003816100381610038161003816

119889 (119909) + 119889 (119910)

HPI 119878119909119910=

1003816100381610038161003816Λ (119909) cap Λ (119910)1003816100381610038161003816

min 119889 (119909) 119889 (119910)

HDI 119878119909119910=

1003816100381610038161003816Λ (119909) cap Λ (119910)1003816100381610038161003816

max 119889 (119909) 119889 (119910)

LHN-I 119878119909119910=

1003816100381610038161003816Λ (119909) cap Λ (119910)1003816100381610038161003816

119889 (119909) times 119889 (119910)

PA 119878119909119910= 119889 (119909) times 119889 (119910)

AA 119878119909119910= sum

119911isinΛ(119909)capΛ(119910)

1

lg 119889 (119911)

RA 119878119909119910= sum

119911isinΛ(119909)capΛ(119910)

1

119889 (119911)

the impact of degree of both nodes and generate six simi-larity indices namely Salton index (cosine-based similarity)[9] Jaccard index [10] Soslashrensen index [11] hub promotedindex (HPI) hub depressed index (HDI) and Leicht-Holme-Newman (LHN) index [12] These indices are based onCN similarity Another similarity-degree-based method ispreferential attachment (PA) [13] which can generate a scale-free network Note that the complexity of this algorithm islower than that of others because it requires less information

If we consider degree information of two nodes CNs wehave the Adamic-Adar (AA) index [14] which considers thatthe contribution of a small degree of CN nodes is greater thanthat of a larger one Liu et al [15] proposed the RA indexwhich forms a view of network resource allocation The RAandAA indices determine theweight of CNnodes in differentways RA decreases by 1119896 and AA is 1 lg 119896 The RA indexperforms better than the AA index in a weighted networkand community mining Traditional CN methods do notdistinguish the different roles of CNs Liu et al [15] proposeda local naive Bayes (LNB) model This model creates a roleparameter for marking the different roles of CNs However itis only applied to certain types of networks

3 Link Prediction Based on Sparse GaussianGraphical Model

Most similarity algorithms perform well with ideal datasetsand large training datasets Such algorithms do not con-sider missing incomplete and polluted data in an adja-cency matrix As previously mentioned the accuracy ofsimilarity-based methods depends on whether the method

can determine the characteristics of the network structureFor example CN-based algorithms utilise a nodes CNs anda pair of nodes with many CNs are more likely to connectSuch algorithms perform well sometimes even better thancomplex algorithms in a network with a high clusteringcoefficient However in networks with a low clusteringcoefficient such as router or power networks accuracy issignificantly worse

This study attempts to determine links in an adjacencymatrix to reveal actual links between nodes The proposedSGGM method is based on an undirected graphical modeland transforms the adjacency matrix without using nodeproperties The SGGMmethod estimates the precision of thematrix of a network to predict unknown edges To verify linkprediction accuracy area under the curve (AUC) and an errormetric are used to prove the proposed methodsrsquo effective-ness

31 Sparse Graphical Model Biologists interested in geneticconnections use the GGM to estimate genetic interactionEdge relationships in an undirected graph are representedby the joint distribution of random variables For examplegenome work is based on biological functions and somesupervising relationships exist between genes Correspond-ing to the graph genes represent nodes and edges repre-sent this supervising relationship The relationship betweengenes provides a method to model such relationships Weassume that variables have Gaussian distribution thereforethe GGM is most frequently used Therefore the problem isequivalent to estimating the inverse covariance matrix Σ

minus1

(precision matrix Θ = Σminus1) and the diagonal elements of

precision matrixΘ represent the edges in the graph [16] TheGGM denotes the statistical dependencies between variablesIf there is no link between two nodes such nodes haveconditional independence In the GGM a precision matrixcan parameterise each edge

A popular modern method for evaluating a GGM is thegraphical lasso and Friedman et al [7] added an ℓ

1norm to

punish each off-diagonal element of the inverse covariancematrix

The GGM encodes the conditional dependence relation-ships among a set of 119901 variables and 119899 observations that haveidentical and independent Gaussian distribution Motivatedby network terminology we can refer to the 119901 variables in agraphical model as nodes If a pair of variables (or features)is conditionally dependent then there is an edge betweenthe corresponding pair of nodes otherwise no edge ispresentThebasicmodel for continuous data assumes that theobservations have a multivariate Gaussian distribution withmean120583 and covariancematrixΣ and119883[1] 119883[2] 119883[119873] sim

119873(0 Σ) If the 119894119895th component of Σminus1 is zero then thevariables 119894 and 119895 are conditionally independent given theother variables Specifically given119883

119896and 119896 = 1 119901119894 119895

the nonzero elements inΣminus1 represent the graphs structure Inother words if (Σminus1)

119894119895= 0 nodes 119894 and 119895 have no connecting

edge in the graph The precision matrix is sparse due to

4 Mathematical Problems in Engineering

the conditional independence of the variables To estimatea sparse GGM many methods are based on maximumlikelihood estimation (MLE) and one such method is thegraphical lasso which is expressed as

maxΘ≻0

log detΘ minus tr (119878Θ) minus 120582 Θ1 (1)

119878 =1

119873

119873

sum

119894=1

(119909119894minus 119909) (119909

119894minus 119909)119879

(2)

Θ1= sum

119894 =119895

10038161003816100381610038161003816Θ119894119895

10038161003816100381610038161003816 (3)

Here 119878 is the experiential covariance matrix and 120582 ispositive tuning parameter If Θ gt 0 Θ is a 119901 times 119901 positivedefinite matrix Θ

1is the punishment element and Θ

becomes increasingly sparse with increasing 120582 An ℓ1-norm is

utilised to considerΘ a variable rather than a fixed parameter

32 Graphical Lasso Algorithm In 2008 Banerjee et al [17]proved that formula (1) is convex They estimated Σ via 119882only and subsequently solved the problem by optimisingover each row Moreover they optimised the correspondingcolumn of119882 in a block coordinate descent fashion

119882 = (

11988211

11990812

119908119879

1211990822

)

119878 = (

11987811

11990412

119904119879

1211990422

)

(4)

By partitioning119882 and 119878 the solution for 11990812satisfies

11990812= argmin

119910

119910119879

119882minus1

11119910

1003817100381710038171003817119910 minus 119904121003817100381710038171003817infin

le 119901 (5)

Banerjee et al [17] proved that solving (5) is equivalent tosolving its dual problem

min120573

= 1

2

1003817100381710038171003817100381711988212

11120573 minus 119887

10038171003817100381710038171003817

2

+ 120588100381710038171003817100381712057310038171003817100381710038171 (6)

Here assume that 120573 is the solution to (6) that is 119887 =

11988212

1111990412 thus119908

12= 11988211120573 is the solution to (5)Therefore (6)

is similar to lasso regression and is the basis for the graphicallasso algorithm First we must prove that (6) and (1) areequivalent Given119882Θ = 119868 that is

(

11988211

11990812

119908119879

1211990822

)(

Θ11

12057912

120579119879

1212057922

) = (

119868 0

0119879

1

) (7)

the MLE of (1) is rewritten as

119882minus 119878 minus 120588119867 = 0 (8)

Note that the derivative of log detΘ equals Θminus1 = 119882Here 119867

119894119895isin sign(Θ

119894119895) that is 119867

119894119895= sign(Θ

119894119895) if Θ

119894119895= 0

else119867119894119895isin [minus1 1] ifΘ

119894119895= 0 Thus the upper-right block of (8)

is expressed as

11990812minus 11990412minus 12058812057412= 0 (9)

The subgradient equation of (6) is expressed as follows

11988211120573 minus 11990412+ 120588V = 0 (10)

where V isin sign(120573) Here suppose that (119882119867) solves (8)consequently (119908

12 12057412) solves (9) Then 120573 = 119882

minus1

1111990812

andV = minus120574

12solve (10) For the sign terms119882

11Θ12+ 11990812Θ22= 0

is derived from (7) therefore we obtainΘ12= minusΘ22119882minus1

1111990812

Since Θ22gt 0 it follows that sign(Θ

12) = minus sign(119882minus1

1111990812) =

minus sign(120573) which proves the equivalenceThe solution120573 to thelasso problem (6) gives the relevant part of Θ

12= minusΘ22120573

Problem (6) appears similar to a lasso (ℓ1-regularised)

least-squares problem In fact if 11988211

= 11987811 the solutions

of 120573 are equal to the lasso estimates for the 119901th variable In2007 Friedman et al used fast coordinate descent algorithmsto solve the lasso problem Hsieh et al [18 19] proposed anovel block-coordinate descentmethod via clustering to solvethe optimization problem The method can reach quadraticconvergence rates which is designed for large network

33 Link Prediction Scheme Based on Sparse Inverse Covari-ance Estimation Here we consider an undirected simplenetwork119866(119881 119864) where 119864 is the set of links and119881 is the set ofnodes Moreover 119860 is an adjacent matrix Λ is the graphicallasso parameter 119901 denotes the number of nodes and theuniversal set 119880 consists of 119901(119901 minus 1)2 edges The set of linksare randomly divided into two parts training (119864

119879) and testing

(119864119875) sets Note that only the information in the training set

can be used for link prediction 120588 denotes the ratio of thetraining set which means the amount of edges being usedClearly 119864 = 119864

119879cup 119864119875 and 119864

119879cap 119864119875= 0 In the training set

sampling process we generated 119899 independent observations119883 = [119909

1 119909

119899] each from normal distribution 119873(0 Σ)

where Σ = (119864119879)minus1 Here 119878 denotes the sample covariance

matrices of 119883 = [1199091 119909

119899] We then apply SGGM for link

prediction The pseudocode of the link prediction scheme isgiven in Algorithm 1

4 Experiment and Analysis

41 Evaluation Metrics

411 AUC In link prediction we focus on accurately recov-ering missing links Here we have chosen the area undera relative operating characteristic (ROC) curve as a metricAn ROC curve represents a comparison of two operatingcharacteristics TPR and FPR as the criterion changes TheAUC ranges from 05 to 1 and a higher value means a bettermodel

412 Error Rate We also defined an error metric to evaluatethe difference between the original and estimated networksand are denoted by 120579 and 1205791015840 respectively The error metric isdefined as Error = 120579minus120579

1015840

119865120579119865 where sdot

119865is the Frobenius

norm

42 Real-World Datasets In this paper we consider fourrepresentative networks drawn from disparate fields (data

Mathematical Problems in Engineering 5

Require 119860 120588 gt 0 120582 gt 0 119899 gt 0

Ensure 119860119864119879 randomly select 120588|119864| edges from 119864

119864119875

lArr 119864 minus 119864119879

Σ lArr (119864119879

)minus1

119883 lArr [1199091 1199092 119909

119899] 119899 independent observations from

119873(0 Σ)

119878 lArr cov(119883)Θ lArr Sparse Inverse Covariance Estimation (119860 120582 119878)if 120579119894119895le 0 and 119894 = 119895 then

119860119894119895lArr 1

else119860119894119895lArr 0

end if

Algorithm 1 Link prediction based on sparse inverse covarianceestimation with graphical lasso

Table 2 Basic topological features of four real-world networks

Network 119873 119872 119864 119862 Avg 119863 119903

Taro 22 39 0488 0339 3546 minus0375Tailor Shop 39 158 0567 0458 8103 minus0183Dolphin 62 159 0379 0259 5129 minus0044Football 115 613 0450 0403 10661 0162

sources (1) httpwww-personalumichedusimmejnnetdata(Football andDolphin) and (2) httpvladofmfuni-ljsipubnetworksdataucinetucidatahtm (Tailor Shop and Taro))

We selected four real-world datasets to compare theproposed method with the 13 similarity measures Table 2shows the basic topological features of the four real-worldnetworks wherein 119873 and119872 are the total numbers of nodesand links respectively 119864 is the network efficiency [20] whichis defined as 119864 = (2119873(119873 minus 1))sum

119909119910isin119881119909 =119910119889minus1

119909119910 where 119889

119909119910

denotes the shortest distance between nodes 119909 and 119910 and119889119909119910

= +infin if 119909 and 119910 are in two different components119862 and Avg 119863 denote the clustering coefficient [21] andaverage degree respectively and 119903 is the network assortativecoefficient [22] The network is assortative if a node tends toconnect with the approximate node 119903 gt 0 means that theentire network has assortative structure and a nodewith largedegree tends to connect to other nodes with large degree 119903 lt0 means that the entire network has disassortative structureand 119903 = 0 implies that there is no correlation in the networkstructure Figure 1 shows the distribution of the four datasetsIt is clear that the degree distributions of the four data sets areconsiderably different

43 Experimental Settings To evaluate the performance ofthe algorithmswith different sized training sets we randomlyselected 60 70 80 and 90 of the links from theoriginal network as training sets Only the information inthe training set was used for estimation Each training setwas executed 10 times We choose 10-fold cross validationbecause it has been widely accepted in machine learning

Table 3 Football dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0702 0725 0757 0783SGGM 119899 = 119901 0742 0777 0818 0857SGGM 119899 = 15119901 0753 0795 0845 0889SGGM 119899 = 2119901 0756 0800 0852 0905CN 0774 0798 0824 0835Salton 0779 0804 0828 0838Jaccard 0502 0508 0514 0520Soslashrensen 0590 0597 0604 0606HPI 0778 0802 0827 0838HDI 0779 0803 0827 0838LHN 0777 0802 0827 0838AA 0774 0798 0825 0836RA 0774 0798 0824 0836PA 0515 0516 0520 0528LNBCN 0777 0800 0825 0836LNBAA 0777 0800 0825 0836LNBRA 0776 0799 0825 0837

and data mining research Thirteen similarity measures wereimplemented following Zhou et al [14] and the SGGM wasimplemented using the SLEP toolbox [23] We evaluated theperformance of the GGM algorithm with different samplescales Here 119901 denotes the number of nodes We set thesample scale to 05119901 119901 15119901 and 2119901 The parameter 120582 in theSGGM controlled the sparsity of the estimated model Larger120582 gives a sparser estimated network Here 120582 ranged from 001

to 02 with a step of 001

44 Results

441 Tuning Parameter for the SGGM Using the training setthat was scaled to 90 we tested the proposed SGGM withdifferent 120582 and sample scales As shown in Figure 2 the AUCof the SGGM increased with increasing sample scale for thefour datasets One advantage of the SGGM is that it is notsensitive to the parameter 120582 Then we set the sample scale to05119873 Figure 3 shows that the AUC increased with increase intraining set scale For the Tailor Shop dataset the proposedalgorithm performs optimally with the 70 scaled trainingset

442 Comparison on AUC Table 3 lists the results for 14methods with the Football dataset The SGGM performsoptimally when 80 or 90 of the training set is used and thenumber of samples is greater than119873The prediction accuracyof the SGGM increased by 5 compared to the other 13methodsHowever with lower proportions of the training setthe SGGMs AUC is very close to that of the other methodsNote that PA yielded the poorest result with this dataset

As shown in Table 4 the SGGM method performs opti-mally with the Dolphin dataset The prediction accuracy ofthe SGGMmethod increased by at least 3 compared to the

6 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10 11 120

002

004

006

008

01

012

014

016

Degree

Prob

abili

ty o

f the

nod

e deg

ree

(a) Dolphin

0 1 2 3 4 5 6 7 8 9 10 11 120

01

02

03

04

05

06

07

Degree

Prob

abili

ty o

f the

nod

e deg

ree

(b) Football

0 1 2 3 4 5 60

01

02

03

04

05

06

07

08

Degree

Prob

abili

ty o

f the

nod

e deg

ree

(c) Taro

minus5 0 5 10 15 20 250

002

004

006

008

01

012

014

Degree

Prob

abili

ty o

f the

nod

e deg

ree

(d) Tailor Shop

Figure 1 Degree distribution of four real-world datasets

other thirteen methods PA is optimal among the similaritymeasures and Jaccard yielded the poorest result

As shown in Table 5 the SGGMmethod outperforms theother 13 methods with the Taro dataset The AUC improvedby at least 10 For the Taro dataset the degree of 70 nodesis 3 which indicates a low degree of heterogeneity thus theperformance of the other thirteen methods is very close Onemight expect PA to show good performance on assortativenetworks and poor performance on disassortative networksHowever the experimental results show no obvious correla-tion between PA performance and the assortative coefficient

For the Tailor Shop dataset PA performs optimally whenthe 60 and 70 training sets were used while the SGGM

method performs optimally with the 80 and 90 trainingsets (Table 6) The SGGMs prediction accuracy graduallyincreased with increasing the number of samplesThus whensampling conditions permit multiple samples could improveprediction accuracy

443 Comparison on ROC The ratio of the training set wasset to 90The sample scale of the SGGM varied within 05119901119901 15119901 and 2119901120582 = 001 CNwas chosen as the representativeneighbourhood because six other measures (Salton JaccardSoslashrensen HPI HDI and LHN) are variants of CN

The ROC of the RA index is similar to that of the AAindex As observed in Figure 4 in most cases the SGGM

Mathematical Problems in Engineering 7

0005

010015

020

0506

0708

09

06507

07508

08509

Ratio of training

AUC

120582

(a) Dolphin

0005

010015

020

0506

0708

09

05506

06507

07508

Ratio of training

AUC

120582

(b) Football

0005

010015

020

06

07

08

064066068

07072074076

09

Ratio of training

AUC

120582

(c) Taro

0005

010015

020

06

07

08

09

06062064066068

07072

Ratio of training

AUC

120582

(d) Tailor Shop

Figure 2 AUC of the SGGM (four datasets training set ratio = 90)

05pp

152p

25p3p

0005

010015

020

07808

082084086088

09092

Samples

AUC

120582

(a) Dolphin

05pp

152p

25p3p

0005

010015

020

0506070809

1

Samples

AUC

120582

(b) Football

05pp

152p

25p3p

0005

010015

020

07075

08085

09095

Samples

AUC

120582

(c) Taro

05pp

152p

25p

3p

0005

010015

020

06507

07508

08509

095

Samples

AUC

120582

(d) Tailor Shop

Figure 3 AUC of the SGGM (four datasets sample scale = 05119873)

8 Mathematical Problems in Engineering

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

(a) Dolphin

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

(b) Football

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

(c) Taro

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

(d) Tailor Shop

Figure 4 Comparison of ROC metric of four methods (SGGM CN AA and PA) on four datasets

shows an improvement over existing similarity measureswhen sufficient samples were used Note that more sampleslead to higher accuracy

444 Comparison on Error Rate The error rate is a metricto evaluate the difference between two matrices A lowererror rate indicates that the estimated matrix approximates

the original matrix From Table 7 we observe that the SGGMhas the lowest error rate among all methods shown Toshow the effectiveness of the proposed SGGM both theoriginal adjacency matrix and estimated adjacency matrixare presented as a coloured graph in Figure 5 We usedthe Taro dataset with a 90 training set to recover theoriginal network Graphs with fewer red points indicate

Mathematical Problems in Engineering 9

Figure 5 Recovery performances of 14methods on the Taro datasetThe upper left corner is the original matrix blue points denote linksbetween two corresponding nodes The red points denote links inthe original matrix that were not accurately estimated The greenpoints denote links in the estimated matrix that do not exist in theoriginal matrix From left to right top to bottom the figures are therecovered matrix by SGGM (119899 = 05119901 119901 15119901 and 2119901) CN SaltonJaccard Soslashrensen HPI HDI LHN AA RA PA LNBCN LNBAAand LNBRA respectively

Table 4 Dolphin dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0699 0731 0761 0801SGGM 119899 = 119901 0742 0775 0823 0867SGGM 119899 = 15119901 0758 0803 0845 0895SGGM 119899 = 2119901 0765 0807 0857 0909CN 0631 0687 0728 0762Salton 0621 0673 0709 0740Jaccard 0485 0490 0495 0500Soslashrensen 0538 0558 0571 0583HPI 0620 0669 0702 0728HDI 0626 0680 0720 0753LHN 0619 0666 0698 0721AA 0633 0690 0732 0767RA 0632 0690 0731 0766PA 0712 0727 0737 0746LNBCN 0636 0696 0739 0775LNBAA 0635 0694 0738 0773LNBRA 0635 0693 0736 0771

good recovery of the original matrix As can be seen fromFigure 5 using the SGGM method can restore the originalimage whereas the other similarity measures return greatererror rates Note that the SGGM error rate decreased withincreasing sample scale

45 Large Datasets The main challenge of link prediction isdealing with large network We carefully choose four largedatasets from four individual domain For SGGM method120582 is set to 001 90 of data was used as training set

Table 5 Taro dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0672 0711 0736 0750SGGM 119899 = 119901 0713 0747 0779 0815SGGM 119899 = 15119901 0721 0763 0816 0857SGGM 119899 = 2119901 0723 0788 0825 0865CN 0529 0545 0573 0582Salton 0530 0542 0571 0580Jaccard 0468 0476 0476 0478Soslashrensen 0523 0533 0544 0552HPI 0533 0545 0574 0585HDI 0528 0541 0568 0576LHN 0532 0543 0570 0579AA 0533 0554 0589 0609RA 0533 0554 0589 0610PA 0559 0568 0582 0607LNBCN 0536 0562 0616 0638LNBAA 0537 0561 0614 0634LNBRA 0536 0560 0613 0634

Table 6 Tailor Shop dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0652 0679 0677 0706SGGM 119899 = 119901 0703 0723 0750 0769SGGM 119899 = 15119901 0715 0755 0786 0817SGGM 119899 = 2119901 0734 0770 0803 0844CN 0673 0699 0720 0746Salton 0616 0628 0646 0667Jaccard 0494 0498 0504 0514Soslashrensen 0583 0589 0592 0597HPI 0615 0618 0631 0642HDI 0628 0643 0660 0680LHN 0588 0574 0571 0566AA 0682 0709 0729 0753RA 0681 0707 0729 0752PA 0761 0775 0779 0786LNBCN 0692 0715 0746 0765LNBAA 0693 0716 0746 0764LNBRA 0693 0715 0745 0762

In order to prove the performance of SGGM we comparethe proposed method with the other thirteen methods onthese large datasets (for details see Table 8) in terms of AUCand error rate (data sources (1) httpkonectuni-koblenzdenetworks (Email) (2) httpwww3ndedusimnetworksresourceshtm (Protein) (3) httpwww-personalumichedusimmejnnetdatapowerzip (Grid) (4) httpkonectuni-koblenzdenetworks (Astro-ph))

To estimate the large network efficiently we used QUIC[19] to solve sparse inverse covariance estimation (formula(1)) instead of glasso [7] Overall the proposed method

10 Mathematical Problems in Engineering

Table 7 Error rate of 13 methods on 4 datasets

Method DatasetsFootball Dolphin Taro Tailor Shop

SGGM 119899 = 05119901 1467 1251 1098 1121SGGM 119899 = 119901 1364 1109 1062 1188SGGM 119899 = 15119901 1362 1198 1098 1025SGGM 119899 = 2119901 1200 1121 1013 0987CN 1641 1291 1320 1485Salton 1641 1291 1320 1485Jaccard 1641 1291 1320 1485Soslashrensen 1863 1672 1695 1826HPI 1641 1291 1320 1485HDI 1641 1291 1320 1485LHN 1641 1291 1320 1485AA 1641 1291 1320 1485RA 1641 1291 1320 1485PA 2512 1251 1340 1446LNBCN 1641 1291 1261 1382LNBAA 1641 1291 1261 1382LNBRA 1641 1291 1261 1382

Table 8 Basic topological features of four large networks

Network 119873 119872 119864 119862 Avg 119863 119903

Email 1133 5451 0299 0220 9622 0078Protein 1870 2240 0099 0099 2396 minus0156Grid 4941 6594 0063 0107 2669 0003Astro-ph 18771 198050 0091 0486 4140 0294

Table 9 Comparison of 14 methods on 4 large datasets on AUC

Method Email Protein Grid Astro-phSGGM 0916 0943 0944 0929CN 0817 0566 0570 0864Salton 0812 0566 0569 0863Jaccard 0475 0479 0487 0479Soslashrensen 0547 0493 0529 0584HPI 0809 0566 0569 0863HDI 0813 0566 0569 0863LHN-I 0805 0565 0570 0864AA 0819 0565 0570 0876RA 0818 0566 0568 0838PA 0824 0820 0713 0785

can efficiently recover the original network and predict themissing links accurately As observed from Table 9 AUC ofSGGM is highest on all 4 datasets For Email Protein Gridand Astro-ph datasets SGGM improves 92 123 231and 53 on AUC respectively For error rate metric asshown in Table 10 the SGGM method performs optimallywith all 4 large datasets It indicates that SGGMmethod couldaccurately recover the original network

Table 10 Comparison of 14 methods on 4 large datasets on errorrate

Method Email Protein Grid Astro-phSGGM 0467 0411 0367 0359CN 2348 2189 1670 1568Salton 2978 2189 1767 1549Jaccard 2982 2189 1450 1548Soslashrensen 3047 2555 2071 2771HPI 2578 2189 1670 1610HDI 2978 2189 1237 1854LHN-I 2428 2189 1692 1612AA 2922 2204 1872 1953RA 2348 2189 1670 1410PA 9219 8032 15081 12451

46 Analysis In the comparative analysis of the performanceof the 14 methods using the eight real-world datasets theSGGM method was outstanding in terms of AUC and errorrate This method can accurately recover the original adja-cency matrix and in most of cases requires fewer samplesthat is far fewer than the actual number of nodes TheSGGM methodrsquos prediction precision gradually increasedwith the increase in number of samples moreover therecovery error rate decreased with the increase in matrixsparsityTherefore the SGGMmethod can be applied to smallsamples however it performs better with more samplesThese results demonstrate that the SGGM method can beimplemented in a scalable fashion

5 Conclusions

Link prediction is a basic problem in complex networkanalysis In real-world networks obtaining node and edgeproperties is cumbersome Link prediction methods basedon network structure similarity do not depend on nodeproperties however these methods are limited by networkstructure and are difficult to adapt in different networkstructures

Our work is not dependent on node properties andexpands link prediction methods that cannot deal withdiverse network topologiesWe sampled the original networkadjacency matrix and used the SGGM method to depict thenetwork structure

Most nodes are independent in the actual network thuswe used the SGGM method to estimate a precision matrixof the adjacency matrix to predict links Our experimentalresults show that using the SGGMmethod to recover networkstructure is better than 13mainstream similaritymethodsWetested the proposed method with eight real-world datasetsand used the AUC and error rate as evaluation indexes Wefound that this method obtains higher prediction precisionWe also analysed the influence of different parameters onthe SGGM method and found no significant influenceThe SGGM method returns a high AUC value within acertain range Furthermore the proposed method retains

Mathematical Problems in Engineering 11

high prediction precision with fewer training samples it canbe applied in large network

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank Longqi Yang Zhisong Pan and GuyuHu for their contributions to work on link prediction andZhen Li and Junyang Qiu for their valuable discussionscomments and suggestions This work was supported by theNational Technology Research and Development Programof China (863 Program) (Grant no 2012AA01A510) theNational Natural Science Foundation of China (Grant no61473149) and the Natural Science Foundation of JiangsuProvince China (Grant no BK20140073)

References

[1] L L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquo Physica A Statistical Mechanics amp Its Applications vol390 no 6 pp 1150ndash1170 2011

[2] B Ermis E Acar and A T Cemgil ldquoLink prediction in het-erogeneous data via generalized coupled tensor factorizationrdquoDataMining amp Knowledge Discovery vol 29 no 1 pp 203ndash2362013

[3] Y Yang N Chawla Y Sun and J Hani ldquoPredicting links inmulti-relational and heterogeneous networksrdquo in Proceedings ofthe 12th IEEE International Conference on Data Mining (ICDMrsquo12) pp 755ndash764 Brussels Belgium December 2012

[4] D Liben-Nowell and J Kleinberg ldquoThe link-prediction prob-lem for social networksrdquo Journal of the American Society forInformation Science and Technology vol 58 no 7 pp 1019ndash10312007

[5] Y Yang R N Lichtenwalter and N V Chawla ldquoEvaluating linkprediction methodsrdquo Knowledge and Information Systems vol45 no 3 pp 751ndash782 2015

[6] N Z Gong A Talwalkar LMacKey et al ldquoJoint link predictionand attribute inference using a social-attribute networkrdquo ACMTransactions on Intelligent Systems and Technology vol 5 no 2article 27 2014

[7] J Friedman T Hastie and R Tibshirani ldquoSparse inversecovariance estimationwith the graphical lassordquoBiostatistics vol9 no 3 pp 432ndash441 2008

[8] M E Newman ldquoClustering and preferential attachment ingrowing networksrdquo Physical Review E vol 64 no 2 pp 25ndash1022001

[9] G Salton and M J McGill Introduction to Modern InformationRetrieval McGraw-Hill New York NY USA 1986

[10] P Jaccard ldquoEtude comparative de la distribution florale dansune portion des Alpes et du Jurardquo Bulletin de la Societe Vaudoisedes Sciences Naturelles vol 37 no 142 pp 547ndash579 1901

[11] T Soslashrensen ldquoA method of establishing groups of equal ampli-tude in plant sociology based on similarity of species and itsapplication to analyses of the vegetation on danish commonsrdquoBiologiske Skrifter vol 5 pp 1ndash34 1948

[12] E A Leicht P Holme and M E J Newman ldquoVertex similarityin networksrdquo Physical Review E vol 73 no 2 Article ID 0261202006

[13] Y-B Xie T Zhou and B-H Wang ldquoScale-free networkswithout growthrdquo Physica A Statistical Mechanics and Its Appli-cations vol 387 no 7 pp 1683ndash1688 2008

[14] T Zhou L Lu and Y-C Zhang ldquoPredicting missing links vialocal informationrdquoThe European Physical Journal B vol 71 no4 pp 623ndash630 2009

[15] Z Liu Q M Zhang L Lu and T Zhou ldquoLink prediction incomplex networks a local naıve Bayesmodelrdquo EPL (EurophysicsLetters) vol 96 no 4 Article ID 48007 2011

[16] J Dahl L Vandenberghe and V Roychowdhury ldquoCovarianceselection for nonchordal graphs via chordal embeddingrdquo Opti-mization Methods amp Software vol 23 no 4 pp 501ndash520 2008

[17] O Banerjee L E Ghaoui andADaspremont ldquoModel selectionthrough sparse maximum likelihood estimation for multivari-ate Gaussian or binary datardquoMachine Learning Research vol 9no 3 pp 485ndash516 2008

[18] C J Hsieh M A Sustik I S Dhillon P K Ravikumar and RPoldrack ldquoBigamp quic sparse inverse covariance estimation for amillion variablesrdquo in Advances in Neural Information ProcessingSystems pp 3165ndash3173 MIT Press 2013

[19] C-J Hsieh M A Sustik I S Dhillon and P RavikumarldquoQUIC quadratic approximation for sparse inverse covarianceestimationrdquo Journal of Machine Learning Research vol 15 pp2911ndash2947 2014

[20] V Latora and M Marchiori ldquoEfficient behavior of small-worldnetworksrdquo Physical Review Letters vol 87 Article ID 1987012001

[21] D J Watts and S H Strogatz ldquoCollective dynamics of lsquosmall-worldrsquo networksrdquoNature vol 393 no 6684 pp 440ndash442 1998

[22] M E J Newman ldquoAssortative mixing in networksrdquo PhysicalReview Letters vol 89 no 20 Article ID 208701 2002

[23] J Liu S Ji and J Ye SLEP Sparse Learning with EfficientProjections vol 6 Arizona State University Tempe Ariz USA2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 4: Research Article Link Prediction via Sparse Gaussian Graphical … · 2019. 5. 14. · Research Article Link Prediction via Sparse Gaussian Graphical Model LiangliangZhang,LongqiYang,GuyuHu,ZhisongPan,andZhenLi

4 Mathematical Problems in Engineering

the conditional independence of the variables To estimatea sparse GGM many methods are based on maximumlikelihood estimation (MLE) and one such method is thegraphical lasso which is expressed as

maxΘ≻0

log detΘ minus tr (119878Θ) minus 120582 Θ1 (1)

119878 =1

119873

119873

sum

119894=1

(119909119894minus 119909) (119909

119894minus 119909)119879

(2)

Θ1= sum

119894 =119895

10038161003816100381610038161003816Θ119894119895

10038161003816100381610038161003816 (3)

Here 119878 is the experiential covariance matrix and 120582 ispositive tuning parameter If Θ gt 0 Θ is a 119901 times 119901 positivedefinite matrix Θ

1is the punishment element and Θ

becomes increasingly sparse with increasing 120582 An ℓ1-norm is

utilised to considerΘ a variable rather than a fixed parameter

32 Graphical Lasso Algorithm In 2008 Banerjee et al [17]proved that formula (1) is convex They estimated Σ via 119882only and subsequently solved the problem by optimisingover each row Moreover they optimised the correspondingcolumn of119882 in a block coordinate descent fashion

119882 = (

11988211

11990812

119908119879

1211990822

)

119878 = (

11987811

11990412

119904119879

1211990422

)

(4)

By partitioning119882 and 119878 the solution for 11990812satisfies

11990812= argmin

119910

119910119879

119882minus1

11119910

1003817100381710038171003817119910 minus 119904121003817100381710038171003817infin

le 119901 (5)

Banerjee et al [17] proved that solving (5) is equivalent tosolving its dual problem

min120573

= 1

2

1003817100381710038171003817100381711988212

11120573 minus 119887

10038171003817100381710038171003817

2

+ 120588100381710038171003817100381712057310038171003817100381710038171 (6)

Here assume that 120573 is the solution to (6) that is 119887 =

11988212

1111990412 thus119908

12= 11988211120573 is the solution to (5)Therefore (6)

is similar to lasso regression and is the basis for the graphicallasso algorithm First we must prove that (6) and (1) areequivalent Given119882Θ = 119868 that is

(

11988211

11990812

119908119879

1211990822

)(

Θ11

12057912

120579119879

1212057922

) = (

119868 0

0119879

1

) (7)

the MLE of (1) is rewritten as

119882minus 119878 minus 120588119867 = 0 (8)

Note that the derivative of log detΘ equals Θminus1 = 119882Here 119867

119894119895isin sign(Θ

119894119895) that is 119867

119894119895= sign(Θ

119894119895) if Θ

119894119895= 0

else119867119894119895isin [minus1 1] ifΘ

119894119895= 0 Thus the upper-right block of (8)

is expressed as

11990812minus 11990412minus 12058812057412= 0 (9)

The subgradient equation of (6) is expressed as follows

11988211120573 minus 11990412+ 120588V = 0 (10)

where V isin sign(120573) Here suppose that (119882119867) solves (8)consequently (119908

12 12057412) solves (9) Then 120573 = 119882

minus1

1111990812

andV = minus120574

12solve (10) For the sign terms119882

11Θ12+ 11990812Θ22= 0

is derived from (7) therefore we obtainΘ12= minusΘ22119882minus1

1111990812

Since Θ22gt 0 it follows that sign(Θ

12) = minus sign(119882minus1

1111990812) =

minus sign(120573) which proves the equivalenceThe solution120573 to thelasso problem (6) gives the relevant part of Θ

12= minusΘ22120573

Problem (6) appears similar to a lasso (ℓ1-regularised)

least-squares problem In fact if 11988211

= 11987811 the solutions

of 120573 are equal to the lasso estimates for the 119901th variable In2007 Friedman et al used fast coordinate descent algorithmsto solve the lasso problem Hsieh et al [18 19] proposed anovel block-coordinate descentmethod via clustering to solvethe optimization problem The method can reach quadraticconvergence rates which is designed for large network

33 Link Prediction Scheme Based on Sparse Inverse Covari-ance Estimation Here we consider an undirected simplenetwork119866(119881 119864) where 119864 is the set of links and119881 is the set ofnodes Moreover 119860 is an adjacent matrix Λ is the graphicallasso parameter 119901 denotes the number of nodes and theuniversal set 119880 consists of 119901(119901 minus 1)2 edges The set of linksare randomly divided into two parts training (119864

119879) and testing

(119864119875) sets Note that only the information in the training set

can be used for link prediction 120588 denotes the ratio of thetraining set which means the amount of edges being usedClearly 119864 = 119864

119879cup 119864119875 and 119864

119879cap 119864119875= 0 In the training set

sampling process we generated 119899 independent observations119883 = [119909

1 119909

119899] each from normal distribution 119873(0 Σ)

where Σ = (119864119879)minus1 Here 119878 denotes the sample covariance

matrices of 119883 = [1199091 119909

119899] We then apply SGGM for link

prediction The pseudocode of the link prediction scheme isgiven in Algorithm 1

4 Experiment and Analysis

41 Evaluation Metrics

411 AUC In link prediction we focus on accurately recov-ering missing links Here we have chosen the area undera relative operating characteristic (ROC) curve as a metricAn ROC curve represents a comparison of two operatingcharacteristics TPR and FPR as the criterion changes TheAUC ranges from 05 to 1 and a higher value means a bettermodel

412 Error Rate We also defined an error metric to evaluatethe difference between the original and estimated networksand are denoted by 120579 and 1205791015840 respectively The error metric isdefined as Error = 120579minus120579

1015840

119865120579119865 where sdot

119865is the Frobenius

norm

42 Real-World Datasets In this paper we consider fourrepresentative networks drawn from disparate fields (data

Mathematical Problems in Engineering 5

Require 119860 120588 gt 0 120582 gt 0 119899 gt 0

Ensure 119860119864119879 randomly select 120588|119864| edges from 119864

119864119875

lArr 119864 minus 119864119879

Σ lArr (119864119879

)minus1

119883 lArr [1199091 1199092 119909

119899] 119899 independent observations from

119873(0 Σ)

119878 lArr cov(119883)Θ lArr Sparse Inverse Covariance Estimation (119860 120582 119878)if 120579119894119895le 0 and 119894 = 119895 then

119860119894119895lArr 1

else119860119894119895lArr 0

end if

Algorithm 1 Link prediction based on sparse inverse covarianceestimation with graphical lasso

Table 2 Basic topological features of four real-world networks

Network 119873 119872 119864 119862 Avg 119863 119903

Taro 22 39 0488 0339 3546 minus0375Tailor Shop 39 158 0567 0458 8103 minus0183Dolphin 62 159 0379 0259 5129 minus0044Football 115 613 0450 0403 10661 0162

sources (1) httpwww-personalumichedusimmejnnetdata(Football andDolphin) and (2) httpvladofmfuni-ljsipubnetworksdataucinetucidatahtm (Tailor Shop and Taro))

We selected four real-world datasets to compare theproposed method with the 13 similarity measures Table 2shows the basic topological features of the four real-worldnetworks wherein 119873 and119872 are the total numbers of nodesand links respectively 119864 is the network efficiency [20] whichis defined as 119864 = (2119873(119873 minus 1))sum

119909119910isin119881119909 =119910119889minus1

119909119910 where 119889

119909119910

denotes the shortest distance between nodes 119909 and 119910 and119889119909119910

= +infin if 119909 and 119910 are in two different components119862 and Avg 119863 denote the clustering coefficient [21] andaverage degree respectively and 119903 is the network assortativecoefficient [22] The network is assortative if a node tends toconnect with the approximate node 119903 gt 0 means that theentire network has assortative structure and a nodewith largedegree tends to connect to other nodes with large degree 119903 lt0 means that the entire network has disassortative structureand 119903 = 0 implies that there is no correlation in the networkstructure Figure 1 shows the distribution of the four datasetsIt is clear that the degree distributions of the four data sets areconsiderably different

43 Experimental Settings To evaluate the performance ofthe algorithmswith different sized training sets we randomlyselected 60 70 80 and 90 of the links from theoriginal network as training sets Only the information inthe training set was used for estimation Each training setwas executed 10 times We choose 10-fold cross validationbecause it has been widely accepted in machine learning

Table 3 Football dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0702 0725 0757 0783SGGM 119899 = 119901 0742 0777 0818 0857SGGM 119899 = 15119901 0753 0795 0845 0889SGGM 119899 = 2119901 0756 0800 0852 0905CN 0774 0798 0824 0835Salton 0779 0804 0828 0838Jaccard 0502 0508 0514 0520Soslashrensen 0590 0597 0604 0606HPI 0778 0802 0827 0838HDI 0779 0803 0827 0838LHN 0777 0802 0827 0838AA 0774 0798 0825 0836RA 0774 0798 0824 0836PA 0515 0516 0520 0528LNBCN 0777 0800 0825 0836LNBAA 0777 0800 0825 0836LNBRA 0776 0799 0825 0837

and data mining research Thirteen similarity measures wereimplemented following Zhou et al [14] and the SGGM wasimplemented using the SLEP toolbox [23] We evaluated theperformance of the GGM algorithm with different samplescales Here 119901 denotes the number of nodes We set thesample scale to 05119901 119901 15119901 and 2119901 The parameter 120582 in theSGGM controlled the sparsity of the estimated model Larger120582 gives a sparser estimated network Here 120582 ranged from 001

to 02 with a step of 001

44 Results

441 Tuning Parameter for the SGGM Using the training setthat was scaled to 90 we tested the proposed SGGM withdifferent 120582 and sample scales As shown in Figure 2 the AUCof the SGGM increased with increasing sample scale for thefour datasets One advantage of the SGGM is that it is notsensitive to the parameter 120582 Then we set the sample scale to05119873 Figure 3 shows that the AUC increased with increase intraining set scale For the Tailor Shop dataset the proposedalgorithm performs optimally with the 70 scaled trainingset

442 Comparison on AUC Table 3 lists the results for 14methods with the Football dataset The SGGM performsoptimally when 80 or 90 of the training set is used and thenumber of samples is greater than119873The prediction accuracyof the SGGM increased by 5 compared to the other 13methodsHowever with lower proportions of the training setthe SGGMs AUC is very close to that of the other methodsNote that PA yielded the poorest result with this dataset

As shown in Table 4 the SGGM method performs opti-mally with the Dolphin dataset The prediction accuracy ofthe SGGMmethod increased by at least 3 compared to the

6 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10 11 120

002

004

006

008

01

012

014

016

Degree

Prob

abili

ty o

f the

nod

e deg

ree

(a) Dolphin

0 1 2 3 4 5 6 7 8 9 10 11 120

01

02

03

04

05

06

07

Degree

Prob

abili

ty o

f the

nod

e deg

ree

(b) Football

0 1 2 3 4 5 60

01

02

03

04

05

06

07

08

Degree

Prob

abili

ty o

f the

nod

e deg

ree

(c) Taro

minus5 0 5 10 15 20 250

002

004

006

008

01

012

014

Degree

Prob

abili

ty o

f the

nod

e deg

ree

(d) Tailor Shop

Figure 1 Degree distribution of four real-world datasets

other thirteen methods PA is optimal among the similaritymeasures and Jaccard yielded the poorest result

As shown in Table 5 the SGGMmethod outperforms theother 13 methods with the Taro dataset The AUC improvedby at least 10 For the Taro dataset the degree of 70 nodesis 3 which indicates a low degree of heterogeneity thus theperformance of the other thirteen methods is very close Onemight expect PA to show good performance on assortativenetworks and poor performance on disassortative networksHowever the experimental results show no obvious correla-tion between PA performance and the assortative coefficient

For the Tailor Shop dataset PA performs optimally whenthe 60 and 70 training sets were used while the SGGM

method performs optimally with the 80 and 90 trainingsets (Table 6) The SGGMs prediction accuracy graduallyincreased with increasing the number of samplesThus whensampling conditions permit multiple samples could improveprediction accuracy

443 Comparison on ROC The ratio of the training set wasset to 90The sample scale of the SGGM varied within 05119901119901 15119901 and 2119901120582 = 001 CNwas chosen as the representativeneighbourhood because six other measures (Salton JaccardSoslashrensen HPI HDI and LHN) are variants of CN

The ROC of the RA index is similar to that of the AAindex As observed in Figure 4 in most cases the SGGM

Mathematical Problems in Engineering 7

0005

010015

020

0506

0708

09

06507

07508

08509

Ratio of training

AUC

120582

(a) Dolphin

0005

010015

020

0506

0708

09

05506

06507

07508

Ratio of training

AUC

120582

(b) Football

0005

010015

020

06

07

08

064066068

07072074076

09

Ratio of training

AUC

120582

(c) Taro

0005

010015

020

06

07

08

09

06062064066068

07072

Ratio of training

AUC

120582

(d) Tailor Shop

Figure 2 AUC of the SGGM (four datasets training set ratio = 90)

05pp

152p

25p3p

0005

010015

020

07808

082084086088

09092

Samples

AUC

120582

(a) Dolphin

05pp

152p

25p3p

0005

010015

020

0506070809

1

Samples

AUC

120582

(b) Football

05pp

152p

25p3p

0005

010015

020

07075

08085

09095

Samples

AUC

120582

(c) Taro

05pp

152p

25p

3p

0005

010015

020

06507

07508

08509

095

Samples

AUC

120582

(d) Tailor Shop

Figure 3 AUC of the SGGM (four datasets sample scale = 05119873)

8 Mathematical Problems in Engineering

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

(a) Dolphin

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

(b) Football

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

(c) Taro

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

(d) Tailor Shop

Figure 4 Comparison of ROC metric of four methods (SGGM CN AA and PA) on four datasets

shows an improvement over existing similarity measureswhen sufficient samples were used Note that more sampleslead to higher accuracy

444 Comparison on Error Rate The error rate is a metricto evaluate the difference between two matrices A lowererror rate indicates that the estimated matrix approximates

the original matrix From Table 7 we observe that the SGGMhas the lowest error rate among all methods shown Toshow the effectiveness of the proposed SGGM both theoriginal adjacency matrix and estimated adjacency matrixare presented as a coloured graph in Figure 5 We usedthe Taro dataset with a 90 training set to recover theoriginal network Graphs with fewer red points indicate

Mathematical Problems in Engineering 9

Figure 5 Recovery performances of 14methods on the Taro datasetThe upper left corner is the original matrix blue points denote linksbetween two corresponding nodes The red points denote links inthe original matrix that were not accurately estimated The greenpoints denote links in the estimated matrix that do not exist in theoriginal matrix From left to right top to bottom the figures are therecovered matrix by SGGM (119899 = 05119901 119901 15119901 and 2119901) CN SaltonJaccard Soslashrensen HPI HDI LHN AA RA PA LNBCN LNBAAand LNBRA respectively

Table 4 Dolphin dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0699 0731 0761 0801SGGM 119899 = 119901 0742 0775 0823 0867SGGM 119899 = 15119901 0758 0803 0845 0895SGGM 119899 = 2119901 0765 0807 0857 0909CN 0631 0687 0728 0762Salton 0621 0673 0709 0740Jaccard 0485 0490 0495 0500Soslashrensen 0538 0558 0571 0583HPI 0620 0669 0702 0728HDI 0626 0680 0720 0753LHN 0619 0666 0698 0721AA 0633 0690 0732 0767RA 0632 0690 0731 0766PA 0712 0727 0737 0746LNBCN 0636 0696 0739 0775LNBAA 0635 0694 0738 0773LNBRA 0635 0693 0736 0771

good recovery of the original matrix As can be seen fromFigure 5 using the SGGM method can restore the originalimage whereas the other similarity measures return greatererror rates Note that the SGGM error rate decreased withincreasing sample scale

45 Large Datasets The main challenge of link prediction isdealing with large network We carefully choose four largedatasets from four individual domain For SGGM method120582 is set to 001 90 of data was used as training set

Table 5 Taro dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0672 0711 0736 0750SGGM 119899 = 119901 0713 0747 0779 0815SGGM 119899 = 15119901 0721 0763 0816 0857SGGM 119899 = 2119901 0723 0788 0825 0865CN 0529 0545 0573 0582Salton 0530 0542 0571 0580Jaccard 0468 0476 0476 0478Soslashrensen 0523 0533 0544 0552HPI 0533 0545 0574 0585HDI 0528 0541 0568 0576LHN 0532 0543 0570 0579AA 0533 0554 0589 0609RA 0533 0554 0589 0610PA 0559 0568 0582 0607LNBCN 0536 0562 0616 0638LNBAA 0537 0561 0614 0634LNBRA 0536 0560 0613 0634

Table 6 Tailor Shop dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0652 0679 0677 0706SGGM 119899 = 119901 0703 0723 0750 0769SGGM 119899 = 15119901 0715 0755 0786 0817SGGM 119899 = 2119901 0734 0770 0803 0844CN 0673 0699 0720 0746Salton 0616 0628 0646 0667Jaccard 0494 0498 0504 0514Soslashrensen 0583 0589 0592 0597HPI 0615 0618 0631 0642HDI 0628 0643 0660 0680LHN 0588 0574 0571 0566AA 0682 0709 0729 0753RA 0681 0707 0729 0752PA 0761 0775 0779 0786LNBCN 0692 0715 0746 0765LNBAA 0693 0716 0746 0764LNBRA 0693 0715 0745 0762

In order to prove the performance of SGGM we comparethe proposed method with the other thirteen methods onthese large datasets (for details see Table 8) in terms of AUCand error rate (data sources (1) httpkonectuni-koblenzdenetworks (Email) (2) httpwww3ndedusimnetworksresourceshtm (Protein) (3) httpwww-personalumichedusimmejnnetdatapowerzip (Grid) (4) httpkonectuni-koblenzdenetworks (Astro-ph))

To estimate the large network efficiently we used QUIC[19] to solve sparse inverse covariance estimation (formula(1)) instead of glasso [7] Overall the proposed method

10 Mathematical Problems in Engineering

Table 7 Error rate of 13 methods on 4 datasets

Method DatasetsFootball Dolphin Taro Tailor Shop

SGGM 119899 = 05119901 1467 1251 1098 1121SGGM 119899 = 119901 1364 1109 1062 1188SGGM 119899 = 15119901 1362 1198 1098 1025SGGM 119899 = 2119901 1200 1121 1013 0987CN 1641 1291 1320 1485Salton 1641 1291 1320 1485Jaccard 1641 1291 1320 1485Soslashrensen 1863 1672 1695 1826HPI 1641 1291 1320 1485HDI 1641 1291 1320 1485LHN 1641 1291 1320 1485AA 1641 1291 1320 1485RA 1641 1291 1320 1485PA 2512 1251 1340 1446LNBCN 1641 1291 1261 1382LNBAA 1641 1291 1261 1382LNBRA 1641 1291 1261 1382

Table 8 Basic topological features of four large networks

Network 119873 119872 119864 119862 Avg 119863 119903

Email 1133 5451 0299 0220 9622 0078Protein 1870 2240 0099 0099 2396 minus0156Grid 4941 6594 0063 0107 2669 0003Astro-ph 18771 198050 0091 0486 4140 0294

Table 9 Comparison of 14 methods on 4 large datasets on AUC

Method Email Protein Grid Astro-phSGGM 0916 0943 0944 0929CN 0817 0566 0570 0864Salton 0812 0566 0569 0863Jaccard 0475 0479 0487 0479Soslashrensen 0547 0493 0529 0584HPI 0809 0566 0569 0863HDI 0813 0566 0569 0863LHN-I 0805 0565 0570 0864AA 0819 0565 0570 0876RA 0818 0566 0568 0838PA 0824 0820 0713 0785

can efficiently recover the original network and predict themissing links accurately As observed from Table 9 AUC ofSGGM is highest on all 4 datasets For Email Protein Gridand Astro-ph datasets SGGM improves 92 123 231and 53 on AUC respectively For error rate metric asshown in Table 10 the SGGM method performs optimallywith all 4 large datasets It indicates that SGGMmethod couldaccurately recover the original network

Table 10 Comparison of 14 methods on 4 large datasets on errorrate

Method Email Protein Grid Astro-phSGGM 0467 0411 0367 0359CN 2348 2189 1670 1568Salton 2978 2189 1767 1549Jaccard 2982 2189 1450 1548Soslashrensen 3047 2555 2071 2771HPI 2578 2189 1670 1610HDI 2978 2189 1237 1854LHN-I 2428 2189 1692 1612AA 2922 2204 1872 1953RA 2348 2189 1670 1410PA 9219 8032 15081 12451

46 Analysis In the comparative analysis of the performanceof the 14 methods using the eight real-world datasets theSGGM method was outstanding in terms of AUC and errorrate This method can accurately recover the original adja-cency matrix and in most of cases requires fewer samplesthat is far fewer than the actual number of nodes TheSGGM methodrsquos prediction precision gradually increasedwith the increase in number of samples moreover therecovery error rate decreased with the increase in matrixsparsityTherefore the SGGMmethod can be applied to smallsamples however it performs better with more samplesThese results demonstrate that the SGGM method can beimplemented in a scalable fashion

5 Conclusions

Link prediction is a basic problem in complex networkanalysis In real-world networks obtaining node and edgeproperties is cumbersome Link prediction methods basedon network structure similarity do not depend on nodeproperties however these methods are limited by networkstructure and are difficult to adapt in different networkstructures

Our work is not dependent on node properties andexpands link prediction methods that cannot deal withdiverse network topologiesWe sampled the original networkadjacency matrix and used the SGGM method to depict thenetwork structure

Most nodes are independent in the actual network thuswe used the SGGM method to estimate a precision matrixof the adjacency matrix to predict links Our experimentalresults show that using the SGGMmethod to recover networkstructure is better than 13mainstream similaritymethodsWetested the proposed method with eight real-world datasetsand used the AUC and error rate as evaluation indexes Wefound that this method obtains higher prediction precisionWe also analysed the influence of different parameters onthe SGGM method and found no significant influenceThe SGGM method returns a high AUC value within acertain range Furthermore the proposed method retains

Mathematical Problems in Engineering 11

high prediction precision with fewer training samples it canbe applied in large network

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank Longqi Yang Zhisong Pan and GuyuHu for their contributions to work on link prediction andZhen Li and Junyang Qiu for their valuable discussionscomments and suggestions This work was supported by theNational Technology Research and Development Programof China (863 Program) (Grant no 2012AA01A510) theNational Natural Science Foundation of China (Grant no61473149) and the Natural Science Foundation of JiangsuProvince China (Grant no BK20140073)

References

[1] L L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquo Physica A Statistical Mechanics amp Its Applications vol390 no 6 pp 1150ndash1170 2011

[2] B Ermis E Acar and A T Cemgil ldquoLink prediction in het-erogeneous data via generalized coupled tensor factorizationrdquoDataMining amp Knowledge Discovery vol 29 no 1 pp 203ndash2362013

[3] Y Yang N Chawla Y Sun and J Hani ldquoPredicting links inmulti-relational and heterogeneous networksrdquo in Proceedings ofthe 12th IEEE International Conference on Data Mining (ICDMrsquo12) pp 755ndash764 Brussels Belgium December 2012

[4] D Liben-Nowell and J Kleinberg ldquoThe link-prediction prob-lem for social networksrdquo Journal of the American Society forInformation Science and Technology vol 58 no 7 pp 1019ndash10312007

[5] Y Yang R N Lichtenwalter and N V Chawla ldquoEvaluating linkprediction methodsrdquo Knowledge and Information Systems vol45 no 3 pp 751ndash782 2015

[6] N Z Gong A Talwalkar LMacKey et al ldquoJoint link predictionand attribute inference using a social-attribute networkrdquo ACMTransactions on Intelligent Systems and Technology vol 5 no 2article 27 2014

[7] J Friedman T Hastie and R Tibshirani ldquoSparse inversecovariance estimationwith the graphical lassordquoBiostatistics vol9 no 3 pp 432ndash441 2008

[8] M E Newman ldquoClustering and preferential attachment ingrowing networksrdquo Physical Review E vol 64 no 2 pp 25ndash1022001

[9] G Salton and M J McGill Introduction to Modern InformationRetrieval McGraw-Hill New York NY USA 1986

[10] P Jaccard ldquoEtude comparative de la distribution florale dansune portion des Alpes et du Jurardquo Bulletin de la Societe Vaudoisedes Sciences Naturelles vol 37 no 142 pp 547ndash579 1901

[11] T Soslashrensen ldquoA method of establishing groups of equal ampli-tude in plant sociology based on similarity of species and itsapplication to analyses of the vegetation on danish commonsrdquoBiologiske Skrifter vol 5 pp 1ndash34 1948

[12] E A Leicht P Holme and M E J Newman ldquoVertex similarityin networksrdquo Physical Review E vol 73 no 2 Article ID 0261202006

[13] Y-B Xie T Zhou and B-H Wang ldquoScale-free networkswithout growthrdquo Physica A Statistical Mechanics and Its Appli-cations vol 387 no 7 pp 1683ndash1688 2008

[14] T Zhou L Lu and Y-C Zhang ldquoPredicting missing links vialocal informationrdquoThe European Physical Journal B vol 71 no4 pp 623ndash630 2009

[15] Z Liu Q M Zhang L Lu and T Zhou ldquoLink prediction incomplex networks a local naıve Bayesmodelrdquo EPL (EurophysicsLetters) vol 96 no 4 Article ID 48007 2011

[16] J Dahl L Vandenberghe and V Roychowdhury ldquoCovarianceselection for nonchordal graphs via chordal embeddingrdquo Opti-mization Methods amp Software vol 23 no 4 pp 501ndash520 2008

[17] O Banerjee L E Ghaoui andADaspremont ldquoModel selectionthrough sparse maximum likelihood estimation for multivari-ate Gaussian or binary datardquoMachine Learning Research vol 9no 3 pp 485ndash516 2008

[18] C J Hsieh M A Sustik I S Dhillon P K Ravikumar and RPoldrack ldquoBigamp quic sparse inverse covariance estimation for amillion variablesrdquo in Advances in Neural Information ProcessingSystems pp 3165ndash3173 MIT Press 2013

[19] C-J Hsieh M A Sustik I S Dhillon and P RavikumarldquoQUIC quadratic approximation for sparse inverse covarianceestimationrdquo Journal of Machine Learning Research vol 15 pp2911ndash2947 2014

[20] V Latora and M Marchiori ldquoEfficient behavior of small-worldnetworksrdquo Physical Review Letters vol 87 Article ID 1987012001

[21] D J Watts and S H Strogatz ldquoCollective dynamics of lsquosmall-worldrsquo networksrdquoNature vol 393 no 6684 pp 440ndash442 1998

[22] M E J Newman ldquoAssortative mixing in networksrdquo PhysicalReview Letters vol 89 no 20 Article ID 208701 2002

[23] J Liu S Ji and J Ye SLEP Sparse Learning with EfficientProjections vol 6 Arizona State University Tempe Ariz USA2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 5: Research Article Link Prediction via Sparse Gaussian Graphical … · 2019. 5. 14. · Research Article Link Prediction via Sparse Gaussian Graphical Model LiangliangZhang,LongqiYang,GuyuHu,ZhisongPan,andZhenLi

Mathematical Problems in Engineering 5

Require 119860 120588 gt 0 120582 gt 0 119899 gt 0

Ensure 119860119864119879 randomly select 120588|119864| edges from 119864

119864119875

lArr 119864 minus 119864119879

Σ lArr (119864119879

)minus1

119883 lArr [1199091 1199092 119909

119899] 119899 independent observations from

119873(0 Σ)

119878 lArr cov(119883)Θ lArr Sparse Inverse Covariance Estimation (119860 120582 119878)if 120579119894119895le 0 and 119894 = 119895 then

119860119894119895lArr 1

else119860119894119895lArr 0

end if

Algorithm 1 Link prediction based on sparse inverse covarianceestimation with graphical lasso

Table 2 Basic topological features of four real-world networks

Network 119873 119872 119864 119862 Avg 119863 119903

Taro 22 39 0488 0339 3546 minus0375Tailor Shop 39 158 0567 0458 8103 minus0183Dolphin 62 159 0379 0259 5129 minus0044Football 115 613 0450 0403 10661 0162

sources (1) httpwww-personalumichedusimmejnnetdata(Football andDolphin) and (2) httpvladofmfuni-ljsipubnetworksdataucinetucidatahtm (Tailor Shop and Taro))

We selected four real-world datasets to compare theproposed method with the 13 similarity measures Table 2shows the basic topological features of the four real-worldnetworks wherein 119873 and119872 are the total numbers of nodesand links respectively 119864 is the network efficiency [20] whichis defined as 119864 = (2119873(119873 minus 1))sum

119909119910isin119881119909 =119910119889minus1

119909119910 where 119889

119909119910

denotes the shortest distance between nodes 119909 and 119910 and119889119909119910

= +infin if 119909 and 119910 are in two different components119862 and Avg 119863 denote the clustering coefficient [21] andaverage degree respectively and 119903 is the network assortativecoefficient [22] The network is assortative if a node tends toconnect with the approximate node 119903 gt 0 means that theentire network has assortative structure and a nodewith largedegree tends to connect to other nodes with large degree 119903 lt0 means that the entire network has disassortative structureand 119903 = 0 implies that there is no correlation in the networkstructure Figure 1 shows the distribution of the four datasetsIt is clear that the degree distributions of the four data sets areconsiderably different

43 Experimental Settings To evaluate the performance ofthe algorithmswith different sized training sets we randomlyselected 60 70 80 and 90 of the links from theoriginal network as training sets Only the information inthe training set was used for estimation Each training setwas executed 10 times We choose 10-fold cross validationbecause it has been widely accepted in machine learning

Table 3 Football dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0702 0725 0757 0783SGGM 119899 = 119901 0742 0777 0818 0857SGGM 119899 = 15119901 0753 0795 0845 0889SGGM 119899 = 2119901 0756 0800 0852 0905CN 0774 0798 0824 0835Salton 0779 0804 0828 0838Jaccard 0502 0508 0514 0520Soslashrensen 0590 0597 0604 0606HPI 0778 0802 0827 0838HDI 0779 0803 0827 0838LHN 0777 0802 0827 0838AA 0774 0798 0825 0836RA 0774 0798 0824 0836PA 0515 0516 0520 0528LNBCN 0777 0800 0825 0836LNBAA 0777 0800 0825 0836LNBRA 0776 0799 0825 0837

and data mining research Thirteen similarity measures wereimplemented following Zhou et al [14] and the SGGM wasimplemented using the SLEP toolbox [23] We evaluated theperformance of the GGM algorithm with different samplescales Here 119901 denotes the number of nodes We set thesample scale to 05119901 119901 15119901 and 2119901 The parameter 120582 in theSGGM controlled the sparsity of the estimated model Larger120582 gives a sparser estimated network Here 120582 ranged from 001

to 02 with a step of 001

44 Results

441 Tuning Parameter for the SGGM Using the training setthat was scaled to 90 we tested the proposed SGGM withdifferent 120582 and sample scales As shown in Figure 2 the AUCof the SGGM increased with increasing sample scale for thefour datasets One advantage of the SGGM is that it is notsensitive to the parameter 120582 Then we set the sample scale to05119873 Figure 3 shows that the AUC increased with increase intraining set scale For the Tailor Shop dataset the proposedalgorithm performs optimally with the 70 scaled trainingset

442 Comparison on AUC Table 3 lists the results for 14methods with the Football dataset The SGGM performsoptimally when 80 or 90 of the training set is used and thenumber of samples is greater than119873The prediction accuracyof the SGGM increased by 5 compared to the other 13methodsHowever with lower proportions of the training setthe SGGMs AUC is very close to that of the other methodsNote that PA yielded the poorest result with this dataset

As shown in Table 4 the SGGM method performs opti-mally with the Dolphin dataset The prediction accuracy ofthe SGGMmethod increased by at least 3 compared to the

6 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10 11 120

002

004

006

008

01

012

014

016

Degree

Prob

abili

ty o

f the

nod

e deg

ree

(a) Dolphin

0 1 2 3 4 5 6 7 8 9 10 11 120

01

02

03

04

05

06

07

Degree

Prob

abili

ty o

f the

nod

e deg

ree

(b) Football

0 1 2 3 4 5 60

01

02

03

04

05

06

07

08

Degree

Prob

abili

ty o

f the

nod

e deg

ree

(c) Taro

minus5 0 5 10 15 20 250

002

004

006

008

01

012

014

Degree

Prob

abili

ty o

f the

nod

e deg

ree

(d) Tailor Shop

Figure 1 Degree distribution of four real-world datasets

other thirteen methods PA is optimal among the similaritymeasures and Jaccard yielded the poorest result

As shown in Table 5 the SGGMmethod outperforms theother 13 methods with the Taro dataset The AUC improvedby at least 10 For the Taro dataset the degree of 70 nodesis 3 which indicates a low degree of heterogeneity thus theperformance of the other thirteen methods is very close Onemight expect PA to show good performance on assortativenetworks and poor performance on disassortative networksHowever the experimental results show no obvious correla-tion between PA performance and the assortative coefficient

For the Tailor Shop dataset PA performs optimally whenthe 60 and 70 training sets were used while the SGGM

method performs optimally with the 80 and 90 trainingsets (Table 6) The SGGMs prediction accuracy graduallyincreased with increasing the number of samplesThus whensampling conditions permit multiple samples could improveprediction accuracy

443 Comparison on ROC The ratio of the training set wasset to 90The sample scale of the SGGM varied within 05119901119901 15119901 and 2119901120582 = 001 CNwas chosen as the representativeneighbourhood because six other measures (Salton JaccardSoslashrensen HPI HDI and LHN) are variants of CN

The ROC of the RA index is similar to that of the AAindex As observed in Figure 4 in most cases the SGGM

Mathematical Problems in Engineering 7

0005

010015

020

0506

0708

09

06507

07508

08509

Ratio of training

AUC

120582

(a) Dolphin

0005

010015

020

0506

0708

09

05506

06507

07508

Ratio of training

AUC

120582

(b) Football

0005

010015

020

06

07

08

064066068

07072074076

09

Ratio of training

AUC

120582

(c) Taro

0005

010015

020

06

07

08

09

06062064066068

07072

Ratio of training

AUC

120582

(d) Tailor Shop

Figure 2 AUC of the SGGM (four datasets training set ratio = 90)

05pp

152p

25p3p

0005

010015

020

07808

082084086088

09092

Samples

AUC

120582

(a) Dolphin

05pp

152p

25p3p

0005

010015

020

0506070809

1

Samples

AUC

120582

(b) Football

05pp

152p

25p3p

0005

010015

020

07075

08085

09095

Samples

AUC

120582

(c) Taro

05pp

152p

25p

3p

0005

010015

020

06507

07508

08509

095

Samples

AUC

120582

(d) Tailor Shop

Figure 3 AUC of the SGGM (four datasets sample scale = 05119873)

8 Mathematical Problems in Engineering

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

(a) Dolphin

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

(b) Football

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

(c) Taro

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

(d) Tailor Shop

Figure 4 Comparison of ROC metric of four methods (SGGM CN AA and PA) on four datasets

shows an improvement over existing similarity measureswhen sufficient samples were used Note that more sampleslead to higher accuracy

444 Comparison on Error Rate The error rate is a metricto evaluate the difference between two matrices A lowererror rate indicates that the estimated matrix approximates

the original matrix From Table 7 we observe that the SGGMhas the lowest error rate among all methods shown Toshow the effectiveness of the proposed SGGM both theoriginal adjacency matrix and estimated adjacency matrixare presented as a coloured graph in Figure 5 We usedthe Taro dataset with a 90 training set to recover theoriginal network Graphs with fewer red points indicate

Mathematical Problems in Engineering 9

Figure 5 Recovery performances of 14methods on the Taro datasetThe upper left corner is the original matrix blue points denote linksbetween two corresponding nodes The red points denote links inthe original matrix that were not accurately estimated The greenpoints denote links in the estimated matrix that do not exist in theoriginal matrix From left to right top to bottom the figures are therecovered matrix by SGGM (119899 = 05119901 119901 15119901 and 2119901) CN SaltonJaccard Soslashrensen HPI HDI LHN AA RA PA LNBCN LNBAAand LNBRA respectively

Table 4 Dolphin dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0699 0731 0761 0801SGGM 119899 = 119901 0742 0775 0823 0867SGGM 119899 = 15119901 0758 0803 0845 0895SGGM 119899 = 2119901 0765 0807 0857 0909CN 0631 0687 0728 0762Salton 0621 0673 0709 0740Jaccard 0485 0490 0495 0500Soslashrensen 0538 0558 0571 0583HPI 0620 0669 0702 0728HDI 0626 0680 0720 0753LHN 0619 0666 0698 0721AA 0633 0690 0732 0767RA 0632 0690 0731 0766PA 0712 0727 0737 0746LNBCN 0636 0696 0739 0775LNBAA 0635 0694 0738 0773LNBRA 0635 0693 0736 0771

good recovery of the original matrix As can be seen fromFigure 5 using the SGGM method can restore the originalimage whereas the other similarity measures return greatererror rates Note that the SGGM error rate decreased withincreasing sample scale

45 Large Datasets The main challenge of link prediction isdealing with large network We carefully choose four largedatasets from four individual domain For SGGM method120582 is set to 001 90 of data was used as training set

Table 5 Taro dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0672 0711 0736 0750SGGM 119899 = 119901 0713 0747 0779 0815SGGM 119899 = 15119901 0721 0763 0816 0857SGGM 119899 = 2119901 0723 0788 0825 0865CN 0529 0545 0573 0582Salton 0530 0542 0571 0580Jaccard 0468 0476 0476 0478Soslashrensen 0523 0533 0544 0552HPI 0533 0545 0574 0585HDI 0528 0541 0568 0576LHN 0532 0543 0570 0579AA 0533 0554 0589 0609RA 0533 0554 0589 0610PA 0559 0568 0582 0607LNBCN 0536 0562 0616 0638LNBAA 0537 0561 0614 0634LNBRA 0536 0560 0613 0634

Table 6 Tailor Shop dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0652 0679 0677 0706SGGM 119899 = 119901 0703 0723 0750 0769SGGM 119899 = 15119901 0715 0755 0786 0817SGGM 119899 = 2119901 0734 0770 0803 0844CN 0673 0699 0720 0746Salton 0616 0628 0646 0667Jaccard 0494 0498 0504 0514Soslashrensen 0583 0589 0592 0597HPI 0615 0618 0631 0642HDI 0628 0643 0660 0680LHN 0588 0574 0571 0566AA 0682 0709 0729 0753RA 0681 0707 0729 0752PA 0761 0775 0779 0786LNBCN 0692 0715 0746 0765LNBAA 0693 0716 0746 0764LNBRA 0693 0715 0745 0762

In order to prove the performance of SGGM we comparethe proposed method with the other thirteen methods onthese large datasets (for details see Table 8) in terms of AUCand error rate (data sources (1) httpkonectuni-koblenzdenetworks (Email) (2) httpwww3ndedusimnetworksresourceshtm (Protein) (3) httpwww-personalumichedusimmejnnetdatapowerzip (Grid) (4) httpkonectuni-koblenzdenetworks (Astro-ph))

To estimate the large network efficiently we used QUIC[19] to solve sparse inverse covariance estimation (formula(1)) instead of glasso [7] Overall the proposed method

10 Mathematical Problems in Engineering

Table 7 Error rate of 13 methods on 4 datasets

Method DatasetsFootball Dolphin Taro Tailor Shop

SGGM 119899 = 05119901 1467 1251 1098 1121SGGM 119899 = 119901 1364 1109 1062 1188SGGM 119899 = 15119901 1362 1198 1098 1025SGGM 119899 = 2119901 1200 1121 1013 0987CN 1641 1291 1320 1485Salton 1641 1291 1320 1485Jaccard 1641 1291 1320 1485Soslashrensen 1863 1672 1695 1826HPI 1641 1291 1320 1485HDI 1641 1291 1320 1485LHN 1641 1291 1320 1485AA 1641 1291 1320 1485RA 1641 1291 1320 1485PA 2512 1251 1340 1446LNBCN 1641 1291 1261 1382LNBAA 1641 1291 1261 1382LNBRA 1641 1291 1261 1382

Table 8 Basic topological features of four large networks

Network 119873 119872 119864 119862 Avg 119863 119903

Email 1133 5451 0299 0220 9622 0078Protein 1870 2240 0099 0099 2396 minus0156Grid 4941 6594 0063 0107 2669 0003Astro-ph 18771 198050 0091 0486 4140 0294

Table 9 Comparison of 14 methods on 4 large datasets on AUC

Method Email Protein Grid Astro-phSGGM 0916 0943 0944 0929CN 0817 0566 0570 0864Salton 0812 0566 0569 0863Jaccard 0475 0479 0487 0479Soslashrensen 0547 0493 0529 0584HPI 0809 0566 0569 0863HDI 0813 0566 0569 0863LHN-I 0805 0565 0570 0864AA 0819 0565 0570 0876RA 0818 0566 0568 0838PA 0824 0820 0713 0785

can efficiently recover the original network and predict themissing links accurately As observed from Table 9 AUC ofSGGM is highest on all 4 datasets For Email Protein Gridand Astro-ph datasets SGGM improves 92 123 231and 53 on AUC respectively For error rate metric asshown in Table 10 the SGGM method performs optimallywith all 4 large datasets It indicates that SGGMmethod couldaccurately recover the original network

Table 10 Comparison of 14 methods on 4 large datasets on errorrate

Method Email Protein Grid Astro-phSGGM 0467 0411 0367 0359CN 2348 2189 1670 1568Salton 2978 2189 1767 1549Jaccard 2982 2189 1450 1548Soslashrensen 3047 2555 2071 2771HPI 2578 2189 1670 1610HDI 2978 2189 1237 1854LHN-I 2428 2189 1692 1612AA 2922 2204 1872 1953RA 2348 2189 1670 1410PA 9219 8032 15081 12451

46 Analysis In the comparative analysis of the performanceof the 14 methods using the eight real-world datasets theSGGM method was outstanding in terms of AUC and errorrate This method can accurately recover the original adja-cency matrix and in most of cases requires fewer samplesthat is far fewer than the actual number of nodes TheSGGM methodrsquos prediction precision gradually increasedwith the increase in number of samples moreover therecovery error rate decreased with the increase in matrixsparsityTherefore the SGGMmethod can be applied to smallsamples however it performs better with more samplesThese results demonstrate that the SGGM method can beimplemented in a scalable fashion

5 Conclusions

Link prediction is a basic problem in complex networkanalysis In real-world networks obtaining node and edgeproperties is cumbersome Link prediction methods basedon network structure similarity do not depend on nodeproperties however these methods are limited by networkstructure and are difficult to adapt in different networkstructures

Our work is not dependent on node properties andexpands link prediction methods that cannot deal withdiverse network topologiesWe sampled the original networkadjacency matrix and used the SGGM method to depict thenetwork structure

Most nodes are independent in the actual network thuswe used the SGGM method to estimate a precision matrixof the adjacency matrix to predict links Our experimentalresults show that using the SGGMmethod to recover networkstructure is better than 13mainstream similaritymethodsWetested the proposed method with eight real-world datasetsand used the AUC and error rate as evaluation indexes Wefound that this method obtains higher prediction precisionWe also analysed the influence of different parameters onthe SGGM method and found no significant influenceThe SGGM method returns a high AUC value within acertain range Furthermore the proposed method retains

Mathematical Problems in Engineering 11

high prediction precision with fewer training samples it canbe applied in large network

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank Longqi Yang Zhisong Pan and GuyuHu for their contributions to work on link prediction andZhen Li and Junyang Qiu for their valuable discussionscomments and suggestions This work was supported by theNational Technology Research and Development Programof China (863 Program) (Grant no 2012AA01A510) theNational Natural Science Foundation of China (Grant no61473149) and the Natural Science Foundation of JiangsuProvince China (Grant no BK20140073)

References

[1] L L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquo Physica A Statistical Mechanics amp Its Applications vol390 no 6 pp 1150ndash1170 2011

[2] B Ermis E Acar and A T Cemgil ldquoLink prediction in het-erogeneous data via generalized coupled tensor factorizationrdquoDataMining amp Knowledge Discovery vol 29 no 1 pp 203ndash2362013

[3] Y Yang N Chawla Y Sun and J Hani ldquoPredicting links inmulti-relational and heterogeneous networksrdquo in Proceedings ofthe 12th IEEE International Conference on Data Mining (ICDMrsquo12) pp 755ndash764 Brussels Belgium December 2012

[4] D Liben-Nowell and J Kleinberg ldquoThe link-prediction prob-lem for social networksrdquo Journal of the American Society forInformation Science and Technology vol 58 no 7 pp 1019ndash10312007

[5] Y Yang R N Lichtenwalter and N V Chawla ldquoEvaluating linkprediction methodsrdquo Knowledge and Information Systems vol45 no 3 pp 751ndash782 2015

[6] N Z Gong A Talwalkar LMacKey et al ldquoJoint link predictionand attribute inference using a social-attribute networkrdquo ACMTransactions on Intelligent Systems and Technology vol 5 no 2article 27 2014

[7] J Friedman T Hastie and R Tibshirani ldquoSparse inversecovariance estimationwith the graphical lassordquoBiostatistics vol9 no 3 pp 432ndash441 2008

[8] M E Newman ldquoClustering and preferential attachment ingrowing networksrdquo Physical Review E vol 64 no 2 pp 25ndash1022001

[9] G Salton and M J McGill Introduction to Modern InformationRetrieval McGraw-Hill New York NY USA 1986

[10] P Jaccard ldquoEtude comparative de la distribution florale dansune portion des Alpes et du Jurardquo Bulletin de la Societe Vaudoisedes Sciences Naturelles vol 37 no 142 pp 547ndash579 1901

[11] T Soslashrensen ldquoA method of establishing groups of equal ampli-tude in plant sociology based on similarity of species and itsapplication to analyses of the vegetation on danish commonsrdquoBiologiske Skrifter vol 5 pp 1ndash34 1948

[12] E A Leicht P Holme and M E J Newman ldquoVertex similarityin networksrdquo Physical Review E vol 73 no 2 Article ID 0261202006

[13] Y-B Xie T Zhou and B-H Wang ldquoScale-free networkswithout growthrdquo Physica A Statistical Mechanics and Its Appli-cations vol 387 no 7 pp 1683ndash1688 2008

[14] T Zhou L Lu and Y-C Zhang ldquoPredicting missing links vialocal informationrdquoThe European Physical Journal B vol 71 no4 pp 623ndash630 2009

[15] Z Liu Q M Zhang L Lu and T Zhou ldquoLink prediction incomplex networks a local naıve Bayesmodelrdquo EPL (EurophysicsLetters) vol 96 no 4 Article ID 48007 2011

[16] J Dahl L Vandenberghe and V Roychowdhury ldquoCovarianceselection for nonchordal graphs via chordal embeddingrdquo Opti-mization Methods amp Software vol 23 no 4 pp 501ndash520 2008

[17] O Banerjee L E Ghaoui andADaspremont ldquoModel selectionthrough sparse maximum likelihood estimation for multivari-ate Gaussian or binary datardquoMachine Learning Research vol 9no 3 pp 485ndash516 2008

[18] C J Hsieh M A Sustik I S Dhillon P K Ravikumar and RPoldrack ldquoBigamp quic sparse inverse covariance estimation for amillion variablesrdquo in Advances in Neural Information ProcessingSystems pp 3165ndash3173 MIT Press 2013

[19] C-J Hsieh M A Sustik I S Dhillon and P RavikumarldquoQUIC quadratic approximation for sparse inverse covarianceestimationrdquo Journal of Machine Learning Research vol 15 pp2911ndash2947 2014

[20] V Latora and M Marchiori ldquoEfficient behavior of small-worldnetworksrdquo Physical Review Letters vol 87 Article ID 1987012001

[21] D J Watts and S H Strogatz ldquoCollective dynamics of lsquosmall-worldrsquo networksrdquoNature vol 393 no 6684 pp 440ndash442 1998

[22] M E J Newman ldquoAssortative mixing in networksrdquo PhysicalReview Letters vol 89 no 20 Article ID 208701 2002

[23] J Liu S Ji and J Ye SLEP Sparse Learning with EfficientProjections vol 6 Arizona State University Tempe Ariz USA2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 6: Research Article Link Prediction via Sparse Gaussian Graphical … · 2019. 5. 14. · Research Article Link Prediction via Sparse Gaussian Graphical Model LiangliangZhang,LongqiYang,GuyuHu,ZhisongPan,andZhenLi

6 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10 11 120

002

004

006

008

01

012

014

016

Degree

Prob

abili

ty o

f the

nod

e deg

ree

(a) Dolphin

0 1 2 3 4 5 6 7 8 9 10 11 120

01

02

03

04

05

06

07

Degree

Prob

abili

ty o

f the

nod

e deg

ree

(b) Football

0 1 2 3 4 5 60

01

02

03

04

05

06

07

08

Degree

Prob

abili

ty o

f the

nod

e deg

ree

(c) Taro

minus5 0 5 10 15 20 250

002

004

006

008

01

012

014

Degree

Prob

abili

ty o

f the

nod

e deg

ree

(d) Tailor Shop

Figure 1 Degree distribution of four real-world datasets

other thirteen methods PA is optimal among the similaritymeasures and Jaccard yielded the poorest result

As shown in Table 5 the SGGMmethod outperforms theother 13 methods with the Taro dataset The AUC improvedby at least 10 For the Taro dataset the degree of 70 nodesis 3 which indicates a low degree of heterogeneity thus theperformance of the other thirteen methods is very close Onemight expect PA to show good performance on assortativenetworks and poor performance on disassortative networksHowever the experimental results show no obvious correla-tion between PA performance and the assortative coefficient

For the Tailor Shop dataset PA performs optimally whenthe 60 and 70 training sets were used while the SGGM

method performs optimally with the 80 and 90 trainingsets (Table 6) The SGGMs prediction accuracy graduallyincreased with increasing the number of samplesThus whensampling conditions permit multiple samples could improveprediction accuracy

443 Comparison on ROC The ratio of the training set wasset to 90The sample scale of the SGGM varied within 05119901119901 15119901 and 2119901120582 = 001 CNwas chosen as the representativeneighbourhood because six other measures (Salton JaccardSoslashrensen HPI HDI and LHN) are variants of CN

The ROC of the RA index is similar to that of the AAindex As observed in Figure 4 in most cases the SGGM

Mathematical Problems in Engineering 7

0005

010015

020

0506

0708

09

06507

07508

08509

Ratio of training

AUC

120582

(a) Dolphin

0005

010015

020

0506

0708

09

05506

06507

07508

Ratio of training

AUC

120582

(b) Football

0005

010015

020

06

07

08

064066068

07072074076

09

Ratio of training

AUC

120582

(c) Taro

0005

010015

020

06

07

08

09

06062064066068

07072

Ratio of training

AUC

120582

(d) Tailor Shop

Figure 2 AUC of the SGGM (four datasets training set ratio = 90)

05pp

152p

25p3p

0005

010015

020

07808

082084086088

09092

Samples

AUC

120582

(a) Dolphin

05pp

152p

25p3p

0005

010015

020

0506070809

1

Samples

AUC

120582

(b) Football

05pp

152p

25p3p

0005

010015

020

07075

08085

09095

Samples

AUC

120582

(c) Taro

05pp

152p

25p

3p

0005

010015

020

06507

07508

08509

095

Samples

AUC

120582

(d) Tailor Shop

Figure 3 AUC of the SGGM (four datasets sample scale = 05119873)

8 Mathematical Problems in Engineering

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

(a) Dolphin

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

(b) Football

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

(c) Taro

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

(d) Tailor Shop

Figure 4 Comparison of ROC metric of four methods (SGGM CN AA and PA) on four datasets

shows an improvement over existing similarity measureswhen sufficient samples were used Note that more sampleslead to higher accuracy

444 Comparison on Error Rate The error rate is a metricto evaluate the difference between two matrices A lowererror rate indicates that the estimated matrix approximates

the original matrix From Table 7 we observe that the SGGMhas the lowest error rate among all methods shown Toshow the effectiveness of the proposed SGGM both theoriginal adjacency matrix and estimated adjacency matrixare presented as a coloured graph in Figure 5 We usedthe Taro dataset with a 90 training set to recover theoriginal network Graphs with fewer red points indicate

Mathematical Problems in Engineering 9

Figure 5 Recovery performances of 14methods on the Taro datasetThe upper left corner is the original matrix blue points denote linksbetween two corresponding nodes The red points denote links inthe original matrix that were not accurately estimated The greenpoints denote links in the estimated matrix that do not exist in theoriginal matrix From left to right top to bottom the figures are therecovered matrix by SGGM (119899 = 05119901 119901 15119901 and 2119901) CN SaltonJaccard Soslashrensen HPI HDI LHN AA RA PA LNBCN LNBAAand LNBRA respectively

Table 4 Dolphin dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0699 0731 0761 0801SGGM 119899 = 119901 0742 0775 0823 0867SGGM 119899 = 15119901 0758 0803 0845 0895SGGM 119899 = 2119901 0765 0807 0857 0909CN 0631 0687 0728 0762Salton 0621 0673 0709 0740Jaccard 0485 0490 0495 0500Soslashrensen 0538 0558 0571 0583HPI 0620 0669 0702 0728HDI 0626 0680 0720 0753LHN 0619 0666 0698 0721AA 0633 0690 0732 0767RA 0632 0690 0731 0766PA 0712 0727 0737 0746LNBCN 0636 0696 0739 0775LNBAA 0635 0694 0738 0773LNBRA 0635 0693 0736 0771

good recovery of the original matrix As can be seen fromFigure 5 using the SGGM method can restore the originalimage whereas the other similarity measures return greatererror rates Note that the SGGM error rate decreased withincreasing sample scale

45 Large Datasets The main challenge of link prediction isdealing with large network We carefully choose four largedatasets from four individual domain For SGGM method120582 is set to 001 90 of data was used as training set

Table 5 Taro dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0672 0711 0736 0750SGGM 119899 = 119901 0713 0747 0779 0815SGGM 119899 = 15119901 0721 0763 0816 0857SGGM 119899 = 2119901 0723 0788 0825 0865CN 0529 0545 0573 0582Salton 0530 0542 0571 0580Jaccard 0468 0476 0476 0478Soslashrensen 0523 0533 0544 0552HPI 0533 0545 0574 0585HDI 0528 0541 0568 0576LHN 0532 0543 0570 0579AA 0533 0554 0589 0609RA 0533 0554 0589 0610PA 0559 0568 0582 0607LNBCN 0536 0562 0616 0638LNBAA 0537 0561 0614 0634LNBRA 0536 0560 0613 0634

Table 6 Tailor Shop dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0652 0679 0677 0706SGGM 119899 = 119901 0703 0723 0750 0769SGGM 119899 = 15119901 0715 0755 0786 0817SGGM 119899 = 2119901 0734 0770 0803 0844CN 0673 0699 0720 0746Salton 0616 0628 0646 0667Jaccard 0494 0498 0504 0514Soslashrensen 0583 0589 0592 0597HPI 0615 0618 0631 0642HDI 0628 0643 0660 0680LHN 0588 0574 0571 0566AA 0682 0709 0729 0753RA 0681 0707 0729 0752PA 0761 0775 0779 0786LNBCN 0692 0715 0746 0765LNBAA 0693 0716 0746 0764LNBRA 0693 0715 0745 0762

In order to prove the performance of SGGM we comparethe proposed method with the other thirteen methods onthese large datasets (for details see Table 8) in terms of AUCand error rate (data sources (1) httpkonectuni-koblenzdenetworks (Email) (2) httpwww3ndedusimnetworksresourceshtm (Protein) (3) httpwww-personalumichedusimmejnnetdatapowerzip (Grid) (4) httpkonectuni-koblenzdenetworks (Astro-ph))

To estimate the large network efficiently we used QUIC[19] to solve sparse inverse covariance estimation (formula(1)) instead of glasso [7] Overall the proposed method

10 Mathematical Problems in Engineering

Table 7 Error rate of 13 methods on 4 datasets

Method DatasetsFootball Dolphin Taro Tailor Shop

SGGM 119899 = 05119901 1467 1251 1098 1121SGGM 119899 = 119901 1364 1109 1062 1188SGGM 119899 = 15119901 1362 1198 1098 1025SGGM 119899 = 2119901 1200 1121 1013 0987CN 1641 1291 1320 1485Salton 1641 1291 1320 1485Jaccard 1641 1291 1320 1485Soslashrensen 1863 1672 1695 1826HPI 1641 1291 1320 1485HDI 1641 1291 1320 1485LHN 1641 1291 1320 1485AA 1641 1291 1320 1485RA 1641 1291 1320 1485PA 2512 1251 1340 1446LNBCN 1641 1291 1261 1382LNBAA 1641 1291 1261 1382LNBRA 1641 1291 1261 1382

Table 8 Basic topological features of four large networks

Network 119873 119872 119864 119862 Avg 119863 119903

Email 1133 5451 0299 0220 9622 0078Protein 1870 2240 0099 0099 2396 minus0156Grid 4941 6594 0063 0107 2669 0003Astro-ph 18771 198050 0091 0486 4140 0294

Table 9 Comparison of 14 methods on 4 large datasets on AUC

Method Email Protein Grid Astro-phSGGM 0916 0943 0944 0929CN 0817 0566 0570 0864Salton 0812 0566 0569 0863Jaccard 0475 0479 0487 0479Soslashrensen 0547 0493 0529 0584HPI 0809 0566 0569 0863HDI 0813 0566 0569 0863LHN-I 0805 0565 0570 0864AA 0819 0565 0570 0876RA 0818 0566 0568 0838PA 0824 0820 0713 0785

can efficiently recover the original network and predict themissing links accurately As observed from Table 9 AUC ofSGGM is highest on all 4 datasets For Email Protein Gridand Astro-ph datasets SGGM improves 92 123 231and 53 on AUC respectively For error rate metric asshown in Table 10 the SGGM method performs optimallywith all 4 large datasets It indicates that SGGMmethod couldaccurately recover the original network

Table 10 Comparison of 14 methods on 4 large datasets on errorrate

Method Email Protein Grid Astro-phSGGM 0467 0411 0367 0359CN 2348 2189 1670 1568Salton 2978 2189 1767 1549Jaccard 2982 2189 1450 1548Soslashrensen 3047 2555 2071 2771HPI 2578 2189 1670 1610HDI 2978 2189 1237 1854LHN-I 2428 2189 1692 1612AA 2922 2204 1872 1953RA 2348 2189 1670 1410PA 9219 8032 15081 12451

46 Analysis In the comparative analysis of the performanceof the 14 methods using the eight real-world datasets theSGGM method was outstanding in terms of AUC and errorrate This method can accurately recover the original adja-cency matrix and in most of cases requires fewer samplesthat is far fewer than the actual number of nodes TheSGGM methodrsquos prediction precision gradually increasedwith the increase in number of samples moreover therecovery error rate decreased with the increase in matrixsparsityTherefore the SGGMmethod can be applied to smallsamples however it performs better with more samplesThese results demonstrate that the SGGM method can beimplemented in a scalable fashion

5 Conclusions

Link prediction is a basic problem in complex networkanalysis In real-world networks obtaining node and edgeproperties is cumbersome Link prediction methods basedon network structure similarity do not depend on nodeproperties however these methods are limited by networkstructure and are difficult to adapt in different networkstructures

Our work is not dependent on node properties andexpands link prediction methods that cannot deal withdiverse network topologiesWe sampled the original networkadjacency matrix and used the SGGM method to depict thenetwork structure

Most nodes are independent in the actual network thuswe used the SGGM method to estimate a precision matrixof the adjacency matrix to predict links Our experimentalresults show that using the SGGMmethod to recover networkstructure is better than 13mainstream similaritymethodsWetested the proposed method with eight real-world datasetsand used the AUC and error rate as evaluation indexes Wefound that this method obtains higher prediction precisionWe also analysed the influence of different parameters onthe SGGM method and found no significant influenceThe SGGM method returns a high AUC value within acertain range Furthermore the proposed method retains

Mathematical Problems in Engineering 11

high prediction precision with fewer training samples it canbe applied in large network

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank Longqi Yang Zhisong Pan and GuyuHu for their contributions to work on link prediction andZhen Li and Junyang Qiu for their valuable discussionscomments and suggestions This work was supported by theNational Technology Research and Development Programof China (863 Program) (Grant no 2012AA01A510) theNational Natural Science Foundation of China (Grant no61473149) and the Natural Science Foundation of JiangsuProvince China (Grant no BK20140073)

References

[1] L L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquo Physica A Statistical Mechanics amp Its Applications vol390 no 6 pp 1150ndash1170 2011

[2] B Ermis E Acar and A T Cemgil ldquoLink prediction in het-erogeneous data via generalized coupled tensor factorizationrdquoDataMining amp Knowledge Discovery vol 29 no 1 pp 203ndash2362013

[3] Y Yang N Chawla Y Sun and J Hani ldquoPredicting links inmulti-relational and heterogeneous networksrdquo in Proceedings ofthe 12th IEEE International Conference on Data Mining (ICDMrsquo12) pp 755ndash764 Brussels Belgium December 2012

[4] D Liben-Nowell and J Kleinberg ldquoThe link-prediction prob-lem for social networksrdquo Journal of the American Society forInformation Science and Technology vol 58 no 7 pp 1019ndash10312007

[5] Y Yang R N Lichtenwalter and N V Chawla ldquoEvaluating linkprediction methodsrdquo Knowledge and Information Systems vol45 no 3 pp 751ndash782 2015

[6] N Z Gong A Talwalkar LMacKey et al ldquoJoint link predictionand attribute inference using a social-attribute networkrdquo ACMTransactions on Intelligent Systems and Technology vol 5 no 2article 27 2014

[7] J Friedman T Hastie and R Tibshirani ldquoSparse inversecovariance estimationwith the graphical lassordquoBiostatistics vol9 no 3 pp 432ndash441 2008

[8] M E Newman ldquoClustering and preferential attachment ingrowing networksrdquo Physical Review E vol 64 no 2 pp 25ndash1022001

[9] G Salton and M J McGill Introduction to Modern InformationRetrieval McGraw-Hill New York NY USA 1986

[10] P Jaccard ldquoEtude comparative de la distribution florale dansune portion des Alpes et du Jurardquo Bulletin de la Societe Vaudoisedes Sciences Naturelles vol 37 no 142 pp 547ndash579 1901

[11] T Soslashrensen ldquoA method of establishing groups of equal ampli-tude in plant sociology based on similarity of species and itsapplication to analyses of the vegetation on danish commonsrdquoBiologiske Skrifter vol 5 pp 1ndash34 1948

[12] E A Leicht P Holme and M E J Newman ldquoVertex similarityin networksrdquo Physical Review E vol 73 no 2 Article ID 0261202006

[13] Y-B Xie T Zhou and B-H Wang ldquoScale-free networkswithout growthrdquo Physica A Statistical Mechanics and Its Appli-cations vol 387 no 7 pp 1683ndash1688 2008

[14] T Zhou L Lu and Y-C Zhang ldquoPredicting missing links vialocal informationrdquoThe European Physical Journal B vol 71 no4 pp 623ndash630 2009

[15] Z Liu Q M Zhang L Lu and T Zhou ldquoLink prediction incomplex networks a local naıve Bayesmodelrdquo EPL (EurophysicsLetters) vol 96 no 4 Article ID 48007 2011

[16] J Dahl L Vandenberghe and V Roychowdhury ldquoCovarianceselection for nonchordal graphs via chordal embeddingrdquo Opti-mization Methods amp Software vol 23 no 4 pp 501ndash520 2008

[17] O Banerjee L E Ghaoui andADaspremont ldquoModel selectionthrough sparse maximum likelihood estimation for multivari-ate Gaussian or binary datardquoMachine Learning Research vol 9no 3 pp 485ndash516 2008

[18] C J Hsieh M A Sustik I S Dhillon P K Ravikumar and RPoldrack ldquoBigamp quic sparse inverse covariance estimation for amillion variablesrdquo in Advances in Neural Information ProcessingSystems pp 3165ndash3173 MIT Press 2013

[19] C-J Hsieh M A Sustik I S Dhillon and P RavikumarldquoQUIC quadratic approximation for sparse inverse covarianceestimationrdquo Journal of Machine Learning Research vol 15 pp2911ndash2947 2014

[20] V Latora and M Marchiori ldquoEfficient behavior of small-worldnetworksrdquo Physical Review Letters vol 87 Article ID 1987012001

[21] D J Watts and S H Strogatz ldquoCollective dynamics of lsquosmall-worldrsquo networksrdquoNature vol 393 no 6684 pp 440ndash442 1998

[22] M E J Newman ldquoAssortative mixing in networksrdquo PhysicalReview Letters vol 89 no 20 Article ID 208701 2002

[23] J Liu S Ji and J Ye SLEP Sparse Learning with EfficientProjections vol 6 Arizona State University Tempe Ariz USA2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 7: Research Article Link Prediction via Sparse Gaussian Graphical … · 2019. 5. 14. · Research Article Link Prediction via Sparse Gaussian Graphical Model LiangliangZhang,LongqiYang,GuyuHu,ZhisongPan,andZhenLi

Mathematical Problems in Engineering 7

0005

010015

020

0506

0708

09

06507

07508

08509

Ratio of training

AUC

120582

(a) Dolphin

0005

010015

020

0506

0708

09

05506

06507

07508

Ratio of training

AUC

120582

(b) Football

0005

010015

020

06

07

08

064066068

07072074076

09

Ratio of training

AUC

120582

(c) Taro

0005

010015

020

06

07

08

09

06062064066068

07072

Ratio of training

AUC

120582

(d) Tailor Shop

Figure 2 AUC of the SGGM (four datasets training set ratio = 90)

05pp

152p

25p3p

0005

010015

020

07808

082084086088

09092

Samples

AUC

120582

(a) Dolphin

05pp

152p

25p3p

0005

010015

020

0506070809

1

Samples

AUC

120582

(b) Football

05pp

152p

25p3p

0005

010015

020

07075

08085

09095

Samples

AUC

120582

(c) Taro

05pp

152p

25p

3p

0005

010015

020

06507

07508

08509

095

Samples

AUC

120582

(d) Tailor Shop

Figure 3 AUC of the SGGM (four datasets sample scale = 05119873)

8 Mathematical Problems in Engineering

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

(a) Dolphin

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

(b) Football

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

(c) Taro

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

(d) Tailor Shop

Figure 4 Comparison of ROC metric of four methods (SGGM CN AA and PA) on four datasets

shows an improvement over existing similarity measureswhen sufficient samples were used Note that more sampleslead to higher accuracy

444 Comparison on Error Rate The error rate is a metricto evaluate the difference between two matrices A lowererror rate indicates that the estimated matrix approximates

the original matrix From Table 7 we observe that the SGGMhas the lowest error rate among all methods shown Toshow the effectiveness of the proposed SGGM both theoriginal adjacency matrix and estimated adjacency matrixare presented as a coloured graph in Figure 5 We usedthe Taro dataset with a 90 training set to recover theoriginal network Graphs with fewer red points indicate

Mathematical Problems in Engineering 9

Figure 5 Recovery performances of 14methods on the Taro datasetThe upper left corner is the original matrix blue points denote linksbetween two corresponding nodes The red points denote links inthe original matrix that were not accurately estimated The greenpoints denote links in the estimated matrix that do not exist in theoriginal matrix From left to right top to bottom the figures are therecovered matrix by SGGM (119899 = 05119901 119901 15119901 and 2119901) CN SaltonJaccard Soslashrensen HPI HDI LHN AA RA PA LNBCN LNBAAand LNBRA respectively

Table 4 Dolphin dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0699 0731 0761 0801SGGM 119899 = 119901 0742 0775 0823 0867SGGM 119899 = 15119901 0758 0803 0845 0895SGGM 119899 = 2119901 0765 0807 0857 0909CN 0631 0687 0728 0762Salton 0621 0673 0709 0740Jaccard 0485 0490 0495 0500Soslashrensen 0538 0558 0571 0583HPI 0620 0669 0702 0728HDI 0626 0680 0720 0753LHN 0619 0666 0698 0721AA 0633 0690 0732 0767RA 0632 0690 0731 0766PA 0712 0727 0737 0746LNBCN 0636 0696 0739 0775LNBAA 0635 0694 0738 0773LNBRA 0635 0693 0736 0771

good recovery of the original matrix As can be seen fromFigure 5 using the SGGM method can restore the originalimage whereas the other similarity measures return greatererror rates Note that the SGGM error rate decreased withincreasing sample scale

45 Large Datasets The main challenge of link prediction isdealing with large network We carefully choose four largedatasets from four individual domain For SGGM method120582 is set to 001 90 of data was used as training set

Table 5 Taro dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0672 0711 0736 0750SGGM 119899 = 119901 0713 0747 0779 0815SGGM 119899 = 15119901 0721 0763 0816 0857SGGM 119899 = 2119901 0723 0788 0825 0865CN 0529 0545 0573 0582Salton 0530 0542 0571 0580Jaccard 0468 0476 0476 0478Soslashrensen 0523 0533 0544 0552HPI 0533 0545 0574 0585HDI 0528 0541 0568 0576LHN 0532 0543 0570 0579AA 0533 0554 0589 0609RA 0533 0554 0589 0610PA 0559 0568 0582 0607LNBCN 0536 0562 0616 0638LNBAA 0537 0561 0614 0634LNBRA 0536 0560 0613 0634

Table 6 Tailor Shop dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0652 0679 0677 0706SGGM 119899 = 119901 0703 0723 0750 0769SGGM 119899 = 15119901 0715 0755 0786 0817SGGM 119899 = 2119901 0734 0770 0803 0844CN 0673 0699 0720 0746Salton 0616 0628 0646 0667Jaccard 0494 0498 0504 0514Soslashrensen 0583 0589 0592 0597HPI 0615 0618 0631 0642HDI 0628 0643 0660 0680LHN 0588 0574 0571 0566AA 0682 0709 0729 0753RA 0681 0707 0729 0752PA 0761 0775 0779 0786LNBCN 0692 0715 0746 0765LNBAA 0693 0716 0746 0764LNBRA 0693 0715 0745 0762

In order to prove the performance of SGGM we comparethe proposed method with the other thirteen methods onthese large datasets (for details see Table 8) in terms of AUCand error rate (data sources (1) httpkonectuni-koblenzdenetworks (Email) (2) httpwww3ndedusimnetworksresourceshtm (Protein) (3) httpwww-personalumichedusimmejnnetdatapowerzip (Grid) (4) httpkonectuni-koblenzdenetworks (Astro-ph))

To estimate the large network efficiently we used QUIC[19] to solve sparse inverse covariance estimation (formula(1)) instead of glasso [7] Overall the proposed method

10 Mathematical Problems in Engineering

Table 7 Error rate of 13 methods on 4 datasets

Method DatasetsFootball Dolphin Taro Tailor Shop

SGGM 119899 = 05119901 1467 1251 1098 1121SGGM 119899 = 119901 1364 1109 1062 1188SGGM 119899 = 15119901 1362 1198 1098 1025SGGM 119899 = 2119901 1200 1121 1013 0987CN 1641 1291 1320 1485Salton 1641 1291 1320 1485Jaccard 1641 1291 1320 1485Soslashrensen 1863 1672 1695 1826HPI 1641 1291 1320 1485HDI 1641 1291 1320 1485LHN 1641 1291 1320 1485AA 1641 1291 1320 1485RA 1641 1291 1320 1485PA 2512 1251 1340 1446LNBCN 1641 1291 1261 1382LNBAA 1641 1291 1261 1382LNBRA 1641 1291 1261 1382

Table 8 Basic topological features of four large networks

Network 119873 119872 119864 119862 Avg 119863 119903

Email 1133 5451 0299 0220 9622 0078Protein 1870 2240 0099 0099 2396 minus0156Grid 4941 6594 0063 0107 2669 0003Astro-ph 18771 198050 0091 0486 4140 0294

Table 9 Comparison of 14 methods on 4 large datasets on AUC

Method Email Protein Grid Astro-phSGGM 0916 0943 0944 0929CN 0817 0566 0570 0864Salton 0812 0566 0569 0863Jaccard 0475 0479 0487 0479Soslashrensen 0547 0493 0529 0584HPI 0809 0566 0569 0863HDI 0813 0566 0569 0863LHN-I 0805 0565 0570 0864AA 0819 0565 0570 0876RA 0818 0566 0568 0838PA 0824 0820 0713 0785

can efficiently recover the original network and predict themissing links accurately As observed from Table 9 AUC ofSGGM is highest on all 4 datasets For Email Protein Gridand Astro-ph datasets SGGM improves 92 123 231and 53 on AUC respectively For error rate metric asshown in Table 10 the SGGM method performs optimallywith all 4 large datasets It indicates that SGGMmethod couldaccurately recover the original network

Table 10 Comparison of 14 methods on 4 large datasets on errorrate

Method Email Protein Grid Astro-phSGGM 0467 0411 0367 0359CN 2348 2189 1670 1568Salton 2978 2189 1767 1549Jaccard 2982 2189 1450 1548Soslashrensen 3047 2555 2071 2771HPI 2578 2189 1670 1610HDI 2978 2189 1237 1854LHN-I 2428 2189 1692 1612AA 2922 2204 1872 1953RA 2348 2189 1670 1410PA 9219 8032 15081 12451

46 Analysis In the comparative analysis of the performanceof the 14 methods using the eight real-world datasets theSGGM method was outstanding in terms of AUC and errorrate This method can accurately recover the original adja-cency matrix and in most of cases requires fewer samplesthat is far fewer than the actual number of nodes TheSGGM methodrsquos prediction precision gradually increasedwith the increase in number of samples moreover therecovery error rate decreased with the increase in matrixsparsityTherefore the SGGMmethod can be applied to smallsamples however it performs better with more samplesThese results demonstrate that the SGGM method can beimplemented in a scalable fashion

5 Conclusions

Link prediction is a basic problem in complex networkanalysis In real-world networks obtaining node and edgeproperties is cumbersome Link prediction methods basedon network structure similarity do not depend on nodeproperties however these methods are limited by networkstructure and are difficult to adapt in different networkstructures

Our work is not dependent on node properties andexpands link prediction methods that cannot deal withdiverse network topologiesWe sampled the original networkadjacency matrix and used the SGGM method to depict thenetwork structure

Most nodes are independent in the actual network thuswe used the SGGM method to estimate a precision matrixof the adjacency matrix to predict links Our experimentalresults show that using the SGGMmethod to recover networkstructure is better than 13mainstream similaritymethodsWetested the proposed method with eight real-world datasetsand used the AUC and error rate as evaluation indexes Wefound that this method obtains higher prediction precisionWe also analysed the influence of different parameters onthe SGGM method and found no significant influenceThe SGGM method returns a high AUC value within acertain range Furthermore the proposed method retains

Mathematical Problems in Engineering 11

high prediction precision with fewer training samples it canbe applied in large network

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank Longqi Yang Zhisong Pan and GuyuHu for their contributions to work on link prediction andZhen Li and Junyang Qiu for their valuable discussionscomments and suggestions This work was supported by theNational Technology Research and Development Programof China (863 Program) (Grant no 2012AA01A510) theNational Natural Science Foundation of China (Grant no61473149) and the Natural Science Foundation of JiangsuProvince China (Grant no BK20140073)

References

[1] L L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquo Physica A Statistical Mechanics amp Its Applications vol390 no 6 pp 1150ndash1170 2011

[2] B Ermis E Acar and A T Cemgil ldquoLink prediction in het-erogeneous data via generalized coupled tensor factorizationrdquoDataMining amp Knowledge Discovery vol 29 no 1 pp 203ndash2362013

[3] Y Yang N Chawla Y Sun and J Hani ldquoPredicting links inmulti-relational and heterogeneous networksrdquo in Proceedings ofthe 12th IEEE International Conference on Data Mining (ICDMrsquo12) pp 755ndash764 Brussels Belgium December 2012

[4] D Liben-Nowell and J Kleinberg ldquoThe link-prediction prob-lem for social networksrdquo Journal of the American Society forInformation Science and Technology vol 58 no 7 pp 1019ndash10312007

[5] Y Yang R N Lichtenwalter and N V Chawla ldquoEvaluating linkprediction methodsrdquo Knowledge and Information Systems vol45 no 3 pp 751ndash782 2015

[6] N Z Gong A Talwalkar LMacKey et al ldquoJoint link predictionand attribute inference using a social-attribute networkrdquo ACMTransactions on Intelligent Systems and Technology vol 5 no 2article 27 2014

[7] J Friedman T Hastie and R Tibshirani ldquoSparse inversecovariance estimationwith the graphical lassordquoBiostatistics vol9 no 3 pp 432ndash441 2008

[8] M E Newman ldquoClustering and preferential attachment ingrowing networksrdquo Physical Review E vol 64 no 2 pp 25ndash1022001

[9] G Salton and M J McGill Introduction to Modern InformationRetrieval McGraw-Hill New York NY USA 1986

[10] P Jaccard ldquoEtude comparative de la distribution florale dansune portion des Alpes et du Jurardquo Bulletin de la Societe Vaudoisedes Sciences Naturelles vol 37 no 142 pp 547ndash579 1901

[11] T Soslashrensen ldquoA method of establishing groups of equal ampli-tude in plant sociology based on similarity of species and itsapplication to analyses of the vegetation on danish commonsrdquoBiologiske Skrifter vol 5 pp 1ndash34 1948

[12] E A Leicht P Holme and M E J Newman ldquoVertex similarityin networksrdquo Physical Review E vol 73 no 2 Article ID 0261202006

[13] Y-B Xie T Zhou and B-H Wang ldquoScale-free networkswithout growthrdquo Physica A Statistical Mechanics and Its Appli-cations vol 387 no 7 pp 1683ndash1688 2008

[14] T Zhou L Lu and Y-C Zhang ldquoPredicting missing links vialocal informationrdquoThe European Physical Journal B vol 71 no4 pp 623ndash630 2009

[15] Z Liu Q M Zhang L Lu and T Zhou ldquoLink prediction incomplex networks a local naıve Bayesmodelrdquo EPL (EurophysicsLetters) vol 96 no 4 Article ID 48007 2011

[16] J Dahl L Vandenberghe and V Roychowdhury ldquoCovarianceselection for nonchordal graphs via chordal embeddingrdquo Opti-mization Methods amp Software vol 23 no 4 pp 501ndash520 2008

[17] O Banerjee L E Ghaoui andADaspremont ldquoModel selectionthrough sparse maximum likelihood estimation for multivari-ate Gaussian or binary datardquoMachine Learning Research vol 9no 3 pp 485ndash516 2008

[18] C J Hsieh M A Sustik I S Dhillon P K Ravikumar and RPoldrack ldquoBigamp quic sparse inverse covariance estimation for amillion variablesrdquo in Advances in Neural Information ProcessingSystems pp 3165ndash3173 MIT Press 2013

[19] C-J Hsieh M A Sustik I S Dhillon and P RavikumarldquoQUIC quadratic approximation for sparse inverse covarianceestimationrdquo Journal of Machine Learning Research vol 15 pp2911ndash2947 2014

[20] V Latora and M Marchiori ldquoEfficient behavior of small-worldnetworksrdquo Physical Review Letters vol 87 Article ID 1987012001

[21] D J Watts and S H Strogatz ldquoCollective dynamics of lsquosmall-worldrsquo networksrdquoNature vol 393 no 6684 pp 440ndash442 1998

[22] M E J Newman ldquoAssortative mixing in networksrdquo PhysicalReview Letters vol 89 no 20 Article ID 208701 2002

[23] J Liu S Ji and J Ye SLEP Sparse Learning with EfficientProjections vol 6 Arizona State University Tempe Ariz USA2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 8: Research Article Link Prediction via Sparse Gaussian Graphical … · 2019. 5. 14. · Research Article Link Prediction via Sparse Gaussian Graphical Model LiangliangZhang,LongqiYang,GuyuHu,ZhisongPan,andZhenLi

8 Mathematical Problems in Engineering

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

(a) Dolphin

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

(b) Football

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

(c) Taro

0 02 04 06 08 10

01

02

03

04

05

06

07

08

09

1

False positive rate

True

pos

itive

rate

CNAAPA

GGM n = 2p

GGM n = 15p

GGM n = p

GGM n = 05p

(d) Tailor Shop

Figure 4 Comparison of ROC metric of four methods (SGGM CN AA and PA) on four datasets

shows an improvement over existing similarity measureswhen sufficient samples were used Note that more sampleslead to higher accuracy

444 Comparison on Error Rate The error rate is a metricto evaluate the difference between two matrices A lowererror rate indicates that the estimated matrix approximates

the original matrix From Table 7 we observe that the SGGMhas the lowest error rate among all methods shown Toshow the effectiveness of the proposed SGGM both theoriginal adjacency matrix and estimated adjacency matrixare presented as a coloured graph in Figure 5 We usedthe Taro dataset with a 90 training set to recover theoriginal network Graphs with fewer red points indicate

Mathematical Problems in Engineering 9

Figure 5 Recovery performances of 14methods on the Taro datasetThe upper left corner is the original matrix blue points denote linksbetween two corresponding nodes The red points denote links inthe original matrix that were not accurately estimated The greenpoints denote links in the estimated matrix that do not exist in theoriginal matrix From left to right top to bottom the figures are therecovered matrix by SGGM (119899 = 05119901 119901 15119901 and 2119901) CN SaltonJaccard Soslashrensen HPI HDI LHN AA RA PA LNBCN LNBAAand LNBRA respectively

Table 4 Dolphin dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0699 0731 0761 0801SGGM 119899 = 119901 0742 0775 0823 0867SGGM 119899 = 15119901 0758 0803 0845 0895SGGM 119899 = 2119901 0765 0807 0857 0909CN 0631 0687 0728 0762Salton 0621 0673 0709 0740Jaccard 0485 0490 0495 0500Soslashrensen 0538 0558 0571 0583HPI 0620 0669 0702 0728HDI 0626 0680 0720 0753LHN 0619 0666 0698 0721AA 0633 0690 0732 0767RA 0632 0690 0731 0766PA 0712 0727 0737 0746LNBCN 0636 0696 0739 0775LNBAA 0635 0694 0738 0773LNBRA 0635 0693 0736 0771

good recovery of the original matrix As can be seen fromFigure 5 using the SGGM method can restore the originalimage whereas the other similarity measures return greatererror rates Note that the SGGM error rate decreased withincreasing sample scale

45 Large Datasets The main challenge of link prediction isdealing with large network We carefully choose four largedatasets from four individual domain For SGGM method120582 is set to 001 90 of data was used as training set

Table 5 Taro dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0672 0711 0736 0750SGGM 119899 = 119901 0713 0747 0779 0815SGGM 119899 = 15119901 0721 0763 0816 0857SGGM 119899 = 2119901 0723 0788 0825 0865CN 0529 0545 0573 0582Salton 0530 0542 0571 0580Jaccard 0468 0476 0476 0478Soslashrensen 0523 0533 0544 0552HPI 0533 0545 0574 0585HDI 0528 0541 0568 0576LHN 0532 0543 0570 0579AA 0533 0554 0589 0609RA 0533 0554 0589 0610PA 0559 0568 0582 0607LNBCN 0536 0562 0616 0638LNBAA 0537 0561 0614 0634LNBRA 0536 0560 0613 0634

Table 6 Tailor Shop dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0652 0679 0677 0706SGGM 119899 = 119901 0703 0723 0750 0769SGGM 119899 = 15119901 0715 0755 0786 0817SGGM 119899 = 2119901 0734 0770 0803 0844CN 0673 0699 0720 0746Salton 0616 0628 0646 0667Jaccard 0494 0498 0504 0514Soslashrensen 0583 0589 0592 0597HPI 0615 0618 0631 0642HDI 0628 0643 0660 0680LHN 0588 0574 0571 0566AA 0682 0709 0729 0753RA 0681 0707 0729 0752PA 0761 0775 0779 0786LNBCN 0692 0715 0746 0765LNBAA 0693 0716 0746 0764LNBRA 0693 0715 0745 0762

In order to prove the performance of SGGM we comparethe proposed method with the other thirteen methods onthese large datasets (for details see Table 8) in terms of AUCand error rate (data sources (1) httpkonectuni-koblenzdenetworks (Email) (2) httpwww3ndedusimnetworksresourceshtm (Protein) (3) httpwww-personalumichedusimmejnnetdatapowerzip (Grid) (4) httpkonectuni-koblenzdenetworks (Astro-ph))

To estimate the large network efficiently we used QUIC[19] to solve sparse inverse covariance estimation (formula(1)) instead of glasso [7] Overall the proposed method

10 Mathematical Problems in Engineering

Table 7 Error rate of 13 methods on 4 datasets

Method DatasetsFootball Dolphin Taro Tailor Shop

SGGM 119899 = 05119901 1467 1251 1098 1121SGGM 119899 = 119901 1364 1109 1062 1188SGGM 119899 = 15119901 1362 1198 1098 1025SGGM 119899 = 2119901 1200 1121 1013 0987CN 1641 1291 1320 1485Salton 1641 1291 1320 1485Jaccard 1641 1291 1320 1485Soslashrensen 1863 1672 1695 1826HPI 1641 1291 1320 1485HDI 1641 1291 1320 1485LHN 1641 1291 1320 1485AA 1641 1291 1320 1485RA 1641 1291 1320 1485PA 2512 1251 1340 1446LNBCN 1641 1291 1261 1382LNBAA 1641 1291 1261 1382LNBRA 1641 1291 1261 1382

Table 8 Basic topological features of four large networks

Network 119873 119872 119864 119862 Avg 119863 119903

Email 1133 5451 0299 0220 9622 0078Protein 1870 2240 0099 0099 2396 minus0156Grid 4941 6594 0063 0107 2669 0003Astro-ph 18771 198050 0091 0486 4140 0294

Table 9 Comparison of 14 methods on 4 large datasets on AUC

Method Email Protein Grid Astro-phSGGM 0916 0943 0944 0929CN 0817 0566 0570 0864Salton 0812 0566 0569 0863Jaccard 0475 0479 0487 0479Soslashrensen 0547 0493 0529 0584HPI 0809 0566 0569 0863HDI 0813 0566 0569 0863LHN-I 0805 0565 0570 0864AA 0819 0565 0570 0876RA 0818 0566 0568 0838PA 0824 0820 0713 0785

can efficiently recover the original network and predict themissing links accurately As observed from Table 9 AUC ofSGGM is highest on all 4 datasets For Email Protein Gridand Astro-ph datasets SGGM improves 92 123 231and 53 on AUC respectively For error rate metric asshown in Table 10 the SGGM method performs optimallywith all 4 large datasets It indicates that SGGMmethod couldaccurately recover the original network

Table 10 Comparison of 14 methods on 4 large datasets on errorrate

Method Email Protein Grid Astro-phSGGM 0467 0411 0367 0359CN 2348 2189 1670 1568Salton 2978 2189 1767 1549Jaccard 2982 2189 1450 1548Soslashrensen 3047 2555 2071 2771HPI 2578 2189 1670 1610HDI 2978 2189 1237 1854LHN-I 2428 2189 1692 1612AA 2922 2204 1872 1953RA 2348 2189 1670 1410PA 9219 8032 15081 12451

46 Analysis In the comparative analysis of the performanceof the 14 methods using the eight real-world datasets theSGGM method was outstanding in terms of AUC and errorrate This method can accurately recover the original adja-cency matrix and in most of cases requires fewer samplesthat is far fewer than the actual number of nodes TheSGGM methodrsquos prediction precision gradually increasedwith the increase in number of samples moreover therecovery error rate decreased with the increase in matrixsparsityTherefore the SGGMmethod can be applied to smallsamples however it performs better with more samplesThese results demonstrate that the SGGM method can beimplemented in a scalable fashion

5 Conclusions

Link prediction is a basic problem in complex networkanalysis In real-world networks obtaining node and edgeproperties is cumbersome Link prediction methods basedon network structure similarity do not depend on nodeproperties however these methods are limited by networkstructure and are difficult to adapt in different networkstructures

Our work is not dependent on node properties andexpands link prediction methods that cannot deal withdiverse network topologiesWe sampled the original networkadjacency matrix and used the SGGM method to depict thenetwork structure

Most nodes are independent in the actual network thuswe used the SGGM method to estimate a precision matrixof the adjacency matrix to predict links Our experimentalresults show that using the SGGMmethod to recover networkstructure is better than 13mainstream similaritymethodsWetested the proposed method with eight real-world datasetsand used the AUC and error rate as evaluation indexes Wefound that this method obtains higher prediction precisionWe also analysed the influence of different parameters onthe SGGM method and found no significant influenceThe SGGM method returns a high AUC value within acertain range Furthermore the proposed method retains

Mathematical Problems in Engineering 11

high prediction precision with fewer training samples it canbe applied in large network

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank Longqi Yang Zhisong Pan and GuyuHu for their contributions to work on link prediction andZhen Li and Junyang Qiu for their valuable discussionscomments and suggestions This work was supported by theNational Technology Research and Development Programof China (863 Program) (Grant no 2012AA01A510) theNational Natural Science Foundation of China (Grant no61473149) and the Natural Science Foundation of JiangsuProvince China (Grant no BK20140073)

References

[1] L L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquo Physica A Statistical Mechanics amp Its Applications vol390 no 6 pp 1150ndash1170 2011

[2] B Ermis E Acar and A T Cemgil ldquoLink prediction in het-erogeneous data via generalized coupled tensor factorizationrdquoDataMining amp Knowledge Discovery vol 29 no 1 pp 203ndash2362013

[3] Y Yang N Chawla Y Sun and J Hani ldquoPredicting links inmulti-relational and heterogeneous networksrdquo in Proceedings ofthe 12th IEEE International Conference on Data Mining (ICDMrsquo12) pp 755ndash764 Brussels Belgium December 2012

[4] D Liben-Nowell and J Kleinberg ldquoThe link-prediction prob-lem for social networksrdquo Journal of the American Society forInformation Science and Technology vol 58 no 7 pp 1019ndash10312007

[5] Y Yang R N Lichtenwalter and N V Chawla ldquoEvaluating linkprediction methodsrdquo Knowledge and Information Systems vol45 no 3 pp 751ndash782 2015

[6] N Z Gong A Talwalkar LMacKey et al ldquoJoint link predictionand attribute inference using a social-attribute networkrdquo ACMTransactions on Intelligent Systems and Technology vol 5 no 2article 27 2014

[7] J Friedman T Hastie and R Tibshirani ldquoSparse inversecovariance estimationwith the graphical lassordquoBiostatistics vol9 no 3 pp 432ndash441 2008

[8] M E Newman ldquoClustering and preferential attachment ingrowing networksrdquo Physical Review E vol 64 no 2 pp 25ndash1022001

[9] G Salton and M J McGill Introduction to Modern InformationRetrieval McGraw-Hill New York NY USA 1986

[10] P Jaccard ldquoEtude comparative de la distribution florale dansune portion des Alpes et du Jurardquo Bulletin de la Societe Vaudoisedes Sciences Naturelles vol 37 no 142 pp 547ndash579 1901

[11] T Soslashrensen ldquoA method of establishing groups of equal ampli-tude in plant sociology based on similarity of species and itsapplication to analyses of the vegetation on danish commonsrdquoBiologiske Skrifter vol 5 pp 1ndash34 1948

[12] E A Leicht P Holme and M E J Newman ldquoVertex similarityin networksrdquo Physical Review E vol 73 no 2 Article ID 0261202006

[13] Y-B Xie T Zhou and B-H Wang ldquoScale-free networkswithout growthrdquo Physica A Statistical Mechanics and Its Appli-cations vol 387 no 7 pp 1683ndash1688 2008

[14] T Zhou L Lu and Y-C Zhang ldquoPredicting missing links vialocal informationrdquoThe European Physical Journal B vol 71 no4 pp 623ndash630 2009

[15] Z Liu Q M Zhang L Lu and T Zhou ldquoLink prediction incomplex networks a local naıve Bayesmodelrdquo EPL (EurophysicsLetters) vol 96 no 4 Article ID 48007 2011

[16] J Dahl L Vandenberghe and V Roychowdhury ldquoCovarianceselection for nonchordal graphs via chordal embeddingrdquo Opti-mization Methods amp Software vol 23 no 4 pp 501ndash520 2008

[17] O Banerjee L E Ghaoui andADaspremont ldquoModel selectionthrough sparse maximum likelihood estimation for multivari-ate Gaussian or binary datardquoMachine Learning Research vol 9no 3 pp 485ndash516 2008

[18] C J Hsieh M A Sustik I S Dhillon P K Ravikumar and RPoldrack ldquoBigamp quic sparse inverse covariance estimation for amillion variablesrdquo in Advances in Neural Information ProcessingSystems pp 3165ndash3173 MIT Press 2013

[19] C-J Hsieh M A Sustik I S Dhillon and P RavikumarldquoQUIC quadratic approximation for sparse inverse covarianceestimationrdquo Journal of Machine Learning Research vol 15 pp2911ndash2947 2014

[20] V Latora and M Marchiori ldquoEfficient behavior of small-worldnetworksrdquo Physical Review Letters vol 87 Article ID 1987012001

[21] D J Watts and S H Strogatz ldquoCollective dynamics of lsquosmall-worldrsquo networksrdquoNature vol 393 no 6684 pp 440ndash442 1998

[22] M E J Newman ldquoAssortative mixing in networksrdquo PhysicalReview Letters vol 89 no 20 Article ID 208701 2002

[23] J Liu S Ji and J Ye SLEP Sparse Learning with EfficientProjections vol 6 Arizona State University Tempe Ariz USA2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 9: Research Article Link Prediction via Sparse Gaussian Graphical … · 2019. 5. 14. · Research Article Link Prediction via Sparse Gaussian Graphical Model LiangliangZhang,LongqiYang,GuyuHu,ZhisongPan,andZhenLi

Mathematical Problems in Engineering 9

Figure 5 Recovery performances of 14methods on the Taro datasetThe upper left corner is the original matrix blue points denote linksbetween two corresponding nodes The red points denote links inthe original matrix that were not accurately estimated The greenpoints denote links in the estimated matrix that do not exist in theoriginal matrix From left to right top to bottom the figures are therecovered matrix by SGGM (119899 = 05119901 119901 15119901 and 2119901) CN SaltonJaccard Soslashrensen HPI HDI LHN AA RA PA LNBCN LNBAAand LNBRA respectively

Table 4 Dolphin dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0699 0731 0761 0801SGGM 119899 = 119901 0742 0775 0823 0867SGGM 119899 = 15119901 0758 0803 0845 0895SGGM 119899 = 2119901 0765 0807 0857 0909CN 0631 0687 0728 0762Salton 0621 0673 0709 0740Jaccard 0485 0490 0495 0500Soslashrensen 0538 0558 0571 0583HPI 0620 0669 0702 0728HDI 0626 0680 0720 0753LHN 0619 0666 0698 0721AA 0633 0690 0732 0767RA 0632 0690 0731 0766PA 0712 0727 0737 0746LNBCN 0636 0696 0739 0775LNBAA 0635 0694 0738 0773LNBRA 0635 0693 0736 0771

good recovery of the original matrix As can be seen fromFigure 5 using the SGGM method can restore the originalimage whereas the other similarity measures return greatererror rates Note that the SGGM error rate decreased withincreasing sample scale

45 Large Datasets The main challenge of link prediction isdealing with large network We carefully choose four largedatasets from four individual domain For SGGM method120582 is set to 001 90 of data was used as training set

Table 5 Taro dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0672 0711 0736 0750SGGM 119899 = 119901 0713 0747 0779 0815SGGM 119899 = 15119901 0721 0763 0816 0857SGGM 119899 = 2119901 0723 0788 0825 0865CN 0529 0545 0573 0582Salton 0530 0542 0571 0580Jaccard 0468 0476 0476 0478Soslashrensen 0523 0533 0544 0552HPI 0533 0545 0574 0585HDI 0528 0541 0568 0576LHN 0532 0543 0570 0579AA 0533 0554 0589 0609RA 0533 0554 0589 0610PA 0559 0568 0582 0607LNBCN 0536 0562 0616 0638LNBAA 0537 0561 0614 0634LNBRA 0536 0560 0613 0634

Table 6 Tailor Shop dataset results

Method Training set ratio60 70 80 90

SGGM 119899 = 05119901 0652 0679 0677 0706SGGM 119899 = 119901 0703 0723 0750 0769SGGM 119899 = 15119901 0715 0755 0786 0817SGGM 119899 = 2119901 0734 0770 0803 0844CN 0673 0699 0720 0746Salton 0616 0628 0646 0667Jaccard 0494 0498 0504 0514Soslashrensen 0583 0589 0592 0597HPI 0615 0618 0631 0642HDI 0628 0643 0660 0680LHN 0588 0574 0571 0566AA 0682 0709 0729 0753RA 0681 0707 0729 0752PA 0761 0775 0779 0786LNBCN 0692 0715 0746 0765LNBAA 0693 0716 0746 0764LNBRA 0693 0715 0745 0762

In order to prove the performance of SGGM we comparethe proposed method with the other thirteen methods onthese large datasets (for details see Table 8) in terms of AUCand error rate (data sources (1) httpkonectuni-koblenzdenetworks (Email) (2) httpwww3ndedusimnetworksresourceshtm (Protein) (3) httpwww-personalumichedusimmejnnetdatapowerzip (Grid) (4) httpkonectuni-koblenzdenetworks (Astro-ph))

To estimate the large network efficiently we used QUIC[19] to solve sparse inverse covariance estimation (formula(1)) instead of glasso [7] Overall the proposed method

10 Mathematical Problems in Engineering

Table 7 Error rate of 13 methods on 4 datasets

Method DatasetsFootball Dolphin Taro Tailor Shop

SGGM 119899 = 05119901 1467 1251 1098 1121SGGM 119899 = 119901 1364 1109 1062 1188SGGM 119899 = 15119901 1362 1198 1098 1025SGGM 119899 = 2119901 1200 1121 1013 0987CN 1641 1291 1320 1485Salton 1641 1291 1320 1485Jaccard 1641 1291 1320 1485Soslashrensen 1863 1672 1695 1826HPI 1641 1291 1320 1485HDI 1641 1291 1320 1485LHN 1641 1291 1320 1485AA 1641 1291 1320 1485RA 1641 1291 1320 1485PA 2512 1251 1340 1446LNBCN 1641 1291 1261 1382LNBAA 1641 1291 1261 1382LNBRA 1641 1291 1261 1382

Table 8 Basic topological features of four large networks

Network 119873 119872 119864 119862 Avg 119863 119903

Email 1133 5451 0299 0220 9622 0078Protein 1870 2240 0099 0099 2396 minus0156Grid 4941 6594 0063 0107 2669 0003Astro-ph 18771 198050 0091 0486 4140 0294

Table 9 Comparison of 14 methods on 4 large datasets on AUC

Method Email Protein Grid Astro-phSGGM 0916 0943 0944 0929CN 0817 0566 0570 0864Salton 0812 0566 0569 0863Jaccard 0475 0479 0487 0479Soslashrensen 0547 0493 0529 0584HPI 0809 0566 0569 0863HDI 0813 0566 0569 0863LHN-I 0805 0565 0570 0864AA 0819 0565 0570 0876RA 0818 0566 0568 0838PA 0824 0820 0713 0785

can efficiently recover the original network and predict themissing links accurately As observed from Table 9 AUC ofSGGM is highest on all 4 datasets For Email Protein Gridand Astro-ph datasets SGGM improves 92 123 231and 53 on AUC respectively For error rate metric asshown in Table 10 the SGGM method performs optimallywith all 4 large datasets It indicates that SGGMmethod couldaccurately recover the original network

Table 10 Comparison of 14 methods on 4 large datasets on errorrate

Method Email Protein Grid Astro-phSGGM 0467 0411 0367 0359CN 2348 2189 1670 1568Salton 2978 2189 1767 1549Jaccard 2982 2189 1450 1548Soslashrensen 3047 2555 2071 2771HPI 2578 2189 1670 1610HDI 2978 2189 1237 1854LHN-I 2428 2189 1692 1612AA 2922 2204 1872 1953RA 2348 2189 1670 1410PA 9219 8032 15081 12451

46 Analysis In the comparative analysis of the performanceof the 14 methods using the eight real-world datasets theSGGM method was outstanding in terms of AUC and errorrate This method can accurately recover the original adja-cency matrix and in most of cases requires fewer samplesthat is far fewer than the actual number of nodes TheSGGM methodrsquos prediction precision gradually increasedwith the increase in number of samples moreover therecovery error rate decreased with the increase in matrixsparsityTherefore the SGGMmethod can be applied to smallsamples however it performs better with more samplesThese results demonstrate that the SGGM method can beimplemented in a scalable fashion

5 Conclusions

Link prediction is a basic problem in complex networkanalysis In real-world networks obtaining node and edgeproperties is cumbersome Link prediction methods basedon network structure similarity do not depend on nodeproperties however these methods are limited by networkstructure and are difficult to adapt in different networkstructures

Our work is not dependent on node properties andexpands link prediction methods that cannot deal withdiverse network topologiesWe sampled the original networkadjacency matrix and used the SGGM method to depict thenetwork structure

Most nodes are independent in the actual network thuswe used the SGGM method to estimate a precision matrixof the adjacency matrix to predict links Our experimentalresults show that using the SGGMmethod to recover networkstructure is better than 13mainstream similaritymethodsWetested the proposed method with eight real-world datasetsand used the AUC and error rate as evaluation indexes Wefound that this method obtains higher prediction precisionWe also analysed the influence of different parameters onthe SGGM method and found no significant influenceThe SGGM method returns a high AUC value within acertain range Furthermore the proposed method retains

Mathematical Problems in Engineering 11

high prediction precision with fewer training samples it canbe applied in large network

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank Longqi Yang Zhisong Pan and GuyuHu for their contributions to work on link prediction andZhen Li and Junyang Qiu for their valuable discussionscomments and suggestions This work was supported by theNational Technology Research and Development Programof China (863 Program) (Grant no 2012AA01A510) theNational Natural Science Foundation of China (Grant no61473149) and the Natural Science Foundation of JiangsuProvince China (Grant no BK20140073)

References

[1] L L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquo Physica A Statistical Mechanics amp Its Applications vol390 no 6 pp 1150ndash1170 2011

[2] B Ermis E Acar and A T Cemgil ldquoLink prediction in het-erogeneous data via generalized coupled tensor factorizationrdquoDataMining amp Knowledge Discovery vol 29 no 1 pp 203ndash2362013

[3] Y Yang N Chawla Y Sun and J Hani ldquoPredicting links inmulti-relational and heterogeneous networksrdquo in Proceedings ofthe 12th IEEE International Conference on Data Mining (ICDMrsquo12) pp 755ndash764 Brussels Belgium December 2012

[4] D Liben-Nowell and J Kleinberg ldquoThe link-prediction prob-lem for social networksrdquo Journal of the American Society forInformation Science and Technology vol 58 no 7 pp 1019ndash10312007

[5] Y Yang R N Lichtenwalter and N V Chawla ldquoEvaluating linkprediction methodsrdquo Knowledge and Information Systems vol45 no 3 pp 751ndash782 2015

[6] N Z Gong A Talwalkar LMacKey et al ldquoJoint link predictionand attribute inference using a social-attribute networkrdquo ACMTransactions on Intelligent Systems and Technology vol 5 no 2article 27 2014

[7] J Friedman T Hastie and R Tibshirani ldquoSparse inversecovariance estimationwith the graphical lassordquoBiostatistics vol9 no 3 pp 432ndash441 2008

[8] M E Newman ldquoClustering and preferential attachment ingrowing networksrdquo Physical Review E vol 64 no 2 pp 25ndash1022001

[9] G Salton and M J McGill Introduction to Modern InformationRetrieval McGraw-Hill New York NY USA 1986

[10] P Jaccard ldquoEtude comparative de la distribution florale dansune portion des Alpes et du Jurardquo Bulletin de la Societe Vaudoisedes Sciences Naturelles vol 37 no 142 pp 547ndash579 1901

[11] T Soslashrensen ldquoA method of establishing groups of equal ampli-tude in plant sociology based on similarity of species and itsapplication to analyses of the vegetation on danish commonsrdquoBiologiske Skrifter vol 5 pp 1ndash34 1948

[12] E A Leicht P Holme and M E J Newman ldquoVertex similarityin networksrdquo Physical Review E vol 73 no 2 Article ID 0261202006

[13] Y-B Xie T Zhou and B-H Wang ldquoScale-free networkswithout growthrdquo Physica A Statistical Mechanics and Its Appli-cations vol 387 no 7 pp 1683ndash1688 2008

[14] T Zhou L Lu and Y-C Zhang ldquoPredicting missing links vialocal informationrdquoThe European Physical Journal B vol 71 no4 pp 623ndash630 2009

[15] Z Liu Q M Zhang L Lu and T Zhou ldquoLink prediction incomplex networks a local naıve Bayesmodelrdquo EPL (EurophysicsLetters) vol 96 no 4 Article ID 48007 2011

[16] J Dahl L Vandenberghe and V Roychowdhury ldquoCovarianceselection for nonchordal graphs via chordal embeddingrdquo Opti-mization Methods amp Software vol 23 no 4 pp 501ndash520 2008

[17] O Banerjee L E Ghaoui andADaspremont ldquoModel selectionthrough sparse maximum likelihood estimation for multivari-ate Gaussian or binary datardquoMachine Learning Research vol 9no 3 pp 485ndash516 2008

[18] C J Hsieh M A Sustik I S Dhillon P K Ravikumar and RPoldrack ldquoBigamp quic sparse inverse covariance estimation for amillion variablesrdquo in Advances in Neural Information ProcessingSystems pp 3165ndash3173 MIT Press 2013

[19] C-J Hsieh M A Sustik I S Dhillon and P RavikumarldquoQUIC quadratic approximation for sparse inverse covarianceestimationrdquo Journal of Machine Learning Research vol 15 pp2911ndash2947 2014

[20] V Latora and M Marchiori ldquoEfficient behavior of small-worldnetworksrdquo Physical Review Letters vol 87 Article ID 1987012001

[21] D J Watts and S H Strogatz ldquoCollective dynamics of lsquosmall-worldrsquo networksrdquoNature vol 393 no 6684 pp 440ndash442 1998

[22] M E J Newman ldquoAssortative mixing in networksrdquo PhysicalReview Letters vol 89 no 20 Article ID 208701 2002

[23] J Liu S Ji and J Ye SLEP Sparse Learning with EfficientProjections vol 6 Arizona State University Tempe Ariz USA2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 10: Research Article Link Prediction via Sparse Gaussian Graphical … · 2019. 5. 14. · Research Article Link Prediction via Sparse Gaussian Graphical Model LiangliangZhang,LongqiYang,GuyuHu,ZhisongPan,andZhenLi

10 Mathematical Problems in Engineering

Table 7 Error rate of 13 methods on 4 datasets

Method DatasetsFootball Dolphin Taro Tailor Shop

SGGM 119899 = 05119901 1467 1251 1098 1121SGGM 119899 = 119901 1364 1109 1062 1188SGGM 119899 = 15119901 1362 1198 1098 1025SGGM 119899 = 2119901 1200 1121 1013 0987CN 1641 1291 1320 1485Salton 1641 1291 1320 1485Jaccard 1641 1291 1320 1485Soslashrensen 1863 1672 1695 1826HPI 1641 1291 1320 1485HDI 1641 1291 1320 1485LHN 1641 1291 1320 1485AA 1641 1291 1320 1485RA 1641 1291 1320 1485PA 2512 1251 1340 1446LNBCN 1641 1291 1261 1382LNBAA 1641 1291 1261 1382LNBRA 1641 1291 1261 1382

Table 8 Basic topological features of four large networks

Network 119873 119872 119864 119862 Avg 119863 119903

Email 1133 5451 0299 0220 9622 0078Protein 1870 2240 0099 0099 2396 minus0156Grid 4941 6594 0063 0107 2669 0003Astro-ph 18771 198050 0091 0486 4140 0294

Table 9 Comparison of 14 methods on 4 large datasets on AUC

Method Email Protein Grid Astro-phSGGM 0916 0943 0944 0929CN 0817 0566 0570 0864Salton 0812 0566 0569 0863Jaccard 0475 0479 0487 0479Soslashrensen 0547 0493 0529 0584HPI 0809 0566 0569 0863HDI 0813 0566 0569 0863LHN-I 0805 0565 0570 0864AA 0819 0565 0570 0876RA 0818 0566 0568 0838PA 0824 0820 0713 0785

can efficiently recover the original network and predict themissing links accurately As observed from Table 9 AUC ofSGGM is highest on all 4 datasets For Email Protein Gridand Astro-ph datasets SGGM improves 92 123 231and 53 on AUC respectively For error rate metric asshown in Table 10 the SGGM method performs optimallywith all 4 large datasets It indicates that SGGMmethod couldaccurately recover the original network

Table 10 Comparison of 14 methods on 4 large datasets on errorrate

Method Email Protein Grid Astro-phSGGM 0467 0411 0367 0359CN 2348 2189 1670 1568Salton 2978 2189 1767 1549Jaccard 2982 2189 1450 1548Soslashrensen 3047 2555 2071 2771HPI 2578 2189 1670 1610HDI 2978 2189 1237 1854LHN-I 2428 2189 1692 1612AA 2922 2204 1872 1953RA 2348 2189 1670 1410PA 9219 8032 15081 12451

46 Analysis In the comparative analysis of the performanceof the 14 methods using the eight real-world datasets theSGGM method was outstanding in terms of AUC and errorrate This method can accurately recover the original adja-cency matrix and in most of cases requires fewer samplesthat is far fewer than the actual number of nodes TheSGGM methodrsquos prediction precision gradually increasedwith the increase in number of samples moreover therecovery error rate decreased with the increase in matrixsparsityTherefore the SGGMmethod can be applied to smallsamples however it performs better with more samplesThese results demonstrate that the SGGM method can beimplemented in a scalable fashion

5 Conclusions

Link prediction is a basic problem in complex networkanalysis In real-world networks obtaining node and edgeproperties is cumbersome Link prediction methods basedon network structure similarity do not depend on nodeproperties however these methods are limited by networkstructure and are difficult to adapt in different networkstructures

Our work is not dependent on node properties andexpands link prediction methods that cannot deal withdiverse network topologiesWe sampled the original networkadjacency matrix and used the SGGM method to depict thenetwork structure

Most nodes are independent in the actual network thuswe used the SGGM method to estimate a precision matrixof the adjacency matrix to predict links Our experimentalresults show that using the SGGMmethod to recover networkstructure is better than 13mainstream similaritymethodsWetested the proposed method with eight real-world datasetsand used the AUC and error rate as evaluation indexes Wefound that this method obtains higher prediction precisionWe also analysed the influence of different parameters onthe SGGM method and found no significant influenceThe SGGM method returns a high AUC value within acertain range Furthermore the proposed method retains

Mathematical Problems in Engineering 11

high prediction precision with fewer training samples it canbe applied in large network

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank Longqi Yang Zhisong Pan and GuyuHu for their contributions to work on link prediction andZhen Li and Junyang Qiu for their valuable discussionscomments and suggestions This work was supported by theNational Technology Research and Development Programof China (863 Program) (Grant no 2012AA01A510) theNational Natural Science Foundation of China (Grant no61473149) and the Natural Science Foundation of JiangsuProvince China (Grant no BK20140073)

References

[1] L L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquo Physica A Statistical Mechanics amp Its Applications vol390 no 6 pp 1150ndash1170 2011

[2] B Ermis E Acar and A T Cemgil ldquoLink prediction in het-erogeneous data via generalized coupled tensor factorizationrdquoDataMining amp Knowledge Discovery vol 29 no 1 pp 203ndash2362013

[3] Y Yang N Chawla Y Sun and J Hani ldquoPredicting links inmulti-relational and heterogeneous networksrdquo in Proceedings ofthe 12th IEEE International Conference on Data Mining (ICDMrsquo12) pp 755ndash764 Brussels Belgium December 2012

[4] D Liben-Nowell and J Kleinberg ldquoThe link-prediction prob-lem for social networksrdquo Journal of the American Society forInformation Science and Technology vol 58 no 7 pp 1019ndash10312007

[5] Y Yang R N Lichtenwalter and N V Chawla ldquoEvaluating linkprediction methodsrdquo Knowledge and Information Systems vol45 no 3 pp 751ndash782 2015

[6] N Z Gong A Talwalkar LMacKey et al ldquoJoint link predictionand attribute inference using a social-attribute networkrdquo ACMTransactions on Intelligent Systems and Technology vol 5 no 2article 27 2014

[7] J Friedman T Hastie and R Tibshirani ldquoSparse inversecovariance estimationwith the graphical lassordquoBiostatistics vol9 no 3 pp 432ndash441 2008

[8] M E Newman ldquoClustering and preferential attachment ingrowing networksrdquo Physical Review E vol 64 no 2 pp 25ndash1022001

[9] G Salton and M J McGill Introduction to Modern InformationRetrieval McGraw-Hill New York NY USA 1986

[10] P Jaccard ldquoEtude comparative de la distribution florale dansune portion des Alpes et du Jurardquo Bulletin de la Societe Vaudoisedes Sciences Naturelles vol 37 no 142 pp 547ndash579 1901

[11] T Soslashrensen ldquoA method of establishing groups of equal ampli-tude in plant sociology based on similarity of species and itsapplication to analyses of the vegetation on danish commonsrdquoBiologiske Skrifter vol 5 pp 1ndash34 1948

[12] E A Leicht P Holme and M E J Newman ldquoVertex similarityin networksrdquo Physical Review E vol 73 no 2 Article ID 0261202006

[13] Y-B Xie T Zhou and B-H Wang ldquoScale-free networkswithout growthrdquo Physica A Statistical Mechanics and Its Appli-cations vol 387 no 7 pp 1683ndash1688 2008

[14] T Zhou L Lu and Y-C Zhang ldquoPredicting missing links vialocal informationrdquoThe European Physical Journal B vol 71 no4 pp 623ndash630 2009

[15] Z Liu Q M Zhang L Lu and T Zhou ldquoLink prediction incomplex networks a local naıve Bayesmodelrdquo EPL (EurophysicsLetters) vol 96 no 4 Article ID 48007 2011

[16] J Dahl L Vandenberghe and V Roychowdhury ldquoCovarianceselection for nonchordal graphs via chordal embeddingrdquo Opti-mization Methods amp Software vol 23 no 4 pp 501ndash520 2008

[17] O Banerjee L E Ghaoui andADaspremont ldquoModel selectionthrough sparse maximum likelihood estimation for multivari-ate Gaussian or binary datardquoMachine Learning Research vol 9no 3 pp 485ndash516 2008

[18] C J Hsieh M A Sustik I S Dhillon P K Ravikumar and RPoldrack ldquoBigamp quic sparse inverse covariance estimation for amillion variablesrdquo in Advances in Neural Information ProcessingSystems pp 3165ndash3173 MIT Press 2013

[19] C-J Hsieh M A Sustik I S Dhillon and P RavikumarldquoQUIC quadratic approximation for sparse inverse covarianceestimationrdquo Journal of Machine Learning Research vol 15 pp2911ndash2947 2014

[20] V Latora and M Marchiori ldquoEfficient behavior of small-worldnetworksrdquo Physical Review Letters vol 87 Article ID 1987012001

[21] D J Watts and S H Strogatz ldquoCollective dynamics of lsquosmall-worldrsquo networksrdquoNature vol 393 no 6684 pp 440ndash442 1998

[22] M E J Newman ldquoAssortative mixing in networksrdquo PhysicalReview Letters vol 89 no 20 Article ID 208701 2002

[23] J Liu S Ji and J Ye SLEP Sparse Learning with EfficientProjections vol 6 Arizona State University Tempe Ariz USA2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 11: Research Article Link Prediction via Sparse Gaussian Graphical … · 2019. 5. 14. · Research Article Link Prediction via Sparse Gaussian Graphical Model LiangliangZhang,LongqiYang,GuyuHu,ZhisongPan,andZhenLi

Mathematical Problems in Engineering 11

high prediction precision with fewer training samples it canbe applied in large network

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank Longqi Yang Zhisong Pan and GuyuHu for their contributions to work on link prediction andZhen Li and Junyang Qiu for their valuable discussionscomments and suggestions This work was supported by theNational Technology Research and Development Programof China (863 Program) (Grant no 2012AA01A510) theNational Natural Science Foundation of China (Grant no61473149) and the Natural Science Foundation of JiangsuProvince China (Grant no BK20140073)

References

[1] L L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquo Physica A Statistical Mechanics amp Its Applications vol390 no 6 pp 1150ndash1170 2011

[2] B Ermis E Acar and A T Cemgil ldquoLink prediction in het-erogeneous data via generalized coupled tensor factorizationrdquoDataMining amp Knowledge Discovery vol 29 no 1 pp 203ndash2362013

[3] Y Yang N Chawla Y Sun and J Hani ldquoPredicting links inmulti-relational and heterogeneous networksrdquo in Proceedings ofthe 12th IEEE International Conference on Data Mining (ICDMrsquo12) pp 755ndash764 Brussels Belgium December 2012

[4] D Liben-Nowell and J Kleinberg ldquoThe link-prediction prob-lem for social networksrdquo Journal of the American Society forInformation Science and Technology vol 58 no 7 pp 1019ndash10312007

[5] Y Yang R N Lichtenwalter and N V Chawla ldquoEvaluating linkprediction methodsrdquo Knowledge and Information Systems vol45 no 3 pp 751ndash782 2015

[6] N Z Gong A Talwalkar LMacKey et al ldquoJoint link predictionand attribute inference using a social-attribute networkrdquo ACMTransactions on Intelligent Systems and Technology vol 5 no 2article 27 2014

[7] J Friedman T Hastie and R Tibshirani ldquoSparse inversecovariance estimationwith the graphical lassordquoBiostatistics vol9 no 3 pp 432ndash441 2008

[8] M E Newman ldquoClustering and preferential attachment ingrowing networksrdquo Physical Review E vol 64 no 2 pp 25ndash1022001

[9] G Salton and M J McGill Introduction to Modern InformationRetrieval McGraw-Hill New York NY USA 1986

[10] P Jaccard ldquoEtude comparative de la distribution florale dansune portion des Alpes et du Jurardquo Bulletin de la Societe Vaudoisedes Sciences Naturelles vol 37 no 142 pp 547ndash579 1901

[11] T Soslashrensen ldquoA method of establishing groups of equal ampli-tude in plant sociology based on similarity of species and itsapplication to analyses of the vegetation on danish commonsrdquoBiologiske Skrifter vol 5 pp 1ndash34 1948

[12] E A Leicht P Holme and M E J Newman ldquoVertex similarityin networksrdquo Physical Review E vol 73 no 2 Article ID 0261202006

[13] Y-B Xie T Zhou and B-H Wang ldquoScale-free networkswithout growthrdquo Physica A Statistical Mechanics and Its Appli-cations vol 387 no 7 pp 1683ndash1688 2008

[14] T Zhou L Lu and Y-C Zhang ldquoPredicting missing links vialocal informationrdquoThe European Physical Journal B vol 71 no4 pp 623ndash630 2009

[15] Z Liu Q M Zhang L Lu and T Zhou ldquoLink prediction incomplex networks a local naıve Bayesmodelrdquo EPL (EurophysicsLetters) vol 96 no 4 Article ID 48007 2011

[16] J Dahl L Vandenberghe and V Roychowdhury ldquoCovarianceselection for nonchordal graphs via chordal embeddingrdquo Opti-mization Methods amp Software vol 23 no 4 pp 501ndash520 2008

[17] O Banerjee L E Ghaoui andADaspremont ldquoModel selectionthrough sparse maximum likelihood estimation for multivari-ate Gaussian or binary datardquoMachine Learning Research vol 9no 3 pp 485ndash516 2008

[18] C J Hsieh M A Sustik I S Dhillon P K Ravikumar and RPoldrack ldquoBigamp quic sparse inverse covariance estimation for amillion variablesrdquo in Advances in Neural Information ProcessingSystems pp 3165ndash3173 MIT Press 2013

[19] C-J Hsieh M A Sustik I S Dhillon and P RavikumarldquoQUIC quadratic approximation for sparse inverse covarianceestimationrdquo Journal of Machine Learning Research vol 15 pp2911ndash2947 2014

[20] V Latora and M Marchiori ldquoEfficient behavior of small-worldnetworksrdquo Physical Review Letters vol 87 Article ID 1987012001

[21] D J Watts and S H Strogatz ldquoCollective dynamics of lsquosmall-worldrsquo networksrdquoNature vol 393 no 6684 pp 440ndash442 1998

[22] M E J Newman ldquoAssortative mixing in networksrdquo PhysicalReview Letters vol 89 no 20 Article ID 208701 2002

[23] J Liu S Ji and J Ye SLEP Sparse Learning with EfficientProjections vol 6 Arizona State University Tempe Ariz USA2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 12: Research Article Link Prediction via Sparse Gaussian Graphical … · 2019. 5. 14. · Research Article Link Prediction via Sparse Gaussian Graphical Model LiangliangZhang,LongqiYang,GuyuHu,ZhisongPan,andZhenLi

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of