research article link-based similarity measures using...

14
Research Article Link-Based Similarity Measures Using Reachability Vectors Seok-Ho Yoon, 1 Ji-Soo Kim, 1 Jiwoon Ha, 2 Sang-Wook Kim, 1 Minsoo Ryu, 1 and Ho-Jin Choi 3 1 Department of Electronics and Computer Engineering, Hanyang University, Seoul 133-791, Republic of Korea 2 Department of Computer and Soſtware, Hanyang University, Seoul 133-791, Republic of Korea 3 Department of Computer Science, KAIST, Daejeon 305-701, Republic of Korea Correspondence should be addressed to Sang-Wook Kim; [email protected] Received 31 August 2013; Accepted 28 November 2013; Published 18 February 2014 Academic Editors: S. Amat, L. Mart´ ınez, and J. Zhang Copyright © 2014 Seok-Ho Yoon et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. We present a novel approach for computing link-based similarities among objects accurately by utilizing the link information pertaining to the objects involved. We discuss the problems with previous link-based similarity measures and propose a novel approach for computing link based similarities that does not suffer from these problems. In the proposed approach each target object is represented by a vector. Each element of the vector corresponds to all the objects in the given data, and the value of each element denotes the weight for the corresponding object. As for this weight value, we propose to utilize the probability of reaching from the target object to the specific object, computed using the “Random Walk with Restart” strategy. en, we define the similarity between two objects as the cosine similarity of the two vectors. In this paper, we provide examples to show that our approach does not suffer from the aforementioned problems. We also evaluate the performance of the proposed methods in comparison with existing link-based measures, qualitatively and quantitatively, with respect to two kinds of data sets, scientific papers and Web documents. Our experimental results indicate that the proposed methods significantly outperform the existing measures. 1. Introduction Similarities among objects provide useful information to wide application areas such as ranking Web documents [1], detecting duplicate documents [2], comparing user profiles in e-commerce recommendation systems [3], searching for similar papers in literature databases [4, 5], and the like. Accurate computation of similarities among objects is crucial to the success of these applications [6, 7]. For example, the collaborative filtering technique used in an e-commerce system makes recommendations of goods or products to a user by choosing from the purchase list of those users deemed similar to that user. In order to search for the “similar” users, the system needs to compute the similarities among users [3]. If the similarities are not accurate, the user would get recommendations with unwanted items. Existing similarity measures can be classified into either content-based or link-based ones [8]. Content-based mea- sures compute similarities among objects by comparing the contents of the objects involved, such as texts and multime- dia. In the various types of contents, these measures mainly utilize the textual information, which is easier to analyze than other types. Measures for computing the similarities among objects using textual information are referred to as text-based similarity measures [9, 10]. Cosine similarity, SVD, LDA, LSI- based similarity measures, and -Sim belong to this category [11]. In a text-based similarity measure, the similarity between two objects becomes higher in general when the two objects have more words in common. On the other hand, link-based measures represent the relationships among objects as links and compute the simi- larities using the link information. e more neighbors two objects have in common, the higher the similarity between the two becomes. Typical link-based measures include Coci- tation [12], Bibliographic coupling [13], Amsler [14], SimRank [15], rvs-SimRank [6], and P-Rank [6]. As compared to text- based measures, link-based measures have recently been paid attention to for the merits of language-independency, good performance, and being able to produce results appealing to human intuition [6, 16]. For these reasons, the work in this paper will also focus on the link-based similarity measures. Hindawi Publishing Corporation e Scientific World Journal Volume 2014, Article ID 741608, 13 pages http://dx.doi.org/10.1155/2014/741608

Upload: others

Post on 07-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article Link-Based Similarity Measures Using ...downloads.hindawi.com/journals/tswj/2014/741608.pdf · Research Article Link-Based Similarity Measures Using Reachability

Research ArticleLink-Based Similarity Measures Using Reachability Vectors

Seok-Ho Yoon1 Ji-Soo Kim1 Jiwoon Ha2 Sang-Wook Kim1

Minsoo Ryu1 and Ho-Jin Choi3

1 Department of Electronics and Computer Engineering Hanyang University Seoul 133-791 Republic of Korea2Department of Computer and Software Hanyang University Seoul 133-791 Republic of Korea3 Department of Computer Science KAIST Daejeon 305-701 Republic of Korea

Correspondence should be addressed to Sang-Wook Kim wookhanyangackr

Received 31 August 2013 Accepted 28 November 2013 Published 18 February 2014

Academic Editors S Amat L Martınez and J Zhang

Copyright copy 2014 Seok-Ho Yoon et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

We present a novel approach for computing link-based similarities among objects accurately by utilizing the link informationpertaining to the objects involved We discuss the problems with previous link-based similarity measures and propose a novelapproach for computing link based similarities that does not suffer from these problems In the proposed approach each targetobject is represented by a vector Each element of the vector corresponds to all the objects in the given data and the value of eachelement denotes the weight for the corresponding object As for this weight value we propose to utilize the probability of reachingfrom the target object to the specific object computed using the ldquoRandomWalkwith Restartrdquo strategyThen we define the similaritybetween two objects as the cosine similarity of the two vectors In this paper we provide examples to show that our approach doesnot suffer from the aforementioned problems We also evaluate the performance of the proposed methods in comparison withexisting link-based measures qualitatively and quantitatively with respect to two kinds of data sets scientific papers and Webdocuments Our experimental results indicate that the proposed methods significantly outperform the existing measures

1 Introduction

Similarities among objects provide useful information towide application areas such as ranking Web documents [1]detecting duplicate documents [2] comparing user profilesin e-commerce recommendation systems [3] searching forsimilar papers in literature databases [4 5] and the likeAccurate computation of similarities among objects is crucialto the success of these applications [6 7] For examplethe collaborative filtering technique used in an e-commercesystem makes recommendations of goods or products to auser by choosing from the purchase list of those users deemedsimilar to that user In order to search for the ldquosimilarrdquo usersthe system needs to compute the similarities among users[3] If the similarities are not accurate the user would getrecommendations with unwanted items

Existing similarity measures can be classified into eithercontent-based or link-based ones [8] Content-based mea-sures compute similarities among objects by comparing thecontents of the objects involved such as texts and multime-dia In the various types of contents these measures mainly

utilize the textual information which is easier to analyze thanother types Measures for computing the similarities amongobjects using textual information are referred to as text-basedsimilaritymeasures [9 10] Cosine similarity SVD LDA LSI-based similarity measures and 120594-Sim belong to this category[11] In a text-based similaritymeasure the similarity betweentwo objects becomes higher in general when the two objectshave more words in common

On the other hand link-based measures represent therelationships among objects as links and compute the simi-larities using the link information The more neighbors twoobjects have in common the higher the similarity betweenthe two becomes Typical link-based measures include Coci-tation [12] Bibliographic coupling [13] Amsler [14] SimRank[15] rvs-SimRank [6] and P-Rank [6] As compared to text-basedmeasures link-basedmeasures have recently been paidattention to for the merits of language-independency goodperformance and being able to produce results appealing tohuman intuition [6 16] For these reasons the work in thispaper will also focus on the link-based similarity measures

Hindawi Publishing Corporatione Scientific World JournalVolume 2014 Article ID 741608 13 pageshttpdxdoiorg1011552014741608

2 The Scientific World Journal

Among the existing link-basedmeasures SimRank is wellknown and has driven a large number of subsequent studieswhich proposed variations of SimRank [6 17] or investigatedperformance speed-ups of it [18ndash25] The basic principle ofSimRank states that ldquotwo objects are similar if they are relatedto similar objectsrdquo SimRank computes the similarity betweentwo objects say 119909 and 119910 by recursively computing averageof all the similarities between every object pointing to 119909 andevery object pointing to 119910 [15] In other words SimRankcan be better explained in terms of what we call the ldquopair-wiserdquo and the ldquolevel-wiserdquo computationmodelsThe pair-wisemodel is to compute the similarity between 119909 and 119910 by aver-aging all the similarities computed between every neighborof 119909 and every neighbor of 119910 whereas the level-wise model isto compute the similarity by utilizing only those other objectslocated in the same distance from 119909 and 119910 level by level

We believe that any similarity measure such as SimRankwhich uses both the pair-wise and level-wise models cannotaccurately compute the similarity between two objects Forinstance the pair-wisemodel would still compute the similar-ities between each neighbor of 119909 and each neighbor of 119910 evenif the entire set of neighbors of 119909 and 119910 is exactly the sameConsequently the pair-wise model in general yields biasedresults in that the similarity between two objects havinga large number of links tends to become lower than thesimilarity between two objects having a small number of links[18 26] On the other hand the level-wisemodel focuses onlyon those neighbor objects linkeddirectly to the twoobjects ldquoatthe same levelrdquo that is in the graph only considering thoseobjects located at the same distance from the two objectsConsequently this model cannot consider all of the objectslinked directly or indirectly to the two target objects

This paper proposes a new link-based similarity measurethat does not suffer from the aforementioned problems withthe pair-wise and level-wise models In the proposed mea-sure we represent each object say 119909 as a vectorThe elementsof the vector (for 119909) correspond to all the objects (including119909 itself) in the given universe and the value of each elementdenotes theweight (with respect to119909) for the particular objectcorresponding to the element To obtain this weight value wepropose to utilize the probability of reaching from the object119909 to the particular object computed using the ldquoRandomWalkwith Restartrdquo strategyThen we define the similarity betweentwo objects as the cosine similarity [9] between the twovectors representing the two objectsThis approach resemblesthe text-based similarity measures using the cosine similarityfor computing similarity between documents where a doc-ument is represented by a vector each element of the vectorcorresponds to each word in the universe of all documentsand the value of each element denotes the frequency of thecorresponding word in the target document

Our approach does not suffer from the problem withthe pair-wise model because it computes the similaritybetween two vectors by multiplying only the values of thecorresponding elements in the two vectors not trying everypossible element pair between the two vectors Moreoverthe approach can also consider all the objects linked directlyor indirectly to the two target objects reflecting the degreeof ldquoclosenessrdquo between objects in the form of reachability

between objects Thus it does not suffer from the problemwith the level-wise model either In this paper we willdevelop two methods to implement our approach The firstmethod will generate the vectors representing objects usinginlinks and outlinks separately then merges two vectorsto compute the similarity The second method will notdistinguish inlinks from outlinks but convert them togetherinto undirected links to generate the vectorsThe effectivenessof both methods will be demonstrated by examples showingthat they do not suffer from problems of the pair-wise andlevel-wise models

The paper also evaluates the performance of the proposedmethods in comparison with existing link-based measuresqualitatively and quantitatively using the data sets of sci-entific papers and Web documents as two exemplary typesof data having link information Our experimental resultsindicate that the proposedmethods generally outperform theexisting measures for both types of data

The rest of this paper is organized as follows Section 2summarizes existing link-based similarity measures Section3 explains our research motivations and Section 4 describesthe proposed approach and methods in detail Section 5presents the experimental results to validate the performanceof the proposed methods Section 6 concludes the paper

2 Related Work

Existing link-based similarity measures include CocitationBibliographic coupling Amsler SimRank rvs-SimRankand P-Rank While Cocitation Bibliographic coupling andAmsler were originally devised to deal with scientific papers[6] they have also been applied to other types of data suchas Web documents which have link information [27 28] Onthe other hand SimRank rvs-SimRank and P-Rank wereoriginally proposed to deal with objects of any kind havinglink information [6 15]

Cocitation computes the similarity between two objectsbased on the number of objects which commonly point to thetwo As a result the similarity between two objects becomeshigher as the number of ldquocommonly pointingrdquo objects getslarger [12] This concept is described as follows where 119909 and119910 denote objects 119878(119909 119910) the similarity between 119909 and 119910and 119868(119909) and 119868(119910) the sets of objects pointing to 119909 and 119910respectively

119878 (119909 119910) =1003816100381610038161003816119868 (119909) cap 119868 (119910)

1003816100381610038161003816 (1)

Bibliographic coupling computes the similarity betweentwo objects based on the number of objects which arecommonly pointed by the two [13] This is described asfollows where 119874(119909) and 119874(119910) represent the sets of objectspointed by 119909 and 119910 respectively

119878 (119909 119910) =1003816100381610038161003816119874 (119909) cap 119874 (119910)

1003816100381610038161003816 (2)

The above two measures namely Cocitation and Biblio-graphic coupling are combined together by Amsler whichdefines the similarity between two objects as a weightedsum of the two similarities computed by Cocitation andBibliographic coupling as described in (3) where120582 represents

The Scientific World Journal 3

l

h i j

e

b c

f

a

d

g

k m

Figure 1 An example graph

the factor to balance the weights between the two similaritiesinvolved In general 120582 is set to 05 to assign equal weightsbetween the two [6 14]

119878 (119909 119910) = 120582 times (1003816100381610038161003816119868 (119909) cap 119868 (119910)

1003816100381610038161003816) + (1 minus 120582) times (1003816100381610038161003816119874 (119909) cap 119874 (119910)

1003816100381610038161003816)

(3)

The concepts of these three measures can be illustratedusing the graph in Figure 1 which represents objects asnodes and reference relationships among objects as linksThe similarity between objects 119892 and ℎ in the graph willbecome 2 when computed by Cocitation since there existtwo objects (ie 119889 and 119890) commonly pointing to 119892 and ℎ Onthe other hand the similarity will become 1 when computedby Bibliographic coupling since there exists only one object(ie 119896) commonly pointed by 119892 and ℎ Finally Amsler wouldcompute the similarity between 119892 and ℎ to be 15 assumingthe same weights (120582 = 05) among both measures of theCocitation and Bibliographic coupling (ie 15 = 05 times 2 +05 times 1 using (3))

Note that the similarities such computed by Cocita-tion Bibliographic coupling and Amsler in general tend tobecome larger as the number of links among the objectsbecomes larger This phenomenon can be normalized bydividing the similarities by the size of the union of the sets ofobjects commonly pointing to (or pointed by) the two objectsyielding the resulting similarity to be between 0 and 1 [29]

SimRank uses the concept that ldquotwo objects are similar ifthey are related to similar objectsrdquo In the graph of Figure 1for example the similarity between objects ℎ and 119894 wouldbe computed as zero that is interpreted as ldquonot similar atallrdquo using Cocitation since no object exists which commonlypoints to both ℎ and 119894 Nevertheless these two objectscould be seen as similar to some degree in that ℎ and 119894are separately pointed by two ldquosimilarrdquo objects 119890 and 119891respectively because 119890 and 119891 are commonly pointed by 119887SimRank exploits such a concept by recursively computingthe similarity between two objects say 119909 and 119910 as the averageof all the similarities between every object pointing to 119909and every object pointing to 119910 This concept of SimRank isdescribed by (4) where 119909 and 119910 denote objects 119878(119909 119910) thesimilarity between 119909 and 119910 and 119868(119909) and 119868(119910) the sets ofobjects pointing to 119909 and 119910 respectively 119868

119894(119909) is the 119894th object

in the list of objects pointing to 119909 and 119862 is the decay factorhaving the value between 0 and 1 The decay factor reducesthe weights of the computed similarity as the iterations getdeeper As shown by the following SimRank yields thesimilarity as a value between 0 and 1 by normalizing thesummation of the similarities of all pairs of objects in theCartesian product of the two sets 119868(119909) and 119868(119910) by itscardinality [15]

119878 (119909 119910) =119862

|119868 (119909)|1003816100381610038161003816119868 (119910)

1003816100381610038161003816

|119868(119909)|

sum

119894=1

|119868(119910)|

sum

119895=1

119878 (119868119894 (119909) 119868119895 (119910)) (4)

In a sense SimRank expands the Cocitation to a broaderscope of neighbor objects in the similarity definition so as tocount not just the adjacent objects directly linked to the twotarget objects (as inCocitation) but also to consider the effectsof all other objects indirectly linked (through the recursivecomputation) In a similarmanner [6] expands Bibliographiccoupling to yield rvs-SimRank and Amsler to yield P-RankThe rvs-SimRank is expressed by (5) which differs from(4) of SimRank only in that it uses outlinks instead ofinlinks The P-Rank is expressed by (6) which computes thesimilarity as a weighted sum of the two similarities obtainedby SimRank and rvs-SimRank respectively in each iterationstep Consider the following

119878 (119909 119910) =119862

|119874 (119909)|1003816100381610038161003816119874 (119910)

1003816100381610038161003816

|119874(119909)|

sum

119894=1

|119874(119910)|

sum

119895=1

119878 (119874119894 (119909) 119874119895 (119910)) (5)

119878 (119909 119910)

= 120582 times119862

|119868 (119909)|1003816100381610038161003816119868 (119910)

1003816100381610038161003816

|119868(119909)|

sum

119894=1

|119868(119910)|

sum

119895=1

119878 (119868119894 (119909) 119868119895 (119910))

+ (1 minus 120582)times119862

|119874 (119909)|1003816100381610038161003816119874 (119910)

1003816100381610038161003816

|119874(119909)|

sum

119894=1

|119874(119910)|

sum

119895=1

119878 (119874119894 (119909) 119874119895 (119910))

(6)

In another direction various approaches have soughtimproving the accuracy of existingmeasures [4 18 26] Refer-ence [18] proposes to apply Jaccard coefficient to the SimRankin order to remedy the phenomenon that the similarities tendto become lower among the objects having a larger numberof links Reference [26] proposes to improve the accuracyof SimRank by taking the average of the similarities onlybetween the maximally matching neighbor objects across thetwo groups associated with the two target objects in order toresolve the problem indicated in [18]

Many of these approaches have also been investigated toimprove the speed of existing measures [18 20 21] Reference[18] suggests to improve the performance of SimRank byproposing to construct first a fingerprint tree for each objectand then use such trees to approximate the similarity tobe obtained by SimRank Reference [21] proposes to reducethe time and space complexity of SimRank by utilizing atree structure called SimTree which allows storing directlythe similarities among similar objects but computing the

4 The Scientific World Journal

similarities among dissimilar objects using the path infor-mation of the tree Reference [20] aims to compute thesimilarity between two objects using SimRank in online real-time by suggesting to consider only those objects relateddirectly to the two target objects rather than computingsimilarities involved in all objects Reference [24] investigatesa method to run SimRank in parallel using GPGPU (general-purpose computation on graphics processors) and a methodto approximately compute the similarity in a dynamic graphusing uncoupling Markov chains

3 Motivation

In this section we discuss the problems with existing link-based similarity measures First we cast existing measuresinto a method combining two computationmodels which wewill call ldquopair-wiserdquo and ldquolevel-wiserdquo and explain the inherentdifficulties in thesemodelsWe then analyze and illustrate thelimitations of three representative existing methods namelyrvs-SimRank SimRank and P-Rank showing that each ofthesemethods actually combines the pair-wise and level-wisemodels

31 Pair-Wise and Level-Wise Computation Models In thegraph of Figure 1 again SimRank would compute the sim-ilarity between 119896 and 119897 by taking the average of the foursimilarities obtained from the four pairs (119892 ℎ) (119892 119894) (ℎ ℎ)and (ℎ 119894) namely the Cartesian product of the set 119892 ℎ theobjects directly ldquoin-linkedrdquo to 119896 with the set ℎ 119894 and theobjects directly ldquoin-linkedrdquo to 119897 We call such a way of pairingand averaging out the similarities for all such pairs the ldquopair-wiserdquo computational model In this example the similaritybetween 119896 and 119897 119878(119896 119897) becomes the average of 119878(119892 ℎ) 119878(119892 119894)119878(ℎ ℎ) and 119878(ℎ 119894) Here 119878(ℎ ℎ) = 1 by definition and 119878(119892 ℎ)119878(119892 119894) and 119878(ℎ 119894) should also be computed in turn by Sim-Rank in a recursive manner using the same pair-wise modelThat is 119904(119892 ℎ) is computed from the Cartesian product of119892rsquos neighbors 119889 119890 and ℎrsquos neighbors 119889 119890 119878(119892 119894) from 119892rsquosneighbors 119889 119890 and 119894rsquos neighbors 119891 and 119878(ℎ 119894) from ℎrsquosneighbors 119889 119890 and 119894rsquos neighbors 119891 using the pair-wisemodel In this way the recursive process continues and getsdeeper using all the objects linked directly or indirectly to thetwo target objects for which the similarity is being computed

On the other hand such a model can be seen as com-puting the similarity between two objects by utilizing onlythose other objects located in the same distance from thetwo targets level by level In Figure 1 for example the modelcomputes the similarity between 119896 and 119897 by utilizing theobjects with distance 1 (namely nodes 119892 ℎ and 119894) in the firstround then the objects with distance 2 (namely 119889 119890 and 119891)in the second round and then the objects with distance 3(namely 119886 119887 and 119888) in the third round We call such processthe ldquolevel-wiserdquo computation model

32 Problem of Pair-Wise Computation Model Figure 2 illus-trates the problem of the pair-wise model which induces thephenomenon that the similarities among objects havingmorelinks tend to become lower than those among objects having

Table 1 Similarities between 119886 and 119887 and between 1198861015840 and 1198871015840 in Figure2

rvs-SimRank SimRank P-Rank119904(119886 119887) 0396 0 0423119904(1198861015840 1198871015840) 0321 0 0408

a b

c d e

f g

(a)

a998400

b998400

c998400 d

998400e998400 f

998400g998400

h998400

i998400

j998400

(b)

Figure 2 Example graphs to illustrate the problem with the pair-wise model

less links In the graphs of Figure 2 again nodes representobjects and links represent the reference relationships amongobjects For example objects 119886 and 119887 in Figure 2(a) both referto the same three objects 119888 119889 and 119890 all of which in turn referto the same two objects 119891 and 119892 Similarly objects 1198861015840 and 1198871015840in Figure 2(b) both refer to the same objects all of which inturn refer to the same objects The only difference betweenthese two graphs is that the numbers of links from objects 1198861015840and 1198871015840 are larger (ie six links each) than those from objects 119886and 119887 (ie three links each) Intuitively the similarity between119886 and 119887 should be the same as that between 1198861015840 and 1198871015840 byobserving that 119886 and 119887 both refer to the same objects and also1198861015840 and 1198871015840 both refer to the same objects Existing methods

however produce different resultsTable 1 shows the resulting similarities between 119886 and 119887

and between 1198861015840 and 1198871015840 as to be computed by three existingmethods rvs-SimRank SimRank and P-Rank assuming 119862and 120582 to be 07 and 05 in formulas (4) (5) and (6) Twoentries of the table indicate that the computed similaritiesbetween 119886 and 119887 are higher than those between 1198861015840 and 1198871015840which is counterintuitive Considering that the differencein the numbers of links between the two graphs in thisexample is only as small as four one can imagine that thisphenomenon would become clearer as the numbers of linksget larger for example with such data set as scientific papersor Web documents In the papers data sets for example wellknown papers tend to have many reference links becausethey will in general get substantially larger numbers ofcitations than ordinary papers and similarly in the Webdocuments data set portal sites tend to have many linksbecause they are referenced more frequently than ordinary

The Scientific World Journal 5

ec

d

ba

Figure 3 An example graph to illustrate the problemwith the level-wise model

sites We envisage that for these domains those methodswhich use the pair-wise computation model cannot computethe similarity between objects accurately

33 Problem of Level-Wise Computation Model The level-wise computation model computes the similarity betweentwo objects by utilizing only those objects in the graph inthe same distance from the two The problem with thismodel can be illustrated using Figure 3 where again nodesrepresent objects and directed links represent the referencerelationships between objects Intuitively we can say thattwo objects 119886 and 119887 are similar to some degree as theyshare though indirectly an object 119888 in common All ofthe aforementioned three methods however compute thesimilarity between these two objects 119886 and 119887 to be 0 Forthis case one would expect to compute the similarity byconsidering all objects directly and indirectly linked to 119886(namely 119888) and all objects directly and indirectly linkedto 119887 (namely 119888 119889 and 119890) In fact however any methodswhich utilize the level-wise model will compute 119904(119886 119887) byonly using 119888 (as 119886rsquos neighbor) and 119889 (as 119887rsquos neighbor) Here119904(119888 119889) should also be computed by taking the average of thesimilarities for all possible pairs across 119888rsquos neighbor objectsand 119889rsquos neighbor objects In this case however 119888rsquos neighbordoes not exist yielding 119904(119888 119889) to be 0 even though therestill remain 119889rsquos neighbors 119888 and 119890 which should also becounted by some means After all 119904(119886 119887) will also become 0indicating that they are not similar at all From this examplewe conclude that the level-wise model cannot compare theobjects not located at the same distance and consequentlycannot compute the similarity properly

4 Proposed Methods

41 Main Concepts In this section we propose a novel link-based similarity measure Our approach differs from theexisting link-based methods in that we will not define themeasure by combining the pair-wise and level-wise modelsbut use the concept of the cosine similarity As expressedby (7) the cosine similarity is a measure for computing asimilarity between a pair of vectors Thus in computing a

similarity between objects using the cosine similarity thefeatures of an object are represented by the elements of avector and the weight to each feature is captured by thevalue of each element For example in case of computing asimilarity between documents using cosine similarity eachdocument is represented by a vector the words in the givenuniverse of documents are denoted by the elements of thevector and the frequency of each word in the document isindicated by the value of the corresponding element Thenthe similarity between two documents is computed by thesimilarity between the two corresponding vectors [9]

119904 (119909 119910) =119909 sdot 119910

11990910038171003817100381710038171199101003817100381710038171003817

=sum119899

119894=1(119909119894times 119910119894)

radicsum119899

119894=1(119909119894)2times radicsum

119899

119894=1(119910119894)2

(7)

In our approach each object is represented as a vectorall other objects in the given universe as the elements ofthe vector and some ldquoweight valuerdquo to each object as thevalue of the corresponding element in the vector As forthe cosine similarity used by text-based measures discussedabove where words of higher-frequency get larger weightvalues by being treated as better characterizing features fora document we need to define a measure to quantify thedegree with which to determine how well the target object ischaracterized by each object in the universe For thismeasurewe propose to use the degree of how close the two objects arelocated in the topological point of view

Sun et al [30] proposed a method for computing theprobability of reaching from an object to another as therelevance between two objects using the ldquoRandomWalkwithRestart (RWR)rdquo strategy We adopt this strategy and use thereachability to an object from the target object as the weightvalue for its corresponding element of the vector representingthe target object Reachability becomes higher as the distancebetween two objects becomes shorter and also when morepaths exist between the two Computing reachability usingRWR is expressed by the following where 119875

119860represents

an adjacency matrix column-normalizing the connectednessamong objects 119906

119886a vector having reachability to each node

starting from 119886 119902119886a restart vector having the value 1 only

for the starting node 119886 and 0 for the rest and 119888 the restartprobability

= (1 minus 119888) 119875119860 119906119886+ 119888 119902119886 (8)

Our approach does not suffer from the problem withthe pair-wise model because it computes the similaritybetween two vectors by multiplying only the values of thecorresponding elements in the two vectors not trying everypossible element pair between the two vectors Moreover theapproach also generates the vectors by considering all theobjects linked directly or indirectly to the two target objectsThismechanism is different from the pair-wise and level-wisemodels and does not suffer from the problems discussed inthe previous section

In this paper we develop two methods to implementour approach The two methods slightly differ only in theirways to compute the weights in the vector One methodcomputes the weights by first computing two values of

6 The Scientific World Journal

b c

a

d

a b c d

a 0 1 1 0b 0 0 0 1c 0 0 0 1d 0 0 0 0

a b c d

a 0 0 0 0b 1 0 0 0c 1 0 0 0d 0 1 1 0

a b c d

a 0 0 0 0b 0 0 0c 0 0 0d 0 1 1 0

05

05

a b c d

a 0 1 1 0b 0 0 0c 0 0 0d 0 0 0 0

05

05

1 0 0 0

1 006 006 0102

1 003 003 0051

Inlink(normalized)

(normalized)

Inlink

OutlinkOutlink

+

Figure 4 The procedure of generating a vector using the weightedSum method

reachability one using only ldquoinlinksrdquo and the other using onlyldquooutlinksrdquo separately and then combining them together as aweighted sum This method provides flexibility to a problemdomain by allowing different weights to reachability usinginlinks versus outlinks The method however cannot com-pute reachability properly for those objects located closelywith the target object because links in both directions arenot used together In Figure 1 for example nodes 119886 and119887 are located closely but the reachability would become 0when computed using only outlinks (or using only inlinks)from 119886 or 119887 Another problem would be the difficulty indetermining the appropriate weights for a vector generatedpurely using the inlinks or purely using the outlinksThuswedevelop the second method which computes reachability byignoring the directions of inlinks and outlinks and convertingthem to undirected links before computing reachability Thismethod is advantageous in that it can compute appropriatereachability for every object In the rest of this paper wewill call these two methods the ldquoweightedSumrdquo methodand the ldquoundirectedrdquo method respectively In the weighted-Sum method in particular the proper balance between theweights for inlinks and outlinks needs to be found domain bydomain through experimentation

42 Procedure of the Proposed Methods Basically both ofthe proposed methods (1) construct vectors by computingreachability from the target object to all other objects and(2) compute the similarity between vectors by using thecosine similarity The two methods differ only in the processof generating vectors as illustrated by Figure 4 (for theweightedSum method) and Figure 5 (for the undirectedmethod) The weightedSum method generates vectors in thefollowingmanner First two adjacencymatrices are built with

inlinks and outlinks and then column-normalized such thatthe sum of all values in a column becomes 1 Second a vectoris generated from the normalized matrix and the weightsare assigned to the elements in the vector by computing thereachability from the target object to every object using theRWR strategy Similarly the undirected method generates avector by constructing a normalized adjacencymatrix ignor-ing the link directions this time and assigning the reachabil-ity values to the elements of the vector in the same manner

43 Complexity Analysis The complexity of the proposedmethods for computing similarities for all pairs of givenobjects can be analyzed as follows First time complexity ofgeneration process of a vector for an object using the RWRstrategy is 119874(119896119890) [31] where 119890 represents the number of linksand 119896 the number of iterations for the matrix calculation toobtain converged values of reachability Thus time complex-ity of generation process for all objects 119899 is 119874(119896119899119890) where 119899represents the number of objects It is generally known thatthe constant number of such iterations would be sufficient toobtain converged values of reachability [6 15] Consequentlytime complexity is reduced to 119874(119899119890) On the other handtime complexity of the similarity calculation process is119874(1198993)Combining them together overall time complexity of bothmethods becomes 119874(1198993) + 119874(119899119890) = 119874(119899

3) In practice

the weightedSum method will require double time than theundirectedmethod because the former computes reachabilityseparately for each direction of the links

As for space complexity119874(119890) space is required for storingthe matrix to represent the relationships between objects119874(1198992) for storing vectors and119874(1198992) for storing the similarity

measures between objects Combining them together overallspace complexity becomes 119874(1198992) Again in practice the

The Scientific World Journal 7

b c

a

d

a b c d

a 0 1 1 0

b 0 0 0 1

c 0 0 0 1

d 0 0 0 0

a b c d

a 0 0 0 0

b 1 0 0 0

c 1 0 0 0

d 0 1 1 0

a b c d

a 0 1 1 0

b 1 0 0 1

c 1 0 0 1

d 0 1 1 0

a b c d

a 0 0

b 0 0

c 0 0

d 0 0

Undirected-link(normalized)

05 05

0505

0505

05 05

1 046204620441

Undirected-link

Inlink

Outlink

Figure 5 The procedure of generating a vector using the undirected method

weightedSum method will require double space than theother

In comparison time complexity of the three existingmethods namely rvs-SimRank SimRank and P-Rank isknown to be 119874(1198994) and space complexity 119874(1198992) [6] Webelieve the existing methods require more computation thanour proposed methods because they adopt the pair-wisecomputation model in principle

Moreover the existing methods cannot compute the sim-ilarity between two given objects independently because theyneed to know the similarities among all objects (whether con-nected directly or indirectly to the target object) in order tocompute the similarity between two target objects [20] How-ever our approach does not require the similarities amongall objects and hence can compute the similarity between anypair of objects independentlymaking it possible to parallelizethe algorithm Since our approach need not refer to the simi-larities amongother objects vectors for individual objects canbe generated independently and the similarity between anypair of objects can be computed separately that is in parallelTime complexity of this parallelized version would become119874(1198993119898) if119898 processors are utilized in parallelWhen we need to compute the similarity between only

a particular pair of objects on-line rather than computingthe similarities among all objects [20] our methods can bemade even more efficient by performing the computation ofreachability on-line through the inverse matrix as suggestedby (9) That is the vector for a given object can be generatedby multiplying 119902 to the inverse matrix of 119876 [30 31] Timecomplexity of this off-line computation of inverse matrix is119874(1198992376

) [32] and the complexity of the on-line computationis 119874(119899) Such an off-line approach tends to require relativelylonger time and more space to handle the inverse matrixReference [31] suggests an approximation approach using

low-rank approximation in order to keep balance betweenthe off-line and on-line computation This approach hasadvantages in both time and space complexity suggest-ing improvement for on-line execution of our proposedapproach In conclusion time complexity of our methods islower than that of existing methods and space complexity isequal to that of existing methods in any case

= (1 minus 119888) 119875119860 119906119886+ 119888 119902119886

= (119868 minus (1 minus 119888) 119875119860)minus1119902119886

= 119862119876minus1119902119886

(9)

44 Discussions The proposed methods do not suffer fromthe problems of the pair-wise and level-wisemodels discussedin Section 3 Let us compute the similarities between 119886 and119887 and 1198861015840 and 1198871015840 in Figure 2 using our approach and checkwhether the results appeal to our intuition We assume 120582 and119862 to be 05 and 07 respectively Table 2 shows the similarityresults computed by the two proposed methods togetherwith those obtained by the three existing methods (shownalready in Table 1) for comparison Both of our methods haveproduced coinciding results that the similarity between 119886and 119887 and 1198861015840 and 1198871015840 is 1 saying that the two objects areregarded as the same These results appeal to our intuitionIn comparison the results obtained by the existing methodsindicate that the two objects are not the same Since ourapproach does not use the pair-wise model but performthe computations among identical features the proposedmethods will produce results more appealing to our intuitionthan existing methods

The proposed methods do not suffer from the problemof the level-wise model either Let us compute the similaritybetween 119886 and 119887 in Figure 3 using our approach and check

8 The Scientific World Journal

Table 2 Similarities between 119886 and 119887 and between 1198861015840 and 1198871015840 using the proposed methods and existing methods in Figure 2

rvs-SimRank SimRank P-Rank WeightedSum Undirected119904(119886 119887) 0396 0 0423 1 1119904(1198861015840 1198871015840) 0321 0 0408 1 1

whether the results appeal to our intuition Assuming 120582and 119862 to be 05 and 07 respectively our approach hascomputed the similarity between 119886 and 119887 to be 028 using theweightedSummethod and 06 using the undirected methodFor comparison the existing methods would compute thesimilarity to be 0 as discussed in Section 3 We cannot tellif the results obtained by our approach are appropriate ornot because the similarity is by nature a subjective valueStill we can at least argue that our approach produces resultswhich are more appealing to our intuition than the existingmethods that is the similarity between 119886 and 119887must not be 0because they are related indirectly by some common objectsThis difference has come from our strategy of considering forall the objects indirectly connected to the target objects atonce rather than using the level-wise model

5 Experiments

In this section we verify the effectiveness of our approachthrough experimentation We will show and analyze quanti-tatively the experimental results obtained from applying thetwo proposed methods to practical application domains

51 Experimentation Setup We carried out experimentsto verify the performance of the proposed methods withrespect to scientific papers and Web documents the twotypes of exemplary data sets having link informationFor the experiments with the scientific papers we usedthe data set consisting of the papers downloaded fromDBLP (httpwwwinformatikuni-trierdesimleydb) andreference information among the papers crawled from Libra(httpacademicresearchmicrosoftcom) In total 44800papers and 126281 references were used For the experimentswith the Web documents we used the data set consisting of1227038Web pages with 11164829 hyperlinks in total takenfrom the TREC (httptrecnistgovdatat11webhtml) 2002data The experiments were carried out in a platform withQuad Core 267GHz CPU Windows 2008 Sever OS

In the experiments we aimed to evaluate the performanceof the two proposed methods (weightedSum undirected) incomparison with the three existing methods (rvs-SimRankSimRank and P-Rank) with the input values of 08 05 and015 for 119862 120582 and restart probability respectively as usedfrequently in the existing methods [6 15 30]

Themethod of experiments proceeded as follows For theweightedSum method we first assigned appropriate weightsto the vectors generated with inlinks and outlinks by varyingthe weights and finding the ones that achieve the highestaccuracy in the similarity through weightedSum method Inthe experiments we tried the weight values of 0 01 03 05

07 09 and 1 for inlinks (note the sum of in-link and out-link weights equals 1) Second we evaluated the accuracyof each method qualitatively by examining 10 objects thatthe method computes as the most similar to a target objectchosen arbitrarily Finally we measure the accuracy of eachmethod quantitatively by comparing the obtained results withthe true answers for each data set

The measurement of the accuracy proceeded as followsFirst we chose one object in turn as the target object from theanswer setThen we computed the recall [29] by extracting119898objects (where119898 can be 10 20 30 40 and 50)most similar tothe target object according to each method This process wasrepeated until every object in the set has been chosen as thetarget object The average of all recall values obtained as suchwill be taken as the final accuracy value

The answers of each data set were constructed in thefollowingmanner For the experiments with scientific paperswe selected five well known areas (ie clustering sequentialpattern mining spatial databases link mining and graphpatternmining) in a data mining text book [29] and obtainedpapers referenced in each section of these areasWe supposedthat the reference papers in the same section are similarto one another Thus those papers in a section formed ananswer set Each answer set had 3 to 14 papers and thetotal number of papers in all such answer sets was 106 Forthe experiments with Web documents we used TREC 2002TREC 2002 provides Web document sets related to specifickeywords We chose 9 Web document sets randomly fromTREC 2002 and used the sets as the answer sets

52 Domain of Scientific Papers Figure 6 depicts the resultsshowing the accuracy on the similarities among scientificpapers obtained by the weightedSum method using variousweight values for the vectors generated with inlinks andoutlinks The 119909-axis represents the number of ldquothe mostsimilarrdquo papers selected by the method and the 119910-axis theaccuracy Annotations such as ldquo00 10rdquo represent the weightbalances between inlinks and outlinks According to Figure 6the accuracy of the weightedSum method tends to becomehigher when the weight for inlinks gets higher than the out-link weight This is because the papers in the answer set arerelatively famous ones frequently referenced by other papersFor this reason we conducted the experiments using theldquo09 01rdquo weight balance in order to obtain the best result

Tables 3 4 5 6 and 7 present the lists of top 10 papersfound to be the most similar to a target paper which is[33] using the three existing methods and the two proposedmethods respectively (Note the paper of [33] is concernedabout clustering in data mining) In the tables those paperswhich are not similar to [33] are italic In these papers theauthorsmainly deal with issues of outlier detection ormining

The Scientific World Journal 9

0

01

02

03

04

05

06

07

08

09

1

10 20 30 40 50(m)

(Acc

urac

y)

00 1001 0903 0705 05

07 0309 0110 00

Figure 6 Accuracy of the weightedSum method with varyingweights

frequent patterns indicating that the existing methods havemade a wrong conclusion for these papers In comparisonthe results by the weightedSum method (shown in Table 6)include only one wrong entry (of frequent pattern mining)in the top 10 list and the results by the undirected methoddo not include wrong papers (ie all related to clustering)These results imply that the undirected method performsbetter than the three existing methods and even than theweightedSum method We have repeated this experimentmany times with different target papers other than [33] andobtained results similar to Tables 3ndash7

Figure 7 compares the accuracy of similarities computedby the three existingmethods and the two proposedmethodsAs in Figure 6 the 119909-axis represents the number of ldquothe mostsimilarrdquo papers selected by each method and the 119910-axis theaccuracy The weightedSum method improved accuracy by9 on average and up to 13 compared with SimRank Alsothe undirectedmethod improved accuracy by 16 on averageand up to 20 compared with SimRank In conclusion thetwo proposed methods turn out to compute the similaritymore accurately than the existing methods Especially theundirected method performs the best

53 Domain ofWeb Documents Figure 8 shows the results ofthe accuracy on the similarities betweenWeb pages obtainedby the weightedSum method using various weight valuesbetween inlinks and outlinks The 119909-axis represents thenumber of ldquothe most similarrdquo Web pages selected by themethod and the 119910-axis the accuracy Annotations such asldquo00 10rdquo represent the weight balances between inlinks andoutlinks Again the accuracy of the weightedSum methodtends to become higher when the weight for inlinks getshigher as in the case of scientific papers This is because theanswer set contains many Web pages with high authority

0

01

02

03

04

05

06

07

08

09

1

10 20 30 40 50

SimRankrvs-SimRankP-Rank

WeightedSumUndirected

(m)

(Acc

urac

y)Figure 7 Accuracy of the similarity measures in scientific papersdata

0

01

02

03

04

05

10 20 30 40 50(m)

(Acc

urac

y)

00 1001 0903 0705 05

07 0309 0110 00

Figure 8 Accuracy of the weightedSum method with varying theweights

For this reason we conducted the experiments using theldquo09 01rdquo weight balance in order to obtain the best result

Figure 9 compares the accuracy of similarities computedby the three existingmethods and the two proposedmethodsAgain the 119909-axis represents the number of ldquothemost similarrdquodocuments selected by each method and the 119910-axis theaccuracy The weightedSum method improved accuracy by20 on average and up to 24 compared with SimRankAlso the undirected method improved accuracy by 34on average and up to 43 compared with SimRank In

10 The Scientific World Journal

Table 3 Top 10 papers similar to [33] using SimRank

First author Title Conferencejournal Year

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approach forVery Large Spatial Databases VLDB 1998

Ester Knowledge Discovery in Large Spatial Databases FocusingTechniques for Efficient Class Identification SSD 1995

Hinneburg An Efficient Approach to Clustering in Large MultimediaDatabases with Noise AAAI 1998

Ng Efficient and Effective Clustering Methods for Spatial DataMining VLDB 1994

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Wang STING A Statistical Information Grid Approach to SpatialData Mining VLDB 1997

L OrsquoCallaghan Streaming-Data Algorithms for High-Quality Clustering IEEE ICDE 2002

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996

Table 4 Top 10 papers similar to [33] using rvs-SimRank

First author Title Conferencejournal YearKnorr A Unified Notion of Outliers Properties and Computation ACM KDD 1997Guha CURE An Efficient Clustering Algorithm for Large Databases ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Guha ROCK A Robust Clustering Algorithm for CategoricalAttributes IEEE ICDE 1999

Sheikholeslami WaveCluster A multi-Resolution Clustering Approach forVery Large Spatial Databases VLDB 1998

Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996

Burdick MAFIA A Maximal Frequent Itemset Algorithmfor Transactional Databases IEEE TKDE 2005

Mannila Efficient Algorithms for Discovering Association Rules AAAI 1994

Kamber Metarule-Guided Mining of Multi-DimensionalAssociation Rules Using Data Cubes ACM KDD 1997

Table 5 Top 10 papers similar to [33] using P-Rank

First author Title Conferencejournal YearKnorr A Unified Notion of Outliers Properties and Computation ACM KDD 1997Guha CURE An Efficient Clustering Algorithm for Large Databases ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Guha ROCK A Robust Clustering Algorithm for Categorical Attributes IEEE ICDE 1999Mannila Efficient Algorithms for Discovering Association Rules AAAI 1994Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996Silberschatz What Makes Patterns Interesting in Knowledge Discovery Systems IEEE TKDE 1996Agrawal Mining Association Rules between Sets of Items in Large Databases ACM SIGMOD 1993

The Scientific World Journal 11

Table 6 Top 10 papers similar to [33] using weightedSum method

First author Title Conferencejournal Year

Ester Knowledge Discovery in Large Spatial DatabasesFocusing Techniques for Efficient Class Identification SSD 1995

Ng Efficient and Effective Clustering Methods for Spatial DataMining VLDB 1994

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Guha ROCK A Robust Clustering Algorithm for CategoricalAttributes IEEE ICDE 1999

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Wang STING A Statistical Information Grid Approach toSpatial Data Mining VLDB 1997

Hinneburg An Efficient Approach to Clusteringin Large Multimedia Databases with Noise AAAI 1998

Aggarwal Fast Algorithms for Projected Clustering ACM SIGMOD 1999

Agrawal Automatic Subspace Clustering of High Dimensional Datafor Data Mining Applications ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Table 7 Top 10 papers similar to [33] using undirected method

First author Title Conferencejournal Year

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Ng Efficient and Effective Clustering Methods for SpatialData Mining VLDB 1994

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Hinneburg An Efficient Approach to Clusteringin Large Multimedia Databases with Noise AAAI 1998

Wang STING A Statistical Information Grid Approach toSpatial Data Mining VLDB 1997

Agrawal Automatic Subspace Clustering of High Dimensional Datafor Data Mining Applications ACM SIGMOD 1998

Aggarwal Fast Algorithms for Projected Clustering ACM SIGMOD 1999

Ankerst OPTICS Ordering Points To Identify the ClusteringStructure ACM SIGMOD 1999

conclusion the two proposed methods turn out to computethe similarity more accurately than the existing methodsEspecially the undirected method outperforms by a largedegree the weightedSum and the three existing methods

6 Conclusions

This paper presented new link-based similarity methods thatcan compute more accurately the similarity between objectsby using the link information pertaining to the objectsNoticing that most existing link-based similarity methods

use the pair-wise and level-wise models we analyzed theproblems with these models and proposed a new approachthat does not suffer from these problems In our proposedapproach each object is represented by a vector all objectsin the given universe as the elements of the vector and aweight value to each object as the value of the correspondingelement in the vector As for this weight value we proposed toutilize the notion of reachability between objects computedusing the ldquoRandom Walk with Restartrdquo strategy Then wedefined the similarity between two objects as the cosinesimilarity between two vectors representing the two objectsThe proposed approach was then refined into two methods

12 The Scientific World Journal

0

01

02

03

04

05

10 20 30 40 50(m)

(Acc

urac

y)

SimRankrvs-SimRankP-Rank

WeightedSumUndirected

Figure 9 Accuracy of the similarity measures in Web documentsdata

the weightedSum and the undirected methods differentiatedby the strategy to handle the information on link directionsExamples showed that the two methods do not suffer fromthe problems of the pair-wise and level-wise models In ourexperimentation of the proposed methods with the scientificpapers and Web documents data sets the results indicatedthat both of the proposed methods generally outperform theexisting methods significantly

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by (1) Business for CooperativeRampD between Industry Academy and Research Institutefunded Korea Small and Medium Business Administrationin 2013 (Grants no C0006278) (2) the MSIP (Ministry ofScience ICT and Future Planning) Korea under the ITRC(Information Technology Research Center) Support Pro-gram (NIPA-2013-H0301-13-4009) supervised by the NIPA(National IT Industry Promotion Agency) (3) Ministry ofCulture Sports and Tourism (MCST) and from Korea Copy-right Commission in 2013 and (4) the National ResearchFoundation of Korea (NRF) Grant funded by the Koreagovernment (MEST) (no 2011-0029181)

References

[1] J Shen Y Zhu H Zhang C Chen R Sun and F Xu ldquoAcontent-based algorithm for blog rankingrdquo in Proceedings of theInternational Conference on Internet Computing in Science andEngineering (ICICSE rsquo08) pp 19ndash22 January 2008

[2] A Broder S Glassman M Manasse and G Zweig ldquoSyntacticclustering of the webrdquo in Proceedings of the 6th InternationalWorld Wide Web Conference (WWW rsquo97) pp 391ndash404 1997

[3] X Su and T Khoshgoftaar ldquoA survey of collaborative filteringtechniquesrdquoAdvances in Artificial Intelligence vol 2009 ArticleID 421425 19 pages 2009

[4] S-H Yoon S-W Kim and S Park ldquoA link-based similaritymeasure for scientific literaturerdquo in Proceedings of the 19thInternational World Wide Web Conference (WWW rsquo10) pp1213ndash1214 April 2010

[5] M R Hamedani S Lee and S Kim ldquoOn exploiting content andcitations together to compute similarity of scientific papersrdquo inProceedings of the ACM International Conference on Informationand Knowledge Management (CIKM rsquo13) San Francisco CalifUSA 2013

[6] P Zhao J Han andY Sun ldquoP-Rank a comprehensive structuralsimilarity measure over information networksrdquo in Proceedingsof the ACM 18th International Conference on Information andKnowledge Management (CIKM rsquo09) pp 553ndash562 November2009

[7] D-H Bae S-M Hwang S-W Kim and C Faloutsos ldquoCon-structing seminal paper genealogyrdquo in Proceedings of the 20thACM Conference on Information and Knowledge Management(CIKM rsquo11) pp 2101ndash2104 October 2011

[8] F Menczer ldquoCombining link and content analysis to estimatesemantic similarityrdquo in Proceedings of the 13th InternationalWorld Wide Web Conference Proceedings (WWW rsquo04) pp 452ndash453 May 2004

[9] R Baeza-Yates and B Ribeiro-Neto Modern InformationRetrieval Addison Wesley 1999

[10] S-H Yoon S-W Kim J-S Kim and W-S Hwang ldquoOncomputing text-based similarity in scientific literaturerdquo inProceedings of the 20th International Conference Companion onWorld Wide Web (WWW rsquo11) pp 169ndash170 April 2011

[11] G Bisson and F Hussain ldquo120594-Sim a new similarity measure forthe co-clustering taskrdquo in Proceedings of the 7th InternationalConference onMachine Learning and Applications (ICMLA rsquo08)pp 211ndash217 December 2008

[12] H Small ldquoCoCitation in the scientific literature a newmeasureof the relationship between two documentsrdquo Journal of theAmerican Society for Information Science vol 24 no 4 pp 265ndash269 1973

[13] M Kessler ldquoBibliographic coupling between scientific papersrdquoJournal of the American Documentation vol 14 no 1 pp 10ndash251963

[14] R Amsler ldquoApplication of citation-based automatic classifi-cationrdquo Technical Report The University of Texas at AustinLinguistics Research Center 1972

[15] G Jeh and JWidom ldquoSimRank ameasure of structural-contextsimilarityrdquo in Proceedings of the 8th ACMSIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDDrsquo02) pp 538ndash543 July 2002

[16] U Shardanband and P Maes ldquoSocial information filteringalgorithms for automating rdquoword of mouthrdquordquo in Proceedings ofthe SIGCHIConference onHumanFactors in Computing Systems(CHI rsquo95) pp 210ndash217 ACM Press Denver Colorado 1995

[17] I Antonellis H Garcia-Molina and C Chang ldquoSimrank++query rewriting through link analysis of the click graphrdquo inProceedings of the 34th International Conference on Vary LargeData Bases ( VLDB rsquo8) pp 408ndash421 Auckland New ZealandAugust 2008

The Scientific World Journal 13

[18] D Fogaras andB Racz ldquoScaling link-based similarity searchrdquo inProceedings of the 14th International Conference on World WideWeb (WWWrsquo05) pp 641ndash650 2005

[19] C Li J Han G He et al ldquoFast computation of SimRankfor static and dynamic information networksrdquo in Proceedingsof the 13th International Conference on Extending DatabaseTechnology Advances in Database Technology (EDBT rsquo10) pp465ndash476 March 2010

[20] P Li H Liu J Xu Yu J He and X Du ldquoFast single-pairSimRank computationrdquo in Proceedings of the SIAM Interna-tional Conference onDataMining (SDM rsquo10) pp 571ndash582 SIAMColumbus Ohio USA April 2010

[21] X Yin J Han and P S Yu ldquoLinkClus efficient clusteringvia heterogeneous semantic linksrdquo in Proceedings of the 32ndInternational Conference on Very Large Data Bases (VLDB rsquo06)pp 427ndash438 2006

[22] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo in Proceedings of the Vary Large Data Bases Endowmentvol 1 pp 422ndash433 2008

[23] Y Cai G Cong X Jia et al ldquoEfficient algorithm for computinglink-based similarity in real world networksrdquo in Proceedings ofthe 9th IEEE International Conference on Data Mining (ICDMrsquo09) pp 734ndash739 December 2009

[24] G He H Feng C Li and H Chen ldquoParallel SimRankcomputation on large graphs with iterative aggregationrdquo inProceedings of the 16th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD rsquo10) pp 543ndash552 July 2010

[25] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo The Intrsquol Journal on Very Large Data Bases vol 19 no 1pp 45ndash66 2010

[26] Z Lin M R Lyu and I King ldquoMatchSim a novel neighbor-based similaritymeasure withmaximumneighborhoodmatch-ingrdquo in Proceedings of the ACM 18th International Conference onInformation and Knowledge Management (CIKM rsquo09) pp 1613ndash1616 November 2009

[27] R R Larson ldquoBibliometrics of the world wide web anexploratory analysis of the intellectual structure of cyberspacerdquoProceedings of the Annual Meeting of the American Society forInformation Science vol 33 pp 71ndash78 1996

[28] J Pitkow and P Pirolli ldquoLife death and lawfulness on the elec-tronic frontierrdquo in Proceedings of the Conference on Human Fac-tors in Computing Systems (CHI rsquo97) pp 383ndash390 March 1997

[29] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann 2006

[30] J Sun H Qu D Chakrabarti and C Faloutsos ldquoNeighbor-hood formation and anomaly detection in bipartite graphsrdquo inProceedings of the 5th IEEE International Conference on DataMining (ICDM rsquo05) pp 418ndash425 November 2005

[31] H Tong C Faloutsos and J-Y Pan ldquoFast random walkwith restart and its applicationsrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 613ndash622 December 2006

[32] S Robinson ldquoToward an optimal algorithm for matrix multi-plicationrdquo News Journal of the Society for Industrial and AppliedMathematics vol 38 no 9 2005

[33] T Zhang R Ramakrishnam andM Livny ldquoBIRCH an efficientdata clustering method for very large databasesrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data (SIGMOD rsquo96) pp 103ndash114 1996

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 2: Research Article Link-Based Similarity Measures Using ...downloads.hindawi.com/journals/tswj/2014/741608.pdf · Research Article Link-Based Similarity Measures Using Reachability

2 The Scientific World Journal

Among the existing link-basedmeasures SimRank is wellknown and has driven a large number of subsequent studieswhich proposed variations of SimRank [6 17] or investigatedperformance speed-ups of it [18ndash25] The basic principle ofSimRank states that ldquotwo objects are similar if they are relatedto similar objectsrdquo SimRank computes the similarity betweentwo objects say 119909 and 119910 by recursively computing averageof all the similarities between every object pointing to 119909 andevery object pointing to 119910 [15] In other words SimRankcan be better explained in terms of what we call the ldquopair-wiserdquo and the ldquolevel-wiserdquo computationmodelsThe pair-wisemodel is to compute the similarity between 119909 and 119910 by aver-aging all the similarities computed between every neighborof 119909 and every neighbor of 119910 whereas the level-wise model isto compute the similarity by utilizing only those other objectslocated in the same distance from 119909 and 119910 level by level

We believe that any similarity measure such as SimRankwhich uses both the pair-wise and level-wise models cannotaccurately compute the similarity between two objects Forinstance the pair-wisemodel would still compute the similar-ities between each neighbor of 119909 and each neighbor of 119910 evenif the entire set of neighbors of 119909 and 119910 is exactly the sameConsequently the pair-wise model in general yields biasedresults in that the similarity between two objects havinga large number of links tends to become lower than thesimilarity between two objects having a small number of links[18 26] On the other hand the level-wisemodel focuses onlyon those neighbor objects linkeddirectly to the twoobjects ldquoatthe same levelrdquo that is in the graph only considering thoseobjects located at the same distance from the two objectsConsequently this model cannot consider all of the objectslinked directly or indirectly to the two target objects

This paper proposes a new link-based similarity measurethat does not suffer from the aforementioned problems withthe pair-wise and level-wise models In the proposed mea-sure we represent each object say 119909 as a vectorThe elementsof the vector (for 119909) correspond to all the objects (including119909 itself) in the given universe and the value of each elementdenotes theweight (with respect to119909) for the particular objectcorresponding to the element To obtain this weight value wepropose to utilize the probability of reaching from the object119909 to the particular object computed using the ldquoRandomWalkwith Restartrdquo strategyThen we define the similarity betweentwo objects as the cosine similarity [9] between the twovectors representing the two objectsThis approach resemblesthe text-based similarity measures using the cosine similarityfor computing similarity between documents where a doc-ument is represented by a vector each element of the vectorcorresponds to each word in the universe of all documentsand the value of each element denotes the frequency of thecorresponding word in the target document

Our approach does not suffer from the problem withthe pair-wise model because it computes the similaritybetween two vectors by multiplying only the values of thecorresponding elements in the two vectors not trying everypossible element pair between the two vectors Moreoverthe approach can also consider all the objects linked directlyor indirectly to the two target objects reflecting the degreeof ldquoclosenessrdquo between objects in the form of reachability

between objects Thus it does not suffer from the problemwith the level-wise model either In this paper we willdevelop two methods to implement our approach The firstmethod will generate the vectors representing objects usinginlinks and outlinks separately then merges two vectorsto compute the similarity The second method will notdistinguish inlinks from outlinks but convert them togetherinto undirected links to generate the vectorsThe effectivenessof both methods will be demonstrated by examples showingthat they do not suffer from problems of the pair-wise andlevel-wise models

The paper also evaluates the performance of the proposedmethods in comparison with existing link-based measuresqualitatively and quantitatively using the data sets of sci-entific papers and Web documents as two exemplary typesof data having link information Our experimental resultsindicate that the proposedmethods generally outperform theexisting measures for both types of data

The rest of this paper is organized as follows Section 2summarizes existing link-based similarity measures Section3 explains our research motivations and Section 4 describesthe proposed approach and methods in detail Section 5presents the experimental results to validate the performanceof the proposed methods Section 6 concludes the paper

2 Related Work

Existing link-based similarity measures include CocitationBibliographic coupling Amsler SimRank rvs-SimRankand P-Rank While Cocitation Bibliographic coupling andAmsler were originally devised to deal with scientific papers[6] they have also been applied to other types of data suchas Web documents which have link information [27 28] Onthe other hand SimRank rvs-SimRank and P-Rank wereoriginally proposed to deal with objects of any kind havinglink information [6 15]

Cocitation computes the similarity between two objectsbased on the number of objects which commonly point to thetwo As a result the similarity between two objects becomeshigher as the number of ldquocommonly pointingrdquo objects getslarger [12] This concept is described as follows where 119909 and119910 denote objects 119878(119909 119910) the similarity between 119909 and 119910and 119868(119909) and 119868(119910) the sets of objects pointing to 119909 and 119910respectively

119878 (119909 119910) =1003816100381610038161003816119868 (119909) cap 119868 (119910)

1003816100381610038161003816 (1)

Bibliographic coupling computes the similarity betweentwo objects based on the number of objects which arecommonly pointed by the two [13] This is described asfollows where 119874(119909) and 119874(119910) represent the sets of objectspointed by 119909 and 119910 respectively

119878 (119909 119910) =1003816100381610038161003816119874 (119909) cap 119874 (119910)

1003816100381610038161003816 (2)

The above two measures namely Cocitation and Biblio-graphic coupling are combined together by Amsler whichdefines the similarity between two objects as a weightedsum of the two similarities computed by Cocitation andBibliographic coupling as described in (3) where120582 represents

The Scientific World Journal 3

l

h i j

e

b c

f

a

d

g

k m

Figure 1 An example graph

the factor to balance the weights between the two similaritiesinvolved In general 120582 is set to 05 to assign equal weightsbetween the two [6 14]

119878 (119909 119910) = 120582 times (1003816100381610038161003816119868 (119909) cap 119868 (119910)

1003816100381610038161003816) + (1 minus 120582) times (1003816100381610038161003816119874 (119909) cap 119874 (119910)

1003816100381610038161003816)

(3)

The concepts of these three measures can be illustratedusing the graph in Figure 1 which represents objects asnodes and reference relationships among objects as linksThe similarity between objects 119892 and ℎ in the graph willbecome 2 when computed by Cocitation since there existtwo objects (ie 119889 and 119890) commonly pointing to 119892 and ℎ Onthe other hand the similarity will become 1 when computedby Bibliographic coupling since there exists only one object(ie 119896) commonly pointed by 119892 and ℎ Finally Amsler wouldcompute the similarity between 119892 and ℎ to be 15 assumingthe same weights (120582 = 05) among both measures of theCocitation and Bibliographic coupling (ie 15 = 05 times 2 +05 times 1 using (3))

Note that the similarities such computed by Cocita-tion Bibliographic coupling and Amsler in general tend tobecome larger as the number of links among the objectsbecomes larger This phenomenon can be normalized bydividing the similarities by the size of the union of the sets ofobjects commonly pointing to (or pointed by) the two objectsyielding the resulting similarity to be between 0 and 1 [29]

SimRank uses the concept that ldquotwo objects are similar ifthey are related to similar objectsrdquo In the graph of Figure 1for example the similarity between objects ℎ and 119894 wouldbe computed as zero that is interpreted as ldquonot similar atallrdquo using Cocitation since no object exists which commonlypoints to both ℎ and 119894 Nevertheless these two objectscould be seen as similar to some degree in that ℎ and 119894are separately pointed by two ldquosimilarrdquo objects 119890 and 119891respectively because 119890 and 119891 are commonly pointed by 119887SimRank exploits such a concept by recursively computingthe similarity between two objects say 119909 and 119910 as the averageof all the similarities between every object pointing to 119909and every object pointing to 119910 This concept of SimRank isdescribed by (4) where 119909 and 119910 denote objects 119878(119909 119910) thesimilarity between 119909 and 119910 and 119868(119909) and 119868(119910) the sets ofobjects pointing to 119909 and 119910 respectively 119868

119894(119909) is the 119894th object

in the list of objects pointing to 119909 and 119862 is the decay factorhaving the value between 0 and 1 The decay factor reducesthe weights of the computed similarity as the iterations getdeeper As shown by the following SimRank yields thesimilarity as a value between 0 and 1 by normalizing thesummation of the similarities of all pairs of objects in theCartesian product of the two sets 119868(119909) and 119868(119910) by itscardinality [15]

119878 (119909 119910) =119862

|119868 (119909)|1003816100381610038161003816119868 (119910)

1003816100381610038161003816

|119868(119909)|

sum

119894=1

|119868(119910)|

sum

119895=1

119878 (119868119894 (119909) 119868119895 (119910)) (4)

In a sense SimRank expands the Cocitation to a broaderscope of neighbor objects in the similarity definition so as tocount not just the adjacent objects directly linked to the twotarget objects (as inCocitation) but also to consider the effectsof all other objects indirectly linked (through the recursivecomputation) In a similarmanner [6] expands Bibliographiccoupling to yield rvs-SimRank and Amsler to yield P-RankThe rvs-SimRank is expressed by (5) which differs from(4) of SimRank only in that it uses outlinks instead ofinlinks The P-Rank is expressed by (6) which computes thesimilarity as a weighted sum of the two similarities obtainedby SimRank and rvs-SimRank respectively in each iterationstep Consider the following

119878 (119909 119910) =119862

|119874 (119909)|1003816100381610038161003816119874 (119910)

1003816100381610038161003816

|119874(119909)|

sum

119894=1

|119874(119910)|

sum

119895=1

119878 (119874119894 (119909) 119874119895 (119910)) (5)

119878 (119909 119910)

= 120582 times119862

|119868 (119909)|1003816100381610038161003816119868 (119910)

1003816100381610038161003816

|119868(119909)|

sum

119894=1

|119868(119910)|

sum

119895=1

119878 (119868119894 (119909) 119868119895 (119910))

+ (1 minus 120582)times119862

|119874 (119909)|1003816100381610038161003816119874 (119910)

1003816100381610038161003816

|119874(119909)|

sum

119894=1

|119874(119910)|

sum

119895=1

119878 (119874119894 (119909) 119874119895 (119910))

(6)

In another direction various approaches have soughtimproving the accuracy of existingmeasures [4 18 26] Refer-ence [18] proposes to apply Jaccard coefficient to the SimRankin order to remedy the phenomenon that the similarities tendto become lower among the objects having a larger numberof links Reference [26] proposes to improve the accuracyof SimRank by taking the average of the similarities onlybetween the maximally matching neighbor objects across thetwo groups associated with the two target objects in order toresolve the problem indicated in [18]

Many of these approaches have also been investigated toimprove the speed of existing measures [18 20 21] Reference[18] suggests to improve the performance of SimRank byproposing to construct first a fingerprint tree for each objectand then use such trees to approximate the similarity tobe obtained by SimRank Reference [21] proposes to reducethe time and space complexity of SimRank by utilizing atree structure called SimTree which allows storing directlythe similarities among similar objects but computing the

4 The Scientific World Journal

similarities among dissimilar objects using the path infor-mation of the tree Reference [20] aims to compute thesimilarity between two objects using SimRank in online real-time by suggesting to consider only those objects relateddirectly to the two target objects rather than computingsimilarities involved in all objects Reference [24] investigatesa method to run SimRank in parallel using GPGPU (general-purpose computation on graphics processors) and a methodto approximately compute the similarity in a dynamic graphusing uncoupling Markov chains

3 Motivation

In this section we discuss the problems with existing link-based similarity measures First we cast existing measuresinto a method combining two computationmodels which wewill call ldquopair-wiserdquo and ldquolevel-wiserdquo and explain the inherentdifficulties in thesemodelsWe then analyze and illustrate thelimitations of three representative existing methods namelyrvs-SimRank SimRank and P-Rank showing that each ofthesemethods actually combines the pair-wise and level-wisemodels

31 Pair-Wise and Level-Wise Computation Models In thegraph of Figure 1 again SimRank would compute the sim-ilarity between 119896 and 119897 by taking the average of the foursimilarities obtained from the four pairs (119892 ℎ) (119892 119894) (ℎ ℎ)and (ℎ 119894) namely the Cartesian product of the set 119892 ℎ theobjects directly ldquoin-linkedrdquo to 119896 with the set ℎ 119894 and theobjects directly ldquoin-linkedrdquo to 119897 We call such a way of pairingand averaging out the similarities for all such pairs the ldquopair-wiserdquo computational model In this example the similaritybetween 119896 and 119897 119878(119896 119897) becomes the average of 119878(119892 ℎ) 119878(119892 119894)119878(ℎ ℎ) and 119878(ℎ 119894) Here 119878(ℎ ℎ) = 1 by definition and 119878(119892 ℎ)119878(119892 119894) and 119878(ℎ 119894) should also be computed in turn by Sim-Rank in a recursive manner using the same pair-wise modelThat is 119904(119892 ℎ) is computed from the Cartesian product of119892rsquos neighbors 119889 119890 and ℎrsquos neighbors 119889 119890 119878(119892 119894) from 119892rsquosneighbors 119889 119890 and 119894rsquos neighbors 119891 and 119878(ℎ 119894) from ℎrsquosneighbors 119889 119890 and 119894rsquos neighbors 119891 using the pair-wisemodel In this way the recursive process continues and getsdeeper using all the objects linked directly or indirectly to thetwo target objects for which the similarity is being computed

On the other hand such a model can be seen as com-puting the similarity between two objects by utilizing onlythose other objects located in the same distance from thetwo targets level by level In Figure 1 for example the modelcomputes the similarity between 119896 and 119897 by utilizing theobjects with distance 1 (namely nodes 119892 ℎ and 119894) in the firstround then the objects with distance 2 (namely 119889 119890 and 119891)in the second round and then the objects with distance 3(namely 119886 119887 and 119888) in the third round We call such processthe ldquolevel-wiserdquo computation model

32 Problem of Pair-Wise Computation Model Figure 2 illus-trates the problem of the pair-wise model which induces thephenomenon that the similarities among objects havingmorelinks tend to become lower than those among objects having

Table 1 Similarities between 119886 and 119887 and between 1198861015840 and 1198871015840 in Figure2

rvs-SimRank SimRank P-Rank119904(119886 119887) 0396 0 0423119904(1198861015840 1198871015840) 0321 0 0408

a b

c d e

f g

(a)

a998400

b998400

c998400 d

998400e998400 f

998400g998400

h998400

i998400

j998400

(b)

Figure 2 Example graphs to illustrate the problem with the pair-wise model

less links In the graphs of Figure 2 again nodes representobjects and links represent the reference relationships amongobjects For example objects 119886 and 119887 in Figure 2(a) both referto the same three objects 119888 119889 and 119890 all of which in turn referto the same two objects 119891 and 119892 Similarly objects 1198861015840 and 1198871015840in Figure 2(b) both refer to the same objects all of which inturn refer to the same objects The only difference betweenthese two graphs is that the numbers of links from objects 1198861015840and 1198871015840 are larger (ie six links each) than those from objects 119886and 119887 (ie three links each) Intuitively the similarity between119886 and 119887 should be the same as that between 1198861015840 and 1198871015840 byobserving that 119886 and 119887 both refer to the same objects and also1198861015840 and 1198871015840 both refer to the same objects Existing methods

however produce different resultsTable 1 shows the resulting similarities between 119886 and 119887

and between 1198861015840 and 1198871015840 as to be computed by three existingmethods rvs-SimRank SimRank and P-Rank assuming 119862and 120582 to be 07 and 05 in formulas (4) (5) and (6) Twoentries of the table indicate that the computed similaritiesbetween 119886 and 119887 are higher than those between 1198861015840 and 1198871015840which is counterintuitive Considering that the differencein the numbers of links between the two graphs in thisexample is only as small as four one can imagine that thisphenomenon would become clearer as the numbers of linksget larger for example with such data set as scientific papersor Web documents In the papers data sets for example wellknown papers tend to have many reference links becausethey will in general get substantially larger numbers ofcitations than ordinary papers and similarly in the Webdocuments data set portal sites tend to have many linksbecause they are referenced more frequently than ordinary

The Scientific World Journal 5

ec

d

ba

Figure 3 An example graph to illustrate the problemwith the level-wise model

sites We envisage that for these domains those methodswhich use the pair-wise computation model cannot computethe similarity between objects accurately

33 Problem of Level-Wise Computation Model The level-wise computation model computes the similarity betweentwo objects by utilizing only those objects in the graph inthe same distance from the two The problem with thismodel can be illustrated using Figure 3 where again nodesrepresent objects and directed links represent the referencerelationships between objects Intuitively we can say thattwo objects 119886 and 119887 are similar to some degree as theyshare though indirectly an object 119888 in common All ofthe aforementioned three methods however compute thesimilarity between these two objects 119886 and 119887 to be 0 Forthis case one would expect to compute the similarity byconsidering all objects directly and indirectly linked to 119886(namely 119888) and all objects directly and indirectly linkedto 119887 (namely 119888 119889 and 119890) In fact however any methodswhich utilize the level-wise model will compute 119904(119886 119887) byonly using 119888 (as 119886rsquos neighbor) and 119889 (as 119887rsquos neighbor) Here119904(119888 119889) should also be computed by taking the average of thesimilarities for all possible pairs across 119888rsquos neighbor objectsand 119889rsquos neighbor objects In this case however 119888rsquos neighbordoes not exist yielding 119904(119888 119889) to be 0 even though therestill remain 119889rsquos neighbors 119888 and 119890 which should also becounted by some means After all 119904(119886 119887) will also become 0indicating that they are not similar at all From this examplewe conclude that the level-wise model cannot compare theobjects not located at the same distance and consequentlycannot compute the similarity properly

4 Proposed Methods

41 Main Concepts In this section we propose a novel link-based similarity measure Our approach differs from theexisting link-based methods in that we will not define themeasure by combining the pair-wise and level-wise modelsbut use the concept of the cosine similarity As expressedby (7) the cosine similarity is a measure for computing asimilarity between a pair of vectors Thus in computing a

similarity between objects using the cosine similarity thefeatures of an object are represented by the elements of avector and the weight to each feature is captured by thevalue of each element For example in case of computing asimilarity between documents using cosine similarity eachdocument is represented by a vector the words in the givenuniverse of documents are denoted by the elements of thevector and the frequency of each word in the document isindicated by the value of the corresponding element Thenthe similarity between two documents is computed by thesimilarity between the two corresponding vectors [9]

119904 (119909 119910) =119909 sdot 119910

11990910038171003817100381710038171199101003817100381710038171003817

=sum119899

119894=1(119909119894times 119910119894)

radicsum119899

119894=1(119909119894)2times radicsum

119899

119894=1(119910119894)2

(7)

In our approach each object is represented as a vectorall other objects in the given universe as the elements ofthe vector and some ldquoweight valuerdquo to each object as thevalue of the corresponding element in the vector As forthe cosine similarity used by text-based measures discussedabove where words of higher-frequency get larger weightvalues by being treated as better characterizing features fora document we need to define a measure to quantify thedegree with which to determine how well the target object ischaracterized by each object in the universe For thismeasurewe propose to use the degree of how close the two objects arelocated in the topological point of view

Sun et al [30] proposed a method for computing theprobability of reaching from an object to another as therelevance between two objects using the ldquoRandomWalkwithRestart (RWR)rdquo strategy We adopt this strategy and use thereachability to an object from the target object as the weightvalue for its corresponding element of the vector representingthe target object Reachability becomes higher as the distancebetween two objects becomes shorter and also when morepaths exist between the two Computing reachability usingRWR is expressed by the following where 119875

119860represents

an adjacency matrix column-normalizing the connectednessamong objects 119906

119886a vector having reachability to each node

starting from 119886 119902119886a restart vector having the value 1 only

for the starting node 119886 and 0 for the rest and 119888 the restartprobability

= (1 minus 119888) 119875119860 119906119886+ 119888 119902119886 (8)

Our approach does not suffer from the problem withthe pair-wise model because it computes the similaritybetween two vectors by multiplying only the values of thecorresponding elements in the two vectors not trying everypossible element pair between the two vectors Moreover theapproach also generates the vectors by considering all theobjects linked directly or indirectly to the two target objectsThismechanism is different from the pair-wise and level-wisemodels and does not suffer from the problems discussed inthe previous section

In this paper we develop two methods to implementour approach The two methods slightly differ only in theirways to compute the weights in the vector One methodcomputes the weights by first computing two values of

6 The Scientific World Journal

b c

a

d

a b c d

a 0 1 1 0b 0 0 0 1c 0 0 0 1d 0 0 0 0

a b c d

a 0 0 0 0b 1 0 0 0c 1 0 0 0d 0 1 1 0

a b c d

a 0 0 0 0b 0 0 0c 0 0 0d 0 1 1 0

05

05

a b c d

a 0 1 1 0b 0 0 0c 0 0 0d 0 0 0 0

05

05

1 0 0 0

1 006 006 0102

1 003 003 0051

Inlink(normalized)

(normalized)

Inlink

OutlinkOutlink

+

Figure 4 The procedure of generating a vector using the weightedSum method

reachability one using only ldquoinlinksrdquo and the other using onlyldquooutlinksrdquo separately and then combining them together as aweighted sum This method provides flexibility to a problemdomain by allowing different weights to reachability usinginlinks versus outlinks The method however cannot com-pute reachability properly for those objects located closelywith the target object because links in both directions arenot used together In Figure 1 for example nodes 119886 and119887 are located closely but the reachability would become 0when computed using only outlinks (or using only inlinks)from 119886 or 119887 Another problem would be the difficulty indetermining the appropriate weights for a vector generatedpurely using the inlinks or purely using the outlinksThuswedevelop the second method which computes reachability byignoring the directions of inlinks and outlinks and convertingthem to undirected links before computing reachability Thismethod is advantageous in that it can compute appropriatereachability for every object In the rest of this paper wewill call these two methods the ldquoweightedSumrdquo methodand the ldquoundirectedrdquo method respectively In the weighted-Sum method in particular the proper balance between theweights for inlinks and outlinks needs to be found domain bydomain through experimentation

42 Procedure of the Proposed Methods Basically both ofthe proposed methods (1) construct vectors by computingreachability from the target object to all other objects and(2) compute the similarity between vectors by using thecosine similarity The two methods differ only in the processof generating vectors as illustrated by Figure 4 (for theweightedSum method) and Figure 5 (for the undirectedmethod) The weightedSum method generates vectors in thefollowingmanner First two adjacencymatrices are built with

inlinks and outlinks and then column-normalized such thatthe sum of all values in a column becomes 1 Second a vectoris generated from the normalized matrix and the weightsare assigned to the elements in the vector by computing thereachability from the target object to every object using theRWR strategy Similarly the undirected method generates avector by constructing a normalized adjacencymatrix ignor-ing the link directions this time and assigning the reachabil-ity values to the elements of the vector in the same manner

43 Complexity Analysis The complexity of the proposedmethods for computing similarities for all pairs of givenobjects can be analyzed as follows First time complexity ofgeneration process of a vector for an object using the RWRstrategy is 119874(119896119890) [31] where 119890 represents the number of linksand 119896 the number of iterations for the matrix calculation toobtain converged values of reachability Thus time complex-ity of generation process for all objects 119899 is 119874(119896119899119890) where 119899represents the number of objects It is generally known thatthe constant number of such iterations would be sufficient toobtain converged values of reachability [6 15] Consequentlytime complexity is reduced to 119874(119899119890) On the other handtime complexity of the similarity calculation process is119874(1198993)Combining them together overall time complexity of bothmethods becomes 119874(1198993) + 119874(119899119890) = 119874(119899

3) In practice

the weightedSum method will require double time than theundirectedmethod because the former computes reachabilityseparately for each direction of the links

As for space complexity119874(119890) space is required for storingthe matrix to represent the relationships between objects119874(1198992) for storing vectors and119874(1198992) for storing the similarity

measures between objects Combining them together overallspace complexity becomes 119874(1198992) Again in practice the

The Scientific World Journal 7

b c

a

d

a b c d

a 0 1 1 0

b 0 0 0 1

c 0 0 0 1

d 0 0 0 0

a b c d

a 0 0 0 0

b 1 0 0 0

c 1 0 0 0

d 0 1 1 0

a b c d

a 0 1 1 0

b 1 0 0 1

c 1 0 0 1

d 0 1 1 0

a b c d

a 0 0

b 0 0

c 0 0

d 0 0

Undirected-link(normalized)

05 05

0505

0505

05 05

1 046204620441

Undirected-link

Inlink

Outlink

Figure 5 The procedure of generating a vector using the undirected method

weightedSum method will require double space than theother

In comparison time complexity of the three existingmethods namely rvs-SimRank SimRank and P-Rank isknown to be 119874(1198994) and space complexity 119874(1198992) [6] Webelieve the existing methods require more computation thanour proposed methods because they adopt the pair-wisecomputation model in principle

Moreover the existing methods cannot compute the sim-ilarity between two given objects independently because theyneed to know the similarities among all objects (whether con-nected directly or indirectly to the target object) in order tocompute the similarity between two target objects [20] How-ever our approach does not require the similarities amongall objects and hence can compute the similarity between anypair of objects independentlymaking it possible to parallelizethe algorithm Since our approach need not refer to the simi-larities amongother objects vectors for individual objects canbe generated independently and the similarity between anypair of objects can be computed separately that is in parallelTime complexity of this parallelized version would become119874(1198993119898) if119898 processors are utilized in parallelWhen we need to compute the similarity between only

a particular pair of objects on-line rather than computingthe similarities among all objects [20] our methods can bemade even more efficient by performing the computation ofreachability on-line through the inverse matrix as suggestedby (9) That is the vector for a given object can be generatedby multiplying 119902 to the inverse matrix of 119876 [30 31] Timecomplexity of this off-line computation of inverse matrix is119874(1198992376

) [32] and the complexity of the on-line computationis 119874(119899) Such an off-line approach tends to require relativelylonger time and more space to handle the inverse matrixReference [31] suggests an approximation approach using

low-rank approximation in order to keep balance betweenthe off-line and on-line computation This approach hasadvantages in both time and space complexity suggest-ing improvement for on-line execution of our proposedapproach In conclusion time complexity of our methods islower than that of existing methods and space complexity isequal to that of existing methods in any case

= (1 minus 119888) 119875119860 119906119886+ 119888 119902119886

= (119868 minus (1 minus 119888) 119875119860)minus1119902119886

= 119862119876minus1119902119886

(9)

44 Discussions The proposed methods do not suffer fromthe problems of the pair-wise and level-wisemodels discussedin Section 3 Let us compute the similarities between 119886 and119887 and 1198861015840 and 1198871015840 in Figure 2 using our approach and checkwhether the results appeal to our intuition We assume 120582 and119862 to be 05 and 07 respectively Table 2 shows the similarityresults computed by the two proposed methods togetherwith those obtained by the three existing methods (shownalready in Table 1) for comparison Both of our methods haveproduced coinciding results that the similarity between 119886and 119887 and 1198861015840 and 1198871015840 is 1 saying that the two objects areregarded as the same These results appeal to our intuitionIn comparison the results obtained by the existing methodsindicate that the two objects are not the same Since ourapproach does not use the pair-wise model but performthe computations among identical features the proposedmethods will produce results more appealing to our intuitionthan existing methods

The proposed methods do not suffer from the problemof the level-wise model either Let us compute the similaritybetween 119886 and 119887 in Figure 3 using our approach and check

8 The Scientific World Journal

Table 2 Similarities between 119886 and 119887 and between 1198861015840 and 1198871015840 using the proposed methods and existing methods in Figure 2

rvs-SimRank SimRank P-Rank WeightedSum Undirected119904(119886 119887) 0396 0 0423 1 1119904(1198861015840 1198871015840) 0321 0 0408 1 1

whether the results appeal to our intuition Assuming 120582and 119862 to be 05 and 07 respectively our approach hascomputed the similarity between 119886 and 119887 to be 028 using theweightedSummethod and 06 using the undirected methodFor comparison the existing methods would compute thesimilarity to be 0 as discussed in Section 3 We cannot tellif the results obtained by our approach are appropriate ornot because the similarity is by nature a subjective valueStill we can at least argue that our approach produces resultswhich are more appealing to our intuition than the existingmethods that is the similarity between 119886 and 119887must not be 0because they are related indirectly by some common objectsThis difference has come from our strategy of considering forall the objects indirectly connected to the target objects atonce rather than using the level-wise model

5 Experiments

In this section we verify the effectiveness of our approachthrough experimentation We will show and analyze quanti-tatively the experimental results obtained from applying thetwo proposed methods to practical application domains

51 Experimentation Setup We carried out experimentsto verify the performance of the proposed methods withrespect to scientific papers and Web documents the twotypes of exemplary data sets having link informationFor the experiments with the scientific papers we usedthe data set consisting of the papers downloaded fromDBLP (httpwwwinformatikuni-trierdesimleydb) andreference information among the papers crawled from Libra(httpacademicresearchmicrosoftcom) In total 44800papers and 126281 references were used For the experimentswith the Web documents we used the data set consisting of1227038Web pages with 11164829 hyperlinks in total takenfrom the TREC (httptrecnistgovdatat11webhtml) 2002data The experiments were carried out in a platform withQuad Core 267GHz CPU Windows 2008 Sever OS

In the experiments we aimed to evaluate the performanceof the two proposed methods (weightedSum undirected) incomparison with the three existing methods (rvs-SimRankSimRank and P-Rank) with the input values of 08 05 and015 for 119862 120582 and restart probability respectively as usedfrequently in the existing methods [6 15 30]

Themethod of experiments proceeded as follows For theweightedSum method we first assigned appropriate weightsto the vectors generated with inlinks and outlinks by varyingthe weights and finding the ones that achieve the highestaccuracy in the similarity through weightedSum method Inthe experiments we tried the weight values of 0 01 03 05

07 09 and 1 for inlinks (note the sum of in-link and out-link weights equals 1) Second we evaluated the accuracyof each method qualitatively by examining 10 objects thatthe method computes as the most similar to a target objectchosen arbitrarily Finally we measure the accuracy of eachmethod quantitatively by comparing the obtained results withthe true answers for each data set

The measurement of the accuracy proceeded as followsFirst we chose one object in turn as the target object from theanswer setThen we computed the recall [29] by extracting119898objects (where119898 can be 10 20 30 40 and 50)most similar tothe target object according to each method This process wasrepeated until every object in the set has been chosen as thetarget object The average of all recall values obtained as suchwill be taken as the final accuracy value

The answers of each data set were constructed in thefollowingmanner For the experiments with scientific paperswe selected five well known areas (ie clustering sequentialpattern mining spatial databases link mining and graphpatternmining) in a data mining text book [29] and obtainedpapers referenced in each section of these areasWe supposedthat the reference papers in the same section are similarto one another Thus those papers in a section formed ananswer set Each answer set had 3 to 14 papers and thetotal number of papers in all such answer sets was 106 Forthe experiments with Web documents we used TREC 2002TREC 2002 provides Web document sets related to specifickeywords We chose 9 Web document sets randomly fromTREC 2002 and used the sets as the answer sets

52 Domain of Scientific Papers Figure 6 depicts the resultsshowing the accuracy on the similarities among scientificpapers obtained by the weightedSum method using variousweight values for the vectors generated with inlinks andoutlinks The 119909-axis represents the number of ldquothe mostsimilarrdquo papers selected by the method and the 119910-axis theaccuracy Annotations such as ldquo00 10rdquo represent the weightbalances between inlinks and outlinks According to Figure 6the accuracy of the weightedSum method tends to becomehigher when the weight for inlinks gets higher than the out-link weight This is because the papers in the answer set arerelatively famous ones frequently referenced by other papersFor this reason we conducted the experiments using theldquo09 01rdquo weight balance in order to obtain the best result

Tables 3 4 5 6 and 7 present the lists of top 10 papersfound to be the most similar to a target paper which is[33] using the three existing methods and the two proposedmethods respectively (Note the paper of [33] is concernedabout clustering in data mining) In the tables those paperswhich are not similar to [33] are italic In these papers theauthorsmainly deal with issues of outlier detection ormining

The Scientific World Journal 9

0

01

02

03

04

05

06

07

08

09

1

10 20 30 40 50(m)

(Acc

urac

y)

00 1001 0903 0705 05

07 0309 0110 00

Figure 6 Accuracy of the weightedSum method with varyingweights

frequent patterns indicating that the existing methods havemade a wrong conclusion for these papers In comparisonthe results by the weightedSum method (shown in Table 6)include only one wrong entry (of frequent pattern mining)in the top 10 list and the results by the undirected methoddo not include wrong papers (ie all related to clustering)These results imply that the undirected method performsbetter than the three existing methods and even than theweightedSum method We have repeated this experimentmany times with different target papers other than [33] andobtained results similar to Tables 3ndash7

Figure 7 compares the accuracy of similarities computedby the three existingmethods and the two proposedmethodsAs in Figure 6 the 119909-axis represents the number of ldquothe mostsimilarrdquo papers selected by each method and the 119910-axis theaccuracy The weightedSum method improved accuracy by9 on average and up to 13 compared with SimRank Alsothe undirectedmethod improved accuracy by 16 on averageand up to 20 compared with SimRank In conclusion thetwo proposed methods turn out to compute the similaritymore accurately than the existing methods Especially theundirected method performs the best

53 Domain ofWeb Documents Figure 8 shows the results ofthe accuracy on the similarities betweenWeb pages obtainedby the weightedSum method using various weight valuesbetween inlinks and outlinks The 119909-axis represents thenumber of ldquothe most similarrdquo Web pages selected by themethod and the 119910-axis the accuracy Annotations such asldquo00 10rdquo represent the weight balances between inlinks andoutlinks Again the accuracy of the weightedSum methodtends to become higher when the weight for inlinks getshigher as in the case of scientific papers This is because theanswer set contains many Web pages with high authority

0

01

02

03

04

05

06

07

08

09

1

10 20 30 40 50

SimRankrvs-SimRankP-Rank

WeightedSumUndirected

(m)

(Acc

urac

y)Figure 7 Accuracy of the similarity measures in scientific papersdata

0

01

02

03

04

05

10 20 30 40 50(m)

(Acc

urac

y)

00 1001 0903 0705 05

07 0309 0110 00

Figure 8 Accuracy of the weightedSum method with varying theweights

For this reason we conducted the experiments using theldquo09 01rdquo weight balance in order to obtain the best result

Figure 9 compares the accuracy of similarities computedby the three existingmethods and the two proposedmethodsAgain the 119909-axis represents the number of ldquothemost similarrdquodocuments selected by each method and the 119910-axis theaccuracy The weightedSum method improved accuracy by20 on average and up to 24 compared with SimRankAlso the undirected method improved accuracy by 34on average and up to 43 compared with SimRank In

10 The Scientific World Journal

Table 3 Top 10 papers similar to [33] using SimRank

First author Title Conferencejournal Year

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approach forVery Large Spatial Databases VLDB 1998

Ester Knowledge Discovery in Large Spatial Databases FocusingTechniques for Efficient Class Identification SSD 1995

Hinneburg An Efficient Approach to Clustering in Large MultimediaDatabases with Noise AAAI 1998

Ng Efficient and Effective Clustering Methods for Spatial DataMining VLDB 1994

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Wang STING A Statistical Information Grid Approach to SpatialData Mining VLDB 1997

L OrsquoCallaghan Streaming-Data Algorithms for High-Quality Clustering IEEE ICDE 2002

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996

Table 4 Top 10 papers similar to [33] using rvs-SimRank

First author Title Conferencejournal YearKnorr A Unified Notion of Outliers Properties and Computation ACM KDD 1997Guha CURE An Efficient Clustering Algorithm for Large Databases ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Guha ROCK A Robust Clustering Algorithm for CategoricalAttributes IEEE ICDE 1999

Sheikholeslami WaveCluster A multi-Resolution Clustering Approach forVery Large Spatial Databases VLDB 1998

Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996

Burdick MAFIA A Maximal Frequent Itemset Algorithmfor Transactional Databases IEEE TKDE 2005

Mannila Efficient Algorithms for Discovering Association Rules AAAI 1994

Kamber Metarule-Guided Mining of Multi-DimensionalAssociation Rules Using Data Cubes ACM KDD 1997

Table 5 Top 10 papers similar to [33] using P-Rank

First author Title Conferencejournal YearKnorr A Unified Notion of Outliers Properties and Computation ACM KDD 1997Guha CURE An Efficient Clustering Algorithm for Large Databases ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Guha ROCK A Robust Clustering Algorithm for Categorical Attributes IEEE ICDE 1999Mannila Efficient Algorithms for Discovering Association Rules AAAI 1994Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996Silberschatz What Makes Patterns Interesting in Knowledge Discovery Systems IEEE TKDE 1996Agrawal Mining Association Rules between Sets of Items in Large Databases ACM SIGMOD 1993

The Scientific World Journal 11

Table 6 Top 10 papers similar to [33] using weightedSum method

First author Title Conferencejournal Year

Ester Knowledge Discovery in Large Spatial DatabasesFocusing Techniques for Efficient Class Identification SSD 1995

Ng Efficient and Effective Clustering Methods for Spatial DataMining VLDB 1994

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Guha ROCK A Robust Clustering Algorithm for CategoricalAttributes IEEE ICDE 1999

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Wang STING A Statistical Information Grid Approach toSpatial Data Mining VLDB 1997

Hinneburg An Efficient Approach to Clusteringin Large Multimedia Databases with Noise AAAI 1998

Aggarwal Fast Algorithms for Projected Clustering ACM SIGMOD 1999

Agrawal Automatic Subspace Clustering of High Dimensional Datafor Data Mining Applications ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Table 7 Top 10 papers similar to [33] using undirected method

First author Title Conferencejournal Year

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Ng Efficient and Effective Clustering Methods for SpatialData Mining VLDB 1994

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Hinneburg An Efficient Approach to Clusteringin Large Multimedia Databases with Noise AAAI 1998

Wang STING A Statistical Information Grid Approach toSpatial Data Mining VLDB 1997

Agrawal Automatic Subspace Clustering of High Dimensional Datafor Data Mining Applications ACM SIGMOD 1998

Aggarwal Fast Algorithms for Projected Clustering ACM SIGMOD 1999

Ankerst OPTICS Ordering Points To Identify the ClusteringStructure ACM SIGMOD 1999

conclusion the two proposed methods turn out to computethe similarity more accurately than the existing methodsEspecially the undirected method outperforms by a largedegree the weightedSum and the three existing methods

6 Conclusions

This paper presented new link-based similarity methods thatcan compute more accurately the similarity between objectsby using the link information pertaining to the objectsNoticing that most existing link-based similarity methods

use the pair-wise and level-wise models we analyzed theproblems with these models and proposed a new approachthat does not suffer from these problems In our proposedapproach each object is represented by a vector all objectsin the given universe as the elements of the vector and aweight value to each object as the value of the correspondingelement in the vector As for this weight value we proposed toutilize the notion of reachability between objects computedusing the ldquoRandom Walk with Restartrdquo strategy Then wedefined the similarity between two objects as the cosinesimilarity between two vectors representing the two objectsThe proposed approach was then refined into two methods

12 The Scientific World Journal

0

01

02

03

04

05

10 20 30 40 50(m)

(Acc

urac

y)

SimRankrvs-SimRankP-Rank

WeightedSumUndirected

Figure 9 Accuracy of the similarity measures in Web documentsdata

the weightedSum and the undirected methods differentiatedby the strategy to handle the information on link directionsExamples showed that the two methods do not suffer fromthe problems of the pair-wise and level-wise models In ourexperimentation of the proposed methods with the scientificpapers and Web documents data sets the results indicatedthat both of the proposed methods generally outperform theexisting methods significantly

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by (1) Business for CooperativeRampD between Industry Academy and Research Institutefunded Korea Small and Medium Business Administrationin 2013 (Grants no C0006278) (2) the MSIP (Ministry ofScience ICT and Future Planning) Korea under the ITRC(Information Technology Research Center) Support Pro-gram (NIPA-2013-H0301-13-4009) supervised by the NIPA(National IT Industry Promotion Agency) (3) Ministry ofCulture Sports and Tourism (MCST) and from Korea Copy-right Commission in 2013 and (4) the National ResearchFoundation of Korea (NRF) Grant funded by the Koreagovernment (MEST) (no 2011-0029181)

References

[1] J Shen Y Zhu H Zhang C Chen R Sun and F Xu ldquoAcontent-based algorithm for blog rankingrdquo in Proceedings of theInternational Conference on Internet Computing in Science andEngineering (ICICSE rsquo08) pp 19ndash22 January 2008

[2] A Broder S Glassman M Manasse and G Zweig ldquoSyntacticclustering of the webrdquo in Proceedings of the 6th InternationalWorld Wide Web Conference (WWW rsquo97) pp 391ndash404 1997

[3] X Su and T Khoshgoftaar ldquoA survey of collaborative filteringtechniquesrdquoAdvances in Artificial Intelligence vol 2009 ArticleID 421425 19 pages 2009

[4] S-H Yoon S-W Kim and S Park ldquoA link-based similaritymeasure for scientific literaturerdquo in Proceedings of the 19thInternational World Wide Web Conference (WWW rsquo10) pp1213ndash1214 April 2010

[5] M R Hamedani S Lee and S Kim ldquoOn exploiting content andcitations together to compute similarity of scientific papersrdquo inProceedings of the ACM International Conference on Informationand Knowledge Management (CIKM rsquo13) San Francisco CalifUSA 2013

[6] P Zhao J Han andY Sun ldquoP-Rank a comprehensive structuralsimilarity measure over information networksrdquo in Proceedingsof the ACM 18th International Conference on Information andKnowledge Management (CIKM rsquo09) pp 553ndash562 November2009

[7] D-H Bae S-M Hwang S-W Kim and C Faloutsos ldquoCon-structing seminal paper genealogyrdquo in Proceedings of the 20thACM Conference on Information and Knowledge Management(CIKM rsquo11) pp 2101ndash2104 October 2011

[8] F Menczer ldquoCombining link and content analysis to estimatesemantic similarityrdquo in Proceedings of the 13th InternationalWorld Wide Web Conference Proceedings (WWW rsquo04) pp 452ndash453 May 2004

[9] R Baeza-Yates and B Ribeiro-Neto Modern InformationRetrieval Addison Wesley 1999

[10] S-H Yoon S-W Kim J-S Kim and W-S Hwang ldquoOncomputing text-based similarity in scientific literaturerdquo inProceedings of the 20th International Conference Companion onWorld Wide Web (WWW rsquo11) pp 169ndash170 April 2011

[11] G Bisson and F Hussain ldquo120594-Sim a new similarity measure forthe co-clustering taskrdquo in Proceedings of the 7th InternationalConference onMachine Learning and Applications (ICMLA rsquo08)pp 211ndash217 December 2008

[12] H Small ldquoCoCitation in the scientific literature a newmeasureof the relationship between two documentsrdquo Journal of theAmerican Society for Information Science vol 24 no 4 pp 265ndash269 1973

[13] M Kessler ldquoBibliographic coupling between scientific papersrdquoJournal of the American Documentation vol 14 no 1 pp 10ndash251963

[14] R Amsler ldquoApplication of citation-based automatic classifi-cationrdquo Technical Report The University of Texas at AustinLinguistics Research Center 1972

[15] G Jeh and JWidom ldquoSimRank ameasure of structural-contextsimilarityrdquo in Proceedings of the 8th ACMSIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDDrsquo02) pp 538ndash543 July 2002

[16] U Shardanband and P Maes ldquoSocial information filteringalgorithms for automating rdquoword of mouthrdquordquo in Proceedings ofthe SIGCHIConference onHumanFactors in Computing Systems(CHI rsquo95) pp 210ndash217 ACM Press Denver Colorado 1995

[17] I Antonellis H Garcia-Molina and C Chang ldquoSimrank++query rewriting through link analysis of the click graphrdquo inProceedings of the 34th International Conference on Vary LargeData Bases ( VLDB rsquo8) pp 408ndash421 Auckland New ZealandAugust 2008

The Scientific World Journal 13

[18] D Fogaras andB Racz ldquoScaling link-based similarity searchrdquo inProceedings of the 14th International Conference on World WideWeb (WWWrsquo05) pp 641ndash650 2005

[19] C Li J Han G He et al ldquoFast computation of SimRankfor static and dynamic information networksrdquo in Proceedingsof the 13th International Conference on Extending DatabaseTechnology Advances in Database Technology (EDBT rsquo10) pp465ndash476 March 2010

[20] P Li H Liu J Xu Yu J He and X Du ldquoFast single-pairSimRank computationrdquo in Proceedings of the SIAM Interna-tional Conference onDataMining (SDM rsquo10) pp 571ndash582 SIAMColumbus Ohio USA April 2010

[21] X Yin J Han and P S Yu ldquoLinkClus efficient clusteringvia heterogeneous semantic linksrdquo in Proceedings of the 32ndInternational Conference on Very Large Data Bases (VLDB rsquo06)pp 427ndash438 2006

[22] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo in Proceedings of the Vary Large Data Bases Endowmentvol 1 pp 422ndash433 2008

[23] Y Cai G Cong X Jia et al ldquoEfficient algorithm for computinglink-based similarity in real world networksrdquo in Proceedings ofthe 9th IEEE International Conference on Data Mining (ICDMrsquo09) pp 734ndash739 December 2009

[24] G He H Feng C Li and H Chen ldquoParallel SimRankcomputation on large graphs with iterative aggregationrdquo inProceedings of the 16th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD rsquo10) pp 543ndash552 July 2010

[25] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo The Intrsquol Journal on Very Large Data Bases vol 19 no 1pp 45ndash66 2010

[26] Z Lin M R Lyu and I King ldquoMatchSim a novel neighbor-based similaritymeasure withmaximumneighborhoodmatch-ingrdquo in Proceedings of the ACM 18th International Conference onInformation and Knowledge Management (CIKM rsquo09) pp 1613ndash1616 November 2009

[27] R R Larson ldquoBibliometrics of the world wide web anexploratory analysis of the intellectual structure of cyberspacerdquoProceedings of the Annual Meeting of the American Society forInformation Science vol 33 pp 71ndash78 1996

[28] J Pitkow and P Pirolli ldquoLife death and lawfulness on the elec-tronic frontierrdquo in Proceedings of the Conference on Human Fac-tors in Computing Systems (CHI rsquo97) pp 383ndash390 March 1997

[29] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann 2006

[30] J Sun H Qu D Chakrabarti and C Faloutsos ldquoNeighbor-hood formation and anomaly detection in bipartite graphsrdquo inProceedings of the 5th IEEE International Conference on DataMining (ICDM rsquo05) pp 418ndash425 November 2005

[31] H Tong C Faloutsos and J-Y Pan ldquoFast random walkwith restart and its applicationsrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 613ndash622 December 2006

[32] S Robinson ldquoToward an optimal algorithm for matrix multi-plicationrdquo News Journal of the Society for Industrial and AppliedMathematics vol 38 no 9 2005

[33] T Zhang R Ramakrishnam andM Livny ldquoBIRCH an efficientdata clustering method for very large databasesrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data (SIGMOD rsquo96) pp 103ndash114 1996

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 3: Research Article Link-Based Similarity Measures Using ...downloads.hindawi.com/journals/tswj/2014/741608.pdf · Research Article Link-Based Similarity Measures Using Reachability

The Scientific World Journal 3

l

h i j

e

b c

f

a

d

g

k m

Figure 1 An example graph

the factor to balance the weights between the two similaritiesinvolved In general 120582 is set to 05 to assign equal weightsbetween the two [6 14]

119878 (119909 119910) = 120582 times (1003816100381610038161003816119868 (119909) cap 119868 (119910)

1003816100381610038161003816) + (1 minus 120582) times (1003816100381610038161003816119874 (119909) cap 119874 (119910)

1003816100381610038161003816)

(3)

The concepts of these three measures can be illustratedusing the graph in Figure 1 which represents objects asnodes and reference relationships among objects as linksThe similarity between objects 119892 and ℎ in the graph willbecome 2 when computed by Cocitation since there existtwo objects (ie 119889 and 119890) commonly pointing to 119892 and ℎ Onthe other hand the similarity will become 1 when computedby Bibliographic coupling since there exists only one object(ie 119896) commonly pointed by 119892 and ℎ Finally Amsler wouldcompute the similarity between 119892 and ℎ to be 15 assumingthe same weights (120582 = 05) among both measures of theCocitation and Bibliographic coupling (ie 15 = 05 times 2 +05 times 1 using (3))

Note that the similarities such computed by Cocita-tion Bibliographic coupling and Amsler in general tend tobecome larger as the number of links among the objectsbecomes larger This phenomenon can be normalized bydividing the similarities by the size of the union of the sets ofobjects commonly pointing to (or pointed by) the two objectsyielding the resulting similarity to be between 0 and 1 [29]

SimRank uses the concept that ldquotwo objects are similar ifthey are related to similar objectsrdquo In the graph of Figure 1for example the similarity between objects ℎ and 119894 wouldbe computed as zero that is interpreted as ldquonot similar atallrdquo using Cocitation since no object exists which commonlypoints to both ℎ and 119894 Nevertheless these two objectscould be seen as similar to some degree in that ℎ and 119894are separately pointed by two ldquosimilarrdquo objects 119890 and 119891respectively because 119890 and 119891 are commonly pointed by 119887SimRank exploits such a concept by recursively computingthe similarity between two objects say 119909 and 119910 as the averageof all the similarities between every object pointing to 119909and every object pointing to 119910 This concept of SimRank isdescribed by (4) where 119909 and 119910 denote objects 119878(119909 119910) thesimilarity between 119909 and 119910 and 119868(119909) and 119868(119910) the sets ofobjects pointing to 119909 and 119910 respectively 119868

119894(119909) is the 119894th object

in the list of objects pointing to 119909 and 119862 is the decay factorhaving the value between 0 and 1 The decay factor reducesthe weights of the computed similarity as the iterations getdeeper As shown by the following SimRank yields thesimilarity as a value between 0 and 1 by normalizing thesummation of the similarities of all pairs of objects in theCartesian product of the two sets 119868(119909) and 119868(119910) by itscardinality [15]

119878 (119909 119910) =119862

|119868 (119909)|1003816100381610038161003816119868 (119910)

1003816100381610038161003816

|119868(119909)|

sum

119894=1

|119868(119910)|

sum

119895=1

119878 (119868119894 (119909) 119868119895 (119910)) (4)

In a sense SimRank expands the Cocitation to a broaderscope of neighbor objects in the similarity definition so as tocount not just the adjacent objects directly linked to the twotarget objects (as inCocitation) but also to consider the effectsof all other objects indirectly linked (through the recursivecomputation) In a similarmanner [6] expands Bibliographiccoupling to yield rvs-SimRank and Amsler to yield P-RankThe rvs-SimRank is expressed by (5) which differs from(4) of SimRank only in that it uses outlinks instead ofinlinks The P-Rank is expressed by (6) which computes thesimilarity as a weighted sum of the two similarities obtainedby SimRank and rvs-SimRank respectively in each iterationstep Consider the following

119878 (119909 119910) =119862

|119874 (119909)|1003816100381610038161003816119874 (119910)

1003816100381610038161003816

|119874(119909)|

sum

119894=1

|119874(119910)|

sum

119895=1

119878 (119874119894 (119909) 119874119895 (119910)) (5)

119878 (119909 119910)

= 120582 times119862

|119868 (119909)|1003816100381610038161003816119868 (119910)

1003816100381610038161003816

|119868(119909)|

sum

119894=1

|119868(119910)|

sum

119895=1

119878 (119868119894 (119909) 119868119895 (119910))

+ (1 minus 120582)times119862

|119874 (119909)|1003816100381610038161003816119874 (119910)

1003816100381610038161003816

|119874(119909)|

sum

119894=1

|119874(119910)|

sum

119895=1

119878 (119874119894 (119909) 119874119895 (119910))

(6)

In another direction various approaches have soughtimproving the accuracy of existingmeasures [4 18 26] Refer-ence [18] proposes to apply Jaccard coefficient to the SimRankin order to remedy the phenomenon that the similarities tendto become lower among the objects having a larger numberof links Reference [26] proposes to improve the accuracyof SimRank by taking the average of the similarities onlybetween the maximally matching neighbor objects across thetwo groups associated with the two target objects in order toresolve the problem indicated in [18]

Many of these approaches have also been investigated toimprove the speed of existing measures [18 20 21] Reference[18] suggests to improve the performance of SimRank byproposing to construct first a fingerprint tree for each objectand then use such trees to approximate the similarity tobe obtained by SimRank Reference [21] proposes to reducethe time and space complexity of SimRank by utilizing atree structure called SimTree which allows storing directlythe similarities among similar objects but computing the

4 The Scientific World Journal

similarities among dissimilar objects using the path infor-mation of the tree Reference [20] aims to compute thesimilarity between two objects using SimRank in online real-time by suggesting to consider only those objects relateddirectly to the two target objects rather than computingsimilarities involved in all objects Reference [24] investigatesa method to run SimRank in parallel using GPGPU (general-purpose computation on graphics processors) and a methodto approximately compute the similarity in a dynamic graphusing uncoupling Markov chains

3 Motivation

In this section we discuss the problems with existing link-based similarity measures First we cast existing measuresinto a method combining two computationmodels which wewill call ldquopair-wiserdquo and ldquolevel-wiserdquo and explain the inherentdifficulties in thesemodelsWe then analyze and illustrate thelimitations of three representative existing methods namelyrvs-SimRank SimRank and P-Rank showing that each ofthesemethods actually combines the pair-wise and level-wisemodels

31 Pair-Wise and Level-Wise Computation Models In thegraph of Figure 1 again SimRank would compute the sim-ilarity between 119896 and 119897 by taking the average of the foursimilarities obtained from the four pairs (119892 ℎ) (119892 119894) (ℎ ℎ)and (ℎ 119894) namely the Cartesian product of the set 119892 ℎ theobjects directly ldquoin-linkedrdquo to 119896 with the set ℎ 119894 and theobjects directly ldquoin-linkedrdquo to 119897 We call such a way of pairingand averaging out the similarities for all such pairs the ldquopair-wiserdquo computational model In this example the similaritybetween 119896 and 119897 119878(119896 119897) becomes the average of 119878(119892 ℎ) 119878(119892 119894)119878(ℎ ℎ) and 119878(ℎ 119894) Here 119878(ℎ ℎ) = 1 by definition and 119878(119892 ℎ)119878(119892 119894) and 119878(ℎ 119894) should also be computed in turn by Sim-Rank in a recursive manner using the same pair-wise modelThat is 119904(119892 ℎ) is computed from the Cartesian product of119892rsquos neighbors 119889 119890 and ℎrsquos neighbors 119889 119890 119878(119892 119894) from 119892rsquosneighbors 119889 119890 and 119894rsquos neighbors 119891 and 119878(ℎ 119894) from ℎrsquosneighbors 119889 119890 and 119894rsquos neighbors 119891 using the pair-wisemodel In this way the recursive process continues and getsdeeper using all the objects linked directly or indirectly to thetwo target objects for which the similarity is being computed

On the other hand such a model can be seen as com-puting the similarity between two objects by utilizing onlythose other objects located in the same distance from thetwo targets level by level In Figure 1 for example the modelcomputes the similarity between 119896 and 119897 by utilizing theobjects with distance 1 (namely nodes 119892 ℎ and 119894) in the firstround then the objects with distance 2 (namely 119889 119890 and 119891)in the second round and then the objects with distance 3(namely 119886 119887 and 119888) in the third round We call such processthe ldquolevel-wiserdquo computation model

32 Problem of Pair-Wise Computation Model Figure 2 illus-trates the problem of the pair-wise model which induces thephenomenon that the similarities among objects havingmorelinks tend to become lower than those among objects having

Table 1 Similarities between 119886 and 119887 and between 1198861015840 and 1198871015840 in Figure2

rvs-SimRank SimRank P-Rank119904(119886 119887) 0396 0 0423119904(1198861015840 1198871015840) 0321 0 0408

a b

c d e

f g

(a)

a998400

b998400

c998400 d

998400e998400 f

998400g998400

h998400

i998400

j998400

(b)

Figure 2 Example graphs to illustrate the problem with the pair-wise model

less links In the graphs of Figure 2 again nodes representobjects and links represent the reference relationships amongobjects For example objects 119886 and 119887 in Figure 2(a) both referto the same three objects 119888 119889 and 119890 all of which in turn referto the same two objects 119891 and 119892 Similarly objects 1198861015840 and 1198871015840in Figure 2(b) both refer to the same objects all of which inturn refer to the same objects The only difference betweenthese two graphs is that the numbers of links from objects 1198861015840and 1198871015840 are larger (ie six links each) than those from objects 119886and 119887 (ie three links each) Intuitively the similarity between119886 and 119887 should be the same as that between 1198861015840 and 1198871015840 byobserving that 119886 and 119887 both refer to the same objects and also1198861015840 and 1198871015840 both refer to the same objects Existing methods

however produce different resultsTable 1 shows the resulting similarities between 119886 and 119887

and between 1198861015840 and 1198871015840 as to be computed by three existingmethods rvs-SimRank SimRank and P-Rank assuming 119862and 120582 to be 07 and 05 in formulas (4) (5) and (6) Twoentries of the table indicate that the computed similaritiesbetween 119886 and 119887 are higher than those between 1198861015840 and 1198871015840which is counterintuitive Considering that the differencein the numbers of links between the two graphs in thisexample is only as small as four one can imagine that thisphenomenon would become clearer as the numbers of linksget larger for example with such data set as scientific papersor Web documents In the papers data sets for example wellknown papers tend to have many reference links becausethey will in general get substantially larger numbers ofcitations than ordinary papers and similarly in the Webdocuments data set portal sites tend to have many linksbecause they are referenced more frequently than ordinary

The Scientific World Journal 5

ec

d

ba

Figure 3 An example graph to illustrate the problemwith the level-wise model

sites We envisage that for these domains those methodswhich use the pair-wise computation model cannot computethe similarity between objects accurately

33 Problem of Level-Wise Computation Model The level-wise computation model computes the similarity betweentwo objects by utilizing only those objects in the graph inthe same distance from the two The problem with thismodel can be illustrated using Figure 3 where again nodesrepresent objects and directed links represent the referencerelationships between objects Intuitively we can say thattwo objects 119886 and 119887 are similar to some degree as theyshare though indirectly an object 119888 in common All ofthe aforementioned three methods however compute thesimilarity between these two objects 119886 and 119887 to be 0 Forthis case one would expect to compute the similarity byconsidering all objects directly and indirectly linked to 119886(namely 119888) and all objects directly and indirectly linkedto 119887 (namely 119888 119889 and 119890) In fact however any methodswhich utilize the level-wise model will compute 119904(119886 119887) byonly using 119888 (as 119886rsquos neighbor) and 119889 (as 119887rsquos neighbor) Here119904(119888 119889) should also be computed by taking the average of thesimilarities for all possible pairs across 119888rsquos neighbor objectsand 119889rsquos neighbor objects In this case however 119888rsquos neighbordoes not exist yielding 119904(119888 119889) to be 0 even though therestill remain 119889rsquos neighbors 119888 and 119890 which should also becounted by some means After all 119904(119886 119887) will also become 0indicating that they are not similar at all From this examplewe conclude that the level-wise model cannot compare theobjects not located at the same distance and consequentlycannot compute the similarity properly

4 Proposed Methods

41 Main Concepts In this section we propose a novel link-based similarity measure Our approach differs from theexisting link-based methods in that we will not define themeasure by combining the pair-wise and level-wise modelsbut use the concept of the cosine similarity As expressedby (7) the cosine similarity is a measure for computing asimilarity between a pair of vectors Thus in computing a

similarity between objects using the cosine similarity thefeatures of an object are represented by the elements of avector and the weight to each feature is captured by thevalue of each element For example in case of computing asimilarity between documents using cosine similarity eachdocument is represented by a vector the words in the givenuniverse of documents are denoted by the elements of thevector and the frequency of each word in the document isindicated by the value of the corresponding element Thenthe similarity between two documents is computed by thesimilarity between the two corresponding vectors [9]

119904 (119909 119910) =119909 sdot 119910

11990910038171003817100381710038171199101003817100381710038171003817

=sum119899

119894=1(119909119894times 119910119894)

radicsum119899

119894=1(119909119894)2times radicsum

119899

119894=1(119910119894)2

(7)

In our approach each object is represented as a vectorall other objects in the given universe as the elements ofthe vector and some ldquoweight valuerdquo to each object as thevalue of the corresponding element in the vector As forthe cosine similarity used by text-based measures discussedabove where words of higher-frequency get larger weightvalues by being treated as better characterizing features fora document we need to define a measure to quantify thedegree with which to determine how well the target object ischaracterized by each object in the universe For thismeasurewe propose to use the degree of how close the two objects arelocated in the topological point of view

Sun et al [30] proposed a method for computing theprobability of reaching from an object to another as therelevance between two objects using the ldquoRandomWalkwithRestart (RWR)rdquo strategy We adopt this strategy and use thereachability to an object from the target object as the weightvalue for its corresponding element of the vector representingthe target object Reachability becomes higher as the distancebetween two objects becomes shorter and also when morepaths exist between the two Computing reachability usingRWR is expressed by the following where 119875

119860represents

an adjacency matrix column-normalizing the connectednessamong objects 119906

119886a vector having reachability to each node

starting from 119886 119902119886a restart vector having the value 1 only

for the starting node 119886 and 0 for the rest and 119888 the restartprobability

= (1 minus 119888) 119875119860 119906119886+ 119888 119902119886 (8)

Our approach does not suffer from the problem withthe pair-wise model because it computes the similaritybetween two vectors by multiplying only the values of thecorresponding elements in the two vectors not trying everypossible element pair between the two vectors Moreover theapproach also generates the vectors by considering all theobjects linked directly or indirectly to the two target objectsThismechanism is different from the pair-wise and level-wisemodels and does not suffer from the problems discussed inthe previous section

In this paper we develop two methods to implementour approach The two methods slightly differ only in theirways to compute the weights in the vector One methodcomputes the weights by first computing two values of

6 The Scientific World Journal

b c

a

d

a b c d

a 0 1 1 0b 0 0 0 1c 0 0 0 1d 0 0 0 0

a b c d

a 0 0 0 0b 1 0 0 0c 1 0 0 0d 0 1 1 0

a b c d

a 0 0 0 0b 0 0 0c 0 0 0d 0 1 1 0

05

05

a b c d

a 0 1 1 0b 0 0 0c 0 0 0d 0 0 0 0

05

05

1 0 0 0

1 006 006 0102

1 003 003 0051

Inlink(normalized)

(normalized)

Inlink

OutlinkOutlink

+

Figure 4 The procedure of generating a vector using the weightedSum method

reachability one using only ldquoinlinksrdquo and the other using onlyldquooutlinksrdquo separately and then combining them together as aweighted sum This method provides flexibility to a problemdomain by allowing different weights to reachability usinginlinks versus outlinks The method however cannot com-pute reachability properly for those objects located closelywith the target object because links in both directions arenot used together In Figure 1 for example nodes 119886 and119887 are located closely but the reachability would become 0when computed using only outlinks (or using only inlinks)from 119886 or 119887 Another problem would be the difficulty indetermining the appropriate weights for a vector generatedpurely using the inlinks or purely using the outlinksThuswedevelop the second method which computes reachability byignoring the directions of inlinks and outlinks and convertingthem to undirected links before computing reachability Thismethod is advantageous in that it can compute appropriatereachability for every object In the rest of this paper wewill call these two methods the ldquoweightedSumrdquo methodand the ldquoundirectedrdquo method respectively In the weighted-Sum method in particular the proper balance between theweights for inlinks and outlinks needs to be found domain bydomain through experimentation

42 Procedure of the Proposed Methods Basically both ofthe proposed methods (1) construct vectors by computingreachability from the target object to all other objects and(2) compute the similarity between vectors by using thecosine similarity The two methods differ only in the processof generating vectors as illustrated by Figure 4 (for theweightedSum method) and Figure 5 (for the undirectedmethod) The weightedSum method generates vectors in thefollowingmanner First two adjacencymatrices are built with

inlinks and outlinks and then column-normalized such thatthe sum of all values in a column becomes 1 Second a vectoris generated from the normalized matrix and the weightsare assigned to the elements in the vector by computing thereachability from the target object to every object using theRWR strategy Similarly the undirected method generates avector by constructing a normalized adjacencymatrix ignor-ing the link directions this time and assigning the reachabil-ity values to the elements of the vector in the same manner

43 Complexity Analysis The complexity of the proposedmethods for computing similarities for all pairs of givenobjects can be analyzed as follows First time complexity ofgeneration process of a vector for an object using the RWRstrategy is 119874(119896119890) [31] where 119890 represents the number of linksand 119896 the number of iterations for the matrix calculation toobtain converged values of reachability Thus time complex-ity of generation process for all objects 119899 is 119874(119896119899119890) where 119899represents the number of objects It is generally known thatthe constant number of such iterations would be sufficient toobtain converged values of reachability [6 15] Consequentlytime complexity is reduced to 119874(119899119890) On the other handtime complexity of the similarity calculation process is119874(1198993)Combining them together overall time complexity of bothmethods becomes 119874(1198993) + 119874(119899119890) = 119874(119899

3) In practice

the weightedSum method will require double time than theundirectedmethod because the former computes reachabilityseparately for each direction of the links

As for space complexity119874(119890) space is required for storingthe matrix to represent the relationships between objects119874(1198992) for storing vectors and119874(1198992) for storing the similarity

measures between objects Combining them together overallspace complexity becomes 119874(1198992) Again in practice the

The Scientific World Journal 7

b c

a

d

a b c d

a 0 1 1 0

b 0 0 0 1

c 0 0 0 1

d 0 0 0 0

a b c d

a 0 0 0 0

b 1 0 0 0

c 1 0 0 0

d 0 1 1 0

a b c d

a 0 1 1 0

b 1 0 0 1

c 1 0 0 1

d 0 1 1 0

a b c d

a 0 0

b 0 0

c 0 0

d 0 0

Undirected-link(normalized)

05 05

0505

0505

05 05

1 046204620441

Undirected-link

Inlink

Outlink

Figure 5 The procedure of generating a vector using the undirected method

weightedSum method will require double space than theother

In comparison time complexity of the three existingmethods namely rvs-SimRank SimRank and P-Rank isknown to be 119874(1198994) and space complexity 119874(1198992) [6] Webelieve the existing methods require more computation thanour proposed methods because they adopt the pair-wisecomputation model in principle

Moreover the existing methods cannot compute the sim-ilarity between two given objects independently because theyneed to know the similarities among all objects (whether con-nected directly or indirectly to the target object) in order tocompute the similarity between two target objects [20] How-ever our approach does not require the similarities amongall objects and hence can compute the similarity between anypair of objects independentlymaking it possible to parallelizethe algorithm Since our approach need not refer to the simi-larities amongother objects vectors for individual objects canbe generated independently and the similarity between anypair of objects can be computed separately that is in parallelTime complexity of this parallelized version would become119874(1198993119898) if119898 processors are utilized in parallelWhen we need to compute the similarity between only

a particular pair of objects on-line rather than computingthe similarities among all objects [20] our methods can bemade even more efficient by performing the computation ofreachability on-line through the inverse matrix as suggestedby (9) That is the vector for a given object can be generatedby multiplying 119902 to the inverse matrix of 119876 [30 31] Timecomplexity of this off-line computation of inverse matrix is119874(1198992376

) [32] and the complexity of the on-line computationis 119874(119899) Such an off-line approach tends to require relativelylonger time and more space to handle the inverse matrixReference [31] suggests an approximation approach using

low-rank approximation in order to keep balance betweenthe off-line and on-line computation This approach hasadvantages in both time and space complexity suggest-ing improvement for on-line execution of our proposedapproach In conclusion time complexity of our methods islower than that of existing methods and space complexity isequal to that of existing methods in any case

= (1 minus 119888) 119875119860 119906119886+ 119888 119902119886

= (119868 minus (1 minus 119888) 119875119860)minus1119902119886

= 119862119876minus1119902119886

(9)

44 Discussions The proposed methods do not suffer fromthe problems of the pair-wise and level-wisemodels discussedin Section 3 Let us compute the similarities between 119886 and119887 and 1198861015840 and 1198871015840 in Figure 2 using our approach and checkwhether the results appeal to our intuition We assume 120582 and119862 to be 05 and 07 respectively Table 2 shows the similarityresults computed by the two proposed methods togetherwith those obtained by the three existing methods (shownalready in Table 1) for comparison Both of our methods haveproduced coinciding results that the similarity between 119886and 119887 and 1198861015840 and 1198871015840 is 1 saying that the two objects areregarded as the same These results appeal to our intuitionIn comparison the results obtained by the existing methodsindicate that the two objects are not the same Since ourapproach does not use the pair-wise model but performthe computations among identical features the proposedmethods will produce results more appealing to our intuitionthan existing methods

The proposed methods do not suffer from the problemof the level-wise model either Let us compute the similaritybetween 119886 and 119887 in Figure 3 using our approach and check

8 The Scientific World Journal

Table 2 Similarities between 119886 and 119887 and between 1198861015840 and 1198871015840 using the proposed methods and existing methods in Figure 2

rvs-SimRank SimRank P-Rank WeightedSum Undirected119904(119886 119887) 0396 0 0423 1 1119904(1198861015840 1198871015840) 0321 0 0408 1 1

whether the results appeal to our intuition Assuming 120582and 119862 to be 05 and 07 respectively our approach hascomputed the similarity between 119886 and 119887 to be 028 using theweightedSummethod and 06 using the undirected methodFor comparison the existing methods would compute thesimilarity to be 0 as discussed in Section 3 We cannot tellif the results obtained by our approach are appropriate ornot because the similarity is by nature a subjective valueStill we can at least argue that our approach produces resultswhich are more appealing to our intuition than the existingmethods that is the similarity between 119886 and 119887must not be 0because they are related indirectly by some common objectsThis difference has come from our strategy of considering forall the objects indirectly connected to the target objects atonce rather than using the level-wise model

5 Experiments

In this section we verify the effectiveness of our approachthrough experimentation We will show and analyze quanti-tatively the experimental results obtained from applying thetwo proposed methods to practical application domains

51 Experimentation Setup We carried out experimentsto verify the performance of the proposed methods withrespect to scientific papers and Web documents the twotypes of exemplary data sets having link informationFor the experiments with the scientific papers we usedthe data set consisting of the papers downloaded fromDBLP (httpwwwinformatikuni-trierdesimleydb) andreference information among the papers crawled from Libra(httpacademicresearchmicrosoftcom) In total 44800papers and 126281 references were used For the experimentswith the Web documents we used the data set consisting of1227038Web pages with 11164829 hyperlinks in total takenfrom the TREC (httptrecnistgovdatat11webhtml) 2002data The experiments were carried out in a platform withQuad Core 267GHz CPU Windows 2008 Sever OS

In the experiments we aimed to evaluate the performanceof the two proposed methods (weightedSum undirected) incomparison with the three existing methods (rvs-SimRankSimRank and P-Rank) with the input values of 08 05 and015 for 119862 120582 and restart probability respectively as usedfrequently in the existing methods [6 15 30]

Themethod of experiments proceeded as follows For theweightedSum method we first assigned appropriate weightsto the vectors generated with inlinks and outlinks by varyingthe weights and finding the ones that achieve the highestaccuracy in the similarity through weightedSum method Inthe experiments we tried the weight values of 0 01 03 05

07 09 and 1 for inlinks (note the sum of in-link and out-link weights equals 1) Second we evaluated the accuracyof each method qualitatively by examining 10 objects thatthe method computes as the most similar to a target objectchosen arbitrarily Finally we measure the accuracy of eachmethod quantitatively by comparing the obtained results withthe true answers for each data set

The measurement of the accuracy proceeded as followsFirst we chose one object in turn as the target object from theanswer setThen we computed the recall [29] by extracting119898objects (where119898 can be 10 20 30 40 and 50)most similar tothe target object according to each method This process wasrepeated until every object in the set has been chosen as thetarget object The average of all recall values obtained as suchwill be taken as the final accuracy value

The answers of each data set were constructed in thefollowingmanner For the experiments with scientific paperswe selected five well known areas (ie clustering sequentialpattern mining spatial databases link mining and graphpatternmining) in a data mining text book [29] and obtainedpapers referenced in each section of these areasWe supposedthat the reference papers in the same section are similarto one another Thus those papers in a section formed ananswer set Each answer set had 3 to 14 papers and thetotal number of papers in all such answer sets was 106 Forthe experiments with Web documents we used TREC 2002TREC 2002 provides Web document sets related to specifickeywords We chose 9 Web document sets randomly fromTREC 2002 and used the sets as the answer sets

52 Domain of Scientific Papers Figure 6 depicts the resultsshowing the accuracy on the similarities among scientificpapers obtained by the weightedSum method using variousweight values for the vectors generated with inlinks andoutlinks The 119909-axis represents the number of ldquothe mostsimilarrdquo papers selected by the method and the 119910-axis theaccuracy Annotations such as ldquo00 10rdquo represent the weightbalances between inlinks and outlinks According to Figure 6the accuracy of the weightedSum method tends to becomehigher when the weight for inlinks gets higher than the out-link weight This is because the papers in the answer set arerelatively famous ones frequently referenced by other papersFor this reason we conducted the experiments using theldquo09 01rdquo weight balance in order to obtain the best result

Tables 3 4 5 6 and 7 present the lists of top 10 papersfound to be the most similar to a target paper which is[33] using the three existing methods and the two proposedmethods respectively (Note the paper of [33] is concernedabout clustering in data mining) In the tables those paperswhich are not similar to [33] are italic In these papers theauthorsmainly deal with issues of outlier detection ormining

The Scientific World Journal 9

0

01

02

03

04

05

06

07

08

09

1

10 20 30 40 50(m)

(Acc

urac

y)

00 1001 0903 0705 05

07 0309 0110 00

Figure 6 Accuracy of the weightedSum method with varyingweights

frequent patterns indicating that the existing methods havemade a wrong conclusion for these papers In comparisonthe results by the weightedSum method (shown in Table 6)include only one wrong entry (of frequent pattern mining)in the top 10 list and the results by the undirected methoddo not include wrong papers (ie all related to clustering)These results imply that the undirected method performsbetter than the three existing methods and even than theweightedSum method We have repeated this experimentmany times with different target papers other than [33] andobtained results similar to Tables 3ndash7

Figure 7 compares the accuracy of similarities computedby the three existingmethods and the two proposedmethodsAs in Figure 6 the 119909-axis represents the number of ldquothe mostsimilarrdquo papers selected by each method and the 119910-axis theaccuracy The weightedSum method improved accuracy by9 on average and up to 13 compared with SimRank Alsothe undirectedmethod improved accuracy by 16 on averageand up to 20 compared with SimRank In conclusion thetwo proposed methods turn out to compute the similaritymore accurately than the existing methods Especially theundirected method performs the best

53 Domain ofWeb Documents Figure 8 shows the results ofthe accuracy on the similarities betweenWeb pages obtainedby the weightedSum method using various weight valuesbetween inlinks and outlinks The 119909-axis represents thenumber of ldquothe most similarrdquo Web pages selected by themethod and the 119910-axis the accuracy Annotations such asldquo00 10rdquo represent the weight balances between inlinks andoutlinks Again the accuracy of the weightedSum methodtends to become higher when the weight for inlinks getshigher as in the case of scientific papers This is because theanswer set contains many Web pages with high authority

0

01

02

03

04

05

06

07

08

09

1

10 20 30 40 50

SimRankrvs-SimRankP-Rank

WeightedSumUndirected

(m)

(Acc

urac

y)Figure 7 Accuracy of the similarity measures in scientific papersdata

0

01

02

03

04

05

10 20 30 40 50(m)

(Acc

urac

y)

00 1001 0903 0705 05

07 0309 0110 00

Figure 8 Accuracy of the weightedSum method with varying theweights

For this reason we conducted the experiments using theldquo09 01rdquo weight balance in order to obtain the best result

Figure 9 compares the accuracy of similarities computedby the three existingmethods and the two proposedmethodsAgain the 119909-axis represents the number of ldquothemost similarrdquodocuments selected by each method and the 119910-axis theaccuracy The weightedSum method improved accuracy by20 on average and up to 24 compared with SimRankAlso the undirected method improved accuracy by 34on average and up to 43 compared with SimRank In

10 The Scientific World Journal

Table 3 Top 10 papers similar to [33] using SimRank

First author Title Conferencejournal Year

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approach forVery Large Spatial Databases VLDB 1998

Ester Knowledge Discovery in Large Spatial Databases FocusingTechniques for Efficient Class Identification SSD 1995

Hinneburg An Efficient Approach to Clustering in Large MultimediaDatabases with Noise AAAI 1998

Ng Efficient and Effective Clustering Methods for Spatial DataMining VLDB 1994

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Wang STING A Statistical Information Grid Approach to SpatialData Mining VLDB 1997

L OrsquoCallaghan Streaming-Data Algorithms for High-Quality Clustering IEEE ICDE 2002

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996

Table 4 Top 10 papers similar to [33] using rvs-SimRank

First author Title Conferencejournal YearKnorr A Unified Notion of Outliers Properties and Computation ACM KDD 1997Guha CURE An Efficient Clustering Algorithm for Large Databases ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Guha ROCK A Robust Clustering Algorithm for CategoricalAttributes IEEE ICDE 1999

Sheikholeslami WaveCluster A multi-Resolution Clustering Approach forVery Large Spatial Databases VLDB 1998

Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996

Burdick MAFIA A Maximal Frequent Itemset Algorithmfor Transactional Databases IEEE TKDE 2005

Mannila Efficient Algorithms for Discovering Association Rules AAAI 1994

Kamber Metarule-Guided Mining of Multi-DimensionalAssociation Rules Using Data Cubes ACM KDD 1997

Table 5 Top 10 papers similar to [33] using P-Rank

First author Title Conferencejournal YearKnorr A Unified Notion of Outliers Properties and Computation ACM KDD 1997Guha CURE An Efficient Clustering Algorithm for Large Databases ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Guha ROCK A Robust Clustering Algorithm for Categorical Attributes IEEE ICDE 1999Mannila Efficient Algorithms for Discovering Association Rules AAAI 1994Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996Silberschatz What Makes Patterns Interesting in Knowledge Discovery Systems IEEE TKDE 1996Agrawal Mining Association Rules between Sets of Items in Large Databases ACM SIGMOD 1993

The Scientific World Journal 11

Table 6 Top 10 papers similar to [33] using weightedSum method

First author Title Conferencejournal Year

Ester Knowledge Discovery in Large Spatial DatabasesFocusing Techniques for Efficient Class Identification SSD 1995

Ng Efficient and Effective Clustering Methods for Spatial DataMining VLDB 1994

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Guha ROCK A Robust Clustering Algorithm for CategoricalAttributes IEEE ICDE 1999

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Wang STING A Statistical Information Grid Approach toSpatial Data Mining VLDB 1997

Hinneburg An Efficient Approach to Clusteringin Large Multimedia Databases with Noise AAAI 1998

Aggarwal Fast Algorithms for Projected Clustering ACM SIGMOD 1999

Agrawal Automatic Subspace Clustering of High Dimensional Datafor Data Mining Applications ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Table 7 Top 10 papers similar to [33] using undirected method

First author Title Conferencejournal Year

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Ng Efficient and Effective Clustering Methods for SpatialData Mining VLDB 1994

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Hinneburg An Efficient Approach to Clusteringin Large Multimedia Databases with Noise AAAI 1998

Wang STING A Statistical Information Grid Approach toSpatial Data Mining VLDB 1997

Agrawal Automatic Subspace Clustering of High Dimensional Datafor Data Mining Applications ACM SIGMOD 1998

Aggarwal Fast Algorithms for Projected Clustering ACM SIGMOD 1999

Ankerst OPTICS Ordering Points To Identify the ClusteringStructure ACM SIGMOD 1999

conclusion the two proposed methods turn out to computethe similarity more accurately than the existing methodsEspecially the undirected method outperforms by a largedegree the weightedSum and the three existing methods

6 Conclusions

This paper presented new link-based similarity methods thatcan compute more accurately the similarity between objectsby using the link information pertaining to the objectsNoticing that most existing link-based similarity methods

use the pair-wise and level-wise models we analyzed theproblems with these models and proposed a new approachthat does not suffer from these problems In our proposedapproach each object is represented by a vector all objectsin the given universe as the elements of the vector and aweight value to each object as the value of the correspondingelement in the vector As for this weight value we proposed toutilize the notion of reachability between objects computedusing the ldquoRandom Walk with Restartrdquo strategy Then wedefined the similarity between two objects as the cosinesimilarity between two vectors representing the two objectsThe proposed approach was then refined into two methods

12 The Scientific World Journal

0

01

02

03

04

05

10 20 30 40 50(m)

(Acc

urac

y)

SimRankrvs-SimRankP-Rank

WeightedSumUndirected

Figure 9 Accuracy of the similarity measures in Web documentsdata

the weightedSum and the undirected methods differentiatedby the strategy to handle the information on link directionsExamples showed that the two methods do not suffer fromthe problems of the pair-wise and level-wise models In ourexperimentation of the proposed methods with the scientificpapers and Web documents data sets the results indicatedthat both of the proposed methods generally outperform theexisting methods significantly

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by (1) Business for CooperativeRampD between Industry Academy and Research Institutefunded Korea Small and Medium Business Administrationin 2013 (Grants no C0006278) (2) the MSIP (Ministry ofScience ICT and Future Planning) Korea under the ITRC(Information Technology Research Center) Support Pro-gram (NIPA-2013-H0301-13-4009) supervised by the NIPA(National IT Industry Promotion Agency) (3) Ministry ofCulture Sports and Tourism (MCST) and from Korea Copy-right Commission in 2013 and (4) the National ResearchFoundation of Korea (NRF) Grant funded by the Koreagovernment (MEST) (no 2011-0029181)

References

[1] J Shen Y Zhu H Zhang C Chen R Sun and F Xu ldquoAcontent-based algorithm for blog rankingrdquo in Proceedings of theInternational Conference on Internet Computing in Science andEngineering (ICICSE rsquo08) pp 19ndash22 January 2008

[2] A Broder S Glassman M Manasse and G Zweig ldquoSyntacticclustering of the webrdquo in Proceedings of the 6th InternationalWorld Wide Web Conference (WWW rsquo97) pp 391ndash404 1997

[3] X Su and T Khoshgoftaar ldquoA survey of collaborative filteringtechniquesrdquoAdvances in Artificial Intelligence vol 2009 ArticleID 421425 19 pages 2009

[4] S-H Yoon S-W Kim and S Park ldquoA link-based similaritymeasure for scientific literaturerdquo in Proceedings of the 19thInternational World Wide Web Conference (WWW rsquo10) pp1213ndash1214 April 2010

[5] M R Hamedani S Lee and S Kim ldquoOn exploiting content andcitations together to compute similarity of scientific papersrdquo inProceedings of the ACM International Conference on Informationand Knowledge Management (CIKM rsquo13) San Francisco CalifUSA 2013

[6] P Zhao J Han andY Sun ldquoP-Rank a comprehensive structuralsimilarity measure over information networksrdquo in Proceedingsof the ACM 18th International Conference on Information andKnowledge Management (CIKM rsquo09) pp 553ndash562 November2009

[7] D-H Bae S-M Hwang S-W Kim and C Faloutsos ldquoCon-structing seminal paper genealogyrdquo in Proceedings of the 20thACM Conference on Information and Knowledge Management(CIKM rsquo11) pp 2101ndash2104 October 2011

[8] F Menczer ldquoCombining link and content analysis to estimatesemantic similarityrdquo in Proceedings of the 13th InternationalWorld Wide Web Conference Proceedings (WWW rsquo04) pp 452ndash453 May 2004

[9] R Baeza-Yates and B Ribeiro-Neto Modern InformationRetrieval Addison Wesley 1999

[10] S-H Yoon S-W Kim J-S Kim and W-S Hwang ldquoOncomputing text-based similarity in scientific literaturerdquo inProceedings of the 20th International Conference Companion onWorld Wide Web (WWW rsquo11) pp 169ndash170 April 2011

[11] G Bisson and F Hussain ldquo120594-Sim a new similarity measure forthe co-clustering taskrdquo in Proceedings of the 7th InternationalConference onMachine Learning and Applications (ICMLA rsquo08)pp 211ndash217 December 2008

[12] H Small ldquoCoCitation in the scientific literature a newmeasureof the relationship between two documentsrdquo Journal of theAmerican Society for Information Science vol 24 no 4 pp 265ndash269 1973

[13] M Kessler ldquoBibliographic coupling between scientific papersrdquoJournal of the American Documentation vol 14 no 1 pp 10ndash251963

[14] R Amsler ldquoApplication of citation-based automatic classifi-cationrdquo Technical Report The University of Texas at AustinLinguistics Research Center 1972

[15] G Jeh and JWidom ldquoSimRank ameasure of structural-contextsimilarityrdquo in Proceedings of the 8th ACMSIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDDrsquo02) pp 538ndash543 July 2002

[16] U Shardanband and P Maes ldquoSocial information filteringalgorithms for automating rdquoword of mouthrdquordquo in Proceedings ofthe SIGCHIConference onHumanFactors in Computing Systems(CHI rsquo95) pp 210ndash217 ACM Press Denver Colorado 1995

[17] I Antonellis H Garcia-Molina and C Chang ldquoSimrank++query rewriting through link analysis of the click graphrdquo inProceedings of the 34th International Conference on Vary LargeData Bases ( VLDB rsquo8) pp 408ndash421 Auckland New ZealandAugust 2008

The Scientific World Journal 13

[18] D Fogaras andB Racz ldquoScaling link-based similarity searchrdquo inProceedings of the 14th International Conference on World WideWeb (WWWrsquo05) pp 641ndash650 2005

[19] C Li J Han G He et al ldquoFast computation of SimRankfor static and dynamic information networksrdquo in Proceedingsof the 13th International Conference on Extending DatabaseTechnology Advances in Database Technology (EDBT rsquo10) pp465ndash476 March 2010

[20] P Li H Liu J Xu Yu J He and X Du ldquoFast single-pairSimRank computationrdquo in Proceedings of the SIAM Interna-tional Conference onDataMining (SDM rsquo10) pp 571ndash582 SIAMColumbus Ohio USA April 2010

[21] X Yin J Han and P S Yu ldquoLinkClus efficient clusteringvia heterogeneous semantic linksrdquo in Proceedings of the 32ndInternational Conference on Very Large Data Bases (VLDB rsquo06)pp 427ndash438 2006

[22] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo in Proceedings of the Vary Large Data Bases Endowmentvol 1 pp 422ndash433 2008

[23] Y Cai G Cong X Jia et al ldquoEfficient algorithm for computinglink-based similarity in real world networksrdquo in Proceedings ofthe 9th IEEE International Conference on Data Mining (ICDMrsquo09) pp 734ndash739 December 2009

[24] G He H Feng C Li and H Chen ldquoParallel SimRankcomputation on large graphs with iterative aggregationrdquo inProceedings of the 16th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD rsquo10) pp 543ndash552 July 2010

[25] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo The Intrsquol Journal on Very Large Data Bases vol 19 no 1pp 45ndash66 2010

[26] Z Lin M R Lyu and I King ldquoMatchSim a novel neighbor-based similaritymeasure withmaximumneighborhoodmatch-ingrdquo in Proceedings of the ACM 18th International Conference onInformation and Knowledge Management (CIKM rsquo09) pp 1613ndash1616 November 2009

[27] R R Larson ldquoBibliometrics of the world wide web anexploratory analysis of the intellectual structure of cyberspacerdquoProceedings of the Annual Meeting of the American Society forInformation Science vol 33 pp 71ndash78 1996

[28] J Pitkow and P Pirolli ldquoLife death and lawfulness on the elec-tronic frontierrdquo in Proceedings of the Conference on Human Fac-tors in Computing Systems (CHI rsquo97) pp 383ndash390 March 1997

[29] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann 2006

[30] J Sun H Qu D Chakrabarti and C Faloutsos ldquoNeighbor-hood formation and anomaly detection in bipartite graphsrdquo inProceedings of the 5th IEEE International Conference on DataMining (ICDM rsquo05) pp 418ndash425 November 2005

[31] H Tong C Faloutsos and J-Y Pan ldquoFast random walkwith restart and its applicationsrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 613ndash622 December 2006

[32] S Robinson ldquoToward an optimal algorithm for matrix multi-plicationrdquo News Journal of the Society for Industrial and AppliedMathematics vol 38 no 9 2005

[33] T Zhang R Ramakrishnam andM Livny ldquoBIRCH an efficientdata clustering method for very large databasesrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data (SIGMOD rsquo96) pp 103ndash114 1996

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 4: Research Article Link-Based Similarity Measures Using ...downloads.hindawi.com/journals/tswj/2014/741608.pdf · Research Article Link-Based Similarity Measures Using Reachability

4 The Scientific World Journal

similarities among dissimilar objects using the path infor-mation of the tree Reference [20] aims to compute thesimilarity between two objects using SimRank in online real-time by suggesting to consider only those objects relateddirectly to the two target objects rather than computingsimilarities involved in all objects Reference [24] investigatesa method to run SimRank in parallel using GPGPU (general-purpose computation on graphics processors) and a methodto approximately compute the similarity in a dynamic graphusing uncoupling Markov chains

3 Motivation

In this section we discuss the problems with existing link-based similarity measures First we cast existing measuresinto a method combining two computationmodels which wewill call ldquopair-wiserdquo and ldquolevel-wiserdquo and explain the inherentdifficulties in thesemodelsWe then analyze and illustrate thelimitations of three representative existing methods namelyrvs-SimRank SimRank and P-Rank showing that each ofthesemethods actually combines the pair-wise and level-wisemodels

31 Pair-Wise and Level-Wise Computation Models In thegraph of Figure 1 again SimRank would compute the sim-ilarity between 119896 and 119897 by taking the average of the foursimilarities obtained from the four pairs (119892 ℎ) (119892 119894) (ℎ ℎ)and (ℎ 119894) namely the Cartesian product of the set 119892 ℎ theobjects directly ldquoin-linkedrdquo to 119896 with the set ℎ 119894 and theobjects directly ldquoin-linkedrdquo to 119897 We call such a way of pairingand averaging out the similarities for all such pairs the ldquopair-wiserdquo computational model In this example the similaritybetween 119896 and 119897 119878(119896 119897) becomes the average of 119878(119892 ℎ) 119878(119892 119894)119878(ℎ ℎ) and 119878(ℎ 119894) Here 119878(ℎ ℎ) = 1 by definition and 119878(119892 ℎ)119878(119892 119894) and 119878(ℎ 119894) should also be computed in turn by Sim-Rank in a recursive manner using the same pair-wise modelThat is 119904(119892 ℎ) is computed from the Cartesian product of119892rsquos neighbors 119889 119890 and ℎrsquos neighbors 119889 119890 119878(119892 119894) from 119892rsquosneighbors 119889 119890 and 119894rsquos neighbors 119891 and 119878(ℎ 119894) from ℎrsquosneighbors 119889 119890 and 119894rsquos neighbors 119891 using the pair-wisemodel In this way the recursive process continues and getsdeeper using all the objects linked directly or indirectly to thetwo target objects for which the similarity is being computed

On the other hand such a model can be seen as com-puting the similarity between two objects by utilizing onlythose other objects located in the same distance from thetwo targets level by level In Figure 1 for example the modelcomputes the similarity between 119896 and 119897 by utilizing theobjects with distance 1 (namely nodes 119892 ℎ and 119894) in the firstround then the objects with distance 2 (namely 119889 119890 and 119891)in the second round and then the objects with distance 3(namely 119886 119887 and 119888) in the third round We call such processthe ldquolevel-wiserdquo computation model

32 Problem of Pair-Wise Computation Model Figure 2 illus-trates the problem of the pair-wise model which induces thephenomenon that the similarities among objects havingmorelinks tend to become lower than those among objects having

Table 1 Similarities between 119886 and 119887 and between 1198861015840 and 1198871015840 in Figure2

rvs-SimRank SimRank P-Rank119904(119886 119887) 0396 0 0423119904(1198861015840 1198871015840) 0321 0 0408

a b

c d e

f g

(a)

a998400

b998400

c998400 d

998400e998400 f

998400g998400

h998400

i998400

j998400

(b)

Figure 2 Example graphs to illustrate the problem with the pair-wise model

less links In the graphs of Figure 2 again nodes representobjects and links represent the reference relationships amongobjects For example objects 119886 and 119887 in Figure 2(a) both referto the same three objects 119888 119889 and 119890 all of which in turn referto the same two objects 119891 and 119892 Similarly objects 1198861015840 and 1198871015840in Figure 2(b) both refer to the same objects all of which inturn refer to the same objects The only difference betweenthese two graphs is that the numbers of links from objects 1198861015840and 1198871015840 are larger (ie six links each) than those from objects 119886and 119887 (ie three links each) Intuitively the similarity between119886 and 119887 should be the same as that between 1198861015840 and 1198871015840 byobserving that 119886 and 119887 both refer to the same objects and also1198861015840 and 1198871015840 both refer to the same objects Existing methods

however produce different resultsTable 1 shows the resulting similarities between 119886 and 119887

and between 1198861015840 and 1198871015840 as to be computed by three existingmethods rvs-SimRank SimRank and P-Rank assuming 119862and 120582 to be 07 and 05 in formulas (4) (5) and (6) Twoentries of the table indicate that the computed similaritiesbetween 119886 and 119887 are higher than those between 1198861015840 and 1198871015840which is counterintuitive Considering that the differencein the numbers of links between the two graphs in thisexample is only as small as four one can imagine that thisphenomenon would become clearer as the numbers of linksget larger for example with such data set as scientific papersor Web documents In the papers data sets for example wellknown papers tend to have many reference links becausethey will in general get substantially larger numbers ofcitations than ordinary papers and similarly in the Webdocuments data set portal sites tend to have many linksbecause they are referenced more frequently than ordinary

The Scientific World Journal 5

ec

d

ba

Figure 3 An example graph to illustrate the problemwith the level-wise model

sites We envisage that for these domains those methodswhich use the pair-wise computation model cannot computethe similarity between objects accurately

33 Problem of Level-Wise Computation Model The level-wise computation model computes the similarity betweentwo objects by utilizing only those objects in the graph inthe same distance from the two The problem with thismodel can be illustrated using Figure 3 where again nodesrepresent objects and directed links represent the referencerelationships between objects Intuitively we can say thattwo objects 119886 and 119887 are similar to some degree as theyshare though indirectly an object 119888 in common All ofthe aforementioned three methods however compute thesimilarity between these two objects 119886 and 119887 to be 0 Forthis case one would expect to compute the similarity byconsidering all objects directly and indirectly linked to 119886(namely 119888) and all objects directly and indirectly linkedto 119887 (namely 119888 119889 and 119890) In fact however any methodswhich utilize the level-wise model will compute 119904(119886 119887) byonly using 119888 (as 119886rsquos neighbor) and 119889 (as 119887rsquos neighbor) Here119904(119888 119889) should also be computed by taking the average of thesimilarities for all possible pairs across 119888rsquos neighbor objectsand 119889rsquos neighbor objects In this case however 119888rsquos neighbordoes not exist yielding 119904(119888 119889) to be 0 even though therestill remain 119889rsquos neighbors 119888 and 119890 which should also becounted by some means After all 119904(119886 119887) will also become 0indicating that they are not similar at all From this examplewe conclude that the level-wise model cannot compare theobjects not located at the same distance and consequentlycannot compute the similarity properly

4 Proposed Methods

41 Main Concepts In this section we propose a novel link-based similarity measure Our approach differs from theexisting link-based methods in that we will not define themeasure by combining the pair-wise and level-wise modelsbut use the concept of the cosine similarity As expressedby (7) the cosine similarity is a measure for computing asimilarity between a pair of vectors Thus in computing a

similarity between objects using the cosine similarity thefeatures of an object are represented by the elements of avector and the weight to each feature is captured by thevalue of each element For example in case of computing asimilarity between documents using cosine similarity eachdocument is represented by a vector the words in the givenuniverse of documents are denoted by the elements of thevector and the frequency of each word in the document isindicated by the value of the corresponding element Thenthe similarity between two documents is computed by thesimilarity between the two corresponding vectors [9]

119904 (119909 119910) =119909 sdot 119910

11990910038171003817100381710038171199101003817100381710038171003817

=sum119899

119894=1(119909119894times 119910119894)

radicsum119899

119894=1(119909119894)2times radicsum

119899

119894=1(119910119894)2

(7)

In our approach each object is represented as a vectorall other objects in the given universe as the elements ofthe vector and some ldquoweight valuerdquo to each object as thevalue of the corresponding element in the vector As forthe cosine similarity used by text-based measures discussedabove where words of higher-frequency get larger weightvalues by being treated as better characterizing features fora document we need to define a measure to quantify thedegree with which to determine how well the target object ischaracterized by each object in the universe For thismeasurewe propose to use the degree of how close the two objects arelocated in the topological point of view

Sun et al [30] proposed a method for computing theprobability of reaching from an object to another as therelevance between two objects using the ldquoRandomWalkwithRestart (RWR)rdquo strategy We adopt this strategy and use thereachability to an object from the target object as the weightvalue for its corresponding element of the vector representingthe target object Reachability becomes higher as the distancebetween two objects becomes shorter and also when morepaths exist between the two Computing reachability usingRWR is expressed by the following where 119875

119860represents

an adjacency matrix column-normalizing the connectednessamong objects 119906

119886a vector having reachability to each node

starting from 119886 119902119886a restart vector having the value 1 only

for the starting node 119886 and 0 for the rest and 119888 the restartprobability

= (1 minus 119888) 119875119860 119906119886+ 119888 119902119886 (8)

Our approach does not suffer from the problem withthe pair-wise model because it computes the similaritybetween two vectors by multiplying only the values of thecorresponding elements in the two vectors not trying everypossible element pair between the two vectors Moreover theapproach also generates the vectors by considering all theobjects linked directly or indirectly to the two target objectsThismechanism is different from the pair-wise and level-wisemodels and does not suffer from the problems discussed inthe previous section

In this paper we develop two methods to implementour approach The two methods slightly differ only in theirways to compute the weights in the vector One methodcomputes the weights by first computing two values of

6 The Scientific World Journal

b c

a

d

a b c d

a 0 1 1 0b 0 0 0 1c 0 0 0 1d 0 0 0 0

a b c d

a 0 0 0 0b 1 0 0 0c 1 0 0 0d 0 1 1 0

a b c d

a 0 0 0 0b 0 0 0c 0 0 0d 0 1 1 0

05

05

a b c d

a 0 1 1 0b 0 0 0c 0 0 0d 0 0 0 0

05

05

1 0 0 0

1 006 006 0102

1 003 003 0051

Inlink(normalized)

(normalized)

Inlink

OutlinkOutlink

+

Figure 4 The procedure of generating a vector using the weightedSum method

reachability one using only ldquoinlinksrdquo and the other using onlyldquooutlinksrdquo separately and then combining them together as aweighted sum This method provides flexibility to a problemdomain by allowing different weights to reachability usinginlinks versus outlinks The method however cannot com-pute reachability properly for those objects located closelywith the target object because links in both directions arenot used together In Figure 1 for example nodes 119886 and119887 are located closely but the reachability would become 0when computed using only outlinks (or using only inlinks)from 119886 or 119887 Another problem would be the difficulty indetermining the appropriate weights for a vector generatedpurely using the inlinks or purely using the outlinksThuswedevelop the second method which computes reachability byignoring the directions of inlinks and outlinks and convertingthem to undirected links before computing reachability Thismethod is advantageous in that it can compute appropriatereachability for every object In the rest of this paper wewill call these two methods the ldquoweightedSumrdquo methodand the ldquoundirectedrdquo method respectively In the weighted-Sum method in particular the proper balance between theweights for inlinks and outlinks needs to be found domain bydomain through experimentation

42 Procedure of the Proposed Methods Basically both ofthe proposed methods (1) construct vectors by computingreachability from the target object to all other objects and(2) compute the similarity between vectors by using thecosine similarity The two methods differ only in the processof generating vectors as illustrated by Figure 4 (for theweightedSum method) and Figure 5 (for the undirectedmethod) The weightedSum method generates vectors in thefollowingmanner First two adjacencymatrices are built with

inlinks and outlinks and then column-normalized such thatthe sum of all values in a column becomes 1 Second a vectoris generated from the normalized matrix and the weightsare assigned to the elements in the vector by computing thereachability from the target object to every object using theRWR strategy Similarly the undirected method generates avector by constructing a normalized adjacencymatrix ignor-ing the link directions this time and assigning the reachabil-ity values to the elements of the vector in the same manner

43 Complexity Analysis The complexity of the proposedmethods for computing similarities for all pairs of givenobjects can be analyzed as follows First time complexity ofgeneration process of a vector for an object using the RWRstrategy is 119874(119896119890) [31] where 119890 represents the number of linksand 119896 the number of iterations for the matrix calculation toobtain converged values of reachability Thus time complex-ity of generation process for all objects 119899 is 119874(119896119899119890) where 119899represents the number of objects It is generally known thatthe constant number of such iterations would be sufficient toobtain converged values of reachability [6 15] Consequentlytime complexity is reduced to 119874(119899119890) On the other handtime complexity of the similarity calculation process is119874(1198993)Combining them together overall time complexity of bothmethods becomes 119874(1198993) + 119874(119899119890) = 119874(119899

3) In practice

the weightedSum method will require double time than theundirectedmethod because the former computes reachabilityseparately for each direction of the links

As for space complexity119874(119890) space is required for storingthe matrix to represent the relationships between objects119874(1198992) for storing vectors and119874(1198992) for storing the similarity

measures between objects Combining them together overallspace complexity becomes 119874(1198992) Again in practice the

The Scientific World Journal 7

b c

a

d

a b c d

a 0 1 1 0

b 0 0 0 1

c 0 0 0 1

d 0 0 0 0

a b c d

a 0 0 0 0

b 1 0 0 0

c 1 0 0 0

d 0 1 1 0

a b c d

a 0 1 1 0

b 1 0 0 1

c 1 0 0 1

d 0 1 1 0

a b c d

a 0 0

b 0 0

c 0 0

d 0 0

Undirected-link(normalized)

05 05

0505

0505

05 05

1 046204620441

Undirected-link

Inlink

Outlink

Figure 5 The procedure of generating a vector using the undirected method

weightedSum method will require double space than theother

In comparison time complexity of the three existingmethods namely rvs-SimRank SimRank and P-Rank isknown to be 119874(1198994) and space complexity 119874(1198992) [6] Webelieve the existing methods require more computation thanour proposed methods because they adopt the pair-wisecomputation model in principle

Moreover the existing methods cannot compute the sim-ilarity between two given objects independently because theyneed to know the similarities among all objects (whether con-nected directly or indirectly to the target object) in order tocompute the similarity between two target objects [20] How-ever our approach does not require the similarities amongall objects and hence can compute the similarity between anypair of objects independentlymaking it possible to parallelizethe algorithm Since our approach need not refer to the simi-larities amongother objects vectors for individual objects canbe generated independently and the similarity between anypair of objects can be computed separately that is in parallelTime complexity of this parallelized version would become119874(1198993119898) if119898 processors are utilized in parallelWhen we need to compute the similarity between only

a particular pair of objects on-line rather than computingthe similarities among all objects [20] our methods can bemade even more efficient by performing the computation ofreachability on-line through the inverse matrix as suggestedby (9) That is the vector for a given object can be generatedby multiplying 119902 to the inverse matrix of 119876 [30 31] Timecomplexity of this off-line computation of inverse matrix is119874(1198992376

) [32] and the complexity of the on-line computationis 119874(119899) Such an off-line approach tends to require relativelylonger time and more space to handle the inverse matrixReference [31] suggests an approximation approach using

low-rank approximation in order to keep balance betweenthe off-line and on-line computation This approach hasadvantages in both time and space complexity suggest-ing improvement for on-line execution of our proposedapproach In conclusion time complexity of our methods islower than that of existing methods and space complexity isequal to that of existing methods in any case

= (1 minus 119888) 119875119860 119906119886+ 119888 119902119886

= (119868 minus (1 minus 119888) 119875119860)minus1119902119886

= 119862119876minus1119902119886

(9)

44 Discussions The proposed methods do not suffer fromthe problems of the pair-wise and level-wisemodels discussedin Section 3 Let us compute the similarities between 119886 and119887 and 1198861015840 and 1198871015840 in Figure 2 using our approach and checkwhether the results appeal to our intuition We assume 120582 and119862 to be 05 and 07 respectively Table 2 shows the similarityresults computed by the two proposed methods togetherwith those obtained by the three existing methods (shownalready in Table 1) for comparison Both of our methods haveproduced coinciding results that the similarity between 119886and 119887 and 1198861015840 and 1198871015840 is 1 saying that the two objects areregarded as the same These results appeal to our intuitionIn comparison the results obtained by the existing methodsindicate that the two objects are not the same Since ourapproach does not use the pair-wise model but performthe computations among identical features the proposedmethods will produce results more appealing to our intuitionthan existing methods

The proposed methods do not suffer from the problemof the level-wise model either Let us compute the similaritybetween 119886 and 119887 in Figure 3 using our approach and check

8 The Scientific World Journal

Table 2 Similarities between 119886 and 119887 and between 1198861015840 and 1198871015840 using the proposed methods and existing methods in Figure 2

rvs-SimRank SimRank P-Rank WeightedSum Undirected119904(119886 119887) 0396 0 0423 1 1119904(1198861015840 1198871015840) 0321 0 0408 1 1

whether the results appeal to our intuition Assuming 120582and 119862 to be 05 and 07 respectively our approach hascomputed the similarity between 119886 and 119887 to be 028 using theweightedSummethod and 06 using the undirected methodFor comparison the existing methods would compute thesimilarity to be 0 as discussed in Section 3 We cannot tellif the results obtained by our approach are appropriate ornot because the similarity is by nature a subjective valueStill we can at least argue that our approach produces resultswhich are more appealing to our intuition than the existingmethods that is the similarity between 119886 and 119887must not be 0because they are related indirectly by some common objectsThis difference has come from our strategy of considering forall the objects indirectly connected to the target objects atonce rather than using the level-wise model

5 Experiments

In this section we verify the effectiveness of our approachthrough experimentation We will show and analyze quanti-tatively the experimental results obtained from applying thetwo proposed methods to practical application domains

51 Experimentation Setup We carried out experimentsto verify the performance of the proposed methods withrespect to scientific papers and Web documents the twotypes of exemplary data sets having link informationFor the experiments with the scientific papers we usedthe data set consisting of the papers downloaded fromDBLP (httpwwwinformatikuni-trierdesimleydb) andreference information among the papers crawled from Libra(httpacademicresearchmicrosoftcom) In total 44800papers and 126281 references were used For the experimentswith the Web documents we used the data set consisting of1227038Web pages with 11164829 hyperlinks in total takenfrom the TREC (httptrecnistgovdatat11webhtml) 2002data The experiments were carried out in a platform withQuad Core 267GHz CPU Windows 2008 Sever OS

In the experiments we aimed to evaluate the performanceof the two proposed methods (weightedSum undirected) incomparison with the three existing methods (rvs-SimRankSimRank and P-Rank) with the input values of 08 05 and015 for 119862 120582 and restart probability respectively as usedfrequently in the existing methods [6 15 30]

Themethod of experiments proceeded as follows For theweightedSum method we first assigned appropriate weightsto the vectors generated with inlinks and outlinks by varyingthe weights and finding the ones that achieve the highestaccuracy in the similarity through weightedSum method Inthe experiments we tried the weight values of 0 01 03 05

07 09 and 1 for inlinks (note the sum of in-link and out-link weights equals 1) Second we evaluated the accuracyof each method qualitatively by examining 10 objects thatthe method computes as the most similar to a target objectchosen arbitrarily Finally we measure the accuracy of eachmethod quantitatively by comparing the obtained results withthe true answers for each data set

The measurement of the accuracy proceeded as followsFirst we chose one object in turn as the target object from theanswer setThen we computed the recall [29] by extracting119898objects (where119898 can be 10 20 30 40 and 50)most similar tothe target object according to each method This process wasrepeated until every object in the set has been chosen as thetarget object The average of all recall values obtained as suchwill be taken as the final accuracy value

The answers of each data set were constructed in thefollowingmanner For the experiments with scientific paperswe selected five well known areas (ie clustering sequentialpattern mining spatial databases link mining and graphpatternmining) in a data mining text book [29] and obtainedpapers referenced in each section of these areasWe supposedthat the reference papers in the same section are similarto one another Thus those papers in a section formed ananswer set Each answer set had 3 to 14 papers and thetotal number of papers in all such answer sets was 106 Forthe experiments with Web documents we used TREC 2002TREC 2002 provides Web document sets related to specifickeywords We chose 9 Web document sets randomly fromTREC 2002 and used the sets as the answer sets

52 Domain of Scientific Papers Figure 6 depicts the resultsshowing the accuracy on the similarities among scientificpapers obtained by the weightedSum method using variousweight values for the vectors generated with inlinks andoutlinks The 119909-axis represents the number of ldquothe mostsimilarrdquo papers selected by the method and the 119910-axis theaccuracy Annotations such as ldquo00 10rdquo represent the weightbalances between inlinks and outlinks According to Figure 6the accuracy of the weightedSum method tends to becomehigher when the weight for inlinks gets higher than the out-link weight This is because the papers in the answer set arerelatively famous ones frequently referenced by other papersFor this reason we conducted the experiments using theldquo09 01rdquo weight balance in order to obtain the best result

Tables 3 4 5 6 and 7 present the lists of top 10 papersfound to be the most similar to a target paper which is[33] using the three existing methods and the two proposedmethods respectively (Note the paper of [33] is concernedabout clustering in data mining) In the tables those paperswhich are not similar to [33] are italic In these papers theauthorsmainly deal with issues of outlier detection ormining

The Scientific World Journal 9

0

01

02

03

04

05

06

07

08

09

1

10 20 30 40 50(m)

(Acc

urac

y)

00 1001 0903 0705 05

07 0309 0110 00

Figure 6 Accuracy of the weightedSum method with varyingweights

frequent patterns indicating that the existing methods havemade a wrong conclusion for these papers In comparisonthe results by the weightedSum method (shown in Table 6)include only one wrong entry (of frequent pattern mining)in the top 10 list and the results by the undirected methoddo not include wrong papers (ie all related to clustering)These results imply that the undirected method performsbetter than the three existing methods and even than theweightedSum method We have repeated this experimentmany times with different target papers other than [33] andobtained results similar to Tables 3ndash7

Figure 7 compares the accuracy of similarities computedby the three existingmethods and the two proposedmethodsAs in Figure 6 the 119909-axis represents the number of ldquothe mostsimilarrdquo papers selected by each method and the 119910-axis theaccuracy The weightedSum method improved accuracy by9 on average and up to 13 compared with SimRank Alsothe undirectedmethod improved accuracy by 16 on averageand up to 20 compared with SimRank In conclusion thetwo proposed methods turn out to compute the similaritymore accurately than the existing methods Especially theundirected method performs the best

53 Domain ofWeb Documents Figure 8 shows the results ofthe accuracy on the similarities betweenWeb pages obtainedby the weightedSum method using various weight valuesbetween inlinks and outlinks The 119909-axis represents thenumber of ldquothe most similarrdquo Web pages selected by themethod and the 119910-axis the accuracy Annotations such asldquo00 10rdquo represent the weight balances between inlinks andoutlinks Again the accuracy of the weightedSum methodtends to become higher when the weight for inlinks getshigher as in the case of scientific papers This is because theanswer set contains many Web pages with high authority

0

01

02

03

04

05

06

07

08

09

1

10 20 30 40 50

SimRankrvs-SimRankP-Rank

WeightedSumUndirected

(m)

(Acc

urac

y)Figure 7 Accuracy of the similarity measures in scientific papersdata

0

01

02

03

04

05

10 20 30 40 50(m)

(Acc

urac

y)

00 1001 0903 0705 05

07 0309 0110 00

Figure 8 Accuracy of the weightedSum method with varying theweights

For this reason we conducted the experiments using theldquo09 01rdquo weight balance in order to obtain the best result

Figure 9 compares the accuracy of similarities computedby the three existingmethods and the two proposedmethodsAgain the 119909-axis represents the number of ldquothemost similarrdquodocuments selected by each method and the 119910-axis theaccuracy The weightedSum method improved accuracy by20 on average and up to 24 compared with SimRankAlso the undirected method improved accuracy by 34on average and up to 43 compared with SimRank In

10 The Scientific World Journal

Table 3 Top 10 papers similar to [33] using SimRank

First author Title Conferencejournal Year

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approach forVery Large Spatial Databases VLDB 1998

Ester Knowledge Discovery in Large Spatial Databases FocusingTechniques for Efficient Class Identification SSD 1995

Hinneburg An Efficient Approach to Clustering in Large MultimediaDatabases with Noise AAAI 1998

Ng Efficient and Effective Clustering Methods for Spatial DataMining VLDB 1994

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Wang STING A Statistical Information Grid Approach to SpatialData Mining VLDB 1997

L OrsquoCallaghan Streaming-Data Algorithms for High-Quality Clustering IEEE ICDE 2002

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996

Table 4 Top 10 papers similar to [33] using rvs-SimRank

First author Title Conferencejournal YearKnorr A Unified Notion of Outliers Properties and Computation ACM KDD 1997Guha CURE An Efficient Clustering Algorithm for Large Databases ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Guha ROCK A Robust Clustering Algorithm for CategoricalAttributes IEEE ICDE 1999

Sheikholeslami WaveCluster A multi-Resolution Clustering Approach forVery Large Spatial Databases VLDB 1998

Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996

Burdick MAFIA A Maximal Frequent Itemset Algorithmfor Transactional Databases IEEE TKDE 2005

Mannila Efficient Algorithms for Discovering Association Rules AAAI 1994

Kamber Metarule-Guided Mining of Multi-DimensionalAssociation Rules Using Data Cubes ACM KDD 1997

Table 5 Top 10 papers similar to [33] using P-Rank

First author Title Conferencejournal YearKnorr A Unified Notion of Outliers Properties and Computation ACM KDD 1997Guha CURE An Efficient Clustering Algorithm for Large Databases ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Guha ROCK A Robust Clustering Algorithm for Categorical Attributes IEEE ICDE 1999Mannila Efficient Algorithms for Discovering Association Rules AAAI 1994Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996Silberschatz What Makes Patterns Interesting in Knowledge Discovery Systems IEEE TKDE 1996Agrawal Mining Association Rules between Sets of Items in Large Databases ACM SIGMOD 1993

The Scientific World Journal 11

Table 6 Top 10 papers similar to [33] using weightedSum method

First author Title Conferencejournal Year

Ester Knowledge Discovery in Large Spatial DatabasesFocusing Techniques for Efficient Class Identification SSD 1995

Ng Efficient and Effective Clustering Methods for Spatial DataMining VLDB 1994

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Guha ROCK A Robust Clustering Algorithm for CategoricalAttributes IEEE ICDE 1999

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Wang STING A Statistical Information Grid Approach toSpatial Data Mining VLDB 1997

Hinneburg An Efficient Approach to Clusteringin Large Multimedia Databases with Noise AAAI 1998

Aggarwal Fast Algorithms for Projected Clustering ACM SIGMOD 1999

Agrawal Automatic Subspace Clustering of High Dimensional Datafor Data Mining Applications ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Table 7 Top 10 papers similar to [33] using undirected method

First author Title Conferencejournal Year

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Ng Efficient and Effective Clustering Methods for SpatialData Mining VLDB 1994

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Hinneburg An Efficient Approach to Clusteringin Large Multimedia Databases with Noise AAAI 1998

Wang STING A Statistical Information Grid Approach toSpatial Data Mining VLDB 1997

Agrawal Automatic Subspace Clustering of High Dimensional Datafor Data Mining Applications ACM SIGMOD 1998

Aggarwal Fast Algorithms for Projected Clustering ACM SIGMOD 1999

Ankerst OPTICS Ordering Points To Identify the ClusteringStructure ACM SIGMOD 1999

conclusion the two proposed methods turn out to computethe similarity more accurately than the existing methodsEspecially the undirected method outperforms by a largedegree the weightedSum and the three existing methods

6 Conclusions

This paper presented new link-based similarity methods thatcan compute more accurately the similarity between objectsby using the link information pertaining to the objectsNoticing that most existing link-based similarity methods

use the pair-wise and level-wise models we analyzed theproblems with these models and proposed a new approachthat does not suffer from these problems In our proposedapproach each object is represented by a vector all objectsin the given universe as the elements of the vector and aweight value to each object as the value of the correspondingelement in the vector As for this weight value we proposed toutilize the notion of reachability between objects computedusing the ldquoRandom Walk with Restartrdquo strategy Then wedefined the similarity between two objects as the cosinesimilarity between two vectors representing the two objectsThe proposed approach was then refined into two methods

12 The Scientific World Journal

0

01

02

03

04

05

10 20 30 40 50(m)

(Acc

urac

y)

SimRankrvs-SimRankP-Rank

WeightedSumUndirected

Figure 9 Accuracy of the similarity measures in Web documentsdata

the weightedSum and the undirected methods differentiatedby the strategy to handle the information on link directionsExamples showed that the two methods do not suffer fromthe problems of the pair-wise and level-wise models In ourexperimentation of the proposed methods with the scientificpapers and Web documents data sets the results indicatedthat both of the proposed methods generally outperform theexisting methods significantly

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by (1) Business for CooperativeRampD between Industry Academy and Research Institutefunded Korea Small and Medium Business Administrationin 2013 (Grants no C0006278) (2) the MSIP (Ministry ofScience ICT and Future Planning) Korea under the ITRC(Information Technology Research Center) Support Pro-gram (NIPA-2013-H0301-13-4009) supervised by the NIPA(National IT Industry Promotion Agency) (3) Ministry ofCulture Sports and Tourism (MCST) and from Korea Copy-right Commission in 2013 and (4) the National ResearchFoundation of Korea (NRF) Grant funded by the Koreagovernment (MEST) (no 2011-0029181)

References

[1] J Shen Y Zhu H Zhang C Chen R Sun and F Xu ldquoAcontent-based algorithm for blog rankingrdquo in Proceedings of theInternational Conference on Internet Computing in Science andEngineering (ICICSE rsquo08) pp 19ndash22 January 2008

[2] A Broder S Glassman M Manasse and G Zweig ldquoSyntacticclustering of the webrdquo in Proceedings of the 6th InternationalWorld Wide Web Conference (WWW rsquo97) pp 391ndash404 1997

[3] X Su and T Khoshgoftaar ldquoA survey of collaborative filteringtechniquesrdquoAdvances in Artificial Intelligence vol 2009 ArticleID 421425 19 pages 2009

[4] S-H Yoon S-W Kim and S Park ldquoA link-based similaritymeasure for scientific literaturerdquo in Proceedings of the 19thInternational World Wide Web Conference (WWW rsquo10) pp1213ndash1214 April 2010

[5] M R Hamedani S Lee and S Kim ldquoOn exploiting content andcitations together to compute similarity of scientific papersrdquo inProceedings of the ACM International Conference on Informationand Knowledge Management (CIKM rsquo13) San Francisco CalifUSA 2013

[6] P Zhao J Han andY Sun ldquoP-Rank a comprehensive structuralsimilarity measure over information networksrdquo in Proceedingsof the ACM 18th International Conference on Information andKnowledge Management (CIKM rsquo09) pp 553ndash562 November2009

[7] D-H Bae S-M Hwang S-W Kim and C Faloutsos ldquoCon-structing seminal paper genealogyrdquo in Proceedings of the 20thACM Conference on Information and Knowledge Management(CIKM rsquo11) pp 2101ndash2104 October 2011

[8] F Menczer ldquoCombining link and content analysis to estimatesemantic similarityrdquo in Proceedings of the 13th InternationalWorld Wide Web Conference Proceedings (WWW rsquo04) pp 452ndash453 May 2004

[9] R Baeza-Yates and B Ribeiro-Neto Modern InformationRetrieval Addison Wesley 1999

[10] S-H Yoon S-W Kim J-S Kim and W-S Hwang ldquoOncomputing text-based similarity in scientific literaturerdquo inProceedings of the 20th International Conference Companion onWorld Wide Web (WWW rsquo11) pp 169ndash170 April 2011

[11] G Bisson and F Hussain ldquo120594-Sim a new similarity measure forthe co-clustering taskrdquo in Proceedings of the 7th InternationalConference onMachine Learning and Applications (ICMLA rsquo08)pp 211ndash217 December 2008

[12] H Small ldquoCoCitation in the scientific literature a newmeasureof the relationship between two documentsrdquo Journal of theAmerican Society for Information Science vol 24 no 4 pp 265ndash269 1973

[13] M Kessler ldquoBibliographic coupling between scientific papersrdquoJournal of the American Documentation vol 14 no 1 pp 10ndash251963

[14] R Amsler ldquoApplication of citation-based automatic classifi-cationrdquo Technical Report The University of Texas at AustinLinguistics Research Center 1972

[15] G Jeh and JWidom ldquoSimRank ameasure of structural-contextsimilarityrdquo in Proceedings of the 8th ACMSIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDDrsquo02) pp 538ndash543 July 2002

[16] U Shardanband and P Maes ldquoSocial information filteringalgorithms for automating rdquoword of mouthrdquordquo in Proceedings ofthe SIGCHIConference onHumanFactors in Computing Systems(CHI rsquo95) pp 210ndash217 ACM Press Denver Colorado 1995

[17] I Antonellis H Garcia-Molina and C Chang ldquoSimrank++query rewriting through link analysis of the click graphrdquo inProceedings of the 34th International Conference on Vary LargeData Bases ( VLDB rsquo8) pp 408ndash421 Auckland New ZealandAugust 2008

The Scientific World Journal 13

[18] D Fogaras andB Racz ldquoScaling link-based similarity searchrdquo inProceedings of the 14th International Conference on World WideWeb (WWWrsquo05) pp 641ndash650 2005

[19] C Li J Han G He et al ldquoFast computation of SimRankfor static and dynamic information networksrdquo in Proceedingsof the 13th International Conference on Extending DatabaseTechnology Advances in Database Technology (EDBT rsquo10) pp465ndash476 March 2010

[20] P Li H Liu J Xu Yu J He and X Du ldquoFast single-pairSimRank computationrdquo in Proceedings of the SIAM Interna-tional Conference onDataMining (SDM rsquo10) pp 571ndash582 SIAMColumbus Ohio USA April 2010

[21] X Yin J Han and P S Yu ldquoLinkClus efficient clusteringvia heterogeneous semantic linksrdquo in Proceedings of the 32ndInternational Conference on Very Large Data Bases (VLDB rsquo06)pp 427ndash438 2006

[22] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo in Proceedings of the Vary Large Data Bases Endowmentvol 1 pp 422ndash433 2008

[23] Y Cai G Cong X Jia et al ldquoEfficient algorithm for computinglink-based similarity in real world networksrdquo in Proceedings ofthe 9th IEEE International Conference on Data Mining (ICDMrsquo09) pp 734ndash739 December 2009

[24] G He H Feng C Li and H Chen ldquoParallel SimRankcomputation on large graphs with iterative aggregationrdquo inProceedings of the 16th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD rsquo10) pp 543ndash552 July 2010

[25] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo The Intrsquol Journal on Very Large Data Bases vol 19 no 1pp 45ndash66 2010

[26] Z Lin M R Lyu and I King ldquoMatchSim a novel neighbor-based similaritymeasure withmaximumneighborhoodmatch-ingrdquo in Proceedings of the ACM 18th International Conference onInformation and Knowledge Management (CIKM rsquo09) pp 1613ndash1616 November 2009

[27] R R Larson ldquoBibliometrics of the world wide web anexploratory analysis of the intellectual structure of cyberspacerdquoProceedings of the Annual Meeting of the American Society forInformation Science vol 33 pp 71ndash78 1996

[28] J Pitkow and P Pirolli ldquoLife death and lawfulness on the elec-tronic frontierrdquo in Proceedings of the Conference on Human Fac-tors in Computing Systems (CHI rsquo97) pp 383ndash390 March 1997

[29] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann 2006

[30] J Sun H Qu D Chakrabarti and C Faloutsos ldquoNeighbor-hood formation and anomaly detection in bipartite graphsrdquo inProceedings of the 5th IEEE International Conference on DataMining (ICDM rsquo05) pp 418ndash425 November 2005

[31] H Tong C Faloutsos and J-Y Pan ldquoFast random walkwith restart and its applicationsrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 613ndash622 December 2006

[32] S Robinson ldquoToward an optimal algorithm for matrix multi-plicationrdquo News Journal of the Society for Industrial and AppliedMathematics vol 38 no 9 2005

[33] T Zhang R Ramakrishnam andM Livny ldquoBIRCH an efficientdata clustering method for very large databasesrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data (SIGMOD rsquo96) pp 103ndash114 1996

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 5: Research Article Link-Based Similarity Measures Using ...downloads.hindawi.com/journals/tswj/2014/741608.pdf · Research Article Link-Based Similarity Measures Using Reachability

The Scientific World Journal 5

ec

d

ba

Figure 3 An example graph to illustrate the problemwith the level-wise model

sites We envisage that for these domains those methodswhich use the pair-wise computation model cannot computethe similarity between objects accurately

33 Problem of Level-Wise Computation Model The level-wise computation model computes the similarity betweentwo objects by utilizing only those objects in the graph inthe same distance from the two The problem with thismodel can be illustrated using Figure 3 where again nodesrepresent objects and directed links represent the referencerelationships between objects Intuitively we can say thattwo objects 119886 and 119887 are similar to some degree as theyshare though indirectly an object 119888 in common All ofthe aforementioned three methods however compute thesimilarity between these two objects 119886 and 119887 to be 0 Forthis case one would expect to compute the similarity byconsidering all objects directly and indirectly linked to 119886(namely 119888) and all objects directly and indirectly linkedto 119887 (namely 119888 119889 and 119890) In fact however any methodswhich utilize the level-wise model will compute 119904(119886 119887) byonly using 119888 (as 119886rsquos neighbor) and 119889 (as 119887rsquos neighbor) Here119904(119888 119889) should also be computed by taking the average of thesimilarities for all possible pairs across 119888rsquos neighbor objectsand 119889rsquos neighbor objects In this case however 119888rsquos neighbordoes not exist yielding 119904(119888 119889) to be 0 even though therestill remain 119889rsquos neighbors 119888 and 119890 which should also becounted by some means After all 119904(119886 119887) will also become 0indicating that they are not similar at all From this examplewe conclude that the level-wise model cannot compare theobjects not located at the same distance and consequentlycannot compute the similarity properly

4 Proposed Methods

41 Main Concepts In this section we propose a novel link-based similarity measure Our approach differs from theexisting link-based methods in that we will not define themeasure by combining the pair-wise and level-wise modelsbut use the concept of the cosine similarity As expressedby (7) the cosine similarity is a measure for computing asimilarity between a pair of vectors Thus in computing a

similarity between objects using the cosine similarity thefeatures of an object are represented by the elements of avector and the weight to each feature is captured by thevalue of each element For example in case of computing asimilarity between documents using cosine similarity eachdocument is represented by a vector the words in the givenuniverse of documents are denoted by the elements of thevector and the frequency of each word in the document isindicated by the value of the corresponding element Thenthe similarity between two documents is computed by thesimilarity between the two corresponding vectors [9]

119904 (119909 119910) =119909 sdot 119910

11990910038171003817100381710038171199101003817100381710038171003817

=sum119899

119894=1(119909119894times 119910119894)

radicsum119899

119894=1(119909119894)2times radicsum

119899

119894=1(119910119894)2

(7)

In our approach each object is represented as a vectorall other objects in the given universe as the elements ofthe vector and some ldquoweight valuerdquo to each object as thevalue of the corresponding element in the vector As forthe cosine similarity used by text-based measures discussedabove where words of higher-frequency get larger weightvalues by being treated as better characterizing features fora document we need to define a measure to quantify thedegree with which to determine how well the target object ischaracterized by each object in the universe For thismeasurewe propose to use the degree of how close the two objects arelocated in the topological point of view

Sun et al [30] proposed a method for computing theprobability of reaching from an object to another as therelevance between two objects using the ldquoRandomWalkwithRestart (RWR)rdquo strategy We adopt this strategy and use thereachability to an object from the target object as the weightvalue for its corresponding element of the vector representingthe target object Reachability becomes higher as the distancebetween two objects becomes shorter and also when morepaths exist between the two Computing reachability usingRWR is expressed by the following where 119875

119860represents

an adjacency matrix column-normalizing the connectednessamong objects 119906

119886a vector having reachability to each node

starting from 119886 119902119886a restart vector having the value 1 only

for the starting node 119886 and 0 for the rest and 119888 the restartprobability

= (1 minus 119888) 119875119860 119906119886+ 119888 119902119886 (8)

Our approach does not suffer from the problem withthe pair-wise model because it computes the similaritybetween two vectors by multiplying only the values of thecorresponding elements in the two vectors not trying everypossible element pair between the two vectors Moreover theapproach also generates the vectors by considering all theobjects linked directly or indirectly to the two target objectsThismechanism is different from the pair-wise and level-wisemodels and does not suffer from the problems discussed inthe previous section

In this paper we develop two methods to implementour approach The two methods slightly differ only in theirways to compute the weights in the vector One methodcomputes the weights by first computing two values of

6 The Scientific World Journal

b c

a

d

a b c d

a 0 1 1 0b 0 0 0 1c 0 0 0 1d 0 0 0 0

a b c d

a 0 0 0 0b 1 0 0 0c 1 0 0 0d 0 1 1 0

a b c d

a 0 0 0 0b 0 0 0c 0 0 0d 0 1 1 0

05

05

a b c d

a 0 1 1 0b 0 0 0c 0 0 0d 0 0 0 0

05

05

1 0 0 0

1 006 006 0102

1 003 003 0051

Inlink(normalized)

(normalized)

Inlink

OutlinkOutlink

+

Figure 4 The procedure of generating a vector using the weightedSum method

reachability one using only ldquoinlinksrdquo and the other using onlyldquooutlinksrdquo separately and then combining them together as aweighted sum This method provides flexibility to a problemdomain by allowing different weights to reachability usinginlinks versus outlinks The method however cannot com-pute reachability properly for those objects located closelywith the target object because links in both directions arenot used together In Figure 1 for example nodes 119886 and119887 are located closely but the reachability would become 0when computed using only outlinks (or using only inlinks)from 119886 or 119887 Another problem would be the difficulty indetermining the appropriate weights for a vector generatedpurely using the inlinks or purely using the outlinksThuswedevelop the second method which computes reachability byignoring the directions of inlinks and outlinks and convertingthem to undirected links before computing reachability Thismethod is advantageous in that it can compute appropriatereachability for every object In the rest of this paper wewill call these two methods the ldquoweightedSumrdquo methodand the ldquoundirectedrdquo method respectively In the weighted-Sum method in particular the proper balance between theweights for inlinks and outlinks needs to be found domain bydomain through experimentation

42 Procedure of the Proposed Methods Basically both ofthe proposed methods (1) construct vectors by computingreachability from the target object to all other objects and(2) compute the similarity between vectors by using thecosine similarity The two methods differ only in the processof generating vectors as illustrated by Figure 4 (for theweightedSum method) and Figure 5 (for the undirectedmethod) The weightedSum method generates vectors in thefollowingmanner First two adjacencymatrices are built with

inlinks and outlinks and then column-normalized such thatthe sum of all values in a column becomes 1 Second a vectoris generated from the normalized matrix and the weightsare assigned to the elements in the vector by computing thereachability from the target object to every object using theRWR strategy Similarly the undirected method generates avector by constructing a normalized adjacencymatrix ignor-ing the link directions this time and assigning the reachabil-ity values to the elements of the vector in the same manner

43 Complexity Analysis The complexity of the proposedmethods for computing similarities for all pairs of givenobjects can be analyzed as follows First time complexity ofgeneration process of a vector for an object using the RWRstrategy is 119874(119896119890) [31] where 119890 represents the number of linksand 119896 the number of iterations for the matrix calculation toobtain converged values of reachability Thus time complex-ity of generation process for all objects 119899 is 119874(119896119899119890) where 119899represents the number of objects It is generally known thatthe constant number of such iterations would be sufficient toobtain converged values of reachability [6 15] Consequentlytime complexity is reduced to 119874(119899119890) On the other handtime complexity of the similarity calculation process is119874(1198993)Combining them together overall time complexity of bothmethods becomes 119874(1198993) + 119874(119899119890) = 119874(119899

3) In practice

the weightedSum method will require double time than theundirectedmethod because the former computes reachabilityseparately for each direction of the links

As for space complexity119874(119890) space is required for storingthe matrix to represent the relationships between objects119874(1198992) for storing vectors and119874(1198992) for storing the similarity

measures between objects Combining them together overallspace complexity becomes 119874(1198992) Again in practice the

The Scientific World Journal 7

b c

a

d

a b c d

a 0 1 1 0

b 0 0 0 1

c 0 0 0 1

d 0 0 0 0

a b c d

a 0 0 0 0

b 1 0 0 0

c 1 0 0 0

d 0 1 1 0

a b c d

a 0 1 1 0

b 1 0 0 1

c 1 0 0 1

d 0 1 1 0

a b c d

a 0 0

b 0 0

c 0 0

d 0 0

Undirected-link(normalized)

05 05

0505

0505

05 05

1 046204620441

Undirected-link

Inlink

Outlink

Figure 5 The procedure of generating a vector using the undirected method

weightedSum method will require double space than theother

In comparison time complexity of the three existingmethods namely rvs-SimRank SimRank and P-Rank isknown to be 119874(1198994) and space complexity 119874(1198992) [6] Webelieve the existing methods require more computation thanour proposed methods because they adopt the pair-wisecomputation model in principle

Moreover the existing methods cannot compute the sim-ilarity between two given objects independently because theyneed to know the similarities among all objects (whether con-nected directly or indirectly to the target object) in order tocompute the similarity between two target objects [20] How-ever our approach does not require the similarities amongall objects and hence can compute the similarity between anypair of objects independentlymaking it possible to parallelizethe algorithm Since our approach need not refer to the simi-larities amongother objects vectors for individual objects canbe generated independently and the similarity between anypair of objects can be computed separately that is in parallelTime complexity of this parallelized version would become119874(1198993119898) if119898 processors are utilized in parallelWhen we need to compute the similarity between only

a particular pair of objects on-line rather than computingthe similarities among all objects [20] our methods can bemade even more efficient by performing the computation ofreachability on-line through the inverse matrix as suggestedby (9) That is the vector for a given object can be generatedby multiplying 119902 to the inverse matrix of 119876 [30 31] Timecomplexity of this off-line computation of inverse matrix is119874(1198992376

) [32] and the complexity of the on-line computationis 119874(119899) Such an off-line approach tends to require relativelylonger time and more space to handle the inverse matrixReference [31] suggests an approximation approach using

low-rank approximation in order to keep balance betweenthe off-line and on-line computation This approach hasadvantages in both time and space complexity suggest-ing improvement for on-line execution of our proposedapproach In conclusion time complexity of our methods islower than that of existing methods and space complexity isequal to that of existing methods in any case

= (1 minus 119888) 119875119860 119906119886+ 119888 119902119886

= (119868 minus (1 minus 119888) 119875119860)minus1119902119886

= 119862119876minus1119902119886

(9)

44 Discussions The proposed methods do not suffer fromthe problems of the pair-wise and level-wisemodels discussedin Section 3 Let us compute the similarities between 119886 and119887 and 1198861015840 and 1198871015840 in Figure 2 using our approach and checkwhether the results appeal to our intuition We assume 120582 and119862 to be 05 and 07 respectively Table 2 shows the similarityresults computed by the two proposed methods togetherwith those obtained by the three existing methods (shownalready in Table 1) for comparison Both of our methods haveproduced coinciding results that the similarity between 119886and 119887 and 1198861015840 and 1198871015840 is 1 saying that the two objects areregarded as the same These results appeal to our intuitionIn comparison the results obtained by the existing methodsindicate that the two objects are not the same Since ourapproach does not use the pair-wise model but performthe computations among identical features the proposedmethods will produce results more appealing to our intuitionthan existing methods

The proposed methods do not suffer from the problemof the level-wise model either Let us compute the similaritybetween 119886 and 119887 in Figure 3 using our approach and check

8 The Scientific World Journal

Table 2 Similarities between 119886 and 119887 and between 1198861015840 and 1198871015840 using the proposed methods and existing methods in Figure 2

rvs-SimRank SimRank P-Rank WeightedSum Undirected119904(119886 119887) 0396 0 0423 1 1119904(1198861015840 1198871015840) 0321 0 0408 1 1

whether the results appeal to our intuition Assuming 120582and 119862 to be 05 and 07 respectively our approach hascomputed the similarity between 119886 and 119887 to be 028 using theweightedSummethod and 06 using the undirected methodFor comparison the existing methods would compute thesimilarity to be 0 as discussed in Section 3 We cannot tellif the results obtained by our approach are appropriate ornot because the similarity is by nature a subjective valueStill we can at least argue that our approach produces resultswhich are more appealing to our intuition than the existingmethods that is the similarity between 119886 and 119887must not be 0because they are related indirectly by some common objectsThis difference has come from our strategy of considering forall the objects indirectly connected to the target objects atonce rather than using the level-wise model

5 Experiments

In this section we verify the effectiveness of our approachthrough experimentation We will show and analyze quanti-tatively the experimental results obtained from applying thetwo proposed methods to practical application domains

51 Experimentation Setup We carried out experimentsto verify the performance of the proposed methods withrespect to scientific papers and Web documents the twotypes of exemplary data sets having link informationFor the experiments with the scientific papers we usedthe data set consisting of the papers downloaded fromDBLP (httpwwwinformatikuni-trierdesimleydb) andreference information among the papers crawled from Libra(httpacademicresearchmicrosoftcom) In total 44800papers and 126281 references were used For the experimentswith the Web documents we used the data set consisting of1227038Web pages with 11164829 hyperlinks in total takenfrom the TREC (httptrecnistgovdatat11webhtml) 2002data The experiments were carried out in a platform withQuad Core 267GHz CPU Windows 2008 Sever OS

In the experiments we aimed to evaluate the performanceof the two proposed methods (weightedSum undirected) incomparison with the three existing methods (rvs-SimRankSimRank and P-Rank) with the input values of 08 05 and015 for 119862 120582 and restart probability respectively as usedfrequently in the existing methods [6 15 30]

Themethod of experiments proceeded as follows For theweightedSum method we first assigned appropriate weightsto the vectors generated with inlinks and outlinks by varyingthe weights and finding the ones that achieve the highestaccuracy in the similarity through weightedSum method Inthe experiments we tried the weight values of 0 01 03 05

07 09 and 1 for inlinks (note the sum of in-link and out-link weights equals 1) Second we evaluated the accuracyof each method qualitatively by examining 10 objects thatthe method computes as the most similar to a target objectchosen arbitrarily Finally we measure the accuracy of eachmethod quantitatively by comparing the obtained results withthe true answers for each data set

The measurement of the accuracy proceeded as followsFirst we chose one object in turn as the target object from theanswer setThen we computed the recall [29] by extracting119898objects (where119898 can be 10 20 30 40 and 50)most similar tothe target object according to each method This process wasrepeated until every object in the set has been chosen as thetarget object The average of all recall values obtained as suchwill be taken as the final accuracy value

The answers of each data set were constructed in thefollowingmanner For the experiments with scientific paperswe selected five well known areas (ie clustering sequentialpattern mining spatial databases link mining and graphpatternmining) in a data mining text book [29] and obtainedpapers referenced in each section of these areasWe supposedthat the reference papers in the same section are similarto one another Thus those papers in a section formed ananswer set Each answer set had 3 to 14 papers and thetotal number of papers in all such answer sets was 106 Forthe experiments with Web documents we used TREC 2002TREC 2002 provides Web document sets related to specifickeywords We chose 9 Web document sets randomly fromTREC 2002 and used the sets as the answer sets

52 Domain of Scientific Papers Figure 6 depicts the resultsshowing the accuracy on the similarities among scientificpapers obtained by the weightedSum method using variousweight values for the vectors generated with inlinks andoutlinks The 119909-axis represents the number of ldquothe mostsimilarrdquo papers selected by the method and the 119910-axis theaccuracy Annotations such as ldquo00 10rdquo represent the weightbalances between inlinks and outlinks According to Figure 6the accuracy of the weightedSum method tends to becomehigher when the weight for inlinks gets higher than the out-link weight This is because the papers in the answer set arerelatively famous ones frequently referenced by other papersFor this reason we conducted the experiments using theldquo09 01rdquo weight balance in order to obtain the best result

Tables 3 4 5 6 and 7 present the lists of top 10 papersfound to be the most similar to a target paper which is[33] using the three existing methods and the two proposedmethods respectively (Note the paper of [33] is concernedabout clustering in data mining) In the tables those paperswhich are not similar to [33] are italic In these papers theauthorsmainly deal with issues of outlier detection ormining

The Scientific World Journal 9

0

01

02

03

04

05

06

07

08

09

1

10 20 30 40 50(m)

(Acc

urac

y)

00 1001 0903 0705 05

07 0309 0110 00

Figure 6 Accuracy of the weightedSum method with varyingweights

frequent patterns indicating that the existing methods havemade a wrong conclusion for these papers In comparisonthe results by the weightedSum method (shown in Table 6)include only one wrong entry (of frequent pattern mining)in the top 10 list and the results by the undirected methoddo not include wrong papers (ie all related to clustering)These results imply that the undirected method performsbetter than the three existing methods and even than theweightedSum method We have repeated this experimentmany times with different target papers other than [33] andobtained results similar to Tables 3ndash7

Figure 7 compares the accuracy of similarities computedby the three existingmethods and the two proposedmethodsAs in Figure 6 the 119909-axis represents the number of ldquothe mostsimilarrdquo papers selected by each method and the 119910-axis theaccuracy The weightedSum method improved accuracy by9 on average and up to 13 compared with SimRank Alsothe undirectedmethod improved accuracy by 16 on averageand up to 20 compared with SimRank In conclusion thetwo proposed methods turn out to compute the similaritymore accurately than the existing methods Especially theundirected method performs the best

53 Domain ofWeb Documents Figure 8 shows the results ofthe accuracy on the similarities betweenWeb pages obtainedby the weightedSum method using various weight valuesbetween inlinks and outlinks The 119909-axis represents thenumber of ldquothe most similarrdquo Web pages selected by themethod and the 119910-axis the accuracy Annotations such asldquo00 10rdquo represent the weight balances between inlinks andoutlinks Again the accuracy of the weightedSum methodtends to become higher when the weight for inlinks getshigher as in the case of scientific papers This is because theanswer set contains many Web pages with high authority

0

01

02

03

04

05

06

07

08

09

1

10 20 30 40 50

SimRankrvs-SimRankP-Rank

WeightedSumUndirected

(m)

(Acc

urac

y)Figure 7 Accuracy of the similarity measures in scientific papersdata

0

01

02

03

04

05

10 20 30 40 50(m)

(Acc

urac

y)

00 1001 0903 0705 05

07 0309 0110 00

Figure 8 Accuracy of the weightedSum method with varying theweights

For this reason we conducted the experiments using theldquo09 01rdquo weight balance in order to obtain the best result

Figure 9 compares the accuracy of similarities computedby the three existingmethods and the two proposedmethodsAgain the 119909-axis represents the number of ldquothemost similarrdquodocuments selected by each method and the 119910-axis theaccuracy The weightedSum method improved accuracy by20 on average and up to 24 compared with SimRankAlso the undirected method improved accuracy by 34on average and up to 43 compared with SimRank In

10 The Scientific World Journal

Table 3 Top 10 papers similar to [33] using SimRank

First author Title Conferencejournal Year

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approach forVery Large Spatial Databases VLDB 1998

Ester Knowledge Discovery in Large Spatial Databases FocusingTechniques for Efficient Class Identification SSD 1995

Hinneburg An Efficient Approach to Clustering in Large MultimediaDatabases with Noise AAAI 1998

Ng Efficient and Effective Clustering Methods for Spatial DataMining VLDB 1994

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Wang STING A Statistical Information Grid Approach to SpatialData Mining VLDB 1997

L OrsquoCallaghan Streaming-Data Algorithms for High-Quality Clustering IEEE ICDE 2002

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996

Table 4 Top 10 papers similar to [33] using rvs-SimRank

First author Title Conferencejournal YearKnorr A Unified Notion of Outliers Properties and Computation ACM KDD 1997Guha CURE An Efficient Clustering Algorithm for Large Databases ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Guha ROCK A Robust Clustering Algorithm for CategoricalAttributes IEEE ICDE 1999

Sheikholeslami WaveCluster A multi-Resolution Clustering Approach forVery Large Spatial Databases VLDB 1998

Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996

Burdick MAFIA A Maximal Frequent Itemset Algorithmfor Transactional Databases IEEE TKDE 2005

Mannila Efficient Algorithms for Discovering Association Rules AAAI 1994

Kamber Metarule-Guided Mining of Multi-DimensionalAssociation Rules Using Data Cubes ACM KDD 1997

Table 5 Top 10 papers similar to [33] using P-Rank

First author Title Conferencejournal YearKnorr A Unified Notion of Outliers Properties and Computation ACM KDD 1997Guha CURE An Efficient Clustering Algorithm for Large Databases ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Guha ROCK A Robust Clustering Algorithm for Categorical Attributes IEEE ICDE 1999Mannila Efficient Algorithms for Discovering Association Rules AAAI 1994Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996Silberschatz What Makes Patterns Interesting in Knowledge Discovery Systems IEEE TKDE 1996Agrawal Mining Association Rules between Sets of Items in Large Databases ACM SIGMOD 1993

The Scientific World Journal 11

Table 6 Top 10 papers similar to [33] using weightedSum method

First author Title Conferencejournal Year

Ester Knowledge Discovery in Large Spatial DatabasesFocusing Techniques for Efficient Class Identification SSD 1995

Ng Efficient and Effective Clustering Methods for Spatial DataMining VLDB 1994

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Guha ROCK A Robust Clustering Algorithm for CategoricalAttributes IEEE ICDE 1999

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Wang STING A Statistical Information Grid Approach toSpatial Data Mining VLDB 1997

Hinneburg An Efficient Approach to Clusteringin Large Multimedia Databases with Noise AAAI 1998

Aggarwal Fast Algorithms for Projected Clustering ACM SIGMOD 1999

Agrawal Automatic Subspace Clustering of High Dimensional Datafor Data Mining Applications ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Table 7 Top 10 papers similar to [33] using undirected method

First author Title Conferencejournal Year

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Ng Efficient and Effective Clustering Methods for SpatialData Mining VLDB 1994

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Hinneburg An Efficient Approach to Clusteringin Large Multimedia Databases with Noise AAAI 1998

Wang STING A Statistical Information Grid Approach toSpatial Data Mining VLDB 1997

Agrawal Automatic Subspace Clustering of High Dimensional Datafor Data Mining Applications ACM SIGMOD 1998

Aggarwal Fast Algorithms for Projected Clustering ACM SIGMOD 1999

Ankerst OPTICS Ordering Points To Identify the ClusteringStructure ACM SIGMOD 1999

conclusion the two proposed methods turn out to computethe similarity more accurately than the existing methodsEspecially the undirected method outperforms by a largedegree the weightedSum and the three existing methods

6 Conclusions

This paper presented new link-based similarity methods thatcan compute more accurately the similarity between objectsby using the link information pertaining to the objectsNoticing that most existing link-based similarity methods

use the pair-wise and level-wise models we analyzed theproblems with these models and proposed a new approachthat does not suffer from these problems In our proposedapproach each object is represented by a vector all objectsin the given universe as the elements of the vector and aweight value to each object as the value of the correspondingelement in the vector As for this weight value we proposed toutilize the notion of reachability between objects computedusing the ldquoRandom Walk with Restartrdquo strategy Then wedefined the similarity between two objects as the cosinesimilarity between two vectors representing the two objectsThe proposed approach was then refined into two methods

12 The Scientific World Journal

0

01

02

03

04

05

10 20 30 40 50(m)

(Acc

urac

y)

SimRankrvs-SimRankP-Rank

WeightedSumUndirected

Figure 9 Accuracy of the similarity measures in Web documentsdata

the weightedSum and the undirected methods differentiatedby the strategy to handle the information on link directionsExamples showed that the two methods do not suffer fromthe problems of the pair-wise and level-wise models In ourexperimentation of the proposed methods with the scientificpapers and Web documents data sets the results indicatedthat both of the proposed methods generally outperform theexisting methods significantly

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by (1) Business for CooperativeRampD between Industry Academy and Research Institutefunded Korea Small and Medium Business Administrationin 2013 (Grants no C0006278) (2) the MSIP (Ministry ofScience ICT and Future Planning) Korea under the ITRC(Information Technology Research Center) Support Pro-gram (NIPA-2013-H0301-13-4009) supervised by the NIPA(National IT Industry Promotion Agency) (3) Ministry ofCulture Sports and Tourism (MCST) and from Korea Copy-right Commission in 2013 and (4) the National ResearchFoundation of Korea (NRF) Grant funded by the Koreagovernment (MEST) (no 2011-0029181)

References

[1] J Shen Y Zhu H Zhang C Chen R Sun and F Xu ldquoAcontent-based algorithm for blog rankingrdquo in Proceedings of theInternational Conference on Internet Computing in Science andEngineering (ICICSE rsquo08) pp 19ndash22 January 2008

[2] A Broder S Glassman M Manasse and G Zweig ldquoSyntacticclustering of the webrdquo in Proceedings of the 6th InternationalWorld Wide Web Conference (WWW rsquo97) pp 391ndash404 1997

[3] X Su and T Khoshgoftaar ldquoA survey of collaborative filteringtechniquesrdquoAdvances in Artificial Intelligence vol 2009 ArticleID 421425 19 pages 2009

[4] S-H Yoon S-W Kim and S Park ldquoA link-based similaritymeasure for scientific literaturerdquo in Proceedings of the 19thInternational World Wide Web Conference (WWW rsquo10) pp1213ndash1214 April 2010

[5] M R Hamedani S Lee and S Kim ldquoOn exploiting content andcitations together to compute similarity of scientific papersrdquo inProceedings of the ACM International Conference on Informationand Knowledge Management (CIKM rsquo13) San Francisco CalifUSA 2013

[6] P Zhao J Han andY Sun ldquoP-Rank a comprehensive structuralsimilarity measure over information networksrdquo in Proceedingsof the ACM 18th International Conference on Information andKnowledge Management (CIKM rsquo09) pp 553ndash562 November2009

[7] D-H Bae S-M Hwang S-W Kim and C Faloutsos ldquoCon-structing seminal paper genealogyrdquo in Proceedings of the 20thACM Conference on Information and Knowledge Management(CIKM rsquo11) pp 2101ndash2104 October 2011

[8] F Menczer ldquoCombining link and content analysis to estimatesemantic similarityrdquo in Proceedings of the 13th InternationalWorld Wide Web Conference Proceedings (WWW rsquo04) pp 452ndash453 May 2004

[9] R Baeza-Yates and B Ribeiro-Neto Modern InformationRetrieval Addison Wesley 1999

[10] S-H Yoon S-W Kim J-S Kim and W-S Hwang ldquoOncomputing text-based similarity in scientific literaturerdquo inProceedings of the 20th International Conference Companion onWorld Wide Web (WWW rsquo11) pp 169ndash170 April 2011

[11] G Bisson and F Hussain ldquo120594-Sim a new similarity measure forthe co-clustering taskrdquo in Proceedings of the 7th InternationalConference onMachine Learning and Applications (ICMLA rsquo08)pp 211ndash217 December 2008

[12] H Small ldquoCoCitation in the scientific literature a newmeasureof the relationship between two documentsrdquo Journal of theAmerican Society for Information Science vol 24 no 4 pp 265ndash269 1973

[13] M Kessler ldquoBibliographic coupling between scientific papersrdquoJournal of the American Documentation vol 14 no 1 pp 10ndash251963

[14] R Amsler ldquoApplication of citation-based automatic classifi-cationrdquo Technical Report The University of Texas at AustinLinguistics Research Center 1972

[15] G Jeh and JWidom ldquoSimRank ameasure of structural-contextsimilarityrdquo in Proceedings of the 8th ACMSIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDDrsquo02) pp 538ndash543 July 2002

[16] U Shardanband and P Maes ldquoSocial information filteringalgorithms for automating rdquoword of mouthrdquordquo in Proceedings ofthe SIGCHIConference onHumanFactors in Computing Systems(CHI rsquo95) pp 210ndash217 ACM Press Denver Colorado 1995

[17] I Antonellis H Garcia-Molina and C Chang ldquoSimrank++query rewriting through link analysis of the click graphrdquo inProceedings of the 34th International Conference on Vary LargeData Bases ( VLDB rsquo8) pp 408ndash421 Auckland New ZealandAugust 2008

The Scientific World Journal 13

[18] D Fogaras andB Racz ldquoScaling link-based similarity searchrdquo inProceedings of the 14th International Conference on World WideWeb (WWWrsquo05) pp 641ndash650 2005

[19] C Li J Han G He et al ldquoFast computation of SimRankfor static and dynamic information networksrdquo in Proceedingsof the 13th International Conference on Extending DatabaseTechnology Advances in Database Technology (EDBT rsquo10) pp465ndash476 March 2010

[20] P Li H Liu J Xu Yu J He and X Du ldquoFast single-pairSimRank computationrdquo in Proceedings of the SIAM Interna-tional Conference onDataMining (SDM rsquo10) pp 571ndash582 SIAMColumbus Ohio USA April 2010

[21] X Yin J Han and P S Yu ldquoLinkClus efficient clusteringvia heterogeneous semantic linksrdquo in Proceedings of the 32ndInternational Conference on Very Large Data Bases (VLDB rsquo06)pp 427ndash438 2006

[22] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo in Proceedings of the Vary Large Data Bases Endowmentvol 1 pp 422ndash433 2008

[23] Y Cai G Cong X Jia et al ldquoEfficient algorithm for computinglink-based similarity in real world networksrdquo in Proceedings ofthe 9th IEEE International Conference on Data Mining (ICDMrsquo09) pp 734ndash739 December 2009

[24] G He H Feng C Li and H Chen ldquoParallel SimRankcomputation on large graphs with iterative aggregationrdquo inProceedings of the 16th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD rsquo10) pp 543ndash552 July 2010

[25] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo The Intrsquol Journal on Very Large Data Bases vol 19 no 1pp 45ndash66 2010

[26] Z Lin M R Lyu and I King ldquoMatchSim a novel neighbor-based similaritymeasure withmaximumneighborhoodmatch-ingrdquo in Proceedings of the ACM 18th International Conference onInformation and Knowledge Management (CIKM rsquo09) pp 1613ndash1616 November 2009

[27] R R Larson ldquoBibliometrics of the world wide web anexploratory analysis of the intellectual structure of cyberspacerdquoProceedings of the Annual Meeting of the American Society forInformation Science vol 33 pp 71ndash78 1996

[28] J Pitkow and P Pirolli ldquoLife death and lawfulness on the elec-tronic frontierrdquo in Proceedings of the Conference on Human Fac-tors in Computing Systems (CHI rsquo97) pp 383ndash390 March 1997

[29] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann 2006

[30] J Sun H Qu D Chakrabarti and C Faloutsos ldquoNeighbor-hood formation and anomaly detection in bipartite graphsrdquo inProceedings of the 5th IEEE International Conference on DataMining (ICDM rsquo05) pp 418ndash425 November 2005

[31] H Tong C Faloutsos and J-Y Pan ldquoFast random walkwith restart and its applicationsrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 613ndash622 December 2006

[32] S Robinson ldquoToward an optimal algorithm for matrix multi-plicationrdquo News Journal of the Society for Industrial and AppliedMathematics vol 38 no 9 2005

[33] T Zhang R Ramakrishnam andM Livny ldquoBIRCH an efficientdata clustering method for very large databasesrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data (SIGMOD rsquo96) pp 103ndash114 1996

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 6: Research Article Link-Based Similarity Measures Using ...downloads.hindawi.com/journals/tswj/2014/741608.pdf · Research Article Link-Based Similarity Measures Using Reachability

6 The Scientific World Journal

b c

a

d

a b c d

a 0 1 1 0b 0 0 0 1c 0 0 0 1d 0 0 0 0

a b c d

a 0 0 0 0b 1 0 0 0c 1 0 0 0d 0 1 1 0

a b c d

a 0 0 0 0b 0 0 0c 0 0 0d 0 1 1 0

05

05

a b c d

a 0 1 1 0b 0 0 0c 0 0 0d 0 0 0 0

05

05

1 0 0 0

1 006 006 0102

1 003 003 0051

Inlink(normalized)

(normalized)

Inlink

OutlinkOutlink

+

Figure 4 The procedure of generating a vector using the weightedSum method

reachability one using only ldquoinlinksrdquo and the other using onlyldquooutlinksrdquo separately and then combining them together as aweighted sum This method provides flexibility to a problemdomain by allowing different weights to reachability usinginlinks versus outlinks The method however cannot com-pute reachability properly for those objects located closelywith the target object because links in both directions arenot used together In Figure 1 for example nodes 119886 and119887 are located closely but the reachability would become 0when computed using only outlinks (or using only inlinks)from 119886 or 119887 Another problem would be the difficulty indetermining the appropriate weights for a vector generatedpurely using the inlinks or purely using the outlinksThuswedevelop the second method which computes reachability byignoring the directions of inlinks and outlinks and convertingthem to undirected links before computing reachability Thismethod is advantageous in that it can compute appropriatereachability for every object In the rest of this paper wewill call these two methods the ldquoweightedSumrdquo methodand the ldquoundirectedrdquo method respectively In the weighted-Sum method in particular the proper balance between theweights for inlinks and outlinks needs to be found domain bydomain through experimentation

42 Procedure of the Proposed Methods Basically both ofthe proposed methods (1) construct vectors by computingreachability from the target object to all other objects and(2) compute the similarity between vectors by using thecosine similarity The two methods differ only in the processof generating vectors as illustrated by Figure 4 (for theweightedSum method) and Figure 5 (for the undirectedmethod) The weightedSum method generates vectors in thefollowingmanner First two adjacencymatrices are built with

inlinks and outlinks and then column-normalized such thatthe sum of all values in a column becomes 1 Second a vectoris generated from the normalized matrix and the weightsare assigned to the elements in the vector by computing thereachability from the target object to every object using theRWR strategy Similarly the undirected method generates avector by constructing a normalized adjacencymatrix ignor-ing the link directions this time and assigning the reachabil-ity values to the elements of the vector in the same manner

43 Complexity Analysis The complexity of the proposedmethods for computing similarities for all pairs of givenobjects can be analyzed as follows First time complexity ofgeneration process of a vector for an object using the RWRstrategy is 119874(119896119890) [31] where 119890 represents the number of linksand 119896 the number of iterations for the matrix calculation toobtain converged values of reachability Thus time complex-ity of generation process for all objects 119899 is 119874(119896119899119890) where 119899represents the number of objects It is generally known thatthe constant number of such iterations would be sufficient toobtain converged values of reachability [6 15] Consequentlytime complexity is reduced to 119874(119899119890) On the other handtime complexity of the similarity calculation process is119874(1198993)Combining them together overall time complexity of bothmethods becomes 119874(1198993) + 119874(119899119890) = 119874(119899

3) In practice

the weightedSum method will require double time than theundirectedmethod because the former computes reachabilityseparately for each direction of the links

As for space complexity119874(119890) space is required for storingthe matrix to represent the relationships between objects119874(1198992) for storing vectors and119874(1198992) for storing the similarity

measures between objects Combining them together overallspace complexity becomes 119874(1198992) Again in practice the

The Scientific World Journal 7

b c

a

d

a b c d

a 0 1 1 0

b 0 0 0 1

c 0 0 0 1

d 0 0 0 0

a b c d

a 0 0 0 0

b 1 0 0 0

c 1 0 0 0

d 0 1 1 0

a b c d

a 0 1 1 0

b 1 0 0 1

c 1 0 0 1

d 0 1 1 0

a b c d

a 0 0

b 0 0

c 0 0

d 0 0

Undirected-link(normalized)

05 05

0505

0505

05 05

1 046204620441

Undirected-link

Inlink

Outlink

Figure 5 The procedure of generating a vector using the undirected method

weightedSum method will require double space than theother

In comparison time complexity of the three existingmethods namely rvs-SimRank SimRank and P-Rank isknown to be 119874(1198994) and space complexity 119874(1198992) [6] Webelieve the existing methods require more computation thanour proposed methods because they adopt the pair-wisecomputation model in principle

Moreover the existing methods cannot compute the sim-ilarity between two given objects independently because theyneed to know the similarities among all objects (whether con-nected directly or indirectly to the target object) in order tocompute the similarity between two target objects [20] How-ever our approach does not require the similarities amongall objects and hence can compute the similarity between anypair of objects independentlymaking it possible to parallelizethe algorithm Since our approach need not refer to the simi-larities amongother objects vectors for individual objects canbe generated independently and the similarity between anypair of objects can be computed separately that is in parallelTime complexity of this parallelized version would become119874(1198993119898) if119898 processors are utilized in parallelWhen we need to compute the similarity between only

a particular pair of objects on-line rather than computingthe similarities among all objects [20] our methods can bemade even more efficient by performing the computation ofreachability on-line through the inverse matrix as suggestedby (9) That is the vector for a given object can be generatedby multiplying 119902 to the inverse matrix of 119876 [30 31] Timecomplexity of this off-line computation of inverse matrix is119874(1198992376

) [32] and the complexity of the on-line computationis 119874(119899) Such an off-line approach tends to require relativelylonger time and more space to handle the inverse matrixReference [31] suggests an approximation approach using

low-rank approximation in order to keep balance betweenthe off-line and on-line computation This approach hasadvantages in both time and space complexity suggest-ing improvement for on-line execution of our proposedapproach In conclusion time complexity of our methods islower than that of existing methods and space complexity isequal to that of existing methods in any case

= (1 minus 119888) 119875119860 119906119886+ 119888 119902119886

= (119868 minus (1 minus 119888) 119875119860)minus1119902119886

= 119862119876minus1119902119886

(9)

44 Discussions The proposed methods do not suffer fromthe problems of the pair-wise and level-wisemodels discussedin Section 3 Let us compute the similarities between 119886 and119887 and 1198861015840 and 1198871015840 in Figure 2 using our approach and checkwhether the results appeal to our intuition We assume 120582 and119862 to be 05 and 07 respectively Table 2 shows the similarityresults computed by the two proposed methods togetherwith those obtained by the three existing methods (shownalready in Table 1) for comparison Both of our methods haveproduced coinciding results that the similarity between 119886and 119887 and 1198861015840 and 1198871015840 is 1 saying that the two objects areregarded as the same These results appeal to our intuitionIn comparison the results obtained by the existing methodsindicate that the two objects are not the same Since ourapproach does not use the pair-wise model but performthe computations among identical features the proposedmethods will produce results more appealing to our intuitionthan existing methods

The proposed methods do not suffer from the problemof the level-wise model either Let us compute the similaritybetween 119886 and 119887 in Figure 3 using our approach and check

8 The Scientific World Journal

Table 2 Similarities between 119886 and 119887 and between 1198861015840 and 1198871015840 using the proposed methods and existing methods in Figure 2

rvs-SimRank SimRank P-Rank WeightedSum Undirected119904(119886 119887) 0396 0 0423 1 1119904(1198861015840 1198871015840) 0321 0 0408 1 1

whether the results appeal to our intuition Assuming 120582and 119862 to be 05 and 07 respectively our approach hascomputed the similarity between 119886 and 119887 to be 028 using theweightedSummethod and 06 using the undirected methodFor comparison the existing methods would compute thesimilarity to be 0 as discussed in Section 3 We cannot tellif the results obtained by our approach are appropriate ornot because the similarity is by nature a subjective valueStill we can at least argue that our approach produces resultswhich are more appealing to our intuition than the existingmethods that is the similarity between 119886 and 119887must not be 0because they are related indirectly by some common objectsThis difference has come from our strategy of considering forall the objects indirectly connected to the target objects atonce rather than using the level-wise model

5 Experiments

In this section we verify the effectiveness of our approachthrough experimentation We will show and analyze quanti-tatively the experimental results obtained from applying thetwo proposed methods to practical application domains

51 Experimentation Setup We carried out experimentsto verify the performance of the proposed methods withrespect to scientific papers and Web documents the twotypes of exemplary data sets having link informationFor the experiments with the scientific papers we usedthe data set consisting of the papers downloaded fromDBLP (httpwwwinformatikuni-trierdesimleydb) andreference information among the papers crawled from Libra(httpacademicresearchmicrosoftcom) In total 44800papers and 126281 references were used For the experimentswith the Web documents we used the data set consisting of1227038Web pages with 11164829 hyperlinks in total takenfrom the TREC (httptrecnistgovdatat11webhtml) 2002data The experiments were carried out in a platform withQuad Core 267GHz CPU Windows 2008 Sever OS

In the experiments we aimed to evaluate the performanceof the two proposed methods (weightedSum undirected) incomparison with the three existing methods (rvs-SimRankSimRank and P-Rank) with the input values of 08 05 and015 for 119862 120582 and restart probability respectively as usedfrequently in the existing methods [6 15 30]

Themethod of experiments proceeded as follows For theweightedSum method we first assigned appropriate weightsto the vectors generated with inlinks and outlinks by varyingthe weights and finding the ones that achieve the highestaccuracy in the similarity through weightedSum method Inthe experiments we tried the weight values of 0 01 03 05

07 09 and 1 for inlinks (note the sum of in-link and out-link weights equals 1) Second we evaluated the accuracyof each method qualitatively by examining 10 objects thatthe method computes as the most similar to a target objectchosen arbitrarily Finally we measure the accuracy of eachmethod quantitatively by comparing the obtained results withthe true answers for each data set

The measurement of the accuracy proceeded as followsFirst we chose one object in turn as the target object from theanswer setThen we computed the recall [29] by extracting119898objects (where119898 can be 10 20 30 40 and 50)most similar tothe target object according to each method This process wasrepeated until every object in the set has been chosen as thetarget object The average of all recall values obtained as suchwill be taken as the final accuracy value

The answers of each data set were constructed in thefollowingmanner For the experiments with scientific paperswe selected five well known areas (ie clustering sequentialpattern mining spatial databases link mining and graphpatternmining) in a data mining text book [29] and obtainedpapers referenced in each section of these areasWe supposedthat the reference papers in the same section are similarto one another Thus those papers in a section formed ananswer set Each answer set had 3 to 14 papers and thetotal number of papers in all such answer sets was 106 Forthe experiments with Web documents we used TREC 2002TREC 2002 provides Web document sets related to specifickeywords We chose 9 Web document sets randomly fromTREC 2002 and used the sets as the answer sets

52 Domain of Scientific Papers Figure 6 depicts the resultsshowing the accuracy on the similarities among scientificpapers obtained by the weightedSum method using variousweight values for the vectors generated with inlinks andoutlinks The 119909-axis represents the number of ldquothe mostsimilarrdquo papers selected by the method and the 119910-axis theaccuracy Annotations such as ldquo00 10rdquo represent the weightbalances between inlinks and outlinks According to Figure 6the accuracy of the weightedSum method tends to becomehigher when the weight for inlinks gets higher than the out-link weight This is because the papers in the answer set arerelatively famous ones frequently referenced by other papersFor this reason we conducted the experiments using theldquo09 01rdquo weight balance in order to obtain the best result

Tables 3 4 5 6 and 7 present the lists of top 10 papersfound to be the most similar to a target paper which is[33] using the three existing methods and the two proposedmethods respectively (Note the paper of [33] is concernedabout clustering in data mining) In the tables those paperswhich are not similar to [33] are italic In these papers theauthorsmainly deal with issues of outlier detection ormining

The Scientific World Journal 9

0

01

02

03

04

05

06

07

08

09

1

10 20 30 40 50(m)

(Acc

urac

y)

00 1001 0903 0705 05

07 0309 0110 00

Figure 6 Accuracy of the weightedSum method with varyingweights

frequent patterns indicating that the existing methods havemade a wrong conclusion for these papers In comparisonthe results by the weightedSum method (shown in Table 6)include only one wrong entry (of frequent pattern mining)in the top 10 list and the results by the undirected methoddo not include wrong papers (ie all related to clustering)These results imply that the undirected method performsbetter than the three existing methods and even than theweightedSum method We have repeated this experimentmany times with different target papers other than [33] andobtained results similar to Tables 3ndash7

Figure 7 compares the accuracy of similarities computedby the three existingmethods and the two proposedmethodsAs in Figure 6 the 119909-axis represents the number of ldquothe mostsimilarrdquo papers selected by each method and the 119910-axis theaccuracy The weightedSum method improved accuracy by9 on average and up to 13 compared with SimRank Alsothe undirectedmethod improved accuracy by 16 on averageand up to 20 compared with SimRank In conclusion thetwo proposed methods turn out to compute the similaritymore accurately than the existing methods Especially theundirected method performs the best

53 Domain ofWeb Documents Figure 8 shows the results ofthe accuracy on the similarities betweenWeb pages obtainedby the weightedSum method using various weight valuesbetween inlinks and outlinks The 119909-axis represents thenumber of ldquothe most similarrdquo Web pages selected by themethod and the 119910-axis the accuracy Annotations such asldquo00 10rdquo represent the weight balances between inlinks andoutlinks Again the accuracy of the weightedSum methodtends to become higher when the weight for inlinks getshigher as in the case of scientific papers This is because theanswer set contains many Web pages with high authority

0

01

02

03

04

05

06

07

08

09

1

10 20 30 40 50

SimRankrvs-SimRankP-Rank

WeightedSumUndirected

(m)

(Acc

urac

y)Figure 7 Accuracy of the similarity measures in scientific papersdata

0

01

02

03

04

05

10 20 30 40 50(m)

(Acc

urac

y)

00 1001 0903 0705 05

07 0309 0110 00

Figure 8 Accuracy of the weightedSum method with varying theweights

For this reason we conducted the experiments using theldquo09 01rdquo weight balance in order to obtain the best result

Figure 9 compares the accuracy of similarities computedby the three existingmethods and the two proposedmethodsAgain the 119909-axis represents the number of ldquothemost similarrdquodocuments selected by each method and the 119910-axis theaccuracy The weightedSum method improved accuracy by20 on average and up to 24 compared with SimRankAlso the undirected method improved accuracy by 34on average and up to 43 compared with SimRank In

10 The Scientific World Journal

Table 3 Top 10 papers similar to [33] using SimRank

First author Title Conferencejournal Year

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approach forVery Large Spatial Databases VLDB 1998

Ester Knowledge Discovery in Large Spatial Databases FocusingTechniques for Efficient Class Identification SSD 1995

Hinneburg An Efficient Approach to Clustering in Large MultimediaDatabases with Noise AAAI 1998

Ng Efficient and Effective Clustering Methods for Spatial DataMining VLDB 1994

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Wang STING A Statistical Information Grid Approach to SpatialData Mining VLDB 1997

L OrsquoCallaghan Streaming-Data Algorithms for High-Quality Clustering IEEE ICDE 2002

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996

Table 4 Top 10 papers similar to [33] using rvs-SimRank

First author Title Conferencejournal YearKnorr A Unified Notion of Outliers Properties and Computation ACM KDD 1997Guha CURE An Efficient Clustering Algorithm for Large Databases ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Guha ROCK A Robust Clustering Algorithm for CategoricalAttributes IEEE ICDE 1999

Sheikholeslami WaveCluster A multi-Resolution Clustering Approach forVery Large Spatial Databases VLDB 1998

Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996

Burdick MAFIA A Maximal Frequent Itemset Algorithmfor Transactional Databases IEEE TKDE 2005

Mannila Efficient Algorithms for Discovering Association Rules AAAI 1994

Kamber Metarule-Guided Mining of Multi-DimensionalAssociation Rules Using Data Cubes ACM KDD 1997

Table 5 Top 10 papers similar to [33] using P-Rank

First author Title Conferencejournal YearKnorr A Unified Notion of Outliers Properties and Computation ACM KDD 1997Guha CURE An Efficient Clustering Algorithm for Large Databases ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Guha ROCK A Robust Clustering Algorithm for Categorical Attributes IEEE ICDE 1999Mannila Efficient Algorithms for Discovering Association Rules AAAI 1994Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996Silberschatz What Makes Patterns Interesting in Knowledge Discovery Systems IEEE TKDE 1996Agrawal Mining Association Rules between Sets of Items in Large Databases ACM SIGMOD 1993

The Scientific World Journal 11

Table 6 Top 10 papers similar to [33] using weightedSum method

First author Title Conferencejournal Year

Ester Knowledge Discovery in Large Spatial DatabasesFocusing Techniques for Efficient Class Identification SSD 1995

Ng Efficient and Effective Clustering Methods for Spatial DataMining VLDB 1994

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Guha ROCK A Robust Clustering Algorithm for CategoricalAttributes IEEE ICDE 1999

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Wang STING A Statistical Information Grid Approach toSpatial Data Mining VLDB 1997

Hinneburg An Efficient Approach to Clusteringin Large Multimedia Databases with Noise AAAI 1998

Aggarwal Fast Algorithms for Projected Clustering ACM SIGMOD 1999

Agrawal Automatic Subspace Clustering of High Dimensional Datafor Data Mining Applications ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Table 7 Top 10 papers similar to [33] using undirected method

First author Title Conferencejournal Year

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Ng Efficient and Effective Clustering Methods for SpatialData Mining VLDB 1994

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Hinneburg An Efficient Approach to Clusteringin Large Multimedia Databases with Noise AAAI 1998

Wang STING A Statistical Information Grid Approach toSpatial Data Mining VLDB 1997

Agrawal Automatic Subspace Clustering of High Dimensional Datafor Data Mining Applications ACM SIGMOD 1998

Aggarwal Fast Algorithms for Projected Clustering ACM SIGMOD 1999

Ankerst OPTICS Ordering Points To Identify the ClusteringStructure ACM SIGMOD 1999

conclusion the two proposed methods turn out to computethe similarity more accurately than the existing methodsEspecially the undirected method outperforms by a largedegree the weightedSum and the three existing methods

6 Conclusions

This paper presented new link-based similarity methods thatcan compute more accurately the similarity between objectsby using the link information pertaining to the objectsNoticing that most existing link-based similarity methods

use the pair-wise and level-wise models we analyzed theproblems with these models and proposed a new approachthat does not suffer from these problems In our proposedapproach each object is represented by a vector all objectsin the given universe as the elements of the vector and aweight value to each object as the value of the correspondingelement in the vector As for this weight value we proposed toutilize the notion of reachability between objects computedusing the ldquoRandom Walk with Restartrdquo strategy Then wedefined the similarity between two objects as the cosinesimilarity between two vectors representing the two objectsThe proposed approach was then refined into two methods

12 The Scientific World Journal

0

01

02

03

04

05

10 20 30 40 50(m)

(Acc

urac

y)

SimRankrvs-SimRankP-Rank

WeightedSumUndirected

Figure 9 Accuracy of the similarity measures in Web documentsdata

the weightedSum and the undirected methods differentiatedby the strategy to handle the information on link directionsExamples showed that the two methods do not suffer fromthe problems of the pair-wise and level-wise models In ourexperimentation of the proposed methods with the scientificpapers and Web documents data sets the results indicatedthat both of the proposed methods generally outperform theexisting methods significantly

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by (1) Business for CooperativeRampD between Industry Academy and Research Institutefunded Korea Small and Medium Business Administrationin 2013 (Grants no C0006278) (2) the MSIP (Ministry ofScience ICT and Future Planning) Korea under the ITRC(Information Technology Research Center) Support Pro-gram (NIPA-2013-H0301-13-4009) supervised by the NIPA(National IT Industry Promotion Agency) (3) Ministry ofCulture Sports and Tourism (MCST) and from Korea Copy-right Commission in 2013 and (4) the National ResearchFoundation of Korea (NRF) Grant funded by the Koreagovernment (MEST) (no 2011-0029181)

References

[1] J Shen Y Zhu H Zhang C Chen R Sun and F Xu ldquoAcontent-based algorithm for blog rankingrdquo in Proceedings of theInternational Conference on Internet Computing in Science andEngineering (ICICSE rsquo08) pp 19ndash22 January 2008

[2] A Broder S Glassman M Manasse and G Zweig ldquoSyntacticclustering of the webrdquo in Proceedings of the 6th InternationalWorld Wide Web Conference (WWW rsquo97) pp 391ndash404 1997

[3] X Su and T Khoshgoftaar ldquoA survey of collaborative filteringtechniquesrdquoAdvances in Artificial Intelligence vol 2009 ArticleID 421425 19 pages 2009

[4] S-H Yoon S-W Kim and S Park ldquoA link-based similaritymeasure for scientific literaturerdquo in Proceedings of the 19thInternational World Wide Web Conference (WWW rsquo10) pp1213ndash1214 April 2010

[5] M R Hamedani S Lee and S Kim ldquoOn exploiting content andcitations together to compute similarity of scientific papersrdquo inProceedings of the ACM International Conference on Informationand Knowledge Management (CIKM rsquo13) San Francisco CalifUSA 2013

[6] P Zhao J Han andY Sun ldquoP-Rank a comprehensive structuralsimilarity measure over information networksrdquo in Proceedingsof the ACM 18th International Conference on Information andKnowledge Management (CIKM rsquo09) pp 553ndash562 November2009

[7] D-H Bae S-M Hwang S-W Kim and C Faloutsos ldquoCon-structing seminal paper genealogyrdquo in Proceedings of the 20thACM Conference on Information and Knowledge Management(CIKM rsquo11) pp 2101ndash2104 October 2011

[8] F Menczer ldquoCombining link and content analysis to estimatesemantic similarityrdquo in Proceedings of the 13th InternationalWorld Wide Web Conference Proceedings (WWW rsquo04) pp 452ndash453 May 2004

[9] R Baeza-Yates and B Ribeiro-Neto Modern InformationRetrieval Addison Wesley 1999

[10] S-H Yoon S-W Kim J-S Kim and W-S Hwang ldquoOncomputing text-based similarity in scientific literaturerdquo inProceedings of the 20th International Conference Companion onWorld Wide Web (WWW rsquo11) pp 169ndash170 April 2011

[11] G Bisson and F Hussain ldquo120594-Sim a new similarity measure forthe co-clustering taskrdquo in Proceedings of the 7th InternationalConference onMachine Learning and Applications (ICMLA rsquo08)pp 211ndash217 December 2008

[12] H Small ldquoCoCitation in the scientific literature a newmeasureof the relationship between two documentsrdquo Journal of theAmerican Society for Information Science vol 24 no 4 pp 265ndash269 1973

[13] M Kessler ldquoBibliographic coupling between scientific papersrdquoJournal of the American Documentation vol 14 no 1 pp 10ndash251963

[14] R Amsler ldquoApplication of citation-based automatic classifi-cationrdquo Technical Report The University of Texas at AustinLinguistics Research Center 1972

[15] G Jeh and JWidom ldquoSimRank ameasure of structural-contextsimilarityrdquo in Proceedings of the 8th ACMSIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDDrsquo02) pp 538ndash543 July 2002

[16] U Shardanband and P Maes ldquoSocial information filteringalgorithms for automating rdquoword of mouthrdquordquo in Proceedings ofthe SIGCHIConference onHumanFactors in Computing Systems(CHI rsquo95) pp 210ndash217 ACM Press Denver Colorado 1995

[17] I Antonellis H Garcia-Molina and C Chang ldquoSimrank++query rewriting through link analysis of the click graphrdquo inProceedings of the 34th International Conference on Vary LargeData Bases ( VLDB rsquo8) pp 408ndash421 Auckland New ZealandAugust 2008

The Scientific World Journal 13

[18] D Fogaras andB Racz ldquoScaling link-based similarity searchrdquo inProceedings of the 14th International Conference on World WideWeb (WWWrsquo05) pp 641ndash650 2005

[19] C Li J Han G He et al ldquoFast computation of SimRankfor static and dynamic information networksrdquo in Proceedingsof the 13th International Conference on Extending DatabaseTechnology Advances in Database Technology (EDBT rsquo10) pp465ndash476 March 2010

[20] P Li H Liu J Xu Yu J He and X Du ldquoFast single-pairSimRank computationrdquo in Proceedings of the SIAM Interna-tional Conference onDataMining (SDM rsquo10) pp 571ndash582 SIAMColumbus Ohio USA April 2010

[21] X Yin J Han and P S Yu ldquoLinkClus efficient clusteringvia heterogeneous semantic linksrdquo in Proceedings of the 32ndInternational Conference on Very Large Data Bases (VLDB rsquo06)pp 427ndash438 2006

[22] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo in Proceedings of the Vary Large Data Bases Endowmentvol 1 pp 422ndash433 2008

[23] Y Cai G Cong X Jia et al ldquoEfficient algorithm for computinglink-based similarity in real world networksrdquo in Proceedings ofthe 9th IEEE International Conference on Data Mining (ICDMrsquo09) pp 734ndash739 December 2009

[24] G He H Feng C Li and H Chen ldquoParallel SimRankcomputation on large graphs with iterative aggregationrdquo inProceedings of the 16th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD rsquo10) pp 543ndash552 July 2010

[25] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo The Intrsquol Journal on Very Large Data Bases vol 19 no 1pp 45ndash66 2010

[26] Z Lin M R Lyu and I King ldquoMatchSim a novel neighbor-based similaritymeasure withmaximumneighborhoodmatch-ingrdquo in Proceedings of the ACM 18th International Conference onInformation and Knowledge Management (CIKM rsquo09) pp 1613ndash1616 November 2009

[27] R R Larson ldquoBibliometrics of the world wide web anexploratory analysis of the intellectual structure of cyberspacerdquoProceedings of the Annual Meeting of the American Society forInformation Science vol 33 pp 71ndash78 1996

[28] J Pitkow and P Pirolli ldquoLife death and lawfulness on the elec-tronic frontierrdquo in Proceedings of the Conference on Human Fac-tors in Computing Systems (CHI rsquo97) pp 383ndash390 March 1997

[29] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann 2006

[30] J Sun H Qu D Chakrabarti and C Faloutsos ldquoNeighbor-hood formation and anomaly detection in bipartite graphsrdquo inProceedings of the 5th IEEE International Conference on DataMining (ICDM rsquo05) pp 418ndash425 November 2005

[31] H Tong C Faloutsos and J-Y Pan ldquoFast random walkwith restart and its applicationsrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 613ndash622 December 2006

[32] S Robinson ldquoToward an optimal algorithm for matrix multi-plicationrdquo News Journal of the Society for Industrial and AppliedMathematics vol 38 no 9 2005

[33] T Zhang R Ramakrishnam andM Livny ldquoBIRCH an efficientdata clustering method for very large databasesrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data (SIGMOD rsquo96) pp 103ndash114 1996

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 7: Research Article Link-Based Similarity Measures Using ...downloads.hindawi.com/journals/tswj/2014/741608.pdf · Research Article Link-Based Similarity Measures Using Reachability

The Scientific World Journal 7

b c

a

d

a b c d

a 0 1 1 0

b 0 0 0 1

c 0 0 0 1

d 0 0 0 0

a b c d

a 0 0 0 0

b 1 0 0 0

c 1 0 0 0

d 0 1 1 0

a b c d

a 0 1 1 0

b 1 0 0 1

c 1 0 0 1

d 0 1 1 0

a b c d

a 0 0

b 0 0

c 0 0

d 0 0

Undirected-link(normalized)

05 05

0505

0505

05 05

1 046204620441

Undirected-link

Inlink

Outlink

Figure 5 The procedure of generating a vector using the undirected method

weightedSum method will require double space than theother

In comparison time complexity of the three existingmethods namely rvs-SimRank SimRank and P-Rank isknown to be 119874(1198994) and space complexity 119874(1198992) [6] Webelieve the existing methods require more computation thanour proposed methods because they adopt the pair-wisecomputation model in principle

Moreover the existing methods cannot compute the sim-ilarity between two given objects independently because theyneed to know the similarities among all objects (whether con-nected directly or indirectly to the target object) in order tocompute the similarity between two target objects [20] How-ever our approach does not require the similarities amongall objects and hence can compute the similarity between anypair of objects independentlymaking it possible to parallelizethe algorithm Since our approach need not refer to the simi-larities amongother objects vectors for individual objects canbe generated independently and the similarity between anypair of objects can be computed separately that is in parallelTime complexity of this parallelized version would become119874(1198993119898) if119898 processors are utilized in parallelWhen we need to compute the similarity between only

a particular pair of objects on-line rather than computingthe similarities among all objects [20] our methods can bemade even more efficient by performing the computation ofreachability on-line through the inverse matrix as suggestedby (9) That is the vector for a given object can be generatedby multiplying 119902 to the inverse matrix of 119876 [30 31] Timecomplexity of this off-line computation of inverse matrix is119874(1198992376

) [32] and the complexity of the on-line computationis 119874(119899) Such an off-line approach tends to require relativelylonger time and more space to handle the inverse matrixReference [31] suggests an approximation approach using

low-rank approximation in order to keep balance betweenthe off-line and on-line computation This approach hasadvantages in both time and space complexity suggest-ing improvement for on-line execution of our proposedapproach In conclusion time complexity of our methods islower than that of existing methods and space complexity isequal to that of existing methods in any case

= (1 minus 119888) 119875119860 119906119886+ 119888 119902119886

= (119868 minus (1 minus 119888) 119875119860)minus1119902119886

= 119862119876minus1119902119886

(9)

44 Discussions The proposed methods do not suffer fromthe problems of the pair-wise and level-wisemodels discussedin Section 3 Let us compute the similarities between 119886 and119887 and 1198861015840 and 1198871015840 in Figure 2 using our approach and checkwhether the results appeal to our intuition We assume 120582 and119862 to be 05 and 07 respectively Table 2 shows the similarityresults computed by the two proposed methods togetherwith those obtained by the three existing methods (shownalready in Table 1) for comparison Both of our methods haveproduced coinciding results that the similarity between 119886and 119887 and 1198861015840 and 1198871015840 is 1 saying that the two objects areregarded as the same These results appeal to our intuitionIn comparison the results obtained by the existing methodsindicate that the two objects are not the same Since ourapproach does not use the pair-wise model but performthe computations among identical features the proposedmethods will produce results more appealing to our intuitionthan existing methods

The proposed methods do not suffer from the problemof the level-wise model either Let us compute the similaritybetween 119886 and 119887 in Figure 3 using our approach and check

8 The Scientific World Journal

Table 2 Similarities between 119886 and 119887 and between 1198861015840 and 1198871015840 using the proposed methods and existing methods in Figure 2

rvs-SimRank SimRank P-Rank WeightedSum Undirected119904(119886 119887) 0396 0 0423 1 1119904(1198861015840 1198871015840) 0321 0 0408 1 1

whether the results appeal to our intuition Assuming 120582and 119862 to be 05 and 07 respectively our approach hascomputed the similarity between 119886 and 119887 to be 028 using theweightedSummethod and 06 using the undirected methodFor comparison the existing methods would compute thesimilarity to be 0 as discussed in Section 3 We cannot tellif the results obtained by our approach are appropriate ornot because the similarity is by nature a subjective valueStill we can at least argue that our approach produces resultswhich are more appealing to our intuition than the existingmethods that is the similarity between 119886 and 119887must not be 0because they are related indirectly by some common objectsThis difference has come from our strategy of considering forall the objects indirectly connected to the target objects atonce rather than using the level-wise model

5 Experiments

In this section we verify the effectiveness of our approachthrough experimentation We will show and analyze quanti-tatively the experimental results obtained from applying thetwo proposed methods to practical application domains

51 Experimentation Setup We carried out experimentsto verify the performance of the proposed methods withrespect to scientific papers and Web documents the twotypes of exemplary data sets having link informationFor the experiments with the scientific papers we usedthe data set consisting of the papers downloaded fromDBLP (httpwwwinformatikuni-trierdesimleydb) andreference information among the papers crawled from Libra(httpacademicresearchmicrosoftcom) In total 44800papers and 126281 references were used For the experimentswith the Web documents we used the data set consisting of1227038Web pages with 11164829 hyperlinks in total takenfrom the TREC (httptrecnistgovdatat11webhtml) 2002data The experiments were carried out in a platform withQuad Core 267GHz CPU Windows 2008 Sever OS

In the experiments we aimed to evaluate the performanceof the two proposed methods (weightedSum undirected) incomparison with the three existing methods (rvs-SimRankSimRank and P-Rank) with the input values of 08 05 and015 for 119862 120582 and restart probability respectively as usedfrequently in the existing methods [6 15 30]

Themethod of experiments proceeded as follows For theweightedSum method we first assigned appropriate weightsto the vectors generated with inlinks and outlinks by varyingthe weights and finding the ones that achieve the highestaccuracy in the similarity through weightedSum method Inthe experiments we tried the weight values of 0 01 03 05

07 09 and 1 for inlinks (note the sum of in-link and out-link weights equals 1) Second we evaluated the accuracyof each method qualitatively by examining 10 objects thatthe method computes as the most similar to a target objectchosen arbitrarily Finally we measure the accuracy of eachmethod quantitatively by comparing the obtained results withthe true answers for each data set

The measurement of the accuracy proceeded as followsFirst we chose one object in turn as the target object from theanswer setThen we computed the recall [29] by extracting119898objects (where119898 can be 10 20 30 40 and 50)most similar tothe target object according to each method This process wasrepeated until every object in the set has been chosen as thetarget object The average of all recall values obtained as suchwill be taken as the final accuracy value

The answers of each data set were constructed in thefollowingmanner For the experiments with scientific paperswe selected five well known areas (ie clustering sequentialpattern mining spatial databases link mining and graphpatternmining) in a data mining text book [29] and obtainedpapers referenced in each section of these areasWe supposedthat the reference papers in the same section are similarto one another Thus those papers in a section formed ananswer set Each answer set had 3 to 14 papers and thetotal number of papers in all such answer sets was 106 Forthe experiments with Web documents we used TREC 2002TREC 2002 provides Web document sets related to specifickeywords We chose 9 Web document sets randomly fromTREC 2002 and used the sets as the answer sets

52 Domain of Scientific Papers Figure 6 depicts the resultsshowing the accuracy on the similarities among scientificpapers obtained by the weightedSum method using variousweight values for the vectors generated with inlinks andoutlinks The 119909-axis represents the number of ldquothe mostsimilarrdquo papers selected by the method and the 119910-axis theaccuracy Annotations such as ldquo00 10rdquo represent the weightbalances between inlinks and outlinks According to Figure 6the accuracy of the weightedSum method tends to becomehigher when the weight for inlinks gets higher than the out-link weight This is because the papers in the answer set arerelatively famous ones frequently referenced by other papersFor this reason we conducted the experiments using theldquo09 01rdquo weight balance in order to obtain the best result

Tables 3 4 5 6 and 7 present the lists of top 10 papersfound to be the most similar to a target paper which is[33] using the three existing methods and the two proposedmethods respectively (Note the paper of [33] is concernedabout clustering in data mining) In the tables those paperswhich are not similar to [33] are italic In these papers theauthorsmainly deal with issues of outlier detection ormining

The Scientific World Journal 9

0

01

02

03

04

05

06

07

08

09

1

10 20 30 40 50(m)

(Acc

urac

y)

00 1001 0903 0705 05

07 0309 0110 00

Figure 6 Accuracy of the weightedSum method with varyingweights

frequent patterns indicating that the existing methods havemade a wrong conclusion for these papers In comparisonthe results by the weightedSum method (shown in Table 6)include only one wrong entry (of frequent pattern mining)in the top 10 list and the results by the undirected methoddo not include wrong papers (ie all related to clustering)These results imply that the undirected method performsbetter than the three existing methods and even than theweightedSum method We have repeated this experimentmany times with different target papers other than [33] andobtained results similar to Tables 3ndash7

Figure 7 compares the accuracy of similarities computedby the three existingmethods and the two proposedmethodsAs in Figure 6 the 119909-axis represents the number of ldquothe mostsimilarrdquo papers selected by each method and the 119910-axis theaccuracy The weightedSum method improved accuracy by9 on average and up to 13 compared with SimRank Alsothe undirectedmethod improved accuracy by 16 on averageand up to 20 compared with SimRank In conclusion thetwo proposed methods turn out to compute the similaritymore accurately than the existing methods Especially theundirected method performs the best

53 Domain ofWeb Documents Figure 8 shows the results ofthe accuracy on the similarities betweenWeb pages obtainedby the weightedSum method using various weight valuesbetween inlinks and outlinks The 119909-axis represents thenumber of ldquothe most similarrdquo Web pages selected by themethod and the 119910-axis the accuracy Annotations such asldquo00 10rdquo represent the weight balances between inlinks andoutlinks Again the accuracy of the weightedSum methodtends to become higher when the weight for inlinks getshigher as in the case of scientific papers This is because theanswer set contains many Web pages with high authority

0

01

02

03

04

05

06

07

08

09

1

10 20 30 40 50

SimRankrvs-SimRankP-Rank

WeightedSumUndirected

(m)

(Acc

urac

y)Figure 7 Accuracy of the similarity measures in scientific papersdata

0

01

02

03

04

05

10 20 30 40 50(m)

(Acc

urac

y)

00 1001 0903 0705 05

07 0309 0110 00

Figure 8 Accuracy of the weightedSum method with varying theweights

For this reason we conducted the experiments using theldquo09 01rdquo weight balance in order to obtain the best result

Figure 9 compares the accuracy of similarities computedby the three existingmethods and the two proposedmethodsAgain the 119909-axis represents the number of ldquothemost similarrdquodocuments selected by each method and the 119910-axis theaccuracy The weightedSum method improved accuracy by20 on average and up to 24 compared with SimRankAlso the undirected method improved accuracy by 34on average and up to 43 compared with SimRank In

10 The Scientific World Journal

Table 3 Top 10 papers similar to [33] using SimRank

First author Title Conferencejournal Year

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approach forVery Large Spatial Databases VLDB 1998

Ester Knowledge Discovery in Large Spatial Databases FocusingTechniques for Efficient Class Identification SSD 1995

Hinneburg An Efficient Approach to Clustering in Large MultimediaDatabases with Noise AAAI 1998

Ng Efficient and Effective Clustering Methods for Spatial DataMining VLDB 1994

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Wang STING A Statistical Information Grid Approach to SpatialData Mining VLDB 1997

L OrsquoCallaghan Streaming-Data Algorithms for High-Quality Clustering IEEE ICDE 2002

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996

Table 4 Top 10 papers similar to [33] using rvs-SimRank

First author Title Conferencejournal YearKnorr A Unified Notion of Outliers Properties and Computation ACM KDD 1997Guha CURE An Efficient Clustering Algorithm for Large Databases ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Guha ROCK A Robust Clustering Algorithm for CategoricalAttributes IEEE ICDE 1999

Sheikholeslami WaveCluster A multi-Resolution Clustering Approach forVery Large Spatial Databases VLDB 1998

Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996

Burdick MAFIA A Maximal Frequent Itemset Algorithmfor Transactional Databases IEEE TKDE 2005

Mannila Efficient Algorithms for Discovering Association Rules AAAI 1994

Kamber Metarule-Guided Mining of Multi-DimensionalAssociation Rules Using Data Cubes ACM KDD 1997

Table 5 Top 10 papers similar to [33] using P-Rank

First author Title Conferencejournal YearKnorr A Unified Notion of Outliers Properties and Computation ACM KDD 1997Guha CURE An Efficient Clustering Algorithm for Large Databases ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Guha ROCK A Robust Clustering Algorithm for Categorical Attributes IEEE ICDE 1999Mannila Efficient Algorithms for Discovering Association Rules AAAI 1994Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996Silberschatz What Makes Patterns Interesting in Knowledge Discovery Systems IEEE TKDE 1996Agrawal Mining Association Rules between Sets of Items in Large Databases ACM SIGMOD 1993

The Scientific World Journal 11

Table 6 Top 10 papers similar to [33] using weightedSum method

First author Title Conferencejournal Year

Ester Knowledge Discovery in Large Spatial DatabasesFocusing Techniques for Efficient Class Identification SSD 1995

Ng Efficient and Effective Clustering Methods for Spatial DataMining VLDB 1994

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Guha ROCK A Robust Clustering Algorithm for CategoricalAttributes IEEE ICDE 1999

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Wang STING A Statistical Information Grid Approach toSpatial Data Mining VLDB 1997

Hinneburg An Efficient Approach to Clusteringin Large Multimedia Databases with Noise AAAI 1998

Aggarwal Fast Algorithms for Projected Clustering ACM SIGMOD 1999

Agrawal Automatic Subspace Clustering of High Dimensional Datafor Data Mining Applications ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Table 7 Top 10 papers similar to [33] using undirected method

First author Title Conferencejournal Year

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Ng Efficient and Effective Clustering Methods for SpatialData Mining VLDB 1994

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Hinneburg An Efficient Approach to Clusteringin Large Multimedia Databases with Noise AAAI 1998

Wang STING A Statistical Information Grid Approach toSpatial Data Mining VLDB 1997

Agrawal Automatic Subspace Clustering of High Dimensional Datafor Data Mining Applications ACM SIGMOD 1998

Aggarwal Fast Algorithms for Projected Clustering ACM SIGMOD 1999

Ankerst OPTICS Ordering Points To Identify the ClusteringStructure ACM SIGMOD 1999

conclusion the two proposed methods turn out to computethe similarity more accurately than the existing methodsEspecially the undirected method outperforms by a largedegree the weightedSum and the three existing methods

6 Conclusions

This paper presented new link-based similarity methods thatcan compute more accurately the similarity between objectsby using the link information pertaining to the objectsNoticing that most existing link-based similarity methods

use the pair-wise and level-wise models we analyzed theproblems with these models and proposed a new approachthat does not suffer from these problems In our proposedapproach each object is represented by a vector all objectsin the given universe as the elements of the vector and aweight value to each object as the value of the correspondingelement in the vector As for this weight value we proposed toutilize the notion of reachability between objects computedusing the ldquoRandom Walk with Restartrdquo strategy Then wedefined the similarity between two objects as the cosinesimilarity between two vectors representing the two objectsThe proposed approach was then refined into two methods

12 The Scientific World Journal

0

01

02

03

04

05

10 20 30 40 50(m)

(Acc

urac

y)

SimRankrvs-SimRankP-Rank

WeightedSumUndirected

Figure 9 Accuracy of the similarity measures in Web documentsdata

the weightedSum and the undirected methods differentiatedby the strategy to handle the information on link directionsExamples showed that the two methods do not suffer fromthe problems of the pair-wise and level-wise models In ourexperimentation of the proposed methods with the scientificpapers and Web documents data sets the results indicatedthat both of the proposed methods generally outperform theexisting methods significantly

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by (1) Business for CooperativeRampD between Industry Academy and Research Institutefunded Korea Small and Medium Business Administrationin 2013 (Grants no C0006278) (2) the MSIP (Ministry ofScience ICT and Future Planning) Korea under the ITRC(Information Technology Research Center) Support Pro-gram (NIPA-2013-H0301-13-4009) supervised by the NIPA(National IT Industry Promotion Agency) (3) Ministry ofCulture Sports and Tourism (MCST) and from Korea Copy-right Commission in 2013 and (4) the National ResearchFoundation of Korea (NRF) Grant funded by the Koreagovernment (MEST) (no 2011-0029181)

References

[1] J Shen Y Zhu H Zhang C Chen R Sun and F Xu ldquoAcontent-based algorithm for blog rankingrdquo in Proceedings of theInternational Conference on Internet Computing in Science andEngineering (ICICSE rsquo08) pp 19ndash22 January 2008

[2] A Broder S Glassman M Manasse and G Zweig ldquoSyntacticclustering of the webrdquo in Proceedings of the 6th InternationalWorld Wide Web Conference (WWW rsquo97) pp 391ndash404 1997

[3] X Su and T Khoshgoftaar ldquoA survey of collaborative filteringtechniquesrdquoAdvances in Artificial Intelligence vol 2009 ArticleID 421425 19 pages 2009

[4] S-H Yoon S-W Kim and S Park ldquoA link-based similaritymeasure for scientific literaturerdquo in Proceedings of the 19thInternational World Wide Web Conference (WWW rsquo10) pp1213ndash1214 April 2010

[5] M R Hamedani S Lee and S Kim ldquoOn exploiting content andcitations together to compute similarity of scientific papersrdquo inProceedings of the ACM International Conference on Informationand Knowledge Management (CIKM rsquo13) San Francisco CalifUSA 2013

[6] P Zhao J Han andY Sun ldquoP-Rank a comprehensive structuralsimilarity measure over information networksrdquo in Proceedingsof the ACM 18th International Conference on Information andKnowledge Management (CIKM rsquo09) pp 553ndash562 November2009

[7] D-H Bae S-M Hwang S-W Kim and C Faloutsos ldquoCon-structing seminal paper genealogyrdquo in Proceedings of the 20thACM Conference on Information and Knowledge Management(CIKM rsquo11) pp 2101ndash2104 October 2011

[8] F Menczer ldquoCombining link and content analysis to estimatesemantic similarityrdquo in Proceedings of the 13th InternationalWorld Wide Web Conference Proceedings (WWW rsquo04) pp 452ndash453 May 2004

[9] R Baeza-Yates and B Ribeiro-Neto Modern InformationRetrieval Addison Wesley 1999

[10] S-H Yoon S-W Kim J-S Kim and W-S Hwang ldquoOncomputing text-based similarity in scientific literaturerdquo inProceedings of the 20th International Conference Companion onWorld Wide Web (WWW rsquo11) pp 169ndash170 April 2011

[11] G Bisson and F Hussain ldquo120594-Sim a new similarity measure forthe co-clustering taskrdquo in Proceedings of the 7th InternationalConference onMachine Learning and Applications (ICMLA rsquo08)pp 211ndash217 December 2008

[12] H Small ldquoCoCitation in the scientific literature a newmeasureof the relationship between two documentsrdquo Journal of theAmerican Society for Information Science vol 24 no 4 pp 265ndash269 1973

[13] M Kessler ldquoBibliographic coupling between scientific papersrdquoJournal of the American Documentation vol 14 no 1 pp 10ndash251963

[14] R Amsler ldquoApplication of citation-based automatic classifi-cationrdquo Technical Report The University of Texas at AustinLinguistics Research Center 1972

[15] G Jeh and JWidom ldquoSimRank ameasure of structural-contextsimilarityrdquo in Proceedings of the 8th ACMSIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDDrsquo02) pp 538ndash543 July 2002

[16] U Shardanband and P Maes ldquoSocial information filteringalgorithms for automating rdquoword of mouthrdquordquo in Proceedings ofthe SIGCHIConference onHumanFactors in Computing Systems(CHI rsquo95) pp 210ndash217 ACM Press Denver Colorado 1995

[17] I Antonellis H Garcia-Molina and C Chang ldquoSimrank++query rewriting through link analysis of the click graphrdquo inProceedings of the 34th International Conference on Vary LargeData Bases ( VLDB rsquo8) pp 408ndash421 Auckland New ZealandAugust 2008

The Scientific World Journal 13

[18] D Fogaras andB Racz ldquoScaling link-based similarity searchrdquo inProceedings of the 14th International Conference on World WideWeb (WWWrsquo05) pp 641ndash650 2005

[19] C Li J Han G He et al ldquoFast computation of SimRankfor static and dynamic information networksrdquo in Proceedingsof the 13th International Conference on Extending DatabaseTechnology Advances in Database Technology (EDBT rsquo10) pp465ndash476 March 2010

[20] P Li H Liu J Xu Yu J He and X Du ldquoFast single-pairSimRank computationrdquo in Proceedings of the SIAM Interna-tional Conference onDataMining (SDM rsquo10) pp 571ndash582 SIAMColumbus Ohio USA April 2010

[21] X Yin J Han and P S Yu ldquoLinkClus efficient clusteringvia heterogeneous semantic linksrdquo in Proceedings of the 32ndInternational Conference on Very Large Data Bases (VLDB rsquo06)pp 427ndash438 2006

[22] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo in Proceedings of the Vary Large Data Bases Endowmentvol 1 pp 422ndash433 2008

[23] Y Cai G Cong X Jia et al ldquoEfficient algorithm for computinglink-based similarity in real world networksrdquo in Proceedings ofthe 9th IEEE International Conference on Data Mining (ICDMrsquo09) pp 734ndash739 December 2009

[24] G He H Feng C Li and H Chen ldquoParallel SimRankcomputation on large graphs with iterative aggregationrdquo inProceedings of the 16th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD rsquo10) pp 543ndash552 July 2010

[25] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo The Intrsquol Journal on Very Large Data Bases vol 19 no 1pp 45ndash66 2010

[26] Z Lin M R Lyu and I King ldquoMatchSim a novel neighbor-based similaritymeasure withmaximumneighborhoodmatch-ingrdquo in Proceedings of the ACM 18th International Conference onInformation and Knowledge Management (CIKM rsquo09) pp 1613ndash1616 November 2009

[27] R R Larson ldquoBibliometrics of the world wide web anexploratory analysis of the intellectual structure of cyberspacerdquoProceedings of the Annual Meeting of the American Society forInformation Science vol 33 pp 71ndash78 1996

[28] J Pitkow and P Pirolli ldquoLife death and lawfulness on the elec-tronic frontierrdquo in Proceedings of the Conference on Human Fac-tors in Computing Systems (CHI rsquo97) pp 383ndash390 March 1997

[29] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann 2006

[30] J Sun H Qu D Chakrabarti and C Faloutsos ldquoNeighbor-hood formation and anomaly detection in bipartite graphsrdquo inProceedings of the 5th IEEE International Conference on DataMining (ICDM rsquo05) pp 418ndash425 November 2005

[31] H Tong C Faloutsos and J-Y Pan ldquoFast random walkwith restart and its applicationsrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 613ndash622 December 2006

[32] S Robinson ldquoToward an optimal algorithm for matrix multi-plicationrdquo News Journal of the Society for Industrial and AppliedMathematics vol 38 no 9 2005

[33] T Zhang R Ramakrishnam andM Livny ldquoBIRCH an efficientdata clustering method for very large databasesrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data (SIGMOD rsquo96) pp 103ndash114 1996

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 8: Research Article Link-Based Similarity Measures Using ...downloads.hindawi.com/journals/tswj/2014/741608.pdf · Research Article Link-Based Similarity Measures Using Reachability

8 The Scientific World Journal

Table 2 Similarities between 119886 and 119887 and between 1198861015840 and 1198871015840 using the proposed methods and existing methods in Figure 2

rvs-SimRank SimRank P-Rank WeightedSum Undirected119904(119886 119887) 0396 0 0423 1 1119904(1198861015840 1198871015840) 0321 0 0408 1 1

whether the results appeal to our intuition Assuming 120582and 119862 to be 05 and 07 respectively our approach hascomputed the similarity between 119886 and 119887 to be 028 using theweightedSummethod and 06 using the undirected methodFor comparison the existing methods would compute thesimilarity to be 0 as discussed in Section 3 We cannot tellif the results obtained by our approach are appropriate ornot because the similarity is by nature a subjective valueStill we can at least argue that our approach produces resultswhich are more appealing to our intuition than the existingmethods that is the similarity between 119886 and 119887must not be 0because they are related indirectly by some common objectsThis difference has come from our strategy of considering forall the objects indirectly connected to the target objects atonce rather than using the level-wise model

5 Experiments

In this section we verify the effectiveness of our approachthrough experimentation We will show and analyze quanti-tatively the experimental results obtained from applying thetwo proposed methods to practical application domains

51 Experimentation Setup We carried out experimentsto verify the performance of the proposed methods withrespect to scientific papers and Web documents the twotypes of exemplary data sets having link informationFor the experiments with the scientific papers we usedthe data set consisting of the papers downloaded fromDBLP (httpwwwinformatikuni-trierdesimleydb) andreference information among the papers crawled from Libra(httpacademicresearchmicrosoftcom) In total 44800papers and 126281 references were used For the experimentswith the Web documents we used the data set consisting of1227038Web pages with 11164829 hyperlinks in total takenfrom the TREC (httptrecnistgovdatat11webhtml) 2002data The experiments were carried out in a platform withQuad Core 267GHz CPU Windows 2008 Sever OS

In the experiments we aimed to evaluate the performanceof the two proposed methods (weightedSum undirected) incomparison with the three existing methods (rvs-SimRankSimRank and P-Rank) with the input values of 08 05 and015 for 119862 120582 and restart probability respectively as usedfrequently in the existing methods [6 15 30]

Themethod of experiments proceeded as follows For theweightedSum method we first assigned appropriate weightsto the vectors generated with inlinks and outlinks by varyingthe weights and finding the ones that achieve the highestaccuracy in the similarity through weightedSum method Inthe experiments we tried the weight values of 0 01 03 05

07 09 and 1 for inlinks (note the sum of in-link and out-link weights equals 1) Second we evaluated the accuracyof each method qualitatively by examining 10 objects thatthe method computes as the most similar to a target objectchosen arbitrarily Finally we measure the accuracy of eachmethod quantitatively by comparing the obtained results withthe true answers for each data set

The measurement of the accuracy proceeded as followsFirst we chose one object in turn as the target object from theanswer setThen we computed the recall [29] by extracting119898objects (where119898 can be 10 20 30 40 and 50)most similar tothe target object according to each method This process wasrepeated until every object in the set has been chosen as thetarget object The average of all recall values obtained as suchwill be taken as the final accuracy value

The answers of each data set were constructed in thefollowingmanner For the experiments with scientific paperswe selected five well known areas (ie clustering sequentialpattern mining spatial databases link mining and graphpatternmining) in a data mining text book [29] and obtainedpapers referenced in each section of these areasWe supposedthat the reference papers in the same section are similarto one another Thus those papers in a section formed ananswer set Each answer set had 3 to 14 papers and thetotal number of papers in all such answer sets was 106 Forthe experiments with Web documents we used TREC 2002TREC 2002 provides Web document sets related to specifickeywords We chose 9 Web document sets randomly fromTREC 2002 and used the sets as the answer sets

52 Domain of Scientific Papers Figure 6 depicts the resultsshowing the accuracy on the similarities among scientificpapers obtained by the weightedSum method using variousweight values for the vectors generated with inlinks andoutlinks The 119909-axis represents the number of ldquothe mostsimilarrdquo papers selected by the method and the 119910-axis theaccuracy Annotations such as ldquo00 10rdquo represent the weightbalances between inlinks and outlinks According to Figure 6the accuracy of the weightedSum method tends to becomehigher when the weight for inlinks gets higher than the out-link weight This is because the papers in the answer set arerelatively famous ones frequently referenced by other papersFor this reason we conducted the experiments using theldquo09 01rdquo weight balance in order to obtain the best result

Tables 3 4 5 6 and 7 present the lists of top 10 papersfound to be the most similar to a target paper which is[33] using the three existing methods and the two proposedmethods respectively (Note the paper of [33] is concernedabout clustering in data mining) In the tables those paperswhich are not similar to [33] are italic In these papers theauthorsmainly deal with issues of outlier detection ormining

The Scientific World Journal 9

0

01

02

03

04

05

06

07

08

09

1

10 20 30 40 50(m)

(Acc

urac

y)

00 1001 0903 0705 05

07 0309 0110 00

Figure 6 Accuracy of the weightedSum method with varyingweights

frequent patterns indicating that the existing methods havemade a wrong conclusion for these papers In comparisonthe results by the weightedSum method (shown in Table 6)include only one wrong entry (of frequent pattern mining)in the top 10 list and the results by the undirected methoddo not include wrong papers (ie all related to clustering)These results imply that the undirected method performsbetter than the three existing methods and even than theweightedSum method We have repeated this experimentmany times with different target papers other than [33] andobtained results similar to Tables 3ndash7

Figure 7 compares the accuracy of similarities computedby the three existingmethods and the two proposedmethodsAs in Figure 6 the 119909-axis represents the number of ldquothe mostsimilarrdquo papers selected by each method and the 119910-axis theaccuracy The weightedSum method improved accuracy by9 on average and up to 13 compared with SimRank Alsothe undirectedmethod improved accuracy by 16 on averageand up to 20 compared with SimRank In conclusion thetwo proposed methods turn out to compute the similaritymore accurately than the existing methods Especially theundirected method performs the best

53 Domain ofWeb Documents Figure 8 shows the results ofthe accuracy on the similarities betweenWeb pages obtainedby the weightedSum method using various weight valuesbetween inlinks and outlinks The 119909-axis represents thenumber of ldquothe most similarrdquo Web pages selected by themethod and the 119910-axis the accuracy Annotations such asldquo00 10rdquo represent the weight balances between inlinks andoutlinks Again the accuracy of the weightedSum methodtends to become higher when the weight for inlinks getshigher as in the case of scientific papers This is because theanswer set contains many Web pages with high authority

0

01

02

03

04

05

06

07

08

09

1

10 20 30 40 50

SimRankrvs-SimRankP-Rank

WeightedSumUndirected

(m)

(Acc

urac

y)Figure 7 Accuracy of the similarity measures in scientific papersdata

0

01

02

03

04

05

10 20 30 40 50(m)

(Acc

urac

y)

00 1001 0903 0705 05

07 0309 0110 00

Figure 8 Accuracy of the weightedSum method with varying theweights

For this reason we conducted the experiments using theldquo09 01rdquo weight balance in order to obtain the best result

Figure 9 compares the accuracy of similarities computedby the three existingmethods and the two proposedmethodsAgain the 119909-axis represents the number of ldquothemost similarrdquodocuments selected by each method and the 119910-axis theaccuracy The weightedSum method improved accuracy by20 on average and up to 24 compared with SimRankAlso the undirected method improved accuracy by 34on average and up to 43 compared with SimRank In

10 The Scientific World Journal

Table 3 Top 10 papers similar to [33] using SimRank

First author Title Conferencejournal Year

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approach forVery Large Spatial Databases VLDB 1998

Ester Knowledge Discovery in Large Spatial Databases FocusingTechniques for Efficient Class Identification SSD 1995

Hinneburg An Efficient Approach to Clustering in Large MultimediaDatabases with Noise AAAI 1998

Ng Efficient and Effective Clustering Methods for Spatial DataMining VLDB 1994

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Wang STING A Statistical Information Grid Approach to SpatialData Mining VLDB 1997

L OrsquoCallaghan Streaming-Data Algorithms for High-Quality Clustering IEEE ICDE 2002

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996

Table 4 Top 10 papers similar to [33] using rvs-SimRank

First author Title Conferencejournal YearKnorr A Unified Notion of Outliers Properties and Computation ACM KDD 1997Guha CURE An Efficient Clustering Algorithm for Large Databases ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Guha ROCK A Robust Clustering Algorithm for CategoricalAttributes IEEE ICDE 1999

Sheikholeslami WaveCluster A multi-Resolution Clustering Approach forVery Large Spatial Databases VLDB 1998

Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996

Burdick MAFIA A Maximal Frequent Itemset Algorithmfor Transactional Databases IEEE TKDE 2005

Mannila Efficient Algorithms for Discovering Association Rules AAAI 1994

Kamber Metarule-Guided Mining of Multi-DimensionalAssociation Rules Using Data Cubes ACM KDD 1997

Table 5 Top 10 papers similar to [33] using P-Rank

First author Title Conferencejournal YearKnorr A Unified Notion of Outliers Properties and Computation ACM KDD 1997Guha CURE An Efficient Clustering Algorithm for Large Databases ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Guha ROCK A Robust Clustering Algorithm for Categorical Attributes IEEE ICDE 1999Mannila Efficient Algorithms for Discovering Association Rules AAAI 1994Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996Silberschatz What Makes Patterns Interesting in Knowledge Discovery Systems IEEE TKDE 1996Agrawal Mining Association Rules between Sets of Items in Large Databases ACM SIGMOD 1993

The Scientific World Journal 11

Table 6 Top 10 papers similar to [33] using weightedSum method

First author Title Conferencejournal Year

Ester Knowledge Discovery in Large Spatial DatabasesFocusing Techniques for Efficient Class Identification SSD 1995

Ng Efficient and Effective Clustering Methods for Spatial DataMining VLDB 1994

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Guha ROCK A Robust Clustering Algorithm for CategoricalAttributes IEEE ICDE 1999

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Wang STING A Statistical Information Grid Approach toSpatial Data Mining VLDB 1997

Hinneburg An Efficient Approach to Clusteringin Large Multimedia Databases with Noise AAAI 1998

Aggarwal Fast Algorithms for Projected Clustering ACM SIGMOD 1999

Agrawal Automatic Subspace Clustering of High Dimensional Datafor Data Mining Applications ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Table 7 Top 10 papers similar to [33] using undirected method

First author Title Conferencejournal Year

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Ng Efficient and Effective Clustering Methods for SpatialData Mining VLDB 1994

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Hinneburg An Efficient Approach to Clusteringin Large Multimedia Databases with Noise AAAI 1998

Wang STING A Statistical Information Grid Approach toSpatial Data Mining VLDB 1997

Agrawal Automatic Subspace Clustering of High Dimensional Datafor Data Mining Applications ACM SIGMOD 1998

Aggarwal Fast Algorithms for Projected Clustering ACM SIGMOD 1999

Ankerst OPTICS Ordering Points To Identify the ClusteringStructure ACM SIGMOD 1999

conclusion the two proposed methods turn out to computethe similarity more accurately than the existing methodsEspecially the undirected method outperforms by a largedegree the weightedSum and the three existing methods

6 Conclusions

This paper presented new link-based similarity methods thatcan compute more accurately the similarity between objectsby using the link information pertaining to the objectsNoticing that most existing link-based similarity methods

use the pair-wise and level-wise models we analyzed theproblems with these models and proposed a new approachthat does not suffer from these problems In our proposedapproach each object is represented by a vector all objectsin the given universe as the elements of the vector and aweight value to each object as the value of the correspondingelement in the vector As for this weight value we proposed toutilize the notion of reachability between objects computedusing the ldquoRandom Walk with Restartrdquo strategy Then wedefined the similarity between two objects as the cosinesimilarity between two vectors representing the two objectsThe proposed approach was then refined into two methods

12 The Scientific World Journal

0

01

02

03

04

05

10 20 30 40 50(m)

(Acc

urac

y)

SimRankrvs-SimRankP-Rank

WeightedSumUndirected

Figure 9 Accuracy of the similarity measures in Web documentsdata

the weightedSum and the undirected methods differentiatedby the strategy to handle the information on link directionsExamples showed that the two methods do not suffer fromthe problems of the pair-wise and level-wise models In ourexperimentation of the proposed methods with the scientificpapers and Web documents data sets the results indicatedthat both of the proposed methods generally outperform theexisting methods significantly

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by (1) Business for CooperativeRampD between Industry Academy and Research Institutefunded Korea Small and Medium Business Administrationin 2013 (Grants no C0006278) (2) the MSIP (Ministry ofScience ICT and Future Planning) Korea under the ITRC(Information Technology Research Center) Support Pro-gram (NIPA-2013-H0301-13-4009) supervised by the NIPA(National IT Industry Promotion Agency) (3) Ministry ofCulture Sports and Tourism (MCST) and from Korea Copy-right Commission in 2013 and (4) the National ResearchFoundation of Korea (NRF) Grant funded by the Koreagovernment (MEST) (no 2011-0029181)

References

[1] J Shen Y Zhu H Zhang C Chen R Sun and F Xu ldquoAcontent-based algorithm for blog rankingrdquo in Proceedings of theInternational Conference on Internet Computing in Science andEngineering (ICICSE rsquo08) pp 19ndash22 January 2008

[2] A Broder S Glassman M Manasse and G Zweig ldquoSyntacticclustering of the webrdquo in Proceedings of the 6th InternationalWorld Wide Web Conference (WWW rsquo97) pp 391ndash404 1997

[3] X Su and T Khoshgoftaar ldquoA survey of collaborative filteringtechniquesrdquoAdvances in Artificial Intelligence vol 2009 ArticleID 421425 19 pages 2009

[4] S-H Yoon S-W Kim and S Park ldquoA link-based similaritymeasure for scientific literaturerdquo in Proceedings of the 19thInternational World Wide Web Conference (WWW rsquo10) pp1213ndash1214 April 2010

[5] M R Hamedani S Lee and S Kim ldquoOn exploiting content andcitations together to compute similarity of scientific papersrdquo inProceedings of the ACM International Conference on Informationand Knowledge Management (CIKM rsquo13) San Francisco CalifUSA 2013

[6] P Zhao J Han andY Sun ldquoP-Rank a comprehensive structuralsimilarity measure over information networksrdquo in Proceedingsof the ACM 18th International Conference on Information andKnowledge Management (CIKM rsquo09) pp 553ndash562 November2009

[7] D-H Bae S-M Hwang S-W Kim and C Faloutsos ldquoCon-structing seminal paper genealogyrdquo in Proceedings of the 20thACM Conference on Information and Knowledge Management(CIKM rsquo11) pp 2101ndash2104 October 2011

[8] F Menczer ldquoCombining link and content analysis to estimatesemantic similarityrdquo in Proceedings of the 13th InternationalWorld Wide Web Conference Proceedings (WWW rsquo04) pp 452ndash453 May 2004

[9] R Baeza-Yates and B Ribeiro-Neto Modern InformationRetrieval Addison Wesley 1999

[10] S-H Yoon S-W Kim J-S Kim and W-S Hwang ldquoOncomputing text-based similarity in scientific literaturerdquo inProceedings of the 20th International Conference Companion onWorld Wide Web (WWW rsquo11) pp 169ndash170 April 2011

[11] G Bisson and F Hussain ldquo120594-Sim a new similarity measure forthe co-clustering taskrdquo in Proceedings of the 7th InternationalConference onMachine Learning and Applications (ICMLA rsquo08)pp 211ndash217 December 2008

[12] H Small ldquoCoCitation in the scientific literature a newmeasureof the relationship between two documentsrdquo Journal of theAmerican Society for Information Science vol 24 no 4 pp 265ndash269 1973

[13] M Kessler ldquoBibliographic coupling between scientific papersrdquoJournal of the American Documentation vol 14 no 1 pp 10ndash251963

[14] R Amsler ldquoApplication of citation-based automatic classifi-cationrdquo Technical Report The University of Texas at AustinLinguistics Research Center 1972

[15] G Jeh and JWidom ldquoSimRank ameasure of structural-contextsimilarityrdquo in Proceedings of the 8th ACMSIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDDrsquo02) pp 538ndash543 July 2002

[16] U Shardanband and P Maes ldquoSocial information filteringalgorithms for automating rdquoword of mouthrdquordquo in Proceedings ofthe SIGCHIConference onHumanFactors in Computing Systems(CHI rsquo95) pp 210ndash217 ACM Press Denver Colorado 1995

[17] I Antonellis H Garcia-Molina and C Chang ldquoSimrank++query rewriting through link analysis of the click graphrdquo inProceedings of the 34th International Conference on Vary LargeData Bases ( VLDB rsquo8) pp 408ndash421 Auckland New ZealandAugust 2008

The Scientific World Journal 13

[18] D Fogaras andB Racz ldquoScaling link-based similarity searchrdquo inProceedings of the 14th International Conference on World WideWeb (WWWrsquo05) pp 641ndash650 2005

[19] C Li J Han G He et al ldquoFast computation of SimRankfor static and dynamic information networksrdquo in Proceedingsof the 13th International Conference on Extending DatabaseTechnology Advances in Database Technology (EDBT rsquo10) pp465ndash476 March 2010

[20] P Li H Liu J Xu Yu J He and X Du ldquoFast single-pairSimRank computationrdquo in Proceedings of the SIAM Interna-tional Conference onDataMining (SDM rsquo10) pp 571ndash582 SIAMColumbus Ohio USA April 2010

[21] X Yin J Han and P S Yu ldquoLinkClus efficient clusteringvia heterogeneous semantic linksrdquo in Proceedings of the 32ndInternational Conference on Very Large Data Bases (VLDB rsquo06)pp 427ndash438 2006

[22] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo in Proceedings of the Vary Large Data Bases Endowmentvol 1 pp 422ndash433 2008

[23] Y Cai G Cong X Jia et al ldquoEfficient algorithm for computinglink-based similarity in real world networksrdquo in Proceedings ofthe 9th IEEE International Conference on Data Mining (ICDMrsquo09) pp 734ndash739 December 2009

[24] G He H Feng C Li and H Chen ldquoParallel SimRankcomputation on large graphs with iterative aggregationrdquo inProceedings of the 16th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD rsquo10) pp 543ndash552 July 2010

[25] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo The Intrsquol Journal on Very Large Data Bases vol 19 no 1pp 45ndash66 2010

[26] Z Lin M R Lyu and I King ldquoMatchSim a novel neighbor-based similaritymeasure withmaximumneighborhoodmatch-ingrdquo in Proceedings of the ACM 18th International Conference onInformation and Knowledge Management (CIKM rsquo09) pp 1613ndash1616 November 2009

[27] R R Larson ldquoBibliometrics of the world wide web anexploratory analysis of the intellectual structure of cyberspacerdquoProceedings of the Annual Meeting of the American Society forInformation Science vol 33 pp 71ndash78 1996

[28] J Pitkow and P Pirolli ldquoLife death and lawfulness on the elec-tronic frontierrdquo in Proceedings of the Conference on Human Fac-tors in Computing Systems (CHI rsquo97) pp 383ndash390 March 1997

[29] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann 2006

[30] J Sun H Qu D Chakrabarti and C Faloutsos ldquoNeighbor-hood formation and anomaly detection in bipartite graphsrdquo inProceedings of the 5th IEEE International Conference on DataMining (ICDM rsquo05) pp 418ndash425 November 2005

[31] H Tong C Faloutsos and J-Y Pan ldquoFast random walkwith restart and its applicationsrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 613ndash622 December 2006

[32] S Robinson ldquoToward an optimal algorithm for matrix multi-plicationrdquo News Journal of the Society for Industrial and AppliedMathematics vol 38 no 9 2005

[33] T Zhang R Ramakrishnam andM Livny ldquoBIRCH an efficientdata clustering method for very large databasesrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data (SIGMOD rsquo96) pp 103ndash114 1996

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 9: Research Article Link-Based Similarity Measures Using ...downloads.hindawi.com/journals/tswj/2014/741608.pdf · Research Article Link-Based Similarity Measures Using Reachability

The Scientific World Journal 9

0

01

02

03

04

05

06

07

08

09

1

10 20 30 40 50(m)

(Acc

urac

y)

00 1001 0903 0705 05

07 0309 0110 00

Figure 6 Accuracy of the weightedSum method with varyingweights

frequent patterns indicating that the existing methods havemade a wrong conclusion for these papers In comparisonthe results by the weightedSum method (shown in Table 6)include only one wrong entry (of frequent pattern mining)in the top 10 list and the results by the undirected methoddo not include wrong papers (ie all related to clustering)These results imply that the undirected method performsbetter than the three existing methods and even than theweightedSum method We have repeated this experimentmany times with different target papers other than [33] andobtained results similar to Tables 3ndash7

Figure 7 compares the accuracy of similarities computedby the three existingmethods and the two proposedmethodsAs in Figure 6 the 119909-axis represents the number of ldquothe mostsimilarrdquo papers selected by each method and the 119910-axis theaccuracy The weightedSum method improved accuracy by9 on average and up to 13 compared with SimRank Alsothe undirectedmethod improved accuracy by 16 on averageand up to 20 compared with SimRank In conclusion thetwo proposed methods turn out to compute the similaritymore accurately than the existing methods Especially theundirected method performs the best

53 Domain ofWeb Documents Figure 8 shows the results ofthe accuracy on the similarities betweenWeb pages obtainedby the weightedSum method using various weight valuesbetween inlinks and outlinks The 119909-axis represents thenumber of ldquothe most similarrdquo Web pages selected by themethod and the 119910-axis the accuracy Annotations such asldquo00 10rdquo represent the weight balances between inlinks andoutlinks Again the accuracy of the weightedSum methodtends to become higher when the weight for inlinks getshigher as in the case of scientific papers This is because theanswer set contains many Web pages with high authority

0

01

02

03

04

05

06

07

08

09

1

10 20 30 40 50

SimRankrvs-SimRankP-Rank

WeightedSumUndirected

(m)

(Acc

urac

y)Figure 7 Accuracy of the similarity measures in scientific papersdata

0

01

02

03

04

05

10 20 30 40 50(m)

(Acc

urac

y)

00 1001 0903 0705 05

07 0309 0110 00

Figure 8 Accuracy of the weightedSum method with varying theweights

For this reason we conducted the experiments using theldquo09 01rdquo weight balance in order to obtain the best result

Figure 9 compares the accuracy of similarities computedby the three existingmethods and the two proposedmethodsAgain the 119909-axis represents the number of ldquothemost similarrdquodocuments selected by each method and the 119910-axis theaccuracy The weightedSum method improved accuracy by20 on average and up to 24 compared with SimRankAlso the undirected method improved accuracy by 34on average and up to 43 compared with SimRank In

10 The Scientific World Journal

Table 3 Top 10 papers similar to [33] using SimRank

First author Title Conferencejournal Year

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approach forVery Large Spatial Databases VLDB 1998

Ester Knowledge Discovery in Large Spatial Databases FocusingTechniques for Efficient Class Identification SSD 1995

Hinneburg An Efficient Approach to Clustering in Large MultimediaDatabases with Noise AAAI 1998

Ng Efficient and Effective Clustering Methods for Spatial DataMining VLDB 1994

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Wang STING A Statistical Information Grid Approach to SpatialData Mining VLDB 1997

L OrsquoCallaghan Streaming-Data Algorithms for High-Quality Clustering IEEE ICDE 2002

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996

Table 4 Top 10 papers similar to [33] using rvs-SimRank

First author Title Conferencejournal YearKnorr A Unified Notion of Outliers Properties and Computation ACM KDD 1997Guha CURE An Efficient Clustering Algorithm for Large Databases ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Guha ROCK A Robust Clustering Algorithm for CategoricalAttributes IEEE ICDE 1999

Sheikholeslami WaveCluster A multi-Resolution Clustering Approach forVery Large Spatial Databases VLDB 1998

Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996

Burdick MAFIA A Maximal Frequent Itemset Algorithmfor Transactional Databases IEEE TKDE 2005

Mannila Efficient Algorithms for Discovering Association Rules AAAI 1994

Kamber Metarule-Guided Mining of Multi-DimensionalAssociation Rules Using Data Cubes ACM KDD 1997

Table 5 Top 10 papers similar to [33] using P-Rank

First author Title Conferencejournal YearKnorr A Unified Notion of Outliers Properties and Computation ACM KDD 1997Guha CURE An Efficient Clustering Algorithm for Large Databases ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Guha ROCK A Robust Clustering Algorithm for Categorical Attributes IEEE ICDE 1999Mannila Efficient Algorithms for Discovering Association Rules AAAI 1994Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996Silberschatz What Makes Patterns Interesting in Knowledge Discovery Systems IEEE TKDE 1996Agrawal Mining Association Rules between Sets of Items in Large Databases ACM SIGMOD 1993

The Scientific World Journal 11

Table 6 Top 10 papers similar to [33] using weightedSum method

First author Title Conferencejournal Year

Ester Knowledge Discovery in Large Spatial DatabasesFocusing Techniques for Efficient Class Identification SSD 1995

Ng Efficient and Effective Clustering Methods for Spatial DataMining VLDB 1994

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Guha ROCK A Robust Clustering Algorithm for CategoricalAttributes IEEE ICDE 1999

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Wang STING A Statistical Information Grid Approach toSpatial Data Mining VLDB 1997

Hinneburg An Efficient Approach to Clusteringin Large Multimedia Databases with Noise AAAI 1998

Aggarwal Fast Algorithms for Projected Clustering ACM SIGMOD 1999

Agrawal Automatic Subspace Clustering of High Dimensional Datafor Data Mining Applications ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Table 7 Top 10 papers similar to [33] using undirected method

First author Title Conferencejournal Year

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Ng Efficient and Effective Clustering Methods for SpatialData Mining VLDB 1994

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Hinneburg An Efficient Approach to Clusteringin Large Multimedia Databases with Noise AAAI 1998

Wang STING A Statistical Information Grid Approach toSpatial Data Mining VLDB 1997

Agrawal Automatic Subspace Clustering of High Dimensional Datafor Data Mining Applications ACM SIGMOD 1998

Aggarwal Fast Algorithms for Projected Clustering ACM SIGMOD 1999

Ankerst OPTICS Ordering Points To Identify the ClusteringStructure ACM SIGMOD 1999

conclusion the two proposed methods turn out to computethe similarity more accurately than the existing methodsEspecially the undirected method outperforms by a largedegree the weightedSum and the three existing methods

6 Conclusions

This paper presented new link-based similarity methods thatcan compute more accurately the similarity between objectsby using the link information pertaining to the objectsNoticing that most existing link-based similarity methods

use the pair-wise and level-wise models we analyzed theproblems with these models and proposed a new approachthat does not suffer from these problems In our proposedapproach each object is represented by a vector all objectsin the given universe as the elements of the vector and aweight value to each object as the value of the correspondingelement in the vector As for this weight value we proposed toutilize the notion of reachability between objects computedusing the ldquoRandom Walk with Restartrdquo strategy Then wedefined the similarity between two objects as the cosinesimilarity between two vectors representing the two objectsThe proposed approach was then refined into two methods

12 The Scientific World Journal

0

01

02

03

04

05

10 20 30 40 50(m)

(Acc

urac

y)

SimRankrvs-SimRankP-Rank

WeightedSumUndirected

Figure 9 Accuracy of the similarity measures in Web documentsdata

the weightedSum and the undirected methods differentiatedby the strategy to handle the information on link directionsExamples showed that the two methods do not suffer fromthe problems of the pair-wise and level-wise models In ourexperimentation of the proposed methods with the scientificpapers and Web documents data sets the results indicatedthat both of the proposed methods generally outperform theexisting methods significantly

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by (1) Business for CooperativeRampD between Industry Academy and Research Institutefunded Korea Small and Medium Business Administrationin 2013 (Grants no C0006278) (2) the MSIP (Ministry ofScience ICT and Future Planning) Korea under the ITRC(Information Technology Research Center) Support Pro-gram (NIPA-2013-H0301-13-4009) supervised by the NIPA(National IT Industry Promotion Agency) (3) Ministry ofCulture Sports and Tourism (MCST) and from Korea Copy-right Commission in 2013 and (4) the National ResearchFoundation of Korea (NRF) Grant funded by the Koreagovernment (MEST) (no 2011-0029181)

References

[1] J Shen Y Zhu H Zhang C Chen R Sun and F Xu ldquoAcontent-based algorithm for blog rankingrdquo in Proceedings of theInternational Conference on Internet Computing in Science andEngineering (ICICSE rsquo08) pp 19ndash22 January 2008

[2] A Broder S Glassman M Manasse and G Zweig ldquoSyntacticclustering of the webrdquo in Proceedings of the 6th InternationalWorld Wide Web Conference (WWW rsquo97) pp 391ndash404 1997

[3] X Su and T Khoshgoftaar ldquoA survey of collaborative filteringtechniquesrdquoAdvances in Artificial Intelligence vol 2009 ArticleID 421425 19 pages 2009

[4] S-H Yoon S-W Kim and S Park ldquoA link-based similaritymeasure for scientific literaturerdquo in Proceedings of the 19thInternational World Wide Web Conference (WWW rsquo10) pp1213ndash1214 April 2010

[5] M R Hamedani S Lee and S Kim ldquoOn exploiting content andcitations together to compute similarity of scientific papersrdquo inProceedings of the ACM International Conference on Informationand Knowledge Management (CIKM rsquo13) San Francisco CalifUSA 2013

[6] P Zhao J Han andY Sun ldquoP-Rank a comprehensive structuralsimilarity measure over information networksrdquo in Proceedingsof the ACM 18th International Conference on Information andKnowledge Management (CIKM rsquo09) pp 553ndash562 November2009

[7] D-H Bae S-M Hwang S-W Kim and C Faloutsos ldquoCon-structing seminal paper genealogyrdquo in Proceedings of the 20thACM Conference on Information and Knowledge Management(CIKM rsquo11) pp 2101ndash2104 October 2011

[8] F Menczer ldquoCombining link and content analysis to estimatesemantic similarityrdquo in Proceedings of the 13th InternationalWorld Wide Web Conference Proceedings (WWW rsquo04) pp 452ndash453 May 2004

[9] R Baeza-Yates and B Ribeiro-Neto Modern InformationRetrieval Addison Wesley 1999

[10] S-H Yoon S-W Kim J-S Kim and W-S Hwang ldquoOncomputing text-based similarity in scientific literaturerdquo inProceedings of the 20th International Conference Companion onWorld Wide Web (WWW rsquo11) pp 169ndash170 April 2011

[11] G Bisson and F Hussain ldquo120594-Sim a new similarity measure forthe co-clustering taskrdquo in Proceedings of the 7th InternationalConference onMachine Learning and Applications (ICMLA rsquo08)pp 211ndash217 December 2008

[12] H Small ldquoCoCitation in the scientific literature a newmeasureof the relationship between two documentsrdquo Journal of theAmerican Society for Information Science vol 24 no 4 pp 265ndash269 1973

[13] M Kessler ldquoBibliographic coupling between scientific papersrdquoJournal of the American Documentation vol 14 no 1 pp 10ndash251963

[14] R Amsler ldquoApplication of citation-based automatic classifi-cationrdquo Technical Report The University of Texas at AustinLinguistics Research Center 1972

[15] G Jeh and JWidom ldquoSimRank ameasure of structural-contextsimilarityrdquo in Proceedings of the 8th ACMSIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDDrsquo02) pp 538ndash543 July 2002

[16] U Shardanband and P Maes ldquoSocial information filteringalgorithms for automating rdquoword of mouthrdquordquo in Proceedings ofthe SIGCHIConference onHumanFactors in Computing Systems(CHI rsquo95) pp 210ndash217 ACM Press Denver Colorado 1995

[17] I Antonellis H Garcia-Molina and C Chang ldquoSimrank++query rewriting through link analysis of the click graphrdquo inProceedings of the 34th International Conference on Vary LargeData Bases ( VLDB rsquo8) pp 408ndash421 Auckland New ZealandAugust 2008

The Scientific World Journal 13

[18] D Fogaras andB Racz ldquoScaling link-based similarity searchrdquo inProceedings of the 14th International Conference on World WideWeb (WWWrsquo05) pp 641ndash650 2005

[19] C Li J Han G He et al ldquoFast computation of SimRankfor static and dynamic information networksrdquo in Proceedingsof the 13th International Conference on Extending DatabaseTechnology Advances in Database Technology (EDBT rsquo10) pp465ndash476 March 2010

[20] P Li H Liu J Xu Yu J He and X Du ldquoFast single-pairSimRank computationrdquo in Proceedings of the SIAM Interna-tional Conference onDataMining (SDM rsquo10) pp 571ndash582 SIAMColumbus Ohio USA April 2010

[21] X Yin J Han and P S Yu ldquoLinkClus efficient clusteringvia heterogeneous semantic linksrdquo in Proceedings of the 32ndInternational Conference on Very Large Data Bases (VLDB rsquo06)pp 427ndash438 2006

[22] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo in Proceedings of the Vary Large Data Bases Endowmentvol 1 pp 422ndash433 2008

[23] Y Cai G Cong X Jia et al ldquoEfficient algorithm for computinglink-based similarity in real world networksrdquo in Proceedings ofthe 9th IEEE International Conference on Data Mining (ICDMrsquo09) pp 734ndash739 December 2009

[24] G He H Feng C Li and H Chen ldquoParallel SimRankcomputation on large graphs with iterative aggregationrdquo inProceedings of the 16th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD rsquo10) pp 543ndash552 July 2010

[25] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo The Intrsquol Journal on Very Large Data Bases vol 19 no 1pp 45ndash66 2010

[26] Z Lin M R Lyu and I King ldquoMatchSim a novel neighbor-based similaritymeasure withmaximumneighborhoodmatch-ingrdquo in Proceedings of the ACM 18th International Conference onInformation and Knowledge Management (CIKM rsquo09) pp 1613ndash1616 November 2009

[27] R R Larson ldquoBibliometrics of the world wide web anexploratory analysis of the intellectual structure of cyberspacerdquoProceedings of the Annual Meeting of the American Society forInformation Science vol 33 pp 71ndash78 1996

[28] J Pitkow and P Pirolli ldquoLife death and lawfulness on the elec-tronic frontierrdquo in Proceedings of the Conference on Human Fac-tors in Computing Systems (CHI rsquo97) pp 383ndash390 March 1997

[29] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann 2006

[30] J Sun H Qu D Chakrabarti and C Faloutsos ldquoNeighbor-hood formation and anomaly detection in bipartite graphsrdquo inProceedings of the 5th IEEE International Conference on DataMining (ICDM rsquo05) pp 418ndash425 November 2005

[31] H Tong C Faloutsos and J-Y Pan ldquoFast random walkwith restart and its applicationsrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 613ndash622 December 2006

[32] S Robinson ldquoToward an optimal algorithm for matrix multi-plicationrdquo News Journal of the Society for Industrial and AppliedMathematics vol 38 no 9 2005

[33] T Zhang R Ramakrishnam andM Livny ldquoBIRCH an efficientdata clustering method for very large databasesrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data (SIGMOD rsquo96) pp 103ndash114 1996

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 10: Research Article Link-Based Similarity Measures Using ...downloads.hindawi.com/journals/tswj/2014/741608.pdf · Research Article Link-Based Similarity Measures Using Reachability

10 The Scientific World Journal

Table 3 Top 10 papers similar to [33] using SimRank

First author Title Conferencejournal Year

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approach forVery Large Spatial Databases VLDB 1998

Ester Knowledge Discovery in Large Spatial Databases FocusingTechniques for Efficient Class Identification SSD 1995

Hinneburg An Efficient Approach to Clustering in Large MultimediaDatabases with Noise AAAI 1998

Ng Efficient and Effective Clustering Methods for Spatial DataMining VLDB 1994

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Wang STING A Statistical Information Grid Approach to SpatialData Mining VLDB 1997

L OrsquoCallaghan Streaming-Data Algorithms for High-Quality Clustering IEEE ICDE 2002

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996

Table 4 Top 10 papers similar to [33] using rvs-SimRank

First author Title Conferencejournal YearKnorr A Unified Notion of Outliers Properties and Computation ACM KDD 1997Guha CURE An Efficient Clustering Algorithm for Large Databases ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Guha ROCK A Robust Clustering Algorithm for CategoricalAttributes IEEE ICDE 1999

Sheikholeslami WaveCluster A multi-Resolution Clustering Approach forVery Large Spatial Databases VLDB 1998

Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996

Burdick MAFIA A Maximal Frequent Itemset Algorithmfor Transactional Databases IEEE TKDE 2005

Mannila Efficient Algorithms for Discovering Association Rules AAAI 1994

Kamber Metarule-Guided Mining of Multi-DimensionalAssociation Rules Using Data Cubes ACM KDD 1997

Table 5 Top 10 papers similar to [33] using P-Rank

First author Title Conferencejournal YearKnorr A Unified Notion of Outliers Properties and Computation ACM KDD 1997Guha CURE An Efficient Clustering Algorithm for Large Databases ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Guha ROCK A Robust Clustering Algorithm for Categorical Attributes IEEE ICDE 1999Mannila Efficient Algorithms for Discovering Association Rules AAAI 1994Arning A Linear Method for Deviation Detection in Large Databases ACM KDD 1996Silberschatz What Makes Patterns Interesting in Knowledge Discovery Systems IEEE TKDE 1996Agrawal Mining Association Rules between Sets of Items in Large Databases ACM SIGMOD 1993

The Scientific World Journal 11

Table 6 Top 10 papers similar to [33] using weightedSum method

First author Title Conferencejournal Year

Ester Knowledge Discovery in Large Spatial DatabasesFocusing Techniques for Efficient Class Identification SSD 1995

Ng Efficient and Effective Clustering Methods for Spatial DataMining VLDB 1994

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Guha ROCK A Robust Clustering Algorithm for CategoricalAttributes IEEE ICDE 1999

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Wang STING A Statistical Information Grid Approach toSpatial Data Mining VLDB 1997

Hinneburg An Efficient Approach to Clusteringin Large Multimedia Databases with Noise AAAI 1998

Aggarwal Fast Algorithms for Projected Clustering ACM SIGMOD 1999

Agrawal Automatic Subspace Clustering of High Dimensional Datafor Data Mining Applications ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Table 7 Top 10 papers similar to [33] using undirected method

First author Title Conferencejournal Year

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Ng Efficient and Effective Clustering Methods for SpatialData Mining VLDB 1994

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Hinneburg An Efficient Approach to Clusteringin Large Multimedia Databases with Noise AAAI 1998

Wang STING A Statistical Information Grid Approach toSpatial Data Mining VLDB 1997

Agrawal Automatic Subspace Clustering of High Dimensional Datafor Data Mining Applications ACM SIGMOD 1998

Aggarwal Fast Algorithms for Projected Clustering ACM SIGMOD 1999

Ankerst OPTICS Ordering Points To Identify the ClusteringStructure ACM SIGMOD 1999

conclusion the two proposed methods turn out to computethe similarity more accurately than the existing methodsEspecially the undirected method outperforms by a largedegree the weightedSum and the three existing methods

6 Conclusions

This paper presented new link-based similarity methods thatcan compute more accurately the similarity between objectsby using the link information pertaining to the objectsNoticing that most existing link-based similarity methods

use the pair-wise and level-wise models we analyzed theproblems with these models and proposed a new approachthat does not suffer from these problems In our proposedapproach each object is represented by a vector all objectsin the given universe as the elements of the vector and aweight value to each object as the value of the correspondingelement in the vector As for this weight value we proposed toutilize the notion of reachability between objects computedusing the ldquoRandom Walk with Restartrdquo strategy Then wedefined the similarity between two objects as the cosinesimilarity between two vectors representing the two objectsThe proposed approach was then refined into two methods

12 The Scientific World Journal

0

01

02

03

04

05

10 20 30 40 50(m)

(Acc

urac

y)

SimRankrvs-SimRankP-Rank

WeightedSumUndirected

Figure 9 Accuracy of the similarity measures in Web documentsdata

the weightedSum and the undirected methods differentiatedby the strategy to handle the information on link directionsExamples showed that the two methods do not suffer fromthe problems of the pair-wise and level-wise models In ourexperimentation of the proposed methods with the scientificpapers and Web documents data sets the results indicatedthat both of the proposed methods generally outperform theexisting methods significantly

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by (1) Business for CooperativeRampD between Industry Academy and Research Institutefunded Korea Small and Medium Business Administrationin 2013 (Grants no C0006278) (2) the MSIP (Ministry ofScience ICT and Future Planning) Korea under the ITRC(Information Technology Research Center) Support Pro-gram (NIPA-2013-H0301-13-4009) supervised by the NIPA(National IT Industry Promotion Agency) (3) Ministry ofCulture Sports and Tourism (MCST) and from Korea Copy-right Commission in 2013 and (4) the National ResearchFoundation of Korea (NRF) Grant funded by the Koreagovernment (MEST) (no 2011-0029181)

References

[1] J Shen Y Zhu H Zhang C Chen R Sun and F Xu ldquoAcontent-based algorithm for blog rankingrdquo in Proceedings of theInternational Conference on Internet Computing in Science andEngineering (ICICSE rsquo08) pp 19ndash22 January 2008

[2] A Broder S Glassman M Manasse and G Zweig ldquoSyntacticclustering of the webrdquo in Proceedings of the 6th InternationalWorld Wide Web Conference (WWW rsquo97) pp 391ndash404 1997

[3] X Su and T Khoshgoftaar ldquoA survey of collaborative filteringtechniquesrdquoAdvances in Artificial Intelligence vol 2009 ArticleID 421425 19 pages 2009

[4] S-H Yoon S-W Kim and S Park ldquoA link-based similaritymeasure for scientific literaturerdquo in Proceedings of the 19thInternational World Wide Web Conference (WWW rsquo10) pp1213ndash1214 April 2010

[5] M R Hamedani S Lee and S Kim ldquoOn exploiting content andcitations together to compute similarity of scientific papersrdquo inProceedings of the ACM International Conference on Informationand Knowledge Management (CIKM rsquo13) San Francisco CalifUSA 2013

[6] P Zhao J Han andY Sun ldquoP-Rank a comprehensive structuralsimilarity measure over information networksrdquo in Proceedingsof the ACM 18th International Conference on Information andKnowledge Management (CIKM rsquo09) pp 553ndash562 November2009

[7] D-H Bae S-M Hwang S-W Kim and C Faloutsos ldquoCon-structing seminal paper genealogyrdquo in Proceedings of the 20thACM Conference on Information and Knowledge Management(CIKM rsquo11) pp 2101ndash2104 October 2011

[8] F Menczer ldquoCombining link and content analysis to estimatesemantic similarityrdquo in Proceedings of the 13th InternationalWorld Wide Web Conference Proceedings (WWW rsquo04) pp 452ndash453 May 2004

[9] R Baeza-Yates and B Ribeiro-Neto Modern InformationRetrieval Addison Wesley 1999

[10] S-H Yoon S-W Kim J-S Kim and W-S Hwang ldquoOncomputing text-based similarity in scientific literaturerdquo inProceedings of the 20th International Conference Companion onWorld Wide Web (WWW rsquo11) pp 169ndash170 April 2011

[11] G Bisson and F Hussain ldquo120594-Sim a new similarity measure forthe co-clustering taskrdquo in Proceedings of the 7th InternationalConference onMachine Learning and Applications (ICMLA rsquo08)pp 211ndash217 December 2008

[12] H Small ldquoCoCitation in the scientific literature a newmeasureof the relationship between two documentsrdquo Journal of theAmerican Society for Information Science vol 24 no 4 pp 265ndash269 1973

[13] M Kessler ldquoBibliographic coupling between scientific papersrdquoJournal of the American Documentation vol 14 no 1 pp 10ndash251963

[14] R Amsler ldquoApplication of citation-based automatic classifi-cationrdquo Technical Report The University of Texas at AustinLinguistics Research Center 1972

[15] G Jeh and JWidom ldquoSimRank ameasure of structural-contextsimilarityrdquo in Proceedings of the 8th ACMSIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDDrsquo02) pp 538ndash543 July 2002

[16] U Shardanband and P Maes ldquoSocial information filteringalgorithms for automating rdquoword of mouthrdquordquo in Proceedings ofthe SIGCHIConference onHumanFactors in Computing Systems(CHI rsquo95) pp 210ndash217 ACM Press Denver Colorado 1995

[17] I Antonellis H Garcia-Molina and C Chang ldquoSimrank++query rewriting through link analysis of the click graphrdquo inProceedings of the 34th International Conference on Vary LargeData Bases ( VLDB rsquo8) pp 408ndash421 Auckland New ZealandAugust 2008

The Scientific World Journal 13

[18] D Fogaras andB Racz ldquoScaling link-based similarity searchrdquo inProceedings of the 14th International Conference on World WideWeb (WWWrsquo05) pp 641ndash650 2005

[19] C Li J Han G He et al ldquoFast computation of SimRankfor static and dynamic information networksrdquo in Proceedingsof the 13th International Conference on Extending DatabaseTechnology Advances in Database Technology (EDBT rsquo10) pp465ndash476 March 2010

[20] P Li H Liu J Xu Yu J He and X Du ldquoFast single-pairSimRank computationrdquo in Proceedings of the SIAM Interna-tional Conference onDataMining (SDM rsquo10) pp 571ndash582 SIAMColumbus Ohio USA April 2010

[21] X Yin J Han and P S Yu ldquoLinkClus efficient clusteringvia heterogeneous semantic linksrdquo in Proceedings of the 32ndInternational Conference on Very Large Data Bases (VLDB rsquo06)pp 427ndash438 2006

[22] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo in Proceedings of the Vary Large Data Bases Endowmentvol 1 pp 422ndash433 2008

[23] Y Cai G Cong X Jia et al ldquoEfficient algorithm for computinglink-based similarity in real world networksrdquo in Proceedings ofthe 9th IEEE International Conference on Data Mining (ICDMrsquo09) pp 734ndash739 December 2009

[24] G He H Feng C Li and H Chen ldquoParallel SimRankcomputation on large graphs with iterative aggregationrdquo inProceedings of the 16th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD rsquo10) pp 543ndash552 July 2010

[25] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo The Intrsquol Journal on Very Large Data Bases vol 19 no 1pp 45ndash66 2010

[26] Z Lin M R Lyu and I King ldquoMatchSim a novel neighbor-based similaritymeasure withmaximumneighborhoodmatch-ingrdquo in Proceedings of the ACM 18th International Conference onInformation and Knowledge Management (CIKM rsquo09) pp 1613ndash1616 November 2009

[27] R R Larson ldquoBibliometrics of the world wide web anexploratory analysis of the intellectual structure of cyberspacerdquoProceedings of the Annual Meeting of the American Society forInformation Science vol 33 pp 71ndash78 1996

[28] J Pitkow and P Pirolli ldquoLife death and lawfulness on the elec-tronic frontierrdquo in Proceedings of the Conference on Human Fac-tors in Computing Systems (CHI rsquo97) pp 383ndash390 March 1997

[29] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann 2006

[30] J Sun H Qu D Chakrabarti and C Faloutsos ldquoNeighbor-hood formation and anomaly detection in bipartite graphsrdquo inProceedings of the 5th IEEE International Conference on DataMining (ICDM rsquo05) pp 418ndash425 November 2005

[31] H Tong C Faloutsos and J-Y Pan ldquoFast random walkwith restart and its applicationsrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 613ndash622 December 2006

[32] S Robinson ldquoToward an optimal algorithm for matrix multi-plicationrdquo News Journal of the Society for Industrial and AppliedMathematics vol 38 no 9 2005

[33] T Zhang R Ramakrishnam andM Livny ldquoBIRCH an efficientdata clustering method for very large databasesrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data (SIGMOD rsquo96) pp 103ndash114 1996

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 11: Research Article Link-Based Similarity Measures Using ...downloads.hindawi.com/journals/tswj/2014/741608.pdf · Research Article Link-Based Similarity Measures Using Reachability

The Scientific World Journal 11

Table 6 Top 10 papers similar to [33] using weightedSum method

First author Title Conferencejournal Year

Ester Knowledge Discovery in Large Spatial DatabasesFocusing Techniques for Efficient Class Identification SSD 1995

Ng Efficient and Effective Clustering Methods for Spatial DataMining VLDB 1994

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Guha ROCK A Robust Clustering Algorithm for CategoricalAttributes IEEE ICDE 1999

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Wang STING A Statistical Information Grid Approach toSpatial Data Mining VLDB 1997

Hinneburg An Efficient Approach to Clusteringin Large Multimedia Databases with Noise AAAI 1998

Aggarwal Fast Algorithms for Projected Clustering ACM SIGMOD 1999

Agrawal Automatic Subspace Clustering of High Dimensional Datafor Data Mining Applications ACM SIGMOD 1998

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Table 7 Top 10 papers similar to [33] using undirected method

First author Title Conferencejournal Year

Guha CURE An Efficient Clustering Algorithm for LargeDatabases ACM SIGMOD 1998

Ng Efficient and Effective Clustering Methods for SpatialData Mining VLDB 1994

Sander A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise DMKD 1998

Bradley Scaling Clustering Algorithms to Large Databases AAAI 1998

Sheikholeslami WaveCluster A multi-Resolution Clustering Approachfor Very Large Spatial Databases VLDB 1998

Hinneburg An Efficient Approach to Clusteringin Large Multimedia Databases with Noise AAAI 1998

Wang STING A Statistical Information Grid Approach toSpatial Data Mining VLDB 1997

Agrawal Automatic Subspace Clustering of High Dimensional Datafor Data Mining Applications ACM SIGMOD 1998

Aggarwal Fast Algorithms for Projected Clustering ACM SIGMOD 1999

Ankerst OPTICS Ordering Points To Identify the ClusteringStructure ACM SIGMOD 1999

conclusion the two proposed methods turn out to computethe similarity more accurately than the existing methodsEspecially the undirected method outperforms by a largedegree the weightedSum and the three existing methods

6 Conclusions

This paper presented new link-based similarity methods thatcan compute more accurately the similarity between objectsby using the link information pertaining to the objectsNoticing that most existing link-based similarity methods

use the pair-wise and level-wise models we analyzed theproblems with these models and proposed a new approachthat does not suffer from these problems In our proposedapproach each object is represented by a vector all objectsin the given universe as the elements of the vector and aweight value to each object as the value of the correspondingelement in the vector As for this weight value we proposed toutilize the notion of reachability between objects computedusing the ldquoRandom Walk with Restartrdquo strategy Then wedefined the similarity between two objects as the cosinesimilarity between two vectors representing the two objectsThe proposed approach was then refined into two methods

12 The Scientific World Journal

0

01

02

03

04

05

10 20 30 40 50(m)

(Acc

urac

y)

SimRankrvs-SimRankP-Rank

WeightedSumUndirected

Figure 9 Accuracy of the similarity measures in Web documentsdata

the weightedSum and the undirected methods differentiatedby the strategy to handle the information on link directionsExamples showed that the two methods do not suffer fromthe problems of the pair-wise and level-wise models In ourexperimentation of the proposed methods with the scientificpapers and Web documents data sets the results indicatedthat both of the proposed methods generally outperform theexisting methods significantly

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by (1) Business for CooperativeRampD between Industry Academy and Research Institutefunded Korea Small and Medium Business Administrationin 2013 (Grants no C0006278) (2) the MSIP (Ministry ofScience ICT and Future Planning) Korea under the ITRC(Information Technology Research Center) Support Pro-gram (NIPA-2013-H0301-13-4009) supervised by the NIPA(National IT Industry Promotion Agency) (3) Ministry ofCulture Sports and Tourism (MCST) and from Korea Copy-right Commission in 2013 and (4) the National ResearchFoundation of Korea (NRF) Grant funded by the Koreagovernment (MEST) (no 2011-0029181)

References

[1] J Shen Y Zhu H Zhang C Chen R Sun and F Xu ldquoAcontent-based algorithm for blog rankingrdquo in Proceedings of theInternational Conference on Internet Computing in Science andEngineering (ICICSE rsquo08) pp 19ndash22 January 2008

[2] A Broder S Glassman M Manasse and G Zweig ldquoSyntacticclustering of the webrdquo in Proceedings of the 6th InternationalWorld Wide Web Conference (WWW rsquo97) pp 391ndash404 1997

[3] X Su and T Khoshgoftaar ldquoA survey of collaborative filteringtechniquesrdquoAdvances in Artificial Intelligence vol 2009 ArticleID 421425 19 pages 2009

[4] S-H Yoon S-W Kim and S Park ldquoA link-based similaritymeasure for scientific literaturerdquo in Proceedings of the 19thInternational World Wide Web Conference (WWW rsquo10) pp1213ndash1214 April 2010

[5] M R Hamedani S Lee and S Kim ldquoOn exploiting content andcitations together to compute similarity of scientific papersrdquo inProceedings of the ACM International Conference on Informationand Knowledge Management (CIKM rsquo13) San Francisco CalifUSA 2013

[6] P Zhao J Han andY Sun ldquoP-Rank a comprehensive structuralsimilarity measure over information networksrdquo in Proceedingsof the ACM 18th International Conference on Information andKnowledge Management (CIKM rsquo09) pp 553ndash562 November2009

[7] D-H Bae S-M Hwang S-W Kim and C Faloutsos ldquoCon-structing seminal paper genealogyrdquo in Proceedings of the 20thACM Conference on Information and Knowledge Management(CIKM rsquo11) pp 2101ndash2104 October 2011

[8] F Menczer ldquoCombining link and content analysis to estimatesemantic similarityrdquo in Proceedings of the 13th InternationalWorld Wide Web Conference Proceedings (WWW rsquo04) pp 452ndash453 May 2004

[9] R Baeza-Yates and B Ribeiro-Neto Modern InformationRetrieval Addison Wesley 1999

[10] S-H Yoon S-W Kim J-S Kim and W-S Hwang ldquoOncomputing text-based similarity in scientific literaturerdquo inProceedings of the 20th International Conference Companion onWorld Wide Web (WWW rsquo11) pp 169ndash170 April 2011

[11] G Bisson and F Hussain ldquo120594-Sim a new similarity measure forthe co-clustering taskrdquo in Proceedings of the 7th InternationalConference onMachine Learning and Applications (ICMLA rsquo08)pp 211ndash217 December 2008

[12] H Small ldquoCoCitation in the scientific literature a newmeasureof the relationship between two documentsrdquo Journal of theAmerican Society for Information Science vol 24 no 4 pp 265ndash269 1973

[13] M Kessler ldquoBibliographic coupling between scientific papersrdquoJournal of the American Documentation vol 14 no 1 pp 10ndash251963

[14] R Amsler ldquoApplication of citation-based automatic classifi-cationrdquo Technical Report The University of Texas at AustinLinguistics Research Center 1972

[15] G Jeh and JWidom ldquoSimRank ameasure of structural-contextsimilarityrdquo in Proceedings of the 8th ACMSIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDDrsquo02) pp 538ndash543 July 2002

[16] U Shardanband and P Maes ldquoSocial information filteringalgorithms for automating rdquoword of mouthrdquordquo in Proceedings ofthe SIGCHIConference onHumanFactors in Computing Systems(CHI rsquo95) pp 210ndash217 ACM Press Denver Colorado 1995

[17] I Antonellis H Garcia-Molina and C Chang ldquoSimrank++query rewriting through link analysis of the click graphrdquo inProceedings of the 34th International Conference on Vary LargeData Bases ( VLDB rsquo8) pp 408ndash421 Auckland New ZealandAugust 2008

The Scientific World Journal 13

[18] D Fogaras andB Racz ldquoScaling link-based similarity searchrdquo inProceedings of the 14th International Conference on World WideWeb (WWWrsquo05) pp 641ndash650 2005

[19] C Li J Han G He et al ldquoFast computation of SimRankfor static and dynamic information networksrdquo in Proceedingsof the 13th International Conference on Extending DatabaseTechnology Advances in Database Technology (EDBT rsquo10) pp465ndash476 March 2010

[20] P Li H Liu J Xu Yu J He and X Du ldquoFast single-pairSimRank computationrdquo in Proceedings of the SIAM Interna-tional Conference onDataMining (SDM rsquo10) pp 571ndash582 SIAMColumbus Ohio USA April 2010

[21] X Yin J Han and P S Yu ldquoLinkClus efficient clusteringvia heterogeneous semantic linksrdquo in Proceedings of the 32ndInternational Conference on Very Large Data Bases (VLDB rsquo06)pp 427ndash438 2006

[22] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo in Proceedings of the Vary Large Data Bases Endowmentvol 1 pp 422ndash433 2008

[23] Y Cai G Cong X Jia et al ldquoEfficient algorithm for computinglink-based similarity in real world networksrdquo in Proceedings ofthe 9th IEEE International Conference on Data Mining (ICDMrsquo09) pp 734ndash739 December 2009

[24] G He H Feng C Li and H Chen ldquoParallel SimRankcomputation on large graphs with iterative aggregationrdquo inProceedings of the 16th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD rsquo10) pp 543ndash552 July 2010

[25] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo The Intrsquol Journal on Very Large Data Bases vol 19 no 1pp 45ndash66 2010

[26] Z Lin M R Lyu and I King ldquoMatchSim a novel neighbor-based similaritymeasure withmaximumneighborhoodmatch-ingrdquo in Proceedings of the ACM 18th International Conference onInformation and Knowledge Management (CIKM rsquo09) pp 1613ndash1616 November 2009

[27] R R Larson ldquoBibliometrics of the world wide web anexploratory analysis of the intellectual structure of cyberspacerdquoProceedings of the Annual Meeting of the American Society forInformation Science vol 33 pp 71ndash78 1996

[28] J Pitkow and P Pirolli ldquoLife death and lawfulness on the elec-tronic frontierrdquo in Proceedings of the Conference on Human Fac-tors in Computing Systems (CHI rsquo97) pp 383ndash390 March 1997

[29] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann 2006

[30] J Sun H Qu D Chakrabarti and C Faloutsos ldquoNeighbor-hood formation and anomaly detection in bipartite graphsrdquo inProceedings of the 5th IEEE International Conference on DataMining (ICDM rsquo05) pp 418ndash425 November 2005

[31] H Tong C Faloutsos and J-Y Pan ldquoFast random walkwith restart and its applicationsrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 613ndash622 December 2006

[32] S Robinson ldquoToward an optimal algorithm for matrix multi-plicationrdquo News Journal of the Society for Industrial and AppliedMathematics vol 38 no 9 2005

[33] T Zhang R Ramakrishnam andM Livny ldquoBIRCH an efficientdata clustering method for very large databasesrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data (SIGMOD rsquo96) pp 103ndash114 1996

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 12: Research Article Link-Based Similarity Measures Using ...downloads.hindawi.com/journals/tswj/2014/741608.pdf · Research Article Link-Based Similarity Measures Using Reachability

12 The Scientific World Journal

0

01

02

03

04

05

10 20 30 40 50(m)

(Acc

urac

y)

SimRankrvs-SimRankP-Rank

WeightedSumUndirected

Figure 9 Accuracy of the similarity measures in Web documentsdata

the weightedSum and the undirected methods differentiatedby the strategy to handle the information on link directionsExamples showed that the two methods do not suffer fromthe problems of the pair-wise and level-wise models In ourexperimentation of the proposed methods with the scientificpapers and Web documents data sets the results indicatedthat both of the proposed methods generally outperform theexisting methods significantly

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by (1) Business for CooperativeRampD between Industry Academy and Research Institutefunded Korea Small and Medium Business Administrationin 2013 (Grants no C0006278) (2) the MSIP (Ministry ofScience ICT and Future Planning) Korea under the ITRC(Information Technology Research Center) Support Pro-gram (NIPA-2013-H0301-13-4009) supervised by the NIPA(National IT Industry Promotion Agency) (3) Ministry ofCulture Sports and Tourism (MCST) and from Korea Copy-right Commission in 2013 and (4) the National ResearchFoundation of Korea (NRF) Grant funded by the Koreagovernment (MEST) (no 2011-0029181)

References

[1] J Shen Y Zhu H Zhang C Chen R Sun and F Xu ldquoAcontent-based algorithm for blog rankingrdquo in Proceedings of theInternational Conference on Internet Computing in Science andEngineering (ICICSE rsquo08) pp 19ndash22 January 2008

[2] A Broder S Glassman M Manasse and G Zweig ldquoSyntacticclustering of the webrdquo in Proceedings of the 6th InternationalWorld Wide Web Conference (WWW rsquo97) pp 391ndash404 1997

[3] X Su and T Khoshgoftaar ldquoA survey of collaborative filteringtechniquesrdquoAdvances in Artificial Intelligence vol 2009 ArticleID 421425 19 pages 2009

[4] S-H Yoon S-W Kim and S Park ldquoA link-based similaritymeasure for scientific literaturerdquo in Proceedings of the 19thInternational World Wide Web Conference (WWW rsquo10) pp1213ndash1214 April 2010

[5] M R Hamedani S Lee and S Kim ldquoOn exploiting content andcitations together to compute similarity of scientific papersrdquo inProceedings of the ACM International Conference on Informationand Knowledge Management (CIKM rsquo13) San Francisco CalifUSA 2013

[6] P Zhao J Han andY Sun ldquoP-Rank a comprehensive structuralsimilarity measure over information networksrdquo in Proceedingsof the ACM 18th International Conference on Information andKnowledge Management (CIKM rsquo09) pp 553ndash562 November2009

[7] D-H Bae S-M Hwang S-W Kim and C Faloutsos ldquoCon-structing seminal paper genealogyrdquo in Proceedings of the 20thACM Conference on Information and Knowledge Management(CIKM rsquo11) pp 2101ndash2104 October 2011

[8] F Menczer ldquoCombining link and content analysis to estimatesemantic similarityrdquo in Proceedings of the 13th InternationalWorld Wide Web Conference Proceedings (WWW rsquo04) pp 452ndash453 May 2004

[9] R Baeza-Yates and B Ribeiro-Neto Modern InformationRetrieval Addison Wesley 1999

[10] S-H Yoon S-W Kim J-S Kim and W-S Hwang ldquoOncomputing text-based similarity in scientific literaturerdquo inProceedings of the 20th International Conference Companion onWorld Wide Web (WWW rsquo11) pp 169ndash170 April 2011

[11] G Bisson and F Hussain ldquo120594-Sim a new similarity measure forthe co-clustering taskrdquo in Proceedings of the 7th InternationalConference onMachine Learning and Applications (ICMLA rsquo08)pp 211ndash217 December 2008

[12] H Small ldquoCoCitation in the scientific literature a newmeasureof the relationship between two documentsrdquo Journal of theAmerican Society for Information Science vol 24 no 4 pp 265ndash269 1973

[13] M Kessler ldquoBibliographic coupling between scientific papersrdquoJournal of the American Documentation vol 14 no 1 pp 10ndash251963

[14] R Amsler ldquoApplication of citation-based automatic classifi-cationrdquo Technical Report The University of Texas at AustinLinguistics Research Center 1972

[15] G Jeh and JWidom ldquoSimRank ameasure of structural-contextsimilarityrdquo in Proceedings of the 8th ACMSIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDDrsquo02) pp 538ndash543 July 2002

[16] U Shardanband and P Maes ldquoSocial information filteringalgorithms for automating rdquoword of mouthrdquordquo in Proceedings ofthe SIGCHIConference onHumanFactors in Computing Systems(CHI rsquo95) pp 210ndash217 ACM Press Denver Colorado 1995

[17] I Antonellis H Garcia-Molina and C Chang ldquoSimrank++query rewriting through link analysis of the click graphrdquo inProceedings of the 34th International Conference on Vary LargeData Bases ( VLDB rsquo8) pp 408ndash421 Auckland New ZealandAugust 2008

The Scientific World Journal 13

[18] D Fogaras andB Racz ldquoScaling link-based similarity searchrdquo inProceedings of the 14th International Conference on World WideWeb (WWWrsquo05) pp 641ndash650 2005

[19] C Li J Han G He et al ldquoFast computation of SimRankfor static and dynamic information networksrdquo in Proceedingsof the 13th International Conference on Extending DatabaseTechnology Advances in Database Technology (EDBT rsquo10) pp465ndash476 March 2010

[20] P Li H Liu J Xu Yu J He and X Du ldquoFast single-pairSimRank computationrdquo in Proceedings of the SIAM Interna-tional Conference onDataMining (SDM rsquo10) pp 571ndash582 SIAMColumbus Ohio USA April 2010

[21] X Yin J Han and P S Yu ldquoLinkClus efficient clusteringvia heterogeneous semantic linksrdquo in Proceedings of the 32ndInternational Conference on Very Large Data Bases (VLDB rsquo06)pp 427ndash438 2006

[22] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo in Proceedings of the Vary Large Data Bases Endowmentvol 1 pp 422ndash433 2008

[23] Y Cai G Cong X Jia et al ldquoEfficient algorithm for computinglink-based similarity in real world networksrdquo in Proceedings ofthe 9th IEEE International Conference on Data Mining (ICDMrsquo09) pp 734ndash739 December 2009

[24] G He H Feng C Li and H Chen ldquoParallel SimRankcomputation on large graphs with iterative aggregationrdquo inProceedings of the 16th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD rsquo10) pp 543ndash552 July 2010

[25] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo The Intrsquol Journal on Very Large Data Bases vol 19 no 1pp 45ndash66 2010

[26] Z Lin M R Lyu and I King ldquoMatchSim a novel neighbor-based similaritymeasure withmaximumneighborhoodmatch-ingrdquo in Proceedings of the ACM 18th International Conference onInformation and Knowledge Management (CIKM rsquo09) pp 1613ndash1616 November 2009

[27] R R Larson ldquoBibliometrics of the world wide web anexploratory analysis of the intellectual structure of cyberspacerdquoProceedings of the Annual Meeting of the American Society forInformation Science vol 33 pp 71ndash78 1996

[28] J Pitkow and P Pirolli ldquoLife death and lawfulness on the elec-tronic frontierrdquo in Proceedings of the Conference on Human Fac-tors in Computing Systems (CHI rsquo97) pp 383ndash390 March 1997

[29] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann 2006

[30] J Sun H Qu D Chakrabarti and C Faloutsos ldquoNeighbor-hood formation and anomaly detection in bipartite graphsrdquo inProceedings of the 5th IEEE International Conference on DataMining (ICDM rsquo05) pp 418ndash425 November 2005

[31] H Tong C Faloutsos and J-Y Pan ldquoFast random walkwith restart and its applicationsrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 613ndash622 December 2006

[32] S Robinson ldquoToward an optimal algorithm for matrix multi-plicationrdquo News Journal of the Society for Industrial and AppliedMathematics vol 38 no 9 2005

[33] T Zhang R Ramakrishnam andM Livny ldquoBIRCH an efficientdata clustering method for very large databasesrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data (SIGMOD rsquo96) pp 103ndash114 1996

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 13: Research Article Link-Based Similarity Measures Using ...downloads.hindawi.com/journals/tswj/2014/741608.pdf · Research Article Link-Based Similarity Measures Using Reachability

The Scientific World Journal 13

[18] D Fogaras andB Racz ldquoScaling link-based similarity searchrdquo inProceedings of the 14th International Conference on World WideWeb (WWWrsquo05) pp 641ndash650 2005

[19] C Li J Han G He et al ldquoFast computation of SimRankfor static and dynamic information networksrdquo in Proceedingsof the 13th International Conference on Extending DatabaseTechnology Advances in Database Technology (EDBT rsquo10) pp465ndash476 March 2010

[20] P Li H Liu J Xu Yu J He and X Du ldquoFast single-pairSimRank computationrdquo in Proceedings of the SIAM Interna-tional Conference onDataMining (SDM rsquo10) pp 571ndash582 SIAMColumbus Ohio USA April 2010

[21] X Yin J Han and P S Yu ldquoLinkClus efficient clusteringvia heterogeneous semantic linksrdquo in Proceedings of the 32ndInternational Conference on Very Large Data Bases (VLDB rsquo06)pp 427ndash438 2006

[22] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo in Proceedings of the Vary Large Data Bases Endowmentvol 1 pp 422ndash433 2008

[23] Y Cai G Cong X Jia et al ldquoEfficient algorithm for computinglink-based similarity in real world networksrdquo in Proceedings ofthe 9th IEEE International Conference on Data Mining (ICDMrsquo09) pp 734ndash739 December 2009

[24] G He H Feng C Li and H Chen ldquoParallel SimRankcomputation on large graphs with iterative aggregationrdquo inProceedings of the 16th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD rsquo10) pp 543ndash552 July 2010

[25] D Lizorkin P VelikhovM Grinev andD Turdakov ldquoAccuracyestimate and optimization techniques for SimRank computa-tionrdquo The Intrsquol Journal on Very Large Data Bases vol 19 no 1pp 45ndash66 2010

[26] Z Lin M R Lyu and I King ldquoMatchSim a novel neighbor-based similaritymeasure withmaximumneighborhoodmatch-ingrdquo in Proceedings of the ACM 18th International Conference onInformation and Knowledge Management (CIKM rsquo09) pp 1613ndash1616 November 2009

[27] R R Larson ldquoBibliometrics of the world wide web anexploratory analysis of the intellectual structure of cyberspacerdquoProceedings of the Annual Meeting of the American Society forInformation Science vol 33 pp 71ndash78 1996

[28] J Pitkow and P Pirolli ldquoLife death and lawfulness on the elec-tronic frontierrdquo in Proceedings of the Conference on Human Fac-tors in Computing Systems (CHI rsquo97) pp 383ndash390 March 1997

[29] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann 2006

[30] J Sun H Qu D Chakrabarti and C Faloutsos ldquoNeighbor-hood formation and anomaly detection in bipartite graphsrdquo inProceedings of the 5th IEEE International Conference on DataMining (ICDM rsquo05) pp 418ndash425 November 2005

[31] H Tong C Faloutsos and J-Y Pan ldquoFast random walkwith restart and its applicationsrdquo in Proceedings of the 6thInternational Conference on Data Mining (ICDM rsquo06) pp 613ndash622 December 2006

[32] S Robinson ldquoToward an optimal algorithm for matrix multi-plicationrdquo News Journal of the Society for Industrial and AppliedMathematics vol 38 no 9 2005

[33] T Zhang R Ramakrishnam andM Livny ldquoBIRCH an efficientdata clustering method for very large databasesrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data (SIGMOD rsquo96) pp 103ndash114 1996

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 14: Research Article Link-Based Similarity Measures Using ...downloads.hindawi.com/journals/tswj/2014/741608.pdf · Research Article Link-Based Similarity Measures Using Reachability

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014