author's personal copy - xidiansee.xidian.edu.cn/iiip/mggong/down/physa2012gong.pdf ·...

12
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright

Upload: others

Post on 03-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Author's personal copy - Xidiansee.xidian.edu.cn/iiip/mggong/down/PHYSA2012Gong.pdf · Author's personal copy M. Gong et al. / Physica A 391 (2012) 4050 4060 4051 solutions in a single

This article appeared in a journal published by Elsevier. The attachedcopy is furnished to the author for internal non-commercial researchand education use, including for instruction at the authors institution

and sharing with colleagues.

Other uses, including reproduction and distribution, or selling orlicensing copies, or posting to personal, institutional or third party

websites are prohibited.

In most cases authors are permitted to post their version of thearticle (e.g. in Word or Tex form) to their personal website orinstitutional repository. Authors requiring further information

regarding Elsevier’s archiving and manuscript policies areencouraged to visit:

http://www.elsevier.com/copyright

Page 2: Author's personal copy - Xidiansee.xidian.edu.cn/iiip/mggong/down/PHYSA2012Gong.pdf · Author's personal copy M. Gong et al. / Physica A 391 (2012) 4050 4060 4051 solutions in a single

Author's personal copy

Physica A 391 (2012) 4050–4060

Contents lists available at SciVerse ScienceDirect

Physica A

journal homepage: www.elsevier.com/locate/physa

Community detection in networks by using multiobjective evolutionaryalgorithm with decompositionMaoguo Gong a,∗, Lijia Ma a, Qingfu Zhang a,b, Licheng Jiao a

a Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, Xidian University, Xi’an, Shaanxi Province 710071, Chinab School of Computer Science & Electronic Engineering, University of Essex, Wivenhoe Park, Colchester, CO4 3SQ, UK

a r t i c l e i n f o

Article history:Received 29 August 2011Received in revised form 14 January 2012Available online 28 March 2012

Keywords:Community detectionComplex networkMultiobjective optimizationEvolutionary algorithmDecomposition

a b s t r a c t

Community structure is an important property of complex networks. Most optimization-based community detection algorithms employ single optimization criteria. In this study,the community detection is solved as a multiobjective optimization problem by using themultiobjective evolutionary algorithm based on decomposition. The proposed algorithmmaximizes the density of internal degrees, and minimizes the density of external degreessimultaneously. It can produce a set of solutions which can represent various divisions tothe networks at different hierarchical levels. The number of communities is automaticallydetermined by the non-dominated individuals resulting from our algorithm. Experimentson both synthetic and real-world network datasets verify that our algorithm is highlyefficient at discovering quality community structure.

© 2012 Elsevier B.V. All rights reserved.

1. Introduction

Many real-world complex systems can be represented as complex networks. Collaboration networks, the World WideWeb, power grids, biological networks and social networks are some examples. Networks could bemodeled as graphs,wherenodes (or vertices) represent the objects and edges represent the interactions among these objects. The area on complexnetwork has attracted many researchers from different fields such as physics, mathematics, biology, and sociology. Besidesa number of distinctive properties such as ‘‘small world effect’’ and the right-skewed degree distributions, communitystructure is another important property in a complex network [1]. Qualitatively, a community is defined as a subset of nodeswithin the graph such that connections between the nodes are denser than connections with the rest of the network [2,3].Community detection in complex networks is potentially very useful. Nodes belonging to the same community are morelikely to have properties in common. For instance, in the World Wide Web, community analysis has uncovered thematicclusters [4,5].

A large number of community detection algorithms have been proposed in the last decade [6–10]. Optimization-based methods are main branches of existed community detection algorithms. In optimization-based methods, communitydetection problems are modeled as optimization problems through optimizing the objective functions to measure networkpartitions. Optimization-based community detection methods connect two important fields, optimization and communitydetection together. Their efficiency depends on the searching ability of optimization algorithms and the objective functionswhich they employ.

In the field of optimization, Evolutionary Algorithms (EAs) have become more and more popular, which are parallel innature and do not require differentiability of objective functions and constraints, and also which deal with a set of possible

∗ Corresponding author. Tel.: +86 29 88202661.E-mail addresses: [email protected], [email protected] (M. Gong).

0378-4371/$ – see front matter© 2012 Elsevier B.V. All rights reserved.doi:10.1016/j.physa.2012.03.021

Page 3: Author's personal copy - Xidiansee.xidian.edu.cn/iiip/mggong/down/PHYSA2012Gong.pdf · Author's personal copy M. Gong et al. / Physica A 391 (2012) 4050 4060 4051 solutions in a single

Author's personal copy

M. Gong et al. / Physica A 391 (2012) 4050–4060 4051

solutions in a single run. In Ref. [11], a multiobjective genetic algorithm for community detection, named MOGA-Net, wasproposed. The author indicated that community structure detection is a problem that can naturally be formulated with twodifferent objectives, one for themaximization of internal links, and the other for theminimization of external links. The twoobjectives in MOGA-Net, named Community Score and Community Fitness, are introduced in Refs. [12,13], respectively.Both the objective functions have a positive real-valued parameter controlling the size of the communities. The higher thevalue of the parameter, the smaller the size of the communities found. The algorithm employed the Non-dominated SortingGenetic Algorithm (NSGA-II) [14] to optimize the two objectives.

In this paper, we propose a novel community detection algorithm based on a multiobjective evolutionary algorithmwith decomposition, termed as MOEA/D-Net. We use the Multiobjective Evolutionary Algorithm based on Decomposition(MOEA/D) proposed by Zhang and Li in [15] to simultaneously optimize the two new contradictory objectives, negative ratioassociation [16] and ratio cut [17]. The negative ratio association and the ratio cut have the potential to balance each other’stendency to increase or decrease the number of communities, and that both of the two objectives are related to the densityof subgraphs to overcome the resolution limit. These features make the two objectives be suitable for revealing communitystructure in networks. Large amounts of literature have shown that MOEA/D is a very effective method for multiobjectiveoptimization problems (MOPS) [15,18–20]. It has lower computational complexity thanNSGA-II, and can generate a uniformdistribution of representative nondominated solutions on the Pareto-optional front (PF, see the following section for details).These features make MOEA/D suitable for community detection when multiobjective evolutionary algorithm is needed.MOEA/D-Net selectively explores the search space without the need to know in advance the exact number of groups, andreturns just not a single partitioning of the network, but a set of solutions. Each of these solutions corresponds to a differenttrade-off between the twoobjectives and thus to diverse partitioning of the network consisting of various number of clusters.Experiments on computer-generated and real-world networks show the effectiveness of our algorithm. It also shows thatthe proposed method is able to uncover meaningful hierarchical community structure of the networks.

The remainder of this paper is organized as follows: In the next section, a description of the problem, the concept ofmultiobjective optimization, and existing community detection algorithms are given. In Section 3, we describe the proposedMOEA/D-Net in detail. In Section 4, Experimental studies are performed. Finally, concluding remarks are given.

2. Related background

2.1. Definition of community in networks

Let us consider a network N which is modeled as a graph G = (V , E), where V is a set of vertices (or nodes), and E is aset of edges (or links) connecting two vertices. A community in a network is a group of vertices having dense connectionwithin them, and relatively sparse connection between groups. As wementioned above, the definition of community is veryvague and there is no general agreement on the concept of density. However, a more formal definition has been introducedin Ref. [2] by considering the degree ki of a node i. The graph G is represented as an adjacency matrix A, where the entry aijis 1 if there is an edge from node i to node j, and 0 otherwise. The degree ki is defined as ki =

j Aij. Suppose the node i

belongs to a sub-graph S ⊂ G, the degree of iwith respect to S can be split as ki(s) = kini (s)+kouti (s), where kini (s) =

j∈S Aij

is the number of edges connecting i to the other nodes in S, and kouti (s) =

j∈S Aij is the number of edges connecting i tothe rest of the network. The sub-graph S is a community in a strong sense if kini (S) > kouti (S), ∀i ∈ S and a community in aweak sense if

i∈S k

ini (S) >

i∈S k

outi (S). This means that, in a strong community, each node has more connections within

the community than with the rest of the graph, and in a weak community, the sum of the degree within the sub-graph islarger than the sum of degrees towards the rest of the network.

2.2. Multiobjective optimization

Multiobjective optimization seeks to optimize a vector of functions [21]

min F(x) = (f1(x), f2(x), . . . , fk(x))T , (1)

subject to x = (x1, x2, . . . , xm) ∈ Ω . Where x is called the decision vector, and Ω is the feasible region in decision space.Considering a minimization problem for each objective, it is said that a decision vector xA ∈ Ω dominates another vector

xB ∈ Ω (written as xA ≻ xB) if and only if

∀i = 1, 2, . . . , k fi(xA) ≤ fi(xB) ∧ ∃j = 1, 2, . . . , k fj(xA) < fj(xB). (2)

We say that a vector of decision variables x∗ ∈ Ω is a Pareto-optimal solution or nondominated solution if there doesnot exist another x ∈ Ω such that x ≻ x∗.

Then the Pareto-optimal set is defined as

PS∗ ,x∗ ∈ Ω|¬∃x ∈ Ω, x ≻ x∗

. (3)

Page 4: Author's personal copy - Xidiansee.xidian.edu.cn/iiip/mggong/down/PHYSA2012Gong.pdf · Author's personal copy M. Gong et al. / Physica A 391 (2012) 4050 4060 4051 solutions in a single

Author's personal copy

4052 M. Gong et al. / Physica A 391 (2012) 4050–4060

So the Pareto-optimal set is the set of all nondominated solutions. The corresponding image of the Pareto-optimal setunder the objective function space

PF∗ =F(x∗) = (f1(x∗), f2(x∗), . . . , fk(x∗))T |x∗ ∈ P∗

, (4)

is called the Pareto-optimal front. The aim of a multiobjective optimization algorithm is to find a set of nondominatedsolutions approximating the true Pareto-optimal front.

In the last few years, many efforts have been devoted to the application of evolutionary computation to thedevelopment ofmultiobjective optimization algorithms. A lot ofmultiobjective evolutionary algorithms have been proposed[14,15,21–25]. Among them, the Multiobjective Evolutionary Algorithm based on Decomposition [15] proposed by Zhangand Li had been shown to be a very effective method for multiobjective optimization problems in literature [15,18–20].Because of its good performance, the proposed multiobjective community detection algorithm is based on MOEA/D. In thenext section, our multiobjective algorithm MOEA/D-Net for community detection will be described.

2.3. Related works

In recent years, many approaches to reveal community structure in networks have been proposed. In particular,modularity optimization is themost known community detectionmethod, whichwas proposed by Girvan and Newman [1].They used the concept of modularity as the criterion to stop the division of a network in sub-networks in their divisivehierarchical clustering algorithm.

Fast greedy modularity optimization was introduced by Clauset et al. [26]. This method is essentially a fastimplementation of a previous technique proposed in Ref. [1]. Starting from a set of isolated nodes, the links of the originalgraph are iteratively added such to produce the largest possible increase of the modularity of Newman and Girvan at eachstep [22]. In the following we will refer to the method as FM.

In Ref. [27], Rosvall and Bergstrom turned the problem of finding the best cluster structure of a graph into the problemof optimally compressing the information on the structure of the graph, so that one can recover as closely as possible theoriginal structure when the compressed information is decoded. In the following we will refer to the method as InfoMap.

In Ref. [28], we proposed a memetic algorithm for optimizing the modularity density (D) [29], which we named Meme-Net, to reveal community structure of a network. Meme-Net also shows its ability to explore the network at differentresolutions and reveal the hierarchical structure of the network.

GA-Net [12], a genetic algorithm for community detection in social networks proposed by Pizzuti, introduced the conceptof community score tomeasure the quality of a partitioning of a network in communities, and tried to optimize this quantityby a genetic algorithm. In GA-Net, only one objective function, community score, is optimized, so that only a certain solutionis obtained in one run. Unlikemost existingmethods, the algorithmdoes not require the number of communities in advance.This number is automatically determined by the optimal value of the community score.

Pizzuti proposed MOGA-Net in Ref. [11], which employs Multiobjective Genetic Algorithm to uncover communitystructure in complex networks. This algorithm introduces two objective functions. The first objective function employs theconcept of community score to measure the quality of the division in communities of a network. The higher the communityscore, the denser the clustering obtained. The second defines the concept of fitness of the nodes, which belong to a module,and iteratively find modules, which have the highest sum of node fitness, in the following referred to as community fitness.When this sum reaches its maximum value, the number of external links in minimized. Both the objective functions have apositive real-valued parameter controlling the size of the communities. The higher the value of the parameter, the smallerthe size of the communities found. MOGA-Net exploits the benefits of these two functions and obtains the communitiespresent in the network by selectively exploring the search space, without need to know in advance the exact number ofgroups. This number is automatically determined by the optimal compromise values of the objectives. An interesting resultof the multiobjective approach is that it returns not a single partitioning of the network, but a set of solutions [11]. Eachof these solutions corresponds to a different trade-off between the two objectives and thus to diverse partitioning of thenetwork consisting of various number of clusters. This gives the readers a great chance to analyze several partitions atdifferent hierarchical levels.

Modularity Q has been widely used recently. Modularity Q was used in Ref. [13] which optimized network modularityusing genetic algorithm to detect community. It is scalable to very large networks and does not need any a priori knowledgeabout the number of communities or any threshold value.

However, Fortunato and Barthélemy [30] showed mathematically that the optimization of modularity has a resolutionlimit, raising important concerns about the reliability of the modules detected so far using this technique, or eventuallyusing some other quality functions.

3. The proposed MOEA/D-Net for community detection

In this section, we will describe our MOEA/D-Net algorithm for community detection. First, the MOEA/D is introduced,which represents the state-of-the-art approach in the field of Multiobjective Evolutionary Algorithms. Then, the objectivefunctions, the genetic encoding that suitably represents a partitioning of a network, and the modified variation operatorsused to work with this encoding are described.

Page 5: Author's personal copy - Xidiansee.xidian.edu.cn/iiip/mggong/down/PHYSA2012Gong.pdf · Author's personal copy M. Gong et al. / Physica A 391 (2012) 4050 4060 4051 solutions in a single

Author's personal copy

M. Gong et al. / Physica A 391 (2012) 4050–4060 4053

3.1. An introduction to MOEA/D

Zhang and Li proposed MOEA/D in Ref. [15]. The idea behind the MOEA/D is based on a basic strategy in traditionalmultiobjective optimization, that is, a nondominated solution to aMOP, undermild conditions, could be an optimal solutionof a scalar optimization problem in which the objective is an aggregation of all the f ′i s. Therefore, approximation of the PFcan be decomposed into a number of scalar objective optimization subproblems.

MOEA/D decomposes an MOP into a number of scalar optimization subproblems and optimizes them simultaneouslyby evolving a population of solutions. At each generation, the population is composed of the best solution found so far(i.e. since the start of the run of the algorithm) for each subproblem. The neighborhood relations among these subproblemsare defined based on the distances between their aggregation coefficient vectors. The optimal solutions to two neighboringsubproblems should be very similar. Each subproblem (i.e., scalar aggregation function) is optimized in MOEA/D by usinginformation from its neighboring subproblems.

There are several methods for constructing aggregation functions. The most popular ones among them include theweighted sum approach and the Tchebycheff approach. In our algorithm, the Tchebycheff approach is used. Because both ofthe two objectives used in this study are not continuous, and we cannot simply conclude whether that PF is concave or not.If the PF is nonconcave, the weighted sum approach would not work well. This is the reason why we prefer the Tchebycheffapproach. The Tchebycheff approach is defined as Ref. [15]: the scalar optimization problem is in the form

min g te(x|λ, z∗) = max1≤i≤m

λi|fi(x)− z∗i |

Subject to x ∈ Ω,(5)

where z∗ = (z∗1 , z∗

2 , . . . , z∗m) is the reference point z∗i = min fi(x)|x ∈ Ω, for each i = 1, 2, . . . ,m. For each nondominated

point x∗, there exists a weight vector λ such that x∗ is the optimal solution of (5) and each optimal solution of (5) is a non-dominated solution of (4). Therefore, one is able to obtain different nondominated solutions by altering the weight vector.

As mentioned in Ref. [15], MOEA/D has lower computational complexity at each generation than NSGA-II, and using asmall population is able to produce a small number of very evenly distributed solutions. In this paper, our goal is to show thevalid effectiveness of the proposedmethod to detect the community structure of a network, and to illustrate that ourmethodis able to reveal the hierarchical community structure of the network. We prefer a small number of evenly distributedsolutions than a large number of solutions containing too much unhelpful information. More details on the decompositionmethod and the general framework of MOEA/D can be found in Refs. [15,18].

In the following, the objective functions, genetic encoding and the modified genetic operators are described.

3.2. Objective functions

For the evaluating objectives, we are interested in selecting those reflecting fundamentally different aspects of a goodcommunity partition. Modularity density is a foundational quality index for community detection [29].

An undirected graph can be given G = (V , E) with |V | = n vertexes and |E| = e edges. The adjacent matrix is A. If V1and V2 are two disjoint subsets of V , we define L(V1, V2) =

i∈V1,j∈V2

Aij and L(V1, V2) =

i∈V1,j∈V2Aij. Given a partition

S = (V1, V2, . . . , Vm) of the graph, where Vi is the vertex set of subgraph Gi for i = 1, 2, . . . ,m, the modularity density isdefined as

D =mi=1

L(Vi, Vi)− L(Vi, Vi)

|Vi|. (6)

In this equation, each summand means the ratio between the difference of the internal and external degrees of the sub-graph Gi and the size of the subgraph. The first term of D is equivalent to the ratio association [16] and the second term isequivalent to the ratio cut [17]. The larger the valueD, themore accurate a partition is. Tomaximize themodularity densityD,we shouldmaximize the first term andminimize the second term. Generally, maximizing the ratio association often dividesa network into small communitieswith high densely interconnected [16], whileminimizing the ratio cut often divides a net-work into large communities with sparely connected with the rest. Therefore, these two complementary terms reflect twofundamental aspects of a good partition, and the modularity density is an intrinsic trade-off between these two objectives.

In this paper, we select these two terms as the objective functions. In order to formulate the problem as a minimumoptimization problem, we revise the first term. Therefore, the first objective, which is called as Negative Ratio Association(NRA), is defined as

NRA = −mi=1

L(Vi, Vi)

|Vi|. (7)

The other objective, which is called the Ratio Cut (RC), is represented as

RC =mi=1

L(Vi, Vi)

|Vi|. (8)

Page 6: Author's personal copy - Xidiansee.xidian.edu.cn/iiip/mggong/down/PHYSA2012Gong.pdf · Author's personal copy M. Gong et al. / Physica A 391 (2012) 4050 4060 4051 solutions in a single

Author's personal copy

4054 M. Gong et al. / Physica A 391 (2012) 4050–4060

Fig. 1. Illustration of the locus-based adjacency scheme. Left: one possible genotype. Middle: translate the genotype into the graph structure (the graph isshown as directed only to aid in understanding how it originates from the genotype). Right: the final clusters (every connected component is interpretedas an individual cluster).

The motivations of adopting the two criteria other than other criteria as the objective functions are stated as follows.Firstly, Anglelini et al. [16] pointed out that Ratio Association is a decreasing function of the number of communities. Theopposite trend happens to the ratio cut metric, because with the number of communities increased, the more edges fall ininter-communities (i.e., L(Vi, Vi) becomes larger) and the number of nodes in a community become smaller. So these twofunctions have the potential to balance each other’s tendency to increase or decrease the number of communities. Secondly,Fortunato and Barthélemy [30] pointed out that the main reason the resolution limit appear in the modularity measure isthe modularity does not contain information on the number of nodes in a community and the choice of partition is highlysensitive to the total number of links in the network. Both the criteria are related to the density of subgraphs, and are notsensitive enough to the total number of links in thenetwork. TheRatio association canbe considered as the sumof thedensityof the link of intra-communities. The Ratio cut can be considered as the sum of the density of the link of inter-communities.Thirdly, both of the objectives reflect a fundamental aspect of a good partition. Finally, after many experiments, we findthese two functions are more empirically suitable.

Therefore, the two criteriaNRA and RC can be described as two contradictorymultiobjective functions. The correspondingMOP can be described as

min f1 = NRA = −mi=1

L(Vi, Vi)

|Vi|

min f2 = RC =mi=1

L(Vi, Vi)

|Vi|.

(9)

3.3. Representation

MOEA/D-Net adopts the locus-based adjacency representation [31]. In this graph-based representation, each individualg of the population consists of N genes g1, g2, . . . , gN and each gi can take allele values j in the range 1, 2, . . . ,N . Genesand alleles represent nodes of the graph G = (V , E) modeling a network. Thus, a value of j assigned to the ith gene, is theninterpreted as a link between the node i and j. This means that in the resulting clustering solution, they will be in the samecluster. The decoding of this representation requires the identification of all connected components. All nodes belongingto the same connected component are then assigned to one cluster. This decoding step can be performed in linear time asobserved in Ref. [31]. Amain advantage of this representation is that there is no need to fix the number of clusters in advance,as it is automatically determined in the decoding step. Fig. 1 illustrates the locus-based adjacency scheme for a network of7 nodes.

3.4. Initialization

In order to avoid uninteresting divisions containing unconnected nodes, our initialization process takes into account theeffective connections of the network. For each individual, the allele value j assigned to the ith gene is randomly selectedfrom the neighbors of node i. This initialization process improves the convergence of the algorithm because the space of thepossible solutions is restricted.

3.5. Crossover

We choose the two-point crossover in favor of uniform crossover because the two-point crossover can better maintainthe effective connections of the nodes in the network. Given two parents A and B, we first randomly select two points i andj (i.e. 1 ≤ i ≤ j ≤ N), and then everything between the two points is swapped between the parents (i.e. Ak ↔ Bk,∀k ∈k|i ≤ k ≤ j). An example of the operation of two-point crossover on the encoding employed is shown in Fig. 2.

3.6. Neighbor-based mutation

In this process, we randomly pick a chromosome C to be mutated. The we employ one point neighbor-based mutationon this chromosome: a gene i is pick randomly on the chromosome, then the possible values of its allele are restricted to

Page 7: Author's personal copy - Xidiansee.xidian.edu.cn/iiip/mggong/down/PHYSA2012Gong.pdf · Author's personal copy M. Gong et al. / Physica A 391 (2012) 4050 4060 4051 solutions in a single

Author's personal copy

M. Gong et al. / Physica A 391 (2012) 4050–4060 4055

Fig. 2. A and B are two parent genotypes and their corresponding graph structures. A random two-point crossover of the genotypes yields the child C ,which has inherited much of its structure from its parents, but differs from both of them.

the neighbors of gene i (i.e. Ci ← j, j ∈ j|aij = 1). The neighbor-based mutation guarantees that, in a mutated child, eachnode is linked only with one of its neighbors. This can avoid the useless exploration of the search space, because of the sameabove observation in the process of initialization.

3.7. The main loop of the MOEA/D-Net algorithm

After introducing the MOEA/D, the genetic representation and the modified variation operators, we give a summarydescription of our MOEA/D-Net algorithm for community detection.

Given a network N and the graph G modeling it, MOEA/D-Net optimizes the two objectives (7) and (8) presented inSection 3.2. It decomposes the two-objective optimization problem into a number of scalar optimization subproblems andoptimizes them simultaneously by evolving a population of solutions. The population is initialized as described above. Everyindividual generates a graph structure inwhich each component is a connected subgraph ofG. Each subproblem is optimizedby using information from its several neighboring subproblems. At the end of the procedure, MOEA/D-Net returns a setof solutions. Each of these solutions corresponds to a different tradeoff between the two objectives and thus to diversepartitioning of the network consisting of various number of clusters. This gives a chance to reveal the hierarchical structureof the network. After that, we adopt the concept of modularity, introduced by Girvan and Newman [3] to assess the qualityof a partitioning, to select, among the solutions found, that having the highest value of modularity.

4. Experimental results

In this Section, we compare our algorithmMOEA/D-Net with the Fast Modularity algorithm (FM) on the extension of thebenchmark network and four real-world networks to show the effectiveness of the proposed method. We also compare ouralgorithmwith Meme-Net [28], which optimizes a single objective modularity density Dwith memetic algorithm, and withthe Rosvall and Bergstrom’s algorithms (InfoMap) [27,32] on the extension of the GN benchmark network. Parameters inthe algorithm MOEA/D-Net are as follows: The number of subproblems (i.e., the population size) is 100, the neighborhoodparameter is 10, and mutation rate 0.06, the number of generations is 400, the update size is 2.

4.1. Evaluation metrics

In order to evaluate the quality of the partitioning obtained, in the following, we first introduce two commonly usedevaluation metrics, namely, the Normalized Mutual Information (NMI) and modularity (Q ).

TheNormalizedMutual Information (NMI), as an externalmeasure, is adopted to estimate the similarity between the truepartitions and the detected ones. The Normalized Mutual Information (NMI) is a similarity measure proved to be reliableby Danon et al. [33]. Given two partitions p1 and p2 of a network in communities, let C be the confusion matrix whoseelement Cij is the number of nodes of community i of the partition p1 that are also in the community j of the partition p2.The normalized mutual information I(p1, p2) is defined as

I(p1, p2) =

−2cp1i=1

cp2j=1

Cij log(CijN/Ci·C·j)

cp1i=1

Ci· log(Ci·/N)+

cp2j=1

C·j log(C·j/N)

where cp1(cp2) is the number of groups in the partition p1(p2), Ci·(Cj·) is the sum of elements of C in row i (column j), andN is the number of nodes (note that, some denominations here are different from the ones in previous sections just forconvenience). If p1 = p2, then I(p1, p2) = 1; if p1 and p2 are completely different, then I(p1, p2) = 0. A larger value of NMIrepresents a greater similarity between p1 and p2.

The modularity of Newman and Girvan [3] is a well-known quality function to evaluate the goodness of a partition. Letk be the number of modules found inside a network, the modularity is defined as:

Q =K

S=1

lsm−

ds2m

2

,

Page 8: Author's personal copy - Xidiansee.xidian.edu.cn/iiip/mggong/down/PHYSA2012Gong.pdf · Author's personal copy M. Gong et al. / Physica A 391 (2012) 4050 4060 4051 solutions in a single

Author's personal copy

4056 M. Gong et al. / Physica A 391 (2012) 4050–4060

Fig. 3. NMI obtained by MOEA/D-Net, Meme-Net, FM and InfoMap on the extension of the classical GN Benchmark.

where ls is the total number of edges joining vertices inside the module s, m is the total number of edges in the network,and ds is the sum of the degrees of the nodes of s.

This quantity measures the fraction of the edges in the network that connect vertices of the same type (i.e., within-community edges) minus the expected value of the same quantity in a network with the same community divisions butrandom connections between the vertices. If the number of within-community edges is no better than random, we will getQ = 0. A value approaching Q = 1, which is the maximum, indicates networks with a strong community structure. Inpractice, values for such networks typically fall in the range from about 0.3 to 0.7. Higher values are rare [3].

4.2. Experimental results on the extension of GN Benchmark

In this section, we test our method on the computer-generated networks, which have a known community structure,to illustrate the proposed algorithm can recognize and discover its community structure. Then, a comparison between theresults obtained by our algorithm and that obtained by Meme-Net is made, which shows the multi-objective algorithmbased on these two components of modularity density has better performances than the single-objective algorithms basedon the modularity density optimization. We also make a comparison our method with FM and InfoMap algorithms to verifythe effectiveness of our method.

The network we used here is the benchmark network proposed by Lancichinetti et al. [34], which is an extension of theclassic benchmark network proposed by Girvan and Newman in [1]. The network consists of 128 nodes divided into fourcommunities of 32 nodes each. Every node has an average degree of 16 and shares a fraction 1−µ of its links with the othernodes of its community and a fraction µ with the other nodes of the network; µ is the mixing parameter. When µ < 0.5the neighbors of a node inside its group are more than the neighbors belonging to the other three groups, thus a goodalgorithm should discover them. We use this computer-generated data set to test if our algorithm MOEA/D-Net effectivelydetects the community structure inside the network.We generated 11 different networks for the value of mixing parameterµ ranging from 0 to 0.5 and used the Normalized Mutual Information (NMI) to measure the similarity between the truepartitions and the detected ones. For each network, we computed the average Normalized Mutual Information (NMI) over30 independent runs. Fig. 3 shows the average NMI obtained by MOEA/D-Net, Meme-Net, FM and InfoMap algorithms. Asis shown in Fig. 3, when the value of mixing parameter is small (µ <= 0.35) which means the fuzziness of the communityin the network is low, our algorithm and InfoMap algorithm find the true partition correctly (NMI equals 1). When themixing parameter increases, those algorithms are more difficult to detect the true partition, but the detected partition byour method is the most close to the true one (NMI is 0.9919, 0.7911 and 0.4005 when µ = 0.40, µ = 0.45 and µ = 0.50,respectively). Therefore, MOEA/D-Net obtains higher NMI value than Meme-Net, FM and InfoMap obtained, correspondingto the proposed method MOEA/D-Net has better performances than Meme-Net, FM and InfoMap on the extension of GNBenchmark network.

This experiment shows MOEA/D-Net is a high valid method for revealing community structure on the extension of GNBenchmark.

4.3. Experimental results on real-world networks

In this section, we show the application of MOEA/D-Net on four real-world networks, Zachary’s Karate Club network, theBottlenose Dolphins network, the American College Football network and the Krebs’ books on American politics network.The results obtained by Fast Modularity algorithm (http://www.cs.unm.edu/~aaron/research/fastmodularity.htm) are givenfor comparison.

Page 9: Author's personal copy - Xidiansee.xidian.edu.cn/iiip/mggong/down/PHYSA2012Gong.pdf · Author's personal copy M. Gong et al. / Physica A 391 (2012) 4050 4060 4051 solutions in a single

Author's personal copy

M. Gong et al. / Physica A 391 (2012) 4050–4060 4057

Fig. 4. The results on the Zachary’s Karate Club network. (a) Pareto front of one run. (b) Network corresponding to solution (24). (c) Network correspondingto the exact solution (node number (22) on the Pareto front). (d) Network corresponding to solution (20).

Zachary’s Karate Club network was constructed by Zachary, who observed 34 members of a karate club over a periodof two years [35]. During the course of the study, a disagreement developed between the administrator of the club andthe club’s instructor, which ultimately resulted in the instructor’s leaving and starting a new club, taking about a half of theoriginal club’s members with him. The network splits naturally into two clusters. Here we use a simple un-weighted versionof his network.

The Bottlenose Dolphins network of 62 Bottlenose dolphins, living in Doubtful Sound, New Zealand, was compiled byLusseau from the observation of dolphins’ behavior during seven years. A tie between two dolphins was established by theirstatistically significant frequent association. The network split naturally into two large groups, the number of ties being 159.

The American College Football network [1] comes from the United States college football. The network representsthe schedule of Division I games during the 2000 season. Nodes in the graph represent teams and edges represent theregular season games between the two teams they connect. The teams are divided in conferences. The teams on averageplayed 4 inter-conference matches and 7 intra-conference matches, thus teams tend to play between members of the sameconference. The network consists of 115 nodes and 616 edges grouped in 12 teams.

The last example is the network of political books compiled by Krebs. The nodes represent 105 books on American politicsbrought from Amazon.com, and edges join pairs of books frequently purchased by the same buyer. Books were divided byNewman [36] according to their political alignment (conservative or liberal), except for a small number of books (13) havingno clear affiliation.

Fig. 4(a) displays the Pareto front in one out of the 30 runs on Zachary’s Karate Club network. The maximum generationof MOEA/D-Net is 50. The network corresponding to the best value of NMI = 1with themodularity = 0.3715 (solution (24)),the onewith the NMI = 0.8255 andmodularity = 0.3391(solution (22)) and the one with the NMI = 0.7071 andmodularity= 0.4151(solution (20)) are shown in Fig. 4. We can clearly observe that the solutions of the Pareto front have a hierarchicalstructure. Each of these solutions corresponds to a different partitioning of the network consisting of various clusters. Thetrue partitioning, which is displayed in Fig. 4(b), consists of two modules obtained by the split of the two main groups. Itis shown in Fig. 4(c) that the left sub-graph is divided into two smaller ones and in Fig. 4(d) that both the sub-graphs are

Page 10: Author's personal copy - Xidiansee.xidian.edu.cn/iiip/mggong/down/PHYSA2012Gong.pdf · Author's personal copy M. Gong et al. / Physica A 391 (2012) 4050 4060 4051 solutions in a single

Author's personal copy

4058 M. Gong et al. / Physica A 391 (2012) 4050–4060

Fig. 5. The box plot of the statistic value of NMI and Q over the 30 runs on the four real-world networks. (a) The statistic value of NMI. (b) The statisticvalue of Q . Here, box plots are used to illustrate the distribution of the NMI obtained by MOEA/D-Net. On each box, the red line is the median, the edgesof the box are one fourth and three fourths, the whisker extends to the most extreme datapoints the algorithm does not considers to be outliers, and theoutliers are plotted individually. Symbol + denotes outliers. (For interpretation of the references to colour in this figure legend, the reader is referred tothe web version of this article.)

Table 1The results of 30 runs of best NMI obtained by our method and fastmodularity algorithm for the real-world datasets.

Network Iavg Mod(Q ) FMIavg FM(Q )

Zachary’s Karate Club 1 0371 0.693 0.380Bottlenose Dolphins 1 0.373 0.573 0.495American College football 0.925 0.599 0.762 0.577Books about US politics 0.596 0.481 0.530 0.502

divided into two smaller ones respectively. Fig. 4 shows that our algorithmMOEA/D-Net can produce a set of solutionswhichrepresent different divisions to Zachary’s Karate Club network at different hierarchical levels. The number of subdivisions isautomatically determined by the non-dominated individuals resulting from our algorithm.

In the following, for each network, we run our algorithm 30 times, record the average value of best NMI (Iavg) and itscorresponding the value ofmodularity (Mod(Q )) over 30 runs in Table 1,we also record the average values of bestmodularity(Qavg) and its corresponding to the value ofNMI (I(Q )) in Table 2. Themaximumgenerationwas 50. At each run, the solutions,which have the maximum value of NMI and modularity, are selected. The average results of 30 times obtained by MOEA/D-Net and FM on those four real-world networks are shown in Table 1. We also show the statistic values of best NMI and thestatistic values of best modularity Q over the 30 runs on the four real-world networks in terms of box plots in Fig. 5, whichcan illustrate the stability of our algorithm. As we can see from Fig. 5, on each of the four networks, the variability of NMIand Q values obtained over the 30 runs is relatively small.

Table 1 reports the average of the best NMI (Iavg), the average modularity value (Mod(Q )) corresponding to the solutionshaving the bestNMI, theNormalizedMutual Information (NMI) value of the solution foundby FM (FMIavg) and themodularityvalue of the solution found by FM (FM(Q )).

As is shown in Table 1, the average of the best NMI of the first two real-world networks obtained by our algorithmMOEA/D-Net is 1. This means the true partitions to both real-world networks can be obtained at each run of our algorithmMOEA/D-Net. The modularity value of the true partitioning to Zachary’s Karate Club network is 0.371 and the modularityvalue of the true partitioning to the Bottlenose Dolphins network is 0.373. The true partitioning of the Bottlenose Dolphinsnetwork is shown in Fig. 6. However, on Zachary’s karate club and the Bottlenose Dolphins networks, the fast modularityalgorithm found a solution with a NMI value of 0.693 and 0.573, respectively.

The American College Football network and the Krebs’ Books network aremore difficult, especially the last one.MOEA/D-Net and FM cannot find a true partitioning to them. However, the highest average the best value of NMI is obtained byemploying our algorithmMOEA/D-Net, as is known from Table 1. The fast modularity algorithm found a solution with a NMIvalue of 0.762, while MOEA/D-Net found a solution with average best NMI value of 0.925 for 30 runs.

The last real-world network is the most complex one. Its structure is not very clear as the first two networks, so it isdifficult to detect the communitywithin it. However, the highest average the best value of NMI is also obtained by employingMOEA/D-Net, as is known from Table 1. The fast modularity algorithm found a solution with a NMI value of 0.502, whileMOEA/D-Net found a solution with average best NMI value of 0.596 for 30 runs.

Tables 1 and 2 clearly show the average best values of modularity by our method are larger than those obtained bythe fast modularity algorithm on four real-world networks. It also shows that except for Zachary’s karate club network,the value of NMI, which corresponds to the best value of modularity, are larger than obtained by the fast modularityalgorithm. Therefore, MOEA/D-Net has better performances than FM on the four real-world networks. For Zachary’s karateclub network, the reason the value of NMI, which corresponds to the best value of modularity, obtained by MOEA/D-Net issmaller than the one obtained by the fastmodularity algorithm is that there exists a resolution limit in optimizingmodularityto reveal the community structure of networks. As is shown in Fig. 4, Zachary’s karate club network has several kinds of

Page 11: Author's personal copy - Xidiansee.xidian.edu.cn/iiip/mggong/down/PHYSA2012Gong.pdf · Author's personal copy M. Gong et al. / Physica A 391 (2012) 4050 4060 4051 solutions in a single

Author's personal copy

M. Gong et al. / Physica A 391 (2012) 4050–4060 4059

Fig. 6. The true partitioning of the Bottlenose Dolphins network.

Table 2The results of 30 runs of bestmodularity obtainedby our method.

Network I(Q ) Qavg

Zachary’s Karate Club 0.687 0.420Bottlenose Dolphins 0.623 0.520American College football 0.891 0.604Books about US politics 0.574 0.527

hierarchical structures. Therefore, although the best value of Q obtained by MOEA/D-Net is larger than that obtained by thefast modularity algorithm, it is also true that the value of NMI obtained by MOEA/D-Net is smaller than that obtained by thefast modularity algorithm when the modularity is chosen as the criterion to reveal community structure in networks.

It is clearly known from this experiment, MOEA/D-Net is a more effective algorithm when compared with the fastmodularity algorithm.

5. Concluding remarks

In this paper, we propose a new community detection algorithm, MOEA/D-Net, to simultaneously optimize twocontradictory objective functions, Negative Ratio Association and Ratio Cut. Optimization of Negative Ratio Associationtends to divide a network into small communities, while the optimization of Ratio Cut tends to divide a network intolarge communities. The simultaneous optimization of these two contradictory objectives returns a set of tradeoff solutionsbetween the two objectives. Each of these solutions corresponds to a network partition. The experimental results show thatMOEA/D-Net has better performances than Meme-Net, FM and InfoMap on the extension of GN Benchmark network andhas better performances than FM on the four real-world networks to reveal community structure in networks. It also showsthat the proposed method can reveal community structure at different hierarchical levels.

Acknowledgments

The authors thank the editors and anonymous reviewers for their valuable comments and helpful suggestions whichgreatly improved the quality of the paper. This work was supported by the Program for New Century Excellent Talents inUniversity (Grant No. NCET-08-0811), the Program for New Scientific and Technological Star of Shaanxi Province (Grant No.2010KJXX-03), and the Fundamental Research Funds for the Central Universities (Grant No. K50510020001).

References

[1] M. Girvan, M.E.J. Newman, Community structure in social and biological networks, Proc. Natl. Acad. Sci. USA 99 (12) (2002) 7821–7826.[2] F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, D. Parisi, Defining and identifying communities in networks, Proc. Natl. Acad. Sci. USA 101 (9) (2004)

2658–2663.[3] M.E.J. Newman, M. Girvan, Finding and evaluating community structure in networks, Phys. Rev. E 69 (2) (2004) 026113.

Page 12: Author's personal copy - Xidiansee.xidian.edu.cn/iiip/mggong/down/PHYSA2012Gong.pdf · Author's personal copy M. Gong et al. / Physica A 391 (2012) 4050 4060 4051 solutions in a single

Author's personal copy

4060 M. Gong et al. / Physica A 391 (2012) 4050–4060

[4] J.-P. Eckmann, E. Moses, Curvature of co-links uncovers hidden thematic layers in the world wide web, Proc. Natl. Acad. Sci. USA 99 (9) (2002)5825–5829.

[5] G. Flake, S. Lawrence, C. Giles, F. Coetzee, Self-organization and identification of web communities, Computer 35 (3) (2002) 66–70.[6] M.E.J. Newman, Fast algorithm for detecting community structure in networks, Phys. Rev. E 69 (2) (2004) 066133.[7] J. Duch, A. Arenas, Community detection in complex networks using extremal optimization, Phys. Rev. E 72 (2) (2005) 027104.[8] J. Liu, T. Liu, Detecting community structure in complex networks using simulated annealing with k-means algorithms, Physica A 389 (11) (2010)

2300–2309.[9] Y. Pan, D.-H. Li, J.-G. Liu, J.-Z. Liang, Detecting community structure in complex networks via node similarity, Physica A 389 (14) (2010) 2849–2857.

[10] D. Chen, Y. Fu, M. Shang, A fast and efficient heuristic algorithm for detecting community structures in complex networks, Physica A 388 (13) (2009)2741–2749.

[11] C. Pizzuti, A multi-objective genetic algorithm for community detection in networks, in: Proceedings of the 21st IEEE International Conference onTools with Artificial Intelligence, Newark, New Jersey, USA, 2009, pp. 379–386.

[12] C. Pizzuti, Ga-net: a genetic algorithm for community detection in social networks, in: Parallel Problem Solving from Nature C PPSN X, in: Lect. NotesComput. Sc., vol. 5199, Springer, Berlin, Heidelberg, 2008, pp. 1081–1090.

[13] A. Lancichinetti, S. Fortunato, K. Kertesz, Detecting the overlapping and hierarchical community structure of complex networks, New J. Phys. 11 (2009)033015.

[14] K. Deb, A. Pratap, S. Agarwal, T. Meyarivan, A fast and elitist multiobjective genetic algorithm: Nsga-ii, IEEE Trans. Evol. Comput. 6 (2) (2002) 182–197.[15] Q. Zhang, H. Li, Moea/d: a multiobjective evolutionary algorithm based on decomposition, IEEE Trans. Evol. Comput. 11 (6) (2007) 712–731.[16] L. Angelini, S. Boccaletti, D. Marinazzo, M. Pellicoro, S. Stramaglia, Identification of network modules by optimization of ratio association, Chaos 17

(2) (2007) 023114.[17] Y.-C. Wei, C.-K. Cheng, Ratio cut partitioning for hierarchical designs, IEEE Trans. Comput. Aid. D 10 (7) (1991) 911–921.[18] H. Li, Q. Zhang, Multiobjective optimization problems with complicated Pareto sets, MOEA/D and NSGA-II, IEEE Trans. Evol. Comput. 13 (2) (2009)

284–302.[19] Q. Zhang, W. Liu, E. Tsang, B. Virginas, Expensive multiobjective optimization by MOEA/D with Gaussian process model, IEEE Trans. Evol. Comput. 14

(3) (2010) 456–474.[20] K. Tang, Y. Mei, X. Yao, Memetic algorithmwith extended neighborhood search for capacitated arc routing problems, IEEE Trans. Evol. Comput. 13 (5)

(2009) 1151–1166.[21] M. Gong, L. Jiao, H. Du, L. Bo, Multiobjective immune algorithm with nondominated neighbor-based selection, Evol. Comput. 16 (2) (2008) 225–255.[22] E. Zitzler, L. Thiele, Multiobjective evolutionary algorithms: a comparative case study and the strength Pareto approach, IEEE Trans. Evol. Comput. 3

(4) (1999) 257–271.[23] J.D. Knowles, D.W. Corne, Approximating the nondominated front using the Pareto archived evolution strategy, Evol. Comput. 8 (2) (2000) 149–172.[24] C. Coello, G. Pulido, M. Lechuga, Handling multiple objectives with particle swarm optimization, IEEE Trans. Evol. Comput. 8 (3) (2004) 256–279.[25] Q. Zhang, A. Zhou, Y. Jin, Rm-meda: a regularity model-based multiobjective estimation of distribution algorithm, IEEE Trans. Evol. Comput. 12 (1)

(2008) 41–63.[26] A. Clauset, M.E.J. Newman, C. Moore, Finding community structure in very large networks, Phys. Rev. E 70 (6) (2004) 066111.[27] M. Rosvall, C.T. Bergstrom, Maps of random walks on complex networks reveal community structure, Proc. Natl. Acad. Sci. USA 105 (4) (2008)

1118–1123.[28] M. Gong, B. Fu, L. Jiao, H. Du, A memetic algorithm for community detection in networks, Phys. Rev. E 84 (5) (2011) 056101.[29] Z. Li, S. Zhang, R.-S. Wang, X.-S. Zhang, L.N. Chen, Quantitative function for community detection, Phys. Rev. E 77 (3) (2008) 036109.[30] S. Fortunato, M. Barthélemy, Resolution limit in community detection, Proc. Natl. Acad. Sci. USA 104 (1) (2007) 36–41.[31] J. Handl, J. Knowles, An evolutionary approach to multiobjective clustering, IEEE Trans. Evol. Comput. 11 (1) (2007) 56–76.[32] A. Lancichinetti, S. Fortunato, Community detection algorithms: a comparative analysis, Phys. Rev. E 80 (5) (2009) 056117.[33] L. Danon, A. Díaz-Guilera, J. Duch, A. Arenas, Comparing community structure identification, J. Stat. Metch 78 (2005) P09008.[34] A. Lancichinetti, S. Fortunato, F. Radicchi, Benchmark graphs for testing community detection algorithms, Phys. Rev. E 78 (4) (2008) 046110.[35] W.W. Zachary, An information-flow model for conflict and fission in small groups, J. Anthroplo. Res. 33 (4) (1997) 452–473.[36] M.E.J. Newman, Modularity and community structure in networks, Proc. Natl. Acad. Sci. USA 103 (23) (2006) 8577–8582.