community detection algorithms: a comparative analysis santo fortunato

26
Community detection algorithms: a comparative analysis Santo Fortunato

Upload: malcolm-white

Post on 12-Jan-2016

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Community detection algorithms: a comparative analysis Santo Fortunato

Community detection algorithms:

a comparative analysis

Santo Fortunato

Page 2: Community detection algorithms: a comparative analysis Santo Fortunato

More links “inside” than “outside”

Graphs are “sparse”

“Communities”

Page 3: Community detection algorithms: a comparative analysis Santo Fortunato

Metabolic Protein-protein

Social Economical

Page 4: Community detection algorithms: a comparative analysis Santo Fortunato

• Confusion about the main concepts: community, partition, null models

• (Too) Many algorithms around • How shall we test them?

Problems

Page 5: Community detection algorithms: a comparative analysis Santo Fortunato

Testing a method means applying it to graphs withknow community structure (benchmarks)

Benchmarks are then based on an implicit definition of community

Ideally algorithms have to be based on the same definition/principle, otherwise there is inconsistency

Page 6: Community detection algorithms: a comparative analysis Santo Fortunato

The planted l-partition model (Condon & Karp, 1999)

n nodes, l equal-sized groups with g=n/l nodes

p = probability that two nodes in the same group are connected

q = probability that two nodes in different groups are connected

If p>q, communities are there!

Page 7: Community detection algorithms: a comparative analysis Santo Fortunato
Page 8: Community detection algorithms: a comparative analysis Santo Fortunato

Benchmark of Girvan & Newman

128 nodes, 4 groups, average degree 16

All nodes have the same degree

Special case of planted l-partition model, with n=128, l=4, g=32

Page 9: Community detection algorithms: a comparative analysis Santo Fortunato
Page 10: Community detection algorithms: a comparative analysis Santo Fortunato

Problems with GN benchmark

• All nodes have the same degree• All communities have equal size

In real networks the distributions of degree and community size is highly heterogeneous!

Page 11: Community detection algorithms: a comparative analysis Santo Fortunato

New benchmark (A. Lancichinetti, S. F., F. Radicchi, Phys. Rev. E 78, 046110, 2008)

• Power law distribution of degree• Power law distribution of community size

• A mixing parameter μt sets the ratio between the external and the total degree of each node

The software to produce all new benchmarks is here: http://santo.fortunato.googlepages.com/inthepress2

The benchmark can be extended to directed and weighted networks with overlapping communities(A. Lancichinetti, S. F., Phys. Rev. E 80, 016118, 2009)

Page 12: Community detection algorithms: a comparative analysis Santo Fortunato

Computer time

Page 13: Community detection algorithms: a comparative analysis Santo Fortunato

Comparing partitions: normalized mutual information

xi, yi : community assignments

P(X=x)=nx/n, P(Y=y)=ny/n

Joint distribution: P(X=x, Y=y)= nxy/n

Shannon entropy of X:

Shannon conditional entropy of X given Y:

Page 14: Community detection algorithms: a comparative analysis Santo Fortunato

Mutual information

To avoid that: normalized mutual information

Problem: the mutual information is identical for all Y which are subpartitions of X

Page 15: Community detection algorithms: a comparative analysis Santo Fortunato

What is the best algorithm? A comparative analysis (A. Lancichinetti, S.F., Phys. Rev. E 80, 056117, 2009)

Page 16: Community detection algorithms: a comparative analysis Santo Fortunato

Divisive algorithms

Principle: one removes the links that connect the clusters, until the latter are isolated

How to identify intercommunity links?

1) Edge-betweenness (M. Girvan & M.E.J Newman, PNAS 99, 7821-7826, 2002)

2) Edge clustering coefficient (F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, D. Parisi, PNAS 101, 2658, 2004)

Page 17: Community detection algorithms: a comparative analysis Santo Fortunato

Modularity

= # links in module i

= expected # of links in module i

li

Newman & Girvan, Phys. Rev. E 69, 026113, 2004

Page 18: Community detection algorithms: a comparative analysis Santo Fortunato

Infomap (Rosvall & Bergstrom, PNAS 105, 1118, 2008)

Best partition minimum description length, optimization can be carried out with simulated annealing, greedy methods, etc.

Page 19: Community detection algorithms: a comparative analysis Santo Fortunato

Clique Percolation Method

Palla, Derényi, Farkas & Vicsek, Nature 435, 814, (2005)

Principle: in a graph with community structure there are many cliques within the clusters

Cliques can be used as probes to explore the graph: 1) Two k-cliques are neighbors if they share a (k-1)-clique2) One can travel along paths of neighboring cliques

Cliques may be trapped within clusters, which can then be identified

Page 20: Community detection algorithms: a comparative analysis Santo Fortunato

Clique percolation method

Page 21: Community detection algorithms: a comparative analysis Santo Fortunato

What is the best algorithm? A comparative analysis (A. Lancichinetti, S.F., Phys. Rev. E 80, 056117, 2009)

Page 22: Community detection algorithms: a comparative analysis Santo Fortunato

Tests on GN benchmark

Page 23: Community detection algorithms: a comparative analysis Santo Fortunato

Tests on LFR benchmark (undirected, unweighted)

Page 24: Community detection algorithms: a comparative analysis Santo Fortunato

Tests on random graphs

Page 25: Community detection algorithms: a comparative analysis Santo Fortunato

Outlook

• New benchmark graphs based on planted l-partition model (true community definition?): weighted/unweighted, directed/undirected and with overlapping communities

• Comparative analysis of existing methods on new benchmarks: the method by Rosvall and Bergstrom (PNAS 105, 1118, 2008) is the best: very good on the new benchmarks, it also recognizes random graphs, if the average degree is not too small, it is fast as well!

• Warning: benchmarks are characterized by “flat” clustering, there is no hierarchy! Low clustering coefficient too (work in progress)

• Crucial issue for the future: proper definition of hierarchical community structure and relative testing!

Agreement on how to test algorithms is more crucial than designing algorithms!

Page 26: Community detection algorithms: a comparative analysis Santo Fortunato

S. F., arXiv:0906.0612, Physics Reports 486, 75-174 (2010)