complex networks - eliassi

9
1 Complex Networks Tina Eliassi-Rad Lawrence Livermore National Laboratory & Rutgers University 7/22/2010 Mathcamp'10 http://eliassi.org 7/22/2010 Mathcamp'10 2 Internet Terrorism Reality Mining Food Web Internet Terrorism Reality Mining Food Web Enron Emails Map of Science HP Emails Contagion of TB NY State Power Grid Protein Interactions Friendship Network

Upload: others

Post on 01-Jan-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

1

Complex NetworksTina Eliassi-RadLawrence Livermore National Laboratory & Rutgers University

7/22/2010Mathcamp'10

g y

http://eliassi.org

7/22/2010Mathcamp'10

2

Internet TerrorismReality MiningFood WebInternet TerrorismReality MiningFood Web

Enron EmailsMap of Science HP Emails

Contagion of TB NY State Power GridProtein InteractionsFriendship Network

7/22/2010Mathcamp'10

3

Common Patterns• Scale-free▫ Eigen exponentEigen exponent

• Small-world

• Triangle lawg▫ k friends → ~k1.6 triangles

• Small diameter

i l i fl• Social influence▫ Pr(A.class == B.class | A → B) > Pr(A.class == B.class)

• Social selection• Social selection▫ Pr(A B | A.class == B.class) > Pr(A → B)

7/22/2010Mathcamp'10

4

Problem #1

• Triangles are expensive to compute▫ Friends of friends are friends▫ Friends of friends are friends

3-way join

• Q: Can we do this quickly?Q: Can we do this quickly?• A: Yes!▫ #triangles = 1/6 Σi( λi

3 )g i( i )▫ because of skewness, we only need the top few

eigenvalues

7/22/2010Mathcamp'10

5

Communities• Clusters, groups, modules• Need to• Need to▫ Formalize the notion

of a communityHard vs. soft

▫ Design an algorithm that will find sets of nodes that are “good” communities

▫ Formalize the evaluation of Formalize the evaluation of community structure

7/22/2010Mathcamp'10

6

Clustering Objective Function: ConductanceConductance• A good cluster S has

▫ Many edges internallyS

▫ Many edges internally

▫ Few edges pointing outside

• Simplest objective function:S’

S p est object ve u ct o :

Conductance

Φ(S) = #edges outside S / #edges inside S

▫ Small conductance corresponds to good clusters

• The score of best cluster of size k

7/22/2010Mathcamp'10

7

Network Community Profile (NCP) [Leskovec et al WWW‘08][Leskovec et al. WWW‘08]

• The score of best cluster of size k

log Φ(k)

k=5 k=7k=10

g ( )

Community size, log k

7/22/2010Mathcamp'10

8

Clustering Objective Function: ModularityModularity• m = number of edges in the graph

• A = 1 if v→w; 0 otherwise• Avw = 1 if v→w; 0 otherwise

• kv = degree of vertex v

• δ(i, j) = 1 if i ≡ j; 0 otherwiseδ(i, j) 1 if i j; 0 otherwise

• Maximizes modularity, Q

▫ Fraction of all edges within communities minus the expected Fraction of all edges within communities minus the expected value of the same quantity in a network where the vertices have the same degrees but edges are placed at random

7/22/2010Mathcamp'10

9

Problem #2

• Q: Is maximizing modularity similar to local spectral partitioning?local spectral partitioning?

• Notes:

▫ In spectral methods, eigenvectors are based on the unnormalized Laplacian of the graph: L = D – A

D d t i f th hD = degree matrix of the graph

A = adjacency matrix of the graph

L k i t Fi ld t▫ Look into Fielder vector

7/22/2010Mathcamp'10

10

Clustering Objective Function: CompressionCompression• Organize into few, homogeneous communities

versusw g

rou

ps

ow g

rou

ps

Column groups Column groups

Row

Ro

Good Clustering

1. Similar nodes are grouped together

2. As few groups as

A few,homogeneous

blocks

Good Compression

necessary

implies

11

Total Encoding Cost Objective FunctionTotal Encoding Cost Objective Function

m1 m2 m3

ℓ = 3 col. groups jiep ,=

n1

m1 m2 m3

p p p

density of ones (edges)ji

ji mnp , =

∑i,j nimj H(pi,j)

1 p1,1 p1,2 p1,3

w g

rou

ps

bits total

block size entropy

n2 p2,1 p2,2 p2,3

k=

3ro

w code cost

d i ti t

+

n3 p3,3p3,2p3,1 ∑irow-partitionidescription ∑j

col-partitionjdescription

∑ transmit

+

description cost

transmit

n × m adj. matrix

∑i,jtransmit#edges ei,j

++ transmit#partitions

12

Total Encoding Cost Objective FunctionTotal Encoding Cost Objective Function

m1 m2 m3

ℓ = 3 col. groups

jie

n1

m1 m2 m3

p p p density of ones (edges)ji

jiji mn

ep ,

, =

1 p1,1 p1,2 p1,3

w g

rou

ps

∑ n m H(p ) bits total

block size entropy

n2 p2,1 p2,2 p2,3

k=

3ro

w ∑i,j nimj H(pi,j)code cost

bits total

+n3 p3,3p3,2p3,1 ( ) ( )

⎡ ⎤∑+ m

mmm

nn

nn

lk

mHnH lk

lll

,,,, 11 LL

+

n × m adj. matrix

⎡ ⎤∑+++ji jimnlk

,logloglog

7/22/2010Mathcamp'10

13

one r

one

Total Encoding Cost:cost vs. # of clusters

trow

group

col group

bit

cost

nrow

m

col group

s grou

ps

k

k=

3row

ℓ=

3col ℓ

w grou

ps

group

s

7/22/2010Mathcamp'10

14

Problem #3

• Q: How do you find the best partitioning of a graph based on compression?of a graph based on compression?

• Notes:

▫ Requires a couple of lemmas that rely on

Concavity of entropy

Non-negativity of the KL-divergence

7/22/2010Mathcamp'10

15

Evaluation based on Link Prediction

• A good factorization of a graph's connectivity structure should accurately predict links between nodes based on their respective communities

• P(s→t | s, t, cs, ct)

E l t ff ti b

0 1 0 1 1

1 0 1 0 1

0 1 1

1 0 0 1• Evaluate effectiveness by

1. randomly holding out a number of links

2 building a model

0 1 0 0 0

1 0 0 0 1

1 1 0 1 0

0 0

1 0 0 0 1

1 1 02. building a model

3. using learnt model to predict held-out links

4. measuring performance with area under ROC curve (AUC)

1 1 0 1 01 1 0

4. measuring performance with area under ROC curve (AUC)

Evaluation based on Variation of Information [Karrer Levina and Newman Phys Rev E 2008] [Karrer, Levina, and Newman, Phys. Rev. E. 2008]

Perturb graph by randomly reassigning a number of its links

R i i t [0 1] d t i f ti f li k i dRewiring parameter c ∈ [0,1] determines fraction of links rewired

Links are rewired in a way that preserves the expected degree of each node in graph

C = communities discovered on original graph, where c = 0

C’ = communities discovered on perturbed graphs, where c ≠ 0

V l f I f ti Δ(C C’) H(C|C’) H(C’|C)Value of Information, Δ(C, C’) = H(C|C’) + H(C’|C)

H(C’|C) measures the information needed to describe C’ given C

Δ(C, C’) ∈ [0, log(N)] treats each assignment as a messageΔ(C, C ) ∈ [0, log(N)] treats each assignment as a message

Is a symmetric entropy-based measure of the distance between these messages

7/22/2010Mathcamp'10

17

Summary: Clustering Objective Functions for Complex NetworksFunctions for Complex Networks

• Conductance▫ J. Leskovec, K. J. Lang, M. W. Mahoney: Empirical comparison of algorithms for network community

detection. WWW 2010: 631-640

• Modularity▫ A. Clauset, M.E.J. Newman, C. Moore: Finding community structure in very large networks. Phys. Rev. E 70:

066111 (2004)

• Total encoding cost▫ D. Chakrabarti, S. Papadimitriou, D. S. Modha, Christos Faloutsos: Fully automatic cross-associations. KDD

2004: 79-88

• Maximum likelihood▫ D. M. Blei, A. Y. Ng, M. I. Jordan: Latent Dirichlet Allocation. JMLR 3: 993-1022 (2003)

d d i ldi id l i S h i b i i i d b hi S h i ▫ Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, Eric P. Xing: Mixed Membership Stochastic Blockmodels. JMLR 9: 1981-2014 (2008)

▫ K. Henderson, T. Eliassi-Rad, S. Papadimitriou, C. Faloutsos: HCDF: A Hybrid Community Discovery Framework. SDM 2010: 754-765

• Clique percolation q p▫ G. Palla, I. Derenyi, I. Farkas, T. Vicsek: Uncovering the overlapping community structure of complex

networks in nature and society. Nature 435:814 (2005)

• Many more…