presented by september 21, 2007 warsaw, poland 2007... · 2007. 8. 10. · cmu scs mining large...

131
THE 18 TH EUROPEAN CONFERENCE ON MACHINE LEARNING AND THE 11 TH EUROPEAN CONFERENCE ON PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES M INING L ARGE G RAPHS L AWS AND TOOLS T UTORIAL N OTES presented by Christos Faloutsos and Jure Leskovec September 21, 2007 Warsaw, Poland

Upload: others

Post on 21-Feb-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

THE 18TH EUROPEAN CONFERENCE ON MACHINE LEARNINGAND

THE 11TH EUROPEAN CONFERENCE ON PRINCIPLES AND PRACTICEOF KNOWLEDGE DISCOVERY IN DATABASES

MINING LARGE GRAPHS

LAWS AND TOOLS

TUTORIAL NOTES

presented by

Christos Faloutsos and Jure Leskovec

September 21, 2007

Warsaw, Poland

Page 2: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

Prepared and presented by:Christos Faloutsos and Jure LeskovecMachine Learning Department, Carnegie Mellon University, USA

Page 3: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools

Christos Faloutsos and Jure LeskovecM hi L i D t tMachine Learning Department

CMU SCS

Tutorial outlineTutorial outline

Part 1: Laws and generators:What are properties of large graphs?How do we model them?

Part 2: Tools for static and dynamics graphsy g pMethods based on Singular Value Decomposition

Part 3: Case studiesPart 3: Case studiesPropagation and detection of viruses in networksGraph projectionsGraph projections

Faloutsos&LeskovecECML/PKDD 2007

Part 1-2

1

Page 4: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Part 1: OutlinePart 1: Outline

1 1 Mi i1.1: MiningWhat are the statistical properties of static and time evolving networks?

1.2: Models:How do we build models of network generations of evolution?g

1.3: Fitting:How do we fit models?How do we fit models? How do we generate realistic looking graphs?

Faloutsos&LeskovecECML/PKDD 2007

Part 1-3

CMU SCS

Part 1.1: Mining

What are statistical properties ofWhat are statistical properties of networks across various

domains?domains?

2

Page 5: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Examples of networksExamples of networks(b) (c)

(a)(a)

(d)(e)

Internet (a)

( )

sexual network (d)Internet (a)citation network (b)World Wide Web (c)

sexual network (d)food web (e)

Faloutsos&LeskovecECML/PKDD 2007

Part 1-5

World Wide Web (c)

CMU SCS

Networks of the real world (1)Networks of the real-world (1)Information networks:

World Wide Web: hyperlinksWorld Wide Web: hyperlinksCitation networksBlog networks

Social networks: people +Social networks: people + interactios

Organizational networksCommunication networks

Florence families Karate club networkCommunication networks Collaboration networksSexual networks Collaboration networks

Technological networks:Power gridAirline, road, river networks, ,Telephone networksInternetAutonomous systems Collaboration networkF i d hi t k

Faloutsos&LeskovecECML/PKDD 2007

Part 1-6

y Collaboration networkFriendship network

3

Page 6: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Networks of the real world (2)Networks of the real-world (2)

Biological networksmetabolic networksfood webneural networks Yeast protein Semantic network

gene regulatory networks

pinteractions

Language networksSemantic networks

Software networks

Faloutsos&LeskovecECML/PKDD 2007

Part 1-7

…Language network XFree86 network

CMU SCS

Traditional approachTraditional approachSociologists were first to study networks:Sociologists were first to study networks:

Study of patterns of connections between people to understand functioning of thepeople to understand functioning of the societyPeople are nodes interactions are edgesPeople are nodes, interactions are edges Questionares are used to collect link data (hard to obtain inaccurate subjective)(hard to obtain, inaccurate, subjective)Typical questions: Centrality and connectivity

Li it d t ll h ( 10 d ) dLimited to small graphs (~10 nodes) and properties of individual nodes and edges

Faloutsos&LeskovecECML/PKDD 2007

Part 1-8

4

Page 7: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Motivation: New approach (1)Motivation: New approach (1)

Large networks (e.g., web, internet, on-line social networks) with millions of nodesMany traditional questions not useful anymore:

Traditional: What happens if a node u is removed? ppNow: What percentage of nodes needs to be removed to affect network connectivity?

Focus moves from a single node to study of statistical properties of the network as a wholep p

Faloutsos&LeskovecECML/PKDD 2007

Part 1-9

CMU SCS

Motivation: New approach (2)Motivation: New approach (2)

How the network “looks like” even if I can’t look at it?Need statistical methods and tools to quantify large networksg3 parts/goals:

Statistical properties of large networksStatistical properties of large networksModels that help understand these propertiesPredict behavior of networked systems based onPredict behavior of networked systems based on measured structural properties and local rules governing individual nodes

Faloutsos&LeskovecECML/PKDD 2007

Part 1-10

5

Page 8: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

How to generate a graph?How to generate a graph?

What is the simplest way to generate a graph?g pRandom graph model (Erdos-Renyi model Poisson random graph model):model, Poisson random graph model):

Given n vertices connect each pair i.i.d. with probability p

How good (“realistic”) is this graph g ( ) g pgenerator?

Faloutsos&LeskovecECML/PKDD 2007

Part 1-11

CMU SCS

Small world effect (1)Small-world effect (1)Si d f ti [Mil 60 ]Six degrees of separation [Milgram 60s]

Random people in Nebraska were asked to send letters to stockbrokes in BostonLetters can only be passed to first-name acquanticesOnly 25% letters reached the goalBut they reached it in about 6 stepsBut they reached it in about 6 steps

Measuring path lengths: Diameter (longest shortest path): max dij( g p ) ijEffective diameter: distance at which 90% of all connected pairs of nodes can be reachedMean geodesic (shortest) distance lMean geodesic (shortest) distance l

or

Faloutsos&LeskovecECML/PKDD 2007

Part 1-12

6

Page 9: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Small world effect (2)Small-world effect (2)[Leskovec&Horvitz,07]

Distribution of shortest path lengths

107

108

Pick a random

[Leskovec&Horvitz,07]

shortest path lengths Microsoft Messenger network

105

106

odes

node, count how many

nodes are at di tnetwork

180 million people1 3 billi d

103

104

mbe

r of n

o distance 1,2,3... hops

71.3 billion edgesEdge if two people exchanged at least

101

102Num

exchanged at least one message in one month period

0 5 10 15 20 25 30100

Distance (Hops)

Faloutsos&LeskovecECML/PKDD 2007

Part 1-13

month period

CMU SCS

Small world effect (3)Small-world effect (3)If number of vertices within distance growsIf number of vertices within distance r grows exponentially with r, then mean shortest path length lincreases as log nImplications:

Information (viruses) spread quicklyErdos numbers are smallErdos numbers are smallPeer to peer networks (for navigation purposes)

Shortest paths exists H bl t fi d th thHumans are able to find the paths:

People only know their friendsPeople do not have the global knowledge of the network

This suggests something special about the structure of the network

On a random graph short paths exists but no one would be ableFaloutsos&LeskovecECML/PKDD 2007

Part 1-14

On a random graph short paths exists but no one would be able to find them

7

Page 10: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Degree distributions (1)Degree distributions (1)Let denote a fraction of nodes with degree kLet pk denote a fraction of nodes with degree kWe can plot a histogram of pk vs. kIn a (Erdos-Renyi) random graph degreeIn a (Erdos Renyi) random graph degree distribution follows Poisson distributionDegrees in real networks are heavily skewed to the rightthe rightDistribution has a long tail of values that are far above the meanPower-law [Faloutsos et al], Zipf’s law, Pareto’s law, Long tailMany things follow Power law:Many things follow Power-law:

Amazon sales,word length distribution, W lth E th k

Faloutsos&LeskovecECML/PKDD 2007

Part 1-15

Wealth, Earthquakes, …

CMU SCS

Degree distributions (2)Degree distributions (2)Degree distribution in a blog network

( l t th d t i diff t l )

10-3

10-2

2.5

3

3.5 x 10-3Many real world networks contain hubs: highly

(plot the same data using different scales)

lin-lin log-lin

p k p k

6

10-5

10-4

0.5

1

1.5

2hubs: highly connected nodesWe can easily

p

log

p

105

0 200 400 600 800 100010-6

0 200 400 600 800 10000

We can easily distinguish between exponential and power law tail by

kk

Power-law:

103

104power-law tail by plotting on log-lin and log-log axis g

p k

Power law:

101

102

a d og og a sPower-law is a lineon log-log plot log-log

log

Faloutsos&LeskovecECML/PKDD 2007

Part 1-16100 101 102 103 104

100

g g

log k

8

Page 11: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Poisson vs Scale free networkPoisson vs. Scale-free network

Poisson network S l f ( l ) t kPoisson network Scale-free (power-law) network

Function is

(Erdos-Renyi random graph)Degree distribution is

Faloutsos&LeskovecECML/PKDD 2007

Part 1-17

Function is scale free if:f(ax) = c f(x)Degree distribution is Poisson

distribution is Power-law

CMU SCS

Network resilience (1)Network resilience (1)W b h thWe observe how the connectivity (length of the paths) of the network p )changes as the vertices get removed [Albert et al, 00; Palmer et al 01]Palmer et al, 01]Vertices can be removed:

Uniformly at randomUniformly at randomIn order of decreasing degree

It is important for id i lepidemiologyRemoval of vertices corresponds to vaccination

Faloutsos&LeskovecECML/PKDD 2007

Part 1-18

p

9

Page 12: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Network resilience (2)Network resilience (2)Real-world networks are resilient to random attacks

One has to remove all web-pages of degree > 5 to disconnect the webBut this is a very small percentage of web pages

Random network has better resilience to targeted attacks

Random networkInternet (Autonomous systems)

Preferential

engt

h

Preferentialremoval

an p

ath

le

Randomremoval

Me

Faloutsos&LeskovecECML/PKDD 2007

Part 1-19Fraction of removed nodes Fraction of removed nodes

CMU SCS

Community structureCommunity structureMost social networks showMost social networks show community structure

groups have higher density of edges ithi thwithin than accross groups

People naturally divide into groups based on interests, age, occupation, …

How to find communities:Spectral clustering (embedding into a l di )low-dim space)Hierarchical clustering based on connection strengthC bi t i l l ith ( i tCombinatorial algorithms (min cut style formulations)Block modelsDiff i th d

Faloutsos&LeskovecECML/PKDD 2007

Part 1-20

Diffusion methods Friendship network of children in a school

10

Page 13: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Spectral propertiesSpectral properties

Eigenvalues of graph adjacency

Eigenvalue distribution in online social networkgraph adjacency

matrix follow a power law

valu

e

online social network

Network values (components of principal Ei

genv

principal eigenvector) also follow a power-law

log

p[Chakrabarti et al]

log Rank

Faloutsos&LeskovecECML/PKDD 2007

Part 1-21

CMU SCS

What about evolving graphs?What about evolving graphs?

Conventional wisdom/intuition:Constant average degree: the number of g gedges grows linearly with the number of nodesSlowly growing diameter: as the network grows the distances between nodes growgrows the distances between nodes grow

Faloutsos&LeskovecECML/PKDD 2007

Part 1-22

11

Page 14: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Networks over time: DensificationNetworks over time: DensificationA i l ti Wh t i th l tiA simple question: What is the relation between the number of nodes and the number of edges in a network over time?number of edges in a network over time?Let:

N(t) … nodes at time t( )E(t) … edges at time t

Suppose that:N(t+1) = 2 * N(t)

Q: what is your guess for E( 1) ? 2 * E( )E(t+1) =? 2 * E(t)

A: over-doubled!But obeying the Densification Power Law [KDD05]

Faloutsos&LeskovecECML/PKDD 2007

Part 1-23

But obeying the Densification Power Law [KDD05]

CMU SCS

Networks over time: DensificationNetworks over time: DensificationInternet

Networks are denser over time The number of edges grows faster than the number of nodes

Internet

E(t)

1 2faster than the number of nodes – average degree is increasing lo

g E a=1.2

a … densification exponent log N(t)1 ≤ a ≤ 2:

a=1: linear growth – constant out-degree (assumed in the literature so f )

Citations

E(t)

far)a=2: quadratic growth – clique

a=1.7

log

E

Faloutsos&LeskovecECML/PKDD 2007

Part 1-24log N(t)

12

Page 15: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Densification & degree distributionDensification & degree distributionH d d ifi ti ff t

Case (a): Degree exponent γ is constant over time TheHow does densification affect

degree distribution?Densification:

γ is constant over time. The network densifies, a=1.2

Densification:Degree distribution: pk=kγGiven densification exponent a

γ(t)

Given densification exponent a, the degree exponent is [TKDD07]:

(a) For γ=const over time, we obtain Case (b): Degree exponent time t

( ) γ ,densification only for 1<γ<2, and then it holds: γ=a/2(b) For γ<2 degree distribution

γ evolves over time. The network densifies, a=1.6

(b) For γ<2 degree distribution evolves according to:

γ(t)

Faloutsos&LeskovecECML/PKDD 2007

Part 1-25time t

Given: densification a, number of nodes n

CMU SCS

Shrinking diametersShrinking diametersInternet

Intuition and prior work say that distances between the

Internet

met

er

nodes slowly grow as the network grows (like log n):

diam

d ~ O(log N)d ~ O(log log N)

Citations

size of the graph

Diameter Shrinks/Stabilizes over time

Citations

er

as the network grows the distances between nodes diam

ete

Faloutsos&LeskovecECML/PKDD 2007

Part 1-26slowly decrease [KDD 05]

time

13

Page 16: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Properties hold in many graphsProperties hold in many graphsTh tt b b d i lThese patterns can be observed in many real world networks:

World wide web [Barabasi]World wide web [Barabasi]On-line communities [Holme, Edling, Liljeros]Who call whom telephone networks [Cortes]Internet backbone – routers [Faloutsos, Faloutsos, Faloutsos]Movies to actors network [Barabasi]Science citations [Leskovec Kleinberg Faloutsos]Science citations [Leskovec, Kleinberg, Faloutsos]Click-streams [Chakrabarti]Autonomous systems [Faloutsos, Faloutsos, Faloutsos]Co-authorship [Leskovec, Kleinberg, Faloutsos]Sexual relationships [Liljeros]

Faloutsos&LeskovecECML/PKDD 2007

Part 1-27

CMU SCS

Part 1.2: ModelsWe saw properties

H d fi d d l ?How do we find models?

14

Page 17: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

1 2 Models: Outline1.2 Models: Outline

W t th f ll ti liWe present the follow timeline(Erdos-Renyi) Random graphsExponential random graphsSmall-world modelPreferential attachmentEdge copying modelEdge copying modelCommunity guided attachmentForest fireForest fireKronecker graphs

Faloutsos&LeskovecECML/PKDD 2007

Part 1-29

CMU SCS

(Erdos Renyi) Random graph(Erdos-Renyi) Random graphAl k P i d hAlso known as Poisson random graphs or Bernoulli graphs [Erdos&Renyi, 60s]

Gi ti t h i i i d ithGiven n vertices connect each pair i.i.d. with probability p

Two variants:Two variants:Gn,p: graph with m edges appears with probability pm(1-p)M-m, where M=0.5n(n-1) is the max number p ( p) , ( )of edgesGn,m: graphs with n nodes, m edges

Does not mimic reality Very rich mathematical theory: many

Faloutsos&LeskovecECML/PKDD 2007

Part 1-30

y y yproperties are exactly solvable

15

Page 18: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Properties of random graphsProperties of random graphs

Degree distribution is Poisson since the presence and absence of edges is independent

!)1(

kezpp

kn

pzk

knkk

−− ≈−⎟⎟

⎞⎜⎜⎝

⎛=

Giant component: average degree k=2m/n:k=1-ε: all components are of size Ω(log n)

⎠⎝

k 1 ε: all components are of size Ω(log n)k=1+ε: there is 1 component of size Ω(n)

All others are of size Ω(log n)( g )They are a tree plus an edge, i.e., cycles

Diameter: log n / log kFaloutsos&LeskovecECML/PKDD 2007

Part 1-31

g g

CMU SCS

Evolution of a random graphEvolution of a random graph

verti

ces

on-G

CC

vfo

r no

Faloutsos&LeskovecECML/PKDD 2007

Part 1-32k

16

Page 19: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Subgraphs in random graphsSubgraphs in random graphsExpected number ofExpected number of

subgraphs H(v,e) in Gn,p is

n ev⎞⎛ !apnp

av

vn

XEev

e ≈⎟⎟⎠

⎞⎜⎜⎝

⎛=

!)(

a... # of isomorphic graphs

Faloutsos&LeskovecECML/PKDD 2007

Part 1-33

CMU SCS

Random graphs: conclusionRandom graphs: conclusionP

Configuration modelPros:

Simple and tractable modelPhase transitionsPhase transitionsGiant component

Cons:Degree distributionNo community structureNo degree correlationsNo degree correlations

Extensions:Configuration model

Random graphs with arbitrary degree sequenceExcess degree: Degree of a vertex of the end of random edge: qk = k pk

Faloutsos&LeskovecECML/PKDD 2007

Part 1-34

17

Page 20: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Exponential random graphs(p* models)

Social sciences to analyze the small networks

Examples of εi

Let εi set of measurable properties of a graph (number of edges, g p ( g ,number of nodes of a given degree, number of triangles, …)Exponential random graph model defines a probability distribution over graphs:

Faloutsos&LeskovecECML/PKDD 2007

Part 1-35

CMU SCS

Exponential random graphsExponential random graphsIncludes Erdos-Renyi as a specialIncludes Erdos Renyi as a special caseAssume parameters βi are specifiedspecified

No analytical solutions for the modelBut can use simulation to sample the graphs:graphs:

Define local moves on a graph:Addition/removal of edgesMovement of edges

Example of parameter estimates:Movement of edgesEdge swaps

Parameter estimation: maximum likelihoodmaximum likelihood

Problem:Can’t solve for transitivity (produces cliques)

Faloutsos&LeskovecECML/PKDD 2007

Part 1-36

cliques)Used to analyze small networks

18

Page 21: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Small world modelSmall-world modelUsed for modeling network transitivityM k ki d f hi lMany networks assume some kind of geographical proximitySmall-world model:Small world model:

Start with a low-dimensional regular latticeRewire:

Add/remove edges to create shortcuts to join remote parts of theAdd/remove edges to create shortcuts to join remote parts of the latticeFor each edge with prob p move the other end to a random vertex

Rewiring allows to interpolate between regular latticeRewiring allows to interpolate between regular lattice and random graph

Faloutsos&LeskovecECML/PKDD 2007

Part 1-37Watts&Strogatz,1998

CMU SCS

Small-world modelSmall world model

Regular lattice (p=0):Clustering coefficient g

C=(3k-3)/(4k-2)=3/4Mean distance L/4k

Almost random graphAlmost random graph (p=1):

Clustering coefficient C=2k/LRewiring probability p

Clustering coefficient C 2k/LMean distance log L / log k

But, real graphs have g ppower-law degree distribution

Faloutsos&LeskovecECML/PKDD 2007

Part 1-38

Degree distribution

19

Page 22: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Preferential attachmentPreferential attachmentB t d h h P i d di t ib tiBut, random graphs have Poisson degree distributionLet’s find a better model

Preferential attachment [Price 1965, Albert & Barabasi 1999]:

Add a new node, create m out-linksProbability of linking a node ki is

proportional to its degreep p gBased on Herbert Simon’s result

Power-laws arise from “Rich get richer” (cumulative advantage)Examples (Price 1965 for modeling citations):

Citations: new citations of a paper are proportional to the number it already has

Faloutsos&LeskovecECML/PKDD 2007

Part 1-39

it already has

CMU SCS

Preferential attachmentPreferential attachmentLeads to power-law degree distributions

3kBut:

3−∝ kpk

all nodes have equal (constant) out-degree

d l t k l d f thone needs a complete knowledge of the network

There are many generalizations andThere are many generalizations and variants, but the preferential selection is the key ingredient that

Faloutsos&LeskovecECML/PKDD 2007

Part 1-40leads to power-laws

20

Page 23: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Edge copying modelEdge copying modelB t f ti l tt h t d t hBut, preferential attachment does not have communitiesC i d l [Kl i b t l 99]Copying model [Kleinberg et al, 99]:

Add a node and choose k the number of edges to addWith b β l t k d ti d li k t thWith prob. β select k random vertices and link to themWith prob. 1-β edges are copied from a randomly chosen nodechosen node

Generates power-law degree distributions with exponent 1/(1-β)exponent 1/(1 β)Generates communities

Faloutsos&LeskovecECML/PKDD 2007

Part 1-41

CMU SCS

Community guided attachmentCommunity guided attachmentBut we want toBut, we want to model/explain densification in networks

University

Assume community CS

Science Arts

Assume community structureOne expects many

CS Math Drama Music

One expects many within-group friendships and fewer cross-group onesonesCommunity guided attachment [KDD05]

Self-similar university community structure

Faloutsos&LeskovecECML/PKDD 2007

Part 1-42

attachment [KDD05]

21

Page 24: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Community guided attachmentCommunity guided attachmentAssuming cross-community linking probabilityAssuming cross community linking probability

The Community Guided Attachment leads toThe Community Guided Attachment leads to Densification Power Law with exponent

a … densification exponentb … community tree branching factory gc … difficulty constant, 1 ≤ c ≤ b

If c = 1: easy to cross communitiesTh 2 d ti th f d liThen: a=2, quadratic growth of edges – near clique

If c = b: hard to cross communitiesThen: a=1 linear growth of edges – constant out-degree

Faloutsos&LeskovecECML/PKDD 2007

Part 1-43

Then: a 1, linear growth of edges constant out degree

CMU SCS

Forest Fire ModelForest Fire ModelBut we do not want to have explicitBut, we do not want to have explicit communitiesWant to model graphs that density and haveWant to model graphs that density and have shrinking diametersIntuition:

How do we meet friends at a party?How do we identify references when writing papers?papers?

Faloutsos&LeskovecECML/PKDD 2007

Part 1-44

22

Page 25: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Forest Fire ModelForest Fire ModelTh F t Fi d l [KDD05] h 2The Forest Fire model [KDD05] has 2 parameters:

f d b i b bilitp … forward burning probabilityr … backward burning probability

The model:The model:Each turn a new node v arrivesUniformly at random chooses an “ambassador”Uniformly at random chooses an ambassador wFlip two geometric coins to determine the number in-and out-links of w to follow (burn)and out links of w to follow (burn)Fire spreads recursively until it diesNode v links to all burned nodes

Faloutsos&LeskovecECML/PKDD 2007

Part 1-45

CMU SCS

Forest Fire ModelForest Fire ModelForest Fire generates graphs thatForest Fire generates graphs that densify and have shrinking diameter

densification diameterE(t)

densification diameter

r

1.32

amet

erdi

a

Faloutsos&LeskovecECML/PKDD 2007

Part 1-46N(t) N(t)

23

Page 26: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Forest Fire ModelForest Fire Model

Forest Fire also generates graphs with Power-Law degree distributiong

in-degree out-degreeg g

Faloutsos&LeskovecECML/PKDD 2007

Part 1-47log count vs. log in-degree log count vs. log out-degree

CMU SCS

Forest Fire: Phase transitionsForest Fire: Phase transitionsFix backwardFix backward probability r and vary forward burning

b bilit Cli likprobability pWe observe a sharp transition between

Clique-likegraphIncreasing

diametertransition between sparse and clique-like graphs Sparse

h

diameterConstantdiameter

Sweet spot is very narrow

graphDecreasing

diameter

Faloutsos&LeskovecECML/PKDD 2007

Part 1-48

24

Page 27: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Kronecker graphsKronecker graphsB t t t h d l th t tBut, want to have a model that can generate a realistic graph with realistic growth:

St ti P ttStatic PatternsPower Law Degree DistributionSmall DiameterSmall DiameterPower Law Eigenvalue and Eigenvector Distribution

Temporal PatternsDensification Power LawShrinking/Constant Diameter

For Kronecker graphs [PKDD05] all theseFor Kronecker graphs [PKDD05] all these properties can actually be proven

Faloutsos&LeskovecECML/PKDD 2007

Part 1-49

CMU SCS

Idea: Recursive graph generationStarting with our intuitions from densification

Idea: Recursive graph generationStarting with our intuitions from densificationTry to mimic recursive graph/community growth because self similarity leads to power-lawsThere are many obvious (but wrong) ways:

Initial graph Recursive expansion

Does not densify, has increasing diameterKronecker Product is a way of generating self-similar

Faloutsos&LeskovecECML/PKDD 2007

Part 1-50

y g gmatrices

25

Page 28: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Kronecker product: GraphKronecker product: Graph

Intermediate stage

(9x9)(3x3)

Faloutsos&LeskovecECML/PKDD 2007

Part 1-51Adjacency matrixAdjacency matrix

CMU SCS

Kronecker product: GraphKronecker product: Graph

Continuing multypling with G1 we obtain G4 and so on …4

Faloutsos&LeskovecECML/PKDD 2007

Part 1-52G4 adjacency matrix

26

Page 29: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Kronecker product: DefinitionKronecker product: DefinitionTh K k d t f t i A d B iThe Kronecker product of matrices A and B is given by

N x M K x L

N*K x M*L

We define a Kronecker product of two graphs as a Kronecker product of their adjacency matrices

Faloutsos&LeskovecECML/PKDD 2007

Part 1-53

CMU SCS

Kronecker graphsKronecker graphsW i fWe propose a growing sequence of graphs by iterating the Kronecker productproduct

Each Kronecker multiplicationEach Kronecker multiplication exponentially increases the size of the graphgraphGk has N1

k nodes and E1k edges, so we

get densificationFaloutsos&LeskovecECML/PKDD 2007

Part 1-54

get densification

27

Page 30: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Kronecker graphs: Intuition (1)Kronecker graphs: Intuition (1)

Intuition:Recursive growth of graph communitiesNodes get expanded to micro communitiesNodes in sub-community link among themselves and to nodes from different communities

Faloutsos&LeskovecECML/PKDD 2007

Part 1-55

CMU SCS

Stochastic Kronecker graphsStochastic Kronecker graphs

But, want a randomized version of Kronecker graphsg pPossible strategies:

R d l dd/d l t dRandomly add/delete some edgesThreshold the matrix, e.g. use only the strongest edges

Wrong, will destroy the structure of theWrong, will destroy the structure of the graph, e.g. diameter, clustering

Faloutsos&LeskovecECML/PKDD 2007

Part 1-56

28

Page 31: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Stochastic Kronecker graphsStochastic Kronecker graphs

Create N1×N1 probability matrix P1

Compute the kth Kronecker power Pkp p k

For each entry puv of Pk include an edge (u,v)with probability puvwith probability puv

0 25 0 10 0 10 0 04Kronecker

Probability of edge pij

0.5 0.2 Instance0.25 0.10 0.10 0.040.05 0.15 0.02 0.06

Kroneckermultiplication

0.1 0.3P1

matrix K20.05 0.02 0.15 0.060.01 0.03 0.03 0.09

flip biasedFaloutsos&LeskovecECML/PKDD 2007

Part 1-57

1

P2=P1⊗P1

flip biased coins

CMU SCS

Properties of Kronecker graphsProperties of Kronecker graphs

We can show [PKDD05] that Kronecker multiplication generates graphs that have:

Properties of static networksPower Law Degree DistributionPower Law eigenvalue and eigenvector distributionSmall Diameter

Properties of dynamic networksProperties of dynamic networksDensification Power LawShrinking/Stabilizing DiameterShrinking/Stabilizing Diameter

Faloutsos&LeskovecECML/PKDD 2007

Part 1-58

29

Page 32: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

1 3: Fitting models to real1.3: Fitting models to real graphsg p

We saw the modelsWe saw the models.Want to fit a model to a large real

h?graph?

CMU SCS

The problemThe problem

W t t t li ti t kWe want to generate realistic networks:

Some statistical property, e.g., degree distribution

Given a real network

Generate a synthetic network

P1) What are the relevant properties?P2) What is a good analytically tractable model?P2) What is a good analytically tractable model?P3) How can we fit the model (find parameters)?

Faloutsos&LeskovecECML/PKDD 2007

Part 1-60

parameters)?

30

Page 33: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Model estimation: approachModel estimation: approach

M i lik lih d ti tiMaximum likelihood estimationGiven real graph GE i K k i i i h ( )Estimate Kronecker initiator graph Θ (e.g., ) which

)|(maxarg ΘGPWe need to (efficiently) calculate

)|(maxarg ΘΘ

GP( y)

)|( ΘGPAnd maximize over Θ (e.g., using gradient descent)

Faloutsos&LeskovecECML/PKDD 2007

Part 1-61

)

CMU SCS

Fitting Kronecker graphsFitting Kronecker graphsGiven a graph G and Kronecker matrix Θ we

GG e a g ap G a d o ec e at Θ ecalculate probability that Θ generated G P(G|Θ)

0.25 0.10 0.10 0.040 05 0 15 0 02 0 060 5 0 2

1 0 1 1

0 1 0 10.05 0.15 0.02 0.060.05 0.02 0.15 0.060 01 0 03 0 03 0 09

0.5 0.20.1 0.3

0 1 0 1

1 0 1 1

1 1 1 10.01 0.03 0.03 0.09Θ

Θk

1 1 1 1

GP(G|Θ)P(G|Θ)

]),[1(],[)|( vuvuGP kk Θ−ΠΘΠ=ΘFaloutsos&LeskovecECML/PKDD 2007

Part 1-62

]),[(],[)|(),(),( kGvukGvu ∉∈

31

Page 34: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

ChallengesChallenges

Challenge 1: Mode correspondence problemp

How the map the nodes of the real graph to the nodes of the synthetic graph?the nodes of the synthetic graph?

Challenge 2: ScalabilityFor large graphs O(N2) is too slowScaling to large graphs – performing the g g g p p gcalculations quickly

Faloutsos&LeskovecECML/PKDD 2007

Part 1-63

CMU SCS

Challenge 1: Node correspondenceNodes are unlabeled

Challenge 1: Node correspondence0 25 0 10 0 10 0 04

ΘΘk

Nodes are unlabeledGraphs G’ and G” should have the same probability

0.25 0.10 0.10 0.04

0.05 0.15 0.02 0.06

0.05 0.02 0.15 0.06

0.5 0.20.1 0.3

have the same probabilityP(G’|Θ) = P(G”|Θ)One needs to consider all

0.01 0.03 0.03 0.09

G’ σOne needs to consider all node correspondences σ

1 0 1 0

0 1 1 1

1 1 1 1

1

23

)()|()|( PGPGP ∑ ΘΘ

All correspondences are a

0 0 1 14

2

)(),|()|( σσσ

PGPGP ∑ Θ=Θ

1 0 1 1G”

All correspondences are a priori equally likelyThere are O(N!)

2

14

3

0 1 0 1

1 0 1 1

1 1 1 1

Faloutsos&LeskovecECML/PKDD 2007

Part 1-64

There are O(N!)correspondences

1 1 1 1

P(G’|Θ) = P(G”|Θ)

32

Page 35: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Challenge 2: calculating P(G|Θ σ)Challenge 2: calculating P(G|Θ,σ)Assume we solved the correspondence problemCalculating

σ… node labeling

]),[1(],[)|(),(),( vukGvuvukGvu

GP σσσσ Θ−ΠΘΠ=Θ∉∈

Takes O(N2) timeInfeasible for large graphs (N ~ 105)

σ… node labeling

0.25 0.10 0.10 0.040.05 0.15 0.02 0.06

1 0 1 10 1 0 1

σ0.05 0.02 0.15 0.060.01 0.03 0.03 0.09

1 0 1 10 0 1 1

σ

Faloutsos&LeskovecECML/PKDD 2007

Part 1-65G

P(G|Θ, σ)Θkc

CMU SCS

Model estimation: solutionModel estimation: solution

N ï l ti ti th K k i iti tNaïvely estimating the Kronecker initiatortakes O(N!N2) time:

N! for graph isomorphism Metropolis sampling: N! (big) const

N2 for traversing the graph adjacency matrixProperties of Kronecker product and sparsity(E << N2): N2 E

We can estimate the parameters of Kronecker graph in linear time O(E)For details see [ICML07]

Faloutsos&LeskovecECML/PKDD 2007

Part 1-66

For details see [ICML07]

33

Page 36: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Solution 1: Node correspondenceSolution 1: Node correspondence Log-likelihoodLog likelihood

Gradient of log-likelihoodg

Sample the permutations from P(σ|G,Θ) and average the gradients

Faloutsos&LeskovecECML/PKDD 2007

Part 1-67

( | , ) g g

CMU SCS

Sampling node correspondencesSampling node correspondencesMetropolis sampling:p p g

Start with a random permutationDo local moves on the permutationAccept the new permutation

If new permutation is better (gives higher likelihood)If new is worse accept with probability proportional toIf new is worse accept with probability proportional to the ratio of likelihoods

1 4Swap node1

23

4

23

14

Swap node labels 1 and 4

1 0 1 0

0 1 1 11 1 1 0

1 1 1 0

123

12

Can compute efficiently:Only need to account for changes in 2 rows /

Faloutsos&LeskovecECML/PKDD 2007

Part 1-68

1 1 1 1

0 1 1 11 1 1 1

0 0 1 1

34

34

gcolumns

34

Page 37: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Solution 2: Calculating P(G|Θ σ)Solution 2: Calculating P(G|Θ,σ)

Calculating naively P(G|Θ,σ) takes O(N2)Idea:Idea:

First calculate likelihood of empty graph, a graph with 0 edgesgraph with 0 edgesCorrect the likelihood for edges that we observe i th hin the graph

By exploiting the structure of Kronecker y p gproduct we obtain closed form for likelihood of an empty graph

Faloutsos&LeskovecECML/PKDD 2007

Part 1-69

of an empty graph

CMU SCS

Solution 2: Calculating P(G|Θ σ)Solution 2: Calculating P(G|Θ,σ)

W i t th lik lih dWe approximate the likelihood:

No edge likelihood Edge likelihoodEmpty graph

The sum goes only over the edges

No-edge likelihood Edge likelihoodEmpty graph

g y gEvaluating P(G|Θ,σ) takes O(E) timeReal graphs are sparse E << N2

Faloutsos&LeskovecECML/PKDD 2007

Part 1-70

Real graphs are sparse, E << N2

35

Page 38: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Experiments: synthetic dataExperiments: synthetic data

C di t d t tCan gradient descent recover true parameters?Optimization problem is not convexHow nice (without local minima) isHow nice (without local minima) is optimization space?

Generate a graph from random parametersGenerate a graph from random parametersStart at random point and use gradient descentdescentWe recover true parameters 98% of the times

Faloutsos&LeskovecECML/PKDD 2007

Part 1-71

CMU SCS

Convergence of propertiesConvergence of propertiesHow does algorithm converge to trueHow does algorithm converge to true parameters with gradient descent iterations?

kelih

ood

bs e

rror

Log-

lik

Avg

ab

Gradient descent iterations

ue

Gradient descent iterations

Dia

met

er

eige

nval

u

Faloutsos&LeskovecECML/PKDD 2007

Part 1-72

D

1ste

36

Page 39: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Experiments: real networksExperiments: real networks

E i t l tExperimental setup:Given real graphStochastic gradient descent from random initial pointObtain estimated parametersGenerate synthetic graphs y g pCompare properties of both graphs

We do not fit the properties themselvesWe do not fit the properties themselves We fit the likelihood and then compare the

h tiFaloutsos&LeskovecECML/PKDD 2007

Part 1-73graph properties

CMU SCS

AS graph (N=6500 E=26500)AS graph (N=6500, E=26500)

Autonomous systems (internet)We search the space of ~1050,000 permutationsp pFitting takes 20 minutesAS graph is undirected and estimatedAS graph is undirected and estimated parameter matrix is symmetric:

0.98 0.580.58 0.06

Faloutsos&LeskovecECML/PKDD 2007

Part 1-74

37

Page 40: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

AS: comparing graph propertiesAS: comparing graph properties

Generate synthetic graph using estimated parameterspCompare the properties of two graphs

Degree distribution Hop plot

pair

s

coun

t

acha

ble

pdiameter=4

log

og #

of r

e

Faloutsos&LeskovecECML/PKDD 2007

Part 1-75log degree number of hops

lo

CMU SCS

AS i h tiAS: comparing graph propertiesSpectral properties of graph adjacency matrices

Network valueScree plot

nval

ue

alue

log

eige

log

v a

log rank log rank

Faloutsos&LeskovecECML/PKDD 2007

Part 1-76

og ank

38

Page 41: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Epinions graph (N=76k E=510k)Epinions graph (N=76k, E=510k)1 000 000We search the space of ~101,000,000 permutations

Fitting takes 2 hoursThe structure of the estimated parameter gives insight into the structure of the graph

0.99 0.540.49 0.13

Degree distribution Hop plot

irs

ount

chab

le p

ai

log

co

g #

of re

ac

Faloutsos&LeskovecECML/PKDD 2007

Part 1-77log degree number of hops

log

CMU SCS

E i i h (N 76k E 510k)Epinions graph (N=76k, E=510k)

N t k lS l t Network valueScree plot

ueei

genv

alu

log

log rank log rank

Faloutsos&LeskovecECML/PKDD 2007

Part 1-78

39

Page 42: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

ScalabilityScalability

Fitting scales linearly with the number of edgesg

Faloutsos&LeskovecECML/PKDD 2007

Part 1-79

CMU SCS

ConclusionConclusionK k G h d l hKronecker Graph model has

provable propertiesll b f tsmall number of parameters

We developed scalable algorithms for fitting Kronecker GraphsKronecker GraphsWe can efficiently search large space ( 101 000 000) of permutations(~101,000,000) of permutationsKronecker graphs fit well real networks using few parametersfew parametersWe match graph properties without a priori deciding on which ones to fit

Faloutsos&LeskovecECML/PKDD 2007

Part 1-80

deciding on which ones to fit

40

Page 43: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Why should we care?Why should we care?

Gives insight into the graph formation process:Anomaly detection – abnormal behavior, evolutionPredictions – predicting future from the pastSimulations of new algorithms where real graphs are hard/impossible to collectGraph sampling – many real world graphs are too l t d l ithlarge to deal with“What if” scenarios

Faloutsos&LeskovecECML/PKDD 2007

Part 1-81

CMU SCS

ReferencesReferencesGraphs over Time: Densification Laws Shrinking Diameters and PossibleGraphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations, by Jure Leskovec, Jon Kleinberg, Christos Faloutsos, ACM KDD 2005Graph Evolution: Densification and Shrinking Diameters, by Jure Leskovec, Jon Kleinberg and Christos Faloutsos ACM TKDD 2007Kleinberg and Christos Faloutsos, ACM TKDD 2007Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication, by Jure Leskovec, Deepay Chakrabarti, Jon Kleinberg and Christos Faloutsos, PKDD 2005Scalable Modeling of Real Graphs using Kronecker Multiplication by JureScalable Modeling of Real Graphs using Kronecker Multiplication, by Jure Leskovec and Christos Faloutsos, ICML 2007The Dynamics of Viral Marketing, by Jure Leskovec, Lada Adamic, Bernardo Huberman, ACM EC 2006Collective dynamics of 'small world' networks by Duncan J Watts and Steven HCollective dynamics of 'small-world' networks, by Duncan J. Watts and Steven H. Strogatz, Nature 1998Emergence of scaling in random networks, by R. Albert and A.-L. Barabasi, Science 1999O th l ti f d h b P E d d A R i P bli ti f thOn the evolution of random graphs, by P. Erdos and A. Renyi, Publication of the Mathematical Institute of the Hungarian Acadamy of Science, 1960

Faloutsos&LeskovecECML/PKDD 2007

Part 1-82

41

Page 44: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

ReferencesReferencesThe structure and function of complex networks M Newman SIAM ReviewThe structure and function of complex networks, M. Newman, SIAM Review 2003Hierarchical Organization in Complex Networks, Ravasz and Barabasi, Physical Review E 2003A random graph model for massive graphs W Aiello F Chung and L Lu STOCA random graph model for massive graphs, W. Aiello, F. Chung and L. Lu, STOC 2000Community structure in social and biological networks, by Girvan and Newman, PNAS 2002O P l R l ti hi f th I t t T l b F l t F l tOn Power-law Relationships of the Internet Topology by Faloutsos, Faloutsos, and Faloutsos, SIGCOM 1999Power laws, Pareto distributions and Zipf's law by M. Newman, Contemporary Physics 2005S i l N k A l i M h d d A li i W C b idSocial Network Analysis : Methods and Applications, Wasserman, Cambridge University Press 1994The web as a graph: Measurements, models and methods, J. Kleinberg and S. R. Kumar, P. Raghavan, S. Rajagopalan and A. Tomkins, COCOON 1998

Some plots borrowed from Lada Adamic, Mark Newman, Mark Joseph, Albert Barabasi, Jon Kleinberg, David Lieben-Nowell, Sergi Valverde, and Ricard Sole

Faloutsos&LeskovecECML/PKDD 2007

Part 1-83

42

Page 45: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Tutorial outlineTutorial outline

Part 1: Laws and generators:What are properties of large graphs?How do we model them?

Part 2: Tools for static and dynamics graphsy g pMethods based on Singular Value Decomposition

Part 3: Case studiesPart 3: Case studiesPropagation and detection of viruses in networksGraph projectionsGraph projections

Faloutsos&LeskovecECML/PKDD 2007

Part 2-1

CMU SCS

Part 2: OutlinePart 2: Outline

MotivationMatrix tools

SVD, PCAHITS, PageRank

Tensor toolsCase studies

gCURCo-clusteringCase studies Co-clustering

Faloutsos&LeskovecECML/PKDD 2007

Part 2-2

43

Page 46: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Examples of MatricesExamples of Matrices

Example/Intuition: Documents and termsFind patterns groups conceptsFind patterns, groups, concepts

data mining classif. tree ...13 11 22 55 ...5 4 6 7 ...

Paper#1Paper#2

... ... ... ... ...

... ... ... ... ...

... ... ... ... ...

Paper#3Paper#4

...Faloutsos&LeskovecECML/PKDD 2007

Part 2-3

...

CMU SCS

Singular Value Decomposition (SVD)Singular Value Decomposition (SVD)X = UΣVTX UΣV

X U

v1σ1

Σ VT

u1 u2 ukx(1) x(2) x(M) = . v2.σ2

vkσk

i l l right singular vectors

input data left singular vectors

singular values

Faloutsos&LeskovecECML/PKDD 2007

Part 2-4

44

Page 47: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

SVD as spectral decompositionSVD as spectral decomposition

n nσ1u1°v1 σ2u2°v2

AmΣ

mVT

≈ +Am ≈ +

Best rank-k approximation in L2 and U

ppFrobenius SVD only works for static matrices (a single

Faloutsos&LeskovecECML/PKDD 2007

Part 2-5

y ( g2nd order tensor)

See also PARAFAC

CMU SCS

Vector outer product intuition:Vector outer product – intuition:owner

20; 30; 40

car typeage

20; 30; 40

VW

20; 30; 40

A

VWVolvoBMW

VWVolvoBMWABMW

2-d histogram1-d histograms +

independence assumption

Faloutsos&LeskovecECML/PKDD 2007

Part 2-6

45

Page 48: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

SVD ExampleSVD - Example

A = U Σ VT - example:retrieval

datainf.

retrievalbrain lung

0 18 01 1 1 0 02 2 2 0 01 1 1 0 0

0.18 00.36 00 18 0CS 9 64 01 1 1 0 0

5 5 5 0 00 0 0 2 2

0.18 00.90 00 0.53

=9.64 00 5.29x x

0 0 0 3 30 0 0 1 1

0 0.530 0.800 0.27

MD 0.58 0.58 0.58 0 00 0 0 0.71 0.71

Faloutsos&LeskovecECML/PKDD 2007

Part 2-7

CMU SCS

SVD ExampleSVD - Example

A = U Σ VT - example:retrieval CS concept

datainf.

retrievalbrain lung

0 18 0

CS-conceptMD-concept

1 1 1 0 02 2 2 0 01 1 1 0 0

0.18 00.36 00 18 0CS 9 64 01 1 1 0 0

5 5 5 0 00 0 0 2 2

0.18 00.90 00 0.53

=9.64 00 5.29x x

0 0 0 3 30 0 0 1 1

0 0.530 0.800 0.27

MD 0.58 0.58 0.58 0 00 0 0 0.71 0.71

Faloutsos&LeskovecECML/PKDD 2007

Part 2-8

46

Page 49: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

SVD ExampleSVD - Example

A = U Σ VT - example:retrieval CS concept

doc-to-concept similarity matrix

datainf.

retrievalbrain lung

0 18 0

CS-conceptMD-concept

1 1 1 0 02 2 2 0 01 1 1 0 0

0.18 00.36 00 18 0CS 9 64 01 1 1 0 0

5 5 5 0 00 0 0 2 2

0.18 00.90 00 0.53

=9.64 00 5.29x x

0 0 0 3 30 0 0 1 1

0 0.530 0.800 0.27

MD 0.58 0.58 0.58 0 00 0 0 0.71 0.71

Faloutsos&LeskovecECML/PKDD 2007

Part 2-9

CMU SCS

SVD ExampleSVD - Example

A = U Σ VT - example:retrieval

datainf.

retrievalbrain lung

0 18 0

‘strength’ of CS-concept

1 1 1 0 02 2 2 0 01 1 1 0 0

0.18 00.36 00 18 0CS 9 64 01 1 1 0 0

5 5 5 0 00 0 0 2 2

0.18 00.90 00 0.53

=9.64 00 5.29x x

0 0 0 3 30 0 0 1 1

0 0.530 0.800 0.27

MD 0.58 0.58 0.58 0 00 0 0 0.71 0.71

Faloutsos&LeskovecECML/PKDD 2007

Part 2-10

47

Page 50: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

SVD ExampleSVD - Example

A = U Σ VT - example:retrieval term-to-concept

datainf.

retrievalbrain lung

0 18 0

similarity matrix

CS1 1 1 0 02 2 2 0 01 1 1 0 0

0.18 00.36 00 18 0CS 9 64 0

CS-concept

1 1 1 0 05 5 5 0 00 0 0 2 2

0.18 00.90 00 0.53

=9.64 00 5.29x x

0 0 0 3 30 0 0 1 1

0 0.530 0.800 0.27

MD 0.58 0.58 0.58 0 00 0 0 0.71 0.71

Faloutsos&LeskovecECML/PKDD 2007

Part 2-11

CMU SCS

SVD ExampleSVD - Example

A = U Σ VT - example:retrieval term-to-concept

datainf.

retrievalbrain lung

0 18 0

similarity matrix

CS1 1 1 0 02 2 2 0 01 1 1 0 0

0.18 00.36 00 18 0CS 9 64 0

CS-concept

1 1 1 0 05 5 5 0 00 0 0 2 2

0.18 00.90 00 0.53

=9.64 00 5.29x x

0 0 0 3 30 0 0 1 1

0 0.530 0.800 0.27

MD 0.58 0.58 0.58 0 00 0 0 0.71 0.71

Faloutsos&LeskovecECML/PKDD 2007

Part 2-12

48

Page 51: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

SVD InterpretationSVD - Interpretation

‘documents’, ‘terms’ and ‘concepts’:Q if A i th d t t t t i h tQ: if A is the document-to-term matrix, what

is AT A?A: term-to-term ([m x m]) similarity matrixQ: A AT ?Q: A A ?A: document-to-document ([n x n]) similarity

matrixmatrix

Faloutsos&LeskovecECML/PKDD 2007

Part 2-13

CMU SCS

SVD propertiesSVD properties

V are the eigenvectors of the covariance matrix ATA

U are the eigenvectors of the Gram (inner-U a e t e e ge ecto s o t e G a ( eproduct) matrix AAT

Further reading:

Faloutsos&LeskovecECML/PKDD 2007

Part 2-14

Further reading:1. Ian T. Jolliffe, Principal Component Analysis (2nd ed), Springer, 2002.2. Gilbert Strang, Linear Algebra and Its Applications (4th ed), Brooks Cole, 2005.

49

Page 52: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Principal Component Analysis (PCA)

SVDn n

RRk k

Am Σm

R

UVT k

Loading

PCs

Am m U Loading

PCA is an important application of SVDNote that U and V are dense and may have negative

Faloutsos&LeskovecECML/PKDD 2007

Part 2-15

y gentries

CMU SCS

PCA interpretationPCA interpretationbest axis to project on: (‘best’ = minbest axis to project on: ( best min sum of squares of projection errors)

Term2 (‘lung’)

Faloutsos&LeskovecECML/PKDD 2007

Part 2-16Term1 (‘data’)

50

Page 53: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

PCA interpretationPCA - interpretation ΣU

VT

Term2 (‘retrieval’)

PCA projects points first singular vectorp j pOnto the “best” axis

v1

minimum RMS error Term1 (‘data’)

Faloutsos&LeskovecECML/PKDD 2007

Part 2-17

CMU SCS

Part 2: OutlinePart 2: Outline

MotivationMatrix tools

SVD, PCAHITS, PageRank

Tensor toolsCase studies

gCURCo-clusteringCase studies Co-clustering

Faloutsos&LeskovecECML/PKDD 2007

Part 2-18

51

Page 54: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Kleinberg’s algorithm HITSKleinberg s algorithm HITS

Problem dfn: given the web and a queryfind the most ‘authoritative’ web pages forfind the most authoritative web pages for this query

Step 0: find all pages containing the query termsp p g g q yStep 1: expand by one move forward and

backwardbackward

Faloutsos&LeskovecECML/PKDD 2007

Part 2-19Further reading:1. J. Kleinberg. Authoritative sources in a hyperlinked environment. SODA 1998

CMU SCS

Kleinberg’s algorithm HITSKleinberg s algorithm HITS

Step 1: expand by one move forward and backward

Faloutsos&LeskovecECML/PKDD 2007

Part 2-20

52

Page 55: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Kleinberg’s algorithm HITSKleinberg s algorithm HITS

on the resulting graph, give high score (= ‘authorities’) to nodes that many important ) y pnodes point togive high importance score (‘hubs’) togive high importance score ( hubs ) to nodes that point to good ‘authorities’

hubs authorities

Faloutsos&LeskovecECML/PKDD 2007

Part 2-21

CMU SCS

Kleinberg’s algorithm HITSKleinberg s algorithm HITS

Observationsrecursive definition!recursive definition!each node (say, ‘i’-th node) has both an

th it ti d h bauthoritativeness score ai and a hubness score hi

Faloutsos&LeskovecECML/PKDD 2007

Part 2-22

53

Page 56: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Kleinberg’s algorithm: HITSKleinberg s algorithm: HITS

Let A be the adjacency matrix: the (i,j) entry is 1 if the edge from i to j exists( ,j) y g j

Let h and a be [n x 1] vectors with the ‘hubness’ and ‘authoritativiness’ scoreshubness and authoritativiness scores.

Then:

Faloutsos&LeskovecECML/PKDD 2007

Part 2-23

CMU SCS

Kleinberg’s algorithm: HITSKleinberg s algorithm: HITS

Then:ai = hk + hl + hai hk + hl + hm

that iskl i

ai = Sum (hj) over all j that (j,i) edge exists

lm (j, ) edge e sts

orTa = AT h

Faloutsos&LeskovecECML/PKDD 2007

Part 2-24

54

Page 57: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Kleinberg’s algorithm: HITSKleinberg s algorithm: HITS

symmetrically, for the ‘hubness’:hi = a + a + ani hi an + ap + aq

that isp

hi = Sum (qj) over all j that (i,j) edge existsq ( ,j) edge e sts

orh = A a

Faloutsos&LeskovecECML/PKDD 2007

Part 2-25

CMU SCS

Kleinberg’s algorithm: HITSKleinberg s algorithm: HITS

In conclusion, we want vectors h and asuch that:

h = A aAT ha = AT h

That is:at sa = ATA a

Faloutsos&LeskovecECML/PKDD 2007

Part 2-26

55

Page 58: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Kleinberg’s algorithm: HITSKleinberg s algorithm: HITS

i i ht i l t f th dja is a right singular vector of the adjacency matrix A (by dfn!), a.k.a the eigenvector

Tof ATA

Starting from random a’ and iterating, we’ll eventually convergeeventually converge

Q: to which of all the eigenvectors? why?A: to the one of the strongest eigenvalue,

(AT A ) k a = λ1ka

Faloutsos&LeskovecECML/PKDD 2007

Part 2-27

( ) 1

CMU SCS

Kleinberg’s algorithm -discussion

‘authority’ score can be used to find ‘similar pages’ (how?)p g ( )closely related to ‘citation analysis’, social networks / ‘small world’ phenomenanetworks / small world phenomena

Faloutsos&LeskovecECML/PKDD 2007

Part 2-28See also TOPHITS

56

Page 59: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Part 2: OutlinePart 2: Outline

MotivationMatrix tools

SVD, PCAHITS, PageRank

Tensor toolsCase studies

gCURCo-clusteringCase studies Co-clustering

Faloutsos&LeskovecECML/PKDD 2007

Part 2-29

CMU SCS

Motivating problem: PageRankMotivating problem: PageRank

Gi di t d h fi d it tGiven a directed graph, find its most interesting/central node

A node is important,if it is connectedif it is connected with important nodes( i b t OK!)(recursive, but OK!)

Faloutsos&LeskovecECML/PKDD 2007

Part 2-30

57

Page 60: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Motivating problem – PageRank g p gsolution

Given a directed graph, find its most interesting/central nodeinteresting/central node

Proposed solution: Random walk; spot most ‘ l ’ d ( t d t t b ( ))‘popular’ node (-> steady state prob. (ssp))

A d h hi hA node has high ssp,if it is connected with high ssp nodes(recursive, but OK!)

Faloutsos&LeskovecECML/PKDD 2007

Part 2-31

( , )

CMU SCS

(Simplified) PageRank algorithm(Simplified) PageRank algorithm

Let A be the transition matrix (= adjacency t i ) thmatrix); let B be the transpose, column-normalized - then

From B

1 2 3

p1

p2

p1

p2

To B1

1 11 2 3 p2

p3

p4

p2

p3

p4

=1 1

1/2 1/2

1/2

45

p4

p5

p4

p5

1/2

1/2

Faloutsos&LeskovecECML/PKDD 2007

Part 2-32

58

Page 61: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

(Simplified) PageRank algorithm(Simplified) PageRank algorithm

B p = p

B p = p

1 2 3p1

p2

p1

p2

1

1 11 3 p2

p3

p4

p2

p3

p4

=1 1

1/2 1/2

1/245

p4

p5

p4

p5

1/2

1/2

Faloutsos&LeskovecECML/PKDD 2007

Part 2-33

CMU SCS

(Simplified) PageRank algorithm(Simplified) PageRank algorithm

B p = 1 * pthus p is the eigenvector thatthus, p is the eigenvector that corresponds to the highest eigenvalue (=1, i th t i i l li d)since the matrix is column-normalized)

Why does such a p exist? p exists if B is nxn, nonnegative, irreducible [Perron–Frobenius theorem][ e o obe us eo e ]

Faloutsos&LeskovecECML/PKDD 2007

Part 2-34

59

Page 62: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

(Simplified) PageRank algorithm(Simplified) PageRank algorithm

In short: imagine a particle randomly moving along the edgesg g gcompute its steady-state probabilities (ssp)

Full version of algo: with occasional random u e s o o a go t occas o a a dojumps

Wh ? T k th t i i d iblWhy? To make the matrix irreducible

Faloutsos&LeskovecECML/PKDD 2007

Part 2-35

CMU SCS

Full AlgorithmFull Algorithm

With probability 1-c, fly-out to a random nodeThen, we have

B (1 )/ 1 >p = c B p + (1-c)/n 1 =>p = (1-c)/n [I - c B] -1 1

Faloutsos&LeskovecECML/PKDD 2007

Part 2-36

60

Page 63: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Part 2: OutlinePart 2: Outline

MotivationMatrix tools

SVD, PCAHITS, PageRank

Tensor toolsCase studies

gCURCo-clusteringCase studies Co-clustering

Faloutsos&LeskovecECML/PKDD 2007

Part 2-37

CMU SCS

Motivation of CUR or CMDMotivation of CUR or CMD

SVD, PCA all transform data into some abstract space (specified by a set basis)p ( p y )

Interpretability problemLoss of sparsityLoss of sparsity

Faloutsos&LeskovecECML/PKDD 2007

Part 2-38

61

Page 64: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

PCA interpretationPCA - interpretation

Term2 (‘retrieval’)

PCA projects points first singular vectorp j pOnto the “best” axis

v1

minimum RMS error Term1 (‘data’)

Faloutsos&LeskovecECML/PKDD 2007

Part 2-39

CMU SCS

CURExample-based projection: use actual rows and

l t if th bcolumns to specify the subspaceGiven a matrix A∈Rm×n, find three matrices C∈R U R R R h th t ||A CUR|| iRm×c, U∈ Rc×r, R∈ Rr× n , such that ||A-CUR|| is small nn

C

RXm

r

AmC

c OrthogonalU is the pseudo-inverse of X

Orthogonal projection

Faloutsos&LeskovecECML/PKDD 2007

Part 2-40

62

Page 65: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

CURExample-based projection: use actual rows and

l t if th bcolumns to specify the subspaceGiven a matrix A∈Rm×n, find three matrices C∈R U R R R h th t ||A CUR|| iRm×c, U∈ Rc×r, R∈ Rr× n , such that ||A-CUR|| is small nn

C

RXm

r

AmC

c Example-based

U is the pseudo-inverse of X:U = X† = (UT U )-1 UT

Faloutsos&LeskovecECML/PKDD 2007

Part 2-41

CMU SCS

CUR (cont )CUR (cont.)

Key question:How to select/sample the columns and rows?p

Uniform samplingBiased sampling

CUR w/ absolute error boundCUR w/ relative error bound

Reference:1. Tutorial: Randomized Algorithms for Matrices and Massive Datasets, SDM’062. Drineas et al. Subspace Sampling and Relative-error Matrix Approximation: Column-

Faloutsos&LeskovecECML/PKDD 2007

Part 2-42Row-Based Methods, ESA2006

3. Drineas et al., Fast Monte Carlo Algorithms for Matrices III: Computing a Compressed Approximate Matrix Decomposition, SIAM Journal on Computing, 2006.

63

Page 66: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

The sparsity property pictorially:The sparsity property – pictorially:

=SVD/PCA:Destroys sparsity

U Σ VT

CUR i t i it

=

CUR: maintains sparsity

=

Faloutsos&LeskovecECML/PKDD 2007

Part 2-43C U R

CMU SCS

The sparsity propertyThe sparsity propertysparse and small

SVD: A = U Σ VT

p

Big but sparse Big and denseg p Big and dense

dense but small

CUR: A = C U RCUR: A C U RBig but sparse Bi b

Faloutsos&LeskovecECML/PKDD 2007

Part 2-44

Big but sparse Big but sparse

64

Page 67: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Th it t ( t )The sparsity property (cont.)SVDCUR

102 SVD

CUR

102

e ra

tio

CURCMD

1e ra

tio

CURCMD

101

spac

e r

101

spac

e

C

0 0.2 0.4 0.6 0.8 1

10

accuracy0 0.2 0.4 0.6 0.8 1

accuracy

Network DBLPCMD uses much smaller space to achieve the same accuracyCUR li i i d li l dCUR limitation: duplicate columns and rowsSVD limitation: orthogonal projection densifies h d

Faloutsos&LeskovecECML/PKDD 2007

Part 2-45the data

Reference:Sun et al. Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM’07

CMU SCS

Tensors for time evolving graphsTensors for time evolving graphs

[Jimeng Sun+ KDD’06][Jimeng Sun+ SMD’07][ g ][CF, Kolda, Sun, SDM’07 tutorial]tutorial]

Faloutsos&LeskovecECML/PKDD 2007

Part 2-46

65

Page 68: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Social network analysisSocial network analysis

Static: find community structures

Keywords1990

thor

s

DB

Aut

Faloutsos&LeskovecECML/PKDD 2007

Part 2-47

CMU SCS

Social network analysisSocial network analysis

Static: find community structures Dynamic: monitor community structure evolution;Dynamic: monitor community structure evolution; spot abnormal individuals; abnormal time-stamps Keywordsstamps Keywords

DM

2004

DB

1990

DButho

rs

Faloutsos&LeskovecECML/PKDD 2007

Part 2-48

Au

66

Page 69: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Multiway latent semantic indexing (LSI)(LSI)

2004 Philip Yu

DM2004

1990Michael

Stonebreakerutho

rs

p

DB1990

hors

Uau

DBUkeyword

auth

QueryPatternkeyword

P j ti t i if th l tProjection matrices specify the clustersCore tensors give cluster activation level

Faloutsos&LeskovecECML/PKDD 2007

Part 2-49

g

CMU SCS

Bibliographic data (DBLP)g p ( )

P f VLDB d KDD fPapers from VLDB and KDD conferencesConstruct 2nd order tensors with yearly y ywindows with <author, keywords> Each tensor: 4584×3741Each tensor: 4584×374111 timestamps (years)

Faloutsos&LeskovecECML/PKDD 2007

Part 2-50

67

Page 70: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Multiway LSIMultiway LSIAuthors Keywords Year

i h l i h l i ll l i i i 199michael carey, michaelstonebreaker, h. jagadish,hector garcia-molina

queri,parallel,optimization,concurr,objectorient

1995

DBsurajit chaudhuri,mitch cherniack,michaelstonebreaker,ugur etintemel

distribut,systems,view,storage,servic,process,cache

2004DB

jiawei han,jian pei,philip s. yu,jianyong wang,charu c. aggarwal

streams,pattern,support, cluster, index,gener,queri

2004

DM

Two groups are correctly identified: Databases and Data mininggPeople and concepts are drifting over time

Faloutsos&LeskovecECML/PKDD 2007

Part 2-51

CMU SCS

ConclusionsConclusions

Tensor-based methods (WTA/DTA/STA):spot patterns and anomalies on timespot patterns and anomalies on time evolving graphs, and

t ( it i )on streams (monitoring)

Faloutsos&LeskovecECML/PKDD 2007

Part 2-52

68

Page 71: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

RoadmapRoadmap

MotivationMatrix tools

SVD, PCAHITS, PageRank

Tensor toolsCase studies

gCURCo-clusteringCase studies Co-clustering

Faloutsos&LeskovecECML/PKDD 2007

Part 2-53

CMU SCS

Co clusteringCo-clustering

Given data matrix and the number of row and column groups k and lg pSimultaneously

Cl t f (X Y) i t k di j i tCluster rows of p(X, Y) into k disjoint groups Cluster columns of p(X, Y) into l disjoint groups

Faloutsos&LeskovecECML/PKDD 2007

Part 2-54

69

Page 72: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Co clusteringdetails

Co-clustering

Let X and Y be discrete random variables X and Y take values in 1, 2, …, m and 1, 2, …, np(X, Y) denotes the joint probability distribution—if not known, it is often estimated based on co-

d toccurrence dataApplication areas: text mining, market-basket analysis analysis of browsing behavior etcanalysis, analysis of browsing behavior, etc.

Key Obstacles in Clustering Contingency Tables High Dimensionality, Sparsity, NoiseNeed for robust and scalable algorithms

Faloutsos&LeskovecECML/PKDD 2007

Part 2-55Reference:1. Dhillon et al. Information-Theoretic Co-clustering, KDD’03

CMU SCS

⎥⎤

⎢⎡ 00005.05.05.

n

eg terms x documents

⎥⎥⎥⎤

⎢⎢⎢⎡

05.05.05.00005.05.05.00000005.05.05.

meg, terms x documents

⎥⎥⎦⎢

⎢⎣ 04.04.004.04.04.

04.04.04.004.04.

nlk

⎥⎥⎤

⎢⎢⎡

054054042000000042.054.054.000042.054.054.

⎥⎥⎤

⎢⎢⎡

050005.005.

⎥⎦⎤

⎢⎣⎡

223.003. [ ]

36.36.28.00000028.36.36. =

nl

k

kl

⎥⎥⎥

⎦⎢⎢⎢

⎣036.036.028.028036.036.054.054.042.000054.054.042.000

⎥⎥⎥

⎦⎢⎢⎢

⎣5.0005.005.0 ⎦⎣ 2.2.

m

⎥⎦⎢⎣ 036.036.028.028.036.036.⎥⎦⎢⎣ 5.00

Faloutsos&LeskovecECML/PKDD 2007

Part 2-56

70

Page 73: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

med. doccs doc

⎤⎡ 00005.05.05. med. terms

⎥⎥⎥⎤

⎢⎢⎢⎡

05.05.05.00005.05.05.00000005.05.05.

term group xcs terms

⎥⎥⎦⎢

⎢⎣ 04.04.004.04.04.

04.04.04.004.04.term group xdoc. group common terms

⎥⎥⎤

⎢⎢⎡

000042.054.054.000042.054.054.

⎥⎥⎤

⎢⎢⎡

005.005.

⎥⎦⎤

⎢⎣⎡

223.003. [ ]

36.36.28.00000028.36.36. =

⎥⎥⎥

⎦⎢⎢⎢

⎣036.036.028.028036.036.054.054.042.000054.054.042.000

⎥⎥⎥

⎦⎢⎢⎢

⎣5.0005.005.0 ⎦⎣ 2.2.

doc xd ⎥⎦⎢⎣ 036.036.028.028.036.036.⎥⎦⎢⎣ 5.00

t

doc group

Faloutsos&LeskovecECML/PKDD 2007

Part 2-57term x

term-group

CMU SCS

Co clusteringCo-clustering

Observationsuses KL divergence instead of L2uses KL divergence, instead of L2the middle matrix is not diagonal

we’ll see that again in the Tucker tensor decomposition

Faloutsos&LeskovecECML/PKDD 2007

Part 2-58

71

Page 74: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Problem with Information Theoretic Co-clustering

Number of row and column groups must be specifiedp

Desiderata:

Simultaneously discover row and column groups

Fully Automatic: No “magic numbers”Fully Automatic: No magic numbers

Scalable to large graphs

Faloutsos&LeskovecECML/PKDD 2007

Part 2-59

CMU SCS

Cross associationsCross-associations

Desiderata:

Simultaneously discover row and column groups

Fully Automatic: No “magic numbers”Fully Automatic: No magic numbers

Scalable to large matrices

Faloutsos&LeskovecECML/PKDD 2007

Part 2-60Reference:1. Chakrabarti et al. Fully Automatic Cross-Associations, KDD’04

72

Page 75: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

What makes a cross-association “good”?

ups

ups

Why is this better?

versus

Row

gro

u

Row

gro

u better?

Column groups

Column gro ps

R Rgroups groups

Faloutsos&LeskovecECML/PKDD 2007

Part 2-61

CMU SCS

What makes a cross-association “good”?

ups

ups

Why is this better?

versus

Row

gro

u

Row

gro

u better?

Column groups

Column gro ps

R R

groups groups

simpler; easier to describesimpler; easier to describeeasier to compress!

Faloutsos&LeskovecECML/PKDD 2007

Part 2-62

73

Page 76: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

What makes a cross-association “good”?

Problem definition: given an encoding scheme• decide on the # of col and row groups k and ldecide on the # of col. and row groups k and l• and reorder rows and columns,• to achieve best compressionto achieve best compression

Faloutsos&LeskovecECML/PKDD 2007

Part 2-63

CMU SCS

Main Ideadetails

Main Idea

Good Compression

Better Clustering

sizei * H(xi) + Cost of describing cross-associationsΣiTotal Encoding Cost =

Code Cost Description C tCost

Minimize the total cost (# bits)Minimize the total cost (# bits)

for lossless compressionFaloutsos&LeskovecECML/PKDD 2007

Part 2-64

74

Page 77: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

AlgorithmAlgorithml 5 l

k = 5

l = 5 col groups

row grouups

k=1, l=2

k=2, l=2

k=2, l=3

k=3, l=3

k=3, l=4

k=4, l=4

k=4, l=5

Faloutsos&LeskovecECML/PKDD 2007

Part 2-65

CMU SCS

AlgorithmAlgorithm

Code for cross-associations (matlab):

www.cs.cmu.edu/~deepay/mywww/software/CrossAssociations-01-27-2005.tgzions 01 27 2005.tgz

Variations and extensions:Variations and extensions:‘Autopart’ [Chakrabarti, PKDD’04]www.cs.cmu.edu/~deepay

Faloutsos&LeskovecECML/PKDD 2007

Part 2-66

75

Page 78: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Matrix tools summaryMatrix tools - summary

SVD: optimal for L2 – VERY popular (HITS, p p p ( ,PageRank, Karhunen-Loeve, Latent Semantic Indexing, PCA, etc etc)g, , )

C-U-R (CMD etc)ti l it i t t bilitnear-optimal; sparsity; interpretability

Tensors

Faloutsos&LeskovecECML/PKDD 2007

Part 2-67

76

Page 79: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Part 3: Case studiesPart 3: Case studies(Virus) Propagation in networks:(Virus) Propagation in networks:

properties, models, detectionWeb projections:Web projections:

How to predict user intentions?Center Piece Subgraphs:Center Piece Subgraphs:

Find best subgraph connecting a set of nodes

CMU SCS

Part 3 1: Virus and influencePart 3.1: Virus and influence propagationp p g

77

Page 80: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Cascading Behavior in gLarge Blog Graphs

Blogs Posts

Information

H d i f ti t

LinksInformation

cascade

How does information propagate over the blogosphere?

H d bl it d i fl h th ?How do blogs cite and influence each other?How do such links evolve?

Faloutsos, LeskovecECML/PKDD 2007

Part 3-3

CMU SCS

Motivation: Why blogs?Motivation: Why blogs?

Blogs are a widely used medium of information for many topics and have y pbecome an important mode of communicationcommunicationBlogs cite one another, creating a recordf fof how information and ideas spread

through the social networkgThis record is publicly available

Faloutsos, LeskovecECML/PKDD 2007

Part 3-4

78

Page 81: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Immediate GoalsImmediate Goals

Temporal questions: Does popularity have half-life? Is there periodicity?p yTopological questions: What topological patterns do posts and blogs follow? Whatpatterns do posts and blogs follow? What shapes to cascades take on? Stars? C ? S ?Chains? Something else?Models: Can we build a generative modelModels: Can we build a generative model that mimics properties of cascades?

Faloutsos, LeskovecECML/PKDD 2007

Part 3-5

CMU SCS

Cascades on the BlogosphereCascades on the BlogosphereB1 a1B1 B2

a

b cB1 B2

11

21

B4B3

deB4

B3

2

1 3

Blogosphereblogs + posts

Blog networklinks among blogs

Post networklinks among posts

Cascade is graph induced by d b ca

a time ordered propagation of information (edges)

Cascadese e

Faloutsos, LeskovecECML/PKDD 2007

Part 3-6

Cascades

79

Page 82: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Blog dataBlog data45 000 blogs participating in cascades45,000 blogs participating in cascadesAll their posts for 3 months (Aug-Sept ‘05)2 4 illi t2.4 million posts ~5 million links (245,404 inside the dataset)

sr o

f pos

tsN

umbe

r

Faloutsos, LeskovecECML/PKDD 2007

Part 3-7Time [1 day]

CMU SCS

Temporal ObservationsTemporal Observations

Posts have weekend effect, less traffic on po

sts

# lin

ks

Saturday/Sunday.Day of the week Day of the week

#

#

Post popularity drop-off follows a power law which can be explained by a simple queuing model

nks

(log)

p y p q g

The probability that a post written at time t acquires a

mbe

r in-

linwritten at time tp acquires a link at time tp + Δ is:

p(t +Δ) ~ Δ-1.5

Faloutsos, LeskovecECML/PKDD 2007

Part 3-8Days after post

Nup(tp+Δ) ~ Δ

80

Page 83: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Blog Networkg44,356 nodes, 122,153 edges. Half of blogs b l t l t t d tbelong to largest connected component.

e) ks

log

scal

e

og in

-link

Cou

nt (

ber o

f blo

Number of blog in-links (log scale) Number of blog out-linksN

umIn- and out-degree follow power law distribution. In-degree exponent 1.7, out-degree exponent -3.

But, in-links and out-links are not correlated. ρ=0.16

Faloutsos, LeskovecECML/PKDD 2007

Part 3-9

out degree exponent 3.

Strong rich-get-richer phenomena.

CMU SCS

Post NetworkPost Network2.4M nodes, 250K edges, gBoth in- and out-degree follow power laws. In-degree exponent - un

t

2.1, out-degree exponent -3.

Power laws also in total degree d b f t bl

Cou

and number of posts per blog.Number of blog-blog links

ount

unt

Co

Cou

Faloutsos, LeskovecECML/PKDD 2007

Part 3-10Post in-degree Posts per blog

81

Page 84: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Topological patterns: CascadesTopological patterns: CascadesProcedure for gathering cascades:Procedure for gathering cascades:

Find all initiators (nodes with out-degree 0)Follow in-linksProduces directed acyclic graphProduces directed acyclic graphCount cascade shapes (use our multi-level graph isomorphism testing algorithm)graph isomorphism testing algorithm)

a a

d b c ab c

d

b c

d

d

e

b c

e

Faloutsos, LeskovecECML/PKDD 2007

Part 3-11e e

CMU SCS

Cascade shapesCascade shapesC dCascade shapes

(ranked by(ranked by frequency)

The probability of observing a cascade on n nodesa cascade on n nodes follows a Zipf distribution:

p(n) n-2

Also power laws in in/out-degree, type of cascades (chains stars)

p(n) ~ n 2

Faloutsos, LeskovecECML/PKDD 2007

Part 3-12

type of cascades (chains, stars), and degree per level.

82

Page 85: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Properties of CascadesProperties of CascadesMost of cascades are trees

edge

s

met

er

umbe

r of

ctiv

e di

am

Cascade size (number of nodes)

Nu

Cascade size

Effe

c

t

Number of cascades per

Cou

nt

Cou

ntcascades per node also

follows power-l di t ib ti

Faloutsos, LeskovecECML/PKDD 2007

Part 3-13

Number of joined cascades Cascades per node

law distribution.

CMU SCS

Cascade Generation ModelCascade Generation Model1) Randomly pick blog to 2) Infect each in-linked1) Randomly pick blog to infect, add to cascade.

1

2) Infect each in linked neighbor with probability β.

B B1

B1 B2

11

21

B1B1

B1 B21

21

B4B31 3

3) Add infected neighbors 4) Set node infected in (i) to

B4B31 3

B1B B

1

3) Add infected neighbors to cascade.

4) Set node infected in (i) to uninfected.

B1 B21

21

B1 B21

21B1

B

B1

Faloutsos, LeskovecECML/PKDD 2007

Part 3-14B4B31 3B4

B31 3B4 B4

83

Page 86: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Experimental ResultsExperimental ResultsG ti d l

ount

ount

Generative model produces realistic cascades C Ccascades

β=0.025Cascade size Cascade node in-degree

Cou

nt

Cou

ntC

Faloutsos, LeskovecECML/PKDD 2007

Part 3-15Most frequent cascades Size of star cascade Size of chain cascade

CMU SCS

ConclusionsConclusionsTemporal PropertiesTemporal PropertiesPopularity drop-off follows power-law distribution exactly as found in response times in other work

Topological Properties

exactly as found in response times in other work. Posts follow weekly periodicity.

Topological PropertiesPower law distributions in almost every topological property Star cascades are more common than chainsproperty. Star cascades are more common than chains, and size of cascades follow a power law.

Generative ModelGenerative ModelWe developed a generative model based on SIS model in epidemiology that matched properties of

Faloutsos, LeskovecECML/PKDD 2007

Part 3-16

model in epidemiology that matched properties of cascades.

84

Page 87: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Part 3 2: How to detectPart 3.2: How to detect outbreaks in networks?

CMU SCS

Scenario 1: Water networkScenario 1: Water network

Given a real city water distributionwater distribution networkAnd data on how S

Where should we place sensors to efficientlycontaminants spread

in the network (for any possible

sensors to efficientlydetect the all possible

contaminations?any possible contaminant starting point) S

contaminations?p )

Faloutsos, LeskovecECML/PKDD 2007

Part 3-18

85

Page 88: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Scenario 2: Cascades in blogsScenario 2: Cascades in blogs

Which blogs sho ld oneBlogs Posts

Which blogs should one read to detect stories as

Information effectively as possible?

Links cascade

Faloutsos, LeskovecECML/PKDD 2007

Part 3-19

CMU SCS

What means effectively?What means effectively?

What do we mean by effectively:Want to capture stories/contaminations soon?

Time to detection

Want to detect as many contaminations/stories as ibl ?possible?

Detection likelihood

Want to be the first to know? Want few people to getWant to be the first to know? Want few people to get infected?

Population affectedPopulation affected

What is the cost of reading a blog (placing a sensor)?

Faloutsos, LeskovecECML/PKDD 2007

Part 3-20

sensor)?

86

Page 89: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

General problemGeneral problemW t t k d bl hWater networks and blogs have common structure:

Given a dynamic process spreading over the networkGiven a dynamic process spreading over the network (e.g., a virus, information, influence)We want to select a set of nodes to detect the process effectivelyprocess effectively

Many other examples:Epidemics – which people to monitor to detectEpidemics – which people to monitor to detect disease outbreaks?Viral marketing – who are easily influenced people?Water distribution network – where to place sensors?Information propagation networks

Faloutsos, LeskovecECML/PKDD 2007

Part 3-21

CMU SCS

Problem settingProblem settingGiven a graph G(V E)Given a graph G(V,E)and a budget of B sensors and data on how contaminations spread over thepnetwork:

for each contamination i we know the time T(i, u) when itcontaminated node ucontaminated node u

We want to select a subset of nodes A that minimizes the expected penalty typ p y

Detection time

Pen

alt

where T(i, A) is the time when placement A first detects contamination i

Detection time

Faloutsos, LeskovecECML/PKDD 2007

Part 3-22

detects co ta at o iThe later we detect higher the penalty

87

Page 90: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Objective (penalty) functionsObjective (penalty) functionsObj ti f ti d fi d b B ttl fObjective functions were defined by Battle of Water Sensor Networks competitionW h th t th bj ti f ti ( ltWe show that the objective functions (penalty reductions):

Ti t d t ti (DT)Time to detection (DT)How long does it take to detect a contamination?

Detection likelihood (DL)Detection likelihood (DL)How many contaminations do we detect?

Population affected (PA)How many people drank contaminated water?

are submodular

Faloutsos, LeskovecECML/PKDD 2007

Part 3-23

CMU SCS

Objective function propertiesObjective function propertiesSubmodularity a diminishing returnsSubmodularity, a diminishing returns property of functions:

For all placements it holds

The marginal benefit (penalty reduction) of ddi l di i i hadding a sensor only diminishes as more

sensors are addedO ti i i b d l f ti iOptimizing submodular functions is NP-hard

Faloutsos, LeskovecECML/PKDD 2007

Part 3-24

88

Page 91: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Unit cost: Greedy algorithmUnit cost: Greedy algorithmOptimization is NP hard soOptimization is NP-hard, so lets use simple greedy algorithmbenefit gHow far is greedy solution from the optimal solution?

a

b b

dbenefit

Unit cost (each sensor costs the same):

A greedy algorithm is near

b

c

a

ce

A greedy algorithm is near optimal – it is at most 1-1/e(~63%) from optimal [N h t l ‘78]

d

e [Nemhauser et al ‘78] Greedy algorithm is slow

scales as O(|V|B)

e

Faloutsos, LeskovecECML/PKDD 2007

Part 3-25

scales as O(|V|B)

CMU SCS

Variable cost: Greedy algorithmVariable cost: Greedy algorithm

B t l i diff t ( diBut placing different sensors (reading different blogs) may have different costsVariable sensor cost:

Greedy algorithm can fail arbitrarily badlyGreedy algorithm can fail arbitrarily badlyWe develop an new algorithm that achieves ½(1-1/e) factor approximation½(1 1/e) factor approximation

We also develop a new algorithm independent bound that is in practiceindependent bound that is in practice much tighter

Faloutsos, LeskovecECML/PKDD 2007

Part 3-26

89

Page 92: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Scaling up greedy algorithmScaling up greedy algorithm

Submodularity guarantees that marginal improvements decrease with the solution psizeImprovement for including sensor u nowImprovement for including sensor u nowsmaller (or equal) than it was in previous

fstep of the greedy algorithm

Can do lazy evaluations! [Robertazzi et al, ‘89]

Faloutsos, LeskovecECML/PKDD 2007

Part 3-27

al, 89]

CMU SCS

Scaling up greedy algorithmScaling up greedy algorithm

Our improved algorithm:Keep an ordered list of d

benefitpmarginal benefits bi from previous iteration

a

b ab

d

pRe-evaluate bi for top-sensor

cc

d

e

sensorIf that removes it from the top pick next sensor

d

e

top, pick next sensorContinue until top remains

h dFaloutsos, LeskovecECML/PKDD 2007

Part 3-28unchanged

90

Page 93: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Scaling up greedy algorithmScaling up greedy algorithm

dbenefit

Our improved algorithm:Keep an ordered list of

a

ab

d

b

pmarginal benefits bi from previous iteration

cd

c e

pRe-evaluate bi for top-sensor d

e

sensorIf that removes it from the top pick next sensortop, pick next sensorContinue until top remains

h dFaloutsos, LeskovecECML/PKDD 2007

Part 3-29unchanged

CMU SCS

Scaling up greedy algorithmScaling up greedy algorithm

Our improved algorithm:Keep an ordered list of d

benefitpmarginal benefits bi from previous iteration

a

ab

d

d

pRe-evaluate bi for top-sensor

cb

e

e

sensorIf that removes it from the top pick next sensor

c

e

top, pick next sensorContinue until top remains

h dFaloutsos, LeskovecECML/PKDD 2007

Part 3-30unchanged

91

Page 94: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Experiments: 2 case studiesExperiments: 2 case studies

W h l ti d tWe have real propagation dataBlog network:

We crawled blogs for 1 yearWe identified cascades – temporal propagation of i f tiinformation

Water distribution network:B ttl f W t S N t k titiBattle of Water Sensor Networks competitionReal city water distribution networksRealistic simulator of water consumption and waterRealistic simulator of water consumption and water flow provided by US Environmental Protection Agency

Faloutsos, LeskovecECML/PKDD 2007

Part 3-31

g y

CMU SCS

Case study 1: Cascades in blogsCase study 1: Cascades in blogs

We crawled 45,000 blogs for 1 yearWe obtained 10 million postsWe obtained 10 million postsAnd identified 350,000 cascades

Faloutsos, LeskovecECML/PKDD 2007

Part 3-32

92

Page 95: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Blogs: Solution qualityBlogs: Solution quality

Our bound is much tighter13% instead of 66%% %

Our algorithm

Faloutsos, LeskovecECML/PKDD 2007

Part 3-33

CMU SCS

Blogs: Cost of a blog (1)Blogs: Cost of a blog (1)If ll bl t thIf all blogs cost the same algorithm picks large popular blogs:large popular blogs: instapundit.com, michellemalkin.com

But reading largeBut reading large blogs is time consumingSo consider the case where cost of a blog is proportional to theis proportional to the number of posts on the blog

Faloutsos, LeskovecECML/PKDD 2007

Part 3-34

g

93

Page 96: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Blogs: Cost of a blog (2)Blogs: Cost of a blog (2)B t th l ithBut then algorithm picks lots of small blogs that participateblogs that participate in few cascadesWe pick best

Score π=0.4π=0.3We pick best

solution that interpolates betweeninterpolates between the costsWe can get good

π=0.3

We can get good solutions with few blogs and few posts Each curve represents solutions

Faloutsos, LeskovecECML/PKDD 2007

Part 3-35

g p Each curve represents solutions with the same score

CMU SCS

Blogs: Comparison to heuristicsBlogs: Comparison to heuristics

Use heuristics to select blogsgNumber of in-links and out links of aand out-links of a blog perform bestBut much worse than our algorithmthan our algorithm

Faloutsos, LeskovecECML/PKDD 2007

Part 3-36

94

Page 97: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Blogs: ScalabilityBlogs: Scalability

Our algorithm that uses lazy yevaluations runs 700 times faster700 times faster than simple greedy algorithmgreedy algorithm

Faloutsos, LeskovecECML/PKDD 2007

Part 3-37

CMU SCS

Case study 2: Water networkCase study 2: Water network

B ttl f W t S N t kBattle for Water Sensor Networks competitionReal metropolitan area water distribution network (largest network optimized):( g p )

V = 21,000 nodesE = 25 000 pipesE = 25,000 pipes

3.6 million different epidemic scenarios (152 GB of simulation data)(152 GB of simulation data)By exploiting sparsity we can fit it into

Faloutsos, LeskovecECML/PKDD 2007

Part 3-38main memory (16GB)

95

Page 98: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Water network: Solution qualityWater network: Solution quality

Our bound is much tighter

Faloutsos, LeskovecECML/PKDD 2007

Part 3-39

CMU SCS

Water network: Heuristic placementWater network: Heuristic placement

Our algorithms outperforms heuristic node selection

Faloutsos, LeskovecECML/PKDD 2007

Part 3-40selection

96

Page 99: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Water network: Sensor placementWater network: Sensor placement

Different objective functions give different sensor placementsp

Population affected Detection likelihood

Faloutsos, LeskovecECML/PKDD 2007

Part 3-41

Population affected

CMU SCS

Water network: ScalabilityWater network: Scalability

Our algorithm is 10 times faster than greedy

Faloutsos, LeskovecECML/PKDD 2007

Part 3-42

g y

97

Page 100: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Results of BWSN [Ostfeld et al]Results of BWSN [Ostfeld et al]Author #non- dominated

(out of 30)(out of 30)Our approach 26Berry et. al. 21

Battle of Water Sensor Networks y

Dorini et. al. 20Wu and Walski 19

competitionMulti criterion Ostfeld et al 14

Propato and Piller 12Eli d l 11

Multi-criterion optimization

Eliades et. al. 11Huang et. al. 7Guan et al 4

[Ostfeld et al ‘07]: count number of Guan et. al. 4

Ghimire and Barkdoll 3Trachtman 2

count number of non-dominated solutions

Faloutsos, LeskovecECML/PKDD 2007

Part 3-43Gueli 2Preis and Ostfeld 1

solutions

CMU SCS

Other resultsOther results

We also consider Fractional selection of the blogsgGeneralization to future unseen cascadesMulti criterion optimization when one caresMulti-criterion optimization when one cares about more than one optimization function

And show that triggering model of Kempe et al is a special case of out network poutbreak detection problem

Faloutsos, LeskovecECML/PKDD 2007

Part 3-44

98

Page 101: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Conclusion (1)Conclusion (1)

We presented a general methodology for selecting nodes to detect outbreaks in gnetworksFor non unit cost case we developed anFor non-unit cost case we developed an algorithm that achieves ½(1-1/e) of optimal

l isolutionWe developed a new online bound that inWe developed a new online bound that in practice is very tight

Faloutsos, LeskovecECML/PKDD 2007

Part 3-45

CMU SCS

Conclusion (2)Conclusion (2)

We demonstrate how lazy evaluation drastically speeds up our algorithmy p p gWe show that our methodology can easily be extended to specific problem domainsbe extended to specific problem domainsWe evaluated our approach on two large real world application domains with many domain specific questionsdomain specific questions

Faloutsos, LeskovecECML/PKDD 2007

Part 3-46

99

Page 102: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Part 3.3: Web projections

CMU SCS

MotivationMotivationI f ti t i l t diti ll id dInformation retrieval traditionally considered documents as independentWeb retrieval incorporates global hyperlinkWeb retrieval incorporates global hyperlink relationships to enhance ranking (e.g., PageRank, HITS)g )

Operates on the entire graphUses just one feature (principal eigenvector) of the graphthe graph

Our work on Web projections focuses oncontextual subsets of the web graph; in-betweencontextual subsets of the web graph; in between the independent and global consideration of the documentsa rich set of graph theoretic properties

Faloutsos, LeskovecECML/PKDD 2007

Part 3-48

a rich set of graph theoretic properties

100

Page 103: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Web projectionsWeb projections

Web projections: How they work?Project a set of web pages of interest onto the web graphThis creates a subgraph of the web called

j ti hprojection graphUse the graph-theoretic properties of the subgraph for tasks of interestfor tasks of interest

Query projectionsQuery results give the context (set of web pages)Use characteristics of the resulting graphs for

di ti b t h lit d b h iFaloutsos, LeskovecECML/PKDD 2007

Part 3-49predictions about search quality and user behavior

CMU SCS

Query projectionsQuery Results Projection on the web graph

Query projections

Q• -- -- ----• --- --- ----• ------ ---• ----- --- --• ------ -----• ------ -----

Query projection graph Query connection graph

Generate graphical

Faloutsos, LeskovecECML/PKDD 2007

Part 3-50features Construct

case libraryPredictions

101

Page 104: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Questions we exploreQuestions we explore

H d h lt j tHow do query search results project onto the underlying web graph?Can we predict the quality of search results from the projection on the web p jgraph?Can we predict users’ behaviors withCan we predict users behaviors with issuing and reformulating queries?

Faloutsos, LeskovecECML/PKDD 2007

Part 3-51

CMU SCS

Is this a good set of search results?Is this a good set of search results?

Faloutsos, LeskovecECML/PKDD 2007

Part 3-52

102

Page 105: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Will the user reformulate the query?Will the user reformulate the query?

Faloutsos, LeskovecECML/PKDD 2007

Part 3-53

CMU SCS

Resources and conceptsResources and conceptsWeb as a graphWeb as a graph

URL graph:Nodes are web pages, edges are hyper-linksD f M h 2006Data from March 2006 Graph: 22 million nodes, 355 million edges

Domain graph:g pNodes are domains (cmu.edu, bbc.co.uk). Directed edge (u,v) if there exists a webpage at domain u pointing to vData from February 2006 yGraph: 40 million nodes, 720 million edges

Contextual subgraphs for queriesProjection graphConnection graph

Compute graph theoretic featuresFaloutsos, LeskovecECML/PKDD 2007

Part 3-54

Compute graph-theoretic features

103

Page 106: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

“Projection” graphProjection graph

Example query: Subaru

Project top 20 results byProject top 20 results by the search engineNumber in the node denotes the searchdenotes the search engine rankColor indicates relevancy yas assigned by human:

PerfectExcellentExcellentGoodFairPoor

Faloutsos, LeskovecECML/PKDD 2007

Part 3-55

PoorIrrelevant

CMU SCS

“Connection” graphConnection graphProjection graph is generally disconnectedFi d t dFind connector nodesConnector nodes are existing nodes that areexisting nodes that are not part of the original result setesu t setIdeally, we would like to introduce fewest possible pnodes to make projection graph connected Connector

nodesFaloutsos, LeskovecECML/PKDD 2007

Part 3-56

nodesProjection nodes

104

Page 107: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Finding connector nodesFinding connector nodesFind connector nodes is a Steiner tree problem whichFind connector nodes is a Steiner tree problem which is NP hardOur heuristic:

Connect 2nd largest connected component via shortest path toConnect 2nd largest connected component via shortest path to the largestThis makes a new largest componentRepeat until the graph is connectedRepeat until the graph is connected

2nd largest component

Largest component

Faloutsos, LeskovecECML/PKDD 2007

Part 3-572nd largest component

CMU SCS

Extracting graph featuresExtracting graph featuresTh idThe idea

Find features that describe the str ct re of the graphstructure of the graphThen use the features for machine learningmachine learning

Want features that describeC ti it f th h

vs.Connectivity of the graphCentrality of projection and connector nodesconnector nodesClustering and density of the core of the graph

Faloutsos, LeskovecECML/PKDD 2007

Part 3-58

core of the graph

105

Page 108: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Examples of graph featuresExamples of graph featuresProjection graphj g p

Number of nodes/edgesNumber of connected componentsSize and density of the largest connected componentN b f t i d i th hNumber of triads in the graph

Connection graphN b f t d

vs.Number of connector nodesMaximal connector node degreeMean path length betweenMean path length between projection/connector nodesTriads on connector nodes

Faloutsos, LeskovecECML/PKDD 2007

Part 3-59

Triads on connector nodesWe consider 55 features total

CMU SCS

Experimental setupQuery Results Projection on the web graph

Experimental setup

Q• -- -- ----• --- --- ----• ------ ---• ----- --- --• ------ -----• ------ -----

Query projection graph Query connection graph

Generate graphical

Faloutsos, LeskovecECML/PKDD 2007

Part 3-60features Construct

case libraryPredictions

106

Page 109: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Constructing case library forConstructing case library for machine learningg

Given a task of interestGenerate contextual subgraph and extract featuresextract featuresEach graph is labeled by target outcomeLearn statistical model that relates the features with the outcomeMake prediction on unseen graphs

Faloutsos, LeskovecECML/PKDD 2007

Part 3-61

CMU SCS

Experiments overviewExperiments overviewGi t f h lt tGiven a set of search results generate projection and connection graphs and their featuresfeatures

Predict quality of a search result set Discriminate top20 vs. top40to60 resultsPredict rating of highest rated document in the set

Predict user behaviorPredict user behaviorPredict queries with high vs. low reformulation probabilityp yPredict query transition (generalization vs. specialization)Predict direction of the transition

Faloutsos, LeskovecECML/PKDD 2007

Part 3-62

Predict direction of the transition

107

Page 110: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Experimental detailsExperimental detailsF tFeatures

55 graphical featuresNote we use only graph features no contentNote we use only graph features, no content

LearningWe use probabilistic decision trees (“DNet”)We use probabilistic decision trees ( DNet )

Report classification accuracy using 10-fold cross validation Compare against 2 baselines

Marginals: Predict most common classRankNet: use 350 traditional features (document, anchor text, and basic hyperlink features)

Faloutsos, LeskovecECML/PKDD 2007

Part 3-63

CMU SCS

Search results qualitySearch results quality

D t tDataset:30,000 queriesTop 20 results for each

Each result is labeled by a human judge using a 6-point scale from "Perfect" to "Bad"

Task:Predict the highest rating in the set of results

6-class problem2-class problem: “Good” (top 3 ratings) vs. “P ” (b tt 3 ti )

Faloutsos, LeskovecECML/PKDD 2007

Part 3-64“Poor” (bottom 3 ratings)

108

Page 111: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Search quality: the taskSearch quality: the task

Predict the rating of the top result in the set

Predict “Good” Predict “Poor”Faloutsos, LeskovecECML/PKDD 2007

Part 3-65

Predict Good Predict Poor

CMU SCS

Search quality: resultsSearch quality: resultsP di t t h ti iPredict top human rating in the set

Binary classification: GoodAttributes URL

GraphDomain Graph

Binary classification: Good vs. Poor

10-fold cross validation classification accuracy

Marginals 0.55 0.55

RankNet 0.63 0.60classification accuracy

Observations:

RankNet 0.63 0.60

Projection 0.80 0.64

Web Projections outperform both baseline methodsJust projection graph already

Connection 0.79 0.66

Projection + 0 82 0 69Just projection graph already performs quite wellProjections on the URL graph perform better

Connection 0.82 0.69

All 0.83 0.71

Faloutsos, LeskovecECML/PKDD 2007

Part 3-66

perform better

109

Page 112: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Search quality: the modelSearch quality: the modelThe learned model shows graph properties of good result setsresult setsGood result sets have:

Search result nodes areSearch result nodes are hub nodes in the graph (have large degrees)S ll dSmall connector node degreesBig connected componentg pFew isolated nodes in projection graphF t d

Faloutsos, LeskovecECML/PKDD 2007

Part 3-67Few connector nodes

CMU SCS

Predict user behaviorPredict user behavior

DatasetQuery logs for 6 weeks35 million unique queries, 80 million total query reformulationsWe only take queries that occur at least 10We only take queries that occur at least 10 timesThis gives us 50,000 queries and 120,000 g , q ,query reformulations

TaskPredict whether the query is going to be reformulated

Faloutsos, LeskovecECML/PKDD 2007

Part 3-68

110

Page 113: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Query reformulation: the taskQuery reformulation: the taskGi d di j ti dGiven a query and corresponding projection and connection graphsPredict whether query is likely to be reformulated

Q t lik l t b f l t dFaloutsos, LeskovecECML/PKDD 2007

Part 3-69

Query not likely to be reformulated Query likely to be reformulated

CMU SCS

Query reformulation: resultsQuery reformulation: resultsOb tiObservations:

Gradual improvement as using more features

Attributes URL Graph

Domain Graph

as using more featuresUsing Connection graph features helps

Marginals 0.54 0.54g p pURL graph gives better performance

Projection 0.59 0.58

Connection 0.63 0.59We can also predict type of reformulation ( i li ti

Projection + Connection 0.63 0.60

(specialization vs. generalization) with 0 80 accuracy

Connection

All 0.71 0.67

Faloutsos, LeskovecECML/PKDD 2007

Part 3-70

0.80 accuracy

111

Page 114: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Query reformulation: the modelQuery reformulation: the model

Queries likely to be reformulated have:

Search result nodes have low degreeConnector nodes are hubsM t dMany connector nodesResults came from many different domainsmany different domainsResults are sparsely knit

Faloutsos, LeskovecECML/PKDD 2007

Part 3-71

knit

CMU SCS

Query transitionsQuery transitions

Predict if and how will user transform the queryq y

transition

Q: Strawberry Q: Strawberry shortcake

Faloutsos, LeskovecECML/PKDD 2007

Part 3-72

Q: Strawberry shortcake

Q: Strawberry shortcake pictures

112

Page 115: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Query transitionQuery transition

With 75% accuracy we can say whether a query is likely to be reformulated: q y y

Def: Likely reformulated p(reformulated) > 0.6With 87% accuracy we can predictWith 87% accuracy we can predict whether observed transition is specialization or generalizationWith 76% we can predict whether the userWith 76% we can predict whether the user will specialize or generalize

Faloutsos, LeskovecECML/PKDD 2007

Part 3-73

CMU SCS

ConclusionConclusionW i t d d W b j tiWe introduced Web projections

A general approach of using context-sensitive sets of web pages to focus attention on relevant subsetof web pages to focus attention on relevant subsetof the web graphAnd then using rich graph-theoretic features of the g g psubgraph as input to statistical models to learn predictive models

We demonstrated Web projections using search result graphs for

P di ti lt t litPredicting result set qualityPredicting user behavior when reformulating queries

Faloutsos, LeskovecECML/PKDD 2007

Part 3-74

queries

113

Page 116: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Part 3 4: Center piecePart 3.4: Center piece subgraphsg p

CMU SCS

MasterMind ‘CePS’MasterMind – CePS

w/ Hanghang Tong, KDD 2006htong <at> cs.cmu.edu

Faloutsos, LeskovecECML/PKDD 2007

Part 3-76

114

Page 117: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Center Piece Subgraph(Ceps)Center-Piece Subgraph(Ceps)Gi Q dGiven Q query nodesFind Center-piece ( )b≤p ( )

App.App.Social NetworksLaw Inforcement BLaw Inforcement, …

Id

B

Idea:Proximity -> random walk A C

Faloutsos, LeskovecECML/PKDD 2007

Part 3-77with restarts

CMU SCS

C St d ANDCase Study: AND query

R A l Ji i HR. Agrawal Jiawei Han

V. Vapnik M. Jordan

Faloutsos, LeskovecECML/PKDD 2007

Part 3-78

115

Page 118: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

C St d ANDCase Study: AND query

Faloutsos, LeskovecECML/PKDD 2007

Part 3-79

CMU SCS databases

ML/StatisticML/Statistics

2_SoftAndFaloutsos, LeskovecECML/PKDD 2007

Part 3-80

_query

116

Page 119: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

DetailsDetails

Main idea: use random walk with restarts, to measure ‘proximity’ p(i,j) of node j to p y p( ,j) jnode i

Faloutsos, LeskovecECML/PKDD 2007

Part 3-81

CMU SCS

ExampleExample

5

P b (RW ill fi ll t t j)

411

12•Starting from 1

Prob (RW will finally stay at j)

310 13

g

•Randomly to neighbor

2 6

•Some p to return to 1

1 789

Faloutsos, LeskovecECML/PKDD 2007

Part 3-82

89

117

Page 120: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Individual Score CalculationIndividual Score Calculation

Q1 Q2 Q3Node 1 0.5767 0.0088

0.0088

5

Node 2Node 3Node 4Node 5

0.00880.1235 0.0076

0.00760 0283 0 0283

11 124

0.00760.00240.0333

Node 5Node 6Node 7Node 8

0.0283 0.0283 0.0283

0.0076 0.12350.0076

10 133

0.1260

0.1235

0.02830.0024

0.0076

Node 9Node 10Node 11Node 12

0.0088 0.5767 0.0088

0.0076 0.0076 0 1235

19 8

62

0.1260 0.0333

0 0088

7

Node 12Node 13

0.12350.0088 0.0088

0.57670.0333 0.0024

9 80.5767 0.0088

Faloutsos, LeskovecECML/PKDD 2007

Part 3-830.1260

0.1260 0.0024 0.0333

0 1260 0 0333

CMU SCS

Individual Score CalculationIndividual Score Calculation

Q1 Q2 Q3Node 1 0.5767 0.0088

0.0088

5

Node 2Node 3Node 4Node 5

0.00880.1235 0.0076

0.00760 0283 0 0283

11 124

0.00760.00240.0333

Node 5Node 6Node 7Node 8

0.0283 0.0283 0.0283

0.0076 0.12350.0076

10 133

0.1260

0.1235

0.02830.0024

0.0076

Node 9Node 10Node 11Node 12

0.0088 0.5767 0.0088

0.0076 0.0076 0 1235

19 8

62

0.1260 0.0333

0 0088

7

Node 12Node 13

0.12350.0088 0.0088

0.57670.0333 0.0024 Individual Score matrix

9 80.5767 0.0088

Faloutsos, LeskovecECML/PKDD 2007

Part 3-840.1260

0.1260 0.0024 0.0333

0 1260 0 0333

Individual Score matrix

118

Page 121: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

AND: Combining ScoresAND: Combining Scores

Q: How to combine scores?A: Multiply…= prob. 3 random… prob. 3 random particles coincide on node jnode j

Faloutsos, LeskovecECML/PKDD 2007

Part 3-85

CMU SCS

K SoftAnd: Combining ScoresdetailsK_SoftAnd: Combining Scores

Generalization –SoftAND:

We want nodes close to k of Q (k<Q) query ( ) ynodes.

Q: How to do that?Q: How to do that?

Faloutsos, LeskovecECML/PKDD 2007

Part 3-86

119

Page 122: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

K SoftAnd: Combining ScoresdetailsK_SoftAnd: Combining Scores

Generalization –softAND:

We want nodes close to k of Q (k<Q) query ( ) ynodes.

Q: How to do that?Q: How to do that? A: Prob(at least k-out-

of-Q will meet eachof-Q will meet each other at j)

Faloutsos, LeskovecECML/PKDD 2007

Part 3-87

CMU SCS

AND query vs K SoftAnd queryAND query vs. K_SoftAnd query

x 1e-4 5

0.4505

11 124

0.07100.10100.1010

10 133

0.1010

0.0710

0.22670.1010

0.0710

1 7

62

0.1010 0.1010

And Query 2_SoftAnd Query9 8

0.4505 0.4505

Faloutsos, LeskovecECML/PKDD 2007

Part 3-88

120

Page 123: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

1 SoftAnd query = OR query0 0103

1_SoftAnd query = OR query

5

0.0103

11 124

0.13870.16170.1617

10 133

0.1617

0.08490.1617

3

62

0.1387 0.1387

1 7

9 80 0103

0.1617 0.1617

0.0103

Faloutsos, LeskovecECML/PKDD 2007

Part 3-89

0.0103 0.0103

CMU SCS

Challenges in CepsChallenges in Ceps

Q1: How to measure the importance?A: RWRA: RWR

Q2: How to do it efficiently?y

Faloutsos, LeskovecECML/PKDD 2007

Part 3-90

121

Page 124: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Graph Partition: Efficiency IssueGraph Partition: Efficiency Issue

St i htf dStraightforward waysolve a linear system: ti li t # ftime: linear to # of edges

ObservationSkewed distSkewed dist.communities

How to exploit them?Graph partition

Faloutsos, LeskovecECML/PKDD 2007

Part 3-91

p p

CMU SCS

Even better:Even better:

We can correct for the deleted edges (Tong+, ICDM’06, best paper award)( g , , p p )

Faloutsos, LeskovecECML/PKDD 2007

Part 3-92

122

Page 125: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Experimental SetupExperimental Setup

DatasetDBLP/authorshipDBLP/authorshipAuthor-Paper315k nodes1 8M edges1.8M edges

Faloutsos, LeskovecECML/PKDD 2007

Part 3-93

CMU SCS

Query Time vs. Pre-Computation Time

Log Query Time

•Quality: 90%+•Quality: 90%+ •On-line:

•Up to 150x speedupPre computation:

L P t ti Ti

•Pre-computation: •Two orders saving

Log Pre-computation Time

Faloutsos, LeskovecECML/PKDD 2007

Part 3-94

123

Page 126: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Query Time vs StorageQuery Time vs. Storage

Log Query Time

•Quality: 90%+•Quality: 90%+ •On-line:

•Up to 150x speedupPre storage:

L St

•Pre-storage: •Three orders saving

Log Storage

Faloutsos, LeskovecECML/PKDD 2007

Part 3-95

CMU SCS

ConclusionsB

Conclusions

Q1 H t th i t ?A C

Q1:How to measure the importance?A1: RWR+K SoftAnd_(Q2: How to find connection subgraph?A2:”Extract” Alg )A2: Extract Alg.)Q3:How to do it efficiently?A3:Graph Partition and Sherman-Morrison

~90% quality90% quality6:1 speedup; 150x speedup (ICDM’06, b p award)

Faloutsos, LeskovecECML/PKDD 2007

Part 3-96

b.p. award)

124

Page 127: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Part 3 5: FraudPart 3.5: Fraud detection on e-bayy

CMU SCS

E bay Fraud detectionE-bay Fraud detection

/ P l Ch &w/ Polo Chau &Shashank Pandit, CMU

‘non-delivery’ fraud:yseller takes $$ and disappears

Faloutsos, LeskovecECML/PKDD 2007

Part 3-98

and disappears

125

Page 128: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

Idea: ‘Accomplices’, and Belief Propagation

3 t f d h t f d3 types of nodes: honest, fraud, accomplices‘accomplices’ never do fraud – they just give high ratings to fraudsters-to-beg g g

B.P.:If I am honest my neighbors are eitherIf I am honest, my neighbors are either honest or ‘accomplices’If I’m an accomplice, my neighbors are either honest or fraud

Faloutsos, LeskovecECML/PKDD 2007

Part 3-99...

CMU SCS

E bay Fraud detection NetProbeE-bay Fraud detection - NetProbe

Faloutsos, LeskovecECML/PKDD 2007

Part 3-100

126

Page 129: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

ConclusionsConclusions

MasterMind -> CePS and random walkscommunities over time -> tensorscommunities over time > tensorsvirus propagation -> eigenvaluefraud detection -> belief propagation

Faloutsos, LeskovecECML/PKDD 2007

Part 3-101

CMU SCS

ReferencesReferences

Shashank Pandit, Duen Horng Chau, Samuel Wang, and Christos Faloutsosg,NetProbe: A Fast and Scalable System for Fraud Detection in Online AuctionFraud Detection in Online Auction Networks, WWW 2007.

Faloutsos, LeskovecECML/PKDD 2007

Part 3-102

127

Page 130: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

ReferencesReferences

Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan Fast Random Walk with Restart and Its Applications ICDM 2006, Hong Kong. Hanghang Tong, Christos Faloutsos Center-g g gPiece Subgraphs: Problem Definition and Fast Solutions, KDD 2006, Philadelphia, PA

Faloutsos, LeskovecECML/PKDD 2007

Part 3-103

CMU SCS

ReferencesReferencesJ L k J Kl i b d Ch i tJure Leskovec, Jon Kleinberg and Christos Faloutsos Graphs over Time: Densification Laws Shrinking Diameters and PossibleLaws, Shrinking Diameters and Possible Explanations KDD 2005, Chicago, IL. ("Best Research Paper" award)Research Paper award). Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg Christos Faloutsos RealisticKleinberg, Christos Faloutsos Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication, g p(ECML/PKDD 2005)

Faloutsos, LeskovecECML/PKDD 2007

Part 3-104

128

Page 131: presented by September 21, 2007 Warsaw, Poland 2007... · 2007. 8. 10. · CMU SCS Mining Large Graphs:Mining Large Graphs: Laws and toolsLaws and tools Christos Faloutsos and Jure

CMU SCS

ReferencesReferencesJi S D h T Ch i t F l tJimeng Sun, Dacheng Tao, Christos Faloutsos Beyond Streams and Graphs: Dynamic Tensor Analysis KDD 2006 Philadelphia PAAnalysis, KDD 2006, Philadelphia, PA Jimeng Sun, Yinglian Xie, Hui Zhang, Christos Faloutsos Less is More: Compact MatrixFaloutsos. Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM, Minneapolis Minnesota Apr 2007Minneapolis, Minnesota, Apr 2007. Web Projections: Learning from Contextual Subgraphs of the Web by Jure LeskovecSubgraphs of the Web, by Jure Leskovec, Susan Dumais, Eric Horvitz, WWW 2007

Faloutsos, LeskovecECML/PKDD 2007

Part 3-105

129