dynamics of networks

Dynamics of networks

Jure LeskovecMachine Learning DepartmentCarnegie Mellon University

Networks: Rich data Today: Large on-line systems have detailed records of

human activity On-line communities:

▪ Facebook (64 million users, billion dollar business)▪ MySpace (300 million users)

Communication:▪ Instant Messenger (~1 billion users)

News and Social media:▪ Blogging (250 million blogs world-wide, presidential candidates run

blogs) On-line worlds:

▪ World of Warcraft (internal economy 1 billion USD)▪ Second Life (GDP of 700 million USD in ‘07)

Opportunities for impact in science and industry

The networks

b) Internet (AS)c) Social networksa) World wide web

d) Communication e) Citationsf) Protein interactions

Networks: What do we know? We know lots about the network structure:

Properties: Scale free [Barabasi ’99], 6-degrees of separation [Milgram ’67], Navigation [Adamic-Adar ’03, LibenNowell ’05], Bipartite cores [Kumar et al. ’99], Network motifs [Milo et al. ‘02], Communities [Newman ‘99], Conductance [Mihail-Papadimitriou-Saberi ‘06], Hubs and authorities [Page et al. ’98, Kleinberg ‘99]

Models: Preferential attachment [Barabasi ’99], Small-world [Watts-Strogatz ‘98], Copying model [Kleinberg el al. ’99], Heuristically optimized tradeoffs [Fabrikant et al. ‘02], Congestion [Mihail et al. ‘03], Searchability [Kleinberg ‘00], Bowtie [Broder et al. ‘00], Transit-stub [Zegura ‘97], Jellyfish [Tauro et al. ‘01]

We know much less about processes and

dynamics of networks

My research: Network dynamics

Network Dynamics: Network evolution

▪ How network structure changes as the network grows and evolves?

Diffusion and cascading behavior▪ How do rumors and diseases spread over networks?

My research: Scale matters We need massive network data for the

patterns to emerge: MSN Messenger network [WWW ’08, Nature ‘08]

(the largest social network ever analyzed)▪ 240M people, 255B messages, 4.5 TB data

Product recommendations [EC ‘06]

▪ 4M people, 16M recommendations Blogosphere [work in progress]

▪ 60M posts, 120M links

My research: The structure

Diffusion & Cascades

Network Evolution

Patterns & Observatio

Q1: What do cascades look like? Trees? Stars? Chains?

Q4: How does network structure change as the network grows/evolves?

Models & Algorithms

Q2: How do we find influential nodes?Q3: How do we quickly detect epidemics?

Q5: How do we generate realistic looking and evolving networks?

Network Evolution

Q1: What do cascades look like? Trees? Stars? Chains?

Models & Algorithms

Diffusion and Cascades Behavior that cascades from node to node like an

epidemic News, opinions, rumors Word-of-mouth in marketing Infectious diseases

As activations spread through the network they leave a trace – a cascade

Cascade (propagation graph)Network

Setting 1: Viral marketing

People send and receive product recommendations, purchase products

Data: Large online retailer: 4 million people, 16 million recommendations, 500k products

10% credit 10% off

[w/ Adamic-Huberman, EC ’06]

Setting 2: Blogosphere

Bloggers write posts and refer (link) to other posts and the information propagates

Data: 10.5 million posts, 16 million links

[w/ Glance-Hurst et al., SDM ’07]

Viral marketing cascades are more social: Collisions (no summarizers) Richer non-tree structures

Q1) What do cascades look like? Are they stars? Chains? Trees?

Information cascades (blogosphere):

Viral marketing (DVD recommendations):

(ordered by frequency)

[w/ Kleinberg-Singh, PAKDD ’06]

Network Evolution

A1: Cascade shapes. Viral marketing cascades more social.

Models & Algorithms

Q5:How do we generate realistic looking and evolving networks?

Cascade & outbreak detection Blogs – information epidemics

Which are the influential/infectious blogs?

Viral marketing Who are the trendsetters? Influential people?

Disease spreading Where to place monitoring stations to detect

epidemics?

The problem: Detecting cascades

How to quickly detect epidemics as they spread?

[w/ Krause-Guestrin et al., KDD ’07](best student paper)

Two parts to the problem Cost:

Cost of monitoring is node dependent

Reward: Minimize the number of affected

nodes:▪ If A are the monitored nodes, let R(A)

denote the number of nodes we save

We also consider other rewards: ▪ Minimize time to detection▪ Maximize number of detected outbreaks

Reward for detecting cascade i

Given: Graph G(V,E), budget M Data on how cascades C1, …, Ci, …,CK spread over time

Select a set of nodes A maximizing the reward

subject to cost(A) ≤ M

Solving the problem exactly is NP-hard Max-cover [Khuller et al. ’99]

Optimization problem

Detection: Solution outline

Problem structure:Submodularity of the reward functions

CELF:algorithm with approximate guarantee

Speed up:Lazy evaluation to speed-up CELF

We extend the result of [Kempe-Kleinberg-Tardos ’03]

Reward function are submodular

Theorem: Reward function R is submodular (diminishing returns, think of it as “convexity”)

(best student paper)[w/ Krause-Guestrin et al., KDD ’07]

Gain of adding a node to a small set

Gain of adding a node to a large set

R(A {u}) – R(A) ≥ R(B {u}) – R(B)

Placement A={S1, S2}

New monitored

Adding S’ helps a lotS2

Placement B={S1, S2, S3, S4}

Adding S’ helps very

little

Solution: CELF Algorithm

We develop CELF (cost-effective lazy forward-selection) algorithm: Two independent runs of a modified greedy

▪ Solution set A’: ignore cost, greedily optimize reward▪ Solution set A’’: greedily optimize reward/cost ratio

Pick best of the two: arg max(R(A’), R(A’’))

Theorem: If R is submodular then CELF is near optimal CELF achieves ½(1-1/e) factor approximation

For the size of our problems naïve CELF is too slow

(best student paper)[w/ Krause-Guestrin et al., KDD ’07]

Scaling up: Lazy evaluation Observation:

Submodularity guarantees that marginal rewards decrease with the solution size

Idea: Use marginal reward from previous step as a upper

bound on current marginal reward

Marginal reward

CELF: Lazy evaluation

CELF algorithm: Keep an ordered list of

marginal rewards ri from previous step

Re-evaluate ri only for the top node

Marginal reward

Blogs: Information epidemics

For more info see our website: www.blogcascade.org

Which blogs should one read to catch big stories?

In-links

Random

# postsOut-links

Number of selected blogs (sensors)

Reward

(higher is

better)

(used by Technorati)

CELF: Scalability

GreedyExhaustive search

CELF runs 700x faster than simple greedy algorithm

Number of selected blogs (sensors)

Run time

(seconds)

(lower is better)

Same problem: Water Network Given:

a real city water distribution network

data on how contaminants spread over time

Place sensors (to save lives)

Problem posed by the US Environmental Protection Agency

[w/ Krause et al., J. of Water Resource Planning]

Water network: Results

Our approach performed best at the Battle of Water Sensor Networks competition

PopulationRandom

Degree

Author Score

CMU (CELF) 26

Sandia 21U Exter 20Bentley systems 19Technion (1) 14Bordeaux 12U Cyprus 11U Guelph 7U Michigan 4Michigan Tech U 3Malcolm 2Proteo 2Technion (2) 1

Number of placed sensors

Population

saved(higher is better)

[w/ Ostfeld et al., J. of Water Resource Planning]

Network Evolution

Models & Algorithms

A2, A3: CELF algorithm for detecting cascades and outbreaks

Background: Network models

Empirical findings on real graphs led to new network models

Such models make assumptions/predictions about other network properties

What about network evolution?

log degree

b. Model

Power-law degree distribution

Preferential attachment

Explains

Networks are denser over time Densification Power Law:

a … densification exponent (1 ≤ a ≤ 2)

Q4) Network evolution What is the relation between the

number of nodes and the edges over time?

Prior work assumes: constant average degree over time

Internet

Citations

[w/ Kleinberg-Faloutsos, KDD ’05](best research paper)

Q4) Network evolution Prior models and intuition say

that the network diameter slowly grows (like log N, log log N)

size of the graph

Internet

Citations

Diameter shrinks over time as the network grows the

distances between the nodes slowly decrease

[w/ Kleinberg-Faloutsos, KDD ’05](best research paper)

Q5) Generating realistic graphs Want to generate realistic networks:

Why synthetic graphs? Anomaly detection, Simulations, Predictions, Null-model, Sharing privacy

sensitive graphs, …

Q: What is a good model that we can fit to the data? A: Next slide. Q: Which network properties do we care about? A: Don’t commit, let’s match adjacency matrices

Compare graphs properties, e.g., degree

distribution

Given a real network

Generate a synthetic network

The model: Kronecker graphs

We prove Kronecker graphs mimic real graphs: Power-law degree distribution, Densification,

Shrinking/stabilizing diameter, Spectral properties

Initiator

(9x9)(3x3)

(27x27)

Kronecker product of graph adjacency matrices

[w/ Chakrabarti-Kleinberg-Faloutsos, PKDD ’05]

Edge probability

Maximum likelihood estimation

Kronecker graphs: Estimation

Naïve estimation takes O(N!N2): N! for different node labelings:

▪ Our solution: Metropolis sampling: N! (big) const N2 for traversing graph adjacency matrix

▪ Our solution: Kronecker product (E << N2): N2 E Do stochastic gradient descent

a bc d

P( | ) Kroneckerarg max

We estimate the model in O(E)

[w/ Faloutsos, ICML ’07]

Estimation: Epinions (N=76k, E=510k)

We search the space of ~101,000,000

permutations Fitting takes 2 hours Real and Kronecker are very close

Degree distribution

node degree

abilit

0.99 0.54

0.49 0.13

Path lengths

number of hops

“Network” values

rankne

[w/ Faloutsos, ICML ’07]

Diffusion &

CascadesNetwork Evolution

Patterns &

Observations

A4: Densification,

Shrinking diameters

Models & Algorithm

algorithm for

detecting cascades and

outbreaks

A5: Kronecker graphs,

Estimating Kronecker initiator

dynamics of networks

network growsevolves

network picture3networks

largest social network

massive network data

network motifs milo

msn messenger network

evolving networks

links66my research

Documents

uncovering the temporal dynamics of diffusion networks

temporal dynamics of generalization in neural networks

dynamics of networks 2 synchrony & balanced colourings

opinion dynamics on social networks · opinion dynamics on...

structure and dynamics of multiplex networks · eu-fp7...

epidemic dynamics on networks

regulatory dynamics of synthetic gene networks with...

the dynamics of faculty hiring networks

the dynamics of personal networks

dynamics of electric power supply chain networks under...

dynamics of gene networks with time...

collective dynamics of ‘small world’ networks

dynamics on multiplex networks · multiplex dynamics...

dynamics of complex networks i: networks ii: percolation

dynamics of large networks

social networks and the dynamics of labor market outcomes...

nonlinear dynamics and symbolic dynamics of neural...

structure and dynamics of core/periphery networks ·...

dynamics of analog neural networks

dynamics of wood in stream networks of the western cascades...