manuel gomez rodriguez 1,2 jure leskovec 1 andreas krause 3
DESCRIPTION
1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology. Inferring Networks of Diffusion and Influence. Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3. Networks and Processes. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/1.jpg)
1
1 Stanford University2 MPI for Biological Cybernetics
3 California Institute of Technology
Inferring Networks of Diffusion and Influence
Manuel Gomez Rodriguez1,2
Jure Leskovec1
Andreas Krause3
![Page 2: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/2.jpg)
2
Networks and Processes
Many times hard to directly observe a social or information network: Hidden/hard-to-reach populations:
Drug injection users Implicit connections:
Network of information sharing in online media
Often easier to observe results of the processes taking place on such (invisible) networks: Virus propagation:
People get sick, they see the doctor Information networks:
Blogs mention information
![Page 3: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/3.jpg)
3
Information Diffusion Network
Information diffuses through the network
We only see who mentions but not where they got the information from
Can we reconstruct the (hidden) diffusion network?
![Page 4: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/4.jpg)
4
More Examples
Virus propagation Word of mouth & Viral marketing
Can we infer the underlying social or information network?
Viruses propagate through the network
We only observe when people get sick
But NOT who infected them
Recommendations and influence propagate
We only observe when people buy products
But NOT who influenced them
Process
We observe
It’s hidden
![Page 5: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/5.jpg)
5
Inferring the Network
There is a hidden directed network:
We only see times when nodes get infected: Contagion c1: (a, 1), (c, 2), (b, 3), (e, 4) Contagion c2: (c, 1), (a, 4), (b, 5), (d, 6)
Want to infer the who-infects-whom network
bb
dd
ee
aa
cc
aa
cc
bb
eecc
aabb
dd
![Page 6: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/6.jpg)
6
Our Problem Formulation
Plan for the talk:
1. Define a continuous time model of diffusion
2. Define the likelihood of the observed propagation data given a graph
3. Show how to efficiently compute the likelihood
4. Show how to efficiently optimize the likelihood Find a graph G that maximizes the likelihood There is a super-exponential number of graphs G Our method finds a near-optimal solution in O(N2)!
![Page 7: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/7.jpg)
7
cccc
ee ffee ff
cc
bbaa bbaaaa bb
dd
Cascade Generation Model
Cascade generation model: Cascade reaches u at tu, and spreads to u’s
neighbors v: with probability β cascade propagates along (u, v) and
tv = tu + Δ, with Δ ~ f()
ta tb tcΔ1 Δ2
We assume each node v has only one parent!
Δ3 Δ4
te tf
![Page 8: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/8.jpg)
8
Likelihood of a Cascade
bb
dd
ee
aa
cc
aa
cc
bb
ee
If u infected v in a cascade c, its transmission probability is:
Pc(u, v) f(tv - tu) with tv > tu and (u, v) are neighbors
Prob. that cascade c propagates in a tree T:
To model that in reality any node v in a cascade can have been infected by an external influence m: Pc(m, j) = ε
mmεεε
Tree pattern T on cascade c: (a, 1), (b, 2), (c, 4), (e, 8)
![Page 9: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/9.jpg)
9
Finding the Diffusion Network
There are many possible propagation trees: c: (a, 1), (c, 2), (b, 3), (e, 4)
Need to consider all possible propagation trees T supported by G:
bb
dd
ee
aa
cc
aa
cc
bb
ee
bb
dd
ee
aa
cc
aa
cc
bb
ee
bb
dd
ee
aa
cc
aa
cc
bb
ee
Likelihood of a set of cascades C on G: Want to find:
Good news
Computing P(c|G) is tractable:For each c, consider all O(nn) possible transmission trees of G.
Matrix Tree Theorem can compute this sum in O(n3)!
Bad news
We actually want to find:
We have a super-exponential number of graphs!
![Page 10: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/10.jpg)
10
An Alternative Formulation
We consider only the most likely tree Maximum log-likelihood for a cascade c under a
graph G:
Log-likelihood of G given a set of cascades C:The problem is NP-hard: MAX-k-COVER
Our algorithm can do it near-optimally in O(N2)
![Page 11: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/11.jpg)
11
Max Directed Spanning Tree
Given a cascade c, What is the most likely propagation tree?
where
bb
dd
aa
cc
Local greedy selection gives optimal tree!
A maximum directed spanning tree (MDST): Just need to compute the MDST of a the sub-
graph of G induced by c (i.e., a DAG) For each node, just picks an in-edge of max-
weight:
3
5
4
2
6
1
Subgraph of G induced by c doesn’t have loops (DAG)
aa
bb
dd
cc
![Page 12: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/12.jpg)
12
Great News: Submodularity
Theorem:Log-likelihood Fc(G) is monotonic, and submodular in the edges of the graph G
Gain of adding an edge to a “small” graph Gain of adding an edge to a “large“ graph
Fc(A {e}) – Fc (A) ≥ Fc (B {e}) – Fc (B)
Proof:
ss
A B VxV
w
w’
xA
Bjj
oo
Single cascade c, edge e with weight x Let w be max weight in-edge of s in A Let w’ be max weight in-edge of s in B We know: w ≤ w’ Now: Fc(A {e}) – Fc(A) = max (w, x) – w
≥ max (w’, x) – w’ = Fc(B {e}) – Fc(B)
rr
aa
kk
iiii
kk
Then, log-likelihood FC(G) is monotonic, and submodular too
![Page 13: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/13.jpg)
13
Finding the Diffusion Graph
Use the greedy hill-climbing to maximize FC(G):
At every step, pick the edge that maximizes the marginal improvement
bb
dd
ee
aa
ccLocalized update
Marginal gainsb ac ad aa bc bd be b
: 12 : 3 : 6 : 20 : 18 : 4 : 5
a cb cb dc de d
: 15 : 8 : 16 : 8 : 10
b ed e
: 7 : 13
: 17 : 2 : 3
Localized update
Localized update
Localized update
: 1 : 1
: 8 : 7
: 6
1. Approximation guarantee (≈ 0.63 of OPT)
2. Tight on-line bounds on the solution quality
3. Speed-ups:Lazy evaluation (by submodularity)
Localized update (by the structure of the problem)
Benefits
![Page 14: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/14.jpg)
14
Experimental Setup
We validate our method on:
How many edges of G can we find?
Precision-Recall Break-even point
How many cascades do we need?
How fast are we?
How well do we optimize Fc(G)?
Synthetic dataGenerate a graph G on k edgesGenerate cascadesRecord node infection timesReconstruct G
Real dataMemeTracker: 172m news articlesAug ’08 – Sept ‘09343m textual phrases (quotes)
![Page 15: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/15.jpg)
15
Small synthetic network:
True networkTrue network Baseline networkBaseline network Our methodOur method
15
Small Synthetic Example
Pick k strongest edges:
![Page 16: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/16.jpg)
16
Synthetic Networks
Our performance does not depend on the network structure:
Synthetic Networks: Forest Fire, Kronecker, etc. Prob. of transmission: Exponential, Power Law
Break-even points of > 90% when the baseline gets 30-50%!
1024 node hierarchical Kronecker exponential transmission model
1000 node Forest Fire (α = 1.1) power law transmission model
![Page 17: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/17.jpg)
17
How good is our graph?
We achieve ≈ 90 % of the best possible network!
![Page 18: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/18.jpg)
18
How many cascades do we need?
With 2x as many infections as edges, the break-even point is already 0.8 - 0.9!
![Page 19: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/19.jpg)
19
Running Time
Lazy evaluation and localized updates speed up 2 orders of magnitude!
![Page 20: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/20.jpg)
20
Real Data
MemeTracker dataset: 172m news articles Aug ’08 – Sept ‘09 343m textual phrases
(quotes) Times tc(w) when site w
mentions phrase (quote) c http://memetracker.org
Given times when sites mention phrases We infer the network of information diffusion:
Who tends to copy (repeat after) whom
![Page 21: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/21.jpg)
21
Real Network
We use the hyperlinks in the MemeTracker dataset to generate the edges of a ground truth G
From the MemeTracker dataset, we have the timestamps of: 1. cascades of hyperlinks:
sites link other sites
2. cascades of (MemeTracker) quotes
sites copy quotes from other sites
Can we infer the hyperlinks network from…
…cascades of hyperlinks? …cascades of MemeTracker
quotes?
Are they correlated?
ee
ffccaa
ee
ffccaa
![Page 22: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/22.jpg)
22
Real Network
500 node hyperlink network using hyperlinks cascades
500 node hyperlink network using MemeTracker cascades
Break-even points of 50% for hyperlinks cascades and 30% for MemeTracker cascades!
![Page 23: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/23.jpg)
23
5,000 news sites:
BlogsMainstream media
Diffusion Network
![Page 24: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/24.jpg)
24BlogsMainstream media
Diffusion Network (small part)
![Page 25: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/25.jpg)
25
Networks and Processes We infer hidden networks based on diffusion data
(timestamps)
Problem formulation in a maximum likelihood framework NP-hard problem to solve exactly We develop an approximation algorithm that:
It is efficient -> It runs in O(N2) It is invariant to the structure of the underlying network It gives a sub-optimal network with tight bound
Future work: Learn both the network and the diffusion model Extensions to other processes taking place on networks Applications to other domains: biology, neuroscience, etc.
![Page 26: Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814c0f550346895db90cff/html5/thumbnails/26.jpg)
26
Thanks!For more (Code & Data):http://snap.stanford.edu/netinf