influence at scale: distributed computation of complex
TRANSCRIPT
Influence at Scale: Distributed Computation of Complex Contagion in Networks
Brendan Lucier, Joel Oren, Yaron Singer Microsoft Research, University of Toronto, Harvard University
Motivation
• Information requirements: how much of the graph do we need to examine? (query complexity).• Efficient parallel computation: how to distribute computation; maintain high-volume intermediate data.
The FrameworkThe influence model: The Independent Cascade Model [KKT’03]• Input: edge-weighted graph 𝐺 = (𝑉, 𝐸,p), initial seeds 𝑆0 ⊆ 𝑉.• At every synchronous step 𝑡 > 0, every node previously infected node
𝑢 ∈ 𝑆𝑡 − 𝑆𝑡−1 successfully infects an uninfected neighbor 𝑣 w.p. 𝑝𝑢𝑣.𝑆𝑡+1 -- set of infected nodes at step 𝑡 + 1.
• The influence function: 𝑓 𝑆0 = 𝐸[𝑆𝑡]
The MRC parallel computation model [KSV’10]• Inspired by the MapReduce paradigm.• Synchronous round computation on 𝑁 𝑘; 𝑣 on tuples. On every round:
1. Map: apply local transformation on tuples in a streaming fashion.2. Reduce: polytime computation on aggregates of tuples with same
key.• Constraintaints: 𝑁1−𝜖 machines and space per machine; log𝑐𝑁 rounds.
• Graph access model: Link server.
The Information Requirements of Influence Estimation
• Question: how much of the graph to we need to know in order to estimate the influence?
• Concretely: what is the query complexity of approximating 𝑓(𝑆0)?
Theorem: Let 𝜖 ∈ 0,1
3, 𝛿 ∈ [0,
1
2]. For large enough 𝑛 = |𝑉|, ∃𝐺 = (𝑉, 𝐸, 𝒑)
for which getting an estimate 𝑓(𝑆0) of 𝑓(𝑆0) s.t. 1 − 𝜖 𝑓 𝑆0 ≤ 𝑓 𝑆0 ≤ 𝑓(𝑆0) with confidence ≥ 1 − 𝛿 requires Ω((1 − 2𝛿) 𝑛 link server queries.
Main algorithm: Intermediate Layer; Estimating the Expectation
Main intuitions:1. High/low influence few/many samples needed,
2. Can compute the Riemann integral for CDF (𝑓 𝑆0 = 0
∞1 − 𝐹(𝐼) 𝑑𝐼)
Lemma (informal):w.h.p., VerifyGuess returns True if 𝜏 > 𝑖𝑛𝑓𝑙𝑢𝑒𝑛𝑐𝑒.
Main Algorithm: Top Layer; Iterating on Candidate Values
Theorem: For any 𝜖 ∈ (0,1
4), InfEst provides a (1 + 8𝜖)-approx. w.h.p.
Algorithm: Bottom Layer; Bounded SamplesHighlights: Input: Seeds 𝑆0 ⊆ 𝑉, link-server oracle 𝑄(⋅), 𝐿- # samples, bound 𝑡.1. Simultaneous prob-BFS in MapReduce, one layer per round.2. Finalize a sample when at least 𝑡 nodes infected.3. Return fraction of samples that reached bound.
Theorem (informal): 1) If 𝐿 is suff. small, # of tuples is Ω(𝑚1.5 log4 𝑚).2) If 𝑑𝑖𝑎𝑚 𝐺 = log𝑐 𝑛, algorithm takes polylog rounds.
Experimental Results
1
2
5
3
6
7
4
𝐺
𝑏
𝑎
𝑥
𝑐 𝑑
𝑆0
𝑝𝑥𝑏
𝑝𝑥𝑎𝑝𝑏𝑑𝑝𝑏𝑐
𝑝𝑑𝑎
L-S𝑄(𝑢)
𝑁+(𝑢)
𝑝
Running timeApprox. ratios (Epinions)
Heuristics(Epinions)
Scalable estimation of the influence of individuals in massive social networks