influence at scale: distributed computation of complex

Influence at Scale: Distributed Computation of Complex Contagion in Networks Brendan Lucier, Joel Oren, Yaron Singer Microsoft Research, University of Toronto, Harvard University Motivation • Information requirements: how much of the graph do we need to examine? (query complexity). • Efficient parallel computation: how to distribute computation; maintain high-volume intermediate data. The Framework The influence model: The Independent Cascade Model [KKT’03] • Input: edge-weighted graph = (, ,p), initial seeds 0 ⊆ . • At every synchronous step >0, every node previously infected node ∈ − −1 successfully infects an uninfected neighbor w.p. . +1 -- set of infected nodes at step +1. • The influence function: 0 = [ ] The MRC parallel computation model [KSV’10] • Inspired by the MapReduce paradigm. • Synchronous round computation on ; on tuples. On every round: 1. Map: apply local transformation on tuples in a streaming fashion. 2. Reduce: polytime computation on aggregates of tuples with same key. • Constraintaints: 1− machines and space per machine; log rounds. • Graph access model: Link server. The Information Requirements of Influence Estimation • Question: how much of the graph to we need to know in order to estimate the influence? • Concretely: what is the query complexity of approximating ( 0 )? Theorem: Let ∈ 0, 1 3 , ∈ [0, 1 2 ]. For large enough = ||, ∃ = (, , ) for which getting an estimate ( 0 ) of ( 0 ) s.t. 1− 0 ≤ 0 ≤ ( 0 ) with confidence ≥1− requires Ω((1 − 2) link server queries. Main algorithm: Intermediate Layer; Estimating the Expectation Main intuitions: 1. High/low influence few/many samples needed, 2. Can compute the Riemann integral for CDF ( 0 = 0 ∞ 1 − () ) Lemma (informal):w.h.p., VerifyGuess returns True if > . Main Algorithm: Top Layer; Iterating on Candidate Values Theorem: For any ∈ (0, 1 4 ), InfEst provides a (1 + 8)-approx. w.h.p. Algorithm: Bottom Layer; Bounded Samples Highlights: Input: Seeds 0 ⊆ , link-server oracle (⋅), - # samples, bound . 1. Simultaneous prob-BFS in MapReduce, one layer per round. 2. Finalize a sample when at least nodes infected. 3. Return fraction of samples that reached bound. Theorem (informal): 1) If is suff. small, # of tuples is Ω( 1.5 log 4 ). 2) If = log , algorithm takes polylog rounds. Experimental Results 1 2 5 3 6 7 4 0 L-S () + () Running time Approx. ratios (Epinions) Heuristics (Epinions) Scalable estimation of the influence of individuals in massive social networks

Upload: others

Post on 05-Jan-2022

6 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: Influence at Scale: Distributed Computation of Complex

Influence at Scale: Distributed Computation of Complex Contagion in Networks

Brendan Lucier, Joel Oren, Yaron Singer Microsoft Research, University of Toronto, Harvard University

Motivation

• Information requirements: how much of the graph do we need to examine? (query complexity).• Efficient parallel computation: how to distribute computation; maintain high-volume intermediate data.

The FrameworkThe influence model: The Independent Cascade Model [KKT’03]• Input: edge-weighted graph 𝐺 = (𝑉, 𝐸,p), initial seeds 𝑆0 ⊆ 𝑉.• At every synchronous step 𝑡 > 0, every node previously infected node

𝑢 ∈ 𝑆𝑡 − 𝑆𝑡−1 successfully infects an uninfected neighbor 𝑣 w.p. 𝑝𝑢𝑣.𝑆𝑡+1 -- set of infected nodes at step 𝑡 + 1.

• The influence function: 𝑓 𝑆0 = 𝐸[𝑆𝑡]

The MRC parallel computation model [KSV’10]• Inspired by the MapReduce paradigm.• Synchronous round computation on 𝑁 𝑘; 𝑣 on tuples. On every round:

1. Map: apply local transformation on tuples in a streaming fashion.2. Reduce: polytime computation on aggregates of tuples with same

key.• Constraintaints: 𝑁1−𝜖 machines and space per machine; log𝑐𝑁 rounds.

• Graph access model: Link server.

The Information Requirements of Influence Estimation

• Question: how much of the graph to we need to know in order to estimate the influence?

• Concretely: what is the query complexity of approximating 𝑓(𝑆0)?

Theorem: Let 𝜖 ∈ 0,1

3, 𝛿 ∈ [0,

2]. For large enough 𝑛 = |𝑉|, ∃𝐺 = (𝑉, 𝐸, 𝒑)

for which getting an estimate 𝑓(𝑆0) of 𝑓(𝑆0) s.t. 1 − 𝜖 𝑓 𝑆0 ≤ 𝑓 𝑆0 ≤ 𝑓(𝑆0) with confidence ≥ 1 − 𝛿 requires Ω((1 − 2𝛿) 𝑛 link server queries.

Main algorithm: Intermediate Layer; Estimating the Expectation

Main intuitions:1. High/low influence few/many samples needed,

2. Can compute the Riemann integral for CDF (𝑓 𝑆0 = 0

∞1 − 𝐹(𝐼) 𝑑𝐼)

Lemma (informal):w.h.p., VerifyGuess returns True if 𝜏 > 𝑖𝑛𝑓𝑙𝑢𝑒𝑛𝑐𝑒.

Main Algorithm: Top Layer; Iterating on Candidate Values

Theorem: For any 𝜖 ∈ (0,1

4), InfEst provides a (1 + 8𝜖)-approx. w.h.p.

Algorithm: Bottom Layer; Bounded SamplesHighlights: Input: Seeds 𝑆0 ⊆ 𝑉, link-server oracle 𝑄(⋅), 𝐿- # samples, bound 𝑡.1. Simultaneous prob-BFS in MapReduce, one layer per round.2. Finalize a sample when at least 𝑡 nodes infected.3. Return fraction of samples that reached bound.

Theorem (informal): 1) If 𝐿 is suff. small, # of tuples is Ω(𝑚1.5 log4 𝑚).2) If 𝑑𝑖𝑎𝑚 𝐺 = log𝑐 𝑛, algorithm takes polylog rounds.

Experimental Results

𝐺

𝑏

𝑎

𝑥

𝑐 𝑑

𝑆0

𝑝𝑥𝑏

𝑝𝑥𝑎𝑝𝑏𝑑𝑝𝑏𝑐

𝑝𝑑𝑎

L-S𝑄(𝑢)