debunking the myths of influence maximizationmitdbg.github.io/nedbday/2017/talks/galhotr.pdf ·...

Debunking the Myths of InfluenceMaximization

Akhil Arora1, Sainyam Galhotra1, Sayan Ranu

[email protected] of Massachusetts, Amherst

January 27, 2017NEDB, 2017

1The first two authors have contributed equally to this work.Influence Maximization January 27, 2017 NEDB, 2017 1 / 19

[email protected]

Information Propagation2: Need for Modelling??

Many real-world processes can be interpreted using conceptsfrom information propagationFor example: Spread of Diseases

2Propagation/Flow/Spread/Diffusion, would be used interchangeablyInfluence Maximization January 27, 2017 NEDB, 2017 2 / 19

Need for Modelling??

Traffic Congestion and its propagation

Influence Maximization January 27, 2017 NEDB, 2017 3 / 19

Other Applications

Using the word-of-mouth effect for:

Viral Marketing: Product/Topic/Event promotionManaging Celebrity/Political campaigns

Detect and Prevent Outbreaks/Epidemics/RumoursMany more . . .


Other Applications

Using the word-of-mouth effect for:Viral Marketing: Product/Topic/Event promotionManaging Celebrity/Political campaigns

Detect and Prevent Outbreaks/Epidemics/RumoursMany more . . .


Existing Information Propagation models

Independent Cascade (IC) and Weighted Cascade (WC) ModelsLinear Threshold (LT) ModelOther models – Heat Diffusion etc.

0.5 0.2

0.7

0.1

0.1

0.3

0.6

0.4

0.2

0.4

A

B C

D

E

F

G

H



0.5 0.2

0.7

0.1

0.1

0.3

0.6

0.4

0.2

0.4

A

B

D

C

E

F

G

H



0.5 0.2

0.7

0.1

0.1

0.3

0.6

0.4

0.2

0.4

A

B C

D

F

E

G

H



0.5 0.2

0.7

0.1

0.1

0.3

0.6

0.4

0.2

0.4

A

B C

D

E

F

G

H


The Influence Maximization (IM) Problem

Input: A graph G, an information-diffusion model IConstraints: The budget (k = |S|) defining the size of the seed-set

Task: Identify the set of most-influential nodes in a networkMaximize σ(S) = E[F(S)]: Expected number of nodes active at theend, if set S is targeted for initial activation

Tractability: The IM problem is NP-hard. Need for ApproximateSolutions!The spread function σ is Monotone and Submodular, thus, asimple GREEDY algorithm provides the best possible (1− 1/e)approximation


The Influence Maximization (IM) Problem

Input: A graph G, an information-diffusion model IConstraints: The budget (k = |S|) defining the size of the seed-setTask: Identify the set of most-influential nodes in a network

Maximize σ(S) = E[F(S)]: Expected number of nodes active at theend, if set S is targeted for initial activation

Tractability: The IM problem is NP-hard. Need for ApproximateSolutions!The spread function σ is Monotone and Submodular, thus, asimple GREEDY algorithm provides the best possible (1− 1/e)approximation


Need for benchmarking? Wide variety of techniques

MC Simulation1. Run MC Simulation from each node to estimate its spread.2. Exploit submodularity to prune out nodes with low spread

SamplingStore a DAG for a sample of nodes and use it to estimate influ-ence

Approximate ScoringEstimate the influence of the nodes using heuristics as exactcomputation is #P - hard


Need for benchmarking? : Ambiguities

Existing Literature: Use IC, WC interchangeablyActual scenario: Varied behaviour in terms of the spread of seednodes, efficiency and scalability aspects of different techniques.

0 100 200

102

104

106

Ru

nn

ing

tim

e (

se

cs)

Seeds (k)

ICWC

Figure: IMM (ε = 0.5) for Orkut dataset


Need for benchmarking? : Ambiguities

State-of-the-art technique in one aspect behaves the worst inanother aspect of the problem.

0 100 200Seeds (k)

103

104

105

Mem

ory

(MB

)

EaSyIMIMM

(a) Memory

0 100 200Seeds (k)

101

102

103

104

Run

ning

tim

e (s

ecs)

EaSyIMIMM

(b) Running Time


Important Questions

How to choose the most appropriate IM technique in a givenspecific scenario?

What does it really mean to claim to be the state-of-the-art?Are the claims made by the recent papers true?


Important Questions

How to choose the most appropriate IM technique in a givenspecific scenario?What does it really mean to claim to be the state-of-the-art?

Are the claims made by the recent papers true?


Important Questions

How to choose the most appropriate IM technique in a givenspecific scenario?What does it really mean to claim to be the state-of-the-art?Are the claims made by the recent papers true?


Our Framework

Generic framework applicable on all techniques.Unified approach to tune the external parameters.

Setu

p

Algorithms Propagation Models Datasets Configurations

IM F

ram

ew

orkSeed Selection Spread Computation

Monte-Carlo (MC)

Simulations

Insig

hts

Eva

luati

on

Quality

Efficiency Scalability

Node

Ord

ering

Influen

ce

Estim

atio

n

Convergence

Robustness

Upda

te

Data

Str

uctu

res

#Parameters

Convergence calculationSelect the parameters which provide the best quality withouthampering the efficiency and scalability of the technique.


Myths

IMM is always faster than TIM+?

Model ε (TIM+) ε (IMM) Time (TIM+) Time (IMM) GainIC 0.05 0.05 8582.23 829.6 10.3xLT 0.35 0.1 0.79 1.2 0.65x

Table: Comparison of convergence parameter and running time (secs) for IMM and TIM+

over HepPH dataset for 200 seeds


Myths

CELF++ is the fastest IM technique in the MC estimationparadigm?

50

60

70

80

1 2 3 4 5 6 7 8 9 101112

Ru

nn

ing

Tim

e (

in m

in)

Independent Runs (WC)

CELFCELF++

(c) Nethept (WC)

120

130

140

150

160

170

180

1 2 3 4 5 6 7 8 9 101112

Ru

nn

ing

Tim

e (

in m

in)

Independent Runs (LT)

CELFCELF++

(d) Nethept (LT)


Myths

CELF++ is the fastest IM technique in the MC estimationparadigm?

50

60

70

80

1 2 3 4 5 6 7 8 9 101112

Ru

nn

ing

Tim

e (

in m

in)

Independent Runs (WC)

CELFCELF++

(e) Nethept (WC)

120

130

140

150

160

170

180

1 2 3 4 5 6 7 8 9 101112

Ru

nn

ing

Tim

e (

in m

in)

Independent Runs (LT)

CELFCELF++

(f) Nethept (LT)


Myths

SIMPATH is faster the LDAG?

0

0.2

0.4

0.6

0.8

1

1.2

40 80 120 160 200

Tim

e (

in m

in)

#Seeds (k)

LDAGSIMPATH

(g) Nethept

5

10

15

20

25

30

35

40 80 120 160 200

Tim

e (

in m

in)

#Seeds (k)

LDAGSIMPATH

(h) DBLP


Myths

SIMPATH is faster the LDAG?

0

0.2

0.4

0.6

0.8

1

1.2

40 80 120 160 200

Tim

e (

in m

in)

#Seeds (k)

LDAGSIMPATH

(i) Nethept

5

10

15

20

25

30

35

40 80 120 160 200

Tim

e (

in m

in)

#Seeds (k)

LDAGSIMPATH

(j) DBLP


Conclusions

No technique is the best on all aspects of IM.

Quality

EfficiencyMemory Footprint

TIM/IMMPMC

CELF/CELF++EaSyIM

ME

IRIEIMRank

StaticGreedy

LDAGSIMPATH

(k) Qualitative catego-rization of IM techniques

(l) Which technique to choose & when?


Thanks!

For more details, please refer :A. Arora, S. Galhotra, S. Ranu. Debunking the Myths of InfluenceMaximization : An In-Depth Benchmarking Study. SIGMOD 2017


debunking the myths of influence maximizationmitdbg.github.io/nedbday/2017/talks/galhotr.pdf ·...

Documents