debunking the myths of influence maximizationmitdbg.github.io/nedbday/2017/talks/galhotr.pdf ·...
TRANSCRIPT
Debunking the Myths of InfluenceMaximization
Akhil Arora1, Sainyam Galhotra1, Sayan Ranu
[email protected] of Massachusetts, Amherst
January 27, 2017NEDB, 2017
1The first two authors have contributed equally to this work.Influence Maximization January 27, 2017 NEDB, 2017 1 / 19
Information Propagation2: Need for Modelling??
Many real-world processes can be interpreted using conceptsfrom information propagationFor example: Spread of Diseases
2Propagation/Flow/Spread/Diffusion, would be used interchangeablyInfluence Maximization January 27, 2017 NEDB, 2017 2 / 19
Need for Modelling??
Traffic Congestion and its propagation
Influence Maximization January 27, 2017 NEDB, 2017 3 / 19
Other Applications
Using the word-of-mouth effect for:
Viral Marketing: Product/Topic/Event promotionManaging Celebrity/Political campaigns
Detect and Prevent Outbreaks/Epidemics/RumoursMany more . . .
Influence Maximization January 27, 2017 NEDB, 2017 4 / 19
Other Applications
Using the word-of-mouth effect for:Viral Marketing: Product/Topic/Event promotionManaging Celebrity/Political campaigns
Detect and Prevent Outbreaks/Epidemics/RumoursMany more . . .
Influence Maximization January 27, 2017 NEDB, 2017 4 / 19
Other Applications
Using the word-of-mouth effect for:Viral Marketing: Product/Topic/Event promotionManaging Celebrity/Political campaigns
Detect and Prevent Outbreaks/Epidemics/RumoursMany more . . .
Influence Maximization January 27, 2017 NEDB, 2017 4 / 19
Existing Information Propagation models
Independent Cascade (IC) and Weighted Cascade (WC) ModelsLinear Threshold (LT) ModelOther models – Heat Diffusion etc.
0.5 0.2
0.7
0.1
0.1
0.3
0.6
0.4
0.2
0.4
A
B C
D
E
F
G
H
Influence Maximization January 27, 2017 NEDB, 2017 5 / 19
Existing Information Propagation models
Independent Cascade (IC) and Weighted Cascade (WC) ModelsLinear Threshold (LT) ModelOther models – Heat Diffusion etc.
0.5 0.2
0.7
0.1
0.1
0.3
0.6
0.4
0.2
0.4
A
B C
D
E
F
G
H
Influence Maximization January 27, 2017 NEDB, 2017 5 / 19
Existing Information Propagation models
0.5 0.2
0.7
0.1
0.1
0.3
0.6
0.4
0.2
0.4
A
B
D
C
E
F
G
H
Influence Maximization January 27, 2017 NEDB, 2017 6 / 19
Existing Information Propagation models
0.5 0.2
0.7
0.1
0.1
0.3
0.6
0.4
0.2
0.4
A
B C
D
F
E
G
H
Influence Maximization January 27, 2017 NEDB, 2017 7 / 19
Existing Information Propagation models
0.5 0.2
0.7
0.1
0.1
0.3
0.6
0.4
0.2
0.4
A
B C
D
E
F
G
H
Influence Maximization January 27, 2017 NEDB, 2017 8 / 19
The Influence Maximization (IM) Problem
Input: A graph G, an information-diffusion model IConstraints: The budget (k = |S|) defining the size of the seed-set
Task: Identify the set of most-influential nodes in a networkMaximize σ(S) = E[F(S)]: Expected number of nodes active at theend, if set S is targeted for initial activation
Tractability: The IM problem is NP-hard. Need for ApproximateSolutions!The spread function σ is Monotone and Submodular, thus, asimple GREEDY algorithm provides the best possible (1− 1/e)approximation
Influence Maximization January 27, 2017 NEDB, 2017 9 / 19
The Influence Maximization (IM) Problem
Input: A graph G, an information-diffusion model IConstraints: The budget (k = |S|) defining the size of the seed-setTask: Identify the set of most-influential nodes in a network
Maximize σ(S) = E[F(S)]: Expected number of nodes active at theend, if set S is targeted for initial activation
Tractability: The IM problem is NP-hard. Need for ApproximateSolutions!The spread function σ is Monotone and Submodular, thus, asimple GREEDY algorithm provides the best possible (1− 1/e)approximation
Influence Maximization January 27, 2017 NEDB, 2017 9 / 19
The Influence Maximization (IM) Problem
Input: A graph G, an information-diffusion model IConstraints: The budget (k = |S|) defining the size of the seed-setTask: Identify the set of most-influential nodes in a network
Maximize σ(S) = E[F(S)]: Expected number of nodes active at theend, if set S is targeted for initial activation
Tractability: The IM problem is NP-hard. Need for ApproximateSolutions!The spread function σ is Monotone and Submodular, thus, asimple GREEDY algorithm provides the best possible (1− 1/e)approximation
Influence Maximization January 27, 2017 NEDB, 2017 9 / 19
Need for benchmarking? Wide variety of techniques
MC Simulation1. Run MC Simulation from each node to estimate its spread.2. Exploit submodularity to prune out nodes with low spread
SamplingStore a DAG for a sample of nodes and use it to estimate influ-ence
Approximate ScoringEstimate the influence of the nodes using heuristics as exactcomputation is #P - hard
Influence Maximization January 27, 2017 NEDB, 2017 10 / 19
Need for benchmarking? Wide variety of techniques
MC Simulation1. Run MC Simulation from each node to estimate its spread.2. Exploit submodularity to prune out nodes with low spread
SamplingStore a DAG for a sample of nodes and use it to estimate influ-ence
Approximate ScoringEstimate the influence of the nodes using heuristics as exactcomputation is #P - hard
Influence Maximization January 27, 2017 NEDB, 2017 10 / 19
Need for benchmarking? Wide variety of techniques
MC Simulation1. Run MC Simulation from each node to estimate its spread.2. Exploit submodularity to prune out nodes with low spread
SamplingStore a DAG for a sample of nodes and use it to estimate influ-ence
Approximate ScoringEstimate the influence of the nodes using heuristics as exactcomputation is #P - hard
Influence Maximization January 27, 2017 NEDB, 2017 10 / 19
Need for benchmarking? Wide variety of techniques
MC Simulation1. Run MC Simulation from each node to estimate its spread.2. Exploit submodularity to prune out nodes with low spread
SamplingStore a DAG for a sample of nodes and use it to estimate influ-ence
Approximate ScoringEstimate the influence of the nodes using heuristics as exactcomputation is #P - hard
Influence Maximization January 27, 2017 NEDB, 2017 10 / 19
Need for benchmarking? : Ambiguities
Existing Literature: Use IC, WC interchangeablyActual scenario: Varied behaviour in terms of the spread of seednodes, efficiency and scalability aspects of different techniques.
0 100 200
102
104
106
Ru
nn
ing
tim
e (
se
cs)
Seeds (k)
ICWC
Figure: IMM (ε = 0.5) for Orkut dataset
Influence Maximization January 27, 2017 NEDB, 2017 11 / 19
Need for benchmarking? : Ambiguities
State-of-the-art technique in one aspect behaves the worst inanother aspect of the problem.
0 100 200Seeds (k)
103
104
105
Mem
ory
(MB
)
EaSyIMIMM
(a) Memory
0 100 200Seeds (k)
101
102
103
104
Run
ning
tim
e (s
ecs)
EaSyIMIMM
(b) Running Time
Influence Maximization January 27, 2017 NEDB, 2017 12 / 19
Important Questions
How to choose the most appropriate IM technique in a givenspecific scenario?
What does it really mean to claim to be the state-of-the-art?Are the claims made by the recent papers true?
Influence Maximization January 27, 2017 NEDB, 2017 13 / 19
Important Questions
How to choose the most appropriate IM technique in a givenspecific scenario?What does it really mean to claim to be the state-of-the-art?
Are the claims made by the recent papers true?
Influence Maximization January 27, 2017 NEDB, 2017 13 / 19
Important Questions
How to choose the most appropriate IM technique in a givenspecific scenario?What does it really mean to claim to be the state-of-the-art?Are the claims made by the recent papers true?
Influence Maximization January 27, 2017 NEDB, 2017 13 / 19
Our Framework
Generic framework applicable on all techniques.Unified approach to tune the external parameters.
Setu
p
Algorithms Propagation Models Datasets Configurations
IM F
ram
ew
orkSeed Selection Spread Computation
Monte-Carlo (MC)
Simulations
Insig
hts
Eva
luati
on
Quality
Efficiency Scalability
Node
Ord
ering
Influen
ce
Estim
atio
n
Convergence
Robustness
Upda
te
Data
Str
uctu
res
#Parameters
Convergence calculationSelect the parameters which provide the best quality withouthampering the efficiency and scalability of the technique.
Influence Maximization January 27, 2017 NEDB, 2017 14 / 19
Our Framework
Generic framework applicable on all techniques.Unified approach to tune the external parameters.
Setu
p
Algorithms Propagation Models Datasets Configurations
IM F
ram
ew
orkSeed Selection Spread Computation
Monte-Carlo (MC)
Simulations
Insig
hts
Eva
luati
on
Quality
Efficiency Scalability
Node
Ord
ering
Influen
ce
Estim
atio
n
Convergence
Robustness
Upda
te
Data
Str
uctu
res
#Parameters
Convergence calculationSelect the parameters which provide the best quality withouthampering the efficiency and scalability of the technique.
Influence Maximization January 27, 2017 NEDB, 2017 14 / 19
Myths
IMM is always faster than TIM+?
Model ε (TIM+) ε (IMM) Time (TIM+) Time (IMM) GainIC 0.05 0.05 8582.23 829.6 10.3xLT 0.35 0.1 0.79 1.2 0.65x
Table: Comparison of convergence parameter and running time (secs) for IMM and TIM+
over HepPH dataset for 200 seeds
Influence Maximization January 27, 2017 NEDB, 2017 15 / 19
Myths
IMM is always faster than TIM+?
Model ε (TIM+) ε (IMM) Time (TIM+) Time (IMM) GainIC 0.05 0.05 8582.23 829.6 10.3xLT 0.35 0.1 0.79 1.2 0.65x
Table: Comparison of convergence parameter and running time (secs) for IMM and TIM+
over HepPH dataset for 200 seeds
Influence Maximization January 27, 2017 NEDB, 2017 15 / 19
Myths
CELF++ is the fastest IM technique in the MC estimationparadigm?
50
60
70
80
1 2 3 4 5 6 7 8 9 101112
Ru
nn
ing
Tim
e (
in m
in)
Independent Runs (WC)
CELFCELF++
(c) Nethept (WC)
120
130
140
150
160
170
180
1 2 3 4 5 6 7 8 9 101112
Ru
nn
ing
Tim
e (
in m
in)
Independent Runs (LT)
CELFCELF++
(d) Nethept (LT)
Influence Maximization January 27, 2017 NEDB, 2017 16 / 19
Myths
CELF++ is the fastest IM technique in the MC estimationparadigm?
50
60
70
80
1 2 3 4 5 6 7 8 9 101112
Ru
nn
ing
Tim
e (
in m
in)
Independent Runs (WC)
CELFCELF++
(e) Nethept (WC)
120
130
140
150
160
170
180
1 2 3 4 5 6 7 8 9 101112
Ru
nn
ing
Tim
e (
in m
in)
Independent Runs (LT)
CELFCELF++
(f) Nethept (LT)
Influence Maximization January 27, 2017 NEDB, 2017 16 / 19
Myths
SIMPATH is faster the LDAG?
0
0.2
0.4
0.6
0.8
1
1.2
40 80 120 160 200
Tim
e (
in m
in)
#Seeds (k)
LDAGSIMPATH
(g) Nethept
5
10
15
20
25
30
35
40 80 120 160 200
Tim
e (
in m
in)
#Seeds (k)
LDAGSIMPATH
(h) DBLP
Influence Maximization January 27, 2017 NEDB, 2017 17 / 19
Myths
SIMPATH is faster the LDAG?
0
0.2
0.4
0.6
0.8
1
1.2
40 80 120 160 200
Tim
e (
in m
in)
#Seeds (k)
LDAGSIMPATH
(i) Nethept
5
10
15
20
25
30
35
40 80 120 160 200
Tim
e (
in m
in)
#Seeds (k)
LDAGSIMPATH
(j) DBLP
Influence Maximization January 27, 2017 NEDB, 2017 17 / 19
Conclusions
No technique is the best on all aspects of IM.
Quality
EfficiencyMemory Footprint
TIM/IMMPMC
CELF/CELF++EaSyIM
ME
IRIEIMRank
StaticGreedy
LDAGSIMPATH
(k) Qualitative catego-rization of IM techniques
(l) Which technique to choose & when?
Influence Maximization January 27, 2017 NEDB, 2017 18 / 19
Thanks!
For more details, please refer :A. Arora, S. Galhotra, S. Ranu. Debunking the Myths of InfluenceMaximization : An In-Depth Benchmarking Study. SIGMOD 2017
Influence Maximization January 27, 2017 NEDB, 2017 19 / 19