funnel: automatic mining of spatially coevolving epidemics yasuko matsubara, yasushi sakurai...
TRANSCRIPT
Y. Matsubara et al. 1
FUNNEL: Automatic Mining of Spatially Coevolving Epidemics
Yasuko Matsubara, Yasushi Sakurai (Kumamoto University)
Willem G. van Panhuis (University of Pittsburgh)
Christos Faloutsos (CMU)
SIGKDD 2014
Y. Matsubara et al. 2
Motivation
Given: Large set of epidemiological data
SIGKDD 2014
e.g., Measles cases in the U.S.
Linear
(Weekly)
Y. Matsubara et al. 3
Motivation
Given: Large set of epidemiological data
SIGKDD 2014
e.g., Measles cases in the U.S.
Linear
Yearly periodicity
(Weekly)
Y. Matsubara et al. 4
Motivation
Given: Large set of epidemiological data
SIGKDD 2014
e.g., Measles cases in the U.S.
Linear
Yearly periodicity Vaccine
effect
(Weekly)
Y. Matsubara et al. 5
Motivation
Given: Large set of epidemiological data
SIGKDD 2014
e.g., Measles cases in the U.S.
Linear
Yearly periodicity
Shocks,e.g., 1941
Vaccineeffect
(Weekly)
Y. Matsubara et al. 6
Motivation
Given: Large set of epidemiological data
SIGKDD 2014
e.g., Measles cases in the U.S.
Linear
Goal: summarize all the epidemic time-series, “fully-automatically”
Y. Matsubara et al. 7
Data description
Project Tycho: infectious diseases in the U.S.
SIGKDD 2014
50 states
56 d
isea
ses
Time (weekly) 1888 (> 125 years)
X
Y. Matsubara et al. 8
50 states
56 d
isea
ses
Time (weekly) 1888 (> 125 years)
Data description
Project Tycho: infectious diseases in the U.S.
SIGKDD 2014
Element x : # of casese.g., ‘measles’, ‘NY’, ‘April 1-7, 1931’, ‘4000’
Xx
Y. Matsubara et al. 9
Problem definition
Given: Tensor X (disease x state x time)
SIGKDD 2014
X
Y. Matsubara et al. 10
Problem definition
Given: Tensor X (disease x state x time)
Find: Compact description of X, “automatically”
SIGKDD 2014
X
P1 P2 P3 P4 P5
ME
FUNNEL
X
Y. Matsubara et al. 11
Problem definition
Given: Tensor X (disease x state x time)
Find: Compact description of X, “automatically”
SIGKDD 2014
X
P1 P2 P3 P4 P5
ME
FUNNEL
X
Discontinuities
Y. Matsubara et al. 12
Problem definition
Given: Tensor X (disease x state x time)
Find: Compact description of X, “automatically”
SIGKDD 2014
X
P1 P2 P3 P4 P5
ME
FUNNEL
X
NO magic numbers !
Parameter-free!
Roadmap
SIGKDD 2014 13Y. Matsubara et al.
- Motivation- Modeling power of FUNNEL- Overview – main ideas- Proposed model – idea #1 - Algorithm – idea #2- Experiments- Discussion- Conclusions
✔
Y. Matsubara et al. 14
Modeling power of FUNNEL
SIGKDD 2014
Questions about epidemicsQ1 Q2 Q3 Q4 Q5
X
Y. Matsubara et al. 15
X
Questions
SIGKDD 2014
Are there any periodicities?If yes, when is the peak season?
Q1 Q2 Q3 Q4 Q5
Q1
Y. Matsubara et al. 16
Answers
SIGKDD 2014
Q1 Q2 Q3 Q4 Q5
SIGKDD 2014 16Y. Matsubara et al.
SeasonalityP1
Angle: peak season
Radius: seasonality strength
FUNNEL:
Polar plot
Y. Matsubara et al. 17
Answers
SIGKDD 2014
Q1 Q2 Q3 Q4 Q5
SIGKDD 2014 17Y. Matsubara et al.
SeasonalityP1
Questions
?
Y. Matsubara et al. 18
Answers
SIGKDD 2014
Q1 Q2 Q3 Q4 Q5
SIGKDD 2014 18Y. Matsubara et al.
SeasonalityP1
Questions
?Q: Does Influenza have seasonality? If yes, when?
Y. Matsubara et al. 19
Answers
SIGKDD 2014
Q1 Q2 Q3 Q4 Q5
SIGKDD 2014 19Y. Matsubara et al.
SeasonalityP1
Questions
?
Y. Matsubara et al. 20
Answers
SIGKDD 2014
Q1 Q2 Q3 Q4 Q5
SIGKDD 2014 20Y. Matsubara et al.
SeasonalityP1
Influenza in Feb.Detected by FUNNEL(strong seasonality)
Detected!
Y. Matsubara et al. 21
Answers
SIGKDD 2014
Q1 Q2 Q3 Q4 Q5
SIGKDD 2014 21Y. Matsubara et al.
SeasonalityP1
Questions
?
Y. Matsubara et al. 22
Answers
SIGKDD 2014
Q1 Q2 Q3 Q4 Q5
SIGKDD 2014 22Y. Matsubara et al.
SeasonalityP1
Questions
?Q: How about measles ?
Y. Matsubara et al. 23
Answers
SIGKDD 2014
Q1 Q2 Q3 Q4 Q5
SIGKDD 2014 23Y. Matsubara et al.
SeasonalityP1
Questions
?
Y. Matsubara et al. 24
Answers
SIGKDD 2014
Q1 Q2 Q3 Q4 Q5
SIGKDD 2014 24Y. Matsubara et al.
SeasonalityP1
Measles(children’s)
in spring
Detected!
Y. Matsubara et al. 25
Answers
SIGKDD 2014
Q1 Q2 Q3 Q4 Q5
SIGKDD 2014 25Y. Matsubara et al.
SeasonalityP1
Questions
?
Y. Matsubara et al. 26
Answers
SIGKDD 2014
Q1 Q2 Q3 Q4 Q5
SIGKDD 2014 26Y. Matsubara et al.
SeasonalityP1
Questions
?Q: Which disease peaks in summer?
Y. Matsubara et al. 27
Answers
SIGKDD 2014
Q1 Q2 Q3 Q4 Q5
SIGKDD 2014 27Y. Matsubara et al.
SeasonalityP1
Questions
?
Y. Matsubara et al. 28
Answers
SIGKDD 2014
Q1 Q2 Q3 Q4 Q5
SIGKDD 2014 28Y. Matsubara et al.
SeasonalityP1
Detected!
Lyme-disease
(tick-borne)in summer
Y. Matsubara et al. 29
Answers
SIGKDD 2014
Q1 Q2 Q3 Q4 Q5
SIGKDD 2014 29Y. Matsubara et al.
SeasonalityP1
Questions
?
Y. Matsubara et al. 30
Answers
SIGKDD 2014
Q1 Q2 Q3 Q4 Q5
SIGKDD 2014 30Y. Matsubara et al.
SeasonalityP1
Questions
?Q: Which disease has no periodicity?
Y. Matsubara et al. 31
Answers
SIGKDD 2014
Q1 Q2 Q3 Q4 Q5
SIGKDD 2014 31Y. Matsubara et al.
SeasonalityP1
Questions
?
Y. Matsubara et al. 32
Answers
SIGKDD 2014
Q1 Q2 Q3 Q4 Q5
SIGKDD 2014 32Y. Matsubara et al.
SeasonalityP1
Detected!
Gonorrhea(STD)
no periodicity
Y. Matsubara et al. 33
Questions
SIGKDD 2014
Can we see any discontinuities?
Q1 Q2 Q3 Q4 Q5
Q2
X
Y. Matsubara et al. 34
Answers
SIGKDD 2014
Q1 Q2 Q3 Q4 Q5
Disease reduction effectP2
Measles 1965: Detected by
FUNNEL
1963: Vaccine licensure
Detected!
Y. Matsubara et al. 35
Questions
SIGKDD 2014
Q3
What’s the difference between measles in NY and in FL?
Q1 Q2 Q3 Q4 Q5
Y. Matsubara et al. 36
Answers
SIGKDD 2014
Q1 Q2 Q3 Q4 Q5
area sensitivityP3
FUNNEL’s guess of susceptibles (measles)
CA
TX
NY, PA(more
children)
FL (fewer children)
Detected!
Y. Matsubara et al. 37
X
Questions
SIGKDD 2014
Q4Are there any external shock events, like wars?
Q1 Q2 Q3 Q4 Q5
Y. Matsubara et al. 38
Answers
SIGKDD 2014
Q1 Q2 Q3 Q4 Q5
external shock events
P4
Funnel can detect external shocks“fully-automatically” !
World war II
Scarlet fever
Detected by FUNNEL
Detected!
Y. Matsubara et al. 39
X
Questions
SIGKDD 2014
Q5
How can we remove mistakes and incorrect values?
Q1 Q2 Q3 Q4 Q5
TYPO!
Y. Matsubara et al. 40
Answers
SIGKDD 2014
Q1 Q2 Q3 Q4 Q5
mistakes
P5
It can also detect typos, “automatically” !!
Missing values
Typhoid fever cases
Detected!
Mistake
Y. Matsubara et al. 41
Modeling power of FUNNEL
Our model can capture 5 properties
SIGKDD 2014
P1
P2
P3
P4
P5
Seasonality
Disease
reductions
Area sensitivity
External events
Mistakes
Roadmap
SIGKDD 2014 42Y. Matsubara et al.
- Motivation- Modeling power of FUNNEL- Overview – main ideas- Proposed model – idea #1 - Algorithm – idea #2- Experiments- Discussion- Conclusions
✔✔
Y. Matsubara et al. 43
Problem definition
Given: Tensor X (disease x state x time)
Find: Compact description of X, “automatically”
SIGKDD 2014
X
P1 P2 P3 P4 P5
ME
FUNNEL
X
Y. Matsubara et al. 44
Two main ideas
Idea #1: Grey-box model
Idea #2: MDL for fitting
SIGKDD 2014
P1 P2 P3 P4 P5
ME
FUNNEL
X
NO magic numbers !(parameter-free)
Y. Matsubara et al. 45
Two main ideas
Idea #1: Grey-box model - domain knowledge
SIGKDD 2014
P1 P2 P3 P4 P5
ME
FUNNEL
X
(SIRS+) : 6 parameters
ShocksVaccine
Y. Matsubara et al. 46
Two main ideas
Idea #2: Fitting with MDL -> parameter free!
SIGKDD 2014
P1 P2 P3 P4 P5
ME
FUNNEL
X
NO magic numbers
Parameter-free!
Roadmap
SIGKDD 2014 47Y. Matsubara et al.
- Motivation- Modeling power of FUNNEL- Overview – main ideas- Proposed model – idea #1 - Algorithm – idea #2- Experiments- Discussion- Conclusions
✔✔✔
Y. Matsubara et al. 48
Proposed model: FUNNEL
SIGKDD 2014
states
dise
ases
Time
single epidemic
Multi-evolving epidemics
(a) FUNNEL-single
(b) FUNNEL-full
X
Y. Matsubara et al. 49
Proposed model: FUNNEL
SIGKDD 2014
states
dise
ases
Time
single epidemic
Multi-evolving epidemics
(a) FUNNEL-single
(b) FUNNEL-full
X
Y. Matsubara et al. 50
FUNNEL – with a single epidemic
Given: “single” epidemic sequence
Find: nonlinear equation,model parameters
SIGKDD 2014
e.g., measles in NY
FUNNEL
Y. Matsubara et al. 51
FUNNEL – with a single epidemic
With a single epidemic: Funnel-RE
SIGKDD 2014
S(t)
I(t)V(t)
Linear
Log
I(t)
People of 3 classes• S :
Susceptible• I : Infected • V : Vigilant/ vaccinated
Details
Y. Matsubara et al. 52
FUNNEL – with a single epidemic
With a single epidemic: Funnel-RE
SIGKDD 2014
S(t) : susceptibleI (t) : Infected V(t) : Vigilant /Vaccinated
S
V
I
Details
Y. Matsubara et al. 53
FUNNEL – with a single epidemic
With a single epidemic: Funnel-RE
SIGKDD 2014
S
V
I
Details
: strength of infection
(yearly periodic func)
Y. Matsubara et al. 54
FUNNEL – with a single epidemic
With a single epidemic: Funnel-RE
SIGKDD 2014
S
V
I
Details
: healing rate : disease reduction
effect
Y. Matsubara et al. 55
FUNNEL – with a single epidemic
With a single epidemic: Funnel-RE
SIGKDD 2014
S
V
I
Details
: temporal susceptible rate
Y. Matsubara et al. 56
Proposed model: FUNNEL
SIGKDD 2014
states
dise
ases
Time
single epidemic
Multi-evolving epidemics
(a) FUNNEL-single
(b) FUNNEL-full
X
Y. Matsubara et al. 57
states
disease
s
time
n
X
(P1, P2): global/country (P3): local/state
M
n
(P4, P5): extra - E: shocks & M: mistakes
E
n
Proposed model: FUNNEL-full
P1 P2 P3
P4 P5
SIGKDD 2014
Y. Matsubara et al. 58
states
disease
s
time
n
X
(P1, P2): global/country
Proposed model: FUNNEL-full
P1 P2Detail
s
Global
P1
P2Base matrix (d x 6)Disease reduction matrix (d x 2)
SIGKDD 2014
Y. Matsubara et al. 59
states
disease
s
time
n
X
(P3): local/state
Proposed model: FUNNEL-full
P3Detail
s
Local
P3 Geo-disease matrix (d x l)
: potential population of disease i in state jSIGKDD 2014
Y. Matsubara et al. 60
states
disease
s
time
n
X
M
n
(P4, P5): extra - E: shocks & M: mistakes
E
n
Proposed model: FUNNEL-full
P4 P5
Extra
P4
P5External shock tensor EMistake tensor M
Details
SIGKDD 2014
Roadmap
SIGKDD 2014 61Y. Matsubara et al.
- Motivation- Modeling power of FUNNEL- Overview – main ideas- Proposed model – idea #1 - Algorithm – idea #2- Experiments- Discussion- Conclusions
✔✔✔✔
Y. Matsubara et al. 62
Challenges
Q1. How to automatically- find “external shocks” ?- ignore “mistakes” (i.e., typos) ?
Q2. How to efficiently estimate model parameters ?
SIGKDD 2014
M
E
XP1 P2 P3 P4 P5
ME
FUNNEL
Y. Matsubara et al. 63
Challenges
Q1. How to automatically- find “external shocks” ?- ignore “mistakes” (i.e., typos) ?
Idea (1) : Model description cost
Q2. How to efficiently estimate model parameters ?
Idea (2): Multi-layer optimization (linear)
SIGKDD 2014
M
E
XP1 P2 P3 P4 P5
ME
FUNNEL
Roadmap
SIGKDD 2014 64Y. Matsubara et al.
- Motivation- Modeling power of FUNNEL- Overview – main ideas- Proposed model – idea #1 - Algorithm – idea #2- Experiments- Discussion- Conclusions
✔✔✔✔✔
Roadmap
SIGKDD 2014 65Y. Matsubara et al.
- Motivation- Modeling power of FUNNEL- Overview – main ideas- Proposed model – idea #1 - Algorithm – idea #2- Experiments- Discussion- Conclusions
✔✔✔✔✔✔
Roadmap
SIGKDD 2014 66Y. Matsubara et al.
- Motivation- Modeling power of FUNNEL- Overview – main ideas- Proposed model – idea #1 - Algorithm – idea #2- Experiments- Discussion- Conclusions
✔✔✔✔✔✔
(a) FUNNEL at work - forecasting
SIGKDD 2014 67Y. Matsubara et al.
Forecasting future epidemics
Train: 2/3 sequences
Forecast:1/3 following years
?
(a) FUNNEL at work - forecasting
SIGKDD 2014 68Y. Matsubara et al.
Forecasting future epidemics
Train: 2/3 sequences
Forecast:1/3 following years
Funnel can capture future epidemics (AR: fail)
(a) FUNNEL at work - forecasting
SIGKDD 2014 69Y. Matsubara et al.
Forecasting future epidemics
Train: 2/3 sequences
Forecast:1/3 following years
Funnel can capture future epidemics (AR: fail)
(b) Generality of FUNNEL
Epidemics on computer networks
SIGKDD 2014 70Y. Matsubara et al.
Funnel is general: it fits computer virus very well!
(10 years)
Spread via email attachment
Spread through corporate networks
Roadmap
SIGKDD 2014 71Y. Matsubara et al.
- Motivation- Modeling power of FUNNEL- Overview – main ideas- Proposed model – idea #1 - Algorithm – idea #2- Experiments- Discussion- Conclusions
✔✔✔✔✔✔✔
Y. Matsubara et al. 72
Conclusions
FUNNEL has the following advantagesSense-making
Captures all essential aspects:
Fully-automaticNo training set
ScalableIt scales linearly
GeneralReal epidemics (+ computer virus)
SIGKDD 2014
✔
✔
✔
✔
P1 P2 P3 P4 P5
Thank you!
SIGKDD 2014 73Y. Matsubara et al.
Data: http://www.tycho.pitt.edu/
Code: http://www.cs.kumamoto-u.ac.jp/~yasuko/software.html
Y. Matsubara et al. 74
FUNNEL: Automatic Mining of Spatially Coevolving Epidemics
Yasuko Matsubara, Yasushi Sakurai (Kumamoto University)
Willem G. van Panhuis (University of Pittsburgh)
Christos Faloutsos (CMU)
SIGKDD 2014
Y. Matsubara et al. 75
Proposed model: FUNNEL-fullDetail
s
P4 P5
ME vs.What’s the difference??External shock vs. Mistake
P4 P5
giardiasis giardiasis
SIGKDD 2014
Y. Matsubara et al. 76
Proposed model: FUNNEL-fullDetail
s
P4 P5
ME vs.What’s the difference??External shock vs. Mistake
P4 P5
independent
influenced
SIGKDD 2014
Y. Matsubara et al. 77
Proposed model: FUNNEL-fullDetail
s
E
n
(e1)
(e2)
(ek)
P4
E=
Diseasematrix
Timematrix
Statematrix
SIGKDD 2014
Y. Matsubara et al. 78
Idea (1): Model description costQ1. How should we
- find “external shocks” ?- ignore “mistakes” (i.e., typos) ?
Idea (1) : Model description cost• Minimize coding cost• find “optimal” # of externals/mistakes• “automatically”
SIGKDD 2014
M
E
Y. Matsubara et al. 79
Idea (1): Model description costIdea: Minimize encoding cost!
SIGKDD 2014
Good compression
Good description
1 2 3 4 5 6 7 8 9 10
CostMCostCCostT
(# of |E|+|M|)
CostM(F) + Costc(X|F)
Model cost Coding cost
min ( )
Y. Matsubara et al. 80
Idea (1): Model description costTotal cost of tensor X, given F
SIGKDD 2014
Details
Y. Matsubara et al. 81
Idea (1): Model description costTotal cost of tensor X, given F
SIGKDD 2014
Details
Model description
cost of F
Coding cost of X given F
Dimensions of X
Y. Matsubara et al. 82
Idea (2): Multi-layer optimization
Q2. How to efficiently estimate model parameters ?
Idea (2): Multi-layer optimization• Find “optimal” solution w.r.t.– level parameters– level parameters
SIGKDD 2014
XP1 P2 P3 P4 P5
ME
FUNNEL
LocalGlobal
Y. Matsubara et al. 83
Idea (2): Multi-layer optimization
Find “optimal” solution w.r.t.
SIGKDD 2014
P1 P2 P3 P4 P5
ME
FUNNEL
LocalGlobal
LocalGlobal
Y. Matsubara et al. 84
Idea (2): Multi-layer optimization
Find “optimal” solution w.r.t.
SIGKDD 2014
P1 P2 P3 P4 P5
ME
FUNNEL
LocalGlobal
LocalGlobal
P1 P2 P4 P5 P4 P5P3
Y. Matsubara et al. 85
Idea (2): Multi-layer optimization
Multi-layer fitting algorithm
SIGKDD 2014
Global
P1 P2 P4 P5
Local
P4 P5P3
Global fitting
Local fitting
X
X
Experiments
We answer the following questions…
Q1. Sense-making Can it help us understand the given epidemics?
Q2. Accuracy How well does it match the data?
Q3. Scalability How does it scale in terms of computational time?
SIGKDD 2014 86Y. Matsubara et al.
Q1. Sense-making
Our preliminary observations:
SIGKDD 2014 87Y. Matsubara et al.
P1
P2
P3
P4
P5
yearly periodicitydisease reduction effectsarea specificity and sensitivityexternal shock eventsmistakes, incorrect values
Q1. Sense-makingDisease seasonality
SIGKDD 2014 88Y. Matsubara et al.
P1
Radius: seasonality strength Angle: peak season
SIGKDD 2014 89Y. Matsubara et al.
Q1. Sense-makingDisease seasonality
P1 Influenzain Feb.
Children’sin spring
Gonorrheano periodicity
Tick-bornein summer
SIGKDD 2014 90Y. Matsubara et al.
Q1. Sense-makingDisease reduction effectP2
Measles (vaccine licensure: 1963)
1967
1969
Linear
Log
SIGKDD 2014 91Y. Matsubara et al.
Q1. Sense-makingarea specificity and sensitivity
P3
Potential population of susceptibles (measles)
CA
TX
NY, PA
FL (less children)
SIGKDD 2014 92Y. Matsubara et al.
Q1. Sense-makingarea specificity and sensitivity
P3
Measles in NY and PA
Linear
Log
SIGKDD 2014 93Y. Matsubara et al.
Q1. Sense-makingexternal shock eventsP4
We can detect external shocks “automatically” !!
Linear
Log
SIGKDD 2014 94Y. Matsubara et al.
Q1. Sense-makingmistakes, incorrect valuesP5
We can also detect typos “automatically” !!
Linear
Log
Missing values
Mistake
SIGKDD 2014 95Y. Matsubara et al.
Q2. Accuracy
Fitting accuracy for sequences (lower is better) LocalGlobal
LocalGlobal
X X
SIGKDD 2014 96Y. Matsubara et al.
Q3. Scalability
Wall clock time vs. diseases , states , Time
X X X
FunnelFit is linear w.r.t. data size : O(dln)