funnel: automatic mining of spatially coevolving epidemicsyasuko/publications/kdd14-funnel...1930...

1
1930 1940 1950 1960 1970 1980 0 5 x 10 4 Year Count 0.1 0.2 0.3 0.4 0.5 February (2) August (8) March (3) September (9) April (4) October (10) May (5) November (11) June (6) December (12) July (7) January (1) Rubella Measles Mumps Gonorrhea Streptococcal sore throat Chickenpox Smallpox Lymedisease Typhoidfever Cryptosporidiosis Rocky mountain spotted fever Typhus fever Influenza 1930 1950 1970 0 5000 10000 Year 1930 1950 1970 10 0 10 5 Year Original I(t) Original S(t) I(t) V(t) 2000 4000 6000 0 1000 2000 3000 Number of timeticks (n) Wall clock time (s) 10 20 30 40 50 2000 2500 3000 Number of states (l) Wall clock time (s) 10 20 30 40 50 0 1000 2000 3000 Number of diseases (d) Wall clock time (s) States Diseases Time 1930 1940 1950 1960 1970 1980 0 2 4 6 x 10 4 Year Count 1930 1940 1950 1960 1970 1980 10 5 Year Count (log) Original I(t) Original S(t) I(t) V(t) Vaccination 1935 1940 1945 1950 10 0 10 5 Year Count (log) Funnel AR(52) AR(26) AR(8) 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 0 1000 2000 3000 Year (per month) Count Sircam Klez Badtrans Netsky Mytob FUNNEL: Automatic Mining of Spatially Coevolving Epidemics Motivation - Given: large set of epidemiological data e.g., Measles cases from 1928 to 1982 (50 states) Goal: statistically summarize all the epidemic time-series Conclusions – FUNNEL has following advantages: General & Sense-making: it captures all essential aspects (P1-P5) • Fully-automatic: it needs no training set • Scalable: it scales linearly with the input size Yasuko Matsubara Kumamoto University [email protected] Yasushi Sakurai Kumamoto University [email protected] Willem G. van Panhuis University of Pittsburgh [email protected] Christos Faloutsos Carnegie Mellon University [email protected] Data description – Project Tycho 56 contagious diseases for U.S. states from 1888 to the present (>125 years) 3 rd order tensor (diseases x location x Time ) Data: http://www.tycho.pitt.edu/ Code: http://www.cs.kumamoto-u.ac.jp/~yasuko/software.html Experiments - (a) Sense-making (5 properties) (b) Fitting accuracy (c) Scalability – (linear with data size) ACM SIGKDD 2014 20 th ACM SIGKDD Conference August 24-27, 2014, New York City Vaccination effect Outbreaks e.g., 1941 Yearly periodicity Time disease loc cases 04-01 measles PA 4740 04-01 measles NY 5310 04-01 rubella CA 1923 diseases Time x l d n # of cases in 1931, Element x: # of cases (weekly) e.g., ‘measles’, ‘PA’, ‘April 1-7, 1931’, ‘4740’ Observations - Properties of real epidemic data P1 P2 P3 P4 P5 yearly periodicity (e.g., flu peaks in the winter) disease reduction effects (e.g., vaccination) area specificity and sensitivity (e.g., correlation) external shock events (e.g., outbreaks) mistakes, incorrect values (e.g., typos) Proposed model: FUNNEL (a) With a single epidemic: FUNNEL-RE (b) With multi-evolving epidemics: FUNNEL-full People of 3 classes S(t) : susceptible I (t) : Infected V(t) : Vigilant S(t) I(t) V(t) I(t) e.g., measles in NY SIRS SKIPS FunnelR Funnel 25 30 35 40 RMSE Discussion - Application & Generality (a) Forecasting (Influenza) (b) Epidemics on computer networks N d l P1 P2 P3 P4 P5 (t ) (t ) (t ) P1 P2 P3 Measles 1930 1940 1950 1960 1970 1980 0 500 1000 1500 Year Count 5 Original I(t) Mistake 1930 1935 1940 1945 0 500 1000 1500 2000 Year Count Original I(t) Smallpox epidemics in 193739 P4 P5 Optimization algorithm: FUNNEL-Fit Idea (1) Model description cost Q. How can we find externals and mistakes ?? A. Minimize coding cost! Idea (2) Multi-layer optimization = X (step1) Global fitting (step2) Local fitting X X Global P1 P2 B R P4 P5 M E Local N P4 P5 M E P3 P1 P2 P3 P4 P5 M E FUNNEL F = = Model description cost of F Coding cost of X given F Dimensions of X M E P4 P5 Linear Log : strength of infection (yearly cycle) : healing rate : forgetting rate : disease reduction effect : temporal susceptible rate β (t ) δ θ (t ) ε (t ) Extra Local Global Base matrix Geo- disease matrix External shock tensor Disease reduction matrix Mistake tensor E= {E ( D ) , E ( T ) , E ( S ) } Disease matrix Time matrix State matrix E = E ( S ) E ( D ) E ( T ) X l d n X γ

Upload: hoangdung

Post on 27-Aug-2019

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FUNNEL: Automatic Mining of Spatially Coevolving Epidemicsyasuko/PUBLICATIONS/kdd14-funnel...1930 1940 1950 1960 1970 1980 0 5 x 104 Year Count periodicity 1930 1940 1950 1960 1970

1930 1940 1950 1960 1970 19800

5x 104

Year

Cou

nt

1930 1940 1950 1960 1970 1980

105

Year

Cou

nt (l

og)

0.1

0.2

0.3

0.4

0.5

February (2)

August (8)

March (3)

September (9)

April (4)

October (10)

May (5)

November (11)

June (6)

December (12)

July (7) January (1)

RubellaMeasles

Mumps

GonorrheaStreptococcal sore throat

ChickenpoxSmallpox

Lymedisease

TyphoidfeverCryptosporidiosis

Rocky mountain spotted fever

Typhus fever

Influenza

1930 1950 19700

5000

10000

Year

1930 1950 1970100

105

Year

OriginalI(t)

OriginalS(t)I(t)V(t)

2000 4000 60000

1000

2000

3000

Number of time−ticks (n)

Wal

l clo

ck ti

me

(s)

10 20 30 40 50

2000

2500

3000

Number of states (l)

Wal

l clo

ck ti

me

(s)

10 20 30 40 500

100020003000

Number of diseases (d)

Wal

l clo

ck ti

me

(s)

States Diseases Time

1930 1940 1950 1960 1970 19800

2

4

6 x 104

Year

Cou

nt

1930 1940 1950 1960 1970 1980

105

Year

Cou

nt (l

og)

OriginalI(t)

OriginalS(t)I(t)V(t)

Vaccination

1930 1935 1940 1945 1950100

105

Year

Coun

t (lo

g)

proposed=8.69e+03, AR(52,26,8)=9.62e+03, 9.76e+03, 9.77e+03

FunnelAR(52)AR(26)AR(8) 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

0100020003000

Year (per month)

Cou

nt

SircamKlez

Badtrans NetskyMytob

FUNNEL: Automatic Mining of Spatially Coevolving Epidemics

Motivation - Given: large set of epidemiological data e.g., Measles cases from 1928 to 1982 (50 states)

Goal: statistically summarize all the epidemic time-series

Conclusions – FUNNEL has following advantages: •  General & Sense-making: it captures all essential aspects (P1-P5) •  Fully-automatic: it needs no training set •  Scalable: it scales linearly with the input size

Yasuko Matsubara

Kumamoto University [email protected]

Yasushi Sakurai

Kumamoto University [email protected]

Willem G. van Panhuis University of Pittsburgh

[email protected]

Christos Faloutsos

Carnegie Mellon University [email protected]

Data description – Project Tycho •  56 contagious diseases for U.S. states •  from 1888 to the present (>125 years) 3rd order tensor (diseases x location x Time)

Data: http://www.tycho.pitt.edu/

Code: http://www.cs.kumamoto-u.ac.jp/~yasuko/software.html

Experiments - (a) Sense-making (5 properties) (b) Fitting accuracy (c) Scalability – (linear with data size)

ACM SIGKDD 2014

20th ACM SIGKDD Conference August 24-27, 2014, New York City

Vaccinationeffect

Outbreakse.g.,  1941Yearly  

periodicity

Time disease loc cases 04-01 measles PA 4740 04-01 measles NY 5310 04-01 rubella CA 1923 … … … …

dise

ases

Time

x

l

d

n

# of cases in 1931, …

Element x: # of cases (weekly) e.g., ‘measles’, ‘PA’, ‘April 1-7, 1931’, ‘4740’

Observations - Properties of real epidemic data

P1 P2 P3 P4 P5

yearly periodicity (e.g., flu peaks in the winter) disease reduction effects (e.g., vaccination) area specificity and sensitivity (e.g., correlation) external shock events (e.g., outbreaks) mistakes, incorrect values (e.g., typos)

Proposed model: FUNNEL (a) With a single epidemic: FUNNEL-RE

(b) With multi-evolving epidemics: FUNNEL-full

People of 3 classes •  S(t) : susceptible •  I (t) : Infected •  V(t) : Vigilant

S(t)

I(t)

V(t)

I(t)

e.g., measles in NY

SIRS SKIPS Funnel−R Funnel25

30

35

40Local

RMSE

Discussion - Application & Generality

(a) Forecasting (Influenza) (b) Epidemics on computer networks N

d

l

P1 P2 P3 P4 P5

��

��

���

�(t)

�(t)

�(t)

P1 P2

P3

Measles

1930 1940 1950 1960 1970 19800

500

1000

1500

Year

Cou

nt

Er= 4.26e+01 (1.2e+04, 0.59, 0.48, 0.11) , P=(0.2, 24), (2081, 0.02), k=16

1930 1940 1950 1960 1970 1980100

105

Year

Cou

nt

OriginalI(t)

OriginalS(t)I(t)V(t)

Mistake

1930 1935 1940 19450

500

1000

1500

2000

Year

Coun

t

Er= 1.07e+02 (1.9e+04, 0.58, 0.44, 0.09) , P=(0.1, 47), (1041, 0.03), k=8

1930 1935 1940 1945100

105

Year

Coun

t

OriginalI(t)

OriginalS(t)I(t)V(t)

Smallpox epidemicsin 1937−39

P4 P5

Optimization algorithm: FUNNEL-Fit Idea (1) Model description cost Q. How can we find externals and mistakes?? A. Minimize coding cost! Idea (2) Multi-layer optimization

=X

(step1) Global fitting

(step2) Local fitting X X

Global

P1 P2

B R

P4 P5

MELocal

N

P4 P5

ME

P3

P1 P2 P3 P4 P5

ME

FUNNEL F = =

Model description cost of F

Coding cost of X given F

Dimensions of X

M

E

P4

P5

Linear

Log

: strength of infection (yearly cycle) : healing rate : forgetting rate : disease reduction effect : temporal susceptible rate

β(t)δθ(t)ε(t)

Extra Local Global

Base matrix

Geo-disease matrix

External shock tensor Disease reduction

matrix

Mistake tensor

E= {E(D),E(T ),E(S )}

Disease matrix

Time matrix

State matrix

E =E(S )

E(D)E(T )

X

l

d

n

X

γ