tracking epidemics with natural language processing and crowdsourcing
TRANSCRIPT
‘Tracking Epidemics with Natural Language Processing and
Crowdsourcing’
Robert Munro, Lucky Gunasekara, Stephanie Nevins, Lalith Polepeddi and Evan Rosen
Stanford University and EpidemicIQ2012 AAAI Spring Symposium
March 2012
http://www.robertmunro.com/research/munro12epidemics.pdf
Global Viral Forecasting
weakly human adapted
human adapted
human exclusive
Influenza HIV-1Yellow FeverRabies SARS/Ebola
transmissible
not human adapted
90% o f t he
wor l d ’s
eco l og i ca l
d i ve r s i t y
90% o f t he
wor l d ’s
l i ngu i s t i c
d i ve r s i t y
Repor t ed
l oca l l y be fo r e
i den t i f i c a t i on
H1N1 (Swine Flu) –
months
( 10% o f wor l d i n fec t ed )HIV – decades
( 35 mi l l i on i n fec t ed )
H1N5 (Bird Flu) –
weeks( >50% fa t a l )
Di se a se s e r a d i c a t e d i n t he
l a s t 75 ye a r s :
I nc r e a se i n a i r t r a ve l i n t he l a s t
75 ye a r s :
s m a l l p o x
No one is tracking all the world’s outbreaks
• NASA is tracking thousands of potentially dangerous near-Earth objects (NASA 2011).
• National security agencies are tracking tens of thousands of suspected terrorists daily (Chertoff 2008).
• A deadly microbe is far more likely to sneak onto a plane undetected.
CDC vs
Google Flu
Trends?
"I'm Jacqui Jeras with
today's cold and flu
report ... across the
mid- Atlantic states, a
little bit of an increase
here” Jan 4th
"I'm Jacqui Jeras with
today's cold and flu
report ... across the
mid- Atlantic states, a
little bit of an increase
here” Jan 4th
CDC vs
Google Flu
Trends?
The f i r s t s i gna l i s p l a i n l a ngua ge
“ t oday ' s co l d and f l u r epo r t . . . a c r o s s
t h e mi d - At l an t i c s t a t e s , a l i t t l e b i t o f an
i n c r e a se ” CNN
J an 4 , 2008
Goog l e F l u Tr e nds
+ 3 weeks
CDC
+ 5 weeks
… bu t bu r i e d i n p l a i n v i e w
“ t oda y ' s c o l d a nd f l u r e po r t
. . . a c r o s s t he mi d - At l an t i c
s t a t e s , a l i t t l e b i t o f a n
i n c r e a se ”
“ We ' r e wor r i e d a bou t t he
mar ke t s . ”“We ' r e go i ng t o t ake you t o
Kenya whe r e t he U .S . ha s
d i spa t ched some d i p l omat i c
he l p t o t r y t o ge t t he
coun t r y back on po l i t i c a l
ba l ance . ”
“ I s i nd i v i dua l i sm an
endange r ed concep t
i n Saud i Ar ab i a? ”“ We l l , i n S t . J ohn ' s
Coun t y, one ma n l o s t
h i s home t r y i ng t o
ke e p h i s p i g wa r m.”“The p i g d i d no t
make i t . ”
“He had eve r y t h i ng bu t
t he cape . A good
s amar i t an i n Oh i o s aved
a f ami l y f r om t h i s
f e r oc i ous house f i r e . ”
“A spunky boy r ee l s
i n a 550 - pound
sha r k . ”
… i n 1000s o f l anguages
в пред с тоящий о с е нне - зим ний период в
Украине ожидают ся две эпидем ии гриппа
( 2 f l u ou t b r eaks p r ed i c t ed fo r t he Ukr a i ne )
مصر ي ف ور طي ل ا ا ز ن و ل ف ن ا ن م د ي ز م
( mor e f l u i n Egyp t )
香港现1例H5 N1禽流感病例曾游上海南京等地
( Hong Kong ha d a c a s e o f a v i a n i n f l ue nz a t ha t
t r a ve l e d t o Sha ngha i a nd Na n j i ng )
M a c h i n e -
l e a r n i n g :
R e l e v a n t ?
R e p o r t s
( m i l l i o n s )
в предстоящий о с енне -
зимний период в Украине
ожидают ся две эпидем ии
гриппа
ن م د ي ز زم ن و ل ف ن ي ا ف ور طي ل ا مصرا
香港现1例H5N1禽流感病例曾游上海南京等地
Targeted machine-processing
Broad machine-processing
Human-processing
Low-volumeprocessing
High-volumeprocessing
Data input
“there is a new flu-like illness here”
Discovered by crawler
Relevance evaluated by
machine learning
Relevance evaluated by microtasker
Information stored from the reports
Relevance evaluated by in-
house analyst
Sources monitor-frequency updated
Maximally relevant phrases used to
search more data
Direct report from field staff / partner
organization
Reports for each outbreak
aggregated
Data structuring
• Disease (if known)
• Case counts / demographics
• Location
• Responding organizations
• Transport used
• Quotes from officials
• Changing conditions (spreading / ending)
• Public reaction
Motivations
For 600 new seeds, please answer this question:
Does this sentence refer to a disease outbreak:
“E Coli spreads to Spain, sprouts suspected”
Yes/no: __
What disease: _______
What location: _______
Vi r t ua l p r o t e c t ing t he r e a l
Crowdsourcing applicability
• Success:
– Language coverage
– Outbreak relatedness
– Case-counts
– Location names
– Quotes from officials
• Falling short:
– Estimating citizen unrest
– Growth predictions
Native speaker expertise
Data structuring
Data analysis
Crowdsourcing and machine-learning
• The German problem
• Bias-free seeding
• Evaluation for needle-in-haystack scenarios
– Machine and human
• Language representation
Appendix: Abstract
The first indication of a new outbreak is often in unstructured data (natural language) and reported openly in traditional or social media as a new ‘flu-like’ or ‘malaria-like’ illness weeks or months before the new pathogen is eventually isolated. We present a system for tracking these early signals globally, using natural language processing and crowdsourcing. By comparison, search-log-based approaches, while innovative and inexpensive, are often a trailing signal that follow open reports in plain language. Concentrating on discovering outbreak-related reports in big open data, we show how crowdsourced workers can create near-real-time training data for adaptive active-learning models, addressing the lack of broad coverage training data for tracking epidemics. This is well-suited to an outbreak information- flow context, where sudden bursts of information about new diseases/locations need to be manually processed quickly at short notice.