why is it difficult to detect outbreaks in twitter?
TRANSCRIPT
![Page 1: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/1.jpg)
Why Is It Difficult to Detect Outbreaksin Twitter?
Avaré Stewart, Nattiya Kanhabua, Sara Romano
Ernesto Diaz-Aviles, Wolf Siberski, and Wolfgang Nejdl
L3S Research Center / Leibniz Universität Hannover, Germany
SIGIR 2013 Workshop on Health Search and Discovery
1 August 2013, Dublin, Ireland
![Page 2: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/2.jpg)
Motivation
• Numerous works use Twitter to infer the existence and magnitude of real-world events in real-time– Earthquake [Sakaki et al., 2010]– Predicting financial time series [Ruiz et al., 2012]– Influenza epidemics [Culotta, 2010; Lampos et al.,
2011; Paul et al., 2011]
![Page 3: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/3.jpg)
Early Warnings
![Page 4: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/4.jpg)
Health related tweets
• User status updates or news related to public health are common in Twitter– I have the mumps...am I alone?
– my baby girl has a Gastroenteritis so great!! Please do not give it to meee
– #Cholera breaks out in #Dadaab refugee camp in #Kenya http://t.co/....
– As many as 16 people have been found infected with Anthrax in Shahjadpur upazila of the Sirajganj district in Bangladesh.
![Page 5: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/5.jpg)
Matching Tweets
[Kanhabua et al., CIKM’12]
![Page 6: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/6.jpg)
Matching Tweets
[Kanhabua et al., CIKM’12]
![Page 7: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/7.jpg)
Twitter vs. Official Source
![Page 8: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/8.jpg)
M-Eco System
Medical Ecosystem: Personalized Event-based Surveillance
http://www.meco-project.eu/
![Page 9: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/9.jpg)
Data Collection• Official outbreak reports
– ~3,000 ProMED-mail reports from 2011– WHO reports have very small coverage
• Twitter data– ~1,200 health-related terms (i.e., infectious diseases,
their synonyms, pathogens and symptoms)– Over 112 millions of tweets from 2011
• Series of NLP tools including– OpenNLP (tokenization, sentence splitting, POS tagging)– OpenCalais (named entity recognition) – HeidelTime (temporal expression extraction)
![Page 10: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/10.jpg)
Ground Truths
[Kanhabua et al., TAIA’ 12]
![Page 11: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/11.jpg)
Event Extraction
• An event is a sentence containing two entities– (1) medical condition and (2) geographic expression– A minimum requirement by domain experts
• A victim and the time of an event can be identified from the sentence itself, or its surrounding context
• Output: a set of event candidates
Reported by World Health Organization (WHO) on 29 July 2012 about an ongoing Ebola outbreak
in Uganda since the beginning of July 2012
[Kanhabua et al., TAIA’ 12]
![Page 12: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/12.jpg)
Message Filtering: Challenges
• Ambiguity– having several meanings– used in different contexts
• Incompleteness– missing or under-reported events– data processing errors
![Page 13: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/13.jpg)
Message Filtering: Challenges• Ambiguity
– having several meanings– used in different contexts
• Incompleteness– missing or under-reported events– data processing errors
Category Example tweet
Literature A two hour train journey, Love In the Time of Cholera ...
Music Dengue Fever’s “Uku,” Mixed by Paul Dreux Smith Universal Audio...
Marketing Exclusive distributor of high quality #HIV/AIDS Blood & Urine and #Hepatitis #Self -testers.
General Identification of genotype 4 Hepatitis E virus binding proteins on swine liver cells: Hepatitis E virus...
Negative i dont have sniffles and no real coughing..well its coughing but not like an influenza cough.
Joke Thought I had Bieber Fever. Ends up I just had a combo of the mumps, mono, measles & the hershey squ...
![Page 14: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/14.jpg)
Challenge I. Noisy/evolving
• Evolving data– Relevant features changes over time
![Page 15: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/15.jpg)
Challenge I. Noisy/evolving
![Page 16: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/16.jpg)
Approach for Noisy Data
• MedISys1
– providing a list of negative keywords created by medical experts
• Urban Dictionary2
– a Web-based dictionary of slang, ethnic culture words or phrases
1http://medusa.jrc.it/medisys/homeedition/en/home.html2http://www.urbandictionary.com/
![Page 17: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/17.jpg)
Approach for Noisy Data
1http://medusa.jrc.it/medisys/homeedition/en/home.html2http://www.urbandictionary.com/
![Page 18: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/18.jpg)
[Kanhabua and Nejdl, WOW’ 13]
![Page 19: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/19.jpg)
[Kanhabua and Nejdl, WOW’ 13]
![Page 20: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/20.jpg)
Approach for Feature Changes
![Page 21: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/21.jpg)
Signal Generation: Challenges
• Temporal Dynamics– seasonal infectious diseases– rare and spontaneous outbreaks
• Location Dynamics– frequency and duration– levels of prevalence or severity
![Page 22: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/22.jpg)
Signal Generation: Challenges• Temporal Dynamics
– seasonal infectious diseases– rare and spontaneous outbreaks
• Location Dynamics– frequency and duration– levels of prevalence or severity
[Rortais et al., 2010 in Journal of Food Research International]
![Page 23: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/23.jpg)
Signal Generation: Challenges
• Temporal Dynamics– seasonal infectious diseases– rare and spontaneous outbreaks
• Location Dynamics– frequency and duration– levels of prevalence or severity
![Page 24: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/24.jpg)
Signal Generation: Challenges
[Emch et al., 2008 in International Journal of Health Geographics]
![Page 25: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/25.jpg)
Outbreak Categorization
![Page 26: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/26.jpg)
Outbreak Categorization
How to generate a reliable signal for low aggregate counts?
![Page 27: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/27.jpg)
Approach
[Kanhabua and Nejdl, WOW’ 13]
![Page 28: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/28.jpg)
Temporal Diversity
• Refined Jaccard Index (RDJ-index)– average Jaccard similarity of all object pairs
• Note: lower RDJ corresponds to higher diversity• Problem: “All-Pair comparison”• Solution: Estimation algorithms with probabilistic
error bound guarantees[Deng et al., CIKM’
12]
ji
ji OOJSnn
RDJ ),()1(
2
nji 1
∩ UU
Jaccard similarity
![Page 29: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/29.jpg)
Temporal Diversity
• Refined Jaccard Index (RDJ-index)– average Jaccard similarity of all object pairs
• Note: lower RDJ corresponds to higher diversity• Problem: “All-Pair comparison”• Solution: Estimation algorithms with probabilistic
error bound guarantees[Deng et al., CIKM’
12]
ji
ji OOJSnn
RDJ ),()1(
2
nji 1
∩ UU
Jaccard similarity
(1) Top-k terms(2) Entities
![Page 30: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/30.jpg)
Threat Assessment: Challenge• Overwhelming with the large number of tweets
![Page 31: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/31.jpg)
Approach• Personalized Tweet Ranking for Epidemic
Intelligence– Learning to rank and recommender systems– User's context as implicit criteria for
recommendation
[Diaz-Aviles et al., WWW’ 12,Diaz-Aviles et al., ICWSM’ 12]
![Page 32: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/32.jpg)
Approach
![Page 33: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/33.jpg)
Signal Search Prototype
![Page 34: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/34.jpg)
Future Work
• Real-Time Analysis of Big and Fast Social Web Streams– Scalable, efficient methods for filtering and
generating signals in real-time
– Effective methods for aggregating and visualizing information in a meaningful way
![Page 35: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/35.jpg)
Thank [email protected]
![Page 36: Why Is It Difficult to Detect Outbreaks in Twitter?](https://reader037.vdocuments.us/reader037/viewer/2022103118/55c1dd03bb61eb86518b4606/html5/thumbnails/36.jpg)
References• [Culotta, 2010] A. Culotta. Towards detecting influenza epidemics by analyzing twitter messages. In
Proceedings of the First Workshop on Social Media Analytics (SOMA’2010), 2010.• [Diaz-Aviles et al., 2012a] E. Diaz-Aviles, A. Stewart, E. Velasco, K. Denecke, and W. Nejdl. Towards
personalized learning to rank for epidemic intelligence based on social media streams. In Proceedings of the 21st World Wide Web Conference (WWW ‘2012), 2012.
• [Diaz-Aviles et al., 2012b] E. Diaz-Aviles, A. Stewart, E. Velasco, K. Denecke, and W. Nejdl. Epidemic intelligence for the crowd, by the crowd. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’2012), 2012.
• [Kanhabua et al., 2012a] N. Kanhabua, Sara Romano, and A. Stewart, Identifying Relevant Temporal Expressions for Real-world Events, In SIGIR 2012 Workshop on Time-aware Information Access (TAIA'2012), 2012.
• [Kanhabua et al., 2012b] N. Kanhabua, Sara Romano, and A. Stewart and W. Nejdl. Supporting Temporal Analytics for Health Related Events in Microblogs. In Proceedings of CIKM'2012, 2012.
• [Kanhabua and Nejdl 2013] N. Kanhabua and W. Nejdl. Understanding the Diversity of Tweets in the Time of Outbreaks. In Proceedings of the First International Web Observatory Workshop (WOW'2013) at WWW'2013, 2013.
• [Lampos et al., 2011] V. Lampos and N. Cristianini. Nowcasting events from the social web with statistical learning. ACM TIST, 3, 2011.
• [Paul et al., 2011] M. J. Paul and M. Dredze. You are what you tweet: Analyzing twitter for public health. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’2011), 2011.
• [Ruiz et al., 2012] E. J. Ruiz, V. Hristidis, C. Castillo, A. Gionis, and A. Jaimes. Correlating financial time series with micro-blogging activity. In Proceedings of WSDM’2012, 2012.
• [Sakaki et al., 2010] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter users: real-time event detection by social sensors. In Proceedings of WWW’2010, 2010.