Bringing Together the Social and Technical in Big Data Analytics: Why You Can't
Predict the Flu from Twitter, and Here's How
David A. BroniatowskiAsst. Prof. EMSE
http://www.seas.gwu.edu/~broniatowski
PUBLIC HEALTH CYCLE
Population Doctors
Surveillance
Intervention
• Traditional mechanisms
• Surveys
• Clinical visits
REQUIRES:DATA ON THE POPULATION
This has limited research
TWITTER• Short messages (140 chars) posted to public internet
• Content: news, conversation, pointless babble
• Huge volume
• 500 million a day
WHY TWITTER?
• Huge volumes of data
• A constant stream of small updates
• Nothing like waiting in line to buy cigarettes behind a guy in a business suit buying gasoline with ten dollars in dimes
• I eat pizza too much
• I'm at Cvs Pharmacy (117th and kendall, Miami)
INFLUENZA SURVEILLANCE
INFLUENZA SURVEILLANCE
• CDC has nationwide surveillance network with 2700 outpatient centers reporting
• ILI: influenza-like illness
• Cons:
• Slow (2 weeks)
• Varying levels ofgeographicgranularity
TWITTER SURVEILLANCE
• Twitter influenza surveillance must be
• 1) Accurately track ground truth
• Identify infection tweets
• 2) Effective at both municipal and national level
• Expand tweet geolocation and evaluate municipal accuracy
• 3) Predictive in real time
• Deploy previously trained system on this flu season
PIPELINE CLASSIFIERS
• Three steps using supervised machine learning+NLP
• Step 1: Identify health tweets
• Step 2: Identify flu related
• Step 3: Awareness vs. infection
TWITTER SURVEILLANCE
• Twitter influenza surveillance must be
• 1) Accurately track ground truth
• Identify infection tweets
• 2) Effective at both municipal and national level
• Expand tweet geolocation and evaluate municipal accuracy
• 3) Predictive in real time
• Deploy previously trained system on this flu season
LOCAL EFFECTIVENESS
• Current work focuses on US national flu rates
• Useful surveillance needed by region/state/city
• How can Twitter track local trends?
• Is it accurate?
• Is there enough data?
• Only about 1% of Twitter is geocoded
CARMEN(Dredze et al., 2013)
• Over 4000 known locations (countries, states, counties, cities)
• Geocordinates only: ~1%
• Expanded locations: ~22%
• Available in Python and Java
TWITTER SURVEILLANCE
• Twitter influenza surveillance must be
• 1) Accurately track ground truth
• Identify infection tweets
• 2) Effective at both municipal and national level
• Expand tweet geolocation and evaluate municipal accuracy
• 3) Predictive in real time
• Deploy previously trained system on this flu season
SURVEILLANCE RESULTSPearson
Correlation 2009 2011
Keywords 0.97 0.646
Flu Classifier 0.97 0.519
Google Flu Trends
0.97 0.897
Infection 0.972 0.7832
GOOGLE FLU TRENDS GETS IT WRONG?Lohr, S. (2014). Google flu trends: the limits of
big data. New York Times.
Pearson Correlation:
Keywords: 0.75Infection: 0.93
• ILI counts:
• Infection: 0.88
• Keywords: 0.72
BLIND EVALUATION
2013-20140.95 Correlation
MOST RECENT DATA
Broniatowski, D. A., Dredze, M., Paul, M. J., & Dugas, A. (2015). Using Social Media to Perform Local Influenza Surveillance in an Inner-City Hospital: A Retrospective Observational Study. JMIR Public Health and Surveillance, 1(1), e5.
PREDICTING ACTUAL FLU IN BALTIMORE
Broniatowski, D. A., Dredze, M., Paul, M. J., & Dugas, A. (2015). Using Social Media to Perform Local Influenza Surveillance in an Inner-City Hospital: A Retrospective Observational Study. JMIR Public Health and Surveillance, 1(1), e5.
HEALTHTWEETS.ORG
HEALTHTWEETS WORLDWIDE
Some Other Projects
David A. BroniatowskiAsst. Prof. EMSE
http://www.seas.gwu.edu/~broniatowski
29
BIG DATA FOR GROUP DECISION MAKING: EXTRACTING SOCIAL NETWORKS FROM FDA ADVISORY PANEL
MEETING TRANSCRIPTS
(Broniatowski & Magee, 2013 American Journal of Therapeutics; Broniatowski & Magee, 2012 IEEE Signal Processing Magazine; Broniatowski & Magee, in preparation)
“GERMS ARE GERMS” AND “WHY NOT TAKE A RISK?”
MODELS AND DATA FOR RISKY DECISION MAKING IN THE ED
(Broniatowski, Klein, & Reyna, in press, Medical Decision Making Broniatowski & Reyna, in preparation)
HOW DO WE DESIGN SYSTEMS TO USE INFORMATION FLOW TO OUR ADVANTAGE?
We would like to deepen our intuitionregarding system architectures
(Broniatowski & Moses, in preparation)
32
QUESTIONS?• Big data
• Influenza tracking and coupled contagion
• Group decision-making
• Individual decision-making
• Formal models
• Medical and engineering applications
• Formal and mathematical models
• Systems architecture
• Design for flexibility