using internet data to learn in the health domaineol/ssiim/1617/seminars/l12... · 2016-10-27 ·...
TRANSCRIPT
Using Internet data to learn in the health domain
Carla Teixeira Lopes - [email protected], MIEIC, 2016/17
Based on slides from Yom-Tov et al. (2015)
Agenda
Internet data for health research
Data sources
Research works
Internet data
Internet data
When should we use it for health research?
Why is it useful?
Any advantages over the data collected in the physical world?
Advantages of Internet data
Easier to collect than in the physical world
Larger sample
More trustworthy than surveys
Easier to collect
http://blog.okcupid.com/index.php/the-biggest-lies-in-online-dating/
Easier to collect
(Pelleg etal.,2012)
Larger sample
Survey problem
On a survey we depend on the quality of the answers.
Associations are hard to predict
Data sources
Data sources
Web searchGeneral social media: Twitter, Facebook, Flickr
Medical social media: eHealthMe, PatientsLikeMe
Medical Internet aggregators: HealthMap
Actively collecting data: crowdsourcing, online advertisements, online surveys
Other data: Smartphone interaction, Fitness monitors
Web search
http://www.internetlivestats.com/google-search-statistics/
Health web search
Searching for health information online is the third most popular activity online (Fox, 2011), being done by 72% of American Internet users (Fox and Duggan, 2013)
(FoxandDuggan,2013)
Health web search
http://healthdecide.orcahealth.com/2012/12/10/how-health-consumers-use-the-web-infographic/
Obtaining a search log
Company
Crowdsourcing
Use other datasets
General social media
http://healthdecide.orcahealth.com/2012/12/10/how-health-consumers-use-the-web-infographic/
General social media
Small scale data is generally available (e.g.: in collated datasets or through crawl)
http://anadouglas.com/which-social-media-platform-are-you-on/
Medical social media
People gathering to discuss their specific predicament
Examples: eHealthMe, PatientsLikeMe
Truthfulness is usually high
Data availability can be a(legal) problem
Medical Internet aggregators: HealthMap
Crowdsourcing
Online advertisements
Online surveys
To validate findings
Other data
Smartphone interaction
Fitness monitors
Internet of Things (IoT)
http://healthdecide.orcahealth.com/2012/12/10/how-health-consumers-use-the-web-infographic/
Characteristics of data sources
TruthfulnessAre people providing real information?
Anonymity and usefulnessWhat do people say on each? What do they feel comfortable discussing?Personal interest (news, gossip) versus personal medical needReal or imagined?
MetadataDemographics, medical diagnosis, etc.
Explicit vs. implicit creationPatient groups versus location data
Accessibility for research
Summary
Source Truthfulness Anonymity and usefulness
Metadata Creation Accessibilityfor research
Web search High High Rare Implicit Within companies or via toolbars
General social media
Low Low-medium Available Explicit Through hoses or scraping
Medical social media
Medium-High High Common Explicit Usually via scraping
Medical internet aggregators
High Medium -- Explicit ?
Smartphone interaction
High Medium None Implicit Very difficult
Actively collecting data
Variable Medium Available Explicit Easy – Make your own!
(Yom-Tovetal.,2015)
Research works
Postmarket drug safety surveillance via search queriesWhy?
Current postmarket drug surveillance mechanisms depend on patient reports
Hard to identify if an adverse reaction happens after the drug is taken for a long period
Hard to identify if several medications are taken at the same time
Therefore,Could we complement this process by looking at search queries?
(Yom-TovandGabrilovich,2013)
Postmarket drug safety surveillance via search queriesData
queries submitted to Yahoo search engine during 6 months in 2010176 unique million users (search logs anonymized)
Drugs under investigation: 20 top-selling drugs (in the US)
Symptoms lexicon195 symptoms from the International Statistical Classification of Diseases (ICD) and related health problems (WHO)
filtered by Wikipedia (http://en.wikipedia.org/wiki/List_of_medical_symptoms )
expanded with synonyms acquired through an analysis of the most frequently returned web page when a symptom was forming the query
Aimquantify the prevalence of adverse drug reports (ADR) for a given drug
(Yom-TovandGabrilovich,2013)
Postmarket drug safety surveillance via search queries‘groundtruth’: reports to repositories for safety surveillance for approved drugs mapped to same list of symptomsscore of drug-symptom pair
nij: how many times a symptom was searchedDay 0: first day user searched for a drug D
if the user has not searched for a drug, then day 0 is the midpoint of his history
(Yom-TovandGabrilovich,2013)
Postmarket drug safety surveillance via search queriesComparison of drug-symptom scores based on query logs and ‘groundtruth’ Which symptoms reduce this correlation the most? (most discordant ADRs)
discover previously unknown ADRs that patients do not tend to report
(Yom-TovandGabrilovich,2013)
Predicting depression via social media
Mental illness leading cause of disability worldwide300 million people suffer from depression (WHO, 2001)Services for identifying and treating mental illnesses: NOT adequateCan content from social media (Twitter) assist?
Focus on Major Depressive Disorder (MDD)low moodlow self-esteemloss of interest or pleasure in normally enjoyable activities
(De Choudhury et al., 2013)
Predicting depression via social media
Data set formationcrowdsourcing a depression survey, share Twitter username
determine a depression score via a formalized questionnaire (Center for Epidemiologic Studies Depression Scale; CES-D):
from 0 (no symptoms) to 60
476 peoplediagnosed with depression with onset between September 2011 and June 2012agreed to monitor their public Twitter profile36% with CES-D > 22 (definite depression)
Twitter feed collection ~ 2.1 million tweetsdepression-positive users (from onset and one year back)depression-negative users (from survey date and one year back)
(De Choudhury et al., 2013)
Predicting depression via social media
Examples of feature categories (overall 47)Engagement
daily volume of tweets, proportion of @replyposts, retweets, links, question-centric posts, normalized difference between night and day posts (insomnia index)
Social network properties (ego-centric)followers, followees, reciprocity (average number of replies of U to V divided by number of replies from V to U), graph density (edges / nodes in a user’s ego-centric graph)
Linguistic Inquiry and Word Count (LIWC - http://www.liwc.net)features for emotion: positive/negative affect, activation, dominancefeatures for linguistic style: functional words, negation, adverbs, certainty
Depression lexiconMental health in Yahoo! AnswersPointwise-Mutual-Information + Likelihood-ratio between ‘depress*’ and all other tokens (top 1%)TF-IDF of these terms in Wikipedia to remove very frequent terms:1,000 depression words
Anti-depression languagelexicon of antidepressant drug names
(De Choudhury et al., 2013)
Predicting depression via social media
Depressive patternsdecrease in user engagement (volume and replies)higher Negative Affect (NA)low activation (loneliness, exhaustion, lack of energy, sleep deprivation)
Depression classNon-depression class
(De Choudhury et al., 2013)
Predicting depression via social media
Depressive patternsincreased presence of 1st person pronounsdecreased for 3rd person pronounsuse of depression terms higher (examples: anxiety, withdrawal, fun, play, helped, medication, side-effects, home, woman)
Depression classNon-depression class
(De Choudhury et al., 2013)
Other works using social media
TwitterHIV detection
Modeling influenza rates
Modeling health topics
Modeling disease spread
FlickrPro-anorexia and pro-recovery
Google Flu TrendsForecasting influenza
WikipediaNowcasting and forecasting diseases
Does Sustained Participation in an Online Health Community Affect Sentiment?
Large breast cancer community
Impact of different factors on post sentimentTime since joining the community, posting activity, age, cancer stage
(Zhang et al, 2014)
Does Sustained Participation in an Online Health Community Affect Sentiment?
Datasetbreastcancer.org291,528 posts in 31,034 threads published by 12,819 community members between May 2004 and September 2010Metadata including user profiles were also extracted
(Zhang et al, 2014)
Automated Sentiment Analysis
Built a classifier
1,000 posts were manually annotated (positive or negative)
Does Sustained Participation in an Online Health Community Affect Sentiment?For each post, a sentiment score (probability of post being positive) was calculated.
Significant increase in sentiment of posts through time
Different patternsfor initial postsand reply posts
Factors play a role
(Zhang et al, 2014)
A global compendium of human dengue virus occurrenceDatabase comprising occurrence data linked to point or polygon locations.
GoalGenerate a global risk map and associate burden estimates.
Data collectionSearch by ‘dengue’ in PubMed, ISI Web of Science and ProMEDPublications between 1960 and 2012Data from HealthMap
(Messinaetal,2014)
A global compendium of human dengue virus occurrenceGeo-positioning of the data
Location extracted from the articlesLatitudinal and longitudinal coordinates determined using Google Maps
(Messinaetal,2014)
A global compendium of human dengue virus occurrence
(Messinaetal,2014)
Tracking Flu-Related Searches on the Web for Syndromic Surveillance“Campaign” using a keyword-triggered “sponsored link” in Google Adsense, for Canadian searchers
Keywords: “flu” or “flu symptoms”
Number of impressions roughly proportional to the number of searches containing the keywords
Daily statistics on impressions and clicks aggregated to match the time periods of the FluWatch reports.
(Eysenbach,2006)
Tracking Flu-Related Searches on the Web for Syndromic Surveillance
(Eysenbach,2006)(Eysenbach,2006)
Measuring the impact of epidemic alerts on human mobility using cell-phone network data
Measure the impact that the alerts issued by the Mexican government had during the H1N1 flu outbreak in 2009
Mobility characterized using anonymized Call Detail Records (CDRs) traces
(Frias-Martinezetal.,2012)
Measuring the impact of epidemic alerts on human mobility using cell-phone network data
(Frias-Martinezetal.,2012)
How the Napa earthquake affected Bay Area sleepers
https://jawbone.com/blog/napa-earthquake-effect-on-sleep/
Topics for SSIM
Topics for SSIM
The use of Wikipedia for automatic translation in the health domain
Using a set of Portuguese health queries, the goal of this work is to evaluate if and how well can Wikipedia be used to automatically translate Portuguese medical expressions to the English language. It is also a goal of this work to compare the Wikipedia approaches to other well-established approaches.
Topics for SSIM
Assessing and comparing the readability of online topics Using a set of search queries previously classified into topics, the goal of this work is to analyze and compare the readability of the initial documents retrieved with those queries.
Evaluation of query expansion approaches using the CLEF eHealth 2016 test collection
The goal of this work is to evaluate the query expansion approaches that were proposed in a previous work using a newly-formed test collection. The evaluation should focus on the relevance, understandability and credibility of the obtained results.
Topics for SSIM
The use of Data Mining to understand behaviourdynamics in online health forums: state of the art
Do a survey and write a scientific article on the use of Data Mining to understand behaviour dynamics in online health forums.
Automatic text simplification in the health domain: state of the art
Do a survey and write a scientific article on current techniques for automatic text simplification in the health domain.
ReferencesDan Pelleg, Elad Yom-Tov, Yoelle Maarek (2012). Can you believe an anonymous contributor? On truthfulness in Yahoo! AnswersElad Yom-Tov, Evgeniy Gabrilovich (2013). Postmarket Drug Surveillance Without Trial Costs: Discovery of Adverse Drug Reactions Through Large-Scale Analysis of Web Search QueriesElad Yom-Tov; Ingemar Cox; Vasileios Lampos (2015). Learning about health and medicine from Internet data.Gunther Eysenbach (2006). Tracking flu-related searches on the Web for syndromic surveillanceJane P Messina, Oliver J Brady, David M Pigott, John S Brownstein, Anne G Hoen & Simon I Hay (2014). A global compendium of human dengue virus occurrenceMunmun De Choudhury, Michael Gamon, Scott Counts and Eric Horvitz (2013). Predicting depression via social mediaShaodian Zhang, Erin Bantum, Jason Owen, Noémie Elhadad (2014). Does Sustained Participation in an Online Health Community Affect Sentiment?Susannah Fox (2011). Health Topics. Pew Internet Project.Susannah Fox and Maeve Duggan (2013). Health Online 2013. Pew Internet Project.Vanessa Frias-Martinez, Alberto Rubio, Enrique Frias-Martinez (2012). Measuring the impact of epidemic alerts on human mobility using cell-phone network data