using internet data to learn in the health domaineol/ssiim/1617/seminars/l12... · 2016-10-27 ·...

53
Using Internet data to learn in the health domain Carla Teixeira Lopes - [email protected] SSIM, MIEIC, 2016/17 Based on slides from Yom-Tov et al. (2015)

Upload: others

Post on 06-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Using Internet data to learn in the health domain

Carla Teixeira Lopes - [email protected], MIEIC, 2016/17

Based on slides from Yom-Tov et al. (2015)

Page 2: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Agenda

Internet data for health research

Data sources

Research works

Page 3: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Internet data

Page 4: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Internet data

When should we use it for health research?

Why is it useful?

Any advantages over the data collected in the physical world?

Page 5: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Advantages of Internet data

Easier to collect than in the physical world

Larger sample

More trustworthy than surveys

Page 6: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Easier to collect

http://blog.okcupid.com/index.php/the-biggest-lies-in-online-dating/

Page 7: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Easier to collect

(Pelleg etal.,2012)

Page 8: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Larger sample

Page 9: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Survey problem

On a survey we depend on the quality of the answers.

Page 10: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Associations are hard to predict

Page 11: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Data sources

Page 12: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Data sources

Web searchGeneral social media: Twitter, Facebook, Flickr

Medical social media: eHealthMe, PatientsLikeMe

Medical Internet aggregators: HealthMap

Actively collecting data: crowdsourcing, online advertisements, online surveys

Other data: Smartphone interaction, Fitness monitors

Page 13: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Web search

http://www.internetlivestats.com/google-search-statistics/

Page 14: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Health web search

Searching for health information online is the third most popular activity online (Fox, 2011), being done by 72% of American Internet users (Fox and Duggan, 2013)

(FoxandDuggan,2013)

Page 15: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Health web search

http://healthdecide.orcahealth.com/2012/12/10/how-health-consumers-use-the-web-infographic/

Page 16: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Obtaining a search log

Company

Crowdsourcing

Use other datasets

Page 17: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

General social media

http://healthdecide.orcahealth.com/2012/12/10/how-health-consumers-use-the-web-infographic/

Page 18: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

General social media

Small scale data is generally available (e.g.: in collated datasets or through crawl)

http://anadouglas.com/which-social-media-platform-are-you-on/

Page 19: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Medical social media

People gathering to discuss their specific predicament

Examples: eHealthMe, PatientsLikeMe

Truthfulness is usually high

Data availability can be a(legal) problem

Page 20: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Medical Internet aggregators: HealthMap

Page 21: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Crowdsourcing

Page 22: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Online advertisements

Page 23: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Online surveys

To validate findings

Page 24: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Other data

Smartphone interaction

Fitness monitors

Internet of Things (IoT)

http://healthdecide.orcahealth.com/2012/12/10/how-health-consumers-use-the-web-infographic/

Page 25: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Characteristics of data sources

TruthfulnessAre people providing real information?

Anonymity and usefulnessWhat do people say on each? What do they feel comfortable discussing?Personal interest (news, gossip) versus personal medical needReal or imagined?

MetadataDemographics, medical diagnosis, etc.

Explicit vs. implicit creationPatient groups versus location data

Accessibility for research

Page 26: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Summary

Source Truthfulness Anonymity and usefulness

Metadata Creation Accessibilityfor research

Web search High High Rare Implicit Within companies or via toolbars

General social media

Low Low-medium Available Explicit Through hoses or scraping

Medical social media

Medium-High High Common Explicit Usually via scraping

Medical internet aggregators

High Medium -- Explicit ?

Smartphone interaction

High Medium None Implicit Very difficult

Actively collecting data

Variable Medium Available Explicit Easy – Make your own!

(Yom-Tovetal.,2015)

Page 27: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Research works

Page 28: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Postmarket drug safety surveillance via search queriesWhy?

Current postmarket drug surveillance mechanisms depend on patient reports

Hard to identify if an adverse reaction happens after the drug is taken for a long period

Hard to identify if several medications are taken at the same time

Therefore,Could we complement this process by looking at search queries?

(Yom-TovandGabrilovich,2013)

Page 29: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Postmarket drug safety surveillance via search queriesData

queries submitted to Yahoo search engine during 6 months in 2010176 unique million users (search logs anonymized)

Drugs under investigation: 20 top-selling drugs (in the US)

Symptoms lexicon195 symptoms from the International Statistical Classification of Diseases (ICD) and related health problems (WHO)

filtered by Wikipedia (http://en.wikipedia.org/wiki/List_of_medical_symptoms )

expanded with synonyms acquired through an analysis of the most frequently returned web page when a symptom was forming the query

Aimquantify the prevalence of adverse drug reports (ADR) for a given drug

(Yom-TovandGabrilovich,2013)

Page 30: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Postmarket drug safety surveillance via search queries‘groundtruth’: reports to repositories for safety surveillance for approved drugs mapped to same list of symptomsscore of drug-symptom pair

nij: how many times a symptom was searchedDay 0: first day user searched for a drug D

if the user has not searched for a drug, then day 0 is the midpoint of his history

(Yom-TovandGabrilovich,2013)

Page 31: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Postmarket drug safety surveillance via search queriesComparison of drug-symptom scores based on query logs and ‘groundtruth’ Which symptoms reduce this correlation the most? (most discordant ADRs)

discover previously unknown ADRs that patients do not tend to report

(Yom-TovandGabrilovich,2013)

Page 32: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Predicting depression via social media

Mental illness leading cause of disability worldwide300 million people suffer from depression (WHO, 2001)Services for identifying and treating mental illnesses: NOT adequateCan content from social media (Twitter) assist?

Focus on Major Depressive Disorder (MDD)low moodlow self-esteemloss of interest or pleasure in normally enjoyable activities

(De Choudhury et al., 2013)

Page 33: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Predicting depression via social media

Data set formationcrowdsourcing a depression survey, share Twitter username

determine a depression score via a formalized questionnaire (Center for Epidemiologic Studies Depression Scale; CES-D):

from 0 (no symptoms) to 60

476 peoplediagnosed with depression with onset between September 2011 and June 2012agreed to monitor their public Twitter profile36% with CES-D > 22 (definite depression)

Twitter feed collection ~ 2.1 million tweetsdepression-positive users (from onset and one year back)depression-negative users (from survey date and one year back)

(De Choudhury et al., 2013)

Page 34: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Predicting depression via social media

Examples of feature categories (overall 47)Engagement

daily volume of tweets, proportion of @replyposts, retweets, links, question-centric posts, normalized difference between night and day posts (insomnia index)

Social network properties (ego-centric)followers, followees, reciprocity (average number of replies of U to V divided by number of replies from V to U), graph density (edges / nodes in a user’s ego-centric graph)

Linguistic Inquiry and Word Count (LIWC - http://www.liwc.net)features for emotion: positive/negative affect, activation, dominancefeatures for linguistic style: functional words, negation, adverbs, certainty

Depression lexiconMental health in Yahoo! AnswersPointwise-Mutual-Information + Likelihood-ratio between ‘depress*’ and all other tokens (top 1%)TF-IDF of these terms in Wikipedia to remove very frequent terms:1,000 depression words

Anti-depression languagelexicon of antidepressant drug names

(De Choudhury et al., 2013)

Page 35: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Predicting depression via social media

Depressive patternsdecrease in user engagement (volume and replies)higher Negative Affect (NA)low activation (loneliness, exhaustion, lack of energy, sleep deprivation)

Depression classNon-depression class

(De Choudhury et al., 2013)

Page 36: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Predicting depression via social media

Depressive patternsincreased presence of 1st person pronounsdecreased for 3rd person pronounsuse of depression terms higher (examples: anxiety, withdrawal, fun, play, helped, medication, side-effects, home, woman)

Depression classNon-depression class

(De Choudhury et al., 2013)

Page 37: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Other works using social media

TwitterHIV detection

Modeling influenza rates

Modeling health topics

Modeling disease spread

FlickrPro-anorexia and pro-recovery

Google Flu TrendsForecasting influenza

WikipediaNowcasting and forecasting diseases

Page 38: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Does Sustained Participation in an Online Health Community Affect Sentiment?

Large breast cancer community

Impact of different factors on post sentimentTime since joining the community, posting activity, age, cancer stage

(Zhang et al, 2014)

Page 39: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Does Sustained Participation in an Online Health Community Affect Sentiment?

Datasetbreastcancer.org291,528 posts in 31,034 threads published by 12,819 community members between May 2004 and September 2010Metadata including user profiles were also extracted

(Zhang et al, 2014)

Automated Sentiment Analysis

Built a classifier

1,000 posts were manually annotated (positive or negative)

Page 40: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Does Sustained Participation in an Online Health Community Affect Sentiment?For each post, a sentiment score (probability of post being positive) was calculated.

Significant increase in sentiment of posts through time

Different patternsfor initial postsand reply posts

Factors play a role

(Zhang et al, 2014)

Page 41: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

A global compendium of human dengue virus occurrenceDatabase comprising occurrence data linked to point or polygon locations.

GoalGenerate a global risk map and associate burden estimates.

Data collectionSearch by ‘dengue’ in PubMed, ISI Web of Science and ProMEDPublications between 1960 and 2012Data from HealthMap

(Messinaetal,2014)

Page 42: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

A global compendium of human dengue virus occurrenceGeo-positioning of the data

Location extracted from the articlesLatitudinal and longitudinal coordinates determined using Google Maps

(Messinaetal,2014)

Page 43: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

A global compendium of human dengue virus occurrence

(Messinaetal,2014)

Page 44: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Tracking Flu-Related Searches on the Web for Syndromic Surveillance“Campaign” using a keyword-triggered “sponsored link” in Google Adsense, for Canadian searchers

Keywords: “flu” or “flu symptoms”

Number of impressions roughly proportional to the number of searches containing the keywords

Daily statistics on impressions and clicks aggregated to match the time periods of the FluWatch reports.

(Eysenbach,2006)

Page 45: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Tracking Flu-Related Searches on the Web for Syndromic Surveillance

(Eysenbach,2006)(Eysenbach,2006)

Page 46: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Measuring the impact of epidemic alerts on human mobility using cell-phone network data

Measure the impact that the alerts issued by the Mexican government had during the H1N1 flu outbreak in 2009

Mobility characterized using anonymized Call Detail Records (CDRs) traces

(Frias-Martinezetal.,2012)

Page 47: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Measuring the impact of epidemic alerts on human mobility using cell-phone network data

(Frias-Martinezetal.,2012)

Page 48: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

How the Napa earthquake affected Bay Area sleepers

https://jawbone.com/blog/napa-earthquake-effect-on-sleep/

Page 49: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Topics for SSIM

Page 50: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Topics for SSIM

The use of Wikipedia for automatic translation in the health domain

Using a set of Portuguese health queries, the goal of this work is to evaluate if and how well can Wikipedia be used to automatically translate Portuguese medical expressions to the English language. It is also a goal of this work to compare the Wikipedia approaches to other well-established approaches.

Page 51: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Topics for SSIM

Assessing and comparing the readability of online topics Using a set of search queries previously classified into topics, the goal of this work is to analyze and compare the readability of the initial documents retrieved with those queries.

Evaluation of query expansion approaches using the CLEF eHealth 2016 test collection

The goal of this work is to evaluate the query expansion approaches that were proposed in a previous work using a newly-formed test collection. The evaluation should focus on the relevance, understandability and credibility of the obtained results.

Page 52: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

Topics for SSIM

The use of Data Mining to understand behaviourdynamics in online health forums: state of the art

Do a survey and write a scientific article on the use of Data Mining to understand behaviour dynamics in online health forums.

Automatic text simplification in the health domain: state of the art

Do a survey and write a scientific article on current techniques for automatic text simplification in the health domain.

Page 53: Using Internet data to learn in the health domaineol/SSIIM/1617/seminars/L12... · 2016-10-27 · Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt

ReferencesDan Pelleg, Elad Yom-Tov, Yoelle Maarek (2012). Can you believe an anonymous contributor? On truthfulness in Yahoo! AnswersElad Yom-Tov, Evgeniy Gabrilovich (2013). Postmarket Drug Surveillance Without Trial Costs: Discovery of Adverse Drug Reactions Through Large-Scale Analysis of Web Search QueriesElad Yom-Tov; Ingemar Cox; Vasileios Lampos (2015). Learning about health and medicine from Internet data.Gunther Eysenbach (2006). Tracking flu-related searches on the Web for syndromic surveillanceJane P Messina, Oliver J Brady, David M Pigott, John S Brownstein, Anne G Hoen & Simon I Hay (2014). A global compendium of human dengue virus occurrenceMunmun De Choudhury, Michael Gamon, Scott Counts and Eric Horvitz (2013). Predicting depression via social mediaShaodian Zhang, Erin Bantum, Jason Owen, Noémie Elhadad (2014). Does Sustained Participation in an Online Health Community Affect Sentiment?Susannah Fox (2011). Health Topics. Pew Internet Project.Susannah Fox and Maeve Duggan (2013). Health Online 2013. Pew Internet Project.Vanessa Frias-Martinez, Alberto Rubio, Enrique Frias-Martinez (2012). Measuring the impact of epidemic alerts on human mobility using cell-phone network data