afterthe boomno one tweets: microblog-based influenza detection incorporating indirect information

AftertheBoomNoOneTweets:Microblog-basedInfluenzaDetectionIncorporatingIndirectInformation

ShokoWakamiya1,YukikoKawai2,Eiji Aramaki11NaraInstituteofScienceandTechnology,Japan

2KyotoSangyoUniversity,Japan

Oct.18,2016Twitter

ExploitingTweetingUserasSocialSensor[Sakaki2010,Lee2011,Aramaki2011]

• Variousreal-worldphenomenacanbeobservedEX)Disasters,localevents,infectiousdiseases,etc.

• Itisexpectedtooutperformothertraditionalmethodsofmedicalreportingmeans

Sakaki etal.:EarthquakeShakesTwitterUsers.WWW(2010)Lee,Wakamiya,Sumiya:DiscoveryofUnusualRegionalSocialActivitiesusingGeo-taggedMicroblogs,WorldWideWebSpecialIssueonMobileServicesontheWeb(2011)Aramaki etal.:TwitterCatchesTheFlu:DetectingInfluenzaEpidemicsusingTwitter,EMNLP(2011)

Target event

Physical Sensor-basedSocial Sensor-based

Previous Proposed

Sensors Direct information Indirect information

Direct information

Physicalsensor Socialsensor

RelatedworkonTwitter-basedInfluenzaSurveillance

Target(#ofareas) Datasize(milliontweets)Aramaki [16] Japan(1area) 300Achrekar [27] US(10areas) 1.9*Culotta [28] US(1area) 0.5Kanouch [29] Japan(1area) 300DeQuincy[30] Europe(1area) 0.14Doan[31] US(1area) 24*Szomszor [32] Europe(1area) 3

• LotsofTwitter-baseddiseasedetection/predictionhavebeendeveloped•Mostofthesystemsperformedlow-resolutiongeographicanalysis(country-level)

Problem(1):

ImbalanceofSocialSensorDistribution•Mostofthesocialsensorsareinurbancities(Tokyo,Osaka,etc.)• Othercitiesareaffectedbyashortageofdata

Sapporo,Hokkaido

TokyoGeographicdistributionof

influenza-relatedtweetsinJapan

Problem(2):

GapbetweenSocialSensorsandPatientsRelationbetweennumbersofinfluenza-relatedtweetsandpatientsineachprefecture• Exceptforafewhigh-populationcities,mostareashavefewertweets• Somesuchareashavenumerousinfluenzapatients

100000

150000

200000

250000

300000

AREA13

AREA14

AREA12

AREA11

AREA40

AREA22

NIIGATA

AREA15

AREA21

AREA43

AREA25

AREA24

AREA29

OKAYAM

OKINAW

AREA35

AREA37

AREA45

ISHIKAW

AREA44

AREA42

AREA16

TOTTOR

AREA41

TOYAMA

AREA18

AREA32

# of patients

# of tweets

s # of tweets

Prefectures (area)

Problem(2):

GapbetweenSocialSensorsandPatientsRelationbetweennumbersofinfluenza-relatedtweetsandpatientsineachprefecture• Exceptforafewhigh-populationcities,mostareashavefewertweets• Somesuchareashavenumerousinfluenzapatients

100000

150000

200000

250000

300000

AREA13

AREA14

AREA12

AREA11

AREA40

AREA22

NIIGATA

AREA15

AREA21

AREA43

AREA25

AREA24

AREA29

OKAYAM

OKINAW

AREA35

AREA37

AREA45

ISHIKAW

AREA44

AREA42

AREA16

TOTTOR

AREA41

TOYAMA

AREA18

AREA32

# of patients

# of tweets

s # of tweets

Prefectures (area)

ExploitingIndirectInfo.

Pro)CoveringwiderareasCon)• Unreliability(toonoisyortooold)

(1)MygrandmainKyotoisinbedwithflu(2)NEWS:classesinOsakahavebeenclosedbecauseoftheflu

• Complexpattern

Alreadyspread

Target event

Physical Sensor-basedSocial Sensor-based

Previous Proposed

Sensors Direct information Indirect information

Direct information

Existing Proposed

Theamountoftweetscontainingdirectinfo.

Theamountoftweetscontainingindirectinfo. Theamountofpatients

OurGoal&Approach

Toestimatethenumberofpatientsineachareabasedontherelationbetweenhumanmotivationtotweet andinformationpropagation

• h1)Peoplepreferreportingnewinfo.,andthattheyareinsensitivetoalready-propagatedinfo.

• h2)Thedegreeofpropagation(popularity)iscorrelatedwiththeamountofindirectinfo.

Theamountoftweetscontainingdirectinfo.

Theamountoftweetscontainingindirectinfo. Theamountofpatients

(a) Before Epidemics (b) After Epidemics

� �Positive Negative TrappedSensor

� ��

��

� ��

� �

� � � �

� �

Indirect Information

Direct Information Direct Information

Trappedsensors

Outline

• Background• Goalandapproach• ConstructionofTwitter-basedInfluenzaSurveillanceSystem• Experimentalevaluation• Discussion• Conclusions

Twitter-basedInfluenzaSurveillance

LOCATION DETECTION MODULE

AGGREGATION MODULE

LINEAR MODEL

TRAP MODEL

Positive

Negative

P/N Classifier

Tweets

GPS Info.

Profile Info.

Indirect Info.

Available

NLP MODULE

# of flu patients

Direct Information

1.NLP-basedClassificationPatient(positive)ornot(negative)

2.LocationDetectionDirectinfo.orIndirectinfo.

3.DataAggregationLinearmodelorTrapmodel

1.NLP-basedClassification

ToJudgewhetheragiventweetiswrittenbyapatientornot• Buildingthetrainingset

Ahumanannotatorassignedoneoftwolabels(positive/negative)to1,000influenza-relatedtweetsEx)

• Classifyingthetestset• SVM-basedclassifier• Bag-of-wordsrepresentation• Polynomialkernel(d=2)

“Mymothergotflu today” positive“Igotinfluenzashottoday” negative

Tocoverwiderareasbyextractingindirectinfo.aswellasdirectinfo.• Directinfo.• GPSinfo.(GPS)• Profileinfo.(PROF)

• Indirectinfo.(IND)Locationnamesintweets’contentsextractedusingalistofprefecturenamesandfamouslandmarksEx)“MyfriendinOsaka caughtflu”

2.LocationDetection

GPS(0.5%)

PROF(26.2%)

IND(4.7%)Nolocationinfo.

Percentageoftweetswithdirect/indirectinfo.

7,666,201tweets

3.DataAggregation

Toestimatetheamountofpatientsusingdifferenttypesofinfo;directinfo.andindirectinfo.i) LINEARModel

Asimplemodeltosumupdirectinfo.andindirectinfo.

ii) TRAPModelAmodelbasedonLINEARmodelandhuman’snaturetotweet

𝐼"#$%&' 𝑎, 𝑡 = 𝑤-./ 0 𝐺𝑃𝑆 𝑎, 𝑡 + 𝑤.'56 0 𝑃𝑅𝑂𝐹 𝑎, 𝑡 + 𝑤#$:; 𝐼𝑁𝐷(𝑎, 𝑏, 𝑡)�

Thenumberofpatients𝐼"#$%&' 𝑎, 𝑡 inarea𝑎 atday𝑡:

𝐼D'&. 𝑎, 𝑡 =𝐼"#$%&' 𝑎, 𝑡

𝑤F/%'/ 0 𝑁G − 𝑤D'&. 0 log( 𝑝𝑜𝑝 𝑎, 𝑡 + 1) , 𝑝𝑜𝑝 𝑎, 𝑡 = ; 𝐼𝑁𝐷(𝑎, 𝑐)P

Thenumberofpatients𝐼D'&. 𝑎, 𝑡 inarea𝑎 atday𝑡:

ConceptofTRAPModel1. Peoplepreferanewevent,andareinsensitivetoan

alreadypropagatedevent2. Thedegreeofpropagation(popularity)iscorrelated

withtheamountofindirectinfo.(a) Before Epidemics (b) After Epidemics

� ��

��

� ��

� �

� � � �

� �

(a)Beforeepidemics (a) Before Epidemics (b) After Epidemics

� ��

��

� ��

� �

� � � �

� �

(b)Afterepidemics(a) Before Epidemics (b) After Epidemics

� ��

��

� ��

� �

� � � �

� �

(a) (b)

Peopleactivelyreporttheflu

Mostofthepeopleloseinteresttosharedirectinfo.

3.DataAggregation

Toestimatetheamountofpatientsusingdifferenttypesofinfo;directinfo.andindirectinfo.i) LINEARModel

Asimplemodeltosumupdirectinfo.andindirectinfo.

ii) TRAPModelAmodelbasedonLINEARmodelandhuman’snaturetotweet

𝐼"#$%&' 𝑎, 𝑡 = 𝑤-./ 0 𝐺𝑃𝑆 𝑎, 𝑡 + 𝑤.'56 0 𝑃𝑅𝑂𝐹 𝑎, 𝑡 + 𝑤#$:; 𝐼𝑁𝐷(𝑎, 𝑏, 𝑡)�

Thenumberofpatients𝐼"#$%&' 𝑎, 𝑡 inarea𝑎 atday𝑡:

𝐼D'&. 𝑎, 𝑡 =𝐼"#$%&' 𝑎, 𝑡

𝑤F/%'/ 0 𝑁G − 𝑤D'&. 0 log( 𝑝𝑜𝑝 𝑎, 𝑡 + 1) , 𝑝𝑜𝑝 𝑎, 𝑡 = ; 𝐼𝑁𝐷(𝑎, 𝑐)P

Thenumberofpatients𝐼D'&. 𝑎, 𝑡 inarea𝑎 atday𝑡:

ThedegreeofInformationpropagationinareaaduringt days

Theamountoftrappedsensors

Theamountofsocialsensorsinareaa

ExperimentalDatasets

• Tweetdata• Acollectionoftweetscontainingthekeyword“I-N-FU-RU”

• Goldstandarddata• Thenumberofpatientsperweekforeveryprefecture(47areas)• ThedataisavailablefromtheInfectiousDiseaseSurveillanceCenter(IDSC)

ALL Duration 2012/08/02-2016/01/03# of tweets (Size) 7,666,201 (2.275 GB)

SEASON2012 Duration 2012/11/01-2013/05/31# of tweets (Size) 1,959,610 (729.4 MB)

SEASON2013 Duration 2013/11/01-2014/05/31# of tweets (Size) 501,542 (143.7 MB)*

SEASON2014 Duration 2014/11/01-2015/05/31# of tweets (Size) 2,736,685 (808.2 MB)

AsampleoftheweeklyreportfromIDSChttp://www.nih.go.jp/niid/ja/diseases/a/flu.html

•MethodsBASELINE,BASELINE+PROF,LINEAR,TRAP

• EvaluationmetricPearsoncorrelationcoefficient(high:|r|>0.7,medium:0.4<|r|≤0.7,low:|r|≤0.4)

Experiments

Method NLP GPS PROF IND

TRAP TRAP+NLP ✓ ✓ ✓ ✓

TRAP ✓ ✓ ✓

LINEAR LINEAR+NLP ✓ ✓ ✓ ✓

LINEAR ✓ ✓ ✓

BASRLINE+PROF

BASELINE+PROF+NLP (EMNLP2011) ✓ ✓ ✓

BASELINE+PROF ✓ ✓

BASELINE BASELINE +NLP ✓ ✓

BASELINE ✓

𝐼T&/% 𝑎, 𝑡 = 𝐺𝑃𝑆 𝑎, 𝑡 𝐼T&/%U.'56 𝑎, 𝑡 = 𝐺𝑃𝑆 𝑎, 𝑡 + 𝑃𝑅𝑂𝐹 𝑎, 𝑡

𝐼"#$%&' 𝑎, 𝑡= 𝐺𝑃𝑆 𝑎, 𝑡 + 𝑃𝑅𝑂𝐹 𝑎, 𝑡 +; 𝐼𝑁𝐷(𝑎, 𝑏, 𝑡)

𝐼D'&. 𝑎, 𝑡 =𝐼"#$%&' 𝑎, 𝑡

0.05 0 𝑁G − 0.2 0 log(𝑝𝑜𝑝 𝑎, 𝑡 + 1)

Results(1/3)

ContributionofNLP-basedClassification• TRAP+NLP(r=0.70)ishigherthanTRAP(r=0.64)• NLPclassificationinthisdomain(flu)isnothard

Target Method SEASON2012

SEASON2013

SEASON2014

SEASON TOTAL

All areas

TRAP+NLP 0.76 0.70 0.69 0.70 LINEAR+NLP 0.70 0.55 0.53 0.50 EMNLP2011 0.74 0.68 0.67 0.69 BASELINE+NLP 0.33 0.37 0.48 0.36

High population areas (Top 10)

Low population areas (Top 10)

Target Method SEASON 2012

SEASON 2013

SEASON 2014

SEASON TOTAL

All areas

TRAP 0.72 0.63 0.64 0.64 LINEAR 0.65 0.48 0.53 0.48 BASELINE+PROF 0.69 0.59 0.66 0.64 BASELINE 0.29 0.34 0.48 0.35

(a)WithNLP-basedclassification (b)WithoutNLP-basedclassification

Results(2/3)

ContributionofIndirectInfo.inLINEARModel• LINEAR+NLP(r=0.50)islowerthanBASELINE+PROF+NLP(r=0.69)• Itisdifficulttodetectinfluenzaepidemicsbyaddingindirectinfo.inanaïvemanner

SEASON2013

SEASON2014

SEASON TOTAL

All areas

(a)WithNLP-basedclassification

Results(3/3)

ContributionofIndirectInfo.inTRAPModel• TRAP+NLPachievedthebestperformance(r=0.70)• TRAPmodeleffectivelycontributestoexploitationofbothdirectandindirectinfo.

SEASON2013

SEASON2014

SEASON TOTAL

All areas

(a)WithNLP-basedclassification

Discussion:RelationbetweenVolumeofTweetsandPerformance (1/2)

Highpopulationareas• TRAP+NLPwashigherthanEMNLP2011• Top17highpopulationareasexhibitedhighcorrelation(r>0.7)

05001000150020002500300035004000

0.60.620.640.660.680.70.720.740.760.780.8

#oftweets TRAP+NLP EMNLP2011

Prefectures (AREAs)

TOKYO (AREA13) OSAKA (AREA27)

Discussion:RelationbetweenVolumeofTweetsandPerformance(2/2)

Otherareas• Thereislargevarianceofperformance• TRAP+NLPmostlyoutperformsEMNLP2011

05001000150020002500300035004000

0.60.620.640.660.680.70.720.740.760.780.8

#oftweets TRAP+NLP EMNLP2011

ntPrefectures (AREAs)

FUKUI (AREA20)AOMORI (AREA2)

Discussion:

AftertheBoomNoOneTweets• TRAPmodeloutperformedtheLINEARmodel

Ifinfluenzabecomesahottopic,peopledonottalkaboutit

• SimilarphenomenaweresofarproposedfromapsychologicalviewpointMoststudiesshowedrapidpropagationofrumors(especiallybadnews)anditsshortlife

• ThisstudyattemptstohandlehumannatureusingastatisticalmodelThismodelhassufficientroomforapplicationtoadditionalstudies

Conclusions• Twitter-basedinfluenzasurveillance• Utilizedindirectinfo.thatmentionotherplaces forcoveringwiderarea• DevelopedTRAPmodel basedoninformationpropagationandpeople’smotivationtotweet

• Futurework• Toexamineworldwideinfluenzasurveillance• Toestablishanovelmethodbyintegratingvariousmodelsfortheiraccurateprediction• Toconsidervariouseffectsrelatedtogeographicrelationsamongareas

afterthe boomno one tweets: microblog-based influenza detection incorporating indirect information

Science

automatic generation of event summaries using microblog...

disambiguating company names in microblog textdisambiguating...

experiments in microblog summarization

let them blog, glog, microblog & more!

building a microblog corpus for search result...

priceless tweets!

social network microblog: 140 letters per message

sponsored tweets

twitter & tweets

a time-sensitive model for microblog...

festive tweets

who tweets

adverse drug event detection in tweets with semi...

microblog credibility perceptions: comparing the united...

tweetprobe: a real-time microblog stream visualization...

after the boom no one tweets: microblog-based influenza...

aggregate estimation over a microblog...

tweets tweets & replies 5,732 - internet archive

bloggers need for yonkly (a microblog platform)

tweets - how to delete your old tweets fast