afterthe boomno one tweets: microblog-based influenza detection incorporating indirect information
Post on 16-Jan-2017
476 Views
Preview:
TRANSCRIPT
AftertheBoomNoOneTweets:Microblog-basedInfluenzaDetectionIncorporatingIndirectInformation
ShokoWakamiya1,YukikoKawai2,Eiji Aramaki11NaraInstituteofScienceandTechnology,Japan
2KyotoSangyoUniversity,Japan
Oct.18,2016Twitter
ExploitingTweetingUserasSocialSensor[Sakaki2010,Lee2011,Aramaki2011]
• Variousreal-worldphenomenacanbeobservedEX)Disasters,localevents,infectiousdiseases,etc.
• Itisexpectedtooutperformothertraditionalmethodsofmedicalreportingmeans
Sakaki etal.:EarthquakeShakesTwitterUsers.WWW(2010)Lee,Wakamiya,Sumiya:DiscoveryofUnusualRegionalSocialActivitiesusingGeo-taggedMicroblogs,WorldWideWebSpecialIssueonMobileServicesontheWeb(2011)Aramaki etal.:TwitterCatchesTheFlu:DetectingInfluenzaEpidemicsusingTwitter,EMNLP(2011)
Target event
Physical Sensor-basedSocial Sensor-based
Previous Proposed
Sensors Direct information Indirect information
Direct information
Physicalsensor Socialsensor
RelatedworkonTwitter-basedInfluenzaSurveillance
Target(#ofareas) Datasize(milliontweets)Aramaki [16] Japan(1area) 300Achrekar [27] US(10areas) 1.9*Culotta [28] US(1area) 0.5Kanouch [29] Japan(1area) 300DeQuincy[30] Europe(1area) 0.14Doan[31] US(1area) 24*Szomszor [32] Europe(1area) 3
• LotsofTwitter-baseddiseasedetection/predictionhavebeendeveloped•Mostofthesystemsperformedlow-resolutiongeographicanalysis(country-level)
Problem(1):
ImbalanceofSocialSensorDistribution•Mostofthesocialsensorsareinurbancities(Tokyo,Osaka,etc.)• Othercitiesareaffectedbyashortageofdata
Sapporo,Hokkaido
TokyoGeographicdistributionof
influenza-relatedtweetsinJapan
Problem(2):
GapbetweenSocialSensorsandPatientsRelationbetweennumbersofinfluenza-relatedtweetsandpatientsineachprefecture• Exceptforafewhigh-populationcities,mostareashavefewertweets• Somesuchareashavenumerousinfluenzapatients
0
500
1000
1500
2000
2500
3000
3500
4000
0
50000
100000
150000
200000
250000
300000
TOKYO
AREA13
OSAK
AAR
EA27
KANA
GAWA
AREA14
CHIBA
AREA12
AICH
IAR
EA23
SAITA
MA
AREA11
HOKK
AIDO
AREA1
HYOG
OAR
EA28
KYOT
OAR
EA26
FUKU
OKA
AREA40
SHIZU
OKA
AREA22
MIYA
GIAR
EA4
IBAR
AKI
AREA8
NIIGATA
AREA15
FUKU
SHIM
AAR
EA7
GUNM
AAR
EA10
HIRO
SHIM
AAR
EA34
FUKU
IAR
EA20
GIFU
AREA21
KUMAM
OTO
AREA43
SHIGA
AREA25
TOCH
IGI
AREA9
MIE
AREA24
NARA
AREA29
IWATE
AREA3
OKAYAM
AAR
EA33
KAGO
SHIM
AAR
EA46
WAK
AYAM
AAR
EA30
OKINAW
AAR
EA47
YAMAG
UCHI
AREA35
YAMAG
ATA
AREA6
KAGA
WA
AREA37
MIYA
ZAKI
AREA45
ISHIKAW
AAR
EA19
AOMOR
IAR
EA2
EHIM
EAR
EA38
NAGA
NOAR
EA17
OITA
AREA44
TOKU
SHIM
AAR
EA36
NAGA
SAKI
AREA42
AKITA
AREA5
YAMAN
ASHI
AREA16
TOTTOR
IAR
EA31
KOCH
IAR
EA39
SAGA
AREA41
TOYAMA
AREA18
SHIM
ANE
AREA32
# of patients
# of tweets
# of
pat
ient
s # of tweets
Prefectures (area)
Problem(2):
GapbetweenSocialSensorsandPatientsRelationbetweennumbersofinfluenza-relatedtweetsandpatientsineachprefecture• Exceptforafewhigh-populationcities,mostareashavefewertweets• Somesuchareashavenumerousinfluenzapatients
0
500
1000
1500
2000
2500
3000
3500
4000
0
50000
100000
150000
200000
250000
300000
TOKYO
AREA13
OSAK
AAR
EA27
KANA
GAWA
AREA14
CHIBA
AREA12
AICH
IAR
EA23
SAITA
MA
AREA11
HOKK
AIDO
AREA1
HYOG
OAR
EA28
KYOT
OAR
EA26
FUKU
OKA
AREA40
SHIZU
OKA
AREA22
MIYA
GIAR
EA4
IBAR
AKI
AREA8
NIIGATA
AREA15
FUKU
SHIM
AAR
EA7
GUNM
AAR
EA10
HIRO
SHIM
AAR
EA34
FUKU
IAR
EA20
GIFU
AREA21
KUMAM
OTO
AREA43
SHIGA
AREA25
TOCH
IGI
AREA9
MIE
AREA24
NARA
AREA29
IWATE
AREA3
OKAYAM
AAR
EA33
KAGO
SHIM
AAR
EA46
WAK
AYAM
AAR
EA30
OKINAW
AAR
EA47
YAMAG
UCHI
AREA35
YAMAG
ATA
AREA6
KAGA
WA
AREA37
MIYA
ZAKI
AREA45
ISHIKAW
AAR
EA19
AOMOR
IAR
EA2
EHIM
EAR
EA38
NAGA
NOAR
EA17
OITA
AREA44
TOKU
SHIM
AAR
EA36
NAGA
SAKI
AREA42
AKITA
AREA5
YAMAN
ASHI
AREA16
TOTTOR
IAR
EA31
KOCH
IAR
EA39
SAGA
AREA41
TOYAMA
AREA18
SHIM
ANE
AREA32
# of patients
# of tweets
# of
pat
ient
s # of tweets
Prefectures (area)
ExploitingIndirectInfo.
Pro)CoveringwiderareasCon)• Unreliability(toonoisyortooold)
(1)MygrandmainKyotoisinbedwithflu(2)NEWS:classesinOsakahavebeenclosedbecauseoftheflu
• Complexpattern
When?
Alreadyspread
Target event
Physical Sensor-basedSocial Sensor-based
Previous Proposed
Sensors Direct information Indirect information
Direct information
Existing Proposed
Theamountoftweetscontainingdirectinfo.
Theamountoftweetscontainingindirectinfo. Theamountofpatients
OurGoal&Approach
Toestimatethenumberofpatientsineachareabasedontherelationbetweenhumanmotivationtotweet andinformationpropagation
• h1)Peoplepreferreportingnewinfo.,andthattheyareinsensitivetoalready-propagatedinfo.
• h2)Thedegreeofpropagation(popularity)iscorrelatedwiththeamountofindirectinfo.
Theamountoftweetscontainingdirectinfo.
Theamountoftweetscontainingindirectinfo. Theamountofpatients
(a) Before Epidemics (b) After Epidemics
� �Positive Negative TrappedSensor
� ��
� ��
� ��
� ��
�
�
�
�� ��
� ��
�
�
� �
� � � �
� �
Indirect Information
Direct Information Direct Information
Trappedsensors
Outline
• Background• Goalandapproach• ConstructionofTwitter-basedInfluenzaSurveillanceSystem• Experimentalevaluation• Discussion• Conclusions
Twitter-basedInfluenzaSurveillance
LOCATION DETECTION MODULE
AGGREGATION MODULE
LINEAR MODEL
TRAP MODEL
Positive
Negative
Trash
P/N Classifier
Tweets
GPS Info.
Profile Info.
Indirect Info.
Available
No
No
NLP MODULE
# of flu patients
Direct Information
Indirect Information
No
1.NLP-basedClassificationPatient(positive)ornot(negative)
2.LocationDetectionDirectinfo.orIndirectinfo.
3.DataAggregationLinearmodelorTrapmodel
1.NLP-basedClassification
ToJudgewhetheragiventweetiswrittenbyapatientornot• Buildingthetrainingset
Ahumanannotatorassignedoneoftwolabels(positive/negative)to1,000influenza-relatedtweetsEx)
• Classifyingthetestset• SVM-basedclassifier• Bag-of-wordsrepresentation• Polynomialkernel(d=2)
“Mymothergotflu today” positive“Igotinfluenzashottoday” negative
Tocoverwiderareasbyextractingindirectinfo.aswellasdirectinfo.• Directinfo.• GPSinfo.(GPS)• Profileinfo.(PROF)
• Indirectinfo.(IND)Locationnamesintweets’contentsextractedusingalistofprefecturenamesandfamouslandmarksEx)“MyfriendinOsaka caughtflu”
2.LocationDetection
GPS(0.5%)
PROF(26.2%)
IND(4.7%)Nolocationinfo.
Percentageoftweetswithdirect/indirectinfo.
7,666,201tweets
3.DataAggregation
Toestimatetheamountofpatientsusingdifferenttypesofinfo;directinfo.andindirectinfo.i) LINEARModel
Asimplemodeltosumupdirectinfo.andindirectinfo.
ii) TRAPModelAmodelbasedonLINEARmodelandhuman’snaturetotweet
𝐼"#$%&' 𝑎, 𝑡 = 𝑤-./ 0 𝐺𝑃𝑆 𝑎, 𝑡 + 𝑤.'56 0 𝑃𝑅𝑂𝐹 𝑎, 𝑡 + 𝑤#$:; 𝐼𝑁𝐷(𝑎, 𝑏, 𝑡)�
B∈&
Thenumberofpatients𝐼"#$%&' 𝑎, 𝑡 inarea𝑎 atday𝑡:
𝐼D'&. 𝑎, 𝑡 =𝐼"#$%&' 𝑎, 𝑡
𝑤F/%'/ 0 𝑁G − 𝑤D'&. 0 log( 𝑝𝑜𝑝 𝑎, 𝑡 + 1) , 𝑝𝑜𝑝 𝑎, 𝑡 = ; 𝐼𝑁𝐷(𝑎, 𝑐)P
QRS
Thenumberofpatients𝐼D'&. 𝑎, 𝑡 inarea𝑎 atday𝑡:
ConceptofTRAPModel1. Peoplepreferanewevent,andareinsensitivetoan
alreadypropagatedevent2. Thedegreeofpropagation(popularity)iscorrelated
withtheamountofindirectinfo.(a) Before Epidemics (b) After Epidemics
� �Positive Negative TrappedSensor
� ��
� ��
� ��
� ��
�
�
�
�� ��
� ��
�
�
� �
� � � �
� �
Indirect Information
Direct Information Direct Information
(a)Beforeepidemics (a) Before Epidemics (b) After Epidemics
� �Positive Negative TrappedSensor
� ��
� ��
� ��
� ��
�
�
�
�� ��
� ��
�
�
� �
� � � �
� �
Indirect Information
Direct Information Direct Information
(b)Afterepidemics(a) Before Epidemics (b) After Epidemics
� �Positive Negative TrappedSensor
� ��
� ��
� ��
� ��
�
�
�
�� ��
� ��
�
�
� �
� � � �
� �
Indirect Information
Direct Information Direct Information
(a) (b)
Peopleactivelyreporttheflu
Mostofthepeopleloseinteresttosharedirectinfo.
3.DataAggregation
Toestimatetheamountofpatientsusingdifferenttypesofinfo;directinfo.andindirectinfo.i) LINEARModel
Asimplemodeltosumupdirectinfo.andindirectinfo.
ii) TRAPModelAmodelbasedonLINEARmodelandhuman’snaturetotweet
𝐼"#$%&' 𝑎, 𝑡 = 𝑤-./ 0 𝐺𝑃𝑆 𝑎, 𝑡 + 𝑤.'56 0 𝑃𝑅𝑂𝐹 𝑎, 𝑡 + 𝑤#$:; 𝐼𝑁𝐷(𝑎, 𝑏, 𝑡)�
B∈&
Thenumberofpatients𝐼"#$%&' 𝑎, 𝑡 inarea𝑎 atday𝑡:
𝐼D'&. 𝑎, 𝑡 =𝐼"#$%&' 𝑎, 𝑡
𝑤F/%'/ 0 𝑁G − 𝑤D'&. 0 log( 𝑝𝑜𝑝 𝑎, 𝑡 + 1) , 𝑝𝑜𝑝 𝑎, 𝑡 = ; 𝐼𝑁𝐷(𝑎, 𝑐)P
QRS
Thenumberofpatients𝐼D'&. 𝑎, 𝑡 inarea𝑎 atday𝑡:
ThedegreeofInformationpropagationinareaaduringt days
Theamountoftrappedsensors
Theamountofsocialsensorsinareaa
ExperimentalDatasets
• Tweetdata• Acollectionoftweetscontainingthekeyword“I-N-FU-RU”
• Goldstandarddata• Thenumberofpatientsperweekforeveryprefecture(47areas)• ThedataisavailablefromtheInfectiousDiseaseSurveillanceCenter(IDSC)
ALL Duration 2012/08/02-2016/01/03# of tweets (Size) 7,666,201 (2.275 GB)
SEASON2012 Duration 2012/11/01-2013/05/31# of tweets (Size) 1,959,610 (729.4 MB)
SEASON2013 Duration 2013/11/01-2014/05/31# of tweets (Size) 501,542 (143.7 MB)*
SEASON2014 Duration 2014/11/01-2015/05/31# of tweets (Size) 2,736,685 (808.2 MB)
AsampleoftheweeklyreportfromIDSChttp://www.nih.go.jp/niid/ja/diseases/a/flu.html
•MethodsBASELINE,BASELINE+PROF,LINEAR,TRAP
• EvaluationmetricPearsoncorrelationcoefficient(high:|r|>0.7,medium:0.4<|r|≤0.7,low:|r|≤0.4)
Experiments
Method NLP GPS PROF IND
TRAP TRAP+NLP ✓ ✓ ✓ ✓
TRAP ✓ ✓ ✓
LINEAR LINEAR+NLP ✓ ✓ ✓ ✓
LINEAR ✓ ✓ ✓
BASRLINE+PROF
BASELINE+PROF+NLP (EMNLP2011) ✓ ✓ ✓
BASELINE+PROF ✓ ✓
BASELINE BASELINE +NLP ✓ ✓
BASELINE ✓
𝐼T&/% 𝑎, 𝑡 = 𝐺𝑃𝑆 𝑎, 𝑡 𝐼T&/%U.'56 𝑎, 𝑡 = 𝐺𝑃𝑆 𝑎, 𝑡 + 𝑃𝑅𝑂𝐹 𝑎, 𝑡
𝐼"#$%&' 𝑎, 𝑡= 𝐺𝑃𝑆 𝑎, 𝑡 + 𝑃𝑅𝑂𝐹 𝑎, 𝑡 +; 𝐼𝑁𝐷(𝑎, 𝑏, 𝑡)
�
B∈&
𝐼D'&. 𝑎, 𝑡 =𝐼"#$%&' 𝑎, 𝑡
0.05 0 𝑁G − 0.2 0 log(𝑝𝑜𝑝 𝑎, 𝑡 + 1)
Results(1/3)
ContributionofNLP-basedClassification• TRAP+NLP(r=0.70)ishigherthanTRAP(r=0.64)• NLPclassificationinthisdomain(flu)isnothard
Target Method SEASON2012
SEASON2013
SEASON2014
SEASON TOTAL
All areas
TRAP+NLP 0.76 0.70 0.69 0.70 LINEAR+NLP 0.70 0.55 0.53 0.50 EMNLP2011 0.74 0.68 0.67 0.69 BASELINE+NLP 0.33 0.37 0.48 0.36
High population areas (Top 10)
TRAP+NLP 0.80 0.77 0.72 0.75 LINEAR+NLP 0.78 0.65 0.64 0.64 EMNLP2011 0.80 0.77 0.71 0.75 BASELINE+NLP 0.55 0.60 0.63 0.53
Low population areas (Top 10)
TRAP+NLP 0.75 0.66 0.71 0.69 LINEAR+NLP 0.62 0.46 0.48 0.43 EMNLP2011 0.70 0.61 0.65 0.64 BASELINE+NLP 0.21 0.26 0.35 0.25
Target Method SEASON 2012
SEASON 2013
SEASON 2014
SEASON TOTAL
All areas
TRAP 0.72 0.63 0.64 0.64 LINEAR 0.65 0.48 0.53 0.48 BASELINE+PROF 0.69 0.59 0.66 0.64 BASELINE 0.29 0.34 0.48 0.35
High population areas (Top 10)
TRAP 0.75 0.69 0.70 0.70 LINEAR 0.72 0.60 0.63 0.61 BASELINE+PROF 0.75 0.69 0.70 0.70 BASELINE 0.44 0.56 0.63 0.50
Low population areas (Top 10)
TRAP 0.71 0.61 0.53 0.57 LINEAR 0.58 0.41 0.46 0.40 BASELINE+PROF 0.65 0.52 0.65 0.59 BASELINE 0.20 0.23 0.35 0.25
(a)WithNLP-basedclassification (b)WithoutNLP-basedclassification
Results(2/3)
ContributionofIndirectInfo.inLINEARModel• LINEAR+NLP(r=0.50)islowerthanBASELINE+PROF+NLP(r=0.69)• Itisdifficulttodetectinfluenzaepidemicsbyaddingindirectinfo.inanaïvemanner
Target Method SEASON2012
SEASON2013
SEASON2014
SEASON TOTAL
All areas
TRAP+NLP 0.76 0.70 0.69 0.70 LINEAR+NLP 0.70 0.55 0.53 0.50 EMNLP2011 0.74 0.68 0.67 0.69 BASELINE+NLP 0.33 0.37 0.48 0.36
High population areas (Top 10)
TRAP+NLP 0.80 0.77 0.72 0.75 LINEAR+NLP 0.78 0.65 0.64 0.64 EMNLP2011 0.80 0.77 0.71 0.75 BASELINE+NLP 0.55 0.60 0.63 0.53
Low population areas (Top 10)
TRAP+NLP 0.75 0.66 0.71 0.69 LINEAR+NLP 0.62 0.46 0.48 0.43 EMNLP2011 0.70 0.61 0.65 0.64 BASELINE+NLP 0.21 0.26 0.35 0.25
(a)WithNLP-basedclassification
Results(3/3)
ContributionofIndirectInfo.inTRAPModel• TRAP+NLPachievedthebestperformance(r=0.70)• TRAPmodeleffectivelycontributestoexploitationofbothdirectandindirectinfo.
Target Method SEASON2012
SEASON2013
SEASON2014
SEASON TOTAL
All areas
TRAP+NLP 0.76 0.70 0.69 0.70 LINEAR+NLP 0.70 0.55 0.53 0.50 EMNLP2011 0.74 0.68 0.67 0.69 BASELINE+NLP 0.33 0.37 0.48 0.36
High population areas (Top 10)
TRAP+NLP 0.80 0.77 0.72 0.75 LINEAR+NLP 0.78 0.65 0.64 0.64 EMNLP2011 0.80 0.77 0.71 0.75 BASELINE+NLP 0.55 0.60 0.63 0.53
Low population areas (Top 10)
TRAP+NLP 0.75 0.66 0.71 0.69 LINEAR+NLP 0.62 0.46 0.48 0.43 EMNLP2011 0.70 0.61 0.65 0.64 BASELINE+NLP 0.21 0.26 0.35 0.25
(a)WithNLP-basedclassification
Discussion:RelationbetweenVolumeofTweetsandPerformance (1/2)
Highpopulationareas• TRAP+NLPwashigherthanEMNLP2011• Top17highpopulationareasexhibitedhighcorrelation(r>0.7)
05001000150020002500300035004000
0.60.620.640.660.680.70.720.740.760.780.8
#oftweets TRAP+NLP EMNLP2011
# of
twee
ts
Cor
rela
tion
coef
ficie
nt
Prefectures (AREAs)
TOKYO (AREA13) OSAKA (AREA27)
Discussion:RelationbetweenVolumeofTweetsandPerformance(2/2)
Otherareas• Thereislargevarianceofperformance• TRAP+NLPmostlyoutperformsEMNLP2011
05001000150020002500300035004000
0.60.620.640.660.680.70.720.740.760.780.8
#oftweets TRAP+NLP EMNLP2011
# of
twee
ts
Cor
rela
tion
coef
ficie
ntPrefectures (AREAs)
FUKUI (AREA20)AOMORI (AREA2)
Discussion:
AftertheBoomNoOneTweets• TRAPmodeloutperformedtheLINEARmodel
Ifinfluenzabecomesahottopic,peopledonottalkaboutit
• SimilarphenomenaweresofarproposedfromapsychologicalviewpointMoststudiesshowedrapidpropagationofrumors(especiallybadnews)anditsshortlife
• ThisstudyattemptstohandlehumannatureusingastatisticalmodelThismodelhassufficientroomforapplicationtoadditionalstudies
Conclusions• Twitter-basedinfluenzasurveillance• Utilizedindirectinfo.thatmentionotherplaces forcoveringwiderarea• DevelopedTRAPmodel basedoninformationpropagationandpeople’smotivationtotweet
• Futurework• Toexamineworldwideinfluenzasurveillance• Toestablishanovelmethodbyintegratingvariousmodelsfortheiraccurateprediction• Toconsidervariouseffectsrelatedtogeographicrelationsamongareas
top related